rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2025-12-28 00:23:00 +00:00

Author	SHA1	Message	Date
Joonas Koivunen	b52e31c1a4	fix: allow layer flushes more often (#7927 ) As seen with the pgvector 0.7.0 index builds, we can receive large batches of images, leading to very large L0 layers in the range of 1GB. These large layers are produced because we are only able to roll the layer after we have witnessed two different Lsns in a single `DataDirModification::commit`. As the single Lsn batches of images can span over multiple `DataDirModification` lifespans, we will rarely get to write two different Lsns in a single `put_batch` currently. The solution is to remember the TimelineWriterState instead of eagerly forgetting it until we really open the next layer or someone else flushes (while holding the write_guard). Additional changes are test fixes to avoid "initdb image layer optimization" or ignoring initdb layers for assertion. Cc: #7197 because small `checkpoint_distance` will now trigger the "initdb image layer optimization"	2024-06-10 13:50:17 +00:00
Arseny Sher	b2d34a82b9	Make python Safekeeper datadir Path instead of str.	2024-05-25 06:06:32 +03:00
Arseny Sher	3797566c36	safekeeper: test pull_timeline with WAL gc. Do pull_timeline while WAL is being removed. To this end - extract pausable_failpoint to utils, sprinkle pull_timeline with it - add 'checkpoint' sk http endpoint to force WAL removal. After fixing checking for pull file status code test fails so far which is expected.	2024-05-25 06:06:32 +03:00
Joonas Koivunen	49d7f9b5a4	test_import_from_pageserver_small: try to make less flaky (#7843 ) With #7828 and proper fullbackup testing the test became flaky ([evidence]). - produce better assertion messages in `assert_pageserver_backups_equal` - use read only endpoint to confirm the row count [evidence]: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7839/9192447962/index.html#suites/89cfa994d71769e01e3fc4f475a1f3fa/49009214d0f8b8ce	2024-05-23 14:44:08 +03:00
Alex Chi Z	ddd8ebd253	chore(pageserver): use kebab case for aux file flag (#7840 ) part of https://github.com/neondatabase/neon/issues/7462 --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-05-22 17:06:00 +00:00
Joonas Koivunen	df9ab1b5e3	refactor(test): duplication with fullbackup, tar content hashing (#7828 ) "taking a fullbackup" is an ugly multi-liner copypasted in multiple places, most recently with timeline ancestor detach tests. move it under `PgBin` which is not a great place, but better than yet another utility function. Additionally: - cleanup `psql_env` repetition (PgBin already configures that) - move the backup tar comparison as a yet another free utility function - use backup tar comparison in `test_import.py` where a size check was done previously - cleanup extra timeline creation from test Cc: #7715	2024-05-22 15:43:21 +03:00
Joonas Koivunen	d9dcbffac3	python: allow using allowed_errors.py (#7719 ) See #7718. Fix it by renaming all `types.py` to `common_types.py`. Additionally, add an advert for using `allowed_errors.py` to test any added regex.	2024-05-13 15:16:23 +03:00
Alex Chi Z	017c34b773	feat(pageserver): generate basebackup from aux file v2 storage (#7517 ) This pull request adds the new basebackup read path + aux file write path. In the regression test, all logical replication tests are run with matrix aux_file_v2=false/true. Also fixed the vectored get code path to correctly return missing key error when being called from the unified sequential get code path. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-05-07 16:30:18 +00:00
John Spray	b80704cd34	tests: log hygiene checks for storage controller (#6710 ) ## Problem As with the pageserver, we should fail tests that emit unexpected log errors/warnings. ## Summary of changes - Refactor existing log checks to be reusable - Run log checks for attachment_service - Add allow lists as needed.	2024-03-19 10:30:33 +00:00
Christian Schwarz	e9e77ee744	tests: add optional cursor to `log_contains` + fix truthiness issues in callers (#6960 ) Extracted from https://github.com/neondatabase/neon/pull/6953 Part of https://github.com/neondatabase/neon/issues/5899 Core Change ----------- In #6953, we need the ability to scan the log _after_ a specific line and ignore anything before that line. This PR changes `log_contains` to returns a tuple of `(matching line, cursor)`. Hand that cursor to a subsequent `log_contains` call to search the log for the next occurrence of the pattern. Other Changes ------------- - Inspect all the callsites of `log_contains` to handle the new tuple return type. - Above inspection unveiled many callers aren't using `assert log_contains(...) is not None` but some weaker version of the code that breaks if `log_contains` ever returns a not-None but falsy value. Fix that. - Above changes unveiled that `test_remote_storage_upload_queue_retries` was using `wait_until` incorrectly; after fixing the usage, I had to raise the `wait_until` timeout. So, maybe this will fix its flakiness.	2024-03-01 10:45:39 +01:00
Christian Schwarz	996abc9563	pagebench-based GetPage@LSN performance test (#6214 )	2024-01-24 12:51:53 +01:00
Joonas Koivunen	b451e75dc6	test: include cmdline in captured output (#5977 ) aiming for faster to understand a bunch of `.stdout` and `.stderr` files, see example echo_1.stdout differences: ``` +# echo foobar abbacd + foobar abbacd ``` it can be disabled and is disabled in this PR for some tests; use `pg_bin.run_capture(..., with_command_header=False)` for that. as a bonus this cleans up the echoed newlines from s3_scrubber output which are also saved to file but echoed to test log. Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2023-11-30 17:31:03 +00:00
Joonas Koivunen	eba3bfc57e	test: python needs thread safety as well (#5992 ) we have test cases which launch processes from threads, and they capture output assuming this counter is thread-safe. at least according to my understanding this operation in python requires a lock to be thread-safe.	2023-11-30 15:48:40 +00:00
Christian Schwarz	a0e61145c8	fix: cleanup of layers from the future can race with their re-creation (#5890 ) fixes https://github.com/neondatabase/neon/issues/5878 obsoletes https://github.com/neondatabase/neon/issues/5879 Before this PR, it could happen that `load_layer_map` schedules removal of the future image layer. Then a later compaction run could re-create the same image layer, scheduling a PUT. Due to lack of an upload queue barrier, the PUT and DELETE could be re-ordered. The result was IndexPart referencing a non-existent object. ## Summary of changes * Add support to `pagectl` / Python tests to decode `IndexPart` * Rust * new `pagectl` Subcommand * `IndexPart::{from,to}_s3_bytes()` methods to internalize knowledge about encoding of `IndexPart` * Python * new `NeonCli` subclass * Add regression test * Rust * Ability to force repartitioning; required to ensure image layer creation at last_record_lsn * Python * The regression test. * Fix the issue * Insert an `UploadOp::Barrier` after scheduling the deletions.	2023-11-23 13:33:41 +00:00
John Spray	306c4f9967	s3_scrubber: prepare for scrubbing buckets with generation-aware content (#5700 ) ## Problem The scrubber didn't know how to find the latest index_part when generations were in use. ## Summary of changes - Teach the scrubber to do the same dance that pageserver does when finding the latest index_part.json - Teach the scrubber how to understand layer files with generation suffixes. - General improvement to testability: scan_metadata has a machine readable output that the testing `S3Scrubber` wrapper can read. - Existing test coverage of scrubber was false-passing because it just didn't see any data due to prefixing of data in the bucket. Fix that. This is incremental improvement: the more confidence we can have in the scrubber, the more we can use it in integration tests to validate the state of remote storage. --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2023-11-03 17:36:02 +00:00
Alexander Bayandin	7a2cafb34d	Use zstd to compress large allure artifacts (#5458 ) ## Problem - Because we compress artifacts file by file, we don't need to put them into `tar` containers (ie instead of `tar.gz` we can use just `gz`). - Pythons gz single-threaded and pretty slow. A benchmark has shown ~20 times speedup (19.876176291 vs 0.8748335830000009) on my laptop (for a pageserver.log size is 1.3M) ## Summary of changes - Replace tarfile with zstandart - Update allure to 2.24.0	2023-10-04 16:20:16 +01:00
Joonas Koivunen	db8ff9d64b	testing: record walredo failures to test reports (#5451 ) We have rare walredo failures with pg16. Let's introduce recording of failing walredo input in `#[cfg(feature = "testing")]`. There is additional logging (the value reconstruction path logging usually shown with not found keys), keeping it for `#[cfg(features = "testing")]`. Cc: #5404.	2023-10-04 11:24:30 +03:00
John Spray	743933176e	scrubber: add `scan-metadata` and hook into integration tests (#5176 ) ## Problem - Scrubber's `tidy` command requires presence of a control plane - Scrubber has no tests at all ## Summary of changes - Add re-usable async streams for reading metadata from a bucket - Add a `scan-metadata` command that reads from those streams and calls existing `checks.rs` code to validate metadata, then returns a summary struct for the bucket. Command returns nonzero status if errors are found. - Add an `enable_scrub_on_exit()` function to NeonEnvBuilder so that tests using remote storage can request to have the scrubber run after they finish - Enable remote storarge and scrub_on_exit in test_pageserver_restart and test_pageserver_chaos This is a "toe in the water" of the overall space of validating the scrubber. Later, we should: - Enable scrubbing at end of tests using remote storage by default - Make the success condition stricter than "no errors": tests should declare what tenants+timelines they expect to see in the bucket (or sniff these from the functions tests use to create them) and we should require that the scrubber reports on these particular tenants/timelines. The `tidy` command is untouched in this PR, but it should be refactored later to use similar async streaming interface instead of the current batch-reading approach (the streams are faster with large buckets), and to also be covered by some tests. --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech> Co-authored-by: Alexander Bayandin <alexander@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech> Co-authored-by: Conrad Ludgate <conrad@neon.tech>	2023-09-06 11:55:24 +01:00
Dmitry Rodionov	d39fd66773	tests: remove redundant wait_while (#4952 ) Remove redundant `wait_while` in tests. It had only one usage. Use `wait_tenant_status404`. Related: https://github.com/neondatabase/neon/pull/4855#discussion_r1289610641	2023-08-11 10:18:13 +03:00
Dmitry Rodionov	c58b22bacb	Delete tenant's data from s3 (#4855 ) ## Summary of changes For context see https://github.com/neondatabase/neon/blob/main/docs/rfcs/022-pageserver-delete-from-s3.md Create Flow to delete tenant's data from pageserver. The approach heavily mimics previously implemented timeline deletion implemented mostly in https://github.com/neondatabase/neon/pull/4384 and followed up in https://github.com/neondatabase/neon/pull/4552 For remaining deletion related issues consult with deletion project here: https://github.com/orgs/neondatabase/projects/33 resolves #4250 resolves https://github.com/neondatabase/neon/issues/3889 --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-08-10 18:53:16 +03:00
Joonas Koivunen	762a8a7bb5	python: more linting (#4734 ) Ruff has "B" class of lints, including B018 which will nag on useless expressions, related to #4719. Enable such lints and fix the existing issues. Most notably: - https://beta.ruff.rs/docs/rules/mutable-argument-default/ - https://beta.ruff.rs/docs/rules/assert-false/ --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2023-07-18 12:56:40 +03:00
Alexander Bayandin	0c82ff3d98	test_runner: add Timeline Inspector to Grafana links (#4021 )	2023-04-14 11:46:47 +01:00
Dmitry Rodionov	bfeb428d1b	tests: make neon_fixtures a bit thinner by splitting out some pageserver related helpers (#3977 ) neon_fixture is quite big and messy, lets clean it up a bit.	2023-04-07 13:47:28 +03:00
Alexander Bayandin	887cee64e2	test_runner: add links to grafana for remote tests (#3961 ) Add Grafana links to allure reports to make it easier to debug perf test failures	2023-04-06 13:52:41 +01:00
Alexander Bayandin	105b8bb9d3	test_runner: automatically rerun flaky tests (#3880 ) This PR adds a plugin that automatically reruns (up to 3 times) flaky tests. Internally, it uses data from `TEST_RESULT_CONNSTR` database and `pytest-rerunfailures` plugin. As the first approximation we consider the test flaky if it has failed on the main branch in the last 10 days. Flaky tests are fetched by `scripts/flaky_tests.py` script (it's possible to use it in a standalone mode to learn which tests are flaky), stored to a JSON file, and then the file is passed to the pytest plugin.	2023-04-04 12:21:54 +01:00
Alexander Bayandin	3d869cbcde	Replace flake8 and isort with ruff (#3810 ) - Introduce ruff (https://beta.ruff.rs/) to replace flake8 and isort - Update mypy and black	2023-03-14 13:25:44 +00:00
Heikki Linnakangas	e9583db73b	Remove code and test to generate flamegraph on GetPage requests. (#3257 ) It was nice to have and useful at the time, but unfortunately the method used to gather the profiling data doesn't play nicely with 'async'. PR #3228 will turn 'get_page_at_lsn' function async, which will break the profiling support. Let's remove it, and re-introduce some kind of profiling later, using some different method, if we feel like we need it again.	2023-01-03 20:11:32 +02:00
Arseny Sher	32662ff1c4	Replace etcd with storage_broker. This is the replacement itself, the binary landed earlier. See docs/storage_broker.md. ref https://github.com/neondatabase/neon/pull/2466 https://github.com/neondatabase/neon/issues/2394	2022-12-12 13:30:16 +03:00
Vadim Kharitonov	f720dd735e	Stricter mypy linters for `test_runner/fixtures/*`	2022-11-10 12:47:27 +01:00
Alexander Bayandin	c4f9f1dc6d	Add data format forward compatibility tests (#2766 ) Add `test_forward_compatibility`, which checks if it's going to be possible to roll back a release to the previous version. The test uses artifacts (Neon & Postgres binaries) from the previous release to start Neon on the repo created by the current version. It performs exactly the same checks as `test_backward_compatibility` does. Single `ALLOW_BREAKING_CHANGES` env var got replaced by `ALLOW_BACKWARD_COMPATIBILITY_BREAKAGE` & `ALLOW_FORWARD_COMPATIBILITY_BREAKAGE` and can be set by `backward compatibility breakage` and `forward compatibility breakage` labels respectively.	2022-11-10 09:06:34 +00:00
Alexander Bayandin	c1a76eb0e5	test_runner: replace global variables with fixtures (#2754 ) This PR replaces the following global variables in the test framework with fixtures to make tests more configurable. I mainly need this for the forward compatibility tests (draft in https://github.com/neondatabase/neon/pull/2766). ``` base_dir neon_binpath pg_distrib_dir top_output_dir default_pg_version (this one got replaced with a fixture named pg_version) ``` Also, this PR adds more `Path` type where the code implies it.	2022-11-07 18:39:51 +00:00
Kirill Bulatov	d42700280f	Remove daemonize from storage components (#2677 ) Move daemonization logic into `control_plane`. Storage binaries now only crate a lockfile to avoid concurrent services running in the same directory.	2022-11-02 02:26:37 +02:00
Kirill Bulatov	923f642549	Collect cargo build timings	2022-09-09 22:32:00 +03:00
Heikki Linnakangas	47bd307cb8	Add python types to represent LSNs, tenant IDs and timeline IDs. (#2351 ) For better ergonomics. I always found it weird that we used UUID to actually mean a tenant or timeline ID. It worked because it happened to have the same length, 16 bytes, but it was hacky.	2022-09-02 10:16:47 +03:00
Alexander Bayandin	d7c9cfe7bb	Create Allure report for perf tests (#2326 )	2022-08-31 16:15:26 +01:00
Alexander Bayandin	39a3bcac36	test_runner: fix flake8 warnings	2022-08-22 14:57:09 +01:00
Alexander Bayandin	4c2bb43775	Reformat all python files by black & isort	2022-08-22 14:57:09 +01:00
Alexander Bayandin	4cddb0f1a4	Set up a workflow to run pgbench against captest (#2077 )	2022-08-15 18:54:31 +01:00
Alexander Bayandin	da5f8486ce	test_runner/pg_clients: collect docker logs (#2259 )	2022-08-12 17:03:09 +01:00
Dmitry Rodionov	092a9b74d3	use only s3 in boto3-stubs and update mypy Newer version of mypy fixes buggy error when trying to update only boto3 stubs. However it brings new checks and starts to yell when we index into cusror.fetchone without checking for None first. So this introduces a wrapper to simplify quering for scalar values. I tried to use cursor_factory connection argument but without success. There can be a better way to do that, but this looks the simplest	2022-08-01 18:28:49 +03:00
Thang Pham	6a664629fa	Add timeline physical size tracking (#2126 ) Ref #1902. - Track the layered timeline's `physical_size` using `pageserver_current_physical_size` metric when updating the layer map. - Report the local timeline's `physical_size` in timeline GET APIs. - Add `include-non-incremental-physical-size` URL flag to also report the local timeline's `physical_size_non_incremental` (similar to `logical_size_non_incremental`) - Add a `UIntGaugeVec` and `UIntGauge` to represent `u64` prometheus metrics Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>	2022-07-27 12:36:46 -04:00
Arseny Sher	eeff56aeb7	Make get_dir_size robust to concurrent deletions. ref #2055	2022-07-18 15:13:10 +03:00
Dmitry Rodionov	65704708fa	remove unused imports, make more use of pathlib.Path	2022-07-01 18:56:51 +03:00
Anastasia Lubennikova	d11c9f9fcb	Use random ports for the proxy and local pg in tests Fixes #1931 Author: Dmitry Ivanov	2022-06-15 20:21:58 +03:00
Kirill Bulatov	a884f4cf6b	Add etcd to neon_local	2022-05-17 01:17:44 +03:00
Anastasia Lubennikova	aa7c601eca	Fix pitr_interval check in GC: Use timestamp->LSN mapping instead of file modification time. Fix 'latest_gc_cutoff_lsn' - set it to the minimum of pitr_cutoff and gc_cutoff. Add new test: test_pitr_gc	2022-05-13 18:32:14 +03:00
Arthur Petukhovsky	29539b0561	Set wal_keep_size to zero (#1507 ) wal_keep_size is already set to 0 in our cloud setup, but we don't use this value in tests. This commit fixes wal_keep_size in control_plane and adds tests for WAL recycling and lagging safekeepers.	2022-04-27 19:09:28 +03:00
Arseny Sher	ec3bc74165	Add safekeeper information exchange through etcd. Safekeers now publish to and pull from etcd per-timeline data. Immediate goal is WAL truncation, for which every safekeeper must know remote_consistent_lsn; the next would be callmemaybe replacement. Adds corresponding '--broker' argument to safekeeper and ability to run etcd in tests. Adds test checking remote_consistent_lsn is indeed communicated.	2022-03-29 18:16:49 +04:00
Heikki Linnakangas	07342f7519	Major storage format rewrite. This is a backwards-incompatible change. The new pageserver cannot read repositories created with an old pageserver binary, or vice versa. Simplify Repository to a value-store ------------------------------------ Move the responsibility of tracking relation metadata, like which relations exist and what are their sizes, from Repository to a new module, pgdatadir_mapping.rs. The interface to Repository is now a simple key-value PUT/GET operations. It's still not any old key-value store though. A Repository is still responsible from handling branching, and every GET operation comes with an LSN. Mapping from Postgres data directory to keys/values --------------------------------------------------- All the data is now stored in the key-value store. The 'pgdatadir_mapping.rs' module handles mapping from PostgreSQL objects like relation pages and SLRUs, to key-value pairs. The key to the Repository key-value store is a Key struct, which consists of a few integer fields. It's wide enough to store a full RelFileNode, fork and block number, and to distinguish those from metadata keys. 'pgdatadir_mapping.rs' is also responsible for maintaining a "partitioning" of the keyspace. Partitioning means splitting the keyspace so that each partition holds a roughly equal number of keys. The partitioning is used when new image layer files are created, so that each image layer file is roughly the same size. The partitioning is also responsible for reclaiming space used by deleted keys. The Repository implementation doesn't have any explicit support for deleting keys. Instead, the deleted keys are simply omitted from the partitioning, and when a new image layer is created, the omitted keys are not copied over to the new image layer. We might want to implement tombstone keys in the future, to reclaim space faster, but this will work for now. Changes to low-level layer file code ------------------------------------ The concept of a "segment" is gone. Each layer file can now store an arbitrary range of Keys. Checkpointing, compaction ------------------------- The background tasks are somewhat different now. Whenever checkpoint_distance is reached, the WAL receiver thread "freezes" the current in-memory layer, and creates a new one. This is a quick operation and doesn't perform any I/O yet. It then launches a background "layer flushing thread" to write the frozen layer to disk, as a new L0 delta layer. This mechanism takes care of durability. It replaces the checkpointing thread. Compaction is a new background operation that takes a bunch of L0 delta layers, and reshuffles the data in them. It runs in a separate compaction thread. Deployment ---------- This also contains changes to the ansible scripts that enable having multiple different pageservers running at the same time in the staging environment. We will use that to keep an old version of the pageserver running, for clusters created with the old version, at the same time with a new pageserver with the new binary. Author: Heikki Linnakangas Author: Konstantin Knizhnik <knizhnik@zenith.tech> Author: Andrey Taranik <andrey@zenith.tech> Reviewed-by: Matthias Van De Meent <matthias@zenith.tech> Reviewed-by: Bojan Serafimov <bojan@zenith.tech> Reviewed-by: Konstantin Knizhnik <knizhnik@zenith.tech> Reviewed-by: Anton Shyrabokau <antons@zenith.tech> Reviewed-by: Dhammika Pathirana <dham@zenith.tech> Reviewed-by: Kirill Bulatov <kirill@zenith.tech> Reviewed-by: Anastasia Lubennikova <anastasia@zenith.tech> Reviewed-by: Alexey Kondratov <alexey@zenith.tech>	2022-03-28 05:41:15 -05:00
Dmitry Rodionov	44111e3ba3	Prohibit branch creation at lsn that was already garbage collected. This introduces new timeline field latest_gc_cutoff. It is updated before each gc iteration. New check is added to branch_timelines to prevent branch creation with start point less than latest_gc_cutoff. Also this adds a check to get_page_at_lsn which asserts that lsn at which the page is requested was not garbage collected. This check currently is triggered for readonly nodes which are pinned to specific lsn and because they are not tracked in pageserver garbage collection can remove data that still might be referenced. This is a bug and will be fixed separately.	2021-11-15 20:03:16 +03:00

1 2

60 Commits