rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-27 18:10:37 +00:00

Author	SHA1	Message	Date
Christian Schwarz	1a81eeaceb	pagebench: fixup after is_rel_block_key changes in #6266 PR #6266 broke the getpage_latest_lsn benchmark. Before this patch, we'd fail with ``` not implemented: split up range ``` because `r.start = rel size key` and `r.end = rel size key + 1`.	2024-01-08 18:58:32 +00:00
Christian Schwarz	52e1d5605f	work_queue abstraction: typing fixes	2024-01-08 18:29:29 +00:00
Christian Schwarz	587493dbe1	many_tenants: logging	2024-01-08 17:25:19 +00:00
Christian Schwarz	d9586e27fd	Merge remote-tracking branch 'origin/main' into problame/benchmarking/pr/python-perftest	2024-01-08 14:24:32 +00:00
Conrad Ludgate	8a646cb750	proxy: add request context for observability and blocking (#6160 ) ## Summary of changes ### RequestMonitoring We want to add an event stream with information on each request for easier analysis than what we can do with diagnostic logs alone (https://github.com/neondatabase/cloud/issues/8807). This RequestMonitoring will keep a record of the final state of a request. On drop it will be pushed into a queue to be uploaded. Because this context is a bag of data, I don't want this information to impact logic of request handling. I personally think that weakly typed data (such as all these options) makes for spaghetti code. I will however allow for this data to impact rate-limiting and blocking of requests, as this does not _really_ change how a request is handled. ### Parquet Each `RequestMonitoring` is flushed into a channel where it is converted into `RequestData`, which is accumulated into parquet files. Each file will have a certain number of rows per row group, and several row groups will eventually fill up the file, which we then upload to S3. We will also upload smaller files if they take too long to construct.	2024-01-08 11:42:43 +00:00
Arpad Müller	a4ac8e26e8	Update Rust to 1.75.0 (#6285 ) [Release notes](https://github.com/rust-lang/rust/releases/tag/1.75.0).	2024-01-08 11:46:16 +01:00
John Spray	b3a681d121	s3_scrubber: updates for sharding (#6281 ) This is a lightweight change to keep the scrubber providing sensible output when using sharding. - The timeline count was wrong when using sharding - When checking for tenant existence, we didn't re-use results between different shards in the same tenant Closes: https://github.com/neondatabase/neon/issues/5929	2024-01-08 09:19:10 +00:00
John Spray	b5ed6f22ae	pageserver: clean up a TODO comment (#6282 ) These functions don't need updating for sharding: it's fine for them to remain shard-naive, as they're only used in the context of dumping a layer file. The sharding metadata doesn't live in the layer file, it lives in the index.	2024-01-08 09:19:00 +00:00
John Spray	d1c0232e21	pageserver: use `pub(crate)` in metrics.rs, and clean up unused items (#6275 ) ## Problem Noticed while making other changes that there were `pub` items that were unused. ## Summary of changes - Make everything `pub(crate)` in metrics.rs, apart from items used from `bin/` - Fix the timelines eviction metric: it was never being incremented - Remove an unused ephemeral_bytes counter.	2024-01-08 03:53:15 +00:00
Arseny Sher	a41c4122e3	Don't suspend compute if there is active LR subscriber. https://github.com/neondatabase/neon/issues/6258	2024-01-06 01:24:44 +04:00
Christian Schwarz	fbcb1268bf	extract work queue and use it to drive broken attach in parallel	2024-01-05 19:10:41 +00:00
Christian Schwarz	392e014a7f	reuable abstraction for many tenants fixture	2024-01-05 18:08:59 +00:00
Christian Schwarz	dd69927953	do the on-demand downloads in Python, it's faster; plus some cleanups and renamings	2024-01-05 17:37:59 +00:00
Alexander Bayandin	7de829e475	test_runner: replace black with ruff format (#6268 ) ## Problem `black` is slow sometimes, we can replace it with `ruff format` (a new feature in 0.1.2 [0]), which produces pretty similar to black style [1]. On my local machine (MacBook M1 Pro 16GB): ``` # `black` on main $ hyperfine "BLACK_CACHE_DIR=/dev/null poetry run black ." Benchmark 1: BLACK_CACHE_DIR=/dev/null poetry run black . Time (mean ± σ): 3.131 s ± 0.090 s [User: 5.194 s, System: 0.859 s] Range (min … max): 3.047 s … 3.354 s 10 runs ``` ``` # `ruff format` on the current PR $ hyperfine "RUFF_NO_CACHE=true poetry run ruff format" Benchmark 1: RUFF_NO_CACHE=true poetry run ruff format Time (mean ± σ): 300.7 ms ± 50.2 ms [User: 259.5 ms, System: 76.1 ms] Range (min … max): 267.5 ms … 420.2 ms 10 runs ``` ## Summary of changes - Replace `black` with `ruff format` everywhere - [0] https://docs.astral.sh/ruff/formatter/ - [1] https://docs.astral.sh/ruff/formatter/#black-compatibility	2024-01-05 15:35:07 +00:00
Christian Schwarz	838a6d304d	test_snapshot_dir fixture with marker file for finished snapshot	2024-01-05 13:41:12 +00:00
Christian Schwarz	72da46dd5a	improve overlayfs cleanup code	2024-01-05 13:17:28 +00:00
John Spray	3c560d27a8	pageserver: implement secondary-mode downloads (#6123 ) Follows on from #6050 , in which we upload heatmaps. Secondary locations will now poll those heatmaps and download layers mentioned in the heatmap. TODO: - [X] ~Unify/reconcile stats for behind-schedule execution with warn_when_period_overrun (https://github.com/neondatabase/neon/pull/6050#discussion_r1426560695)~ - [x] Give downloads their own concurrency config independent of uploads Deferred optimizations: - https://github.com/neondatabase/neon/issues/6199 - https://github.com/neondatabase/neon/issues/6200 Eviction will be the next PR: - #5342	2024-01-05 12:29:20 +00:00
Christian Schwarz	a748d67915	add support to use overlayfs in from_root_dir	2024-01-05 12:15:32 +00:00
Christian Schwarz	857eabc812	include downloaded layers in snapshot	2024-01-05 10:57:14 +00:00
Christian Schwarz	d260426a14	is_rel_block_key: exclude the relsize key (#6266 ) Before this PR, `is_rel_block_key` returns true for the blknum `0xffffffff`, which is a blknum that's actually never written by Postgres, but used by Neon Pageserver to store the relsize. Quoting @MMeent: > PostgreSQL can't extend the relation beyond size of 0xFFFFFFFF blocks, > so block number 0xFFFFFFFE is the last valid block number. This PR changes the definition of the function to exclude blknum 0xffffffff. My motivation for doing this change is to fix the `pagebench` getpage benchmark, which uses `is_rel_block_key` to filter the keyspace for valid pages to request from page_service. fixes https://github.com/neondatabase/neon/issues/6210 I checked other users of the function. The first one is `key_is_shard0`, which already had added an exemption for 0xffffffff. So, there's no functional change with this PR. The second one is `DatadirModification::flush`[^1]. With this PR, `.flush()` will skip the relsize key, whereas it didn't before. This means we will pile up all the relsize key-value pairs `(Key,u32)` in `DatadirModification::pending_updates` until `.commit()` is called. The only place I can think of where that would be a problem is if we import from a full basebackup, and don't `.commit()` regularly, like we currently don't do in `import_basebackup_from_tar`. It exposes us to input-controlled allocations. However, that was already the case for the other keys that are skipped, so, one can argue that this change is not making the situation much worse. [^1]: That type's `flush()` and `commit()` methods are terribly named, but, that's for another time	2024-01-05 11:48:06 +01:00
Arthur Petukhovsky	f3b5db1443	Add API for safekeeper timeline copy (#6091 ) Implement API for cloning a single timeline inside a safekeeper. Also add API for calculating a sha256 hash of WAL, which is used in tests. `/copy` API works by copying objects inside S3 for all but the last segments, and the last segments are copied on-disk. A special temporary directory is created for a timeline, because copy can take a lot of time, especially for large timelines. After all files segments have been prepared, this directory is mounted to the main tree and timeline is loaded to memory. Some caveats: - large timelines can take a lot of time to copy, because we need to copy many S3 segments - caller should wait for HTTP call to finish indefinetely and don't close the HTTP connection, because it will stop the process, which is not continued in the background - `until_lsn` must be a valid LSN, otherwise bad things can happen - API will return 200 if specified `timeline_id` already exists, even if it's not a copy - each safekeeper will try to copy S3 segments, so it's better to not call this API in-parallel on different safekeepers	2024-01-04 17:40:38 +00:00
John Spray	18e9208158	pageserver: improved error handling for shard routing error, timeline not found (#6262 ) ## Problem - When a client requests a key that isn't found in any shard on the node (edge case that only happens if a compute's config is out of date), we should prompt them to reconnect (as this includes a backoff), since they will not be able to complete the request until they eventually get a correct pageserver connection string. - QueryError::Other is used excessively: this contains a type-ambiguous anyhow::Error and is logged very verbosely (including backtrace). ## Summary of changes - Introduce PageStreamError to replace use of anyhow::Error in request handlers for getpage, etc. - Introduce Reconnect and NotFound variants to QueryError - Map the "shard routing error" case to PageStreamError::Reconnect -> QueryError::Reconnect - Update type conversions for LSN timeouts and tenant/timeline not found errors to use PageStreamError::NotFound->QueryError::NotFound	2024-01-04 10:40:03 +00:00
Sasha Krassovsky	7662df6ca0	Fix minimum backoff to 1ms	2024-01-03 21:09:19 -08:00
Alexander Bayandin	be21ab135d	Revert "test_pageserver: fix unexpected message: CopyFail during COPY by turning off safekeepers" This reverts commit `e91073df75`.	2024-01-03 19:13:02 +00:00
Alexander Bayandin	c54b262bbe	test_pageserver: start all components from snapshot	2024-01-03 19:12:38 +00:00
Alexander Bayandin	e91073df75	test_pageserver: fix unexpected message: CopyFail during COPY by turning off safekeepers	2024-01-03 17:06:21 +00:00
Alexander Bayandin	7322ccf3f7	test_pageserver: move attachment to different section	2024-01-03 17:04:47 +00:00
Alexander Bayandin	004aff5314	test_pageserver: report duration	2024-01-03 16:51:45 +00:00
John Spray	c119af8ddd	pageserver: run at least 2 background task threads Otherwise an assertion in CONCURRENT_BACKGROUND_TASKS will trip if you try to run the pageserver on a single core.	2024-01-03 14:22:40 +00:00
John Spray	a2e083ebe0	pageserver: make walredo shard-aware This does not have a functional impact, but enables all the logging in this code to include the shard_id label.	2024-01-03 14:22:40 +00:00
John Spray	73a944205b	pageserver: log details on shard routing error	2024-01-03 14:22:40 +00:00
John Spray	34ebfbdd6f	pageserver: fix handling getpage with multiple shards on one node Previously, we would wait for the LSN to be visible on whichever timeline we happened to load at the start of the connection, then proceed to look up the correct timeline for the key and do the read. If the timeline holding the key was behind the timeline we used for the LSN wait, then we might serve an apparently-successful read result that actually contains data from behind the requested lsn.	2024-01-03 14:22:40 +00:00
John Spray	ef7c9c2ccc	pageserver: fix active tenant lookup hitting secondaries with sharding If there is some secondary shard for a tenant on the same node as an attached shard, the secondary shard could trip up this code and cause page_service to incorrectly get an error instead of finding the attached shard.	2024-01-03 14:22:40 +00:00
John Spray	6c79e12630	pageserver: drop unwanted keys during compaction after split	2024-01-03 14:22:40 +00:00
John Spray	753d97bd77	pageserver: don't delete ancestor shard layers	2024-01-03 14:22:40 +00:00
John Spray	edc962f1d7	test_runner: test_issue_5878 log allow list (#6259 ) ## Problem https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6254/7388706419/index.html#suites/5a4b8734277a9878cb429b80c314f470/e54c4f6f6ed22672 ## Summary of changes Permit the log message: because the test helper's detach function increments the generation number, a detach/attach cycle can cause the error if the test runner node is slow enough for the opportunistic deletion queue flush on detach not to complete by the time we call attach.	2024-01-03 14:22:17 +00:00
Arseny Sher	65b4e6e7d6	Remove empty safekeeper init since truncateLsn. It has caveats such as creating half empty segment which can't be offloaded. Instead we'll pursue approach of pull_timeline, seeding new state from some peer.	2024-01-03 18:20:19 +04:00
Alexander Bayandin	549f607a13	Merge remote-tracking branch 'origin/main' into problame/benchmarking/pr/python-perftest	2024-01-03 13:30:49 +00:00
Alexander Bayandin	17b256679b	vm-image-spec: build pgbouncer from Neon's fork (#6249 ) ## Problem We need to add one more patch to pgbouncer (for https://github.com/neondatabase/neon/issues/5801). I've decided to cherry-pick all required patches to a pgbouncer fork (`neondatabase/pgbouncer`) and use it instead. See https://github.com/neondatabase/pgbouncer/releases/tag/pgbouncer_1_21_0-neon-1 ## Summary of changes - Revert the previous patch (for deallocate/discard all) — the fork already contains it. - Remove `libssl-dev` dependency — we build pgbouncer without `openssl` support. - Clone git tag and build pgbouncer from source code.	2024-01-03 13:02:04 +00:00
John Spray	673a865055	tests: tolerate 304 when evicting layers (#6261 ) In tests that evict layers, explicit eviction can race with automatic eviction of the same layer and result in a 304	2024-01-03 11:50:58 +00:00
Cuong Nguyen	fb518aea0d	Add batch ingestion mechanism to avoid high contention (#5886 ) ## Problem For context, this problem was observed in a research project where we try to make neon run in multiple regions and I was asked by @hlinnaka to make this PR. In our project, we use the pageserver in a non-conventional way such that we would send a larger number of requests to the pageserver than normal (imagine postgres without the buffer pool). I measured the time from the moment a WAL record left the safekeeper to when it reached the pageserver ([code](`e593db1f5a/pageserver/src/tenant/timeline/walreceiver/walreceiver_connection.rs (L282-L287)`)) and observed that when the number of get_page_at_lsn requests was high, the wal receiving time increased significantly (see the left side of the graphs below). Upon further investigation, I found that the delay was caused by this line `d2ca410919/pageserver/src/tenant/timeline.rs (L2348)` The `get_layer_for_write` method is called for every value during WAL ingestion and it tries to acquire layers write lock every time, thus this results in high contention when read lock is acquired more frequently. ![Untitled](https://github.com/neondatabase/neon/assets/6244849/85460f4d-ead1-4532-bc64-736d0bfd7f16) ![Untitled2](https://github.com/neondatabase/neon/assets/6244849/84199ab7-5f0e-413b-a42b-f728f2225218) ## Summary of changes It is unnecessary to call `get_layer_for_write` repeatedly for all values in a WAL message since they would end up in the same memory layer anyway, so I created the batched versions of `InMemoryLayer::put_value`, `InMemoryLayer ::put_tombstone`, `Timeline::put_value`, and `Timeline::put_tombstone`, that acquire the locks once for a batch of values. Additionally, `DatadirModification` is changed to store multiple versions of uncommitted values, and `WalIngest::ingest_record()` can now ingest records without immediately committing them. With these new APIs, the new ingestion loop can be changed to commit for every `ingest_batch_size` records. The `ingest_batch_size` variable is exposed as a config. If it is set to 1 then we get the same behavior before this change. I found that setting this value to 100 seems to work the best, and you can see its effect on the right side of the above graphs. --------- Co-authored-by: John Spray <john@neon.tech>	2024-01-03 10:41:58 +00:00
John Spray	42f41afcbd	tests: update pytest and boto3 dependencies (#6253 ) ## Problem The version of pytest we were using emits a number of DeprecationWarnings on latest python: these are fixed in latest release. boto3 and python-dateutil also have deprecation warnings, but unfortunately these aren't fixed upstream yet. ## Summary of changes - Update pytest - Update boto3 (this doesn't fix deprecation warnings, but by the time I figured that out I had already done the update, and it's good hygiene anyway)	2024-01-03 10:36:53 +00:00
Arseny Sher	f71110383c	Remove second check for max_slot_wal_keep_size download size. Already checked in GetLogRepRestartLSN, a rebase artifact.	2024-01-03 13:13:32 +04:00
Arseny Sher	ae3eaf9995	Add [WP] prefix to all walproposer logging. - rename walpop_log to wp_log - create also wpg_log which is used in postgres-specific code - in passing format messages to start with lower case	2024-01-03 11:10:27 +04:00
Christian Schwarz	aa9f1d4b69	pagebench get-page: default to latest=true, make configurable via flag (#6252 ) fixes https://github.com/neondatabase/neon/issues/6209	2024-01-02 16:57:29 +00:00
Joonas Koivunen	946c6a0006	scrubber: use adaptive config with retries, check subset of tenants (#6219 ) The tool still needs a lot of work. These are the easiest fix and feature: - use similar adaptive config with s3 as remote_storage, use retries - process only particular tenants Tenants need to be from the correct region, they are not deduplicated, but the feature is useful for re-checking small amount of tenants after a large run.	2024-01-02 15:22:16 +00:00
Sasha Krassovsky	ce13281d54	MIN not MAX	2024-01-02 06:28:49 -08:00
Sasha Krassovsky	4e1d16f311	Switch to exponential rate-limiting	2024-01-02 06:28:49 -08:00
Sasha Krassovsky	091a0cda9d	Switch to rate-limiting strategy	2024-01-02 06:28:49 -08:00
Sasha Krassovsky	ea9fad419e	Add exponential backoff to page_server->send	2024-01-02 06:28:49 -08:00

1 2 3 4 5 ...

4343 Commits