rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-19 06:00:38 +00:00

Author	SHA1	Message	Date
Christian Schwarz	79a4ee938b	Revert "revert recent VirtualFile asyncification changes (#5291 )" This reverts commit `ab1f37e908`. fixes #5479	2024-01-09 18:35:27 +00:00
Christian Schwarz	ebcdd758eb	test results `34e69cfc93` 18:04:59 - 18:31:15 => 05:01 + 21:15 = 26:16 duration That's ca 5min slower than what we saw without tokio-epoll-uring (scratches head)	2024-01-09 18:35:27 +00:00
Christian Schwarz	c61d27c18d	improve instructions	2024-01-09 18:35:27 +00:00
Christian Schwarz	3bb58f78f0	usage instructions for generator script	2024-01-09 18:35:27 +00:00
Christian Schwarz	c12af4ea58	many_tenants script now works	2024-01-09 18:35:27 +00:00
Christian Schwarz	2c4d2e9d7e	update many tenants script to use the new method for duplicating tenants (copy-paste from benchmarking WIP PR)	2024-01-09 18:35:27 +00:00
Christian Schwarz	2043221fca	Squashed commit of the following: commit `de90ba56d4` Author: Christian Schwarz <christian@neon.tech> Date: Mon Nov 27 14:47:26 2023 +0000 expose generation number in API commit `ae2c7589f9` Author: Christian Schwarz <christian@neon.tech> Date: Mon Nov 27 14:53:13 2023 +0000 pagectl: add subcommand to rewrite layer file history	2024-01-09 18:35:26 +00:00
Christian Schwarz	bf40845db4	measured BACKGROUND_RUNTIME performance using `wrk` Launch wrk from command line 3-4 seconds after the load starts. => blocking of executor threads is clearly visible, my branch performs _much_ better. baseline: commit `15b8618d25` (HEAD -> problame/loadtest-baseline, origin/problame/loadtest-baseline, main) neon-main (compaction semaphore disabled!) admin@ip-172-31-13-23:[~/neon]: wrk --latency http://localhost:2342 Running 10s test @ http://localhost:2342 2 threads and 10 connections Thread Stats Avg Stdev Max +/- Stdev Latency 71.42ms 15.97ms 125.18ms 70.82% Req/Sec 41.44 28.85 101.00 57.35% Latency Distribution 50% 72.53ms 75% 82.07ms 90% 91.44ms 99% 116.56ms 291 requests in 10.01s, 22.73KB read Socket errors: connect 0, read 0, write 0, timeout 10 Requests/sec: 29.07 Transfer/sec: 2.27KB this branch (comapction semaphore also disabled!): admin@ip-172-31-13-23:[~/neon]: wrk --latency http://localhost:2342 Running 10s test @ http://localhost:2342 2 threads and 10 connections Thread Stats Avg Stdev Max +/- Stdev Latency 45.74ms 64.13ms 293.44ms 83.27% Req/Sec 442.81 258.18 1.32k 69.79% Latency Distribution 50% 2.92ms 75% 75.52ms 90% 148.03ms 99% 248.50ms 8641 requests in 10.01s, 675.08KB read Requests/sec: 862.81 Transfer/sec: 67.41KB	2024-01-09 18:31:20 +00:00
Christian Schwarz	44f885f444	HACK: BACKGROUND_RUNTIME webserver to measure response time using `wrk`	2024-01-09 18:31:20 +00:00
Christian Schwarz	eb679d4b27	REPRO the problem: , uses 430GB of space; 4 seconds load time; constant 20kIOPS after ~20s	2024-01-09 18:31:20 +00:00
Christian Schwarz	66c52a629a	RFC: vectored `Timeline::get` (#6250 )	2024-01-08 15:00:01 +00:00
Conrad Ludgate	8a646cb750	proxy: add request context for observability and blocking (#6160 ) ## Summary of changes ### RequestMonitoring We want to add an event stream with information on each request for easier analysis than what we can do with diagnostic logs alone (https://github.com/neondatabase/cloud/issues/8807). This RequestMonitoring will keep a record of the final state of a request. On drop it will be pushed into a queue to be uploaded. Because this context is a bag of data, I don't want this information to impact logic of request handling. I personally think that weakly typed data (such as all these options) makes for spaghetti code. I will however allow for this data to impact rate-limiting and blocking of requests, as this does not _really_ change how a request is handled. ### Parquet Each `RequestMonitoring` is flushed into a channel where it is converted into `RequestData`, which is accumulated into parquet files. Each file will have a certain number of rows per row group, and several row groups will eventually fill up the file, which we then upload to S3. We will also upload smaller files if they take too long to construct.	2024-01-08 11:42:43 +00:00
Arpad Müller	a4ac8e26e8	Update Rust to 1.75.0 (#6285 ) [Release notes](https://github.com/rust-lang/rust/releases/tag/1.75.0).	2024-01-08 11:46:16 +01:00
John Spray	b3a681d121	s3_scrubber: updates for sharding (#6281 ) This is a lightweight change to keep the scrubber providing sensible output when using sharding. - The timeline count was wrong when using sharding - When checking for tenant existence, we didn't re-use results between different shards in the same tenant Closes: https://github.com/neondatabase/neon/issues/5929	2024-01-08 09:19:10 +00:00
John Spray	b5ed6f22ae	pageserver: clean up a TODO comment (#6282 ) These functions don't need updating for sharding: it's fine for them to remain shard-naive, as they're only used in the context of dumping a layer file. The sharding metadata doesn't live in the layer file, it lives in the index.	2024-01-08 09:19:00 +00:00
John Spray	d1c0232e21	pageserver: use `pub(crate)` in metrics.rs, and clean up unused items (#6275 ) ## Problem Noticed while making other changes that there were `pub` items that were unused. ## Summary of changes - Make everything `pub(crate)` in metrics.rs, apart from items used from `bin/` - Fix the timelines eviction metric: it was never being incremented - Remove an unused ephemeral_bytes counter.	2024-01-08 03:53:15 +00:00
Arseny Sher	a41c4122e3	Don't suspend compute if there is active LR subscriber. https://github.com/neondatabase/neon/issues/6258	2024-01-06 01:24:44 +04:00
Alexander Bayandin	7de829e475	test_runner: replace black with ruff format (#6268 ) ## Problem `black` is slow sometimes, we can replace it with `ruff format` (a new feature in 0.1.2 [0]), which produces pretty similar to black style [1]. On my local machine (MacBook M1 Pro 16GB): ``` # `black` on main $ hyperfine "BLACK_CACHE_DIR=/dev/null poetry run black ." Benchmark 1: BLACK_CACHE_DIR=/dev/null poetry run black . Time (mean ± σ): 3.131 s ± 0.090 s [User: 5.194 s, System: 0.859 s] Range (min … max): 3.047 s … 3.354 s 10 runs ``` ``` # `ruff format` on the current PR $ hyperfine "RUFF_NO_CACHE=true poetry run ruff format" Benchmark 1: RUFF_NO_CACHE=true poetry run ruff format Time (mean ± σ): 300.7 ms ± 50.2 ms [User: 259.5 ms, System: 76.1 ms] Range (min … max): 267.5 ms … 420.2 ms 10 runs ``` ## Summary of changes - Replace `black` with `ruff format` everywhere - [0] https://docs.astral.sh/ruff/formatter/ - [1] https://docs.astral.sh/ruff/formatter/#black-compatibility	2024-01-05 15:35:07 +00:00
John Spray	3c560d27a8	pageserver: implement secondary-mode downloads (#6123 ) Follows on from #6050 , in which we upload heatmaps. Secondary locations will now poll those heatmaps and download layers mentioned in the heatmap. TODO: - [X] ~Unify/reconcile stats for behind-schedule execution with warn_when_period_overrun (https://github.com/neondatabase/neon/pull/6050#discussion_r1426560695)~ - [x] Give downloads their own concurrency config independent of uploads Deferred optimizations: - https://github.com/neondatabase/neon/issues/6199 - https://github.com/neondatabase/neon/issues/6200 Eviction will be the next PR: - #5342	2024-01-05 12:29:20 +00:00
Christian Schwarz	d260426a14	is_rel_block_key: exclude the relsize key (#6266 ) Before this PR, `is_rel_block_key` returns true for the blknum `0xffffffff`, which is a blknum that's actually never written by Postgres, but used by Neon Pageserver to store the relsize. Quoting @MMeent: > PostgreSQL can't extend the relation beyond size of 0xFFFFFFFF blocks, > so block number 0xFFFFFFFE is the last valid block number. This PR changes the definition of the function to exclude blknum 0xffffffff. My motivation for doing this change is to fix the `pagebench` getpage benchmark, which uses `is_rel_block_key` to filter the keyspace for valid pages to request from page_service. fixes https://github.com/neondatabase/neon/issues/6210 I checked other users of the function. The first one is `key_is_shard0`, which already had added an exemption for 0xffffffff. So, there's no functional change with this PR. The second one is `DatadirModification::flush`[^1]. With this PR, `.flush()` will skip the relsize key, whereas it didn't before. This means we will pile up all the relsize key-value pairs `(Key,u32)` in `DatadirModification::pending_updates` until `.commit()` is called. The only place I can think of where that would be a problem is if we import from a full basebackup, and don't `.commit()` regularly, like we currently don't do in `import_basebackup_from_tar`. It exposes us to input-controlled allocations. However, that was already the case for the other keys that are skipped, so, one can argue that this change is not making the situation much worse. [^1]: That type's `flush()` and `commit()` methods are terribly named, but, that's for another time	2024-01-05 11:48:06 +01:00
Arthur Petukhovsky	f3b5db1443	Add API for safekeeper timeline copy (#6091 ) Implement API for cloning a single timeline inside a safekeeper. Also add API for calculating a sha256 hash of WAL, which is used in tests. `/copy` API works by copying objects inside S3 for all but the last segments, and the last segments are copied on-disk. A special temporary directory is created for a timeline, because copy can take a lot of time, especially for large timelines. After all files segments have been prepared, this directory is mounted to the main tree and timeline is loaded to memory. Some caveats: - large timelines can take a lot of time to copy, because we need to copy many S3 segments - caller should wait for HTTP call to finish indefinetely and don't close the HTTP connection, because it will stop the process, which is not continued in the background - `until_lsn` must be a valid LSN, otherwise bad things can happen - API will return 200 if specified `timeline_id` already exists, even if it's not a copy - each safekeeper will try to copy S3 segments, so it's better to not call this API in-parallel on different safekeepers	2024-01-04 17:40:38 +00:00
John Spray	18e9208158	pageserver: improved error handling for shard routing error, timeline not found (#6262 ) ## Problem - When a client requests a key that isn't found in any shard on the node (edge case that only happens if a compute's config is out of date), we should prompt them to reconnect (as this includes a backoff), since they will not be able to complete the request until they eventually get a correct pageserver connection string. - QueryError::Other is used excessively: this contains a type-ambiguous anyhow::Error and is logged very verbosely (including backtrace). ## Summary of changes - Introduce PageStreamError to replace use of anyhow::Error in request handlers for getpage, etc. - Introduce Reconnect and NotFound variants to QueryError - Map the "shard routing error" case to PageStreamError::Reconnect -> QueryError::Reconnect - Update type conversions for LSN timeouts and tenant/timeline not found errors to use PageStreamError::NotFound->QueryError::NotFound	2024-01-04 10:40:03 +00:00
Sasha Krassovsky	7662df6ca0	Fix minimum backoff to 1ms	2024-01-03 21:09:19 -08:00
John Spray	c119af8ddd	pageserver: run at least 2 background task threads Otherwise an assertion in CONCURRENT_BACKGROUND_TASKS will trip if you try to run the pageserver on a single core.	2024-01-03 14:22:40 +00:00
John Spray	a2e083ebe0	pageserver: make walredo shard-aware This does not have a functional impact, but enables all the logging in this code to include the shard_id label.	2024-01-03 14:22:40 +00:00
John Spray	73a944205b	pageserver: log details on shard routing error	2024-01-03 14:22:40 +00:00
John Spray	34ebfbdd6f	pageserver: fix handling getpage with multiple shards on one node Previously, we would wait for the LSN to be visible on whichever timeline we happened to load at the start of the connection, then proceed to look up the correct timeline for the key and do the read. If the timeline holding the key was behind the timeline we used for the LSN wait, then we might serve an apparently-successful read result that actually contains data from behind the requested lsn.	2024-01-03 14:22:40 +00:00
John Spray	ef7c9c2ccc	pageserver: fix active tenant lookup hitting secondaries with sharding If there is some secondary shard for a tenant on the same node as an attached shard, the secondary shard could trip up this code and cause page_service to incorrectly get an error instead of finding the attached shard.	2024-01-03 14:22:40 +00:00
John Spray	6c79e12630	pageserver: drop unwanted keys during compaction after split	2024-01-03 14:22:40 +00:00
John Spray	753d97bd77	pageserver: don't delete ancestor shard layers	2024-01-03 14:22:40 +00:00
John Spray	edc962f1d7	test_runner: test_issue_5878 log allow list (#6259 ) ## Problem https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6254/7388706419/index.html#suites/5a4b8734277a9878cb429b80c314f470/e54c4f6f6ed22672 ## Summary of changes Permit the log message: because the test helper's detach function increments the generation number, a detach/attach cycle can cause the error if the test runner node is slow enough for the opportunistic deletion queue flush on detach not to complete by the time we call attach.	2024-01-03 14:22:17 +00:00
Arseny Sher	65b4e6e7d6	Remove empty safekeeper init since truncateLsn. It has caveats such as creating half empty segment which can't be offloaded. Instead we'll pursue approach of pull_timeline, seeding new state from some peer.	2024-01-03 18:20:19 +04:00
Alexander Bayandin	17b256679b	vm-image-spec: build pgbouncer from Neon's fork (#6249 ) ## Problem We need to add one more patch to pgbouncer (for https://github.com/neondatabase/neon/issues/5801). I've decided to cherry-pick all required patches to a pgbouncer fork (`neondatabase/pgbouncer`) and use it instead. See https://github.com/neondatabase/pgbouncer/releases/tag/pgbouncer_1_21_0-neon-1 ## Summary of changes - Revert the previous patch (for deallocate/discard all) — the fork already contains it. - Remove `libssl-dev` dependency — we build pgbouncer without `openssl` support. - Clone git tag and build pgbouncer from source code.	2024-01-03 13:02:04 +00:00
John Spray	673a865055	tests: tolerate 304 when evicting layers (#6261 ) In tests that evict layers, explicit eviction can race with automatic eviction of the same layer and result in a 304	2024-01-03 11:50:58 +00:00
Cuong Nguyen	fb518aea0d	Add batch ingestion mechanism to avoid high contention (#5886 ) ## Problem For context, this problem was observed in a research project where we try to make neon run in multiple regions and I was asked by @hlinnaka to make this PR. In our project, we use the pageserver in a non-conventional way such that we would send a larger number of requests to the pageserver than normal (imagine postgres without the buffer pool). I measured the time from the moment a WAL record left the safekeeper to when it reached the pageserver ([code](`e593db1f5a/pageserver/src/tenant/timeline/walreceiver/walreceiver_connection.rs (L282-L287)`)) and observed that when the number of get_page_at_lsn requests was high, the wal receiving time increased significantly (see the left side of the graphs below). Upon further investigation, I found that the delay was caused by this line `d2ca410919/pageserver/src/tenant/timeline.rs (L2348)` The `get_layer_for_write` method is called for every value during WAL ingestion and it tries to acquire layers write lock every time, thus this results in high contention when read lock is acquired more frequently. ![Untitled](https://github.com/neondatabase/neon/assets/6244849/85460f4d-ead1-4532-bc64-736d0bfd7f16) ![Untitled2](https://github.com/neondatabase/neon/assets/6244849/84199ab7-5f0e-413b-a42b-f728f2225218) ## Summary of changes It is unnecessary to call `get_layer_for_write` repeatedly for all values in a WAL message since they would end up in the same memory layer anyway, so I created the batched versions of `InMemoryLayer::put_value`, `InMemoryLayer ::put_tombstone`, `Timeline::put_value`, and `Timeline::put_tombstone`, that acquire the locks once for a batch of values. Additionally, `DatadirModification` is changed to store multiple versions of uncommitted values, and `WalIngest::ingest_record()` can now ingest records without immediately committing them. With these new APIs, the new ingestion loop can be changed to commit for every `ingest_batch_size` records. The `ingest_batch_size` variable is exposed as a config. If it is set to 1 then we get the same behavior before this change. I found that setting this value to 100 seems to work the best, and you can see its effect on the right side of the above graphs. --------- Co-authored-by: John Spray <john@neon.tech>	2024-01-03 10:41:58 +00:00
John Spray	42f41afcbd	tests: update pytest and boto3 dependencies (#6253 ) ## Problem The version of pytest we were using emits a number of DeprecationWarnings on latest python: these are fixed in latest release. boto3 and python-dateutil also have deprecation warnings, but unfortunately these aren't fixed upstream yet. ## Summary of changes - Update pytest - Update boto3 (this doesn't fix deprecation warnings, but by the time I figured that out I had already done the update, and it's good hygiene anyway)	2024-01-03 10:36:53 +00:00
Arseny Sher	f71110383c	Remove second check for max_slot_wal_keep_size download size. Already checked in GetLogRepRestartLSN, a rebase artifact.	2024-01-03 13:13:32 +04:00
Arseny Sher	ae3eaf9995	Add [WP] prefix to all walproposer logging. - rename walpop_log to wp_log - create also wpg_log which is used in postgres-specific code - in passing format messages to start with lower case	2024-01-03 11:10:27 +04:00
Christian Schwarz	aa9f1d4b69	pagebench get-page: default to latest=true, make configurable via flag (#6252 ) fixes https://github.com/neondatabase/neon/issues/6209	2024-01-02 16:57:29 +00:00
Joonas Koivunen	946c6a0006	scrubber: use adaptive config with retries, check subset of tenants (#6219 ) The tool still needs a lot of work. These are the easiest fix and feature: - use similar adaptive config with s3 as remote_storage, use retries - process only particular tenants Tenants need to be from the correct region, they are not deduplicated, but the feature is useful for re-checking small amount of tenants after a large run.	2024-01-02 15:22:16 +00:00
Sasha Krassovsky	ce13281d54	MIN not MAX	2024-01-02 06:28:49 -08:00
Sasha Krassovsky	4e1d16f311	Switch to exponential rate-limiting	2024-01-02 06:28:49 -08:00
Sasha Krassovsky	091a0cda9d	Switch to rate-limiting strategy	2024-01-02 06:28:49 -08:00
Sasha Krassovsky	ea9fad419e	Add exponential backoff to page_server->send	2024-01-02 06:28:49 -08:00
Arseny Sher	e92c9f42c0	Don't split WAL record across two XLogData's when sending from safekeepers. As protocol demands. Not following this makes standby complain about corrupted WAL in various ways. https://neondb.slack.com/archives/C05L7D1JAUS/p1703774799114719 closes https://github.com/neondatabase/cloud/issues/9057	2024-01-02 10:50:20 +04:00
Arseny Sher	aaaa39d9f5	Add large insertion and slow WAL sending to test_hot_standby. To exercise MAX_SEND_SIZE sending from safekeeper; we've had a bug with WAL records torn across several XLogData messages. Add failpoint to safekeeper to slow down sending. Also check for corrupted WAL complains in standby log. Make the test a bit simpler in passing, e.g. we don't need explicit commits as autocommit is enabled by default. https://neondb.slack.com/archives/C05L7D1JAUS/p1703774799114719 https://github.com/neondatabase/cloud/issues/9057	2024-01-02 10:50:20 +04:00
Arseny Sher	e79a19339c	Add failpoint support to safekeeper. Just a copy paste from pageserver.	2024-01-02 10:50:20 +04:00
Arseny Sher	dbd36e40dc	Move failpoint support code to utils. To enable them in safekeeper as well.	2024-01-02 10:50:20 +04:00
Arseny Sher	90ef48aab8	Fix safekeeper START_REPLICATION (term=n). It was giving WAL only up to commit_lsn instead of flush_lsn, so recovery of uncommitted WAL since `cdb08f03` hanged. Add test for this.	2024-01-01 20:44:05 +04:00
Arseny Sher	9a43c04a19	compute_ctl: kill postgres and sync-safekeeprs on exit. Otherwise they are left orphaned when compute_ctl is terminated with a signal. It was invisible most of the time because normally neon_local or k8s kills postgres directly and then compute_ctl finishes gracefully. However, in some tests compute_ctl gets stuck waiting for sync-safekeepers which intentionally never ends because safekeepers are offline, and we want to stop compute_ctl without leaving orphanes behind. This is a quite rough approach which doesn't wait for children termination. A better way would be to convert compute_ctl to async which would make waiting easy.	2024-01-01 20:44:05 +04:00

1 2 3 4 5 ...

4332 Commits