rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-06 13:40:37 +00:00

Author	SHA1	Message	Date
Christian Schwarz	9d0405b798	Merge remote-tracking branch 'origin/main' into problame/batching-sidecar-task	2024-11-30 00:03:30 +01:00
Christian Schwarz	1cd333ce38	read_messages => batcher	2024-11-29 20:40:51 +01:00
Christian Schwarz	9b65b268ed	stop Box'ing stuff & clean up the passing-through of errors (remove enum Batch)	2024-11-29 16:14:03 +01:00
John Spray	d5624cc505	pageserver: download small objects using a smaller timeout (#9938 ) ## Problem It appears that the Azure storage API tends to hang TCP connections more than S3 does. Currently we use a 2 minute timeout for all downloads. This is large because sometimes the objects we download are large. However, waiting 2 minutes when doing something like downloading a manifest on tenant attach is problematic, because when someone is doing a "create tenant, create timeline" workflow, that 2 minutes is long enough for them reasonably to give up creating that timeline. Rather than propagate oversized timeouts further up the stack, we should use a different timeout for objects that we expect to be small. Closes: https://github.com/neondatabase/neon/issues/9836 ## Summary of changes - Add a `small_timeout` configuration attribute to remote storage, defaulting to 30 seconds (still a very generous period to do something like download an index) - Add a DownloadKind parameter to DownloadOpts, so that callers can indicate whether they expect the object to be small or large. - In the azure client, use small timeout for HEAD requests, and for GET requests if DownloadKind::Small is used. - Use DownloadKind::Small for manifests, indices, and heatmap downloads. This PR intentionally does not make the equivalent change to the S3 client, to reduce blast radius in case this has unexpected consequences (we could accomplish the same thing by editing lots of configs, but just skipping the code is simpler for right now)	2024-11-29 15:11:44 +00:00
Christian Schwarz	199a4bd6c1	Merge remote-tracking branch 'origin/main' into problame/batching-sidecar-task	2024-11-29 14:26:37 +01:00
Christian Schwarz	bab3dd009d	Merge remote-tracking branch 'origin/main' into problame/batching-sidecar-task	2024-11-29 14:19:06 +01:00
Christian Schwarz	dfcbb139fb	the `None` configuration in the benchmark would use the default instead of the serial configuration; fix that	2024-11-29 13:35:24 +01:00
Christian Schwarz	9a5611a5ef	merge reader&batcher stages, update docs	2024-11-29 11:39:16 +01:00
Christian Schwarz	a2a3613185	reintroduce task-based execution	2024-11-28 20:50:06 +01:00
Christian Schwarz	6bd39f95f5	rn benchmark on hetzner runner -------------------------------------------------------------------------------------------------------------------- Benchmark results --------------------------------------------------------------------------------------------------------------------- test_throughput[release-pg16-50-None-30-1-128-not batchable None].tablesize_mib: 50 MiB test_throughput[release-pg16-50-None-30-1-128-not batchable None].pipelining_enabled: 0 test_throughput[release-pg16-50-None-30-1-128-not batchable None].effective_io_concurrency: 1 test_throughput[release-pg16-50-None-30-1-128-not batchable None].readhead_buffer_size: 128 test_throughput[release-pg16-50-None-30-1-128-not batchable None].counters.time: 0.8905 test_throughput[release-pg16-50-None-30-1-128-not batchable None].counters.pageserver_getpage_count: 6,403.0000 test_throughput[release-pg16-50-None-30-1-128-not batchable None].counters.pageserver_vectored_get_count: 6,403.0000 test_throughput[release-pg16-50-None-30-1-128-not batchable None].counters.compute_getpage_count: 6,403.0000 test_throughput[release-pg16-50-None-30-1-128-not batchable None].counters.pageserver_cpu_seconds_total: 0.8633 test_throughput[release-pg16-50-None-30-1-128-not batchable None].perfmetric.batching_factor: 1.0000 test_throughput[release-pg16-50-pipelining_config1-30-1-128-not batchable {'max_batch_size': 1}].tablesize_mib: 50 MiB test_throughput[release-pg16-50-pipelining_config1-30-1-128-not batchable {'max_batch_size': 1}].pipelining_enabled: 1 test_throughput[release-pg16-50-pipelining_config1-30-1-128-not batchable {'max_batch_size': 1}].effective_io_concurrency: 1 test_throughput[release-pg16-50-pipelining_config1-30-1-128-not batchable {'max_batch_size': 1}].readhead_buffer_size: 128 test_throughput[release-pg16-50-pipelining_config1-30-1-128-not batchable {'max_batch_size': 1}].pipelining_config.max_batch_size: 1 test_throughput[release-pg16-50-pipelining_config1-30-1-128-not batchable {'max_batch_size': 1}].counters.time: 0.9195 test_throughput[release-pg16-50-pipelining_config1-30-1-128-not batchable {'max_batch_size': 1}].counters.pageserver_getpage_count: 6,403.0000 test_throughput[release-pg16-50-pipelining_config1-30-1-128-not batchable {'max_batch_size': 1}].counters.pageserver_vectored_get_count: 6,403.0000 test_throughput[release-pg16-50-pipelining_config1-30-1-128-not batchable {'max_batch_size': 1}].counters.compute_getpage_count: 6,403.0000 test_throughput[release-pg16-50-pipelining_config1-30-1-128-not batchable {'max_batch_size': 1}].counters.pageserver_cpu_seconds_total: 0.8925 test_throughput[release-pg16-50-pipelining_config1-30-1-128-not batchable {'max_batch_size': 1}].perfmetric.batching_factor: 1.0000 test_throughput[release-pg16-50-pipelining_config2-30-1-128-not batchable {'max_batch_size': 32}].tablesize_mib: 50 MiB test_throughput[release-pg16-50-pipelining_config2-30-1-128-not batchable {'max_batch_size': 32}].pipelining_enabled: 1 test_throughput[release-pg16-50-pipelining_config2-30-1-128-not batchable {'max_batch_size': 32}].effective_io_concurrency: 1 test_throughput[release-pg16-50-pipelining_config2-30-1-128-not batchable {'max_batch_size': 32}].readhead_buffer_size: 128 test_throughput[release-pg16-50-pipelining_config2-30-1-128-not batchable {'max_batch_size': 32}].pipelining_config.max_batch_size: 32 test_throughput[release-pg16-50-pipelining_config2-30-1-128-not batchable {'max_batch_size': 32}].counters.time: 0.8724 test_throughput[release-pg16-50-pipelining_config2-30-1-128-not batchable {'max_batch_size': 32}].counters.pageserver_getpage_count: 6,403.0000 test_throughput[release-pg16-50-pipelining_config2-30-1-128-not batchable {'max_batch_size': 32}].counters.pageserver_vectored_get_count: 6,403.0000 test_throughput[release-pg16-50-pipelining_config2-30-1-128-not batchable {'max_batch_size': 32}].counters.compute_getpage_count: 6,403.0000 test_throughput[release-pg16-50-pipelining_config2-30-1-128-not batchable {'max_batch_size': 32}].counters.pageserver_cpu_seconds_total: 0.8406 test_throughput[release-pg16-50-pipelining_config2-30-1-128-not batchable {'max_batch_size': 32}].perfmetric.batching_factor: 1.0000 test_throughput[release-pg16-50-None-30-100-128-batchable None].tablesize_mib: 50 MiB test_throughput[release-pg16-50-None-30-100-128-batchable None].pipelining_enabled: 0 test_throughput[release-pg16-50-None-30-100-128-batchable None].effective_io_concurrency: 100 test_throughput[release-pg16-50-None-30-100-128-batchable None].readhead_buffer_size: 128 test_throughput[release-pg16-50-None-30-100-128-batchable None].counters.time: 0.2576 test_throughput[release-pg16-50-None-30-100-128-batchable None].counters.pageserver_getpage_count: 6,401.5259 test_throughput[release-pg16-50-None-30-100-128-batchable None].counters.pageserver_vectored_get_count: 307.8534 test_throughput[release-pg16-50-None-30-100-128-batchable None].counters.compute_getpage_count: 6,401.5259 test_throughput[release-pg16-50-None-30-100-128-batchable None].counters.pageserver_cpu_seconds_total: 0.3043 test_throughput[release-pg16-50-None-30-100-128-batchable None].perfmetric.batching_factor: 20.7941 test_throughput[release-pg16-50-pipelining_config4-30-100-128-batchable {'max_batch_size': 1}].tablesize_mib: 50 MiB test_throughput[release-pg16-50-pipelining_config4-30-100-128-batchable {'max_batch_size': 1}].pipelining_enabled: 1 test_throughput[release-pg16-50-pipelining_config4-30-100-128-batchable {'max_batch_size': 1}].effective_io_concurrency: 100 test_throughput[release-pg16-50-pipelining_config4-30-100-128-batchable {'max_batch_size': 1}].readhead_buffer_size: 128 test_throughput[release-pg16-50-pipelining_config4-30-100-128-batchable {'max_batch_size': 1}].pipelining_config.max_batch_size: 1 test_throughput[release-pg16-50-pipelining_config4-30-100-128-batchable {'max_batch_size': 1}].counters.time: 0.6187 test_throughput[release-pg16-50-pipelining_config4-30-100-128-batchable {'max_batch_size': 1}].counters.pageserver_getpage_count: 6,403.0000 test_throughput[release-pg16-50-pipelining_config4-30-100-128-batchable {'max_batch_size': 1}].counters.pageserver_vectored_get_count: 6,403.0000 test_throughput[release-pg16-50-pipelining_config4-30-100-128-batchable {'max_batch_size': 1}].counters.compute_getpage_count: 6,403.0000 test_throughput[release-pg16-50-pipelining_config4-30-100-128-batchable {'max_batch_size': 1}].counters.pageserver_cpu_seconds_total: 0.7473 test_throughput[release-pg16-50-pipelining_config4-30-100-128-batchable {'max_batch_size': 1}].perfmetric.batching_factor: 1.0000 test_throughput[release-pg16-50-pipelining_config5-30-100-128-batchable {'max_batch_size': 2}].tablesize_mib: 50 MiB test_throughput[release-pg16-50-pipelining_config5-30-100-128-batchable {'max_batch_size': 2}].pipelining_enabled: 1 test_throughput[release-pg16-50-pipelining_config5-30-100-128-batchable {'max_batch_size': 2}].effective_io_concurrency: 100 test_throughput[release-pg16-50-pipelining_config5-30-100-128-batchable {'max_batch_size': 2}].readhead_buffer_size: 128 test_throughput[release-pg16-50-pipelining_config5-30-100-128-batchable {'max_batch_size': 2}].pipelining_config.max_batch_size: 2 test_throughput[release-pg16-50-pipelining_config5-30-100-128-batchable {'max_batch_size': 2}].counters.time: 0.4419 test_throughput[release-pg16-50-pipelining_config5-30-100-128-batchable {'max_batch_size': 2}].counters.pageserver_getpage_count: 6,402.6418 test_throughput[release-pg16-50-pipelining_config5-30-100-128-batchable {'max_batch_size': 2}].counters.pageserver_vectored_get_count: 3,207.7015 test_throughput[release-pg16-50-pipelining_config5-30-100-128-batchable {'max_batch_size': 2}].counters.compute_getpage_count: 6,402.6418 test_throughput[release-pg16-50-pipelining_config5-30-100-128-batchable {'max_batch_size': 2}].counters.pageserver_cpu_seconds_total: 0.5391 test_throughput[release-pg16-50-pipelining_config5-30-100-128-batchable {'max_batch_size': 2}].perfmetric.batching_factor: 1.9960 test_throughput[release-pg16-50-pipelining_config6-30-100-128-batchable {'max_batch_size': 4}].tablesize_mib: 50 MiB test_throughput[release-pg16-50-pipelining_config6-30-100-128-batchable {'max_batch_size': 4}].pipelining_enabled: 1 test_throughput[release-pg16-50-pipelining_config6-30-100-128-batchable {'max_batch_size': 4}].effective_io_concurrency: 100 test_throughput[release-pg16-50-pipelining_config6-30-100-128-batchable {'max_batch_size': 4}].readhead_buffer_size: 128 test_throughput[release-pg16-50-pipelining_config6-30-100-128-batchable {'max_batch_size': 4}].pipelining_config.max_batch_size: 4 test_throughput[release-pg16-50-pipelining_config6-30-100-128-batchable {'max_batch_size': 4}].counters.time: 0.3569 test_throughput[release-pg16-50-pipelining_config6-30-100-128-batchable {'max_batch_size': 4}].counters.pageserver_getpage_count: 6,402.1071 test_throughput[release-pg16-50-pipelining_config6-30-100-128-batchable {'max_batch_size': 4}].counters.pageserver_vectored_get_count: 1,660.0952 test_throughput[release-pg16-50-pipelining_config6-30-100-128-batchable {'max_batch_size': 4}].counters.compute_getpage_count: 6,402.1071 test_throughput[release-pg16-50-pipelining_config6-30-100-128-batchable {'max_batch_size': 4}].counters.pageserver_cpu_seconds_total: 0.4244 test_throughput[release-pg16-50-pipelining_config6-30-100-128-batchable {'max_batch_size': 4}].perfmetric.batching_factor: 3.8565 test_throughput[release-pg16-50-pipelining_config7-30-100-128-batchable {'max_batch_size': 8}].tablesize_mib: 50 MiB test_throughput[release-pg16-50-pipelining_config7-30-100-128-batchable {'max_batch_size': 8}].pipelining_enabled: 1 test_throughput[release-pg16-50-pipelining_config7-30-100-128-batchable {'max_batch_size': 8}].effective_io_concurrency: 100 test_throughput[release-pg16-50-pipelining_config7-30-100-128-batchable {'max_batch_size': 8}].readhead_buffer_size: 128 test_throughput[release-pg16-50-pipelining_config7-30-100-128-batchable {'max_batch_size': 8}].pipelining_config.max_batch_size: 8 test_throughput[release-pg16-50-pipelining_config7-30-100-128-batchable {'max_batch_size': 8}].counters.time: 0.2977 test_throughput[release-pg16-50-pipelining_config7-30-100-128-batchable {'max_batch_size': 8}].counters.pageserver_getpage_count: 6,401.7700 test_throughput[release-pg16-50-pipelining_config7-30-100-128-batchable {'max_batch_size': 8}].counters.pageserver_vectored_get_count: 886.6900 test_throughput[release-pg16-50-pipelining_config7-30-100-128-batchable {'max_batch_size': 8}].counters.compute_getpage_count: 6,401.7700 test_throughput[release-pg16-50-pipelining_config7-30-100-128-batchable {'max_batch_size': 8}].counters.pageserver_cpu_seconds_total: 0.3511 test_throughput[release-pg16-50-pipelining_config7-30-100-128-batchable {'max_batch_size': 8}].perfmetric.batching_factor: 7.2199 test_throughput[release-pg16-50-pipelining_config8-30-100-128-batchable {'max_batch_size': 16}].tablesize_mib: 50 MiB test_throughput[release-pg16-50-pipelining_config8-30-100-128-batchable {'max_batch_size': 16}].pipelining_enabled: 1 test_throughput[release-pg16-50-pipelining_config8-30-100-128-batchable {'max_batch_size': 16}].effective_io_concurrency: 100 test_throughput[release-pg16-50-pipelining_config8-30-100-128-batchable {'max_batch_size': 16}].readhead_buffer_size: 128 test_throughput[release-pg16-50-pipelining_config8-30-100-128-batchable {'max_batch_size': 16}].pipelining_config.max_batch_size: 16 test_throughput[release-pg16-50-pipelining_config8-30-100-128-batchable {'max_batch_size': 16}].counters.time: 0.2697 test_throughput[release-pg16-50-pipelining_config8-30-100-128-batchable {'max_batch_size': 16}].counters.pageserver_getpage_count: 6,401.5946 test_throughput[release-pg16-50-pipelining_config8-30-100-128-batchable {'max_batch_size': 16}].counters.pageserver_vectored_get_count: 500.5766 test_throughput[release-pg16-50-pipelining_config8-30-100-128-batchable {'max_batch_size': 16}].counters.compute_getpage_count: 6,401.5946 test_throughput[release-pg16-50-pipelining_config8-30-100-128-batchable {'max_batch_size': 16}].counters.pageserver_cpu_seconds_total: 0.3195 test_throughput[release-pg16-50-pipelining_config8-30-100-128-batchable {'max_batch_size': 16}].perfmetric.batching_factor: 12.7884 test_throughput[release-pg16-50-pipelining_config9-30-100-128-batchable {'max_batch_size': 32}].tablesize_mib: 50 MiB test_throughput[release-pg16-50-pipelining_config9-30-100-128-batchable {'max_batch_size': 32}].pipelining_enabled: 1 test_throughput[release-pg16-50-pipelining_config9-30-100-128-batchable {'max_batch_size': 32}].effective_io_concurrency: 100 test_throughput[release-pg16-50-pipelining_config9-30-100-128-batchable {'max_batch_size': 32}].readhead_buffer_size: 128 test_throughput[release-pg16-50-pipelining_config9-30-100-128-batchable {'max_batch_size': 32}].pipelining_config.max_batch_size: 32 test_throughput[release-pg16-50-pipelining_config9-30-100-128-batchable {'max_batch_size': 32}].counters.time: 0.2548 test_throughput[release-pg16-50-pipelining_config9-30-100-128-batchable {'max_batch_size': 32}].counters.pageserver_getpage_count: 6,401.5128 test_throughput[release-pg16-50-pipelining_config9-30-100-128-batchable {'max_batch_size': 32}].counters.pageserver_vectored_get_count: 307.7692 test_throughput[release-pg16-50-pipelining_config9-30-100-128-batchable {'max_batch_size': 32}].counters.compute_getpage_count: 6,401.5128 test_throughput[release-pg16-50-pipelining_config9-30-100-128-batchable {'max_batch_size': 32}].counters.pageserver_cpu_seconds_total: 0.3015 test_throughput[release-pg16-50-pipelining_config9-30-100-128-batchable {'max_batch_size': 32}].perfmetric.batching_factor: 20.7997 test_latency[release-pg16-None-None].latency_mean: 0.127 ms test_latency[release-pg16-None-None].latency_percentiles.p95: 0.166 ms test_latency[release-pg16-None-None].latency_percentiles.p99: 0.187 ms test_latency[release-pg16-None-None].latency_percentiles.p99.9: 0.292 ms test_latency[release-pg16-None-None].latency_percentiles.p99.99: 0.624 ms test_latency[release-pg16-pipelining_config1-{'max_batch_size': 1}].latency_mean: 0.139 ms test_latency[release-pg16-pipelining_config1-{'max_batch_size': 1}].latency_percentiles.p95: 0.175 ms test_latency[release-pg16-pipelining_config1-{'max_batch_size': 1}].latency_percentiles.p99: 0.200 ms test_latency[release-pg16-pipelining_config1-{'max_batch_size': 1}].latency_percentiles.p99.9: 0.444 ms test_latency[release-pg16-pipelining_config1-{'max_batch_size': 1}].latency_percentiles.p99.99: 0.658 ms test_latency[release-pg16-pipelining_config2-{'max_batch_size': 32}].latency_mean: 0.119 ms test_latency[release-pg16-pipelining_config2-{'max_batch_size': 32}].latency_percentiles.p95: 0.155 ms test_latency[release-pg16-pipelining_config2-{'max_batch_size': 32}].latency_percentiles.p99: 0.172 ms test_latency[release-pg16-pipelining_config2-{'max_batch_size': 32}].latency_percentiles.p99.9: 0.267 ms test_latency[release-pg16-pipelining_config2-{'max_batch_size': 32}].latency_percentiles.p99.99: 0.587 ms	2024-11-28 20:24:01 +01:00
Christian Schwarz	07358dea89	converge on approach that pushes read Result through pipeline	2024-11-28 20:06:15 +01:00
Vlad Lazar	eb520a14ce	pageserver: return correct LSN for interpreted proto keep alive responses (#9928 ) ## Problem For the interpreted proto the pageserver is not returning the correct LSN in replies to keep alive requests. This is because the interpreted protocol arm was not updating `last_rec_lsn`. ## Summary of changes * Return correct LSN in keep-alive responses * Fix shard field in wal sender traces	2024-11-28 17:38:47 +00:00
Alex Chi Z.	9e3cb75bc7	fix(pageserver): flush deletion queue in `reload` shutdown mode (#9884 ) ## Problem close https://github.com/neondatabase/neon/issues/9859 ## Summary of changes Ensure that the deletion queue gets fully flushed (i.e., the deletion lists get applied) during a graceful shutdown. It is still possible that an incomplete shutdown would leave deletion list behind and cause race upon the next startup, but we assume this will unlikely happen, and even if it happened, the pageserver should already be at a tainted state and the tenant should be moved to a new tenant with a new generation number. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-11-27 18:30:54 +00:00
Erik Grinaker	cc37fa0f33	pageserver: add metrics for unknown `ClearVmBits` pages (#9911 ) ## Problem When ingesting implicit `ClearVmBits` operations, we silently drop the writes if the relation or page is unknown. There are implicit assumptions around VM pages wrt. explicit/implicit updates, sharding, and relation sizes, which can possibly drop writes incorrectly. Adding a few metrics will allow us to investigate further and tighten up the logic. Touches #9855. ## Summary of changes Add a `pageserver_wal_ingest_clear_vm_bits_unknown` metric to record dropped `ClearVmBits` writes. Also add comments clarifying the behavior of relation sizes on non-zero shards.	2024-11-27 17:16:41 +00:00
Erik Grinaker	e4f437a354	pageserver: add relsize cache metrics (#9890 ) ## Problem We don't have any observability for the relation size cache. We have seen cache misses cause significant performance impact with high relation counts. Touches #9855. ## Summary of changes Adds the following metrics: * `pageserver_relsize_cache_entries` * `pageserver_relsize_cache_hits` * `pageserver_relsize_cache_misses` * `pageserver_relsize_cache_misses_old`	2024-11-27 13:54:14 +00:00
Vlad Lazar	8fdf786217	pageserver: add tenant config override for wal receiver proto (#9888 ) ## Problem Can't change protocol at tenant granularity. ## Summary of changes Add tenant config level override for wal receiver protocol. ## Links Related: https://github.com/neondatabase/neon/issues/9336 Epic: https://github.com/neondatabase/neon/issues/9329	2024-11-27 13:46:23 +00:00
Vlad Lazar	9e0148de11	safekeeper: use protobuf for sending compressed records to pageserver (#9821 ) ## Problem https://github.com/neondatabase/neon/pull/9746 lifted decoding and interpretation of WAL to the safekeeper. This reduced the ingested amount on the pageservers by around 10x for a tenant with 8 shards, but doubled the ingested amount for single sharded tenants. Also, https://github.com/neondatabase/neon/pull/9746 uses bincode which doesn't support schema evolution. Technically the schema can be evolved, but it's very cumbersome. ## Summary of changes This patch set addresses both problems by adding protobuf support for the interpreted wal records and adding compression support. Compressed protobuf reduced the ingested amount by 100x on the 32 shards `test_sharded_ingest` case (compared to non-interpreted proto). For the 1 shard case the reduction is 5x. Sister change to `rust-postgres` is [here](https://github.com/neondatabase/rust-postgres/pull/33). ## Links Related: https://github.com/neondatabase/neon/issues/9336 Epic: https://github.com/neondatabase/neon/issues/9329	2024-11-27 12:12:21 +00:00
Christian Schwarz	82e1fa3f83	WIP	2024-11-27 12:31:56 +01:00
Christian Schwarz	7fb3d95596	review & identified a cast that isn't handled, document that	2024-11-27 10:33:53 +01:00
Christian Schwarz	e0123c8a80	explain the pipeline cancellation story	2024-11-27 10:13:51 +01:00
Christian Schwarz	18ffaba975	fix pipeline cancellation	2024-11-26 20:44:36 +01:00
Christian Schwarz	a23abb2cc0	adopt spsc_fold	2024-11-26 13:30:40 +01:00
Peter Bendel	13feda0669	track how much time the flush loop is stalled waiting for uploads (#9885 ) ## Problem We don't know how much time PS is losing during ingest when waiting for remote storage uploads in the flush frozen layer loop. Also we don't know how many remote storage requests get an permit without waiting (not throttled by remote_storage concurrency_limit). ## Summary of changes - Add a metric that accumulates the time waited per shard/PS - in [remote storage semaphore wait seconds](https://neonprod.grafana.net/d/febd9732-9bcf-4992-a821-49b1f6b02724/remote-storage?orgId=1&var-datasource=HUNg6jvVk&var-instance=pageserver-26.us-east-2.aws.neon.build&var-instance=pageserver-27.us-east-2.aws.neon.build&var-instance=pageserver-28.us-east-2.aws.neon.build&var-instance=pageserver-29.us-east-2.aws.neon.build&var-instance=pageserver-30.us-east-2.aws.neon.build&var-instance=pageserver-31.us-east-2.aws.neon.build&var-instance=pageserver-36.us-east-2.aws.neon.build&var-instance=pageserver-37.us-east-2.aws.neon.build&var-instance=pageserver-38.us-east-2.aws.neon.build&var-instance=pageserver-39.us-east-2.aws.neon.build&var-instance=pageserver-40.us-east-2.aws.neon.build&var-instance=pageserver-41.us-east-2.aws.neon.build&var-request_type=put_object&from=1731961336340&to=1731964762933&viewPanel=3) add a first bucket with 100 microseconds to count requests that do not need to wait on semaphore Update: created a new version that uses a Gauge (one increasing value per PS/shard) instead of histogram as suggested by review	2024-11-26 11:46:58 +00:00
Vlad Lazar	7a2f0ed8d4	safekeeper: lift decoding and interpretation of WAL to the safekeeper (#9746 ) ## Problem For any given tenant shard, pageservers receive all of the tenant's WAL from the safekeeper. This soft-blocks us from using larger shard counts due to bandwidth concerns and CPU overhead of filtering out the records. ## Summary of changes This PR lifts the decoding and interpretation of WAL from the pageserver into the safekeeper. A customised PG replication protocol is used where instead of sending raw WAL, the safekeeper sends filtered, interpreted records. The receiver drives the protocol selection, so, on the pageserver side, usage of the new protocol is gated by a new pageserver config: `wal_receiver_protocol`. More granularly the changes are: 1. Optionally inject the protocol and shard identity into the arguments used for starting replication 2. On the safekeeper side, implement a new wal sending primitive which decodes and interprets records before sending them over 3. On the pageserver side, implement the ingestion of this new replication message type. It's very similar to what we already have for raw wal (minus decoding and interpreting). ## Notes * This PR currently uses my [branch of rust-postgres](https://github.com/neondatabase/rust-postgres/tree/vlad/interpreted-wal-record-replication-support) which includes the deserialization logic for the new replication message type. PR for that is open [here](https://github.com/neondatabase/rust-postgres/pull/32). * This PR contains changes for both pageservers and safekeepers. It's safe to merge because the new protocol is disabled by default on the pageserver side. We can gradually start enabling it in subsequent releases. * CI tests are running on https://github.com/neondatabase/neon/pull/9747 ## Links Related: https://github.com/neondatabase/neon/issues/9336 Epic: https://github.com/neondatabase/neon/issues/9329	2024-11-25 17:29:28 +00:00
Arpad Müller	77630e5408	Address beta clippy lint needless_lifetimes (#9877 ) The 1.82.0 version of Rust will be stable soon, let's get the clippy lint fixes in before the compiler version upgrade.	2024-11-25 14:59:12 +00:00
Christian Schwarz	99b664c9ed	expand fix to tasks mode; add some comments	2024-11-25 11:51:58 +01:00
Christian Schwarz	b9477aa945	fix: batcher wouldn't shut down after executor exits	2024-11-25 11:28:30 +01:00
Christian Schwarz	0bb037240d	logging to debug test_pageserver_restarts_under_worload	2024-11-25 10:36:48 +01:00
Christian Schwarz	450be26bbb	fast imports: initial Importer and Storage changes (#9218 ) Co-authored-by: Heikki Linnakangas <heikki@neon.tech> Co-authored-by: Stas Kelvic <stas@neon.tech> # Context This PR contains PoC-level changes for a product feature that allows onboarding large databases into Neon without going through the regular data path. # Changes This internal RFC provides all the context * https://github.com/neondatabase/cloud/pull/19799 In the language of the RFC, this PR covers * the Importer code (`fast_import`) * all the Pageserver changes (mgmt API changes, flow implementation, etc) * a basic test for the Pageserver changes # Reviewing As acknowledged in the RFC, the code added in this PR is not ready for general availability. Also, the architecture is not to be discussed in this PR, but in the RFC and associated Slack channel instead. Reviewers of this PR should take that into consideration. The quality bar to apply during review depends on what area of the code is being reviewed: * Importer code (`fast_import`): practically anything goes * Core flow (`flow.rs`): * Malicious input data must be expected and the existing threat models apply. * The code must not be safe to execute on dedicated Pageserver instances: * This means in particular that tenants on other Pageserver instances must not be affected negatively wrt data confidentiality, integrity or availability. * Other code: the usual quality bar * Pay special attention to correct use of gate guards, timeline cancellation in all places during shutdown & migration, etc. * Consider the broader system impact; if you find potentially problematic interactions with Storage features that were not covered in the RFC, bring that up during the review. I recommend submitting three separate reviews, for the three high-level areas with different quality bars. # References (Internal-only) * refs https://github.com/neondatabase/cloud/issues/17507 * refs https://github.com/neondatabase/company_projects/issues/293 * refs https://github.com/neondatabase/company_projects/issues/309 * refs https://github.com/neondatabase/cloud/issues/20646 --------- Co-authored-by: Stas Kelvich <stas.kelvich@gmail.com> Co-authored-by: Heikki Linnakangas <heikki@neon.tech> Co-authored-by: John Spray <john@neon.tech>	2024-11-22 22:47:06 +00:00
Alex Chi Z.	c1937d073f	fix(pageserver): ensure upload happens after delete (#9844 ) ## Problem Follow up of https://github.com/neondatabase/neon/pull/9682, that patch didn't fully address the problem: what if shutdown fails due to whatever reason and then we reattach the tenant? Then we will still remove the future layer. The underlying problem is that the fix for #5878 gets voided because of the generation optimizations. Of course, we also need to ensure that delete happens after uploads, but note that we only schedule deletes when there are no ongoing upload tasks, so that's fine. ## Summary of changes * Add a test case to reproduce the behavior (by changing the original test case to attach the same generation). * If layer upload happens after the deletion, drain the deletion queue before uploading. * If blocked_deletion is enabled, directly remove it from the blocked_deletion queue. * Local fs backend fix to avoid race between deletion and preload. * test_emergency_mode does not need to wait for uploads (and it's generally not possible to wait for uploads). * ~~Optimize deletion executor to skip validation if there are no files to delete.~~ this doesn't work --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-11-22 18:30:53 +00:00
Erik Grinaker	e939d36dd4	safekeeper,pageserver: fix CPU profiling allowlists (#9856 ) ## Problem The HTTP router allowlists matched both on the path and the query string. This meant that only `/profile/cpu` would be allowed without auth, while `/profile/cpu?format=svg` would require auth. Follows #9764. ## Summary of changes * Match allowlists on URI path, rather than the entire URI. * Fix the allowlist for Safekeeper to use `/profile/cpu` rather than the old `/pprof/profile`. * Just use a constant slice for the allowlist; it's only a handful of items, and these handlers are not on hot paths.	2024-11-22 17:50:33 +00:00
Christian Schwarz	d6e5a46015	eliminate the word `batch` and stale doc comments	2024-11-22 12:46:52 +01:00
Christian Schwarz	a28c54dac1	cosmetics	2024-11-22 12:44:31 +01:00
Christian Schwarz	ef502f8311	remove async-timer heritage	2024-11-22 12:43:55 +01:00
Christian Schwarz	c1e8347160	make configurable whether pipelining should use concurrent futures or tasks	2024-11-22 11:27:23 +01:00
John Spray	d9de65ee8f	pageserver: permit reads behind GC cutoff during LSN grace period (#9833 ) ## Problem In https://github.com/neondatabase/neon/issues/9754 and the flakiness of `test_readonly_node_gc`, we saw that although our logic for controlling GC was sound, the validation of getpage requests was not, because it could not consider LSN leases when requests arrived shortly after restart. Closes https://github.com/neondatabase/neon/issues/9754 ## Summary of changes This is the "Option 3" discussed verbally -- rather than holding back gc cutoff, we waive the usual validation of request LSN if we are still waiting for leases to be sent after startup - When validating LSN in `wait_or_get_last_lsn`, skip the validation relative to GC cutoff if the timeline is still in its LSN lease grace period - Re-enable test_readonly_node_gc	2024-11-22 09:24:23 +00:00
Christian Schwarz	093674b2fb	impmlement the serial mode	2024-11-22 09:53:08 +01:00
Christian Schwarz	0fa8ae3c0a	WIP refactor to allow truly serial mode	2024-11-22 09:47:49 +01:00
Christian Schwarz	c1040bc25d	task-based mode	2024-11-22 09:36:45 +01:00
Christian Schwarz	a3d1cf636b	config changes to express pipelining config (not respected yet)	2024-11-22 08:36:17 +01:00
Christian Schwarz	88fd8aed52	watch-based approach	2024-11-21 23:03:21 +01:00
Christian Schwarz	db9093f938	revert back to 'span fixes' commit	2024-11-21 22:07:05 +01:00
Christian Schwarz	240e48df59	improvements	2024-11-21 21:57:53 +01:00
Christian Schwarz	7680aa12a8	draft	2024-11-21 21:34:58 +01:00
Christian Schwarz	56de07154e	fruitless debugging	2024-11-21 20:46:56 +01:00
Christian Schwarz	73046fdf5b	span fixes	2024-11-21 20:21:55 +01:00
Erik Grinaker	190e8cebac	safekeeper,pageserver: add CPU profiling (#9764 ) ## Problem We don't have a convenient way to gather CPU profiles from a running binary, e.g. during production incidents or end-to-end benchmarks, nor during microbenchmarks (particularly on macOS). We would also like to have continuous profiling in production, likely using [Grafana Cloud Profiles](https://grafana.com/products/cloud/profiles-for-continuous-profiling/). We may choose to use either eBPF profiles or pprof profiles for this (pending testing and discussion with SREs), but pprof profiles appear useful regardless for the reasons listed above. See https://github.com/neondatabase/cloud/issues/14888. This PR is intended as a proof of concept, to try it out in staging and drive further discussions about profiling more broadly. Touches #9534. Touches https://github.com/neondatabase/cloud/issues/14888. ## Summary of changes Adds a HTTP route `/profile/cpu` that takes a CPU profile and returns it. Defaults to a 5-second pprof Protobuf profile for use with e.g. `pprof` or Grafana Alloy, but can also emit an SVG flamegraph. Query parameters: * `format`: output format (`pprof` or `svg`) * `frequency`: sampling frequency in microseconds (default 100) * `seconds`: number of seconds to profile (default 5) Also integrates pprof profiles into Criterion benchmarks, such that flamegraph reports can be taken with `cargo bench ... --profile-duration <seconds>`. Output under `target/criterion//profile/flamegraph.svg`. Example profiles: pprof profile (use [`pprof`](https://github.com/google/pprof)): [profile.pb.gz](https://github.com/user-attachments/files/17756788/profile.pb.gz) * Web interface: `pprof -http :6060 profile.pb.gz` * Interactive flamegraph: [profile.svg.gz](https://github.com/user-attachments/files/17756782/profile.svg.gz)	2024-11-21 18:59:46 +00:00
Christian Schwarz	408bc8fc71	cleanups	2024-11-21 19:42:43 +01:00
Christian Schwarz	345f8b6c3b	fix ready_for_next_batch order	2024-11-21 19:11:57 +01:00
Christian Schwarz	aa1032aeff	no need for cancel & ctx in pagestream_do_batch	2024-11-21 18:40:22 +01:00

1 2 3 4 5 ...

2628 Commits