rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-24 16:40:38 +00:00

Author	SHA1	Message	Date
John Spray	6633332e67	storage controller: tenant scheduling policy (#7262 ) ## Problem In the event of bugs with scheduling or reconciliation, we need to be able to switch this off at a per-tenant granularity. This is intended to mitigate risk of issues with https://github.com/neondatabase/neon/pull/7181, which makes scheduling more involved. Closes: #7103 ## Summary of changes - Introduce a scheduling policy per tenant, with API to set it - Refactor persistent.rs helpers for updating tenants to be more general - Add tests	2024-03-28 14:19:25 +00:00
John Spray	b3b7ce457c	pageserver: remove bare mgr::get_tenant, mgr::list_tenants (#7237 ) ## Problem This is a refactor. This PR was a precursor to a much smaller change `e5bd602dc1`, where as I was writing it I found that we were not far from getting rid of the last non-deprecated code paths that use `mgr::` scoped functions to get at the TenantManager state. We're almost done cleaning this up as per https://github.com/neondatabase/neon/issues/5796. The only significant remaining mgr:: item is `get_active_tenant_with_timeout`, which is page_service's path for fetching tenants. ## Summary of changes - Remove the bool argument to get_attached_tenant_shard: this was almost always false from API use cases, and in cases when it was true, it was readily replacable with an explicit check of the returned tenant's status. - Rather than letting the timeline eviction task query any tenant it likes via `mgr::`, pass an `Arc<Tenant>` into the task. This is still an ugly circular reference, but should eventually go away: either when we switch to exclusively using disk usage eviction, or when we change metadata storage to avoid the need to imitate layer accesses. - Convert all the mgr::get_tenant call sites to use TenantManager::get_attached_tenant_shard - Move list_tenants into TenantManager.	2024-03-26 18:29:08 +00:00
John Spray	6814bb4b59	tests: add a log allow list to stabilize benchmarks (#7251 ) ## Problem https://github.com/neondatabase/neon/pull/7227 destabilized various tests in the performance suite, with log errors during shutdown. It's because we switched shutdown order to stop the storage controller before the pageservers. ## Summary of changes - Tolerate "connection failed" errors from pageservers trying to validation their deletion queue.	2024-03-26 17:44:18 +00:00
John Spray	b3bb1d1cad	storage controller: make direct tenant creation more robust (#7247 ) ## Problem - Creations were not idempotent (unique key violation) - Creations waited for reconciliation, which control plane blocks while an operation is in flight ## Summary of changes - Handle unique key constraint violation as an OK situation: if we're creating the same tenant ID and shard count, it's reasonable to assume this is a duplicate creation. - Make the wait for reconcile during creation tolerate failures: this is similar to location_conf, where the cloud control plane blocks our notification calls until it is done with calling into our API (in future this constraint is expected to relax as the cloud control plane learns to run multiple operations concurrently for a tenant)	2024-03-26 16:57:35 +00:00
John Spray	47d2b3a483	pageserver: limit total ephemeral layer bytes (#7218 ) ## Problem Follows: https://github.com/neondatabase/neon/pull/7182 - Sufficient concurrent writes could OOM a pageserver from the size of indices on all the InMemoryLayer instances. - Enforcement of checkpoint_period only happened if there were some writes. Closes: https://github.com/neondatabase/neon/issues/6916 ## Summary of changes - Add `ephemeral_bytes_per_memory_kb` config property. This controls the ratio of ephemeral layer capacity to memory capacity. The weird unit is to enable making the ratio less than 1:1 (set this property to 1024 to use 1MB of ephemeral layers for every 1MB of RAM, set it smaller to get a fraction). - Implement background layer rolling checks in Timeline::compaction_iteration -- this ensures we apply layer rolling policy in the absence of writes. - During background checks, if the total ephemeral layer size has exceeded the limit, then roll layers whose size is greater than the mean size of all ephemeral layers. - Remove the tick() path from walreceiver: it isn't needed any more now that we do equivalent checks from compaction_iteration. - Add tests for the above. --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2024-03-26 15:45:32 +00:00
John Spray	8dfe3a070c	pageserver: return 429 on timeline creation in progress (#7225 ) ## Problem Currently, we return 409 (Conflict) in two cases: - Temporary: Timeline creation cannot proceed because another timeline with the same ID is being created - Permanent: Timeline creation cannot proceed because another timeline exists with different parameters but the same ID. Callers which time out a request and retry should be able to distinguish these cases. Closes: #7208 ## Summary of changes - Expose `AlreadyCreating` errors as 429 instead of 409	2024-03-26 15:20:05 +00:00
Alexander Bayandin	3426619a79	test_runner/performance: skip test_bulk_insert (#7238 ) ## Problem `test_bulk_insert` becomes too slow, and it fails constantly: https://github.com/neondatabase/neon/issues/7124 ## Summary of changes - Skip `test_bulk_insert` until it's fixed	2024-03-26 15:10:15 +00:00
Christian Schwarz	ad072de420	Revert "pageserver: use a single tokio runtime (#6555 )" (#7246 )	2024-03-26 15:24:18 +01:00
John Spray	5dee58f492	tests: wait for uploads in test_secondary_downloads (#7220 ) ## Problem - https://github.com/neondatabase/neon/issues/6966 This test occasionally failed with some layers unexpectedly not present on the secondary pageserver. The issue in that failure is the attached pageserver uploading heatmaps that refer to not-yet-uploaded layers. ## Summary of changes After uploading heatmap, drain upload queue on attached pageserver, to guarantee that all the layers referenced in the haetmap are uploaded.	2024-03-26 10:59:16 +00:00
John Spray	6313f1fa7a	tests: tolerate transient unavailability in test_sharding_split_failures (#7223 ) ## Problem While most forms of split rollback don't interrupt clients, there are a couple of cases that do -- this interruption is brief, driven by the time it takes the controller to kick off Reconcilers during the async abort of the split, so it's operationally fine, but can trip up a test. - #7148 ## Summary of changes - Relax test check to require that the tenant is eventually available after split failure, rather than immediately. In the vast majority of cases this will pass on the first iteration.	2024-03-26 09:56:47 +00:00
George Ma	d837ce0686	chore: remove repetitive words (#7206 ) Signed-off-by: availhang <mayangang@outlook.com>	2024-03-25 11:43:02 -04:00
John Spray	2713142308	tests: stabilize compat tests (#7227 ) This test had two flaky failure modes: - pageserver log error for timeline not found: this resulted from changes for DR when timeline destroy/create was added, but endpoint was left running during that operation. - storage controller log error because the test was running for long enough that a background reconcile happened at almost the exact moment of test teardown, and our test fixtures tear down the pageservers before the controller. Closes: #7224	2024-03-25 14:35:24 +00:00
John Spray	adb0526262	pageserver: track total ephemeral layer bytes (#7182 ) ## Problem Large quantities of ephemeral layer data can lead to excessive memory consumption (https://github.com/neondatabase/neon/issues/6939). We currently don't have a way to know how much ephemeral layer data is present on a pageserver. Before we can add new behaviors to proactively roll layers in response to too much ephemeral data, we must calculate that total. Related: https://github.com/neondatabase/neon/issues/6916 ## Summary of changes - Create GlobalResources and GlobalResourceUnits types, where timelines carry a GlobalResourceUnits in their TimelineWriterState. - Periodically update the size in GlobalResourceUnits: - During tick() - During layer roll - During put() if the latest value has drifted more than 10MB since our last update - Expose the value of the global ephemeral layer bytes counter as a prometheus metric. - Extend the lifetime of TimelineWriterState: - Instead of dropping it in TimelineWriter::drop, let it remain. - Drop TimelineWriterState in roll_layer: this drops our guard on the global byte count to reflect the fact that we're freezing the layer. - Ensure the validity of the later in the writer state by clearing the state in the same place we freeze layers, and asserting on the write-ability of the layer in `writer()` - Add a 'context' parameter to `get_open_layer_action` so that it can skip the prev_lsn==lsn check when called in tick() -- this is needed because now tick is called with a populated state, where prev_lsn==Some(lsn) is true for an idle timeline. - Extend layer rolling test to use this metric	2024-03-25 11:52:50 +00:00
John Spray	0099dfa56b	storage controller: tighten up secrets handling (#7105 ) - Remove code for using AWS secrets manager, as we're deploying with k8s->env vars instead - Load each secret independently, so that one can mix CLI args with environment variables, rather than requiring that all secrets are loaded with the same mechanism. - Add a 'strict mode', enabled by default, which will refuse to start if secrets are not loaded. This avoids the risk of accidentially disabling auth by omitting the public key, for example	2024-03-25 11:52:33 +00:00
Vlad Lazar	3a4ebfb95d	test: fix `test_pageserver_recovery` flakyness (#7207 ) ## Problem We recently introduced log file validation for the storage controller. The heartbeater will WARN when it fails for a node, hence the test fails. Closes https://github.com/neondatabase/neon/issues/7159 ## Summary of changes * Warn only once for each set of heartbeat retries * Allow list heartbeat warns	2024-03-25 09:38:12 +00:00
Christian Schwarz	3220f830b7	pageserver: use a single tokio runtime (#6555 ) Before this PR, each core had 3 executor threads from 3 different runtimes. With this PR, we just have one runtime, with one thread per core. Switching to a single tokio runtime should reduce that effective over-commit of CPU and in theory help with tail latencies -- iff all tokio tasks are well-behaved and yield to the runtime regularly. Are All Tasks Well-Behaved? Are We Ready? ----------------------------------------- Sadly there doesn't seem to be good out-of-the box tokio tooling to answer this question. We believe all tasks are well behaved in today's code base, as of the switch to `virtual_file_io_engine = "tokio-epoll-uring"` in production (https://github.com/neondatabase/aws/pull/1121). The only remaining executor-thread-blocking code is walredo and some filesystem namespace operations. Filesystem namespace operations work is being tracked in #6663 and not considered likely to actually block at this time. Regarding walredo, it currently does a blocking `poll` for read/write to the pipe file descriptors we use for IPC with the walredo process. There is an ongoing experiment to make walredo async (#6628), but it needs more time because there are surprisingly tricky trade-offs that are articulated in that PR's description (which itself is still WIP). What's relevant for this PR is that 1. walredo is always CPU-bound 2. production tail latencies for walredo request-response (`pageserver_wal_redo_seconds_bucket`) are - p90: with few exceptions, low hundreds of micro-seconds - p95: except on very packed pageservers, below 1ms - p99: all below 50ms, vast majority below 1ms - p99.9: almost all around 50ms, rarely at >= 70ms - [Dashboard Link](https://neonprod.grafana.net/d/edgggcrmki3uof/2024-03-walredo-latency?orgId=1&var-ds=ZNX49CDVz&var-pXX_by_instance=0.9&var-pXX_by_instance=0.99&var-pXX_by_instance=0.95&var-adhoc=instance%7C%21%3D%7Cpageserver-30.us-west-2.aws.neon.tech&var-per_instance_pXX_max_seconds=0.0005&from=1711049688777&to=1711136088777) The ones below 1ms are below our current threshold for when we start thinking about yielding to the executor. The tens of milliseconds stalls aren't great, but, not least because of the implicit overcommit of CPU by the three runtimes, we can't be sure whether these tens of milliseconds are inherently necessary to do the walredo work or whether we could be faster if there was less contention for CPU. On the first item (walredo being always CPU-bound work): it means that walredo processes will always compete with the executor threads. We could yield, using async walredo, but then we hit the trade-offs explained in that PR. tl;dr: the risk of stalling executor threads through blocking walredo seems low, and switching to one runtime cleans up one potential source for higher-than-necessary stall times (explained in the previous paragraphs). Code Changes ------------ - Remove the 3 different runtime definitions. - Add a new definition called `THE_RUNTIME`. - Use it in all places that previously used one of the 3 removed runtimes. - Remove the argument from `task_mgr`. - Fix failpoint usage where `pausable_failpoint!` should have been used. We encountered some actual failures because of this, e.g., hung `get_metric()` calls during test teardown that would client-timeout after 300s. As indicated by the comment above `THE_RUNTIME`, we could take this clean-up further. But before we create so much churn, let's first validate that there's no perf regression. Performance ----------- We will test this in staging using the various nightly benchmark runs. However, the worst-case impact of this change is likely compaction (=>image layer creation) competing with compute requests. Image layer creation work can't be easily generated & repeated quickly by pagebench. So, we'll simply watch getpage & basebackup tail latencies in staging. Additionally, I have done manual benchmarking using pagebench. Report: https://neondatabase.notion.site/2024-03-23-oneruntime-change-benchmarking-22a399c411e24399a73311115fb703ec?pvs=4 Tail latencies and throughput are marginally better (no regression = good). Except in a workload with 128 clients against one tenant. There, the p99.9 and p99.99 getpage latency is about 2x worse (at slightly lower throughput). A dip in throughput every 20s (compaction_period_ is clearly visible, and probably responsible for that worse tail latency. This has potential to improve with async walredo, and is an edge case workload anyway. Future Work ----------- 1. Once this change has shown satisfying results in production, change the codebase to use the ambient runtime instead of explicitly referencing `THE_RUNTIME`. 2. Have a mode where we run with a single-threaded runtime, so we uncover executor stalls more quickly. 3. Switch or write our own failpoints library that is async-native: https://github.com/neondatabase/neon/issues/7216	2024-03-23 19:25:11 +01:00
Alex Chi Z	643683f41a	fixup(#7204 / postgres): revert `IsPrimaryAlive` checks (#7209 ) Fix #7204. https://github.com/neondatabase/postgres/pull/400 https://github.com/neondatabase/postgres/pull/401 https://github.com/neondatabase/postgres/pull/402 These commits never go into prod. Detailed investigation will be posted in another issue. Reverting the commits so that things can keep running in prod. This pull request adds the test to start two replicas. It fails on the current main https://github.com/neondatabase/neon/pull/7210 but passes in this pull request. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-03-23 01:01:51 +00:00
John Spray	1787cf19e3	pageserver: write consumption metrics to S3 (#7200 ) ## Problem The service that receives consumption metrics has lower availability than S3. Writing metrics to S3 improves their availability. Closes: https://github.com/neondatabase/cloud/issues/9824 ## Summary of changes - The same data as consumption metrics POST bodies is also compressed and written to an S3 object with a timestamp-formatted path. - Set `metric_collection_bucket` (same format as `remote_storage` config) to configure the location to write to	2024-03-22 14:52:14 +00:00
John Spray	62b318c928	Fix ephemeral file warning on secondaries (#7201 ) A test was added which exercises secondary locations more, and there was a location in the secondary downloader that warned on ephemeral files. This was intended to be fixed in this faulty commit: `8cea866adf`	2024-03-22 10:10:28 +00:00
John Spray	06cb582d91	pageserver: extend /re-attach response to include tenant mode (#6941 ) This change improves the resilience of the system to unclean restarts. Previously, re-attach responses only included attached tenants - If the pageserver had local state for a secondary location, it would remain, but with no guarantee that it was still _meant_ to be there. After this change, the pageserver will only retain secondary locations if the /re-attach response indicates that they should still be there. - If the pageserver had local state for an attached location that was omitted from a re-attach response, it would be entirely detached. This is wasteful in a typical HA setup, where an offline node's tenants might have been re-attached elsewhere before it restarts, but the offline node's location should revert to a secondary location rather than being wiped. Including secondary tenants in the re-attach response enables the pageserver to avoid throwing away local state unnecessarily. In this PR: - The re-attach items are extended with a 'mode' field. - Storage controller populates 'mode' - Pageserver interprets it (default is attached if missing) to construct either a SecondaryTenant or a Tenant. - A new test exercises both cases.	2024-03-21 13:39:23 +00:00
John Spray	59cdee749e	storage controller: fixes to secondary location handling (#7169 ) Stacks on: - https://github.com/neondatabase/neon/pull/7165 Fixes while working on background optimization of scheduling after a split: - When a tenant has secondary locations, we weren't detaching the parent shards' secondary locations when doing a split - When a reconciler detaches a location, it was feeding back a locationconf with `Detached` mode in its `observed` object, whereas it should omit that location. This could cause the background reconcile task to keep kicking off no-op reconcilers forever (harmless but annoying). - During shard split, we were scheduling secondary locations for the child shards, but no reconcile was run for these until the next time the background reconcile task ran. Creating these ASAP is useful, because they'll be used shortly after a shard split as the destination locations for migrating the new shards to different nodes.	2024-03-21 12:06:57 +00:00
Vlad Lazar	c75b584430	storage_controller: add metrics (#7178 ) ## Problem Storage controller had basically no metrics. ## Summary of changes 1. Migrate the existing metrics to use Conrad's [`measured`](https://docs.rs/measured/0.0.14/measured/) crate. 2. Add metrics for incoming http requests 3. Add metrics for outgoing http requests to the pageserver 4. Add metrics for outgoing pass through requests to the pageserver 5. Add metrics for database queries Note that the metrics response for the attachment service does not use chunked encoding like the rest of the metrics endpoints. Conrad has kindly extended the crate such that it can now be done. Let's leave it for a follow-up since the payload shouldn't be that big at this point. Fixes https://github.com/neondatabase/neon/issues/6875	2024-03-21 12:00:20 +00:00
Arpad Müller	34fa34d15c	Dump layer map json in test_gc_feedback.py (#7179 ) The layer map json is an interesting file for that test, so dump it to make debugging easier.	2024-03-20 18:39:46 +00:00
John Spray	2726b1934e	pageserver: extra debug for test_secondary_downloads failures (#7183 ) - Enable debug logs for this test - Add some debug logging detail in downloader.rs - Add an info-level message in scheduler.rs that makes it obvious if a command is waiting for an existing task rather than spawning a new one.	2024-03-20 18:07:45 +00:00
Vlad Lazar	4ba3f3518e	test: fix on demand activation test flakyness (#7180 ) Warm-up (and the "tenant startup complete" metric update) happens in a background tokio task. The tenant map is eagerly updated (can happen before the task finishes). The test assumed that if the tenant map was updated, then the metric should reflect that. That's not the case, so we tweak the test to wait for the metric. Fixes https://github.com/neondatabase/neon/issues/7158	2024-03-20 10:24:59 +00:00
John Spray	a5d5c2a6a0	storage controller: tech debt (#7165 ) This is a mixed bag of changes split out for separate review while working on other things, and batched together to reduce load on CI runners. Each commits stands alone for review purposes: - do_tenant_shard_split was a long function and had a synchronous validation phase at the start that could readily be pulled out into a separate function. This also avoids the special casing of ApiError::BadRequest when deciding whether an abort is needed on errors - Add a 'describe' API (GET on tenant ID) that will enable storcon-cli to see what's going on with a tenant - the 'locate' API wasn't really meant for use in the field. It's for tests: demote it to the /debug/ prefix - The `Single` placement policy was a redundant duplicate of Double(0), and Double was a bad name. Rename it Attached. (https://github.com/neondatabase/neon/issues/7107) - Some neon_local commands were added for debug/demos, which are now replaced by commands in storcon-cli (#7114 ). Even though that's not merged yet, we don't need the neon_local ones any more. Closes https://github.com/neondatabase/neon/issues/7107 ## Backward compat of Single/Double -> `Attached(n)` change A database migration is used to convert any existing values.	2024-03-19 16:08:20 +00:00
John Spray	b80704cd34	tests: log hygiene checks for storage controller (#6710 ) ## Problem As with the pageserver, we should fail tests that emit unexpected log errors/warnings. ## Summary of changes - Refactor existing log checks to be reusable - Run log checks for attachment_service - Add allow lists as needed.	2024-03-19 10:30:33 +00:00
Arthur Petukhovsky	ad5efb49ee	Support backpressure for sharding (#7100 ) Add shard_number to PageserverFeedback and parse it on the compute side. When compute receives a new ps_feedback, it calculates min LSNs among feedbacks from all shards, and uses those LSNs for backpressure. Add `test_sharding_backpressure` to verify that backpressure slows down compute to wait for the slowest shard.	2024-03-18 21:54:44 +00:00
John Spray	9752ad8489	pageserver, controller: improve secondary download APIs for large shards (#7131 ) ## Problem The existing secondary download API relied on the caller to wait as long as it took to complete -- for large shards that could be a long time, so typical clients that might have a baked-in ~30s timeout would have a problem. ## Summary of changes - Take a `wait_ms` query parameter to instruct the pageserver how long to wait: if the download isn't complete in this duration, then 201 is returned instead of 200. - For both 200 and 201 responses, include response body describing download progress, in terms of layers and bytes. This is sufficient for the caller to track how much data is being transferred and log/present that status. - In storage controller live migrations, use this API to apply a much longer outer timeout, with smaller individual per-request timeouts, and log the progress of the downloads. - Add a test that injects layer download delays to exercise the new behavior	2024-03-15 19:45:58 +00:00
John Spray	1aa159acca	pageserver: cancellation for remote ops in tenant deletion on shutdown (#6105 ) ## Problem Tenant deletion had a couple of TODOs where we weren't using proper cancellation tokens that would have aborted the deletions during process shutdown. ## Summary of changes - Refactor enough that deletion/shutdown code has access to the TenantManager's cancellation toke - Use that cancellation token in tenant deletion instead of dummy tokens.	2024-03-15 18:03:49 +00:00
Christian Schwarz	60f30000ef	tokio-epoll-uring: fallback to std-fs if not available & not explicitly requested (#7120 ) fixes https://github.com/neondatabase/neon/issues/7116 Changes: - refactor PageServerConfigBuilder: support not-set values - implement runtime feature test - use runtime feature test to determine `virtual_file_io_engine` if not explicitly configured in the config - log the effective engine at startup - drive-by: improve assertion messages in `test_pageserver_init_node_id` This needed a tiny bit of tokio-epoll-uring work, hence bumping it. Changelog: ``` git log --no-decorate --oneline --reverse 868d2c42b5d54ca82fead6e8f2f233b69a540d3e..342ddd197a060a8354e8f11f4d12994419fff939 c7a74c6 Bump mio from 0.8.8 to 0.8.11 4df3466 Bump mio from 0.8.8 to 0.8.11 (#47) 342ddd1 lifecycle: expose `LaunchResult` enum (#49) ```	2024-03-15 17:46:04 +00:00
John Spray	bc1efa827f	pageserver: exclude gc_horizon from synthetic size calculation (#6407 ) ## Problem See: - https://github.com/neondatabase/neon/issues/6374 ## Summary of changes Whereas previously we calculated synthetic size from the gc_horizon or the pitr_interval (whichever is the lower LSN), now we ignore gc_horizon and exclusively start from the `pitr_interval`. This is a more generous calculation for billing, where we do not charge users for data retained due to gc_horizon.	2024-03-15 16:07:36 +00:00
John Spray	22c26d610b	pageserver: remove un-needed "uninit mark" (#5717 ) Switched the order; doing https://github.com/neondatabase/neon/pull/6139 first then can remove uninit marker after. ## Problem Previously, existence of a timeline directory was treated as evidence of the timeline's logical existence. That is no longer the case since we treat remote storage as the source of truth on each startup: we can therefore do without this mark file. The mark file had also been used as a pseudo-lock to guard against concurrent creations of the same TimelineId -- now that persistence is no longer required, this is a bit unwieldy. In #6139 the `Tenant::timelines_creating` was added to protect against concurrent creations on the same TimelineId, making the uninit mark file entirely redundant. ## Summary of changes - Code that writes & reads mark file is removed - Some nearby `pub` definitions are amended to `pub(crate)` - `test_duplicate_creation` is added to demonstrate that mutual exclusion of creations still works.	2024-03-15 17:23:05 +02:00
John Spray	6443dbef90	tests: extend log allow list for test_sharding_split_failures (#7134 ) Failure types that panic the storage controller can cause unlucky pageservers to emit log warnings that they can't reach the generation validation API: https://neon-github-public-dev.s3.amazonaws.com/reports/main/8284495687/index.html Tolerate this log message: it's an expected behavior.	2024-03-15 13:18:12 +00:00
Conrad Ludgate	49bc734e02	proxy: add websocket regression tests (#7121 ) ## Problem We have no regression tests for websocket flow ## Summary of changes Add a hacky implementation of the postgres protocol over websockets just to verify the protocol behaviour does not regress over time.	2024-03-15 10:21:48 +01:00
Vlad Lazar	3d8830ac35	test_runner: re-enable large slru benchmark (#7125 ) Previously disabled due to https://github.com/neondatabase/neon/issues/7006.	2024-03-14 16:47:32 +00:00
Vlad Lazar	38767ace68	storage_controller: periodic pageserver heartbeats (#7092 ) ## Problem If a pageserver was offline when the storage controller started, there was no mechanism to update the storage controller state when the pageserver becomes active. ## Summary of changes * Add a heartbeater module. The heartbeater must be driven by an external loop. * Integrate the heartbeater into the service. - Extend the types used by the service and scheduler to keep track of a nodes' utilisation score. - Add a background loop to drive the heartbeater and update the state based on the deltas it generated - Do an initial round of heartbeats at start-up	2024-03-14 15:21:36 +00:00
Christian Schwarz	8075f0965a	fix(test suite) `virtual_file_io_engine` and `get_vectored_impl` patametrization doesn't work (#7113 ) # Problem While investigating #7124, I noticed that the benchmark was always using the `DEFAULT_*` `virtual_file_io_engine` , i.e., `tokio-epoll-uring` as of https://github.com/neondatabase/neon/pull/7077. The fundamental problem is that the `control_plane` code has its own view of `PageServerConfig`, which, I believe, will always be a subset of the real pageserver's `pageserver/src/config.rs`. For the `virtual_file_io_engine` and `get_vectored_impl` parametrization of the test suite, we were constructing a dict on the Python side that contained these parameters, then handed it to `control_plane::PageServerConfig`'s derived `serde::Deserialize`. The default in serde is to ignore unknown fields, so, the Deserialize impl silently ignored the fields. In consequence, the fields weren't propagated to the `pageserver --init` call, and the tests ended up using the `pageserver/src/config.rs::DEFAULT_` values for the respective options all the time. Tests that explicitly used overrides in `env.pageserver.start()` and similar were not affected by this. But, it means that all the test suite runs where with parametrization didn't properly exercise the code path. # Changes - use `serde(deny_unknown_fields)` to expose the problem - With this change, the Python tests that override `virtual_file_io_engine` and `get_vectored_impl` fail on `pageserver --init`, exposing the problem. - use destructuring to uncover the issue in the future - fix the issue by adding the missing fields to the `control_plane` crate's `PageServerConf` - A better solution would be for control plane to re-use a struct provided by the pageserver crate, so that everything is in one place in `pageserver/src/config.rs`, but, our config parsing code is (almost) beyond repair anyways. - fix the `pageserver_virtual_file_io_engine` to be responsive to the env var - => required to make parametrization work in benchmarks # Testing Before merging this PR, I re-ran the regression tests & CI with the full matrix of `virtual_file_io_engine` and `tokio-epoll-uring`, see `9c7ea364e0`	2024-03-14 11:18:55 +00:00
John Spray	44f42627dd	pageserver/controller: error handling for shard splitting (#7074 ) ## Problem Shard splits worked, but weren't safe against failures (e.g. node crash during split) yet. Related: #6676 ## Summary of changes - Introduce async rwlocks at the scope of Tenant and Node: - exclusive tenant lock is used to protect splits - exclusive node lock is used to protect new reconciliation process that happens when setting node active - exclusive locks used in both cases when doing persistent updates (e.g. node scheduling conf) where the update to DB & in-memory state needs to be atomic. - Add failpoints to shard splitting in control plane and pageserver code. - Implement error handling in control plane for shard splits: this detaches child chards and ensures parent shards are re-attached. - Crash-safety for storage controller restarts requires little effort: we already reconcile with nodes over a storage controller restart, so as long as we reset any incomplete splits in the DB on restart (added in this PR), things are implicitly cleaned up. - Implement reconciliation with offline nodes before they transition to active: - (in this context reconciliation means something like startup_reconcile, not literally the Reconciler) - This covers cases where split abort cannot reach a node to clean it up: the cleanup will eventually happen when the node is marked active, as part of reconciliation. - This also covers the case where a node was unavailable when the storage controller started, but becomes available later: previously this allowed it to skip the startup reconcile. - Storage controller now terminates on panics. We only use panics for true "should never happen" assertions, and these cases can leave us in an un-usable state if we keep running (e.g. panicking in a shard split). In the unlikely event that we get into a crashloop as a result, we'll rely on kubernetes to back us off. - Add `test_sharding_split_failures` which exercises a variety of failure cases during shard split.	2024-03-14 09:11:57 +00:00
Conrad Ludgate	3bd6551b36	proxy http cancellation safety (#7117 ) ## Problem hyper auto-cancels the request futures on connection close. `sql_over_http::handle` is not 'drop cancel safe', so we need to do some other work to make sure connections are queries in the right way. ## Summary of changes 1. tokio::spawn the request handler to resolve the initial cancel-safety issue 2. share a cancellation token, and cancel it when the request `Service` is dropped. 3. Add a new log span to be able to track the HTTP connection lifecycle.	2024-03-14 08:20:56 +00:00
John Spray	1b41db8bdd	pageserver: enable setting stripe size inline with split request. (#7093 ) ## Summary - Currently we can set stripe size at tenant creation, but it doesn't mean anything until we have multiple shards - When onboarding an existing tenant, it will always get a default shard stripe size, so we would like to be able to pick the actual stripe size at the point we split. ## Why do this inline with a split? The alternative to this change would be to have a separate endpoint on the storage controller for setting the stripe size on a tenant, and only permit writes to that endpoint when the tenant has only a single shard. That would work, but be a little bit more work for a client, and not appreciably simpler (instead of having a special argument to the split functions, we'd have a special separate endpoint, and a requirement that the controller must sync its config down to the pageserver before calling the split API). Either approach would work, but this one feels a bit more robust end-to-end: the split API is the _very last moment_ that the stripe size is mutable, so if we aim to set it before splitting, it makes sense to do it as part of the same operation.	2024-03-12 20:41:08 +00:00
John Spray	7ae8364b0b	storage controller: register nodes in re-attach request (#7040 ) ## Problem Currently we manually register nodes with the storage controller, and use a script during deploy to register with the cloud control plane. Rather than extend that script further, nodes should just register on startup. ## Summary of changes - Extend the re-attach request to include an optional NodeRegisterRequest - If the `register` field is set, handle it like a normal node registration before executing the normal re-attach work. - Update tests/neon_local that used to rely on doing an explicit register step that could be enabled/disabled. --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-03-12 14:47:12 +00:00
Conrad Ludgate	09699d4bd8	proxy: cancel http queries on timeout (#7031 ) ## Problem On HTTP query timeout, we should try and cancel the current in-flight SQL query. ## Summary of changes Trigger a cancellation command in postgres once the timeout is reach	2024-03-12 11:52:00 +00:00
John Spray	89cf714890	tests/neon_local: rename "attachment service" -> "storage controller" (#7087 ) Not a user-facing change, but can break any existing `.neon` directories created by neon_local, as the name of the database used by the storage controller changes. This PR changes all the locations apart from the path of `control_plane/attachment_service` (waiting for an opportune moment to do that one, because it's the most conflict-ish wrt ongoing PRs like #6676 )	2024-03-12 11:36:27 +00:00
Heikki Linnakangas	621ea2ec44	tests: try to make restored-datadir comparison tests not flaky v2 This test occasionally fails with a difference in "pg_xact/0000" file between the local and restored datadirs. My hypothesis is that something changed in the database between the last explicit checkpoint and the shutdown. I suspect autovacuum, it could certainly create transactions. To fix, be more precise about the point in time that we compare. Shut down the endpoint first, then read the last LSN (i.e. the shutdown checkpoint's LSN), from the local disk with pg_controldata. And use exactly that LSN in the basebackup. Closes #559	2024-03-11 23:29:32 +04:00
Heikki Linnakangas	74d09b78c7	Keep walproposer alive until shutdown checkpoint is safe on safekepeers The walproposer pretends to be a walsender in many ways. It has a WalSnd slot, it claims to be a walsender by calling MarkPostmasterChildWalSender() etc. But one different to real walsenders was that the postmaster still treated it as a bgworker rather than a walsender. The difference is that at shutdown, walsenders are not killed until the very end, after the checkpointer process has written the shutdown checkpoint and exited. As a result, the walproposer always got killed before the shutdown checkpoint was written, so the shutdown checkpoint never made it to safekeepers. That's fine in principle, we don't require a clean shutdown after all. But it also feels a bit silly not to stream the shutdown checkpoint. It could be useful for initializing hot standby mode in a read replica, for example. Change postmaster to treat background workers that have called MarkPostmasterChildWalSender() as walsenders. That unfortunately requires another small change in postgres core. After doing that, walproposers stay alive longer. However, it also means that the checkpointer will wait for the walproposer to switch to WALSNDSTATE_STOPPING state, when the checkpointer sends the PROCSIG_WALSND_INIT_STOPPING signal. We don't have the machinery in walproposer to receive and handle that signal reliably. Instead, we mark walproposer as being in WALSNDSTATE_STOPPING always. In commit `568f91420a`, I assumed that shutdown will wait for all the remaining WAL to be streamed to safekeepers, but before this commit that was not true, and the test became flaky. This should make it stable again. Some tests wrongly assumed that no WAL could have been written between pg_current_wal_flush_lsn and quick pg stop after it. Fix them by introducing flush_ep_to_pageserver which first stops the endpoint and then waits till all committed WAL reaches the pageserver. In passing extract safekeeper http client to its own module.	2024-03-11 23:29:32 +04:00
Christian Schwarz	2b0f3549f7	default to tokio-epoll-uring in CI tests & on Linux (#7077 ) All of production is using it now as of https://github.com/neondatabase/aws/pull/1121 The change in `flaky_tests.py` resets the flakiness detection logic. The alternative would have been to repeat the choice of io engine in each test name, which would junk up the various test reports too much. --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2024-03-11 14:35:59 +00:00
Joonas Koivunen	b09d686335	fix: on-demand downloads can outlive timeline shutdown (#7051 ) ## Problem Before this PR, it was possible that on-demand downloads were started after `Timeline::shutdown()`. For example, we have observed a walreceiver-connection-handler-initiated on-demand download that was started after `Timeline::shutdown()`s final `task_mgr::shutdown_tasks()` call. The underlying issue is that `task_mgr::shutdown_tasks()` isn't sticky, i.e., new tasks can be spawned during or after `task_mgr::shutdown_tasks()`. Cc: https://github.com/neondatabase/neon/issues/4175 in lieu of a more specific issue for task_mgr. We already decided we want to get rid of it anyways. Original investigation: https://neondb.slack.com/archives/C033RQ5SPDH/p1709824952465949 ## Changes - enter gate while downloading - use timeline cancellation token for cancelling download thereby, fixes #7054 Entering the gate might also remove recent "kept the gate from closing" in staging.	2024-03-09 13:09:08 +00:00
Sasha Krassovsky	4834d22d2d	Revoke REPLICATION (#7052 ) ## Problem Currently users can cause problems with replication ## Summary of changes Don't let them replicate	2024-03-08 22:24:30 +00:00
Anastasia Lubennikova	86e8c43ddf	Add downgrade scripts for neon extension. (#7065 ) ## Problem When we start compute with newer version of extension (i.e. 1.2) and then rollback the release, downgrading the compute version, next compute start will try to update extension to the latest version available in neon.control (i.e. 1.1). Thus we need to provide downgrade scripts like neon--1.2--1.1.sql These scripts must revert the changes made by the upgrade scripts in the reverse order. This is necessary to ensure that the next upgrade will work correctly. In general, we need to write upgrade and downgrade scripts to be more robust and add IF EXISTS / CREATE OR REPLACE clauses to all statements (where applicable). ## Summary of changes Adds downgrade scripts. Adds test cases for extension downgrade/upgrade. fixes #7066 This is a follow-up for https://app.incident.io/neondb/incidents/167?tab=follow-ups Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Alex Chi Z <iskyzh@gmail.com> Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>	2024-03-08 20:42:35 +00:00

1 2 3 4 5 ...

1177 Commits