rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2025-12-22 21:59:59 +00:00

Author	SHA1	Message	Date
John Spray	66fc465484	Clean up 'attachment service' names to storage controller (#7326 ) The binary etc were renamed some time ago, but the path in the source tree remained "attachment_service" to avoid disruption to ongoing PRs. There aren't any big PRs out right now, so it's a good time to cut over. - Rename `attachment_service` to `storage_controller` - Move it to the top level for symmetry with `storage_broker` & to avoid mixing the non-prod neon_local stuff (`control_plane/`) with the storage controller which is a production component.	2024-04-05 16:18:00 +01:00
Conrad Ludgate	55da8eff4f	proxy: report metrics based on cold start info (#7324 ) ## Problem Would be nice to have a bit more info on cold start metrics. ## Summary of changes * Change connect compute latency to include `cold_start_info`. * Update `ColdStartInfo` to include HttpPoolHit and WarmCached. * Several changes to make more use of interned strings	2024-04-05 16:14:50 +01:00
Arpad Müller	0fa517eb80	Update test-context dependency to 0.3 (#7303 ) Updates the `test-context` dev-dependency of the `remote_storage` crate to 0.3. This removes a lot of `async_trait` instances. Related earlier work: #6305, #6464	2024-04-05 15:53:29 +02:00
Arthur Petukhovsky	8ceb4f0a69	Fix partial zero segment upload (#7318 ) Found these logs on staging safekeepers: ``` INFO Partial backup{ttid=X/Y}: failed to upload 000000010000000000000000_173_0000000000000000_0000000000000000_sk56.partial: Failed to open file "/storage/safekeeper/data/X/Y/000000010000000000000000.partial" for wal backup: No such file or directory (os error 2) INFO Partial backup{ttid=X/Y}:upload{name=000000010000000000000000_173_0000000000000000_0000000000000000_sk56.partial}: starting upload PartialRemoteSegment { status: InProgress, name: "000000010000000000000000_173_0000000000000000_0000000000000000_sk56.partial", commit_lsn: 0/0, flush_lsn: 0/0, term: 173 } ``` This is because partial backup tries to upload zero segment when there is no data in timeline. This PR fixes this bug introduced in #6530.	2024-04-05 11:48:08 +01:00
John Spray	6019ccef06	tests: extend log allow list in test_storcon_cli (#7321 ) This test was occasionally flaky: it already allowed the log for the scheduler complaining about Stop state, but not the log for maybe_reconcile complaining.	2024-04-05 11:44:15 +01:00
John Spray	0c6367a732	storage controller: fix repeated location_conf returning no shards (#7314 ) ## Problem When a location_conf request was repeated with no changes, we failed to build the list of shards in the result. ## Summary of changes Remove conditional that only generated a list of updates if something had really changed. This does some redundant database updates, but it is preferable to having a whole separate code path for no-op changes. --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2024-04-04 17:34:05 +00:00
John Spray	e17bc6afb4	pageserver: update mgmt_api to use TenantShardId (#7313 ) ## Problem The API client was written around the same time as some of the server APIs changed from TenantId to TenantShardId Closes: https://github.com/neondatabase/neon/issues/6154 ## Summary of changes - Refactor mgmt_api timeline_info and keyspace methods to use TenantShardId to match the server This doesn't make pagebench sharding aware, but it paves the way to do so later.	2024-04-04 18:23:45 +01:00
John Spray	ac7fc6110b	pageserver: handle WAL gaps on sharded tenants (#6788 ) ## Problem In the test for https://github.com/neondatabase/neon/pull/6776, a test cases uses tiny layer sizes and tiny stripe sizes. This hits a scenario where a shard's checkpoint interval spans a region where none of the content in the WAL is ingested by this shard. Since there is no layer to flush, we do not advance disk_consistent_lsn, and this causes the test to fail while waiting for LSN to advance. ## Summary of changes - Pass an LSN through `layer_flush_start_tx`. This is the LSN to which we have frozen at the time we ask the flush to flush layers frozen up to this point. - In the layer flush task, if the layers we flush do not reach `frozen_to_lsn`, then advance disk_consistent_lsn up to this point. - In `maybe_freeze_ephemeral_layer`, handle the case where last_record_lsn has advanced without writing a layer file: this ensures that disk_consistent_lsn and remote_consistent_lsn advance anyway. The net effect is that the disk_consistent_lsn is allowed to advance past regions in the WAL where a shard ingests no data, and that we uphold our guarantee that remote_consistent_lsn always eventually reaches the tip of the WAL. The case of no layer at all is hard to test at present due to >0 shards being polluted with SLRU writes, but I have tested it locally with a branch that disables SLRU writes on shards >0. We can tighten up the testing on this in future as/when we refine shard filtering (currently shards >0 need the SLRU because they use it to figure out cutoff in GC using timestamp-to-lsn).	2024-04-04 16:54:38 +00:00
John Spray	862a6b7018	pageserver: timeout on deletion queue flush in timeline deletion (#7315 ) Some time ago, we had an issue where a deletion queue hang was also causing timeline deletions to hang. This was unnecessary because the timeline deletion doesn't _need_ to flush the deletion queue, it just does it as a pleasantry to make the behavior easier to understand and test. In this PR, we wrap the flush calls in a 10 second timeout (typically the flush takes milliseconds) so that in the event of issues with the deletion queue, timeline deletions are slower but not entirely blocked. Closes: https://github.com/neondatabase/neon/issues/6440	2024-04-04 17:51:44 +01:00
Christian Schwarz	4810c22607	fix(walredo spawn): coalescing stalls other executors std::sync::RwLock (#7310 ) part of #6628 Before this PR, we used a std::sync::RwLock to coalesce multiple callers on one walredo spawning. One thread would win the write lock and others would queue up either at the read() or write() lock call. In a scenario where a compute initiates multiple getpage requests from different Postgres backends (= different page_service conns), and we don't have a walredo process around, this means all these page_service handler tasks will enter the spawning code path, one of them will do the spawning, and the others will stall their respective executor thread because they do a blocking read()/write() lock call. I don't know exactly how bad the impact is in reality because posix_spawn uses CLONE_VFORK under the hood, which means that the entire parent process stalls anyway until the child does `exec`, which in turn resumes the parent. But, anyway, we won't know until we fix this issue. And, there's definitely a future way out of stalling the pageserver on posix_spawn, namely, forking template walredo processes that fork again when they need to be per-tenant. This idea is tracked in https://github.com/neondatabase/neon/issues/7320. Changes ------- This PR fixes that scenario by switching to use `heavier_once_cell` for coalescing. There is a comment on the struct field that explains it in a bit more nuance. ### Alternative Design An alternative would be to use tokio::sync::RwLock. I did this in the first commit in this PR branch, before switching to `heavier_once_cell`. Performance ----------- I re-ran the `bench_walredo` and updated the results, showing that the changes are neglible. For the record, the earlier commit in this PR branch that uses `tokio::sync::RwLock` also has updated benchmark numbers, and the results / kinds of tiny regression were equivalent to `heavier_once_cell`. Note that the above doesn't measure performance on the cold path, i.e., when we need to launch the process and coalesce. We don't have a benchmark for that, and I don't expect any significant changes. We have metrics and we log spawn latency, so, we can monitor it in staging & prod. Risks ----- As "usual", replacing a std::sync primitive with something that yields to the executor risks exposing concurrency that was previously implicitly limited to the number of executor threads. This would be the first one for walredo. The risk is that we get descheduled while the reconstruct data is already there. That could pile up reconstruct data. In practice, I think the risk is low because once we get scheduled again, we'll likely have a walredo process ready, and there is no further await point until walredo is complete and the reconstruct data has been dropped. This will change with async walredo PR #6548, and I'm well aware of it in that PR.	2024-04-04 17:54:14 +02:00
Vlad Lazar	9d754e984f	storage_controller: setup sentry reporting (#7311 ) ## Problem No alerting for storage controller is in place. ## Summary of changes Set up sentry for the storage controller.	2024-04-04 13:41:04 +01:00
John Spray	375e15815c	storage controller: grant 'admin' access to all APIs (#7307 ) ## Problem Currently, using `storcon-cli` requires user to select a token with either `pageserverapi` or `admin` scope depending on which endpoint they're using. ## Summary of changes - In check_permissions, permit access with the admin scope even if the required scope is missing. The effect is that an endpoint that required `pageserverapi` now accepts either `pageserverapi` or `admin`, and for the CLI one can simply use an `admin` scope token for everything.	2024-04-04 11:22:08 +00:00
Anna Khanova	7ce613354e	Fix length (#7308 ) ## Problem Bug ## Summary of changes Use `compressed_data.len()` instead of `data.len()`.	2024-04-04 10:29:10 +00:00
Konstantin Knizhnik	ae15acdee7	Fix bug in prefetch cleanup (#7277 ) ## Problem Running test_pageserver_restarts_under_workload in POR #7275 I get the following assertion failure in prefetch: ``` #5 0x00005587220d4bf0 in ExceptionalCondition ( conditionName=0x7fbf24d003c8 "(ring_index) < MyPState->ring_unused && (ring_index) >= MyPState->ring_last", fileName=0x7fbf24d00240 "/home/knizhnik/neon.main//pgxn/neon/pagestore_smgr.c", lineNumber=644) at /home/knizhnik/neon.main//vendor/postgres-v16/src/backend/utils/error/assert.c:66 #6 0x00007fbf24cebc9b in prefetch_set_unused (ring_index=1509) at /home/knizhnik/neon.main//pgxn/neon/pagestore_smgr.c:644 #7 0x00007fbf24cec613 in prefetch_register_buffer (tag=..., force_latest=0x0, force_lsn=0x0) at /home/knizhnik/neon.main//pgxn/neon/pagestore_smgr.c:891 #8 0x00007fbf24cef21e in neon_prefetch (reln=0x5587233b7388, forknum=MAIN_FORKNUM, blocknum=14110) at /home/knizhnik/neon.main//pgxn/neon/pagestore_smgr.c:2055 (gdb) p ring_index $1 = 1509 (gdb) p MyPState->ring_unused $2 = 1636 (gdb) p MyPState->ring_last $3 = 1636 ``` ## Summary of changes Check status of `prefetch_wait_for` ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-04-04 13:28:22 +03:00
Vlad Lazar	c5f64fe54f	tests: reinstate some syntethic size tests (#7294 ) ## Problem `test_empty_tenant_size` was marked `xfail` and a few other tests were skipped. ## Summary of changes Stabilise `test_empty_tenant_size`. This test attempted to disable checkpointing for the postgres instance and expected that the synthetic size remains stable for an empty tenant. When debugging I noticed that postgres was issuing a checkpoint after the transaction in the test (perhaps something changed since the test was introduced). Hence, I relaxed the size check to allow for the checkpoint key written on the pageserver. Also removed the checks for synthetic size inputs since the expected values differ between postgres versions. Closes https://github.com/neondatabase/neon/issues/7138	2024-04-04 09:45:14 +00:00
Conrad Ludgate	40852b955d	update ordered-multimap (#7306 ) ## Problem ordered-multimap was yanked ## Summary of changes `cargo update -p ordered-multimap`	2024-04-04 08:55:43 +00:00
Christian Schwarz	b30b15e7cb	refactor(Timeline::shutdown): rely more on Timeline::cancel; use it from deletion code path (#7233 ) This PR is a fallout from work on #7062. # Changes - Unify the freeze-and-flush and hard shutdown code paths into a single method `Timeline::shutdown` that takes the shutdown mode as an argument. - Replace `freeze_and_flush` bool arg in callers with that mode argument, makes them more expressive. - Switch timeline deletion to use `Timeline::shutdown` instead of its own slightly-out-of-sync copy. - Remove usage of `task_mgr::shutdown_watcher` / `task_mgr::shutdown_token` where possible # Future Work Do we really need the freeze_and_flush? If we could get rid of it, then there'd be no need for a specific shutdown order. Also, if you undo this patch's changes to the `eviction_task.rs` and enable RUST_LOG=debug, it's easy to see that we do leave some task hanging that logs under span `Connection{...}` at debug level. I think it's a pre-existing issue; it's probably a broker client task.	2024-04-03 17:49:54 +02:00
Vlad Lazar	36b875388f	pageserver: replace the locked tenant config with arcsawps (#7292 ) ## Problem For reasons unrelated to this PR, I would like to make use of the tenant conf in the `InMemoryLayer`. Previously, this was not possible without copying and manually updating the copy to keep it in sync with updates. ## Summary of Changes: Replace the `Arc<RwLock<AttachedTenantConf>>` with `Arc<ArcSwap<AttachedTenantConf>>` (how many `Arc(s)` can one fit in a type?). The most interesting part of this change is the updating of the tenant config (`set_new_tenant_config` and `set_new_location_config`). In theory, these two may race, although the storage controller should prevent this via the tenant exclusive op lock. Particular care has been taken to not "lose" a location config update by using the read-copy-update approach when updating only the config.	2024-04-03 16:46:25 +01:00
Arthur Petukhovsky	3f77f26aa2	Upload partial segments (#6530 ) Add support for backing up partial segments to remote storage. Disabled by default, can be enabled with `--partial-backup-enabled`. Safekeeper timeline has a background task which is subscribed to `commit_lsn` and `flush_lsn` updates. After the partial segment was updated (`flush_lsn` was changed), the segment will be uploaded to S3 in about 15 minutes. The filename format for partial segments is `Segment_Term_Flush_Commit_skNN.partial`, where: - `Segment` – the segment name, like `000000010000000000000001` - `Term` – current term - `Flush` – flush_lsn in hex format `{:016X}`, e.g. `00000000346BC568` - `Commit` – commit_lsn in the same hex format - `NN` – safekeeper_id, like `1` The full object name example: `000000010000000000000002_2_0000000002534868_0000000002534410_sk1.partial` Each safekeeper will keep info about remote partial segments in its control file. Code updates state in the control file before doing any S3 operations. This way control file stores information about all potentially existing remote partial segments and can clean them up after uploading a newer version. Closes #6336	2024-04-03 15:20:51 +00:00
John Spray	8b10407be4	pageserver: on-demand activation of tenant on GET tenant status (#7250 ) ## Problem (Follows https://github.com/neondatabase/neon/pull/7237) Some API users will query a tenant to wait for it to activate. Currently, we return the current status of the tenant, whatever that may be. Under heavy load, a pageserver starting up might take a long time to activate such a tenant. ## Summary of changes - In `tenant_status` handler, call wait_to_become_active on the tenant. If the tenant is currently waiting for activation, this causes it to skip the queue, similiar to other API handlers that require an active tenant, like timeline creation. This avoids external services waiting a long time for activation when polling GET /v1/tenant/<id>.	2024-04-03 16:53:43 +03:00
Arpad Müller	944313ffe1	Schedule image layer uploads in tiered compaction (#7282 ) Tiered compaction hasn't scheduled the upload of image layers. In the `test_gc_feedback.py` test this has caused warnings like with tiered compaction: ``` INFO request[...] Deleting layer [...] not found in latest_files list, never uploaded? ``` Which caused errors like: ``` ERROR layer_delete[...] was unlinked but was not dangling ``` Fixes #7244	2024-04-03 13:42:45 +02:00
Joonas Koivunen	d443d07518	wal_ingest: global counter for bytes received (#7240 ) Fixes #7102 by adding a metric for global total received WAL bytes: `pageserver_wal_ingest_bytes_received`.	2024-04-03 13:30:14 +03:00
Christian Schwarz	3de416a016	refactor(walreceiver): eliminate task_mgr usage (#7260 ) We want to move the code base away from task_mgr. This PR refactors the walreceiver code such that it doesn't use `task_mgr` anymore. # Background As a reminder, there are three tasks in a Timeline that's ingesting WAL. `WalReceiverManager`, `WalReceiverConnectionHandler`, and `WalReceiverConnectionPoller`. See the documentation in `task_mgr.rs` for how they interact. Before this PR, cancellation was requested through task_mgr::shutdown_token() and `TaskHandle::shutdown`. Wait-for-task-finish was implemented using a mixture of `task_mgr::shutdown_tasks` and `TaskHandle::shutdown`. This drawing might help: <img width="300" alt="image" src="https://github.com/neondatabase/neon/assets/956573/b6be7ad6-ecb3-41d0-b410-ec85cb8d6d20"> # Changes For cancellation, the entire WalReceiver task tree now has a `child_token()` of `Timeline::cancel`. The `TaskHandle` no longer is a cancellation root. This means that `Timeline::cancel.cancel()` is propagated. For wait-for-task-finish, all three tasks in the task tree hold the `Timeline::gate` open until they exit. The downside of using the `Timeline::gate` is that we can no longer wait for just the walreceiver to shut down, which is particularly relevant for `Timeline::flush_and_shutdown`. Effectively, it means that we might ingest more WAL while the `freeze_and_flush()` call is ongoing. Also, drive-by-fix the assertiosn around task kinds in `wait_lsn`. The check for `WalReceiverConnectionHandler` was ineffective because that never was a task_mgr task, but a TaskHandle task. Refine the assertion to check whether we would wait, and only fail in that case. # Alternatives I contemplated (ab-)using the `Gate` by having a separate `Gate` for `struct WalReceiver`. All the child tasks would use _that_ gate instead of `Timeline::gate`. And `struct WalReceiver` itself would hold an `Option<GateGuard>` of the `Timeline::gate`. Then we could have a `WalReceiver::stop` function that closes the WalReceiver's gate, then drops the `WalReceiver::Option<GateGuard>`. However, such design would mean sharing the WalReceiver's `Gate` in an `Arc`, which seems awkward. A proper abstraction would be to make gates hierarchical, analogous to CancellationToken. In the end, @jcsp and I talked it over and we determined that it's not worth the effort at this time. # Refs part of #7062	2024-04-03 12:28:04 +02:00
John Spray	bc05d7eb9c	pageserver: even more debug for test_secondary_downloads (#7295 ) The latest failures of test_secondary_downloads are spooky: layers are missing on disk according to the test, but present according to the pageserver logs: - Make the pageserver assert that layers are really present on disk and log the full path (debug mode only) - Make the test dump a full listing on failure of the assert that failed the last two times Related: #6966	2024-04-03 11:23:44 +01:00
Conrad Ludgate	d8da51e78a	remove http timeout (#7291 ) ## Problem https://github.com/neondatabase/cloud/issues/11051 additionally, I felt like the http logic was a bit complex. ## Summary of changes 1. Removes timeout for HTTP requests. 2. Split out header parsing to a `HttpHeaders` type. 3. Moved db client handling to `QueryData::process` and `BatchQueryData::process` to simplify the logic of `handle_inner` a bit.	2024-04-03 11:23:26 +01:00
John Spray	6e3834d506	controller: add `storcon-cli` (#7114 ) ## Problem During incidents, we may need to quickly access the storage controller's API without trying API client code or crafting `curl` CLIs on the fly. A basic CLI client is needed for this. ## Summary of changes - Update storage controller node listing API to only use public types in controller_api.rs - Add a storage controller API for listing tenants - Add a basic test that the CLI can list and modify nodes and tenants.	2024-04-03 10:07:56 +00:00
Anna Khanova	582cec53c5	proxy: upload consumption events to S3 (#7213 ) ## Problem If vector is unavailable, we are missing consumption events. https://github.com/neondatabase/cloud/issues/9826 ## Summary of changes Added integration with the consumption bucket.	2024-04-02 21:46:23 +02:00
Vlad Lazar	9957c6a9a0	pageserver: drop the layer map lock after planning reads (#7215 ) ## Problem The vectored read path holds the layer map lock while visiting a timeline. ## Summary of changes * Rework the fringe order to hold `Layer` on `Arc<InMemoryLayer>` handles instead of descriptions that are resolved by the layer map at the time of read. Note that previously `get_values_reconstruct_data` was implemented for the layer description which already knew the lsn range for the read. Now it is implemented on the new `ReadableLayer` handle and needs to get the lsn range as an argument. * Drop the layer map lock after updating the fringe. Related https://github.com/neondatabase/neon/issues/6833	2024-04-02 17:16:15 +01:00
John Spray	a5777bab09	tests: clean up compat test workarounds (#7097 ) - Cleanup from https://github.com/neondatabase/neon/pull/7040#discussion_r1521120263 -- in that PR, we needed to let compat tests manually register a node, because it would run an old binary that doesn't self-register. - Cleanup vectored get config workaround - Cleanup a log allow list for which the underlying log noise has been fixed.	2024-04-02 16:46:24 +01:00
Alexander Bayandin	90a8ff55fa	CI(benchmarking): Add Sharded Tenant for pgbench (#7186 ) ## Problem During Nightly Benchmarks, we want to collect pgbench results for sharded tenants as well. ## Summary of changes - Add pre-created sharded project for pgbench	2024-04-02 14:39:24 +01:00
macdoos	3b95e8072a	test_runner: replace all `.format()` with f-strings (#7194 )	2024-04-02 14:32:14 +01:00
Conrad Ludgate	8ee54ffd30	update tokio 1.37 (#7276 ) ## Problem ## Summary of changes `cargo update -p tokio`. The only risky change I could see is the `tokio::io::split` moving from a spin-lock to a mutex but I think that's ok.	2024-04-02 10:12:54 +01:00
Alex Chi Z	3ab9f56f5f	fixup(#7278/compute_ctl): remote extension download permission (#7280 ) Fix #7278 ## Summary of changes * Explicitly create the extension download directory and assign correct permissoins. * Fix the problem that the extension download failure will cause all future downloads to fail. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-03-29 17:59:30 +00:00
Alex Chi Z	7ddc7b4990	neonvm: add LFC approximate working set size to metrics (#7252 ) ref https://github.com/neondatabase/autoscaling/pull/878 ref https://github.com/neondatabase/autoscaling/issues/872 Add `approximate_working_set_size` to sql exporter so that autoscaling can use it in the future. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Peter Bendel <peterbendel@neon.tech>	2024-03-29 12:11:17 -04:00
John Spray	63213fc814	storage controller: scheduling optimization for sharded tenants (#7181 ) ## Problem - When we scheduled locations, we were doing it without any context about other shards in the same tenant - After a shard split, there wasn't an automatic mechanism to migrate the attachments away from the split location - After a shard split and the migration away from the split location, there wasn't an automatic mechanism to pick new secondary locations so that the end state has no concentration of locations on the nodes where the split happened. Partially completes: https://github.com/neondatabase/neon/issues/7139 ## Summary of changes - Scheduler now takes a `ScheduleContext` object that can be populated with information about other shards - During tenant creation and shard split, we incrementally build up the ScheduleContext, updating it for each shard as we proceed. - When scheduling new locations, the ScheduleContext is used to apply a soft anti-affinity to nodes where a tenant already has shards. - The background reconciler task now has an extra phase `optimize_all`, which runs only if the primary `reconcile_all` phase didn't generate any work. The separation is that `reconcile_all` is needed for availability, but optimize_all is purely "nice to have" work to balance work across the nodes better. - optimize_all calls into two new TenantState methods called optimize_attachment and optimize_secondary, which seek out opportunities to improve placment: - optimize_attachment: if the node where we're currently attached has an excess of attached shard locations for this tenant compared with the node where we have a secondary location, then cut over to the secondary location. - optimize_secondary: if the node holding our secondary location has an excessive number of locations for this tenant compared with some other node where we don't currently have a location, then create a new secondary location on that other node. - a new debug API endpoint is provided to run background tasks on-demand. This returns a number of reconciliations in progress, so callers can keep calling until they get a `0` to advance the system to its final state without waiting for many iterations of the background task. Optimization is run at an implicitly low priority by: - Omitting the phase entirely if reconcile_all has work to do - Skipping optimization of any tenant that has reconciles in flight - Limiting the total number of optimizations that will be run from one call to optimize_all to a constant (currently 2). The idea of that low priority execution is to minimize the operational risk that optimization work overloads any part of the system. It happens to also make the system easier to observe and debug, as we avoid running large numbers of concurrent changes. Eventually we may relax these limitations: there is no correctness problem with optimizing lots of tenants concurrently, and optimizing multiple shards in one tenant just requires housekeeping changes to update ShardContext with the result of one optimization before proceeding to the next shard.	2024-03-28 18:48:52 +00:00
Vlad Lazar	090123a429	pageserver: check for new image layers based on ingested WAL (#7230 ) ## Problem Part of the legacy (but current) compaction algorithm is to find a stack of overlapping delta layers which will be turned into an image layer. This operation is exponential in terms of the number of matching layers and we do it roughly every 20 seconds. ## Summary of changes Only check if a new image layer is required if we've ingested a certain amount of WAL since the last check. The amount of wal is expressed in terms of multiples of checkpoint distance, with the intuition being that that there's little point doing the check if we only have two new L1 layers (not enough to create a new image).	2024-03-28 17:44:55 +00:00
John Spray	39d1818ae9	storage controller: be more tolerant of control plane blocking notifications (#7268 ) ## Problem - Control plane can deadlock if it calls into a function that requires reconciliation to complete, while refusing compute notification hooks API calls. ## Summary of changes - Fail faster in the notify path in 438 errors: these were originally expected to be transient, but in practice it's more common that a 438 results from an operation blocking on the currently API call, rather than something happening in the background. - In ensure_attached, relax the condition for spawning a reconciler: instead of just the general maybe_reconcile path, do a pre-check that skips trying to reconcile if the shard appears to be attached. This avoids doing work in cases where the tenant is attached, but is dirty from a reconciliation point of view, e.g. due to a failed compute notification.	2024-03-28 17:38:08 +00:00
Alex Chi Z	90be79fcf5	spec: allow neon extension auto-upgrade + softfail upgrade (#7231 ) reverts https://github.com/neondatabase/neon/pull/7128, unblocks https://github.com/neondatabase/cloud/issues/10742 --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-03-28 17:22:35 +00:00
Alexander Bayandin	c52b80b930	CI(deploy): Do not deploy storage controller to preprod for proxy releases (#7269 ) ## Problem Proxy release to a preprod automatically triggers a deployment of storage controller (`deployStorageController=true` by default) ## Summary of changes - Set `deployStorageController=false` for proxy releases to preprod - Set explicitly `deployStorageController=true` for storage releases to preprod and prod	2024-03-28 16:51:45 +00:00
Anastasia Lubennikova	722f271f6e	Specify caller in 'unexpected response from page server' error (#7272 ) Tiny improvement for log messages to investigate https://github.com/neondatabase/cloud/issues/11559	2024-03-28 15:28:58 +00:00
Alex Chi Z	be1d8fc4f7	fix: drop replication slot causes postgres stuck on exit (#7192 ) Fix https://github.com/neondatabase/neon/issues/6969 Ref https://github.com/neondatabase/postgres/pull/395 https://github.com/neondatabase/postgres/pull/396 Postgres will stuck on exit if the replication slot is not dropped before shutting down. This is caused by Neon's custom WAL record to record replication slots. The pull requests in the postgres repo fixes the problem, and this pull request bumps the postgres commit. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-03-28 15:24:36 +00:00
Vlad Lazar	25c4b676e0	pageserver: fix oversized key on vectored read (#7259 ) ## Problem During this week's deployment we observed panics due to the blobs for certain keys not fitting in the vectored read buffers. The likely cause of this is a bloated AUX_FILE_KEY caused by logical replication. ## Summary of changes This pr fixes the issue by allocating a buffer big enough to fit the widest read. It also has the benefit of saving space if all keys in the read have blobs smaller than the max vectored read size. If the soft limit for the max size of a vectored read is violated, we print a warning which includes the offending key and lsn. A randomised (but deterministic) end to end test is also added for vectored reads on the delta layer.	2024-03-28 14:27:15 +00:00
John Spray	6633332e67	storage controller: tenant scheduling policy (#7262 ) ## Problem In the event of bugs with scheduling or reconciliation, we need to be able to switch this off at a per-tenant granularity. This is intended to mitigate risk of issues with https://github.com/neondatabase/neon/pull/7181, which makes scheduling more involved. Closes: #7103 ## Summary of changes - Introduce a scheduling policy per tenant, with API to set it - Refactor persistent.rs helpers for updating tenants to be more general - Add tests	2024-03-28 14:19:25 +00:00
Arpad Müller	5928f6709c	Support compaction_threshold=1 for tiered compaction (#7257 ) Many tests like `test_live_migration` or `test_timeline_deletion_with_files_stuck_in_upload_queue` set `compaction_threshold` to 1, to create a lot of changes/updates. The compaction threshold was passed as `fanout` parameter to the tiered_compaction function, which didn't support values of 1 however. Now we change the assert to support it, while still retaining the exponential nature of the increase in range in terms of lsn that a layer is responsible for. A large chunk of the failures in #6964 was due to hitting this issue that we now resolved. Part of #6768.	2024-03-28 13:48:47 +01:00
Konstantin Knizhnik	63b2060aef	Drop connections with all shards invoplved in prefetch in case of error (#7249 ) ## Problem See https://github.com/neondatabase/cloud/issues/11559 If we have multiple shards, we need to reset connections to all shards involved in prefetch (having active prefetch requests) if connection with any of them is lost. ## Summary of changes In `prefetch_on_ps_disconnect` drop connection to all shards with active page requests. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-03-28 08:16:05 +02:00
Sasha Krassovsky	24c5a5ac16	Revert "Revoke REPLICATION" (#7261 ) Reverts neondatabase/neon#7052	2024-03-27 18:07:51 +00:00
Alexander Bayandin	7f9cc1bd5e	CI(trigger-e2e-tests): set e2e-platforms (#7229 ) ## Problem We don't want to run an excessive e2e test suite on neonvm if there are no relevant changes. ## Summary of changes - Check PR diff and if there are no relevant compute changes (in `vendor/`, `pgxn/`, `libs/vm_monitor` or `Dockerfile.compute-node` - Switch job from `small` to `ubuntu-latest` runner to make it possible to use GitHub CLI	2024-03-27 13:10:37 +00:00
Christian Schwarz	cdf12ed008	fix(walreceiver): Timeline::shutdown can leave a dangling handle_walreceiver_connection tokio task (#7235 ) # Problem As pointed out through doc-comments in this PR, `drop_old_connection` is not cancellation-safe. This means we can leave a `handle_walreceiver_connection` tokio task dangling during Timeline shutdown. More details described in the corresponding issue #7062. # Solution Don't cancel-by-drop the `connection_manager_loop_step` from the `tokio::select!()` in the task_mgr task. Instead, transform the code to use a `CancellationToken` --- specifically, `task_mgr::shutdown_token()` --- and make code responsive to it. The `drop_old_connection()` is still not cancellation-safe and also doesn't get a cancellation token, because there's no point inside the function where we could return early if cancellation were requested using a token. We rely on the `handle_walreceiver_connection` to be sensitive to the `TaskHandle`s cancellation token (argument name: `cancellation`). Currently it checks for `cancellation` on each WAL message. It is probably also sensitive to `Timeline::cancel` because ultimately all that `handle_walreceiver_connection` does is interact with the `Timeline`. In summary, the above means that the following code (which is found in `Timeline::shutdown`) now might take longer, but actually ensures that all `handle_walreceiver_connection` tasks are finished: ```rust task_mgr::shutdown_tasks( Some(TaskKind::WalReceiverManager), Some(self.tenant_shard_id), Some(self.timeline_id) ) ``` # Refs refs #7062	2024-03-27 12:04:31 +01:00
Conrad Ludgate	12512f3173	add authentication rate limiting (#6865 ) ## Problem https://github.com/neondatabase/cloud/issues/9642 ## Summary of changes 1. Make `EndpointRateLimiter` generic, renamed as `BucketRateLimiter` 2. Add support for claiming multiple tokens at once 3. Add `AuthRateLimiter` alias. 4. Check `(Endpoint, IP)` pair during authentication, weighted by how many hashes proxy would be doing. TODO: handle ipv6 subnets. will do this in a separate PR.	2024-03-26 19:31:19 +00:00
John Spray	b3b7ce457c	pageserver: remove bare mgr::get_tenant, mgr::list_tenants (#7237 ) ## Problem This is a refactor. This PR was a precursor to a much smaller change `e5bd602dc1`, where as I was writing it I found that we were not far from getting rid of the last non-deprecated code paths that use `mgr::` scoped functions to get at the TenantManager state. We're almost done cleaning this up as per https://github.com/neondatabase/neon/issues/5796. The only significant remaining mgr:: item is `get_active_tenant_with_timeout`, which is page_service's path for fetching tenants. ## Summary of changes - Remove the bool argument to get_attached_tenant_shard: this was almost always false from API use cases, and in cases when it was true, it was readily replacable with an explicit check of the returned tenant's status. - Rather than letting the timeline eviction task query any tenant it likes via `mgr::`, pass an `Arc<Tenant>` into the task. This is still an ugly circular reference, but should eventually go away: either when we switch to exclusively using disk usage eviction, or when we change metadata storage to avoid the need to imitate layer accesses. - Convert all the mgr::get_tenant call sites to use TenantManager::get_attached_tenant_shard - Move list_tenants into TenantManager.	2024-03-26 18:29:08 +00:00

1 2 3 4 5 ...

4974 Commits