rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-05 20:42:54 +00:00

Author	SHA1	Message	Date
Christian Schwarz	915fba146d	pagebench: getpage: optional keyspace cache file (#6324 ) Proved useful when benchmarking 20k tenant setup when validating https://github.com/neondatabase/neon/issues/5479	2024-01-11 17:42:11 +00:00
Vlad Lazar	da7a7c867e	pageserver: do not bump priority of background task for timeline status requests (#6301 ) ## Problem Previously, `GET /v1/tenant/:tenant_id/timeline` and `GET /v1/tenant/:tenant_id/timeline/:timeline_id` would bump the priority of the background task which computes the initial logical size by cancelling the wait on the synchronisation semaphore. However, the request would still return an approximate logical size. It's undesirable to force background work for a status request. ## Summary of changes This PR updates the priority used by the timeline status request such that they don't do priority boosting by default anymore. An optional query parameter, `force-await-initial-logical-size`, is added for both mentioned endpoints. When set to true, it will skip the concurrency limiting semaphore and wait for the background task to complete before returning the exact logical size. In order to exercise this behaviour in a test I had to add an extra failpoint. If you think it's too intrusive, it can be removed. Also fixeda small bug where the cancellation of a download is reported as an opaque download failure upstream. This caused `test_location_conf_churn` to fail at teardown due to a WARN log line. Closes https://github.com/neondatabase/neon/issues/6168	2024-01-11 15:55:32 +00:00
Christian Schwarz	3ee981889f	compaction: avoid no-op timeline dir fsync (#6311 ) Random find while looking at an idle 20k tenant pageserver where each tenant has 9 tiny L0 layers and compaction produces no new L1s / image layers. The aggregate CPU cost of running this every 20s for 20k tenants is actually substantial, due to the use of `spawn_blocking`.	2024-01-11 10:32:39 +00:00
Christian Schwarz	fc66ba43c4	Revert "revert recent VirtualFile asyncification changes (#5291 )" (#6309 ) This reverts commit `ab1f37e908`. Thereby fixes #5479 Updated Analysis ================ The problem with the original patch was that it, for the first time, exposed the `VirtualFile` code to tokio task concurrency instead of just thread-based concurrency. That caused the VirtualFile file descriptor cache to start thrashing, effectively grinding the system to a halt. Details ------- At the time of the original patch, we had a _lot_ of runnable tasks in the pageserver. The symptom that prompted the revert (now being reverted in this PR) is that our production systems fell into a valley of zero goodput, high CPU, and zero disk IOPS shortly after PS restart. We lay out the root cause for that behavior in this subsection. At the time, there was no concurrency limit on the number of concurrent initial logical size calculations. Initial size calculation was initiated for all timelines within the first 10 minutes as part of consumption metrics collection. On a PS with 20k timelines, we'd thus have 20k runnable tasks. Before the original patch, the `VirtualFile` code never returned `Poll::Pending`. That meant that once we entered it, the calling tokio task would not yield to the tokio executor until we were done performing the VirtualFile operation, i.e., doing a blocking IO system call. The original patch switched the VirtualFile file descriptor cache's synchronization primitives to those from `tokio::sync`. It did not change that we were doing synchronous IO system calls. And the cache had more slots than we have tokio executor threads. So, these primitives never actually needed to return `Poll::Pending`. But, the tokio scheduler makes tokio sync primitives return `Pending` artificially, as a mechanism for the scheduler to get back into control more often ([example](https://docs.rs/tokio/1.35.1/src/tokio/sync/batch_semaphore.rs.html#570)). So, the new reality was that VirtualFile calls could now yield to the tokio executor. Tokio would pick one of the other 19999 runnable tasks to run. These tasks were also using VirtualFile. So, we now had a lot more concurrency in that area of the code. The problem with more concurrency was that caches started thrashing, most notably the VirtualFile file descriptor cache: each time a task would be rescheduled, it would want to do its next VirtualFile operation. For that, it would first need to evict another (task's) VirtualFile fd from the cache to make room for its own fd. It would then do one VirtualFile operation before hitting an await point and yielding to the executor again. The executor would run the other 19999 tasks for fairness before circling back to the first task, which would find its fd evicted. The other cache that would theoretically be impacted in a similar way is the pageserver's `PageCache`. However, for initial logical size calculation, it seems much less relevant in experiments, likely because of the random access nature of initial logical size calculation. Fixes ===== We fixed the above problems by - raising VirtualFile cache sizes - https://github.com/neondatabase/cloud/issues/8351 - changing code to ensure forward-progress once cache slots have been acquired - https://github.com/neondatabase/neon/pull/5480 - https://github.com/neondatabase/neon/pull/5482 - tbd: https://github.com/neondatabase/neon/issues/6065 - reducing the amount of runnable tokio tasks - https://github.com/neondatabase/neon/pull/5578 - https://github.com/neondatabase/neon/pull/6000 - fix bugs that caused unnecessary concurrency induced by connection handlers - https://github.com/neondatabase/neon/issues/5993 I manually verified that this PR doesn't negatively affect startup performance as follows: create a pageserver in production configuration, with 20k tenants/timelines, 9 tiny L0 layer files each; Start it, and observe ``` INFO Startup complete (368.009s since start) elapsed_ms=368009 ``` I further verified in that same setup that, when using `pagebench`'s getpage benchmark at as-fast-as-possible request rate against 5k of the 20k tenants, the achieved throughput is identical. The VirtualFile cache isn't thrashing in that case. Future Work =========== We will still exposed to the cache thrashing risk from outside factors, e.g., request concurrency is unbounded, and initial size calculation skips the concurrency limiter when we establish a walreceiver connection. Once we start thrashing, we will degrade non-gracefully, i.e., encounter a valley as was seen with the original patch. However, we have sufficient means to deal with that unlikely situation: 1. we have dashboards & metrics to monitor & alert on cache thrashing 2. we can react by scaling the bottleneck resources (cache size) or by manually shedding load through tenant relocation Potential systematic solutions are future work: * global concurrency limiting * per-tenant rate limiting => #5899 * pageserver-initiated load shedding Related Issues ============== This PR unblocks the introduction of tokio-epoll-uring for asynchronous disk IO ([Epic](#4744)).	2024-01-11 11:29:14 +01:00
Christian Schwarz	4e1b0b84eb	pagebench: fixup after is_rel_block_key changes in #6266 (#6303 ) PR #6266 broke the getpage_latest_lsn benchmark. Before this patch, we'd fail with ``` not implemented: split up range ``` because `r.start = rel size key` and `r.end = rel size key + 1`. The filtering of the key ranges in that loop is a bit ugly, but, I measured: * setup with 180k layer files (20k tenants * 9 layers). * total physical size is 463GiB * 5k tenants, the range filtering takes `0.6 seconds` on an i3en.3xlarge. That's a tiny fraction of the overall time it takes for pagebench to get ready to send requests. So, this is good enough for now / there are other bottlenecks that are bigger.	2024-01-09 19:00:37 +01:00
John Spray	f94abbab95	pageserver: clean up a redundant tenant_id attribute (#6280 ) This was a small TODO(sharding) thing in TenantHarness.	2024-01-09 12:10:15 +00:00
John Spray	4b9b4c2c36	pageserver: cleanup redundant create/attach code, fix detach while attaching (#6277 ) ## Problem The code for tenant create and tenant attach was just a special case of what upsert_location does. ## Summary of changes - Use `upsert_location` for create and attach APIs - Clean up error handling in upsert_location so that it can generate appropriate HTTP response codes - Update tests that asserted the old non-idempotent behavior of attach - Rework the `test_ignore_while_attaching` test, and fix tenant shutdown during activation, which this test was supposed to cover, but it was actually just waiting for activation to complete.	2024-01-09 10:37:54 +00:00
Arpad Müller	8186f6b6f9	Drop async_trait usage from three internal traits (#6305 ) This uses the [newly stable](https://blog.rust-lang.org/2023/12/21/async-fn-rpit-in-traits.html) async trait feature for three internal traits. One requires `Send` bounds to be present so uses `impl Future<...> + Send` instead. Advantages: * less macro usage * no extra boxing Disadvantages: * impl syntax needed for `Send` bounds is a bit more verbose (but only required in one place)	2024-01-09 11:20:08 +01:00
Arpad Müller	d5e3434371	Also allow unnecessary_fallible_conversions lint (#6294 ) This fixes the clippy lint firing on macOS on the conversion which needed for portability. For some reason, the logic in https://github.com/rust-lang/rust-clippy/pull/11669 to avoid an overlap is not working.	2024-01-09 04:22:36 +00:00
John Spray	b5ed6f22ae	pageserver: clean up a TODO comment (#6282 ) These functions don't need updating for sharding: it's fine for them to remain shard-naive, as they're only used in the context of dumping a layer file. The sharding metadata doesn't live in the layer file, it lives in the index.	2024-01-08 09:19:00 +00:00
John Spray	d1c0232e21	pageserver: use `pub(crate)` in metrics.rs, and clean up unused items (#6275 ) ## Problem Noticed while making other changes that there were `pub` items that were unused. ## Summary of changes - Make everything `pub(crate)` in metrics.rs, apart from items used from `bin/` - Fix the timelines eviction metric: it was never being incremented - Remove an unused ephemeral_bytes counter.	2024-01-08 03:53:15 +00:00
John Spray	3c560d27a8	pageserver: implement secondary-mode downloads (#6123 ) Follows on from #6050 , in which we upload heatmaps. Secondary locations will now poll those heatmaps and download layers mentioned in the heatmap. TODO: - [X] ~Unify/reconcile stats for behind-schedule execution with warn_when_period_overrun (https://github.com/neondatabase/neon/pull/6050#discussion_r1426560695)~ - [x] Give downloads their own concurrency config independent of uploads Deferred optimizations: - https://github.com/neondatabase/neon/issues/6199 - https://github.com/neondatabase/neon/issues/6200 Eviction will be the next PR: - #5342	2024-01-05 12:29:20 +00:00
John Spray	18e9208158	pageserver: improved error handling for shard routing error, timeline not found (#6262 ) ## Problem - When a client requests a key that isn't found in any shard on the node (edge case that only happens if a compute's config is out of date), we should prompt them to reconnect (as this includes a backoff), since they will not be able to complete the request until they eventually get a correct pageserver connection string. - QueryError::Other is used excessively: this contains a type-ambiguous anyhow::Error and is logged very verbosely (including backtrace). ## Summary of changes - Introduce PageStreamError to replace use of anyhow::Error in request handlers for getpage, etc. - Introduce Reconnect and NotFound variants to QueryError - Map the "shard routing error" case to PageStreamError::Reconnect -> QueryError::Reconnect - Update type conversions for LSN timeouts and tenant/timeline not found errors to use PageStreamError::NotFound->QueryError::NotFound	2024-01-04 10:40:03 +00:00
John Spray	c119af8ddd	pageserver: run at least 2 background task threads Otherwise an assertion in CONCURRENT_BACKGROUND_TASKS will trip if you try to run the pageserver on a single core.	2024-01-03 14:22:40 +00:00
John Spray	a2e083ebe0	pageserver: make walredo shard-aware This does not have a functional impact, but enables all the logging in this code to include the shard_id label.	2024-01-03 14:22:40 +00:00
John Spray	73a944205b	pageserver: log details on shard routing error	2024-01-03 14:22:40 +00:00
John Spray	34ebfbdd6f	pageserver: fix handling getpage with multiple shards on one node Previously, we would wait for the LSN to be visible on whichever timeline we happened to load at the start of the connection, then proceed to look up the correct timeline for the key and do the read. If the timeline holding the key was behind the timeline we used for the LSN wait, then we might serve an apparently-successful read result that actually contains data from behind the requested lsn.	2024-01-03 14:22:40 +00:00
John Spray	ef7c9c2ccc	pageserver: fix active tenant lookup hitting secondaries with sharding If there is some secondary shard for a tenant on the same node as an attached shard, the secondary shard could trip up this code and cause page_service to incorrectly get an error instead of finding the attached shard.	2024-01-03 14:22:40 +00:00
John Spray	6c79e12630	pageserver: drop unwanted keys during compaction after split	2024-01-03 14:22:40 +00:00
John Spray	753d97bd77	pageserver: don't delete ancestor shard layers	2024-01-03 14:22:40 +00:00
Cuong Nguyen	fb518aea0d	Add batch ingestion mechanism to avoid high contention (#5886 ) ## Problem For context, this problem was observed in a research project where we try to make neon run in multiple regions and I was asked by @hlinnaka to make this PR. In our project, we use the pageserver in a non-conventional way such that we would send a larger number of requests to the pageserver than normal (imagine postgres without the buffer pool). I measured the time from the moment a WAL record left the safekeeper to when it reached the pageserver ([code](`e593db1f5a/pageserver/src/tenant/timeline/walreceiver/walreceiver_connection.rs (L282-L287)`)) and observed that when the number of get_page_at_lsn requests was high, the wal receiving time increased significantly (see the left side of the graphs below). Upon further investigation, I found that the delay was caused by this line `d2ca410919/pageserver/src/tenant/timeline.rs (L2348)` The `get_layer_for_write` method is called for every value during WAL ingestion and it tries to acquire layers write lock every time, thus this results in high contention when read lock is acquired more frequently. ![Untitled](https://github.com/neondatabase/neon/assets/6244849/85460f4d-ead1-4532-bc64-736d0bfd7f16) ![Untitled2](https://github.com/neondatabase/neon/assets/6244849/84199ab7-5f0e-413b-a42b-f728f2225218) ## Summary of changes It is unnecessary to call `get_layer_for_write` repeatedly for all values in a WAL message since they would end up in the same memory layer anyway, so I created the batched versions of `InMemoryLayer::put_value`, `InMemoryLayer ::put_tombstone`, `Timeline::put_value`, and `Timeline::put_tombstone`, that acquire the locks once for a batch of values. Additionally, `DatadirModification` is changed to store multiple versions of uncommitted values, and `WalIngest::ingest_record()` can now ingest records without immediately committing them. With these new APIs, the new ingestion loop can be changed to commit for every `ingest_batch_size` records. The `ingest_batch_size` variable is exposed as a config. If it is set to 1 then we get the same behavior before this change. I found that setting this value to 100 seems to work the best, and you can see its effect on the right side of the above graphs. --------- Co-authored-by: John Spray <john@neon.tech>	2024-01-03 10:41:58 +00:00
Christian Schwarz	aa9f1d4b69	pagebench get-page: default to latest=true, make configurable via flag (#6252 ) fixes https://github.com/neondatabase/neon/issues/6209	2024-01-02 16:57:29 +00:00
Arseny Sher	dbd36e40dc	Move failpoint support code to utils. To enable them in safekeeper as well.	2024-01-02 10:50:20 +04:00
John Spray	e68ae2888a	pageserver: expedite tenant activation on delete (#6190 ) ## Problem During startup, a tenant delete request might have to retry for many minutes waiting for a tenant to enter Active state. ## Summary of changes - Refactor delete_tenant into TenantManager: this is not a functional change, but will avoid merge conflicts with https://github.com/neondatabase/neon/pull/6105 later - Add 412 responses to the swagger definition of this endpoint. - Use Tenant::wait_to_become_active in `TenantManager::delete_tenant` --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2023-12-22 10:22:22 +00:00
Arpad Müller	7d6fc3c826	Use pre-generated initdb.tar.zst in test_ingest_real_wal (#6206 ) This implements the TODO mentioned in the test added by #5892.	2023-12-21 14:23:09 +00:00
Christian Schwarz	5385791ca6	add pageserver component-level benchmark (`pagebench`) (#6174 ) This PR adds a component-level benchmarking utility for pageserver. Its name is `pagebench`. The problem solved by `pagebench` is that we want to put Pageserver under high load. This isn't easily achieved with `pgbench` because it needs to go through a compute, which has signficant performance overhead compared to accessing Pageserver directly. Further, compute has its own performance optimizations (most importantly: caches). Instead of designing a compute-facing workload that defeats those internal optimizations, `pagebench` simply bypasses them by accessing pageserver directly. Supported benchmarks: * getpage@latest_lsn * basebackup * triggering logical size calculation This code has no automated users yet. A performance regression test for getpage@latest_lsn will be added in a later PR. part of https://github.com/neondatabase/neon/issues/5771	2023-12-21 13:07:23 +01:00
Arpad Müller	48890d206e	Simplify inject_index_part test function (#6207 ) Instead of manually constructing the directory's path, we can just use the `parent()` function. This is a drive-by improvement from #6206	2023-12-21 12:52:38 +01:00
Joonas Koivunen	48f156b8a2	feat: relative last activity based eviction (#6136 ) Adds a new disk usage based eviction option, EvictionOrder, which selects whether to use the current `AbsoluteAccessed` or this new proposed but not yet tested `RelativeAccessed`. Additionally a fudge factor was noticed while implementing this, which might help sparing smaller tenants at the expense of targeting larger tenants. Cc: #5304 Co-authored-by: Arpad Müller <arpad@neon.tech>	2023-12-20 18:44:19 +00:00
John Spray	f260f1565e	pageserver: fixes + test updates for sharding (#6186 ) This is a precursor to: - https://github.com/neondatabase/neon/pull/6185 While that PR contains big changes to neon_local and attachment_service, this PR contains a few unrelated standalone changes generated while working on that branch: - Fix restarting a pageserver when it contains multiple shards for the same tenant - When using location_config api to attach a tenant, create its timelines dir - Update test paths where generations were previously optional to make them always-on: this avoids tests having to spuriously assert that attachment_service is not None in order to make the linter happy. - Add a TenantShardId python implementation for subsequent use in test helpers that will be made shard-aware - Teach scrubber to read across shards when checking for layer existence: this is a refactor to track the list of existent layers at tenant-level rather than locally to each timeline. This is a precursor to testing shard splitting.	2023-12-20 12:26:20 +00:00
Joonas Koivunen	c29df80634	fix(layer): move backoff to spawned task (#5746 ) Move the backoff to spawned task as it can still be useful; make the sleep cancellable.	2023-12-20 10:26:06 +02:00
Christian Schwarz	82809d2ec2	fix metric `pageserver_initial_logical_size_start_calculation` (#6191 ) It wasn't being incremented. Fixup of commit `1c88824ed0` Author: Christian Schwarz <christian@neon.tech> Date: Fri Dec 1 12:52:59 2023 +0100 initial logical size calculation: add a bunch of metrics (#5995)	2023-12-19 17:44:49 +01:00
Christian Schwarz	e6bf6952b8	higher resolution histograms for getpage@lsn (#6177 ) part of https://github.com/neondatabase/cloud/issues/7811	2023-12-19 14:46:17 +01:00
John Spray	d89af4cf8e	pageserver: downgrade 'connection reset' WAL errors (#6181 ) This squashes a particularly noisy warn-level log that occurs when safekeepers are restarted. Unfortunately the error type from `tonic` doesn't provide a neat way of matching this, so we use a string comparison	2023-12-19 10:38:00 +00:00
Christian Schwarz	6ffbbb2e02	include timeline ids in tenant details response (#6166 ) Part of getpage@lsn benchmark epic: https://github.com/neondatabase/neon/issues/5771 This allows getting the list of tenants and timelines without triggering initial logical size calculation by requesting the timeline details API response, which would skew our results.	2023-12-19 10:32:51 +00:00
Arpad Müller	a89d6dc76e	Always send a json response for timeline_get_lsn_by_timestamp (#6178 ) As part of the transition laid out in [this](https://github.com/neondatabase/cloud/pull/7553#discussion_r1370473911) comment, don't read the `version` query parameter in `timeline_get_lsn_by_timestamp`, but always return the structured json response. Follow-up of https://github.com/neondatabase/neon/pull/5608	2023-12-19 11:29:16 +01:00
Christian Schwarz	1f9a7d1cd0	add a Rust client for Pageserver page_service (#6128 ) Part of getpage@lsn benchmark epic: https://github.com/neondatabase/neon/issues/5771 Stacked atop https://github.com/neondatabase/neon/pull/6145	2023-12-18 18:17:19 +00:00
John Khvatov	33cb9a68f7	pageserver: Reduce tracing overhead in timeline::get (#6115 ) ## Problem Compaction process (specifically the image layer reconstructions part) is lagging behind wal ingest (at speed ~10-15MB/s) for medium-sized tenants (30-50GB). CPU profile shows that significant amount of time (see flamegraph) is being spent in `tracing::span::Span::new`. mainline (commit: `0ba4cae491`): ![reconstruct-mainline-0ba4cae491c2](https://github.com/neondatabase/neon/assets/289788/ebfd262e-5c97-4858-80c7-664a1dbcc59d) ## Summary of changes By lowering the tracing level in get_value_reconstruct_data and get_or_maybe_download from info to debug, we can reduce the overhead of span creation in prod environments. On my system, this sped up the image reconstruction process by 60% (from 14500 to 23160 page reconstruction per sec) pr: ![reconstruct-opt-2](https://github.com/neondatabase/neon/assets/289788/563a159b-8f2f-4300-b0a1-6cd66e7df769) `create_image_layers()` (it's 1 CPU bound here) mainline vs pr: ![image](https://github.com/neondatabase/neon/assets/289788/a981e3cb-6df9-4882-8a94-95e99c35aa83)	2023-12-18 13:33:23 +00:00
John Spray	dbdb1d21f2	pageserver: on-demand activation cleanups (#6157 ) ## Problem #6112 added some logs and metrics: clean these up a bit: - Avoid counting startup completions for tenants launched after startup - exclude no-op cases from timing histograms - remove a rogue log messages	2023-12-18 10:29:19 +00:00
Christian Schwarz	47873470db	pageserver: add method to dump keyspace in mgmt api client (#6145 ) Part of getpage@lsn benchmark epic: https://github.com/neondatabase/neon/issues/5771	2023-12-16 10:52:48 +00:00
John Spray	d066dad84b	pageserver: prioritize activation of tenants with client requests (#6112 ) ## Problem During startup, a client request might have to wait a long time while the system is busy initializing all the attached tenants, even though most of the attached tenants probably don't have any client requests to service, and could wait a bit. ## Summary of changes - Add a semaphore to limit how many Tenant::spawn()s may concurrently do I/O to attach their tenant (i.e. read indices from remote storage, scan local layer files, etc). - Add Tenant::activate_now, a hook for kicking a tenant in its spawn() method to skip waiting for the warmup semaphore - For tenants that attached via warmup semaphore units, wait for logical size calculation to complete before dropping the warmup units - Set Tenant::activate_now in `get_active_tenant_with_timeout` (the page service's path for getting a reference to a tenant). - Wait for tenant activation in HTTP handlers for timeline creation and deletion: like page service requests, these require an active tenant and should prioritize activation if called.	2023-12-15 20:37:47 +00:00
John Spray	56f7d55ba7	pageserver: basic cancel/timeout for remote storage operations (#6097 ) ## Problem Various places in remote storage were not subject to a timeout (thereby stuck TCP connections could hold things up), and did not respect a cancellation token (so things like timeline deletion or tenant detach would have to wait arbitrarily long). ## Summary of changes - Add download_cancellable and upload_cancellable helpers, and use them in all the places we wait for remote storage operations (with the exception of initdb downloads, where it would not have been safe). - Add a cancellation token arg to `download_retry`. - Use cancellation token args in various places that were missing one per #5066 Closes: #5066 Why is this only "basic" handling? - Doesn't express difference between shutdown and errors in return types, to avoid refactoring all the places that use an anyhow::Error (these should all eventually return a more structured error type) - Implements timeouts on top of remote storage, rather than within it: this means that operations hitting their timeout will lose their semaphore permit and thereby go to the back of the queue for their retry. - Doing a nicer job is tracked in https://github.com/neondatabase/neon/issues/6096	2023-12-15 17:43:02 +00:00
Christian Schwarz	1a9854bfb7	add a Rust client for Pageserver management API (#6127 ) Part of getpage@lsn benchmark epic: https://github.com/neondatabase/neon/issues/5771 This PR moves the control plane's spread-all-over-the-place client for the pageserver management API into a separate module within the pageserver crate. I need that client to be async in my benchmarking work, so, this PR switches to the async version of `reqwest`. That is also the right direction generally IMO. The switch to async in turn mandated converting most of the `control_plane/` code to async. Note that some of the client methods should be taking `TenantShardId` instead of `TenantId`, but, none of the callers seem to be sharding-aware. Leaving that for another time: https://github.com/neondatabase/neon/issues/6154	2023-12-15 18:33:45 +01:00
Arpad Müller	215cdd18c4	Make initdb upload retries cancellable and seek to beginning (#6147 ) * initdb uploads had no cancellation token, which means that when we were stuck in upload retries, we wouldn't be able to delete the timeline. in general, the combination of retrying forever and not having cancellation tokens is quite dangerous. * initdb uploads wouldn't rewind the file. this wasn't discovered in the purposefully unreliable test-s3 in pytest because those fail on the first byte always, not somewhere during the connection. we'd be getting errors from the AWS sdk that the file was at an unexpected end. slack thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1702632247784079	2023-12-15 12:11:25 +00:00
Joonas Koivunen	0fd80484a9	fix: Timeline deletion during busy startup (#6133 ) Compaction was holding back timeline deletion because the compaction lock had been acquired, but the semaphore was waited on. Timeline deletion was waiting on the same lock for 1500s. This replaces the `pageserver::tenant::tasks::concurrent_background_tasks_rate_limit` (which looks correct) with a simpler `..._permit` which is just an infallible acquire, which is easier to spot "aah this needs to be raced with cancellation tokens". Ref: https://neondb.slack.com/archives/C03F5SM1N02/p1702496912904719 Ref: https://neondb.slack.com/archives/C03F5SM1N02/p1702578093497779	2023-12-15 11:59:24 +00:00
Joonas Koivunen	07508fb110	fix: better Json parsing errors (#6135 ) Before any json parsing from the http api only returned errors were per field errors. Now they are done using `serde_path_to_error`, which at least helped greatly with the `disk_usage_eviction_run` used for testing. I don't think this can conflict with anything added in #5310.	2023-12-15 12:18:22 +02:00
John Spray	f1cd1a2122	pageserver: improved handling of concurrent timeline creations on the same ID (#6139 ) ## Problem Historically, the pageserver used an "uninit mark" file on disk for two purposes: - Track which timeline dirs are incomplete for handling on restart - Avoid trying to create the same timeline twice at the same time. The original purpose of handling restarts is now defunct, as we use remote storage as the source of truth and clean up any trash timeline dirs on startup. Using the file to mutually exclude creation operations is error prone compared with just doing it in memory, and the existing checks happened some way into the creation operation, and could expose errors as 500s (anyhow::Errors) rather than something clean. ## Summary of changes - Creations are now mutually excluded in memory (using `Tenant::timelines_creating`), rather than relying on a file on disk for coordination. - Acquiring unique access to the timeline ID now happens earlier in the request. - Creating the same timeline which already exists is now a 201: this simplifies retry handling for clients. - 409 is still returned if a timeline with the same ID is still being created: if this happens it is probably because the client timed out an earlier request and has retried. - Colliding timeline creation requests should no longer return 500 errors This paves the way to entirely removing uninit markers in a subsequent change. --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-12-15 08:51:23 +00:00
Joonas Koivunen	f010479107	feat(layer): pageserver_layer_redownloaded_after histogram (#6132 ) this is aimed at replacing the current mtime only based trashing alerting later. Cc: #5331	2023-12-14 21:32:54 +02:00
Conrad Ludgate	cc633585dc	gauge guards (#6138 ) ## Problem The websockets gauge for active db connections seems to be growing more than the gauge for client connections over websockets, which does not make sense. ## Summary of changes refactor how our counter-pair gauges are represented. not sure if this will improve the problem, but it should be harder to mess-up the counters. The API is much nicer though now and doesn't require scopeguard::defer hacks	2023-12-14 17:21:39 +00:00
John Spray	c4e0ef507f	pageserver: heatmap uploads (#6050 ) Dependency (commits inline): https://github.com/neondatabase/neon/pull/5842 ## Problem Secondary mode tenants need a manifest of what to download. Ultimately this will be some kind of heat-scored set of layers, but as a robust first step we will simply use the set of resident layers: secondary tenant locations will aim to match the on-disk content of the attached location. ## Summary of changes - Add heatmap types representing the remote structure - Add hooks to Tenant/Timeline for generating these heatmaps - Create a new `HeatmapUploader` type that is external to `Tenant`, and responsible for walking the list of attached tenants and scheduling heatmap uploads. Notes to reviewers: - Putting the logic for uploads (and later, secondary mode downloads) outside of `Tenant` is an opinionated choice, motivated by: - Enable future smarter scheduling of operations, e.g. uploading the stalest tenant first, rather than having all tenants compete for a fair semaphore on a first-come-first-served basis. Similarly for downloads, we may wish to schedule the tenants with the hottest un-downloaded layers first. - Enable accessing upload-related state without synchronization (it belongs to HeatmapUploader, rather than being some Mutex<>'d part of Tenant) - Avoid further expanding the scope of Tenant/Timeline types, which are already among the largest in the codebase - You might reasonably wonder how much of the uploader code could be a generic job manager thing. Probably some of it: but let's defer pulling that out until we have at least two users (perhaps secondary downloads will be the second one) to highlight which bits are really generic. Compromises: - Later, instead of using digests of heatmaps to decide whether anything changed, I would prefer to avoid walking the layers in tenants that don't have changes: tracking that will be a bit invasive, as it needs input from both remote_timeline_client and Layer.	2023-12-14 13:09:24 +00:00
Joonas Koivunen	a919b863d1	refactor: remove eviction batching (#6060 ) We no longer have `layer_removal_cs` since #5108, we no longer need batching.	2023-12-13 18:05:33 +02:00

1 2 3 4 5 ...

1770 Commits