rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-03 20:20:38 +00:00

Author	SHA1	Message	Date
Joonas Koivunen	4d426f6fbe	feat: support lazy, queued tenant attaches (#6907 ) Add off-by-default support for lazy queued tenant activation on attach. This should be useful on bulk migrations as some tenants will be activated faster due to operations or endpoint startup. Eventually all tenants will get activated by reusing the same mechanism we have at startup (`PageserverConf::concurrent_tenant_warmup`). The difference to lazy attached tenants to startup ones is that we leave their initial logical size calculation be triggered by WalReceiver or consumption metrics. Fixes: #6315 Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2024-02-29 13:26:29 +02:00
John Spray	e5384ebefc	pageserver: accelerate tenant activation on HTTP API timeline read requests (#6944 ) ## Problem Callers of the timeline creation API may issue timeline GETs ahead of creation to e.g. check if their intended timeline already exists, or to learn the LSN of a parent timeline. Although the timeline creation API already triggers activation of a timeline if it's currently waiting to activate, the GET endpoint doesn't, so such callers will encounter 503 responses for several minutes after a pageserver restarts, while tenants are lazily warming up. The original scope of which APIs will activate a timeline was quite small, but really it makes sense to do it for any API that needs a particular timeline to be active. ## Summary of changes - In the timeline detail GET handler, use wait_to_become_active, which triggers immediate activation of a tenant if it was currently waiting for the warmup semaphore, then waits up to 5 seconds for the activation to complete. If it doesn't complete promptly, we return a 503 as before. - Modify active_timeline_for_active_tenant to also use wait_to_become_active, which indirectly makes several other timeline-scope request handlers fast-activate a tenant when called. This is important because a timeline creation flow could also use e.g. get_lsn_for_timestamp as a precursor to creating a timeline. - There is some risk to this change: an excessive number of timeline GET requests could cause too many tenant activations to happen at the same time, leading to excessive queue depth to the S3 client. However, this was already the case for e.g. many concurrent timeline creations.	2024-02-28 14:53:35 +00:00
Vlad Lazar	2b11466b59	pageserver: optimise disk io for vectored get (#6780 ) ## Problem The vectored read path proposed in https://github.com/neondatabase/neon/pull/6576 seems to be functionally correct, but in my testing (see below) it is about 10-20% slower than the naive sequential vectored implementation. ## Summary of changes There's three parts to this PR: 1. Supporting vectored blob reads. This is actually trickier than it sounds because on disk blobs are prefixed with a variable length size header. Since the blobs are not necessarily fixed size, we need to juggle the offsets such that the callers can retrieve the blobs from the resulting buffer. 2. Merge disk read requests issued by the vectored read path up to a maximum size. Again, the merging is complicated by the fact that blobs are not fixed size. We keep track of the begin and end offset of each blob and pass them into the vectored blob reader. In turn, the reader will return a buffer and the offsets at which the blobs begin and end. 3. A benchmark for basebackup requests against tenant with large SLRU block counts is added. This required a small change to pagebench and a new config variable for the pageserver which toggles the vectored get validation. We can probably optimise things further by adding a little bit of concurrency for our IO. In principle, it's as simple as spawning a task which deals with issuing IO and doing the serialisation and handling on the parent task which receives input via a channel.	2024-02-28 12:06:00 +00:00
Christian Schwarz	b6bd75964f	Revert "pageserver: roll open layer in timeline writer (#6661 )" + PR #6842 (#6938 ) This reverts commits `587cb705b8` (PR #6661) and `fcbe9fb184` (PR #6842). Conflicts: pageserver/src/tenant.rs pageserver/src/tenant/timeline.rs The conflicts were with * pageserver: adjust checkpoint distance for sharded tenants (#6852) * pageserver: add vectored get implementation (#6576) Also we had to keep the `allowed_errors` to make `test_forward_compatibility` happy, see the PR thread on GitHub for details.	2024-02-28 11:38:23 +00:00
Joonas Koivunen	1b1320a263	fix: allow evicting wanted deleted layers (#6931 ) Not allowing evicting wanted deleted layers is something I've forgotten to implement on #5645. This PR makes it possible to evict such layers, which should reduce the amount of hanging evictions. Fixes: #6928 Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-02-28 00:02:44 +02:00
Konstantin Knizhnik	e1b4d96b5b	Limit number of AUX files deltas to reduce reconstruct time (#6874 ) ## Problem After commit [`840abe3954`] (store AUX files as deltas) we avoid quadratic growth of storage size when storing LR snapshots but get quadratic slowdown of reconstruct time. As a result storing 70k snapshots at my local Neon instance took more than 3 hours and starting node (creation of basecbackup): ~10 minutes. In prod 70k AUX files cause increase of startup time to 40 minutes: https://neondb.slack.com/archives/C03F5SM1N02/p1708513010480179 ## Summary of changes Enforce storing full AUX directory (some analog of FPI) each 1024 files. Time of creation 70k snapshots is reduced to 6 minutes and startup time - to 1.5 minutes (100 seconds). ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-02-27 21:18:46 +02:00
John Spray	a8ec18c0f4	refactor: move storage controller API structs into pageserver_api (#6927 ) ## Problem This is a precursor to adding a convenience CLI for the storage controller. ## Summary of changes - move controller api structs into pageserver_api::controller_api to make them visible to other crates - rename pageserver_api::control_api to pageserver_api::upcall_api to match the /upcall/v1/ naming in the storage controller. Why here rather than a totally separate crate? It's convenient to have all the pageserver-related stuff in one place, and if we ever wanted to move it to a different crate it's super easy to do that later.	2024-02-27 17:24:01 +00:00
Arpad Müller	045bc6af8b	Add new compaction abstraction, simulator, and implementation. (#6830 ) Rebased version of #5234, part of #6768 This consists of three parts: 1. A refactoring and new contract for implementing and testing compaction. The logic is now in a separate crate, with no dependency on the 'pageserver' crate. It defines an interface that the real pageserver must implement, in order to call the compaction algorithm. The interface models things like delta and image layers, but just the parts that the compaction algorithm needs to make decisions. That makes it easier unit test the algorithm and experiment with different implementations. I did not convert the current code to the new abstraction, however. When compaction algorithm is set to "Legacy", we just use the old code. It might be worthwhile to convert the old code to the new abstraction, so that we can compare the behavior of the new algorithm against the old one, using the same simulated cases. If we do that, have to be careful that the converted code really is equivalent to the old. This inclues only trivial changes to the main pageserver code. All the new code is behind a tenant config option. So this should be pretty safe to merge, even if the new implementation is buggy, as long as we don't enable it. 2. A new compaction algorithm, implemented using the new abstraction. The new algorithm is tiered compaction. It is inspired by the PoC at PR #4539, although I did not use that code directly, as I needed the new implementation to fit the new abstraction. The algorithm here is less advanced, I did not implement partial image layers, for example. I wanted to keep it simple on purpose, so that as we add bells and whistles, we can see the effects using the included simulator. One difference to #4539 and your typical LSM tree implementations is how we keep track of the LSM tree levels. This PR doesn't have a permanent concept of a level, tier or sorted run at all. There are just delta and image layers. However, when compaction starts, we look at the layers that exist, and arrange them into levels, depending on their shapes. That is ephemeral: when the compaction finishes, we forget that information. This allows the new algorithm to work without any extra bookkeeping. That makes it easier to transition from the old algorithm to new, and back again. There is just a new tenant config option to choose the compaction algorithm. The default is "Legacy", meaning the current algorithm in 'main'. If you set it to "Tiered", the new algorithm is used. 3. A simulator, which implements the new abstraction. The simulator can be used to analyze write and storage amplification, without running a test with the full pageserver. It can also draw an SVG animation of the simulation, to visualize how layers are created and deleted. To run the simulator: cargo run --bin compaction-simulator run-suite --------- Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2024-02-27 17:15:46 +01:00
Joonas Koivunen	a691786ce2	fix: logical size calculation gating (#6915 ) Noticed that we are failing to handle `Result::Err` when entering a gate for logical size calculation. Audited rest of the gate enters, which seem fine, unified two instances. Noticed that the gate guard allows to remove a failpoint, then noticed that adjacent failpoint was blocking the executor thread instead of using `pausable_failpoint!`, fix both. eviction_task.rs now maintains a gate guard as well. Cc: #4733	2024-02-27 14:27:13 +00:00
Christian Schwarz	62d77e263f	test_remote_timeline_client_calls_started_metric: fix flakiness (#6911 ) fixes https://github.com/neondatabase/neon/issues/6889 # Problem The failure in the last 3 flaky runs on `main` is ``` test_runner/regress/test_remote_storage.py:460: in test_remote_timeline_client_calls_started_metric churn("a", "b") test_runner/regress/test_remote_storage.py:457: in churn assert gc_result["layers_removed"] > 0 E assert 0 > 0 ``` That's this code `cd449d66ea/test_runner/regress/test_remote_storage.py (L448-L460)` So, the test expects GC to remove some layers but the GC doesn't. # Fix My impression is that the VACUUM isn't re-using pages aggressively enough, but I can't really prove that. Tried to analyze the layer map dump but it's too complex. So, this PR: - Creates more churn by doing the overwrite twice. - Forces image layer creation. It also drive-by removes the redundant call to timeline_compact, because, timeline_checkpoint already does that internally.	2024-02-27 10:55:10 +01:00
Vlad Lazar	5accf6e24a	attachment_service: JWT auth enforcement (#6897 ) ## Problem Attachment service does not do auth based on JWT scopes. ## Summary of changes Do JWT based permission checking for requests coming into the attachment service. Requests into the attachment service must use different tokens based on the endpoint: * `/control` and `/debug` require `admin` scope * `/upcall` requires `generations_api` scope * `/v1/...` requires `pageserverapi` scope Requests into the pageserver from the attachment service must use `pageserverapi` scope.	2024-02-26 18:17:06 +00:00
John Spray	256058f2ab	pageserver: only write out legacy tenant config if no generation (#6891 ) ## Problem Previously we always wrote out both legacy and modern tenant config files. The legacy write enabled rollbacks, but we are long past the point where that is needed. We still need the legacy format for situations where someone is running tenants without generations (that will be yanked as well eventually), but we can avoid writing it out at all if we do have a generation number set. We implicitly also avoid writing the legacy config if our mode is Secondary (secondary mode is newer than generations). ## Summary of changes - Make writing legacy tenant config conditional on there being no generation number set.	2024-02-26 10:24:58 +00:00
Christian Schwarz	ceedc3ef73	Timeline::repartition: enforce no concurrent callers & lsn to not move backwards (#6862 ) This PR enforces aspects of `Timeline::repartition` that were already true at runtime: - it's not called concurrently, so, bail out if it is anyway (see comment why it's not called concurrently) - the `lsn` should never be moving backwards over the lifetime of a Timeline object, because last_record_lsn() can only move forwards over the lifetime of a Timeline object The switch to tokio::sync::Mutex blows up the size of the `partitioning` field from 40 bytes to 72 bytes on Linux x86_64. That would be concerning if it was a hot field, but, `partitioning` is only accessed every 20s by one task, so, there won't be excessive cache pain on it. (It still sucks that it's now >1 cache line, but I need the Send-able MutexGuard in the next PR) part of https://github.com/neondatabase/neon/issues/6861	2024-02-26 11:22:15 +01:00
Christian Schwarz	5273c94c59	pageserver: remove two obsolete/unused per-timeline metrics (#6893 ) over-compensating the addition of a new per-timeline metric in https://github.com/neondatabase/neon/pull/6834 part of https://github.com/neondatabase/neon/issues/6737	2024-02-26 09:19:24 +00:00
Christian Schwarz	dedf66ba5b	remove `gc_feedback` mechanism (#6863 ) It's been dead-code-at-runtime for 9 months, let's remove it. We can always re-introduce it at a later point. Came across this while working on #6861, which will touch `time_for_new_image_layer`. This is an opporunity to make that function simpler.	2024-02-26 10:05:24 +01:00
John Spray	8283779ee8	pageserver: remove legacy attach/detach APIs from swagger (#6883 ) ## Problem Since the location config API was added, the attach and detach endpoints are deprecated. Hiding them from consumers of the swagger definition is a precursor to removing them Neon's cloud no longer uses this api since https://github.com/neondatabase/cloud/pull/10538 Fully removing the APIs will implicitly make use of generation numbers mandatory, and should happen alongside https://github.com/neondatabase/neon/issues/5388, which will happen once we're happy that the storage controller is ready for prime time. ## Summary of changes - Remove /attach and /detach from pageserver's swagger file	2024-02-25 14:53:17 +00:00
Joonas Koivunen	b8f9e3a9eb	fix(flaky): typo Stopping/Stopped (#6894 ) introduced in `8dee9908f8`, should help with the #6681 common problem which is just a mismatched allowed error.	2024-02-24 21:32:41 +00:00
Christian Schwarz	ec3efc56a8	Revert "Revert "refactor(VirtualFile::crashsafe_overwrite): avoid Handle::block_on in callers"" (#6775 ) Reverts neondatabase/neon#6765 , bringing back #6731 We concluded that #6731 never was the root cause for the instability in staging. More details: https://neondb.slack.com/archives/C033RQ5SPDH/p1708011674755319 However, the massive amount of concurrent `spawn_blocking` calls from the `save_metadata` calls during startups might cause a performance regression. So, we'll merge this PR here after we've stopped writing the metadata #6769).	2024-02-23 17:16:43 +01:00
Anastasia Lubennikova	a12e4261a3	Add neon.primary_is_running GUC. (#6705 ) We set it for neon replica, if primary is running. Postgres uses this GUC at the start, to determine if replica should wait for RUNNING_XACTS from primary or not. Corresponding cloud PR is https://github.com/neondatabase/cloud/pull/10183 * Add test hot-standby replica startup. * Extract oldest_running_xid from XlRunningXits WAL records. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Konstantin Knizhnik <knizhnik@garret.ru> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2024-02-23 13:56:41 +00:00
Christian Schwarz	cd449d66ea	stop writing `metadata` file (#6769 ) Building atop #6777, this PR removes the code that writes the `metadata` file and adds a piece of migration code that removes any remaining `metadata` files. We'll remove the migration code after this PR has been deployed. part of https://github.com/neondatabase/neon/issues/6663 More cleanups punted into follow-up issue, as they touch a lot of code: https://github.com/neondatabase/neon/issues/6890	2024-02-23 14:33:47 +01:00
Joonas Koivunen	bc7a82caf2	feat: bare-bones /v1/utilization (#6831 ) PR adds a simple at most 1Hz refreshed informational API for querying pageserver utilization. In this first phase, no actual background calculation is performed. Instead, the worst possible score is always returned. The returned bytes information is however correct. Cc: #6835 Cc: #5331	2024-02-22 13:58:59 +02:00
John Spray	c1095f4c52	pageserver: don't warn on tempfiles in secondary location (#6837 ) ## Problem When a secondary mode location starts up, it scans local layer files. Currently it warns on any layers whose names don't parse as a LayerFileName, generating warning spam from perfectly normal tempfiles. ## Summary of changes - Refactor local vars to build a Utf8PathBuf for the layer file candidate - Use the crate::is_temporary check to identify + clean up temp files. --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-02-22 09:32:27 +00:00
Arpad Müller	4de2f0f3e0	Implement a sharded time travel recovery endpoint (#6821 ) The sharding service didn't have support for S3 disaster recovery. This PR adds a new endpoint to the attachment service, which is slightly different from the endpoint on the pageserver, in that it takes the shard count history of the tenant as json parameters: we need to do time travel recovery for both the shard count at the target time and the shard count at the current moment in time, as well as the past shard counts that either still reference. Fixes #6604, part of https://github.com/neondatabase/cloud/issues/8233 --------- Co-authored-by: John Spray <john@neon.tech>	2024-02-21 16:35:37 +01:00
Joonas Koivunen	41464325c7	fix: remaining missed cancellations and timeouts (#6843 ) As noticed in #6836 some occurances of error conversions were missed in #6697: - `std::io::Error` popped up by `tokio::io::copy_buf` containing `DownloadError` was turned into `DownloadError::Other` - similarly for secondary downloader errors These changes come at the loss of pathname context. Cc: #6096	2024-02-21 15:20:59 +00:00
Joonas Koivunen	7257ffbf75	feat: imitiation_only eviction_task policy (#6598 ) mostly reusing the existing and perhaps controversially sharing the histogram. in practice we don't configure this per-tenant. Cc: #5331	2024-02-21 16:57:30 +02:00
John Spray	84f027357d	pageserver: adjust checkpoint distance for sharded tenants (#6852 ) ## Problem Where the stripe size is the same order of magnitude as the checkpoint distance (such as with default settings), tenant shards can easily pass through `checkpoint_distance` bytes of LSN without actually ingesting anything. This results in emitting many tiny L0 delta layers. ## Summary of changes - Multiply checkpoint distance by shard count before comparing with LSN distance. This is a heuristic and does not guarantee that we won't emit small layers, but it fixes the issue for typical cases where the writes in a (checkpoint_distance * shard_count) range of LSN bytes are somewhat distributed across shards. - Add a test that checks the size of layers after ingesting to a sharded tenant; this fails before the fix. --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2024-02-21 14:12:35 +00:00
Vlad Lazar	5d6083bfc6	pageserver: add vectored get implementation (#6576 ) This PR introduces a new vectored implementation of the read path. The search is basically a DFS if you squint at it long enough. LayerFringe tracks the next layers to visit and acts as our stack. Vertices are tuples of (layer, keyspace, lsn range). Continuously pop the top of the stack (most recent layer) and do all the reads for one layer at once. The search maintains a fringe (`LayerFringe`) which tracks all the layers that intersect the current keyspace being searched. Continuously pop the top of the fringe (layer with highest LSN) and get all the data required from the layer in one go. Said search is done on one timeline at a time. If data is still required for some keys, then search the ancestor timeline. Apart from the high level layer traversal, vectored variants have been introduced for grabbing data from each layer type. They still suffer from read amplification issues and that will be addressed in a different PR. You might notice that in some places we duplicate the code for the existing read path. All of that code will be removed when we switch the non-vectored read path to proxy into the vectored read path. In the meantime, we'll have to contend with the extra cruft for the sake of testing and gentle releasing.	2024-02-21 09:49:46 +00:00
Christian Schwarz	e49602ecf5	feat(metrics): per-timeline metric for on-demand downloads, remove calls_started histogram (#6834 ) refs #6737 # Problem Before this PR, on-demand downloads weren't measured per tenant_id. This makes root-cause analysis of latency spikes harder, requiring us to resort to log scraping for ``` {neon_service="pageserver"} \|= `downloading on-demand` \|= `$tenant_id` ``` which can be expensive when zooming out in Grafana. Context: https://neondb.slack.com/archives/C033RQ5SPDH/p1707809037868189 # Solution / Changes - Remove the calls_started histogram - I did the dilegence, there are only 2 dashboards using this histogram, and in fact only one uses it as a histogram, the other just as a a counter. - [Link 1](`8115b54d9f/neonprod/dashboards/hkXNF7oVz/dashboard-Z31XmM24k.yaml (L1454)`): `Pageserver Thrashing` dashboard, linked from playbook, will fix. - [Link 2](`8115b54d9f/neonprod/dashboards/CEllzAO4z/dashboard-sJqfNFL4k.yaml (L599)`): one of my personal dashboards, unused for a long time, already broken in other ways, no need to fix. - replace `pageserver_remote_timeline_client_calls_unfinished` gauge with a counter pair - Required `Clone`-able `IntCounterPair`, made the necessary changes in the `libs/metrics` crate - fix tests to deal with the fallout A subsequent PR will remove a timeline-scoped metric to compensate. Note that we don't need additional global counters for the per-timeline counters affected by this PR; we can use the `remote_storage` histogram for those, which, conveniently, also include the secondary-mode downloads, which aren't covered by the remote timeline client metrics (should they?).	2024-02-20 17:52:23 +01:00
John Spray	d152d4f16f	pageserver: fix treating all download errors as 'Other' (#6836 ) ## Problem `download_retry` correctly uses a fatal check to avoid retrying forever on cancellations and NotFound cases. However, `download_layer_file` was casting all download errors to "Other" in order to attach an anyhow::Context. Noticed this issue in the context of secondary downloads, where requests to download layers that might not exist are issued intentionally, and this resulted in lots of error spam from retries that shouldn't have happened. ## Summary of changes - Remove the `.context()` so that the original DownloadError is visible to backoff::retry	2024-02-20 13:40:46 +00:00
Christian Schwarz	a48b23d777	fix(startup + remote_timeline_client): no-op deletion ops scheduled during startup (#6825 ) Before this PR, if remote storage is configured, `load_layer_map`'s call to `RemoteTimelineClient::schedule_layer_file_deletion` would schedule an empty UploadOp::Delete for each timeline. It's jsut CPU overhead, no actual interaction with deletion queue on-disk state or S3, as far as I can tell. However, it shows up in the "RemoteTimelineClient calls started metrics", which I'm refining in an orthogonal PR.	2024-02-20 14:06:25 +01:00
Arpad Müller	e0c12faabd	Allow initdb preservation for broken tenants (#6790 ) Often times the tenants we want to (WAL) DR are the ones which the pageserver marks as broken. Therefore, we should allow initdb preservation also for broken tenants. Fixes #6781.	2024-02-19 17:27:02 +01:00
John Spray	2f8a2681b8	pageserver: ensure we never try to save empty delta layer (#6805 ) ## Problem Sharded tenants could panic during compaction when they try to generate an L1 delta layer for a region that contains no keys on a particular shard. This is a variant of https://github.com/neondatabase/neon/issues/6755, where we attempt to save a delta layer with no keys. It is harder to reproduce than the case of image layers fixed in https://github.com/neondatabase/neon/pull/6776. It will become even less likely once https://github.com/neondatabase/neon/pull/6778 tweaks keyspace generation, but even then, we should not rely on keyspace partitioning to guarantee at least one stored key in each partition. ## Summary of changes - Move construction of `writer` in `compact_level0_phase1`, so that we never leave a writer constructed but without any keys.	2024-02-19 15:07:07 +00:00
John Spray	349b375010	pageserver: remove heatmap file during tenant delete (#6806 ) ## Problem Secondary mode locations keep a local copy of the heatmap, which needs cleaning up during deletion. Closes: https://github.com/neondatabase/neon/issues/6802 ## Summary of changes - Extend test_live_migration to reproduce the issue - Remove heatmap-v1.json during tenant deletion	2024-02-19 14:01:36 +00:00
Vlad Lazar	587cb705b8	pageserver: roll open layer in timeline writer (#6661 ) ## Problem One WAL record can actually produce an arbitrary amount of key value pairs. This is problematic since it might cause our frozen layers to bloat past the max allowed size of S3 single shot uploads. [#6639](https://github.com/neondatabase/neon/pull/6639) introduced a "should roll" check after every batch of `ingest_batch_size` (100 WAL records by default). This helps, but the original problem still exists. ## Summary of changes This patch moves the responsibility of rolling the currently open layer to the `TimelineWriter`. Previously, this was done ad-hoc via calls to `check_checkpoint_distance`. The advantages of this approach are: * ability to split one batch over multiple open layers * less layer map locking * remove ad-hoc check_checkpoint_distance calls More specifically, we track the current size of the open layer in the writer. On each `put` check whether the current layer should be closed and a new one opened. Keeping track of the currently open layer results in less contention on the layer map lock. It only needs to be acquired on the first write and on writes that require a roll afterwards. Rolling the open layer can be triggered by: 1. The distance from the last LSN we rolled at. This bounds the amount of WAL that the safekeepers need to store. 2. The size of the currently open layer. 3. The time since the last roll. It helps safekeepers to regard pageserver as caught up and suspend activity. Closes #6624	2024-02-19 12:34:27 +00:00
John Spray	5667372c61	pageserver: during shard split, wait for child to activate (#6789 ) ## Problem test_sharding_split_unsharded was flaky with log errors from tenants not being active. This was happening when the split function enters wait_lsn() while the child shard might still be activating. It's flaky rather than an outright failure because activation is usually very fast. This is also a real bug fix, because in realistic scenarios we could proceed to detach the parent shard before the children are ready, leading to an availability gap for clients. ## Summary of changes - Do a short wait_to_become_active on the child shards before proceeding to wait for their LSNs to advance --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2024-02-18 15:55:19 +00:00
John Spray	24014d8383	pageserver: fix sharding emitting empty image layers during compaction (#6776 ) ## Problem Sharded tenants would sometimes try to write empty image layers during compaction: this was more noticeable on larger databases. - https://github.com/neondatabase/neon/issues/6755 Note to reviewers: the last commit is a refactor that de-intents a whole block, I recommend reviewing the earlier commits one by one to see the real changes ## Summary of changes - Fix a case where when we drop a key during compaction, we might fail to write out keys (this was broken when vectored get was added) - If an image layer is empty, then do not try and write it out, but leave `start` where it is so that if the subsequent key range meets criteria for writing an image layer, we will extend its key range to cover the empty area. - Add a compaction test that configures small layers and compaction thresholds, and asserts that we really successfully did image layer generation. This fails before the fix.	2024-02-18 08:51:12 +00:00
Christian Schwarz	ca07fa5f8b	per-TenantShard read throttling (#6706 )	2024-02-16 21:26:59 +01:00
John Spray	5d039c6e9b	libs: add 'generations_api' auth scope (#6783 ) ## Problem Even if you're not enforcing auth, the JwtAuth middleware barfs on scopes it doesn't know about. Add `generations_api` scope, which was invented in the cloud control plane for the pageserver's /re-attach and /validate upcalls: this will be enforced in storage controller's implementation of these in a later PR. Unfortunately the scope's naming doesn't match the other scope's naming styles, so needs a manual serde decorator to give it an underscore. ## Summary of changes - Add `Scope::GenerationsApi` variant - Update pageserver + safekeeper auth code to print appropriate message if they see it.	2024-02-16 15:53:09 +00:00
Calin Anca	36e1100949	bench_walredo: use tokio multi-threaded runtime (#6743 ) fixes https://github.com/neondatabase/neon/issues/6648 Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-02-16 16:31:54 +01:00
Christian Schwarz	45e929c069	stop reading local `metadata` file (#6777 )	2024-02-16 09:35:11 +00:00
John Spray	6b980f38da	libs: refactor ShardCount.0 to private (#6690 ) ## Problem The ShardCount type has a magic '0' value that represents a legacy single-sharded tenant, whose TenantShardId is formatted without a `-0001` suffix (i.e. formatted as a traditional TenantId). This was error-prone in code locations that wanted the actual number of shards: they had to handle the 0 case specially. ## Summary of changes - Make the internal value of ShardCount private, and expose `count()` and `literal()` getters so that callers have to explicitly say whether they want the literal value (e.g. for storing in a TenantShardId), or the actual number of shards in the tenant. --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2024-02-15 21:59:39 +00:00
Joonas Koivunen	046d9c69e6	fix: require wider jwt for changing the io engine (#6770 ) io-engine should not be changeable with any JWT token, for example the tenant_id scoped token which computes have.	2024-02-15 16:58:26 +00:00
Arpad Müller	cd3e4ac18d	Rename TEST_IMG function to test_img (#6762 ) Latter follows the canonical way to naming functions in Rust.	2024-02-15 15:14:51 +00:00
Joonas Koivunen	936f2ee2a5	fix: accidential wide span in tests (#6772 ) introduced in a PR without other #[tracing::instrument] changes.	2024-02-15 13:48:44 +00:00
John Spray	5fa747e493	pageserver: shard splitting refinements (parent deletion, hard linking) (#6725 ) ## Problem - We weren't deleting parent shard contents once the split was done - Re-downloading layers into child shards is wasteful ## Summary of changes - Hard-link layers into child chart local storage during split - Delete parent shards content at the end --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2024-02-15 10:21:53 +02:00
Joonas Koivunen	80854b98ff	move timeouts and cancellation handling to remote_storage (#6697 ) Cancellation and timeouts are handled at remote_storage callsites, if they are. However they should always be handled, because we've had transient problems with remote storage connections. - Add cancellation token to the `trait RemoteStorage` methods - For `download`, `list` methods there is `DownloadError::{Cancelled,Timeout}` - For the rest now using `anyhow::Error`, it will have root cause `remote_storage::TimeoutOrCancel::{Cancel,Timeout}` - Both types have `::is_permanent` equivalent which should be passed to `backoff::retry` - New generic RemoteStorageConfig option `timeout`, defaults to 120s - Start counting timeouts only after acquiring concurrency limiter permit - Cancellable permit acquiring - Download stream timeout or cancellation is communicated via an `std::io::Error` - Exit backoff::retry by marking cancellation errors permanent Fixes: #6096 Closes: #4781 Co-authored-by: arpad-m <arpad-m@users.noreply.github.com>	2024-02-14 23:24:07 +00:00
Christian Schwarz	024372a3db	Revert "refactor(VirtualFile::crashsafe_overwrite): avoid Handle::block_on in callers" (#6765 ) Reverts neondatabase/neon#6731 On high tenant count Pageservers in staging, memory and CPU usage shoots to 100% with this change. (NB: staging currently has tokio-epoll-uring enabled) Will analyze tomorrow. https://neondb.slack.com/archives/C03H1K0PGKH/p1707933875639379?thread_ts=1707929541.125329&cid=C03H1K0PGKH	2024-02-14 19:17:12 +00:00
Arpad Müller	a2d0d44b42	Remove unused allow's (#6760 ) These allow's became redundant some time ago so remove them, or address them if addressing is very simple.	2024-02-14 18:16:05 +00:00
Christian Schwarz	7d3cdc05d4	fix(pageserver): pagebench doesn't work with released artifacts (#6757 ) The canonical release artifact of neon.git is the Docker image with all the binaries in them: ``` docker pull neondatabase/neon:release-4854 docker create --name extract neondatabase/neon:release-4854 docker cp extract:/usr/local/bin/pageserver ./pageserver.release-4854 chmod +x pageserver.release-4854 cp -a pageserver.release-4854 ./target/release/pageserver ``` Before this PR, these artifacts didn't expose the `keyspace` API, thereby preventing `pagebench get-page-latest-lsn` from working. Having working pagebench is useful, e.g., for experiments in staging. So, expose the API, but don't document it, as it's not part of the interface with control plane.	2024-02-14 17:01:15 +00:00
John Spray	840abe3954	pageserver: store aux files as deltas (#6742 ) ## Problem Aux files were stored with an O(N^2) cost, since on each modification the entire map is re-written as a page image. This addresses one axis of the inefficiency in logical replication's use of storage (https://github.com/neondatabase/neon/issues/6626). It will still be writing a large amount of duplicative data if writing the same slot's state every 15 seconds, but the impact will be O(N) instead of O(N^2). ## Summary of changes - Introduce `NeonWalRecord::AuxFile` - In `DatadirModification`, if the AUX_FILES_KEY has already been set, then write a delta instead of an image	2024-02-14 15:01:16 +00:00

1 2 3 4 5 ...

1912 Commits