rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-07 21:42:56 +00:00

Author	SHA1	Message	Date
Arpad Müller	c698b7b010	Implement retry support for list_streaming (#8481 ) Implements the TODO from #8466 about retries: now the user of the stream returned by `list_streaming` is able to obtain the next item in the stream as often as they want, and retry it if it is an error. Also adds extends the test for paginated listing to include a dedicated test for `list_streaming`. follow-up of #8466 fixes #8457 part of #7547 --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2024-07-24 10:43:05 +01:00
John Spray	f5db655447	pageserver: simplify LayerAccessStats (#8431 ) ## Problem LayerAccessStats contains a lot of detail that we don't use: short histories of most recent accesses, specifics on what kind of task accessed a layer, etc. This is all stored inside a Mutex, which is locked every time something accesses a layer. ## Summary of changes - Store timestamps at a very low resolution (to the nearest second), sufficient for use on the timescales of eviction. - Pack access time and last residence change time into a single u64 - Use the high bits of the u64 for other flags, including the new layer visibility concept. - Simplify the external-facing model for access stats to just include what we now track. Note that the `HistoryBufferWithDropCounter` is removed here because it is no longer used. I do not dislike this type, we just happen not to use it for anything else at present. Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-07-24 08:17:28 +01:00
Konstantin Knizhnik	925c5ad1e8	Make async connect work on MacOS: it is necessary top call WaitLatchOrSocket before PQconnectPoll (#8472 ) ## Problem While investigating problem with test_subscriber_restart flukyness, I found out that this test is not passed at all for PG 14/15 at MacOS (while working for PG16). ## Summary of changes Rewrite async connect state machine exactly in the same way as in Vanilla: call `WaitLatchOrSocket` with `WL_SOCKETR_WRTEABLE` before calling `PQconnectPoll`. Please notice that most likely it will not fix flukyness of test_subscriber_restart. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-07-24 09:59:18 +03:00
Christian Schwarz	b037ce07ec	followup(#8475 ): also disable 'cache-to' for neon-image-arch and neon-test-extensions (#8478 ) PR #8475 only disabled it for compute-node-image-arch. Those are fast now, but we use cache-to in other places.	2024-07-24 02:17:52 +01:00
Arpad Müller	2c0d311a54	remote_storage: add list_streaming API call (#8466 ) This adds the ability to list many prefixes in a streaming fashion to both the `RemoteStorage` trait as well as `GenericRemoteStorage`. * The `list` function of the `RemoteStorage` trait is implemented by default in terms of `list_streaming`. * For the production users (S3, Azure), `list_streaming` is implemented and the default `list` implementation is used. * For `LocalFs`, we keep the `list` implementation and make `list_streaming` call it. The `list_streaming` function is implemented for both S3 and Azure. A TODO for later is retries, which the scrubber currently has while the `list_streaming` implementations lack them. part of #8457 and #7547	2024-07-24 02:09:01 +02:00
Alex Chi Z.	18cf5cfefd	feat(pageserver): support retain_lsn in bottommost gc-compaction (#8328 ) part of https://github.com/neondatabase/neon/issues/8002 The main thing in this pull request is the new `generate_key_retention` function. It decides which deltas to retain and generate images for a given key based on its history + retain_lsn + horizon. On that, we generate a flat single level of delta layers over all deltas included in the compaction. In the future, we can decide whether to split them over the LSN axis as described in the RFC. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-07-24 00:28:43 +01:00
Andrey Taranik	39a35671df	temporarily disable cache saving in the registry as it is very slow (#8475 ) ## Problem `compute-node-image-arch` jobs are very slow and block development. ## Summary of changes Temporary disable cache saving	2024-07-23 23:36:28 +02:00
John Spray	9e23410074	tests: allow-list a controller heartbeat error (#8471 ) ## Problem `test_change_pageserver` stops pageservers in a way that can overlap with the controller's heartbeats: the controller can get a heartbeat success and then immediately find the node unavailable. This particular situation triggers a log that isn't in our current allow-list of messages for nodes offline Example: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8339/10048487700/index.html#testresult/19678f27810231df/retries ## Summary of changes - Add the message to the allow list	2024-07-23 16:09:05 -04:00
Shinya Kato	d47c94b336	Fix to use a tab instead of spaces (#8394 ) ## Problem There were spaces instead of a tab in the C source file. ## Summary of changes I fixed to use a tab instead of spaces.	2024-07-23 17:46:05 +02:00
Konstantin Knizhnik	563d73d923	Use smgrexists() instead of access() to enforce uniqueness of generated relfilenumber (#7992 ) ## Problem Postgres is using `access()` function in `GetNewRelFileNumber` to check if assigned relfilenumber is not used for any other relation. This check will not work in Neon, because we do not have all files in local storage. ## Summary of changes Use smgrexists() instead which will check at page server if such relfilenode is used. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-07-23 18:41:55 +03:00
John Spray	1a4c1eba92	pageserver: add LayerVisibilityHint (#8432 ) ## Problem As described in https://github.com/neondatabase/neon/issues/8398, layer visibility is a new hint that will help us manage disk space more efficiently. ## Summary of changes - Introduce LayerVisibilityHint and store it as part of access stats - Automatically mark a layer visible if it is accessed, or when it is created. The impact on the access stats size will be reversed in https://github.com/neondatabase/neon/pull/8431 This is functionally a no-op change: subsequent PRs will add the logic that sets layers to Covered, and which uses the layer visibility as an input to eviction and heatmap generation. --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2024-07-23 15:37:12 +01:00
dependabot[bot]	129f348aae	build(deps): bump openssl from 0.10.64 to 0.10.66 in /test_runner/pg_clients/rust/tokio-postgres (#8464 )	2024-07-23 14:05:07 +00:00
John Spray	80c8ceacbc	tests: make `test_scrubber_physical_gc_ancestors` more stable (#8453 ) ## Problem This test sometimes found that ancestors were getting cleaned up before it had done any compaction. Compaction was happening implicitly via Workload. Example: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8298/10032173390/index.html#testresult/fb04786402f80822/retries ## Summary of changes - Set upload=False when writing data after shard split, to avoid doing a checkpoint - Add a checkpoint_period & explicit wait for uploads so that we ensure data lands in S3 without doing a checkpoint	2024-07-23 12:57:57 +01:00
Vlad Lazar	35854928d9	pageserver: use identity file as node id authority and remove init command and config-override flags (#7766 ) Ansible will soon write the node id to `identity.toml` in the work dir for new pageservers. On the pageserver side, we read the node id from the identity file if it is present and use that as the source of truth. If the identity file is missing, cannot be read, or does not deserialise, start-up is aborted. This PR also removes the `--init` mode and the `--config-override` flag from the `pageserver` binary. The neon_local is already not using these flags anymore. Ansible still uses them until the linked change is merged & deployed, so, this PR has to land simultaneously or after the Ansible change due to that. Related Ansible change: https://github.com/neondatabase/aws/pull/1322 Cplane change to remove config-override usages: https://github.com/neondatabase/cloud/pull/13417 Closes: https://github.com/neondatabase/neon/issues/7736 Overall plan: https://www.notion.so/neondatabase/Rollout-Plan-simplified-pageserver-initialization-f935ae02b225444e8a41130b7d34e4ea?pvs=4 Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-07-23 11:41:12 +01:00
Yuchen Liang	3cd888f173	fix(docs): remove incorrect flags for scrubber purge-garbage command (#8463 ) Scrubber purge-garbage command does not take `--node-kind` and `--depth`.	2024-07-22 20:02:25 +01:00
Em Sharnoff	d6753e9ee4	vm-image: Expose new LFC working set size metrics (#8298 ) In general, replace: * 'lfc_approximate_working_set_size' with * 'lfc_approximate_working_set_size_windows' For the "main" metrics that are actually scraped and used internally, the old one is just marked as deprecated. For the "autoscaling" metrics, we're not currently using the old one, so we can get away with just replacing it. Also, for the user-visible metrics we'll only store & expose a few different time windows, to avoid making the UI overly busy or bloating our internal metrics storage. But for the autoscaling-related scraper, we aren't storing the metrics, and it's useful to be able to programmatically operate on the trendline of how WSS increases (or doesn't!) with window size. So there, we can just output datapoints for each minute. Part of neondatabase/autoscaling#872 See also https://www.notion.so/neondatabase/cca38138fadd45eaa753d81b859490c6	2024-07-22 19:28:08 +01:00
Konstantin Knizhnik	a868e342d4	Change default version of Neon extensio to 1.4	2024-07-22 17:58:07 +01:00
Arpad Müller	f17fe75169	Mark body of archival_config endpoint as required (#8458 ) As pointed out in https://github.com/neondatabase/neon/pull/8414#discussion_r1684881525 Part of https://github.com/neondatabase/neon/issues/8088	2024-07-22 17:39:18 +01:00
Christian Schwarz	6237322a2e	build: mark `target/` and `pg_install/` with `CACHEDIR.TAG` (#8448 ) Backup tools such as `tar` and `restic` recognize this. More info: https://bford.info/cachedir/ NB: cargo _should_ create the tag file in the `target/` directory but doesn't if the directory already exists, which happens frequently if rust-analyzer is launched by your IDE before you can type `cargo build`. Hence, create the file manually here. => https://github.com/rust-lang/cargo/issues/14281	2024-07-22 17:32:25 +02:00
Christian Schwarz	e8523014d4	refactor(pageserver) remove `task_mgr` for most global tasks (#8449 ) ## Motivation & Context We want to move away from `task_mgr` towards explicit tracking of child tasks. This PR is extracted from https://github.com/neondatabase/neon/pull/8339 where I refactor `PageRequestHandler` to not depend on task_mgr anymore. ## Changes This PR refactors all global tasks but `PageRequestHandler` to use some combination of `JoinHandle`/`JoinSet` + `CancellationToken`. The `task_mgr::spawn(.., shutdown_process_on_error)` functionality is preserved through the new `exit_on_panic_or_error` wrapper. Some global tasks were not using it before, but as of this PR, they are. The rationale is that all global tasks are relevant for correct operation of the overall Neon system in one way or another. ## Future Work After #8339, we can make `task_mgr::spawn` require a `TenantId` instead of an `Option<TenantId>` which concludes this step of cleanup work and will help discourage future usage of task_mgr for global tasks.	2024-07-22 17:25:06 +02:00
Alex Chi Z.	631a9c372f	fix(docs): clearify the admin URL and token used in scrubber (#8441 ) We were not clear about which token and admin URL to use for this tool. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-07-22 09:59:02 -04:00
Yuchen Liang	595c450036	fix(scrubber): more robust metadata consistency checks (#8344 ) Part of #8128. ## Problem Scrubber uses `scan_metadata` command to flag metadata inconsistencies. To trust it at scale, we need to make sure the errors we emit is a reflection of real scenario. One check performed in the scrubber is to see whether layers listed in the latest `index_part.json` is present in object listing. Currently, the scrubber does not robustly handle the case where objects are uploaded/deleted during the scan. ## Summary of changes Condition for success: An object in the index is (1) in the object listing we acquire from S3 or (2) found in a HeadObject request (new object). - Add in the `HeadObject` requests for the layers missing from the object listing. - Keep the order of first getting the object listing and then downloading the layers. - Update check to only consider shards with highest shard count. - Skip analyzing a timeline if `deleted_at` tombstone is marked in `index_part.json`. - Add new test to see if scrubber actually detect the metadata inconsistency. _Misc_ - A timeline with no ancestor should always have some layers. - Removed experimental histograms _Caveat_ - Ancestor layer is not cleaned until #8308 is implemented. If ancestor layers reference non-existing layers in the index, the scrubber will emit false positives. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-07-22 14:53:33 +01:00
Arpad Müller	204bb8faa3	Start using remote_storage in S3 scrubber for PurgeGarbage (#7932 ) Starts using the `remote_storage` crate in the S3 scrubber for the `PurgeGarbage` subcommand. The `remote_storage` crate is generic over various backends and thus using it gives us the ability to run the scrubber against all supported backends. Start with the `PurgeGarbage` subcommand as it doesn't use `stream_tenants`. Part of #7547.	2024-07-22 14:49:30 +01:00
John Spray	8d948f2e07	tests: make test_change_pageserver more robust (#8442 ) ## Problem This test predates the storage controller. It stops pageservers and reconfigures computes, but that races with the storage controller's node failure detection, which can result in restarting nodes not getting the attachments they expect, and the test failing ## Summary of changes - Configure the storage controller to use a compute notify hook that does nothing, so that it cannot interfere with the test's configuration of computes. - Instead of using the attach hook, just notify the storage controller that nodes are offline, and reconcile tenants so that they will automatically be attached to the other node.	2024-07-22 14:17:02 +01:00
John Spray	98af1e365b	pageserver: remove absolute-order disk usage eviction (#8454 ) ## Problem Deployed pageserver configurations are all like this: ``` disk_usage_based_eviction: max_usage_pct: 85 min_avail_bytes: 0 period: "10s" eviction_order: type: "RelativeAccessed" args: highest_layer_count_loses_first: true ``` But we're maintaining this optional absolute order eviction, with test cases etc. ## Summary of changes - Remove absolute order eviction. Make the default eviction policy the same as how we really deploy pageservers.	2024-07-22 13:15:55 +01:00
John Spray	ebda667ef8	tests: more generous memory allowance in test_compaction_l0_memory (#8446 ) ## Problem This test is new, the limit was set experimentally and it turns out the memory consumption in CI runs varies more than expected. Example failure: https://neon-github-public-dev.s3.amazonaws.com/reports/main/10010912745/index.html#suites/9eebd1154fe19f9311ca7613f38156a1/82e40cf86a243ad5/	2024-07-22 11:50:30 +01:00
Alex Chi Z.	fd8a7a7223	fix(docs): race on monotonic rfc id (#8445 ) ## Problem We have two No.34 RFC. ## Summary of changes ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-07-22 09:22:07 +01:00
Anton Chaporgin	7996bce6d6	[proxy/redis] impr: use redis_auth_type to switch between auth types (#8428 ) ## Problem On Azure we need to use username-password authentication in proxy for regional redis client. ## Summary of changes This adds `redis_auth_type` to the config with default value of "irsa". Not specifying it will enforce the `regional_redis_client` to be configured with IRSA redis (as it's done now). If "plain" is specified, then the regional client is condifigured with `redis_notifications`, consuming username:password auth from URI. We plan to do that for the Azure cloud. Configuring `regional_redis_client` is required now, there is no opt-out from configuring it. https://github.com/neondatabase/cloud/issues/14462	2024-07-22 11:02:22 +03:00
Arpad Müller	4e547e6274	Use DefaultCredentialsChain AWS authentication in remote_storage (#8440 ) PR #8299 has switched the storage scrubber to use `DefaultCredentialsChain`. Now we do this for `remote_storage`, as it allows us to use `remote_storage` from inside kubernetes. Most of the diff is due to `GenericRemoteStorage::from_config` becoming `async fn`.	2024-07-19 21:19:30 +02:00
Arpad Müller	3d582b212a	Add archival_config endpoint to pageserver (#8414 ) This adds an archival_config endpoint to the pageserver. Currently it has no effect, and always "works", but later the intent is that it will make a timeline archived/unarchived. - [x] add yml spec - [x] add endpoint handler Part of https://github.com/neondatabase/neon/issues/8088	2024-07-19 21:01:59 +02:00
Shinya Kato	3fbb84d741	Fix openapi specification (#8273 ) ## Problem There are some swagger errors in `pageserver/src/http/openapi_spec.yml` ``` Error 431 15000 Object includes not allowed fields Error 569 3100401 should always have a 'required' Error 569 15000 Object includes not allowed fields Error 1111 10037 properties members must be schemas ``` ## Summary of changes Fixed the above errors.	2024-07-19 18:20:57 +00:00
John Spray	a4fa250c92	tests: longer timeouts in test_timeline_deletion_with_files_stuck_in_upload_queue (#8438 ) ## Problem This test had two locations with 2 second timeouts, which is rather low when we run on a highly contended test machine running lots of tests in parallel. It usually passes, but today I've seen both of these locations time out on separate PRs. Example failure: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8432/10007868041/index.html#suites/837740b64a53e769572c4ed7b7a7eeeb/6c6a092be083d27c ## Summary of changes - Change 2 second timeouts to 20 second timeouts	2024-07-19 19:30:28 +02:00
Shinya Kato	39aeb10cfc	safekeeper: remove unused safekeeper runtimes (#8433 ) There are unused safekeeper runtimes `WAL_REMOVER_RUNTIME` and `METRICS_SHIFTER_RUNTIME`. `WAL_REMOVER_RUNTIME` was implemented in [#4119](https://github.com/neondatabase/neon/pull/4119) and removed in [#7887](https://github.com/neondatabase/neon/pull/7887). `METRICS_SHIFTER_RUNTIME` was also implemented in [#4119](https://github.com/neondatabase/neon/pull/4119) but has never been used. I removed unused safekeeper runtimes `WAL_REMOVER_RUNTIME` and `METRICS_SHIFTER_RUNTIME`.	2024-07-19 13:10:19 -04:00
John Spray	44781518d0	storage scrubber: GC ancestor shard layers (#8196 ) ## Problem After a shard split, the pageserver leaves the ancestor shard's content in place. It may be referenced by child shards, but eventually child shards will de-reference most ancestor layers as they write their own data and do GC. We would like to eventually clean up those ancestor layers to reclaim space. ## Summary of changes - Extend the physical GC command with `--mode=full`, which includes cleaning up unreferenced ancestor shard layers - Add test `test_scrubber_physical_gc_ancestors` - Remove colored log output: in testing this is irritating ANSI code spam in logs, and in interactive use doesn't add much. - Refactor storage controller API client code out of storcon_client into a `storage_controller/client` crate - During physical GC of ancestors, call into the storage controller to check that the latest shards seen in S3 reflect the latest state of the tenant, and there is no shard split in progress.	2024-07-19 19:07:59 +03:00
Christian Schwarz	16071e57c6	pageserver: remove obsolete cached_metric_collection_interval (#8370 ) We're removing the usage of this long-meaningless config field in https://github.com/neondatabase/aws/pull/1599 Once that PR has been deployed to staging and prod, we can merge this PR.	2024-07-19 17:01:02 +01:00
Peter Bendel	392d3524f9	Bodobolero/fix root permissions (#8429 ) ## Problem My prior PR https://github.com/neondatabase/neon/pull/8422 caused leftovers in the GitHub action runner work directory with root permission. As an example see here https://github.com/neondatabase/neon/actions/runs/10001857641/job/27646237324#step:3:37 To work-around we install vanilla postgres as non-root using deb packages in /home/nonroot user directory ## Summary of changes - since we cannot use root we install the deb pkgs directly and create symbolic links for psql, pgbench and libs in expected places - continue jobs an aws even if azure jobs fail (because this region is currently unreliable)	2024-07-19 14:40:55 +01:00
Arpad Müller	c96e8012ce	Enable zstd in tests (#8368 ) Successor of #8288 , just enable zstd in tests. Also adds a test that creates easily compressable data. Part of #5431 --------- Co-authored-by: John Spray <john@neon.tech> Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2024-07-18 19:09:57 +01:00
Arthur Petukhovsky	5a772761ee	Change log level for GuardDrop error (#8305 ) The error means that manager exited earlier than `ResidenceGuard` and it's not unexpected with current deletion implementation. This commit changes log level to reduse noise.	2024-07-18 16:26:27 +00:00
Peter Bendel	841b76ea7c	Temporarily use vanilla pgbench and psql (client) for running pgvector benchmark (#8422 ) ## Problem https://github.com/neondatabase/neon/issues/8275 is not yet fixed Periodic benchmarking fails with SIGABRT in pgvector step, see https://github.com/neondatabase/neon/actions/runs/9967453263/job/27541159738#step:7:393 ## Summary of changes Instead of using pgbench and psql from Neon artifacts, download vanilla postgres binaries into the container and use those to run the client side of the test.	2024-07-18 18:18:18 +02:00
Alex Chi Z.	a4434cf1c0	pageserver: integrate k-merge with bottom-most compaction (#8415 ) Use the k-merge iterator in the compaction process to reduce memory footprint. part of https://github.com/neondatabase/neon/issues/8002 ## Summary of changes * refactor the bottom-most compaction code to use k-merge iterator * add Send bound on some structs as it is used across the await points --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2024-07-18 17:16:44 +01:00
Arthur Petukhovsky	d263b1804e	Fix partial upload bug with invalid remote state (#8383 ) We have an issue that some partial uploaded segments can be actually missing in remote storage. I found this issue when was looking at the logs in staging, and it can be triggered by failed uploads: 1. Code tries to upload `SEG_TERM_LSN_LSN_sk5.partial`, but receives error from S3 2. The failed attempt is saved to `segments` vec 3. After some time, the code tries to upload `SEG_TERM_LSN_LSN_sk5.partial` again 4. This time the upload is successful and code calls `gc()` to delete previous uploads 5. Since new object and old object share the same name, uploaded data gets deleted from remote storage This commit fixes the issue by patching `gc()` not to delete objects with the same name as currently uploaded. --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2024-07-18 13:46:00 +01:00
John Spray	b461755326	tests: turn on safekeeper eviction by default (#8352 ) ## Problem Ahead of enabling eviction in the field, where it will become the normal/default mode, let's enable it by default throughout our tests in case any issues become visible there. ## Summary of changes - Make default `extra_opts` for safekeepers enable offload & deletion - Set low timeouts in `extra_opts` so that tests running for tens of seconds have a chance to hit some of these background operations.	2024-07-18 12:59:14 +01:00
John Spray	9ded2556df	tests: increase test_pg_regress and test_isolation timeouts (#8418 ) ## Problem These tests time out ~1 in 50 runs when in debug mode. There is no indication of a real issue: they're just wrappers that have large numbers of individual tests contained within on pytest case. ## Summary of changes - Bump pg_regress timeout from 600 to 900s - Bump test_isolation timeout from 300s (default) to 600s In future it would be nice to break out these tests to run individual cases (or batches thereof) as separate tests, rather than this monolith.	2024-07-18 10:23:17 +01:00
John Spray	7672e49ab5	tests: fix metrics check in test_s3_eviction (#8419 ) ## Problem This test would occasionally fail its metric check. This could happen in the rare case that the nodes had all been restarted before their most recent eviction. The metric check was added in https://github.com/neondatabase/neon/pull/8348 ## Summary of changes - Check metrics before each restart, accumulate into a bool that we assert on at the end of the test	2024-07-18 10:14:56 +01:00
Christian Schwarz	a2d170b6d0	NeonEnv.from_repo_dir: use storage_controller_db instead of `attachments.json` (#8382 ) When `NeonEnv.from_repo_dir` was introduced, storage controller stored its state exclusively `attachments.json`. Since then, it has moved to using Postgres, which stores its state in `storage_controller_db`. But `NeonEnv.from_repo_dir` wasn't adjusted to do this. This PR rectifies the situation. Context for this is failures in `test_pageserver_characterize_throughput_with_n_tenants` CF: https://neondb.slack.com/archives/C033RQ5SPDH/p1721035799502239?thread_ts=1720901332.293769&cid=C033RQ5SPDH Notably, `from_repo_dir` is also used by the backwards- and forwards-compatibility. Thus, the changes in this PR affect those tests as well. However, it turns out that the compatibility snapshot already contains the `storage_controller_db`. Thus, it should just work and in fact we can remove hacks like `fixup_storage_controller`. Follow-ups created as part of this work: * https://github.com/neondatabase/neon/issues/8399 * https://github.com/neondatabase/neon/issues/8400	2024-07-18 10:56:07 +02:00
dotdister	1303d47778	Fix comment in Control Plane (#8406 ) ## Problem There are something wrong in the comment of `control_plane/src/broker.rs` and `control_plane/src/pageserver.rs` ## Summary of changes Fixed the comment about component name and their data path in `control_plane/src/broker.rs` and `control_plane/src/pageserver.rs`.	2024-07-18 09:33:46 +01:00
Joonas Koivunen	e250b9e063	test: allow requests to any pageserver get cancelled (#8413 ) Fix flakyness on `test_sharded_timeline_detach_ancestor` which does not reproduce on a fast enough runner by allowing cancelled request before completing on all pageservers. It was only allowed on half of the pageservers. Failure evidence: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8352/9972357040/index.html#suites/a1c2be32556270764423c495fad75d47/7cca3e3d94fe12f2	2024-07-17 22:03:02 +01:00
John Spray	0c236fa465	pageserver: layer count & size metrics (#8410 ) ## Problem We lack insight into: - How much of a tenant's physical size is image vs. delta layers - Average sizes of image vs. delta layers - Total layer counts per timeline, indicating size of index_part object As well as general observability love, this is motivated by https://github.com/neondatabase/neon/issues/6738, where we need to define some sensible thresholds for storage amplification, and using total physical size may not work well (if someone does a lot of DROPs then it's legitimate for the physical-synthetic ratio to be huge), but the ratio between image layer size and delta layer size may be a better indicator of whether we're generating unreasonable quantities of image layers. ## Summary of changes - Add pageserver_layer_bytes and pageserver_layer_count metrics, labelled by timeline and `kind` (delta or image) - Add & subtract these with LayerInner's lifetime. I'm intentionally avoiding using a generic metric RAII guard object, to avoid bloating LayerInner: it already has all the information it needs to update metric on new+drop.	2024-07-17 21:55:20 +01:00
Yuchen Liang	da84a250c6	docs: update storage controller db name in doc (#8411 ) The db name was renamed to storage_controller from attachment_service. Doc was stale.	2024-07-17 15:19:40 -04:00
John Spray	975f8ac658	tests: add test_compaction_l0_memory (#8403 ) This test reproduces the case of a writer creating a deep stack of L0 layers. It uses realistic layer sizes and writes several gigabytes of data, therefore runs as a performance test although it is validating memory footprint rather than performance per se. It acts a regression test for two recent fixes: - https://github.com/neondatabase/neon/pull/8401 - https://github.com/neondatabase/neon/pull/8391 In future it will demonstrate the larger improvement of using a k-merge iterator for L0 compaction (#8184) This test can be extended to enforce limits on the memory consumption of other housekeeping steps, by restarting the pageserver and then running other things to do the same "how much did RSS increase" measurement.	2024-07-17 17:35:27 +00:00

1 2 3 4 5 ...

5702 Commits