rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2025-12-22 21:59:59 +00:00

Author	SHA1	Message	Date
Heikki Linnakangas	9f653893b9	Update a few dependencies, removing some indirect dependencies cargo update ciborium iana-time-zone lazy_static schannel uuid cargo update hyper@0.14 cargo update --precise 2.9.7 ureq It might be worthwhile just update all our dependencies at some point, but this is aimed at pruning the dependency tree, to make the build a little faster. That's also why I didn't update ureq to the latest version: that would've added a dependency to yet another version of rustls.	2024-09-23 00:37:41 +03:00
Heikki Linnakangas	913af44219	Update "memoffset" crate To eliminate one version of it from our dependency tree.	2024-09-23 00:37:41 +03:00
Heikki Linnakangas	ecd615ab6d	Update "hostname" crate We were already building v0.4.0 as an indirect dependency, so this avoids having to build two different versions of it.	2024-09-23 00:37:41 +03:00
Heikki Linnakangas	d211f00f05	Remove unnecessary dependencies (#9000 ) Found by "cargo machete"	2024-09-17 17:55:45 +03:00
Heikki Linnakangas	982b376ea2	Update parquet crate to a released version (#8961 ) PR #7782 set the dependency in Cargo.toml to 'master', and locked the version to commit that contained a specific fix, because we needed the fix before it was included in a versioned release. The fix was later included in parquet crate version 52.0.0, so we can now switch back to using a released version. The latest release is 53.0.0, switch straight to that. --------- Co-authored-by: Conrad Ludgate <conradludgate@gmail.com>	2024-09-10 00:04:00 +03:00
Heikki Linnakangas	2d885ac07a	Update strum (#8962 ) I wanted to use some features from the newer version. The PR that needed the new version is not ready yet (and might never be), but seems nice to stay up in any case.	2024-09-08 21:47:57 +03:00
Heikki Linnakangas	89c5e80b3f	Update toml and toml_edit crates (#8963 ) Eliminates a few duplicate versions from the dependency tree.	2024-09-08 21:47:23 +03:00
Heikki Linnakangas	93ec7503e0	Lock the correct revision of rust-postgres crates (#8960 ) We modified the crate in an incompatible way and upgraded to the new version in PR #8076. However, it was reverted in #8654. The revert reverted the Cargo.lock reference to it, but since Cargo.toml still points to the (tip of the) 'neon' branch, every time you make any other unrelated changes to Cargo.toml, it also tries to update the rust-postgres crates to the tip of the 'neon' branch again, which doesn't work. To fix, lock the crates to the exact commit SHA that works.	2024-09-07 14:11:36 +01:00
Arpad Müller	a1323231bc	Update Rust to 1.81.0 (#8939 ) We keep the practice of keeping the compiler up to date, pointing to the latest release. This is done by many other projects in the Rust ecosystem as well. [Release notes](https://github.com/rust-lang/rust/blob/master/RELEASES.md#version-1810-2024-09-05). Prior update was in #8667 and #8518	2024-09-06 12:40:19 +02:00
Christian Schwarz	cf11c8ab6a	update svg_fmt to 0.4.3 (#8930 ) Audited ``` diff -r -u ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/svg_fmt-0.4.{2,3} ``` fixes https://github.com/neondatabase/neon/issues/7763	2024-09-06 10:52:29 +02:00
Christian Schwarz	850421ec06	refactor(pageserver): rely on serde derive for toml deserialization (#7656 ) This PR simplifies the pageserver configuration parsing as follows: * introduce the `pageserver_api::config::ConfigToml` type * implement `Default` for `ConfigToml` * use serde derive to do the brain-dead leg-work of processing the toml document * use `serde(default)` to fill in default values * in `pageserver` crate: * use `toml_edit` to deserialize the pageserver.toml string into a `ConfigToml` * `PageServerConfig::parse_and_validate` then * consumes the `ConfigToml` * destructures it exhaustively into its constituent fields * constructs the `PageServerConfig` The rules are: * in `ConfigToml`, use `deny_unknown_fields` everywhere * static default values go in `pageserver_api` * if there cannot be a static default value (e.g. which default IO engine to use, because it depends on the runtime), make the field in `ConfigToml` an `Option` * if runtime-augmentation of a value is needed, do that in `parse_and_validate` * a good example is `virtual_file_io_engine` or `l0_flush`, both of which need to execute code to determine the effective value in `PageServerConf` The benefits: * massive amount of brain-dead repetitive code can be deleted * "unused variable" compile-time errors when removing a config value, due to the exhaustive destructuring in `parse_and_validate` * compile-time errors guide you when adding a new config field Drawbacks: * serde derive is sometimes a bit too magical * `deny_unknown_fields` is easy to miss Future Work / Benefits: * make `neon_local` use `pageserver_api` to construct `ConfigToml` and write it to `pageserver.toml` * This provides more type safety / coompile-time errors than the current approach. ### Refs Fixes #3682 ### Future Work * `remote_storage` deser doesn't reject unknown fields https://github.com/neondatabase/neon/issues/8915 * clean up `libs/pageserver_api/src/config.rs` further * break up into multiple files, at least for tenant config * move `models` as appropriate / refine distinction between config and API models / be explicit about when it's the same * use `pub(crate)` visibility on `mod defaults` to detect stale values	2024-09-05 14:59:49 +02:00
Arpad Müller	8eaa8ad358	Remove async_trait usages from safekeeper and neon_local (#8864 ) Removes additional async_trait usages from safekeeper and neon_local. Also removes now redundant dependencies of the `async_trait` crate. cc earlier work: #6305, #6464, #7303, #7342, #7212, #8296	2024-08-29 18:24:25 +02:00
Conrad Ludgate	a644f01b6a	proxy+pageserver: shared leaky bucket impl (#8539 ) In proxy I switched to a leaky-bucket impl using the GCRA algorithm. I figured I could share the code with pageserver and remove the leaky_bucket crate dependency with some very basic tokio timers and queues for fairness. The underlying algorithm should be fairly clear how it works from the comments I have left in the code. --- In benchmarking pageserver, @problame found that the new implementation fixes a getpage throughput discontinuity in pageserver under the `pagebench get-page-latest-lsn` benchmark with the clickbench dataset (`test_perf_olap.py`). The discontinuity is that for any of `--num-clients={2,3,4}`, getpage throughput remains 10k. With `--num-clients=5` and greater, getpage throughput then jumps to the configured 20k rate limit. With the changes in this PR, the discontinuity is gone, and we scale throughput linearly to `--num-clients` until the configured rate limit. More context in https://github.com/neondatabase/cloud/issues/16886#issuecomment-2315257641. closes https://github.com/neondatabase/cloud/issues/16886 --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-08-29 11:26:52 +00:00
Christian Schwarz	9627747d35	bypass `PageCache` for `InMemoryLayer` + avoid `Value::deser` on L0 flush (#8537 ) Part of [Epic: Bypass PageCache for user data blocks](https://github.com/neondatabase/neon/issues/7386). # Problem `InMemoryLayer` still uses the `PageCache` for all data stored in the `VirtualFile` that underlies the `EphemeralFile`. # Background Before this PR, `EphemeralFile` is a fancy and (code-bloated) buffered writer around a `VirtualFile` that supports `blob_io`. The `InMemoryLayerInner::index` stores offsets into the `EphemeralFile`. At those offset, we find a varint length followed by the serialized `Value`. Vectored reads (`get_values_reconstruct_data`) are not in fact vectored - each `Value` that needs to be read is read sequentially. The `will_init` bit of information which we use to early-exit the `get_values_reconstruct_data` for a given key is stored in the serialized `Value`, meaning we have to read & deserialize the `Value` from the `EphemeralFile`. The L0 flushing also needs to re-determine the `will_init` bit of information, by deserializing each value during L0 flush. # Changes 1. Store the value length and `will_init` information in the `InMemoryLayer::index`. The `EphemeralFile` thus only needs to store the values. 2. For `get_values_reconstruct_data`: - Use the in-memory `index` figures out which values need to be read. Having the `will_init` stored in the index enables us to do that. - View the EphemeralFile as a byte array of "DIO chunks", each 512 bytes in size (adjustable constant). A "DIO chunk" is the minimal unit that we can read under direct IO. - Figure out which chunks need to be read to retrieve the serialized bytes for thes values we need to read. - Coalesce chunk reads such that each DIO chunk is only read once to serve all value reads that need data from that chunk. - Merge adjacent chunk reads into larger `EphemeralFile::read_exact_at_eof_ok` of up to 128k (adjustable constant). 3. The new `EphemeralFile::read_exact_at_eof_ok` fills the IO buffer from the underlying VirtualFile and/or its in-memory buffer. 4. The L0 flush code is changed to use the `index` directly, `blob_io` 5. We can remove the `ephemeral_file::page_caching` construct now. The `get_values_reconstruct_data` changes seem like a bit overkill but they are necessary so we issue the equivalent amount of read system calls compared to before this PR where it was highly likely that even if the first PageCache access was a miss, remaining reads within the same `get_values_reconstruct_data` call from the same `EphemeralFile` page were a hit. The "DIO chunk" stuff is truly unnecessary for page cache bypass, but, since we're working on [direct IO](https://github.com/neondatabase/neon/issues/8130) and https://github.com/neondatabase/neon/issues/8719 specifically, we need to do _something_ like this anyways in the near future. # Alternative Design The original plan was to use the `vectored_blob_io` code it relies on the invariant of Delta&Image layers that `index order == values order`. Further, `vectored_blob_io` code's strategy for merging IOs is limited to adjacent reads. However, with direct IO, there is another level of merging that should be done, specifically, if multiple reads map to the same "DIO chunk" (=alignment-requirement-sized and -aligned region of the file), then it's "free" to read the chunk into an IO buffer and serve the two reads from that buffer. => https://github.com/neondatabase/neon/issues/8719 # Testing / Performance Correctness of the IO merging code is ensured by unit tests. Additionally, minimal tests are added for the `EphemeralFile` implementation and the bit-packed `InMemoryLayerIndexValue`. Performance testing results are presented below. All pref testing done on my M2 MacBook Pro, running a Linux VM. It's a release build without `--features testing`. We see definitive improvement in ingest performance microbenchmark and an ad-hoc microbenchmark for getpage against InMemoryLayer. ``` baseline: commit `7c74112b2a` origin/main HEAD: `ef1c55c52e` ``` <details> ``` cargo bench --bench bench_ingest -- 'ingest 128MB/100b seq, no delta' baseline ingest-small-values/ingest 128MB/100b seq, no delta time: [483.50 ms 498.73 ms 522.53 ms] thrpt: [244.96 MiB/s 256.65 MiB/s 264.73 MiB/s] HEAD ingest-small-values/ingest 128MB/100b seq, no delta time: [479.22 ms 482.92 ms 487.35 ms] thrpt: [262.64 MiB/s 265.06 MiB/s 267.10 MiB/s] ``` </details> We don't have a micro-benchmark for InMemoryLayer and it's quite cumbersome to add one. So, I did manual testing in `neon_local`. <details> ``` ./target/release/neon_local stop rm -rf .neon ./target/release/neon_local init ./target/release/neon_local start ./target/release/neon_local tenant create --set-default ./target/release/neon_local endpoint create foo ./target/release/neon_local endpoint start foo psql 'postgresql://cloud_admin@127.0.0.1:55432/postgres' psql (13.16 (Debian 13.16-0+deb11u1), server 15.7) CREATE TABLE wal_test ( id SERIAL PRIMARY KEY, data TEXT ); DO $$ DECLARE i INTEGER := 1; BEGIN WHILE i <= 500000 LOOP INSERT INTO wal_test (data) VALUES ('data'); i := i + 1; END LOOP; END $$; -- => result is one L0 from initdb and one 137M-sized ephemeral-2 DO $$ DECLARE i INTEGER := 1; random_id INTEGER; random_record wal_test%ROWTYPE; start_time TIMESTAMP := clock_timestamp(); selects_completed INTEGER := 0; min_id INTEGER := 1; -- Minimum ID value max_id INTEGER := 100000; -- Maximum ID value, based on your insert range iters INTEGER := 100000000; -- Number of iterations to run BEGIN WHILE i <= iters LOOP -- Generate a random ID within the known range random_id := min_id + floor(random() * (max_id - min_id + 1))::int; -- Select the row with the generated random ID SELECT * INTO random_record FROM wal_test WHERE id = random_id; -- Increment the select counter selects_completed := selects_completed + 1; -- Check if a second has passed IF EXTRACT(EPOCH FROM clock_timestamp() - start_time) >= 1 THEN -- Print the number of selects completed in the last second RAISE NOTICE 'Selects completed in last second: %', selects_completed; -- Reset counters for the next second selects_completed := 0; start_time := clock_timestamp(); END IF; -- Increment the loop counter i := i + 1; END LOOP; END $$; ./target/release/neon_local stop baseline: commit `7c74112b2a` origin/main NOTICE: Selects completed in last second: 1864 NOTICE: Selects completed in last second: 1850 NOTICE: Selects completed in last second: 1851 NOTICE: Selects completed in last second: 1918 NOTICE: Selects completed in last second: 1911 NOTICE: Selects completed in last second: 1879 NOTICE: Selects completed in last second: 1858 NOTICE: Selects completed in last second: 1827 NOTICE: Selects completed in last second: 1933 ours NOTICE: Selects completed in last second: 1915 NOTICE: Selects completed in last second: 1928 NOTICE: Selects completed in last second: 1913 NOTICE: Selects completed in last second: 1932 NOTICE: Selects completed in last second: 1846 NOTICE: Selects completed in last second: 1955 NOTICE: Selects completed in last second: 1991 NOTICE: Selects completed in last second: 1973 ``` NB: the ephemeral file sizes differ by ca 1MiB, ours being 1MiB smaller. </details> # Rollout This PR changes the code in-place and is not gated by a feature flag.	2024-08-28 18:31:41 +00:00
Conrad Ludgate	612b643315	update diesel (#8816 ) https://rustsec.org/advisories/RUSTSEC-2024-0365	2024-08-23 15:28:22 +00:00
Arpad Müller	e80ab8fd6a	Update serde_json to 1.0.125 (#8813 ) Updates `serde_json` to `1.0.125`, rolling out speedups added by a serde_json contributor. Release [link](https://github.com/serde-rs/json/releases/tag/1.0.125). Blog post [link](https://purplesyringa.moe/blog/i-sped-up-serde-json-strings-by-20-percent/).	2024-08-23 12:14:14 +01:00
Conrad Ludgate	428b105dde	remove workspace hack from libs (#8780 ) This removes workspace hack from all libs, not from any binaries. This does not change the behaviour of the hack. Running ``` cargo clean cargo build --release --bin proxy ``` Before this change took 5m16s. After this change took 3m3s. This is because this allows the build to be parallelisable much more.	2024-08-21 14:45:32 +01:00
Conrad Ludgate	a7028d92b7	proxy: start of jwk cache (#8690 ) basic JWT implementation that caches JWKs and verifies signatures. this code is currently not reachable from proxy, I just wanted to get something merged in.	2024-08-14 13:35:29 +01:00
Conrad Ludgate	7e08fbd1b9	Revert "proxy: update tokio-postgres to allow arbitrary config params (#8076 )" (#8654 ) This reverts #8076 - which was already reverted from the release branch since forever (it would have been a breaking change to release for all users who currently set TimeZone options). It's causing conflicts now so we should revert it here as well.	2024-08-09 09:09:29 +01:00
Conrad Ludgate	ad0988f278	proxy: random changes (#8602 ) ## Problem 1. Hard to correlate startup parameters with the endpoint that provided them. 2. Some configurations are not needed in the `ProxyConfig` struct. ## Summary of changes Because of some borrow checker fun, I needed to switch to an interior-mutability implementation of our `RequestMonitoring` context system. Using https://docs.rs/try-lock/latest/try_lock/ as a cheap lock for such a use-case (needed to be thread safe). Removed the lock of each startup message, instead just logging only the startup params in a successful handshake. Also removed from values from `ProxyConfig` and kept as arguments. (needed for local-proxy config)	2024-08-07 14:37:03 +01:00
John Spray	2334fed762	storage_controller: start adding chaos hooks (#7946 ) Chaos injection bridges the gap between automated testing (where we do lots of different things with small, short-lived tenants), and staging (where we do many fewer things, but with larger, long-lived tenants). This PR adds a first type of chaos which isn't really very chaotic: it's live migration of tenants between healthy pageservers. This nevertheless provides continuous checks that things like clean, prompt shutdown of tenants works for realistically deployed pageservers with realistically large tenants.	2024-08-02 09:37:44 +01:00
Alex Chi Z.	970f2923b2	storage-scrubber: log version on start (#8571 ) Helps us better identify which version of storage scrubber is running. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-08-01 13:52:34 +00:00
John Spray	1678dea20f	pageserver: add layer visibility calculation (#8511 ) ## Problem We recently added a "visibility" state to layers, but nothing initializes it. Part of: - #8398 ## Summary of changes - Add a dependency on `range-set-blaze`, which is used as a fast incrementally updated alternative to KeySpace. We could also use this to replace the internals of KeySpaceRandomAccum if we wanted to. Writing a type that does this kind of "BtreeMap & merge overlapping entries" thing isn't super complicated, but no reason to write this ourselves when there's a third party impl available. - Add a function to layermap to calculate visibilities for each layer - Add a function to Timeline to call into layermap and then apply these visibilities to the Layer objects. - Invoke the calculation during startup, after image layer creations, and when removing branches. Branch removal and image layer creation are the two ways that a layer can go from Visible to Covered. - Add unit test & benchmark for the visibility calculation - Expose `pageserver_visible_physical_size` metric, which should always be <= `pageserver_remote_physical_size`. - This metric will feed into the /v1/utilization endpoint later: the visible size indicates how much space we would like to use on this pageserver for this tenant. - When `pageserver_visible_physical_size` is greater than `pageserver_resident_physical_size`, this is a sign that the tenant has long-idle branches, which result in layers that are visible in principle, but not used in practice. This does not keep visibility hints up to date in all cases: particularly, when creating a child timeline, any previously covered layers will not get marked Visible until they are accessed. Updates after image layer creation could be implemented as more of a special case, but this would require more new code: the existing depth calculation code doesn't maintain+yield the list of deltas that would be covered by an image layer. ## Performance This operation is done rarely (at startup and at timeline deletion), so needs to be efficient but not ultra-fast. There is a new `visibility` bench that measures runtime for a synthetic 100k layers case (`sequential`) and a real layer map (`real_map`) with ~26k layers. The benchmark shows runtimes of single digit milliseconds (on a ryzen 7950). This confirms that the runtime shouldn't be a problem at startup (as we already incur S3-level latencies there), but that it's slow enough that we definitely shouldn't call it more often than necessary, and it may be worthwhile to optimize further later (things like: when removing a branch, only bother scanning layers below the branchpoint) ``` visibility/sequential time: [4.5087 ms 4.5894 ms 4.6775 ms] change: [+2.0826% +3.9097% +5.8995%] (p = 0.00 < 0.05) Performance has regressed. Found 24 outliers among 100 measurements (24.00%) 2 (2.00%) high mild 22 (22.00%) high severe min: 0/1696070, max: 93/1C0887F0 visibility/real_map time: [7.0796 ms 7.0832 ms 7.0871 ms] change: [+0.3900% +0.4505% +0.5164%] (p = 0.00 < 0.05) Change within noise threshold. Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe min: 0/1696070, max: 93/1C0887F0 visibility/real_map_many_branches time: [4.5285 ms 4.5355 ms 4.5434 ms] change: [-1.0012% -0.8004% -0.5969%] (p = 0.00 < 0.05) Change within noise threshold. ```	2024-08-01 09:25:35 +00:00
Arpad Müller	163f2eaf79	Reduce linux-raw-sys duplication (#8577 ) Before, we had four versions of linux-raw-sys in our dependency graph: ``` linux-raw-sys@0.1.4 linux-raw-sys@0.3.8 linux-raw-sys@0.4.13 linux-raw-sys@0.6.4 ``` now it's only two: ``` linux-raw-sys@0.4.13 linux-raw-sys@0.6.4 ``` The changes in this PR are minimal. In order to get to its state one only has to update procfs in Cargo.toml to 0.16 and do `cargo update -p tempfile -p is-terminal -p prometheus`.	2024-08-01 08:22:21 +00:00
Yuchen Liang	e374d6778e	feat(storcon): store scrubber metadata scan result (#8480 ) Part of #8128, followed by #8502. ## Problem Currently we lack mechanism to alert unhealthy `scan_metadata` status if we start running this scrubber command as part of a cronjob. With the storage controller client introduced to storage scrubber in #8196, it is viable to set up alert by storing health status in the storage controller database. We intentionally do not store the full output to the database as the json blobs potentially makes the table really huge. Instead, only a health status and a timestamp recording the last time metadata health status is posted on a tenant shard. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-07-30 14:32:00 +01:00
John Spray	f5db655447	pageserver: simplify LayerAccessStats (#8431 ) ## Problem LayerAccessStats contains a lot of detail that we don't use: short histories of most recent accesses, specifics on what kind of task accessed a layer, etc. This is all stored inside a Mutex, which is locked every time something accesses a layer. ## Summary of changes - Store timestamps at a very low resolution (to the nearest second), sufficient for use on the timescales of eviction. - Pack access time and last residence change time into a single u64 - Use the high bits of the u64 for other flags, including the new layer visibility concept. - Simplify the external-facing model for access stats to just include what we now track. Note that the `HistoryBufferWithDropCounter` is removed here because it is no longer used. I do not dislike this type, we just happen not to use it for anything else at present. Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-07-24 08:17:28 +01:00
Arpad Müller	2c0d311a54	remote_storage: add list_streaming API call (#8466 ) This adds the ability to list many prefixes in a streaming fashion to both the `RemoteStorage` trait as well as `GenericRemoteStorage`. * The `list` function of the `RemoteStorage` trait is implemented by default in terms of `list_streaming`. * For the production users (S3, Azure), `list_streaming` is implemented and the default `list` implementation is used. * For `LocalFs`, we keep the `list` implementation and make `list_streaming` call it. The `list_streaming` function is implemented for both S3 and Azure. A TODO for later is retries, which the scrubber currently has while the `list_streaming` implementations lack them. part of #8457 and #7547	2024-07-24 02:09:01 +02:00
Yuchen Liang	595c450036	fix(scrubber): more robust metadata consistency checks (#8344 ) Part of #8128. ## Problem Scrubber uses `scan_metadata` command to flag metadata inconsistencies. To trust it at scale, we need to make sure the errors we emit is a reflection of real scenario. One check performed in the scrubber is to see whether layers listed in the latest `index_part.json` is present in object listing. Currently, the scrubber does not robustly handle the case where objects are uploaded/deleted during the scan. ## Summary of changes Condition for success: An object in the index is (1) in the object listing we acquire from S3 or (2) found in a HeadObject request (new object). - Add in the `HeadObject` requests for the layers missing from the object listing. - Keep the order of first getting the object listing and then downloading the layers. - Update check to only consider shards with highest shard count. - Skip analyzing a timeline if `deleted_at` tombstone is marked in `index_part.json`. - Add new test to see if scrubber actually detect the metadata inconsistency. _Misc_ - A timeline with no ancestor should always have some layers. - Removed experimental histograms _Caveat_ - Ancestor layer is not cleaned until #8308 is implemented. If ancestor layers reference non-existing layers in the index, the scrubber will emit false positives. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-07-22 14:53:33 +01:00
John Spray	44781518d0	storage scrubber: GC ancestor shard layers (#8196 ) ## Problem After a shard split, the pageserver leaves the ancestor shard's content in place. It may be referenced by child shards, but eventually child shards will de-reference most ancestor layers as they write their own data and do GC. We would like to eventually clean up those ancestor layers to reclaim space. ## Summary of changes - Extend the physical GC command with `--mode=full`, which includes cleaning up unreferenced ancestor shard layers - Add test `test_scrubber_physical_gc_ancestors` - Remove colored log output: in testing this is irritating ANSI code spam in logs, and in interactive use doesn't add much. - Refactor storage controller API client code out of storcon_client into a `storage_controller/client` crate - During physical GC of ancestors, call into the storage controller to check that the latest shards seen in S3 reflect the latest state of the tenant, and there is no shard split in progress.	2024-07-19 19:07:59 +03:00
Christian Schwarz	a2d170b6d0	NeonEnv.from_repo_dir: use storage_controller_db instead of `attachments.json` (#8382 ) When `NeonEnv.from_repo_dir` was introduced, storage controller stored its state exclusively `attachments.json`. Since then, it has moved to using Postgres, which stores its state in `storage_controller_db`. But `NeonEnv.from_repo_dir` wasn't adjusted to do this. This PR rectifies the situation. Context for this is failures in `test_pageserver_characterize_throughput_with_n_tenants` CF: https://neondb.slack.com/archives/C033RQ5SPDH/p1721035799502239?thread_ts=1720901332.293769&cid=C033RQ5SPDH Notably, `from_repo_dir` is also used by the backwards- and forwards-compatibility. Thus, the changes in this PR affect those tests as well. However, it turns out that the compatibility snapshot already contains the `storage_controller_db`. Thus, it should just work and in fact we can remove hacks like `fixup_storage_controller`. Follow-ups created as part of this work: * https://github.com/neondatabase/neon/issues/8399 * https://github.com/neondatabase/neon/issues/8400	2024-07-18 10:56:07 +02:00
Luca Bruno	8da3b547f8	proxy/http: switch to typed_json (#8377 ) ## Summary of changes This switches JSON rendering logic to `typed_json` in order to reduce the number of allocations in the HTTP responder path. Followup from https://github.com/neondatabase/neon/pull/8319#issuecomment-2216991760. --------- Co-authored-by: Conrad Ludgate <conradludgate@gmail.com>	2024-07-15 12:38:52 +01:00
Japin Li	86d6ef305a	Remove fs2 dependency (#8350 ) The fs2 dependency is not needed anymore after commit `d42700280`.	2024-07-12 12:56:06 +03:00
John Spray	0159ae9536	safekeeper: eviction metrics (#8348 ) ## Problem Follow up to https://github.com/neondatabase/neon/pull/8335, to improve observability of how many evict/restores we are doing. ## Summary of changes - Add `safekeeper_eviction_events_started_total` and `safekeeper_eviction_events_completed_total`, with a "kind" label of evict or restore. This gives us rates, and also ability to calculate how many are in progress. - Generalize SafekeeperMetrics test type to use the same helpers as pageserver, and enable querying any metric. - Read the new metrics at the end of the eviction test.	2024-07-11 17:05:35 +01:00
Christian Schwarz	e26ef640c1	pageserver: remove `trace_read_requests` (#8338 ) `trace_read_requests` is a per `Tenant`-object option. But the `handle_pagerequests` loop doesn't know which `Tenant` object (i.e., which shard) the request is for. The remaining use of the `Tenant` object is to check `tenant.cancel`. That check is incorrect [if the pageserver hosts multiple shards](https://github.com/neondatabase/neon/issues/7427#issuecomment-2220577518). I'll fix that in a future PR where I completely eliminate the holding of `Tenant/Timeline` objects across requests. See [my code RFC](https://github.com/neondatabase/neon/pull/8286) for the high level idea. Note that we can always bring the tracing functionality if we need it. But since it's actually about logging the `page_service` wire bytes, it should be a `page_service`-level config option, not per-Tenant. And for enabling tracing on a single connection, we can implement a `set pageserver_trace_connection;` option.	2024-07-11 15:17:07 +02:00
Stas Kelvich	6bbd34a216	Enable core dumps for postgres (#8272 ) Set core rmilit to ulimited in compute_ctl, so that all child processes inherit it. We could also set rlimit in relevant startup script, but that way we would depend on external setup and might inadvertently disable it again (core dumping worked in pods, but not in VMs with inittab-based startup).	2024-07-11 10:20:14 +03:00
Christian Schwarz	3f7aebb01c	refactor: postgres_backend: replace abstract shutdown_watcher with CancellationToken (#8295 ) Preliminary refactoring while working on https://github.com/neondatabase/neon/issues/7427 and specifically https://github.com/neondatabase/neon/pull/8286	2024-07-09 21:11:11 +03:00
Conrad Ludgate	4a5b55c834	chore: fix nightly build (#8142 ) ## Problem `cargo +nightly check` fails ## Summary of changes Updates `measured`, `time`, and `crc32c`. * `measured`: updated to fix https://github.com/rust-lang/rust/issues/125763. * `time`: updated to fix https://github.com/rust-lang/rust/issues/125319 * `crc32c`: updated to remove some nightly feature detection with a removed nightly feature	2024-07-09 18:25:49 +01:00
John Spray	1121a1cbac	pageserver: switch to jemalloc (#8307 ) ## Problem - Resident memory on long running pageserver processes tends to climb: memory fragmentation is suspected. - Total resident memory may be a limiting factor for running on smaller nodes. ## Summary of changes - As a low-energy experiment, switch the pageserver to use jemalloc (not a net-new dependency, proxy already use it) - Decide at end of week whether to revert before next release.	2024-07-08 14:10:42 +01:00
Christian Schwarz	7dcdbaa25e	remote_storage config: move handling of empty inline table `{}` to callers (#8193 ) Before this PR, `RemoteStorageConfig::from_toml` would support deserializing an empty `{}` TOML inline table to a `None`, otherwise try `Some()`. We can instead let * in proxy: let clap derive handle the Option * in PS & SK: assume that if the field is specified, it must be a valid RemtoeStorageConfig (This PR started with a much simpler goal of factoring out the `deserialize_item` function because I need that in another PR).	2024-07-02 12:53:08 +02:00
Vlad Lazar	7026dde9eb	storcon: update db related dependencides (#8155 ) ## Problem Storage controller runs into memory corruption issue on the drain/fill code paths. ## Summary of changes Update db related depdencies in the unlikely case that the issue was fixed in diesel.	2024-06-25 15:06:18 +01:00
Conrad Ludgate	78d9059fc7	proxy: update tokio-postgres to allow arbitrary config params (#8076 ) ## Problem Fixes https://github.com/neondatabase/neon/issues/1287 ## Summary of changes tokio-postgres now supports arbitrary server params through the `param(key, value)` method. Some keys are special so we explicitly filter them out.	2024-06-24 10:20:27 +00:00
Arpad Müller	75747cdbff	Use serde for RemoteStorageConfig parsing (#8126 ) Adds a `Deserialize` impl to `RemoteStorageConfig`. We thus achieve the same as #7743 but with less repetitive code, by deriving `Deserialize` impls on `S3Config`, `AzureConfig`, and `RemoteStorageConfig`. The disadvantage is less useful error messages. The git history of this PR contains a state where we go via an intermediate representation, leveraging the `serde_json` crate, without it ever being actual json though. Also, the PR adds deserialization tests. Alternative to #7743 .	2024-06-22 17:57:09 +00:00
Vlad Lazar	5778d714f0	storcon: add drain and fill background operations for graceful cluster restarts (#8014 ) ## Problem Pageserver restarts cause read availablity downtime for tenants. See `Motivation` section in the [RFC](https://github.com/neondatabase/neon/pull/7704). ## Summary of changes * Introduce a new `NodeSchedulingPolicy`: `PauseForRestart` * Implement the first take of drain and fill algorithms * Add a node status endpoint which can be polled to figure out when an operation is done The implementation follows the RFC, so it might be useful to peek at it as you're reviewing. Since the PR is rather chunky, I've made sure all commits build (with warnings), so you can review by commit if you prefer that. RFC: https://github.com/neondatabase/neon/pull/7704 Related https://github.com/neondatabase/neon/issues/7387	2024-06-19 11:55:30 +01:00
Arseny Sher	d8b2a49c55	safekeeper: streaming pull_timeline - Add /snapshot http endpoing streaming tar archive timeline contents up to flush_lsn. - Add check that term doesn't change, corresponding test passes now. - Also prepares infra to hold off WAL removal during the basebackup. - Sprinkle fsyncs to persist the pull_timeline result. ref https://github.com/neondatabase/neon/issues/6340	2024-06-18 15:45:39 +03:00
Arpad Müller	27518676d7	Rename S3 scrubber to storage scrubber (#8013 ) The S3 scrubber contains "S3" in its name, but we want to make it generic in terms of which storage is used (#7547). Therefore, rename it to "storage scrubber", following the naming scheme of already existing components "storage broker" and "storage controller". Part of #7547	2024-06-11 22:45:22 +00:00
Vlad Lazar	7121db3669	storcon_cli: add 'drain' command (#8007 ) ## Problem We need the ability to prepare a subset of storage controller managed pageservers for decommisioning. The storage controller cannot currently express this in terms of scheduling constraints (it's a pretty special case, so I'm not sure it even should). ## Summary of Changes A new `drain` command is added to `storcon_cli`. It takes a set of nodes to drain and migrates primary attachments outside of said set. Simple round robing assignment is used under the assumption that nodes outside of the draining set are evenly balanced. Note that secondary locations are not migrated. This is fine for staging, but the migration API will have to be extended for prod in order to allow migration of secondaries as well. I've tested this out against a neon local cluster. The immediate use for this command will be to migrate staging to ARM(Arch64) pageservers. Related https://github.com/neondatabase/cloud/issues/14029	2024-06-11 16:39:38 +00:00
John Spray	69026a9a36	storcon_cli: add 'drop' and eviction interval utilities (#7938 ) The storage controller has 'drop' APIs for tenants and nodes, for use in situations where something weird has happened: - node-drop is useful until we implement proper node decom, or if we have a partially provisioned node that somehow gets registered with the storage controller but is then dead. - tenant-drop is useful if we accidentally add a tenant that shouldn't be there at all, or if we want to make the controller forget about a tenant without deleting its data. For example, if one uses the tenant-warmup command with a bad tenant ID and needs to clean that up. The drop commands require an `--unsafe` parameter, to reduce the chance that someone incorrectly assumes these are the normal/clean ways to delete things. This PR also adds a convenience command for setting the time based eviction parameters on a tenant. This is useful when onboarding an existing tenant that has high resident size due to storage amplification in compaction: setting a lower time based eviction threshold brings down the resident size ahead of doing a shard split.	2024-06-03 18:13:01 +00:00
John Spray	69d18d6429	s3_scrubber: add `pageserver-physical-gc` (#7925 ) ## Problem Currently, we leave `index_part.json` objects from old generations behind each time a pageserver restarts or a tenant is migrated. This doesn't break anything, but it's annoying when a tenant has been around for a long time and starts to accumulate 10s-100s of these. Partially implements: #7043 ## Summary of changes - Add a new `pageserver-physical-gc` command to `s3_scrubber` The name is a bit of a mouthful, but I think it makes sense: - GC is the accurate term for what we are doing here: removing data that takes up storage but can never be accessed. - "physical" is a necessary distinction from the "normal" GC that we do online in the pageserver, which operates at a higher level in terms of LSNs+layers, whereas this type of GC is purely about S3 objects. - "pageserver" makes clear that this command deals exclusively with pageserver data, not safekeeper.	2024-06-03 17:16:23 +01:00
Joonas Koivunen	ef83f31e77	pagectl: key command for dumping what we know about the key (#7890 ) What we know about the key via added `pagectl key $key` command: - debug formatting - shard placement when `--shard-count` is specified - different boolean queries in `key.rs` - aux files v2 Example: ``` $ cargo run -qp pagectl -- key 000000063F00004005000060270000100E2C parsed from hex: 000000063F00004005000060270000100E2C: Key { field1: 0, field2: 1599, field3: 16389, field4: 24615, field5: 0, field6: 1052204 } rel_block: true rel_vm_block: false rel_fsm_block: false slru_block: false inherited: true rel_size: false slru_segment_size: false recognized kind: None ```	2024-05-31 18:19:41 +00:00
Arpad Müller	c18b1c0646	Update tokio-epoll-uring for linux-raw-sys (#7918 ) Updates the `tokio-epoll-uring` dependency. There is [only one change](`342ddd197a...08ccfa94ff`), the adoption of linux-raw-sys for `statx` instead of using libc. Part of #7889.	2024-05-30 17:45:48 +02:00

... 3 4 5 6 7 ...

740 Commits