rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-04 04:30:38 +00:00

Author	SHA1	Message	Date
Stefan Radig	fcab61bdcd	Prototype implementation for private access poc (#8976 ) ## Problem For the Private Access POC we want users to be able to disable access from the public proxy. To limit the number of changes this can be done by configuring an IP allowlist [ "255.255.255.255" ]. For the Private Access proxy a new commandline flag allows to disable IP allowlist completely. See https://www.notion.so/neondatabase/Neon-Private-Access-POC-Proposal-8f707754e1ab4190ad5709da7832f020?d=887495c15e884aa4973f973a8a0a582a#7ac6ec249b524a74adbeddc4b84b8f5f for details about the POC., ## Summary of changes - Adding the commandline flag is_private_access_proxy=true will disable IP allowlist	2024-09-12 15:55:12 +01:00
Tristan Partin	9e3ead3689	Collect the last of on-demand WAL download in CreateReplicationSlot reverts Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-09-12 11:31:38 +01:00
Heikki Linnakangas	8dc069037b	Remove NeonEnvBuilder.start() function It feels wrong to me to start() from the builder object. Surely the thing you start is the environment itself, not its configuration.	2024-09-12 01:28:56 +03:00
Heikki Linnakangas	0a363c3dce	Add --timeline-id option to "neon_local timeline branch" command Makes it consistent with the "timeline create" and "timeline import" commands, which allowed you to pass the timeline id as argument. This also makes it unnecessary to parse the timeline ID from the output in the python function that calls it.	2024-09-12 01:28:56 +03:00
Heikki Linnakangas	aeca15008c	Remove obsolete and misleading comment The tenant ID was not actually generated here but in NeonEnvBuilder. And the "neon_local init" command hasn't been able to generate the initial tenant since `8712e1899e` anyway.	2024-09-12 01:28:56 +03:00
Heikki Linnakangas	43846b72fa	Remove unused "neon_local init --pg-version" arg It has been unused since commit `8712e1899e`, when it stopped creating the initial timeline.	2024-09-12 01:28:56 +03:00
John Spray	cb060548fb	libs: tweak PageserverUtilization::is_overloaded (#8946 ) ## Problem Having run in production for a while, we see that nodes are generally safely oversubscribed by about a factor of 2. ## Summary of changes Tweak the is_overloaded method to check for utililzation over 200% rather than over 100%	2024-09-11 18:45:34 +01:00
Folke Behrens	bae793ffcd	proxy: Handle all let underscore instances (#8898 ) * Most can be simply replaced * One instance renamed to _rtchk (return-type check)	2024-09-10 15:36:08 +02:00
John Spray	26b5fcdc50	reinstate write-path key check (#8973 ) ## Problem In https://github.com/neondatabase/neon/pull/8621, validation of keys during ingest was removed because the places where we actually store keys are now past the point where we have already converted them to CompactKey (i128) representation. ## Summary of changes Reinstate validation at an earlier stage in ingest. This doesn't cover literally every place we write a key, but it covers most cases where we're trusting postgres to give us a valid key (i.e. one that doesn't try and use a custom spacenode).	2024-09-10 12:54:25 +01:00
Arpad Müller	97582178cb	Remove async_trait from the Handler trait (#8958 ) Newest attempt to remove `async_trait` from the Handler trait. Earlier attempts were in #7301 and #8296 .	2024-09-10 02:40:00 +02:00
Matthias van de Meent	842be0ba74	Specialize WalIngest on PostgreSQL version (#8904 ) The current code assumes that most of this functionality is version-independent, which is only true up to v16 - PostgreSQL 17 has a new field in CheckPoint that we need to keep track of. This basically removes the file-level dependency on v14, and replaces it with switches that load the correct version dependencies where required.	2024-09-09 23:01:52 +01:00
Heikki Linnakangas	982b376ea2	Update parquet crate to a released version (#8961 ) PR #7782 set the dependency in Cargo.toml to 'master', and locked the version to commit that contained a specific fix, because we needed the fix before it was included in a versioned release. The fix was later included in parquet crate version 52.0.0, so we can now switch back to using a released version. The latest release is 53.0.0, switch straight to that. --------- Co-authored-by: Conrad Ludgate <conradludgate@gmail.com>	2024-09-10 00:04:00 +03:00
Alex Chi Z.	e158df4e86	feat(pageserver): split delta writer automatically determines key range (#8850 ) close https://github.com/neondatabase/neon/issues/8838 ## Summary of changes This patch modifies the split delta layer writer to avoid taking start_key and end_key when creating/finishing the layer writer. The start_key for the delta layers will be the first key provided to the layer writer, and the end_key would be the `last_key.next()`. This simplifies the delta layer writer API. On that, the layer key hack is removed. Image layers now use the full key range, and delta layers use the first/last key provided by the user. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-09-09 22:03:27 +01:00
Heikki Linnakangas	723c0971e8	Don't create 'empty' branch in neon_simple_env (#8965 ) Now that we've given up hope on sharing the neon_simple_env between tests, there's no reason to not use the 'main' branch directly.	2024-09-09 12:38:34 +03:00
Heikki Linnakangas	c8f67eed8f	Remove TEST_SHARED_FIXTURES (#8965 ) I wish it worked, but it's been broken for a long time, so let's admit defeat and remove it. The idea of sharing the same pageserver and safekeeper environment between tests is still sound, and it could save a lot of time in our CI. We should perhaps put some time into doing that, but we're better off starting from scratch than trying to make TEST_SHARED_FIXTURES work in its current form.	2024-09-09 12:38:34 +03:00
Heikki Linnakangas	2d885ac07a	Update strum (#8962 ) I wanted to use some features from the newer version. The PR that needed the new version is not ready yet (and might never be), but seems nice to stay up in any case.	2024-09-08 21:47:57 +03:00
Heikki Linnakangas	89c5e80b3f	Update toml and toml_edit crates (#8963 ) Eliminates a few duplicate versions from the dependency tree.	2024-09-08 21:47:23 +03:00
Heikki Linnakangas	93ec7503e0	Lock the correct revision of rust-postgres crates (#8960 ) We modified the crate in an incompatible way and upgraded to the new version in PR #8076. However, it was reverted in #8654. The revert reverted the Cargo.lock reference to it, but since Cargo.toml still points to the (tip of the) 'neon' branch, every time you make any other unrelated changes to Cargo.toml, it also tries to update the rust-postgres crates to the tip of the 'neon' branch again, which doesn't work. To fix, lock the crates to the exact commit SHA that works.	2024-09-07 14:11:36 +01:00
Alexander Bayandin	7d7d1f354b	Fix rust warnings on macOS (#8955 ) ## Problem ``` error: unused import: `anyhow::Context` --> libs/utils/src/crashsafe.rs:8:5 \| 8 \| use anyhow::Context; \| ^^^^^^^^^^^^^^^ \| = note: `-D unused-imports` implied by `-D warnings` = help: to override `-D warnings` add `#[allow(unused_imports)]` error: unused variable: `fd` --> libs/utils/src/crashsafe.rs:209:15 \| 209 \| pub fn syncfs(fd: impl AsRawFd) -> anyhow::Result<()> { \| ^^ help: if this is intentional, prefix it with an underscore: `_fd` \| = note: `-D unused-variables` implied by `-D warnings` = help: to override `-D warnings` add `#[allow(unused_variables)]` ``` ## Summary of changes - Fix rust warnings on macOS	2024-09-07 08:17:25 +01:00
Cihan Demirci	16c200d6d9	push images to prod ACR (#8940 ) Used `vars` for new storing non-sensitive information, changed dev secrets to vars as well but didn't cleanup any secrets. https://github.com/neondatabase/cloud/issues/16925 --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2024-09-07 00:20:36 +01:00
Joonas Koivunen	3dbd34aa78	feat(storcon): forward gc blocking and unblocking (#8956 ) Currently using gc blocking and unblocking with storage controller managed pageservers is painful. Implement the API on storage controller. Fixes: #8893	2024-09-06 22:42:55 +01:00
Arpad Müller	fa3fc73c1b	Address 1.82 clippy lints (#8944 ) Addresses the clippy lints of the beta 1.82 toolchain. The `too_long_first_doc_paragraph` lint complained a lot and was addressed separately: #8941	2024-09-06 21:05:18 +02:00
Alex Chi Z.	ac5815b594	feat(storage-controller): add node shards api (#8896 ) For control-plane managed tenants, we have the page in the admin console that lists all tenants on a specific pageserver. But for storage-controller managed ones, we don't have that functionality for now. ## Summary of changes Adds an API that lists all shards on a given node (intention + observed) --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-09-06 14:14:21 -04:00
Alexander Bayandin	30583cb626	CI(label-for-external-users): add retry logic for unexpected errors (#8938 ) ## Problem One of the PRs opened by a `neondatabase` org member got labelled as `external` because the `gh api` call failed in the wrong way: ``` Get "https://api.github.com/orgs/neondatabase/members/<username>": dial tcp 140.82.114.5:443: i/o timeout is-member=false ``` ## Summary of changes - Check that the error message is expected before labelling PRs - Retry `gh api` call for 10 times in case of unexpected error messages - Add `workflow_dispatch` trigger	2024-09-06 17:42:35 +01:00
Arseny Sher	c1a51416db	safekeeper: fsync filesystem on start. We can't really rely on files contents after boot without fsync'ing them.	2024-09-06 19:14:25 +03:00
Arseny Sher	8eab7009c1	safekeeper: do pid file lock before id init	2024-09-06 19:14:25 +03:00
Arseny Sher	11cf16e3f3	safekeeper: add term_bump endpoint. When walproposer observes now higher term it restarts instead of crashing whole compute with PANIC; this avoids compute crash after term_bump call. After successfull election we're still checking last_log_term of the highest given vote to ensure basebackup is good, and PANIC otherwise. It will be used for migration per 035-safekeeper-dynamic-membership-change.md and https://github.com/neondatabase/docs/pull/21 ref https://github.com/neondatabase/neon/issues/8700	2024-09-06 19:13:50 +03:00
Folke Behrens	af6f63617e	proxy: clean up code and lints for 1.81 and 1.82 (#8945 )	2024-09-06 17:13:30 +02:00
Arseny Sher	e287f36a05	safekeeper: fix endpoint restart immediately after xlog switch. Check that truncation point is not from the future by comparing it with write_record_lsn, not write_lsn, and explain that xlog switch changes their normal order. ref https://github.com/neondatabase/neon/issues/8911	2024-09-06 18:09:21 +03:00
Arpad Müller	cbcd4058ed	Fix 1.82 clippy lint too_long_first_doc_paragraph (#8941 ) Addresses the 1.82 beta clippy lint `too_long_first_doc_paragraph` by adding newlines to the first sentence if it is short enough, and making a short first sentence if there is the need.	2024-09-06 14:33:52 +02:00
Vlad Lazar	e86fef05dd	storcon: track preferred AZ for each tenant shard (#8937 ) ## Problem We want to do AZ aware scheduling, but don't have enough metadata. ## Summary of changes Introduce a `preferred_az_id` concept for each managed tenant shard. In a future PR, the scheduler will use this as a soft preference. The idea is to try and keep the shard attachments within the same AZ. Under the assumption that the compute was placed in the correct AZ, this reduces the chances of cross AZ trafic from between compute and PS. In terms of code changes we: 1. Add a new nullable `preferred_az_id` column to the `tenant_shards` table. Also include an in-memory counterpart. 2. Populate the preferred az on tenant creation and shard splits. 3. Add an endpoint which allows to bulk-set preferred AZs. (3) gives us the migration path. I'll write a script which queries the cplane db in the region and sets the preferred az of all shards with an active compute to the AZ of said compute. For shards without an active compute, I'll use the AZ of the currently attached pageserver since this is what cplane uses now to schedule computes.	2024-09-06 13:11:17 +01:00
Arpad Müller	a1323231bc	Update Rust to 1.81.0 (#8939 ) We keep the practice of keeping the compiler up to date, pointing to the latest release. This is done by many other projects in the Rust ecosystem as well. [Release notes](https://github.com/rust-lang/rust/blob/master/RELEASES.md#version-1810-2024-09-05). Prior update was in #8667 and #8518	2024-09-06 12:40:19 +02:00
Christian Schwarz	06e840b884	compact_level0_phase1: ignore access mode config, always do streaming-kmerge without validation (#8934 ) refs https://github.com/neondatabase/neon/issues/8184 PR https://github.com/neondatabase/infra/pull/1905 enabled streaming-kmerge without validation everywhere. It rolls into prod sooner or in the same release as this PR.	2024-09-06 10:58:48 +02:00
Christian Schwarz	cf11c8ab6a	update svg_fmt to 0.4.3 (#8930 ) Audited ``` diff -r -u ~/.cargo/registry/src/index.crates.io-6f17d22bba15001f/svg_fmt-0.4.{2,3} ``` fixes https://github.com/neondatabase/neon/issues/7763	2024-09-06 10:52:29 +02:00
Vlad Lazar	04f99a87bf	storcon: make pageserver AZ id mandatory (#8856 ) ## Problem https://github.com/neondatabase/neon/pull/8852 introduced a new nullable column for the `nodes` table: `availability_zone_id` ## Summary of changes * Make neon local and the test suite always provide an az id * Make the az id field in the ps registration request mandatory * Migrate the column to non-nullable and adjust in memory state accordingly * Remove the code that was used to populate the az id for pre-existing nodes	2024-09-05 19:14:21 +01:00
Stefan Radig	fd12dd942f	Add installation instructions for m4 on mac (#8929 ) ## Problem Building on MacOS failed due to missing m4. Although a window was popping up claiming to install m4, this was not helping. ## Summary of changes Add instructions to install m4 using brew and link it (thanks to Folke for helping).	2024-09-05 17:48:51 +02:00
vladov	ebddda5b7f	Fix precedence issue causing yielding loop to never yield. (#8922 ) There is a bug in `yielding_loop` that causes it to never yield. ## Summary of changes Fixed the bug. `i + 1 % interval == 0` will always evaluate to `i + 1 == 0` which is false ([Playground](https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=68e6ca393a02113cb7720115c2842e75)). This function is called in 2 places [here](`99fa1c3600/pageserver/src/tenant/secondary/scheduler.rs (L389)`) and [here](`99fa1c3600/pageserver/src/tenant/secondary/heatmap_uploader.rs (L152)`) with `interval == 1000` in both cases. This may change the performance of the system since now we are yielding to tokio. Also, this may expose undefined behavior since it is now possible for tasks to be moved between threads/whatever tokio does to tasks. However, this was the intention of the author of the code.	2024-09-05 11:06:57 -04:00
Joonas Koivunen	efe03d5a1c	build: sync between benchies (#8919 ) Sometimes, the benchmarks fail to start up pageserver in 10s without any obvious reason. Benchmarks run sequentially on otherwise idle runners. Try running `sync(2)` after each bench to force a cleaner slate. Implement this via: - SYNC_AFTER_EACH_TEST environment variable enabled autouse fixture - autouse fixture seems to be outermost fixture, so it works as expected - set SYNC_AFTER_EACH_TEST=true for benchmarks in build_and_test workflow Evidence: https://neon-github-public-dev.s3.amazonaws.com/reports/main/10678984691/index.html#suites/5008d72a1ba3c0d618a030a938fc035c/1210266507534c0f/ --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2024-09-05 14:29:48 +01:00
Christian Schwarz	850421ec06	refactor(pageserver): rely on serde derive for toml deserialization (#7656 ) This PR simplifies the pageserver configuration parsing as follows: * introduce the `pageserver_api::config::ConfigToml` type * implement `Default` for `ConfigToml` * use serde derive to do the brain-dead leg-work of processing the toml document * use `serde(default)` to fill in default values * in `pageserver` crate: * use `toml_edit` to deserialize the pageserver.toml string into a `ConfigToml` * `PageServerConfig::parse_and_validate` then * consumes the `ConfigToml` * destructures it exhaustively into its constituent fields * constructs the `PageServerConfig` The rules are: * in `ConfigToml`, use `deny_unknown_fields` everywhere * static default values go in `pageserver_api` * if there cannot be a static default value (e.g. which default IO engine to use, because it depends on the runtime), make the field in `ConfigToml` an `Option` * if runtime-augmentation of a value is needed, do that in `parse_and_validate` * a good example is `virtual_file_io_engine` or `l0_flush`, both of which need to execute code to determine the effective value in `PageServerConf` The benefits: * massive amount of brain-dead repetitive code can be deleted * "unused variable" compile-time errors when removing a config value, due to the exhaustive destructuring in `parse_and_validate` * compile-time errors guide you when adding a new config field Drawbacks: * serde derive is sometimes a bit too magical * `deny_unknown_fields` is easy to miss Future Work / Benefits: * make `neon_local` use `pageserver_api` to construct `ConfigToml` and write it to `pageserver.toml` * This provides more type safety / coompile-time errors than the current approach. ### Refs Fixes #3682 ### Future Work * `remote_storage` deser doesn't reject unknown fields https://github.com/neondatabase/neon/issues/8915 * clean up `libs/pageserver_api/src/config.rs` further * break up into multiple files, at least for tenant config * move `models` as appropriate / refine distinction between config and API models / be explicit about when it's the same * use `pub(crate)` visibility on `mod defaults` to detect stale values	2024-09-05 14:59:49 +02:00
Folke Behrens	6dfbf49128	proxy: don't let one timeout eat entire retry budget (#8924 ) This reduces the per-request timeout to 10sec while keeping the total retry duration at 1min. Relates: neondatabase/cloud#15944	2024-09-05 13:34:27 +02:00
Vlad Lazar	708322ce3c	storcon: handle fills including high tput tenants more gracefully (#8865 ) ## Problem A tenant may ingest a lot of data between being drained for node restart and being moved back in the fill phase. This is expensive and causes the fill to stall. ## Summary of changes We make a tactical change to reduce secondary warm-up time for migrations in fills.	2024-09-05 09:56:26 +01:00
Alex Chi Z.	99fa1c3600	fix(pageserver): more information on aux v1 warnings (#8906 ) Part of https://github.com/neondatabase/neon/issues/8623 ## Summary of changes It seems that we have tenants with aux policy set to v1 but don't have any aux files in the storage. It is still safe to force migrate them without notifying the customers. This patch adds more details to the warning to identify the cases where we have to reach out to the users before retiring aux v1. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-09-04 21:45:04 +01:00
Heikki Linnakangas	0205ce1849	Update submodule reference for vendor/postgres-v14 (#8913 ) There was a confusion on the REL_14_STABLE_neon branch. PR https://github.com/neondatabase/postgres/pull/471 was merged ot the branch, but the corresponding PRs on the other REL_15_STABLE_neon and REL_16_STABLE_neon branches were not merged. Also, the submodule reference in the neon repository was never updated, so even though the REL_14_STABLE_neon branch contained the commit, it was never used. That PR https://github.com/neondatabase/postgres/pull/471 was a few bricks shy of a load (no tests, some differences between the different branches), so to get us to a good state, revert that change from the REL_14_STABLE_neon branch. This PR in the neon repository updates the submodule reference past two commites on the REL_14_STABLE_neon branch: first the commit from PR https://github.com/neondatabase/postgres/pull/471, and immediately after that the revert of the same commit. This brings us back to square one, but now the submodule reference matches the tip of the REL_14_STABLE_neon branch again.	2024-09-04 15:41:51 +01:00
John Spray	1a9b54f1d9	storage controller: read from database in validate API (#8784 ) ## Problem The initial implementation of the validate API treats the in-memory generations as authoritative. - This is true when only one storage controller is running, but if a rogue controller was running that hadn't been shut down properly, and some pageserver requests were routed to that bad controller, it could incorrectly return valid=true for stale generations. - The generation in the main in-memory map gets out of date while a live migration is in flight, and if the origin location for the migration tries to do some deletions even though it is in AttachedStale (for example because it had already started compaction), these might be wrongly validated + executed. ## Summary of changes - Continue to do the in-memory check: if this returns valid=false it is sufficient to reject requests. - When valid=true, do an additional read from the database to confirm the generation is fresh. - Revise behavior for validation on missing shards: this used to always return valid=true as a convenience for deletions and shard splits, so that pageservers weren't prevented from completing any enqueued deletions for these shards after they're gone. However, this becomes unsafe when we consider split brain scenarios. We could reinstate this in future if we wanted to store some tombstones for deleted shards. - Update test_scrubber_physical_gc to cope with the behavioral change: they must now explicitly flush the deletion queue before splits, to avoid tripping up on deletions that are enqueued at the time of the split (these tests assert "scrubber deletes nothing", which check fails if the split leaves behind some remote objects that are legitimately GC'able) - Add `test_storage_controller_validate_during_migration`, which uses failpoints to create a situation where incorrect generation validation during a live migration could result in a corruption The rate of validate calls for tenants is pretty low: it happens as a consequence deletions from GC and compaction, which are both concurrency-limited on the pageserver side.	2024-09-04 15:00:40 +01:00
dependabot[bot]	3f43823a9b	build(deps): bump cryptography from 42.0.4 to 43.0.1 (#8908 )	2024-09-04 13:41:10 +01:00
Heikki Linnakangas	a046717a24	Fix submodule refs to point to the correct REL_X_STABLE_neon branches (#8910 ) Commit `cfa45ff5ee` (PR #8860) updated the vendor/postgres submodules, but didn't use the same commit SHAs that were pushed as the corresponding REL__STABLE_neon branches in the postgres repository. The contents were the same, but the REL__STABLE_neon branches pointed to squashed versions of the commits, whereas the SHAs used in the submodules referred to the pre-squash revisions. Note: The vendor/postgres-v14 submodule still doesn't match with the tip of REL_14_STABLE_neon branch, because there has been one more commit on that branch since then. That's another confusion which we should fix, but let's do that separately. This commit doesn't change the code that gets built in any way, only changes the submodule references to point to the correct SHAs in the REL_*_STABLE_neon branch histories, rather than some detached commits.	2024-09-04 12:41:51 +01:00
Joonas Koivunen	7a1397cf37	storcon: boilerplate to upsert safekeeper records on deploy (#8879 ) We currently do not record safekeepers in the storage controller database. We want to migrate timelines across safekeepers eventually, so start recording the safekeepers on deploy. Cc: #8698	2024-09-04 10:10:05 +00:00
Vlad Lazar	75310fe441	storcon: make hb interval an argument and speed up tests (#8880 ) ## Problem Each test might wait for up to 5s in order to HB the pageserver. ## Summary of changes Make the heartbeat interval configurable and use a really tight one for neon local => startup quicker	2024-09-04 10:09:41 +01:00
Alex Chi Z.	ecfa3d9de9	fix(storage-scrubber): wrong trial condition (#8905 ) ref https://github.com/neondatabase/neon/issues/8872 ## Summary of changes We saw stuck storage scrubber in staging caused by infinite retries. I believe here we should use `min` instead of `max` to avoid getting minutes or hours of retry backoff. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-09-03 21:39:56 +00:00
Alex Chi Z.	3d9001d83f	fix(pageserver): is_archived should be optional (#8902 ) Set the field to optional, otherwise there will be decode errors when newer version of the storage controller receives the JSON from older version of the pageservers. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-09-03 14:05:06 -04:00

1 2 3 4 5 ...

6044 Commits