rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-24 16:40:38 +00:00

Author	SHA1	Message	Date
Conrad Ludgate	897cffb9d8	auth_broker: fix local_proxy conn count (#9593 ) our current metrics for http pool opened connections is always negative :D oops	2024-10-31 14:57:55 +00:00
John Spray	552088ac16	pageserver: fix spurious error logs in timeline lifecycle (#9589 ) ## Problem The final part of https://github.com/neondatabase/neon/issues/9543 will be a chaos test that creates/deletes/archives/offloads timelines while restarting pageservers and migrating tenants. Developing that test showed up a few places where we log errors during normal shutdown. ## Summary of changes - UninitializedTimeline's drop should log at info severity: this is a normal code path when some part of timeline creation encounters a cancellation `?` path. - When offloading and finding a `RemoteTimelineClient` in a non-initialized state, this is not an error and should not be logged as such. - The `offload_timeline` function returned an anyhow error, so callers couldn't gracefully pick out cancellation errors from real errors: update this to have a structured error type and use it throughout.	2024-10-31 14:44:59 +00:00
Peter Bendel	51fda118f6	increase lifetime of AWS session token to 12 hours (#9590 ) ## Problem clickbench regression causes clickbench to run >9 hours and the AWS session token is expired before the run completes ## Summary of changes extend lifetime of session token for this job to 12 hours	2024-10-31 13:34:50 +00:00
Anastasia Lubennikova	e96398a552	Add support of extensions for v17 (part 4) (#9568 ) - pg_jsonschema 0.3.3 - pg_graphql 1.5.9 - rum 65e0a752 - pg_tiktoken a5bc447e update support of extensions for v14-v16: - pg_jsonschema 0.3.1 -> 0.3.3 - pg_graphql 1.5.7 -> 1.5.9 - rum 6ab37053 -> 65e0a752 - pg_tiktoken e64e55aa -> a5bc447e	2024-10-31 15:05:24 +02:00
Erik Grinaker	f9d8256d55	pageserver: don't return option from `DeletionQueue::new` (#9588 ) `DeletionQueue::new()` always returns deletion workers, so the returned `Option` is redundant.	2024-10-31 10:51:58 +00:00
Vlad Lazar	411c3aa0d6	pageserver: lift decoding and interpreting of wal into wal_decoder (#9524 ) ## Problem Decoding and ingestion are still coupled in `pageserver::WalIngest`. ## Summary of changes A new type is added to `wal_decoder::models`, InterpretedWalRecord. This type contains everything that the pageserver requires in order to ingest a WAL record. The highlights are the `metadata_record` which is an optional special record type to be handled and `blocks` which stores key, value pairs to be persisted to storage. This type is produced by `wal_decoder::models::InterpretedWalRecord::from_bytes` from a raw PG wal record. The rest of this commit separates decoding and interpretation of the PG WAL record from its application in `WalIngest::ingest_record`. Related: https://github.com/neondatabase/neon/issues/9335 Epic: https://github.com/neondatabase/neon/issues/9329	2024-10-31 10:47:43 +00:00
Arpad Müller	65b69392ea	Disallow offloaded children during timeline deletion (#9582 ) If we delete a timeline that has childen, those children will have their data corrupted. Therefore, extend the already existing safety check to offloaded timelines as well. Part of #8088	2024-10-30 19:37:09 +01:00
Alex Chi Z.	8d70f88b37	refactor(pageserver): use JSON field encoding for consumption metrics cache (#9470 ) In https://github.com/neondatabase/neon/issues/9032, I would like to eventually add a `generation` field to the consumption metrics cache. The current encoding is not backward compatible and it is hard to add another field into the cache. Therefore, this patch refactors the format to store "field -> value", and it's easier to maintain backward/forward compatibility with this new format. ## Summary of changes * Add `NewRawMetric` as the new format. * Add upgrade path. When opening the disk cache, the codepath first inspects the `version` field, and decide how to decode. * Refactor metrics generation code and tests. * Add tests on upgrade / compatibility with the old format. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-10-30 18:13:11 +00:00
Arpad Müller	bcfe013094	Don't keep around the timeline's remote_client (#9583 ) Constructing a remote client is no big deal. Yes, it means an extra download from S3 but it's not that expensive. This simplifies code paths and scenarios to test. This unifies timelines that have been recently offloaded with timelines that have been offloaded in an earlier invocation of the process. Part of #8088	2024-10-30 18:44:29 +01:00
Arpad Müller	d0a02f3649	Disallow archived timelines to be detached or reparented (#9578 ) Disallow a request for timeline ancestor detach if either the to be detached timeline, or any of the to be reparented timelines are offloaded or archived. In theory we could support timelines that are archived but not offloaded, but archived timelines are at the risk of being offloaded, so we treat them like offloaded timelines. As for offloaded timelines, any code to "support" them would amount to unoffloading them, at which point we can just demand to have the timelines be unarchived. Part of #8088	2024-10-30 17:04:57 +01:00
Tristan Partin	8af9412eb2	Collect compute backpressure throttling time This will tell us how much time the compute has spent throttled if pageserver/safekeeper cannot keep up with WAL generation. Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-10-30 09:58:29 -05:00
Erik Grinaker	96e35e11a6	postgres_ffi: add WAL generator for tests/benchmarks (#9503 ) ## Problem We don't have a convenient way to generate WAL records for benchmarks and tests. ## Summary of changes Adds a WAL generator, exposed as an iterator. It currently only generates logical messages (noops), but will be extended to write actual table rows later. Some existing code for WAL generation has been replaced with this generator, to reduce duplication.	2024-10-30 14:46:39 +03:00
Alexey Kondratov	745061ddf8	chore(compute): Bump pg_mooncake to the latest version (#9576 ) ## Problem There were some critical breaking changes made in the upstream since Oct 29th morning. ## Summary of changes Point it to the topmost commit in the `neon` branch at the time of writing this https://github.com/Mooncake-Labs/pg_mooncake/commits/neon/ `c495cd17d6`	2024-10-30 11:07:02 +01:00
Tristan Partin	0c828c57e2	Remove non-gzipped basebackup code path In July of 2023, Bojan and Chi authored `92aee7e07f`. Our in production pageservers are most definitely at a version where they all support gzipped basebackups.	2024-10-29 23:03:45 -05:00
John Spray	8e2e9f0fed	pageserver: generation-aware storage for TenantManifest (#9555 ) ## Problem When tenant manifest objects are written without a generation suffix, concurrently attached pageservers may stamp on each others writes of the manifest and cause undefined behavior. Closes: #9543 ## Summary of changes - Use download_generation_object helper when reading manifests, to search for the most recent generation - Use Tenant::generation as the generation suffix when writing manifests.	2024-10-29 23:24:04 +01:00
Tristan Partin	b77b9bdc9f	Add tests for sql-exporter metrics Should help us keep non-working metrics from hitting staging or production. Co-authored-by: Heikki Linnakangas <heikki@neon.tech> Fixes: https://github.com/neondatabase/neon/issues/8569 Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-10-29 15:13:06 -05:00
Alex Chi Z.	81f9aba005	fix(pagectl): layer parsing and image layer dump (#9571 ) This patch contains various improvements for the pagectl tool. ## Summary of changes * Rewrite layer name parsing: LayerName now supports all variants we use now. * Drop pagectl's own layer parsing function, use LayerName in the pageserver crate. * Support image layer dumping in the layer dump command using ImageLayer::dump, drop the original implementation. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-10-29 15:16:23 -04:00
Alex Chi Z.	88ff8a7803	feat(pageserver): support partial gc-compaction for lowest retain lsn (#9134 ) part of https://github.com/neondatabase/neon/issues/8921, https://github.com/neondatabase/neon/issues/9114 ## Summary of changes We start the partial compaction implementation with the image layer partial generation. The partial compaction API now takes a key range. We will only generate images for that key range for now, and remove layers fully included in the key range after compaction. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-10-29 18:25:32 +00:00
Konstantin Knizhnik	0c075fab3a	Add --replica parameter to basebackup (#9553 ) ## Problem See https://github.com/neondatabase/neon/pull/9458 This PR separates PS related changes in #9458 from compute_ctl changes to enforce that PS is deployed before compute. ## Summary of changes This PR adds handlings of `--replica` parameters of backebackup to page server. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-10-29 18:40:10 +02:00
Anastasia Lubennikova	80e1630042	Use pg_mooncake from our fork. (#9565 ) Switch to main repo once https://github.com/Mooncake-Labs/pg_mooncake/pull/3 is merged	2024-10-29 15:57:52 +00:00
Jakub Kołodziejczak	57499640c5	proxy: more granular http status codes for sql-over-http errors (#9549 ) closes #9532	2024-10-29 15:44:45 +00:00
Anastasia Lubennikova	793ad50b7d	fix allow_unstable_extensions GUC - make it USERSET (#9563 ) fix message wording	2024-10-29 14:25:23 +00:00
John Spray	7a1331eee5	pageserver: make concurrent offloaded timeline operations safe wrt manifest uploads (#9557 ) ## Problem Uploads of the tenant manifest could race between different tasks, resulting in unexpected results in remote storage. Closes: https://github.com/neondatabase/neon/issues/9556 ## Summary of changes - Create a central function for uploads that takes a tokio::sync::Mutex - Store the latest upload in that Mutex, so that when there is lots of concurrency (e.g. archive 20 timelines at once) we can coalesce their manifest writes somewhat.	2024-10-29 13:54:48 +00:00
John Spray	4ef74215e1	pageserver: refactor generation-aware loading code into generic (#9545 ) ## Problem Indices used to be the only kind of object where we had to search across generations to find the most recent one. As of https://github.com/neondatabase/neon/issues/9543, manifests will need the same treatment. ## Summary of changes - Refactor download_index_part to a generic download_generation_object function, which will be usable for downloading manifest objects as well.	2024-10-29 13:00:03 +00:00
Conrad Ludgate	d4cbc8cfeb	[auth_broker]: regress test (#9541 ) python based regression test setup for auth_broker. This uses a http mock for cplane as well as the JWKs url. complications: 1. We cannot just use local_proxy binary, as that requires the pg_session_jwt extension which we don't have available in the current test suite 2. We cannot use just any old http mock for local_proxy, as auth_broker requires http2 to local_proxy as such, I used the h2 library to implement an echo server - copied from the examples in the h2 docs.	2024-10-29 11:39:09 +00:00
Conrad Ludgate	47c35f67c3	[proxy]: fix JWT handling for AWS cognito. (#9536 ) In the base64 payload of an aws cognito jwt, I saw the following: ``` "iss":"https:\/\/cognito-idp.us-west-2.amazonaws.com\/us-west-2_redacted" ``` issuers are supposed to be URLs, and URLs are always valid un-escaped JSON. However, `\/` is a valid escape character so what AWS is doing is technically correct... sigh... This PR refactors the test suite and adds a new regression test for cognito.	2024-10-29 11:01:09 +00:00
Peter Bendel	45b558f480	temporarily increase timeout for clickbench benchmark until regression is resolved (#9554 ) ## Problem click bench job in benchmarking workflow has a performance regression causing it to run in timeout of max job run. Suspected root cause: Project has been migrated from single pageserver to storage controller managed project on Oct 14th. Since then the regression shows. ## Summary of changes Increase timeout of pytest to 12 hours. Increase job timeout to 12 hours	2024-10-29 10:53:28 +00:00
Arpad Müller	a73402e646	Offloaded timeline deletion (#9519 ) As pointed out in https://github.com/neondatabase/neon/pull/9489#discussion_r1814699683 , we currently didn't support deletion for offloaded timelines after the timeline has been loaded from the manifest instead of having been offloaded. This was because the upload queue hasn't been initialized yet. This PR thus initializes the timeline and shuts it down immediately. Part of #8088	2024-10-29 10:41:53 +00:00
Vlad Lazar	07b974480c	pageserver: move things around to prepare for decoding logic (#9504 ) ## Problem We wish to have high level WAL decoding logic in `wal_decoder::decoder` module. ## Summary of Changes For this we need the `Value` and `NeonWalRecord` types accessible there, so: 1. Move `Value` and `NeonWalRecord` to `pageserver::value` and `pageserver::record` respectively. 2. Get rid of `pageserver::repository` (follow up from (1)) 3. Move PG specific WAL record types to `postgres_ffi::walrecord`. In theory they could live in `wal_decoder`, but it would create a circular dependency between `wal_decoder` and `postgres_ffi`. Long term it makes sense for those types to be PG version specific, so that will work out nicely. 4. Move higher level WAL record types (to be ingested by pageserver) into `wal_decoder::models` Related: https://github.com/neondatabase/neon/issues/9335 Epic: https://github.com/neondatabase/neon/issues/9329	2024-10-29 10:00:34 +00:00
Arpad Müller	62f5d484d9	Assert the tenant to be active in `unoffload_timeline` (#9539 ) Currently, all callers of `unoffload_timeline` ensure that the tenant the unoffload operation is called on is active. We rely on it being active as we activate the timeline below and don't want to race with the activation code of the tenant (in the worst case, activating a timeline twice). Therefore, add this assertion. Part of #8088	2024-10-29 00:36:05 +00:00
Tristan Partin	4df3987054	Get role name when not a C string We will only have a C string if the specified role is a string. Otherwise, we need to resolve references to public, current_role, current_user, and session_user. Fixes: https://github.com/neondatabase/cloud/issues/19323 Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-10-28 18:21:45 -05:00
Konstantin Knizhnik	0624565617	Create the notion of unstable extensions As a DBaaS provider, Neon needs to provide a stable platform for customers to build applications upon. At the same time however, we also need to enable customers to use the latest and greatest technology, so they can prototype their work, and we can solicit feedback. If all extensions are treated the same in terms of stability, it is hard to meet that goal. There are now two new GUCs created by the Neon extension: neon.allow_unstable_extensions: This is a session GUC which allows a session to install and load unstable extensions. neon.unstable_extensions: This is a comma-separated list of extension names. We can check if a CREATE EXTENSION statement is attempting to install an unstable extension, and if so, deny the request if neon.allow_unstable_extensions is not set to true. Signed-off-by: Tristan Partin <tristan@neon.tech> Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-10-28 17:47:15 -05:00
George MacKerron	7d5f6b6a52	Build `pgrag` extensions x3 (#8486 ) Build the pgrag extensions (rag, rag_bge_small_en_v15, and rag_jina_reranker_v1_tiny_en) as part of the compute node Dockerfile. --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2024-10-28 20:06:36 +00:00
Alex Chi Z.	f7c61e856f	fix(pageserver): bump tokio-epoll-uring (#9546 ) Includes https://github.com/neondatabase/tokio-epoll-uring/pull/58 that fixes the clippy error. ## Summary of changes Update the version of tokio-epoll-uring Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-10-28 20:03:02 +00:00
Alex Chi Z.	57c21aff9f	refactor(pageserver): remove aux v1 configs (#9494 ) ## Problem Part of https://github.com/neondatabase/neon/issues/8623 ## Summary of changes Removed all aux-v1 config processing code. Note that we persisted it into the index part file, so we cannot really remove the field from index part. I also kept the config item within the tenant config, but we will not read it any more. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-10-28 19:51:14 +00:00
Erik Grinaker	248558dee8	safekeeper: refactor `WalAcceptor` to be event-driven (#9462 ) ## Problem The `WalAcceptor` main loop currently uses two nested loops to consume inbound messages. This makes it hard to slot in periodic events like metrics collection. It also duplicates the event processing code, and assumes all messages in steady state are AppendRequests (other messages types may be dropped if following an AppendRequest). ## Summary of changes Refactor the `WalAcceptor` loop to be event driven.	2024-10-28 17:18:37 +00:00
Sergey Melnikov	3bad52543f	We don't have legacy proxies anymore (#9544 ) We don't have legacy scram proxies anymore: cc: https://github.com/neondatabase/cloud/issues/9745	2024-10-28 16:42:35 +00:00
Tristan Partin	3d64a7ddcd	Add pg_mooncake to compute-node.Dockerfile Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-10-28 11:23:30 -05:00
Conrad Ludgate	25f1e5cfeb	[proxy] demote warnings and remove dead-argument (#9512 ) fixes https://github.com/neondatabase/cloud/issues/19000	2024-10-28 15:02:20 +00:00
Rahul Patil	8dd555d396	ci(proxy): Update GH action flag on proxy deployment (#9535 ) ## Problem Based on a recent proxy deployment issue, we deployed another proxy version (proxy-scram), which was not needed when deploying a specific proxy type. we have [PR](https://github.com/neondatabase/infra/pull/2142) to update on the infra branch and need to update CI in this repo which triggers proxy deployment. ## Summary of changes - Update proxy deployment flag ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist	2024-10-28 13:17:09 +01:00
Arthur Petukhovsky	01b6843e12	Route pgbouncer logs to virtio-serial (#9488 ) virtio-serial is much more performant than /dev/console emulation, therefore, is much more suitable for the verbose logs inside vm. This commit changes routing for pgbouncer logs, since we've recently noticed it can emit large volumes of logs. Manually tested on staging by pinning a compute image to my test project. Should help with https://github.com/neondatabase/cloud/issues/19072	2024-10-28 12:09:47 +00:00
John Spray	93987b5a4a	tests: add test_storage_controller_onboard_detached (#9431 ) ## Problem We haven't historically taken this API route where we would onboard a tenant to the controller in detached state. It worked, but we didn't have test coverage. ## Summary of changes - Add a test that onboards a tenant to the storage controller in Detached mode, and checks that deleting it without attaching it works as expected.	2024-10-28 11:11:12 +00:00
John Spray	33baca07b6	storcon: add an API to cancel ongoing reconciler (#9520 ) ## Problem If something goes wrong with a live migration, we currently only have awkward ways to interrupt that: - Restart the storage controller - Ask it to do some other modification/migration on the shard, which we don't really want. ## Summary of changes - Add a new `/cancel` control API, and storcon_cli wrapper for it, which fires the Reconciler's cancellation token. This is just for on-call use and we do not expect it to be used by any other services.	2024-10-28 09:26:01 +00:00
John Spray	923974d4da	safekeeper: don't un-evict timelines during snapshot API handler (#9428 ) ## Problem When we use pull_timeline API on an evicted timeline, it gets downloaded to serve the snapshot API request. That means that to evacuate all the timelines from a node, the node needs enough disk space to download partial segments from all timelines, which may not be physically the case. Closes: #8833 ## Summary of changes - Add a "try" variant of acquiring a residence guard, that returns None if the timeline is offloaded - During snapshot API handler, take a different code path if the timeline isn't resident, where we just read the checkpoint and don't try to read any segments.	2024-10-28 08:47:12 +00:00
Arpad Müller	e7277885b3	Don't consider archived timelines for synthetic size calculation (#9497 ) Archived timelines should not count towards synthetic size. Closes #9384. Part of #8088.	2024-10-26 13:27:57 +00:00
dependabot[bot]	80262e724f	build(deps): bump werkzeug from 3.0.3 to 3.0.6 (#9527 )	2024-10-26 08:24:15 +01:00
Yuchen Liang	85b954f449	pageserver: add tokio-epoll-uring slots waiters queue depth metrics (#9482 ) In complement to https://github.com/neondatabase/tokio-epoll-uring/pull/56. ## Problem We want to make tokio-epoll-uring slots waiters queue depth observable via Prometheus. ## Summary of changes - Add `pageserver_tokio_epoll_uring_slots_submission_queue_depth` metrics as a `Histogram`. - Each thread-local tokio-epoll-uring system is given a `LocalHistogram` to observe the metrics. - Keep a list of `Arc<ThreadLocalMetrics>` used on-demand to flush data to the shared histogram. - Extend `Collector::collect` to report `pageserver_tokio_epoll_uring_slots_submission_queue_depth`. Signed-off-by: Yuchen Liang <yuchen@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-10-25 21:30:57 +01:00
Arpad Müller	76328ada05	Fix unoffload_timeline races with creation (#9525 ) This PR does two things: 1. Obtain a `TimelineCreateGuard` object in `unoffload_timeline`. This prevents two unoffload tasks from racing with each other. While they already obtain locks for `timelines` and `offloaded_timelines`, they aren't sufficient, as we have already constructed an entire timeline at that point. We shouldn't ever have two `Timeline` objects in the same process at the same time. 2. don't allow timeline creations for timelines that have been offloaded. Obviously they already exist, so we should not allow creation. the previous logic only looked at the timelines list. Part of #8088	2024-10-25 20:06:27 +00:00
Erik Grinaker	b54b632c6a	safekeeper: don't pass conf into storage constructors (#9523 ) ## Problem The storage components take an entire `SafekeeperConf` during construction, but only actually use the `no_sync` field. This makes it hard to understand the storage inputs (which fields do they actually care about?), and is also inconvenient for tests and benchmarks that need to set up a lot of unnecessary boilerplate. ## Summary of changes * Don't take the entire config, but pass in the `no_sync` field explicitly. * Take the timeline dir instead of `ttid` as an input, since it's the only thing it cares about. * Fix a couple of tests to not leak tempdirs. * Various minor tweaks.	2024-10-25 18:19:52 +01:00
Erik Grinaker	9909551f47	safekeeper: fix version in `TimelinePersistentState::empty()` (#9521 ) ## Problem The Postgres version in `TimelinePersistentState::empty()` is incorrect: the major version should be multiplied by 10000. ## Summary of changes Multiply the version by 10000.	2024-10-25 16:22:35 +01:00

1 2 3 4 5 ...

6442 Commits