rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-08 14:02:55 +00:00

Author	SHA1	Message	Date
Vlad Lazar	dcf7af5a16	storcon: do timeline creation on all attached location (#9237 ) ## Problem Creation of a timelines during a reconciliation can lead to unavailability if the user attempts to start a compute before the storage controller has notified cplane of the cut-over. ## Summary of changes Create timelines on all currently attached locations. For the latest location, we still look at the database (this is a previously). With this change we also look into the observed state to find other attached locations. Related https://github.com/neondatabase/neon/issues/9144	2024-10-04 11:56:43 +01:00
Erik Grinaker	37158d0424	pageserver: use conditional GET for secondary tenant heatmaps (#9236 ) ## Problem Secondary tenant heatmaps were always downloaded, even when they hadn't changed. This can be avoided by using a conditional GET request passing the `ETag` of the previous heatmap. ## Summary of changes The `ETag` was already plumbed down into the heatmap downloader, and just needed further plumbing into the remote storage backends. * Add a `DownloadOpts` struct and pass it to `RemoteStorage::download()`. * Add an optional `DownloadOpts::etag` field, which uses a conditional GET and returns `DownloadError::Unmodified` on match.	2024-10-04 12:29:48 +02:00
Erik Grinaker	60fb840e1f	Cargo.toml: enable `sso` for `aws-config` (#9261 ) ## Problem The S3 tests couldn't use SSO authentication for local tests against S3. ## Summary of changes Enable the `sso` feature of `aws-config`. Also run `cargo hakari generate` which made some updates to `workspace_hack`.	2024-10-04 11:27:06 +01:00
Heikki Linnakangas	52232dd85c	tests: Add a comment explaining the rules of NeonLocalCli wrappers (#9195 )	2024-10-03 22:03:29 +03:00
Heikki Linnakangas	8ef0c38b23	tests: Rename NeonLocalCli functions to match the 'neon_local' commands (#9195 ) This makes it more clear that the functions in NeonLocalCli are just typed wrappers around the corresponding 'neon_local' commands.	2024-10-03 22:03:27 +03:00
Heikki Linnakangas	56bb1ac458	tests: Move NeonCli and friends to separate file (#9195 ) In the passing, rename it to NeonLocalCli, to reflect that the binary is called 'neon_local'. Add wrapper for the 'timeline_import' command, eliminating the last raw call to the raw_cli() function from tests, except for a few in test_neon_cli.py which are about testing the 'neon_local' iteself. All the other calls are now made through the strongly-typed wrapper functions	2024-10-03 22:03:25 +03:00
Heikki Linnakangas	19db9e9aad	tests: Replace direct calls to neon_cli with wrappers in NeonEnv (#9195 ) Add wrappers for a few commands that didn't have them before. Move the logic to generate tenant and timeline IDs from NeonCli to the callers, so that NeonCli is more purely just a type-safe wrapper around 'neon_local'.	2024-10-03 22:03:22 +03:00
David Gomes	4e9b32c442	chore: makes some onboarding document improvements (#9216 ) * I had to install `m4` in order to be able to run locally * The docs/docker.md was missing a pointer to where the compute node code is (Was originally on #8888 but I am pulling this out)	2024-10-03 20:58:30 +02:00
David Gomes	2fac0b7fac	chore: remove unnecessary comments in compute/Dockerfile.compute-node (#9253 ) See [this comment](https://github.com/neondatabase/neon/pull/8888#discussion_r1783130082).	2024-10-03 18:26:41 +00:00
Arpad Müller	e3d6ecaeee	Revert hyper and tonic updates (#9268 )	2024-10-03 19:21:22 +01:00
Arseny Sher	d785fcb5ff	safekeeper: fix panic in debug_dump. (#9097 ) Panic was triggered only when dump selected no timelines. sentry report: https://neondatabase.sentry.io/issues/5832368589/	2024-10-03 19:22:22 +03:00
Vlad Lazar	552fa2b972	pageserver: tweak oversized key read path warning (#9221 ) ## Problem `Oversized vectored read [...]` logs are spewing in prod because we have a few keys that are unexpectedly large: * reldir/relblock - these are unbounded, so it's known technical debt * slru block - they can be a bit bigger than 128KiB due to storage format overhead ## Summary of changes * Bump threshold to 130KiB * Don't warn on oversized reldir and dbdir keys Closes https://github.com/neondatabase/neon/issues/8967	2024-10-03 16:40:35 +01:00
Arpad Müller	9d93dd4807	Rename hyper 1.0 to hyper and hyper 0.14 to hyper0 (#9254 ) Follow-up of #9234 to give hyper 1.0 the version-free name, and the legacy version of hyper the one with the version number inside. As we move away from hyper 0.14, we can remove the `hyper0` name piece by piece. Part of #9255	2024-10-03 16:33:43 +02:00
Heikki Linnakangas	53b6e1a01c	vm-monitor: Upgrade axum from 0.6 to 0.7 (#9257 ) Because: - it's nice to be up-to-date, - we already had axum 0.7 in our dependency tree, so this avoids having to compile two versions, and - removes one of the remaining dpendencies to hyper version 0 Also bumps the 'tokio-tungstenite' dependency, to avoid having two versions in the dependency tree.	2024-10-03 16:49:39 +03:00
Joonas Koivunen	dbef1b064c	chore: smaller layer changes (#9247 ) Address minor technical debt in Layer inspired by #9224: - layer usage as arg same as in spans - avoid one Weak::upgrade	2024-10-03 09:38:45 +01:00
Heikki Linnakangas	6a9e2d657c	Remove unnecessary dependencies from postgis-build image (#9211 ) The apt install stage before this commit: 0 upgraded, 391 newly installed, 0 to remove and 9 not upgraded. Need to get 261 MB of archives. after: 0 upgraded, 367 newly installed, 0 to remove and 9 not upgraded. Need to get 220 MB of archives.	2024-10-03 10:05:23 +03:00
Arpad Müller	2d8f6d7906	Suppress wal lag timeout warnings right after tenant attachment (#9232 ) As seen in https://github.com/neondatabase/cloud/issues/17335, during releases we can have ingest lags that are above the limits for warnings. However, such lags are part of normal pageserver startup. Therefore, calculate a certain cooldown timestamp until which we accept lags up to a certain size. The heuristic is chosen to grow the later we get to fully load the tenant, and we also add 60 seconds as a grace period after that term.	2024-10-03 02:33:09 +01:00
Arpad Müller	1b176fe74a	Use hyper 1.0 and tonic 0.12 in storage broker (#9234 ) Fixes #9231 . Upgrade hyper to 1.4.0 and use hyper 1.4 instead of 0.14 in the storage broker, together with tonic 0.12. The two upgrades go hand in hand. Thanks to the broker being independent from other components, we can upgrade its hyper version without touching the other components, which makes things easier.	2024-10-03 00:48:12 +02:00
Heikki Linnakangas	1dec93f129	Add compute_tools/ to the list of paths that trigger an E2E run on a PR (#9251 ) compute_ctl is an important part of the interfaces between the control plane and the compute, so it seems important to E2E test any changes there.	2024-10-03 00:31:19 +03:00
Alexander Bayandin	16002f5e45	test_runner: bump `requests` and `psycopg2-binary` (#9248 ) ## Problem ``` Warning: The file chosen for install of requests 2.32.0 (requests-2.32.0-py3-none-any.whl) is yanked. Reason for being yanked: Yanked due to conflicts with CVE-2024-35195 mitigation ``` ## Summary of changes - Update `requests` to fix the warning - Update `psycopg2-binary`	2024-10-02 21:26:45 +01:00
dotdister	09d4bad1be	Change parentheses to clarify conditions in walproposer (#9180 ) Some parentheses in conditional expressions are redundant or necessary for clarity conditional expressions in walproposer. ## Summary of changes Change some parentheses to clarify conditions in walproposer. Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2024-10-02 14:49:52 -04:00
Heikki Linnakangas	d20448986c	Fix metric name of the 'getpage_wait_seconds_bucket' metric (#9242 ) Per convention, histogram buckets have the '_bucket' suffix. I got that wrong in commit `0d500bbd5b`. Fixes https://github.com/neondatabase/neon/issues/9241	2024-10-02 20:05:14 +03:00
John Spray	d54624153d	tests: sync_after_each_test -> sync_between_tests (#9239 ) ## Problem We are seeing frequent pageserver startup timelines while it calls syncfs(). There is an existing fixture that syncs _after_ tests, but not before the first one. We hypothesize that some failures are happening on the first test in a job. ## Summary of changes - extend the existing sync_after_each_test to be a sync between all tests, including sync'ing before running the first test. That should remove any ambiguity about whether the sync is happening on the correct node. This is an alternative to https://github.com/neondatabase/neon/pull/8957 -- I didn't realize until I saw Alexander's comment on that PR that we have an existing hook that syncs filesystems and can be extended.	2024-10-02 17:44:25 +01:00
Alex Chi Z.	700885471f	fix(test): only test num of L1 layers in compaction smoke test (#9186 ) close https://github.com/neondatabase/neon/issues/9160 For whatever reason, pg17's WAL pattern seems different from others, which triggers some flaky behavior within the compaction smoke test. ## Summary of changes * Run L0 compaction before proceeding with the read benchmark. * So that we can ensure the num of L0 layers is 0 and test the compaction behavior only with L1 layers. We have a threshold for triggering L0 compaction. In some cases, the test case did not produce enough L0 layers to do a L0 compaction, therefore leaving the layer map with 3+ L0 layers above the L1 layers. This increases the average read depth for the timeline. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-10-02 17:42:35 +01:00
Vlad Lazar	38a8dcab9f	storcon: add metric for long running reconciles (#9207 ) ## Problem We don't have an alert for long running reconciles. Stuck reconciles are problematic as we've seen in a recent incident. ## Summary of changes Add a new metric `storage_controller_reconcile_long_running_total` with labels: `{tenant_id, shard_number, seq}`. The metric is removed after the long running reconcile finishes. These events should be rare, so we won't break the bank on cardinality. Related https://github.com/neondatabase/neon/issues/9150	2024-10-02 17:25:11 +01:00
Vlad Lazar	8dbfda98d4	storcon: ignore deleted timelines on new location catch-up (#9244 ) ## Problem If a timeline was deleted right before waiting for LSNs to catch up before the cut-over, then we would wait forever. ## Summary of changes Fix the issue and add a test for timeline deletions mid migration. Related https://github.com/neondatabase/neon/issues/9144	2024-10-02 17:23:26 +01:00
John Spray	f875e107aa	pageserver: tweak logging of "became visible" for layers (#9224 ) ## Problem Recent change to avoid the "became visible" log messages from certain tasks missed a task: the logical size calculation that happens as a child of synthetic size calculation. Related: https://github.com/neondatabase/neon/issues/9058 ## Summary of changes - Add OnDemandLogicalSize to the list of permitted tasks for reads making a covered layer visible - Tweak the log message to use layer name instead of key: this is more terse, and easier to use when debugging, as one can search for it elsewhere to see when the layer was written/downloaded etc.	2024-10-02 13:21:04 +01:00
Folke Behrens	1e90e792d6	proxy: Add timeout to webauth confirmation wait (#9227 ) ```shell $ cargo run -p proxy --bin proxy -- --auth-backend=web --webauth-confirmation-timeout=5s ``` ``` $ psql -h localhost -p 4432 NOTICE: Welcome to Neon! Authenticate by visiting within 5s: http://localhost:3000/psql_session/e946900c8a9bc6e9 psql: error: connection to server at "localhost" (::1), port 4432 failed: Connection refused Is the server running on that host and accepting TCP/IP connections? connection to server at "localhost" (127.0.0.1), port 4432 failed: ERROR: Disconnected due to inactivity after 5s. ```	2024-10-02 12:10:56 +02:00
Matthias van de Meent	ea32f1d0a3	Expose more granular wait event data to the user (#9163 ) In PG17, there is this newfangled custom wait events system. This commit adds that feature to Neon, so that users can see what their backends may be waiting for when a PostgreSQL backend is playing the waiting game in Neon code.	2024-10-02 11:12:50 +02:00
Heikki Linnakangas	2e3b7862d0	Fix compute metrics collector config (#9235 )	2024-10-02 09:44:00 +01:00
Arpad Müller	387e569259	Update aws SDK crates (#9233 ) This updates the aws SDK crates to their newest released versions.	2024-10-02 08:00:08 +02:00
Alex Chi Z.	31f12f6426	fix: ignore tonic to resolve advisories (#9230 ) check-rust-style fails because tonic version too old, this does not seem to be an easy fix, so ignore it from the deny list. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-10-01 19:26:54 -04:00
Heikki Linnakangas	8861e8a323	Fix the size of the perf counters shared memory array (#9226 ) MaxBackends doesn't include auxiliary processes. Whenever an aux process made IO operations that updated the counters, they would scribble over shared memory beoynd the end of the array. The relsize cache hash table comes after the array, so the symptom was an error about hash table corruption in the relsize cache hash.	2024-10-01 20:07:51 +01:00
Arseny Sher	62e22dfd85	Backpressure: reset ps display after it is done. (#8980 ) Previously we set the 'backpressure throttling' status, but overwrote current one and never reset it back.	2024-10-01 20:55:05 +03:00
Arseny Sher	17672c88ff	tests: wait walreceiver on sks to be gone on 'immediate' ep restart. (#9099 ) When endpoint is stopped in immediate mode and started again there is a chance of old connection delivering some WAL to safekeepers after second start checked need for sync-safekeepers and thus grabbed basebackup LSN. It makes basebackup unusable, so compute panics. Avoid flakiness by waiting for walreceivers on safekeepers to be gone in such cases. A better way would be to bump term on safekeepers if sync-safekeepers is skipped, but it needs more infrastructure. ref https://github.com/neondatabase/neon/issues/9079	2024-10-01 20:54:00 +03:00
Matthias van de Meent	6efdb1d0f3	Fix small memory accounting bug in libpagestore (#9223 ) Found while searching for other issues in shared memory. The bug should be benign, in that it over-allocates memory for this struct, but doesn't allow for out-of-bounds writes.	2024-10-01 17:37:59 +01:00
Erik Grinaker	325de52e73	pageserver: remove `TenantConfOpt::TryFrom<toml_edit::Item>` (#9219 ) Following #7656, `TenantConfOpt::TryFrom<toml_edit::Item>` appears to be dead code. This patch removes `TenantConfOpt::TryFrom<toml_edit::Item>`. The code does appear to be dead, since the TOML config is deserialized into `TenantConfig` (via `LocationConfig`) and then converted into `TenantConfOpt`. This was verified by adding a panic to `try_from()` and running the pageserver unit tests as well as a local end-to-end cluster (including creating a new tenant and restarting the pageserver). This did not fail, so this is not used on the common happy path at least. No explicit `try_from` or `try_into` calls were found either. Resolves #8918.	2024-10-01 16:35:18 +01:00
Anastasia Lubennikova	ce73db9316	Fix post_apply_config() (#9220 ) Bring back post_apply_config() step that was accidentally removed in `78938d1`	2024-10-01 16:28:58 +01:00
Shinya Kato	b675997f48	safekeeper: Fix a log message of HTTP worker (#9213 ) ## Problem There is a wrong log message. ## Summary of changes Fixed the log message.	2024-10-01 17:16:53 +02:00
Alex Chi Z.	49f99eb729	docs: add aux file v2 RFC (#9115 ) aux v2 migration is near the end and I rewrote the RFC based on what I proposed (several months before...) and what I actually implemented. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-10-01 15:56:54 +01:00
Heikki Linnakangas	0d500bbd5b	Add new compute metrics to sql exporter (#9190 ) These are the perf counters added in commit `263dfba6ee`. Note: This relies on 'neon' extension version 1.5. The default was bumped to 1.5 in commit `d696c41807`. --------- Co-authored-by: Matthias van de Meent <matthias@neon.tech>	2024-10-01 17:38:19 +03:00
Heikki Linnakangas	1b8b50755c	Use debian packages for cmake again (#9212 ) On bookworm, 'cmake' is new enough that we can just use it. On bullseye, we can get a new-enough package from backports. By including 'cmake' in the build-deps stage, we don't need to install it separately in all the later build stages that need it. See https://github.com/neondatabase/neon/pull/2699, where we switched to downloading and building a specific version.	2024-10-01 15:09:09 +03:00
Conrad Ludgate	4391b25d01	proxy: ignore typ and use jwt.alg rather than jwk.alg (#9215 ) Microsoft exposes JWKs without the alg header. It's only included on the tokens. Not a problem. Also noticed that wrt the `typ` header: > It will typically not be used by applications when it is already known that the object is a JWT. This parameter is ignored by JWT implementations; any processing of this parameter is performed by the JWT application. Since we know we are expecting JWTs only, I've followed the guidance and removed the validation.	2024-10-01 10:36:49 +01:00
John Spray	40b10b878a	storage_scrubber: retry on index deletion failures (#9204 ) ## Problem In automated tests running on AWS S3, we frequently see scrubber failures when it can't delete an index. `location_conf_churn`: https://neon-github-public-dev.s3.amazonaws.com/reports/main/11076221056/index.html#/testresult/f89b1916b6a693e2 `scrubber_physical_gc`: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9178/11074269153/index.html#/testresult/9885ed5aa0fe38b6 ## Summary of changes Wrap index deletion in a backoff::retry --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2024-10-01 10:34:39 +01:00
David Gomes	d6c6b0a509	feat(compute): adds pg_session_jwt extension to compute image (#8888 ) ## Problem We need the [pg_session_jwt](https://github.com/neondatabase/pg_session_jwt/) extension in the compute image. This PR adds it. ## Summary of changes I added the `pg_session_jwt` extension in a very similar way to how the pggraphql and pgtiktoken extensions were added (since they're all written with pgrx). Then I tested this. ``` $ cd docker-compose/ $ PG_VERSION=16 TAG=10667533475 docker-compose up --build -d $ psql postgresql://cloud_admin:cloud_admin@localhost:55433/postgres cloud_admin@postgres=# create extension pg_session_jwt; CREATE EXTENSION Time: 43.048 ms cloud_admin@postgres=# \df auth.*; List of functions ┌────────┬──────────────────┬──────────────────┬─────────────────────┬──────┐ │ Schema │ Name │ Result data type │ Argument data types │ Type │ ├────────┼──────────────────┼──────────────────┼─────────────────────┼──────┤ │ auth │ get │ jsonb │ s text │ func │ │ auth │ init │ void │ kid bigint, s jsonb │ func │ │ auth │ jwt_session_init │ void │ s text │ func │ │ auth │ user_id │ text │ │ func │ └────────┴──────────────────┴──────────────────┴─────────────────────┴──────┘ (4 rows) cloud_admin@postgres=# select auth.init(cast('1' as bigint), to_jsonb(TEXT '{ "kty": "EC", "kid": "571683be-33cf-4e67-bccc-8905c0ebb862", "crv": "P-521", "alg": "ES512", "x": "AM_GsnQvKML2yXdn_OsN8PdgO1Sf9XMXih5vQMKLmJkp-Iz_FFWJUt6uyR_qp4brr8Ji2kjGJgN4cQJpg2kskH7V", "y": "AZg-salw24lCmsBP-BCBa5jT6INkTwLtCOC7o0BIxDVvmIEH1-PQAJVYVJPTFvPMi_PLa0QlOm-ufJYkynwa2Mau" }')); ERROR: called `Result::unwrap()` on an `Err` value: Error("invalid type: string \"{ \\\"kty\\\": \\\"EC\\\", \\\"kid\\\": \\\"571683be-33cf-4e67-bccc-8905c0ebb862\\\", \\\"crv\\\": \\\"P-521\\\", \\\"alg\\\": \\\"ES512\\\", \\\"x\\\": \\\"AM_GsnQvKML2yXdn_OsN8PdgO1Sf9XMXih5vQMKLmJkp-Iz_FFWJUt6uyR_qp4brr8Ji2kjGJgN4cQJpg2kskH7V\\\", \\\"y\\\": \\\"AZg-salw24lCmsBP-BCBa5jT6INkTwLtCOC7o0BIxDVvmIEH1-PQAJVYVJPTFvPMi_PLa0QlOm-ufJYkynwa2Mau\\\" }\", expected struct JwkEcKey", line: 0, column: 0) Time: 6.991 ms ``` ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Move the download location to a proper URL	2024-10-01 10:29:56 +01:00
John Spray	d515727e94	tests: make test_multi_attach more stable (#9202 ) ## Problem `test_multi_attach` is sometimes failing with `invalid compute status for configuration request: Configuration`. This is likely a result of the test attempting to reconfigure the compute at the same time as the storage controller is doing so. This test was originally written before the storage controller existed, and is not expecting anything else to be reconfiguring computes at the same time. ## Summary of changes - Configure the tenant into scheduling policy `Stop` in the storage controller at the start of the test, so that it won't try to do anything to the tenant while the test is running.	2024-10-01 10:15:18 +01:00
Folke Behrens	2e508b1ff9	Upgrade OpenTelemetry and other tracing crates (#9200 ) * tracing-utils now returns a `Layer` impl. Removes the need for crates to import OTel crates. * Drop the /v1/traces URI check. Verified that the code does the right thing. * Leave a TODO to hook in an error handler for OTel to log errors to when it assumes the regular pipeline cannot be used/is broken.	2024-10-01 11:02:54 +02:00
John Spray	651ae44569	storage controller: drop out of blocking compute notification loop if migration origin becomes unavailable (#9147 ) ## Problem The live migration code waits forever for the compute notification hook, on the basis that until it succeeds, the compute is probably using the old location and we shouldn't detach it. However, if a pageserver stops or restarts in the background, then this original location might no longer be available, so there is no point waiting. Waiting is also actively harmful, because it prevents other reconciliations happening for the tenant shard, such as during an upgrade where a stuck "drain" migration might prevent the later "fill" migration from moving the shard back to its original location. ## Summary of changes - Refactor the notification wait loop into a function - Add a checks during the loop, for the origin node's cancellation token and an explicit HTTP request to the origin node to confirm the shard is still attached there. Closes: https://github.com/neondatabase/neon/issues/8901	2024-10-01 07:57:22 +00:00
Heikki Linnakangas	65bda19051	Remove unnecessary dev package from compute image (#9210 ) libcurl4-openssl-dev is needed to build pgxn/, but libcurl4 is enough at runtime.	2024-10-01 01:07:43 +03:00
Conrad Ludgate	94a5ca2817	proxy: auth broker (#8855 ) Opens http2 connection to local-proxy and forwards requests over with all headers and body closes https://github.com/neondatabase/cloud/issues/16039	2024-09-30 20:43:45 +01:00

1 2 3 4 5 ...

6229 Commits