rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-14 17:02:56 +00:00

Author	SHA1	Message	Date
Conrad Ludgate	4673dd6d29	set thp:never during build	2024-10-03 14:54:22 +01:00
Conrad Ludgate	37f5a6434b	disable thp	2024-10-03 14:29:52 +01:00
Conrad Ludgate	6588edd693	log	2024-10-01 16:11:54 +01:00
Conrad Ludgate	973eb69cd3	try reduce memory usage of returned data	2024-10-01 13:17:17 +01:00
Conrad Ludgate	8bb2127a19	fix str	2024-10-01 13:06:24 +01:00
Conrad Ludgate	b5ad693a87	some start to using arenas	2024-10-01 13:06:24 +01:00
Conrad Ludgate	5a9138a764	support seeded deser	2024-10-01 12:52:45 +01:00
Conrad Ludgate	1466767571	share json parse fn	2024-10-01 12:49:03 +01:00
Conrad Ludgate	f11254f2c5	deduplicate even more	2024-10-01 12:44:41 +01:00
Conrad Ludgate	4529b463b5	array parsing by value	2024-10-01 12:44:41 +01:00
Conrad Ludgate	a8d4634191	move by value	2024-10-01 12:44:41 +01:00
Conrad Ludgate	53de382533	fold in the json params parsing	2024-10-01 12:44:41 +01:00
Conrad Ludgate	05f7fc4a06	split out	2024-10-01 12:44:39 +01:00
Conrad Ludgate	6946325596	remove duplication	2024-10-01 12:43:07 +01:00
Conrad Ludgate	b41070ba53	proxy: refactor untagged enum parsing with manually implemented deserialize	2024-10-01 12:43:07 +01:00
Conrad Ludgate	4391b25d01	proxy: ignore typ and use jwt.alg rather than jwk.alg (#9215 ) Microsoft exposes JWKs without the alg header. It's only included on the tokens. Not a problem. Also noticed that wrt the `typ` header: > It will typically not be used by applications when it is already known that the object is a JWT. This parameter is ignored by JWT implementations; any processing of this parameter is performed by the JWT application. Since we know we are expecting JWTs only, I've followed the guidance and removed the validation.	2024-10-01 10:36:49 +01:00
John Spray	40b10b878a	storage_scrubber: retry on index deletion failures (#9204 ) ## Problem In automated tests running on AWS S3, we frequently see scrubber failures when it can't delete an index. `location_conf_churn`: https://neon-github-public-dev.s3.amazonaws.com/reports/main/11076221056/index.html#/testresult/f89b1916b6a693e2 `scrubber_physical_gc`: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9178/11074269153/index.html#/testresult/9885ed5aa0fe38b6 ## Summary of changes Wrap index deletion in a backoff::retry --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2024-10-01 10:34:39 +01:00
David Gomes	d6c6b0a509	feat(compute): adds pg_session_jwt extension to compute image (#8888 ) ## Problem We need the [pg_session_jwt](https://github.com/neondatabase/pg_session_jwt/) extension in the compute image. This PR adds it. ## Summary of changes I added the `pg_session_jwt` extension in a very similar way to how the pggraphql and pgtiktoken extensions were added (since they're all written with pgrx). Then I tested this. ``` $ cd docker-compose/ $ PG_VERSION=16 TAG=10667533475 docker-compose up --build -d $ psql postgresql://cloud_admin:cloud_admin@localhost:55433/postgres cloud_admin@postgres=# create extension pg_session_jwt; CREATE EXTENSION Time: 43.048 ms cloud_admin@postgres=# \df auth.*; List of functions ┌────────┬──────────────────┬──────────────────┬─────────────────────┬──────┐ │ Schema │ Name │ Result data type │ Argument data types │ Type │ ├────────┼──────────────────┼──────────────────┼─────────────────────┼──────┤ │ auth │ get │ jsonb │ s text │ func │ │ auth │ init │ void │ kid bigint, s jsonb │ func │ │ auth │ jwt_session_init │ void │ s text │ func │ │ auth │ user_id │ text │ │ func │ └────────┴──────────────────┴──────────────────┴─────────────────────┴──────┘ (4 rows) cloud_admin@postgres=# select auth.init(cast('1' as bigint), to_jsonb(TEXT '{ "kty": "EC", "kid": "571683be-33cf-4e67-bccc-8905c0ebb862", "crv": "P-521", "alg": "ES512", "x": "AM_GsnQvKML2yXdn_OsN8PdgO1Sf9XMXih5vQMKLmJkp-Iz_FFWJUt6uyR_qp4brr8Ji2kjGJgN4cQJpg2kskH7V", "y": "AZg-salw24lCmsBP-BCBa5jT6INkTwLtCOC7o0BIxDVvmIEH1-PQAJVYVJPTFvPMi_PLa0QlOm-ufJYkynwa2Mau" }')); ERROR: called `Result::unwrap()` on an `Err` value: Error("invalid type: string \"{ \\\"kty\\\": \\\"EC\\\", \\\"kid\\\": \\\"571683be-33cf-4e67-bccc-8905c0ebb862\\\", \\\"crv\\\": \\\"P-521\\\", \\\"alg\\\": \\\"ES512\\\", \\\"x\\\": \\\"AM_GsnQvKML2yXdn_OsN8PdgO1Sf9XMXih5vQMKLmJkp-Iz_FFWJUt6uyR_qp4brr8Ji2kjGJgN4cQJpg2kskH7V\\\", \\\"y\\\": \\\"AZg-salw24lCmsBP-BCBa5jT6INkTwLtCOC7o0BIxDVvmIEH1-PQAJVYVJPTFvPMi_PLa0QlOm-ufJYkynwa2Mau\\\" }\", expected struct JwkEcKey", line: 0, column: 0) Time: 6.991 ms ``` ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Move the download location to a proper URL	2024-10-01 10:29:56 +01:00
John Spray	d515727e94	tests: make test_multi_attach more stable (#9202 ) ## Problem `test_multi_attach` is sometimes failing with `invalid compute status for configuration request: Configuration`. This is likely a result of the test attempting to reconfigure the compute at the same time as the storage controller is doing so. This test was originally written before the storage controller existed, and is not expecting anything else to be reconfiguring computes at the same time. ## Summary of changes - Configure the tenant into scheduling policy `Stop` in the storage controller at the start of the test, so that it won't try to do anything to the tenant while the test is running.	2024-10-01 10:15:18 +01:00
Folke Behrens	2e508b1ff9	Upgrade OpenTelemetry and other tracing crates (#9200 ) * tracing-utils now returns a `Layer` impl. Removes the need for crates to import OTel crates. * Drop the /v1/traces URI check. Verified that the code does the right thing. * Leave a TODO to hook in an error handler for OTel to log errors to when it assumes the regular pipeline cannot be used/is broken.	2024-10-01 11:02:54 +02:00
John Spray	651ae44569	storage controller: drop out of blocking compute notification loop if migration origin becomes unavailable (#9147 ) ## Problem The live migration code waits forever for the compute notification hook, on the basis that until it succeeds, the compute is probably using the old location and we shouldn't detach it. However, if a pageserver stops or restarts in the background, then this original location might no longer be available, so there is no point waiting. Waiting is also actively harmful, because it prevents other reconciliations happening for the tenant shard, such as during an upgrade where a stuck "drain" migration might prevent the later "fill" migration from moving the shard back to its original location. ## Summary of changes - Refactor the notification wait loop into a function - Add a checks during the loop, for the origin node's cancellation token and an explicit HTTP request to the origin node to confirm the shard is still attached there. Closes: https://github.com/neondatabase/neon/issues/8901	2024-10-01 07:57:22 +00:00
Heikki Linnakangas	65bda19051	Remove unnecessary dev package from compute image (#9210 ) libcurl4-openssl-dev is needed to build pgxn/, but libcurl4 is enough at runtime.	2024-10-01 01:07:43 +03:00
Conrad Ludgate	94a5ca2817	proxy: auth broker (#8855 ) Opens http2 connection to local-proxy and forwards requests over with all headers and body closes https://github.com/neondatabase/cloud/issues/16039	2024-09-30 20:43:45 +01:00
Arthur Petukhovsky	c07cea80bd	Bump vm-builder v0.29.3 -> v0.35.0 (#9208 ) We haven't updated it for a while. Now I need the update to add quotas support to compute images (https://github.com/neondatabase/cloud/issues/13127). Previous update: https://github.com/neondatabase/neon/pull/7849	2024-09-30 19:18:42 +01:00
Conrad Ludgate	a2e2362ee9	add proxy-protocol header disable option (#9203 ) resolves https://github.com/neondatabase/cloud/issues/18026	2024-09-30 18:11:50 +00:00
Heikki Linnakangas	0a567acdb9	tests: Move comment to more appropriate place There is no 'pg_bin' in NeonEnv.	2024-09-30 17:56:43 +03:00
Heikki Linnakangas	69ea2776e9	tests: Remove creation of extra timelines in some tests neon_cli.create_tenant() creates a new tenant and a timeline on the tenant, with name "main". In most tests, there's no need to create another timeline on the same tenant. There are some more tests that do that, but in the remaining cases, I wasn't be 100% if the presence of extra root timelines affect what the tests test, so I left them alone.	2024-09-30 17:56:40 +03:00
Heikki Linnakangas	4dc9cb7cf9	tests: Remove some spurious list_timelines calls These calls seem really out of place. We know what the initial tenant and branch are in these tests, just like in all other tests.	2024-09-30 17:56:37 +03:00
John Spray	7424e7269c	tests: longer timeout in `test_delete_timeline_client_hangup` (#9161 ) ## Problem This test waits for a request to finish, and then expects deletion to complete almost immediately. The request completes, but it's a 202, the timeline is still deleting in the background: we need to be more patient. ## Summary of changes - Adjust iterations from 2 to 10 when waiting for deletion	2024-09-30 15:46:07 +01:00
a-masterov	5dc68e4e6a	test_compatibility: fix the regexes detecting the version (#9205 ) ## Problem The Neon components, built locally and by the GitHub workflow have slightly different version prefixes (git: vs git-env:) This does not allow running tests against local builds correctly. ## Summary of changes The regular expressions were changed to work with both prefixes.	2024-09-30 16:37:14 +02:00
John Spray	7cfd116856	pageserver: refactor immediate_gc into TenantManager (#9183 ) ## Problem Legacy functions that were called as `mgr::` and relied on the static TENANTS, see #5796 ## Summary of changes - Move the last stray function (immediate_gc) into TenantManager Closes: https://github.com/neondatabase/neon/issues/5796	2024-09-30 09:27:28 +01:00
Heikki Linnakangas	d696c41807	Bump default neon extension version to 1.5 (#9188 ) Commit `263dfba6ee` introduced neon extension version 1.5, which included some new functions and views for metrics. It didn't bump the default neon extension number yet, so that we could still safely roll back to the old binary if necessary. This bumps the default version.	2024-09-30 09:20:52 +03:00
Alexander Bayandin	3c72192065	CI(benchmarking): fix setting LD_LIBRARY_PATH (#9191 ) ## Problem `pgbench-pgvector` job from Nightly Benchmarks fails with the error: ``` /__w/_temp/f45bc2eb-4c4c-4f0a-8030-99079303fa65.sh: line 17: LD_LIBRARY_PATH: unbound variable ``` ## Summary of changes - Fix `LD_LIBRARY_PATH: unbound variable` error in benchmarks	2024-09-29 22:27:53 +00:00
Alexander Bayandin	d2d9921761	CI(benchmarking): fix Nightly Benchmarks (#9178 ) ## Problem Nightly Benchmarks have been broken for some time due to various reasons, this PR fixes it ## Summary of changes - Pull `build-tools` image from dockerhub for `benchmarking` workflow - Use `aws-actions/configure-aws-credentials` to upload/download artifacts from S3 - Fix Postgres 16 installation (for pgbench)	2024-09-28 02:44:22 +01:00
Arthur Petukhovsky	ba498a630a	Set disk quotas on bind in compute_ctl (#8936 ) Part of https://github.com/neondatabase/cloud/issues/13127. Resolves #9153 What changed in this PR: 1. Adds `ComputeSpec.disk_quota_bytes: Option<u64>` 2. Adds new arg to compute_ctl: `--set-disk-quota-for-fs <mountpoint>` 3. Implements running `/neonvm/bin/set-disk-quota` with the right value if both cmdline arg AND field in the spec are specified 4. Patches `/etc/sudoers.d` to allow `compute_ctl` to set quota with sudo This PR is very similar to the swap support added earlier, you can take a look at it as prior art: #7434 In theory, it can be implemented outside of compute_ctl when we will have a separate neonvm daemon, but we are not there yet. Current implementation is the simplest possible to unblock computes with larger disks. All code related to usage of `/neonvm/bin/set-disk-quota` is located in `disk_quota.rs`. We need to call this script with the following arguments: `/neonvm/bin/set-disk-quota {size_kb} {mountpoint}`. Quotas are set on the filesystem level, so we need to provide path to the directory that filesystem was mounted to. I tested this change locally with https://github.com/neondatabase/cloud/pull/17270. It should be safe to merge, because this feature is gated by both cmdline arg and field in the spec. If control-plane doesn't set values in both places, compute_ctl won't be affected by this change.	2024-09-27 20:52:22 +01:00
Heikki Linnakangas	e989a5e4a2	neon_local: Use clap derive macros to parse the CLI args (#9103 ) This is easier to work with.	2024-09-27 22:08:46 +03:00
Alex Chi Z.	cde1654d7b	fix(pageserver): abort process if fsync fails (#9108 ) close https://github.com/neondatabase/neon/issues/8140 The original issue is rather vague on what we should do. After discussion w/ @problame we decided to narrow down the problems we want to solve in that issue. * read path -- do not panic for now. * write path -- panic only on write errors (i.e., device error, fsync error), but not on no-space for now. The guideline is that if the pageserver behavior could lead to violation of persistent constraints (i.e., return an operation as successful but not actually persisting things), we should panic. Fsync is the place where both of us agree that we should panic, because if fsync fails, the kernel will mark dirty pages as clean, and the next fsync will not necessarily return false. This would make the storage client assume the operation is successful. ## Summary of changes Make fsync panic on fatal errors. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-09-27 19:58:50 +01:00
Heikki Linnakangas	cf6a776fcf	tests: Reduce the # of iterations in safekeeper::test_random_schedules (#9182 ) To make it faster. On my laptop, it takes about 30 before this commit. In the arm64 debug variant in CI, it takes about 120 s. Reduce it by factor of 4.	2024-09-27 16:25:35 +00:00
Matthias van de Meent	5c5871111a	WalProposer: Read WAL directly from WAL buffers in PG17 (#9171 ) This reduces the overhead of the WalProposer when it is not being throttled by SK WAL acceptance rate	2024-09-27 17:47:05 +02:00
Yuchen Liang	d56c4e7a38	pageserver: remove AdjacentVectoredReadBuilder and bump minmimum io_buffer_alignment to 512 (#9175 ) Part of #8130 ## Problem After deploying https://github.com/neondatabase/infra/pull/1927, we shipped `io_buffer_alignment=512` to all prod region. The `AdjacentVectoredReadBuilder` code path is no longer taken and we are running pageserver unit tests 6 times in the CI. Removing it would reduce the test duration by 30-60s. ## Summary of changes - Remove `AdjacentVectoredReadBuilder` code. - Bump the minimum `io_buffer_alignment` requirement to at least 512 bytes. - Use default `io_buffer_alignment` for Rust unit tests. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-09-27 16:41:42 +01:00
Conrad Ludgate	43b2445d0b	proxy: add jwks endpoint to control plane and mock providers (#9165 )	2024-09-27 16:08:43 +01:00
Yuchen Liang	42ef08db47	fix(pageserver): LSN lease edge cases around restarts/migrations (#9055 ) Part of #7497, closes #8817. ## Problem See #8817. ## Summary of changes compute_ctl - Renew lsn lease as soon as `/configure` updates pageserver_connstr, use `state_changed` Condvar for synchronization. pageserver As mentioned in https://github.com/neondatabase/neon/issues/8817#issuecomment-2315768076, we still want some permanent error reported if a lease cannot be granted. By considering attachment mode and the added `lsn_lease_deadline` when processing lease requests, we can also bound the case of bad requests to a very short period after migration/restart. - Refactor https://github.com/neondatabase/neon/pull/9024 and move `lsn_lease_deadline` to `AttachedTenantConf` so timeline can easily access it. - Have separate HTTP `init_lsn_lease` and libpq `renew_lsn_lease` API. - Always do LSN verification for the initial HTTP lease request. - LSN verification for the renewal is still done when tenants are not in `AttachedSingle` and we have pass the `lsn_lease_deadline`, which give plenty of time for compute to renew the lease. neon_local - add and call `timeline_init_lsn_lease` mgmt_api at static endpoint start. The initial lsn lease http request is sent when we run `cargo neon endpoint start <static endpoint>`. ## Testing - Extend `test_readonly_node_gc` to do pageserver restarts and migration. ## Future Work - The control plane should make the initial lease request through HTTP when creating a static endpoint. This is currently only done in `neon_local`. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-09-27 09:56:52 -04:00
Tristan Partin	fc962c9605	Use long options when calling initdb Verbosity in this case is good when reading the code. Short options are better when operating in an interactive shell. Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-09-27 08:22:16 -05:00
Heikki Linnakangas	357fa070a3	Add gdb to build-tools (#9125 ) So that compute_ctl can use it to print backtrace on core dumps See issue #2800.	2024-09-27 15:36:24 +03:00
Heikki Linnakangas	02cdd37b56	Dump backtrace if a core dump is called just "core" (#9125 ) I hope this lets us capture backtraces in CI. At least it makes it work on my laptop, which is valuable even if we need to do more for CI. See issue #2800.	2024-09-27 15:36:24 +03:00
Vlad Lazar	fa354a65ab	libs: improve logging on PG connection errors (#9130 ) ## Problem We get some unexpected errors, but don't know who they're happening for. ## Summary of change Add tenant id and peer address to PG connection error logs. Related https://github.com/neondatabase/cloud/issues/17336	2024-09-27 12:36:43 +01:00
Arseny Sher	40f7930a7d	safekeeper: skip syncfs on start if --no-sync is specified. (#9166 ) https://neondb.slack.com/archives/C059ZC138NR/p1727350911890989?thread_ts=1727350211.370869&cid=C059ZC138NR	2024-09-27 09:59:38 +03:00
Conrad Ludgate	ec07a1ecc9	proxy: make local-proxy config by signal with PID, refine JWKS apis with role caching (#9164 )	2024-09-26 19:01:48 +01:00
Arseny Sher	c4cdfe66ac	Fix flakiness of test_timeline_copy. Timeline might be not initialized when timeline_start_lsn is queried. Spotted by CI.	2024-09-26 19:01:45 +03:00
Alex Chi Z.	42e19e952f	fix(pageserver): categorize client error in basebackup metrics (#9110 ) We separated client error from basebackup error log lines in https://github.com/neondatabase/neon/pull/7523, but we didn't do anything for the metrics. In this patch, we fixed it. ref https://github.com/neondatabase/neon/issues/8970 ## Summary of changes We use the same criteria as in `log_query_error` producing an info line (instead of error) for the metrics. We added a `client_error` category for the basebackup query time metrics. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-09-26 11:38:19 -04:00

1 2 3 4 5 ...

6202 Commits