rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-07 05:22:56 +00:00

Author	SHA1	Message	Date
a-masterov	5dc68e4e6a	test_compatibility: fix the regexes detecting the version (#9205 ) ## Problem The Neon components, built locally and by the GitHub workflow have slightly different version prefixes (git: vs git-env:) This does not allow running tests against local builds correctly. ## Summary of changes The regular expressions were changed to work with both prefixes.	2024-09-30 16:37:14 +02:00
John Spray	7cfd116856	pageserver: refactor immediate_gc into TenantManager (#9183 ) ## Problem Legacy functions that were called as `mgr::` and relied on the static TENANTS, see #5796 ## Summary of changes - Move the last stray function (immediate_gc) into TenantManager Closes: https://github.com/neondatabase/neon/issues/5796	2024-09-30 09:27:28 +01:00
Heikki Linnakangas	d696c41807	Bump default neon extension version to 1.5 (#9188 ) Commit `263dfba6ee` introduced neon extension version 1.5, which included some new functions and views for metrics. It didn't bump the default neon extension number yet, so that we could still safely roll back to the old binary if necessary. This bumps the default version.	2024-09-30 09:20:52 +03:00
Alexander Bayandin	3c72192065	CI(benchmarking): fix setting LD_LIBRARY_PATH (#9191 ) ## Problem `pgbench-pgvector` job from Nightly Benchmarks fails with the error: ``` /__w/_temp/f45bc2eb-4c4c-4f0a-8030-99079303fa65.sh: line 17: LD_LIBRARY_PATH: unbound variable ``` ## Summary of changes - Fix `LD_LIBRARY_PATH: unbound variable` error in benchmarks	2024-09-29 22:27:53 +00:00
Alexander Bayandin	d2d9921761	CI(benchmarking): fix Nightly Benchmarks (#9178 ) ## Problem Nightly Benchmarks have been broken for some time due to various reasons, this PR fixes it ## Summary of changes - Pull `build-tools` image from dockerhub for `benchmarking` workflow - Use `aws-actions/configure-aws-credentials` to upload/download artifacts from S3 - Fix Postgres 16 installation (for pgbench)	2024-09-28 02:44:22 +01:00
Arthur Petukhovsky	ba498a630a	Set disk quotas on bind in compute_ctl (#8936 ) Part of https://github.com/neondatabase/cloud/issues/13127. Resolves #9153 What changed in this PR: 1. Adds `ComputeSpec.disk_quota_bytes: Option<u64>` 2. Adds new arg to compute_ctl: `--set-disk-quota-for-fs <mountpoint>` 3. Implements running `/neonvm/bin/set-disk-quota` with the right value if both cmdline arg AND field in the spec are specified 4. Patches `/etc/sudoers.d` to allow `compute_ctl` to set quota with sudo This PR is very similar to the swap support added earlier, you can take a look at it as prior art: #7434 In theory, it can be implemented outside of compute_ctl when we will have a separate neonvm daemon, but we are not there yet. Current implementation is the simplest possible to unblock computes with larger disks. All code related to usage of `/neonvm/bin/set-disk-quota` is located in `disk_quota.rs`. We need to call this script with the following arguments: `/neonvm/bin/set-disk-quota {size_kb} {mountpoint}`. Quotas are set on the filesystem level, so we need to provide path to the directory that filesystem was mounted to. I tested this change locally with https://github.com/neondatabase/cloud/pull/17270. It should be safe to merge, because this feature is gated by both cmdline arg and field in the spec. If control-plane doesn't set values in both places, compute_ctl won't be affected by this change.	2024-09-27 20:52:22 +01:00
Heikki Linnakangas	e989a5e4a2	neon_local: Use clap derive macros to parse the CLI args (#9103 ) This is easier to work with.	2024-09-27 22:08:46 +03:00
Alex Chi Z.	cde1654d7b	fix(pageserver): abort process if fsync fails (#9108 ) close https://github.com/neondatabase/neon/issues/8140 The original issue is rather vague on what we should do. After discussion w/ @problame we decided to narrow down the problems we want to solve in that issue. * read path -- do not panic for now. * write path -- panic only on write errors (i.e., device error, fsync error), but not on no-space for now. The guideline is that if the pageserver behavior could lead to violation of persistent constraints (i.e., return an operation as successful but not actually persisting things), we should panic. Fsync is the place where both of us agree that we should panic, because if fsync fails, the kernel will mark dirty pages as clean, and the next fsync will not necessarily return false. This would make the storage client assume the operation is successful. ## Summary of changes Make fsync panic on fatal errors. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-09-27 19:58:50 +01:00
Heikki Linnakangas	cf6a776fcf	tests: Reduce the # of iterations in safekeeper::test_random_schedules (#9182 ) To make it faster. On my laptop, it takes about 30 before this commit. In the arm64 debug variant in CI, it takes about 120 s. Reduce it by factor of 4.	2024-09-27 16:25:35 +00:00
Matthias van de Meent	5c5871111a	WalProposer: Read WAL directly from WAL buffers in PG17 (#9171 ) This reduces the overhead of the WalProposer when it is not being throttled by SK WAL acceptance rate	2024-09-27 17:47:05 +02:00
Yuchen Liang	d56c4e7a38	pageserver: remove AdjacentVectoredReadBuilder and bump minmimum io_buffer_alignment to 512 (#9175 ) Part of #8130 ## Problem After deploying https://github.com/neondatabase/infra/pull/1927, we shipped `io_buffer_alignment=512` to all prod region. The `AdjacentVectoredReadBuilder` code path is no longer taken and we are running pageserver unit tests 6 times in the CI. Removing it would reduce the test duration by 30-60s. ## Summary of changes - Remove `AdjacentVectoredReadBuilder` code. - Bump the minimum `io_buffer_alignment` requirement to at least 512 bytes. - Use default `io_buffer_alignment` for Rust unit tests. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-09-27 16:41:42 +01:00
Conrad Ludgate	43b2445d0b	proxy: add jwks endpoint to control plane and mock providers (#9165 )	2024-09-27 16:08:43 +01:00
Yuchen Liang	42ef08db47	fix(pageserver): LSN lease edge cases around restarts/migrations (#9055 ) Part of #7497, closes #8817. ## Problem See #8817. ## Summary of changes compute_ctl - Renew lsn lease as soon as `/configure` updates pageserver_connstr, use `state_changed` Condvar for synchronization. pageserver As mentioned in https://github.com/neondatabase/neon/issues/8817#issuecomment-2315768076, we still want some permanent error reported if a lease cannot be granted. By considering attachment mode and the added `lsn_lease_deadline` when processing lease requests, we can also bound the case of bad requests to a very short period after migration/restart. - Refactor https://github.com/neondatabase/neon/pull/9024 and move `lsn_lease_deadline` to `AttachedTenantConf` so timeline can easily access it. - Have separate HTTP `init_lsn_lease` and libpq `renew_lsn_lease` API. - Always do LSN verification for the initial HTTP lease request. - LSN verification for the renewal is still done when tenants are not in `AttachedSingle` and we have pass the `lsn_lease_deadline`, which give plenty of time for compute to renew the lease. neon_local - add and call `timeline_init_lsn_lease` mgmt_api at static endpoint start. The initial lsn lease http request is sent when we run `cargo neon endpoint start <static endpoint>`. ## Testing - Extend `test_readonly_node_gc` to do pageserver restarts and migration. ## Future Work - The control plane should make the initial lease request through HTTP when creating a static endpoint. This is currently only done in `neon_local`. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-09-27 09:56:52 -04:00
Tristan Partin	fc962c9605	Use long options when calling initdb Verbosity in this case is good when reading the code. Short options are better when operating in an interactive shell. Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-09-27 08:22:16 -05:00
Heikki Linnakangas	357fa070a3	Add gdb to build-tools (#9125 ) So that compute_ctl can use it to print backtrace on core dumps See issue #2800.	2024-09-27 15:36:24 +03:00
Heikki Linnakangas	02cdd37b56	Dump backtrace if a core dump is called just "core" (#9125 ) I hope this lets us capture backtraces in CI. At least it makes it work on my laptop, which is valuable even if we need to do more for CI. See issue #2800.	2024-09-27 15:36:24 +03:00
Vlad Lazar	fa354a65ab	libs: improve logging on PG connection errors (#9130 ) ## Problem We get some unexpected errors, but don't know who they're happening for. ## Summary of change Add tenant id and peer address to PG connection error logs. Related https://github.com/neondatabase/cloud/issues/17336	2024-09-27 12:36:43 +01:00
Arseny Sher	40f7930a7d	safekeeper: skip syncfs on start if --no-sync is specified. (#9166 ) https://neondb.slack.com/archives/C059ZC138NR/p1727350911890989?thread_ts=1727350211.370869&cid=C059ZC138NR	2024-09-27 09:59:38 +03:00
Conrad Ludgate	ec07a1ecc9	proxy: make local-proxy config by signal with PID, refine JWKS apis with role caching (#9164 )	2024-09-26 19:01:48 +01:00
Arseny Sher	c4cdfe66ac	Fix flakiness of test_timeline_copy. Timeline might be not initialized when timeline_start_lsn is queried. Spotted by CI.	2024-09-26 19:01:45 +03:00
Alex Chi Z.	42e19e952f	fix(pageserver): categorize client error in basebackup metrics (#9110 ) We separated client error from basebackup error log lines in https://github.com/neondatabase/neon/pull/7523, but we didn't do anything for the metrics. In this patch, we fixed it. ref https://github.com/neondatabase/neon/issues/8970 ## Summary of changes We use the same criteria as in `log_query_error` producing an info line (instead of error) for the metrics. We added a `client_error` category for the basebackup query time metrics. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-09-26 11:38:19 -04:00
John Spray	3d255d601b	pageserver: rename control plane client & chunk validation requests (#8997 ) ## Problem - In https://github.com/neondatabase/neon/pull/8784, the validate controller API is modified to check generations directly in the database. It batches tenants into separate queries to avoid generating a huge statement, but - While updating this, I realized that "control_plane_client" is a kind of confusing name for the client code now that it primarily talks to the storage controller (the case of talking to the control plane will go away in a few months). ## Summary of changes - Big rename to "ControllerUpcallClient" -- this reflects the storage controller's api naming, where the paths used by the pageserver are in `/upcall/` - When sending validate requests, break them up into chunks so that we avoid possible edge cases of generating any HTTP requests that require database I/O across many thousands of tenants. This PR mixes a functional change with a refactor, but the commits are cleanly separated -- only the last commit is a functional change. --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-09-26 16:06:34 +01:00
Arthur Petukhovsky	80e974d05b	fix(compute_ctl): race condition in configurator (#9162 ) There was a tricky race condition in compute_ctl, that sometimes makes configurator skip updates. It makes a deadlock because: - control-plane cannot configure compute, because it's in ConfigurationPending state - compute_ctl doesn't do any reconfiguration because `configurator_main_loop` missed notification for it Full sequence that reproduces the issue: 1. `start_compute` finishes works and changes status `self.set_status(ComputeStatus::Running);` 2. configurator received update about `Running` state and dropped the mutex lock in the iteration 3. `/configure` request was triggered at the same time as step 1, and got the mutex lock 4. same `/configure` request set the spec and updated the state to `ConfigurationPending`, also sent a notification 5. next iteration in configurator got the mutex lock, but missed the notification There are more details in this slack thread: https://neondb.slack.com/archives/C03438W3FLZ/p1727281028478689?thread_ts=1727261220.483799&cid=C03438W3FLZ --------- Co-authored-by: Alexey Kondratov <kondratov.aleksey@gmail.com>	2024-09-26 15:42:17 +01:00
Alexander Bayandin	7fdf1ab5b6	CI: run compatibility tests on Postgres 17 (#9145 ) ## Problem The latest storage release has generated artifacts for Postgres 17, so we can enable compatibility tests this version ## Summary of changes - Unskip `test_backward_compatibility` / `test_forward_compatibility` on Postgres 17	2024-09-26 15:17:01 +01:00
Arpad Müller	7bae78186b	Forbid creation of child timelines of archived timeline (#9122 ) We don't want to allow any new child timelines of archived timelines. If you want any new child timelines, you should first un-archive the timeline. Part of #8088	2024-09-26 02:05:25 +02:00
Heikki Linnakangas	7e560dd00e	chore: Silence clippy warning with nightly (#9157 ) The warning: warning: first doc comment paragraph is too long --> pageserver/src/tenant/checks.rs:7:1 \| 7 \| / /// Checks whether a layer map is valid (i.e., is a valid result of the current compaction algorithm if no... 8 \| \| /// The function checks if we can split the LSN range of a delta layer only at the LSNs of the delta layer... 9 \| \| /// 10 \| \| /// ```plain \| \|_ \| = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#too_long_first_doc_paragraph = note: `#[warn(clippy::too_long_first_doc_paragraph)]` on by default help: add an empty line \| 7 ~ /// Checks whether a layer map is valid (i.e., is a valid result of the current compaction algorithm if nothing goes wrong). 8 + /// \| Fix by applying the suggestion.	2024-09-25 21:29:16 +00:00
Tristan Partin	684e924211	Fix compute_logical_snapshot_files for v14 The function, pg_ls_logicalsnapdir(), was added in version 15. Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-09-25 16:25:17 -05:00
Tristan Partin	8ace9ea25f	Format long single DATA line in pgxn/Makefile This should be a little more readable. Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-09-25 16:25:17 -05:00
Alex Chi Z.	6a4f49b08b	fix(pageserver): passthrough partition cancel error (#9154 ) close https://github.com/neondatabase/neon/issues/9142 ## Summary of changes passthrough CollectKeyspaceError::Cancelled to CompactionError::ShuttingDown Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-09-25 21:35:33 +01:00
Alexander Bayandin	c6e89445e2	CI(promote-images): fix prod ECR auth (#9146 ) A cherry-pick from the previous release (#9131) ## Problem Login to prod ECR doesn't work anymore: ``` Retrieving registries data through * SDK... * ECR detected with eu-central-1 region Error: The security token included in the request is invalid. ``` ## Summary of changes - Fix login to prod ECR by using `aws-actions/configure-aws-credentials`	2024-09-25 18:22:39 +01:00
Vlad Lazar	04f32b9526	tests: remove patching up of az id column (#8968 ) This was required since the compat tests used a snapshot generated from a version of neon local which didn't contain the availability_zone_id column.	2024-09-25 17:22:32 +01:00
Heikki Linnakangas	6f2333f52b	CI: Leave out unnecessary build files from binary artifact (#9135 ) The pg_install/build directory contains .o files and such intermediate results from the build, which are not needed in the final tarball. Except for src/test/regress/regress.so and a few other .so files in that directory; keep those. This reduces the size of the neon-Linux-X64-release-artifact.tar.zst artifact from about 1.5 GB to 700 MB. (I attempted this a long time ago already, by moving the build/ directory out of pg_install altogether, see PR #2127. But I never got around to finish that work.) Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2024-09-25 19:07:20 +03:00
Yuchen Liang	d447f49bc3	fix(pageserver): handle lsn lease requests for unnormalized lsns (#9137 ) Fixes https://github.com/neondatabase/neon/issues/9098. ## Problem See https://github.com/neondatabase/neon/issues/9098#issuecomment-2372484969. ### Related A similar problem happened with branch creation, which was discussed [here](https://github.com/neondatabase/neon/pull/2143#issuecomment-1199969052) and fixed by https://github.com/neondatabase/neon/pull/2529. ## Summary of changes - Normalize the lsn on pageserver side upon lsn lease request, stores the normalized LSN. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-09-25 14:57:38 +00:00
Vlad Lazar	c5972389aa	storcon: include timeline ID in LSN waiting logs (#9141 ) ## Problem Hard to tell which timeline is holding the migration. ## Summary of Changes Add timeline id to log.	2024-09-25 15:54:41 +01:00
Matthias van de Meent	c4f5736d5a	Build images for PG17 using Debian 12 "Bookworm" (#9132 ) This increases the support window of the OS used for PG17 by 2 years compared to the previous usage of Debian 11 "Bullseye".	2024-09-25 17:50:05 +03:00
Alexey Kondratov	518f598e2d	docs(rfc): Independent compute release flow (#8881 ) Related to https://github.com/neondatabase/cloud/issues/11698	2024-09-25 16:24:09 +02:00
John Spray	4b711caf5e	storage controller: make proxying of GETs to pageservers more robust (#9065 ) ## Problem These commits are split off from https://github.com/neondatabase/neon/pull/8971/commits where I was fixing this to make a better scale test pass -- Vlad also independently recognized these issues with cloudbench in https://github.com/neondatabase/neon/issues/9062. 1. The storage controller proxies GET requests to pageservers based on their intent, not the ground truth of where they're really attached. 2. Proxied requests can race with scheduling to tenants, resulting in 404 responses if the request hits the wrong pageserver. Closes: https://github.com/neondatabase/neon/issues/9062 ## Summary of changes 1. If a shard has a running reconciler, then use the database generation_pageserver to decide who to proxy the request to 2. If such a request gets a 404 response and its scheduled node has changed since the request was dispatched.	2024-09-25 13:56:39 +00:00
Vlad Lazar	2cf47b1477	storcon: do az aware scheduling (#9083 ) ## Problem Storage controller didn't previously consider AZ locality between compute and pageservers when scheduling nodes. Control plane has this feature, and, since we are migrating tenants away from it, we need feature parity to avoid perf degradations. ## Summary of changes The change itself is fairly simple: 1. Thread az info into the scheduler 2. Add an extra member to the scheduling scores Step (2) deserves some more discussion. Let's break it down by the shard type being scheduled: Attached Shards We wish for attached shards of a tenant to end up in the preferred AZ of the tenant since that is where the compute is like to be. The AZ member for `NodeAttachmentSchedulingScore` has been placed below the affinity score (so it's got the second biggest weight for picking the node). The rationale for going below the affinity score is to avoid having all shards of a single tenant placed on the same node in 2 node regions, since that would mean that one tenant can drive the general workload of an entire pageserver. I'm not 100% sure this is the right decision, so open to discussing hoisting the AZ up to first place. Secondary Shards We wish for secondary shards of a tenant to be scheduled in a different AZ from the preferred one for HA purposes. The AZ member for `NodeSecondarySchedulingScore` has been placed first, so nodes in different AZs from the preferred one will always be considered first. On small clusters, this can mean that all the secondaries of a tenant are scheduled to the same pageserver, but secondaries don't use up as many resources as the attached location, so IMO the argument made for attached shards doesn't hold. Related: https://github.com/neondatabase/neon/issues/8848	2024-09-25 14:31:04 +01:00
Folke Behrens	7dcfcccf7c	Re-export git-version from utils and remove as direct dep (#9138 )	2024-09-25 14:38:35 +02:00
Vlad Lazar	a26cc29d92	storcon: add tags to scheduler logs (#9127 ) We log something at info level each time we schedule a shard to a non-secondary location. Might as well have context for it.	2024-09-25 10:16:06 +01:00
Alex Chi Z.	5f2f31e879	fix(test): storage scrubber should only log to stdout with info (#9067 ) As @koivunej mentioned in the storage channel, for regress test, we don't need to create a log file for the scrubber, and we should reduce noisy logs. ## Summary of changes * Disable log file creation for storage scrubber * Only log at info level --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-09-24 22:33:03 +00:00
Damian972	938b163b42	chore(docker-compose): fix typo in readme (#9133 ) Typo in the readme inside docker-compose folder ## Summary of changes - Update the readme	2024-09-24 18:05:23 -04:00
Heikki Linnakangas	5cbf5b45ae	Remove TenantState::Loading (#9118 ) The last real use was removed in commit `de90bf4663`. It was still used in a few unit tests, but they can use Attaching too.	2024-09-24 20:58:54 +00:00
Heikki Linnakangas	af5c54ed14	test: Make test_lfc_resize more robust (#9117 ) 1. Increase statement_timeout. It defaults to 120 s, which is not quite enough on slow or busy systems with debug build. On my laptop, the index creation takes about 100 s. On buildfarm, we've seen failures, e.g: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9084/10997888708/index.html#suites/821f97908a487f1d7d3a2a4dd1571e99/db1834bddfe8c5b9/ 2. Keep twiddling the LFC size through the whole test. Before, we would do it for the first 10 seconds, but that only covers a small part of the pgbench initialization phase. Change the loop so that the pgbench run time determines how long the test runs, and we keep changing the LFC for the whole time. In the passing, also fix bogus test description, copy-pasted from a completely unrelated test.	2024-09-24 23:38:16 +03:00
Alexander Bayandin	523cf71721	Fix compiler warnings on macOS (#9128 ) ## Problem Compilation of neon extension on macOS produces a warning ``` pgxn/neon/neon_perf_counters.c:50:1: error: non-void function does not return a value [-Werror,-Wreturn-type] ``` ## Summary of changes - Change the return type of `NeonPerfCountersShmemInit` to void	2024-09-24 18:11:31 +00:00
Arpad Müller	c47f355ec1	Catch Cancelled and don't print a warning for it (#9121 ) In the `imitate_synthetic_size_calculation_worker` function, we might obtain the `Cancelled` error variant instead of hitting the cancellation token based path. Therefore, catch `Cancelled` and handle it analogously to the cancellation case. Fixes #8886.	2024-09-24 17:28:56 +00:00
Yuchen Liang	4f67b0225b	pageserver: handle decompression outside vectored `read_blobs` (#8942 ) Part of #8130. ## Problem Currently, decompression is performed within the `read_blobs` implementation and the decompressed blob will be appended to the end of the `BytesMut` buffer. We will lose this flexibility of extending the buffer when we switch to using our own dio-aligned buffer (WIP in https://github.com/neondatabase/neon/pull/8730). To facilitate the adoption of aligned buffer, we need to refactor the code to perform decompression outside `read_blobs`. ## Summary of changes - `VectoredBlobReader::read_blobs` will return `VectoredBlob` without performing decompression and appending decompressed blob. It becomes the caller's responsibility to decompress the buffer. - Added a new `BufView` type that functions as `Cow<Bytes, &[u8]>`. - Perform decompression within `VectoredBlob::read` so that people don't have to explicitly thinking about compression when using the reader interface. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-09-24 16:41:38 +00:00
Heikki Linnakangas	2f7cecaf6a	test: Poll pageserver availability more aggressively at test startup Even with the 100 ms interval, on my laptop the pageserver always becomes available on second attempt, so this saves about 900 ms at every test startup.	2024-09-24 17:16:43 +03:00
Heikki Linnakangas	589594c2e1	test: Skip fsync when initdb'ing the storage controller db After initdb, we configure it with "fsync=off" anyway.	2024-09-24 17:16:43 +03:00
Heikki Linnakangas	70fe007519	test: Make test_hot_standby_feedback more forgiving of slow initialization (#9113 ) Don't start waiting for the index to appear in the secondary until it has been created in the primary. Before, if the "pgbench -i" step took more than 60 s, we would give up. There was a flaky test failure along those lines at: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9105/10997477941/index.html#suites/950eff205b552e248417890b8b8f189e/73cf4b5648fa6f74/ Hopefully, this avoids such failures in the future.	2024-09-24 16:41:59 +03:00

1 2 3 4 5 ...

6173 Commits