rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-07 05:22:56 +00:00

Author	SHA1	Message	Date
Alexander Bayandin	4a53cd0fc3	Dockerfiles: remove cachepot (#8666 ) ## Problem We install and try to use `cachepot`. But it is not configured correctly and doesn't work (after https://github.com/neondatabase/neon/pull/2290) ## Summary of changes - Remove `cachepot`	2024-08-09 15:48:16 +01:00
Vlad Lazar	f5cef7bf7f	storcon: skip draining shard if it's secondary is lagging too much (#8644 ) ## Problem Migrations of tenant shards with cold secondaries are holding up drains in during production deployments. ## Summary of changes If a secondary locations is lagging by more than 256MiB (configurable, but that's the default), then skip cutting it over to the secondary as part of the node drain.	2024-08-09 15:45:07 +01:00
John Spray	e6770d79fd	pageserver: don't treat NotInitialized::Stopped as unexpected (#8675 ) ## Problem This type of error can happen during shutdown & was triggering a circuit breaker alert. ## Summary of changes - Map NotIntialized::Stopped to CompactionError::ShuttingDown, so that we may handle it cleanly	2024-08-09 14:01:56 +01:00
Alexander Bayandin	201f56baf7	CI(pin-build-tools-image): fix permissions for Azure login (#8671 ) ## Problem Azure login fails in `pin-build-tools-image` workflow because the job doesn't have the required permissions. ``` Error: Please make sure to give write permissions to id-token in the workflow. Error: Login failed with Error: Error message: Unable to get ACTIONS_ID_TOKEN_REQUEST_URL env variable. Double check if the 'auth-type' is correct. Refer to https://github.com/Azure/login#readme for more information. ``` ## Summary of changes - Add `id-token: write` permission to `pin-build-tools-image` - Add an input to force image tagging - Unify pushing to Docker Hub with other registries - Split the job into two to have less if's	2024-08-09 12:05:43 +01:00
Alex Chi Z.	a155914c1c	fix(neon): disable create tablespace stmt (#8657 ) part of https://github.com/neondatabase/neon/issues/8653 Disable create tablespace stmt. It turns out it requires much less effort to do the regress test mode flag than patching the test cases, and given that we might need to support tablespaces in the future, I decided to add a new flag `regress_test_mode` to change the behavior of create tablespace. Tested manually that without setting regress_test_mode, create tablespace will be rejected. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2024-08-09 09:18:55 +01:00
Conrad Ludgate	7e08fbd1b9	Revert "proxy: update tokio-postgres to allow arbitrary config params (#8076 )" (#8654 ) This reverts #8076 - which was already reverted from the release branch since forever (it would have been a breaking change to release for all users who currently set TimeZone options). It's causing conflicts now so we should revert it here as well.	2024-08-09 09:09:29 +01:00
Peter Bendel	2ca5ff26d7	Run a subset of benchmarking job steps on GitHub action runners in Azure - closer to the system under test (#8651 ) ## Problem Latency from one cloud provider to another one is higher than within the same cloud provider. Some of our benchmarks are latency sensitive - we run a pgbench or psql in the github action runner and the system under test is running in Neon (database project). For realistic perf tps and latency results we need to compare apples to apples and run the database client in the same "latency distance" for all tests. ## Summary of changes Move job steps that test Neon databases deployed on Azure into Azure action runners. - bench strategy variant using azure database - pgvector strategy variant using azure database - pgbench-compare strategy variants using azure database ## Test run https://github.com/neondatabase/neon/actions/runs/10314848502	2024-08-09 08:36:29 +01:00
Alexander Bayandin	8acce00953	Dockerfiles: fix LegacyKeyValueFormat & JSONArgsRecommended (#8664 ) ## Problem CI complains in all PRs: ``` "ENV key=value" should be used instead of legacy "ENV key value" format ``` https://docs.docker.com/reference/build-checks/legacy-key-value-format/ See - https://github.com/neondatabase/neon/pull/8644/files ("Unchanged files with check annotations" section) - https://github.com/neondatabase/neon/actions/runs/10304090562?pr=8644 ("Annotations" section) ## Summary of changes - Use `ENV key=value` instead of `ENV key value` in all Dockerfiles	2024-08-09 07:54:54 +01:00
Alexander Bayandin	d28a6f2576	CI(build-tools): update Rust, Python, Mold (#8667 ) ## Problem - Rust 1.80.1 has been released: https://blog.rust-lang.org/2024/08/08/Rust-1.80.1.html - Python 3.9.19 has been released: https://www.python.org/downloads/release/python-3919/ - Mold 2.33.0 has been released: https://github.com/rui314/mold/releases/tag/v2.33.0 - Unpinned `cargo-deny` in `build-tools` got updated to the latest version and doesn't work anymore with the current config file ## Summary of changes - Bump Rust to 1.80.1 - Bump Python to 3.9.19 - Bump Mold to 2.33.0 - Pin `cargo-deny`, `cargo-hack`, `cargo-hakari`, `cargo-nextest`, `rustfilt` versions - Update `deny.toml` to the latest format, see https://github.com/EmbarkStudios/cargo-deny/pull/611	2024-08-09 06:17:16 +00:00
John Spray	4431688dc6	tests: don't require kafka client for regular tests (#8662 ) ## Problem We're adding more third party dependencies to support more diverse + realistic test cases in `test_runner/logical_repl`. I ❤️ these tests, they are a good thing. The slight glitch is that python packaging is hard, and some third party python packages have issues. For example the current kafka dependency doesn't work on latest python. We can mitigate that by only importing these more specialized dependencies in the tests that use them. ## Summary of changes - Move the `kafka` import into a test body, so that folks running the regular `test_runner/regress` tests don't have to have a working kafka client package.	2024-08-08 19:24:21 +01:00
John Spray	953b7d4f7e	pageserver: remove paranoia double-calculation of retain_lsns (#8617 ) ## Problem This code was to mitigate risk in https://github.com/neondatabase/neon/pull/8427 As expected, we did not hit this code path - the new continuous updates of gc_info are working fine, we can remove this code now. ## Summary of changes - Remove block that double-checks retain_lsns	2024-08-08 12:57:48 +01:00
Joonas Koivunen	8561b2c628	fix: stop leaking BackgroundPurges (#8650 ) avoid "leaking" the completions of BackgroundPurges by: 1. switching it to TaskTracker for provided close+wait 2. stop using tokio::fs::remove_dir_all which will consume two units of memory instead of one blocking task Additionally, use more graceful shutdown in tests which do actually some background cleanup.	2024-08-08 12:02:53 +01:00
Joonas Koivunen	21638ee96c	fix(test): do not fail test for filesystem race (#8643 ) evidence: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8632/10287641784/index.html#suites/0e58fb04d9998963e98e45fe1880af7d/c7a46335515142b/	2024-08-08 10:34:47 +01:00
Konstantin Knizhnik	cbe8c77997	Use sycnhronous commit for logical replicaiton worker (#8645 ) ## Problem See https://neondb.slack.com/archives/C03QLRH7PPD/p1723038557449239?thread_ts=1722868375.476789&cid=C03QLRH7PPD Logical replication subscription by default use `synchronous_commit=off` which cause problems with safekeeper ## Summary of changes Set `synchronous_commit=on` for logical replication subscription in test_subscriber_restart.py ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2024-08-08 10:23:57 +03:00
John Spray	cf3eac785b	pageserver: make bench_ingest build (but panic) on macOS (#8641 ) ## Problem Some developers build on MacOS, which doesn't have io_uring. ## Summary of changes - Add `io_engine_for_bench`, which on linux will give io_uring or panic if it's unavailable, and on MacOS will always panic. We do not want to run such benchmarks with StdFs: the results aren't interesting, and will actively waste the time of any developers who start investigating performance before they realize they're using a known-slow I/O backend. Why not just conditionally compile this benchmark on linux only? Because even on linux, I still want it to refuse to run if it can't get io_uring.	2024-08-07 21:17:08 +01:00
Yuchen Liang	542385e364	feat(pageserver): add direct io pageserver config (#8622 ) Part of #8130, [RFC: Direct IO For Pageserver](https://github.com/neondatabase/neon/blob/problame/direct-io-rfc/docs/rfcs/034-direct-io-for-pageserver.md) ## Description Add pageserver config for evaluating/enabling direct I/O. - Disabled: current default, uses buffered io as is. - Evaluate: still uses buffered io, but could do alignment checking and perf simulation (pad latency by direct io RW to a fake file). - Enabled: uses direct io, behavior on alignment error is configurable. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-08-07 21:04:19 +01:00
Joonas Koivunen	05dd1ae9e0	fix: drain completed page_service connections (#8632 ) We've noticed increased memory usage with the latest release. Drain the joinset of `page_service` connection handlers to avoid leaking them until shutdown. An alternative would be to use a TaskTracker. TaskTracker was not discussed in original PR #8339 review, so not hot fixing it in here either.	2024-08-07 17:14:45 +00:00
Cihan Demirci	8468d51a14	cicd: push build-tools image to ACR as well (#8638 ) https://github.com/neondatabase/cloud/issues/15899	2024-08-07 17:53:47 +01:00
Joonas Koivunen	a81fab4826	refactor(timeline_detach_ancestor): replace ordered reparented with a hashset (#8629 ) Earlier I was thinking we'd need a (ancestor_lsn, timeline_id) ordered list of reparented. Turns out we did not need it at all. Replace it with an unordered hashset. Additionally refactor the reparented direct children query out, it will later be used from more places. Split off from #8430. Cc: #6994	2024-08-07 18:19:00 +02:00
Alex Chi Z.	b3eea45277	fix(pageserver): dump the key when it's invalid (#8633 ) We see an assertion error in staging. Dump the key to guess where it was from, and then we can fix it. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-08-07 16:37:46 +01:00
Joonas Koivunen	fc78774f39	fix: EphemeralFiles can outlive their Timeline via `enum LayerManager` (#8229 ) Ephemeral files cleanup on drop but did not delay shutdown, leading to problems with restarting the tenant. The solution is as proposed: - make ephemeral files carry the gate guard to delay `Timeline::gate` closing - flush in-memory layers and strong references to those on `Timeline::shutdown` The above are realized by making LayerManager an `enum` with `Open` and `Closed` variants, and fail requests to modify `LayerMap`. Additionally: - fix too eager anyhow conversions in compaction - unify how we freeze layers and handle errors - optimize likely_resident_layers to read LayerFileManager hashmap values instead of bouncing through LayerMap Fixes: #7830	2024-08-07 17:50:09 +03:00
Conrad Ludgate	ad0988f278	proxy: random changes (#8602 ) ## Problem 1. Hard to correlate startup parameters with the endpoint that provided them. 2. Some configurations are not needed in the `ProxyConfig` struct. ## Summary of changes Because of some borrow checker fun, I needed to switch to an interior-mutability implementation of our `RequestMonitoring` context system. Using https://docs.rs/try-lock/latest/try_lock/ as a cheap lock for such a use-case (needed to be thread safe). Removed the lock of each startup message, instead just logging only the startup params in a successful handshake. Also removed from values from `ProxyConfig` and kept as arguments. (needed for local-proxy config)	2024-08-07 14:37:03 +01:00
Arpad Müller	4d7c0dac93	Add missing colon to ArchivalConfigRequest specification (#8627 ) Add a missing colon to the API specification of `ArchivalConfigRequest`. The `state` field is required. Pointed out by Gleb.	2024-08-07 14:53:52 +02:00
Arpad Müller	00c981576a	Lower level for timeline cancellations during gc (#8626 ) Timeline cancellation running in parallel with gc yields error log lines like: ``` Gc failed 1 times, retrying in 2s: TimelineCancelled ``` They are completely harmless though and normal to occur. Therefore, only print those messages at an info level. Still print them at all so that we know what is going on if we focus on a single timeline.	2024-08-07 09:29:52 +02:00
Arpad Müller	c3f2240fbd	storage broker: only print one line for version and build tag in init (#8624 ) This makes it more consistent with pageserver and safekeeper. Also, it is easier to collect the two values into one data point.	2024-08-07 09:14:26 +02:00
Yuchen Liang	ed5724d79d	scrubber: clean up `scan_metadata` before prod (#8565 ) Part of #8128. ## Problem Currently, scrubber `scan_metadata` command will return with an error code if the metadata on remote storage is corrupted with fatal errors. To safely deploy this command in a cronjob, we want to differentiate between failures while running scrubber command and the erroneous metadata. At the same time, we also want our regression tests to catch corrupted metadata using the scrubber command. ## Summary of changes - Return with error code only when the scrubber command fails - Uses explicit checks on errors and warnings to determine metadata health in regression tests. Resolve conflict with `tenant-snapshot` command (after shard split): [`test_scrubber_tenant_snapshot`](https://github.com/neondatabase/neon/blob/yuchen/scrubber-scan-cleanup-before-prod/test_runner/regress/test_storage_scrubber.py#L23) failed before applying `422a8443dd` - When taking a snapshot, the old `index_part.json` in the unsharded tenant directory is not kept. - The current `list_timeline_blobs` implementation consider no `index_part.json` as a parse error. - During the scan, we are only analyzing shards with highest shard count, so we will not get a parse error. but we do need to add the layers to tenant object listing, otherwise we will get index is referencing a layer that is not in remote storage error. - Action: Add s3_layers from `list_timeline_blobs` regardless of parsing error Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-08-06 18:55:42 +01:00
John Spray	ca5390a89d	pageserver: add `bench_ingest` (#7409 ) ## Problem We lack a rust bench for the inmemory layer and delta layer write paths: it is useful to benchmark these components independent of postgres & WAL decoding. Related: https://github.com/neondatabase/neon/issues/8452 ## Summary of changes - Refactor DeltaLayerWriter to avoid carrying a Timeline, so that it can be cleanly tested + benched without a Tenant/Timeline test harness. It only needed the Timeline for building `Layer`, so this can be done in a separate step. - Add `bench_ingest`, which exercises a variety of workload "shapes" (big values, small values, sequential keys, random keys) - Include a small uncontroversial optimization: in `freeze`, only exhaustively walk values to assert ordering relative to end_lsn in debug mode. These benches are limited by drive performance on a lot of machines, but still useful as a local tool for iterating on CPU/memory improvements around this code path. Anecdotal measurements on Hetzner AX102 (Ryzen 7950xd): ``` ingest-small-values/ingest 128MB/100b seq time: [1.1160 s 1.1230 s 1.1289 s] thrpt: [113.38 MiB/s 113.98 MiB/s 114.70 MiB/s] Found 1 outliers among 10 measurements (10.00%) 1 (10.00%) low mild Benchmarking ingest-small-values/ingest 128MB/100b rand: Warming up for 3.0000 s Warning: Unable to complete 10 samples in 10.0s. You may wish to increase target time to 18.9s. ingest-small-values/ingest 128MB/100b rand time: [1.9001 s 1.9056 s 1.9110 s] thrpt: [66.982 MiB/s 67.171 MiB/s 67.365 MiB/s] Benchmarking ingest-small-values/ingest 128MB/100b rand-1024keys: Warming up for 3.0000 s Warning: Unable to complete 10 samples in 10.0s. You may wish to increase target time to 11.0s. ingest-small-values/ingest 128MB/100b rand-1024keys time: [1.0715 s 1.0828 s 1.0937 s] thrpt: [117.04 MiB/s 118.21 MiB/s 119.46 MiB/s] ingest-small-values/ingest 128MB/100b seq, no delta time: [425.49 ms 429.07 ms 432.04 ms] thrpt: [296.27 MiB/s 298.32 MiB/s 300.83 MiB/s] Found 1 outliers among 10 measurements (10.00%) 1 (10.00%) low mild ingest-big-values/ingest 128MB/8k seq time: [373.03 ms 375.84 ms 379.17 ms] thrpt: [337.58 MiB/s 340.57 MiB/s 343.13 MiB/s] Found 1 outliers among 10 measurements (10.00%) 1 (10.00%) high mild ingest-big-values/ingest 128MB/8k seq, no delta time: [81.534 ms 82.811 ms 83.364 ms] thrpt: [1.4994 GiB/s 1.5095 GiB/s 1.5331 GiB/s] Found 1 outliers among 10 measurements (10.00%) ```	2024-08-06 16:39:40 +00:00
John Spray	3727c6fbbe	pageserver: use layer visibility when composing heatmap (#8616 ) ## Problem Sometimes, a layer is Covered by hasn't yet been evicted from local disk (e.g. shortly after image layer generation). It is not good use of resources to download these to a secondary location, as there's a good chance they will never be read. This follows the previous change that added layer visibility: - #8511 Part of epic: - https://github.com/neondatabase/neon/issues/8398 ## Summary of changes - When generating heatmaps, only include Visible layers - Update test_secondary_downloads to filter to visible layers when listing layers from an attached location	2024-08-06 17:15:40 +01:00
John Spray	42229aacf6	pageserver: fixes for layer visibility metric (#8603 ) ## Problem In staging, we could see that occasionally tenants were wrapping their pageserver_visible_physical_size metric past zero to 2^64. This is harmless right now, but will matter more later when we start using visible size in things like the /utilization endpoint. ## Summary of changes - Add debug asserts that detect this case. `test_gc_of_remote_layers` works as a reproducer for this issue once the asserts are added. - Tighten up the interface around access_stats so that only Layer can mutate it. - In Layer, wrap calls to `record_access` in code that will update the visible size statistic if the access implicitly marks the layer visible (this was what caused the bug) - In LayerManager::rewrite_layers, use the proper set_visibility layer function instead of directly using access_stats (this is an additional path where metrics could go bad.) - Removed unused instances of LayerAccessStats in DeltaLayer and ImageLayer which I noticed while reviewing the code paths that call record_access.	2024-08-06 14:47:01 +01:00
John Spray	b7beaa0fd7	tests: improve stability of `test_storage_controller_many_tenants` (#8607 ) ## Problem The controller scale test does random migrations. These mutate secondary locations, and therefore can cause secondary optimizations to happen in the background, violating the test's expectation that consistency_check will work as there are no reconciliations running. Example: https://neon-github-public-dev.s3.amazonaws.com/reports/main/10247161379/index.html#suites/07874de07c4a1c9effe0d92da7755ebf/6316beacd3fb3060/ ## Summary of changes - Only migrate to existing secondary locations, not randomly picked nodes, so that we can do a fast reconcile_until_idle (otherwise reconcile_until_idle is takes a long time to create new secondary locations). - Do a reconcile_until_idle before consistency_check.	2024-08-06 12:58:33 +01:00
a-masterov	16c91ff5d3	enable rum test (#8380 ) ## Problem We need to test the rum extension automatically as a path of the GitHub workflow ## Summary of changes rum test is enabled	2024-08-06 13:56:42 +02:00
a-masterov	078f941dc8	Add a test using Debezium as a client for the logical replication (#8568 ) ## Problem We need to test the logical replication with some external consumers. ## Summary of changes A test of the logical replication with Debezium as a consumer was added. --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2024-08-06 13:08:55 +02:00
Arseny Sher	68bcbf8227	Add package-mode=false to poetry. We don't use it for packaging, and 'poetry install' will soon error otherwise. Also remove name and version fields as these are not required for non-packaging mode.	2024-08-06 13:53:23 +03:00
Arpad Müller	a31c95cb40	storage_scrubber: migrate scan_safekeeper_metadata to remote_storage (#8595 ) Migrates the safekeeper-specific parts of `ScanMetadata` to GenericRemoteStorage, making it Azure-ready. Part of https://github.com/neondatabase/neon/issues/7547	2024-08-06 10:51:39 +00:00
Joonas Koivunen	dc7eb5ae5a	chore: bump index part version (#8611 ) #8600 missed the hunk changing index_part.json informative version. Include it in this PR, in addition add more non-warning index_part.json versions to scrubber.	2024-08-06 11:45:41 +01:00
Vlad Lazar	44fedfd6c3	pageserver: remove legacy read path (#8601 ) ## Problem We have been maintaining two read paths (legacy and vectored) for a while now. The legacy read-path was only used for cross validation in some tests. ## Summary of changes * Tweak all tests that were using the legacy read path to use the vectored read path instead * Remove the read path dispatching based on the pageserver configs * Remove the legacy read path code We will be able to remove the single blob io code in `pageserver/src/tenant/blob_io.rs` when https://github.com/neondatabase/neon/issues/7386 is complete. Closes https://github.com/neondatabase/neon/issues/8005	2024-08-06 10:14:01 +01:00
Joonas Koivunen	138f008bab	feat: persistent gc blocking (#8600 ) Currently, we do not have facilities to persistently block GC on a tenant for whatever reason. We could do a tenant configuration update, but that is risky for generation numbers and would also be transient. Introduce a `gc_block` facility in the tenant, which manages per timeline blocking reasons. Additionally, add HTTP endpoints for enabling/disabling manual gc blocking for a specific timeline. For debugging, individual tenant status now includes a similar string representation logged when GC is skipped. Cc: #6994	2024-08-06 10:09:56 +01:00
Joonas Koivunen	6a6f30e378	fix: make Timeline::set_disk_consistent_lsn use fetch_max (#8311 ) now it is safe to use from multiple callers, as we have two callers.	2024-08-06 08:52:01 +01:00
Alex Chi Z.	8f3bc5ae35	feat(pageserver): support dry-run for gc-compaction, add statistics (#8557 ) Add dry-run mode that does not produce any image layer + delta layer. I will use this code to do some experiments and see how much space we can reclaim for tenants on staging. Part of https://github.com/neondatabase/neon/issues/8002 * Add dry-run mode that runs the full compaction process without updating the layer map. (We never call finish on the writers and the files will be removed before exiting the function). * Add compaction statistics and print them at the end of compaction. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-08-06 02:07:48 +00:00
Alexander Bayandin	e6e578821b	CI(benchmarking): set pub/sub projects for LR tests (#8483 ) ## Problem > Currently, long-running LR tests recreate endpoints every night. We'd like to have along-running buildup of history to exercise the pageserver in this case (instead of "unit-testing" the same behavior everynight). Closes #8317 ## Summary of changes - Update Postgres version for replication tests - Set `BENCHMARK_PROJECT_ID_PUB`/`BENCHMARK_PROJECT_ID_SUB` env vars to projects that were created for this purpose --------- Co-authored-by: Sasha Krassovsky <krassovskysasha@gmail.com>	2024-08-05 22:06:47 +00:00
Joonas Koivunen	c32807ac19	fix: allow awaiting logical size for root timelines (#8604 ) Currently if `GET /v1/tenant/x/timeline/y?force-await-initial-logical-size=true` is requested for a root timeline created within the current pageserver session, the request handler panics hitting the debug assertion. These timelines will always have an accurate (at initdb import) calculated logical size. Fix is to never attempt prioritizing timeline size calculation if we already have an exact value. Split off from #8528.	2024-08-05 21:21:33 +01:00
Alexander Bayandin	50daff9655	CI(trigger-e2e-tests): fix deadlock with Build and Test workflow (#8606 ) ## Problem In some cases, a deadlock between `build-and-test` and `trigger-e2e-tests` workflows can happen: ``` Build and Test Canceling since a deadlock for concurrency group 'Build and Test-8600/merge-anysha' was detected between 'top level workflow' and 'trigger-e2e-tests' ``` I don't understand the reason completely, probably `${{ github.workflow }}` got evaluated to the same value and somehow caused the issue. We don't need to limit concurrency for `trigger-e2e-tests` workflow. See https://neondb.slack.com/archives/C059ZC138NR/p1722869486708179?thread_ts=1722869027.960029&cid=C059ZC138NR	2024-08-05 19:47:59 +01:00
Alexander Bayandin	bd845c7587	CI(trigger-e2e-tests): wait for promote-images job from the last commit (#8592 ) ## Problem We don't trigger e2e tests for draft PRs, but we do trigger them once a PR is in the "Ready for review" state. Sometimes, a PR can be marked as "Ready for review" before we finish image building. In such cases, triggering e2e tests fails. ## Summary of changes - Make `trigger-e2e-tests` job poll status of `promote-images` job from the build-and-test workflow for the last commit. And trigger only if the status is `success` - Remove explicit image checking from the workflow - Add `concurrency` for `triggere-e2e-tests` workflow to make it possible to cancel jobs in progress (if PR moves from "Draft" to "Ready for review" several times in a row)	2024-08-05 12:25:23 +01:00
Konstantin Knizhnik	f63c8e5a8c	Update Postgres versions to use smgrexists() instead of access() to check if Oid is used (#8597 ) ## Problem PR #7992 was merged without correspondent changes in Postgres submodules and this is why test_oid_overflow.py is failed now. ## Summary of changes Bump Postgres versions ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-08-05 14:24:54 +03:00
Alex Chi Z.	200fa56b04	feat(pageserver): support split delta layers (#8599 ) part of https://github.com/neondatabase/neon/issues/8002 Similar to https://github.com/neondatabase/neon/pull/8574, we add auto-split support for delta layers. Tests are reused from image layer split writers. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-08-05 10:30:49 +00:00
dotdister	0f3dac265b	safekeeper: remove unused partial_backup_enabled option (#8547 ) ## Problem There is an unused safekeeper option `partial_backup_enabled`. `partial_backup_enabled` was implemented in #6530, but this option was always turned into enabled in #8022. If you intended to keep this option for a specific reason, I will close this PR. ## Summary of changes I removed an unused safekeeper option `partial_backup_enabled`.	2024-08-05 09:23:59 +02:00
Alex Chi Z.	1dc496a2c9	feat(pageserver): support auto split layers based on size (#8574 ) part of https://github.com/neondatabase/neon/issues/8002 ## Summary of changes Add a `SplitImageWriter` that automatically splits image layer based on estimated target image layer size. This does not consider compression and we might need a better metrics. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2024-08-05 06:55:36 +01:00
Alex Chi Z.	6814bdd30b	fix(pageserver): deadlock in gc-compaction (#8590 ) We need both compaction and gc lock for gc-compaction. The lock order should be the same everywhere, otherwise there could be a deadlock where A waits for B and B waits for A. We also had a double-lock issue. The compaction lock gets acquired in the outer `compact` function. Note that the unit tests directly call `compact_with_gc`, and therefore not triggering the issue. ## Summary of changes Ensure all places acquire compact lock and then gc lock. Remove an extra compact lock acqusition. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-08-03 00:52:04 +01:00
John Spray	0a667bc8ef	tests: add test_historic_storage_formats (#8423 ) ## Problem Currently, our backward compatibility tests only look one release back. That means, for example, that when we switch on image layer compression by default, we'll test reading of uncompressed layers for one release, and then stop doing it. When we make an index_part.json format change, we'll test against the old format for a week, then stop (unless we write separate unit tests for each old format). The reality in the field is that data in old formats will continue to exist for weeks/months/years. When we make major format changes, we should retain examples of the old format data, and continuously verify that the latest code can still read them. This test uses contents from a new path in the public S3 bucket, `compatibility-data-snapshots/`. It is populated by hand. The first important artifact is one from before we switch on compression, so that we will keep testing reads of uncompressed data. We will generate more artifacts ahead of other key changes, like when we update remote storage format for archival timelines. Closes: https://github.com/neondatabase/cloud/issues/15576	2024-08-02 18:28:23 +01:00
Arthur Petukhovsky	f3acfb2d80	Improve safekeepers eviction rate limiting (#8456 ) This commit tries to fix regular load spikes on staging, caused by too many eviction and partial upload operations running at the same time. Usually it was hapenning after restart, for partial backup the load was delayed. - Add a semaphore for evictions (2 permits by default) - Rename `resident_since` to `evict_not_before` and smooth out the curve by using random duration - Use random duration in partial uploads as well related to https://github.com/neondatabase/neon/issues/6338 some discussion in https://neondb.slack.com/archives/C033RQ5SPDH/p1720601531744029	2024-08-02 15:26:46 +01:00

1 2 3 4 5 ...

5821 Commits