rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-07 22:20:36 +00:00

Author	SHA1	Message	Date
Christian Schwarz	7dcdbaa25e	remote_storage config: move handling of empty inline table `{}` to callers (#8193 ) Before this PR, `RemoteStorageConfig::from_toml` would support deserializing an empty `{}` TOML inline table to a `None`, otherwise try `Some()`. We can instead let * in proxy: let clap derive handle the Option * in PS & SK: assume that if the field is specified, it must be a valid RemtoeStorageConfig (This PR started with a much simpler goal of factoring out the `deserialize_item` function because I need that in another PR).	2024-07-02 12:53:08 +02:00
Konstantin Knizhnik	0497b99f3a	Check status of connection after PQconnectStartParams (#8210 ) ## Problem See https://github.com/neondatabase/cloud/issues/14289 ## Summary of changes Check connection status after calling PQconnectStartParams ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-07-02 06:56:10 +03:00
Vlad Lazar	9882ac8e06	docs: Graceful storage controller cluster restarts RFC (#7704 ) RFC for "Graceful Restarts of Storage Controller Managed Clusters". Related https://github.com/neondatabase/neon/issues/7387	2024-07-01 18:44:28 +01:00
Heikki Linnakangas	0789160ffa	tests: Make neon_xlogflush() flush all WAL, if you omit the LSN arg (#8215 ) This makes it much more convenient to use in the common case that you want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the argument doesn't work for the same reasons as explained in the comments: we need to be back off to the beginning of a page if the previous record ended at page boundary.) I plan to use this to fix the issue that Arseny Sher called out at https://github.com/neondatabase/neon/pull/7288#discussion_r1660063852	2024-07-01 10:55:18 -05:00
Alexander Bayandin	9c32604aa9	CI(gather-rust-build-stats): fix build with libpq (#8219 ) ## Problem I've missed setting `PQ_LIB_DIR` in https://github.com/neondatabase/neon/pull/8206 in `gather-rust-build-stats` job and it fails now: ``` = note: /usr/bin/ld: cannot find -lpq collect2: error: ld returned 1 exit status error: could not compile `storage_controller` (bin "storage_controller") due to 1 previous error ``` https://github.com/neondatabase/neon/actions/runs/9743960062/job/26888597735 ## Summary of changes - Set `PQ_LIB_DIR` for `gather-rust-build-stats` job	2024-07-01 16:42:23 +01:00
Alex Chi Z	b02aafdfda	fix(pageserver): include aux file in basebackup only once (#8207 ) Extracted from https://github.com/neondatabase/neon/pull/6560, currently we include multiple copies of aux files in the basebackup. ## Summary of changes Fix the loop. Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-07-01 14:36:49 +00:00
Alexander Bayandin	e823b92947	CI(build-tools): Remove libpq from build image (#8206 ) ## Problem We use `build-tools` image as a base image to build other images, and it has a pretty old `libpq-dev` installed (v13; it wasn't that old until I removed system Postgres 14 from `build-tools` image in https://github.com/neondatabase/neon/pull/6540) ## Summary of changes - Remove `libpq-dev` from `build-tools` image - Set `LD_LIBRARY_PATH` for tests (for different Postgres binaries that we use, like psql and pgbench) - Set `PQ_LIB_DIR` to build Storage Controller - Set `LD_LIBRARY_PATH`/`DYLD_LIBRARY_PATH` in the Storage Controller where it calls Postgres binaries	2024-07-01 13:11:55 +01:00
John Spray	aea5cfe21e	pageserver: add metric `pageserver_secondary_resident_physical_size` (#8204 ) ## Problem We lack visibility of how much local disk space is used by secondary tenant locations Close: https://github.com/neondatabase/neon/issues/8181 ## Summary of changes - Add `pageserver_secondary_resident_physical_size`, tagged by tenant - Register & de-register label sets from SecondaryTenant - Add+use wrappers in SecondaryDetail that update metrics when adding+removing layers/timelines	2024-07-01 12:48:20 +01:00
Heikki Linnakangas	9ce193082a	Restore running xacts from CLOG on replica startup (#7288 ) We have one pretty serious MVCC visibility bug with hot standby replicas. We incorrectly treat any transactions that are in progress in the primary, when the standby is started, as aborted. That can break MVCC for queries running concurrently in the standby. It can also lead to hint bits being set incorrectly, and that damage can last until the replica is restarted. The fundamental bug was that we treated any replica start as starting from a shut down server. The fix for that is straightforward: we need to set 'wasShutdown = false' in InitWalRecovery() (see changes in the postgres repo). However, that introduces a new problem: with wasShutdown = false, the standby will not open up for queries until it receives a running-xacts WAL record from the primary. That's correct, and that's how Postgres hot standby always works. But it's a problem for Neon, because: * It changes the historical behavior for existing users. Currently, the standby immediately opens up for queries, so if they now need to wait, we can breka existing use cases that were working fine (assuming you don't hit the MVCC issues). * The problem is much worse for Neon than it is for standalone PostgreSQL, because in Neon, we can start a replica from an arbitrary LSN. In standalone PostgreSQL, the replica always starts WAL replay from a checkpoint record, and the primary arranges things so that there is always a running-xacts record soon after each checkpoint record. You can still hit this issue with PostgreSQL if you have a transaction with lots of subtransactions running in the primary, but it's pretty rare in practice. To mitigate that, we introduce another way to collect the running-xacts information at startup, without waiting for the running-xacts WAL record: We can the CLOG for XIDs that haven't been marked as committed or aborted. It has limitations with subtransactions too, but should mitigate the problem for most users. See https://github.com/neondatabase/neon/issues/7236. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-07-01 12:58:12 +03:00
Heikki Linnakangas	75c84c846a	tests: Make neon_xlogflush() flush all WAL, if you omit the LSN arg This makes it much more convenient to use in the common case that you want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the argument doesn't work for the same reasons as explained in the comments: we need to be back off to the beginning of a page if the previous record ended at page boundary.) I plan to use this to fix the issue that Arseny Sher called out at https://github.com/neondatabase/neon/pull/7288#discussion_r1660063852	2024-07-01 12:58:08 +03:00
Heikki Linnakangas	57535c039c	tests: remove a leftover 'running' flag (#8216 ) The 'running' boolean was replaced with a semaphore in commit `f0e2bb79b2`, but this initialization was missed. Remove it so that if a test tries to access it, you get an error rather than always claiming that the endpoint is not running. Spotted by Arseny at https://github.com/neondatabase/neon/pull/7288#discussion_r1660068657	2024-07-01 11:23:31 +03:00
Heikki Linnakangas	30027d94a2	Fix tracking of the nextMulti in the pageserver's copy of CheckPoint (#6528 ) Whenever we see an XLOG_MULTIXACT_CREATE_ID WAL record, we need to update the nextMulti and NextMultiOffset fields in the pageserver's copy of the CheckPoint struct, to cover the new multi-XID. In PostgreSQL, this is done by updating an in-memory struct during WAL replay, but because in Neon you can start a compute node at any LSN, we need to have an up-to-date value pre-calculated in the pageserver at all times. We do the same for nextXid. However, we had a bug in WAL ingestion code that does that: the multi-XIDs will wrap around at 2^32, just like XIDs, so we need to do the comparisons in a wraparound-aware fashion. Fix that, and add tests. Fixes issue #6520 Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-07-01 01:49:49 +03:00
Alex Chi Z	bc704917a3	fix(pageserver): ensure tenant harness has different names (#8205 ) rename the tenant test harness name Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-06-28 15:13:25 -04:00
John Spray	b8bbaafc03	storage controller: fix heatmaps getting disabled during shard split (#8197 ) ## Problem At the start of do_tenant_shard_split, we drop any secondary location for the parent shards. The reconciler uses presence of secondary locations as a condition for enabling heatmaps. On the pageserver, child shards inherit their configuration from parents, but the storage controller assumes the child's ObservedState is the same as the parent's config from the prepare phase. The result is that some child shards end up with inaccurate ObservedState, and until something next migrates or restarts, those tenant shards aren't uploading heatmaps, so their secondary locations are downloading everything that was resident at the moment of the split (including ancestor layers which are often cleaned up shortly after the split). Closes: https://github.com/neondatabase/neon/issues/8189 ## Summary of changes - Use PlacementPolicy to control enablement of heatmap upload, rather than the literal presence of secondaries in IntentState: this way we avoid switching them off during shard split - test: during tenant split test, assert that the child shards have heatmap uploads enabled.	2024-06-28 18:27:13 +01:00
Arthur Petukhovsky	e1a06b40b7	Add rate limiter for partial uploads (#8203 ) Too many concurrect partial uploads can hurt disk performance, this commit adds a limiter. Context: https://neondb.slack.com/archives/C04KGFVUWUQ/p1719489018814669?thread_ts=1719440183.134739&cid=C04KGFVUWUQ	2024-06-28 18:16:21 +01:00
John Spray	babbe125da	pageserver: drop out of secondary download if iteration time has passed (#8198 ) ## Problem Very long running downloads can be wasteful, because the heatmap they're using is outdated after a few minutes. Closes: https://github.com/neondatabase/neon/issues/8182 ## Summary of changes - Impose a deadline on timeline downloads, using the same period as we use for scheduling, and returning an UpdateError::Restart when it is reached. This restart will involve waiting for a scheduling interval, but that's a good thing: it helps let other tenants proceed. - Refactor download_timeline so that the part where we update the state for local layers is done even if we fall out of the layer download loop with an error: this is important, especially for big tenants, because only layers in the SecondaryDetail state will be considered for eviction.	2024-06-28 17:05:09 +00:00
Heikki Linnakangas	ca2f7d06b2	Cherry-pick upstream fix for TruncateMultiXact assertion (#8195 ) We hit that bug in a new test being added in PR #6528. We'd get the fix from upstream with the next minor release anyway, but cherry-pick it now to unblock PR #6528. Upstream commit b1ffe3ff0b. See https://github.com/neondatabase/neon/pull/6528#issuecomment-2167367910	2024-06-28 16:47:05 +03:00
Arthur Petukhovsky	c22c6a6c9e	Add buckets to safekeeper ops metrics (#8194 ) In #8188 I forgot to specify buckets for new operations metrics. This commit fixes that.	2024-06-28 11:09:11 +01:00
Christian Schwarz	deec3bc578	virtual_file: take a `Slice` in the read APIs, eliminate `read_exact_at_n`, fix UB for engine `std-fs` (#8186 ) part of https://github.com/neondatabase/neon/issues/7418 I reviewed how the VirtualFile API's `read` methods look like and came to the conclusion that we've been using `IoBufMut` / `BoundedBufMut` / `Slice` wrong. This patch rectifies the situation. # Change 1: take `tokio_epoll_uring::Slice` in the read APIs Before, we took an `IoBufMut`, which is too low of a primitive and while it _seems_ convenient to be able to pass in a `Vec<u8>` without any fuzz, it's actually very unclear at the callsite that we're going to fill up that `Vec` up to its `capacity()`, because that's what `IoBuf::bytes_total()` returns and that's what `VirtualFile::read_exact_at` fills. By passing a `Slice` instead, a caller that "just wants to read into a `Vec`" is forced to be explicit about it, adding either `slice_full()` or `slice(x..y)`, and these methods panic if the read is outside of the bounds of the `Vec::capacity()`. Last, passing slices is more similar to what the `std::io` APIs look like. # Change 2: fix UB in `virtual_file_io_engine=std-fs` While reviewing call sites, I noticed that the `io_engine::IoEngine::read_at` method for `StdFs` mode has been constructing an `&mut[u8]` from raw parts that were uninitialized. We then used `std::fs::File::read_exact` to initialize that memory, but, IIUC we must not even be constructing an `&mut[u8]` where some of the memory isn't initialized. So, stop doing that and add a helper ext trait on `Slice` to do the zero-initialization. # Change 3: eliminate `read_exact_at_n` The `read_exact_at_n` doesn't make sense because the caller can just 1. `slice = buf.slice()` the exact memory it wants to fill 2. `slice = read_exact_at(slice)` 3. `buf = slice.into_inner()` Again, the `std::io` APIs specify the length of the read via the Rust slice length. We should do the same for the owned buffers IO APIs, i.e., via `Slice::bytes_total()`. # Change 4: simplify filling of `PageWriteGuard` The `PageWriteGuardBuf::init_up_to` was never necessary. Remove it. See changes to doc comment for more details. --- Reviewers should probably look at the added test case first, it illustrates my case a bit.	2024-06-28 11:20:37 +02:00
John Spray	063553a51b	pageserver: remove tenant create API (#8135 ) ## Problem For some time, we have created tenants with calls to location_conf. The legacy "POST /v1/tenant" path was only used in some tests. ## Summary of changes - Remove the API - Relocate TenantCreateRequest to the controller API file (this used to be used in both pageserver and controller APIs) - Rewrite tenant_create test helper to use location_config API, as control plane and storage controller do - Update docker-compose test script to create tenants with location_config API (this small commit is also present in https://github.com/neondatabase/neon/pull/7947)	2024-06-28 09:14:19 +01:00
Tristan Partin	5700233a47	Add application_name to compute activity monitor connection string This was missed in my previous attempt to mark every connection string with an application name. See `0c3e3a8667`.	2024-06-27 12:38:15 -07:00
Arthur Petukhovsky	1d66ca79a9	Improve slow operations observability in safekeepers (#8188 ) After https://github.com/neondatabase/neon/pull/8022 was deployed to staging, I noticed many cases of timeouts. After inspecting the logs, I realized that some operations are taking ~20 seconds and they're doing while holding shared state lock. Usually it happens right after redeploy, because compute reconnections put high load on disks. This commit tries to improve observability around slow operations. Non-observability changes: - `TimelineState::finish_change` now skips update if nothing has changed - `wal_residence_guard()` timeout is set to 30s	2024-06-27 18:39:43 +01:00
Alex Chi Z	23827c6b0d	feat(pageserver): add delta layer iterator (#8064 ) part of https://github.com/neondatabase/neon/issues/8002 ## Summary of changes Add delta layer iterator and tests. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-06-27 16:03:48 +00:00
Christian Schwarz	66b0bf41a1	fix: shutdown does not kill walredo processes (#8150 ) While investigating Pageserver logs from the cases where systemd hangs during shutdown (https://github.com/neondatabase/cloud/issues/11387), I noticed that even if Pageserver shuts down cleanly[^1], there are lingering walredo processes. [^1]: Meaning, pageserver finishes its shutdown procedure and calls `exit(0)` on its own terms, instead of hitting the systemd unit's `TimeoutSec=` limit and getting SIGKILLed. While systemd should never lock up like it does, maybe we can avoid hitting that bug by cleaning up properly. Changes ------- This PR adds a shutdown method to `WalRedoManager` and hooks it up to tenant shutdown. We keep track of intent to shutdown through the new `enum ProcessOnceCell` stored inside the pre-existing `redo_process` field. A gate is added to keep track of running processes, using the new type `struct Process`. Future Work ----------- Requests that don't need the redo process will not observe the shutdown (see doc comment). Doing so would be nice for completeness sake, but doesn't provide much benefit because `Tenant` and `Timeline` already shut down all walredo users. Testing ------- I did manual testing to confirm that the problem exists before this PR and that it's gone after. Setup: * `neon_local` with a single tenant, create some data using `pgbench` * ensure walredo process is running, not pid * watch `strace -e kill,wait4 -f -p "$(pgrep pageserver)"` * `neon_local pageserver stop` With this PR, we always observe ``` $ strace -e kill,wait4 -f -p "$(pgrep pageserver)" ... [pid 591120] --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=591215, si_uid=1000} --- [pid 591134] kill(591174, SIGKILL) = 0 [pid 591134] wait4(591174, <unfinished ...> [pid 591142] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=591174, si_uid=1000, si_status=SIGKILL, si_utime=0, si_stime=0} --- [pid 591134] <... wait4 resumed>[{WIFSIGNALED(s) && WTERMSIG(s) == SIGKILL}], 0, NULL) = 591174 ... +++ exited with 0 +++ ``` Before this PR, we'd usually observe just ``` ... [pid 596239] --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=596455, si_uid=1000} --- ... +++ exited with 0 +++ ``` Refs ---- refs https://github.com/neondatabase/cloud/issues/11387	2024-06-27 15:58:28 +02:00
Vlad Lazar	89cf8df93b	stocon: bump number of concurrent reconciles per operation (#8179 ) ## Problem Background node operations take a long time for loaded nodes. ## Summary of changes Increase number of concurrent reconciles an operation is allowed to spawn. This should make drain and fill operations faster and the new value is still well below the total limit of concurrent reconciles.	2024-06-27 13:16:41 +00:00
Alexander Bayandin	54a06de4b5	CI: Use `runner.arch` in cache keys along with `runner.os` (#8175 ) ## Problem The cache keys that we use on CI are the same for X64 and ARM64 (`runner.arch`) ## Summary of changes - Include `runner.arch` along with `runner.os` into cache keys	2024-06-27 13:56:03 +01:00
Arseny Sher	6f20a18e8e	Allow to change compute safekeeper list without restart. - Add --safekeepers option to neon_local reconfigure - Add it to python Endpoint reconfigure - Implement config reload in walproposer by restarting the whole bgw when safekeeper list changes. ref https://github.com/neondatabase/neon/issues/6341	2024-06-27 15:08:35 +03:00
Vlad Lazar	d557002675	strocon: don't overcommit when making node fill plan (#8171 ) ## Problem The fill requirement was not taken into account when looking through the shards of a given node to fill from. ## Summary of Changes Ensure that we do not fill a node past the recommendation from `Scheduler::compute_fill_requirement`.	2024-06-27 11:56:57 +01:00
Cihan Demirci	32b75e7c73	CI: additional trigger on merge to main (#8176 ) Before we consolidate workflows we want to be triggered by merges to main. https://github.com/neondatabase/cloud/issues/14862	2024-06-26 22:36:41 +00:00
Heikki Linnakangas	d2753719e3	test: Add helper function for importing a Postgres cluster (#8025 ) Also, modify the "neon_local timeline import" command so that it doesn't create the endpoint any more. I don't see any reason to bundle that in the same command, the "timeline create" and "timeline branch" commands don't do that either. I plan to add more tests similar to 'test_import_at_2bil', this will help to reduce the copy-pasting.	2024-06-26 21:54:29 +00:00
Alex Chi Z	04b2ac3fed	test: use aux file v2 policy in benchmarks (#8174 ) Use aux file v2 in benchmarks. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-06-26 20:33:15 +00:00
John Spray	c39d5b03e8	pageserver: remove legacy tenant config code, clean up redundant generation none/broken usages (#7947 ) ## Problem In https://github.com/neondatabase/neon/pull/5299, the new config-v1 tenant config file was added to hold the LocationConf type. We left the old config file in place for forward compat, and because running without generations (therefore without LocationConf) as still useful before the storage controller was ready for prime-time. Closes: https://github.com/neondatabase/neon/issues/5388 ## Summary of changes - Remove code for reading and writing the legacy config file - Remove Generation::Broken: it was unused. - Treat missing config file on disk as an error loading a tenant, rather than defaulting it. We can now remove LocationConf::default, and thereby guarantee that we never construct a tenant with a None generation. - Update some comments + add some assertions to clarify that Generation::None is only used in layer metadata, not in the state of a running tenant. - Update docker compose test to create tenants with a generation	2024-06-26 19:53:59 +00:00
Arthur Petukhovsky	76fc3d4aa1	Evict WAL files from disk (#8022 ) Fixes https://github.com/neondatabase/neon/issues/6337 Add safekeeper support to switch between `Present` and `Offloaded(flush_lsn)` states. The offloading is disabled by default, but can be controlled using new cmdline arguments: ``` --enable-offload Enable automatic switching to offloaded state --delete-offloaded-wal Delete local WAL files after offloading. When disabled, they will be left on disk --control-file-save-interval <CONTROL_FILE_SAVE_INTERVAL> Pending updates to control file will be automatically saved after this interval [default: 300s] ``` Manager watches state updates and detects when there are no actvity on the timeline and actual partial backup upload in remote storage. When all conditions are met, the state can be switched to offloaded. In `timeline.rs` there is `StateSK` enum to support switching between states. When offloaded, code can access only control file structure and cannot use `SafeKeeper` to accept new WAL. `FullAccessTimeline` is now renamed to `WalResidentTimeline`. This struct contains guard to notify manager about active tasks requiring on-disk WAL access. All guards are issued by the manager, all requests are sent via channel using `ManagerCtl`. When manager receives request to issue a guard, it unevicts timeline if it's currently evicted. Fixed a bug in partial WAL backup, it used `term` instead of `last_log_term` previously. After this commit is merged, next step is to roll this change out, as in issue #6338.	2024-06-26 18:58:56 +01:00
Vlad Lazar	dd3adc3693	docker: downgrade openssl to 1.1.1w (#8168 ) ## Problem We have seen numerous segfault and memory corruption issue for clients using libpq and openssl 3.2.2. I don't know if this is a bug in openssl or libpq. Downgrading to 1.1.1w fixes the issues for the storage controller and pgbench. ## Summary of Changes: Use openssl 1.1.1w instead of 3.2.2	2024-06-26 17:27:23 +00:00
Heikki Linnakangas	5b871802fd	Add counters for commands processed through the libpq page service API (#8089 ) I was looking for metrics on how many computes are still using protocol version 1 and 2. This provides counters for that as "pagestream" and "pagestream_v2" commands, but also all the other commands. The new metrics are global for the whole pageserver instance rather than per-tenant, so the additional metrics bloat should be fairly small.	2024-06-26 19:53:03 +03:00
Heikki Linnakangas	24ce73ffaf	Silence compiler warning (#8153 ) I saw this compiler warning on my laptop: pgxn/neon_walredo/walredoproc.c:178:10: warning: using the result of an assignment as a condition without parentheses [-Wparentheses] if (err = close_range_syscall(3, ~0U, 0)) { ~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ pgxn/neon_walredo/walredoproc.c:178:10: note: place parentheses around the assignment to silence this warning if (err = close_range_syscall(3, ~0U, 0)) { ^ ( ) pgxn/neon_walredo/walredoproc.c:178:10: note: use '==' to turn this assignment into an equality comparison if (err = close_range_syscall(3, ~0U, 0)) { ^ == 1 warning generated. I'm not sure what compiler version or options cause that, but it's a good warning. Write the call a little differently, to avoid the warning and to make it a little more clear anyway. (The 'err' variable wasn't used for anything, so I'm surprised we were not seeing a compiler warning on the unused value, too.)	2024-06-26 19:19:27 +03:00
Arthur Petukhovsky	3118c24521	Panic on unexpected error in simtests (#8169 )	2024-06-26 16:46:14 +01:00
Alexander Bayandin	5af9660b9e	CI(build-tools): don't install Postgres 14 (#6540 ) ## Problem We install Postgres 14 in `build-tools` image, but we don't need it. We use Postgres binaries, which we build ourselves. ## Summary of changes - Remove Postgresql 14 installation from `build-tools` image	2024-06-26 16:37:04 +01:00
Conrad Ludgate	d7e349d33c	proxy: report blame for passthrough disconnect io errors (#8170 ) ## Problem Hard to debug the disconnection reason currently. ## Summary of changes Keep track of error-direction, and therefore error source (client vs compute) during passthrough.	2024-06-26 15:11:26 +00:00
Arthur Petukhovsky	47e5bf3bbb	Improve term reject message in walproposer (#8164 ) Co-authored-by: Tristan Partin <tristan@neon.tech>	2024-06-26 15:26:52 +01:00
Alex Chi Z	5d2f9ffa89	test(bottom-most-compaction): wal apply order (#8163 ) A follow-up on https://github.com/neondatabase/neon/pull/8103/. Previously, main branch fails with: ``` assertion `left == right` failed left: b"value 3@0x10@0x30@0x28@0x40" right: b"value 3@0x10@0x28@0x30@0x40" ``` This gets fixed after #8103 gets merged. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-06-26 09:34:41 -04:00
Heikki Linnakangas	fdadd6a152	Remove primary_is_running (#8162 ) This was a half-finished mechanism to allow a replica to enter hot standby mode sooner, without waiting for a running-xacts record. It had issues, and we are working on a better mechanism to replace it. The control plane might still set the flag in the spec file, but compute_ctl will simply ignore it.	2024-06-26 15:13:03 +03:00
Peter Bendel	9b623d3a2c	add commit hash to S3 object identifier for artifacts on S3 (#8161 ) In future we may want to run periodic tests on dedicated cloud instances that are not GitHub action runners. To allow these to download artifact binaries for a specific commit hash we want to make the search by commit hash possible and prefix the S3 objects with `artifacts/${GITHUB_SHA}/${GITHUB_RUN_ID}/${GITHUB_RUN_ATTEMPT}` --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2024-06-26 05:46:52 +00:00
Alex Chi Z	9b98823d61	bottom-most-compaction: use in test_gc_feedback + fix bugs (#8103 ) Adds manual compaction trigger; add gc compaction to test_gc_feedback Part of https://github.com/neondatabase/neon/issues/8002 ``` test_gc_feedback[debug-pg15].logical_size: 50 Mb test_gc_feedback[debug-pg15].physical_size: 2269 Mb test_gc_feedback[debug-pg15].physical/logical ratio: 44.5302 test_gc_feedback[debug-pg15].max_total_num_of_deltas: 7 test_gc_feedback[debug-pg15].max_num_of_deltas_above_image: 2 test_gc_feedback[debug-pg15].logical_size_after_bottom_most_compaction: 50 Mb test_gc_feedback[debug-pg15].physical_size_after_bottom_most_compaction: 287 Mb test_gc_feedback[debug-pg15].physical/logical ratio after bottom_most_compaction: 5.6312 test_gc_feedback[debug-pg15].max_total_num_of_deltas_after_bottom_most_compaction: 4 test_gc_feedback[debug-pg15].max_num_of_deltas_above_image_after_bottom_most_compaction: 1 ``` ## Summary of changes * Add the manual compaction trigger * Use in test_gc_feedback * Add a guard to avoid running it with retain_lsns * Fix: Do `schedule_compaction_update` after compaction * Fix: Supply deltas in the correct order to reconstruct value --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-06-25 23:00:14 +00:00
Alex Chi Z	76864e6a2a	feat(pageserver): add image layer iterator (#8006 ) part of https://github.com/neondatabase/neon/issues/8002 ## Summary of changes This pull request adds the image layer iterator. It buffers a fixed amount of key-value pairs in memory, and give developer an iterator abstraction over the image layer. Once the buffer is exhausted, it will issue 1 I/O to fetch the next batch. Due to the Rust lifetime mysteries, the `get_stream_from` API has been refactored to `into_stream` and consumes `self`. Delta layer iterator implementation will be similar, therefore I'll add it after this pull request gets merged. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-06-25 20:49:29 +00:00
Conrad Ludgate	6c5d3b5263	proxy fix wake compute console retry (#8141 ) ## Problem 1. Proxy is retrying errors from cplane that shouldn't be retried 2. ~~Proxy is not using the retry_after_ms value~~ ## Summary of changes 1. Correct the could_retry impl for ConsoleError. 2. ~~Update could_retry interface to support returning a fixed wait duration.~~	2024-06-25 18:07:54 +00:00
Christian Schwarz	cd9a550d97	clippy-deny the `todo!()` macro (#4340 ) `todo!()` shouldn't slip into prod code	2024-06-25 18:03:27 +00:00
John Spray	07f21dd6b6	pageserver: remove attach/detach apis (#8134 ) ## Problem These APIs have been deprecated for some time, but were still used from test code. Closes: https://github.com/neondatabase/neon/issues/4282 ## Summary of changes - It is still convenient to do a "tenant_attach" from a test without having to write out a location_conf body, so those test methods have been retained with implementations that call through to their location_conf equivalent.	2024-06-25 17:38:06 +01:00
Heikki Linnakangas	64a4461191	Fix submodule references to match the REL_*_STABLE_neon branches (#8159 ) No code changes, just point to the correct commit SHAs.	2024-06-25 19:05:13 +03:00
Yuchen Liang	961fc0ba8f	feat(pageserver): add metrics for number of valid leases after each refresh (#8147 ) Part of #7497, closes #8120. ## Summary of changes This PR adds a metric to track the number of valid leases after `GCInfo` gets refreshed each time. Besides this metric, we should also track disk space and synthetic size (after #8071 is closed) to make sure leases are used properly. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-06-25 15:43:12 +00:00

1 2 3 4 5 ...

5536 Commits