Comparing 98882548d8..5340423416 - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-24 06:20:41 +00:00

Author	SHA1	Message	Date
github-actions[bot]	5340423416	Storage release 2025-07-25 06:12 UTC	2025-07-25 06:12:11 +00:00
github-actions[bot]	c1de242c1c	Storage release 2025-07-18 06:11 UTC	2025-07-18 06:11:53 +00:00
Vlad Lazar	88cf0c04d5	Storage release 2025-07-11 06:11 UTC	2025-07-11 15:04:44 +01:00
Vlad Lazar	c19e1e76ac	pageserver: log only on final shard resolution failure (#12565 ) This log is too noisy. Instead of warning on every retry, let's log only on the final failure.	2025-07-11 15:04:25 +01:00
github-actions[bot]	a72eaf701e	Storage release 2025-07-04 06:11 UTC	2025-07-04 06:11:44 +00:00
Vlad Lazar	8693b85986	Storage release 2025-07-01 07:57 UTC	2025-07-01 09:57:07 +02:00
Arpad Müller	29cf6a36c2	detach_ancestor: delete the right layer when hardlink fails (#12397 ) If a hardlink operation inside `detach_ancestor` fails due to the layer already existing, we delete the layer to make sure the source is one we know about, and then retry. But we deleted the wrong file, namely, the one we wanted to use as the source of the hardlink. As a result, the follow up hard link operation failed. Our PR corrects this mistake.	2025-07-01 09:57:07 +02:00
Vlad Lazar	4beca2e639	Storage release 2025-06-27 06:11 UTC	2025-06-27 17:37:32 +02:00
Arpad Müller	103c866a90	Fix hang deleting offloaded timelines (#12366 ) We don't have cancellation support for timeline deletions. In other words, timeline deletion might still go on in an older generation while we are attaching it in a newer generation already, because the cancellation simply hasn't reached the deletion code. This has caused us to hit a situation with offloaded timelines in which the timeline was in an unrecoverable state: always returning an accepted response, but never a 404 like it should be. The detailed description can be found in [here](https://github.com/neondatabase/cloud/issues/30406#issuecomment-3008667859) (private repo link). TLDR: 1. we ask to delete timeline on old pageserver/generation, starts process in background 2. the storcon migrates the tenant to a different pageserver. - during attach, the pageserver still finds an index part, so it adds it to `offloaded_timelines` 4. the timeline deletion finishes, removing the index part in S3 5. there is a retry of the timeline deletion endpoint, sent to the new pageserver location. it is bound to fail however: - as the index part is gone, we print `Timeline already deleted in remote storage`. - the problem is that we then return an accepted response code, and not a 404. - this confuses the code calling us. it thinks the timeline is not deleted, so keeps retrying. - this state never gets recovered from until a reset/detach, because of the `offloaded_timelines` entry staying there. This is where this PR fixes things: if no index part can be found, we can safely assume that the timeline is gone in S3 (it's the last thing to be deleted), so we can remove it from `offloaded_timelines` and trigger a reupload of the manifest. Subsequent retries will pick that up. Why not improve the cancellation support? It is a more disruptive code change, that might have its own risks. So we don't do it for now. Fixes https://github.com/neondatabase/cloud/issues/30406	2025-06-27 17:36:01 +02:00
github-actions[bot]	38c73bea87	Storage release 2025-06-20 12:32 UTC	2025-06-20 12:32:41 +00:00
github-actions[bot]	345662cbc2	Storage release 2025-06-20 06:11 UTC	2025-06-20 06:11:33 +00:00
github-actions[bot]	e23726f31c	Storage release 2025-06-13 07:48 UTC	2025-06-13 07:48:37 +00:00
Alex Chi Z	e52313d216	Storage release 2025-06-06 10:55 UTC	2025-06-06 22:23:10 +08:00
Conrad Ludgate	6585f71137	[proxy] separate compute connect from compute authentication (#12145 ) ## Problem PGLB/Neonkeeper needs to separate the concerns of connecting to compute, and authenticating to compute. Additionally, the code within `connect_to_compute` is rather messy, spending effort on recovering the authentication info after wake_compute. ## Summary of changes Split `ConnCfg` into `ConnectInfo` and `AuthInfo`. `wake_compute` only returns `ConnectInfo` and `AuthInfo` is determined separately from the `handshake`/`authenticate` process. Additionally, `ConnectInfo::connect_raw` is in-charge or establishing the TLS connection, and the `postgres_client::Config::connect_raw` is configured to use `NoTls` which will force it to skip the TLS negotiation. This should just work.	2025-06-06 22:23:01 +08:00
Alexander Sarantcev	765b76f4cd	storcon: Introduce deletion tombstones to support flaky node scenario (#12096 ) ## Problem Removed nodes can re-add themselves on restart if not properly tombstoned. We need a mechanism (e.g. soft-delete flag) to prevent this, especially in cases where the node is unreachable. More details there: #12036 ## Summary of changes - Introduced `NodeLifecycle` enum to represent node lifecycle states. - Added a string representation of `NodeLifecycle` to the `nodes` table. - Implemented node removal using a tombstone mechanism. - Introduced `/debug/v1/tombstone*` handlers to manage the tombstone state.	2025-06-06 22:23:01 +08:00
Erik Grinaker	72b09473c1	pageserver: move `spawn_grpc` to `GrpcPageServiceHandler::spawn` (#12147 ) Mechanical move, no logic changes.	2025-06-06 22:23:01 +08:00
Alex Chi Z.	72a6d668b5	feat(build): add aws cli into the docker image (#12161 ) ## Problem Makes it easier to debug AWS permission issues (i.e., storage scrubber) ## Summary of changes Install awscliv2 into the docker image. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-06-06 22:23:01 +08:00
Alex Chi Z.	b5d7296e04	test(pageserver): ensure offload cleans up metrics (#12127 ) Add a test to ensure timeline metrics are fully cleaned up after offloading. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-06-06 22:23:01 +08:00
Arpad Müller	854de6d221	neon_local timeline import: create timelines on safekeepers (#12138 ) neon_local's timeline import subcommand creates timelines manually, but doesn't create them on the safekeepers. If a test then tries to open an endpoint to read from the timeline, it will error in the new world with `--timelines-onto-safekeepers`. Therefore, if that flag is enabled, create the timelines on the safekeepers. Note that this import functionality is different from the fast import feature (https://github.com/neondatabase/neon/issues/10188, #11801). Part of #11670 As well as part of #11712	2025-06-06 22:23:01 +08:00
Alexey Kondratov	8dee1a7d0f	feat(compute_ctl): Implement graceful compute monitor exit (#11911 ) ## Problem After introducing a naive downtime calculation for the Postgres process inside compute in https://github.com/neondatabase/neon/pull/11346, I noticed that some amount of computes regularly report short downtime. After checking some particular cases, it looks like all of them report downtime close to the end of the life of the compute, i.e., when the control plane calls a `/terminate` and we are waiting for Postgres to exit. Compute monitor also produces a lot of error logs because Postgres stops accepting connections, but it's expected during the termination process. ## Summary of changes Regularly check the compute status inside the main compute monitor loop and exit gracefully when the compute is in some terminal or soon-to-be-terminal state. --------- Co-authored-by: Tristan Partin <tristan@neon.tech>	2025-06-06 22:23:01 +08:00
Dmitrii Kovalkov	b232c18441	pageserver, tests: prepare test_basebackup_cache for --timelines-onto-safekeepers (#12143 ) ## Problem - `test_basebackup_cache` fails in https://github.com/neondatabase/neon/pull/11712 because after the timelines on safekeepers are managed by storage controller, they do contain proper start_lsn and the compute_ctl tool sends the first basebackup request with this LSN. - `Failed to prepare basebackup` log messages during timeline initialization, because the timeline is not yet in the global timeline map. - Relates to https://github.com/neondatabase/cloud/issues/29353 ## Summary of changes - Account for `timeline_onto_safekeepers` storcon's option in the test. - Do not trigger basebackup prepare during the timeline initialization.	2025-06-06 22:23:01 +08:00
a-masterov	f25da470c7	Configure the dynamic loader for the extension-tests image (#12142 ) ## Problem The same problem, fixed in https://github.com/neondatabase/neon/issues/11857, but for the image `neon-extesions-test` ## Summary of changes The config file was added to use our library	2025-06-06 22:23:01 +08:00
Erik Grinaker	7f1f5c8487	pagebench: add batch support (#12133 ) ## Problem The new gRPC page service protocol supports client-side batches. The current libpq protocol only does best-effort server-side batching. To compare these approaches, Pagebench should support submitting contiguous page batches, similar to how Postgres will submit them (e.g. with prefetches or vectored reads). ## Summary of changes Add a `--batch-size` parameter specifying the size of contiguous page batches. One batch counts as 1 RPS and 1 queue depth. For the libpq protocol, a batch is submitted as individual requests and we rely on the server to batch them for us. This will give a realistic comparison of how these would be processed in the wild (e.g. when Postgres sends 100 prefetch requests). This patch also adds some basic validation of responses.	2025-06-06 22:23:01 +08:00
Vlad Lazar	c01591ce61	pageserver: remove handling of vanilla protocol (#12126 ) ## Problem We support two ingest protocols on the pageserver: vanilla and interpreted. Interpreted has been the only protocol in use for a long time. ## Summary of changes * Remove the ingest handling of the vanilla protocol * Remove tenant and pageserver configuration for it * Update all tests that tweaked the ingest protocol ## Compatibility Backward compatibility: * The new pageserver version can read the existing pageserver configuration and it will ignore the unknown field. * When the tenant config is read from the storcon db or from the pageserver disk, the extra field will be ignored. Forward compatiblity: * Both the pageserver config and the tenant config map missing fields to their default value. I'm not aware of any tenant level override that was made for this knob.	2025-06-06 22:23:01 +08:00
Konstantin Knizhnik	bf1e33b062	Replica promote (#12090 ) ## Problem This PR is part of larger computes support activity: https://www.notion.so/neondatabase/Larger-computes-114f189e00478080ba01e8651ab7da90 Epic: https://github.com/neondatabase/cloud/issues/19010 In case of planned node restart, we are going to 1. create new read-only replica 2. capture LFC state at primary 3. use this state to prewarm replica 4. stop old primary 5. promote replica to primary Steps 1-3 are currently implemented and support from compute side. This PR provides compute level implementation of replica promotion. Support replica promotion ## Summary of changes Right now replica promotion is done in three steps: 1. Set safekeepers list (now it is empty for replica) 2. Call `pg_promote()` top promote replica 3. Update endpoint setting to that it ids not more treated as replica. May be all this three steps should be done by some function in compute_ctl. But right now this logic is only implement5ed in test. Postgres submodules PRs: https://github.com/neondatabase/postgres/pull/648 https://github.com/neondatabase/postgres/pull/649 https://github.com/neondatabase/postgres/pull/650 https://github.com/neondatabase/postgres/pull/651 --------- Co-authored-by: Matthias van de Meent <matthias@neon.tech> Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-06-06 22:23:01 +08:00
Konstantin Knizhnik	62bbd2f723	Add query execution time histogram (#10050 ) ## Problem It will be useful to understand what kind of queries our clients are executed. And one of the most important characteristic of query is query execution time - at least it allows to distinguish OLAP and OLTP queries. Also monitoring query execution time can help to detect problem with performance (assuming that workload is more or less stable). ## Summary of changes Add query execution time histogram. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-06-06 22:23:01 +08:00
Folke Behrens	3ca47bc37b	proxy: Move PGLB-related modules into pglb root module. (#12144 ) Split the modules responsible for passing data and connecting to compute from auth and waking for PGLB. This PR just moves files. The waking is going to get removed from pglb after this.	2025-06-06 22:23:01 +08:00
Alex Chi Z.	852c210d69	feat(pageserver): report tenant properties to posthog (#12113 ) ## Problem Part of https://github.com/neondatabase/neon/issues/11813 In PostHog UI, we need to create the properties before using them as a filter. We report all variants automatically when we start the pageserver. In the future, we can report all real tenants instead of fake tenants (we do that now to save money + we don't need real tenants in the UI). ## Summary of changes * Collect `region`, `availability_zone`, `pageserver_id` properties and use them in the feature evaluation. * Report 10 fake tenants on each pageserver startup. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-06-06 22:23:01 +08:00
Conrad Ludgate	3c6a1f6a81	update proxy protocol parsing to not a rw wrapper (#12035 ) ## Problem I believe in all environments we now specify either required/rejected for proxy-protocol V2 as required. We no longer rely on the supported flow. This means we no longer need to keep around read bytes incase they're not in a header. While I designed ChainRW to be fast (the hot path with an empty buffer is very easy to branch predict), it's still unnecessary. ## Summary of changes * Remove the ChainRW wrapper * Refactor how we read the proxy-protocol header using read_exact. Slightly worse perf but it's hardly significant. * Don't try and parse the header if it's rejected.	2025-06-06 22:23:01 +08:00
Konstantin Knizhnik	d49d781689	Update online_advisor (#12045 ) ## Problem Investigate crash of online_advisor in image check ## Summary of changes --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-06-06 22:23:01 +08:00
Arpad Müller	1e6278f1f3	pgxn: support generations in safekeepers_cmp (#12129 ) `safekeepers_cmp` was added by #8840 to make changes of the safekeeper set order independent: a `sk1,sk2,sk3` specifier changed to `sk3,sk1,sk2` should not cause a walproposer restart. However, this check didn't support generations, in the sense that it would see the `g#123:` as part of the first safekeeper in the list, and if the first safekeeper changes, it would also restart the walproposer. Therefore, parse the generation properly and make it not be a part of the generation. This PR doesn't add a specific test, but I have confirmed locally that `test_safekeepers_reconfigure_reorder` is fixed with the changes of PR #11712 applied thanks to this PR. Part of https://github.com/neondatabase/neon/issues/11670	2025-06-06 22:23:01 +08:00
Conrad Ludgate	cca40c62b7	compute-ctl: add spec for enable_tls, separate from compute-ctl config (#12109 ) ## Problem Inbetween adding the TLS config for compute-ctl, and adding the TLS config in controlplane, we switched from using a provision flag to a bind flag. This happened to work in all of my testing in preview regions as they have no VM pool, so each bind was also a provision. However, in staging I found that the TLS config is still only processed during provision, even though it's only sent on bind. ## Summary of changes * Add a new feature flag value, `tls_experimental`, which tells postgres/pgbouncer/local_proxy to use the TLS certificates on bind. * compute_ctl on provision will be told where the certificates are, instead of being told on bind.	2025-06-06 22:23:00 +08:00
Suhas Thalanki	aa84913318	compute: Add manifest.yml for default Postgres configuration settings (#11820 ) Adds a `manifest.yml` file that contains the default settings for compute. Currently, it comes from cplane code [here](`0cda3d4b01/goapp/controlplane/internal/pkg/compute/computespec/pg_settings.go (L110)`). Related RFC: https://github.com/neondatabase/neon/blob/main/docs/rfcs/038-independent-compute-release.md Related Issue: https://github.com/neondatabase/cloud/issues/11698	2025-06-06 22:23:00 +08:00
Tristan Partin	864910a8c5	Use Url::join() when creating the final remote extension URL (#12121 ) Url::to_string() adds a trailing slash on the base URL, so when we did the format!(), we were adding a double forward slash. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-06-06 22:23:00 +08:00
Erik Grinaker	c51514d427	pageserver: support `get_vectored_concurrent_io` with gRPC (#12131 ) ## Problem The gRPC page service doesn't respect `get_vectored_concurrent_io` and always uses sequential IO. ## Summary of changes Spawn a sidecar task for concurrent IO when enabled. Cancellation will be addressed separately.	2025-06-06 22:23:00 +08:00
a-masterov	18c40ceae9	Fix codestyle for compute.sh for docker-compose (#12128 ) ## Problem The script `compute.sh` had a non-consistent coding style and didn't follow best practices for modern bash scripts ## Summary of changes The coding style was fixed to follow best practices.	2025-06-06 22:23:00 +08:00
Vlad Lazar	48dd8e2008	pageserver: make import job max byte range size configurable (#12117 ) ## Problem We want to repro an OOM situation, but large partial reads are required. ## Summary of Changes Make the max partial read size configurable for import jobs.	2025-06-06 22:23:00 +08:00
github-actions[bot]	00ba9ca40f	Storage release 2025-05-30 17:04 UTC	2025-05-30 17:04:39 +00:00
github-actions[bot]	26923935d3	Storage release 2025-05-23 06:10 UTC	2025-05-23 06:11:00 +00:00
github-actions[bot]	8f98b823c7	Storage release 2025-05-16 06:11 UTC	2025-05-16 06:11:05 +00:00
Erik Grinaker	1821b6ea3f	Storage release 2025-05-09 12:25 UTC	2025-05-09 14:25:07 +02:00
Erik Grinaker	be3686e2af	Storage release 2025-05-06 15:12 UTC	2025-05-06 17:12:17 +02:00
Erik Grinaker	1c3cb18c60	storcon: fix split aborts removing other tenants (#11837 ) ## Problem When aborting a split, the code accidentally removes all other tenant shards from the in-memory map that have the same shard count as the aborted split, causing "tenant not found" errors. It will recover on a storcon restart, when it loads the persisted state. This issue has been present for at least a year. Resolves https://github.com/neondatabase/cloud/issues/28589. ## Summary of changes Only remove shards belonging to the relevant tenant when aborting a split. Also adds a regression test.	2025-05-06 17:12:17 +02:00
github-actions[bot]	946a0667eb	Storage release 2025-05-02 06:10 UTC	2025-05-02 06:10:54 +00:00
Vlad Lazar	e5d15450ec	Storage release 2025-04-28	2025-04-28 22:43:18 +02:00
Alex Chi Z.	d294078d1f	fix(pageserver): consider tombstones in replorigin (#11752 ) ## Problem We didn't consider tombstones in replorigin read path in the past. This was fine because tombstones are stored as LSN::Invalid before we universally define what the tombstone is for sparse keyspaces. Now we remove non-inherited keys during detach ancestor and write the universal tombstone "empty image". So we need to consider it across all the read paths. related: https://github.com/neondatabase/neon/pull/11299 ## Summary of changes Empty value gets ignored for replorigin scans. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-04-28 21:20:11 +01:00
github-actions[bot]	eb70d7a55c	Storage release 2025-04-25	2025-04-25 11:04:06 +00:00
github-actions[bot]	20723ea039	Storage release 2025-04-21	2025-04-21 19:39:21 +00:00
Vlad Lazar	db95540975	pageserver: handle empty get vectored queries (#11652 ) ## Problem If all batched requests are excluded from the query by `Timeine::get_rel_page_at_lsn_batched` (e.g. because they are past the end of the relation), the read path would panic since it doesn't expect empty queries. This is a change in behaviour that was introduced with the scattered query implementation. ## Summary of Changes Handle empty queries explicitly.	2025-04-21 15:38:44 -04:00
JC Grünhage	90033fe693	fix(ci): set token for fast-forward failure comments and allow merging with state unstable (#11647 ) ## Problem https://github.com/neondatabase/neon/actions/runs/14538136318/job/40790985693?pr=11645 failed, even though the relevant parts of the CI had passed and auto-merge determined the PR is ready to merge. After that, commenting failed. ## Summary of changes - set GH_TOKEN for commenting after fast-forward failure - allow merging with mergeable_state unstable	2025-04-21 15:38:44 -04:00
JC Grünhage	cb9d439cc1	fix(ci): make regex to find rc branches less strict (#11646 ) ## Problem https://github.com/neondatabase/neon/actions/runs/14537161022/job/40787763965 failed to find the correct RC PR run, preventing artifact re-use. This broke in https://github.com/neondatabase/neon/pull/11547. There's a hotfix release containing this in https://github.com/neondatabase/neon/pull/11645. ## Summary of changes Make the regex for finding the RC PR run less strict, it was needlessly precise.	2025-04-21 15:38:44 -04:00
Jan Christian Grünhage	7a2f0c4d53	Storage release 2025-04-18	2025-04-18 18:51:56 +02:00
Jan Christian Grünhage	60c907cbdb	fix(ci): allow merging with mergeable_state unstable	2025-04-18 18:51:27 +02:00
Jan Christian Grünhage	2eb43a5d81	fix(ci): set GH_TOKEN for commenting after fast-forward failure	2025-04-18 18:51:21 +02:00
Jan Christian Grünhage	b2f45fe37f	fix(ci): make regex to find rc branches less strict	2025-04-18 17:15:01 +02:00
github-actions[bot]	3904a0fe4f	Storage release 2025-04-18	2025-04-18 06:02:29 +00:00
github-actions[bot]	cd07b120b5	Storage release 2025-04-11	2025-04-11 16:20:12 +00:00
github-actions[bot]	7c06a33df0	Storage release 2025-04-04	2025-04-04 15:40:58 +00:00
github-actions[bot]	85992000e3	Storage release 2025-03-28	2025-03-28 06:02:32 +00:00
github-actions[bot]	c33cf739e3	Storage release 2025-03-16	2025-03-16 17:05:15 +00:00
JC Grünhage	1812fc7cf1	Merge pull request #11251 from neondatabase/rc/release/2025-03-14--storcon-hotfix Storage controller hotfix 2025-03-14	2025-03-14 16:43:22 +01:00
Christian Schwarz	929d963a35	Merge branch 'hotfix/release/2025-03-14-storcon-optimizations' into rc/release/2025-03-14--storcon-hotfix	2025-03-14 15:06:25 +01:00
Christian Schwarz	e405420458	CHERRY PICK: fix(storcon): optimization validation makes decisions based on wrong SecondaryProgress (#11229 ) (cherry picked from commit `04370b48b3`) Conflicts: storage_controller/src/service.rs Because `release` head doesn't yet have `storcon: timetime table, creation and deletion (#11058)`	2025-03-14 14:47:29 +01:00
Alex Chi Z.	1b5258ef6a	Merge pull request #11161 from neondatabase/arpad/release-db-sk-loading release branch: re-apply "storcon db: load safekeepers from DB again"	2025-03-10 21:50:58 -04:00
Arpad Müller	9fbb33e9d9	storcon db: load safekeepers from DB again (#11087 ) Earlier PR #11041 soft-disabled the loading code for safekeepers from the storcon db. This PR makes us load the safekeepers from the database again, now that we have [JWTs available on staging](https://github.com/neondatabase/neon/pull/11087) and soon on prod. This reverts commit `23fb8053c5`. Part of https://github.com/neondatabase/cloud/issues/24727	2025-03-11 01:12:21 +01:00
Alex Chi Z.	93983ac5fc	Merge pull request #11128 from neondatabase/rc/release/2025-03-07 Storage release 2025-03-07	2025-03-07 14:02:55 -05:00
Alex Chi Z	72dd540c87	Merge branch 'release' of https://github.com/neondatabase/neon into rc/release/2025-03-07	2025-03-07 13:04:10 -05:00
Alex Chi Z.	612d0aea4f	feat(pageserver): add force patch index_part API (#11119 ) ## Problem As part of the disaster recovery tool. Partly for https://github.com/neondatabase/neon/issues/9114. ## Summary of changes * Add a new pageserver API to force patch the fields in index_part and modify the timeline internal structures. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-03-07 12:54:44 -05:00
Vlad Lazar	aa3a75a0a7	pageserver: enable previous heatmaps by default (#11132 ) We add the off by default configs in https://github.com/neondatabase/neon/pull/11088 because the unarchival heatmap was causing oversized secondary locations. That was fixed in https://github.com/neondatabase/neon/pull/11098, so let's turn them on by default.	2025-03-07 11:14:38 -05:00
Arpad Müller	6389c9184c	update ring to 0.17.13 (#11131 ) Update ring from 0.17.6 to 0.17.13. Addresses the advisory: https://rustsec.org/advisories/RUSTSEC-2025-0009	2025-03-07 11:14:26 -05:00
Vlad Lazar	e177927476	safekeeper: don't skip empty records for shard zero (#11137 ) ## Problem Shard zero needs to track the start LSN of the latest record in adition to the LSN of the next record to ingest. The former is included in basebackup persisted by the compute in WAL. Previously, empty records were skipped for all shards. This caused the prev LSN tracking on the PS to fall behind and led to logical replication issues. ## Summary of changes Shard zero now receives emtpy interpreted records for LSN tracking purposes. A test is included too.	2025-03-07 11:04:30 -05:00
github-actions[bot]	e9bbafebbd	Storage release 2025-03-07	2025-03-07 06:02:10 +00:00
Vlad Lazar	7430fb9836	Merge pull request #11090 from neondatabase/vlad/release-gate-previous-heatmap Storage release 2025-03-05	2025-03-05 14:37:24 +00:00
Vlad Lazar	c45d169527	pageserver: gate previous heatmap behind config flag (#11088 ) ## Problem On unarchival, we update the previous heatmap with all visible layers. When the primary generates a new heatmap it includes all those layers, so the secondary will download them. Since they're not actually resident on the primary (we didn't call the warm up API), they'll never be evicted, so they remain in the heatmap. This leads to oversized secondary locations like we saw in pre-prod. ## Summary of changes Gate the loading of the previous heatmaps and the heatmap generation on unarchival behind configuration flags. They are disabled by default, but enabled in tests.	2025-03-05 13:29:46 +01:00
Vlad Lazar	a1e67cfe86	Merge pull request #11051 from neondatabase/vlad/release-8005-and-cherry-picks Storage release 2025-02-28	2025-02-28 19:40:33 +00:00
Erik Grinaker	0263c92c47	pageserver: fix race that can wedge background tasks (#11047 ) ## Problem `wait_for_active_tenant()`, used when starting background tasks, has a race condition that can cause it to wait forever (until cancelled). It first checks the current tenant state, and then subscribes for state updates, but if the state changes between these then it won't be notified about it. We've seen this wedge compaction tasks, which can cause unbounded layer file buildup and read amplification. ## Summary of changes Use `watch::Receiver::wait_for()` to check both the current and new tenant states.	2025-02-28 18:11:10 +01:00
Vlad Lazar	5fc599d653	storcon: soft disable SK heartbeats (#11041 ) ## Problem JWT tokens aren't in place, so all SK heartbeats fail. This is equivalent to a wait before applying the PS heartbeats and makes things more flaky. ## Summary of Changes Add a flag that skips loading SKs from the db on start-up and at runtime.	2025-02-28 18:11:10 +01:00
JC Grünhage	66d2592d04	Merge pull request #11032 from neondatabase/rc/release/2025-02-28 Storage release 2025-02-28	2025-02-28 11:12:07 +01:00
Jan Christian Grünhage	517ae7a60e	Merge remote-tracking branch 'origin/release' into rc/release/2025-02-28	2025-02-28 10:10:19 +01:00
github-actions[bot]	b2a769cc86	Storage release 2025-02-28	2025-02-28 06:02:09 +00:00
Erik Grinaker	12f0e525c6	Merge pull request #10961 from neondatabase/erik/release-7930-slow-getpage pageserver: tweak slow GetPage logging (#10956)	2025-02-24 23:05:10 +01:00
Erik Grinaker	6c6e5bfc2b	pageserver: tweak slow GetPage logging	2025-02-24 21:27:38 +01:00
Erik Grinaker	9c0fefee25	Merge pull request #10921 from neondatabase/rc/release/2025-02-21 Storage release 2025-02-21	2025-02-21 18:55:58 +01:00
Vlad Lazar	97e2e27f68	storcon: use `Duration` for duration's in the storage controller tenant config (#10928 ) ## Problem The storage controller treats durations in the tenant config as strings. These are loaded from the db. The pageserver maps these durations to a seconds only format and we always get a mismatch compared to what's in the db. ## Summary of changes Treat durations as durations inside the storage controller and not as strings. Nothing changes in the cross service API's themselves or the way things are stored in the db. I also added some logging which I would have made the investigation a 10min job: 1. Reason for why the reconciliation was spawned 2. Location config diff between the observed and wanted states	2025-02-21 17:14:43 +01:00
Erik Grinaker	cea2b222d0	Merge branch 'release' into rc/release/2025-02-21	2025-02-21 12:38:14 +01:00
github-actions[bot]	5b2afd953c	Storage release 2025-02-21	2025-02-21 06:02:10 +00:00
Alex Chi Z.	74e789b155	Merge pull request #10878 from neondatabase/rc/release/2025-02-18	2025-02-18 23:04:21 -05:00
Alex Chi Z.	235439e639	fix(pageserver): make repartition error critical (#10872 ) ## Problem Read errors during repartition should be a critical error. ## Summary of changes <del>We only have one call site</del> We have two call sites of `repartition` where one of them is during the initial image upload optimization and another is during image layer creation, so I added a `critical!` here instead of inside `collect_keyspace`. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-02-18 15:29:19 -05:00
Alex Chi Z.	a78438f15c	Revert "feat(pageserver): repartition on L0-L1 boundary (#10548 )" (#10870 ) This reverts commit `443c8d0b4b`. ## Problem We observe a massive amount of compaction errors. ## Summary of changes If the tenant did not write any L1 layers (i.e., they accumulate L0 layers where number of them is below L0 threshold), image creation will always fail. Therefore, it's not correct to simply use the disk_consistent_lsn or L0/L1 boundary for the image creation.	2025-02-18 13:39:01 -05:00
Arseny Sher	3d370679a1	Merge pull request #10855 from neondatabase/rel-02-14-39d42d846ae38 Cherry-pick #10845 into release	2025-02-17 18:46:22 +03:00
John Spray	c37e3020ab	pageserver_api: fix decoding old-version TimelineInfo (#10845 ) ## Problem In #10707 some new fields were introduced in TimelineInfo. I forgot that we do not only use TimelineInfo for encoding, but also decoding when the storage controller calls into a pageserver, so this broke some calls from controller to pageserver while in a mixed-version state. ## Summary of changes - Make new fields have default behavior so that they are optional	2025-02-17 18:43:14 +03:00
Arseny Sher	a036708da1	Merge pull request #10820 from neondatabase/rc/release/2025-02-14 Storage release 2025-02-14	2025-02-14 19:36:36 +03:00
John Spray	1e7ad80ee7	storage controller: prioritize reconciles for user-facing operations (#10822 ) ## Problem Some situations may produce a large number of pending reconciles. If we experience an issue where reconciles are processed more slowly than expected, that can prevent us responding promptly to user requests like tenant/timeline CRUD. This is a cleaner implementation of the hotfix in https://github.com/neondatabase/neon/pull/10815 ## Summary of changes - Introduce a second semaphore for high priority tasks, with configurable units (default 256). The intent is that in practical situations these user-facing requests should never have to wait. - Use the high priority semaphore for: tenant/timeline CRUD, and shard splitting operations. Use normal priority for everything else.	2025-02-14 17:34:51 +03:00
John Spray	581be23100	storcon: fix eliding parameters from proxied URL labels (#10817 ) ## Problem We had code for stripping IDs out of proxied paths to reduce cardinality of metrics, but it was only stripping out tenant IDs, and leaving in timeline IDs and query parameters (e.g. LSN in lsn->timestamp lookups). ## Summary of changes - Use a more general regex approach. There is still some risk that a future pageserver API might include a parameter in `/the/path/`, but we control that API and it is not often extended. We will also alert on metrics cardinality in staging so that if we made that mistake we would notice.	2025-02-14 17:34:25 +03:00
github-actions[bot]	8ca7ea859d	Storage release 2025-02-14	2025-02-14 06:02:05 +00:00
Christian Schwarz	efea8223bb	Merge pull request #10774 from neondatabase/releases/2025-02-11-smgr-op-latency-metrics-hotfix	2025-02-11 21:16:44 +01:00
Christian Schwarz	d3d3bfc6d0	fix(page_service / batching): smgr op latency metric of dropped responses include flush time (#10756 ) # Problem Say we have a batch of 10 responses to send out. Then, even with - #10728 we've still only called observe_execution_end_flush_start for the first 3 responses. The remaining 7 response timers are still ticking. When compute now closes the connection, the waiting flush fails with an error and we `drop()` the remaining 7 responses' smgr op timers. The `impl Drop for SmgrOpTimer` will observe an execution time that includes the flush time. In practice, this is supsected to produce the `+Inf` observations in the smgr op latency histogram we've seen since the introduction of pipelining, even after shipping #10728. refs: - fixup of https://github.com/neondatabase/neon/pull/10042 - fixup of https://github.com/neondatabase/neon/pull/10728 - fixes https://github.com/neondatabase/neon/issues/10754	2025-02-11 20:25:13 +01:00
Christian Schwarz	b3a911ff8c	fix(page_service / batching): smgr op latency metrics includes the flush time of preceding requests (#10728 ) Before this PR, if a batch contains N responses, the smgr op latency reported for response (N-i) would include the time we spent flushing the preceding requests. refs: - fixup of https://github.com/neondatabase/neon/pull/10042 - fixes https://github.com/neondatabase/neon/issues/10674	2025-02-11 20:25:08 +01:00
John Spray	a54853abd5	Merge pull request #10712 from neondatabase/rc/release/2025-02-07 Storage release 2025-02-07	2025-02-07 18:21:13 +00:00
Arpad Müller	69007f7ac8	Revert recent AWS SDK update (#10724 ) We've been seeing some regressions in staging since the AWS SDK updates: https://github.com/neondatabase/neon/issues/10695 . We aren't sure the regression was caused by the SDK update, but the issues do involve S3, so it's not unlikely. By reverting the SDK update we find out whether it was really the SDK update, or something else. Reverts the two PRs: * https://github.com/neondatabase/neon/pull/10588 * https://github.com/neondatabase/neon/pull/10699 https://neondb.slack.com/archives/C08C2G15M6U/p1738576986047179	2025-02-07 18:18:45 +00:00
github-actions[bot]	d255fa4b7e	Storage release 2025-02-07	2025-02-07 06:02:18 +00:00
Arpad Müller	40d6b3a34e	Merge pull request #10602 from neondatabase/rc/release/2025-01-31 Storage release 2025-01-31	2025-02-03 16:43:04 +01:00
github-actions[bot]	a018878e27	Storage release 2025-01-31	2025-01-31 06:02:08 +00:00
Christian Schwarz	e5b3eb1e64	Merge pull request #10500 from neondatabase/rc/release/2025-01-24 Storage release 2025-01-24	2025-01-25 00:54:56 +01:00
github-actions[bot]	f35e1356a1	Storage release 2025-01-24	2025-01-24 06:02:13 +00:00
Christian Schwarz	4dec0dddc6	Merge pull request #10447 from neondatabase/releases/2025-01-20-hotfix Release: storage hotfix 2025-01-20	2025-01-20 15:55:44 +01:00
Christian Schwarz	e0c504af38	fix(page_service / handle): panic when parallel client disconnect & Timeline shutdown Refs - fixes https://github.com/neondatabase/neon/issues/10444	2025-01-20 14:37:16 +01:00
Alex Chi Z.	3399eea2ed	Merge pull request #10436 from neondatabase/rc/release/2025-01-17 Storage release 2025-01-17	2025-01-17 12:36:17 -05:00
Alex Chi Z	6a29c809d5	Merge branch 'release' of https://github.com/neondatabase/neon into rc/release/2025-01-17	2025-01-17 10:44:25 -05:00
github-actions[bot]	a62c01df4c	Storage release 2025-01-17	2025-01-17 06:02:11 +00:00
Vlad Lazar	4c093c6314	Merge pull request #10338 from neondatabase/rc/release/2025-01-10 Storage release 2025-01-10	2025-01-10 19:21:43 +00:00
github-actions[bot]	32f58f8228	Storage release 2025-01-10	2025-01-10 06:02:00 +00:00
Erik Grinaker	96c36c0894	Merge pull request #10263 from neondatabase/rc/release/2025-01-03 Storage release 2025-01-03	2025-01-03 20:32:37 +01:00
Erik Grinaker	d719709316	Revert "pageserver: revert flush backpressure (#8550 ) (#10135 )" (#10270 ) This reverts commit `f3ecd5d76a`. It is [suspected](https://neondb.slack.com/archives/C033RQ5SPDH/p1735907405716759) to have caused significant read amplification in the [ingest benchmark](https://neonprod.grafana.net/d/de3mupf4g68e8e/perf-test3a-ingest-benchmark?orgId=1&from=now-30d&to=now&timezone=utc&var-new_project_endpoint_id=ep-solitary-sun-w22bmut6&var-large_tenant_endpoint_id=ep-holy-bread-w203krzs) (specifically during index creation). We will revisit an intermediate improvement here to unblock [upload parallelism](https://github.com/neondatabase/neon/issues/10096) before properly addressing [compaction backpressure](https://github.com/neondatabase/neon/issues/8390).	2025-01-03 16:51:16 +01:00
Erik Grinaker	97912f19fc	pageserver,safekeeper: disable heap profiling (#10268 ) ## Problem Since enabling continuous profiling in staging, we've seen frequent seg faults. This is suspected to be because jemalloc and pprof-rs take a stack trace at the same time, and the handlers aren't signal safe. jemalloc does this probabilistically on every allocation, regardless of whether someone is taking a heap profile, which means that any CPU profile has a chance to cause a seg fault. Touches #10225. ## Summary of changes For now, just disable heap profiles -- CPU profiles are more important, and we need to be able to take them without risking a crash.	2025-01-03 16:51:16 +01:00
github-actions[bot]	49724aa3b6	Storage release 2025-01-03	2025-01-03 06:02:03 +00:00

Compare commits

116 Commits

release-co ... release

Diff Content Not Available

Compare commits

116 Commits release-co ... release

Diff Content Not Available

116 Commits

release-co ... release