rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-04 20:50:40 +00:00

Author	SHA1	Message	Date
John Spray	affe408433	storage scrubber: GC ancestor shard layers (#8196 ) ## Problem After a shard split, the pageserver leaves the ancestor shard's content in place. It may be referenced by child shards, but eventually child shards will de-reference most ancestor layers as they write their own data and do GC. We would like to eventually clean up those ancestor layers to reclaim space. ## Summary of changes - Extend the physical GC command with `--mode=full`, which includes cleaning up unreferenced ancestor shard layers - Add test `test_scrubber_physical_gc_ancestors` - Remove colored log output: in testing this is irritating ANSI code spam in logs, and in interactive use doesn't add much. - Refactor storage controller API client code out of storcon_client into a `storage_controller/client` crate - During physical GC of ancestors, call into the storage controller to check that the latest shards seen in S3 reflect the latest state of the tenant, and there is no shard split in progress.	2024-07-22 14:36:56 +02:00
Christian Schwarz	9b883e4651	pageserver: remove obsolete cached_metric_collection_interval (#8370 ) We're removing the usage of this long-meaningless config field in https://github.com/neondatabase/aws/pull/1599 Once that PR has been deployed to staging and prod, we can merge this PR.	2024-07-22 14:36:56 +02:00
Arpad Müller	ed7ee73cba	Enable zstd in tests (#8368 ) Successor of #8288 , just enable zstd in tests. Also adds a test that creates easily compressable data. Part of #5431 --------- Co-authored-by: John Spray <john@neon.tech> Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2024-07-22 14:36:56 +02:00
John Spray	27da0e9cf5	tests: increase test_pg_regress and test_isolation timeouts (#8418 ) ## Problem These tests time out ~1 in 50 runs when in debug mode. There is no indication of a real issue: they're just wrappers that have large numbers of individual tests contained within on pytest case. ## Summary of changes - Bump pg_regress timeout from 600 to 900s - Bump test_isolation timeout from 300s (default) to 600s In future it would be nice to break out these tests to run individual cases (or batches thereof) as separate tests, rather than this monolith.	2024-07-22 14:36:56 +02:00
John Spray	de9bf2af6c	tests: fix metrics check in test_s3_eviction (#8419 ) ## Problem This test would occasionally fail its metric check. This could happen in the rare case that the nodes had all been restarted before their most recent eviction. The metric check was added in https://github.com/neondatabase/neon/pull/8348 ## Summary of changes - Check metrics before each restart, accumulate into a bool that we assert on at the end of the test	2024-07-22 14:36:56 +02:00
Christian Schwarz	3d2c2ce139	NeonEnv.from_repo_dir: use storage_controller_db instead of `attachments.json` (#8382 ) When `NeonEnv.from_repo_dir` was introduced, storage controller stored its state exclusively `attachments.json`. Since then, it has moved to using Postgres, which stores its state in `storage_controller_db`. But `NeonEnv.from_repo_dir` wasn't adjusted to do this. This PR rectifies the situation. Context for this is failures in `test_pageserver_characterize_throughput_with_n_tenants` CF: https://neondb.slack.com/archives/C033RQ5SPDH/p1721035799502239?thread_ts=1720901332.293769&cid=C033RQ5SPDH Notably, `from_repo_dir` is also used by the backwards- and forwards-compatibility. Thus, the changes in this PR affect those tests as well. However, it turns out that the compatibility snapshot already contains the `storage_controller_db`. Thus, it should just work and in fact we can remove hacks like `fixup_storage_controller`. Follow-ups created as part of this work: * https://github.com/neondatabase/neon/issues/8399 * https://github.com/neondatabase/neon/issues/8400	2024-07-22 14:36:56 +02:00
Joonas Koivunen	ff174a88c0	test: allow requests to any pageserver get cancelled (#8413 ) Fix flakyness on `test_sharded_timeline_detach_ancestor` which does not reproduce on a fast enough runner by allowing cancelled request before completing on all pageservers. It was only allowed on half of the pageservers. Failure evidence: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8352/9972357040/index.html#suites/a1c2be32556270764423c495fad75d47/7cca3e3d94fe12f2	2024-07-22 14:36:56 +02:00
John Spray	b21e131d11	pageserver: exclude un-read layers from short residence statistic (#8396 ) ## Problem The `evictions_with_low_residence_duration` is used as an indicator of cache thrashing. However, there are situations where it is quite legitimate to only have a short residence during compaction, where a delta is downloaded, used to generate an image layer, and then discarded. This can lead to false positive alerts. ## Summary of changes - Only track low residence duration for layers that have been accessed at least once (compaction doesn't count as an access). This will give us a metric that indicates thrashing on layers that the _user_ is using, rather than those we're downloading for housekeeping purposes. Once we add "layer visibility" as an explicit property of layers, this can also be used as a cleaner condition (residence of non-visible layers should never be alertable)	2024-07-22 14:36:56 +02:00
Tristan Partin	ad5d784fb7	Hide import behind TYPE_CHECKING	2024-07-22 14:36:56 +02:00
Tristan Partin	85d47637ee	Run each migration in its own transaction Previously, every migration was run in the same transaction. This is preparatory work for fixing CVE-2024-4317.	2024-07-22 14:36:56 +02:00
Arpad Müller	9dc71f5a88	Avoid the storage controller in test_tenant_creation_fails (#8392 ) As described in #8385, the likely source for flakiness in test_tenant_creation_fails is the following sequence of events: 1. test instructs the storage controller to create the tenant 2. storage controller adds the tenant and persists it to the database. issues a creation request 3. the pageserver restarts with the failpoint disabled 4. storage controller's background reconciliation still wants to create the tenant 5. pageserver gets new request to create the tenant from background reconciliation This commit just avoids the storage controller entirely. It has its own set of issues, as the re-attach request will obviously not include the tenant, but it's still useful to test for non-existence of the tenant. The generation is also not optional any more during tenant attachment. If you omit it, the pageserver yields an error. We change the signature of `tenant_attach` to reflect that. Alternative to #8385 Fixes #8266	2024-07-22 14:36:56 +02:00
Joonas Koivunen	957f99cad5	feat(timeline_detach_ancestor): success idempotency (#8354 ) Right now timeline detach ancestor reports an error (409, "no ancestor") on a new attempt after successful completion. This makes it troublesome for storage controller retries. Fix it to respond with `200 OK` as if the operation had just completed quickly. Additionally, the returned timeline identifiers in the 200 OK response are now ordered so that responses between different nodes for error comparison are done by the storage controller added in #8353. Design-wise, this PR introduces a new strategy for accessing the latest uploaded IndexPart: `RemoteTimelineClient::initialized_upload_queue(&self) -> Result<UploadQueueAccessor<'_>, NotInitialized>`. It should be a more scalable way to query the latest uploaded `IndexPart` than to add a query method for each question directly on `RemoteTimelineClient`. GC blocking will need to be introduced to make the operation fully idempotent. However, it is idempotent for the cases demonstrated by tests. Cc: #6994	2024-07-22 14:36:56 +02:00
John Spray	2a3a136474	pageserver: use PITR GC cutoffs as authoritative (#8365 ) ## Problem Pageserver GC uses a size-based condition (GC "horizon" in addition to time-based "PITR"). Eventually we plan to retire the size-based condition: https://github.com/neondatabase/neon/issues/6374 Currently, we always apply the more conservative of the two, meaning that tenants always retain at least 64MB of history (default horizon), even after a very long time has passed. This is particularly acute in cases where someone has dropped tables/databases, and then leaves a database idle: the horizon can prevent GCing very large quantities of historical data (we already account for this in synthetic size by ignoring gc horizon). We're not entirely removing GC horizon right now because we don't want to 100% rely on standby_horizon for robustness of physical replication, but we can tweak our logic to avoid retaining that 64MB LSN length indefinitely. ## Summary of changes - Rework `Timeline::find_gc_cutoffs`, with new logic: - If there is no PITR set, then use `DEFAULT_PITR_INTERVAL` (1 week) to calculate a time threshold. Retain either the horizon or up to that thresholds, whichever requires less data. - When there is a PITR set, and we have unambiguously resolved the timestamp to an LSN, then ignore the GC horizon entirely. For typical PITRs (1 day, 1 week), this will still easily retain enough data to avoid stressing read only replicas. The key property we end up with, whether a PITR is set or not, is that after enough time has passed, our GC cutoff on an idle timeline will catch up with the last_record_lsn. Using `DEFAULT_PITR_INTERVAL` is a bit of an arbitrary hack, but this feels like it isn't really worth the noise of exposing in TenantConfig. We could just make it a different named constant though. The end-end state will be that there is no gc_horizon at all, and that tenants with pitr_interval=0 would truly retain no history, so this constant would go away.	2024-07-22 14:36:56 +02:00
Joonas Koivunen	cfaf30f5e8	feat(storcon): timeline detach ancestor passthrough (#8353 ) Currently storage controller does not support forwarding timeline detach ancestor requests to pageservers. Add support for forwarding `PUT .../:tenant_id/timelines/:timeline_id/detach_ancestor`. Implement the support mostly as is, because the timeline detach ancestor will be made (mostly) idempotent in future PR. Cc: #6994	2024-07-22 14:36:56 +02:00
Christian Schwarz	72c2d0812e	remove page_service `show <tenant_id>` (#8372 ) This operation isn't used in practice, so let's remove it. Context: in https://github.com/neondatabase/neon/pull/8339	2024-07-22 14:36:56 +02:00
Arseny Sher	537ecf45f8	Fix test_timeline_copy flakiness. fixes https://github.com/neondatabase/neon/issues/8355	2024-07-22 14:31:12 +02:00
Konstantin Knizhnik	7973c3e941	Add neon.running_xacts_overflow_policy to make it possible for RO replica to startup without primary even in case running xacts overflow (#8323 ) ## Problem Right now if there are too many running xacts to be restored from CLOG at replica startup, then replica is not trying to restore them and wait for non-overflown running-xacs WAL record from primary. But if primary is not active, then replica will not start at all. Too many running xacts can be caused by transactions with large number of subtractions. But right now it can be also cause by two reasons: - Lack of shutdown checkpoint which updates `oldestRunningXid` (because of immediate shutdown) - nextXid alignment on 1024 boundary (which cause loosing ~1k XIDs on each restart) Both problems are somehow addressed now. But we have existed customers with "sparse" CLOG and lack of checkpoints. To be able to start RO replicas for such customers I suggest to add GUC which allows replica to start even in case of subxacts overflow. ## Summary of changes Add `neon.running_xacts_overflow_policy` with the following values: - ignore: restore from CLOG last N XIDs and accept connections - skip: do not restore any XIDs from CXLOGbut still accept connections - wait: wait non-overflown running xacts record from primary node ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-07-15 09:34:35 -04:00
Vlad Lazar	085bbaf5f8	tests: allow list breaching min resident size in statvfs test (#8358 ) ## Problem This test would sometimes violate the min resident size during disk eviction and fail due to the generate warning log. Disk usage candidate collection only takes into account active tenants. However, the statvfs call takes into account the entire tenants directory, which includes tenants which haven't become active yet. After re-starting the pageserver, disk usage eviction may kick in before both tenants have become active. Hence, the logic will try to satisfy thedisk usage requirements by evicting everything belonging to the active tenant, and hence violating the tenant minimum resident size. ## Summary of changes Allow the warning	2024-07-15 09:28:35 -04:00
John Spray	3f8819827c	pageserver: circuit breaker on compaction (#8359 ) ## Problem We already back off on compaction retries, but the impact of a failing compaction can be so great that backing off up to 300s isn't enough. The impact is consuming a lot of I/O+CPU in the case of image layer generation for large tenants, and potentially also leaking disk space. Compaction failures are extremely rare and almost always indicate a bug, frequently a bug that will not let compaction to proceed until it is fixed. Related: https://github.com/neondatabase/neon/issues/6738 ## Summary of changes - Introduce a CircuitBreaker type - Add a circuit breaker for compaction, with a policy that after 5 failures, compaction will not be attempted again for 24 hours. - Add metrics that we can alert on: any >0 value for `pageserver_circuit_breaker_broken_total` should generate an alert. - Add a test that checks this works as intended. Couple notes to reviewers: - Circuit breakers are intrinsically a defense-in-depth measure: this is not the solution to any underlying issues, it is just a general mitigation for "unknown unknowns" that might be encountered in future. - This PR isn't primarily about writing a perfect CircuitBreaker type: the one in this PR is meant to be just enough to mitigate issues in compaction, and make it easy to monitor/alert on these failures. We can refine this type in future as/when we want to use it elsewhere.	2024-07-15 09:28:35 -04:00
Sasha Krassovsky	119ddf6ccf	Grant execute on snapshot functions to neon_superuser (#8346 ) ## Problem I need `neon_superuser` to be allowed to create snapshots for replication tests ## Summary of changes Adds a migration that grants these functions to neon_superuser	2024-07-15 09:28:35 -04:00
Joonas Koivunen	90f447b79d	test: limit `test_layer_download_timeouted` to MOCK_S3 (#8331 ) Requests against REAL_S3 on CI can consistently take longer than 1s; testing the short timeouts against it made no sense in hindsight, as MOCK_S3 works just as well. evidence: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8229/9857994025/index.html#suites/b97efae3a617afb71cb8142f5afa5224/6828a50921660a32	2024-07-15 09:28:35 -04:00
Arseny Sher	8532d72276	Fix memory context of NeonWALReader allocation. Allocating it in short living context is wrong because it is reused during backend lifetime.	2024-07-15 09:28:35 -04:00
John Spray	d3ff47f572	storage controller: add node deletion API (#8226 ) ## Problem In anticipation of later adding a really nice drain+delete API, I initially only added an intentionally basic `/drop` API that is just about usable for deleting nodes in a pinch, but requires some ugly storage controller restarts to persuade it to restart secondaries. ## Summary of changes I started making a few tiny fixes, and ended up writing the delete API... - Quality of life nit: ordering of node + tenant listings in storcon_cli - Papercut: Fix the attach_hook using the wrong operation type for reporting slow locks - Make Service::spawn tolerate `generation_pageserver` columns that point to nonexistent node IDs. I started out thinking of this as a general resilience thing, but when implementing the delete API I realized it was actually a legitimate end state after the delete API is called (as that API doesn't wait for all reconciles to succeed). - Add a `DELETE` API for nodes, which does not gracefully drain, but does reschedule everything. This becomes safe to use when the system is in any state, but will incur availability gaps for any tenants that weren't already live-migrated away. If tenants have already been drained, this becomes a totally clean + safe way to decom a node. - Add a test and a storcon_cli wrapper for it This is meant to be a robust initial API that lets us remove nodes without doing ugly things like restarting the storage controller -- it's not quite a totally graceful node-draining routine yet. There's more work in https://github.com/neondatabase/neon/issues/8333 to get to our end-end state.	2024-07-15 09:28:35 -04:00
John Spray	8cc768254f	safekeeper: eviction metrics (#8348 ) ## Problem Follow up to https://github.com/neondatabase/neon/pull/8335, to improve observability of how many evict/restores we are doing. ## Summary of changes - Add `safekeeper_eviction_events_started_total` and `safekeeper_eviction_events_completed_total`, with a "kind" label of evict or restore. This gives us rates, and also ability to calculate how many are in progress. - Generalize SafekeeperMetrics test type to use the same helpers as pageserver, and enable querying any metric. - Read the new metrics at the end of the eviction test.	2024-07-15 09:28:35 -04:00
Christian Schwarz	5bba3e3c75	pageserver: remove `trace_read_requests` (#8338 ) `trace_read_requests` is a per `Tenant`-object option. But the `handle_pagerequests` loop doesn't know which `Tenant` object (i.e., which shard) the request is for. The remaining use of the `Tenant` object is to check `tenant.cancel`. That check is incorrect [if the pageserver hosts multiple shards](https://github.com/neondatabase/neon/issues/7427#issuecomment-2220577518). I'll fix that in a future PR where I completely eliminate the holding of `Tenant/Timeline` objects across requests. See [my code RFC](https://github.com/neondatabase/neon/pull/8286) for the high level idea. Note that we can always bring the tracing functionality if we need it. But since it's actually about logging the `page_service` wire bytes, it should be a `page_service`-level config option, not per-Tenant. And for enabling tracing on a single connection, we can implement a `set pageserver_trace_connection;` option.	2024-07-15 09:28:35 -04:00
John Spray	547acde6cd	safekeeper: add eviction_min_resident to stop evictions thrashing (#8335 ) ## Problem - The condition for eviction is not time-based: it is possible for a timeline to be restored in response to a client, that client times out, and then as soon as the timeline is restored it is immediately evicted again. - There is no delay on eviction at startup of the safekeeper, so when it starts up and sees many idle timelines, it does many evictions which will likely be immediately restored when someone uses the timeline. ## Summary of changes - Add `eviction_min_resident` parameter, and use it in `ready_for_eviction` to avoid evictions if the timeline has been resident for less than this period. - This also implicitly delays evictions at startup for `eviction_min_resident` - Set this to a very low number for the existing eviction test, which expects immediate eviction. The default period is 15 minutes. The general reasoning for that is that in the worst case where we thrash ~10k timelines on one safekeeper, downloading 16MB for each one, we should set a period that would not overwhelm the node's bandwidth.	2024-07-15 09:28:35 -04:00
John Spray	f28abb953d	tests: stabilize test_sharding_split_compaction (#8318 ) ## Problem This test incorrectly assumed that a post-split compaction would only drop content. This was easily destabilized by any changes to image generation rules. ## Summary of changes - Before split, do a full image layer generation pass, to guarantee that post-split compaction should only drop data, never create it. - Fix the force_image_layer_creation mode of compaction that we use from tests like this: previously it would try and generate image layers even if one already existed with the same layer key, which caused compaction to fail.	2024-07-15 09:28:35 -04:00
Christian Schwarz	bfc7338246	pageserver: move `page_service`'s `import basebackup` / `import wal` to mgmt API (#8292 ) I want to fix bugs in `page_service` ([issue](https://github.com/neondatabase/neon/issues/7427)) and the `import basebackup` / `import wal` stand in the way / make the refactoring more complicated. We don't use these methods anyway in practice, but, there have been some objections to removing the functionality completely. So, this PR preserves the existing functionality but moves it into the HTTP management API. Note that I don't try to fix existing bugs in the code, specifically not fixing * it only ever worked correctly for unsharded tenants * it doesn't clean up on error All errors are mapped to `ApiError::InternalServerError`.	2024-07-15 09:28:35 -04:00
Yuchen Liang	0b6492e7d3	tests: increase approx size equal threshold to avoid `test_lsn_lease_size` flakiness (#8282 ) ## Summary of changes Increase the `assert_size_approx_equal` threshold to avoid flakiness of `test_lsn_lease_size`. Still needs more investigation to fully resolve #8293. - Also set `autovacuum=off` for the endpoint we are running in the test. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-07-15 09:28:35 -04:00
John Spray	7cfaecbeb6	tests: stabilize test_timeline_size_quota_on_startup (#8255 ) ## Problem `test_timeline_size_quota_on_startup` assumed that writing data beyond the size limit would always be blocked. This is not so: the limit is only enforced if feedback makes it back from the pageserver to the safekeeper + compute. Closes: https://github.com/neondatabase/neon/issues/6562 ## Summary of changes - Modify the test to wait for the pageserver to catch up. The size limit was never actually being enforced robustly, the original version of this test was just writing much more than 30MB and about 98% of the time getting lucky such that the feedback happened to arrive before the tests for loop was done. - If the test fails, log the logical size as seen by the pageserver.	2024-07-15 09:28:35 -04:00
John Spray	108bf56e44	tests: use smaller layers in test_pg_regress (#8232 ) ## Problem Debug-mode runs of test_pg_regress are rather slow since https://github.com/neondatabase/neon/pull/8105, and occasionally exceed their 600s timeout. ## Summary of changes - Use 8MiB layer files, avoiding large ephemeral layers On a hetzner AX102, this takes the runtime from 230s to 190s. Which hopefully will be enough to get the runtime on github runners more reliably below its 600s timeout. This has the side benefit of exercising more of the pageserver stack (including compaction) under a workload that exercises a more diverse set of postgres functionality than most of our tests.	2024-07-15 09:28:35 -04:00
Tristan Partin	55d37c77b9	Hide import behind TYPE_CHECKING No need to import it if we aren't type checking anything.	2024-07-15 09:28:35 -04:00
Konstantin Knizhnik	05b4169644	Increase timeout for wating subscriber caught-up (#8118 ) ## Problem test_subscriber_restart has quit large failure rate' https://neonprod.grafana.net/d/fddp4rvg7k2dcf/regression-test-failures?orgId=1&var-test_name=test_subscriber_restart&var-max_count=100&var-restrict=false I can be caused by too small timeout (5 seconds) to wait until changes are propagated. Related to #8097 ## Summary of changes Increase timeout to 30 seconds. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-07-08 17:22:36 +01:00
Alexander Bayandin	d1495755e7	SELECT 💣(); (#8270 ) ## Problem We want to be able to test how our infrastructure reacts on segfaults in Postgres (for example, we collect cores, and get some required logs/metrics, etc) ## Summary of changes - Add `trigger_segfauls` function to `neon_test_utils` to trigger a segfault in Postgres - Add `trigger_panic` function to `neon_test_utils` to trigger SIGABRT (by using `elog(PANIC, ...)) - Fix cleanup logic in regression tests in endpoint crashed	2024-07-08 17:22:36 +01:00
John Spray	64334f497d	tests: make location_conf_churn more robust (#8271 ) ## Problem This test directly manages locations on pageservers and configuration of an endpoint. However, it did not switch off the parts of the storage controller that attempt to do the same: occasionally, the test would fail in a strange way such as a compute failing to accept a reconfiguration request. ## Summary of changes - Wire up the storage controller's compute notification hook to a no-op handler - Configure the tenant's scheduling policy to Stop.	2024-07-08 17:22:35 +01:00
John Spray	32fc2dd683	tests: extend allow list in deletion test (#8268 ) ## Problem `1ea5d8b132` tolerated this as an error message, but it can show up in logs as well. Example failure: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8201/9780147712/index.html#testresult/263422f5f5f292ea/retries ## Summary of changes - Tolerate "failed to delete 1 objects" in pageserver logs, this occurs occasionally when injected failures exhaust deletion's retries.	2024-07-08 17:22:35 +01:00
Konstantin Knizhnik	3ee82a9895	implement rolling hyper-log-log algorithm (#8068 ) ## Problem See #7466 ## Summary of changes Implement algorithm descried in https://hal.science/hal-00465313/document Now new GUC is added: `neon.wss_max_duration` which specifies size of sliding window (in seconds). Default value is 1 hour. It is possible to request estimation of working set sizes (within this window using new function `approximate_working_set_size_seconds`. Old function `approximate_working_set_size` is preserved for backward compatibility. But its scope is also limited by `neon.wss_max_duration`. Version of Neon extension is changed to 1.4 ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Matthias van de Meent <matthias@neon.tech>	2024-07-08 17:22:35 +01:00
Yuchen Liang	32828cddd6	feat(pageserver): integrate lsn lease into synthetic size (#8220 ) Part of #7497, closes #8071. (accidentally closed #8208, reopened here) ## Problem After the changes in #8084, we need synthetic size to also account for leased LSNs so that users do not get free retention by running a small ephemeral endpoint for a long time. ## Summary of changes This PR integrates LSN leases into the synthetic size calculation. We model leases as read-only branches started at the leased LSN (except it does not have a timeline id). Other changes: - Add new unit tests testing whether a lease behaves like a read-only branch. - Change `/size_debug` response to include lease point in the SVG visualization. - Fix `/lsn_lease` HTTP API to do proper parsing for POST. Signed-off-by: Yuchen Liang <yuchen@neon.tech> Co-authored-by: Joonas Koivunen <joonas@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-07-08 17:22:35 +01:00
John Spray	7e2a3d2728	pageserver: downgrade stale generation messages to INFO (#8256 ) ## Problem When generations were new, these messages were an important way of noticing if something unexpected was going on. We found some real issues when investigating tests that unexpectedly tripped them. At time has gone on, this code is now pretty battle-tested, and as we do more live migrations etc, it's fairly normal to see the occasional message from a node with a stale generation. At this point the cognitive load on developers to selectively allow-list these logs outweighs the benefit of having them at warn severity. Closes: https://github.com/neondatabase/neon/issues/8080 ## Summary of changes - Downgrade "Dropped remote consistent LSN updates" and "Dropping stale deletions" messages to INFO - Remove all the allow-list entries for these logs.	2024-07-08 17:22:35 +01:00
Vlad Lazar	b917868ada	tests: perform graceful rolling restarts in storcon scale test (#8173 ) ## Problem Scale test doesn't exercise drain & fill. ## Summary of changes Make scale test exercise drain & fill	2024-07-08 17:22:35 +01:00
Christian Schwarz	47e06a2cc6	page_service: stop exposing `get_last_record_rlsn` (#8244 ) Compute doesn't use it, let's eliminate it. Ref to Slack thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1719920261995529	2024-07-08 17:22:35 +01:00
Heikki Linnakangas	db70c175e6	Simplify test_wal_page_boundary_start test (#8214 ) All the code to ensure the WAL record lands at a page boundary was unnecessary for reproducing the original problem. In fact, it's a pretty basic test that checks that outbound replication (= neon as publisher) still works after restarting the endpoint. It just used to be very broken before commit `5ceccdc7de`, which also added this test. To verify that: 1. Check out commit `f3af5f4660` (because the next commit, `7dd58e1449`, fixed the same bug in a different way, making it infeasible to revert the bug fix in an easy way) 2. Revert the bug fix from commit `5ceccdc7de` with this: ``` diff --git a/pgxn/neon/walproposer_pg.c b/pgxn/neon/walproposer_pg.c index 7debb6325..9f03bbd99 100644 --- a/pgxn/neon/walproposer_pg.c +++ b/pgxn/neon/walproposer_pg.c @@ -1437,8 +1437,10 @@ XLogWalPropWrite(WalProposer wp, char buf, Size nbytes, XLogRecPtr recptr) * * https://github.com/neondatabase/neon/issues/5749 */ +#if 0 if (!wp->config->syncSafekeepers) XLogUpdateWalBuffers(buf, recptr, nbytes); +#endif while (nbytes > 0) { ``` 3. Run the test_wal_page_boundary_start regression test. It fails, as expected 4. Apply this commit to the test, and run it again. It still fails, with the same error mentioned in issue #5749: ``` PG:2024-06-30 20:49:08.805 GMT [1248196] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names '"pub1"') PG:2024-06-30 21:37:52.567 GMT [1467972] LOG: starting logical decoding for slot "sub1" PG:2024-06-30 21:37:52.567 GMT [1467972] DETAIL: Streaming transactions committing after 0/1532330, reading WAL from 0/1531C78. PG:2024-06-30 21:37:52.567 GMT [1467972] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names '"pub1"') PG:2024-06-30 21:37:52.567 GMT [1467972] LOG: logical decoding found consistent point at 0/1531C78 PG:2024-06-30 21:37:52.567 GMT [1467972] DETAIL: There are no running transactions. PG:2024-06-30 21:37:52.567 GMT [1467972] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names '"pub1"') PG:2024-06-30 21:37:52.568 GMT [1467972] ERROR: could not find record while sending logically-decoded data: invalid contrecord length 312 (expected 6) at 0/1533FD8 ```	2024-07-08 17:22:35 +01:00
Konstantin Knizhnik	2863d1df63	Add test for proper handling of connection failure to avoid 'cannot wait on socket event without a socket' error (#8231 ) ## Problem See https://github.com/neondatabase/cloud/issues/14289 and PR #8210 ## Summary of changes Add test for problems fixed in #8210 ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-07-08 17:22:35 +01:00
John Spray	7ff9989dd5	pageserver: simpler, stricter config error handling (#8177 ) ## Problem Tenant attachment has error paths for failures to write local configuration, but these types of local storage I/O errors should be considered fatal for the process. Related thread on an earlier PR that touched this code: https://github.com/neondatabase/neon/pull/7947#discussion_r1655134114 ## Summary of changes - Make errors writing tenant config fatal (abort process) - When reading tenant config, make all I/O errors except ENOENT fatal - Replace use of bare anyhow errors with `LoadConfigError`	2024-07-08 17:22:35 +01:00
Heikki Linnakangas	57f476ff5a	Restore running xacts from CLOG on replica startup (#7288 ) We have one pretty serious MVCC visibility bug with hot standby replicas. We incorrectly treat any transactions that are in progress in the primary, when the standby is started, as aborted. That can break MVCC for queries running concurrently in the standby. It can also lead to hint bits being set incorrectly, and that damage can last until the replica is restarted. The fundamental bug was that we treated any replica start as starting from a shut down server. The fix for that is straightforward: we need to set 'wasShutdown = false' in InitWalRecovery() (see changes in the postgres repo). However, that introduces a new problem: with wasShutdown = false, the standby will not open up for queries until it receives a running-xacts WAL record from the primary. That's correct, and that's how Postgres hot standby always works. But it's a problem for Neon, because: * It changes the historical behavior for existing users. Currently, the standby immediately opens up for queries, so if they now need to wait, we can breka existing use cases that were working fine (assuming you don't hit the MVCC issues). * The problem is much worse for Neon than it is for standalone PostgreSQL, because in Neon, we can start a replica from an arbitrary LSN. In standalone PostgreSQL, the replica always starts WAL replay from a checkpoint record, and the primary arranges things so that there is always a running-xacts record soon after each checkpoint record. You can still hit this issue with PostgreSQL if you have a transaction with lots of subtransactions running in the primary, but it's pretty rare in practice. To mitigate that, we introduce another way to collect the running-xacts information at startup, without waiting for the running-xacts WAL record: We can the CLOG for XIDs that haven't been marked as committed or aborted. It has limitations with subtransactions too, but should mitigate the problem for most users. See https://github.com/neondatabase/neon/issues/7236. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-07-04 18:58:34 +03:00
Heikki Linnakangas	30027d94a2	Fix tracking of the nextMulti in the pageserver's copy of CheckPoint (#6528 ) Whenever we see an XLOG_MULTIXACT_CREATE_ID WAL record, we need to update the nextMulti and NextMultiOffset fields in the pageserver's copy of the CheckPoint struct, to cover the new multi-XID. In PostgreSQL, this is done by updating an in-memory struct during WAL replay, but because in Neon you can start a compute node at any LSN, we need to have an up-to-date value pre-calculated in the pageserver at all times. We do the same for nextXid. However, we had a bug in WAL ingestion code that does that: the multi-XIDs will wrap around at 2^32, just like XIDs, so we need to do the comparisons in a wraparound-aware fashion. Fix that, and add tests. Fixes issue #6520 Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-07-01 01:49:49 +03:00
John Spray	b8bbaafc03	storage controller: fix heatmaps getting disabled during shard split (#8197 ) ## Problem At the start of do_tenant_shard_split, we drop any secondary location for the parent shards. The reconciler uses presence of secondary locations as a condition for enabling heatmaps. On the pageserver, child shards inherit their configuration from parents, but the storage controller assumes the child's ObservedState is the same as the parent's config from the prepare phase. The result is that some child shards end up with inaccurate ObservedState, and until something next migrates or restarts, those tenant shards aren't uploading heatmaps, so their secondary locations are downloading everything that was resident at the moment of the split (including ancestor layers which are often cleaned up shortly after the split). Closes: https://github.com/neondatabase/neon/issues/8189 ## Summary of changes - Use PlacementPolicy to control enablement of heatmap upload, rather than the literal presence of secondaries in IntentState: this way we avoid switching them off during shard split - test: during tenant split test, assert that the child shards have heatmap uploads enabled.	2024-06-28 18:27:13 +01:00
John Spray	063553a51b	pageserver: remove tenant create API (#8135 ) ## Problem For some time, we have created tenants with calls to location_conf. The legacy "POST /v1/tenant" path was only used in some tests. ## Summary of changes - Remove the API - Relocate TenantCreateRequest to the controller API file (this used to be used in both pageserver and controller APIs) - Rewrite tenant_create test helper to use location_config API, as control plane and storage controller do - Update docker-compose test script to create tenants with location_config API (this small commit is also present in https://github.com/neondatabase/neon/pull/7947)	2024-06-28 09:14:19 +01:00
Vlad Lazar	89cf8df93b	stocon: bump number of concurrent reconciles per operation (#8179 ) ## Problem Background node operations take a long time for loaded nodes. ## Summary of changes Increase number of concurrent reconciles an operation is allowed to spawn. This should make drain and fill operations faster and the new value is still well below the total limit of concurrent reconciles.	2024-06-27 13:16:41 +00:00
Arseny Sher	6f20a18e8e	Allow to change compute safekeeper list without restart. - Add --safekeepers option to neon_local reconfigure - Add it to python Endpoint reconfigure - Implement config reload in walproposer by restarting the whole bgw when safekeeper list changes. ref https://github.com/neondatabase/neon/issues/6341	2024-06-27 15:08:35 +03:00

1 2 3 4 5 ...

817 Commits