rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-03 12:10:36 +00:00

Author	SHA1	Message	Date
John Spray	affe408433	storage scrubber: GC ancestor shard layers (#8196 ) ## Problem After a shard split, the pageserver leaves the ancestor shard's content in place. It may be referenced by child shards, but eventually child shards will de-reference most ancestor layers as they write their own data and do GC. We would like to eventually clean up those ancestor layers to reclaim space. ## Summary of changes - Extend the physical GC command with `--mode=full`, which includes cleaning up unreferenced ancestor shard layers - Add test `test_scrubber_physical_gc_ancestors` - Remove colored log output: in testing this is irritating ANSI code spam in logs, and in interactive use doesn't add much. - Refactor storage controller API client code out of storcon_client into a `storage_controller/client` crate - During physical GC of ancestors, call into the storage controller to check that the latest shards seen in S3 reflect the latest state of the tenant, and there is no shard split in progress.	2024-07-22 14:36:56 +02:00
Christian Schwarz	9b883e4651	pageserver: remove obsolete cached_metric_collection_interval (#8370 ) We're removing the usage of this long-meaningless config field in https://github.com/neondatabase/aws/pull/1599 Once that PR has been deployed to staging and prod, we can merge this PR.	2024-07-22 14:36:56 +02:00
Arpad Müller	ed7ee73cba	Enable zstd in tests (#8368 ) Successor of #8288 , just enable zstd in tests. Also adds a test that creates easily compressable data. Part of #5431 --------- Co-authored-by: John Spray <john@neon.tech> Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2024-07-22 14:36:56 +02:00
John Spray	9868bb3346	tests: turn on safekeeper eviction by default (#8352 ) ## Problem Ahead of enabling eviction in the field, where it will become the normal/default mode, let's enable it by default throughout our tests in case any issues become visible there. ## Summary of changes - Make default `extra_opts` for safekeepers enable offload & deletion - Set low timeouts in `extra_opts` so that tests running for tens of seconds have a chance to hit some of these background operations.	2024-07-22 14:36:56 +02:00
John Spray	27da0e9cf5	tests: increase test_pg_regress and test_isolation timeouts (#8418 ) ## Problem These tests time out ~1 in 50 runs when in debug mode. There is no indication of a real issue: they're just wrappers that have large numbers of individual tests contained within on pytest case. ## Summary of changes - Bump pg_regress timeout from 600 to 900s - Bump test_isolation timeout from 300s (default) to 600s In future it would be nice to break out these tests to run individual cases (or batches thereof) as separate tests, rather than this monolith.	2024-07-22 14:36:56 +02:00
John Spray	de9bf2af6c	tests: fix metrics check in test_s3_eviction (#8419 ) ## Problem This test would occasionally fail its metric check. This could happen in the rare case that the nodes had all been restarted before their most recent eviction. The metric check was added in https://github.com/neondatabase/neon/pull/8348 ## Summary of changes - Check metrics before each restart, accumulate into a bool that we assert on at the end of the test	2024-07-22 14:36:56 +02:00
Christian Schwarz	3d2c2ce139	NeonEnv.from_repo_dir: use storage_controller_db instead of `attachments.json` (#8382 ) When `NeonEnv.from_repo_dir` was introduced, storage controller stored its state exclusively `attachments.json`. Since then, it has moved to using Postgres, which stores its state in `storage_controller_db`. But `NeonEnv.from_repo_dir` wasn't adjusted to do this. This PR rectifies the situation. Context for this is failures in `test_pageserver_characterize_throughput_with_n_tenants` CF: https://neondb.slack.com/archives/C033RQ5SPDH/p1721035799502239?thread_ts=1720901332.293769&cid=C033RQ5SPDH Notably, `from_repo_dir` is also used by the backwards- and forwards-compatibility. Thus, the changes in this PR affect those tests as well. However, it turns out that the compatibility snapshot already contains the `storage_controller_db`. Thus, it should just work and in fact we can remove hacks like `fixup_storage_controller`. Follow-ups created as part of this work: * https://github.com/neondatabase/neon/issues/8399 * https://github.com/neondatabase/neon/issues/8400	2024-07-22 14:36:56 +02:00
Joonas Koivunen	ff174a88c0	test: allow requests to any pageserver get cancelled (#8413 ) Fix flakyness on `test_sharded_timeline_detach_ancestor` which does not reproduce on a fast enough runner by allowing cancelled request before completing on all pageservers. It was only allowed on half of the pageservers. Failure evidence: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8352/9972357040/index.html#suites/a1c2be32556270764423c495fad75d47/7cca3e3d94fe12f2	2024-07-22 14:36:56 +02:00
John Spray	ef3ebfaf67	pageserver: layer count & size metrics (#8410 ) ## Problem We lack insight into: - How much of a tenant's physical size is image vs. delta layers - Average sizes of image vs. delta layers - Total layer counts per timeline, indicating size of index_part object As well as general observability love, this is motivated by https://github.com/neondatabase/neon/issues/6738, where we need to define some sensible thresholds for storage amplification, and using total physical size may not work well (if someone does a lot of DROPs then it's legitimate for the physical-synthetic ratio to be huge), but the ratio between image layer size and delta layer size may be a better indicator of whether we're generating unreasonable quantities of image layers. ## Summary of changes - Add pageserver_layer_bytes and pageserver_layer_count metrics, labelled by timeline and `kind` (delta or image) - Add & subtract these with LayerInner's lifetime. I'm intentionally avoiding using a generic metric RAII guard object, to avoid bloating LayerInner: it already has all the information it needs to update metric on new+drop.	2024-07-22 14:36:56 +02:00
John Spray	c150ad4ee2	tests: add test_compaction_l0_memory (#8403 ) This test reproduces the case of a writer creating a deep stack of L0 layers. It uses realistic layer sizes and writes several gigabytes of data, therefore runs as a performance test although it is validating memory footprint rather than performance per se. It acts a regression test for two recent fixes: - https://github.com/neondatabase/neon/pull/8401 - https://github.com/neondatabase/neon/pull/8391 In future it will demonstrate the larger improvement of using a k-merge iterator for L0 compaction (#8184) This test can be extended to enforce limits on the memory consumption of other housekeeping steps, by restarting the pageserver and then running other things to do the same "how much did RSS increase" measurement.	2024-07-22 14:36:56 +02:00
John Spray	b21e131d11	pageserver: exclude un-read layers from short residence statistic (#8396 ) ## Problem The `evictions_with_low_residence_duration` is used as an indicator of cache thrashing. However, there are situations where it is quite legitimate to only have a short residence during compaction, where a delta is downloaded, used to generate an image layer, and then discarded. This can lead to false positive alerts. ## Summary of changes - Only track low residence duration for layers that have been accessed at least once (compaction doesn't count as an access). This will give us a metric that indicates thrashing on layers that the _user_ is using, rather than those we're downloading for housekeeping purposes. Once we add "layer visibility" as an explicit property of layers, this can also be used as a cleaner condition (residence of non-visible layers should never be alertable)	2024-07-22 14:36:56 +02:00
Tristan Partin	18e7c2b7a1	Add some typing to Endpoint.respec()	2024-07-22 14:36:56 +02:00
Tristan Partin	ad5d784fb7	Hide import behind TYPE_CHECKING	2024-07-22 14:36:56 +02:00
Tristan Partin	85d47637ee	Run each migration in its own transaction Previously, every migration was run in the same transaction. This is preparatory work for fixing CVE-2024-4317.	2024-07-22 14:36:56 +02:00
Arpad Müller	9dc71f5a88	Avoid the storage controller in test_tenant_creation_fails (#8392 ) As described in #8385, the likely source for flakiness in test_tenant_creation_fails is the following sequence of events: 1. test instructs the storage controller to create the tenant 2. storage controller adds the tenant and persists it to the database. issues a creation request 3. the pageserver restarts with the failpoint disabled 4. storage controller's background reconciliation still wants to create the tenant 5. pageserver gets new request to create the tenant from background reconciliation This commit just avoids the storage controller entirely. It has its own set of issues, as the re-attach request will obviously not include the tenant, but it's still useful to test for non-existence of the tenant. The generation is also not optional any more during tenant attachment. If you omit it, the pageserver yields an error. We change the signature of `tenant_attach` to reflect that. Alternative to #8385 Fixes #8266	2024-07-22 14:36:56 +02:00
Sasha Krassovsky	349373cb11	Allow reusing projects between runs of logical replication benchmarks (#8393 )	2024-07-22 14:36:56 +02:00
Joonas Koivunen	957f99cad5	feat(timeline_detach_ancestor): success idempotency (#8354 ) Right now timeline detach ancestor reports an error (409, "no ancestor") on a new attempt after successful completion. This makes it troublesome for storage controller retries. Fix it to respond with `200 OK` as if the operation had just completed quickly. Additionally, the returned timeline identifiers in the 200 OK response are now ordered so that responses between different nodes for error comparison are done by the storage controller added in #8353. Design-wise, this PR introduces a new strategy for accessing the latest uploaded IndexPart: `RemoteTimelineClient::initialized_upload_queue(&self) -> Result<UploadQueueAccessor<'_>, NotInitialized>`. It should be a more scalable way to query the latest uploaded `IndexPart` than to add a query method for each question directly on `RemoteTimelineClient`. GC blocking will need to be introduced to make the operation fully idempotent. However, it is idempotent for the cases demonstrated by tests. Cc: #6994	2024-07-22 14:36:56 +02:00
John Spray	2a3a136474	pageserver: use PITR GC cutoffs as authoritative (#8365 ) ## Problem Pageserver GC uses a size-based condition (GC "horizon" in addition to time-based "PITR"). Eventually we plan to retire the size-based condition: https://github.com/neondatabase/neon/issues/6374 Currently, we always apply the more conservative of the two, meaning that tenants always retain at least 64MB of history (default horizon), even after a very long time has passed. This is particularly acute in cases where someone has dropped tables/databases, and then leaves a database idle: the horizon can prevent GCing very large quantities of historical data (we already account for this in synthetic size by ignoring gc horizon). We're not entirely removing GC horizon right now because we don't want to 100% rely on standby_horizon for robustness of physical replication, but we can tweak our logic to avoid retaining that 64MB LSN length indefinitely. ## Summary of changes - Rework `Timeline::find_gc_cutoffs`, with new logic: - If there is no PITR set, then use `DEFAULT_PITR_INTERVAL` (1 week) to calculate a time threshold. Retain either the horizon or up to that thresholds, whichever requires less data. - When there is a PITR set, and we have unambiguously resolved the timestamp to an LSN, then ignore the GC horizon entirely. For typical PITRs (1 day, 1 week), this will still easily retain enough data to avoid stressing read only replicas. The key property we end up with, whether a PITR is set or not, is that after enough time has passed, our GC cutoff on an idle timeline will catch up with the last_record_lsn. Using `DEFAULT_PITR_INTERVAL` is a bit of an arbitrary hack, but this feels like it isn't really worth the noise of exposing in TenantConfig. We could just make it a different named constant though. The end-end state will be that there is no gc_horizon at all, and that tenants with pitr_interval=0 would truly retain no history, so this constant would go away.	2024-07-22 14:36:56 +02:00
Joonas Koivunen	cfaf30f5e8	feat(storcon): timeline detach ancestor passthrough (#8353 ) Currently storage controller does not support forwarding timeline detach ancestor requests to pageservers. Add support for forwarding `PUT .../:tenant_id/timelines/:timeline_id/detach_ancestor`. Implement the support mostly as is, because the timeline detach ancestor will be made (mostly) idempotent in future PR. Cc: #6994	2024-07-22 14:36:56 +02:00
Christian Schwarz	72c2d0812e	remove page_service `show <tenant_id>` (#8372 ) This operation isn't used in practice, so let's remove it. Context: in https://github.com/neondatabase/neon/pull/8339	2024-07-22 14:36:56 +02:00
Arseny Sher	537ecf45f8	Fix test_timeline_copy flakiness. fixes https://github.com/neondatabase/neon/issues/8355	2024-07-22 14:31:12 +02:00
Konstantin Knizhnik	7973c3e941	Add neon.running_xacts_overflow_policy to make it possible for RO replica to startup without primary even in case running xacts overflow (#8323 ) ## Problem Right now if there are too many running xacts to be restored from CLOG at replica startup, then replica is not trying to restore them and wait for non-overflown running-xacs WAL record from primary. But if primary is not active, then replica will not start at all. Too many running xacts can be caused by transactions with large number of subtractions. But right now it can be also cause by two reasons: - Lack of shutdown checkpoint which updates `oldestRunningXid` (because of immediate shutdown) - nextXid alignment on 1024 boundary (which cause loosing ~1k XIDs on each restart) Both problems are somehow addressed now. But we have existed customers with "sparse" CLOG and lack of checkpoints. To be able to start RO replicas for such customers I suggest to add GUC which allows replica to start even in case of subxacts overflow. ## Summary of changes Add `neon.running_xacts_overflow_policy` with the following values: - ignore: restore from CLOG last N XIDs and accept connections - skip: do not restore any XIDs from CXLOGbut still accept connections - wait: wait non-overflown running xacts record from primary node ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-07-15 09:34:35 -04:00
Vlad Lazar	085bbaf5f8	tests: allow list breaching min resident size in statvfs test (#8358 ) ## Problem This test would sometimes violate the min resident size during disk eviction and fail due to the generate warning log. Disk usage candidate collection only takes into account active tenants. However, the statvfs call takes into account the entire tenants directory, which includes tenants which haven't become active yet. After re-starting the pageserver, disk usage eviction may kick in before both tenants have become active. Hence, the logic will try to satisfy thedisk usage requirements by evicting everything belonging to the active tenant, and hence violating the tenant minimum resident size. ## Summary of changes Allow the warning	2024-07-15 09:28:35 -04:00
John Spray	3f8819827c	pageserver: circuit breaker on compaction (#8359 ) ## Problem We already back off on compaction retries, but the impact of a failing compaction can be so great that backing off up to 300s isn't enough. The impact is consuming a lot of I/O+CPU in the case of image layer generation for large tenants, and potentially also leaking disk space. Compaction failures are extremely rare and almost always indicate a bug, frequently a bug that will not let compaction to proceed until it is fixed. Related: https://github.com/neondatabase/neon/issues/6738 ## Summary of changes - Introduce a CircuitBreaker type - Add a circuit breaker for compaction, with a policy that after 5 failures, compaction will not be attempted again for 24 hours. - Add metrics that we can alert on: any >0 value for `pageserver_circuit_breaker_broken_total` should generate an alert. - Add a test that checks this works as intended. Couple notes to reviewers: - Circuit breakers are intrinsically a defense-in-depth measure: this is not the solution to any underlying issues, it is just a general mitigation for "unknown unknowns" that might be encountered in future. - This PR isn't primarily about writing a perfect CircuitBreaker type: the one in this PR is meant to be just enough to mitigate issues in compaction, and make it easy to monitor/alert on these failures. We can refine this type in future as/when we want to use it elsewhere.	2024-07-15 09:28:35 -04:00
Sasha Krassovsky	119ddf6ccf	Grant execute on snapshot functions to neon_superuser (#8346 ) ## Problem I need `neon_superuser` to be allowed to create snapshots for replication tests ## Summary of changes Adds a migration that grants these functions to neon_superuser	2024-07-15 09:28:35 -04:00
Joonas Koivunen	90f447b79d	test: limit `test_layer_download_timeouted` to MOCK_S3 (#8331 ) Requests against REAL_S3 on CI can consistently take longer than 1s; testing the short timeouts against it made no sense in hindsight, as MOCK_S3 works just as well. evidence: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8229/9857994025/index.html#suites/b97efae3a617afb71cb8142f5afa5224/6828a50921660a32	2024-07-15 09:28:35 -04:00
Arseny Sher	8532d72276	Fix memory context of NeonWALReader allocation. Allocating it in short living context is wrong because it is reused during backend lifetime.	2024-07-15 09:28:35 -04:00
John Spray	d3ff47f572	storage controller: add node deletion API (#8226 ) ## Problem In anticipation of later adding a really nice drain+delete API, I initially only added an intentionally basic `/drop` API that is just about usable for deleting nodes in a pinch, but requires some ugly storage controller restarts to persuade it to restart secondaries. ## Summary of changes I started making a few tiny fixes, and ended up writing the delete API... - Quality of life nit: ordering of node + tenant listings in storcon_cli - Papercut: Fix the attach_hook using the wrong operation type for reporting slow locks - Make Service::spawn tolerate `generation_pageserver` columns that point to nonexistent node IDs. I started out thinking of this as a general resilience thing, but when implementing the delete API I realized it was actually a legitimate end state after the delete API is called (as that API doesn't wait for all reconciles to succeed). - Add a `DELETE` API for nodes, which does not gracefully drain, but does reschedule everything. This becomes safe to use when the system is in any state, but will incur availability gaps for any tenants that weren't already live-migrated away. If tenants have already been drained, this becomes a totally clean + safe way to decom a node. - Add a test and a storcon_cli wrapper for it This is meant to be a robust initial API that lets us remove nodes without doing ugly things like restarting the storage controller -- it's not quite a totally graceful node-draining routine yet. There's more work in https://github.com/neondatabase/neon/issues/8333 to get to our end-end state.	2024-07-15 09:28:35 -04:00
John Spray	8cc768254f	safekeeper: eviction metrics (#8348 ) ## Problem Follow up to https://github.com/neondatabase/neon/pull/8335, to improve observability of how many evict/restores we are doing. ## Summary of changes - Add `safekeeper_eviction_events_started_total` and `safekeeper_eviction_events_completed_total`, with a "kind" label of evict or restore. This gives us rates, and also ability to calculate how many are in progress. - Generalize SafekeeperMetrics test type to use the same helpers as pageserver, and enable querying any metric. - Read the new metrics at the end of the eviction test.	2024-07-15 09:28:35 -04:00
Christian Schwarz	5bba3e3c75	pageserver: remove `trace_read_requests` (#8338 ) `trace_read_requests` is a per `Tenant`-object option. But the `handle_pagerequests` loop doesn't know which `Tenant` object (i.e., which shard) the request is for. The remaining use of the `Tenant` object is to check `tenant.cancel`. That check is incorrect [if the pageserver hosts multiple shards](https://github.com/neondatabase/neon/issues/7427#issuecomment-2220577518). I'll fix that in a future PR where I completely eliminate the holding of `Tenant/Timeline` objects across requests. See [my code RFC](https://github.com/neondatabase/neon/pull/8286) for the high level idea. Note that we can always bring the tracing functionality if we need it. But since it's actually about logging the `page_service` wire bytes, it should be a `page_service`-level config option, not per-Tenant. And for enabling tracing on a single connection, we can implement a `set pageserver_trace_connection;` option.	2024-07-15 09:28:35 -04:00
Peter Bendel	6caf702417	Run Performance bench on more platforms (#8312 ) ## Problem https://github.com/neondatabase/cloud/issues/14721 ## Summary of changes add one more platform to benchmarking job `57535c039c/.github/workflows/benchmarking.yml (L57C3-L126)` Run with pg 16, provisioner k8-neonvm by default on the new platform. Adjust some test cases to - not depend on database client <-> database server latency by pushing loops into server side pl/pgSQL functions - increase statement and test timeouts First successful run of these job steps https://github.com/neondatabase/neon/actions/runs/9869817756/job/27254280428	2024-07-15 09:28:35 -04:00
John Spray	547acde6cd	safekeeper: add eviction_min_resident to stop evictions thrashing (#8335 ) ## Problem - The condition for eviction is not time-based: it is possible for a timeline to be restored in response to a client, that client times out, and then as soon as the timeline is restored it is immediately evicted again. - There is no delay on eviction at startup of the safekeeper, so when it starts up and sees many idle timelines, it does many evictions which will likely be immediately restored when someone uses the timeline. ## Summary of changes - Add `eviction_min_resident` parameter, and use it in `ready_for_eviction` to avoid evictions if the timeline has been resident for less than this period. - This also implicitly delays evictions at startup for `eviction_min_resident` - Set this to a very low number for the existing eviction test, which expects immediate eviction. The default period is 15 minutes. The general reasoning for that is that in the worst case where we thrash ~10k timelines on one safekeeper, downloading 16MB for each one, we should set a period that would not overwhelm the node's bandwidth.	2024-07-15 09:28:35 -04:00
John Spray	f28abb953d	tests: stabilize test_sharding_split_compaction (#8318 ) ## Problem This test incorrectly assumed that a post-split compaction would only drop content. This was easily destabilized by any changes to image generation rules. ## Summary of changes - Before split, do a full image layer generation pass, to guarantee that post-split compaction should only drop data, never create it. - Fix the force_image_layer_creation mode of compaction that we use from tests like this: previously it would try and generate image layers even if one already existed with the same layer key, which caused compaction to fail.	2024-07-15 09:28:35 -04:00
Christian Schwarz	bfc7338246	pageserver: move `page_service`'s `import basebackup` / `import wal` to mgmt API (#8292 ) I want to fix bugs in `page_service` ([issue](https://github.com/neondatabase/neon/issues/7427)) and the `import basebackup` / `import wal` stand in the way / make the refactoring more complicated. We don't use these methods anyway in practice, but, there have been some objections to removing the functionality completely. So, this PR preserves the existing functionality but moves it into the HTTP management API. Note that I don't try to fix existing bugs in the code, specifically not fixing * it only ever worked correctly for unsharded tenants * it doesn't clean up on error All errors are mapped to `ApiError::InternalServerError`.	2024-07-15 09:28:35 -04:00
Yuchen Liang	0b6492e7d3	tests: increase approx size equal threshold to avoid `test_lsn_lease_size` flakiness (#8282 ) ## Summary of changes Increase the `assert_size_approx_equal` threshold to avoid flakiness of `test_lsn_lease_size`. Still needs more investigation to fully resolve #8293. - Also set `autovacuum=off` for the endpoint we are running in the test. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-07-15 09:28:35 -04:00
John Spray	7cfaecbeb6	tests: stabilize test_timeline_size_quota_on_startup (#8255 ) ## Problem `test_timeline_size_quota_on_startup` assumed that writing data beyond the size limit would always be blocked. This is not so: the limit is only enforced if feedback makes it back from the pageserver to the safekeeper + compute. Closes: https://github.com/neondatabase/neon/issues/6562 ## Summary of changes - Modify the test to wait for the pageserver to catch up. The size limit was never actually being enforced robustly, the original version of this test was just writing much more than 30MB and about 98% of the time getting lucky such that the feedback happened to arrive before the tests for loop was done. - If the test fails, log the logical size as seen by the pageserver.	2024-07-15 09:28:35 -04:00
John Spray	108bf56e44	tests: use smaller layers in test_pg_regress (#8232 ) ## Problem Debug-mode runs of test_pg_regress are rather slow since https://github.com/neondatabase/neon/pull/8105, and occasionally exceed their 600s timeout. ## Summary of changes - Use 8MiB layer files, avoiding large ephemeral layers On a hetzner AX102, this takes the runtime from 230s to 190s. Which hopefully will be enough to get the runtime on github runners more reliably below its 600s timeout. This has the side benefit of exercising more of the pageserver stack (including compaction) under a workload that exercises a more diverse set of postgres functionality than most of our tests.	2024-07-15 09:28:35 -04:00
Tristan Partin	cec216c5c0	Add long running replication tests These tests will help verify that replication, both physical and logical, works as expected in Neon. Co-authored-by: Sasha Krassovsky <sasha@neon.tech>	2024-07-15 09:28:35 -04:00
Tristan Partin	930201e033	Add PgBin.run_nonblocking() Allows a process to run without blocking program execution, which can be useful for certain test scenarios. Co-authored-by: Sasha Krassovsky <sasha@neon.tech>	2024-07-15 09:28:35 -04:00
Tristan Partin	8328580dc2	Log PG environment variables when a PgBin runs Useful for debugging situations like connecting to databases. Co-authored-by: Sasha Krassovsky <sasha@neon.tech>	2024-07-15 09:28:35 -04:00
Tristan Partin	8d9b632f2a	Add Neon HTTP API test fixture This is a Python binding to the Neon HTTP API. It isn't complete, but can be extended as necessary. Co-authored-by: Sasha Krassovsky <sasha@neon.tech>	2024-07-15 09:28:35 -04:00
Tristan Partin	55d37c77b9	Hide import behind TYPE_CHECKING No need to import it if we aren't type checking anything.	2024-07-15 09:28:35 -04:00
Konstantin Knizhnik	05b4169644	Increase timeout for wating subscriber caught-up (#8118 ) ## Problem test_subscriber_restart has quit large failure rate' https://neonprod.grafana.net/d/fddp4rvg7k2dcf/regression-test-failures?orgId=1&var-test_name=test_subscriber_restart&var-max_count=100&var-restrict=false I can be caused by too small timeout (5 seconds) to wait until changes are propagated. Related to #8097 ## Summary of changes Increase timeout to 30 seconds. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-07-08 17:22:36 +01:00
Alexander Bayandin	d1495755e7	SELECT 💣(); (#8270 ) ## Problem We want to be able to test how our infrastructure reacts on segfaults in Postgres (for example, we collect cores, and get some required logs/metrics, etc) ## Summary of changes - Add `trigger_segfauls` function to `neon_test_utils` to trigger a segfault in Postgres - Add `trigger_panic` function to `neon_test_utils` to trigger SIGABRT (by using `elog(PANIC, ...)) - Fix cleanup logic in regression tests in endpoint crashed	2024-07-08 17:22:36 +01:00
John Spray	64334f497d	tests: make location_conf_churn more robust (#8271 ) ## Problem This test directly manages locations on pageservers and configuration of an endpoint. However, it did not switch off the parts of the storage controller that attempt to do the same: occasionally, the test would fail in a strange way such as a compute failing to accept a reconfiguration request. ## Summary of changes - Wire up the storage controller's compute notification hook to a no-op handler - Configure the tenant's scheduling policy to Stop.	2024-07-08 17:22:35 +01:00
John Spray	32fc2dd683	tests: extend allow list in deletion test (#8268 ) ## Problem `1ea5d8b132` tolerated this as an error message, but it can show up in logs as well. Example failure: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8201/9780147712/index.html#testresult/263422f5f5f292ea/retries ## Summary of changes - Tolerate "failed to delete 1 objects" in pageserver logs, this occurs occasionally when injected failures exhaust deletion's retries.	2024-07-08 17:22:35 +01:00
Konstantin Knizhnik	3ee82a9895	implement rolling hyper-log-log algorithm (#8068 ) ## Problem See #7466 ## Summary of changes Implement algorithm descried in https://hal.science/hal-00465313/document Now new GUC is added: `neon.wss_max_duration` which specifies size of sliding window (in seconds). Default value is 1 hour. It is possible to request estimation of working set sizes (within this window using new function `approximate_working_set_size_seconds`. Old function `approximate_working_set_size` is preserved for backward compatibility. But its scope is also limited by `neon.wss_max_duration`. Version of Neon extension is changed to 1.4 ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Matthias van de Meent <matthias@neon.tech>	2024-07-08 17:22:35 +01:00
Yuchen Liang	32828cddd6	feat(pageserver): integrate lsn lease into synthetic size (#8220 ) Part of #7497, closes #8071. (accidentally closed #8208, reopened here) ## Problem After the changes in #8084, we need synthetic size to also account for leased LSNs so that users do not get free retention by running a small ephemeral endpoint for a long time. ## Summary of changes This PR integrates LSN leases into the synthetic size calculation. We model leases as read-only branches started at the leased LSN (except it does not have a timeline id). Other changes: - Add new unit tests testing whether a lease behaves like a read-only branch. - Change `/size_debug` response to include lease point in the SVG visualization. - Fix `/lsn_lease` HTTP API to do proper parsing for POST. Signed-off-by: Yuchen Liang <yuchen@neon.tech> Co-authored-by: Joonas Koivunen <joonas@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-07-08 17:22:35 +01:00
John Spray	7e2a3d2728	pageserver: downgrade stale generation messages to INFO (#8256 ) ## Problem When generations were new, these messages were an important way of noticing if something unexpected was going on. We found some real issues when investigating tests that unexpectedly tripped them. At time has gone on, this code is now pretty battle-tested, and as we do more live migrations etc, it's fairly normal to see the occasional message from a node with a stale generation. At this point the cognitive load on developers to selectively allow-list these logs outweighs the benefit of having them at warn severity. Closes: https://github.com/neondatabase/neon/issues/8080 ## Summary of changes - Downgrade "Dropped remote consistent LSN updates" and "Dropping stale deletions" messages to INFO - Remove all the allow-list entries for these logs.	2024-07-08 17:22:35 +01:00
Alexander Bayandin	0e4832308d	CI(pg-clients): unify workflow with build-and-test (#8160 ) ## Problem `pg-clients` workflow looks different from the main `build-and-test` workflow for historical reasons (it was my very first task at Neon, and back then I wasn't really familiar with the rest of the CI pipelines). This PR unifies `pg-clients` workflow with `build-and-test` ## Summary of changes - Rename `pg_clients.yml` to `pg-clients.yml` - Run the workflow on changes in relevant files - Create Allure report for tests - Send slack notifications to `#on-call-qa-staging-stream` channel (instead of `#on-call-staging-stream`) - Update Client libraries once we're here	2024-07-08 17:22:35 +01:00

1 2 3 4 5 ...

1441 Commits