rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-05 13:10:37 +00:00

Author	SHA1	Message	Date
Vlad Lazar	587cb705b8	pageserver: roll open layer in timeline writer (#6661 ) ## Problem One WAL record can actually produce an arbitrary amount of key value pairs. This is problematic since it might cause our frozen layers to bloat past the max allowed size of S3 single shot uploads. [#6639](https://github.com/neondatabase/neon/pull/6639) introduced a "should roll" check after every batch of `ingest_batch_size` (100 WAL records by default). This helps, but the original problem still exists. ## Summary of changes This patch moves the responsibility of rolling the currently open layer to the `TimelineWriter`. Previously, this was done ad-hoc via calls to `check_checkpoint_distance`. The advantages of this approach are: * ability to split one batch over multiple open layers * less layer map locking * remove ad-hoc check_checkpoint_distance calls More specifically, we track the current size of the open layer in the writer. On each `put` check whether the current layer should be closed and a new one opened. Keeping track of the currently open layer results in less contention on the layer map lock. It only needs to be acquired on the first write and on writes that require a roll afterwards. Rolling the open layer can be triggered by: 1. The distance from the last LSN we rolled at. This bounds the amount of WAL that the safekeepers need to store. 2. The size of the currently open layer. 3. The time since the last roll. It helps safekeepers to regard pageserver as caught up and suspend activity. Closes #6624	2024-02-19 12:34:27 +00:00
Alexander Bayandin	61f99d703d	test_create_snapshot: do not try to copy pg_dynshmem dir (#6796 ) ## Problem `test_create_snapshot` is flaky[0] on CI and fails constantly on macOS, but with a slightly different error: ``` shutil.Error: [('/Users/bayandin/work/neon/test_output/test_create_snapshot[release-pg15-1-100]/repo/endpoints/ep-1/pgdata/pg_dynshmem', '/Users/bayandin/work/neon/test_output/compatibility_snapshot_pgv15/repo/endpoints/ep-1/pgdata/pg_dynshmem', "[Errno 2] No such file or directory: '/Users/bayandin/work/neon/test_output/test_create_snapshot[release-pg15-1-100]/repo/endpoints/ep-1/pgdata/pg_dynshmem'")] ``` Also (on macOS) `repo/endpoints/ep-1/pgdata/pg_dynshmem` is a symlink to `/dev/shm/`. - [0] https://github.com/neondatabase/neon/issues/6784 ## Summary of changes Ignore `pg_dynshmem` directory while copying a snapshot	2024-02-18 12:16:07 +00:00
Christian Schwarz	ca07fa5f8b	per-TenantShard read throttling (#6706 )	2024-02-16 21:26:59 +01:00
John Spray	5d039c6e9b	libs: add 'generations_api' auth scope (#6783 ) ## Problem Even if you're not enforcing auth, the JwtAuth middleware barfs on scopes it doesn't know about. Add `generations_api` scope, which was invented in the cloud control plane for the pageserver's /re-attach and /validate upcalls: this will be enforced in storage controller's implementation of these in a later PR. Unfortunately the scope's naming doesn't match the other scope's naming styles, so needs a manual serde decorator to give it an underscore. ## Summary of changes - Add `Scope::GenerationsApi` variant - Update pageserver + safekeeper auth code to print appropriate message if they see it.	2024-02-16 15:53:09 +00:00
Alexander Bayandin	59c5b374de	test_pageserver_max_throughput_getpage_at_latest_lsn: disable on CI (#6785 ) ## Problem `test_pageserver_max_throughput_getpage_at_latest_lsn` is flaky which makes CI status red pretty frequently. `benchmarks` is not a blocking job (doesn't block `deploy`), so having it red might hide failures in other jobs Ref: https://github.com/neondatabase/neon/issues/6724 ## Summary of changes - Disable `test_pageserver_max_throughput_getpage_at_latest_lsn` on CI until it fixed	2024-02-16 15:30:04 +00:00
Arpad Müller	0f3b87d023	Add test for pageserver_directory_entries_count metric (#6767 ) Adds a simple test to ensure the metric works. The test creates a bunch of relations to activate the metric. Follow-up of #6736	2024-02-16 14:53:36 +00:00
John Spray	f2e5212fed	storage controller: background reconcile, graceful shutdown, better logging (#6709 ) ## Problem Now that the storage controller is working end to end, we start burning down the robustness aspects. ## Summary of changes - Add a background task that periodically calls `reconcile_all`. This ensures that if earlier operations couldn't succeed (e.g. because a node was unavailable), we will eventually retry. This is a naive initial implementation can start an unlimited number of reconcile tasks: limiting reconcile concurrency is a later item in #6342 - Add a number of tracing spans in key locations: each background task, each reconciler task. - Add a top level CancellationToken and Gate, and use these to implement a graceful shutdown that waits for tasks to shut down. This is not bulletproof yet, because within these tasks we have remote HTTP calls that aren't wrapped in cancellation/timeouts, but it creates the structure, and if we don't shutdown promptly then k8s will kill us. - To protect shard splits from background reconciliation, expose the `SplitState` in memory and use it to guard any APIs that require an attached tenant.	2024-02-16 13:00:53 +00:00
Alexander Bayandin	c72cb44213	test_runner/performance: parametrize benchmarks (#6744 ) ## Problem Currently, we don't store `PLATFORM` for Nightly Benchmarks. It causes them to be merged as reruns in Allure report (because they have the same test name). ## Summary of changes - Parametrize benchmarks by - Postgres Version (14/15/16) - Build Type (debug/release/remote) - PLATFORM (neon-staging/github-actions-selfhosted/...) --------- Co-authored-by: Bodobolero <peterbendel@neon.tech>	2024-02-15 15:53:58 +00:00
John Spray	5fa747e493	pageserver: shard splitting refinements (parent deletion, hard linking) (#6725 ) ## Problem - We weren't deleting parent shard contents once the split was done - Re-downloading layers into child shards is wasteful ## Summary of changes - Hard-link layers into child chart local storage during split - Delete parent shards content at the end --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2024-02-15 10:21:53 +02:00
Shayan Hosseini	fff2468aa2	Add resource consume test funcs (#6747 ) ## Problem Building on #5875 to add handy test functions for autoscaling. Resolves #5609 ## Summary of changes This PR makes the following changes to #5875: - Enable `neon_test_utils` extension in the compute node docker image, so we could use it in the e2e tests (as discussed with @kelvich). - Removed test functions related to disk as we don't use them for autoscaling. - Fix the warning with printf-ing unsigned long variables. --------- Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2024-02-14 18:45:05 +00:00
John Spray	f39b0fce9b	Revert #6666 "tests: try to make restored-datadir comparison tests not flaky" (#6751 ) The #6666 change appears to have made the test fail more often. PR https://github.com/neondatabase/neon/pull/6712 should re-instate this change, along with its change to make the overall flow more reliable. This reverts commit `568f91420a`.	2024-02-14 10:57:01 +00:00
Arpad Müller	ee7bbdda0e	Create new metric for directory counts (#6736 ) There is O(n^2) issues due to how we store these directories (#6626), so it's good to keep an eye on them and ensure the numbers stay low. The new per-timeline metric `pageserver_directory_entries_count` isn't perfect, namely we don't calculate it every time we attach the timeline, but only if there is an actual change. Also, it is a collective metric over multiple scalars. Lastly, we only emit the metric if it is above a certain threshold. However, the metric still give a feel for the general size of the timeline. We care less for small values as the metric is mainly there to detect and track tenants with large directory counts. We also expose the directory counts in `TimelineInfo` so that one can get the detailed size distribution directly via the pageserver's API. Related: #6642 , https://github.com/neondatabase/cloud/issues/10273	2024-02-14 02:12:00 +01:00
John Spray	a8eb4042ba	tests: test_secondary_mode_eviction: avoid use of mocked statvfs (#6698 ) ## Problem Test sometimes fails with `used_blocks > total_blocks`, because when using mocked statvfs with the total blocks set to the size of data on disk before starting, we are implicitly asserting that nothing at all can be written to disk between startup and calling statvfs. Related: https://github.com/neondatabase/neon/issues/6511 ## Summary of changes - Use HTTP API to invoke disk usage eviction instead of mocked statvfs	2024-02-13 09:00:50 +02:00
Arpad Müller	a1f37cba1c	Add test that runs the S3 scrubber (#6641 ) In #6079 it was found that there is no test that executes the scrubber. We now add such a test, which does the following things: * create a tenant, write some data * run the scrubber * remove the tenant * run the scrubber again Each time, the scrubber runs the scan-metadata command. Before #6079 we would have errored, now we don't. Fixes #6080	2024-02-12 19:15:21 +01:00
Heikki Linnakangas	e5daf366ac	tests: Remove unnecessary port config with VanillaPostgres class VanillaPostgres constructor prints the "port={port}" line to the config file, no need to do it in the callers. The TODO comment that it would be nice if VanillaPostgres could pick the port by itself is still valid though.	2024-02-11 01:34:31 +02:00
Heikki Linnakangas	d77583c86a	tests: Remove obsolete allowlist entries Commit `9a6c0be823` removed the code that printed these warnings: marking {} as locally complete, while it doesnt exist in remote index No timelines to attach received Remove those warnings from all the allowlists in tests.	2024-02-11 01:34:31 +02:00
Heikki Linnakangas	241dcbf70c	tests: Remove "Running in ..." log message from every CLI call It's always the same directory, the test's "repo" directory.	2024-02-11 01:34:31 +02:00
Heikki Linnakangas	da626fb1fa	tests: Remove "postgres is running on ... branch" messages It seems like useless chatter. The endpoint.start() itself prints a "Running command ... neon_local endpoint start" message too.	2024-02-11 01:34:31 +02:00
John Spray	12b39c9db9	control_plane: add debug APIs for force-dropping tenant/node (#6702 ) ## Problem When debugging/supporting this service, we sometimes need it to just forget about a tenant or node, e.g. because of an issue cleanly tearing them down. For example, if I create a tenant with a PlacementPolicy that can't be scheduled on the nodes we have, we would never be able to schedule it for a DELETE to work. ## Summary of changes - Add APIs for dropping nodes and tenants that do no teardown other than removing the entity from the DB and removing any references to it.	2024-02-10 11:56:52 +00:00
Heikki Linnakangas	df5e2729a9	Remove now unused allowlisted errors. I'm not sure when we stopped emitting these, but they don't seem to be needed anymore.	2024-02-10 12:05:02 +02:00
Heikki Linnakangas	0fd3cd27cb	Tighten up the check for garbage after end-of-tar. Turn the warning into an error, if there is garbage after the end of imported tar file. However, it's normal for 'tar' to append extra empty blocks to the end, so tolerate those without warnings or errors.	2024-02-10 12:05:02 +02:00
Sasha Krassovsky	1a4dd58b70	Grant pg_monitor to neon_superuser (#6691 ) ## Problem The people want pg_monitor https://github.com/neondatabase/neon/issues/6682 ## Summary of changes Gives the people pg_monitor	2024-02-09 20:22:53 +00:00
Conrad Ludgate	cbd3a32d4d	proxy: decode username and password (#6700 ) ## Problem usernames and passwords can be URL 'percent' encoded in the connection string URL provided by serverless driver. ## Summary of changes Decode the parameters when getting conn info	2024-02-09 19:22:23 +00:00
Christian Schwarz	ca818c8bd7	fix(test_ondemand_download_timetravel): occasionally fails with slightly higher physical size (#6687 )	2024-02-09 20:09:37 +01:00
Heikki Linnakangas	5239cdc29f	Fix test_vm_bit_clear_on_heap_lock test The test was supposed to reproduce the bug fixed in commit `66fa176cc8`, i.e. that the clearing of the VM bit was not replayed in the pageserver on HEAP_LOCK records. But it was broken in many ways and failed to reproduce the original problem if you reverted the fix: - The comparison of XIDs was broken. The test read the XID in to a variable in python, but it was treated as a string rather than an integer. As a result, e.g. "999" > "1000". - The test accessed the locked tuple too early, in the loop. Accessing it early, before the pg_xact page had been removed, set the hint bits. That masked the problem on subsequent accesses. - The on-demand SLRU download that was introduced in commit `9a9d9beaee` hid the issue. Even though an SLRU segment was removed by Postgres, when it later tried to access it, it could still download it from the pageserver. To ensure that doesn't happen, shorten the GC period and compact and GC aggressively in the test. I also added a more direct check that the VM page is updated, using the get_page_at_lsn() debugging function. Right after locking the row, we now fetch the VM page from pageserver and directly compare it with the VM page in the page cache. They should match. That assertion is more robust to things like on-demand SLRU download that could mask the bug.	2024-02-09 15:56:41 +02:00
Heikki Linnakangas	84a0e7b022	tests: Allow setting shutdown mode separately from 'destroy' flag In neon_local, the default mode is now always 'fast', regardless of 'destroy'. You can override it with the "neon_local endpoint stop --mode=immediate" flag. In python tests, we still default to 'immediate' mode when using the stop_and_destroy() function, and 'fast' with plain stop(). I kept that to avoid changing behavior in existing tests. I don't think existing tests depend on it, but I wasn't 100% certain.	2024-02-09 15:56:41 +02:00
John Spray	8d98981fe5	tests: deflake test_sharding_split_unsharded (#6699 ) ## Problem This test was a subset of the larger sharding test, and it missed the validate() call on workload that was implicitly waiting for a tenant to become active before trying to split it. It could therefore fail to split due to tenant not yet being active. ## Summary of changes - Insert .validate() call, and move the Workload setup to after the check of shard ID (as the shard ID check should pass immediately)	2024-02-09 13:20:04 +00:00
Conrad Ludgate	ea089dc977	proxy: add per query array mode flag (#6678 ) ## Problem Drizzle needs to be able to configure the array_mode flag per query. ## Summary of changes Adds an array_mode flag to the query data json that will otherwise default to the header flag.	2024-02-09 10:29:20 +00:00
John Spray	951c9bf4ca	control_plane: fix shard splitting on unsharded tenant (#6689 ) ## Problem Previous test started with a new-style TenantShardId with a non-zero ShardCount. We also need to handle the case of a ShardCount() (aka `unsharded`) parent shard. A followup PR will refactor ShardCount to make its inner value private and thereby make this kind of mistake harder ## Summary of changes - Fix a place we were incorrectly treating a ShardCount as a number of shards rather than as thing that can be zero or the number of shards. - Add a test for this case.	2024-02-09 10:12:40 +00:00
Heikki Linnakangas	568f91420a	tests: try to make restored-datadir comparison tests not flaky (#6666 ) This test occasionally fails with a difference in "pg_xact/0000" file between the local and restored datadirs. My hypothesis is that something changed in the database between the last explicit checkpoint and the shutdown. I suspect autovacuum, it could certainly create transactions. To fix, be more precise about the point in time that we compare. Shut down the endpoint first, then read the last LSN (i.e. the shutdown checkpoint's LSN), from the local disk with pg_controldata. And use exactly that LSN in the basebackup. Closes #559. I'm proposing this as an alternative to https://github.com/neondatabase/neon/pull/6662.	2024-02-09 11:34:15 +02:00
Joonas Koivunen	a18aa14754	test: shutdown endpoints before deletion (#6619 ) this avoids a page_service error in the log sometimes. keeping the endpoint running while deleting has no function for this test.	2024-02-09 09:01:07 +00:00
John Spray	e8d2843df6	storage controller: improved handling of node availability on restart (#6658 ) - Automatically set a node's availability to Active if it is responsive in startup_reconcile - Impose a 5s timeout of HTTP request to list location conf, so that an unresponsive node can't hang it for minutes - Do several retries if the request fails with a retryable error, to be tolerant of concurrent pageserver & storage controller restarts - Add a readiness hook for use with k8s so that we can tell when the startup reconciliaton is done and the service is fully ready to do work. - Add /metrics to the list of un-authenticated endpoints (this is unrelated but we're touching the line in this PR already, and it fixes auth error spam in deployed container.) - A test for the above. Closes: #6670	2024-02-08 18:00:53 +00:00
John Spray	af91a28936	pageserver: shard splitting (#6379 ) ## Problem One doesn't know at tenant creation time how large the tenant will grow. We need to be able to dynamically adjust the shard count at runtime. This is implemented as "splitting" of shards into smaller child shards, which cover a subset of the keyspace that the parent covered. Refer to RFC: https://github.com/neondatabase/neon/pull/6358 Part of epic: #6278 ## Summary of changes This PR implements the happy path (does not cleanly recover from a crash mid-split, although won't lose any data), without any optimizations (e.g. child shards re-download their own copies of layers that the parent shard already had on local disk) - Add `/v1/tenant/:tenant_shard_id/shard_split` API to pageserver: this copies the shard's index to the child shards' paths, instantiates child `Tenant` object, and tears down parent `Tenant` object. - Add `splitting` column to `tenant_shards` table. This is written into an existing migration because we haven't deployed yet, so don't need to cleanly upgrade. - Add `/control/v1/tenant/:tenant_id/shard_split` API to attachment_service, - Add `test_sharding_split_smoke` test. This covers the happy path: future PRs will add tests that exercise failure cases.	2024-02-08 15:35:13 +00:00
Anna Khanova	c63e3e7e84	Proxy: improve http-pool (#6577 ) ## Problem The password check logic for the sql-over-http is a bit non-intuitive. ## Summary of changes 1. Perform scram auth using the same logic as for websocket cleartext password. 2. Split establish connection logic and connection pool. 3. Parallelize param parsing logic with authentication + wake compute. 4. Limit the total number of clients	2024-02-08 12:57:05 +01:00
Sasha Krassovsky	7b49e5e5c3	Remove compute migrations feature flag (#6653 )	2024-02-07 07:55:55 -09:00
John Spray	090a789408	storage controller: use PUT instead of POST (#6659 ) This was a typo, the server expects PUT.	2024-02-07 13:24:10 +00:00
Arpad Müller	f7516df6c1	Pass timestamp as a datetime (#6656 ) This saves some repetition. I did this in #6533 for `tenant_time_travel_remote_storage` already.	2024-02-07 12:56:53 +01:00
Konstantin Knizhnik	f3d7d23805	Some small WAL records can write a lot of data to KV storage, so perform checkpoint check more frequently (#6639 ) ## Problem See https://neondb.slack.com/archives/C04DGM6SMTM/p1707149618314539?thread_ts=1707081520.140049&cid=C04DGM6SMTM ## Summary of changes Perform checkpoint check after processing `ingest_batch_size` (default 100) WAL records. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-02-07 08:47:19 +02:00
Alexander Bayandin	9f75da7c0a	test_lazy_startup: fix statement_timeout setting (#6654 ) ## Problem Test `test_lazy_startup` is flaky[0], sometimes (pretty frequently) it fails with `canceling statement due to statement timeout`. - [0] https://neon-github-public-dev.s3.amazonaws.com/reports/main/7803316870/index.html#suites/355b1a7a5b1e740b23ea53728913b4fa/7263782d30986c50/history ## Summary of changes - Fix setting `statement_timeout` setting by reusing a connection for all queries. - Also fix label (`lazy`, `eager`) assignment - Split `test_lazy_startup` into two, by `slru` laziness and make tests smaller	2024-02-07 00:31:26 +00:00
John Spray	6297843317	tests: flakiness fixes in pageserver tests (#6632 ) Fix several test flakes: - test_sharding_service_smoke had log failures on "Dropped LSN updates" - test_emergency_mode had log failures on a deletion queue shutdown check, where the check was incorrect because it was expecting channel receiver to stay alive after cancellation token was fired. - test_secondary_mode_eviction had racing heatmap uploads because the test was using a live migration hook to set up locations, where that migration was itself uploading heatmaps and generally making the situation more complex than it needed to be. These are the failure modes that I saw when spot checking the last few failures of each test. This will mostly/completely address #6511, but I'll leave that ticket open for a couple days and then check if either of the tests named in that ticket are flaky. Related #6511	2024-02-06 12:49:41 +00:00
John Spray	cb7c89332f	control_plane: fix tenant GET, clean up endpoints (#6553 ) Cleanups from https://github.com/neondatabase/neon/pull/6394 - There was a rogue `*` breaking the `GET /tenant/:tenant_id`, which passes through to shard zero - There was a duplicate migrate endpoint - There are un-prefixed API endpoints that were only needed for compat tests and can now be removed.	2024-02-05 14:29:05 +00:00
Joonas Koivunen	5e8deca268	metrics: remove broken tenants (#6586 ) Before tenant migration it made sense to leak broken tenants in the metrics until restart. Nowdays it makes less sense because on cancellations we set the tenant broken. The set metric still allows filterable alerting. Fixes: #6507	2024-02-05 14:49:35 +02:00
Joonas Koivunen	70f646ffe2	More logging fixes (#6584 ) I was on-call this week, these would had made me understand more/faster of the system: - move stray attaching start logging inside the span it starts, add generation - log ancestor timeline_id or bootstrapping in the beginning of timeline creation	2024-02-05 09:34:03 +02:00
Arpad Müller	0ac2606c8a	S3 restore test: Use a workaround to enable moto's self-copy support (#6594 ) While working on https://github.com/getmoto/moto/pull/7303 I discovered that if you enable bucket encryption, moto allows self-copies. So we can un-ignore the test. I tried it out locally, it works great. Followup of #6533, part of https://github.com/neondatabase/cloud/issues/8233	2024-02-02 23:45:57 +01:00
Sasha Krassovsky	2fd8e24c8f	Switch sleeps to wait_until (#6575 ) ## Problem I didn't know about `wait_until` and was relying on `sleep` to wait for stuff. This caused some tests to be flaky. https://github.com/neondatabase/neon/issues/6561 ## Summary of changes Switch to `wait_until`, this should make it tests less flaky	2024-02-02 21:32:40 +00:00
Heikki Linnakangas	c9876b0993	Fix double-free bug in walredo process. (#6534 ) At the end of ApplyRecord(), we called pfree on the decoded record, if it was "oversized". However, we had alread linked it to the "decode queue" list in XLogReaderState. If we later called XLogBeginRead(), it called ResetDecoder and tried to free the same record again. The conditions to hit this are: - a large WAL record (larger than aboue 64 kB I think, per DEFAULT_DECODE_BUFFER_SIZE), and - another WAL record processed by the same WAL redo process after the large one. I think the reason we haven't seen this earlier is that you don't get WAL records that large that are sent to the WAL redo process, except when logical replication is enabled. Logical replication adds data to the WAL records, making them larger. To fix, allocate the buffer ourselves, and don't link it to the decode queue. Alternatively, we could perhaps have just removed the pfree(), but frankly I'm a bit scared about the whole queue thing.	2024-02-02 21:49:11 +02:00
John Spray	786e9cf75b	control_plane: implement HTTP compute hook for attachment service (#6471 ) ## Problem When we change which physical pageservers a tenant is attached to, we must update the control plane so that it can update computes. This will be done via an HTTP hook, as described in https://www.notion.so/neondatabase/Sharding-Service-Control-Plane-interface-6de56dd310a043bfa5c2f5564fa98365#1fe185a35d6d41f0a54279ac1a41bc94 ## Summary of changes - Optional CLI args `--control-plane-jwt-token` and `-compute-hook-url` are added. If these are set, then we will use this HTTP endpoint, instead of trying to use neon_local LocalEnv to update compute configuration. - Implement an HTTP-driven version of ComputeHook that calls into the configured URL - Notify for all tenants on startup, to ensure that we don't miss notifications if we crash partway through a change, and carry a `pending_compute_notification` flag at runtime to allow notifications to fail without risking never sending the update. - Add a test for all this One might wonder: why not do a "forever" retry for compute hook notifications, rather than carrying a flag on the shard to call reconcile() again later. The reason is that we will later limit concurreny of reconciles, when dealing with larger numbers of shards, and if reconcile is stuck waiting for the control plane to accept a notification request, it could jam up the whole system and prevent us making other changes. Anyway: from the perspective of the outside world, we _do_ retry forever, but we don't retry forever within a given Reconciler lifetime. The `pending_compute_notification` logic is predicated on later adding a background task that just calls `Service::reconcile_all` on a schedule to make sure that anything+everything that can fail a Reconciler::reconcile call will eventually be retried.	2024-02-02 19:22:03 +00:00
John Spray	2e5eab69c6	tests: remove test_gc_cutoff (#6587 ) This test became flaky when postgres retry handling was fixed to use backoff delays -- each iteration in this test's loop was taking much longer because pgbench doesn't fail until postgres has given up on retrying to the pageserver. We are just removing it, because the condition it tests is no longer risky: we reload all metadata from remote storage on restart, so crashing directly between making local changes and doing remote uploads isn't interesting any more. Closes: https://github.com/neondatabase/neon/issues/2856 Closes: https://github.com/neondatabase/neon/issues/5329	2024-02-02 18:20:18 +00:00
Joonas Koivunen	caf868e274	test: assert we eventually free space (#6536 ) in `test_statvfs_pressure_{usage,min_avail_bytes}` we now race against initial logical size calculation on-demand downloading the layers. first wait out the initial logical sizes, then change the final asserts to be "eventual", which is not great but it is faster than failing and retrying. this issue seems to happen only in debug mode tests. Fixes: #6510	2024-02-02 19:46:47 +02:00
Arpad Müller	48b05b7c50	Add a time_travel_remote_storage http endpoint (#6533 ) Adds an endpoint to the pageserver to S3-recover an entire tenant to a specific given timestamp. Required input parameters: * `travel_to`: the target timestamp to recover the S3 state to * `done_if_after`: a timestamp that marks the beginning of the recovery process. retries of the query should keep this value constant. it must be after `travel_to`, and also after any changes we want to revert, and must represent a point in time before the endpoint is being called, all of these time points in terms of the time source used by S3. these criteria need to hold even in the face of clock differences, so I recommend waiting a specific amount of time, then taking `done_if_after`, then waiting some amount of time again, and only then issuing the request. Also important to note: the timestamps in S3 work at second accuracy, so one needs to add generous waits before and after for the process to work smoothly (at least 2-3 seconds). We ignore the added test for the mocked S3 for now due to a limitation in moto: https://github.com/getmoto/moto/issues/7300 . Part of https://github.com/neondatabase/cloud/issues/8233	2024-02-02 14:52:12 +01:00

1 2 3 4 5 ...

1071 Commits