rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-05 20:42:54 +00:00

Author	SHA1	Message	Date
John Spray	1b9a27d6e3	tests: reinstate test_bulk_insert (#8683 ) ## Problem This test was disabled. ## Summary of changes - Remove the skip marker. - Explicitly avoid doing compaction & gc during checkpoints (the default scale doesn't do anything here, but when experimeting with larger scales it messes things up) - Set a data size that gives a ~20s runtime on a Hetzner dev machine, previous one gave very noisy results because it was so small For reference on a Hetzner AX102: ``` ------------------------------ Benchmark results ------------------------------- test_bulk_insert[neon-release-pg16].insert: 25.664 s test_bulk_insert[neon-release-pg16].pageserver_writes: 5,428 MB test_bulk_insert[neon-release-pg16].peak_mem: 577 MB test_bulk_insert[neon-release-pg16].size: 0 MB test_bulk_insert[neon-release-pg16].data_uploaded: 1,922 MB test_bulk_insert[neon-release-pg16].num_files_uploaded: 8 test_bulk_insert[neon-release-pg16].wal_written: 1,382 MB test_bulk_insert[neon-release-pg16].wal_recovery: 25.373 s test_bulk_insert[neon-release-pg16].compaction: 0.035 s ```	2024-08-12 13:33:09 +01:00
Vlad Lazar	f5cef7bf7f	storcon: skip draining shard if it's secondary is lagging too much (#8644 ) ## Problem Migrations of tenant shards with cold secondaries are holding up drains in during production deployments. ## Summary of changes If a secondary locations is lagging by more than 256MiB (configurable, but that's the default), then skip cutting it over to the secondary as part of the node drain.	2024-08-09 15:45:07 +01:00
John Spray	b7beaa0fd7	tests: improve stability of `test_storage_controller_many_tenants` (#8607 ) ## Problem The controller scale test does random migrations. These mutate secondary locations, and therefore can cause secondary optimizations to happen in the background, violating the test's expectation that consistency_check will work as there are no reconciliations running. Example: https://neon-github-public-dev.s3.amazonaws.com/reports/main/10247161379/index.html#suites/07874de07c4a1c9effe0d92da7755ebf/6316beacd3fb3060/ ## Summary of changes - Only migrate to existing secondary locations, not randomly picked nodes, so that we can do a fast reconcile_until_idle (otherwise reconcile_until_idle is takes a long time to create new secondary locations). - Do a reconcile_until_idle before consistency_check.	2024-08-06 12:58:33 +01:00
Alexander Bayandin	e6e578821b	CI(benchmarking): set pub/sub projects for LR tests (#8483 ) ## Problem > Currently, long-running LR tests recreate endpoints every night. We'd like to have along-running buildup of history to exercise the pageserver in this case (instead of "unit-testing" the same behavior everynight). Closes #8317 ## Summary of changes - Update Postgres version for replication tests - Set `BENCHMARK_PROJECT_ID_PUB`/`BENCHMARK_PROJECT_ID_SUB` env vars to projects that were created for this purpose --------- Co-authored-by: Sasha Krassovsky <krassovskysasha@gmail.com>	2024-08-05 22:06:47 +00:00
Alex Chi Z.	d6c79b77df	test(pageserver): add test_gc_feedback_with_snapshots (#8474 ) should be working after https://github.com/neondatabase/neon/pull/8328 gets merged. Part of https://github.com/neondatabase/neon/issues/8002 adds a new perf benchmark case that ensures garbages can be collected with branches --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-07-31 17:55:19 -04:00
Vlad Lazar	9c5ad21341	storcon: make heartbeats restart aware (#8222 ) ## Problem Re-attach blocks the pageserver http server from starting up. Hence, it can't reply to heartbeats until that's done. This makes the storage controller mark the node off-line (not good). We worked around this by setting the interval after which nodes are marked offline to 5 minutes. This isn't a long term solution. ## Summary of changes * Introduce a new `NodeAvailability` state: `WarmingUp`. This state models the following time interval: * From receiving the re-attach request until the pageserver replies to the first heartbeat post re-attach * The heartbeat delta generator becomes aware of this state and uses a separate longer interval * Flag `max-warming-up-interval` now models the longer timeout and `max-offline-interval` the shorter one to match the names of the states Closes https://github.com/neondatabase/neon/issues/7552	2024-07-25 14:09:12 +01:00
John Spray	842c3d8c10	tests: simplify code around unstable `test_basebackup_with_high_slru_count` (#8477 ) ## Problem In `test_basebackup_with_high_slru_count`, the pageserver is sometimes mysteriously hanging on startup, having been started+stopped earlier in the test setup while populating template tenant data. - #7586 We can't see why this is hanging in this particular test. The test does some weird stuff though, like attaching a load of broken tenants and then doing a SIGQUIT kill of a pageserver. ## Summary of changes - Attach tenants normally instead of doing a failpoint dance to attach them as broken - Shut the pageserver down gracefully during init instead of using immediate mode - Remove the "sequential" variant of the unstable test, as this is going away soon anyway - Log before trying to acquire lock file, so that if it hangs we have a clearer sense of if that's really where it's hanging. It seems like it is, but that code does a non-blocking flock so it's surprising.	2024-07-24 11:26:24 +01:00
John Spray	ebda667ef8	tests: more generous memory allowance in test_compaction_l0_memory (#8446 ) ## Problem This test is new, the limit was set experimentally and it turns out the memory consumption in CI runs varies more than expected. Example failure: https://neon-github-public-dev.s3.amazonaws.com/reports/main/10010912745/index.html#suites/9eebd1154fe19f9311ca7613f38156a1/82e40cf86a243ad5/	2024-07-22 11:50:30 +01:00
Christian Schwarz	a2d170b6d0	NeonEnv.from_repo_dir: use storage_controller_db instead of `attachments.json` (#8382 ) When `NeonEnv.from_repo_dir` was introduced, storage controller stored its state exclusively `attachments.json`. Since then, it has moved to using Postgres, which stores its state in `storage_controller_db`. But `NeonEnv.from_repo_dir` wasn't adjusted to do this. This PR rectifies the situation. Context for this is failures in `test_pageserver_characterize_throughput_with_n_tenants` CF: https://neondb.slack.com/archives/C033RQ5SPDH/p1721035799502239?thread_ts=1720901332.293769&cid=C033RQ5SPDH Notably, `from_repo_dir` is also used by the backwards- and forwards-compatibility. Thus, the changes in this PR affect those tests as well. However, it turns out that the compatibility snapshot already contains the `storage_controller_db`. Thus, it should just work and in fact we can remove hacks like `fixup_storage_controller`. Follow-ups created as part of this work: * https://github.com/neondatabase/neon/issues/8399 * https://github.com/neondatabase/neon/issues/8400	2024-07-18 10:56:07 +02:00
John Spray	975f8ac658	tests: add test_compaction_l0_memory (#8403 ) This test reproduces the case of a writer creating a deep stack of L0 layers. It uses realistic layer sizes and writes several gigabytes of data, therefore runs as a performance test although it is validating memory footprint rather than performance per se. It acts a regression test for two recent fixes: - https://github.com/neondatabase/neon/pull/8401 - https://github.com/neondatabase/neon/pull/8391 In future it will demonstrate the larger improvement of using a k-merge iterator for L0 compaction (#8184) This test can be extended to enforce limits on the memory consumption of other housekeeping steps, by restarting the pageserver and then running other things to do the same "how much did RSS increase" measurement.	2024-07-17 17:35:27 +00:00
Sasha Krassovsky	7eb37fea26	Allow reusing projects between runs of logical replication benchmarks (#8393 )	2024-07-15 14:55:57 -07:00
Peter Bendel	c11b9cb43d	Run Performance bench on more platforms (#8312 ) ## Problem https://github.com/neondatabase/cloud/issues/14721 ## Summary of changes add one more platform to benchmarking job `57535c039c/.github/workflows/benchmarking.yml (L57C3-L126)` Run with pg 16, provisioner k8-neonvm by default on the new platform. Adjust some test cases to - not depend on database client <-> database server latency by pushing loops into server side pl/pgSQL functions - increase statement and test timeouts First successful run of these job steps https://github.com/neondatabase/neon/actions/runs/9869817756/job/27254280428	2024-07-11 10:07:12 +01:00
Tristan Partin	1c57f6bac3	Add long running replication tests These tests will help verify that replication, both physical and logical, works as expected in Neon. Co-authored-by: Sasha Krassovsky <sasha@neon.tech>	2024-07-08 07:30:22 -07:00
Tristan Partin	2a3410d1c3	Hide import behind TYPE_CHECKING No need to import it if we aren't type checking anything.	2024-07-08 07:30:22 -07:00
John Spray	c9e6dd45d3	pageserver: downgrade stale generation messages to INFO (#8256 ) ## Problem When generations were new, these messages were an important way of noticing if something unexpected was going on. We found some real issues when investigating tests that unexpectedly tripped them. At time has gone on, this code is now pretty battle-tested, and as we do more live migrations etc, it's fairly normal to see the occasional message from a node with a stale generation. At this point the cognitive load on developers to selectively allow-list these logs outweighs the benefit of having them at warn severity. Closes: https://github.com/neondatabase/neon/issues/8080 ## Summary of changes - Downgrade "Dropped remote consistent LSN updates" and "Dropping stale deletions" messages to INFO - Remove all the allow-list entries for these logs.	2024-07-04 15:05:41 +01:00
Vlad Lazar	bbb2fa7cdd	tests: perform graceful rolling restarts in storcon scale test (#8173 ) ## Problem Scale test doesn't exercise drain & fill. ## Summary of changes Make scale test exercise drain & fill	2024-07-04 06:04:19 +01:00
Peter Bendel	392a58bdce	add pagebench test cases for periodic pagebench on dedicated hardware (#8233 ) we want to run some specific pagebench test cases on dedicated hardware to get reproducible results run1: 1 client per tenant => characterize throughput with n tenants. - 500 tenants - scale 13 (200 MB database) - 1 hour duration - ca 380 GB layer snapshot files run2.singleclient: 1 client per tenant => characterize latencies run2.manyclient: N clients per tenant => characterize throughput scalability within one tenant. - 1 tenant with 1 client for latencies - 1 tenant with 64 clients because typically for a high number of connections we recommend the connection pooler which by default uses 64 connections (for scalability) - scale 136 (2048 MB database) - 20 minutes each	2024-07-03 16:22:33 +00:00
Alex Chi Z	04b2ac3fed	test: use aux file v2 policy in benchmarks (#8174 ) Use aux file v2 in benchmarks. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-06-26 20:33:15 +00:00
Alex Chi Z	9b98823d61	bottom-most-compaction: use in test_gc_feedback + fix bugs (#8103 ) Adds manual compaction trigger; add gc compaction to test_gc_feedback Part of https://github.com/neondatabase/neon/issues/8002 ``` test_gc_feedback[debug-pg15].logical_size: 50 Mb test_gc_feedback[debug-pg15].physical_size: 2269 Mb test_gc_feedback[debug-pg15].physical/logical ratio: 44.5302 test_gc_feedback[debug-pg15].max_total_num_of_deltas: 7 test_gc_feedback[debug-pg15].max_num_of_deltas_above_image: 2 test_gc_feedback[debug-pg15].logical_size_after_bottom_most_compaction: 50 Mb test_gc_feedback[debug-pg15].physical_size_after_bottom_most_compaction: 287 Mb test_gc_feedback[debug-pg15].physical/logical ratio after bottom_most_compaction: 5.6312 test_gc_feedback[debug-pg15].max_total_num_of_deltas_after_bottom_most_compaction: 4 test_gc_feedback[debug-pg15].max_num_of_deltas_above_image_after_bottom_most_compaction: 1 ``` ## Summary of changes * Add the manual compaction trigger * Use in test_gc_feedback * Add a guard to avoid running it with retain_lsns * Fix: Do `schedule_compaction_update` after compaction * Fix: Supply deltas in the correct order to reconstruct value --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-06-25 23:00:14 +00:00
John Spray	1ea5d8b132	tests: accomodate some messages that can fail tests (#8144 ) ## Problem - `test_storage_controller_many_tenants` can fail with warnings in the storage controller about tenant creation holding a lock for too long, because this test stresses the machine running the test with many concurrent timeline creations - `test_tenant_delete_smoke` can fail when synthetic remote storage errors show up ## Summary of changes - tolerate warnings about slow timeline creation in test_storage_controller_many_tenants - tolerate both possible errors during error_tolerant_delete	2024-06-24 17:03:53 +00:00
John Spray	15728be0e1	pageserver: always detach before deleting (#8082 ) In #7957 we enabled deletion without attachment, but retained the old-style deletion (return 202, delete in background) for attached tenants. In this PR, we remove the old-style deletion path, such that if the tenant delete API is invoked while a tenant is detached, it is simply detached before completing the deletion. This intentionally doesn't rip out all the old deletion code: in case a deletion was in progress at time of upgrade, we keep around the code for finishing it for one release cycle. The rest of the code removal happens in https://github.com/neondatabase/neon/pull/8091 Now that deletion will always be via the new path, the new path is also updated to use some retries around remote storage operations, to tripping up the control plane with 500s if S3 has an intermittent issue.	2024-06-21 15:39:19 +01:00
Peter Bendel	82266a252c	Allow longer timeout for starting pageserver, safe keeper and storage controller in test cases to make test cases less flaky (#8079 ) ## Problem see https://github.com/neondatabase/neon/issues/8070 ## Summary of changes the neon_local subcommands to - start neon - start pageserver - start safekeeper - start storage controller get a new option -t=xx or --start-timeout=xx which allows to specify a longer timeout in seconds we wait for the process start. This is useful in test cases where the pageserver has to read a lot of layer data, like in pagebench test cases. In addition we exploit the new timeout option in the python test infrastructure (python fixtures) and modify the flaky testcase to increase the timeout from 10 seconds to 1 minute. Example from the test execution ```bash RUST_BACKTRACE=1 NEON_ENV_BUILDER_USE_OVERLAYFS_FOR_SNAPSHOTS=1 DEFAULT_PG_VERSION=15 BUILD_TYPE=release ./scripts/pytest test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py ... 2024-06-19 09:29:34.590 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local storage_controller start --start-timeout=60s" 2024-06-19 09:29:36.365 INFO [broker.py:34] starting storage_broker to listen incoming connections at "127.0.0.1:15001" 2024-06-19 09:29:36.365 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local pageserver start --id=1 --start-timeout=60s" 2024-06-19 09:29:36.366 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local safekeeper start 1 --start-timeout=60s" ```	2024-06-21 10:36:12 +00:00
Peter Bendel	56da624870	allow storage_controller error during pagebench (#8109 ) ## Problem `test_pageserver_max_throughput_getpage_at_latest_lsn` is a pagebench testcase which creates several tenants/timelines to verify pageserver performance. The test swaps environments around in the tenant duplication stage, so the storage controller uses two separate db instances (one in the duplication stage and another one in the benchmarking stage). In the benchmarking stage, the storage controller starts without any knowledge of nodes, but with knowledge of tenants (via attachments.json). When we re-attach and attempt to update the scheduler stats, the scheduler rightfully complains about the node not being known. The setup should preserve the storage controller across the two envs, but i think it's fine to just allow list the error in this case. ## Summary of changes add the error message `2024-06-19T09:38:27.866085Z ERROR Scheduler missing node 1`` to the list of allowed errors for storage_controller	2024-06-19 13:04:29 +00:00
Peter Bendel	46210035c5	add halfvec indexing and queries to periodic pgvector performance tests (#8057 ) ## Problem halfvec data type was introduced in pgvector 0.7.0 and is popular because it allows smaller vectors, smaller indexes and potentially better performance. So far we have not tested halfvec in our periodic performance tests. This PR adds halfvec indexing and halfvec queries to the test.	2024-06-14 18:36:50 +02:00
Peter Bendel	9ba9f32dfe	Reactivate page bench test in CI after ignoring CopyFail error in pageserver (#8023 ) ## Problem Testcase page bench test_pageserver_max_throughput_getpage_at_latest_lsn had been deactivated because it was flaky. We now ignore copy fail error messages like in `270d3be507/test_runner/regress/test_pageserver_getpage_throttle.py (L17-L20)` and want to reactivate it to see it it is still flaky ## Summary of changes - reactivate the test in CI - ignore CopyFail error message during page bench test cases ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist	2024-06-12 16:10:57 +02:00
Alex Chi Z	3e63d0f9e0	test(pageserver): quantify compaction outcome (#7867 ) A simple API to collect some statistics after compaction to easily understand the result. The tool reads the layer map, and analyze range by range instead of doing single-key operations, which is more efficient than doing a benchmark to collect the result. It currently computes two key metrics: * Latest data access efficiency, which finds how many delta layers / image layers the system needs to iterate before returning any key in a key range. * (Approximate) PiTR efficiency, as in https://github.com/neondatabase/neon/issues/7770, which is simply the number of delta files in the range. The reason behind that is, assume no image layer is created, PiTR efficiency is simply the cost of collect records from the delta layers, and the replay time. Number of delta files (or in the future, estimated size of reads) is a simple yet efficient way of estimating how much effort the page server needs to reconstruct a page. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-06-10 10:42:13 +02:00
Peter Bendel	f9f69a2ee7	clarify how to load the dbpedia vector embeddings into a postgres database (#7894 ) ## Problem Improve the readme for the data load step in the pgvector performance test.	2024-05-28 17:21:09 +03:00
Peter Bendel	fabeff822f	Performance test for pgvector HNSW index build and queries (#7873 ) ## Problem We want to regularly verify the performance of pgvector HNSW parallel index builds and parallel similarity search using HNSW indexes. The first release that considerably improved the index-build parallelism was pgvector 0.7.0 and we want to make sure that we do not regress by our neon compute VM settings (swap, memory over commit, pg conf etc.) ## Summary of changes Prepare a Neon project with 1 million openAI vector embeddings (vector size 1536). Run HNSW indexing operations in the regression test for the various distance metrics. Run similarity queries using pgbench with 100 concurrent clients. I have also added the relevant metrics to the grafana dashboards pgbench and olape --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2024-05-28 11:05:33 +00:00
John Spray	c84656a53e	pageserver: implement auto-splitting (#7681 ) ## Problem Currently tenants are only split into multiple shards if a human being calls the API to do it. Issue: #7388 ## Summary of changes - Add a pageserver API for returning the top tenants by size - Add a step to the controller's background loop where if there is no reconciliation or optimization to be done, it looks for things to split. - Add a test that runs pgbench on many tenants concurrently, and checks that splitting happens as expected as tenants grow, without interrupting the client I/O. This PR is quite basic: there is a tasklist in https://github.com/neondatabase/neon/issues/7388 for further work. This PR is meant to be safe (off by default), and sufficient to enable our staging environment to run lots of sharded tenants without a human having to set them up.	2024-05-17 16:01:24 +00:00
Jure Bajic	affc18f912	Add performance regress `test_ondemand_download_churn.py` (#7242 ) Add performance regress test for on-demand download throughput. Closes https://github.com/neondatabase/neon/issues/7146 Co-authored-by: Christian Schwarz <christian@neon.tech> Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2024-05-15 18:41:12 +02:00
Joonas Koivunen	d9dcbffac3	python: allow using allowed_errors.py (#7719 ) See #7718. Fix it by renaming all `types.py` to `common_types.py`. Additionally, add an advert for using `allowed_errors.py` to test any added regex.	2024-05-13 15:16:23 +03:00
John Spray	107f535294	storage controller: fix handing of tenants with no timelines during scheduling optimization (#7673 ) ## Problem Storage controller was using a zero layer count in SecondaryProgress as a proxy for "not initialized". However, in tenants with zero timelines (a legitimate state), the layer count remains zero forever. This caused https://github.com/neondatabase/neon/pull/7583 to destabilize the storage controller scale test, which creates lots of tenants, some of which don't get any timelines. ## Summary of changes - Use a None mtime instead of zero layer count to determine if a SecondaryProgress should be ignored. - Adjust the test to use a shorter heatmap upload period to let it proceed faster while waiting for scheduling optimizations to complete.	2024-05-09 12:33:09 +01:00
Christian Schwarz	308227fa51	remove `neon_local --pageserver-config-override` (#7614 ) Preceding PR https://github.com/neondatabase/neon/pull/7613 reduced the usage of `--pageserver-config-override`. This PR builds on top of that work and fully removes the `neon_local --pageserver-config-override`. Tests that need a non-default `pageserver.toml` control it using two options: 1. Specify `NeonEnvBuilder.pageserver_config_override` before `NeonEnvBuilder.init_start()`. This uses a new `neon_local init --pageserver-config` flag. 2. After `init_start()`: `env.pageserver.stop()` + `NeonPageserver.edit_config_toml()` + `env.pageserver.start()` A few test cases were using `env.pageserver.start(overrides=("--pageserver-config-override...",))`. I changed them to use one of the options above. Future Work ----------- The `neon_local init --pageserver-config` flag still uses `pageserver --config-override` under the hood. In the future, neon_local should just write the `pageserver.toml` directly. The `NeonEnvBuilder.pageserver_config_override` field should be renamed to `pageserver_initial_config`. Let's save this churn for a separate refactor commit.	2024-05-07 16:29:59 +00:00
John Spray	a74b60066c	storage controller: test for large shard counts (#7475 ) ## Problem Storage controller was observed to have unexpectedly large memory consumption when loaded with many thousands of shards. This was recently fixed: - https://github.com/neondatabase/neon/pull/7493 ...but we need a general test that the controller is well behaved with thousands of shards. Closes: https://github.com/neondatabase/neon/issues/7460 Closes: https://github.com/neondatabase/neon/issues/7463 ## Summary of changes - Add test test_storage_controller_many_tenants to exercise the system's behaviour with a more substantial workload. This test measures memory consumption and reproduces #7460 before the other changes in this PR. - Tweak reconcile_all's return value to make it nonzero if it spawns no reconcilers, but _would_ have spawned some reconcilers if they weren't blocked by the reconcile concurrency limit. This makes the test's reconcile_until_idle behave as expected (i.e. not complete until the system is nice and calm). - Fix an issue where tenant migrations would leave a spurious secondary location when migrated to some location that was not already their secondary (this was an existing low-impact bug that tripped up the test's consistency checks). On the test with 8000 shards, the resident memory per shard is about 20KiB. This is not really per-shard memory: the primary source of memory growth is the number of concurrent network/db clients we create. With 8000 shards, the test takes 125s to run on my workstation.	2024-04-30 15:21:54 +00:00
macdoos	3b95e8072a	test_runner: replace all `.format()` with f-strings (#7194 )	2024-04-02 14:32:14 +01:00
Alexander Bayandin	3426619a79	test_runner/performance: skip test_bulk_insert (#7238 ) ## Problem `test_bulk_insert` becomes too slow, and it fails constantly: https://github.com/neondatabase/neon/issues/7124 ## Summary of changes - Skip `test_bulk_insert` until it's fixed	2024-03-26 15:10:15 +00:00
Arpad Müller	34fa34d15c	Dump layer map json in test_gc_feedback.py (#7179 ) The layer map json is an interesting file for that test, so dump it to make debugging easier.	2024-03-20 18:39:46 +00:00
Vlad Lazar	3d8830ac35	test_runner: re-enable large slru benchmark (#7125 ) Previously disabled due to https://github.com/neondatabase/neon/issues/7006.	2024-03-14 16:47:32 +00:00
John Spray	89cf714890	tests/neon_local: rename "attachment service" -> "storage controller" (#7087 ) Not a user-facing change, but can break any existing `.neon` directories created by neon_local, as the name of the database used by the storage controller changes. This PR changes all the locations apart from the path of `control_plane/attachment_service` (waiting for an opportune moment to do that one, because it's the most conflict-ish wrt ongoing PRs like #6676 )	2024-03-12 11:36:27 +00:00
Vlad Lazar	0f05ef67e2	pageserver: revert open layer rolling revert (#6962 ) ## Problem We reverted https://github.com/neondatabase/neon/pull/6661 a few days ago. The change led to OOMs in benchmarks followed by large WAL reingests. The issue was that we removed [this code](`d04af08567/pageserver/src/tenant/timeline/walreceiver/walreceiver_connection.rs (L409-L417)`). That call may trigger a roll of the open layer due to the keepalive messages received from the safekeeper. Removing it meant that enforcing of checkpoint timeout became even more lax and led to using up large amounts of memory for the in memory layer indices. ## Summary of changes Piggyback on keep alive messages to enforce checkpoint timeout. This is a hack, but it's exactly what the current code is doing. ## Alternatives Christhian, Joonas and myself sketched out a timer based approach [here](https://github.com/neondatabase/neon/pull/6940). While discussing it further, it became obvious that's also a bit of a hack and not the desired end state. I chose not to take that further since it's not what we ultimately want and it'll be harder to rip out. Right now it's unclear what the ideal system behaviour is: * early flushing on memory pressure, or ... * detaching tenants on memory pressure	2024-03-07 19:53:10 +00:00
Joonas Koivunen	602a4da9a5	bench: run branch_creation_many at 500, seeded (#6959 ) We have a benchmark for creating a lot of branches, but it does random things, and the branch count is not what we is the largest maximum we aim to support. If this PR would stabilize the benchmark total duration it means that there are some structures which are very much slower than others. Then we should add a seed-outputting variant to help find and reproduce such cases. Additionally, record for the benchmark: - shutdown duration - startup metrics once done (on restart) - duration of first compaction completion via debug logging	2024-03-07 16:23:42 +02:00
Vlad Lazar	2daa2f1d10	test: disable large slru basebackup bench in ci (#7025 ) The test is flaky due to https://github.com/neondatabase/neon/issues/7006.	2024-03-05 15:41:05 +00:00
Vlad Lazar	2b11466b59	pageserver: optimise disk io for vectored get (#6780 ) ## Problem The vectored read path proposed in https://github.com/neondatabase/neon/pull/6576 seems to be functionally correct, but in my testing (see below) it is about 10-20% slower than the naive sequential vectored implementation. ## Summary of changes There's three parts to this PR: 1. Supporting vectored blob reads. This is actually trickier than it sounds because on disk blobs are prefixed with a variable length size header. Since the blobs are not necessarily fixed size, we need to juggle the offsets such that the callers can retrieve the blobs from the resulting buffer. 2. Merge disk read requests issued by the vectored read path up to a maximum size. Again, the merging is complicated by the fact that blobs are not fixed size. We keep track of the begin and end offset of each blob and pass them into the vectored blob reader. In turn, the reader will return a buffer and the offsets at which the blobs begin and end. 3. A benchmark for basebackup requests against tenant with large SLRU block counts is added. This required a small change to pagebench and a new config variable for the pageserver which toggles the vectored get validation. We can probably optimise things further by adding a little bit of concurrency for our IO. In principle, it's as simple as spawning a task which deals with issuing IO and doing the serialisation and handling on the parent task which receives input via a channel.	2024-02-28 12:06:00 +00:00
Christian Schwarz	b6bd75964f	Revert "pageserver: roll open layer in timeline writer (#6661 )" + PR #6842 (#6938 ) This reverts commits `587cb705b8` (PR #6661) and `fcbe9fb184` (PR #6842). Conflicts: pageserver/src/tenant.rs pageserver/src/tenant/timeline.rs The conflicts were with * pageserver: adjust checkpoint distance for sharded tenants (#6852) * pageserver: add vectored get implementation (#6576) Also we had to keep the `allowed_errors` to make `test_forward_compatibility` happy, see the PR thread on GitHub for details.	2024-02-28 11:38:23 +00:00
Christian Schwarz	dedf66ba5b	remove `gc_feedback` mechanism (#6863 ) It's been dead-code-at-runtime for 9 months, let's remove it. We can always re-introduce it at a later point. Came across this while working on #6861, which will touch `time_for_new_image_layer`. This is an opporunity to make that function simpler.	2024-02-26 10:05:24 +01:00
Vlad Lazar	fcbe9fb184	test: adjust checkpoint distance in `test_layer_map` (#6842 ) `587cb705b8` changed the layer rolling logic to more closely obey the `checkpoint_distance` config. Previously, this test was getting layers significantly larger than the 8K it was asking for. Now the payload in the layers is closer to 8K (which means more layers in total). Tweak the `checkpoint_distance` to get a number of layers more reasonable for this test. Note that we still get more layers than before (~8K vs ~5K).	2024-02-20 19:42:54 +00:00
Alexander Bayandin	59c5b374de	test_pageserver_max_throughput_getpage_at_latest_lsn: disable on CI (#6785 ) ## Problem `test_pageserver_max_throughput_getpage_at_latest_lsn` is flaky which makes CI status red pretty frequently. `benchmarks` is not a blocking job (doesn't block `deploy`), so having it red might hide failures in other jobs Ref: https://github.com/neondatabase/neon/issues/6724 ## Summary of changes - Disable `test_pageserver_max_throughput_getpage_at_latest_lsn` on CI until it fixed	2024-02-16 15:30:04 +00:00
Alexander Bayandin	9f75da7c0a	test_lazy_startup: fix statement_timeout setting (#6654 ) ## Problem Test `test_lazy_startup` is flaky[0], sometimes (pretty frequently) it fails with `canceling statement due to statement timeout`. - [0] https://neon-github-public-dev.s3.amazonaws.com/reports/main/7803316870/index.html#suites/355b1a7a5b1e740b23ea53728913b4fa/7263782d30986c50/history ## Summary of changes - Fix setting `statement_timeout` setting by reusing a connection for all queries. - Also fix label (`lazy`, `eager`) assignment - Split `test_lazy_startup` into two, by `slru` laziness and make tests smaller	2024-02-07 00:31:26 +00:00
Konstantin Knizhnik	9a9d9beaee	Download SLRU segments on demand (#6151 ) ## Problem See https://github.com/neondatabase/cloud/issues/8673 ## Summary of changes Download missed SLRU segments from page server ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2024-01-31 21:39:18 +02:00
Christian Schwarz	fd4cce9417	test_pageserver_max_throughput_getpage_at_latest_lsn: remove n_tenants=100 combination (#6477 ) Need to fix the neon_local timeouts first (https://github.com/neondatabase/neon/issues/6473) and also not run them on every merge, but only nightly: https://github.com/neondatabase/neon/issues/6476	2024-01-25 18:17:53 +00:00

1 2 3 4 5 ...

277 Commits