rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-16 01:42:55 +00:00

Author	SHA1	Message	Date
Christian Schwarz	c4f03a225c	latency recorder sketches	2025-01-23 08:51:24 +01:00
Christian Schwarz	214ce815bc	sketch cancellation	2025-01-22 19:19:42 +01:00
Christian Schwarz	f5b2eee23d	prototype IoConcurrency propagation	2025-01-22 19:13:34 +01:00
Christian Schwarz	4b2a91cb5a	sketch propagation through request context	2025-01-22 19:13:34 +01:00
Christian Schwarz	a6660a2883	without request context	2025-01-22 18:32:02 +01:00
Christian Schwarz	b0b9206908	Merge remote-tracking branch 'origin/main' into vlad/read-path-concurrent-io Conflicts: pageserver/src/tenant/timeline.rs test_runner/fixtures/neon_fixtures.py	2025-01-22 14:31:12 +01:00
Christian Schwarz	4298e77f7a	run unit tests in both modes	2025-01-22 14:30:01 +01:00
Alexey Kondratov	881e351f69	feat(compute): Allow installing both 0.8.0 and 0.7.4 pgvector (#10345 ) ## Problem Both these versions are binary compatible, but the way pgvector structures the SQL files forbids installing 0.7.4 if you have a 0.8.0 distribution. Yet, some users may need a previous version for backward compatibility, e.g., restoring the dump. See this thread for discussion https://neondb.slack.com/archives/C04DGM6SMTM/p1735911490242919?thread_ts=1731343604.259169&cid=C04DGM6SMTM ## Summary of changes Put `vector--0.7.4.sql` file into compute image to allow installing this version as well. Tested on staging and it seems to be working as expected: ```sql select * from pg_available_extensions where name = 'vector'; name \| default_version \| installed_version \| comment --------+-----------------+-------------------+------------------------------------------------------ vector \| 0.8.0 \| (null) \| vector data type and ivfflat and hnsw access methods create extension vector version '0.7.4'; select * from pg_available_extensions where name = 'vector'; name \| default_version \| installed_version \| comment --------+-----------------+-------------------+------------------------------------------------------ vector \| 0.8.0 \| 0.7.4 \| vector data type and ivfflat and hnsw access methods alter extension vector update; select * from pg_available_extensions where name = 'vector'; name \| default_version \| installed_version \| comment --------+-----------------+-------------------+------------------------------------------------------ vector \| 0.8.0 \| 0.8.0 \| vector data type and ivfflat and hnsw access methods drop extension vector; create extension vector; select * from pg_available_extensions where name = 'vector'; name \| default_version \| installed_version \| comment --------+-----------------+-------------------+------------------------------------------------------ vector \| 0.8.0 \| 0.8.0 \| vector data type and ivfflat and hnsw access methods ``` If we find out it's a good approach, we can adopt the same for other extensions with a stable ABI -- support both `current` and `current - 1` releases.	2025-01-22 12:38:23 +00:00
Christian Schwarz	b31ce14083	initial logical size calculation: always poll to completion (#10471 ) # Refs - extracted from https://github.com/neondatabase/neon/pull/9353 # Problem Before this PR, when task_mgr shutdown is signalled, e.g. during pageserver shutdown or Tenant shutdown, initial logical size calculation stops polling and drops the future that represents the calculation. This is against the current policy that we poll all futures to completion. This became apparent during development of concurrent IO which warns if we drop a `Timeline::get_vectored` future that still has in-flight IOs. We may revise the policy in the future, but, right now initial logical size calculation is the only part of the codebase that doesn't adhere to the policy, so let's fix it. ## Code Changes - make sensitive exclusively to `Timeline::cancel` - This should be sufficient for all cases of shutdowns; the sensitivity to task_mgr shutdown is unnecessary. - this broke the various cancel tests in `test_timeline_size.py`, e.g., `test_timeline_initial_logical_size_calculation_cancellation` - the tests would time out because the await point was not sensitive to cancellation - to fix this, refactor `pausable_failpoint` so that it accepts a cancellation token - side note: we _really_ should write our own failpoint library; maybe after we get heap-allocated RequestContext, we can plumb failpoints through there.	2025-01-22 12:28:26 +00:00
Christian Schwarz	dbb88cc59e	test_get_vectored: don't parametrize inside the test, instead, use spawn_for_test like we do in all the other tests This is a remnant from the early times of this PR.	2025-01-22 12:55:16 +01:00
Christian Schwarz	b4d87b9dfe	fix(tests): actually enable pipelinig by default in the test suite (#10472 ) ## Problem PR #9993 was supposed to enable `page_service_pipelining` by default for all `NeonEnv`s, but this was ineffective in our CI environment. Thus, CI Python-based tests and benchmarks, unless explicitly configuring pipelining, were still using serial protocol handling. ## Analysis The root cause was that in our CI environment, `config.compatibility_neon_binpath` is always Truthy. It's not in local environments, which is why this slipped through in local testing. Lesson: always add a log line ot pageserver startup and spot-check tests to ensure the intended default is picked up. ## Summary of changes Fix it. Since enough time has passed, the compatiblity snapshot contains a recent enough software version so we don't need to worry about `compatibility_neon_binpath` anymore. ## Future Work The question how to add a new default except for compatibliity tests, which is what the broken code was supposed to do, is still unsolved. Slack discussion: https://neondb.slack.com/archives/C059ZC138NR/p1737490501941309	2025-01-22 10:10:43 +00:00
Conrad Ludgate	2b49d6ee05	feat: adjust the tonic features to remove axum dependency (#10348 ) To help facilitate an upgrade to axum 0.8 (https://github.com/neondatabase/neon/pull/10332#pullrequestreview-2541989619) this massages the tonic dependency features so that tonic does not depend on axum.	2025-01-22 09:15:52 +00:00
Christian Schwarz	c5af3c576e	the previous patch didn't cover test_version_mismatch; this one is far more universal	2025-01-22 01:03:34 +01:00
Christian Schwarz	3526d9aad3	pass forward compatibility	2025-01-22 00:40:19 +01:00
Christian Schwarz	a501095c5a	fixup(commit `b2dbc47b31` initial logical size calculation wasn't polled to completion; fix that, to make tests pass) (requires previous commit)	2025-01-21 23:22:31 +01:00
Christian Schwarz	728052bd2e	pausable_failpoint: add ability to provide a cancel flag, similar to what we have for sleep	2025-01-21 23:16:53 +01:00
Erik Grinaker	14e1f89053	pageserver: eagerly notify flush waiters (#10469 ) ## Problem Currently, the layer flush loop will continue flushing layers as long as any are pending, and only notify waiters once there are no further layers to flush. This can cause waiters to wait longer than necessary, and potentially starve them if pending layers keep arriving faster than they can be flushed. The impact of this will increase when we add compaction backpressure and propagate it up into the WAL receiver. Extracted from #10405. ## Summary of changes Break out of the layer flush loop once we've flushed up to the requested LSN. If further flush requests have arrived in the meanwhile, flushing will resume immediately after.	2025-01-21 22:01:27 +00:00
Erik Grinaker	8a8c656c06	pageserver: add `LayerMap::watch_layer0_deltas()` (#10470 ) ## Problem For compaction backpressure, we need a mechanism to signal when compaction has reduced the L0 delta layer count below the backpressure threshold. Extracted from #10405. ## Summary of changes Add `LayerMap::watch_level0_deltas()` which returns a `tokio::sync:⌚:Receiver` signalling the current L0 delta layer count.	2025-01-21 21:18:09 +00:00
Erik Grinaker	a75e11cc00	pageserver: return duration from `StorageTimeMetricsTimer` (#10468 ) ## Problem It's sometimes useful to obtain the elapsed duration from a `StorageTimeMetricsTimer` for purposes beyond just recording it in metrics (e.g. to log it). Extracted from #10405. ## Summary of changes Add `StorageTimeMetricsTimer.elapsed()` and return the duration from `stop_and_record()`.	2025-01-21 20:56:34 +00:00
Christian Schwarz	361210f8dc	actually enable concurrent IO (and batching!) by default in test suite (still need to figure out how to not make compat test break) https://neondb.slack.com/archives/C059ZC138NR/p1737490501941309	2025-01-21 21:27:21 +01:00
Christian Schwarz	925dd17fb8	Revert "debug why CI tests don't run with sidecar-task" This reverts commit `a528b325ee`.	2025-01-21 20:51:48 +01:00
Christian Schwarz	fe615520dd	remove the timing histograms for traversal and walredo, since their meaning and utility is dubious with concurrent IO; https://github.com/neondatabase/neon/pull/9353#discussion_r1924181713 The issue is that get_vectored_reconstruct_data latency means something very different now with concurrent IO than what it did before, because all the time we spend on the data blocks is no longer part of the get_vectored_reconstruct_data().await wall clock time GET_RECONSTRUCT_DATA_TIME : all the 3 dashboards that use it are in my /personal/christian folder. I guess I'm free to break them 😄 https://github.com/search?q=repo%3Aneondatabase%2Fgrafana-dashboard-export%20pageserver_getpage_get_reconstruct_data_seconds&type=code RECONSTRUCT_TIME Used in a couple of dashboards I think nobody uses - Timeline Inspector - Sharding WAL streaming - Pageserver - walredo time throaway Vlad agrees with removing them for now. Maybe in the future we'll add some back pageserver_getpage_get_reconstruct_data_seconds -> pageserver_getpage_io_plan_seconds pageserver_getpage_reconstruct_data_seconds -> pageserver_getpage_io_execute_seconds	2025-01-21 20:32:15 +01:00
Alex Chi Z.	7d4bfcdc47	feat(pageserver): add config items for gc-compaction auto trigger (#10455 ) ## Problem part of https://github.com/neondatabase/neon/issues/9114 The automatic trigger is already implemented at https://github.com/neondatabase/neon/pull/10221 but I need to write some tests and finish my experiments in staging before I can merge it with confidence. Given that I have some other patches that will modify the config items, I'd like to get the config items merged first to reduce conflicts. ## Summary of changes * add `l2_lsn` to index_part.json -- below that LSN, data have been processed by gc-compaction * add a set of gc-compaction auto trigger control items into the config --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-01-21 19:29:38 +00:00
Christian Schwarz	d6cdf1b13f	debug assertion on correct record order: https://github.com/neondatabase/neon/pull/9353#discussion_r1923801121	2025-01-21 20:21:35 +01:00
a-masterov	737888e5c9	Remove the tests for `pg_anon` (#10382 ) ## Problem We are removing the `pg_anon` v1 extension from Neon. So we don't need to test it anymore and can remove the code for simplicity. ## Summary of changes The code required for testing `pg_anon` is removed.	2025-01-21 19:17:14 +00:00
Christian Schwarz	b2dbc47b31	initial logical size calculation wasn't polled to completion; fix that, to make tests pass (see prev commit for stack trace) CI test failures https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9353/12892883355/index.html#suites/a1c2be32556270764423c495fad75d47/92cacda354b63fd7/	2025-01-21 20:09:39 +01:00
Christian Schwarz	a69dcadba4	test failure christian@neon-hetzner-dev-christian:[~/src/neon-work-1]: NEON_PAGESERVER_USE_ONE_RUNTIME=current_thread DEFAULT_PG_VERSION=14 BUILD_TYPE=release poetry run pytest -k 'test_ancestor_detach_branched_from[release-pg14-False-True-after]' 2025-01-21T18:42:38.794431Z WARN initial_size_calculation{tenant_id=cb106e50ddedc20995b0b1bb065ebcd9 shard_id=0000 timeline_id=e362ff10e7c4e116baee457de5c766d9}:logical_size_calculation_task: dropping ValuesReconstructState while some IOs have not been completed num_active_ios=1 sidecar_task_id=None backtrace= 0: <pageserver::tenant::storage_layer::ValuesReconstructState as core::ops::drop::Drop>::drop at /home/christian/src/neon-work-1/pageserver/src/tenant/storage_layer.rs:553:24 1: core::ptr::drop_in_place<pageserver::tenant::storage_layer::ValuesReconstructState> at /home/christian/.rustup/toolchains/1.84.0-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:521:1 2: core::ptr::drop_in_place<pageserver::tenant::timeline::Timeline::get::{{closure}}> at /home/christian/src/neon-work-1/pageserver/src/tenant/timeline.rs:1042:5 3: core::ptr::drop_in_place<pageserver::pgdatadir_mapping::<impl pageserver::tenant::timeline::Timeline>::get_current_logical_size_non_incremental::{{closure}}> at /home/christian/src/neon-work-1/pageserver/src/pgdatadir_mapping.rs:1001:67 4: core::ptr::drop_in_place<pageserver::tenant::timeline::Timeline::calculate_logical_size::{{closure}}> at /home/christian/src/neon-work-1/pageserver/src/tenant/timeline.rs:3100:18 5: core::ptr::drop_in_place<pageserver::tenant::timeline::Timeline::logical_size_calculation_task::{{closure}}::{{closure}}::{{closure}}> at /home/christian/src/neon-work-1/pageserver/src/tenant/timeline.rs:3050:22 6: core::ptr::drop_in_place<pageserver::tenant::timeline::Timeline::logical_size_calculation_task::{{closure}}::{{closure}}> at /home/christian/src/neon-work-1/pageserver/src/tenant/timeline.rs:3060:5	2025-01-21 19:43:25 +01:00
Christian Schwarz	c4e0ba38b8	drain spawned IOs if traversal fails; https://github.com/neondatabase/neon/pull/9353/files#r1923589380	2025-01-21 19:33:13 +01:00
Christian Schwarz	14e4fcdb2a	delta_layer: remove ignore_key_with_err optimization; https://github.com/neondatabase/neon/pull/9353#discussion_r1920573294 it would cause an assertion failure because we wouldn't be consuming all IOs	2025-01-21 19:11:25 +01:00
Christian Schwarz	1862fdf9e2	clean up doc comment	2025-01-21 19:01:22 +01:00
Christian Schwarz	24e0a3f941	undo the WIP benchmarks, will clean those up and commit in a future PR	2025-01-21 19:00:20 +01:00
Christian Schwarz	a528b325ee	debug why CI tests don't run with sidecar-task	2025-01-21 18:53:44 +01:00
Christian Schwarz	a3c756334b	lift noisereporting to ValuesReconstructData::drop, it's actually better there For all high-rooted long-lived IoConcurrency's, IoConcurrency::drop will never run. What we actually care about is that we leave no dangling IOs after get_vectored_impl, which lives much shorter than a high-rooted IoConcurrency. However, lifetime of `ValuesReconstructData` is generally == lifetime of get_vectored_impl.	2025-01-21 18:52:15 +01:00
Gleb Novikov	19bf7b78a0	fast import: basic python test (#10271 ) We did not have any tests on fast_import binary yet. In this PR I have introduced: - `FastImport` class and tools for testing in python - basic test that runs fast import against vanilla postgres and checks that data is there Should be merged after https://github.com/neondatabase/neon/pull/10251	2025-01-21 16:50:44 +00:00
Christian Schwarz	93625344eb	refactor: wrap the oneshot's into a properly named abstraction (OnDiskValueIo, etc)	2025-01-21 17:45:36 +01:00
Christian Schwarz	7ca9112ec1	fix noise	2025-01-21 17:08:03 +01:00
Arpad Müller	7e4a39ea53	Fix two flakiness sources in test_scrubber_physical_gc_ancestors (#10457 ) We currently have some flakiness in `test_scrubber_physical_gc_ancestors`, see #10391. The first flakiness kind is about the reconciler not actually becoming idle within the timeout of 30 seconds. We see continuous forward progress so this is likely not a hang. We also see this happen in parallel to a test failure, so is likely due to runners being overloaded. Therefore, we increase the timeout. The second flakiness kind is an assertion failure. This one is a little bit more tricky, but we saw in the successful run that there was some advance of the lsn between the compaction ran (which created layer files) and the gc run. Apparently gc rejects reductions to the single image layer setting if the cutoff lsn is the same as the lsn of the image layer: it will claim that that layer is newer than the space cutoff and therefore skip it, while thinking the old layer (that we want to delete) is the latest one (so it's not deleted). We address the second flakiness kind by inserting a tiny amount of WAL between the compaction and gc. This should hopefully fix things. Related issue: #10391 (not closing it with the merger of the PR as we'll need to validate that these changes had the intended effect). Thanks to Chi for going over this together with me in a call.	2025-01-21 15:40:04 +00:00
Christian Schwarz	4e72b22b41	make noise from IoConcurrency::drop instead of the task, for more context	2025-01-21 14:51:54 +01:00
JC Grünhage	624a507544	Create Github releases with empty body for now (#10448 ) ## Problem When releasing `release-7574`, the Github Release creation failed with "body is too long" (see https://github.com/neondatabase/neon/actions/runs/12834025431/job/35792346745#step:5:77). There's lots of room for improvement of the release notes, but for now we'll disable them instead. ## Summary of changes - Disable automatic generation of release notes for Github releases - Enable creation of Github releases for proxy/compute	2025-01-21 12:45:21 +00:00
Christian Schwarz	4014c390e2	initial logical size calculation can also reasonable use the sidecar because it's concurrency-limited	2025-01-21 12:50:33 +01:00
Christian Schwarz	bca4263eb8	inspect_image_layer can also have an IoConcurrency root, it's tests only	2025-01-21 12:40:22 +01:00
Christian Schwarz	a958febd7a	reference issue that will remote hard-coded sequential()	2025-01-21 12:36:23 +01:00
Christian Schwarz	fc27da43ff	one more test can do without it	2025-01-21 12:25:30 +01:00
Christian Schwarz	cf2f0c27aa	IoConcurrency roots for scan() an tests	2025-01-21 12:21:46 +01:00
Christian Schwarz	f54c5d5596	turns out create_image_layers is easy	2025-01-21 10:47:49 +01:00
Christian Schwarz	ce5452d2e5	followup `0a37164c29`: also rename IoConcurrency::serial()	2025-01-21 00:47:37 +01:00
Christian Schwarz	af6c9ffac7	Ok, I now understand why it deadlocked in mode=sidecar-task The reason is that even in mode=`sidecar-task`, there are a bunch of places that are serial. Those places obviously deadlock.	2025-01-21 00:41:45 +01:00
Christian Schwarz	081ff26519	fixup `40ab9c2c5e`: deadlock Reproduced by test_runner/regress/test_branching.py::test_branching_with_pgbench[debug-pg16-flat-1-10]' It kinda makes sense that this deadlocks in `sequential` mode. However, it also deadlocks in `sidecar-task` mode. I don't understand why.	2025-01-20 23:46:56 +01:00
Christian Schwarz	0a37164c29	replace env var with config variable; add test suite fixture env var to override default	2025-01-20 23:46:56 +01:00
Arpad Müller	2ab9f69825	Simplify pageserver_physical_gc function (#10104 ) This simplifies the code in `pageserver_physical_gc` a little bit after the feedback in #10007 that the code is too complicated. Most importantly, we don't pass around `GcSummary` any more in a complicated fashion, and we save on async stream-combinator-inception in one place in favour of `try_stream!{}`. Follow-up of #10007	2025-01-20 21:57:15 +00:00

1 2 3 4 5 ...

7147 Commits