rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-05 21:20:37 +00:00

Author	SHA1	Message	Date
Arpad Müller	d820aa1d08	Disable initdb cancellation (#6451 ) ## Problem The initdb cancellation added in #5921 is not sufficient to reliably abort the entire initdb process. Initdb also spawns children. The tests added by #6310 (#6385) and #6436 now do initdb cancellations on a more regular basis. In #6385, I attempted to issue `killpg` (after giving it a new process group ID) to kill not just the initdb but all its spawned subprocesses, but this didn't work. Initdb doesn't take that long in the end either, so we just wait until it concludes. ## Summary of changes * revert initdb cancellation support added in #5921 * still return `Err(Cancelled)` upon cancellation, but this is just to not have to remove the cancellation infrastructure * fixes to the `test_tenant_delete_races_timeline_creation` test to make it reliably pass Fixes #6385	2024-01-24 13:06:05 +01:00
Christian Schwarz	996abc9563	pagebench-based GetPage@LSN performance test (#6214 )	2024-01-24 12:51:53 +01:00
John Spray	a72af29d12	control_plane/attachment_service: implement PlacementPolicy::Detached (#6458 ) ## Problem The API for detaching things wasn't implement yet, but one could hit this case indirectly from tests when using attach-hook, and find tenants unexpectedly attached again because their policy remained Single. ## Summary of changes Add PlacementPolicy::Detached, and: - add the behavior for it in schedule() - in tenant_migrate, refuse if the policy is detached - automatically set this policy in attach-hook if the caller has specified pageserver=null.	2024-01-24 12:49:30 +01:00
Sasha Krassovsky	4f51824820	Fix creating publications for all tables	2024-01-23 22:41:00 -08:00
Christian Schwarz	743f6dfb9b	fix(attachment_service): corrupted attachments.json when parallel requests (#6450 ) The pagebench integration PR (#6214) issues attachment requests in parallel. We observed corrupted attachments.json from time to time, especially in the test cases with high tenant counts. The atomic overwrite added in #6444 exposed the root cause cleanly: the `.commit()` calls of two request handlers could interleave or be reordered. See also: https://github.com/neondatabase/neon/pull/6444#issuecomment-1906392259 This PR makes changes to the `persistence` module to fix above race: - mpsc queue for PendingWrites - one writer task performs the writes in mpsc queue order - request handlers that need to do writes do it using the new `mutating_transaction` function. `mutating_transaction`, while holding the lock, does the modifications, serializes the post-modification state, and pushes that as a `PendingWrite` into the mpsc queue. It then release the lock and `await`s the completion of the write. The writer tasks executes the `PendingWrites` in queue order. Once the write has been executed, it wakes the writing tokio task.	2024-01-23 19:14:32 +00:00
Arpad Müller	faf275d4a2	Remove initdb on timeline delete (#6387 ) This PR: * makes `initdb.tar.zst` be deleted by default on timeline deletion (#6226), mirroring the safekeeper: https://github.com/neondatabase/neon/pull/6381 * adds a new `preserve_initdb_archive` endpoint for a timeline, to be used during the disaster recovery process, see reasoning [here](https://github.com/neondatabase/neon/issues/6226#issuecomment-1894574778) * makes the creation code look for `initdb-preserved.tar.zst` in addition to `initdb.tar.zst`. * makes the tests use the new endpoint fixes #6226	2024-01-23 18:22:59 +00:00
Vlad Lazar	001f0d6db7	pageserver: fix import failure caused by merge race (#6448 ) PR #6406 raced with #6372 and broke main.	2024-01-23 18:07:01 +01:00
Christian Schwarz	42c17a6fc6	attachment_service: use atomic overwrite to persist attachments.json (#6444 ) The pagebench integration PR (#6214) is the first to SIGQUIT & then restart attachment_service. With many tenants (100), we have found frequent failures on restart in the CI[^1]. [^1]: [Allure](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6214/7615750160/index.html#suites/e26265675583c610f99af77084ae58f1/851ff709578c4452/) ``` 2024-01-22T19:07:57.932021Z INFO request{method=POST path=/attach-hook request_id=2697503c-7b3e-4529-b8c1-d12ef912d3eb}: Request handled, status: 200 OK 2024-01-22T19:07:58.898213Z INFO Got SIGQUIT. Terminating 2024-01-22T19:08:02.176588Z INFO version: git-env:d56f31639356ed8e8ce832097f132f27ee19ac8a, launch_timestamp: 2024-01-22 19:08:02.174634554 UTC, build_tag build_tag-env:7615750160, state at /tmp/test_output/test_pageserver_max_throughput_getpage_at_latest_lsn[10-13-30]/repo/attachments.json, listening on 127.0.0.1:15048 thread 'main' panicked at /__w/neon/neon/control_plane/attachment_service/src/persistence.rs:95:17: Failed to load state from '/tmp/test_output/test_pageserver_max_throughput_getpage_at_latest_lsn[10-13-30]/repo/attachments.json': trailing characters at line 1 column 8957 (maybe your .neon/ dir was written by an older version?) stack backtrace: 0: rust_begin_unwind at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:645:5 1: core::panicking::panic_fmt at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:72:14 2: attachment_service::persistence::PersistentState::load_or_new::{{closure}} at ./control_plane/attachment_service/src/persistence.rs:95:17 3: attachment_service::persistence::Persistence:🆕:{{closure}} at ./control_plane/attachment_service/src/persistence.rs:103:56 4: attachment_service::main::{{closure}} at ./control_plane/attachment_service/src/main.rs:69:61 5: tokio::runtime::park::CachedParkThread::block_on::{{closure}} at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/park.rs:282:63 6: tokio::runtime::coop::with_budget at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/coop.rs:107:5 7: tokio::runtime::coop::budget at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/coop.rs:73:5 8: tokio::runtime::park::CachedParkThread::block_on at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/park.rs:282:31 9: tokio::runtime::context::blocking::BlockingRegionGuard::block_on at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context/blocking.rs:66:9 10: tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}} at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/multi_thread/mod.rs:87:13 11: tokio::runtime::context::runtime::enter_runtime at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context/runtime.rs:65:16 12: tokio::runtime::scheduler::multi_thread::MultiThread::block_on at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/multi_thread/mod.rs:86:9 13: tokio::runtime::runtime::Runtime::block_on at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/runtime.rs:350:50 14: attachment_service::main at ./control_plane/attachment_service/src/main.rs:99:5 15: core::ops::function::FnOnce::call_once at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/ops/function.rs:250:5 note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace. ``` The attachment_service handles SIGQUIT by just exiting the process. In theory, the SIGQUIT could come in while we're writing out the `attachments.json`. Now, in above log output, there's a 1 second gap between the last request completing and the SIGQUIT coming in. So, there must be some other issue. But, let's have this change anyways, maybe it helps uncover the real cause for the test failure.	2024-01-23 17:21:06 +01:00
Vlad Lazar	37638fce79	pageserver: introduce vectored Timeline::get interface (#6372 ) 1. Introduce a naive `Timeline::get_vectored` implementation The return type is intended to be flexible enough for various types of callers. We return the pages in a map keyed by `Key` such that the caller doesn't have to map back to the key if it needs to know it. Some callers can ignore errors for specific pages, so we return a separate `Result<Bytes, PageReconstructError>` for each page and an overarching `GetVectoredError` for API misuse. The overhead of the mapping will be small and bounded since we enforce a maximum key count for the operation. 2. Use the `get_vectored` API for SLRU segment reconstruction and image layer creation.	2024-01-23 14:23:53 +00:00
Christian Schwarz	50288c16b1	fix(pagebench): avoid CopyFail error in success case (#6443 ) PR #6392 fixed CopyFail in the case where we get cancelled. But, we also want to use `client.shutdown()` if we don't get cancelled.	2024-01-23 15:11:32 +01:00
Conrad Ludgate	e03f8abba9	eager parsing of ip addr (#6446 ) ## Problem Parsing the IP address at check time is a little wasteful. ## Summary of changes Parse the IP when we get it from cplane. Adding a `None` variant to still allow malformed patterns	2024-01-23 13:25:01 +00:00
Anna Khanova	1905f0bced	proxy: store role not found in cache (#6439 ) ## Problem There are a lot of responses with 404 role not found error, which are not getting cached in proxy. ## Summary of changes If there was returned an empty secret but with the project_id, store it in cache.	2024-01-23 13:15:05 +01:00
Conrad Ludgate	72de1cb511	remove some duped deps (#6422 ) ## Problem duplicated deps ## Summary of changes little bit of fiddling with deps to reduce duplicates needs consideration: https://github.com/notify-rs/notify/blob/main/CHANGELOG.md#notify-600-2023-05-17	2024-01-23 11:17:15 +00:00
Konstantin Knizhnik	00d9bf5b61	Implement lockless update of pageserver_connstring GUC in shared memory (#6314 ) ## Problem There is "neon.pageserver_connstring" GUC with PGC_SIGHUP option, allowing to change it using pg_reload_conf(). It is used by control plane to update pageserver connection string if page server is crashed, relocated or new shards are added. It is copied to shared memory because config can not be loaded during query execution and we need to reestablish connection to page server. ## Summary of changes Copying connection string to shared memory is done by postmaster. And other backends should check update counter to determine of connection URL is changed and connection needs to be reestablished. We can not use standard Postgres LW-locks, because postmaster has proc entry and so can not wait on this primitive. This is why lockless access algorithm is implemented using two atomic counters to enforce consistent reading of connection string value from shared memory. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2024-01-23 07:55:05 +02:00
Sasha Krassovsky	71f495c7f7	Gate it behind feature flags	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	0a7e050144	Fix test one last time	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	55bfa91bd7	Fix test again again	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	d90b2b99df	Fix test again	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	27587e155d	Fix test	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	55aede2762	Prevnet duplicate insertions	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	9f186b4d3e	Fix query	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	585687d563	Fix syntax error	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	65a98e425d	Switch to bigint	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	b2e7249979	Sleep	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	844303255a	Cargo fmt	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	6d8df2579b	Fix dumb thing	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	3c3b53f8ad	Update test	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	30064eb197	Add scary comment	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	869acfe29b	Make migrations transactional	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	11a91eaf7b	Uncomment the thread	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	394ef013d0	Push the migrations test	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	a718287902	Make migrations happen on a separate thread	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	2eac1adcb9	Make clippy happy	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	3f90b2d337	Fix test_ddl_forwarding	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	a40ed86d87	Add test for migrations, add initial migration	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	1bf8bb88c5	Add support for migrations within compute_ctl	2024-01-22 14:53:29 -08:00
Vlad Lazar	f1901833a6	pageserver_api: migrate keyspace related functions from `pgdatadir_mapping` (#6406 ) The idea is to achieve separation between keyspace layout definition and operating on said keyspace. I've inlined all these function since they're small and we don't use LTO in the storage release builds at the moment. Closes https://github.com/neondatabase/neon/issues/6347	2024-01-22 19:16:38 +00:00
Arthur Petukhovsky	b41ee81308	Log warning on slow WAL removal (#6432 ) Also add `safekeeper_active_timelines` metric. Should help investigating #6403	2024-01-22 18:38:05 +00:00
Christian Schwarz	205b6111e6	attachment_service: /attach-hook: correctly handle detach (#6433 ) Before this patch, we would update the `tenant_state.intent` in memory but not persist the detachment to disk. I noticed this in https://github.com/neondatabase/neon/pull/6214 where we stop, then restart, the attachment service.	2024-01-22 18:27:05 +00:00
John Spray	93572a3e99	pageserver: mark tenant broken when cancelling attach (#6430 ) ## Problem When a tenant is in Attaching state, and waiting for the `concurrent_tenant_warmup` semaphore, it also listens for the tenant cancellation token. When that token fires, Tenant::attach drops out. Meanwhile, Tenant::set_stopping waits forever for the tenant to exit Attaching state. Fixes: https://github.com/neondatabase/neon/issues/6423 ## Summary of changes - In the absence of a valid state for the tenant, it is set to Broken in this path. A more elegant solution will require more refactoring, beyond this minimal fix.	2024-01-22 15:50:32 +00:00
Christian Schwarz	15c0df4de7	fixup(#6037 ): actually fix the issue, #6388 failed to do so (#6429 ) Before this patch, the select! still retured immediately if `futs` was empty. Must have tested a stale build in my manual testing of #6388.	2024-01-22 14:27:29 +00:00
Anna Khanova	3290fb09bf	Proxy: fix gc (#6426 ) ## Problem Gc currently doesn't work properly. ## Summary of changes Change statement on running gc.	2024-01-22 13:24:10 +00:00
hamishc	efdb2bf948	Added missing PG_VERSION arg into compute node dockerfile (#6382 ) ## Problem If you build the compute-node dockerfile with the PG_VERSION argument passed in (e.g. `docker build -f Dockerfile.compute-node --build-arg PG_VERSION=v15 .`, it fails, as some of stages doesn't have the PG_VERSION arg defined. ## Summary of changes Added the PG_VERSION arg to the plv8-build, neon-pg-ext-build, and pg-embedding-pg-build stages of Dockerfile.compute-node	2024-01-22 11:05:27 +00:00
Conrad Ludgate	5559b16953	bump shlex (#6421 ) ## Problem https://rustsec.org/advisories/RUSTSEC-2024-0006 ## Summary of changes `cargo update -p shlex`	2024-01-22 09:14:30 +00:00
Konstantin Knizhnik	1aea65eb9d	Fix potential overflow in update_next_xid (#6412 ) ## Problem See https://neondb.slack.com/archives/C06F5UJH601/p1705731304237889 Adding 1 to xid in `update_next_xid` can cause overflow in debug mode. 0xffffffff is valid transaction ID. ## Summary of changes Use `wrapping_add` ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2024-01-21 22:11:00 +02:00
Conrad Ludgate	34ddec67d9	proxy small tweaks (#6398 ) ## Problem In https://github.com/neondatabase/neon/pull/6283 I did a couple changes that weren't directly related to the goal of extracting the state machine, so I'm putting them here ## Summary of changes - move postgres vs console provider into another enum - reduce error cases for link auth - slightly refactor link flow	2024-01-21 09:58:42 +01:00
Anna Khanova	9ace36d93c	Proxy: do not store empty key (#6415 ) ## Problem Currently we store in cache even if the project is undefined. That makes invalidation impossible. ## Summary of changes Do not store if project id is empty.	2024-01-20 16:14:53 +00:00
Heikki Linnakangas	e4898a6e60	Don't pass InvalidTransactionId to update_next_xid. (#6410 ) update_next_xid() doesn't have any special treatment for the invalid or other special XIDs, so it will treat InvalidTransactionId (0) as a regular XID. If old nextXid is smaller than 2^31, 0 will look like a very old XID, and nothing happens. But if nextXid is greater than 2^31 0 will look like a very new XID, and update_next_xid() will incorrectly bump up nextXID.	2024-01-20 18:04:16 +02:00
Joonas Koivunen	c77981289c	build: terminate long running tests (#6389 ) configures nextest to kill tests after 1 minute. slow period is set to 20s which is how long our tests currently take in total, there will be 2 warnings and then the test will be killed and it's output logged. Cc: #6361 Cc: #6368 -- likely this will be enough for longer time, but it will be counter productive when we want to attach and debug; the added line would have to be commented out.	2024-01-20 17:41:55 +02:00
Anna Khanova	f003dd6ad5	Remove rename in parameters (#6411 ) ## Problem Name in notifications is not compatible with console name. ## Summary of changes Rename fields to make it compatible.	2024-01-20 10:20:53 +00:00

... 19 20 21 22 23 ...

5432 Commits