rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-07 21:42:56 +00:00

Author	SHA1	Message	Date
Arpad Müller	faf275d4a2	Remove initdb on timeline delete (#6387 ) This PR: * makes `initdb.tar.zst` be deleted by default on timeline deletion (#6226), mirroring the safekeeper: https://github.com/neondatabase/neon/pull/6381 * adds a new `preserve_initdb_archive` endpoint for a timeline, to be used during the disaster recovery process, see reasoning [here](https://github.com/neondatabase/neon/issues/6226#issuecomment-1894574778) * makes the creation code look for `initdb-preserved.tar.zst` in addition to `initdb.tar.zst`. * makes the tests use the new endpoint fixes #6226	2024-01-23 18:22:59 +00:00
Vlad Lazar	001f0d6db7	pageserver: fix import failure caused by merge race (#6448 ) PR #6406 raced with #6372 and broke main.	2024-01-23 18:07:01 +01:00
Christian Schwarz	42c17a6fc6	attachment_service: use atomic overwrite to persist attachments.json (#6444 ) The pagebench integration PR (#6214) is the first to SIGQUIT & then restart attachment_service. With many tenants (100), we have found frequent failures on restart in the CI[^1]. [^1]: [Allure](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6214/7615750160/index.html#suites/e26265675583c610f99af77084ae58f1/851ff709578c4452/) ``` 2024-01-22T19:07:57.932021Z INFO request{method=POST path=/attach-hook request_id=2697503c-7b3e-4529-b8c1-d12ef912d3eb}: Request handled, status: 200 OK 2024-01-22T19:07:58.898213Z INFO Got SIGQUIT. Terminating 2024-01-22T19:08:02.176588Z INFO version: git-env:d56f31639356ed8e8ce832097f132f27ee19ac8a, launch_timestamp: 2024-01-22 19:08:02.174634554 UTC, build_tag build_tag-env:7615750160, state at /tmp/test_output/test_pageserver_max_throughput_getpage_at_latest_lsn[10-13-30]/repo/attachments.json, listening on 127.0.0.1:15048 thread 'main' panicked at /__w/neon/neon/control_plane/attachment_service/src/persistence.rs:95:17: Failed to load state from '/tmp/test_output/test_pageserver_max_throughput_getpage_at_latest_lsn[10-13-30]/repo/attachments.json': trailing characters at line 1 column 8957 (maybe your .neon/ dir was written by an older version?) stack backtrace: 0: rust_begin_unwind at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:645:5 1: core::panicking::panic_fmt at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:72:14 2: attachment_service::persistence::PersistentState::load_or_new::{{closure}} at ./control_plane/attachment_service/src/persistence.rs:95:17 3: attachment_service::persistence::Persistence:🆕:{{closure}} at ./control_plane/attachment_service/src/persistence.rs:103:56 4: attachment_service::main::{{closure}} at ./control_plane/attachment_service/src/main.rs:69:61 5: tokio::runtime::park::CachedParkThread::block_on::{{closure}} at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/park.rs:282:63 6: tokio::runtime::coop::with_budget at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/coop.rs:107:5 7: tokio::runtime::coop::budget at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/coop.rs:73:5 8: tokio::runtime::park::CachedParkThread::block_on at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/park.rs:282:31 9: tokio::runtime::context::blocking::BlockingRegionGuard::block_on at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context/blocking.rs:66:9 10: tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}} at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/multi_thread/mod.rs:87:13 11: tokio::runtime::context::runtime::enter_runtime at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context/runtime.rs:65:16 12: tokio::runtime::scheduler::multi_thread::MultiThread::block_on at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/multi_thread/mod.rs:86:9 13: tokio::runtime::runtime::Runtime::block_on at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/runtime.rs:350:50 14: attachment_service::main at ./control_plane/attachment_service/src/main.rs:99:5 15: core::ops::function::FnOnce::call_once at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/ops/function.rs:250:5 note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace. ``` The attachment_service handles SIGQUIT by just exiting the process. In theory, the SIGQUIT could come in while we're writing out the `attachments.json`. Now, in above log output, there's a 1 second gap between the last request completing and the SIGQUIT coming in. So, there must be some other issue. But, let's have this change anyways, maybe it helps uncover the real cause for the test failure.	2024-01-23 17:21:06 +01:00
Vlad Lazar	37638fce79	pageserver: introduce vectored Timeline::get interface (#6372 ) 1. Introduce a naive `Timeline::get_vectored` implementation The return type is intended to be flexible enough for various types of callers. We return the pages in a map keyed by `Key` such that the caller doesn't have to map back to the key if it needs to know it. Some callers can ignore errors for specific pages, so we return a separate `Result<Bytes, PageReconstructError>` for each page and an overarching `GetVectoredError` for API misuse. The overhead of the mapping will be small and bounded since we enforce a maximum key count for the operation. 2. Use the `get_vectored` API for SLRU segment reconstruction and image layer creation.	2024-01-23 14:23:53 +00:00
Christian Schwarz	50288c16b1	fix(pagebench): avoid CopyFail error in success case (#6443 ) PR #6392 fixed CopyFail in the case where we get cancelled. But, we also want to use `client.shutdown()` if we don't get cancelled.	2024-01-23 15:11:32 +01:00
Conrad Ludgate	e03f8abba9	eager parsing of ip addr (#6446 ) ## Problem Parsing the IP address at check time is a little wasteful. ## Summary of changes Parse the IP when we get it from cplane. Adding a `None` variant to still allow malformed patterns	2024-01-23 13:25:01 +00:00
Anna Khanova	1905f0bced	proxy: store role not found in cache (#6439 ) ## Problem There are a lot of responses with 404 role not found error, which are not getting cached in proxy. ## Summary of changes If there was returned an empty secret but with the project_id, store it in cache.	2024-01-23 13:15:05 +01:00
Conrad Ludgate	72de1cb511	remove some duped deps (#6422 ) ## Problem duplicated deps ## Summary of changes little bit of fiddling with deps to reduce duplicates needs consideration: https://github.com/notify-rs/notify/blob/main/CHANGELOG.md#notify-600-2023-05-17	2024-01-23 11:17:15 +00:00
Konstantin Knizhnik	00d9bf5b61	Implement lockless update of pageserver_connstring GUC in shared memory (#6314 ) ## Problem There is "neon.pageserver_connstring" GUC with PGC_SIGHUP option, allowing to change it using pg_reload_conf(). It is used by control plane to update pageserver connection string if page server is crashed, relocated or new shards are added. It is copied to shared memory because config can not be loaded during query execution and we need to reestablish connection to page server. ## Summary of changes Copying connection string to shared memory is done by postmaster. And other backends should check update counter to determine of connection URL is changed and connection needs to be reestablished. We can not use standard Postgres LW-locks, because postmaster has proc entry and so can not wait on this primitive. This is why lockless access algorithm is implemented using two atomic counters to enforce consistent reading of connection string value from shared memory. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2024-01-23 07:55:05 +02:00
Sasha Krassovsky	71f495c7f7	Gate it behind feature flags	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	0a7e050144	Fix test one last time	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	55bfa91bd7	Fix test again again	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	d90b2b99df	Fix test again	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	27587e155d	Fix test	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	55aede2762	Prevnet duplicate insertions	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	9f186b4d3e	Fix query	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	585687d563	Fix syntax error	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	65a98e425d	Switch to bigint	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	b2e7249979	Sleep	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	844303255a	Cargo fmt	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	6d8df2579b	Fix dumb thing	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	3c3b53f8ad	Update test	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	30064eb197	Add scary comment	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	869acfe29b	Make migrations transactional	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	11a91eaf7b	Uncomment the thread	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	394ef013d0	Push the migrations test	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	a718287902	Make migrations happen on a separate thread	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	2eac1adcb9	Make clippy happy	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	3f90b2d337	Fix test_ddl_forwarding	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	a40ed86d87	Add test for migrations, add initial migration	2024-01-22 14:53:29 -08:00
Sasha Krassovsky	1bf8bb88c5	Add support for migrations within compute_ctl	2024-01-22 14:53:29 -08:00
Vlad Lazar	f1901833a6	pageserver_api: migrate keyspace related functions from `pgdatadir_mapping` (#6406 ) The idea is to achieve separation between keyspace layout definition and operating on said keyspace. I've inlined all these function since they're small and we don't use LTO in the storage release builds at the moment. Closes https://github.com/neondatabase/neon/issues/6347	2024-01-22 19:16:38 +00:00
Arthur Petukhovsky	b41ee81308	Log warning on slow WAL removal (#6432 ) Also add `safekeeper_active_timelines` metric. Should help investigating #6403	2024-01-22 18:38:05 +00:00
Christian Schwarz	205b6111e6	attachment_service: /attach-hook: correctly handle detach (#6433 ) Before this patch, we would update the `tenant_state.intent` in memory but not persist the detachment to disk. I noticed this in https://github.com/neondatabase/neon/pull/6214 where we stop, then restart, the attachment service.	2024-01-22 18:27:05 +00:00
John Spray	93572a3e99	pageserver: mark tenant broken when cancelling attach (#6430 ) ## Problem When a tenant is in Attaching state, and waiting for the `concurrent_tenant_warmup` semaphore, it also listens for the tenant cancellation token. When that token fires, Tenant::attach drops out. Meanwhile, Tenant::set_stopping waits forever for the tenant to exit Attaching state. Fixes: https://github.com/neondatabase/neon/issues/6423 ## Summary of changes - In the absence of a valid state for the tenant, it is set to Broken in this path. A more elegant solution will require more refactoring, beyond this minimal fix.	2024-01-22 15:50:32 +00:00
Christian Schwarz	15c0df4de7	fixup(#6037 ): actually fix the issue, #6388 failed to do so (#6429 ) Before this patch, the select! still retured immediately if `futs` was empty. Must have tested a stale build in my manual testing of #6388.	2024-01-22 14:27:29 +00:00
Anna Khanova	3290fb09bf	Proxy: fix gc (#6426 ) ## Problem Gc currently doesn't work properly. ## Summary of changes Change statement on running gc.	2024-01-22 13:24:10 +00:00
hamishc	efdb2bf948	Added missing PG_VERSION arg into compute node dockerfile (#6382 ) ## Problem If you build the compute-node dockerfile with the PG_VERSION argument passed in (e.g. `docker build -f Dockerfile.compute-node --build-arg PG_VERSION=v15 .`, it fails, as some of stages doesn't have the PG_VERSION arg defined. ## Summary of changes Added the PG_VERSION arg to the plv8-build, neon-pg-ext-build, and pg-embedding-pg-build stages of Dockerfile.compute-node	2024-01-22 11:05:27 +00:00
Conrad Ludgate	5559b16953	bump shlex (#6421 ) ## Problem https://rustsec.org/advisories/RUSTSEC-2024-0006 ## Summary of changes `cargo update -p shlex`	2024-01-22 09:14:30 +00:00
Konstantin Knizhnik	1aea65eb9d	Fix potential overflow in update_next_xid (#6412 ) ## Problem See https://neondb.slack.com/archives/C06F5UJH601/p1705731304237889 Adding 1 to xid in `update_next_xid` can cause overflow in debug mode. 0xffffffff is valid transaction ID. ## Summary of changes Use `wrapping_add` ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2024-01-21 22:11:00 +02:00
Conrad Ludgate	34ddec67d9	proxy small tweaks (#6398 ) ## Problem In https://github.com/neondatabase/neon/pull/6283 I did a couple changes that weren't directly related to the goal of extracting the state machine, so I'm putting them here ## Summary of changes - move postgres vs console provider into another enum - reduce error cases for link auth - slightly refactor link flow	2024-01-21 09:58:42 +01:00
Anna Khanova	9ace36d93c	Proxy: do not store empty key (#6415 ) ## Problem Currently we store in cache even if the project is undefined. That makes invalidation impossible. ## Summary of changes Do not store if project id is empty.	2024-01-20 16:14:53 +00:00
Heikki Linnakangas	e4898a6e60	Don't pass InvalidTransactionId to update_next_xid. (#6410 ) update_next_xid() doesn't have any special treatment for the invalid or other special XIDs, so it will treat InvalidTransactionId (0) as a regular XID. If old nextXid is smaller than 2^31, 0 will look like a very old XID, and nothing happens. But if nextXid is greater than 2^31 0 will look like a very new XID, and update_next_xid() will incorrectly bump up nextXID.	2024-01-20 18:04:16 +02:00
Joonas Koivunen	c77981289c	build: terminate long running tests (#6389 ) configures nextest to kill tests after 1 minute. slow period is set to 20s which is how long our tests currently take in total, there will be 2 warnings and then the test will be killed and it's output logged. Cc: #6361 Cc: #6368 -- likely this will be enough for longer time, but it will be counter productive when we want to attach and debug; the added line would have to be commented out.	2024-01-20 17:41:55 +02:00
Anna Khanova	f003dd6ad5	Remove rename in parameters (#6411 ) ## Problem Name in notifications is not compatible with console name. ## Summary of changes Rename fields to make it compatible.	2024-01-20 10:20:53 +00:00
Conrad Ludgate	7e7e9f5191	proxy: add more columns to parquet upload (#6405 ) ## Problem Some fields were missed in the initial spec. ## Summary of changes Adds a success boolean (defaults to false unless specifically marked as successful). Adds a duration_us integer that tracks how many microseconds were taken from session start through to request completion.	2024-01-20 09:38:11 +00:00
Christian Schwarz	760a48207d	fixup(#6037 ): page_service hangs up within 10ms if there's no message (#6388 ) From #6037 on, until this patch, if the client opens the connection but doesn't send a `PagestreamFeMessage` within the first 10ms, we'd close the connection because `self.timeline_cancelled()` returns. It returns because `self.shard_timelines` is still empty at that point: it gets filled lazily within the handlers for the incoming messages. Changes ------- The question is: if we can't check for timeline cancellation, what else do we need to be cancellable for? `tenant.cancel` is also a bad choice because the `tenant` (shard) we pick at the top of handle_pagerequests might indeed go away over the course of the connection lifetime, but other shards may still be there. The correct solution, I think, is to be responsive to task_mgr cancellation, because the connection handler runs in a task_mgr task and it is already the current canonical way how we shut down a tenant's / timelin's page_service connections (see `Tenant::shutdown` / `Timeline::shutdown`). So, rename the function and make it sensitive to task_mgr cancellation.	2024-01-19 19:16:01 +00:00
Arseny Sher	88df057531	Delete WAL segments from s3 when timeline is deleted. In the most straightforward way; safekeeper performs it in DELETE endpoint implementation, with no coordination between sks. delete_force endpoint in the code is renamed to delete as there is only one way to delete.	2024-01-19 20:11:24 +04:00
Alexander Bayandin	c65ac37a6d	zenbenchmark: attach perf results to allure report (#6395 ) ## Problem For PRs with `run-benchmarks` label, we don't upload results to the db, making it harder to debug such tests. The only way to see some numbers is by examining GitHub Action output which is really inconvenient. This PR adds zenbenchmark metrics to Allure reports. ## Summary of changes - Create a json file with zenbenchmark results and attach it to allure report	2024-01-18 20:59:43 +00:00
Arthur Petukhovsky	a092127b17	Fix truncateLsn initialization (#6396 ) In `7f828890cf` we changed the logic for persisting control_files. Previously it was updated if `peer_horizon_lsn` jumped more than one segment, which made `peer_horizon_lsn` initialized on disk as soon as safekeeper has received a first `AppendRequest`. This caused an issue with `truncateLsn`, which now can be zero sometimes. This PR fixes it, and now `truncateLsn/peer_horizon_lsn` can never be zero once we know `timeline_start_lsn`. Closes https://github.com/neondatabase/neon/issues/6248	2024-01-18 18:55:24 +00:00

1 2 3 4 5 ...

4427 Commits