rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-27 01:50:38 +00:00

Author	SHA1	Message	Date
Christian Schwarz	de780d2e0f	make TenantState::{Loading,Attaching,Activating} owned by spawn_load / spawn_attach See the Mermaid diagram in the doc comment for the now-possible state transitions. The two core insights / changes are: - spawn_load and spawn_attach own the tenant state until they're done - once load()/attach() calls are done - if they failed, transition them to Broken directly (we know that there's no background activity because we didn't call activate yet) - if they succeed, call activate. We can make it infallible. How? Later. - set_broken() and set_stopping() are changed to wait for spawn_load() / spawn_attach() to finish. This sounds scary because it might hinder detach or shutdown, but actually, concurrent attach+detach, or attach+shutdown, or load+shutdown, or attach+shutdown were just racy. With this change, they're not anymore. We can add a CancellationToken stored in Tenant for load/attach and cancel it from set_stopping() or set_broken() if necessary in the future. So, why can activate() be infallible now: because we declare that spawn_load and spawn_attach own the tenant state until they're done. And we enforce that ownership using the wait_for at the start of set_stopping and set_broken.	2023-05-25 15:02:43 +02:00
Christian Schwarz	f18d9f555b	Revert "Revert "use tokio::sync:⌚:Receiver::wait_for"" This reverts commit `eaf270c648`.	2023-05-25 14:58:49 +02:00
Christian Schwarz	05a2fe08d1	Merge branch 'problame/infallible-timeline-activate/4-make-infallible' into problame/async-timeline-get/dont-hold-timelines-lock-inside-tenant-state-send-modify	2023-05-25 14:58:19 +02:00
Christian Schwarz	eaf270c648	Revert "use tokio::sync:⌚:Receiver::wait_for" This reverts commit `fe4ef121b6`.	2023-05-25 14:57:41 +02:00
Christian Schwarz	ddad0928c5	Merge branch 'problame/infallible-timeline-activate/3-funnel-storage-broker-client' into problame/infallible-timeline-activate/4-make-infallible	2023-05-25 14:53:32 +02:00
Christian Schwarz	96c550222b	apply heikki's comment suggestion	2023-05-25 14:53:20 +02:00
Christian Schwarz	cf8ff7edad	explainer comment on storage_broker::connect async weirdness	2023-05-25 14:51:48 +02:00
Christian Schwarz	da6573f551	Merge branch 'problame/infallible-timeline-activate/3-funnel-storage-broker-client' into problame/infallible-timeline-activate/4-make-infallible	2023-05-25 10:54:30 +02:00
Christian Schwarz	2fee8c884f	Merge remote-tracking branch 'origin/main' into problame/infallible-timeline-activate/3-funnel-storage-broker-client	2023-05-25 10:54:03 +02:00
Christian Schwarz	e11ba24ec5	tenant loops: operate on the Arc<Tenant> directly (#4298 ) (Instead of going through mgr every iteration.) The `wait_for_active_tenant` function's `wait` argument could be removed because it was only used for the loop that waits for the tenant to show up in the tenants map. Since we're passing the tenant in, we now longer need to get it from the tenants map. NB that there's no guarantee that the tenant object is in the tenants map at the time the background loop function starts running. But the tenant mgr guarantees that it will be quite soon. See `tenant_map_insert` way upwards in the call hierarchy for details. This is prep work to eliminate `subscribe_for_state_updates` (PR #4299 ) Fixes: #3501	2023-05-25 10:49:09 +02:00
Christian Schwarz	fe4ef121b6	use tokio::sync:⌚:Receiver::wait_for	2023-05-25 10:44:26 +02:00
Christian Schwarz	641ca994dc	assert_eq suggestion	2023-05-25 09:55:32 +02:00
Alex Chi Z	7126197000	pagectl: refactor ctl and support dump kv in delta (#4268 ) This PR refactors the original page_binutils with a single tool pagectl, use clap derive for better command line parsing, and adds the dump kv tool to extract information from delta file. This helps me better understand what's inside the page server. We can add support for other types of file and more functionalities in the future. --------- Signed-off-by: Alex Chi <iskyzh@gmail.com>	2023-05-24 19:36:07 +03:00
Christian Schwarz	413598b19b	fix merge fallout (?)	2023-05-24 17:42:51 +02:00
Christian Schwarz	b345f32e3f	Merge branch 'problame/infallible-timeline-activate/4-make-infallible' into problame/async-timeline-get/dont-hold-timelines-lock-inside-tenant-state-send-modify	2023-05-24 17:25:35 +02:00
Christian Schwarz	69cfa9fe61	launch_wal_receiver: apply joonas's review suggestion (visibility + doc comment)	2023-05-24 17:20:03 +02:00
Christian Schwarz	2c424c8f4e	Revert "activate_timelines counter is now == not_broken_timelines.len()" not_broken_timelines is an iterator, doesn't have `len()`. This reverts commit `4001f441c0`.	2023-05-24 17:19:22 +02:00
Christian Schwarz	4001f441c0	activate_timelines counter is now == not_broken_timelines.len()	2023-05-24 17:14:49 +02:00
Christian Schwarz	ef956c47fc	make it clear that `walreceiver_status` is always used in the branch where it's produced	2023-05-24 17:12:35 +02:00
Christian Schwarz	8606b6abe5	Merge remote-tracking branch 'origin/problame/infallible-timeline-activate/3-funnel-storage-broker-client' into problame/infallible-timeline-activate/4-make-infallible	2023-05-24 17:02:18 +02:00
Christian Schwarz	732f60317b	Merge remote-tracking branch 'origin/main' into problame/infallible-timeline-activate/3-funnel-storage-broker-client	2023-05-24 16:58:25 +02:00
Christian Schwarz	afc48e2cd9	refactor responsibility for tenant/timeline activation (#4317 ) (This is prep work to make `Timeline::activate()` infallible.) The current possibility for failure in `Timeline::activate()` is the broker client's presence / absence. It should be an assert, but we're careful with these. So, I'm planning to pass in the broker client to activate(), thereby eliminating the possiblity of its absence. In the unit tests, we don't have a broker client. So, I thought I'd be in trouble because the unit tests also called `activate()` before this PR. However, closer inspection reveals a long-standing FIXME about this, which is addressed by this patch. It turns out that the unit tests don't actually need the background loops to be running. They just need the state value to be `Active`. So, for the tests, we just set it to that value but don't spawn the background loops. We'll need to revisit this if we ever do more Rust unit tests in the future. But right now, this refactoring improves the code, so, let's revisit when we get there. Patch series: - #4316 - #4317 - #4318 - #4319	2023-05-24 16:54:11 +02:00
Christian Schwarz	df52587bef	attach-time tenant config (#4255 ) This PR adds support for supplying the tenant config upon /attach. Before this change, when relocating a tenant using `/detach` and `/attach`, the tenant config after `/attach` would be the default config from `pageserver.toml`. That is undesirable for settings such as the PITR-interval: if the tenant's config on the source was `30 days` and the default config on the attach-side is `7 days`, then the first GC run would eradicate 23 days worth of PITR capability. The API change is backwards-compatible: if the body is empty, we continue to use the default config. We'll remove that capability as soon as the cloud.git code is updated to use attach-time tenant config (https://github.com/neondatabase/neon/issues/4282 keeps track of this). unblocks https://github.com/neondatabase/cloud/issues/5092 fixes https://github.com/neondatabase/neon/issues/1555 part of https://github.com/neondatabase/neon/issues/886 (Tenant Relocation) Implementation ============== The preliminary PRs for this work were (most-recent to least-recent) * https://github.com/neondatabase/neon/pull/4279 * https://github.com/neondatabase/neon/pull/4267 * https://github.com/neondatabase/neon/pull/4252 * https://github.com/neondatabase/neon/pull/4235	2023-05-24 17:46:30 +03:00
Christian Schwarz	b54431bbd3	pass the BrokerClientChannel by value & clone it as necessary It's a wrapper around an inner Arc anyways Also, this gets rid of the OnceCell	2023-05-24 12:29:05 +02:00
Christian Schwarz	def5eb8542	Merge branch 'problame/infallible-timeline-activate/2-pushup-tenant-and-timeline-activation' into problame/infallible-timeline-activate/3-funnel-storage-broker-client	2023-05-24 11:57:37 +02:00
Christian Schwarz	07da786ed3	apply joonas's suggestion to use parent: None + follows_from	2023-05-24 11:56:26 +02:00
Christian Schwarz	75c3c43b2e	don't unwrap() the `activate()` result in spawn_load / spawn_attach	2023-05-24 11:36:07 +02:00
Christian Schwarz	bdf03eab58	Merge branch 'problame/infallible-timeline-activate/2-pushup-tenant-and-timeline-activation' into problame/infallible-timeline-activate/3-funnel-storage-broker-client	2023-05-24 11:32:38 +02:00
Christian Schwarz	32c85fa87a	Merge remote-tracking branch 'origin/main' into problame/infallible-timeline-activate/2-pushup-tenant-and-timeline-activation	2023-05-24 11:31:00 +02:00
Konstantin Knizhnik	417f37b2e8	Pass set of wanted image layers from GC to compaction (#3673 ) ## Describe your changes Right now the only criteria for image layer generation is number of delta layer since last image layer. If we have "stairs" layout of delta layers (see link below) then it can happen that there a lot of old delta layers which can not be reclaimed by GC because are not fully covered with image layers. This PR constructs list of "wanted" image layers in GC (which image layers are needed to be able to remove old layers) and pass this list to compaction task which performs generation of image layers. So right now except deltas count criteria we also take in account "wishes" of GC. ## Issue ticket number and link See https://neondb.slack.com/archives/C033RQ5SPDH/p1676914249982519 ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2023-05-24 08:01:41 +03:00
Christian Schwarz	00f7fc324d	tenant_map_insert: don't expose the vacant entry to the closure (#4316 ) This tightens up the API a little. Byproduct of some refactoring work that I'm doing right now.	2023-05-23 15:16:12 -04:00
Christian Schwarz	b2e0c58a8c	Merge branch 'problame/infallible-timeline-activate/4-make-infallible' into problame/async-timeline-get/dont-hold-timelines-lock-inside-tenant-state-send-modify	2023-05-23 20:44:34 +02:00
Christian Schwarz	94f30f0660	Merge branch 'problame/infallible-timeline-activate/3-funnel-storage-broker-client' into problame/infallible-timeline-activate/4-make-infallible	2023-05-23 20:44:12 +02:00
Christian Schwarz	a55d224923	tests would fail because broker client needs to be launched on a tokio runtime thread	2023-05-23 20:43:10 +02:00
Christian Schwarz	4f586ac101	Merge branch 'problame/infallible-timeline-activate/2-pushup-tenant-and-timeline-activation' into problame/infallible-timeline-activate/3-funnel-storage-broker-client	2023-05-23 20:42:54 +02:00
Christian Schwarz	feb2e80b83	tests were failing because activate() was outside of a span with tenant_id	2023-05-23 20:36:32 +02:00
Christian Schwarz	ee22e81583	don't hold timelines lock inside set_stopping()	2023-05-23 20:11:15 +02:00
Christian Schwarz	3e604eaa39	refactor: introduce TenantState::Activating to avoid holding timelines lock inside Tenant::activate	2023-05-23 20:03:12 +02:00
Christian Schwarz	8bcb542a3b	refactor: make timeline activation infallible Timeline::activate() was only fallible because `launch_wal_receiver` was. `launch_wal_receiver` was fallible only because of some preliminary checks in `WalReceiver::start`. Turns out these checks can be shifted to the type system by delaying creatinon of the `WalReceiver` struct to the point where we activate the timeline. The changes in this PR were enabled by my previous refactoring that funneled the broker_client from pageserver startup to the activate() call sites.	2023-05-23 19:27:06 +02:00
Christian Schwarz	17b081d294	refactor: eliminate global storage_broker client state (This is prep work to make `Timeline::activate` infallible.) This patch removes the global storage_broker client instance from the pageserver codebase. Instead, pageserver startup instantiates it and passes it down to the `Timeline::activate` function, which in turn passes it to the WalReceiver, which is the entity that actually uses it.	2023-05-23 19:27:06 +02:00
Christian Schwarz	d5337e6a65	refactor responsibility for tenant/timeline activation (This is prep work to make `Timeline::activate()` infallible.) The current possibility for failure in `Timeline::activate()` is the broker client's presence / absence. It should be an assert, but we're careful with these. So, I'm planning to pass in the broker client to activate(), thereby eliminating the possiblity of its absence. In the unit tests, we don't have a broker client. So, I thought I'd be in trouble because the unit tests also called `activate()` before this PR. However, closer inspection reveals a long-standing FIXME about this, which is addressed by this patch. It turns out that the unit tests don't actually need the background loops to be running. They just need the state value to be `Active`. So, for the tests, we just set it to that value but don't spawn the background loops. We'll need to revisit this if we ever do more Rust unit tests in the future. But right now, this refactoring improves the code, so, let's revisit when we get there.	2023-05-23 19:26:36 +02:00
Christian Schwarz	cc96a5186d	tenant_map_insert: don't expose the vacant entry to the closure This tightens up the API a little. Byproduct of some refactoring work that I'm doing right now.	2023-05-23 19:25:47 +02:00
Christian Schwarz	4d41b2d379	fix: `max_lsn_wal_lag` broken in tenant conf (#4279 ) This patch fixes parsing of the `max_lsn_wal_lag` tenant config item. We were incorrectly expecting a string before, but the type is a NonZeroU64. So, when setting it in the config, the (updated) test case would fail with ``` E psycopg2.errors.InternalError_: Tenant a1fa9cc383e32ddafb73ff920de5f2e6 will not become active. Current state: Broken due to: Failed to parse config from file '.../repo/tenants/a1fa9cc383e32ddafb73ff920de5f2e6/config' as pageserver config: configure option max_lsn_wal_lag is not a string. Backtrace: ``` So, not even the assertions added are necessary. The test coverage for tenant config is rather thin in general. For example, the `test_tenant_conf.py` test doesn't cover all the options. I'll add a new regression test as part of attach-time-tenant-conf PR https://github.com/neondatabase/neon/pull/4255	2023-05-23 16:29:59 +03:00
Shany Pozin	d6cf347670	Add an option to set "latest gc cutoff lsn" in pageserver binutils (#4290 ) ## Problem [#2539](https://github.com/neondatabase/neon/issues/2539) ## Summary of changes Add support for latest_gc_cutoff_lsn update in pageserver_binutils	2023-05-23 15:48:43 +03:00
Christian Schwarz	b391c94440	tenant create / update-config: reject unknown fields (#4267 ) This PR enforces that the tenant create / update-config APIs reject requests with unknown fields. This is a desirable property because some tenant config settings control the lifetime of user data (e.g., GC horizon or PITR interval). Suppose we inadvertently rename the `pitr_interval` field in the Rust code. Then, right now, a client that still uses the old name will send a tenant config request to configure a new PITR interval. Before this PR, we would accept such a request, ignore the old name field, and use the pageserver.toml default value for what the new PITR interval is. With this PR, we will instead reject such a request. One might argue that the client could simply check whether the config it sent has been applied, using the `/v1/tenant/.../config` endpoint. That is correct for tenant create and update-config. But, attach will soon [^1] grow the ability to have attach-time config as well. If we ignore unknown fields and fall back to global defaults in that case, we risk data loss. Example: 1. Default PITR in pageservers is 7 days. 2. Create a tenant and set its PITR to 30 days. 3. For 30 days, fill the tenant continuously with data. 4. Detach the tenant. 5. Attach tenant. Attach must use the 30-day PITR setting in this scenario. If it were to fall back to the 7-day default value, we would lose 23 days of PITR capability for the tenant. So, the PR that adds attach-time tenant config will build on the (clunky) infrastructure added in this PR [^1]: https://github.com/neondatabase/neon/pull/4255 Implementation Notes ==================== This could have been a simple `#[serde(deny_unknown_fields)]` but sadly, that is documented- but silent-at-compile-time-incompatible with `#[serde(flatten)]`. But we are still using this by adding on outer struct and use unit tests to ensure it is correct. `neon_local tenant config` now uses the `.remove()` pattern + bail if there are leftover config args. That's in line with what `neon_local tenant create` does. We should dedupe that logic in a future PR. --------- Signed-off-by: Alex Chi <iskyzh@gmail.com> Co-authored-by: Alex Chi <iskyzh@gmail.com>	2023-05-18 21:16:09 -04:00
Anastasia Lubennikova	8ebae74c6f	Fix handling of XLOG_XACT_COMMIT/ABORT: Previously we didn't handle XACT_XINFO_HAS_INVALS and XACT_XINFO_HAS_DROPPED_STAT correctly, which led to getting incorrect value of twophase_xid for records with XACT_XINFO_HAS_TWOPHASE. This caused 'twophase file for xid {} does not exist' errors in test_isolation	2023-05-18 14:36:45 +01:00
Christian Schwarz	89307822b0	mgmt api: share a single tenant config model struct in Rust and OpenAPI (#4252 ) This is prep for https://github.com/neondatabase/neon/pull/4255 [1/X] OpenAPI: share a single definition of TenantConfig DRYs up the pageserver OpenAPI YAML's representation of tenant config. All the fields of tenant config are now located in a model schema called TenantConfig. The tenant create & config-change endpoints have separate schemas, TenantCreateInfo and TenantConfigureArg, respectively. These schemas inherit from TenantConfig, using allOf 1. The tenant config-GET handler's response was previously named TenantConfig. It's now named TenantConfigResponse. None of these changes affect how the request looks on the wire. The generated Go code will change for Console because the OpenAPI code generator maps `allOf` to a Go struct embedding. Luckily, usage of tenant config in Console is still very lightweigt, but that will change in the near future. So, this is a good chance to set things straight. The console changes are tracked in https://github.com/neondatabase/cloud/pull/5046 [2/x]: extract the tenant config parts of create & config requests [3/x]: code movement: move TenantConfigRequestConfig next to TenantCreateRequestConfig [4/x] type-alias TenantConfigRequestConfig = TenantCreateRequestConfig; They are exactly the same. [5/x] switch to qualified use for tenant create/config request api models [6/x] rename models::TenantConfig{RequestConfig,} and remove the alias [7/x] OpenAPI: sync tenant create & configure body names from Rust code [8/x]: dedupe the two TryFrom<...> for TenantConfOpt impls The only difference is that the TenantConfigRequest impl does ``` tenant_conf.max_lsn_wal_lag = request_data.max_lsn_wal_lag; tenant_conf.trace_read_requests = request_data.trace_read_requests; ``` and the TenantCreateRequest impl does ``` if let Some(max_lsn_wal_lag) = request_data.max_lsn_wal_lag { tenant_conf.max_lsn_wal_lag = Some(max_lsn_wal_lag); } if let Some(trace_read_requests) = request_data.trace_read_requests { tenant_conf.trace_read_requests = Some(trace_read_requests); } ``` As far as I can tell, these are identical.	2023-05-17 12:31:17 +02:00
Christian Schwarz	1bceceac5a	add helper to debug_assert that current span has a TenantId (#4248 ) We already have `debug_assert_current_span_has_tenant_and_timeline_id`. Have the same for just TenantId.	2023-05-17 11:03:46 +02:00
Christian Schwarz	4431779e32	refactor: attach: use create_tenant_files + schedule_local_tenant_processing (#4235 ) With this patch, the attach handler now follows the same pattern as tenant create with regards to instantiation of the new tenant: 1. Prepare on-disk state using `create_tenant_files`. 2. Use the same code path as pageserver startup to load it into memory and start background loops (`schedule_local_tenant_processing`). It's a bit sad we can't use the `PageServerConfig::tenant_attaching_mark_file_path` method inside `create_tenant_files` because it operates in a temporary directory. However, it's a small price to pay for the gained simplicity. During implementation, I noticed that we don't handle failures post `create_tenant_files` well. I left TODO comments in the code linking to the issue that I created for this [^1]. Also, I'll dedupe the spawn_load and spawn_attach code in a future commit. refs https://github.com/neondatabase/neon/issues/1555 part of https://github.com/neondatabase/neon/issues/886 (Tenant Relocation) [^1]: https://github.com/neondatabase/neon/issues/4233	2023-05-16 12:53:17 -04:00
Joseph Koshakow	511b0945c3	Replace usages of wait_for_active_timeline (#4243 ) This commit replaces all usages of connection_manager.rs: wait_for_active_timeline with Timeline::wait_to_become_active. wait_to_become_active is better and in the right module. close https://github.com/neondatabase/neon/issues/4189 Co-authored-by: Shany Pozin <shany@neon.tech>	2023-05-16 10:38:39 -04:00

1 2 3 4 5 ...

1344 Commits