rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-17 21:20:37 +00:00

Author	SHA1	Message	Date
Vlad Lazar	35854928d9	pageserver: use identity file as node id authority and remove init command and config-override flags (#7766 ) Ansible will soon write the node id to `identity.toml` in the work dir for new pageservers. On the pageserver side, we read the node id from the identity file if it is present and use that as the source of truth. If the identity file is missing, cannot be read, or does not deserialise, start-up is aborted. This PR also removes the `--init` mode and the `--config-override` flag from the `pageserver` binary. The neon_local is already not using these flags anymore. Ansible still uses them until the linked change is merged & deployed, so, this PR has to land simultaneously or after the Ansible change due to that. Related Ansible change: https://github.com/neondatabase/aws/pull/1322 Cplane change to remove config-override usages: https://github.com/neondatabase/cloud/pull/13417 Closes: https://github.com/neondatabase/neon/issues/7736 Overall plan: https://www.notion.so/neondatabase/Rollout-Plan-simplified-pageserver-initialization-f935ae02b225444e8a41130b7d34e4ea?pvs=4 Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-07-23 11:41:12 +01:00
dotdister	1303d47778	Fix comment in Control Plane (#8406 ) ## Problem There are something wrong in the comment of `control_plane/src/broker.rs` and `control_plane/src/pageserver.rs` ## Summary of changes Fixed the comment about component name and their data path in `control_plane/src/broker.rs` and `control_plane/src/pageserver.rs`.	2024-07-18 09:33:46 +01:00
Christian Schwarz	e26ef640c1	pageserver: remove `trace_read_requests` (#8338 ) `trace_read_requests` is a per `Tenant`-object option. But the `handle_pagerequests` loop doesn't know which `Tenant` object (i.e., which shard) the request is for. The remaining use of the `Tenant` object is to check `tenant.cancel`. That check is incorrect [if the pageserver hosts multiple shards](https://github.com/neondatabase/neon/issues/7427#issuecomment-2220577518). I'll fix that in a future PR where I completely eliminate the holding of `Tenant/Timeline` objects across requests. See [my code RFC](https://github.com/neondatabase/neon/pull/8286) for the high level idea. Note that we can always bring the tracing functionality if we need it. But since it's actually about logging the `page_service` wire bytes, it should be a `page_service`-level config option, not per-Tenant. And for enabling tracing on a single connection, we can implement a `set pageserver_trace_connection;` option.	2024-07-11 15:17:07 +02:00
Christian Schwarz	1a49f1c15c	pageserver: move `page_service`'s `import basebackup` / `import wal` to mgmt API (#8292 ) I want to fix bugs in `page_service` ([issue](https://github.com/neondatabase/neon/issues/7427)) and the `import basebackup` / `import wal` stand in the way / make the refactoring more complicated. We don't use these methods anyway in practice, but, there have been some objections to removing the functionality completely. So, this PR preserves the existing functionality but moves it into the HTTP management API. Note that I don't try to fix existing bugs in the code, specifically not fixing * it only ever worked correctly for unsharded tenants * it doesn't clean up on error All errors are mapped to `ApiError::InternalServerError`.	2024-07-09 23:17:42 +02:00
John Spray	063553a51b	pageserver: remove tenant create API (#8135 ) ## Problem For some time, we have created tenants with calls to location_conf. The legacy "POST /v1/tenant" path was only used in some tests. ## Summary of changes - Remove the API - Relocate TenantCreateRequest to the controller API file (this used to be used in both pageserver and controller APIs) - Rewrite tenant_create test helper to use location_config API, as control plane and storage controller do - Update docker-compose test script to create tenants with location_config API (this small commit is also present in https://github.com/neondatabase/neon/pull/7947)	2024-06-28 09:14:19 +01:00
Peter Bendel	82266a252c	Allow longer timeout for starting pageserver, safe keeper and storage controller in test cases to make test cases less flaky (#8079 ) ## Problem see https://github.com/neondatabase/neon/issues/8070 ## Summary of changes the neon_local subcommands to - start neon - start pageserver - start safekeeper - start storage controller get a new option -t=xx or --start-timeout=xx which allows to specify a longer timeout in seconds we wait for the process start. This is useful in test cases where the pageserver has to read a lot of layer data, like in pagebench test cases. In addition we exploit the new timeout option in the python test infrastructure (python fixtures) and modify the flaky testcase to increase the timeout from 10 seconds to 1 minute. Example from the test execution ```bash RUST_BACKTRACE=1 NEON_ENV_BUILDER_USE_OVERLAYFS_FOR_SNAPSHOTS=1 DEFAULT_PG_VERSION=15 BUILD_TYPE=release ./scripts/pytest test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py ... 2024-06-19 09:29:34.590 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local storage_controller start --start-timeout=60s" 2024-06-19 09:29:36.365 INFO [broker.py:34] starting storage_broker to listen incoming connections at "127.0.0.1:15001" 2024-06-19 09:29:36.365 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local pageserver start --id=1 --start-timeout=60s" 2024-06-19 09:29:36.366 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local safekeeper start 1 --start-timeout=60s" ```	2024-06-21 10:36:12 +00:00
Yuchen Liang	30b890e378	feat(pageserver): use leases to temporarily block gc (#8084 ) Part of #7497, extracts from #7996, closes #8063. ## Problem With the LSN lease API introduced in https://github.com/neondatabase/neon/issues/7808, we want to implement the real lease logic so that GC will keep all the layers needed to reconstruct all pages at all the leased LSNs with valid leases at a given time. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-06-18 17:37:06 +00:00
Christian Schwarz	8728d5a5fd	neon_local: use `pageserver.toml` as source of truth for `struct PageServerConf` (#7642 ) Before this PR, `neon_local` would store a copy of a subset of the initial `pageserver.toml` in its `.neon/config`, e.g, `listen_pg_addr`. That copy is represented as `struct PageServerConf`. This copy was used to inform e.g., `neon_local endpoint` and other commands that depend on Pageserver about which port to connect to. The problem with that scheme is that the duplicated information in `.neon/config` can get stale if `pageserver.toml` is changed. This PR fixes that by eliminating populating `struct PageServerConf` from the `pageserver.toml`s. The `[[pageservers]]` TOML table in the `.neon/config` is obsolete. As of this PR, `neon_local` will fail to start and print an error informing about this change. Code-level changes: - Remove the `--pg-version` flag, it was only used for some checks during `neon_local init` - Remove the warn-but-continue behavior for when auth key creation fails but auth keys are not required. It's just complexity that is unjustified for a tool like `neon_local`. - Introduce a type-system-level distinction between the runtime state and the two (!) toml formats that are almost the same but not quite. - runtime state: `struct PageServerConf`, now without `serde` derives - toml format 1: the state in `.neon/config` => `struct OnDiskState` - toml format 2: the `neon_local init --config TMPFILE` that, unlike `struct OnDiskState`, allows specifying `pageservers` - Remove `[[pageservers]]` from the `struct OnDiskState` and load the data from the individual `pageserver.toml`s instead.	2024-05-08 14:32:21 +00:00
Christian Schwarz	02d42861e4	`neon_local init`: write `pageserver.toml` directly; no `pageserver --init --config-override` (#7638 ) This does to `neon_local` what https://github.com/neondatabase/aws/pull/1322 does to our production deployment. After both are merged, there are no users of `pageserver --init` / `pageserver --config-override` left, and we can remove those flags eventually.	2024-05-08 09:03:29 +00:00
Alex Chi Z	017c34b773	feat(pageserver): generate basebackup from aux file v2 storage (#7517 ) This pull request adds the new basebackup read path + aux file write path. In the regression test, all logical replication tests are run with matrix aux_file_v2=false/true. Also fixed the vectored get code path to correctly return missing key error when being called from the unified sequential get code path. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-05-07 16:30:18 +00:00
Christian Schwarz	308227fa51	remove `neon_local --pageserver-config-override` (#7614 ) Preceding PR https://github.com/neondatabase/neon/pull/7613 reduced the usage of `--pageserver-config-override`. This PR builds on top of that work and fully removes the `neon_local --pageserver-config-override`. Tests that need a non-default `pageserver.toml` control it using two options: 1. Specify `NeonEnvBuilder.pageserver_config_override` before `NeonEnvBuilder.init_start()`. This uses a new `neon_local init --pageserver-config` flag. 2. After `init_start()`: `env.pageserver.stop()` + `NeonPageserver.edit_config_toml()` + `env.pageserver.start()` A few test cases were using `env.pageserver.start(overrides=("--pageserver-config-override...",))`. I changed them to use one of the options above. Future Work ----------- The `neon_local init --pageserver-config` flag still uses `pageserver --config-override` under the hood. In the future, neon_local should just write the `pageserver.toml` directly. The `NeonEnvBuilder.pageserver_config_override` field should be renamed to `pageserver_initial_config`. Let's save this churn for a separate refactor commit.	2024-05-07 16:29:59 +00:00
Christian Schwarz	ac7dc82103	use less `neon_local --pageserver-config-override` / `pageserver -c` (#7613 )	2024-05-06 22:31:26 +02:00
Christian Schwarz	df1def7018	refactor(pageserver): remove --update-init flag (#7612 ) We don't actually use it. refs https://github.com/neondatabase/neon/issues/7555	2024-05-06 16:40:44 +02:00
Christian Schwarz	1e7cd6ac9f	refactor: move `NodeMetadata` to `pageserver_api`; use it from `neon_local` (#7606 ) This is the first step towards representing all of Pageserver configuration as clean `serde::Serialize`able Rust structs in `pageserver_api`. The `neon_local` code will then use those structs instead of the crude `toml_edit` / string concatenation that it does today. refs https://github.com/neondatabase/neon/issues/7555 --------- Co-authored-by: Alex Chi Z <iskyzh@gmail.com>	2024-05-03 13:15:38 -04:00
Vlad Lazar	1f417af9fd	pagserver: use vectored read path in benchmarks (#7498 ) ## Problem Benchmarks don't use the vectored read path. ## Summary of changes * Update the benchmarks to use the vectored read path for both singular and vectored gets. * Disable validation for the benchmarks	2024-04-29 17:26:35 +01:00
Alex Chi Z	dbe0aa653a	feat(pageserver): add aux-file-v2 flag on tenant level (#7505 ) Changing metadata format is not easy. This pull request adds a tenant-level flag on whether to enable aux file v2. As long as we don't roll this out to the user and guarantee our staging projects can persist tenant config correctly, we can test the aux file v2 change with setting this flag. Previous discussion at https://github.com/neondatabase/neon/pull/7424. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-04-26 11:48:47 -04:00
Vlad Lazar	e4a279db13	pageserver: coalesce read paths (#7477 ) ## Problem We are currently supporting two read paths. No bueno. ## Summary of changes High level: use vectored read path to serve get page requests - gated by `get_impl` config Low level: 1. Add ps config, `get_impl` to specify which read path to use when serving get page requests 2. Fix base cached image handling for the vectored read path. This was subtly broken: previously we would not mark keys that went past their cached lsn as complete. This is a self standing change which could be its own PR, but I've included it here because writing separate tests for it is tricky. 3. Fork get page to use either the legacy or vectored implementation 4. Validate the use of vectored read path when serving get page requests against the legacy implementation. Controlled by `validate_vectored_get` ps config. 5. Use the vectored read path to serve get page requests in tests (with validation). ## Note Since the vectored read path does not go through the page cache to read buffers, this change also amounts to a removal of the buffer page cache. Materialized page cache is still used.	2024-04-25 13:29:17 +01:00
Vlad Lazar	090123a429	pageserver: check for new image layers based on ingested WAL (#7230 ) ## Problem Part of the legacy (but current) compaction algorithm is to find a stack of overlapping delta layers which will be turned into an image layer. This operation is exponential in terms of the number of matching layers and we do it roughly every 20 seconds. ## Summary of changes Only check if a new image layer is required if we've ingested a certain amount of WAL since the last check. The amount of wal is expressed in terms of multiples of checkpoint distance, with the intuition being that that there's little point doing the check if we only have two new L1 layers (not enough to create a new image).	2024-03-28 17:44:55 +00:00
Alex Chi Z	a8384a074e	fixup(#7168 ): neon_local: use pageserver defaults for known but unspecified config overrides (#7166 ) e2e tests cannot run on macOS unless the file engine env var is supplied. ``` ./scripts/pytest test_runner/regress/test_neon_superuser.py -s ``` will fail with tokio-epoll-uring not supported. This is because we persist the file engine config by default. In this pull request, we only persist when someone specifies it, so that it can use the default platform-variant config in the page server. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-03-19 10:43:24 -04:00
John Spray	9752ad8489	pageserver, controller: improve secondary download APIs for large shards (#7131 ) ## Problem The existing secondary download API relied on the caller to wait as long as it took to complete -- for large shards that could be a long time, so typical clients that might have a baked-in ~30s timeout would have a problem. ## Summary of changes - Take a `wait_ms` query parameter to instruct the pageserver how long to wait: if the download isn't complete in this duration, then 201 is returned instead of 200. - For both 200 and 201 responses, include response body describing download progress, in terms of layers and bytes. This is sufficient for the caller to track how much data is being transferred and log/present that status. - In storage controller live migrations, use this API to apply a much longer outer timeout, with smaller individual per-request timeouts, and log the progress of the downloads. - Add a test that injects layer download delays to exercise the new behavior	2024-03-15 19:45:58 +00:00
Christian Schwarz	8075f0965a	fix(test suite) `virtual_file_io_engine` and `get_vectored_impl` patametrization doesn't work (#7113 ) # Problem While investigating #7124, I noticed that the benchmark was always using the `DEFAULT_*` `virtual_file_io_engine` , i.e., `tokio-epoll-uring` as of https://github.com/neondatabase/neon/pull/7077. The fundamental problem is that the `control_plane` code has its own view of `PageServerConfig`, which, I believe, will always be a subset of the real pageserver's `pageserver/src/config.rs`. For the `virtual_file_io_engine` and `get_vectored_impl` parametrization of the test suite, we were constructing a dict on the Python side that contained these parameters, then handed it to `control_plane::PageServerConfig`'s derived `serde::Deserialize`. The default in serde is to ignore unknown fields, so, the Deserialize impl silently ignored the fields. In consequence, the fields weren't propagated to the `pageserver --init` call, and the tests ended up using the `pageserver/src/config.rs::DEFAULT_` values for the respective options all the time. Tests that explicitly used overrides in `env.pageserver.start()` and similar were not affected by this. But, it means that all the test suite runs where with parametrization didn't properly exercise the code path. # Changes - use `serde(deny_unknown_fields)` to expose the problem - With this change, the Python tests that override `virtual_file_io_engine` and `get_vectored_impl` fail on `pageserver --init`, exposing the problem. - use destructuring to uncover the issue in the future - fix the issue by adding the missing fields to the `control_plane` crate's `PageServerConf` - A better solution would be for control plane to re-use a struct provided by the pageserver crate, so that everything is in one place in `pageserver/src/config.rs`, but, our config parsing code is (almost) beyond repair anyways. - fix the `pageserver_virtual_file_io_engine` to be responsive to the env var - => required to make parametrization work in benchmarks # Testing Before merging this PR, I re-ran the regression tests & CI with the full matrix of `virtual_file_io_engine` and `tokio-epoll-uring`, see `9c7ea364e0`	2024-03-14 11:18:55 +00:00
John Spray	7ae8364b0b	storage controller: register nodes in re-attach request (#7040 ) ## Problem Currently we manually register nodes with the storage controller, and use a script during deploy to register with the cloud control plane. Rather than extend that script further, nodes should just register on startup. ## Summary of changes - Extend the re-attach request to include an optional NodeRegisterRequest - If the `register` field is set, handle it like a normal node registration before executing the normal re-attach work. - Update tests/neon_local that used to rely on doing an explicit register step that could be enabled/disabled. --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-03-12 14:47:12 +00:00
John Spray	89cf714890	tests/neon_local: rename "attachment service" -> "storage controller" (#7087 ) Not a user-facing change, but can break any existing `.neon` directories created by neon_local, as the name of the database used by the storage controller changes. This PR changes all the locations apart from the path of `control_plane/attachment_service` (waiting for an opportune moment to do that one, because it's the most conflict-ish wrt ongoing PRs like #6676 )	2024-03-12 11:36:27 +00:00
John Spray	7329413705	storage controller: enable setting PlacementPolicy in tenant creation (#7037 ) ## Problem Tenants created via the storage controller have a `PlacementPolicy` that defines their HA/secondary/detach intent. For backward compat we can just set it to Single, for onboarding tenants using /location_conf it is automatically set to Double(1) if there are at least two pageservers, but for freshly created tenants we didn't have a way to specify it. This unblocks writing tests that create HA tenants on the storage controller and do failure injection testing. ## Summary of changes - Add optional fields to TenantCreateRequest for specifying PlacementPolicy. This request structure is used both on pageserver API and storage controller API, but this method is only meaningful for the storage controller (same as existing `shard_parameters` attribute). - Use the value from the creation request in tenant creation, if provided.	2024-03-08 15:34:53 +00:00
John Spray	a3ef50c9b6	storage controller: use 'lazy' mode for location_config (#6987 ) ## Problem If large numbers of shards are attached to a pageserver concurrently, for example after another node fails, it can cause excessive I/O queue depths due to all the newly attached shards trying to calculate logical sizes concurrently. #6907 added the `lazy` flag to handle this. ## Summary of changes - Use `lazy=true` from all /location_config calls in the storage controller Reconciler.	2024-03-06 11:26:29 +00:00
Joonas Koivunen	752bf5a22f	build: clippy disallow futures::pin_mut macro (#7016 ) `std` has had `pin!` macro for some time, there is no need for us to use the older alternatives. Cannot disallow `tokio::pin` because tokio macros use that.	2024-03-05 10:14:37 +00:00
John Spray	a8ec18c0f4	refactor: move storage controller API structs into pageserver_api (#6927 ) ## Problem This is a precursor to adding a convenience CLI for the storage controller. ## Summary of changes - move controller api structs into pageserver_api::controller_api to make them visible to other crates - rename pageserver_api::control_api to pageserver_api::upcall_api to match the /upcall/v1/ naming in the storage controller. Why here rather than a totally separate crate? It's convenient to have all the pageserver-related stuff in one place, and if we ever wanted to move it to a different crate it's super easy to do that later.	2024-02-27 17:24:01 +00:00
Arpad Müller	045bc6af8b	Add new compaction abstraction, simulator, and implementation. (#6830 ) Rebased version of #5234, part of #6768 This consists of three parts: 1. A refactoring and new contract for implementing and testing compaction. The logic is now in a separate crate, with no dependency on the 'pageserver' crate. It defines an interface that the real pageserver must implement, in order to call the compaction algorithm. The interface models things like delta and image layers, but just the parts that the compaction algorithm needs to make decisions. That makes it easier unit test the algorithm and experiment with different implementations. I did not convert the current code to the new abstraction, however. When compaction algorithm is set to "Legacy", we just use the old code. It might be worthwhile to convert the old code to the new abstraction, so that we can compare the behavior of the new algorithm against the old one, using the same simulated cases. If we do that, have to be careful that the converted code really is equivalent to the old. This inclues only trivial changes to the main pageserver code. All the new code is behind a tenant config option. So this should be pretty safe to merge, even if the new implementation is buggy, as long as we don't enable it. 2. A new compaction algorithm, implemented using the new abstraction. The new algorithm is tiered compaction. It is inspired by the PoC at PR #4539, although I did not use that code directly, as I needed the new implementation to fit the new abstraction. The algorithm here is less advanced, I did not implement partial image layers, for example. I wanted to keep it simple on purpose, so that as we add bells and whistles, we can see the effects using the included simulator. One difference to #4539 and your typical LSM tree implementations is how we keep track of the LSM tree levels. This PR doesn't have a permanent concept of a level, tier or sorted run at all. There are just delta and image layers. However, when compaction starts, we look at the layers that exist, and arrange them into levels, depending on their shapes. That is ephemeral: when the compaction finishes, we forget that information. This allows the new algorithm to work without any extra bookkeeping. That makes it easier to transition from the old algorithm to new, and back again. There is just a new tenant config option to choose the compaction algorithm. The default is "Legacy", meaning the current algorithm in 'main'. If you set it to "Tiered", the new algorithm is used. 3. A simulator, which implements the new abstraction. The simulator can be used to analyze write and storage amplification, without running a test with the full pageserver. It can also draw an SVG animation of the simulation, to visualize how layers are created and deleted. To run the simulator: cargo run --bin compaction-simulator run-suite --------- Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2024-02-27 17:15:46 +01:00
Vlad Lazar	5accf6e24a	attachment_service: JWT auth enforcement (#6897 ) ## Problem Attachment service does not do auth based on JWT scopes. ## Summary of changes Do JWT based permission checking for requests coming into the attachment service. Requests into the attachment service must use different tokens based on the endpoint: * `/control` and `/debug` require `admin` scope * `/upcall` requires `generations_api` scope * `/v1/...` requires `pageserverapi` scope Requests into the pageserver from the attachment service must use `pageserverapi` scope.	2024-02-26 18:17:06 +00:00
Christian Schwarz	dedf66ba5b	remove `gc_feedback` mechanism (#6863 ) It's been dead-code-at-runtime for 9 months, let's remove it. We can always re-introduce it at a later point. Came across this while working on #6861, which will touch `time_for_new_image_layer`. This is an opporunity to make that function simpler.	2024-02-26 10:05:24 +01:00
John Spray	9c48b5c4ab	controller: improved handling of offline nodes (#6846 ) Stacks on https://github.com/neondatabase/neon/pull/6823 - Pending a heartbeating mechanism (#6844 ), use /re-attach calls as a cue to mark an offline node as active, so that a node which is unavailable during controller startup doesn't require manual intervention if it later starts/restarts. - Tweak scheduling logic so that when we schedule the attached location for a tenant, we prefer to select from secondary locations rather than picking a fresh one. This is an interim state until we implement #6844 and full chaos testing for handling failures.	2024-02-22 14:01:06 +00:00
Christian Schwarz	ca07fa5f8b	per-TenantShard read throttling (#6706 )	2024-02-16 21:26:59 +01:00
Konstantin Knizhnik	9a9d9beaee	Download SLRU segments on demand (#6151 ) ## Problem See https://github.com/neondatabase/cloud/issues/8673 ## Summary of changes Download missed SLRU segments from page server ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2024-01-31 21:39:18 +02:00
John Spray	58f6cb649e	control_plane: database persistence for attachment_service (#6468 ) ## Problem Spun off from https://github.com/neondatabase/neon/pull/6394 -- this PR is just the persistence parts and the changes that enable it to work nicely ## Summary of changes - Revert #6444 and #6450 - In neon_local, start a vanilla postgres instance for the attachment service to use. - Adopt `diesel` crate for database access in attachment service. This uses raw SQL migrations as the source of truth for the schema, so it's a soft dependency: we can switch libraries pretty easily. - Rewrite persistence.rs to use postgres (via diesel) instead of JSON. - Preserve JSON read+write at startup and shutdown: this enables using the JSON format in compatibility tests, so that we don't have to commit to our DB schema yet. - In neon_local, run database creation + migrations before starting attachment service - Run the initial reconciliation in Service::spawn in the background, so that the pageserver + attachment service don't get stuck waiting for each other to start, when restarting both together in a test.	2024-01-26 17:20:44 +00:00
Christian Schwarz	689ad72e92	fix(neon_local): leaks child process if it fails to start & pass checks (#6474 ) refs https://github.com/neondatabase/neon/issues/6473 Before this PR, if process_started() didn't return Ok(true) until we ran out of retries, we'd return an error but leave the process running. Try it by adding a 20s sleep to the pageserver `main()`, e.g., right before we claim the pidfile. Without this PR, output looks like so: ``` (.venv) cs@devvm-mbp:[~/src/neon-work-2]: ./target/debug/neon_local start Starting neon broker at 127.0.0.1:50051. storage_broker started, pid: 2710939 . attachment_service started, pid: 2710949 Starting pageserver node 1 at '127.0.0.1:64000' in ".neon/pageserver_1"..... pageserver has not started yet, continuing to wait..... pageserver 1 start failed: pageserver did not start in 10 seconds No process is holding the pidfile. The process must have already exited. Leave in place to avoid race conditions: ".neon/pageserver_1/pageserver.pid" No process is holding the pidfile. The process must have already exited. Leave in place to avoid race conditions: ".neon/safekeepers/sk1/safekeeper.pid" Stopping storage_broker with pid 2710939 immediately....... storage_broker has not stopped yet, continuing to wait..... neon broker stop failed: storage_broker with pid 2710939 did not stop in 10 seconds Stopping attachment_service with pid 2710949 immediately....... attachment_service has not stopped yet, continuing to wait..... attachment service stop failed: attachment_service with pid 2710949 did not stop in 10 seconds ``` and we leak the pageserver process ``` (.venv) cs@devvm-mbp:[~/src/neon-work-2]: ps aux \| grep pageserver cs 2710959 0.0 0.2 2377960 47616 pts/4 Sl 14:36 0:00 /home/cs/src/neon-work-2/target/debug/pageserver -D .neon/pageserver_1 -c id=1 -c pg_distrib_dir='/home/cs/src/neon-work-2/pg_install' -c http_auth_type='Trust' -c pg_auth_type='Trust' -c listen_http_addr='127.0.0.1:9898' -c listen_pg_addr='127.0.0.1:64000' -c broker_endpoint='http://127.0.0.1:50051/' -c control_plane_api='http://127.0.0.1:1234/' -c remote_storage={local_path='../local_fs_remote_storage/pageserver'} ``` After this PR, there is no leaked process.	2024-01-25 19:20:02 +01:00
John Spray	b6ec11ad78	control_plane: generalize attachment_service to handle sharding (#6251 ) ## Problem To test sharding, we need something to control it. We could write python code for doing this from the test runner, but this wouldn't be usable with neon_local run directly, and when we want to write tests with large number of shards/tenants, Rust is a better fit efficiently handling all the required state. This service enables automated tests to easily get a system with sharding/HA without the test itself having to set this all up by hand: existing tests can be run against sharded tenants just by setting a shard count when creating the tenant. ## Summary of changes Attachment service was previously a map of TenantId->TenantState, where the principal state stored for each tenant was the generation and the last attached pageserver. This enabled it to serve the re-attach and validate requests that the pageserver requires. In this PR, the scope of the service is extended substantially to do overall management of tenants in the pageserver, including tenant/timeline creation, live migration, evacuation of offline pageservers etc. This is done using synchronous code to make declarative changes to the tenant's intended state (`TenantState.policy` and `TenantState.intent`), which are then translated into calls into the pageserver by the `Reconciler`. Top level summary of modules within `control_plane/attachment_service/src`: - `tenant_state`: structure that represents one tenant shard. - `service`: implements the main high level such as tenant/timeline creation, marking a node offline, etc. - `scheduler`: for operations that need to pick a pageserver for a tenant, construct a scheduler and call into it. - `compute_hook`: receive notifications when a tenant shard is attached somewhere new. Once we have locations for all the shards in a tenant, emit an update to postgres configuration via the neon_local `LocalEnv`. - `http`: HTTP stubs. These mostly map to methods on `Service`, but are separated for readability and so that it'll be easier to adapt if/when we switch to another RPC layer. - `node`: structure that describes a pageserver node. The most important attribute of a node is its availability: marking a node offline causes tenant shards to reschedule away from it. This PR is a precursor to implementing the full sharding service for prod (#6342). What's the difference between this and a production-ready controller for pageservers? - JSON file persistence to be replaced with a database - Limited observability. - No concurrency limits. Marking a pageserver offline will try and migrate every tenant to a new pageserver concurrently, even if there are thousands. - Very simple scheduler that only knows to pick the pageserver with fewest tenants, and place secondary locations on a different pageserver than attached locations: it does not try to place shards for the same tenant on different pageservers. This matters little in tests, because picking the least-used pageserver usually results in round-robin placement. - Scheduler state is rebuilt exhaustively for each operation that requires a scheduler. - Relies on neon_local mechanisms for updating postgres: in production this would be something that flows through the real control plane. --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2024-01-17 18:01:08 +00:00
John Spray	df9e9de541	pageserver: API updates for sharding (#6330 ) The theme of the changes in this PR is that they're enablers for #6251 which are superficial struct/api changes. This is a spinoff from #6251: - Various APIs + clients thereof take TenantShardId rather than TenantId - The creation API gets a ShardParameters member, which may be used to configure shard count and stripe size. This enables the attachment service to present a "virtual pageserver" creation endpoint that creates multiple shards. - The attachment service will use tenant size information to drive shard splitting. Make a version of `TenantHistorySize` that is usable for decoding these API responses. - ComputeSpec includes a shard stripe size.	2024-01-16 09:21:00 +00:00
John Spray	3c560d27a8	pageserver: implement secondary-mode downloads (#6123 ) Follows on from #6050 , in which we upload heatmaps. Secondary locations will now poll those heatmaps and download layers mentioned in the heatmap. TODO: - [X] ~Unify/reconcile stats for behind-schedule execution with warn_when_period_overrun (https://github.com/neondatabase/neon/pull/6050#discussion_r1426560695)~ - [x] Give downloads their own concurrency config independent of uploads Deferred optimizations: - https://github.com/neondatabase/neon/issues/6199 - https://github.com/neondatabase/neon/issues/6200 Eviction will be the next PR: - #5342	2024-01-05 12:29:20 +00:00
Christian Schwarz	1a9854bfb7	add a Rust client for Pageserver management API (#6127 ) Part of getpage@lsn benchmark epic: https://github.com/neondatabase/neon/issues/5771 This PR moves the control plane's spread-all-over-the-place client for the pageserver management API into a separate module within the pageserver crate. I need that client to be async in my benchmarking work, so, this PR switches to the async version of `reqwest`. That is also the right direction generally IMO. The switch to async in turn mandated converting most of the `control_plane/` code to async. Note that some of the client methods should be taking `TenantShardId` instead of `TenantId`, but, none of the callers seem to be sharding-aware. Leaving that for another time: https://github.com/neondatabase/neon/issues/6154	2023-12-15 18:33:45 +01:00
John Spray	c4e0ef507f	pageserver: heatmap uploads (#6050 ) Dependency (commits inline): https://github.com/neondatabase/neon/pull/5842 ## Problem Secondary mode tenants need a manifest of what to download. Ultimately this will be some kind of heat-scored set of layers, but as a robust first step we will simply use the set of resident layers: secondary tenant locations will aim to match the on-disk content of the attached location. ## Summary of changes - Add heatmap types representing the remote structure - Add hooks to Tenant/Timeline for generating these heatmaps - Create a new `HeatmapUploader` type that is external to `Tenant`, and responsible for walking the list of attached tenants and scheduling heatmap uploads. Notes to reviewers: - Putting the logic for uploads (and later, secondary mode downloads) outside of `Tenant` is an opinionated choice, motivated by: - Enable future smarter scheduling of operations, e.g. uploading the stalest tenant first, rather than having all tenants compete for a fair semaphore on a first-come-first-served basis. Similarly for downloads, we may wish to schedule the tenants with the hottest un-downloaded layers first. - Enable accessing upload-related state without synchronization (it belongs to HeatmapUploader, rather than being some Mutex<>'d part of Tenant) - Avoid further expanding the scope of Tenant/Timeline types, which are already among the largest in the codebase - You might reasonably wonder how much of the uploader code could be a generic job manager thing. Probably some of it: but let's defer pulling that out until we have at least two users (perhaps secondary downloads will be the second one) to highlight which bits are really generic. Compromises: - Later, instead of using digests of heatmaps to decide whether anything changed, I would prefer to avoid walking the layers in tenants that don't have changes: tracking that will be a bit invasive, as it needs input from both remote_timeline_client and Layer.	2023-12-14 13:09:24 +00:00
Arpad Müller	b71b8ecfc2	Add existing_initdb_timeline_id param to timeline creation (#5912 ) This PR adds an `existing_initdb_timeline_id` option to timeline creation APIs, taking an optional timeline ID. Follow-up of #5390. If the `existing_initdb_timeline_id` option is specified via the HTTP API, the pageserver downloads the existing initdb archive from the given timeline ID and extracts it, instead of running initdb itself. --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-11-30 22:32:04 +01:00
John Spray	57ae9cd07f	pageserver: add `flush_ms` and document `/location_config` API (#5860 ) - During migration of tenants, it is useful for callers to `/location_conf` to flush a tenant's layers while transitioning to AttachedStale: this optimization reduces the redundant WAL replay work that the tenant's new attached pageserver will have to do. Test coverage for this will come as part of the larger tests for live migration in #5745 #5842 - Flushing is controlled with `flush_ms` query parameter: it is the caller's job to decide how long they want to wait for a flush to complete. If flush is not complete within the time limit, the pageserver proceeds to succeed anyway: flushing is only an optimization. - Add swagger definitions for all this: the location_config API is the primary interface for driving tenant migration as described in docs/rfcs/028-pageserver-migration.md, and will eventually replace the various /attach /detach /load /ignore APIs. --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-11-30 14:22:07 +00:00
John Spray	ab631e6792	pageserver: make TenantsMap shard-aware (#5819 ) ## Problem When using TenantId as the key, we are unable to handle multiple tenant shards attached to the same pageserver for the same tenant ID. This is an expected scenario if we have e.g. 8 shards and 5 pageservers. ## Summary of changes - TenantsMap is now a BTreeMap instead of a HashMap: this enables looking up by range. In future, we will need this for page_service, as incoming requests will just specify the Key, and we'll have to figure out which shard to route it to. - A new key type TenantShardId is introduced, to act as the key in TenantsMap, and as the id type in external APIs. Its human readable serialization is backward compatible with TenantId, and also forward-compatible as long as sharding is not actually used (when we construct a TenantShardId with ShardCount(0), it serializes to an old-fashioned TenantId). - Essential tenant APIs are updated to accept TenantShardIds: tenant/timeline create, tenant delete, and /location_conf. These are the APIs that will enable driving sharded tenants. Other apis like /attach /detach /load /ignore will not work with sharding: those will soon be deprecated and replaced with /location_conf as part of the live migration work. Closes: #5787	2023-11-15 23:20:21 +02:00
John Spray	7709c91fe5	neon_local: use remote storage by default, add `cargo neon tenant migrate` (#5760 ) ## Problem Currently the only way to exercise tenant migration is via python test code. We need a convenient way for developers to do it directly in a neon local environment. ## Summary of changes - Add a `--num-pageservers` argument to `cargo neon init` so that it's easy to run with multiple pageservers - Modify default pageserver overrides in neon_local to set up `LocalFs` remote storage, as any migration/attach/detach stuff doesn't work in the legacy local storage mode. This also unblocks removing the pageserver's support for the legacy local mode. - Add a new `cargo neon tenant migrate` command that orchestrates tenant migration, including endpoints.	2023-11-14 09:51:51 +00:00
duguorong009	25a37215f3	fix: replace all `std::PathBuf`s with `camino::Utf8PathBuf` (#5352 ) Fixes #4689 by replacing all of `std::Path` , `std::PathBuf` with `camino::Utf8Path`, `camino::Utf8PathBuf` in - pageserver - safekeeper - control_plane - libs/remote_storage Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-10-04 17:52:23 +03:00
John Spray	7b6337db58	tests: enable multiple pageservers in `neon_local` and `neon_fixture` (#5231 ) ## Problem Currently our testing environment only supports running a single pageserver at a time. This is insufficient for testing failover and migrations. - Dependency of writing tests for #5207 ## Summary of changes - `neon_local` and `neon_fixture` now handle multiple pageservers - This is a breaking change to the `.neon/config` format: any local environments will need recreating - Existing tests continue to work unchanged: - The default number of pageservers is 1 - `NeonEnv.pageserver` is now a helper property that retrieves the first pageserver if there is only one, else throws. - Pageserver data directories are now at `.neon/pageserver_{n}` where n is 1,2,3... - Compatibility tests get some special casing to migrate neon_local configs: these are not meant to be backward/forward compatible, but they were treated that way by the test.	2023-09-08 16:19:57 +01:00
John Spray	61d661a6c3	pageserver: generation number fetch on startup and use in /attach (#5163 ) ## Problem - #5050 Closes: https://github.com/neondatabase/neon/issues/5136 ## Summary of changes - A new configuration property `control_plane_api` controls other functionality in this PR: if it is unset (default) then everything still works as it does today. - If `control_plane_api` is set, then on startup we call out to control plane `/re-attach` endpoint to discover our attachments and their generations. If an attachment is missing from the response we implicitly detach the tenant. - Calls to pageserver `/attach` API may include a `generation` parameter. If `control_plane_api` is set, then this parameter is mandatory. - RemoteTimelineClient's loading of index_part.json is generation-aware, and will try to load the index_part with the most recent generation <= its own generation. - The `neon_local` testing environment now includes a new binary `attachment_service` which implements the endpoints that the pageserver requires to operate. This is on by default if running `cargo neon` by hand. In `test_runner/` tests, it is off by default: existing tests continue to run with in the legacy generation-less mode. Caveats: - The re-attachment during startup assumes that we are only re-attaching tenants that have previously been attached, and not totally new tenants -- this relies on the control plane's attachment logic to keep retrying so that we should eventually see the attach API call. That's important because the `/re-attach` API doesn't tell us which timelines we should attach -- we still use local disk state for that. Ref: https://github.com/neondatabase/neon/issues/5173 - Testing: generations are only enabled for one integration test right now (test_pageserver_restart), as a smoke test that all the machinery basically works. Writing fuller tests that stress tenant migration will come later, and involve extending our test fixtures to deal with multiple pageservers. - I'm not in love with "attachment_service" as a name for the neon_local component, but it's not very important because we can easily rename these test bits whenever we want. - Limited observability when in re-attach on startup: when I add generation validation for deletions in a later PR, I want to wrap up the control plane API calls in some small client class that will expose metrics for things like errors calling the control plane API, which will act as a strong red signal that something is not right. Co-authored-by: Christian Schwarz <christian@neon.tech> Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-09-06 14:44:48 +01:00
Heikki Linnakangas	df3bae2ce3	Use `compute_ctl` to manage Postgres in tests. (#3886 ) This adds test coverage for 'compute_ctl', as it is now used by all the python tests. There are a few differences in how 'compute_ctl' is called in the tests, compared to the real web console: - In the tests, the postgresql.conf file is included as one large string in the spec file, and it is written out as it is to the data directory. I added a new field for that to the spec file. The real web console, however, sets all the necessary settings in the 'settings' field, and 'compute_ctl' creates the postgresql.conf from those settings. - In the tests, the information needed to connect to the storage, i.e. tenant_id, timeline_id, connection strings to pageserver and safekeepers, are now passed as new fields in the spec file. The real web console includes them as the GUCs in the 'settings' field. (Both of these are different from what the test control plane used to do: It used to write the GUCs directly in the postgresql.conf file). The plan is to change the control plane to use the new method, and remove the old method, but for now, support both. Some tests that were sensitive to the amount of WAL generated needed small changes, to accommodate that compute_ctl runs the background health monitor which makes a few small updates. Also some tests shut down the pageserver, and now that the background health check can run some queries while the pageserver is down, that can produce a few extra errors in the logs, which needed to be allowlisted. Other changes: - remove obsolete comments about PostgresNode; - create standby.signal file for Static compute node; - log output of `compute_ctl` and `postgres` is merged into `endpoints/compute.log`. --------- Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>	2023-06-06 14:59:36 +01:00
Konstantin Knizhnik	952d6e43a2	Add pageserver parameter forced_image_creation_limit which can be used… (#4353 ) This parameter can be use to restrict number of image layers generated because of GC request (wanted image layers). Been set to zero it completely eliminates creation of such image layers. So it allows to avoid extra storage consumption after merging #3673 ## Problem PR #3673 forces generation of missed image layers. So i short term is cause cause increase (in worst case up to two times) size of storage. It was intended (by me) that GC period is comparable with PiTR interval. But looks like it is not the case now - GC is performed much more frequently. It may cause the problem with space exhaustion: GC forces new image creation while large PiTR still prevent GC from collecting old layers. ## Summary of changes Add new pageserver parameter` forced_image_creation_limit` which restrict number of created image layers which are requested by GC. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist	2023-05-31 21:37:20 +03:00
Heikki Linnakangas	a560b28829	Make new tenant/timeline IDs mandatory in create APIs. (#4304 ) We used to generate the ID, if the caller didn't specify it. That's bad practice, however, because network is never fully reliable, so it's possible we create a new tenant but the caller doesn't know about it, and because it doesn't know the tenant ID, it has no way of retrying or checking if it succeeded. To discourage that, make it mandatory. The web control plane has not relied on the auto-generation for a long time.	2023-05-26 16:19:36 +03:00

1 2

72 Commits