rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-04 12:40:37 +00:00

Author	SHA1	Message	Date
Christian Schwarz	f1aece1ba0	add RequestContext plumbing for layer access stats In preparation for #3496 plumb through RequestContext to the data access methods of `PersistentLayer`. This is PR https://github.com/neondatabase/neon/pull/3504	2023-02-01 15:29:01 +02:00
Christian Schwarz	590695e845	improve query param parsing - add parse_query_param() - use Cow<> where possible - move param parsing code to utils::http::request This was originally PR https://github.com/neondatabase/neon/pull/3502 which targeted a different branch. closes #3510	2023-02-01 14:11:12 +01:00
Konstantin Knizhnik	895f929bce	Add layer_map_analyzer tool (#3451 ) See #3348	2023-01-31 15:50:52 +02:00
Lassi Pölönen	20b38acff0	Replace per timeline `pageserver_storage_operations_seconds` with a global one (#3409 ) Related to: https://github.com/neondatabase/neon/issues/2848 `pageserver_storage_operations_seconds` is the most expensive metric we have, as there are a lot of tenants/timelines and the histogram had 42 buckets. These are quite sparse too, so instead of having a histogram per timeline, create a new histogram `pageserver_storage_operations_seconds_global` without tenant and timeline dimensions and replace `pageserver_storage_operations_seconds` with sum and counter. Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-01-30 17:10:29 +02:00
Kirill Bulatov	c61bc25ef9	Clean up NeedsDownload error (#3464 )	2023-01-30 16:08:23 +02:00
Shany Pozin	ddb9c2fe94	Add metrics for tenants state (#3448 ) ## Describe your changes Added a metric that allow to monitor tenants state ## Issue ticket number and link https://github.com/neondatabase/neon/issues/3161 ## Checklist before requesting a review - [X] I have performed a self-review of my code. - [X] I have added an e2e test for it. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.	2023-01-29 14:04:06 +02:00
Shany Pozin	67d418e91c	Set the last_record_gauge to the value which was persisted metadata (#3460 ) ## Describe your changes Whenever a tenant is detached or the pageserver is restarted the pageserver_last_record_lsn metric is dropped This fix resurrects the value from the metadata whenever the tenant is attached again ## Issue ticket number and link [3571](https://github.com/neondatabase/cloud/issues/3571) ## Checklist before requesting a review - [X] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.	2023-01-29 12:40:50 +02:00
Konstantin Knizhnik	c5ca7d0c68	Implement asynchronous pipe for communication with walredo process (#3368 ) Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-01-27 18:36:24 +02:00
Joonas Koivunen	0ec84e2f1f	Allow creating config for attached tenant (#3446 ) Currently `attach` doesn't write a tenant config, because we don't back it up in the first place. The current implementation of `Tenant::persist_tenant_config` does not allow changing tenant's configuration through the http api which will fail because the file wasn't created on attach and `OpenOptions::truncate(true).write(true).create_new(false)` is used. I think this patch allows for least controversial middle ground which enables changing tenant configuration even for attached tenants (not just created tenants).	2023-01-27 15:34:59 +02:00
Christian Schwarz	99399c112a	move walreceiver module under timeline Walreceiver is a per-timeline abstraction. Move it there to reflect the hierarchy of abstractions and task_mgr tasks. The code that sets up the global storage_broker client is not timeline-scoped. So, break it out into a separate module. The motivation for this change is to prepare the code base for replacing the task_mgr global task registry with a more ownership-oriented approach to manage task lifetimes. I removed TaskStateUpdate::Init because, after doing the changes, rustc warned that it was never constructed. A quick search through the commit history shows that this has always been true since commit `fb68d01449` Author: Dmitry Rodionov <dmitry@neon.tech> Date: Mon Sep 26 23:57:02 2022 +0300 Preserve task result in TaskHandle by keeping join handle around (#2521) So, the warning is not an indication of some accidental code removal. This is PR: https://github.com/neondatabase/neon/pull/3456	2023-01-27 12:23:17 +01:00
Heikki Linnakangas	bf63f129ae	Make 'branch_timeline' function more clear. Change the signature so that it takes an Arc<Timeline> reference to the source timeline, instead of just the ID. All the callers have an Arc reference at hand, so this is more convenient for everyone. Reorder the code a bit and improve the comments, to make it more clear what it does and why.	2023-01-27 02:12:07 +02:00
Christian Schwarz	dc64962ffc	tenant::mgr: explicit tracking of initializing & shutting-down states This patch wrap the tenants hashmap into an enum that represents the tenant manager's three major states: - Initializing - Open for business - Shutting down. See the enum doc comments for details. In response, all the users of `TENANTS` are now forced to distinguish those states. The only major change is in `run_if_no_tenant_in_memory`, which, before this patch, was used by the /attach and /load endpoints. This patch rewrites that method under the name `tenant_map_insert`, replacing the anyhow::Result with a std Result and a dedicated error type. Introducing this error types allows using `tenant_map_insert` in `tenant_create`, thereby unifying all code paths that create tenants objects to use `tenant_map_insert`. This is beneficial because we can now systematically prevent tenants from being created, attached, or `/load`ed during pageserver shutdown. The management API remains available, but the endpoints that create new tenants will fail with an error. More work would need to be done to properly distinguish these errors through HTTP status codes such as 503.	2023-01-26 11:24:48 +01:00
bojanserafimov	0a09589403	Increase gc period to 1h (#3432 )	2023-01-25 15:18:41 -05:00
Christian Schwarz	01b4b0c2f3	Introduce RequestContext Motivation ========== Layer Eviction Needs Context ---------------------------- Before we start implementing layer eviction, we need to collect some access statistics per layer file or maybe even page. Part of these statistics should be the initiator of a page read request to answer the question of whether it was page_service vs. one of the background loops, and if the latter, which of them? Further, it would be nice to learn more about what activity in the pageserver initiated an on-demand download of a layer file. We will use this information to test out layer eviction policies. Read more about the current plan for layer eviction here: https://github.com/neondatabase/neon/issues/2476#issuecomment-1370822104 task_mgr problems + cancellation + tenant/timeline lifecycle ------------------------------------------------------------ Apart from layer eviction, we have long-standing problems with task_mgr, task cancellation, and various races around tenant / timeline lifecycle transitions. One approach to solve these is to abandon task_mgr in favor of a mechanism similar to Golang's context.Context, albeit extended to support waiting for completion, and specialized to the needs in the pageserver. Heikki solves all of the above at once in PR https://github.com/neondatabase/neon/pull/3228 , which is not yet merged at the time of writing. What Is This Patch About ======================== This patch addresses the immediate needs of layer eviction by introducing a `RequestContext` structure that is plumbed through the pageserver - all the way from the various entrypoints (page_service, management API, tenant background loops) down to Timeline::{get,get_reconstruct_data}. The struct carries a description of the kind of activity that initiated the call. We re-use task_mgr::TaskKind for this. Also, it carries the desired on-demand download behavior of the entrypoint. Timeline::get_reconstruct_data can then log the TaskKind that initiated the on-demand download. I developed this patch by git-checking-out Heikki's big RequestContext PR https://github.com/neondatabase/neon/pull/3228 , then deleting all the functionality that we do not need to address the needs for layer eviction. After that, I added a few things on top: 1. The concept of attached_child and detached_child in preparation for cancellation signalling through RequestContext, which will be added in a future patch. 2. A kill switch to turn DownloadBehavior::Error into a warning. 3. Renamed WalReceiverConnection to WalReceiverConnectionPoller and added an additional TaskKind WalReceiverConnectionHandler.These were necessary to create proper detached_child-type RequestContexts for the various tasks that walreceiver starts. How To Review This Patch ======================== Start your review with the module-level comment in context.rs. It explains the idea of RequestContext, what parts of it are implemented in this patch, and the future plans for RequestContext. Then review the various `task_mgr::spawn` call sites. At each of them, we should be creating a new detached_child RequestContext. Then review the (few) RequestContext::attached_child call sites and ensure that the spawned tasks do not outlive the task that spawns them. If they do, these call sites should use detached_child() instead. Then review the todo_child() call sites and judge whether it's worth the trouble of plumbing through a parent context from the caller(s). Lastly, go through the bulk of mechanical changes that simply forwards the &ctx.	2023-01-25 14:53:30 +01:00
Kirill Bulatov	572332ab50	Tone down page_service timeouts (#3426 ) Closes https://github.com/neondatabase/neon/issues/3341	2023-01-25 13:40:08 +02:00
Vadim Kharitonov	bc4f594ed6	Fix Sentry Version	2023-01-25 12:07:38 +01:00
Kirill Bulatov	ea6f41324a	Tone down postgres client io errors (#3435 ) Closes https://github.com/neondatabase/neon/issues/3343	2023-01-25 10:50:33 +00:00
Kirill Bulatov	1c3636d848	Tone down walreceiver connection timeout errors (#3425 ) Closes https://github.com/neondatabase/neon/issues/3342	2023-01-24 18:03:33 +02:00
Kirill Bulatov	0c16ad8591	Tone down broker subscription errors	2023-01-24 17:23:33 +02:00
Christian Schwarz	0b673c12d7	timeline: don't transition Active=>Active during pageserver startup Before this patch, when `initialize_with_lock` was called via `timeline_init_and_sync`, we would transition the timeline like so: load_local_timeline/load_remote_timeline: timeline_init_and_sync Timeline::new () => Loading initialize_with_lock: set_state(Active) Loading => Active timeline.activate() Active => Active	2023-01-24 15:56:02 +01:00
Christian Schwarz	7a333cfb12	be noisy about unexpected Timeline state transitions	2023-01-24 15:56:02 +01:00
Christian Schwarz	f7ec33970a	add doc comment that outlines which tokio tasks walreceiver creates	2023-01-24 15:23:48 +01:00
Joonas Koivunen	98d0a0d242	fix(http): omit needless string allocs (#3421 ) Drive-by fix noticed while #3419.	2023-01-24 14:53:39 +02:00
Joonas Koivunen	f74080cbad	feat(http): support ?inputs_only=true for tenant_size (#3419 ) this makes debugging problematic cases in the future easier, as we can just request the model inputs, use them locally to reproduce the issue with the model.	2023-01-24 13:57:13 +02:00
Christian Schwarz	55c184fcd7	fix some anyhow::Context::context calls that should use with_context(format!(...)) Noticed this while combing through some production logs.	2023-01-24 12:22:33 +01:00
Christian Schwarz	6b6570b580	remove TimelineState::Suspended, introduce TimelineState::Loading The TimelineState::Suspsended was dubious to begin with. I suppose that the intention was that timelines could transition back and forth between Active and Suspended states. But practically, the code before this patch never did that. The transitions were: () ==Timeline::new==> Suspended ====> {Active,Broken,Stopping} One exception: Tenant::set_stopping() could transition timelines like so: !Broken ==Tenant::set_stopping()==> Suspended But Tenant itself cannot transition from stopping state to any other state. Thus, this patch removes TimelineState::Suspended and introduces a new state Loading. The aforementioned transitions change as follows: - () ==Timeline::new==> Suspended ====> {Active,Broken,Stopping} + () ==Timeline::new==> Loading ==*==> {Active,Broken,Stopping} - !Broken ==Tenant::set_stopping()==> Suspended + !Broken ==Tenant::set_stopping()==> Stopping Walreceiver's connection manager loop watches TimelineState to decide whether it should retry connecting, or exit. This patch changes the loop to exit when it observes the transition into Stopping state. Walreceiver isn't supposed to be started until the timeline transitions into Active state. So, this patch also adds some warn!() messages in case this happens anyways.	2023-01-23 17:22:49 +01:00
Joonas Koivunen	7704caa3ac	More tenant size fixes (#3410 ) Small changes, but hopefully this will help with the panic detected in staging, for which we cannot get the debugging information right now (end-of-branch before branch-point).	2023-01-23 17:12:51 +02:00
Konstantin Knizhnik	5c865f46ba	Fix slru_segment_key_range function: segno was assigned to incorrect Key field (#3354 )	2023-01-23 10:51:09 +02:00
bojanserafimov	a3d7ad2d52	Implement layer map using immutable BST (#2998 )	2023-01-20 16:10:12 -05:00
Anastasia Lubennikova	36f048d6b0	Fix tenant size orphans (#3377 ) Before only the timelines which have passed the `gc_horizon` were processed which failed with orphans at the tree_sort phase. Example input in added `test_branched_empty_timeline_size` test case. The PR changes iteration to happen through all timelines, and in addition to that, any learned branch points will be calculated as they would had been in the original implementation if the ancestor branch had been over the `gc_horizon`. This also changes how tenants where all timelines are below `gc_horizon` are handled. Previously tenant_size 0 was returned, but now they will have approximately `initdb_lsn` worth of tenant_size. The PR also adds several new tenant size tests that describe various corner cases of branching structure and `gc_horizon` setting. They are currently disabled to not consume time during CI. Co-authored-by: Joonas Koivunen <joonas@neon.tech> Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>	2023-01-20 20:21:36 +02:00
Joonas Koivunen	58fb6fe861	fix: dont stop pageserver if we fail to calculate synthetic size	2023-01-20 19:55:19 +02:00
Christian Schwarz	8ba1699937	Revert "Use actual temporary dir for pageserver unit tests" This reverts commit `826e89b9ce`. The problem with that commit was that it deletes the TempDir while there are still EphemeralFile instances open. At first I thought this could be fixed by simply adding Handle::current().block_on(task_mgr::shutdown(None, Some(tenant_id), None)) to TenantHarness::drop, but it turned out to be insufficient. So, reverting the commit until we find a proper solution. refs https://github.com/neondatabase/neon/issues/3385	2023-01-19 20:16:56 +01:00
bojanserafimov	a9bd05760f	Improve layer map docstrings (#3382 )	2023-01-19 10:29:15 -05:00
Kirill Bulatov	90f66aa51b	Enable logs in unit tests	2023-01-18 17:43:27 +02:00
Kirill Bulatov	826e89b9ce	Use actual temporary dir for pageserver unit tests	2023-01-18 17:43:27 +02:00
Kirill Bulatov	c6b56d2967	Add more io::Error context when fail to operate on a path (#3254 ) I have a test failure that shows ``` Caused by: 0: Failed to reconstruct a page image: 1: Directory not empty (os error 39) ``` but does not really show where exactly that happens. https://neon-github-public-dev.s3.amazonaws.com/reports/pr-3227/release/3823785365/index.html#categories/c0057473fc9ec8fb70876fd29a171ce8/7088dab272f2c7b7/?attachment=60fe6ed2add4d82d The PR aims to add more context in debugging that issue.	2023-01-17 22:07:38 +02:00
Kirill Bulatov	1ebd145c29	Actualize the comment (#3362 ) Follow-up of https://github.com/neondatabase/neon/pull/3326#issuecomment-1384265759	2023-01-17 13:30:42 +02:00
Christian Schwarz	48dd9565ac	TaskHandle: tone down `sender is dropped while join handle is still alive` Rationale: see comments added as part of this commit. fixes https://github.com/neondatabase/neon/issues/3339	2023-01-17 09:42:22 +01:00
Christian Schwarz	58c8c1076c	download_all_remote_layers API: require client to specify max_concurrent_downloads Before this patch, we would start all layer downloads simultaneously. There is at most one download_all_remote_layers task per timeline. Hence, the specified limit is per timeline. There is still no global concurrency limit for layer downloads. We'll have to revisit that at some point and also prioritize on-demand initiated downloads over download_all_remote_layers downloads. But that's for another day.	2023-01-16 19:29:06 +01:00
Joonas Koivunen	a8a9bee602	walredo: simple tests and bench updates (#3045 ) Separated from #2875. The microbenchmark has been validated to show similar difference as to larger scale OLTP benchmark.	2023-01-16 18:24:45 +02:00
Anastasia Lubennikova	2cbe84b78f	Proxy metrics (#3290 ) Implement proxy metrics collection. Only collect metric for outbound traffic. Add proxy CLI parameters: - metric-collection-endpoint - metric-collection-interval. Add test_proxy_metric_collection test. Move shared consumption metrics code to libs/consumption_metrics. Refactor the code.	2023-01-16 15:17:28 +00:00
Kirill Bulatov	bce4233d3a	Rework Cargo.toml dependencies (#3322 ) * Use workspace variables from cargo, coming with rustc [1.64](https://github.com/rust-lang/rust/blob/master/RELEASES.md#version-1640-2022-09-22) See https://doc.rust-lang.org/nightly/cargo/reference/workspaces.html#the-package-table and https://doc.rust-lang.org/nightly/cargo/reference/workspaces.html#the-dependencies-table sections. Now, all dependencies in all non-root `Cargo.toml` files are defined as ``` clap.workspace = true ``` sometimes, when extra features are needed, as ``` bytes = {workspace = true, features = ['serde'] } ``` With the actual declarations (with shared features and version numbers/file paths/etc.) in the root Cargo.toml. Features are additive: https://doc.rust-lang.org/nightly/cargo/reference/specifying-dependencies.html#inheriting-a-dependency-from-a-workspace * Uses the mechanism above to set common, 2021, edition and license across the workspace * Mechanically bumps a few dependencies * Updates hakari format, as it suggested: ``` work/neon/neon kb/cargo-templated ❯ cargo hakari generate info: no changes detected info: new hakari format version available: 3 (current: 2) (add or update `dep-format-version = "3"` in hakari.toml, then run `cargo hakari generate && cargo hakari manage-deps`) ```	2023-01-13 18:13:34 +02:00
Kirill Bulatov	99808558de	Avoid duplicate timeline insert (#3326 ) `initialize_with_lock` inserts `Arc<Timeline>` before returning it: `c1731bc4f0/pageserver/src/tenant.rs (L222)` but `setup_timeline` function did another insert, which got removed in this PR: `c1731bc4f0/pageserver/src/tenant.rs (L486)` On top, a better comment and function renames are added.	2023-01-13 12:05:54 +00:00
Anastasia Lubennikova	c6d383e239	code cleanup	2023-01-13 11:51:28 +02:00
Anastasia Lubennikova	5e3e0fbf6f	remove unneeded Cargo.lock changes	2023-01-13 11:51:28 +02:00
Anastasia Lubennikova	26f39c03f2	review code cleanup: - handle errors in calculate_synthetic_size_worker. Don't exit the bgworker if one tenant failed. - add cached_synthetic_tenant_size to cache values calculated by the bgworker - code cleanup: remove unneeded info! messages, clean comments - handle collect_metrics_task() error. Don't exit collect_metrics worker if one task failed. - add unit test to cover case when we have multiple branches at the same lsn	2023-01-13 11:51:28 +02:00
Anastasia Lubennikova	148e020fb9	Fix logical size calculation: sort updates in topological order so that the parent timeline always preceeds its children. fixes #3179	2023-01-13 11:51:28 +02:00
Anastasia Lubennikova	0675859bb0	Add background worker that periodically spawns synthetic size calculation. Add new pageserver config param calculate_synthetic_size_interval	2023-01-13 11:51:28 +02:00
Heikki Linnakangas	57a6e931ea	Comment, formatting and other cosmetic cleanup.	2023-01-12 19:05:13 +02:00
Heikki Linnakangas	0cceb14e48	Add a FIXME on ugly error message parsing.	2023-01-12 19:05:13 +02:00

1 2 3 4 5 ...

1154 Commits