rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-16 18:02:56 +00:00

Author	SHA1	Message	Date
Christian Schwarz	cd48ea784f	TenantInfo: expose generation number (#6348 ) Generally useful when debugging / troubleshooting. I found this useful when manually duplicating a tenant from a script[^1] where I can't use `neon_fixtures.Pageserver.tenant_attach`'s automatic integration with the neon_local's attachment_service. [^1]: https://github.com/neondatabase/neon/pull/6349	2024-01-12 18:27:11 +01:00
Joonas Koivunen	2d22661061	refactor: calculate_synthetic_size_worker, remove PRE::NeedsDownload (#6111 ) Changes I wanted to make on #6106 but decided to leave out to keep that commit clean for including in the #6090. Finally remove `PageReconstructionError::NeedsDownload`.	2023-12-13 14:23:19 +00:00
Joonas Koivunen	3b04f3a749	fix: accidential return Ok (#6106 ) Error indicating request cancellation OR timeline shutdown was deemed as a reason to exit the background worker that calculated synthetic size. Fix it to only be considered for avoiding logging such of such errors.	2023-12-11 21:27:53 +00:00
John Spray	f1fc1fd639	pageserver: further refactoring from TenantId to TenantShardId (#6059 ) ## Problem In https://github.com/neondatabase/neon/pull/5957, the most essential types were updated to use TenantShardId rather than TenantId. That unblocked other work, but didn't fully enable running multiple shards from the same tenant on the same pageserver. ## Summary of changes - Use TenantShardId in page cache key for materialized pages - Update mgr.rs get_tenant() and list_tenants() functions to use a shard id, and update all callers. - Eliminate the exactly_one_or_none helper in mgr.rs and all code that used it - Convert timeline HTTP routes to use tenant_shard_id Note on page cache: ``` struct MaterializedPageHashKey { /// Why is this TenantShardId rather than TenantId? /// /// Usually, the materialized value of a page@lsn is identical on any shard in the same tenant. However, this /// this not the case for certain internally-generated pages (e.g. relation sizes). In future, we may make this /// key smaller by omitting the shard, if we ensure that reads to such pages always skip the cache, or are /// special-cased in some other way. tenant_shard_id: TenantShardId, timeline_id: TimelineId, key: Key, } ```	2023-12-11 15:52:33 +00:00
Shany Pozin	b7a988ba46	Support cancellation for find_lsn_for_timestamp API (#5904 ) ## Problem #5900 ## Summary of changes Added cancellation token as param in all relevant code paths and actually used it in the find_lsn_for_timestamp main loop	2023-11-23 17:08:32 +02:00
John Spray	85cd97af61	pageserver: add `InProgress` tenant map state, use a sync lock for the map (#5367 ) ## Problem Follows on from #5299 - We didn't have a generic way to protect a tenant undergoing changes: `Tenant` had states, but for our arbitrary transitions between secondary/attached, we need a general way to say "reserve this tenant ID, and don't allow any other ops on it, but don't try and report it as being in any particular state". - The TenantsMap structure was behind an async RwLock, but it was never correct to hold it across await points: that would block any other changes for all tenants. ## Summary of changes - Add the `TenantSlot::InProgress` value. This means: - Incoming administrative operations on the tenant should retry later - Anything trying to read the live state of the tenant (e.g. a page service reader) should retry later or block. - Store TenantsMap in `std::sync::RwLock` - Provide an extended `get_active_tenant_with_timeout` for page_service to use, which will wait on InProgress slots as well as non-active tenants. Closes: https://github.com/neondatabase/neon/issues/5378 --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-11-06 14:03:22 +00:00
Christian Schwarz	c6ca1d76d2	consumption_metrics: fix periodicness behavior & reporting (#5625 ) Before this PR, the ticker was running at default miss behavior `Delay`. For example, here is the startup output with 25k tenants: ``` 2023-10-19T09:57:21.682466Z INFO synthetic_size_worker: starting calculate_synthetic_size_worker 2023-10-19T10:50:44.678202Z WARN synthetic_size_worker: task iteration took longer than the configured period elapsed=3202.995707156s period=10m task=ConsumptionMetricsSyntheticSizeWorker 2023-10-19T10:52:17.408056Z WARN synthetic_size_worker: task iteration took longer than the configured period elapsed=2695.72556035s period=10m task=ConsumptionMetricsSyntheticSizeWorker ``` The first message's `elapsed` value is correct. It matches the delta between the log line timestamps. The second one is logged ca 1.5min after, though, but reports a much larger `elapsed` than 1.5min. This PR fixes the behavior by copying what `eviction_task.rs` does.	2023-10-23 16:31:38 +02:00
Christian Schwarz	9256788273	limit imitate accesses concurrency, using same semaphore as compactions (#5578 ) Before this PR, when we restarted pageserver, we'd see a rush of `$number_of_tenants` concurrent eviction tasks starting to do imitate accesses building up in the period of `[init_order allows activations, $random_access_delay + EvictionPolicyLayerAccessThreshold::period]`. We simply cannot handle that degree of concurrent IO. We already solved the problem for compactions by adding a semaphore. So, this PR shares that semaphore for use by evictions. Part of https://github.com/neondatabase/neon/issues/5479 Which is again part of https://github.com/neondatabase/neon/issues/4743 Risks / Changes In System Behavior ================================== * we don't do evictions as timely as we currently do * we log a bunch of warnings about eviction taking too long * imitate accesses and compactions compete for the same concurrency limit, so, they'll slow each other down through this shares semaphore Changes ======= - Move the `CONCURRENT_COMPACTIONS` semaphore into `tasks.rs` - Rename it to `CONCURRENT_BACKGROUND_TASKS` - Use it also for the eviction imitate accesses: - Imitate acceses are both per-TIMELINE and per-TENANT - The per-TENANT is done through coalescing all the per-TIMELINE tasks via a tokio mutex `eviction_task_tenant_state`. - We acquire the CONCURRENT_BACKGROUND_TASKS permit early, at the beginning of the eviction iteration, much before the imitate acesses start (and they may not even start at all in the given iteration, as they happen only every $threshold). - Acquiring early is sub-optimal because when the per-timline tasks coalesce on the `eviction_task_tenant_state` mutex, they are already holding a CONCURRENT_BACKGROUND_TASKS permit. - It's also unfair because tenants with many timelines win the CONCURRENT_BACKGROUND_TASKS more often. - I don't think there's another way though, without refactoring more of the imitate accesses logic, e.g, making it all per-tenant. - Add metrics for queue depth behind the semaphore. I found these very useful to understand what work is queued in the system. - The metrics are tagged by the new `BackgroundLoopKind`. - On a green slate, I would have used `TaskKind`, but we already had pre-existing labels whose names didn't map exactly to task kind. Also the task kind is kind of a lower-level detail, so, I think it's fine to have a separate enum to identify background work kinds. Future Work =========== I guess I could move the eviction tasks from a ticker to "sleep for $period". The benefit would be that the semaphore automatically "smears" the eviction task scheduling over time, so, we only have the rush on restart but a smeared-out rush afterward. The downside is that this perverts the meaning of "$period", as we'd actually not run the eviction at a fixed period. It also means the the "took to long" warning & metric becomes meaningless. Then again, that is already the case for the compaction and gc tasks, which do sleep for `$period` instead of using a ticker.	2023-10-17 11:29:48 +02:00
duguorong009	25a37215f3	fix: replace all `std::PathBuf`s with `camino::Utf8PathBuf` (#5352 ) Fixes #4689 by replacing all of `std::Path` , `std::PathBuf` with `camino::Utf8Path`, `camino::Utf8PathBuf` in - pageserver - safekeeper - control_plane - libs/remote_storage Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-10-04 17:52:23 +03:00
Joonas Koivunen	5d8597c2f0	refactor(consumption_metrics): post-split cleanup (#5327 ) Split off from #5297. Builds upon #5326. Handles original review comments which I did not move to earlier split PRs. Completes test support for verifying events by notifying of the last batch of events. Adds cleaning up of tempfiles left because of an unlucky shutdown or SIGKILL. Finally closes #5175. Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2023-09-18 23:30:01 +03:00
Joonas Koivunen	e62ab176b8	refactor(consumption_metrics): split (#5326 ) Split off from #5297. Builds upon #5325, should contain only the splitting. Next up: #5327.	2023-09-16 18:45:08 +03:00
Joonas Koivunen	9cf4ae86ff	refactor(consumption_metrics): pre-split cleanup (#5325 ) Cleanups in preparation to splitting the consumption_metrics.rs in #5326. Split off from #5297.	2023-09-16 18:08:33 +03:00
Joonas Koivunen	f902777202	fix: consumption metrics on restart (#5323 ) Write collected metrics to disk to recover previously sent metrics on restart. Recover the previously collected metrics during startup, send them over at right time - send cached synthetic size before actual is calculated - when `last_record_lsn` rolls back on startup - stay at last sent `written_size` metric - send `written_size_delta_bytes` metric as 0 Add test support: stateful verification of events in python tests. Fixes: #5206 Cc: #5175 (loggings, will be enhanced in follow-up)	2023-09-16 11:24:42 +03:00
Joonas Koivunen	a7f4ee02a3	fix(consumption_metrics): exp backoff retry (#5317 ) Split off from #5297. Depends on #5315. Cc: #5175 for retry	2023-09-16 01:11:01 +03:00
Joonas Koivunen	00c4c8e2e8	feat(consumption_metrics): remove event deduplication support (#5316 ) We no longer use pageserver deduplication anywhere. Give out a warning instead. Split off from #5297. Cc: #5175 for dedup.	2023-09-16 00:06:19 +03:00
Joonas Koivunen	c5d226d9c7	refactor(consumption_metrics): prereq refactorings, tests (#5315 ) Split off from #5297. There should be no functional changes here: - refactor tenant metric "production" like previously timeline, allows unit testing, though not interesting enough yet to test - introduce type aliases for tuples - extra refactoring for `collect`, was initially thinking it was useful but will do a inline later - shorter binding names - support for future allocation reuse quests with IdempotencyKey - move code out of tokio::select to make it rustfmt-able - generification, allow later replacement of `&'static str` with enum - add tests that assert sent event contents exactly	2023-09-15 19:44:14 +03:00
Joonas Koivunen	04776ade6c	fix(consumption): rename _size_ => _data_ (#4866 ) I failed at renaming the metric middle part while managing to do a great job with the suffix. Fix the middle part as well.	2023-08-01 19:18:25 +03:00
Joonas Koivunen	3a00a5deb2	refactor: tidy consumption metrics (#4860 ) Tidying up I've been wanting to do for some time. Follow-up to #4857.	2023-08-01 18:14:16 +03:00
Joonas Koivunen	78fa2b13e5	test: written_size_bytes_delta (#4857 ) Two stabs at this, by mocking a http receiver and the globals out (now reverted) and then by separating the timeline dependency and just testing what kind of events certain timelines produce. I think this pattern could work for some of our problems. Follow-up to #4822.	2023-08-01 15:30:36 +03:00
Joonas Koivunen	326189d950	consumption_metrics: send timeline_written_size_delta (#4822 ) We want to have timeline_written_size_delta which is defined as difference to the previously sent `timeline_written_size` from the current `timeline_written_size`. Solution is to send it. On the first round `disk_consistent_lsn` is used which is captured during `load` time. After that an incremental "event" is sent on every collection. Incremental "events" are not part of deduplication. I've added some infrastructure to allow somewhat typesafe `EventType::Absolute` and `EventType::Incremental` factories per metrics, now that we have our first `EventType::Incremental` usage.	2023-07-31 22:10:19 +03:00
Joonas Koivunen	269e20aeab	fix: filter out zero synthetic sizes (#4639 ) Apparently sending the synthetic_size == 0 is causing problems, so filter out sending zeros. Slack discussion: https://neondb.slack.com/archives/C03F5SM1N02/p1688574285718989?thread_ts=1687874910.681049&cid=C03F5SM1N02	2023-07-06 12:32:34 +03:00
Shany Pozin	26828560a8	Add timeouts and retries to consumption metrics reporting client (#4563 ) ## Problem #4528 ## Summary of changes Add a 60 seconds default timeout to the reqwest client Add retries for up to 3 times to call into the metric consumption endpoint --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-07-03 15:20:49 +03:00
Christian Schwarz	edcf4d61a4	distinguish imitated from real size::gather_input calls in metrics (#4224 ) Before this PR, the gather_inputs() calls made to imitate synthetic size calculation accesses were accounted towards the real logical size calculation metric. This PR forces all callers to declare the cause for making logical size calculations, making the decision which cause counts towards which metric explicit. This is follow-up to ``` commit `1d266a6365` Author: Christian Schwarz <christian@neon.tech> Date: Thu May 11 16:09:29 2023 +0200 logical size calculation metrics: differentiate regular vs imitated (#4197) ``` After merging this patch, I hope to be able to explain why we have ca 30x more "logical size" ops in prod than "imitate logical size" for any given observation interval. refs https://github.com/neondatabase/neon/issues/4154	2023-05-12 17:57:33 +00:00
Christian Schwarz	5869234290	logical size calculation: spawn with in_current_span (#4196 ) While investigating https://github.com/neondatabase/neon/issues/4154 I found that the `Calculating logical size for timeline` tracing events created from within the logical size computation code are not always attributable to the background task that caused it. My goal is to be able to distinguish in the logs whether a `Calculating logical size for timeline` was logged as part of a real synthetic size calculation VS an imitation by the eviction task. I want this distinction so I can prove my assumption that the disk IO peaks which we see every 24h on prod are due to eviction's imitate synthetic size calculations. The alternative here, which I would have preferred, but is more work: link RequestContext's into a child->parent list and dump this list when we log `Calculating logical size for timeline`. I would have preferred that over what we have in this PR because, technically, the ondemand logical size computation can outlive the caller that spawned it. This is against the idea of correctly nested spans. I guess in OpenTelemetry land, the correct modelling would be a link between the caller's span and the task_mgr task's span. Anyways, I think the case where we hang up on the spawned ondemand logical size calculation is quite rare. So, I'm willing to tolerate incorrectly nested spans for these edge-cases. refs https://github.com/neondatabase/neon/issues/4154	2023-05-12 15:36:30 +02:00
Anastasia Lubennikova	40799d8ae7	Add debug messages to catch abnormal consumption metric values	2023-02-17 17:57:45 +02:00
Anastasia Lubennikova	6139e8e426	Revert "Add debug messages around sending cached metrics" This reverts commit `a839860c2e`.	2023-02-16 17:46:15 +02:00
Anastasia Lubennikova	d9ba3c5f5e	Revert "Add debug messages around timeline.get_current_logical_size" This reverts commit `a5ce2b5330`.	2023-02-16 17:46:15 +02:00
Anastasia Lubennikova	7991bd3b69	Fix periodic metric sending: don't reset timer on every iteration (#3617 ) Previously timer was reset on every collect_metrics_iteration and sending of cached metrics was never triggered. This is a follow-up for `a69da4a7`.	2023-02-16 10:56:42 +02:00
Anastasia Lubennikova	a839860c2e	Add debug messages around sending cached metrics	2023-02-15 16:02:02 +02:00
Anastasia Lubennikova	a5ce2b5330	Add debug messages around timeline.get_current_logical_size	2023-02-15 16:02:02 +02:00
Anastasia Lubennikova	877a2d70e3	Periodically send cached consumption metrics (#3520 ) Add new pageserver config setting `cached_metric_collection_interval` with default `1 hour`. This setting controls how often unchanged cached consumption metrics are sent to the HTTP endpoint. This is a workaround for billing service limitations. fixes #3485	2023-02-06 17:53:10 +02:00
Anastasia Lubennikova	83048a4adc	Handle errors during metric collection. (#3521 ) Don't exit the loop if one of the tenants failed to scrape its metrics. fixes #3490	2023-02-03 12:37:34 +02:00
Christian Schwarz	dc64962ffc	tenant::mgr: explicit tracking of initializing & shutting-down states This patch wrap the tenants hashmap into an enum that represents the tenant manager's three major states: - Initializing - Open for business - Shutting down. See the enum doc comments for details. In response, all the users of `TENANTS` are now forced to distinguish those states. The only major change is in `run_if_no_tenant_in_memory`, which, before this patch, was used by the /attach and /load endpoints. This patch rewrites that method under the name `tenant_map_insert`, replacing the anyhow::Result with a std Result and a dedicated error type. Introducing this error types allows using `tenant_map_insert` in `tenant_create`, thereby unifying all code paths that create tenants objects to use `tenant_map_insert`. This is beneficial because we can now systematically prevent tenants from being created, attached, or `/load`ed during pageserver shutdown. The management API remains available, but the endpoints that create new tenants will fail with an error. More work would need to be done to properly distinguish these errors through HTTP status codes such as 503.	2023-01-26 11:24:48 +01:00
Christian Schwarz	01b4b0c2f3	Introduce RequestContext Motivation ========== Layer Eviction Needs Context ---------------------------- Before we start implementing layer eviction, we need to collect some access statistics per layer file or maybe even page. Part of these statistics should be the initiator of a page read request to answer the question of whether it was page_service vs. one of the background loops, and if the latter, which of them? Further, it would be nice to learn more about what activity in the pageserver initiated an on-demand download of a layer file. We will use this information to test out layer eviction policies. Read more about the current plan for layer eviction here: https://github.com/neondatabase/neon/issues/2476#issuecomment-1370822104 task_mgr problems + cancellation + tenant/timeline lifecycle ------------------------------------------------------------ Apart from layer eviction, we have long-standing problems with task_mgr, task cancellation, and various races around tenant / timeline lifecycle transitions. One approach to solve these is to abandon task_mgr in favor of a mechanism similar to Golang's context.Context, albeit extended to support waiting for completion, and specialized to the needs in the pageserver. Heikki solves all of the above at once in PR https://github.com/neondatabase/neon/pull/3228 , which is not yet merged at the time of writing. What Is This Patch About ======================== This patch addresses the immediate needs of layer eviction by introducing a `RequestContext` structure that is plumbed through the pageserver - all the way from the various entrypoints (page_service, management API, tenant background loops) down to Timeline::{get,get_reconstruct_data}. The struct carries a description of the kind of activity that initiated the call. We re-use task_mgr::TaskKind for this. Also, it carries the desired on-demand download behavior of the entrypoint. Timeline::get_reconstruct_data can then log the TaskKind that initiated the on-demand download. I developed this patch by git-checking-out Heikki's big RequestContext PR https://github.com/neondatabase/neon/pull/3228 , then deleting all the functionality that we do not need to address the needs for layer eviction. After that, I added a few things on top: 1. The concept of attached_child and detached_child in preparation for cancellation signalling through RequestContext, which will be added in a future patch. 2. A kill switch to turn DownloadBehavior::Error into a warning. 3. Renamed WalReceiverConnection to WalReceiverConnectionPoller and added an additional TaskKind WalReceiverConnectionHandler.These were necessary to create proper detached_child-type RequestContexts for the various tasks that walreceiver starts. How To Review This Patch ======================== Start your review with the module-level comment in context.rs. It explains the idea of RequestContext, what parts of it are implemented in this patch, and the future plans for RequestContext. Then review the various `task_mgr::spawn` call sites. At each of them, we should be creating a new detached_child RequestContext. Then review the (few) RequestContext::attached_child call sites and ensure that the spawned tasks do not outlive the task that spawns them. If they do, these call sites should use detached_child() instead. Then review the todo_child() call sites and judge whether it's worth the trouble of plumbing through a parent context from the caller(s). Lastly, go through the bulk of mechanical changes that simply forwards the &ctx.	2023-01-25 14:53:30 +01:00
Joonas Koivunen	58fb6fe861	fix: dont stop pageserver if we fail to calculate synthetic size	2023-01-20 19:55:19 +02:00
Anastasia Lubennikova	2cbe84b78f	Proxy metrics (#3290 ) Implement proxy metrics collection. Only collect metric for outbound traffic. Add proxy CLI parameters: - metric-collection-endpoint - metric-collection-interval. Add test_proxy_metric_collection test. Move shared consumption metrics code to libs/consumption_metrics. Refactor the code.	2023-01-16 15:17:28 +00:00
Anastasia Lubennikova	c6d383e239	code cleanup	2023-01-13 11:51:28 +02:00
Anastasia Lubennikova	26f39c03f2	review code cleanup: - handle errors in calculate_synthetic_size_worker. Don't exit the bgworker if one tenant failed. - add cached_synthetic_tenant_size to cache values calculated by the bgworker - code cleanup: remove unneeded info! messages, clean comments - handle collect_metrics_task() error. Don't exit collect_metrics worker if one task failed. - add unit test to cover case when we have multiple branches at the same lsn	2023-01-13 11:51:28 +02:00
Anastasia Lubennikova	0675859bb0	Add background worker that periodically spawns synthetic size calculation. Add new pageserver config param calculate_synthetic_size_interval	2023-01-13 11:51:28 +02:00
Anastasia Lubennikova	8ff7bc5df1	Add timleline_logical_size metric. Send this metric only when it is fully calculated. Make consumption metrics more stable: - Send per-timeline metrics only for active timelines. - Adjust test assertions to make test_metric_collection test more stable.	2022-12-29 19:13:54 +02:00
Anastasia Lubennikova	894ac30734	Rename billing_metrics to consumption_metrics. Use more appropriate term, because not all of these metrics are used for billing.	2022-12-29 14:25:47 +02:00

41 Commits