rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-08 05:52:55 +00:00

Author	SHA1	Message	Date
bojanserafimov	a9bd05760f	Improve layer map docstrings (#3382 )	2023-01-19 10:29:15 -05:00
Kirill Bulatov	90f66aa51b	Enable logs in unit tests	2023-01-18 17:43:27 +02:00
Kirill Bulatov	826e89b9ce	Use actual temporary dir for pageserver unit tests	2023-01-18 17:43:27 +02:00
Kirill Bulatov	c6b56d2967	Add more io::Error context when fail to operate on a path (#3254 ) I have a test failure that shows ``` Caused by: 0: Failed to reconstruct a page image: 1: Directory not empty (os error 39) ``` but does not really show where exactly that happens. https://neon-github-public-dev.s3.amazonaws.com/reports/pr-3227/release/3823785365/index.html#categories/c0057473fc9ec8fb70876fd29a171ce8/7088dab272f2c7b7/?attachment=60fe6ed2add4d82d The PR aims to add more context in debugging that issue.	2023-01-17 22:07:38 +02:00
Kirill Bulatov	1ebd145c29	Actualize the comment (#3362 ) Follow-up of https://github.com/neondatabase/neon/pull/3326#issuecomment-1384265759	2023-01-17 13:30:42 +02:00
Christian Schwarz	48dd9565ac	TaskHandle: tone down `sender is dropped while join handle is still alive` Rationale: see comments added as part of this commit. fixes https://github.com/neondatabase/neon/issues/3339	2023-01-17 09:42:22 +01:00
Christian Schwarz	58c8c1076c	download_all_remote_layers API: require client to specify max_concurrent_downloads Before this patch, we would start all layer downloads simultaneously. There is at most one download_all_remote_layers task per timeline. Hence, the specified limit is per timeline. There is still no global concurrency limit for layer downloads. We'll have to revisit that at some point and also prioritize on-demand initiated downloads over download_all_remote_layers downloads. But that's for another day.	2023-01-16 19:29:06 +01:00
Joonas Koivunen	a8a9bee602	walredo: simple tests and bench updates (#3045 ) Separated from #2875. The microbenchmark has been validated to show similar difference as to larger scale OLTP benchmark.	2023-01-16 18:24:45 +02:00
Anastasia Lubennikova	2cbe84b78f	Proxy metrics (#3290 ) Implement proxy metrics collection. Only collect metric for outbound traffic. Add proxy CLI parameters: - metric-collection-endpoint - metric-collection-interval. Add test_proxy_metric_collection test. Move shared consumption metrics code to libs/consumption_metrics. Refactor the code.	2023-01-16 15:17:28 +00:00
Kirill Bulatov	bce4233d3a	Rework Cargo.toml dependencies (#3322 ) * Use workspace variables from cargo, coming with rustc [1.64](https://github.com/rust-lang/rust/blob/master/RELEASES.md#version-1640-2022-09-22) See https://doc.rust-lang.org/nightly/cargo/reference/workspaces.html#the-package-table and https://doc.rust-lang.org/nightly/cargo/reference/workspaces.html#the-dependencies-table sections. Now, all dependencies in all non-root `Cargo.toml` files are defined as ``` clap.workspace = true ``` sometimes, when extra features are needed, as ``` bytes = {workspace = true, features = ['serde'] } ``` With the actual declarations (with shared features and version numbers/file paths/etc.) in the root Cargo.toml. Features are additive: https://doc.rust-lang.org/nightly/cargo/reference/specifying-dependencies.html#inheriting-a-dependency-from-a-workspace * Uses the mechanism above to set common, 2021, edition and license across the workspace * Mechanically bumps a few dependencies * Updates hakari format, as it suggested: ``` work/neon/neon kb/cargo-templated ❯ cargo hakari generate info: no changes detected info: new hakari format version available: 3 (current: 2) (add or update `dep-format-version = "3"` in hakari.toml, then run `cargo hakari generate && cargo hakari manage-deps`) ```	2023-01-13 18:13:34 +02:00
Kirill Bulatov	99808558de	Avoid duplicate timeline insert (#3326 ) `initialize_with_lock` inserts `Arc<Timeline>` before returning it: `c1731bc4f0/pageserver/src/tenant.rs (L222)` but `setup_timeline` function did another insert, which got removed in this PR: `c1731bc4f0/pageserver/src/tenant.rs (L486)` On top, a better comment and function renames are added.	2023-01-13 12:05:54 +00:00
Anastasia Lubennikova	c6d383e239	code cleanup	2023-01-13 11:51:28 +02:00
Anastasia Lubennikova	5e3e0fbf6f	remove unneeded Cargo.lock changes	2023-01-13 11:51:28 +02:00
Anastasia Lubennikova	26f39c03f2	review code cleanup: - handle errors in calculate_synthetic_size_worker. Don't exit the bgworker if one tenant failed. - add cached_synthetic_tenant_size to cache values calculated by the bgworker - code cleanup: remove unneeded info! messages, clean comments - handle collect_metrics_task() error. Don't exit collect_metrics worker if one task failed. - add unit test to cover case when we have multiple branches at the same lsn	2023-01-13 11:51:28 +02:00
Anastasia Lubennikova	148e020fb9	Fix logical size calculation: sort updates in topological order so that the parent timeline always preceeds its children. fixes #3179	2023-01-13 11:51:28 +02:00
Anastasia Lubennikova	0675859bb0	Add background worker that periodically spawns synthetic size calculation. Add new pageserver config param calculate_synthetic_size_interval	2023-01-13 11:51:28 +02:00
Heikki Linnakangas	57a6e931ea	Comment, formatting and other cosmetic cleanup.	2023-01-12 19:05:13 +02:00
Heikki Linnakangas	0cceb14e48	Add a FIXME on ugly error message parsing.	2023-01-12 19:05:13 +02:00
Heikki Linnakangas	d7c41cbbee	Replace tokio::watch with CancellationToken. PR #3228 starts to use CancellationTokens more widely, this is a small part extracted from that.	2023-01-12 17:37:15 +02:00
Heikki Linnakangas	c1731bc4f0	Push on-demand download into Timeline::get() function itself. This makes Timeline::get() async, and all functions that call it directly or indirectly with it. The with_ondemand_download() mechanism is gone, Timeline::get() now always downloads files, whether you want it or not. That is what all the current callers want, so even though this loses the capability to get a page only if it's already in the pageserver, without downloading, we were not using that capability. There were some places that used 'no_ondemand_download' in the WAL ingestion code that would error out if a layer file was not found locally, but those were dubious. We do actually want to on-demand download in all of those places. Per discussion at https://github.com/neondatabase/neon/pull/3233#issuecomment-1368032358	2023-01-12 11:53:10 +02:00
Christian Schwarz	8eebd5f039	run on-demand compaction in a task_mgr task With this patch, tenant_detach and timeline_delete's task_mgr::shutdown_tasks() call will wait for on-demand compaction to finish. Before this patch, the on-demand compaction would grab the layer_removal_cs after tenant_detach / timeline_delete had removed the timeline directory. This resulted in error No such file or directory (os error 2) NB: I already implemented this pattern for ondemand GC a while back. fixes https://github.com/neondatabase/neon/issues/3136	2023-01-09 19:08:22 +01:00
Christian Schwarz	d4d0aa6ed6	gc_iteration_internal: better log message & debug log level if nothing to do fixes https://github.com/neondatabase/neon/issues/3107	2023-01-09 13:53:59 +01:00
Kirill Bulatov	a457256fef	Fix log message matching (#3291 ) Spotted https://neon-github-public-dev.s3.amazonaws.com/reports/main/debug/3871991071/index.html#suites/158be07438eb5188d40b466b6acfaeb3/22966d740e33b677/ failing on `main`, fixes that by using a proper regex match string. Also removes one clippy lint suppression.	2023-01-09 14:25:12 +02:00
Shany Pozin	7920b39a27	Adding transition reason to the log when a tenant is moved to Broken state (#3289 ) #3160	2023-01-09 10:24:50 +02:00
Christian Schwarz	3526323bc4	prepare Timeline::get_reconstruct_data for becoming async (#3271 ) This patch restructures the code so that PR https://github.com/neondatabase/neon/pull/3228 can seamlessly replace the return PageReconstructResult::NeedsDownload with a download_remote_layer().await. Background: PR https://github.com/neondatabase/neon/pull/3228 will turn get_reconstruct_data() async and do the on-demand download right in place, instead of returning a PageReconstructResult::NeedsDownload. Current rustc requires that the layers lock guard be not in scope across an await point. For on-demand download inside get_reconstruct_data(), we need to do download_remote_layer().await. Supersedes https://github.com/neondatabase/neon/pull/3260 See my comment there: https://github.com/neondatabase/neon/pull/3260#issuecomment-1370752407 Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2023-01-06 19:42:25 +02:00
Christian Schwarz	d7f1e30112	remote_timeline_client: more metrics & metrics-related cleanups - Clean up redundant metric removal in TimelineMetrics::drop. RemoteTimelineClientMetrics is responsible for cleaning up REMOTE_OPERATION_TIME andREMOTE_UPLOAD_QUEUE_UNFINISHED_TASKS. - Rename `pageserver_remote_upload_queue_unfinished_tasks` to `pageserver_remote_timeline_client_calls_unfinished`. The new name reflects that the metric is with respect to the entire call to remote timeline client. This includes wait time in the upload queue and hence it's a longer span than what `pageserver_remote_OPERATION_seconds` measures. - Add the `pageserver_remote_timeline_client_calls_started` histogram. See the metric description for why we need it. - Add helper functions `call_begin` etc to `RemoteTimelineClientMetrics` to centralize the logic for updating the metrics above (they relate to each other, see comments in code). - Use these constructs to track ongoing downloads in `pageserver_remote_timeline_client_calls_unfinished` refs https://github.com/neondatabase/neon/issues/2029 fixes https://github.com/neondatabase/neon/issues/3249 closes https://github.com/neondatabase/neon/pull/3250	2023-01-05 11:50:17 +01:00
Christian Schwarz	6a9d1030a6	use RemoteTimelineClient for downloading index part during tenant_attach Before this change, we would not .measure_remote_op for index part downloads. And more generally, it's good to pass not just uploads but also downloads through RemoteTimelineClient, e.g., if we ever want to implement some timeline-scoped policies there. Found this while working on https://github.com/neondatabase/neon/pull/3250 where I add a metric to measure the degree of concurrent downloads. Layer download was missing in a test that I added there.	2023-01-05 11:08:50 +01:00
Heikki Linnakangas	8c6e607327	Refactor send_tarball() (#3259 ) The Basebackup struct is really just a convenient place to carry the various parameters around in send_tarball and its subroutines. Make it internal to the send_tarball function.	2023-01-04 23:03:16 +02:00
Kirill Bulatov	10dae79c6d	Tone down safekeeper and pageserver walreceiver errors (#3227 ) Closes https://github.com/neondatabase/neon/issues/3114 Adds more typization into errors that appear during protocol messages (`FeMessage`), postgres and walreceiver connections. Socket IO errors are now better detected and logged with lesser (INFO, DEBUG) error level, without traces that they were logged before, when they were wrapped in anyhow context.	2023-01-03 20:42:04 +00:00
Heikki Linnakangas	e9583db73b	Remove code and test to generate flamegraph on GetPage requests. (#3257 ) It was nice to have and useful at the time, but unfortunately the method used to gather the profiling data doesn't play nicely with 'async'. PR #3228 will turn 'get_page_at_lsn' function async, which will break the profiling support. Let's remove it, and re-introduce some kind of profiling later, using some different method, if we feel like we need it again.	2023-01-03 20:11:32 +02:00
Vadim Kharitonov	0b428f7c41	Enable licenses check for 3rd-parties	2023-01-03 15:11:50 +01:00
Heikki Linnakangas	8b692e131b	Enable on-demand download in WalIngest. (#3233 ) Makes the top-level functions in WalIngest async, and replaces no_ondemand_download calls with with_ondemand_download. This hopefully fixes the problem reported in issue #3230, although I don't have a self-contained test case for it.	2023-01-03 14:44:42 +02:00
Heikki Linnakangas	0a0e55c3d0	Replace 'tar' crate with 'tokio-tar' (#3202 ) The synchronous 'tar' crate has required us to use block_in_place and SyncIoBridge to work together with the async I/O in the client connection. Switch to 'tokio-tar' crate that uses async I/O natively. As part of this, move the CopyDataWriter implementation to postgres_backend_async.rs. Even though it's only used in one place currently, it's in principle generally applicable whenever you want to use COPY out. Unfortunately we cannot use the 'tokio-tar' as it is: the Builder implementation requires the writer to have 'static lifetime. So we have to use a modified version without that requirement. The 'static lifetime was required just for the Drop implementation that writes the end-of-archive sections if the Builder is dropped without calling `finish`. But we don't actually want that behavior anyway; in fact we had to jump through some hoops with the AbortableWrite hack to skip those. With the modified version of 'tokio-tar' without that Drop implementation, we don't need AbortableWrite either. Co-authored-by: Kirill Bulatov <kirill@neon.tech>	2023-01-03 12:39:11 +02:00
Shany Pozin	182dc785d6	Set PITR default to 7 days (#3245 ) https://github.com/neondatabase/cloud/issues/3406	2023-01-02 18:05:23 +02:00
Heikki Linnakangas	81afd7011c	Use rustls for everything. I looked at "cargo tree" output and noticed that through various dependencies, we are depending on both native-tls and rustls. We have tried to standardize on rustls for everything, but dependencies on native-tls have crept in recently. One such dependency came from 'reqwest' with default features in pageserver, used for consumption_metrics. Another dependency was from 'sentry'. Both 'reqwest' and 'sentry' use native-tls by default, but can use 'rustls' if compiled with the right feature flags.	2023-01-02 11:14:35 +02:00
Egor Suvorov	9f94d098aa	Remove unused AuthType::MD5	2022-12-31 02:27:08 +03:00
Anastasia Lubennikova	8ff7bc5df1	Add timleline_logical_size metric. Send this metric only when it is fully calculated. Make consumption metrics more stable: - Send per-timeline metrics only for active timelines. - Adjust test assertions to make test_metric_collection test more stable.	2022-12-29 19:13:54 +02:00
Heikki Linnakangas	890ff3803e	Allow update_gc_info to download files on-demand.	2022-12-29 16:23:25 +02:00
Heikki Linnakangas	fefe19a284	Avoid calling find_lsn_for_timestamp call while holding lock. Refactor update_gc_info function so that it calls the potentially expensive find_lsn_for_timestamp() function before acquiring the lock. This will also be needed if we make find_lsn_for_timestamp() async in the future; it cannot be awaited while holding the lock.	2022-12-29 16:23:25 +02:00
Anastasia Lubennikova	894ac30734	Rename billing_metrics to consumption_metrics. Use more appropriate term, because not all of these metrics are used for billing.	2022-12-29 14:25:47 +02:00
Shany Pozin	c0290467fa	Fix #2907 Remove missing_layers from IndexPart (#3217 ) #2907	2022-12-29 12:33:30 +02:00
Konstantin Knizhnik	0e7c03370e	Lazy calculation of traversal_id which is needed only for error repoting (#3221 ) See https://neondb.slack.com/archives/C0277TKAJCA/p1672245908989789 and https://neondb.slack.com/archives/C033RQ5SPDH/p1671885245981359	2022-12-29 12:20:28 +02:00
Anastasia Lubennikova	f731e9b3de	Fix serialization of billing metrics (#3215 ) Fixes: - serialize TenantId and TimelineId as strings, - skip TimelineId if none - serialize `metric_type` field as `type` - add `idempotency_key` field to uniquely identify metrics	2022-12-29 12:11:04 +02:00
Dmitry Rodionov	bd7a9e6274	switch to debug from info to produce less noise	2022-12-29 13:01:17 +03:00
Egor Suvorov	42c6ddef8e	Rename ZENITH_AUTH_TOKEN to NEON_AUTH_TOKEN Changes are: * Pageserver: start reading from NEON_AUTH_TOKEN by default. Warn if ZENITH_AUTH_TOKEN is used instead. * Compute, Docs: fix the default token name. * Control plane: change name of the token in configs and start sequences. Compatibility: * Control plane in tests: works, no compatibility expected. * Control plane for local installations: never officially supported auth anyways. If someone did enable it, `pageserver.toml` should be updated with the new `neon.pageserver_connstring` and `neon.safekeeper_token_env`. * Pageserver is backward compatible: you can run new Pageserver with old commands and environment configurations, but not vice-versa. The culprit is the hard-coded `NEON_AUTH_TOKEN`. * Compute has no code changes. As long as you update its configuration file with `pageserver_connstring` in sync with the start up scripts, you are good to go. * Safekeeper has no code changes and has never used `ZENITH_AUTH_TOKEN` in the first place.	2022-12-29 03:33:43 +03:00
Shany Pozin	172c7e5f92	Split upload queue code from storage_sync.rs (#3216 ) https://github.com/neondatabase/neon/issues/3208	2022-12-28 15:12:06 +02:00
Shany Pozin	0c7b02ebc3	Move tenant related files to tenant directory (#3214 ) Related to https://github.com/neondatabase/neon/issues/3208	2022-12-28 09:20:01 +02:00
Konstantin Knizhnik	1137b58b4d	Fix LayerMap::search to not return delta layer preceeding image layer (#3197 ) While @bojanserafimov is still working on best replacement of R-Tree in layer_map.rs there is obvious pitfall in the current `search` method implementation: is returns delta layer even if there is image layer if greater LSN. I think that it should be fixed.	2022-12-26 18:21:41 +02:00
Kirill Bulatov	b77c33ee06	Move tenant-related modules below `tenant` module (#3190 ) No real code changes besides moving code around and adjusting the imports.	2022-12-23 15:40:37 +02:00
Kirill Bulatov	0bafb2a6c7	Do more on-demand downloads where needed (#3194 ) The PR aims to fix two missing redownloads in a flacky test_remote_storage_upload_queue_retries[local_fs] ([example](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-3190/release/3759194738/index.html#categories/80f1dcdd7c08252126be7e9f44fe84e6/8a70800f7ab13620/)) 1. missing redownload during walreceiver work ``` 2022-12-22T16:09:51.509891Z ERROR wal_connection_manager{tenant=fb62b97553e40f949de8bdeab7f93563 timeline=4f153bf6a58fd63832f6ee175638d049}: wal receiver task finished with an error: walreceiver connection handling failure Caused by: Layer needs downloading Stack backtrace: 0: pageserver::tenant::timeline::PageReconstructResult<T>::no_ondemand_download at /__w/neon/neon/pageserver/src/tenant/timeline.rs:467:59 1: pageserver::walingest::WalIngest::new at /__w/neon/neon/pageserver/src/walingest.rs:61:32 2: pageserver::walreceiver::walreceiver_connection::handle_walreceiver_connection::{{closure}} at /__w/neon/neon/pageserver/src/walreceiver/walreceiver_connection.rs:178:25 .... ``` That looks sad, but inevitable during the current approach: seems that we need to wait for old layers to arrive in order to accept new data. For that, `WalIngest::new` now started to return the `PageReconstructResult`. Sync methods from `import_datadir.rs` use `WalIngest::new` too, but both of them import WAL during timeline creation, so no layers to download are needed there, ergo the `PageReconstructResult` is converted to `anyhow::Result` with `no_ondemand_download`. 2. missing redownload during compaction work ``` 2022-12-22T16:09:51.090296Z ERROR compaction_loop{tenant_id=fb62b97553e40f949de8bdeab7f93563}:compact_timeline{timeline=4f153bf6a58fd63832f6ee175638d049}: could not compact, repartitioning keyspace failed: Layer needs downloading Stack backtrace: 0: pageserver::tenant::timeline::PageReconstructResult<T>::no_ondemand_download at /__w/neon/neon/pageserver/src/tenant/timeline.rs:467:59 1: pageserver::pgdatadir_mapping::<impl pageserver::tenant::timeline::Timeline>::collect_keyspace::{{closure}} at /__w/neon/neon/pageserver/src/pgdatadir_mapping.rs:506:41 <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19 pageserver::tenant::timeline::Timeline::repartition::{{closure}} at /__w/neon/neon/pageserver/src/tenant/timeline.rs:2161:50 <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19 2: pageserver::tenant::timeline::Timeline::compact::{{closure}} at /__w/neon/neon/pageserver/src/tenant/timeline.rs:700:14 <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19 3: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll at /github/home/.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-0.1.37/src/instrument.rs:272:9 4: pageserver::tenant::Tenant::compaction_iteration::{{closure}} at /__w/neon/neon/pageserver/src/tenant.rs:1232:85 <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19 pageserver::tenant_tasks::compaction_loop::{{closure}}::{{closure}} at /__w/neon/neon/pageserver/src/tenant_tasks.rs:76:62 <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19 pageserver::tenant_tasks::compaction_loop::{{closure}} at /__w/neon/neon/pageserver/src/tenant_tasks.rs:91:6 ```	2022-12-23 15:39:59 +02:00

1 2 3 4 5 ...

1122 Commits