rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-06 13:40:37 +00:00

Author	SHA1	Message	Date
Heikki Linnakangas	8b710b9753	Fix segfault if pageserver connection is lost during backend startup. It's not OK to return early from within a PG_TRY-CATCH block. The PG_TRY macro sets the global PG_exception_stack variable, and PG_END_TRY restores it. If we jump out in between with "return NULL", the PG_exception_stack is left to point to garbage. (I'm surprised the comments in PG_TRY_CATCH don't warn about this.) Add test that re-attaches tenant in pageserver while Postgres is running. If the tenant is detached while compute is connected and busy running queries, those queries will fail if they try to fetch any pages. But when the tenant is re-attached, things should start working again, without disconnecting the client <-> postgres connections. Without this fix, this reproduced the segfault. Fixes issue #3231	2023-01-05 18:51:47 +02:00
Heikki Linnakangas	c187de1101	Copy error message before it's freed. pageserver_disconnect() call invalidates 'pageserver_conn', including the error message pointer we got from PQerrorMessage(pageserver_conn). Copy the message to a temporary variable before disconnecting, like we do in a few other places. In the passing, clear 'pageserver_conn_wes' variable in a few places where it was free'd. I didn't see any live bug from this, but since pageserver_disconnect() checks if it's NULL, let's not leave it dangling to already-free'd memory.	2023-01-05 18:51:47 +02:00
Kirill Bulatov	8712e1899e	Move initial timeline creation into pytest (#3270 ) For every Python test, we start the storage first, and expect that later, in the test, when we start a compute, it will work without specific timeline and tenant creation or their IDs specified. For that, we have a concept of "default" branch that was created on the control plane level first, but that's not needed at all, given that it's only Python tests that need it: let them create the initial timeline during set-up. Before, control plane started and stopped pageserver for timeline creation, now Python harness runs an extra tenant creation request on test env init. I had to adjust the metrics test, turns out it registered the metrics from the default tenant after an extra pageserver restart. New model does not sent the metrics before the collection time happens, and that was 30s before.	2023-01-05 17:48:27 +02:00
Christian Schwarz	d7f1e30112	remote_timeline_client: more metrics & metrics-related cleanups - Clean up redundant metric removal in TimelineMetrics::drop. RemoteTimelineClientMetrics is responsible for cleaning up REMOTE_OPERATION_TIME andREMOTE_UPLOAD_QUEUE_UNFINISHED_TASKS. - Rename `pageserver_remote_upload_queue_unfinished_tasks` to `pageserver_remote_timeline_client_calls_unfinished`. The new name reflects that the metric is with respect to the entire call to remote timeline client. This includes wait time in the upload queue and hence it's a longer span than what `pageserver_remote_OPERATION_seconds` measures. - Add the `pageserver_remote_timeline_client_calls_started` histogram. See the metric description for why we need it. - Add helper functions `call_begin` etc to `RemoteTimelineClientMetrics` to centralize the logic for updating the metrics above (they relate to each other, see comments in code). - Use these constructs to track ongoing downloads in `pageserver_remote_timeline_client_calls_unfinished` refs https://github.com/neondatabase/neon/issues/2029 fixes https://github.com/neondatabase/neon/issues/3249 closes https://github.com/neondatabase/neon/pull/3250	2023-01-05 11:50:17 +01:00
Christian Schwarz	6a9d1030a6	use RemoteTimelineClient for downloading index part during tenant_attach Before this change, we would not .measure_remote_op for index part downloads. And more generally, it's good to pass not just uploads but also downloads through RemoteTimelineClient, e.g., if we ever want to implement some timeline-scoped policies there. Found this while working on https://github.com/neondatabase/neon/pull/3250 where I add a metric to measure the degree of concurrent downloads. Layer download was missing in a test that I added there.	2023-01-05 11:08:50 +01:00
Heikki Linnakangas	8c6e607327	Refactor send_tarball() (#3259 ) The Basebackup struct is really just a convenient place to carry the various parameters around in send_tarball and its subroutines. Make it internal to the send_tarball function.	2023-01-04 23:03:16 +02:00
Vadim Kharitonov	f436fb2dfb	Fix panics at compute_ctl:monitor	2023-01-04 17:26:42 +01:00
Kirill Bulatov	8932d14d50	Revert "Run Python tests in 8 threads (#3206 )" (#3264 ) This reverts commit `56a4466d0a`. Seems that flackiness increased after this commit, while the time decrease was a couple of seconds. With every regular Python test spawing 1 etcd, 3 safekeepers, 1 pageserver, few CLI commands and post-run cleanup hooks, it might be hard to run many such tests in parallel. We could return to this later, after we consider alternative test structure and/or CI runner structure.	2023-01-04 17:31:51 +02:00
Kirill Bulatov	efad64bc7f	Expect compute shutdown test log error (#3262 ) https://neon-github-public-dev.s3.amazonaws.com/reports/pr-3261/debug/3833043374/index.html#suites/ffbb7f9930a77115316b58ff32b7c719/1f6ebaedc0a113a1/ Spotted a flacky test that appeared after https://github.com/neondatabase/neon/pull/3227 changes	2023-01-04 10:45:11 +00:00
Kirill Bulatov	10dae79c6d	Tone down safekeeper and pageserver walreceiver errors (#3227 ) Closes https://github.com/neondatabase/neon/issues/3114 Adds more typization into errors that appear during protocol messages (`FeMessage`), postgres and walreceiver connections. Socket IO errors are now better detected and logged with lesser (INFO, DEBUG) error level, without traces that they were logged before, when they were wrapped in anyhow context.	2023-01-03 20:42:04 +00:00
Heikki Linnakangas	e9583db73b	Remove code and test to generate flamegraph on GetPage requests. (#3257 ) It was nice to have and useful at the time, but unfortunately the method used to gather the profiling data doesn't play nicely with 'async'. PR #3228 will turn 'get_page_at_lsn' function async, which will break the profiling support. Let's remove it, and re-introduce some kind of profiling later, using some different method, if we feel like we need it again.	2023-01-03 20:11:32 +02:00
Vadim Kharitonov	0b428f7c41	Enable licenses check for 3rd-parties	2023-01-03 15:11:50 +01:00
Heikki Linnakangas	8b692e131b	Enable on-demand download in WalIngest. (#3233 ) Makes the top-level functions in WalIngest async, and replaces no_ondemand_download calls with with_ondemand_download. This hopefully fixes the problem reported in issue #3230, although I don't have a self-contained test case for it.	2023-01-03 14:44:42 +02:00
Heikki Linnakangas	0a0e55c3d0	Replace 'tar' crate with 'tokio-tar' (#3202 ) The synchronous 'tar' crate has required us to use block_in_place and SyncIoBridge to work together with the async I/O in the client connection. Switch to 'tokio-tar' crate that uses async I/O natively. As part of this, move the CopyDataWriter implementation to postgres_backend_async.rs. Even though it's only used in one place currently, it's in principle generally applicable whenever you want to use COPY out. Unfortunately we cannot use the 'tokio-tar' as it is: the Builder implementation requires the writer to have 'static lifetime. So we have to use a modified version without that requirement. The 'static lifetime was required just for the Drop implementation that writes the end-of-archive sections if the Builder is dropped without calling `finish`. But we don't actually want that behavior anyway; in fact we had to jump through some hoops with the AbortableWrite hack to skip those. With the modified version of 'tokio-tar' without that Drop implementation, we don't need AbortableWrite either. Co-authored-by: Kirill Bulatov <kirill@neon.tech>	2023-01-03 12:39:11 +02:00
Christian Schwarz	5bc9f8eae0	README: Fedora needs protobuf-devel Otherwise, common protobufs such as Google's empty.proto are missing, resulting in storage_broker build.rs failure. I encountered this on Fedora 36.	2023-01-03 11:05:23 +01:00
Sergey Melnikov	4c4d3dc87a	Add new pageserver to us-east-2 staging (#3248 )	2023-01-02 22:14:05 +04:00
Shany Pozin	182dc785d6	Set PITR default to 7 days (#3245 ) https://github.com/neondatabase/cloud/issues/3406	2023-01-02 18:05:23 +02:00
Kirill Bulatov	a9cca7a0fd	Use proper error code for BeMessage error responses (#3240 ) Based on https://github.com/neondatabase/neon/pull/3227#discussion_r1059430067 Seems that the constant, used for internal error during BeMessage error response serialization is incorrect. Currently used one is `CXX000`, yet all docs mention `XX000` instead: * https://www.postgresql.org/docs/current/errcodes-appendix.html * https://docs.rs/postgres/latest/postgres/error/struct.SqlState.html#associatedconstant.INTERNAL_ERROR I have checked it with the patch and logs described in https://github.com/neondatabase/neon/pull/3227#discussion_r1059949982	2023-01-02 16:51:05 +02:00
Arseny Sher	6fd64cd5f6	Allow failure to report metrics in test_metric_collection. Per CI https://github.com/neondatabase/neon/actions/runs/3822039946/attempts/1 shutdown seems to be racy.	2023-01-02 15:37:05 +03:00
Kirill Bulatov	56a4466d0a	Run Python tests in 8 threads (#3206 ) I have experimented with the runner threads number, and looks like 8 threads win us a few seconds. Bumping the thread count more did not improve the situation much: * 20 threads were not allowed by pytest * 16 threads were flacking quite notably My guess would be that all pageservers, safekeepers, and other nodes we start occupy quite much of the CPU and other resources to make this approach more scalable.	2023-01-02 14:34:06 +02:00
Arseny Sher	41b8e67305	Fix `81afd7011` by enabling reqwest feature for sentry. It disabled transport altogether.	2023-01-02 15:29:33 +03:00
Heikki Linnakangas	81afd7011c	Use rustls for everything. I looked at "cargo tree" output and noticed that through various dependencies, we are depending on both native-tls and rustls. We have tried to standardize on rustls for everything, but dependencies on native-tls have crept in recently. One such dependency came from 'reqwest' with default features in pageserver, used for consumption_metrics. Another dependency was from 'sentry'. Both 'reqwest' and 'sentry' use native-tls by default, but can use 'rustls' if compiled with the right feature flags.	2023-01-02 11:14:35 +02:00
dependabot[bot]	3468db8a2b	Bump setuptools from 65.5.0 to 65.5.1 (#3212 )	2023-01-02 08:47:28 +01:00
Egor Suvorov	9f94d098aa	Remove unused AuthType::MD5	2022-12-31 02:27:08 +03:00
Egor Suvorov	cb61944982	Safekeeper: refactor auth validation * Load public auth key on startup and store it in the config. * Get rid of a separate `auth` parameter which was passed all over the place.	2022-12-31 02:27:08 +03:00
Dmitry Ivanov	c700c7db2e	[proxy] Add more labels to the pricing metrics	2022-12-29 22:25:52 +03:00
Dmitry Rodionov	7c7d225d98	add pageserver to new region see https://github.com/neondatabase/aws/pull/116	2022-12-29 20:29:42 +03:00
Anastasia Lubennikova	8ff7bc5df1	Add timleline_logical_size metric. Send this metric only when it is fully calculated. Make consumption metrics more stable: - Send per-timeline metrics only for active timelines. - Adjust test assertions to make test_metric_collection test more stable.	2022-12-29 19:13:54 +02:00
Heikki Linnakangas	890ff3803e	Allow update_gc_info to download files on-demand.	2022-12-29 16:23:25 +02:00
Heikki Linnakangas	fefe19a284	Avoid calling find_lsn_for_timestamp call while holding lock. Refactor update_gc_info function so that it calls the potentially expensive find_lsn_for_timestamp() function before acquiring the lock. This will also be needed if we make find_lsn_for_timestamp() async in the future; it cannot be awaited while holding the lock.	2022-12-29 16:23:25 +02:00
Vadim Kharitonov	434fcac357	Remove unused HTTP endpoints from compute_ctl	2022-12-29 13:59:40 +01:00
Anastasia Lubennikova	894ac30734	Rename billing_metrics to consumption_metrics. Use more appropriate term, because not all of these metrics are used for billing.	2022-12-29 14:25:47 +02:00
Shany Pozin	c0290467fa	Fix #2907 Remove missing_layers from IndexPart (#3217 ) #2907	2022-12-29 12:33:30 +02:00
Konstantin Knizhnik	0e7c03370e	Lazy calculation of traversal_id which is needed only for error repoting (#3221 ) See https://neondb.slack.com/archives/C0277TKAJCA/p1672245908989789 and https://neondb.slack.com/archives/C033RQ5SPDH/p1671885245981359	2022-12-29 12:20:28 +02:00
Anastasia Lubennikova	f731e9b3de	Fix serialization of billing metrics (#3215 ) Fixes: - serialize TenantId and TimelineId as strings, - skip TimelineId if none - serialize `metric_type` field as `type` - add `idempotency_key` field to uniquely identify metrics	2022-12-29 12:11:04 +02:00
Dmitry Rodionov	bd7a9e6274	switch to debug from info to produce less noise	2022-12-29 13:01:17 +03:00
Egor Suvorov	42c6ddef8e	Rename ZENITH_AUTH_TOKEN to NEON_AUTH_TOKEN Changes are: * Pageserver: start reading from NEON_AUTH_TOKEN by default. Warn if ZENITH_AUTH_TOKEN is used instead. * Compute, Docs: fix the default token name. * Control plane: change name of the token in configs and start sequences. Compatibility: * Control plane in tests: works, no compatibility expected. * Control plane for local installations: never officially supported auth anyways. If someone did enable it, `pageserver.toml` should be updated with the new `neon.pageserver_connstring` and `neon.safekeeper_token_env`. * Pageserver is backward compatible: you can run new Pageserver with old commands and environment configurations, but not vice-versa. The culprit is the hard-coded `NEON_AUTH_TOKEN`. * Compute has no code changes. As long as you update its configuration file with `pageserver_connstring` in sync with the start up scripts, you are good to go. * Safekeeper has no code changes and has never used `ZENITH_AUTH_TOKEN` in the first place.	2022-12-29 03:33:43 +03:00
Shany Pozin	172c7e5f92	Split upload queue code from storage_sync.rs (#3216 ) https://github.com/neondatabase/neon/issues/3208	2022-12-28 15:12:06 +02:00
Shany Pozin	0c7b02ebc3	Move tenant related files to tenant directory (#3214 ) Related to https://github.com/neondatabase/neon/issues/3208	2022-12-28 09:20:01 +02:00
Arseny Sher	f6bf7b2003	Add tenant_id to safekeeper spans. Now that it's hard to map timeline id into project in the console, this should help a little.	2022-12-27 20:19:12 +03:00
Arseny Sher	fee8bf3a17	Remove global_commit_lsn. It is complicated and fragile to maintain and not really needed; update commit_lsn locally only when we have enough WAL flushed. ref https://github.com/neondatabase/neon/issues/3069	2022-12-27 20:19:12 +03:00
Arseny Sher	1ad6e186bc	Refuse ProposerElected if it is going to truncate correct WAL. Prevents commit_lsn monotonicity violation (otherwise harmless). closes https://github.com/neondatabase/neon/issues/3069	2022-12-27 20:19:12 +03:00
Konstantin Knizhnik	140c0edac8	Yet another port of local file system cache (#2622 )	2022-12-27 14:42:51 +02:00
Anna Stepanyan	5826e19b56	update the grafana links in the PR release template (#3156 )	2022-12-27 10:25:19 +01:00
Konstantin Knizhnik	1137b58b4d	Fix LayerMap::search to not return delta layer preceeding image layer (#3197 ) While @bojanserafimov is still working on best replacement of R-Tree in layer_map.rs there is obvious pitfall in the current `search` method implementation: is returns delta layer even if there is image layer if greater LSN. I think that it should be fixed.	2022-12-26 18:21:41 +02:00
Anastasia Lubennikova	1468c65ffb	Enable billing metric_collection_endpoint on staging	2022-12-23 18:04:45 +02:00
Kirill Bulatov	b77c33ee06	Move tenant-related modules below `tenant` module (#3190 ) No real code changes besides moving code around and adjusting the imports.	2022-12-23 15:40:37 +02:00
Kirill Bulatov	0bafb2a6c7	Do more on-demand downloads where needed (#3194 ) The PR aims to fix two missing redownloads in a flacky test_remote_storage_upload_queue_retries[local_fs] ([example](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-3190/release/3759194738/index.html#categories/80f1dcdd7c08252126be7e9f44fe84e6/8a70800f7ab13620/)) 1. missing redownload during walreceiver work ``` 2022-12-22T16:09:51.509891Z ERROR wal_connection_manager{tenant=fb62b97553e40f949de8bdeab7f93563 timeline=4f153bf6a58fd63832f6ee175638d049}: wal receiver task finished with an error: walreceiver connection handling failure Caused by: Layer needs downloading Stack backtrace: 0: pageserver::tenant::timeline::PageReconstructResult<T>::no_ondemand_download at /__w/neon/neon/pageserver/src/tenant/timeline.rs:467:59 1: pageserver::walingest::WalIngest::new at /__w/neon/neon/pageserver/src/walingest.rs:61:32 2: pageserver::walreceiver::walreceiver_connection::handle_walreceiver_connection::{{closure}} at /__w/neon/neon/pageserver/src/walreceiver/walreceiver_connection.rs:178:25 .... ``` That looks sad, but inevitable during the current approach: seems that we need to wait for old layers to arrive in order to accept new data. For that, `WalIngest::new` now started to return the `PageReconstructResult`. Sync methods from `import_datadir.rs` use `WalIngest::new` too, but both of them import WAL during timeline creation, so no layers to download are needed there, ergo the `PageReconstructResult` is converted to `anyhow::Result` with `no_ondemand_download`. 2. missing redownload during compaction work ``` 2022-12-22T16:09:51.090296Z ERROR compaction_loop{tenant_id=fb62b97553e40f949de8bdeab7f93563}:compact_timeline{timeline=4f153bf6a58fd63832f6ee175638d049}: could not compact, repartitioning keyspace failed: Layer needs downloading Stack backtrace: 0: pageserver::tenant::timeline::PageReconstructResult<T>::no_ondemand_download at /__w/neon/neon/pageserver/src/tenant/timeline.rs:467:59 1: pageserver::pgdatadir_mapping::<impl pageserver::tenant::timeline::Timeline>::collect_keyspace::{{closure}} at /__w/neon/neon/pageserver/src/pgdatadir_mapping.rs:506:41 <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19 pageserver::tenant::timeline::Timeline::repartition::{{closure}} at /__w/neon/neon/pageserver/src/tenant/timeline.rs:2161:50 <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19 2: pageserver::tenant::timeline::Timeline::compact::{{closure}} at /__w/neon/neon/pageserver/src/tenant/timeline.rs:700:14 <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19 3: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll at /github/home/.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-0.1.37/src/instrument.rs:272:9 4: pageserver::tenant::Tenant::compaction_iteration::{{closure}} at /__w/neon/neon/pageserver/src/tenant.rs:1232:85 <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19 pageserver::tenant_tasks::compaction_loop::{{closure}}::{{closure}} at /__w/neon/neon/pageserver/src/tenant_tasks.rs:76:62 <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19 pageserver::tenant_tasks::compaction_loop::{{closure}} at /__w/neon/neon/pageserver/src/tenant_tasks.rs:91:6 ```	2022-12-23 15:39:59 +02:00
Sergey Melnikov	c01f92c081	Fully remove old staging deploy (#3191 )	2022-12-22 20:09:45 +01:00
Sergey Melnikov	7bc17b373e	Fix calculate-deploy-targets (#3189 ) Was broken in https://github.com/neondatabase/neon/pull/3180	2022-12-22 16:28:36 +01:00

1 2 3 4 5 ...

2598 Commits