It's not OK to return early from within a PG_TRY-CATCH block. The
PG_TRY macro sets the global PG_exception_stack variable, and
PG_END_TRY restores it. If we jump out in between with "return NULL",
the PG_exception_stack is left to point to garbage. (I'm surprised the
comments in PG_TRY_CATCH don't warn about this.)
Add test that re-attaches tenant in pageserver while Postgres is
running. If the tenant is detached while compute is connected and
busy running queries, those queries will fail if they try to fetch any
pages. But when the tenant is re-attached, things should start working
again, without disconnecting the client <-> postgres connections.
Without this fix, this reproduced the segfault.
Fixes issue #3231
pageserver_disconnect() call invalidates 'pageserver_conn', including
the error message pointer we got from PQerrorMessage(pageserver_conn).
Copy the message to a temporary variable before disconnecting, like
we do in a few other places.
In the passing, clear 'pageserver_conn_wes' variable in a few places
where it was free'd. I didn't see any live bug from this, but since
pageserver_disconnect() checks if it's NULL, let's not leave it
dangling to already-free'd memory.
For every Python test, we start the storage first, and expect that
later, in the test, when we start a compute, it will work without
specific timeline and tenant creation or their IDs specified.
For that, we have a concept of "default" branch that was created on the
control plane level first, but that's not needed at all, given that it's
only Python tests that need it: let them create the initial timeline
during set-up.
Before, control plane started and stopped pageserver for timeline
creation, now Python harness runs an extra tenant creation request on
test env init.
I had to adjust the metrics test, turns out it registered the metrics
from the default tenant after an extra pageserver restart.
New model does not sent the metrics before the collection time happens,
and that was 30s before.
- Clean up redundant metric removal in TimelineMetrics::drop.
RemoteTimelineClientMetrics is responsible for cleaning up
REMOTE_OPERATION_TIME andREMOTE_UPLOAD_QUEUE_UNFINISHED_TASKS.
- Rename `pageserver_remote_upload_queue_unfinished_tasks` to
`pageserver_remote_timeline_client_calls_unfinished`. The new name
reflects that the metric is with respect to the entire call to remote
timeline client. This includes wait time in the upload queue and hence
it's a longer span than what `pageserver_remote_OPERATION_seconds`
measures.
- Add the `pageserver_remote_timeline_client_calls_started` histogram.
See the metric description for why we need it.
- Add helper functions `call_begin` etc to `RemoteTimelineClientMetrics`
to centralize the logic for updating the metrics above (they relate to
each other, see comments in code).
- Use these constructs to track ongoing downloads in
`pageserver_remote_timeline_client_calls_unfinished`
refs https://github.com/neondatabase/neon/issues/2029
fixes https://github.com/neondatabase/neon/issues/3249
closes https://github.com/neondatabase/neon/pull/3250
Before this change, we would not .measure_remote_op for index part
downloads.
And more generally, it's good to pass not just uploads but also
downloads through RemoteTimelineClient, e.g., if we ever want to
implement some timeline-scoped policies there.
Found this while working on https://github.com/neondatabase/neon/pull/3250
where I add a metric to measure the degree of concurrent downloads.
Layer download was missing in a test that I added there.
The Basebackup struct is really just a convenient place to carry the
various parameters around in send_tarball and its subroutines. Make it
internal to the send_tarball function.
This reverts commit 56a4466d0a.
Seems that flackiness increased after this commit, while the time
decrease was a couple of seconds.
With every regular Python test spawing 1 etcd, 3 safekeepers, 1
pageserver, few CLI commands and post-run cleanup hooks, it might be
hard to run many such tests in parallel.
We could return to this later, after we consider alternative test
structure and/or CI runner structure.
Closes https://github.com/neondatabase/neon/issues/3114
Adds more typization into errors that appear during protocol messages (`FeMessage`), postgres and walreceiver connections.
Socket IO errors are now better detected and logged with lesser (INFO, DEBUG) error level, without traces that they were logged before, when they were wrapped in anyhow context.
It was nice to have and useful at the time, but unfortunately the method
used to gather the profiling data doesn't play nicely with 'async'. PR
#3228 will turn 'get_page_at_lsn' function async, which will break the
profiling support. Let's remove it, and re-introduce some kind of
profiling later, using some different method, if we feel like we need it
again.
Makes the top-level functions in WalIngest async, and replaces
no_ondemand_download calls with with_ondemand_download.
This hopefully fixes the problem reported in issue #3230, although I
don't have a self-contained test case for it.
The synchronous 'tar' crate has required us to use block_in_place and
SyncIoBridge to work together with the async I/O in the client
connection. Switch to 'tokio-tar' crate that uses async I/O natively.
As part of this, move the CopyDataWriter implementation to
postgres_backend_async.rs. Even though it's only used in one place
currently, it's in principle generally applicable whenever you want to
use COPY out.
Unfortunately we cannot use the 'tokio-tar' as it is: the Builder
implementation requires the writer to have 'static lifetime. So we
have to use a modified version without that requirement. The 'static
lifetime was required just for the Drop implementation that writes
the end-of-archive sections if the Builder is dropped without calling
`finish`. But we don't actually want that behavior anyway; in fact
we had to jump through some hoops with the AbortableWrite hack to skip
those. With the modified version of 'tokio-tar' without that Drop
implementation, we don't need AbortableWrite either.
Co-authored-by: Kirill Bulatov <kirill@neon.tech>
I have experimented with the runner threads number, and looks like 8
threads win us a few seconds.
Bumping the thread count more did not improve the situation much:
* 20 threads were not allowed by pytest
* 16 threads were flacking quite notably
My guess would be that all pageservers, safekeepers, and other nodes we
start occupy quite much of the CPU and other resources to make this
approach more scalable.
I looked at "cargo tree" output and noticed that through various
dependencies, we are depending on both native-tls and rustls. We have
tried to standardize on rustls for everything, but dependencies on
native-tls have crept in recently. One such dependency came from
'reqwest' with default features in pageserver, used for
consumption_metrics. Another dependency was from 'sentry'. Both
'reqwest' and 'sentry' use native-tls by default, but can use 'rustls'
if compiled with the right feature flags.
Send this metric only when it is fully calculated.
Make consumption metrics more stable:
- Send per-timeline metrics only for active timelines.
- Adjust test assertions to make test_metric_collection test more stable.
Refactor update_gc_info function so that it calls the potentially
expensive find_lsn_for_timestamp() function before acquiring the
lock. This will also be needed if we make find_lsn_for_timestamp()
async in the future; it cannot be awaited while holding the lock.
Fixes:
- serialize TenantId and TimelineId as strings,
- skip TimelineId if none
- serialize `metric_type` field as `type`
- add `idempotency_key` field to uniquely identify metrics
Changes are:
* Pageserver: start reading from NEON_AUTH_TOKEN by default.
Warn if ZENITH_AUTH_TOKEN is used instead.
* Compute, Docs: fix the default token name.
* Control plane: change name of the token in configs and start
sequences.
Compatibility:
* Control plane in tests: works, no compatibility expected.
* Control plane for local installations: never officially supported
auth anyways. If someone did enable it, `pageserver.toml` should be updated
with the new `neon.pageserver_connstring` and `neon.safekeeper_token_env`.
* Pageserver is backward compatible: you can run new Pageserver with old
commands and environment configurations, but not vice-versa.
The culprit is the hard-coded `NEON_AUTH_TOKEN`.
* Compute has no code changes. As long as you update its configuration
file with `pageserver_connstring` in sync with the start up scripts,
you are good to go.
* Safekeeper has no code changes and has never used `ZENITH_AUTH_TOKEN` in
the first place.
While @bojanserafimov is still working on best replacement of R-Tree in
layer_map.rs there is obvious pitfall in the current `search` method
implementation: is returns delta layer even if there is image layer if
greater LSN. I think that it should be fixed.
The PR aims to fix two missing redownloads in a flacky
test_remote_storage_upload_queue_retries[local_fs]
([example](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-3190/release/3759194738/index.html#categories/80f1dcdd7c08252126be7e9f44fe84e6/8a70800f7ab13620/))
1. missing redownload during walreceiver work
```
2022-12-22T16:09:51.509891Z ERROR wal_connection_manager{tenant=fb62b97553e40f949de8bdeab7f93563 timeline=4f153bf6a58fd63832f6ee175638d049}: wal receiver task finished with an error: walreceiver connection handling failure
Caused by:
Layer needs downloading
Stack backtrace:
0: pageserver::tenant::timeline::PageReconstructResult<T>::no_ondemand_download
at /__w/neon/neon/pageserver/src/tenant/timeline.rs:467:59
1: pageserver::walingest::WalIngest::new
at /__w/neon/neon/pageserver/src/walingest.rs:61:32
2: pageserver::walreceiver::walreceiver_connection::handle_walreceiver_connection::{{closure}}
at /__w/neon/neon/pageserver/src/walreceiver/walreceiver_connection.rs:178:25
....
```
That looks sad, but inevitable during the current approach: seems that
we need to wait for old layers to arrive in order to accept new data.
For that, `WalIngest::new` now started to return the
`PageReconstructResult`.
Sync methods from `import_datadir.rs` use `WalIngest::new` too, but both
of them import WAL during timeline creation, so no layers to download
are needed there, ergo the `PageReconstructResult` is converted to
`anyhow::Result` with `no_ondemand_download`.
2. missing redownload during compaction work
```
2022-12-22T16:09:51.090296Z ERROR compaction_loop{tenant_id=fb62b97553e40f949de8bdeab7f93563}:compact_timeline{timeline=4f153bf6a58fd63832f6ee175638d049}: could not compact, repartitioning keyspace failed: Layer needs downloading
Stack backtrace:
0: pageserver::tenant::timeline::PageReconstructResult<T>::no_ondemand_download
at /__w/neon/neon/pageserver/src/tenant/timeline.rs:467:59
1: pageserver::pgdatadir_mapping::<impl pageserver::tenant::timeline::Timeline>::collect_keyspace::{{closure}}
at /__w/neon/neon/pageserver/src/pgdatadir_mapping.rs:506:41
<core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19
pageserver::tenant::timeline::Timeline::repartition::{{closure}}
at /__w/neon/neon/pageserver/src/tenant/timeline.rs:2161:50
<core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19
2: pageserver::tenant::timeline::Timeline::compact::{{closure}}
at /__w/neon/neon/pageserver/src/tenant/timeline.rs:700:14
<core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19
3: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
at /github/home/.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-0.1.37/src/instrument.rs:272:9
4: pageserver::tenant::Tenant::compaction_iteration::{{closure}}
at /__w/neon/neon/pageserver/src/tenant.rs:1232:85
<core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19
pageserver::tenant_tasks::compaction_loop::{{closure}}::{{closure}}
at /__w/neon/neon/pageserver/src/tenant_tasks.rs:76:62
<core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19
pageserver::tenant_tasks::compaction_loop::{{closure}}
at /__w/neon/neon/pageserver/src/tenant_tasks.rs:91:6
```