Commit Graph

1148 Commits

Author SHA1 Message Date
Bojan Serafimov
e392d25828 Improve image_layer_exists docstring 2023-01-11 12:54:13 -05:00
Bojan Serafimov
2951be5386 Improve search docstring 2023-01-11 12:52:15 -05:00
Bojan Serafimov
7c6af6d729 Comments 2023-01-10 21:43:32 -05:00
Bojan Serafimov
cb5b9375d2 Organize modules, rename structs 2023-01-10 21:03:26 -05:00
Bojan Serafimov
d3d17f2c7c Simplify 2023-01-10 17:30:40 -05:00
Bojan Serafimov
c3f5e00ad1 Comments 2023-01-10 15:32:18 -05:00
Bojan Serafimov
d0095d4457 Comments 2023-01-10 15:02:06 -05:00
Bojan Serafimov
7d057e1038 Simplify 2023-01-10 11:52:01 -05:00
Bojan Serafimov
f22437086e Rename file 2023-01-10 11:09:02 -05:00
Bojan Serafimov
7c6909c31f Merge branch 'main' into immutable_bst_layer_map 2023-01-10 11:02:45 -05:00
Christian Schwarz
8eebd5f039 run on-demand compaction in a task_mgr task
With this patch, tenant_detach and timeline_delete's
task_mgr::shutdown_tasks() call will wait for on-demand
compaction to finish.
Before this patch, the on-demand compaction would grab the
layer_removal_cs after tenant_detach / timeline_delete had
removed the timeline directory.
This resulted in error

  No such file or directory (os error 2)

NB: I already implemented this pattern for ondemand GC a while back.

fixes https://github.com/neondatabase/neon/issues/3136
2023-01-09 19:08:22 +01:00
Christian Schwarz
d4d0aa6ed6 gc_iteration_internal: better log message & debug log level if nothing to do
fixes https://github.com/neondatabase/neon/issues/3107
2023-01-09 13:53:59 +01:00
Kirill Bulatov
a457256fef Fix log message matching (#3291)
Spotted
https://neon-github-public-dev.s3.amazonaws.com/reports/main/debug/3871991071/index.html#suites/158be07438eb5188d40b466b6acfaeb3/22966d740e33b677/
failing on `main`, fixes that by using a proper regex match string.

Also removes one clippy lint suppression.
2023-01-09 14:25:12 +02:00
Shany Pozin
7920b39a27 Adding transition reason to the log when a tenant is moved to Broken state (#3289)
#3160
2023-01-09 10:24:50 +02:00
Christian Schwarz
3526323bc4 prepare Timeline::get_reconstruct_data for becoming async (#3271)
This patch restructures the code so that PR
https://github.com/neondatabase/neon/pull/3228 can seamlessly
replace the return PageReconstructResult::NeedsDownload with
a download_remote_layer().await.

Background:

PR https://github.com/neondatabase/neon/pull/3228 will turn
get_reconstruct_data() async and do the on-demand
download right in place, instead of returning a
PageReconstructResult::NeedsDownload.

Current rustc requires that the layers lock guard be not in scope
across an await point.

For on-demand download inside get_reconstruct_data(), we need
to do download_remote_layer().await.

Supersedes https://github.com/neondatabase/neon/pull/3260

See my comment there:
https://github.com/neondatabase/neon/pull/3260#issuecomment-1370752407

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2023-01-06 19:42:25 +02:00
Bojan Serafimov
a2642966f2 Add arbitrary key partitioning for benchmark 2023-01-05 20:21:19 -05:00
Christian Schwarz
d7f1e30112 remote_timeline_client: more metrics & metrics-related cleanups
- Clean up redundant metric removal in TimelineMetrics::drop.
RemoteTimelineClientMetrics is responsible for cleaning up
REMOTE_OPERATION_TIME andREMOTE_UPLOAD_QUEUE_UNFINISHED_TASKS.

- Rename `pageserver_remote_upload_queue_unfinished_tasks` to
`pageserver_remote_timeline_client_calls_unfinished`. The new name
reflects that the metric is with respect to the entire call to remote
timeline client. This includes wait time in the upload queue and hence
it's a longer span than what `pageserver_remote_OPERATION_seconds`
measures.

- Add the `pageserver_remote_timeline_client_calls_started` histogram.
See the metric description for why we need it.

- Add helper functions `call_begin` etc to `RemoteTimelineClientMetrics`
to centralize the logic for updating the metrics above (they relate to
each other, see comments in code).

- Use these constructs to track ongoing downloads in
`pageserver_remote_timeline_client_calls_unfinished`

refs https://github.com/neondatabase/neon/issues/2029
fixes https://github.com/neondatabase/neon/issues/3249
closes https://github.com/neondatabase/neon/pull/3250
2023-01-05 11:50:17 +01:00
Christian Schwarz
6a9d1030a6 use RemoteTimelineClient for downloading index part during tenant_attach
Before this change, we would not .measure_remote_op for index part
downloads.

And more generally, it's good to pass not just uploads but also
downloads through RemoteTimelineClient, e.g., if we ever want to
implement some timeline-scoped policies there.

Found this while working on https://github.com/neondatabase/neon/pull/3250
where I add a metric to measure the degree of concurrent downloads.
Layer download was missing in a test that I added there.
2023-01-05 11:08:50 +01:00
Bojan Serafimov
3d6bc126ed Partially implement get_difficulty_map bench 2023-01-05 03:23:10 -05:00
Bojan Serafimov
fb6569c880 Implement get_difficulty_map 2023-01-05 02:50:48 -05:00
Bojan Serafimov
115549261c Add get_difficulty_map method 2023-01-05 01:58:18 -05:00
Bojan Serafimov
01e09fc56c Cleanup bench_layer_map 2023-01-05 01:35:36 -05:00
Heikki Linnakangas
8c6e607327 Refactor send_tarball() (#3259)
The Basebackup struct is really just a convenient place to carry the
various parameters around in send_tarball and its subroutines. Make it
internal to the send_tarball function.
2023-01-04 23:03:16 +02:00
Bojan Serafimov
8618f1d225 Remove DummyImage and DummyDelta 2023-01-03 23:54:14 -05:00
Bojan Serafimov
6f374d24be Merge branch 'main' into immutable_bst_layer_map 2023-01-03 18:02:20 -05:00
Kirill Bulatov
10dae79c6d Tone down safekeeper and pageserver walreceiver errors (#3227)
Closes https://github.com/neondatabase/neon/issues/3114

Adds more typization into errors that appear during protocol messages (`FeMessage`), postgres and walreceiver connections.

Socket IO errors are now better detected and logged with lesser (INFO, DEBUG) error level, without traces that they were logged before, when they were wrapped in anyhow context.
2023-01-03 20:42:04 +00:00
Heikki Linnakangas
e9583db73b Remove code and test to generate flamegraph on GetPage requests. (#3257)
It was nice to have and useful at the time, but unfortunately the method
used to gather the profiling data doesn't play nicely with 'async'. PR
#3228 will turn 'get_page_at_lsn' function async, which will break the
profiling support. Let's remove it, and re-introduce some kind of
profiling later, using some different method, if we feel like we need it
again.
2023-01-03 20:11:32 +02:00
Vadim Kharitonov
0b428f7c41 Enable licenses check for 3rd-parties 2023-01-03 15:11:50 +01:00
Heikki Linnakangas
8b692e131b Enable on-demand download in WalIngest. (#3233)
Makes the top-level functions in WalIngest async, and replaces
no_ondemand_download calls with with_ondemand_download.

This hopefully fixes the problem reported in issue #3230, although I
don't have a self-contained test case for it.
2023-01-03 14:44:42 +02:00
Heikki Linnakangas
0a0e55c3d0 Replace 'tar' crate with 'tokio-tar' (#3202)
The synchronous 'tar' crate has required us to use block_in_place and
SyncIoBridge to work together with the async I/O in the client
connection. Switch to 'tokio-tar' crate that uses async I/O natively.

As part of this, move the CopyDataWriter implementation to
postgres_backend_async.rs. Even though it's only used in one place
currently, it's in principle generally applicable whenever you want to
use COPY out.

Unfortunately we cannot use the 'tokio-tar' as it is: the Builder
implementation requires the writer to have 'static lifetime. So we
have to use a modified version without that requirement. The 'static
lifetime was required just for the Drop implementation that writes
the end-of-archive sections if the Builder is dropped without calling
`finish`. But we don't actually want that behavior anyway; in fact
we had to jump through some hoops with the AbortableWrite hack to skip
those. With the modified version of 'tokio-tar' without that Drop
implementation, we don't need AbortableWrite either.

Co-authored-by: Kirill Bulatov <kirill@neon.tech>
2023-01-03 12:39:11 +02:00
Shany Pozin
182dc785d6 Set PITR default to 7 days (#3245)
https://github.com/neondatabase/cloud/issues/3406
2023-01-02 18:05:23 +02:00
Heikki Linnakangas
81afd7011c Use rustls for everything.
I looked at "cargo tree" output and noticed that through various
dependencies, we are depending on both native-tls and rustls. We have
tried to standardize on rustls for everything, but dependencies on
native-tls have crept in recently. One such dependency came from
'reqwest' with default features in pageserver, used for
consumption_metrics. Another dependency was from 'sentry'. Both
'reqwest' and 'sentry' use native-tls by default, but can use 'rustls'
if compiled with the right feature flags.
2023-01-02 11:14:35 +02:00
Egor Suvorov
9f94d098aa Remove unused AuthType::MD5 2022-12-31 02:27:08 +03:00
Bojan Serafimov
f0a23b474c Fix a count_deltas bug 2022-12-29 14:30:04 -05:00
Anastasia Lubennikova
8ff7bc5df1 Add timleline_logical_size metric.
Send this metric only when it is fully calculated.

Make consumption metrics more stable:
- Send per-timeline metrics only for active timelines.
- Adjust test assertions to make test_metric_collection test more stable.
2022-12-29 19:13:54 +02:00
Heikki Linnakangas
890ff3803e Allow update_gc_info to download files on-demand. 2022-12-29 16:23:25 +02:00
Heikki Linnakangas
fefe19a284 Avoid calling find_lsn_for_timestamp call while holding lock.
Refactor update_gc_info function so that it calls the potentially
expensive find_lsn_for_timestamp() function before acquiring the
lock. This will also be needed if we make find_lsn_for_timestamp()
async in the future; it cannot be awaited while holding the lock.
2022-12-29 16:23:25 +02:00
Anastasia Lubennikova
894ac30734 Rename billing_metrics to consumption_metrics.
Use more appropriate term, because not all of these metrics are used for billing.
2022-12-29 14:25:47 +02:00
Shany Pozin
c0290467fa Fix #2907 Remove missing_layers from IndexPart (#3217)
#2907
2022-12-29 12:33:30 +02:00
Konstantin Knizhnik
0e7c03370e Lazy calculation of traversal_id which is needed only for error repoting (#3221)
See 
https://neondb.slack.com/archives/C0277TKAJCA/p1672245908989789
and 
https://neondb.slack.com/archives/C033RQ5SPDH/p1671885245981359
2022-12-29 12:20:28 +02:00
Anastasia Lubennikova
f731e9b3de Fix serialization of billing metrics (#3215)
Fixes:
- serialize TenantId and TimelineId as strings,
- skip TimelineId if none
- serialize `metric_type` field as `type`
- add `idempotency_key` field to uniquely identify metrics
2022-12-29 12:11:04 +02:00
Dmitry Rodionov
bd7a9e6274 switch to debug from info to produce less noise 2022-12-29 13:01:17 +03:00
Bojan Serafimov
6bfa99e0c4 Fix infinite recursion 2022-12-28 21:30:35 -05:00
Egor Suvorov
42c6ddef8e Rename ZENITH_AUTH_TOKEN to NEON_AUTH_TOKEN
Changes are:

* Pageserver: start reading from NEON_AUTH_TOKEN by default.
  Warn if ZENITH_AUTH_TOKEN is used instead.
* Compute, Docs: fix the default token name.
* Control plane: change name of the token in configs and start
  sequences.

Compatibility:

* Control plane in tests: works, no compatibility expected.
* Control plane for local installations: never officially supported
  auth anyways. If someone did enable it, `pageserver.toml` should be updated
  with the new `neon.pageserver_connstring` and `neon.safekeeper_token_env`.
* Pageserver is backward compatible: you can run new Pageserver with old
  commands and environment configurations, but not vice-versa.
  The culprit is the hard-coded `NEON_AUTH_TOKEN`.
* Compute has no code changes. As long as you update its configuration
  file with `pageserver_connstring` in sync with the start up scripts,
  you are good to go.
* Safekeeper has no code changes and has never used `ZENITH_AUTH_TOKEN` in
  the first place.
2022-12-29 03:33:43 +03:00
Bojan Serafimov
bfeaee90ca implement image_layer_exists (not tested) 2022-12-28 19:22:10 -05:00
Bojan Serafimov
d835d8c853 Implement count_deltas (untested, probably buggy) 2022-12-28 17:51:06 -05:00
Bojan Serafimov
3b40d7b61a Big refactor (WIP) 2022-12-28 14:12:41 -05:00
Shany Pozin
172c7e5f92 Split upload queue code from storage_sync.rs (#3216)
https://github.com/neondatabase/neon/issues/3208
2022-12-28 15:12:06 +02:00
Shany Pozin
0c7b02ebc3 Move tenant related files to tenant directory (#3214)
Related to https://github.com/neondatabase/neon/issues/3208
2022-12-28 09:20:01 +02:00
Bojan Serafimov
2737e33565 Implement image_coverage (untested) 2022-12-27 15:10:56 -05:00