Commit Graph

179 Commits

Author SHA1 Message Date
Kirill Bulatov
ebea298415 Update most of the dependencies to their latest versions (#4026)
See https://github.com/neondatabase/neon/pull/3991

Brings the changes back with the right way to use new `toml_edit` to
deserialize values and a unit test for this.

All non-trivial updates extracted into separate commits, also `carho hakari` data and its manifest format were updated.

3 sets of crates remain unupdated:

* `base64` — touches proxy in a lot of places and changed its api (by 0.21 version) quite strongly since our version (0.13).
* `opentelemetry` and `opentelemetry-*` crates

```
error[E0308]: mismatched types
  --> libs/tracing-utils/src/http.rs:65:21
   |
65 |     span.set_parent(parent_ctx);
   |          ---------- ^^^^^^^^^^ expected struct `opentelemetry_api::context::Context`, found struct `opentelemetry::Context`
   |          |
   |          arguments to this method are incorrect
   |
   = note: struct `opentelemetry::Context` and struct `opentelemetry_api::context::Context` have similar names, but are actually distinct types
note: struct `opentelemetry::Context` is defined in crate `opentelemetry_api`
  --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/opentelemetry_api-0.19.0/src/context.rs:77:1
   |
77 | pub struct Context {
   | ^^^^^^^^^^^^^^^^^^
note: struct `opentelemetry_api::context::Context` is defined in crate `opentelemetry_api`
  --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/opentelemetry_api-0.18.0/src/context.rs:77:1
   |
77 | pub struct Context {
   | ^^^^^^^^^^^^^^^^^^
   = note: perhaps two different versions of crate `opentelemetry_api` are being used?
note: associated function defined here
  --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-opentelemetry-0.18.0/src/span_ext.rs:43:8
   |
43 |     fn set_parent(&self, cx: Context);
   |        ^^^^^^^^^^

For more information about this error, try `rustc --explain E0308`.
error: could not compile `tracing-utils` due to previous error
warning: build failed, waiting for other jobs to finish...
error: could not compile `tracing-utils` due to previous error
```

`tracing-opentelemetry` of version `0.19` is not yet released, that is supposed to have the update we need.

* similarly, `rustls`, `tokio-rustls`, `rustls-*` and `tls-listener` crates have similar issue:

```
error[E0308]: mismatched types
   --> libs/postgres_backend/tests/simple_select.rs:112:78
    |
112 |     let mut make_tls_connect = tokio_postgres_rustls::MakeRustlsConnect::new(client_cfg);
    |                                --------------------------------------------- ^^^^^^^^^^ expected struct `rustls::client::client_conn::ClientConfig`, found struct `ClientConfig`
    |                                |
    |                                arguments to this function are incorrect
    |
    = note: struct `ClientConfig` and struct `rustls::client::client_conn::ClientConfig` have similar names, but are actually distinct types
note: struct `ClientConfig` is defined in crate `rustls`
   --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/rustls-0.21.0/src/client/client_conn.rs:125:1
    |
125 | pub struct ClientConfig {
    | ^^^^^^^^^^^^^^^^^^^^^^^
note: struct `rustls::client::client_conn::ClientConfig` is defined in crate `rustls`
   --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/rustls-0.20.8/src/client/client_conn.rs:91:1
    |
91  | pub struct ClientConfig {
    | ^^^^^^^^^^^^^^^^^^^^^^^
    = note: perhaps two different versions of crate `rustls` are being used?
note: associated function defined here
   --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-postgres-rustls-0.9.0/src/lib.rs:23:12
    |
23  |     pub fn new(config: ClientConfig) -> Self {
    |            ^^^

For more information about this error, try `rustc --explain E0308`.
error: could not compile `postgres_backend` due to previous error
warning: build failed, waiting for other jobs to finish...
```

* aws crates: I could not make new API to work with bucket endpoint overload, and console e2e tests failed.
Other our tests passed, further investigation is worth to be done in https://github.com/neondatabase/neon/issues/4008
2023-04-14 18:28:54 +03:00
Christian Schwarz
8895f28dae make evictions_low_residence_duration_metric_threshold per-tenant (#3949)
Before this patch, if a tenant would override its eviction_policy
setting to use a lower LayerAccessThreshold::threshold than the
`evictions_low_residence_duration_metric_threshold`, the evictions done
for that tenant would count towards the
`evictions_with_low_residence_duration` metric.

That metric is used to identify pre-mature evictions, commonly triggered
by disk-usage-based eviction under disk pressure.

We don't want that to happen for the legitimate evictions of the tenant
that overrides its eviction_policy.

So, this patch
- moves the setting into TenantConf
- adds test coverage
- updates the staging & prod yamls

Forward Compatibility:
Software before this patch will ignore the new tenant conf field and use
the global one instead.
So we can roll back safely.

Backward Compatibility:
Parsing old configs with software as of this patch will fail in
`PageServerConf::parse_and_validate` with error 
`unrecognized pageserver option 'evictions_low_residence_duration_metric_threshold'`
if the option is still present in the global section.
We deal with this by updating the configs in Ansible.

fixes https://github.com/neondatabase/neon/issues/3940
2023-04-14 13:25:45 +03:00
Dmitry Rodionov
15d1f85552 Add reason to TenantState::Broken (#3954)
Reason and backtrace are added to the Broken state. Backtrace is automatically collected when tenant entered the broken state. The format for API, CLI and metrics is changed and unified to return tenant state name in camel case. Previously snake case was used for metrics and camel case was used for everything else. Now tenant state field in TenantInfo swagger spec is changed to contain state name in "slug" field and other fields (currently only reason and backtrace for Broken variant in "data" field). To allow for this breaking change state was removed from TenantInfo swagger spec because it was not used anywhere.

Please note that the tenant's broken reason is not persisted on disk so the reason is lost when pageserver is restarted.

Requires changes to grafana dashboard that monitors tenant states.

Closes #3001

---------

Co-authored-by: theirix <theirix@gmail.com>
2023-04-13 12:11:43 +03:00
Konstantin Knizhnik
732acc54c1 Add check for duplicates of generated image layers (#3869)
## Describe your changes

## Issue ticket number and link

#3673

## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

---------

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2023-04-13 10:19:34 +03:00
Kirill Bulatov
f7995b3c70 Revert "Update most of the dependencies to their latest versions (#3991)" (#4013)
This reverts commit a64044a7a9.

See https://neondb.slack.com/archives/C03H1K0PGKH/p1681306682795559
2023-04-12 14:51:59 +00:00
Kirill Bulatov
a64044a7a9 Update most of the dependencies to their latest versions (#3991)
All non-trivial updates extracted into separate commits, also `carho
hakari` data and its manifest format were updated.

3 sets of crates remain unupdated:

* `base64` — touches proxy in a lot of places and changed its api (by
0.21 version) quite strongly since our version (0.13).
* `opentelemetry` and `opentelemetry-*` crates

```
error[E0308]: mismatched types
  --> libs/tracing-utils/src/http.rs:65:21
   |
65 |     span.set_parent(parent_ctx);
   |          ---------- ^^^^^^^^^^ expected struct `opentelemetry_api::context::Context`, found struct `opentelemetry::Context`
   |          |
   |          arguments to this method are incorrect
   |
   = note: struct `opentelemetry::Context` and struct `opentelemetry_api::context::Context` have similar names, but are actually distinct types
note: struct `opentelemetry::Context` is defined in crate `opentelemetry_api`
  --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/opentelemetry_api-0.19.0/src/context.rs:77:1
   |
77 | pub struct Context {
   | ^^^^^^^^^^^^^^^^^^
note: struct `opentelemetry_api::context::Context` is defined in crate `opentelemetry_api`
  --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/opentelemetry_api-0.18.0/src/context.rs:77:1
   |
77 | pub struct Context {
   | ^^^^^^^^^^^^^^^^^^
   = note: perhaps two different versions of crate `opentelemetry_api` are being used?
note: associated function defined here
  --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-opentelemetry-0.18.0/src/span_ext.rs:43:8
   |
43 |     fn set_parent(&self, cx: Context);
   |        ^^^^^^^^^^

For more information about this error, try `rustc --explain E0308`.
error: could not compile `tracing-utils` due to previous error
warning: build failed, waiting for other jobs to finish...
error: could not compile `tracing-utils` due to previous error
```

`tracing-opentelemetry` of version `0.19` is not yet released, that is
supposed to have the update we need.

* similarly, `rustls`, `tokio-rustls`, `rustls-*` and `tls-listener`
crates have similar issue:

```
error[E0308]: mismatched types
   --> libs/postgres_backend/tests/simple_select.rs:112:78
    |
112 |     let mut make_tls_connect = tokio_postgres_rustls::MakeRustlsConnect::new(client_cfg);
    |                                --------------------------------------------- ^^^^^^^^^^ expected struct `rustls::client::client_conn::ClientConfig`, found struct `ClientConfig`
    |                                |
    |                                arguments to this function are incorrect
    |
    = note: struct `ClientConfig` and struct `rustls::client::client_conn::ClientConfig` have similar names, but are actually distinct types
note: struct `ClientConfig` is defined in crate `rustls`
   --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/rustls-0.21.0/src/client/client_conn.rs:125:1
    |
125 | pub struct ClientConfig {
    | ^^^^^^^^^^^^^^^^^^^^^^^
note: struct `rustls::client::client_conn::ClientConfig` is defined in crate `rustls`
   --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/rustls-0.20.8/src/client/client_conn.rs:91:1
    |
91  | pub struct ClientConfig {
    | ^^^^^^^^^^^^^^^^^^^^^^^
    = note: perhaps two different versions of crate `rustls` are being used?
note: associated function defined here
   --> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-postgres-rustls-0.9.0/src/lib.rs:23:12
    |
23  |     pub fn new(config: ClientConfig) -> Self {
    |            ^^^

For more information about this error, try `rustc --explain E0308`.
error: could not compile `postgres_backend` due to previous error
warning: build failed, waiting for other jobs to finish...
```

* aws crates: I could not make new API to work with bucket endpoint
overload, and console e2e tests failed.
Other our tests passed, further investigation is worth to be done in
https://github.com/neondatabase/neon/issues/4008
2023-04-12 15:32:38 +03:00
Kirill Bulatov
d8939d4162 Move walreceiver start and stop behind a struct (#3973)
The PR changes module function-based walreceiver interface with a
`WalReceiver` struct that exposes a few public methods, `new`, `start`
and `stop` now.

Later, the same struct is planned to be used for getting walreceiver
stats (and, maybe, other extra data) to display during missing wal
errors for https://github.com/neondatabase/neon/issues/2106

Now though, the change required extra logic changes:

* due to the `WalReceiver` struct added, it became easier to pass `ctx`
and later do a `detached_child` instead of

bfee412701/pageserver/src/tenant/timeline.rs (L1379-L1381)

* `WalReceiver::start` which is now the public API to start the
walreceiver, could return an `Err` which now may turn a tenant into
`Broken`, same as the timeline that it tries to load during startup.

* `WalReceiverConf` was added to group walreceiver parameters from
pageserver's tenant config
2023-04-12 12:39:02 +03:00
Joonas Koivunen
a415670bc3 feat: log evictions (#3930)
this will help log analysis with the counterpart of already logging all
remote download needs and downloads. ended up with a easily regexable
output in the final round.
2023-04-03 14:15:41 +03:00
Arseny Sher
d733bc54b8 Rename ReplicationFeedback and its fields.
This is the the feedback originating from pageserver, so change previous
confusing names to
s/ReplicationFeedback/PageserverFeedback
s/ps_writelsn/last_receive_lsn
s/ps_flushlsn/disk_consistent_lsn
s/ps_apply_lsn/remote_consistent_lsn

I haven't changed on the wire format to keep compatibility. However,
understanding of new field names is added to compute, so once all computes
receive this patch we can change the wire names as well. Safekeepers/pageservers
are deployed roughly at the same time and it is ok to live without feedbacks
during the short period, so this is not a problem there.
2023-04-03 01:52:41 +04:00
Arthur Petukhovsky
814abd9f84 Switch to safekeeper in the same AZ (#3883)
Add a condition to switch walreceiver connection to safekeeper that is
located in the same availability zone. Switch happens when commit_lsn of
a candidate is not less than commit_lsn from the active connection. This
condition is expected not to trigger instantly, because commit_lsn of a
current connection is usually greater than commit_lsn of updates from
the broker. That means that if WAL is written continuously, switch can
take a lot of time, but it should happen eventually.

Now protoc 3.15+ is required for building neon.

Fixes https://github.com/neondatabase/neon/issues/3200
2023-04-02 11:32:27 +03:00
Christian Schwarz
a64dd3ecb5 disk-usage-based layer eviction (#3809)
This patch adds a pageserver-global background loop that evicts layers
in response to a shortage of available bytes in the $repo/tenants
directory's filesystem.

The loop runs periodically at a configurable `period`.

Each loop iteration uses `statvfs` to determine filesystem-level space
usage. It compares the returned usage data against two different types
of thresholds. The iteration tries to evict layers until app-internal
accounting says we should be below the thresholds. We cross-check this
internal accounting with the real world by making another `statvfs` at
the end of the iteration. We're good if that second statvfs shows that
we're _actually_ below the configured thresholds. If we're still above
one or more thresholds, we emit a warning log message, leaving it to the
operator to investigate further.

There are two thresholds:
- `max_usage_pct` is the relative available space, expressed in percent
of the total filesystem space. If the actual usage is higher, the
threshold is exceeded.
- `min_avail_bytes` is the absolute available space in bytes. If the
actual usage is lower, the threshold is exceeded.

The iteration evicts layers in LRU fashion with a reservation of up to
`tenant_min_resident_size` bytes of the most recent layers per tenant.
The layers not part of the per-tenant reservation are evicted
least-recently-used first until we're below all thresholds. The
`tenant_min_resident_size` can be overridden per tenant as
`min_resident_size_override` (bytes).

In addition to the loop, there is also an HTTP endpoint to perform one
loop iteration synchronous to the request. The endpoint takes an
absolute number of bytes that the iteration needs to evict before
pressure is relieved. The tests use this endpoint, which is a great
simplification over setting up loopback-mounts in the tests, which would
be required to test the statvfs part of the implementation. We will rely
on manual testing in staging to test the statvfs parts.

The HTTP endpoint is also handy in emergencies where an operator wants
the pageserver to evict a given amount of space _now. Hence, it's
arguments documented in openapi_spec.yml. The response type isn't
documented though because we don't consider it stable. The endpoint
should _not_ be used by Console but it could be used by on-call.

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2023-03-31 14:47:57 +03:00
Christian Schwarz
fa54a57ca2 random_init_delay: remove the minimum of 10 seconds (#3914)
Before this patch, the range from which the random delay is picked is at
minimum 10 seconds.
With this patch, they delay is bounded to whatever the given `period`
is, and zero, if period id Duration::ZERO.

Motivation for this: the disk usage eviction tests that we'll add in
https://github.com/neondatabase/neon/pull/3905 need to wait for the disk
usage eviction background loop to do its job.
They set a period of 1s.
It seems wasteful to wait 10 seconds in the tests.

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-03-30 18:38:45 +02:00
Kirill Bulatov
ac9c7e8c4a Replace pin! from tokio to the std one (#3903)
With fresh rustc brought by
https://github.com/neondatabase/neon/pull/3902, we can use
`std::pin::pin!` macro instead of the tokio one.
One place did not need the macro at all, other places were adjusted.
2023-03-29 14:14:56 +03:00
Joonas Koivunen
f14895b48e eviction: avoid post-restart download by synthetic_size (#3871)
As of #3867, we do artificial layer accesses to layers that will be
needed after the next restart, but not until then because of caches.

With this patch, we also do that for the accesses that the synthetic
size calculation worker does if consumption metrics are enabled.

The actual size calculation is not of importance, but we need to
calculate all of the sizes, so we only call tenant::size::gather_inputs.

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-03-27 19:20:23 +02:00
Christian Schwarz
fe15624570 eviction_task: only refresh layer accesses once per p.threshold (#3877)
Without this, we run it every p.period, which can be quite low. For
example, the running experiment with 3000 tenants in prod uses a period
of 1 minute.

Doing it once per p.threshold is enough to prevent eviction.
2023-03-27 14:33:40 +03:00
Christian Schwarz
ff51e96fbd fix synthetic size for (last_record_lsn - gc_horizon) < initdb_lsn (#3874)
fix synthetic size for (last_record_lsn - gc_horizon) < initdb_lsn

Assume a single-timeline project.
If the gc_horizon covers all WAL (last_record_lsn < gc_horizon)
but we have written more data than just initdb, the synthetic
size calculation worker needs to calculate the logical size
at LSN initdb_lsn (Segment BranchStart).

Before this patch, that calculation would incorrectly return
the initial logical size calculation result that we cache in
the Timeline::initial_logical_size. Presumably, because there
was confusion around initdb_lsn vs. initial size calculation.

The fix is to only hand out the initialized_size() only if
the LSN matches.

The distinction in the metrics between "init logical size" and "logical
size" was also incorrect because of the above. So, remove it.

There was a special case for `size != 0`. This was to cover the case of
LogicalSize::empty_initial(), but `initial_part_end` is `None` in that
case, so the new `LogicalSize::initialized_size()` will return None
in that case as well.

Lastly, to prevent confusion like this in the future, rename all
occurrences of `init_lsn` to either just `lsn` or a more specific name.

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2023-03-27 12:45:10 +02:00
Dmitry Rodionov
870ba43a1f return proper http codes in timeline delete endpoint (#3876)
return proper http codes in timeline delete endpoint
+ fix openapi spec for detach to include 404 responses
2023-03-24 19:25:39 +02:00
Kirill Bulatov
8bd565e09e Ensure branches with no layers have their remote storage counterpart created eventually (#3857)
Discovered during writing a test for
https://github.com/neondatabase/neon/pull/3843
2023-03-22 17:42:31 +02:00
Joonas Koivunen
6033dfdf4a Re-access layers before threshold eviction (#3867)
To avoid re-downloading evicted files on restart, re-compute logical
size and partitioning before each threshold based eviction run.

Cc: #3802

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-03-22 16:26:27 +02:00
Shany Pozin
0f7de84785 Allow calling detach on ignored tenant (#3834)
## Describe your changes
Added a query param to detach API
Allow to remove local state of a tenant even if its not in the memory
(following ignore API)
## Issue ticket number and link
#3828
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

---------

Co-authored-by: Kirill Bulatov <kirill@neon.tech>
2023-03-22 07:17:00 +00:00
Kirill Bulatov
dd22c87100 Remove older layer metadata format support code (#3854)
The PR enforces current newest `index_part.json` format in the type
system (version `1`), not allowing any previous forms of it, that were
used in the past.
Similarly, the code to mitigate the
https://github.com/neondatabase/neon/issues/3024 issue is now also
removed.

Current code does not produce old formats and extra files in the
index_part.json, in the future we will be able to use
https://github.com/neondatabase/aversion or other approach to make
version transitions more explicit.

See https://neondb.slack.com/archives/C033RQ5SPDH/p1679134185248119 for
the justification on the breaking changes.
2023-03-21 23:33:28 +02:00
Christian Schwarz
881356c417 add metrics to detect eviction-induced thrashing (#3837)
This patch adds two metrics that will enable us to detect *thrashing* of
layers, i.e., repetitions of `eviction, on-demand-download, eviction,
... ` for a given layer.

The first metric counts all layer evictions per timeline. It requires no
further explanation. The second metric counts the layer evictions where
the layer was resident for less than a given threshold.

We can alert on increments to the second metric. The first metric will
serve as a baseline, and further, it's generally interesting, outside of
thrashing.

The second metric's threshold is configurable in PageServerConf and
defaults to 24h. The threshold value is reproduced as a label in the
metric because the counter's value is semantically tied to that
threshold. Since changes to the config and hence the label value are
infrequent, this will have low storage overhead in the metrics storage.

The data source to determine the time that the layer was resident is the
file's `mtime`. Using `mtime` is more of a crutch. It would be better if
Pageserver did its own persistent bookkeeping of residence change events
instead of relying on the filesystem. We had some discussion about this:
https://github.com/neondatabase/neon/pull/3809#issuecomment-1470448900

My position is that `mtime` is good enough for now. It can theoretically
jump forward if someone copies files without resetting `mtime`. But that
shouldn't happen in practice. Note that moving files back and forth
doesn't change `mtime`, nor does `chown` or `chmod`. Lastly, `rsync -a`,
which is typically used for filesystem-level backup / restore, correctly
syncs `mtime`.

I've added a label that identifies the data source to keep options open
for a future, better data source than `mtime`. Since this value will
stay the same for the time being, it's not a problem for metrics
storage.

refs https://github.com/neondatabase/neon/issues/3728
2023-03-20 16:11:36 +01:00
Shany Pozin
93f3f4ab5f Return NotFound in mgmt API requests when tenant is not present in the pageserver (#3818)
## Describe your changes
Add Error enum for tenant state response to allow better error handling
in mgmt api
## Issue ticket number and link
#2238
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
2023-03-19 10:44:42 +02:00
Christian Schwarz
b917270c67 remove unused TenantConfig::update function 2023-03-16 16:25:35 +01:00
Arthur Petukhovsky
b067378d0d Measure cross-AZ traffic in safekeepers (#3806)
Create `safekeeper_pg_io_bytes_total` metric to track total amount of
bytes written/read in a postgres connections to safekeepers. This metric
has the following labels:
- `client_az` – availability zone of the connection initiator, or
`"unknown"`
- `sk_az` – availability zone of the safekeeper, or `"unknown"`
- `app_name` – `application_name` of the postgres client
- `dir` – data direction, either `"read"` or `"write"`
- `same_az` – `"true"`, `"false"` or `"unknown"`. Can be derived from
`client_az` and `sk_az`, exists purely for convenience.

This is implemented by passing availability zone in the connection
string, like this: `-c tenant_id=AAA timeline_id=BBB
availability-zone=AZ-1`.

Update ansible deployment scripts to add availability_zone argument
to safekeeper and pageserver in systemd service files.
2023-03-16 17:24:01 +03:00
Konstantin Knizhnik
5396273541 Avoid holes between generated image layers (#3771)
## Describe your changes

When we perform partitioning of the whole key space, we take in account
actual ranges of relation present in the database. So if we have
relation with relid=1 and size 100 and relation with relid=2 with size
200 then result of KeySpace::partition may contain partitions
<100000000..100000099> and <200000000..200000199>. Generated image
layers will contain the same boundaries.
But when GC is checking image coverage to find out of old layer is fully
covered by newer image layer and so can be deleted, it takes in account
only full key range. I.e. if there is delta layer <100000000..300000000>
then it never be garbage collected because image layers
<100000000..100000099> and <200000000..200000199> are not completely
covering it.This is how it looks in practice:


000000067F000032AC00000A300000000000-000000067F000032AC00000A330000000000__000000000F761828

000000067F000032AC00000A31000000001F-000000067F000032AC00000A620000000005__0000000001696070-000000000442A551

000000067F000032AC00000A3300FFFFFFFF-000000067F000032AC00000A650100000000__000000000F761828

So there are two image layers covering delta layer but ... there is a
hole: A330000000000...A3300FFFFFFFF and as a result delta layer is not
collected.

## Issue ticket number and link

This PR is deeply related with #3673 because it is addressing the same
problem: old layers are not utilized by GC.
The test test_gc_old_layers.py in #3673 can be used to see effect of
this patch.

## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-03-14 16:15:07 +02:00
Joonas Koivunen
c23c8946a3 chore: clippies introduced with rust 1.68 (#3781)
- handle automatically fixable future clippies
- tune run-clippy.sh to remove macos specifics which we no longer have

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2023-03-14 15:29:02 +02:00
Joonas Koivunen
d6bb8caad4 refactor: correct return value for not found L0's on LayerMap::replace (#3805)
in prev implementation, an `ok_or_else(...)?` is used to cause a
"precondition error" on LayerMap::replace, however we only see this
particular error if an L0 for which replace fails is not in the layermap
because it is not in `l0_delta_layers`. changes or fixes this to be
Replacement::NotFound instead, making it more clear that an error would
only be raised for actual preconditions, like trying to replace layer
with completly unrelated layer.
2023-03-14 13:18:26 +02:00
Konstantin Knizhnik
f0573f5991 Remove block cursor cache (#3740)
## Describe your changes

Do not pin current block in BlockCursor

## Issue ticket number and link

See #3712 

There are places (see get_reconstruct_data) in our code when thread is
holding read layers lock and then try to read file and so lock page
cache slot. So we have edge in dependency graph layers->page cache slot.
At the same time (as Christian noticed) we can lock page cache slot in
BlockCursor and then try obtain shard lock on layers.
So there is backward edge in dependency graph page cache slot>layers
which forms loop and may cause deadlock.

There are three possible fixes of the problem:
1. Perform compaction under `layers` shared lock. See PR #3732. It fixes
the problem but make it not possible to append any data to pageserver
until compaction is completed.
2. Do not hold `layers` lock while accessing layers (not sure if it is
possible to do because it definitely introduce some new race
conditions).
3. Do not pin current pages in BockCursor (this PR).

My experiments shows that this cache in BlockCursor is not so useful:
the number of hits/misses for cursor cache on pgbench workload (-i -s
10/-c 10 -T 100/-c 10 -S -T 100):
```
hits: 163011
misses: 1023602
```
So number of cache misses is 10x times  larger.
And results for read-only pgbench are mostly the same:
```
with cache: 14581
w/out cache: 14429
```
 
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
2023-03-13 14:07:35 +02:00
Joonas Koivunen
8699342249 Ondemand rx bytes and layer count (#3777)
Adds two new *global* metrics:
- pageserver_remote_ondemand_downloaded_layers_total
- pageserver_remote_ondemand_downloaded_bytes_total

An existing test is repurposed once more to check that we do get some
reasonable counts. These are to replace guessing from the nic RX bytes
metric how much was on-demand downloaded.

First part of #3745: This does not add the "(un)?avoidable" metric,
which I plan to add as a new metric, which will be a subset of the
counts of the metrics added here.
2023-03-13 09:26:49 +02:00
Arseny Sher
b80fe41af3 Refactor postgres protocol parsing.
1) Remove allocation and data copy during each message read. Instead, parsing
functions now accept BytesMut from which data they form messages, with
pointers (e.g. in CopyData) pointing directly into BytesMut buffer. Accordingly,
move ConnectionError containing IO error subtype into framed.rs providing this
and leave in pq_proto only ProtocolError.

2) Remove anyhow from pq_proto.

3) Move FeStartupPacket out of FeMessage. Now FeStartupPacket::parse returns it
directly, eliminating dead code where user wants startup packet but has to match
for others.

proxy stream.rs is adapted to framed.rs with minimal changes. It also benefits
from framed.rs improvements described above.
2023-03-09 20:45:56 +03:00
Arseny Sher
0d8ced8534 Remove sync postgres_backend, tidy up its split usage.
- Add support for splitting async postgres_backend into read and write halfes.
  Safekeeper needs this for bidirectional streams. To this end, encapsulate
  reading-writing postgres messages to framed.rs with split support without any
  additional changes (relying on BufRead for reading and BytesMut out buffer for
  writing).
- Use async postgres_backend throughout safekeeper (and in proxy auth link
  part).
- In both safekeeper COPY streams, do read-write from the same thread/task with
  select! for easier error handling.
- Tidy up finishing CopyBoth streams in safekeeper sending and receiving WAL
  -- join split parts back catching errors from them before returning.

Initially I hoped to do that read-write without split at all, through polling
IO:
https://github.com/neondatabase/neon/pull/3522
However that turned out to be more complicated than I initially expected
due to 1) borrow checking and 2) anon Future types. 1) required Rc<Refcell<...>>
which is Send construct just to satisfy the checker; 2) can be workaround with
transmute. But this is so messy that I decided to leave split.
2023-03-09 20:45:56 +03:00
Arseny Sher
7627d85345 Move async postgres_backend to its own crate.
To untie cyclic dependency between sync and async versions of postgres_backend,
copy QueryError and some logging/error routines to postgres_backend.rs. This is
temporal glue to make commits smaller, sync version will be dropped by the
upcoming commit completely.
2023-03-09 20:45:56 +03:00
Shany Pozin
7b9057ad01 Add timeout to download copy (#3675)
## Describe your changes
Adding a timeout handling for the remote download of layers of 120
seconds for each operation
Note that these downloads are being retried for N times 

## Issue ticket number and link
Fixes: #3672

## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-03-06 18:52:59 +02:00
Christian Schwarz
66a5159511 fix: compaction: no index upload scheduled if no on-demand downloads
Commit

    0cf7fd0fb8
    Compaction with on-demand download (#3598)

introduced a subtle bug: if we don't have to do on-demand downloads,
we only take one ROUND in fn compact() and exit early.
Thereby, we miss scheduling the index part upload for any layers
created by fn compact_inner().

Before that commit, we didn't have this problem.
So, this patch fixes it.

Since no regression test caught this, I went ahead and extended the
timeline size tests to assert that, if remote storage is configured,
1. pageserver_remote_physical_size matches the other physical sizes
2. file sizes reported by the layer map info endpoint match the other
   physical size metrics

Without the pageserver code fix, the regression test would
fail at the physical size assertion, complaining that
any of the resident physical size != remote physical size metric
50790400.0 != 18399232.0
I figured out what the problem is by comparing the remote storage
and local directories like so, and noticed that the image layer
in the local directory wasn't present on the remote side.
It's size was exactly the difference
    50790400.0 - 18399232.0  =32391168.0

fixes https://github.com/neondatabase/neon/issues/3738
2023-03-03 16:11:54 +01:00
Christian Schwarz
38022ff11c gc: only decrement resident size if GC'd layer is resident
Before this patch, GC would call PersistentLayer::delete()
on every GC'ed layer.
RemoteLayer::delete() returned Ok(()) unconditionally.
GC would then proceed by decrementing the resident size metric,
even though the layer is a RemoteLayer.

This patch makes the following changes:
- Rename PersistentLayer::delete() to delete_resident_layer_file().
  That name is unambiguous.
- Make RemoteLayer::delete_resident_layer_file return an Err().
  We would have uncovered this bug if we had done that from the start.
- Change GC / Timeline::delete_historic_layer check whether
  the layer is remote or not, and only call delete_resident_layer_file()
  if it's not remote. This brings us in line with how eviction does it.
- Add a regression test.

fixes https://github.com/neondatabase/neon/issues/3722
2023-03-03 12:10:24 +01:00
Christian Schwarz
1b9b9d60d4 eviction: add comment explaining resident size decrement on error
https://github.com/neondatabase/neon/issues/3722
2023-03-03 12:10:24 +01:00
Christian Schwarz
68141a924d eviction: remove needless if-let around resident size decrement
The branch was always taken at runtime, so, this should not
constitute a behavioral change.

refs https://github.com/neondatabase/neon/issues/3722
2023-03-03 12:10:24 +01:00
Christian Schwarz
764d27f696 fix checkpoint_timeout serialization in TenantConf
Without this change, when actually setting this conf opt, the tenant
would become Broken next time we load it.
Why?
The serde_toml representation that persist_tenant_conf would write out
would be a TOML inline table of `secs` and `nsecs`.
But our hand-rolled TenantConf parser expects a TOML string.

I checked that all other `Duration` values in TenantConfOpt
use the humantime serialization.

Issues like this would likely be systematically prevent by
https://github.com/neondatabase/neon/issues/3682
2023-03-03 12:10:24 +01:00
Konstantin Knizhnik
412e0aa985 Skip largest N holes during compaction (#3597)
## Describe your changes

This is yet another attempt to address problem with storage size
ballooning #2948
Previous PR #3348 tries to address this problem by maintaining list of
holes for each layer.
The problem with this approach is that we have to load all layer on
pageserver start.
Lazy loading of layers is not possible any more.

This PR tries to collect information of N largest holes on compaction
time and exclude this holes from produced layers.
It can cause generation of larger number of layers (up to 2 times) and
producing small layers.
But it requires minimal changes in code and doesn't affect storage
format.

For graphical explanation please see thread:
https://github.com/neondatabase/neon/pull/3597#discussion_r1112704451

## Issue ticket number and link

#2948
#3348

## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
2023-02-22 18:28:01 +02:00
Joonas Koivunen
225add041f calculate_logical_size: no longer use spawn_blocking (#3664)
Calculation of logical size is now async because of layer downloads, so
we shouldn't use spawn_blocking for it. Use of `spawn_blocking`
exhausted resources which are needed by `tokio::io::copy` when copying
from a stream to a file which lead to deadlock.

Fixes: #3657
2023-02-21 21:09:31 +02:00
Joonas Koivunen
fe462de85b fix: log download failed error (#3661)
Fixes #3659
2023-02-21 19:31:53 +02:00
Joonas Koivunen
b220ba6cd1 add random init delay for background tasks (#3655)
Fixes #3649.
2023-02-21 12:42:11 +01:00
Joonas Koivunen
7de373210d Warn when background tasks exceed their configured period (#3654)
Fixes #3648.
2023-02-21 13:02:19 +02:00
Christian Schwarz
485b269674 eviction: tone down logs to debug!() level if there were no evictions
fixes #3647
2023-02-20 18:01:59 +01:00
Christian Schwarz
ee1eda9921 eviction: remove EvictionStats::not_considered_due_to_clock_skew
Rationale: see the block comment added in this patch.

fixes #3641
2023-02-20 18:01:59 +01:00
Christian Schwarz
e363911c85 timeline: propagate span to download_remote_layer (#3644)
fixes #3643
refs #3604
2023-02-20 17:18:13 +02:00
Shany Pozin
af210c8b42 Allow running do_gc in non testing env (#3639)
## Describe your changes
Since the current default gc period is set to 1 hour, whenever there is
an immediate need to reduce PITR and run gc, the user has to wait 1 hour
for PITR change to take effect
By enabling this API the user can configure PITR and immediately call
the do_gc API to trigger gc
## Issue ticket number and link
#3590
## Checklist before requesting a review
- [X] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
2023-02-20 13:23:13 +02:00
Joonas Koivunen
8e6b27bf7c fix: avoid busy loop on replacement failure (#3613)
Add an AtomicBool per RemoteLayer, use it to mark together with closed
semaphore that remotelayer is unusable until restart or ignore+load.

https://github.com/neondatabase/neon/issues/3533#issuecomment-1431481554
2023-02-17 14:15:29 +02:00
Anastasia Lubennikova
0d3aefb274 Only use active timelines in synthetic_size calculation 2023-02-16 17:58:53 +02:00