Commit Graph

5557 Commits

Author SHA1 Message Date
Vlad Lazar
bbb2fa7cdd tests: perform graceful rolling restarts in storcon scale test (#8173)
## Problem
Scale test doesn't exercise drain & fill.

## Summary of changes
Make scale test exercise drain & fill
2024-07-04 06:04:19 +01:00
John Spray
778787d8e9 pageserver: add supplementary branch usage stats (#8131)
## Problem

The metrics we have today aren't convenient for planning around the
impact of timeline archival on costs.

Closes: https://github.com/neondatabase/neon/issues/8108

## Summary of changes

- Add metric `pageserver_archive_size`, which indicates the logical
bytes of data which we would expect to write into an archived branch.
- Add metric `pageserver_pitr_history_size`, which indicates the
distance between last_record_lsn and the PITR cutoff.

These metrics are somewhat temporary: when we implement #8088 and
associated consumption metric changes, these will reach a final form.
For now, an "archived" branch is just any branch outside of its parent's
PITR window: later, archival will become an explicit state (which will
_usually_ correspond to falling outside the parent's PITR window).

The overall volume of timeline metrics is something to watch, but we are
removing many more in https://github.com/neondatabase/neon/pull/8245
than this PR is adding.
2024-07-03 22:29:43 +01:00
Alex Chi Z
90b51dcf16 fix(pageserver): ensure test creates valid layer map (#8191)
I'd like to add some constraints to the layer map we generate in tests.

(1) is the layer map that the current compaction algorithm will produce.
There is a property that for all delta layer, all delta layer overlaps
with it on the LSN axis will have the same LSN range.
(2) is the layer map that cannot be produced with the legacy compaction
algorithm.
(3) is the layer map that will be produced by the future
tiered-compaction algorithm. The current validator does not allow that
but we can modify the algorithm to allow it in the future.

## Summary of changes

Add a validator to check if the layer map is valid and refactor the test
cases to include delta layer start/end LSN.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
2024-07-03 18:46:58 +00:00
Christian Schwarz
a85aa03d18 page_service: stop exposing get_last_record_rlsn (#8244)
Compute doesn't use it, let's eliminate it.

Ref to Slack thread:
https://neondb.slack.com/archives/C033RQ5SPDH/p1719920261995529
2024-07-03 20:05:01 +02:00
Japin Li
cdaed4d79c Fix outdated comment (#8149)
Commit 97b48c23f changes the log wait timeout from 1 second to 100
milliseconds but forgets to update the comment.
2024-07-03 13:55:36 -04:00
John Spray
ea0b22a9b0 pageserver: reduce ops tracked at per-timeline detail (#8245)
## Problem

We record detailed histograms for all page_service op types, which
mostly aren't very interesting, but make our prometheus scrapes huge.

Closes: #8223 

## Summary of changes

- Only track GetPageAtLsn histograms on a per-timeline granularity. For
all other operation types, rely on existing node-wide histograms.
2024-07-03 17:27:34 +01:00
Peter Bendel
392a58bdce add pagebench test cases for periodic pagebench on dedicated hardware (#8233)
we want to run some specific pagebench test cases on dedicated hardware
to get reproducible results

run1: 1 client per tenant => characterize throughput with n tenants.
-  500 tenants
- scale 13 (200 MB database)
- 1 hour duration
- ca 380 GB layer snapshot files

run2.singleclient: 1 client per tenant => characterize latencies
run2.manyclient: N clients per tenant => characterize throughput
scalability within one tenant.
- 1 tenant with 1 client for latencies
- 1 tenant with 64 clients because typically for a high number of
connections we recommend the connection pooler
which by default uses 64 connections (for scalability)
- scale 136 (2048 MB database)
- 20 minutes each
2024-07-03 16:22:33 +00:00
Arpad Müller
e0891ec8c8 Only support compressed reads if the compression setting is present (#8238)
PR #8106 was created with the assumption that no blob is larger than
`256 MiB`. Due to #7852 we have checking for *writes* of blobs larger
than that limit, but we didn't have checking for *reads* of such large
blobs: in theory, we could be reading these blobs every day but we just
don't happen to write the blobs for some reason.

Therefore, we now add a warning for *reads* of such large blobs as well.

To make deploying compression less dangerous, we therefore only assume a
blob is compressed if the compression setting is present in the config.
This also means that we can't back out of compression once we enabled
it.

Part of https://github.com/neondatabase/neon/issues/5431
2024-07-03 18:02:10 +02:00
John Spray
97f7188a07 pageserver: don't try to flush if shutdown during attach (#8235)
## Problem

test_location_conf_churn fails on log errors when it tries to shutdown a
pageserver immediately after starting a tenant attach, like this:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8224/9761000525/index.html#/testresult/15fb6beca5c7327c

```
shutdown:shutdown{tenant_id=35f5c55eb34e7e5e12288c5d8ab8b909 shard_id=0000}:timeline_shutdown{timeline_id=30936747043353a98661735ad09cbbfe shutdown_mode=FreezeAndFlush}: failed to freeze and flush: cannot flush frozen layers when flush_loop is not running, state is Exited\n')
```

This is happening because Tenant::shutdown fires its cancellation token
early if the tenant is not fully attached by the time shutdown is
called, so the flush loop is shutdown by the time we try and flush.

## Summary of changes

- In the early-cancellation case, also set the shutdown mode to Hard to
skip trying to do a flush that will fail.
2024-07-03 13:13:06 +00:00
Alexander Bayandin
aae3876318 CI: update docker/* actions to latest versions (#7694)
## Problem

GitHub Actions complain that we use actions that depend on deprecated
Node 16:

```
Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: docker/setup-buildx-action@v2
```

But also, the latest `docker/setup-buildx-action` fails with the following
error:
```
/nvme/actions-runner/_work/_actions/docker/setup-buildx-action/v3/webpack:/docker-setup-buildx/node_modules/@actions/cache/lib/cache.js:175
            throw new Error(`Path Validation Error: Path(s) specified in the action for caching do(es) not exist, hence no cache is being saved.`);
^
Error: Path Validation Error: Path(s) specified in the action for caching do(es) not exist, hence no cache is being saved.
    at Object.rejected (/nvme/actions-runner/_work/_actions/docker/setup-buildx-action/v3/webpack:/docker-setup-buildx/node_modules/@actions/cache/lib/cache.js:175:1)
    at Generator.next (<anonymous>)
    at fulfilled (/nvme/actions-runner/_work/_actions/docker/setup-buildx-action/v3/webpack:/docker-setup-buildx/node_modules/@actions/cache/lib/cache.js:29:1)
```

We can work this around by setting `cache-binary: false` for `uses:
docker/setup-buildx-action@v3`

## Summary of changes
- Update `docker/setup-buildx-action` from `v2` to `v3`, set
`cache-binary: false`
- Update `docker/login-action` from `v2` to `v3`
- Update `docker/build-push-action` from `v4`/`v5` to `v6`
2024-07-03 12:19:13 +01:00
Heikki Linnakangas
dae55badf3 Simplify test_wal_page_boundary_start test (#8214)
All the code to ensure the WAL record lands at a page boundary was
unnecessary for reproducing the original problem. In fact, it's a pretty
basic test that checks that outbound replication (= neon as publisher)
still works after restarting the endpoint. It just used to be very
broken before commit 5ceccdc7de, which also added this test.

To verify that:

1. Check out commit f3af5f4660 (because the next commit, 7dd58e1449,
fixed the same bug in a different way, making it infeasible to revert
the bug fix in an easy way)
2. Revert the bug fix from commit 5ceccdc7de with this:

```
diff --git a/pgxn/neon/walproposer_pg.c b/pgxn/neon/walproposer_pg.c
index 7debb6325..9f03bbd99 100644
--- a/pgxn/neon/walproposer_pg.c
+++ b/pgxn/neon/walproposer_pg.c
@@ -1437,8 +1437,10 @@ XLogWalPropWrite(WalProposer *wp, char *buf, Size nbytes, XLogRecPtr recptr)
 	 *
 	 * https://github.com/neondatabase/neon/issues/5749
 	 */
+#if 0
 	if (!wp->config->syncSafekeepers)
 		XLogUpdateWalBuffers(buf, recptr, nbytes);
+#endif

 	while (nbytes > 0)
 	{
```

3. Run the test_wal_page_boundary_start regression test. It fails, as
expected

4. Apply this commit to the test, and run it again. It still fails, with
the same error mentioned in issue #5749:

```
PG:2024-06-30 20:49:08.805 GMT [1248196] STATEMENT:  START_REPLICATION SLOT "sub1" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names '"pub1"')
PG:2024-06-30 21:37:52.567 GMT [1467972] LOG:  starting logical decoding for slot "sub1"
PG:2024-06-30 21:37:52.567 GMT [1467972] DETAIL:  Streaming transactions committing after 0/1532330, reading WAL from 0/1531C78.
PG:2024-06-30 21:37:52.567 GMT [1467972] STATEMENT:  START_REPLICATION SLOT "sub1" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names '"pub1"')
PG:2024-06-30 21:37:52.567 GMT [1467972] LOG:  logical decoding found consistent point at 0/1531C78
PG:2024-06-30 21:37:52.567 GMT [1467972] DETAIL:  There are no running transactions.
PG:2024-06-30 21:37:52.567 GMT [1467972] STATEMENT:  START_REPLICATION SLOT "sub1" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names '"pub1"')
PG:2024-06-30 21:37:52.568 GMT [1467972] ERROR:  could not find record while sending logically-decoded data: invalid contrecord length 312 (expected 6) at 0/1533FD8
```
2024-07-03 13:22:53 +03:00
Alex Chi Z
4273309962 docker: add storage_scrubber into the docker image (#8239)
## Problem

We will run this tool in the k8s cluster. To make it accessible from
k8s, we need to package it into the docker image.

part of https://github.com/neondatabase/cloud/issues/14024
2024-07-03 09:48:56 +01:00
Konstantin Knizhnik
4a0c2aebe0 Add test for proper handling of connection failure to avoid 'cannot wait on socket event without a socket' error (#8231)
## Problem

See https://github.com/neondatabase/cloud/issues/14289
and PR #8210 

## Summary of changes

Add test for problems fixed in #8210

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-07-02 21:45:42 +03:00
Alex Chi Z
891cb5a9a8 fix(pageserver): comments about metadata key range (#8236)
Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-07-02 16:54:32 +00:00
John Spray
f5832329ac tense of errors (#8234)
I forgot a commit when merging
https://github.com/neondatabase/neon/pull/8177
2024-07-02 17:17:22 +01:00
Alexander Bayandin
6216df7765 CI(benchmarking): move psql queries to actions/run-python-test-set (#8230)
## Problem

Some of the Nightly benchmarks fail with the error
```
+ /tmp/neon/pg_install/v14/bin/pgbench --version
/tmp/neon/pg_install/v14/bin/pgbench: error while loading shared libraries: libpq.so.5: cannot open shared object file: No such file or directory
```
Originally, we added the `pgbench --version` call to check that
`pgbench` is installed and to fail earlier if it's not.
The failure happens because we don't have `LD_LIBRARY_PATH` set for
every job, and it also affects `psql` command.
We can move it to `actions/run-python-test-set` so as not to duplicate
code (as it already have `LD_LIBRARY_PATH` set).

## Summary of changes
- Remove `pgbench --version` call
- Move `psql` commands to common `actions/run-python-test-set`
2024-07-02 15:21:23 +00:00
Christian Schwarz
5de896e7d8 L0 flush: opt-in mechanism to bypass PageCache reads and writes (#8190)
part of https://github.com/neondatabase/neon/issues/7418

# Motivation

(reproducing #7418)

When we do an `InMemoryLayer::write_to_disk`, there is a tremendous
amount of random read I/O, as deltas from the ephemeral file (written in
LSN order) are written out to the delta layer in key order.

In benchmarks (https://github.com/neondatabase/neon/pull/7409) we can
see that this delta layer writing phase is substantially more expensive
than the initial ingest of data, and that within the delta layer write a
significant amount of the CPU time is spent traversing the page cache.

# High-Level Changes

Add a new mode for L0 flush that works as follows:

* Read the full ephemeral file into memory -- layers are much smaller
than total memory, so this is afforable
* Do all the random reads directly from this in memory buffer instead of
using blob IO/page cache/disk reads.
* Add a semaphore to limit how many timelines may concurrently do this
(limit peak memory).
* Make the semaphore configurable via PS config.

# Implementation Details

The new `BlobReaderRef::Slice` is a temporary hack until we can ditch
`blob_io` for `InMemoryLayer` => Plan for this is laid out in
https://github.com/neondatabase/neon/issues/8183

# Correctness

The correctness of this change is quite obvious to me: we do what we did
before (`blob_io`) but read from memory instead of going to disk.

The highest bug potential is in doing owned-buffers IO. I refactored the
API a bit in preliminary PR
https://github.com/neondatabase/neon/pull/8186 to make it less
error-prone, but still, careful review is requested.

# Performance

I manually measured single-client ingest performance from `pgbench -i
...`.

Full report:
https://neondatabase.notion.site/2024-06-28-benchmarking-l0-flush-performance-e98cff3807f94cb38f2054d8c818fe84?pvs=4

tl;dr:

* no speed improvements during ingest,  but
* significantly lower pressure on PS PageCache (eviction rate drops to
1/3)
  * (that's why I'm working on this)
* noticable but modestly lower CPU time

This is good enough for merging this PR because the changes require
opt-in.

We'll do more testing in staging & pre-prod.

# Stability / Monitoring

**memory consumption**: there's no _hard_ limit on max `InMemoryLayer`
size (aka "checkpoint distance") , hence there's no hard limit on the
memory allocation we do for flushing. In practice, we a) [log a
warning](23827c6b0d/pageserver/src/tenant/timeline.rs (L5741-L5743))
when we flush oversized layers, so we'd know which tenant is to blame
and b) if we were to put a hard limit in place, we would have to decide
what to do if there is an InMemoryLayer that exceeds the limit.
It seems like a better option to guarantee a max size for frozen layer,
dependent on `checkpoint_distance`. Then limit concurrency based on
that.

**metrics**: we do have the
[flush_time_histo](23827c6b0d/pageserver/src/tenant/timeline.rs (L3725-L3726)),
but that includes the wait time for the semaphore. We could add a
separate metric for the time spent after acquiring the semaphore, so one
can infer the wait time. Seems unnecessary at this point, though.
2024-07-02 16:29:09 +02:00
Arpad Müller
25eefdeb1f Add support for reading and writing compressed blobs (#8106)
Add support for reading and writing zstd-compressed blobs for use in
image layer generation, but maybe one day useful also for delta layers.
The reading of them is unconditional while the writing is controlled by
the `image_compression` config variable allowing for experiments.

For the on-disk format, we re-use some of the bitpatterns we currently
keep reserved for blobs larger than 256 MiB. This assumes that we have
never ever written any such large blobs to image layers.

After the preparation in #7852, we now are unable to read blobs with a
size larger than 256 MiB (or write them).

A non-goal of this PR is to come up with good heuristics of when to
compress a bitpattern. This is left for future work.

Parts of the PR were inspired by #7091.

cc  #7879

Part of #5431
2024-07-02 14:14:12 +00:00
Vlad Lazar
28929d9cfa pageserver: rate limit log for loads of layers visited (#8228)
## Problem
At high percentiles we see more than 800 layers being visited by the
read path. We need the tenant/timeline to investigate.

## Summary of changes
Add a rate limited log line when the average number of layers visited
per key is in the last specified histogram bucket.
I plan to use this to identify tenants in us-east-2 staging that exhibit
this behaviour. Will revert before next week's release.
2024-07-02 14:14:10 +01:00
Christian Schwarz
9b4b4bbf6f fix: noisy logging when download gets cancelled during shutdown (#8224)
Before this PR, during timeline shutdown, we'd occasionally see
log lines like this one:

```
2024-06-26T18:28:11.063402Z  INFO initial_size_calculation{tenant_id=$TENANT,shard_id=0000 timeline_id=$TIMELINE}:logical_size_calculation_task:get_or_maybe_download{layer=000000000000000000000000000000000000-000000067F0001A3950001C1630100000000__0000000D88265898}: layer file download failed, and caller has been cancelled: Cancelled, shutting down
Stack backtrace:
   0: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/result.rs:1964:27
      pageserver::tenant::remote_timeline_client::RemoteTimelineClient::download_layer_file::{{closure}}
             at /home/nonroot/pageserver/src/tenant/remote_timeline_client.rs:531:13
      pageserver::tenant::storage_layer::layer::LayerInner::download_and_init::{{closure}}
             at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1136:14
      pageserver::tenant::storage_layer::layer::LayerInner::download_init_and_wait::{{closure}}::{{closure}}
             at /home/nonroot/pageserver/src/tenant/storage_layer/layer.rs:1082:74
```

We can eliminate the anyhow backtrace with no loss of information
because the conversion to anyhow::Error happens in exactly one place.

refs #7427
2024-07-02 13:13:27 +00:00
John Spray
1a0f545c16 pageserver: simpler, stricter config error handling (#8177)
## Problem

Tenant attachment has error paths for failures to write local
configuration, but these types of local storage I/O errors should be
considered fatal for the process. Related thread on an earlier PR that
touched this code:
https://github.com/neondatabase/neon/pull/7947#discussion_r1655134114

## Summary of changes

- Make errors writing tenant config fatal (abort process)
- When reading tenant config, make all I/O errors except ENOENT fatal
- Replace use of bare anyhow errors with `LoadConfigError`
2024-07-02 12:45:04 +00:00
Christian Schwarz
7dcdbaa25e remote_storage config: move handling of empty inline table {} to callers (#8193)
Before this PR, `RemoteStorageConfig::from_toml` would support
deserializing an
empty `{}` TOML inline table to a `None`, otherwise try `Some()`.

We can instead let
* in proxy: let clap derive handle the Option
* in PS & SK: assume that if the field is specified, it must be a valid
  RemtoeStorageConfig

(This PR started with a much simpler goal of factoring out the
`deserialize_item` function because I need that in another PR).
2024-07-02 12:53:08 +02:00
Konstantin Knizhnik
0497b99f3a Check status of connection after PQconnectStartParams (#8210)
## Problem

See https://github.com/neondatabase/cloud/issues/14289

## Summary of changes

Check connection status after calling PQconnectStartParams

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-07-02 06:56:10 +03:00
Vlad Lazar
9882ac8e06 docs: Graceful storage controller cluster restarts RFC (#7704)
RFC for "Graceful Restarts of Storage Controller Managed Clusters". 
Related https://github.com/neondatabase/neon/issues/7387
2024-07-01 18:44:28 +01:00
Heikki Linnakangas
0789160ffa tests: Make neon_xlogflush() flush all WAL, if you omit the LSN arg (#8215)
This makes it much more convenient to use in the common case that you
want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the
argument doesn't work for the same reasons as explained in the comments:
we need to be back off to the beginning of a page if the previous record
ended at page boundary.)

I plan to use this to fix the issue that Arseny Sher called out at
https://github.com/neondatabase/neon/pull/7288#discussion_r1660063852
2024-07-01 10:55:18 -05:00
Alexander Bayandin
9c32604aa9 CI(gather-rust-build-stats): fix build with libpq (#8219)
## Problem
I've missed setting `PQ_LIB_DIR` in
https://github.com/neondatabase/neon/pull/8206 in
`gather-rust-build-stats` job and it fails now:
```
  = note: /usr/bin/ld: cannot find -lpq
          collect2: error: ld returned 1 exit status
          

error: could not compile `storage_controller` (bin "storage_controller") due to 1 previous error
```

https://github.com/neondatabase/neon/actions/runs/9743960062/job/26888597735

## Summary of changes
- Set `PQ_LIB_DIR` for `gather-rust-build-stats` job
2024-07-01 16:42:23 +01:00
Alex Chi Z
b02aafdfda fix(pageserver): include aux file in basebackup only once (#8207)
Extracted from https://github.com/neondatabase/neon/pull/6560, currently
we include multiple copies of aux files in the basebackup.

## Summary of changes

Fix the loop.

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-07-01 14:36:49 +00:00
Alexander Bayandin
e823b92947 CI(build-tools): Remove libpq from build image (#8206)
## Problem
We use `build-tools` image as a base image to build other images, and it
has a pretty old `libpq-dev` installed (v13; it wasn't that old until I
removed system Postgres 14 from `build-tools` image in
https://github.com/neondatabase/neon/pull/6540)

## Summary of changes
- Remove `libpq-dev` from `build-tools` image
- Set `LD_LIBRARY_PATH` for tests (for different Postgres binaries that
we use, like psql and pgbench)
- Set `PQ_LIB_DIR` to build Storage Controller
- Set `LD_LIBRARY_PATH`/`DYLD_LIBRARY_PATH` in the Storage Controller
where it calls Postgres binaries
2024-07-01 13:11:55 +01:00
John Spray
aea5cfe21e pageserver: add metric pageserver_secondary_resident_physical_size (#8204)
## Problem

We lack visibility of how much local disk space is used by secondary
tenant locations

Close: https://github.com/neondatabase/neon/issues/8181

## Summary of changes

- Add `pageserver_secondary_resident_physical_size`, tagged by tenant
- Register & de-register label sets from SecondaryTenant
- Add+use wrappers in SecondaryDetail that update metrics when
adding+removing layers/timelines
2024-07-01 12:48:20 +01:00
Heikki Linnakangas
9ce193082a Restore running xacts from CLOG on replica startup (#7288)
We have one pretty serious MVCC visibility bug with hot standby
replicas. We incorrectly treat any transactions that are in progress
in the primary, when the standby is started, as aborted. That can
break MVCC for queries running concurrently in the standby. It can
also lead to hint bits being set incorrectly, and that damage can last
until the replica is restarted.

The fundamental bug was that we treated any replica start as starting
from a shut down server. The fix for that is straightforward: we need
to set 'wasShutdown = false' in InitWalRecovery() (see changes in the
postgres repo).

However, that introduces a new problem: with wasShutdown = false, the
standby will not open up for queries until it receives a running-xacts
WAL record from the primary. That's correct, and that's how Postgres
hot standby always works. But it's a problem for Neon, because:

* It changes the historical behavior for existing users. Currently,
  the standby immediately opens up for queries, so if they now need to
  wait, we can breka existing use cases that were working fine
  (assuming you don't hit the MVCC issues).

* The problem is much worse for Neon than it is for standalone
  PostgreSQL, because in Neon, we can start a replica from an
  arbitrary LSN. In standalone PostgreSQL, the replica always starts
  WAL replay from a checkpoint record, and the primary arranges things
  so that there is always a running-xacts record soon after each
  checkpoint record. You can still hit this issue with PostgreSQL if
  you have a transaction with lots of subtransactions running in the
  primary, but it's pretty rare in practice.

To mitigate that, we introduce another way to collect the
running-xacts information at startup, without waiting for the
running-xacts WAL record: We can the CLOG for XIDs that haven't been
marked as committed or aborted. It has limitations with
subtransactions too, but should mitigate the problem for most users.

See https://github.com/neondatabase/neon/issues/7236.

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-07-01 12:58:12 +03:00
Heikki Linnakangas
75c84c846a tests: Make neon_xlogflush() flush all WAL, if you omit the LSN arg
This makes it much more convenient to use in the common case that you
want to flush all the WAL. (Passing pg_current_wal_insert_lsn() as the
argument doesn't work for the same reasons as explained in the
comments: we need to be back off to the beginning of a page if the
previous record ended at page boundary.)

I plan to use this to fix the issue that Arseny Sher called out at
https://github.com/neondatabase/neon/pull/7288#discussion_r1660063852
2024-07-01 12:58:08 +03:00
Heikki Linnakangas
57535c039c tests: remove a leftover 'running' flag (#8216)
The 'running' boolean was replaced with a semaphore in commit
f0e2bb79b2, but this initialization was missed. Remove it so that if a
test tries to access it, you get an error rather than always claiming
that the endpoint is not running.

Spotted by Arseny at
https://github.com/neondatabase/neon/pull/7288#discussion_r1660068657
2024-07-01 11:23:31 +03:00
Heikki Linnakangas
30027d94a2 Fix tracking of the nextMulti in the pageserver's copy of CheckPoint (#6528)
Whenever we see an XLOG_MULTIXACT_CREATE_ID WAL record, we need to
update the nextMulti and NextMultiOffset fields in the pageserver's
copy of the CheckPoint struct, to cover the new multi-XID. In
PostgreSQL, this is done by updating an in-memory struct during WAL
replay, but because in Neon you can start a compute node at any LSN,
we need to have an up-to-date value pre-calculated in the pageserver
at all times. We do the same for nextXid.

However, we had a bug in WAL ingestion code that does that: the
multi-XIDs will wrap around at 2^32, just like XIDs, so we need to do
the comparisons in a wraparound-aware fashion.

Fix that, and add tests.

Fixes issue #6520

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-07-01 01:49:49 +03:00
Alex Chi Z
bc704917a3 fix(pageserver): ensure tenant harness has different names (#8205)
rename the tenant test harness name

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-06-28 15:13:25 -04:00
John Spray
b8bbaafc03 storage controller: fix heatmaps getting disabled during shard split (#8197)
## Problem

At the start of do_tenant_shard_split, we drop any secondary location
for the parent shards. The reconciler uses presence of secondary
locations as a condition for enabling heatmaps.

On the pageserver, child shards inherit their configuration from
parents, but the storage controller assumes the child's ObservedState is
the same as the parent's config from the prepare phase. The result is
that some child shards end up with inaccurate ObservedState, and until
something next migrates or restarts, those tenant shards aren't
uploading heatmaps, so their secondary locations are downloading
everything that was resident at the moment of the split (including
ancestor layers which are often cleaned up shortly after the split).

Closes: https://github.com/neondatabase/neon/issues/8189

## Summary of changes

- Use PlacementPolicy to control enablement of heatmap upload, rather
than the literal presence of secondaries in IntentState: this way we
avoid switching them off during shard split
- test: during tenant split test, assert that the child shards have
heatmap uploads enabled.
2024-06-28 18:27:13 +01:00
Arthur Petukhovsky
e1a06b40b7 Add rate limiter for partial uploads (#8203)
Too many concurrect partial uploads can hurt disk performance, this
commit adds a limiter.

Context:
https://neondb.slack.com/archives/C04KGFVUWUQ/p1719489018814669?thread_ts=1719440183.134739&cid=C04KGFVUWUQ
2024-06-28 18:16:21 +01:00
John Spray
babbe125da pageserver: drop out of secondary download if iteration time has passed (#8198)
## Problem

Very long running downloads can be wasteful, because the heatmap they're
using is outdated after a few minutes.

Closes: https://github.com/neondatabase/neon/issues/8182

## Summary of changes

- Impose a deadline on timeline downloads, using the same period as we
use for scheduling, and returning an UpdateError::Restart when it is
reached. This restart will involve waiting for a scheduling interval,
but that's a good thing: it helps let other tenants proceed.
- Refactor download_timeline so that the part where we update the state
for local layers is done even if we fall out of the layer download loop
with an error: this is important, especially for big tenants, because
only layers in the SecondaryDetail state will be considered for
eviction.
2024-06-28 17:05:09 +00:00
Heikki Linnakangas
ca2f7d06b2 Cherry-pick upstream fix for TruncateMultiXact assertion (#8195)
We hit that bug in a new test being added in PR #6528. We'd get the fix
from upstream with the next minor release anyway, but cherry-pick it now
to unblock PR #6528.

Upstream commit b1ffe3ff0b.

See
https://github.com/neondatabase/neon/pull/6528#issuecomment-2167367910
2024-06-28 16:47:05 +03:00
Arthur Petukhovsky
c22c6a6c9e Add buckets to safekeeper ops metrics (#8194)
In #8188 I forgot to specify buckets for new operations metrics. This
commit fixes that.
2024-06-28 11:09:11 +01:00
Christian Schwarz
deec3bc578 virtual_file: take a Slice in the read APIs, eliminate read_exact_at_n, fix UB for engine std-fs (#8186)
part of https://github.com/neondatabase/neon/issues/7418

I reviewed how the VirtualFile API's `read` methods look like and came
to the conclusion that we've been using `IoBufMut` / `BoundedBufMut` /
`Slice` wrong.

This patch rectifies the situation.

# Change 1: take `tokio_epoll_uring::Slice` in the read APIs

Before, we took an `IoBufMut`, which is too low of a primitive and while
it _seems_ convenient to be able to pass in a `Vec<u8>` without any
fuzz, it's actually very unclear at the callsite that we're going to
fill up that `Vec` up to its `capacity()`, because that's what
`IoBuf::bytes_total()` returns and that's what
`VirtualFile::read_exact_at` fills.

By passing a `Slice` instead, a caller that "just wants to read into a
`Vec`" is forced to be explicit about it, adding either `slice_full()`
or `slice(x..y)`, and these methods panic if the read is outside of the
bounds of the `Vec::capacity()`.

Last, passing slices is more similar to what the `std::io` APIs look
like.

# Change 2: fix UB in `virtual_file_io_engine=std-fs`

While reviewing call sites, I noticed that the
`io_engine::IoEngine::read_at` method for `StdFs` mode has been
constructing an `&mut[u8]` from raw parts that were uninitialized.

We then used `std::fs::File::read_exact` to initialize that memory, but,
IIUC we must not even be constructing an `&mut[u8]` where some of the
memory isn't initialized.

So, stop doing that and add a helper ext trait on `Slice` to do the
zero-initialization.

# Change 3: eliminate  `read_exact_at_n`

The `read_exact_at_n` doesn't make sense because the caller can just

1. `slice = buf.slice()` the exact memory it wants to fill 
2. `slice = read_exact_at(slice)`
3. `buf = slice.into_inner()`

Again, the `std::io` APIs specify the length of the read via the Rust
slice length.
We should do the same for the owned buffers IO APIs, i.e., via
`Slice::bytes_total()`.

# Change 4: simplify filling of `PageWriteGuard`

The `PageWriteGuardBuf::init_up_to` was never necessary.
Remove it. See changes to doc comment for more details.

---

Reviewers should probably look at the added test case first, it
illustrates my case a bit.
2024-06-28 11:20:37 +02:00
John Spray
063553a51b pageserver: remove tenant create API (#8135)
## Problem

For some time, we have created tenants with calls to location_conf. The
legacy "POST /v1/tenant" path was only used in some tests.

## Summary of changes

- Remove the API
- Relocate TenantCreateRequest to the controller API file (this used to
be used in both pageserver and controller APIs)
- Rewrite tenant_create test helper to use location_config API, as
control plane and storage controller do
- Update docker-compose test script to create tenants with
location_config API (this small commit is also present in
https://github.com/neondatabase/neon/pull/7947)
2024-06-28 09:14:19 +01:00
Tristan Partin
5700233a47 Add application_name to compute activity monitor connection string
This was missed in my previous attempt to mark every connection string
with an application name. See 0c3e3a8667.
2024-06-27 12:38:15 -07:00
Arthur Petukhovsky
1d66ca79a9 Improve slow operations observability in safekeepers (#8188)
After https://github.com/neondatabase/neon/pull/8022 was deployed to
staging, I noticed many cases of timeouts. After inspecting the logs, I
realized that some operations are taking ~20 seconds and they're doing
while holding shared state lock. Usually it happens right after
redeploy, because compute reconnections put high load on disks. This
commit tries to improve observability around slow operations.

Non-observability changes:
- `TimelineState::finish_change` now skips update if nothing has changed
- `wal_residence_guard()` timeout is set to 30s
2024-06-27 18:39:43 +01:00
Alex Chi Z
23827c6b0d feat(pageserver): add delta layer iterator (#8064)
part of https://github.com/neondatabase/neon/issues/8002

## Summary of changes

Add delta layer iterator and tests.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-06-27 16:03:48 +00:00
Christian Schwarz
66b0bf41a1 fix: shutdown does not kill walredo processes (#8150)
While investigating Pageserver logs from the cases where systemd hangs
during shutdown (https://github.com/neondatabase/cloud/issues/11387), I
noticed that even if Pageserver shuts down cleanly[^1], there are
lingering walredo processes.

[^1]: Meaning, pageserver finishes its shutdown procedure and calls
`exit(0)` on its own terms, instead of hitting the systemd unit's
`TimeoutSec=` limit and getting SIGKILLed.

While systemd should never lock up like it does, maybe we can avoid
hitting that bug by cleaning up properly.

Changes
-------

This PR adds a shutdown method to `WalRedoManager` and hooks it up to
tenant shutdown.

We keep track of intent to shutdown through the new `enum
ProcessOnceCell` stored inside the pre-existing `redo_process` field.
A gate is added to keep track of running processes, using the new type
`struct Process`.

Future Work
-----------

Requests that don't need the redo process will not observe the shutdown
(see doc comment).
Doing so would be nice for completeness sake, but doesn't provide much
benefit because `Tenant` and `Timeline` already shut down all walredo
users.

Testing
-------


I did manual testing to confirm that the problem exists before this PR
and that it's gone after.
Setup:
* `neon_local` with a single tenant, create some data using `pgbench`
* ensure walredo process is running, not pid
* watch `strace -e kill,wait4 -f -p "$(pgrep pageserver)"`
* `neon_local pageserver stop`

With this PR, we always observe

```
$ strace -e kill,wait4 -f -p "$(pgrep pageserver)"
...
[pid 591120] --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=591215, si_uid=1000} ---
[pid 591134] kill(591174, SIGKILL)      = 0
[pid 591134] wait4(591174,  <unfinished ...>
[pid 591142] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=591174, si_uid=1000, si_status=SIGKILL, si_utime=0, si_stime=0} ---
[pid 591134] <... wait4 resumed>[{WIFSIGNALED(s) && WTERMSIG(s) == SIGKILL}], 0, NULL) = 591174
...
+++ exited with 0 +++
```

Before this PR, we'd usually observe just

```
...
[pid 596239] --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=596455, si_uid=1000} ---
...
+++ exited with 0 +++
```

Refs
----

refs https://github.com/neondatabase/cloud/issues/11387
2024-06-27 15:58:28 +02:00
Vlad Lazar
89cf8df93b stocon: bump number of concurrent reconciles per operation (#8179)
## Problem
Background node operations take a long time for loaded nodes.

## Summary of changes
Increase number of concurrent reconciles an operation is allowed to
spawn.
This should make drain and fill operations faster and the new value is
still well below the total limit of concurrent reconciles.
2024-06-27 13:16:41 +00:00
Alexander Bayandin
54a06de4b5 CI: Use runner.arch in cache keys along with runner.os (#8175)
## Problem
The cache keys that we use on CI are the same for X64 and ARM64
(`runner.arch`)

## Summary of changes
- Include `runner.arch` along with `runner.os` into cache keys
2024-06-27 13:56:03 +01:00
Arseny Sher
6f20a18e8e Allow to change compute safekeeper list without restart.
- Add --safekeepers option to neon_local reconfigure
- Add it to python Endpoint reconfigure
- Implement config reload in walproposer by restarting the whole bgw when
  safekeeper list changes.

ref https://github.com/neondatabase/neon/issues/6341
2024-06-27 15:08:35 +03:00
Vlad Lazar
d557002675 strocon: don't overcommit when making node fill plan (#8171)
## Problem
The fill requirement was not taken into account when looking through the
shards of a given node to fill from.

## Summary of Changes
Ensure that we do not fill a node past the recommendation from
`Scheduler::compute_fill_requirement`.
2024-06-27 11:56:57 +01:00
Cihan Demirci
32b75e7c73 CI: additional trigger on merge to main (#8176)
Before we consolidate workflows we want to be triggered by merges to main.

https://github.com/neondatabase/cloud/issues/14862
2024-06-26 22:36:41 +00:00