Commit Graph

929 Commits

Author SHA1 Message Date
John Spray
d515727e94 tests: make test_multi_attach more stable (#9202)
## Problem

`test_multi_attach` is sometimes failing with `invalid compute status
for configuration request: Configuration`. This is likely a result of
the test attempting to reconfigure the compute at the same time as the
storage controller is doing so.

This test was originally written before the storage controller existed,
and is not expecting anything else to be reconfiguring computes at the
same time.

## Summary of changes

- Configure the tenant into scheduling policy `Stop` in the storage
controller at the start of the test, so that it won't try to do anything
to the tenant while the test is running.
2024-10-01 10:15:18 +01:00
John Spray
651ae44569 storage controller: drop out of blocking compute notification loop if migration origin becomes unavailable (#9147)
## Problem

The live migration code waits forever for the compute notification hook,
on the basis that until it succeeds, the compute is probably using the
old location and we shouldn't detach it.

However, if a pageserver stops or restarts in the background, then this
original location might no longer be available, so there is no point
waiting. Waiting is also actively harmful, because it prevents other
reconciliations happening for the tenant shard, such as during an
upgrade where a stuck "drain" migration might prevent the later "fill"
migration from moving the shard back to its original location.

## Summary of changes

- Refactor the notification wait loop into a function
- Add a checks during the loop, for the origin node's cancellation token
and an explicit HTTP request to the origin node to confirm the shard is
still attached there.

Closes: https://github.com/neondatabase/neon/issues/8901
2024-10-01 07:57:22 +00:00
Heikki Linnakangas
69ea2776e9 tests: Remove creation of extra timelines in some tests
neon_cli.create_tenant() creates a new tenant *and* a timeline on the
tenant, with name "main". In most tests, there's no need to create
another timeline on the same tenant.

There are some more tests that do that, but in the remaining cases, I
wasn't be 100% if the presence of extra root timelines affect what the
tests test, so I left them alone.
2024-09-30 17:56:40 +03:00
Heikki Linnakangas
4dc9cb7cf9 tests: Remove some spurious list_timelines calls
These calls seem really out of place. We know what the initial tenant
and branch are in these tests, just like in all other tests.
2024-09-30 17:56:37 +03:00
John Spray
7424e7269c tests: longer timeout in test_delete_timeline_client_hangup (#9161)
## Problem

This test waits for a request to finish, and then expects deletion to
complete almost immediately. The request completes, but it's a 202, the
timeline is still deleting in the background: we need to be more
patient.

## Summary of changes

- Adjust iterations from 2 to 10 when waiting for deletion
2024-09-30 15:46:07 +01:00
a-masterov
5dc68e4e6a test_compatibility: fix the regexes detecting the version (#9205)
## Problem
The Neon components, built locally and by the GitHub workflow have
slightly different version prefixes (git: vs git-env:)
This does not allow running tests against local builds correctly.

## Summary of changes
The regular expressions were changed to work with both
prefixes.
2024-09-30 16:37:14 +02:00
Heikki Linnakangas
d696c41807 Bump default neon extension version to 1.5 (#9188)
Commit 263dfba6ee introduced neon extension version 1.5, which included
some new functions and views for metrics. It didn't bump the default
neon extension number yet, so that we could still safely roll back to
the old binary if necessary. This bumps the default version.
2024-09-30 09:20:52 +03:00
Yuchen Liang
42ef08db47 fix(pageserver): LSN lease edge cases around restarts/migrations (#9055)
Part of #7497, closes #8817.

## Problem

See #8817. 

## Summary of changes

**compute_ctl**

- Renew lsn lease as soon as `/configure` updates pageserver_connstr,
use `state_changed` Condvar for synchronization.

**pageserver**

As mentioned in
https://github.com/neondatabase/neon/issues/8817#issuecomment-2315768076,
we still want some permanent error reported if a lease cannot be
granted. By considering attachment mode and the added
`lsn_lease_deadline` when processing lease requests, we can also bound
the case of bad requests to a very short period after migration/restart.

- Refactor https://github.com/neondatabase/neon/pull/9024 and move
`lsn_lease_deadline` to `AttachedTenantConf` so timeline can easily
access it.
- Have separate HTTP `init_lsn_lease` and  libpq `renew_lsn_lease` API.
  - Always do LSN verification for the initial HTTP lease request.
- LSN verification for the renewal is **still done** when tenants are
not in `AttachedSingle` and we have pass the `lsn_lease_deadline`, which
give plenty of time for compute to renew the lease.
 
**neon_local**

- add and call `timeline_init_lsn_lease` mgmt_api at static endpoint
start. The initial lsn lease http request is sent when we run `cargo
neon endpoint start <static endpoint>`.


## Testing

- Extend `test_readonly_node_gc` to do pageserver restarts and
migration.

## Future Work

- The control plane should make the initial lease request through HTTP
when creating a static endpoint. This is currently only done in
`neon_local`.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-09-27 09:56:52 -04:00
Tristan Partin
fc962c9605 Use long options when calling initdb
Verbosity in this case is good when reading the code. Short options are
better when operating in an interactive shell.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-09-27 08:22:16 -05:00
Arseny Sher
c4cdfe66ac Fix flakiness of test_timeline_copy.
Timeline might be not initialized when timeline_start_lsn is
queried. Spotted by CI.
2024-09-26 19:01:45 +03:00
Alexander Bayandin
7fdf1ab5b6 CI: run compatibility tests on Postgres 17 (#9145)
## Problem

The latest storage release has generated artifacts for Postgres 17,
so we can enable compatibility tests this version

## Summary of changes
- Unskip `test_backward_compatibility` / `test_forward_compatibility` on
Postgres 17
2024-09-26 15:17:01 +01:00
Yuchen Liang
d447f49bc3 fix(pageserver): handle lsn lease requests for unnormalized lsns (#9137)
Fixes https://github.com/neondatabase/neon/issues/9098.

## Problem

See
https://github.com/neondatabase/neon/issues/9098#issuecomment-2372484969.

### Related

A similar problem happened with branch creation, which was discussed
[here](https://github.com/neondatabase/neon/pull/2143#issuecomment-1199969052)
and fixed by https://github.com/neondatabase/neon/pull/2529.

## Summary of changes

- Normalize the lsn on pageserver side upon lsn lease request, stores
the normalized LSN.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-09-25 14:57:38 +00:00
John Spray
4b711caf5e storage controller: make proxying of GETs to pageservers more robust (#9065)
## Problem

These commits are split off from
https://github.com/neondatabase/neon/pull/8971/commits where I was
fixing this to make a better scale test pass -- Vlad also independently
recognized these issues with cloudbench in
https://github.com/neondatabase/neon/issues/9062.

1. The storage controller proxies GET requests to pageservers based on
their intent, not the ground truth of where they're really attached.
2. Proxied requests can race with scheduling to tenants, resulting in
404 responses if the request hits the wrong pageserver.

Closes: https://github.com/neondatabase/neon/issues/9062

## Summary of changes

1. If a shard has a running reconciler, then use the database
generation_pageserver to decide who to proxy the request to
2. If such a request gets a 404 response and its scheduled node has
changed since the request was dispatched.
2024-09-25 13:56:39 +00:00
Heikki Linnakangas
af5c54ed14 test: Make test_lfc_resize more robust (#9117)
1. Increase statement_timeout. It defaults to 120 s, which is not quite
enough on slow or busy systems with debug build. On my laptop, the index
creation takes about 100 s. On buildfarm, we've seen failures, e.g:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9084/10997888708/index.html#suites/821f97908a487f1d7d3a2a4dd1571e99/db1834bddfe8c5b9/

2. Keep twiddling the LFC size through the whole test. Before, we would
do it for the first 10 seconds, but that only covers a small part of the
pgbench initialization phase. Change the loop so that the pgbench run
time determines how long the test runs, and we keep changing the LFC for
the whole time.

In the passing, also fix bogus test description, copy-pasted from a
completely unrelated test.
2024-09-24 23:38:16 +03:00
Heikki Linnakangas
70fe007519 test: Make test_hot_standby_feedback more forgiving of slow initialization (#9113)
Don't start waiting for the index to appear in the secondary until it
has been created in the primary. Before, if the "pgbench -i" step took
more than 60 s, we would give up.

There was a flaky test failure along those lines at:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9105/10997477941/index.html#suites/950eff205b552e248417890b8b8f189e/73cf4b5648fa6f74/
Hopefully, this avoids such failures in the future.
2024-09-24 16:41:59 +03:00
Matthias van de Meent
d865881d59 NOAI (#9084)
We can't FlushOneBuffer when we're in redo-only mode on PageServer, so
make execution of that function conditional on us not running in
pageserver walredo mode.
2024-09-23 21:16:42 +00:00
Heikki Linnakangas
263dfba6ee Add views for metrics about pageserver requests (#9008)
The metrics include a histogram of how long we need to wait for a
GetPage request, number of reconnects, and number of requests among
other things.

The metrics are not yet exported anywhere, but you can query them
manually.

Note: This does *not* bump the default version of the 'neon' extension. We
will do that later, as a separate PR. The reason is that this allows us to roll back
the compute image smoothly, if necessary. Once the image that includes the
new extension .so file with the new functions has been rolled out, and we're
confident that we don't need to roll back the image anymore, we can change
default extension version and actually start using the new functions and views.

This is what the view looks like:

```
postgres=# select * from neon_perf_counters ;
                metric                 | bucket_le |  value   
---------------------------------------+-----------+----------
 getpage_wait_seconds_count            |           |      300
 getpage_wait_seconds_sum              |           | 0.048506
 getpage_wait_seconds_bucket           |     2e-05 |        0
 getpage_wait_seconds_bucket           |     3e-05 |        0
 getpage_wait_seconds_bucket           |     6e-05 |       71
 getpage_wait_seconds_bucket           |    0.0001 |      124
 getpage_wait_seconds_bucket           |    0.0002 |      248
 getpage_wait_seconds_bucket           |    0.0003 |      279
 getpage_wait_seconds_bucket           |    0.0006 |      297
 getpage_wait_seconds_bucket           |     0.001 |      298
 getpage_wait_seconds_bucket           |     0.002 |      298
 getpage_wait_seconds_bucket           |     0.003 |      298
 getpage_wait_seconds_bucket           |     0.006 |      300
 getpage_wait_seconds_bucket           |      0.01 |      300
 getpage_wait_seconds_bucket           |      0.02 |      300
 getpage_wait_seconds_bucket           |      0.03 |      300
 getpage_wait_seconds_bucket           |      0.06 |      300
 getpage_wait_seconds_bucket           |       0.1 |      300
 getpage_wait_seconds_bucket           |       0.2 |      300
 getpage_wait_seconds_bucket           |       0.3 |      300
 getpage_wait_seconds_bucket           |       0.6 |      300
 getpage_wait_seconds_bucket           |         1 |      300
 getpage_wait_seconds_bucket           |         2 |      300
 getpage_wait_seconds_bucket           |         3 |      300
 getpage_wait_seconds_bucket           |         6 |      300
 getpage_wait_seconds_bucket           |        10 |      300
 getpage_wait_seconds_bucket           |        20 |      300
 getpage_wait_seconds_bucket           |        30 |      300
 getpage_wait_seconds_bucket           |        60 |      300
 getpage_wait_seconds_bucket           |       100 |      300
 getpage_wait_seconds_bucket           |  Infinity |      300
 getpage_prefetch_requests_total       |           |       69
 getpage_sync_requests_total           |           |      231
 getpage_prefetch_misses_total         |           |        0
 getpage_prefetch_discards_total       |           |        0
 pageserver_requests_sent_total        |           |      323
 pageserver_requests_disconnects_total |           |        0
 pageserver_send_flushes_total         |           |      323
 file_cache_hits_total                 |           |        0
(39 rows)
```
2024-09-23 21:28:50 +03:00
Heikki Linnakangas
9a32aa828d Fix init of WAL page header at startup (#8914)
If the primary is started at an LSN within the first of a 16 MB WAL
segment, the "long XLOG page header" at the beginning of the segment was
not initialized correctly. That has gone unnnoticed, because under
normal circumstances, nothing looks at the page header. The WAL that is
streamed to the safekeepers starts at the new record's LSN, not at the
beginning of the page, so that bogus page header didn't propagate
elsewhere, and a primary server doesn't normally read the WAL its
written. Which is good because the contents of the page would be bogus
anyway, as it wouldn't contain any of the records before the LSN where
the new record is written.

Except that in the following cases a primary does read its own WAL:

1. When there are two-phase transactions in prepared state at
checkpoint. The checkpointer reads the two-phase state from the
XLOG_XACT_PREPARE record, and writes it to a file in pg_twophase/.

2. Logical decoding reads the WAL starting from the replication slot's
restart LSN.

This PR fixes the problem with two-phase transactions. For that, it's
sufficient to initialize the page header correctly. The checkpointer
only needs to read XLOG_XACT_PREPARE records that were generated after
the server startup, so it's still OK that older WAL is missing / bogus.

I have not investigated if we have a problem with logical decoding,
however. Let's deal with that separately.

Special thanks to @Lzjing-1997, who independently found the same bug
and opened a PR to fix it, although I did not use that PR.
2024-09-21 04:00:38 +03:00
Yuchen Liang
1708743e78 pageserver: wait for lsn lease duration after transition into AttachedSingle (#9024)
Part of #7497, closes https://github.com/neondatabase/neon/issues/8890.

## Problem

Since leases are in-memory objects, we need to take special care of them
after pageserver restarts and while doing a live migration. The approach
we took for pageserver restart is to wait for at least lease duration
before doing first GC. We want to do the same for live migration. Since
we do not do any GC when a tenant is in `AttachedStale` or
`AttachedMulti` mode, only the transition from `AttachedMulti` to
`AttachedSingle` requires this treatment.

## Summary of changes

- Added `lsn_lease_deadline` field in `GcBlock::reasons`: the tenant is
temporarily blocked from GC until we reach the deadline. This
information does not persist to S3.
- In `GCBlock::start`, skip the GC iteration if we are blocked by the
lsn lease deadline.
- In `TenantManager::upsert_location`, set the lsn_lease_deadline to
`Instant::now() + lsn_lease_length` so the granted leases have a chance
to be renewed before we run GC for the first time after transitioned
from AttachedMulti to AttachedSingle.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2024-09-19 17:27:10 +01:00
Arseny Sher
32a0e759bd safekeeper: add wal_last_modified to debug_dump.
Adds to debug_dump option to include highest modified time among all WAL
segments. In passing replace some str with OsStr to have less unwraps.
2024-09-19 16:17:25 +03:00
Alexander Bayandin
ac6a1151ae test_postgres_version: reenable version check for prereleased versions 2024-09-18 14:51:59 +01:00
Christian Schwarz
135e7e4306 add neon_local subcommand for the broker & use that from regression tests (#8948)
There's currently no way to just start/stop broker from `neon_local`.

This PR
* adds a sub-command
* uses that sub-command from the test suite instead of the pre-existing
Python `subprocess` based approach.

Found this useful during investigation
https://github.com/neondatabase/cloud/issues/16886.
2024-09-18 09:10:27 +02:00
Arpad Müller
a1b71b73fe Rename some S3 usages to "remote storage" in exposed messages (#8999)
In exposed messages like log messages we mentioned "S3", which is not
entirely accurate as we support Azure blob storage now as well.
2024-09-17 19:15:01 +02:00
Vlad Lazar
b719d58863 storcon: forward requests from stepped down instance to the current leader (#8954)
## Problem
It turns out that we can't rely on external orchestration to promptly
route trafic to the new leader. This is downtime inducing.
Forwarding provides a safe way out.

## Safety
We forward when:
1. Request is not one of ["/control/v1/step_down", "/status", "/ready",
"/metrics"]
2. Current instance is in [`LeadershipStatus::SteppedDown`] state
3. There is a leader in the database to forward to
4. Leader from step (3) is not the current instance

If a storcon instance is persisted in the database, then we know that it
is the current leader.
There's one exception: time between handling step-down request and the
new leader updating the
database.

Let's treat the happy case first. The stepped down node does not produce
any side effects,
since all request handling happens on the leader.

As for the edge case, we are guaranteed to always have a maximum of two
running instances.
Hence, if we are in the edge case scenario the leader persisted in the
database is the
stepped down instance that received the request. Condition (4) above
covers this scenario.

## Summary of changes
* Conversion utilities for reqwest <-> hyper. I'm not happy with these,
but I don't see a better way. Open to suggestions.
* Add request forwarding logic
* Update each request handler. Again, not happy with this. If anyone
knows a nice to wrap the handlers, lmk. Me and Joonas tried :/
* Update each handler to maybe forward
* Tweak tests to showcase new behaviour
2024-09-17 09:25:42 +01:00
Heikki Linnakangas
2db840d8b8 Move a few test functions related to auth tokens to separate file (#9018)
For readability. neon_fixtures.py is huge.
2024-09-17 06:53:18 +03:00
Tristan Partin
5e16c7bb0b Generate pgbench data on the server for most tests
This should generally be faster when running tests, especially those
that run with higher scales.

Ignoring test_lfc_resize since it seems like we are hitting a query
timeout for some reason that I have yet to investigate. A little bit of
improvemnt is better than none.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-09-16 23:37:36 +01:00
Tristan Partin
5876c441ab Grant access to pg_show_replication_origin_status for neon_superuser
Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-09-16 16:38:55 +01:00
Matthias van de Meent
78938d1b59 [compute/postgres] feature: PostgreSQL 17 (#8573)
This adds preliminary PG17 support to Neon, based on RC1 / 2024-09-04
07b828e9d4

NOTICE: The data produced by the included version of the PostgreSQL fork
may not be compatible with the future full release of PostgreSQL 17 due to
expected or unexpected future changes in magic numbers and internals.
DO NOT EXPECT DATA IN V17-TENANTS TO BE COMPATIBLE WITH THE 17.0
RELEASE!

Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2024-09-12 23:18:41 +01:00
Heikki Linnakangas
8dc069037b Remove NeonEnvBuilder.start() function
It feels wrong to me to start() from the builder object. Surely the
thing you start is the environment itself, not its configuration.
2024-09-12 01:28:56 +03:00
Heikki Linnakangas
723c0971e8 Don't create 'empty' branch in neon_simple_env (#8965)
Now that we've given up hope on sharing the neon_simple_env between
tests, there's no reason to not use the 'main' branch directly.
2024-09-09 12:38:34 +03:00
Joonas Koivunen
3dbd34aa78 feat(storcon): forward gc blocking and unblocking (#8956)
Currently using gc blocking and unblocking with storage controller
managed pageservers is painful. Implement the API on storage controller.

Fixes: #8893
2024-09-06 22:42:55 +01:00
Alex Chi Z.
ac5815b594 feat(storage-controller): add node shards api (#8896)
For control-plane managed tenants, we have the page in the admin console
that lists all tenants on a specific pageserver. But for
storage-controller managed ones, we don't have that functionality for
now.

## Summary of changes

Adds an API that lists all shards on a given node (intention + observed)

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-09-06 14:14:21 -04:00
Arseny Sher
11cf16e3f3 safekeeper: add term_bump endpoint.
When walproposer observes now higher term it restarts instead of
crashing whole compute with PANIC; this avoids compute crash after
term_bump call. After successfull election we're still checking
last_log_term of the highest given vote to ensure basebackup is good,
and PANIC otherwise.

It will be used for migration per
035-safekeeper-dynamic-membership-change.md
and
https://github.com/neondatabase/docs/pull/21

ref https://github.com/neondatabase/neon/issues/8700
2024-09-06 19:13:50 +03:00
Arseny Sher
e287f36a05 safekeeper: fix endpoint restart immediately after xlog switch.
Check that truncation point is not from the future by comparing it with
write_record_lsn, not write_lsn, and explain that xlog switch changes
their normal order.

ref https://github.com/neondatabase/neon/issues/8911
2024-09-06 18:09:21 +03:00
Vlad Lazar
e86fef05dd storcon: track preferred AZ for each tenant shard (#8937)
## Problem
We want to do AZ aware scheduling, but don't have enough metadata.

## Summary of changes
Introduce a `preferred_az_id` concept for each managed tenant shard.

In a future PR, the scheduler will use this as a soft preference. 
The idea is to try and keep the shard attachments within the same AZ.
Under the assumption that the compute was placed in the correct AZ,
this reduces the chances of cross AZ trafic from between compute and PS.

In terms of code changes we:
1. Add a new nullable `preferred_az_id` column to the `tenant_shards`
table. Also include an in-memory counterpart.
2. Populate the preferred az on tenant creation and shard splits.
3. Add an endpoint which allows to bulk-set preferred AZs.

(3) gives us the migration path. I'll write a script which queries the
cplane db in the region and sets the preferred az of all shards with an 
active compute to the AZ of said compute. For shards without an active compute, 
I'll use the AZ of the currently attached pageserver
since this is what cplane uses now to schedule computes.
2024-09-06 13:11:17 +01:00
Christian Schwarz
850421ec06 refactor(pageserver): rely on serde derive for toml deserialization (#7656)
This PR simplifies the pageserver configuration parsing as follows:

* introduce the `pageserver_api::config::ConfigToml` type
* implement `Default` for `ConfigToml`
* use serde derive to do the brain-dead leg-work of processing the toml
document
  * use `serde(default)` to fill in default values
* in `pageserver` crate:
* use `toml_edit` to deserialize the pageserver.toml string into a
`ConfigToml`
  * `PageServerConfig::parse_and_validate` then
    * consumes the `ConfigToml`
    * destructures it exhaustively into its constituent fields
    * constructs the `PageServerConfig`

The rules are:

* in `ConfigToml`, use `deny_unknown_fields` everywhere
* static default values go in `pageserver_api`
* if there cannot be a static default value (e.g. which default IO
engine to use, because it depends on the runtime), make the field in
`ConfigToml` an `Option`
* if runtime-augmentation of a value is needed, do that in
`parse_and_validate`
* a good example is `virtual_file_io_engine` or `l0_flush`, both of
which need to execute code to determine the effective value in
`PageServerConf`

The benefits:

* massive amount of brain-dead repetitive code can be deleted
* "unused variable" compile-time errors when removing a config value,
due to the exhaustive destructuring in `parse_and_validate`
* compile-time errors guide you when adding a new config field

Drawbacks:

* serde derive is sometimes a bit too magical
* `deny_unknown_fields` is easy to miss

Future Work / Benefits:
* make `neon_local` use `pageserver_api` to construct `ConfigToml` and
write it to `pageserver.toml`
* This provides more type safety / coompile-time errors than the current
approach.

### Refs

Fixes #3682 

### Future Work

* `remote_storage` deser doesn't reject unknown fields
https://github.com/neondatabase/neon/issues/8915
* clean up `libs/pageserver_api/src/config.rs` further
  * break up into multiple files, at least for tenant config
* move `models` as appropriate / refine distinction between config and
API models / be explicit about when it's the same
  * use `pub(crate)` visibility on `mod defaults` to detect stale values
2024-09-05 14:59:49 +02:00
John Spray
1a9b54f1d9 storage controller: read from database in validate API (#8784)
## Problem

The initial implementation of the validate API treats the in-memory
generations as authoritative.
- This is true when only one storage controller is running, but if a
rogue controller was running that hadn't been shut down properly, and
some pageserver requests were routed to that bad controller, it could
incorrectly return valid=true for stale generations.
- The generation in the main in-memory map gets out of date while a live
migration is in flight, and if the origin location for the migration
tries to do some deletions even though it is in AttachedStale (for
example because it had already started compaction), these might be
wrongly validated + executed.

## Summary of changes

- Continue to do the in-memory check: if this returns valid=false it is
sufficient to reject requests.
- When valid=true, do an additional read from the database to confirm
the generation is fresh.
- Revise behavior for validation on missing shards: this used to always
return valid=true as a convenience for deletions and shard splits, so
that pageservers weren't prevented from completing any enqueued
deletions for these shards after they're gone. However, this becomes
unsafe when we consider split brain scenarios. We could reinstate this
in future if we wanted to store some tombstones for deleted shards.
- Update test_scrubber_physical_gc to cope with the behavioral change:
they must now explicitly flush the deletion queue before splits, to
avoid tripping up on deletions that are enqueued at the time of the
split (these tests assert "scrubber deletes nothing", which check fails
if the split leaves behind some remote objects that are legitimately
GC'able)
- Add `test_storage_controller_validate_during_migration`, which uses
failpoints to create a situation where incorrect generation validation
during a live migration could result in a corruption

The rate of validate calls for tenants is pretty low: it happens as a
consequence deletions from GC and compaction, which are both
concurrency-limited on the pageserver side.
2024-09-04 15:00:40 +01:00
Joonas Koivunen
7a1397cf37 storcon: boilerplate to upsert safekeeper records on deploy (#8879)
We currently do not record safekeepers in the storage controller
database. We want to migrate timelines across safekeepers eventually, so
start recording the safekeepers on deploy.

Cc: #8698
2024-09-04 10:10:05 +00:00
Vlad Lazar
75310fe441 storcon: make hb interval an argument and speed up tests (#8880)
## Problem
Each test might wait for up to 5s in order to HB the pageserver.

## Summary of changes
Make the heartbeat interval configurable and use a really tight one for
neon local => startup quicker
2024-09-04 10:09:41 +01:00
Arseny Sher
80512e2779 safekeeper: add endpoint resetting uploaded partial segment state.
Endpoint implementation sends msg to manager requesting to do the
reset. Manager stops current partial backup upload task if it exists and
performs the reset.

Also slightly tweak eviction condition: all full segments before
flush_lsn must be uploaded (and committed) and there must be only one
segment left on disk (partial). This allows to evict timelines which
started not on the first segment and didn't fill the whole
segment (previous condition wasn't good because last_removed_segno was
0).

ref https://github.com/neondatabase/neon/issues/8759
2024-09-03 17:21:36 +03:00
Christian Schwarz
bf0531d107 fixup(#8839): test_forward_compatibility needs to allow lag warning as well (#8891)
Found in
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8885/10665614629/index.html#suites/0fbaeb107ef328d03993d44a1fb15690/ea10ba1c140fba1d
2024-09-02 15:10:10 +01:00
Arpad Müller
9746b6ea31 Implement archival_config timeline endpoint in the storage controller (#8680)
Implement the timeline specific `archival_config` endpoint also in the
storage controller.

It's mostly a copy-paste of the detach handler: the task is the same: do
the same operation on all shards.

Part of #8088.
2024-09-02 13:51:45 +02:00
Arpad Müller
3ec785f30d Add safekeeper scrubber test (#8785)
The test is very rudimentary, it only checks that before and after
tenant deletion, we can run `scan_metadata` for the safekeeper node
kind. Also, we don't actually expect any uploaded data, for that we
don't have enough WAL (needs to create at least one S3-uploaded file,
the scrubber doesn't recognize partial files yet).

The `scan_metadata` scrubber subcommand is extended to support either
specifying a database connection string, which was previously the only
way, and required a database to be present, or specifying the timeline
information manually via json. This is ideal for testing scenarios
because in those, the number of timelines is usually limited,
but it is involved to spin up a database just to write the timeline
information.
2024-08-31 01:12:25 +02:00
Arpad Müller
96b5c4d33d Don't unarchive a timeline if its ancestor is archived (#8853)
If a timeline unarchival request comes in, give an error if the parent
timeline is archived. This prevents us from the situation of having an
archived timeline with children that are not archived.

Follow up of #8824

Part of #8088

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2024-08-29 12:54:02 +00:00
Conrad Ludgate
a644f01b6a proxy+pageserver: shared leaky bucket impl (#8539)
In proxy I switched to a leaky-bucket impl using the GCRA algorithm. I
figured I could share the code with pageserver and remove the
leaky_bucket crate dependency with some very basic tokio timers and
queues for fairness.

The underlying algorithm should be fairly clear how it works from the
comments I have left in the code.

---

In benchmarking pageserver, @problame found that the new implementation
fixes a getpage throughput discontinuity in pageserver under the
`pagebench get-page-latest-lsn` benchmark with the clickbench dataset
(`test_perf_olap.py`).
The discontinuity is that for any of `--num-clients={2,3,4}`, getpage
throughput remains 10k.
With `--num-clients=5` and greater, getpage throughput then jumps to the
configured 20k rate limit.
With the changes in this PR, the discontinuity is gone, and we scale
throughput linearly to `--num-clients` until the configured rate limit.

More context in
https://github.com/neondatabase/cloud/issues/16886#issuecomment-2315257641.

closes https://github.com/neondatabase/cloud/issues/16886

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
2024-08-29 11:26:52 +00:00
Christian Schwarz
c2f8fdccd7 ingest: rate-limited warning if WAL commit timestamps lags for > wait_lsn_timeout (#8839)
refs https://github.com/neondatabase/cloud/issues/13750

The logging in this commit will make it easier to detect lagging ingest.

We're trusting compute timestamps --- ideally we'd use SK timestmaps
instead.
But trusting the compute timestamp is ok for now.
2024-08-29 12:06:00 +01:00
Andrew Rudenko
acc075071d feat(compute_ctl): add periodic lease lsn requests for static computes (#7994)
Part of #7497

## Problem

Static computes pinned at some fix LSN could be created initially within
PITR interval but eventually go out it. To make sure that Static
computes are not affected by GC, we need to start using the LSN lease
API (introduced in #8084) in compute_ctl.

## Summary of changes

**compute_ctl**
- Spawn a thread for when a static compute starts to periodically ping
pageserver(s) to make LSN lease requests.
- Add `test_readonly_node_gc` to test if static compute can read all
pages without error.
  - (test will fail on main without the code change here)

**page_service**
- `wait_or_get_last_lsn` will now allow `request_lsn` less than
`latest_gc_cutoff_lsn` to proceed if there is a lease on `request_lsn`.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Co-authored-by: Alexey Kondratov <kondratov.aleksey@gmail.com>
2024-08-28 19:09:26 +00:00
Christian Schwarz
9627747d35 bypass PageCache for InMemoryLayer + avoid Value::deser on L0 flush (#8537)
Part of [Epic: Bypass PageCache for user data
blocks](https://github.com/neondatabase/neon/issues/7386).

# Problem

`InMemoryLayer` still uses the `PageCache` for all data stored in the
`VirtualFile` that underlies the `EphemeralFile`.

# Background

Before this PR, `EphemeralFile` is a fancy and (code-bloated) buffered
writer around a `VirtualFile` that supports `blob_io`.

The `InMemoryLayerInner::index` stores offsets into the `EphemeralFile`.
At those offset, we find a varint length followed by the serialized
`Value`.

Vectored reads (`get_values_reconstruct_data`) are not in fact vectored
- each `Value` that needs to be read is read sequentially.

The `will_init` bit of information which we use to early-exit the
`get_values_reconstruct_data` for a given key is stored in the
serialized `Value`, meaning we have to read & deserialize the `Value`
from the `EphemeralFile`.

The L0 flushing **also** needs to re-determine the `will_init` bit of
information, by deserializing each value during L0 flush.

# Changes

1. Store the value length and `will_init` information in the
`InMemoryLayer::index`. The `EphemeralFile` thus only needs to store the
values.
2. For `get_values_reconstruct_data`:
- Use the in-memory `index` figures out which values need to be read.
Having the `will_init` stored in the index enables us to do that.
- View the EphemeralFile as a byte array of "DIO chunks", each 512 bytes
in size (adjustable constant). A "DIO chunk" is the minimal unit that we
can read under direct IO.
- Figure out which chunks need to be read to retrieve the serialized
bytes for thes values we need to read.
- Coalesce chunk reads such that each DIO chunk is only read once to
serve all value reads that need data from that chunk.
- Merge adjacent chunk reads into larger
`EphemeralFile::read_exact_at_eof_ok` of up to 128k (adjustable
constant).
3. The new `EphemeralFile::read_exact_at_eof_ok` fills the IO buffer
from the underlying VirtualFile and/or its in-memory buffer.
4. The L0 flush code is changed to use the `index` directly, `blob_io` 
5. We can remove the `ephemeral_file::page_caching` construct now.

The `get_values_reconstruct_data` changes seem like a bit overkill but
they are necessary so we issue the equivalent amount of read system
calls compared to before this PR where it was highly likely that even if
the first PageCache access was a miss, remaining reads within the same
`get_values_reconstruct_data` call from the same `EphemeralFile` page
were a hit.

The "DIO chunk" stuff is truly unnecessary for page cache bypass, but,
since we're working on [direct
IO](https://github.com/neondatabase/neon/issues/8130) and
https://github.com/neondatabase/neon/issues/8719 specifically, we need
to do _something_ like this anyways in the near future.

# Alternative Design

The original plan was to use the `vectored_blob_io` code it relies on
the invariant of Delta&Image layers that `index order == values order`.

Further, `vectored_blob_io` code's strategy for merging IOs is limited
to adjacent reads. However, with direct IO, there is another level of
merging that should be done, specifically, if multiple reads map to the
same "DIO chunk" (=alignment-requirement-sized and -aligned region of
the file), then it's "free" to read the chunk into an IO buffer and
serve the two reads from that buffer.
=> https://github.com/neondatabase/neon/issues/8719

# Testing / Performance

Correctness of the IO merging code is ensured by unit tests.

Additionally, minimal tests are added for the `EphemeralFile`
implementation and the bit-packed `InMemoryLayerIndexValue`.

Performance testing results are presented below.
All pref testing done on my M2 MacBook Pro, running a Linux VM.
It's a release build without `--features testing`.

We see definitive improvement in ingest performance microbenchmark and
an ad-hoc microbenchmark for getpage against InMemoryLayer.

```
baseline: commit 7c74112b2a origin/main
HEAD: ef1c55c52e
```

<details>

```
cargo bench --bench bench_ingest -- 'ingest 128MB/100b seq, no delta'

baseline

ingest-small-values/ingest 128MB/100b seq, no delta
                        time:   [483.50 ms 498.73 ms 522.53 ms]
                        thrpt:  [244.96 MiB/s 256.65 MiB/s 264.73 MiB/s]

HEAD

ingest-small-values/ingest 128MB/100b seq, no delta
                        time:   [479.22 ms 482.92 ms 487.35 ms]
                        thrpt:  [262.64 MiB/s 265.06 MiB/s 267.10 MiB/s]
```

</details>

We don't have a micro-benchmark for InMemoryLayer and it's quite
cumbersome to add one. So, I did manual testing in `neon_local`.

<details>

```

  ./target/release/neon_local stop
  rm -rf .neon
  ./target/release/neon_local init
  ./target/release/neon_local start
  ./target/release/neon_local tenant create --set-default
  ./target/release/neon_local endpoint create foo
  ./target/release/neon_local endpoint start foo
  psql 'postgresql://cloud_admin@127.0.0.1:55432/postgres'
psql (13.16 (Debian 13.16-0+deb11u1), server 15.7)

CREATE TABLE wal_test (
    id SERIAL PRIMARY KEY,
    data TEXT
);

DO $$
DECLARE
    i INTEGER := 1;
BEGIN
    WHILE i <= 500000 LOOP
        INSERT INTO wal_test (data) VALUES ('data');
        i := i + 1;
    END LOOP;
END $$;

-- => result is one L0 from initdb and one 137M-sized ephemeral-2

DO $$
DECLARE
    i INTEGER := 1;
    random_id INTEGER;
    random_record wal_test%ROWTYPE;
    start_time TIMESTAMP := clock_timestamp();
    selects_completed INTEGER := 0;
    min_id INTEGER := 1;  -- Minimum ID value
    max_id INTEGER := 100000;  -- Maximum ID value, based on your insert range
    iters INTEGER := 100000000;  -- Number of iterations to run
BEGIN
    WHILE i <= iters LOOP
        -- Generate a random ID within the known range
        random_id := min_id + floor(random() * (max_id - min_id + 1))::int;

        -- Select the row with the generated random ID
        SELECT * INTO random_record
        FROM wal_test
        WHERE id = random_id;

        -- Increment the select counter
        selects_completed := selects_completed + 1;

        -- Check if a second has passed
        IF EXTRACT(EPOCH FROM clock_timestamp() - start_time) >= 1 THEN
            -- Print the number of selects completed in the last second
            RAISE NOTICE 'Selects completed in last second: %', selects_completed;

            -- Reset counters for the next second
            selects_completed := 0;
            start_time := clock_timestamp();
        END IF;

        -- Increment the loop counter
        i := i + 1;
    END LOOP;
END $$;

./target/release/neon_local stop

baseline: commit 7c74112b2a origin/main

NOTICE:  Selects completed in last second: 1864
NOTICE:  Selects completed in last second: 1850
NOTICE:  Selects completed in last second: 1851
NOTICE:  Selects completed in last second: 1918
NOTICE:  Selects completed in last second: 1911
NOTICE:  Selects completed in last second: 1879
NOTICE:  Selects completed in last second: 1858
NOTICE:  Selects completed in last second: 1827
NOTICE:  Selects completed in last second: 1933

ours

NOTICE:  Selects completed in last second: 1915
NOTICE:  Selects completed in last second: 1928
NOTICE:  Selects completed in last second: 1913
NOTICE:  Selects completed in last second: 1932
NOTICE:  Selects completed in last second: 1846
NOTICE:  Selects completed in last second: 1955
NOTICE:  Selects completed in last second: 1991
NOTICE:  Selects completed in last second: 1973
```

NB: the ephemeral file sizes differ by ca 1MiB, ours being 1MiB smaller.

</details>

# Rollout

This PR changes the code in-place and  is not gated by a feature flag.
2024-08-28 18:31:41 +00:00
Heikki Linnakangas
2d10306f7a Remove support for pageserver <-> compute protocol version 1 (#8774)
Protocol version 2 has been the default for a while now, and we no
longer have any computes running in production that used protocol
version 1. This completes the migration by removing support for v1 in
both the pageserver and the compute.

See issue #6211.
2024-08-27 18:36:33 +03:00
Alexey Kondratov
9b9f90c562 fix(walproposer): Do not restart on safekeepers reordering (#8840)
## Problem

Currently, we compare `neon.safekeepers` values as is, so we
unnecessarily restart walproposer even if safekeepers set didn't change.
This leads to errors like:
```log
FATAL:  [WP] restarting walproposer to change safekeeper list
from safekeeper-8.us-east-2.aws.neon.tech:6401,safekeeper-11.us-east-2.aws.neon.tech:6401,safekeeper-10.us-east-2.aws.neon.tech:6401
to safekeeper-11.us-east-2.aws.neon.tech:6401,safekeeper-8.us-east-2.aws.neon.tech:6401,safekeeper-10.us-east-2.aws.neon.tech:6401
```

## Summary of changes

Split the GUC into the list of individual safekeepers and properly
compare. We could've done that somewhere on the upper level, e.g.,
control plane, but I think it's still better when the actual config
consumer is smarter and doesn't rely on upper levels.
2024-08-27 15:49:47 +02:00