Commit Graph

4345 Commits

Author SHA1 Message Date
John Spray
f756b374c7 WIP diesel for persistence 2024-01-07 22:12:38 +00:00
John Spray
a1c1c1e747 wip 2024-01-07 20:59:27 +00:00
John Spray
7c3d2f0c61 wip reconciliation 2024-01-07 20:59:27 +00:00
John Spray
53c87cdb92 wip persistence 2024-01-07 20:59:27 +00:00
John Spray
bbdd27a522 tests: avoid using "create tenant" as a metric test 2024-01-07 20:59:27 +00:00
John Spray
e0cd0fac0d control_plane: selectively pass through timeline create errors 2024-01-07 20:59:27 +00:00
John Spray
ac12b7b45e control_plane: use mgmt_api::Error in client 2024-01-07 20:59:27 +00:00
John Spray
8da1034b8f tests: update test_broken_timeline to hit HTTP API directly 2024-01-07 20:59:27 +00:00
John Spray
62e48e9ebd control_plane, tests: enable running compute with explicit pageserver 2024-01-07 20:59:27 +00:00
John Spray
4a97058239 tests: update generations tests 2024-01-07 20:59:27 +00:00
John Spray
0c85cd3766 tests: use auth tokens for attachment service 2024-01-07 20:59:27 +00:00
John Spray
5ae1efaea8 pageserver_api: make ShardIdentity::stripe_size public 2024-01-07 20:59:27 +00:00
John Spray
502504d0ad pageserver_api: enrich types for use in reconciler 2024-01-07 20:59:27 +00:00
John Spray
30a46df7da tests: fix type decorations on helpers 2024-01-07 20:59:27 +00:00
John Spray
fa6ffa1782 pageserver: enrich a remote client log message 2024-01-07 20:59:27 +00:00
John Spray
078f8f955d tests: update restart+regress tests to use sharding 2024-01-07 20:59:27 +00:00
John Spray
e83c793c54 test: add test_sharding 2024-01-07 20:59:27 +00:00
John Spray
02c00b1418 compute_api: add shard_stripe_size to ComputeSpec 2024-01-07 20:59:27 +00:00
John Spray
55939ef897 test: update fixtures for sharding/attachment service 2024-01-07 20:59:27 +00:00
John Spray
7183c5719b utils: implement FromStr for NodeId 2024-01-07 20:59:27 +00:00
John Spray
6a72fdc5d8 control_plane: major rework of attachment_service 2024-01-07 20:59:27 +00:00
John Spray
10519937e6 pageserver/client: update APIs for sharding 2024-01-07 20:59:27 +00:00
John Spray
b63ba61da1 pageserver: refactor creation API (add ShardParams) 2024-01-07 20:59:26 +00:00
John Spray
af844571e6 pageserver: only upload initdb from shard 0 2024-01-07 20:59:07 +00:00
John Spray
6f4d1daaec pageserver_api: accomodate TenantShardId in location conf 2024-01-07 20:59:07 +00:00
John Spray
11d97ea65d pageserver_api: add a public version of TenantHistorySize 2024-01-07 20:59:07 +00:00
John Spray
214c01b498 pageserver/mgmt_api: include status in API errors 2024-01-07 20:59:07 +00:00
John Spray
0a2e90ab65 pageserver: move EvictionPolicy into models/
This makes the config type more convenient to work with: it
was previously carrying a serde_json::Value to force the
contained types to be opaque, which doesn't make a lot of
sense for an external interface.
2024-01-07 20:59:07 +00:00
Konstantin Knizhnik
8bd0fc4dd1 Support sharding at compute side
refer #5508
2024-01-07 20:15:05 +00:00
Arseny Sher
a41c4122e3 Don't suspend compute if there is active LR subscriber.
https://github.com/neondatabase/neon/issues/6258
2024-01-06 01:24:44 +04:00
Alexander Bayandin
7de829e475 test_runner: replace black with ruff format (#6268)
## Problem

`black` is slow sometimes, we can replace it with `ruff format` (a new
feature in 0.1.2 [0]), which produces pretty similar to black style [1].

On my local machine (MacBook M1 Pro 16GB):
```
# `black` on main
$ hyperfine "BLACK_CACHE_DIR=/dev/null poetry run black ."
Benchmark 1: BLACK_CACHE_DIR=/dev/null poetry run black .
  Time (mean ± σ):      3.131 s ±  0.090 s    [User: 5.194 s, System: 0.859 s]
  Range (min … max):    3.047 s …  3.354 s    10 runs
```
```
# `ruff format` on the current PR
$ hyperfine "RUFF_NO_CACHE=true poetry run ruff format"      
Benchmark 1: RUFF_NO_CACHE=true poetry run ruff format
  Time (mean ± σ):     300.7 ms ±  50.2 ms    [User: 259.5 ms, System: 76.1 ms]
  Range (min … max):   267.5 ms … 420.2 ms    10 runs
```

## Summary of changes
- Replace `black` with `ruff format` everywhere

- [0] https://docs.astral.sh/ruff/formatter/
- [1] https://docs.astral.sh/ruff/formatter/#black-compatibility
2024-01-05 15:35:07 +00:00
John Spray
3c560d27a8 pageserver: implement secondary-mode downloads (#6123)
Follows on from #6050 , in which we upload heatmaps. Secondary locations
will now poll those heatmaps and download layers mentioned in the
heatmap.

TODO:
- [X] ~Unify/reconcile stats for behind-schedule execution with
warn_when_period_overrun
(https://github.com/neondatabase/neon/pull/6050#discussion_r1426560695)~
- [x] Give downloads their own concurrency config independent of uploads

Deferred optimizations:
- https://github.com/neondatabase/neon/issues/6199
- https://github.com/neondatabase/neon/issues/6200

Eviction will be the next PR:
- #5342
2024-01-05 12:29:20 +00:00
Christian Schwarz
d260426a14 is_rel_block_key: exclude the relsize key (#6266)
Before this PR, `is_rel_block_key` returns true for the blknum
`0xffffffff`,
which is a blknum that's actually never written by Postgres, but used by
Neon Pageserver to store the relsize.

Quoting @MMeent:

> PostgreSQL can't extend the relation beyond size of 0xFFFFFFFF blocks,
> so block number 0xFFFFFFFE is the last valid block number.

This PR changes the definition of the function to exclude blknum
0xffffffff.

My motivation for doing this change is to fix the `pagebench` getpage
benchmark, which uses `is_rel_block_key` to filter the keyspace for
valid pages to request from page_service.
fixes https://github.com/neondatabase/neon/issues/6210

I checked other users of the function.

The first one is `key_is_shard0`, which already had added an exemption
for 0xffffffff. So, there's no functional change with this PR.

The second one is `DatadirModification::flush`[^1]. With this PR,
`.flush()` will skip the relsize key, whereas it didn't
before. This means we will pile up all the relsize key-value pairs
`(Key,u32)`
in `DatadirModification::pending_updates` until `.commit()` is called.

The only place I can think of where that would be a problem is if we
import from a full basebackup, and don't `.commit()` regularly,
like we currently don't do in `import_basebackup_from_tar`.
It exposes us to input-controlled allocations.
However, that was already the case for the other keys that are skipped,
so, one can argue that this change is not making the situation much
worse.

[^1]: That type's `flush()` and `commit()` methods are terribly named,
but,
      that's for another time
2024-01-05 11:48:06 +01:00
Arthur Petukhovsky
f3b5db1443 Add API for safekeeper timeline copy (#6091)
Implement API for cloning a single timeline inside a safekeeper. Also
add API for calculating a sha256 hash of WAL, which is used in tests.

`/copy` API works by copying objects inside S3 for all but the last
segments, and the last segments are copied on-disk. A special temporary
directory is created for a timeline, because copy can take a lot of
time, especially for large timelines. After all files segments have been
prepared, this directory is mounted to the main tree and timeline is
loaded to memory.

Some caveats:
- large timelines can take a lot of time to copy, because we need to
copy many S3 segments
- caller should wait for HTTP call to finish indefinetely and don't
close the HTTP connection, because it will stop the process, which is
not continued in the background
- `until_lsn` must be a valid LSN, otherwise bad things can happen
- API will return 200 if specified `timeline_id` already exists, even if
it's not a copy
- each safekeeper will try to copy S3 segments, so it's better to not
call this API in-parallel on different safekeepers
2024-01-04 17:40:38 +00:00
John Spray
18e9208158 pageserver: improved error handling for shard routing error, timeline not found (#6262)
## Problem

- When a client requests a key that isn't found in any shard on the node
(edge case that only happens if a compute's config is out of date), we
should prompt them to reconnect (as this includes a backoff), since they
will not be able to complete the request until they eventually get a
correct pageserver connection string.
- QueryError::Other is used excessively: this contains a type-ambiguous
anyhow::Error and is logged very verbosely (including backtrace).

## Summary of changes

- Introduce PageStreamError to replace use of anyhow::Error in request
handlers for getpage, etc.
- Introduce Reconnect and NotFound variants to QueryError
- Map the "shard routing error" case to PageStreamError::Reconnect ->
QueryError::Reconnect
- Update type conversions for LSN timeouts and tenant/timeline not found
errors to use PageStreamError::NotFound->QueryError::NotFound
2024-01-04 10:40:03 +00:00
Sasha Krassovsky
7662df6ca0 Fix minimum backoff to 1ms 2024-01-03 21:09:19 -08:00
John Spray
c119af8ddd pageserver: run at least 2 background task threads
Otherwise an assertion in CONCURRENT_BACKGROUND_TASKS will
trip if you try to run the pageserver on a single core.
2024-01-03 14:22:40 +00:00
John Spray
a2e083ebe0 pageserver: make walredo shard-aware
This does not have a functional impact, but enables all
the logging in this code to include the shard_id
label.
2024-01-03 14:22:40 +00:00
John Spray
73a944205b pageserver: log details on shard routing error 2024-01-03 14:22:40 +00:00
John Spray
34ebfbdd6f pageserver: fix handling getpage with multiple shards on one node
Previously, we would wait for the LSN to be visible on whichever
timeline we happened to load at the start of the connection, then
proceed to look up the correct timeline for the key and do the read.

If the timeline holding the key was behind the timeline we used
for the LSN wait, then we might serve an apparently-successful read result
that actually contains data from behind the requested lsn.
2024-01-03 14:22:40 +00:00
John Spray
ef7c9c2ccc pageserver: fix active tenant lookup hitting secondaries with sharding
If there is some secondary shard for a tenant on the same
node as an attached shard, the secondary shard could trip up
this code and cause page_service to incorrectly
get an error instead of finding the attached shard.
2024-01-03 14:22:40 +00:00
John Spray
6c79e12630 pageserver: drop unwanted keys during compaction after split 2024-01-03 14:22:40 +00:00
John Spray
753d97bd77 pageserver: don't delete ancestor shard layers 2024-01-03 14:22:40 +00:00
John Spray
edc962f1d7 test_runner: test_issue_5878 log allow list (#6259)
## Problem


https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6254/7388706419/index.html#suites/5a4b8734277a9878cb429b80c314f470/e54c4f6f6ed22672

## Summary of changes

Permit the log message: because the test helper's detach function
increments the generation number, a detach/attach cycle can cause the
error if the test runner node is slow enough for the opportunistic
deletion queue flush on detach not to complete by the time we call
attach.
2024-01-03 14:22:17 +00:00
Arseny Sher
65b4e6e7d6 Remove empty safekeeper init since truncateLsn.
It has caveats such as creating half empty segment which can't be
offloaded. Instead we'll pursue approach of pull_timeline, seeding new state
from some peer.
2024-01-03 18:20:19 +04:00
Alexander Bayandin
17b256679b vm-image-spec: build pgbouncer from Neon's fork (#6249)
## Problem

We need to add one more patch to pgbouncer (for
https://github.com/neondatabase/neon/issues/5801). I've decided to
cherry-pick all required patches to a pgbouncer fork
(`neondatabase/pgbouncer`) and use it instead.

See
https://github.com/neondatabase/pgbouncer/releases/tag/pgbouncer_1_21_0-neon-1

## Summary of changes
- Revert the previous patch (for deallocate/discard all) — the fork
already contains it.
- Remove `libssl-dev` dependency — we build pgbouncer without `openssl`
support.
- Clone git tag and build pgbouncer from source code.
2024-01-03 13:02:04 +00:00
John Spray
673a865055 tests: tolerate 304 when evicting layers (#6261)
In tests that evict layers, explicit eviction can race with automatic
eviction of the same layer and result in a 304
2024-01-03 11:50:58 +00:00
Cuong Nguyen
fb518aea0d Add batch ingestion mechanism to avoid high contention (#5886)
## Problem
For context, this problem was observed in a research project where we
try to make neon run in multiple regions and I was asked by @hlinnaka to
make this PR.

In our project, we use the pageserver in a non-conventional way such
that we would send a larger number of requests to the pageserver than
normal (imagine postgres without the buffer pool). I measured the time
from the moment a WAL record left the safekeeper to when it reached the
pageserver
([code](e593db1f5a/pageserver/src/tenant/timeline/walreceiver/walreceiver_connection.rs (L282-L287)))
and observed that when the number of get_page_at_lsn requests was high,
the wal receiving time increased significantly (see the left side of the
graphs below).

Upon further investigation, I found that the delay was caused by this
line


d2ca410919/pageserver/src/tenant/timeline.rs (L2348)

The `get_layer_for_write` method is called for every value during WAL
ingestion and it tries to acquire layers write lock every time, thus
this results in high contention when read lock is acquired more
frequently.


![Untitled](https://github.com/neondatabase/neon/assets/6244849/85460f4d-ead1-4532-bc64-736d0bfd7f16)

![Untitled2](https://github.com/neondatabase/neon/assets/6244849/84199ab7-5f0e-413b-a42b-f728f2225218)

## Summary of changes

It is unnecessary to call `get_layer_for_write` repeatedly for all
values in a WAL message since they would end up in the same memory layer
anyway, so I created the batched versions of `InMemoryLayer::put_value`,
`InMemoryLayer ::put_tombstone`, `Timeline::put_value`, and
`Timeline::put_tombstone`, that acquire the locks once for a batch of
values.

Additionally, `DatadirModification` is changed to store multiple
versions of uncommitted values, and `WalIngest::ingest_record()` can now
ingest records without immediately committing them.

With these new APIs, the new ingestion loop can be changed to commit for
every `ingest_batch_size` records. The `ingest_batch_size` variable is
exposed as a config. If it is set to 1 then we get the same behavior
before this change. I found that setting this value to 100 seems to work
the best, and you can see its effect on the right side of the above
graphs.

---------

Co-authored-by: John Spray <john@neon.tech>
2024-01-03 10:41:58 +00:00
John Spray
42f41afcbd tests: update pytest and boto3 dependencies (#6253)
## Problem

The version of pytest we were using emits a number of
DeprecationWarnings on latest python: these are fixed in latest release.

boto3 and python-dateutil also have deprecation warnings, but
unfortunately these aren't fixed upstream yet.



## Summary of changes

- Update pytest
- Update boto3 (this doesn't fix deprecation warnings, but by the time I
figured that out I had already done the update, and it's good hygiene
anyway)
2024-01-03 10:36:53 +00:00
Arseny Sher
f71110383c Remove second check for max_slot_wal_keep_size download size.
Already checked in GetLogRepRestartLSN, a rebase artifact.
2024-01-03 13:13:32 +04:00