Postgres can write multiline logs, and they are difficult to handle
after they are mixed with other logs. This PR combines multiline logs
from postgres into a single line, where previous line breaks are
replaced with unicode zero-width spaces. Then postgres logs are written
to stderr with `PG:` prefix.
It makes it easy to distinguish postgres logs from all other compute
logs with a simple grep, e.g. `|= "PG:"`
## Problem
Current cache doesn't support any updates from the cplane.
## Summary of changes
* Added redis notifier listner.
* Added cache which can be invalidated with the notifier. If the
notifier is not available, it's just a normal ttl cache.
* Updated cplane api.
The motivation behind this organization of the data is the following:
* In the Neon data model there are projects. Projects could have
multiple branches and each branch could have more than one endpoint.
* Also there is one special `main` branch.
* Password reset works per branch.
* Allowed IPs are the same for every branch in the project (except,
maybe, the main one).
* The main branch can be changed to the other branch.
* The endpoint can be moved between branches.
Every event described above requires some special processing on the
porxy (or cplane) side.
The idea of invalidating for the project is that whenever one of the
events above is happening with the project, proxy can invalidate all
entries for the entire project.
This approach also requires some additional API change (returning
project_id inside the auth info).
PR #6266 broke the getpage_latest_lsn benchmark.
Before this patch, we'd fail with
```
not implemented: split up range
```
because `r.start = rel size key` and `r.end = rel size key + 1`.
The filtering of the key ranges in that loop is a bit ugly, but,
I measured:
* setup with 180k layer files (20k tenants * 9 layers).
* total physical size is 463GiB
* 5k tenants, the range filtering takes `0.6 seconds` on an
i3en.3xlarge.
That's a tiny fraction of the overall time it takes for pagebench to get
ready to send requests. So, this is good enough for now / there are
other bottlenecks that are bigger.
## Problem
The code for tenant create and tenant attach was just a special case of
what upsert_location does.
## Summary of changes
- Use `upsert_location` for create and attach APIs
- Clean up error handling in upsert_location so that it can generate
appropriate HTTP response codes
- Update tests that asserted the old non-idempotent behavior of attach
- Rework the `test_ignore_while_attaching` test, and fix tenant shutdown
during activation, which this test was supposed to cover, but it was
actually just waiting for activation to complete.
This uses the [newly stable](https://blog.rust-lang.org/2023/12/21/async-fn-rpit-in-traits.html)
async trait feature for three internal traits. One requires `Send`
bounds to be present so uses `impl Future<...> + Send` instead.
Advantages:
* less macro usage
* no extra boxing
Disadvantages:
* impl syntax needed for `Send` bounds is a bit more verbose (but only
required in one place)
Part of #5771
Extracted from https://github.com/neondatabase/neon/pull/6214
This PR makes the test suite sensitive to the new env var
`NEON_ENV_BUILDER_FROM_REPO_DIR_USE_OVERLAYFS`.
If it is set, `NeonEnvBuilder.from_repo_dir` uses overlayfs
to duplicate the the snapshot repo dir contents.
Since mounting requires root privileges, we use sudo to perform
the mounts. That, and macOS support, is also why copytree remains
the default.
If we ever run on a filesystem with copy reflink support, we should
consider that as an alternative.
This PR can be tried on a Linux machine on the
`test_backward_compatiblity` test, which uses `from_repo_dir`.
During a previous incident, we noticed that this particular line can be
repeatedly logged every 100ms if the memory usage continues is
persistently high enough to warrant upscaling.
Per the added comment: Ideally we'd still like to include this log line,
because it's useful information, but the simple way to include it
produces far too many log lines, and the more complex ways to
deduplicate the log lines while still including the information are
probably not worth the effort right now.
This fixes the clippy lint firing on macOS on the conversion which
needed for portability. For some reason, the logic in
https://github.com/rust-lang/rust-clippy/pull/11669 to avoid an overlap
is not working.
## Summary of changes
### RequestMonitoring
We want to add an event stream with information on each request for
easier analysis than what we can do with diagnostic logs alone
(https://github.com/neondatabase/cloud/issues/8807). This
RequestMonitoring will keep a record of the final state of a request. On
drop it will be pushed into a queue to be uploaded.
Because this context is a bag of data, I don't want this information to
impact logic of request handling. I personally think that weakly typed
data (such as all these options) makes for spaghetti code. I will
however allow for this data to impact rate-limiting and blocking of
requests, as this does not _really_ change how a request is handled.
### Parquet
Each `RequestMonitoring` is flushed into a channel where it is converted
into `RequestData`, which is accumulated into parquet files. Each file
will have a certain number of rows per row group, and several row groups
will eventually fill up the file, which we then upload to S3.
We will also upload smaller files if they take too long to construct.
This is a lightweight change to keep the scrubber providing sensible
output when using sharding.
- The timeline count was wrong when using sharding
- When checking for tenant existence, we didn't re-use results between
different shards in the same tenant
Closes: https://github.com/neondatabase/neon/issues/5929
These functions don't need updating for sharding: it's fine for them to
remain shard-naive, as they're only used in the context of dumping a
layer file. The sharding metadata doesn't live in the layer file, it
lives in the index.
## Problem
Noticed while making other changes that there were `pub` items that were
unused.
## Summary of changes
- Make everything `pub(crate)` in metrics.rs, apart from items used from
`bin/`
- Fix the timelines eviction metric: it was never being incremented
- Remove an unused ephemeral_bytes counter.
## Problem
`black` is slow sometimes, we can replace it with `ruff format` (a new
feature in 0.1.2 [0]), which produces pretty similar to black style [1].
On my local machine (MacBook M1 Pro 16GB):
```
# `black` on main
$ hyperfine "BLACK_CACHE_DIR=/dev/null poetry run black ."
Benchmark 1: BLACK_CACHE_DIR=/dev/null poetry run black .
Time (mean ± σ): 3.131 s ± 0.090 s [User: 5.194 s, System: 0.859 s]
Range (min … max): 3.047 s … 3.354 s 10 runs
```
```
# `ruff format` on the current PR
$ hyperfine "RUFF_NO_CACHE=true poetry run ruff format"
Benchmark 1: RUFF_NO_CACHE=true poetry run ruff format
Time (mean ± σ): 300.7 ms ± 50.2 ms [User: 259.5 ms, System: 76.1 ms]
Range (min … max): 267.5 ms … 420.2 ms 10 runs
```
## Summary of changes
- Replace `black` with `ruff format` everywhere
- [0] https://docs.astral.sh/ruff/formatter/
- [1] https://docs.astral.sh/ruff/formatter/#black-compatibility
Before this PR, `is_rel_block_key` returns true for the blknum
`0xffffffff`,
which is a blknum that's actually never written by Postgres, but used by
Neon Pageserver to store the relsize.
Quoting @MMeent:
> PostgreSQL can't extend the relation beyond size of 0xFFFFFFFF blocks,
> so block number 0xFFFFFFFE is the last valid block number.
This PR changes the definition of the function to exclude blknum
0xffffffff.
My motivation for doing this change is to fix the `pagebench` getpage
benchmark, which uses `is_rel_block_key` to filter the keyspace for
valid pages to request from page_service.
fixes https://github.com/neondatabase/neon/issues/6210
I checked other users of the function.
The first one is `key_is_shard0`, which already had added an exemption
for 0xffffffff. So, there's no functional change with this PR.
The second one is `DatadirModification::flush`[^1]. With this PR,
`.flush()` will skip the relsize key, whereas it didn't
before. This means we will pile up all the relsize key-value pairs
`(Key,u32)`
in `DatadirModification::pending_updates` until `.commit()` is called.
The only place I can think of where that would be a problem is if we
import from a full basebackup, and don't `.commit()` regularly,
like we currently don't do in `import_basebackup_from_tar`.
It exposes us to input-controlled allocations.
However, that was already the case for the other keys that are skipped,
so, one can argue that this change is not making the situation much
worse.
[^1]: That type's `flush()` and `commit()` methods are terribly named,
but,
that's for another time
Implement API for cloning a single timeline inside a safekeeper. Also
add API for calculating a sha256 hash of WAL, which is used in tests.
`/copy` API works by copying objects inside S3 for all but the last
segments, and the last segments are copied on-disk. A special temporary
directory is created for a timeline, because copy can take a lot of
time, especially for large timelines. After all files segments have been
prepared, this directory is mounted to the main tree and timeline is
loaded to memory.
Some caveats:
- large timelines can take a lot of time to copy, because we need to
copy many S3 segments
- caller should wait for HTTP call to finish indefinetely and don't
close the HTTP connection, because it will stop the process, which is
not continued in the background
- `until_lsn` must be a valid LSN, otherwise bad things can happen
- API will return 200 if specified `timeline_id` already exists, even if
it's not a copy
- each safekeeper will try to copy S3 segments, so it's better to not
call this API in-parallel on different safekeepers
## Problem
- When a client requests a key that isn't found in any shard on the node
(edge case that only happens if a compute's config is out of date), we
should prompt them to reconnect (as this includes a backoff), since they
will not be able to complete the request until they eventually get a
correct pageserver connection string.
- QueryError::Other is used excessively: this contains a type-ambiguous
anyhow::Error and is logged very verbosely (including backtrace).
## Summary of changes
- Introduce PageStreamError to replace use of anyhow::Error in request
handlers for getpage, etc.
- Introduce Reconnect and NotFound variants to QueryError
- Map the "shard routing error" case to PageStreamError::Reconnect ->
QueryError::Reconnect
- Update type conversions for LSN timeouts and tenant/timeline not found
errors to use PageStreamError::NotFound->QueryError::NotFound
Previously, we would wait for the LSN to be visible on whichever
timeline we happened to load at the start of the connection, then
proceed to look up the correct timeline for the key and do the read.
If the timeline holding the key was behind the timeline we used
for the LSN wait, then we might serve an apparently-successful read result
that actually contains data from behind the requested lsn.
If there is some secondary shard for a tenant on the same
node as an attached shard, the secondary shard could trip up
this code and cause page_service to incorrectly
get an error instead of finding the attached shard.
It has caveats such as creating half empty segment which can't be
offloaded. Instead we'll pursue approach of pull_timeline, seeding new state
from some peer.
## Problem
We need to add one more patch to pgbouncer (for
https://github.com/neondatabase/neon/issues/5801). I've decided to
cherry-pick all required patches to a pgbouncer fork
(`neondatabase/pgbouncer`) and use it instead.
See
https://github.com/neondatabase/pgbouncer/releases/tag/pgbouncer_1_21_0-neon-1
## Summary of changes
- Revert the previous patch (for deallocate/discard all) — the fork
already contains it.
- Remove `libssl-dev` dependency — we build pgbouncer without `openssl`
support.
- Clone git tag and build pgbouncer from source code.
## Problem
For context, this problem was observed in a research project where we
try to make neon run in multiple regions and I was asked by @hlinnaka to
make this PR.
In our project, we use the pageserver in a non-conventional way such
that we would send a larger number of requests to the pageserver than
normal (imagine postgres without the buffer pool). I measured the time
from the moment a WAL record left the safekeeper to when it reached the
pageserver
([code](e593db1f5a/pageserver/src/tenant/timeline/walreceiver/walreceiver_connection.rs (L282-L287)))
and observed that when the number of get_page_at_lsn requests was high,
the wal receiving time increased significantly (see the left side of the
graphs below).
Upon further investigation, I found that the delay was caused by this
line
d2ca410919/pageserver/src/tenant/timeline.rs (L2348)
The `get_layer_for_write` method is called for every value during WAL
ingestion and it tries to acquire layers write lock every time, thus
this results in high contention when read lock is acquired more
frequently.


## Summary of changes
It is unnecessary to call `get_layer_for_write` repeatedly for all
values in a WAL message since they would end up in the same memory layer
anyway, so I created the batched versions of `InMemoryLayer::put_value`,
`InMemoryLayer ::put_tombstone`, `Timeline::put_value`, and
`Timeline::put_tombstone`, that acquire the locks once for a batch of
values.
Additionally, `DatadirModification` is changed to store multiple
versions of uncommitted values, and `WalIngest::ingest_record()` can now
ingest records without immediately committing them.
With these new APIs, the new ingestion loop can be changed to commit for
every `ingest_batch_size` records. The `ingest_batch_size` variable is
exposed as a config. If it is set to 1 then we get the same behavior
before this change. I found that setting this value to 100 seems to work
the best, and you can see its effect on the right side of the above
graphs.
---------
Co-authored-by: John Spray <john@neon.tech>
## Problem
The version of pytest we were using emits a number of
DeprecationWarnings on latest python: these are fixed in latest release.
boto3 and python-dateutil also have deprecation warnings, but
unfortunately these aren't fixed upstream yet.
## Summary of changes
- Update pytest
- Update boto3 (this doesn't fix deprecation warnings, but by the time I
figured that out I had already done the update, and it's good hygiene
anyway)
The tool still needs a lot of work. These are the easiest fix and
feature:
- use similar adaptive config with s3 as remote_storage, use retries
- process only particular tenants
Tenants need to be from the correct region, they are not deduplicated,
but the feature is useful for re-checking small amount of tenants after
a large run.
To exercise MAX_SEND_SIZE sending from safekeeper; we've had a bug with WAL
records torn across several XLogData messages. Add failpoint to safekeeper to
slow down sending. Also check for corrupted WAL complains in standby log.
Make the test a bit simpler in passing, e.g. we don't need explicit commits as
autocommit is enabled by default.
https://neondb.slack.com/archives/C05L7D1JAUS/p1703774799114719https://github.com/neondatabase/cloud/issues/9057
Otherwise they are left orphaned when compute_ctl is terminated with a
signal. It was invisible most of the time because normally neon_local or k8s
kills postgres directly and then compute_ctl finishes gracefully. However, in
some tests compute_ctl gets stuck waiting for sync-safekeepers which
intentionally never ends because safekeepers are offline, and we want to stop
compute_ctl without leaving orphanes behind.
This is a quite rough approach which doesn't wait for children termination. A
better way would be to convert compute_ctl to async which would make waiting
easy.