## Problem
We have switched to new test results and new coverage results, so no
need to collect these data in old formats.
## Summary of changes
- Remove "Upload coverage report" for old coverage report
- Remove "Store Allure test stat in the DB" for old test results format
This commit adds a function to `KeySpace` which updates a key key space
by removing all overlaps with a second key space. This can involve
splitting or removing of existing ranges.
The implementation is not particularly efficient: O(M * N * log(N))
where N is the number of ranges in the current key space and M is the
number of ranges in the key space we are checking against. In practice,
this shouldn't matter much since, in the short term, the only caller of
this function will be the vectored read path and the number of key
spaces invovled will be small. This follows from the upper bound placed
on the number of keys accepted by the vectored read path.
A couple other small utility functions are added. They'll be used by the
vectored search path as well.
When the read path needs to follow a key into the ancestor timeline, it
needs to wait for said ancestor to become active and aware of it's
branching lsn. The logic is lifted into a separate function with it's
own new error type.
This is done because the vectored read path needs the same logic. It's
also the reason for the newly introduced error type.
When we'll switch the read path to proxy into `get_vectored`, we can
remove the duplicated variants from `PageReconstructError`.
This refactoring makes it easier to experimentally replace
BACKGROUND_RUNTIME with a single-threaded runtime. Found this useful
[during benchmarking](https://github.com/neondatabase/neon/pull/6555).
## Problem
Right now if get_role_secret response wasn't cached (e.g. cache already
reached max size) it will send the second (exactly the same request).
## Summary of changes
Avoid needless request.
changes:
- two messages instead of message every second when gate was closing
- replace the gate name string by using a pointer
- slow GateGuards are likely to log who they were (see example)
example found in regress tests: <https://github.com/neondatabase/neon/pull/6542#issuecomment-1919009256>
## Problem
See https://github.com/neondatabase/cloud/issues/8673
## Summary of changes
Download missed SLRU segments from page server
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
## Problem
The `--path` argument is only used in testing, for compat tests that use
a JSON snapshot of state rather than the postgres database. In regular
deployments, it should be omitted (currently one has to specify `--path
""`)
## Summary of changes
Make `--path` optional.
Some tests which are unit test alike do not need to run on different pg
versions. Logging test is one of them which I found for unrelated
reasons.
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Depends on: https://github.com/neondatabase/neon/pull/6468
## Problem
The sharding service will be used as a "virtual pageserver" by the
control plane -- so it needs the set of pageserver APIs that the control
plane uses, and to present them under identical URLs, including prefix
(/v1).
## Summary of changes
- Add missing APIs:
- Tenant deletion
- Timeline deletion
- Node list (used in test now, later in tools)
- `/location_config` API (for migrating tenants into the sharding
service)
- Rework attachment service URLs:
- `/v1` prefix is used for pageserver-compatible APIs
- `/upcall/v1` prefix is used for APIs that are called by the pageserver
(re-attach and validate)
- `/debug/v1` prefix is used for endpoints that are for testing
- `/control/v1` prefix is used for new sharding service APIs that do not
mimic a pageserver API, such as registering and configuring nodes.
- Add test_sharding_service. The sharding service already had some
collateral coverage from its use in general tests, but this is the first
dedicated testing for it.
## Problem
See https://neondb.slack.com/archives/C04DGM6SMTM/p1706531433057289
## Summary of changes
1. Do not decrease reconnect timeout until maximal interval value (1
second) is reached
2. Compute reconnect time after connection attempt is taken to exclude
connect time itself from the interval measurement.
So now backend should not perform more than 4 reconnect attempts per
second.
But please notice that backoff is performed locally in each backend and
so if there are many active backends,
then connection (and so error) rate may be much higher.
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
We currently can't create subscriptions in PG14 and PG15 because only
superusers can, and PG16 requires adding roles to
pg_create_subscription.
## Summary of changes
I added changes to PG14 and PG15 that allow neon_superuser to bypass the
superuser requirement. For PG16, I didn't do that but added a migration
that adds neon_superuser to pg_create_subscription. Also added a test to
make sure it works.
## Problem
PR #6500 has removed the limiting by number of versions/deletions for
time travel calls. We never get informed about how many versions there
are, and thus the call would just hang without any indication of
progress.
## Summary of changes
We improve the pageserver's behaviour with large prefixes, i.e. those
with many keys, removed or currently still available.
* Add a hard limit of 100k versions/deletions. For the reasoning see
https://github.com/neondatabase/cloud/issues/8233#issuecomment-1915021625
, but TLDR it will roughly support tenants of 2 TiB size, of course
depending on general write activity and duration of the s3 retention
window. The goal is to have a limit at all so that the process doesn't
accumulate increasing numbers of versions until an eventual crash.
* Lower the RAM footprint for the `VerOrDelete` datastructure. This
means we now don't cache a lot of redundant metadata in RAM like the
owner ID. The top level datastructure's footprint goes down from 264
bytes to 80 (but it contains strings that are not counted in there).
Follow-up of #6500, part of https://github.com/neondatabase/cloud/issues/8233
---------
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Since fdatasync is used for flushing WAL, changing file size is unsafe. Make
segment creation atomic by using tmp file + rename to avoid using partially
initialized segments.
fixes https://github.com/neondatabase/neon/issues/6402
It hanged if file size is less than of a normal segment. Normally that doesn't
happen, but it might in case of crash during segment init. We're going to fix
that half initialized segment by durably renaming it after cooking, so this fix
won't be needed, but better avoid busy loop anyway.
fixes https://github.com/neondatabase/neon/issues/6401
Before this patch, when requesting basebackup for a not-found tenant or
timeline, we'd emit an ERROR-level log entry with a huge stack trace.
See #6366 "Details" section for an example
With this patch, we log at INFO level and only a single line.
Example:
```
2024-01-19T14:16:11.479800Z INFO page_service_conn_main{peer_addr=127.0.0.1:43448}: query handler for 'basebackup d69a536d529a68fcf85bc070030cdf4b 035484e9c28d8d0138a492caadd03ffd 0/2204340 --gzip' entity not found: Tenant d69a536d529a68fcf85bc070030cdf4b not found
2024-01-19T14:19:35.807819Z INFO page_service_conn_main{peer_addr=127.0.0.1:48862}: query handler for 'basebackup d69a536d529a68fcf85bc070030cdf4a 035484e9c28d8d0138a492caadd03ffd 0/2204340 --gzip' entity not found: Timeline d69a536d529a68fcf85bc070030cdf4a/035484e9c28d8d0138a492caadd03ffd was not found
```
fixes https://github.com/neondatabase/neon/issues/6366
Changes
-------
- Change `handle_basebackup_request` to return a `QueryError`
- The new `impl From<WaitLsnError> for QueryError` is needed so the `?`
at `wait_lsn()` call in `handle_basebackup_request` works again. It's
duplicating `impl From<WaitLsnError> for PageStreamError`.
- Remove hard-to-spot conversion of `handle_basebackup_request` return
value to anyhow::Result (the place where I replaced `anyhow::Ok` with
`Result::<(), QueryError>::Ok(())`
- Add forgotten distinguished handling for "Tenant not found" case in
`impl From<GetActiveTenantError> for QueryError`
This was not at all pleasant, and I find it very hard to follow the
various error conversions.
It took me a while to spot the hard-to-spot `anyhow::Ok` thing above.
It would have been caught by the compiler if we weren't auto-converting
`anyhow::Error` into `QueryError::Other`.
We should move away from that, in my opinion, instead forcing each
`.context()` site to become `.context().map_err(QueryError::Other)`.
But that's for a future PR.
When using spawn + wait_with_output instead of
std::process::Command::output or tokio::process::Command::output we must
configure the redirection.
Fixes: #6523 by discarding the stdout completely, we only care about
stderr if any.
## Problem
Taking my ideas from https://github.com/neondatabase/neon/pull/6283 and
doing a bit less radical changes. smaller commits.
Proxy flow was quite deeply nested, which makes adding more interesting
error handling quite tricky.
## Summary of changes
I recommend reviewing commit by commit.
1. move handshake logic into a separate file
2. move passthrough logic into a separate file
3. no longer accept a closure in CancelMap session logic
4. Remove connect_to_db, copy logic into handle_client
5. flatten auth_and_wake_compute in authenticate
6. record info for link auth
## Problem
The tenants we want to recover might have tens of thousands of keys, or
more. At that point, the AWS API returns a paginated response.
## Summary of changes
Support paginated responses for `list_object_versions` requests.
Follow-up of #6155, part of https://github.com/neondatabase/cloud/issues/8233
## Problem
Creating sharded tenants will require an instance of the sharding
service -- the initial goal is to deploy one of these in a staging
region (https://github.com/neondatabase/cloud/issues/9718). It will run
as a kubernetes container, similar to the storage broker, so needs to be
built into the container image.
## Summary of changes
Add `attachment_service` binary to container image
## Problem
There's no efficient way of querying the layer map for a range.
## Summary of changes
Introduce a range query for the layer map (`LayerMap::range_search`).
There's two broad steps to it:
1. Find all coverage changes for layers that intersect the queried range
(see `LayerCoverage::range_overlaps`).
The slightly tricky part is dealing with the start of the range. We can
either be aligned with a layer or not and we need
to treat these cases differently.
2. Iterate over the coverage changes and collect the result. For this we
use a two pointer approach: the trailing pointer tracks the start of the
current range (current location in the key space) and the forward
pointer tracks the next coverage change.
Plugging the range search into the read path is deferred to a future PR.
## Performance
I adapted the layer map benchmarks on a local branch. Range searches are
between 2x and 2.5x slower than point searches. That's in line with what I
expected since we query thelayer map twice.
Since `Timeline::get` will proxy to `Timeline::get_vectored` we can
special case the one element layer map range search
at that point.
This is the "partial revert" of #6384. The summaries turned out to be
expensive due to naive vec usage, but also inconclusive because of the
additional context required. In addition to removing summary traces,
small refactoring is done.
## Problem
Measuring cardinality using logs is expensive and slow.
## Summary of changes
Implement a pre-aggregated HyperLogLog-based cardinality estimate.
HyperLogLog estimates the cardinality of a set by using the probability
that the uniform hash of a value will have a run of n 0s at the end is
`1/2^n`, therefore, having observed a run of `n` 0s suggests we have
measured `2^n` distinct values. By using multiple shards, we can use the
harmonic mean to get a more accurate estimate.
We record this into a Prometheus time-series. HyperLogLog counts can be
merged by taking the `max` of each shard. We can apply a `max_over_time`
in order to find the estimate of cardinality of distinct values over
time
## Problem
See https://neondb.slack.com/archives/C06F5UJH601/p1706373716661439
## Summary of changes
Use None instead of 0 as initial accumulator value for calculating
maximal multixact XID.
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
## Problem
http-over-sql allowes host to be in format api.aws.... however it's not
the case for the websocket flow.
## Summary of changes
Relax endpoint check for the ws serverless connections.
PR #5824 introduced the concept of io engines in pageserver and
implemented `tokio-epoll-uring` in addition to our current method,
`std-fs`.
We used `tokio-epoll-uring` in CI for a day to get more exposure to
the code. Now it's time to switch CI back so that we test with `std-fs`
as well, because that's what we're (still) using in production.
## Problem
Spun off from https://github.com/neondatabase/neon/pull/6394 -- this PR
is just the persistence parts and the changes that enable it to work
nicely
## Summary of changes
- Revert #6444 and #6450
- In neon_local, start a vanilla postgres instance for the attachment
service to use.
- Adopt `diesel` crate for database access in attachment service. This
uses raw SQL migrations as the source of truth for the schema, so it's a
soft dependency: we can switch libraries pretty easily.
- Rewrite persistence.rs to use postgres (via diesel) instead of JSON.
- Preserve JSON read+write at startup and shutdown: this enables using
the JSON format in compatibility tests, so that we don't have to commit
to our DB schema yet.
- In neon_local, run database creation + migrations before starting
attachment service
- Run the initial reconciliation in Service::spawn in the background, so
that the pageserver + attachment service don't get stuck waiting for
each other to start, when restarting both together in a test.
The top level retries weren't enough, probably because we do so many
network requests. Fine grained retries ensure that there is higher
potential for the entire test to succeed.
To demonstrate this, consider the following example: let's assume that
each request has 5% chance of failing and we do 10 requests. Then
chances of success without any retries is 0.95^10 = 0.6. With 3 top
level retries it is 1-0.4^3 = 0.936. With 3 fine grained retries it is
(1-0.05^3)^10 = 0.9988 (roundings implicit). So chances of failure are
6.4% for the top level retry vs 0.12% for the fine grained retry.
Follow-up of #6155
## Problem
- `docker.io/neondatabase/build-tools:pinned` image is frequently
outdated on Docker Hub because there's no automated way to update it.
- `update_build_tools_image.yml` workflow contains legacy roll-back
logic, which is not required anymore because it updates only a single
image.
## Summary of changes
- Make `update_build_tools_image.yml` workflow push images to both ECR
and Docker Hub
- Remove unneeded roll-back logic
## Problem
The support for sharding in the pageserver was written before
https://github.com/neondatabase/neon/pull/6205 landed, so when it landed
we couldn't directly test sharding.
## Summary of changes
- Add `test_sharding_smoke` which tests the basics of creating a
sharding tenant, creating a timeline within it, checking that data
within it is distributed.
- Add modes to pg_regress tests for running with 4 shards as well as
with 1.
This patch introduces a new set of grafana metrics for a histogram:
pageserver_get_vectored_seconds_bucket{task_kind="Compaction|PageRequestHandler"}.
While it has a `task_kind` label, only compaction and SLRU fetches are
tracked. This reduces the increase in cardinality to 24.
The metric should allow us to isolate performance regressions while the
vectorized get is being implemented. Once the implementation is
complete, it'll also allow us to quantify the improvements.
## Problem
Triggered `e2e-tests` job is not cancelled along with other jobs in a PR
if the PR get new commits. We can improve the situation by setting
`concurrency_group` for the remote workflow
(https://github.com/neondatabase/cloud/pull/9622 adds
`concurrency_group` group input to the remote workflow).
Ref https://neondb.slack.com/archives/C059ZC138NR/p1706087124297569
Cloud's part added in https://github.com/neondatabase/cloud/pull/9622
## Summary of changes
- Set `concurrency_group` parameter when triggering `e2e-tests`
- At the beginning of a CI pipeline, trigger Cloud's
`cancel-previous-in-concurrency-group.yml` workflow which cancels
previously triggered e2e-tests
refs https://github.com/neondatabase/neon/issues/6473
Before this PR, if process_started() didn't return Ok(true) until we
ran out of retries, we'd return an error but leave the process running.
Try it by adding a 20s sleep to the pageserver `main()`, e.g., right
before we claim the pidfile.
Without this PR, output looks like so:
```
(.venv) cs@devvm-mbp:[~/src/neon-work-2]: ./target/debug/neon_local start
Starting neon broker at 127.0.0.1:50051.
storage_broker started, pid: 2710939
.
attachment_service started, pid: 2710949
Starting pageserver node 1 at '127.0.0.1:64000' in ".neon/pageserver_1".....
pageserver has not started yet, continuing to wait.....
pageserver 1 start failed: pageserver did not start in 10 seconds
No process is holding the pidfile. The process must have already exited. Leave in place to avoid race conditions: ".neon/pageserver_1/pageserver.pid"
No process is holding the pidfile. The process must have already exited. Leave in place to avoid race conditions: ".neon/safekeepers/sk1/safekeeper.pid"
Stopping storage_broker with pid 2710939 immediately.......
storage_broker has not stopped yet, continuing to wait.....
neon broker stop failed: storage_broker with pid 2710939 did not stop in 10 seconds
Stopping attachment_service with pid 2710949 immediately.......
attachment_service has not stopped yet, continuing to wait.....
attachment service stop failed: attachment_service with pid 2710949 did not stop in 10 seconds
```
and we leak the pageserver process
```
(.venv) cs@devvm-mbp:[~/src/neon-work-2]: ps aux | grep pageserver
cs 2710959 0.0 0.2 2377960 47616 pts/4 Sl 14:36 0:00 /home/cs/src/neon-work-2/target/debug/pageserver -D .neon/pageserver_1 -c id=1 -c pg_distrib_dir='/home/cs/src/neon-work-2/pg_install' -c http_auth_type='Trust' -c pg_auth_type='Trust' -c listen_http_addr='127.0.0.1:9898' -c listen_pg_addr='127.0.0.1:64000' -c broker_endpoint='http://127.0.0.1:50051/' -c control_plane_api='http://127.0.0.1:1234/' -c remote_storage={local_path='../local_fs_remote_storage/pageserver'}
```
After this PR, there is no leaked process.