## Problem
We have some amount of outdated action in the CI pipeline, GitHub
complains about some of them.
## Summary of changes
- Update `actions/checkout@1` (a really old one) in
`vm-compute-node-image`
- Update `actions/checkout@3` in `build-build-tools-image`
- Update `docker/setup-buildx-action` in all workflows / jobs, it was
downgraded in https://github.com/neondatabase/neon/pull/7445, but it
it seems it works fine now
This failed once with `relation "test" does not exist` when trying to
run the query on the standby. It's possible that the standby is started
before the CREATE TABLE is processed in the pageserver, and the standby
opens up for queries before it has received the CREATE TABLE transaction
from the primary. To fix, wait for the standby to catch up to the
primary before starting to run the queries.
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8025/9483658488/index.html
## Problem
Some code paths during secondary mode download are returning Ok() rather
than UpdateError::Cancelled. This is functionally okay, but it means
that the end of TenantDownloader::download has a sanity check that the
progress is 100% on success, and prints a "Correcting drift..." warning
if not. This warning can be emitted in a test, e.g.
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8049/9503642976/index.html#/testresult/fff1624ba6adae9e.
## Summary of changes
- In secondary download cancellation paths, use
Err(UpdateError::Cancelled) rather than Ok(), so that we drop out of the
download function and do not reach the progress sanity check.
# Problem
Suppose our vectored get starts with an inexact materialized page cache
hit ("cached lsn") that is shadowed by a newer image layer image layer.
Like so:
```
<inmemory layers>
+-+ < delta layer
| |
-|-|----- < image layer
| |
| |
-|-|----- < cached lsn for requested key
+_+
```
The correct visitation order is
1. inmemory layers
2. delta layer records in LSN range `[image_layer.lsn,
oldest_inmemory_layer.lsn_range.start)`
3. image layer
However, the vectored get code, when it visits the delta layer, it
(incorrectly!) returns with state `Complete`.
The reason why it returns is that it calls `on_lsn_advanced` with
`self.lsn_range.start`, i.e., the layer's LSN range.
Instead, it should use `lsn_range.start`, i.e., the LSN range from the
correct visitation order listed above.
# Solution
Use `lsn_range.start` instead of `self.lsn_range.start`.
# Refs
discovered by & fixes https://github.com/neondatabase/neon/issues/6967
Co-authored-by: Vlad Lazar <vlad@neon.tech>
https://github.com/neondatabase/neon/issues/8002
We need mock WAL record to make it easier to write unit tests. This pull
request adds such a record. It has `clear` flag and `append` field. The
tests for legacy-enhanced compaction are not modified yet and will be
part of the next pull request.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
If a standby is started right after switching to a new WAL segment, the
request in the SLRU download request would point to the beginning of the
segment (e.g. 0/5000000), while the not-modified-since LSN would point
to just after the page header (e.g. 0/5000028). It's effectively the
same position, as there cannot be any WAL records in between, but the
pageserver rightly errors out on any request where the request LSN <
not-modified since LSN.
To fix, round down the not-modified since LSN to the beginning of the
page like the request LSN.
Fixes issue #8030
Some test cases add random keys into the timeline, but it is not part of
the `collect_keyspace`, this will cause compaction remove the keys.
The pull request adds a field to supply extra keyspaces during unit
tests.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Testcase page bench test_pageserver_max_throughput_getpage_at_latest_lsn
had been deactivated because it was flaky.
We now ignore copy fail error messages like in
270d3be507/test_runner/regress/test_pageserver_getpage_throttle.py (L17-L20)
and want to reactivate it to see it it is still flaky
## Summary of changes
- reactivate the test in CI
- ignore CopyFail error message during page bench test cases
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
## Problem
The previous code would attempt to drain to unavailable or unschedulable
nodes.
## Summary of Changes
Remove such nodes from the list of nodes to fill.
## Problem
The merging of #7818 caused the problem with the docker-compose file.
Running docker compose is now impossible due to the unavailability of
the neon-test-extensions:latest image
## Summary of changes
Fix the problem:
Add the latest tag to the neon-test-extensions image and use the
profiles feature of the docker-compose file to avoid loading the
neon-test-extensions container if it is not needed.
- Fix the dockerhub URLs
- `neondatabase/compute-node` image has been replaced with Postgres
version specific images like `neondatabase/compute-node-v16`
- Use TAG=latest in the example, rather than some old tag. That's a
sensible default for people to copy-past
- For convenience, use a Postgres connection URL in the `psql` example
that also includes the password. That way, there's no need to set up
.pgpass
- Update the image names in `docker ps` example to match what you get
when you follow the example
- Split the first and second parts of the test to two separate tests
- In the first test, disable the aggressive GC, compaction, and
autovacuum. They are only needed by the second test. I'd like to get the
first test to a point that the VM page is never all-zeros. Disabling
autovacuum in the first test is hopefully enough to accomplish that.
- Compare the full page images, don't skip page header. After fixing the
previous point, there should be no discrepancy. LSN still won't match,
though, because of commit 387a36874c.
Fixes issue https://github.com/neondatabase/neon/issues/7984
The S3 scrubber contains "S3" in its name, but we want to make it
generic in terms of which storage is used (#7547). Therefore, rename it
to "storage scrubber", following the naming scheme of already existing
components "storage broker" and "storage controller".
Part of #7547
## Problem
We need the ability to prepare a subset of storage controller managed
pageservers for decommisioning. The storage controller cannot currently
express this in terms of scheduling constraints (it's a pretty special
case, so I'm not sure it even should).
## Summary of Changes
A new `drain` command is added to `storcon_cli`. It takes a set of nodes
to drain and migrates primary attachments outside of said set. Simple
round robing assignment is used under the assumption that nodes outside
of the draining set are evenly balanced.
Note that secondary locations are not migrated. This is fine for
staging, but the migration API will have to be extended for prod in
order to allow migration of secondaries as well.
I've tested this out against a neon local cluster. The immediate use for
this command will be to migrate staging to ARM(Arch64) pageservers.
Related https://github.com/neondatabase/cloud/issues/14029
## Problem
The storage controller does not track the number of shards attached to a
given pageserver. This is a requirement for various scheduling
operations (e.g. draining and filling will use this to figure out if the
cluster is balanced)
## Summary of Changes
Track the number of shards attached to each node.
Related https://github.com/neondatabase/neon/issues/7387
A demo for a building block for compaction. The GC-compaction operation
iterates all layers below/intersect with the GC horizon, and do a full
layer rewrite of all of them. The end result will be image layer
covering the full keyspace at GC-horizon, and a bunch of delta layers
above the GC-horizon. This helps us collect the garbages of the
test_gc_feedback test case to reduce space amplification.
This operation can be manually triggered using an HTTP API or be
triggered based on some metrics. Actual method TBD.
The test is very basic and it's very likely that most part of the
algorithm will be rewritten. I would like to get this merged so that I
can have a basic skeleton for the algorithm and then make incremental
changes.
<img width="924" alt="image"
src="https://github.com/neondatabase/neon/assets/4198311/f3d49f4e-634f-4f56-986d-bfefc6ae6ee2">
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
We need unique tenant harness names in case you want to inspect the
results of the last failing run. We are not using any proc macros to get
the test name as there is no stable way of doing that, and there will
not be one in the future, so we need to fix these duplicates.
Also, clean up the duplicated tests to not mix `?` and `unwrap/assert`.
We've stored metadata as bytes within the `index_part.json` for
long fixed reasons. #7693 added support for reading out normal json
serialization of the `TimelineMetadata`.
Change the serialization to only write `TimelineMetadata` as json for
going forward, keeping the backward compatibility to reading the
metadata as bytes. Because of failure to include `alias = "metadata"` in
#7693, one more follow-up is required to make the switch from the old
name to `"metadata": <json>`, but that affects only the field name in
serialized format.
In documentation and naming, an effort is made to add enough warning
signs around TimelineMetadata so that it will receive no changes in the
future. We can add those fields to `IndexPart` directly instead.
Additionally, the path to cleaning up `metadata.rs` is documented in the
`metadata.rs` module comment. If we must extend `TimelineMetadata`
before that, the duplication suggested in [review comment] is the way to
go.
[review comment]:
https://github.com/neondatabase/neon/pull/7699#pullrequestreview-2107081558
## Problem
We need automated tests of extensions shipped with Neon to detect
possible problems.
## Summary of changes
A new image neon-test-extensions is added. Workflow changes to test the
shipped extensions are added as well.
Currently, the regression tests, shipped with extensions are in use.
Some extensions, i.e. rum, timescaledb, rdkit, postgis, pgx_ulid, pgtap,
pg_tiktoken, pg_jsonschema, pg_graphql, kq_imcx, wal2json_2_5 are
excluded due to problems or absence of internal tests.
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
The new features have deteriorated layer flushing, most recently with
#7927. Changes:
- inline `Timeline::freeze_inmem_layer` to the only caller
- carry the TimelineWriterState guard to the actual point of freezing
the layer
- this allows us to `#[cfg(feature = "testing")]` the assertion added in
#7927
- remove duplicate `flush_frozen_layer` in favor of splitting the
`flush_frozen_layers_and_wait`
- this requires starting the flush loop earlier for `checkpoint_distance
< initdb size` tests
Quite a few existing test cases create their own timelines instead of
using the default one. This pull request highlights that and hopefully
people can write simpler tests in the future.
Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Yuchen Liang <70461588+yliang412@users.noreply.github.com>
As seen with the pgvector 0.7.0 index builds, we can receive large
batches of images, leading to very large L0 layers in the range of 1GB.
These large layers are produced because we are only able to roll the
layer after we have witnessed two different Lsns in a single
`DataDirModification::commit`. As the single Lsn batches of images can
span over multiple `DataDirModification` lifespans, we will rarely get
to write two different Lsns in a single `put_batch` currently.
The solution is to remember the TimelineWriterState instead of eagerly
forgetting it until we really open the next layer or someone else
flushes (while holding the write_guard).
Additional changes are test fixes to avoid "initdb image layer
optimization" or ignoring initdb layers for assertion.
Cc: #7197 because small `checkpoint_distance` will now trigger the
"initdb image layer optimization"
Reverts neondatabase/neon#7956
Rationale: compute incompatibilties
Slack thread:
https://neondb.slack.com/archives/C033RQ5SPDH/p1718011276665839?thread_ts=1718008160.431869&cid=C033RQ5SPDH
Relevant quotes from @hlinnaka
> If we go through with the current release candidate, but the compute
is pinned, people who create new projects will get that warning, which
is silly. To them, it looks like the ICU version was downgraded, because
initdb was run with newer version.
> We should upgrade the ICU version eventually. And when we do that,
users with old projects that use ICU will start to see that warning. I
think that's acceptable, as long as we do homework, notify users, and
communicate that properly.
> When do that, we should to try to upgrade the storage and compute
versions at roughly the same time.
A simple API to collect some statistics after compaction to easily
understand the result.
The tool reads the layer map, and analyze range by range instead of
doing single-key operations, which is more efficient than doing a
benchmark to collect the result. It currently computes two key metrics:
* Latest data access efficiency, which finds how many delta layers /
image layers the system needs to iterate before returning any key in a
key range.
* (Approximate) PiTR efficiency, as in
https://github.com/neondatabase/neon/issues/7770, which is simply the
number of delta files in the range. The reason behind that is, assume no
image layer is created, PiTR efficiency is simply the cost of collect
records from the delta layers, and the replay time. Number of delta
files (or in the future, estimated size of reads) is a simple yet
efficient way of estimating how much effort the page server needs to
reconstruct a page.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Due to the upcoming End of Life (EOL) for Debian 11, we need to upgrade
the base OS for Pageservers from Debian 11 to Debian 12 for security
reasons.
When deploying a new Pageserver on Debian 12 with the same binary built
on
Debian 11, we encountered the following errors:
```
could not execute operation: pageserver error, status: 500,
msg: Command failed with status ExitStatus(unix_wait_status(32512)):
/usr/local/neon/v16/bin/initdb: error while loading shared libraries:
libicuuc.so.67: cannot open shared object file: No such file or directory
```
and
```
could not execute operation: pageserver error, status: 500,
msg: Command failed with status ExitStatus(unix_wait_status(32512)):
/usr/local/neon/v14/bin/initdb: error while loading shared libraries:
libssl.so.1.1: cannot open shared object file: No such file or directory
```
These issues occur when creating new projects.
## Summary of changes
- To address these issues, we configured PostgreSQL build to use
statically linked OpenSSL and ICU libraries.
- This resolves the missing shared library errors when running the
binaries on Debian 12.
Closes: https://github.com/neondatabase/cloud/issues/12648
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [x] Do not forget to reformat commit message to not include the above
checklist
It makes them much easier to reason about, and allows other SQL tooling
to operate on them like language servers, formatters, etc.
I also brought back the removed migrations such that we can more easily
understand what they were. I included a "-- SKIP" comment describing why
those migrations are now skipped. We no longer skip migrations by
checking if it is empty, but instead check to see if the migration
starts with "-- SKIP".
## Problem
We don't carry run-* labels from external contributors' PRs to
ci-run/pr-* PRs. This is not really convenient.
Need to sync labels in approved-for-ci-run workflow.
## Summary of changes
Added the procedure of transition of labels from the original PR
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
In #7927 I needed to fix this test case, but the fixes should be
possible to land irrespective of the layer ingestion code change.
The most important fix is the behavior if an image layer is found: the
assertion message formatting raises a runtime error, which obscures the
fact that we found an image layer.
We had a random sleep in the beginning of partial backup task, which was
needed for the first partial backup deploy. It helped with gradual
upload of segments without causing network overload. Now partial backup
is deployed everywhere, so we don't need this random sleep anymore.
We also had an issue related to this, in which manager task was not shut
down for a long time. The cause of the issue is this random sleep that
didn't take timeline cancellation into account, meanwhile manager task
waited for partial backup to complete.
Fixes https://github.com/neondatabase/neon/issues/7967
currently we warn even by going over a single byte. even that will be
hit much more rarely once #7927 lands, but get this in earlier.
rationale for 2*checkpoint_distance: anything smaller is not really
worth a warn.
we have an global allowed_error for this warning, which still cannot be
removed nor can it be removed with #7927 because of many tests with very
small `checkpoint_distance`.
M-series macOS has different alignments/size for some fields (which I
did not investigate in detail) and therefore this test cannot pass on
macOS. Fixed by using `<=` for the comparison so that we do not test for
an exact match.
observed by @yliang412
Signed-off-by: Alex Chi Z <chi@neon.tech>
This adds retries to the bulk deletion, because if there is a certain
chance n that a request fails, the chance that at least one of the
requests in a chain of requests fails increases exponentially.
We've had similar issues with the S3 DR tests, which in the end yielded
in adding retries at the remote_storage level. Retries at the top level
are not sufficient when one remote_storage "operation" is multiple
network requests in a trench coat, especially when there is no notion of
saving the progress: even if prior deletions had been successful, we'd
still need to get a 404 in order to continue the loop and get to the
point where we failed in the last iteration. Maybe we'll fail again but
before we've even reached it.
Retries at the bottom level avoid this issue because they have the
notion of progress and also when one network operation fails, only that
operation is retried.
First part of #7931.
Related to #7341 tenant deletion will end up shutting down timelines
twice, once before actually starting and the second time when per
timeline deletion is requested. Shutting down TimelineMetrics causes
underflows. Add an atomic boolean and only do the shutdown once.
Closes#7406.
## Problem
When a `get_lsn_by_timestamp` request is cancelled, an anyhow error is
exposed to handle that case, which verbosely logs the error. However, we
don't benefit from having the full backtrace provided by anyhow in this
case.
## Summary of changes
This PR introduces a new `ApiError` type to handle errors caused by
cancelled request more robustly.
- A new enum variant `ApiError::Cancelled`
- Currently the cancelled request is mapped to status code 500.
- Need to handle this error in proxy's `http_util` as well.
- Added a failpoint test to simulate cancelled `get_lsn_by_timestamp`
request.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
## Problem
As described in #7952, the controller's attempt to reconcile a tenant
before finally deleting it can get hung up waiting for the compute
notification hook to accept updates.
The fact that we try and reconcile a tenant at all during deletion is
part of a more general design issue (#5080), where deletion was
implemented as an operation on attached tenant, requiring the tenant to
be attached in order to delete it, which is not in principle necessary.
Closes: #7952
## Summary of changes
- In the pageserver deletion API, only do the traditional deletion path
if the tenant is attached. If it's secondary, then tear down the
secondary location, and then do a remote delete. If it's not attached at
all, just do the remote delete.
- In the storage controller, instead of ensuring a tenant is attached
before deletion, do a best-effort detach of the tenant, and then call
into some arbitrary pageserver to issue a deletion of remote content.
The pageserver retains its existing delete behavior when invoked on
attached locations. We can remove this later when all users of the API
are updated to either do a detach-before-delete. This will enable
removing the "weird" code paths during startup that sometimes load a
tenant and then immediately delete it, and removing the deletion markers
on tenants.