## Problem
We don't build our docker images for ARM arch, and that makes it harder
to run images on ARM (on MacBooks with Apple Silicon, for example).
## Summary of changes
- Build `neondatabase/neon` for ARM and create a multi-arch image
- Build `neondatabase/compute-node-vXX` for ARM and create a multi-arch
image
- Run `test-images` job on ARM as well
## Problem
Currently, `latest` tag is added to the images in several cases:
```
github.ref_name == 'main' || github.ref_name == 'release' || github.ref_name == 'release-proxy'
```
This leads to a race; the `latest` tag jumps back and forth depending on
the branch that has built images.
## Summary of changes
- Do not push `latest` images to prod ECR (we don't use it)
- Use `docker buildx imagetools` instead of `crane` for tagging images
- Unify `vm-compute-node-image` job with others and use dockerhub as a
first source for images (sync images with ECR)
- Tag images with `latest` only for commits in `main`
The openapi description with the error descriptions:
- 200 is used for "detached or has been detached previously"
- 400 is used for "cannot be detached right now" -- it's an odd thing,
but good enough
- 404 is used for tenant or timeline not found
- 409 is used for "can never be detached" (root timeline)
- 500 is used for transient errors (basically ill-defined shutdown
errors)
- 503 is used for busy (other tenant ancestor detach underway,
pageserver shutdown)
Cc: #6994
## Problem
`report-benchmarks-failures` got skipped if a dependent job fails.
## Summary of changes
- Fix the if-condition by adding `&& failures()` to it; it'll make the
job run if the dependent job fails.
"taking a fullbackup" is an ugly multi-liner copypasted in multiple
places, most recently with timeline ancestor detach tests. move it under
`PgBin` which is not a great place, but better than yet another utility
function.
Additionally:
- cleanup `psql_env` repetition (PgBin already configures that)
- move the backup tar comparison as a yet another free utility function
- use backup tar comparison in `test_import.py` where a size check was
done previously
- cleanup extra timeline creation from test
Cc: #7715
Using InvalidateBuffer is wrong, because if the page is concurrently
dirtied, it will throw away the dirty page without calling
smgwrite(). In Neon, that means that the last-written LSN update for
the page is missed.
In v16, use the new InvalidateVictimBuffer() function that does what
we need. In v15 and v14, backport the InvalidateVictimBuffer()
function.
Fixes issue https://github.com/neondatabase/neon/issues/7802
In the process_query function in page_service.rs there was some
redundant duplication. Remove it and create a vector of whitespace
separated parts at the start and then use `slice::strip_prefix`. Only
use `starts_with` in the places with multiple whitespace separated
parameters: here we want to preserve grep/rg ability.
Followup of #7815, requested in
https://github.com/neondatabase/neon/pull/7815#pullrequestreview-2068835674
In safekeepers we have several background tasks. Previously `WAL backup`
task was spawned by another task called `wal_backup_launcher`. That task
received notifications via `wal_backup_launcher_rx` and decided to spawn
or kill existing backup task associated with the timeline. This was
inconvenient because each code segment that touched shared state was
responsible for pushing notification into `wal_backup_launcher_tx`
channel. This was error prone because it's easy to miss and could lead
to deadlock in some cases, if notification pushing was done in the wrong
order.
We also had a similar issue with `is_active` timeline flag. That flag
was calculated based on the state and code modifying the state had to
call function to update the flag. We had a few bugs related to that,
when we forgot to update `is_active` flag in some places where it could
change.
To fix these issues, this PR adds a new `timeline_manager` background
task associated with each timeline. This task is responsible for
managing all background tasks, including `is_active` flag which is used
for pushing broker messages. It is subscribed for updates in timeline
state in a loop and decides to spawn/kill background tasks when needed.
There is a new structure called `TimelinesSet`. It stores a set of
`Arc<Timeline>` and allows to copy the set to iterate without holding
the mutex. This is what replaced `is_active` flag for the broker. Now
broker push task holds a reference to the `TimelinesSet` with active
timelines and use it instead of iterating over all timelines and
filtering by `is_active` flag.
Also added some metrics for manager iterations and active backup tasks.
Ideally manager should be doing not too many iterations and we should
not have a lot of backup tasks spawned at the same time.
Fixes#7751
---------
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
The logic added in the original PR (#7434) only worked before sudo was
used, because 'sudo foo' will only fail with NotFound if 'sudo' doesn't
exist; if 'foo' doesn't exist, then sudo will fail with a normal error
exit.
This means that compute_ctl may fail to restart if it exits after
successfully enabling swap.
We want to introduce a concept of temporary and expiring LSN leases.
This adds both a http API as well as one for the page service to obtain
temporary LSN leases.
This adds a dummy implementation to unblock integration work of this
API. A functional implementation of the lease feature is deferred to a
later step.
Fixes#7808
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
detaching a timeline from its ancestor can leave the resulting timeline
with more L0 layers than the compaction threshold. most of the time, the
detached timeline has made progress, and next time the L0 -> L1
compaction happens near the original branch point and not near the
last_record_lsn.
add a test to ensure that inheriting the historical L0s does not change
fullbackup. additionally:
- add `wait_until_completed` to test-only timeline checkpoint and
compact HTTP endpoints. with `?wait_until_completed=true` the endpoints
will wait until the remote client has completed uploads.
- for delta layers, describe L0-ness with the `/layer` endpoint
Cc: #6994
## Problem
In `test_storage_controller_many_tenants` we
[occasionally](https://neon-github-public-dev.s3.amazonaws.com/reports/main/9155810417/index.html#/testresult/8fbdf57a0e859c2d)
see it hit the retry limit on serializable transactions. That's likely
due to a combination of relative slow fsync on the hetzner nodes running
the test, and the way the test does lots of parallel timeline creations,
putting high load on the drive.
Running the storage controller's db with fsync=off may help here.
## Summary of changes
- Set `fsync=off` in the postgres config for the database used by the
storage controller in tests
Previously we worked around file comparison issues by dropping unlogged
relations in the pg_regress tests, but this would lead to an unnecessary
diff when compared to upstream in our Postgres fork. Instead, we can
precompute the files that we know will be different, and ignore them.
Hot standby feedback xmins can be greater than next_xid due to sparse update of
nextXid on pageserver (to do less writes it advances next xid on
1024). ProcessStandbyHSFeedback ignores such xids from the future; to fix,
minimize received xmin to next_xid.
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
To avoid pageserver gc'ing data needed by standby, propagate standby apply LSN
through standby -> safekeeper -> broker -> pageserver flow and hold off GC for
it. Iteration of GC resets the value to remove the horizon when standby goes
away -- pushes are assumed to happen at least once between gc iterations. As a
safety guard max allowed lag compared to normal GC horizon is hardcoded as 10GB.
Add test for the feature.
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
Noticed this issue in staging.
When a tenant is under somewhat heavy timeline creation/deletion
thrashing, it becomes quite common for secondary downloads to encounter
404s downloading layers. This is tolerated by design, because heatmaps
are not guaranteed to be up to date with what layers/timelines actually
exist.
However, we were not updating the SecondaryProgress structure in this
case, so after such a download pass, we would leave a SecondaryProgress
state with lower "downloaded" stats than "total" stats. This causes the
storage controller to consider this secondary location inelegible for
optimization actions such as we do after shard splits
This issue has relative low impact because a typical tenant will
eventually upload a heatmap where we do download all the layers and
thereby enable the controller to progress with migrations -- the heavy
thrashing of timeline creation/deletion is an artifact of our nightly
stress tests.
## Summary of changes
- In the layer 404 case, subtract the skipped layer's stats from the
totals, so that at the end of this download pass we should still end up
in a complete state.
- When updating `last_downloaded`, do a sanity check that our progress
is complete. In debug builds, assert out if this is not the case. In
prod builds, correct the stats and log a warning.
## Problem
We want to add alerts for when people's replication slots break, and
also metrics for retained WAL so that we can make warn customers when
their storage gets bloated.
## Summary of changes
Adds the metrics. Addresses
https://github.com/neondatabase/neon/issues/7593
## Problem
Part of https://github.com/neondatabase/neon/issues/7462
On metadata keyspace, vectored get will not stop if a key is not found,
and will read past the image layer. However, the semantics is different
from single get, because if a key does not exist in the image layer, it
means that the key does not exist in the past, or have been deleted.
This pull request fixed it by recording image layer coverage during the
vectored get process and stop when the full keyspace is covered by an
image layer. A corresponding test case is added to ensure generating
image layer reduces the number of delta layers.
This optimization (or bug fix) also applies to rel block keyspaces. If a
key is missing, we can know it's missing once the first image layer is
reached. Page server will not attempt to read lower layers, which
potentially incurs layer downloads + evictions.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Part of https://github.com/neondatabase/neon/issues/7462
Sparse keyspace does not generate image layers for now. This pull
request adds support for generating image layers for sparse keyspace.
## Summary of changes
* Use the scan interface to generate compaction data for sparse
keyspace.
* Track num of delta layers reads during scan.
* Read-trigger compaction: when a scan on the keyspace touches too many
delta files, generate an image layer. There are one hard-coded threshold
for now: max delta layers we want to touch for a scan.
* L0 compaction does not need to compute holes for metadata keyspace.
Know issue: the scan interface currently reads past the image layer,
which causes `delta_layer_accessed` keeps increasing even if image
layers are generated. The pull request to fix that will be separate, and
orthogonal to this one.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Upgrade pgvector to 0.7.0.
This PR is based on Heikki's PR #6753 and just uses pgvector 0.7.0
instead of 0.6.0
I have now done all planned manual tests.
The pull request is ready to be reviewed and merged and can be deployed
in production together / after swap enablement.
See (https://github.com/neondatabase/autoscaling/issues/800)
Fixes https://github.com/neondatabase/neon/issues/6516
Fixes https://github.com/neondatabase/neon/issues/7780
## Documentation input for usage recommendations
### maintenance_work_mem
In Neon
`maintenance_work_mem` is very small by default (depends on configured
RAM for your compute but can be as low as 64 MB).
To optimize pgvector index build time you may have to bump it up
according to your working set size (size of tuples for vector index
creation).
You can do so in the current session using
`SET maintenance_work_mem='10 GB';`
The target value you choose should fit into the memory of your compute
size and not exceed 50-60% of available RAM.
The value above has been successfully used on a 7CU endpoint.
### max_parallel_maintenance_workers
max_parallel_maintenance_workers is also small by default (2). For
efficient parallel pgvector index creation you have to bump it up with
`SET max_parallel_maintenance_workers = 7`
to make use of all the CPUs available, assuming you have configured your
endpoint to use 7CU.
## ID input for changelog
pgvector extension in Neon has been upgraded from version 0.5.1 to
version 0.7.0.
Please see https://github.com/pgvector/pgvector/ for documentation of
new capabilities in pgvector version 0.7.0
If you have existing databases with pgvector 0.5.1 already installed
there is a slight difference in behavior in the following corner cases
even if you don't run `ALTER EXTENSION UPDATE`:
### L2 distance from NULL::vector
For the following script, comparing the NULL::vector to non-null vectors
the resulting output changes:
```sql
SET enable_seqscan = off;
CREATE TABLE t (val vector(3));
INSERT INTO t (val) VALUES ('[0,0,0]'), ('[1,2,3]'), ('[1,1,1]'), (NULL);
CREATE INDEX ON t USING hnsw (val vector_l2_ops);
INSERT INTO t (val) VALUES ('[1,2,4]');
SELECT * FROM t ORDER BY val <-> (SELECT NULL::vector);
```
and now the output is
```
val
---------
[1,1,1]
[1,2,4]
[1,2,3]
[0,0,0]
(4 rows)
```
For the following script
```sql
SET enable_seqscan = off;
CREATE TABLE t (val vector(3));
INSERT INTO t (val) VALUES ('[0,0,0]'), ('[1,2,3]'), ('[1,1,1]'), (NULL);
CREATE INDEX ON t USING ivfflat (val vector_l2_ops) WITH (lists = 1);
INSERT INTO t (val) VALUES ('[1,2,4]');
SELECT * FROM t ORDER BY val <-> (SELECT NULL::vector);
```
the output now is
```
val
---------
[0,0,0]
[1,2,3]
[1,1,1]
[1,2,4]
(4 rows)
```
### changed error messages
If you provide invalid literals for datatype vector you may get
improved/changed error messages, for example:
```sql
neondb=> SELECT '[4e38,1]'::vector;
ERROR: "4e38" is out of range for type vector
LINE 1: SELECT '[4e38,1]'::vector;
^
```
---------
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
## Problem
The heatmap upload period is configurable, but secondary mode downloads
were using a fixed download period.
Closes: #6200
## Summary of changes
- Use the upload period in the heatmap to adjust the download period.
In practice, this will reduce the frequency of downloads from its
current 60 second period to what heatmaps use, which is 5-10m depending
on environment.
This is an improvement rather than being optimal: we could be smarter
about periods, and schedule downloads to occur around the time we expect
the next upload, rather than just using the same period, but that's
something we can address in future if it comes up.
## Summary of changes
Updates the parquet lib. one change left that we need is in an open PR
against upstream, hopefully we can remove the git dependency by 52.0.0
https://github.com/apache/arrow-rs/pull/5773
I'm not sure why the parquet files got a little bit bigger. I tested
them and they still open fine. 🤷
side effect of the update, chrono updated and added yet another
deprecation warning (hence why the safekeepers change)
The comment says that this checks if there's enough space on the page
for logical message *and* an XLOG_SWITCH. So the sizes of the logical
message and the XLOG_SWITCH record should be added together, not
subtracted.
I saw a panic in the test that led me to investigate and notice this
(https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7803/9142396223/index.html):
RuntimeError: Run ['/tmp/neon/bin/wal_craft', 'in-existing', 'last_wal_record_xlog_switch_ends_on_page_boundary', "host=localhost port=16165 user=cloud_admin dbname=postgres options='-cstatement_timeout=120s '"] failed:
stdout:
stderr:
thread 'main' panicked at libs/postgres_ffi/wal_craft/src/lib.rs:370:27:
attempt to subtract with overflow
stack backtrace:
0: rust_begin_unwind
at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/std/src/panicking.rs:645:5
1: core::panicking::panic_fmt
at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/panicking.rs:72:14
2: core::panicking::panic
at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/panicking.rs:145:5
3: <wal_craft::LastWalRecordXlogSwitchEndsOnPageBoundary as wal_craft::Crafter>::craft::<postgres::client::Client>
at libs/postgres_ffi/wal_craft/src/lib.rs:370:27
4: wal_craft::main::{closure#0}
at libs/postgres_ffi/wal_craft/src/bin/wal_craft.rs:21:17
5: wal_craft::main
at libs/postgres_ffi/wal_craft/src/bin/wal_craft.rs:66:47
6: <fn() -> core::result::Result<(), anyhow::Error> as core::ops::function::FnOnce<()>>::call_once
at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
part of https://github.com/neondatabase/neon/issues/7462
## Summary of changes
This pull request adds two APIs to the pageserver management API:
list_aux_files and ingest_aux_files. The aux file pagebench is intended
to be used on an empty timeline because the data do not go through the
safekeeper. LSNs are advanced by 8 for each ingestion, to avoid
invariant checks inside the pageserver.
For now, I only care about space amplification / read amplification, so
the bench is designed in a very simple way: ingest 10000 files, and I
will manually dump the layer map to analyze.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Part of https://github.com/neondatabase/neon/issues/7462
## Summary of changes
Tenant config is not persisted unless it's attached on the storage
controller. In this pull request, we persist the aux file policy flag in
the `index_part.json`.
Admins can set `switch_aux_file_policy` in the storage controller or
using the page server API. Upon the first aux file gets written, the
write path will compare the aux file policy target with the current
policy. If it is switch-able, we will do the switch. Otherwise, the
original policy will be used. The test cases show what the admins can do
/ cannot do.
The `last_aux_file_policy` is stored in `IndexPart`. Updates to the
persisted policy are done via
`schedule_index_upload_for_aux_file_policy_update`. On the write path,
the writer will update the field.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
## Problem
Currently tenants are only split into multiple shards if a human being
calls the API to do it.
Issue: #7388
## Summary of changes
- Add a pageserver API for returning the top tenants by size
- Add a step to the controller's background loop where if there is no
reconciliation or optimization to be done, it looks for things to split.
- Add a test that runs pgbench on many tenants concurrently, and checks
that splitting happens as expected as tenants grow, without interrupting
the client I/O.
This PR is quite basic: there is a tasklist in
https://github.com/neondatabase/neon/issues/7388 for further work. This
PR is meant to be safe (off by default), and sufficient to enable our
staging environment to run lots of sharded tenants without a human
having to set them up.
## Problem
The storage controller generally assumes that things like updating
generation numbers are atomic: it should use a strict isolation level.
## Summary of changes
- Wrap all database operations in a SERIALIZABLE transaction.
- Retry serialization failures, as these do not indicate problems and
are normal when plenty of concurrent work is happening.
Using this isolation level for all reads is overkill, but much simpler
than reasoning about it on a per-operation basis, and does not hurt
performance.
Tested this with a modified version of storage_controller_many_tenants
test with 128k shards, to check that our performance is still fine: it
is.
## Problem
- When a layer with legacy local path format is evicted and then
re-downloaded, a panic happened because the path downloaded by remote
storage didn't match the path stored in Layer.
- While investigating, I also realized that secondary locations would
have a similar issue with evictions.
Closes: #7783
## Summary of changes
- Make remote timeline client take local paths as an input: it should
not have its own ideas about local paths, instead it just uses the layer
path that the Layer has.
- Make secondary state store an explicit local path, populated on scan
of local disk at startup. This provides the same behavior as for Layer,
that our local_layer_path is a _default_, but the layer path can
actually be anything (e.g. an old style one).
- Add tests for both cases.
by having 100 copy operations in flight twe climb up to 2500 requests
per min or 41/s. This is still probably less than is allowed, but fast
enough for our purposes.
Before this PR, the changed tests would overwrite the entire
`tenant_config` because `pageserver_config_override` is merged
non-recursively into the `ps_cfg`.
This meant they would override the
`PAGESERVER_DEFAULT_TENANT_CONFIG_COMPACTION_ALGORITHM`, impacting our
matrix build for `compaction_algorithm=Tiered|Legacy` in
https://github.com/neondatabase/neon/pull/7748.
I found the tests fixed in this PR using the
`NEON_PAGESERVER_PANIC_ON_UNSPECIFIED_COMPACTION_ALGORITHM` env var that
I added in #7748. Therefore, I think this is an exhaustive fix. This is
better than just searching the code base for `tenant_config`, which is
what I had sketched in #7747.
refs #7749
Tiered compaction employs two sliding windows over the keyspace:
`KeyspaceWindow` for the image layer generation and `Window` for the
delta layer generation. Do some fixes to both windows:
* The distinction between the two windows is not very clear. Do the
absolute minimum to mention where they are used in the rustdoc
description of the struct. Maybe we should rename them (say
`WindowForImage` and `WindowForDelta`) or merge them into one window
implementation.
* Require the keys to strictly increase. The `accum_key_values` already
combines the key, so there is no logic needed in `Window::feed` for the
same key repeating. This is a follow-up to address the request in
https://github.com/neondatabase/neon/pull/7671#pullrequestreview-2051995541
* In `choose_next_delta`, we claimed in the comment to use 1.25 as the
factor but it was 1.66 instead. Fix this discrepancy by using `*5/4` as
the two operations.
As of #6202 we support `AWS_PROFILE` as well, which is more convenient.
Change the docs to using it instead of `SSO_ACCOUNT_ID`. Also, remove
`SSO_ACCOUNT_ID` from BucketConfig as it is confusing to the code's
reader: it's not the "main" way of setting up authentication for the
scrubber any more.
It is a breaking change for the on-disk format as we persist `sso_account_id` to disk,
but it was quite inconsistent with the other methods which are not persistet. Also,
I don't think we want to support the case where one version writes the json and
another version reads it.
Related: #7667