preliminary for #9270
The auth::Backend didn't need to be in the mega ProxyConfig object, so I
split it off and passed it manually in the few places it was necessary.
I've also refined some of the uses of config I saw while doing this
small refactor.
I've also followed the trend and make the console redirect backend it's
own struct, same as LocalBackend and ControlPlaneBackend.
## Problem
Action `run-python-test-set` fails if it is not used for `regress_tests`
on release PR, because it expects
`test_compatibility.py::test_create_snapshot` to generate a snapshot,
and the test exists only in `regress_tests` suite.
For example, in https://github.com/neondatabase/neon/pull/9291
[`test-postgres-client-libs`](https://github.com/neondatabase/neon/actions/runs/11209615321/job/31155111544)
job failed.
## Summary of changes
- Add `skip-if-does-not-exist` input to `.github/actions/upload` action
(the same way we do for `.github/actions/download`)
- Set `skip-if-does-not-exist=true` for "Upload compatibility snapshot"
step in `run-python-test-set` action
## Problem
We faced the problem of incompatibility of the different components of
different versions.
This should be detected automatically to prevent production bugs.
## Summary of changes
The test for this situation was implemented
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
## Problem
Fixes#8340
## Summary of changes
Introduced ErrorKind::quota to handle quota-related errors
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
This job takes an extraordinary amount of time for what I understand it
to do. The obvious win is caching dependencies.
Rory disabled caching in cd5732d9d8.
I assume this was to get gen3 runners up and running.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
This test restarts services in an undefined order (whatever neon_local
does), which means we should be tolerant of warnings that come from
restarting the storage controller while a pageserver is running.
We can see failures with warnings from dropped requests, e.g.
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9307/11229000712/index.html#/testresult/d33d5cb206331e28
```
WARN request{method=GET path=/v1/location_config request_id=b7dbda15-6efb-4610-8b19-a3772b65455f}: request was dropped before completing\n')
```
## Summary of changes
- allow-list the `request was dropped before completing` message on
pageservers before restarting services
When there are no timelines in remote storage, the storage scrubber
would incorrectly trip an assertion with "Must be set if results are
present", referring to the last processed tenant ID. When there are no
timelines we don't expect there to be a tenant ID either.
The assertion was introduced in 37aa6fd.
Only apply the assertion when any timelines are present.
## Problem
Storage controller `/control` API mostly requires admin tokens, for
interactive use by engineers. But for endpoints used by scripts, we
should not require admin tokens.
Discussion at
https://neondb.slack.com/archives/C033RQ5SPDH/p1728550081788989?thread_ts=1728548232.265019&cid=C033RQ5SPDH
## Summary of changes
- Introduce the 'infra' JWT scope, which was not previously used in the
neon repo
- For pageserver & safekeeper node registrations, require infra scope
instead of admin
Note that admin will still work, as the controller auth checks permit
admin tokens for all endpoints irrespective of what scope they require.
This seems to paper over a behavioral difference in Python 3.9 and
Python 3.12 with how dataclasses work with mutable variables. On Python
3.12, I get the following error:
ValueError: mutable default <class 'dict'> for field EXTRACTORS is not allowed: use default_factory
This obviously doesn't occur in our testing environment. When I do what
the error tells me, EXTRACTORS doesn't seem to exist as an attribute on
the class in at least Python 3.9.
The solution provided in this commit seems like the least amount of
friction to keep the wheels turning.
Signed-off-by: Tristan Partin <tristan@neon.tech>
This is simpler than using subprocess.
One difference is in how moto's log output is now collected. Previously,
moto's logs went to stderr, and were collected and printed at the end of
the test by pytest, like this:
2024-10-07T22:45:12.3705222Z ----------------------------- Captured stderr call -----------------------------
2024-10-07T22:45:12.3705577Z 127.0.0.1 - - [07/Oct/2024 22:35:14] "PUT /pageserver-test-deletion-queue-2e6efa8245ec92a37a07004569c29eb7 HTTP/1.1" 200 -
2024-10-07T22:45:12.3706181Z 127.0.0.1 - - [07/Oct/2024 22:35:15] "GET /pageserver-test-deletion-queue-2e6efa8245ec92a37a07004569c29eb7/?list-type=2&delimiter=/&prefix=/tenants/43da25eac0f41412696dd31b94dbb83c/timelines/ HTTP/1.1" 200 -
2024-10-07T22:45:12.3706894Z 127.0.0.1 - - [07/Oct/2024 22:35:16] "PUT /pageserver-test-deletion-queue-2e6efa8245ec92a37a07004569c29eb7//tenants/43da25eac0f41412696dd31b94dbb83c/timelines/eabba5f0c1c72c8656d3ef1d85b98c1d/initdb.tar.zst?x-id=PutObject HTTP/1.1" 200 -
Note the timestamps: the timestamp at the beginning of the line is the
time that the stderr was dumped, i.e. the end of the test, which makes
those timestamps rather useless. The timestamp in the middle of the line
is when the operation actually happened, but it has only 1 s
granularity.
With this change, moto's log lines are printed in the "live log call"
section, as they happen, which makes the timestamps more useful:
2024-10-08 12:12:31.129 INFO [_internal.py:97] 127.0.0.1 - - [08/Oct/2024 12:12:31] "GET /pageserver-test-deletion-queue-e24e7525d437e1874d8a52030dcabb4f/?list-type=2&delimiter=/&prefix=/tenants/7b6a16b1460eda5204083fba78bc360f/timelines/ HTTP/1.1" 200 -
2024-10-08 12:12:32.612 INFO [_internal.py:97] 127.0.0.1 - - [08/Oct/2024 12:12:32] "PUT /pageserver-test-deletion-queue-e24e7525d437e1874d8a52030dcabb4f//tenants/7b6a16b1460eda5204083fba78bc360f/timelines/7ab4c2b67fa8c712cada207675139877/initdb.tar.zst?x-id=PutObject HTTP/1.1" 200 -
## Problem
We need a way to incrementally switch to direct IO. During the rollout
we might want to switch to O_DIRECT on image and delta layer read path
first before others.
## Summary of changes
- Revisited and simplified direct io config in `PageserverConf`.
- We could add a fallback mode for open, but for read there isn't a
reasonable alternative (without creating another buffered virtual file).
- Added a wrapper around `VirtualFile`, current implementation become
`VirtualFileInner`
- Use `open_v2`, `create_v2`, `open_with_options_v2` when we want to use
the IO mode specified in PS config.
- Once we onboard all IO through VirtualFile using this new API, we will
delete the old code path.
- Make io mode live configurable for benchmarking.
- Only guaranteed for files opened after the config change, so do it
before the experiment.
As an example, we are using `open_v2` with
`virtual_file::IoMode::Direct` in
https://github.com/neondatabase/neon/pull/9169
We also remove `io_buffer_alignment` config in
a04cfd754b and use it as a compile time
constant. This way we don't have to carry the alignment around or make
frequent call to retrieve this information from the static variable.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Add /installed_extensions endpoint to collect
statistics about extension usage.
It returns a list of installed extensions in the format:
```json
{
"extensions": [
{
"extname": "extension_name",
"versions": ["1.0", "1.1"],
"n_databases": 5,
}
]
}
```
---------
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
## Problem
The path to TPC-H queries was incorrectly changed in #9306.
This path is used for `test_tpch` parameterization, so all perf tests
started to fail:
```
==================================== ERRORS ====================================
__________ ERROR collecting test_runner/performance/test_perf_olap.py __________
test_runner/performance/test_perf_olap.py:205: in <module>
@pytest.mark.parametrize("query", tpch_queuies())
test_runner/performance/test_perf_olap.py:196: in tpch_queuies
assert queries_dir.exists(), f"TPC-H queries dir not found: {queries_dir}"
E AssertionError: TPC-H queries dir not found: /__w/neon/neon/test_runner/performance/performance/tpc-h/queries
E assert False
E + where False = <bound method Path.exists of PosixPath('/__w/neon/neon/test_runner/performance/performance/tpc-h/queries')>()
E + where <bound method Path.exists of PosixPath('/__w/neon/neon/test_runner/performance/performance/tpc-h/queries')> = PosixPath('/__w/neon/neon/test_runner/performance/performance/tpc-h/queries').exists
```
## Summary of changes
- Fix the path to tpc-h queries
## Problem
Previously, observed state updates from the reconciler may have
clobbered inline changes made to the observed state by other code paths.
## Summary of changes
Model observed state changes from reconcilers as deltas. This means that
we only update what has changed. Handling for node going off-line concurrently
during the reconcile is also added: set observed state to None in such cases to
respect the convention.
Closes https://github.com/neondatabase/neon/issues/9124
Instead of printing the full absolute path for every file, print just
the filenames.
Before:
2024-10-08 13:19:39.98 INFO [test_pageserver_generations.py:669] Found file /home/heikki/git-sandbox/neon/test_output/test_upgrade_generationless_local_file_paths[debug-pg16]/repo/pageserver_1/tenants/0c04a8df7691a367ad0bb1cc1373ba4d/timelines/f41022551e5f96ce8dbefb9b5d35ab45/000000067F0000000100000A8D0100000000-000000067F0000000100000AC10000000002__00000000014F16F0-v1-00000001
2024-10-08 13:19:39.99 INFO [test_pageserver_generations.py:673] Renamed /home/heikki/git-sandbox/neon/test_output/test_upgrade_generationless_local_file_paths[debug-pg16]/repo/pageserver_1/tenants/0c04a8df7691a367ad0bb1cc1373ba4d/timelines/f41022551e5f96ce8dbefb9b5d35ab45/000000067F0000000100000A8D0100000000-000000067F0000000100000AC10000000002__00000000014F16F0-v1-00000001 -> /home/heikki/git-sandbox/neon/test_output/test_upgrade_generationless_local_file_paths[debug-pg16]/repo/pageserver_1/tenants/0c04a8df7691a367ad0bb1cc1373ba4d/timelines/f41022551e5f96ce8dbefb9b5d35ab45/000000067F0000000100000A8D0100000000-000000067F0000000100000AC10000000002__00000000014F16F0
After:
2024-10-08 13:24:39.726 INFO [test_pageserver_generations.py:667] Renaming files in /home/heikki/git-sandbox/neon/test_output/test_upgrade_generationless_local_file_paths[debug-pg16]/repo/pageserver_1/tenants/3439538816c520adecc541cc8b1de21c/timelines/6a7be8ee707b355de48dd91b326d6ae1
2024-10-08 13:24:39.728 INFO [test_pageserver_generations.py:673] Renamed
000000067F0000000100000A8D0100000000-000000067F0000000100000AC10000000002__00000000014F16F0-v1-00000001 -> 000000067F0000000100000A8D0100000000-000000067F0000000100000AC10000000002__00000000014F16F0
`download_byte_range()` is basically a copy of `download()` with an
additional option passed to the backend SDKs. This can cause these code
paths to diverge, and prevents combining various options.
This patch adds `DownloadOpts::byte_(start|end)` and move byte range
handling into `download()`.
The "Starting postgres endpoint <name>" message is not needed, because
the neon_cli.py prints the neon_local command line used to start the
endpoint. That contains the same information. The "Postgres startup took
XX seconds" message is not very useful because no one pays attention to
those in the python test logs when things are going smoothly, and if you
do wonder about the startup speed, the same information and more can be
found in the compute log.
Before:
2024-10-07 22:32:27.794 INFO [neon_fixtures.py:3492] Starting postgres endpoint ep-1
2024-10-07 22:32:27.794 INFO [neon_cli.py:73] Running command "/tmp/neon/bin/neon_local endpoint start --safekeepers 1 ep-1"
2024-10-07 22:32:27.901 INFO [neon_fixtures.py:3690] Postgres startup took 0.11398935317993164 seconds
After:
2024-10-07 22:32:27.794 INFO [neon_cli.py:73] Running command "/tmp/neon/bin/neon_local endpoint start --safekeepers 1 ep-1"
Implements an initial mechanism for offloading of archived timelines.
Offloading is implemented as specified in the RFC.
For now, there is no persistence, so a restart of the pageserver will
retrigger downloads until the timeline is offloaded again.
We trigger offloading in the compaction loop because we need the signal
for whether compaction is done and everything has been uploaded or not.
Part of #8088
## Problem
When a node goes offline, we trigger reconciles to migrate shards away
from it. If multiple nodes go offline at the same time, we handled them in
sequence. Hence, we might migrate shards from the first offline node to the second
offline node and increase the unavailability period.
## Summary of changes
Refactor heartbeat delta handling to:
1. Update in memory state for all nodes first
2. Handle availability transitions one by one (we have full picture for each node after (1))
Closes https://github.com/neondatabase/neon/issues/9126
## Problem
On macOS:
```
/Users/runner/work/neon/neon//pgxn/neon/file_cache.c:623:19: error: variable 'has_remaining_pages' is used uninitialized whenever 'for' loop exits because its condition is false [-Werror,-Wsometimes-uninitialized]
```
## Summary of changes
- Initialise `has_remaining_pages` with `false`
It didn't serve much value, and was only used twice.
Path(__file__).parent is a pretty easy invocation to use.
Signed-off-by: Tristan Partin <tristan@neon.tech>
If you override CFLAGS, you also override any flags that PostgreSQL
configure script had picked. That includes many options that enable
extra compiler warnings, like '-Wall', '-Wmissing-prototypes', and so
forth. The override was added in commit 171385ac14, but the intention
of that was to be *more* strict, by enabling '-Werror', not less
strict. The proper way of setting '-Werror', as documented in the docs
and mentioned in PR #2405, is to set COPT='-Werror', but leave CFLAGS
alone.
All the compiler warnings with the standard PostgreSQL flags have now
been fixed, so we can do this without adding noise.
Part of the cleanup issue #9217.
It's not a bug, the variable is initialized when it's used, but the
compiler isn't smart enough to see that through all the conditions.
Part of the cleanup issue #9217.
/pgxn/neon/walproposer_compat.c:192:9: warning: function ‘WalProposerLibLog’ might be a candidate for ‘gnu_printf’ format attribute [-Wsuggest-attribute=format]
192 | vsnprintf(buf, sizeof(buf), fmt, args);
| ^~~~~~~~~
The warning:
warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
It's PostgreSQL project style to stick to the old C90 style.
(Alternatively, we could disable it for our extension.)
Part of the cleanup issue #9217.
Prototypes for neon_writev(), neon_readv(), and neon_regisersync()
were missing. But instead of adding the missing prototypes, mark all
the smgr functions 'static'.
Part of the cleanup issue #9217.
Silences these compiler warnings:
/pgxn/neon_walredo/walredoproc.c:452:1: warning: ‘CreateFakeSharedMemoryAndSemaphores’ was used with no prototype before its definition [-Wmissing-prototypes]
452 | CreateFakeSharedMemoryAndSemaphores()
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/pgxn/neon/walproposer_pg.c:541:1: warning: no previous prototype for ‘GetWalpropShmemState’ [-Wmissing-prototypes]
541 | GetWalpropShmemState()
| ^~~~~~~~~~~~~~~~~~~~
Part of the cleanup issue #9217.
In v16 merge, we copied much of heap RMGR, to distinguish vanilla
Postgres heap records from records generated with neon patches, with
the additional CID fields. This function is only used by the
HEAP_TRUNCATE records, however, which we didn't need to copy.
Part of the cleanup issue #9217.
In short: Currently we reserve 75% of memory to the LFC, meaning that if
we scale up to keep postgres using less than 25% of the compute's
memory.
This means that for certain memory-heavy workloads, we end up scaling
much higher than is actually needed — in the worst case, up to 4x,
although in practice it tends not to be quite so bad.
Part of neondatabase/autoscaling#1030.
Update hyper and tonic again in the storage broker, this time with a fix
for the issue that made us revert the update last time.
The first commit is a revert of #9268, the second a fix for the issue.
fixes#9231.
I'm trying to debug a situation with the LR benchmark publisher not
being in the correct state. This should aid in debugging, while just
being generally useful.
PR: https://github.com/neondatabase/neon/pull/9265
Signed-off-by: Tristan Partin <tristan@neon.tech>
In PostgreSQL v16, BUFFERTAGS_EQUAL was replaced with a static inline
macro, BufferTagsEqual. Let's use the new name going forward, and have
backwards-compatibility glue to allow using the new name on v14 and v15,
rather than the other way round. This also makes BufferTagsEquals
consistent with InitBufferTag, for which we were already using the new
name.
Requires https://github.com/neondatabase/neon/pull/9086 first to have
`local_proxy_config`. This logic can still be reviewed implementation
wise.
Create JWT Auth functionality related roles without attributes and
`neon_superuser` group.
Read the JWT related roles from `local_proxy_config` `JWKS` settings and
handle them differently than other console created roles.
The prefetch-queue hash table uses a BufferTag struct as the hash key,
and it's hashed using hash_bytes(). It's important that all the padding
bytes in the key are cleared, because hash_bytes() will include them.
I was getting compiler warnings like this on v14 and v15, when compiling
with -Warray-bounds:
In function ‘prfh_lookup_hash_internal’,
inlined from ‘prfh_lookup’ at
pg_install/v14/include/postgresql/server/lib/simplehash.h:821:9,
inlined from ‘neon_read_at_lsnv’ at pgxn/neon/pagestore_smgr.c:2789:11,
inlined from ‘neon_read_at_lsn’ at pgxn/neon/pagestore_smgr.c:2904:2:
pg_install/v14/include/postgresql/server/storage/relfilenode.h:90:43:
warning: array subscript ‘PrefetchRequest[0]’ is partly outside array
bounds of ‘BufferTag[1]’ {aka ‘struct buftag[1]’} [-Warray-bounds]
89 | ((node1).relNode == (node2).relNode && \
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
90 | (node1).dbNode == (node2).dbNode && \
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~
91 | (node1).spcNode == (node2).spcNode)
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
pg_install/v14/include/postgresql/server/storage/buf_internals.h:116:9:
note: in expansion of macro ‘RelFileNodeEquals’
116 | RelFileNodeEquals((a).rnode, (b).rnode) && \
| ^~~~~~~~~~~~~~~~~
pgxn/neon/neon_pgversioncompat.h:25:31: note: in expansion of macro
‘BUFFERTAGS_EQUAL’
25 | #define BufferTagsEqual(a, b) BUFFERTAGS_EQUAL(*(a), *(b))
| ^~~~~~~~~~~~~~~~
pgxn/neon/pagestore_smgr.c:220:34: note: in expansion of macro
‘BufferTagsEqual’
220 | #define SH_EQUAL(tb, a, b) (BufferTagsEqual(&(a)->buftag,
&(b)->buftag))
| ^~~~~~~~~~~~~~~
pg_install/v14/include/postgresql/server/lib/simplehash.h:280:77: note:
in expansion of macro ‘SH_EQUAL’
280 | #define SH_COMPARE_KEYS(tb, ahash, akey, b) (ahash ==
SH_GET_HASH(tb, b) && SH_EQUAL(tb, b->SH_KEY, akey))
| ^~~~~~~~
pg_install/v14/include/postgresql/server/lib/simplehash.h:799:21: note:
in expansion of macro ‘SH_COMPARE_KEYS’
799 | if (SH_COMPARE_KEYS(tb, hash, key, entry))
| ^~~~~~~~~~~~~~~
pgxn/neon/pagestore_smgr.c: In function ‘neon_read_at_lsn’:
pgxn/neon/pagestore_smgr.c:2742:25: note: object ‘buftag’ of size 20
2742 | BufferTag buftag = {0};
| ^~~~~~
This commit silences those warnings, although it's not clear to me why
the compiler complained like that in the first place. I found the issue
with padding bytes while looking into those warnings, but that was
coincidental, I don't think the padding bytes explain the warnings as
such.
In v16, the BUFFERTAGS_EQUAL macro was replaced with a static inline
function, and that also silences the compiler warning. Not clear to me
why.
## Problem
See #9199
## Summary of changes
Fix update of hits/misses for LFC and prefetch introduced in
78938d1b59
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
If peer safekeeper needs garbage collected segment it will be fetched
now from s3 using on-demand WAL download. Reduces danger of running out of disk space when safekeeper fails.
1. Adds local-proxy to compute image and vm spec
2. Updates local-proxy config processing, writing PID to a file eagerly
3. Updates compute-ctl to understand local proxy compute spec and to
send SIGHUP to local-proxy over that pid.
closes https://github.com/neondatabase/cloud/issues/16867
NeonWALReader needs to know LSN before which WAL is not available
locally, that is, basebackup LSN. Previously it was taken from
WalpropShmemState, but that's racy, as walproposer sets its there only
after successfull election. Get it directly with GetRedoStartLsn.
Should fix flakiness of
test_ondemand_wal_download_in_replication_slot_funcs etc.
ref #9201
## Problem
Creation of a timelines during a reconciliation can lead to
unavailability if the user attempts to
start a compute before the storage controller has notified cplane of the
cut-over.
## Summary of changes
Create timelines on all currently attached locations. For the latest
location, we still look
at the database (this is a previously). With this change we also look
into the observed state
to find *other* attached locations.
Related https://github.com/neondatabase/neon/issues/9144
## Problem
Secondary tenant heatmaps were always downloaded, even when they hadn't
changed. This can be avoided by using a conditional GET request passing
the `ETag` of the previous heatmap.
## Summary of changes
The `ETag` was already plumbed down into the heatmap downloader, and
just needed further plumbing into the remote storage backends.
* Add a `DownloadOpts` struct and pass it to
`RemoteStorage::download()`.
* Add an optional `DownloadOpts::etag` field, which uses a conditional
GET and returns `DownloadError::Unmodified` on match.
## Problem
The S3 tests couldn't use SSO authentication for local tests against S3.
## Summary of changes
Enable the `sso` feature of `aws-config`. Also run `cargo hakari
generate` which made some updates to `workspace_hack`.
In the passing, rename it to NeonLocalCli, to reflect that the binary
is called 'neon_local'.
Add wrapper for the 'timeline_import' command, eliminating the last
raw call to the raw_cli() function from tests, except for a few in
test_neon_cli.py which are about testing the 'neon_local' iteself. All
the other calls are now made through the strongly-typed wrapper
functions
Add wrappers for a few commands that didn't have them before. Move the
logic to generate tenant and timeline IDs from NeonCli to the callers,
so that NeonCli is more purely just a type-safe wrapper around
'neon_local'.
* I had to install `m4` in order to be able to run locally
* The docs/docker.md was missing a pointer to where the compute node
code is
(Was originally on #8888 but I am pulling this out)
## Problem
`Oversized vectored read [...]` logs are spewing in prod because we have
a few keys that
are unexpectedly large:
* reldir/relblock - these are unbounded, so it's known technical debt
* slru block - they can be a bit bigger than 128KiB due to storage
format overhead
## Summary of changes
* Bump threshold to 130KiB
* Don't warn on oversized reldir and dbdir keys
Closes https://github.com/neondatabase/neon/issues/8967
Follow-up of #9234 to give hyper 1.0 the version-free name, and the
legacy version of hyper the one with the version number inside. As we
move away from hyper 0.14, we can remove the `hyper0` name piece by
piece.
Part of #9255
Because:
- it's nice to be up-to-date,
- we already had axum 0.7 in our dependency tree, so this avoids having
to compile two versions, and
- removes one of the remaining dpendencies to hyper version 0
Also bumps the 'tokio-tungstenite' dependency, to avoid having two
versions in the dependency tree.
The apt install stage before this commit:
0 upgraded, 391 newly installed, 0 to remove and 9 not upgraded.
Need to get 261 MB of archives.
after:
0 upgraded, 367 newly installed, 0 to remove and 9 not upgraded.
Need to get 220 MB of archives.
As seen in https://github.com/neondatabase/cloud/issues/17335, during
releases we can have ingest lags that are above the limits for warnings.
However, such lags are part of normal pageserver startup.
Therefore, calculate a certain cooldown timestamp until which we accept
lags up to a certain size. The heuristic is chosen to grow the later we get
to fully load the tenant, and we also add 60 seconds as a grace period
after that term.
Fixes#9231 .
Upgrade hyper to 1.4.0 and use hyper 1.4 instead of 0.14 in the storage
broker, together with tonic 0.12. The two upgrades go hand in hand.
Thanks to the broker being independent from other components, we can
upgrade its hyper version without touching the other components, which
makes things easier.
## Problem
```
Warning: The file chosen for install of requests 2.32.0 (requests-2.32.0-py3-none-any.whl) is yanked. Reason for being yanked: Yanked due to conflicts with CVE-2024-35195 mitigation
```
## Summary of changes
- Update `requests` to fix the warning
- Update `psycopg2-binary`
Some parentheses in conditional expressions are redundant or necessary
for clarity conditional expressions in walproposer.
## Summary of changes
Change some parentheses to clarify conditions in walproposer.
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
## Problem
We are seeing frequent pageserver startup timelines while it calls
syncfs(). There is an existing fixture that syncs _after_ tests, but not
before the first one. We hypothesize that some failures are happening on
the first test in a job.
## Summary of changes
- extend the existing sync_after_each_test to be a sync between all
tests, including sync'ing before running the first test. That should
remove any ambiguity about whether the sync is happening on the correct
node.
This is an alternative to https://github.com/neondatabase/neon/pull/8957
-- I didn't realize until I saw Alexander's comment on that PR that we
have an existing hook that syncs filesystems and can be extended.
close https://github.com/neondatabase/neon/issues/9160
For whatever reason, pg17's WAL pattern seems different from others,
which triggers some flaky behavior within the compaction smoke test.
## Summary of changes
* Run L0 compaction before proceeding with the read benchmark.
* So that we can ensure the num of L0 layers is 0 and test the
compaction behavior only with L1 layers.
We have a threshold for triggering L0 compaction. In some cases, the
test case did not produce enough L0 layers to do a L0 compaction,
therefore leaving the layer map with 3+ L0 layers above the L1 layers.
This increases the average read depth for the timeline.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We don't have an alert for long running reconciles. Stuck reconciles are
problematic
as we've seen in a recent incident.
## Summary of changes
Add a new metric `storage_controller_reconcile_long_running_total` with
labels: `{tenant_id, shard_number, seq}`.
The metric is removed after the long running reconcile finishes. These
events should be rare, so we won't break
the bank on cardinality.
Related https://github.com/neondatabase/neon/issues/9150
## Problem
If a timeline was deleted right before waiting for LSNs to catch up
before the cut-over,
then we would wait forever.
## Summary of changes
Fix the issue and add a test for timeline deletions mid migration.
Related https://github.com/neondatabase/neon/issues/9144
## Problem
Recent change to avoid the "became visible" log messages from certain
tasks missed a task: the logical size calculation that happens as a
child of synthetic size calculation.
Related: https://github.com/neondatabase/neon/issues/9058
## Summary of changes
- Add OnDemandLogicalSize to the list of permitted tasks for reads
making a covered layer visible
- Tweak the log message to use layer name instead of key: this is more
terse, and easier to use when debugging, as one can search for it
elsewhere to see when the layer was written/downloaded etc.
```shell
$ cargo run -p proxy --bin proxy -- --auth-backend=web --webauth-confirmation-timeout=5s
```
```
$ psql -h localhost -p 4432
NOTICE: Welcome to Neon!
Authenticate by visiting within 5s:
http://localhost:3000/psql_session/e946900c8a9bc6e9
psql: error: connection to server at "localhost" (::1), port 4432 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
connection to server at "localhost" (127.0.0.1), port 4432 failed: ERROR: Disconnected due to inactivity after 5s.
```
In PG17, there is this newfangled custom wait events system. This commit
adds that feature to Neon, so that users can see what their backends may
be waiting for when a PostgreSQL backend is playing the waiting game in
Neon code.
check-rust-style fails because tonic version too old, this does not seem
to be an easy fix, so ignore it from the deny list.
Signed-off-by: Alex Chi Z <chi@neon.tech>
MaxBackends doesn't include auxiliary processes. Whenever an aux process
made IO operations that updated the counters, they would scribble over
shared memory beoynd the end of the array. The relsize cache hash table
comes after the array, so the symptom was an error about hash table
corruption in the relsize cache hash.
When endpoint is stopped in immediate mode and started again there is a
chance of old connection delivering some WAL to safekeepers after second
start checked need for sync-safekeepers and thus grabbed basebackup LSN.
It makes basebackup unusable, so compute panics. Avoid flakiness by
waiting for walreceivers on safekeepers to be gone in such cases. A
better way would be to bump term on safekeepers if sync-safekeepers is
skipped, but it needs more infrastructure.
ref https://github.com/neondatabase/neon/issues/9079
Found while searching for other issues in shared memory.
The bug should be benign, in that it over-allocates memory for this
struct, but doesn't allow for out-of-bounds writes.
Following #7656, `TenantConfOpt::TryFrom<toml_edit::Item>` appears to be
dead code. This patch removes `TenantConfOpt::TryFrom<toml_edit::Item>`.
The code does appear to be dead, since the TOML config is deserialized
into `TenantConfig` (via `LocationConfig`) and then converted into
`TenantConfOpt`.
This was verified by adding a panic to `try_from()` and running the
pageserver unit tests as well as a local end-to-end cluster (including
creating a new tenant and restarting the pageserver). This did not fail,
so this is not used on the common happy path at least. No explicit
`try_from` or `try_into` calls were found either.
Resolves#8918.
aux v2 migration is near the end and I rewrote the RFC based on what I
proposed (several months before...) and what I actually implemented.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
These are the perf counters added in commit 263dfba6ee.
Note: This relies on 'neon' extension version 1.5. The default was
bumped to 1.5 in commit d696c41807.
---------
Co-authored-by: Matthias van de Meent <matthias@neon.tech>
On bookworm, 'cmake' is new enough that we can just use it. On bullseye,
we can get a new-enough package from backports. By including 'cmake' in
the build-deps stage, we don't need to install it separately in all the
later build stages that need it.
See https://github.com/neondatabase/neon/pull/2699, where we switched to
downloading and building a specific version.
Microsoft exposes JWKs without the alg header. It's only included on the
tokens. Not a problem.
Also noticed that wrt the `typ` header:
> It will typically not be used by applications when it is already known
that the object is a JWT. This parameter is ignored by JWT
implementations; any processing of this parameter is performed by the
JWT application.
Since we know we are expecting JWTs only, I've followed the guidance and
removed the validation.
## Problem
We need the
[pg_session_jwt](https://github.com/neondatabase/pg_session_jwt/)
extension in the compute image. This PR adds it.
## Summary of changes
I added the `pg_session_jwt` extension in a very similar way to how the
pggraphql and pgtiktoken extensions were added (since they're all
written with pgrx). Then I tested this.
```
$ cd docker-compose/
$ PG_VERSION=16 TAG=10667533475 docker-compose up --build -d
$ psql postgresql://cloud_admin:cloud_admin@localhost:55433/postgres
cloud_admin@postgres=# create extension pg_session_jwt;
CREATE EXTENSION
Time: 43.048 ms
cloud_admin@postgres=# \df auth.*;
List of functions
┌────────┬──────────────────┬──────────────────┬─────────────────────┬──────┐
│ Schema │ Name │ Result data type │ Argument data types │ Type │
├────────┼──────────────────┼──────────────────┼─────────────────────┼──────┤
│ auth │ get │ jsonb │ s text │ func │
│ auth │ init │ void │ kid bigint, s jsonb │ func │
│ auth │ jwt_session_init │ void │ s text │ func │
│ auth │ user_id │ text │ │ func │
└────────┴──────────────────┴──────────────────┴─────────────────────┴──────┘
(4 rows)
cloud_admin@postgres=# select auth.init(cast('1' as bigint), to_jsonb(TEXT '{ "kty": "EC", "kid": "571683be-33cf-4e67-bccc-8905c0ebb862", "crv": "P-521", "alg": "ES512", "x": "AM_GsnQvKML2yXdn_OsN8PdgO1Sf9XMXih5vQMKLmJkp-Iz_FFWJUt6uyR_qp4brr8Ji2kjGJgN4cQJpg2kskH7V", "y": "AZg-salw24lCmsBP-BCBa5jT6INkTwLtCOC7o0BIxDVvmIEH1-PQAJVYVJPTFvPMi_PLa0QlOm-ufJYkynwa2Mau" }'));
ERROR: called `Result::unwrap()` on an `Err` value: Error("invalid type: string \"{ \\\"kty\\\": \\\"EC\\\", \\\"kid\\\": \\\"571683be-33cf-4e67-bccc-8905c0ebb862\\\", \\\"crv\\\": \\\"P-521\\\", \\\"alg\\\": \\\"ES512\\\", \\\"x\\\": \\\"AM_GsnQvKML2yXdn_OsN8PdgO1Sf9XMXih5vQMKLmJkp-Iz_FFWJUt6uyR_qp4brr8Ji2kjGJgN4cQJpg2kskH7V\\\", \\\"y\\\": \\\"AZg-salw24lCmsBP-BCBa5jT6INkTwLtCOC7o0BIxDVvmIEH1-PQAJVYVJPTFvPMi_PLa0QlOm-ufJYkynwa2Mau\\\" }\", expected struct JwkEcKey", line: 0, column: 0)
Time: 6.991 ms
```
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Move the download location to a proper URL
## Problem
`test_multi_attach` is sometimes failing with `invalid compute status
for configuration request: Configuration`. This is likely a result of
the test attempting to reconfigure the compute at the same time as the
storage controller is doing so.
This test was originally written before the storage controller existed,
and is not expecting anything else to be reconfiguring computes at the
same time.
## Summary of changes
- Configure the tenant into scheduling policy `Stop` in the storage
controller at the start of the test, so that it won't try to do anything
to the tenant while the test is running.
* tracing-utils now returns a `Layer` impl. Removes the need for crates
to
import OTel crates.
* Drop the /v1/traces URI check. Verified that the code does the right
thing.
* Leave a TODO to hook in an error handler for OTel to log errors to
when it
assumes the regular pipeline cannot be used/is broken.
## Problem
The live migration code waits forever for the compute notification hook,
on the basis that until it succeeds, the compute is probably using the
old location and we shouldn't detach it.
However, if a pageserver stops or restarts in the background, then this
original location might no longer be available, so there is no point
waiting. Waiting is also actively harmful, because it prevents other
reconciliations happening for the tenant shard, such as during an
upgrade where a stuck "drain" migration might prevent the later "fill"
migration from moving the shard back to its original location.
## Summary of changes
- Refactor the notification wait loop into a function
- Add a checks during the loop, for the origin node's cancellation token
and an explicit HTTP request to the origin node to confirm the shard is
still attached there.
Closes: https://github.com/neondatabase/neon/issues/8901
neon_cli.create_tenant() creates a new tenant *and* a timeline on the
tenant, with name "main". In most tests, there's no need to create
another timeline on the same tenant.
There are some more tests that do that, but in the remaining cases, I
wasn't be 100% if the presence of extra root timelines affect what the
tests test, so I left them alone.
## Problem
This test waits for a request to finish, and then expects deletion to
complete almost immediately. The request completes, but it's a 202, the
timeline is still deleting in the background: we need to be more
patient.
## Summary of changes
- Adjust iterations from 2 to 10 when waiting for deletion
## Problem
The Neon components, built locally and by the GitHub workflow have
slightly different version prefixes (git: vs git-env:)
This does not allow running tests against local builds correctly.
## Summary of changes
The regular expressions were changed to work with both
prefixes.
## Problem
Legacy functions that were called as `mgr::` and relied on the static
TENANTS, see #5796
## Summary of changes
- Move the last stray function (immediate_gc) into TenantManager
Closes: https://github.com/neondatabase/neon/issues/5796
Commit 263dfba6ee introduced neon extension version 1.5, which included
some new functions and views for metrics. It didn't bump the default
neon extension number yet, so that we could still safely roll back to
the old binary if necessary. This bumps the default version.
## Problem
`pgbench-pgvector` job from Nightly Benchmarks fails with the error:
```
/__w/_temp/f45bc2eb-4c4c-4f0a-8030-99079303fa65.sh: line 17: LD_LIBRARY_PATH: unbound variable
```
## Summary of changes
- Fix `LD_LIBRARY_PATH: unbound variable` error in benchmarks
## Problem
Nightly Benchmarks have been broken for some time due to various
reasons, this PR fixes it
## Summary of changes
- Pull `build-tools` image from dockerhub for `benchmarking` workflow
- Use `aws-actions/configure-aws-credentials` to upload/download
artifacts from S3
- Fix Postgres 16 installation (for pgbench)
Part of https://github.com/neondatabase/cloud/issues/13127. Resolves
#9153
What changed in this PR:
1. Adds `ComputeSpec.disk_quota_bytes: Option<u64>`
2. Adds new arg to compute_ctl: `--set-disk-quota-for-fs <mountpoint>`
3. Implements running `/neonvm/bin/set-disk-quota` with the right value
if both cmdline arg AND field in the spec are specified
4. Patches `/etc/sudoers.d` to allow `compute_ctl` to set quota with
sudo
This PR is very similar to the swap support added earlier, you can take
a look at it as prior art: #7434
In theory, it can be implemented outside of compute_ctl when we will
have a separate neonvm daemon, but we are not there yet. Current
implementation is the simplest possible to unblock computes with larger
disks.
All code related to usage of `/neonvm/bin/set-disk-quota` is located in
`disk_quota.rs`. We need to call this script with the following
arguments: `/neonvm/bin/set-disk-quota {size_kb} {mountpoint}`. Quotas
are set on the filesystem level, so we need to provide path to the
directory that filesystem was mounted to.
I tested this change locally with
https://github.com/neondatabase/cloud/pull/17270. It should be safe to
merge, because this feature is gated by both cmdline arg and field in
the spec. If control-plane doesn't set values in both places,
compute_ctl won't be affected by this change.
close https://github.com/neondatabase/neon/issues/8140
The original issue is rather vague on what we should do. After
discussion w/ @problame we decided to narrow down the problems we want
to solve in that issue.
* read path -- do not panic for now.
* write path -- panic only on write errors (i.e., device error, fsync
error), but not on no-space for now.
The guideline is that if the pageserver behavior could lead to violation
of persistent constraints (i.e., return an operation as successful but
not actually persisting things), we should panic. Fsync is the place
where both of us agree that we should panic, because if fsync fails, the
kernel will mark dirty pages as clean, and the next fsync will not
necessarily return false. This would make the storage client assume the
operation is successful.
## Summary of changes
Make fsync panic on fatal errors.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
To make it faster. On my laptop, it takes about 30 before this commit.
In the arm64 debug variant in CI, it takes about 120 s. Reduce it by
factor of 4.
Part of #8130
## Problem
After deploying https://github.com/neondatabase/infra/pull/1927, we
shipped `io_buffer_alignment=512` to all prod region. The
`AdjacentVectoredReadBuilder` code path is no longer taken and we are
running pageserver unit tests 6 times in the CI. Removing it would
reduce the test duration by 30-60s.
## Summary of changes
- Remove `AdjacentVectoredReadBuilder` code.
- Bump the minimum `io_buffer_alignment` requirement to at least 512
bytes.
- Use default `io_buffer_alignment` for Rust unit tests.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Part of #7497, closes#8817.
## Problem
See #8817.
## Summary of changes
**compute_ctl**
- Renew lsn lease as soon as `/configure` updates pageserver_connstr,
use `state_changed` Condvar for synchronization.
**pageserver**
As mentioned in
https://github.com/neondatabase/neon/issues/8817#issuecomment-2315768076,
we still want some permanent error reported if a lease cannot be
granted. By considering attachment mode and the added
`lsn_lease_deadline` when processing lease requests, we can also bound
the case of bad requests to a very short period after migration/restart.
- Refactor https://github.com/neondatabase/neon/pull/9024 and move
`lsn_lease_deadline` to `AttachedTenantConf` so timeline can easily
access it.
- Have separate HTTP `init_lsn_lease` and libpq `renew_lsn_lease` API.
- Always do LSN verification for the initial HTTP lease request.
- LSN verification for the renewal is **still done** when tenants are
not in `AttachedSingle` and we have pass the `lsn_lease_deadline`, which
give plenty of time for compute to renew the lease.
**neon_local**
- add and call `timeline_init_lsn_lease` mgmt_api at static endpoint
start. The initial lsn lease http request is sent when we run `cargo
neon endpoint start <static endpoint>`.
## Testing
- Extend `test_readonly_node_gc` to do pageserver restarts and
migration.
## Future Work
- The control plane should make the initial lease request through HTTP
when creating a static endpoint. This is currently only done in
`neon_local`.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Verbosity in this case is good when reading the code. Short options are
better when operating in an interactive shell.
Signed-off-by: Tristan Partin <tristan@neon.tech>
I hope this lets us capture backtraces in CI. At least it makes it
work on my laptop, which is valuable even if we need to do more for
CI.
See issue #2800.
## Problem
We get some unexpected errors, but don't know who they're happening for.
## Summary of change
Add tenant id and peer address to PG connection error logs.
Related https://github.com/neondatabase/cloud/issues/17336
We separated client error from basebackup error log lines in
https://github.com/neondatabase/neon/pull/7523, but we didn't do
anything for the metrics. In this patch, we fixed it.
ref https://github.com/neondatabase/neon/issues/8970
## Summary of changes
We use the same criteria as in `log_query_error` producing an info line
(instead of error) for the metrics. We added a `client_error` category
for the basebackup query time metrics.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
- In https://github.com/neondatabase/neon/pull/8784, the validate
controller API is modified to check generations directly in the
database. It batches tenants into separate queries to avoid generating a
huge statement, but
- While updating this, I realized that "control_plane_client" is a kind
of confusing name for the client code now that it primarily talks to the
storage controller (the case of talking to the control plane will go
away in a few months).
## Summary of changes
- Big rename to "ControllerUpcallClient" -- this reflects the storage
controller's api naming, where the paths used by the pageserver are in
`/upcall/`
- When sending validate requests, break them up into chunks so that we
avoid possible edge cases of generating any HTTP requests that require
database I/O across many thousands of tenants.
This PR mixes a functional change with a refactor, but the commits are
cleanly separated -- only the last commit is a functional change.
---------
Co-authored-by: Christian Schwarz <christian@neon.tech>
There was a tricky race condition in compute_ctl, that sometimes makes
configurator skip updates. It makes a deadlock because:
- control-plane cannot configure compute, because it's in
ConfigurationPending state
- compute_ctl doesn't do any reconfiguration because
`configurator_main_loop` missed notification for it
Full sequence that reproduces the issue:
1. `start_compute` finishes works and changes status
`self.set_status(ComputeStatus::Running);`
2. configurator received update about `Running` state and dropped the
mutex lock in the iteration
3. `/configure` request was triggered at the same time as step 1, and
got the mutex lock
4. same `/configure` request set the spec and updated the state to
`ConfigurationPending`, also sent a notification
5. next iteration in configurator got the mutex lock, but missed the
notification
There are more details in this slack thread:
https://neondb.slack.com/archives/C03438W3FLZ/p1727281028478689?thread_ts=1727261220.483799&cid=C03438W3FLZ
---------
Co-authored-by: Alexey Kondratov <kondratov.aleksey@gmail.com>
## Problem
The latest storage release has generated artifacts for Postgres 17,
so we can enable compatibility tests this version
## Summary of changes
- Unskip `test_backward_compatibility` / `test_forward_compatibility` on
Postgres 17
We don't want to allow any new child timelines of archived timelines. If
you want any new child timelines, you should first un-archive the
timeline.
Part of #8088
The warning:
warning: first doc comment paragraph is too long
--> pageserver/src/tenant/checks.rs:7:1
|
7 | / /// Checks whether a layer map is valid (i.e., is a valid result
of the current compaction algorithm if no...
8 | | /// The function checks if we can split the LSN range of a delta
layer only at the LSNs of the delta layer...
9 | | ///
10 | | /// ```plain
| |_
|
= help: for further information visit
https://rust-lang.github.io/rust-clippy/master/index.html#too_long_first_doc_paragraph
= note: `#[warn(clippy::too_long_first_doc_paragraph)]` on by default
help: add an empty line
|
7 ~ /// Checks whether a layer map is valid (i.e., is a valid result of
the current compaction algorithm if nothing goes wrong).
8 + ///
|
Fix by applying the suggestion.
A cherry-pick from the previous release (#9131)
## Problem
Login to prod ECR doesn't work anymore:
```
Retrieving registries data through *** SDK...
*** ECR detected with eu-central-1 region
Error: The security token included in the request is invalid.
```
## Summary of changes
- Fix login to prod ECR by using `aws-actions/configure-aws-credentials`
The pg_install/build directory contains .o files and such intermediate
results from the build, which are not needed in the final tarball.
Except for src/test/regress/regress.so and a few other .so files in that
directory; keep those.
This reduces the size of the neon-Linux-X64-release-artifact.tar.zst
artifact from about 1.5 GB to 700 MB.
(I attempted this a long time ago already, by moving the build/
directory out of pg_install altogether, see PR #2127. But I never got
around to finish that work.)
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
## Problem
These commits are split off from
https://github.com/neondatabase/neon/pull/8971/commits where I was
fixing this to make a better scale test pass -- Vlad also independently
recognized these issues with cloudbench in
https://github.com/neondatabase/neon/issues/9062.
1. The storage controller proxies GET requests to pageservers based on
their intent, not the ground truth of where they're really attached.
2. Proxied requests can race with scheduling to tenants, resulting in
404 responses if the request hits the wrong pageserver.
Closes: https://github.com/neondatabase/neon/issues/9062
## Summary of changes
1. If a shard has a running reconciler, then use the database
generation_pageserver to decide who to proxy the request to
2. If such a request gets a 404 response and its scheduled node has
changed since the request was dispatched.
## Problem
Storage controller didn't previously consider AZ locality between
compute and pageservers
when scheduling nodes. Control plane has this feature, and, since we are
migrating tenants
away from it, we need feature parity to avoid perf degradations.
## Summary of changes
The change itself is fairly simple:
1. Thread az info into the scheduler
2. Add an extra member to the scheduling scores
Step (2) deserves some more discussion. Let's break it down by the shard
type being scheduled:
**Attached Shards**
We wish for attached shards of a tenant to end up in the preferred AZ of
the tenant since that
is where the compute is like to be.
The AZ member for `NodeAttachmentSchedulingScore` has been placed
below the affinity score (so it's got the second biggest weight for
picking the node). The rationale for going
below the affinity score is to avoid having all shards of a single
tenant placed on the same node in 2 node
regions, since that would mean that one tenant can drive the general
workload of an entire pageserver.
I'm not 100% sure this is the right decision, so open to discussing
hoisting the AZ up to first place.
**Secondary Shards**
We wish for secondary shards of a tenant to be scheduled in a different
AZ from the preferred one
for HA purposes.
The AZ member for `NodeSecondarySchedulingScore` has been placed first,
so nodes in different AZs
from the preferred one will always be considered first. On small
clusters, this can mean that all the secondaries
of a tenant are scheduled to the same pageserver, but secondaries don't
use up as many resources as the
attached location, so IMO the argument made for attached shards doesn't
hold.
Related: https://github.com/neondatabase/neon/issues/8848
As @koivunej mentioned in the storage channel, for regress test, we
don't need to create a log file for the scrubber, and we should reduce
noisy logs.
## Summary of changes
* Disable log file creation for storage scrubber
* Only log at info level
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
1. Increase statement_timeout. It defaults to 120 s, which is not quite
enough on slow or busy systems with debug build. On my laptop, the index
creation takes about 100 s. On buildfarm, we've seen failures, e.g:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9084/10997888708/index.html#suites/821f97908a487f1d7d3a2a4dd1571e99/db1834bddfe8c5b9/
2. Keep twiddling the LFC size through the whole test. Before, we would
do it for the first 10 seconds, but that only covers a small part of the
pgbench initialization phase. Change the loop so that the pgbench run
time determines how long the test runs, and we keep changing the LFC for
the whole time.
In the passing, also fix bogus test description, copy-pasted from a
completely unrelated test.
## Problem
Compilation of neon extension on macOS produces a warning
```
pgxn/neon/neon_perf_counters.c:50:1: error: non-void function does not return a value [-Werror,-Wreturn-type]
```
## Summary of changes
- Change the return type of `NeonPerfCountersShmemInit` to void
In the `imitate_synthetic_size_calculation_worker` function, we might
obtain the `Cancelled` error variant instead of hitting the cancellation
token based path. Therefore, catch `Cancelled` and handle it analogously
to the cancellation case.
Fixes#8886.
Part of #8130.
## Problem
Currently, decompression is performed within the `read_blobs`
implementation and the decompressed blob will be appended to the end of
the `BytesMut` buffer. We will lose this flexibility of extending the
buffer when we switch to using our own dio-aligned buffer (WIP in
https://github.com/neondatabase/neon/pull/8730). To facilitate the
adoption of aligned buffer, we need to refactor the code to perform
decompression outside `read_blobs`.
## Summary of changes
- `VectoredBlobReader::read_blobs` will return `VectoredBlob` without
performing decompression and appending decompressed blob. It becomes the
caller's responsibility to decompress the buffer.
- Added a new `BufView` type that functions as `Cow<Bytes, &[u8]>`.
- Perform decompression within `VectoredBlob::read` so that people don't
have to explicitly thinking about compression when using the reader
interface.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Even with the 100 ms interval, on my laptop the pageserver always
becomes available on second attempt, so this saves about 900 ms at
every test startup.
## Problem
All the other patches were moved to the compute directory, and only one
was left in the patches subdirectory in the root directory.
## Summary of changes
The patch was moved to the compute directory as others
The PostgreSQL 17 vendor module is now based on postgres/postgres @
d7ec59a63d745ba74fba0e280bbf85dc6d1caa3e, presumably the final code
change before the V17 tag.
## Problem
Scheduling on tenant creation uses different heuristics compared to the
scheduling done during
background optimizations. This results in scenarios where shards are
created and then immediately
migrated by the optimizer.
## Summary of changes
1. Make scheduler aware of the type of the shard it is scheduling
(attached vs secondary).
We wish to have different heuristics.
2. For attached shards, include the attached shard count from the
context in the node score
calculation. This brings initial shard scheduling in line with what the
optimization passes do.
3. Add a test for (2).
This looks like a bigger change than required, but the refactoring
serves as the basis for az-aware
shard scheduling where we also need to make the distinction between
attached and secondary shards.
Closes https://github.com/neondatabase/neon/issues/8969
## Problem
We need to be able to run the regression tests against a cloud-based
Neon staging instance to prepare the migration to the arm architecture.
## Summary of changes
Some tests were modified to work on the cloud instance (i.e. added
passwords, server-side copy changed to client-side, etc)
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Part of #8128, fixes#8872.
## Problem
See #8872.
## Summary of changes
- Retry `list_timeline_blobs` another time if
- there are layer file keys listed but not index.
- failed to download index.
- Instrument code with `analyze-tenant` and `analyze-timeline` span.
- Remove `initdb_archive` check, it could have been deleted.
- Return with exit code 1 on fatal error if `--exit-code` parameter is set.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Instead of adding them to the VM image late in the build process, when
putting together the final VM image, include them in the earlier
compute image already. That makes it more convenient to edit the
files, and to test them.
Seems nice to keep all these together. This also provides a nice place
for a README file to describe the compute image build process. For
now, it briefly describes the contents of the directory, but can be
expanded.
We can't FlushOneBuffer when we're in redo-only mode on PageServer, so
make execution of that function conditional on us not running in
pageserver walredo mode.
## Problem
LFC cache entry is chunk (right now size of chunk is 1Mb). LFC
statistics shows number of chunks, but not number of used pages. And
autoscaling team wants to know how sparse LFC is:
https://neondb.slack.com/archives/C04DGM6SMTM/p1726782793595969
It is possible to obtain it from the view `select count(*) from
local_cache`.
Nut it is expensive operation, enumerating all entries in LFC under
lock.
## Summary of changes
This PR added "file_cache_used_pages" to `neon_lfc_stats` view:
```
select * from neon_lfc_stats;
lfc_key | lfc_value
-----------------------+-----------
file_cache_misses | 3139029
file_cache_hits | 4098394
file_cache_used | 1024
file_cache_writes | 3173728
file_cache_size | 1024
file_cache_used_pages | 25689
(6 rows)
```
Please notice that this PR doesn't change neon extension API, so no need
to create new version of Neon extension.
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
The metrics include a histogram of how long we need to wait for a
GetPage request, number of reconnects, and number of requests among
other things.
The metrics are not yet exported anywhere, but you can query them
manually.
Note: This does *not* bump the default version of the 'neon' extension. We
will do that later, as a separate PR. The reason is that this allows us to roll back
the compute image smoothly, if necessary. Once the image that includes the
new extension .so file with the new functions has been rolled out, and we're
confident that we don't need to roll back the image anymore, we can change
default extension version and actually start using the new functions and views.
This is what the view looks like:
```
postgres=# select * from neon_perf_counters ;
metric | bucket_le | value
---------------------------------------+-----------+----------
getpage_wait_seconds_count | | 300
getpage_wait_seconds_sum | | 0.048506
getpage_wait_seconds_bucket | 2e-05 | 0
getpage_wait_seconds_bucket | 3e-05 | 0
getpage_wait_seconds_bucket | 6e-05 | 71
getpage_wait_seconds_bucket | 0.0001 | 124
getpage_wait_seconds_bucket | 0.0002 | 248
getpage_wait_seconds_bucket | 0.0003 | 279
getpage_wait_seconds_bucket | 0.0006 | 297
getpage_wait_seconds_bucket | 0.001 | 298
getpage_wait_seconds_bucket | 0.002 | 298
getpage_wait_seconds_bucket | 0.003 | 298
getpage_wait_seconds_bucket | 0.006 | 300
getpage_wait_seconds_bucket | 0.01 | 300
getpage_wait_seconds_bucket | 0.02 | 300
getpage_wait_seconds_bucket | 0.03 | 300
getpage_wait_seconds_bucket | 0.06 | 300
getpage_wait_seconds_bucket | 0.1 | 300
getpage_wait_seconds_bucket | 0.2 | 300
getpage_wait_seconds_bucket | 0.3 | 300
getpage_wait_seconds_bucket | 0.6 | 300
getpage_wait_seconds_bucket | 1 | 300
getpage_wait_seconds_bucket | 2 | 300
getpage_wait_seconds_bucket | 3 | 300
getpage_wait_seconds_bucket | 6 | 300
getpage_wait_seconds_bucket | 10 | 300
getpage_wait_seconds_bucket | 20 | 300
getpage_wait_seconds_bucket | 30 | 300
getpage_wait_seconds_bucket | 60 | 300
getpage_wait_seconds_bucket | 100 | 300
getpage_wait_seconds_bucket | Infinity | 300
getpage_prefetch_requests_total | | 69
getpage_sync_requests_total | | 231
getpage_prefetch_misses_total | | 0
getpage_prefetch_discards_total | | 0
pageserver_requests_sent_total | | 323
pageserver_requests_disconnects_total | | 0
pageserver_send_flushes_total | | 323
file_cache_hits_total | | 0
(39 rows)
```
Part of https://github.com/neondatabase/neon/issues/8002
Close https://github.com/neondatabase/neon/issues/8920
Legacy compaction (as well as gc-compaction) rely on the GC process to
remove unused layer files, but this relies on many factors (i.e., key
partition) to ensure data in a dropped table can be eventually removed.
In gc-compaction, we consider the keyspace information when doing the
compaction process. If a key is not in the keyspace, we will skip that
key and not include it in the final output.
However, this is not easy to implement because gc-compaction considers
branch points (i.e., retain_lsns) and the retained keyspaces could
change across different LSNs. Therefore, for now, we only remove aux v1
keys in the compaction process.
## Summary of changes
* Add `FilterIterator` to filter out keys.
* Integrate `FilterIterator` with gc-compaction.
* Add `collect_gc_compaction_keyspace` for a spec of keyspaces that can
be retained during the gc-compaction process.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Not used in production, but in benchmarks, to demonstrate minimal RTT.
(It would be nice to not have to copy the 8KiB of zeroes, but, that
would require larger protocol changes).
Found this useful in investigation
https://github.com/neondatabase/neon/pull/8952.
## Problem
Previously, the storage controller may send compute notifications
containing stale pageservers (i.e. pageserver serving the shard was
detached). This happened because detaches did not update the compute
hook state.
## Summary of Changes
Update compute hook state on shard detach.
Fixes#8928
These were not referenced in any of the other Cargo.toml files in the
workspace. They were not being built because of that, so there was
little harm in having them listed, but let's be tidy.
cargo update ciborium iana-time-zone lazy_static schannel uuid
cargo update hyper@0.14
cargo update --precise 2.9.7 ureq
It might be worthwhile just update all our dependencies at some point,
but this is aimed at pruning the dependency tree, to make the build a
little faster. That's also why I didn't update ureq to the latest
version: that would've added a dependency to yet another version of
rustls.
We frequently mess up our submodule references. This adds one safeguard:
it checks that the submodule references are only updated "forwards", not
to some older commit, or a commit that's not a descended of the previous
one.
As next step, I'm thinking that we should automate things so that when
you merge a PR to the 'neon' repository that updates the submodule
references, the REL_*_STABLE_neon branches are automatically updated to
match the submodule references. That way, you never need to manually
merge PRs in the postgres repository, it's all triggered from commits in
the 'neon' repository. But that's not included here.
Moves the per-timeline code to load timeline metadata into a new
dedicated function called `load_timeline_metadata`. The old
`load_timeline_metadata` becomes `load_timelines_metadata`.
Split out of #8907
Part of #8088
If the primary is started at an LSN within the first of a 16 MB WAL
segment, the "long XLOG page header" at the beginning of the segment was
not initialized correctly. That has gone unnnoticed, because under
normal circumstances, nothing looks at the page header. The WAL that is
streamed to the safekeepers starts at the new record's LSN, not at the
beginning of the page, so that bogus page header didn't propagate
elsewhere, and a primary server doesn't normally read the WAL its
written. Which is good because the contents of the page would be bogus
anyway, as it wouldn't contain any of the records before the LSN where
the new record is written.
Except that in the following cases a primary does read its own WAL:
1. When there are two-phase transactions in prepared state at
checkpoint. The checkpointer reads the two-phase state from the
XLOG_XACT_PREPARE record, and writes it to a file in pg_twophase/.
2. Logical decoding reads the WAL starting from the replication slot's
restart LSN.
This PR fixes the problem with two-phase transactions. For that, it's
sufficient to initialize the page header correctly. The checkpointer
only needs to read XLOG_XACT_PREPARE records that were generated after
the server startup, so it's still OK that older WAL is missing / bogus.
I have not investigated if we have a problem with logical decoding,
however. Let's deal with that separately.
Special thanks to @Lzjing-1997, who independently found the same bug
and opened a PR to fix it, although I did not use that PR.
## Problem
When layer visibility was added, an info log was included for the
situation where actual access to a layer disagrees with the visibility
calculation. This situation is safe, but I was interested in seeing when
it happens.
The log is pretty high volume, so this PR refines it to fire less often.
## Summary of changes
- For cases where accessing non-visible layers is normal, don't log at
all.
- Extend a unit test to increase confidence that the updates to
visibility on access are working as expected
- During compaction, only call the visibility calculation routine if
some image layers were created: previously, frequent calls resulted in
the visibility of layers getting reset every time we passed through
create_image_layers.
## Problem
Seems that PS might be too eager in reporting throttled tasks
## Summary of changes
Introduce a sleep counter. If the sleep counter increases, then the
acquire tasks was throttled.
close https://github.com/neondatabase/neon/issues/8903
In https://github.com/neondatabase/neon/issues/8903 we observed JSON
decoding error to have the following error message in the log:
```
Error processing HTTP request: Resource temporarily unavailable: 3956 (pageserver-6.ap-southeast-1.aws.neon.tech) error receiving body: error decoding response body
```
This is hard to understand. In this patch, we make the error message
more reasonable.
## Summary of changes
* receive body error is now an internal server error, passthrough the
`reqwest::Error` (only decoding error) as `anyhow::Error`.
* instead of formatting the error using `to_string`, we use the
alternative `anyhow::Error` formatting, so that it prints out the cause
of the error (i.e., what exactly cannot serde decode).
I would expect seeing something like `error receiving body: error
decoding response body: XXX field not found` after this patch, though I
didn't set up a testing environment to observe the exact behavior.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
It's pretty expensive to run, and there is very little difference
between debug and release builds that could lead to different clippy
warnings.
This is extracted from PR #8912. That PR wandered off into various
improvements we could make, but we seem to have consensus on this part
at least.
## Problem
Safekeeper's OpenAPI spec is incorrect:
```
Semantic error at paths./v1/tenant/{tenant_id}/timeline/{timeline_id}.get.responses.404.content.application/json.schema.$ref
$refs must reference a valid location in the document
Jump to line 126
```
Checked on https://editor.swagger.io
## Summary of changes
- Add `NotFoundError`
- Add `description` and `license` fields to make Cloud OpenAPI spec
linter happy
We have 3 places where we implement layer map checks.
## Summary of changes
Now we have a single check function being called in all places.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Part of #7497, closes https://github.com/neondatabase/neon/issues/8890.
## Problem
Since leases are in-memory objects, we need to take special care of them
after pageserver restarts and while doing a live migration. The approach
we took for pageserver restart is to wait for at least lease duration
before doing first GC. We want to do the same for live migration. Since
we do not do any GC when a tenant is in `AttachedStale` or
`AttachedMulti` mode, only the transition from `AttachedMulti` to
`AttachedSingle` requires this treatment.
## Summary of changes
- Added `lsn_lease_deadline` field in `GcBlock::reasons`: the tenant is
temporarily blocked from GC until we reach the deadline. This
information does not persist to S3.
- In `GCBlock::start`, skip the GC iteration if we are blocked by the
lsn lease deadline.
- In `TenantManager::upsert_location`, set the lsn_lease_deadline to
`Instant::now() + lsn_lease_length` so the granted leases have a chance
to be renewed before we run GC for the first time after transitioned
from AttachedMulti to AttachedSingle.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
misc changes split out from #8855
- **allow cloning the request context in a read-only fashion for
background tasks**
- **propagate endpoint and request context through the jwk cache**
- **only allow password based auth for md5 during testing**
- **remove auth info from conn info**
https://github.com/neondatabase/neon/pull/9028 changed the image layer
creation log into trace level. However, I personally find logging image
layer creation useful when reading the logs -- it makes it clear that
the image layer creation is happening and gives a clear idea of the
progress. Therefore, I propose to continue logging them for
create_image_layers set of functions.
## Summary of changes
* Add info logging for all image layers created in legacy compaction.
* Add info logging for all layers creation in testing functions.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Different keyspaces may require different floor LSNs in vectored
delta layer visits. This patch adds support for such cases.
## Summary of changes
Different keyspaces wishing to read the same layer might
require different stop lsns (or lsn floor). The start LSN
of the read (or the lsn ceil) will always be the same.
With this observation, we fix skipping of image layers by
indexing the fringe by layer id plus lsn floor.
This is very simple, but means that we can visit delta layers twice
in certain cases. Still, I think it's very unlikely for any extra
merging to have taken place in this case, so perhaps it makes sense to go
with the simpler patch.
Fixes https://github.com/neondatabase/neon/issues/9012
Alternative to https://github.com/neondatabase/neon/pull/9025
This constant in 'tenant_conf_defaults' was unused, but there's
another constant with the same name in the global 'defaults'. I wish
the setting was configurable per-tenant, but it isn't, so let's remove
the confusing duplicate.
The DEFAULT_CONCURRENT_TENANT_SIZE_LOGICAL_SIZE_QUERIES constant was
unused, because we had just hardcoded it to 1 where the constant
should've been used.
Remove the ConfigurableSemaphore::Default implementation, since it was
unused.
There's some more code that still checks for uninit and delete
markers, see callers of is_delete_mark and is_uninit_mark, and github
issue #5718. But these functions were outright dead.
Dead code is generally useless, but with Postgres constants in
particular, I'm also worried that if they're not used anywhere, we
might fail to update them at a Postgres version update, and get very
confused later when they have wrong values.
When I checked the log in Grafana I couldn't find the scrubber version.
Then I realized that it should be logged after the logger gets
initialized.
## Summary of changes
Log after initializing the logger for the scrubber.
Signed-off-by: Alex Chi Z <chi@neon.tech>
(Found this useful during investigation
https://github.com/neondatabase/cloud/issues/16886.)
Problem
-------
Before this PR, `neon_local` sequentially does the following:
1. launch storcon process
2. wait for storcon to signal readiness
[here](75310fe441/control_plane/src/storage_controller.rs (L804-L808))
3. start pageserver
4. wait for pageserver to become ready
[here](c43e664ff5/control_plane/src/pageserver.rs (L343-L346))
5. etc
The problem is that storcon's readiness waits for the
[`startup_reconcile`](cbcd4058ed/storage_controller/src/service.rs (L520-L523))
to complete.
But pageservers aren't started at this point.
So, worst case we wait for `STARTUP_RECONCILE_TIMEOUT/2`, i.e., 15s.
This is more than the 10s default timeout allowed by neon_local.
So, the result is that `neon_local start` fails to start storcon and
stops everything.
Solution
--------
In this PR I choose the the radical solution to start everything in
parallel.
It junks up the output because we do stuff like `print!(".")` to
indicate progress.
We should just abandon that.
And switch to `utils::logging` + `tracing` with separate spans for each
component.
I can do that in this PR or we leave it as a follow-up.
Alternatives Considered
-----------------------
The Pageserver's `/v1/status` or in fact any endpoint of the mgmt API
will not `accept()` on the mgmt API socket until after the `re-attach`
call to storcon returned success.
So, it's insufficient to change the startup order to start Pageservers
first.
We cannot easily change Pageserver startup order because
`init_tenant_mgr` must complete before we start serving the mgmt API.
Otherwise tenant detach calls et al can race with `init_tenant_mgr`.
We'd have to add a "loading" state to tenant mgr and make all API
endpoints except `/v1/status` wait for _that_ to complete.
Related
-------
- https://github.com/neondatabase/neon/pull/6475
## Problem
It turns out the previous approach (with `skip_if` input) doesn't work
(from https://github.com/neondatabase/neon/pull/9017).
Revert it and use more straightforward if-conditions
## Summary of changes
- Revert efbe8db7f1
- Add if-condition to`promote-compatibility-data` job and relevant
comments
Commit ca5390a89d made a similar change to DeltaLayerWriter.
We bumped into this with Stas with our hackathon project, to create a
standalong program to create image layers directly from a Postgres data
directory. It needs to create image layers without having a Timeline and
other pageserver machinery.
This downgrades the "created image layer {}" message from INFO to TRACE
level. TRACE is used for the corresponding message on delta layer
creation too. The path logged in the message is now the temporary path,
before the file is renamed to its final name. Again commit ca5390a89d
made the same change for the message on delta layer creation.
There's currently no way to just start/stop broker from `neon_local`.
This PR
* adds a sub-command
* uses that sub-command from the test suite instead of the pre-existing
Python `subprocess` based approach.
Found this useful during investigation
https://github.com/neondatabase/cloud/issues/16886.
## Problem
We do use `actions/checkout` with `fetch-depth: 0` when it's not
required
## Summary of changes
- Remove unneeded `fetch-depth: 0`
- Add a comment if `fetch-depth: 0` is required
We added another migration in 5876c441ab,
but didn't bump this value. This had no effect, but best to fix it
anyway.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
We've got 2 non-blocking failures on the release pipeline:
- `promote-compatibility-data` job got skipped _presumably_ because one
of the dependencies of `deploy` job (`push-to-acr-dev`) got skipped
(https://github.com/neondatabase/neon/pull/8940)
- `coverage-report` job fails because we don't build debug artifacts in
the release branch (https://github.com/neondatabase/neon/pull/8561)
## Summary of changes
- Always run `push-to-acr-dev` / `push-to-acr-prod` jobs, but add
`skip_if` parameter to the reusable workflow, which can skip the job
internally, without skipping externally
- Do not run `coverage-report` on release branches
## Problem
It turns out that we can't rely on external orchestration to promptly
route trafic to the new leader. This is downtime inducing.
Forwarding provides a safe way out.
## Safety
We forward when:
1. Request is not one of ["/control/v1/step_down", "/status", "/ready",
"/metrics"]
2. Current instance is in [`LeadershipStatus::SteppedDown`] state
3. There is a leader in the database to forward to
4. Leader from step (3) is not the current instance
If a storcon instance is persisted in the database, then we know that it
is the current leader.
There's one exception: time between handling step-down request and the
new leader updating the
database.
Let's treat the happy case first. The stepped down node does not produce
any side effects,
since all request handling happens on the leader.
As for the edge case, we are guaranteed to always have a maximum of two
running instances.
Hence, if we are in the edge case scenario the leader persisted in the
database is the
stepped down instance that received the request. Condition (4) above
covers this scenario.
## Summary of changes
* Conversion utilities for reqwest <-> hyper. I'm not happy with these,
but I don't see a better way. Open to suggestions.
* Add request forwarding logic
* Update each request handler. Again, not happy with this. If anyone
knows a nice to wrap the handlers, lmk. Me and Joonas tried :/
* Update each handler to maybe forward
* Tweak tests to showcase new behaviour
pg_distrib_dir doesn't include the Postgres version and only depends
on env variables which cannot change during a test run, so it can be
marked as session-scoped. Similarly, the platform cannot change during
a test run.
There was another copy of it in utils.py. The only difference is that
the version in utils.py tolerates files that are concurrently
removed. That seems fine for the few callers in neon_fixtures.py too.
This should generally be faster when running tests, especially those
that run with higher scales.
Ignoring test_lfc_resize since it seems like we are hitting a query
timeout for some reason that I have yet to investigate. A little bit of
improvemnt is better than none.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
`gather-rust-build-stats` extra CI job fails with
```
"PQ_LIB_DIR" doesn't exist in the configured path: "/__w/neon/neon/pg_install/v16/lib"
```
## Summary of changes
- Use the path to Postgres 17 for the `gather-rust-build-stats` job.
The job uses Postgres built by `make walproposer-lib`
Most extensions are not required to run Neon-based PostgreSQL, but the
Neon extension is _quite_ critical, so let's make sure we include it.
## Problem
Staging doesn't have working compute images for PG17
## Summary of changes
Disable some PG17 filters so that we get the critical components into the PG17 image
This adds preliminary PG17 support to Neon, based on RC1 / 2024-09-04
07b828e9d4
NOTICE: The data produced by the included version of the PostgreSQL fork
may not be compatible with the future full release of PostgreSQL 17 due to
expected or unexpected future changes in magic numbers and internals.
DO NOT EXPECT DATA IN V17-TENANTS TO BE COMPATIBLE WITH THE 17.0
RELEASE!
Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Makes it consistent with the "timeline create" and "timeline import"
commands, which allowed you to pass the timeline id as argument. This
also makes it unnecessary to parse the timeline ID from the output in
the python function that calls it.
The tenant ID was not actually generated here but in NeonEnvBuilder.
And the "neon_local init" command hasn't been able to generate the
initial tenant since 8712e1899e anyway.
## Problem
Having run in production for a while, we see that nodes are generally
safely oversubscribed by about a factor of 2.
## Summary of changes
Tweak the is_overloaded method to check for utililzation over 200%
rather than over 100%
## Problem
In https://github.com/neondatabase/neon/pull/8621, validation of keys
during ingest was removed because the places where we actually store
keys are now past the point where we have already converted them to
CompactKey (i128) representation.
## Summary of changes
Reinstate validation at an earlier stage in ingest. This doesn't cover
literally every place we write a key, but it covers most cases where
we're trusting postgres to give us a valid key (i.e. one that doesn't
try and use a custom spacenode).
The current code assumes that most of this functionality is
version-independent, which is only true up to v16 - PostgreSQL 17 has a
new field in CheckPoint that we need to keep track of.
This basically removes the file-level dependency on v14, and replaces it
with switches that load the correct version dependencies where required.
PR #7782 set the dependency in Cargo.toml to 'master', and locked the
version to commit that contained a specific fix, because we needed the
fix before it was included in a versioned release. The fix was later
included in parquet crate version 52.0.0, so we can now switch back to
using a released version. The latest release is 53.0.0, switch straight
to that.
---------
Co-authored-by: Conrad Ludgate <conradludgate@gmail.com>
close https://github.com/neondatabase/neon/issues/8838
## Summary of changes
This patch modifies the split delta layer writer to avoid taking
start_key and end_key when creating/finishing the layer writer. The
start_key for the delta layers will be the first key provided to the
layer writer, and the end_key would be the `last_key.next()`. This
simplifies the delta layer writer API.
On that, the layer key hack is removed. Image layers now use the full
key range, and delta layers use the first/last key provided by the user.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
I wish it worked, but it's been broken for a long time, so let's admit
defeat and remove it.
The idea of sharing the same pageserver and safekeeper environment
between tests is still sound, and it could save a lot of time in our
CI. We should perhaps put some time into doing that, but we're better
off starting from scratch than trying to make TEST_SHARED_FIXTURES
work in its current form.
I wanted to use some features from the newer version. The PR that needed
the new version is not ready yet (and might never be), but seems nice to
stay up in any case.
We modified the crate in an incompatible way and upgraded to the new
version in PR #8076. However, it was reverted in #8654. The revert
reverted the Cargo.lock reference to it, but since Cargo.toml still
points to the (tip of the) 'neon' branch, every time you make any other
unrelated changes to Cargo.toml, it also tries to update the
rust-postgres crates to the tip of the 'neon' branch again, which
doesn't work.
To fix, lock the crates to the exact commit SHA that works.
Currently using gc blocking and unblocking with storage controller
managed pageservers is painful. Implement the API on storage controller.
Fixes: #8893
For control-plane managed tenants, we have the page in the admin console
that lists all tenants on a specific pageserver. But for
storage-controller managed ones, we don't have that functionality for
now.
## Summary of changes
Adds an API that lists all shards on a given node (intention + observed)
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
One of the PRs opened by a `neondatabase` org member got labelled as
`external` because the `gh api` call failed in the wrong way:
```
Get "https://api.github.com/orgs/neondatabase/members/<username>": dial tcp 140.82.114.5:443: i/o timeout
is-member=false
```
## Summary of changes
- Check that the error message is expected before labelling PRs
- Retry `gh api` call for 10 times in case of unexpected error messages
- Add `workflow_dispatch` trigger
When walproposer observes now higher term it restarts instead of
crashing whole compute with PANIC; this avoids compute crash after
term_bump call. After successfull election we're still checking
last_log_term of the highest given vote to ensure basebackup is good,
and PANIC otherwise.
It will be used for migration per
035-safekeeper-dynamic-membership-change.md
and
https://github.com/neondatabase/docs/pull/21
ref https://github.com/neondatabase/neon/issues/8700
Check that truncation point is not from the future by comparing it with
write_record_lsn, not write_lsn, and explain that xlog switch changes
their normal order.
ref https://github.com/neondatabase/neon/issues/8911
Addresses the 1.82 beta clippy lint `too_long_first_doc_paragraph` by
adding newlines to the first sentence if it is short enough, and making
a short first sentence if there is the need.
## Problem
We want to do AZ aware scheduling, but don't have enough metadata.
## Summary of changes
Introduce a `preferred_az_id` concept for each managed tenant shard.
In a future PR, the scheduler will use this as a soft preference.
The idea is to try and keep the shard attachments within the same AZ.
Under the assumption that the compute was placed in the correct AZ,
this reduces the chances of cross AZ trafic from between compute and PS.
In terms of code changes we:
1. Add a new nullable `preferred_az_id` column to the `tenant_shards`
table. Also include an in-memory counterpart.
2. Populate the preferred az on tenant creation and shard splits.
3. Add an endpoint which allows to bulk-set preferred AZs.
(3) gives us the migration path. I'll write a script which queries the
cplane db in the region and sets the preferred az of all shards with an
active compute to the AZ of said compute. For shards without an active compute,
I'll use the AZ of the currently attached pageserver
since this is what cplane uses now to schedule computes.
## Problem
https://github.com/neondatabase/neon/pull/8852 introduced a new nullable
column for the `nodes` table: `availability_zone_id`
## Summary of changes
* Make neon local and the test suite always provide an az id
* Make the az id field in the ps registration request mandatory
* Migrate the column to non-nullable and adjust in memory state
accordingly
* Remove the code that was used to populate the az id for pre-existing
nodes
## Problem
Building on MacOS failed due to missing m4. Although a window was
popping up claiming to install m4, this was not helping.
## Summary of changes
Add instructions to install m4 using brew and link it (thanks to Folke
for helping).
Sometimes, the benchmarks fail to start up pageserver in 10s without any
obvious reason. Benchmarks run sequentially on otherwise idle runners.
Try running `sync(2)` after each bench to force a cleaner slate.
Implement this via:
- SYNC_AFTER_EACH_TEST environment variable enabled autouse fixture
- autouse fixture seems to be outermost fixture, so it works as expected
- set SYNC_AFTER_EACH_TEST=true for benchmarks in build_and_test
workflow
Evidence:
https://neon-github-public-dev.s3.amazonaws.com/reports/main/10678984691/index.html#suites/5008d72a1ba3c0d618a030a938fc035c/1210266507534c0f/
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
This PR simplifies the pageserver configuration parsing as follows:
* introduce the `pageserver_api::config::ConfigToml` type
* implement `Default` for `ConfigToml`
* use serde derive to do the brain-dead leg-work of processing the toml
document
* use `serde(default)` to fill in default values
* in `pageserver` crate:
* use `toml_edit` to deserialize the pageserver.toml string into a
`ConfigToml`
* `PageServerConfig::parse_and_validate` then
* consumes the `ConfigToml`
* destructures it exhaustively into its constituent fields
* constructs the `PageServerConfig`
The rules are:
* in `ConfigToml`, use `deny_unknown_fields` everywhere
* static default values go in `pageserver_api`
* if there cannot be a static default value (e.g. which default IO
engine to use, because it depends on the runtime), make the field in
`ConfigToml` an `Option`
* if runtime-augmentation of a value is needed, do that in
`parse_and_validate`
* a good example is `virtual_file_io_engine` or `l0_flush`, both of
which need to execute code to determine the effective value in
`PageServerConf`
The benefits:
* massive amount of brain-dead repetitive code can be deleted
* "unused variable" compile-time errors when removing a config value,
due to the exhaustive destructuring in `parse_and_validate`
* compile-time errors guide you when adding a new config field
Drawbacks:
* serde derive is sometimes a bit too magical
* `deny_unknown_fields` is easy to miss
Future Work / Benefits:
* make `neon_local` use `pageserver_api` to construct `ConfigToml` and
write it to `pageserver.toml`
* This provides more type safety / coompile-time errors than the current
approach.
### Refs
Fixes#3682
### Future Work
* `remote_storage` deser doesn't reject unknown fields
https://github.com/neondatabase/neon/issues/8915
* clean up `libs/pageserver_api/src/config.rs` further
* break up into multiple files, at least for tenant config
* move `models` as appropriate / refine distinction between config and
API models / be explicit about when it's the same
* use `pub(crate)` visibility on `mod defaults` to detect stale values
## Problem
A tenant may ingest a lot of data between being drained for node restart
and being moved back
in the fill phase. This is expensive and causes the fill to stall.
## Summary of changes
We make a tactical change to reduce secondary warm-up time for
migrations in fills.
Part of https://github.com/neondatabase/neon/issues/8623
## Summary of changes
It seems that we have tenants with aux policy set to v1 but don't have
any aux files in the storage. It is still safe to force migrate them
without notifying the customers. This patch adds more details to the
warning to identify the cases where we have to reach out to the users
before retiring aux v1.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
There was a confusion on the REL_14_STABLE_neon branch. PR
https://github.com/neondatabase/postgres/pull/471 was merged ot the
branch, but the corresponding PRs on the other REL_15_STABLE_neon and
REL_16_STABLE_neon branches were not merged. Also, the submodule
reference in the neon repository was never updated, so even though the
REL_14_STABLE_neon branch contained the commit, it was never used.
That PR https://github.com/neondatabase/postgres/pull/471 was a few
bricks shy of a load (no tests, some differences between the different
branches), so to get us to a good state, revert that change from the
REL_14_STABLE_neon branch. This PR in the neon repository updates the
submodule reference past two commites on the REL_14_STABLE_neon branch:
first the commit from PR
https://github.com/neondatabase/postgres/pull/471, and immediately after
that the revert of the same commit. This brings us back to square one,
but now the submodule reference matches the tip of the
REL_14_STABLE_neon branch again.
## Problem
The initial implementation of the validate API treats the in-memory
generations as authoritative.
- This is true when only one storage controller is running, but if a
rogue controller was running that hadn't been shut down properly, and
some pageserver requests were routed to that bad controller, it could
incorrectly return valid=true for stale generations.
- The generation in the main in-memory map gets out of date while a live
migration is in flight, and if the origin location for the migration
tries to do some deletions even though it is in AttachedStale (for
example because it had already started compaction), these might be
wrongly validated + executed.
## Summary of changes
- Continue to do the in-memory check: if this returns valid=false it is
sufficient to reject requests.
- When valid=true, do an additional read from the database to confirm
the generation is fresh.
- Revise behavior for validation on missing shards: this used to always
return valid=true as a convenience for deletions and shard splits, so
that pageservers weren't prevented from completing any enqueued
deletions for these shards after they're gone. However, this becomes
unsafe when we consider split brain scenarios. We could reinstate this
in future if we wanted to store some tombstones for deleted shards.
- Update test_scrubber_physical_gc to cope with the behavioral change:
they must now explicitly flush the deletion queue before splits, to
avoid tripping up on deletions that are enqueued at the time of the
split (these tests assert "scrubber deletes nothing", which check fails
if the split leaves behind some remote objects that are legitimately
GC'able)
- Add `test_storage_controller_validate_during_migration`, which uses
failpoints to create a situation where incorrect generation validation
during a live migration could result in a corruption
The rate of validate calls for tenants is pretty low: it happens as a
consequence deletions from GC and compaction, which are both
concurrency-limited on the pageserver side.
Commit cfa45ff5ee (PR #8860) updated the vendor/postgres submodules, but
didn't use the same commit SHAs that were pushed as the corresponding
REL_*_STABLE_neon branches in the postgres repository. The contents were
the same, but the REL_*_STABLE_neon branches pointed to squashed
versions of the commits, whereas the SHAs used in the submodules
referred to the pre-squash revisions.
Note: The vendor/postgres-v14 submodule still doesn't match with the tip
of REL_14_STABLE_neon branch, because there has been one more commit on
that branch since then. That's another confusion which we should fix,
but let's do that separately. This commit doesn't change the code that
gets built in any way, only changes the submodule references to point to
the correct SHAs in the REL_*_STABLE_neon branch histories, rather than
some detached commits.
We currently do not record safekeepers in the storage controller
database. We want to migrate timelines across safekeepers eventually, so
start recording the safekeepers on deploy.
Cc: #8698
## Problem
Each test might wait for up to 5s in order to HB the pageserver.
## Summary of changes
Make the heartbeat interval configurable and use a really tight one for
neon local => startup quicker
ref https://github.com/neondatabase/neon/issues/8872
## Summary of changes
We saw stuck storage scrubber in staging caused by infinite retries. I
believe here we should use `min` instead of `max` to avoid getting
minutes or hours of retry backoff.
Signed-off-by: Alex Chi Z <chi@neon.tech>
Set the field to optional, otherwise there will be decode errors when
newer version of the storage controller receives the JSON from older
version of the pageservers.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Currently, DatadirModification keeps a key-indexed map of all pending
writes, even though we (almost) never need to read back dirty pages for
anything other than metadata pages (e.g. relation sizes).
Related: https://github.com/neondatabase/neon/issues/6345
## Summary of changes
- commit() modifications before ingesting database creation wal records,
so that they are guaranteed to be able to get() everything they need
directly from the underlying Timeline.
- Split dirty pages in DatadirModification into pending_metadata_pages
and pending_data_pages. The data ones don't need to be in a
key-addressable format, so they just go in a Vec instead.
- Special case handling of zero-page writes in DatadirModification,
putting them in a map which is flushed on the end of a WAL record. This
handles the case where during ingest, we might first write a zero page,
and then ingest a postgres write to that page. We used to do this via
the key-indexed map of writes, but in this PR we change the data page
write path to not bother indexing these by key.
My least favorite thing about this PR is that I needed to change the
DatadirModification interface to add the on_record_end call. This is not
very invasive because there's really only one place we use it, but it
changes the object's behaviour from being clearly an aggregation of many
records to having some per-record state. I could avoid this by
implicitly doing the work when someone calls set_lsn or commit -- I'm
open to opinions on whether that's cleaner or dirtier.
## Performance
There may be some efficiency improvement here, but the primary
motivation is to enable an earlier stage of ingest to operate without
access to a Timeline. The `pending_data_pages` part is the "fast path"
bulk write data that can in principle be generated without a Timeline,
in parallel with other ingest batches, and ultimately on the safekeeper.
`test_bulk_insert` on AX102 shows approximately the same results as in
the previous PR #8591:
```
------------------------------ Benchmark results -------------------------------
test_bulk_insert[neon-release-pg16].insert: 23.577 s
test_bulk_insert[neon-release-pg16].pageserver_writes: 5,428 MB
test_bulk_insert[neon-release-pg16].peak_mem: 637 MB
test_bulk_insert[neon-release-pg16].size: 0 MB
test_bulk_insert[neon-release-pg16].data_uploaded: 1,922 MB
test_bulk_insert[neon-release-pg16].num_files_uploaded: 8
test_bulk_insert[neon-release-pg16].wal_written: 1,382 MB
test_bulk_insert[neon-release-pg16].wal_recovery: 18.264 s
test_bulk_insert[neon-release-pg16].compaction: 0.052 s
```
wal_storage.rs already checks this, but since this is a quite legit scenario
check it at safekeeper.rs (consensus level) as well.
ref https://github.com/neondatabase/neon/issues/8212
This is a take 2; previous PR #8640 had been reverted because interplay
with another change broke test_last_log_term_switch.
Endpoint implementation sends msg to manager requesting to do the
reset. Manager stops current partial backup upload task if it exists and
performs the reset.
Also slightly tweak eviction condition: all full segments before
flush_lsn must be uploaded (and committed) and there must be only one
segment left on disk (partial). This allows to evict timelines which
started not on the first segment and didn't fill the whole
segment (previous condition wasn't good because last_removed_segno was
0).
ref https://github.com/neondatabase/neon/issues/8759
## Problem
Neon local set-up does not inject an az id in `metadata.json`. See real
change in https://github.com/neondatabase/neon/pull/8852.
## Summary of changes
We piggyback on the existing `availability_zone` pageserver
configuration in order to avoid making neon local even more complex.
## Problem
Metrics event idempotency keys differ across S3 and Vector. The events
should be identical.
Resolves#8605.
## Summary of changes
Pre-generate the idempotency keys and pass the same set into both
metrics sinks.
Co-authored-by: John Spray <john@neon.tech>
Implement the timeline specific `archival_config` endpoint also in the
storage controller.
It's mostly a copy-paste of the detach handler: the task is the same: do
the same operation on all shards.
Part of #8088.
## Problem
This is a followup to #8783
- The old blocking ensure_attached function had been retained to handle
the case where a shard had a None generation_pageserver, but this wasn't
really necessary.
- There was a subtle `.1` in the code where a struct would have been
clearer
Closes#8819
## Summary of changes
- Add ShardGenerationState to represent the results of peek_generation
- Instead of calling ensure_attached when a tenant has a non-attached
shard, check the shard's policy and return 409 if it isn't Attached,
else return 503 if the shard's policy is attached but it hasn't been
reconciled yet (i.e. has a None generation_pageserver)
The test is very rudimentary, it only checks that before and after
tenant deletion, we can run `scan_metadata` for the safekeeper node
kind. Also, we don't actually expect any uploaded data, for that we
don't have enough WAL (needs to create at least one S3-uploaded file,
the scrubber doesn't recognize partial files yet).
The `scan_metadata` scrubber subcommand is extended to support either
specifying a database connection string, which was previously the only
way, and required a database to be present, or specifying the timeline
information manually via json. This is ideal for testing scenarios
because in those, the number of timelines is usually limited,
but it is involved to spin up a database just to write the timeline
information.
The pull request https://github.com/neondatabase/neon/pull/8679
explicitly mentioned that it will evict layers earlier than before.
Given that the eviction metrics is solely based on eviction threshold
(which is 86400s now), we should consider the early eviction and do not
fire alert if it's a covered layer.
## Summary of changes
Record eviction timer only when the layer is visible + accessed.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Summary of changes
- Setting default io_buffer_alignment to 512 bytes.
- Fix places that assumed `DEFAULT_IO_BUFFER_ALIGNMENT=0`
- Adapt unit tests to handle merge with `chunk size <= 4096`.
## Testing and Performance
We have done sufficient performance de-risking.
Enabling it by default completes our correctness de-risking before the
next release.
Context: https://neondb.slack.com/archives/C07BZ38E6SD/p1725026845455259
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
When implementing bottom-most gc-compaction, we analyzed the structure
of layer maps that the current compaction algorithm could produce, and
decided to only support structures without delta layer overlaps and LSN
intersections with the exception of single key layers.
## Summary of changes
This patch adds the layer map valid check in the storage scrubber.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
`promote-compatibility-data` job got broken and slightly outdated after
- https://github.com/neondatabase/neon/pull/8552 -- we don't upload
artifacts for ARM64
- https://github.com/neondatabase/neon/pull/8561 -- we don't prepare
`debug` artifacts in the release branch anymore
## Summary of changes
- Promote artifacts from release PRs to the latest version (but do it
from `release` branch)
- Upload artifacts for both X64 and ARM64
## Problem
Live migration retries when it fails to notify the compute of the new
location. It should sleep between attempts.
Closes: https://github.com/neondatabase/neon/issues/8820
## Summary of changes
- Do an `exponential_backoff` in the retry loop for compute
notifications
`safekeeper::random_test test_random_schedules` debug test takes over 2
minutes to run on our arm runners. Running it 6 times with pageserver
settings seems redundant.
In #8863 I replaced the threadpool with tokio tasks, but there was a
behaviour I missed regarding cancellation. Adding the JoinHandle wrapper
that triggers abort on drop should fix this.
Another change, any panics that occur in password hashing will be
propagated through the resume_unwind functionality.
Removes additional async_trait usages from safekeeper and neon_local.
Also removes now redundant dependencies of the `async_trait` crate.
cc earlier work: #6305, #6464, #7303, #7342, #7212, #8296
It's better to reject invalid keys on the write path than storing it and
panic-ing the pageserver.
https://github.com/neondatabase/neon/issues/8636
## Summary of changes
If a key cannot be represented using i128, we don't allow writing that
key into the pageserver.
There are two versions of the check valid function: the normal one that
simply rejects i128 keys, and the stronger one that rejects all keys
that we don't support.
The current behavior when a key gets rejected is that safekeeper will
keep retrying streaming that key to the pageserver. And once such key
gets written, no new computes can be started. Therefore, there could be
a large amount of pageserver warnings if a key cannot be ingested. To
validate this behavior by yourself, the reviewer can (1) use the
stronger version of the valid check (2) run the following SQL.
```
set neon.regress_test_mode = true;
CREATE TABLESPACE regress_tblspace LOCATION '/Users/skyzh/Work/neon-test/tablespace';
CREATE SCHEMA testschema;
CREATE TABLE testschema.foo (i int) TABLESPACE regress_tblspace;
insert into testschema.foo values (1), (2), (3);
```
For now, I'd like to merge the patch with only rejecting non-i128 keys.
It's still unknown whether the stronger version covers all the cases
that basebackup doesn't support. Furthermore, the behavior of rejecting
a key will produce large amounts of warnings due to safekeeper retry.
Therefore, I'd like to reject the minimum set of keys that we don't
support (i128 ones) for now. (well, erroring out is better than panic on
`to_compact_key`)
The next step is to fix the safekeeper behavior (i.e., on such key
rejections, stop streaming WAL), so that we can properly stop writing.
An alternative solution is to simply drop these keys on the write path.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Some tests were very slow and some tests occasionally stalled. This PR
improves some test performance and replaces the custom threadpool in
order to fix the stalling of tests.
refs https://github.com/neondatabase/neon/issues/7524
Problem
-------
When browsing Pageserver logs, background loop iterations that take a
long time are hard to spot / easy to miss because they tend to not
produce any log messages unless:
- they overrun their period, but that's only one message when the
iteration completes late
- they do something that produces logs (e.g., create image layers)
Further, a slow iteration that is still running does will not
log nor bump the metrics of `warn_when_period_overrun`until _after_
it has finished. Again, that makes a still-running iteration hard to
spot.
Solution
--------
This PR adds a wrapper around the per-tenant background loops
that, while a slow iteration is ongoing, emit a log message
every $period.
If a timeline unarchival request comes in, give an error if the parent
timeline is archived. This prevents us from the situation of having an
archived timeline with children that are not archived.
Follow up of #8824
Part of #8088
---------
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
# Motivation
In https://github.com/neondatabase/neon/pull/8832 I get tokio runtime
worker stack overflow errors in debug builds.
In a similar vein, I had tokio runtimer worker stack overflow when
trying to eliminate `async_trait`
(https://github.com/neondatabase/neon/pull/8296).
The 2MiB default is kind of arbitrary - so this PR bumps it to 4MiB.
It also adds an env var to control it.
# Risk Assessment
With our 4 runtimes, the worst case stack memory usage is `4 (runtimes)
* ($num_cpus (executor threads) + 512 (blocking pool threads)) * 4MiB`.
On i3en.3xlarge, that's `8384 MiB`.
On im4gn.2xlarge, that's `8320 MiB`.
Before this change, it was half that.
Looking at production metrics, we _do_ have the headroom to accomodate
this worst case case.
# Alternatives
The problems only occur with debug builds, so technically we could only
raise the stack size for debug builds.
However, it would be another configuration where `debug != release`.
# Future Work
If we ever enable single runtime mode in prod (=>
https://github.com/neondatabase/neon/issues/7312 ) then the worst case
will drop to 25% of its current value.
Eliminating the use of `tokio::spawn_blocking` / `tokio::fs` in favor of
`tokio-epoll-uring` (=> https://github.com/neondatabase/neon/issues/7370
) would reduce the worst case to `4 (runtimes) * $num_cpus (executor
threads) * 4 MiB`.
In proxy I switched to a leaky-bucket impl using the GCRA algorithm. I
figured I could share the code with pageserver and remove the
leaky_bucket crate dependency with some very basic tokio timers and
queues for fairness.
The underlying algorithm should be fairly clear how it works from the
comments I have left in the code.
---
In benchmarking pageserver, @problame found that the new implementation
fixes a getpage throughput discontinuity in pageserver under the
`pagebench get-page-latest-lsn` benchmark with the clickbench dataset
(`test_perf_olap.py`).
The discontinuity is that for any of `--num-clients={2,3,4}`, getpage
throughput remains 10k.
With `--num-clients=5` and greater, getpage throughput then jumps to the
configured 20k rate limit.
With the changes in this PR, the discontinuity is gone, and we scale
throughput linearly to `--num-clients` until the configured rate limit.
More context in
https://github.com/neondatabase/cloud/issues/16886#issuecomment-2315257641.
closes https://github.com/neondatabase/cloud/issues/16886
---------
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
refs https://github.com/neondatabase/cloud/issues/13750
The logging in this commit will make it easier to detect lagging ingest.
We're trusting compute timestamps --- ideally we'd use SK timestmaps
instead.
But trusting the compute timestamp is ok for now.
## Problem
See #8620
## Summary of changes
Remove walloping of replorigin file because it is reconstructed by PS
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Part of #7497
## Problem
Static computes pinned at some fix LSN could be created initially within
PITR interval but eventually go out it. To make sure that Static
computes are not affected by GC, we need to start using the LSN lease
API (introduced in #8084) in compute_ctl.
## Summary of changes
**compute_ctl**
- Spawn a thread for when a static compute starts to periodically ping
pageserver(s) to make LSN lease requests.
- Add `test_readonly_node_gc` to test if static compute can read all
pages without error.
- (test will fail on main without the code change here)
**page_service**
- `wait_or_get_last_lsn` will now allow `request_lsn` less than
`latest_gc_cutoff_lsn` to proceed if there is a lease on `request_lsn`.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Co-authored-by: Alexey Kondratov <kondratov.aleksey@gmail.com>
Part of [Epic: Bypass PageCache for user data
blocks](https://github.com/neondatabase/neon/issues/7386).
# Problem
`InMemoryLayer` still uses the `PageCache` for all data stored in the
`VirtualFile` that underlies the `EphemeralFile`.
# Background
Before this PR, `EphemeralFile` is a fancy and (code-bloated) buffered
writer around a `VirtualFile` that supports `blob_io`.
The `InMemoryLayerInner::index` stores offsets into the `EphemeralFile`.
At those offset, we find a varint length followed by the serialized
`Value`.
Vectored reads (`get_values_reconstruct_data`) are not in fact vectored
- each `Value` that needs to be read is read sequentially.
The `will_init` bit of information which we use to early-exit the
`get_values_reconstruct_data` for a given key is stored in the
serialized `Value`, meaning we have to read & deserialize the `Value`
from the `EphemeralFile`.
The L0 flushing **also** needs to re-determine the `will_init` bit of
information, by deserializing each value during L0 flush.
# Changes
1. Store the value length and `will_init` information in the
`InMemoryLayer::index`. The `EphemeralFile` thus only needs to store the
values.
2. For `get_values_reconstruct_data`:
- Use the in-memory `index` figures out which values need to be read.
Having the `will_init` stored in the index enables us to do that.
- View the EphemeralFile as a byte array of "DIO chunks", each 512 bytes
in size (adjustable constant). A "DIO chunk" is the minimal unit that we
can read under direct IO.
- Figure out which chunks need to be read to retrieve the serialized
bytes for thes values we need to read.
- Coalesce chunk reads such that each DIO chunk is only read once to
serve all value reads that need data from that chunk.
- Merge adjacent chunk reads into larger
`EphemeralFile::read_exact_at_eof_ok` of up to 128k (adjustable
constant).
3. The new `EphemeralFile::read_exact_at_eof_ok` fills the IO buffer
from the underlying VirtualFile and/or its in-memory buffer.
4. The L0 flush code is changed to use the `index` directly, `blob_io`
5. We can remove the `ephemeral_file::page_caching` construct now.
The `get_values_reconstruct_data` changes seem like a bit overkill but
they are necessary so we issue the equivalent amount of read system
calls compared to before this PR where it was highly likely that even if
the first PageCache access was a miss, remaining reads within the same
`get_values_reconstruct_data` call from the same `EphemeralFile` page
were a hit.
The "DIO chunk" stuff is truly unnecessary for page cache bypass, but,
since we're working on [direct
IO](https://github.com/neondatabase/neon/issues/8130) and
https://github.com/neondatabase/neon/issues/8719 specifically, we need
to do _something_ like this anyways in the near future.
# Alternative Design
The original plan was to use the `vectored_blob_io` code it relies on
the invariant of Delta&Image layers that `index order == values order`.
Further, `vectored_blob_io` code's strategy for merging IOs is limited
to adjacent reads. However, with direct IO, there is another level of
merging that should be done, specifically, if multiple reads map to the
same "DIO chunk" (=alignment-requirement-sized and -aligned region of
the file), then it's "free" to read the chunk into an IO buffer and
serve the two reads from that buffer.
=> https://github.com/neondatabase/neon/issues/8719
# Testing / Performance
Correctness of the IO merging code is ensured by unit tests.
Additionally, minimal tests are added for the `EphemeralFile`
implementation and the bit-packed `InMemoryLayerIndexValue`.
Performance testing results are presented below.
All pref testing done on my M2 MacBook Pro, running a Linux VM.
It's a release build without `--features testing`.
We see definitive improvement in ingest performance microbenchmark and
an ad-hoc microbenchmark for getpage against InMemoryLayer.
```
baseline: commit 7c74112b2a origin/main
HEAD: ef1c55c52e
```
<details>
```
cargo bench --bench bench_ingest -- 'ingest 128MB/100b seq, no delta'
baseline
ingest-small-values/ingest 128MB/100b seq, no delta
time: [483.50 ms 498.73 ms 522.53 ms]
thrpt: [244.96 MiB/s 256.65 MiB/s 264.73 MiB/s]
HEAD
ingest-small-values/ingest 128MB/100b seq, no delta
time: [479.22 ms 482.92 ms 487.35 ms]
thrpt: [262.64 MiB/s 265.06 MiB/s 267.10 MiB/s]
```
</details>
We don't have a micro-benchmark for InMemoryLayer and it's quite
cumbersome to add one. So, I did manual testing in `neon_local`.
<details>
```
./target/release/neon_local stop
rm -rf .neon
./target/release/neon_local init
./target/release/neon_local start
./target/release/neon_local tenant create --set-default
./target/release/neon_local endpoint create foo
./target/release/neon_local endpoint start foo
psql 'postgresql://cloud_admin@127.0.0.1:55432/postgres'
psql (13.16 (Debian 13.16-0+deb11u1), server 15.7)
CREATE TABLE wal_test (
id SERIAL PRIMARY KEY,
data TEXT
);
DO $$
DECLARE
i INTEGER := 1;
BEGIN
WHILE i <= 500000 LOOP
INSERT INTO wal_test (data) VALUES ('data');
i := i + 1;
END LOOP;
END $$;
-- => result is one L0 from initdb and one 137M-sized ephemeral-2
DO $$
DECLARE
i INTEGER := 1;
random_id INTEGER;
random_record wal_test%ROWTYPE;
start_time TIMESTAMP := clock_timestamp();
selects_completed INTEGER := 0;
min_id INTEGER := 1; -- Minimum ID value
max_id INTEGER := 100000; -- Maximum ID value, based on your insert range
iters INTEGER := 100000000; -- Number of iterations to run
BEGIN
WHILE i <= iters LOOP
-- Generate a random ID within the known range
random_id := min_id + floor(random() * (max_id - min_id + 1))::int;
-- Select the row with the generated random ID
SELECT * INTO random_record
FROM wal_test
WHERE id = random_id;
-- Increment the select counter
selects_completed := selects_completed + 1;
-- Check if a second has passed
IF EXTRACT(EPOCH FROM clock_timestamp() - start_time) >= 1 THEN
-- Print the number of selects completed in the last second
RAISE NOTICE 'Selects completed in last second: %', selects_completed;
-- Reset counters for the next second
selects_completed := 0;
start_time := clock_timestamp();
END IF;
-- Increment the loop counter
i := i + 1;
END LOOP;
END $$;
./target/release/neon_local stop
baseline: commit 7c74112b2a origin/main
NOTICE: Selects completed in last second: 1864
NOTICE: Selects completed in last second: 1850
NOTICE: Selects completed in last second: 1851
NOTICE: Selects completed in last second: 1918
NOTICE: Selects completed in last second: 1911
NOTICE: Selects completed in last second: 1879
NOTICE: Selects completed in last second: 1858
NOTICE: Selects completed in last second: 1827
NOTICE: Selects completed in last second: 1933
ours
NOTICE: Selects completed in last second: 1915
NOTICE: Selects completed in last second: 1928
NOTICE: Selects completed in last second: 1913
NOTICE: Selects completed in last second: 1932
NOTICE: Selects completed in last second: 1846
NOTICE: Selects completed in last second: 1955
NOTICE: Selects completed in last second: 1991
NOTICE: Selects completed in last second: 1973
```
NB: the ephemeral file sizes differ by ca 1MiB, ours being 1MiB smaller.
</details>
# Rollout
This PR changes the code in-place and is not gated by a feature flag.
We get many HTTP connect timeout errors from scrubber logs, and it
turned out that the scrubber is retrying, and this is not an actual
error. In the future, we should revisit all places where we log errors
in the storage scrubber, and only error when necessary (i.e., errors
that might need manual fixing)
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
In order to build AZ aware scheduling, the storage controller needs to
know what AZ pageservers are in.
Related https://github.com/neondatabase/neon/issues/8848
## Summary of changes
This patch set adds a new nullable column to the `nodes` table:
`availability_zone_id`. The node registration
request is extended to include the AZ id (pageservers already have this
in their `metadata.json` file).
If the node is already registered, then we update the persistent and
in-memory state with the provided AZ.
Otherwise, we add the node with the AZ to begin with.
A couple assumptions are made here:
1. Pageserver AZ ids are stable
2. AZ ids do not change over time
Once all pageservers have a configured AZ, we can remove the optionals
in the code and make the database column not nullable.
Part of #8130, closes#8719.
## Problem
Currently, vectored blob io only coalesce blocks if they are immediately
adjacent to each other. When we switch to Direct IO, we need a way to
coalesce blobs that are within the dio-aligned boundary but has gap
between them.
## Summary of changes
- Introduces a `VectoredReadCoalesceMode` for `VectoredReadPlanner` and
`StreamingVectoredReadPlanner` which has two modes:
- `AdjacentOnly` (current implementation)
- `Chunked(<alignment requirement>)`
- New `ChunkedVectorBuilder` that considers batching `dio-align`-sized
read, the start and end of the vectored read will respect
`stx_dio_offset_align` / `stx_dio_mem_align` (`vectored_read.start` and
`vectored_read.blobs_at.first().start_offset` will be two different
value).
- Since we break the assumption that blobs within single `VectoredRead`
are next to each other (implicit end offset), we start to store blob end
offsets in the `VectoredRead`.
- Adapted existing tests to run in both `VectoredReadCoalesceMode`.
- The io alignment can also be live configured at runtime.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
## Problem
Storage controller upgrades (restarts, more generally) can cause
multi-second availability gaps.
While the storage controller does not sit on the main data path, it's
generally not acceptable
to block management requests for extended periods of time (e.g.
https://github.com/neondatabase/neon/issues/8034).
## Summary of changes
This RFC describes the issues around the current storage controller
restart procedure
and describes an implementation which reduces downtime to a few
milliseconds on the happy path.
Related https://github.com/neondatabase/neon/issues/7797
## Problem
When folks open github issues for feature requests, they don't have a
clear recipient: engineers usually see them during bug triage, but that
doesn't necessarily get the work prioritized.
## Summary of changes
Give end users a clearer path to submitting feedback to Neon
Protocol version 2 has been the default for a while now, and we no
longer have any computes running in production that used protocol
version 1. This completes the migration by removing support for v1 in
both the pageserver and the compute.
See issue #6211.
## Problem
Currently, we compare `neon.safekeepers` values as is, so we
unnecessarily restart walproposer even if safekeepers set didn't change.
This leads to errors like:
```log
FATAL: [WP] restarting walproposer to change safekeeper list
from safekeeper-8.us-east-2.aws.neon.tech:6401,safekeeper-11.us-east-2.aws.neon.tech:6401,safekeeper-10.us-east-2.aws.neon.tech:6401
to safekeeper-11.us-east-2.aws.neon.tech:6401,safekeeper-8.us-east-2.aws.neon.tech:6401,safekeeper-10.us-east-2.aws.neon.tech:6401
```
## Summary of changes
Split the GUC into the list of individual safekeepers and properly
compare. We could've done that somewhere on the upper level, e.g.,
control plane, but I think it's still better when the actual config
consumer is smarter and doesn't rely on upper levels.
## Problem
pg_hintplan test seems to be flaky, sometimes it fails, while usually it
passes
## Summary of changes
The regression test is changed to filter out the Neon service queries. The
expected file is changed as well.
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
Routes and their handlers were in a bit different order in 1) routes
list 2) their implementation 3) python client 4) openapi spec, making
addition of new ones intimidating. Make it the same everywhere, roughly
lexicographically but preserving some of existing logic.
No functional changes.
Part of #8002, the final big PR in the batch.
## Summary of changes
This pull request uses the new split layer writer in the gc-compaction.
* It changes how layers are split. Previously, we split layers based on
the original split point, but this creates too many layers
(test_gc_feedback has one key per layer).
* Therefore, we first verify if the layer map can be processed by the
current algorithm (See https://github.com/neondatabase/neon/pull/8191,
it's basically the same check)
* On that, we proceed with the compaction. This way, it creates a large
enough layer close to the target layer size.
* Added a new set of functions `with_discard` in the split layer writer.
This helps us skip layers if we are going to produce the same persistent
key.
* The delta writer will keep the updates of the same key in a single
file. This might create a super large layer, but we can optimize it
later.
* The split layer writer is used in the gc-compaction algorithm, and it
will split layers based on size.
* Fix the image layer summary block encoded the wrong key range.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
Co-authored-by: Christian Schwarz <christian@neon.tech>
refs https://github.com/neondatabase/neon/issues/6989
Problem
-------
After unclean shutdown, we get restarted, start reading the local
filesystem,
and make decisions based on those reads. However, some of the data might
have
not yet been fsynced when the unclean shutdown completed.
Durability matters even though Pageservers are conceptually just a cache
of state in S3. For example:
- the cloud control plane is no control loop => pageserver responses
to tenant attachmentm, etc, needs to be durable.
- the storage controller does not rely on this (as much?)
- we don't have layer file checksumming, so, downloaded+renamed but not
fsynced layer files are technically not to be trusted
- https://github.com/neondatabase/neon/issues/2683
Solution
--------
`syncfs` the tenants directory during startup, before we start reading
from it.
This is a bit overkill because we do remove some temp files
(InMemoryLayer!)
later during startup. Further, these temp files are particularly likely
to
be dirty in the kernel page cache. However, we don't want to refactor
that
cleanup code right now, and the dirty data on pageservers is generally
not that high. Last, with [direct
IO](https://github.com/neondatabase/neon/issues/8130) we're going to
have near-zero kernel page cache anyway quite soon.
This PR:
* Implements the rule that archived timelines require all of their
children to be archived as well, as specified in the RFC. There is no
fancy locking mechanism though, so the precondition can still be broken.
As a TODO for later, we still allow unarchiving timelines with archived
parents.
* Adds an `is_archived` flag to `TimelineInfo`
* Adds timeline_archival_config to `PageserverHttpClient`
* Adds a new `test_timeline_archive` test, loosely based on
`test_timeline_delete`
Part of #8088
## Problem
We need some metric to sneak peek into how many people use inbound
logical replication (Neon is a subscriber).
## Summary of changes
This commit adds a new metric `compute_subscriptions_count`, which is
number of subscriptions grouped by enabled/disabled state.
Resolves: neondatabase/cloud#16146
Add binary for local-proxy that uses the local auth backend. Runs only
the http serverless driver support and offers config reload based on a
config file and SIGHUP
## Problem
- If a reconciler was waiting to be able to notify computes about a
change, but the control plane was waiting for the controller to finish a
timeline creation/deletion, the overall system can deadlock.
- If a tenant shard was migrated concurrently with a timeline
creation/deletion, there was a risk that the timeline operation could be
applied to a non-latest-generation location, and thereby not really be
persistent. This has never happened in practice, but would eventually
happen at scale.
Closes: #8743
## Summary of changes
- Introduce `Service::tenant_remote_mutation` helper, which looks up
shards & generations and passes them into an inner function that may do
remote I/O to pageservers. Before returning success, this helper checks
that generations haven't incremented, to guarantee that changes are
persistent.
- Convert tenant_timeline_create, tenant_timeline_delete, and
tenant_timeline_detach_ancestor to use this helper.
- These functions no longer block on ensure_attached unless the tenant
was never attached at all, so they should make progress even if we can't
complete compute notifications.
This increases the database load from timeline/create operations, but
only with cheap read transactions.
## Problem
Previously, the controller only used the shard counts for scheduling.
This works well when hosting only many-sharded tenants, but works much
less well when hosting single-sharded tenants that have a greater
deviation in size-per-shard.
Closes: https://github.com/neondatabase/neon/issues/7798
## Summary of changes
- Instead of UtilizationScore, carry the full PageserverUtilization
through into the Scheduler.
- Use the PageserverUtilization::score() instead of shard count when
ordering nodes in scheduling.
Q: Why did test_sharding_split_smoke need updating in this PR?
A: There's an interesting side effect during shard splits: because we do
not decrement the shard count in the utilization when we de-schedule the
shards from before the split, the controller will now prefer to pick
_different_ nodes for shards compared with which ones held secondaries
before the split. We could use our knowledge of splitting to fix up the
utilizations more actively in this situation, but I'm leaning toward
leaving the code simpler, as in practical systems the impact of one
shard on the utilization of a node should be fairly low (single digit
%).
In case of corrupted delta layers, we can detect the corruption and bail
out the compaction.
## Summary of changes
* Detect wrong delta desc of key range
* Detect unordered deltas
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We have been naughty and curl-ed storcon to fix-up drains and fills.
## Summary of changes
Add support for starting/cancelling drain/fill operations via
`storcon_cli`.
close https://github.com/neondatabase/neon/issues/8579
## Summary of changes
The `is_l0` check now takes both layer key range and the layer type.
This allows us to have image layers covering the full key range in
btm-most compaction (upcoming PR). However, we still don't allow delta
layers to cover the full key range, and I will make btm-most compaction
to generate delta layers with the key range of the keys existing in the
layer instead of `Key::MIN..Key::HACK_MAX` (upcoming PR).
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Failed / flaky tests for different arches don't have any difference in
GitHub Autocomment
## Summary of changes
- Add arch to build type for GitHub autocomment
Our new policy is to use the "rebase" method and slice all the Neon
commits into a nice patch set when doing a new major version, and use
"merge" method on minor version upgrades on the release branches.
"git merge" preserves the git history of Neon commits on the Postgres
branches. While it's nice to rebase all the Neon changes to a logical
patch set against upstream, having to do it between every minor release
is a fair amount work, and it loses the history, and is more
error-prone.
part of https://github.com/neondatabase/neon/issues/8623
We want to discover potential aux v1 customers that we might have missed
from the migrations.
## Summary of changes
Log warnings on basebackup, load timeline, and the first put_file.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
See https://neondb.slack.com/archives/C07J14D8GTX/p1724347552023709
Manipulations with LRU list in relation size cache are performed under
shared lock
## Summary of changes
Take exclusive lock
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
This reverts commit cbe8c77997.
This change was originally made to test a hypothesis, but after that,
the proper fix#8669 was merged, so now it's not needed. Moreover, the
test is still flaky, so probably this bug was not a reason of the
flakiness.
Related to #8097
This copies a piece of code from `test_scrubber_physical_gc_ancestors`
to fix a source of flakiness: later on we rely on stuff being older than
a second, but the test can run faster under optimal conditions (as
happened to me locally, but also obvservable in
[this](https://neon-github-public-dev.s3.amazonaws.com/reports/main/10470762360/index.html#testresult/f713b02657db4b4c/retries)
allure report):
```
test_runner/regress/test_storage_scrubber.py:169: in test_scrubber_physical_gc
assert gc_summary["remote_storage_errors"] == 0
E assert 1 == 0
```
## Problem/Solution
TimelineWriter::put_batch is simply a loop over individual puts. Each
put acquires and releases locks, and checks for potentially starting a
new layer. Batching these is more efficient, but more importantly
unlocks future changes where we can pre-build serialized buffers much
earlier in the ingest process, potentially even on the safekeeper
(imagine a future model where some variant of DatadirModification lives
on the safekeeper).
Ensuring that the values in put_batch are written to one layer also
enables a simplification upstream, where we no longer need to write
values in LSN-order. This saves us a sort, but also simplifies follow-on
refactors to DatadirModification: we can store metadata keys and data
keys separately at that level without needing to zip them together in
LSN order later.
## Why?
In this PR, these changes are simplify optimizations, but they are
motivated by evolving the ingest path in the direction of disentangling
extracting DatadirModification from Timeline. It may not obvious how
right now, but the general idea is that we'll end up with three phases
of ingest:
- A) Decode walrecords and build a datadirmodification with all the
simple data contents already in a big serialized buffer ready to write
to an ephemeral layer **<-- this part can be pipelined and parallelized,
and done on a safekeeper!**
- B) Let that datadirmodification see a Timeline, so that it can also
generate all the metadata updates that require a read-modify-write of
existing pages
- C) Dump the results of B into an ephemeral layer.
Related: https://github.com/neondatabase/neon/issues/8452
## Caveats
Doing a big monolithic buffer of values to write to disk is ordinarily
an anti-pattern: we prefer nice streaming I/O. However:
- In future, when we do this first decode stage on the safekeeper, it
would be inefficient to serialize a Vec of Value, and then later
deserialize it just to add blob size headers while writing into the
ephemeral layer format. The idea is that for bulk write data, we will
serialize exactly once.
- The monolithic buffer is a stepping stone to pipelining more of this:
by seriailizing earlier (rather than at the final put_value), we will be
able to parallelize the wal decoding and bulk serialization of data page
writes.
- The ephemeral layer's buffered writer already stalls writes while it
waits to flush: so while yes we'll stall for a couple milliseconds to
write a couple megabytes, we already have stalls like this, just
distributed across smaller writes.
## Benchmarks
This PR is primarily a stepping stone to safekeeper ingest filtering,
but also provides a modest efficiency improvement to the `wal_recovery`
part of `test_bulk_ingest`.
test_bulk_ingest:
```
test_bulk_insert[neon-release-pg16].insert: 23.659 s
test_bulk_insert[neon-release-pg16].pageserver_writes: 5,428 MB
test_bulk_insert[neon-release-pg16].peak_mem: 626 MB
test_bulk_insert[neon-release-pg16].size: 0 MB
test_bulk_insert[neon-release-pg16].data_uploaded: 1,922 MB
test_bulk_insert[neon-release-pg16].num_files_uploaded: 8
test_bulk_insert[neon-release-pg16].wal_written: 1,382 MB
test_bulk_insert[neon-release-pg16].wal_recovery: 18.981 s
test_bulk_insert[neon-release-pg16].compaction: 0.055 s
vs. tip of main:
test_bulk_insert[neon-release-pg16].insert: 24.001 s
test_bulk_insert[neon-release-pg16].pageserver_writes: 5,428 MB
test_bulk_insert[neon-release-pg16].peak_mem: 604 MB
test_bulk_insert[neon-release-pg16].size: 0 MB
test_bulk_insert[neon-release-pg16].data_uploaded: 1,922 MB
test_bulk_insert[neon-release-pg16].num_files_uploaded: 8
test_bulk_insert[neon-release-pg16].wal_written: 1,382 MB
test_bulk_insert[neon-release-pg16].wal_recovery: 23.586 s
test_bulk_insert[neon-release-pg16].compaction: 0.054 s
```
close https://github.com/neondatabase/neon/issues/8558
* Directly generate image layers for sparse keyspaces during initdb
optimization.
* Support force image layer generation for sparse keyspaces.
* Fix a bug of incorrect image layer key range in case of duplicated
keys. (The added line: `start = img_range.end;`) This can cause
overlapping image layers and keys to disappear.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
After #8655, we needed to mark some tests to shut down immediately. To
aid these tests, try the new pattern of `flush_ep_to_pageserver`
followed by a non-compacting checkpoint. This moves the general graceful
shutdown problem of having too much to flush at shutdown into the test.
Also, add logging for how long the graceful shutdown took, if we got to
complete it for faster log eyeballing.
Fixes: #8712
Cc: #8715, #8708
Going through the list of recent flaky tests, trying to fix those
related to graceful shutdown.
- test_forward_compatibility: flush and wait for uploads to avoid
graceful shutdown
- test_layer_bloating: in the end the endpoint and vanilla are still up
=> immediate shutdown
- test_lagging_sk: pageserver shutdown is not related to the test =>
immediate shutdown
- test_lsn_lease_size: pageserver flushing is not needed => immediate
shutdown
Additionally:
- remove `wait_for_upload` usage from workload fixture
Cc: #8708Fixes: #8710
This removes workspace hack from all libs, not from any binaries. This
does not change the behaviour of the hack.
Running
```
cargo clean
cargo build --release --bin proxy
```
Before this change took 5m16s. After this change took 3m3s. This is
because this allows the build to be parallelisable much more.
## Problem
We want to run our regression test suite on ARM.
## Summary of changes
- run regression tests on release ARM builds
- run `build-neon` (including rust tests) on debug ARM builds
- add `arch` parameter to test to distinguish them in the allure report
and in a database
## Problem
Compaction jobs and other background loops are concurrency-limited
through a global semaphore.
The current counters allow quantifying how _many_ tasks are waiting.
But there is no way to tell how _much_ delay is added by the semaphore.
So, add a counter that aggregates the wall clock time seconds spent
acquiring the semaphore.
The metrics can be used as follows:
* retroactively calculate average acquisition time in a given time range
* compare the degree of background loop backlog among pageservers
The metric is insufficient to calculate
* run-up of ongoing acquisitions that haven't finished acquiring yet
* Not easily feasible because ["Cancelling a call to acquire makes you
lose your place in the
queue"](https://docs.rs/tokio/latest/tokio/sync/struct.Semaphore.html#method.acquire)
## Summary of changes
* Refactor the metrics to follow the current best practice for typed
metrics in `metrics.rs`.
* Add the new counter.
## Problem
We don't have a convenient way for a human to ask "how far are secondary
downloads along for this tenant".
This is useful when driving migrations of tenants to the storage
controller, as we first create a secondary location and want to see it
warm up before we cut over. That can already be done via storcon_cli,
but we would like a way that doesn't require direct API access.
## Summary of changes
Add a metric that reports to total size of layers in the heatmap: this
may be used in conjunction with the existing
`pageserver_secondary_resident_physical_size` to estimate "warmth" of
the secondary location.
## Problem
Storage controllers did not have the right token to speak to their peers
for leadership transitions.
## Summary of changes
Accept a peer jwt token for the storage controller.
Epic: https://github.com/neondatabase/cloud/issues/14701
## Problem
#8736 is getting too big. splitting off some simple changes here
## Summary of changes
Local proxy wont always be using tls, so make it optional. Local proxy
wont be using ws for now, so make it optional. Remove a dead config var.
## Problem
Previously, we would run db migrations before doing the step-down
sequence. This meant that the current leader would have to deal with
the schema changes and that's generally not safe.
## Summary of changes
Push the step-down procedure earlier in start-up and
do db migrations right after it (but before we load-up the in-memory
state from the db).
Epic: https://github.com/neondatabase/cloud/issues/14701
## Problem
The default Postgres version is set to 15 in code, while we use 16 in
most of the other places (and Postgres 17 is coming)
## Summary of changes
- Run `benchmarks` job with Postgres 16 (instead of Postgres 14)
- Set `DEFAULT_PG_VERSION` to 16 in all places
- Remove deprecated `--pg-version` pytest argument
- Update `test_metadata_bincode_serde_ensure_roundtrip` for Postgres 16
Removes the `_generic` postfix from the `GenericRemoteStorage` using
APIs, as `remote_storage` is the "default" now, and add a `_s3` postfix
to the remaining APIs using the S3 SDK (only in tenant snapshot). Also,
remove two unused functions: `list_objects_with_retries` and
`stream_tenants functions`.
Part of https://github.com/neondatabase/neon/issues/7547
Migrates most of the remaining parts of the scrubber to remote_storage:
* `pageserver_physical_gc`
* `scan_metadata` for pageservers (safekeepers were done in #8595)
* `download()` in `tenant_snapshot`. The main `tenant_snapshot` is not
migrated as it uses version history to be able to work in the face of
ongoing changes.
Part of #7547
It's been rolled out everywhere, no configs are referencing it.
All code that's made dead by the removal of the config option is removed
as part of this PR.
The `page_caching::PreWarmingWriter` in `::No` mode is equivalent to a
`size_tracking_writer`, so, use that.
part of https://github.com/neondatabase/neon/issues/7418
After the rollout has succeeded, we now set the default image
compression to be enabled.
We also remove its explicit mention from `neon_fixtures.py` added in
#8368 as it is now the default (and we switch to `zstd(1)` which is a
bit nicer on CPU time).
Part of https://github.com/neondatabase/neon/issues/5431
Part of #8128.
## Description
This PR creates a unified command to run both physical gc and metadata
health check as a cron job. This also enables us to add additional tasks
to the cron job in the future.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
## Problem
Multiple increase/decrease LFC limit may cause unlimited growth of LFC
file because punched holes while LFC shrinking are not reused when LFC
is extended.
## Summary of changes
Keep track of holes and reused them when LFC size is increased.
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
Current superuser check always passes because it returns a tuple like
`(False,)`, and then the `if not superuser` passes.
## Summary of changes
Fixes the issue by unwrapping the tuple. Verified that it works against
a project where I don't have superuser.
## Problem
When a secondary location is trying to catch up while a tenant is
receiving new writes, it can become quite wasteful:
- Downloading L0s which are soon destroyed by compaction to L1s
- Downloading older layer files which are soon made irrelevant when
covered by image layers.
## Summary of changes
Sort the layer files in the heatmap:
- L0 layers are the lowest priority
- Other layers are sorted to download the highest LSNs first.
Previously, we protected from multiple ProposerElected messages from the same
walproposer with the following condition:
msg.term == self.get_last_log_term() && self.flush_lsn() >
msg.start_streaming_at
It is not exhaustive, i.e. we could still proceed to truncating WAL even though
safekeeper inserted something since the divergence point has been
calculated. While it was most likely safe because walproposer can't use
safekeeper position to commit WAL until last_log_term reaches the current
walproposer term, let's be more careful and properly calculate the divergence
point like walproposer does.
## Problem
https://github.com/neondatabase/neon/pull/8588 implemented the mechanism
for storage controller
leadership transfers. However, there's no tests that exercise the
behaviour.
## Summary of changes
1. Teach `neon_local` how to handle multiple storage controller
instances. Each storage controller
instance gets its own subdirectory (`storage_controller_1, ...`).
`storage_controller start|stop` subcommands
have also been extended to optionally accept an instance id.
2. Add a storage controller proxy test fixture. It's a basic HTTP server
that forwards requests from pageserver
and test env to the currently configured storage controller.
3. Add a test which exercises storage controller leadership transfer.
4. Finally fix a couple bugs that the test surfaced
We've had physical replication support for a long time, but we never
created an RFC for the feature. This RFC does that after the fact. Even
though we've already implemented the feature, let's have a design
discussion as if it hadn't done that. It can still be quite insightful.
This is written from a pretty compute-centric viewpoint, not much
on how it works in the control plane.
## Problem
We want to store Nightly Replication test results in the database and
notify the relevant Slack channel about failures
## Summary of changes
- Store test results in the database
- Notify `on-call-compute-staging-stream` about failures
## Problem
`secrets.GITHUB_TOKEN` (with any permissions) is not enough to get
a user's membership info if they decide to hide it.
## Summary of changes
- Use `secrets.CI_ACCESS_TOKEN` for `gh api` call
- Use `pull_request_target` instead of `pull_request` event to get
access to secrets
## Problem
Logical replication BGW checking replication lag is not reloading config
## Summary of changes
Add handling of reload config request
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Per #8674, disallow node configuration while drain/fill are ongoing.
Implement it by adding a only-http wrapper
`Service::external_node_configure` which checks for operation existing
before configuring.
Additionally:
- allow cancelling drain/fill after a pageserver has restarted and
transitioned to WarmingUp
Fixes: #8674
## Problem
On macOS, clippy fails with the following error:
```
error: unused import: `crate::virtual_file::owned_buffers_io::io_buf_ext::IoBufExt`
--> pageserver/src/tenant/remote_timeline_client/download.rs:26:5
|
26 | use crate::virtual_file::owned_buffers_io::io_buf_ext::IoBufExt;
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
= note: `-D unused-imports` implied by `-D warnings`
= help: to override `-D warnings` add `#[allow(unused_imports)]`
```
Introduced in https://github.com/neondatabase/neon/pull/8717
## Summary of changes
- allow `unused_imports` for
`crate::virtual_file::owned_buffers_io::io_buf_ext::IoBufExt` on macOS
in download.rs
## Problem
The control file contains the id of the safekeeper that uploaded it.
Previously, when sending a snapshot of the control file to another sk,
it would eventually be gc-ed by the receiving sk. This is incorrect
because the original sk might still need it later.
## Summary of Changes
When sending a snapshot and the control file contains an uploaded
segment:
* Create a copy of the segment in s3 with the destination sk in the
object name
* Tweak the streamed control file to point to the object create in the
previous step
Note that the snapshot endpoint now has to know the id of the requestor,
so the api has been extended to include the node if of the destination
sk.
Closes https://github.com/neondatabase/neon/issues/8542
The `tokio_epoll_uring::Slice` / `tokio_uring::Slice` type is weird.
The new `FullSlice` newtype is better. See the doc comment for details.
The naming is not ideal, but we'll clean that up in a future refactoring
where we move the `FullSlice` into `tokio_epoll_uring`. Then, we'll do
the following:
* tokio_epoll_uring::Slice is removed
* `FullSlice` becomes `tokio_epoll_uring::IoBufView`
* new type `tokio_epoll_uring::IoBufMutView` for the current
`tokio_epoll_uring::Slice<IoBufMut>`
Context
-------
I did this work in preparation for
https://github.com/neondatabase/neon/pull/8537.
There, I'm changing the type that the `inmemory_layer.rs` passes to
`DeltaLayerWriter::put_value_bytes` and thus it seemed like a good
opportunity to make this cleanup first.
## Problem
A bunch of small fixes and improvements for CI, that are too small to
have a separate PR for them
## Summary of changes
- CI(build-and-test): fix parenthesis
- CI(actionlint): fix path to workflow file
- CI: remove default args from actions/checkout
- CI: remove `gen3` label, using a couple `self-hosted` +
`small{,-arm64}`/`large{,-arm64}` is enough
- CI: prettify Slack messages, hide links behind text messages
- C(build-and-test): add more dependencies to `conclusion` job
## Problem
It's recommended that a couple of additional RUSTFLAGS be set up to
improve the performance of Rust applications on AWS Graviton.
See
57dc813626/rust.md
Note: Apple Silicon is compatible with neoverse-n1:
```
$ clang --version
Apple clang version 15.0.0 (clang-1500.3.9.4)
Target: arm64-apple-darwin23.6.0
Thread model: posix
InstalledDir: /Applications/Xcode_15.4.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
$
$ clang --print-supported-cpus 2>&1 | grep neoverse-
neoverse-512tvb
neoverse-e1
neoverse-n1
neoverse-n2
neoverse-v1
neoverse-v2
```
## Summary of changes
- Add `-Ctarget-feature=+lse -Ctarget-cpu=neoverse-n1` to RUSTFLAGS for
ARM images
Some benchmarks and tests might still fail because of #8655 (tracked in
#8708) because we are not fast enough to shut down ([one evidence]).
Partially this is explained by the current validation mode of streaming
k-merge, but otherwise because that is where we use a lot of time in
compaction. Outside of L0 => L1 compaction, the image layer generation
is already guarded by vectored reads doing cancellation checks.
32768 is a wild guess based on looking how many keys we put in each
layer in a bench (1-2 million), but I assume it will be good enough
divisor. Doing checks more often will start showing up as contention
which we cannot currently measure. Doing checks less often might be
reasonable.
[one evidence]:
https://neon-github-public-dev.s3.amazonaws.com/reports/main/10384136483/index.html#suites/9681106e61a1222669b9d22ab136d07b/96e6d53af234924/
Earlier PR: #8706.
## Problem
During `Run rust tests` step (for debug builds), we accidentally rebuild
neon twice (by `cargo test --doc` and by `cargo nextest run`).
It happens because we don't set `cov_prefix` for the `cargo test --doc`
command, which triggers rebuilding with different build flags, and one
more rebuild by `cargo nextest run`.
## Summary of changes
- Set `cov_prefix` for `cargo test --doc` to prevent unneeded rebuilds
## Problem
This command is kind of a hack, used when we're migrating large tenants
and want to get their resident size down. It sets the tenant config to a
fixed value, which omitted heatmap_period, so caused secondaries to get
out of date.
## Summary of changes
- Set heatmap period to the same 300s default that we use elsewhere when
updating eviction settings
This is not as elegant as some general purpose partial modification of
the config, but it practically makes the command safer to use.
basic JWT implementation that caches JWKs and verifies signatures.
this code is currently not reachable from proxy, I just wanted to get
something merged in.
We can get CompactionError::Other(Cancelled) via the error handling with
a few ways.
[evidence](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8655/10301613380/index.html#suites/cae012a1e6acdd9fdd8b81541972b6ce/653a33de17802bb1/).
Hopefully fix it by:
1. replace the `map_err` which hid the
`GetReadyAncestorError::Cancelled` with `From<GetReadyAncestorError> for
GetVectoredError` conversion
2. simplifying the code in pgdatadir_mapping to eliminate the token
anyhow wrapping for deserialization errors
3. stop wrapping GetVectoredError as anyhow errors
4. stop wrapping PageReconstructError as anyhow errors
Additionally, produce warnings if we treat any other error (as was legal
before this PR) as missing key.
Cc: #8708.
## Problem
`author_association` doesn't properly work if a GitHub user decides not
to show affiliation with the org in their profile (i.e. if it's private)
## Summary of changes
- Call
`/orgs/ORG/members/USERNAME` API to check whether
a PR/issue author is a member of the org
## Problem
When pageservers do compaction, they frequently create image layers that
make earlier layers un-needed for reads, but then keep those earlier
layers around for 24 hours waiting for time-based eviction to expire
them.
Now that we track layer visibility, we can use it as an input to
eviction, and avoid the 24 hour "disk bump" that happens around
pageserver restarts.
## Summary of changes
- During time-based eviction, if a layer is marked Covered, use the
eviction period as the threshold: i.e. these layers get to remain
resident for at least one iteration of the eviction loop, but then get
evicted. With current settings this means they get evicted after 1h
instead of 24h.
- During disk usage eviction, prioritized evicting covered layers above
all other layers.
Caveats:
- Using the period as the threshold for time based eviction in this case
is a bit of a hack, but it avoids adding yet another configuration
property, and in any case the value of a new property would be somewhat
arbitrary: there's no "right" length of time to keep covered layers
around just in case.
- We had previously planned on removing time-based eviction: this change
would motivate us to keep it around, but we can still simplify the code
later to just do the eviction of covered layers, rather than applying a
TTL policy to all layers.
With additional phases from #8430 the `detach_ancestor::Error` became
untenable. Split it up into phases, and introduce laundering for
remaining `anyhow::Error` to propagate them as most often
`Error::ShuttingDown`.
Additionally, complete FIXMEs.
Cc: #6994
## Problem
The storage scrubber was reporting warnings for lots of timelines like:
```
WARN Missed some shards at count ShardCount(0) tenant_id=25eb7a83d9a2f90ac0b765b6ca84cf4c
```
These were spurious: these tenants are fine. There was a bug in
accumulating the ShardIndex for each tenant, whereby multiple timelines
would lead us to add the same ShardIndex more than one.
Closes: #8646
## Summary of changes
- Accumulate ShardIndex in a BTreeSet instead of a Vec
- Extend the test to reproduce the issue
## Problem
See https://github.com/neondatabase/neon/issues/8499
## Summary of changes
Save HEAP_COMBOCID flag in WAL and do not clear it in redo handlers.
Related Postgres PRs:
https://github.com/neondatabase/postgres/pull/457https://github.com/neondatabase/postgres/pull/458https://github.com/neondatabase/postgres/pull/459
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
With the persistent gc blocking, we can now retry reparenting timelines
which had failed for whatever reason on the previous attempt(s).
Restructure the detach_ancestor into three phases:
- prepare (insert persistent gc blocking, copy lsn prefix, layers)
- detach and reparent
- reparenting can fail, so we might need to retry this portion
- complete (remove persistent gc blocking)
Cc: #6994
After #8655 we've had a few issues (mostly tracked on #8708) with the
graceful shutdown. In order to shutdown more of the processes and catch
more errors, for example, from all pageservers, do an immediate shutdown
for those nodes which fail the initial (possibly graceful) shutdown.
Cc: #6485
## Problem
We use a set of **Neon** reuse databases in benchmarking.yml which are
still using pg14.
Because we want to compare apples to apples and have migrated the AWS
reuse clusters to pg16 we should also use pg16 for Neon.
## Summary of changes
- Automatically restore the test databases for Neon project
A few of the benchmarks have started failing after #8655 where they are
waiting for compactor task. Reads done by image layer creation should
already be cancellation sensitive because vectored get does a check each
time, but try sprinkling additional cancellation points to:
- each partition
- after each vectored read batch
It seems that some benchmarks are failing because they are simply not
stopping to ingest wal on shutdown. It might mean that the tests were
never ran on a stable pageserver situation and WAL has always been left
to be ingested on safekeepers, but let's see if this silences the
failures and "stops the bleeding".
Cc: https://github.com/neondatabase/neon/issues/8712
## Problem
We want to mark new PRs and issues created by external users
## Summary of changes
- Add a new workflow which adds `external` label for issues and PRs
created by external users
## Problem
When the utilization API was added, it was just a stub with disk space
information.
Disk space information isn't a very good metric for assigning tenants to
pageservers, because pageservers making full use of their disks would
always just have 85% utilization, irrespective of how much pressure they
had for disk space.
## Summary of changes
- Use the new layer visibiilty metric to calculate a "wanted size" per
tenant, and sum these to get a total local disk space wanted per
pageserver. This acts as the primary signal for utilization.
- Also use the shard count to calculate a utilization score, and take
the max of this and the disk-driven utilization. The shard count limit
is currently set as a constant 20,000, which matches contemporary
operational practices when loading pageservers.
The shard count limit means that for tiny/empty tenants, on a machine
with 3.84TB disk, each tiny tenant influences the utilization score as
if it had size 160MB.
## Problem
When pooled connections are used, session semantic its not preserved,
including GUC settings.
Many customers have particular problem with setting search_path.
But pgbouncer 1.20 has `track_extra_parameters` settings which allows to
track parameters included in startup package which are reported by
Postgres. Postgres has [an official list of parameters that it reports
to the
client](https://www.postgresql.org/docs/15/protocol-flow.html#PROTOCOL-ASYNC).
This PR makes Postgres also report `search_path` and so allows to
include it in `track_extra_parameters`.
## Summary of changes
Set GUC_REPORT flag for `search_path`.
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
Pageserver exposes some vectored get related configs which are not in
use.
## Summary of changes
Remove the following pageserver configs: get_impl, get_vectored_impl,
and `validate_get_vectored`.
They are not used in the pageserver since
https://github.com/neondatabase/neon/pull/8601.
Manual overrides have been removed from the aws repo in
https://github.com/neondatabase/aws/pull/1664.
## Problem
This follows a PR that insists all input keys are representable in 16
bytes:
- https://github.com/neondatabase/neon/pull/8648
& a PR that prevents postgres from sending us keys that use the high
bits of field2:
- https://github.com/neondatabase/neon/pull/8657
Motivation for this change:
1. Ingest is bottlenecked on CPU
2. InMemoryLayer can create huge (~1M value) BTreeMap<Key,_> for its
index.
3. Maps over i128 are much faster than maps over an arbitrary 18 byte
struct.
It may still be worthwhile to make the index two-tier to optimize for
the case where only the last 4 bytes (blkno) of the key vary frequently,
but simply using the i128 representation of keys has a big impact for
very little effort.
Related: #8452
## Summary of changes
- Introduce `CompactKey` type which contains an i128
- Use this instead of Key in InMemoryLayer's index, converting back and
forth as needed.
## Performance
All the small-value `bench_ingest` cases show improved throughput.
The one that exercises this index most directly shows a 35% throughput
increase:
```
ingest-small-values/ingest 128MB/100b seq, no delta
time: [374.29 ms 378.56 ms 383.38 ms]
thrpt: [333.88 MiB/s 338.13 MiB/s 341.98 MiB/s]
change:
time: [-26.993% -26.117% -25.111%] (p = 0.00 < 0.05)
thrpt: [+33.531% +35.349% +36.974%]
Performance has improved.
```
## Problem
We use infrastructure as code (TF) to deploy AWS Aurora and AWS RDS
Postgres database clusters.
Whenever we have a change in TF (e.g. **every year** to upgrade to a
higher Postgres version or when we change the cluster configuration) TF
will apply the change and create a new AWS database cluster.
However our benchmarking testcase also expects databases in these
clusters and tables loaded with data.
So we add auto-detection - if the AWS RDS instances are "empty" we
create the necessary databases and restore a pg_dump.
**Important Notes:**
- These steps are NOT run in each benchmarking run, but only after a new
RDS instance has been deployed.
- the benchmarking workflows use GitHub secrets to find the connection
string for the database. These secrets still need to be (manually or
programmatically using git cli) updated if some port of the connection
string (e.g. user, password or hostname) changes.
## Summary of changes
In each benchmarking run check if
- database has already been created - if not create it
- database has already been restored - if not restore it
Supported databases
- tpch
- clickbench
- user example
Supported platforms:
- AWS RDS Postgres
- AWS Aurora serverless Postgres
Sample workflow run - but this one uses Neon database to test the
restore step and not real AWS databases
https://github.com/neondatabase/neon/actions/runs/10321441086/job/28574350581
Sample workflow run - with real AWS database clusters
https://github.com/neondatabase/neon/actions/runs/10346816389/job/28635997653
Verification in second run - with real AWS database clusters - that
second time the restore is skipped
https://github.com/neondatabase/neon/actions/runs/10348469517/job/28640778223
## Problem
Storage controller restarts cause temporary unavailability from the
control plane POV. See RFC for more details.
## Summary of changes
* A couple of small refactors of the storage controller start-up
sequence to make extending it easier.
* A leader table is added to track the storage controller instance
that's currently the leader (if any)
* A peer client is added such that storage controllers can send
`step_down` requests to each other (implemented in
https://github.com/neondatabase/neon/pull/8512).
* Implement the leader cut-over as described in the RFC
* Add `start-as-candidate` flag to the storage controller to gate the
rolling restart behaviour. When the flag is `false` (the default), the
only change from the current start-up sequence is persisting the leader
entry to the database.
It should give us all possible allowed_errors more consistently.
While getting the workflows to pass on
https://github.com/neondatabase/neon/pull/8632 it was noticed that
allowed_errors are rarely hit (1/4). This made me realize that we always
do an immediate stop by default. Doing a graceful shutdown would had
made the draining more apparent and likely we would not have needed the
#8632 hotfix.
Downside of doing this is that we will see more timeouts if tests are
randomly leaving pause failpoints which fail the shutdown.
The net outcome should however be positive, we could even detect too
slow shutdowns caused by a bug or deadlock.
## Problem
This test was disabled.
## Summary of changes
- Remove the skip marker.
- Explicitly avoid doing compaction & gc during checkpoints (the default
scale doesn't do anything here, but when experimeting with larger scales
it messes things up)
- Set a data size that gives a ~20s runtime on a Hetzner dev machine,
previous one gave very noisy results because it was so small
For reference on a Hetzner AX102:
```
------------------------------ Benchmark results -------------------------------
test_bulk_insert[neon-release-pg16].insert: 25.664 s
test_bulk_insert[neon-release-pg16].pageserver_writes: 5,428 MB
test_bulk_insert[neon-release-pg16].peak_mem: 577 MB
test_bulk_insert[neon-release-pg16].size: 0 MB
test_bulk_insert[neon-release-pg16].data_uploaded: 1,922 MB
test_bulk_insert[neon-release-pg16].num_files_uploaded: 8
test_bulk_insert[neon-release-pg16].wal_written: 1,382 MB
test_bulk_insert[neon-release-pg16].wal_recovery: 25.373 s
test_bulk_insert[neon-release-pg16].compaction: 0.035 s
```
It should take syncrep flush_lsn into account because WAL before it on endpoint
restart is lost, which makes replication miss some data if slot had already been
advanced too far. This commit adds test reproducing the issue and bumps
vendor/postgres to commit with the actual fix.
## Problem
In several workflows, we have repeating code which is separated into
two steps:
```bash
mkdir -p $(pwd)/.docker-custom
echo DOCKER_CONFIG=/tmp/.docker-custom >> $GITHUB_ENV
...
rm -rf $(pwd)/.docker-custom
```
Such copy-paste is prone to errors; for example, in one case, instead of
`$(pwd)/.docker-custom`, we use `/tmp/.docker-custom`, which is shared
between workflows.
## Summary of changes
- Create a new action `actions/set-docker-config-dir`, which sets
`DOCKER_CONFIG` and deletes it in a Post action part
Noticed this while debugging a test failure in #8673 which only occurs
with real S3 instead of mock S3: if you authenticate to S3 via
`AWS_PROFILE`, then it requires the `HOME` env var to be set so that it
can read inside the `~/.aws` directory.
The scrubber abstraction `StorageScrubber::scrubber_cli` in
`neon_fixtures.py` would otherwise not work. My earlier PR #6556 has
done similar things for the `neon_local` wrapper.
You can try:
```
aws sso login --profile dev
export ENABLE_REAL_S3_REMOTE_STORAGE=y REMOTE_STORAGE_S3_BUCKET=neon-github-ci-tests REMOTE_STORAGE_S3_REGION=eu-central-1 AWS_PROFILE=dev
RUST_BACKTRACE=1 BUILD_TYPE=debug DEFAULT_PG_VERSION=16 ./scripts/pytest -vv --tb=short -k test_scrubber_tenant_snapshot
```
before and after this patch: this patch fixes it.
## Problem
This page had many dead links, and was confusing for folks looking for
documentation about our product.
Closes: https://github.com/neondatabase/neon/issues/8535
## Summary of changes
- Add a link to the product docs up top
- Remove dead/placeholder links
## Problem
We install and try to use `cachepot`. But it is not configured correctly
and doesn't work (after https://github.com/neondatabase/neon/pull/2290)
## Summary of changes
- Remove `cachepot`
## Problem
Migrations of tenant shards with cold secondaries are holding up drains
in during production deployments.
## Summary of changes
If a secondary locations is lagging by more than 256MiB (configurable,
but that's the default), then skip cutting it over to the secondary as part of the node drain.
## Problem
This type of error can happen during shutdown & was triggering a circuit
breaker alert.
## Summary of changes
- Map NotIntialized::Stopped to CompactionError::ShuttingDown, so that
we may handle it cleanly
## Problem
Azure login fails in `pin-build-tools-image` workflow because the job
doesn't have the required permissions.
```
Error: Please make sure to give write permissions to id-token in the workflow.
Error: Login failed with Error: Error message: Unable to get ACTIONS_ID_TOKEN_REQUEST_URL env variable. Double check if the 'auth-type' is correct. Refer to https://github.com/Azure/login#readme for more information.
```
## Summary of changes
- Add `id-token: write` permission to `pin-build-tools-image`
- Add an input to force image tagging
- Unify pushing to Docker Hub with other registries
- Split the job into two to have less if's
part of https://github.com/neondatabase/neon/issues/8653
Disable create tablespace stmt. It turns out it requires much less
effort to do the regress test mode flag than patching the test cases,
and given that we might need to support tablespaces in the future, I
decided to add a new flag `regress_test_mode` to change the behavior of
create tablespace.
Tested manually that without setting regress_test_mode, create
tablespace will be rejected.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
This reverts #8076 - which was already reverted from the release branch
since forever (it would have been a breaking change to release for all
users who currently set TimeZone options). It's causing conflicts now so
we should revert it here as well.
## Problem
Latency from one cloud provider to another one is higher than within the
same cloud provider.
Some of our benchmarks are latency sensitive - we run a pgbench or psql
in the github action runner and the system under test is running in Neon
(database project).
For realistic perf tps and latency results we need to compare apples to
apples and run the database client in the same "latency distance" for
all tests.
## Summary of changes
Move job steps that test Neon databases deployed on Azure into Azure
action runners.
- bench strategy variant using azure database
- pgvector strategy variant using azure database
- pgbench-compare strategy variants using azure database
## Test run
https://github.com/neondatabase/neon/actions/runs/10314848502
## Problem
We're adding more third party dependencies to support more diverse +
realistic test cases in `test_runner/logical_repl`. I ❤️ these
tests, they are a good thing.
The slight glitch is that python packaging is hard, and some third party
python packages have issues. For example the current kafka dependency
doesn't work on latest python. We can mitigate that by only importing
these more specialized dependencies in the tests that use them.
## Summary of changes
- Move the `kafka` import into a test body, so that folks running the
regular `test_runner/regress` tests don't have to have a working kafka
client package.
## Problem
This code was to mitigate risk in
https://github.com/neondatabase/neon/pull/8427
As expected, we did not hit this code path - the new continuous updates
of gc_info are working fine, we can remove this code now.
## Summary of changes
- Remove block that double-checks retain_lsns
avoid "leaking" the completions of BackgroundPurges by:
1. switching it to TaskTracker for provided close+wait
2. stop using tokio::fs::remove_dir_all which will consume two units of
memory instead of one blocking task
Additionally, use more graceful shutdown in tests which do actually some
background cleanup.
## Problem
See
https://neondb.slack.com/archives/C03QLRH7PPD/p1723038557449239?thread_ts=1722868375.476789&cid=C03QLRH7PPD
Logical replication subscription by default use `synchronous_commit=off`
which cause problems with safekeeper
## Summary of changes
Set `synchronous_commit=on` for logical replication subscription in
test_subscriber_restart.py
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
## Problem
Some developers build on MacOS, which doesn't have io_uring.
## Summary of changes
- Add `io_engine_for_bench`, which on linux will give io_uring or panic
if it's unavailable, and on MacOS will always panic.
We do not want to run such benchmarks with StdFs: the results aren't
interesting, and will actively waste the time of any developers who
start investigating performance before they realize they're using a
known-slow I/O backend.
Why not just conditionally compile this benchmark on linux only? Because
even on linux, I still want it to refuse to run if it can't get
io_uring.
Part of #8130, [RFC: Direct IO For Pageserver](https://github.com/neondatabase/neon/blob/problame/direct-io-rfc/docs/rfcs/034-direct-io-for-pageserver.md)
## Description
Add pageserver config for evaluating/enabling direct I/O.
- Disabled: current default, uses buffered io as is.
- Evaluate: still uses buffered io, but could do alignment checking and
perf simulation (pad latency by direct io RW to a fake file).
- Enabled: uses direct io, behavior on alignment error is configurable.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
We've noticed increased memory usage with the latest release. Drain the
joinset of `page_service` connection handlers to avoid leaking them
until shutdown. An alternative would be to use a TaskTracker.
TaskTracker was not discussed in original PR #8339 review, so not hot
fixing it in here either.
Earlier I was thinking we'd need a (ancestor_lsn, timeline_id) ordered
list of reparented. Turns out we did not need it at all. Replace it with
an unordered hashset. Additionally refactor the reparented direct
children query out, it will later be used from more places.
Split off from #8430.
Cc: #6994
Ephemeral files cleanup on drop but did not delay shutdown, leading to
problems with restarting the tenant. The solution is as proposed:
- make ephemeral files carry the gate guard to delay `Timeline::gate`
closing
- flush in-memory layers and strong references to those on
`Timeline::shutdown`
The above are realized by making LayerManager an `enum` with `Open` and
`Closed` variants, and fail requests to modify `LayerMap`.
Additionally:
- fix too eager anyhow conversions in compaction
- unify how we freeze layers and handle errors
- optimize likely_resident_layers to read LayerFileManager hashmap
values instead of bouncing through LayerMap
Fixes: #7830
## Problem
1. Hard to correlate startup parameters with the endpoint that provided
them.
2. Some configurations are not needed in the `ProxyConfig` struct.
## Summary of changes
Because of some borrow checker fun, I needed to switch to an
interior-mutability implementation of our `RequestMonitoring` context
system. Using https://docs.rs/try-lock/latest/try_lock/ as a cheap lock
for such a use-case (needed to be thread safe).
Removed the lock of each startup message, instead just logging only the
startup params in a successful handshake.
Also removed from values from `ProxyConfig` and kept as arguments.
(needed for local-proxy config)
Timeline cancellation running in parallel with gc yields error log lines
like:
```
Gc failed 1 times, retrying in 2s: TimelineCancelled
```
They are completely harmless though and normal to occur. Therefore, only
print those messages at an info level. Still print them at all so that
we know what is going on if we focus on a single timeline.
Part of #8128.
## Problem
Currently, scrubber `scan_metadata` command will return with an error
code if the metadata on remote storage is corrupted with fatal errors.
To safely deploy this command in a cronjob, we want to differentiate
between failures while running scrubber command and the erroneous
metadata. At the same time, we also want our regression tests to catch
corrupted metadata using the scrubber command.
## Summary of changes
- Return with error code only when the scrubber command fails
- Uses explicit checks on errors and warnings to determine metadata
health in regression tests.
**Resolve conflict with `tenant-snapshot` command (after shard split):**
[`test_scrubber_tenant_snapshot`](https://github.com/neondatabase/neon/blob/yuchen/scrubber-scan-cleanup-before-prod/test_runner/regress/test_storage_scrubber.py#L23)
failed before applying 422a8443dd
- When taking a snapshot, the old `index_part.json` in the unsharded
tenant directory is not kept.
- The current `list_timeline_blobs` implementation consider no
`index_part.json` as a parse error.
- During the scan, we are only analyzing shards with highest shard
count, so we will not get a parse error. but we do need to add the
layers to tenant object listing, otherwise we will get index is
referencing a layer that is not in remote storage error.
- **Action:** Add s3_layers from `list_timeline_blobs` regardless of
parsing error
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
## Problem
We lack a rust bench for the inmemory layer and delta layer write paths:
it is useful to benchmark these components independent of postgres & WAL
decoding.
Related: https://github.com/neondatabase/neon/issues/8452
## Summary of changes
- Refactor DeltaLayerWriter to avoid carrying a Timeline, so that it can
be cleanly tested + benched without a Tenant/Timeline test harness. It
only needed the Timeline for building `Layer`, so this can be done in a
separate step.
- Add `bench_ingest`, which exercises a variety of workload "shapes"
(big values, small values, sequential keys, random keys)
- Include a small uncontroversial optimization: in `freeze`, only
exhaustively walk values to assert ordering relative to end_lsn in debug
mode.
These benches are limited by drive performance on a lot of machines, but
still useful as a local tool for iterating on CPU/memory improvements
around this code path.
Anecdotal measurements on Hetzner AX102 (Ryzen 7950xd):
```
ingest-small-values/ingest 128MB/100b seq
time: [1.1160 s 1.1230 s 1.1289 s]
thrpt: [113.38 MiB/s 113.98 MiB/s 114.70 MiB/s]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) low mild
Benchmarking ingest-small-values/ingest 128MB/100b rand: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 10.0s. You may wish to increase target time to 18.9s.
ingest-small-values/ingest 128MB/100b rand
time: [1.9001 s 1.9056 s 1.9110 s]
thrpt: [66.982 MiB/s 67.171 MiB/s 67.365 MiB/s]
Benchmarking ingest-small-values/ingest 128MB/100b rand-1024keys: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 10.0s. You may wish to increase target time to 11.0s.
ingest-small-values/ingest 128MB/100b rand-1024keys
time: [1.0715 s 1.0828 s 1.0937 s]
thrpt: [117.04 MiB/s 118.21 MiB/s 119.46 MiB/s]
ingest-small-values/ingest 128MB/100b seq, no delta
time: [425.49 ms 429.07 ms 432.04 ms]
thrpt: [296.27 MiB/s 298.32 MiB/s 300.83 MiB/s]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) low mild
ingest-big-values/ingest 128MB/8k seq
time: [373.03 ms 375.84 ms 379.17 ms]
thrpt: [337.58 MiB/s 340.57 MiB/s 343.13 MiB/s]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
ingest-big-values/ingest 128MB/8k seq, no delta
time: [81.534 ms 82.811 ms 83.364 ms]
thrpt: [1.4994 GiB/s 1.5095 GiB/s 1.5331 GiB/s]
Found 1 outliers among 10 measurements (10.00%)
```
## Problem
Sometimes, a layer is Covered by hasn't yet been evicted from local disk
(e.g. shortly after image layer generation). It is not good use of
resources to download these to a secondary location, as there's a good
chance they will never be read.
This follows the previous change that added layer visibility:
- #8511
Part of epic:
- https://github.com/neondatabase/neon/issues/8398
## Summary of changes
- When generating heatmaps, only include Visible layers
- Update test_secondary_downloads to filter to visible layers when
listing layers from an attached location
## Problem
In staging, we could see that occasionally tenants were wrapping their
pageserver_visible_physical_size metric past zero to 2^64.
This is harmless right now, but will matter more later when we start
using visible size in things like the /utilization endpoint.
## Summary of changes
- Add debug asserts that detect this case. `test_gc_of_remote_layers`
works as a reproducer for this issue once the asserts are added.
- Tighten up the interface around access_stats so that only Layer can
mutate it.
- In Layer, wrap calls to `record_access` in code that will update the
visible size statistic if the access implicitly marks the layer visible
(this was what caused the bug)
- In LayerManager::rewrite_layers, use the proper set_visibility layer
function instead of directly using access_stats (this is an additional
path where metrics could go bad.)
- Removed unused instances of LayerAccessStats in DeltaLayer and
ImageLayer which I noticed while reviewing the code paths that call
record_access.
## Problem
The controller scale test does random migrations. These mutate secondary
locations, and therefore can cause secondary optimizations to happen in
the background, violating the test's expectation that consistency_check
will work as there are no reconciliations running.
Example:
https://neon-github-public-dev.s3.amazonaws.com/reports/main/10247161379/index.html#suites/07874de07c4a1c9effe0d92da7755ebf/6316beacd3fb3060/
## Summary of changes
- Only migrate to existing secondary locations, not randomly picked
nodes, so that we can do a fast reconcile_until_idle (otherwise
reconcile_until_idle is takes a long time to create new secondary
locations).
- Do a reconcile_until_idle before consistency_check.
## Problem
We need to test the logical replication with some external consumers.
## Summary of changes
A test of the logical replication with Debezium as a consumer was added.
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
We don't use it for packaging, and 'poetry install' will soon error
otherwise. Also remove name and version fields as these are not required for
non-packaging mode.
#8600 missed the hunk changing index_part.json informative version.
Include it in this PR, in addition add more non-warning index_part.json
versions to scrubber.
## Problem
We have been maintaining two read paths (legacy and vectored) for a
while now. The legacy read-path was only used for cross validation in some tests.
## Summary of changes
* Tweak all tests that were using the legacy read path to use the
vectored read path instead
* Remove the read path dispatching based on the pageserver configs
* Remove the legacy read path code
We will be able to remove the single blob io code in
`pageserver/src/tenant/blob_io.rs` when https://github.com/neondatabase/neon/issues/7386 is complete.
Closes https://github.com/neondatabase/neon/issues/8005
Currently, we do not have facilities to persistently block GC on a
tenant for whatever reason. We could do a tenant configuration update,
but that is risky for generation numbers and would also be transient.
Introduce a `gc_block` facility in the tenant, which manages per
timeline blocking reasons.
Additionally, add HTTP endpoints for enabling/disabling manual gc
blocking for a specific timeline. For debugging, individual tenant
status now includes a similar string representation logged when GC is
skipped.
Cc: #6994
Add dry-run mode that does not produce any image layer + delta layer. I
will use this code to do some experiments and see how much space we can
reclaim for tenants on staging. Part of
https://github.com/neondatabase/neon/issues/8002
* Add dry-run mode that runs the full compaction process without
updating the layer map. (We never call finish on the writers and the
files will be removed before exiting the function).
* Add compaction statistics and print them at the end of compaction.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
> Currently, long-running LR tests recreate endpoints every night. We'd
like to have along-running buildup of history to exercise the pageserver
in this case (instead of "unit-testing" the same behavior everynight).
Closes#8317
## Summary of changes
- Update Postgres version for replication tests
- Set `BENCHMARK_PROJECT_ID_PUB`/`BENCHMARK_PROJECT_ID_SUB` env vars to
projects that were created for this purpose
---------
Co-authored-by: Sasha Krassovsky <krassovskysasha@gmail.com>
Currently if `GET
/v1/tenant/x/timeline/y?force-await-initial-logical-size=true` is
requested for a root timeline created within the current pageserver
session, the request handler panics hitting the debug assertion. These
timelines will always have an accurate (at initdb import) calculated
logical size. Fix is to never attempt prioritizing timeline size
calculation if we already have an exact value.
Split off from #8528.
## Problem
In some cases, a deadlock between `build-and-test` and
`trigger-e2e-tests` workflows can happen:
```
Build and Test
Canceling since a deadlock for concurrency group 'Build and Test-8600/merge-anysha' was detected between 'top level workflow' and 'trigger-e2e-tests'
```
I don't understand the reason completely, probably `${{ github.workflow
}}` got evaluated to the same value and somehow caused the issue.
We don't need to limit concurrency for `trigger-e2e-tests`
workflow.
See
https://neondb.slack.com/archives/C059ZC138NR/p1722869486708179?thread_ts=1722869027.960029&cid=C059ZC138NR
## Problem
We don't trigger e2e tests for draft PRs, but we do trigger them once a
PR is in the "Ready for review" state.
Sometimes, a PR can be marked as "Ready for review" before we finish
image building. In such cases, triggering e2e tests fails.
## Summary of changes
- Make `trigger-e2e-tests` job poll status of `promote-images` job from
the build-and-test workflow for the last commit. And trigger only if the
status is `success`
- Remove explicit image checking from the workflow
- Add `concurrency` for `triggere-e2e-tests` workflow to make it
possible to cancel jobs in progress (if PR moves from "Draft" to "Ready
for review" several times in a row)
## Problem
PR #7992 was merged without correspondent changes in Postgres submodules
and this is why test_oid_overflow.py is failed now.
## Summary of changes
Bump Postgres versions
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
There is an unused safekeeper option `partial_backup_enabled`.
`partial_backup_enabled` was implemented in #6530, but this option was
always turned into enabled in #8022.
If you intended to keep this option for a specific reason, I will close
this PR.
## Summary of changes
I removed an unused safekeeper option `partial_backup_enabled`.
part of https://github.com/neondatabase/neon/issues/8002
## Summary of changes
Add a `SplitImageWriter` that automatically splits image layer based on
estimated target image layer size. This does not consider compression
and we might need a better metrics.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
We need both compaction and gc lock for gc-compaction. The lock order
should be the same everywhere, otherwise there could be a deadlock where
A waits for B and B waits for A.
We also had a double-lock issue. The compaction lock gets acquired in
the outer `compact` function. Note that the unit tests directly call
`compact_with_gc`, and therefore not triggering the issue.
## Summary of changes
Ensure all places acquire compact lock and then gc lock. Remove an extra
compact lock acqusition.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Currently, our backward compatibility tests only look one release back.
That means, for example, that when we switch on image layer compression
by default, we'll test reading of uncompressed layers for one release,
and then stop doing it. When we make an index_part.json format change,
we'll test against the old format for a week, then stop (unless we write
separate unit tests for each old format).
The reality in the field is that data in old formats will continue to
exist for weeks/months/years. When we make major format changes, we
should retain examples of the old format data, and continuously verify
that the latest code can still read them.
This test uses contents from a new path in the public S3 bucket,
`compatibility-data-snapshots/`. It is populated by hand. The first
important artifact is one from before we switch on compression, so that
we will keep testing reads of uncompressed data. We will generate more
artifacts ahead of other key changes, like when we update remote storage
format for archival timelines.
Closes: https://github.com/neondatabase/cloud/issues/15576
This commit tries to fix regular load spikes on staging, caused by too
many eviction and partial upload operations running at the same time.
Usually it was hapenning after restart, for partial backup the load was
delayed.
- Add a semaphore for evictions (2 permits by default)
- Rename `resident_since` to `evict_not_before` and smooth out the curve
by using random duration
- Use random duration in partial uploads as well
related to https://github.com/neondatabase/neon/issues/6338
some discussion in
https://neondb.slack.com/archives/C033RQ5SPDH/p1720601531744029
Makes `flush_frozen_layer` add a barrier to the upload queue and makes
it wait for that barrier to be reached until it lets the flushing be
completed.
This gives us backpressure and ensures that writes can't build up in an
unbounded fashion.
Fixes#7317
Chaos injection bridges the gap between automated testing (where we do
lots of different things with small, short-lived tenants), and staging
(where we do many fewer things, but with larger, long-lived tenants).
This PR adds a first type of chaos which isn't really very chaotic: it's
live migration of tenants between healthy pageservers. This nevertheless
provides continuous checks that things like clean, prompt shutdown of
tenants works for realistically deployed pageservers with realistically
large tenants.
## Problem
Previously, when we do a timeline deletion, shards will delete layers
that belong to an ancestor. That is not a correctness issue, because
when we delete a timeline, we're always deleting it from all shards, and
destroying data for that timeline is clearly fine.
However, there exists a race where one shard might start doing this
deletion while another shard has not yet received the deletion request,
and might try to access an ancestral layer. This creates ambiguity over
the "all layers referenced by my index should always exist" invariant,
which is important to detecting and reporting corruption.
Now that we have a GC mode for clearing up ancestral layers, we can rely
on that to clean up such layers, and avoid deleting them right away.
This makes things easier to reason about: there are now no cases where a
shard will delete a layer that belongs to a ShardIndex other than
itself.
## Summary of changes
- Modify behavior of RemoteTimelineClient::delete_all
- Add `test_scrubber_physical_gc_timeline_deletion` to exercise this
case
- Tweak AWS SDK config in the scrubber to enable retries. Motivated by
seeing the test for this feature encounter some transient "service
error" S3 errors (which are probably nothing to do with the changes in
this PR)
## Problem
`allure_attach_from_dir` method might create `tar.zst` archives even
if `--alluredir` is not set (i.e. Allure results collection is disabled)
## Summary of changes
- Don't run `allure_attach_from_dir` if `--alluredir` is not set
part of https://github.com/neondatabase/neon/issues/8002
Due to the limitation of the current layer map implementation, we cannot
directly replace a layer. It's interpreted as an insert and a deletion,
and there will be file exist error when renaming the newly-created layer
to replace the old layer. We work around that by changing the end key of
the image layer. A long-term fix would involve a refactor around the
layer file naming. For delta layers, we simply skip layers with the same
key range produced, though it is possible to add an extra key as an
alternative solution.
* The image layer range for the layers generated from gc-compaction will
be Key::MIN..(Key..MAX-1), to avoid being recognized as an L0 delta
layer.
* Skip existing layers if it turns out that we need to generate a layer
with the same persistent key in the same generation.
Note that it is possible that the newly-generated layer has different
content from the existing layer. For example, when the user drops a
retain_lsn, the compaction could have combined or dropped some records,
therefore creating a smaller layer than the existing one. We discard the
"optimized" layer for now because we cannot deal with such rewrites
within the same generation.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
## Problem
We recently added a "visibility" state to layers, but nothing
initializes it.
Part of:
- #8398
## Summary of changes
- Add a dependency on `range-set-blaze`, which is used as a fast
incrementally updated alternative to KeySpace. We could also use this to
replace the internals of KeySpaceRandomAccum if we wanted to. Writing a
type that does this kind of "BtreeMap & merge overlapping entries" thing
isn't super complicated, but no reason to write this ourselves when
there's a third party impl available.
- Add a function to layermap to calculate visibilities for each layer
- Add a function to Timeline to call into layermap and then apply these
visibilities to the Layer objects.
- Invoke the calculation during startup, after image layer creations,
and when removing branches. Branch removal and image layer creation are
the two ways that a layer can go from Visible to Covered.
- Add unit test & benchmark for the visibility calculation
- Expose `pageserver_visible_physical_size` metric, which should always
be <= `pageserver_remote_physical_size`.
- This metric will feed into the /v1/utilization endpoint later: the
visible size indicates how much space we would like to use on this
pageserver for this tenant.
- When `pageserver_visible_physical_size` is greater than
`pageserver_resident_physical_size`, this is a sign that the tenant has
long-idle branches, which result in layers that are visible in
principle, but not used in practice.
This does not keep visibility hints up to date in all cases:
particularly, when creating a child timeline, any previously covered
layers will not get marked Visible until they are accessed.
Updates after image layer creation could be implemented as more of a
special case, but this would require more new code: the existing depth
calculation code doesn't maintain+yield the list of deltas that would be
covered by an image layer.
## Performance
This operation is done rarely (at startup and at timeline deletion), so
needs to be efficient but not ultra-fast.
There is a new `visibility` bench that measures runtime for a synthetic
100k layers case (`sequential`) and a real layer map (`real_map`) with
~26k layers.
The benchmark shows runtimes of single digit milliseconds (on a ryzen
7950). This confirms that the runtime shouldn't be a problem at startup
(as we already incur S3-level latencies there), but that it's slow
enough that we definitely shouldn't call it more often than necessary,
and it may be worthwhile to optimize further later (things like: when
removing a branch, only bother scanning layers below the branchpoint)
```
visibility/sequential time: [4.5087 ms 4.5894 ms 4.6775 ms]
change: [+2.0826% +3.9097% +5.8995%] (p = 0.00 < 0.05)
Performance has regressed.
Found 24 outliers among 100 measurements (24.00%)
2 (2.00%) high mild
22 (22.00%) high severe
min: 0/1696070, max: 93/1C0887F0
visibility/real_map time: [7.0796 ms 7.0832 ms 7.0871 ms]
change: [+0.3900% +0.4505% +0.5164%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
3 (3.00%) high mild
1 (1.00%) high severe
min: 0/1696070, max: 93/1C0887F0
visibility/real_map_many_branches
time: [4.5285 ms 4.5355 ms 4.5434 ms]
change: [-1.0012% -0.8004% -0.5969%] (p = 0.00 < 0.05)
Change within noise threshold.
```
Before, we had four versions of linux-raw-sys in our dependency graph:
```
linux-raw-sys@0.1.4
linux-raw-sys@0.3.8
linux-raw-sys@0.4.13
linux-raw-sys@0.6.4
```
now it's only two:
```
linux-raw-sys@0.4.13
linux-raw-sys@0.6.4
```
The changes in this PR are minimal. In order to get to its state one
only has to update procfs in Cargo.toml to 0.16 and do `cargo update -p
tempfile -p is-terminal -p prometheus`.
# Motivation
The working theory for hung systemd during PS deploy
(https://github.com/neondatabase/cloud/issues/11387) is that leftover
walredo processes trigger a race condition.
In https://github.com/neondatabase/neon/pull/8150 I arranged that a
clean Tenant shutdown does actually kill its walredo processes.
But many prod machines don't manage to shut down all their tenants until
the 10s systemd timeout hits and, presumably, triggers the race
condition in systemd / the Linux kernel that causes the frozen systemd
# Solution
This PR bolts on a rather ugly mechanism to shut down tenant managers
out of order 8s after we've received the SIGTERM from systemd.
# Changes
- add a global registry of `Weak<WalRedoManager>`
- add a special thread spawned during `shutdown_pageserver` that sleeps
for 8s, then shuts down all redo managers in the registry and prevents
new redo managers from being created
- propagate the new failure mode of tenant spawning throughout the code
base
- make sure shut down tenant manager results in
PageReconstructError::Cancelled so that if Timeline::get calls come in
after the shutdown, they do the right thing
## Problem
In https://github.com/neondatabase/neon/pull/8241 I've accidentally
removed `create-test-report` dependency on `benchmarks` job
## Summary of changes
- Run `create-test-report` after `benchmarks` job
Uses the newly added APIs from #8541 named `stream_tenants_generic` and
`stream_objects_with_retries` and extends them with
`list_objects_with_retries_generic` and
`stream_tenant_timelines_generic` to migrate the `find-garbage` command
of the scrubber to `GenericRemoteStorage`.
Part of https://github.com/neondatabase/neon/issues/7547
## Problem
This code was confusing, untested and covered:
- an impossible case, where intent state is AttacheStale (we never do
this)
- a rare edge case (going from AttachedMulti to Attached), which we were
not testing, and in any case the pageserver internally does the same
Tenant reset in this transition as it would do if we incremented
generation.
Closes: https://github.com/neondatabase/neon/issues/8367
## Summary of changes
- Simplify the logic to only skip incrementing the generation if the
location already has the expected generation and the exact same mode.
In some cases, we can get a negative metric for replication_delay_bytes.
My best guess from all the research I've done is that we evaluate
pg_last_wal_receive_lsn() before pg_last_wal_replay_lsn(), and that by
the time everything is said and done, the replay LSN has advanced past
the receive LSN. In this case, our lag can effectively be modeled as
0 due to the speed of the WAL reception and replay.
Since the introduction of sharding, the protocol handling loop in
`handle_pagerequests` cannot know anymore which concrete
`Tenant`/`Timeline` object any of the incoming `PagestreamFeMessage`
resolves to.
In fact, one message might resolve to one `Tenant`/`Timeline` while
the next one may resolve to another one.
To avoid going to tenant manager, we added the `shard_timelines` which
acted as an ever-growing cache that held timeline gate guards open for
the lifetime of the connection.
The consequence of holding the gate guards open was that we had to be
sensitive to every cached `Timeline::cancel` on each interaction with
the network connection, so that Timeline shutdown would not have to wait
for network connection interaction.
We can do better than that, meaning more efficiency & better
abstraction.
I proposed a sketch for it in
* https://github.com/neondatabase/neon/pull/8286
and this PR implements an evolution of that sketch.
The main idea is is that `mod page_service` shall be solely concerned
with the following:
1. receiving requests by speaking the protocol / pagestream subprotocol
2. dispatching the request to a corresponding method on the correct
shard/`Timeline` object
3. sending response by speaking the protocol / pagestream subprotocol.
The cancellation sensitivity responsibilities are clear cut:
* while in `page_service` code, sensitivity to page_service cancellation
is sufficient
* while in `Timeline` code, sensitivity to `Timeline::cancel` is
sufficient
To enforce these responsibilities, we introduce the notion of a
`timeline::handle::Handle` to a `Timeline` object that is checked out
from a `timeline::handle::Cache` for **each request**.
The `Handle` derefs to `Timeline` and is supposed to be used for a
single async method invocation on `Timeline`.
See the lengthy doc comment in `mod handle` for details of the design.
part of https://github.com/neondatabase/neon/issues/8002
For child branches, we will pull the image of the modified keys from the
parant into the child branch, which creates a full history for
generating key retention. If there are not enough delta keys, the image
won't be wrote eventually, and we will only keep the deltas inside the
child branch. We could avoid the wasteful work to pull the image from
the parent if we can know the number of deltas in advance, in the future
(currently we always pull image for all modified keys in the child
branch)
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We run regression tests on `release` & `debug` builds for each of the
three supported Postgres versions (6 in total).
With upcoming ARM support and Postgres 17, the number of jobs will jump
to 16, which is a lot.
See the internal discussion here:
https://neondb.slack.com/archives/C033A2WE6BZ/p1722365908404329
## Summary of changes
- Run `regress-tests` job in debug builds only with the latest Postgres
version
- Do not do `debug` builds on release branches
part of https://github.com/neondatabase/neon/issues/8184
# Problem
We want to bypass PS PageCache for all data block reads, but
`compact_level0_phase1` currently uses `ValueRef::load` to load the WAL
records from delta layers.
Internally, that maps to `FileBlockReader:read_blk` which hits the
PageCache
[here](e78341e1c2/pageserver/src/tenant/block_io.rs (L229-L236)).
# Solution
This PR adds a mode for `compact_level0_phase1` that uses the
`MergeIterator` for reading the `Value`s from the delta layer files.
`MergeIterator` is a streaming k-merge that uses vectored blob_io under
the hood, which bypasses the PS PageCache for data blocks.
Other notable changes:
* change the `DiskBtreeReader::into_stream` to buffer the node, instead
of holding a `PageCache` `PageReadGuard`.
* Without this, we run out of page cache slots in
`test_pageserver_compaction_smoke`.
* Generally, `PageReadGuard`s aren't supposed to be held across await
points, so, this is a general bugfix.
# Testing / Validation / Performance
`MergeIterator` has not yet been used in production; it's being
developed as part of
* https://github.com/neondatabase/neon/issues/8002
Therefore, this PR adds a validation mode that compares the existing
approach's value iterator with the new approach's stream output, item by
item.
If they're not identical, we log a warning / fail the unit/regression
test.
To avoid flooding the logs, we apply a global rate limit of once per 10
seconds.
In any case, we use the existing approach's value.
Expected performance impact that will be monitored in staging / nightly
benchmarks / eventually pre-prod:
* with validation:
* increased CPU usage
* ~doubled VirtualFile read bytes/second metric
* no change in disk IO usage because the kernel page cache will likely
have the pages buffered on the second read
* without validation:
* slightly higher DRAM usage because each iterator participating in the
k-merge has a dedicated buffer (as opposed to before, where compactions
would rely on the PS PageCaceh as a shared evicting buffer)
* less disk IO if previously there were repeat PageCache misses (likely
case on a busy production Pageserver)
* lower CPU usage: PageCache out of the picture, fewer syscalls are made
(vectored blob io batches reads)
# Rollout
The new code is used with validation mode enabled-by-default.
This gets us validation everywhere by default, specifically in
- Rust unit tests
- Python tests
- Nightly pagebench (shouldn't really matter)
- Staging
Before the next release, I'll merge the following aws.git PR that
configures prod to continue using the existing behavior:
* https://github.com/neondatabase/aws/pull/1663
# Interactions With Other Features
This work & rollout should complete before Direct IO is enabled because
Direct IO would double the IOPS & latency for each compaction read
(#8240).
# Future Work
The streaming k-merge's memory usage is proportional to the amount of
memory per participating layer.
But `compact_level0_phase1` still loads all keys into memory for
`all_keys_iter`.
Thus, it continues to have active memory usage proportional to the
number of keys involved in the compaction.
Future work should replace `all_keys_iter` with a streaming keys
iterator.
This PR has a draft in its first commit, which I later reverted because
it's not necessary to achieve the goal of this PR / issue #8184.
Change Azure storage configuration to point to new variables/secrets. They have
the `_NEW` suffix in order not to disrupt any tests while we complete the
switch.
Part of #8128, followup to #8480. closes#8421.
Enable scrubber to optionally post metadata scan health results to
storage controller.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Part of #8128, followed by #8502.
## Problem
Currently we lack mechanism to alert unhealthy `scan_metadata` status if
we start running this scrubber command as part of a cronjob. With the
storage controller client introduced to storage scrubber in #8196, it is
viable to set up alert by storing health status in the storage
controller database.
We intentionally do not store the full output to the database as the
json blobs potentially makes the table really huge. Instead, only a
health status and a timestamp recording the last time metadata health
status is posted on a tenant shard.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
This tests the ability to push into ACR using OIDC. Proved it worked by running slightly modified YAML.
In `promote-images` we push the following images `neon compute-tools {vm-,}compute-node-{v14,v15,v16}` into `neoneastus2`.
https://github.com/neondatabase/cloud/issues/14640
## Problem
We don't allow regular end-users to use `k8s-pod` provisioner,
but we still use it in nightly benchmarks
## Summary of changes
- Remove `provisioner` input from `neon-create-project` action, use
`k8s-neonvm` as a default provioner
- Change `neon-` platform prefix to `neonvm-`
- Remove `neon-captest-freetier` and `neon-captest-new` as we already
have their `neonvm` counterparts
Add two new functions `stream_objects_with_retries` and
`stream_tenants_generic` and use them in the `find-large-objects`
subcommand, migrating it to `remote_storage`.
Also adds the `size` field to the `ListingObject` struct.
Part of #7547
If compression is enabled, we currently try compressing each image
larger than a specific size and if the compressed version is smaller, we
write that one, otherwise we use the uncompressed image. However, this
might sometimes be a wasteful process, if there is a substantial amount
of images that don't compress well.
The compression metrics added in #8420
`pageserver_compression_image_in_bytes_total` and
`pageserver_compression_image_out_bytes_total` are well designed for
answering the question how space efficient the total compression process
is end-to-end, which helps one to decide whether to enable it or not.
To answer the question of how much waste there is in terms of trial
compression, so CPU time, we add two metrics:
* one about the images that have been trial-compressed (considered), and
* one about the images where the compressed image has actually been
written (chosen).
There is different ways of weighting them, like for example one could
look at the count, or the compressed data. But the main contributor to
compression CPU usage is amount of data processed, so we weight the
images by their *uncompressed* size. In other words, the two metrics
are:
* `pageserver_compression_image_in_bytes_considered`
* `pageserver_compression_image_in_bytes_chosen`
Part of #5431
## Problem
Old storage buckets can contain a lot of tenants that aren't known to
the control plane at all, because they belonged to test jobs that get
their control plane state cleaned up shortly after running.
In general, it's somewhat unsafe to purge these, as it's hard to
distinguish "control plane doesn't know about this, so it's garbage"
from "control plane said it didn't know about this, which is a bug in
the scrubber, control plane, or API URL configured".
However, the most common case is that we see only a small husk of a
tenant in S3 from a specific old behavior of the software, for example:
- We had a bug where heatmaps weren't deleted on tenant delete
- When WAL DR was first deployed, we didn't delete initdb.tar.zst on
tenant deletion
## Summary of changes
- Add a KnownBug variant for the garbage reason
- Include such cases in the "safe" deletion mode (`--mode=deleted`)
- Add code that inspects tenants missing in control plane to identify
cases of known bugs (this is kind of slow, but should go away once we've
cleaned all these up)
- Add an additional `-min-age` safety check similar to physical GC,
where even if everything indicates objects aren't needed, we won't
delete something that has been modified too recently.
---------
Co-authored-by: Yuchen Liang <70461588+yliang412@users.noreply.github.com>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
## Problem
The secondary download HTTP API is meant to return 200 if the download
is complete, and 202 if it is still in progress. In #8198 the download
implementation was changed to drop out with success early if it
over-runs a time budget, which resulted in 200 responses for incomplete
downloads.
This breaks storcon_cli's "tenant-warmup" command, which uses the OK
status to indicate download complete.
## Summary of changes
- Only return 200 if we get an Ok() _and_ the progress stats indicate
the download is complete.
## Problem
We need to test logical replication with 3rd-party tools regularly.
## Summary of changes
Added a test using ClickHouse as a client
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Uses the Stream based `list_streaming` function added by #8457 in tenant
deletion, as suggested in https://github.com/neondatabase/neon/pull/7932#issuecomment-2150480180 .
We don't have to worry about retries, as the function is wrapped inside
an outer retry block. If there is a retryable error either during the
listing or during deletion, we just do a fresh start.
Also adds `+ Send` bounds as they are required by the
`delete_tenant_remote` function.
## Problem
After https://github.com/neondatabase/neon/pull/7990 `regress_test` job
started to fail with an error:
```
...
File "/__w/neon/neon/test_runner/fixtures/benchmark_fixture.py", line 485, in pytest_terminal_summary
terminalreporter.write(f"{test_report.head_line}.{recorded_property['name']}: ")
TypeError: 'bool' object is not subscriptable
```
https://github.com/neondatabase/neon/actions/runs/10125750938/job/28002582582
It happens because the current implementation doesn't expect pytest's
`user_properties` can be used for anything else but benchmarks (and
https://github.com/neondatabase/neon/pull/7990 started to use it for
tracking `preserve_database_files` parameter)
## Summary of changes
- Make NeonBenchmarker use only records with`neon_benchmarker_` prefix
## Problem
There's a `NeonEnvBuilder#preserve_database_files` parameter that allows
you to keep database files for debugging purposes (by default, files get
cleaned up), but there's no way to get these files from a CI run.
This PR adds handling of `NeonEnvBuilder#preserve_database_files` and
adds the compressed test output directory to Allure reports (for tests
with this parameter enabled).
Ref https://github.com/neondatabase/neon/issues/6967
## Summary of changes
- Compress and add the whole test output directory to Allure reports
- Currently works only with `neon_env_builder` fixture
- Remove `preserve_database_files = True` from sharding tests as
unneeded
---------
Co-authored-by: Christian Schwarz <christian@neon.tech>
Persists whether a timeline is archived or not in `index_part.json`. We
only return success if the upload has actually worked successfully.
Also introduces a new `index_part.json` version number.
Fixes#8459
Part of #8088
close https://github.com/neondatabase/neon/issues/8435
## Summary of changes
If L0 compaction did not include all L0 layers, skip image generation.
There are multiple possible solutions to the original issue, i.e., an
alternative is to wrap the partial L0 compaction in a loop until it
compacts all L0 layers. However, considering that we should weight all
tenants equally, the current solution can ensure everyone gets a chance
to run compaction, and those who write too much won't get a chance to
create image layers. This creates a natural backpressure feedback that
they get a slower read due to no image layers are created, slowing down
their writes, and eventually compaction could keep up with their writes
+ generate image layers.
Consider deployment, we should add an alert on "skipping image layer
generation", so that we won't run into the case that image layers are
not generated => incidents again.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Problem
-------
wait_lsn timeouts result in a user-facing errors like
```
$ /tmp/neon/pg_install/v16/bin/pgbench -s3424 -i -I dtGvp user=neondb_owner dbname=neondb host=ep-tiny-wave-w23owa37.eastus2.azure.neon.build sslmode=require options='-cstatement_timeout=0 '
dropping old tables...
NOTICE: table "pgbench_accounts" does not exist, skipping
NOTICE: table "pgbench_branches" does not exist, skipping
NOTICE: table "pgbench_history" does not exist, skipping
NOTICE: table "pgbench_tellers" does not exist, skipping
creating tables...
generating data (server-side)...
vacuuming...
pgbench: error: query failed: ERROR: [NEON_SMGR] [shard 0] could not read block 214338 in rel 1663/16389/16839.0 from page server at lsn C/E1C12828
DETAIL: page server returned error: LSN timeout: Timed out while waiting for WAL record at LSN C/E1418528 to arrive, last_record_lsn 6/999D9CA8 disk consistent LSN=6/999D9CA8, WalReceiver status: (update 2024-07-25 08:30:07): connecting to node 25, safekeeper candidates (id|update_time|commit_lsn): [(21|08:30:16|C/E1C129E0), (23|08:30:16|C/E1C129E0), (25|08:30:17|C/E1C129E0)]
CONTEXT: while scanning block 214338 of relation "public.pgbench_accounts"
pgbench: detail: Query was: vacuum analyze pgbench_accounts
```
Solution
--------
Its better to be slow than to fail the queries.
If the app has a deadline, it can use `statement_timeout`.
In the long term, we want to eliminate wait_lsn timeout.
In the short term (this PR), we bump the wait_lsn timeout to
a larger value to reduce the frequency at which these wait_lsn timeouts
occur.
We will observe SLOs and specifically
`pageserver_wait_lsn_seconds_bucket`
before we eliminate the timeout completely.
## Problem
We are missing the step-down primitive required to implement rolling
restarts of the storage controller.
## Summary of changes
Add `/control/v1/step_down` endpoint which puts the storage controller
into a state where it rejects
all API requests apart from `/control/v1/step_down`, `/status` and
`/metrics`. When receiving the request,
storage controller cancels all pending reconciles and waits for them to
exit gracefully. The response contains
a snapshot of the in-memory observed state.
Related:
* https://github.com/neondatabase/cloud/issues/14701
* https://github.com/neondatabase/neon/issues/7797
* https://github.com/neondatabase/neon/pull/8310
## Problem
Vectored get is already enabled in all prod regions without validation.
The pageserver defaults
are out of sync however.
## Summary of changes
Update the pageserver defaults to match the prod config. Also means that
when running tests locally,
people don't have to use the env vars to get the prod config.
## Problem
This is an experiment to see if 16x concurrency is actually helping, or
if it's just giving us very noisy results. If the total runtime with a
lower concurrency is similar, then a lower concurrency is preferable to
reduce the impact of resource-hungry tests running concurrently.
## Problem
This test relies on writing image layers before the split. It can fail
to do so durably if the image layers are written ahead of the remote
consistent LSN, so we should have been doing a checkpoint rather than
just a compaction
## Problem
The scrubber would like to check the highest mtime in a tenant's objects
as a safety check during purges. It recently switched to use
GenericRemoteStorage, so we need to expose that in the listing methods.
## Summary of changes
- In Listing.keys, return a ListingObject{} including a last_modified
field, instead of a RemotePath
---------
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
## Problem
follow up for #8475
## Summary of changes
Using own private docker registry in `cache-from` and `cache-to`
settings in docker build-push actions
There is a race condition between timeline shutdown and the split task.
Timeline shutdown first shuts down the upload queue, and only then fires
the cancellation token. A parallel running timeline split operation
might thus encounter a cancelled upload queue before the cancellation
token is fired, and print a noisy error.
Fix this by mapping `anyhow::Error{ NotInitialized::ShuttingDown }) to
`FlushLayerError::Cancelled` instead of `FlushLayerError::Other(_)`.
Fixes#8496
update pg_jsonschema extension to v 0.3.1
update pg_graphql extension to v1.5.7
update pgx_ulid extension to v0.1.5
update pg_tiktoken extension, patch Cargo.toml to use new pgrx
This pull request (should) fix the failure of test_gc_feedback. See the
explanation in the newly-added test case.
Part of https://github.com/neondatabase/neon/issues/8002
Allow incomplete history for the compaction algorithm.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Storcon shutdown did not produce a clean observed state. This is not a
problem at the moment, but we will need to stop all reconciles with
clean observed state for rolling restarts.
I tried to test this by collecting the observed state during shutdown
and comparing it with the in-memory observed
state, but it doesn't work because a lot of tests use the cursed attach
hook to create tenants directly through the ps.
## Summary of Changes
Rework storcon shutdown as follows:
* Reconcilers get a separate cancellation token which is a child token
of the global `Service::cancel`.
* Reconcilers get a separate gate
* Add a mechanism to drain the reconciler result queue before
* Put all of this together into a clean shutdown sequence
Related https://github.com/neondatabase/cloud/issues/14701
## Problem
This test was destabilized by
https://github.com/neondatabase/neon/pull/8431. The threshold is
arbitrary & failures are still quite close to it. At a high level the
test is asserting "eviction was approximately fair to these tenants",
which appears to still be the case when the abs diff between ratios is
slightly higher at ~0.6-0.7.
## Summary of changes
- Change threshold from 0.06 to 0.065. Based on the last ~10 failures
that should be sufficient.
2024-07-25 15:00:42 +01:00
699 changed files with 54455 additions and 20368 deletions
if ${PG_BINARIES}/psql "${{ env.BENCHMARK_CONNSTR }}" -tAc "SELECT 1 FROM benchmark_restore_status WHERE databasename='${{ env.DATABASE_NAME }}' AND restore_done=true;" | grep -q 1; then
echo "Restore already done for database ${{ env.DATABASE_NAME }} on platform ${{ env.PLATFORM }}. Skipping this database."
skip=true
fi
echo "skip=${skip}" | tee -a $GITHUB_OUTPUT
- name:Check and create database if it does not exist
# `!failure() && !cancelled()` is required because the workflow transitively depends on the job that can be skipped: `push-to-acr-dev` and `push-to-acr-prod`
# 'gdb' is included so that we get backtraces of core dumps produced in
# regression tests
RUNset -e \
&& apt update \
&& apt install -y \
@@ -24,6 +27,7 @@ RUN set -e \
cmake \
curl \
flex \
gdb \
git \
gnupg \
gzip \
@@ -58,7 +62,7 @@ RUN set -e \
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
# protobuf-compiler (protoc)
ENV PROTOC_VERSION25.1
ENVPROTOC_VERSION=25.1
RUN curl -fsSL "https://github.com/protocolbuffers/protobuf/releases/download/v${PROTOC_VERSION}/protoc-${PROTOC_VERSION}-linux-$(uname -m | sed 's/aarch64/aarch_64/g').zip" -o "protoc.zip"\
&& unzip -q protoc.zip -d protoc \
&& mv protoc/bin/protoc /usr/local/bin/protoc \
@@ -99,7 +103,7 @@ RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-$(uname -m).zip" -o "aws
# recommended approach from https://www.rust-lang.org/tools/install
@@ -126,7 +132,7 @@ make -j`sysctl -n hw.logicalcpu` -s
To run the `psql` client, install the `postgresql-client` package or modify `PATH` and `LD_LIBRARY_PATH` to include `pg_install/bin` and `pg_install/lib`, respectively.
To run the integration tests or Python scripts (not required to use the code), install
Python (3.9 or higher), and install the python3 packages using `./scripts/pysync` (requires [poetry>=1.3](https://python-poetry.org/)) in the project directory.
Python (3.9 or higher), and install the python3 packages using `./scripts/pysync` (requires [poetry>=1.8](https://python-poetry.org/)) in the project directory.
#### Running neon database
@@ -262,7 +268,7 @@ By default, this runs both debug and release modes, and all supported postgres v
testing locally, it is convenient to run just one set of permutations, like this:
select lfc_value as lfc_misses from neon.neon_lfc_stats where lfc_key='file_cache_misses';
- metric_name:lfc_used
type:gauge
help:'LFC chunks used (chunk = 1MB)'
key_labels:
values:[lfc_used]
query:|
select lfc_value as lfc_used from neon.neon_lfc_stats where lfc_key='file_cache_used';
- metric_name:lfc_hits
type:gauge
help:'lfc_hits'
key_labels:
values:[lfc_hits]
query:|
select lfc_value as lfc_hits from neon.neon_lfc_stats where lfc_key='file_cache_hits';
- metric_name:lfc_writes
type:gauge
help:'lfc_writes'
key_labels:
values:[lfc_writes]
query:|
select lfc_value as lfc_writes from neon.neon_lfc_stats where lfc_key='file_cache_writes';
- metric_name:lfc_cache_size_limit
type:gauge
help:'LFC cache size limit in bytes'
key_labels:
values:[lfc_cache_size_limit]
query:|
select pg_size_bytes(current_setting('neon.file_cache_size_limit')) as lfc_cache_size_limit;
- metric_name:connection_counts
type:gauge
help:'Connection counts'
key_labels:
- datname
- state
values:[count]
query:|
select datname, state, count(*) as count from pg_stat_activity where state <> '' group by datname, state;
- metric_name:pg_stats_userdb
type:gauge
help:'Stats for several oldest non-system dbs'
key_labels:
- datname
value_label:kind
values:
- db_size
- deadlocks
# Rows
- inserted
- updated
- deleted
# We export stats for 10 non-system database. Without this limit
# it is too easy to abuse the system by creating lots of databases.
query:|
select pg_database_size(datname) as db_size, deadlocks,
tup_inserted as inserted, tup_updated as updated, tup_deleted as deleted,
datname
from pg_stat_database
where datname IN (
select datname
from pg_database
where datname <> 'postgres' and not datistemplate
order by oid
limit 10
);
- metric_name:max_cluster_size
type:gauge
help:'neon.max_cluster_size setting'
key_labels:
values:[max_cluster_size]
query:|
select setting::int as max_cluster_size from pg_settings where name = 'neon.max_cluster_size';
- metric_name:db_total_size
type:gauge
help:'Size of all databases'
key_labels:
values:[total]
query:|
select sum(pg_database_size(datname)) as total from pg_database;
- metric_name:getpage_wait_seconds_count
type:counter
help:'Number of getpage requests'
values:[getpage_wait_seconds_count]
query_ref:neon_perf_counters
- metric_name:getpage_wait_seconds_sum
type:counter
help:'Time spent in getpage requests'
values:[getpage_wait_seconds_sum]
query_ref:neon_perf_counters
- metric_name:getpage_prefetch_requests_total
type:counter
help:'Number of getpage issued for prefetching'
values:[getpage_prefetch_requests_total]
query_ref:neon_perf_counters
- metric_name:getpage_sync_requests_total
type:counter
help:'Number of synchronous getpage issued'
values:[getpage_sync_requests_total]
query_ref:neon_perf_counters
- metric_name:getpage_prefetch_misses_total
type:counter
help:'Total number of readahead misses; consisting of either prefetches that don''t satisfy the LSN bounds once the prefetch got read by the backend, or cases where somehow no readahead was issued for the read'
values:[getpage_prefetch_misses_total]
query_ref:neon_perf_counters
- metric_name:getpage_prefetch_discards_total
type:counter
help:'Number of prefetch responses issued but not used'
values:[getpage_prefetch_discards_total]
query_ref:neon_perf_counters
- metric_name:pageserver_requests_sent_total
type:counter
help:'Number of all requests sent to the pageserver (not just GetPage requests)'
values:[pageserver_requests_sent_total]
query_ref:neon_perf_counters
- metric_name:pageserver_disconnects_total
type:counter
help:'Number of times that the connection to the pageserver was lost'
values:[pageserver_disconnects_total]
query_ref:neon_perf_counters
- metric_name:pageserver_send_flushes_total
type:counter
help:'Number of flushes to the pageserver connection'
values:[pageserver_send_flushes_total]
query_ref:neon_perf_counters
- metric_name:getpage_wait_seconds_bucket
type:counter
help:'Histogram buckets of getpage request latency'
key_labels:
- bucket_le
values:[value]
query_ref:getpage_wait_seconds_buckets
# DEPRECATED
- metric_name:lfc_approximate_working_set_size
type:gauge
help:'Approximate working set size in pages of 8192 bytes'
key_labels:
values:[approximate_working_set_size]
query:|
select neon.approximate_working_set_size(false) as approximate_working_set_size;
assert_eq!(&PageServerConf::from(&conf),&self.conf,"during neon_local init, we derive the runtime state of ps conf (self.conf) from the --config flag fully");
// TODO(christian): instead of what we do here, create a pageserver_api::config::ConfigToml (PR #7656)
@@ -127,16 +125,19 @@ impl PageServerNode {
}
// Apply the user-provided overrides
overrides.push(
toml_edit::ser::to_string_pretty(&conf)
.expect("we deserialized this from toml earlier"),
);
overrides.push({
letmutdoc=
toml_edit::ser::to_document(&conf).expect("we deserialized this from toml earlier");
// `id` is written out to `identity.toml` instead of `pageserver.toml`
doc.remove("id").expect("it's part of the struct");
doc.to_string()
});
// Turn `overrides` into a toml document.
// TODO: above code is legacy code, it should be refactored to use toml_edit directly.
- [neondatabase/neon](https://hub.docker.com/repository/docker/neondatabase/neon) — image with pre-built `pageserver`, `safekeeper` and `proxy` binaries and all the required runtime dependencies. Built from [/Dockerfile](/Dockerfile).
- [neondatabase/compute-node-v16](https://hub.docker.com/repository/docker/neondatabase/compute-node-v16) — compute node image with pre-built Postgres binaries from [neondatabase/postgres](https://github.com/neondatabase/postgres). Similar images exist for v15 and v14.
- [neondatabase/compute-node-v16](https://hub.docker.com/repository/docker/neondatabase/compute-node-v16) — compute node image with pre-built Postgres binaries from [neondatabase/postgres](https://github.com/neondatabase/postgres). Similar images exist for v15 and v14. Built from [/compute-node/Dockerfile](/compute/Dockerfile.compute-node).
@@ -14,7 +14,7 @@ picked tenant (which requested on-demand activation) for around 30 seconds
during the restart at 2024-04-03 16:37 UTC.
Note that lots of shutdowns on loaded pageservers do not finish within the
[10 second systemd enforced timeout](https://github.com/neondatabase/aws/blob/0a5280b383e43c063d43cbf87fa026543f6d6ad4/.github/ansible/systemd/pageserver.service#L16). This means we are shutting down without flushing ephemeral layers
[10 second systemd enforced timeout](https://github.com/neondatabase/infra/blob/0a5280b383e43c063d43cbf87fa026543f6d6ad4/.github/ansible/systemd/pageserver.service#L16). This means we are shutting down without flushing ephemeral layers
and have to reingest data in order to serve requests after restarting, potentially making first request latencies worse.
This problem is not yet very acutely felt in storage controller managed pageservers since
storing safekeepers per timeline instead of per tenant because we can't migrate
tenants atomically.
The important question is how updated configuration is delivered from
storage_controller to control plane to provide it to computes. As always, there
are two options, pull and push. Let's do it the same push as with pageserver
`/notify-attach` because 1) it keeps storage_controller out of critical compute
start path 2) provides easier upgrade: there won't be such a thing as 'timeline
managed by control plane / storcon', cplane just takes the value out of its db
when needed 3) uniformity. It makes storage_controller responsible for retrying notifying
control plane until it succeeds.
So, cplane `/notify-safekeepers` for the timeline accepts `Configuration` and
updates it in the db if the provided conf generation is higher (the cplane db
should also store generations for this). Similarly to [`/notify-attach`](https://www.notion.so/neondatabase/Storage-Controller-Control-Plane-interface-6de56dd310a043bfa5c2f5564fa98365), it
should update db which makes the call successful, and then try to schedule
`apply_config` if possible, it is ok if not. storage_controller
should rate limit calling the endpoint, but likely this won't be needed, as migration
throughput is limited by `pull_timeline`.
Timeline (branch) creation in cplane should call storage_controller POST
`tenant/:tenant_id/timeline` like it currently does for sharded tenants.
Response should be augmented with `safekeeper_conf: Configuration`. The call
should be retried until succeeds.
Timeline deletion and tenant deletion in cplane should call appropriate
storage_controller endpoints like it currently does for sharded tenants. The
calls should be retried until they succeed.
### storage_controller implementation
Current 'load everything on startup and keep in memory' easy design is fine.
Single timeline shouldn't take more than 100 bytes (it's 16 byte tenant_id, 16
byte timeline_id, int generation, vec of ~3 safekeeper ids plus some flags), so
10^6 of timelines shouldn't take more than 100MB.
Similar to pageserver attachment Intents storage_controller would have in-memory
`MigrationRequest` (or its absense) for each timeline and pool of tasks trying
to make these request reality; this ensures one instance of storage_controller
won't do several migrations on the same timeline concurrently. In the first
version it is simpler to have more manual control and no retries, i.e. migration
failure removes the request. Later we can build retries and automatic
scheduling/migration. `MigrationRequest` is
```
enum MigrationRequest {
To(Vec<NodeId>),
FinishPending,
}
```
`FinishPending` requests to run the procedure to ensure state is clean: current
configuration is not joint and majority of safekeepers are aware of it, but do
not attempt to migrate anywhere. If current configuration fetched on step 1 is
not joint it jumps to step 7. It should be run at startup for all timelines (but
similarly, in the first version it is ok to trigger it manually).
#### Schema
`safekeepers` table mirroring current `nodes` should be added, except that for
`scheduling_policy` field (seems like `status` is a better name for it): it is enough
to have at least in the beginning only 3 fields: 1) `active` 2) `offline` 3)
`decomissioned`.
`timelines` table:
```
table! {
// timeline_id is primary key
timelines (tenant_id, timeline_id) {
timeline_id -> Varchar,
tenant_id -> Varchar,
generation -> Int4,
sk_set -> Array<Int4>, // list of safekeeper ids
new_sk_set -> Nullable<Array<Int4>>, // list of safekeeper ids, null if not joint conf
cplane_notified_generation -> Int4,
}
}
```
#### API
Node management is similar to pageserver:
1) POST `/control/v1/safekeepers` upserts safekeeper.
2) GET `/control/v1/safekeepers` lists safekeepers.
3) GET `/control/v1/safekeepers/:node_id` gets safekeeper.
4) PUT `/control/v1/safekepers/:node_id/status` changes status to e.g.
`offline` or `decomissioned`. Initially it is simpler not to schedule any
migrations here.
Safekeeper deploy scripts should register safekeeper at storage_contorller as
they currently do with cplane, under the same id.
Timeline creation/deletion: already existing POST `tenant/:tenant_id/timeline`
would 1) choose initial set of safekeepers; 2) write to the db initial
`Configuration` with `INSERT ON CONFLICT DO NOTHING` returning existing row in
case of conflict; 3) create timeline on the majority of safekeepers (already
created is ok).
We don't want to block timeline creation when one safekeeper is down. Currently
this is solved by compute implicitly creating timeline on any safekeeper it is
connected to. This creates ugly timeline state on safekeeper when timeline is
created, but start LSN is not defined yet. It would be nice to remove this; to
do that, controller can in the background retry to create timeline on
safekeeper(s) which missed that during initial creation call. It can do that
through `pull_timeline` from majority so it doesn't need to remember
`parent_lsn` in its db.
Timeline deletion removes the row from the db and forwards deletion to the
current configuration members. Without additional actions deletions might leak,
see below on this; initially let's ignore these, reporting to cplane success if
at least one safekeeper deleted the timeline (this will remove s3 data).
Tenant deletion repeats timeline deletion for all timelines.
Migration API: the first version is the simplest and the most imperative:
1) PUT `/control/v1/safekeepers/migrate` schedules `MigrationRequest`s to move
all timelines from one safekeeper to another. It accepts json
```
{
"src_sk": u32,
"dst_sk": u32,
"limit": Optional<u32>,
}
```
Returns list of scheduled requests.
2) PUT `/control/v1/tenant/:tenant_id/timeline/:timeline_id/safekeeper_migrate` schedules `MigrationRequest`
to move single timeline to given set of safekeepers:
```
{
"desired_set": Vec<u32>,
}
```
Returns scheduled request.
Similar call should be added for the tenant.
It would be great to have some way of subscribing to the results (apart from
looking at logs/metrics).
Migration is executed as described above. One subtlety is that (local) deletion on
source safekeeper might fail, which is not a problem if we are going to
decomission the node but leaves garbage otherwise. I'd propose in the first version
1) Don't attempt deletion at all if node status is `offline`.
2) If it failed, just issue warning.
And add PUT `/control/v1/safekeepers/:node_id/scrub` endpoint which would find and
remove garbage timelines for manual use. It will 1) list all timelines on the
safekeeper 2) compare each one against configuration storage: if timeline
doesn't exist at all (had been deleted), it can be deleted. Otherwise, it can
be deleted under generation number if node is not member of current generation.
Automating this is untrivial; we'd need to register all potential missing
deletions <tenant_id,timeline_id,generation,node_id> in the same transaction
which switches configurations. Similarly when timeline is fully deleted to
prevent cplane operation from blocking when some safekeeper is not available
deletion should be also registered.
One more task pool should infinitely retry notifying control plane about changed
safekeeper sets.
3) GET `/control/v1/tenant/:tenant_id/timeline/:timeline_id/` should return
current in memory state of the timeline and pending `MigrationRequest`,
if any.
4) PUT `/control/v1/tenant/:tenant_id/timeline/:timeline_id/safekeeper_migrate_abort` tries to abort the
migration by switching configuration from the joint to the one with (previous) `sk_set` under CAS
(incrementing generation as always).
#### Dealing with multiple instances of storage_controller
Operations described above executed concurrently might create some errors but do
not prevent progress, so while we normally don't want to run multiple instances
of storage_controller it is fine to have it temporarily, e.g. during redeploy.
Any interactions with db update in-memory controller state, e.g. if migration
request failed because different one is in progress, controller remembers that
and tries to finish it.
## Testing
`neon_local` should be switched to use storage_controller, playing role of
control plane.
There should be following layers of tests:
1) Model checked TLA+ spec specifies the algorithm and verifies its basic safety.
2) To cover real code and at the same time test many schedules we should have
simulation tests. For that, configuration storage, storage_controller <->
safekeeper communication and pull_timeline need to be mocked and main switch
procedure wrapped to as a node (thread) in simulation tests, using these
mocks. Test would inject migrations like it currently injects
safekeeper/walproposer restars. Main assert is the same -- committed WAL must
not be lost.
3) Since simulation testing injects at relatively high level points (not
syscalls), it omits some code, in particular `pull_timeline`. Thus it is
better to have basic tests covering whole system as well. Extended version of
`test_restarts_under_load` would do: start background load and do migration
under it, then restart endpoint and check that no reported commits
had been lost. I'd also add one more creating classic network split scenario, with
one compute talking to AC and another to BD while migration from nodes ABC to ABD
happens.
4) Simple e2e test should ensure that full flow including cplane notification works.
## Order of implementation and rollout
Note that
- Control plane parts and integration with it is fully independent from everything else
(tests would use simulation and neon_local).
- There is a lot of infra work making storage_controller aware of timelines and safekeepers
and its impl/rollout should be separate from migration itself.
- Initially walproposer can just stop working while it observers joint configuration.
Such window would be typically very short anyway.
To rollout smoothly, both walproposer and safekeeper should have flag
`configurations_enabled`; when set to false, they would work as currently, i.e.
walproposer is able to commit on whatever safekeeper set it is provided. Until
all timelines are managed by storcon we'd need to use current script to migrate
and update/drop entries in the storage_controller database if it has any.
Safekeepers would need to be able to talk both current and new protocol version
with compute to reduce number of computes restarted in prod once v2 protocol is
deployed (though before completely switching we'd need to force this).
Let's have the following rollout order:
- storage_controller becomes aware of safekeepers;
- storage_controller gets timeline creation for new timelines and deletion requests, but
doesn't manage all timelines yet. Migration can be tested on these new timelines.
To keep control plane and storage_controller databases in sync while control
plane still chooses the safekeepers initially (until all timelines are imported
it can choose better), `TimelineCreateRequest` can get optional safekeepers
field with safekeepers chosen by cplane.
- Then we can import all existing timelines from control plane to
storage_controller and gradually enable configurations region by region.
Very rough implementation order:
- Add concept of configurations to safekeepers (including control file),
implement v3 protocol.
- Implement walproposer changes, including protocol.
- Implement storconn part. Use it in neon_local (and pytest).
- Make cplane store safekeepers per timeline instead of per tenant.
This RFC describes the issues around the current storage controller restart procedure
and describes an implementation which reduces downtime to a few milliseconds on the happy path.
## Motivation
Storage controller upgrades (restarts, more generally) can cause multi-second availability gaps.
While the storage controller does not sit on the main data path, it's generally not acceptable
to block management requests for extended periods of time (e.g. https://github.com/neondatabase/neon/issues/8034).
### Current Implementation
The storage controller runs in a Kubernetes Deployment configured for one replica and strategy set to [Recreate](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#recreate-deployment).
In non Kubernetes terms, during an upgrade, the currently running storage controller is stopped and, only after,
a new instance is created.
At start-up, the storage controller calls into all the pageservers it manages (retrieved from DB) to learn the
latest locations of all tenant shards present on them. This is usually fast, but can push into tens of seconds
under unfavourable circumstances: pageservers are heavily loaded or unavailable.
## Prior Art
There's probably as many ways of handling restarts gracefully as there are distributed systems. Some examples include:
* Active/Standby architectures: Two or more instance of the same service run, but traffic is only routed to one of them.
For fail-over, traffic is routed to one of the standbys (which becomes active).
* Consensus Algorithms (Raft, Paxos and friends): The part of consensus we care about here is leader election: peers communicate to each other
and use a voting scheme that ensures the existence of a single leader (e.g. Raft epochs).
## Requirements
* Reduce storage controller unavailability during upgrades to milliseconds
* Minimize the interval in which it's possible for more than one storage controller
to issue reconciles.
* Have one uniform implementation for restarts and upgrades
* Fit in with the current Kubernetes deployment scheme
## Non Goals
* Implement our own consensus algorithm from scratch
* Completely eliminate downtime storage controller downtime. Instead we aim to reduce it to the point where it looks
like a transient error to the control plane
## Impacted Components
* storage controller
* deployment orchestration (i.e. Ansible)
* helm charts
## Terminology
* Observed State: in-memory mapping between tenant shards and their current pageserver locations - currently built up
at start-up by quering pageservers
* Deployment: Kubernetes [primitive](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/) that models
a set of replicas
## Implementation
### High Level Flow
At a very high level the proposed idea is to start a new storage controller instance while
the previous one is still running and cut-over to it when it becomes ready. The new instance,
should coordinate with the existing one and transition responsibility gracefully. While the controller
has built in safety against split-brain situations (via generation numbers), we'd like to avoid such
scenarios since they can lead to availability issues for tenants that underwent changes while two controllers
were operating at the same time and require operator intervention to remedy.
### Kubernetes Deployment Configuration
On the Kubernetes configuration side, the proposal is to update the storage controller `Deployment`
to use `spec.strategy.type = RollingUpdate`, `spec.strategy.rollingUpdate.maxSurge=1` and `spec.strategy.maxUnavailable=0`.
Under the hood, Kubernetes creates a new replica set and adds one pod to it (`maxSurge=1`). The old replica set does not
scale down until the new replica set has one replica in the ready state (`maxUnavailable=0`).
The various possible failure scenarios are investigated in the [Handling Failures](#handling-failures) section.
### Storage Controller Start-Up
This section describes the primitives required on the storage controller side and the flow of the happy path.
#### Database Table For Leader Synchronization
A new table should be added to the storage controller database for leader synchronization during startup.
This table will always contain at most one row. The proposed name for the table is `leader` and the schema
contains two elements:
*`hostname`: represents the hostname for the current storage controller leader - should be addressible
from other pods in the deployment
*`start_timestamp`: holds the start timestamp for the current storage controller leader (UTC timezone) - only required
for failure case handling: see [Previous Leader Crashes Before New Leader Readiness](#previous-leader-crashes-before-new-leader-readiness)
Storage controllers will read the leader row at start-up and then update it to mark themselves as the leader
at the end of the start-up sequence. We want compare-and-exchange semantics for the update: avoid the
situation where two concurrent updates succeed and overwrite each other. The default Postgres isolation
level is `READ COMMITTED`, which isn't strict enough here. This update transaction should use at least `REPEATABLE
READ` isolation level in order to [prevent lost updates](https://www.interdb.jp/pg/pgsql05/08.html). Currently,
the storage controller uses the stricter `SERIALIZABLE` isolation level for all transactions. This more than suits
our needs here.
```
START TRANSACTION ISOLATION LEVEL REPEATABLE READ
UPDATE leader SET hostname=<new_hostname>, start_timestamp=<new_start_ts>
WHERE hostname=<old_hostname>, start_timestampt=<old_start_ts>;
```
If the transaction fails or if no rows have been updated, then the compare-and-exchange is regarded as a failure.
#### Step Down API
A new HTTP endpoint should be added to the storage controller: `POST /control/v1/step_down`. Upon receiving this
request the leader cancels any pending reconciles and goes into a mode where it replies with 503 to all other APIs
and does not issue any location configurations to its pageservers. The successful HTTP response will return a serialized
snapshot of the observed state.
If other step down requests come in after the initial one, the request is handled and the observed state is returned (required
for failure scenario handling - see [Handling Failures](#handling-failures)).
#### Graceful Restart Happy Path
At start-up, the first thing the storage controller does is retrieve the sole row from the new
`leader` table. If such an entry exists, send a `/step_down` PUT API call to the current leader.
This should be retried a few times with a short backoff (see [1]). The aspiring leader loads the
observed state into memory and the start-up sequence proceeds as usual, but *without* querying the
pageservers in order to build up the observed state.
Before doing any reconciliations or persistence change, update the `leader` database table as described in the [Database Table For Leader Synchronization](database-table-for-leader-synchronization)
section. If this step fails, the storage controller process exits.
Note that no row will exist in the `leaders` table for the first graceful restart. In that case, force update the `leader` table
(without the WHERE clause) and perform with the pre-existing start-up procedure (i.e. build observed state by querying pageservers).
Summary of proposed new start-up sequence:
1. Call `/step_down`
2. Perform any pending database migrations
3. Load state from database
4. Load observed state returned in step (1) into memory
5. Do initial heartbeat round (may be moved after 5)
7. Mark self as leader by updating the database
8. Reschedule and reconcile everything
Some things to note from the steps above:
* The storage controller makes no changes to the cluster state before step (5) (i.e. no location config
calls to the pageserver and no compute notifications)
* Ask the current leader to step down before loading state from database so we don't get a lost update
if the transactions overlap.
* Before loading the observed state at step (3), cross-validate against the database. If validation fails,
fall back to asking the pageservers about their current locations.
* Database migrations should only run **after** the previous instance steps down (or the step down times out).
[1] The API call might fail because there's no storage controller running (i.e. [restart](#storage-controller-crash-or-restart)),
so we don't want to extend the unavailability period by much. We still want to retry since that's not the common case.
### Handling Failures
#### Storage Controller Crash Or Restart
The storage controller may crash or be restarted outside of roll-outs. When a new pod is created, its call to
`/step_down` will fail since the previous leader is no longer reachable. In this case perform the pre-existing
start-up procedure and update the leader table (with the WHERE clause). If the update fails, the storage controller
exists and consistency is maintained.
#### Previous Leader Crashes Before New Leader Readiness
When the previous leader (P1) crashes before the new leader (P2) passses the readiness check, Kubernetes will
reconcile the old replica set and create a new pod for it (P1'). The `/step_down` API call will fail for P1'
(see [2]).
Now we have two cases to consider:
* P2 updates the `leader` table first: The database update from P1' will fail and P1' will exit, or be terminated
by Kubernetes depending on timings.
* P1' updates the `leader` table first: The `hostname` field of the `leader` row stays the same, but the `start_timestamp` field changes.
The database update from P2 will fail (since `start_timestamp` does not match). P2 will exit and Kubernetes will
create a new replacement pod for it (P2'). Now the entire dance starts again, but with P1' as the leader and P2' as the incumbent.
[2] P1 and P1' may (more likely than not) be the same pod and have the same hostname. The implementation
should avoid this self reference and fail the API call at the client if the persisted hostname matches
the current one.
#### Previous Leader Crashes After New Leader Readiness
The deployment's replica sets already satisfy the deployment's replica count requirements and the
Kubernetes deployment rollout will just clean up the dead pod.
#### New Leader Crashes Before Pasing Readiness Check
The deployment controller scales up the new replica sets by creating a new pod. The entire procedure is repeated
with the new pod.
#### Network Partition Between New Pod and Previous Leader
This feels very unlikely, but should be considered in any case. P2 (the new aspiring leader) fails the `/step_down`
API call into P1 (the current leader). P2 proceeds with the pre-existing startup procedure and updates the `leader` table.
Kubernetes will terminate P1, but there may be a brief period where both storage controller can drive reconciles.
### Dealing With Split Brain Scenarios
As we've seen in the previous section, we can end up with two storage controller running at the same time. The split brain
duration is not bounded since the Kubernetes controller might become partitioned from the pods (unlikely though). While these
scenarios are not fatal, they can cause tenant unavailability, so we'd like to reduce the chances of this happening.
The rest of this section sketches some safety measure. It's likely overkill to implement all of them however.
### Ensure Leadership Before Producing Side Effects
The storage controller has two types of side effects: location config requests into pageservers and compute notifications into the control plane.
Before issuing either, the storage controller could check that it is indeed still the leader by querying the database. Side effects might still be
applied if they race with the database updatem, but the situation will eventually be detected. The storage controller process should terminate in these cases.
### Leadership Lease
Up until now, the leadership defined by this RFC is static. In order to bound the length of the split brain scenario, we could require the leadership
to be renewed periodically. Two new columns would be added to the leaders table:
1.`last_renewed` - timestamp indicating when the lease was last renewed
2.`lease_duration` - duration indicating the amount of time after which the lease expires
The leader periodically attempts to renew the lease by checking that it is in fact still the legitimate leader and updating `last_renewed` in the
same transaction. If the update fails, the process exits. New storage controller instances wishing to become leaders must wait for the current lease
to expire before acquiring leadership if they have not succesfully received a response to the `/step_down` request.
### Notify Pageserver Of Storage Controller Term
Each time that leadership changes, we can bump a `term` integer column in the `leader` table. This term uniquely identifies a leader.
Location config requests and re-attach responses can include this term. On the pageserver side, keep the latest term in memory and refuse
anything which contains a stale term (i.e. smaller than the current one).
### Observability
* The storage controller should expose a metric which describes it's state (`Active | WarmingUp | SteppedDown`).
Per region alerts should be added on this metric which triggers when:
+ no storage controller has been in the `Active` state for an extended period of time
+ more than one storage controllers are in the `Active` state
* An alert that periodically verifies that the `leader` table is in sync with the metric above would be very useful.
We'd have to expose the storage controller read only database to Grafana (perhaps it is already done).
## Alternatives
### Kubernetes Leases
Kubernetes has a [lease primitive](https://kubernetes.io/docs/concepts/architecture/leases/) which can be used to implement leader election.
Only one instance may hold a lease at any given time. This lease needs to be periodically renewed and has an expiration period.
In our case, it would work something like this:
*`/step_down` deletes the lease or stops it from renewing
* lease acquisition becomes part of the start-up procedure
The kubert crate implements a [lightweight lease API](https://docs.rs/kubert/latest/kubert/lease/struct.LeaseManager.html), but it's still
not exactly trivial to implement.
This approach has the benefit of baked in observability (`kubectl describe lease`), but:
* We offload the responsibility to Kubernetes which makes it harder to debug when things go wrong.
* More code surface than the simple "row in database" approach. Also, most of this code would be in
a dependency not subject to code review, etc.
* Hard to test. Our testing infra does not run the storage controller in Kubernetes and changing it do
so is not simple and complictes and the test set-up.
To my mind, the "row in database" approach is straightforward enough that we don't have to offload this
This is a retrospective RFC describing a new storage strategy for AUX files.
## Motivation
The original aux file storage strategy stores everything in a single `AUX_FILES_KEY`.
Every time the compute node streams a `neon-file` record to the pageserver, it will
update the aux file hash map, and then write the serialized hash map into the key.
This creates serious space bloat. There was a fix to log delta records (i.e., update
a key in the hash map) to the aux file key. In this way, the pageserver only stores
the deltas at each of the LSNs. However, this improved v1 storage strategy still
requires us to store everything in an aux file cache in memory, because we cannot
fetch a single key (or file) from the compound `AUX_FILES_KEY`.
### Prior art
For storing large amount of small files, we can use a key-value store where the key
is the filename and the value is the file content.
## Requirements
- No space bloat, fixed space amplification.
- No write bloat, fixed write amplification.
## Impacted Components
pageserver
## Sparse Keyspace
In pageserver, we had assumed the keyspaces are always contiguous. For example, if the keyspace 0x0000-0xFFFF
exists in the pageserver, every single key in the key range would exist in the storage. Based on the prior
assumption, there are code that traverses the keyspace by iterating every single key.
```rust
loop{
// do something
key=key.next();
}
```
If a keyspace is very large, for example, containing `2^64` keys, this loop will take infinite time to run.
Therefore, we introduce the concept of sparse keyspace in this RFC. For a sparse keyspace, not every key would
exist in the key range. Developers should not attempt to iterate every single key in the keyspace. Instead,
they should fetch all the layer files in the key range, and then do a merge of them.
In aux file v2, we store aux files within the sparse keyspace of the prefix `AUX_KEY_PREFIX`.
## AUX v2 Keyspace and Key Mapping
Pageserver uses fixed-size keys. The key is 128b. In order to store files of arbitrary filenames into the
keyspace, we assign a predetermined prefix based on the directory storing the aux file, and use the FNV hash
of the filename for the rest bits of the key. The encoding scheme is defined in `encode_aux_file_key`.
For example, `pg_logical/mappings/test1` will be encoded as:
```
62 0000 01 01 7F8B83D94F7081693471ABF91C
^ aux prefix
^ assigned prefix of pg_logical/
^ assigned prefix of mappings/
^ 13B FNV hash of test1
^ not used due to key representation
```
The prefixes of the directories should be assigned every time we add a new type of aux file into the storage within `aux_file.rs`. For all directories without an assigned prefix, it will be put into the `0xFFFF` keyspace.
Note that inside pageserver, there are two representations of the keys: the 18B full key representation
and the 16B compact key representation. For the 18B representation, some fields have restricted ranges
of values. Therefore, the aux keys only use the 16B compact portion of the full key.
It is possible that two files get mapped to the same key due to hash collision. Therefore, the value of
each of the aux key is an array that contains all filenames and file content that should be stored in
this key.
We use `Value::Image` to store the aux keys. Therefore, page reconstruction works in the same way as before,
and we do not need addition code to support reconstructing the value. We simply get the latest image from
the storage.
## Inbound Logical Replication Key Mapping
For inbound logical replication, Postgres needs the `replorigin_checkpoint` file to store the data.
This file not directly stored in the pageserver using the aux v2 mechanism. It is constructed during
generating the basebackup by scanning the `REPL_ORIGIN_KEY_PREFIX` keyspace.
## Sparse Keyspace Read Path
There are two places we need to read the aux files from the pageserver:
* On the write path, when the compute node adds an aux file to the pageserver, we will retrieve the key from the storage, append the file to the hashed key, and write it back. The current `get` API already supports that.
* We use the vectored get API to retrieve all aux files during generating the basebackup. Because we need to scan a sparse keyspace, we slightly modified the vectored get path. The vectorized API will attempt to retrieve every single key within the requested key range, and therefore, we modified it in a way that keys within `NON_INHERITED_SPARSE_RANGE` will not trigger missing key error.
## Compaction and Image Layer Generation
With the add of sparse keyspaces, we also modified the compaction code to accommodate the fact that sparse keyspaces do not have every single key stored in the storage.
* L0 compaction: we modified the hole computation code so that it can handle sparse keyspaces when computing holes.
* Image layer creation: instead of calling `key.next()` and getting/reconstructing images for every single key, we use the vectored get API to scan all keys in the keyspace at a given LSN. Image layers are only created if there are too many delta layers between the latest LSN and the last image layer we generated for sparse keyspaces. The created image layer always cover the full aux key range for now, and could be optimized later.
## Migration
We decided not to make the new aux storage strategy (v1) compatible with the original one (v1). One feasible way of doing a seamless migration is to store new data in aux v2 while old data in aux v1, but this complicates file deletions. We want all users to start with a clean state with no aux files in the storage, and therefore, we need to do manual migrations for users using aux v1 by using the [migration script](https://github.com/neondatabase/aux_v2_migration).
During the period of migration, we store the aux policy in the `index_part.json` file. When a tenant is attached
with no policy set, the pageserver will scan the aux file keyspaces to identify the current aux policy being used (v1 or v2).
If a timeline has aux v1 files stored, it will use aux file policy v1 unless we do a manual migration for them. Otherwise, the default aux file policy for new timelines is aux v2. Users enrolled in logical replication before we set aux v2 as default use aux v1 policy. Users who tried setting up inbound replication (which was not supported at that time) may also create some file entries in aux v1 store, even if they did not enroll in the logical replication testing program.
The code for aux v2 migration is in https://github.com/neondatabase/aux_v2_migration. The toolkit scans all projects with logical replication enabled. For all these projects, it put the computes into maintenance mode (suspend all of then), call the migration API to switch the aux file policy on the pageserver (which drops all replication states), and restart all the computes.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.