## Problem
Retagging container images and pushing container images taken from one
registry to another is very tangled up with artifact building and not
separated by component. This makes not building compute for storage
releases and vice versa pretty tricky. To enable that, I want to clean
up retagging and pushing of container images and then continue on making
the pipelines for releases leaner by not building unnecessary things.
## Summary of changes
- Add a reusable workflow that can push to ACR, ECR and Docker Hub,
while being very flexible in terms of source and target images. This
allows for retagging and pushing images between container registries.
- Stop pushing images to registries aside of docker hub in the jobs that
build the images
- Split image pushing into 4 different jobs (not mentioning special
cases):
- neon-dev
- neon-prod
- compute-dev
- compute-prod
## TODO
- Consider also using this for `pin-build-tools-image`, as it's
basically another instance of the same thing.
## Known limitations
- The ECR part of this workflow supports authenticating to multiple AWS
accounts and therefore multiple ECR endpoints, but the ACR part only
supports one Azure Account. If someone with more knowledge on Azure can
tell me whether an equivalent to
https://github.com/aws-actions/amazon-ecr-login?tab=readme-ov-file#login-to-ecr-on-multiple-aws-accounts
is easily possible, that'd be great.
- The `image_map` input is a bit complex. It expects something along the
lines of
```
{
"docker.io/neondatabase/compute-node-v14:13196061314": [
"docker.io/neondatabase/compute-node-v14:13196061314",
"369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-v14:13196061314",
"neoneastus2.azurecr.io/neondatabase/compute-node-v14:13196061314"
],
"docker.io/neondatabase/compute-node-v15:13196061314": [
"docker.io/neondatabase/compute-node-v15:13196061314",
"369495373322.dkr.ecr.eu-central-1.amazonaws.com/compute-node-v15:13196061314",
"neoneastus2.azurecr.io/neondatabase/compute-node-v15:13196061314"
]
}
```
to map from source to target image. We have a small python step to
generate this map for the 4 main image pushing jobs. The concrete
example is taken from
https://github.com/neondatabase/neon/actions/runs/13196061314/job/36838584098?pr=10613#step:3:6
and shortened to two images.
## Problem
We respect `skip_pg_catalog_updates` at the initial start, but ignore at
the follow-up `/configure`. Yet, it's used for storage->cplane->compute
notify requests after migrations, shard split, etc. So every time we get
them, applying the new config takes much longer than it should because
we go through Postgres catalog checks. Cplane sets this flag, when it
does serves notify attach call
9068c7d743
Related to `inc-403`, for example
## Summary of changes
Look at `skip_pg_catalog_updates` in `compute.reconfigure()`
## Problem
PRs created by external contributors, in some cases might list failed
jobs
- `Trigger E2E Tests / cancel-previous-e2e-tests`
- `Trigger E2E Tests / tag`
They don't block the merge, and tests in fact pass (their counterparts
in internal PR), but because jobs are triggered from an external PR (and
not from the corresponding internal one) they still present as red
marks.
For example https://github.com/neondatabase/neon/pull/10778
## Summary of changes
- Check permissions before triggering e2e tests
## Problem
L0 compaction frequently gets starved out by other background tasks and
image/GC compaction. L0 compaction must be responsive to keep read
amplification under control.
Touches #10694.
Resolves#10689.
## Summary of changes
Use a separate semaphore for the L0-only compaction pass.
* Add a `CONCURRENT_L0_COMPACTION_TASKS` semaphore and
`BackgroundLoopKind::L0Compaction`.
* Add a setting `compaction_l0_semaphore` (default off via
`compaction_l0_first`).
* Use the L0 semaphore when doing an `OnlyL0Compaction` pass.
* Use the background semaphore when doing a regular compaction pass
(which includes an initial L0 pass).
* While waiting for the background semaphore, yield for L0 compaction if
triggered.
* Add `CompactFlags::NoYield` to disable L0 yielding, and set it for the
HTTP API route.
* Remove the old `use_compaction_semaphore` setting and
compaction-scoped semaphore.
* Remove the warning when waiting for a semaphore; it's noisy and we
have metrics.
## Problem
With current approach for the base images in `Dockerfiles`, it's hard to
track when image is updated, and as they are base, than update will
invalidate all the layers, as base image changed.
That also becomes more complicated, as we have a number of runners, and
they may have different images with the tag `bookworm-slim`, so that
will lead to invalidate caches, when image build on one runner will be
used on another runners.
To fix that problem, we could pin our base images to the specific sha,
and that not only align images across runners, and also will allow us to
have reproducible build and don't depend on any spontaneous changes in
upstream.
Fix: https://github.com/neondatabase/cloud/issues/24084
## Summary of changes
Beside of the main goal, that PR also included some small changes around
Dockerfiles:
1. Main change: use `SHA` for `bookworm-slim` and `bullseye-slim` debian
images
2. For the layers requiring `curl` we could add `curl` and `unzip` to
the `build-deps` image, and use it as a base image for all the steps,
removing extra dependency on `alpine/curl`
3. added `retry-on-host-error=on` for the `wgetrc` as it happened to me:
fail to resolve hostname
## Problem
Previously, Workload was reconfiguring the compute before each run of
writes, which was meant to be a no-op when nothing changed, but was
actually writing extra data due to an issue being fixed in
https://github.com/neondatabase/neon/pull/10696.
The row counts in tests were too low in some cases, these tests were
only working because of those extra writes that shouldn't have been
happening, and moreover were relying on checkpoints happening.
## Summary of changes
- Only reconfigure compute if the attached pageserver actually changed.
If pageserver is set to None, that means controller is managing
everything, so never reconfigure compute.
- Update tests that wrote too few rows.
---------
Co-authored-by: Alexey Kondratov <kondratov.aleksey@gmail.com>
The old values assumed that you have at least about 18 GB of RAM
available (shared_buffers=10GB and maintenance_work_mem=8GB). That's a
lot when testing locally. Make it configurable, and make the default
assumption much smaller: 256 MB.
This is nice for local testing, but it's also in preparation for
starting to use VMs to run these jobs. When launched in a VM, the
control plane can set these env variables according to the max size of
the VM.
Also change the formula for how RAM is distributed: use 10% of RAM for
shared_buffers, and 70% for maintenance_work_mem. That leaves a good
amount for misc. other stuff and the OS. A very large shared_buffers
setting won't typically help with bulk loading. It won't help with the
network and I/O of processing all the tables, unless maybe if the whole
database fits in shared buffers, but even then it's not much faster than
using local disk. Bulk loading is all sequential I/O. It also won't help
much with index creation, which is also sequential I/O. A large
maintenance_work_mem can be quite useful, however, so that's where we
put most of the RAM.
Resolves: https://github.com/neondatabase/neon/issues/5734
When we query pageserver and it's unavailable after some retries,
postgres sigabrt's. This is intended behavior so I've added tests
checking it
## Problem
When image compaction yields for L0 compaction, it may not immediately
schedule L0 compaction, because it just goes on to compact the next
pending timeline.
Touches #10694.
Requires #10744.
## Summary of changes
Extend `CompactionOutcome` with `YieldForL0` and `Skipped` variants, and
immediately schedule an L0 compaction pass in the `YieldForL0` case.
## Problem
Image compaction can starve out L0 compaction if a tenant has several
timelines with L0 debt.
Touches #10694.
Requires #10740.
## Summary of changes
* Add an initial L0 compaction pass, in order of L0 count.
* Add a tenant option `compaction_l0_first` to control the L0 pass
(disabled by default).
* Add `CompactFlags::OnlyL0Compaction` to run an L0-only compaction
pass.
* Clean up the compaction iteration logic.
A later PR will use separate semaphores for the L0 and image compaction
passes to avoid cross-tenant L0 starvation. That PR will also make image
compaction yield if _any_ of the tenant's timelines have pending L0
compaction to further avoid starvation.
Avoids compiling the crate and its dependencies into binaries that don't
need them. Shrinks the compute_ctl binary from about 31MB to 28MB in the
release-line-debug-size-lto profile.
Run "apt install" first, and only then COPY the files from the
intermediary build layers to the final image. This way, if you modify
any of the sources that trigger e.g. rebuilding compute_ctl, the "apt
install" step can still be cached.
## Problem
In https://github.com/neondatabase/neon/pull/10411 fill logic changes
such that it benefits us to test it with & without AZs set up. I didn't
extend the test inline in that PR because there were overlapping test
changes in flight to add `num_az` parameter.
## Summary of changes
- Parameterise test on AZ count (1 or 2)
- When AZ count is 2, use a different balance check that just asserts
the _tenants_ are balanced (since AZ affinity is chosen on a per-tenant
basis)
The compute_ctl HTTP server has the following purposes:
- Allow management via the control plane
- Provide an endpoint for scaping metrics
- Provide APIs for compute internal clients
- Neon Postgres extension for installing remote extensions
- local_proxy for installing extensions and adding grants
The first two purposes require the HTTP server to be available outside
the compute.
The Neon threat model is a bad actor within our internal network. We
need to reduce the surface area of attack. By exposing unnecessary
unauthenticated HTTP endpoints to the internal network, we increase the
surface area of attack. For endpoints described in the third bullet
point, we can just run an extra HTTP server, which is only bound to the
loopback interface since all consumers of those endpoints are within the
compute.
This changes the default value of the `timeline_offloading` pageserver
and tenant configs to true, now that offloading has been rolled out
without problems.
There is also a small fix in the tenant config merge function, where we
applied the `lazy_slru_download` value instead of `timeline_offloading`.
Related issue: https://github.com/neondatabase/cloud/issues/21353
# Problem
Say we have a batch of 10 responses to send out.
Then, even with
- #10728
we've still only called observe_execution_end_flush_start for the first
3 responses.
The remaining 7 response timers are still ticking.
When compute now closes the connection, the waiting flush fails with an
error and we `drop()` the remaining 7 responses' smgr op timers. The
`impl Drop for SmgrOpTimer` will observe an execution time that includes
the flush time.
In practice, this is supsected to produce the `+Inf` observations in the
smgr op latency histogram we've seen since the introduction of
pipelining, even after shipping #10728.
refs:
- fixup of https://github.com/neondatabase/neon/pull/10042
- fixup of https://github.com/neondatabase/neon/pull/10728
- fixes https://github.com/neondatabase/neon/issues/10754
## Problem
These tests can encounter a bug in the pageserver read path (#9185)
which occurs under the very specific circumstances that the tests
create, but is very unlikely to happen in the field.
We will fix the bug, but in the meantime let's un-flake the tests.
Related: https://github.com/neondatabase/neon/issues/10720
## Summary of changes
- Permit "could not find data for key" errors in tests affected by #9185
I don't see the point of static linking, postgres itself and many of the
extensions are already built dynamically.
One reason for the change is that I'm working on bigger changes to start
using systemd in the compute, and as part of that I wanted to add the
--with-systemd configure option to pgbouncer, and there doesn't seem to
be a static version of libsystemd (at least not on Debian).
## Problem
/database_schema endpoint returns incomplete output from `pg_dump`
## Summary of changes
The Tokio process was not used properly. The returned stream does not
include `process::Child`, and the process is scheduled to be killed
immediately after the `get_database_schema` call when `cmd` goes out of
scope.
The solution in this PR is to return a special Stream implementation
that retains `process::Child`.
compute_ctl is mostly written in synchronous fashion, intended to run in
a single thread. However various parts had become async, and they
launched their own tokio runtimes to run the async code. For example, VM
monitor ran in its own multi-threaded runtime, and apply_spec_sql()
launched another multi-threaded runtime to run the per-database SQL
commands in parallel. In addition to that, a few places used a
current-thread runtime to run async code in the main thread, or launched
a current-thread runtime in a *different* thread to run background
tasks.
Unify the runtimes so that there is only one tokio runtime. It's created
very early at process startup, and the main thread "enters" the runtime,
so that it's always available for tokio::spawn() and runtime.block_on()
calls. All code that needs to run async code uses the same runtime.
The main thread still mostly runs in a synchronous fashion. When it
needs to run async code, it uses rt.block_on().
Spawn fewer additional threads, prefer to spawn tokio tasks instead.
Convert some code that ran synchronously in background threads into
async. I didn't go all the way, though, some background threads are
still spawned.
## Problem
close https://github.com/neondatabase/neon/issues/10213
`range_search` only returns the top-most layers that may satisfy the
search, so it doesn't include all layers that might be accessed (the
user needs to recursively call this function). We need to retrieve the
full layer map and find overlaps in order to have a correct heuristics
of the job split.
## Summary of changes
Retrieve all layers and find overlaps instead of doing `range_search`.
The patch also reduces the time holding the layer map read guard.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
The compaction loop currently runs periodically, which can cause it to
wait for up to 20 seconds before starting L0 compaction by default.
Also, when we later separate the semaphores for L0 compaction and image
compaction, we want to give up waiting for the image compaction
semaphore if L0 compaction is needed on any timeline.
Touches #10694.
## Summary of changes
Notify the compaction loop when an L0 flush (on any timeline) exceeds
`compaction_threshold`.
Also do some opportunistic cleanups in the area.
Building the compute rust binaries from scratch is pretty slow, it takes
between 4-15 minutes on my laptop, depending on which compiler flags and
other tricks I use. A cache mount allows caching the dependencies and
incremental builds, which speeds up rebuilding significantly when you
only makes a small change in a source file.
The awscli was downloaded at the last stages of the overall compute
image build, which meant that if you modified any part of the build, it
would trigger a re-download of the awscli. That's a bit annoying when
developing locally and rebuilding the compute image repeatedly. Move it
to a separate layer, to cache separately and to avoid the spurious
rebuilds.
The compute_id will be used when verifying claims sent by the control
plane.
Signed-off-by: Tristan Partin <tristan@neon.tech>
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
Make it possible to dump WAL records in format recognised by walredo
process.
Intended usage:
```
pg_waldump -R 1663/5/16396 -B 771727 000000010000000100000034 --save-records=/tmp/walredo.records
postgres --wal-redo < /tmp/walredo.records > /tmp/page.img
```
## Summary of changes
Related Postgres PRs:
https://github.com/neondatabase/postgres/pull/575https://github.com/neondatabase/postgres/pull/572
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
helps investigate https://github.com/neondatabase/neon/issues/10482
## Summary of changes
In debug mode and testing mode, we will record all files visited by a
read operation, and print it out when it errors.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Reduce the read amplification when doing `repartition`.
## Summary of changes
Compute the L0-L1 boundary LSN and do repartition here.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
The upgrade test for pg_jwt does not work correctly.
## Summary of changes
The script for the upgrade test is modified to use the database
`contrib_regression`.
# Problem
walredo shutdown is done in the compaction task. Let's move it to tenant
housekeeping.
# Summary of changes
* Rename "ingest housekeeping" to "tenant housekeeping".
* Move walredo shutdown into tenant housekeeping.
* Add a constant `WALREDO_IDLE_TIMEOUT` set to 3 minutes (previously 10x
compaction threshold).
This patch does a bunch of superficial cleanups of `tenant::tasks` to
avoid noise in subsequent PRs. There are no functional changes.
PS: enable "hide whitespace" when reviewing, due to the unindentation of
large async blocks.
# Problem
Before this PR, the `shard_id` field was missing when page_service logs
a reconstruct error.
This was caused by batching-related refactorings.
Example from staging:
```
2025-01-30T07:10:04.346022Z ERROR page_service_conn_main{peer_addr=...}:process_query{tenant_id=... timeline_id=...}:handle_pagerequests:request:handle_get_page_at_lsn_request_batched{req_lsn=FFFFFFFF/FFFFFFFF}: error reading relation or page version: Read error: whole vectored get request failed because one or more of the requested keys were missing: could not find data for key ...
```
# Changes
Delay creation of the handler-specific span until after shard routing
This also avoids the need for the record() call in the pagestream hot
path.
# Testing
Manual testing with a failpoint that is part of this PR's history but
will be squashed away.
# Refs
- fixes https://github.com/neondatabase/neon/issues/10599
## Problem
This test would sometimes fail its assertion that a timeline does not
revert to active once archived. That's because it was using the
in-memory offload state, not the persistent state, so this was sometimes
lost across a pageserver restart.
Closes: https://github.com/neondatabase/neon/issues/10389
## Summary of changes
- When reading offload status, read from pageserver API _and_ remote
storage before considering the timeline offloaded
I plan to use this when launching a fast_import job in a VM. There's
currently no good way for an executable running in a NeonVM to exit
gracefully and have the VM shut down. The inittab we use always respawns
the payload command. The idea is that the control plane can use
"fast_import ... && poweroff" as the command, so that when fast_import
completes successfully, the VM is terminated, and the k8s Pod and
VirtualMachine object are marked as completed successfully.
I'm working on bigger changes to how we launch VMs, and will try to come
up with a nicer system for that, but in the meanwhile, this quick hack
allows us to proceed with using VMs for one-off jobs like fast_import.
## Problem
There are a couple of log warnings tripping up
`test_timeline_archival_chaos`
- `[stopping left-over name="timeline_delete"
tenant_shard_id=2d526292b67dac0e6425266d7079c253
timeline_id=Some(44ba36bfdee5023672c93778985facd9)
kind=TimelineDeletionWorker\n')](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10672/13161357302/index.html#/testresult/716b997bb1d8a021)`
- `ignoring attempt to restart exited flush_loop
503d8f401d8887cfaae873040a6cc193/d5eed0673ba37d8992f7ec411363a7e3\n')`
Related: https://github.com/neondatabase/neon/issues/10389
## Summary of changes
- Downgrade the 'ignoring attempt to restart' to info -- there's nothing
in the design that forbids this happening, i.e. someone calling
maybe_spawn_flush_loop concurrently with shutdown()
- Prevent timeline deletion tasks outliving tenants by carrying a
gateguard. This logically makes sense because the deletion process does
call into Tenant to update manifests.
## Problem
L0 compaction can get starved by other background tasks. It needs to be
responsive to avoid read amp blowing up during heavy write workloads.
Touches #10694.
## Summary of changes
Add a separate semaphore for compaction, configurable via
`use_compaction_semaphore` (disabled by default). This is primarily for
testing in staging; it needs further work (in particular to split
image/L0 compaction jobs) before it can be enabled.
## Problem
Endpoint kept running while timeline was deleted, causing forbidden
warnings on the pageserver when the tenant is not found.
## Summary of changes
- Explicitly stop the endpoint before the end of the test, so that it
isn't trying to talk to the pageserver in the background while things
are torn down
## Problem
This is tech debt. While we introduced generations for tenants, some
legacy situations without generations needed to delete things inline
(async operation) instead of enqueing them (sync operation).
## Summary of changes
- Remove the async code, replace calls with the sync variant, and assert
that the generation is always set