## Problem
For the interpreted proto the pageserver is not returning the correct
LSN
in replies to keep alive requests. This is because the interpreted
protocol arm
was not updating `last_rec_lsn`.
## Summary of changes
* Return correct LSN in keep-alive responses
* Fix shard field in wal sender traces
We keep the practice of keeping the compiler up to date, pointing to the
latest release. This is done by many other projects in the Rust
ecosystem as well.
[Release notes](https://releases.rs/docs/1.83.0/).
Also update `cargo-hakari`, `cargo-deny`, `cargo-hack` and
`cargo-nextest` to their latest versions.
Prior update was in #9445.
## Problem
We currently see elevated levels of errors for GetBlob requests. This is
because 404 and 304 are counted as errors for metric reporting.
## Summary of Changes
Bring the implementation in line with the S3 client and treat 404 and
304 responses as ok for metric purposes.
Related: https://github.com/neondatabase/cloud/issues/20666
## Problem
For cancellation, a connection is open during all the cancel checks.
## Summary of changes
Spawn cancellation checks in the background, and close connection
immediately.
Use task_tracker for cancellation checks.
# Problem
VM (visibility map) pages are stored and managed as any regular relation
page, in the VM fork of the main relation. They are also sharded like
other pages. Regular WAL writes to the VM pages (typically performed by
vacuum) are routed to the correct shard as usual. However, VM pages are
also updated via `ClearVmBits` metadata records emitted when main
relation pages are updated. These metadata records were sent to all
shards, like other metadata records. This had the following effects:
* On shards responsible for VM pages, the `ClearVmBits` applies as
expected.
* On shard 0, which knows about the VM relation and its size but doesn't
necessarily have any VM pages, the `ClearVmBits` writes may have been
applied without also having applied the explicit WAL writes to VM pages.
* If VM pages are spread across multiple shards (unlikely with 256MB
stripe size), all shards may have applied `ClearVmBits` if the pages
fall within their local view of the relation size, even for pages they
do not own.
* On other shards, this caused a relation size cache miss and a DbDir
and RelDir lookup before dropping the `ClearVmBits`. With many
relations, this could cause significant CPU overhead.
This is not believed to be a correctness problem, but this will be
verified in #9914.
Resolves#9855.
# Changes
Route `ClearVmBits` metadata records only to the shards responsible for
the VM pages.
Verification of the current VM handling and cleanup of incomplete VM
pages on shard 0 (and potentially elsewhere) is left as follow-up work.
## Problem
close https://github.com/neondatabase/neon/issues/9859
## Summary of changes
Ensure that the deletion queue gets fully flushed (i.e., the deletion
lists get applied) during a graceful shutdown.
It is still possible that an incomplete shutdown would leave deletion
list behind and cause race upon the next startup, but we assume this
will unlikely happen, and even if it happened, the pageserver should
already be at a tainted state and the tenant should be moved to a new
tenant with a new generation number.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
* Promote two logs from mpsc send errors to error level. The channels
are unbounded and there shouldn't be errors.
* Fix one multiline log from anyhow::Error. Use Debug instead of
Display.
## Problem
When ingesting implicit `ClearVmBits` operations, we silently drop the
writes if the relation or page is unknown. There are implicit
assumptions around VM pages wrt. explicit/implicit updates, sharding,
and relation sizes, which can possibly drop writes incorrectly. Adding a
few metrics will allow us to investigate further and tighten up the
logic.
Touches #9855.
## Summary of changes
Add a `pageserver_wal_ingest_clear_vm_bits_unknown` metric to record
dropped `ClearVmBits` writes.
Also add comments clarifying the behavior of relation sizes on non-zero
shards.
Valid layer assumption is a necessary condition for a layer map to be
valid. It's a stronger check imposed by gc-compaction than the actual
valid layermap definition. Actually, the system can work as long as
there are no overlapping layer maps. Therefore, we degrade that into a
warning.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We don't have any observability for the relation size cache. We have
seen cache misses cause significant performance impact with high
relation counts.
Touches #9855.
## Summary of changes
Adds the following metrics:
* `pageserver_relsize_cache_entries`
* `pageserver_relsize_cache_hits`
* `pageserver_relsize_cache_misses`
* `pageserver_relsize_cache_misses_old`
## Problem
https://github.com/neondatabase/neon/pull/9746 lifted decoding and
interpretation of WAL to the safekeeper.
This reduced the ingested amount on the pageservers by around 10x for a
tenant with 8 shards, but doubled
the ingested amount for single sharded tenants.
Also, https://github.com/neondatabase/neon/pull/9746 uses bincode which
doesn't support schema evolution.
Technically the schema can be evolved, but it's very cumbersome.
## Summary of changes
This patch set addresses both problems by adding protobuf support for
the interpreted wal records and adding compression support. Compressed
protobuf reduced the ingested amount by 100x on the 32 shards
`test_sharded_ingest` case (compared to non-interpreted proto). For the
1 shard case the reduction is 5x.
Sister change to `rust-postgres` is
[here](https://github.com/neondatabase/rust-postgres/pull/33).
## Links
Related: https://github.com/neondatabase/neon/issues/9336
Epic: https://github.com/neondatabase/neon/issues/9329
## Problem
The `pre-merge-checks` workflow relies on the build-tools image.
If changes to the `build-tools` image have been merged into the main
branch since the last CI run for a PR (with other changes to the
`build-tools`), the image will be rebuilt during the merge queue run.
Otherwise, cached images are used.
Rebuilding the image adds approximately 10 minutes on x86-64 and 20
minutes on arm64 to the process.
## Summary of changes
- parametrise `build-build-tools-image` job with arch and Debian version
- Run `pre-merge-checks` only on Debian 12 x86-64 image
## Problem
ingest benchmark tests project migration to Neon involving steps
- COPY relation data
- create indexes
- create constraints
Previously we used only 4 copy jobs, 4 create index jobs and 7
maintenance workers. After increasing effective_io_concurrency on
compute we see that we can sustain more parallelism in the ingest bench
## Summary of changes
Increase copy jobs to 8, create index jobs to 8 and maintenance workers
to 16
## Problem
The RequestContext::span shouldn't live for the entire postgres
connection, only the handshake.
## Summary of changes
* Slight refactor to the RequestContext to discard the span upon
handshake completion.
* Make sure the temporary future for the handshake is dropped (not bound
to a variable)
* Runs our nightly fmt script
Before, we hardcoded the pg_version to 140000, while the code expected
version numbers like 14. Now we use an enum, and code from
`extension_server.rs` to auto-detect the correct version. The enum helps
when we add support for a version: enums ensure that compilation fails
if one forgets to put the version to one of the `match` locations.
cc https://github.com/neondatabase/neon/pull/9218
## Problem
Any errors from these async blocks are unconditionally logged at error
level
even though we already handle such errors based on context.
## Summary of changes
* Log raw errors from creating and executing cplane requests at debug
level.
* Inline macro calls to retain the correct callsite.
## Problem
The vast majority of the error/warn logs from cplane are about time or
data transfer quotas exceeded or endpoint-not-found errors and not
operational errors in proxy or cplane.
## Summary of changes
* Demote cplane error replies to info level.
* Raise other errors from warn back to error.
## Problem
For any given tenant shard, pageservers receive all of the tenant's WAL
from the safekeeper.
This soft-blocks us from using larger shard counts due to bandwidth
concerns and CPU overhead of filtering
out the records.
## Summary of changes
This PR lifts the decoding and interpretation of WAL from the pageserver
into the safekeeper.
A customised PG replication protocol is used where instead of sending
raw WAL, the safekeeper sends
filtered, interpreted records. The receiver drives the protocol
selection, so, on the pageserver side, usage
of the new protocol is gated by a new pageserver config:
`wal_receiver_protocol`.
More granularly the changes are:
1. Optionally inject the protocol and shard identity into the arguments
used for starting replication
2. On the safekeeper side, implement a new wal sending primitive which
decodes and interprets records
before sending them over
3. On the pageserver side, implement the ingestion of this new
replication message type. It's very similar
to what we already have for raw wal (minus decoding and interpreting).
## Notes
* This PR currently uses my [branch of
rust-postgres](https://github.com/neondatabase/rust-postgres/tree/vlad/interpreted-wal-record-replication-support)
which includes the deserialization logic for the new replication message
type. PR for that is open
[here](https://github.com/neondatabase/rust-postgres/pull/32).
* This PR contains changes for both pageservers and safekeepers. It's
safe to merge because the new protocol is disabled by default on the
pageserver side. We can gradually start enabling it in subsequent
releases.
* CI tests are running on https://github.com/neondatabase/neon/pull/9747
## Links
Related: https://github.com/neondatabase/neon/issues/9336
Epic: https://github.com/neondatabase/neon/issues/9329
## Problem
Prefetch is disabled at MacODS because `posix_fadvise` is not available.
But Neon prefetch is not using this function and for testing at MacOS is
it very convenient that prefetch is available.
## Summary of changes
Define `USE_PREFETCH` in Makefile.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
We have a couple of CI workflows that still run on Debian Bullseye, and
the default Debian version in images is Bullseye as well (we explicitly
set building on Bookworm)
## Summary of changes
- Run `pgbench-pgvector` on Bookworm (fix a couple of packages)
- Run `trigger_bench_on_ec2_machine_in_eu_central_1` on Bookworm
- Change default `DEBIAN_VERSION` in Dockerfiles to Bookworm
- Make `pinned` docker tag an alias to `pinned-bookworm`
## Problem
close https://github.com/neondatabase/neon/issues/9761
The test assumed that no new L0 layers are flushed throughout the
process, which is not true.
## Summary of changes
Fix the test case `test_compaction_l0_memory` by flushing in-memory
layers before compaction.
Signed-off-by: Alex Chi Z <chi@neon.tech>
* The futures-util crate we use was yanked. Bump it and its siblings to
new patch release.
https://github.com/rust-lang/futures-rs/releases/tag/0.3.31
* cargo-deny: Drop an unused license.
* cargo-deny: Don't warn about duplicate crate. Duplicate crates are
unavoidable and the noise just hides real warnings.
## Problem
LFC is not enabled by default in tests, but it is enabled in production.
This increases the risk of errors in the production environment, which
were not found during the routine workflow.
However, enabling LFC for all the tests may overload the disk on our
servers and increase the number of failures.
So, we try enabling LFC in one case to evaluate the possible risk.
## Summary of changes
A new environment variable, USE_LFC is introduced. If it is set to true,
LFC is enabled by default in all the tests.
In our workflow, we enable LFC for PG17, release, x86-64, and disabled
for all other combinations.
---------
Co-authored-by: Alexey Masterov <alexeymasterov@neon.tech>
Co-authored-by: a-masterov <72613290+a-masterov@users.noreply.github.com>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Stas Kelvic <stas@neon.tech>
# Context
This PR contains PoC-level changes for a product feature that allows
onboarding large databases into Neon without going through the regular
data path.
# Changes
This internal RFC provides all the context
* https://github.com/neondatabase/cloud/pull/19799
In the language of the RFC, this PR covers
* the Importer code (`fast_import`)
* all the Pageserver changes (mgmt API changes, flow implementation,
etc)
* a basic test for the Pageserver changes
# Reviewing
As acknowledged in the RFC, the code added in this PR is not ready for
general availability.
Also, the **architecture is not to be discussed in this PR**, but in the
RFC and associated Slack channel instead.
Reviewers of this PR should take that into consideration.
The quality bar to apply during review depends on what area of the code
is being reviewed:
* Importer code (`fast_import`): practically anything goes
* Core flow (`flow.rs`):
* Malicious input data must be expected and the existing threat models
apply.
* The code must not be safe to execute on *dedicated* Pageserver
instances:
* This means in particular that tenants *on other* Pageserver instances
must not be affected negatively wrt data confidentiality, integrity or
availability.
* Other code: the usual quality bar
* Pay special attention to correct use of gate guards, timeline
cancellation in all places during shutdown & migration, etc.
* Consider the broader system impact; if you find potentially
problematic interactions with Storage features that were not covered in
the RFC, bring that up during the review.
I recommend submitting three separate reviews, for the three high-level
areas with different quality bars.
# References
(Internal-only)
* refs https://github.com/neondatabase/cloud/issues/17507
* refs https://github.com/neondatabase/company_projects/issues/293
* refs https://github.com/neondatabase/company_projects/issues/309
* refs https://github.com/neondatabase/cloud/issues/20646
---------
Co-authored-by: Stas Kelvich <stas.kelvich@gmail.com>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: John Spray <john@neon.tech>
to keep it consistent with existing compute metrics.
flux-fleet change is not needed, because it doesn't have any filter by
metric name for compute metrics.
## Problem
Follow up of https://github.com/neondatabase/neon/pull/9682, that patch
didn't fully address the problem: what if shutdown fails due to whatever
reason and then we reattach the tenant? Then we will still remove the
future layer. The underlying problem is that the fix for #5878 gets
voided because of the generation optimizations.
Of course, we also need to ensure that delete happens after uploads, but
note that we only schedule deletes when there are no ongoing upload
tasks, so that's fine.
## Summary of changes
* Add a test case to reproduce the behavior (by changing the original
test case to attach the same generation).
* If layer upload happens after the deletion, drain the deletion queue
before uploading.
* If blocked_deletion is enabled, directly remove it from the
blocked_deletion queue.
* Local fs backend fix to avoid race between deletion and preload.
* test_emergency_mode does not need to wait for uploads (and it's
generally not possible to wait for uploads).
* ~~Optimize deletion executor to skip validation if there are no files
to delete.~~ this doesn't work
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Follow up to https://github.com/neondatabase/neon/pull/9682, hopefully
we can detect some issues or assure ourselves that this is ready for
production.
## Summary of changes
* Add a compaction-detach-ancestor smoke test.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
The HTTP router allowlists matched both on the path and the query
string. This meant that only `/profile/cpu` would be allowed without
auth, while `/profile/cpu?format=svg` would require auth.
Follows #9764.
## Summary of changes
* Match allowlists on URI path, rather than the entire URI.
* Fix the allowlist for Safekeeper to use `/profile/cpu` rather than the
old `/pprof/profile`.
* Just use a constant slice for the allowlist; it's only a handful of
items, and these handlers are not on hot paths.
## Problem
close https://github.com/neondatabase/neon/issues/9836
Looking at Azure SDK, the only related issue I can find is
https://github.com/azure/azure-sdk-for-rust/issues/1549. Azure uses
reqwest as the backend, so I assume there's some underlying magic
unknown to us that might have caused the stuck in #9836.
The observation is:
* We didn't get an explicit out of resource HTTP error from Azure.
* The connection simply gets stuck and times out.
* But when we retry after we reach the timeout, it succeeds.
This issue is hard to identify -- maybe something went wrong at the ABS
side, or something wrong with our side. But we know that a retry will
usually succeed if we give up the stuck connection.
Therefore, I propose the fix that we preempt stuck HTTP operation and
actively retry. This would mitigate the problem, while in the long run,
we need to keep an eye on ABS usage and see if we can fully resolve this
problem.
The reasoning of such timeout mechanism: we use a much smaller timeout
than before to preempt, while it is possible that a normal listing
operation would take a longer time than the initial timeout if it
contains a lot of keys. Therefore, after we terminate the connection, we
should double the timeout, so that such requests would eventually
succeed.
## Summary of changes
* Use exponential growth for ABS list timeout.
* Rather than using a fixed timeout, use a timeout that starts small and
grows
* Rather than exposing timeouts to the list_streaming caller as soon as
we see them, only do so after we have retried a few times
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Along with the migration to Python 3.11, I switched `C(str, Enum)` with
`C(StrEnum)`; one such example is the `PgVersion` enum.
It required more changes in `PgVersion` itself (before, it accepted both
`str` and `int`, and after it, it supports only `str`), which caused the
`test_bulk_insert` test to fail.
## Summary of changes
- `test_bulk_insert`: explicitly cast pg_version from `timeline_detail`
to str
I found the rightward drift of the `renew_jwks` function hard to review.
This PR splits out some major logic and uses early returns to make the
happy path more linear.
## Problem
We use a pretty old version of `mypy` 1.3 (released 1.5 years ago), it
produces false positives for `typing.Self`.
## Summary of changes
- Bump `mypy` from 1.3 to 1.13
- Fix new warnings and errors
- Use `typing.Self` whenever we `return self`
Without adding a newline, we can end up with a conf line that looks like
the following:
dynamic_shared_memory_type = mmap# Managed by compute_ctl: begin
This leads to Postgres logging:
LOG: configuration file
"/var/db/postgres/compute/pgdata/postgresql.conf" contains errors;
unaffected changes were applied
Signed-off-by: Tristan Partin <tristan@neon.tech>
The notifications need to be sent whenever the waiters heap changes, per
the comment in `update_status`. But if 'advance' is called when there
are no waiters, or the new LSN is lower than the waiters so that no one
needs to be woken up, there's no need to send notifications. This saves
some CPU cycles in the common case that there are no waiters.
## Problem
In https://github.com/neondatabase/neon/issues/9754 and the flakiness of
`test_readonly_node_gc`, we saw that although our logic for controlling
GC was sound, the validation of getpage requests was not, because it
could not consider LSN leases when requests arrived shortly after
restart.
Closes https://github.com/neondatabase/neon/issues/9754
## Summary of changes
This is the "Option 3" discussed verbally -- rather than holding back gc
cutoff, we waive the usual validation of request LSN if we are still
waiting for leases to be sent after startup
- When validating LSN in `wait_or_get_last_lsn`, skip the validation
relative to GC cutoff if the timeline is still in its LSN lease grace
period
- Re-enable test_readonly_node_gc
Calling unwrap on the encoder is a little overzealous. One of the errors
that can be returned by the encode function in particular is the
non-existence of metrics for a metric family, so we should prematurely
filter instances like that out. I believe that the cause of this panic
was caused by a race condition between the prometheus collector and the
compute collecting the installed extensions metric for the first time.
The HTTP server is spawned on a separate thread before we even start
bringing up Postgres.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
We don't have a convenient way to gather CPU profiles from a running
binary, e.g. during production incidents or end-to-end benchmarks, nor
during microbenchmarks (particularly on macOS).
We would also like to have continuous profiling in production, likely
using [Grafana Cloud
Profiles](https://grafana.com/products/cloud/profiles-for-continuous-profiling/).
We may choose to use either eBPF profiles or pprof profiles for this
(pending testing and discussion with SREs), but pprof profiles appear
useful regardless for the reasons listed above. See
https://github.com/neondatabase/cloud/issues/14888.
This PR is intended as a proof of concept, to try it out in staging and
drive further discussions about profiling more broadly.
Touches #9534.
Touches https://github.com/neondatabase/cloud/issues/14888.
## Summary of changes
Adds a HTTP route `/profile/cpu` that takes a CPU profile and returns
it. Defaults to a 5-second pprof Protobuf profile for use with e.g.
`pprof` or Grafana Alloy, but can also emit an SVG flamegraph. Query
parameters:
* `format`: output format (`pprof` or `svg`)
* `frequency`: sampling frequency in microseconds (default 100)
* `seconds`: number of seconds to profile (default 5)
Also integrates pprof profiles into Criterion benchmarks, such that
flamegraph reports can be taken with `cargo bench ... --profile-duration
<seconds>`. Output under `target/criterion/*/profile/flamegraph.svg`.
Example profiles:
* pprof profile (use [`pprof`](https://github.com/google/pprof)):
[profile.pb.gz](https://github.com/user-attachments/files/17756788/profile.pb.gz)
* Web interface: `pprof -http :6060 profile.pb.gz`
* Interactive flamegraph:
[profile.svg.gz](https://github.com/user-attachments/files/17756782/profile.svg.gz)
Fixes the masking for the CancelKeyData display format. Due to negative
i32 cast to u64, the top-bits all had `0xffffffff` prefix. On the
bitwise-or that followed, these took priority.
This PR also compresses 3 logs during sql-over-http into 1 log with
durations as label fields, as prior discussed.
## Problem
On Debian 12 (Bookworm), Python 3.11 is the latest available version.
## Summary of changes
- Update Python to 3.11 in build-tools
- Fix ruff check / format
- Fix mypy
- Use `StrEnum` instead of pair `str`, `Enum`
- Update docs
## Problem
I have made a mistake in merging Postgre PRs
## Summary of changes
Restore consistency of submodule referenced.
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>