## Problem
Follow up of https://github.com/neondatabase/neon/pull/10550 in case the
upper limit is set larger than threshold. It does not make sense for
someone to enforce the behavior like "if there are >= 50 L0s, only
compact 10 of them".
## Summary of changes
Use the maximum of compaction threshold and upper limit when selecting
L0 files to compact.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Repartition is slow, but it's only used in image layer creation. We can
skip it if we have a lot of L0 layers to ingest.
## Summary of changes
If L0 compaction is not complete, do not repartition and do not create
image layers.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We don't have good observability for per-timeline compaction debt,
specifically the number of delta layers in the frozen, L0, and L1
levels.
Touches https://github.com/neondatabase/cloud/issues/23283.
## Summary of changes
* Add a `level` label for `pageserver_layer_{count,size}` with values
`l0`, `l1`, and `frozen`.
* Track metrics for frozen layers.
There is already a `kind={delta,image}` label. `kind=image` is only
possible for `level=l1`.
We don't include the currently open ephemeral layer, only frozen layers.
There is always exactly 1 ephemeral layer, with a dynamic size which is
already tracked in `pageserver_timeline_ephemeral_bytes`.
## Problem
benchmarking.yml so far is only running benchmarks with PostgreSQL
version 16.
However neon recently changed the default for new customers to
PostgreSQL version 17.
See related [epic](https://github.com/neondatabase/cloud/issues/23295)
## Summary of changes
We do not want to run every job step with both pg 16 and 17 because this
would need excessive resources (runners, computes) and extend the
benchmarking run wall clock time too much.
So we select an opinionated subset of testcases that we also report in
weekly reporting and add a postgres v17 job step.
For re-use projects associated Neon projects have been created and
connection strings have been added to neon database organization
secrets.
A follow up is to add the reporting for these new runs to some grafana
dashboards.
## Problem
1. d04d924 added separate metrics for total requests and failures
separately, but it doesn't make much sense. We could just have a unified
counter with `http_status`.
2. `test_compute_migrations_retry` had a race, i.e., it was waiting for
the last successful migration, not an actual failure. This was revealed
after adding an assert on failure metric in d04d924.
## Summary of changes
1. Switch to unified counters for `compute_ctl` requests.
2. Add a waiting loop into `test_compute_migrations_retry` to eliminate
the race.
Part of neondatabase/cloud#17590
## Problem
If we are GC-ing because a new image layer was added while traversing
the timeline, then it will remove layers that are required for
fulfilling the current get request (read-path cannot "look back" and
notice the new image layer).
## Summary of Changes
Prevent GC from progressing on the current timeline while it is being
visited for a read.
Epic: https://github.com/neondatabase/neon/issues/9376
Luckily they were the same version, so we didn't spend time compiling
two versions, which could have been the case in the future.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
The approach of having CancelMap as an in-memory structure increases
code complexity,
as well as putting additional load for Redis streams.
## Summary of changes
- Implement a set of KV ops for Redis client;
- Remove cancel notifications code;
- Send KV ops over the bounded channel to the handling background task
for removing and adding the cancel keys.
Closes#9660
## Problem
We have to test the extensions, shipped with Neon for compatibility
before the upgrade.
## Summary of changes
Added the test for compatibility with the upgraded extensions.
## Problem
Follow-up of the incident, we should not use the same bound on
lower/upper limit of compaction files. This patch adds an upper bound
limit, which is set to 50 for now.
## Summary of changes
Add `compaction_upper_limit`.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
## Problem
Ref: https://github.com/neondatabase/cloud/issues/23314
We suspect some inconsistency in Benchmark tests runs could be due to
different type of runners they are landed in.
To have that aligned in both terms: failure rates and benchmark results,
lets run them for now on `small-metal` servers and see the progress for
the tests stability.
## Summary of changes
## Problem
There are several parts of `compute_ctl` with a very low visibility of
errors:
1. DB migrations that run async in the background after compute start.
2. Requests made to control plane (currently only `GetSpec`).
3. Requests made to the remote extensions server.
## Summary of changes
Add new counters to quickly evaluate the amount of errors among the
fleet.
Part of neondatabase/cloud#17590
## Problem
https://github.com/neondatabase/neon/pull/10448 removed release notes,
because if their generation failed, the whole release was failing.
People liked them though, and wanted some basic release notes as a
fall-back instead of completely removing them.
## Summary of changes
Include basic release notes that link to the release PR and to a diff to
the previous release.
## Problem
We've seen the ingest connection manager get stuck shortly after a
migration.
## Summary of changes
A speculative mitigation is to use the same mechanism as get page
requests for kicking LSN ingest. The connection manager monitors
LSN waits and queries the broker if no updates are received for the
timeline.
Closes https://github.com/neondatabase/neon/issues/10351
## Problem
We need a setting to disable the flush upload wait, to test L0 flush
backpressure in staging.
## Summary of changes
Add `l0_flush_wait_upload` setting.
## Problem
The request data and usage metrics S3 requests use the same identifier
shown in logs, causing confusion about what type of upload failed.
## Summary of changes
Use the correct identifier for usage metrics uploads.
neondatabase/cloud#23084
Only a few things that needed updating:
- async_trait was removed
- Message::Text takes a Utf8Bytes object instead of a String
Signed-off-by: Tristan Partin <tristan@neon.tech>
Co-authored-by: Conrad Ludgate <connor@neon.tech>
In #10308, we noticed many warnings about the local layer having
different sizes on-disk compared to the metadata.
However, the layer downloader would never redownload layer files if the
sizes or generation numbers change. This is obviously a bug, which we
aim to fix with this PR.
This change also moves the code deciding what to do about a layer to a
dedicated function: before we handled the "routing" via control flow,
but now it's become too complicated and it is nicer to have the
different verdicts for a layer spelled out in a list/match.
This reverts commit 9e55d79803.
We'll still need this until we can tune L0 flush backpressure and
compaction. I'll add a setting to disable this separately.
## Problem
This one is fairly embarrassing. Safekeeper node id was used in the
pageserver application name
when connecting to safekeepers.
## Summary of changes
Use the right node id.
Closes https://github.com/neondatabase/neon/issues/10461
We want to verify if pageserver stripe size has an impact on ingest
performance.
We want to verify if ingest performance has improved or regressed with
postgres version 17.
## Summary of changes
- Allow to create new project with different postgres versions
- allow to pre-shard new project with different stripe sizes instead of
relying on storage manager to shard_split the project once a threshold
is exceeded
Replaces https://github.com/neondatabase/neon/pull/10509
Test run https://github.com/neondatabase/neon/actions/runs/12986410381
We now don't need libpq any more for the build of the storage
controller, as we use `diesel-async` since #10280. Therefore, we remove
the env var that gave cargo/rustc the location for libpq.
Follow-up of #10280
During broker deploys, pageservers log this noisy WARN en masse.
I can trivially reproduce the WARN message in neon_local by SIGKILLing
broker during e.g. `pgbench -i`.
I don't understand why tonic is not detecting the error as
`Code::Unavailable`.
Until we find time to understand that / fix upstream, this PR adds the
error message to the existing list of known error messages that get
demoted to INFO level.
Refs:
- refs https://github.com/neondatabase/neon/issues/9562
## Problem
We were logging a warning after a single request timeout, while listing
objects.
Closes: https://github.com/neondatabase/neon/issues/10166
## Summary of changes
- These timeouts are a pretty normal part of life, so back it off to
only log a warning after two in a row.
## Problem
From time to time, folks discover our `control_plane/` folder and make
the (reasonable) mistake of thinking it's a tool for running full-sized
Neon systems, whereas in reality it is a tool for dev/test.
## Summary of changes
- Change control_plane's readme title to "Local Development Control
Plane (`neon_local`)`
- Change "Running local installation" to "Running a local development
environment" in the main readme
## Problem
The intent of this parameter is to have pageservers consider themselves
"full" if they've got lots of shards, even if they have plenty of
capacity. It works, but because we typically successfully oversubscribe
capacity up to 200%, the MAX_SHARDS limit is effectively doubled, so
this 20,000 value ends up meaning 40,000, whereas the original intent
was to limit nodes to ~10000 shards.
## Summary of changes
- Change MAX_SHARDS to 5000, so that a node with 5000 will get a 100%
utilization, which is equivalent in practice to being considered "half
full" by the storage controller in capacity terms.
This is all a bit subtle and indiret. Originally the limit was baked
into the pageserver with the idea that the pageserver knows better what
its own resources tolerate than the storage controller does, but in
practice it would be probably be easier to understand all this if we
just did it controller-side. So there's scope to refactor here in
future.
Switches the storcon away from using diesel's synchronous APIs in favour
of `diesel-async`.
Advantages:
* less C dependencies, especially no openssl, which might be behind the
bug: https://github.com/neondatabase/cloud/issues/21010
* Better to only have async than mix of async plus `spawn_blocking`
We had to turn off usage of the connection pool for migrations, as
diesel migrations don't support async APIs. Thus we still use
`spawn_blocking` in that one place. But this is explicitly done in one
of the `diesel-async` examples.
## Problem
In ingest benchmarks, we see L0 compaction delays of over 10 minutes due
to image compaction. We can't stall L0 flushes for that long.
## Summary of changes
Disable L0 flush stalls, and bump the default L0 flush delay threshold
from 20 to 30 L0 layers.
## Problem
If compaction fails, we disable L0 flush stalls to avoid persistent
stalls. However, the logic would unset the failure marker on offload
failures or shutdown. This can lead to sudden L0 flush stalls if we try
and fail to offload a timeline with compaction failures, or if there is
some kind of shutdown race.
Touches #10405.
## Summary of changes
Don't touch the compaction failure marker on offload failures or
shutdown.
## Problem
After talking about it again with @bayandin again this should replace
the changes from https://github.com/neondatabase/neon/pull/10475. While
the previous changes worked, they are less visually clear in what
happens, and we might end up in a situation where we update `latest`,
but don't actually have the tagged image pushed that contains the same
changes. The latter would result in potentially hard to debug
situations.
## Summary of changes
Revert c283aaaf8d and make
promote-images-prod depend on promote-images-dev instead.
## Problem
The containers' log output is mixed with the tests' output, so you must
scroll up to find the error.
## Summary of changes
Printing of containers' logs moved to a separate step.