## Problem
In #8550, we made the flush loop wait for uploads after every layer.
This was to avoid unbounded buildup of uploads, and to reduce compaction
debt. However, the approach has several problems:
* It prevents upload parallelism.
* It prevents flush and upload pipelining.
* It slows down ingestion even when there is no need to backpressure.
* It does not directly backpressure WAL ingestion (only via
`disk_consistent_lsn`), and will build up in-memory layers.
* It does not directly backpressure based on compaction debt and read
amplification.
An alternative solution to these problems is proposed in #8390.
In the meanwhile, we revert the change to reduce the impact on ingest
throughput. This does reintroduce some risk of unbounded
upload/compaction buildup. Until
https://github.com/neondatabase/neon/issues/8390, this can be addressed
in other ways:
* Use `max_replication_apply_lag` (aka `remote_consistent_lsn`), which
will more directly limit upload debt.
* Shard the tenant, which will spread the flush/upload work across more
Pageservers and move the bottleneck to Safekeeper.
Touches #10095.
## Summary of changes
Remove waiting on the upload queue in the flush loop.
## Problem
When entry was dropped and password wasn't set, new entry
had uninitialized memory in controlplane adapter
Resolves: https://github.com/neondatabase/cloud/issues/14914
## Summary of changes
Initialize password in all cases, add tests.
Minor formatting for less indentation
## Problem
`benchmarking` job fails because `aws-oicd-role-arn` input is not set
## Summary of changes:
- Set `aws-oicd-role-arn` for `benchmarking job
- Always require `aws-oicd-role-arn` to be set
- Rename `aws_oicd_role_arn` to `aws-oicd-role-arn` for consistency
## Problem
close https://github.com/neondatabase/neon/issues/10124
gc-compaction split_gc_jobs is holding the repartition lock for too long
time.
## Summary of changes
* Ensure split_gc_compaction_jobs drops the repartition lock once it
finishes cloning the structures.
* Update comments.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Improved comments will help others when they read the code, and the log
messages will help others understand why the logical replication monitor
works the way it does.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
LFC used_pages statistic is not updated in case of LFC resize (shrinking
`neon.file_cache_size_limit`)
## Summary of changes
Update `lfc_ctl->used_pages` in `lfc_change_limit_hook`
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
The test was failing with the scary but generic message `Remote storage
metadata corrupted`.
The underlying scrubber error is `Orphan layer detected: ...`.
The test kills pageserver at random points, hence it's expected that we
leak layers if we're killed in the window after layer upload but before
it's referenced from index part.
Refer to generation numbers RFC for details.
Refs:
- fixes https://github.com/neondatabase/neon/issues/9988
- root-cause analysis
https://github.com/neondatabase/neon/issues/9988#issuecomment-2520673167
## Problem
`test_prefetch` is flaky
(https://github.com/neondatabase/neon/issues/9961), but if it passes,
the run time is less than 30 seconds — we don't need an extended timeout
for it.
## Summary of changes
- Remove extended test timeout for `test_prefetch`
## Problem
We want to extract safekeeper http client to separate crate for use in
storage controller and neon_local. However, many types used in the API
are internal to safekeeper.
## Summary of changes
Move them to safekeeper_api crate. No functional changes.
ref https://github.com/neondatabase/neon/issues/9011
## Problem
When moving the comment on proxy-releases from the yaml doc into a
javascript code block, I missed converting the comment marker from `#`
to `//`.
## Summary of changes
Correctly convert comment marker.
## Problem
I've noticed that debug builds with LFC fail more frequently and for
some reason ,their failure do block merging (but it should not)
## Summary of changes
- Do not run Debug builds with LFC
## Problem
When we update our scheduler/optimization code to respect AZs properly
(https://github.com/neondatabase/neon/pull/9916), the choice of AZ
becomes a much higher-stakes decision. We will pretty much always run a
tenant in its preferred AZ, and that AZ is fixed for the lifetime of the
tenant (unless a human intervenes)
Eventually, when we do auto-balancing based on utilization, I anticipate
that part of that will be to automatically change the AZ of tenants if
our original scheduling decisions have caused imbalance, but as an
interim measure, we can at least avoid making this scheduling decision
based purely on which AZ contains the emptiest node.
This is a precursor to https://github.com/neondatabase/neon/pull/9947
## Summary of changes
- When creating a tenant, instead of scheduling a shard and then reading
its preferred AZ back, make the AZ decision first.
- Instead of choosing AZ based on which node is emptiest, use the median
utilization of nodes in each AZ to pick the AZ to use. This avoids bad
AZ decisions during periods when some node has very low utilization
(such as after replacing a dead node)
I considered also making the selection a weighted pseudo-random choice
based on utilization, but wanted to avoid destabilising tests with that
for now.
## Problem
Now notifications about failures in `pg_regress` tests run on the
staging cloud instance, reach the channel `on-call-staging-stream`,
while they should reach `on-call-qa-staging-stream`
## Summary of changes
The channel changed.
## Problem
CI currently uses static credentials in some places. These are less
secure and hard to maintain, so we are going to deprecate them and use
OIDC auth.
## Summary of changes
- ci(fix): Use OIDC auth to upload artifact on s3
- ci(fix): Use OIDC auth to login on ECR
## Problem
Now that https://github.com/neondatabase/cloud/issues/15245 is done, we
can remove the old code.
## Summary of changes
Removes support for the ManagementV2 API, in favour of the ProxyV1 API.
## Problem
To give Storage more time on preprod — create a release branch on Friday
## Summary of changes
- Automatically create Storage release PR on Friday instead of Monday
This adds an API to the storage controller to list safekeepers
registered to it.
This PR does a `diesel print-schema > storage_controller/src/schema.rs`
because of an inconsistency between up.sql and schema.rs, introduced by
[this](2c142f14f7)
commit, so there is some updates of `schema.rs` due to that. As a
followup to this, we should maybe think about running `diesel
print-schema` in CI.
Part of #9981
## Problem
`test_check_visibility_map` has been seen to time out in debug tests.
## Summary of changes
Bump the timeout to 10 minutes (test reports indicate 7 minutes is
sufficient).
We don't want to disable the test entirely in debug builds, to exercise
this with debug assertions enabled.
Resolves#10069.
This adds some validation of invariants that we want to uphold wrt the
tenant manifest and `index_part.json`:
* the data the manifest has about a timeline must match with the data in
`index_part.json`. It might actually change, e.g. when we do reparenting
during detach ancestor, but that requires the timeline to be
unoffloaded, i.e. removed from the manifest.
* any timeline mentioned in index part, must, if present, be archived.
If we unarchive, we first update the tenant manifest to unoffload, and
only then update index part. And one needs to archive before offloading.
* it is legal for timelines to be mentioned in the manifest but have no
`index_part`: this is a temporary state visible during deletion of the
timeline. if the pageserver crashed, an attach of the tenant will clean
the state up.
* it is also legal for offloaded timelines to have an
`ancestor_retain_lsn` of None while having an `ancestor_timeline_id`.
This is for the to-be-added flattening functionality: the plan is to set
former to None if we have flattened a timeline.
follow-up of #9942
part of #8088
## Problem
We saw the drain/fill operations not drain fast enough in ap-southeast.
## Summary of changes
These are some quick changes to speed it up:
* double reconcile concurrency - this is now half of the available
reconcile bandwidth
* reduce the waiter polling timeout - this way we can spawn new
reconciliations faster
## Problem
Cplane and storage controller tenant config changes are not additive.
Any change overrides all existing tenant configs. This would be fine if
both did client side patching, but that's not the case.
Once this merges, we must update cplane to use the PATCH endpoint.
## Summary of changes
### High Level
Allow for patching of tenant configuration with a `PATCH
/v1/tenant/config` endpoint.
It takes the same data as it's PUT counterpart. For example the payload
below will update `gc_period` and unset `compaction_period`. All other
fields are left in their original state.
```
{
"tenant_id": "1234",
"gc_period": "10s",
"compaction_period": null
}
```
### Low Level
* PS and storcon gain `PATCH /v1/tenant/config` endpoints. PS endpoint
is only used for cplane managed instances.
* `storcon_cli` is updated to have separate commands for
`set-tenant-config` and `patch-tenant-config`
Related https://github.com/neondatabase/cloud/issues/21043
add owned_by_superuser field to filter out system extensions.
While on it, also correct related code:
- fix the metric setting: use set() instead of inc() in a loop.
inc() is not idempotent and can lead to incorrect results
if the function called multiple times. Currently it is only called at
compute start, but this will change soon.
- fix the return type of the installed_extensions endpoint
to match the metric. Currently it is only used in the test.
## Problem
Linking walproposer library (e.g. `cargo t`) produces linker errors:
/home/myrrc/neon/pgxn/neon/walproposer_compat.c:169: undefined reference
to `pg_snprintf'
The library with these symbols (libpgcommon.a) is present
## Summary of changes
Changed order of libraries resolution for linker
## Problem
We added support for LFC for tests but are still using it only for the
PG17 release.
## Summary of changes
LFC is enabled for all PG versions. Errors in tests with LFC enabled now
block merging as usual. We keep tests with disabled LFC for PG17
release. Tests on debug builds with LFC enabled still don't affect
permission to merge.
## Problem
If the control plane cannot be reached for some reason, compute_ctl
panics
## Summary of changes
panic is removed in favour of returning an error.
Code is reformatted a bit for more flat control flow
Resolves: #5391
## Problem
We get slru truncation commands on non-zero shards.
Compaction will drop the slru dir keys and ingest will fail when
receiving such records.
https://github.com/neondatabase/neon/pull/10080 fixed it for clog, but
not for multixact.
## Summary of changes
Only truncate multixact slrus on shard zero. I audited the rest of the
ingest code and it looks
fine from this pov.
## Problem
With pipelining enabled, the time a request spends in the batcher stage
counts towards the smgr op latency.
If pipelining is disabled, that time is not accounted for.
In practice, this results in a jump in smgr getpage latencies in various
dashboards and degrades the internal SLO.
## Solution
In a similar vein to #10042 and with a similar rationale, this PR stops
counting the time spent in batcher stage towards smgr op latency.
The smgr op latency metric is reduced to the actual execution time.
Time spent in batcher stage is tracked in a separate histogram.
I expect to remove that histogram after batching rollout is complete,
but it will be helpful in the meantime to reason about the rollout.
## Problem
Protobuf doesn't support 128 bit integers, so we encode the keys as two
64 bit integers. Issue is that when we split the 128 bit compact key we
use signed 64 bit integers to represent the two halves. This may result
in a negative lower half when relnode is larger than `0x00800000`. When
we convert the lower half to an i128 we get a negative `CompactKey`.
## Summary of Changes
Use unsigned integers when encoding into Protobuf.
## Deployment
* Prod: We disabled the interpreted proto, so no compat concerns.
* Staging: Disable the interpreted proto, do one release, and then
release the fixed version.
We do this because a negative int32 will convert to a large uint32 value
and could give
a key in the actual pageserver space. In production we would around this
by adding new
fields to the proto and deprecating the old ones, but we can make our
lives easy here.
* Pre-prod: Same as staging
## Problem
When dev deployments are disabled (or fail), the tags for releases
aren't created. It makes more sense to have tag and release creation
before the deployment to prevent situations like
[this](https://github.com/neondatabase/neon/pull/9959).
It is not enough to move the tag creation before the deployment. If the
deployment fails, re-running the job isn't possible because the API call
to create the tag will fail.
## Summary of changes
- Tag/Release creation now happens before the deployment
- The two steps for tag and release have been merged into a bigger one
- There's new checks to ensure the that if the tags/releases already
exist as expected, things will continue just fine.
## Problem
In #9786 we stop storing SLRUs on non-zero shards.
However, there was one code path during ingest that still tries to
enumerate SLRU relations on all shards. This fails if it sees a tenant
who has never seen any write to an SLRU, or who has done such thorough
compaction+GC that it has dropped its SLRU directory key.
## Summary of changes
- Avoid trying to list SLRU relations on nonzero shards
Neon doesn't have seqscan detection of its own, so stop read_stream from
trying to utilize that readahead, and instead make it issue readahead of
its own.
## Problem
@knizhnik noticed that we didn't issue smgrprefetch[v] calls for
seqscans in PG17 due to the move to the read_stream API, which assumes
that the underlying IO facilities do seqscan detection for readahead.
That is a wrong assumption when Neon is involved, so let's remove the
code that applies that assumption.
## Summary of changes
Remove the cases where seqscans are detected and prefetch is disabled as
a consequence, and instead don't do that detection.
PG PR: https://github.com/neondatabase/postgres/pull/532
## Problem
resolve
https://github.com/neondatabase/neon/issues/9988#issuecomment-2528239437
## Summary of changes
* New verbose mode for storage scrubber scan metadata (pageserver) that
contains the error messages.
* Filter allowed_error list from the JSON output to determine the
healthy flag status.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
We have metrics for GetPage request latencies, but this is an extra
measure to capture requests that take way too long in the logs. The log
message is printed every 10 s, until the response is received:
```
PG:2024-12-09 16:02:07.715 GMT [1782845] LOG: [NEON_SMGR] [shard 0] no response received from pageserver for 10.000 s, still waiting (sent 10613 requests, received 10612 responses)
PG:2024-12-09 16:02:17.723 GMT [1782845] LOG: [NEON_SMGR] [shard 0] no response received from pageserver for 20.008 s, still waiting (sent 10613 requests, received 10612 responses)
PG:2024-12-09 16:02:19.719 GMT [1782845] LOG: [NEON_SMGR] [shard 0] received response from pageserver after 22.006 s
```
## Problem
close https://github.com/neondatabase/cloud/issues/19671
```
Timeline -----------------------------
^ last GC happened LSN
^ original retention period setting = 24hr
> refresh-gc-info updates the gc_info
^ planned cutoff (gc_info)
^ customer set retention to 48hr, and it's still within the last GC LSN
^1 ^2 we have two choices: (1) update the planned cutoff to
move backwards, or (2) keep the current one
```
In this patch, we decided to keep the current cutoff instead of moving
back the gc_info to avoid races. In the future, we could allow the
planned gc cutoff to go back once cplane sends a retention_history
tenant config update, but this requires a careful revisit of the code.
## Summary of changes
Ensure that GC cutoffs never go back if retention settings get changed.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
See https://github.com/neondatabase/neon/issues/9961
Current implementation of prefetch buffer resize doesn't correctly
handle in-flight requests
## Summary of changes
1. Fix index of entry we should wait for if new prefetch buffer size is
smaller than number of in-flight requests.
2. Correctly set flush position
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Azure has a different per-request limit of 256 items for bulk deletion
compared to the number of 1000 on AWS. Therefore, we need to support
multiple values. Due to `GenericRemoteStorage`, we can't add an
associated constant, but it has to be a function.
The PR replaces the `MAX_KEYS_PER_DELETE` constant with a function of
the same name, implemented on both the `RemoteStorage` trait as well as
on `GenericRemoteStorage`.
The value serves as hint of how many objects to pass to the
`delete_objects` function.
Reading:
* https://learn.microsoft.com/en-us/rest/api/storageservices/blob-batch
* https://docs.aws.amazon.com/AmazonS3/latest/API/API_DeleteObjects.html
Part of #7931