## Problem
`test_check_visibility_map` has been seen to time out in debug tests.
## Summary of changes
Bump the timeout to 10 minutes (test reports indicate 7 minutes is
sufficient).
We don't want to disable the test entirely in debug builds, to exercise
this with debug assertions enabled.
Resolves#10069.
This adds some validation of invariants that we want to uphold wrt the
tenant manifest and `index_part.json`:
* the data the manifest has about a timeline must match with the data in
`index_part.json`. It might actually change, e.g. when we do reparenting
during detach ancestor, but that requires the timeline to be
unoffloaded, i.e. removed from the manifest.
* any timeline mentioned in index part, must, if present, be archived.
If we unarchive, we first update the tenant manifest to unoffload, and
only then update index part. And one needs to archive before offloading.
* it is legal for timelines to be mentioned in the manifest but have no
`index_part`: this is a temporary state visible during deletion of the
timeline. if the pageserver crashed, an attach of the tenant will clean
the state up.
* it is also legal for offloaded timelines to have an
`ancestor_retain_lsn` of None while having an `ancestor_timeline_id`.
This is for the to-be-added flattening functionality: the plan is to set
former to None if we have flattened a timeline.
follow-up of #9942
part of #8088
## Problem
We saw the drain/fill operations not drain fast enough in ap-southeast.
## Summary of changes
These are some quick changes to speed it up:
* double reconcile concurrency - this is now half of the available
reconcile bandwidth
* reduce the waiter polling timeout - this way we can spawn new
reconciliations faster
## Problem
Cplane and storage controller tenant config changes are not additive.
Any change overrides all existing tenant configs. This would be fine if
both did client side patching, but that's not the case.
Once this merges, we must update cplane to use the PATCH endpoint.
## Summary of changes
### High Level
Allow for patching of tenant configuration with a `PATCH
/v1/tenant/config` endpoint.
It takes the same data as it's PUT counterpart. For example the payload
below will update `gc_period` and unset `compaction_period`. All other
fields are left in their original state.
```
{
"tenant_id": "1234",
"gc_period": "10s",
"compaction_period": null
}
```
### Low Level
* PS and storcon gain `PATCH /v1/tenant/config` endpoints. PS endpoint
is only used for cplane managed instances.
* `storcon_cli` is updated to have separate commands for
`set-tenant-config` and `patch-tenant-config`
Related https://github.com/neondatabase/cloud/issues/21043
add owned_by_superuser field to filter out system extensions.
While on it, also correct related code:
- fix the metric setting: use set() instead of inc() in a loop.
inc() is not idempotent and can lead to incorrect results
if the function called multiple times. Currently it is only called at
compute start, but this will change soon.
- fix the return type of the installed_extensions endpoint
to match the metric. Currently it is only used in the test.
## Problem
Linking walproposer library (e.g. `cargo t`) produces linker errors:
/home/myrrc/neon/pgxn/neon/walproposer_compat.c:169: undefined reference
to `pg_snprintf'
The library with these symbols (libpgcommon.a) is present
## Summary of changes
Changed order of libraries resolution for linker
## Problem
We added support for LFC for tests but are still using it only for the
PG17 release.
## Summary of changes
LFC is enabled for all PG versions. Errors in tests with LFC enabled now
block merging as usual. We keep tests with disabled LFC for PG17
release. Tests on debug builds with LFC enabled still don't affect
permission to merge.
## Problem
If the control plane cannot be reached for some reason, compute_ctl
panics
## Summary of changes
panic is removed in favour of returning an error.
Code is reformatted a bit for more flat control flow
Resolves: #5391
## Problem
We get slru truncation commands on non-zero shards.
Compaction will drop the slru dir keys and ingest will fail when
receiving such records.
https://github.com/neondatabase/neon/pull/10080 fixed it for clog, but
not for multixact.
## Summary of changes
Only truncate multixact slrus on shard zero. I audited the rest of the
ingest code and it looks
fine from this pov.
## Problem
With pipelining enabled, the time a request spends in the batcher stage
counts towards the smgr op latency.
If pipelining is disabled, that time is not accounted for.
In practice, this results in a jump in smgr getpage latencies in various
dashboards and degrades the internal SLO.
## Solution
In a similar vein to #10042 and with a similar rationale, this PR stops
counting the time spent in batcher stage towards smgr op latency.
The smgr op latency metric is reduced to the actual execution time.
Time spent in batcher stage is tracked in a separate histogram.
I expect to remove that histogram after batching rollout is complete,
but it will be helpful in the meantime to reason about the rollout.
## Problem
Protobuf doesn't support 128 bit integers, so we encode the keys as two
64 bit integers. Issue is that when we split the 128 bit compact key we
use signed 64 bit integers to represent the two halves. This may result
in a negative lower half when relnode is larger than `0x00800000`. When
we convert the lower half to an i128 we get a negative `CompactKey`.
## Summary of Changes
Use unsigned integers when encoding into Protobuf.
## Deployment
* Prod: We disabled the interpreted proto, so no compat concerns.
* Staging: Disable the interpreted proto, do one release, and then
release the fixed version.
We do this because a negative int32 will convert to a large uint32 value
and could give
a key in the actual pageserver space. In production we would around this
by adding new
fields to the proto and deprecating the old ones, but we can make our
lives easy here.
* Pre-prod: Same as staging
## Problem
When dev deployments are disabled (or fail), the tags for releases
aren't created. It makes more sense to have tag and release creation
before the deployment to prevent situations like
[this](https://github.com/neondatabase/neon/pull/9959).
It is not enough to move the tag creation before the deployment. If the
deployment fails, re-running the job isn't possible because the API call
to create the tag will fail.
## Summary of changes
- Tag/Release creation now happens before the deployment
- The two steps for tag and release have been merged into a bigger one
- There's new checks to ensure the that if the tags/releases already
exist as expected, things will continue just fine.
## Problem
In #9786 we stop storing SLRUs on non-zero shards.
However, there was one code path during ingest that still tries to
enumerate SLRU relations on all shards. This fails if it sees a tenant
who has never seen any write to an SLRU, or who has done such thorough
compaction+GC that it has dropped its SLRU directory key.
## Summary of changes
- Avoid trying to list SLRU relations on nonzero shards
Neon doesn't have seqscan detection of its own, so stop read_stream from
trying to utilize that readahead, and instead make it issue readahead of
its own.
## Problem
@knizhnik noticed that we didn't issue smgrprefetch[v] calls for
seqscans in PG17 due to the move to the read_stream API, which assumes
that the underlying IO facilities do seqscan detection for readahead.
That is a wrong assumption when Neon is involved, so let's remove the
code that applies that assumption.
## Summary of changes
Remove the cases where seqscans are detected and prefetch is disabled as
a consequence, and instead don't do that detection.
PG PR: https://github.com/neondatabase/postgres/pull/532
## Problem
resolve
https://github.com/neondatabase/neon/issues/9988#issuecomment-2528239437
## Summary of changes
* New verbose mode for storage scrubber scan metadata (pageserver) that
contains the error messages.
* Filter allowed_error list from the JSON output to determine the
healthy flag status.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
We have metrics for GetPage request latencies, but this is an extra
measure to capture requests that take way too long in the logs. The log
message is printed every 10 s, until the response is received:
```
PG:2024-12-09 16:02:07.715 GMT [1782845] LOG: [NEON_SMGR] [shard 0] no response received from pageserver for 10.000 s, still waiting (sent 10613 requests, received 10612 responses)
PG:2024-12-09 16:02:17.723 GMT [1782845] LOG: [NEON_SMGR] [shard 0] no response received from pageserver for 20.008 s, still waiting (sent 10613 requests, received 10612 responses)
PG:2024-12-09 16:02:19.719 GMT [1782845] LOG: [NEON_SMGR] [shard 0] received response from pageserver after 22.006 s
```
## Problem
close https://github.com/neondatabase/cloud/issues/19671
```
Timeline -----------------------------
^ last GC happened LSN
^ original retention period setting = 24hr
> refresh-gc-info updates the gc_info
^ planned cutoff (gc_info)
^ customer set retention to 48hr, and it's still within the last GC LSN
^1 ^2 we have two choices: (1) update the planned cutoff to
move backwards, or (2) keep the current one
```
In this patch, we decided to keep the current cutoff instead of moving
back the gc_info to avoid races. In the future, we could allow the
planned gc cutoff to go back once cplane sends a retention_history
tenant config update, but this requires a careful revisit of the code.
## Summary of changes
Ensure that GC cutoffs never go back if retention settings get changed.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
See https://github.com/neondatabase/neon/issues/9961
Current implementation of prefetch buffer resize doesn't correctly
handle in-flight requests
## Summary of changes
1. Fix index of entry we should wait for if new prefetch buffer size is
smaller than number of in-flight requests.
2. Correctly set flush position
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Azure has a different per-request limit of 256 items for bulk deletion
compared to the number of 1000 on AWS. Therefore, we need to support
multiple values. Due to `GenericRemoteStorage`, we can't add an
associated constant, but it has to be a function.
The PR replaces the `MAX_KEYS_PER_DELETE` constant with a function of
the same name, implemented on both the `RemoteStorage` trait as well as
on `GenericRemoteStorage`.
The value serves as hint of how many objects to pass to the
`delete_objects` function.
Reading:
* https://learn.microsoft.com/en-us/rest/api/storageservices/blob-batch
* https://docs.aws.amazon.com/AmazonS3/latest/API/API_DeleteObjects.html
Part of #7931
Hello! I was interested in potentially making some contributions to Neon
and looking through the issue backlog I found
[8200](https://github.com/neondatabase/neon/issues/8200) which seemed
like a good first issue to attempt to tackle. I see it was assigned a
while ago so apologies if I'm stepping on any toes with this PR. I also
apologize for the size of this PR. I'm not sure if there is a simple way
to reduce it given the footprint of the components being changed.
## Problem
This PR is attempting to address part of the problem outlined in issue
[8200](https://github.com/neondatabase/neon/issues/8200). Namely to
remove global static usage of timeline state in favour of
`Arc<GlobalTimelines>` and to replace wasteful clones of
`SafeKeeperConf` with `Arc<SafeKeeperConf>`. I did not opt to tackle
`RemoteStorage` in this PR to minimize the amount of changes as this PR
is already quite large. I also did not opt to introduce an
`SafekeeperApp` wrapper struct to similarly minimize changes but I can
tackle either or both of these omissions in this PR if folks would like.
## Summary of changes
- Remove static usage of `GlobalTimelines` in favour of
`Arc<GlobalTimelines>`
- Wrap `SafeKeeperConf` in `Arc` to avoid wasteful clones of the
underlying struct
## Some additional thoughts
- We seem to currently store `SafeKeeperConf` in `GlobalTimelines` and
then expose it through a public`get_global_config` function which
requires locking. This seems needlessly wasteful and based on observed
usage we could remove this public accessor and force consumers to
acquire `SafeKeeperConf` through the new Arc reference.
## Problem
close https://github.com/neondatabase/neon/issues/10049, close
https://github.com/neondatabase/neon/issues/10030, close
https://github.com/neondatabase/neon/issues/8861
part of https://github.com/neondatabase/neon/issues/9114
The legacy gc process calls `get_latest_gc_cutoff`, which uses a Rcu
different than the gc_info struct. In the gc_compaction_smoke test case,
the "latest" cutoff could be lower than the gc_info struct, causing
gc-compaction to collect data that could be accessed by
`latest_gc_cutoff`. Technically speaking, there's nothing wrong with
gc-compaction using gc_info without considering latest_gc_cutoff,
because gc_info is the source of truth. But anyways, let's fix it.
## Summary of changes
* gc-compaction uses `latest_gc_cutoff` instead of gc_info to determine
the gc horizon.
* if a gc-compaction is scheduled via tenant compaction iteration, it
will take the gc_block lock to avoid racing with functionalities like
detach ancestor (if it's triggered via manual compaction API without
scheduling, then it won't take the lock)
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
## Problem
Currently, we run the `pg_regress` tests only for PG16
However, PG17 is a part of Neon and should be tested as well
## Summary of changes
Modified the workflow and added a patch for PG17 enabling the
`pg_regress` tests.
The problem with leftovers was solved by using branches.
For a while already, we've been unable to update the Azure SDK crates
due to Azure adopting use of a non-tokio async runtime, see #7545.
The effort to upstream the fix got stalled, and I think it's better to
switch to a patched version of the SDK that is up to date.
Now we have a fork of the SDK under the neondatabase github org, to
which I have applied Conrad's rebased patches to:
https://github.com/neondatabase/azure-sdk-for-rust/tree/neon .
The existence of a fork will also help with shipping bulk delete support
before it's upstreamed (#7931).
Also, in related news, the Azure SDK has gotten a rift in development,
where the main branch pertains to a future, to-be-officially-blessed
release of the SDK, and the older versions, which we are currently
using, are on the `legacy` branch. Upstream doesn't really want patches
for the `legacy` branch any more, they want to focus on the `main`
efforts. However, even then, the `legacy` branch is still newer than
what we are having right now, so let's switch to `legacy` for now.
Depending on how long it takes, we can switch to the official version of
the SDK once it's released or switch to the upstream `main` branch if
there is changes we want before that.
As a nice side effect of this PR, we now use reqwest 0.12 everywhere,
dropping the dependency on version 0.11.
Fixes#7545
Result of running:
cargo update -p aws-types -p aws-sigv4 -p aws-credential-types -p
aws-smithy-types -p aws-smithy-async -p aws-sdk-kms -p aws-sdk-iam -p
aws-sdk-s3 -p aws-config
We want to keep the AWS SDK up to date as that way we benefit from new
developments and improvements.
## Problem
We saw a tenant get stuck when it had been put into Pause scheduling
mode to pin it to a pageserver, then it was left idle for a while and
the control plane tried to detach it.
Close: https://github.com/neondatabase/neon/issues/9957
## Summary of changes
- When changing policy to Detached or Secondary, set the scheduling
policy to Active.
- Add a test that exercises this
- When persisting tenant shards, set their `generation_pageserver` to
null if the placement policy is not Attached (this enables consistency
checks to work, and avoids leaving state in the DB that could be
confusing/misleading in future)
## Problem
In #9962 I changed the smgr metrics to include time spent on flush.
It isn't under our (=storage team's) control how long that flush takes
because the client can stop reading requests.
## Summary of changes
Stop the timer as soon as we've buffered up the response in the
`pgb_writer`.
Track flush time in a separate metric.
---------
Co-authored-by: Yuchen Liang <70461588+yliang412@users.noreply.github.com>
If the pageserver connection is lost while receiving the prefetch
request, the prefetch queue is cleared. The error message prints the
values from the prefetch slot, but because the slot was already cleared,
they're all zeros:
LOG: [NEON_SMGR] [shard 0] No response from reading prefetch entry 0:
0/0/0.0 block 0. This can be caused by a concurrent disconnect
To fix, make local copies of the values.
In the passing, also add a sanity check that if the receive() call
succeeds, the prefetch slot is still intact.
## Problem
part of https://github.com/neondatabase/neon/issues/9114, stacked PR
over #9809
The compaction scheduler now schedules partial compaction jobs.
## Summary of changes
* Add the compaction job splitter based on size.
* Schedule subcompactions using the compaction scheduler.
* Test subcompaction scheduler in the smoke regress test.
* Temporarily disable layer map checks
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
With the current metrics we can't identify which shards are ingesting
data at any given time.
## Summary of changes
Add a metric for the number of wal records received for processing by
each shard. This is per (tenant, timeline, shard).
## Problem
We didn't have a codeowner for `/compute`, so nobody was auto-assigned
for PRs like #9973
## Summary of changes
While on it:
1. Group codeowners into sections.
2. Remove control plane from the `/compute_tools` because it's primarily
the internal `compute_ctl` code.
3. Add control plane (and compute) to `/libs/compute_api` because that's
the shared public interface of the compute.
We've seen cases where stray keys end up on the wrong shard. This
shouldn't happen. Add debug assertions to prevent this. In release
builds, we should be lenient in order to handle changing key ownership
policies.
Touches #9914.
## Problem
There's no metrics for disk consistent LSN and remote LSN. This stuff is
useful when looking at ingest performance.
## Summary of changes
Two per timeline metrics are added: `pageserver_disk_consistent_lsn` and
`pageserver_projected_remote_consistent_lsn`. I went for the projected
remote lsn instead of the visible one
because that more closely matches remote storage write tput. Ideally we
would have both, but these metrics are expensive.
## Problem
I'm writing an ingest benchmark in #9812. To time S3 uploads, I need to
schedule a flush of the Pageserver's in-memory layer, but don't actually
want to wait around for it to complete (which will take a minute).
## Summary of changes
Add a parameter `wait_until_flush` (default `true`) for
`timeline/checkpoint` to control whether to wait for the flush to
complete.
## Problem
FSM pages are managed like regular relation pages, and owned by a single
shard. However, when truncating the FSM relation the last FSM page was
zeroed out on all shards. This is unnecessary and potentially confusing.
The superfluous keys will be removed during compactions, as they do not
belong on these shards.
Resolves#10027.
## Summary of changes
Only zero out the truncated FSM page on the owning shard.
## Problem
part of https://github.com/neondatabase/neon/issues/9114
gc-compaction can take a long time. This patch adds support for
scheduling a gc-compaction job. The compaction loop will first handle
L0->L1 compaction, and then gc compaction. The scheduled jobs are stored
in a non-persistent queue within the tenant structure.
This will be the building block for the partial compaction trigger -- if
the system determines that we need to do a gc compaction, it will
partition the keyspace and schedule several jobs. Each of these jobs
will run for a short amount of time (i.e, 1 min). L0 compaction will be
prioritized over gc compaction.
## Summary of changes
* Add compaction scheduler in tenant.
* Run scheduled compaction in integration tests.
* Change the manual compaction API to allow schedule a compaction
instead of immediately doing it.
* Add LSN upper bound as gc-compaction parameter. If we schedule partial
compactions, gc_cutoff might move across different runs. Therefore, we
need to pass a pre-determined gc_cutoff beforehand. (TODO: support LSN
lower bound so that we can compact arbitrary "rectangle" in the layer
map)
* Refactor the gc_compaction internal interface.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
## Problem
We need a higher concurrency during reconfiguration in case of many DBs,
but the instance is already running and used by the client. We can
easily get out of `max_connections` limit, and the current code won't
handle that.
## Summary of changes
Default to 1, but also allow control plane to override this value for
specific projects. It's also recommended to bump
`superuser_reserved_connections` += `reconfigure_concurrency` for such
projects to ensure that we always have enough spare connections for
reconfiguration process to succeed.
Quick workaround for neondatabase/cloud#17846
## Problem
The node shard scan timeout of 1 second is a bit too aggressive, and
we've seen this cause test failures. The scans are performed in parallel
across nodes, and the entire operation has a 15 second timeout.
Resolves#9801.
## Summary of changes
Increase the timeout to 5 seconds. This is still enough to time out on a
network failure and retry successfully within 15 seconds.