## Problem
These commits are split off from
https://github.com/neondatabase/neon/pull/8971/commits where I was
fixing this to make a better scale test pass -- Vlad also independently
recognized these issues with cloudbench in
https://github.com/neondatabase/neon/issues/9062.
1. The storage controller proxies GET requests to pageservers based on
their intent, not the ground truth of where they're really attached.
2. Proxied requests can race with scheduling to tenants, resulting in
404 responses if the request hits the wrong pageserver.
Closes: https://github.com/neondatabase/neon/issues/9062
## Summary of changes
1. If a shard has a running reconciler, then use the database
generation_pageserver to decide who to proxy the request to
2. If such a request gets a 404 response and its scheduled node has
changed since the request was dispatched.
## Problem
Storage controller didn't previously consider AZ locality between
compute and pageservers
when scheduling nodes. Control plane has this feature, and, since we are
migrating tenants
away from it, we need feature parity to avoid perf degradations.
## Summary of changes
The change itself is fairly simple:
1. Thread az info into the scheduler
2. Add an extra member to the scheduling scores
Step (2) deserves some more discussion. Let's break it down by the shard
type being scheduled:
**Attached Shards**
We wish for attached shards of a tenant to end up in the preferred AZ of
the tenant since that
is where the compute is like to be.
The AZ member for `NodeAttachmentSchedulingScore` has been placed
below the affinity score (so it's got the second biggest weight for
picking the node). The rationale for going
below the affinity score is to avoid having all shards of a single
tenant placed on the same node in 2 node
regions, since that would mean that one tenant can drive the general
workload of an entire pageserver.
I'm not 100% sure this is the right decision, so open to discussing
hoisting the AZ up to first place.
**Secondary Shards**
We wish for secondary shards of a tenant to be scheduled in a different
AZ from the preferred one
for HA purposes.
The AZ member for `NodeSecondarySchedulingScore` has been placed first,
so nodes in different AZs
from the preferred one will always be considered first. On small
clusters, this can mean that all the secondaries
of a tenant are scheduled to the same pageserver, but secondaries don't
use up as many resources as the
attached location, so IMO the argument made for attached shards doesn't
hold.
Related: https://github.com/neondatabase/neon/issues/8848
As @koivunej mentioned in the storage channel, for regress test, we
don't need to create a log file for the scrubber, and we should reduce
noisy logs.
## Summary of changes
* Disable log file creation for storage scrubber
* Only log at info level
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
1. Increase statement_timeout. It defaults to 120 s, which is not quite
enough on slow or busy systems with debug build. On my laptop, the index
creation takes about 100 s. On buildfarm, we've seen failures, e.g:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9084/10997888708/index.html#suites/821f97908a487f1d7d3a2a4dd1571e99/db1834bddfe8c5b9/
2. Keep twiddling the LFC size through the whole test. Before, we would
do it for the first 10 seconds, but that only covers a small part of the
pgbench initialization phase. Change the loop so that the pgbench run
time determines how long the test runs, and we keep changing the LFC for
the whole time.
In the passing, also fix bogus test description, copy-pasted from a
completely unrelated test.
## Problem
Compilation of neon extension on macOS produces a warning
```
pgxn/neon/neon_perf_counters.c:50:1: error: non-void function does not return a value [-Werror,-Wreturn-type]
```
## Summary of changes
- Change the return type of `NeonPerfCountersShmemInit` to void
In the `imitate_synthetic_size_calculation_worker` function, we might
obtain the `Cancelled` error variant instead of hitting the cancellation
token based path. Therefore, catch `Cancelled` and handle it analogously
to the cancellation case.
Fixes#8886.
Part of #8130.
## Problem
Currently, decompression is performed within the `read_blobs`
implementation and the decompressed blob will be appended to the end of
the `BytesMut` buffer. We will lose this flexibility of extending the
buffer when we switch to using our own dio-aligned buffer (WIP in
https://github.com/neondatabase/neon/pull/8730). To facilitate the
adoption of aligned buffer, we need to refactor the code to perform
decompression outside `read_blobs`.
## Summary of changes
- `VectoredBlobReader::read_blobs` will return `VectoredBlob` without
performing decompression and appending decompressed blob. It becomes the
caller's responsibility to decompress the buffer.
- Added a new `BufView` type that functions as `Cow<Bytes, &[u8]>`.
- Perform decompression within `VectoredBlob::read` so that people don't
have to explicitly thinking about compression when using the reader
interface.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Even with the 100 ms interval, on my laptop the pageserver always
becomes available on second attempt, so this saves about 900 ms at
every test startup.
## Problem
All the other patches were moved to the compute directory, and only one
was left in the patches subdirectory in the root directory.
## Summary of changes
The patch was moved to the compute directory as others
The PostgreSQL 17 vendor module is now based on postgres/postgres @
d7ec59a63d745ba74fba0e280bbf85dc6d1caa3e, presumably the final code
change before the V17 tag.
## Problem
Scheduling on tenant creation uses different heuristics compared to the
scheduling done during
background optimizations. This results in scenarios where shards are
created and then immediately
migrated by the optimizer.
## Summary of changes
1. Make scheduler aware of the type of the shard it is scheduling
(attached vs secondary).
We wish to have different heuristics.
2. For attached shards, include the attached shard count from the
context in the node score
calculation. This brings initial shard scheduling in line with what the
optimization passes do.
3. Add a test for (2).
This looks like a bigger change than required, but the refactoring
serves as the basis for az-aware
shard scheduling where we also need to make the distinction between
attached and secondary shards.
Closes https://github.com/neondatabase/neon/issues/8969
## Problem
We need to be able to run the regression tests against a cloud-based
Neon staging instance to prepare the migration to the arm architecture.
## Summary of changes
Some tests were modified to work on the cloud instance (i.e. added
passwords, server-side copy changed to client-side, etc)
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Part of #8128, fixes#8872.
## Problem
See #8872.
## Summary of changes
- Retry `list_timeline_blobs` another time if
- there are layer file keys listed but not index.
- failed to download index.
- Instrument code with `analyze-tenant` and `analyze-timeline` span.
- Remove `initdb_archive` check, it could have been deleted.
- Return with exit code 1 on fatal error if `--exit-code` parameter is set.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Instead of adding them to the VM image late in the build process, when
putting together the final VM image, include them in the earlier
compute image already. That makes it more convenient to edit the
files, and to test them.
Seems nice to keep all these together. This also provides a nice place
for a README file to describe the compute image build process. For
now, it briefly describes the contents of the directory, but can be
expanded.
We can't FlushOneBuffer when we're in redo-only mode on PageServer, so
make execution of that function conditional on us not running in
pageserver walredo mode.
## Problem
LFC cache entry is chunk (right now size of chunk is 1Mb). LFC
statistics shows number of chunks, but not number of used pages. And
autoscaling team wants to know how sparse LFC is:
https://neondb.slack.com/archives/C04DGM6SMTM/p1726782793595969
It is possible to obtain it from the view `select count(*) from
local_cache`.
Nut it is expensive operation, enumerating all entries in LFC under
lock.
## Summary of changes
This PR added "file_cache_used_pages" to `neon_lfc_stats` view:
```
select * from neon_lfc_stats;
lfc_key | lfc_value
-----------------------+-----------
file_cache_misses | 3139029
file_cache_hits | 4098394
file_cache_used | 1024
file_cache_writes | 3173728
file_cache_size | 1024
file_cache_used_pages | 25689
(6 rows)
```
Please notice that this PR doesn't change neon extension API, so no need
to create new version of Neon extension.
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
The metrics include a histogram of how long we need to wait for a
GetPage request, number of reconnects, and number of requests among
other things.
The metrics are not yet exported anywhere, but you can query them
manually.
Note: This does *not* bump the default version of the 'neon' extension. We
will do that later, as a separate PR. The reason is that this allows us to roll back
the compute image smoothly, if necessary. Once the image that includes the
new extension .so file with the new functions has been rolled out, and we're
confident that we don't need to roll back the image anymore, we can change
default extension version and actually start using the new functions and views.
This is what the view looks like:
```
postgres=# select * from neon_perf_counters ;
metric | bucket_le | value
---------------------------------------+-----------+----------
getpage_wait_seconds_count | | 300
getpage_wait_seconds_sum | | 0.048506
getpage_wait_seconds_bucket | 2e-05 | 0
getpage_wait_seconds_bucket | 3e-05 | 0
getpage_wait_seconds_bucket | 6e-05 | 71
getpage_wait_seconds_bucket | 0.0001 | 124
getpage_wait_seconds_bucket | 0.0002 | 248
getpage_wait_seconds_bucket | 0.0003 | 279
getpage_wait_seconds_bucket | 0.0006 | 297
getpage_wait_seconds_bucket | 0.001 | 298
getpage_wait_seconds_bucket | 0.002 | 298
getpage_wait_seconds_bucket | 0.003 | 298
getpage_wait_seconds_bucket | 0.006 | 300
getpage_wait_seconds_bucket | 0.01 | 300
getpage_wait_seconds_bucket | 0.02 | 300
getpage_wait_seconds_bucket | 0.03 | 300
getpage_wait_seconds_bucket | 0.06 | 300
getpage_wait_seconds_bucket | 0.1 | 300
getpage_wait_seconds_bucket | 0.2 | 300
getpage_wait_seconds_bucket | 0.3 | 300
getpage_wait_seconds_bucket | 0.6 | 300
getpage_wait_seconds_bucket | 1 | 300
getpage_wait_seconds_bucket | 2 | 300
getpage_wait_seconds_bucket | 3 | 300
getpage_wait_seconds_bucket | 6 | 300
getpage_wait_seconds_bucket | 10 | 300
getpage_wait_seconds_bucket | 20 | 300
getpage_wait_seconds_bucket | 30 | 300
getpage_wait_seconds_bucket | 60 | 300
getpage_wait_seconds_bucket | 100 | 300
getpage_wait_seconds_bucket | Infinity | 300
getpage_prefetch_requests_total | | 69
getpage_sync_requests_total | | 231
getpage_prefetch_misses_total | | 0
getpage_prefetch_discards_total | | 0
pageserver_requests_sent_total | | 323
pageserver_requests_disconnects_total | | 0
pageserver_send_flushes_total | | 323
file_cache_hits_total | | 0
(39 rows)
```
Part of https://github.com/neondatabase/neon/issues/8002
Close https://github.com/neondatabase/neon/issues/8920
Legacy compaction (as well as gc-compaction) rely on the GC process to
remove unused layer files, but this relies on many factors (i.e., key
partition) to ensure data in a dropped table can be eventually removed.
In gc-compaction, we consider the keyspace information when doing the
compaction process. If a key is not in the keyspace, we will skip that
key and not include it in the final output.
However, this is not easy to implement because gc-compaction considers
branch points (i.e., retain_lsns) and the retained keyspaces could
change across different LSNs. Therefore, for now, we only remove aux v1
keys in the compaction process.
## Summary of changes
* Add `FilterIterator` to filter out keys.
* Integrate `FilterIterator` with gc-compaction.
* Add `collect_gc_compaction_keyspace` for a spec of keyspaces that can
be retained during the gc-compaction process.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Not used in production, but in benchmarks, to demonstrate minimal RTT.
(It would be nice to not have to copy the 8KiB of zeroes, but, that
would require larger protocol changes).
Found this useful in investigation
https://github.com/neondatabase/neon/pull/8952.
## Problem
Previously, the storage controller may send compute notifications
containing stale pageservers (i.e. pageserver serving the shard was
detached). This happened because detaches did not update the compute
hook state.
## Summary of Changes
Update compute hook state on shard detach.
Fixes#8928
These were not referenced in any of the other Cargo.toml files in the
workspace. They were not being built because of that, so there was
little harm in having them listed, but let's be tidy.
cargo update ciborium iana-time-zone lazy_static schannel uuid
cargo update hyper@0.14
cargo update --precise 2.9.7 ureq
It might be worthwhile just update all our dependencies at some point,
but this is aimed at pruning the dependency tree, to make the build a
little faster. That's also why I didn't update ureq to the latest
version: that would've added a dependency to yet another version of
rustls.
We frequently mess up our submodule references. This adds one safeguard:
it checks that the submodule references are only updated "forwards", not
to some older commit, or a commit that's not a descended of the previous
one.
As next step, I'm thinking that we should automate things so that when
you merge a PR to the 'neon' repository that updates the submodule
references, the REL_*_STABLE_neon branches are automatically updated to
match the submodule references. That way, you never need to manually
merge PRs in the postgres repository, it's all triggered from commits in
the 'neon' repository. But that's not included here.
Moves the per-timeline code to load timeline metadata into a new
dedicated function called `load_timeline_metadata`. The old
`load_timeline_metadata` becomes `load_timelines_metadata`.
Split out of #8907
Part of #8088
If the primary is started at an LSN within the first of a 16 MB WAL
segment, the "long XLOG page header" at the beginning of the segment was
not initialized correctly. That has gone unnnoticed, because under
normal circumstances, nothing looks at the page header. The WAL that is
streamed to the safekeepers starts at the new record's LSN, not at the
beginning of the page, so that bogus page header didn't propagate
elsewhere, and a primary server doesn't normally read the WAL its
written. Which is good because the contents of the page would be bogus
anyway, as it wouldn't contain any of the records before the LSN where
the new record is written.
Except that in the following cases a primary does read its own WAL:
1. When there are two-phase transactions in prepared state at
checkpoint. The checkpointer reads the two-phase state from the
XLOG_XACT_PREPARE record, and writes it to a file in pg_twophase/.
2. Logical decoding reads the WAL starting from the replication slot's
restart LSN.
This PR fixes the problem with two-phase transactions. For that, it's
sufficient to initialize the page header correctly. The checkpointer
only needs to read XLOG_XACT_PREPARE records that were generated after
the server startup, so it's still OK that older WAL is missing / bogus.
I have not investigated if we have a problem with logical decoding,
however. Let's deal with that separately.
Special thanks to @Lzjing-1997, who independently found the same bug
and opened a PR to fix it, although I did not use that PR.
## Problem
When layer visibility was added, an info log was included for the
situation where actual access to a layer disagrees with the visibility
calculation. This situation is safe, but I was interested in seeing when
it happens.
The log is pretty high volume, so this PR refines it to fire less often.
## Summary of changes
- For cases where accessing non-visible layers is normal, don't log at
all.
- Extend a unit test to increase confidence that the updates to
visibility on access are working as expected
- During compaction, only call the visibility calculation routine if
some image layers were created: previously, frequent calls resulted in
the visibility of layers getting reset every time we passed through
create_image_layers.
## Problem
Seems that PS might be too eager in reporting throttled tasks
## Summary of changes
Introduce a sleep counter. If the sleep counter increases, then the
acquire tasks was throttled.
close https://github.com/neondatabase/neon/issues/8903
In https://github.com/neondatabase/neon/issues/8903 we observed JSON
decoding error to have the following error message in the log:
```
Error processing HTTP request: Resource temporarily unavailable: 3956 (pageserver-6.ap-southeast-1.aws.neon.tech) error receiving body: error decoding response body
```
This is hard to understand. In this patch, we make the error message
more reasonable.
## Summary of changes
* receive body error is now an internal server error, passthrough the
`reqwest::Error` (only decoding error) as `anyhow::Error`.
* instead of formatting the error using `to_string`, we use the
alternative `anyhow::Error` formatting, so that it prints out the cause
of the error (i.e., what exactly cannot serde decode).
I would expect seeing something like `error receiving body: error
decoding response body: XXX field not found` after this patch, though I
didn't set up a testing environment to observe the exact behavior.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
It's pretty expensive to run, and there is very little difference
between debug and release builds that could lead to different clippy
warnings.
This is extracted from PR #8912. That PR wandered off into various
improvements we could make, but we seem to have consensus on this part
at least.