Fixes#3978. `test_partial_evict_tenant` can fail multiple times so even
though we retry it as flaky, it will still haunt us.
Originally was going to just relax the comparison, then ended up
replacing warming up to use full table scans instead of `pgbench
--select-only`. This seems to help by producing the expected layer
accesses. There might be something off with how many layers pg16
produces compared to pg14 and pg15. Created #5392.
## Problem
Pageservers must not delete objects or advertise updates to
remote_consistent_lsn without checking that they hold the latest
generation for the tenant in question (see [the RFC](
https://github.com/neondatabase/neon/blob/main/docs/rfcs/025-generation-numbers.md))
In this PR:
- A new "deletion queue" subsystem is introduced, through which
deletions flow
- `RemoteTimelineClient` is modified to send deletions through the
deletion queue:
- For GC & compaction, deletions flow through the full generation
verifying process
- For timeline deletions, deletions take a fast path that bypasses
generation verification
- The `last_uploaded_consistent_lsn` value in `UploadQueue` is replaced
with a mechanism that maintains a "projected" lsn (equivalent to the
previous property), and a "visible" LSN (which is the one that we may
share with safekeepers).
- Until `control_plane_api` is set, all deletions skip generation
validation
- Tests are introduced for the new functionality in
`test_pageserver_generations.py`
Once this lands, if a pageserver is configured with the
`control_plane_api` configuration added in
https://github.com/neondatabase/neon/pull/5163, it becomes safe to
attach a tenant to multiple pageservers concurrently.
---------
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
Fixes#5072. See proof from
https://github.com/neondatabase/neon/issues/5072#issuecomment-1735580798.
Turns out multiple threads can get the same nanoseconds since epoch, so
switch to using millis (for finding the prefix later on) and randomness
via `thread_rng` (protect against adversial ci runners).
Also changes the "per test looking alike" prefix to more "general"
prefix.
The TaskKind dimension added in #5339 is insufficient to understand what
kind of data causes the cache hits.
Regarding performance considerations: I'm not too worried because we're
moving from 3 to 4 one-byte sized fields; likely the space now used by
the new field was padding before. Didn't check this, though, and it
doesn't matter, we need the data.
What I don't like about this PR is that we have an `Unknown` content
type, and I also don't like that there's no compile-time way to assert
that it's set to something != `Unknown` when calling the page cache.
But, this is what I could come up with before tomorrow’s release, and I
think it covers the hot paths.
## Problem
See https://neondb.slack.com/archives/C033RQ5SPDH/p1694595347598249
## Summary of changes
Update last written LSN after walloging all createdb stuff
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Should have added them in the initial PR #5186.
Would have been nice to test the failure cases as well, but, without
mocking the FS, that's too hard / platform-dependent.
This PR adds a `task_kind` label to page cache access metrics.
These are to validate our hypothesis that the high hit page cache rate
we observe in prod is due to internal tasks, not getpage requests from
compute.
We believe the latter should near-always be a pageserver-page-cache
_miss_ because compute has it's own page cache, and hence there is no
locality of reference for its accesses to pageserver page cache.
Before this PR, we didn't have `RequestContext` propagation to any code
below the on-demand downloader.
The vast majority of changes in this PR is concerned with adding that
propagation.
## Problem
Compute start time has improved, but the timing of connection retries
from the proxy is rather slow, meaning we could be making clients wait
hundreds of milliseconds longer than necessary.
## Summary of changes
Previously, retry time in ms was `100 * 1.5**n`, and `n` starts at 1,
giving: 150, 225, 337, 506, 759, 1139, 1709, ...
This PR changes that to `25 * sqrt(2)**(n - 1)` instead, giving: 25, 35,
50, 71, 100, 141, 200, ...
## Problem
Do not allow new installation of deprecated `hnsw` extension.
The same approach as in https://github.com/neondatabase/neon/pull/5345
## Summary of changes
- Remove `trusted = true` from `hnsw.control`
- Remove `hnsw` related targets from Makefile
## Problem
Before releasing new version to production, we'd like to run a set of
required checks on the incoming release.
The simplest approach, which doesn't require many changes — dedicate one
staging region to `preprod` installation.
The proposed changes to the release flow are the following:
- When a release PR is merged into the release branch — trigger
deployment from the release branch to a dedicated staging-preprod region
(for now, it's going to be `eu-west-1` — Ireland)
Corresponding infrastructure PR:
https://github.com/neondatabase/aws/pull/585
## Summary of changes
- Trigger `deploy.dev` workflow with `-f deployPreprodRegion=true` for
release branch
## Problem
In 83e7e5dbbd dependencies were only
updated for Mac users. Without libicu, postgres 16 build fails.
## Summary of changes
Update dependencies on Ubuntu and fedora to include libicu.
## Problem
See https://neondb.slack.com/archives/C05L7D1JAUS/p1693775881474019
## Summary of changes
Do not perform local file cache resizing in processes having no PGPROC
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Split off from #5297. Builds upon #5326. Handles original review
comments which I did not move to earlier split PRs. Completes test
support for verifying events by notifying of the last batch of events.
Adds cleaning up of tempfiles left because of an unlucky shutdown or
SIGKILL.
Finally closes#5175.
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
Only notable change is including neondatabase/autoscaling#523, which we
hope will help with making sure that TCP connections are properly
terminated before shutdown (which hopefully fixes a leak in the
pageserver).
## Problem
We didn't have a Postgres 16 snapshot of data to run compatibility tests
on, but now we have it (since the release).
## Summary of changes
- remove `@skip_on_postgres(PgVersion.V16, ...)` from compatibility
tests
this should allow test
test_delete_tenant_exercise_crash_safety_failpoints with
debug-pg16-Check.RETRY_WITH_RESTART-mock_s3-tenant-delete-before-remove-timelines-dir-True
to pass more reliably.
The test is still flaky, perhaps more after #5233, see #3831.
Do one more `timeline_checkpoint` *after* shutting down safekeepers
*before* shutting down pageserver. Put more effort into not compacting
or creating image layers.
With the addition of the "stateful event verification" the
test_consumption_metrics.py is now too crowded. Split it up for
pageserver and proxy.
Split from #5297.
Write collected metrics to disk to recover previously sent metrics on
restart.
Recover the previously collected metrics during startup, send them over
at right time
- send cached synthetic size before actual is calculated
- when `last_record_lsn` rolls back on startup
- stay at last sent `written_size` metric
- send `written_size_delta_bytes` metric as 0
Add test support: stateful verification of events in python tests.
Fixes: #5206
Cc: #5175 (loggings, will be enhanced in follow-up)
Split off from #5297.
There should be no functional changes here:
- refactor tenant metric "production" like previously timeline, allows
unit testing, though not interesting enough yet to test
- introduce type aliases for tuples
- extra refactoring for `collect`, was initially thinking it was useful
but will do a inline later
- shorter binding names
- support for future allocation reuse quests with IdempotencyKey
- move code out of tokio::select to make it rustfmt-able
- generification, allow later replacement of `&'static str` with enum
- add tests that assert sent event contents exactly
## Problem
VM should be updated if XLH_LOCK_ALL_FROZEN_CLEARED flags is set in
XLOG_HEAP_LOCK,XLOG_HEAP_2_LOCK_UPDATED WAL records
## Summary of changes
Add handling of this records in walingest.rs
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
It was utterly broken on v15 before commit 83e7e5dbbd, which fixed the
incorrect definition of XLOG_DBASE_CREATE_WAL_LOG. We never noticed
because we had no tests for it.
## Problem
In many places in test code, paths are built manually from what
NeonEnv.tenant_dir and NeonEnv.timeline_dir could do.
## Summary of changes
1. NeonEnv.tenant_dir and NeonEnv.timeline_dir moved under class
NeonPageserver as the path they use is per-pageserver instance.
2. Used these everywhere to replace manual path building
Closes#5258
---------
Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>
## Problem
See https://neondb.slack.com/archives/C05L7D1JAUS/p1694614585955029https://www.notion.so/neondatabase/Duplicate-key-issue-651627ce843c45188fbdcb2d30fd2178
## Summary of changes
Swap old/new block references
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
## Problem
- `gh pr list` fails with `unknown argument "main"; please quote all
values that have spaces due to using a variable with the wrong name
- `permissions: write-all` are too wide for the job
## Summary of changes
- For variable name `HEAD` -> `BRANCH`
- Grant only required permissions for each job
---------
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Refactor tests to have less globals.
This will allow to hopefully write more complex tests for our new metric
collection requirements in #5297. Includes reverted work from #4761
related to test globals.
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: MMeent <matthias@neon.tech>
## Problem
There was a bug in lfc_ensure_opened which actually disables LFC
## Summary of changes
Return true ifLFC file is normally opened
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
- pagestore_smgr.c had unnecessary WALSync() (see #5287 )
- Compute node dockerfile didn't build the neon_rmgr extension
- Add PostgreSQL 16 image to docker-compose tests
- Fix issue with high CPU usage in Safekeeper due to a bug in WALSender
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
## Problem
If @github-actions creates release PR, the CI pipeline is not triggered
(but we have `release-notify.yml` workflow that we expect to run on this
event).
I suspect this happened because @github-actions is not a repository
member.
Ref
https://github.com/neondatabase/neon/pull/5283#issuecomment-1715209291
## Summary of changes
- Use `CI_ACCESS_TOKEN` to create a PR
- Use `gh` instead of `thomaseizinger/create-pull-request`
- Restrict permissions for GITHUB_TOKEN to `contents: write` only
(required for `git push`)
## Problem
`ci-run/pr-*` branches (and attached PRs) should be deleted
automatically when their parent PRs get closed.
But there are not
## Summary of changes
- Fix if-condition
## Problem
We don't have this instruction written anywhere but in internal Slack
## Summary of changes
- Add `How to run a CI pipeline on Pull Requests from external
contributors` section to `CONTRIBUTING.md`
---------
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
Motivation
==========
We observed two "indigestion" events on staging, each shortly after
restarting `pageserver-0.eu-west-1.aws.neon.build`. It has ~8k tenants.
The indigestion manifests as `Timeline::get` calls failing with
`exceeded evict iter limit` .
The error is from `page_cache.rs`; it was unable to find a free page and
hence failed with the error.
The indigestion events started occuring after we started deploying
builds that contained the following commits:
```
[~/src/neon]: git log --oneline c0ed362790caa368aa65ba57d352a2f1562fd6bf..15eaf78083ecff62b7669
091da1a1c8b4f60ebf8
15eaf7808 Disallow block_in_place and Handle::block_on (#5101)
a18d6d9ae Make File opening in VirtualFile async-compatible (#5280)
76cc87398 Use tokio locks in VirtualFile and turn with_file into macro (#5247)
```
The second and third commit are interesting.
They add .await points to the VirtualFile code.
Background
==========
On the read path, which is the dominant user of page cache & VirtualFile
during pageserver restart, `Timeline::get` `page_cache` and VirtualFile
interact as follows:
1. Timeline::get tries to read from a layer
2. This read goes through the page cache.
3. If we have a page miss (which is known to be common after restart),
page_cache uses `find_victim` to find an empty slot, and once it has
found a slot, it gives exclusive ownership of it to the caller through a
`PageWriteGuard`.
4. The caller is supposed to fill the write guard with data from the
underlying backing store, i.e., the layer `VirtualFile`.
5. So, we call into `VirtualFile::read_at`` to fill the write guard.
The `find_victim` method finds an empty slot using a basic
implementation of clock page replacement algorithm.
Slots that are currently in use (`PageReadGuard` / `PageWriteGuard`)
cannot become victims.
If there have been too many iterations, `find_victim` gives up with
error `exceeded evict iter limit`.
Root Cause For Indigestion
==========================
The second and third commit quoted in the "Motivation" section
introduced `.await` points in the VirtualFile code.
These enable tokio to preempt us and schedule another future __while__
we hold the `PageWriteGuard` and are calling `VirtualFile::read_at`.
This was not possible before these commits, because there simply were no
await points that weren't Poll::Ready immediately.
With the offending commits, there is now actual usage of
`tokio::sync::RwLock` to protect the VirtualFile file descriptor cache.
And we __know__ from other experiments that, during the post-restart
"rush", the VirtualFile fd cache __is__ too small, i.e., all slots are
taken by _ongoing_ VirtualFile operations and cannot be victims.
So, assume that VirtualFile's `find_victim_slot`'s
`RwLock::write().await` calls _will_ yield control to the executor.
The above can lead to the pathological situation if we have N runnable
tokio tasks, each wanting to do `Timeline::get`, but only M slots, N >>
M.
Suppose M of the N tasks win a PageWriteGuard and get preempted at some
.await point inside `VirtualFile::read_at`.
Now suppose tokio schedules the remaining N-M tasks for fairness, then
schedules the first M tasks again.
Each of the N-M tasks will run `find_victim()` until it hits the
`exceeded evict iter limit`.
Why? Because the first M tasks took all the slots and are still holding
them tight through their `PageWriteGuard`.
The result is massive wastage of CPU time in `find_victim()`.
The effort to find a page is futile, but each of the N-M tasks still
attempts it.
This delays the time when tokio gets around to schedule the first M
tasks again.
Eventually, tokio will schedule them, they will make progress, fill the
`PageWriteGuard`, release it.
But in the meantime, the N-M tasks have already bailed with error
`exceeded evict iter limit`.
Eventually, higher level mechanisms will retry for the N-M tasks, and
this time, there won't be as many concurrent tasks wanting to do
`Timeline::get`.
So, it will shake out.
But, it's a massive indigestion until then.
This PR
=======
This PR reverts the offending commits until we find a proper fix.
```
Revert "Use tokio locks in VirtualFile and turn with_file into macro (#5247)"
This reverts commit 76cc87398c.
Revert "Make File opening in VirtualFile async-compatible (#5280)"
This reverts commit a18d6d9ae3.
```
This adds PostgreSQL 16 as a vendored postgresql version, and adapts the
code to support this version.
The important changes to PostgreSQL 16 compared to the PostgreSQL 15
changeset include the addition of a neon_rmgr instead of altering Postgres's
original WAL format.
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
The sequence that can lead to a deadlock:
1. DELETE request gets all the way to `tenant.shutdown(progress,
false).await.is_err() ` , while holding TENANTS.read()
2. POST request for tenant creation comes in, calls `tenant_map_insert`,
it does `let mut guard = TENANTS.write().await;`
3. Something that `tenant.shutdown()` needs to wait for needs a
`TENANTS.read().await`.
The only case identified in exhaustive manual scanning of the code base
is this one:
Imitate size access does `get_tenant().await`, which does
`TENANTS.read().await` under the hood.
In the above case (1) waits for (3), (3)'s read-lock request is queued
behind (2)'s write-lock, and (2) waits for (1).
Deadlock.
I made a reproducer/proof-that-above-hypothesis-holds in
https://github.com/neondatabase/neon/pull/5281 , but, it's not ready for
merge yet and we want the fix _now_.
fixes https://github.com/neondatabase/neon/issues/5284