- trivial serialization roundtrip test for
`pageserver::repository::Value`
- add missing `start_paused = true` to 15s test making it <0s test
- completely unrelated future clippy lint avoidance (helps beta channel
users)
## Problem
This is a log hygiene fix, for an occasional test failure.
warn-level logging in imitate_timeline_cached_layer_accesses can't
distinguish actual errors from shutdown cases.
## Summary of changes
Replaced anyhow::Error with an explicit CollectKeySpaceError type, that
includes conversion from PageReconstructError::Cancelled.
In OOM situations, knowing exactly how many walredo processes there were
at a time would help afterwards to understand why was pageserver OOM
killed. Add `pageserver_wal_redo_process_total` metric to keep track of
total wal redo process started, shutdown and killed since pageserver
start.
Closes#5722
---------
Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Christian Schwarz <me@cschwarz.com>
Includes the changes of #3689 that address point 1 of #3689, plus some
further improvements. In particular, this PR does:
* set `min_lsn` to a safe value to create branches from (and verify it
in tests)
* return `min_lsn` instead of `max_lsn` for `NoData` and `Past` (verify
it in test for `Past`, `NoData` is harder and not as important)
* return `commit_lsn` instead of `max_lsn` for Future (and verify it in
the tests)
* add some comments
Split out of #5686 to get something more minimal out to users.
Minor bugfix, something noticed while manual code-review. Use the same
joinset for inprogress tenants so we can get the benefit of the
buffering logging just as we get for attached tenants, and no single
inprogress task can hold up shutdown of other tenants.
The `get_timestamp_of_lsn` pageserver endpoint has been added in #5497,
but the yml it added was wrong: the lsn is expected in hex format, not
in integer (decimal) format.
## Problem
We have observed the shutdown of a timeline taking a long time when a
deletion arrives at a busy time for the system. This suggests that we
are not respecting cancellation tokens promptly enough.
## Summary of changes
- Refactor timeline shutdown so that rather than having a shutdown()
function that takes a flag for optionally flushing, there are two
distinct functions, one for graceful flushing shutdown, and another that
does the "normal" shutdown where we're just setting a cancellation token
and then tearing down as fast as we can. This makes things a bit easier
to reason about, and enables us to remove the hand-written variant of
shutdown that was maintained in `delete.rs`
- Layer flush task checks cancellation token more carefully
- Logical size calculation's handling of cancellation tokens is
simplified: rather than passing one in, it respects the Timeline's
cancellation token.
This PR doesn't touch RemoteTimelineClient, which will be a key thing to
fix as well, so that a slow remote storage op doesn't hold up shutdown.
## Problem
We have a funny 3-day timeout for connections between the compute and
pageserver. We want to get rid of it, so to do that we need to make sure
the compute is resilient to connection failures.
Closes: https://github.com/neondatabase/neon/issues/5518
## Summary of changes
This test makes the pageserver randomly drop the connection if the
failpoint is enabled, and ensures we can keep querying the pageserver.
This PR also reduces the default timeout to 10 minutes from 3 days.
* lower level on auth success from info to debug (fixes#5820)
* don't log stacktraces on auth errors (as requested on slack). we do this by introducing an `AuthError` type instead of using `anyhow` and `bail`.
* return errors that have been censored for improved security.
## Problem
#5711 and #5367 raced -- the `SlotGuard` type needs `Gate` to properly
enforce its invariant that we may not drop an `Arc<Tenant>` from a slot.
## Summary of changes
Replace the TODO with the intended check of Gate.
## Problem
- Close#5784
## Summary of changes
- Update the `GetActiveTenantError` -> `QueryError` conversion process
in `pageserver/src/page_service.rs`
- Update the pytest logging exceptions in
`./test_runner/regress/test_tenant_detach.py`
## Problem
For quickly rotating JWT secrets, we want to be able to reload the JWT
public key file in the pageserver, and also support multiple JWT keys.
See #4897.
## Summary of changes
* Allow directories for the `auth_validation_public_key_path` config
param instead of just files. for the safekeepers, all of their config options
also support multiple JWT keys.
* For the pageservers, make the JWT public keys easily globally swappable
by using the `arc-swap` crate.
* Add an endpoint to the pageserver, triggered by a POST to
`/v1/reload_auth_validation_keys`, that reloads the JWT public keys from
the pre-configured path (for security reasons, you cannot upload any
keys yourself).
Fixes#4897
---------
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
## Problem
When enabled, this failpoint would busy-spin in a loop that emits log
messages.
## Summary of changes
Move the failpoint inside a backoff::exponential block: it will still
spam the log, but at much lower rate.
---------
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
This was preventing it getting cleanly converted to a
CalculateLogicalSizeError::Cancelled, resulting in "Logical size
calculation failed" errors in logs.
## Problem
See: https://github.com/neondatabase/neon/issues/5796
## Summary of changes
Completing the refactor is quite verbose and can be done in stages: each
interface that is currently called directly from a top-level mgr.rs
function can be moved into TenantManager once the relevant subsystems
have access to it.
Landing the initial change to create of TenantManager is useful because
it enables new code to use it without having to be altered later, and
sets us up to incrementally fix the existing code to use an explicit
Arc<TenantManager> instead of relying on the static TENANTS.
## Problem
Follows on from #5299
- We didn't have a generic way to protect a tenant undergoing changes:
`Tenant` had states, but for our arbitrary transitions between
secondary/attached, we need a general way to say "reserve this tenant
ID, and don't allow any other ops on it, but don't try and report it as
being in any particular state".
- The TenantsMap structure was behind an async RwLock, but it was never
correct to hold it across await points: that would block any other
changes for all tenants.
## Summary of changes
- Add the `TenantSlot::InProgress` value. This means:
- Incoming administrative operations on the tenant should retry later
- Anything trying to read the live state of the tenant (e.g. a page
service reader) should retry later or block.
- Store TenantsMap in `std::sync::RwLock`
- Provide an extended `get_active_tenant_with_timeout` for page_service
to use, which will wait on InProgress slots as well as non-active
tenants.
Closes: https://github.com/neondatabase/neon/issues/5378
---------
Co-authored-by: Christian Schwarz <christian@neon.tech>
## Problem
When shutting down a Tenant, it isn't just important to cause any
background tasks to stop. It's also important to wait until they have
stopped before declaring shutdown complete, in cases where we may re-use
the tenant's local storage for something else, such as running in
secondary mode, or creating a new tenant with the same ID.
## Summary of changes
A `Gate` class is added, inspired by
[seastar::gate](https://docs.seastar.io/master/classseastar_1_1gate.html).
For types that have an important lifetime that corresponds to some
physical resource, use of a Gate as well as a CancellationToken provides
a robust pattern for async requests & shutdown:
- Requests must always acquire the gate as long as they are using the
object
- Shutdown must set the cancellation token, and then `close()` the gate
to wait for requests in progress before returning.
This is not for memory safety: it's for expressing the difference
between "Arc<Tenant> exists", and "This tenant's files on disk are
eligible to be read/written".
- Both Tenant and Timeline get a Gate & CancellationToken.
- The Timeline gate is held during eviction of layers, and during
page_service requests.
- Existing cancellation support in page_service is refined to use the
timeline-scope cancellation token instead of a process-scope
cancellation token. This replaces the use of `task_mgr::associate_with`:
tasks no longer change their tenant/timelineidentity after being
spawned.
The Tenant's Gate is not yet used, but will be important for
Tenant-scoped operations in secondary mode, where we must ensure that
our secondary-mode downloads for a tenant are gated wrt the activity of
an attached Tenant.
This is part of a broader move away from using the global-state driven
`task_mgr` shutdown tokens:
- less global state where we rely on implicit knowledge of what task a
given function is running in, and more explicit references to the
cancellation token that a particular function/type will respect, making
shutdown easier to reason about.
- eventually avoid the big global TASKS mutex.
---------
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Improve the serde impl for several types (`Lsn`, `TenantId`,
`TimelineId`) by making them sensitive to
`Serializer::is_human_readadable` (true for json, false for bincode).
Fixes#3511 by:
- Implement the custom serde for `Lsn`
- Implement the custom serde for `Id`
- Add the helper module `serde_as_u64` in `libs/utils/src/lsn.rs`
- Remove the unnecessary attr `#[serde_as(as = "DisplayFromStr")]` in
all possible structs
Additionally some safekeeper types gained serde tests.
---------
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
## Problem
The scrubber didn't know how to find the latest index_part when
generations were in use.
## Summary of changes
- Teach the scrubber to do the same dance that pageserver does when
finding the latest index_part.json
- Teach the scrubber how to understand layer files with generation
suffixes.
- General improvement to testability: scan_metadata has a machine
readable output that the testing `S3Scrubber` wrapper can read.
- Existing test coverage of scrubber was false-passing because it just
didn't see any data due to prefixing of data in the bucket. Fix that.
This is incremental improvement: the more confidence we can have in the
scrubber, the more we can use it in integration tests to validate the
state of remote storage.
---------
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
Some of the log messages were lost with the #4938. This PR adds some of
them back, most notably:
- starting to on-demand download
- successful completion of on-demand download
- ability to see when there were many waiters for the layer download
- "unexpectedly on-demand downloading ..." is now `info!`
Additionally some rare events are logged as error, which should never
happen.
when introducing `get_and_upgrade` I forgot that an `evict_and_wait`
would had already incremented the counter for started evictions, but an
upgrade would just "silently" cancel the eviction as no drop would ever
run. these metrics are likely sources for alerts with the next release,
so it's important to keep them correct.
In an earlier PR
https://github.com/neondatabase/neon/pull/5743#discussion_r1378625244 I
added a FIXME and there's a simple solution suggested by @jcsp, so
implement it. Wondering why I did not implement this originally, there
is no concept of a permanent failure, so this failure will happen quite
often. I don't think the frequency is a problem however.
Sadly for std::fs::FileType there is only decimal and hex formatting, no
octal.
Following from discussion on
https://github.com/neondatabase/neon/pull/5436 where hacking an implicit
die-on-fatal-io behavior into an Error type was a source of disagreement
-- in this PR, dying on fatal I/O errors is explicit, with `fatal_err`
and `maybe_fatal_err` helpers in the `MaybeFatalIo` trait, which is
implemented for std::io::Result.
To enable this approach with `crashsafe_overwrite`, the return type of
that function is changed to std::io::Result -- the previous error enum
for this function was not used for any logic, and the utility of saying
exactly which step in the function failed is outweighed by the hygiene
of having an I/O funciton return an io::Result.
The initial use case for these helpers is the deletion queue.
With the layer implementation as was done in #4938, it is possible via
cancellation to cause two concurrent downloads on the same path, due to
how `RemoteTimelineClient::download_remote_layer` does tempfiles. Thread
the init semaphore through the spawned task of downloading to make this
impossible to happen.
Right before merging, I added a loop to `fn
LayerInner::get_or_maybe_download`, which was always supposed to be
there. However I had forgotten to restart initialization instead of
waiting for the eviction to happen to support original design goal of
"eviction should always lose to redownload (or init)". This was wrong.
After this fix, if `spawn_blocking` queue is blocked on something,
nothing bad will happen.
Part of #5737.
## Problem
Some requests with `Authorization` header did not properly set the
`Bearer ` prefix. Problem explained here
https://github.com/neondatabase/cloud/issues/6390.
## Summary of changes
Added `Bearer ` prefix to missing requests.
## Problem
test_stderr hangs on MacOS.
See https://neondb.slack.com/archives/C036U0GRMRB/p1698438997903919
## Summary of changes
Always handle POLLHUP to prevent infinite loop.
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
The `LayerInner::version` never needed to be read in more than one
place. Clarified while fixing #5737 of which this is the first step.
This decrements possible wrong atomics usage in Layer, but does not
really fix anything.
## Problem
In #5658 we suppressed the first-iteration output from these logs, but
the volume of warnings is still problematic.
## Summary of changes
- Downgrade all slow task warnings to INFO. The information is still
there if we actively want to know about which tasks are running slowly,
without polluting the overall stream of warnings with situations that
are unsurprising to us.
- Revert the previous change so that we output on the first iteration as
we used to do. There is no reason to suppress these, now that the
severity is just info.
## Problem
Neon doesn't compile on nightly and had numerous clippy complaints.
## Summary of changes
1. Fixed troublesome dependency
2. Fixed or ignored the lints where appropriate
- include Layer generation in the default display, with
Generation::Broken as `-broken`
- omit layer from `layer_gc` span because the api it works with needs to
support N layers, so the api needs to log each layer
## Problem
If there were stray files in the timelines/ dir after tenant deletion,
pageserver could panic on out of range.
## Summary of changes
Use iterator `take()`, which doesn't care if the number of elements
available is less than requested.
The flush task logs a backtrace if it tries to upload and remote
timeline client is already in stopped state.
Therefore we cannot shut them down concurrently: flush task must be shut
down first.
This wasn't more obvious because:
- Timeline deletions IRL usually happen when not much is being written
- In tests, there is a global allow-list for this log
It's not obvious whether removing the global log allow list is safe,
this PR was prompted by how the log spam got in my way when testing
deletion changes.
## Problem
## Summary of changes
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] ~~If it is a core feature, I have added thorough tests.~~
- [ ] ~~Do we need to implement analytics? if so did you add the
relevant metrics to the dashboard?~~
- [ ] ~~If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.~~
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
#5649 added the concept of dangling layers which #4938 uses but only
partially. I forgot to change `schedule_compaction_update` to not
schedule deletions to uphold the "have a layer, you can read it".
With the now remembered fix, I don't think these checks should ever fail
except for a mistake I already did. These changes might be useful for
protecting future changes, even though the Layer carrying the generation
AND the `schedule_(gc|compaction)_update` require strong arcs.
Rationale for keeping the `#[cfg(feature = "testing")]` is worsening any
leak situation which might come up.
- Add a new util `project_build_tag` macro, similar to
`project_git_version`
- Update the `set_build_info_metric` to accept and make use of
`build_tag` info
- Update all codes which use the `set_build_info_metric`
## Problem
The pageserver had two ways of loading a tenant:
- `spawn_load` would trust on-disk content to reflect all existing
timelines
- `spawn_attach` would list timelines in remote storage.
It was incorrect for `spawn_load` to trust local disk content, because
it doesn't know if the tenant might have been attached and written
somewhere else. To make this correct would requires some generation
number checks, but the payoff is to avoid one S3 op per tenant at
startup, so it's not worth the complexity -- it is much simpler to have
one way to load a tenant.
## Summary of changes
- `Tenant` objects are always created with `Tenant::spawn`: there is no
more distinction between "load" and "attach".
- The ability to run without remote storage (for `neon_local`) is
preserved by adding a branch inside `attach` that uses a fallback
`load_local` if no remote_storage is present.
- Fix attaching a tenant when it has a timeline with no IndexPart: this
can occur if a newly created timeline manages to upload a layer before
it has uploaded an index.
- The attach marker file that used to indicate whether a tenant should
be "loaded" or "attached" is no longer needed, and is removed.
- The GenericRemoteStorage interface gets a `list()` method that maps
more directly to what ListObjects does, returning both keys and common
prefixes. The existing `list_files` and `list_prefixes` methods are just
calls into `list()` now -- these can be removed later if we would like
to shrink the interface a bit.
- The remote deletion marker is moved into `timelines/` and detected as
part of listing timelines rather than as a separate GET request. If any
existing tenants have a marker in the old location (unlikely, only
happens if something crashes mid-delete), then they will rely on the
control plane retrying to complete their deletion.
- Revise S3 calls for timeline listing and tenant load to take a
cancellation token, and retry forever: it never makes sense to make a
Tenant broken because of a transient S3 issue.
## Breaking changes
- The remote deletion marker is moved from `deleted` to
`timelines/deleted` within the tenant prefix. Markers in the old
location will be ignored: it is the control plane's responsibility to
retry deletions until they succeed. Markers in the new location will be
tolerated by the previous release of pageserver via
https://github.com/neondatabase/neon/pull/5632
- The local `attaching` marker file is no longer written. Therefore, if
the pageserver is downgraded after running this code, the old pageserver
will not be able to distinguish between partially attached tenants and
fully attached tenants. This would only impact tenants that were partway
through attaching at the moment of downgrade. In the unlikely even t
that we do experience an incident that prompts us to roll back, then we
may check for attach operations in flight, and manually insert
`attaching` marker files as needed.
---------
Co-authored-by: Christian Schwarz <christian@neon.tech>
Previously, if walredo process crashed we would try to spawn a fresh one
every 2 seconds, which is expensive in itself, but also results in a
high I/O load from the part of the compaction prior to the failure,
which we re-run every 2 seconds.
Closes: https://github.com/neondatabase/neon/issues/5671