## Problem
We need automated tests of extensions shipped with Neon to detect
possible problems.
## Summary of changes
A new image neon-test-extensions is added. Workflow changes to test the
shipped extensions are added as well.
Currently, the regression tests, shipped with extensions are in use.
Some extensions, i.e. rum, timescaledb, rdkit, postgis, pgx_ulid, pgtap,
pg_tiktoken, pg_jsonschema, pg_graphql, kq_imcx, wal2json_2_5 are
excluded due to problems or absence of internal tests.
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
The new features have deteriorated layer flushing, most recently with
#7927. Changes:
- inline `Timeline::freeze_inmem_layer` to the only caller
- carry the TimelineWriterState guard to the actual point of freezing
the layer
- this allows us to `#[cfg(feature = "testing")]` the assertion added in
#7927
- remove duplicate `flush_frozen_layer` in favor of splitting the
`flush_frozen_layers_and_wait`
- this requires starting the flush loop earlier for `checkpoint_distance
< initdb size` tests
Quite a few existing test cases create their own timelines instead of
using the default one. This pull request highlights that and hopefully
people can write simpler tests in the future.
Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Yuchen Liang <70461588+yliang412@users.noreply.github.com>
As seen with the pgvector 0.7.0 index builds, we can receive large
batches of images, leading to very large L0 layers in the range of 1GB.
These large layers are produced because we are only able to roll the
layer after we have witnessed two different Lsns in a single
`DataDirModification::commit`. As the single Lsn batches of images can
span over multiple `DataDirModification` lifespans, we will rarely get
to write two different Lsns in a single `put_batch` currently.
The solution is to remember the TimelineWriterState instead of eagerly
forgetting it until we really open the next layer or someone else
flushes (while holding the write_guard).
Additional changes are test fixes to avoid "initdb image layer
optimization" or ignoring initdb layers for assertion.
Cc: #7197 because small `checkpoint_distance` will now trigger the
"initdb image layer optimization"
Reverts neondatabase/neon#7956
Rationale: compute incompatibilties
Slack thread:
https://neondb.slack.com/archives/C033RQ5SPDH/p1718011276665839?thread_ts=1718008160.431869&cid=C033RQ5SPDH
Relevant quotes from @hlinnaka
> If we go through with the current release candidate, but the compute
is pinned, people who create new projects will get that warning, which
is silly. To them, it looks like the ICU version was downgraded, because
initdb was run with newer version.
> We should upgrade the ICU version eventually. And when we do that,
users with old projects that use ICU will start to see that warning. I
think that's acceptable, as long as we do homework, notify users, and
communicate that properly.
> When do that, we should to try to upgrade the storage and compute
versions at roughly the same time.
A simple API to collect some statistics after compaction to easily
understand the result.
The tool reads the layer map, and analyze range by range instead of
doing single-key operations, which is more efficient than doing a
benchmark to collect the result. It currently computes two key metrics:
* Latest data access efficiency, which finds how many delta layers /
image layers the system needs to iterate before returning any key in a
key range.
* (Approximate) PiTR efficiency, as in
https://github.com/neondatabase/neon/issues/7770, which is simply the
number of delta files in the range. The reason behind that is, assume no
image layer is created, PiTR efficiency is simply the cost of collect
records from the delta layers, and the replay time. Number of delta
files (or in the future, estimated size of reads) is a simple yet
efficient way of estimating how much effort the page server needs to
reconstruct a page.
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
Due to the upcoming End of Life (EOL) for Debian 11, we need to upgrade
the base OS for Pageservers from Debian 11 to Debian 12 for security
reasons.
When deploying a new Pageserver on Debian 12 with the same binary built
on
Debian 11, we encountered the following errors:
```
could not execute operation: pageserver error, status: 500,
msg: Command failed with status ExitStatus(unix_wait_status(32512)):
/usr/local/neon/v16/bin/initdb: error while loading shared libraries:
libicuuc.so.67: cannot open shared object file: No such file or directory
```
and
```
could not execute operation: pageserver error, status: 500,
msg: Command failed with status ExitStatus(unix_wait_status(32512)):
/usr/local/neon/v14/bin/initdb: error while loading shared libraries:
libssl.so.1.1: cannot open shared object file: No such file or directory
```
These issues occur when creating new projects.
## Summary of changes
- To address these issues, we configured PostgreSQL build to use
statically linked OpenSSL and ICU libraries.
- This resolves the missing shared library errors when running the
binaries on Debian 12.
Closes: https://github.com/neondatabase/cloud/issues/12648
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [x] Do not forget to reformat commit message to not include the above
checklist
It makes them much easier to reason about, and allows other SQL tooling
to operate on them like language servers, formatters, etc.
I also brought back the removed migrations such that we can more easily
understand what they were. I included a "-- SKIP" comment describing why
those migrations are now skipped. We no longer skip migrations by
checking if it is empty, but instead check to see if the migration
starts with "-- SKIP".
## Problem
We don't carry run-* labels from external contributors' PRs to
ci-run/pr-* PRs. This is not really convenient.
Need to sync labels in approved-for-ci-run workflow.
## Summary of changes
Added the procedure of transition of labels from the original PR
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
In #7927 I needed to fix this test case, but the fixes should be
possible to land irrespective of the layer ingestion code change.
The most important fix is the behavior if an image layer is found: the
assertion message formatting raises a runtime error, which obscures the
fact that we found an image layer.
We had a random sleep in the beginning of partial backup task, which was
needed for the first partial backup deploy. It helped with gradual
upload of segments without causing network overload. Now partial backup
is deployed everywhere, so we don't need this random sleep anymore.
We also had an issue related to this, in which manager task was not shut
down for a long time. The cause of the issue is this random sleep that
didn't take timeline cancellation into account, meanwhile manager task
waited for partial backup to complete.
Fixes https://github.com/neondatabase/neon/issues/7967
currently we warn even by going over a single byte. even that will be
hit much more rarely once #7927 lands, but get this in earlier.
rationale for 2*checkpoint_distance: anything smaller is not really
worth a warn.
we have an global allowed_error for this warning, which still cannot be
removed nor can it be removed with #7927 because of many tests with very
small `checkpoint_distance`.
M-series macOS has different alignments/size for some fields (which I
did not investigate in detail) and therefore this test cannot pass on
macOS. Fixed by using `<=` for the comparison so that we do not test for
an exact match.
observed by @yliang412
Signed-off-by: Alex Chi Z <chi@neon.tech>
This adds retries to the bulk deletion, because if there is a certain
chance n that a request fails, the chance that at least one of the
requests in a chain of requests fails increases exponentially.
We've had similar issues with the S3 DR tests, which in the end yielded
in adding retries at the remote_storage level. Retries at the top level
are not sufficient when one remote_storage "operation" is multiple
network requests in a trench coat, especially when there is no notion of
saving the progress: even if prior deletions had been successful, we'd
still need to get a 404 in order to continue the loop and get to the
point where we failed in the last iteration. Maybe we'll fail again but
before we've even reached it.
Retries at the bottom level avoid this issue because they have the
notion of progress and also when one network operation fails, only that
operation is retried.
First part of #7931.
Related to #7341 tenant deletion will end up shutting down timelines
twice, once before actually starting and the second time when per
timeline deletion is requested. Shutting down TimelineMetrics causes
underflows. Add an atomic boolean and only do the shutdown once.
Closes#7406.
## Problem
When a `get_lsn_by_timestamp` request is cancelled, an anyhow error is
exposed to handle that case, which verbosely logs the error. However, we
don't benefit from having the full backtrace provided by anyhow in this
case.
## Summary of changes
This PR introduces a new `ApiError` type to handle errors caused by
cancelled request more robustly.
- A new enum variant `ApiError::Cancelled`
- Currently the cancelled request is mapped to status code 500.
- Need to handle this error in proxy's `http_util` as well.
- Added a failpoint test to simulate cancelled `get_lsn_by_timestamp`
request.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
## Problem
As described in #7952, the controller's attempt to reconcile a tenant
before finally deleting it can get hung up waiting for the compute
notification hook to accept updates.
The fact that we try and reconcile a tenant at all during deletion is
part of a more general design issue (#5080), where deletion was
implemented as an operation on attached tenant, requiring the tenant to
be attached in order to delete it, which is not in principle necessary.
Closes: #7952
## Summary of changes
- In the pageserver deletion API, only do the traditional deletion path
if the tenant is attached. If it's secondary, then tear down the
secondary location, and then do a remote delete. If it's not attached at
all, just do the remote delete.
- In the storage controller, instead of ensuring a tenant is attached
before deletion, do a best-effort detach of the tenant, and then call
into some arbitrary pageserver to issue a deletion of remote content.
The pageserver retains its existing delete behavior when invoked on
attached locations. We can remove this later when all users of the API
are updated to either do a detach-before-delete. This will enable
removing the "weird" code paths during startup that sometimes load a
tenant and then immediately delete it, and removing the deletion markers
on tenants.
## Problem
Currently we serialize the `TimelineMetadata` into bytes to put it into
`index_part.json`. This `Vec<u8>` (hopefully `[u8; 512]`) representation
was chosen because of problems serializing TimelineId and Lsn between
different serializers (bincode, json). After #5335, the serialization of
those types became serialization format aware or format agnostic.
We've removed the pageserver local `metadata` file writing in #6769.
## Summary of changes
Allow switching from the current serialization format to plain JSON for
the legacy TimelineMetadata format in the future by adding a competitive
serialization method to the current one
(`crate::tenant::metadata::modern_serde`), which accepts both old bytes
and new plain JSON.
The benefits of this are that dumping the index_part.json with pretty
printing no longer produces more than 500 lines of output, but after
enabling it produces lines only proportional to the layer count, like:
```json
{
"version": ???,
"layer_metadata": { ... },
"disk_consistent_lsn": "0/15FD5D8",
"legacy_metadata": {
"disk_consistent_lsn": "0/15FD5D8",
"prev_record_lsn": "0/15FD5A0",
"ancestor_timeline": null,
"ancestor_lsn": "0/0",
"latest_gc_cutoff_lsn": "0/149FD18",
"initdb_lsn": "0/149FD18",
"pg_version": 15
}
}
```
In the future, I propose we completely stop using this legacy metadata
type and wasting time trying to come up with another version numbering
scheme in addition to the informative-only one already found in
`index_part.json`, and go ahead with storing metadata or feature flags
on the `index_part.json` itself.
#7699 is the "one release after" changes which starts to produce
metadata in the index_part.json as json.
fixes https://github.com/neondatabase/neon/issues/7790 (duplicating most
of the issue description here for posterity)
# Background
From the time before always-authoritative `index_part.json`, we had to
handle duplicate layers. See the RFC for an illustration of how
duplicate layers could happen:
a8e6d259cb/docs/rfcs/027-crash-consistent-layer-map-through-index-part.md (L41-L50)
As of #5198 , we should not be exposed to that problem anymore.
# Problem 1
We still have
1. [code in
Pageserver](82960b2175/pageserver/src/tenant/timeline.rs (L4502-L4521))
than handles duplicate layers
2. [tests in the test
suite](d9dcbffac3/test_runner/regress/test_duplicate_layers.py (L15))
that demonstrates the problem using a failpoint
However, the test in the test suite doesn't use the failpoint to induce
a crash that could legitimately happen in production.
What is does instead is to return early with an `Ok()`, so that the code
in Pageserver that handles duplicate layers (item 1) actually gets
exercised.
That "return early" would be a bug in the routine if it happened in
production.
So, the tests in the test suite are tests for their own sake, but don't
serve to actually regress-test any production behavior.
# Problem 2
Further, if production code _did_ (it nowawdays doesn't!) create a
duplicate layer, the code in Pageserver that handles the condition (item
1 above) is too little and too late:
* the code handles it by discarding the newer `struct Layer`; that's
good.
* however, on disk, we have already overwritten the old with the new
layer file
* the fact that we do it atomically doesn't matter because ...
* if the new layer file is not bit-identical, then we have a cache
coherency problem
* PS PageCache block cache: caches old bit battern
* blob_io offsets stored in variables, based on pre-overwrite bit
pattern / offsets
* => reading based on these offsets from the new file might yield
different data than before
# Solution
- Remove the test suite code pertaining to Problem 1
- Move & rename test suite code that actually tests RFC-27
crash-consistent layer map.
- Remove the Pageserver code that handles duplicate layers too late
(Problem 1)
- Use `RENAME_NOREPLACE` to prevent over-rename the file during
`.finish()`, bail with an error if it happens (Problem 2)
- This bailing prevents the caller from even trying to insert into the
layer map, as they don't even get a `struct Layer` at hand.
- Add `abort`s in the place where we have the layer map lock and check
for duplicates (Problem 2)
- Note again, we can't reach there because we bail from `.finish()` much
earlier in the code.
- Share the logic to clean up after failed `.finish()` between image
layers and delta layers (drive-by cleanup)
- This exposed that test `image_layer_rewrite` was overwriting layer
files in place. Fix the test.
# Future Work
This PR adds a new failure scenario that was previously "papered over"
by the overwriting of layers:
1. Start a compaction that will produce 3 layers: A, B, C
2. Layer A is `finish()`ed successfully.
3. Layer B fails mid-way at some `put_value()`.
4. Compaction bails out, sleeps 20s.
5. Some disk space gets freed in the meantime.
6. Compaction wakes from sleep, another iteration starts, it attempts to
write Layer A again. But the `.finish()` **fails because A already
exists on disk**.
The failure in step 5 is new with this PR, and it **causes the
compaction to get stuck**.
Before, it would silently overwrite the file and "successfully" complete
the second iteration.
The mitigation for this is to `/reset` the tenant.
## Problem
This was an oversight when adding heatmaps: because they are at the top
level of the tenant, they aren't included in the catch-all list & delete
that happens for timeline paths.
This doesn't break anything, but it leaves behind a few kilobytes of
garbage in the S3 bucket after a tenant is deleted, generating work for
the scrubber.
## Summary of changes
- During deletion, explicitly remove the heatmap file
- In test_tenant_delete_smoke, upload a heatmap so that the test would
fail its "remote storage empty after delete" check if we didn't delete
it.
RemoteTimelineClient maintains a copy of "next IndexPart" as a number of
fields which are like an IndexPart but this is not immediately obvious.
Instead of multiple fields, maintain a `dirty` ("next IndexPart") and
`clean` ("uploaded IndexPart") fields.
Additional cleanup:
- rename `IndexPart::disk_consistent_lsn` accessor
`duplicated_disk_consistent_lsn`
- no one except scrubber should be looking at it, even scrubber is a
stretch
- remove usage elsewhere (pagectl used by tests, metadata scan endpoint)
- serialize index part *before* the index upload operation
- avoid upload operation being retried because of serialization error
- serialization error is fatal anyway for timeline -- it can only make
transient local progress after that, at least the error is bubbled up
now
- gather exploded IndexPart fields into single actual
`UploadQueueInitialized::dirty` of which the uploaded snapshot is
serialized
- implement the long wished monotonicity check with the `clean`
IndexPart with an assertion which is not expected to fire
Continued work from #7860 towards next step of #6994.
in build logs we get a lot of lines for building the compute node images
because of verbose tar unpack. we know the sha256 so we don't need to
log the contents. my hope is that this will allow us more reliably use
the github live updating log view.
I suspected a wakeup could be lost with
`remote_storage::support::DownloadStream` if the cancellation and inner
stream wakeups happen simultaneously. The next poll would only return
the cancellation error without setting the wakeup. There is no lost
wakeup because the single future for getting the cancellation error is
consumed when the value is ready, and a new future is created for the
*next* value. The new future is always polled. Similarly, if only the
`Stream::poll_next` is being used after a `Some(_)` value has been
yielded, it makes no sense to have an expectation of a wakeup for the
*(N+1)th* stream value already set because when a value is wanted,
`Stream::poll_next` will be called.
A test is added to show that the above is true.
Additionally, there was a question of these cancellations and timeouts
flowing to attached or secondary tenant downloads. A test is added to
show that this, in fact, happens.
Lastly, a warning message is logged when a download stream is polled
after a timeout or cancellation error (currently unexpected) so we can
rule it out while troubleshooting.
## Problem
Page LSN is not set while VM update.
May be reason of test_vm_bits flukyness.
Buit more serious issues can be also caused by wrong LSN.
Related: https://github.com/neondatabase/neon/pull/7935
## Summary of changes
- In `apply_in_neon`, set the LSN bytes when applying records of type
`ClearVisibilityMapFlags`
## Problem
There was a bug in dynamic rate limiter, which exhausted CPU in proxy
and proxy wasn't able to accept any connections.
## Summary of changes
1. `if self.available > 1` -> `if self.available >= 1`
2. remove `timeout_at` to use just timeout
3. remove potential infinite loops which can exhaust CPUs.
## Problem
CreateImageLayersError and CompactionError had proper From
implementations, but compact_legacy was explicitly squashing all image
layer errors into an anyhow::Error anyway.
This led to errors like:
```
Error processing HTTP request: InternalServerError(timeline shutting down
Stack backtrace:
0: <<anyhow::Error as core::convert::From<pageserver::tenant::timeline::CreateImageLayersError>>::from as core::ops::function::FnOnce<(pageserver::tenant::timeline::CreateImageLayersError,)>>::call_once
at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/ops/function.rs:250:5
1: <core::result::Result<alloc::vec::Vec<pageserver::tenant::storage_layer::layer::ResidentLayer>, pageserver::tenant::timeline::CreateImageLayersError>>::map_err::<anyhow::Error, <anyhow::Error as core::convert::From<pageserver::tenant::timeline::CreateImageLayersError>>::from>
at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6/library/core/src/result.rs:829:27
2: <pageserver::tenant::timeline::Timeline>::compact_legacy::{closure#0}
at pageserver/src/tenant/timeline/compaction.rs:125:36
3: <pageserver::tenant::timeline::Timeline>::compact::{closure#0}
at pageserver/src/tenant/timeline.rs:1719:84
4: pageserver::http::routes::timeline_checkpoint_handler::{closure#0}::{closure#0}
```
Closes: https://github.com/neondatabase/neon/issues/7861
The storage controller has 'drop' APIs for tenants and nodes, for use in
situations where something weird has happened:
- node-drop is useful until we implement proper node decom, or if we
have a partially provisioned node that somehow gets registered with the
storage controller but is then dead.
- tenant-drop is useful if we accidentally add a tenant that shouldn't
be there at all, or if we want to make the controller forget about a
tenant without deleting its data. For example, if one uses the
tenant-warmup command with a bad tenant ID and needs to clean that up.
The drop commands require an `--unsafe` parameter, to reduce the chance
that someone incorrectly assumes these are the normal/clean ways to
delete things.
This PR also adds a convenience command for setting the time based
eviction parameters on a tenant. This is useful when onboarding an
existing tenant that has high resident size due to storage amplification
in compaction: setting a lower time based eviction threshold brings down
the resident size ahead of doing a shard split.
Store logical replication origin in KV storage
## Problem
See #6977
## Summary of changes
* Extract origin_lsn from commit WAl record
* Add ReplOrigin key to KV storage and store origin_lsn
* In basebackup replace snapshot origin_lsn with last committed
origin_lsn at basebackup LSN
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Alex Chi Z <chi@neon.tech>
## Problem
Currently, we leave `index_part.json` objects from old generations
behind each time a pageserver restarts or a tenant is migrated. This
doesn't break anything, but it's annoying when a tenant has been around
for a long time and starts to accumulate 10s-100s of these.
Partially implements: #7043
## Summary of changes
- Add a new `pageserver-physical-gc` command to `s3_scrubber`
The name is a bit of a mouthful, but I think it makes sense:
- GC is the accurate term for what we are doing here: removing data that
takes up storage but can never be accessed.
- "physical" is a necessary distinction from the "normal" GC that we do
online in the pageserver, which operates at a higher level in terms of
LSNs+layers, whereas this type of GC is purely about S3 objects.
- "pageserver" makes clear that this command deals exclusively with
pageserver data, not safekeeper.
The keyspace utils like `is_rel_size_key` or `is_rel_fsm_block_key` and
many others are free functions and have to be either imported separately
or specified with the full path starting in `pageserver_api:🔑:`.
This is less convenient than if these functions were just inherent
impls.
Follow-up of #7890Fixes#6438
when running the regress tests locally without any environment variables
we use on CI, `test_pageserver_compaction_smoke` fails with division by
zero. fix it temporarily by allowing no vectored read happening. to be
cleaned when vectored get validation gets removed and the default value
can be changed.
Cc: https://github.com/neondatabase/neon/issues/7381
In issue #5590 it was proposed to implement metrics for Azure blob
storage. This PR implements them except for the part that performs the
rename, which is left for a followup.
Closes#5590
The general partial backup idea is that each safekeeper keeps only one
partial segment in remote storage at a time. Sometimes this is not true,
for example if we uploaded object to S3 but got an error when tried to
remove the previous upload. In this case we still keep a list of all
potentially uploaded objects in safekeeper state.
This commit prints a warning to logs if there is too many objects in
safekeeper state. This is not expected and we should try to fix this
state, we can do this by running gc.
I haven't seen this being an issue anywhere, but printing a warning is
something that I wanted to do and forgot in initial PR.
## Problem
- Because GC exposes all errors as an anyhow::Error, we have
intermittent issues with spurious log errors during shutdown, e.g. in
this failure of a performance test
https://neon-github-public-dev.s3.amazonaws.com/reports/main/9300804302/index.html#suites/07874de07c4a1c9effe0d92da7755ebf/214a2154f6f0217a/
```
Gc failed 1 times, retrying in 2s: shutting down
```
GC really doesn't do a lot of complicated IO: it doesn't benefit from
the backtrace capabilities of anyhow::Error, and can be expressed more
robustly as an enum.
## Summary of changes
- Add GcError type and use it instead of anyhow::Error in GC functions
- In `gc_iteration_internal`, return GcError::Cancelled on shutdown
rather than Ok(()) (we only used Ok before because we didn't have a
clear cancellation error variant to use).
- In `gc_iteration_internal`, skip past timelines that are shutting
down, to avoid having to go through another GC iteration if we happen to
see a deleting timeline during a GC run.
- In `refresh_gc_info_internal`, avoid an error case where a timeline
might not be found after being looked up, by carrying an Arc<Timeline>
instead of a TimelineId between the first loop and second loop in the
function.
- In HTTP request handler, handle Cancelled variants as 503 instead of
turning all GC errors into 500s.
## Problem
In all cases, AncestorStopping is equivalent to Cancelled.
This became more obvious in
https://github.com/neondatabase/neon/pull/7912#discussion_r1620582309
when updating these error types.
## Summary of changes
- Remove AncestorStopping, always use Cancelled instead
This pull request adds necessary interfaces to deterministically create
scenarios we want to test. Simplify some test cases to use this
interface to make it stable + reproducible.
Compaction test will be able to use this interface. Also the upcoming
delete tombstone tests will use this interface to make test
reproducible.
## Summary of changes
* `force_create_image_layer`
* `force_create_delta_layer`
* `force_advance_lsn`
* `create_test_timeline_with_states`
* `branch_timeline_test_with_states`
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
This is a preparation for
https://github.com/neondatabase/neon/issues/6337.
The idea is to add FullAccessTimeline, which will act as a guard for
tasks requiring access to WAL files. Eviction will be blocked on these
tasks and WAL won't be deleted from disk until there is at least one
active FullAccessTimeline.
To get FullAccessTimeline, tasks call `tli.full_access_guard().await?`.
After eviction is implemented, this function will be responsible for
downloading missing WAL file and waiting until the download finishes.
This commit also contains other small refactorings:
- Separate `get_tenant_dir` and `get_timeline_dir` functions for
building a local path. This is useful for looking at usages and finding
tasks requiring access to local filesystem.
- `timeline_manager` is now responsible for spawning all background
tasks
- WAL removal task is now spawned instantly after horizon is updated
## Problem
- Initial size calculations tend to fail with `Bad state (not active)`
Closes: https://github.com/neondatabase/neon/issues/7911
## Summary of changes
- In `wait_lsn`, return WaitLsnError::Cancelled rather than BadState
when the state is Stopping
- Replace PageReconstructError's `Other` variant with a specific
`BadState` variant
- Avoid returning anyhow::Error from get_ready_ancestor_timeline -- this
was only used for the case where there was no ancestor. All callers of
this function had implicitly checked that the ancestor timeline exists
before calling it, so they can pass in the ancestor instead of handling
an error.
epoch is a historical and potentially confusing name. It semantically means
lastLogTerm from the raft paper, so let's use it.
This commit changes only internal namings, not public interface (http).