This fixes all kinds of problems related to missing params,
like broken timestamps (due to `integer_datetimes`).
This solution is not ideal, but it will help. Meanwhile,
I'm going to dedicate some time to improving connection machinery.
Note that this **does not** fix problems with passing certain parameters
in a reverse direction, i.e. **from client to compute**. This is a
separate matter and will be dealt with in an upcoming PR.
this should help us in the future to have more freedom with spawning
tasks and cancelling things, most importantly blocking tasks (assuming
the CancellationToken::is_cancelled is performant enough).
CancellationToken allows creation of hierarchical cancellations, which
would also simplify the task_mgr shutdown operation, rendering it
unnecessary.
I saw an excessive number of index file upload operations in
production, even when nothing on the timeline changes. It was because
our GC schedules index file upload if the GC cutoff LSN is advanced,
even if the GC had nothing else to do. The GC cutoff LSN marches
steadily forwards, even when there is no user activity on the
timeline, when the cutoff is determined by the time-based PITR
interval setting. To dial that down, only schedule index file upload
when GC is about to actually remove something.
Previously, the /v1/tenant/:tenant_id/timeline/:timeline_id/do_gc API
call performed a flush and compaction on the timeline before
GC. Change it not to do that, and change all the tests that used that
API to perform compaction explicitly.
The compaction happens at a slightly different point now. Previously,
the code performed the `refresh_gc_info_internal` step first, and only
then did compaction on all the timelines. I don't think that was what
was originally intended here. Presumably the idea with compaction was
to make some old layer files available for GC. But if we're going to
flush the current in-memory layer to disk, surely you would want to
include the newly-written layer in the compaction too. I guess this
didn't make any difference to the tests in practice, but in any case,
the tests now perform the flush and compaction before any of the GC
steps.
Some of the tests might not need the compaction at all, but I didn't
try hard to determine which ones might need it. I left it out from a
few tests that intentionally tested calling do_gc with an invalid
tenant or timeline ID, though.
If we get cancelled before jh.await returns we've take()n the join handle but
drop the result on the floor.
Fix it by setting self.join_handle = None after the .await
fixes https://github.com/neondatabase/neon/issues/3104
We do the accounting exclusively after updating remote IndexPart successfully.
This is cleaner & more robust than doing it upon completion of
individual layer file uploads / deletions since we can uset .set()
insteaf of add()/sub().
NB: Originally, this work was intended to be part of #3013 but it
turns out that it's completely orthogonal.
So, spin it out into this PR for easier review.
Since this change is additive, it won't break anything.
Before this patch, when we decide to rename a layer file to backup
because of layer file size mismatch, we would not remove the layer from
the layer map, but remote the on-disk file.
Because we re-download the file immediately after, we simply end up with
two layer objects in memory that reference the same file in the layer
map. So, GetPage() would work fine until one of the layers gets
delete()'d. The other layer's delete() would then fail.
Future work: prevent insertion of the same layer at LayerMap level
so that we notice such bugs sooner.
Removes the race during pageserver initial timeline creation that lead to partial layer uploads.
This race is only reproducible in test code, we do not create initial timelines in cloud (yet, at least), but still nice to remove the non-deterministic behavior.
refactor: use new type LayerFileName when referring to layer file names in PathBuf/RemotePath
Before this patch, we would sometimes carry around plain file names in
`Path` types and/or awkwardly "rebase" paths to have a unified
representation of the layer file name between local and remote.
This patch introduces a new type `LayerFileName` which replaces the use
of `Path` / `PathBuf` / `RemotePath` in the `storage_sync2` APIs.
Instead of holding a string, it contains the parsed representation of
the image and delta file name.
When we need the file name, e.g., to construct a local path or
remote object key, we construct the name ad-hoc.
`LayerFileName` is also serde {Dese,Se}rializable, and in an initial
version of this patch, it was supposed to be used directly inside
`IndexPart`, replacing `RemotePath`.
However,
commit 3122f3282f
Ignore backup files (ones with .n.old suffix) in download_missing
fixed handling of `*.old` backup file names in IndexPart, and we need
to carry that behavior forward.
The solution is to remove `*.old` backup files names during
deserialization. When we re-serialize the IndexPart, the `*.old` file
will be gone.
This leaks the `.old` file in the remote storage, but makes it safe
to clean it up later.
There is additional churn by a preliminary refactoring that got squashed
into this change:
split off LayerMap's needs from trait Layer into super trait
That refactoring renames `Layer` to `PersistentLayer` and splits off a subset
of the functions into a super-trait called `Layer`.
The upser trait implements just the functions needed by `LayerMap`, whereas
`PersisentLayer` adds the context of the pageserver.
The naming is imperfect as some functions that reside in `PersistentLayer`
have nothing persistence-specific to it. But it's a step in the right direction.
This is rather a hack to resolve immediate issue:
https://github.com/neondatabase/neon/issues/3024
Properly cleaning this file from index part requires changes to
initialization of remote queue. Because we need to clean it up earlier
than we start warking around files.
With on-demand there will be no walk around layer files becase
download_missing is no longer needed, so I believe it will be
natural to unify this with load_layer_map
Changes:
* Remove `RemoteObjectId` concept from remote_storage.
Operate directly on /-separated names instead.
These names are now represented by struct `RemotePath` which was renamed from struct `RelativePath`
* Require remote storage to operate on relative paths for its contents, thus simplifying the way to derive them in pageserver and safekeeper
* Make `IndexPart` to use `String` instead of `RelativePath` for its entries, since those are just the layer names
This patch centralize the logic of creating & reading pid files into the
new pid_file module and improves upon / makes explicit a few race conditions
that existed with the previous code.
Starting Processes / Creating Pidfiles
======================================
Before this patch, we had three places that had very similar-looking
match lock_file::create_lock_file { ... }
blocks.
After this change, they can use a straight-forward call provided
by the pid_file:
pid_file::claim_pid_file_for_pid()
Stopping Processes / Reading Pidfiles
=====================================
The new pid_file module provides a function to read a pidfile,
called read_pidfile(), that returns a
pub enum PidFileRead {
NotExist,
NotHeldByAnyProcess(PidFileGuard),
LockedByOtherProcess(Pid),
}
If we get back NotExist, there is nothing to kill.
If we get back NotHeldByAnyProcess, the pid file is stale and we must
ignore its contents.
If it's LockedByOtherProcess, it's either another pidfile reader
or, more likely, the daemon that is still running.
In this case, we can read the pid in the pidfile and kill it.
There's still a small window where this is racy, but it's not a
regression compared to what we have before.
The NotHeldByAnyProcess is an improvement over what we had before
this patch. Before, we would blindly read the pidfile contents
and kill, even if no other process held the flock.
If the pidfile was stale (NotHeldByAnyProcess), then that kill
would either result in ESRCH or hit some other unrelated process
on the system. This patch avoids the latter cacse by grabbing
an exclusive flock before reading the pidfile, and returning the
flock to the caller in the form of a guard object, to avoid
concurrent reads / kills.
It's hopefully irrelevant in practice, but it's a little robustness
that we get for free here.
Maintain flock on Pidfile of ETCD / any InitialPidFile::Create()
================================================================
Pageserver and safekeeper create their pidfiles themselves.
But for etcd, neon_local creates the pidfile (InitialPidFile::Create()).
Before this change, we would unlock the etcd pidfile as soon as
`neon_local start` exits, simply because no-one else kept the FD open.
During `neon_local stop`, that results in a stale pid file,
aka, NotHeldByAnyProcess, and it would henceforth not trust that
the PID stored in the file is still valid.
With this patch, we make the etcd process inherit the pidfile FD,
thereby keeping the flock held until it exits.
The new "trace_read_requests" option was missing from the
parse_toml_tenant_conf function that reads the config file. Because of
that, the option was ignored, which caused the test_read_trace.py test
to fail. It used to work before commit 9a6c0be823, because the
TenantConfigOpt struct was constructed directly in tenant_create_handler,
but now it is saved and read back from disk even for a newly created
tenant.
The abovementioned bug was fixed in commit 09393279c6 already, which
added the missing code to parse_toml_tenant_conf() to parse the
new "trace_read_requests" option. This commit fixes one more function
that was missed earlier, and adds more detail to the error message if
parsing the config file fails.
It used to be a separate piece of state, but after 9a6c0be823 it's just
an alias for the Tenant being in Attaching state. It was only used in
one assertion in a test, but that check doesn't make sense anymore, so
just remove it.
Fixes https://github.com/neondatabase/neon/issues/2930
Added basic instrumentation to integrate sentry with the proxy, pageserver, and safekeeper processes.
Currently in sentry there are three projects, one for each process. Sentry url is sent to all three processes separately via cli args.
Closes https://github.com/neondatabase/neon/issues/2537
Follow-up of https://github.com/neondatabase/neon/pull/2950
With the new model that prevents attaching without the remote storage,
it has started to be even more odd to add attach-with-files
functionality (in addition to the issues raised previously).
Adds two separate commands:
* `POST {tenant_id}/ignore` that places a mark file to skip such tenant
on every start and removes it from memory
* `POST {tenant_id}/schedule_load` that tries to load a tenant from
local FS similar to what pageserver does now on startup, but without
directory removals
Move missing_layers property to Option<HashSet<RelativePath>>
This will allow the safe removal of it once the upgrade of all page servers is done with this new code
```
warning: named argument `file` is not used by name
--> pageserver/src/tenant/timeline.rs:1078:54
|
1078 | trace!("downloading image file: {}", file = path.display());
| -- ^^^^ this named argument is referred to by position in formatting string
| |
| this formatting argument uses named argument `file` by position
|
= note: `#[warn(named_arguments_used_positionally)]` on by default
help: use the named argument by name to avoid ambiguity
|
1078 | trace!("downloading image file: {file}", file = path.display());
| ++++
```
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
I'm not a fan of "Paused", for two reasons:
- Paused implies that the tenant/timeline with no activity on it. That's
not true; the tenant/timeline can still have active tasks working on it.
- Paused implies that it can be resumed later. It can not. A tenant or
timeline in this state cannot be switched back to Active state anymore.
A completely new Tenant or Timeline struct can be constructed for the
same tenant or timeline later, e.g. if you detach and later re-attach
the same tenant, but that's a different thing.
Stopping describes the state better. I also considered "ShuttingDown",
but Stopping is simpler as it's a single word.
The code in this change was extracted from PR #2595, i.e., Heikki’s draft
PR for on-demand download.
High-Level Changes
- storage_sync module rewrite
- Changes to Tenant Loading
- Changes to Timeline States
- Crash-safe & Resumable Tenant Attach
There are several follow-up work items planned.
Refer to the Epic issue on GitHub:
https://github.com/neondatabase/neon/issues/2029
Metadata:
closes https://github.com/neondatabase/neon/pull/2785
unsquashed history of this patch: archive/pr-2785-storage-sync2/pre-squash
Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
===============================================================================
storage_sync module rewrite
===========================
The storage_sync code is rewritten. New module name is storage_sync2, mostly to
make a more reasonable git diff.
The updated block comment in storage_sync2.rs describes the changes quite well,
so, we will not reproduce that comment here. TL;DR:
- Global sync queue and RemoteIndex are replaced with per-timeline
`RemoteTimelineClient` structure that contains a queue for UploadOperations
to ensure proper ordering and necessary metadata.
- Before deleting local layer files, wait for ongoing UploadOps to finish
(wait_completion()).
- Download operations are not queued and executed immediately.
Changes to Tenant Loading
=========================
Initial sync part was rewritten as well and represents the other major change
that serves as a foundation for on-demand downloads. Routines for attaching and
loading shifted directly to Tenant struct and now are asynchronous and spawned
into the background.
Since this patch doesn’t introduce on-demand download of layers we fully
synchronize with the remote during pageserver startup. See details in
`Timeline::reconcile_with_remote` and `Timeline::download_missing`.
Changes to Tenant States
========================
The “Active” state has lost its “background_jobs_running: bool” member. That
variable indicated whether the GC & Compaction background loops are spawned or
not. With this patch, they are now always spawned. Unit tests (#[test]) use the
TenantConf::{gc_period,compaction_period} to disable their effect (15db566).
This patch introduces a new tenant state, “Attaching”. A tenant that is being
attached starts in this state and transitions to “Active” once it finishes
download.
The `GET /tenant` endpoints returns `TenantInfo::has_in_progress_downloads`. We
derive the value for that field from the tenant state now, to remain
backwards-compatible with cloud.git. We will remove that field when we switch
to on-demand downloads.
Changes to Timeline States
==========================
The TimelineInfo::awaits_download field is now equivalent to the tenant being
in Attaching state. Previously, download progress was tracked per timeline.
With this change, it’s only tracked per tenant. When on-demand downloads
arrive, the field will be completely obsolete. Deprecation is tracked in
isuse #2930.
Crash-safe & Resumable Tenant Attach
====================================
Previously, the attach operation was not persistent. I.e., when tenant attach
was interrupted by a crash, the pageserver would not continue attaching after
pageserver restart. In fact, the half-finished tenant directory on disk would
simply be skipped by tenant_mgr because it lacked the metadata file (it’s
written last). This patch introduces an “attaching” marker file inside that is
present inside the tenant directory while the tenant is attaching. During
pageserver startup, tenant_mgr will resume attach if that file is present. If
not, it assumes that the local tenant state is consistent and tries to load the
tenant. If that fails, the tenant transitions into Broken state.
Nothing interesting in these changes. Passing through the
RUST_BACKTRACE=full will hopefully save someone else panick reproduction
time.
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
When a new root timeline is created, we want to flush all the data to
disk before we return success to the caller. We were using
checkpoint(CheckpointConfig::Forced) for that, but that also performs
compaction. With the default settings, compaction will have no work
after we have imported an empty database, as the image of that is too
small to require compaction. However, with very small
checkpoint_distance and compaction_target_size, compaction will run, and
it can take a while.
PR #2785 adds new tests that use very small checkpoint_distance and
compaction_target_size settings, and the test sometimes failed with
"operation timed out" error in the client, when the create_timeline step
took too long.
Many python tests were setting the GC/compaction period to large
values, to effectively disable GC / compaction. Reserve value 0 to
mean "explicitly disabled". We also set them to 0 in unit tests now,
although currently, unit tests don't launch the background jobs at
all, so it won't have any effect.
Fixes https://github.com/neondatabase/neon/issues/2917
* Fix https://github.com/neondatabase/neon/issues/1854
* Never log Safekeeper::conninfo in walproposer as it now contains a secret token
* control_panel, test_runner: generate and pass JWT tokens for Safekeeper to compute and pageserver
* Compute: load JWT token for Safekepeer from the environment variable. Do not reuse the token from
pageserver_connstring because it's embedded in there weirdly.
* Pageserver: load JWT token for Safekeeper from the environment variable.
* Rewrite docs/authentication.md
There will be different scopes for those two, so authorization code should be different.
The `check_permission` function is now not in the shared library. Its implementation
is very similar to the one which will be added for Safekeeper. In fact, we may reuse
the same existing root-like 'PageServerApi' scope, but I would prefer to have separate
root-like scopes for services.
Also, generate_management_token in tests is generate_pageserver_token now.