This is preliminary work for/from #4220 (async
`Layer::get_value_reconstruct_data`).
There, we want to switch `Timeline::layers` to be a
`tokio::sync::RwLock`.
That will require the `TimelineWriter` to become async.
That will require `freeze_inmem_layer` to become async.
So, inside check_checkpoint_distance, we will have
`freeze_inmem_layer().await`.
But current rustc isn't smart enough to understand that we
`drop(layers)` earlier, and hence, will complain about the `!Send`
`layers` being held across the `freeze_inmem_layer().await`-point.
This patch puts the guard into a scope, so rustc will shut up in the
next patch where we make the transition for `TimelineWriter`.
obsoletes https://github.com/neondatabase/neon/pull/4474
c058e1cec2 skipped `truncate_wal()` it if `write_lsn` is equal to
truncation position, but didn't took into account that `write_lsn` is
reset on restart.
Fixes regression looking like:
```
ERROR WAL acceptor{cid=22 ...}:panic{thread=WAL acceptor 19b6c1743666ec02991a7633c57178db/b07db8c88f4c76ea5ed0954c04cc1e74 location=safekeeper/src/wal_storage.rs:230:13}: unexpected write into non-partial segment file
```
This fix will prevent skipping WAL truncation when we are running for
the first time after restart.
Commit `create_test_timeline: always put@initdb_lsn the minimum required keys`
already switched us over to using valid initdb_lsns.
All that's left to do is to actually flush the minimum keys so that
we move from disk_consistent_lsn=Lsn(0) to disk_consistent_lsn=initdb_lsn.
Co-authored-by: Christian Schwarz <christian@neon.tech>
Part of https://github.com/neondatabase/neon/pull/4364
Clarify who's responsible for initializing the layer map. There were
previously two different ways to do it:
- create_empty_timeline and bootstrap_timeline let prepare_timeline()
initialize an empty layer map.
- branch_timeline passed a flag to initialize_with_lock() to tell
initialize_with_lock to call load_layer_map(). Because it was a
newly created timeline, load_layer_map() never found any layer
files, so it just initialized an empty layer map.
With this commit, prepare_new_timeline() always does it. The LSN to
initialize it with is passed as argument.
Other changes per function:
prepare_timeline:
- rename to 'prepare_new_timeline' to make it clear that it's only used
when creating a new timeline, not when loading an existing timeline
- always initialize an empty layer map. The caller can pass the LSN to
initialize it with. (Previously, prepare_timeline would optionally
load the layer map at 'initdb_lsn'. Some caller used that, while others
let initialize_with_lock do it
initialize_with_lock:
- As mentioned above, remove the option to load the layer map
- Acquire the 'timelines' lock in the function itself. None of the callers
did any other work while holding the lock.
- Rename it to finish_creation() to make its intent more clear. It's only
used when creating a new timeline now.
create_timeline_data:
- Rename to create_timeline_struct() for clarity. It just initializes
the Timeline struct, not any other "data"
create_timeline_files:
- use create_dir rather than create_dir_all, to be a little more strict.
We know that the parent directory should already exist, and the timeline
directory should not exist.
- Move the call to create_timeline_struct() to the caller. It was just
being "passed through"
Part of https://github.com/neondatabase/neon/pull/4364
This patch inlines `initialize_with_lock` and then reorganizes the code
such that we can `load_layer_map` without holding the
`Tenant::timelines` lock.
As a nice aside, we can get rid of the dummy() uninit mark, which has
always been a terrible hack.
Part of https://github.com/neondatabase/neon/pull/4364
This is a full switch, fs io operations are also tokio ones, working through
thread pool. Similar to pageserver, we have multiple runtimes for easier `top`
usage and isolation.
Notable points:
- Now that guts of safekeeper.rs are full of .await's, we need to be very
careful not to drop task at random point, leaving timeline in unclear
state. Currently the only writer is walreceiver and we don't have top
level cancellation there, so we are good. But to be safe probably we should
add a fuse panicking if task is being dropped while operation on a timeline
is in progress.
- Timeline lock is Tokio one now, as we do disk IO under it.
- Collecting metrics got a crutch: since prometheus Collector is
synchronous, it spawns a thread with current thread runtime collecting data.
- Anything involving closures becomes significantly more complicated, as
async fns are already kinda closures + 'async closures are unstable'.
- Main thread now tracks other main tasks, which got much easier.
- The only sync place left is initial data loading, as otherwise clippy
complains on timeline map lock being held across await points -- which is
not bad here as it happens only in single threaded runtime of main thread.
But having it sync doesn't hurt either.
I'm concerned about performance of thread pool io offloading, async traits and
many await points; but we can try and see how it goes.
fixes https://github.com/neondatabase/neon/issues/3036
fixes https://github.com/neondatabase/neon/issues/3966
## Problem
Part of https://github.com/neondatabase/neon/issues/4418
## Summary of changes
This PR implements the local manifest interfaces. After the refactor of
timeline is done, we can integrate this with the current storage. The
reader will stop at the first corrupted record.
---------
Signed-off-by: Alex Chi <iskyzh@gmail.com>
Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>
There were few problems with null handling:
* query_raw_txt() accepted vector of string so it always (erroneously)
treated "null" as a string instead of null. Change rust pg client
to accept the vector of Option<String> instead of just Strings. Adopt
coding here to pass nulls as None.
* pg_text_to_json() had a check that always interpreted "NULL" string
as null. That is wrong and nulls were already handled by match None.
This bug appeared as a bad attempt to parse arrays containing NULL
elements. Fix coding by checking presence of quotes while parsing an
array (no quotes -> null, quoted -> "null" string).
Array parser fix also slightly changes behavior by always cleaning
current entry when pushing to the resulting vector. This seems to be
an omission by previous coding, however looks like it was harmless
as entry was not cleared only at the end of the nested or to-level
array.
With this commit client can pass following optional headers:
`Neon-Raw-Text-Output: true`. Return postgres values as text, without parsing them. So numbers, objects, booleans, nulls and arrays will be returned as text. That can be useful in cases when client code wants to implement it's own parsing or reuse parsing libraries from e.g. node-postgres.
`Neon-Array-Mode: true`. Return postgres rows as arrays instead of objects. That is more compact representation and also helps in some edge
cases where it is hard to use rows represented as objects (e.g. when several fields have the same name).
Delete data from s3 when timeline deletion is requested
## Summary of changes
UploadQueue is altered to support scheduling of delete operations in
stopped state. This looks weird, and I'm thinking whether there are
better options/refactorings for upload client to make it look better.
Probably can be part of https://github.com/neondatabase/neon/issues/4378
Deletion is implemented directly in existing endpoint because changes are not
that significant. If we want more safety we can separate those or create
feature flag for new behavior.
resolves [#4193](https://github.com/neondatabase/neon/issues/4193)
---------
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
## Problem
part of https://github.com/neondatabase/neon/issues/4392
## Summary of changes
This PR adds a new HashMap that maps persistent layer desc to the layer
object *inside* LayerMap. Originally I directly went towards adding such
layer cache in Timeline, but the changes are too many and cannot be
reviewed as a reasonably-sized PR. Therefore, we take this intermediate
step to change part of the codebase to use persistent layer desc, and
come up with other PRs to move this hash map of layer desc to the
timeline struct.
Also, file_size is now part of the layer desc.
---------
Signed-off-by: Alex Chi <iskyzh@gmail.com>
Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>
## Problem
Attach failures are not reported in public part of the api (in
`attachment_status` field of TenantInfo).
## Summary of changes
Expose TenantState::Broken as TenantAttachmentStatus::Failed
In the way its written Failed status will be reported even if no
attachment happened. (I e if tenant become broken on startup). This is
in line with other members. I e Active will be resolved to Attached even
if no actual attach took place.
This can be tweaked if needed. At the current stage it would be overengineering without clear motivation
resolves#4344
## Problem
close https://github.com/neondatabase/neon/issues/4266
## Summary of changes
With this PR, rust-analyzer should be able to give lints and auto
complete in `mod tests`, and this makes writing tests easier.
Previously, rust-analyzer cannot do auto completion.
---------
Signed-off-by: Alex Chi <iskyzh@gmail.com>
It should make remote_consistent_lsn commonly up-to-date on non actively writing
projects, which removes spike or pageserver -> safekeeper reconnections on
storage nodes restart.
Initial logical size calculation could still hinder our fast startup
efforts in #4397. See #4183. In deployment of 2023-06-06
about a 200 initial logical sizes were calculated on hosts which
took the longest to complete initial load (12s).
Implements the three step/tier initialization ordering described in
#4397:
1. load local tenants
2. do initial logical sizes per walreceivers for 10s
3. background tasks
Ordering is controlled by:
- waiting on `utils::completion::Barrier`s on background tasks
- having one attempt for each Timeline to do initial logical size
calculation
- `pageserver/src/bin/pageserver.rs` releasing background jobs after
timeout or completion of initial logical size calculation
The timeout is there just to safeguard in case a legitimate non-broken
timeline initial logical size calculation goes long. The timeout is
configurable, by default 10s, which I think would be fine for production
systems. In the test cases I've been looking at, it seems that these
steps are completed as fast as possible.
Co-authored-by: Christian Schwarz <christian@neon.tech>
This adds test coverage for 'compute_ctl', as it is now used by all
the python tests.
There are a few differences in how 'compute_ctl' is called in the
tests, compared to the real web console:
- In the tests, the postgresql.conf file is included as one large
string in the spec file, and it is written out as it is to the data
directory. I added a new field for that to the spec file. The real
web console, however, sets all the necessary settings in the
'settings' field, and 'compute_ctl' creates the postgresql.conf from
those settings.
- In the tests, the information needed to connect to the storage, i.e.
tenant_id, timeline_id, connection strings to pageserver and
safekeepers, are now passed as new fields in the spec file. The real
web console includes them as the GUCs in the 'settings' field. (Both
of these are different from what the test control plane used to do:
It used to write the GUCs directly in the postgresql.conf file). The
plan is to change the control plane to use the new method, and
remove the old method, but for now, support both.
Some tests that were sensitive to the amount of WAL generated needed
small changes, to accommodate that compute_ctl runs the background
health monitor which makes a few small updates. Also some tests shut
down the pageserver, and now that the background health check can run
some queries while the pageserver is down, that can produce a few
extra errors in the logs, which needed to be allowlisted.
Other changes:
- remove obsolete comments about PostgresNode;
- create standby.signal file for Static compute node;
- log output of `compute_ctl` and `postgres` is merged into
`endpoints/compute.log`.
---------
Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
We have 2 ways of tenant shutdown, we should have just one.
Changes are mostly mechanical simple refactorings.
Added `warn!` on the "shutdown all remaining tasks" should trigger test
failures in the between time of not having solved the "tenant/timeline
owns all spawned tasks" issue.
Cc: #4327.
…ompiler
## Problem
## Summary of changes
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
added the `allowed_error` to the `positive_env` so any tests completing
the attach are allowed have this print out. they are allowed to do so,
because the `random_init_delay` can produce close to zero and thus the
first run will be near attach. Though... Unsure if we ever really need
the eviction task to run **before** it can evict something, as in after
20min or 24h.
in the failed test case however period is 20s so interesting that we
didn't run into this sooner.
evidence of flaky:
https://github.com/neondatabase/neon/actions/runs/5175677035/jobs/9323705929?pr=4399#step:4:38536
walreceiver logs are a bit hard to understand because of partial span
usage, extra messages, ignored errors popping up as huge stacktraces.
Fixes#3330 (by spans, also demote info -> debug).
- arrange walreceivers spans into a hiearchy:
- `wal_connection_manager{tenant_id, timeline_id}` ->
`connection{node_id}` -> `poller`
- unifies the error reporting inside `wal_receiver`:
- All ok errors are now `walreceiver connection handling ended: {e:#}`
- All unknown errors are still stacktraceful task_mgr reported errors
with context `walreceiver connection handling failure`
- Remove `connect` special casing, was: `DB connection stream finished`
for ok errors
- Remove `done replicating` special casing, was `Replication stream
finished` for ok errors
- lowered log levels for (non-exhaustive list):
- `WAL receiver manager started, connecting to broker` (at startup)
- `WAL receiver shutdown requested, shutting down` (at shutdown)
- `Connection manager loop ended, shutting down` (at shutdown)
- `sender is dropped while join handle is still alive` (at lucky
shutdown, see #2885)
- `timeline entered terminal state {:?}, stopping wal connection manager
loop` (at shutdown)
- `connected!` (at startup)
- `Walreceiver db connection closed` (at disconnects?, was without span)
- `Connection cancelled` (at shutdown, was without span)
- `observed timeline state change, new state is {new_state:?}` (never
after Timeline::activate was made infallible)
- changed:
- `Timeline dropped state updates sender, stopping wal connection
manager loop`
- was out of date; sender is not dropped but `Broken | Stopping` state
transition
- also made `debug!`
- `Timeline dropped state updates sender before becoming active,
stopping wal connection manager loop`
- was out of date: sender is again not dropped but `Broken | Stopping`
state transition
- also made `debug!`
- log fixes:
- stop double reporting panics via JoinError
As seen on deployment of 2023-06-01 release, times were improving but
there were some outliers caused by:
- timelines `eviction_task` starting while activating and running
imitation
- timelines `initial logical size` calculation
This PR fixes it so that `eviction_task` is delayed like other
background tasks fixing an oversight from earlier #4372.
After this PR activation will be two phases:
1. load and activate tenants AND calculate some initial logical sizes
2. rest of initial logical sizes AND background tasks
- compaction, gc, disk usage based eviction, timelines `eviction_task`,
consumption metrics
## Describe your changes
Port HNSW implementation for ANN search top Postgres
## Issue ticket number and link
https://www.pinecone.io/learn/hnsw
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
We now spawn a new task for every HTTP request, and wait on the
JoinHandle. If Hyper drops the Future, the spawned task will keep
running. This protects the rest of the pageserver code from unexpected
async cancellations.
This creates a CancellationToken for each request and passes it to the
handler function. If the HTTP request is dropped by the client, the
CancellationToken is signaled. None of the handler functions make use
for the CancellationToken currently, but they now they could.
The CancellationToken arguments also work like documentation. When
you're looking at a function signature and you see that it takes a
CancellationToken as argument, it's a nice hint that the function might
run for a long time, and won't be async cancelled. The default
assumption in the pageserver is now that async functions are not
cancellation-safe anyway, unless explictly marked as such, but this is a
nice extra reminder.
Spawning a task for each request is OK from a performance point of view
because spawning is very cheap in Tokio, and none of our HTTP requests
are very performance critical anyway.
Fixes issue #3478
## Problem
Part of https://github.com/neondatabase/neon/issues/4373
## Summary of changes
This PR adds `PersistentLayerDesc`, which will be used in LayerMap
mapping and probably layer cache. After this PR and after we change
LayerMap to map to layer desc, we can safely drop RemoteLayerDesc.
---------
Signed-off-by: Alex Chi <iskyzh@gmail.com>
Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>
## Problem
This PR includes doc changes to the current metrics as well as adding
new metrics. With the new set of metrics, we can quantitatively analyze
the read amp., write amp. and space amp. in the system, when used
together with https://github.com/neondatabase/neonbench
close https://github.com/neondatabase/neon/issues/4312
ref https://github.com/neondatabase/neon/issues/4347
compaction metrics TBD, a novel idea is to print L0 file number and
number of layers in the system, and we can do this in the future when we
start working on compaction.
## Summary of changes
* Add `READ_NUM_FS_LAYERS` for computing read amp.
* Add `MATERIALIZED_PAGE_CACHE_HIT_UPON_REQUEST`.
* Add `GET_RECONSTRUCT_DATA_TIME`. GET_RECONSTRUCT_DATA_TIME +
RECONSTRUCT_TIME + WAIT_LSN_TIME should be approximately total time of
reads.
* Add `5.0` and `10.0` to `STORAGE_IO_TIME_BUCKETS` given some fsync
runs slow (i.e., > 1s) in some cases.
* Some `WAL_REDO` metrics are only used when Postgres is involved in the
redo process.
---------
Signed-off-by: Alex Chi <iskyzh@gmail.com>
compute_ctl can panic, but `tracing` is used for logging. panic stderr
output can interleave with messages from normal logging. The fix is to
use the established way (pageserver, safekeeper, storage_broker) of using
`tracing` to report panics.
This parameter can be use to restrict number of image layers generated
because of GC request (wanted image layers).
Been set to zero it completely eliminates creation of such image layers.
So it allows to avoid extra storage consumption after merging #3673
## Problem
PR #3673 forces generation of missed image layers. So i short term is
cause cause increase (in worst case up to two times) size of storage.
It was intended (by me) that GC period is comparable with PiTR interval.
But looks like it is not the case now - GC is performed much more
frequently. It may cause the problem with space exhaustion: GC forces
new image creation while large PiTR still prevent GC from collecting old
layers.
## Summary of changes
Add new pageserver parameter` forced_image_creation_limit` which
restrict number of created image layers which are requested by GC.
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
Startup continues to be slow, work towards to alleviate it.
Summary of changes:
- pretty the functional improvements from #4366 into
`utils::completion::{Completion, Barrier}`
- extend "initial load completion" usage up to tenant background tasks
- previously only global background tasks
- spawn_blocking the tenant load directory traversal
- demote some logging
- remove some unwraps
- propagate some spans to `spawn_blocking`
Runtime effects should be major speedup to loading, but after that, the
`BACKGROUND_RUNTIME` will be blocked for a long time (minutes). Possible
follow-ups:
- complete initial tenant sizes before allowing background tasks to
block the `BACKGROUND_RUNTIME`
I've added logs for broker push duration after every iteration in https://github.com/neondatabase/neon/pull/4142. This log has not found any real issues, so we can replace it with metrics, to slightly reduce log volume.
LogQL query found that pushes longer that 500ms happened only 90 times for the last month. https://neonprod.grafana.net/goto/KTNj9UwVg?orgId=1
`{unit="safekeeper.service"} |= "timeline updates to broker in" | regexp "to broker in (?P<duration>.*)" | duration > 500ms`
## Problem
In the future, we want to compare code coverage on a PR with coverage on
the main branch.
Currently, we store only code coverage HTML reports, I suggest we start
storing reports in "lcov info" format that we can use/parse in the
future. Currently, the file size is ~7Mb (it's a text-based format and
could be compressed into a ~400Kb archive)
- More about "lcov info" format:
https://manpages.ubuntu.com/manpages/jammy/man1/geninfo.1.html#files
- Part of https://github.com/neondatabase/neon/issues/3543
## Summary of changes
- Change `scripts/coverage` to output lcov coverage to
`report/lcov.info` file instead of stdout (we already upload the whole
`report/` directory to S3)
Startup can take a long time. We suspect it's the initial logical size
calculations. Long term solution is to not block the tokio executors but
do most of I/O in spawn_blocking.
See: #4025, #4183
Short-term solution to above:
- Delay global background tasks until initial tenant loads complete
- Just limit how many init logical size calculations can we have at the
same time to `cores / 2`
This PR is for trying in staging.
This is preliminary work for/from #4220 (async
`Layer::get_value_reconstruct_data`).
The motivation is to avoid locking `Tenant::timelines` in places that
can't be `async`, because in #4333 we want to convert Tenant::timelines
from `std::sync::Mutex` to `tokio::sync::Mutex`.
But, the changes here are useful in general because they clean up &
document tenant state transitions.
That also paves the way for #4350, which is an alternative to #4333 that
refactors the pageserver code so that we can keep the
`Tenant::timelines` mutex sync.
This patch consists of the following core insights and changes:
* spawn_load and spawn_attach own the tenant state until they're done
* once load()/attach() calls are done ...
* if they failed, transition them to Broken directly (we know that
there's no background activity because we didn't call activate yet)
* if they succeed, call activate. We can make it infallible. How? Later.
* set_broken() and set_stopping() are changed to wait for spawn_load() /
spawn_attach() to finish.
* This sounds scary because it might hinder detach or shutdown, but
actually, concurrent attach+detach, or attach+shutdown, or
load+shutdown, or attach+shutdown were just racy before this PR.
So, with this change, they're not anymore.
In the future, we can add a `CancellationToken` stored in Tenant to
cancel `load` and `attach` faster, i.e., make `spawn_load` /
`spawn_attach` transition them to Broken state sooner.
See the doc comments on TenantState for the state transitions that are
now possible.
It might seem scary, but actually, this patch reduces the possible state
transitions.
We introduce a new state `TenantState::Activating` to avoid grabbing the
`Tenant::timelines` lock inside the `send_modify` closure.
These were the humble beginnings of this PR (see Motivation section),
and I think it's still the right thing to have this `Activating` state,
even if we decide against async `Tenant::timelines` mutex. The reason is
that `send_modify` locks internally, and by moving locking of
Tenant::timelines out of the closure, the internal locking of
`send_modify` becomes a leaf of the lock graph, and so, we eliminate
deadlock risk.
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
#4155 inadvertently switched to a version of the VM builder that leaves
the file cache integration disabled by default. This re-enables the
vm-informant's file cache integration.
(as a refresher: The vm-informant is the autoscaling component that sits
inside the VM and manages postgres / compute_ctl)
See also: https://github.com/neondatabase/autoscaling/pull/265
If the timeline is already being deleted, return an error. We used to
notice the duplicate request and error out in
persist_index_part_with_deleted_flag(), but it's better to detect it
earlier. Add an explicit lock for the deletion.
Note: This doesn't do anything about the async cancellation problem
(github issue #3478): if the original HTTP request dropped, because the
client disconnected, the timeline deletion stops half-way through the
operation. That needs to be fixed, too, but that's a separate story.
(This is a simpler replacement for PR #4194. I'm also working on the
cancellation shielding, see PR #4314.)
Previously, you used it like this:
|r| RequestSpan(my_handler).handle(r)
But I don't see the point of the RequestSpan struct. It's just a
wrapper around the handler function. With this commit, the call
becomes:
|r| request_span(r, my_handler)
Which seems a little simpler.
At first I thought that the RequestSpan struct would allow "chaining"
other kinds of decorators like RequestSpan, so that you could do
something like this:
|r| CheckPermissions(RequestSpan(my_handler)).handle(r)
But it doesn't work like that. If each of those structs wrap a handler
*function*, it would actually look like this:
|r| CheckPermissions(|r| RequestSpan(my_handler).handle(r))).handle(r)
This commit doesn't make that kind of chaining any easier, but seems a
little more straightforward anyway.
Require the error type to be ApiError. It implicitly required that
anyway, because the function used error::handler, which downcasted the
error to an ApiError. If the error was in fact anything else than
ApiError, it would just panic. Better to check it at compilation time.
Also make the last-resort error handler more forgiving, so that it
returns an 500 Internal Server error response, instead of panicking,
if a request handler returns some other error than an ApiError.
Compaction is usually a compute-heavy process and might affect other
futures running on the thread of the compaction. Therefore, we add
`block_in_place` as a temporary solution to avoid blocking other futures
on the same thread as compaction in the runtime. As we are migrating
towards a fully-async-style pageserver, we can revert this change when
everything is async and when we move compaction to a separate runtime.
---------
Signed-off-by: Alex Chi <iskyzh@gmail.com>
## Problem
GitHub Autocomment script posts a comment only for PRs. It's harder
to debug failed tests on main or release branches.
## Summary of changes
- Change the GitHub Autocomment script to be able to post a comment to
either a PR or a commit of a branch
We used to generate the ID, if the caller didn't specify it. That's bad
practice, however, because network is never fully reliable, so it's
possible we create a new tenant but the caller doesn't know about it,
and because it doesn't know the tenant ID, it has no way of retrying or
checking if it succeeded. To discourage that, make it mandatory. The web
control plane has not relied on the auto-generation for a long time.
We currently have a semaphore based rate limiter which we hope will keep
us under S3 limits. However, the semaphore does not consider time, so
I've been hesitant to raise the concurrency limit of 100.
See #3698.
The PR Introduces a leaky-bucket based rate limiter instead of the
`tokio::sync::Semaphore` which will allow us to raise the limit later
on. The configuration changes are not contained here.
## Problem
Test `test_metric_collection` become flaky:
```
AssertionError: assert not ['2023-05-25T14:03:41.644042Z ERROR metrics_collection: failed to send metrics: reqwest::Error { kind: Request, url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("localhost")), port: Some(18022), path: "/billing/api/v1/usage_events", query: None, fragment: None }, source: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 99, kind: AddrNotAvailable, message: "Cannot assign requested address" })) }',
...]
```
I suspect it is caused by having 2 places when we define
`httpserver_listen_address` fixture (which is internally used by
`pytest-httpserver` plugin)
## Summary of changes
- Remove the definition of `httpserver_listen_address` from
`test_runner/regress/test_ddl_forwarding.py` and keep one in
`test_runner/fixtures/neon_fixtures.py`
- Also remote unused `httpserver_listen_address` parameter from
`test_proxy_metric_collection`
Timeline::activate() was only fallible because `launch_wal_receiver`
was.
`launch_wal_receiver` was fallible only because of some preliminary
checks in `WalReceiver::start`.
Turns out these checks can be shifted to the type system by delaying
creatinon of the `WalReceiver` struct to the point where we activate the
timeline.
The changes in this PR were enabled by my previous refactoring that
funneled the broker_client from pageserver startup to the activate()
call sites.
Patch series:
- #4316
- #4317
- #4318
- #4319
Routine `vm-builder` version bump, from autoscaling repo release. You
can find the release notes here:
https://github.com/neondatabase/autoscaling/releases/tag/v0.8.0
The changes are from v0.7.2 — most of them were already included in
v0.7.3-alpha3.
Of particular note: This (finally) fixes the cgroup issues, so we should
now be able to scale up when we're about to run out of memory.
**NB:** This has the effect of limit the DB's memory usage in a way it
wasn't limited before. We may run into issues because of that. There is
currently no way to disable that behavior, other than switching the
endpoint back to the k8s-pod provisioner.
(This is prep work to make `Timeline::activate` infallible.)
This patch removes the global storage_broker client instance from the
pageserver codebase.
Instead, pageserver startup instantiates it and passes it down to the
`Timeline::activate` function, which in turn passes it to the
WalReceiver, which is the entity that actually uses it.
Patch series:
- #4316
- #4317
- #4318
- #4319
Before this patch, it would use error type `TenantStateError` which has
many more error variants than can actually happen with
`mgr::get_tenant`.
Along the way, I also introduced `SetNewTenantConfigError` because it
uses `mgr::get_tenant` and also can only fail in much fewer ways than
`TenantStateError` suggests.
The new `page_service.rs`'s `GetActiveTimelineError` and
`GetActiveTenantError` types were necessary to avoid an `Other` variant
on the `GetTenantError`.
This patch is a by-product of reading code that subscribes to
`Tenant::state` changes.
Can't really connect it to any given project.
## Describe your changes
## Issue ticket number and link
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [x] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
(Instead of going through mgr every iteration.)
The `wait_for_active_tenant` function's `wait` argument could be removed
because it was only used for the loop that waits for the tenant to show
up in the tenants map. Since we're passing the tenant in, we now longer
need to get it from the tenants map.
NB that there's no guarantee that the tenant object is in the tenants
map at the time the background loop function starts running. But the
tenant mgr guarantees that it will be quite soon. See
`tenant_map_insert` way upwards in the call hierarchy for details.
This is prep work to eliminate `subscribe_for_state_updates` (PR #4299 )
Fixes: #3501
This PR refactors the original page_binutils with a single tool pagectl,
use clap derive for better command line parsing, and adds the dump kv
tool to extract information from delta file. This helps me better
understand what's inside the page server. We can add support for other
types of file and more functionalities in the future.
---------
Signed-off-by: Alex Chi <iskyzh@gmail.com>
(This is prep work to make `Timeline::activate()` infallible.)
The current possibility for failure in `Timeline::activate()` is the
broker client's presence / absence. It should be an assert, but we're
careful with these. So, I'm planning to pass in the broker client to
activate(), thereby eliminating the possiblity of its absence.
In the unit tests, we don't have a broker client. So, I thought I'd be
in trouble because the unit tests also called `activate()` before this
PR.
However, closer inspection reveals a long-standing FIXME about this,
which is addressed by this patch.
It turns out that the unit tests don't actually need the background
loops to be running. They just need the state value to be `Active`. So,
for the tests, we just set it to that value but don't spawn the
background loops.
We'll need to revisit this if we ever do more Rust unit tests in the
future. But right now, this refactoring improves the code, so, let's
revisit when we get there.
Patch series:
- #4316
- #4317
- #4318
- #4319
## Problem
`pytest` 6 truncates error messages and this is not configured.
It's fixed in `pytest` 7, it prints the whole message (truncating limit
is higher) if `--verbose` is set (it's set on CI).
## Summary of changes
- `pytest` and `pytest` plugins are updated to their latest versions
- linters (`black` and `ruff`) are updated to their latest versions
- `mypy` and types are updated to their latest versions, new warnings
are fixed
- while we're here, allure updated its latest version as well
Checking out proxy logs for the endpoint is a frequent (often first) operation
during user issues investigation; let's remove endpoint id -> session id mapping
annoying extra step here.
## Describe your changes
Right now the only criteria for image layer generation is number of
delta layer since last image layer.
If we have "stairs" layout of delta layers (see link below) then it can
happen that there a lot of old delta layers which can not be reclaimed
by GC because are not fully covered with image layers.
This PR constructs list of "wanted" image layers in GC (which image
layers are needed to be able to remove old layers)
and pass this list to compaction task which performs generation of image
layers.
So right now except deltas count criteria we also take in account
"wishes" of GC.
## Issue ticket number and link
See
https://neondb.slack.com/archives/C033RQ5SPDH/p1676914249982519
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
---------
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
In the v0.6.0 release, vm-builder was changed to be Neon-specific, so
it's handling all the stuff that Dockerfile.vm-compute-node used to do.
This commit bumps vm-builder to v0.7.3-alpha3.
This commit introduces an SQL-over-HTTP endpoint in the proxy, with a JSON
response structure resembling that of the node-postgres driver. This method,
using HTTP POST, achieves smaller amortized latencies in edge setups due to
fewer round trips and an enhanced open connection reuse by the v8 engine.
This update involves several intricacies:
1. SQL injection protection: We employed the extended query protocol, modifying
the rust-postgres driver to send queries in one roundtrip using a text
protocol rather than binary, bypassing potential issues like those identified
in https://github.com/sfackler/rust-postgres/issues/1030.
2. Postgres type compatibility: As not all postgres types have binary
representations (e.g., acl's in pg_class), we adjusted rust-postgres to
respond with text protocol, simplifying serialization and fixing queries with
text-only types in response.
3. Data type conversion: Considering JSON supports fewer data types than
Postgres, we perform conversions where possible, passing all other types as
strings. Key conversions include:
- postgres int2, int4, float4, float8 -> json number (NaN and Inf remain
text)
- postgres bool, null, text -> json bool, null, string
- postgres array -> json array
- postgres json and jsonb -> json object
4. Alignment with node-postgres: To facilitate integration with js libraries,
we've matched the response structure of node-postgres, returning command tags
and column oids. Command tag capturing was added to the rust-postgres
functionality as part of this change.
This patch fixes parsing of the `max_lsn_wal_lag` tenant config item.
We were incorrectly expecting a string before, but the type is a
NonZeroU64.
So, when setting it in the config, the (updated) test case would fail
with
```
E psycopg2.errors.InternalError_: Tenant a1fa9cc383e32ddafb73ff920de5f2e6 will not become active. Current state: Broken due to: Failed to parse config from file '.../repo/tenants/a1fa9cc383e32ddafb73ff920de5f2e6/config' as pageserver config: configure option max_lsn_wal_lag is not a string. Backtrace:
```
So, not even the assertions added are necessary.
The test coverage for tenant config is rather thin in general.
For example, the `test_tenant_conf.py` test doesn't cover all the
options.
I'll add a new regression test as part of attach-time-tenant-conf PR
https://github.com/neondatabase/neon/pull/4255
## Problem
`osgeo.org` is experiencing some problems with DNS resolving which
breaks `compute-node-image` (because it can't download postgis)
## Summary of changes
- Add `140.211.15.30 download.osgeo.org` to /etc/hosts by passing it via
the container option
This PR enforces that the tenant create / update-config APIs reject
requests with unknown fields.
This is a desirable property because some tenant config settings control
the lifetime of user data (e.g., GC horizon or PITR interval).
Suppose we inadvertently rename the `pitr_interval` field in the Rust
code.
Then, right now, a client that still uses the old name will send a
tenant config request to configure a new PITR interval.
Before this PR, we would accept such a request, ignore the old name
field, and use the pageserver.toml default value for what the new PITR
interval is.
With this PR, we will instead reject such a request.
One might argue that the client could simply check whether the config it
sent has been applied, using the `/v1/tenant/.../config` endpoint.
That is correct for tenant create and update-config.
But, attach will soon [^1] grow the ability to have attach-time config
as well.
If we ignore unknown fields and fall back to global defaults in that
case, we risk data loss.
Example:
1. Default PITR in pageservers is 7 days.
2. Create a tenant and set its PITR to 30 days.
3. For 30 days, fill the tenant continuously with data.
4. Detach the tenant.
5. Attach tenant.
Attach must use the 30-day PITR setting in this scenario.
If it were to fall back to the 7-day default value, we would lose 23
days of PITR capability for the tenant.
So, the PR that adds attach-time tenant config will build on the
(clunky) infrastructure added in this PR
[^1]: https://github.com/neondatabase/neon/pull/4255
Implementation Notes
====================
This could have been a simple `#[serde(deny_unknown_fields)]` but sadly,
that is documented- but silent-at-compile-time-incompatible with
`#[serde(flatten)]`. But we are still using this by adding on outer struct and use unit tests to ensure it is correct.
`neon_local tenant config` now uses the `.remove()` pattern + bail if
there are leftover config args. That's in line with what
`neon_local tenant create` does. We should dedupe that logic in a future
PR.
---------
Signed-off-by: Alex Chi <iskyzh@gmail.com>
Co-authored-by: Alex Chi <iskyzh@gmail.com>
Previously we didn't handle XACT_XINFO_HAS_INVALS and XACT_XINFO_HAS_DROPPED_STAT correctly,
which led to getting incorrect value of twophase_xid for records with XACT_XINFO_HAS_TWOPHASE.
This caused 'twophase file for xid {} does not exist' errors in test_isolation
We had a hot debate on whether we should try to make our code
cancellation-safe, or just accept that it's not, and make sure that
our Futures are driven to completion. The decision is that we drive
Futures to completion. This documents the decision, and summarizes the
reasoning for that.
Discussion that sparked this:
https://github.com/neondatabase/neon/pull/4198#discussion_r1190209316
Disable background tasks to not get compaction downloading all layers
but also stop safekeepers before checkpointing, use a readonly endpoint.
Fixes: #3666
Co-authored-by: Christian Schwarz <christian@neon.tech>
- Group tests by Postgres version
- Merge different build types
- Add a command to GitHub comment on how to rerun all failed tests
(different command for different Postgres versions)
- Restore a link to a test report in the build summary
This is prep for https://github.com/neondatabase/neon/pull/4255
[1/X] OpenAPI: share a single definition of TenantConfig
DRYs up the pageserver OpenAPI YAML's representation of
tenant config.
All the fields of tenant config are now located in a model schema
called TenantConfig.
The tenant create & config-change endpoints have separate schemas,
TenantCreateInfo and TenantConfigureArg, respectively.
These schemas inherit from TenantConfig, using allOf 1.
The tenant config-GET handler's response was previously named
TenantConfig.
It's now named TenantConfigResponse.
None of these changes affect how the request looks on the wire.
The generated Go code will change for Console because the OpenAPI code
generator maps `allOf` to a Go struct embedding.
Luckily, usage of tenant config in Console is still very lightweigt,
but that will change in the near future.
So, this is a good chance to set things straight.
The console changes are tracked in
https://github.com/neondatabase/cloud/pull/5046
[2/x]: extract the tenant config parts of create & config requests
[3/x]: code movement: move TenantConfigRequestConfig next to
TenantCreateRequestConfig
[4/x] type-alias TenantConfigRequestConfig = TenantCreateRequestConfig;
They are exactly the same.
[5/x] switch to qualified use for tenant create/config request api
models
[6/x] rename models::TenantConfig{RequestConfig,} and remove the alias
[7/x] OpenAPI: sync tenant create & configure body names from Rust code
[8/x]: dedupe the two TryFrom<...> for TenantConfOpt impls
The only difference is that the TenantConfigRequest impl does
```
tenant_conf.max_lsn_wal_lag = request_data.max_lsn_wal_lag;
tenant_conf.trace_read_requests = request_data.trace_read_requests;
```
and the TenantCreateRequest impl does
```
if let Some(max_lsn_wal_lag) = request_data.max_lsn_wal_lag {
tenant_conf.max_lsn_wal_lag = Some(max_lsn_wal_lag);
}
if let Some(trace_read_requests) = request_data.trace_read_requests {
tenant_conf.trace_read_requests = Some(trace_read_requests);
}
```
As far as I can tell, these are identical.
## Problem
`neondatabase/zenith-coverage-data` is too big:
- It takes ~6 minutes to clone and push the repo
- GitHub fails to publish an HTML report to github.io
Part of https://github.com/neondatabase/neon/issues/3543
## Summary of changes
Replace pushing code coverage report to
`neondatabase/zenith-coverage-data` with uploading it to S3
With this patch, the attach handler now follows the same pattern as
tenant create with regards to instantiation of the new tenant:
1. Prepare on-disk state using `create_tenant_files`.
2. Use the same code path as pageserver startup to load it into memory
and start background loops (`schedule_local_tenant_processing`).
It's a bit sad we can't use the
`PageServerConfig::tenant_attaching_mark_file_path` method inside
`create_tenant_files` because it operates in a temporary directory.
However, it's a small price to pay for the gained simplicity.
During implementation, I noticed that we don't handle failures post
`create_tenant_files` well. I left TODO comments in the code linking to
the issue that I created for this [^1].
Also, I'll dedupe the spawn_load and spawn_attach code in a future
commit.
refs https://github.com/neondatabase/neon/issues/1555
part of https://github.com/neondatabase/neon/issues/886 (Tenant
Relocation)
[^1]: https://github.com/neondatabase/neon/issues/4233
## Problem
Compatibility tests don't support Postgres 15 yet, but we're still
trying to upload compatibility snapshot (which we do not collect).
Ref
https://github.com/neondatabase/neon/actions/runs/4991394158/jobs/8940369368#step:4:38129
## Summary of changes
Add `pg_version` parameter to `run-python-test-set` actions and do not
upload compatibility snapshot for Postgres 15
This commit replaces all usages of connection_manager.rs:
wait_for_active_timeline with Timeline::wait_to_become_active.
wait_to_become_active is better and in the right module.
close https://github.com/neondatabase/neon/issues/4189
Co-authored-by: Shany Pozin <shany@neon.tech>
Use an enum instead of an array. Before that there was no connection
between definition of the metric and point where it was used aside from
matching string literals. Now its possible to use IDE features to check
for references. Also this allows to avoid mismatch between set of
metrics that was defined and set of metrics that was actually used
What is interesting is that `init logical size` case is not used. I
think `LogicalSize` is a duplicate of `InitLogicalSize`. So removed the latter.
Conflicts:
- Changes in PG15's xlogrecovery.c resulted in non-substantial conflicts between
ecb01e6ebb5a67f3fc00840695682a8b1ba40461 and
aee72b7be903e52d9bdc6449aa4c17fb852d8708
Fixes#4207
This PR is simply the patch from
https://github.com/neondatabase/neon/issues/4008 except we enabled
`force_path_style` for custom endpoints. This is because at some
version, the s3 sdk by default uses the virtual-host style access, which
is not supported by MinIO in the default configuration. By enforcing
path style access for custom endpoints, we can pass all e2e test cases.
SDK 0.55 is not the latest version and we can bump it further later when
all flaky tests in this PR are resolved.
This PR also (hopefully) fixes flaky test
`test_ondemand_download_timetravel`.
close https://github.com/neondatabase/neon/issues/4008
Signed-off-by: Alex Chi <iskyzh@gmail.com>
Should fix flakiness caused by the error
```
FATAL: could not resize shared memory segment "/PostgreSQL.3944613150" to 1048576 bytes: No space left on device
```
## Describe your changes
`pageserver_disconnect()` calls `PQfinish()` which deallocates resources
on the connection structure. `PQerrorMessage()` hands back a pointer to
an allocated resource. Duplicate the error message prior to calling
`pageserver_disconnect()`.
## Issue ticket number and link
Fixes https://github.com/neondatabase/neon/issues/4214
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [x] If it is a core feature, I have added thorough tests.
- [x] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [x] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [x] Do not forget to reformat commit message to not include the above
checklist
Await for upload to complete before returning 201 Created on
`branch_timeline` or when `bootstrap_timeline` happens. Should either of
those waits fail, then on the retried request await for uploads again.
This should work as expected assuming control-plane does not start to
use timeline creation as a wait_for_upload mechanism.
Fixes#3865, started from
https://github.com/neondatabase/neon/pull/3857/files#r1144468177
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
After tenant attach, there is a window where the child timeline is
loaded and accepts GetPage requests, but its parent is not. If a
GetPage request needs to traverse to the parent, it needs to wait for
the parent timeline to become active, or it might miss some records on
the parent timeline.
It's also possible that the parent timeline is active, but it hasn't
yet received all the WAL up to the branch point from the safekeeper.
This happens if a pageserver crashes soon after creating a timeline,
so that the WAL leading to the branch point has not yet been uploaded
to remote storage. After restart, the WAL will be re-streamed and
ingested from the safekeeper, but that takes a while. Because of that,
it's not enough to check that the parent timeline is active, we also
need to wait for the WAL to arrive on the parent timeline, just like
at the beginning of GetPage handling. We probably should change the
behavior at create_timeline so that a timeline can only be created
after all the WAL up to the branch point has been uploaded to remote
storage, but that's not currently the case and out of scope for this
PR (see github issue #4218).
@NanoBjorn encountered this while working on tenant migration. After
migrating a tenant with a parent and child branch, connecting to the
child branch failed with an error like:
```
FATAL: "base/16385" is not a valid data directory
DETAIL: File "base/16385/PG_VERSION" is missing.
```
This commit adds two tests that reproduce the bug, with slightly
different symptoms.
Before this PR, the gather_inputs() calls made to imitate synthetic size
calculation accesses were accounted towards the real logical size
calculation metric.
This PR forces all callers to declare the cause for making logical size
calculations, making the decision which cause counts towards which
metric explicit.
This is follow-up to
```
commit 1d266a6365
Author: Christian Schwarz <christian@neon.tech>
Date: Thu May 11 16:09:29 2023 +0200
logical size calculation metrics: differentiate regular vs imitated (#4197)
```
After merging this patch, I hope to be able to explain why we have ca
30x more "logical size" ops in prod than "imitate logical size" for any
given observation interval.
refs https://github.com/neondatabase/neon/issues/4154
We already have the warn!() log line for this condition. This PR adds a
corresponding metric on which we can have a dedicated alert. Cheaper and
more reliable than alerting on the logs, because, we run into log rate
limits from time to time these days.
refs https://github.com/neondatabase/neon/issues/4222
This PR adds tests runs on Postgres 15 and created unified Allure report
with results for all tests.
- Split `.github/actions/allure-report` into
`.github/actions/allure-report-store` and
`.github/actions/allure-report-generate`
- Add debug or release pytest parameter for all tests (depending on
`BUILD_TYPE` env variable)
- Add Postgres version as a pytest parameter for all tests (depending on
`DEFAULT_PG_VERSION` env variable)
- Fix `test_wal_restore` and `restore_from_wal.sh` to support path with
`[`/`]` in it (fixed by applying spellcheck to the script and fixing all
warnings), `restore_from_wal_archive.sh` is deleted as unused.
- All known failures on Postgres 15 marked with xfail
While investigating https://github.com/neondatabase/neon/issues/4154 I
found that the `Calculating logical size for timeline` tracing events
created from within the logical size computation code are not always
attributable to the background task that caused it.
My goal is to be able to distinguish in the logs whether a `Calculating
logical size for timeline` was logged as part of a real synthetic size
calculation VS an imitation by the eviction task.
I want this distinction so I can prove my assumption that the disk IO
peaks which we see every 24h on prod are due to eviction's imitate
synthetic size calculations.
The alternative here, which I would have preferred, but is more work:
link RequestContext's into a child->parent list and dump this list when
we log `Calculating logical size for timeline`.
I would have preferred that over what we have in this PR because,
technically, the ondemand logical size computation can outlive the
caller that spawned it. This is against the idea of correctly nested
spans.
I guess in OpenTelemetry land, the correct modelling would be a link
between the caller's span and the task_mgr task's span.
Anyways, I think the case where we hang up on the spawned ondemand
logical size calculation is quite rare. So, I'm willing to tolerate
incorrectly nested spans for these edge-cases.
refs https://github.com/neondatabase/neon/issues/4154
I would like to know whether and by how much the eviction iterations
spike in the $period-sized window that happens every $threshold , when
all the timelines do the imitate accesses.
refs https://github.com/neondatabase/neon/issues/4154
I tried to use failpoint_sleep_millis_async(...) in a source file that
didn't do `use std::time::Duration`, and got a compiler error:
```
error[E0433]: failed to resolve: use of undeclared type `Duration`
--> pageserver/src/walingest.rs:316:17
|
316 | utils::failpoint_sleep_millis_async!("wal-ingest-logical-message-sleep");
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ not found in this scope
|
= note: this error originates in the macro `utils::failpoint_sleep_millis_async` (in Nightly builds, run with -Z macro-backtrace for more info)
help: consider importing one of these items
|
24 | use chrono::Duration;
|
24 | use core::time::Duration;
|
24 | use humantime::Duration;
|
24 | use serde_with::__private__::Duration;
|
and 2 other candidates
```
I want this distinction so I can prove my assumption that the disk IO
peaks which we see every 24h on prod are due to eviction's imitate
synthetic size calculations.
refs https://github.com/neondatabase/neon/issues/4154
Control Plane currently [^1] polls for `has_in_progress_downloads ==
false` after /attach to determine that an attach operation succeeded.
As pointed out in the OpenAPI spec as of neon#4151, polling for
`has_in_progress_downloads` is incorrect.
This patch changes the situation by
- removing `has_in_progress_downloads`
- adding a new field `attachment_status.`
- changing instructions for `/attach` to poll for `attachment_status ==
attached`.
This makes the instructions in `/attach` actionable for Control Plane.
NB that we don't expose the TenantState in the OpenAPI docs, even though
we expose it in the endpoint. That is with good reason because we don't
want to commit to a fixed set of tenant states forever. Hence, the
separate `attachment_status` field that exposes the bare minimum
required to make /attach + subsequent polling 100% safe wrt split brain.
It would have been nice to report failures explicitly, but the problem
is that we lose that state when we restart. So, we return `attached`
upon attach failure. The tenant is Broken in that case, causing Control
Plane's subsequent health check will fail. Control Plane can roll back
the relocation operation then.
NB: the reliance on the subsequent health check is no change to what we
had before this patch!
NB: we can always add additional TenantAttachmentStatus'es in the future
to communicate failure.
This PR also moves the attach-marker file's creation to the API
handler's synchronous part. That was done to avoid the need to
distinguish
* `Attaching but marker not yet written => AttachmentStatus::Maybe` from
* `Attaching, marker written, but attach failed for other reason =>
AttachmentStatus::Attached`
Coincidentally, this also adds more transactionality to the /attach API
because we only return 202 once we've written the marker file. But, in
the end, it doesn't affect how the control plane interacts with us or
how it needs to do retries. So, we don't mention any of this in the API
docs.
[^1]: The one-click tenant relocation PR cloud#4740, currently WIP, is
the first real user.
Before this patch, the following sequence would lead to the resurrection of a deleted timeline:
- create timeline
- wait for its index part to reach s3
- delete timeline
- wait an arbitrary amount of time, including 0 seconds
- detach tenant
- attach tenant
- the timeline is there and Active again
This happens because we only kept track of the deletion in the tenant dir (by deleting the timeline dir) but not in S3.
The solution is to turn the deleted timeline's IndexPart into a tombstone.
The deletion status of the timeline is expressed in the `deleted_at: Option<NativeDateTime>` field of IndexPart.
It's `None` while the timeline is alive and `Some(deletion time stamp)` if it is deleted.
We change the timeline deletion handler to upload this tombstoned IndexPart.
The handler does not return success if the upload fails.
Coincidentally, this fixes the long-stanging TODO about the `std::fs::remove_dir_all` being not atomic.
It need not be atomic anymore because we set the `deleted_at=Some()` before starting the `remove_dir_all`.
The tombstone is in the IndexPart only, not in the `metadata`.
So, we only have the tombstone and the `remove_dir_all` benefits mentioned above if remote storage is configured.
This was a conscious trade-off because there's no good format evolution story for the current metadata file format.
The introduction of this additional step into `delete_timeline` was painful because delete_timeline needs to be
1. cancel-safe
2. idempotent
3. safe to call concurrently
These are mostly self-inflicted limitations that can be avoided by using request-coalescing.
PR https://github.com/neondatabase/neon/pull/4159 will do that.
fixes https://github.com/neondatabase/neon/issues/3560
refs https://github.com/neondatabase/neon/issues/3889 (part of tenant relocation)
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
Currently, if we unexpectly download from the eviction task, the log
lines look like what we have in
https://github.com/neondatabase/neon/issues/4154
```
2023-05-04T14:42:57.586772Z WARN eviction_task{tenant_id=$TENANT timeline_id=$TIMELINE}:eviction_iteration{policy_kind="LayerAccessThreshold"}: unexpectedly on-demand downloading remote layer remote $TIMELINE/000000067F000032AC0000400C00FFFFFFFF-000000067F000032AC000040140000000008__0000000001696070-0000000003DC76E9 for task kind Eviction
```
We know these are caused by the imitate accesses.
But we don't know which one (my bet is on update_gc_info).
I didn't want to pollute the other tasks' logs with the additional
spans, so, using `.instrument()` when we call non-eviction-task code.
refs https://github.com/neondatabase/neon/issues/4154
If compute_ctl is launched without a spec file, it fetches it from the
control plane with an HTTP request. We cannot get the startup tracing
context from the compute spec in that case, because we don't have it
available on start. We could still read the tracing context from the
compute spec after we have fetched it, but that would leave the fetch
itself out of the context. Pass the tracing context in environment
variables instead.
Add `wal_backup_parallel_jobs` cmdline argument to specify the max count
of parallel segments upload. New default value is 5, meaning that
safekeepers will try to upload 5 segments concurrently if they are
available. Setting this value to 1 will be equivalent to the sequential
upload that we had before.
Part of the https://github.com/neondatabase/neon/issues/3957
## Describe your changes
## Issue ticket number and link
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
## Describe your changes
Compiles kq_imcx extension
At this moment, there are some issues with the extension:
1. I'm cloning it directly from the master branch. It's better to fetch
tag/archive
2. PG14:
```
postgres=# CREATE EXTENSION IF NOT EXISTS kq_imcx;
postgres=# select * from kq_calendar_cache_info();
2023-02-08 13:55:22.853 UTC [412] ERROR: relation "ketteq.slice_type" does not exist at character 34
2023-02-08 13:55:22.853 UTC [412] QUERY: select min(s.id), max(s.id) from ketteq.slice_type s
2023-02-08 13:55:22.853 UTC [412] STATEMENT: select * from kq_calendar_cache_info();
ERROR: relation "ketteq.slice_type" does not exist
LINE 1: select min(s.id), max(s.id) from ketteq.slice_type s
```
3. PG15:
`cannot request additional shared memory outside shmem_request_hook`
Note: I don't think we need to publish info about this extension in the
docs.
## Issue ticket number and link
neondatabase/cloud#3387
We currently return 202 as soon as the tenant is allocated in memory
before we've written out the marker file. So, the /attach API currently
does not have a transactional character. For example, it can happen that
we respond with a 202 and then crash before writing out the marker file.
In such a case, it is important that the client
1. observes the lost attach (by polling tenant status and observing 404)
2. and consequently retries the attach.
It has to do it in this loop until it observes the tenant as "Active" in
the tenant status. If the client doesn't follow this protocol and
instead goes to another pageserver to attach the tenant, we risk a
split-brain situation where both the first and second pageserver write
to the tenant's S3 state.
The improved description highlights the consequences of this behavior
for clients that use the /attach endpoint.
The tenant relocation that is currently being implemented in cloud#4740
implements retries of Attach and it does poll afterwards, but, it polls
`has_in_progress_downloads`.
That is incorrect, as described in the patch body.
The motivation for this write-up is that, in a future PR, we'll extend
the /attach endpoint with an option to provide the tenant config. If we
decide to leave the non-transactional behavior of /attach unmodified, we
will be able to avoid persisting the tenant config. Conversely, if we
decide that the /attach API should become transactional, we'll need to
persist the tenant config in the attach-marker-file before acknowledging
receipt of the /attach operation.
refs https://github.com/neondatabase/cloud/pull/4740
refs https://github.com/neondatabase/neon/issues/2238
refs https://github.com/neondatabase/neon/issues/1555
Our scale-to-zero logic was optimized for short auto-suspend intervals,
e.g. minutes or hours. In this case, if compute was restarted by k8s due
to some reason (OOM, k8s node went down, pod relocation, etc.),
`last_active` got bumped, we start counting auto-suspend timeout again.
It's not a big deal, i.e. we suspend completely idle compute not after 5
minutes, but after 10 minutes or so.
Yet, some clients may want days or even weeks. And chance that compute
could be restarted during this interval is pretty high, but in this case
we could be not able to suspend some computes for weeks.
After this commit, we won't initialize `last_active` on start, so
`/status` could return an unset attribute. This means that there was no
user activity since start. Control-plane should deal with it by taking
`max()` out of all available activity timestamps: `started_at`,
`last_active`, etc.
compute_ctl part of neondatabase/cloud#4853
This patch adds a regression test for the threshold-based layer
eviction.
The test asserts the basic invariant that, if left alone, the residence
statuses will stabilize, with some layers resident and some layers
evicted.
Thereby, we cover both the aspect of last-access-time-threshold-based
eviction, and the "imitate access" hacks that we put in recently.
The aggressive `period` and `threshold` values revealed a subtle bug
which is also fixed in this patch.
The symptom was that, without the Rust changes of this patch, there
would be occasional test failures due to `WARN... unexpectedly
downloading` log messages.
These log messages were caused by the "imitate access" calls of the
eviction task.
But, the whole point of the "imitate access" hack was to prevent
eviction of the layers that we access there.
After some digging, I found the root cause, which is the following race
condition:
1. Compact: Write out an L1 layer from several L0 layers. This records
residence event `LayerCreate` with the current timestamp.
2. Eviction: imitate access logical size calculation. This accesses the
L0 layers because the L1 layer is not yet in the layer map.
3. Compact: Grab layer map lock, add the new L1 to layer map and remove
the L0s, release layer map lock.
4. Eviction: observes the new L1 layer whose only activity timestamp is
the `LayerCreate` event.
The L1 layer had no chance of being accessed until after (3).
So, if enough time passes between (1) and (3), then (4) will observe a
layer with `now-last_activity > threshold` and evict it
The fix is to require the first `record_residence_event` to happen while
we already hold the layer map lock.
The API requires a ref to a `BatchedUpdates` as a witness that we are
inside a layer map lock.
That is not fool-proof, e.g., new call sites for `insert_historic` could
just completely forget to record the residence event.
It would be nice to prevent this at the type level.
In the meantime, we have a rate-limited log messages to warn us, if such
an implementation error sneaks in in the future.
fixes https://github.com/neondatabase/neon/issues/3593
fixes https://github.com/neondatabase/neon/issues/3942
---------
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
The field means the same thing as the `wal_end` field in the XLogData
struct. And in the postgres-protocol crate's corresponding
PrimaryKeepAlive struct, it's also called `wal_end`. Let's be
consistent.
As noted by Arthur at
https://github.com/neondatabase/neon/pull/4144#pullrequestreview-1411031881
When a new connection is established to the safekeeper, the 'end_pos'
field is initially set to Lsn::INVALID (i.e 0/0). If there is no WAL to
send to the client, we send KeepAlive messages with Lsn::INVALID. That
confuses the pageserver: it thinks that safekeeper is lagging very much
behind the tip of the branch, and will reconnect to a different
safekeeper. Then the same thing happens with the new safekeeper, until
some WAL is streamed which sets 'end_pos' to a valid value.
This fix always sets `end_pos` to the most recent `commit_lsn` value.
This is useful to send the latest `commit_lsn` to the receiver, so it
will know how advanced this safekeeper compared to the others.
Fixes https://github.com/neondatabase/neon/issues/3972
Supersedes https://github.com/neondatabase/neon/pull/4144
Add essential safekeeper and pageserver::walreceiver metrics. Mostly
counters, such as the number of received queries, broker messages,
removed WAL segments, or connection switches events in walreceiver.
Also logs broker push loop duration.
Closes https://github.com/neondatabase/neon/issues/2106
Before:
```
Extracting base backup to create postgres instance: path=/Users/someonetoignore/work/neon/neon_main/test_output/test_pageserver_lsn_wait_error_safekeeper_stop/repo/endpoints/ep-2/pgdata port=15017
stderr: command failed: page server 'basebackup' command failed
Caused by:
0: db error: ERROR: Timed out while waiting for WAL record at LSN 0/FFFFFFFF to arrive, last_record_lsn 0/A2C3F58 disk consistent LSN=0/16B5A50
1: ERROR: Timed out while waiting for WAL record at LSN 0/FFFFFFFF to arrive, last_record_lsn 0/A2C3F58 disk consistent LSN=0/16B5A50
Stack backtrace:
```
After:
```
Extracting base backup to create postgres instance: path=/Users/someonetoignore/work/neon/neon/test_output/test_pageserver_lsn_wait_error_safekeeper_stop/repo/endpoints/ep-2/pgdata port=15011
stderr: command failed: page server 'basebackup' command failed
Caused by:
0: db error: ERROR: Timed out while waiting for WAL record at LSN 0/FFFFFFFF to arrive, last_record_lsn 0/A2C3F58 disk consistent LSN=0/16B5A50, WalReceiver status (update 2023-04-26 14:20:39): streaming WAL from node 12346, commit|streaming Lsn: 0/A2C3F58|0/A2C3F58, safekeeper candidates (id|update_time|commit_lsn): [(12348|14:20:40|0/A2C3F58), (12346|14:20:40|0/A2C3F58), (12347|14:20:40|0/A2C3F58)]
1: ERROR: Timed out while waiting for WAL record at LSN 0/FFFFFFFF to arrive, last_record_lsn 0/A2C3F58 disk consistent LSN=0/16B5A50, WalReceiver status (update 2023-04-26 14:20:39): streaming WAL from node 12346, commit|streaming Lsn: 0/A2C3F58|0/A2C3F58, safekeeper candidates (id|update_time|commit_lsn): [(12348|14:20:40|0/A2C3F58), (12346|14:20:40|0/A2C3F58), (12347|14:20:40|0/A2C3F58)]
Stack backtrace:
```
As the issue requests, the PR adds the context in logs only, but I think
we should expose the context via HTTP management API similar way — it
should be simple with the new API, but better be done in a separate PR.
Co-authored-by: Kirill Bulatov <kirill@neon.tech>
If the other end of a TCP connection closes its read end of the socket,
you get an EPIPE when you try to send. I saw that happen in the CI once:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4136/release/4869464644/index.html#suites/c19bc2126511ef8cb145cca25c438215/7ec87b016c0b4b50/
```
2023-05-03T07:53:22.394152Z ERROR Task 'serving compute connection task' tenant_id: Some(c204447079e02e7ba8f593cb8bc57e76), timeline_id: Some(b666f26600e6deaa9f43e1aeee5bacb7) exited with error: Postgres connection error
Caused by:
Broken pipe (os error 32)
Stack backtrace:
0: pageserver::page_service::page_service_conn_main::{{closure}}
at /__w/neon/neon/pageserver/src/page_service.rs:282:17
<core::panic::unwind_safe::AssertUnwindSafe<F> as core::future::future::Future>::poll
at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panic/unwind_safe.rs:296:9
<futures_util::future::future::catch_unwind::CatchUnwind<Fut> as core::future::future::Future>::poll::{{closure}}
at /__w/neon/neon/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/future/future/catch_unwind.rs:36:42
<core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panic/unwind_safe.rs:271:9
...
```
In the passing, add a comment to explain what the "expected" in the
`is_expected_io_error` function means.
wal_craft had accumulated some trouble by using `use anyhow::*;`. Fixes
that, removes redundant conversions (never need to convert a Path to
OsStr), especially at the `Process` args.
Originally in #4100 but we merged a later PR instead for the fixes. I
dropped the `postmaster.pid` polling in favor of just having a longer
connect timeout.
noticed while describing `RequestSpan`, this fix will omit the otherwise
logged message about request being cancelled when panicking in the
request handler. this was missed on #4064.
## Describe your changes
## Issue ticket number and link
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
It's more convenient than parsing the postgresql.conf file.
Extracted from PR #3886. I started working on another patch (to make it
safe to run two "neon_local endpoint create" commands concurrently), and
realized that this change will make that simpler too.
Sentry caught a panic on load testing server related to metric removals:
https://neondatabase.sentry.io/issues/4142396994
Turn the `expect` into logging, but also add logging for each removal,
so we could identify in which cases we do double-remove. The
double-removal (or never adding) cause is not obvious or expected.
Original added in #3837.
the fix is rather indirect due to the accidental applying of too much
`anyhow`: if handle_pagerequests returns a `QueryError` it will now be
bubbled up as-is `QueryError`. `QueryError` allows the inner
`std::io::Error` to be inspected and thus we can filter certain error
kinds which are perfectly normal without a huge log message.
for a very long time (b2f5102) the errors were converted to `anyhow` by
mistake which made this difficult or impossible, even though from the
types it would *appear* that we propagate wrapped `std::io::Error`s and
can filter them.
Fixes#4113, most likely filters some other errors as well.
Add patch directive to Cargo.toml to use patched version of
sharded-slab:
98d16753ab
Patch changes the MAX_THREADS limit from 4096 to 32768. This is a
temporary workaround for using tracing from many threads in safekeepers
code, until async safekeepers patch is merged to the main.
Note that patch can affect other rust services, not only the safekeeper
binary.
Add HTTP endpoint to initialize safekeeper timeline from peer
safekeepers. This is useful for initializing new safekeeper to replace
failed safekeeper. Not fully "correct" in all cases, but should work in
most.
This code is not suitable for production workloads but can be tested on
staging to get started. New endpoint is separated from usual cases and
should not affect anything if no one explicitly uses a new endpoint. We
can rollback this commit in case of issues.
This reverts commit 732acc5.
Reverted PR: #3869
As noted in PR #4094, we do in fact try to insert duplicates to the
layer map, if L0->L1 compaction is interrupted. We do not have a proper
fix for that right now, and we are in a hurry to make a release to
production, so revert the changes related to this to the state that we
have in production currently. We know that we have a bug here, but
better to live with the bug that we've had in production for a long
time, than rush a fix to production without testing it in staging first.
Cc: #4094, #4088
As soon as we have received the SSLRequest packet, and have figured
out the hostname to connect to from the SNI, we can start passing
through data. We don't need to parse the StartupPacket that the client
will send next.
In order to not to create NodePorts for each compute we can setup
services that accept connections on wildcard domains and then use
information from domain name to route connection to some internal
service. There are ready solutions for HTTPS and TLS connections
but postgresql protocol uses opportunistic TLS and we haven't found
any ready solutions.
This patch introduces `pg_sni_router` which routes connections to
`aaa--bbb--123.external.domain` to `aaa.bbb.123.internal.domain`.
In the long run we can avoid console -> compute psql communications,
but now this router seems to be the easier way forward.
Refactors walsenders out of timeline.rs to makes it less convoluted into
separate WalSenders with its own lock, but otherwise having the same structure.
Tracking of in-memory remote_consistent_lsn is also moved there as it is mainly
received from pageserver.
State of walsender (feedback) is also restructured to be cleaner; now it is
either PageserverFeedback or StandbyFeedback(StandbyReply, HotStandbyFeedback),
but not both.
Introduce read timeouts to our `page_service` connections. Without read
timeouts, we essentially leak connections.
This is a port of #3995. Split the refactorings to the other PR: #4097.
Fixes#4028.
- Increase `connect_timeout` to 30s, which should be enough for
most of the cases
- If the script cannot connect to the DB (or any other
`psycopg2.OperationalError` occur) — do not fail the script, log
the error and proceed. Problems with fetching flaky tests shouldn't
block the PR
Refactoring part of #4093.
Numerious `Send + Sync` bounds were a distraction, that were not needed
at all. The proper `Bytes` usage and one `"error_message".to_string()`
are just drive-by fixes.
Not using the `PostgresBackendTCP` allows us to start setting read
timeouts (and more). `PostgresBackendTCP` is still used from proxy, so
it cannot be removed.
PR `build: run clippy for powerset of features (#4077)` brought us a
`clippy --release` pass.
It was merged after #4030, which fails under `clippy --release` with
```
error: static `TENANT_ID_EXTRACTOR` is never used
--> pageserver/src/tenant/timeline.rs:4270:16
|
4270 | pub static TENANT_ID_EXTRACTOR: once_cell::sync::Lazy<
| ^^^^^^^^^^^^^^^^^^^
|
= note: `-D dead-code` implied by `-D warnings`
error: static `TIMELINE_ID_EXTRACTOR` is never used
--> pageserver/src/tenant/timeline.rs:4276:16
|
4276 | pub static TIMELINE_ID_EXTRACTOR: once_cell::sync::Lazy<
| ^^^^^^^^^^^^^^^^^^^^^
```
A merge queue would have prevented this.
Notes:
- This still needs UI support from the Console
- I've not tuned any GUCs for PostgreSQL to make this work better
- Safekeeper has gotten a tweak in which WAL is sent and how: It now
sends zero-ed WAL data from the start of the timeline's first segment up to
the first byte of the timeline to be compatible with normal PostgreSQL
WAL streaming.
- This includes the commits of #3714
Fixes one part of https://github.com/neondatabase/neon/issues/769
Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
- Remove repeated tenant & timeline from span
- Demote logging of the path to debug level
- Log completion at info level, in the same function where we log errors
- distinguish between layer file download success & on-demand download
succeeding as a whole in the log message wording
- Assert that the span contains a tenant id and a timeline id
fixes https://github.com/neondatabase/neon/issues/3945
Before:
```
INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{tenant_id=$TENANT_ID timeline_id=$TIMELINE_ID layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: download complete: /storage/pageserver/data/tenants/$TENANT_ID/timelines/$TIMELINE_ID/000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91
INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{tenant_id=$TENANT_ID timeline_id=$TIMELINE_ID layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: Rebuilt layer map. Did 9 insertions to process a batch of 1 updates.
```
After:
```
INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: layer file download finished
INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: Rebuilt layer map. Did 9 insertions to process a batch of 1 updates.
INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: on-demand download successful
```
It fixes the miscalculation of the metric for projects that use multiple
branches for the same endpoint.
We were under billing users with such projects. So we need to
communicate the change in Release Notes.
It had a couple of inherent races:
1) Even if compute is killed before the call, some more data might still arrive
to safekeepers after commit_lsn on them is polled, advancing it. Then checkpoint
on pageserver might not include this tail, and so upload of expected LSN won't
happen until one more checkpoint.
2) commit_lsn is updated asynchronously -- compute can commit transaction before
communicating commit_lsn to even single safekeeper (sync-safekeepers can be used
to forces the advancement). This makes semantics of
wait_for_sk_commit_lsn_to_reach_remote_storage quite complicated.
Replace it with last_flush_lsn_upload which
1) Learns last flush LSN on compute;
2) Waits for it to arrive to pageserver;
3) Checkpoints it;
4) Waits for the upload.
In some tests this keeps compute alive longer than before, but this doesn't seem
to be important.
There is a chance this fixes https://github.com/neondatabase/neon/issues/3209
Follow-up fix after https://github.com/neondatabase/neon/pull/4067
```
+ crane tag neondatabase/vm-compute-node-v14:3064 latest
Error: fetching "neondatabase/vm-compute-node-v14:3064": GET https://index.docker.io/v2/neondatabase/vm-compute-node-v14/manifests/3064: MANIFEST_UNKNOWN: manifest unknown; unknown tag=3064
```
I reverted back the previous approach for promoting images
(login to one registry, save images to local fs, logout and login to
another registry, and push images from local fs). It turns out what
works for one Google project (kaniko), doesn't work for another (crane)
[sigh]
- Update kaniko to 1.9.2 (from 1.7.0), problem with reproducible build is fixed
- Login to ECR and Docker Hub at once, so we can push to several
registries, it makes job `push-docker-hub` unneeded
- `push-docker-hub` replaced with `promote-images` in `needs:` clause,
Pushing images to production ECR moved to `promote-images` job
## Describe your changes
Deploy `main` proxies to the preview environments
We don't deploy storage there yet, as it's tricky.
## Issue ticket number and link
https://github.com/neondatabase/cloud/issues/4737
And add corresponding unit test.
The fix is to use `.remove()` instead of `.get()` when processing the
arugments hash map.
The code uses emptiness of the hash map to determine whether all
arguments have been processed.
This was likely a copy-paste error.
refs https://github.com/neondatabase/neon/issues/3942
just record the time needed for waiting the lsn and then the basebackup
in a log message in millis. this is related to ongoing investigations to
cold start performance.
this could also be a a counter. it cannot be added next to smgr
histograms, because we don't want another histogram per timeline.
the aim is to allow drilling deeper into which timelines were slow, and
to understand why some need two basebackups.
For the "worst-case /storage usage panel", we need to compute
```
remote size + local-only size
```
We currently don't have a metric for local-only layers.
The number of in-flight layers in the upload queue is just that, so, let
Prometheus scrape it.
The metric is two counters (started and finished).
The delta is the amount of in-flight uploads in the queue.
The metrics are incremented in the respective `call_unfinished_metric_*`
functions.
These track ongoing operations by file_kind and op_kind.
We only need this metric for layer uploads, so, there's the new
RemoteTimelineClientMetricsCallTrackSize type that forces all call sites
to decide whether they want the size tracked or not.
If we find that other file_kinds or op_kinds are interesting (metadata
uploads, layer downloads, layer deletes) are interesting, we can just
enable them, and they'll be just another label combination within the
metrics that this PR adds.
fixes https://github.com/neondatabase/neon/issues/3922
Add a simple disarmable dropguard to log if request is cancelled before
it is completed. We currently don't have this, and it makes for
difficult to know when the request was dropped.
This patch extends the libmetrics logging setup functionality with a
`tracing` layer that increments a Prometheus counter each time we log a
log message. We have the counter per tracing event level. This allows
for monitoring WARN and ERR log volume without parsing the log. Also, it
would allow cross-checking whether logs got dropped on the way into
Loki.
It would be nicer if we could hook deeper into the tracing logging
layer, to avoid evaluating the filter twice.
But I don't know how to do it.
Do several attempts to get spec from the control-plane and retry network
errors and all reasonable HTTP response codes. Do not hang waiting for
spec without confirmation from the control-plane that compute is known
and is in the `Empty` state.
Adjust the way we track `total_startup_ms` metric, it should be
calculated since the moment we received spec, not from the moment
`compute_ctl` started. Also introduce a new `wait_for_spec_ms` metric
to track the time spent sleeping and waiting for spec to be delivered
from control-plane.
Part of neondatabase/cloud#3533
Changes the vm-informant's postgres connection string's dbname from
"neondb" (which sometimes doesn't exist) to "postgres" (which
_hopefully_ should exist more often?).
Currently there are a handful of VMs in prod that aren't working with
autoscaling because they don't have the "neondb" database.
The vm-informant doesn't require any database in particular; it's just
connecting as `cloud_admin` to be able to adjust the file cache
settings.
## Describe your changes
## Issue ticket number and link
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
See https://github.com/neondatabase/neon/pull/3991
Brings the changes back with the right way to use new `toml_edit` to
deserialize values and a unit test for this.
All non-trivial updates extracted into separate commits, also `carho hakari` data and its manifest format were updated.
3 sets of crates remain unupdated:
* `base64` — touches proxy in a lot of places and changed its api (by 0.21 version) quite strongly since our version (0.13).
* `opentelemetry` and `opentelemetry-*` crates
```
error[E0308]: mismatched types
--> libs/tracing-utils/src/http.rs:65:21
|
65 | span.set_parent(parent_ctx);
| ---------- ^^^^^^^^^^ expected struct `opentelemetry_api::context::Context`, found struct `opentelemetry::Context`
| |
| arguments to this method are incorrect
|
= note: struct `opentelemetry::Context` and struct `opentelemetry_api::context::Context` have similar names, but are actually distinct types
note: struct `opentelemetry::Context` is defined in crate `opentelemetry_api`
--> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/opentelemetry_api-0.19.0/src/context.rs:77:1
|
77 | pub struct Context {
| ^^^^^^^^^^^^^^^^^^
note: struct `opentelemetry_api::context::Context` is defined in crate `opentelemetry_api`
--> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/opentelemetry_api-0.18.0/src/context.rs:77:1
|
77 | pub struct Context {
| ^^^^^^^^^^^^^^^^^^
= note: perhaps two different versions of crate `opentelemetry_api` are being used?
note: associated function defined here
--> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-opentelemetry-0.18.0/src/span_ext.rs:43:8
|
43 | fn set_parent(&self, cx: Context);
| ^^^^^^^^^^
For more information about this error, try `rustc --explain E0308`.
error: could not compile `tracing-utils` due to previous error
warning: build failed, waiting for other jobs to finish...
error: could not compile `tracing-utils` due to previous error
```
`tracing-opentelemetry` of version `0.19` is not yet released, that is supposed to have the update we need.
* similarly, `rustls`, `tokio-rustls`, `rustls-*` and `tls-listener` crates have similar issue:
```
error[E0308]: mismatched types
--> libs/postgres_backend/tests/simple_select.rs:112:78
|
112 | let mut make_tls_connect = tokio_postgres_rustls::MakeRustlsConnect::new(client_cfg);
| --------------------------------------------- ^^^^^^^^^^ expected struct `rustls::client::client_conn::ClientConfig`, found struct `ClientConfig`
| |
| arguments to this function are incorrect
|
= note: struct `ClientConfig` and struct `rustls::client::client_conn::ClientConfig` have similar names, but are actually distinct types
note: struct `ClientConfig` is defined in crate `rustls`
--> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/rustls-0.21.0/src/client/client_conn.rs:125:1
|
125 | pub struct ClientConfig {
| ^^^^^^^^^^^^^^^^^^^^^^^
note: struct `rustls::client::client_conn::ClientConfig` is defined in crate `rustls`
--> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/rustls-0.20.8/src/client/client_conn.rs:91:1
|
91 | pub struct ClientConfig {
| ^^^^^^^^^^^^^^^^^^^^^^^
= note: perhaps two different versions of crate `rustls` are being used?
note: associated function defined here
--> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-postgres-rustls-0.9.0/src/lib.rs:23:12
|
23 | pub fn new(config: ClientConfig) -> Self {
| ^^^
For more information about this error, try `rustc --explain E0308`.
error: could not compile `postgres_backend` due to previous error
warning: build failed, waiting for other jobs to finish...
```
* aws crates: I could not make new API to work with bucket endpoint overload, and console e2e tests failed.
Other our tests passed, further investigation is worth to be done in https://github.com/neondatabase/neon/issues/4008
Initially, idea was to ensure that when we come and check data
availability, special service table already contains one row. So if we
loose it for some reason, we will error out.
Yet, to do availability check we anyway start compute first! So it
doesn't really add some value, but we affect each compute start as we
update at least one row in the database. Also this writes some WAL, so
if timeline is close to `neon.max_cluster_size` it could prevent compute
from starting up.
That said, do CREATE TABLE IF NOT EXISTS + UPSERT right in the
`/check_writability` handler.
Before this patch, if a tenant would override its eviction_policy
setting to use a lower LayerAccessThreshold::threshold than the
`evictions_low_residence_duration_metric_threshold`, the evictions done
for that tenant would count towards the
`evictions_with_low_residence_duration` metric.
That metric is used to identify pre-mature evictions, commonly triggered
by disk-usage-based eviction under disk pressure.
We don't want that to happen for the legitimate evictions of the tenant
that overrides its eviction_policy.
So, this patch
- moves the setting into TenantConf
- adds test coverage
- updates the staging & prod yamls
Forward Compatibility:
Software before this patch will ignore the new tenant conf field and use
the global one instead.
So we can roll back safely.
Backward Compatibility:
Parsing old configs with software as of this patch will fail in
`PageServerConf::parse_and_validate` with error
`unrecognized pageserver option 'evictions_low_residence_duration_metric_threshold'`
if the option is still present in the global section.
We deal with this by updating the configs in Ansible.
fixes https://github.com/neondatabase/neon/issues/3940
With this commit one can request compute reconfiguration
from the running `compute_ctl` with compute in `Running` state
by sending a new spec:
```shell
curl -d "{\"spec\": $(cat ./compute-spec-new.json)}" http://localhost:3080/configure
```
Internally, we start a separate configurator thread that is waiting on
`Condvar` for `ConfigurationPending` compute state in a loop. Then it does
reconfiguration, sets compute back to `Running` state and notifies other
waiters.
It will need some follow-ups, e.g. for retry logic for control-plane
requests, but should be useful for testing in the current state. This
shouldn't affect any existing environment, since computes are configured
in a different way there.
Resolvesneondatabase/cloud#4433
To not be taken by surprise by upstream git re-tag or by malicious activity,
let's verify the checksum for extensions we download
Also, unify the installation of `pg_graphql` and `pg_tiktoken`
with other extensions.
We use the term "endpoint" in for compute Postgres nodes in the web UI
and user-facing documentation now. Adjust the nomenclature in the code.
This changes the name of the "neon_local pg" command to "neon_local
endpoint". Also adjust names of classes, variables etc. in the python
tests accordingly.
This also changes the directory structure so that endpoints are now
stored in:
.neon/endpoints/<endpoint id>
instead of:
.neon/pgdatadirs/tenants/<tenant_id>/<endpoint (node) name>
The tenant ID is no longer part of the path. That means that you
cannot have two endpoints with the same name/ID in two different
tenants anymore. That's consistent with how we treat endpoints in the
real control plane and proxy: the endpoint ID must be globally unique.
## Describe your changes
Do not forget to process required manual stuff after release
## Issue ticket number and link
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
Reason and backtrace are added to the Broken state. Backtrace is automatically collected when tenant entered the broken state. The format for API, CLI and metrics is changed and unified to return tenant state name in camel case. Previously snake case was used for metrics and camel case was used for everything else. Now tenant state field in TenantInfo swagger spec is changed to contain state name in "slug" field and other fields (currently only reason and backtrace for Broken variant in "data" field). To allow for this breaking change state was removed from TenantInfo swagger spec because it was not used anywhere.
Please note that the tenant's broken reason is not persisted on disk so the reason is lost when pageserver is restarted.
Requires changes to grafana dashboard that monitors tenant states.
Closes#3001
---------
Co-authored-by: theirix <theirix@gmail.com>
## Describe your changes
## Issue ticket number and link
#3673
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
---------
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
When no SNI is provided use the default certificate, otherwise we can't
get to the options parameter which can be used to set endpoint name too.
That means that non-SNI flow will not work for CNAME domains in verify-full
mode.
All non-trivial updates extracted into separate commits, also `carho
hakari` data and its manifest format were updated.
3 sets of crates remain unupdated:
* `base64` — touches proxy in a lot of places and changed its api (by
0.21 version) quite strongly since our version (0.13).
* `opentelemetry` and `opentelemetry-*` crates
```
error[E0308]: mismatched types
--> libs/tracing-utils/src/http.rs:65:21
|
65 | span.set_parent(parent_ctx);
| ---------- ^^^^^^^^^^ expected struct `opentelemetry_api::context::Context`, found struct `opentelemetry::Context`
| |
| arguments to this method are incorrect
|
= note: struct `opentelemetry::Context` and struct `opentelemetry_api::context::Context` have similar names, but are actually distinct types
note: struct `opentelemetry::Context` is defined in crate `opentelemetry_api`
--> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/opentelemetry_api-0.19.0/src/context.rs:77:1
|
77 | pub struct Context {
| ^^^^^^^^^^^^^^^^^^
note: struct `opentelemetry_api::context::Context` is defined in crate `opentelemetry_api`
--> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/opentelemetry_api-0.18.0/src/context.rs:77:1
|
77 | pub struct Context {
| ^^^^^^^^^^^^^^^^^^
= note: perhaps two different versions of crate `opentelemetry_api` are being used?
note: associated function defined here
--> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-opentelemetry-0.18.0/src/span_ext.rs:43:8
|
43 | fn set_parent(&self, cx: Context);
| ^^^^^^^^^^
For more information about this error, try `rustc --explain E0308`.
error: could not compile `tracing-utils` due to previous error
warning: build failed, waiting for other jobs to finish...
error: could not compile `tracing-utils` due to previous error
```
`tracing-opentelemetry` of version `0.19` is not yet released, that is
supposed to have the update we need.
* similarly, `rustls`, `tokio-rustls`, `rustls-*` and `tls-listener`
crates have similar issue:
```
error[E0308]: mismatched types
--> libs/postgres_backend/tests/simple_select.rs:112:78
|
112 | let mut make_tls_connect = tokio_postgres_rustls::MakeRustlsConnect::new(client_cfg);
| --------------------------------------------- ^^^^^^^^^^ expected struct `rustls::client::client_conn::ClientConfig`, found struct `ClientConfig`
| |
| arguments to this function are incorrect
|
= note: struct `ClientConfig` and struct `rustls::client::client_conn::ClientConfig` have similar names, but are actually distinct types
note: struct `ClientConfig` is defined in crate `rustls`
--> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/rustls-0.21.0/src/client/client_conn.rs:125:1
|
125 | pub struct ClientConfig {
| ^^^^^^^^^^^^^^^^^^^^^^^
note: struct `rustls::client::client_conn::ClientConfig` is defined in crate `rustls`
--> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/rustls-0.20.8/src/client/client_conn.rs:91:1
|
91 | pub struct ClientConfig {
| ^^^^^^^^^^^^^^^^^^^^^^^
= note: perhaps two different versions of crate `rustls` are being used?
note: associated function defined here
--> /Users/someonetoignore/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-postgres-rustls-0.9.0/src/lib.rs:23:12
|
23 | pub fn new(config: ClientConfig) -> Self {
| ^^^
For more information about this error, try `rustc --explain E0308`.
error: could not compile `postgres_backend` due to previous error
warning: build failed, waiting for other jobs to finish...
```
* aws crates: I could not make new API to work with bucket endpoint
overload, and console e2e tests failed.
Other our tests passed, further investigation is worth to be done in
https://github.com/neondatabase/neon/issues/4008
The PR changes module function-based walreceiver interface with a
`WalReceiver` struct that exposes a few public methods, `new`, `start`
and `stop` now.
Later, the same struct is planned to be used for getting walreceiver
stats (and, maybe, other extra data) to display during missing wal
errors for https://github.com/neondatabase/neon/issues/2106
Now though, the change required extra logic changes:
* due to the `WalReceiver` struct added, it became easier to pass `ctx`
and later do a `detached_child` instead of
bfee412701/pageserver/src/tenant/timeline.rs (L1379-L1381)
* `WalReceiver::start` which is now the public API to start the
walreceiver, could return an `Err` which now may turn a tenant into
`Broken`, same as the timeline that it tries to load during startup.
* `WalReceiverConf` was added to group walreceiver parameters from
pageserver's tenant config
Sometimes, it contained real values, sometimes just defaults if the
spec was not received yet. Make the state more clear by making it an
Option instead.
One consequence is that if some of the required settings like
neon.tenant_id are missing from the spec file sent to the /configure
endpoint, it is spotted earlier and you get an immediate HTTP error
response. Not that it matters very much, but it's nicer nevertheless.
Aarch64 doesn't implement some old syscalls like open and select. Use
openat instead of open to check if seccomp is supported. Leave both
select and pselect6 in the allowlist since we don't call select syscall
directly and may hope that libc will call pselect6 on aarch64.
To check whether some syscall is supported it is possible to use
`scmp_sys_resolver` from seccopm package:
```
> apt install seccopm
> scmp_sys_resolver -a x86_64 select
23
> scmp_sys_resolver -a aarch64 select
-10101
> scmp_sys_resolver -a aarch64 pselect6
72
```
Negative value means that syscall is not supported.
Another cross-check is to look up for the actuall syscall table in
`unistd.h`. To resolve all the macroses one can use `gcc -E` as it is
done in `dump_sys_aarch64()` function in
libseccomp/src/arch-syscall-validate.
---------
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
TCP_KEEPALIVE is not enabled by default, so this prevents hanged up connections
in case of abrupt client termination. Add 'closed' flag to PostgresBackendReader
and pass it during handles join to prevent attempts to read from socket if we
errored out previously -- now with timeouts this is a common situation.
It looks like
2023-04-10T18:08:37.493448Z INFO {cid=68}:WAL
receiver{ttid=59f91ad4e821ab374f9ccdf918da3a85/16438f99d61572c72f0c7b0ed772785d}:
terminated: timed out
Presumably fixes https://github.com/neondatabase/neon/issues/3971
Aarch64 doesn't implement some old syscalls like open and select. Use
openat instead of open to check if seccomp is supported. Leave both
select and pselect6 in the allowlist since we don't call select syscall
directly and may hope that libc will call pselect6 on aarch64.
To check whether some syscall is supported it is possible to use
`scmp_sys_resolver` from seccopm package:
```
> apt install seccopm
> scmp_sys_resolver -a x86_64 select
23
> scmp_sys_resolver -a aarch64 select
-10101
> scmp_sys_resolver -a aarch64 pselect6
72
```
Negative value means that syscall is not supported.
Another cross-check is to look up for the actuall syscall table
in `unistd.h`. To resolve all the macroses one can use `gcc -E` as
it is done in `dump_sys_aarch64()` function in
libseccomp/src/arch-syscall-validate.
This is in preparation of using compute_ctl to launch postgres nodes
in the neon_local control plane. And seems like a good idea to
separate the public interfaces anyway.
One non-mechanical change here is that the 'metrics' field is moved
under the Mutex, instead of using atomics. We were not using atomics
for performance but for convenience here, and it seems more clear to
not use atomics in the model for the HTTP response type.
We have enabled prefetch by default, let's use this in Nightly
Benchmarks:
- effective_io_concurrency=100 by default (instead of 32)
- maintenance_io_concurrency=100 by default (instead of 32)
Rename `neon-captest-prefetch` to `neon-captest-new` (for pgbench with
initialisation) and `neon-captest-reuse` (for OLAP scenarios)
Replaces `Box<(dyn io::AsyncRead + Unpin + Send + Sync + 'static)>` with
`impl io::AsyncRead + Unpin + Send + Sync + 'static` usages in the
`RemoteStorage` interface, to make it closer to
[`#![feature(async_fn_in_trait)]`](https://blog.rust-lang.org/inside-rust/2022/11/17/async-fn-in-trait-nightly.html)
For `GenericRemoteStorage`, replaces `type Target = dyn RemoteStorage`
with another impl with `RemoteStorage` methods inside it.
We can reuse the trait, that would require importing the trait in every
file where it's used and makes us farther from the unstable feature.
After this PR, I've manged to create a patch with the changes:
https://github.com/neondatabase/neon/compare/kb/less-dyn-storage...kb/nightly-async-trait?expand=1
Current rust implementation does not like recursive async trait calls,
so `UnreliableWrapper` was removed: it contained a
`GenericRemoteStorage` that implemented the `RemoteStorage` trait, and
itself implemented the trait, which nightly rustc did not like and
proposed to box the future.
Similarly, `GenericRemoteStorage` cannot implement `RemoteStorage` for
nightly rustc to work, since calls various remote storages' methods from
inside.
I've compiled current `main` and the nightly branch both with `time env
RUSTC_WRAPPER="" cargo +nightly build --all --timings` command, and got
```
Finished dev [optimized + debuginfo] target(s) in 2m 04s
env RUSTC_WRAPPER="" cargo +nightly build --all --timings 1283.19s user 50.40s system 1074% cpu 2:04.15 total
for the new feature tried and
Finished dev [optimized + debuginfo] target(s) in 2m 40s
env RUSTC_WRAPPER="" cargo +nightly build --all --timings 1288.59s user 52.06s system 834% cpu 2:40.71 total
for the old async_trait approach.
```
On my machine, the `remote_storage` lib compilation takes ~10 less time
with the nightly feature (left) than the regular main (right).

Full cargo reports are available at
[timings.zip](https://github.com/neondatabase/neon/files/11179369/timings.zip)
## Describe your changes
## Issue ticket number and link
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
Old coding here ignored non-wildcard common names and passed None instead. With my recent changes
I started throwing an error in that case. Old logic doesn't seem to be a great choice, so instead
of passing None I actually set non-wildcard common names too. That way it is possible to avoid handling
cases with None in downstream code.
This commit adds an option to start compute without spec and then pass
it a valid spec via `POST /configure` API endpoint. This is a main
prerequisite for maintaining the pool of compute nodes in the
control-plane.
For example:
1. Start compute with
```shell
cargo run --bin compute_ctl -- -i no-compute \
-p http://localhost:9095 \
-D compute_pgdata \
-C "postgresql://cloud_admin@127.0.0.1:5434/postgres" \
-b ./pg_install/v15/bin/postgres
```
2. Configure it with
```shell
curl -d "{\"spec\": $(cat ./compute-spec.json)}" http://localhost:3080/configure
```
Internally, it's implemented using a `Condvar` + `Mutex`. Compute spec
is moved under Mutex, as it's now could be updated in the http handler.
Also `RwLock` was replaced with `Mutex` because the latter works well
with `Condvar`.
First part of the neondatabase/cloud#4433
This allows to skip compatibility tests based on `CHECK_ONDISK_DATA_COMPATIBILITY` environment variable. When the variable is missing (default) compatibility tests wont be run.
Otherwise it can lag a lot, preventing WAL segments cleanup. Also max
wal_backup_lsn on update, pulling it down is pointless.
Should help with https://github.com/neondatabase/neon/issues/3957, but
will not fix it completely.
## Describe your changes
In https://github.com/neondatabase/cloud/issues/4354 we are making
scheduling of projects based on available disk space and overcommit, so
we need to know disk size and just in case instance type of the
pageserver
## Issue ticket number and link
https://github.com/neondatabase/cloud/issues/4354
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] ~If it is a core feature, I have added thorough tests.~
- [ ] ~Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?~
- [ ] ~If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.~
in real env testing we noted that the disk-usage based eviction sails 1
percentage point above the configured value, which might be a source of
confusion, so it might be better to get rid of that confusion now.
confusion: "I configured 85% but pageserver sails at 86%".
Co-authored-by: Christian Schwarz <christian@neon.tech>
Make it possible to specify directory where proxy will look up for
extra certificates. Proxy will iterate through subdirs of that directory
and load `key.pem` and `cert.pem` files from each subdir. Certs directory
structure may look like that:
certs
|--example.com
| |--key.pem
| |--cert.pem
|--foo.bar
|--key.pem
|--cert.pem
Actual domain names are taken from certs and key, subdir names are
ignored.
This PR adds posting a comment with test results. Each workflow run
updates the comment with new results.
The layout and the information that we post can be changed to our needs,
right now, it contains failed tests and test which changes status after
rerun (i.e. flaky tests)
This PR adds a plugin that automatically reruns (up to 3 times) flaky
tests. Internally, it uses data from `TEST_RESULT_CONNSTR` database and
`pytest-rerunfailures` plugin.
As the first approximation we consider the test flaky if it has failed on
the main branch in the last 10 days.
Flaky tests are fetched by `scripts/flaky_tests.py` script (it's
possible to use it in a standalone mode to learn which tests are flaky),
stored to a JSON file, and then the file is passed to the pytest plugin.
In S3, pageserver only lists tenants (prefixes) on S3, no other keys.
Remove the list operation from the API, since S3 impl does not seem to
work normally and not used anyway,
For some reason, `tracing::instrument` proc_macro doesn't always print
elements specified via `fields()` or even show that it's impossible
(e.g. there's no Display impl).
Work around this using the `?foo` notation.
Before:
2023-04-03T14:48:06.017504Z INFO handle_client🤝 received SslRequest
After:
2023-04-03T14:51:24.424176Z INFO handle_client{session_id=7bd07be8-3462-404e-8ccc-0a5332bf3ace}🤝 received SslRequest
Leave disk_usage_based_eviction above the current max usage in prod
(82%ish), so that deploying this commit won't trigger
disk_usage_based_eviction.
As indicated in the TODO, we'll decrease the value to 80% later.
Also update the staging YAMLs to use the anchor syntax for
`evictions_low_residence_duration_metric_threshold` like we do in the
prod YAMLs as of this patch.
this will help log analysis with the counterpart of already logging all
remote download needs and downloads. ended up with a easily regexable
output in the final round.
This is the the feedback originating from pageserver, so change previous
confusing names to
s/ReplicationFeedback/PageserverFeedback
s/ps_writelsn/last_receive_lsn
s/ps_flushlsn/disk_consistent_lsn
s/ps_apply_lsn/remote_consistent_lsn
I haven't changed on the wire format to keep compatibility. However,
understanding of new field names is added to compute, so once all computes
receive this patch we can change the wire names as well. Safekeepers/pageservers
are deployed roughly at the same time and it is ok to live without feedbacks
during the short period, so this is not a problem there.
Add a condition to switch walreceiver connection to safekeeper that is
located in the same availability zone. Switch happens when commit_lsn of
a candidate is not less than commit_lsn from the active connection. This
condition is expected not to trigger instantly, because commit_lsn of a
current connection is usually greater than commit_lsn of updates from
the broker. That means that if WAL is written continuously, switch can
take a lot of time, but it should happen eventually.
Now protoc 3.15+ is required for building neon.
Fixes https://github.com/neondatabase/neon/issues/3200
Instead of checking neon.safekeepers GUC value in existing pg node data dir,
just always run sync-safekeepers when safekeepers are configured. Without this
change, creation of new compute didn't run it. That's ok for new
timeline/branch (it doesn't return anything useful anyway, and LSN is known by
pageserver), but restart of compute for existing timeline bore the risk of
getting basebackup not on the latest LSN, i.e. basically broken -- it might not
have prev_lsn, and even if it had, walproposer would complain anyway.
fixes https://github.com/neondatabase/neon/issues/2963
This patch adds a pageserver-global background loop that evicts layers
in response to a shortage of available bytes in the $repo/tenants
directory's filesystem.
The loop runs periodically at a configurable `period`.
Each loop iteration uses `statvfs` to determine filesystem-level space
usage. It compares the returned usage data against two different types
of thresholds. The iteration tries to evict layers until app-internal
accounting says we should be below the thresholds. We cross-check this
internal accounting with the real world by making another `statvfs` at
the end of the iteration. We're good if that second statvfs shows that
we're _actually_ below the configured thresholds. If we're still above
one or more thresholds, we emit a warning log message, leaving it to the
operator to investigate further.
There are two thresholds:
- `max_usage_pct` is the relative available space, expressed in percent
of the total filesystem space. If the actual usage is higher, the
threshold is exceeded.
- `min_avail_bytes` is the absolute available space in bytes. If the
actual usage is lower, the threshold is exceeded.
The iteration evicts layers in LRU fashion with a reservation of up to
`tenant_min_resident_size` bytes of the most recent layers per tenant.
The layers not part of the per-tenant reservation are evicted
least-recently-used first until we're below all thresholds. The
`tenant_min_resident_size` can be overridden per tenant as
`min_resident_size_override` (bytes).
In addition to the loop, there is also an HTTP endpoint to perform one
loop iteration synchronous to the request. The endpoint takes an
absolute number of bytes that the iteration needs to evict before
pressure is relieved. The tests use this endpoint, which is a great
simplification over setting up loopback-mounts in the tests, which would
be required to test the statvfs part of the implementation. We will rely
on manual testing in staging to test the statvfs parts.
The HTTP endpoint is also handy in emergencies where an operator wants
the pageserver to evict a given amount of space _now. Hence, it's
arguments documented in openapi_spec.yml. The response type isn't
documented though because we don't consider it stable. The endpoint
should _not_ be used by Console but it could be used by on-call.
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
## Describe your changes
Fix prefetch for parallel bitmap scan
## Issue ticket number and link
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
Currently we don't see from the logs, if shutting down tracing takes
long time or not. We do see that shutting down computes gets delayed for
some reason and hits thhe grace period limit. Moving the shutdown
message to slightly later, when we don't have anything else than just
exit left.
## Issue ticket number and link
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
Before this patch, the range from which the random delay is picked is at
minimum 10 seconds.
With this patch, they delay is bounded to whatever the given `period`
is, and zero, if period id Duration::ZERO.
Motivation for this: the disk usage eviction tests that we'll add in
https://github.com/neondatabase/neon/pull/3905 need to wait for the disk
usage eviction background loop to do its job.
They set a period of 1s.
It seems wasteful to wait 10 seconds in the tests.
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
## Describe your changes
Get rid of the legacy labeling. Aslo `neon_region_slug` with the same
value as `neon_region` doesn't make much sense, so just drop it. This
allows us to drop the relabeling from zenith to neon in the log
collector.
## Describe your changes
In [this linter
run](https://github.com/neondatabase/cloud/actions/runs/4553032319/jobs/8029101300?pr=4391)
accidentally found out that spec is invalid. Reference other schemas in
properties should be done the way I changed.
Could not find documentation specifically for schemas embedding in
`components.schemas`, but it seems like the approach is inherited from
json schema:
https://json-schema.org/understanding-json-schema/structuring.html#ref
## Issue ticket number and link
-
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] ~If it is a core feature, I have added thorough tests.~
- [ ] ~Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?~
- [ ] ~If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.~
With fresh rustc brought by
https://github.com/neondatabase/neon/pull/3902, we can use
`std::pin::pin!` macro instead of the tokio one.
One place did not need the macro at all, other places were adjusted.
neon_local sends SIGQUIT, which otherwise dumps core by default. Also, remove
obsolete install_shutdown_handlers; in all binaries it was overridden by
ShutdownSignals::handle later.
ref https://github.com/neondatabase/neon/issues/3847
## Describe your changes
Monitor free spae in local file system and shrink local file cache size
if it is under watermark.
Neon is using local storage for temp files (temp table + intermediate
results), unlogged relations
and local file cache.
Ideally all space not used for temporary files should be used for local
file cache.
Temporary files and even unlogged relation are intended to have small
life time (because
them can be lost at any moment in case of compute restart).
So the policy is to overcommit local cache size and shrink it if there
is not enough free space.
As far as temporary files are expected to be needed for a short time,
there i no need
to permanently shrink local file cache size. Instead of it, we just
throw away least recently accessed elements
from local file cache, releasing some space on the local disk.
## Issue ticket number and link
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
---------
Co-authored-by: sharnoff <sharnoff@neon.tech>
As of #3867, we do artificial layer accesses to layers that will be
needed after the next restart, but not until then because of caches.
With this patch, we also do that for the accesses that the synthetic
size calculation worker does if consumption metrics are enabled.
The actual size calculation is not of importance, but we need to
calculate all of the sizes, so we only call tenant::size::gather_inputs.
Co-authored-by: Christian Schwarz <christian@neon.tech>
Without this, we run it every p.period, which can be quite low. For
example, the running experiment with 3000 tenants in prod uses a period
of 1 minute.
Doing it once per p.threshold is enough to prevent eviction.
fix synthetic size for (last_record_lsn - gc_horizon) < initdb_lsn
Assume a single-timeline project.
If the gc_horizon covers all WAL (last_record_lsn < gc_horizon)
but we have written more data than just initdb, the synthetic
size calculation worker needs to calculate the logical size
at LSN initdb_lsn (Segment BranchStart).
Before this patch, that calculation would incorrectly return
the initial logical size calculation result that we cache in
the Timeline::initial_logical_size. Presumably, because there
was confusion around initdb_lsn vs. initial size calculation.
The fix is to only hand out the initialized_size() only if
the LSN matches.
The distinction in the metrics between "init logical size" and "logical
size" was also incorrect because of the above. So, remove it.
There was a special case for `size != 0`. This was to cover the case of
LogicalSize::empty_initial(), but `initial_part_end` is `None` in that
case, so the new `LogicalSize::initialized_size()` will return None
in that case as well.
Lastly, to prevent confusion like this in the future, rename all
occurrences of `init_lsn` to either just `lsn` or a more specific name.
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
whoami() was never called, 'is_test' was never set.
'restart()' might be useful, but it wasn't hooked up the CLI so it was
dead code. It's not clear what kind of a restart it should perform,
anyway: just restart Postgres, or re-initialize the data directory
from a fresh basebackup like "stop"+"start" does.
To avoid re-downloading evicted files on restart, re-compute logical
size and partitioning before each threshold based eviction run.
Cc: #3802
Co-authored-by: Christian Schwarz <christian@neon.tech>
* Add the REPOSITORY env to build args to avoid the following error when
executing without the credentials for the repository.
```
ERROR: Service 'compute' failed to build: Head
"https://369495373322.dkr.ecr.eu-central-1.amazonaws.com/v2/compute-node-v15/manifests/2221":
no basic auth credentials
```
* update the tag version in the documentation to support storage broker
## Describe your changes
Added a query param to detach API
Allow to remove local state of a tenant even if its not in the memory
(following ignore API)
## Issue ticket number and link
#3828
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
---------
Co-authored-by: Kirill Bulatov <kirill@neon.tech>
The PR enforces current newest `index_part.json` format in the type
system (version `1`), not allowing any previous forms of it, that were
used in the past.
Similarly, the code to mitigate the
https://github.com/neondatabase/neon/issues/3024 issue is now also
removed.
Current code does not produce old formats and extra files in the
index_part.json, in the future we will be able to use
https://github.com/neondatabase/aversion or other approach to make
version transitions more explicit.
See https://neondb.slack.com/archives/C033RQ5SPDH/p1679134185248119 for
the justification on the breaking changes.
We read the pageserver connection string from the spec file, so let's
read the auth token from the same place.
We've been talking about pre-launching compute nodes that are not
associated with any particular tenant at startup, so that the spec
file is delivered to the compute node later. We cannot change the env
variables after the process has been launched.
We still pass the token to 'postgres' binary in the NEON_AUTH_TOKEN
env variable, but compute_ctl is now responsible for setting it.
## Describe your changes
https://neondb.slack.com/archives/C039YKBRZB4/p1679413279637059
## Issue ticket number and link
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
- Remove the neon.safekeeper_token_env GUC. It was used to set the
name of an environment variable, which was then used in pageserver
and safekeeper connection strings to in place of the
password. Instead, always look up the environment variable called
NEON_AUTH_TOKEN. That's what neon.safekeeper_token_env was always
set to in practice, and I don't see the need for the extra level of
indirection or configurability.
- Instead of substituting $NEON_AUTH_TOKEN in the connection strings,
pass $NEON_AUTH_TOKEN "out-of-band" as the password, when we connect
to the pageserver or safekeepers. That's simpler.
- Also use the password from $NEON_AUTH_TOKEN in compute_ctl, when it
connects to the pageserver to get the "base backup".
Walproposer used to live in the backend, while pagestore_smgr was an
extension. But now that both are part of the neon extension,
walproposer can access the same 'neon_tenant' and 'neon_timeline'
variables as the pageserver_smgr code.
This patch adds two metrics that will enable us to detect *thrashing* of
layers, i.e., repetitions of `eviction, on-demand-download, eviction,
... ` for a given layer.
The first metric counts all layer evictions per timeline. It requires no
further explanation. The second metric counts the layer evictions where
the layer was resident for less than a given threshold.
We can alert on increments to the second metric. The first metric will
serve as a baseline, and further, it's generally interesting, outside of
thrashing.
The second metric's threshold is configurable in PageServerConf and
defaults to 24h. The threshold value is reproduced as a label in the
metric because the counter's value is semantically tied to that
threshold. Since changes to the config and hence the label value are
infrequent, this will have low storage overhead in the metrics storage.
The data source to determine the time that the layer was resident is the
file's `mtime`. Using `mtime` is more of a crutch. It would be better if
Pageserver did its own persistent bookkeeping of residence change events
instead of relying on the filesystem. We had some discussion about this:
https://github.com/neondatabase/neon/pull/3809#issuecomment-1470448900
My position is that `mtime` is good enough for now. It can theoretically
jump forward if someone copies files without resetting `mtime`. But that
shouldn't happen in practice. Note that moving files back and forth
doesn't change `mtime`, nor does `chown` or `chmod`. Lastly, `rsync -a`,
which is typically used for filesystem-level backup / restore, correctly
syncs `mtime`.
I've added a label that identifies the data source to keep options open
for a future, better data source than `mtime`. Since this value will
stay the same for the time being, it's not a problem for metrics
storage.
refs https://github.com/neondatabase/neon/issues/3728
The control plane currently only supports EdDSA. We need to either teach
the storage to use EdDSA, or the control plane to use RSA. EdDSA is more
modern, so let's use that.
We could support both, but it would require a little more code and tests,
and we don't really need the flexibility since we control both sides.
This allows you to run without the 'openssl' binary as long as you
don't enable authentication. This becomes more important with the next
commit, which switches the JWT algorithm to EdDSA. LibreSSL does not
support EdDSA, and LibreSSL comes with macOS, so the next commit makes
it much more likely for the key generation to fail for macOS users.
To allow running without a keypair, don't generate the authentication
token in the 'neon_local init' step. Instead, generate a new token on
every request that needs one, using the private key.
Shrinks the total number of metrics collected for each timeline by
about 50%.
See https://github.com/neondatabase/neon/issues/2848. This doesn't fully
solve the problem, we still collect a lot of metrics even with this, but
this gives us a lot of headroom.
## Describe your changes
Add Error enum for tenant state response to allow better error handling
in mgmt api
## Issue ticket number and link
#2238
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
Re-enable cgroup shenanigans in VMs, with some special care taken to
make sure that our version of cgroup-tools supports cgroup v2 (debian
bullseye does not, and probably won't because it requires a breaking
change in libcgroup).
This involves manually building libcgroup / cgroup-tools from source,
then copying the output into the final build stage.
We originally considered pulling the package from debian's testing repo
(which is up-to-date), but decided against it. Refer to the PR for more
details.
Prior work, for reference:
* 2153d2e0 - Run compute_ctl in a cgroup in VMs
* 1360361f - Fix missing VM cgconfig.conf
* 8dae8799 - Disable VM cgroup shenanigans
Create `safekeeper_pg_io_bytes_total` metric to track total amount of
bytes written/read in a postgres connections to safekeepers. This metric
has the following labels:
- `client_az` – availability zone of the connection initiator, or
`"unknown"`
- `sk_az` – availability zone of the safekeeper, or `"unknown"`
- `app_name` – `application_name` of the postgres client
- `dir` – data direction, either `"read"` or `"write"`
- `same_az` – `"true"`, `"false"` or `"unknown"`. Can be derived from
`client_az` and `sk_az`, exists purely for convenience.
This is implemented by passing availability zone in the connection
string, like this: `-c tenant_id=AAA timeline_id=BBB
availability-zone=AZ-1`.
Update ansible deployment scripts to add availability_zone argument
to safekeeper and pageserver in systemd service files.
this failure case was probably introduced by b220ba6, because earlier
the gc would always have run fast enough for restart every 1s. however,
test got added later, so we have just been lucky.
fixes#3824 by allowing this error to happen.
## Describe your changes
## Issue ticket number and link
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
This makes it possible to enable authentication only for the mgmt HTTP
API or the compute API. The HTTP API doesn't need to be directly
accessible from compute nodes, and it can be secured through network
policies. This also allows rolling out authentication in a piecemeal
fashion.
Otherwise they get lost. Normally buffer is empty before proxy pass, but this is
not the case with pipeline mode of out npm driver; fixes connection hangup
introduced by b80fe41af3 for it.
fixes https://github.com/neondatabase/neon/issues/3822
There was a warning for trailing garbage after end-of-tar archive, but
it didn't always work. The reason is that we created a StreamReader
over the original copyin-stream, but performed the check for garbage
on the copyin-stream. There could be some garbage bytes buffered in
the StreamReader, which were not caught by the warning.
I considered turning the the warning into a fatal error, aborting the
import, but I wasn't sure if we handle aborting the import properly.
Do we clean up the timeline directory on error? If we don't, we should
make that more robust, but that's a different story.
Also, normally a valid tar archive ends with two 512-byte blocks of zeros.
The tokio_tar crate stops at the first all-zeros block. Read and check
the second all-zeros block, and error out if it's not there, or contains
something unexpected.
## Describe your changes
When we perform partitioning of the whole key space, we take in account
actual ranges of relation present in the database. So if we have
relation with relid=1 and size 100 and relation with relid=2 with size
200 then result of KeySpace::partition may contain partitions
<100000000..100000099> and <200000000..200000199>. Generated image
layers will contain the same boundaries.
But when GC is checking image coverage to find out of old layer is fully
covered by newer image layer and so can be deleted, it takes in account
only full key range. I.e. if there is delta layer <100000000..300000000>
then it never be garbage collected because image layers
<100000000..100000099> and <200000000..200000199> are not completely
covering it.This is how it looks in practice:
000000067F000032AC00000A300000000000-000000067F000032AC00000A330000000000__000000000F761828
000000067F000032AC00000A31000000001F-000000067F000032AC00000A620000000005__0000000001696070-000000000442A551
000000067F000032AC00000A3300FFFFFFFF-000000067F000032AC00000A650100000000__000000000F761828
So there are two image layers covering delta layer but ... there is a
hole: A330000000000...A3300FFFFFFFF and as a result delta layer is not
collected.
## Issue ticket number and link
This PR is deeply related with #3673 because it is addressing the same
problem: old layers are not utilized by GC.
The test test_gc_old_layers.py in #3673 can be used to see effect of
this patch.
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
---------
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
- handle automatically fixable future clippies
- tune run-clippy.sh to remove macos specifics which we no longer have
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
## Describe your changes
We have previously changed the neon-proxy to use RollingUpdate. This
should be enabled in legacy proxy too in order to avoid breaking
connections for the clients and allow for example backups to run even
during deployment. (https://github.com/neondatabase/neon/pull/3683)
## Issue ticket number and link
https://github.com/neondatabase/neon/issues/3333
in prev implementation, an `ok_or_else(...)?` is used to cause a
"precondition error" on LayerMap::replace, however we only see this
particular error if an L0 for which replace fails is not in the layermap
because it is not in `l0_delta_layers`. changes or fixes this to be
Replacement::NotFound instead, making it more clear that an error would
only be raised for actual preconditions, like trying to replace layer
with completly unrelated layer.
## Describe your changes
Do not pin current block in BlockCursor
## Issue ticket number and link
See #3712
There are places (see get_reconstruct_data) in our code when thread is
holding read layers lock and then try to read file and so lock page
cache slot. So we have edge in dependency graph layers->page cache slot.
At the same time (as Christian noticed) we can lock page cache slot in
BlockCursor and then try obtain shard lock on layers.
So there is backward edge in dependency graph page cache slot>layers
which forms loop and may cause deadlock.
There are three possible fixes of the problem:
1. Perform compaction under `layers` shared lock. See PR #3732. It fixes
the problem but make it not possible to append any data to pageserver
until compaction is completed.
2. Do not hold `layers` lock while accessing layers (not sure if it is
possible to do because it definitely introduce some new race
conditions).
3. Do not pin current pages in BockCursor (this PR).
My experiments shows that this cache in BlockCursor is not so useful:
the number of hits/misses for cursor cache on pgbench workload (-i -s
10/-c 10 -T 100/-c 10 -S -T 100):
```
hits: 163011
misses: 1023602
```
So number of cache misses is 10x times larger.
And results for read-only pgbench are mostly the same:
```
with cache: 14581
w/out cache: 14429
```
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
I moved management API v2 to ogen and the generated code seems to be
more strict about content type. Let's set it properly as it is json
after all
## Describe your changes
## Issue ticket number and link
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
Adds two new *global* metrics:
- pageserver_remote_ondemand_downloaded_layers_total
- pageserver_remote_ondemand_downloaded_bytes_total
An existing test is repurposed once more to check that we do get some
reasonable counts. These are to replace guessing from the nic RX bytes
metric how much was on-demand downloaded.
First part of #3745: This does not add the "(un)?avoidable" metric,
which I plan to add as a new metric, which will be a subset of the
counts of the metrics added here.
After enabling autoscaling, we faced the issue that customers are not
able to get the number of CPUs they use at this moment. Therefore I've
added these two options:
1. Postgresql function to allow customers to call it whenever they want
2. `compute_ctl` endpoint to show these number in console
Recently added `unsafe-postgres` feature allows to build pgx extensions
against postgres forks that decided to change their ABI name (like us).
With that we can build extensions without forking them and using stock
pgx. As this feature is new few manual version bumps were required.
Previously, we only accepted RS256. Seems like a pointless limitation,
and when I was testing it with RS512 tokens, it took me a while to
understand why it wasn't working.
Since we allow switching endpoints between different branches, it is important to use composite key.
Otherwise, we may try to calculate delta between metric values for two different branches.
…start transparent for clients
## Describe your changes
Try to reestablish connection with pageserver if send is failed to be
able to make pageserver restart transparent for client
## Issue ticket number and link
https://github.com/neondatabase/neon/issues/1138
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
---------
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
1) Remove allocation and data copy during each message read. Instead, parsing
functions now accept BytesMut from which data they form messages, with
pointers (e.g. in CopyData) pointing directly into BytesMut buffer. Accordingly,
move ConnectionError containing IO error subtype into framed.rs providing this
and leave in pq_proto only ProtocolError.
2) Remove anyhow from pq_proto.
3) Move FeStartupPacket out of FeMessage. Now FeStartupPacket::parse returns it
directly, eliminating dead code where user wants startup packet but has to match
for others.
proxy stream.rs is adapted to framed.rs with minimal changes. It also benefits
from framed.rs improvements described above.
- Add support for splitting async postgres_backend into read and write halfes.
Safekeeper needs this for bidirectional streams. To this end, encapsulate
reading-writing postgres messages to framed.rs with split support without any
additional changes (relying on BufRead for reading and BytesMut out buffer for
writing).
- Use async postgres_backend throughout safekeeper (and in proxy auth link
part).
- In both safekeeper COPY streams, do read-write from the same thread/task with
select! for easier error handling.
- Tidy up finishing CopyBoth streams in safekeeper sending and receiving WAL
-- join split parts back catching errors from them before returning.
Initially I hoped to do that read-write without split at all, through polling
IO:
https://github.com/neondatabase/neon/pull/3522
However that turned out to be more complicated than I initially expected
due to 1) borrow checking and 2) anon Future types. 1) required Rc<Refcell<...>>
which is Send construct just to satisfy the checker; 2) can be workaround with
transmute. But this is so messy that I decided to leave split.
To untie cyclic dependency between sync and async versions of postgres_backend,
copy QueryError and some logging/error routines to postgres_backend.rs. This is
temporal glue to make commits smaller, sync version will be dropped by the
upcoming commit completely.
This commit adds a basic HTTP API endpoint that allows scraping the
`pg_stat_statements` data and getting a list of slow queries. New
insights like cache hit rate and so on could be added later.
Extension `pg_stat_statements` is checked / created only if compute
tries to load the corresponding shared library. The latter is configured
by control-plane and currently covered with feature flag.
Co-authored by Eduard Dyckman (bird.duskpoet@gmail.com)
Adds a newtype that creates a span with request_id from
https://github.com/neondatabase/neon/pull/3708 for every HTTP request
served.
Moves request logging and error handlers under the new wrapper, so every request-related event now is logged under the request span.
For compatibility reasons, error handler is left on the general router, since not every service uses the new handler wrappers yet.
We have a few tests that try to set image_creation_threshold, but it
didn't actually have any effect because we were missing some critical
code to load the setting from config file into memory.
The two modified tests in `test_remote_storage.py perform
compaction and GC, and assert that GC removes some layers. That
only happens if new image layers are created by the
compaction. The tests explicitly disabled image layer creation by
setting image_creation_threshold to a high value, but it didn't
take effect because reading image_creation_threshold from config
file was broken, which is why the test worked. Fix the test to
set image_creation_threshold low, instead, so that GC has work to
do.
Change 'test_tenant_conf.py' so that it exercises the added code.
This might explain why we're apparently missing test coverage for GC
(issue #3415), although I didn't try to address that here, nor did I
check if this improves the it.
## Describe your changes
Adds a field to the OpenAPI spec for the page server which describes the
`do_gc` command.
## Issue ticket number and link
#3669
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
* Stop allocating and maintaining 128MB hash table for last written
LSN cache as it is not needed in wal-redo.
* Do not require access to the initialized data directory. That
saves few dozens megabytes of empty but initialized data directory.
Currently such directories do occupy about 10% of the disk space
on the pageservers as most of tenants are empty.
* Move shmem-initialization code to the extension instead of postgres
The test already allows the background thread trying to checkpoint to
fail, however the resulting log message is currently not allowed thus
causing flakyness.
Before this patch, persistent peer_horizon_lsn was always sent to walproposer,
making it initially calculate it equal to max of persistent values and in turn
pulling back the in memory value. Send instead in memory value and take max when
safekeeper sets it.
closes https://github.com/neondatabase/neon/issues/3752
## Describe your changes
Adding a timeout handling for the remote download of layers of 120
seconds for each operation
Note that these downloads are being retried for N times
## Issue ticket number and link
Fixes: #3672
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
---------
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
## Describe your changes
Restart walredo process an d retry applying walredo records i case of
abnormal walredo process termination
## Issue ticket number and link
See #1700
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
Commit
0cf7fd0fb8
Compaction with on-demand download (#3598)
introduced a subtle bug: if we don't have to do on-demand downloads,
we only take one ROUND in fn compact() and exit early.
Thereby, we miss scheduling the index part upload for any layers
created by fn compact_inner().
Before that commit, we didn't have this problem.
So, this patch fixes it.
Since no regression test caught this, I went ahead and extended the
timeline size tests to assert that, if remote storage is configured,
1. pageserver_remote_physical_size matches the other physical sizes
2. file sizes reported by the layer map info endpoint match the other
physical size metrics
Without the pageserver code fix, the regression test would
fail at the physical size assertion, complaining that
any of the resident physical size != remote physical size metric
50790400.0 != 18399232.0
I figured out what the problem is by comparing the remote storage
and local directories like so, and noticed that the image layer
in the local directory wasn't present on the remote side.
It's size was exactly the difference
50790400.0 - 18399232.0 =32391168.0
fixes https://github.com/neondatabase/neon/issues/3738
- use parse_metrics() in all places where we parse Prometheus metrics
- query_all: make `filter` argument optional
- encourage using properly parsed, typed metrics by changing get_metrics()
to return already-parsed metrics. The new get_metric_str() method,
like in the Safekeeper type, returns the raw text response.
Before this patch, GC would call PersistentLayer::delete()
on every GC'ed layer.
RemoteLayer::delete() returned Ok(()) unconditionally.
GC would then proceed by decrementing the resident size metric,
even though the layer is a RemoteLayer.
This patch makes the following changes:
- Rename PersistentLayer::delete() to delete_resident_layer_file().
That name is unambiguous.
- Make RemoteLayer::delete_resident_layer_file return an Err().
We would have uncovered this bug if we had done that from the start.
- Change GC / Timeline::delete_historic_layer check whether
the layer is remote or not, and only call delete_resident_layer_file()
if it's not remote. This brings us in line with how eviction does it.
- Add a regression test.
fixes https://github.com/neondatabase/neon/issues/3722
Without this change, when actually setting this conf opt, the tenant
would become Broken next time we load it.
Why?
The serde_toml representation that persist_tenant_conf would write out
would be a TOML inline table of `secs` and `nsecs`.
But our hand-rolled TenantConf parser expects a TOML string.
I checked that all other `Duration` values in TenantConfOpt
use the humantime serialization.
Issues like this would likely be systematically prevent by
https://github.com/neondatabase/neon/issues/3682
## Describe your changes
## Issue ticket number and link
#3479
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Describe your changes
Rebase vendored PostgreSQL onto 14.7 and 15.2
## Issue ticket number and link
#3579
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [x] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [x] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
```
The version of PostgreSQL that we use is updated to 14.7 for PostgreSQL
14 and 15.2 for PostgreSQL 15.
```
We started rather frequently scrap some apis for metadata. This includes
layer eviction tester, I believe console does that too.
It should eliminate these logs:
https://neonprod.grafana.net/goto/rr_ace1Vz?orgId=1 (Note the rate
around 2k messages per minute)
## Describe your changes
This is yet another attempt to address problem with storage size
ballooning #2948
Previous PR #3348 tries to address this problem by maintaining list of
holes for each layer.
The problem with this approach is that we have to load all layer on
pageserver start.
Lazy loading of layers is not possible any more.
This PR tries to collect information of N largest holes on compaction
time and exclude this holes from produced layers.
It can cause generation of larger number of layers (up to 2 times) and
producing small layers.
But it requires minimal changes in code and doesn't affect storage
format.
For graphical explanation please see thread:
https://github.com/neondatabase/neon/pull/3597#discussion_r1112704451
## Issue ticket number and link
#2948#3348
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Describe your changes
When we deploy the proxy with the default Recreate strategy, there's
always some downtime and existing connections will be shut down. Change
the strategy to RollingUpdate and delay the kill signal by one week. AWS
Network Loadbalancer keeps the existing connections alive for as long as
the pods are alive, but will direct new connections to new pods.
## Issue ticket number and link
https://github.com/neondatabase/neon/issues/3333
previously we applied the ratelimiting only up to receiving the headers
from s3, or somewhere near it. the commit adds an adapter which carries
the permit until the AsyncRead has been disposed.
fixes#3662.
Calculation of logical size is now async because of layer downloads, so
we shouldn't use spawn_blocking for it. Use of `spawn_blocking`
exhausted resources which are needed by `tokio::io::copy` when copying
from a stream to a file which lead to deadlock.
Fixes: #3657
these are happening in tests because of #3655 but they sure took some
time to appear.
makes the `Compaction failed, retrying in 2s: Cannot run compaction
iteration on inactive tenant` into a globally allowed error, because it
has been seen failing on different test cases.
## Describe your changes
Layer for building pg extensions written on Rust
It required forking:
* `cargo-pgx` (in order not to catch an ABI mismatch error (`cargo-pgx`
hardcoded ABI tcdi/pgx#1032)
* `pg_jsonschema` (to use forked `cargo-pgx` version)
* `pgx-contrib-spiext` (to use forked `cargo-pgx`)
* `pg_graphql` (to use forked `cargo-pgx` and `pgx-contrib-spiext`
version)
Before the patch:
```
postgres=# create extension pg_jsonschema;
2023-02-02 17:45:23.120 UTC [35] ERROR: incompatible library "/usr/local/lib/pg_jsonschema.so": ABI mismatch
2023-02-02 17:45:23.120 UTC [35] DETAIL: Server has ABI "Neon Postgres", library has "PostgreSQL".
2023-02-02 17:45:23.120 UTC [35] STATEMENT: create extension pg_jsonschema;
ERROR: incompatible library "/usr/local/lib/pg_jsonschema.so": ABI mismatch
DETAIL: Server has ABI "Neon Postgres", library has "PostgreSQL".
```
After
```
postgres=# create extension pg_jsonschema;
CREATE EXTENSION
postgres=# select json_matches_schema('{"type": "object"}', '{}');
json_matches_schema
---------------------
t
postgres=# create extension pg_graphql;
CREATE EXTENSION
postgres=# create table book(id int primary key, title text);
CREATE TABLE
postgres=# insert into book(id, title) values (1, 'book 1');
INSERT 0 1
postgres=# select graphql.resolve($$
query {
bookCollection {
edges {
node {
id
}
}
}
}
$$);
resolve
----------------------------------------------------------------
{"data": {"bookCollection": {"edges": [{"node": {"id": 1}}]}}}
(1 row)
```
## Issue ticket number and link
Closes#3429, #3096
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [x] If it is a core feature, I have added thorough tests.
- [x] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [x] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
`pg_jsonschema` extension will be available for our customers
Enables tracing panic hook in addition to pageserver introduced in
#3475:
- proxy
- safekeeper
- storage_broker
For proxy, a drop guard which resets the original std panic hook was
added on the first commit. Other binaries don't need it so they never
reset anything by `disarm`ing the drop guard.
The aim of the change is to make sure all panics a) have span
information b) are logged similar to other messages, not interleaved
with other messages as happens right now. Interleaving happens right now
because std prints panics to stderr, and other logging happens in
stdout. If this was handled gracefully by some utility, the log message
splitter would treat panics as belonging to the previous message because
it expects a message to start with a timestamp.
Cc: #3468
console-release.local is legacy manual CNAME to
neon-internal-api.aws.neon.tech in r53
We could use neon-internal-api.aws.neon.tech name directly
This already was deployed to staging in
https://github.com/neondatabase/neon/pull/3642
`console-staging.local` is legacy manual CNAME to
`neon-internal-api.aws.neon.build` in r53
We could use `neon-internal-api.aws.neon.build` name directly
## Describe your changes
Since the current default gc period is set to 1 hour, whenever there is
an immediate need to reduce PITR and run gc, the user has to wait 1 hour
for PITR change to take effect
By enabling this API the user can configure PITR and immediately call
the do_gc API to trigger gc
## Issue ticket number and link
#3590
## Checklist before requesting a review
- [X] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Describe your changes
test_on_demand_download is flaky because not waiting until created image
layer is transferred to S3.
test_tenants_with_remote_storage just leaves garbage at the end of
overwritten file.
Right solution for test_on_demand_download is to add some API call to
wait completion of synchronization with S3 (not just based on last
record LSN). But right now it is solved using sleep.
## Issue ticket number and link
#3209
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
On the surface, this doesn't add much, but there are some benefits:
* We can do graceful shutdowns and thus record more code coverage data.
* We now have a foundation for the more interesting behaviors, e.g. "stop
accepting new connections after SIGTERM but keep serving the existing ones".
* We give the otel machinery a chance to flush trace events before
finally shutting down.
This commit sets up OpenTelemetry tracing and exporter, so that they
can be exported as OpenTelemetry traces as well.
All outgoing HTTP requests will be traced. A separate (child)
span is created for each outgoing HTTP request, and the tracing
context is also propagated to the server in the HTTP headers.
If tracing is enabled in the control plane and compute node too, you
can now get an end-to-end distributed trace of what happens when a new
connection is established, starting from the handshake with the
client, creating the 'start_compute' operation in the control plane,
starting the compute node, all the way to down to fetching the base
backup and the availability checks in compute_ctl.
Co-authored-by: Dmitry Ivanov <dima@neon.tech>
Fixes#3468.
This does change how the panics look, and most importantly, make sure
they are not interleaved with other messages. Adds a `GET /v1/panic`
endpoint for panic testing (useful for sentry dedup and this hook
testing).
The panics are now logged within a new error level span called `panic`
which separates it from other error level events. The panic info is
unpacked into span fields:
- thread=mgmt request worker
- location="pageserver/src/http/routes.rs:898:9"
Co-authored-by: Christian Schwarz <christian@neon.tech>
## Describe your changes
```
$ poetry add werkzeug@latest "moto[server]@latest"
Using version ^2.2.3 for werkzeug
Using version ^4.1.2 for moto
Updating dependencies
Resolving dependencies... (1.6s)
Writing lock file
Package operations: 0 installs, 2 updates, 1 removal
• Removing pytz (2022.1)
• Updating werkzeug (2.1.2 -> 2.2.3)
• Updating moto (3.1.18 -> 4.1.2)
```
Resolves:
- https://github.com/neondatabase/neon/security/dependabot/14
- https://github.com/neondatabase/neon/security/dependabot/13
`@dependabot` failed to create a PR for some reason (I guess because it
also needed to handle `moto` dependency)
## Issue ticket number and link
N/A
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [x] If it is a core feature, I have added thorough tests.
- [x] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [x] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
It's not a property of the credentials that we receive from the
client, so remove it from ClientCredentials. Instead, pass it as an
argument directly to 'authenticate' function, where it's actually
used. All the rest of the changes is just plumbing to pass it through
the call stack to 'authenticate'
Repeatedly (twice) try to download the compaction targeted layers before actual
compaction. Adds tests for both L0 compaction downloading layers and image
creation downloading layers. Image creation support existed already.
Fixes#3591
Co-authored-by: Christian Schwarz <christian@neon.tech>
Refactor the tenant_size_model code. Segment now contains just the
minimum amount of information needed to calculate the size. Other
information that is useful for building up the segment tree, and for
display purposes, is now kept elsewhere. The code in 'main.rs' has a new
ScenarioBuilder struct for that.
Calculating which Segments are "needed" is now the responsibility of the
caller of tenant_size_mode, not part of the calculation itself. So it's
up to the caller to make all the decisions with retention periods for
each branch.
The output of the sizing calculation is now a Vec of SizeResults, rather
than a tree. It uses a tree representation internally, when doing the
calculation, but it's not exposed to the caller anymore.
Refactor the way the recursive calculation is performed.
Rewrite the code in size.rs that builds the Segment model. Get rid of
the intermediate representation with Update structs. Build the Segments
directly, with some local HashMaps and Vecs to track branch points to
help with that.
retention_period is now an input to gather_inputs(), rather than an
output.
Update pageserver http API: rename /size endpoint to /synthetic_size
with following parameters:
- /synthetic_size?inputs_only to get debug info;
- /synthetic_size?retention_period=0 to override cutoff that is used to
calculate the size;
pass header -H "Accept: text/html" to get HTML output, otherwise JSON is
returned
Update python tests and openapi spec.
---------
Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
## Describe your changes
Updates PITR and GC_PERIOD default value doc
## Issue ticket number and link
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
Clients may specify endpoint/project name via `options=project=...`,
so we should not only remove `project=` from `options` but also
drop `options` entirely, because connection pools don't support it.
Discussion: https://neondb.slack.com/archives/C033A2WE6BZ/p1676464382670119
a size of a *database* cannot be a sum of the sizes of *all databases*
indicating that a logical size is calculated for a branch
## Describe your changes
## Issue ticket number and link
## Checklist before requesting a review
- [x] i checked the suggested changes
- [x] this is not a core feature
- [x] this is just a docs update, does not require analytics
- [x] this PR does not require a public announcement
Upstream proxy erroneously stores user & dbname in compute node info
cache entries, thus causing "funny" connection problems if such an entry
is reused while connecting to e.g. a different DB on the same compute node.
This PR fixes the problem but doesn't eliminate the root cause just yet.
I'll revisit this code and make it more type-safe in the upcoming PR.
Alternative to #3586.
Introduces usage of current_logical_size.current_size as a boundary after which we start to update the metric gauge on ingested wal. Previously any incremented value (ingested wal) would had updated the gauge, but this would had left the metric at zero for timelines which never receive any wal even if size had been calculated. Now the gauge is updated right away as the calculation completes, not requiring any wal to be received.
This PR replaces the ill-advised `unsafe Sync` impl with a de-facto
standard way to solve the underlying problem.
TLDR:
- tokio::task::spawn requires future to be Send
- ∀t. (t : Sync) <=> (&t : Send)
- ∀t. (t : Send + !Sync) => (&t : !Send)
Without this patch, basebackup fails if we evict all layers before that.
This slipped in as part of
commit 01b4b0c2f3
Author: Christian Schwarz <christian@neon.tech>
Date: Fri Jan 13 17:02:22 2023 +0100
Introduce RequestContext
This patch adds a per-timeline periodic task that executes an eviction
policy. The eviction policy is configurable per tenant.
Two policies exist:
- NoEviction (the default one)
- LayerAccessThreshold
The LayerAccessThreshold policy examines the last access timestamp per
layer in the layer map and evicts the layer if that last access is
further in the past than a configurable threshold value.
This policy kind is evaluated periodically at a configurable period.
It logs a summary statistic at `info!()` or `warn!()` level, depending
on whether any evictions failed.
This feature has no explicit killswitch since it's off by default.
#3536 added the custom Debug implementations but it using derived Debug
on Key lead to too verbose output. Instead of making `Key`'s `Debug`
unconditionally or conditionally do the `Display` variant (for table
space'd keys), opted to build a newtype to provide `Debug` for
`Range<Key>` via `Display` which seemed to work unconditionally.
Also orders Key to have: 1. comment, 2. derive, 3. `struct Key`.
This script can be used to remove tenant directories on safekeepers for
projects which do not longer exist (deleted in the console).
To run this script you need to upload it to safekeeper (i.e. with SSH),
and run it with python3. Ansible can be used to run this script on
multiple safekeepers.
Fixes https://github.com/neondatabase/cloud/issues/3356
The auto-eviction PR (#3552) operates in two phaes:
1. find candidate layers
2. evict them.
For (2), a batch API like the one added in this commit is useful.
Note that this PR requires #3558 to be merged first.
Otherwise, the tests won't pass.
This changes the way we compare `Arc<dyn PersistentLayer>` in Timeline's
`LayerMap` not to use `Arc::ptr_eq` which has been witnessed in
development of #3557 to yield wrong results. It gives wrong results
because it compares fat pointers, which are `(object, vtable)` tuples
for `dyn Trait` and there are no guarantees that the `vtable`s are
unique. As in there were multiple vtables for `RemoteLayer` which is why
the comparison failed in #3557.
This is a known issue in rust, clippy warns against it and rust std
might be moving to the solution which has been reproduced on this PR:
compare only object pointers by "casting out" the vtable pointer.
Follow-up to #3536, to actually use the new `Debug` in replacing the
layers, and use replacement with manual eviction endpoint.
Turns out the two paths share a lot of handling of `Replacement` but
didn't unify the two (need 3). There are also upcoming refactorings
from other PRs to this.
This patch adds basic access statistics for historic layers
and exposes them in the management API's `LayerMapInfo`.
We record the accesses in the `{Delta,Image}Layer::load()` function
because it's the common path of
* page_service (`Timline::get_reconstruct_data()`)
* Compaction (`PersistentLayer::iter()` and `PersistentLayer::key_iter()`)
The stats survive residence status changes, and record these as well.
When scraping the layer map endpoint to record its evolution over time,
one must account for stat resets because they are in-memory only and
will reset on pageserver restart.
Use the launch timestamp header added by (#3527) to identify pageserver restarts.
This is PR https://github.com/neondatabase/neon/pull/3496
Add new pageserver config setting `cached_metric_collection_interval`
with default `1 hour`.
This setting controls how often unchanged cached consumption metrics are sent to
the HTTP endpoint.
This is a workaround for billing service limitations.
fixes#3485
Follow-up to #3513.
This removes the old blanket `std::fmt::Debug` impl on `dyn Layer` which
did not seem to be used from anywhere (no compilation errors after
removing).
Adds `std::fmt::Debug` requirement and implementations for `trait Layer`
implementors:
- LayerDescriptor (derived)
- RemoteLayer (manual)
- DeltaLayer (manual)
- ImageLayer (manual)
Manual implementations are used to skip PageserverConf, tenant and
timeline ids, large collections.
Adds and adjusts some doc comments to be more rustdoc alike.
## Describe your changes
Expose the currently calculated synthetic size as a Prometheus metric
## Issue ticket number and link
#3509
## Checklist before requesting a review
- [X] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
The PR adds an endpoint to show tenant's current config: `GET
/v1/tenant/:tenant_id/config`
Tenant's config consists of two parts: tenant overrides (could be
changed via other management API requests) and the default part,
substituting all missing overrides (constant, hardcoded in pageserver).
The API returns the custom overrides and the final tenant config, after
applying all the defaults.
Along the way, it had to fix two things in the config:
* allow to shorten the json version and omit all `null`'s (same as toml
serializer behaves by default), and to understand such shortened format
when deserialized. A unit test is added
* fix a bug, when `PUT /v1/tenant/config` endpoint rewritten the local
file with what had came in the request, but updating (not rewriting the
old values) the in-memory state instead.
That got uncovered during adjusting the e2e test and fixed to do the
replacement everywhere, otherwise there's no way to revert existing
overrides. Fixes#3471 (commit
dc688affe8)
* fixes https://github.com/neondatabase/neon/issues/3472 by reordering
the config saving operations
This patch adds a LaunchTimestamp type to the `metrics` crate,
along with a `libmetric_` Prometheus metric.
The initial user is pageserver.
In addition to exposing the Prometheus metric, it also reproduces
the launch timestamp as a header in the API responses.
The motivation for this is that we plan to scrape the pageserver's
/v1/tenant/:tenant_id/timeline/:timeline_id/layer
HTTP endpoint over time. It will soon expose access metrics (#3496)
which reset upon process restart. We will use the pageserver's launch
ID to identify a restart between two scrape points.
However, there are other potential uses. For example, we could use
the Prometheus metric to annotate Grafana plots whenever the launch
timestamp changes.
Cc: #3486
Adds a method to replace a particular layer from the LayerMap for the
purposes of remote layer download and layer eviction. In those use cases
read lock on layer map needs to be released after initial search, but
other operations could modify layermap before replacing thread gets to
run.
Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>
## Describe your changes
Add libmetrics_build_info metrics with commit sha to storage_broken
/metrics, to match behaviour of proxy, pageserver and safekeeper.
## Describe your changes
## Issue ticket number and link
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
The project/endpoint should be set in the original (non-as_ref'd) creds,
because we call `wake_compute` not only in `try_password_hack` but also
later in the connection retry logic.
This PR also removes the obsolete `as_ref` method and makes the code
simpler because we no longer need this complication after a recent
refactoring.
Further action points: finally introduce typestate in creds (planned).
We do not need special enum variant for testing the file names, neither
its special handling across the code.
Current tests are able to create regular layers with normal layer names,
as the PR shows.
Closes https://github.com/neondatabase/neon/issues/3439
Adds a set of commands to manipulate the layer map:
* dump the layer map contents
* evict the layer form the layer map (remove the local file, put the
remote layer instead in the layer map)
* download the layer (operation, reversing the eviction)
The commands will change later, when the statistics is added on top, so
the swagger schema is not adjusted.
The commands might have issues with big amount of layers: no pagination
is done for the dump command, eviction and download commands look for
the layer to evict/download by iterating all layers sequentially and
comparing the layer names.
For now, that seems to be tolerable ("big" number of layers is ~2_000)
and further experiments are needed.
---------
Co-authored-by: Christian Schwarz <christian@neon.tech>
Adds two new tags, `run-extra-build-macos` and `run-extra-build-stats`
to trigger corresponding build jobs on any PR.
On every build for `main` or PR with `run-extra-build-stats` tag, publish a GitHub commit status with the link to the `cargo build --all --release --timings` report.
This patch adds a timed LRU cache implementation and a compute node info cache on top of that.
Cache entries might expire on their own (default ttl=5mins) or become invalid due to real-world events,
e.g. compute node scale-to-zero event, so we add a connection retry loop with a wake-up call.
Solved problems:
- [x] Find a decent LRU implementation.
- [x] Implement timed LRU on top of that.
- [x] Cache results of `proxy_wake_compute` API call.
- [x] Don't invalidate newer cache entries for the same key.
- [x] Add cmdline configuration knobs (requires some refactoring).
- [x] Add failed connection estab metric.
- [x] Refactor auth backends to make things simpler (retries, cache
placement, etc).
- [x] Address review comments (add code comments + cleanup).
- [x] Retry `/proxy_wake_compute` if we couldn't connect to a compute
(e.g. stalled cache entry).
- [x] Add high-level description for `TimedLru`.
TODOs (will be addressed later):
- [ ] Add cache metrics (hit, spurious hit, miss).
- [ ] Synchronize http requests across concurrent per-client tasks
(https://github.com/neondatabase/neon/pull/3331#issuecomment-1399216069).
- [ ] Cache results of `proxy_get_role_secret` API call.
- add parse_query_param()
- use Cow<> where possible
- move param parsing code to utils::http::request
This was originally PR https://github.com/neondatabase/neon/pull/3502
which targeted a different branch.
closes #3510
Add 'branch' input to specify commit for deploy scripts/configs. Commit
can't be passed to workflow as ref, and we need to pin configs to
specific commit for main/release deploys
Update deploy input descriptions to match GH interface
Extract deploy jobs from build_and_test.yml to deploy-dev and
deploy-prod workflows.
Add trigger to run this workflows after Neon is build and tested on main and
release branches.
This will allow us to redeploy/rollback/patch config without full
rebuild.
Related to: https://github.com/neondatabase/neon/issues/2848
`pageserver_storage_operations_seconds` is the most expensive metric we
have, as there are a lot of tenants/timelines and the histogram had 42
buckets. These are quite sparse too, so instead of having a histogram
per timeline, create a new histogram
`pageserver_storage_operations_seconds_global` without tenant and
timeline dimensions and replace `pageserver_storage_operations_seconds`
with sum and counter.
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
I added these spans to trace how long the queries take, but I didn't realize
that there's a difference between:
let _ = span.entered();
and
let _guard = span.entered();
The former drops the guard immediately, while the latter holds it
until the end of the scope. As a result, the span was ended
immediately, and the query was executed outside the span.
## Describe your changes
Added a metric that allow to monitor tenants state
## Issue ticket number and link
https://github.com/neondatabase/neon/issues/3161
## Checklist before requesting a review
- [X] I have performed a self-review of my code.
- [X] I have added an e2e test for it.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Describe your changes
Whenever a tenant is detached or the pageserver is restarted the
pageserver_last_record_lsn metric is dropped
This fix resurrects the value from the metadata whenever the tenant is
attached again
## Issue ticket number and link
[3571](https://github.com/neondatabase/cloud/issues/3571)
## Checklist before requesting a review
- [X] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
To fix `Error: The requested DurationSeconds exceeds the
MaxSessionDuration set for this role.`
Co-authored-by: Rory de Zoete <rdezoete@Rorys-Mac-Studio.fritz.box>
Currently `attach` doesn't write a tenant config, because we don't back
it up in the first place. The current implementation of
`Tenant::persist_tenant_config` does not allow changing tenant's
configuration through the http api which will fail because the file
wasn't created on attach and
`OpenOptions::truncate(true).write(true).create_new(false)` is used.
I think this patch allows for least controversial middle ground which
*enables* changing tenant configuration even for attached tenants (not
just created tenants).
Walreceiver is a per-timeline abstraction. Move it there to reflect
the hierarchy of abstractions and task_mgr tasks.
The code that sets up the global storage_broker client
is not timeline-scoped. So, break it out into a separate module.
The motivation for this change is to prepare the code base for replacing
the task_mgr global task registry with a more ownership-oriented
approach to manage task lifetimes.
I removed TaskStateUpdate::Init because, after doing the changes,
rustc warned that it was never constructed.
A quick search through the commit history shows that this
has always been true since
commit fb68d01449
Author: Dmitry Rodionov <dmitry@neon.tech>
Date: Mon Sep 26 23:57:02 2022 +0300
Preserve task result in TaskHandle by keeping join handle around (#2521)
So, the warning is not an indication of some accidental code removal.
This is PR: https://github.com/neondatabase/neon/pull/3456
Change the signature so that it takes an Arc<Timeline> reference to the
source timeline, instead of just the ID. All the callers have an Arc
reference at hand, so this is more convenient for everyone.
Reorder the code a bit and improve the comments, to make it more clear
what it does and why.
To fix errors such as:
`An error occurred (ImageAlreadyExistsException) when calling the
PutImage operation: Image with digest
'sha256:da6d8ad97d84e3aec4e6a240c3a35868b626692ee5d199cdd3fe45d29a8e54df'
and tag 'latest' already exists in the repository with name
'compute-node-v14' in registry with id '369495373322'`
Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@Rorys-Mac-Studio.fritz.box>
This allows fine-grained distributed tracing of the 'start_compute'
operation from the cloud console. The startup actions performed by
'compute_ctl' are now performed in a child of the 'start_compute'
context, so you can trace through the whole compute start operation.
This needs a corresponding change in the cloud console to fill in the
'startup_tracing_context' field in the json spec. If it's missing, the
startup operations are simply traced as a separate trace, without
a parent.
This allows tracing the startup actions e.g. with Jaeger
(https://www.jaegertracing.io/). We use the "tracing-opentelemetry"
crate, which turns tracing spans into OpenTelemetry spans, so you can
use the usual "#[instrument]" directives to add tracing.
I put the tracing initialization code to a separate crate,
`tracing-utils`, so that we can reuse it in other programs. We
probably want to set up tracing in the same way in all our programs.
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
This patch wrap the tenants hashmap into an enum that represents the
tenant manager's three major states:
- Initializing
- Open for business
- Shutting down.
See the enum doc comments for details.
In response, all the users of `TENANTS` are now forced to distinguish
those states.
The only major change is in `run_if_no_tenant_in_memory`, which,
before this patch, was used by the /attach and /load endpoints.
This patch rewrites that method under the name `tenant_map_insert`,
replacing the anyhow::Result with a std Result and a dedicated error
type.
Introducing this error types allows using `tenant_map_insert` in
`tenant_create`, thereby unifying all code paths that create tenants
objects to use `tenant_map_insert`.
This is beneficial because we can now systematically prevent tenants
from being created, attached, or `/load`ed during pageserver shutdown.
The management API remains available, but the endpoints that create
new tenants will fail with an error.
More work would need to be done to properly distinguish these errors
through HTTP status codes such as 503.
after executing
```sql
CREATE EXTENSION unit;
```
I saw such error
```
ERROR: could not open file "/usr/local/pgsql/share/extension/unit_prefixes.data" for reading: No such file or directory (SQLSTATE 58P01)
```
Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
For use in production in case on-demand download turns out to be
problematic during tenant_attach, or when we eventually introduce layer
eviction.
Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
Motivation
==========
Layer Eviction Needs Context
----------------------------
Before we start implementing layer eviction, we need to collect some
access statistics per layer file or maybe even page.
Part of these statistics should be the initiator of a page read request
to answer the question of whether it was page_service vs. one of the
background loops, and if the latter, which of them?
Further, it would be nice to learn more about what activity in the pageserver
initiated an on-demand download of a layer file.
We will use this information to test out layer eviction policies.
Read more about the current plan for layer eviction here:
https://github.com/neondatabase/neon/issues/2476#issuecomment-1370822104
task_mgr problems + cancellation + tenant/timeline lifecycle
------------------------------------------------------------
Apart from layer eviction, we have long-standing problems with task_mgr,
task cancellation, and various races around tenant / timeline lifecycle
transitions.
One approach to solve these is to abandon task_mgr in favor of a
mechanism similar to Golang's context.Context, albeit extended to
support waiting for completion, and specialized to the needs in the
pageserver.
Heikki solves all of the above at once in PR
https://github.com/neondatabase/neon/pull/3228 , which is not yet
merged at the time of writing.
What Is This Patch About
========================
This patch addresses the immediate needs of layer eviction by
introducing a `RequestContext` structure that is plumbed through the
pageserver - all the way from the various entrypoints (page_service,
management API, tenant background loops) down to
Timeline::{get,get_reconstruct_data}.
The struct carries a description of the kind of activity that initiated
the call. We re-use task_mgr::TaskKind for this.
Also, it carries the desired on-demand download behavior of the entrypoint.
Timeline::get_reconstruct_data can then log the TaskKind that initiated
the on-demand download.
I developed this patch by git-checking-out Heikki's big RequestContext
PR https://github.com/neondatabase/neon/pull/3228 , then deleting all
the functionality that we do not need to address the needs for layer
eviction.
After that, I added a few things on top:
1. The concept of attached_child and detached_child in preparation for
cancellation signalling through RequestContext, which will be added in
a future patch.
2. A kill switch to turn DownloadBehavior::Error into a warning.
3. Renamed WalReceiverConnection to WalReceiverConnectionPoller and
added an additional TaskKind WalReceiverConnectionHandler.These were
necessary to create proper detached_child-type RequestContexts for the
various tasks that walreceiver starts.
How To Review This Patch
========================
Start your review with the module-level comment in context.rs.
It explains the idea of RequestContext, what parts of it are implemented
in this patch, and the future plans for RequestContext.
Then review the various `task_mgr::spawn` call sites. At each of them,
we should be creating a new detached_child RequestContext.
Then review the (few) RequestContext::attached_child call sites and
ensure that the spawned tasks do not outlive the task that spawns them.
If they do, these call sites should use detached_child() instead.
Then review the todo_child() call sites and judge whether it's worth the
trouble of plumbing through a parent context from the caller(s).
Lastly, go through the bulk of mechanical changes that simply forwards
the &ctx.
Before this patch, when `initialize_with_lock` was called via
`timeline_init_and_sync`, we would transition the timeline like so:
load_local_timeline/load_remote_timeline:
timeline_init_and_sync
Timeline::new
() => Loading
initialize_with_lock:
set_state(Active)
Loading => Active
timeline.activate()
Active => Active
this makes debugging problematic cases in the future easier, as we can
just request the model inputs, use them locally to reproduce the issue
with the model.
I noticed that `compute_ctl` updates all roles on each start, search for
rows like
> - web_access:[FILTERED] -> update
in the compute startup log.
It happens since we had an adhoc hack for md5 hashes comparison, which
doesn't work with scram hashes stored in the `pg_authid`. It doesn't
really hurt, as nothing changes, but we just run >= 2 extra queries on
each start, so fix it.
The TimelineState::Suspsended was dubious to begin with. I suppose
that the intention was that timelines could transition back and
forth between Active and Suspended states.
But practically, the code before this patch never did that.
The transitions were:
() ==Timeline::new==> Suspended ==*==> {Active,Broken,Stopping}
One exception: Tenant::set_stopping() could transition timelines like
so:
!Broken ==Tenant::set_stopping()==> Suspended
But Tenant itself cannot transition from stopping state to any other
state.
Thus, this patch removes TimelineState::Suspended and introduces a new
state Loading. The aforementioned transitions change as follows:
- () ==Timeline::new==> Suspended ==*==> {Active,Broken,Stopping}
+ () ==Timeline::new==> Loading ==*==> {Active,Broken,Stopping}
- !Broken ==Tenant::set_stopping()==> Suspended
+ !Broken ==Tenant::set_stopping()==> Stopping
Walreceiver's connection manager loop watches TimelineState to decide
whether it should retry connecting, or exit.
This patch changes the loop to exit when it observes the transition
into Stopping state.
Walreceiver isn't supposed to be started until the timeline transitions
into Active state. So, this patch also adds some warn!() messages
in case this happens anyways.
Small changes, but hopefully this will help with the panic detected in
staging, for which we cannot get the debugging information right now
(end-of-branch before branch-point).
Before only the timelines which have passed the `gc_horizon` were
processed which failed with orphans at the tree_sort phase. Example
input in added `test_branched_empty_timeline_size` test case.
The PR changes iteration to happen through all timelines, and in
addition to that, any learned branch points will be calculated as they
would had been in the original implementation if the ancestor branch had
been over the `gc_horizon`.
This also changes how tenants where all timelines are below `gc_horizon`
are handled. Previously tenant_size 0 was returned, but now they will
have approximately `initdb_lsn` worth of tenant_size.
The PR also adds several new tenant size tests that describe various corner
cases of branching structure and `gc_horizon` setting.
They are currently disabled to not consume time during CI.
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
Previously, we were trying to re-assign owned objects of the already
deleted role. This were causing a crash loop in the case when compute
was restarted with a spec that includes delta operation for role
deletion. To avoid such cases, check that role is still present before
calling `reassign_owned_objects`.
Resolvesneondatabase/cloud#3553
This reverts commit 826e89b9ce.
The problem with that commit was that it deletes the TempDir while
there are still EphemeralFile instances open.
At first I thought this could be fixed by simply adding
Handle::current().block_on(task_mgr::shutdown(None, Some(tenant_id), None))
to TenantHarness::drop, but it turned out to be insufficient.
So, reverting the commit until we find a proper solution.
refs https://github.com/neondatabase/neon/issues/3385
Refactors Compute::prepare_and_run. It's split into subroutines
differently, to make it easier to attach tracing spans to the
different stages. The high-level logic for waiting for Postgres to
exit is moved to the caller.
Replace 'env_logger' with 'tracing', and add `#instrument` directives
to different stages fo the startup process. This is a fairly
mechanical change, except for the changes in 'spec.rs'. 'spec.rs'
contained some complicated formatting, where parts of log messages
were printed directly to stdout with `print`s. That was a bit messed
up because the log normally goes to stderr, but those lines were
printed to stdout. In our docker images, stderr and stdout both go to
the same place so you wouldn't notice, but I don't think it was
intentional.
This changes the log format to the default
'tracing_subscriber::format' format. It's different from the Postgres
log format, however, and because both compute_tools and Postgres print
to the same log, it's now a mix of two different formats. I'm not
sure how the Grafana log parsing pipeline can handle that. If it's a
problem, we can build custom formatter to change the compute_tools log
format to be the same as Postgres's, like it was before this commit,
or we can change the Postgres log format to match tracing_formatter's,
or we can start printing compute_tool's log output to a different
destination than Postgres
Before this patch, we would start all layer downloads simultaneously.
There is at most one download_all_remote_layers task per timeline.
Hence, the specified limit is per timeline.
There is still no global concurrency limit for layer downloads.
We'll have to revisit that at some point and also prioritize on-demand
initiated downloads over download_all_remote_layers downloads.
But that's for another day.
The general idea is that the VM informant binary is added to the
vm-compute-node images only. `compute_tools` then will run whatever's at
`/bin/vm-informant`, if the path exists.
- handle errors in calculate_synthetic_size_worker. Don't exit the bgworker if one tenant failed.
- add cached_synthetic_tenant_size to cache values calculated by the bgworker
- code cleanup: remove unneeded info! messages, clean comments
- handle collect_metrics_task() error. Don't exit collect_metrics worker if one task failed.
- add unit test to cover case when we have multiple branches at the same lsn
If we ran `compact_prefetch_buffers` with exactly one hole in the
buffers, the code would fail to remove the last, now unused, entry from
the array.
This is now fixed.
Also, add and adjust some comments in the compaction code so that the
algorithm used is a bit more clear.
Fixes#3192
This makes Timeline::get() async, and all functions that call it
directly or indirectly with it. The with_ondemand_download() mechanism
is gone, Timeline::get() now always downloads files, whether you want
it or not. That is what all the current callers want, so even though
this loses the capability to get a page only if it's already in the
pageserver, without downloading, we were not using that capability.
There were some places that used 'no_ondemand_download' in the WAL
ingestion code that would error out if a layer file was not found
locally, but those were dubious. We do actually want to on-demand
download in all of those places.
Per discussion at
https://github.com/neondatabase/neon/pull/3233#issuecomment-1368032358
When number of github actions workers is changed, some jobs get killed.
When helm if killed during the upgrade, release stuck in pending-upgrade
state. --atomic should initiate automatic rollback in this case.
With this patch, tenant_detach and timeline_delete's
task_mgr::shutdown_tasks() call will wait for on-demand
compaction to finish.
Before this patch, the on-demand compaction would grab the
layer_removal_cs after tenant_detach / timeline_delete had
removed the timeline directory.
This resulted in error
No such file or directory (os error 2)
NB: I already implemented this pattern for ondemand GC a while back.
fixes https://github.com/neondatabase/neon/issues/3136
- Fix and improve comments
- Rename 'physical_size' local variable to 'resident_size' for clarity.
- Remove one 'unnecessary wait_for_upload' call. The
'wait_for_sk_commit_lsn_to_reach_remote_storage' call after shutting
down compute is sufficient.
## Describe your changes
Added a PR template
## Issue ticket number and link
#3162
## Checklist before requesting a review
- [ ] I have performed a self-review of my code
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
This is a hacky implementation of WebSocket server, embedded into our
postgres proxy. The server is used to allow https://github.com/neondatabase/serverless
to connect to our postgres from browser and serverless javascript functions.
How it will work (general schema):
- browser opens a websocket connection to
`wss://ep-abc-xyz-123.xx-central-1.aws.neon.tech/`
- proxy accepts this connection and terminates TLS (https)
- inside encrypted tunnel (HTTPS), browser initiates plain
(non-encrypted) postgres connection
- proxy performs auth as in usual plain pg connection and forwards
connection to the compute
Related issue: #3225
Follow-up of https://github.com/neondatabase/neon/pull/3270 which made
an example from main README.md not working.
Fixes that, by adding a way to specify a default tenant now and modifies
the basic neon_local test to start postgres and check branching.
Not all neon_local commands are implemented, so not all README.md
contents is tested yet.
It's not OK to return early from within a PG_TRY-CATCH block. The
PG_TRY macro sets the global PG_exception_stack variable, and
PG_END_TRY restores it. If we jump out in between with "return NULL",
the PG_exception_stack is left to point to garbage. (I'm surprised the
comments in PG_TRY_CATCH don't warn about this.)
Add test that re-attaches tenant in pageserver while Postgres is
running. If the tenant is detached while compute is connected and
busy running queries, those queries will fail if they try to fetch any
pages. But when the tenant is re-attached, things should start working
again, without disconnecting the client <-> postgres connections.
Without this fix, this reproduced the segfault.
Fixes issue #3231
pageserver_disconnect() call invalidates 'pageserver_conn', including
the error message pointer we got from PQerrorMessage(pageserver_conn).
Copy the message to a temporary variable before disconnecting, like
we do in a few other places.
In the passing, clear 'pageserver_conn_wes' variable in a few places
where it was free'd. I didn't see any live bug from this, but since
pageserver_disconnect() checks if it's NULL, let's not leave it
dangling to already-free'd memory.
For every Python test, we start the storage first, and expect that
later, in the test, when we start a compute, it will work without
specific timeline and tenant creation or their IDs specified.
For that, we have a concept of "default" branch that was created on the
control plane level first, but that's not needed at all, given that it's
only Python tests that need it: let them create the initial timeline
during set-up.
Before, control plane started and stopped pageserver for timeline
creation, now Python harness runs an extra tenant creation request on
test env init.
I had to adjust the metrics test, turns out it registered the metrics
from the default tenant after an extra pageserver restart.
New model does not sent the metrics before the collection time happens,
and that was 30s before.
- Clean up redundant metric removal in TimelineMetrics::drop.
RemoteTimelineClientMetrics is responsible for cleaning up
REMOTE_OPERATION_TIME andREMOTE_UPLOAD_QUEUE_UNFINISHED_TASKS.
- Rename `pageserver_remote_upload_queue_unfinished_tasks` to
`pageserver_remote_timeline_client_calls_unfinished`. The new name
reflects that the metric is with respect to the entire call to remote
timeline client. This includes wait time in the upload queue and hence
it's a longer span than what `pageserver_remote_OPERATION_seconds`
measures.
- Add the `pageserver_remote_timeline_client_calls_started` histogram.
See the metric description for why we need it.
- Add helper functions `call_begin` etc to `RemoteTimelineClientMetrics`
to centralize the logic for updating the metrics above (they relate to
each other, see comments in code).
- Use these constructs to track ongoing downloads in
`pageserver_remote_timeline_client_calls_unfinished`
refs https://github.com/neondatabase/neon/issues/2029
fixes https://github.com/neondatabase/neon/issues/3249
closes https://github.com/neondatabase/neon/pull/3250
Before this change, we would not .measure_remote_op for index part
downloads.
And more generally, it's good to pass not just uploads but also
downloads through RemoteTimelineClient, e.g., if we ever want to
implement some timeline-scoped policies there.
Found this while working on https://github.com/neondatabase/neon/pull/3250
where I add a metric to measure the degree of concurrent downloads.
Layer download was missing in a test that I added there.
The Basebackup struct is really just a convenient place to carry the
various parameters around in send_tarball and its subroutines. Make it
internal to the send_tarball function.
This reverts commit 56a4466d0a.
Seems that flackiness increased after this commit, while the time
decrease was a couple of seconds.
With every regular Python test spawing 1 etcd, 3 safekeepers, 1
pageserver, few CLI commands and post-run cleanup hooks, it might be
hard to run many such tests in parallel.
We could return to this later, after we consider alternative test
structure and/or CI runner structure.
Closes https://github.com/neondatabase/neon/issues/3114
Adds more typization into errors that appear during protocol messages (`FeMessage`), postgres and walreceiver connections.
Socket IO errors are now better detected and logged with lesser (INFO, DEBUG) error level, without traces that they were logged before, when they were wrapped in anyhow context.
It was nice to have and useful at the time, but unfortunately the method
used to gather the profiling data doesn't play nicely with 'async'. PR
#3228 will turn 'get_page_at_lsn' function async, which will break the
profiling support. Let's remove it, and re-introduce some kind of
profiling later, using some different method, if we feel like we need it
again.
Makes the top-level functions in WalIngest async, and replaces
no_ondemand_download calls with with_ondemand_download.
This hopefully fixes the problem reported in issue #3230, although I
don't have a self-contained test case for it.
The synchronous 'tar' crate has required us to use block_in_place and
SyncIoBridge to work together with the async I/O in the client
connection. Switch to 'tokio-tar' crate that uses async I/O natively.
As part of this, move the CopyDataWriter implementation to
postgres_backend_async.rs. Even though it's only used in one place
currently, it's in principle generally applicable whenever you want to
use COPY out.
Unfortunately we cannot use the 'tokio-tar' as it is: the Builder
implementation requires the writer to have 'static lifetime. So we
have to use a modified version without that requirement. The 'static
lifetime was required just for the Drop implementation that writes
the end-of-archive sections if the Builder is dropped without calling
`finish`. But we don't actually want that behavior anyway; in fact
we had to jump through some hoops with the AbortableWrite hack to skip
those. With the modified version of 'tokio-tar' without that Drop
implementation, we don't need AbortableWrite either.
Co-authored-by: Kirill Bulatov <kirill@neon.tech>
I have experimented with the runner threads number, and looks like 8
threads win us a few seconds.
Bumping the thread count more did not improve the situation much:
* 20 threads were not allowed by pytest
* 16 threads were flacking quite notably
My guess would be that all pageservers, safekeepers, and other nodes we
start occupy quite much of the CPU and other resources to make this
approach more scalable.
I looked at "cargo tree" output and noticed that through various
dependencies, we are depending on both native-tls and rustls. We have
tried to standardize on rustls for everything, but dependencies on
native-tls have crept in recently. One such dependency came from
'reqwest' with default features in pageserver, used for
consumption_metrics. Another dependency was from 'sentry'. Both
'reqwest' and 'sentry' use native-tls by default, but can use 'rustls'
if compiled with the right feature flags.
Send this metric only when it is fully calculated.
Make consumption metrics more stable:
- Send per-timeline metrics only for active timelines.
- Adjust test assertions to make test_metric_collection test more stable.
Refactor update_gc_info function so that it calls the potentially
expensive find_lsn_for_timestamp() function before acquiring the
lock. This will also be needed if we make find_lsn_for_timestamp()
async in the future; it cannot be awaited while holding the lock.
Fixes:
- serialize TenantId and TimelineId as strings,
- skip TimelineId if none
- serialize `metric_type` field as `type`
- add `idempotency_key` field to uniquely identify metrics
Changes are:
* Pageserver: start reading from NEON_AUTH_TOKEN by default.
Warn if ZENITH_AUTH_TOKEN is used instead.
* Compute, Docs: fix the default token name.
* Control plane: change name of the token in configs and start
sequences.
Compatibility:
* Control plane in tests: works, no compatibility expected.
* Control plane for local installations: never officially supported
auth anyways. If someone did enable it, `pageserver.toml` should be updated
with the new `neon.pageserver_connstring` and `neon.safekeeper_token_env`.
* Pageserver is backward compatible: you can run new Pageserver with old
commands and environment configurations, but not vice-versa.
The culprit is the hard-coded `NEON_AUTH_TOKEN`.
* Compute has no code changes. As long as you update its configuration
file with `pageserver_connstring` in sync with the start up scripts,
you are good to go.
* Safekeeper has no code changes and has never used `ZENITH_AUTH_TOKEN` in
the first place.
While @bojanserafimov is still working on best replacement of R-Tree in
layer_map.rs there is obvious pitfall in the current `search` method
implementation: is returns delta layer even if there is image layer if
greater LSN. I think that it should be fixed.
The PR aims to fix two missing redownloads in a flacky
test_remote_storage_upload_queue_retries[local_fs]
([example](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-3190/release/3759194738/index.html#categories/80f1dcdd7c08252126be7e9f44fe84e6/8a70800f7ab13620/))
1. missing redownload during walreceiver work
```
2022-12-22T16:09:51.509891Z ERROR wal_connection_manager{tenant=fb62b97553e40f949de8bdeab7f93563 timeline=4f153bf6a58fd63832f6ee175638d049}: wal receiver task finished with an error: walreceiver connection handling failure
Caused by:
Layer needs downloading
Stack backtrace:
0: pageserver::tenant::timeline::PageReconstructResult<T>::no_ondemand_download
at /__w/neon/neon/pageserver/src/tenant/timeline.rs:467:59
1: pageserver::walingest::WalIngest::new
at /__w/neon/neon/pageserver/src/walingest.rs:61:32
2: pageserver::walreceiver::walreceiver_connection::handle_walreceiver_connection::{{closure}}
at /__w/neon/neon/pageserver/src/walreceiver/walreceiver_connection.rs:178:25
....
```
That looks sad, but inevitable during the current approach: seems that
we need to wait for old layers to arrive in order to accept new data.
For that, `WalIngest::new` now started to return the
`PageReconstructResult`.
Sync methods from `import_datadir.rs` use `WalIngest::new` too, but both
of them import WAL during timeline creation, so no layers to download
are needed there, ergo the `PageReconstructResult` is converted to
`anyhow::Result` with `no_ondemand_download`.
2. missing redownload during compaction work
```
2022-12-22T16:09:51.090296Z ERROR compaction_loop{tenant_id=fb62b97553e40f949de8bdeab7f93563}:compact_timeline{timeline=4f153bf6a58fd63832f6ee175638d049}: could not compact, repartitioning keyspace failed: Layer needs downloading
Stack backtrace:
0: pageserver::tenant::timeline::PageReconstructResult<T>::no_ondemand_download
at /__w/neon/neon/pageserver/src/tenant/timeline.rs:467:59
1: pageserver::pgdatadir_mapping::<impl pageserver::tenant::timeline::Timeline>::collect_keyspace::{{closure}}
at /__w/neon/neon/pageserver/src/pgdatadir_mapping.rs:506:41
<core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19
pageserver::tenant::timeline::Timeline::repartition::{{closure}}
at /__w/neon/neon/pageserver/src/tenant/timeline.rs:2161:50
<core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19
2: pageserver::tenant::timeline::Timeline::compact::{{closure}}
at /__w/neon/neon/pageserver/src/tenant/timeline.rs:700:14
<core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19
3: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
at /github/home/.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-0.1.37/src/instrument.rs:272:9
4: pageserver::tenant::Tenant::compaction_iteration::{{closure}}
at /__w/neon/neon/pageserver/src/tenant.rs:1232:85
<core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19
pageserver::tenant_tasks::compaction_loop::{{closure}}::{{closure}}
at /__w/neon/neon/pageserver/src/tenant_tasks.rs:76:62
<core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19
pageserver::tenant_tasks::compaction_loop::{{closure}}
at /__w/neon/neon/pageserver/src/tenant_tasks.rs:91:6
```
Scan core dumps directory on exit. In case of existing core dumps
call gdb/lldb to get a backtrace and log it. By default look for
core dumps in postgres data directory as core.<pid>. That is how
core collection is configured in our k8s nodes (and a reasonable
convention in general).
1.66 release speeds up compile times for over 10% according to tests.
Also its Clippy finds plenty of old nits in our code:
* useless conversion, `foo as u8` where `foo: u8` and similar, removed
`as u8` and similar
* useless references and dereferenced (that were automatically adjusted
by the compiler), removed various `&` and `*`
* bool -> u8 conversion via `if/else`, changed to `u8::from`
* Map `.iter()` calls where only values were used, changed to
`.values()` instead
Standing out lints:
* `Eq` is missing in our protoc generated structs. Silenced, does not
seem crucial for us.
* `fn default` looks like the one from `Default` trait, so I've
implemented that instead and replaced the `dummy_*` method in tests with
`::default()` invocation
* Clippy detected that
```
if retry_attempt < u32::MAX {
retry_attempt += 1;
}
```
is a saturating add and proposed to replace it.
The code in this change was extracted from #2595 (Heikki’s on-demand
download draft PR).
High-Level Changes
- New RemoteLayer Type
- On-Demand Download As An Effect Of Page Reconstruction
- Breaking Semantics For Physical Size Metrics
There are several follow-up work items planned.
Refer to the Epic issue on GitHub: https://github.com/neondatabase/neon/issues/2029
closes https://github.com/neondatabase/neon/pull/3013
Co-authored-by: Kirill Bulatov <kirill@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
New RemoteLayer Type
====================
Instead of downloading all layers during tenant attach, we create
RemoteLayer instances for each of them and add them to the layer map.
On-Demand Download As An Effect Of Page Reconstruction
======================================================
At the heart of pageserver is Timeline::get_reconstruct_data(). It
traverses the layer map until it has collected all the data it needs to
produce the page image. Most code in the code base uses it, though many
layers of indirection.
Before this patch, the function would use synchronous filesystem IO to
load data from disk-resident layer files if the data was not cached.
That is not possible with RemoteLayer, because the layer file has not
been downloaded yet. So, we do the download when get_reconstruct_data
gets there, i.e., “on demand”.
The mechanics of how the download is done are rather involved, because
of the infamous async-sync-async sandwich problem that plagues the async
Rust world. We use the new PageReconstructResult type to work around
this. Its introduction is the cause for a good amount of code churn in
this patch. Refer to the block comment on `with_ondemand_download()`
for details.
Breaking Semantics For Physical Size Metrics
============================================
We rename prometheus metric pageserver_{current,resident}_physical_size to
reflect what this metric actually represents with on-demand download.
This intentionally BREAKS existing grafana dashboard and the cost model data
pipeline. Breaking is desirable because the meaning of this metrics has changed
with on-demand download. See
https://docs.google.com/document/d/12AFpvKY-7FZdR5a4CaD6Ir_rI3QokdCLSPJ6upHxJBo/edit#
for how we will handle this breakage.
Likewise, we rename the new billing_metrics’s PhysicalSize => ResidentSize.
This is not yet used anywhere, so, this is not a breaking change.
There is still a field called TimelineInfo::current_physical_size. It
is now the sum of the layer sizes in layer map, regardless of whether
local or remote. To compute that sum, we added a new trait method
PersistentLayer::file_size().
When updating the Python tests, we got rid of
current_physical_size_non_incremental. An earlier commit removed it from
the OpenAPI spec already, so this is not a breaking change.
test_timeline_size.py has grown additional assertions on the
resident_physical_size metric.
Add new background job to collect billing metrics for each tenant and
send them to the HTTP endpoint.
Metrics are cached, so we don't send non-changed metrics.
Add metric collection config parameters:
metric_collection_endpoint (default None, i.e. disabled)
metric_collection_interval (default 60s)
Add test_metric_collection.py to test metric collection
and sending to the mocked HTTP endpoint.
Use port distributor in metric_collection test
review fixes: only update cache after metrics were send successfully, simplify code
disable metric collection if metric_collection_endpoint is not provided in config
It's better to request the tasks to shut down only after setting the
timeline state to Stopping. Otherwise, it's possible that a new task
spawns after we have waited for the existing tasks to shut down, but
before we have changed the state. We would fail to wait for them.
Feels nicer from a readability point of view too.
Remote operations fail sometimes due to network failures or other
external reasons. Add retry logic to all the remote downloads, so that
a transient failure at pageserver startup or tenant attach doesn't
cause the whole tenant to be marked as Broken.
Like in the uploads retry logic, we print the failure to the log as a
WARNing after three retries, but keep retrying. We will retry up to 10
times now, before returning the error to the caller.
To test the retries, I created a new RemoteStorage wrapper that simulates
failures, by returning an error for the first N times that a remote
operation is performed. It can be enabled by setting a new
"test_remote_failures" option in the pageserver config file.
Fixes#3112
The new Timeline::freeze_and_flush function is equivalent to calling
Timeline::checkpoint(CheckpointConfig::Flush). There were only one
non-test caller that used CheckpointConfig::Forced, so replace that
with a call to the new Timeline::freeze_and_flush, followed by an
explicit call to Timeline::compact.
That only caller was to handle the mgmt API's 'checkpoint' endpoint.
Perhaps we should split that into separate 'flush' and 'compact'
endpoints too, but I didn't go that far yet.
Commit 6dec85b19d remove the `checkpoint_before_gc` argument, but failed
to update the comment. Remove its description, and while we're at it, try
to explain better how the `horizon` and `pitr` arguments are used.
This splits the storage_sync2::schedule_index_file into two (public)
functions:
1. `schedule_index_upload_for_metadata_update`, for when the metadata
(e.g. disk_consistent_lsn or last_gc_cutoff) has changed, and
2. `schedule_index_upload_for_file_changes`, for when layer file uploads
or deletions have been scheduled.
We now keep track of whether there have been any uploads or deletions
since the last index-file upload, and skip the upload in
`schedule_index_upload_for_file_changes` if there haven't been any
changes. That allows us to call the function liberally in timeline.rs,
whenever layer file uploads or deletions might've been scheduled,
without starting a lot of unnecessary index file uploads.
GC was covered earlier by commit c262390214, but that missed that we
have the same problem with compaction.
- Refactor logical_size_calculation_task, moving the pieces that are
specific to try_spawn_size_init_task into that function.
This allows us to spawn additional size calculation tasks that are not
init size calculation tasks.
- As part of this refactoring, stop logging cancellations as errors.
They are part of regular operations.
Logging them as errors was inadvertently introduced in earlier commit
427c1b2e9661161439e65aabc173d695cfc03ab4
initial logical size calculation: if it fails, retry on next call
- Change tenant size model request code to spawn task_mgr tasks using
the refactored logical_size_calculation_task function.
Using a task_mgr task ensures that the calculation cannot outlive
the timeline.
- There are presumably still some subtle race conditions if a size
requests comes in at exactly the same time as a detach / delete
request.
- But that's the concern of diferent area of the code (e.g., tenant_mgr)
and requires holistic solutions, such as the proposed TenantGuard.
- Make size calculation cancellable using CancellationToken.
This is more of a cherry on top.
NB: the test code doesn't use this because we _must_ return from
the failpoint, because the failpoint lib doesn't allow to just
continue execution in combination with executing the closure.
This commit fixes the tests introduced earlier in this patch series.
This exacerbates the problem pointed out in the previous commit.
Why? Because with this patch, deleting a timeline also exposes the issue.
Extend the test to expose the problem.
Before this patch, if the task fails, we would not reset
self.initial_size_computation_started.
So, if it fails, we will return the approximate value forever.
In practice, it probably never failed because the local filesystem
is quite reliable.
But with on-demand download, the logical size calculation may need
to download layers, which is more likely to fail at times.
There will be internal retires with a timeout, but eventually,
the downloads will give up.
We want to retry in those cases.
While we're at it, also change the handling of the timeline state
watch so that we treat it as an error. Most likely, we'll not be
called again, but if we are, retrying is the right thing.
This prevents us from overwriting all blocks of a relation when we
extend the relation without first caching the size - get_cached_relsize
does not guarantee a correct result when it returns `false`.
This fixes all kinds of problems related to missing params,
like broken timestamps (due to `integer_datetimes`).
This solution is not ideal, but it will help. Meanwhile,
I'm going to dedicate some time to improving connection machinery.
Note that this **does not** fix problems with passing certain parameters
in a reverse direction, i.e. **from client to compute**. This is a
separate matter and will be dealt with in an upcoming PR.
this should help us in the future to have more freedom with spawning
tasks and cancelling things, most importantly blocking tasks (assuming
the CancellationToken::is_cancelled is performant enough).
CancellationToken allows creation of hierarchical cancellations, which
would also simplify the task_mgr shutdown operation, rendering it
unnecessary.
Do not run Nightly Benchmarks on `neon-captest-new`.
This is a temporary solution to avoid spikes in the storage we consume
during the test run. To collect data for the default instance, we could
run tests weekly (i.e. not daily).
IMDSv2 has limits, and if we query it on every s3 interaction we are
going to go over those limits. Changes the s3_bucket client
configuration to use:
- ChainCredentialsProvider to handle env variables or imds usage
- LazyCachingCredentialsProvider to actually cache any credentials
Related: https://github.com/awslabs/aws-sdk-rust/issues/629
Possibly related: https://github.com/neondatabase/neon/issues/3118
I saw an excessive number of index file upload operations in
production, even when nothing on the timeline changes. It was because
our GC schedules index file upload if the GC cutoff LSN is advanced,
even if the GC had nothing else to do. The GC cutoff LSN marches
steadily forwards, even when there is no user activity on the
timeline, when the cutoff is determined by the time-based PITR
interval setting. To dial that down, only schedule index file upload
when GC is about to actually remove something.
Previously, the /v1/tenant/:tenant_id/timeline/:timeline_id/do_gc API
call performed a flush and compaction on the timeline before
GC. Change it not to do that, and change all the tests that used that
API to perform compaction explicitly.
The compaction happens at a slightly different point now. Previously,
the code performed the `refresh_gc_info_internal` step first, and only
then did compaction on all the timelines. I don't think that was what
was originally intended here. Presumably the idea with compaction was
to make some old layer files available for GC. But if we're going to
flush the current in-memory layer to disk, surely you would want to
include the newly-written layer in the compaction too. I guess this
didn't make any difference to the tests in practice, but in any case,
the tests now perform the flush and compaction before any of the GC
steps.
Some of the tests might not need the compaction at all, but I didn't
try hard to determine which ones might need it. I left it out from a
few tests that intentionally tested calling do_gc with an invalid
tenant or timeline ID, though.
- Use only one templated section for most postgres-versioned steps
- Clean up neon_walredo, too, when running neon-pg-ext-clean
- Depend on the various cleanup steps for `clean` instead of manually
executing those cleanup steps.
If we get cancelled before jh.await returns we've take()n the join handle but
drop the result on the floor.
Fix it by setting self.join_handle = None after the .await
fixes https://github.com/neondatabase/neon/issues/3104
We do the accounting exclusively after updating remote IndexPart successfully.
This is cleaner & more robust than doing it upon completion of
individual layer file uploads / deletions since we can uset .set()
insteaf of add()/sub().
NB: Originally, this work was intended to be part of #3013 but it
turns out that it's completely orthogonal.
So, spin it out into this PR for easier review.
Since this change is additive, it won't break anything.
Temporarily disable `test_seqscans` for remote projects; they acquire
too much space and time. We can try to reenable it back after switching
to per-test projects.
Closes https://github.com/neondatabase/neon/issues/1984
Closes https://github.com/neondatabase/neon/pull/2830
A follow-up of https://github.com/neondatabase/neon/pull/2830, I've
noticed that benchmarks failed again due to out of space issues.
Removes most of the pageserver and safekeeper files from disk after
every pytest suite run.
```
$ poetry run pytest -vvsk "test_tenant_redownloads_truncated_file_on_startup[local_fs]"
# ...
$ du -h test_output/test_tenant_redownloads_truncated_file_on_startup\[local_fs\]
# ...
104K test_output/test_tenant_redownloads_truncated_file_on_startup[local_fs]
$ poetry run pytest -vvsk "test_tenant_redownloads_truncated_file_on_startup[local_fs]" --preserve-database-files
# ...
$ du -h test_output/test_tenant_redownloads_truncated_file_on_startup\[local_fs\]
# ...
123M test_output/test_tenant_redownloads_truncated_file_on_startup[local_fs]
```
Co-authored-by: Bojan Serafimov <bojan.serafimov7@gmail.com>
Before this patch, when we decide to rename a layer file to backup
because of layer file size mismatch, we would not remove the layer from
the layer map, but remote the on-disk file.
Because we re-download the file immediately after, we simply end up with
two layer objects in memory that reference the same file in the layer
map. So, GetPage() would work fine until one of the layers gets
delete()'d. The other layer's delete() would then fail.
Future work: prevent insertion of the same layer at LayerMap level
so that we notice such bugs sooner.
Replace actions/setup-python@v4 with the ansible image to fix
```
Version 3.10 was not found in the local cache
Error: The version '3.10' with architecture 'x64' was not found for this operating system.
```
Removes the race during pageserver initial timeline creation that lead to partial layer uploads.
This race is only reproducible in test code, we do not create initial timelines in cloud (yet, at least), but still nice to remove the non-deterministic behavior.
This patch aims to fix some of the inconsistencies in error reporting,
for example "Internal error" or "Console request failed" instead of
"password authentication failed for user '<NAME>'".
refactor: use new type LayerFileName when referring to layer file names in PathBuf/RemotePath
Before this patch, we would sometimes carry around plain file names in
`Path` types and/or awkwardly "rebase" paths to have a unified
representation of the layer file name between local and remote.
This patch introduces a new type `LayerFileName` which replaces the use
of `Path` / `PathBuf` / `RemotePath` in the `storage_sync2` APIs.
Instead of holding a string, it contains the parsed representation of
the image and delta file name.
When we need the file name, e.g., to construct a local path or
remote object key, we construct the name ad-hoc.
`LayerFileName` is also serde {Dese,Se}rializable, and in an initial
version of this patch, it was supposed to be used directly inside
`IndexPart`, replacing `RemotePath`.
However,
commit 3122f3282f
Ignore backup files (ones with .n.old suffix) in download_missing
fixed handling of `*.old` backup file names in IndexPart, and we need
to carry that behavior forward.
The solution is to remove `*.old` backup files names during
deserialization. When we re-serialize the IndexPart, the `*.old` file
will be gone.
This leaks the `.old` file in the remote storage, but makes it safe
to clean it up later.
There is additional churn by a preliminary refactoring that got squashed
into this change:
split off LayerMap's needs from trait Layer into super trait
That refactoring renames `Layer` to `PersistentLayer` and splits off a subset
of the functions into a super-trait called `Layer`.
The upser trait implements just the functions needed by `LayerMap`, whereas
`PersisentLayer` adds the context of the pageserver.
The naming is imperfect as some functions that reside in `PersistentLayer`
have nothing persistence-specific to it. But it's a step in the right direction.
Part of https://github.com/neondatabase/neon/pull/2410 and
https://github.com/neondatabase/neon/pull/2407
* adds `hashFiles('rust-toolchain.toml')` into Rust cache keys, thus
removing one of the manual steps to do when upgrading rustc
* copies Python and Rust style checks from the `codestyle.yml` workflow
* adjusts shell defaults in the main workflow
* replaces `codestyle.yml` with a `neon_extra_builds.yml` worlflow
The new workflow runs on commits to `main` (`codestyle.yml` was run per
PR), and runs two custom builds on GH agents:
* macos-latest, to ensure the entire project compiles on it (no tests
run)
There were no frequent breakages on macOs in our builds, so we can check
it rarely without making every storage PR to wait for it to complete.
The updated mac build use release builds now, so presumably should work
a bit faster due to overall smaller files to cache between builds.
* ubuntu-latest, without caches, to produce full compilation stats for
Rust builds and upload it as an artifact to GitHub
Old `clippy build --timings` stats were collected from the builds that
use caches and incremental calculation hence never could produce a full
report, it got removed.
Increase table size four times to fix the following error:
```
______________________ test_seqscans[remote-100000-100-0] ______________________
test_runner/performance/test_seqscans.py:57: in test_seqscans
assert int(shared_buffers) < int(table_size)
E assert 536870912 < 181239808
E + where 536870912 = int(536870912)
E + and 181239808 = int(181239808)
```
536870912 / 181239808 ≈ 2.96
Closes https://github.com/neondatabase/neon/issues/3052
From what I could understand from the PR, we did not wait enough before
the attach failed.
Extended the wait period a bit and put a check for a status instead of
plain `sleep` to fail if we don't get the expected status.
This is rather a hack to resolve immediate issue:
https://github.com/neondatabase/neon/issues/3024
Properly cleaning this file from index part requires changes to
initialization of remote queue. Because we need to clean it up earlier
than we start warking around files.
With on-demand there will be no walk around layer files becase
download_missing is no longer needed, so I believe it will be
natural to unify this with load_layer_map
Several fixes are included, with among others:
- Prefetching for index bulkdelete calls (e.g. during vacuum), plus v14 compiler warning fix
- A fix for setting LSN on heap pages while setting vm bits
- Some style updates that were lost in the previous wave (v15 only)
Migrate Nightly Benchmarks from captest to staging.
- Migrate GitHub Workflows
- Replace `zenith-benchmarker` with regular runners
- Remove `environment` parameter from Neon GitHub Actions, add
`postgres_version`
- The only job left on captest is `neon-captest-reuse`, which will be
moved to staging after its project migration.
Ref https://github.com/neondatabase/cloud/issues/2836
Changes:
* Remove `RemoteObjectId` concept from remote_storage.
Operate directly on /-separated names instead.
These names are now represented by struct `RemotePath` which was renamed from struct `RelativePath`
* Require remote storage to operate on relative paths for its contents, thus simplifying the way to derive them in pageserver and safekeeper
* Make `IndexPart` to use `String` instead of `RelativePath` for its entries, since those are just the layer names
This patch centralize the logic of creating & reading pid files into the
new pid_file module and improves upon / makes explicit a few race conditions
that existed with the previous code.
Starting Processes / Creating Pidfiles
======================================
Before this patch, we had three places that had very similar-looking
match lock_file::create_lock_file { ... }
blocks.
After this change, they can use a straight-forward call provided
by the pid_file:
pid_file::claim_pid_file_for_pid()
Stopping Processes / Reading Pidfiles
=====================================
The new pid_file module provides a function to read a pidfile,
called read_pidfile(), that returns a
pub enum PidFileRead {
NotExist,
NotHeldByAnyProcess(PidFileGuard),
LockedByOtherProcess(Pid),
}
If we get back NotExist, there is nothing to kill.
If we get back NotHeldByAnyProcess, the pid file is stale and we must
ignore its contents.
If it's LockedByOtherProcess, it's either another pidfile reader
or, more likely, the daemon that is still running.
In this case, we can read the pid in the pidfile and kill it.
There's still a small window where this is racy, but it's not a
regression compared to what we have before.
The NotHeldByAnyProcess is an improvement over what we had before
this patch. Before, we would blindly read the pidfile contents
and kill, even if no other process held the flock.
If the pidfile was stale (NotHeldByAnyProcess), then that kill
would either result in ESRCH or hit some other unrelated process
on the system. This patch avoids the latter cacse by grabbing
an exclusive flock before reading the pidfile, and returning the
flock to the caller in the form of a guard object, to avoid
concurrent reads / kills.
It's hopefully irrelevant in practice, but it's a little robustness
that we get for free here.
Maintain flock on Pidfile of ETCD / any InitialPidFile::Create()
================================================================
Pageserver and safekeeper create their pidfiles themselves.
But for etcd, neon_local creates the pidfile (InitialPidFile::Create()).
Before this change, we would unlock the etcd pidfile as soon as
`neon_local start` exits, simply because no-one else kept the FD open.
During `neon_local stop`, that results in a stale pid file,
aka, NotHeldByAnyProcess, and it would henceforth not trust that
the PID stored in the file is still valid.
With this patch, we make the etcd process inherit the pidfile FD,
thereby keeping the flock held until it exits.
* We were missing one cluster in production:
`prod-ap-southeast-1-epsilon` configs.
* We had `metrics` enabled. This means creating `ServiceScrape` objects,
but since those clusters don't have `kube-prometheus-stack` like older
ones, we are missing the CRDs, so the helm deploy fails.
The new "trace_read_requests" option was missing from the
parse_toml_tenant_conf function that reads the config file. Because of
that, the option was ignored, which caused the test_read_trace.py test
to fail. It used to work before commit 9a6c0be823, because the
TenantConfigOpt struct was constructed directly in tenant_create_handler,
but now it is saved and read back from disk even for a newly created
tenant.
The abovementioned bug was fixed in commit 09393279c6 already, which
added the missing code to parse_toml_tenant_conf() to parse the
new "trace_read_requests" option. This commit fixes one more function
that was missed earlier, and adds more detail to the error message if
parsing the config file fails.
helm values for the new `storage-broker`. gRPC, over secure connection
with a proper certificate, but no authentication.
Uses alb ingress in the old cluster and nginx ingress for the new one.
The chart is deployed and the addresses are functional, while the
pipeline doesn't exist yet.
It used to be a separate piece of state, but after 9a6c0be823 it's just
an alias for the Tenant being in Attaching state. It was only used in
one assertion in a test, but that check doesn't make sense anymore, so
just remove it.
Fixes https://github.com/neondatabase/neon/issues/2930
Added basic instrumentation to integrate sentry with the proxy, pageserver, and safekeeper processes.
Currently in sentry there are three projects, one for each process. Sentry url is sent to all three processes separately via cli args.
Closes https://github.com/neondatabase/neon/issues/2537
Follow-up of https://github.com/neondatabase/neon/pull/2950
With the new model that prevents attaching without the remote storage,
it has started to be even more odd to add attach-with-files
functionality (in addition to the issues raised previously).
Adds two separate commands:
* `POST {tenant_id}/ignore` that places a mark file to skip such tenant
on every start and removes it from memory
* `POST {tenant_id}/schedule_load` that tries to load a tenant from
local FS similar to what pageserver does now on startup, but without
directory removals
Move missing_layers property to Option<HashSet<RelativePath>>
This will allow the safe removal of it once the upgrade of all page servers is done with this new code
- Replace `seqscan_prefetch_buffers` with `effective_io_concurrency` and
`maintenance_io_concurrency` for `clickbench-compare` job (see
https://github.com/neondatabase/neon/pull/2876)
- Get the database name in a runtime (it can be `main` or `neondb` or
something else)
```
warning: named argument `file` is not used by name
--> pageserver/src/tenant/timeline.rs:1078:54
|
1078 | trace!("downloading image file: {}", file = path.display());
| -- ^^^^ this named argument is referred to by position in formatting string
| |
| this formatting argument uses named argument `file` by position
|
= note: `#[warn(named_arguments_used_positionally)]` on by default
help: use the named argument by name to avoid ambiguity
|
1078 | trace!("downloading image file: {file}", file = path.display());
| ++++
```
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
- **Enable `enable_seqscan_prefetch` by default**
- Drop use of `seqscan_prefetch_buffers` in favor of
`[maintenance,effective]_io_concurrency`
This includes adding some fields to the HeapScan execution node, and
vacuum state.
- Cleanup some conditionals in vacuumlazy.c
- Clarify enable_seqscan_prefetch GUC description
- Fix issues in heap SeqScan prefetching where synchronize_seqscan
machinery wasn't handled properly.
- Change "WAL service" to "safekeepers" in the architecture section. The
safekeepers together form the WAL service, but we don't use that term
much in the code.
- Replace the short list of pageserver components with a link /docs. We
have more details there.
- Add "Other resources" to Documention section, with links to some blog
posts and a video presentation.
- Remove notice at the top about the Zenith -> Neon rename. There are
still a few references to Zenith in the codebase, but not so many that
we would need to call it out at the top anymore.
Our shutdown procedure for "pageserver init" was buggy. Firstly, it
merely sent the process a SIGKILL, but did not wait for it to actually
exit. Normally, it should exit quickly as SIGKILL cannot be caught or
ignored by the target process, but it's still asynchronous and the
process can still be alive when the kill(2) call returns. Secondly,
"neon_local" removed the PID file after sending SIGKILL, even though the
process was still running. That hid the first problem: if we didn't
remove the PID file, and you start a new pageserver process while the
old one is still running, you would get an error when the new process
tries to lock the PID file.
We've been seeing a lot of "Cannot assign requested address" failures in
the CI lately. Our theory is that when we run "pageserver init"
immediately followed by "pageserver start", the first process is still
running and listening on the port when the second invocation starts up.
This commit hopefully fixes the problem.
It is generally a bad idea for the "neon_local" to remove the PID file
on the child process's behalf. The correct way would be for the server
process to remove the PID file, after it has fully shutdown everything
else. We don't currently have a robust way to ensure that everything has
truly shut down and closed, however.
A simpler way is to simply never remove the PID file. It's not necessary
to remove the PID file for correctness: we cannot rely on the cleanup to
happen anyway, if the server process crashes for example. Because of
that, we already have all the logic in place to deal with a stale PID
file that belonged to a process that already exited. Let's rely on that
on normal shutdown too.
We used to have a bug where the pageserver just got stuck if the
client sent a CopyDone message before reaching end of tar stream. That
showed up with an empty tar file, as one example. That was
inadvertently fixed by code refactorings, but let's add a regression
test for it, so that we don't accidentally re-introduce the bug later.
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
I'm not a fan of "Paused", for two reasons:
- Paused implies that the tenant/timeline with no activity on it. That's
not true; the tenant/timeline can still have active tasks working on it.
- Paused implies that it can be resumed later. It can not. A tenant or
timeline in this state cannot be switched back to Active state anymore.
A completely new Tenant or Timeline struct can be constructed for the
same tenant or timeline later, e.g. if you detach and later re-attach
the same tenant, but that's a different thing.
Stopping describes the state better. I also considered "ShuttingDown",
but Stopping is simpler as it's a single word.
The code in this change was extracted from PR #2595, i.e., Heikki’s draft
PR for on-demand download.
High-Level Changes
- storage_sync module rewrite
- Changes to Tenant Loading
- Changes to Timeline States
- Crash-safe & Resumable Tenant Attach
There are several follow-up work items planned.
Refer to the Epic issue on GitHub:
https://github.com/neondatabase/neon/issues/2029
Metadata:
closes https://github.com/neondatabase/neon/pull/2785
unsquashed history of this patch: archive/pr-2785-storage-sync2/pre-squash
Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
===============================================================================
storage_sync module rewrite
===========================
The storage_sync code is rewritten. New module name is storage_sync2, mostly to
make a more reasonable git diff.
The updated block comment in storage_sync2.rs describes the changes quite well,
so, we will not reproduce that comment here. TL;DR:
- Global sync queue and RemoteIndex are replaced with per-timeline
`RemoteTimelineClient` structure that contains a queue for UploadOperations
to ensure proper ordering and necessary metadata.
- Before deleting local layer files, wait for ongoing UploadOps to finish
(wait_completion()).
- Download operations are not queued and executed immediately.
Changes to Tenant Loading
=========================
Initial sync part was rewritten as well and represents the other major change
that serves as a foundation for on-demand downloads. Routines for attaching and
loading shifted directly to Tenant struct and now are asynchronous and spawned
into the background.
Since this patch doesn’t introduce on-demand download of layers we fully
synchronize with the remote during pageserver startup. See details in
`Timeline::reconcile_with_remote` and `Timeline::download_missing`.
Changes to Tenant States
========================
The “Active” state has lost its “background_jobs_running: bool” member. That
variable indicated whether the GC & Compaction background loops are spawned or
not. With this patch, they are now always spawned. Unit tests (#[test]) use the
TenantConf::{gc_period,compaction_period} to disable their effect (15db566).
This patch introduces a new tenant state, “Attaching”. A tenant that is being
attached starts in this state and transitions to “Active” once it finishes
download.
The `GET /tenant` endpoints returns `TenantInfo::has_in_progress_downloads`. We
derive the value for that field from the tenant state now, to remain
backwards-compatible with cloud.git. We will remove that field when we switch
to on-demand downloads.
Changes to Timeline States
==========================
The TimelineInfo::awaits_download field is now equivalent to the tenant being
in Attaching state. Previously, download progress was tracked per timeline.
With this change, it’s only tracked per tenant. When on-demand downloads
arrive, the field will be completely obsolete. Deprecation is tracked in
isuse #2930.
Crash-safe & Resumable Tenant Attach
====================================
Previously, the attach operation was not persistent. I.e., when tenant attach
was interrupted by a crash, the pageserver would not continue attaching after
pageserver restart. In fact, the half-finished tenant directory on disk would
simply be skipped by tenant_mgr because it lacked the metadata file (it’s
written last). This patch introduces an “attaching” marker file inside that is
present inside the tenant directory while the tenant is attaching. During
pageserver startup, tenant_mgr will resume attach if that file is present. If
not, it assumes that the local tenant state is consistent and tries to load the
tenant. If that fails, the tenant transitions into Broken state.
Change the default region for staging from `us-east-1` to `us-east-2`
for project creation.
Remove REGION_ID from `neon-branch-create` since we don't use it.
Nothing interesting in these changes. Passing through the
RUST_BACKTRACE=full will hopefully save someone else panick reproduction
time.
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Sometimes CI build fails with
error: couldn't read storage_broker/src/../proto/storage_broker.rs: No such file or directory (os error 2)
--> storage_broker/src/lib.rs:14:5
|
14 | include!("../proto/storage_broker.rs");
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The root cause is not clear, but it looks like interference with cachepot. Per
cargo docs, build scripts shouldn't output to anywhere but OUT_DIR; let's follow
this and see if it helps.
When a new root timeline is created, we want to flush all the data to
disk before we return success to the caller. We were using
checkpoint(CheckpointConfig::Forced) for that, but that also performs
compaction. With the default settings, compaction will have no work
after we have imported an empty database, as the image of that is too
small to require compaction. However, with very small
checkpoint_distance and compaction_target_size, compaction will run, and
it can take a while.
PR #2785 adds new tests that use very small checkpoint_distance and
compaction_target_size settings, and the test sometimes failed with
"operation timed out" error in the client, when the create_timeline step
took too long.
Add ClickBench benchmark, an OLAP-style benchmark, to Nightly
Benchmarks.
The full run of 43 queries on the original dataset takes more than 6h
(only 34 queries got processed on in 6h) on our default-sized compute.
Having this, currently, would mean having some really unstable tests
because of our regular deployment to staging/captest environment (see
https://github.com/neondatabase/cloud/issues/1872).
I've reduced the dataset size to the first 10^7 rows from the original
10^8 rows. Now it takes ~30-40 minutes to pass.
Ref https://github.com/ClickHouse/ClickBench/tree/main/aurora-postgresql
Ref https://benchmark.clickhouse.com/
Add `neon-branch-create` / `neon-branch-delete` to allow using branches
in tests.
I have a couple of use cases in mind:
- For destructive tests with a big DB, we can create the DB once in
advance and then use branches without the need to recreate the DB itself
after tests change it.
- We can run tests in parallel (if there're compute-bound).
Also migrate API v2 for `neon-project-create` / `neon-project-delete`
Many python tests were setting the GC/compaction period to large
values, to effectively disable GC / compaction. Reserve value 0 to
mean "explicitly disabled". We also set them to 0 in unit tests now,
although currently, unit tests don't launch the background jobs at
all, so it won't have any effect.
Fixes https://github.com/neondatabase/neon/issues/2917
Fix `test_seqscans` by disabling statement timeout.
Also, replace increasing statement timeout with disabling it for
performance tests. This should make tests more stable and allow us to
observe performance degradation instead of test failures.
**NB**: this PR doesn't update Python to 3.11; it makes tests
compatible with it and fixes a couple of warnings by updating
dependencies.
- `poetry add asyncpg@latest` to fix `./scripts/pysync`
- `poetry add boto3@latest "boto3-stubs[s3]@latest"` to fix
```
DeprecationWarning: 'cgi' is deprecated and slated for removal in Python 3.13
```
- `poetry update certifi` to fix
```
DeprecationWarning: path is deprecated. Use files() instead. Refer to https://importlib-resources.readthedocs.io/en/latest/using.html#migrating-from-legacy for migration advice.
```
- Move `types-toml` from `dep-dependencies` to `dependencies` to keep it
aligned with other `types-*` deps
* Fix https://github.com/neondatabase/neon/issues/1854
* Never log Safekeeper::conninfo in walproposer as it now contains a secret token
* control_panel, test_runner: generate and pass JWT tokens for Safekeeper to compute and pageserver
* Compute: load JWT token for Safekepeer from the environment variable. Do not reuse the token from
pageserver_connstring because it's embedded in there weirdly.
* Pageserver: load JWT token for Safekeeper from the environment variable.
* Rewrite docs/authentication.md
There will be different scopes for those two, so authorization code should be different.
The `check_permission` function is now not in the shared library. Its implementation
is very similar to the one which will be added for Safekeeper. In fact, we may reuse
the same existing root-like 'PageServerApi' scope, but I would prefer to have separate
root-like scopes for services.
Also, generate_management_token in tests is generate_pageserver_token now.
Downsides are:
* We store all components of the config separately. `Url` stores them inside a single
`String` and a bunch of ints which point to different parts of the URL, which is
probably more efficient.
* It is now impossible to pass arbitrary connection strings to the configuration file,
one has to support all components explicitly. However, we never supported anything
except for `host:port` anyway.
Upsides are:
* This significantly restricts the space of possible connection strings, some of which
may be either invalid or unsupported. E.g. Postgres' connection strings may include
a bunch of parameters as query (e.g. `connect_timeout=`, `options=`). These are nether
validated by the current implementation, nor passed to the postgres client library,
Hence, storing separate fields expresses the intention better.
* The same connection configuration may be represented as a URL in multiple ways
(e.g. either `password=` in the query part or a standard URL password).
Now we have a single canonical way.
* Escaping is provided for `options=`.
Other possibilities considered:
* `newtype` with a `String` inside and some validation on creation.
This is more efficient, but harder to log for two reasons:
* Passwords should never end up in logs, so we have to somehow
* Escaped `options=` are harder to read, especially if URL-encoded,
and we use `options=` a lot.
Which ought to replace etcd. This patch only adds the binary and adjusts
Dockerfile to include it; subsequent ones will add deploy of helm chart and the
actual replacement.
It is a simple and fast pub-sub message bus. In this patch only safekeeper
message is supported, but others can be easily added.
Compilation now requires protoc to be installed. Installing protobuf-compiler
package is fine for Debian/Ubuntu.
ref
https://github.com/neondatabase/neon/pull/2733https://github.com/neondatabase/neon/issues/2394
Imagine that you have a tenant with a single branch like this:
---------------==========>
^
gc horizon
where:
---- is the portion of the branch that is older than retention period
==== is the portion of the branch that is newer than retention period.
Before this commit, the sizing model included the logical size at the
GC horizon, but not the WAL after that. In particular, that meant that
on a newly created tenant with just one timeline, where the retention
period covered the whole history of the timeline, i.e. gc_cutoff was 0,
the calculated tenant size was always zero.
We now include the WAL after the GC horizon in the size. So in the
above example, the calculated tenant size would be the logical size
of the database the GC horizon, plus all the WAL after it (marked with
===).
This adds a new `insert_point` function to the sizing model, alongside
`modify_branch`, and changes the code in size.rs to use the new
function. The new function takes an absolute lsn and logical size as
argument, so we no longer need to calculate the difference to the
previous point. Also, the end-size is now optional, because we now
need to add a point to represent the end of each branch to the model,
but we don't want to or need to calculate the logical size at that
point.
- Pass through FAILPOINTS environment variable to the pageserver in
"neon_local pageserver start" command
- On startup, list any failpoints that were set with FAILPOINTS to the log
- Add optional "extra_env_vars" argument to the NeonPageserver.start()
function in the python fixture, so that you can pass FAILPOINTS
None of the tests use this functionality yet; that comes in a separate
commit.
closes https://github.com/neondatabase/neon/pull/2865
Increse the pgbench runtimes even further. The theory is that when
there are many other tests running at the same time, one pgbench run
could take a long time until it generates enough layers for GC to kick
in.
I saw these from the build of the compute docker image in the CI
(compute-node-image-v15):
pagestore_smgr.c: In function 'neon_prefetch':
pagestore_smgr.c:1654:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
1654 | BufferTag tag = (BufferTag) {
| ^~~~~~~~~
walproposer.c:197:1: warning: no previous prototype for 'WalProposerSync' [-Wmissing-prototypes]
197 | WalProposerSync(int argc, char *argv[])
| ^~~~~~~~~~~~~~~
libpagestore.c: In function 'pageserver_connect':
libpagestore.c💯9: warning: variable 'wc' set but not used [-Wunused-but-set-variable]
100 | int wc;
| ^~
libpagestore.c: In function 'call_PQgetCopyData':
libpagestore.c:144:9: warning: variable 'wc' set but not used [-Wunused-but-set-variable]
144 | int wc;
| ^~
Harmless warnings, but let's be tidy.
In the passing, I added some "extern" to a few function declarations
that were missing them, and marked WalProposerSync as "static". Those
changes are also purely cosmetic.
Commit d013a2b227 changed the test, so that it fails if pgbench runs
to completion without triggering the failpoint. That has now happened
several times in the CI. That's not expected, so this needs some
investigation, but as a quick fix just make the pgbench runs longer so
that we're closer to the situation before commit d013a2b227.
See https://github.com/neondatabase/neon/issues/2856
This allows us to error out in the case where we request flush but the
flush loop is not running.
Before, we would only track whether it was started, but not when it
exited.
Better to use an enum with 3 states than a 2-state bool because then
the error message can answer the question whether we ever started
the flush loop or not.
In a CI run, I got a test failure because of this error in the log,
from the test_get_tenant_size_with_multiple_branches test:
ERROR gc_loop{tenant_id=f1630516d4b526139836ced93be0c878}: Gc failed, retrying in 2s: No such file or directory (os error 2)
There are known race conditions between GC and timeline deletion,
which surely caused that error. But if we didn't know the cause, it
would be pretty hard to debug without a stack trace.
* Poll more frequently when waiting for process start/stop. This
speeds up startup and shutdown in tests. We did this already in
commit 52ce1c9d53, which reduced the interval to 100 ms, but it was
inadvertently increased back to 500 ms in commit d42700280f. Reduce
it to 100 ms again, for both start and stop operations.
* Harmonize the start and stop loops, printing the dots and notices
the same way in both. I considered extracting the logic to a
separate retry-function that takes a closure as argument that does
the polling, but as long as we only have two copies, the code
duplication isn't that bad.
* Remove newline after "Starting pageserver" and "Starting etcd"
messages, so that the progress-indicator dots that are printed once
a second are printed on the same line. Before:
Starting pageserver at '127.0.0.1:64000' in '.neon'
...
pageserver started, pid: 2538937
After:
Starting pageserver at '127.0.0.1:64000' in '.neon'...
pageserver started, pid: 2538937
The "Starting safekeeper" message already got this right.
* Update example output in README.md to match
Set correct `pg_distrib_dir` in `pageserver.toml` and in neon_local
`config`.
`test_forward_compatibility` shows flakiness during `neon_local pg
start`, so hopefully, the patch will help.
```
2022-11-15 16:07:34.091 GMT [13338] LOG: starting with zenith basebackup at LSN 0/A6A9310, prev 0/0
2022-11-15 16:07:34.091 GMT [13338] FATAL: cannot start in read-write mode from this base backup
2022-11-15 16:07:34.091 GMT [13337] LOG: startup process (PID 13338) exited with exit code 1
```
Despite tests working, on staging the library started to fail with the
following error:
```
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 2022-11-16T11:53:37.191211Z INFO init_tenant_mgr:local_tenant_timeline_files: Collected files for 16 tenants
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: thread 'main' panicked at 'A connector was not available. Either set a custom connector or enable the `rustls` and `native-tls` crate featu>
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: stack backtrace:
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 0: rust_begin_unwind
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:584:5
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 1: core::panicking::panic_fmt
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:142:14
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 2: core::panicking::panic_display
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:72:5
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 3: core::panicking::panic_str
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:56:5
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 4: core::option::expect_failed
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/option.rs:1854:5
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 5: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 6: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 7: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 8: <aws_types::credentials::provider::future::ProvideCredentials as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 9: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 10: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 11: <aws_types::credentials::provider::future::ProvideCredentials as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 12: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 13: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 14: <aws_smithy_http_tower::map_request::MapRequestFuture<F,E> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 15: <core::pin::Pin<P> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/future.rs:124:9
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 16: <aws_smithy_http_tower::parse_response::ParseResponseService<InnerService,ResponseHandler,RetryPolicy> as tower_service::Service<aws_>
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/aws-smithy-http-tower-0.51.0/src/parse_response.rs:109:34
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 17: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 18: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-0.1.37/src/instrument.rs:272:9
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 19: <core::pin::Pin<P> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/future.rs:124:9
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 20: <aws_smithy_client::timeout::TimeoutServiceFuture<InnerFuture> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/aws-smithy-client-0.51.0/src/timeout.rs:189:70
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 21: <tower::retry::future::ResponseFuture<P,S,Request> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tower-0.4.13/src/retry/future.rs:77:41
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 22: <aws_smithy_client::timeout::TimeoutServiceFuture<InnerFuture> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/aws-smithy-client-0.51.0/src/timeout.rs:189:70
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 23: aws_smithy_client::Client<C,M,R>::call_raw::{{closure}}
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/aws-smithy-client-0.51.0/src/lib.rs:227:56
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 24: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 25: aws_smithy_client::Client<C,M,R>::call::{{closure}}
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/aws-smithy-client-0.51.0/src/lib.rs:184:29
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 26: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 27: aws_sdk_s3::client::fluent_builders::GetObject::send::{{closure}}
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/aws-sdk-s3-0.21.0/src/client.rs:7735:40
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 28: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 29: remote_storage::s3_bucket::S3Bucket::download_object::{{closure}}
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at libs/remote_storage/src/s3_bucket.rs:205:20
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 30: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 31: <remote_storage::s3_bucket::S3Bucket as remote_storage::RemoteStorage>::download::{{closure}}
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at libs/remote_storage/src/s3_bucket.rs:399:11
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 32: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 33: <core::pin::Pin<P> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/future.rs:124:9
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 34: remote_storage::GenericRemoteStorage::download_storage_object::{{closure}}
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at libs/remote_storage/src/lib.rs:264:55
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 35: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 36: pageserver::storage_sync::download::download_index_part::{{closure}}
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at pageserver/src/storage_sync/download.rs:148:57
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 37: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 38: pageserver::storage_sync::download::download_index_parts::{{closure}}::{{closure}}::{{closure}}
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at pageserver/src/storage_sync/download.rs:77:75
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 39: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 40: <futures_util::stream::futures_unordered::FuturesUnordered<Fut> as futures_core::stream::Stream>::poll_next
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.24/src/stream/futures_unordered/mod.rs:514:17
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 41: futures_util::stream::stream::StreamExt::poll_next_unpin
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.24/src/stream/stream/mod.rs:1626:9
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 42: <futures_util::stream::stream::next::Next<St> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.24/src/stream/stream/next.rs:32:9
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 43: pageserver::storage_sync::download::download_index_parts::{{closure}}
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at pageserver/src/storage_sync/download.rs:80:69
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 44: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 45: tokio::park:🧵:CachedParkThread::block_on::{{closure}}
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.21.1/src/park/thread.rs:267:54
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 46: tokio::coop::with_budget::{{closure}}
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.21.1/src/coop.rs:102:9
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 47: std:🧵:local::LocalKey<T>::try_with
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/thread/local.rs:445:16
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 48: std:🧵:local::LocalKey<T>::with
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/thread/local.rs:421:9
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 49: tokio::coop::with_budget
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.21.1/src/coop.rs:95:5
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 50: tokio::coop::budget
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.21.1/src/coop.rs:72:5
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 51: tokio::park:🧵:CachedParkThread::block_on
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.21.1/src/park/thread.rs:267:31
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 52: tokio::runtime::enter::Enter::block_on
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.21.1/src/runtime/enter.rs:152:13
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 53: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.21.1/src/runtime/scheduler/multi_thread/mod.rs:79:9
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 54: tokio::runtime::Runtime::block_on
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.21.1/src/runtime/mod.rs:492:44
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 55: pageserver::storage_sync::spawn_storage_sync_task
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at pageserver/src/storage_sync.rs:656:34
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 56: pageserver::tenant_mgr::init_tenant_mgr
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at pageserver/src/tenant_mgr.rs:88:13
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 57: pageserver::start_pageserver
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at pageserver/src/bin/pageserver.rs:269:9
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 58: pageserver::main
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at pageserver/src/bin/pageserver.rs:103:5
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 59: core::ops::function::FnOnce::call_once
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/ops/function.rs:248:5
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
```
Feels like better testing on the env is needed later, maybe more e2e
tests have to be written (albeit we have download tests, so something
else happens here, tls issues?)
Thanks to the race condition, GC sometimes fails with "no such file or
directory" error, if the tenant is detached concurrently. That's a
known issue, but it didn't cause test failures until we started to
check for unexpected ERRORs in the log in commit 46d30bf054. We should
fix the race condition, of course, but until we do, let's silence the
failures.
Previously, if the failpoint was not reached for some reason, the test
would only fail because it would reach the 5 minute timeout we have on
all python tests. That's very subtle. Make it fail explicitly, if the
failpoint is not hit on each iteration of the loop.
Extracted from a larger PR, see
https://github.com/neondatabase/neon/pull/2785/files#r1022765794
- Refactor the code a little bit, removing the silly for-loop over a
single element.
- Make it more clear in log messages that the errors are expectd
- Check for a more precise error message "Failed to load delta layer"
instead of just "extracting base backup failed".
If there are any unexpected ERRORs or WARNs in pageserver.log after test
finishes, fail the test. This requires whitelisting the errors that *are*
expected in each test, and there's also a few common errors that are
printed by most tests, which are whitelisted in the fixture itself.
With this, we don't need the special abort() call in testing mode, when
compaction or GC fails. Those failures will print ERRORs to the logs,
which will be picked up by this new mechanisms.
A bunch of errors are currently whitelisted that we probably shouldn't
be emitting in the first place, but fixing those is out of scope for this
commit, so I just left FIXME comments on them.
It's more or less expected from pageserver's point of view. Change the
error kind to ConnectionReset, so that it gets logged at INFO level
instead of ERROR.
We passed the pageserver's libpq endpoint URL as the 'compute_ctl
--connstr' argument, but that was bogus: the --connstr URL is supposed
to be the URL to the *Postgres* instance that compute_ctl launches and
monitors, not to the pageserver. compute_ctl does need the pageserver
URL too, but it is read from the cluster spec JSON, not --connstr.
That was pretty confusing, as you got a lot of "unknown command"
errors in the pageserver log, when compute_tools tries to run regular
SQL commands on the pageserver. The test still passed, however, as it
doesn't require the SQL commands to succeed. But to make this less
confusing, use an invalid hostname instead, so that the queries will
fail to even connect.
- Update vendored PostgreSQL to address prefetch issues
- Make flushed state explicit in PrefetchState
- Move flush logic into prefetch_wait_for, where possible
- Clean up some prefetch state handling code in the various code
elements handling state transitions.
- Fix a race condition in neon_read_at_lsn where a hash entry pointer
was used after the hash table was updated. This could result in
incorrect state transitions and assertion failures after disconnects
during prefetch_wait_for in that neon_read_at_lsn.
Fixes#2780
Before we had separate images for v14 and v15, the compute node image
was called just "neondatabase/compute-node". It has been superseded by
the "neondatabase/compute-node-v14" and "neondatabase/compute-node-v15"
images. The old image is not used by the cloud console build or tests
anymore.
I saw this in 'perf' profile of a sequential scan:
> - 31.93% 0.21% compute request pageserver [.] <pageserver::walredo::PostgresRedoManager as pageserver::walredo::WalRedoManager>::request_redo
> - 31.72% <pageserver::walredo::PostgresRedoManager as pageserver::walredo::WalRedoManager>::request_redo
> - 31.26% pageserver::walredo::PostgresRedoManager::apply_batch_postgres
> + 7.64% <std::process::ChildStdin as std::io::Write>::write
> + 6.17% nix::poll::poll
> + 3.58% <std::process::ChildStderr as std::io::Read>::read
> + 2.96% std::sync::condvar::Condvar::notify_one
> + 2.48% std::sys::unix::locks::futex::Condvar::wait
> + 2.19% alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle
> + 1.14% std::sys::unix::locks::futex::Mutex::lock_contended
> 0.67% __rust_alloc_zeroed
> 0.62% __stpcpy_ssse3
> 0.56% std::sys::unix::locks::futex::Mutex::wake
Note the 'do_reserve_handle' overhead. That's caused by having to grow
the buffer used to construct the WAL redo request. This commit
eliminates that overhead. It's only about 2% of the overall CPU usage,
but every little helps.
Also reuse the temp buffer when reading records from a DeltaLayer, and
call Vec::reserve to avoid growing a buffer when reading a blob across
pages. I saw a reduction from 2% to 1% of CPU spent in
do_reserve_and_handle in that codepath, but that's such a small change
that it could be just noise. Seems like it shouldn't hurt though.
Closes https://github.com/neondatabase/neon/issues/2697
Example:
https://github.com/neondatabase/neon/actions/runs/3416774593/jobs/5688394855
Adds a set of tests on the storage Docker images before they are pushed
to the public registries:
* tests that pageserver binary has the correct version string (other
binaries are built with the same library, so it should be enough to test
one)
* tests that the compose file set-up works and all components are able
to start and perform a single SQL query (CREATE TABLE)
This change wraps the std::process:Child that we spawn for WAL redo
into a type that ensures that we try to SIGKILL + waitpid() on it.
If there is no explicit call to kill_and_wait(), the Drop implementation
will spawns a task that does it in the BACKGROUND_RUNTIME.
That's an ugly hack but I think it's better than doing kill+wait
synchronously from Drop, since I think the general assumption in the
Rust ecosystem is that Drop doesn't block.
Especially since the drop sites can be _any_ place that drops the last
Arc<PostgresRedoManager>, e.g., compaction or GC.
The benefit of having the new type over just adding a Drop impl to
PostgresRedoProcess is that we can construct it earlier than the full
PostgresRedoProcess in PostgresRedoProcess::launch().
That allows us to correctly kill+wait the child if there is an error in
PostgresRedoProcess::launch() after spawning it.
I also took a stab at a regression test. I manually verified
that it fails before the fix to walredo.rs.
fixes https://github.com/neondatabase/neon/issues/2761
closes https://github.com/neondatabase/neon/pull/2776
Add `test_forward_compatibility`, which checks if it's going to
be possible to roll back a release to the previous version.
The test uses artifacts (Neon & Postgres binaries) from the previous
release to start Neon on the repo created by the current version. It
performs exactly the same checks as `test_backward_compatibility` does.
Single `ALLOW_BREAKING_CHANGES` env var got replaced by
`ALLOW_BACKWARD_COMPATIBILITY_BREAKAGE` &
`ALLOW_FORWARD_COMPATIBILITY_BREAKAGE` and can be set by `backward
compatibility breakage` and `forward compatibility breakage` labels
respectively.
When we repeatedly wait for the same events, it's faster to create the
event set once and reuse it. While testing with a sequential scan test
case, I saw WaitLatchOrSocket consuming a lot of CPU:
> - 40.52% 0.14% postgres postgres [.] WaitLatchOrSocket
> - 40.38% WaitLatchOrSocket
> + 17.83% AddWaitEventToSet
> + 9.47% close@plt
> + 8.29% CreateWaitEventSet
> + 4.57% WaitEventSetWait
This eliminates most of that overhead.
If we're not calling kill() before dropping the PostgresRedoProcess, we
currently leak it.
That's most likely the root cause for #2761.
This patch
1. adds an error log message for that case and
2. adds error handling for all errors on the kill() path. If we're a
`testing` build, we panic. Otherwise, we log an error and leak the
process.
The error handling changes (2) are necessary to conclusively state that
the root cause for #2761 is indeed (1). If we didn't have them, the root
cause could be missing error handling instead.
To make the log messages useful, I've added tracing::instrument
attributes that log the tenant_id and PID. That helps mapping back the
PID of `defunct` processes to pageserver log messages. Note that a
defunct process's `/proc/$PID/` directory isn't very useful. We have
left little more than its PID.
Once we have validated the root cause, we'll find a fix, but that's
still an ongoing discussion.
refs https://github.com/neondatabase/neon/issues/2761
closes https://github.com/neondatabase/neon/pull/2769
This PR replaces the following global variables in the test framework
with fixtures to make tests more configurable. I mainly need this for
the forward compatibility tests (draft in
https://github.com/neondatabase/neon/pull/2766).
```
base_dir
neon_binpath
pg_distrib_dir
top_output_dir
default_pg_version (this one got replaced with a fixture named pg_version)
```
Also, this PR adds more `Path` type where the code implies it.
Prefetch requests and responses are stored in a ringbuffer instead of a
queue, which means we can utilize prefetches of many relations
concurrently -- page reads of un-prefetched relations now don't imply
dropping prefetches.
In a future iteration, this may detect sequential scans based on the
read behavior of sequential scans, and will dynamically prefetch buffers
for such relations as needed. Right now, it still depends on explicit
prefetch requests from PostgreSQL.
The main improvement here is that we now have a buffer for prefetched
pages of 128 entries with random access. Before, we had a similarly sized
cache, but this cache did not allow for random access, which resulted in
dropped entries when multiple systems used the prefetching subsystem
concurrently.
See also: #2544
With more realistic selection of gc_horizon in tests there is an
immediate failure with trying to query logical size with lsn <
initdb_lsn. Fixes that, adds illustration gathered from clarity of
explaining this tenant size calculation to more people.
Cc: #2748, #2599.
Tenant size information is gathered by using existing parts of
`Tenant::gc_iteration` which are now separated as
`Tenant::refresh_gc_info`. `Tenant::refresh_gc_info` collects branch
points, and invokes `Timeline::update_gc_info`; nothing was supposed to
be changed there. The gathered branch points (through Timeline's
`GcInfo::retain_lsns`), `GcInfo::horizon_cutoff`, and
`GcInfo::pitr_cutoff` are used to build up a Vec of updates fed into the
`libs/tenant_size_model` to calculate the history size.
The gathered information is now exposed using `GET
/v1/tenant/{tenant_id}/size`, which which will respond with the actual
calculated size. Initially the idea was to have this delivered as tenant
background task and exported via metric, but it might be too
computationally expensive to run it periodically as we don't yet know if
the returned values are any good.
Adds one new metric:
- pageserver_storage_operations_seconds with label `logical_size`
- separating from original `init_logical_size`
Adds a pageserver wide configuration variable:
- `concurrent_tenant_size_logical_size_queries` with default 1
This leaves a lot of TODO's, tracked on issue #2748.
`test_tenant_relocation` ends up starting a temporary postgres instance with a fixed port. the change makes the port configurable at scripts/export_import_between_pageservers.py and uses that in test_tenant_relocation.
The spawn_blocking is pointless in this cases: get_tenant is not
expected to block for any meaningful amount of time. There are
get_tenant calls in most other functions in the file too, and they don't
bother with spawn_blocking. Let's remove the spawn_blocking from
tenant_status, too, to be consistent.
fixes https://github.com/neondatabase/neon/issues/2731
Gc needs to know about all branch points, not only ones for
timelines that are active at the moment of gc. If timeline
is inactive then we wont know about branch point. In this
case gc can delete data that is needed by child timeline.
For compaction it is less severe. Delaying compaction can
cause an effect on performance. So it is still better to run
it. There is a logic to exit it quickly if there is nothing
to compact
- Refactor the way the WalProposerMain function is called when started
with --sync-safekeepers. The postgres binary now explicitly loads
the 'neon.so' library and calls the WalProposerMain in it. This is
simpler than the global function callback "hook" we previously used.
- Move the WAL redo process code to a new library, neon_walredo.so,
and use the same mechanism as for --sync-safekeepers to call the
WalRedoMain function, when launched with --walredo argument.
- Also move the seccomp code to neon_walredo.so library. I kept the
configure check in the postgres side for now, though.
The main reason for that change is that Postgres 15 requires OpenSSL
for `pgcrypto` to work. Also not a bad idea to have SSL-enabled
Postgres in general.
plv8 can only be built with a fairly new gold linker version. We used to install
it via binutils packages from testing, but it also updates libc and that causes
troubles in the resulting image as different extensions were built against
different libc versions. We could either use libc from debian-testing everywhere
or restrain from using testing packages and install necessary programs manually.
This patch uses the latter approach: gold for plv8 and cmake for h3 are
installed manually.
In a passing declare h3_postgis as a safe extension (previous omission).
`GRANT CREATE ON SCHEMA public` fails if there is no schema `public`.
Disable it in release for now and make a better fix later (it is
needed for v15 support).
* Support configuring the log format as json or plain.
Separately test json and plain logger. They would be competing on the
same global subscriber otherwise.
* Implement log_format for pageserver config
* Implement configurable log format for safekeeper.
Similar to https://github.com/neondatabase/neon/pull/2395, introduces a state field in Timeline, that's possible to subscribe to.
Adjusts
* walreceiver to not to have any connections if timeline is not Active
* remote storage sync to not to schedule uploads if timeline is Broken
* not to create timelines if a tenant/timeline is broken
* automatically switches timelines' states based on tenant state
Does not adjust timeline's gc, checkpointing and layer flush behaviour much, since it's not safe to cancel these processes abruptly and there's task_mgr::shutdown_tasks that does similar thing.
This API is rather pointless, as sane choice anyway requires knowledge of peers
status and leaders lifetime in any case can intersect, which is fine for us --
so manual elections are straightforward. Here, we deterministically choose among
the reasonably caught up safekeepers, shifting by timeline id to spread the
load.
A step towards custom broker https://github.com/neondatabase/neon/issues/2394
* Fix bogus early exit from GC.
Commit 91411c415a added this failpoint, but the early exit was not
intentional.
* Cleanup test_gc_cutoff.py test.
- Remove the 'scale' parameter, this isn't a benchmark
- Tweak pgbench and pageserver options to create garbage faster that the
the GC can collect away. The test used to take just under 5 minutes,
which was uncomfortably close to the default 5 minute test timeout, and
annoyingly even without the hard limit. These changes bring it down to
about 1-2 minutes.
- Improve comments, fix typos
- Rename the failpoint. The old name, 'gc-before-save-metadata' implied
that the failpoint was before the metadata update, but it was in fact
much later in the function.
- Move the call to persist the metadata outside the lock, to avoid
holding it for too long.
To verify that this test still covers the original bug,
https://github.com/neondatabase/neon/issues/2539, I commenting out
updating the metadata file like this:
```
diff --git a/pageserver/src/tenant/timeline.rs b/pageserver/src/tenant/timeline.rs
index 1e857a9a..f8a9f34a 100644
--- a/pageserver/src/tenant/timeline.rs
+++ b/pageserver/src/tenant/timeline.rs
@@ -1962,7 +1962,7 @@ impl Timeline {
}
// Persist the new GC cutoff value in the metadata file, before
// we actually remove anything.
- self.update_metadata_file(self.disk_consistent_lsn.load(), HashMap::new())?;
+ //self.update_metadata_file(self.disk_consistent_lsn.load(), HashMap::new())?;
info!("GC starting");
```
It doesn't fail every time with that, but it did fail after about 5
runs.
If we cannot reconstruct an FSM or VM page, while creating image
layers, fill it with zeros instead. That should always be safe, for
the FSM and VM, in the sense that you won't lose actual user data. It
will get cleaned up by VACUUM later.
We had a bug with FSM/VM truncation, where we truncated the FSM and VM
at WAL replay to a smaller size than PostgreSQL originally did. We
thought was harmless, as the FSM and VM are not critical for
correctness and can be zeroed out or truncated without affecting user
data. However, it lead to a situation where PostgreSQL created
incremental WAL records for pages that we had already truncated away
in the pageserver, and when we tried to replay those WAL records, that
failed. That lead to a permanent error in image layer creation, and
prevented it from ever finishing. See
https://github.com/neondatabase/neon/issues/2601. With this patch,
those pages will be filled with zeros in the image layer, which allows
the image layer creation to finish.
Part of https://github.com/neondatabase/neon/pull/2239
Regular, from scratch, timeline creation involves initdb to be run in a separate directory, data from this directory to be imported into pageserver and, finally, timeline-related background tasks to start.
This PR ensures we don't leave behind any directories that are not marked as temporary and that pageserver removes such directories on restart, allowing timeline creation to be retried with the same IDs, if needed.
It would be good to later rewrite the logic to use a temporary directory, similar what tenant creation does.
Yet currently it's harder than this change, so not done.
Follow-up of #2636 and #2654 , fixing the test detection feature.
Pageserver currently outputs features as
```
/target/debug/pageserver --version
Neon page server git:7734929a8202c8cc41596a861ffbe0b51b5f3cb9 failpoints: true, features: ["testing", "profiling"]
```
These two tests, test_timeline_physical_size_post_compaction and
test_timeline_physical_size_post_gc, assumed that after you have
waited for the WAL from a bulk insertion to arrive, and you run a
cycle of checkpoint and compaction, no new layer files are created.
Because if a new layer file is created while we are calculating the
incremental and non-incremental physical sizes, they might differ.
However, the tests used a very small checkpoint_distance, so even a
small amount of WAL generated in PostgreSQL could cause a new layer
file to be created. Autovacuum can kick in at any time, and do that.
That caused occasional failues in the test. I was able to reproduce it
reliably by adding a long delay between the incremental and
non-incremental size calculations:
```
--- a/pageserver/src/http/routes.rs
+++ b/pageserver/src/http/routes.rs
@@ -129,6 +129,9 @@ async fn build_timeline_info(
}
};
let current_physical_size = Some(timeline.get_physical_size());
+ if include_non_incremental_physical_size {
+ std:🧵:sleep(std::time::Duration::from_millis(60000));
+ }
let info = TimelineInfo {
tenant_id: timeline.tenant_id,
```
To fix, disable autovacuum for the table. Autovacuum could still kick
in for other tables, e.g. catalog tables, but that seems less likely
to generate enough WAL to causea new layer file to be flushed.
If this continues to be a problem in the future, we could simply retry
the physical size call a few times, if there's a mismatch. A mismatch
could happen every once in a while, but it's very unlikely to happen
more than once or twice in a row.
Fixes https://github.com/neondatabase/neon/issues/2212
- Measure size of redo WAL (new histogram), with bounds between 24B-32kB
- Add 2 more buckets at the upper end of the redo time histogram
We often (>0.1% of several hours each day) take more than 250ms to do the
redo round-trip to the postgres process. We need to measure these redo
times more precisely.
We've got at least one user in production that cannot create a
database with a trailing space in the name.
This happens because we use `url` crate for manipulating the
DATABASE_URL, but it follows a standard that doesn't fit really
well with Postgres. For example, it trims all trailing spaces
from the path:
> Remove any leading and trailing C0 control or space from input.
> See: https://url.spec.whatwg.org/#url-parsing
But we used `set_path()` to set database name and it's totally valid
to have trailing spaces in the database name in Postgres.
Thus, use `postgres::config::Config` to modify database name in the
connection details.
* Persists latest_gc_cutoff_lsn before performing GC
* Peform some refactoring and code deduplication
refer #2539
* Add test for persisting GC cutoff
* Fix python test style warnings
* Bump postgres version
* Reduce number of iterations in test_gc_cutoff test
* Bump postgres version
* Undo bumping postgres version
In the Postgres backend, we cannot link directly with libpq (check the
pgsql-hackers arhive for all kinds of fun that ensued when we tried to
do that). Therefore, the libpq functions are used through the thin
wrapper functions in libpqwalreceiver.so, and libpqwalreceiver.so is
loaded dynamically. To hide the dynamic loading and make the calls
look like regular functions, we use macros to hide the function
pointers.
We had inherited the same indirections in libpqwalproposer, but it's
not needed since the neon extension is already a shared library that's
loaded dynamically. There's no problem calling the functions directly
there. Remove the indirections.
Speeds up layer_map::search somewhat. I also opened a PR in the upstream
rust-amplify repository with these changes,
see https://github.com/rust-amplify/rust-amplify/pull/148. We can switch
back to upstream version when that's merged.
Lookups in the R-tree call the "envelope" function for every comparison,
and our envelope function isn't very cheap, so that overhead adds up.
Create the envelope once, when the layer is inserted into the tree, and
store it along with the layer. That uses some more memory per layer, but
that's not very significant.
Speeds up the search operation 2x
This is the first step in verifying layer files. Next up on the road is
hashing the files and verifying the hashes.
The metadata additions do not require any migration. The idea is that
the change is backward and forward-compatible with regard to
`index_part.json` due to the softness of JSON schema and the
deserialization options in use.
New types added:
- LayerFileMetadata for tracking the file metadata
- starting with only the file size
- in future hopefully a sha256 as well
- IndexLayerMetadata, the serialized counterpart of LayerFileMetadata
LayerFileMetadata needing to have all fields Option is a problem but
that is not possible to handle without conflicting a lot more with other
ongoing work.
Co-authored-by: Kirill Bulatov <kirill@neon.tech>
* We had an issue with `lineinfile` usage for pageserver configuration
file: if the S3 bucket related values were changed, it would have
resulted in duplicate keys, resulting in invalid toml.
So to fix the issue, we should keep the configuration in structured
format (yaml in this case) so we can always generate syntactically
correct toml.
Inventories are converted to yaml just so that it's easier to maintain
the configuration there. Another alternative would have been a separate
variable files.
* Keep the ansible collections dir, but locally installed collections
should not be tracked.
* etcd-client is not updated, since we plan to replace it with another client and the new version errors with some missing prost library error
* clap had released another major update that requires changing every CLI declaration again, deserves a separate PR
The 'local' part was always filled in, so that was easy to merge into
into the TimelineInfo itself. 'remote' only contained two fields,
'remote_consistent_lsn' and 'awaits_download'. I made
'remote_consistent_lsn' an optional field, and 'awaits_download' is now
false if the timeline is not present remotely.
However, I kept stub versions of the 'local' and 'remote' structs for
backwards-compatibility, with a few fields that are actively used by
the control plane. They just duplicate the fields from TimelineInfo
now. They can be removed later, once the control plane has been
updated to use the new fields.
It was only None when you queried the status of a timeline with
'timeline_detail' mgmt API call, and it was still being downloaded. You
can check for that status with the 'tenant_status' API call instead,
checking for has_in_progress_downloads field.
Anothere case was if an error happened while trying to get the current
logical size, in a 'timeline_detail' request. It might make sense to
tolerate such errors, and leave the fields we cannot fill in as empty,
None, 0 or similar, but it doesn't make sense to me to leave the whole
'local' struct empty in tht case.
With the ability to pass commit_lsn. This allows to perform project WAL recovery
through different (from the original) set of safekeepers (or under different
ttid) by
1) moving WAL files to s3 under proper ttid;
2) explicitly creating timeline on safekeepers, setting commit_lsn to the
latest point;
3) putting the lastest .parital file to the timeline directory on safekeepers, if
desired.
Extend test_s3_wal_replay to exersise this behaviour.
Also extends timeline_status endpoint to return postgres information.
I'm using the Rust compiler and cargo versions from Debian packages,
but the latest available cargo Debian package is quite old, version
1.57. The 'named-profiles' features was not stabilized at that
version yet, so ever since commit a463749f5, I've had to manually add
this line to the Cargo.toml file to compile. I've been wishing that
someone would update the cargo Debian package, but it doesn't seem to
be happening any time soon.
This doesn't seem to bother anyone else but me, but it shouldn't hurt
anyone else either. If there was a good reason, I could install a
newer cargo version with 'rustup', but if all we need is this one line
in Cargo.toml, I'd prefer to continue using the Debian packages.
You cannot attach/detach an individual timeline, attach/detach always
applies to the whole tenant. However, you can *delete* a single timeline
from a tenant. Fix some comments and error messages that confused these
two operations.
Commit c634cb1d36 removed the trait and changed the function to return
a &TimelineWriter, as the FIXME said we should do, but forgot to remove
the FIXME.
* Test that we emit build info metric for pageserver, safekeeper and proxy with some non-zero length revision label
* Emit libmetrics_build_info on startup of pageserver, safekeeper and
proxy with label "revision" which tells the git revision.
The previous default of 1 s caused excessive CPU usage when there were
a lot of projects. Polling every timeline once a second was too aggressive
so let's reduce it.
Fixes https://github.com/neondatabase/neon/issues/2542, but we
probably also want do to something so that we don't poll timelines
that have received no new WAL or layers since last check.
* Add test for branching on page boundary
* Normalize start recovery point
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Thang Pham <thang@neon.tech>
We had a problem where almost all of the threads were waiting on a futex syscall. More specifically:
- `/metrics` handler was inside `TimelineCollector::collect()`, waiting on a mutex for a single Timeline
- This exact timeline was inside `control_file::FileStorage::persist()`, waiting on a mutex for Lazy initialization of `PERSIST_CONTROL_FILE_SECONDS`
- `PERSIST_CONTROL_FILE_SECONDS: Lazy<Histogram>` was blocked on `prometheus::register`
- `prometheus::register` calls `DEFAULT_REGISTRY.write().register()` to take a write lock on Registry and add a new metric
- `DEFAULT_REGISTRY` lock was already taken inside `DEFAULT_REGISTRY.gather()`, which was called by `/metrics` handler to collect all metrics
This commit creates another Registry with a separate lock, to avoid deadlock in a case where `TimelineCollector` triggers registration of new metrics inside default registry.
Creates new `pageserver_api` and `safekeeper_api` crates to serve as the
shared dependencies. Should reduce both recompile times and cold compile
times.
Decreases the size of the optimized `neon_local` binary: 380M -> 179M.
No significant changes for anything else (mostly as expected).
Compute node startup time is very important. After launching
PostgreSQL, use 'notify' to be notified immediately when it has
updated the PID file, instead of polling. The polling loop had 100 ms
interval so this shaves up to 100 ms from the startup time.
The loop checked if the TCP port is open for connections, by trying to
connect to it. That seems unnecessary. By the time the postmaster.pid
file says that it's ready, the port should be open. Remove that check.
* Preserve task result in TaskHandle by keeping join handle around
The solution is not great, but it should hep to debug staging issue
I tried to do it in a least destructive way. TaskHandle used only in
one place so it is ok to use something less generic unless we want
to extend its usage across the codebase. In its current current form
for its single usage place it looks too abstract
Some problems around this code:
1. Task can drop event sender and continue running
2. Task cannot be joined several times (probably not needed,
but still, can be surprising)
3. Had to split task event into two types because ahyhow::Error
does not implement clone. So TaskContinueEvent derives clone
but usual task evend does not. Clone requirement appears
because we clone the current value in next_task_event.
Taking it by reference is complicated.
4. Split between Init and Started is artificial and comes from
watch::channel requirement to have some initial value.
To summarize from 3 and 4. It may be a better idea to use
RWLock or a bounded channel instead
Changes are:
* Correct typo "firts" -> "first"
* Change <empty panic with comment explaining> to <panic with message
taken from the comment>
* Fix weird indentation that rustfmt was failing to handle
* Use existing `anyhow::{anyhow,bail}!` as `{anyhow,bail}!` if it's
already in scope
* Spell `Result<T, anyhow::Error>` as `anyhow::Result<T>`
* In general, closer to matching the rest of the codebase
* Change usages of `hash_map::Entry` to `Entry` when it's already in
scope
* A quick search shows our style on this one varies across the files
it's used in
* Fix extreme metrics bloat in storage sync
From 78 metrics per (timeline, tenant) pair down to (max) 10 metrics per
(timeline, tenant) pair, plus another 117 metrics in a global histogram that
replaces the previous per-timeline histogram.
* Drop image sync operation metric series when dropping TimelineMetrics.
- Split postgres_ffi into two version specific files.
- Preserve pg_version in timeline metadata.
- Use pg_version in safekeeper code. Check for postgres major version mismatch.
- Clean up the code to use DEFAULT_PG_VERSION constant everywhere, instead of hardcoding.
- Parameterize python tests: use DEFAULT_PG_VERSION env and pg_version fixture.
To run tests using a specific PostgreSQL version, pass the DEFAULT_PG_VERSION environment variable:
'DEFAULT_PG_VERSION='15' ./scripts/pytest test_runner/regress'
Currently don't all tests pass, because rust code relies on the default version of PostgreSQL in a few places.
Replace the layer array and linear search with R-tree
So far, the in-memory layer map that holds information about layer
files that exist, has used a simple Vec, in no particular order, to
hold information about all the layers. That obviously doesn't scale
very well; with thousands of layer files the linear search was
consuming a lot of CPU. Replace it with a two-dimensional R-tree, with
Key and LSN ranges as the dimensions.
For the R-tree, use the 'rstar' crate. To be able to use that, we
convert the Keys and LSNs into 256-bit integers. 64 bits would be
enough to represent LSNs, and 128 bits would be enough to represent
Keys. However, we use 256 bits, because rstar internally performs
multiplication to calculate the area of rectangles, and the result of
multiplying two 128 bit integers doesn't necessarily fit in 128 bits,
causing integer overflow and, if overflow-checks are enabled,
panic. To avoid that, we use 256 bit integers.
Add a performance test that creates a lot of layer files, to
demonstrate the benefit.
Part of the general work on improving pageserver logs.
Brief summary of changes:
* Remove `ApiError::from_err`
* Remove `impl From<anyhow::Error> for ApiError`
* Convert `ApiError::{BadRequest, NotFound}` to use `anyhow::Error`
* Note: `NotFound` has more verbose formatting because it's more
likely to have useful information for the receiving "user"
* Explicitly convert from `tokio::task::JoinError`s into
`InternalServerError`s where appropriate
Also note: many of the places where errors were implicitly converted to
500s have now been updated to return a more appropriate error. Some
places where it's not yet possible to distinguish the error types have
been left as 500s.
Follow-up to PR #2433 (b8eb908a). There's still a few more unresolved
locations that have been left as-is for the same compatibility reasons
in the original PR.
Commit 43a4f7173e fixed the case that there are extra options in the
connection string, but broke it in the case when there are not. Fix
that. But on second thoughts, it's more straightforward set the
options with ALTER DATABASE, so change the workflow yaml file to do
that instead.
In commit 6985f6cd6c, I tried passing extra GUCs in the 'options' part
of the connection string, but it didn't work because the pgbench test
overrode it with the statement_timeout. Change it so that it adds the
statement_timeout to any other options, instead of replacing them.
* Set last written lsn for created relation
* use current LSN for updating last written LSN of relation metadata
* Update LSN for the extended blocks even for pges without LSN (zeroed)
* Update pgxn/neon/pagestore_smgr.c
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Fixes#1873: previously any run of `make` caused the `postgres-v15-headers`
target to build. It copied a bunch of headers via `install -C`. Unfortunately,
some origins were symlinks in the `./pg_install/build` directory pointing
inside `./vendor/postgres-v15` (e.g. `pg_config_os.h` pointing to `linux.h`).
GNU coreutils' `install` ignores the `-C` key for non-regular files and
always overwrites the destination if the origin is a symlink. That in turn
made Cargo rebuild the `postgres_ffi` crate and all its dependencies because
it thinks that Postgres headers changed, even if they did not. That was slow.
Now we use a custom script that wraps the `install` program. It handles one
specific case and makes sure individual headers are never copied if their
content did not change. Hence, `postgres_ffi` is not rebuilt unless there were
some changes to the C code.
One may still have slow incremental single-threaded builds because Postgres
Makefiles spawn about 2800 sub-makes even if no files have been changed.
A no-op build takes "only" 3-4 seconds on my machine now when run with `-j30`,
and 20 seconds when run with `-j1`.
This commit does two things of note:
1. Bumps the bindgen dependency from `0.59.1` to `0.60.1`. This gets us
an actual error type from bindgen, so we can display what's wrong.
2. Adds `anyhow` as a build dependency, so our error message can be
prettier. It's already used heavily elsewhere in the crates in this
repo, so I figured the fact it's a build dependency doesn't matter
much.
I ran into this from running `cargo <cmd>` without running `make` first.
Here's a comparison of the compiler output in those two cases.
Before this commit:
```
error: failed to run custom build command for `postgres_ffi v0.1.0 ($repo_path/libs/postgres_ffi)`
Caused by:
process didn't exit successfully: `$repo_path/target/debug/build/postgres_ffi-2f7253b3ad3ca840/build-script-build` (exit status: 101)
--- stdout
cargo:rerun-if-changed=bindgen_deps.h
--- stderr
bindgen_deps.h:7:10: fatal error: 'c.h' file not found
bindgen_deps.h:7:10: fatal error: 'c.h' file not found, err: true
thread 'main' panicked at 'Unable to generate bindings: ()', libs/postgres_ffi/build.rs:135:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
```
After this commit:
```
error: failed to run custom build command for `postgres_ffi v0.1.0 ($repo_path/libs/postgres_ffi)`
Caused by:
process didn't exit successfully: `$repo_path/target/debug/build/postgres_ffi-e01fb59602596748/build-script-build` (exit status: 1)
--- stdout
cargo:rerun-if-changed=bindgen_deps.h
--- stderr
bindgen_deps.h:7:10: fatal error: 'c.h' file not found
Error: Unable to generate bindings
Caused by:
clang diagnosed error: bindgen_deps.h:7:10: fatal error: 'c.h' file not found
```
Also get rid if `with_safekeepers` parameter in tests.
Its meaning has changed: `False` meant "no safekeepers" which is not
supported anymore, so we assume it's always `True`.
See #1648
Previously, we compiled neon separately for Postgres v14 and v15, for
the codestyle checks. But that was bogus; we actually just ran "make
postgres", which always compiled both versions. The version really only
affected the caching.
Fix that, by copying the build steps from the main build_and_test.yml
workflow.
Running "make" at the top level calls "make install" to install the
PostgreSQL headers into the pg_install/ directory. That always updated
the modification time of the headers even if there were no changes,
triggering recompilation of the postgres_ffi bindings. To avoid that,
use 'install -C', to install the PostgreSQL headers.
However, there was an upstream PostgreSQL issue that the
src/include/Makefile didn't respect the INSTALL configure option. That
was just fixed in upstream PostgreSQL, so cherry-pick that fix to our
vendor/postgres repositories.
Fixes https://github.com/neondatabase/neon/issues/1873.
Commit f44afbaf62 updated vendor/postgres-v15 to point to a commit that
was built on top of PostgreSQL 14 rather than 15. So we accidentally had
two copies of PostgreSQL v14 in the repository. Oops. This updates
it to point to the correct version.
* Changes of neon extension to support local prefetch
* Catch exceptions in pageserver_receive
* Bump posgres version
* Bump posgres version
* Bump posgres version
* Bump posgres version
* github/actions: add neon projects related actions
* workflows/benchmarking: create projects using API
* workflows/pg_clients: create projects using API
Instead of spawning helper threads, we now use Tokio tasks. There
are multiple Tokio runtimes, for different kinds of tasks. One for
serving libpq client connections, another for background operations
like GC and compaction, and so on. That's not strictly required, we
could use just one runtime, but with this you can still get an
overview of what's happening with "top -H".
There's one subtle behavior in how TenantState is updated. Before this
patch, if you deleted all timelines from a tenant, its GC and
compaction loops were stopped, and the tenant went back to Idle
state. We no longer do that. The empty tenant stays Active. The
changes to test_tenant_tasks.py are related to that.
There's still plenty of synchronous code and blocking. For example, we
still use blocking std::io functions for all file I/O, and the
communication with WAL redo processes is still uses low-level unix
poll(). We might want to rewrite those later, but this will do for
now. The model is that local file I/O is considered to be fast enough
that blocking - and preventing other tasks running in the same thread -
is acceptable.
We had a pattern like this:
match remote_storage {
GenericRemoteStorage::Local(storage) => {
let source = storage.remote_object_id(&file_path)?;
...
storage
.function(&source, ...)
.await
},
GenericRemoteStorage::S3(storage) => {
... exact same code as for the Local case ...
},
This removes the code duplication, by allowing you to call the functions
directly on GenericRemoteStorage.
Also change RemoveObjectId to be just a type alias for String. Now that
the callers of GenericRemoteStorage functions don't know whether they're
dealing with the LocalFs or S3 implementation, RemoveObjectId must be the
same type for both.
Because the metadata was not locked, it could be updated concurrently
such that we wouldn't actually have the tail block.
The current ordering works better, as we still only start XLogBeginInsert()
once we have all potentially interesting buffers loaded in memory, but
still have correct lock lifetimes.
See also: access/transam/README section Write-Ahead Log Coding
Another preparatory commit for pg15 support:
* generate bindings for both pg14 and pg15;
* update Makefile and CI scripts: now neon build depends on both PostgreSQL versions;
* some code refactoring to decrease version-specific dependencies.
Commit f081419e68 moved all the prometheus counters to `metrics.rs`,
but accidentally replaced a couple of `register_int_counter!(...)`
calls with just `IntCounter::new(...)`. Because of that, the counters
were not registered in the metrics registry, and were not exposed
through the metrics HTTP endpoint.
Fixes failures we're seeing in a bunch of 'performance' tests because
of the missing metrics.
This caught or reproduced several bugs when I originally wrote this test
back in May, including #1731, #1740, #1751, and #707. I believe all the
issues have been fixed now, but since this was a very fruitful test,
let's add it to the test suite.
We didn't commit this earlier, because the test was very slow especially
with a debug build. We've since changed the build options so that even
the debug builds are not quite so slow anymore.
* Add test for pageserver metric cleanup once a tenant is detached.
* Remove tenant specific timeline metrics on detach.
* Use definitions from timeline_metrics in page service.
* Move metrics to own file from layered_repository/timeline.rs
* TIMELINE_METRICS: define smgr metrics
* REMOVE SMGR cleanup from timeline_metrics. Doesn't seem to work as
expected.
* Vritual file centralized metrics, except for evicted file as there's no
tenat id or timeline id.
* Use STORAGE_TIME from timeline_metrics in layered_repository.
* Remove timelineless gc metrics for tenant on detach.
* Rename timeline metrics -> metrics as it's more generic.
* Don't create a TimelineMetrics instance for VirtualFile
* Move the rest of the metric definitions to metrics.rs too.
* UUID -> ZTenantId
* Use consistent style for dict.
* Use Repository's Drop trait for dropping STORAGE_TIME metrics.
* No need for Arc, TimelineMetrics is used in just one place. Due to that,
we can fall back using ZTenantId and ZTimelineId too to avoid additional
string allocation.
* Add submodule postgres-15
* Support pg_15 in pgxn/neon
* Renamed zenith -> neon in Makefile
* fix name of codestyle check
* Refactor build system to prepare for building multiple Postgres versions.
Rename "vendor/postgres" to "vendor/postgres-v14"
Change Postgres build and install directory paths to be version-specific:
- tmp_install/build -> pg_install/build/14
- tmp_install/* -> pg_install/14/*
And Makefile targets:
- "make postgres" -> "make postgres-v14"
- "make postgres-headers" -> "make postgres-v14-headers"
- etc.
Add Makefile aliases:
- "make postgres" to build "postgres-v14" and in future, "postgres-v15"
- similarly for "make postgres-headers"
Fix POSTGRES_DISTRIB_DIR path in pytest scripts
* Make postgres version a variable in codestyle workflow
* Support vendor/postgres-v15 in codestyle check workflow
* Support postgres-v15 building in Makefile
* fix pg version in Dockerfile.compute-node
* fix kaniko path
* Build neon extensions in version-specific directories
* fix obsolete mentions of vendor/postgres
* use vendor/postgres-v14 in Dockerfile.compute-node.legacy
* Use PG_VERSION_NUM to gate dependencies in inmem_smgr.c
* Use versioned ECR repositories and image names for compute-node.
The image name format is compute-node-vXX, where XX is postgres major version number.
For now only v14 is supported.
Old format unversioned name (compute-node) is left, because cloud repo depends on it.
* update vendor/postgres submodule url (zenith->neondatabase rename)
* Fix postgres path in python tests after rebase
* fix path in regress test
* Use separate dockerfiles to build compute-node:
Dockerfile.compute-node-v15 should be identical to Dockerfile.compute-node-v14 except for the version number.
This is a hack, because Kaniko doesn't support build ARGs properly
* bump vendor/postgres-v14 and vendor/postgres-v15
* Don't use Kaniko cache for v14 and v15 compute-node images
* Build compute-node images for different versions in different jobs
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
* Update relation size cache only when latest LSN is requested
* Fix tests
* Add a test case for timetravel query after pageserver restart.
This test is currently failing, the queries return incorrect results.
I don't know why, needs to be investigated.
FAILED test_runner/batch_others/test_readonly_node.py::test_timetravel - assert 85 == 100000
If you remove the pageserver restart from the test, it passes.
* yapf3 test_readonly_node.py
* Add comment about cache correction in case of setting incorrect latest flag
* Fix formatting for test_readonly_node.py
* Remove unused imports
* Fix mypy warning for test_readonly_node.py
* Fix formatting of test_readonly_node.py
* Bump postgres version
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
* Fix pythin style
* Fix iport of test_backpressure in test_latency
* Apply changed to moved neon extension
* Apply changed to moved neon extension
* Merge with main
* Update pgxn/neon/pagestore_smgr.c
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
* Bump postgres version
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
* Move backpressure throttling implementation to neon extension and function for monitoring throttling time
* Add missing includes
* Bump postgres version
Seems a bit silly to have a separate crate just for the executable. It
relies on the control plane for everything it does, and it's the only
user of the control plane.
Slim down compute-node images:
- Optimize compute_ctl build for size, not performance & debug-ability
- Don't run unused stages. Saves time in not building the PLV8 extension.
- Do not include static libraries in clean postgres
- Do the installation and finishing touches in the final layer in one job
This allows docker (and kaniko) to only register one change to the files,
removing potentially duplicate changed files.
- The runtime library for libreadline-dev is libreadline8, changing the dependency saves 45 MB
- libprotobuf-c-dev -> libprotobuf-c1, saving 100 kB
- libossp-uuid-dev -> libossp-uuid16, saving 150 kB
- gdal-bin + libgdal-dev -> libgeos-c1v5 + libgdal28 + libproj19, saving 747MB
- binutils @ testing -> libc6 @ testing, saving 32 MB
Start the calculation on the first size request, return
partially calculated size during calculation, retry if failed.
Remove "fast" size init through the ancestor: the current approach is
fast enough for now and there are better ways to optimize the
calculation via incremental ancestor size computation
For better ergonomics. I always found it weird that we used UUID to
actually mean a tenant or timeline ID. It worked because it happened
to have the same length, 16 bytes, but it was hacky.
Update PLV8 to 3.1.4 - which is the latest release.
Update PostGIS to 3.3.0
Remove PLV8 from the final image -- there is an issue we hit when installing PLV8, and we don't quite know what it is yet.
The code correctly detected too short and too long inputs, but the error
message was bogus for the case the input stream was too long:
Error: Provided stream has actual size 5 fthat is smaller than the given stream size 4
That check was only supposed to check for too small inputs, but it in
fact caught too long inputs too. That was good, because the check
below that that was supposed to check for too long inputs was in fact
broken, and never did anything. It tried to read input a buffer of
size 0, to check if there is any extra data, but reading to a
zero-sized buffer always returns 0.
Merge batch_others and batch_pg_regress. The original idea was to
split all the python tests into multiple "batches" and run each batch
in parallel as a separate CI job. However, the batch_pg_regress batch
was pretty short compared to all the tests in batch_others. We could
split batch_others into multiple batches, but it actually seems better
to just treat them as one big pool of tests and use pytest's handle
the parallelism on its own. If we need to split them across multiple
nodes in the future, we could use pytest-shard or something else,
instead of managing the batches ourselves.
Merge test_neon_regress.py, test_pg_regress.py and test_isolation.py
into one file, test_pg_regress.py. Seems more clear to group all
pg_regress-based tests into one file, now that they would all be in
the same directory.
Previously, proxy didn't forward auxiliary `options` parameter
and other ones to the client's compute node, e.g.
```
$ psql "user=john host=localhost dbname=postgres options='-cgeqo=off'"
postgres=# show geqo;
┌──────┐
│ geqo │
├──────┤
│ on │
└──────┘
(1 row)
```
With this patch we now forward `options`, `application_name` and `replication`.
Further reading: https://www.postgresql.org/docs/current/libpq-connect.htmlFixes#1287.
* Add fork_at_current_lsn function which creates branch at current LSN
* Undo use of fork_at_current_lsn in test_branching because of short GC period
* Add missed return in fork_at_current_lsn
* Add missed return in fork_at_current_lsn
* Update test_runner/fixtures/neon_fixtures.py
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
* Update test_runner/fixtures/neon_fixtures.py
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
* Update test_runner/fixtures/neon_fixtures.py
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
`latest_gc_cutoff_lsn` tracks the cutoff point where GC has been
performed. Anything older than the cutoff might already have been GC'd
away, and cannot be queried by get_page_at_lsn requests. It's
protected by an RWLock. Whenever a get_page_at_lsn requests comes in,
it first grabs the lock and reads the current `latest_gc_cutoff`, and
holds the lock it until the request has been served. The lock ensures
that GC doesn't start concurrently and remove page versions that we
still need to satisfy the request.
With the lock, get_page_at_lsn request could potentially be blocked
for a long time. GC only holds the lock in exclusive mode for a short
duration, but depending on how whether the RWLock is "fair", a read
request might be queued behind the GC's exclusive request, which in
turn might be queued behind a long-running read operation, like a
basebackup. If the lock implementation is not fair, i.e. if a reader
can always jump the queue if the lock is already held in read mode,
then another problem arises: GC might be starved if a constant stream
of GetPage requests comes in.
To avoid the long wait or starvation, introduce a Read-Copy-Update
mechanism to replace the lock on `latest_gc_cutoff_lsn`. With the RCU,
reader can always read the latest value without blocking (except for a
very short duration if the lock protecting the RCU is contended;
that's comparable to a spinlock). And a writer can always write a new
value without waiting for readers to finish using the old value. The
old readers will continue to see the old value through their guard
object, while new readers will see the new value.
This is purely theoretical ATM, we don't have any reports of either
starvation or blocking behind GC happening in practice. But it's
simple to fix, so let's nip that problem in the bud.
It's used by e2e CI. Building Dockerfile.compute-node will take
unreasonable ammount of time without v2 runners.
TODO: remove once cloud repo CI is moved to v2 runners.
* Extract neon and neon_test_utils from postgres repo
* Remove neon from vendored postgres repo, and fix build_and_test.yml
* Move EmitWarningsOnPlaceholders to end of _PG_init in neon.c (from libpagestore.c)
* Fix Makefile location comments
* remove Makefile EXTRA_INSTALL flag
* Update Dockerfile.compute-node to build and include the neon extension
Previously, it could only distinguish REDO task durations down to 5ms, which
equates to approx. 200pages/sec or 1.6MB/sec getpage@LSN traffic.
This patch improves to 200'000 pages/sec or 1.6GB/sec, allowing for
much more precise performance measurement of the redo process.
* Add postgis & plv8 extensions
* Update Dockerfile & Fix typo's
* Update dockerfile
* Update Dockerfile
* Update dockerfile
* Use plv8 step
* Reduce giga layer
* Reduce layer size further
* Prepare for rollout
* Fix dependency
* Pass on correct build tag
* No longer dependent on building tools
* Use version from vendor
* Revert "Use version from vendor"
This reverts commit 7c6670c477.
* Revert and push correct set
* Add configure step for new approach
* Re-add configure flags
Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@Rorys-Mac-Studio.fritz.box>
`///` is used for comments on the *next* code that follows, so the comment
actually applied to the `use std::collections::BTreeMap;` line that follows.
rustfmt complained about that:
error: an inner attribute is not permitted following an outer doc comment
--> /home/heikki/git-sandbox/neon/libs/utils/src/seqwait_async.rs:7:1
|
5 | ///
| --- previous doc comment
6 |
7 | #![warn(missing_docs)]
| ^^^^^^^^^^^^^^^^^^^^^^ not permitted following an outer attribute
8 |
9 | use std::collections::BTreeMap;
| ------------------------------- the inner attribute doesn't annotate this `use` import
|
= note: inner attributes, like `#![no_std]`, annotate the item enclosing them, and are usually found at the beginning of source files
help: to annotate the `use` import, change the attribute from inner to outer style
|
7 - #![warn(missing_docs)]
7 + #[warn(missing_docs)]
|
`//!` is the correct syntax for comments that apply to the whole file.
Every handler function now follows the same pattern:
1. extract parameters from the call
2. check permissions
3. execute command.
Previously, we extracted some parameters before permission check and
some after. Let's be consistent.
Added pytest to check correctness of the link authentication pipeline.
Context: this PR is the first step towards refactoring the link authentication pipeline to use https (instead of psql) to send the db info to the proxy. There was a test missing for this pipeline in this repo, so this PR adds that test as preparation for the actual change of psql -> https.
Co-authored-by: Bojan Serafimov <bojan.serafimov7@gmail.com>
Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
Co-authored-by: Stas Kelvic <stas@neon.tech>
Co-authored-by: Dimitrii Ivanov <dima@neon.tech>
There was a nominal split between the tests in layered_repository.rs and
repository.rs, such that tests specific to the layered implementation were
supposed to be in layered_repository.rs, and tests that should work with
any implementation of the traits were supposed to be in repository.rs.
In practice, the line was quite muddled. With minor tweaks, many of the
tests in layered_repository.rs should work with other implementations too,
and vice versa. And in practice we only have one implementation, so it's
more straightforward to gather all unit tests in one place.
usize/isize type corresponds to the CPU architecture's pointer width,
i.e. 64 bits on a 64-bit platform and 32 bits on a 32-bit platform.
The logical size of a database has nothing to do with the that, so
u64/i64 is more appropriate.
It doesn't make any difference in practice as long as you're on a
64-bit platform, and it's hard to imagine anyone wanting to run the
pageserver on a 32-bit platform, but let's be tidy.
Also add a comment on why we use signed i64 for the logical size
variable, even though size should never be negative. I'm not sure the
reasons are very good, but at least this documents them, and hints at
some possible better solutions.
* Update workflow to fix dependency issue
* Update workflow
* Update workflow and dockerfile
* Specify tag
* Update main dockerfile as well
* Mirror rust image to docker hub
* Update submodule ref
Co-authored-by: Rory de Zoete <rdezoete@Rorys-Mac-Studio.fritz.box>
Compute node docker image requires compute-tools to build, but this
dependency (and the argument for which image to pick) weren't described in the
workflow file. This lead to out-of-date binaries in latest builds, which
subsequently broke these images.
Including, but not limited to:
* Fixes to neon management code to support walproposer-as-an-extension
* Fix issue in expected output of pg settings serialization.
* Show the logs of a failed --sync-safekeepers process in CI
* Add compat layer for renamed GUCs in postgres.conf
* Update vendor/postgres to the latest origin/main
- There was an issue with zero commit_lsn `reason: LaggingWal { current_commit_lsn: 0/0, new_commit_lsn: 1/6FD90D38, threshold: 10485760 } }`. The problem was in `send_wal.rs`, where we initialized `end_pos = Lsn(0)` and in some cases sent it to the pageserver.
- IDENTIFY_SYSTEM previously returned `flush_lsn` as a physical end of WAL. Now it returns `flush_lsn` (as it was) to walproposer and `commit_lsn` to everyone else including pageserver.
- There was an issue with backoff where connection was cancelled right after initialization: `connected!` -> `safekeeper_handle_db: Connection cancelled` -> `Backoff: waiting 3 seconds`. The problem was in sleeping before establishing the connection. This is fixed by reworking retry logic.
- There was an issue with getting `NoKeepAlives` reason in a loop. The issue is probably the same as the previous.
- There was an issue with filtering safekeepers based on retry attempts, which could filter some safekeepers indefinetely. This is fixed by using retry cooldown duration instead of retry attempts.
- Some `send_wal.rs` connections failed with errors without context. This is fixed by adding a timeline to safekeepers errors.
New retry logic works like this:
- Every candidate has a `next_retry_at` timestamp and is not considered for connection until that moment
- When walreceiver connection is closed, we update `next_retry_at` using exponential backoff, increasing the cooldown on every disconnect.
- When `last_record_lsn` was advanced using the WAL from the safekeeper, we reset the retry cooldown and exponential backoff, allowing walreceiver to reconnect to the same safekeeper instantly.
Re-export only things that are used by other modules.
In the future, I'm imagining that we run bindgen twice, for Postgres
v14 and v15. The two sets of bindings would go into separate
'bindings_v14' and 'bindings_v15' modules.
Rearrange postgres_ffi modules.
Move function, to avoid Postgres version dependency in timelines.rs
Move function to generate a logical-message WAL record to postgres_ffi.
* Do not create initial tenant and timeline (adjust Python tests for that)
* Rework config handling during init, add --update-config to manage local config updates
The pg_control_ffi.h name implies that it only includes stuff related to
pg_control.h. That's mostly true currently, but really the point of the
file is to include everything that we need to generate Rust definitions
from.
* Use main, not branch for ref check
* Add more debug
* Count main, not head
* Try new approach
* Conform to syntax
* Update approach
* Get full history
* Skip checkout
* Cleanup debug
* Remove more debug
Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box>
This patch makes walreceiver logic more complicated, but it should work better in most cases. Added `test_wal_lagging` to test scenarios where alive safekeepers can lag behind other alive safekeepers.
- There was a bug which looks like `etcd_info.timeline.commit_lsn > Some(self.local_timeline.get_last_record_lsn())` filtered all safekeepers in some strange cases. I removed this filter, it should probably help with #2237
- Now walreceiver_connection reports status, including commit_lsn. This allows keeping safekeeper connection even when etcd is down.
- Safekeeper connection now fails if pageserver doesn't receive safekeeper messages for some time. Usually safekeeper sends messages at least once per second.
- `LaggingWal` check now uses `commit_lsn` directly from safekeeper. This fixes the issue with often reconnects, when compute generates WAL really fast.
- `NoWalTimeout` is rewritten to trigger only when we know about the new WAL and the connected safekeeper doesn't stream any WAL. This allows setting a small `lagging_wal_timeout` because it will trigger only when we observe that the connected safekeeper has stuck.
The new format has a few benefits: it's shorter, simpler and
human-readable as well. We don't use base64 anymore, since
url encoding got us covered.
We also show a better error in case we couldn't parse the
payload; the users should know it's all about passing the
correct project name.
This test failed consistently on `main` now. It's better to temporarily disable it to avoid blocking others' PRs while investigating the root cause for the test failure.
See: #2255, #2256
Resolves#2212.
- use `wait_for_last_flush_lsn` in `test_timeline_physical_size_*` tests
## Context
Need to wait for the pageserver to catch up with the compute's last flush LSN because during the timeline physical size API call, it's possible that there are running `LayerFlushThread` threads. These threads flush new layers into disk and hence update the physical size. This results in a mismatch between the physical size reported by the API and the actual physical size on disk.
### Note
The `LayerFlushThread` threads are processed **concurrently**, so it's possible that the above error still persists even with this patch. However, making the tests wait to finish processing all the WALs (not flushing) before calculating the physical size should help reduce the "flakiness" significantly
Resolves#2097
- use timeline modification's `lsn` and timeline's `last_record_lsn` to determine the corresponding LSN to query data in `DatadirModification::get`
- update `test_import_from_pageserver`. Split the test into 2 variants: `small` and `multisegment`.
+ `small` is the old test
+ `multisegment` is to simulate #2097 by using a larger number of inserted rows to create multiple segment files of a relation. `multisegment` is configured to only run with a `release` build
To flush inmemory layer eventually when no new data arrives, which helps
safekeepers to suspend activity (stop pushing to the broker). Default 10m should
be ok.
This script can be used to migrate a tenant across breaking storage versions, or (in the future) upgrading postgres versions. See the comment at the top for an overview.
Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
A fair amount of the time in our python tests is spent waiting for the
pageserver and safekeeper processes to shut down. It doesn't matter so
much when you're running a lot of tests in parallel, but it's quite
noticeable when running them sequentially.
A big part of the slowness is that is that after sending the SIGTERM
signal, we poll to see if the process is still running, and the
polling happened at 1 s interval. Reduce it to 0.1 s.
Newer version of mypy fixes buggy error when trying to update only boto3 stubs.
However it brings new checks and starts to yell when we index into
cusror.fetchone without checking for None first. So this introduces a wrapper
to simplify quering for scalar values. I tried to use cursor_factory connection
argument but without success. There can be a better way to do that,
but this looks the simplest
Move all the fields that were returned by the wal_receiver endpoint into
timeline_detail. Internally, move those fields from the separate global
WAL_RECEIVERS hash into the LayeredTimeline struct. That way, all the
information about a timeline is kept in one place.
In the passing, I noted that the 'thread_id' field was removed from
WalReceiverEntry in commit e5cb727572, but it forgot to update
openapi_spec.yml. This commit removes that too.
It failed in staging environment a few times, and all we got in the
logs was:
ERROR could not start the compute node: failed to get basebackup@0/2D6194F8 from pageserver host=zenith-us-stage-ps-2.local port=6400
giving control plane 30s to collect the error before shutdown
That's missing all the detail on *why* it failed.
What the WAL receiver really connects to is the safekeeper. The
"producer" term is a bit misleading, as the safekeeper doesn't produce
the WAL, the compute node does.
This change also applies to the name of the field used in the mgmt API
in in the response of the
'/v1/tenant/:tenant_id/timeline/:timeline_id/wal_receiver' endpoint.
AFAICS that's not used anywhere else than one python test, so it
should be OK to change it.
Ref #1902.
- Track the layered timeline's `physical_size` using `pageserver_current_physical_size` metric when updating the layer map.
- Report the local timeline's `physical_size` in timeline GET APIs.
- Add `include-non-incremental-physical-size` URL flag to also report the local timeline's `physical_size_non_incremental` (similar to `logical_size_non_incremental`)
- Add a `UIntGaugeVec` and `UIntGauge` to represent `u64` prometheus metrics
Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
Previously DatadirTimeline was a separate struct, and there was a 1:1
relationship between each DatadirTimeline and LayeredTimeline. That was
a bit awkward; whenever you created a timeline, you also needed to create
the DatadirTimeline wrapper around it, and if you only had a reference
to the LayeredTimeline, you would need to look up the corresponding
DatadirTimeline struct through tenant_mgr::get_local_timeline_with_load().
There were a couple of calls like that from LayeredTimeline itself.
Refactor DatadirTimeline, so that it's a trait, and mark LayeredTimeline
as implementing that trait. That way, there's only one object,
LayeredTimeline, and you can call both Timeline and DatadirTimeline
functions on that. You can now also call DatadirTimeline functions from
LayeredTimeline itself.
I considered just moving all the functions from DatadirTimeline directly
to Timeline/LayeredTimeline, but I still like to have some separation.
Timeline provides a simple key-value API, and handles durably storing
key/value pairs, and branching. Whereas DatadirTimeline is stateless, and
provides an abstraction over the key-value store, to present an interface
with relations, databases, etc. Postgres concepts.
This simplified the logical size calculation fast-path for branch
creation, introduced in commit 28243d68e6. LayerTimeline can now
access the ancestor's logical size directly, so it doesn't need the
caller to pass it to it. I moved the fast-path to init_logical_size()
function itself. It now checks if the ancestor's last LSN is the same
as the branch point, i.e. if there haven't been any changes on the
ancestor after the branch, and copies the size from there. An
additional bonus is that the optimization will now work any time you
have a branch of another branch, with no changes from the ancestor,
not only at a create-branch command.
The layered_repository.rs file had grown to be very large. Split off
the LayeredTimeline struct and related code to a separate source file to
make it more manageable.
There are plans to move much of the code to track timelines from
tenant_mgr.rs to LayeredRepository. That will make layered_repository.rs
grow again, so now is a good time to split it.
There's a lot more cleanup to do, but this commit intentionally only
moves existing code and avoids doing anything else, for easier review.
[proxy] Add the `password hack` authentication flow
This lets us authenticate users which can use neither
SNI (due to old libpq) nor connection string `options`
(due to restrictions in other client libraries).
Note: `PasswordHack` will accept passwords which are not
encoded in base64 via the "password" field. The assumption
is that most user passwords will be valid utf-8 strings,
and the rest may still be passed via "password_".
## Overview
This patch reduces the number of memory allocations when running the page server under a heavy write workload. This mostly helps improve the speed of WAL record ingestion.
## Changes
- modified `DatadirModification` to allow reuse the struct's allocated memory after each modification
- modified `decode_wal_record` to allow passing a `DecodedWALRecord` reference. This helps reuse the struct in each `decode_wal_record` call
- added a reusable buffer for serializing object inside the `InMemoryLayer::put_value` function
- added a performance test simulating a heavy write workload for testing the changes in this patch
### Semi-related changes
- remove redundant serializations when calling `DeltaLayer::put_value` during `InMemoryLayer::write_to_disk` function call [1]
- removed the info span `info_span!("processing record", lsn = %lsn)` during each WAL ingestion [2]
## Notes
- [1]: in `InMemoryLayer::write_to_disk`, a deserialization is called
```
let val = Value::des(&buf)?;
delta_layer_writer.put_value(key, *lsn, val)?;
```
`DeltaLayer::put_value` then creates a serialization based on the previous deserialization
```
let off = self.blob_writer.write_blob(&Value::ser(&val)?)?;
```
- [2]: related: https://github.com/neondatabase/neon/issues/733
* More precisely control size of inmem layer
* Force recompaction of L0 layers if them contains large non-wallogged BLOBs to avoid too large layers
* Add modified version of test_hot_update test (test_dup_key.py) which should generate large layers without large number of tables
* Change test name in test_dup_key
* Add Layer::get_max_key_range function
* Add layer::key_iter method and implement new approach of splitting layers during compaction based on total size of all key values
* Add test_large_schema test for checking layer file size after compaction
* Make clippy happy
* Restore checking LSN distance threshold for checkpoint in-memory layer
* Optimize stoage keys iterator
* Update pageserver/src/layered_repository.rs
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
* Update pageserver/src/layered_repository.rs
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
* Update pageserver/src/layered_repository.rs
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
* Update pageserver/src/layered_repository.rs
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
* Update pageserver/src/layered_repository.rs
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
* Fix code style
* Reduce number of tables in test_large_schema to make it fit in timeout with debug build
* Fix style of test_large_schema.py
* Fix handlng of duplicates layers
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
We were getting a warning like this from the pg_regress tests:
=================== warnings summary ===================
/usr/lib/python3/dist-packages/_pytest/config/__init__.py:663
/usr/lib/python3/dist-packages/_pytest/config/__init__.py:663: PytestAssertRewriteWarning: Module already imported so cannot be rewritten: fixtures.pg_stats
self.import_plugin(import_spec)
-- Docs: https://docs.pytest.org/en/stable/warnings.html
------------------ Benchmark results -------------------
To fix, reorder the imports in conftest.py. I'm not sure what exactly
the problem was or why the order matters, but the warning is gone and
that's good enough for me.
If the WAL arrives at the pageserver slowly, it's possible that the
branch is created before all the data on the parent branch have
arrived. That results in a failure:
test_runner/batch_others/test_tenant_relocation.py:259: in test_tenant_relocation
timeline_id_second, current_lsn_second = populate_branch(pg_second, create_table=False, expected_sum=1001000)
test_runner/batch_others/test_tenant_relocation.py:133: in populate_branch
assert cur.fetchone() == (expected_sum, )
E assert (500500,) == (1001000,)
E At index 0 diff: 500500 != 1001000
E Full diff:
E - (1001000,)
E + (500500,)
To fix, specify the LSN to branch at, so that the pageserver will wait
for it arrive.
See https://github.com/neondatabase/neon/issues/2063
Resolves#2054
**Context**: branch creation needs to wait for GC to acquire `gc_cs` lock, which prevents creating new timelines during GC. However, because individual timeline GC iteration also requires `compaction_cs` lock, branch creation may also need to wait for compactions of multiple timelines. This results in large latency when creating a new branch, which we advertised as *"instantly"*.
This PR optimizes the latency of branch creation by separating GC into two phases:
1. Collect GC data (branching points, cutoff LSNs, etc)
2. Perform GC for each timeline
The GC bottleneck comes from step 2, which must wait for compaction of multiple timelines. This PR modifies the branch creation and GC functions to allow GC to hold the GC lock only in step 1. As a result, branch creation doesn't need to wait for compaction to finish but only needs to wait for GC data collection step, which is fast.
Simplifies the workflow. Makes the overall build a little faster, as
the build_postgres step doesn't need to upload the pg.tgz artifact,
and the build_neon step doesn't need to download it again.
This effectively reverts commit a490f64a68. That commit changed the
workflow so that the Postgres binaries were not included in the
neon.tgz artifact. With this commit, the pg.tgz artifact is gone, and
the Postgres binaries are part of neon.tgz again.
The "cargo metadata" and "cargo test --no-run" are used in the workflow
to just list names of the final binaries, but unless the same cargo
options like --release or --debug are used in those calls, they will in
fact recompile everything.
Reorganize existing READMEs and other documentation files into mdbook
format. The resulting Table of Contents is a mix of placeholders for
docs that we should write, and documentation files that we already had,
dropped into the most appropriate place.
Update the Pageserver overview diagram. Add sections on thread
management and WAL redo processes.
Add all the RFCs to the mdbook Table of Content too.
Per github issue #1979
On ProposerElected message receival WAL is truncated at streaming point; this
code expected that, once vote is given for the proposer / term switch happened,
flush_lsn can be advanced only by this proposer (or higher one). However, that
didn't take into account possibility of accumulating written WAL and flushing it
after vote is given -- flushing goes without term checks. Which eventually led
to the violation in question.
ref #2048
* Deduce `last_segment` automatically
* Get rid of local `wal_dir`/`wal_seg_size` variables
* Prepare to test parsing of WAL from multiple specific points, not just the start;
extract `check_end_of_wal` function to check both partial and non-partial WAL segments.
neon.tgz artifact in the github workflow included the contents of
'tmp_install', but that seems pointless, because the same files are
included earlier already in the pg.tgz artifact.
Uploading large artifacts is slow in github actions. To speed that up,
make the artifact smaller.
The code coverage tool doesn't require debug symbols, so remove them.
We've discussed doing the same for *all* binaries, but it's nice to
have debugging symbols for debugging purposes, and so that you get
more complete stack traces. The discussion is ongoing, but let's at
least do this for the test symbols now.
- Updated dependencies with "cargo update"
- Updated workspace_hack with "cargo hakari generate"
There's no particular reason to do this now, just a periodic refresh.
"cargo clippy" started to complain about these, after running "cargo
update". Not sure why it didn't complain before, but seems reasonable to
fix these. (The "cargo update" is not included in this commit)
Change the build options to enable basic optimizations even in debug
mode, and always build dependencies with more optimizations. That
makes the debug-mode binaries somewhat faster, without messing up
stack traces and line-by-line debugging too much.
See https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#concurrency
* Previously there was a single concurrency group per each branch.
As the `main` branch got pushed into frequently, very few commits got
tested to the end. It resulted in "broken" `main` branch as there were
no fully successful workflow runs.
Now the `main` branch gets a separate concurrency group for each commit.
* As GitHub Actions syntax does not have the conditional operator, it is
emulated via logical and/or operations. Although undocumented, they
return one of their operands instead of plain true/false.
* Replace 3-space indentation with 2-space indentation while we are here
to be consistent with the rest of the file.
* Wait for all computes (except one) to complete before proceeding with
the single compute.
* It previously waited for too few seconds. As the test is randomized, it was
not failing all the time, but only in specific unlucky cases.
E.g. when there were no successfuly queries by concurrent computes,
and the single node had big timeouts and spent lots of time making the
transaction.
See https://github.com/neondatabase/neon/runs/7234456482?check_suite_focus=true
(around line 980).
* Wait for exactly one extra transaction by the single compute.
We need both storage **and** compute images for deploy, because control plane
picks the compute version based on the storage version. If it notices a fresh
storage it may bump the compute version. And if compute image failed to build
it may break things badly.
Before this patch, importing a physical backup followed the same path
as ingesting any WAL records:
1. All the data pages from the backup are first collected in the
DatadirModification object.
2. Then, they are "committed" to the Repository. They are written to
the in-memory layer
3. Finally, the in-memory layer is frozen, and flushed to disk as a
L0 delta layer file.
This was pretty inefficient. In step 1, the whole physical backup was
held in memory. If the backup is large, you simply run out of
memory. And in step 3, the resulting L0 delta layer file is large,
holding all the data again. That's a problem if the backup is larger
than 5 GB: Amazon S3 doesn't allow uploading files larger than 5 GB
(without using multi-part upload, see github issue #1910). So we want
to avoid that.
To alleviate those problems, optimize the codepath for importing a
physical backup. The basic flow is the same as before, but step 1
is optimized so that it doesn't accumulate all the data in memory,
and step 3 writes the data in image layers instead of one large delta
layer.
Previously, upon branching, if no starting LSN is specified, we
determine the start LSN based on the source timeline's last record LSN
in `timelines::create_timeline` function, which then calls `Repository::branch_timeline`
to create the timeline.
Inside the `LayeredRepository::branch_timeline` function, to start branching,
we try to acquire a GC lock to prevent GC from removing data needed
for the new timeline. However, a GC iteration takes time, so the GC lock
can be held for a long period of time. As a result, the previously determined
starting LSN can become invalid because of GC.
This PR fixes the above issue by delaying the LSN calculation part and moving it to be
inside `LayeredRepository::branch_timeline` function.
* ensure_server_config() function is added to ensure the server does not have background processes
which intervene with WAL generation
* Rework command line syntax
* Add `print-postgres-config` subcommand which prints the required server configuration
download operations of all timelines for one tenant are now grouped
together so when attach is invoked pageserver downloads all of them
and registers them in a single apply_sync_status_update call so
branches can be used safely with attach/detach
I noticed that the pageserver has a very large virtual memory size,
several GB, even though it doesn't actually use that much
memory. That's not much of a problem normally, but I hit it because I
wanted to run tests with a limited virtual memory size, by calling
setrlimit(RLIMIT_AS), but the highest limit you can set is 2 GB. I was
not able to start pageserver with a limit of 2 GB.
On Linux, each thread allocates 32 MB of virtual memory. I read this
on some random forum on the Internet, but unfortunately could not find
the source again now. Empirically, reducing the number of threads clearly
helps to bring down the virtual memory size.
Aside from the virtual memory usage, it seems excessive to launch 40
threads in both of those thread pools. The tokio default is to have as
many worker threads as there are CPU cores in the system. That seems
like a fine heuristic for us, too, so remove the explicit setting of
the pool size and rely on the default. Note that the GC and compaction
tasks are actually run with tokio spawn_blocking, so the threads that
are actually doing the work, and possibly waiting on I/O, are not
consuming threads from the thread pool. The WAL receiver work is done
in the tokio worker threads, but the WAL receivers are more CPU bound
so that seems OK.
Also remove the explicit maxinum on blocking tasks. I'm not sure what
the right value for that would be, or whether the value we set (100)
would be better than the tokio default (512). Since the value was
arbitrary, let's just rely on the tokio default for that, too.
* Add tests for different postgres clients
* test/fixtures: sanitize test name for test_output_dir
* test/fixtures: do not look for etcd before runtime
* Add workflow for testing Postgres client libraries
* Avoid reconnecting to safekeeper immediately after its failure by limiting candidates to those with fewest connection attempts. Thus we don't have to wait lagging_wal_timeout (10s by default) before switch happens even if no new changes are generated, and current test_restarts_under_load expects some commits to happen within 4s.
* Make default max_lsn_wal_lag larger, otherwise we constant reconnections happen during normal work.
* Fix wal_connection_attempts maintanance, preventing busy loop of reconnections.
Mitigates latency fee, making push throughput 1-1.5 order of magnitude bigger.
Also make leases per timeline, not per whole safekeeper, avoiding storing
garbage in etcd for deleted timelines while safekeeper is alive.
Previously, we were granting create only to db owner, but now we have a
dedicated 'web_access' role to connect via web UI and proxy link auth.
We anyway grant read / write all data to all roles, so let's grant
create to everyone too. This creates some provelege objects in each db,
which we need to drop before deleting the role. So now we reassign all
owned objects to each db owner before deletion. This also fixes deletion
of roles that created some data in any db previously. Will be tested by
https://github.com/neondatabase/cloud/pull/1673
Later we should stop messing with Postgres ACL that much.
Resolves#1889.
This PR adds new tests to measure the WAL backpressure's performance under different workloads.
## Changes
- add new performance tests in `test_wal_backpressure.py`
- allow safekeeper's fsync to be configurable when running tests
* Update make instructions for release and debug build. Update PostgreSQL glossary to proper version (14)
* Continued cleanup of build instructions including removal of redundancies
This brings in the change to not use a shared memory in the WAL redo
process, to avoid running out of sysv shmem segments in the page server.
Also, removal of callmemaybe bits.
* Implement page servise 'fullbackup' endpoint that works like basebackup, but also sends relational files
* Add test_runner/batch_others/test_fullbackup.py
Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>
All endpoints except for POST /v1/timeline are tested, this one is not tested in any way yet.
Three attempts for each endpoint: correctly authenticated, badly authenticated, unauthenticated.
* `control_plane` crate (used by `neon_local`) now parses an `auth_enabled` bool for each Safekeeper
* If auth is enabled, a Safekeeper is passed a path to a public key via a new command line argument
* Added TODO comments to other places needing auth
I made the check at launcher level with the perspective of generally moving
election (decision who offloads) there.
Also log timeline 'active' changes.
I still don't like the surroundings and feel we'd better get away without using
election API at all, but this is a quick fix to keep CI green.
ref #1815
* Added project option in case SNI data is missing. Resolving issue #1745.
* Added invariant checking for project name: if both sni_data and project_name are available then they should match.
E g this log line contains duplicated data:
INFO /timeline_create{tenant=8d367870988250a755101b5189bbbc17
new_timeline=Some(27e2580f51f5660642d8ce124e9ee4ac) lsn=None}:
bootstrapping{timeline=27e2580f51f5660642d8ce124e9ee4ac
tenant=8d367870988250a755101b5189bbbc17}:
created root timeline 27e2580f51f5660642d8ce124e9ee4ac
timeline.lsn 0/16960E8
this avoids variable duplication in `bootstrapping` subspan
Fixes#1768.
## Context
Previously, to test `get_wal_receiver` API, we make run some DB transactions then call the API to check the latest message's LSN from the WAL receiver. However, this test won't work because it's not guaranteed that the WAL receiver will get the latest WAL from the postgres/safekeeper at the time of making the API call.
This PR resolves the above issue by adding a "poll and wait" code that waits to retrieve the latest data from the WAL receiver.
This PR also fixes a bug that tries to compare two hex LSNs, should convert to number before the comparison. See: https://github.com/neondatabase/neon/issues/1768#issuecomment-1133752122.
Resolve#1663.
## Changes
- ignore a "broken" [1] timeline on page server startup
- fix the race condition when creating multiple timelines in parallel for a tenant
- added tests for the above changes
[1]: a timeline is marked as "broken" if either
- failed to load the timeline's metadata or
- the timeline's disk consistent LSN is zero
- Uncomment accidently `self.keep_alive.abort()` commented line, due to this
task never finished, which blocked launcher.
- Mess up with initialization one more time, to fix offloader trying to back up
segment 0. Now we initialize all required LSNs in handle_elected,
where we learn start LSN for the first time.
- Fix blind attempt to provide safekeeper service file with remote storage
params.
Separate task is launched for each timeline and stopped when timeline doesn't
need offloading. Decision who offloads is done through etcd leader election;
currently there is no pre condition for participating, that's a TODO.
neon_local and tests infrastructure for remote storage in safekeepers added,
along with the test itself.
ref #1009
Co-authored-by: Anton Shyrabokau <ahtoxa@Antons-MacBook-Pro.local>
* turn println into an info with proper message
* rename new_local_timeline to load_local_timeline because it does not
create new timeline, it registers timeline that exists on disk in
pageserver in-memory structures
If the 'basebackup' command failed in the middle of building the tar
archive, the client would not report the error, but would attempt to
to start up postgres with the partial contents of the data directory.
That fails because the control file is missing (it's added to the
archive last, precisly to make sure that you cannot start postgres
from a partial archive). But the client doesn't see the proper error
message that caused the basebackup to fail in the server, which is
confusing.
Two issues conspired to cause that:
1. The tar::Builder object that we use in the pageserver to construct
the tar stream has a Drop handler that automatically writes a valid
end-of-archive marker on drop. Because of that, the resulting tarball
looks complete, even if an error happens while we're building it. The
pageserver does send an ErrorResponse after the seemingly-valid
tarball, but:
2. The client stops reading the Copy stream, as soon as it sees the
tar end-of-archive marker. Therefore, it doesn't read the
ErrorResponse that comes after it.
We have two clients that call 'basebackup', one in `control_plane`
used by the `neon_local` binary, and another one in
`compute_tools`. Both had the same issue.
This PR fixes both issues, even though fixing either one would be
enough to fix the problem at hand. The pageserver now doesn't send the
end-of-archive marker on error, and the client now reads the copy
stream to the end, even if it sees an end-of-archive marker.
Fixes github issue #1715
In the passing, change Basebackup to use generic Write rather than
'dyn'.
By default, etcd makes a huge 10 GB mmap() allocation when it starts up.
It doesn't actually use that much memory, it's just address space, but
it caused me grief when I tried to use 'rr' to debug a python test run.
Apparently, when you replay the 'rr' trace, it does allocate memory for
all that address space.
The size of the initial mmap depends on the --quota-backend-bytes setting.
Our etcd clusters are very small, so let's set --quota-backend-bytes to
keep the virtual memory size small, to make debugging with 'rr' easier.
See https://github.com/etcd-io/etcd/issues/7910 and
5e4b008106
* Potential fix to #1626. Fixed typo is Makefile.
* Completed fix to #1626.
Summary:
changed 'error' to 'bail' in start_pageserver and start_safekeeper.
The logic would incorrectly remove an image layer, if a new image layer
existed, even though the older image layer was still needed by some
delta layers after it. See example given in the comment this adds.
Without this fix, I was getting a lot of "could not find data for key
010000000000000000000000000000000000" errors from GC, with the new test
case being added in PR #1735.
Fixes#707
The primary reason: make GitHub detect that we use Apache License 2.0
They do it via https://github.com/licensee/licensee Ruby library (gem).
Our COPYRIGHT file contains a part of the Apache License, which should
be added to a source file, not the license or copyright information itself,
which confuses the library.
Instead, the recommended way is to create a NOTICE file which references
license of the code and its bundled dependencies.
There were a bunch of dependencies for Python <3.9. They are not needed
after #1254. This commit makes it easier to add/remove dependencies because
lock file will be updated like this on any such operation.
Do not update dependencies yet to not break anything.
It would be better to not update xl_crc/rec_hdr at all when skipping contrecord,
but I would prefer to keep PR #1574 small.
Better audit of `find_end_of_wal_segment` is coming anyway in #544.
Previous invariant: `crc` contains an "unfinalized" CRC32 value,
its one complement, like in postgres before FIN_CRC32C.
New invariant: `crc` always contains a "finalized" CRC32 value,
this is the semantics of crc32c_append, so we don't need to invert CRC manually.
* Actual generation logic is in a separate crate `postgres_ffi/wal_generate`
* The create also provides a binary for debug purposes akin to `initdb`
* Two tests currently fail and are ignored
* There is no easy way to test this directly in Safekeeper as it starts restoring from commit_lsn.
So testing would require disconnecting Safekeeper just after it has received the WAL,
but before it is committed.
The CI times out after 10 minutes of no output. It's annoying if a
test hangs and is killed by the CI timeout, because you don't get
information about which test was running. Try to avoid that, by adding
a slightly smaller timeout in pytest itself. You can override it on a
per-test basis if needed, but let's try to keep our tests shorter than
that.
For the Postgres regression tests, use a longer 30 minute timeout.
They're not really a single test, but many tests wrapped in a single
pytest test. It's OK for them to run longer in aggregate, each
Postgres test is still fairly short.
We saw a case in staging, where there was a gap in the LSN ranges of
level 0 files, like this:
000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9
000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016960E9-00000000016E4DB9
000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016E4DB9-000000000BFCE3E1
000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000000BFCE3E1-000000000BFD0FE9
000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000060045901-000000007005EAC1
000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000007005EAC1-0000000080062E99
000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000080062E99-000000009007F481
000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000009007F481-00000000A009F7C9
000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000A009F7C9-00000000AA284EB9
000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000AA286471-00000000AA2886B9
Note that gap between 000000000BFD0FE9 and 0000000060045901. I don't
know how that happened, but in general the pageserver should be robust
if there are gaps like that, or overlapping files etc. In theory they
could happen as result of crashes, partial downloads from S3 etc.,
although it is mystery what caused it in this case.
Looking at the compaction code, it was not safe in the face of gaps
like that. The compaction routine collected all the level 0 files, and
took their min(start)..max(end) as the range of the new files it
builds. That's wrong, if the level 0 files don't cover the whole LSN
range; the newly created files will miss any records in the gap. Fix
that, by only collecting contiguous sequences of level 0 files, so
that the end LSN of previous delta file is equal to the start of the
next one.
Fixes issue #1730
Previously, the path was printed to the log with separate error!() calls.
It's better to include the whole path in the error object and have it
printed to the log as one message.
Also print the path in the ValueReconstructResult::Missing case.
This is what it looks like now:
2022-05-17T21:53:53.611801Z ERROR pagestream{timeline=5adcb4af3e95f00a31550d266aab7a37 tenant=74d9f9ad3293c030c6a6e196dd91c60f}: error reading relation or page version: could not find data for key 000000067F000032BE000000000000000001 at LSN 0/1698C48, for request at LSN 0/1698CF8
Caused by:
0: layer traversal: result Complete, cont_lsn 0/1698C48, layer: 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001698C48-0000000001698CC1
1: layer traversal: result Continue, cont_lsn 0/1698CC1, layer: inmem-0000000001698CC1-FFFFFFFFFFFFFFFF
Stack backtrace:
+ neondatabase/cloud#1103
This adds a couple of control endpoints to simplify compute state
discovery for control-plane. For example, now we may figure out
that Postgres wasn't able to start or basebackup failed within
seconds instead of just blindly polling the compute readiness
for a minute or two.
Also we now expose startup metrics (time of the each step: basebackup,
sync safekeepers, config, total). Console grabs them after each
successful start and report as histogram to prometheus and grafana.
OpenAPI spec is added and up-tp date, but is not currently used in the
console yet.
- Enabled process exporter for storage services
- Changed zenith_proxy prefix to just proxy
- Removed old `monitoring` directory
- Removed common prefix for metrics, now our common metrics have `libmetrics_` prefix, for example `libmetrics_serve_metrics_count`
- Added `test_metrics_normal_work`
The SyncQueue consisted of a tokio mpsc channel, and an atomic counter
to keep track of how many items there are in the channel. Updating the
atomic counter was racy, and sometimes the consumer would decrement
the counter before the producer had incremented it, leading to integer
wraparound to usize::MAX. Calling Vec::with_capacity(usize::MAX) leads
to a panic.
To fix, replace the channel with a VecDeque protected by a Mutex, and
a condition variable for signaling. Now that the queue is now
protected by standard blocking Mutex and Condvar, refactor the
functions touching it to be sync, not async.
A theoretical downside of this is that the calls to push items to the
queue and the storage sync thread that drains the queue might now need
to wait, if another thread is busy manipulating the queue. I believe
that's OK; the lock isn't held for very long, and these operations are
made in background threads, not in the hot GetPage@LSN path, so
they're not very latency-sensitive.
Fixes#1719. Also add a test case.
The contract for wait_for() was not very clear. It waits until the
given function returns successfully, without an exception, but the
wait_for_last_record_lsn() and wait_for_upload() functions used "a <
b" as the condition, i.e. they thought that wait_for() would poll
until the function returns true.
Inline the logic from wait_for() into those two functions, it's not
that complicated, and you get a more specific error message too, if it
fails. Also add a comment to wait_for() to make it more clear how it
works.
Also change remote_consistent_lsn() to return 0 instead of raising an
exception, if remote is None. That can happen if nothing has been
uploaded to remote storage for the timeline yet. It happened once in
the CI, and I was able to reproduce that locally too by adding a sleep
to the storage sync thread, to delay the first upload.
I got annoyed by all the noise in CI test output.
Before:
$ ./target/release/neon_local stop
Stop pageserver gracefully
Pageserver still receives connections
Pageserver stopped receiving connections
Pageserver status is: Reqwest error: error sending request for url (http://127.0.0.1:9898/v1/status): error trying to connect: tcp connect error: Connection refused (os error 111)
initializing for sk 1 for 7676
Stop safekeeper gracefully
Safekeeper still receives connections
Safekeeper stopped receiving connections
Safekeeper status is: Reqwest error: error sending request for url (http://127.0.0.1:7676/v1/status): error trying to connect: tcp connect error: Connection refused (os error 111)
After:
$ ./target/release/neon_local stop
Stopping pageserver gracefully...done!
Stopping safekeeper 1 gracefully...done!
Also removes the spurious "initializing for sk 1 for 7676" message from
"neon_local start"
Resolves#1488.
- implemented `GET tenant/:tenant_id/timeline/:timeline_id/wal_receiver` endpoint
- returned `thread_id` in `thread_mgr::spawn`
- added `latest_gc_cutoff_lsn` field to `LocalTimelineInfo` struct
It's very confusing, and because you don't get a stack trace and error
message in the logs, makes debugging very hard. However, the
'test_pageserver_recovery' test relied on that behavior. To support that,
add a new "exit" action to the pageserver 'failpoints' command, so that
you can explicitly request to exit the process when a failpoint is hit.
Use timestamp->LSN mapping instead of file modification time.
Fix 'latest_gc_cutoff_lsn' - set it to the minimum of pitr_cutoff and gc_cutoff.
Add new test: test_pitr_gc
* There is no auth in Safekeeper HTTP at all currently,
so simply calling `check_permission` is not enough.
* There are no checks of Safekeeper still working with the data,
as "still working" is burry now: a timeline may be "active"
while there are no compute nodes and all data is propagated.
* Still, callmemaybe is deactivated, and timeline is removed from the
internal map. It can easily sneak back in case of race conditions
and implicit creations, though.
Try to follow Prometheus style-guide https://prometheus.io/docs/practices/naming/ for metrics names. More specifically:
- Use `pageserver_` prefix for all pagserver metrics
- Specify `_seconds` unit in time metrics
- Use unit as a suffix in other cases, such as `_hits`, `_bytes`, `_records`
- Use `_total` suffix for accumulating counters (note that Histograms append that suffix internally)
* Do not apply records with LSN smaller than LSN of cached image in delta layer
* Do not apply records with LSN smaller than LSN of cached image in delta layer
* Do not set LSN for new FPI page
refer #1656
* Add page_is_new, page_get_lsn, page_set_lsn functions
* Fix page_is_new implementation
* Add comment from XLogReadBufferForRedoExtended
Fixes#1628
- add [`comfy_table`](https://github.com/Nukesor/comfy-table/tree/main) and use it to construct table for `pg list` CLI command
Comparison
- Old:
```
NODE ADDRESS TIMELINE BRANCH NAME LSN STATUS
main 127.0.0.1:55432 3823dd05e35d71f6ccf33049de366d70 main 0/16FB140 running
migration_check 127.0.0.1:55433 3823dd05e35d71f6ccf33049de366d70 main 0/16FB140 running
```
- New:
```
NODE ADDRESS TIMELINE BRANCH NAME LSN STATUS
main 127.0.0.1:55432 3823dd05e35d71f6ccf33049de366d70 main 0/16FB140 running
migration_check 127.0.0.1:55433 3823dd05e35d71f6ccf33049de366d70 main 0/16FB140 running
```
Make it remember when timeline starts in general and on this safekeeper in
particular (the point might be later on new safekeeper replacing failed one).
Bumps control file and walproposer protocol versions.
While protocol is bumped, also add safekeeper node id to
AcceptorProposerGreeting.
ref #1561
Now princeple is following: acceptor threads (libpq and http) error will
bring the pageserver down, but all per-tenant thread failures will be treated
as an error.
A new `get_lsn_by_timestamp` command is added to the libpq page service
API.
An extra timestamp field is now stored in an extra field after each
Clog page. It is the timestamp of the latest commit, among all the
transactions on the Clog page. To find the overall latest commit, we
need to scan all Clog pages, but this isn't a very frequent operation
so that's not too bad.
To find the LSN that corresponds to a timestamp, we perform a binary
search. The binary search starts with min = last LSN when GC ran, and
max = latest LSN on the timeline. On each iteration of the search we
check if there are any commits with a higher-than-requested timestamp
at that LSN.
Implements github issue 1361.
* Traverse frozen layer in get_reconstruct_data in reverse order
* Fix comments on frozen layers.
Note explicitly the order that the layers are in the queue.
* Add fail point to reproduce failpoint iteration error
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Now proxy binary accepts `--auth-backend` CLI option, which determines
auth scheme and cluster routing method. Following backends are currently
implemented:
* legacy
old method, when username ends with `@zenith` it uses md5 auth dbname as
the cluster name; otherwise, it sends a login link and waits for the console
to call back
* console
new SCRAM-based console API; uses SNI info to select the destination
cluster
* postgres
uses postgres to select auth secrets of existing roles. Useful for local
testing
* link
sends login link for all usernames
* `cloud::legacy` talks to Cloud API V1.
* `cloud::api` defines Cloud API v2.
* `cloud::local` mocks the Cloud API V2 using a local postgres instance.
* It's possible to choose between API versions using the `--api-version` flag.
This is needed to forward the `ClientKey` that's required
to connect the proxy to a compute.
Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>
Commit edba2e97 renamed pageserver/README to pageserver/README.md, but
forgot to update links to it. Fix.
Rename libs/postgres_ffi/README and safekeeper/README files to also
have the the .md extension, so that github can render them nicely.
Quote ascii-diagram in safekeeper/README.md so that it renders
correctly.
wal_keep_size is already set to 0 in our cloud setup, but we don't use this value in tests. This commit fixes wal_keep_size in control_plane and adds tests for WAL recycling and lagging safekeepers.
When failpoint feature is disabled it throws away passed code so code
inside is not guaranteed to compile when feature is disabled. In this
particular case code is obsolete so removing it.
Remove when it is consumed by all of 1) pageserver (remote_consistent_lsn) 2)
safekeeper peers 3) s3 WAL offloading.
In test s3 offloading for now is mocked by directly bumping s3_wal_lsn.
ref #1403
Previously we've used table interface, but there was no easy way to pass
it as an override to pageserver through cli. Use the same strategy as
for remote storage config parsing
Add tenant config API and 'zenith tenant config' CLI command.
Add 'show' query to pageserver protocol for tenantspecific config parameters
Refactoring: move tenant_config code to a separate module.
Save tenant conf file to tenant's directory, when tenant is created to recover it on pageserver restart.
Ignore error during tenant config loading, while it is not supported by console
Define PiTR interval for GC.
refer #1320
This depends on a hacked version of the 'pprof-rs' crate. Because of
that, it's under an optional 'profiling' feature. It is disabled by
default, but enabled for release builds in CircleCI config. It doesn't
currently work on macOS.
The flamegraph is written to 'flamegraph.svg' in the pageserver
workdir when the 'pageserver' process exits.
Add a performance test that runs the perf_pgbench test, with profiling
enabled.
We only checked the cache page version when collecting WAL records in
an in-memory layer, not in a delta layer. Refactor the code so that we
always stop collecting WAL records when we reach a cached materialized
page.
Fix the assertion on the LSN range in
InMemoryLayer::get_value_reconstruct_data. It was supposed to check
that the requested LSN range is within the layer's LSN range, but the
inequality was backwards. That went unnoticed before, because the
caller always passed the layer's start LSN as the requested LSN
range's start LSN, but now we might stop the search earlier, if we have
a cached page version.
Co-authored-by: Konstantin Knizhnik <knizhnik@zenith.tech>
Unlink failure isn't serious on its own, we were about to remove the
file anyway, but it shouldn't happen and could be a symptom of
something more serious.
We just saw "No such file or directory" errors happening from
ephemeral file writeback in staging, and I suspect if we had this
warning in place, we would have seen these warnings too, if the
problem was that the ephemeral file was removed before dropping the
EphemeralFile struct. Next time it happens, we'll have more
information.
With this, we no longer need to build two versions of 'pem' and 'base64'
crates. Introduces a duplicate version of 'time' crate, though, but it's
still progress.
- Remove batch_others/test_pgbench.py. It was a quick check that pgbench
works, without actually recording any performance numbers, but that
doesn't seem very interesting anymore. Remove it to avoid confusing it
with the actual pgbench benchmarks
- Run pgbench with "-n" and "-S" options, for two different workloads:
simple-updates, and SELECT-only. Previously, we would only run it with
the "default" TPCB-like workload. That's more or less the same as the
simple-update (-n) workload, but I think the simple-upload workload
is more relevant for testing storage performance. The SELECT-only
workload is a new thing to measure.
- Merge test_perf_pgbench.py and test_perf_pgbench_remote.py. I added
a new "remote" implementation of the PgCompare class, which allows
running the same tests against an already-running Postgres instance.
- Make the PgBenchRunResult.parse_from_output function more
flexible. pgbench can print different lines depending on the
command-line options, but the parsing function expected a particular
set of lines.
The PgProtocol.connect() function took extra options for username,
database, etc. Remove those options, and have a generic way for each
subclass of PgProtocol to provide some default options, with the
capability override them in the connect() call.
It was the only test that used the 'schema' argument to the connect()
function. I'm about to refactor the option handling and will remove
the special 'schema' argument altogether, so rewrite the test to not
use it.
There's a lot of renaming left to do in the code and docs, but this is
a start. Our binaries and many other things are still called "zenith",
but I didn't change those in the README, because otherwise the
examples won't work. I added a brief note at the top of the README to
explain that we're in the process of renaming, until we've renamed
everything.
It originated from the fact that we were calling to fetch_full_index
without releasing the read guard, and fetch_full_index tries to acquire
read again. For plain mutex it is already a deeadlock, for RW lock
deadlock was achieved by an attempt to acquire write access later in the
code while still having active read guard up in the stack
This is sort of a bandaid because Kirill plans to change this code
during removal of an archiving mechanism
* [proxy] Add SCRAM auth
* [proxy] Implement some tests for SCRAM
* Refactoring + test fixes
* Hide SCRAM mechanism behind `#[cfg(test)]`
Currently we only use it in tests, so we hide all relevant
module behind `#[cfg(test)]` to prevent "unused item" warnings.
* Add test for restore from WAL
* Fix python formatting
* Choose unused port in wal restore test
* Move recovery tests to zenith_utils/scripts
* Set LD_LIBRARY_PATH in wal recovery scripts
* Fix python test formatting
* Fix mypy warning
* Bump postgres version
* Bump postgres version
We now use a page cache for those, instead of slurping the whole index into
memory.
Fixes https://github.com/zenithdb/zenith/issues/1356
This is a backwards-incompatible change to the storage format, so
bump STORAGE_FORMAT_VERSION.
This introduces two new abstraction layers for I/O:
- Block I/O, and
- Blob I/O.
The BlockReader trait abstracts a file or something else that can be read
in 8kB pages. It is implemented by EphemeralFiles, and by a new
FileBlockReader struct that allows reading arbitrary VirtualFiles in that
manner, utilizing the page cache.
There is also a new BlockCursor struct that works as a cursor over a
BlockReader. When you create a BlockCursor and read the first page using
it, it keeps the reference to the page. If you access the same page again,
it avoids going to page cache and quickly returns the same page again.
That can save a lot of lookups in the page cache if you perform multiple
reads.
The Blob-oriented API allows reading and writing "blobs" of arbitrary
length. It is a layer on top of the block-oriented API. When you write
a blob with the write_blob() function, it writes a length field
followed by the actual data to the underlying block storage, and
returns the offset where the blob was stored. The blob can be
retrieved later using the offset.
Finally, this replaces the I/O code in image-, delta-, and in-memory
layers to use the new abstractions. These replace the 'bookfile'
crate.
This is a backwards-incompatible change to the storage format.
We have these methods for some time in the API, so mentioning them in the
spec could be useful for console (see zenithdb/console#867), as we generate
pageserver HTTP API golang client there.
It happened in unit tests. If a thread tries to read a buffer while
already holding a lock on one buffer, the code to find a victim buffer
to evict could try to evict the buffer that's already locked. To fix,
skip locked buffers.
* Add a test case for reading historic page versions
Test read_page_at_lsn returns correct results when compared to page inspect.
Validate possiblity of reading pages from dropped relation.
Ensure funcitons read latest version when null lsn supplied.
Check that functions do not poison buffer cache with stale page versions.
Safekeers now publish to and pull from etcd per-timeline data. Immediate goal is
WAL truncation, for which every safekeeper must know remote_consistent_lsn; the
next would be callmemaybe replacement.
Adds corresponding '--broker' argument to safekeeper and ability to run etcd in
tests.
Adds test checking remote_consistent_lsn is indeed communicated.
workspace_hack is needed to avoid recompilation when different crates
inside the workspace depend on the same packages but with different
features being enabled. Problem occurs when you build crates separately
one by one. So this is irrelevant to our CI setup because there we build
all binaries at once, but it may be relevant for local development.
this also changes cargo's resolver version to 2
Follow-up for #1417. Previously we had a problem uploading to S3
due to huge ammount of existing not yet uploaded data. Now we have a
fresh pageserver with LSM storage on staging, so we can try enabling it
once again.
This is a backwards-incompatible change. The new pageserver cannot
read repositories created with an old pageserver binary, or vice
versa.
Simplify Repository to a value-store
------------------------------------
Move the responsibility of tracking relation metadata, like which
relations exist and what are their sizes, from Repository to a new
module, pgdatadir_mapping.rs. The interface to Repository is now a
simple key-value PUT/GET operations.
It's still not any old key-value store though. A Repository is still
responsible from handling branching, and every GET operation comes
with an LSN.
Mapping from Postgres data directory to keys/values
---------------------------------------------------
All the data is now stored in the key-value store. The
'pgdatadir_mapping.rs' module handles mapping from PostgreSQL objects
like relation pages and SLRUs, to key-value pairs.
The key to the Repository key-value store is a Key struct, which
consists of a few integer fields. It's wide enough to store a full
RelFileNode, fork and block number, and to distinguish those from
metadata keys.
'pgdatadir_mapping.rs' is also responsible for maintaining a
"partitioning" of the keyspace. Partitioning means splitting the
keyspace so that each partition holds a roughly equal number of keys.
The partitioning is used when new image layer files are created, so
that each image layer file is roughly the same size.
The partitioning is also responsible for reclaiming space used by
deleted keys. The Repository implementation doesn't have any explicit
support for deleting keys. Instead, the deleted keys are simply
omitted from the partitioning, and when a new image layer is created,
the omitted keys are not copied over to the new image layer. We might
want to implement tombstone keys in the future, to reclaim space
faster, but this will work for now.
Changes to low-level layer file code
------------------------------------
The concept of a "segment" is gone. Each layer file can now store an
arbitrary range of Keys.
Checkpointing, compaction
-------------------------
The background tasks are somewhat different now. Whenever
checkpoint_distance is reached, the WAL receiver thread "freezes" the
current in-memory layer, and creates a new one. This is a quick
operation and doesn't perform any I/O yet. It then launches a
background "layer flushing thread" to write the frozen layer to disk,
as a new L0 delta layer. This mechanism takes care of durability. It
replaces the checkpointing thread.
Compaction is a new background operation that takes a bunch of L0
delta layers, and reshuffles the data in them. It runs in a separate
compaction thread.
Deployment
----------
This also contains changes to the ansible scripts that enable having
multiple different pageservers running at the same time in the staging
environment. We will use that to keep an old version of the pageserver
running, for clusters created with the old version, at the same time
with a new pageserver with the new binary.
Author: Heikki Linnakangas
Author: Konstantin Knizhnik <knizhnik@zenith.tech>
Author: Andrey Taranik <andrey@zenith.tech>
Reviewed-by: Matthias Van De Meent <matthias@zenith.tech>
Reviewed-by: Bojan Serafimov <bojan@zenith.tech>
Reviewed-by: Konstantin Knizhnik <knizhnik@zenith.tech>
Reviewed-by: Anton Shyrabokau <antons@zenith.tech>
Reviewed-by: Dhammika Pathirana <dham@zenith.tech>
Reviewed-by: Kirill Bulatov <kirill@zenith.tech>
Reviewed-by: Anastasia Lubennikova <anastasia@zenith.tech>
Reviewed-by: Alexey Kondratov <alexey@zenith.tech>
More rows, and test with serial and parallel plans. But fewer iterations,
so that the tests run in < 1 minutes, and we don't need to mark them as
"slow".
With a Mutex, only one thread could read from the layer at a time. I did
some ad hoc profiling with pgbench and saw that a fair amout of time was
spent blocked on these Mutexes.
It doesn't make much sense to compare TimelineMetadata structs with
< or >. But we depended on that in the remote storage upload code,
so replace BTreeSets with Vecs there.
* [proxy] Propagate most errors to user
This change enables propagation of most errors to the user
(e.g. auth and connectivity errors). Some of them will be
stripped of sensitive information.
As a side effect, most occurrences of `anyhow::Error` were
replaced with concrete error types.
* [proxy] Box weighty errors
Have separate routine and http endpoint to create timeline on safekeepers. It is
not used yet, i.e. timeline is still created implicitly, but we'll change that
once infrastructure for learning which tlis are assigned to which safekeepers
will be ready, preventing accidental creation by compute.
Changes format of safekeeper control file, allowing to store set of
peers. Knowing peers provides a part of foundation for peer
recovery (calculating min horizons like truncate_lsn for WAL truncation and
commit_lsn for sync-safekeepers replacement) and proper membership change;
similarly, we don't yet use it for now.
Employing cf file version bump, extracts tenant_id and timeline_id to top level
where it is more suitable. Also adds a bunch of LSNs there and rename
truncate_lsn to more specific peer_horizon_lsn.
* Add --id argument to safekeeper setting its unique u64 id.
In preparation for storage node messaging. IDs are supposed to be monotonically
assigned by the console. In tests it is issued by ZenithEnv; at the zenith cli
level and fixtures, string name is completely replaced by integer id. Example
TOML configs are adjusted accordingly.
Sequential ids are chosen over Zid mainly because they are compact and easy to
type/remember.
* add node id to pageserver
This adds node id parameter to pageserver configuration. Also I use a
simple builder to construct pageserver config struct to avoid setting
node id to some temporary invalid value. Some of the changes in test
fixtures are needed to split init and start operations for envrionment.
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
* new deployment flow for staging and production
* ansible playbooks and circleci config fixes
* cleanup before merge
* additional cleanup before merge
* debug deployment to staging env
* debug deployment to staging env
* debug deployment to staging env
* debug deployment to staging env
* debug deployment to staging env
* debug deployment to staging env
* bianries artifacts path fix for ansible playbooks
* deployment flow refactored
* base64 decode fix for ssh key
* fix for console notification and production deploy settings
* cleanup after deployment tests
* fix - trigger release binaries download for production deploy
When several AppendRequest's can be read from socket without blocking,
they are processed together and fsync() to segment file is only called
once. Segment file is no longer opened for every write request, now
last opened file is cached inside PhysicalStorage. New metric for WAL
flushes was added to the storage, FLUSH_WAL_SECONDS. More errors were
added to storage for non-sequential WAL writes, now write_lsn can be
moved only with calls to truncate_lsn(new_lsn).
New messages have been added to ProposerAcceptorMessage enum. They
can't be deserialized directly and now are used only for optimizing
flushes. Existing protocol wasn't changed and flush will be called for
every AppendRequest, as it was before.
Since commit fdd987c3ad, it was only used in InMemoryLayers. Let's
just "inline" the code into InMemoryLayer itself.
I originally did this as part of a bigger PR (#1267). With that PR,
one in-memory layer, and one ephemeral file, would hold page versions
belonging to multiple segments. Currently, PageVersions can only hold
versions for a single segment, so that would need to be changed.
Rather than modify PageVersions to support that, just remove it
altogether.
These tests have intimate knowledge of the directory layeout and layer
file names used by the LayeredRepository implementation of the
Repository trait. Move them, so that all the tests that remain in
repository.rs are expected to work without changes with any
implementation of Repository. Not that we have any plans to create
another Repository implementaiton any time soon, but as long as we
have the Repository interface, let's try to maintain that abstraction
in the tests too.
The test creates a page version with a string like "foo 123 at 0/10"
as the content. But the LSN stored in that string was wrong: the page
version stored at LSN 0/20 would say "foo <blk> at 0/10".
wal_storage.rs was split up from timeline.rs, safekeeper.rs and send_wal.rs,
and now contains all WAL related code from the safekeeper. Now there are
PhysicalStorage for persisting WAL to disk and WalReader for reading it.
This allows optimizing PhysicalStorage without affecting too much of other
code.
Also there is a separate structure for persisting control file now in
control_file.rs.
Previous version of spec caused parsing errors in generated clients
as return type is object not array, also one field was missing. In
a passing set `format: hex` on ancestor_id too as value conforms to
that format.
This change makes most parts of the code asynchronous, except
for the `mgmt` subsystem (we're going to drop it anyway).
Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>
If a heap UPDATE record modified two pages, and both pages needed to have
their VM bits cleared, and the VM bits were located on the same VM page,
we would emit two ZenithWalRecord::ClearVisibilityMapFlags records for
the same VM page. That produced warnings like this in the pageserver log:
Page version Wal(ClearVisibilityMapFlags { heap_blkno: 18, flags: 3 }) of rel 1663/13949/2619_vm blk 0 at 2A/346046A0 already exists
To fix, change ClearVisibilityMapFlags so that it can update the bits
for both pages as one operation.
This was already covered by several python tests, so no need to add a
new one. Fixes#1125.
Co-authored-by: Konstantin Knizhnik <knizhnik@zenith.tech>
It was printing a lot of stuff to the log with INFO level, for routine
things like receiving or sending messages. Reduce the noise. The amount
of logging was excessive, and it was also consuming a fair amount of CPU
(about 20% of safekeeper's CPU usage in a little test I ran).
* Always initialize flush_lsn/commit_lsn metrics on a specific timeline, no more `n/a`
* Update flush_lsn metrics missing from cba4da3f4d
* Ensure that flush_lsn found on load is >= than both commit_lsn and truncate_lsn
* Add some debug logging
Use GUC zenith.max_cluster_size to set the limit.
If limit is reached, extend requests will throw out-of-space error.
When current size is too close to the limit - throw a warning.
Add new test: test_timeline_size_quota.
Use log::error!() instead. I spotted a few of these "connection error"
lines in the logs, without timestamps and the other stuff we print for
all other log messages.
Timeline is active whenever there is at least 1 connection from compute or
pageserver is not caught up. Currently 'active' means callmemaybes are being
sent.
Fixes race: now suspend condition checking and callmemaybe unsubscribe happen
under the same lock.
Now it's possible to call Fe{Startup,}Message in both
sync and async contexts, which is good for proxy.
Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>
* Fix checkpoint.nextXid update
* Add test for cehckpoint.nextXid
* Fix indentation of test_next_xid.py
* Fix mypy error in test_next_xid.py
* Tidy up the test case.
* Add a unit test
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
* Freeze vectors at the same end LSN
* Fix calculation of last LSN for inmem layer
* Do not advance disk_consistent_lsn is no open layer was evicted
* Fix calculation of freeze_end_lsn
* Let start_lsn be larger than oldest_pending_lsn
* Rename 'oldest_pending_lsn' and 'last_lsn', add comments.
* Fix future_layerfiles test
* Update comments conserning olest_lsn
* Update comments conserning olest_lsn
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
Now zenith_cli handles wal_acceptors config internally, and if we
will append wal_acceptors to postgresql.conf in python tests, then
it will contain duplicate wal_acceptors config.
Currently ztimelineids are unique, but all APIs accept the pair, so let's keep
it everywhere for uniformity.
Carry around ZTTId containing both ZTenantId and ZTimelineId for simplicity.
(existing clusters on staging ought to be preprocessed for that)
* Reproduce github issue #1047.
* Use RwLock to protect gc_cuttof_lsn
* Eeduce number of updates in test_gc_aggressive
* Change test_prohibit_get_page_at_lsn_for_garbage_collected_pages test
* Change test_prohibit_get_page_at_lsn_for_garbage_collected_pages
* Lock latest_gc_cutoff_lsn in all operations accessing storage to prevent race conditions with GC
* Remove random sleep between wait_for_lsn and get_page_at_lsn
* Initialize latest_gc_cutoff with initdb_lsn and remove separate check that lsn >= initdb_lsn
* Update test_prohibit_branch_creation_on_pre_initdb_lsn test
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
to pass current_timeline_size to compute node
Put standby_status_update fields into ZenithFeedback and send them as one message.
Pass values sizes together with keys in ZenithFeedback message.
This patch includes attach/detach http endpoints in pageservers. Some
changes in callmemaybe handling inside safekeeper and an integrational
test to check migration with and without load. There are still some
rough edges that will be addressed in follow up patches
Mainly because it has better support for installing the packages from
different python versions.
It also has better dependency resolver than Pipenv. And supports modern
standard for python dependency management. This includes usage of
pyproject.toml for project specific configuration instead of per
tool conf files. See following links for details:
https://pip.pypa.io/en/stable/reference/build-system/pyproject-toml/https://www.python.org/dev/peps/pep-0518/
* Do not delete layers beyand cutoff LSN
* Update pageserver/src/layered_repository/layer_map.rs
Co-authored-by: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Co-authored-by: Heikki Linnakangas <heikki.linnakangas@iki.fi>
This introduces a new module to handle thread creation and shutdown.
All page server threads are now registered in a global hash map, and
there's a function to request individual threads to shut down gracefully.
Thread shutdown request is signalled to the thread with a flag, as well
as a Future that can be used to wake up async operations if shutdown is
requested. Use that facility to have the libpq listener thread respond
to pageserver shutdown, based on Kirill's earlier prototype
(https://github.com/zenithdb/zenith/pull/1088). That addresses
https://github.com/zenithdb/zenith/issues/1036, previously the libpq
listener thread would not exit until one more connection arrives.
This also eliminates a resource leak in the accept() loop. Previously,
we added the JoinHanlde of each new thread to a vector but old handles
for threads that had already exited were never removed.
Log the error and continue. Hopefully it's a transient failure.
This might have been happening in staging earlier, when the safekeeper
had a problem where it opened connections very frequently to issue
"callmemaybe" commands. If you launch too many threads too fast, you might
run out of file descriptors or something. It's not totally clear what
happened, but with commit, at least the page server will continue to run
and accept new connections, if a transient error happens.
'anyhow' crate can include a backtrace in all errors, when the
'backtrace' feature is enabled. Enable it, and change the places that used
'{:#}' or '{}' to '{:?}', so that the backtrace is printed.
A timeline ID is only guaranteed to be unique for a particular tenant,
so you need to use tenant ID + timeline ID as the key, rather than just
timeline ID.
The safekeeper currently makes the same assumption, and we should fix that
too, but this commit just addresses this one case in the page server.
In the passing, reorder some function arguments to be more consistent.
The walkeeper launch two threads for each connection, and uses a guard
object to remove entry from 'replicas' array, when finishes. But only
the background thread held onto the guard object, so if the background
thread finished before the other thread, the array entry would be
removed prematurely, which lead to panic in the check_stop_streaming()
call.
Fixes https://github.com/zenithdb/zenith/issues/1103
Hexalize zids there for better output; since Serde doesn't support several
formats for one struct, on-disk representation is changed as well, make
upgrade.rs cope with it.
to avoid a subtle race condition.
Without safekeeper, walreceiver reconnection can stuck,
because of IO deadlock between walsender auth and regular backend.
* Do not hold timelines lock during GC
refer #1087
* Add gc_cs mutex for preveting creation of new timelines during GC
* Make clippy happy
* Use Mutex<()> instead of Mutex<i32> for GC critical section
2022-01-10 14:41:15 +03:00
791 changed files with 172991 additions and 37063 deletions
**NB: this PR must be merged only by 'Create a merge commit'!**
### Checklist when preparing for release
- [ ] Read or refresh [the release flow guide](https://github.com/neondatabase/cloud/wiki/Release:-general-flow)
- [ ] Ask in the [cloud Slack channel](https://neondb.slack.com/archives/C033A2WE6BZ) that you are going to rollout the release. Any blockers?
- [ ] Does this release contain any db migrations? Destructive ones? What is the rollback plan?
<!-- List everything that should be done**before** release, any issues / setting changes / etc -->
### Checklist after release
- [ ] Make sure instructions from PRs included in this release and labeled `manual_release_instructions` are executed (either by you or by people who wrote them).
- [ ] Based on the merged commits write release notes and open a PR into `website` repo ([example](https://github.com/neondatabase/website/pull/219/files))
# This file is only read when `yapf` is run from this directory.
# Hence we only top-level directories here to avoid confusion.
# See source code for the exact file format: https://github.com/google/yapf/blob/c6077954245bc3add82dafd853a1c7305a6ebd20/yapf/yapflib/file_resources.py#L40-L43
storage_broker={version="0.1",path="./storage_broker/"}# Note: main broker code is inside the binary crate, so linking with the library shouldn't be heavy.
Zenith is a serverless open source alternative to AWS Aurora Postgres. It separates storage and compute and substitutes PostgreSQL storage layer by redistributing data across a cluster of nodes.
# Neon
Neon is a serverless open-source alternative to AWS Aurora Postgres. It separates storage and compute and substitutes the PostgreSQL storage layer by redistributing data across a cluster of nodes.
## Quick start
Try the [Neon Free Tier](https://neon.tech/docs/introduction/technical-preview-free-tier/) to create a serverless Postgres instance. Then connect to it with your preferred Postgres client (psql, dbeaver, etc) or use the online [SQL Editor](https://neon.tech/docs/get-started-with-neon/query-with-neon-sql-editor/). See [Connect from any application](https://neon.tech/docs/connect/connect-from-any-app/) for connection instructions.
Alternatively, compile and run the project [locally](#running-local-installation).
## Architecture overview
A Zenith installation consists of compute nodes and Zenith storage engine.
A Neon installation consists of compute nodes and the Neon storage engine. Compute nodes are stateless PostgreSQL nodes backed by the Neon storage engine.
Compute nodes are stateless PostgreSQL nodes, backed by Zenith storage engine.
The Neon storage engine consists of two major components:
- Pageserver. Scalable storage backend for the compute nodes.
- Safekeepers. The safekeepers form a redundant WAL service that received WAL from the compute node, and stores it durably until it has been processed by the pageserver and uploaded to cloud storage.
Zenith storage engine consists of two major components:
- Pageserver. Scalable storage backend for compute nodes.
- WAL service. The service that receives WAL from compute node and ensures that it is stored durably.
Pageserver consists of:
- Repository - Zenith storage implementation.
- WAL receiver - service that receives WAL from WAL service and stores it in the repository.
- Page service - service that communicates with compute nodes and responds with pages from the repository.
- WAL redo - service that builds pages from base images and WAL records on Page service request.
See developer documentation in [SUMMARY.md](/docs/SUMMARY.md) for more information.
## Running local installation
1. Install build dependencies and other useful packages
On Ubuntu or Debian this set of packages should be sufficient to build the code:
```text
#### Installing dependencies on Linux
1. Install build dependencies and other applicable packages
* On Ubuntu or Debian, this set of packages should be sufficient to build the code:
Building Neon requires 3.15+ version of `protoc` (protobuf-compiler). If your distribution provides an older version, you can install a newer version from [here](https://github.com/protocolbuffers/protobuf/releases).
To run the `psql` client, install the `postgresql-client` package or modify `PATH` and `LD_LIBRARY_PATH` to include `tmp_install/bin` and `tmp_install/lib`, respectively.
# recommended approach from https://www.rust-lang.org/tools/install
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```
3. Install PostgreSQL Client
```
# from https://stackoverflow.com/questions/44654216/correct-way-to-install-psql-without-full-postgres-on-macos
brew install libpq
brew link --force libpq
```
#### Rustc version
The project uses [rust toolchain file](./rust-toolchain.toml) to define the version it's built with in CI for testing and local builds.
This file is automatically picked up by [`rustup`](https://rust-lang.github.io/rustup/overrides.html#the-toolchain-file) that installs (if absent) and uses the toolchain version pinned in the file.
rustup users who want to build with another toolchain can use [`rustup override`](https://rust-lang.github.io/rustup/overrides.html#directory-overrides) command to set a specific toolchain for the project's directory.
non-rustup users most probably are not getting the same toolchain automatically from the file, so are responsible to manually verify their toolchain matches the version in the file.
Newer rustc versions most probably will work fine, yet older ones might not be supported due to some new features used by the project or the crates.
#### Building on Linux
1. Build neon and patched postgres
```
# Note: The path to the neon sources can not contain a space.
# The preferred and default is to make a debug build. This will create a
# demonstrably slower build than a release build. For a release build,
# use "BUILD_TYPE=release make -j`sysctl -n hw.logicalcpu` -s"
# Remove -s for the verbose build log
make -j`sysctl -n hw.logicalcpu` -s
```
#### Dependency installation notes
To run the `psql` client, install the `postgresql-client` package or modify `PATH` and `LD_LIBRARY_PATH` to include `pg_install/bin` and `pg_install/lib`, respectively.
To run the integration tests or Python scripts (not required to use the code), install
Python (3.7 or higher), and install python3 packages using `pipenv install` in the project directory.
Python (3.9 or higher), and install python3 packages using `./scripts/pysync` (requires [poetry>=1.3](https://python-poetry.org/)) in the project directory.
Now we use README files to cover design ideas and overall architecture for each module and `rustdoc` style documentation comments. See also [/docs/](/docs/) a top-level overview of all available markdown documentation.
[docs](/docs) Contains a top-level overview of all available markdown documentation.
- [/docs/sourcetree.md](/docs/sourcetree.md) contains overview of source tree layout.
- [sourcetree.md](/docs/sourcetree.md) contains overview of source tree layout.
To view your `rustdoc` documentation in a browser, try running `cargo doc --no-deps --open`
See also README files in some source directories, and `rustdoc` style documentation comments.
Other resources:
- [SELECT 'Hello, World'](https://neon.tech/blog/hello-world/): Blog post by Nikita Shamgunov on the high level architecture
- [Architecture decisions in Neon](https://neon.tech/blog/architecture-decisions-in-neon/): Blog post by Heikki Linnakangas
- [Neon: Serverless PostgreSQL!](https://www.youtube.com/watch?v=rES0yzeERns): Presentation on storage system by Heikki Linnakangas in the CMU Database Group seminar series
### Postgres-specific terms
Due to Zenith's very close relation with PostgreSQL internals, there are numerous specific terms used.
Same applies to certain spelling: i.e. we use MB to denote 1024 * 1024 bytes, while MiB would be technically more correct, it's inconsistent with what PostgreSQL code and its documentation use.
Due to Neon's very close relation with PostgreSQL internals, numerous specific terms are used.
The same applies to certain spelling: i.e. we use MB to denote 1024 * 1024 bytes, while MiB would be technically more correct, it's inconsistent with what PostgreSQL code and its documentation use.
~/git-sandbox/zenith (cli-v2)$ ./target/debug/cli start main
Creating data directory from snapshot at 0/15FFB08...
waiting for server to start....2021-04-13 09:27:43.919 EEST [984664] LOG: starting PostgreSQL 14devel on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2021-04-13 09:27:43.920 EEST [984664] LOG: listening on IPv6 address "::1", port 5432
2021-04-13 09:27:43.920 EEST [984664] LOG: listening on IPv4 address "127.0.0.1", port 5432
2021-04-13 09:27:43.927 EEST [984664] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2021-04-13 09:27:43.939 EEST [984665] LOG: database system was interrupted; last known up at 2021-04-13 09:27:33 EEST
2021-04-13 09:27:43.939 EEST [984665] LOG: creating missing WAL directory "pg_wal/archive_status"
2021-04-13 09:27:44.189 EEST [984665] LOG: database system was not properly shut down; automatic recovery in progress
2021-04-13 09:27:44.195 EEST [984665] LOG: invalid record length at 0/15FFB80: wanted 24, got 0
2021-04-13 09:27:44.195 EEST [984665] LOG: redo is not required
2021-04-13 09:27:44.225 EEST [984664] LOG: database system is ready to accept connections
Creating data directory from snapshot at 0/15FFB08...
waiting for server to start....2021-04-13 09:28:41.874 EEST [984766] LOG: starting PostgreSQL 14devel on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2021-04-13 09:28:41.875 EEST [984766] LOG: listening on IPv6 address "::1", port 5433
2021-04-13 09:28:41.875 EEST [984766] LOG: listening on IPv4 address "127.0.0.1", port 5433
2021-04-13 09:28:41.883 EEST [984766] LOG: listening on Unix socket "/tmp/.s.PGSQL.5433"
2021-04-13 09:28:41.896 EEST [984767] LOG: database system was interrupted; last known up at 2021-04-13 09:27:33 EEST
2021-04-13 09:28:42.265 EEST [984767] LOG: database system was not properly shut down; automatic recovery in progress
2021-04-13 09:28:42.269 EEST [984767] LOG: redo starts at 0/15FFB80
2021-04-13 09:28:42.272 EEST [984767] LOG: invalid record length at 0/161F4B0: wanted 24, got 0
2021-04-13 09:28:42.272 EEST [984767] LOG: redo done at 0/161F478 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2021-04-13 09:28:42.321 EEST [984766] LOG: database system is ready to accept connections
done
server started
Insert some a row on the 'experimental' branch:
~/git-sandbox/zenith (cli-v2)$ psql postgres -p5433 -c "select * from foo"
t
-----------------------------
inserted on the main branch
(1 row)
~/git-sandbox/zenith (cli-v2)$ psql postgres -p5433 -c "insert into foo values ('inserted on experimental')"
INSERT 0 1
~/git-sandbox/zenith (cli-v2)$ psql postgres -p5433 -c "select * from foo"
t
-----------------------------
inserted on the main branch
inserted on experimental
(2 rows)
See that the other Postgres instance is still running on 'main' branch on port 5432:
~/git-sandbox/zenith (cli-v2)$ psql postgres -p5432 -c "select * from foo"
t
-----------------------------
inserted on the main branch
(1 row)
Everything is stored in the .zenith directory:
~/git-sandbox/zenith (cli-v2)$ ls -l .zenith/
total 12
drwxr-xr-x 4 heikki heikki 4096 Apr 13 09:28 datadirs
drwxr-xr-x 4 heikki heikki 4096 Apr 13 09:27 refs
drwxr-xr-x 4 heikki heikki 4096 Apr 13 09:28 timelines
The 'datadirs' directory contains the datadirs of the running instances:
~/git-sandbox/zenith (cli-v2)$ ls -l .zenith/datadirs/
total 8
drwx------ 18 heikki heikki 4096 Apr 13 09:27 3c0c634c1674079b2c6d4edf7c91523e
drwx------ 18 heikki heikki 4096 Apr 13 09:28 697e3c103d4b1763cd6e82e4ff361d76
~/git-sandbox/zenith (cli-v2)$ ls -l .zenith/datadirs/3c0c634c1674079b2c6d4edf7c91523e/
total 124
drwxr-xr-x 5 heikki heikki 4096 Apr 13 09:27 base
drwxr-xr-x 2 heikki heikki 4096 Apr 13 09:27 global
drwxr-xr-x 2 heikki heikki 4096 Apr 13 09:27 pg_commit_ts
drwxr-xr-x 2 heikki heikki 4096 Apr 13 09:27 pg_dynshmem
-rw------- 1 heikki heikki 4760 Apr 13 09:27 pg_hba.conf
-rw------- 1 heikki heikki 1636 Apr 13 09:27 pg_ident.conf
drwxr-xr-x 4 heikki heikki 4096 Apr 13 09:32 pg_logical
drwxr-xr-x 4 heikki heikki 4096 Apr 13 09:27 pg_multixact
drwxr-xr-x 2 heikki heikki 4096 Apr 13 09:27 pg_notify
drwxr-xr-x 2 heikki heikki 4096 Apr 13 09:27 pg_replslot
drwxr-xr-x 2 heikki heikki 4096 Apr 13 09:27 pg_serial
drwxr-xr-x 2 heikki heikki 4096 Apr 13 09:27 pg_snapshots
drwxr-xr-x 2 heikki heikki 4096 Apr 13 09:27 pg_stat
drwxr-xr-x 2 heikki heikki 4096 Apr 13 09:34 pg_stat_tmp
drwxr-xr-x 2 heikki heikki 4096 Apr 13 09:27 pg_subtrans
drwxr-xr-x 2 heikki heikki 4096 Apr 13 09:27 pg_tblspc
drwxr-xr-x 2 heikki heikki 4096 Apr 13 09:27 pg_twophase
-rw------- 1 heikki heikki 3 Apr 13 09:27 PG_VERSION
lrwxrwxrwx 1 heikki heikki 52 Apr 13 09:27 pg_wal -> ../../timelines/3c0c634c1674079b2c6d4edf7c91523e/wal
drwxr-xr-x 2 heikki heikki 4096 Apr 13 09:27 pg_xact
-rw------- 1 heikki heikki 88 Apr 13 09:27 postgresql.auto.conf
-rw------- 1 heikki heikki 28688 Apr 13 09:27 postgresql.conf
-rw------- 1 heikki heikki 96 Apr 13 09:27 postmaster.opts
-rw------- 1 heikki heikki 149 Apr 13 09:27 postmaster.pid
Note how 'pg_wal' is just a symlink to the 'timelines' directory. The
datadir is ephemeral, you can delete it at any time, and it can be reconstructed
from the snapshots and WAL stored in the 'timelines' directory. So if you push/pull
the repository, the 'datadirs' are not included. (They are like git working trees)
Creating data directory from snapshot at 0/15FFB08...
waiting for server to start....2021-04-13 09:37:05.476 EEST [985340] LOG: starting PostgreSQL 14devel on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2021-04-13 09:37:05.477 EEST [985340] LOG: listening on IPv6 address "::1", port 5433
2021-04-13 09:37:05.477 EEST [985340] LOG: listening on IPv4 address "127.0.0.1", port 5433
2021-04-13 09:37:05.487 EEST [985340] LOG: listening on Unix socket "/tmp/.s.PGSQL.5433"
2021-04-13 09:37:05.498 EEST [985341] LOG: database system was interrupted; last known up at 2021-04-13 09:27:33 EEST
2021-04-13 09:37:05.808 EEST [985341] LOG: database system was not properly shut down; automatic recovery in progress
2021-04-13 09:37:05.813 EEST [985341] LOG: redo starts at 0/15FFB80
2021-04-13 09:37:05.815 EEST [985341] LOG: invalid record length at 0/161F770: wanted 24, got 0
2021-04-13 09:37:05.815 EEST [985341] LOG: redo done at 0/161F738 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2021-04-13 09:37:05.866 EEST [985340] LOG: database system is ready to accept connections
done
server started
~/git-sandbox/zenith (cli-v2)$ psql postgres -p5433 -c "select * from foo"
We use JWT tokens in communication between almost all components (compute, pageserver, safekeeper, CLI) regardless of the protocol used (HTTP/PostgreSQL).
storage_broker currently has no authentication.
Authentication is optional and is disabled by default for easier debugging.
It is used in some tests, though.
Note that we do not cover authentication with `pg.neon.tech` here.
Current state of authentication includes usage of JWT tokens in communication between compute and pageserver and between CLI and pageserver. JWT token is signed using RSA keys. CLI generates a key pair during call to `zenith init`. Using following openssl commands:
For HTTP connections we use the Bearer authentication scheme.
For PostgreSQL connections we expect the token to be passed as a password.
There is a caveat for `psql`: it silently truncates passwords to 100 symbols, so to correctly pass JWT via `psql` you have to either use `PGPASSWORD` environment variable, or store password in `psql`'s config file.
Current token scopes are described in `utils::auth::Scope`.
There are no expiration or rotation schemes.
_TODO_: some scopes allow both access to server management API and to the data.
These probably should be split into multiple scopes.
Tokens should not occur in logs.
They may sometimes occur in configuration files, although this is discouraged
because configs may be parsed and dumped into logs.
#### Tokens generation and validation
JWT tokens are signed using a private key.
Compute/pageserver/safekeeper use the private key's public counterpart to validate JWT tokens.
These components should not have access to the private key and may only get tokens from their configuration or external clients.
The key pair is generated once for an installation of compute/pageserver/safekeeper, e.g. by `neon_local init`.
There is currently no way to rotate the key without bringing down all components.
### Best practices
See [RFC 8725: JSON Web Token Best Current Practices](https://www.rfc-editor.org/rfc/rfc8725)
### Token format
The JWT tokens in Neon use "EdDSA" as the algorithm (defined in [RFC8037](https://www.rfc-editor.org/rfc/rfc8037)).
Example:
Header:
```
{
"alg": "EdDSA",
"typ": "JWT"
}
```
Payload:
```
{
"scope": "tenant", # "tenant", "pageserverapi", or "safekeeperdata"
"tenant_id": "5204921ff44f09de8094a1390a6a50f6",
}
```
Meanings of scope:
"tenant": Provides access to all data for a specific tenant
"pageserverapi": Provides blanket access to all tenants on the pageserver plus pageserver-wide APIs.
Should only be used e.g. for status check/tenant creation/list.
"safekeeperdata": Provides blanket access to all data on the safekeeper plus safekeeper-wide APIs.
Should only be used e.g. for status check.
Currently also used for connection from any pageserver to any safekeeper.
### CLI
CLI generates a key pair during call to `neon_local init` with the following commands:
CLI also generates signed token and saves it in the config for later access to pageserver. Now authentication is optional. Pageserver has two variables in config: `auth_validation_public_key_path` and `auth_type`, so when auth type present and set to `ZenithJWT` pageserver will require authentication for connections. Actual JWT is passed in password field of connection string. There is a caveat for psql, it silently truncates passwords to 100 symbols, so to correctly pass JWT via psql you have to either use PGPASSWORD environment variable, or store password in psql config file.
Configuration files for all components point to `public_key.pem` for JWT validation.
However, authentication is disabled by default.
There is no way to automatically enable it everywhere, you have to configure each component individually.
Currently there is no authentication between compute and safekeepers, because this communication layer is under heavy refactoring. After this refactoring support for authentication will be added there too. Now safekeeper supports "hardcoded" token passed via environment variable to be able to use callmemaybe command in pageserver.
CLI also generates signed token (full access to Pageserver) and saves it in
the CLI's `config` file under `pageserver.auth_token`.
Note that pageserver's config does not have any similar parameter.
CLI is the only component which accesses that token.
Technically it could generate it from the private key on each run,
but it does not do that for some reason (_TODO_).
Compute uses token passed via environment variable to communicate to pageserver and in the future to the safekeeper too.
### Compute
#### Overview
Compute is a per-timeline PostgreSQL instance, so it should not have
any access to data of other tenants.
All tokens used by a compute are restricted to a specific tenant.
There is no auth isolation from other timelines of the same tenant,
but a non-rogue client never accesses another timeline even by an accident:
timeline IDs are random and hard to guess.
JWT authentication now supports two scopes: tenant and pageserverapi. Tenant scope is intended for use in tenant related api calls, e.g. create_branch. Compute launched for particular tenant also uses this scope. Scope pageserver api is intended to be used by console to manage pageserver. For now we have only one management operation - create tenant.
#### Incoming connections
All incoming connections are from PostgreSQL clients.
Their authentication is just plain PostgreSQL authentication and out of scope for this document.
Examples for token generation in python:
There is no administrative API except those provided by PostgreSQL.
#### Outgoing connections
Compute connects to Pageserver for getting pages. The connection string is
configured by the `neon.pageserver_connstring` PostgreSQL GUC,
e.g. `postgresql://no_user@localhost:15028`. If the `$NEON_AUTH_TOKEN`
environment variable is set, it is used as the password for the connection. (The
pageserver uses JWT tokens for authentication, so the password is really a
token.)
Compute connects to Safekeepers to write and commit data. The list of safekeeper
addresses is given in the `neon.safekeepers` GUC. The connections to the
safekeepers take the password from the `$NEON_AUTH_TOKEN` environment
variable, if set.
The `compute_ctl` binary that runs before the PostgreSQL server, and launches
PostgreSQL, also makes a connection to the pageserver. It uses it to fetch the
initial "base backup" dump, to initialize the PostgreSQL data directory. It also
uses `$NEON_AUTH_TOKEN` as the password for the connection.
### Pageserver
#### Overview
Pageserver keeps track of multiple tenants, each having multiple timelines.
For each timeline, it connects to the corresponding Safekeeper.
Information about "corresponding Safekeeper" is published by Safekeepers
in the storage_broker, but they do not publish access tokens, otherwise what is
the point of authentication.
Pageserver keeps a connection to some set of Safekeepers, which
may or may not correspond to active Computes.
Hence, we cannot obtain a per-timeline access token from a Compute.
E.g. if the timeline's Compute terminates before all WAL is
consumed by the Pageserver, the Pageserver continues consuming WAL.
Pageserver replicas' authentication is the same as the main's.
#### Incoming connections
Pageserver listens for connections from computes.
Each compute should present a token valid for the timeline's tenant.
Pageserver also has HTTP API: some parts are per-tenant,
some parts are server-wide, these are different scopes.
Authentication can be enabled separately for the HTTP mgmt API, and
for the libpq connections from compute. The `http_auth_type` and
`pg_auth_type` configuration variables in Pageserver's config may
have one of these values:
*`Trust` removes all authentication.
*`NeonJWT` enables JWT validation.
Tokens are validated using the public key which lies in a PEM file
specified in the `auth_validation_public_key_path` config.
#### Outgoing connections
Pageserver makes a connection to a Safekeeper for each active timeline.
As Pageserver may want to access any timeline it has on the disk,
it is given a blanket JWT token to access any data on any Safekeeper.
This token is passed through an environment variable called `NEON_AUTH_TOKEN`
(non-configurable as of writing this text).
A better way _may be_ to store JWT token for each timeline next to it,
but may be not.
### Safekeeper
#### Overview
Safekeeper keeps track of multiple tenants, each having multiple timelines.
#### Incoming connections
Safekeeper accepts connections from Compute/Pageserver, each
connection corresponds to a specific timeline and requires
a corresponding JWT token.
Safekeeper also has HTTP API: some parts are per-tenant,
some parts are server-wide, these are different scopes.
The `auth-validation-public-key-path` command line options controls
the authentication mode:
* If the option is missing, there is no authentication or JWT token validation.
* If the option is present, it should be a path to the public key PEM file used for JWT token validation.
#### Outgoing connections
No connections are initiated by a Safekeeper.
### In the source code
Tests do not use authentication by default.
If you need it, you can enable it by configuring the test's environment:
The cmin/cmax on a heap page is a real bummer. I don't see any other way to fix that than bite the bullet and modify the WAL-logging routine to include the cmin/cmax.
# Postgres core changes
To recap, the problem is that the XLOG_HEAP_INSERT record does not include the command id of the inserted row. And same with deletion/update. So in the primary, a row is inserted with current xmin + cmin. But in the replica, the cmin is always set to 1. That works, because the command id is only relevant to the inserting transaction itself. After commit/abort, no one cares abut it anymore.
This lists all the changes that have been made to the PostgreSQL
source tree, as a somewhat logical set of patches. The long-term goal
is to eliminate all these changes, by submitting patches to upstream
and refactoring code into extensions, so that you can run unmodified
PostgreSQL against Neon storage.
- Alternatives?
I don't know
In Neon, we run PostgreSQL in the compute nodes, but we also run a special WAL redo process in the
page server. We currently use the same binary for both, with --wal-redo runtime flag to launch it in
the WAL redo mode. Some PostgreSQL changes are needed in the compute node, while others are just for
the WAL redo process.
2. Add PD_WAL_LOGGED.
- Why?
Postgres sometimes writes data to the page before it is wal-logged. If such page ais swapped out, we will loose this change. The problem is currently solved by setting PD_WAL_LOGGED bit in page header. When page without this bit set is written to the SMGR, then it is forced to be written to the WAL as FPI using log_newpage_copy() function.
In addition to core PostgreSQL changes, there is a Neon extension in contrib/neon, to hook into the
smgr interface. Once all the core changes have been submitted to upstream or eliminated some other
way, the extension could live outside the postgres repository and build against vanilla PostgreSQL.
There was wrong assumption that it can happen only during construction of some exotic indexes (like gist). It is not true. The same situation can happen with COPY,VACUUM and when record hint bits are set.
Below is a list of all the PostgreSQL source code changes, categorized into changes needed for
compute, and changes needed for the WAL redo process:
Do not store this flag in page header, but associate this bit with shared buffer. Logically it is more correct but in practice we will get not advantages: neither in space, neither in CPU overhead.
## Add t_cid to heap WAL records
```
src/backend/access/heap/heapam.c | 26 +-
src/include/access/heapam_xlog.h | 6 +-
```
We have added a new t_cid field to heap WAL records. This changes the WAL record format, making Neon WAL format incompatible with vanilla PostgreSQL!
### Problem we're trying to solve
The problem is that the XLOG_HEAP_INSERT record does not include the command id of the inserted row. And same with deletion/update. So in the primary, a row is inserted with current xmin + cmin. But in the replica, the cmin is always set to 1. That works in PostgreSQL, because the command id is only relevant to the inserting transaction itself. After commit/abort, no one cares about it anymore. But with Neon, we rely on WAL replay to reconstruct the page, even while the original transaction is still running.
### How to get rid of the patch
Bite the bullet and submit the patch to PostgreSQL, to add the t_cid to the WAL records. It makes the WAL records larger, which could make this unpopular in the PostgreSQL community. However, it might simplify some logical decoding code; Andres Freund briefly mentioned in PGCon 2022 discussion on Heikki's Neon presentation that logical decoding currently needs to jump through some hoops to reconstruct the same information.
3. XLogReadBufferForRedo not always loads and pins requested buffer. So we need to add extra checks that buffer is really pinned. Also do not use BufferGetBlockNumber for buffer returned by XLogReadBufferForRedo.
- Why?
XLogReadBufferForRedo is not pinning pages which are not requested by wal-redo. It is specific only for wal-redo Postgres.
### Alternatives
Perhaps we could write an extra WAL record with the t_cid information, when a page is evicted that contains rows that were touched a transaction that's still running. However, that seems very complicated.
- [zenithdb/zenith](https://hub.docker.com/repository/docker/zenithdb/zenith) — image with pre-built `pageserver`, `safekeeper` and `proxy` binaries and all the required runtime dependencies. Built from [/Dockerfile](/Dockerfile).
- [zenithdb/compute-node](https://hub.docker.com/repository/docker/zenithdb/compute-node) — compute node image with pre-built Postgres binaries from [zenithdb/postgres](https://github.com/zenithdb/postgres).
- [neondatabase/neon](https://hub.docker.com/repository/docker/zenithdb/zenith) — image with pre-built `pageserver`, `safekeeper` and `proxy` binaries and all the required runtime dependencies. Built from [/Dockerfile](/Dockerfile).
- [neondatabase/compute-node](https://hub.docker.com/repository/docker/zenithdb/compute-node) — compute node image with pre-built Postgres binaries from [neondatabase/postgres](https://github.com/neondatabase/postgres).
And two intermediate images used either to reduce build time or to deliver some additional binary tools from other repos:
And additional intermediate image:
- [zenithdb/build](https://hub.docker.com/repository/docker/zenithdb/build) — image with all the dependencies required to build Zenith and compute node images. This image is based on `rust:slim-buster`, so it also has a proper `rust` environment. Built from [/Dockerfile.build](/Dockerfile.build).
1. Image `zenithdb/compute-tools` is re-built automatically.
We build all images after a successful `release` tests run and push automatically to Docker Hub with two parallel CI jobs
2.Image `zenithdb/build` is built manually. If you want to introduce any new compile time dependencies to Zenith or compute node you have to update this image as well, build it and push to Docker Hub.
1.`neondatabase/compute-tools` and `neondatabase/compute-node`
You can see a [docker compose](https://docs.docker.com/compose/) example to create a neon cluster in [/docker-compose/docker-compose.yml](/docker-compose/docker-compose.yml). It creates the following conatainers.
- pageserver x 1
- safekeeper x 3
- storage_broker x 1
- compute x 1
- MinIO x 1 # This is Amazon S3 compatible object storage
### How to use
1. create containers
You can specify version of neon cluster using following environment values.
- PG_VERSION: postgres version for compute (default is 14)
- TAG: the tag version of [docker image](https://registry.hub.docker.com/r/neondatabase/neon/tags) (default is latest), which is tagged in [CI test](/.github/workflows/build_and_test.yml)
```
$ cd docker-compose/
$ docker-compose down # remove the conainers if exists
$ PG_VERSION=15 TAG=2937 docker-compose up --build -d # You can specify the postgres and image version
Creating network "dockercompose_default" with the default driver
Backpressure is used to limit the lag between pageserver and compute node or WAL service.
If compute node or WAL service run far ahead of Page Server,
the time of serving page requests increases. This may lead to timeout errors.
To tune backpressure limits use `max_replication_write_lag`, `max_replication_flush_lag` and `max_replication_apply_lag` settings.
When lag between current LSN (pg_current_wal_flush_lsn() at compute node) and minimal write/flush/apply position of replica exceeds the limit
backends performing writes are blocked until the replica is caught up.
### Base image (page image)
### Basebackup
@@ -11,7 +21,7 @@ NOTE:It has nothing to do with PostgreSQL pg_basebackup.
### Branch
We can create branch at certain LSN using `zenith branch` command.
We can create branch at certain LSN using `neon_local timeline branch` command.
Each Branch lives in a corresponding timeline[] and has an ancestor[].
@@ -19,24 +29,32 @@ Each Branch lives in a corresponding timeline[] and has an ancestor[].
NOTE: This is an overloaded term.
A checkpoint record in the WAL marks a point in the WAL sequence at which it is guaranteed that all data files have been updated with all information from shared memory modified before that checkpoint;
A checkpoint record in the WAL marks a point in the WAL sequence at which it is guaranteed that all data files have been updated with all information from shared memory modified before that checkpoint;
### Checkpoint (Layered repository)
NOTE: This is an overloaded term.
Whenever enough WAL has been accumulated in memory, the page server []
writes out the changes from in-memory layers into new layer files[]. This process
is called "checkpointing". The page server only creates layer files for
relations that have been modified since the last checkpoint.
writes out the changes from the in-memory layer into a new delta layer file. This process
is called "checkpointing".
Configuration parameter `checkpoint_distance` defines the distance
from current LSN to perform checkpoint of in-memory layers.
Default is `DEFAULT_CHECKPOINT_DISTANCE`.
Set this parameter to `0` to force checkpoint of every layer.
Configuration parameter `checkpoint_period` defines the interval between checkpoint iterations.
Default is `DEFAULT_CHECKPOINT_PERIOD`.
### Compaction
A background operation on layer files. Compaction takes a number of L0
layer files, each of which covers the whole key space and a range of
LSN, and reshuffles the data in them into L1 files so that each file
covers the whole LSN range, but only part of the key space.
Compaction should also opportunistically leave obsolete page versions
from the L1 files, and materialize other page versions for faster
access. That hasn't been implemented as of this writing, though.
### Compute node
Stateless Postgres node that stores data in pageserver.
@@ -44,10 +62,10 @@ Stateless Postgres node that stores data in pageserver.
### Garbage collection
The process of removing old on-disk layers that are not needed by any timeline anymore.
### Fork
Each of the separate segmented file sets in which a relation is stored. The main fork is where the actual data resides. There also exist two secondary forks for metadata: the free space map and the visibility map.
Each PostgreSQL fork is considered a separate relish.
### Layer
@@ -57,25 +75,76 @@ layer's Segment and range of LSNs.
There are two kinds of layers, in-memory and on-disk layers. In-memory
layers are used to ingest incoming WAL, and provide fast access
to the recent page versions. On-disk layers are stored as files on disk, and
are immutable. See pageserver/src/layered_repository/README.md for more.
are immutable. See [pageserver-storage.md](./pageserver-storage.md) for more.
### Layer file (on-disk layer)
Layered repository on-disk format is based on immutable files. The
files are called "layer files". Each file corresponds to one RELISH_SEG_SIZE
segment of a PostgreSQL relation fork. There are two kinds of layer
files: image files and delta files. An image file contains a
"snapshot" of the segment at a particular LSN, and a delta file
contains WAL records applicable to the segment, in a range of LSNs.
files are called "layer files". There are two kinds of layer files:
image files and delta files. An image file contains a "snapshot" of a
range of keys at a particular LSN, and a delta file contains WAL
records applicable to a range of keys, in a range of LSNs.
### Layer map
The layer map tracks what layers exist for all the relishes in a timeline.
The layer map tracks what layers exist in a timeline.
### Layered repository
Zenith repository implementation that keeps data in layers.
Neon repository implementation that keeps data in layers.
### LSN
The Log Sequence Number (LSN) is a unique identifier of the WAL record[] in the WAL log.
The insert position is a byte offset into the logs, increasing monotonically with each new record.
Internally, an LSN is a 64-bit integer, representing a byte position in the write-ahead log stream.
It is printed as two hexadecimal numbers of up to 8 digits each, separated by a slash.
Check also [PostgreSQL doc about pg_lsn type](https://www.postgresql.org/docs/devel/datatype-pg-lsn.html)
Values can be compared to calculate the volume of WAL data that separates them, so they are used to measure the progress of replication and recovery.
In Postgres and Neon LSNs are used to describe certain points in WAL handling.
PostgreSQL LSNs and functions to monitor them:
*`pg_current_wal_insert_lsn()` - Returns the current write-ahead log insert location.
*`pg_current_wal_lsn()` - Returns the current write-ahead log write location.
*`pg_current_wal_flush_lsn()` - Returns the current write-ahead log flush location.
*`pg_last_wal_receive_lsn()` - Returns the last write-ahead log location that has been received and synced to disk by streaming replication. While streaming replication is in progress this will increase monotonically.
*`pg_last_wal_replay_lsn ()` - Returns the last write-ahead log location that has been replayed during recovery. If recovery is still in progress this will increase monotonically.
@@ -109,14 +178,6 @@ and create new databases and accounts (control plane API in our case).
The generic term in PostgreSQL for all objects in a database that have a name and a list of attributes defined in a specific order.
### Relish
We call each relation and other file that is stored in the
repository a "relish". It comes from "rel"-ish, as in "kind of a
rel", because it covers relations as well as other things that are
not relations, but are treated similarly for the purposes of the
storage layer.
### Replication slot
@@ -133,33 +194,24 @@ One repository corresponds to one Tenant.
How much history do we need to keep around for PITR and read-only nodes?
### Segment (PostgreSQL)
NOTE: This is an overloaded term.
### Segment
A physical file that stores data for a given relation. File segments are
limited in size by a compile-time setting (1 gigabyte by default), so if a
relation exceeds that size, it is split into multiple segments.
### Segment (Layered Repository)
NOTE: This is an overloaded term.
Segment is a RELISH_SEG_SIZE slice of relish (identified by a SegmentTag).
### SLRU
SLRUs include pg_clog, pg_multixact/members, and
pg_multixact/offsets. There are other SLRUs in PostgreSQL, but
they don't need to be stored permanently (e.g. pg_subtrans),
or we do not support them in zenith yet (pg_commit_ts).
Each SLRU segment is considered a separate relish[].
or we do not support them in neon yet (pg_commit_ts).
### Tenant (Multitenancy)
Tenant represents a single customer, interacting with Zenith.
Tenant represents a single customer, interacting with Neon.
Wal redo[] activity, timelines[], layers[] are managed for each tenant independently.
One pageserver[] can serve multiple tenants at once.
One safekeeper
One safekeeper
See `docs/multitenancy.md` for more.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.