plv8 can only be built with a fairly new gold linker version. We used to install
it via binutils packages from testing, but it also updates libc and that causes
troubles in the resulting image as different extensions were built against
different libc versions. We could either use libc from debian-testing everywhere
or restrain from using testing packages and install necessary programs manually.
This patch uses the latter approach: gold for plv8 and cmake for h3 are
installed manually.
In a passing declare h3_postgis as a safe extension (previous omission).
`GRANT CREATE ON SCHEMA public` fails if there is no schema `public`.
Disable it in release for now and make a better fix later (it is
needed for v15 support).
* Support configuring the log format as json or plain.
Separately test json and plain logger. They would be competing on the
same global subscriber otherwise.
* Implement log_format for pageserver config
* Implement configurable log format for safekeeper.
Similar to https://github.com/neondatabase/neon/pull/2395, introduces a state field in Timeline, that's possible to subscribe to.
Adjusts
* walreceiver to not to have any connections if timeline is not Active
* remote storage sync to not to schedule uploads if timeline is Broken
* not to create timelines if a tenant/timeline is broken
* automatically switches timelines' states based on tenant state
Does not adjust timeline's gc, checkpointing and layer flush behaviour much, since it's not safe to cancel these processes abruptly and there's task_mgr::shutdown_tasks that does similar thing.
This API is rather pointless, as sane choice anyway requires knowledge of peers
status and leaders lifetime in any case can intersect, which is fine for us --
so manual elections are straightforward. Here, we deterministically choose among
the reasonably caught up safekeepers, shifting by timeline id to spread the
load.
A step towards custom broker https://github.com/neondatabase/neon/issues/2394
* Fix bogus early exit from GC.
Commit 91411c415a added this failpoint, but the early exit was not
intentional.
* Cleanup test_gc_cutoff.py test.
- Remove the 'scale' parameter, this isn't a benchmark
- Tweak pgbench and pageserver options to create garbage faster that the
the GC can collect away. The test used to take just under 5 minutes,
which was uncomfortably close to the default 5 minute test timeout, and
annoyingly even without the hard limit. These changes bring it down to
about 1-2 minutes.
- Improve comments, fix typos
- Rename the failpoint. The old name, 'gc-before-save-metadata' implied
that the failpoint was before the metadata update, but it was in fact
much later in the function.
- Move the call to persist the metadata outside the lock, to avoid
holding it for too long.
To verify that this test still covers the original bug,
https://github.com/neondatabase/neon/issues/2539, I commenting out
updating the metadata file like this:
```
diff --git a/pageserver/src/tenant/timeline.rs b/pageserver/src/tenant/timeline.rs
index 1e857a9a..f8a9f34a 100644
--- a/pageserver/src/tenant/timeline.rs
+++ b/pageserver/src/tenant/timeline.rs
@@ -1962,7 +1962,7 @@ impl Timeline {
}
// Persist the new GC cutoff value in the metadata file, before
// we actually remove anything.
- self.update_metadata_file(self.disk_consistent_lsn.load(), HashMap::new())?;
+ //self.update_metadata_file(self.disk_consistent_lsn.load(), HashMap::new())?;
info!("GC starting");
```
It doesn't fail every time with that, but it did fail after about 5
runs.
If we cannot reconstruct an FSM or VM page, while creating image
layers, fill it with zeros instead. That should always be safe, for
the FSM and VM, in the sense that you won't lose actual user data. It
will get cleaned up by VACUUM later.
We had a bug with FSM/VM truncation, where we truncated the FSM and VM
at WAL replay to a smaller size than PostgreSQL originally did. We
thought was harmless, as the FSM and VM are not critical for
correctness and can be zeroed out or truncated without affecting user
data. However, it lead to a situation where PostgreSQL created
incremental WAL records for pages that we had already truncated away
in the pageserver, and when we tried to replay those WAL records, that
failed. That lead to a permanent error in image layer creation, and
prevented it from ever finishing. See
https://github.com/neondatabase/neon/issues/2601. With this patch,
those pages will be filled with zeros in the image layer, which allows
the image layer creation to finish.
Part of https://github.com/neondatabase/neon/pull/2239
Regular, from scratch, timeline creation involves initdb to be run in a separate directory, data from this directory to be imported into pageserver and, finally, timeline-related background tasks to start.
This PR ensures we don't leave behind any directories that are not marked as temporary and that pageserver removes such directories on restart, allowing timeline creation to be retried with the same IDs, if needed.
It would be good to later rewrite the logic to use a temporary directory, similar what tenant creation does.
Yet currently it's harder than this change, so not done.
Follow-up of #2636 and #2654 , fixing the test detection feature.
Pageserver currently outputs features as
```
/target/debug/pageserver --version
Neon page server git:7734929a8202c8cc41596a861ffbe0b51b5f3cb9 failpoints: true, features: ["testing", "profiling"]
```
These two tests, test_timeline_physical_size_post_compaction and
test_timeline_physical_size_post_gc, assumed that after you have
waited for the WAL from a bulk insertion to arrive, and you run a
cycle of checkpoint and compaction, no new layer files are created.
Because if a new layer file is created while we are calculating the
incremental and non-incremental physical sizes, they might differ.
However, the tests used a very small checkpoint_distance, so even a
small amount of WAL generated in PostgreSQL could cause a new layer
file to be created. Autovacuum can kick in at any time, and do that.
That caused occasional failues in the test. I was able to reproduce it
reliably by adding a long delay between the incremental and
non-incremental size calculations:
```
--- a/pageserver/src/http/routes.rs
+++ b/pageserver/src/http/routes.rs
@@ -129,6 +129,9 @@ async fn build_timeline_info(
}
};
let current_physical_size = Some(timeline.get_physical_size());
+ if include_non_incremental_physical_size {
+ std:🧵:sleep(std::time::Duration::from_millis(60000));
+ }
let info = TimelineInfo {
tenant_id: timeline.tenant_id,
```
To fix, disable autovacuum for the table. Autovacuum could still kick
in for other tables, e.g. catalog tables, but that seems less likely
to generate enough WAL to causea new layer file to be flushed.
If this continues to be a problem in the future, we could simply retry
the physical size call a few times, if there's a mismatch. A mismatch
could happen every once in a while, but it's very unlikely to happen
more than once or twice in a row.
Fixes https://github.com/neondatabase/neon/issues/2212
- Measure size of redo WAL (new histogram), with bounds between 24B-32kB
- Add 2 more buckets at the upper end of the redo time histogram
We often (>0.1% of several hours each day) take more than 250ms to do the
redo round-trip to the postgres process. We need to measure these redo
times more precisely.
We've got at least one user in production that cannot create a
database with a trailing space in the name.
This happens because we use `url` crate for manipulating the
DATABASE_URL, but it follows a standard that doesn't fit really
well with Postgres. For example, it trims all trailing spaces
from the path:
> Remove any leading and trailing C0 control or space from input.
> See: https://url.spec.whatwg.org/#url-parsing
But we used `set_path()` to set database name and it's totally valid
to have trailing spaces in the database name in Postgres.
Thus, use `postgres::config::Config` to modify database name in the
connection details.
* Persists latest_gc_cutoff_lsn before performing GC
* Peform some refactoring and code deduplication
refer #2539
* Add test for persisting GC cutoff
* Fix python test style warnings
* Bump postgres version
* Reduce number of iterations in test_gc_cutoff test
* Bump postgres version
* Undo bumping postgres version
In the Postgres backend, we cannot link directly with libpq (check the
pgsql-hackers arhive for all kinds of fun that ensued when we tried to
do that). Therefore, the libpq functions are used through the thin
wrapper functions in libpqwalreceiver.so, and libpqwalreceiver.so is
loaded dynamically. To hide the dynamic loading and make the calls
look like regular functions, we use macros to hide the function
pointers.
We had inherited the same indirections in libpqwalproposer, but it's
not needed since the neon extension is already a shared library that's
loaded dynamically. There's no problem calling the functions directly
there. Remove the indirections.
Speeds up layer_map::search somewhat. I also opened a PR in the upstream
rust-amplify repository with these changes,
see https://github.com/rust-amplify/rust-amplify/pull/148. We can switch
back to upstream version when that's merged.
Lookups in the R-tree call the "envelope" function for every comparison,
and our envelope function isn't very cheap, so that overhead adds up.
Create the envelope once, when the layer is inserted into the tree, and
store it along with the layer. That uses some more memory per layer, but
that's not very significant.
Speeds up the search operation 2x
This is the first step in verifying layer files. Next up on the road is
hashing the files and verifying the hashes.
The metadata additions do not require any migration. The idea is that
the change is backward and forward-compatible with regard to
`index_part.json` due to the softness of JSON schema and the
deserialization options in use.
New types added:
- LayerFileMetadata for tracking the file metadata
- starting with only the file size
- in future hopefully a sha256 as well
- IndexLayerMetadata, the serialized counterpart of LayerFileMetadata
LayerFileMetadata needing to have all fields Option is a problem but
that is not possible to handle without conflicting a lot more with other
ongoing work.
Co-authored-by: Kirill Bulatov <kirill@neon.tech>
* We had an issue with `lineinfile` usage for pageserver configuration
file: if the S3 bucket related values were changed, it would have
resulted in duplicate keys, resulting in invalid toml.
So to fix the issue, we should keep the configuration in structured
format (yaml in this case) so we can always generate syntactically
correct toml.
Inventories are converted to yaml just so that it's easier to maintain
the configuration there. Another alternative would have been a separate
variable files.
* Keep the ansible collections dir, but locally installed collections
should not be tracked.
* etcd-client is not updated, since we plan to replace it with another client and the new version errors with some missing prost library error
* clap had released another major update that requires changing every CLI declaration again, deserves a separate PR
The 'local' part was always filled in, so that was easy to merge into
into the TimelineInfo itself. 'remote' only contained two fields,
'remote_consistent_lsn' and 'awaits_download'. I made
'remote_consistent_lsn' an optional field, and 'awaits_download' is now
false if the timeline is not present remotely.
However, I kept stub versions of the 'local' and 'remote' structs for
backwards-compatibility, with a few fields that are actively used by
the control plane. They just duplicate the fields from TimelineInfo
now. They can be removed later, once the control plane has been
updated to use the new fields.
It was only None when you queried the status of a timeline with
'timeline_detail' mgmt API call, and it was still being downloaded. You
can check for that status with the 'tenant_status' API call instead,
checking for has_in_progress_downloads field.
Anothere case was if an error happened while trying to get the current
logical size, in a 'timeline_detail' request. It might make sense to
tolerate such errors, and leave the fields we cannot fill in as empty,
None, 0 or similar, but it doesn't make sense to me to leave the whole
'local' struct empty in tht case.
With the ability to pass commit_lsn. This allows to perform project WAL recovery
through different (from the original) set of safekeepers (or under different
ttid) by
1) moving WAL files to s3 under proper ttid;
2) explicitly creating timeline on safekeepers, setting commit_lsn to the
latest point;
3) putting the lastest .parital file to the timeline directory on safekeepers, if
desired.
Extend test_s3_wal_replay to exersise this behaviour.
Also extends timeline_status endpoint to return postgres information.
I'm using the Rust compiler and cargo versions from Debian packages,
but the latest available cargo Debian package is quite old, version
1.57. The 'named-profiles' features was not stabilized at that
version yet, so ever since commit a463749f5, I've had to manually add
this line to the Cargo.toml file to compile. I've been wishing that
someone would update the cargo Debian package, but it doesn't seem to
be happening any time soon.
This doesn't seem to bother anyone else but me, but it shouldn't hurt
anyone else either. If there was a good reason, I could install a
newer cargo version with 'rustup', but if all we need is this one line
in Cargo.toml, I'd prefer to continue using the Debian packages.
You cannot attach/detach an individual timeline, attach/detach always
applies to the whole tenant. However, you can *delete* a single timeline
from a tenant. Fix some comments and error messages that confused these
two operations.
Commit c634cb1d36 removed the trait and changed the function to return
a &TimelineWriter, as the FIXME said we should do, but forgot to remove
the FIXME.
* Test that we emit build info metric for pageserver, safekeeper and proxy with some non-zero length revision label
* Emit libmetrics_build_info on startup of pageserver, safekeeper and
proxy with label "revision" which tells the git revision.
The previous default of 1 s caused excessive CPU usage when there were
a lot of projects. Polling every timeline once a second was too aggressive
so let's reduce it.
Fixes https://github.com/neondatabase/neon/issues/2542, but we
probably also want do to something so that we don't poll timelines
that have received no new WAL or layers since last check.
* Add test for branching on page boundary
* Normalize start recovery point
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Thang Pham <thang@neon.tech>
We had a problem where almost all of the threads were waiting on a futex syscall. More specifically:
- `/metrics` handler was inside `TimelineCollector::collect()`, waiting on a mutex for a single Timeline
- This exact timeline was inside `control_file::FileStorage::persist()`, waiting on a mutex for Lazy initialization of `PERSIST_CONTROL_FILE_SECONDS`
- `PERSIST_CONTROL_FILE_SECONDS: Lazy<Histogram>` was blocked on `prometheus::register`
- `prometheus::register` calls `DEFAULT_REGISTRY.write().register()` to take a write lock on Registry and add a new metric
- `DEFAULT_REGISTRY` lock was already taken inside `DEFAULT_REGISTRY.gather()`, which was called by `/metrics` handler to collect all metrics
This commit creates another Registry with a separate lock, to avoid deadlock in a case where `TimelineCollector` triggers registration of new metrics inside default registry.
Creates new `pageserver_api` and `safekeeper_api` crates to serve as the
shared dependencies. Should reduce both recompile times and cold compile
times.
Decreases the size of the optimized `neon_local` binary: 380M -> 179M.
No significant changes for anything else (mostly as expected).
Compute node startup time is very important. After launching
PostgreSQL, use 'notify' to be notified immediately when it has
updated the PID file, instead of polling. The polling loop had 100 ms
interval so this shaves up to 100 ms from the startup time.
The loop checked if the TCP port is open for connections, by trying to
connect to it. That seems unnecessary. By the time the postmaster.pid
file says that it's ready, the port should be open. Remove that check.
* Preserve task result in TaskHandle by keeping join handle around
The solution is not great, but it should hep to debug staging issue
I tried to do it in a least destructive way. TaskHandle used only in
one place so it is ok to use something less generic unless we want
to extend its usage across the codebase. In its current current form
for its single usage place it looks too abstract
Some problems around this code:
1. Task can drop event sender and continue running
2. Task cannot be joined several times (probably not needed,
but still, can be surprising)
3. Had to split task event into two types because ahyhow::Error
does not implement clone. So TaskContinueEvent derives clone
but usual task evend does not. Clone requirement appears
because we clone the current value in next_task_event.
Taking it by reference is complicated.
4. Split between Init and Started is artificial and comes from
watch::channel requirement to have some initial value.
To summarize from 3 and 4. It may be a better idea to use
RWLock or a bounded channel instead
Changes are:
* Correct typo "firts" -> "first"
* Change <empty panic with comment explaining> to <panic with message
taken from the comment>
* Fix weird indentation that rustfmt was failing to handle
* Use existing `anyhow::{anyhow,bail}!` as `{anyhow,bail}!` if it's
already in scope
* Spell `Result<T, anyhow::Error>` as `anyhow::Result<T>`
* In general, closer to matching the rest of the codebase
* Change usages of `hash_map::Entry` to `Entry` when it's already in
scope
* A quick search shows our style on this one varies across the files
it's used in
* Fix extreme metrics bloat in storage sync
From 78 metrics per (timeline, tenant) pair down to (max) 10 metrics per
(timeline, tenant) pair, plus another 117 metrics in a global histogram that
replaces the previous per-timeline histogram.
* Drop image sync operation metric series when dropping TimelineMetrics.
- Split postgres_ffi into two version specific files.
- Preserve pg_version in timeline metadata.
- Use pg_version in safekeeper code. Check for postgres major version mismatch.
- Clean up the code to use DEFAULT_PG_VERSION constant everywhere, instead of hardcoding.
- Parameterize python tests: use DEFAULT_PG_VERSION env and pg_version fixture.
To run tests using a specific PostgreSQL version, pass the DEFAULT_PG_VERSION environment variable:
'DEFAULT_PG_VERSION='15' ./scripts/pytest test_runner/regress'
Currently don't all tests pass, because rust code relies on the default version of PostgreSQL in a few places.
Replace the layer array and linear search with R-tree
So far, the in-memory layer map that holds information about layer
files that exist, has used a simple Vec, in no particular order, to
hold information about all the layers. That obviously doesn't scale
very well; with thousands of layer files the linear search was
consuming a lot of CPU. Replace it with a two-dimensional R-tree, with
Key and LSN ranges as the dimensions.
For the R-tree, use the 'rstar' crate. To be able to use that, we
convert the Keys and LSNs into 256-bit integers. 64 bits would be
enough to represent LSNs, and 128 bits would be enough to represent
Keys. However, we use 256 bits, because rstar internally performs
multiplication to calculate the area of rectangles, and the result of
multiplying two 128 bit integers doesn't necessarily fit in 128 bits,
causing integer overflow and, if overflow-checks are enabled,
panic. To avoid that, we use 256 bit integers.
Add a performance test that creates a lot of layer files, to
demonstrate the benefit.
Part of the general work on improving pageserver logs.
Brief summary of changes:
* Remove `ApiError::from_err`
* Remove `impl From<anyhow::Error> for ApiError`
* Convert `ApiError::{BadRequest, NotFound}` to use `anyhow::Error`
* Note: `NotFound` has more verbose formatting because it's more
likely to have useful information for the receiving "user"
* Explicitly convert from `tokio::task::JoinError`s into
`InternalServerError`s where appropriate
Also note: many of the places where errors were implicitly converted to
500s have now been updated to return a more appropriate error. Some
places where it's not yet possible to distinguish the error types have
been left as 500s.
Follow-up to PR #2433 (b8eb908a). There's still a few more unresolved
locations that have been left as-is for the same compatibility reasons
in the original PR.
Commit 43a4f7173e fixed the case that there are extra options in the
connection string, but broke it in the case when there are not. Fix
that. But on second thoughts, it's more straightforward set the
options with ALTER DATABASE, so change the workflow yaml file to do
that instead.
In commit 6985f6cd6c, I tried passing extra GUCs in the 'options' part
of the connection string, but it didn't work because the pgbench test
overrode it with the statement_timeout. Change it so that it adds the
statement_timeout to any other options, instead of replacing them.
* Set last written lsn for created relation
* use current LSN for updating last written LSN of relation metadata
* Update LSN for the extended blocks even for pges without LSN (zeroed)
* Update pgxn/neon/pagestore_smgr.c
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Fixes#1873: previously any run of `make` caused the `postgres-v15-headers`
target to build. It copied a bunch of headers via `install -C`. Unfortunately,
some origins were symlinks in the `./pg_install/build` directory pointing
inside `./vendor/postgres-v15` (e.g. `pg_config_os.h` pointing to `linux.h`).
GNU coreutils' `install` ignores the `-C` key for non-regular files and
always overwrites the destination if the origin is a symlink. That in turn
made Cargo rebuild the `postgres_ffi` crate and all its dependencies because
it thinks that Postgres headers changed, even if they did not. That was slow.
Now we use a custom script that wraps the `install` program. It handles one
specific case and makes sure individual headers are never copied if their
content did not change. Hence, `postgres_ffi` is not rebuilt unless there were
some changes to the C code.
One may still have slow incremental single-threaded builds because Postgres
Makefiles spawn about 2800 sub-makes even if no files have been changed.
A no-op build takes "only" 3-4 seconds on my machine now when run with `-j30`,
and 20 seconds when run with `-j1`.
This commit does two things of note:
1. Bumps the bindgen dependency from `0.59.1` to `0.60.1`. This gets us
an actual error type from bindgen, so we can display what's wrong.
2. Adds `anyhow` as a build dependency, so our error message can be
prettier. It's already used heavily elsewhere in the crates in this
repo, so I figured the fact it's a build dependency doesn't matter
much.
I ran into this from running `cargo <cmd>` without running `make` first.
Here's a comparison of the compiler output in those two cases.
Before this commit:
```
error: failed to run custom build command for `postgres_ffi v0.1.0 ($repo_path/libs/postgres_ffi)`
Caused by:
process didn't exit successfully: `$repo_path/target/debug/build/postgres_ffi-2f7253b3ad3ca840/build-script-build` (exit status: 101)
--- stdout
cargo:rerun-if-changed=bindgen_deps.h
--- stderr
bindgen_deps.h:7:10: fatal error: 'c.h' file not found
bindgen_deps.h:7:10: fatal error: 'c.h' file not found, err: true
thread 'main' panicked at 'Unable to generate bindings: ()', libs/postgres_ffi/build.rs:135:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
```
After this commit:
```
error: failed to run custom build command for `postgres_ffi v0.1.0 ($repo_path/libs/postgres_ffi)`
Caused by:
process didn't exit successfully: `$repo_path/target/debug/build/postgres_ffi-e01fb59602596748/build-script-build` (exit status: 1)
--- stdout
cargo:rerun-if-changed=bindgen_deps.h
--- stderr
bindgen_deps.h:7:10: fatal error: 'c.h' file not found
Error: Unable to generate bindings
Caused by:
clang diagnosed error: bindgen_deps.h:7:10: fatal error: 'c.h' file not found
```
Also get rid if `with_safekeepers` parameter in tests.
Its meaning has changed: `False` meant "no safekeepers" which is not
supported anymore, so we assume it's always `True`.
See #1648
Previously, we compiled neon separately for Postgres v14 and v15, for
the codestyle checks. But that was bogus; we actually just ran "make
postgres", which always compiled both versions. The version really only
affected the caching.
Fix that, by copying the build steps from the main build_and_test.yml
workflow.
Running "make" at the top level calls "make install" to install the
PostgreSQL headers into the pg_install/ directory. That always updated
the modification time of the headers even if there were no changes,
triggering recompilation of the postgres_ffi bindings. To avoid that,
use 'install -C', to install the PostgreSQL headers.
However, there was an upstream PostgreSQL issue that the
src/include/Makefile didn't respect the INSTALL configure option. That
was just fixed in upstream PostgreSQL, so cherry-pick that fix to our
vendor/postgres repositories.
Fixes https://github.com/neondatabase/neon/issues/1873.
Commit f44afbaf62 updated vendor/postgres-v15 to point to a commit that
was built on top of PostgreSQL 14 rather than 15. So we accidentally had
two copies of PostgreSQL v14 in the repository. Oops. This updates
it to point to the correct version.
* Changes of neon extension to support local prefetch
* Catch exceptions in pageserver_receive
* Bump posgres version
* Bump posgres version
* Bump posgres version
* Bump posgres version
* github/actions: add neon projects related actions
* workflows/benchmarking: create projects using API
* workflows/pg_clients: create projects using API
Instead of spawning helper threads, we now use Tokio tasks. There
are multiple Tokio runtimes, for different kinds of tasks. One for
serving libpq client connections, another for background operations
like GC and compaction, and so on. That's not strictly required, we
could use just one runtime, but with this you can still get an
overview of what's happening with "top -H".
There's one subtle behavior in how TenantState is updated. Before this
patch, if you deleted all timelines from a tenant, its GC and
compaction loops were stopped, and the tenant went back to Idle
state. We no longer do that. The empty tenant stays Active. The
changes to test_tenant_tasks.py are related to that.
There's still plenty of synchronous code and blocking. For example, we
still use blocking std::io functions for all file I/O, and the
communication with WAL redo processes is still uses low-level unix
poll(). We might want to rewrite those later, but this will do for
now. The model is that local file I/O is considered to be fast enough
that blocking - and preventing other tasks running in the same thread -
is acceptable.
We had a pattern like this:
match remote_storage {
GenericRemoteStorage::Local(storage) => {
let source = storage.remote_object_id(&file_path)?;
...
storage
.function(&source, ...)
.await
},
GenericRemoteStorage::S3(storage) => {
... exact same code as for the Local case ...
},
This removes the code duplication, by allowing you to call the functions
directly on GenericRemoteStorage.
Also change RemoveObjectId to be just a type alias for String. Now that
the callers of GenericRemoteStorage functions don't know whether they're
dealing with the LocalFs or S3 implementation, RemoveObjectId must be the
same type for both.
Because the metadata was not locked, it could be updated concurrently
such that we wouldn't actually have the tail block.
The current ordering works better, as we still only start XLogBeginInsert()
once we have all potentially interesting buffers loaded in memory, but
still have correct lock lifetimes.
See also: access/transam/README section Write-Ahead Log Coding
Another preparatory commit for pg15 support:
* generate bindings for both pg14 and pg15;
* update Makefile and CI scripts: now neon build depends on both PostgreSQL versions;
* some code refactoring to decrease version-specific dependencies.
Commit f081419e68 moved all the prometheus counters to `metrics.rs`,
but accidentally replaced a couple of `register_int_counter!(...)`
calls with just `IntCounter::new(...)`. Because of that, the counters
were not registered in the metrics registry, and were not exposed
through the metrics HTTP endpoint.
Fixes failures we're seeing in a bunch of 'performance' tests because
of the missing metrics.
This caught or reproduced several bugs when I originally wrote this test
back in May, including #1731, #1740, #1751, and #707. I believe all the
issues have been fixed now, but since this was a very fruitful test,
let's add it to the test suite.
We didn't commit this earlier, because the test was very slow especially
with a debug build. We've since changed the build options so that even
the debug builds are not quite so slow anymore.
* Add test for pageserver metric cleanup once a tenant is detached.
* Remove tenant specific timeline metrics on detach.
* Use definitions from timeline_metrics in page service.
* Move metrics to own file from layered_repository/timeline.rs
* TIMELINE_METRICS: define smgr metrics
* REMOVE SMGR cleanup from timeline_metrics. Doesn't seem to work as
expected.
* Vritual file centralized metrics, except for evicted file as there's no
tenat id or timeline id.
* Use STORAGE_TIME from timeline_metrics in layered_repository.
* Remove timelineless gc metrics for tenant on detach.
* Rename timeline metrics -> metrics as it's more generic.
* Don't create a TimelineMetrics instance for VirtualFile
* Move the rest of the metric definitions to metrics.rs too.
* UUID -> ZTenantId
* Use consistent style for dict.
* Use Repository's Drop trait for dropping STORAGE_TIME metrics.
* No need for Arc, TimelineMetrics is used in just one place. Due to that,
we can fall back using ZTenantId and ZTimelineId too to avoid additional
string allocation.
* Add submodule postgres-15
* Support pg_15 in pgxn/neon
* Renamed zenith -> neon in Makefile
* fix name of codestyle check
* Refactor build system to prepare for building multiple Postgres versions.
Rename "vendor/postgres" to "vendor/postgres-v14"
Change Postgres build and install directory paths to be version-specific:
- tmp_install/build -> pg_install/build/14
- tmp_install/* -> pg_install/14/*
And Makefile targets:
- "make postgres" -> "make postgres-v14"
- "make postgres-headers" -> "make postgres-v14-headers"
- etc.
Add Makefile aliases:
- "make postgres" to build "postgres-v14" and in future, "postgres-v15"
- similarly for "make postgres-headers"
Fix POSTGRES_DISTRIB_DIR path in pytest scripts
* Make postgres version a variable in codestyle workflow
* Support vendor/postgres-v15 in codestyle check workflow
* Support postgres-v15 building in Makefile
* fix pg version in Dockerfile.compute-node
* fix kaniko path
* Build neon extensions in version-specific directories
* fix obsolete mentions of vendor/postgres
* use vendor/postgres-v14 in Dockerfile.compute-node.legacy
* Use PG_VERSION_NUM to gate dependencies in inmem_smgr.c
* Use versioned ECR repositories and image names for compute-node.
The image name format is compute-node-vXX, where XX is postgres major version number.
For now only v14 is supported.
Old format unversioned name (compute-node) is left, because cloud repo depends on it.
* update vendor/postgres submodule url (zenith->neondatabase rename)
* Fix postgres path in python tests after rebase
* fix path in regress test
* Use separate dockerfiles to build compute-node:
Dockerfile.compute-node-v15 should be identical to Dockerfile.compute-node-v14 except for the version number.
This is a hack, because Kaniko doesn't support build ARGs properly
* bump vendor/postgres-v14 and vendor/postgres-v15
* Don't use Kaniko cache for v14 and v15 compute-node images
* Build compute-node images for different versions in different jobs
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
* Update relation size cache only when latest LSN is requested
* Fix tests
* Add a test case for timetravel query after pageserver restart.
This test is currently failing, the queries return incorrect results.
I don't know why, needs to be investigated.
FAILED test_runner/batch_others/test_readonly_node.py::test_timetravel - assert 85 == 100000
If you remove the pageserver restart from the test, it passes.
* yapf3 test_readonly_node.py
* Add comment about cache correction in case of setting incorrect latest flag
* Fix formatting for test_readonly_node.py
* Remove unused imports
* Fix mypy warning for test_readonly_node.py
* Fix formatting of test_readonly_node.py
* Bump postgres version
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
* Fix pythin style
* Fix iport of test_backpressure in test_latency
* Apply changed to moved neon extension
* Apply changed to moved neon extension
* Merge with main
* Update pgxn/neon/pagestore_smgr.c
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
* Bump postgres version
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
* Move backpressure throttling implementation to neon extension and function for monitoring throttling time
* Add missing includes
* Bump postgres version
Seems a bit silly to have a separate crate just for the executable. It
relies on the control plane for everything it does, and it's the only
user of the control plane.
Slim down compute-node images:
- Optimize compute_ctl build for size, not performance & debug-ability
- Don't run unused stages. Saves time in not building the PLV8 extension.
- Do not include static libraries in clean postgres
- Do the installation and finishing touches in the final layer in one job
This allows docker (and kaniko) to only register one change to the files,
removing potentially duplicate changed files.
- The runtime library for libreadline-dev is libreadline8, changing the dependency saves 45 MB
- libprotobuf-c-dev -> libprotobuf-c1, saving 100 kB
- libossp-uuid-dev -> libossp-uuid16, saving 150 kB
- gdal-bin + libgdal-dev -> libgeos-c1v5 + libgdal28 + libproj19, saving 747MB
- binutils @ testing -> libc6 @ testing, saving 32 MB
Start the calculation on the first size request, return
partially calculated size during calculation, retry if failed.
Remove "fast" size init through the ancestor: the current approach is
fast enough for now and there are better ways to optimize the
calculation via incremental ancestor size computation
For better ergonomics. I always found it weird that we used UUID to
actually mean a tenant or timeline ID. It worked because it happened
to have the same length, 16 bytes, but it was hacky.
Update PLV8 to 3.1.4 - which is the latest release.
Update PostGIS to 3.3.0
Remove PLV8 from the final image -- there is an issue we hit when installing PLV8, and we don't quite know what it is yet.
The code correctly detected too short and too long inputs, but the error
message was bogus for the case the input stream was too long:
Error: Provided stream has actual size 5 fthat is smaller than the given stream size 4
That check was only supposed to check for too small inputs, but it in
fact caught too long inputs too. That was good, because the check
below that that was supposed to check for too long inputs was in fact
broken, and never did anything. It tried to read input a buffer of
size 0, to check if there is any extra data, but reading to a
zero-sized buffer always returns 0.
Merge batch_others and batch_pg_regress. The original idea was to
split all the python tests into multiple "batches" and run each batch
in parallel as a separate CI job. However, the batch_pg_regress batch
was pretty short compared to all the tests in batch_others. We could
split batch_others into multiple batches, but it actually seems better
to just treat them as one big pool of tests and use pytest's handle
the parallelism on its own. If we need to split them across multiple
nodes in the future, we could use pytest-shard or something else,
instead of managing the batches ourselves.
Merge test_neon_regress.py, test_pg_regress.py and test_isolation.py
into one file, test_pg_regress.py. Seems more clear to group all
pg_regress-based tests into one file, now that they would all be in
the same directory.
Previously, proxy didn't forward auxiliary `options` parameter
and other ones to the client's compute node, e.g.
```
$ psql "user=john host=localhost dbname=postgres options='-cgeqo=off'"
postgres=# show geqo;
┌──────┐
│ geqo │
├──────┤
│ on │
└──────┘
(1 row)
```
With this patch we now forward `options`, `application_name` and `replication`.
Further reading: https://www.postgresql.org/docs/current/libpq-connect.htmlFixes#1287.
* Add fork_at_current_lsn function which creates branch at current LSN
* Undo use of fork_at_current_lsn in test_branching because of short GC period
* Add missed return in fork_at_current_lsn
* Add missed return in fork_at_current_lsn
* Update test_runner/fixtures/neon_fixtures.py
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
* Update test_runner/fixtures/neon_fixtures.py
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
* Update test_runner/fixtures/neon_fixtures.py
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
`latest_gc_cutoff_lsn` tracks the cutoff point where GC has been
performed. Anything older than the cutoff might already have been GC'd
away, and cannot be queried by get_page_at_lsn requests. It's
protected by an RWLock. Whenever a get_page_at_lsn requests comes in,
it first grabs the lock and reads the current `latest_gc_cutoff`, and
holds the lock it until the request has been served. The lock ensures
that GC doesn't start concurrently and remove page versions that we
still need to satisfy the request.
With the lock, get_page_at_lsn request could potentially be blocked
for a long time. GC only holds the lock in exclusive mode for a short
duration, but depending on how whether the RWLock is "fair", a read
request might be queued behind the GC's exclusive request, which in
turn might be queued behind a long-running read operation, like a
basebackup. If the lock implementation is not fair, i.e. if a reader
can always jump the queue if the lock is already held in read mode,
then another problem arises: GC might be starved if a constant stream
of GetPage requests comes in.
To avoid the long wait or starvation, introduce a Read-Copy-Update
mechanism to replace the lock on `latest_gc_cutoff_lsn`. With the RCU,
reader can always read the latest value without blocking (except for a
very short duration if the lock protecting the RCU is contended;
that's comparable to a spinlock). And a writer can always write a new
value without waiting for readers to finish using the old value. The
old readers will continue to see the old value through their guard
object, while new readers will see the new value.
This is purely theoretical ATM, we don't have any reports of either
starvation or blocking behind GC happening in practice. But it's
simple to fix, so let's nip that problem in the bud.
It's used by e2e CI. Building Dockerfile.compute-node will take
unreasonable ammount of time without v2 runners.
TODO: remove once cloud repo CI is moved to v2 runners.
* Extract neon and neon_test_utils from postgres repo
* Remove neon from vendored postgres repo, and fix build_and_test.yml
* Move EmitWarningsOnPlaceholders to end of _PG_init in neon.c (from libpagestore.c)
* Fix Makefile location comments
* remove Makefile EXTRA_INSTALL flag
* Update Dockerfile.compute-node to build and include the neon extension
Previously, it could only distinguish REDO task durations down to 5ms, which
equates to approx. 200pages/sec or 1.6MB/sec getpage@LSN traffic.
This patch improves to 200'000 pages/sec or 1.6GB/sec, allowing for
much more precise performance measurement of the redo process.
* Add postgis & plv8 extensions
* Update Dockerfile & Fix typo's
* Update dockerfile
* Update Dockerfile
* Update dockerfile
* Use plv8 step
* Reduce giga layer
* Reduce layer size further
* Prepare for rollout
* Fix dependency
* Pass on correct build tag
* No longer dependent on building tools
* Use version from vendor
* Revert "Use version from vendor"
This reverts commit 7c6670c477.
* Revert and push correct set
* Add configure step for new approach
* Re-add configure flags
Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@Rorys-Mac-Studio.fritz.box>
`///` is used for comments on the *next* code that follows, so the comment
actually applied to the `use std::collections::BTreeMap;` line that follows.
rustfmt complained about that:
error: an inner attribute is not permitted following an outer doc comment
--> /home/heikki/git-sandbox/neon/libs/utils/src/seqwait_async.rs:7:1
|
5 | ///
| --- previous doc comment
6 |
7 | #![warn(missing_docs)]
| ^^^^^^^^^^^^^^^^^^^^^^ not permitted following an outer attribute
8 |
9 | use std::collections::BTreeMap;
| ------------------------------- the inner attribute doesn't annotate this `use` import
|
= note: inner attributes, like `#![no_std]`, annotate the item enclosing them, and are usually found at the beginning of source files
help: to annotate the `use` import, change the attribute from inner to outer style
|
7 - #![warn(missing_docs)]
7 + #[warn(missing_docs)]
|
`//!` is the correct syntax for comments that apply to the whole file.
Every handler function now follows the same pattern:
1. extract parameters from the call
2. check permissions
3. execute command.
Previously, we extracted some parameters before permission check and
some after. Let's be consistent.
Added pytest to check correctness of the link authentication pipeline.
Context: this PR is the first step towards refactoring the link authentication pipeline to use https (instead of psql) to send the db info to the proxy. There was a test missing for this pipeline in this repo, so this PR adds that test as preparation for the actual change of psql -> https.
Co-authored-by: Bojan Serafimov <bojan.serafimov7@gmail.com>
Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
Co-authored-by: Stas Kelvic <stas@neon.tech>
Co-authored-by: Dimitrii Ivanov <dima@neon.tech>
There was a nominal split between the tests in layered_repository.rs and
repository.rs, such that tests specific to the layered implementation were
supposed to be in layered_repository.rs, and tests that should work with
any implementation of the traits were supposed to be in repository.rs.
In practice, the line was quite muddled. With minor tweaks, many of the
tests in layered_repository.rs should work with other implementations too,
and vice versa. And in practice we only have one implementation, so it's
more straightforward to gather all unit tests in one place.
usize/isize type corresponds to the CPU architecture's pointer width,
i.e. 64 bits on a 64-bit platform and 32 bits on a 32-bit platform.
The logical size of a database has nothing to do with the that, so
u64/i64 is more appropriate.
It doesn't make any difference in practice as long as you're on a
64-bit platform, and it's hard to imagine anyone wanting to run the
pageserver on a 32-bit platform, but let's be tidy.
Also add a comment on why we use signed i64 for the logical size
variable, even though size should never be negative. I'm not sure the
reasons are very good, but at least this documents them, and hints at
some possible better solutions.
* Update workflow to fix dependency issue
* Update workflow
* Update workflow and dockerfile
* Specify tag
* Update main dockerfile as well
* Mirror rust image to docker hub
* Update submodule ref
Co-authored-by: Rory de Zoete <rdezoete@Rorys-Mac-Studio.fritz.box>
Compute node docker image requires compute-tools to build, but this
dependency (and the argument for which image to pick) weren't described in the
workflow file. This lead to out-of-date binaries in latest builds, which
subsequently broke these images.
Including, but not limited to:
* Fixes to neon management code to support walproposer-as-an-extension
* Fix issue in expected output of pg settings serialization.
* Show the logs of a failed --sync-safekeepers process in CI
* Add compat layer for renamed GUCs in postgres.conf
* Update vendor/postgres to the latest origin/main
- There was an issue with zero commit_lsn `reason: LaggingWal { current_commit_lsn: 0/0, new_commit_lsn: 1/6FD90D38, threshold: 10485760 } }`. The problem was in `send_wal.rs`, where we initialized `end_pos = Lsn(0)` and in some cases sent it to the pageserver.
- IDENTIFY_SYSTEM previously returned `flush_lsn` as a physical end of WAL. Now it returns `flush_lsn` (as it was) to walproposer and `commit_lsn` to everyone else including pageserver.
- There was an issue with backoff where connection was cancelled right after initialization: `connected!` -> `safekeeper_handle_db: Connection cancelled` -> `Backoff: waiting 3 seconds`. The problem was in sleeping before establishing the connection. This is fixed by reworking retry logic.
- There was an issue with getting `NoKeepAlives` reason in a loop. The issue is probably the same as the previous.
- There was an issue with filtering safekeepers based on retry attempts, which could filter some safekeepers indefinetely. This is fixed by using retry cooldown duration instead of retry attempts.
- Some `send_wal.rs` connections failed with errors without context. This is fixed by adding a timeline to safekeepers errors.
New retry logic works like this:
- Every candidate has a `next_retry_at` timestamp and is not considered for connection until that moment
- When walreceiver connection is closed, we update `next_retry_at` using exponential backoff, increasing the cooldown on every disconnect.
- When `last_record_lsn` was advanced using the WAL from the safekeeper, we reset the retry cooldown and exponential backoff, allowing walreceiver to reconnect to the same safekeeper instantly.
Re-export only things that are used by other modules.
In the future, I'm imagining that we run bindgen twice, for Postgres
v14 and v15. The two sets of bindings would go into separate
'bindings_v14' and 'bindings_v15' modules.
Rearrange postgres_ffi modules.
Move function, to avoid Postgres version dependency in timelines.rs
Move function to generate a logical-message WAL record to postgres_ffi.
* Do not create initial tenant and timeline (adjust Python tests for that)
* Rework config handling during init, add --update-config to manage local config updates
The pg_control_ffi.h name implies that it only includes stuff related to
pg_control.h. That's mostly true currently, but really the point of the
file is to include everything that we need to generate Rust definitions
from.
* Use main, not branch for ref check
* Add more debug
* Count main, not head
* Try new approach
* Conform to syntax
* Update approach
* Get full history
* Skip checkout
* Cleanup debug
* Remove more debug
Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box>
This patch makes walreceiver logic more complicated, but it should work better in most cases. Added `test_wal_lagging` to test scenarios where alive safekeepers can lag behind other alive safekeepers.
- There was a bug which looks like `etcd_info.timeline.commit_lsn > Some(self.local_timeline.get_last_record_lsn())` filtered all safekeepers in some strange cases. I removed this filter, it should probably help with #2237
- Now walreceiver_connection reports status, including commit_lsn. This allows keeping safekeeper connection even when etcd is down.
- Safekeeper connection now fails if pageserver doesn't receive safekeeper messages for some time. Usually safekeeper sends messages at least once per second.
- `LaggingWal` check now uses `commit_lsn` directly from safekeeper. This fixes the issue with often reconnects, when compute generates WAL really fast.
- `NoWalTimeout` is rewritten to trigger only when we know about the new WAL and the connected safekeeper doesn't stream any WAL. This allows setting a small `lagging_wal_timeout` because it will trigger only when we observe that the connected safekeeper has stuck.
The new format has a few benefits: it's shorter, simpler and
human-readable as well. We don't use base64 anymore, since
url encoding got us covered.
We also show a better error in case we couldn't parse the
payload; the users should know it's all about passing the
correct project name.
This test failed consistently on `main` now. It's better to temporarily disable it to avoid blocking others' PRs while investigating the root cause for the test failure.
See: #2255, #2256
Resolves#2212.
- use `wait_for_last_flush_lsn` in `test_timeline_physical_size_*` tests
## Context
Need to wait for the pageserver to catch up with the compute's last flush LSN because during the timeline physical size API call, it's possible that there are running `LayerFlushThread` threads. These threads flush new layers into disk and hence update the physical size. This results in a mismatch between the physical size reported by the API and the actual physical size on disk.
### Note
The `LayerFlushThread` threads are processed **concurrently**, so it's possible that the above error still persists even with this patch. However, making the tests wait to finish processing all the WALs (not flushing) before calculating the physical size should help reduce the "flakiness" significantly
Resolves#2097
- use timeline modification's `lsn` and timeline's `last_record_lsn` to determine the corresponding LSN to query data in `DatadirModification::get`
- update `test_import_from_pageserver`. Split the test into 2 variants: `small` and `multisegment`.
+ `small` is the old test
+ `multisegment` is to simulate #2097 by using a larger number of inserted rows to create multiple segment files of a relation. `multisegment` is configured to only run with a `release` build
To flush inmemory layer eventually when no new data arrives, which helps
safekeepers to suspend activity (stop pushing to the broker). Default 10m should
be ok.
This script can be used to migrate a tenant across breaking storage versions, or (in the future) upgrading postgres versions. See the comment at the top for an overview.
Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
A fair amount of the time in our python tests is spent waiting for the
pageserver and safekeeper processes to shut down. It doesn't matter so
much when you're running a lot of tests in parallel, but it's quite
noticeable when running them sequentially.
A big part of the slowness is that is that after sending the SIGTERM
signal, we poll to see if the process is still running, and the
polling happened at 1 s interval. Reduce it to 0.1 s.
Newer version of mypy fixes buggy error when trying to update only boto3 stubs.
However it brings new checks and starts to yell when we index into
cusror.fetchone without checking for None first. So this introduces a wrapper
to simplify quering for scalar values. I tried to use cursor_factory connection
argument but without success. There can be a better way to do that,
but this looks the simplest
Move all the fields that were returned by the wal_receiver endpoint into
timeline_detail. Internally, move those fields from the separate global
WAL_RECEIVERS hash into the LayeredTimeline struct. That way, all the
information about a timeline is kept in one place.
In the passing, I noted that the 'thread_id' field was removed from
WalReceiverEntry in commit e5cb727572, but it forgot to update
openapi_spec.yml. This commit removes that too.
It failed in staging environment a few times, and all we got in the
logs was:
ERROR could not start the compute node: failed to get basebackup@0/2D6194F8 from pageserver host=zenith-us-stage-ps-2.local port=6400
giving control plane 30s to collect the error before shutdown
That's missing all the detail on *why* it failed.
What the WAL receiver really connects to is the safekeeper. The
"producer" term is a bit misleading, as the safekeeper doesn't produce
the WAL, the compute node does.
This change also applies to the name of the field used in the mgmt API
in in the response of the
'/v1/tenant/:tenant_id/timeline/:timeline_id/wal_receiver' endpoint.
AFAICS that's not used anywhere else than one python test, so it
should be OK to change it.
Ref #1902.
- Track the layered timeline's `physical_size` using `pageserver_current_physical_size` metric when updating the layer map.
- Report the local timeline's `physical_size` in timeline GET APIs.
- Add `include-non-incremental-physical-size` URL flag to also report the local timeline's `physical_size_non_incremental` (similar to `logical_size_non_incremental`)
- Add a `UIntGaugeVec` and `UIntGauge` to represent `u64` prometheus metrics
Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
Previously DatadirTimeline was a separate struct, and there was a 1:1
relationship between each DatadirTimeline and LayeredTimeline. That was
a bit awkward; whenever you created a timeline, you also needed to create
the DatadirTimeline wrapper around it, and if you only had a reference
to the LayeredTimeline, you would need to look up the corresponding
DatadirTimeline struct through tenant_mgr::get_local_timeline_with_load().
There were a couple of calls like that from LayeredTimeline itself.
Refactor DatadirTimeline, so that it's a trait, and mark LayeredTimeline
as implementing that trait. That way, there's only one object,
LayeredTimeline, and you can call both Timeline and DatadirTimeline
functions on that. You can now also call DatadirTimeline functions from
LayeredTimeline itself.
I considered just moving all the functions from DatadirTimeline directly
to Timeline/LayeredTimeline, but I still like to have some separation.
Timeline provides a simple key-value API, and handles durably storing
key/value pairs, and branching. Whereas DatadirTimeline is stateless, and
provides an abstraction over the key-value store, to present an interface
with relations, databases, etc. Postgres concepts.
This simplified the logical size calculation fast-path for branch
creation, introduced in commit 28243d68e6. LayerTimeline can now
access the ancestor's logical size directly, so it doesn't need the
caller to pass it to it. I moved the fast-path to init_logical_size()
function itself. It now checks if the ancestor's last LSN is the same
as the branch point, i.e. if there haven't been any changes on the
ancestor after the branch, and copies the size from there. An
additional bonus is that the optimization will now work any time you
have a branch of another branch, with no changes from the ancestor,
not only at a create-branch command.
The layered_repository.rs file had grown to be very large. Split off
the LayeredTimeline struct and related code to a separate source file to
make it more manageable.
There are plans to move much of the code to track timelines from
tenant_mgr.rs to LayeredRepository. That will make layered_repository.rs
grow again, so now is a good time to split it.
There's a lot more cleanup to do, but this commit intentionally only
moves existing code and avoids doing anything else, for easier review.
[proxy] Add the `password hack` authentication flow
This lets us authenticate users which can use neither
SNI (due to old libpq) nor connection string `options`
(due to restrictions in other client libraries).
Note: `PasswordHack` will accept passwords which are not
encoded in base64 via the "password" field. The assumption
is that most user passwords will be valid utf-8 strings,
and the rest may still be passed via "password_".
## Overview
This patch reduces the number of memory allocations when running the page server under a heavy write workload. This mostly helps improve the speed of WAL record ingestion.
## Changes
- modified `DatadirModification` to allow reuse the struct's allocated memory after each modification
- modified `decode_wal_record` to allow passing a `DecodedWALRecord` reference. This helps reuse the struct in each `decode_wal_record` call
- added a reusable buffer for serializing object inside the `InMemoryLayer::put_value` function
- added a performance test simulating a heavy write workload for testing the changes in this patch
### Semi-related changes
- remove redundant serializations when calling `DeltaLayer::put_value` during `InMemoryLayer::write_to_disk` function call [1]
- removed the info span `info_span!("processing record", lsn = %lsn)` during each WAL ingestion [2]
## Notes
- [1]: in `InMemoryLayer::write_to_disk`, a deserialization is called
```
let val = Value::des(&buf)?;
delta_layer_writer.put_value(key, *lsn, val)?;
```
`DeltaLayer::put_value` then creates a serialization based on the previous deserialization
```
let off = self.blob_writer.write_blob(&Value::ser(&val)?)?;
```
- [2]: related: https://github.com/neondatabase/neon/issues/733
* More precisely control size of inmem layer
* Force recompaction of L0 layers if them contains large non-wallogged BLOBs to avoid too large layers
* Add modified version of test_hot_update test (test_dup_key.py) which should generate large layers without large number of tables
* Change test name in test_dup_key
* Add Layer::get_max_key_range function
* Add layer::key_iter method and implement new approach of splitting layers during compaction based on total size of all key values
* Add test_large_schema test for checking layer file size after compaction
* Make clippy happy
* Restore checking LSN distance threshold for checkpoint in-memory layer
* Optimize stoage keys iterator
* Update pageserver/src/layered_repository.rs
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
* Update pageserver/src/layered_repository.rs
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
* Update pageserver/src/layered_repository.rs
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
* Update pageserver/src/layered_repository.rs
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
* Update pageserver/src/layered_repository.rs
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
* Fix code style
* Reduce number of tables in test_large_schema to make it fit in timeout with debug build
* Fix style of test_large_schema.py
* Fix handlng of duplicates layers
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
We were getting a warning like this from the pg_regress tests:
=================== warnings summary ===================
/usr/lib/python3/dist-packages/_pytest/config/__init__.py:663
/usr/lib/python3/dist-packages/_pytest/config/__init__.py:663: PytestAssertRewriteWarning: Module already imported so cannot be rewritten: fixtures.pg_stats
self.import_plugin(import_spec)
-- Docs: https://docs.pytest.org/en/stable/warnings.html
------------------ Benchmark results -------------------
To fix, reorder the imports in conftest.py. I'm not sure what exactly
the problem was or why the order matters, but the warning is gone and
that's good enough for me.
If the WAL arrives at the pageserver slowly, it's possible that the
branch is created before all the data on the parent branch have
arrived. That results in a failure:
test_runner/batch_others/test_tenant_relocation.py:259: in test_tenant_relocation
timeline_id_second, current_lsn_second = populate_branch(pg_second, create_table=False, expected_sum=1001000)
test_runner/batch_others/test_tenant_relocation.py:133: in populate_branch
assert cur.fetchone() == (expected_sum, )
E assert (500500,) == (1001000,)
E At index 0 diff: 500500 != 1001000
E Full diff:
E - (1001000,)
E + (500500,)
To fix, specify the LSN to branch at, so that the pageserver will wait
for it arrive.
See https://github.com/neondatabase/neon/issues/2063
Resolves#2054
**Context**: branch creation needs to wait for GC to acquire `gc_cs` lock, which prevents creating new timelines during GC. However, because individual timeline GC iteration also requires `compaction_cs` lock, branch creation may also need to wait for compactions of multiple timelines. This results in large latency when creating a new branch, which we advertised as *"instantly"*.
This PR optimizes the latency of branch creation by separating GC into two phases:
1. Collect GC data (branching points, cutoff LSNs, etc)
2. Perform GC for each timeline
The GC bottleneck comes from step 2, which must wait for compaction of multiple timelines. This PR modifies the branch creation and GC functions to allow GC to hold the GC lock only in step 1. As a result, branch creation doesn't need to wait for compaction to finish but only needs to wait for GC data collection step, which is fast.
Simplifies the workflow. Makes the overall build a little faster, as
the build_postgres step doesn't need to upload the pg.tgz artifact,
and the build_neon step doesn't need to download it again.
This effectively reverts commit a490f64a68. That commit changed the
workflow so that the Postgres binaries were not included in the
neon.tgz artifact. With this commit, the pg.tgz artifact is gone, and
the Postgres binaries are part of neon.tgz again.
The "cargo metadata" and "cargo test --no-run" are used in the workflow
to just list names of the final binaries, but unless the same cargo
options like --release or --debug are used in those calls, they will in
fact recompile everything.
Reorganize existing READMEs and other documentation files into mdbook
format. The resulting Table of Contents is a mix of placeholders for
docs that we should write, and documentation files that we already had,
dropped into the most appropriate place.
Update the Pageserver overview diagram. Add sections on thread
management and WAL redo processes.
Add all the RFCs to the mdbook Table of Content too.
Per github issue #1979
On ProposerElected message receival WAL is truncated at streaming point; this
code expected that, once vote is given for the proposer / term switch happened,
flush_lsn can be advanced only by this proposer (or higher one). However, that
didn't take into account possibility of accumulating written WAL and flushing it
after vote is given -- flushing goes without term checks. Which eventually led
to the violation in question.
ref #2048
* Deduce `last_segment` automatically
* Get rid of local `wal_dir`/`wal_seg_size` variables
* Prepare to test parsing of WAL from multiple specific points, not just the start;
extract `check_end_of_wal` function to check both partial and non-partial WAL segments.
neon.tgz artifact in the github workflow included the contents of
'tmp_install', but that seems pointless, because the same files are
included earlier already in the pg.tgz artifact.
Uploading large artifacts is slow in github actions. To speed that up,
make the artifact smaller.
The code coverage tool doesn't require debug symbols, so remove them.
We've discussed doing the same for *all* binaries, but it's nice to
have debugging symbols for debugging purposes, and so that you get
more complete stack traces. The discussion is ongoing, but let's at
least do this for the test symbols now.
- Updated dependencies with "cargo update"
- Updated workspace_hack with "cargo hakari generate"
There's no particular reason to do this now, just a periodic refresh.
"cargo clippy" started to complain about these, after running "cargo
update". Not sure why it didn't complain before, but seems reasonable to
fix these. (The "cargo update" is not included in this commit)
Change the build options to enable basic optimizations even in debug
mode, and always build dependencies with more optimizations. That
makes the debug-mode binaries somewhat faster, without messing up
stack traces and line-by-line debugging too much.
See https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#concurrency
* Previously there was a single concurrency group per each branch.
As the `main` branch got pushed into frequently, very few commits got
tested to the end. It resulted in "broken" `main` branch as there were
no fully successful workflow runs.
Now the `main` branch gets a separate concurrency group for each commit.
* As GitHub Actions syntax does not have the conditional operator, it is
emulated via logical and/or operations. Although undocumented, they
return one of their operands instead of plain true/false.
* Replace 3-space indentation with 2-space indentation while we are here
to be consistent with the rest of the file.
* Wait for all computes (except one) to complete before proceeding with
the single compute.
* It previously waited for too few seconds. As the test is randomized, it was
not failing all the time, but only in specific unlucky cases.
E.g. when there were no successfuly queries by concurrent computes,
and the single node had big timeouts and spent lots of time making the
transaction.
See https://github.com/neondatabase/neon/runs/7234456482?check_suite_focus=true
(around line 980).
* Wait for exactly one extra transaction by the single compute.
We need both storage **and** compute images for deploy, because control plane
picks the compute version based on the storage version. If it notices a fresh
storage it may bump the compute version. And if compute image failed to build
it may break things badly.
Before this patch, importing a physical backup followed the same path
as ingesting any WAL records:
1. All the data pages from the backup are first collected in the
DatadirModification object.
2. Then, they are "committed" to the Repository. They are written to
the in-memory layer
3. Finally, the in-memory layer is frozen, and flushed to disk as a
L0 delta layer file.
This was pretty inefficient. In step 1, the whole physical backup was
held in memory. If the backup is large, you simply run out of
memory. And in step 3, the resulting L0 delta layer file is large,
holding all the data again. That's a problem if the backup is larger
than 5 GB: Amazon S3 doesn't allow uploading files larger than 5 GB
(without using multi-part upload, see github issue #1910). So we want
to avoid that.
To alleviate those problems, optimize the codepath for importing a
physical backup. The basic flow is the same as before, but step 1
is optimized so that it doesn't accumulate all the data in memory,
and step 3 writes the data in image layers instead of one large delta
layer.
Previously, upon branching, if no starting LSN is specified, we
determine the start LSN based on the source timeline's last record LSN
in `timelines::create_timeline` function, which then calls `Repository::branch_timeline`
to create the timeline.
Inside the `LayeredRepository::branch_timeline` function, to start branching,
we try to acquire a GC lock to prevent GC from removing data needed
for the new timeline. However, a GC iteration takes time, so the GC lock
can be held for a long period of time. As a result, the previously determined
starting LSN can become invalid because of GC.
This PR fixes the above issue by delaying the LSN calculation part and moving it to be
inside `LayeredRepository::branch_timeline` function.
* ensure_server_config() function is added to ensure the server does not have background processes
which intervene with WAL generation
* Rework command line syntax
* Add `print-postgres-config` subcommand which prints the required server configuration
download operations of all timelines for one tenant are now grouped
together so when attach is invoked pageserver downloads all of them
and registers them in a single apply_sync_status_update call so
branches can be used safely with attach/detach
I noticed that the pageserver has a very large virtual memory size,
several GB, even though it doesn't actually use that much
memory. That's not much of a problem normally, but I hit it because I
wanted to run tests with a limited virtual memory size, by calling
setrlimit(RLIMIT_AS), but the highest limit you can set is 2 GB. I was
not able to start pageserver with a limit of 2 GB.
On Linux, each thread allocates 32 MB of virtual memory. I read this
on some random forum on the Internet, but unfortunately could not find
the source again now. Empirically, reducing the number of threads clearly
helps to bring down the virtual memory size.
Aside from the virtual memory usage, it seems excessive to launch 40
threads in both of those thread pools. The tokio default is to have as
many worker threads as there are CPU cores in the system. That seems
like a fine heuristic for us, too, so remove the explicit setting of
the pool size and rely on the default. Note that the GC and compaction
tasks are actually run with tokio spawn_blocking, so the threads that
are actually doing the work, and possibly waiting on I/O, are not
consuming threads from the thread pool. The WAL receiver work is done
in the tokio worker threads, but the WAL receivers are more CPU bound
so that seems OK.
Also remove the explicit maxinum on blocking tasks. I'm not sure what
the right value for that would be, or whether the value we set (100)
would be better than the tokio default (512). Since the value was
arbitrary, let's just rely on the tokio default for that, too.
* Add tests for different postgres clients
* test/fixtures: sanitize test name for test_output_dir
* test/fixtures: do not look for etcd before runtime
* Add workflow for testing Postgres client libraries
* Avoid reconnecting to safekeeper immediately after its failure by limiting candidates to those with fewest connection attempts. Thus we don't have to wait lagging_wal_timeout (10s by default) before switch happens even if no new changes are generated, and current test_restarts_under_load expects some commits to happen within 4s.
* Make default max_lsn_wal_lag larger, otherwise we constant reconnections happen during normal work.
* Fix wal_connection_attempts maintanance, preventing busy loop of reconnections.
Mitigates latency fee, making push throughput 1-1.5 order of magnitude bigger.
Also make leases per timeline, not per whole safekeeper, avoiding storing
garbage in etcd for deleted timelines while safekeeper is alive.
Previously, we were granting create only to db owner, but now we have a
dedicated 'web_access' role to connect via web UI and proxy link auth.
We anyway grant read / write all data to all roles, so let's grant
create to everyone too. This creates some provelege objects in each db,
which we need to drop before deleting the role. So now we reassign all
owned objects to each db owner before deletion. This also fixes deletion
of roles that created some data in any db previously. Will be tested by
https://github.com/neondatabase/cloud/pull/1673
Later we should stop messing with Postgres ACL that much.
Resolves#1889.
This PR adds new tests to measure the WAL backpressure's performance under different workloads.
## Changes
- add new performance tests in `test_wal_backpressure.py`
- allow safekeeper's fsync to be configurable when running tests
* Update make instructions for release and debug build. Update PostgreSQL glossary to proper version (14)
* Continued cleanup of build instructions including removal of redundancies
This brings in the change to not use a shared memory in the WAL redo
process, to avoid running out of sysv shmem segments in the page server.
Also, removal of callmemaybe bits.
* Implement page servise 'fullbackup' endpoint that works like basebackup, but also sends relational files
* Add test_runner/batch_others/test_fullbackup.py
Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>
All endpoints except for POST /v1/timeline are tested, this one is not tested in any way yet.
Three attempts for each endpoint: correctly authenticated, badly authenticated, unauthenticated.
* `control_plane` crate (used by `neon_local`) now parses an `auth_enabled` bool for each Safekeeper
* If auth is enabled, a Safekeeper is passed a path to a public key via a new command line argument
* Added TODO comments to other places needing auth
I made the check at launcher level with the perspective of generally moving
election (decision who offloads) there.
Also log timeline 'active' changes.
I still don't like the surroundings and feel we'd better get away without using
election API at all, but this is a quick fix to keep CI green.
ref #1815
* Added project option in case SNI data is missing. Resolving issue #1745.
* Added invariant checking for project name: if both sni_data and project_name are available then they should match.
E g this log line contains duplicated data:
INFO /timeline_create{tenant=8d367870988250a755101b5189bbbc17
new_timeline=Some(27e2580f51f5660642d8ce124e9ee4ac) lsn=None}:
bootstrapping{timeline=27e2580f51f5660642d8ce124e9ee4ac
tenant=8d367870988250a755101b5189bbbc17}:
created root timeline 27e2580f51f5660642d8ce124e9ee4ac
timeline.lsn 0/16960E8
this avoids variable duplication in `bootstrapping` subspan
Fixes#1768.
## Context
Previously, to test `get_wal_receiver` API, we make run some DB transactions then call the API to check the latest message's LSN from the WAL receiver. However, this test won't work because it's not guaranteed that the WAL receiver will get the latest WAL from the postgres/safekeeper at the time of making the API call.
This PR resolves the above issue by adding a "poll and wait" code that waits to retrieve the latest data from the WAL receiver.
This PR also fixes a bug that tries to compare two hex LSNs, should convert to number before the comparison. See: https://github.com/neondatabase/neon/issues/1768#issuecomment-1133752122.
Resolve#1663.
## Changes
- ignore a "broken" [1] timeline on page server startup
- fix the race condition when creating multiple timelines in parallel for a tenant
- added tests for the above changes
[1]: a timeline is marked as "broken" if either
- failed to load the timeline's metadata or
- the timeline's disk consistent LSN is zero
- Uncomment accidently `self.keep_alive.abort()` commented line, due to this
task never finished, which blocked launcher.
- Mess up with initialization one more time, to fix offloader trying to back up
segment 0. Now we initialize all required LSNs in handle_elected,
where we learn start LSN for the first time.
- Fix blind attempt to provide safekeeper service file with remote storage
params.
Separate task is launched for each timeline and stopped when timeline doesn't
need offloading. Decision who offloads is done through etcd leader election;
currently there is no pre condition for participating, that's a TODO.
neon_local and tests infrastructure for remote storage in safekeepers added,
along with the test itself.
ref #1009
Co-authored-by: Anton Shyrabokau <ahtoxa@Antons-MacBook-Pro.local>
* turn println into an info with proper message
* rename new_local_timeline to load_local_timeline because it does not
create new timeline, it registers timeline that exists on disk in
pageserver in-memory structures
If the 'basebackup' command failed in the middle of building the tar
archive, the client would not report the error, but would attempt to
to start up postgres with the partial contents of the data directory.
That fails because the control file is missing (it's added to the
archive last, precisly to make sure that you cannot start postgres
from a partial archive). But the client doesn't see the proper error
message that caused the basebackup to fail in the server, which is
confusing.
Two issues conspired to cause that:
1. The tar::Builder object that we use in the pageserver to construct
the tar stream has a Drop handler that automatically writes a valid
end-of-archive marker on drop. Because of that, the resulting tarball
looks complete, even if an error happens while we're building it. The
pageserver does send an ErrorResponse after the seemingly-valid
tarball, but:
2. The client stops reading the Copy stream, as soon as it sees the
tar end-of-archive marker. Therefore, it doesn't read the
ErrorResponse that comes after it.
We have two clients that call 'basebackup', one in `control_plane`
used by the `neon_local` binary, and another one in
`compute_tools`. Both had the same issue.
This PR fixes both issues, even though fixing either one would be
enough to fix the problem at hand. The pageserver now doesn't send the
end-of-archive marker on error, and the client now reads the copy
stream to the end, even if it sees an end-of-archive marker.
Fixes github issue #1715
In the passing, change Basebackup to use generic Write rather than
'dyn'.
By default, etcd makes a huge 10 GB mmap() allocation when it starts up.
It doesn't actually use that much memory, it's just address space, but
it caused me grief when I tried to use 'rr' to debug a python test run.
Apparently, when you replay the 'rr' trace, it does allocate memory for
all that address space.
The size of the initial mmap depends on the --quota-backend-bytes setting.
Our etcd clusters are very small, so let's set --quota-backend-bytes to
keep the virtual memory size small, to make debugging with 'rr' easier.
See https://github.com/etcd-io/etcd/issues/7910 and
5e4b008106
* Potential fix to #1626. Fixed typo is Makefile.
* Completed fix to #1626.
Summary:
changed 'error' to 'bail' in start_pageserver and start_safekeeper.
The logic would incorrectly remove an image layer, if a new image layer
existed, even though the older image layer was still needed by some
delta layers after it. See example given in the comment this adds.
Without this fix, I was getting a lot of "could not find data for key
010000000000000000000000000000000000" errors from GC, with the new test
case being added in PR #1735.
Fixes#707
The primary reason: make GitHub detect that we use Apache License 2.0
They do it via https://github.com/licensee/licensee Ruby library (gem).
Our COPYRIGHT file contains a part of the Apache License, which should
be added to a source file, not the license or copyright information itself,
which confuses the library.
Instead, the recommended way is to create a NOTICE file which references
license of the code and its bundled dependencies.
There were a bunch of dependencies for Python <3.9. They are not needed
after #1254. This commit makes it easier to add/remove dependencies because
lock file will be updated like this on any such operation.
Do not update dependencies yet to not break anything.
It would be better to not update xl_crc/rec_hdr at all when skipping contrecord,
but I would prefer to keep PR #1574 small.
Better audit of `find_end_of_wal_segment` is coming anyway in #544.
Previous invariant: `crc` contains an "unfinalized" CRC32 value,
its one complement, like in postgres before FIN_CRC32C.
New invariant: `crc` always contains a "finalized" CRC32 value,
this is the semantics of crc32c_append, so we don't need to invert CRC manually.
* Actual generation logic is in a separate crate `postgres_ffi/wal_generate`
* The create also provides a binary for debug purposes akin to `initdb`
* Two tests currently fail and are ignored
* There is no easy way to test this directly in Safekeeper as it starts restoring from commit_lsn.
So testing would require disconnecting Safekeeper just after it has received the WAL,
but before it is committed.
The CI times out after 10 minutes of no output. It's annoying if a
test hangs and is killed by the CI timeout, because you don't get
information about which test was running. Try to avoid that, by adding
a slightly smaller timeout in pytest itself. You can override it on a
per-test basis if needed, but let's try to keep our tests shorter than
that.
For the Postgres regression tests, use a longer 30 minute timeout.
They're not really a single test, but many tests wrapped in a single
pytest test. It's OK for them to run longer in aggregate, each
Postgres test is still fairly short.
We saw a case in staging, where there was a gap in the LSN ranges of
level 0 files, like this:
000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9
000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016960E9-00000000016E4DB9
000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016E4DB9-000000000BFCE3E1
000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000000BFCE3E1-000000000BFD0FE9
000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000060045901-000000007005EAC1
000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000007005EAC1-0000000080062E99
000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000080062E99-000000009007F481
000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000009007F481-00000000A009F7C9
000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000A009F7C9-00000000AA284EB9
000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000AA286471-00000000AA2886B9
Note that gap between 000000000BFD0FE9 and 0000000060045901. I don't
know how that happened, but in general the pageserver should be robust
if there are gaps like that, or overlapping files etc. In theory they
could happen as result of crashes, partial downloads from S3 etc.,
although it is mystery what caused it in this case.
Looking at the compaction code, it was not safe in the face of gaps
like that. The compaction routine collected all the level 0 files, and
took their min(start)..max(end) as the range of the new files it
builds. That's wrong, if the level 0 files don't cover the whole LSN
range; the newly created files will miss any records in the gap. Fix
that, by only collecting contiguous sequences of level 0 files, so
that the end LSN of previous delta file is equal to the start of the
next one.
Fixes issue #1730
Previously, the path was printed to the log with separate error!() calls.
It's better to include the whole path in the error object and have it
printed to the log as one message.
Also print the path in the ValueReconstructResult::Missing case.
This is what it looks like now:
2022-05-17T21:53:53.611801Z ERROR pagestream{timeline=5adcb4af3e95f00a31550d266aab7a37 tenant=74d9f9ad3293c030c6a6e196dd91c60f}: error reading relation or page version: could not find data for key 000000067F000032BE000000000000000001 at LSN 0/1698C48, for request at LSN 0/1698CF8
Caused by:
0: layer traversal: result Complete, cont_lsn 0/1698C48, layer: 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001698C48-0000000001698CC1
1: layer traversal: result Continue, cont_lsn 0/1698CC1, layer: inmem-0000000001698CC1-FFFFFFFFFFFFFFFF
Stack backtrace:
+ neondatabase/cloud#1103
This adds a couple of control endpoints to simplify compute state
discovery for control-plane. For example, now we may figure out
that Postgres wasn't able to start or basebackup failed within
seconds instead of just blindly polling the compute readiness
for a minute or two.
Also we now expose startup metrics (time of the each step: basebackup,
sync safekeepers, config, total). Console grabs them after each
successful start and report as histogram to prometheus and grafana.
OpenAPI spec is added and up-tp date, but is not currently used in the
console yet.
- Enabled process exporter for storage services
- Changed zenith_proxy prefix to just proxy
- Removed old `monitoring` directory
- Removed common prefix for metrics, now our common metrics have `libmetrics_` prefix, for example `libmetrics_serve_metrics_count`
- Added `test_metrics_normal_work`
The SyncQueue consisted of a tokio mpsc channel, and an atomic counter
to keep track of how many items there are in the channel. Updating the
atomic counter was racy, and sometimes the consumer would decrement
the counter before the producer had incremented it, leading to integer
wraparound to usize::MAX. Calling Vec::with_capacity(usize::MAX) leads
to a panic.
To fix, replace the channel with a VecDeque protected by a Mutex, and
a condition variable for signaling. Now that the queue is now
protected by standard blocking Mutex and Condvar, refactor the
functions touching it to be sync, not async.
A theoretical downside of this is that the calls to push items to the
queue and the storage sync thread that drains the queue might now need
to wait, if another thread is busy manipulating the queue. I believe
that's OK; the lock isn't held for very long, and these operations are
made in background threads, not in the hot GetPage@LSN path, so
they're not very latency-sensitive.
Fixes#1719. Also add a test case.
The contract for wait_for() was not very clear. It waits until the
given function returns successfully, without an exception, but the
wait_for_last_record_lsn() and wait_for_upload() functions used "a <
b" as the condition, i.e. they thought that wait_for() would poll
until the function returns true.
Inline the logic from wait_for() into those two functions, it's not
that complicated, and you get a more specific error message too, if it
fails. Also add a comment to wait_for() to make it more clear how it
works.
Also change remote_consistent_lsn() to return 0 instead of raising an
exception, if remote is None. That can happen if nothing has been
uploaded to remote storage for the timeline yet. It happened once in
the CI, and I was able to reproduce that locally too by adding a sleep
to the storage sync thread, to delay the first upload.
I got annoyed by all the noise in CI test output.
Before:
$ ./target/release/neon_local stop
Stop pageserver gracefully
Pageserver still receives connections
Pageserver stopped receiving connections
Pageserver status is: Reqwest error: error sending request for url (http://127.0.0.1:9898/v1/status): error trying to connect: tcp connect error: Connection refused (os error 111)
initializing for sk 1 for 7676
Stop safekeeper gracefully
Safekeeper still receives connections
Safekeeper stopped receiving connections
Safekeeper status is: Reqwest error: error sending request for url (http://127.0.0.1:7676/v1/status): error trying to connect: tcp connect error: Connection refused (os error 111)
After:
$ ./target/release/neon_local stop
Stopping pageserver gracefully...done!
Stopping safekeeper 1 gracefully...done!
Also removes the spurious "initializing for sk 1 for 7676" message from
"neon_local start"
Resolves#1488.
- implemented `GET tenant/:tenant_id/timeline/:timeline_id/wal_receiver` endpoint
- returned `thread_id` in `thread_mgr::spawn`
- added `latest_gc_cutoff_lsn` field to `LocalTimelineInfo` struct
It's very confusing, and because you don't get a stack trace and error
message in the logs, makes debugging very hard. However, the
'test_pageserver_recovery' test relied on that behavior. To support that,
add a new "exit" action to the pageserver 'failpoints' command, so that
you can explicitly request to exit the process when a failpoint is hit.
Use timestamp->LSN mapping instead of file modification time.
Fix 'latest_gc_cutoff_lsn' - set it to the minimum of pitr_cutoff and gc_cutoff.
Add new test: test_pitr_gc
* There is no auth in Safekeeper HTTP at all currently,
so simply calling `check_permission` is not enough.
* There are no checks of Safekeeper still working with the data,
as "still working" is burry now: a timeline may be "active"
while there are no compute nodes and all data is propagated.
* Still, callmemaybe is deactivated, and timeline is removed from the
internal map. It can easily sneak back in case of race conditions
and implicit creations, though.
Try to follow Prometheus style-guide https://prometheus.io/docs/practices/naming/ for metrics names. More specifically:
- Use `pageserver_` prefix for all pagserver metrics
- Specify `_seconds` unit in time metrics
- Use unit as a suffix in other cases, such as `_hits`, `_bytes`, `_records`
- Use `_total` suffix for accumulating counters (note that Histograms append that suffix internally)
* Do not apply records with LSN smaller than LSN of cached image in delta layer
* Do not apply records with LSN smaller than LSN of cached image in delta layer
* Do not set LSN for new FPI page
refer #1656
* Add page_is_new, page_get_lsn, page_set_lsn functions
* Fix page_is_new implementation
* Add comment from XLogReadBufferForRedoExtended
Fixes#1628
- add [`comfy_table`](https://github.com/Nukesor/comfy-table/tree/main) and use it to construct table for `pg list` CLI command
Comparison
- Old:
```
NODE ADDRESS TIMELINE BRANCH NAME LSN STATUS
main 127.0.0.1:55432 3823dd05e35d71f6ccf33049de366d70 main 0/16FB140 running
migration_check 127.0.0.1:55433 3823dd05e35d71f6ccf33049de366d70 main 0/16FB140 running
```
- New:
```
NODE ADDRESS TIMELINE BRANCH NAME LSN STATUS
main 127.0.0.1:55432 3823dd05e35d71f6ccf33049de366d70 main 0/16FB140 running
migration_check 127.0.0.1:55433 3823dd05e35d71f6ccf33049de366d70 main 0/16FB140 running
```
Make it remember when timeline starts in general and on this safekeeper in
particular (the point might be later on new safekeeper replacing failed one).
Bumps control file and walproposer protocol versions.
While protocol is bumped, also add safekeeper node id to
AcceptorProposerGreeting.
ref #1561
Now princeple is following: acceptor threads (libpq and http) error will
bring the pageserver down, but all per-tenant thread failures will be treated
as an error.
A new `get_lsn_by_timestamp` command is added to the libpq page service
API.
An extra timestamp field is now stored in an extra field after each
Clog page. It is the timestamp of the latest commit, among all the
transactions on the Clog page. To find the overall latest commit, we
need to scan all Clog pages, but this isn't a very frequent operation
so that's not too bad.
To find the LSN that corresponds to a timestamp, we perform a binary
search. The binary search starts with min = last LSN when GC ran, and
max = latest LSN on the timeline. On each iteration of the search we
check if there are any commits with a higher-than-requested timestamp
at that LSN.
Implements github issue 1361.
* Traverse frozen layer in get_reconstruct_data in reverse order
* Fix comments on frozen layers.
Note explicitly the order that the layers are in the queue.
* Add fail point to reproduce failpoint iteration error
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Now proxy binary accepts `--auth-backend` CLI option, which determines
auth scheme and cluster routing method. Following backends are currently
implemented:
* legacy
old method, when username ends with `@zenith` it uses md5 auth dbname as
the cluster name; otherwise, it sends a login link and waits for the console
to call back
* console
new SCRAM-based console API; uses SNI info to select the destination
cluster
* postgres
uses postgres to select auth secrets of existing roles. Useful for local
testing
* link
sends login link for all usernames
* `cloud::legacy` talks to Cloud API V1.
* `cloud::api` defines Cloud API v2.
* `cloud::local` mocks the Cloud API V2 using a local postgres instance.
* It's possible to choose between API versions using the `--api-version` flag.
This is needed to forward the `ClientKey` that's required
to connect the proxy to a compute.
Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>
Commit edba2e97 renamed pageserver/README to pageserver/README.md, but
forgot to update links to it. Fix.
Rename libs/postgres_ffi/README and safekeeper/README files to also
have the the .md extension, so that github can render them nicely.
Quote ascii-diagram in safekeeper/README.md so that it renders
correctly.
wal_keep_size is already set to 0 in our cloud setup, but we don't use this value in tests. This commit fixes wal_keep_size in control_plane and adds tests for WAL recycling and lagging safekeepers.
When failpoint feature is disabled it throws away passed code so code
inside is not guaranteed to compile when feature is disabled. In this
particular case code is obsolete so removing it.
Remove when it is consumed by all of 1) pageserver (remote_consistent_lsn) 2)
safekeeper peers 3) s3 WAL offloading.
In test s3 offloading for now is mocked by directly bumping s3_wal_lsn.
ref #1403
Previously we've used table interface, but there was no easy way to pass
it as an override to pageserver through cli. Use the same strategy as
for remote storage config parsing
Add tenant config API and 'zenith tenant config' CLI command.
Add 'show' query to pageserver protocol for tenantspecific config parameters
Refactoring: move tenant_config code to a separate module.
Save tenant conf file to tenant's directory, when tenant is created to recover it on pageserver restart.
Ignore error during tenant config loading, while it is not supported by console
Define PiTR interval for GC.
refer #1320
This depends on a hacked version of the 'pprof-rs' crate. Because of
that, it's under an optional 'profiling' feature. It is disabled by
default, but enabled for release builds in CircleCI config. It doesn't
currently work on macOS.
The flamegraph is written to 'flamegraph.svg' in the pageserver
workdir when the 'pageserver' process exits.
Add a performance test that runs the perf_pgbench test, with profiling
enabled.
We only checked the cache page version when collecting WAL records in
an in-memory layer, not in a delta layer. Refactor the code so that we
always stop collecting WAL records when we reach a cached materialized
page.
Fix the assertion on the LSN range in
InMemoryLayer::get_value_reconstruct_data. It was supposed to check
that the requested LSN range is within the layer's LSN range, but the
inequality was backwards. That went unnoticed before, because the
caller always passed the layer's start LSN as the requested LSN
range's start LSN, but now we might stop the search earlier, if we have
a cached page version.
Co-authored-by: Konstantin Knizhnik <knizhnik@zenith.tech>
Unlink failure isn't serious on its own, we were about to remove the
file anyway, but it shouldn't happen and could be a symptom of
something more serious.
We just saw "No such file or directory" errors happening from
ephemeral file writeback in staging, and I suspect if we had this
warning in place, we would have seen these warnings too, if the
problem was that the ephemeral file was removed before dropping the
EphemeralFile struct. Next time it happens, we'll have more
information.
With this, we no longer need to build two versions of 'pem' and 'base64'
crates. Introduces a duplicate version of 'time' crate, though, but it's
still progress.
- Remove batch_others/test_pgbench.py. It was a quick check that pgbench
works, without actually recording any performance numbers, but that
doesn't seem very interesting anymore. Remove it to avoid confusing it
with the actual pgbench benchmarks
- Run pgbench with "-n" and "-S" options, for two different workloads:
simple-updates, and SELECT-only. Previously, we would only run it with
the "default" TPCB-like workload. That's more or less the same as the
simple-update (-n) workload, but I think the simple-upload workload
is more relevant for testing storage performance. The SELECT-only
workload is a new thing to measure.
- Merge test_perf_pgbench.py and test_perf_pgbench_remote.py. I added
a new "remote" implementation of the PgCompare class, which allows
running the same tests against an already-running Postgres instance.
- Make the PgBenchRunResult.parse_from_output function more
flexible. pgbench can print different lines depending on the
command-line options, but the parsing function expected a particular
set of lines.
The PgProtocol.connect() function took extra options for username,
database, etc. Remove those options, and have a generic way for each
subclass of PgProtocol to provide some default options, with the
capability override them in the connect() call.
It was the only test that used the 'schema' argument to the connect()
function. I'm about to refactor the option handling and will remove
the special 'schema' argument altogether, so rewrite the test to not
use it.
There's a lot of renaming left to do in the code and docs, but this is
a start. Our binaries and many other things are still called "zenith",
but I didn't change those in the README, because otherwise the
examples won't work. I added a brief note at the top of the README to
explain that we're in the process of renaming, until we've renamed
everything.
It originated from the fact that we were calling to fetch_full_index
without releasing the read guard, and fetch_full_index tries to acquire
read again. For plain mutex it is already a deeadlock, for RW lock
deadlock was achieved by an attempt to acquire write access later in the
code while still having active read guard up in the stack
This is sort of a bandaid because Kirill plans to change this code
during removal of an archiving mechanism
* [proxy] Add SCRAM auth
* [proxy] Implement some tests for SCRAM
* Refactoring + test fixes
* Hide SCRAM mechanism behind `#[cfg(test)]`
Currently we only use it in tests, so we hide all relevant
module behind `#[cfg(test)]` to prevent "unused item" warnings.
* Add test for restore from WAL
* Fix python formatting
* Choose unused port in wal restore test
* Move recovery tests to zenith_utils/scripts
* Set LD_LIBRARY_PATH in wal recovery scripts
* Fix python test formatting
* Fix mypy warning
* Bump postgres version
* Bump postgres version
We now use a page cache for those, instead of slurping the whole index into
memory.
Fixes https://github.com/zenithdb/zenith/issues/1356
This is a backwards-incompatible change to the storage format, so
bump STORAGE_FORMAT_VERSION.
This introduces two new abstraction layers for I/O:
- Block I/O, and
- Blob I/O.
The BlockReader trait abstracts a file or something else that can be read
in 8kB pages. It is implemented by EphemeralFiles, and by a new
FileBlockReader struct that allows reading arbitrary VirtualFiles in that
manner, utilizing the page cache.
There is also a new BlockCursor struct that works as a cursor over a
BlockReader. When you create a BlockCursor and read the first page using
it, it keeps the reference to the page. If you access the same page again,
it avoids going to page cache and quickly returns the same page again.
That can save a lot of lookups in the page cache if you perform multiple
reads.
The Blob-oriented API allows reading and writing "blobs" of arbitrary
length. It is a layer on top of the block-oriented API. When you write
a blob with the write_blob() function, it writes a length field
followed by the actual data to the underlying block storage, and
returns the offset where the blob was stored. The blob can be
retrieved later using the offset.
Finally, this replaces the I/O code in image-, delta-, and in-memory
layers to use the new abstractions. These replace the 'bookfile'
crate.
This is a backwards-incompatible change to the storage format.
We have these methods for some time in the API, so mentioning them in the
spec could be useful for console (see zenithdb/console#867), as we generate
pageserver HTTP API golang client there.
It happened in unit tests. If a thread tries to read a buffer while
already holding a lock on one buffer, the code to find a victim buffer
to evict could try to evict the buffer that's already locked. To fix,
skip locked buffers.
* Add a test case for reading historic page versions
Test read_page_at_lsn returns correct results when compared to page inspect.
Validate possiblity of reading pages from dropped relation.
Ensure funcitons read latest version when null lsn supplied.
Check that functions do not poison buffer cache with stale page versions.
Safekeers now publish to and pull from etcd per-timeline data. Immediate goal is
WAL truncation, for which every safekeeper must know remote_consistent_lsn; the
next would be callmemaybe replacement.
Adds corresponding '--broker' argument to safekeeper and ability to run etcd in
tests.
Adds test checking remote_consistent_lsn is indeed communicated.
workspace_hack is needed to avoid recompilation when different crates
inside the workspace depend on the same packages but with different
features being enabled. Problem occurs when you build crates separately
one by one. So this is irrelevant to our CI setup because there we build
all binaries at once, but it may be relevant for local development.
this also changes cargo's resolver version to 2
Follow-up for #1417. Previously we had a problem uploading to S3
due to huge ammount of existing not yet uploaded data. Now we have a
fresh pageserver with LSM storage on staging, so we can try enabling it
once again.
This is a backwards-incompatible change. The new pageserver cannot
read repositories created with an old pageserver binary, or vice
versa.
Simplify Repository to a value-store
------------------------------------
Move the responsibility of tracking relation metadata, like which
relations exist and what are their sizes, from Repository to a new
module, pgdatadir_mapping.rs. The interface to Repository is now a
simple key-value PUT/GET operations.
It's still not any old key-value store though. A Repository is still
responsible from handling branching, and every GET operation comes
with an LSN.
Mapping from Postgres data directory to keys/values
---------------------------------------------------
All the data is now stored in the key-value store. The
'pgdatadir_mapping.rs' module handles mapping from PostgreSQL objects
like relation pages and SLRUs, to key-value pairs.
The key to the Repository key-value store is a Key struct, which
consists of a few integer fields. It's wide enough to store a full
RelFileNode, fork and block number, and to distinguish those from
metadata keys.
'pgdatadir_mapping.rs' is also responsible for maintaining a
"partitioning" of the keyspace. Partitioning means splitting the
keyspace so that each partition holds a roughly equal number of keys.
The partitioning is used when new image layer files are created, so
that each image layer file is roughly the same size.
The partitioning is also responsible for reclaiming space used by
deleted keys. The Repository implementation doesn't have any explicit
support for deleting keys. Instead, the deleted keys are simply
omitted from the partitioning, and when a new image layer is created,
the omitted keys are not copied over to the new image layer. We might
want to implement tombstone keys in the future, to reclaim space
faster, but this will work for now.
Changes to low-level layer file code
------------------------------------
The concept of a "segment" is gone. Each layer file can now store an
arbitrary range of Keys.
Checkpointing, compaction
-------------------------
The background tasks are somewhat different now. Whenever
checkpoint_distance is reached, the WAL receiver thread "freezes" the
current in-memory layer, and creates a new one. This is a quick
operation and doesn't perform any I/O yet. It then launches a
background "layer flushing thread" to write the frozen layer to disk,
as a new L0 delta layer. This mechanism takes care of durability. It
replaces the checkpointing thread.
Compaction is a new background operation that takes a bunch of L0
delta layers, and reshuffles the data in them. It runs in a separate
compaction thread.
Deployment
----------
This also contains changes to the ansible scripts that enable having
multiple different pageservers running at the same time in the staging
environment. We will use that to keep an old version of the pageserver
running, for clusters created with the old version, at the same time
with a new pageserver with the new binary.
Author: Heikki Linnakangas
Author: Konstantin Knizhnik <knizhnik@zenith.tech>
Author: Andrey Taranik <andrey@zenith.tech>
Reviewed-by: Matthias Van De Meent <matthias@zenith.tech>
Reviewed-by: Bojan Serafimov <bojan@zenith.tech>
Reviewed-by: Konstantin Knizhnik <knizhnik@zenith.tech>
Reviewed-by: Anton Shyrabokau <antons@zenith.tech>
Reviewed-by: Dhammika Pathirana <dham@zenith.tech>
Reviewed-by: Kirill Bulatov <kirill@zenith.tech>
Reviewed-by: Anastasia Lubennikova <anastasia@zenith.tech>
Reviewed-by: Alexey Kondratov <alexey@zenith.tech>
More rows, and test with serial and parallel plans. But fewer iterations,
so that the tests run in < 1 minutes, and we don't need to mark them as
"slow".
With a Mutex, only one thread could read from the layer at a time. I did
some ad hoc profiling with pgbench and saw that a fair amout of time was
spent blocked on these Mutexes.
It doesn't make much sense to compare TimelineMetadata structs with
< or >. But we depended on that in the remote storage upload code,
so replace BTreeSets with Vecs there.
* [proxy] Propagate most errors to user
This change enables propagation of most errors to the user
(e.g. auth and connectivity errors). Some of them will be
stripped of sensitive information.
As a side effect, most occurrences of `anyhow::Error` were
replaced with concrete error types.
* [proxy] Box weighty errors
Have separate routine and http endpoint to create timeline on safekeepers. It is
not used yet, i.e. timeline is still created implicitly, but we'll change that
once infrastructure for learning which tlis are assigned to which safekeepers
will be ready, preventing accidental creation by compute.
Changes format of safekeeper control file, allowing to store set of
peers. Knowing peers provides a part of foundation for peer
recovery (calculating min horizons like truncate_lsn for WAL truncation and
commit_lsn for sync-safekeepers replacement) and proper membership change;
similarly, we don't yet use it for now.
Employing cf file version bump, extracts tenant_id and timeline_id to top level
where it is more suitable. Also adds a bunch of LSNs there and rename
truncate_lsn to more specific peer_horizon_lsn.
* Add --id argument to safekeeper setting its unique u64 id.
In preparation for storage node messaging. IDs are supposed to be monotonically
assigned by the console. In tests it is issued by ZenithEnv; at the zenith cli
level and fixtures, string name is completely replaced by integer id. Example
TOML configs are adjusted accordingly.
Sequential ids are chosen over Zid mainly because they are compact and easy to
type/remember.
* add node id to pageserver
This adds node id parameter to pageserver configuration. Also I use a
simple builder to construct pageserver config struct to avoid setting
node id to some temporary invalid value. Some of the changes in test
fixtures are needed to split init and start operations for envrionment.
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
* new deployment flow for staging and production
* ansible playbooks and circleci config fixes
* cleanup before merge
* additional cleanup before merge
* debug deployment to staging env
* debug deployment to staging env
* debug deployment to staging env
* debug deployment to staging env
* debug deployment to staging env
* debug deployment to staging env
* bianries artifacts path fix for ansible playbooks
* deployment flow refactored
* base64 decode fix for ssh key
* fix for console notification and production deploy settings
* cleanup after deployment tests
* fix - trigger release binaries download for production deploy
When several AppendRequest's can be read from socket without blocking,
they are processed together and fsync() to segment file is only called
once. Segment file is no longer opened for every write request, now
last opened file is cached inside PhysicalStorage. New metric for WAL
flushes was added to the storage, FLUSH_WAL_SECONDS. More errors were
added to storage for non-sequential WAL writes, now write_lsn can be
moved only with calls to truncate_lsn(new_lsn).
New messages have been added to ProposerAcceptorMessage enum. They
can't be deserialized directly and now are used only for optimizing
flushes. Existing protocol wasn't changed and flush will be called for
every AppendRequest, as it was before.
Since commit fdd987c3ad, it was only used in InMemoryLayers. Let's
just "inline" the code into InMemoryLayer itself.
I originally did this as part of a bigger PR (#1267). With that PR,
one in-memory layer, and one ephemeral file, would hold page versions
belonging to multiple segments. Currently, PageVersions can only hold
versions for a single segment, so that would need to be changed.
Rather than modify PageVersions to support that, just remove it
altogether.
These tests have intimate knowledge of the directory layeout and layer
file names used by the LayeredRepository implementation of the
Repository trait. Move them, so that all the tests that remain in
repository.rs are expected to work without changes with any
implementation of Repository. Not that we have any plans to create
another Repository implementaiton any time soon, but as long as we
have the Repository interface, let's try to maintain that abstraction
in the tests too.
The test creates a page version with a string like "foo 123 at 0/10"
as the content. But the LSN stored in that string was wrong: the page
version stored at LSN 0/20 would say "foo <blk> at 0/10".
wal_storage.rs was split up from timeline.rs, safekeeper.rs and send_wal.rs,
and now contains all WAL related code from the safekeeper. Now there are
PhysicalStorage for persisting WAL to disk and WalReader for reading it.
This allows optimizing PhysicalStorage without affecting too much of other
code.
Also there is a separate structure for persisting control file now in
control_file.rs.
Previous version of spec caused parsing errors in generated clients
as return type is object not array, also one field was missing. In
a passing set `format: hex` on ancestor_id too as value conforms to
that format.
This change makes most parts of the code asynchronous, except
for the `mgmt` subsystem (we're going to drop it anyway).
Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>
If a heap UPDATE record modified two pages, and both pages needed to have
their VM bits cleared, and the VM bits were located on the same VM page,
we would emit two ZenithWalRecord::ClearVisibilityMapFlags records for
the same VM page. That produced warnings like this in the pageserver log:
Page version Wal(ClearVisibilityMapFlags { heap_blkno: 18, flags: 3 }) of rel 1663/13949/2619_vm blk 0 at 2A/346046A0 already exists
To fix, change ClearVisibilityMapFlags so that it can update the bits
for both pages as one operation.
This was already covered by several python tests, so no need to add a
new one. Fixes#1125.
Co-authored-by: Konstantin Knizhnik <knizhnik@zenith.tech>
It was printing a lot of stuff to the log with INFO level, for routine
things like receiving or sending messages. Reduce the noise. The amount
of logging was excessive, and it was also consuming a fair amount of CPU
(about 20% of safekeeper's CPU usage in a little test I ran).
* Always initialize flush_lsn/commit_lsn metrics on a specific timeline, no more `n/a`
* Update flush_lsn metrics missing from cba4da3f4d
* Ensure that flush_lsn found on load is >= than both commit_lsn and truncate_lsn
* Add some debug logging
Use GUC zenith.max_cluster_size to set the limit.
If limit is reached, extend requests will throw out-of-space error.
When current size is too close to the limit - throw a warning.
Add new test: test_timeline_size_quota.
Use log::error!() instead. I spotted a few of these "connection error"
lines in the logs, without timestamps and the other stuff we print for
all other log messages.
Timeline is active whenever there is at least 1 connection from compute or
pageserver is not caught up. Currently 'active' means callmemaybes are being
sent.
Fixes race: now suspend condition checking and callmemaybe unsubscribe happen
under the same lock.
Now it's possible to call Fe{Startup,}Message in both
sync and async contexts, which is good for proxy.
Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>
* Fix checkpoint.nextXid update
* Add test for cehckpoint.nextXid
* Fix indentation of test_next_xid.py
* Fix mypy error in test_next_xid.py
* Tidy up the test case.
* Add a unit test
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
* Freeze vectors at the same end LSN
* Fix calculation of last LSN for inmem layer
* Do not advance disk_consistent_lsn is no open layer was evicted
* Fix calculation of freeze_end_lsn
* Let start_lsn be larger than oldest_pending_lsn
* Rename 'oldest_pending_lsn' and 'last_lsn', add comments.
* Fix future_layerfiles test
* Update comments conserning olest_lsn
* Update comments conserning olest_lsn
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
Now zenith_cli handles wal_acceptors config internally, and if we
will append wal_acceptors to postgresql.conf in python tests, then
it will contain duplicate wal_acceptors config.
Currently ztimelineids are unique, but all APIs accept the pair, so let's keep
it everywhere for uniformity.
Carry around ZTTId containing both ZTenantId and ZTimelineId for simplicity.
(existing clusters on staging ought to be preprocessed for that)
* Reproduce github issue #1047.
* Use RwLock to protect gc_cuttof_lsn
* Eeduce number of updates in test_gc_aggressive
* Change test_prohibit_get_page_at_lsn_for_garbage_collected_pages test
* Change test_prohibit_get_page_at_lsn_for_garbage_collected_pages
* Lock latest_gc_cutoff_lsn in all operations accessing storage to prevent race conditions with GC
* Remove random sleep between wait_for_lsn and get_page_at_lsn
* Initialize latest_gc_cutoff with initdb_lsn and remove separate check that lsn >= initdb_lsn
* Update test_prohibit_branch_creation_on_pre_initdb_lsn test
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
to pass current_timeline_size to compute node
Put standby_status_update fields into ZenithFeedback and send them as one message.
Pass values sizes together with keys in ZenithFeedback message.
This patch includes attach/detach http endpoints in pageservers. Some
changes in callmemaybe handling inside safekeeper and an integrational
test to check migration with and without load. There are still some
rough edges that will be addressed in follow up patches
Mainly because it has better support for installing the packages from
different python versions.
It also has better dependency resolver than Pipenv. And supports modern
standard for python dependency management. This includes usage of
pyproject.toml for project specific configuration instead of per
tool conf files. See following links for details:
https://pip.pypa.io/en/stable/reference/build-system/pyproject-toml/https://www.python.org/dev/peps/pep-0518/
* Do not delete layers beyand cutoff LSN
* Update pageserver/src/layered_repository/layer_map.rs
Co-authored-by: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Co-authored-by: Heikki Linnakangas <heikki.linnakangas@iki.fi>
This introduces a new module to handle thread creation and shutdown.
All page server threads are now registered in a global hash map, and
there's a function to request individual threads to shut down gracefully.
Thread shutdown request is signalled to the thread with a flag, as well
as a Future that can be used to wake up async operations if shutdown is
requested. Use that facility to have the libpq listener thread respond
to pageserver shutdown, based on Kirill's earlier prototype
(https://github.com/zenithdb/zenith/pull/1088). That addresses
https://github.com/zenithdb/zenith/issues/1036, previously the libpq
listener thread would not exit until one more connection arrives.
This also eliminates a resource leak in the accept() loop. Previously,
we added the JoinHanlde of each new thread to a vector but old handles
for threads that had already exited were never removed.
Log the error and continue. Hopefully it's a transient failure.
This might have been happening in staging earlier, when the safekeeper
had a problem where it opened connections very frequently to issue
"callmemaybe" commands. If you launch too many threads too fast, you might
run out of file descriptors or something. It's not totally clear what
happened, but with commit, at least the page server will continue to run
and accept new connections, if a transient error happens.
'anyhow' crate can include a backtrace in all errors, when the
'backtrace' feature is enabled. Enable it, and change the places that used
'{:#}' or '{}' to '{:?}', so that the backtrace is printed.
A timeline ID is only guaranteed to be unique for a particular tenant,
so you need to use tenant ID + timeline ID as the key, rather than just
timeline ID.
The safekeeper currently makes the same assumption, and we should fix that
too, but this commit just addresses this one case in the page server.
In the passing, reorder some function arguments to be more consistent.
The walkeeper launch two threads for each connection, and uses a guard
object to remove entry from 'replicas' array, when finishes. But only
the background thread held onto the guard object, so if the background
thread finished before the other thread, the array entry would be
removed prematurely, which lead to panic in the check_stop_streaming()
call.
Fixes https://github.com/zenithdb/zenith/issues/1103
Hexalize zids there for better output; since Serde doesn't support several
formats for one struct, on-disk representation is changed as well, make
upgrade.rs cope with it.
to avoid a subtle race condition.
Without safekeeper, walreceiver reconnection can stuck,
because of IO deadlock between walsender auth and regular backend.
* Do not hold timelines lock during GC
refer #1087
* Add gc_cs mutex for preveting creation of new timelines during GC
* Make clippy happy
* Use Mutex<()> instead of Mutex<i32> for GC critical section
Introduce the concept of a "ZenithWalRecord", which can be a Postgres WAL
record that is replayed with the Postgres WAL redo process, or a built-in
type that is handled entirely by pageserver code.
Replace the special code to replay Postgres XACT commit/abort records
with new Zenith WAL records. A separate zenith WAL record is created for
each modified CLOG page. This allows removing the 'main_data_offset'
field from stored PostgreSQL WAL records, which saves some memory and
some disk space in delta layers.
Introduce zenith WAL records for updating bits in the visibility map.
Previously, when e.g. a heap insert cleared the VM bit, we duplicated the
heap insert WAL record for the affected VM page. That was very wasteful.
The heap WAL record could be massive, containing a full page image in
the worst case. This addresses github issue #941.
The first COPY generates about 230 MB of write I/O, but the second
COPY, after deleting most of the rows and vacuuming the rows away,
generates 370 MB of writes. Both COPYs insert the same amount of data,
so they should generate roughly the same amount of I/O. This commit
doesn't try to fix the issue, just adds a test case to demonstrate it.
Add a new 'checkpoint' command to the pageserver API. Previously,
we've used 'do_gc' for that, but many tests, including this new one,
really only want to perform a checkpoint and don't care about GC. For
now, I only used the command in the new test, though, and didn't
convert any existing tests to use it.
It creates busy loop if pageserver <-> safekeeper connection fails after it was
established (e.g. currently due to 'segment checkpoint not found' error on
pageserver).
Also wake up callmemaybe thread regularly once in recall_period regardless of
channel activity.
This patch allows to shutdown wal receiver when there are no messages
and wal receiver is blocked inside tokio-postgres. In this case it
cannot check the shutdown flag.
This patch switches to use async interface of tokio-postgres directly
without sync wrappers. It opens the possibility to use tokio::select!
between the phsycal_stream.next() and a shutdown channel readiness to
interrupt replication process.
Also this allows to shutdown only particular wal receiver without
using global shutdown_requested flag.
Currently it's included with minimal changes and lives aside of the main
workspace. Later we may re-use and combine common parts with zenith
control_plane.
This change is mostly needed to unify cloud deployment pipeline:
1.1. build compute-tools image
1.2. build compute-node image based on the freshly built compute-tools
2. build zenith image
So we can roll new compute image and new storage required by it to
operate properly. Also it becomes easier to test console against some
specific version of compute-node/-tools.
If safekeepers sync fast enough, callmemaybe thread may never make a call before receiving Unsubscribe request. This leads to the situation, when pageserver lacks data that exists on safekeepers.
Do it separately with SafekeeperPostgresCommand enum as a result. Since query is
always C string, switch postgres_backend process_query argument from Bytes to
&str.
Make passing ztli/ztenant id in safekeeper connection string optional; this is
needed for upcoming intra-safekeeper heartbeat cmd which is not bound to any
timeline.
Change meaning of lsns in HOT_STANDBY_FEEDBACK:
flush_lsn = disk_consistent_lsn,
apply_lsn = remote_consistent_lsn
Update compute node backpressure configuration respectively.
Update compute node configuration:
set 'synchronous_commit=remote_write' in setup without safekeepers.
This way compute node doesn't have to wait for data checkpoint on pageserver.
This doesn't guarantee data durability, but we only use this setup for tests, so it's fine.
Introduce builder objects, DeltaLayerWriter and ImageLayerWriter.
This gives more flexibility, as the DeltaLayer::create and
ImageLayer::create functions don't need to know about the details of
the format of where the page versions are coming from. This allows us
to change the format used in InMemoryLayer more easily, without having
to modify Delta- and ImageLayer code.
Also refactor the code in InMemoryLayer::write_to_disk for clarity.
Previously, the 'blknum' argument of various Layer functions was the
block number within the overall relation. That was pretty confusing,
because an individual layer only holds data from a one segment of the
relation. Furthermore, the 'put_truncation' function already dealt
with per-segment size, not overall relation size, adding to the
confusion.
Change the meaning of the 'blknum' argument to mean the block number
within the segment, not the overall relation.
If a commit record contains XIDs that are stored on different CLOG pages,
we duplicate the commit record for each affected CLOG page. In the redo
routine, we must only apply the parts of the record that apply to the
CLOG page being restored. We got that right in the loop that handles the
sub-XIDs, but incorrectly always set the bit that corresponds to the main
XID.
The logic to compute the page number was broken, and as a result, only
the first page of multixact members was updated correctly. All the
rest were left as zeros. Improve test_multixact.py to generate more
multixacts, to cover this case.
Also fix the check that the restored PG data directory matches the
original one. Previously, the test compared the 'pg_new' cluster,
which is a bit silly because the test restored the 'pg_new' cluster
only a few lines earlier, so if the multixact WAL redo is somehow
broken, the comparison will just compare two broken data directories
and report success. Change it to compare the original datadir, the one
where the multixacts were originally created, with a restored image of
the same.
WAL stream uses the 2 connections:
1. Compute node (walproposer) -> Safekeeper (ReceiveWalConn module)
When compute node is shut down, safekeeper needs to stop the respective receiving thread.
Prior to this PR it didn't work because PostgresBackend haven't handled disconnection properly.
2. Safekeeper (ReplicationConn module) -> pageserver (walreceiver thread)
When incoming WAL stream is gone, safekeeper can stop streaming WAL and cancel connection as soon as replica is caught up.
Note that the WAL can be streamed to multiple replicas simultaneously, only disconnect ones that are caught up to the last_recieved_lsn.
The ephemeral files are not usable after restart, so just delete them.
Before this, you got "unrecognized filename in timeline dir" warnings
about them, as Konstantin noted at:
https://github.com/zenithdb/zenith/issues/906#issuecomment-995530870.
While we're at it, refactor away the list_files() function, moving the
logic fully into the caller. Seems more straightforward.
Rename save_decoded_record() to ingest_record(), and move the
responsibility for decoding the record into ingest_record().
Also move the responsibility of updating the CheckPoint relish to
ingest_record(). Put it in a new WalIngest struct, to help with tracking
that.
We depends on rustls in postgres_backend anyway, so might as well use it
for all TLS stuff. Seems better to depend on only one library both from a
security point of view, and because fewer dependencies means less code to
compile. With this commit, we no longer depend on OpenSSL.
Move the code for decoding a WAL stream into WAL records into
'postgres_ffi', and keep the code to parse the WAL records deeper in
'pageserver' crate, renamed to walrecord.rs.
This tidies up the dependencies a bit. 'walkeeper' reuses the same
waldecoder routines, and it used to depend on 'pageserver' because of
that. Now it only depends on 'postgres_ffi'.
(The comment in walkeeper/Cargo.toml that claimed that the dependency was
needed for ZTimelineId was obsolete. ZTimelineId is defined in
'zenith_utils', the dependency was actually needed for the waldecoder.)
0.28.0 includes two changes I submitted to upstream:
- Add support for older ListObjects API, needed to use rust-s3 with Google
Cloud Storage: https://github.com/durch/rust-s3/pull/229
- If file is smaller than one chunk, don't initiate multi-part upload.
https://github.com/durch/rust-s3/pull/228
These are not critical for Zenith right now, but let's stay up-to-date.
Currently, we return an all-zeros page if you request a block beyond end of
a relation. That has been implemented in LayeredTimeline::materialize_page,
so that if Layer::get_page_reconstruct_data returns Missing, it returns
and all-zeros page.
However InMemoryLayer and DeltaLayer would return Continue, not Missing,
in that case, and materialize_page would try to find the predecessor
layer. If there was a preceding image layer, then everything would still
work, but if there wasn't, it would return a "could not find predecessor
of layer" error. Fix that in InMemoryLayer and DeltaLayer, making them
check the size of the relation and return Missing in that case.
This is hard to reproduce at the moment, but it happened quickly with
pgbench when I modified InMemoryLayer::write_to_disk so that it didn't
always create a new ImageLayer.
- Don't spawn a separate thread for each connection.
Instead use one thread per safekeeper, that iterates over all connections and sends callback requests for them.
-Use tokio postgres to connect to the pageserver, to avoid spawning a new thread for each connection.
callmemaybe review fixes:
- Spawn all request_callback tasks separately.
- Remember 'last_call_time' and only send request_callback if 'recall_period' has passed.
- If task hasn't finished till next recall, abort it and try again.
- Add pause/resume CallmeEvents to avoid spamming pageserver when connection already established.
This is needed for implementation of tenant rebalancing. With this
change safekeeper becomes aware of which pageserver is supposed to be
used for replication from this particular compute.
Should have been a part of cba4da3f4d to provide upgrade for previously
existing clusters. Separates version independent header (magic + version) out of
SafeKeeperState to choose what to deserialize.
This patch introduces fixes for several problems affecting
LLVM-based code coverage:
* Daemonizing parent processes should call _exit() to prevent
coverage data file corruption (*.profraw) due to concurrent writes.
* Implement proper shutdown handlers in safekeeper.
write_lsn - The last LSN received and processed by pageserver's walreceiver.
flush_lsn - same as write_lsn. At pageserver it doesn't guarantees data persistence, but it's fine. We rely on safekeepers.
apply_lsn - The LSN at which pageserver guaranteed persistence of all received data (disk_consistent_lsn).
Persist full history of term switches on safekeepers instead of storing only the
single term of the highest entry (called epoch). This allows easily and
correctly find the divergence point of two logs and truncate the obsolete part
before overwriting it with entries of the newer proposer(s).
Full history of the proposer is transferred in separate message before proposer
starts streaming; it is immediately persisted by safekeeper, though he might not
yet have entries for some older terms there. That's because we can't atomically
append to WAL and update the control file anyway, so locally available WAL must
be taken into account when looking at the history.
We should sometimes purge term history entries beyond truncate_lsn; this is not
done here.
Per https://github.com/zenithdb/rfcs/pull/12Closes#296.
Bumps vendor/postgres.
While we're at it, reuse the Book and the VirtualFile that's backing
it even over unload() calls. Previously, we would keep the Book open,
but on load(), we would re-open it anyway, which didn't make much
sense. Now we reuse it it. Alternatively, perhaps we should close it
on unload() to save some memory, but this I'm not going to think too
hard about it right now as the whole load/unload thing is a bit of a
hack and needs to be rewritten.
This is hard to reproduce ATM, because the incorrect state would get
fixed by an unload(). A checkpoint creates the DeltaLayer, and it also
calls unload() afterwards, so the window is not very large. I hit it
occasionally with a scale 1000 pgbench test, after I had modified
InMemoryLayer::write_to_disk() to not write an image layer every time,
which made the DeltaLayers be accessed more often.
This doesnt show up at the moment, because we never create a delta
layer with end-LSN equal to the last LSN. We always create an image
layer at that LSN instead. For example, if the latest processed LSN is
100, we would create a delta layer with end LSN 100 (exclusive), and
an image layer at 100. But that's just how InMemoryLayer::write_to_disk
happens to work at the moment, there's no fundamental reason it needs
to always create that image layer. I noticed this bug when I tried to
change the logic in InMemoryLayer::write_to_disk to only create an
image layer after a few delta layers.
The "in-memory layer" is misnomer now, each in-memory layer is now actually
backed by a file. The files are ephemeral, in that they don't survive page
server crash or shutdown.
To avoid reading the file for every operation,
"ephemeral files" are cached in a page cache.
This includes changes from 'inmemory-layer-chunks' branch to serialize /
the page versions when they are added to the open layer. The difference is
that they are not serialized to the expandable in-memory "chunk buffer", but
written out to the file.
Out of scope LSNs include pre initdb LSNs, and LSNs prior to
latest_gc_cutoff.
To get there there was also two cleanups:
* Fix error handling in Execute message handler. This fixes behaviour
when basebackup retured an error. Previously pageserver thread just
died.
* Remove "ancestor" file which previously contained ancestor id and
branch lsn. Currently the same data can be obtained from metadata file.
And just the way we handled ancestor file in the code introduced the
case when branching fails timeline directory is created but there is no data in it
except ancestor file. And this confused gc because it scans
directories. So it is better to just remove ancestor file and clean up
this timeline directory creation so it happens after all validity
checks have passed
Ever since we've had frozen in-memory layers, having an 'end_lsn' no
longer means that the layer has been dropped. Need to check the 'dropped'
flag explicitly.
This was reliably causing a failure on the new 'test_parallel_copy' test
in https://github.com/zenithdb/zenith/pull/864. I'm not sure why it
doesn't happen on main branch, but the bug is pretty straightforward when
you see it.
This introduces new timeline field latest_gc_cutoff. It is updated
before each gc iteration. New check is added to branch_timelines to
prevent branch creation with start point less than latest_gc_cutoff.
Also this adds a check to get_page_at_lsn which asserts that lsn at
which the page is requested was not garbage collected. This check
currently is triggered for readonly nodes which are pinned to specific
lsn and because they are not tracked in pageserver garbage collection
can remove data that still might be referenced. This is a bug and will
be fixed separately.
The buffer cache is shared across all tenants, allowing memory to be
dynamically allocated where it's needed the most. The cache works on 8 kB
pages, and uses the clock algorithm for replacement policy; same as the
PostgreSQL buffer cache.
One peculiarity is that the materialized page versions can be looked up
by an inexact LSN, to find the latest page version with an LSN >= the
search key.
The code is structured to support caching other kinds of pages in the same
cache in the future, but with a different mapping key.
Co-authored-by: Patrick Insinger <patrick@zenith.tech>
During parallel load of a table, Postgres sometimes requests a page from
the page server for which no WAL has been generated yet. That's normal;
Postgres expects the page to be full of zeros. There was a special case
for that in LayeredTimeline::materialize_page, but the problem remained
when you're crossing a segment boundary, so that there's no layer for
the segment at all.
It would be nice to have a more robust cross-check for this case. That
might need help from the Postgres side. But this extends the bandaid fix
we had in materialize_page() to the case where cross segment boundary.
Fixes https://github.com/zenithdb/zenith/issues/841
There were two separate locking issues that could lead to a deadlock,
both related to holding a lock for longer than necessary:
1. In the loop in `VirtualFile::with_file`, the "handle_guard" was
held across iterations of the loop. Because of that, if the handle was
changed by a concurrent thread, the loop would try to acquire the
handle lock, when it was still holding the lock from previous
iteration. To fix, release the lock earlier. There was no need to hold
it across iterations, it was just accidental.
2. In the same function, we also held the "slot_guard" longer than
necessary. It's only needed in the first part of the loop, where we
check if the current handle is valid. If it's not, the slot lock can
be immediately released. But it was not, it was kept over the
acquisition of the handle lock. I'm not sure if that alone could cause
problems, but let's release the lock as soon as possible anyway.
Add a test case, based on Konstantin's test program to demonstrate the
deadlock.
Currently, whenever a page version is needed from an image or delta
layer, we open the file and read and parse the bookfile headers. That's
pretty expensive. To reduce the overhead, introduce a cache of open file
descriptors, and use that to cache the Book objects so that we don't need
to read the metadata on every access.
Now safekeeper control file updated in a following way:
1. Write data to temp file
2. Fsync the temporary file (if sync option is specified)
3. Rename temporary file to actual control file
4. Fsync containing directory (if sync option is specified)
5. Fsync file after rename (if sync option is specified).
Note that action 5 is not mentioned anywhere as required but it is done
in postgres this way (see durable_rename).
Also because of the rename machinery switch to use dedicated lock file
to prevent running several safekeepers concurrently on the same data
cleanup
fsync control file after rename to match postgres behaviour
* change zenith-perf-data checkout ref to be main
* set cluster id through secrets so there is no code changes required
when we wipe out clusters on staging
* display full pgbench output on error
Commit 960c7d69a8 changed the LSN returned in the Continue case in
InMemoryLayer::get_page_reconstruct_data(), but neglected to make the
same change in DeltaLayer.
Also add an escape hatch to the loop in materialize_page() to avoid
getting stuck in an infinite loop, if a bug like this reoccurs.
Git commit sha is displayed when --version flag is used and is written
to logs during service startup. Uses git_version crate when git is
available, and GIT_VERSION environment variable otherwise which is the case for docker
builds.
The tokio futures added some overhead, so switch to plain non-blocking
I/O with poll(). In a simple pgbench test on my laptop (select-only
queries, scale-factor 1 `pgbench -P1 -T50 -S`), this gives about 10%
improvement, from about 4300 TPS to 4800 TPS.
Instead of building a separate Vec<u8> to hold each message, serialize all
the messages to one big Vec<u8>. This eliminates some Vec allocation and
memcpy() overhead. The downside is that if there are a lot of records to
replay, we have to serialize them all into one big chunk of memory.
That shouldn't be a problem in practice. If you need to replay millions
of records to reconstruct a page, we should've materialized a new image
of that page earlier already.
tests are based on self-hosted runner which is physically close
to our staging deployment in aws, currently tests consist of
various configurations of pgbenchi runs.
Also these changes rework benchmark fixture by removing globals and
allowing to collect reports with desired metrics and dump them to json
for further analysis. This is also applicable to usual performance tests
which use local zenith binaries.
We might want to have custom serialize/deserialize functions for
WALRecords and PageVersions for performance reasons, see github issue 832.
But that would probably look a bit different from this, and currently
these functions are just dead.
Adds simple global tracking of memory used by the in-memory layers. It's
very approximate, it doesn't take into account allocator, memory
fragmentation or many other things, but it's a good first step.
After storing a WAL record in the repository, the WAL receiver checks
if the global memory usage. If it's above a configurable threshold (hard
coded at 128 MB at the moment), it evicts a layer. The victim layer is
chosen by GClock algorithm, similar to that used in the Postgres buffer
cache.
This stops the page server from using an unbounded amount of memory. It's
pretty crude, the eviction and materializing and writing a layer to disk
happens now in the WAL receiver thread. It would be nice to move that
to a background thread, and it would be nice to have a smarter policy on
when to materialize a new image layer and when to just write out a delta
layer, and it would be nice to have more accurate accounting of memory.
But this should fix the most pressing OOM issues, and is a step in the
right direction.
Co-authored-by: Patrick Insinger <patrickinsinger@gmail.com>
This calculation is not that heavy but it is needed only in tests, and
in case the number of tenants/timelines is high the calculation can take
noticeable time.
Resolves https://github.com/zenithdb/zenith/issues/804
The 'zenith' CLI utility can now be used to launch safekeepers. By
default, one safekeeper is configured. There are new 'safekeeper
start/stop' subcommands to manage the safekeepers. Each safekeeper is
given a name that can be used to identify the safekeeper to start/stop
with the 'zenith start/stop' commands. The safekeeper data is stored
in '.zenith/safekeepers/<name>'.
The 'zenith start' command now starts the pageserver and also all
safekeepers. 'zenith stop' stops pageserver, all safekeepers, and all
postgres nodes.
Introduce new 'zenith pageserver start/stop' subcommands for
starting/stopping just the page server.
The biggest change here is to the 'zenith init' command. This adds a
new 'zenith init --config=<path to toml file>' option. It takes a toml
config file that describes the environment. In the config file, you
can specify options for the pageserver, like the pg and http ports,
and authentication. For each safekeeper, you can define a name and the
pg and http ports. If you don't use the --config option, you get a
default configuration with a pageserver and one safekeeper. Note that
that's different from the previous default of no safekeepers. Any
fields that are omitted in the configuration file are filled with
defaults. You can also specify the initial tenant ID in the config
file. A couple of sample config files are added in the control_plane/
directory.
The --pageserver-pg-port, --pageserver-http-port, and
--pageserver-auth options to 'zenith init' are removed. Use a config
file instead.
Finally, change the python test fixtures to use the new 'zenith'
commands and the config file to describe the environment.
We've seen some failures with "Address already in use" errors in the
tests. It's not clear why, perhaps some server processes are not cleaned
up properly after test, or maybe the socket is still in TIME_WAIT state.
In any case, let's make the tests more robust by checking that the port
is free, before trying to use it.
We generate the initial tenantid and store it in the file, so it shouldn't
be missing. But let's cope with it. (This comes handy with the bigger
changes I'm working on at https://github.com/zenithdb/zenith/pull/788)
Instead of having a lot of separate fixtures for setting up the page
server, the compute nodes, the safekeepers etc., have one big ZenithEnv
object that encapsulates the whole environment. Every test either uses
a shared "zenith_simple_env" fixture, which contains the default setup
of a pageserver with no authentication, and no safekeepers. Tests that
want to use safekeepers or authentication set up a custom test-specific
ZenithEnv fixture.
Gathering information about the whole environment into one object makes
some things simpler. For example, when a new compute node is created,
you no longer need to pass the 'wal_acceptors' connection string as
argument to the 'postgres.create_start' function. The 'create_start'
function fetches that information directly from the ZenithEnv object.
Each test now gets its own test output directory, like
'test_output/test_foobar', even when TEST_SHARED_FIXTURES is used.
When TEST_SHARED_FIXTURES is not used, the zenith repo for each test
is created under a 'repo' subdir inside the test output dir, e.g.
'test_output/test_foobar/repo'
The -D option to specify working directory was broken:
$ mkdir foobar
$ ./target/debug/safekeeper -D foobar
Error: failed to open "foobar/safekeeper.log"
Caused by:
No such file or directory (os error 2)
This was because we both chdir'd into to specified directory, and also
prepended the directory to all the paths. So in the above example, it
actually tried to create the log file in "foobar/foobar/safekepeer.log"
Change it to work the same way as in the pageserver: chdir to the
specified directory, and leave 'workdir' always set to ".".
We wouldn't necessarily need the 'workdir' variable in the config at all,
and could assume that the current working directory is always the
safekeeper data directory, but I'd like to keep this consistent with the
the pageserver. The page server doesn't assume that for the sake of unit
tests. We don't currently have unit tests in the safekeeper that write
to disk but we might want to in the future.
* We actually need Python 3.7 because of dataclasses
* Rerun 'pipenv lock' under Python 3.7 and add 'pipenv' to dev deps
* Update docs on developing for Python 3.7
* CircleCI: use Python 3.7 via Docker image instead of Orb
* Fix bugs found by mypy
* Add some missing types and runtime checks, remove unused code
* Make ZenithPageserver start right away for better type safety
* Add `types-*` packages to Pipfile
* Pin mypy version and run it on CircleCI
This change causes writer halves of a TLS stream to always flush after a
portion of bytes has been written by `std::io::copy`. Furthermore, some
cosmetic and minor functional changes are made to facilitate debug.
* Add yapf run to CircleCI
* Pin yapf version
* Enable `SPLIT_ALL_TOP_LEVEL_COMMA_SEPARATED_VALUES` setting
* Reformat all existing code with slight manual adjustments
* test_runner/README: note that yapf is forced
Change 'zenith.signal' file to a human-readable format, similar to
backup_label. It can contain a "PREV LSN: %X/%X" line, or a special
value to indicate that it's OK to start with invalid LSN ('none'), or
that it's a read-only node and generating WAL is forbidden
('invalid').
The 'zenith pg create' and 'zenith pg start' commands now take a node
name parameter, separate from the branch name. If the node name is not
given, it defaults to the branch name, so this doesn't break existing
scripts.
If you pass "foo@<lsn>" as the branch name, a read-only node anchored
at that LSN is created. The anchoring is performed by setting the
'recovery_target_lsn' option in the postgresql.conf file, and putting
the server into standby mode with 'standby.signal'.
We no longer store the synthetic checkpoint record in the WAL segment.
The postgres startup code has been changed to use the copy of the
checkpoint record in the pg_control file, when starting in zenith
mode.
This is in preparation for supporting read-only nodes. You can launch
multiple read-only nodes on the same brach, so we need an identifier
for each node, separate from the branch name.
Previously, the first WAL record on the 'main' branch overwrote the
initial checkpoint record, with invalid 'xl_prev'. That's harmless, but
also pretty ugly. I bumped into this while I was trying to tighen up the
checks for when a valid 'prev_lsn' is required. With this patch, the
first WAL record gets a valid 'xl_prev' value. It doesn't matter much
currently, but let's be tidy.
Which is mainly generational state (terms) and useful LSNs.
Also add /status basic healthcheck request which is now used in tests to
determine the safekeeper is up; this fixes#726.
ref #115
This adds a fast-path for the common case that the record doesn't
cross a page boundary. We now split off a new Bytes directly from the
original input buffer in that case, instead of copying the record to a
new BytesMut. Shaves about 5% of the page server's CPU time on my
laptop, in the 'test_bulk_insert' test.
* Use logging in python tests
* Use f-strings for logs
* Don't log test output while running
* Use only pytest logging handler
* Add more info about pytest logging
* Rename wal_acceptor binary to safekeeper
* Rename wal_acceptor.pid and wal_acceptor.log to safekeeper.pid and safekeeper.log
* Change some mentions of WAL acceptor to safekeeper
* Dockerfile: alias wal_acceptor to safekeeper temporarily until internal scripts are updated
This change brings the following improvements to our build system:
* Now BUILD_TYPE also affects rust apps.
* From now on, cargo will respect `-jN` passed via `make`. However, note
that `rustc` may spawn multiple threads depending on compile flags.
* Cargo is able to cooperate with make to better schedule parallel jobs,
which leads to better build times (-20s in release mode on my machine).
No need to use BytesMut in these functions. Plain Vec is simpler. And
should be marginally faster too; I saw BytesMut functions previously
in 'perf' profile, consuming around 5% of the overall pageserver CPU
time. That's gone with this patch, although I don't see any discernible
difference in the overall performance test results.
Previously, the WAL receiver we would make a decoded copy of the current
Checkpoint before each WAL record, and compare it with the Checkpoint
after the record has been processed. If it has changed, the checkpoint
relish is updated in the repository. That's somewhat expensive, the
Checkpoint::encode() function is visible in 'perf' profile. Change that
so that we set a flag whenever the Checkpoint struct is modified, so that
we dont need to compare the whole struct anymore.
Previously, typos like `BUILD_TYPE=rlease` would silently
lead to building debug binaries. The current approach is also
more future-proof, since we might add `profile`, `valgrind`
as well as other build types.
- perform checkpoint for each tenant repository.
- wait for the completion of all threads.
Add new option 'immediate' to 'pageserver stop' command to terminate the pageserver immediately.
When a WAL record affects multiple pages, we currently duplicate the
record for each affected page. That's a bit wasteful, but not too bad
for b-tree splits and non-hot heap updates that affect two pages. But
buffering GiST index build WAL-logs the whole relation in 32 page chunks,
with one giant WAL record for each 32-page chunk. Currently we duplicate
that giant record for each of the 32 pages, which is really wasteful.
Github issue https://github.com/zenithdb/zenith/issues/720 tracks the
problem. This commit adds a test case for it to demonstrate it.
Whenever we start processing a request, we now enter a tracing "span"
that includes context information like the tenant and timeline ID, and
the operation we're performing. That context information gets attached
to every log message we create within the span. That way, we don't need
to include basic context information like that in every log message, and
it also becomes easier to filter the logs programmatically.
This removes the eplicit timeline and tenant IDs from most log messages,
as you get that information from the enclosing span now.
Also improve log messages in general, dialing down the level of some
messages that are not very useful, and adding information to others.
We now obey the RUST_LOG env variable, if it's set.
The 'tracing' crate allows for different log formatters, like JSON or
bunyan output. The one we use now is human-readable multi-line format,
which is nice when reading the log directly, but hard for
post-processing. For production, we'll probably want JSON output and
some tools for working with it, but that's left as a TODO. The log
format is easy to change.
2021-10-11 08:59:06 +03:00
588 changed files with 104251 additions and 23019 deletions
**NB: this PR must be merged only by 'Create a merge commit'!**
### Checklist when preparing for release
- [ ] Read or refresh [the release flow guide](https://github.com/neondatabase/cloud/wiki/Release:-general-flow)
- [ ] Ask in the [cloud Slack channel](https://neondb.slack.com/archives/C033A2WE6BZ) that you are going to rollout the release. Any blockers?
- [ ] Does this release contain any db migrations? Destructive ones? What is the rollback plan?
<!-- List everything that should be done**before** release, any issues / setting changes / etc -->
### Checklist after release
- [ ] Based on the merged commits write release notes and open a PR into `website` repo ([example](https://github.com/neondatabase/website/pull/219/files))
# The path includes a test name (test_prepare_snapshot) and directory that the test creates (compatibility_snapshot_pg14), keep the path in sync with the test
Zenith substitutes PostgreSQL storage layer and redistributes data across a cluster of nodes
Neon is a serverless open-source alternative to AWS Aurora Postgres. It separates storage and compute and substitutes the PostgreSQL storage layer by redistributing data across a cluster of nodes.
The project used to be called "Zenith". Many of the commands and code comments
still refer to "zenith", but we are in the process of renaming things.
## Quick start
[Join the waitlist](https://neon.tech/) for our free tier to receive your serverless postgres instance. Then connect to it with your preferred postgres client (psql, dbeaver, etc) or use the online SQL editor.
Alternatively, compile and run the project [locally](#running-local-installation).
## Architecture overview
A Zenith installation consists of Compute nodes and Storage engine.
A Neon installation consists of compute nodes and a Neon storage engine.
Compute nodes are stateless PostgreSQL nodes, backed by zenith storage.
Compute nodes are stateless PostgreSQL nodes backed by the Neon storage engine.
Zenith storage engine consists of two major components:
- Pageserver. Scalable storage backend for compute nodes.
- WAL service. The service that receives WAL from compute node and ensures that it is stored durably.
The Neon storage engine consists of two major components:
- Pageserver. Scalable storage backend for the compute nodes.
- WAL service. The service receives WAL from the compute node and ensures that it is stored durably.
Pageserver consists of:
- Repository - Zenith storage implementation.
- Repository - Neon storage implementation.
- WAL receiver - service that receives WAL from WAL service and stores it in the repository.
- Page service - service that communicates with compute nodes and responds with pages from the repository.
- WAL redo - service that builds pages from base images and WAL records on Page service request.
- WAL redo - service that builds pages from base images and WAL records on Page service request
## Running local installation
1. Install build dependencies and other useful packages
On Ubuntu or Debian this set of packages should be sufficient to build the code:
```text
#### Installing dependencies on Linux
1. Install build dependencies and other applicable packages
* On Ubuntu or Debian, this set of packages should be sufficient to build the code:
To run the `psql` client, install the `postgresql-client` package or modify `PATH` and `LD_LIBRARY_PATH` to include `tmp_install/bin` and `tmp_install/lib`, respectively.
To run the integration tests (not required to use the code), install
Python (3.6 or higher), and install python3 packages with `pipenv` using `pipenv install` in the project directory.
# recommended approach from https://www.rust-lang.org/tools/install
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```
3. Install PostgreSQL Client
```
# from https://stackoverflow.com/questions/44654216/correct-way-to-install-psql-without-full-postgres-on-macos
brew install libpq
brew link --force libpq
```
#### Rustc version
The project uses [rust toolchain file](./rust-toolchain.toml) to define the version it's built with in CI for testing and local builds.
This file is automatically picked up by [`rustup`](https://rust-lang.github.io/rustup/overrides.html#the-toolchain-file) that installs (if absent) and uses the toolchain version pinned in the file.
rustup users who want to build with another toolchain can use [`rustup override`](https://rust-lang.github.io/rustup/overrides.html#directory-overrides) command to set a specific toolchain for the project's directory.
non-rustup users most probably are not getting the same toolchain automatically from the file, so are responsible to manually verify their toolchain matches the version in the file.
Newer rustc versions most probably will work fine, yet older ones might not be supported due to some new features used by the project or the crates.
#### Building on Linux
1. Build neon and patched postgres
```
# Note: The path to the neon sources can not contain a space.
# The preferred and default is to make a debug build. This will create a
# demonstrably slower build than a release build. For a release build,
# use "BUILD_TYPE=release make -j`sysctl -n hw.logicalcpu`"
make -j`sysctl -n hw.logicalcpu`
```
#### Dependency installation notes
To run the `psql` client, install the `postgresql-client` package or modify `PATH` and `LD_LIBRARY_PATH` to include `pg_install/bin` and `pg_install/lib`, respectively.
To run the integration tests or Python scripts (not required to use the code), install
Python (3.9 or higher), and install python3 packages using `./scripts/pysync` (requires [poetry](https://python-poetry.org/)) in the project directory.
#### Running neon database
1. Start pageserver and postgres on top of it (should be called from repo root):
```sh
# Create repository in .zenith with proper paths to binaries and data
# Create repository in .neon with proper paths to binaries and data
# Later that would be responsibility of a package install script
> ./target/debug/zenith init
pageserver init succeeded
> ./target/debug/neon_local init
Starting pageserver at '127.0.0.1:64000' in '.neon'
# start pageserver
> ./target/debug/zenith start
Starting pageserver at '127.0.0.1:64000' in .zenith
@@ -134,14 +241,14 @@ To view your `rustdoc` documentation in a browser, try running `cargo doc --no-d
### Postgres-specific terms
Due to Zenith's very close relation with PostgreSQL internals, there are numerous specific terms used.
Same applies to certain spelling: i.e. we use MB to denote 1024 * 1024 bytes, while MiB would be technically more correct, it's inconsistent with what PostgreSQL code and its documentation use.
Due to Neon's very close relation with PostgreSQL internals, numerous specific terms are used.
The same applies to certain spelling: i.e. we use MB to denote 1024 * 1024 bytes, while MiB would be technically more correct, it's inconsistent with what PostgreSQL code and its documentation use.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.