Add `neon-branch-create` / `neon-branch-delete` to allow using branches
in tests.
I have a couple of use cases in mind:
- For destructive tests with a big DB, we can create the DB once in
advance and then use branches without the need to recreate the DB itself
after tests change it.
- We can run tests in parallel (if there're compute-bound).
Also migrate API v2 for `neon-project-create` / `neon-project-delete`
Many python tests were setting the GC/compaction period to large
values, to effectively disable GC / compaction. Reserve value 0 to
mean "explicitly disabled". We also set them to 0 in unit tests now,
although currently, unit tests don't launch the background jobs at
all, so it won't have any effect.
Fixes https://github.com/neondatabase/neon/issues/2917
Fix `test_seqscans` by disabling statement timeout.
Also, replace increasing statement timeout with disabling it for
performance tests. This should make tests more stable and allow us to
observe performance degradation instead of test failures.
**NB**: this PR doesn't update Python to 3.11; it makes tests
compatible with it and fixes a couple of warnings by updating
dependencies.
- `poetry add asyncpg@latest` to fix `./scripts/pysync`
- `poetry add boto3@latest "boto3-stubs[s3]@latest"` to fix
```
DeprecationWarning: 'cgi' is deprecated and slated for removal in Python 3.13
```
- `poetry update certifi` to fix
```
DeprecationWarning: path is deprecated. Use files() instead. Refer to https://importlib-resources.readthedocs.io/en/latest/using.html#migrating-from-legacy for migration advice.
```
- Move `types-toml` from `dep-dependencies` to `dependencies` to keep it
aligned with other `types-*` deps
* Fix https://github.com/neondatabase/neon/issues/1854
* Never log Safekeeper::conninfo in walproposer as it now contains a secret token
* control_panel, test_runner: generate and pass JWT tokens for Safekeeper to compute and pageserver
* Compute: load JWT token for Safekepeer from the environment variable. Do not reuse the token from
pageserver_connstring because it's embedded in there weirdly.
* Pageserver: load JWT token for Safekeeper from the environment variable.
* Rewrite docs/authentication.md
There will be different scopes for those two, so authorization code should be different.
The `check_permission` function is now not in the shared library. Its implementation
is very similar to the one which will be added for Safekeeper. In fact, we may reuse
the same existing root-like 'PageServerApi' scope, but I would prefer to have separate
root-like scopes for services.
Also, generate_management_token in tests is generate_pageserver_token now.
Downsides are:
* We store all components of the config separately. `Url` stores them inside a single
`String` and a bunch of ints which point to different parts of the URL, which is
probably more efficient.
* It is now impossible to pass arbitrary connection strings to the configuration file,
one has to support all components explicitly. However, we never supported anything
except for `host:port` anyway.
Upsides are:
* This significantly restricts the space of possible connection strings, some of which
may be either invalid or unsupported. E.g. Postgres' connection strings may include
a bunch of parameters as query (e.g. `connect_timeout=`, `options=`). These are nether
validated by the current implementation, nor passed to the postgres client library,
Hence, storing separate fields expresses the intention better.
* The same connection configuration may be represented as a URL in multiple ways
(e.g. either `password=` in the query part or a standard URL password).
Now we have a single canonical way.
* Escaping is provided for `options=`.
Other possibilities considered:
* `newtype` with a `String` inside and some validation on creation.
This is more efficient, but harder to log for two reasons:
* Passwords should never end up in logs, so we have to somehow
* Escaped `options=` are harder to read, especially if URL-encoded,
and we use `options=` a lot.
Which ought to replace etcd. This patch only adds the binary and adjusts
Dockerfile to include it; subsequent ones will add deploy of helm chart and the
actual replacement.
It is a simple and fast pub-sub message bus. In this patch only safekeeper
message is supported, but others can be easily added.
Compilation now requires protoc to be installed. Installing protobuf-compiler
package is fine for Debian/Ubuntu.
ref
https://github.com/neondatabase/neon/pull/2733https://github.com/neondatabase/neon/issues/2394
Imagine that you have a tenant with a single branch like this:
---------------==========>
^
gc horizon
where:
---- is the portion of the branch that is older than retention period
==== is the portion of the branch that is newer than retention period.
Before this commit, the sizing model included the logical size at the
GC horizon, but not the WAL after that. In particular, that meant that
on a newly created tenant with just one timeline, where the retention
period covered the whole history of the timeline, i.e. gc_cutoff was 0,
the calculated tenant size was always zero.
We now include the WAL after the GC horizon in the size. So in the
above example, the calculated tenant size would be the logical size
of the database the GC horizon, plus all the WAL after it (marked with
===).
This adds a new `insert_point` function to the sizing model, alongside
`modify_branch`, and changes the code in size.rs to use the new
function. The new function takes an absolute lsn and logical size as
argument, so we no longer need to calculate the difference to the
previous point. Also, the end-size is now optional, because we now
need to add a point to represent the end of each branch to the model,
but we don't want to or need to calculate the logical size at that
point.
- Pass through FAILPOINTS environment variable to the pageserver in
"neon_local pageserver start" command
- On startup, list any failpoints that were set with FAILPOINTS to the log
- Add optional "extra_env_vars" argument to the NeonPageserver.start()
function in the python fixture, so that you can pass FAILPOINTS
None of the tests use this functionality yet; that comes in a separate
commit.
closes https://github.com/neondatabase/neon/pull/2865
Increse the pgbench runtimes even further. The theory is that when
there are many other tests running at the same time, one pgbench run
could take a long time until it generates enough layers for GC to kick
in.
I saw these from the build of the compute docker image in the CI
(compute-node-image-v15):
pagestore_smgr.c: In function 'neon_prefetch':
pagestore_smgr.c:1654:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
1654 | BufferTag tag = (BufferTag) {
| ^~~~~~~~~
walproposer.c:197:1: warning: no previous prototype for 'WalProposerSync' [-Wmissing-prototypes]
197 | WalProposerSync(int argc, char *argv[])
| ^~~~~~~~~~~~~~~
libpagestore.c: In function 'pageserver_connect':
libpagestore.c💯9: warning: variable 'wc' set but not used [-Wunused-but-set-variable]
100 | int wc;
| ^~
libpagestore.c: In function 'call_PQgetCopyData':
libpagestore.c:144:9: warning: variable 'wc' set but not used [-Wunused-but-set-variable]
144 | int wc;
| ^~
Harmless warnings, but let's be tidy.
In the passing, I added some "extern" to a few function declarations
that were missing them, and marked WalProposerSync as "static". Those
changes are also purely cosmetic.
Commit d013a2b227 changed the test, so that it fails if pgbench runs
to completion without triggering the failpoint. That has now happened
several times in the CI. That's not expected, so this needs some
investigation, but as a quick fix just make the pgbench runs longer so
that we're closer to the situation before commit d013a2b227.
See https://github.com/neondatabase/neon/issues/2856
This allows us to error out in the case where we request flush but the
flush loop is not running.
Before, we would only track whether it was started, but not when it
exited.
Better to use an enum with 3 states than a 2-state bool because then
the error message can answer the question whether we ever started
the flush loop or not.
In a CI run, I got a test failure because of this error in the log,
from the test_get_tenant_size_with_multiple_branches test:
ERROR gc_loop{tenant_id=f1630516d4b526139836ced93be0c878}: Gc failed, retrying in 2s: No such file or directory (os error 2)
There are known race conditions between GC and timeline deletion,
which surely caused that error. But if we didn't know the cause, it
would be pretty hard to debug without a stack trace.