Compare commits

..

35 Commits

Author SHA1 Message Date
KlimentSerafimov
78d919ef53 Revert "Added paths to openssl includes and libraries for OSX because make complained that it couldn't find them. (#1761)"
This reverts commit 65cf1a3221.
2022-05-20 12:32:36 -04:00
KlimentSerafimov
65cf1a3221 Added paths to openssl includes and libraries for OSX because make complained that it couldn't find them. (#1761) 2022-05-20 12:02:51 -04:00
bojanserafimov
a4aef5d8dc Compile psql with openssl (#1725) 2022-05-19 12:25:31 -04:00
Heikki Linnakangas
ffbb9dd155 Add a 5 minute timeout to python tests.
The CI times out after 10 minutes of no output. It's annoying if a
test hangs and is killed by the CI timeout, because you don't get
information about which test was running. Try to avoid that, by adding
a slightly smaller timeout in pytest itself. You can override it on a
per-test basis if needed, but let's try to keep our tests shorter than
that.

For the Postgres regression tests, use a longer 30 minute timeout.
They're not really a single test, but many tests wrapped in a single
pytest test. It's OK for them to run longer in aggregate, each
Postgres test is still fairly short.
2022-05-19 14:04:14 +03:00
Egor Suvorov
baf7a81dce git-upload: pass committer to 'git rebase' (fix #1749) (#1750)
No committer was specified, which resulted in failing `git rebase` if
the branch is not up-to-date.
2022-05-19 14:01:03 +03:00
Heikki Linnakangas
ee3bcf108d Fix compact_level0 for delta layers with overlap or gaps
We saw a case in staging, where there was a gap in the LSN ranges of
level 0 files, like this:

    000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9
    000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016960E9-00000000016E4DB9
    000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016E4DB9-000000000BFCE3E1
    000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000000BFCE3E1-000000000BFD0FE9
    000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000060045901-000000007005EAC1
    000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000007005EAC1-0000000080062E99
    000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000080062E99-000000009007F481
    000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000009007F481-00000000A009F7C9
    000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000A009F7C9-00000000AA284EB9
    000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000AA286471-00000000AA2886B9

Note that gap between 000000000BFD0FE9 and 0000000060045901. I don't
know how that happened, but in general the pageserver should be robust
if there are gaps like that, or overlapping files etc. In theory they
could happen as result of crashes, partial downloads from S3 etc.,
although it is mystery what caused it in this case.

Looking at the compaction code, it was not safe in the face of gaps
like that. The compaction routine collected all the level 0 files, and
took their min(start)..max(end) as the range of the new files it
builds. That's wrong, if the level 0 files don't cover the whole LSN
range; the newly created files will miss any records in the gap. Fix
that, by only collecting contiguous sequences of level 0 files, so
that the end LSN of previous delta file is equal to the start of the
next one.

Fixes issue #1730
2022-05-19 10:19:38 +03:00
Heikki Linnakangas
0da4046704 Include traversal path in error message.
Previously, the path was printed to the log with separate error!() calls.
It's better to include the whole path in the error object and have it
printed to the log as one message.

Also print the path in the ValueReconstructResult::Missing case.

This is what it looks like now:

    2022-05-17T21:53:53.611801Z ERROR pagestream{timeline=5adcb4af3e95f00a31550d266aab7a37 tenant=74d9f9ad3293c030c6a6e196dd91c60f}: error reading relation or page version: could not find data for key 000000067F000032BE000000000000000001 at LSN 0/1698C48, for request at LSN 0/1698CF8

    Caused by:
        0: layer traversal: result Complete, cont_lsn 0/1698C48, layer: 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001698C48-0000000001698CC1
        1: layer traversal: result Continue, cont_lsn 0/1698CC1, layer: inmem-0000000001698CC1-FFFFFFFFFFFFFFFF

    Stack backtrace:
2022-05-19 10:19:38 +03:00
Anastasia Lubennikova
cbd00d7ed9 Remove temp layer files during timeline initialization on pageserver start 2022-05-19 10:11:12 +03:00
Anastasia Lubennikova
4c30ae8ba3 Add random string as a part of tempfile name 2022-05-19 10:11:12 +03:00
Anastasia Lubennikova
3da4b3165e Fsync layer files before rename 2022-05-19 10:11:12 +03:00
Anastasia Lubennikova
c1b365fdf7 Use temp filename while writing ImageLayer file 2022-05-19 10:11:12 +03:00
Egor Suvorov
fab104d5f3 docs/sourcetree: add note about exact Python version used and how to choose it 2022-05-19 00:09:13 +02:00
Egor Suvorov
7dd27ecd20 Bump minimal supported Python version to 3.9
Most of the CI already run with Python 3.9 since https://github.com/neondatabase/docker-images/pull/1
2022-05-19 00:09:13 +02:00
Egor Suvorov
bd2979d02c CirleCI/check-codestyle-python: print versions 2022-05-19 00:09:13 +02:00
Dmitry Rodionov
5914aab78a add comments, use expect instead of unwrap 2022-05-19 00:54:14 +03:00
Heikki Linnakangas
4a36d89247 Avoid spawning a layer-flush thread when there's no work to do.
The check_checkpoint_distance() always spawned a new thread, even if
there is no frozen layer to flush. That was a thinko, as @knizhnik
pointed out.
2022-05-19 00:51:48 +03:00
Egor Suvorov
432907ff5f Safekeeper: avoid holding mutex when deleting a tenant (#1746)
Following discussion with @arssher after #1653
2022-05-18 23:02:17 +03:00
Arthur Petukhovsky
98da0aa159 Add _total suffix to metrics name (#1741) 2022-05-18 15:17:04 +03:00
Alexey Kondratov
772c2fb4ff Report startup metrics and failure reason from compute_ctl (#1581)
+ neondatabase/cloud#1103

This adds a couple of control endpoints to simplify compute state
discovery for control-plane. For example, now we may figure out
that Postgres wasn't able to start or basebackup failed within
seconds instead of just blindly polling the compute readiness
for a minute or two.

Also we now expose startup metrics (time of the each step: basebackup,
sync safekeepers, config, total). Console grabs them after each
successful start and report as histogram to prometheus and grafana.

OpenAPI spec is added and up-tp date, but is not currently used in the
console yet.
2022-05-18 13:03:29 +04:00
Andrey Taranik
b9f84f4a83 trun on storage deployment to neon-stress enviroment (#1729) 2022-05-17 23:04:04 +03:00
Arthur Petukhovsky
134eeeb096 Add more common storage metrics (#1722)
- Enabled process exporter for storage services
- Changed zenith_proxy prefix to just proxy
- Removed old `monitoring` directory
- Removed common prefix for metrics, now our common metrics have `libmetrics_` prefix, for example `libmetrics_serve_metrics_count`
- Added `test_metrics_normal_work`
2022-05-17 19:29:01 +03:00
Heikki Linnakangas
55ea3f262e Fix race condition leading to panic in remote storage sync thread.
The SyncQueue consisted of a tokio mpsc channel, and an atomic counter
to keep track of how many items there are in the channel. Updating the
atomic counter was racy, and sometimes the consumer would decrement
the counter before the producer had incremented it, leading to integer
wraparound to usize::MAX. Calling Vec::with_capacity(usize::MAX) leads
to a panic.

To fix, replace the channel with a VecDeque protected by a Mutex, and
a condition variable for signaling. Now that the queue is now
protected by standard blocking Mutex and Condvar, refactor the
functions touching it to be sync, not async.

A theoretical downside of this is that the calls to push items to the
queue and the storage sync thread that drains the queue might now need
to wait, if another thread is busy manipulating the queue. I believe
that's OK; the lock isn't held for very long, and these operations are
made in background threads, not in the hot GetPage@LSN path, so
they're not very latency-sensitive.

Fixes #1719. Also add a test case.
2022-05-17 18:14:57 +03:00
Heikki Linnakangas
f03779bf1a Fix wait_for_last_record_lsn() and wait_for_upload() python functions.
The contract for wait_for() was not very clear. It waits until the
given function returns successfully, without an exception, but the
wait_for_last_record_lsn() and wait_for_upload() functions used "a <
b" as the condition, i.e. they thought that wait_for() would poll
until the function returns true.

Inline the logic from wait_for() into those two functions, it's not
that complicated, and you get a more specific error message too, if it
fails. Also add a comment to wait_for() to make it more clear how it
works.

Also change remote_consistent_lsn() to return 0 instead of raising an
exception, if remote is None. That can happen if nothing has been
uploaded to remote storage for the timeline yet. It happened once in
the CI, and I was able to reproduce that locally too by adding a sleep
to the storage sync thread, to delay the first upload.
2022-05-17 18:14:10 +03:00
Andrey Taranik
070c255522 Neon stress deploy (#1720)
* storage and proxy deployment for neon stress environment

* neon stress inventory fix
2022-05-17 18:03:01 +03:00
Heikki Linnakangas
9ccbb8d331 Make "neon_local stop" less verbose.
I got annoyed by all the noise in CI test output.

Before:

    $ ./target/release/neon_local stop
    Stop pageserver gracefully
    Pageserver still receives connections
    Pageserver stopped receiving connections
    Pageserver status is: Reqwest error: error sending request for url (http://127.0.0.1:9898/v1/status): error trying to connect: tcp connect error: Connection refused (os error 111)
    initializing for sk 1 for 7676
    Stop safekeeper gracefully
    Safekeeper still receives connections
    Safekeeper stopped receiving connections
    Safekeeper status is: Reqwest error: error sending request for url (http://127.0.0.1:7676/v1/status): error trying to connect: tcp connect error: Connection refused (os error 111)

After:

    $ ./target/release/neon_local stop
    Stopping pageserver gracefully...done!
    Stopping safekeeper 1 gracefully...done!

Also removes the spurious "initializing for sk 1 for 7676" message from
"neon_local start"
2022-05-17 10:31:13 +03:00
Kirill Bulatov
f2881bbd8a Start and stop single etcd and mock s3 servers globally in python tests 2022-05-17 01:17:44 +03:00
Kirill Bulatov
a884f4cf6b Add etcd to neon_local 2022-05-17 01:17:44 +03:00
Kirill Bulatov
9a0fed0880 Enable at least 1 safekeeper in every test 2022-05-17 01:17:44 +03:00
chaitanya sharma
bea84150b2 Fix the markdown rendering on 004-durability.md RFC 2022-05-17 00:16:42 +03:00
chaitanya sharma
85b5c0e989 List profiling as a feature with 'pageserver --enabled-features'
Fixes https://github.com/neondatabase/neon/issues/1627
2022-05-16 21:10:57 +03:00
Thang Pham
e4a70faa08 Add more information to timeline-related APIs (#1673)
Resolves #1488.

- implemented `GET tenant/:tenant_id/timeline/:timeline_id/wal_receiver` endpoint
- returned `thread_id` in `thread_mgr::spawn` 
- added `latest_gc_cutoff_lsn` field to `LocalTimelineInfo` struct
2022-05-16 11:05:43 -04:00
chaitanya sharma
c41549f630 Update readme build for osx (#1709) 2022-05-16 10:42:08 -04:00
Heikki Linnakangas
c700032dd2 Run the regression tests in CI also for PRs opened from forked repos. 2022-05-16 14:40:49 +03:00
Kirill Bulatov
33cac863d7 Test simple.conf and handle broker_endpoints better 2022-05-16 12:07:35 +03:00
Heikki Linnakangas
51ea9c3053 Don't swallow panics when the pageserver is build with failpoints.
It's very confusing, and because you don't get a stack trace and error
message in the logs, makes debugging very hard. However, the
'test_pageserver_recovery' test relied on that behavior. To support that,
add a new "exit" action to the pageserver 'failpoints' command, so that
you can explicitly request to exit the process when a failpoint is hit.
2022-05-16 09:58:58 +03:00
94 changed files with 2573 additions and 1113 deletions

View File

@@ -0,0 +1,19 @@
[pageservers]
neon-stress-ps-1 console_region_id=1
neon-stress-ps-2 console_region_id=1
[safekeepers]
neon-stress-sk-1 console_region_id=1
neon-stress-sk-2 console_region_id=1
neon-stress-sk-3 console_region_id=1
[storage:children]
pageservers
safekeepers
[storage:vars]
console_mgmt_base_url = http://neon-stress-console.local
bucket_name = neon-storage-ireland
bucket_region = eu-west-1
etcd_endpoints = etcd-stress.local:2379
safekeeper_enable_s3_offload = false

View File

@@ -6,7 +6,7 @@ After=network.target auditd.service
Type=simple
User=pageserver
Environment=RUST_BACKTRACE=1 ZENITH_REPO_DIR=/storage/pageserver LD_LIBRARY_PATH=/usr/local/lib
ExecStart=/usr/local/bin/pageserver -c "pg_distrib_dir='/usr/local'" -c "listen_pg_addr='0.0.0.0:6400'" -c "listen_http_addr='0.0.0.0:9898'" -D /storage/pageserver/data
ExecStart=/usr/local/bin/pageserver -c "pg_distrib_dir='/usr/local'" -c "listen_pg_addr='0.0.0.0:6400'" -c "listen_http_addr='0.0.0.0:9898'" -c "broker_endpoints=['{{ etcd_endpoints }}']" -D /storage/pageserver/data
ExecReload=/bin/kill -HUP $MAINPID
KillMode=mixed
KillSignal=SIGINT

View File

@@ -222,6 +222,12 @@ jobs:
key: v2-python-deps-{{ checksum "poetry.lock" }}
paths:
- /home/circleci/.cache/pypoetry/virtualenvs
- run:
name: Print versions
when: always
command: |
poetry run python --version
poetry show
- run:
name: Run yapf to ensure code format
when: always
@@ -355,7 +361,7 @@ jobs:
when: always
command: |
du -sh /tmp/test_output/*
find /tmp/test_output -type f ! -name "pg.log" ! -name "pageserver.log" ! -name "safekeeper.log" ! -name "regression.diffs" ! -name "junit.xml" ! -name "*.filediff" ! -name "*.stdout" ! -name "*.stderr" ! -name "flamegraph.svg" -delete
find /tmp/test_output -type f ! -name "pg.log" ! -name "pageserver.log" ! -name "safekeeper.log" ! -name "etcd.log" ! -name "regression.diffs" ! -name "junit.xml" ! -name "*.filediff" ! -name "*.stdout" ! -name "*.stderr" ! -name "flamegraph.svg" ! -name "*.metrics" -delete
du -sh /tmp/test_output/*
- store_artifacts:
path: /tmp/test_output
@@ -587,6 +593,55 @@ jobs:
helm upgrade neon-proxy neondatabase/neon-proxy --install -f .circleci/helm-values/staging.proxy.yaml --set image.tag=${DOCKER_TAG} --wait
helm upgrade neon-proxy-scram neondatabase/neon-proxy --install -f .circleci/helm-values/staging.proxy-scram.yaml --set image.tag=${DOCKER_TAG} --wait
deploy-neon-stress:
docker:
- image: cimg/python:3.10
steps:
- checkout
- setup_remote_docker
- run:
name: Setup ansible
command: |
pip install --progress-bar off --user ansible boto3
- run:
name: Redeploy
command: |
cd "$(pwd)/.circleci/ansible"
./get_binaries.sh
echo "${TELEPORT_SSH_KEY}" | tr -d '\n'| base64 --decode >ssh-key
echo "${TELEPORT_SSH_CERT}" | tr -d '\n'| base64 --decode >ssh-key-cert.pub
chmod 0600 ssh-key
ssh-add ssh-key
rm -f ssh-key ssh-key-cert.pub
ansible-playbook deploy.yaml -i neon-stress.hosts
rm -f neon_install.tar.gz .neon_current_version
deploy-neon-stress-proxy:
docker:
- image: cimg/base:2021.04
environment:
KUBECONFIG: .kubeconfig
steps:
- checkout
- run:
name: Store kubeconfig file
command: |
echo "${NEON_STRESS_KUBECONFIG_DATA}" | base64 --decode > ${KUBECONFIG}
chmod 0600 ${KUBECONFIG}
- run:
name: Setup helm v3
command: |
curl -s https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm repo add neondatabase https://neondatabase.github.io/helm-charts
- run:
name: Re-deploy proxy
command: |
DOCKER_TAG=$(git log --oneline|wc -l)
helm upgrade neon-stress-proxy neondatabase/neon-proxy --install -f .circleci/helm-values/neon-stress.proxy.yaml --set image.tag=${DOCKER_TAG} --wait
deploy-release:
docker:
- image: cimg/python:3.10
@@ -771,6 +826,25 @@ workflows:
requires:
- docker-image
- deploy-neon-stress:
# Context gives an ability to login
context: Docker Hub
# deploy only for commits to main
filters:
branches:
only:
- main
requires:
- docker-image
- deploy-neon-stress-proxy:
# deploy only for commits to main
filters:
branches:
only:
- main
requires:
- docker-image
- docker-image-release:
# Context gives an ability to login
context: Docker Hub

View File

@@ -0,0 +1,34 @@
fullnameOverride: "neon-stress-proxy"
settings:
authEndpoint: "https://console.dev.neon.tech/authenticate_proxy_request/"
uri: "https://console.dev.neon.tech/psql_session/"
# -- Additional labels for zenith-proxy pods
podLabels:
zenith_service: proxy
zenith_env: staging
zenith_region: eu-west-1
zenith_region_slug: ireland
service:
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: external
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-scheme: internal
external-dns.alpha.kubernetes.io/hostname: neon-stress-proxy.local
type: LoadBalancer
exposedService:
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: external
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
external-dns.alpha.kubernetes.io/hostname: connect.dev.neon.tech
metrics:
enabled: true
serviceMonitor:
enabled: true
selector:
release: kube-prometheus-stack

View File

@@ -1,6 +1,8 @@
name: Build and Test
on: push
on:
pull_request:
push:
jobs:
regression-check:

40
Cargo.lock generated
View File

@@ -166,7 +166,7 @@ dependencies = [
"cc",
"cfg-if",
"libc",
"miniz_oxide",
"miniz_oxide 0.4.4",
"object",
"rustc-demangle",
]
@@ -868,6 +868,18 @@ version = "0.4.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "279fb028e20b3c4c320317955b77c5e0c9701f05a1d309905d6fc702cdc5053e"
[[package]]
name = "flate2"
version = "1.0.23"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b39522e96686d38f4bc984b9198e3a0613264abaebaff2c5c918bfa6b6da09af"
dependencies = [
"cfg-if",
"crc32fast",
"libc",
"miniz_oxide 0.5.1",
]
[[package]]
name = "fnv"
version = "1.0.7"
@@ -1527,6 +1539,15 @@ dependencies = [
"autocfg",
]
[[package]]
name = "miniz_oxide"
version = "0.5.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d2b29bd4bc3f33391105ebee3589c19197c4271e3e5a9ec9bfe8127eeff8f082"
dependencies = [
"adler",
]
[[package]]
name = "mio"
version = "0.8.2"
@@ -1772,6 +1793,7 @@ dependencies = [
"crc32c",
"crossbeam-utils",
"daemonize",
"etcd_broker",
"fail",
"futures",
"git-version",
@@ -2087,6 +2109,20 @@ dependencies = [
"unicode-xid",
]
[[package]]
name = "procfs"
version = "0.10.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "95e344cafeaeefe487300c361654bcfc85db3ac53619eeccced29f5ea18c4c70"
dependencies = [
"bitflags",
"byteorder",
"flate2",
"hex",
"lazy_static",
"libc",
]
[[package]]
name = "prometheus"
version = "0.13.0"
@@ -2096,8 +2132,10 @@ dependencies = [
"cfg-if",
"fnv",
"lazy_static",
"libc",
"memchr",
"parking_lot 0.11.2",
"procfs",
"thiserror",
]

View File

@@ -15,4 +15,4 @@ RUN set -e \
# Final image that only has one binary
FROM debian:buster-slim
COPY --from=rust-build /home/circleci/project/target/release/zenith_ctl /usr/local/bin/zenith_ctl
COPY --from=rust-build /home/circleci/project/target/release/compute_ctl /usr/local/bin/compute_ctl

View File

@@ -12,12 +12,12 @@ endif
#
BUILD_TYPE ?= debug
ifeq ($(BUILD_TYPE),release)
PG_CONFIGURE_OPTS = --enable-debug
PG_CONFIGURE_OPTS = --enable-debug --with-openssl
PG_CFLAGS = -O2 -g3 $(CFLAGS)
# Unfortunately, `--profile=...` is a nightly feature
CARGO_BUILD_FLAGS += --release
else ifeq ($(BUILD_TYPE),debug)
PG_CONFIGURE_OPTS = --enable-debug --enable-cassert --enable-depend
PG_CONFIGURE_OPTS = --enable-debug --with-openssl --enable-cassert --enable-depend
PG_CFLAGS = -O0 -g3 $(CFLAGS)
else
$(error Bad build type `$(BUILD_TYPE)', see Makefile for options)

View File

@@ -23,6 +23,8 @@ Pageserver consists of:
## Running local installation
#### building on Ubuntu/ Debian (Linux)
1. Install build dependencies and other useful packages
On Ubuntu or Debian this set of packages should be sufficient to build the code:
@@ -31,21 +33,59 @@ apt install build-essential libtool libreadline-dev zlib1g-dev flex bison libsec
libssl-dev clang pkg-config libpq-dev
```
[Rust] 1.58 or later is also required.
2. [Install Rust](https://www.rust-lang.org/tools/install)
```
# recommended approach from https://www.rust-lang.org/tools/install
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```
To run the `psql` client, install the `postgresql-client` package or modify `PATH` and `LD_LIBRARY_PATH` to include `tmp_install/bin` and `tmp_install/lib`, respectively.
3. Install PostgreSQL Client
```
apt install postgresql-client
```
To run the integration tests or Python scripts (not required to use the code), install
Python (3.7 or higher), and install python3 packages using `./scripts/pysync` (requires poetry) in the project directory.
2. Build neon and patched postgres
4. Build neon and patched postgres
```sh
git clone --recursive https://github.com/neondatabase/neon.git
cd neon
make -j5
```
3. Start pageserver and postgres on top of it (should be called from repo root):
#### building on OSX (12.3.1)
1. Install XCode
```
xcode-select --install
```
2. [Install Rust](https://www.rust-lang.org/tools/install)
```
# recommended approach from https://www.rust-lang.org/tools/install
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```
3. Install PostgreSQL Client
```
# from https://stackoverflow.com/questions/44654216/correct-way-to-install-psql-without-full-postgres-on-macos
brew install libpq
brew link --force libpq
```
4. Build neon and patched postgres
```sh
git clone --recursive https://github.com/neondatabase/neon.git
cd neon
make -j5
```
#### dependency installation notes
To run the `psql` client, install the `postgresql-client` package or modify `PATH` and `LD_LIBRARY_PATH` to include `tmp_install/bin` and `tmp_install/lib`, respectively.
To run the integration tests or Python scripts (not required to use the code), install
Python (3.9 or higher), and install python3 packages using `./scripts/pysync` (requires poetry) in the project directory.
#### running neon database
1. Start pageserver and postgres on top of it (should be called from repo root):
```sh
# Create repository in .zenith with proper paths to binaries and data
# Later that would be responsibility of a package install script
@@ -75,7 +115,7 @@ Starting postgres node at 'host=127.0.0.1 port=55432 user=zenith_admin dbname=po
main 127.0.0.1:55432 de200bd42b49cc1814412c7e592dd6e9 main 0/16B5BA8 running
```
4. Now it is possible to connect to postgres and run some queries:
2. Now it is possible to connect to postgres and run some queries:
```text
> psql -p55432 -h 127.0.0.1 -U zenith_admin postgres
postgres=# CREATE TABLE t(key int primary key, value text);
@@ -89,7 +129,7 @@ postgres=# select * from t;
(1 row)
```
5. And create branches and run postgres on them:
3. And create branches and run postgres on them:
```sh
# create branch named migration_check
> ./target/debug/neon_local timeline branch --branch-name migration_check
@@ -133,7 +173,7 @@ postgres=# select * from t;
(1 row)
```
6. If you want to run tests afterwards (see below), you have to stop all the running the pageserver, safekeeper and postgres instances
4. If you want to run tests afterwards (see below), you have to stop all the running the pageserver, safekeeper and postgres instances
you have just started. You can stop them all with one command:
```sh
> ./target/debug/neon_local stop

View File

@@ -1,9 +1,9 @@
# Compute node tools
Postgres wrapper (`zenith_ctl`) is intended to be run as a Docker entrypoint or as a `systemd`
`ExecStart` option. It will handle all the `zenith` specifics during compute node
Postgres wrapper (`compute_ctl`) is intended to be run as a Docker entrypoint or as a `systemd`
`ExecStart` option. It will handle all the `Neon` specifics during compute node
initialization:
- `zenith_ctl` accepts cluster (compute node) specification as a JSON file.
- `compute_ctl` accepts cluster (compute node) specification as a JSON file.
- Every start is a fresh start, so the data directory is removed and
initialized again on each run.
- Next it will put configuration files into the `PGDATA` directory.
@@ -13,18 +13,18 @@ initialization:
- Check and alter/drop/create roles and databases.
- Hang waiting on the `postmaster` process to exit.
Also `zenith_ctl` spawns two separate service threads:
Also `compute_ctl` spawns two separate service threads:
- `compute-monitor` checks the last Postgres activity timestamp and saves it
into the shared `ComputeState`;
into the shared `ComputeNode`;
- `http-endpoint` runs a Hyper HTTP API server, which serves readiness and the
last activity requests.
Usage example:
```sh
zenith_ctl -D /var/db/postgres/compute \
-C 'postgresql://zenith_admin@localhost/postgres' \
-S /var/db/postgres/specs/current.json \
-b /usr/local/bin/postgres
compute_ctl -D /var/db/postgres/compute \
-C 'postgresql://zenith_admin@localhost/postgres' \
-S /var/db/postgres/specs/current.json \
-b /usr/local/bin/postgres
```
## Tests

View File

@@ -0,0 +1,174 @@
//!
//! Postgres wrapper (`compute_ctl`) is intended to be run as a Docker entrypoint or as a `systemd`
//! `ExecStart` option. It will handle all the `Neon` specifics during compute node
//! initialization:
//! - `compute_ctl` accepts cluster (compute node) specification as a JSON file.
//! - Every start is a fresh start, so the data directory is removed and
//! initialized again on each run.
//! - Next it will put configuration files into the `PGDATA` directory.
//! - Sync safekeepers and get commit LSN.
//! - Get `basebackup` from pageserver using the returned on the previous step LSN.
//! - Try to start `postgres` and wait until it is ready to accept connections.
//! - Check and alter/drop/create roles and databases.
//! - Hang waiting on the `postmaster` process to exit.
//!
//! Also `compute_ctl` spawns two separate service threads:
//! - `compute-monitor` checks the last Postgres activity timestamp and saves it
//! into the shared `ComputeNode`;
//! - `http-endpoint` runs a Hyper HTTP API server, which serves readiness and the
//! last activity requests.
//!
//! Usage example:
//! ```sh
//! compute_ctl -D /var/db/postgres/compute \
//! -C 'postgresql://zenith_admin@localhost/postgres' \
//! -S /var/db/postgres/specs/current.json \
//! -b /usr/local/bin/postgres
//! ```
//!
use std::fs::File;
use std::panic;
use std::path::Path;
use std::process::exit;
use std::sync::{Arc, RwLock};
use std::{thread, time::Duration};
use anyhow::Result;
use chrono::Utc;
use clap::Arg;
use log::{error, info};
use compute_tools::compute::{ComputeMetrics, ComputeNode, ComputeState, ComputeStatus};
use compute_tools::http::api::launch_http_server;
use compute_tools::logger::*;
use compute_tools::monitor::launch_monitor;
use compute_tools::params::*;
use compute_tools::pg_helpers::*;
use compute_tools::spec::*;
fn main() -> Result<()> {
// TODO: re-use `utils::logging` later
init_logger(DEFAULT_LOG_LEVEL)?;
// Env variable is set by `cargo`
let version: Option<&str> = option_env!("CARGO_PKG_VERSION");
let matches = clap::App::new("compute_ctl")
.version(version.unwrap_or("unknown"))
.arg(
Arg::new("connstr")
.short('C')
.long("connstr")
.value_name("DATABASE_URL")
.required(true),
)
.arg(
Arg::new("pgdata")
.short('D')
.long("pgdata")
.value_name("DATADIR")
.required(true),
)
.arg(
Arg::new("pgbin")
.short('b')
.long("pgbin")
.value_name("POSTGRES_PATH"),
)
.arg(
Arg::new("spec")
.short('s')
.long("spec")
.value_name("SPEC_JSON"),
)
.arg(
Arg::new("spec-path")
.short('S')
.long("spec-path")
.value_name("SPEC_PATH"),
)
.get_matches();
let pgdata = matches.value_of("pgdata").expect("PGDATA path is required");
let connstr = matches
.value_of("connstr")
.expect("Postgres connection string is required");
let spec = matches.value_of("spec");
let spec_path = matches.value_of("spec-path");
// Try to use just 'postgres' if no path is provided
let pgbin = matches.value_of("pgbin").unwrap_or("postgres");
let spec: ComputeSpec = match spec {
// First, try to get cluster spec from the cli argument
Some(json) => serde_json::from_str(json)?,
None => {
// Second, try to read it from the file if path is provided
if let Some(sp) = spec_path {
let path = Path::new(sp);
let file = File::open(path)?;
serde_json::from_reader(file)?
} else {
panic!("cluster spec should be provided via --spec or --spec-path argument");
}
}
};
let pageserver_connstr = spec
.cluster
.settings
.find("zenith.page_server_connstring")
.expect("pageserver connstr should be provided");
let tenant = spec
.cluster
.settings
.find("zenith.zenith_tenant")
.expect("tenant id should be provided");
let timeline = spec
.cluster
.settings
.find("zenith.zenith_timeline")
.expect("tenant id should be provided");
let compute_state = ComputeNode {
start_time: Utc::now(),
connstr: connstr.to_string(),
pgdata: pgdata.to_string(),
pgbin: pgbin.to_string(),
spec,
tenant,
timeline,
pageserver_connstr,
metrics: ComputeMetrics::new(),
state: RwLock::new(ComputeState::new()),
};
let compute = Arc::new(compute_state);
// Launch service threads first, so we were able to serve availability
// requests, while configuration is still in progress.
let _http_handle = launch_http_server(&compute).expect("cannot launch http endpoint thread");
let _monitor_handle = launch_monitor(&compute).expect("cannot launch compute monitor thread");
// Run compute (Postgres) and hang waiting on it.
match compute.prepare_and_run() {
Ok(ec) => {
let code = ec.code().unwrap_or(1);
info!("Postgres exited with code {}, shutting down", code);
exit(code)
}
Err(error) => {
error!("could not start the compute node: {}", error);
let mut state = compute.state.write().unwrap();
state.error = Some(format!("{:?}", error));
state.status = ComputeStatus::Failed;
drop(state);
// Keep serving HTTP requests, so the cloud control plane was able to
// get the actual error.
info!("giving control plane 30s to collect the error before shutdown");
thread::sleep(Duration::from_secs(30));
info!("shutting down");
Err(error)
}
}
}

View File

@@ -1,252 +0,0 @@
//!
//! Postgres wrapper (`zenith_ctl`) is intended to be run as a Docker entrypoint or as a `systemd`
//! `ExecStart` option. It will handle all the `zenith` specifics during compute node
//! initialization:
//! - `zenith_ctl` accepts cluster (compute node) specification as a JSON file.
//! - Every start is a fresh start, so the data directory is removed and
//! initialized again on each run.
//! - Next it will put configuration files into the `PGDATA` directory.
//! - Sync safekeepers and get commit LSN.
//! - Get `basebackup` from pageserver using the returned on the previous step LSN.
//! - Try to start `postgres` and wait until it is ready to accept connections.
//! - Check and alter/drop/create roles and databases.
//! - Hang waiting on the `postmaster` process to exit.
//!
//! Also `zenith_ctl` spawns two separate service threads:
//! - `compute-monitor` checks the last Postgres activity timestamp and saves it
//! into the shared `ComputeState`;
//! - `http-endpoint` runs a Hyper HTTP API server, which serves readiness and the
//! last activity requests.
//!
//! Usage example:
//! ```sh
//! zenith_ctl -D /var/db/postgres/compute \
//! -C 'postgresql://zenith_admin@localhost/postgres' \
//! -S /var/db/postgres/specs/current.json \
//! -b /usr/local/bin/postgres
//! ```
//!
use std::fs::File;
use std::panic;
use std::path::Path;
use std::process::{exit, Command, ExitStatus};
use std::sync::{Arc, RwLock};
use anyhow::{Context, Result};
use chrono::Utc;
use clap::Arg;
use log::info;
use postgres::{Client, NoTls};
use compute_tools::checker::create_writablity_check_data;
use compute_tools::config;
use compute_tools::http_api::launch_http_server;
use compute_tools::logger::*;
use compute_tools::monitor::launch_monitor;
use compute_tools::params::*;
use compute_tools::pg_helpers::*;
use compute_tools::spec::*;
use compute_tools::zenith::*;
/// Do all the preparations like PGDATA directory creation, configuration,
/// safekeepers sync, basebackup, etc.
fn prepare_pgdata(state: &Arc<RwLock<ComputeState>>) -> Result<()> {
let state = state.read().unwrap();
let spec = &state.spec;
let pgdata_path = Path::new(&state.pgdata);
let pageserver_connstr = spec
.cluster
.settings
.find("zenith.page_server_connstring")
.expect("pageserver connstr should be provided");
let tenant = spec
.cluster
.settings
.find("zenith.zenith_tenant")
.expect("tenant id should be provided");
let timeline = spec
.cluster
.settings
.find("zenith.zenith_timeline")
.expect("tenant id should be provided");
info!(
"starting cluster #{}, operation #{}",
spec.cluster.cluster_id,
spec.operation_uuid.as_ref().unwrap()
);
// Remove/create an empty pgdata directory and put configuration there.
create_pgdata(&state.pgdata)?;
config::write_postgres_conf(&pgdata_path.join("postgresql.conf"), spec)?;
info!("starting safekeepers syncing");
let lsn = sync_safekeepers(&state.pgdata, &state.pgbin)
.with_context(|| "failed to sync safekeepers")?;
info!("safekeepers synced at LSN {}", lsn);
info!(
"getting basebackup@{} from pageserver {}",
lsn, pageserver_connstr
);
get_basebackup(&state.pgdata, &pageserver_connstr, &tenant, &timeline, &lsn).with_context(
|| {
format!(
"failed to get basebackup@{} from pageserver {}",
lsn, pageserver_connstr
)
},
)?;
// Update pg_hba.conf received with basebackup.
update_pg_hba(pgdata_path)?;
Ok(())
}
/// Start Postgres as a child process and manage DBs/roles.
/// After that this will hang waiting on the postmaster process to exit.
fn run_compute(state: &Arc<RwLock<ComputeState>>) -> Result<ExitStatus> {
let read_state = state.read().unwrap();
let pgdata_path = Path::new(&read_state.pgdata);
// Run postgres as a child process.
let mut pg = Command::new(&read_state.pgbin)
.args(&["-D", &read_state.pgdata])
.spawn()
.expect("cannot start postgres process");
// Try default Postgres port if it is not provided
let port = read_state
.spec
.cluster
.settings
.find("port")
.unwrap_or_else(|| "5432".to_string());
wait_for_postgres(&port, pgdata_path)?;
let mut client = Client::connect(&read_state.connstr, NoTls)?;
handle_roles(&read_state.spec, &mut client)?;
handle_databases(&read_state.spec, &mut client)?;
handle_grants(&read_state.spec, &mut client)?;
create_writablity_check_data(&mut client)?;
// 'Close' connection
drop(client);
info!(
"finished configuration of cluster #{}",
read_state.spec.cluster.cluster_id
);
// Release the read lock.
drop(read_state);
// Get the write lock, update state and release the lock, so HTTP API
// was able to serve requests, while we are blocked waiting on
// Postgres.
let mut state = state.write().unwrap();
state.ready = true;
drop(state);
// Wait for child postgres process basically forever. In this state Ctrl+C
// will be propagated to postgres and it will be shut down as well.
let ecode = pg.wait().expect("failed to wait on postgres");
Ok(ecode)
}
fn main() -> Result<()> {
// TODO: re-use `utils::logging` later
init_logger(DEFAULT_LOG_LEVEL)?;
// Env variable is set by `cargo`
let version: Option<&str> = option_env!("CARGO_PKG_VERSION");
let matches = clap::App::new("zenith_ctl")
.version(version.unwrap_or("unknown"))
.arg(
Arg::new("connstr")
.short('C')
.long("connstr")
.value_name("DATABASE_URL")
.required(true),
)
.arg(
Arg::new("pgdata")
.short('D')
.long("pgdata")
.value_name("DATADIR")
.required(true),
)
.arg(
Arg::new("pgbin")
.short('b')
.long("pgbin")
.value_name("POSTGRES_PATH"),
)
.arg(
Arg::new("spec")
.short('s')
.long("spec")
.value_name("SPEC_JSON"),
)
.arg(
Arg::new("spec-path")
.short('S')
.long("spec-path")
.value_name("SPEC_PATH"),
)
.get_matches();
let pgdata = matches.value_of("pgdata").expect("PGDATA path is required");
let connstr = matches
.value_of("connstr")
.expect("Postgres connection string is required");
let spec = matches.value_of("spec");
let spec_path = matches.value_of("spec-path");
// Try to use just 'postgres' if no path is provided
let pgbin = matches.value_of("pgbin").unwrap_or("postgres");
let spec: ClusterSpec = match spec {
// First, try to get cluster spec from the cli argument
Some(json) => serde_json::from_str(json)?,
None => {
// Second, try to read it from the file if path is provided
if let Some(sp) = spec_path {
let path = Path::new(sp);
let file = File::open(path)?;
serde_json::from_reader(file)?
} else {
panic!("cluster spec should be provided via --spec or --spec-path argument");
}
}
};
let compute_state = ComputeState {
connstr: connstr.to_string(),
pgdata: pgdata.to_string(),
pgbin: pgbin.to_string(),
spec,
ready: false,
last_active: Utc::now(),
};
let compute_state = Arc::new(RwLock::new(compute_state));
// Launch service threads first, so we were able to serve availability
// requests, while configuration is still in progress.
let mut _threads = vec![
launch_http_server(&compute_state).expect("cannot launch compute monitor thread"),
launch_monitor(&compute_state).expect("cannot launch http endpoint thread"),
];
prepare_pgdata(&compute_state)?;
// Run compute (Postgres) and hang waiting on it. Panic if any error happens,
// it will help us to trigger unwind and kill postmaster as well.
match run_compute(&compute_state) {
Ok(ec) => exit(ec.success() as i32),
Err(error) => panic!("cannot start compute node, error: {}", error),
}
}

View File

@@ -1,11 +1,11 @@
use std::sync::{Arc, RwLock};
use std::sync::Arc;
use anyhow::{anyhow, Result};
use log::error;
use postgres::Client;
use tokio_postgres::NoTls;
use crate::zenith::ComputeState;
use crate::compute::ComputeNode;
pub fn create_writablity_check_data(client: &mut Client) -> Result<()> {
let query = "
@@ -23,9 +23,9 @@ pub fn create_writablity_check_data(client: &mut Client) -> Result<()> {
Ok(())
}
pub async fn check_writability(state: &Arc<RwLock<ComputeState>>) -> Result<()> {
let connstr = state.read().unwrap().connstr.clone();
let (client, connection) = tokio_postgres::connect(&connstr, NoTls).await?;
pub async fn check_writability(compute: &Arc<ComputeNode>) -> Result<()> {
let connstr = &compute.connstr;
let (client, connection) = tokio_postgres::connect(connstr, NoTls).await?;
if client.is_closed() {
return Err(anyhow!("connection to postgres closed"));
}

View File

@@ -0,0 +1,315 @@
//
// XXX: This starts to be scarry similar to the `PostgresNode` from `control_plane`,
// but there are several things that makes `PostgresNode` usage inconvenient in the
// cloud:
// - it inherits from `LocalEnv`, which contains **all-all** the information about
// a complete service running
// - it uses `PageServerNode` with information about http endpoint, which we do not
// need in the cloud again
// - many tiny pieces like, for example, we do not use `pg_ctl` in the cloud
//
// Thus, to use `PostgresNode` in the cloud, we need to 'mock' a bunch of required
// attributes (not required for the cloud). Yet, it is still tempting to unify these
// `PostgresNode` and `ComputeNode` and use one in both places.
//
// TODO: stabilize `ComputeNode` and think about using it in the `control_plane`.
//
use std::fs;
use std::os::unix::fs::PermissionsExt;
use std::path::Path;
use std::process::{Command, ExitStatus, Stdio};
use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::RwLock;
use anyhow::{Context, Result};
use chrono::{DateTime, Utc};
use log::info;
use postgres::{Client, NoTls};
use serde::{Serialize, Serializer};
use crate::checker::create_writablity_check_data;
use crate::config;
use crate::pg_helpers::*;
use crate::spec::*;
/// Compute node info shared across several `compute_ctl` threads.
pub struct ComputeNode {
pub start_time: DateTime<Utc>,
pub connstr: String,
pub pgdata: String,
pub pgbin: String,
pub spec: ComputeSpec,
pub tenant: String,
pub timeline: String,
pub pageserver_connstr: String,
pub metrics: ComputeMetrics,
/// Volatile part of the `ComputeNode` so should be used under `RwLock`
/// to allow HTTP API server to serve status requests, while configuration
/// is in progress.
pub state: RwLock<ComputeState>,
}
fn rfc3339_serialize<S>(x: &DateTime<Utc>, s: S) -> Result<S::Ok, S::Error>
where
S: Serializer,
{
x.to_rfc3339().serialize(s)
}
#[derive(Serialize)]
#[serde(rename_all = "snake_case")]
pub struct ComputeState {
pub status: ComputeStatus,
/// Timestamp of the last Postgres activity
#[serde(serialize_with = "rfc3339_serialize")]
pub last_active: DateTime<Utc>,
pub error: Option<String>,
}
impl ComputeState {
pub fn new() -> Self {
Self {
status: ComputeStatus::Init,
last_active: Utc::now(),
error: None,
}
}
}
impl Default for ComputeState {
fn default() -> Self {
Self::new()
}
}
#[derive(Serialize, Clone, Copy, PartialEq, Eq)]
#[serde(rename_all = "snake_case")]
pub enum ComputeStatus {
Init,
Running,
Failed,
}
#[derive(Serialize)]
pub struct ComputeMetrics {
pub sync_safekeepers_ms: AtomicU64,
pub basebackup_ms: AtomicU64,
pub config_ms: AtomicU64,
pub total_startup_ms: AtomicU64,
}
impl ComputeMetrics {
pub fn new() -> Self {
Self {
sync_safekeepers_ms: AtomicU64::new(0),
basebackup_ms: AtomicU64::new(0),
config_ms: AtomicU64::new(0),
total_startup_ms: AtomicU64::new(0),
}
}
}
impl Default for ComputeMetrics {
fn default() -> Self {
Self::new()
}
}
impl ComputeNode {
pub fn set_status(&self, status: ComputeStatus) {
self.state.write().unwrap().status = status;
}
pub fn get_status(&self) -> ComputeStatus {
self.state.read().unwrap().status
}
// Remove `pgdata` directory and create it again with right permissions.
fn create_pgdata(&self) -> Result<()> {
// Ignore removal error, likely it is a 'No such file or directory (os error 2)'.
// If it is something different then create_dir() will error out anyway.
let _ok = fs::remove_dir_all(&self.pgdata);
fs::create_dir(&self.pgdata)?;
fs::set_permissions(&self.pgdata, fs::Permissions::from_mode(0o700))?;
Ok(())
}
// Get basebackup from the libpq connection to pageserver using `connstr` and
// unarchive it to `pgdata` directory overriding all its previous content.
fn get_basebackup(&self, lsn: &str) -> Result<()> {
let start_time = Utc::now();
let mut client = Client::connect(&self.pageserver_connstr, NoTls)?;
let basebackup_cmd = match lsn {
"0/0" => format!("basebackup {} {}", &self.tenant, &self.timeline), // First start of the compute
_ => format!("basebackup {} {} {}", &self.tenant, &self.timeline, lsn),
};
let copyreader = client.copy_out(basebackup_cmd.as_str())?;
let mut ar = tar::Archive::new(copyreader);
ar.unpack(&self.pgdata)?;
self.metrics.basebackup_ms.store(
Utc::now()
.signed_duration_since(start_time)
.to_std()
.unwrap()
.as_millis() as u64,
Ordering::Relaxed,
);
Ok(())
}
// Run `postgres` in a special mode with `--sync-safekeepers` argument
// and return the reported LSN back to the caller.
fn sync_safekeepers(&self) -> Result<String> {
let start_time = Utc::now();
let sync_handle = Command::new(&self.pgbin)
.args(&["--sync-safekeepers"])
.env("PGDATA", &self.pgdata) // we cannot use -D in this mode
.stdout(Stdio::piped())
.spawn()
.expect("postgres --sync-safekeepers failed to start");
// `postgres --sync-safekeepers` will print all log output to stderr and
// final LSN to stdout. So we pipe only stdout, while stderr will be automatically
// redirected to the caller output.
let sync_output = sync_handle
.wait_with_output()
.expect("postgres --sync-safekeepers failed");
if !sync_output.status.success() {
anyhow::bail!(
"postgres --sync-safekeepers exited with non-zero status: {}",
sync_output.status,
);
}
self.metrics.sync_safekeepers_ms.store(
Utc::now()
.signed_duration_since(start_time)
.to_std()
.unwrap()
.as_millis() as u64,
Ordering::Relaxed,
);
let lsn = String::from(String::from_utf8(sync_output.stdout)?.trim());
Ok(lsn)
}
/// Do all the preparations like PGDATA directory creation, configuration,
/// safekeepers sync, basebackup, etc.
pub fn prepare_pgdata(&self) -> Result<()> {
let spec = &self.spec;
let pgdata_path = Path::new(&self.pgdata);
// Remove/create an empty pgdata directory and put configuration there.
self.create_pgdata()?;
config::write_postgres_conf(&pgdata_path.join("postgresql.conf"), spec)?;
info!("starting safekeepers syncing");
let lsn = self
.sync_safekeepers()
.with_context(|| "failed to sync safekeepers")?;
info!("safekeepers synced at LSN {}", lsn);
info!(
"getting basebackup@{} from pageserver {}",
lsn, &self.pageserver_connstr
);
self.get_basebackup(&lsn).with_context(|| {
format!(
"failed to get basebackup@{} from pageserver {}",
lsn, &self.pageserver_connstr
)
})?;
// Update pg_hba.conf received with basebackup.
update_pg_hba(pgdata_path)?;
Ok(())
}
/// Start Postgres as a child process and manage DBs/roles.
/// After that this will hang waiting on the postmaster process to exit.
pub fn run(&self) -> Result<ExitStatus> {
let start_time = Utc::now();
let pgdata_path = Path::new(&self.pgdata);
// Run postgres as a child process.
let mut pg = Command::new(&self.pgbin)
.args(&["-D", &self.pgdata])
.spawn()
.expect("cannot start postgres process");
// Try default Postgres port if it is not provided
let port = self
.spec
.cluster
.settings
.find("port")
.unwrap_or_else(|| "5432".to_string());
wait_for_postgres(&mut pg, &port, pgdata_path)?;
let mut client = Client::connect(&self.connstr, NoTls)?;
handle_roles(&self.spec, &mut client)?;
handle_databases(&self.spec, &mut client)?;
handle_grants(&self.spec, &mut client)?;
create_writablity_check_data(&mut client)?;
// 'Close' connection
drop(client);
let startup_end_time = Utc::now();
self.metrics.config_ms.store(
startup_end_time
.signed_duration_since(start_time)
.to_std()
.unwrap()
.as_millis() as u64,
Ordering::Relaxed,
);
self.metrics.total_startup_ms.store(
startup_end_time
.signed_duration_since(self.start_time)
.to_std()
.unwrap()
.as_millis() as u64,
Ordering::Relaxed,
);
self.set_status(ComputeStatus::Running);
info!(
"finished configuration of compute for project {}",
self.spec.cluster.cluster_id
);
// Wait for child Postgres process basically forever. In this state Ctrl+C
// will propagate to Postgres and it will be shut down as well.
let ecode = pg
.wait()
.expect("failed to start waiting on Postgres process");
Ok(ecode)
}
pub fn prepare_and_run(&self) -> Result<ExitStatus> {
info!(
"starting compute for project {}, operation {}, tenant {}, timeline {}",
self.spec.cluster.cluster_id,
self.spec.operation_uuid.as_ref().unwrap(),
self.tenant,
self.timeline,
);
self.prepare_pgdata()?;
self.run()
}
}

View File

@@ -6,7 +6,7 @@ use std::path::Path;
use anyhow::Result;
use crate::pg_helpers::PgOptionsSerialize;
use crate::zenith::ClusterSpec;
use crate::spec::ComputeSpec;
/// Check that `line` is inside a text file and put it there if it is not.
/// Create file if it doesn't exist.
@@ -32,20 +32,20 @@ pub fn line_in_file(path: &Path, line: &str) -> Result<bool> {
}
/// Create or completely rewrite configuration file specified by `path`
pub fn write_postgres_conf(path: &Path, spec: &ClusterSpec) -> Result<()> {
pub fn write_postgres_conf(path: &Path, spec: &ComputeSpec) -> Result<()> {
// File::create() destroys the file content if it exists.
let mut postgres_conf = File::create(path)?;
write_zenith_managed_block(&mut postgres_conf, &spec.cluster.settings.as_pg_settings())?;
write_auto_managed_block(&mut postgres_conf, &spec.cluster.settings.as_pg_settings())?;
Ok(())
}
// Write Postgres config block wrapped with generated comment section
fn write_zenith_managed_block(file: &mut File, buf: &str) -> Result<()> {
writeln!(file, "# Managed by Zenith: begin")?;
fn write_auto_managed_block(file: &mut File, buf: &str) -> Result<()> {
writeln!(file, "# Managed by compute_ctl: begin")?;
writeln!(file, "{}", buf)?;
writeln!(file, "# Managed by Zenith: end")?;
writeln!(file, "# Managed by compute_ctl: end")?;
Ok(())
}

View File

@@ -1,37 +1,64 @@
use std::convert::Infallible;
use std::net::SocketAddr;
use std::sync::{Arc, RwLock};
use std::sync::Arc;
use std::thread;
use anyhow::Result;
use hyper::service::{make_service_fn, service_fn};
use hyper::{Body, Method, Request, Response, Server, StatusCode};
use log::{error, info};
use serde_json;
use crate::zenith::*;
use crate::compute::{ComputeNode, ComputeStatus};
// Service function to handle all available routes.
async fn routes(req: Request<Body>, state: Arc<RwLock<ComputeState>>) -> Response<Body> {
async fn routes(req: Request<Body>, compute: Arc<ComputeNode>) -> Response<Body> {
match (req.method(), req.uri().path()) {
// Timestamp of the last Postgres activity in the plain text.
// DEPRECATED in favour of /status
(&Method::GET, "/last_activity") => {
info!("serving /last_active GET request");
let state = state.read().unwrap();
let state = compute.state.read().unwrap();
// Use RFC3339 format for consistency.
Response::new(Body::from(state.last_active.to_rfc3339()))
}
// Has compute setup process finished? -> true/false
// Has compute setup process finished? -> true/false.
// DEPRECATED in favour of /status
(&Method::GET, "/ready") => {
info!("serving /ready GET request");
let state = state.read().unwrap();
Response::new(Body::from(format!("{}", state.ready)))
let status = compute.get_status();
Response::new(Body::from(format!("{}", status == ComputeStatus::Running)))
}
// Serialized compute state.
(&Method::GET, "/status") => {
info!("serving /status GET request");
let state = compute.state.read().unwrap();
Response::new(Body::from(serde_json::to_string(&*state).unwrap()))
}
// Startup metrics in JSON format. Keep /metrics reserved for a possible
// future use for Prometheus metrics format.
(&Method::GET, "/metrics.json") => {
info!("serving /metrics.json GET request");
Response::new(Body::from(serde_json::to_string(&compute.metrics).unwrap()))
}
// DEPRECATED, use POST instead
(&Method::GET, "/check_writability") => {
info!("serving /check_writability GET request");
let res = crate::checker::check_writability(&state).await;
let res = crate::checker::check_writability(&compute).await;
match res {
Ok(_) => Response::new(Body::from("true")),
Err(e) => Response::new(Body::from(e.to_string())),
}
}
(&Method::POST, "/check_writability") => {
info!("serving /check_writability POST request");
let res = crate::checker::check_writability(&compute).await;
match res {
Ok(_) => Response::new(Body::from("true")),
Err(e) => Response::new(Body::from(e.to_string())),
@@ -49,7 +76,7 @@ async fn routes(req: Request<Body>, state: Arc<RwLock<ComputeState>>) -> Respons
// Main Hyper HTTP server function that runs it and blocks waiting on it forever.
#[tokio::main]
async fn serve(state: Arc<RwLock<ComputeState>>) {
async fn serve(state: Arc<ComputeNode>) {
let addr = SocketAddr::from(([0, 0, 0, 0], 3080));
let make_service = make_service_fn(move |_conn| {
@@ -73,7 +100,7 @@ async fn serve(state: Arc<RwLock<ComputeState>>) {
}
/// Launch a separate Hyper HTTP API server thread and return its `JoinHandle`.
pub fn launch_http_server(state: &Arc<RwLock<ComputeState>>) -> Result<thread::JoinHandle<()>> {
pub fn launch_http_server(state: &Arc<ComputeNode>) -> Result<thread::JoinHandle<()>> {
let state = Arc::clone(state);
Ok(thread::Builder::new()

View File

@@ -0,0 +1 @@
pub mod api;

View File

@@ -0,0 +1,158 @@
openapi: "3.0.2"
info:
title: Compute node control API
version: "1.0"
servers:
- url: "http://localhost:3080"
paths:
/status:
get:
tags:
- "info"
summary: Get compute node internal status
description: ""
operationId: getComputeStatus
responses:
"200":
description: ComputeState
content:
application/json:
schema:
$ref: "#/components/schemas/ComputeState"
/metrics.json:
get:
tags:
- "info"
summary: Get compute node startup metrics in JSON format
description: ""
operationId: getComputeMetricsJSON
responses:
"200":
description: ComputeMetrics
content:
application/json:
schema:
$ref: "#/components/schemas/ComputeMetrics"
/ready:
get:
deprecated: true
tags:
- "info"
summary: Check whether compute startup process finished successfully
description: ""
operationId: computeIsReady
responses:
"200":
description: Compute is ready ('true') or not ('false')
content:
text/plain:
schema:
type: string
example: "true"
/last_activity:
get:
deprecated: true
tags:
- "info"
summary: Get timestamp of the last compute activity
description: ""
operationId: getLastComputeActivityTS
responses:
"200":
description: Timestamp of the last compute activity
content:
text/plain:
schema:
type: string
example: "2022-10-12T07:20:50.52Z"
/check_writability:
get:
deprecated: true
tags:
- "check"
summary: Check that we can write new data on this compute
description: ""
operationId: checkComputeWritabilityDeprecated
responses:
"200":
description: Check result
content:
text/plain:
schema:
type: string
description: Error text or 'true' if check passed
example: "true"
post:
tags:
- "check"
summary: Check that we can write new data on this compute
description: ""
operationId: checkComputeWritability
responses:
"200":
description: Check result
content:
text/plain:
schema:
type: string
description: Error text or 'true' if check passed
example: "true"
components:
securitySchemes:
JWT:
type: http
scheme: bearer
bearerFormat: JWT
schemas:
ComputeMetrics:
type: object
description: Compute startup metrics
required:
- sync_safekeepers_ms
- basebackup_ms
- config_ms
- total_startup_ms
properties:
sync_safekeepers_ms:
type: integer
basebackup_ms:
type: integer
config_ms:
type: integer
total_startup_ms:
type: integer
ComputeState:
type: object
required:
- status
- last_active
properties:
status:
$ref: '#/components/schemas/ComputeStatus'
last_active:
type: string
description: The last detected compute activity timestamp in UTC and RFC3339 format
example: "2022-10-12T07:20:50.52Z"
error:
type: string
description: Text of the error during compute startup, if any
ComputeStatus:
type: string
enum:
- init
- failed
- running
security:
- JWT: []

View File

@@ -4,11 +4,11 @@
//!
pub mod checker;
pub mod config;
pub mod http_api;
pub mod http;
#[macro_use]
pub mod logger;
pub mod compute;
pub mod monitor;
pub mod params;
pub mod pg_helpers;
pub mod spec;
pub mod zenith;

View File

@@ -1,4 +1,4 @@
use std::sync::{Arc, RwLock};
use std::sync::Arc;
use std::{thread, time};
use anyhow::Result;
@@ -6,16 +6,16 @@ use chrono::{DateTime, Utc};
use log::{debug, info};
use postgres::{Client, NoTls};
use crate::zenith::ComputeState;
use crate::compute::ComputeNode;
const MONITOR_CHECK_INTERVAL: u64 = 500; // milliseconds
// Spin in a loop and figure out the last activity time in the Postgres.
// Then update it in the shared state. This function never errors out.
// XXX: the only expected panic is at `RwLock` unwrap().
fn watch_compute_activity(state: &Arc<RwLock<ComputeState>>) {
fn watch_compute_activity(compute: &Arc<ComputeNode>) {
// Suppose that `connstr` doesn't change
let connstr = state.read().unwrap().connstr.clone();
let connstr = compute.connstr.clone();
// Define `client` outside of the loop to reuse existing connection if it's active.
let mut client = Client::connect(&connstr, NoTls);
let timeout = time::Duration::from_millis(MONITOR_CHECK_INTERVAL);
@@ -46,7 +46,7 @@ fn watch_compute_activity(state: &Arc<RwLock<ComputeState>>) {
AND usename != 'zenith_admin';", // XXX: find a better way to filter other monitors?
&[],
);
let mut last_active = state.read().unwrap().last_active;
let mut last_active = compute.state.read().unwrap().last_active;
if let Ok(backs) = backends {
let mut idle_backs: Vec<DateTime<Utc>> = vec![];
@@ -83,14 +83,14 @@ fn watch_compute_activity(state: &Arc<RwLock<ComputeState>>) {
}
// Update the last activity in the shared state if we got a more recent one.
let mut state = state.write().unwrap();
let mut state = compute.state.write().unwrap();
if last_active > state.last_active {
state.last_active = last_active;
debug!("set the last compute activity time to: {}", last_active);
}
}
Err(e) => {
info!("cannot connect to postgres: {}, retrying", e);
debug!("cannot connect to postgres: {}, retrying", e);
// Establish a new connection and try again.
client = Client::connect(&connstr, NoTls);
@@ -100,7 +100,7 @@ fn watch_compute_activity(state: &Arc<RwLock<ComputeState>>) {
}
/// Launch a separate compute monitor thread and return its `JoinHandle`.
pub fn launch_monitor(state: &Arc<RwLock<ComputeState>>) -> Result<thread::JoinHandle<()>> {
pub fn launch_monitor(state: &Arc<ComputeNode>) -> Result<thread::JoinHandle<()>> {
let state = Arc::clone(state);
Ok(thread::Builder::new()

View File

@@ -1,7 +1,9 @@
use std::fs::File;
use std::io::{BufRead, BufReader};
use std::net::{SocketAddr, TcpStream};
use std::os::unix::fs::PermissionsExt;
use std::path::Path;
use std::process::Command;
use std::process::Child;
use std::str::FromStr;
use std::{fs, thread, time};
@@ -220,12 +222,12 @@ pub fn get_existing_dbs(client: &mut Client) -> Result<Vec<Database>> {
/// Wait for Postgres to become ready to accept connections:
/// - state should be `ready` in the `pgdata/postmaster.pid`
/// - and we should be able to connect to 127.0.0.1:5432
pub fn wait_for_postgres(port: &str, pgdata: &Path) -> Result<()> {
pub fn wait_for_postgres(pg: &mut Child, port: &str, pgdata: &Path) -> Result<()> {
let pid_path = pgdata.join("postmaster.pid");
let mut slept: u64 = 0; // ms
let pause = time::Duration::from_millis(100);
let timeout = time::Duration::from_millis(200);
let timeout = time::Duration::from_millis(10);
let addr = SocketAddr::from_str(&format!("127.0.0.1:{}", port)).unwrap();
loop {
@@ -236,14 +238,19 @@ pub fn wait_for_postgres(port: &str, pgdata: &Path) -> Result<()> {
bail!("timed out while waiting for Postgres to start");
}
if let Ok(Some(status)) = pg.try_wait() {
// Postgres exited, that is not what we expected, bail out earlier.
let code = status.code().unwrap_or(-1);
bail!("Postgres exited unexpectedly with code {}", code);
}
if pid_path.exists() {
// XXX: dumb and the simplest way to get the last line in a text file
// TODO: better use `.lines().last()` later
let stdout = Command::new("tail")
.args(&["-n1", pid_path.to_str().unwrap()])
.output()?
.stdout;
let status = String::from_utf8(stdout)?;
let file = BufReader::new(File::open(&pid_path)?);
let status = file
.lines()
.last()
.unwrap()
.unwrap_or_else(|_| "unknown".to_string());
let can_connect = TcpStream::connect_timeout(&addr, timeout).is_ok();
// Now Postgres is ready to accept connections

View File

@@ -3,16 +3,53 @@ use std::path::Path;
use anyhow::Result;
use log::{info, log_enabled, warn, Level};
use postgres::Client;
use serde::Deserialize;
use crate::config;
use crate::params::PG_HBA_ALL_MD5;
use crate::pg_helpers::*;
use crate::zenith::ClusterSpec;
/// Cluster spec or configuration represented as an optional number of
/// delta operations + final cluster state description.
#[derive(Clone, Deserialize)]
pub struct ComputeSpec {
pub format_version: f32,
pub timestamp: String,
pub operation_uuid: Option<String>,
/// Expected cluster state at the end of transition process.
pub cluster: Cluster,
pub delta_operations: Option<Vec<DeltaOp>>,
}
/// Cluster state seen from the perspective of the external tools
/// like Rails web console.
#[derive(Clone, Deserialize)]
pub struct Cluster {
pub cluster_id: String,
pub name: String,
pub state: Option<String>,
pub roles: Vec<Role>,
pub databases: Vec<Database>,
pub settings: GenericOptions,
}
/// Single cluster state changing operation that could not be represented as
/// a static `Cluster` structure. For example:
/// - DROP DATABASE
/// - DROP ROLE
/// - ALTER ROLE name RENAME TO new_name
/// - ALTER DATABASE name RENAME TO new_name
#[derive(Clone, Deserialize)]
pub struct DeltaOp {
pub action: String,
pub name: PgIdent,
pub new_name: Option<PgIdent>,
}
/// It takes cluster specification and does the following:
/// - Serialize cluster config and put it into `postgresql.conf` completely rewriting the file.
/// - Update `pg_hba.conf` to allow external connections.
pub fn handle_configuration(spec: &ClusterSpec, pgdata_path: &Path) -> Result<()> {
pub fn handle_configuration(spec: &ComputeSpec, pgdata_path: &Path) -> Result<()> {
// File `postgresql.conf` is no longer included into `basebackup`, so just
// always write all config into it creating new file.
config::write_postgres_conf(&pgdata_path.join("postgresql.conf"), spec)?;
@@ -39,7 +76,7 @@ pub fn update_pg_hba(pgdata_path: &Path) -> Result<()> {
/// Given a cluster spec json and open transaction it handles roles creation,
/// deletion and update.
pub fn handle_roles(spec: &ClusterSpec, client: &mut Client) -> Result<()> {
pub fn handle_roles(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
let mut xact = client.transaction()?;
let existing_roles: Vec<Role> = get_existing_roles(&mut xact)?;
@@ -165,7 +202,7 @@ pub fn handle_roles(spec: &ClusterSpec, client: &mut Client) -> Result<()> {
/// like `CREATE DATABASE` and `DROP DATABASE` do not support it. Statement-level
/// atomicity should be enough here due to the order of operations and various checks,
/// which together provide us idempotency.
pub fn handle_databases(spec: &ClusterSpec, client: &mut Client) -> Result<()> {
pub fn handle_databases(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
let existing_dbs: Vec<Database> = get_existing_dbs(client)?;
// Print a list of existing Postgres databases (only in debug mode)
@@ -254,7 +291,7 @@ pub fn handle_databases(spec: &ClusterSpec, client: &mut Client) -> Result<()> {
// Grant CREATE ON DATABASE to the database owner
// to allow clients create trusted extensions.
pub fn handle_grants(spec: &ClusterSpec, client: &mut Client) -> Result<()> {
pub fn handle_grants(spec: &ComputeSpec, client: &mut Client) -> Result<()> {
info!("cluster spec grants:");
for db in &spec.cluster.databases {

View File

@@ -1,109 +0,0 @@
use std::process::{Command, Stdio};
use anyhow::Result;
use chrono::{DateTime, Utc};
use postgres::{Client, NoTls};
use serde::Deserialize;
use crate::pg_helpers::*;
/// Compute node state shared across several `zenith_ctl` threads.
/// Should be used under `RwLock` to allow HTTP API server to serve
/// status requests, while configuration is in progress.
pub struct ComputeState {
pub connstr: String,
pub pgdata: String,
pub pgbin: String,
pub spec: ClusterSpec,
/// Compute setup process has finished
pub ready: bool,
/// Timestamp of the last Postgres activity
pub last_active: DateTime<Utc>,
}
/// Cluster spec or configuration represented as an optional number of
/// delta operations + final cluster state description.
#[derive(Clone, Deserialize)]
pub struct ClusterSpec {
pub format_version: f32,
pub timestamp: String,
pub operation_uuid: Option<String>,
/// Expected cluster state at the end of transition process.
pub cluster: Cluster,
pub delta_operations: Option<Vec<DeltaOp>>,
}
/// Cluster state seen from the perspective of the external tools
/// like Rails web console.
#[derive(Clone, Deserialize)]
pub struct Cluster {
pub cluster_id: String,
pub name: String,
pub state: Option<String>,
pub roles: Vec<Role>,
pub databases: Vec<Database>,
pub settings: GenericOptions,
}
/// Single cluster state changing operation that could not be represented as
/// a static `Cluster` structure. For example:
/// - DROP DATABASE
/// - DROP ROLE
/// - ALTER ROLE name RENAME TO new_name
/// - ALTER DATABASE name RENAME TO new_name
#[derive(Clone, Deserialize)]
pub struct DeltaOp {
pub action: String,
pub name: PgIdent,
pub new_name: Option<PgIdent>,
}
/// Get basebackup from the libpq connection to pageserver using `connstr` and
/// unarchive it to `pgdata` directory overriding all its previous content.
pub fn get_basebackup(
pgdata: &str,
connstr: &str,
tenant: &str,
timeline: &str,
lsn: &str,
) -> Result<()> {
let mut client = Client::connect(connstr, NoTls)?;
let basebackup_cmd = match lsn {
"0/0" => format!("basebackup {} {}", tenant, timeline), // First start of the compute
_ => format!("basebackup {} {} {}", tenant, timeline, lsn),
};
let copyreader = client.copy_out(basebackup_cmd.as_str())?;
let mut ar = tar::Archive::new(copyreader);
ar.unpack(&pgdata)?;
Ok(())
}
/// Run `postgres` in a special mode with `--sync-safekeepers` argument
/// and return the reported LSN back to the caller.
pub fn sync_safekeepers(pgdata: &str, pgbin: &str) -> Result<String> {
let sync_handle = Command::new(&pgbin)
.args(&["--sync-safekeepers"])
.env("PGDATA", &pgdata) // we cannot use -D in this mode
.stdout(Stdio::piped())
.spawn()
.expect("postgres --sync-safekeepers failed to start");
// `postgres --sync-safekeepers` will print all log output to stderr and
// final LSN to stdout. So we pipe only stdout, while stderr will be automatically
// redirected to the caller output.
let sync_output = sync_handle
.wait_with_output()
.expect("postgres --sync-safekeepers failed");
if !sync_output.status.success() {
anyhow::bail!(
"postgres --sync-safekeepers exited with non-zero status: {}",
sync_output.status,
);
}
let lsn = String::from(String::from_utf8(sync_output.stdout)?.trim());
Ok(lsn)
}

View File

@@ -4,12 +4,12 @@ mod pg_helpers_tests {
use std::fs::File;
use compute_tools::pg_helpers::*;
use compute_tools::zenith::ClusterSpec;
use compute_tools::spec::ComputeSpec;
#[test]
fn params_serialize() {
let file = File::open("tests/cluster_spec.json").unwrap();
let spec: ClusterSpec = serde_json::from_reader(file).unwrap();
let spec: ComputeSpec = serde_json::from_reader(file).unwrap();
assert_eq!(
spec.cluster.databases.first().unwrap().to_pg_options(),
@@ -24,7 +24,7 @@ mod pg_helpers_tests {
#[test]
fn settings_serialize() {
let file = File::open("tests/cluster_spec.json").unwrap();
let spec: ClusterSpec = serde_json::from_reader(file).unwrap();
let spec: ComputeSpec = serde_json::from_reader(file).unwrap();
assert_eq!(
spec.cluster.settings.as_pg_settings(),

View File

@@ -9,3 +9,6 @@ auth_type = 'Trust'
id = 1
pg_port = 5454
http_port = 7676
[etcd_broker]
broker_endpoints = ['http://127.0.0.1:2379']

93
control_plane/src/etcd.rs Normal file
View File

@@ -0,0 +1,93 @@
use std::{
fs,
path::PathBuf,
process::{Command, Stdio},
};
use anyhow::Context;
use nix::{
sys::signal::{kill, Signal},
unistd::Pid,
};
use crate::{local_env, read_pidfile};
pub fn start_etcd_process(env: &local_env::LocalEnv) -> anyhow::Result<()> {
let etcd_broker = &env.etcd_broker;
println!(
"Starting etcd broker using {}",
etcd_broker.etcd_binary_path.display()
);
let etcd_data_dir = env.base_data_dir.join("etcd");
fs::create_dir_all(&etcd_data_dir).with_context(|| {
format!(
"Failed to create etcd data dir: {}",
etcd_data_dir.display()
)
})?;
let etcd_stdout_file =
fs::File::create(etcd_data_dir.join("etcd.stdout.log")).with_context(|| {
format!(
"Failed to create ectd stout file in directory {}",
etcd_data_dir.display()
)
})?;
let etcd_stderr_file =
fs::File::create(etcd_data_dir.join("etcd.stderr.log")).with_context(|| {
format!(
"Failed to create ectd stderr file in directory {}",
etcd_data_dir.display()
)
})?;
let client_urls = etcd_broker.comma_separated_endpoints();
let etcd_process = Command::new(&etcd_broker.etcd_binary_path)
.args(&[
format!("--data-dir={}", etcd_data_dir.display()),
format!("--listen-client-urls={client_urls}"),
format!("--advertise-client-urls={client_urls}"),
])
.stdout(Stdio::from(etcd_stdout_file))
.stderr(Stdio::from(etcd_stderr_file))
.spawn()
.context("Failed to spawn etcd subprocess")?;
let pid = etcd_process.id();
let etcd_pid_file_path = etcd_pid_file_path(env);
fs::write(&etcd_pid_file_path, pid.to_string()).with_context(|| {
format!(
"Failed to create etcd pid file at {}",
etcd_pid_file_path.display()
)
})?;
Ok(())
}
pub fn stop_etcd_process(env: &local_env::LocalEnv) -> anyhow::Result<()> {
let etcd_path = &env.etcd_broker.etcd_binary_path;
println!("Stopping etcd broker at {}", etcd_path.display());
let etcd_pid_file_path = etcd_pid_file_path(env);
let pid = Pid::from_raw(read_pidfile(&etcd_pid_file_path).with_context(|| {
format!(
"Failed to read etcd pid filea at {}",
etcd_pid_file_path.display()
)
})?);
kill(pid, Signal::SIGTERM).with_context(|| {
format!(
"Failed to stop etcd with pid {pid} at {}",
etcd_pid_file_path.display()
)
})?;
Ok(())
}
fn etcd_pid_file_path(env: &local_env::LocalEnv) -> PathBuf {
env.base_data_dir.join("etcd.pid")
}

View File

@@ -12,6 +12,7 @@ use std::path::Path;
use std::process::Command;
pub mod compute;
pub mod etcd;
pub mod local_env;
pub mod postgresql_conf;
pub mod safekeeper;

View File

@@ -4,6 +4,7 @@
//! script which will use local paths.
use anyhow::{bail, ensure, Context};
use reqwest::Url;
use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use std::collections::HashMap;
@@ -59,13 +60,7 @@ pub struct LocalEnv {
#[serde(default)]
pub private_key_path: PathBuf,
// A comma separated broker (etcd) endpoints for storage nodes coordination, e.g. 'http://127.0.0.1:2379'.
#[serde(default)]
pub broker_endpoints: Option<String>,
/// A prefix to all to any key when pushing/polling etcd from a node.
#[serde(default)]
pub broker_etcd_prefix: Option<String>,
pub etcd_broker: EtcdBroker,
pub pageserver: PageServerConf,
@@ -81,6 +76,62 @@ pub struct LocalEnv {
branch_name_mappings: HashMap<String, Vec<(ZTenantId, ZTimelineId)>>,
}
/// Etcd broker config for cluster internal communication.
#[serde_as]
#[derive(Serialize, Deserialize, PartialEq, Eq, Clone, Debug)]
pub struct EtcdBroker {
/// A prefix to all to any key when pushing/polling etcd from a node.
#[serde(default)]
pub broker_etcd_prefix: Option<String>,
/// Broker (etcd) endpoints for storage nodes coordination, e.g. 'http://127.0.0.1:2379'.
#[serde(default)]
#[serde_as(as = "Vec<DisplayFromStr>")]
pub broker_endpoints: Vec<Url>,
/// Etcd binary path to use.
#[serde(default)]
pub etcd_binary_path: PathBuf,
}
impl EtcdBroker {
pub fn locate_etcd() -> anyhow::Result<PathBuf> {
let which_output = Command::new("which")
.arg("etcd")
.output()
.context("Failed to run 'which etcd' command")?;
let stdout = String::from_utf8_lossy(&which_output.stdout);
ensure!(
which_output.status.success(),
"'which etcd' invocation failed. Status: {}, stdout: {stdout}, stderr: {}",
which_output.status,
String::from_utf8_lossy(&which_output.stderr)
);
let etcd_path = PathBuf::from(stdout.trim());
ensure!(
etcd_path.is_file(),
"'which etcd' invocation was successful, but the path it returned is not a file or does not exist: {}",
etcd_path.display()
);
Ok(etcd_path)
}
pub fn comma_separated_endpoints(&self) -> String {
self.broker_endpoints.iter().map(Url::as_str).fold(
String::new(),
|mut comma_separated_urls, url| {
if !comma_separated_urls.is_empty() {
comma_separated_urls.push(',');
}
comma_separated_urls.push_str(url);
comma_separated_urls
},
)
}
}
#[derive(Serialize, Deserialize, PartialEq, Eq, Clone, Debug)]
#[serde(default)]
pub struct PageServerConf {
@@ -184,12 +235,7 @@ impl LocalEnv {
if old_timeline_id == &timeline_id {
Ok(())
} else {
bail!(
"branch '{}' is already mapped to timeline {}, cannot map to another timeline {}",
branch_name,
old_timeline_id,
timeline_id
);
bail!("branch '{branch_name}' is already mapped to timeline {old_timeline_id}, cannot map to another timeline {timeline_id}");
}
} else {
existing_values.push((tenant_id, timeline_id));
@@ -225,7 +271,7 @@ impl LocalEnv {
///
/// Unlike 'load_config', this function fills in any defaults that are missing
/// from the config file.
pub fn create_config(toml: &str) -> anyhow::Result<Self> {
pub fn parse_config(toml: &str) -> anyhow::Result<Self> {
let mut env: LocalEnv = toml::from_str(toml)?;
// Find postgres binaries.
@@ -238,26 +284,11 @@ impl LocalEnv {
env.pg_distrib_dir = cwd.join("tmp_install")
}
}
if !env.pg_distrib_dir.join("bin/postgres").exists() {
bail!(
"Can't find postgres binary at {}",
env.pg_distrib_dir.display()
);
}
// Find zenith binaries.
if env.zenith_distrib_dir == Path::new("") {
env.zenith_distrib_dir = env::current_exe()?.parent().unwrap().to_owned();
}
for binary in ["pageserver", "safekeeper"] {
if !env.zenith_distrib_dir.join(binary).exists() {
bail!(
"Can't find binary '{}' in zenith distrib dir '{}'",
binary,
env.zenith_distrib_dir.display()
);
}
}
// If no initial tenant ID was given, generate it.
if env.default_tenant_id.is_none() {
@@ -351,6 +382,36 @@ impl LocalEnv {
"directory '{}' already exists. Perhaps already initialized?",
base_path.display()
);
if !self.pg_distrib_dir.join("bin/postgres").exists() {
bail!(
"Can't find postgres binary at {}",
self.pg_distrib_dir.display()
);
}
for binary in ["pageserver", "safekeeper"] {
if !self.zenith_distrib_dir.join(binary).exists() {
bail!(
"Can't find binary '{}' in zenith distrib dir '{}'",
binary,
self.zenith_distrib_dir.display()
);
}
}
for binary in ["pageserver", "safekeeper"] {
if !self.zenith_distrib_dir.join(binary).exists() {
bail!(
"Can't find binary '{binary}' in zenith distrib dir '{}'",
self.zenith_distrib_dir.display()
);
}
}
if !self.pg_distrib_dir.join("bin/postgres").exists() {
bail!(
"Can't find postgres binary at {}",
self.pg_distrib_dir.display()
);
}
fs::create_dir(&base_path)?;
@@ -408,7 +469,35 @@ impl LocalEnv {
fn base_path() -> PathBuf {
match std::env::var_os("ZENITH_REPO_DIR") {
Some(val) => PathBuf::from(val.to_str().unwrap()),
None => ".zenith".into(),
Some(val) => PathBuf::from(val),
None => PathBuf::from(".zenith"),
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn simple_conf_parsing() {
let simple_conf_toml = include_str!("../simple.conf");
let simple_conf_parse_result = LocalEnv::parse_config(simple_conf_toml);
assert!(
simple_conf_parse_result.is_ok(),
"failed to parse simple config {simple_conf_toml}, reason: {simple_conf_parse_result:?}"
);
let string_to_replace = "broker_endpoints = ['http://127.0.0.1:2379']";
let spoiled_url_str = "broker_endpoints = ['!@$XOXO%^&']";
let spoiled_url_toml = simple_conf_toml.replace(string_to_replace, spoiled_url_str);
assert!(
spoiled_url_toml.contains(spoiled_url_str),
"Failed to replace string {string_to_replace} in the toml file {simple_conf_toml}"
);
let spoiled_url_parse_result = LocalEnv::parse_config(&spoiled_url_toml);
assert!(
spoiled_url_parse_result.is_err(),
"expected toml with invalid Url {spoiled_url_toml} to fail the parsing, but got {spoiled_url_parse_result:?}"
);
}
}

View File

@@ -52,7 +52,7 @@ impl ResponseErrorMessageExt for Response {
Err(SafekeeperHttpError::Response(
match self.json::<HttpErrorBody>() {
Ok(err_body) => format!("Error: {}", err_body.msg),
Err(_) => format!("Http error ({}) at {}.", status.as_u16(), url),
Err(_) => format!("Http error ({}) at {url}.", status.as_u16()),
},
))
}
@@ -75,17 +75,12 @@ pub struct SafekeeperNode {
pub http_base_url: String,
pub pageserver: Arc<PageServerNode>,
broker_endpoints: Option<String>,
broker_etcd_prefix: Option<String>,
}
impl SafekeeperNode {
pub fn from_env(env: &LocalEnv, conf: &SafekeeperConf) -> SafekeeperNode {
let pageserver = Arc::new(PageServerNode::from_env(env));
println!("initializing for sk {} for {}", conf.id, conf.http_port);
SafekeeperNode {
id: conf.id,
conf: conf.clone(),
@@ -94,8 +89,6 @@ impl SafekeeperNode {
http_client: Client::new(),
http_base_url: format!("http://127.0.0.1:{}/v1", conf.http_port),
pageserver,
broker_endpoints: env.broker_endpoints.clone(),
broker_etcd_prefix: env.broker_etcd_prefix.clone(),
}
}
@@ -142,10 +135,12 @@ impl SafekeeperNode {
if !self.conf.sync {
cmd.arg("--no-sync");
}
if let Some(ref ep) = self.broker_endpoints {
cmd.args(&["--broker-endpoints", ep]);
let comma_separated_endpoints = self.env.etcd_broker.comma_separated_endpoints();
if !comma_separated_endpoints.is_empty() {
cmd.args(&["--broker-endpoints", &comma_separated_endpoints]);
}
if let Some(prefix) = self.broker_etcd_prefix.as_deref() {
if let Some(prefix) = self.env.etcd_broker.broker_etcd_prefix.as_deref() {
cmd.args(&["--broker-etcd-prefix", prefix]);
}
@@ -210,12 +205,13 @@ impl SafekeeperNode {
let pid = Pid::from_raw(pid);
let sig = if immediate {
println!("Stop safekeeper immediately");
print!("Stopping safekeeper {} immediately..", self.id);
Signal::SIGQUIT
} else {
println!("Stop safekeeper gracefully");
print!("Stopping safekeeper {} gracefully..", self.id);
Signal::SIGTERM
};
io::stdout().flush().unwrap();
match kill(pid, sig) {
Ok(_) => (),
Err(Errno::ESRCH) => {
@@ -237,25 +233,35 @@ impl SafekeeperNode {
// TODO Remove this "timeout" and handle it on caller side instead.
// Shutting down may take a long time,
// if safekeeper flushes a lot of data
let mut tcp_stopped = false;
for _ in 0..100 {
if let Err(_e) = TcpStream::connect(&address) {
println!("Safekeeper stopped receiving connections");
//Now check status
match self.check_status() {
Ok(_) => {
println!("Safekeeper status is OK. Wait a bit.");
thread::sleep(Duration::from_secs(1));
}
Err(err) => {
println!("Safekeeper status is: {}", err);
return Ok(());
if !tcp_stopped {
if let Err(err) = TcpStream::connect(&address) {
tcp_stopped = true;
if err.kind() != io::ErrorKind::ConnectionRefused {
eprintln!("\nSafekeeper connection failed with error: {err}");
}
}
} else {
println!("Safekeeper still receives connections");
thread::sleep(Duration::from_secs(1));
}
if tcp_stopped {
// Also check status on the HTTP port
match self.check_status() {
Err(SafekeeperHttpError::Transport(err)) if err.is_connect() => {
println!("done!");
return Ok(());
}
Err(err) => {
eprintln!("\nSafekeeper status check failed with error: {err}");
return Ok(());
}
Ok(()) => {
// keep waiting
}
}
}
print!(".");
io::stdout().flush().unwrap();
thread::sleep(Duration::from_secs(1));
}
bail!("Failed to stop safekeeper with pid {}", pid);

View File

@@ -121,6 +121,16 @@ impl PageServerNode {
);
let listen_pg_addr_param =
format!("listen_pg_addr='{}'", self.env.pageserver.listen_pg_addr);
let broker_endpoints_param = format!(
"broker_endpoints=[{}]",
self.env
.etcd_broker
.broker_endpoints
.iter()
.map(|url| format!("'{url}'"))
.collect::<Vec<_>>()
.join(",")
);
let mut args = Vec::with_capacity(20);
args.push("--init");
@@ -129,8 +139,19 @@ impl PageServerNode {
args.extend(["-c", &authg_type_param]);
args.extend(["-c", &listen_http_addr_param]);
args.extend(["-c", &listen_pg_addr_param]);
args.extend(["-c", &broker_endpoints_param]);
args.extend(["-c", &id]);
let broker_etcd_prefix_param = self
.env
.etcd_broker
.broker_etcd_prefix
.as_ref()
.map(|prefix| format!("broker_etcd_prefix='{prefix}'"));
if let Some(broker_etcd_prefix_param) = broker_etcd_prefix_param.as_deref() {
args.extend(["-c", broker_etcd_prefix_param]);
}
for config_override in config_overrides {
args.extend(["-c", config_override]);
}
@@ -260,12 +281,13 @@ impl PageServerNode {
let pid = Pid::from_raw(read_pidfile(&pid_file)?);
let sig = if immediate {
println!("Stop pageserver immediately");
print!("Stopping pageserver immediately..");
Signal::SIGQUIT
} else {
println!("Stop pageserver gracefully");
print!("Stopping pageserver gracefully..");
Signal::SIGTERM
};
io::stdout().flush().unwrap();
match kill(pid, sig) {
Ok(_) => (),
Err(Errno::ESRCH) => {
@@ -287,25 +309,36 @@ impl PageServerNode {
// TODO Remove this "timeout" and handle it on caller side instead.
// Shutting down may take a long time,
// if pageserver checkpoints a lot of data
let mut tcp_stopped = false;
for _ in 0..100 {
if let Err(_e) = TcpStream::connect(&address) {
println!("Pageserver stopped receiving connections");
//Now check status
match self.check_status() {
Ok(_) => {
println!("Pageserver status is OK. Wait a bit.");
thread::sleep(Duration::from_secs(1));
}
Err(err) => {
println!("Pageserver status is: {}", err);
return Ok(());
if !tcp_stopped {
if let Err(err) = TcpStream::connect(&address) {
tcp_stopped = true;
if err.kind() != io::ErrorKind::ConnectionRefused {
eprintln!("\nPageserver connection failed with error: {err}");
}
}
} else {
println!("Pageserver still receives connections");
thread::sleep(Duration::from_secs(1));
}
if tcp_stopped {
// Also check status on the HTTP port
match self.check_status() {
Err(PageserverHttpError::Transport(err)) if err.is_connect() => {
println!("done!");
return Ok(());
}
Err(err) => {
eprintln!("\nPageserver status check failed with error: {err}");
return Ok(());
}
Ok(()) => {
// keep waiting
}
}
}
print!(".");
io::stdout().flush().unwrap();
thread::sleep(Duration::from_secs(1));
}
bail!("Failed to stop pageserver with pid {}", pid);

View File

@@ -1,13 +1,20 @@
#!/bin/sh
set -eux
broker_endpoints_param="${BROKER_ENDPOINT:-absent}"
if [ "$broker_endpoints_param" != "absent" ]; then
broker_endpoints_param="-c broker_endpoints=['$broker_endpoints_param']"
else
broker_endpoints_param=''
fi
if [ "$1" = 'pageserver' ]; then
if [ ! -d "/data/tenants" ]; then
echo "Initializing pageserver data directory"
pageserver --init -D /data -c "pg_distrib_dir='/usr/local'" -c "id=10"
pageserver --init -D /data -c "pg_distrib_dir='/usr/local'" -c "id=10" $broker_endpoints_param
fi
echo "Staring pageserver at 0.0.0.0:6400"
pageserver -c "listen_pg_addr='0.0.0.0:6400'" -c "listen_http_addr='0.0.0.0:9898'" -D /data
pageserver -c "listen_pg_addr='0.0.0.0:6400'" -c "listen_http_addr='0.0.0.0:9898'" $broker_endpoints_param -D /data
else
"$@"
fi

View File

@@ -1,20 +1,20 @@
# Docker images of Zenith
# Docker images of Neon
## Images
Currently we build two main images:
- [zenithdb/zenith](https://hub.docker.com/repository/docker/zenithdb/zenith) — image with pre-built `pageserver`, `safekeeper` and `proxy` binaries and all the required runtime dependencies. Built from [/Dockerfile](/Dockerfile).
- [zenithdb/compute-node](https://hub.docker.com/repository/docker/zenithdb/compute-node) — compute node image with pre-built Postgres binaries from [zenithdb/postgres](https://github.com/zenithdb/postgres).
- [neondatabase/neon](https://hub.docker.com/repository/docker/zenithdb/zenith) — image with pre-built `pageserver`, `safekeeper` and `proxy` binaries and all the required runtime dependencies. Built from [/Dockerfile](/Dockerfile).
- [neondatabase/compute-node](https://hub.docker.com/repository/docker/zenithdb/compute-node) — compute node image with pre-built Postgres binaries from [neondatabase/postgres](https://github.com/neondatabase/postgres).
And additional intermediate images:
And additional intermediate image:
- [zenithdb/compute-tools](https://hub.docker.com/repository/docker/zenithdb/compute-tools) — compute node configuration management tools.
- [neondatabase/compute-tools](https://hub.docker.com/repository/docker/neondatabase/compute-tools) — compute node configuration management tools.
## Building pipeline
1. Image `zenithdb/compute-tools` is re-built automatically.
We build all images after a successful `release` tests run and push automatically to Docker Hub with two parallel CI jobs
2. Image `zenithdb/compute-node` is built independently in the [zenithdb/postgres](https://github.com/zenithdb/postgres) repo.
1. `neondatabase/compute-tools` and `neondatabase/compute-node`
3. Image `zenithdb/zenith` is built in this repo after a successful `release` tests run and pushed to Docker Hub automatically.
2. `neondatabase/neon`

View File

@@ -25,10 +25,14 @@ max_file_descriptors = '100'
# initial superuser role name to use when creating a new tenant
initial_superuser_name = 'zenith_admin'
broker_etcd_prefix = 'neon'
broker_endpoints = ['some://etcd']
# [remote_storage]
```
The config above shows default values for all basic pageserver settings.
The config above shows default values for all basic pageserver settings, besides `broker_endpoints`: that one has to be set by the user,
see the corresponding section below.
Pageserver uses default values for all files that are missing in the config, so it's not a hard error to leave the config blank.
Yet, it validates the config values it can (e.g. postgres install dir) and errors if the validation fails, refusing to start.
@@ -46,6 +50,17 @@ Example: `${PAGESERVER_BIN} -c "checkpoint_period = '100 s'" -c "remote_storage=
Note that TOML distinguishes between strings and integers, the former require single or double quotes around them.
#### broker_endpoints
A list of endpoints (etcd currently) to connect and pull the information from.
Mandatory, does not have a default, since requires etcd to be started as a separate process,
and its connection url should be specified separately.
#### broker_etcd_prefix
A prefix to add for every etcd key used, to separate one group of related instances from another, in the same cluster.
Default is `neon`.
#### checkpoint_distance
`checkpoint_distance` is the amount of incoming WAL that is held in

View File

@@ -91,18 +91,22 @@ so manual installation of dependencies is not recommended.
A single virtual environment with all dependencies is described in the single `Pipfile`.
### Prerequisites
- Install Python 3.7 (the minimal supported version) or greater.
- Install Python 3.9 (the minimal supported version) or greater.
- Our setup with poetry should work with newer python versions too. So feel free to open an issue with a `c/test-runner` label if something doesnt work as expected.
- If you have some trouble with other version you can resolve it by installing Python 3.7 separately, via pyenv or via system package manager e.g.:
- If you have some trouble with other version you can resolve it by installing Python 3.9 separately, via [pyenv](https://github.com/pyenv/pyenv) or via system package manager e.g.:
```bash
# In Ubuntu
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
sudo apt install python3.7
sudo apt install python3.9
```
- Install `poetry`
- Exact version of `poetry` is not important, see installation instructions available at poetry's [website](https://python-poetry.org/docs/#installation)`.
- Install dependencies via `./scripts/pysync`. Note that CI uses Python 3.7 so if you have different version some linting tools can yield different result locally vs in the CI.
- Install dependencies via `./scripts/pysync`.
- Note that CI uses specific Python version (look for `PYTHON_VERSION` [here](https://github.com/neondatabase/docker-images/blob/main/rust/Dockerfile))
so if you have different version some linting tools can yield different result locally vs in the CI.
- You can explicitly specify which Python to use by running `poetry env use /path/to/python`, e.g. `poetry env use python3.9`.
This may also disable the `The currently activated Python version X.Y.Z is not supported by the project` warning.
Run `poetry shell` to activate the virtual environment.
Alternatively, use `poetry run` to run a single command in the venv, e.g. `poetry run pytest`.

View File

@@ -19,6 +19,10 @@ use utils::{
zid::{ZNodeId, ZTenantId, ZTenantTimelineId},
};
/// Default value to use for prefixing to all etcd keys with.
/// This way allows isolating safekeeper/pageserver groups in the same etcd cluster.
pub const DEFAULT_NEON_BROKER_ETCD_PREFIX: &str = "neon";
#[derive(Debug, Deserialize, Serialize)]
struct SafekeeperTimeline {
safekeeper_id: ZNodeId,
@@ -104,28 +108,28 @@ impl SkTimelineSubscription {
/// The subscription kind to the timeline updates from safekeeper.
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct SkTimelineSubscriptionKind {
broker_prefix: String,
broker_etcd_prefix: String,
kind: SubscriptionKind,
}
impl SkTimelineSubscriptionKind {
pub fn all(broker_prefix: String) -> Self {
pub fn all(broker_etcd_prefix: String) -> Self {
Self {
broker_prefix,
broker_etcd_prefix,
kind: SubscriptionKind::All,
}
}
pub fn tenant(broker_prefix: String, tenant: ZTenantId) -> Self {
pub fn tenant(broker_etcd_prefix: String, tenant: ZTenantId) -> Self {
Self {
broker_prefix,
broker_etcd_prefix,
kind: SubscriptionKind::Tenant(tenant),
}
}
pub fn timeline(broker_prefix: String, timeline: ZTenantTimelineId) -> Self {
pub fn timeline(broker_etcd_prefix: String, timeline: ZTenantTimelineId) -> Self {
Self {
broker_prefix,
broker_etcd_prefix,
kind: SubscriptionKind::Timeline(timeline),
}
}
@@ -134,12 +138,12 @@ impl SkTimelineSubscriptionKind {
match self.kind {
SubscriptionKind::All => Regex::new(&format!(
r"^{}/([[:xdigit:]]+)/([[:xdigit:]]+)/safekeeper/([[:digit:]])$",
self.broker_prefix
self.broker_etcd_prefix
))
.expect("wrong regex for 'everything' subscription"),
SubscriptionKind::Tenant(tenant_id) => Regex::new(&format!(
r"^{}/{tenant_id}/([[:xdigit:]]+)/safekeeper/([[:digit:]])$",
self.broker_prefix
self.broker_etcd_prefix
))
.expect("wrong regex for 'tenant' subscription"),
SubscriptionKind::Timeline(ZTenantTimelineId {
@@ -147,7 +151,7 @@ impl SkTimelineSubscriptionKind {
timeline_id,
}) => Regex::new(&format!(
r"^{}/{tenant_id}/{timeline_id}/safekeeper/([[:digit:]])$",
self.broker_prefix
self.broker_etcd_prefix
))
.expect("wrong regex for 'timeline' subscription"),
}
@@ -156,16 +160,16 @@ impl SkTimelineSubscriptionKind {
/// Etcd key to use for watching a certain timeline updates from safekeepers.
pub fn watch_key(&self) -> String {
match self.kind {
SubscriptionKind::All => self.broker_prefix.to_string(),
SubscriptionKind::All => self.broker_etcd_prefix.to_string(),
SubscriptionKind::Tenant(tenant_id) => {
format!("{}/{tenant_id}/safekeeper", self.broker_prefix)
format!("{}/{tenant_id}/safekeeper", self.broker_etcd_prefix)
}
SubscriptionKind::Timeline(ZTenantTimelineId {
tenant_id,
timeline_id,
}) => format!(
"{}/{tenant_id}/{timeline_id}/safekeeper",
self.broker_prefix
self.broker_etcd_prefix
),
}
}

View File

@@ -4,7 +4,7 @@ version = "0.1.0"
edition = "2021"
[dependencies]
prometheus = {version = "0.13", default_features=false} # removes protobuf dependency
prometheus = {version = "0.13", default_features=false, features = ["process"]} # removes protobuf dependency
libc = "0.2"
lazy_static = "1.4"
once_cell = "1.8.0"

View File

@@ -3,7 +3,6 @@
//! Otherwise, we might not see all metrics registered via
//! a default registry.
use lazy_static::lazy_static;
use once_cell::race::OnceBox;
pub use prometheus::{exponential_buckets, linear_buckets};
pub use prometheus::{register_gauge, Gauge};
pub use prometheus::{register_gauge_vec, GaugeVec};
@@ -27,48 +26,15 @@ pub fn gather() -> Vec<prometheus::proto::MetricFamily> {
prometheus::gather()
}
static COMMON_METRICS_PREFIX: OnceBox<&str> = OnceBox::new();
/// Sets a prefix which will be used for all common metrics, typically a service
/// name like 'pageserver'. Should be executed exactly once in the beginning of
/// any executable which uses common metrics.
pub fn set_common_metrics_prefix(prefix: &'static str) {
// Not unwrap() because metrics may be initialized after multiple threads have been started.
COMMON_METRICS_PREFIX
.set(prefix.into())
.unwrap_or_else(|_| {
eprintln!(
"set_common_metrics_prefix() was called second time with '{}', exiting",
prefix
);
std::process::exit(1);
});
}
/// Prepends a prefix to a common metric name so they are distinguished between
/// different services, see <https://github.com/zenithdb/zenith/pull/681>
/// A call to set_common_metrics_prefix() is necessary prior to calling this.
pub fn new_common_metric_name(unprefixed_metric_name: &str) -> String {
// Not unwrap() because metrics may be initialized after multiple threads have been started.
format!(
"{}_{}",
COMMON_METRICS_PREFIX.get().unwrap_or_else(|| {
eprintln!("set_common_metrics_prefix() was not called, but metrics are used, exiting");
std::process::exit(1);
}),
unprefixed_metric_name
)
}
lazy_static! {
static ref DISK_IO_BYTES: IntGaugeVec = register_int_gauge_vec!(
new_common_metric_name("disk_io_bytes"),
"libmetrics_disk_io_bytes_total",
"Bytes written and read from disk, grouped by the operation (read|write)",
&["io_operation"]
)
.expect("Failed to register disk i/o bytes int gauge vec");
static ref MAXRSS_KB: IntGauge = register_int_gauge!(
new_common_metric_name("maxrss_kb"),
"libmetrics_maxrss_kb",
"Memory usage (Maximum Resident Set Size)"
)
.expect("Failed to register maxrss_kb int gauge");

View File

@@ -87,7 +87,8 @@ pub trait RemoteStorage: Send + Sync {
async fn delete(&self, path: &Self::RemoteObjectId) -> anyhow::Result<()>;
}
/// TODO kb
/// Every storage, currently supported.
/// Serves as a simple way to pass around the [`RemoteStorage`] without dealing with generics.
pub enum GenericRemoteStorage {
Local(LocalFs),
S3(S3Bucket),

View File

@@ -5,7 +5,7 @@ use anyhow::anyhow;
use hyper::header::AUTHORIZATION;
use hyper::{header::CONTENT_TYPE, Body, Request, Response, Server};
use lazy_static::lazy_static;
use metrics::{new_common_metric_name, register_int_counter, Encoder, IntCounter, TextEncoder};
use metrics::{register_int_counter, Encoder, IntCounter, TextEncoder};
use routerify::ext::RequestExt;
use routerify::RequestInfo;
use routerify::{Middleware, Router, RouterBuilder, RouterService};
@@ -18,7 +18,7 @@ use super::error::ApiError;
lazy_static! {
static ref SERVE_METRICS_COUNT: IntCounter = register_int_counter!(
new_common_metric_name("serve_metrics_count"),
"libmetrics_metric_handler_requests_total",
"Number of metric requests made"
)
.expect("failed to define a metric");

View File

@@ -1,25 +0,0 @@
version: "3"
services:
prometheus:
container_name: prometheus
image: prom/prometheus:latest
volumes:
- ./prometheus.yaml:/etc/prometheus/prometheus.yml
# ports:
# - "9090:9090"
# TODO: find a proper portable solution
network_mode: "host"
grafana:
image: grafana/grafana:latest
volumes:
- ./grafana.yaml:/etc/grafana/provisioning/datasources/datasources.yaml
environment:
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
- GF_AUTH_DISABLE_LOGIN_FORM=true
# ports:
# - "3000:3000"
# TODO: find a proper portable solution
network_mode: "host"

View File

@@ -1,12 +0,0 @@
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
orgId: 1
url: http://localhost:9090
basicAuth: false
isDefault: false
version: 1
editable: false

View File

@@ -1,5 +0,0 @@
scrape_configs:
- job_name: 'default'
scrape_interval: 10s
static_configs:
- targets: ['localhost:9898']

View File

@@ -1,10 +1,10 @@
use anyhow::{anyhow, bail, Context, Result};
use clap::{App, AppSettings, Arg, ArgMatches};
use control_plane::compute::ComputeControlPlane;
use control_plane::local_env;
use control_plane::local_env::LocalEnv;
use control_plane::local_env::{EtcdBroker, LocalEnv};
use control_plane::safekeeper::SafekeeperNode;
use control_plane::storage::PageServerNode;
use control_plane::{etcd, local_env};
use pageserver::config::defaults::{
DEFAULT_HTTP_LISTEN_ADDR as DEFAULT_PAGESERVER_HTTP_ADDR,
DEFAULT_PG_LISTEN_ADDR as DEFAULT_PAGESERVER_PG_ADDR,
@@ -14,6 +14,7 @@ use safekeeper::defaults::{
DEFAULT_PG_LISTEN_PORT as DEFAULT_SAFEKEEPER_PG_PORT,
};
use std::collections::{BTreeSet, HashMap};
use std::path::Path;
use std::process::exit;
use std::str::FromStr;
use utils::{
@@ -32,28 +33,27 @@ const DEFAULT_PAGESERVER_ID: ZNodeId = ZNodeId(1);
const DEFAULT_BRANCH_NAME: &str = "main";
project_git_version!(GIT_VERSION);
fn default_conf() -> String {
fn default_conf(etcd_binary_path: &Path) -> String {
format!(
r#"
# Default built-in configuration, defined in main.rs
[etcd_broker]
broker_endpoints = ['http://localhost:2379']
etcd_binary_path = '{etcd_binary_path}'
[pageserver]
id = {pageserver_id}
listen_pg_addr = '{pageserver_pg_addr}'
listen_http_addr = '{pageserver_http_addr}'
id = {DEFAULT_PAGESERVER_ID}
listen_pg_addr = '{DEFAULT_PAGESERVER_PG_ADDR}'
listen_http_addr = '{DEFAULT_PAGESERVER_HTTP_ADDR}'
auth_type = '{pageserver_auth_type}'
[[safekeepers]]
id = {safekeeper_id}
pg_port = {safekeeper_pg_port}
http_port = {safekeeper_http_port}
id = {DEFAULT_SAFEKEEPER_ID}
pg_port = {DEFAULT_SAFEKEEPER_PG_PORT}
http_port = {DEFAULT_SAFEKEEPER_HTTP_PORT}
"#,
pageserver_id = DEFAULT_PAGESERVER_ID,
pageserver_pg_addr = DEFAULT_PAGESERVER_PG_ADDR,
pageserver_http_addr = DEFAULT_PAGESERVER_HTTP_ADDR,
etcd_binary_path = etcd_binary_path.display(),
pageserver_auth_type = AuthType::Trust,
safekeeper_id = DEFAULT_SAFEKEEPER_ID,
safekeeper_pg_port = DEFAULT_SAFEKEEPER_PG_PORT,
safekeeper_http_port = DEFAULT_SAFEKEEPER_HTTP_PORT,
)
}
@@ -167,12 +167,12 @@ fn main() -> Result<()> {
.subcommand(App::new("create")
.arg(tenant_id_arg.clone())
.arg(timeline_id_arg.clone().help("Use a specific timeline id when creating a tenant and its initial timeline"))
.arg(Arg::new("config").short('c').takes_value(true).multiple_occurrences(true).required(false))
)
.arg(Arg::new("config").short('c').takes_value(true).multiple_occurrences(true).required(false))
)
.subcommand(App::new("config")
.arg(tenant_id_arg.clone())
.arg(Arg::new("config").short('c').takes_value(true).multiple_occurrences(true).required(false))
)
.arg(Arg::new("config").short('c').takes_value(true).multiple_occurrences(true).required(false))
)
)
.subcommand(
App::new("pageserver")
@@ -275,7 +275,7 @@ fn main() -> Result<()> {
"pageserver" => handle_pageserver(sub_args, &env),
"pg" => handle_pg(sub_args, &env),
"safekeeper" => handle_safekeeper(sub_args, &env),
_ => bail!("unexpected subcommand {}", sub_name),
_ => bail!("unexpected subcommand {sub_name}"),
};
if original_env != env {
@@ -289,7 +289,7 @@ fn main() -> Result<()> {
Ok(Some(updated_env)) => updated_env.persist_config(&updated_env.base_data_dir)?,
Ok(None) => (),
Err(e) => {
eprintln!("command failed: {:?}", e);
eprintln!("command failed: {e:?}");
exit(1);
}
}
@@ -468,21 +468,21 @@ fn parse_timeline_id(sub_match: &ArgMatches) -> anyhow::Result<Option<ZTimelineI
.context("Failed to parse timeline id from the argument string")
}
fn handle_init(init_match: &ArgMatches) -> Result<LocalEnv> {
fn handle_init(init_match: &ArgMatches) -> anyhow::Result<LocalEnv> {
let initial_timeline_id_arg = parse_timeline_id(init_match)?;
// Create config file
let toml_file: String = if let Some(config_path) = init_match.value_of("config") {
// load and parse the file
std::fs::read_to_string(std::path::Path::new(config_path))
.with_context(|| format!("Could not read configuration file \"{}\"", config_path))?
.with_context(|| format!("Could not read configuration file '{config_path}'"))?
} else {
// Built-in default config
default_conf()
default_conf(&EtcdBroker::locate_etcd()?)
};
let mut env =
LocalEnv::create_config(&toml_file).context("Failed to create neon configuration")?;
LocalEnv::parse_config(&toml_file).context("Failed to create neon configuration")?;
env.init().context("Failed to initialize neon repository")?;
// default_tenantid was generated by the `env.init()` call above
@@ -497,7 +497,7 @@ fn handle_init(init_match: &ArgMatches) -> Result<LocalEnv> {
&pageserver_config_overrides(init_match),
)
.unwrap_or_else(|e| {
eprintln!("pageserver init failed: {}", e);
eprintln!("pageserver init failed: {e}");
exit(1);
});
@@ -920,20 +920,23 @@ fn handle_safekeeper(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Resul
Ok(())
}
fn handle_start_all(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> {
fn handle_start_all(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> anyhow::Result<()> {
etcd::start_etcd_process(env)?;
let pageserver = PageServerNode::from_env(env);
// Postgres nodes are not started automatically
if let Err(e) = pageserver.start(&pageserver_config_overrides(sub_match)) {
eprintln!("pageserver start failed: {}", e);
eprintln!("pageserver start failed: {e}");
try_stop_etcd_process(env);
exit(1);
}
for node in env.safekeepers.iter() {
let safekeeper = SafekeeperNode::from_env(env, node);
if let Err(e) = safekeeper.start() {
eprintln!("safekeeper '{}' start failed: {}", safekeeper.id, e);
eprintln!("safekeeper '{}' start failed: {e}", safekeeper.id);
try_stop_etcd_process(env);
exit(1);
}
}
@@ -963,5 +966,14 @@ fn handle_stop_all(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<
eprintln!("safekeeper '{}' stop failed: {}", safekeeper.id, e);
}
}
try_stop_etcd_process(env);
Ok(())
}
fn try_stop_etcd_process(env: &local_env::LocalEnv) {
if let Err(e) = etcd::stop_etcd_process(env) {
eprintln!("etcd stop failed: {e}");
}
}

View File

@@ -55,6 +55,7 @@ fail = "0.5.0"
git-version = "0.3.5"
postgres_ffi = { path = "../libs/postgres_ffi" }
etcd_broker = { path = "../libs/etcd_broker" }
metrics = { path = "../libs/metrics" }
utils = { path = "../libs/utils" }
remote_storage = { path = "../libs/remote_storage" }

View File

@@ -38,7 +38,6 @@ fn version() -> String {
}
fn main() -> anyhow::Result<()> {
metrics::set_common_metrics_prefix("pageserver");
let arg_matches = App::new("Zenith page server")
.about("Materializes WAL stream to pages and serves them to the postgres")
.version(&*version())
@@ -98,6 +97,8 @@ fn main() -> anyhow::Result<()> {
let features: &[&str] = &[
#[cfg(feature = "failpoints")]
"failpoints",
#[cfg(feature = "profiling")]
"profiling",
];
println!("{{\"features\": {features:?} }}");
return Ok(());
@@ -183,13 +184,8 @@ fn main() -> anyhow::Result<()> {
// as a ref.
let conf: &'static PageServerConf = Box::leak(Box::new(conf));
// If failpoints are used, terminate the whole pageserver process if they are hit.
// Initialize up failpoints support
let scenario = FailScenario::setup();
if fail::has_failpoints() {
std::panic::set_hook(Box::new(|_| {
std::process::exit(1);
}));
}
// Basic initialization of things that don't change after startup
virtual_file::init(conf.max_file_descriptors);

View File

@@ -13,6 +13,7 @@ use std::str::FromStr;
use std::time::Duration;
use toml_edit;
use toml_edit::{Document, Item};
use url::Url;
use utils::{
postgres_backend::AuthType,
zid::{ZNodeId, ZTenantId, ZTimelineId},
@@ -111,6 +112,13 @@ pub struct PageServerConf {
pub profiling: ProfilingConfig,
pub default_tenant_conf: TenantConf,
/// A prefix to add in etcd brokers before every key.
/// Can be used for isolating different pageserver groups withing the same etcd cluster.
pub broker_etcd_prefix: String,
/// Etcd broker endpoints to connect to.
pub broker_endpoints: Vec<Url>,
}
#[derive(Debug, Clone, PartialEq, Eq)]
@@ -175,6 +183,8 @@ struct PageServerConfigBuilder {
id: BuilderValue<ZNodeId>,
profiling: BuilderValue<ProfilingConfig>,
broker_etcd_prefix: BuilderValue<String>,
broker_endpoints: BuilderValue<Vec<Url>>,
}
impl Default for PageServerConfigBuilder {
@@ -200,6 +210,8 @@ impl Default for PageServerConfigBuilder {
remote_storage_config: Set(None),
id: NotSet,
profiling: Set(ProfilingConfig::Disabled),
broker_etcd_prefix: Set(etcd_broker::DEFAULT_NEON_BROKER_ETCD_PREFIX.to_string()),
broker_endpoints: Set(Vec::new()),
}
}
}
@@ -256,6 +268,14 @@ impl PageServerConfigBuilder {
self.remote_storage_config = BuilderValue::Set(remote_storage_config)
}
pub fn broker_endpoints(&mut self, broker_endpoints: Vec<Url>) {
self.broker_endpoints = BuilderValue::Set(broker_endpoints)
}
pub fn broker_etcd_prefix(&mut self, broker_etcd_prefix: String) {
self.broker_etcd_prefix = BuilderValue::Set(broker_etcd_prefix)
}
pub fn id(&mut self, node_id: ZNodeId) {
self.id = BuilderValue::Set(node_id)
}
@@ -264,7 +284,11 @@ impl PageServerConfigBuilder {
self.profiling = BuilderValue::Set(profiling)
}
pub fn build(self) -> Result<PageServerConf> {
pub fn build(self) -> anyhow::Result<PageServerConf> {
let broker_endpoints = self
.broker_endpoints
.ok_or(anyhow!("No broker endpoints provided"))?;
Ok(PageServerConf {
listen_pg_addr: self
.listen_pg_addr
@@ -300,6 +324,10 @@ impl PageServerConfigBuilder {
profiling: self.profiling.ok_or(anyhow!("missing profiling"))?,
// TenantConf is handled separately
default_tenant_conf: TenantConf::default(),
broker_endpoints,
broker_etcd_prefix: self
.broker_etcd_prefix
.ok_or(anyhow!("missing broker_etcd_prefix"))?,
})
}
}
@@ -341,7 +369,7 @@ impl PageServerConf {
/// validating the input and failing on errors.
///
/// This leaves any options not present in the file in the built-in defaults.
pub fn parse_and_validate(toml: &Document, workdir: &Path) -> Result<Self> {
pub fn parse_and_validate(toml: &Document, workdir: &Path) -> anyhow::Result<Self> {
let mut builder = PageServerConfigBuilder::default();
builder.workdir(workdir.to_owned());
@@ -373,6 +401,17 @@ impl PageServerConf {
}
"id" => builder.id(ZNodeId(parse_toml_u64(key, item)?)),
"profiling" => builder.profiling(parse_toml_from_str(key, item)?),
"broker_etcd_prefix" => builder.broker_etcd_prefix(parse_toml_string(key, item)?),
"broker_endpoints" => builder.broker_endpoints(
parse_toml_array(key, item)?
.into_iter()
.map(|endpoint_str| {
endpoint_str.parse::<Url>().with_context(|| {
format!("Array item {endpoint_str} for key {key} is not a valid url endpoint")
})
})
.collect::<anyhow::Result<_>>()?,
),
_ => bail!("unrecognized pageserver option '{key}'"),
}
}
@@ -526,6 +565,8 @@ impl PageServerConf {
remote_storage_config: None,
profiling: ProfilingConfig::Disabled,
default_tenant_conf: TenantConf::dummy_conf(),
broker_endpoints: Vec::new(),
broker_etcd_prefix: etcd_broker::DEFAULT_NEON_BROKER_ETCD_PREFIX.to_string(),
}
}
}
@@ -576,14 +617,36 @@ fn parse_toml_duration(name: &str, item: &Item) -> Result<Duration> {
Ok(humantime::parse_duration(s)?)
}
fn parse_toml_from_str<T>(name: &str, item: &Item) -> Result<T>
fn parse_toml_from_str<T>(name: &str, item: &Item) -> anyhow::Result<T>
where
T: FromStr<Err = anyhow::Error>,
T: FromStr,
<T as FromStr>::Err: std::fmt::Display,
{
let v = item
.as_str()
.with_context(|| format!("configure option {name} is not a string"))?;
T::from_str(v)
T::from_str(v).map_err(|e| {
anyhow!(
"Failed to parse string as {parse_type} for configure option {name}: {e}",
parse_type = stringify!(T)
)
})
}
fn parse_toml_array(name: &str, item: &Item) -> anyhow::Result<Vec<String>> {
let array = item
.as_array()
.with_context(|| format!("configure option {name} is not an array"))?;
array
.iter()
.map(|value| {
value
.as_str()
.map(str::to_string)
.with_context(|| format!("Array item {value:?} for key {name} is not a string"))
})
.collect()
}
#[cfg(test)]
@@ -616,12 +679,16 @@ id = 10
fn parse_defaults() -> anyhow::Result<()> {
let tempdir = tempdir()?;
let (workdir, pg_distrib_dir) = prepare_fs(&tempdir)?;
// we have to create dummy pathes to overcome the validation errors
let config_string = format!("pg_distrib_dir='{}'\nid=10", pg_distrib_dir.display());
let broker_endpoint = "http://127.0.0.1:7777";
// we have to create dummy values to overcome the validation errors
let config_string = format!(
"pg_distrib_dir='{}'\nid=10\nbroker_endpoints = ['{broker_endpoint}']",
pg_distrib_dir.display()
);
let toml = config_string.parse()?;
let parsed_config = PageServerConf::parse_and_validate(&toml, &workdir)
.unwrap_or_else(|e| panic!("Failed to parse config '{config_string}', reason: {e}"));
.unwrap_or_else(|e| panic!("Failed to parse config '{config_string}', reason: {e:?}"));
assert_eq!(
parsed_config,
@@ -641,6 +708,10 @@ id = 10
remote_storage_config: None,
profiling: ProfilingConfig::Disabled,
default_tenant_conf: TenantConf::default(),
broker_endpoints: vec![broker_endpoint
.parse()
.expect("Failed to parse a valid broker endpoint URL")],
broker_etcd_prefix: etcd_broker::DEFAULT_NEON_BROKER_ETCD_PREFIX.to_string(),
},
"Correct defaults should be used when no config values are provided"
);
@@ -652,15 +723,16 @@ id = 10
fn parse_basic_config() -> anyhow::Result<()> {
let tempdir = tempdir()?;
let (workdir, pg_distrib_dir) = prepare_fs(&tempdir)?;
let broker_endpoint = "http://127.0.0.1:7777";
let config_string = format!(
"{ALL_BASE_VALUES_TOML}pg_distrib_dir='{}'",
"{ALL_BASE_VALUES_TOML}pg_distrib_dir='{}'\nbroker_endpoints = ['{broker_endpoint}']",
pg_distrib_dir.display()
);
let toml = config_string.parse()?;
let parsed_config = PageServerConf::parse_and_validate(&toml, &workdir)
.unwrap_or_else(|e| panic!("Failed to parse config '{config_string}', reason: {e}"));
.unwrap_or_else(|e| panic!("Failed to parse config '{config_string}', reason: {e:?}"));
assert_eq!(
parsed_config,
@@ -680,6 +752,10 @@ id = 10
remote_storage_config: None,
profiling: ProfilingConfig::Disabled,
default_tenant_conf: TenantConf::default(),
broker_endpoints: vec![broker_endpoint
.parse()
.expect("Failed to parse a valid broker endpoint URL")],
broker_etcd_prefix: etcd_broker::DEFAULT_NEON_BROKER_ETCD_PREFIX.to_string(),
},
"Should be able to parse all basic config values correctly"
);
@@ -691,6 +767,7 @@ id = 10
fn parse_remote_fs_storage_config() -> anyhow::Result<()> {
let tempdir = tempdir()?;
let (workdir, pg_distrib_dir) = prepare_fs(&tempdir)?;
let broker_endpoint = "http://127.0.0.1:7777";
let local_storage_path = tempdir.path().join("local_remote_storage");
@@ -710,6 +787,7 @@ local_path = '{}'"#,
let config_string = format!(
r#"{ALL_BASE_VALUES_TOML}
pg_distrib_dir='{}'
broker_endpoints = ['{broker_endpoint}']
{remote_storage_config_str}"#,
pg_distrib_dir.display(),
@@ -718,7 +796,9 @@ pg_distrib_dir='{}'
let toml = config_string.parse()?;
let parsed_remote_storage_config = PageServerConf::parse_and_validate(&toml, &workdir)
.unwrap_or_else(|e| panic!("Failed to parse config '{config_string}', reason: {e}"))
.unwrap_or_else(|e| {
panic!("Failed to parse config '{config_string}', reason: {e:?}")
})
.remote_storage_config
.expect("Should have remote storage config for the local FS");
@@ -728,7 +808,7 @@ pg_distrib_dir='{}'
max_concurrent_syncs: NonZeroUsize::new(
remote_storage::DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNCS
)
.unwrap(),
.unwrap(),
max_sync_errors: NonZeroU32::new(remote_storage::DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS)
.unwrap(),
storage: RemoteStorageKind::LocalFs(local_storage_path.clone()),
@@ -751,6 +831,7 @@ pg_distrib_dir='{}'
let max_concurrent_syncs = NonZeroUsize::new(111).unwrap();
let max_sync_errors = NonZeroU32::new(222).unwrap();
let s3_concurrency_limit = NonZeroUsize::new(333).unwrap();
let broker_endpoint = "http://127.0.0.1:7777";
let identical_toml_declarations = &[
format!(
@@ -773,6 +854,7 @@ concurrency_limit = {s3_concurrency_limit}"#
let config_string = format!(
r#"{ALL_BASE_VALUES_TOML}
pg_distrib_dir='{}'
broker_endpoints = ['{broker_endpoint}']
{remote_storage_config_str}"#,
pg_distrib_dir.display(),
@@ -781,7 +863,9 @@ pg_distrib_dir='{}'
let toml = config_string.parse()?;
let parsed_remote_storage_config = PageServerConf::parse_and_validate(&toml, &workdir)
.unwrap_or_else(|e| panic!("Failed to parse config '{config_string}', reason: {e}"))
.unwrap_or_else(|e| {
panic!("Failed to parse config '{config_string}', reason: {e:?}")
})
.remote_storage_config
.expect("Should have remote storage config for S3");

View File

@@ -123,6 +123,53 @@ paths:
schema:
$ref: "#/components/schemas/Error"
/v1/tenant/{tenant_id}/timeline/{timeline_id}/wal_receiver:
parameters:
- name: tenant_id
in: path
required: true
schema:
type: string
format: hex
- name: timeline_id
in: path
required: true
schema:
type: string
format: hex
get:
description: Get wal receiver's data attached to the timeline
responses:
"200":
description: WalReceiverEntry
content:
application/json:
schema:
$ref: "#/components/schemas/WalReceiverEntry"
"401":
description: Unauthorized Error
content:
application/json:
schema:
$ref: "#/components/schemas/UnauthorizedError"
"403":
description: Forbidden Error
content:
application/json:
schema:
$ref: "#/components/schemas/ForbiddenError"
"404":
description: Error when no wal receiver is running or found
content:
application/json:
schema:
$ref: "#/components/schemas/NotFoundError"
"500":
description: Generic operation error
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
/v1/tenant/{tenant_id}/timeline/{timeline_id}/attach:
parameters:
@@ -520,6 +567,21 @@ components:
type: integer
current_logical_size_non_incremental:
type: integer
WalReceiverEntry:
type: object
required:
- thread_id
- wal_producer_connstr
properties:
thread_id:
type: integer
wal_producer_connstr:
type: string
last_received_msg_lsn:
type: string
format: hex
last_received_msg_ts:
type: integer
Error:
type: object

View File

@@ -224,6 +224,30 @@ async fn timeline_detail_handler(request: Request<Body>) -> Result<Response<Body
json_response(StatusCode::OK, timeline_info)
}
async fn wal_receiver_get_handler(request: Request<Body>) -> Result<Response<Body>, ApiError> {
let tenant_id: ZTenantId = parse_request_param(&request, "tenant_id")?;
check_permission(&request, Some(tenant_id))?;
let timeline_id: ZTimelineId = parse_request_param(&request, "timeline_id")?;
let wal_receiver = tokio::task::spawn_blocking(move || {
let _enter =
info_span!("wal_receiver_get", tenant = %tenant_id, timeline = %timeline_id).entered();
crate::walreceiver::get_wal_receiver_entry(tenant_id, timeline_id)
})
.await
.map_err(ApiError::from_err)?
.ok_or_else(|| {
ApiError::NotFound(format!(
"WAL receiver not found for tenant {} and timeline {}",
tenant_id, timeline_id
))
})?;
json_response(StatusCode::OK, wal_receiver)
}
async fn timeline_attach_handler(request: Request<Body>) -> Result<Response<Body>, ApiError> {
let tenant_id: ZTenantId = parse_request_param(&request, "tenant_id")?;
check_permission(&request, Some(tenant_id))?;
@@ -485,6 +509,10 @@ pub fn make_router(
"/v1/tenant/:tenant_id/timeline/:timeline_id",
timeline_detail_handler,
)
.get(
"/v1/tenant/:tenant_id/timeline/:timeline_id/wal_receiver",
wal_receiver_get_handler,
)
.post(
"/v1/tenant/:tenant_id/timeline/:timeline_id/attach",
timeline_attach_handler,

View File

@@ -18,7 +18,7 @@ use itertools::Itertools;
use lazy_static::lazy_static;
use tracing::*;
use std::cmp::{max, min, Ordering};
use std::cmp::{max, Ordering};
use std::collections::hash_map::Entry;
use std::collections::HashMap;
use std::collections::{BTreeSet, HashSet};
@@ -1357,7 +1357,9 @@ impl LayeredTimeline {
let mut timeline_owned;
let mut timeline = self;
let mut path: Vec<(ValueReconstructResult, Lsn, Arc<dyn Layer>)> = Vec::new();
// For debugging purposes, collect the path of layers that we traversed
// through. It's included in the error message if we fail to find the key.
let mut traversal_path: Vec<(ValueReconstructResult, Lsn, Arc<dyn Layer>)> = Vec::new();
let cached_lsn = if let Some((cached_lsn, _)) = &reconstruct_state.img {
*cached_lsn
@@ -1387,32 +1389,24 @@ impl LayeredTimeline {
if prev_lsn <= cont_lsn {
// Didn't make any progress in last iteration. Error out to avoid
// getting stuck in the loop.
// For debugging purposes, print the path of layers that we traversed
// through.
for (r, c, l) in path {
error!(
"PATH: result {:?}, cont_lsn {}, layer: {}",
r,
c,
l.filename().display()
);
}
bail!("could not find layer with more data for key {} at LSN {}, request LSN {}, ancestor {}",
key,
Lsn(cont_lsn.0 - 1),
request_lsn,
timeline.ancestor_lsn)
return layer_traversal_error(format!(
"could not find layer with more data for key {} at LSN {}, request LSN {}, ancestor {}",
key,
Lsn(cont_lsn.0 - 1),
request_lsn,
timeline.ancestor_lsn
), traversal_path);
}
prev_lsn = cont_lsn;
}
ValueReconstructResult::Missing => {
bail!(
"could not find data for key {} at LSN {}, for request at LSN {}",
key,
cont_lsn,
request_lsn
)
return layer_traversal_error(
format!(
"could not find data for key {} at LSN {}, for request at LSN {}",
key, cont_lsn, request_lsn
),
traversal_path,
);
}
}
@@ -1447,7 +1441,7 @@ impl LayeredTimeline {
reconstruct_state,
)?;
cont_lsn = lsn_floor;
path.push((result, cont_lsn, open_layer.clone()));
traversal_path.push((result, cont_lsn, open_layer.clone()));
continue;
}
}
@@ -1462,7 +1456,7 @@ impl LayeredTimeline {
reconstruct_state,
)?;
cont_lsn = lsn_floor;
path.push((result, cont_lsn, frozen_layer.clone()));
traversal_path.push((result, cont_lsn, frozen_layer.clone()));
continue 'outer;
}
}
@@ -1477,7 +1471,7 @@ impl LayeredTimeline {
reconstruct_state,
)?;
cont_lsn = lsn_floor;
path.push((result, cont_lsn, layer));
traversal_path.push((result, cont_lsn, layer));
} else if timeline.ancestor_timeline.is_some() {
// Nothing on this timeline. Traverse to parent
result = ValueReconstructResult::Continue;
@@ -1621,22 +1615,30 @@ impl LayeredTimeline {
pub fn check_checkpoint_distance(self: &Arc<LayeredTimeline>) -> Result<()> {
let last_lsn = self.get_last_record_lsn();
// Has more than 'checkpoint_distance' of WAL been accumulated?
let distance = last_lsn.widening_sub(self.last_freeze_at.load());
if distance >= self.get_checkpoint_distance().into() {
// Yes. Freeze the current in-memory layer.
self.freeze_inmem_layer(true);
self.last_freeze_at.store(last_lsn);
}
if let Ok(guard) = self.layer_flush_lock.try_lock() {
drop(guard);
let self_clone = Arc::clone(self);
thread_mgr::spawn(
thread_mgr::ThreadKind::LayerFlushThread,
Some(self.tenant_id),
Some(self.timeline_id),
"layer flush thread",
false,
move || self_clone.flush_frozen_layers(false),
)?;
// Launch a thread to flush the frozen layer to disk, unless
// a thread was already running. (If the thread was running
// at the time that we froze the layer, it must've seen the
// the layer we just froze before it exited; see comments
// in flush_frozen_layers())
if let Ok(guard) = self.layer_flush_lock.try_lock() {
drop(guard);
let self_clone = Arc::clone(self);
thread_mgr::spawn(
thread_mgr::ThreadKind::LayerFlushThread,
Some(self.tenant_id),
Some(self.timeline_id),
"layer flush thread",
false,
move || self_clone.flush_frozen_layers(false),
)?;
}
}
Ok(())
}
@@ -1944,41 +1946,87 @@ impl LayeredTimeline {
Ok(new_path)
}
///
/// Collect a bunch of Level 0 layer files, and compact and reshuffle them as
/// as Level 1 files.
///
fn compact_level0(&self, target_file_size: u64) -> Result<()> {
let layers = self.layers.read().unwrap();
let level0_deltas = layers.get_level0_deltas()?;
// We compact or "shuffle" the level-0 delta layers when they've
// accumulated over the compaction threshold.
if level0_deltas.len() < self.get_compaction_threshold() {
return Ok(());
}
let mut level0_deltas = layers.get_level0_deltas()?;
drop(layers);
// FIXME: this function probably won't work correctly if there's overlap
// in the deltas.
let lsn_range = level0_deltas
.iter()
.map(|l| l.get_lsn_range())
.reduce(|a, b| min(a.start, b.start)..max(a.end, b.end))
.unwrap();
// Only compact if enough layers have accumulated.
if level0_deltas.is_empty() || level0_deltas.len() < self.get_compaction_threshold() {
return Ok(());
}
let all_values_iter = level0_deltas.iter().map(|l| l.iter()).kmerge_by(|a, b| {
if let Ok((a_key, a_lsn, _)) = a {
if let Ok((b_key, b_lsn, _)) = b {
match a_key.cmp(b_key) {
Ordering::Less => true,
Ordering::Equal => a_lsn <= b_lsn,
Ordering::Greater => false,
// Gather the files to compact in this iteration.
//
// Start with the oldest Level 0 delta file, and collect any other
// level 0 files that form a contiguous sequence, such that the end
// LSN of previous file matches the start LSN of the next file.
//
// Note that if the files don't form such a sequence, we might
// "compact" just a single file. That's a bit pointless, but it allows
// us to get rid of the level 0 file, and compact the other files on
// the next iteration. This could probably made smarter, but such
// "gaps" in the sequence of level 0 files should only happen in case
// of a crash, partial download from cloud storage, or something like
// that, so it's not a big deal in practice.
level0_deltas.sort_by_key(|l| l.get_lsn_range().start);
let mut level0_deltas_iter = level0_deltas.iter();
let first_level0_delta = level0_deltas_iter.next().unwrap();
let mut prev_lsn_end = first_level0_delta.get_lsn_range().end;
let mut deltas_to_compact = vec![Arc::clone(first_level0_delta)];
for l in level0_deltas_iter {
let lsn_range = l.get_lsn_range();
if lsn_range.start != prev_lsn_end {
break;
}
deltas_to_compact.push(Arc::clone(l));
prev_lsn_end = lsn_range.end;
}
let lsn_range = Range {
start: deltas_to_compact.first().unwrap().get_lsn_range().start,
end: deltas_to_compact.last().unwrap().get_lsn_range().end,
};
info!(
"Starting Level0 compaction in LSN range {}-{} for {} layers ({} deltas in total)",
lsn_range.start,
lsn_range.end,
deltas_to_compact.len(),
level0_deltas.len()
);
for l in deltas_to_compact.iter() {
info!("compact includes {}", l.filename().display());
}
// We don't need the original list of layers anymore. Drop it so that
// we don't accidentally use it later in the function.
drop(level0_deltas);
// This iterator walks through all key-value pairs from all the layers
// we're compacting, in key, LSN order.
let all_values_iter = deltas_to_compact
.iter()
.map(|l| l.iter())
.kmerge_by(|a, b| {
if let Ok((a_key, a_lsn, _)) = a {
if let Ok((b_key, b_lsn, _)) = b {
match a_key.cmp(b_key) {
Ordering::Less => true,
Ordering::Equal => a_lsn <= b_lsn,
Ordering::Greater => false,
}
} else {
false
}
} else {
false
true
}
} else {
true
}
});
});
// Merge the contents of all the input delta layers into a new set
// of delta layers, based on the current partitioning.
@@ -2044,8 +2092,8 @@ impl LayeredTimeline {
// Now that we have reshuffled the data to set of new delta layers, we can
// delete the old ones
let mut layer_paths_do_delete = HashSet::with_capacity(level0_deltas.len());
for l in level0_deltas {
let mut layer_paths_do_delete = HashSet::with_capacity(deltas_to_compact.len());
for l in deltas_to_compact {
l.delete()?;
if let Some(path) = l.local_path() {
layer_paths_do_delete.insert(path);
@@ -2367,6 +2415,32 @@ impl LayeredTimeline {
}
}
/// Helper function for get_reconstruct_data() to add the path of layers traversed
/// to an error, as anyhow context information.
fn layer_traversal_error(
msg: String,
path: Vec<(ValueReconstructResult, Lsn, Arc<dyn Layer>)>,
) -> anyhow::Result<()> {
// We want the original 'msg' to be the outermost context. The outermost context
// is the most high-level information, which also gets propagated to the client.
let mut msg_iter = path
.iter()
.map(|(r, c, l)| {
format!(
"layer traversal: result {:?}, cont_lsn {}, layer: {}",
r,
c,
l.filename().display()
)
})
.chain(std::iter::once(msg));
// Construct initial message from the first traversed layer
let err = anyhow!(msg_iter.next().unwrap());
// Append all subsequent traversals, and the error message 'msg', as contexts.
Err(msg_iter.fold(err, |err, msg| err.context(msg)))
}
struct LayeredTimelineWriter<'a> {
tl: &'a LayeredTimeline,
_write_guard: MutexGuard<'a, ()>,

View File

@@ -37,6 +37,7 @@ use crate::virtual_file::VirtualFile;
use crate::walrecord;
use crate::{DELTA_FILE_MAGIC, STORAGE_FORMAT_VERSION};
use anyhow::{bail, ensure, Context, Result};
use rand::{distributions::Alphanumeric, Rng};
use serde::{Deserialize, Serialize};
use std::fs;
use std::io::{BufWriter, Write};
@@ -420,6 +421,28 @@ impl DeltaLayer {
}
}
fn temp_path_for(
conf: &PageServerConf,
timelineid: ZTimelineId,
tenantid: ZTenantId,
key_start: Key,
lsn_range: &Range<Lsn>,
) -> PathBuf {
let rand_string: String = rand::thread_rng()
.sample_iter(&Alphanumeric)
.take(8)
.map(char::from)
.collect();
conf.timeline_path(&timelineid, &tenantid).join(format!(
"{}-XXX__{:016X}-{:016X}.{}.temp",
key_start,
u64::from(lsn_range.start),
u64::from(lsn_range.end),
rand_string
))
}
///
/// Open the underlying file and read the metadata into memory, if it's
/// not loaded already.
@@ -607,12 +630,8 @@ impl DeltaLayerWriter {
//
// Note: This overwrites any existing file. There shouldn't be any.
// FIXME: throw an error instead?
let path = conf.timeline_path(&timelineid, &tenantid).join(format!(
"{}-XXX__{:016X}-{:016X}.temp",
key_start,
u64::from(lsn_range.start),
u64::from(lsn_range.end)
));
let path = DeltaLayer::temp_path_for(conf, timelineid, tenantid, key_start, &lsn_range);
let mut file = VirtualFile::create(&path)?;
// make room for the header block
file.seek(SeekFrom::Start(PAGE_SZ as u64))?;
@@ -705,6 +724,8 @@ impl DeltaLayerWriter {
}),
};
// fsync the file
file.sync_all()?;
// Rename the file to its final name
//
// Note: This overwrites any existing file. There shouldn't be any.

View File

@@ -444,6 +444,13 @@ where
///
/// stack[0] is the current root page, stack.last() is the leaf.
///
/// We maintain the length of the stack to be always greater than zero.
/// Two exceptions are:
/// 1. `Self::flush_node`. The method will push the new node if it extracted the last one.
/// So because other methods cannot see the intermediate state invariant still holds.
/// 2. `Self::finish`. It consumes self and does not return it back,
/// which means that this is where the structure is destroyed.
/// Thus stack of zero length cannot be observed by other methods.
stack: Vec<BuildNode<L>>,
/// Last key that was appended to the tree. Used to sanity check that append
@@ -482,7 +489,10 @@ where
fn append_internal(&mut self, key: &[u8; L], value: Value) -> Result<()> {
// Try to append to the current leaf buffer
let last = self.stack.last_mut().unwrap();
let last = self
.stack
.last_mut()
.expect("should always have at least one item");
let level = last.level;
if last.push(key, value) {
return Ok(());
@@ -512,19 +522,25 @@ where
Ok(())
}
/// Flush the bottommost node in the stack to disk. Appends a downlink to its parent,
/// and recursively flushes the parent too, if it becomes full. If the root page becomes full,
/// creates a new root page, increasing the height of the tree.
fn flush_node(&mut self) -> Result<()> {
let last = self.stack.pop().unwrap();
// Get the current bottommost node in the stack and flush it to disk.
let last = self
.stack
.pop()
.expect("should always have at least one item");
let buf = last.pack();
let downlink_key = last.first_key();
let downlink_ptr = self.writer.write_blk(buf)?;
// Append the downlink to the parent
// Append the downlink to the parent. If there is no parent, ie. this was the root page,
// create a new root page, increasing the height of the tree.
if self.stack.is_empty() {
self.stack.push(BuildNode::new(last.level + 1));
}
self.append_internal(&downlink_key, Value::from_blknum(downlink_ptr))?;
Ok(())
self.append_internal(&downlink_key, Value::from_blknum(downlink_ptr))
}
///
@@ -540,7 +556,10 @@ where
self.flush_node()?;
}
let root = self.stack.first().unwrap();
let root = self
.stack
.first()
.expect("by the check above we left one item there");
let buf = root.pack();
let root_blknum = self.writer.write_blk(buf)?;

View File

@@ -34,6 +34,7 @@ use crate::{IMAGE_FILE_MAGIC, STORAGE_FORMAT_VERSION};
use anyhow::{bail, ensure, Context, Result};
use bytes::Bytes;
use hex;
use rand::{distributions::Alphanumeric, Rng};
use serde::{Deserialize, Serialize};
use std::fs;
use std::io::Write;
@@ -241,6 +242,22 @@ impl ImageLayer {
}
}
fn temp_path_for(
conf: &PageServerConf,
timelineid: ZTimelineId,
tenantid: ZTenantId,
fname: &ImageFileName,
) -> PathBuf {
let rand_string: String = rand::thread_rng()
.sample_iter(&Alphanumeric)
.take(8)
.map(char::from)
.collect();
conf.timeline_path(&timelineid, &tenantid)
.join(format!("{}.{}.temp", fname, rand_string))
}
///
/// Open the underlying file and read the metadata into memory, if it's
/// not loaded already.
@@ -398,7 +415,7 @@ impl ImageLayer {
///
pub struct ImageLayerWriter {
conf: &'static PageServerConf,
_path: PathBuf,
path: PathBuf,
timelineid: ZTimelineId,
tenantid: ZTenantId,
key_range: Range<Key>,
@@ -416,12 +433,10 @@ impl ImageLayerWriter {
key_range: &Range<Key>,
lsn: Lsn,
) -> anyhow::Result<ImageLayerWriter> {
// Create the file
//
// Note: This overwrites any existing file. There shouldn't be any.
// FIXME: throw an error instead?
let path = ImageLayer::path_for(
&PathOrConf::Conf(conf),
// Create the file initially with a temporary filename.
// We'll atomically rename it to the final name when we're done.
let path = ImageLayer::temp_path_for(
conf,
timelineid,
tenantid,
&ImageFileName {
@@ -441,7 +456,7 @@ impl ImageLayerWriter {
let writer = ImageLayerWriter {
conf,
_path: path,
path,
timelineid,
tenantid,
key_range: key_range.clone(),
@@ -512,6 +527,25 @@ impl ImageLayerWriter {
index_root_blk,
}),
};
// fsync the file
file.sync_all()?;
// Rename the file to its final name
//
// Note: This overwrites any existing file. There shouldn't be any.
// FIXME: throw an error instead?
let final_path = ImageLayer::path_for(
&PathOrConf::Conf(self.conf),
self.timelineid,
self.tenantid,
&ImageFileName {
key_range: self.key_range.clone(),
lsn: self.lsn,
},
);
std::fs::rename(self.path, &final_path)?;
trace!("created image layer {}", layer.path().display());
Ok(layer)

View File

@@ -730,7 +730,18 @@ impl postgres_backend::Handler for PageServerHandler {
for failpoint in failpoints.split(';') {
if let Some((name, actions)) = failpoint.split_once('=') {
info!("cfg failpoint: {} {}", name, actions);
fail::cfg(name, actions).unwrap();
// We recognize one extra "action" that's not natively recognized
// by the failpoints crate: exit, to immediately kill the process
if actions == "exit" {
fail::cfg_callback(name, || {
info!("Exit requested by failpoint");
std::process::exit(1);
})
.unwrap();
} else {
fail::cfg(name, actions).unwrap();
}
} else {
bail!("Invalid failpoints format");
}

View File

@@ -9,7 +9,7 @@
//!
//! * public API via to interact with the external world:
//! * [`start_local_timeline_sync`] to launch a background async loop to handle the synchronization
//! * [`schedule_timeline_checkpoint_upload`] and [`schedule_timeline_download`] to enqueue a new upload and download tasks,
//! * [`schedule_layer_upload`], [`schedule_layer_download`], and[`schedule_layer_delete`] to enqueue a new task
//! to be processed by the async loop
//!
//! Here's a schematic overview of all interactions backup and the rest of the pageserver perform:
@@ -44,8 +44,8 @@
//! query their downloads later if they are accessed.
//!
//! Some time later, during pageserver checkpoints, in-memory data is flushed onto disk along with its metadata.
//! If the storage sync loop was successfully started before, pageserver schedules the new checkpoint file uploads after every checkpoint.
//! The checkpoint uploads are disabled, if no remote storage configuration is provided (no sync loop is started this way either).
//! If the storage sync loop was successfully started before, pageserver schedules the layer files and the updated metadata file for upload, every time a layer is flushed to disk.
//! The uploads are disabled, if no remote storage configuration is provided (no sync loop is started this way either).
//! See [`crate::layered_repository`] for the upload calls and the adjacent logic.
//!
//! Synchronization logic is able to communicate back with updated timeline sync states, [`crate::repository::TimelineSyncStatusUpdate`],
@@ -54,7 +54,7 @@
//! * once after the sync loop startup, to signal pageserver which timelines will be synchronized in the near future
//! * after every loop step, in case a timeline needs to be reloaded or evicted from pageserver's memory
//!
//! When the pageserver terminates, the sync loop finishes a current sync task (if any) and exits.
//! When the pageserver terminates, the sync loop finishes current sync task (if any) and exits.
//!
//! The storage logic considers `image` as a set of local files (layers), fully representing a certain timeline at given moment (identified with `disk_consistent_lsn` from the corresponding `metadata` file).
//! Timeline can change its state, by adding more files on disk and advancing its `disk_consistent_lsn`: this happens after pageserver checkpointing and is followed
@@ -66,13 +66,13 @@
//! when the newer image is downloaded
//!
//! Pageserver maintains similar to the local file structure remotely: all layer files are uploaded with the same names under the same directory structure.
//! Yet instead of keeping the `metadata` file remotely, we wrap it with more data in [`IndexShard`], containing the list of remote files.
//! Yet instead of keeping the `metadata` file remotely, we wrap it with more data in [`IndexPart`], containing the list of remote files.
//! This file gets read to populate the cache, if the remote timeline data is missing from it and gets updated after every successful download.
//! This way, we optimize S3 storage access by not running the `S3 list` command that could be expencive and slow: knowing both [`ZTenantId`] and [`ZTimelineId`],
//! we can always reconstruct the path to the timeline, use this to get the same path on the remote storage and retrive its shard contents, if needed, same as any layer files.
//!
//! By default, pageserver reads the remote storage index data only for timelines located locally, to synchronize those, if needed.
//! Bulk index data download happens only initially, on pageserer startup. The rest of the remote storage stays unknown to pageserver and loaded on demand only,
//! Bulk index data download happens only initially, on pageserver startup. The rest of the remote storage stays unknown to pageserver and loaded on demand only,
//! when a new timeline is scheduled for the download.
//!
//! NOTES:
@@ -89,13 +89,12 @@
//! Synchronization is done with the queue being emptied via separate thread asynchronously,
//! attempting to fully store pageserver's local data on the remote storage in a custom format, beneficial for storing.
//!
//! A queue is implemented in the [`sync_queue`] module as a pair of sender and receiver channels, to block on zero tasks instead of checking the queue.
//! The pair's shared buffer of a fixed size serves as an implicit queue, holding [`SyncTask`] for local files upload/download operations.
//! A queue is implemented in the [`sync_queue`] module as a VecDeque to hold the tasks, and a condition variable for blocking when the queue is empty.
//!
//! The queue gets emptied by a single thread with the loop, that polls the tasks in batches of deduplicated tasks.
//! A task from the batch corresponds to a single timeline, with its files to sync merged together: given that only one task sync loop step is active at a time,
//! timeline uploads and downloads can happen concurrently, in no particular order due to incremental nature of the timeline layers.
//! Deletion happens only after a successful upload only, otherwise the compation output might make the timeline inconsistent until both tasks are fully processed without errors.
//! Deletion happens only after a successful upload only, otherwise the compaction output might make the timeline inconsistent until both tasks are fully processed without errors.
//! Upload and download update the remote data (inmemory index and S3 json index part file) only after every layer is successfully synchronized, while the deletion task
//! does otherwise: it requires to have the remote data updated first succesfully: blob files will be invisible to pageserver this way.
//!
@@ -138,8 +137,6 @@
//! NOTE: No real contents or checksum check happens right now and is a subject to improve later.
//!
//! After the whole timeline is downloaded, [`crate::tenant_mgr::apply_timeline_sync_status_updates`] function is used to update pageserver memory stage for the timeline processed.
//!
//! When pageserver signals shutdown, current sync task gets finished and the loop exists.
mod delete;
mod download;
@@ -153,10 +150,7 @@ use std::{
num::{NonZeroU32, NonZeroUsize},
ops::ControlFlow,
path::{Path, PathBuf},
sync::{
atomic::{AtomicUsize, Ordering},
Arc,
},
sync::{Arc, Condvar, Mutex},
};
use anyhow::{anyhow, bail, Context};
@@ -167,7 +161,6 @@ use remote_storage::{GenericRemoteStorage, RemoteStorage};
use tokio::{
fs,
runtime::Runtime,
sync::mpsc::{self, error::TryRecvError, UnboundedReceiver, UnboundedSender},
time::{Duration, Instant},
};
use tracing::*;
@@ -428,6 +421,14 @@ fn collect_timeline_files(
entry_path.display()
)
})?;
} else if entry_path.extension().and_then(OsStr::to_str) == Some("temp") {
info!("removing temp layer file at {}", entry_path.display());
std::fs::remove_file(&entry_path).with_context(|| {
format!(
"failed to remove temp layer file at {}",
entry_path.display()
)
})?;
} else {
timeline_files.insert(entry_path);
}
@@ -453,97 +454,77 @@ fn collect_timeline_files(
Ok((timeline_id, metadata, timeline_files))
}
/// Wraps mpsc channel bits around into a queue interface.
/// mpsc approach was picked to allow blocking the sync loop if no tasks are present, to avoid meaningless spinning.
/// Global queue of sync tasks.
///
/// 'queue' is protected by a mutex, and 'condvar' is used to wait for tasks to arrive.
struct SyncQueue {
len: AtomicUsize,
max_timelines_per_batch: NonZeroUsize,
sender: UnboundedSender<(ZTenantTimelineId, SyncTask)>,
queue: Mutex<VecDeque<(ZTenantTimelineId, SyncTask)>>,
condvar: Condvar,
}
impl SyncQueue {
fn new(
max_timelines_per_batch: NonZeroUsize,
) -> (Self, UnboundedReceiver<(ZTenantTimelineId, SyncTask)>) {
let (sender, receiver) = mpsc::unbounded_channel();
(
Self {
len: AtomicUsize::new(0),
max_timelines_per_batch,
sender,
},
receiver,
)
fn new(max_timelines_per_batch: NonZeroUsize) -> Self {
Self {
max_timelines_per_batch,
queue: Mutex::new(VecDeque::new()),
condvar: Condvar::new(),
}
}
/// Queue a new task
fn push(&self, sync_id: ZTenantTimelineId, new_task: SyncTask) {
match self.sender.send((sync_id, new_task)) {
Ok(()) => {
self.len.fetch_add(1, Ordering::Relaxed);
}
Err(e) => {
error!("failed to push sync task to queue: {e}");
}
let mut q = self.queue.lock().unwrap();
q.push_back((sync_id, new_task));
if q.len() <= 1 {
self.condvar.notify_one();
}
}
/// Fetches a task batch, getting every existing entry from the queue, grouping by timelines and merging the tasks for every timeline.
/// A timeline has to care to not to delete cetain layers from the remote storage before the corresponding uploads happen.
/// Otherwise, due to "immutable" nature of the layers, the order of their deletion/uploading/downloading does not matter.
/// A timeline has to care to not to delete certain layers from the remote storage before the corresponding uploads happen.
/// Other than that, due to "immutable" nature of the layers, the order of their deletion/uploading/downloading does not matter.
/// Hence, we merge the layers together into single task per timeline and run those concurrently (with the deletion happening only after successful uploading).
async fn next_task_batch(
&self,
// The queue is based on two ends of a channel and has to be accessible statically without blocking for submissions from the sync code.
// Its receiver needs &mut, so we cannot place it in the same container with the other end and get both static and non-blocking access.
// Hence toss this around to use it from the sync loop directly as &mut.
sync_queue_receiver: &mut UnboundedReceiver<(ZTenantTimelineId, SyncTask)>,
) -> HashMap<ZTenantTimelineId, SyncTaskBatch> {
// request the first task in blocking fashion to do less meaningless work
let (first_sync_id, first_task) = if let Some(first_task) = sync_queue_receiver.recv().await
{
self.len.fetch_sub(1, Ordering::Relaxed);
first_task
} else {
info!("Queue sender part was dropped, aborting");
return HashMap::new();
};
fn next_task_batch(&self) -> (HashMap<ZTenantTimelineId, SyncTaskBatch>, usize) {
// Wait for the first task in blocking fashion
let mut q = self.queue.lock().unwrap();
while q.is_empty() {
q = self
.condvar
.wait_timeout(q, Duration::from_millis(1000))
.unwrap()
.0;
if thread_mgr::is_shutdown_requested() {
return (HashMap::new(), q.len());
}
}
let (first_sync_id, first_task) = q.pop_front().unwrap();
let mut timelines_left_to_batch = self.max_timelines_per_batch.get() - 1;
let mut tasks_to_process = self.len();
let tasks_to_process = q.len();
let mut batches = HashMap::with_capacity(tasks_to_process);
batches.insert(first_sync_id, SyncTaskBatch::new(first_task));
let mut tasks_to_reenqueue = Vec::with_capacity(tasks_to_process);
// Pull the queue channel until we get all tasks that were there at the beginning of the batch construction.
// Greedily grab as many other tasks that we can.
// Yet do not put all timelines in the batch, but only the first ones that fit the timeline limit.
// Still merge the rest of the pulled tasks and reenqueue those for later.
while tasks_to_process > 0 {
match sync_queue_receiver.try_recv() {
Ok((sync_id, new_task)) => {
self.len.fetch_sub(1, Ordering::Relaxed);
tasks_to_process -= 1;
match batches.entry(sync_id) {
hash_map::Entry::Occupied(mut v) => v.get_mut().add(new_task),
hash_map::Entry::Vacant(v) => {
timelines_left_to_batch = timelines_left_to_batch.saturating_sub(1);
if timelines_left_to_batch == 0 {
tasks_to_reenqueue.push((sync_id, new_task));
} else {
v.insert(SyncTaskBatch::new(new_task));
}
}
// Re-enqueue the tasks that don't fit in this batch.
while let Some((sync_id, new_task)) = q.pop_front() {
match batches.entry(sync_id) {
hash_map::Entry::Occupied(mut v) => v.get_mut().add(new_task),
hash_map::Entry::Vacant(v) => {
timelines_left_to_batch = timelines_left_to_batch.saturating_sub(1);
if timelines_left_to_batch == 0 {
tasks_to_reenqueue.push((sync_id, new_task));
} else {
v.insert(SyncTaskBatch::new(new_task));
}
}
Err(TryRecvError::Disconnected) => {
debug!("Sender disconnected, batch collection aborted");
break;
}
Err(TryRecvError::Empty) => {
debug!("No more data in the sync queue, task batch is not full");
break;
}
}
}
@@ -553,14 +534,15 @@ impl SyncQueue {
tasks_to_reenqueue.len()
);
for (id, task) in tasks_to_reenqueue {
self.push(id, task);
q.push_back((id, task));
}
batches
(batches, q.len())
}
#[cfg(test)]
fn len(&self) -> usize {
self.len.load(Ordering::Relaxed)
self.queue.lock().unwrap().len()
}
}
@@ -823,7 +805,7 @@ pub fn schedule_layer_download(tenant_id: ZTenantId, timeline_id: ZTimelineId) {
debug!("Download task for tenant {tenant_id}, timeline {timeline_id} sent")
}
/// Uses a remote storage given to start the storage sync loop.
/// Launch a thread to perform remote storage sync tasks.
/// See module docs for loop step description.
pub(super) fn spawn_storage_sync_thread<P, S>(
conf: &'static PageServerConf,
@@ -836,7 +818,7 @@ where
P: Debug + Send + Sync + 'static,
S: RemoteStorage<RemoteObjectId = P> + Send + Sync + 'static,
{
let (sync_queue, sync_queue_receiver) = SyncQueue::new(max_concurrent_timelines_sync);
let sync_queue = SyncQueue::new(max_concurrent_timelines_sync);
SYNC_QUEUE
.set(sync_queue)
.map_err(|_queue| anyhow!("Could not initialize sync queue"))?;
@@ -864,7 +846,7 @@ where
local_timeline_files,
);
let loop_index = remote_index.clone();
let remote_index_clone = remote_index.clone();
thread_mgr::spawn(
ThreadKind::StorageSync,
None,
@@ -875,12 +857,7 @@ where
storage_sync_loop(
runtime,
conf,
(
Arc::new(storage),
loop_index,
sync_queue,
sync_queue_receiver,
),
(Arc::new(storage), remote_index_clone, sync_queue),
max_sync_errors,
);
Ok(())
@@ -896,12 +873,7 @@ where
fn storage_sync_loop<P, S>(
runtime: Runtime,
conf: &'static PageServerConf,
(storage, index, sync_queue, mut sync_queue_receiver): (
Arc<S>,
RemoteIndex,
&SyncQueue,
UnboundedReceiver<(ZTenantTimelineId, SyncTask)>,
),
(storage, index, sync_queue): (Arc<S>, RemoteIndex, &SyncQueue),
max_sync_errors: NonZeroU32,
) where
P: Debug + Send + Sync + 'static,
@@ -909,16 +881,35 @@ fn storage_sync_loop<P, S>(
{
info!("Starting remote storage sync loop");
loop {
let loop_index = index.clone();
let loop_storage = Arc::clone(&storage);
let (batched_tasks, remaining_queue_length) = sync_queue.next_task_batch();
if thread_mgr::is_shutdown_requested() {
info!("Shutdown requested, stopping");
break;
}
REMAINING_SYNC_ITEMS.set(remaining_queue_length as i64);
if remaining_queue_length > 0 || !batched_tasks.is_empty() {
info!("Processing tasks for {} timelines in batch, more tasks left to process: {remaining_queue_length}", batched_tasks.len());
} else {
debug!("No tasks to process");
continue;
}
// Concurrently perform all the tasks in the batch
let loop_step = runtime.block_on(async {
tokio::select! {
step = loop_step(
step = process_batches(
conf,
(loop_storage, loop_index, sync_queue, &mut sync_queue_receiver),
max_sync_errors,
loop_storage,
&index,
batched_tasks,
sync_queue,
)
.instrument(info_span!("storage_sync_loop_step")) => step,
.instrument(info_span!("storage_sync_loop_step")) => ControlFlow::Continue(step),
_ = thread_mgr::shutdown_watcher() => ControlFlow::Break(()),
}
});
@@ -944,31 +935,18 @@ fn storage_sync_loop<P, S>(
}
}
async fn loop_step<P, S>(
async fn process_batches<P, S>(
conf: &'static PageServerConf,
(storage, index, sync_queue, sync_queue_receiver): (
Arc<S>,
RemoteIndex,
&SyncQueue,
&mut UnboundedReceiver<(ZTenantTimelineId, SyncTask)>,
),
max_sync_errors: NonZeroU32,
) -> ControlFlow<(), HashMap<ZTenantId, HashMap<ZTimelineId, TimelineSyncStatusUpdate>>>
storage: Arc<S>,
index: &RemoteIndex,
batched_tasks: HashMap<ZTenantTimelineId, SyncTaskBatch>,
sync_queue: &SyncQueue,
) -> HashMap<ZTenantId, HashMap<ZTimelineId, TimelineSyncStatusUpdate>>
where
P: Debug + Send + Sync + 'static,
S: RemoteStorage<RemoteObjectId = P> + Send + Sync + 'static,
{
let batched_tasks = sync_queue.next_task_batch(sync_queue_receiver).await;
let remaining_queue_length = sync_queue.len();
REMAINING_SYNC_ITEMS.set(remaining_queue_length as i64);
if remaining_queue_length > 0 || !batched_tasks.is_empty() {
info!("Processing tasks for {} timelines in batch, more tasks left to process: {remaining_queue_length}", batched_tasks.len());
} else {
debug!("No tasks to process");
return ControlFlow::Continue(HashMap::new());
}
let mut sync_results = batched_tasks
.into_iter()
.map(|(sync_id, batch)| {
@@ -993,6 +971,7 @@ where
ZTenantId,
HashMap<ZTimelineId, TimelineSyncStatusUpdate>,
> = HashMap::new();
while let Some((sync_id, state_update)) = sync_results.next().await {
debug!("Finished storage sync task for sync id {sync_id}");
if let Some(state_update) = state_update {
@@ -1003,7 +982,7 @@ where
}
}
ControlFlow::Continue(new_timeline_states)
new_timeline_states
}
async fn process_sync_task_batch<P, S>(
@@ -1376,7 +1355,6 @@ where
P: Debug + Send + Sync + 'static,
S: RemoteStorage<RemoteObjectId = P> + Send + Sync + 'static,
{
info!("Updating remote index for the timeline");
let updated_remote_timeline = {
let mut index_accessor = index.write().await;
@@ -1443,7 +1421,7 @@ where
IndexPart::from_remote_timeline(&timeline_path, updated_remote_timeline)
.context("Failed to create an index part from the updated remote timeline")?;
info!("Uploading remote data for the timeline");
info!("Uploading remote index for the timeline");
upload_index_part(conf, storage, sync_id, new_index_part)
.await
.context("Failed to upload new index part")
@@ -1685,7 +1663,7 @@ mod tests {
#[tokio::test]
async fn separate_task_ids_batch() {
let (sync_queue, mut sync_queue_receiver) = SyncQueue::new(NonZeroUsize::new(100).unwrap());
let sync_queue = SyncQueue::new(NonZeroUsize::new(100).unwrap());
assert_eq!(sync_queue.len(), 0);
let sync_id_2 = ZTenantTimelineId {
@@ -1720,7 +1698,7 @@ mod tests {
let submitted_tasks_count = sync_queue.len();
assert_eq!(submitted_tasks_count, 3);
let mut batch = sync_queue.next_task_batch(&mut sync_queue_receiver).await;
let (mut batch, _) = sync_queue.next_task_batch();
assert_eq!(
batch.len(),
submitted_tasks_count,
@@ -1746,7 +1724,7 @@ mod tests {
#[tokio::test]
async fn same_task_id_separate_tasks_batch() {
let (sync_queue, mut sync_queue_receiver) = SyncQueue::new(NonZeroUsize::new(100).unwrap());
let sync_queue = SyncQueue::new(NonZeroUsize::new(100).unwrap());
assert_eq!(sync_queue.len(), 0);
let download = LayersDownload {
@@ -1769,7 +1747,7 @@ mod tests {
let submitted_tasks_count = sync_queue.len();
assert_eq!(submitted_tasks_count, 3);
let mut batch = sync_queue.next_task_batch(&mut sync_queue_receiver).await;
let (mut batch, _) = sync_queue.next_task_batch();
assert_eq!(
batch.len(),
1,
@@ -1801,7 +1779,7 @@ mod tests {
#[tokio::test]
async fn same_task_id_same_tasks_batch() {
let (sync_queue, mut sync_queue_receiver) = SyncQueue::new(NonZeroUsize::new(1).unwrap());
let sync_queue = SyncQueue::new(NonZeroUsize::new(1).unwrap());
let download_1 = LayersDownload {
layers_to_skip: HashSet::from([PathBuf::from("sk1")]),
};
@@ -1823,11 +1801,11 @@ mod tests {
sync_queue.push(TEST_SYNC_ID, SyncTask::download(download_1.clone()));
sync_queue.push(TEST_SYNC_ID, SyncTask::download(download_2.clone()));
sync_queue.push(sync_id_2, SyncTask::download(download_3.clone()));
sync_queue.push(sync_id_2, SyncTask::download(download_3));
sync_queue.push(TEST_SYNC_ID, SyncTask::download(download_4.clone()));
assert_eq!(sync_queue.len(), 4);
let mut smallest_batch = sync_queue.next_task_batch(&mut sync_queue_receiver).await;
let (mut smallest_batch, _) = sync_queue.next_task_batch();
assert_eq!(
smallest_batch.len(),
1,

View File

@@ -119,7 +119,7 @@ mod tests {
#[tokio::test]
async fn delete_timeline_negative() -> anyhow::Result<()> {
let harness = RepoHarness::create("delete_timeline_negative")?;
let (sync_queue, _) = SyncQueue::new(NonZeroUsize::new(100).unwrap());
let sync_queue = SyncQueue::new(NonZeroUsize::new(100).unwrap());
let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID);
let storage = LocalFs::new(
tempdir()?.path().to_path_buf(),
@@ -152,7 +152,7 @@ mod tests {
#[tokio::test]
async fn delete_timeline() -> anyhow::Result<()> {
let harness = RepoHarness::create("delete_timeline")?;
let (sync_queue, _) = SyncQueue::new(NonZeroUsize::new(100).unwrap());
let sync_queue = SyncQueue::new(NonZeroUsize::new(100).unwrap());
let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID);
let layer_files = ["a", "b", "c", "d"];

View File

@@ -286,7 +286,7 @@ mod tests {
#[tokio::test]
async fn download_timeline() -> anyhow::Result<()> {
let harness = RepoHarness::create("download_timeline")?;
let (sync_queue, _) = SyncQueue::new(NonZeroUsize::new(100).unwrap());
let sync_queue = SyncQueue::new(NonZeroUsize::new(100).unwrap());
let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID);
let layer_files = ["a", "b", "layer_to_skip", "layer_to_keep_locally"];
@@ -385,7 +385,7 @@ mod tests {
#[tokio::test]
async fn download_timeline_negatives() -> anyhow::Result<()> {
let harness = RepoHarness::create("download_timeline_negatives")?;
let (sync_queue, _) = SyncQueue::new(NonZeroUsize::new(100).unwrap());
let sync_queue = SyncQueue::new(NonZeroUsize::new(100).unwrap());
let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID);
let storage = LocalFs::new(tempdir()?.path().to_owned(), harness.conf.workdir.clone())?;

View File

@@ -240,7 +240,7 @@ mod tests {
#[tokio::test]
async fn regular_layer_upload() -> anyhow::Result<()> {
let harness = RepoHarness::create("regular_layer_upload")?;
let (sync_queue, _) = SyncQueue::new(NonZeroUsize::new(100).unwrap());
let sync_queue = SyncQueue::new(NonZeroUsize::new(100).unwrap());
let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID);
let layer_files = ["a", "b"];
@@ -327,7 +327,7 @@ mod tests {
#[tokio::test]
async fn layer_upload_after_local_fs_update() -> anyhow::Result<()> {
let harness = RepoHarness::create("layer_upload_after_local_fs_update")?;
let (sync_queue, _) = SyncQueue::new(NonZeroUsize::new(100).unwrap());
let sync_queue = SyncQueue::new(NonZeroUsize::new(100).unwrap());
let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID);
let layer_files = ["a1", "b1"];

View File

@@ -281,6 +281,7 @@ pub fn activate_tenant(tenant_id: ZTenantId) -> anyhow::Result<()> {
false,
move || crate::tenant_threads::gc_loop(tenant_id),
)
.map(|_thread_id| ()) // update the `Result::Ok` type to match the outer function's return signature
.with_context(|| format!("Failed to launch GC thread for tenant {tenant_id}"));
if let Err(e) = &gc_spawn_result {

View File

@@ -139,7 +139,7 @@ pub fn spawn<F>(
name: &str,
shutdown_process_on_error: bool,
f: F,
) -> std::io::Result<()>
) -> std::io::Result<u64>
where
F: FnOnce() -> anyhow::Result<()> + Send + 'static,
{
@@ -193,7 +193,7 @@ where
drop(jh_guard);
// The thread is now running. Nothing more to do here
Ok(())
Ok(thread_id)
}
/// This wrapper function runs in a newly-spawned thread. It initializes the

View File

@@ -45,6 +45,8 @@ pub struct LocalTimelineInfo {
#[serde_as(as = "Option<DisplayFromStr>")]
pub prev_record_lsn: Option<Lsn>,
#[serde_as(as = "DisplayFromStr")]
pub latest_gc_cutoff_lsn: Lsn,
#[serde_as(as = "DisplayFromStr")]
pub disk_consistent_lsn: Lsn,
pub current_logical_size: Option<usize>, // is None when timeline is Unloaded
pub current_logical_size_non_incremental: Option<usize>,
@@ -68,6 +70,7 @@ impl LocalTimelineInfo {
disk_consistent_lsn: datadir_tline.tline.get_disk_consistent_lsn(),
last_record_lsn,
prev_record_lsn: Some(datadir_tline.tline.get_prev_record_lsn()),
latest_gc_cutoff_lsn: *datadir_tline.tline.get_latest_gc_cutoff_lsn(),
timeline_state: LocalTimelineState::Loaded,
current_logical_size: Some(datadir_tline.get_current_logical_size()),
current_logical_size_non_incremental: if include_non_incremental_logical_size {
@@ -91,6 +94,7 @@ impl LocalTimelineInfo {
disk_consistent_lsn: metadata.disk_consistent_lsn(),
last_record_lsn: metadata.disk_consistent_lsn(),
prev_record_lsn: metadata.prev_record_lsn(),
latest_gc_cutoff_lsn: metadata.latest_gc_cutoff_lsn(),
timeline_state: LocalTimelineState::Unloaded,
current_logical_size: None,
current_logical_size_non_incremental: None,

View File

@@ -18,6 +18,8 @@ use lazy_static::lazy_static;
use postgres_ffi::waldecoder::*;
use postgres_protocol::message::backend::ReplicationMessage;
use postgres_types::PgLsn;
use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use std::cell::Cell;
use std::collections::HashMap;
use std::str::FromStr;
@@ -35,11 +37,19 @@ use utils::{
zid::{ZTenantId, ZTenantTimelineId, ZTimelineId},
};
//
// We keep one WAL Receiver active per timeline.
//
struct WalReceiverEntry {
///
/// A WAL receiver's data stored inside the global `WAL_RECEIVERS`.
/// We keep one WAL receiver active per timeline.
///
#[serde_as]
#[derive(Debug, Serialize, Deserialize, Clone)]
pub struct WalReceiverEntry {
thread_id: u64,
wal_producer_connstr: String,
#[serde_as(as = "Option<DisplayFromStr>")]
last_received_msg_lsn: Option<Lsn>,
/// the timestamp (in microseconds) of the last received message
last_received_msg_ts: Option<u128>,
}
lazy_static! {
@@ -74,7 +84,7 @@ pub fn launch_wal_receiver(
receiver.wal_producer_connstr = wal_producer_connstr.into();
}
None => {
thread_mgr::spawn(
let thread_id = thread_mgr::spawn(
ThreadKind::WalReceiver,
Some(tenantid),
Some(timelineid),
@@ -88,7 +98,10 @@ pub fn launch_wal_receiver(
)?;
let receiver = WalReceiverEntry {
thread_id,
wal_producer_connstr: wal_producer_connstr.into(),
last_received_msg_lsn: None,
last_received_msg_ts: None,
};
receivers.insert((tenantid, timelineid), receiver);
@@ -99,15 +112,13 @@ pub fn launch_wal_receiver(
Ok(())
}
// Look up current WAL producer connection string in the hash table
fn get_wal_producer_connstr(tenantid: ZTenantId, timelineid: ZTimelineId) -> String {
/// Look up a WAL receiver's data in the global `WAL_RECEIVERS`
pub fn get_wal_receiver_entry(
tenant_id: ZTenantId,
timeline_id: ZTimelineId,
) -> Option<WalReceiverEntry> {
let receivers = WAL_RECEIVERS.lock().unwrap();
receivers
.get(&(tenantid, timelineid))
.unwrap()
.wal_producer_connstr
.clone()
receivers.get(&(tenant_id, timeline_id)).cloned()
}
//
@@ -118,7 +129,18 @@ fn thread_main(conf: &'static PageServerConf, tenant_id: ZTenantId, timeline_id:
info!("WAL receiver thread started");
// Look up the current WAL producer address
let wal_producer_connstr = get_wal_producer_connstr(tenant_id, timeline_id);
let wal_producer_connstr = {
match get_wal_receiver_entry(tenant_id, timeline_id) {
Some(e) => e.wal_producer_connstr,
None => {
info!(
"Unable to create the WAL receiver thread: no WAL receiver entry found for tenant {} and timeline {}",
tenant_id, timeline_id
);
return;
}
}
};
// Make a connection to the WAL safekeeper, or directly to the primary PostgreSQL server,
// and start streaming WAL from it.
@@ -318,6 +340,28 @@ fn walreceiver_main(
let apply_lsn = u64::from(timeline_remote_consistent_lsn);
let ts = SystemTime::now();
// Update the current WAL receiver's data stored inside the global hash table `WAL_RECEIVERS`
{
let mut receivers = WAL_RECEIVERS.lock().unwrap();
let entry = match receivers.get_mut(&(tenant_id, timeline_id)) {
Some(e) => e,
None => {
anyhow::bail!(
"no WAL receiver entry found for tenant {} and timeline {}",
tenant_id,
timeline_id
);
}
};
entry.last_received_msg_lsn = Some(last_lsn);
entry.last_received_msg_ts = Some(
ts.duration_since(SystemTime::UNIX_EPOCH)
.expect("Received message time should be before UNIX EPOCH!")
.as_micros(),
);
}
// Send zenith feedback message.
// Regular standby_status_update fields are put into this message.
let zenith_status_update = ZenithFeedback {

45
poetry.lock generated
View File

@@ -822,7 +822,7 @@ python-versions = "*"
[[package]]
name = "moto"
version = "3.1.7"
version = "3.1.9"
description = "A library that allows your python tests to easily mock out the boto library"
category = "main"
optional = false
@@ -868,6 +868,7 @@ ds = ["sshpubkeys (>=3.1.0)"]
dynamodb = ["docker (>=2.5.1)"]
dynamodb2 = ["docker (>=2.5.1)"]
dynamodbstreams = ["docker (>=2.5.1)"]
ebs = ["sshpubkeys (>=3.1.0)"]
ec2 = ["sshpubkeys (>=3.1.0)"]
efs = ["sshpubkeys (>=3.1.0)"]
glue = ["pyparsing (>=3.0.0)"]
@@ -953,6 +954,17 @@ importlib-metadata = {version = ">=0.12", markers = "python_version < \"3.8\""}
dev = ["pre-commit", "tox"]
testing = ["pytest", "pytest-benchmark"]
[[package]]
name = "prometheus-client"
version = "0.14.1"
description = "Python client for the Prometheus monitoring system."
category = "main"
optional = false
python-versions = ">=3.6"
[package.extras]
twisted = ["twisted"]
[[package]]
name = "psycopg2-binary"
version = "2.9.3"
@@ -1003,7 +1015,7 @@ python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*"
[[package]]
name = "pyjwt"
version = "2.3.0"
version = "2.4.0"
description = "JSON Web Token implementation in Python"
category = "main"
optional = false
@@ -1082,6 +1094,17 @@ python-versions = "*"
[package.dependencies]
pytest = ">=3.2.5"
[[package]]
name = "pytest-timeout"
version = "2.1.0"
description = "pytest plugin to abort hanging tests"
category = "main"
optional = false
python-versions = ">=3.6"
[package.dependencies]
pytest = ">=5.0.0"
[[package]]
name = "pytest-xdist"
version = "2.5.0"
@@ -1375,7 +1398,7 @@ testing = ["pytest (>=6)", "pytest-checkdocs (>=2.4)", "pytest-flake8", "pytest-
[metadata]
lock-version = "1.1"
python-versions = "^3.7"
content-hash = "dc63b6e02d0ceccdc4b5616e9362c149a27fdcc6c54fda63a3b115a5b980c42e"
content-hash = "4ee85b435461dec70b406bf7170302fe54e9e247bdf628a9cb6b5fb9eb9afd82"
[metadata.files]
aiopg = [
@@ -1693,8 +1716,8 @@ mccabe = [
{file = "mccabe-0.6.1.tar.gz", hash = "sha256:dd8d182285a0fe56bace7f45b5e7d1a6ebcbf524e8f3bd87eb0f125271b8831f"},
]
moto = [
{file = "moto-3.1.7-py3-none-any.whl", hash = "sha256:4ab6fb8dd150343e115d75e3dbdb5a8f850fc7236790819d7cef438c11ee6e89"},
{file = "moto-3.1.7.tar.gz", hash = "sha256:20607a0fd0cf6530e05ffb623ca84d3f45d50bddbcec2a33705a0cf471e71289"},
{file = "moto-3.1.9-py3-none-any.whl", hash = "sha256:8928ec168e5fd88b1127413b2fa570a80d45f25182cdad793edd208d07825269"},
{file = "moto-3.1.9.tar.gz", hash = "sha256:ba683e70950b6579189bc12d74c1477aa036c090c6ad8b151a22f5896c005113"},
]
mypy = [
{file = "mypy-0.910-cp35-cp35m-macosx_10_9_x86_64.whl", hash = "sha256:a155d80ea6cee511a3694b108c4494a39f42de11ee4e61e72bc424c490e46457"},
@@ -1741,6 +1764,10 @@ pluggy = [
{file = "pluggy-1.0.0-py2.py3-none-any.whl", hash = "sha256:74134bbf457f031a36d68416e1509f34bd5ccc019f0bcc952c7b909d06b37bd3"},
{file = "pluggy-1.0.0.tar.gz", hash = "sha256:4224373bacce55f955a878bf9cfa763c1e360858e330072059e10bad68531159"},
]
prometheus-client = [
{file = "prometheus_client-0.14.1-py3-none-any.whl", hash = "sha256:522fded625282822a89e2773452f42df14b5a8e84a86433e3f8a189c1d54dc01"},
{file = "prometheus_client-0.14.1.tar.gz", hash = "sha256:5459c427624961076277fdc6dc50540e2bacb98eebde99886e59ec55ed92093a"},
]
psycopg2-binary = [
{file = "psycopg2-binary-2.9.3.tar.gz", hash = "sha256:761df5313dc15da1502b21453642d7599d26be88bff659382f8f9747c7ebea4e"},
{file = "psycopg2_binary-2.9.3-cp310-cp310-macosx_10_14_x86_64.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl", hash = "sha256:539b28661b71da7c0e428692438efbcd048ca21ea81af618d845e06ebfd29478"},
@@ -1831,8 +1858,8 @@ pyflakes = [
{file = "pyflakes-2.3.1.tar.gz", hash = "sha256:f5bc8ecabc05bb9d291eb5203d6810b49040f6ff446a756326104746cc00c1db"},
]
pyjwt = [
{file = "PyJWT-2.3.0-py3-none-any.whl", hash = "sha256:e0c4bb8d9f0af0c7f5b1ec4c5036309617d03d56932877f2f7a0beeb5318322f"},
{file = "PyJWT-2.3.0.tar.gz", hash = "sha256:b888b4d56f06f6dcd777210c334e69c737be74755d3e5e9ee3fe67dc18a0ee41"},
{file = "PyJWT-2.4.0-py3-none-any.whl", hash = "sha256:72d1d253f32dbd4f5c88eaf1fdc62f3a19f676ccbadb9dbc5d07e951b2b26daf"},
{file = "PyJWT-2.4.0.tar.gz", hash = "sha256:d42908208c699b3b973cbeb01a969ba6a96c821eefb1c5bfe4c390c01d67abba"},
]
pyparsing = [
{file = "pyparsing-3.0.6-py3-none-any.whl", hash = "sha256:04ff808a5b90911829c55c4e26f75fa5ca8a2f5f36aa3a51f68e27033341d3e4"},
@@ -1873,6 +1900,10 @@ pytest-lazy-fixture = [
{file = "pytest-lazy-fixture-0.6.3.tar.gz", hash = "sha256:0e7d0c7f74ba33e6e80905e9bfd81f9d15ef9a790de97993e34213deb5ad10ac"},
{file = "pytest_lazy_fixture-0.6.3-py3-none-any.whl", hash = "sha256:e0b379f38299ff27a653f03eaa69b08a6fd4484e46fd1c9907d984b9f9daeda6"},
]
pytest-timeout = [
{file = "pytest-timeout-2.1.0.tar.gz", hash = "sha256:c07ca07404c612f8abbe22294b23c368e2e5104b521c1790195561f37e1ac3d9"},
{file = "pytest_timeout-2.1.0-py3-none-any.whl", hash = "sha256:f6f50101443ce70ad325ceb4473c4255e9d74e3c7cd0ef827309dfa4c0d975c6"},
]
pytest-xdist = [
{file = "pytest-xdist-2.5.0.tar.gz", hash = "sha256:4580deca3ff04ddb2ac53eba39d76cb5dd5edeac050cb6fbc768b0dd712b4edf"},
{file = "pytest_xdist-2.5.0-py3-none-any.whl", hash = "sha256:6fe5c74fec98906deb8f2d2b616b5c782022744978e7bd4695d39c8f42d0ce65"},

View File

@@ -38,7 +38,6 @@ async fn flatten_err(
#[tokio::main]
async fn main() -> anyhow::Result<()> {
metrics::set_common_metrics_prefix("zenith_proxy");
let arg_matches = App::new("Neon proxy/router")
.version(GIT_VERSION)
.arg(

View File

@@ -5,7 +5,7 @@ use crate::stream::{MetricsStream, PqStream, Stream};
use anyhow::{bail, Context};
use futures::TryFutureExt;
use lazy_static::lazy_static;
use metrics::{new_common_metric_name, register_int_counter, IntCounter};
use metrics::{register_int_counter, IntCounter};
use std::sync::Arc;
use tokio::io::{AsyncRead, AsyncWrite};
use utils::pq_proto::{BeMessage as Be, *};
@@ -15,17 +15,17 @@ const ERR_PROTO_VIOLATION: &str = "protocol violation";
lazy_static! {
static ref NUM_CONNECTIONS_ACCEPTED_COUNTER: IntCounter = register_int_counter!(
new_common_metric_name("num_connections_accepted"),
"proxy_accepted_connections_total",
"Number of TCP client connections accepted."
)
.unwrap();
static ref NUM_CONNECTIONS_CLOSED_COUNTER: IntCounter = register_int_counter!(
new_common_metric_name("num_connections_closed"),
"proxy_closed_connections_total",
"Number of TCP client connections closed."
)
.unwrap();
static ref NUM_BYTES_PROXIED_COUNTER: IntCounter = register_int_counter!(
new_common_metric_name("num_bytes_proxied"),
"proxy_io_bytes_total",
"Number of bytes sent/received between any client and backend."
)
.unwrap();

View File

@@ -5,7 +5,7 @@ description = ""
authors = []
[tool.poetry.dependencies]
python = "^3.7"
python = "^3.9"
pytest = "^6.2.5"
psycopg2-binary = "^2.9.1"
typing-extensions = "^3.10.0"
@@ -23,6 +23,8 @@ boto3-stubs = "^1.20.40"
moto = {version = "^3.0.0", extras = ["server"]}
backoff = "^1.11.1"
pytest-lazy-fixture = "^0.6.3"
prometheus-client = "^0.14.1"
pytest-timeout = "^2.1.0"
[tool.poetry.dev-dependencies]
yapf = "==0.31.0"

View File

@@ -9,3 +9,4 @@ minversion = 6.0
log_format = %(asctime)s.%(msecs)-3d %(levelname)s [%(filename)s:%(lineno)d] %(message)s
log_date_format = %Y-%m-%d %H:%M:%S
log_cli = true
timeout = 300

View File

@@ -31,8 +31,7 @@ const LOCK_FILE_NAME: &str = "safekeeper.lock";
const ID_FILE_NAME: &str = "safekeeper.id";
project_git_version!(GIT_VERSION);
fn main() -> Result<()> {
metrics::set_common_metrics_prefix("safekeeper");
fn main() -> anyhow::Result<()> {
let arg_matches = App::new("Zenith safekeeper")
.about("Store WAL stream to local file system and push it to WAL receivers")
.version(GIT_VERSION)
@@ -177,7 +176,7 @@ fn main() -> Result<()> {
if let Some(addr) = arg_matches.value_of("broker-endpoints") {
let collected_ep: Result<Vec<Url>, ParseError> = addr.split(',').map(Url::parse).collect();
conf.broker_endpoints = Some(collected_ep?);
conf.broker_endpoints = collected_ep.context("Failed to parse broker endpoint urls")?;
}
if let Some(prefix) = arg_matches.value_of("broker-etcd-prefix") {
conf.broker_etcd_prefix = prefix.to_string();
@@ -309,7 +308,7 @@ fn start_safekeeper(mut conf: SafeKeeperConf, given_id: Option<ZNodeId>, init: b
.unwrap();
threads.push(callmemaybe_thread);
if conf.broker_endpoints.is_some() {
if !conf.broker_endpoints.is_empty() {
let conf_ = conf.clone();
threads.push(
thread::Builder::new()
@@ -318,6 +317,8 @@ fn start_safekeeper(mut conf: SafeKeeperConf, given_id: Option<ZNodeId>, init: b
broker::thread_main(conf_);
})?,
);
} else {
warn!("No broker endpoints providing, starting without node sync")
}
let conf_ = conf.clone();

View File

@@ -34,19 +34,19 @@ pub fn thread_main(conf: SafeKeeperConf) {
/// Key to per timeline per safekeeper data.
fn timeline_safekeeper_path(
broker_prefix: String,
broker_etcd_prefix: String,
zttid: ZTenantTimelineId,
sk_id: ZNodeId,
) -> String {
format!(
"{}/{sk_id}",
SkTimelineSubscriptionKind::timeline(broker_prefix, zttid).watch_key()
SkTimelineSubscriptionKind::timeline(broker_etcd_prefix, zttid).watch_key()
)
}
/// Push once in a while data about all active timelines to the broker.
async fn push_loop(conf: SafeKeeperConf) -> anyhow::Result<()> {
let mut client = Client::connect(&conf.broker_endpoints.as_ref().unwrap(), None).await?;
let mut client = Client::connect(&conf.broker_endpoints, None).await?;
// Get and maintain lease to automatically delete obsolete data
let lease = client.lease_grant(LEASE_TTL_SEC, None).await?;
@@ -91,7 +91,7 @@ async fn push_loop(conf: SafeKeeperConf) -> anyhow::Result<()> {
/// Subscribe and fetch all the interesting data from the broker.
async fn pull_loop(conf: SafeKeeperConf) -> Result<()> {
let mut client = Client::connect(&conf.broker_endpoints.as_ref().unwrap(), None).await?;
let mut client = Client::connect(&conf.broker_endpoints, None).await?;
let mut subscription = etcd_broker::subscribe_to_safekeeper_timeline_updates(
&mut client,
@@ -99,7 +99,6 @@ async fn pull_loop(conf: SafeKeeperConf) -> Result<()> {
)
.await
.context("failed to subscribe for safekeeper info")?;
loop {
match subscription.fetch_data().await {
Some(new_info) => {

View File

@@ -27,7 +27,6 @@ pub mod defaults {
pub const DEFAULT_PG_LISTEN_PORT: u16 = 5454;
pub const DEFAULT_PG_LISTEN_ADDR: &str = formatcp!("127.0.0.1:{DEFAULT_PG_LISTEN_PORT}");
pub const DEFAULT_NEON_BROKER_PREFIX: &str = "neon";
pub const DEFAULT_HTTP_LISTEN_PORT: u16 = 7676;
pub const DEFAULT_HTTP_LISTEN_ADDR: &str = formatcp!("127.0.0.1:{DEFAULT_HTTP_LISTEN_PORT}");
@@ -51,7 +50,7 @@ pub struct SafeKeeperConf {
pub ttl: Option<Duration>,
pub recall_period: Duration,
pub my_id: ZNodeId,
pub broker_endpoints: Option<Vec<Url>>,
pub broker_endpoints: Vec<Url>,
pub broker_etcd_prefix: String,
pub s3_offload_enabled: bool,
}
@@ -81,8 +80,8 @@ impl Default for SafeKeeperConf {
ttl: None,
recall_period: defaults::DEFAULT_RECALL_PERIOD,
my_id: ZNodeId(0),
broker_endpoints: None,
broker_etcd_prefix: defaults::DEFAULT_NEON_BROKER_PREFIX.to_string(),
broker_endpoints: Vec::new(),
broker_etcd_prefix: etcd_broker::DEFAULT_NEON_BROKER_ETCD_PREFIX.to_string(),
s3_offload_enabled: true,
}
}

View File

@@ -679,29 +679,32 @@ impl GlobalTimelines {
/// Deactivates and deletes all timelines for the tenant, see `delete()`.
/// Returns map of all timelines which the tenant had, `true` if a timeline was active.
/// There may be a race if new timelines are created simultaneously.
pub fn delete_force_all_for_tenant(
conf: &SafeKeeperConf,
tenant_id: &ZTenantId,
) -> Result<HashMap<ZTenantTimelineId, TimelineDeleteForceResult>> {
info!("deleting all timelines for tenant {}", tenant_id);
let mut state = TIMELINES_STATE.lock().unwrap();
let mut deleted = HashMap::new();
for (zttid, tli) in &state.timelines {
if zttid.tenant_id == *tenant_id {
deleted.insert(
*zttid,
GlobalTimelines::delete_force_internal(
conf,
zttid,
tli.deactivate_for_delete()?,
)?,
);
let mut to_delete = HashMap::new();
{
// Keep mutex in this scope.
let timelines = &mut TIMELINES_STATE.lock().unwrap().timelines;
for (&zttid, tli) in timelines.iter() {
if zttid.tenant_id == *tenant_id {
to_delete.insert(zttid, tli.deactivate_for_delete()?);
}
}
// TODO: test that the correct subset of timelines is removed. It's complicated because they are implicitly created currently.
timelines.retain(|zttid, _| !to_delete.contains_key(zttid));
}
// TODO: test that the exact subset of timelines is removed.
state
.timelines
.retain(|zttid, _| !deleted.contains_key(zttid));
let mut deleted = HashMap::new();
for (zttid, was_active) in to_delete {
deleted.insert(
zttid,
GlobalTimelines::delete_force_internal(conf, &zttid, was_active)?,
);
}
// There may be inactive timelines, so delete the whole tenant dir as well.
match std::fs::remove_dir_all(conf.tenant_dir(tenant_id)) {
Ok(_) => (),
Err(e) if e.kind() == std::io::ErrorKind::NotFound => (),

View File

@@ -80,12 +80,14 @@ class GitRepo:
print('No changes detected, quitting')
return
run([
git_with_user = [
'git',
'-c',
'user.name=vipvap',
'-c',
'user.email=vipvap@zenith.tech',
]
run(git_with_user + [
'commit',
'--author="vipvap <vipvap@zenith.tech>"',
f'--message={message}',
@@ -94,7 +96,7 @@ class GitRepo:
for _ in range(5):
try:
run(['git', 'fetch', 'origin', branch])
run(['git', 'rebase', f'origin/{branch}'])
run(git_with_user + ['rebase', f'origin/{branch}'])
run(['git', 'push', 'origin', branch])
return

View File

@@ -51,7 +51,6 @@ Useful environment variables:
should go.
`TEST_SHARED_FIXTURES`: Try to re-use a single pageserver for all the tests.
`ZENITH_PAGESERVER_OVERRIDES`: add a `;`-separated set of configs that will be passed as
`FORCE_MOCK_S3`: inits every test's pageserver with a mock S3 used as a remote storage.
`--pageserver-config-override=${value}` parameter values when zenith cli is invoked
`RUST_LOG`: logging configuration to pass into Zenith CLI

View File

@@ -10,13 +10,6 @@ from fixtures.zenith_fixtures import ZenithEnv, ZenithEnvBuilder, ZenithPageserv
# Create ancestor branches off the main branch.
#
def test_ancestor_branch(zenith_env_builder: ZenithEnvBuilder):
# Use safekeeper in this test to avoid a subtle race condition.
# Without safekeeper, walreceiver reconnection can stuck
# because of IO deadlock.
#
# See https://github.com/zenithdb/zenith/issues/1068
zenith_env_builder.num_safekeepers = 1
env = zenith_env_builder.init_start()
# Override defaults, 1M gc_horizon and 4M checkpoint_distance.

View File

@@ -94,7 +94,6 @@ def check_backpressure(pg: Postgres, stop_event: threading.Event, polling_interv
@pytest.mark.skip("See https://github.com/neondatabase/neon/issues/1587")
def test_backpressure_received_lsn_lag(zenith_env_builder: ZenithEnvBuilder):
zenith_env_builder.num_safekeepers = 1
env = zenith_env_builder.init_start()
# Create a branch for us
env.zenith_cli.create_branch('test_backpressure')

View File

@@ -6,8 +6,6 @@ from fixtures.zenith_fixtures import ZenithEnvBuilder
# Test restarting page server, while safekeeper and compute node keep
# running.
def test_next_xid(zenith_env_builder: ZenithEnvBuilder):
# One safekeeper is enough for this test.
zenith_env_builder.num_safekeepers = 1
env = zenith_env_builder.init_start()
pg = env.postgres.create_start('main')

View File

@@ -1,6 +1,12 @@
from uuid import uuid4, UUID
import pytest
from fixtures.zenith_fixtures import ZenithEnv, ZenithEnvBuilder, ZenithPageserverHttpClient
from fixtures.zenith_fixtures import (
DEFAULT_BRANCH_NAME,
ZenithEnv,
ZenithEnvBuilder,
ZenithPageserverHttpClient,
ZenithPageserverApiException,
)
# test that we cannot override node id
@@ -48,6 +54,39 @@ def check_client(client: ZenithPageserverHttpClient, initial_tenant: UUID):
assert local_timeline_details['timeline_state'] == 'Loaded'
def test_pageserver_http_get_wal_receiver_not_found(zenith_simple_env: ZenithEnv):
env = zenith_simple_env
client = env.pageserver.http_client()
tenant_id, timeline_id = env.zenith_cli.create_tenant()
# no PG compute node is running, so no WAL receiver is running
with pytest.raises(ZenithPageserverApiException) as e:
_ = client.wal_receiver_get(tenant_id, timeline_id)
assert "Not Found" in str(e.value)
def test_pageserver_http_get_wal_receiver_success(zenith_simple_env: ZenithEnv):
env = zenith_simple_env
client = env.pageserver.http_client()
tenant_id, timeline_id = env.zenith_cli.create_tenant()
pg = env.postgres.create_start(DEFAULT_BRANCH_NAME, tenant_id=tenant_id)
res = client.wal_receiver_get(tenant_id, timeline_id)
assert list(res.keys()) == [
"thread_id",
"wal_producer_connstr",
"last_received_msg_lsn",
"last_received_msg_ts",
]
# make a DB modification then expect getting a new WAL receiver's data
pg.safe_psql("CREATE TABLE t(key int primary key, value text)")
res2 = client.wal_receiver_get(tenant_id, timeline_id)
assert res2["last_received_msg_lsn"] > res["last_received_msg_lsn"]
def test_pageserver_http_api_client(zenith_simple_env: ZenithEnv):
env = zenith_simple_env
client = env.pageserver.http_client()

View File

@@ -5,8 +5,6 @@ from fixtures.log_helper import log
# Test restarting page server, while safekeeper and compute node keep
# running.
def test_pageserver_restart(zenith_env_builder: ZenithEnvBuilder):
# One safekeeper is enough for this test.
zenith_env_builder.num_safekeepers = 1
env = zenith_env_builder.init_start()
env.zenith_cli.create_branch('test_pageserver_restart')

View File

@@ -45,14 +45,14 @@ def test_pageserver_recovery(zenith_env_builder: ZenithEnvBuilder):
# Configure failpoints
pscur.execute(
"failpoints checkpoint-before-sync=sleep(2000);checkpoint-after-sync=panic")
"failpoints checkpoint-before-sync=sleep(2000);checkpoint-after-sync=exit")
# Do some updates until pageserver is crashed
try:
while True:
cur.execute("update foo set x=x+1")
except Exception as err:
log.info(f"Excepted server crash {err}")
log.info(f"Expected server crash {err}")
log.info("Wait before server restart")
env.pageserver.stop()

View File

@@ -6,7 +6,7 @@ from contextlib import closing
from pathlib import Path
import time
from uuid import UUID
from fixtures.zenith_fixtures import ZenithEnvBuilder, assert_local, wait_for, wait_for_last_record_lsn, wait_for_upload
from fixtures.zenith_fixtures import ZenithEnvBuilder, assert_local, wait_until, wait_for_last_record_lsn, wait_for_upload
from fixtures.log_helper import log
from fixtures.utils import lsn_from_hex, lsn_to_hex
import pytest
@@ -32,7 +32,6 @@ import pytest
@pytest.mark.parametrize('storage_type', ['local_fs', 'mock_s3'])
def test_remote_storage_backup_and_restore(zenith_env_builder: ZenithEnvBuilder, storage_type: str):
# zenith_env_builder.rust_log_override = 'debug'
zenith_env_builder.num_safekeepers = 1
if storage_type == 'local_fs':
zenith_env_builder.enable_local_fs_remote_storage()
elif storage_type == 'mock_s3':
@@ -110,9 +109,9 @@ def test_remote_storage_backup_and_restore(zenith_env_builder: ZenithEnvBuilder,
client.timeline_attach(UUID(tenant_id), UUID(timeline_id))
log.info("waiting for timeline redownload")
wait_for(number_of_iterations=10,
interval=1,
func=lambda: assert_local(client, UUID(tenant_id), UUID(timeline_id)))
wait_until(number_of_iterations=10,
interval=1,
func=lambda: assert_local(client, UUID(tenant_id), UUID(timeline_id)))
detail = client.timeline_detail(UUID(tenant_id), UUID(timeline_id))
assert detail['local'] is not None

View File

@@ -3,12 +3,14 @@ import os
import pathlib
import subprocess
import threading
import typing
from uuid import UUID
from fixtures.log_helper import log
from typing import Optional
import signal
import pytest
from fixtures.zenith_fixtures import PgProtocol, PortDistributor, Postgres, ZenithEnvBuilder, ZenithPageserverHttpClient, assert_local, wait_for, wait_for_last_record_lsn, wait_for_upload, zenith_binpath, pg_distrib_dir
from fixtures.zenith_fixtures import PgProtocol, PortDistributor, Postgres, ZenithEnvBuilder, Etcd, ZenithPageserverHttpClient, assert_local, wait_until, wait_for_last_record_lsn, wait_for_upload, zenith_binpath, pg_distrib_dir
from fixtures.utils import lsn_from_hex
@@ -21,7 +23,8 @@ def new_pageserver_helper(new_pageserver_dir: pathlib.Path,
pageserver_bin: pathlib.Path,
remote_storage_mock_path: pathlib.Path,
pg_port: int,
http_port: int):
http_port: int,
broker: Optional[Etcd]):
"""
cannot use ZenithPageserver yet because it depends on zenith cli
which currently lacks support for multiple pageservers
@@ -38,6 +41,9 @@ def new_pageserver_helper(new_pageserver_dir: pathlib.Path,
f"-c remote_storage={{local_path='{remote_storage_mock_path}'}}",
]
if broker is not None:
cmd.append(f"-c broker_endpoints=['{broker.client_url()}']", )
subprocess.check_output(cmd, text=True)
# actually run new pageserver
@@ -103,7 +109,6 @@ def load(pg: Postgres, stop_event: threading.Event, load_ok_event: threading.Eve
def test_tenant_relocation(zenith_env_builder: ZenithEnvBuilder,
port_distributor: PortDistributor,
with_load: str):
zenith_env_builder.num_safekeepers = 1
zenith_env_builder.enable_local_fs_remote_storage()
env = zenith_env_builder.init_start()
@@ -180,12 +185,13 @@ def test_tenant_relocation(zenith_env_builder: ZenithEnvBuilder,
pageserver_bin,
remote_storage_mock_path,
new_pageserver_pg_port,
new_pageserver_http_port):
new_pageserver_http_port,
zenith_env_builder.broker):
# call to attach timeline to new pageserver
new_pageserver_http.timeline_attach(tenant, timeline)
# new pageserver should be in sync (modulo wal tail or vacuum activity) with the old one because there was no new writes since checkpoint
new_timeline_detail = wait_for(
new_timeline_detail = wait_until(
number_of_iterations=5,
interval=1,
func=lambda: assert_local(new_pageserver_http, tenant, timeline))

View File

@@ -1,8 +1,12 @@
from contextlib import closing
from datetime import datetime
import os
import pytest
from fixtures.zenith_fixtures import ZenithEnvBuilder
from fixtures.log_helper import log
from fixtures.metrics import parse_metrics
from fixtures.utils import lsn_to_hex
@pytest.mark.parametrize('with_safekeepers', [False, True])
@@ -38,3 +42,79 @@ def test_tenants_normal_work(zenith_env_builder: ZenithEnvBuilder, with_safekeep
cur.execute("INSERT INTO t SELECT generate_series(1,100000), 'payload'")
cur.execute("SELECT sum(key) FROM t")
assert cur.fetchone() == (5000050000, )
def test_metrics_normal_work(zenith_env_builder: ZenithEnvBuilder):
zenith_env_builder.num_safekeepers = 3
env = zenith_env_builder.init_start()
tenant_1, _ = env.zenith_cli.create_tenant()
tenant_2, _ = env.zenith_cli.create_tenant()
timeline_1 = env.zenith_cli.create_timeline('test_metrics_normal_work', tenant_id=tenant_1)
timeline_2 = env.zenith_cli.create_timeline('test_metrics_normal_work', tenant_id=tenant_2)
pg_tenant1 = env.postgres.create_start('test_metrics_normal_work', tenant_id=tenant_1)
pg_tenant2 = env.postgres.create_start('test_metrics_normal_work', tenant_id=tenant_2)
for pg in [pg_tenant1, pg_tenant2]:
with closing(pg.connect()) as conn:
with conn.cursor() as cur:
cur.execute("CREATE TABLE t(key int primary key, value text)")
cur.execute("INSERT INTO t SELECT generate_series(1,100000), 'payload'")
cur.execute("SELECT sum(key) FROM t")
assert cur.fetchone() == (5000050000, )
collected_metrics = {
"pageserver": env.pageserver.http_client().get_metrics(),
}
for sk in env.safekeepers:
collected_metrics[f'safekeeper{sk.id}'] = sk.http_client().get_metrics_str()
for name in collected_metrics:
basepath = os.path.join(zenith_env_builder.repo_dir, f'{name}.metrics')
with open(basepath, 'w') as stdout_f:
print(collected_metrics[name], file=stdout_f, flush=True)
all_metrics = [parse_metrics(m, name) for name, m in collected_metrics.items()]
ps_metrics = all_metrics[0]
sk_metrics = all_metrics[1:]
ttids = [{
'tenant_id': tenant_1.hex, 'timeline_id': timeline_1.hex
}, {
'tenant_id': tenant_2.hex, 'timeline_id': timeline_2.hex
}]
# Test metrics per timeline
for tt in ttids:
log.info(f"Checking metrics for {tt}")
ps_lsn = int(ps_metrics.query_one("pageserver_last_record_lsn", filter=tt).value)
sk_lsns = [int(sk.query_one("safekeeper_commit_lsn", filter=tt).value) for sk in sk_metrics]
log.info(f"ps_lsn: {lsn_to_hex(ps_lsn)}")
log.info(f"sk_lsns: {list(map(lsn_to_hex, sk_lsns))}")
assert ps_lsn <= max(sk_lsns)
assert ps_lsn > 0
# Test common metrics
for metrics in all_metrics:
log.info(f"Checking common metrics for {metrics.name}")
log.info(
f"process_cpu_seconds_total: {metrics.query_one('process_cpu_seconds_total').value}")
log.info(f"process_threads: {int(metrics.query_one('process_threads').value)}")
log.info(
f"process_resident_memory_bytes (MB): {metrics.query_one('process_resident_memory_bytes').value / 1024 / 1024}"
)
log.info(
f"process_virtual_memory_bytes (MB): {metrics.query_one('process_virtual_memory_bytes').value / 1024 / 1024}"
)
log.info(f"process_open_fds: {int(metrics.query_one('process_open_fds').value)}")
log.info(f"process_max_fds: {int(metrics.query_one('process_max_fds').value)}")
log.info(
f"process_start_time_seconds (UTC): {datetime.fromtimestamp(metrics.query_one('process_start_time_seconds').value)}"
)

View File

@@ -0,0 +1,97 @@
#
# Little stress test for the checkpointing and remote storage code.
#
# The test creates several tenants, and runs a simple workload on
# each tenant, in parallel. The test uses remote storage, and a tiny
# checkpoint_distance setting so that a lot of layer files are created.
#
import asyncio
from contextlib import closing
from uuid import UUID
import pytest
from fixtures.zenith_fixtures import ZenithEnvBuilder, ZenithEnv, Postgres, wait_for_last_record_lsn, wait_for_upload
from fixtures.utils import lsn_from_hex
async def tenant_workload(env: ZenithEnv, pg: Postgres):
pageserver_conn = await env.pageserver.connect_async()
pg_conn = await pg.connect_async()
tenant_id = await pg_conn.fetchval("show zenith.zenith_tenant")
timeline_id = await pg_conn.fetchval("show zenith.zenith_timeline")
await pg_conn.execute("CREATE TABLE t(key int primary key, value text)")
for i in range(1, 100):
await pg_conn.execute(
f"INSERT INTO t SELECT {i}*1000 + g, 'payload' from generate_series(1,1000) g")
# we rely upon autocommit after each statement
# as waiting for acceptors happens there
res = await pg_conn.fetchval("SELECT count(*) FROM t")
assert res == i * 1000
async def all_tenants_workload(env: ZenithEnv, tenants_pgs):
workers = []
for tenant, pg in tenants_pgs:
worker = tenant_workload(env, pg)
workers.append(asyncio.create_task(worker))
# await all workers
await asyncio.gather(*workers)
@pytest.mark.parametrize('storage_type', ['local_fs', 'mock_s3'])
def test_tenants_many(zenith_env_builder: ZenithEnvBuilder, storage_type: str):
if storage_type == 'local_fs':
zenith_env_builder.enable_local_fs_remote_storage()
elif storage_type == 'mock_s3':
zenith_env_builder.enable_s3_mock_remote_storage('test_remote_storage_backup_and_restore')
else:
raise RuntimeError(f'Unknown storage type: {storage_type}')
zenith_env_builder.enable_local_fs_remote_storage()
env = zenith_env_builder.init_start()
tenants_pgs = []
for i in range(1, 5):
# Use a tiny checkpoint distance, to create a lot of layers quickly
tenant, _ = env.zenith_cli.create_tenant(
conf={
'checkpoint_distance': '5000000',
})
env.zenith_cli.create_timeline(f'test_tenants_many', tenant_id=tenant)
pg = env.postgres.create_start(
f'test_tenants_many',
tenant_id=tenant,
)
tenants_pgs.append((tenant, pg))
asyncio.run(all_tenants_workload(env, tenants_pgs))
# Wait for the remote storage uploads to finish
pageserver_http = env.pageserver.http_client()
for tenant, pg in tenants_pgs:
with closing(pg.connect()) as conn:
with conn.cursor() as cur:
cur.execute("show zenith.zenith_tenant")
tenant_id = cur.fetchone()[0]
cur.execute("show zenith.zenith_timeline")
timeline_id = cur.fetchone()[0]
cur.execute("SELECT pg_current_wal_flush_lsn()")
current_lsn = lsn_from_hex(cur.fetchone()[0])
# wait until pageserver receives all the data
wait_for_last_record_lsn(pageserver_http, UUID(tenant_id), UUID(timeline_id), current_lsn)
# run final checkpoint manually to flush all the data to remote storage
env.pageserver.safe_psql(f"checkpoint {tenant_id} {timeline_id}")
wait_for_upload(pageserver_http, UUID(tenant_id), UUID(timeline_id), current_lsn)

View File

@@ -70,7 +70,6 @@ def wait_for_pageserver_catchup(pgmain: Postgres, polling_interval=1, timeout=60
def test_timeline_size_quota(zenith_env_builder: ZenithEnvBuilder):
zenith_env_builder.num_safekeepers = 1
env = zenith_env_builder.init_start()
new_timeline_id = env.zenith_cli.create_branch('test_timeline_size_quota')

View File

@@ -12,8 +12,8 @@ from contextlib import closing
from dataclasses import dataclass, field
from multiprocessing import Process, Value
from pathlib import Path
from fixtures.zenith_fixtures import PgBin, Postgres, Safekeeper, ZenithEnv, ZenithEnvBuilder, PortDistributor, SafekeeperPort, zenith_binpath, PgProtocol
from fixtures.utils import etcd_path, get_dir_size, lsn_to_hex, mkdir_if_needed, lsn_from_hex
from fixtures.zenith_fixtures import PgBin, Etcd, Postgres, Safekeeper, ZenithEnv, ZenithEnvBuilder, PortDistributor, SafekeeperPort, zenith_binpath, PgProtocol
from fixtures.utils import get_dir_size, lsn_to_hex, mkdir_if_needed, lsn_from_hex
from fixtures.log_helper import log
from typing import List, Optional, Any
@@ -22,7 +22,6 @@ from typing import List, Optional, Any
# succeed and data is written
def test_normal_work(zenith_env_builder: ZenithEnvBuilder):
zenith_env_builder.num_safekeepers = 3
zenith_env_builder.broker = True
env = zenith_env_builder.init_start()
env.zenith_cli.create_branch('test_safekeepers_normal_work')
@@ -328,10 +327,8 @@ def test_race_conditions(zenith_env_builder: ZenithEnvBuilder, stop_value):
# Test that safekeepers push their info to the broker and learn peer status from it
@pytest.mark.skipif(etcd_path() is None, reason="requires etcd which is not present in PATH")
def test_broker(zenith_env_builder: ZenithEnvBuilder):
zenith_env_builder.num_safekeepers = 3
zenith_env_builder.broker = True
zenith_env_builder.enable_local_fs_remote_storage()
env = zenith_env_builder.init_start()
@@ -371,10 +368,8 @@ def test_broker(zenith_env_builder: ZenithEnvBuilder):
# Test that old WAL consumed by peers and pageserver is removed from safekeepers.
@pytest.mark.skipif(etcd_path() is None, reason="requires etcd which is not present in PATH")
def test_wal_removal(zenith_env_builder: ZenithEnvBuilder):
zenith_env_builder.num_safekeepers = 2
zenith_env_builder.broker = True
# to advance remote_consistent_llsn
zenith_env_builder.enable_local_fs_remote_storage()
env = zenith_env_builder.init_start()
@@ -557,8 +552,6 @@ def test_sync_safekeepers(zenith_env_builder: ZenithEnvBuilder,
def test_timeline_status(zenith_env_builder: ZenithEnvBuilder):
zenith_env_builder.num_safekeepers = 1
env = zenith_env_builder.init_start()
env.zenith_cli.create_branch('test_timeline_status')
@@ -599,6 +592,9 @@ class SafekeeperEnv:
num_safekeepers: int = 1):
self.repo_dir = repo_dir
self.port_distributor = port_distributor
self.broker = Etcd(datadir=os.path.join(self.repo_dir, "etcd"),
port=self.port_distributor.get_port(),
peer_port=self.port_distributor.get_port())
self.pg_bin = pg_bin
self.num_safekeepers = num_safekeepers
self.bin_safekeeper = os.path.join(str(zenith_binpath), 'safekeeper')
@@ -645,6 +641,8 @@ class SafekeeperEnv:
safekeeper_dir,
"--id",
str(i),
"--broker-endpoints",
self.broker.client_url(),
"--daemonize"
]
@@ -698,7 +696,6 @@ def test_safekeeper_without_pageserver(test_output_dir: str,
repo_dir,
port_distributor,
pg_bin,
num_safekeepers=1,
)
with env:

View File

@@ -15,7 +15,6 @@ def test_wal_restore(zenith_env_builder: ZenithEnvBuilder,
pg_bin: PgBin,
test_output_dir,
port_distributor: PortDistributor):
zenith_env_builder.num_safekeepers = 1
env = zenith_env_builder.init_start()
env.zenith_cli.create_branch("test_wal_restore")
pg = env.postgres.create_start('test_wal_restore')

View File

@@ -94,8 +94,6 @@ def test_cli_tenant_create(zenith_simple_env: ZenithEnv):
def test_cli_ipv4_listeners(zenith_env_builder: ZenithEnvBuilder):
# Start with single sk
zenith_env_builder.num_safekeepers = 1
env = zenith_env_builder.init_start()
# Connect to sk port on v4 loopback
@@ -111,8 +109,6 @@ def test_cli_ipv4_listeners(zenith_env_builder: ZenithEnvBuilder):
def test_cli_start_stop(zenith_env_builder: ZenithEnvBuilder):
# Start with single sk
zenith_env_builder.num_safekeepers = 1
env = zenith_env_builder.init_start()
# Stop default ps/sk

View File

@@ -1,9 +1,12 @@
import os
import pytest
from fixtures.utils import mkdir_if_needed
from fixtures.zenith_fixtures import ZenithEnv, base_dir, pg_distrib_dir
# The isolation tests run for a long time, especially in debug mode,
# so use a larger-than-default timeout.
@pytest.mark.timeout(1800)
def test_isolation(zenith_simple_env: ZenithEnv, test_output_dir, pg_bin, capsys):
env = zenith_simple_env

View File

@@ -1,9 +1,12 @@
import os
import pytest
from fixtures.utils import mkdir_if_needed
from fixtures.zenith_fixtures import ZenithEnv, check_restored_datadir_content, base_dir, pg_distrib_dir
# The pg_regress tests run for a long time, especially in debug mode,
# so use a larger-than-default timeout.
@pytest.mark.timeout(1800)
def test_pg_regress(zenith_simple_env: ZenithEnv, test_output_dir: str, pg_bin, capsys):
env = zenith_simple_env

View File

@@ -236,14 +236,14 @@ class ZenithBenchmarker:
"""
Fetch the "cumulative # of bytes written" metric from the pageserver
"""
metric_name = r'pageserver_disk_io_bytes{io_operation="write"}'
metric_name = r'libmetrics_disk_io_bytes_total{io_operation="write"}'
return self.get_int_counter_value(pageserver, metric_name)
def get_peak_mem(self, pageserver) -> int:
"""
Fetch the "maxrss" metric from the pageserver
"""
metric_name = r'pageserver_maxrss_kb'
metric_name = r'libmetrics_maxrss_kb'
return self.get_int_counter_value(pageserver, metric_name)
def get_int_counter_value(self, pageserver, metric_name) -> int:

View File

@@ -0,0 +1,38 @@
from dataclasses import dataclass
from prometheus_client.parser import text_string_to_metric_families
from prometheus_client.samples import Sample
from typing import Dict, List
from collections import defaultdict
from fixtures.log_helper import log
class Metrics:
metrics: Dict[str, List[Sample]]
name: str
def __init__(self, name: str = ""):
self.metrics = defaultdict(list)
self.name = name
def query_all(self, name: str, filter: Dict[str, str]) -> List[Sample]:
res = []
for sample in self.metrics[name]:
if all(sample.labels[k] == v for k, v in filter.items()):
res.append(sample)
return res
def query_one(self, name: str, filter: Dict[str, str] = {}) -> Sample:
res = self.query_all(name, filter)
assert len(res) == 1, f"expected single sample for {name} {filter}, found {res}"
return res[0]
def parse_metrics(text: str, name: str = ""):
metrics = Metrics(name)
gen = text_string_to_metric_families(text)
for family in gen:
for sample in family.samples:
metrics.metrics[sample.name].append(sample)
return metrics

View File

@@ -1,8 +1,9 @@
import os
import shutil
import subprocess
from pathlib import Path
from typing import Any, List
from typing import Any, List, Optional
from fixtures.log_helper import log
@@ -80,9 +81,12 @@ def print_gc_result(row):
.format_map(row))
# path to etcd binary or None if not present.
def etcd_path():
return shutil.which("etcd")
def etcd_path() -> Path:
path_output = shutil.which("etcd")
if path_output is None:
raise RuntimeError('etcd not found in PATH')
else:
return Path(path_output)
# Traverse directory to get total size.

View File

@@ -34,7 +34,12 @@ from typing_extensions import Literal
import requests
import backoff # type: ignore
from .utils import (etcd_path, get_self_dir, mkdir_if_needed, subprocess_capture, lsn_from_hex)
from .utils import (etcd_path,
get_self_dir,
mkdir_if_needed,
subprocess_capture,
lsn_from_hex,
lsn_to_hex)
from fixtures.log_helper import log
"""
This file contains pytest fixtures. A fixture is a test resource that can be
@@ -61,7 +66,7 @@ DEFAULT_POSTGRES_DIR = 'tmp_install'
DEFAULT_BRANCH_NAME = 'main'
BASE_PORT = 15000
WORKER_PORT_NUM = 100
WORKER_PORT_NUM = 1000
def pytest_addoption(parser):
@@ -178,7 +183,7 @@ def shareable_scope(fixture_name, config) -> Literal["session", "function"]:
return 'function' if os.environ.get('TEST_SHARED_FIXTURES') is None else 'session'
@pytest.fixture(scope=shareable_scope)
@pytest.fixture(scope='session')
def worker_seq_no(worker_id: str):
# worker_id is a pytest-xdist fixture
# it can be master or gw<number>
@@ -189,7 +194,7 @@ def worker_seq_no(worker_id: str):
return int(worker_id[2:])
@pytest.fixture(scope=shareable_scope)
@pytest.fixture(scope='session')
def worker_base_port(worker_seq_no: int):
# so we divide ports in ranges of 100 ports
# so workers have disjoint set of ports for services
@@ -242,11 +247,30 @@ class PortDistributor:
'port range configured for test is exhausted, consider enlarging the range')
@pytest.fixture(scope=shareable_scope)
@pytest.fixture(scope='session')
def port_distributor(worker_base_port):
return PortDistributor(base_port=worker_base_port, port_number=WORKER_PORT_NUM)
@pytest.fixture(scope='session')
def default_broker(request: Any, port_distributor: PortDistributor):
client_port = port_distributor.get_port()
# multiple pytest sessions could get launched in parallel, get them different datadirs
etcd_datadir = os.path.join(get_test_output_dir(request), f"etcd_datadir_{client_port}")
pathlib.Path(etcd_datadir).mkdir(exist_ok=True, parents=True)
broker = Etcd(datadir=etcd_datadir, port=client_port, peer_port=port_distributor.get_port())
yield broker
broker.stop()
@pytest.fixture(scope='session')
def mock_s3_server(port_distributor: PortDistributor):
mock_s3_server = MockS3Server(port_distributor.get_port())
yield mock_s3_server
mock_s3_server.kill()
class PgProtocol:
""" Reusable connection logic """
def __init__(self, **kwargs):
@@ -410,31 +434,26 @@ class ZenithEnvBuilder:
def __init__(self,
repo_dir: Path,
port_distributor: PortDistributor,
pageserver_remote_storage: Optional[RemoteStorage] = None,
broker: Etcd,
mock_s3_server: MockS3Server,
remote_storage: Optional[RemoteStorage] = None,
pageserver_config_override: Optional[str] = None,
num_safekeepers: int = 0,
num_safekeepers: int = 1,
pageserver_auth_enabled: bool = False,
rust_log_override: Optional[str] = None,
default_branch_name=DEFAULT_BRANCH_NAME,
broker: bool = False):
default_branch_name=DEFAULT_BRANCH_NAME):
self.repo_dir = repo_dir
self.rust_log_override = rust_log_override
self.port_distributor = port_distributor
self.pageserver_remote_storage = pageserver_remote_storage
self.remote_storage = remote_storage
self.broker = broker
self.mock_s3_server = mock_s3_server
self.pageserver_config_override = pageserver_config_override
self.num_safekeepers = num_safekeepers
self.pageserver_auth_enabled = pageserver_auth_enabled
self.default_branch_name = default_branch_name
self.broker = broker
self.env: Optional[ZenithEnv] = None
self.s3_mock_server: Optional[MockS3Server] = None
if os.getenv('FORCE_MOCK_S3') is not None:
bucket_name = f'{repo_dir.name}_bucket'
log.warning(f'Unconditionally initializing mock S3 server for bucket {bucket_name}')
self.enable_s3_mock_remote_storage(bucket_name)
def init(self) -> ZenithEnv:
# Cannot create more than one environment from one builder
assert self.env is None, "environment already initialized"
@@ -455,9 +474,8 @@ class ZenithEnvBuilder:
"""
def enable_local_fs_remote_storage(self, force_enable=True):
assert force_enable or self.pageserver_remote_storage is None, "remote storage is enabled already"
self.pageserver_remote_storage = LocalFsStorage(
Path(self.repo_dir / 'local_fs_remote_storage'))
assert force_enable or self.remote_storage is None, "remote storage is enabled already"
self.remote_storage = LocalFsStorage(Path(self.repo_dir / 'local_fs_remote_storage'))
"""
Sets up the pageserver to use the S3 mock server, creates the bucket, if it's not present already.
@@ -466,22 +484,19 @@ class ZenithEnvBuilder:
"""
def enable_s3_mock_remote_storage(self, bucket_name: str, force_enable=True):
assert force_enable or self.pageserver_remote_storage is None, "remote storage is enabled already"
if not self.s3_mock_server:
self.s3_mock_server = MockS3Server(self.port_distributor.get_port())
mock_endpoint = self.s3_mock_server.endpoint()
mock_region = self.s3_mock_server.region()
assert force_enable or self.remote_storage is None, "remote storage is enabled already"
mock_endpoint = self.mock_s3_server.endpoint()
mock_region = self.mock_s3_server.region()
boto3.client(
's3',
endpoint_url=mock_endpoint,
region_name=mock_region,
aws_access_key_id=self.s3_mock_server.access_key(),
aws_secret_access_key=self.s3_mock_server.secret_key(),
aws_access_key_id=self.mock_s3_server.access_key(),
aws_secret_access_key=self.mock_s3_server.secret_key(),
).create_bucket(Bucket=bucket_name)
self.pageserver_remote_storage = S3Storage(bucket=bucket_name,
endpoint=mock_endpoint,
region=mock_region)
self.remote_storage = S3Storage(bucket=bucket_name,
endpoint=mock_endpoint,
region=mock_region)
def __enter__(self):
return self
@@ -495,10 +510,6 @@ class ZenithEnvBuilder:
for sk in self.env.safekeepers:
sk.stop(immediate=True)
self.env.pageserver.stop(immediate=True)
if self.s3_mock_server:
self.s3_mock_server.kill()
if self.env.broker is not None:
self.env.broker.stop()
class ZenithEnv:
@@ -537,10 +548,12 @@ class ZenithEnv:
self.repo_dir = config.repo_dir
self.rust_log_override = config.rust_log_override
self.port_distributor = config.port_distributor
self.s3_mock_server = config.s3_mock_server
self.s3_mock_server = config.mock_s3_server
self.zenith_cli = ZenithCli(env=self)
self.postgres = PostgresFactory(self)
self.safekeepers: List[Safekeeper] = []
self.broker = config.broker
self.remote_storage = config.remote_storage
# generate initial tenant ID here instead of letting 'zenith init' generate it,
# so that we don't need to dig it out of the config file afterwards.
@@ -551,14 +564,10 @@ class ZenithEnv:
default_tenant_id = '{self.initial_tenant.hex}'
""")
self.broker = None
if config.broker:
# keep etcd datadir inside 'repo'
self.broker = Etcd(datadir=os.path.join(self.repo_dir, "etcd"),
port=self.port_distributor.get_port(),
peer_port=self.port_distributor.get_port())
toml += textwrap.dedent(f"""
broker_endpoints = 'http://127.0.0.1:{self.broker.port}'
toml += textwrap.dedent(f"""
[etcd_broker]
broker_endpoints = ['{self.broker.client_url()}']
etcd_binary_path = '{self.broker.binary_path}'
""")
# Create config for pageserver
@@ -579,7 +588,6 @@ class ZenithEnv:
# Create a corresponding ZenithPageserver object
self.pageserver = ZenithPageserver(self,
port=pageserver_port,
remote_storage=config.pageserver_remote_storage,
config_override=config.pageserver_config_override)
# Create config and a Safekeeper object for each safekeeper
@@ -603,15 +611,13 @@ class ZenithEnv:
self.zenith_cli.init(toml)
def start(self):
# Start up the page server, all the safekeepers and the broker
# Start up broker, pageserver and all safekeepers
self.broker.try_start()
self.pageserver.start()
for safekeeper in self.safekeepers:
safekeeper.start()
if self.broker is not None:
self.broker.start()
def get_safekeeper_connstrs(self) -> str:
""" Get list of safekeeper endpoints suitable for safekeepers GUC """
return ','.join([f'localhost:{wa.port.pg}' for wa in self.safekeepers])
@@ -624,7 +630,10 @@ class ZenithEnv:
@pytest.fixture(scope=shareable_scope)
def _shared_simple_env(request: Any, port_distributor) -> Iterator[ZenithEnv]:
def _shared_simple_env(request: Any,
port_distributor: PortDistributor,
mock_s3_server: MockS3Server,
default_broker: Etcd) -> Iterator[ZenithEnv]:
"""
Internal fixture backing the `zenith_simple_env` fixture. If TEST_SHARED_FIXTURES
is set, this is shared by all tests using `zenith_simple_env`.
@@ -638,7 +647,8 @@ def _shared_simple_env(request: Any, port_distributor) -> Iterator[ZenithEnv]:
repo_dir = os.path.join(str(top_output_dir), "shared_repo")
shutil.rmtree(repo_dir, ignore_errors=True)
with ZenithEnvBuilder(Path(repo_dir), port_distributor) as builder:
with ZenithEnvBuilder(Path(repo_dir), port_distributor, default_broker,
mock_s3_server) as builder:
env = builder.init_start()
# For convenience in tests, create a branch from the freshly-initialized cluster.
@@ -660,12 +670,13 @@ def zenith_simple_env(_shared_simple_env: ZenithEnv) -> Iterator[ZenithEnv]:
yield _shared_simple_env
_shared_simple_env.postgres.stop_all()
if _shared_simple_env.s3_mock_server:
_shared_simple_env.s3_mock_server.kill()
@pytest.fixture(scope='function')
def zenith_env_builder(test_output_dir, port_distributor) -> Iterator[ZenithEnvBuilder]:
def zenith_env_builder(test_output_dir,
port_distributor: PortDistributor,
mock_s3_server: MockS3Server,
default_broker: Etcd) -> Iterator[ZenithEnvBuilder]:
"""
Fixture to create a Zenith environment for test.
@@ -683,7 +694,8 @@ def zenith_env_builder(test_output_dir, port_distributor) -> Iterator[ZenithEnvB
repo_dir = os.path.join(test_output_dir, "repo")
# Return the builder to the caller
with ZenithEnvBuilder(Path(repo_dir), port_distributor) as builder:
with ZenithEnvBuilder(Path(repo_dir), port_distributor, default_broker,
mock_s3_server) as builder:
yield builder
@@ -786,6 +798,15 @@ class ZenithPageserverHttpClient(requests.Session):
assert isinstance(res_json, dict)
return res_json
def wal_receiver_get(self, tenant_id: uuid.UUID, timeline_id: uuid.UUID) -> Dict[Any, Any]:
res = self.get(
f"http://localhost:{self.port}/v1/tenant/{tenant_id.hex}/timeline/{timeline_id.hex}/wal_receiver"
)
self.verbose_error(res)
res_json = res.json()
assert isinstance(res_json, dict)
return res_json
def get_metrics(self) -> str:
res = self.get(f"http://localhost:{self.port}/metrics")
self.verbose_error(res)
@@ -971,9 +992,10 @@ class ZenithCli:
cmd = ['init', f'--config={tmp.name}']
if initial_timeline_id:
cmd.extend(['--timeline-id', initial_timeline_id.hex])
append_pageserver_param_overrides(cmd,
self.env.pageserver.remote_storage,
self.env.pageserver.config_override)
append_pageserver_param_overrides(
params_to_update=cmd,
remote_storage=self.env.remote_storage,
pageserver_config_override=self.env.pageserver.config_override)
res = self.raw_cli(cmd)
res.check_returncode()
@@ -994,9 +1016,10 @@ class ZenithCli:
def pageserver_start(self, overrides=()) -> 'subprocess.CompletedProcess[str]':
start_args = ['pageserver', 'start', *overrides]
append_pageserver_param_overrides(start_args,
self.env.pageserver.remote_storage,
self.env.pageserver.config_override)
append_pageserver_param_overrides(
params_to_update=start_args,
remote_storage=self.env.remote_storage,
pageserver_config_override=self.env.pageserver.config_override)
s3_env_vars = None
if self.env.s3_mock_server:
@@ -1166,16 +1189,11 @@ class ZenithPageserver(PgProtocol):
Initializes the repository via `zenith init`.
"""
def __init__(self,
env: ZenithEnv,
port: PageserverPort,
remote_storage: Optional[RemoteStorage] = None,
config_override: Optional[str] = None):
def __init__(self, env: ZenithEnv, port: PageserverPort, config_override: Optional[str] = None):
super().__init__(host='localhost', port=port.pg, user='zenith_admin')
self.env = env
self.running = False
self.service_port = port
self.remote_storage = remote_storage
self.config_override = config_override
def start(self, overrides=()) -> 'ZenithPageserver':
@@ -1215,21 +1233,21 @@ class ZenithPageserver(PgProtocol):
def append_pageserver_param_overrides(
params_to_update: List[str],
pageserver_remote_storage: Optional[RemoteStorage],
remote_storage: Optional[RemoteStorage],
pageserver_config_override: Optional[str] = None,
):
if pageserver_remote_storage is not None:
if isinstance(pageserver_remote_storage, LocalFsStorage):
pageserver_storage_override = f"local_path='{pageserver_remote_storage.root}'"
elif isinstance(pageserver_remote_storage, S3Storage):
pageserver_storage_override = f"bucket_name='{pageserver_remote_storage.bucket}',\
bucket_region='{pageserver_remote_storage.region}'"
if remote_storage is not None:
if isinstance(remote_storage, LocalFsStorage):
pageserver_storage_override = f"local_path='{remote_storage.root}'"
elif isinstance(remote_storage, S3Storage):
pageserver_storage_override = f"bucket_name='{remote_storage.bucket}',\
bucket_region='{remote_storage.region}'"
if pageserver_remote_storage.endpoint is not None:
pageserver_storage_override += f",endpoint='{pageserver_remote_storage.endpoint}'"
if remote_storage.endpoint is not None:
pageserver_storage_override += f",endpoint='{remote_storage.endpoint}'"
else:
raise Exception(f'Unknown storage configuration {pageserver_remote_storage}')
raise Exception(f'Unknown storage configuration {remote_storage}')
params_to_update.append(
f'--pageserver-config-override=remote_storage={{{pageserver_storage_override}}}')
@@ -1815,10 +1833,13 @@ class SafekeeperHttpClient(requests.Session):
assert isinstance(res_json, dict)
return res_json
def get_metrics(self) -> SafekeeperMetrics:
def get_metrics_str(self) -> str:
request_result = self.get(f"http://localhost:{self.port}/metrics")
request_result.raise_for_status()
all_metrics_text = request_result.text
return request_result.text
def get_metrics(self) -> SafekeeperMetrics:
all_metrics_text = self.get_metrics_str()
metrics = SafekeeperMetrics()
for match in re.finditer(
@@ -1840,26 +1861,36 @@ class Etcd:
datadir: str
port: int
peer_port: int
binary_path: Path = etcd_path()
handle: Optional[subprocess.Popen[Any]] = None # handle of running daemon
def client_url(self):
return f'http://127.0.0.1:{self.port}'
def check_status(self):
s = requests.Session()
s.mount('http://', requests.adapters.HTTPAdapter(max_retries=1)) # do not retry
s.get(f"http://localhost:{self.port}/health").raise_for_status()
s.get(f"{self.client_url()}/health").raise_for_status()
def try_start(self):
if self.handle is not None:
log.debug(f'etcd is already running on port {self.port}')
return
def start(self):
pathlib.Path(self.datadir).mkdir(exist_ok=True)
etcd_full_path = etcd_path()
if etcd_full_path is None:
raise Exception('etcd not found')
if not self.binary_path.is_file():
raise RuntimeError(f"etcd broker binary '{self.binary_path}' is not a file")
client_url = self.client_url()
log.info(f'Starting etcd to listen incoming connections at "{client_url}"')
with open(os.path.join(self.datadir, "etcd.log"), "wb") as log_file:
args = [
etcd_full_path,
self.binary_path,
f"--data-dir={self.datadir}",
f"--listen-client-urls=http://localhost:{self.port}",
f"--advertise-client-urls=http://localhost:{self.port}",
f"--listen-peer-urls=http://localhost:{self.peer_port}"
f"--listen-client-urls={client_url}",
f"--advertise-client-urls={client_url}",
f"--listen-peer-urls=http://127.0.0.1:{self.peer_port}"
]
self.handle = subprocess.Popen(args, stdout=log_file, stderr=log_file)
@@ -1911,7 +1942,12 @@ def test_output_dir(request: Any) -> str:
return test_dir
SKIP_DIRS = frozenset(('pg_wal', 'pg_stat', 'pg_stat_tmp', 'pg_subtrans', 'pg_logical'))
SKIP_DIRS = frozenset(('pg_wal',
'pg_stat',
'pg_stat_tmp',
'pg_subtrans',
'pg_logical',
'pg_replslot/wal_proposer_slot'))
SKIP_FILES = frozenset(('pg_internal.init',
'pg.log',
@@ -2037,7 +2073,11 @@ def check_restored_datadir_content(test_output_dir: str, env: ZenithEnv, pg: Pos
assert (mismatch, error) == ([], [])
def wait_for(number_of_iterations: int, interval: int, func):
def wait_until(number_of_iterations: int, interval: int, func):
"""
Wait until 'func' returns successfully, without exception. Returns the last return value
from the the function.
"""
last_exception = None
for i in range(number_of_iterations):
try:
@@ -2064,9 +2104,15 @@ def remote_consistent_lsn(pageserver_http_client: ZenithPageserverHttpClient,
timeline: uuid.UUID) -> int:
detail = pageserver_http_client.timeline_detail(tenant, timeline)
lsn_str = detail['remote']['remote_consistent_lsn']
assert isinstance(lsn_str, str)
return lsn_from_hex(lsn_str)
if detail['remote'] is None:
# No remote information at all. This happens right after creating
# a timeline, before any part of it it has been uploaded to remote
# storage yet.
return 0
else:
lsn_str = detail['remote']['remote_consistent_lsn']
assert isinstance(lsn_str, str)
return lsn_from_hex(lsn_str)
def wait_for_upload(pageserver_http_client: ZenithPageserverHttpClient,
@@ -2074,8 +2120,15 @@ def wait_for_upload(pageserver_http_client: ZenithPageserverHttpClient,
timeline: uuid.UUID,
lsn: int):
"""waits for local timeline upload up to specified lsn"""
wait_for(10, 1, lambda: remote_consistent_lsn(pageserver_http_client, tenant, timeline) >= lsn)
for i in range(10):
current_lsn = remote_consistent_lsn(pageserver_http_client, tenant, timeline)
if current_lsn >= lsn:
return
log.info("waiting for remote_consistent_lsn to reach {}, now {}, iteration {}".format(
lsn_to_hex(lsn), lsn_to_hex(current_lsn), i + 1))
time.sleep(1)
raise Exception("timed out while waiting for remote_consistent_lsn to reach {}, was {}".format(
lsn_to_hex(lsn), lsn_to_hex(current_lsn)))
def last_record_lsn(pageserver_http_client: ZenithPageserverHttpClient,
@@ -2093,5 +2146,12 @@ def wait_for_last_record_lsn(pageserver_http_client: ZenithPageserverHttpClient,
timeline: uuid.UUID,
lsn: int):
"""waits for pageserver to catch up to a certain lsn"""
wait_for(10, 1, lambda: last_record_lsn(pageserver_http_client, tenant, timeline) >= lsn)
for i in range(10):
current_lsn = last_record_lsn(pageserver_http_client, tenant, timeline)
if current_lsn >= lsn:
return
log.info("waiting for last_record_lsn to reach {}, now {}, iteration {}".format(
lsn_to_hex(lsn), lsn_to_hex(current_lsn), i + 1))
time.sleep(1)
raise Exception("timed out while waiting for last_record_lsn to reach {}, was {}".format(
lsn_to_hex(lsn), lsn_to_hex(current_lsn)))

View File

@@ -1,9 +1,11 @@
import pytest
from contextlib import closing
from fixtures.zenith_fixtures import ZenithEnvBuilder
from fixtures.benchmark_fixture import ZenithBenchmarker
# This test sometimes runs for longer than the global 5 minute timeout.
@pytest.mark.timeout(600)
def test_startup(zenith_env_builder: ZenithEnvBuilder, zenbenchmark: ZenithBenchmarker):
zenith_env_builder.num_safekeepers = 3
env = zenith_env_builder.init_start()