Compare commits

..

62 Commits

Author SHA1 Message Date
Konstantin Knizhnik
0a049ae17a Not working version 2022-05-05 08:53:21 +03:00
Konstantin Knizhnik
32557b16b4 Use prepared dictionary for layer reconstruction 2022-05-04 18:17:33 +03:00
Konstantin Knizhnik
076b8e3d04 Use zstd::bulk::Decompressor::decompress instead decompredd_to_buffer 2022-05-03 11:28:32 +03:00
Konstantin Knizhnik
39eadf6236 Use zstd::bulk::Decompressor to decode WAL records to minimize number of context initalization 2022-05-03 09:59:33 +03:00
Heikki Linnakangas
4472d49c1e Reuse the zstd Compressor context when building delta layer. 2022-05-03 01:47:39 +03:00
Konstantin Knizhnik
dc057ace2f Fix formatting 2022-05-02 07:58:07 +03:00
Konstantin Knizhnik
0e49d748b8 Fix bug in dictinary creation 2022-05-02 07:58:07 +03:00
Konstantin Knizhnik
fc7d1ba043 Do not compress delta layers if there are too few elements 2022-05-02 07:58:07 +03:00
Konstantin Knizhnik
e28b3dee37 Implement compression of image and delta layers 2022-05-02 07:58:07 +03:00
Dhammika Pathirana
992874c916 Fix update ps settings doc
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>
2022-05-01 13:52:08 -07:00
Dhammika Pathirana
3128e8c75c Fix tenant conf test
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>
2022-05-01 13:13:25 -07:00
Dhammika Pathirana
f3f12db2cb Add gc churn threshold knob (#1594)
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>
2022-05-01 13:13:17 -07:00
Andrey Taranik
038ea4c128 proxy notice message update (#1600) 2022-04-30 22:04:08 +03:00
Kirill Bulatov
7e1db8c8a1 Show which virtual file got the deserialization errors 2022-04-29 21:40:57 +03:00
Andrey Taranik
aa933d3961 proxy settings update for new domain (#1597) 2022-04-29 20:05:14 +03:00
Dmitry Rodionov
67b4e38092 remporarily disable test_backpressure_received_lsn_lag 2022-04-29 15:53:56 +03:00
Dmitry Rodionov
05f8e6a050 Use fsync+rename for atomic downloads from remote storage
Use failpoint in test_remote_storage to check the behavior
2022-04-29 15:53:56 +03:00
chaitanya sharma
76388abeb6 Rename READMEs with .md extension, and fix links to them.
Commit edba2e97 renamed pageserver/README to pageserver/README.md, but
forgot to update links to it. Fix.

Rename libs/postgres_ffi/README and safekeeper/README files to also
have the the .md extension, so that github can render them nicely.

Quote ascii-diagram in safekeeper/README.md so that it renders
correctly.
2022-04-29 14:23:42 +03:00
Kirill Bulatov
2911eb084a Remove timeline files on detach 2022-04-29 09:19:18 +03:00
Kirill Bulatov
6cca57f95a Properly remove from the local timeline map 2022-04-29 09:19:18 +03:00
Kirill Bulatov
4a46b01caf Properly populate local timeline map 2022-04-29 09:19:18 +03:00
Anastasia Lubennikova
5c5c3c64f3 Fix tenant config parsing. Add a test 2022-04-28 11:49:19 +03:00
Arthur Petukhovsky
29539b0561 Set wal_keep_size to zero (#1507)
wal_keep_size is already set to 0 in our cloud setup, but we don't use this value in tests. This commit fixes wal_keep_size in control_plane and adds tests for WAL recycling and lagging safekeepers.
2022-04-27 19:09:28 +03:00
Dmitry Rodionov
695b5f9d88 Remove obsolete failpoint in proxy
When failpoint feature is disabled it throws away passed code so code
inside is not guaranteed to compile when feature is disabled. In this
particular case code is obsolete so removing it.
2022-04-27 14:34:33 +03:00
Dhammika Pathirana
66694e736a Fix add ps tenant config
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>
2022-04-27 00:05:13 -07:00
Dhammika Pathirana
091cefaa92 Fix add compaction for key partitioning
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>
2022-04-27 00:05:13 -07:00
Dhammika Pathirana
aeb4f81c3b Add branch traversal unit test
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>
2022-04-27 00:05:13 -07:00
Dhammika Pathirana
6391862d8a Add branch traversal test
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>
2022-04-27 00:05:13 -07:00
Dhammika Pathirana
b2e35fffa6 Fix ancestor layer traversal (#1484)
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>
2022-04-27 00:05:13 -07:00
Arseny Sher
8b9d523f3c Remove old WAL on safekeepers.
Remove when it is consumed by all of 1) pageserver (remote_consistent_lsn) 2)
safekeeper peers 3) s3 WAL offloading.

In test s3 offloading for now is mocked by directly bumping s3_wal_lsn.

ref #1403
2022-04-26 23:02:23 +04:00
Arseny Sher
3fd234da07 Enable etcd for safekeepers in deploy. 2022-04-26 18:13:50 +04:00
Kirill Bulatov
778744d35c Limit concurrent S3 and IAM interactions 2022-04-26 13:49:37 +03:00
Dmitry Rodionov
eabf6f89e4 Use item.get for tenant config toml parsing
Previously we've used table interface, but there was no easy way to pass
it as an override to pageserver through cli. Use the same strategy as
for remote storage config parsing
2022-04-26 10:15:19 +03:00
Kirill Bulatov
fec050ce97 Fix macos clippy issues 2022-04-25 16:23:34 +03:00
Kirill Bulatov
d060a97c54 Simplify clippy runs 2022-04-25 16:23:34 +03:00
Anastasia Lubennikova
78a6cb247f allow the users to create extensions: GRANT CREATE ON DATABASE 2022-04-25 15:35:44 +03:00
Kirill Bulatov
8f6a161271 Show better layer load errors 2022-04-25 14:54:39 +03:00
Andrey Taranik
56f6269a8e rename docker images to neondatabase docker account (#1570)
* rename docker images to neondatabase docker account

* docker images build fix (permisions for Cargo.lock)
2022-04-25 11:34:51 +03:00
Heikki Linnakangas
1fb3d08185 Use a 1-byte length header for short blobs.
Notably, this shaves 3 bytes from each small WAL record stored in
ephemeral or delta layers.
2022-04-22 21:31:27 +03:00
bojanserafimov
867aede715 Add idle compute restart time test (#1514) 2022-04-22 10:45:47 -04:00
Dmitry Ivanov
d3f356e7a8 Update rust-postgres project-wide (#1525)
* Update `rust-postgres` project-wide

This commit points to https://github.com/neondatabase/rust-postgres/commits/neon
in order to test our patches on top of the latest version of this crate.

* [proxy] Update `hmac` and `sha2`
2022-04-22 17:31:58 +03:00
Konstantin Knizhnik
5f83c9290b Make it possible to specify per-tenant configuration parameters
Add tenant config API and 'zenith tenant config' CLI command.
Add 'show' query to pageserver protocol for tenantspecific config parameters

Refactoring: move tenant_config code to a separate module.
Save tenant conf file to tenant's directory, when tenant is created to recover it on pageserver restart.
Ignore error during tenant config loading, while it is not supported by console

Define PiTR interval for GC.

refer #1320
2022-04-22 11:24:29 +03:00
Heikki Linnakangas
a4700c9bbe Use pprof to get flamegraph of get_page and get_relsize requests.
This depends on a hacked version of the 'pprof-rs' crate. Because of
that, it's under an optional 'profiling' feature. It is disabled by
default, but enabled for release builds in CircleCI config. It doesn't
currently work on macOS.

The flamegraph is written to 'flamegraph.svg' in the pageserver
workdir when the 'pageserver' process exits.

Add a performance test that runs the perf_pgbench test, with profiling
enabled.
2022-04-21 20:32:48 +03:00
Heikki Linnakangas
dafdf9b952 Handle EINTR 2022-04-21 16:37:36 +03:00
Heikki Linnakangas
263d60f12d Add prometheus metric for time spent waiting for WAL to arrive 2022-04-21 16:37:32 +03:00
Arseny Sher
abcd7a4b1f Insert less data in test_wal_restore.
Otherwise it sometimes hits 2m statement timeout in CI.
2022-04-21 16:00:15 +04:00
Kirill Bulatov
81cad6277a Move and library crates into a dedicated directory and rename them 2022-04-21 13:30:33 +03:00
Kirill Bulatov
629688fd6c Drop redundant resolver setting for 2021 edition 2022-04-21 13:30:33 +03:00
Heikki Linnakangas
9d3779c124 Add a counter for materialized page cache hits. 2022-04-20 21:26:03 +03:00
Heikki Linnakangas
334a1d6b5d Fix materialized page caching with delta layers.
We only checked the cache page version when collecting WAL records in
an in-memory layer, not in a delta layer. Refactor the code so that we
always stop collecting WAL records when we reach a cached materialized
page.

Fix the assertion on the LSN range in
InMemoryLayer::get_value_reconstruct_data. It was supposed to check
that the requested LSN range is within the layer's LSN range, but the
inequality was backwards. That went unnoticed before, because the
caller always passed the layer's start LSN as the requested LSN
range's start LSN, but now we might stop the search earlier, if we have
a cached page version.

Co-authored-by: Konstantin Knizhnik <knizhnik@zenith.tech>
2022-04-20 21:25:59 +03:00
Dmitry Rodionov
e41ad3be0f add more context to writeback error 2022-04-20 17:07:07 +03:00
Heikki Linnakangas
e113c6fa8d Print a warning if unlinking an ephemeral file fails.
Unlink failure isn't serious on its own, we were about to remove the
file anyway, but it shouldn't happen and could be a symptom of
something more serious.

We just saw "No such file or directory" errors happening from
ephemeral file writeback in staging, and I suspect if we had this
warning in place, we would have seen these warnings too, if the
problem was that the ephemeral file was removed before dropping the
EphemeralFile struct. Next time it happens, we'll have more
information.
2022-04-20 16:23:16 +03:00
Heikki Linnakangas
cbdfd8c719 Update 'routerify' dependency in proxy.
routerify version 3 is used in zenith_utils, use the same version in proxy
to avoid having to build two versions.
2022-04-20 14:42:05 +03:00
Heikki Linnakangas
86bf4301b7 Remove unnecessary dependency on 'webpki' 2022-04-20 14:36:54 +03:00
Heikki Linnakangas
9eaa21317c Update jsonwebtoken crate.
With this, we no longer need to build two versions of 'pem' and 'base64'
crates. Introduces a duplicate version of 'time' crate, though, but it's
still progress.
2022-04-20 14:27:49 +03:00
Heikki Linnakangas
e660e12f79 Update rustls-split and rustls versions.
All dependencies now use rustls 0.20.2, so we no longer need to build two
versions of it.
2022-04-20 14:07:55 +03:00
Konstantin Knizhnik
ac52f4f2d6 Set superuser when initializing database for wal recovery (#1544) 2022-04-20 13:24:38 +03:00
Heikki Linnakangas
5e95338ee9 Improve logging in test_wal_restore.py
- Capture the output of the restore_from_wal.sh in a log file
- Kill "restored" Postgres server on test failure
2022-04-20 11:18:40 +03:00
Heikki Linnakangas
170badd626 Capture the postgres log in all tests that start a vanilla Postgres. 2022-04-20 11:18:40 +03:00
Kirill Bulatov
91fb21225a Show more logs during S3 sync 2022-04-20 02:57:03 +03:00
Kirill Bulatov
3e6087a12f Remove S3 archiving 2022-04-19 23:13:52 +03:00
Kirill Bulatov
44bfc529f6 Require specifying the upload size in remote storage 2022-04-19 23:13:52 +03:00
172 changed files with 6290 additions and 4548 deletions

View File

@@ -1,14 +1,14 @@
- name: Upload Zenith binaries
- name: Upload Neon binaries
hosts: storage
gather_facts: False
remote_user: admin
tasks:
- name: get latest version of Zenith binaries
- name: get latest version of Neon binaries
register: current_version_file
set_fact:
current_version: "{{ lookup('file', '.zenith_current_version') | trim }}"
current_version: "{{ lookup('file', '.neon_current_version') | trim }}"
tags:
- pageserver
- safekeeper
@@ -19,11 +19,11 @@
- pageserver
- safekeeper
- name: upload and extract Zenith binaries to /usr/local
- name: upload and extract Neon binaries to /usr/local
ansible.builtin.unarchive:
owner: root
group: root
src: zenith_install.tar.gz
src: neon_install.tar.gz
dest: /usr/local
become: true
tags:

View File

@@ -4,10 +4,10 @@ set -e
RELEASE=${RELEASE:-false}
# look at docker hub for latest tag fo zenith docker image
# look at docker hub for latest tag for neon docker image
if [ "${RELEASE}" = "true" ]; then
echo "search latest relase tag"
VERSION=$(curl -s https://registry.hub.docker.com/v1/repositories/zenithdb/zenith/tags |jq -r -S '.[].name' | grep release | sed 's/release-//g' | tail -1)
VERSION=$(curl -s https://registry.hub.docker.com/v1/repositories/neondatabase/neon/tags |jq -r -S '.[].name' | grep release | sed 's/release-//g' | tail -1)
if [ -z "${VERSION}" ]; then
echo "no any docker tags found, exiting..."
exit 1
@@ -16,7 +16,7 @@ if [ "${RELEASE}" = "true" ]; then
fi
else
echo "search latest dev tag"
VERSION=$(curl -s https://registry.hub.docker.com/v1/repositories/zenithdb/zenith/tags |jq -r -S '.[].name' | grep -v release | tail -1)
VERSION=$(curl -s https://registry.hub.docker.com/v1/repositories/neondatabase/neon/tags |jq -r -S '.[].name' | grep -v release | tail -1)
if [ -z "${VERSION}" ]; then
echo "no any docker tags found, exiting..."
exit 1
@@ -28,25 +28,25 @@ fi
echo "found ${VERSION}"
# do initial cleanup
rm -rf zenith_install postgres_install.tar.gz zenith_install.tar.gz .zenith_current_version
mkdir zenith_install
rm -rf neon_install postgres_install.tar.gz neon_install.tar.gz .neon_current_version
mkdir neon_install
# retrive binaries from docker image
echo "getting binaries from docker image"
docker pull --quiet zenithdb/zenith:${TAG}
ID=$(docker create zenithdb/zenith:${TAG})
docker pull --quiet neondatabase/neon:${TAG}
ID=$(docker create neondatabase/neon:${TAG})
docker cp ${ID}:/data/postgres_install.tar.gz .
tar -xzf postgres_install.tar.gz -C zenith_install
docker cp ${ID}:/usr/local/bin/pageserver zenith_install/bin/
docker cp ${ID}:/usr/local/bin/safekeeper zenith_install/bin/
docker cp ${ID}:/usr/local/bin/proxy zenith_install/bin/
docker cp ${ID}:/usr/local/bin/postgres zenith_install/bin/
tar -xzf postgres_install.tar.gz -C neon_install
docker cp ${ID}:/usr/local/bin/pageserver neon_install/bin/
docker cp ${ID}:/usr/local/bin/safekeeper neon_install/bin/
docker cp ${ID}:/usr/local/bin/proxy neon_install/bin/
docker cp ${ID}:/usr/local/bin/postgres neon_install/bin/
docker rm -vf ${ID}
# store version to file (for ansible playbooks) and create binaries tarball
echo ${VERSION} > zenith_install/.zenith_current_version
echo ${VERSION} > .zenith_current_version
tar -czf zenith_install.tar.gz -C zenith_install .
echo ${VERSION} > neon_install/.neon_current_version
echo ${VERSION} > .neon_current_version
tar -czf neon_install.tar.gz -C neon_install .
# do final cleaup
rm -rf zenith_install postgres_install.tar.gz
rm -rf neon_install postgres_install.tar.gz

View File

@@ -14,3 +14,4 @@ safekeepers
console_mgmt_base_url = http://console-release.local
bucket_name = zenith-storage-oregon
bucket_region = us-west-2
etcd_endpoints = etcd-release.local:2379

View File

@@ -15,3 +15,4 @@ safekeepers
console_mgmt_base_url = http://console-staging.local
bucket_name = zenith-staging-storage-us-east-1
bucket_region = us-east-1
etcd_endpoints = etcd-staging.local:2379

View File

@@ -6,7 +6,7 @@ After=network.target auditd.service
Type=simple
User=safekeeper
Environment=RUST_BACKTRACE=1 ZENITH_REPO_DIR=/storage/safekeeper/data LD_LIBRARY_PATH=/usr/local/lib
ExecStart=/usr/local/bin/safekeeper -l {{ inventory_hostname }}.local:6500 --listen-http {{ inventory_hostname }}.local:7676 -p {{ first_pageserver }}:6400 -D /storage/safekeeper/data
ExecStart=/usr/local/bin/safekeeper -l {{ inventory_hostname }}.local:6500 --listen-http {{ inventory_hostname }}.local:7676 -p {{ first_pageserver }}:6400 -D /storage/safekeeper/data --broker-endpoints={{ etcd_endpoints }}
ExecReload=/bin/kill -HUP $MAINPID
KillMode=mixed
KillSignal=SIGINT

View File

@@ -1,18 +1,18 @@
version: 2.1
executors:
zenith-xlarge-executor:
neon-xlarge-executor:
resource_class: xlarge
docker:
# NB: when changed, do not forget to update rust image tag in all Dockerfiles
- image: zimg/rust:1.58
zenith-executor:
neon-executor:
docker:
- image: zimg/rust:1.58
jobs:
check-codestyle-rust:
executor: zenith-xlarge-executor
executor: neon-xlarge-executor
steps:
- checkout
- run:
@@ -22,7 +22,7 @@ jobs:
# A job to build postgres
build-postgres:
executor: zenith-xlarge-executor
executor: neon-xlarge-executor
parameters:
build_type:
type: enum
@@ -67,9 +67,9 @@ jobs:
paths:
- tmp_install
# A job to build zenith rust code
build-zenith:
executor: zenith-xlarge-executor
# A job to build Neon rust code
build-neon:
executor: neon-xlarge-executor
parameters:
build_type:
type: enum
@@ -113,7 +113,7 @@ jobs:
CARGO_FLAGS=
elif [[ $BUILD_TYPE == "release" ]]; then
cov_prefix=()
CARGO_FLAGS=--release
CARGO_FLAGS="--release --features profiling"
fi
export CARGO_INCREMENTAL=0
@@ -132,20 +132,6 @@ jobs:
- ~/.cargo/git
- target
# Run style checks
# has to run separately from cargo fmt section
# since needs to run with dependencies
- run:
name: cargo clippy
command: |
if [[ $BUILD_TYPE == "debug" ]]; then
cov_prefix=(scripts/coverage "--profraw-prefix=$CIRCLE_JOB" --dir=/tmp/zenith/coverage run)
elif [[ $BUILD_TYPE == "release" ]]; then
cov_prefix=()
fi
"${cov_prefix[@]}" ./run_clippy.sh
# Run rust unit tests
- run:
name: cargo test
@@ -223,7 +209,7 @@ jobs:
- "*"
check-codestyle-python:
executor: zenith-executor
executor: neon-executor
steps:
- checkout
- restore_cache:
@@ -246,7 +232,7 @@ jobs:
command: poetry run mypy .
run-pytest:
executor: zenith-executor
executor: neon-executor
parameters:
# pytest args to specify the tests to run.
#
@@ -369,7 +355,7 @@ jobs:
when: always
command: |
du -sh /tmp/test_output/*
find /tmp/test_output -type f ! -name "pg.log" ! -name "pageserver.log" ! -name "safekeeper.log" ! -name "regression.diffs" ! -name "junit.xml" ! -name "*.filediff" ! -name "*.stdout" ! -name "*.stderr" -delete
find /tmp/test_output -type f ! -name "pg.log" ! -name "pageserver.log" ! -name "safekeeper.log" ! -name "regression.diffs" ! -name "junit.xml" ! -name "*.filediff" ! -name "*.stdout" ! -name "*.stderr" ! -name "flamegraph.svg" -delete
du -sh /tmp/test_output/*
- store_artifacts:
path: /tmp/test_output
@@ -390,7 +376,7 @@ jobs:
- "*"
coverage-report:
executor: zenith-xlarge-executor
executor: neon-xlarge-executor
steps:
- attach_workspace:
at: /tmp/zenith
@@ -420,7 +406,7 @@ jobs:
COMMIT_URL=https://github.com/neondatabase/neon/commit/$CIRCLE_SHA1
scripts/git-upload \
--repo=https://$VIP_VAP_ACCESS_TOKEN@github.com/zenithdb/zenith-coverage-data.git \
--repo=https://$VIP_VAP_ACCESS_TOKEN@github.com/neondatabase/zenith-coverage-data.git \
--message="Add code coverage for $COMMIT_URL" \
copy /tmp/zenith/coverage/report $CIRCLE_SHA1 # COPY FROM TO_RELATIVE
@@ -437,7 +423,7 @@ jobs:
\"target_url\": \"$REPORT_URL\"
}"
# Build zenithdb/zenith:latest image and push it to Docker hub
# Build neondatabase/neon:latest image and push it to Docker hub
docker-image:
docker:
- image: cimg/base:2021.04
@@ -451,18 +437,18 @@ jobs:
- run:
name: Build and push Docker image
command: |
echo $DOCKER_PWD | docker login -u $DOCKER_LOGIN --password-stdin
echo $NEON_DOCKER_PWD | docker login -u $NEON_DOCKER_LOGIN --password-stdin
DOCKER_TAG=$(git log --oneline|wc -l)
docker build \
--pull \
--build-arg GIT_VERSION=${CIRCLE_SHA1} \
--build-arg AWS_ACCESS_KEY_ID="${CACHEPOT_AWS_ACCESS_KEY_ID}" \
--build-arg AWS_SECRET_ACCESS_KEY="${CACHEPOT_AWS_SECRET_ACCESS_KEY}" \
--tag zenithdb/zenith:${DOCKER_TAG} --tag zenithdb/zenith:latest .
docker push zenithdb/zenith:${DOCKER_TAG}
docker push zenithdb/zenith:latest
--tag neondatabase/neon:${DOCKER_TAG} --tag neondatabase/neon:latest .
docker push neondatabase/neon:${DOCKER_TAG}
docker push neondatabase/neon:latest
# Build zenithdb/compute-node:latest image and push it to Docker hub
# Build neondatabase/compute-node:latest image and push it to Docker hub
docker-image-compute:
docker:
- image: cimg/base:2021.04
@@ -470,31 +456,31 @@ jobs:
- checkout
- setup_remote_docker:
docker_layer_caching: true
# Build zenithdb/compute-tools:latest image and push it to Docker hub
# Build neondatabase/compute-tools:latest image and push it to Docker hub
# TODO: this should probably also use versioned tag, not just :latest.
# XXX: but should it? We build and use it only locally now.
- run:
name: Build and push compute-tools Docker image
command: |
echo $DOCKER_PWD | docker login -u $DOCKER_LOGIN --password-stdin
echo $NEON_DOCKER_PWD | docker login -u $NEON_DOCKER_LOGIN --password-stdin
docker build \
--build-arg AWS_ACCESS_KEY_ID="${CACHEPOT_AWS_ACCESS_KEY_ID}" \
--build-arg AWS_SECRET_ACCESS_KEY="${CACHEPOT_AWS_SECRET_ACCESS_KEY}" \
--tag zenithdb/compute-tools:latest -f Dockerfile.compute-tools .
docker push zenithdb/compute-tools:latest
--tag neondatabase/compute-tools:latest -f Dockerfile.compute-tools .
docker push neondatabase/compute-tools:latest
- run:
name: Init postgres submodule
command: git submodule update --init --depth 1
- run:
name: Build and push compute-node Docker image
command: |
echo $DOCKER_PWD | docker login -u $DOCKER_LOGIN --password-stdin
echo $NEON_DOCKER_PWD | docker login -u $NEON_DOCKER_LOGIN --password-stdin
DOCKER_TAG=$(git log --oneline|wc -l)
docker build --tag zenithdb/compute-node:${DOCKER_TAG} --tag zenithdb/compute-node:latest vendor/postgres
docker push zenithdb/compute-node:${DOCKER_TAG}
docker push zenithdb/compute-node:latest
docker build --tag neondatabase/compute-node:${DOCKER_TAG} --tag neondatabase/compute-node:latest vendor/postgres
docker push neondatabase/compute-node:${DOCKER_TAG}
docker push neondatabase/compute-node:latest
# Build production zenithdb/zenith:release image and push it to Docker hub
# Build production neondatabase/neon:release image and push it to Docker hub
docker-image-release:
docker:
- image: cimg/base:2021.04
@@ -508,18 +494,18 @@ jobs:
- run:
name: Build and push Docker image
command: |
echo $DOCKER_PWD | docker login -u $DOCKER_LOGIN --password-stdin
echo $NEON_DOCKER_PWD | docker login -u $NEON_DOCKER_LOGIN --password-stdin
DOCKER_TAG="release-$(git log --oneline|wc -l)"
docker build \
--pull \
--build-arg GIT_VERSION=${CIRCLE_SHA1} \
--build-arg AWS_ACCESS_KEY_ID="${CACHEPOT_AWS_ACCESS_KEY_ID}" \
--build-arg AWS_SECRET_ACCESS_KEY="${CACHEPOT_AWS_SECRET_ACCESS_KEY}" \
--tag zenithdb/zenith:${DOCKER_TAG} --tag zenithdb/zenith:release .
docker push zenithdb/zenith:${DOCKER_TAG}
docker push zenithdb/zenith:release
--tag neondatabase/neon:${DOCKER_TAG} --tag neondatabase/neon:release .
docker push neondatabase/neon:${DOCKER_TAG}
docker push neondatabase/neon:release
# Build production zenithdb/compute-node:release image and push it to Docker hub
# Build production neondatabase/compute-node:release image and push it to Docker hub
docker-image-compute-release:
docker:
- image: cimg/base:2021.04
@@ -527,29 +513,29 @@ jobs:
- checkout
- setup_remote_docker:
docker_layer_caching: true
# Build zenithdb/compute-tools:release image and push it to Docker hub
# Build neondatabase/compute-tools:release image and push it to Docker hub
# TODO: this should probably also use versioned tag, not just :latest.
# XXX: but should it? We build and use it only locally now.
- run:
name: Build and push compute-tools Docker image
command: |
echo $DOCKER_PWD | docker login -u $DOCKER_LOGIN --password-stdin
echo $NEON_DOCKER_PWD | docker login -u $NEON_DOCKER_LOGIN --password-stdin
docker build \
--build-arg AWS_ACCESS_KEY_ID="${CACHEPOT_AWS_ACCESS_KEY_ID}" \
--build-arg AWS_SECRET_ACCESS_KEY="${CACHEPOT_AWS_SECRET_ACCESS_KEY}" \
--tag zenithdb/compute-tools:release -f Dockerfile.compute-tools .
docker push zenithdb/compute-tools:release
--tag neondatabase/compute-tools:release -f Dockerfile.compute-tools .
docker push neondatabase/compute-tools:release
- run:
name: Init postgres submodule
command: git submodule update --init --depth 1
- run:
name: Build and push compute-node Docker image
command: |
echo $DOCKER_PWD | docker login -u $DOCKER_LOGIN --password-stdin
echo $NEON_DOCKER_PWD | docker login -u $NEON_DOCKER_LOGIN --password-stdin
DOCKER_TAG="release-$(git log --oneline|wc -l)"
docker build --tag zenithdb/compute-node:${DOCKER_TAG} --tag zenithdb/compute-node:release vendor/postgres
docker push zenithdb/compute-node:${DOCKER_TAG}
docker push zenithdb/compute-node:release
docker build --tag neondatabase/compute-node:${DOCKER_TAG} --tag neondatabase/compute-node:release vendor/postgres
docker push neondatabase/compute-node:${DOCKER_TAG}
docker push neondatabase/compute-node:release
deploy-staging:
docker:
@@ -575,7 +561,7 @@ jobs:
rm -f ssh-key ssh-key-cert.pub
ansible-playbook deploy.yaml -i staging.hosts
rm -f zenith_install.tar.gz .zenith_current_version
rm -f neon_install.tar.gz .neon_current_version
deploy-staging-proxy:
docker:
@@ -625,7 +611,7 @@ jobs:
rm -f ssh-key ssh-key-cert.pub
ansible-playbook deploy.yaml -i production.hosts
rm -f zenith_install.tar.gz .zenith_current_version
rm -f neon_install.tar.gz .neon_current_version
deploy-release-proxy:
docker:
@@ -704,8 +690,8 @@ workflows:
matrix:
parameters:
build_type: ["debug", "release"]
- build-zenith:
name: build-zenith-<< matrix.build_type >>
- build-neon:
name: build-neon-<< matrix.build_type >>
matrix:
parameters:
build_type: ["debug", "release"]
@@ -720,7 +706,7 @@ workflows:
test_selection: batch_pg_regress
needs_postgres_source: true
requires:
- build-zenith-<< matrix.build_type >>
- build-neon-<< matrix.build_type >>
- run-pytest:
name: other-tests-<< matrix.build_type >>
matrix:
@@ -728,7 +714,7 @@ workflows:
build_type: ["debug", "release"]
test_selection: batch_others
requires:
- build-zenith-<< matrix.build_type >>
- build-neon-<< matrix.build_type >>
- run-pytest:
name: benchmarks
context: PERF_TEST_RESULT_CONNSTR
@@ -737,7 +723,7 @@ workflows:
run_in_parallel: false
save_perf_report: true
requires:
- build-zenith-release
- build-neon-release
- coverage-report:
# Context passes credentials for gh api
context: CI_ACCESS_TOKEN
@@ -833,6 +819,6 @@ workflows:
# XXX: Successful build doesn't mean everything is OK, but
# the job to be triggered takes so much time to complete (~22 min)
# that it's better not to wait for the commented-out steps
- build-zenith-release
- build-neon-release
# - pg_regress-tests-release
# - other-tests-release

View File

@@ -1,9 +1,12 @@
# Helm chart values for zenith-proxy.
# This is a YAML-formatted file.
image:
repository: neondatabase/neon
settings:
authEndpoint: "https://console.zenith.tech/authenticate_proxy_request/"
uri: "https://console.zenith.tech/psql_session/"
authEndpoint: "https://console.neon.tech/authenticate_proxy_request/"
uri: "https://console.neon.tech/psql_session/"
# -- Additional labels for zenith-proxy pods
podLabels:
@@ -25,7 +28,7 @@ exposedService:
service.beta.kubernetes.io/aws-load-balancer-type: external
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
external-dns.alpha.kubernetes.io/hostname: start.zenith.tech
external-dns.alpha.kubernetes.io/hostname: start.zenith.tech,connect.neon.tech,pg.neon.tech
metrics:
enabled: true

View File

@@ -1,9 +1,12 @@
# Helm chart values for zenith-proxy.
# This is a YAML-formatted file.
image:
repository: neondatabase/neon
settings:
authEndpoint: "https://console.stage.zenith.tech/authenticate_proxy_request/"
uri: "https://console.stage.zenith.tech/psql_session/"
authEndpoint: "https://console.stage.neon.tech/authenticate_proxy_request/"
uri: "https://console.stage.neon.tech/psql_session/"
# -- Additional labels for zenith-proxy pods
podLabels:
@@ -17,7 +20,7 @@ exposedService:
service.beta.kubernetes.io/aws-load-balancer-type: external
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
external-dns.alpha.kubernetes.io/hostname: start.stage.zenith.tech
external-dns.alpha.kubernetes.io/hostname: connect.stage.neon.tech
metrics:
enabled: true

View File

@@ -10,6 +10,8 @@ dep-format-version = "2"
# Hakari works much better with the new feature resolver.
# For more about the new feature resolver, see:
# https://blog.rust-lang.org/2021/03/25/Rust-1.51.0.html#cargos-new-feature-resolver
# Have to keep the resolver still here since hakari requires this field,
# despite it's now the default for 2021 edition & cargo.
resolver = "2"
# Add triples corresponding to platforms commonly used by developers here.

View File

@@ -36,8 +36,7 @@ jobs:
- name: Install macOs postgres dependencies
if: matrix.os == 'macos-latest'
run: |
brew install flex bison
run: brew install flex bison
- name: Set pg revision for caching
id: pg_ver
@@ -53,8 +52,7 @@ jobs:
- name: Build postgres
if: steps.cache_pg.outputs.cache-hit != 'true'
run: |
make postgres
run: make postgres
- name: Cache cargo deps
id: cache_cargo
@@ -64,13 +62,10 @@ jobs:
~/.cargo/registry
~/.cargo/git
target
key: ${{ runner.os }}-cargo-${{ hashFiles('**/Cargo.lock') }}
key: ${{ runner.os }}-cargo-${{ hashFiles('./Cargo.lock') }}
# Use `env CARGO_INCREMENTAL=0` to mitigate https://github.com/rust-lang/rust/issues/91696 for rustc 1.57.0
- name: Run cargo build
run: |
env CARGO_INCREMENTAL=0 cargo build --workspace --bins --examples --tests
- name: Run cargo clippy
run: ./run_clippy.sh
- name: Run cargo test
run: |
env CARGO_INCREMENTAL=0 cargo test -- --nocapture --test-threads=1
run: cargo test --all --all-targets

758
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -3,15 +3,12 @@ members = [
"compute_tools",
"control_plane",
"pageserver",
"postgres_ffi",
"proxy",
"safekeeper",
"workspace_hack",
"zenith",
"zenith_metrics",
"zenith_utils",
"libs/*",
]
resolver = "2"
[profile.release]
# This is useful for profiling and, to some extent, debug.
@@ -21,4 +18,4 @@ debug = true
# This is only needed for proxy's tests.
# TODO: we should probably fork `tokio-postgres-rustls` instead.
[patch.crates-io]
tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" }
tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }

View File

@@ -26,7 +26,9 @@ COPY . .
# Show build caching stats to check if it was used in the end.
# Has to be the part of the same RUN since cachepot daemon is killed in the end of this RUN, loosing the compilation stats.
RUN mold -run cargo build --release && cachepot -s
RUN set -e \
&& sudo -E "PATH=$PATH" mold -run cargo build --release \
&& cachepot -s
# Build final image
#

View File

@@ -8,7 +8,9 @@ ARG AWS_SECRET_ACCESS_KEY
COPY . .
RUN mold -run cargo build -p compute_tools --release && cachepot -s
RUN set -e \
&& sudo -E "PATH=$PATH" mold -run cargo build -p compute_tools --release \
&& cachepot -s
# Final image that only has one binary
FROM debian:buster-slim

View File

@@ -11,11 +11,11 @@ clap = "3.0"
env_logger = "0.9"
hyper = { version = "0.14", features = ["full"] }
log = { version = "0.4", features = ["std", "serde"] }
postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" }
postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
regex = "1"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1"
tar = "0.4"
tokio = { version = "1.17", features = ["macros", "rt", "rt-multi-thread"] }
tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" }
tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
workspace_hack = { version = "0.1", path = "../workspace_hack" }

View File

@@ -129,6 +129,7 @@ fn run_compute(state: &Arc<RwLock<ComputeState>>) -> Result<ExitStatus> {
handle_roles(&read_state.spec, &mut client)?;
handle_databases(&read_state.spec, &mut client)?;
handle_grants(&read_state.spec, &mut client)?;
create_writablity_check_data(&mut client)?;
// 'Close' connection
@@ -157,7 +158,7 @@ fn run_compute(state: &Arc<RwLock<ComputeState>>) -> Result<ExitStatus> {
}
fn main() -> Result<()> {
// TODO: re-use `zenith_utils::logging` later
// TODO: re-use `utils::logging` later
init_logger(DEFAULT_LOG_LEVEL)?;
// Env variable is set by `cargo`

View File

@@ -244,3 +244,24 @@ pub fn handle_databases(spec: &ClusterSpec, client: &mut Client) -> Result<()> {
Ok(())
}
// Grant CREATE ON DATABASE to the database owner
// to allow clients create trusted extensions.
pub fn handle_grants(spec: &ClusterSpec, client: &mut Client) -> Result<()> {
info!("cluster spec grants:");
for db in &spec.cluster.databases {
let dbname = &db.name;
let query: String = format!(
"GRANT CREATE ON DATABASE {} TO {}",
dbname.quote(),
db.owner.quote()
);
info!("grant query {}", &query);
client.execute(query.as_str(), &[])?;
}
Ok(())
}

View File

@@ -5,7 +5,7 @@ edition = "2021"
[dependencies]
tar = "0.4.33"
postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" }
postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
serde = { version = "1.0", features = ["derive"] }
serde_with = "1.12.0"
toml = "0.5"
@@ -19,5 +19,5 @@ reqwest = { version = "0.11", default-features = false, features = ["blocking",
pageserver = { path = "../pageserver" }
safekeeper = { path = "../safekeeper" }
zenith_utils = { path = "../zenith_utils" }
utils = { path = "../libs/utils" }
workspace_hack = { version = "0.1", path = "../workspace_hack" }

View File

@@ -11,11 +11,12 @@ use std::sync::Arc;
use std::time::Duration;
use anyhow::{Context, Result};
use zenith_utils::connstring::connection_host_port;
use zenith_utils::lsn::Lsn;
use zenith_utils::postgres_backend::AuthType;
use zenith_utils::zid::ZTenantId;
use zenith_utils::zid::ZTimelineId;
use utils::{
connstring::connection_host_port,
lsn::Lsn,
postgres_backend::AuthType,
zid::{ZTenantId, ZTimelineId},
};
use crate::local_env::LocalEnv;
use crate::postgresql_conf::PostgresConf;
@@ -272,12 +273,7 @@ impl PostgresNode {
conf.append("wal_sender_timeout", "5s");
conf.append("listen_addresses", &self.address.ip().to_string());
conf.append("port", &self.address.port().to_string());
// Never clean up old WAL. TODO: We should use a replication
// slot or something proper, to prevent the compute node
// from removing WAL that hasn't been streamed to the safekeeper or
// page server yet. (gh issue #349)
conf.append("wal_keep_size", "10TB");
conf.append("wal_keep_size", "0");
// Configure the node to fetch pages from pageserver
let pageserver_connstr = {

View File

@@ -11,9 +11,11 @@ use std::env;
use std::fs;
use std::path::{Path, PathBuf};
use std::process::{Command, Stdio};
use zenith_utils::auth::{encode_from_key_file, Claims, Scope};
use zenith_utils::postgres_backend::AuthType;
use zenith_utils::zid::{ZNodeId, ZTenantId, ZTenantTimelineId, ZTimelineId};
use utils::{
auth::{encode_from_key_file, Claims, Scope},
postgres_backend::AuthType,
zid::{ZNodeId, ZTenantId, ZTenantTimelineId, ZTimelineId},
};
use crate::safekeeper::SafekeeperNode;

View File

@@ -15,13 +15,15 @@ use reqwest::blocking::{Client, RequestBuilder, Response};
use reqwest::{IntoUrl, Method};
use safekeeper::http::models::TimelineCreateRequest;
use thiserror::Error;
use zenith_utils::http::error::HttpErrorBody;
use zenith_utils::zid::{ZNodeId, ZTenantId, ZTimelineId};
use utils::{
connstring::connection_address,
http::error::HttpErrorBody,
zid::{ZNodeId, ZTenantId, ZTimelineId},
};
use crate::local_env::{LocalEnv, SafekeeperConf};
use crate::storage::PageServerNode;
use crate::{fill_rust_env_vars, read_pidfile};
use zenith_utils::connstring::connection_address;
#[derive(Error, Debug)]
pub enum SafekeeperHttpError {

View File

@@ -1,3 +1,4 @@
use std::collections::HashMap;
use std::io::Write;
use std::net::TcpStream;
use std::path::PathBuf;
@@ -9,21 +10,23 @@ use anyhow::{bail, Context};
use nix::errno::Errno;
use nix::sys::signal::{kill, Signal};
use nix::unistd::Pid;
use pageserver::http::models::{TenantCreateRequest, TimelineCreateRequest};
use pageserver::http::models::{TenantConfigRequest, TenantCreateRequest, TimelineCreateRequest};
use pageserver::timelines::TimelineInfo;
use postgres::{Config, NoTls};
use reqwest::blocking::{Client, RequestBuilder, Response};
use reqwest::{IntoUrl, Method};
use thiserror::Error;
use zenith_utils::http::error::HttpErrorBody;
use zenith_utils::lsn::Lsn;
use zenith_utils::postgres_backend::AuthType;
use zenith_utils::zid::{ZTenantId, ZTimelineId};
use utils::{
connstring::connection_address,
http::error::HttpErrorBody,
lsn::Lsn,
postgres_backend::AuthType,
zid::{ZTenantId, ZTimelineId},
};
use crate::local_env::LocalEnv;
use crate::{fill_rust_env_vars, read_pidfile};
use pageserver::tenant_mgr::TenantInfo;
use zenith_utils::connstring::connection_address;
#[derive(Error, Debug)]
pub enum PageserverHttpError {
@@ -342,10 +345,36 @@ impl PageServerNode {
pub fn tenant_create(
&self,
new_tenant_id: Option<ZTenantId>,
settings: HashMap<&str, &str>,
) -> anyhow::Result<Option<ZTenantId>> {
let tenant_id_string = self
.http_request(Method::POST, format!("{}/tenant", self.http_base_url))
.json(&TenantCreateRequest { new_tenant_id })
.json(&TenantCreateRequest {
new_tenant_id,
checkpoint_distance: settings
.get("checkpoint_distance")
.map(|x| x.parse::<u64>())
.transpose()?,
compaction_target_size: settings
.get("compaction_target_size")
.map(|x| x.parse::<u64>())
.transpose()?,
compaction_period: settings.get("compaction_period").map(|x| x.to_string()),
compaction_threshold: settings
.get("compaction_threshold")
.map(|x| x.parse::<usize>())
.transpose()?,
gc_horizon: settings
.get("gc_horizon")
.map(|x| x.parse::<u64>())
.transpose()?,
gc_period: settings.get("gc_period").map(|x| x.to_string()),
image_creation_threshold: settings
.get("image_creation_threshold")
.map(|x| x.parse::<usize>())
.transpose()?,
pitr_interval: settings.get("pitr_interval").map(|x| x.to_string()),
})
.send()?
.error_from_body()?
.json::<Option<String>>()?;
@@ -362,6 +391,35 @@ impl PageServerNode {
.transpose()
}
pub fn tenant_config(&self, tenant_id: ZTenantId, settings: HashMap<&str, &str>) -> Result<()> {
self.http_request(Method::PUT, format!("{}/tenant/config", self.http_base_url))
.json(&TenantConfigRequest {
tenant_id,
checkpoint_distance: settings
.get("checkpoint_distance")
.map(|x| x.parse::<u64>().unwrap()),
compaction_target_size: settings
.get("compaction_target_size")
.map(|x| x.parse::<u64>().unwrap()),
compaction_period: settings.get("compaction_period").map(|x| x.to_string()),
compaction_threshold: settings
.get("compaction_threshold")
.map(|x| x.parse::<usize>().unwrap()),
gc_horizon: settings
.get("gc_horizon")
.map(|x| x.parse::<u64>().unwrap()),
gc_period: settings.get("gc_period").map(|x| x.to_string()),
image_creation_threshold: settings
.get("image_creation_threshold")
.map(|x| x.parse::<usize>().unwrap()),
pitr_interval: settings.get("pitr_interval").map(|x| x.to_string()),
})
.send()?
.error_from_body()?;
Ok(())
}
pub fn timeline_list(&self, tenant_id: &ZTenantId) -> anyhow::Result<Vec<TimelineInfo>> {
let timeline_infos: Vec<TimelineInfo> = self
.http_request(

View File

@@ -7,8 +7,8 @@
- [glossary.md](glossary.md) — Glossary of all the terms used in codebase.
- [multitenancy.md](multitenancy.md) — how multitenancy is organized in the pageserver and Zenith CLI.
- [sourcetree.md](sourcetree.md) — Overview of the source tree layeout.
- [pageserver/README](/pageserver/README) — pageserver overview.
- [postgres_ffi/README](/postgres_ffi/README) — Postgres FFI overview.
- [pageserver/README.md](/pageserver/README.md) — pageserver overview.
- [postgres_ffi/README.md](/libs/postgres_ffi/README.md) — Postgres FFI overview.
- [test_runner/README.md](/test_runner/README.md) — tests infrastructure overview.
- [safekeeper/README](/safekeeper/README) — WAL service overview.
- [safekeeper/README.md](/safekeeper/README.md) — WAL service overview.
- [core_changes.md](core_changes.md) - Description of Zenith changes in Postgres core

View File

@@ -27,4 +27,4 @@ management_token = jwt.encode({"scope": "pageserverapi"}, auth_keys.priv, algori
tenant_token = jwt.encode({"scope": "tenant", "tenant_id": ps.initial_tenant}, auth_keys.priv, algorithm="RS256")
```
Utility functions to work with jwts in rust are located in zenith_utils/src/auth.rs
Utility functions to work with jwts in rust are located in libs/utils/src/auth.rs

View File

@@ -74,6 +74,10 @@ Every `compaction_period` seconds, the page server checks if
maintenance operations, like compaction, are needed on the layer
files. Default is 1 s, which should be fine.
#### compaction_target_size
File sizes for L0 delta and L1 image layers. Default is 128MB.
#### gc_horizon
`gz_horizon` determines how much history is retained, to allow
@@ -85,6 +89,14 @@ away.
Interval at which garbage collection is triggered. Default is 100 s.
#### image_creation_threshold
L0 delta layer threshold for L1 iamge layer creation. Default is 3.
#### pitr_interval
WAL retention duration for PITR branching. Default is 30 days.
#### initial_superuser_name
Name of the initial superuser role, passed to initdb when a new tenant
@@ -156,6 +168,9 @@ access_key_id = 'SOMEKEYAAAAASADSAH*#'
# Secret access key to connect to the bucket ("password" part of the credentials)
secret_access_key = 'SOMEsEcReTsd292v'
# S3 API query limit to avoid getting errors/throttling from AWS.
concurrency_limit = 100
```
###### General remote storage configuration
@@ -167,8 +182,8 @@ Besides, there are parameters common for all types of remote storage that can be
```toml
[remote_storage]
# Max number of concurrent connections to open for uploading to or downloading from the remote storage.
max_concurrent_sync = 100
# Max number of concurrent timeline synchronized (layers uploaded or downloaded) with the remote storage at the same time.
max_concurrent_timelines_sync = 50
# Max number of errors a single task can have before it's considered failed and not attempted to run anymore.
max_sync_errors = 10

View File

@@ -28,12 +28,7 @@ The pageserver has a few different duties:
- Receive WAL from the WAL service and decode it.
- Replay WAL that's applicable to the chunks that the Page Server maintains
For more detailed info, see `/pageserver/README`
`/postgres_ffi`:
Utility functions for interacting with PostgreSQL file formats.
Misc constants, copied from PostgreSQL headers.
For more detailed info, see [/pageserver/README](/pageserver/README.md)
`/proxy`:
@@ -62,7 +57,7 @@ PostgreSQL extension that contains functions needed for testing and debugging.
The zenith WAL service that receives WAL from a primary compute nodes and streams it to the pageserver.
It acts as a holding area and redistribution center for recently generated WAL.
For more detailed info, see `/safekeeper/README`
For more detailed info, see [/safekeeper/README](/safekeeper/README.md)
`/workspace_hack`:
The workspace_hack crate exists only to pin down some dependencies.
@@ -74,14 +69,21 @@ We use [cargo-hakari](https://crates.io/crates/cargo-hakari) for automation.
Main entry point for the 'zenith' CLI utility.
TODO: Doesn't it belong to control_plane?
`/zenith_metrics`:
`/libs`:
Unites granular neon helper crates under the hood.
`/libs/postgres_ffi`:
Utility functions for interacting with PostgreSQL file formats.
Misc constants, copied from PostgreSQL headers.
`/libs/utils`:
Generic helpers that are shared between other crates in this repository.
A subject for future modularization.
`/libs/metrics`:
Helpers for exposing Prometheus metrics from the server.
`/zenith_utils`:
Helpers that are shared between other crates in this repository.
## Using Python
Note that Debian/Ubuntu Python packages are stale, as it commonly happens,
so manual installation of dependencies is not recommended.

View File

@@ -1,5 +1,5 @@
[package]
name = "zenith_metrics"
name = "metrics"
version = "0.1.0"
edition = "2021"
@@ -8,4 +8,4 @@ prometheus = {version = "0.13", default_features=false} # removes protobuf depen
libc = "0.2"
lazy_static = "1.4"
once_cell = "1.8.0"
workspace_hack = { version = "0.1", path = "../workspace_hack" }
workspace_hack = { version = "0.1", path = "../../workspace_hack" }

View File

@@ -8,8 +8,8 @@ use std::io::{Read, Result, Write};
///
/// ```
/// # use std::io::{Result, Read};
/// # use zenith_metrics::{register_int_counter, IntCounter};
/// # use zenith_metrics::CountedReader;
/// # use metrics::{register_int_counter, IntCounter};
/// # use metrics::CountedReader;
/// #
/// # lazy_static::lazy_static! {
/// # static ref INT_COUNTER: IntCounter = register_int_counter!(
@@ -83,8 +83,8 @@ impl<T: Read> Read for CountedReader<'_, T> {
///
/// ```
/// # use std::io::{Result, Write};
/// # use zenith_metrics::{register_int_counter, IntCounter};
/// # use zenith_metrics::CountedWriter;
/// # use metrics::{register_int_counter, IntCounter};
/// # use metrics::CountedWriter;
/// #
/// # lazy_static::lazy_static! {
/// # static ref INT_COUNTER: IntCounter = register_int_counter!(

View File

@@ -17,8 +17,8 @@ log = "0.4.14"
memoffset = "0.6.2"
thiserror = "1.0"
serde = { version = "1.0", features = ["derive"] }
zenith_utils = { path = "../zenith_utils" }
workspace_hack = { version = "0.1", path = "../workspace_hack" }
utils = { path = "../utils" }
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
[build-dependencies]
bindgen = "0.59.1"

View File

@@ -88,8 +88,8 @@ fn main() {
// 'pg_config --includedir-server' would perhaps be the more proper way to find it,
// but this will do for now.
//
.clang_arg("-I../tmp_install/include/server")
.clang_arg("-I../tmp_install/include/postgresql/server")
.clang_arg("-I../../tmp_install/include/server")
.clang_arg("-I../../tmp_install/include/postgresql/server")
//
// Finish the builder and generate the bindings.
//

View File

@@ -43,7 +43,7 @@ impl ControlFileData {
/// Interpret a slice of bytes as a Postgres control file.
///
pub fn decode(buf: &[u8]) -> Result<ControlFileData> {
use zenith_utils::bin_ser::LeSer;
use utils::bin_ser::LeSer;
// Check that the slice has the expected size. The control file is
// padded with zeros up to a 512 byte sector size, so accept a
@@ -77,7 +77,7 @@ impl ControlFileData {
///
/// The CRC is recomputed to match the contents of the fields.
pub fn encode(&self) -> Bytes {
use zenith_utils::bin_ser::LeSer;
use utils::bin_ser::LeSer;
// Serialize into a new buffer.
let b = self.ser().unwrap();

View File

@@ -18,7 +18,7 @@ use crc32c::*;
use log::*;
use std::cmp::min;
use thiserror::Error;
use zenith_utils::lsn::Lsn;
use utils::lsn::Lsn;
pub struct WalStreamDecoder {
lsn: Lsn,

View File

@@ -28,7 +28,7 @@ use std::io::prelude::*;
use std::io::SeekFrom;
use std::path::{Path, PathBuf};
use std::time::SystemTime;
use zenith_utils::lsn::Lsn;
use utils::lsn::Lsn;
pub const XLOG_FNAME_LEN: usize = 24;
pub const XLOG_BLCKSZ: usize = 8192;
@@ -351,17 +351,17 @@ pub fn main() {
impl XLogRecord {
pub fn from_slice(buf: &[u8]) -> XLogRecord {
use zenith_utils::bin_ser::LeSer;
use utils::bin_ser::LeSer;
XLogRecord::des(buf).unwrap()
}
pub fn from_bytes<B: Buf>(buf: &mut B) -> XLogRecord {
use zenith_utils::bin_ser::LeSer;
use utils::bin_ser::LeSer;
XLogRecord::des_from(&mut buf.reader()).unwrap()
}
pub fn encode(&self) -> Bytes {
use zenith_utils::bin_ser::LeSer;
use utils::bin_ser::LeSer;
self.ser().unwrap().into()
}
@@ -373,19 +373,19 @@ impl XLogRecord {
impl XLogPageHeaderData {
pub fn from_bytes<B: Buf>(buf: &mut B) -> XLogPageHeaderData {
use zenith_utils::bin_ser::LeSer;
use utils::bin_ser::LeSer;
XLogPageHeaderData::des_from(&mut buf.reader()).unwrap()
}
}
impl XLogLongPageHeaderData {
pub fn from_bytes<B: Buf>(buf: &mut B) -> XLogLongPageHeaderData {
use zenith_utils::bin_ser::LeSer;
use utils::bin_ser::LeSer;
XLogLongPageHeaderData::des_from(&mut buf.reader()).unwrap()
}
pub fn encode(&self) -> Bytes {
use zenith_utils::bin_ser::LeSer;
use utils::bin_ser::LeSer;
self.ser().unwrap().into()
}
}
@@ -394,12 +394,12 @@ pub const SIZEOF_CHECKPOINT: usize = std::mem::size_of::<CheckPoint>();
impl CheckPoint {
pub fn encode(&self) -> Bytes {
use zenith_utils::bin_ser::LeSer;
use utils::bin_ser::LeSer;
self.ser().unwrap().into()
}
pub fn decode(buf: &[u8]) -> Result<CheckPoint, anyhow::Error> {
use zenith_utils::bin_ser::LeSer;
use utils::bin_ser::LeSer;
Ok(CheckPoint::des(buf)?)
}
@@ -477,7 +477,9 @@ mod tests {
#[test]
pub fn test_find_end_of_wal() {
// 1. Run initdb to generate some WAL
let top_path = PathBuf::from(env!("CARGO_MANIFEST_DIR")).join("..");
let top_path = PathBuf::from(env!("CARGO_MANIFEST_DIR"))
.join("..")
.join("..");
let data_dir = top_path.join("test_output/test_find_end_of_wal");
let initdb_path = top_path.join("tmp_install/bin/initdb");
let lib_path = top_path.join("tmp_install/lib");

View File

@@ -1,5 +1,5 @@
[package]
name = "zenith_utils"
name = "utils"
version = "0.1.0"
edition = "2021"
@@ -10,8 +10,8 @@ bytes = "1.0.1"
hyper = { version = "0.14.7", features = ["full"] }
lazy_static = "1.4.0"
pin-project-lite = "0.2.7"
postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" }
postgres-protocol = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" }
postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
postgres-protocol = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
routerify = "3"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1"
@@ -22,23 +22,23 @@ tracing-subscriber = { version = "0.3", features = ["env-filter"] }
nix = "0.23.0"
signal-hook = "0.3.10"
rand = "0.8.3"
jsonwebtoken = "7"
jsonwebtoken = "8"
hex = { version = "0.4.3", features = ["serde"] }
rustls = "0.19.1"
rustls-split = "0.2.1"
rustls = "0.20.2"
rustls-split = "0.3.0"
git-version = "0.3.5"
serde_with = "1.12.0"
zenith_metrics = { path = "../zenith_metrics" }
workspace_hack = { version = "0.1", path = "../workspace_hack" }
metrics = { path = "../metrics" }
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
[dev-dependencies]
byteorder = "1.4.3"
bytes = "1.0.1"
hex-literal = "0.3"
tempfile = "3.2"
webpki = "0.21"
criterion = "0.3"
rustls-pemfile = "0.2.1"
[[bench]]
name = "benchmarks"

View File

@@ -1,7 +1,7 @@
#![allow(unused)]
use criterion::{criterion_group, criterion_main, Criterion};
use zenith_utils::zid;
use utils::zid;
pub fn bench_zid_stringify(c: &mut Criterion) {
// Can only use public methods.

View File

@@ -1,10 +1,11 @@
#!/bin/bash
PG_BIN=$1
WAL_PATH=$2
DATA_DIR=$3
PORT=$4
SYSID=`od -A n -j 24 -N 8 -t d8 $WAL_PATH/000000010000000000000002* | cut -c 3-`
rm -fr $DATA_DIR
env -i LD_LIBRARY_PATH=$PG_BIN/../lib $PG_BIN/initdb -E utf8 -D $DATA_DIR --sysid=$SYSID
env -i LD_LIBRARY_PATH=$PG_BIN/../lib $PG_BIN/initdb -E utf8 -U zenith_admin -D $DATA_DIR --sysid=$SYSID
echo port=$PORT >> $DATA_DIR/postgresql.conf
REDO_POS=0x`$PG_BIN/pg_controldata -D $DATA_DIR | fgrep "REDO location"| cut -c 42-`
declare -i WAL_SIZE=$REDO_POS+114

View File

@@ -5,7 +5,7 @@
/// For example, to calculate the smallest value among some integers:
///
/// ```
/// use zenith_utils::accum::Accum;
/// use utils::accum::Accum;
///
/// let values = [1, 2, 3];
///

View File

@@ -1,8 +1,6 @@
// For details about authentication see docs/authentication.md
// TODO there are two issues for our use case in jsonwebtoken library which will be resolved in next release
// The first one is that there is no way to disable expiration claim, but it can be excluded from validation, so use this as a workaround for now.
// Relevant issue: https://github.com/Keats/jsonwebtoken/issues/190
// The second one is that we wanted to use ed25519 keys, but they are also not supported until next version. So we go with RSA keys for now.
//
// TODO: use ed25519 keys
// Relevant issue: https://github.com/Keats/jsonwebtoken/issues/162
use serde;
@@ -59,19 +57,19 @@ pub fn check_permission(claims: &Claims, tenantid: Option<ZTenantId>) -> Result<
}
pub struct JwtAuth {
decoding_key: DecodingKey<'static>,
decoding_key: DecodingKey,
validation: Validation,
}
impl JwtAuth {
pub fn new(decoding_key: DecodingKey<'_>) -> Self {
pub fn new(decoding_key: DecodingKey) -> Self {
let mut validation = Validation::new(JWT_ALGORITHM);
// The default 'required_spec_claims' is 'exp'. But we don't want to require
// expiration.
validation.required_spec_claims = [].into();
Self {
decoding_key: decoding_key.into_static(),
validation: Validation {
algorithms: vec![JWT_ALGORITHM],
validate_exp: false,
..Default::default()
},
decoding_key,
validation,
}
}

View File

@@ -5,12 +5,11 @@ use anyhow::anyhow;
use hyper::header::AUTHORIZATION;
use hyper::{header::CONTENT_TYPE, Body, Request, Response, Server};
use lazy_static::lazy_static;
use metrics::{new_common_metric_name, register_int_counter, Encoder, IntCounter, TextEncoder};
use routerify::ext::RequestExt;
use routerify::RequestInfo;
use routerify::{Middleware, Router, RouterBuilder, RouterService};
use tracing::info;
use zenith_metrics::{new_common_metric_name, register_int_counter, IntCounter};
use zenith_metrics::{Encoder, TextEncoder};
use std::future::Future;
use std::net::TcpListener;
@@ -36,7 +35,7 @@ async fn prometheus_metrics_handler(_req: Request<Body>) -> Result<Response<Body
let mut buffer = vec![];
let encoder = TextEncoder::new();
let metrics = zenith_metrics::gather();
let metrics = metrics::gather();
encoder.encode(&metrics, &mut buffer).unwrap();
let response = Response::builder()

View File

@@ -1,4 +1,4 @@
//! zenith_utils is intended to be a place to put code that is shared
//! `utils` is intended to be a place to put code that is shared
//! between other crates in this repository.
#![allow(clippy::manual_range_contains)]
@@ -70,7 +70,7 @@ pub mod signals;
// So the build script will be run only when GIT_VERSION envvar has changed.
//
// Why not to use buildscript to get git commit sha directly without procmacro from different crate?
// Caching and workspaces complicates that. In case zenith_utils is not
// Caching and workspaces complicates that. In case `utils` is not
// recompiled due to caching then version may become outdated.
// git_version crate handles that case by introducing a dependency on .git internals via include_bytes! macro,
// so if we changed the index state git_version will pick that up and rerun the macro.

View File

@@ -304,8 +304,8 @@ impl PostgresBackend {
pub fn start_tls(&mut self) -> anyhow::Result<()> {
match self.stream.take() {
Some(Stream::Bidirectional(bidi_stream)) => {
let session = rustls::ServerSession::new(&self.tls_config.clone().unwrap());
self.stream = Some(Stream::Bidirectional(bidi_stream.start_tls(session)?));
let conn = rustls::ServerConnection::new(self.tls_config.clone().unwrap())?;
self.stream = Some(Stream::Bidirectional(bidi_stream.start_tls(conn)?));
Ok(())
}
stream => {

View File

@@ -100,6 +100,21 @@ pub struct FeExecuteMessage {
#[derive(Debug)]
pub struct FeCloseMessage {}
/// Retry a read on EINTR
///
/// This runs the enclosed expression, and if it returns
/// Err(io::ErrorKind::Interrupted), retries it.
macro_rules! retry_read {
( $x:expr ) => {
loop {
match $x {
Err(e) if e.kind() == io::ErrorKind::Interrupted => continue,
res => break res,
}
}
};
}
impl FeMessage {
/// Read one message from the stream.
/// This function returns `Ok(None)` in case of EOF.
@@ -107,7 +122,7 @@ impl FeMessage {
///
/// ```
/// # use std::io;
/// # use zenith_utils::pq_proto::FeMessage;
/// # use utils::pq_proto::FeMessage;
/// #
/// # fn process_message(msg: FeMessage) -> anyhow::Result<()> {
/// # Ok(())
@@ -141,12 +156,12 @@ impl FeMessage {
// Each libpq message begins with a message type byte, followed by message length
// If the client closes the connection, return None. But if the client closes the
// connection in the middle of a message, we will return an error.
let tag = match stream.read_u8().await {
let tag = match retry_read!(stream.read_u8().await) {
Ok(b) => b,
Err(e) if e.kind() == io::ErrorKind::UnexpectedEof => return Ok(None),
Err(e) => return Err(e.into()),
};
let len = stream.read_u32().await?;
let len = retry_read!(stream.read_u32().await)?;
// The message length includes itself, so it better be at least 4
let bodylen = len
@@ -207,7 +222,7 @@ impl FeStartupPacket {
// reading 4 bytes, to be precise), return None to indicate that the connection
// was closed. This matches the PostgreSQL server's behavior, which avoids noise
// in the log if the client opens connection but closes it immediately.
let len = match stream.read_u32().await {
let len = match retry_read!(stream.read_u32().await) {
Ok(len) => len as usize,
Err(e) if e.kind() == io::ErrorKind::UnexpectedEof => return Ok(None),
Err(e) => return Err(e.into()),
@@ -217,7 +232,7 @@ impl FeStartupPacket {
bail!("invalid message length");
}
let request_code = stream.read_u32().await?;
let request_code = retry_read!(stream.read_u32().await)?;
// the rest of startup packet are params
let params_len = len - 8;

View File

@@ -4,7 +4,7 @@ use std::{
sync::Arc,
};
use rustls::Session;
use rustls::Connection;
/// Wrapper supporting reads of a shared TcpStream.
pub struct ArcTcpRead(Arc<TcpStream>);
@@ -56,7 +56,7 @@ impl BufStream {
pub enum ReadStream {
Tcp(BufReader<ArcTcpRead>),
Tls(rustls_split::ReadHalf<rustls::ServerSession>),
Tls(rustls_split::ReadHalf),
}
impl io::Read for ReadStream {
@@ -79,7 +79,7 @@ impl ReadStream {
pub enum WriteStream {
Tcp(Arc<TcpStream>),
Tls(rustls_split::WriteHalf<rustls::ServerSession>),
Tls(rustls_split::WriteHalf),
}
impl WriteStream {
@@ -107,11 +107,11 @@ impl io::Write for WriteStream {
}
}
type TlsStream<T> = rustls::StreamOwned<rustls::ServerSession, T>;
type TlsStream<T> = rustls::StreamOwned<rustls::ServerConnection, T>;
pub enum BidiStream {
Tcp(BufStream),
/// This variant is boxed, because [`rustls::ServerSession`] is quite larger than [`BufStream`].
/// This variant is boxed, because [`rustls::ServerConnection`] is quite larger than [`BufStream`].
Tls(Box<TlsStream<BufStream>>),
}
@@ -127,7 +127,7 @@ impl BidiStream {
if how == Shutdown::Read {
tls_boxed.sock.get_ref().shutdown(how)
} else {
tls_boxed.sess.send_close_notify();
tls_boxed.conn.send_close_notify();
let res = tls_boxed.flush();
tls_boxed.sock.get_ref().shutdown(how)?;
res
@@ -154,19 +154,23 @@ impl BidiStream {
// TODO would be nice to avoid the Arc here
let socket = Arc::try_unwrap(reader.into_inner().0).unwrap();
let (read_half, write_half) =
rustls_split::split(socket, tls_boxed.sess, read_buf_cfg, write_buf_cfg);
let (read_half, write_half) = rustls_split::split(
socket,
Connection::Server(tls_boxed.conn),
read_buf_cfg,
write_buf_cfg,
);
(ReadStream::Tls(read_half), WriteStream::Tls(write_half))
}
}
}
pub fn start_tls(self, mut session: rustls::ServerSession) -> io::Result<Self> {
pub fn start_tls(self, mut conn: rustls::ServerConnection) -> io::Result<Self> {
match self {
Self::Tcp(mut stream) => {
session.complete_io(&mut stream)?;
assert!(!session.is_handshaking());
Ok(Self::Tls(Box::new(TlsStream::new(session, stream))))
conn.complete_io(&mut stream)?;
assert!(!conn.is_handshaking());
Ok(Self::Tls(Box::new(TlsStream::new(conn, stream))))
}
Self::Tls { .. } => Err(io::Error::new(
io::ErrorKind::InvalidInput,

View File

@@ -29,7 +29,7 @@ impl<S, T: Future> SyncFuture<S, T> {
/// Example:
///
/// ```
/// # use zenith_utils::sync::SyncFuture;
/// # use utils::sync::SyncFuture;
/// # use std::future::Future;
/// # use tokio::io::AsyncReadExt;
/// #

View File

@@ -2,7 +2,7 @@ use bytes::{Buf, BytesMut};
use hex_literal::hex;
use serde::Deserialize;
use std::io::Read;
use zenith_utils::bin_ser::LeSer;
use utils::bin_ser::LeSer;
#[derive(Debug, PartialEq, Deserialize)]
pub struct HeaderData {

View File

@@ -8,9 +8,8 @@ use std::{
use byteorder::{BigEndian, ReadBytesExt, WriteBytesExt};
use bytes::{Buf, BufMut, Bytes, BytesMut};
use lazy_static::lazy_static;
use rustls::Session;
use zenith_utils::postgres_backend::{AuthType, Handler, PostgresBackend};
use utils::postgres_backend::{AuthType, Handler, PostgresBackend};
fn make_tcp_pair() -> (TcpStream, TcpStream) {
let listener = TcpListener::bind("127.0.0.1:0").unwrap();
@@ -23,11 +22,11 @@ fn make_tcp_pair() -> (TcpStream, TcpStream) {
lazy_static! {
static ref KEY: rustls::PrivateKey = {
let mut cursor = Cursor::new(include_bytes!("key.pem"));
rustls::internal::pemfile::rsa_private_keys(&mut cursor).unwrap()[0].clone()
rustls::PrivateKey(rustls_pemfile::rsa_private_keys(&mut cursor).unwrap()[0].clone())
};
static ref CERT: rustls::Certificate = {
let mut cursor = Cursor::new(include_bytes!("cert.pem"));
rustls::internal::pemfile::certs(&mut cursor).unwrap()[0].clone()
rustls::Certificate(rustls_pemfile::certs(&mut cursor).unwrap()[0].clone())
};
}
@@ -45,17 +44,23 @@ fn ssl() {
let ssl_response = client_sock.read_u8().unwrap();
assert_eq!(b'S', ssl_response);
let mut cfg = rustls::ClientConfig::new();
cfg.root_store.add(&CERT).unwrap();
let cfg = rustls::ClientConfig::builder()
.with_safe_defaults()
.with_root_certificates({
let mut store = rustls::RootCertStore::empty();
store.add(&CERT).unwrap();
store
})
.with_no_client_auth();
let client_config = Arc::new(cfg);
let dns_name = webpki::DNSNameRef::try_from_ascii_str("localhost").unwrap();
let mut session = rustls::ClientSession::new(&client_config, dns_name);
let dns_name = "localhost".try_into().unwrap();
let mut conn = rustls::ClientConnection::new(client_config, dns_name).unwrap();
session.complete_io(&mut client_sock).unwrap();
assert!(!session.is_handshaking());
conn.complete_io(&mut client_sock).unwrap();
assert!(!conn.is_handshaking());
let mut stream = rustls::Stream::new(&mut session, &mut client_sock);
let mut stream = rustls::Stream::new(&mut conn, &mut client_sock);
// StartupMessage
stream.write_u32::<BigEndian>(9).unwrap();
@@ -105,8 +110,10 @@ fn ssl() {
}
let mut handler = TestHandler { got_query: false };
let mut cfg = rustls::ServerConfig::new(rustls::NoClientAuth::new());
cfg.set_single_cert(vec![CERT.clone()], KEY.clone())
let cfg = rustls::ServerConfig::builder()
.with_safe_defaults()
.with_no_client_auth()
.with_single_cert(vec![CERT.clone()], KEY.clone())
.unwrap();
let tls_config = Some(Arc::new(cfg));
@@ -209,8 +216,10 @@ fn server_forces_ssl() {
}
let mut handler = TestHandler;
let mut cfg = rustls::ServerConfig::new(rustls::NoClientAuth::new());
cfg.set_single_cert(vec![CERT.clone()], KEY.clone())
let cfg = rustls::ServerConfig::builder()
.with_safe_defaults()
.with_no_client_auth()
.with_single_cert(vec![CERT.clone()], KEY.clone())
.unwrap();
let tls_config = Some(Arc::new(cfg));

View File

@@ -3,6 +3,14 @@ name = "pageserver"
version = "0.1.0"
edition = "2021"
[features]
# It is simpler infra-wise to have failpoints enabled by default
# It shouldnt affect perf in any way because failpoints
# are not placed in hot code paths
default = ["failpoints"]
profiling = ["pprof"]
failpoints = ["fail/failpoints"]
[dependencies]
chrono = "0.4.19"
rand = "0.8.3"
@@ -18,10 +26,10 @@ clap = "3.0"
daemonize = "0.4.1"
tokio = { version = "1.17", features = ["process", "sync", "macros", "fs", "rt", "io-util", "time"] }
tokio-util = { version = "0.7", features = ["io"] }
postgres-types = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" }
postgres-protocol = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" }
postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" }
tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" }
postgres-types = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
postgres-protocol = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" }
tokio-stream = "0.1.8"
anyhow = { version = "1.0", features = ["backtrace"] }
crc32c = "0.6.0"
@@ -31,6 +39,9 @@ humantime = "2.1.0"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1"
serde_with = "1.12.0"
humantime-serde = "1.1.1"
pprof = { git = "https://github.com/neondatabase/pprof-rs.git", branch = "wallclock-profiling", features = ["flamegraph"], optional = true }
toml_edit = { version = "0.13", features = ["easy"] }
scopeguard = "1.1.0"
@@ -46,11 +57,12 @@ fail = "0.5.0"
rusoto_core = "0.47"
rusoto_s3 = "0.47"
async-trait = "0.1"
async-compression = {version = "0.3", features = ["zstd", "tokio"]}
postgres_ffi = { path = "../postgres_ffi" }
zenith_metrics = { path = "../zenith_metrics" }
zenith_utils = { path = "../zenith_utils" }
zstd = "0.11.1"
postgres_ffi = { path = "../libs/postgres_ffi" }
metrics = { path = "../libs/metrics" }
utils = { path = "../libs/utils" }
workspace_hack = { version = "0.1", path = "../workspace_hack" }
[dev-dependencies]

View File

@@ -25,7 +25,7 @@ use crate::repository::Timeline;
use crate::DatadirTimelineImpl;
use postgres_ffi::xlog_utils::*;
use postgres_ffi::*;
use zenith_utils::lsn::Lsn;
use utils::lsn::Lsn;
/// This is short-living object only for the time of tarball creation,
/// created mostly to avoid passing a lot of parameters between various functions

View File

@@ -7,7 +7,7 @@ use pageserver::layered_repository::dump_layerfile_from_path;
use pageserver::page_cache;
use pageserver::virtual_file;
use std::path::PathBuf;
use zenith_utils::GIT_VERSION;
use utils::GIT_VERSION;
fn main() -> Result<()> {
let arg_matches = App::new("Zenith dump_layerfile utility")

View File

@@ -2,14 +2,6 @@
use std::{env, path::Path, str::FromStr};
use tracing::*;
use zenith_utils::{
auth::JwtAuth,
logging,
postgres_backend::AuthType,
tcp_listener,
zid::{ZTenantId, ZTimelineId},
GIT_VERSION,
};
use anyhow::{bail, Context, Result};
@@ -18,22 +10,36 @@ use daemonize::Daemonize;
use pageserver::{
config::{defaults::*, PageServerConf},
http, page_cache, page_service,
remote_storage::{self, SyncStartupData},
repository::{Repository, TimelineSyncStatusUpdate},
tenant_mgr, thread_mgr,
http, page_cache, page_service, profiling, tenant_mgr, thread_mgr,
thread_mgr::ThreadKind,
timelines, virtual_file, LOG_FILE_NAME,
};
use zenith_utils::http::endpoint;
use zenith_utils::shutdown::exit_now;
use zenith_utils::signals::{self, Signal};
use utils::{
auth::JwtAuth,
http::endpoint,
logging,
postgres_backend::AuthType,
shutdown::exit_now,
signals::{self, Signal},
tcp_listener,
zid::{ZTenantId, ZTimelineId},
GIT_VERSION,
};
fn version() -> String {
format!(
"{} profiling:{} failpoints:{}",
GIT_VERSION,
cfg!(feature = "profiling"),
fail::has_failpoints()
)
}
fn main() -> anyhow::Result<()> {
zenith_metrics::set_common_metrics_prefix("pageserver");
metrics::set_common_metrics_prefix("pageserver");
let arg_matches = App::new("Zenith page server")
.about("Materializes WAL stream to pages and serves them to the postgres")
.version(GIT_VERSION)
.version(&*version())
.arg(
Arg::new("daemonize")
.short('d')
@@ -231,46 +237,8 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<()
let signals = signals::install_shutdown_handlers()?;
// Initialize repositories with locally available timelines.
// Timelines that are only partially available locally (remote storage has more data than this pageserver)
// are scheduled for download and added to the repository once download is completed.
let SyncStartupData {
remote_index,
local_timeline_init_statuses,
} = remote_storage::start_local_timeline_sync(conf)
.context("Failed to set up local files sync with external storage")?;
for (tenant_id, local_timeline_init_statuses) in local_timeline_init_statuses {
// initialize local tenant
let repo = tenant_mgr::load_local_repo(conf, tenant_id, &remote_index);
for (timeline_id, init_status) in local_timeline_init_statuses {
match init_status {
remote_storage::LocalTimelineInitStatus::LocallyComplete => {
debug!("timeline {} for tenant {} is locally complete, registering it in repository", tenant_id, timeline_id);
// Lets fail here loudly to be on the safe side.
// XXX: It may be a better api to actually distinguish between repository startup
// and processing of newly downloaded timelines.
repo.apply_timeline_remote_sync_status_update(
timeline_id,
TimelineSyncStatusUpdate::Downloaded,
)
.with_context(|| {
format!(
"Failed to bootstrap timeline {} for tenant {}",
timeline_id, tenant_id
)
})?
}
remote_storage::LocalTimelineInitStatus::NeedsSync => {
debug!(
"timeline {} for tenant {} needs sync, \
so skipped for adding into repository until sync is finished",
tenant_id, timeline_id
);
}
}
}
}
// start profiler (if enabled)
let profiler_guard = profiling::init_profiler(conf);
// initialize authentication for incoming connections
let auth = match &conf.auth_type {
@@ -283,6 +251,8 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<()
};
info!("Using auth: {:#?}", conf.auth_type);
let remote_index = tenant_mgr::init_tenant_mgr(conf)?;
// Spawn a new thread for the http endpoint
// bind before launching separate thread so the error reported before startup exits
let auth_cloned = auth.clone();
@@ -293,7 +263,7 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<()
"http_endpoint_thread",
false,
move || {
let router = http::make_router(conf, auth_cloned, remote_index);
let router = http::make_router(conf, auth_cloned, remote_index)?;
endpoint::serve_thread_main(router, http_listener, thread_mgr::shutdown_watcher())
},
)?;
@@ -315,6 +285,7 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<()
"Got {}. Terminating in immediate shutdown mode",
signal.name()
);
profiling::exit_profiler(conf, &profiler_guard);
std::process::exit(111);
}
@@ -323,6 +294,7 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<()
"Got {}. Terminating gracefully in fast shutdown mode",
signal.name()
);
profiling::exit_profiler(conf, &profiler_guard);
pageserver::shutdown_pageserver();
unreachable!()
}

View File

@@ -1,334 +0,0 @@
//! A CLI helper to deal with remote storage (S3, usually) blobs as archives.
//! See [`compression`] for more details about the archives.
use std::{collections::BTreeSet, path::Path};
use anyhow::{bail, ensure, Context};
use clap::{App, Arg};
use pageserver::{
layered_repository::metadata::{TimelineMetadata, METADATA_FILE_NAME},
remote_storage::compression,
};
use tokio::{fs, io};
use zenith_utils::GIT_VERSION;
const LIST_SUBCOMMAND: &str = "list";
const ARCHIVE_ARG_NAME: &str = "archive";
const EXTRACT_SUBCOMMAND: &str = "extract";
const TARGET_DIRECTORY_ARG_NAME: &str = "target_directory";
const CREATE_SUBCOMMAND: &str = "create";
const SOURCE_DIRECTORY_ARG_NAME: &str = "source_directory";
#[tokio::main(flavor = "current_thread")]
async fn main() -> anyhow::Result<()> {
let arg_matches = App::new("pageserver zst blob [un]compressor utility")
.version(GIT_VERSION)
.subcommands(vec![
App::new(LIST_SUBCOMMAND)
.about("List the archive contents")
.arg(
Arg::new(ARCHIVE_ARG_NAME)
.required(true)
.takes_value(true)
.help("An archive to list the contents of"),
),
App::new(EXTRACT_SUBCOMMAND)
.about("Extracts the archive into the directory")
.arg(
Arg::new(ARCHIVE_ARG_NAME)
.required(true)
.takes_value(true)
.help("An archive to extract"),
)
.arg(
Arg::new(TARGET_DIRECTORY_ARG_NAME)
.required(false)
.takes_value(true)
.help("A directory to extract the archive into. Optional, will use the current directory if not specified"),
),
App::new(CREATE_SUBCOMMAND)
.about("Creates an archive with the contents of a directory (only the first level files are taken, metadata file has to be present in the same directory)")
.arg(
Arg::new(SOURCE_DIRECTORY_ARG_NAME)
.required(true)
.takes_value(true)
.help("A directory to use for creating the archive"),
)
.arg(
Arg::new(TARGET_DIRECTORY_ARG_NAME)
.required(false)
.takes_value(true)
.help("A directory to create the archive in. Optional, will use the current directory if not specified"),
),
])
.get_matches();
let subcommand_name = match arg_matches.subcommand_name() {
Some(name) => name,
None => bail!("No subcommand specified"),
};
let subcommand_matches = match arg_matches.subcommand_matches(subcommand_name) {
Some(matches) => matches,
None => bail!(
"No subcommand arguments were recognized for subcommand '{}'",
subcommand_name
),
};
let target_dir = Path::new(
subcommand_matches
.value_of(TARGET_DIRECTORY_ARG_NAME)
.unwrap_or("./"),
);
match subcommand_name {
LIST_SUBCOMMAND => {
let archive = match subcommand_matches.value_of(ARCHIVE_ARG_NAME) {
Some(archive) => Path::new(archive),
None => bail!("No '{}' argument is specified", ARCHIVE_ARG_NAME),
};
list_archive(archive).await
}
EXTRACT_SUBCOMMAND => {
let archive = match subcommand_matches.value_of(ARCHIVE_ARG_NAME) {
Some(archive) => Path::new(archive),
None => bail!("No '{}' argument is specified", ARCHIVE_ARG_NAME),
};
extract_archive(archive, target_dir).await
}
CREATE_SUBCOMMAND => {
let source_dir = match subcommand_matches.value_of(SOURCE_DIRECTORY_ARG_NAME) {
Some(source) => Path::new(source),
None => bail!("No '{}' argument is specified", SOURCE_DIRECTORY_ARG_NAME),
};
create_archive(source_dir, target_dir).await
}
unknown => bail!("Unknown subcommand {}", unknown),
}
}
async fn list_archive(archive: &Path) -> anyhow::Result<()> {
let archive = archive.canonicalize().with_context(|| {
format!(
"Failed to get the absolute path for the archive path '{}'",
archive.display()
)
})?;
ensure!(
archive.is_file(),
"Path '{}' is not an archive file",
archive.display()
);
println!("Listing an archive at path '{}'", archive.display());
let archive_name = match archive.file_name().and_then(|name| name.to_str()) {
Some(name) => name,
None => bail!(
"Failed to get the archive name from the path '{}'",
archive.display()
),
};
let archive_bytes = fs::read(&archive)
.await
.context("Failed to read the archive bytes")?;
let header = compression::read_archive_header(archive_name, &mut archive_bytes.as_slice())
.await
.context("Failed to read the archive header")?;
let empty_path = Path::new("");
println!("-------------------------------");
let longest_path_in_archive = header
.files
.iter()
.filter_map(|file| Some(file.subpath.as_path(empty_path).to_str()?.len()))
.max()
.unwrap_or_default()
.max(METADATA_FILE_NAME.len());
for regular_file in &header.files {
println!(
"File: {:width$} uncompressed size: {} bytes",
regular_file.subpath.as_path(empty_path).display(),
regular_file.size,
width = longest_path_in_archive,
)
}
println!(
"File: {:width$} uncompressed size: {} bytes",
METADATA_FILE_NAME,
header.metadata_file_size,
width = longest_path_in_archive,
);
println!("-------------------------------");
Ok(())
}
async fn extract_archive(archive: &Path, target_dir: &Path) -> anyhow::Result<()> {
let archive = archive.canonicalize().with_context(|| {
format!(
"Failed to get the absolute path for the archive path '{}'",
archive.display()
)
})?;
ensure!(
archive.is_file(),
"Path '{}' is not an archive file",
archive.display()
);
let archive_name = match archive.file_name().and_then(|name| name.to_str()) {
Some(name) => name,
None => bail!(
"Failed to get the archive name from the path '{}'",
archive.display()
),
};
if !target_dir.exists() {
fs::create_dir_all(target_dir).await.with_context(|| {
format!(
"Failed to create the target dir at path '{}'",
target_dir.display()
)
})?;
}
let target_dir = target_dir.canonicalize().with_context(|| {
format!(
"Failed to get the absolute path for the target dir path '{}'",
target_dir.display()
)
})?;
ensure!(
target_dir.is_dir(),
"Path '{}' is not a directory",
target_dir.display()
);
let mut dir_contents = fs::read_dir(&target_dir)
.await
.context("Failed to list the target directory contents")?;
let dir_entry = dir_contents
.next_entry()
.await
.context("Failed to list the target directory contents")?;
ensure!(
dir_entry.is_none(),
"Target directory '{}' is not empty",
target_dir.display()
);
println!(
"Extracting an archive at path '{}' into directory '{}'",
archive.display(),
target_dir.display()
);
let mut archive_file = fs::File::open(&archive).await.with_context(|| {
format!(
"Failed to get the archive name from the path '{}'",
archive.display()
)
})?;
let header = compression::read_archive_header(archive_name, &mut archive_file)
.await
.context("Failed to read the archive header")?;
compression::uncompress_with_header(&BTreeSet::new(), &target_dir, header, &mut archive_file)
.await
.context("Failed to extract the archive")
}
async fn create_archive(source_dir: &Path, target_dir: &Path) -> anyhow::Result<()> {
let source_dir = source_dir.canonicalize().with_context(|| {
format!(
"Failed to get the absolute path for the source dir path '{}'",
source_dir.display()
)
})?;
ensure!(
source_dir.is_dir(),
"Path '{}' is not a directory",
source_dir.display()
);
if !target_dir.exists() {
fs::create_dir_all(target_dir).await.with_context(|| {
format!(
"Failed to create the target dir at path '{}'",
target_dir.display()
)
})?;
}
let target_dir = target_dir.canonicalize().with_context(|| {
format!(
"Failed to get the absolute path for the target dir path '{}'",
target_dir.display()
)
})?;
ensure!(
target_dir.is_dir(),
"Path '{}' is not a directory",
target_dir.display()
);
println!(
"Compressing directory '{}' and creating resulting archive in directory '{}'",
source_dir.display(),
target_dir.display()
);
let mut metadata_file_contents = None;
let mut files_co_archive = Vec::new();
let mut source_dir_contents = fs::read_dir(&source_dir)
.await
.context("Failed to read the source directory contents")?;
while let Some(source_dir_entry) = source_dir_contents
.next_entry()
.await
.context("Failed to read a source dir entry")?
{
let entry_path = source_dir_entry.path();
if entry_path.is_file() {
if entry_path.file_name().and_then(|name| name.to_str()) == Some(METADATA_FILE_NAME) {
let metadata_bytes = fs::read(entry_path)
.await
.context("Failed to read metata file bytes in the source dir")?;
metadata_file_contents = Some(
TimelineMetadata::from_bytes(&metadata_bytes)
.context("Failed to parse metata file contents in the source dir")?,
);
} else {
files_co_archive.push(entry_path);
}
}
}
let metadata = match metadata_file_contents {
Some(metadata) => metadata,
None => bail!(
"No metadata file found in the source dir '{}', cannot create the archive",
source_dir.display()
),
};
let _ = compression::archive_files_as_stream(
&source_dir,
files_co_archive.iter(),
&metadata,
move |mut archive_streamer, archive_name| async move {
let archive_target = target_dir.join(&archive_name);
let mut archive_file = fs::File::create(&archive_target).await?;
io::copy(&mut archive_streamer, &mut archive_file).await?;
Ok(archive_target)
},
)
.await
.context("Failed to create an archive")?;
Ok(())
}

View File

@@ -6,8 +6,7 @@ use clap::{App, Arg};
use pageserver::layered_repository::metadata::TimelineMetadata;
use std::path::PathBuf;
use std::str::FromStr;
use zenith_utils::lsn::Lsn;
use zenith_utils::GIT_VERSION;
use utils::{lsn::Lsn, GIT_VERSION};
fn main() -> Result<()> {
let arg_matches = App::new("Zenith update metadata utility")

View File

@@ -4,22 +4,30 @@
//! file, or on the command line.
//! See also `settings.md` for better description on every parameter.
use anyhow::{bail, ensure, Context, Result};
use toml_edit;
use toml_edit::{Document, Item};
use zenith_utils::postgres_backend::AuthType;
use zenith_utils::zid::{ZNodeId, ZTenantId, ZTimelineId};
use std::convert::TryInto;
use anyhow::{anyhow, bail, ensure, Context, Result};
use std::env;
use std::num::{NonZeroU32, NonZeroUsize};
use std::path::{Path, PathBuf};
use std::str::FromStr;
use std::time::Duration;
use toml_edit;
use toml_edit::{Document, Item};
use utils::{
postgres_backend::AuthType,
zid::{ZNodeId, ZTenantId, ZTimelineId},
};
use crate::layered_repository::TIMELINES_SEGMENT_NAME;
use crate::tenant_config::{TenantConf, TenantConfOpt};
pub const ZSTD_MAX_SAMPLES: usize = 1024;
pub const ZSTD_MIN_SAMPLES: usize = 8; // magic requirement of zstd
pub const ZSTD_MAX_DICTIONARY_SIZE: usize = 64 * 1024;
pub const ZSTD_COMPRESSION_LEVEL: i32 = 0; // default compression level
pub const ZSTD_DECOMPRESS_BUFFER_LIMIT: usize = 64 * 1024; // TODO: handle larger WAL records?
pub mod defaults {
use crate::tenant_config::defaults::*;
use const_format::formatcp;
pub const DEFAULT_PG_LISTEN_PORT: u16 = 64000;
@@ -27,27 +35,22 @@ pub mod defaults {
pub const DEFAULT_HTTP_LISTEN_PORT: u16 = 9898;
pub const DEFAULT_HTTP_LISTEN_ADDR: &str = formatcp!("127.0.0.1:{DEFAULT_HTTP_LISTEN_PORT}");
// FIXME: This current value is very low. I would imagine something like 1 GB or 10 GB
// would be more appropriate. But a low value forces the code to be exercised more,
// which is good for now to trigger bugs.
// This parameter actually determines L0 layer file size.
pub const DEFAULT_CHECKPOINT_DISTANCE: u64 = 256 * 1024 * 1024;
// Target file size, when creating image and delta layers.
// This parameter determines L1 layer file size.
pub const DEFAULT_COMPACTION_TARGET_SIZE: u64 = 128 * 1024 * 1024;
pub const DEFAULT_COMPACTION_PERIOD: &str = "1 s";
pub const DEFAULT_COMPACTION_THRESHOLD: usize = 10;
pub const DEFAULT_GC_HORIZON: u64 = 64 * 1024 * 1024;
pub const DEFAULT_GC_PERIOD: &str = "100 s";
pub const DEFAULT_WAIT_LSN_TIMEOUT: &str = "60 s";
pub const DEFAULT_WAL_REDO_TIMEOUT: &str = "60 s";
pub const DEFAULT_SUPERUSER: &str = "zenith_admin";
pub const DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNC: usize = 10;
/// How many different timelines can be processed simultaneously when synchronizing layers with the remote storage.
/// During regular work, pageserver produces one layer file per timeline checkpoint, with bursts of concurrency
/// during start (where local and remote timelines are compared and initial sync tasks are scheduled) and timeline attach.
/// Both cases may trigger timeline download, that might download a lot of layers. This concurrency is limited by the clients internally, if needed.
pub const DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_TIMELINES_SYNC: usize = 50;
pub const DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS: u32 = 10;
/// Currently, sync happens with AWS S3, that has two limits on requests per second:
/// ~200 RPS for IAM services
/// https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/UsingWithRDS.IAMDBAuth.html
/// ~3500 PUT/COPY/POST/DELETE or 5500 GET/HEAD S3 requests
/// https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/
pub const DEFAULT_REMOTE_STORAGE_S3_CONCURRENCY_LIMIT: usize = 100;
pub const DEFAULT_PAGE_CACHE_SIZE: usize = 8192;
pub const DEFAULT_MAX_FILE_DESCRIPTORS: usize = 100;
@@ -62,14 +65,6 @@ pub mod defaults {
#listen_pg_addr = '{DEFAULT_PG_LISTEN_ADDR}'
#listen_http_addr = '{DEFAULT_HTTP_LISTEN_ADDR}'
#checkpoint_distance = {DEFAULT_CHECKPOINT_DISTANCE} # in bytes
#compaction_target_size = {DEFAULT_COMPACTION_TARGET_SIZE} # in bytes
#compaction_period = '{DEFAULT_COMPACTION_PERIOD}'
#compaction_threshold = '{DEFAULT_COMPACTION_THRESHOLD}'
#gc_period = '{DEFAULT_GC_PERIOD}'
#gc_horizon = {DEFAULT_GC_HORIZON}
#wait_lsn_timeout = '{DEFAULT_WAIT_LSN_TIMEOUT}'
#wal_redo_timeout = '{DEFAULT_WAL_REDO_TIMEOUT}'
@@ -78,6 +73,17 @@ pub mod defaults {
# initial superuser role name to use when creating a new tenant
#initial_superuser_name = '{DEFAULT_SUPERUSER}'
# [tenant_config]
#checkpoint_distance = {DEFAULT_CHECKPOINT_DISTANCE} # in bytes
#compaction_target_size = {DEFAULT_COMPACTION_TARGET_SIZE} # in bytes
#compaction_period = '{DEFAULT_COMPACTION_PERIOD}'
#compaction_threshold = '{DEFAULT_COMPACTION_THRESHOLD}'
#gc_period = '{DEFAULT_GC_PERIOD}'
#gc_horizon = {DEFAULT_GC_HORIZON}
#image_creation_threshold = {DEFAULT_IMAGE_CREATION_THRESHOLD}
#pitr_interval = '{DEFAULT_PITR_INTERVAL}'
# [remote_storage]
"###
@@ -95,25 +101,6 @@ pub struct PageServerConf {
/// Example (default): 127.0.0.1:9898
pub listen_http_addr: String,
// Flush out an inmemory layer, if it's holding WAL older than this
// This puts a backstop on how much WAL needs to be re-digested if the
// page server crashes.
// This parameter actually determines L0 layer file size.
pub checkpoint_distance: u64,
// Target file size, when creating image and delta layers.
// This parameter determines L1 layer file size.
pub compaction_target_size: u64,
// How often to check if there's compaction work to be done.
pub compaction_period: Duration,
// Level0 delta layer threshold for compaction.
pub compaction_threshold: usize,
pub gc_horizon: u64,
pub gc_period: Duration,
// Timeout when waiting for WAL receiver to catch up to an LSN given in a GetPage@LSN call.
pub wait_lsn_timeout: Duration,
// How long to wait for WAL redo to complete.
@@ -138,6 +125,28 @@ pub struct PageServerConf {
pub auth_validation_public_key_path: Option<PathBuf>,
pub remote_storage_config: Option<RemoteStorageConfig>,
pub profiling: ProfilingConfig,
pub default_tenant_conf: TenantConf,
}
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum ProfilingConfig {
Disabled,
PageRequests,
}
impl FromStr for ProfilingConfig {
type Err = anyhow::Error;
fn from_str(s: &str) -> Result<ProfilingConfig, Self::Err> {
let result = match s {
"disabled" => ProfilingConfig::Disabled,
"page_requests" => ProfilingConfig::PageRequests,
_ => bail!("invalid value \"{s}\" for profiling option, valid values are \"disabled\" and \"page_requests\""),
};
Ok(result)
}
}
// use dedicated enum for builder to better indicate the intention
@@ -162,15 +171,6 @@ struct PageServerConfigBuilder {
listen_http_addr: BuilderValue<String>,
checkpoint_distance: BuilderValue<u64>,
compaction_target_size: BuilderValue<u64>,
compaction_period: BuilderValue<Duration>,
compaction_threshold: BuilderValue<usize>,
gc_horizon: BuilderValue<u64>,
gc_period: BuilderValue<Duration>,
wait_lsn_timeout: BuilderValue<Duration>,
wal_redo_timeout: BuilderValue<Duration>,
@@ -190,6 +190,8 @@ struct PageServerConfigBuilder {
remote_storage_config: BuilderValue<Option<RemoteStorageConfig>>,
id: BuilderValue<ZNodeId>,
profiling: BuilderValue<ProfilingConfig>,
}
impl Default for PageServerConfigBuilder {
@@ -199,14 +201,6 @@ impl Default for PageServerConfigBuilder {
Self {
listen_pg_addr: Set(DEFAULT_PG_LISTEN_ADDR.to_string()),
listen_http_addr: Set(DEFAULT_HTTP_LISTEN_ADDR.to_string()),
checkpoint_distance: Set(DEFAULT_CHECKPOINT_DISTANCE),
compaction_target_size: Set(DEFAULT_COMPACTION_TARGET_SIZE),
compaction_period: Set(humantime::parse_duration(DEFAULT_COMPACTION_PERIOD)
.expect("cannot parse default compaction period")),
compaction_threshold: Set(DEFAULT_COMPACTION_THRESHOLD),
gc_horizon: Set(DEFAULT_GC_HORIZON),
gc_period: Set(humantime::parse_duration(DEFAULT_GC_PERIOD)
.expect("cannot parse default gc period")),
wait_lsn_timeout: Set(humantime::parse_duration(DEFAULT_WAIT_LSN_TIMEOUT)
.expect("cannot parse default wait lsn timeout")),
wal_redo_timeout: Set(humantime::parse_duration(DEFAULT_WAL_REDO_TIMEOUT)
@@ -222,6 +216,7 @@ impl Default for PageServerConfigBuilder {
auth_validation_public_key_path: Set(None),
remote_storage_config: Set(None),
id: NotSet,
profiling: Set(ProfilingConfig::Disabled),
}
}
}
@@ -235,30 +230,6 @@ impl PageServerConfigBuilder {
self.listen_http_addr = BuilderValue::Set(listen_http_addr)
}
pub fn checkpoint_distance(&mut self, checkpoint_distance: u64) {
self.checkpoint_distance = BuilderValue::Set(checkpoint_distance)
}
pub fn compaction_target_size(&mut self, compaction_target_size: u64) {
self.compaction_target_size = BuilderValue::Set(compaction_target_size)
}
pub fn compaction_period(&mut self, compaction_period: Duration) {
self.compaction_period = BuilderValue::Set(compaction_period)
}
pub fn compaction_threshold(&mut self, compaction_threshold: usize) {
self.compaction_threshold = BuilderValue::Set(compaction_threshold)
}
pub fn gc_horizon(&mut self, gc_horizon: u64) {
self.gc_horizon = BuilderValue::Set(gc_horizon)
}
pub fn gc_period(&mut self, gc_period: Duration) {
self.gc_period = BuilderValue::Set(gc_period)
}
pub fn wait_lsn_timeout(&mut self, wait_lsn_timeout: Duration) {
self.wait_lsn_timeout = BuilderValue::Set(wait_lsn_timeout)
}
@@ -306,55 +277,46 @@ impl PageServerConfigBuilder {
self.id = BuilderValue::Set(node_id)
}
pub fn profiling(&mut self, profiling: ProfilingConfig) {
self.profiling = BuilderValue::Set(profiling)
}
pub fn build(self) -> Result<PageServerConf> {
Ok(PageServerConf {
listen_pg_addr: self
.listen_pg_addr
.ok_or(anyhow::anyhow!("missing listen_pg_addr"))?,
.ok_or(anyhow!("missing listen_pg_addr"))?,
listen_http_addr: self
.listen_http_addr
.ok_or(anyhow::anyhow!("missing listen_http_addr"))?,
checkpoint_distance: self
.checkpoint_distance
.ok_or(anyhow::anyhow!("missing checkpoint_distance"))?,
compaction_target_size: self
.compaction_target_size
.ok_or(anyhow::anyhow!("missing compaction_target_size"))?,
compaction_period: self
.compaction_period
.ok_or(anyhow::anyhow!("missing compaction_period"))?,
compaction_threshold: self
.compaction_threshold
.ok_or(anyhow::anyhow!("missing compaction_threshold"))?,
gc_horizon: self
.gc_horizon
.ok_or(anyhow::anyhow!("missing gc_horizon"))?,
gc_period: self.gc_period.ok_or(anyhow::anyhow!("missing gc_period"))?,
.ok_or(anyhow!("missing listen_http_addr"))?,
wait_lsn_timeout: self
.wait_lsn_timeout
.ok_or(anyhow::anyhow!("missing wait_lsn_timeout"))?,
.ok_or(anyhow!("missing wait_lsn_timeout"))?,
wal_redo_timeout: self
.wal_redo_timeout
.ok_or(anyhow::anyhow!("missing wal_redo_timeout"))?,
superuser: self.superuser.ok_or(anyhow::anyhow!("missing superuser"))?,
.ok_or(anyhow!("missing wal_redo_timeout"))?,
superuser: self.superuser.ok_or(anyhow!("missing superuser"))?,
page_cache_size: self
.page_cache_size
.ok_or(anyhow::anyhow!("missing page_cache_size"))?,
.ok_or(anyhow!("missing page_cache_size"))?,
max_file_descriptors: self
.max_file_descriptors
.ok_or(anyhow::anyhow!("missing max_file_descriptors"))?,
workdir: self.workdir.ok_or(anyhow::anyhow!("missing workdir"))?,
.ok_or(anyhow!("missing max_file_descriptors"))?,
workdir: self.workdir.ok_or(anyhow!("missing workdir"))?,
pg_distrib_dir: self
.pg_distrib_dir
.ok_or(anyhow::anyhow!("missing pg_distrib_dir"))?,
auth_type: self.auth_type.ok_or(anyhow::anyhow!("missing auth_type"))?,
.ok_or(anyhow!("missing pg_distrib_dir"))?,
auth_type: self.auth_type.ok_or(anyhow!("missing auth_type"))?,
auth_validation_public_key_path: self
.auth_validation_public_key_path
.ok_or(anyhow::anyhow!("missing auth_validation_public_key_path"))?,
.ok_or(anyhow!("missing auth_validation_public_key_path"))?,
remote_storage_config: self
.remote_storage_config
.ok_or(anyhow::anyhow!("missing remote_storage_config"))?,
id: self.id.ok_or(anyhow::anyhow!("missing id"))?,
.ok_or(anyhow!("missing remote_storage_config"))?,
id: self.id.ok_or(anyhow!("missing id"))?,
profiling: self.profiling.ok_or(anyhow!("missing profiling"))?,
// TenantConf is handled separately
default_tenant_conf: TenantConf::default(),
})
}
}
@@ -363,7 +325,7 @@ impl PageServerConfigBuilder {
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct RemoteStorageConfig {
/// Max allowed number of concurrent sync operations between pageserver and the remote storage.
pub max_concurrent_sync: NonZeroUsize,
pub max_concurrent_timelines_sync: NonZeroUsize,
/// Max allowed errors before the sync task is considered failed and evicted.
pub max_sync_errors: NonZeroU32,
/// The storage connection configuration.
@@ -404,6 +366,9 @@ pub struct S3Config {
///
/// Example: `http://127.0.0.1:5000`
pub endpoint: Option<String>,
/// AWS S3 has various limits on its API calls, we need not to exceed those.
/// See [`defaults::DEFAULT_REMOTE_STORAGE_S3_CONCURRENCY_LIMIT`] for more details.
pub concurrency_limit: NonZeroUsize,
}
impl std::fmt::Debug for S3Config {
@@ -412,6 +377,7 @@ impl std::fmt::Debug for S3Config {
.field("bucket_name", &self.bucket_name)
.field("bucket_region", &self.bucket_region)
.field("prefix_in_bucket", &self.prefix_in_bucket)
.field("concurrency_limit", &self.concurrency_limit)
.finish()
}
}
@@ -457,20 +423,12 @@ impl PageServerConf {
let mut builder = PageServerConfigBuilder::default();
builder.workdir(workdir.to_owned());
let mut t_conf: TenantConfOpt = Default::default();
for (key, item) in toml.iter() {
match key {
"listen_pg_addr" => builder.listen_pg_addr(parse_toml_string(key, item)?),
"listen_http_addr" => builder.listen_http_addr(parse_toml_string(key, item)?),
"checkpoint_distance" => builder.checkpoint_distance(parse_toml_u64(key, item)?),
"compaction_target_size" => {
builder.compaction_target_size(parse_toml_u64(key, item)?)
}
"compaction_period" => builder.compaction_period(parse_toml_duration(key, item)?),
"compaction_threshold" => {
builder.compaction_threshold(parse_toml_u64(key, item)? as usize)
}
"gc_horizon" => builder.gc_horizon(parse_toml_u64(key, item)?),
"gc_period" => builder.gc_period(parse_toml_duration(key, item)?),
"wait_lsn_timeout" => builder.wait_lsn_timeout(parse_toml_duration(key, item)?),
"wal_redo_timeout" => builder.wal_redo_timeout(parse_toml_duration(key, item)?),
"initial_superuser_name" => builder.superuser(parse_toml_string(key, item)?),
@@ -484,12 +442,16 @@ impl PageServerConf {
"auth_validation_public_key_path" => builder.auth_validation_public_key_path(Some(
PathBuf::from(parse_toml_string(key, item)?),
)),
"auth_type" => builder.auth_type(parse_toml_auth_type(key, item)?),
"auth_type" => builder.auth_type(parse_toml_from_str(key, item)?),
"remote_storage" => {
builder.remote_storage_config(Some(Self::parse_remote_storage_config(item)?))
}
"tenant_config" => {
t_conf = Self::parse_toml_tenant_conf(item)?;
}
"id" => builder.id(ZNodeId(parse_toml_u64(key, item)?)),
_ => bail!("unrecognized pageserver option '{}'", key),
"profiling" => builder.profiling(parse_toml_from_str(key, item)?),
_ => bail!("unrecognized pageserver option '{key}'"),
}
}
@@ -515,41 +477,75 @@ impl PageServerConf {
);
}
conf.default_tenant_conf = t_conf.merge(TenantConf::default());
Ok(conf)
}
// subroutine of parse_and_validate to parse `[tenant_conf]` section
pub fn parse_toml_tenant_conf(item: &toml_edit::Item) -> Result<TenantConfOpt> {
let mut t_conf: TenantConfOpt = Default::default();
if let Some(checkpoint_distance) = item.get("checkpoint_distance") {
t_conf.checkpoint_distance =
Some(parse_toml_u64("checkpoint_distance", checkpoint_distance)?);
}
if let Some(compaction_target_size) = item.get("compaction_target_size") {
t_conf.compaction_target_size = Some(parse_toml_u64(
"compaction_target_size",
compaction_target_size,
)?);
}
if let Some(compaction_period) = item.get("compaction_period") {
t_conf.compaction_period =
Some(parse_toml_duration("compaction_period", compaction_period)?);
}
if let Some(compaction_threshold) = item.get("compaction_threshold") {
t_conf.compaction_threshold =
Some(parse_toml_u64("compaction_threshold", compaction_threshold)?.try_into()?);
}
if let Some(gc_horizon) = item.get("gc_horizon") {
t_conf.gc_horizon = Some(parse_toml_u64("gc_horizon", gc_horizon)?);
}
if let Some(gc_period) = item.get("gc_period") {
t_conf.gc_period = Some(parse_toml_duration("gc_period", gc_period)?);
}
if let Some(pitr_interval) = item.get("pitr_interval") {
t_conf.pitr_interval = Some(parse_toml_duration("pitr_interval", pitr_interval)?);
}
Ok(t_conf)
}
/// subroutine of parse_config(), to parse the `[remote_storage]` table.
fn parse_remote_storage_config(toml: &toml_edit::Item) -> anyhow::Result<RemoteStorageConfig> {
let local_path = toml.get("local_path");
let bucket_name = toml.get("bucket_name");
let bucket_region = toml.get("bucket_region");
let max_concurrent_sync: NonZeroUsize = if let Some(s) = toml.get("max_concurrent_sync") {
parse_toml_u64("max_concurrent_sync", s)
.and_then(|toml_u64| {
toml_u64.try_into().with_context(|| {
format!("'max_concurrent_sync' value {} is too large", toml_u64)
})
})
.ok()
.and_then(NonZeroUsize::new)
.context("'max_concurrent_sync' must be a non-zero positive integer")?
} else {
NonZeroUsize::new(defaults::DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNC).unwrap()
};
let max_sync_errors: NonZeroU32 = if let Some(s) = toml.get("max_sync_errors") {
parse_toml_u64("max_sync_errors", s)
.and_then(|toml_u64| {
toml_u64.try_into().with_context(|| {
format!("'max_sync_errors' value {} is too large", toml_u64)
})
})
.ok()
.and_then(NonZeroU32::new)
.context("'max_sync_errors' must be a non-zero positive integer")?
} else {
NonZeroU32::new(defaults::DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS).unwrap()
};
let max_concurrent_timelines_sync = NonZeroUsize::new(
parse_optional_integer("max_concurrent_timelines_sync", toml)?
.unwrap_or(defaults::DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_TIMELINES_SYNC),
)
.context("Failed to parse 'max_concurrent_timelines_sync' as a positive integer")?;
let max_sync_errors = NonZeroU32::new(
parse_optional_integer("max_sync_errors", toml)?
.unwrap_or(defaults::DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS),
)
.context("Failed to parse 'max_sync_errors' as a positive integer")?;
let concurrency_limit = NonZeroUsize::new(
parse_optional_integer("concurrency_limit", toml)?
.unwrap_or(defaults::DEFAULT_REMOTE_STORAGE_S3_CONCURRENCY_LIMIT),
)
.context("Failed to parse 'concurrency_limit' as a positive integer")?;
let storage = match (local_path, bucket_name, bucket_region) {
(None, None, None) => bail!("no 'local_path' nor 'bucket_name' option"),
@@ -580,6 +576,7 @@ impl PageServerConf {
.get("endpoint")
.map(|endpoint| parse_toml_string("endpoint", endpoint))
.transpose()?,
concurrency_limit,
}),
(Some(local_path), None, None) => RemoteStorageKind::LocalFs(PathBuf::from(
parse_toml_string("local_path", local_path)?,
@@ -588,7 +585,7 @@ impl PageServerConf {
};
Ok(RemoteStorageConfig {
max_concurrent_sync,
max_concurrent_timelines_sync,
max_sync_errors,
storage,
})
@@ -596,19 +593,13 @@ impl PageServerConf {
#[cfg(test)]
pub fn test_repo_dir(test_name: &str) -> PathBuf {
PathBuf::from(format!("../tmp_check/test_{}", test_name))
PathBuf::from(format!("../tmp_check/test_{test_name}"))
}
#[cfg(test)]
pub fn dummy_conf(repo_dir: PathBuf) -> Self {
PageServerConf {
id: ZNodeId(0),
checkpoint_distance: defaults::DEFAULT_CHECKPOINT_DISTANCE,
compaction_target_size: 4 * 1024 * 1024,
compaction_period: Duration::from_secs(10),
compaction_threshold: defaults::DEFAULT_COMPACTION_THRESHOLD,
gc_horizon: defaults::DEFAULT_GC_HORIZON,
gc_period: Duration::from_secs(10),
wait_lsn_timeout: Duration::from_secs(60),
wal_redo_timeout: Duration::from_secs(60),
page_cache_size: defaults::DEFAULT_PAGE_CACHE_SIZE,
@@ -621,6 +612,8 @@ impl PageServerConf {
auth_type: AuthType::Trust,
auth_validation_public_key_path: None,
remote_storage_config: None,
profiling: ProfilingConfig::Disabled,
default_tenant_conf: TenantConf::dummy_conf(),
}
}
}
@@ -630,7 +623,7 @@ impl PageServerConf {
fn parse_toml_string(name: &str, item: &Item) -> Result<String> {
let s = item
.as_str()
.with_context(|| format!("configure option {} is not a string", name))?;
.with_context(|| format!("configure option {name} is not a string"))?;
Ok(s.to_string())
}
@@ -639,26 +632,46 @@ fn parse_toml_u64(name: &str, item: &Item) -> Result<u64> {
// for our use, though.
let i: i64 = item
.as_integer()
.with_context(|| format!("configure option {} is not an integer", name))?;
.with_context(|| format!("configure option {name} is not an integer"))?;
if i < 0 {
bail!("configure option {} cannot be negative", name);
bail!("configure option {name} cannot be negative");
}
Ok(i as u64)
}
fn parse_optional_integer<I, E>(name: &str, item: &toml_edit::Item) -> anyhow::Result<Option<I>>
where
I: TryFrom<i64, Error = E>,
E: std::error::Error + Send + Sync + 'static,
{
let toml_integer = match item.get(name) {
Some(item) => item
.as_integer()
.with_context(|| format!("configure option {name} is not an integer"))?,
None => return Ok(None),
};
I::try_from(toml_integer)
.map(Some)
.with_context(|| format!("configure option {name} is too large"))
}
fn parse_toml_duration(name: &str, item: &Item) -> Result<Duration> {
let s = item
.as_str()
.with_context(|| format!("configure option {} is not a string", name))?;
.with_context(|| format!("configure option {name} is not a string"))?;
Ok(humantime::parse_duration(s)?)
}
fn parse_toml_auth_type(name: &str, item: &Item) -> Result<AuthType> {
fn parse_toml_from_str<T>(name: &str, item: &Item) -> Result<T>
where
T: FromStr<Err = anyhow::Error>,
{
let v = item
.as_str()
.with_context(|| format!("configure option {} is not a string", name))?;
AuthType::from_str(v)
.with_context(|| format!("configure option {name} is not a string"))?;
T::from_str(v)
}
#[cfg(test)]
@@ -675,15 +688,6 @@ mod tests {
listen_pg_addr = '127.0.0.1:64000'
listen_http_addr = '127.0.0.1:9898'
checkpoint_distance = 111 # in bytes
compaction_target_size = 111 # in bytes
compaction_period = '111 s'
compaction_threshold = 2
gc_period = '222 s'
gc_horizon = 222
wait_lsn_timeout = '111 s'
wal_redo_timeout = '111 s'
@@ -704,10 +708,8 @@ id = 10
let config_string = format!("pg_distrib_dir='{}'\nid=10", pg_distrib_dir.display());
let toml = config_string.parse()?;
let parsed_config =
PageServerConf::parse_and_validate(&toml, &workdir).unwrap_or_else(|e| {
panic!("Failed to parse config '{}', reason: {}", config_string, e)
});
let parsed_config = PageServerConf::parse_and_validate(&toml, &workdir)
.unwrap_or_else(|e| panic!("Failed to parse config '{config_string}', reason: {e}"));
assert_eq!(
parsed_config,
@@ -715,12 +717,6 @@ id = 10
id: ZNodeId(10),
listen_pg_addr: defaults::DEFAULT_PG_LISTEN_ADDR.to_string(),
listen_http_addr: defaults::DEFAULT_HTTP_LISTEN_ADDR.to_string(),
checkpoint_distance: defaults::DEFAULT_CHECKPOINT_DISTANCE,
compaction_target_size: defaults::DEFAULT_COMPACTION_TARGET_SIZE,
compaction_period: humantime::parse_duration(defaults::DEFAULT_COMPACTION_PERIOD)?,
compaction_threshold: defaults::DEFAULT_COMPACTION_THRESHOLD,
gc_horizon: defaults::DEFAULT_GC_HORIZON,
gc_period: humantime::parse_duration(defaults::DEFAULT_GC_PERIOD)?,
wait_lsn_timeout: humantime::parse_duration(defaults::DEFAULT_WAIT_LSN_TIMEOUT)?,
wal_redo_timeout: humantime::parse_duration(defaults::DEFAULT_WAL_REDO_TIMEOUT)?,
superuser: defaults::DEFAULT_SUPERUSER.to_string(),
@@ -731,6 +727,8 @@ id = 10
auth_type: AuthType::Trust,
auth_validation_public_key_path: None,
remote_storage_config: None,
profiling: ProfilingConfig::Disabled,
default_tenant_conf: TenantConf::default(),
},
"Correct defaults should be used when no config values are provided"
);
@@ -744,16 +742,13 @@ id = 10
let (workdir, pg_distrib_dir) = prepare_fs(&tempdir)?;
let config_string = format!(
"{}pg_distrib_dir='{}'",
ALL_BASE_VALUES_TOML,
"{ALL_BASE_VALUES_TOML}pg_distrib_dir='{}'",
pg_distrib_dir.display()
);
let toml = config_string.parse()?;
let parsed_config =
PageServerConf::parse_and_validate(&toml, &workdir).unwrap_or_else(|e| {
panic!("Failed to parse config '{}', reason: {}", config_string, e)
});
let parsed_config = PageServerConf::parse_and_validate(&toml, &workdir)
.unwrap_or_else(|e| panic!("Failed to parse config '{config_string}', reason: {e}"));
assert_eq!(
parsed_config,
@@ -761,12 +756,6 @@ id = 10
id: ZNodeId(10),
listen_pg_addr: "127.0.0.1:64000".to_string(),
listen_http_addr: "127.0.0.1:9898".to_string(),
checkpoint_distance: 111,
compaction_target_size: 111,
compaction_period: Duration::from_secs(111),
compaction_threshold: 2,
gc_horizon: 222,
gc_period: Duration::from_secs(222),
wait_lsn_timeout: Duration::from_secs(111),
wal_redo_timeout: Duration::from_secs(111),
superuser: "zzzz".to_string(),
@@ -777,6 +766,8 @@ id = 10
auth_type: AuthType::Trust,
auth_validation_public_key_path: None,
remote_storage_config: None,
profiling: ProfilingConfig::Disabled,
default_tenant_conf: TenantConf::default(),
},
"Should be able to parse all basic config values correctly"
);
@@ -805,37 +796,33 @@ local_path = '{}'"#,
for remote_storage_config_str in identical_toml_declarations {
let config_string = format!(
r#"{}
r#"{ALL_BASE_VALUES_TOML}
pg_distrib_dir='{}'
{}"#,
ALL_BASE_VALUES_TOML,
{remote_storage_config_str}"#,
pg_distrib_dir.display(),
remote_storage_config_str,
);
let toml = config_string.parse()?;
let parsed_remote_storage_config = PageServerConf::parse_and_validate(&toml, &workdir)
.unwrap_or_else(|e| {
panic!("Failed to parse config '{}', reason: {}", config_string, e)
})
.unwrap_or_else(|e| panic!("Failed to parse config '{config_string}', reason: {e}"))
.remote_storage_config
.expect("Should have remote storage config for the local FS");
assert_eq!(
parsed_remote_storage_config,
RemoteStorageConfig {
max_concurrent_sync: NonZeroUsize::new(
defaults::DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNC
)
.unwrap(),
max_sync_errors: NonZeroU32::new(defaults::DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS)
parsed_remote_storage_config,
RemoteStorageConfig {
max_concurrent_timelines_sync: NonZeroUsize::new(
defaults::DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_TIMELINES_SYNC
)
.unwrap(),
storage: RemoteStorageKind::LocalFs(local_storage_path.clone()),
},
"Remote storage config should correctly parse the local FS config and fill other storage defaults"
);
max_sync_errors: NonZeroU32::new(defaults::DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS)
.unwrap(),
storage: RemoteStorageKind::LocalFs(local_storage_path.clone()),
},
"Remote storage config should correctly parse the local FS config and fill other storage defaults"
);
}
Ok(())
}
@@ -851,52 +838,49 @@ pg_distrib_dir='{}'
let access_key_id = "SOMEKEYAAAAASADSAH*#".to_string();
let secret_access_key = "SOMEsEcReTsd292v".to_string();
let endpoint = "http://localhost:5000".to_string();
let max_concurrent_sync = NonZeroUsize::new(111).unwrap();
let max_concurrent_timelines_sync = NonZeroUsize::new(111).unwrap();
let max_sync_errors = NonZeroU32::new(222).unwrap();
let s3_concurrency_limit = NonZeroUsize::new(333).unwrap();
let identical_toml_declarations = &[
format!(
r#"[remote_storage]
max_concurrent_sync = {}
max_sync_errors = {}
bucket_name = '{}'
bucket_region = '{}'
prefix_in_bucket = '{}'
access_key_id = '{}'
secret_access_key = '{}'
endpoint = '{}'"#,
max_concurrent_sync, max_sync_errors, bucket_name, bucket_region, prefix_in_bucket, access_key_id, secret_access_key, endpoint
max_concurrent_timelines_sync = {max_concurrent_timelines_sync}
max_sync_errors = {max_sync_errors}
bucket_name = '{bucket_name}'
bucket_region = '{bucket_region}'
prefix_in_bucket = '{prefix_in_bucket}'
access_key_id = '{access_key_id}'
secret_access_key = '{secret_access_key}'
endpoint = '{endpoint}'
concurrency_limit = {s3_concurrency_limit}"#
),
format!(
"remote_storage={{max_concurrent_sync={}, max_sync_errors={}, bucket_name='{}', bucket_region='{}', prefix_in_bucket='{}', access_key_id='{}', secret_access_key='{}', endpoint='{}'}}",
max_concurrent_sync, max_sync_errors, bucket_name, bucket_region, prefix_in_bucket, access_key_id, secret_access_key, endpoint
"remote_storage={{max_concurrent_timelines_sync={max_concurrent_timelines_sync}, max_sync_errors={max_sync_errors}, bucket_name='{bucket_name}',\
bucket_region='{bucket_region}', prefix_in_bucket='{prefix_in_bucket}', access_key_id='{access_key_id}', secret_access_key='{secret_access_key}', endpoint='{endpoint}', concurrency_limit={s3_concurrency_limit}}}",
),
];
for remote_storage_config_str in identical_toml_declarations {
let config_string = format!(
r#"{}
r#"{ALL_BASE_VALUES_TOML}
pg_distrib_dir='{}'
{}"#,
ALL_BASE_VALUES_TOML,
{remote_storage_config_str}"#,
pg_distrib_dir.display(),
remote_storage_config_str,
);
let toml = config_string.parse()?;
let parsed_remote_storage_config = PageServerConf::parse_and_validate(&toml, &workdir)
.unwrap_or_else(|e| {
panic!("Failed to parse config '{}', reason: {}", config_string, e)
})
.unwrap_or_else(|e| panic!("Failed to parse config '{config_string}', reason: {e}"))
.remote_storage_config
.expect("Should have remote storage config for S3");
assert_eq!(
parsed_remote_storage_config,
RemoteStorageConfig {
max_concurrent_sync,
max_concurrent_timelines_sync,
max_sync_errors,
storage: RemoteStorageKind::AwsS3(S3Config {
bucket_name: bucket_name.clone(),
@@ -904,7 +888,8 @@ pg_distrib_dir='{}'
access_key_id: Some(access_key_id.clone()),
secret_access_key: Some(secret_access_key.clone()),
prefix_in_bucket: Some(prefix_in_bucket.clone()),
endpoint: Some(endpoint.clone())
endpoint: Some(endpoint.clone()),
concurrency_limit: s3_concurrency_limit,
}),
},
"Remote storage config should correctly parse the S3 config"

View File

@@ -1,6 +1,6 @@
use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr};
use zenith_utils::{
use utils::{
lsn::Lsn,
zid::{ZNodeId, ZTenantId, ZTimelineId},
};
@@ -20,11 +20,19 @@ pub struct TimelineCreateRequest {
}
#[serde_as]
#[derive(Serialize, Deserialize)]
#[derive(Serialize, Deserialize, Default)]
pub struct TenantCreateRequest {
#[serde(default)]
#[serde_as(as = "Option<DisplayFromStr>")]
pub new_tenant_id: Option<ZTenantId>,
pub checkpoint_distance: Option<u64>,
pub compaction_target_size: Option<u64>,
pub compaction_period: Option<String>,
pub compaction_threshold: Option<usize>,
pub gc_horizon: Option<u64>,
pub gc_period: Option<String>,
pub image_creation_threshold: Option<usize>,
pub pitr_interval: Option<String>,
}
#[serde_as]
@@ -36,3 +44,44 @@ pub struct TenantCreateResponse(#[serde_as(as = "DisplayFromStr")] pub ZTenantId
pub struct StatusResponse {
pub id: ZNodeId,
}
impl TenantCreateRequest {
pub fn new(new_tenant_id: Option<ZTenantId>) -> TenantCreateRequest {
TenantCreateRequest {
new_tenant_id,
..Default::default()
}
}
}
#[serde_as]
#[derive(Serialize, Deserialize)]
pub struct TenantConfigRequest {
pub tenant_id: ZTenantId,
#[serde(default)]
#[serde_as(as = "Option<DisplayFromStr>")]
pub checkpoint_distance: Option<u64>,
pub compaction_target_size: Option<u64>,
pub compaction_period: Option<String>,
pub compaction_threshold: Option<usize>,
pub gc_horizon: Option<u64>,
pub gc_period: Option<String>,
pub image_creation_threshold: Option<usize>,
pub pitr_interval: Option<String>,
}
impl TenantConfigRequest {
pub fn new(tenant_id: ZTenantId) -> TenantConfigRequest {
TenantConfigRequest {
tenant_id,
checkpoint_distance: None,
compaction_target_size: None,
compaction_period: None,
compaction_threshold: None,
gc_horizon: None,
gc_period: None,
image_creation_threshold: None,
pitr_interval: None,
}
}
}

View File

@@ -328,11 +328,7 @@ paths:
content:
application/json:
schema:
type: object
properties:
new_tenant_id:
type: string
format: hex
$ref: "#/components/schemas/TenantCreateInfo"
responses:
"201":
description: New tenant created successfully
@@ -371,7 +367,48 @@ paths:
application/json:
schema:
$ref: "#/components/schemas/Error"
/v1/tenant/config:
put:
description: |
Update tenant's config.
requestBody:
content:
application/json:
schema:
$ref: "#/components/schemas/TenantConfigInfo"
responses:
"200":
description: OK
content:
application/json:
schema:
type: array
items:
$ref: "#/components/schemas/TenantInfo"
"400":
description: Malformed tenant config request
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
"401":
description: Unauthorized Error
content:
application/json:
schema:
$ref: "#/components/schemas/UnauthorizedError"
"403":
description: Forbidden Error
content:
application/json:
schema:
$ref: "#/components/schemas/ForbiddenError"
"500":
description: Generic operation error
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
components:
securitySchemes:
JWT:
@@ -389,6 +426,45 @@ components:
type: string
state:
type: string
TenantCreateInfo:
type: object
properties:
new_tenant_id:
type: string
format: hex
tenant_id:
type: string
format: hex
gc_period:
type: string
gc_horizon:
type: integer
pitr_interval:
type: string
checkpoint_distance:
type: integer
compaction_period:
type: string
compaction_threshold:
type: string
TenantConfigInfo:
type: object
properties:
tenant_id:
type: string
format: hex
gc_period:
type: string
gc_horizon:
type: integer
pitr_interval:
type: string
checkpoint_distance:
type: integer
compaction_period:
type: string
compaction_threshold:
type: string
TimelineInfo:
type: object
required:
@@ -409,6 +485,7 @@ components:
type: object
required:
- awaits_download
- remote_consistent_lsn
properties:
awaits_download:
type: boolean

View File

@@ -1,36 +1,45 @@
use std::sync::Arc;
use anyhow::Result;
use anyhow::{Context, Result};
use hyper::StatusCode;
use hyper::{Body, Request, Response, Uri};
use tracing::*;
use zenith_utils::auth::JwtAuth;
use zenith_utils::http::endpoint::attach_openapi_ui;
use zenith_utils::http::endpoint::auth_middleware;
use zenith_utils::http::endpoint::check_permission;
use zenith_utils::http::error::ApiError;
use zenith_utils::http::{
endpoint,
error::HttpErrorBody,
json::{json_request, json_response},
request::parse_request_param,
};
use zenith_utils::http::{RequestExt, RouterBuilder};
use zenith_utils::zid::{ZTenantTimelineId, ZTimelineId};
use super::models::{
StatusResponse, TenantCreateRequest, TenantCreateResponse, TimelineCreateRequest,
StatusResponse, TenantConfigRequest, TenantCreateRequest, TenantCreateResponse,
TimelineCreateRequest,
};
use crate::config::RemoteStorageKind;
use crate::remote_storage::{
download_index_part, schedule_timeline_download, LocalFs, RemoteIndex, RemoteTimeline, S3Bucket,
};
use crate::remote_storage::{schedule_timeline_download, RemoteIndex};
use crate::repository::Repository;
use crate::tenant_config::TenantConfOpt;
use crate::timelines::{LocalTimelineInfo, RemoteTimelineInfo, TimelineInfo};
use crate::{config::PageServerConf, tenant_mgr, timelines, ZTenantId};
use crate::{config::PageServerConf, tenant_mgr, timelines};
use utils::{
auth::JwtAuth,
http::{
endpoint::{self, attach_openapi_ui, auth_middleware, check_permission},
error::{ApiError, HttpErrorBody},
json::{json_request, json_response},
request::parse_request_param,
RequestExt, RouterBuilder,
},
zid::{ZTenantId, ZTenantTimelineId, ZTimelineId},
};
struct State {
conf: &'static PageServerConf,
auth: Option<Arc<JwtAuth>>,
remote_index: RemoteIndex,
allowlist_routes: Vec<Uri>,
remote_storage: Option<GenericRemoteStorage>,
}
enum GenericRemoteStorage {
Local(LocalFs),
S3(S3Bucket),
}
impl State {
@@ -38,17 +47,34 @@ impl State {
conf: &'static PageServerConf,
auth: Option<Arc<JwtAuth>>,
remote_index: RemoteIndex,
) -> Self {
) -> anyhow::Result<Self> {
let allowlist_routes = ["/v1/status", "/v1/doc", "/swagger.yml"]
.iter()
.map(|v| v.parse().unwrap())
.collect::<Vec<_>>();
Self {
// Note that this remote storage is created separately from the main one in the sync_loop.
// It's fine since it's stateless and some code duplication saves us from bloating the code around with generics.
let remote_storage = conf
.remote_storage_config
.as_ref()
.map(|storage_config| match &storage_config.storage {
RemoteStorageKind::LocalFs(root) => {
LocalFs::new(root.clone(), &conf.workdir).map(GenericRemoteStorage::Local)
}
RemoteStorageKind::AwsS3(s3_config) => {
S3Bucket::new(s3_config, &conf.workdir).map(GenericRemoteStorage::S3)
}
})
.transpose()
.context("Failed to init generic remote storage")?;
Ok(Self {
conf,
auth,
allowlist_routes,
remote_index,
}
remote_storage,
})
}
}
@@ -122,8 +148,8 @@ async fn timeline_list_handler(request: Request<Body>) -> Result<Response<Body>,
timeline_id,
})
.map(|remote_entry| RemoteTimelineInfo {
remote_consistent_lsn: remote_entry.disk_consistent_lsn(),
awaits_download: remote_entry.get_awaits_download(),
remote_consistent_lsn: remote_entry.metadata.disk_consistent_lsn(),
awaits_download: remote_entry.awaits_download,
}),
})
}
@@ -153,43 +179,47 @@ async fn timeline_detail_handler(request: Request<Body>) -> Result<Response<Body
let timeline_id: ZTimelineId = parse_request_param(&request, "timeline_id")?;
let include_non_incremental_logical_size = get_include_non_incremental_logical_size(&request);
let span = info_span!("timeline_detail_handler", tenant = %tenant_id, timeline = %timeline_id);
let (local_timeline_info, remote_timeline_info) = async {
// any error here will render local timeline as None
// XXX .in_current_span does not attach messages in spawn_blocking future to current future's span
let local_timeline_info = tokio::task::spawn_blocking(move || {
let repo = tenant_mgr::get_repository_for_tenant(tenant_id)?;
let local_timeline = {
repo.get_timeline(timeline_id)
.as_ref()
.map(|timeline| {
LocalTimelineInfo::from_repo_timeline(
tenant_id,
timeline_id,
timeline,
include_non_incremental_logical_size,
)
})
.transpose()?
};
Ok::<_, anyhow::Error>(local_timeline)
})
.await
.ok()
.and_then(|r| r.ok())
.flatten();
let (local_timeline_info, span) = tokio::task::spawn_blocking(move || {
let entered = span.entered();
let repo = tenant_mgr::get_repository_for_tenant(tenant_id)?;
let local_timeline = {
repo.get_timeline(timeline_id)
.as_ref()
.map(|timeline| {
LocalTimelineInfo::from_repo_timeline(
tenant_id,
timeline_id,
timeline,
include_non_incremental_logical_size,
)
let remote_timeline_info = {
let remote_index_read = get_state(&request).remote_index.read().await;
remote_index_read
.timeline_entry(&ZTenantTimelineId {
tenant_id,
timeline_id,
})
.map(|remote_entry| RemoteTimelineInfo {
remote_consistent_lsn: remote_entry.metadata.disk_consistent_lsn(),
awaits_download: remote_entry.awaits_download,
})
.transpose()?
};
Ok::<_, anyhow::Error>((local_timeline, entered.exit()))
})
.await
.map_err(ApiError::from_err)??;
let remote_timeline_info = {
let remote_index_read = get_state(&request).remote_index.read().await;
remote_index_read
.timeline_entry(&ZTenantTimelineId {
tenant_id,
timeline_id,
})
.map(|remote_entry| RemoteTimelineInfo {
remote_consistent_lsn: remote_entry.disk_consistent_lsn(),
awaits_download: remote_entry.get_awaits_download(),
})
};
let _enter = span.entered();
(local_timeline_info, remote_timeline_info)
}
.instrument(info_span!("timeline_detail_handler", tenant = %tenant_id, timeline = %timeline_id))
.await;
if local_timeline_info.is_none() && remote_timeline_info.is_none() {
return Err(ApiError::NotFound(
@@ -212,41 +242,105 @@ async fn timeline_attach_handler(request: Request<Body>) -> Result<Response<Body
check_permission(&request, Some(tenant_id))?;
let timeline_id: ZTimelineId = parse_request_param(&request, "timeline_id")?;
let span = info_span!("timeline_attach_handler", tenant = %tenant_id, timeline = %timeline_id);
info!(
"Handling timeline {} attach for tenant: {}",
timeline_id, tenant_id,
);
let span = tokio::task::spawn_blocking(move || {
let entered = span.entered();
if tenant_mgr::get_timeline_for_tenant_load(tenant_id, timeline_id).is_ok() {
tokio::task::spawn_blocking(move || {
if tenant_mgr::get_local_timeline_with_load(tenant_id, timeline_id).is_ok() {
// TODO: maybe answer with 309 Not Modified here?
anyhow::bail!("Timeline is already present locally")
};
Ok(entered.exit())
Ok(())
})
.await
.map_err(ApiError::from_err)??;
let mut remote_index_write = get_state(&request).remote_index.write().await;
let sync_id = ZTenantTimelineId {
tenant_id,
timeline_id,
};
let state = get_state(&request);
let remote_index = &state.remote_index;
let _enter = span.entered(); // entered guard cannot live across awaits (non Send)
let index_entry = remote_index_write
.timeline_entry_mut(&ZTenantTimelineId {
tenant_id,
timeline_id,
})
.ok_or_else(|| ApiError::NotFound("Unknown remote timeline".to_string()))?;
let mut index_accessor = remote_index.write().await;
if let Some(remote_timeline) = index_accessor.timeline_entry_mut(&sync_id) {
if remote_timeline.awaits_download {
return Err(ApiError::Conflict(
"Timeline download is already in progress".to_string(),
));
}
if index_entry.get_awaits_download() {
return Err(ApiError::Conflict(
"Timeline download is already in progress".to_string(),
));
remote_timeline.awaits_download = true;
schedule_timeline_download(tenant_id, timeline_id);
return json_response(StatusCode::ACCEPTED, ());
} else {
// no timeline in the index, release the lock to make the potentially lengthy download opetation
drop(index_accessor);
}
index_entry.set_awaits_download(true);
schedule_timeline_download(tenant_id, timeline_id);
let new_timeline = match try_download_shard_data(state, sync_id).await {
Ok(Some(mut new_timeline)) => {
tokio::fs::create_dir_all(state.conf.timeline_path(&timeline_id, &tenant_id))
.await
.context("Failed to create new timeline directory")?;
new_timeline.awaits_download = true;
new_timeline
}
Ok(None) => return Err(ApiError::NotFound("Unknown remote timeline".to_string())),
Err(e) => {
error!("Failed to retrieve remote timeline data: {:?}", e);
return Err(ApiError::NotFound(
"Failed to retrieve remote timeline".to_string(),
));
}
};
let mut index_accessor = remote_index.write().await;
match index_accessor.timeline_entry_mut(&sync_id) {
Some(remote_timeline) => {
if remote_timeline.awaits_download {
return Err(ApiError::Conflict(
"Timeline download is already in progress".to_string(),
));
}
remote_timeline.awaits_download = true;
}
None => index_accessor.add_timeline_entry(sync_id, new_timeline),
}
schedule_timeline_download(tenant_id, timeline_id);
json_response(StatusCode::ACCEPTED, ())
}
async fn try_download_shard_data(
state: &State,
sync_id: ZTenantTimelineId,
) -> anyhow::Result<Option<RemoteTimeline>> {
let shard = match state.remote_storage.as_ref() {
Some(GenericRemoteStorage::Local(local_storage)) => {
download_index_part(state.conf, local_storage, sync_id).await
}
Some(GenericRemoteStorage::S3(s3_storage)) => {
download_index_part(state.conf, s3_storage, sync_id).await
}
None => return Ok(None),
}
.with_context(|| format!("Failed to download index shard for timeline {}", sync_id))?;
let timeline_path = state
.conf
.timeline_path(&sync_id.timeline_id, &sync_id.tenant_id);
RemoteTimeline::from_index_part(&timeline_path, shard)
.map(Some)
.with_context(|| {
format!(
"Failed to convert index shard into remote timeline for timeline {}",
sync_id
)
})
}
async fn timeline_detach_handler(request: Request<Body>) -> Result<Response<Body>, ApiError> {
let tenant_id: ZTenantId = parse_request_param(&request, "tenant_id")?;
check_permission(&request, Some(tenant_id))?;
@@ -257,8 +351,8 @@ async fn timeline_detach_handler(request: Request<Body>) -> Result<Response<Body
let _enter =
info_span!("timeline_detach_handler", tenant = %tenant_id, timeline = %timeline_id)
.entered();
let repo = tenant_mgr::get_repository_for_tenant(tenant_id)?;
repo.detach_timeline(timeline_id)
let state = get_state(&request);
tenant_mgr::detach_timeline(state.conf, tenant_id, timeline_id)
})
.await
.map_err(ApiError::from_err)??;
@@ -275,7 +369,7 @@ async fn tenant_list_handler(request: Request<Body>) -> Result<Response<Body>, A
crate::tenant_mgr::list_tenants()
})
.await
.map_err(ApiError::from_err)??;
.map_err(ApiError::from_err)?;
json_response(StatusCode::OK, response_data)
}
@@ -287,6 +381,28 @@ async fn tenant_create_handler(mut request: Request<Body>) -> Result<Response<Bo
let request_data: TenantCreateRequest = json_request(&mut request).await?;
let remote_index = get_state(&request).remote_index.clone();
let mut tenant_conf = TenantConfOpt::default();
if let Some(gc_period) = request_data.gc_period {
tenant_conf.gc_period =
Some(humantime::parse_duration(&gc_period).map_err(ApiError::from_err)?);
}
tenant_conf.gc_horizon = request_data.gc_horizon;
tenant_conf.image_creation_threshold = request_data.image_creation_threshold;
if let Some(pitr_interval) = request_data.pitr_interval {
tenant_conf.pitr_interval =
Some(humantime::parse_duration(&pitr_interval).map_err(ApiError::from_err)?);
}
tenant_conf.checkpoint_distance = request_data.checkpoint_distance;
tenant_conf.compaction_target_size = request_data.compaction_target_size;
tenant_conf.compaction_threshold = request_data.compaction_threshold;
if let Some(compaction_period) = request_data.compaction_period {
tenant_conf.compaction_period =
Some(humantime::parse_duration(&compaction_period).map_err(ApiError::from_err)?);
}
let target_tenant_id = request_data
.new_tenant_id
.map(ZTenantId::from)
@@ -294,8 +410,9 @@ async fn tenant_create_handler(mut request: Request<Body>) -> Result<Response<Bo
let new_tenant_id = tokio::task::spawn_blocking(move || {
let _enter = info_span!("tenant_create", tenant = ?target_tenant_id).entered();
let conf = get_config(&request);
tenant_mgr::create_tenant_repository(get_config(&request), target_tenant_id, remote_index)
tenant_mgr::create_tenant_repository(conf, tenant_conf, target_tenant_id, remote_index)
})
.await
.map_err(ApiError::from_err)??;
@@ -306,6 +423,45 @@ async fn tenant_create_handler(mut request: Request<Body>) -> Result<Response<Bo
})
}
async fn tenant_config_handler(mut request: Request<Body>) -> Result<Response<Body>, ApiError> {
let request_data: TenantConfigRequest = json_request(&mut request).await?;
let tenant_id = request_data.tenant_id;
// check for management permission
check_permission(&request, Some(tenant_id))?;
let mut tenant_conf: TenantConfOpt = Default::default();
if let Some(gc_period) = request_data.gc_period {
tenant_conf.gc_period =
Some(humantime::parse_duration(&gc_period).map_err(ApiError::from_err)?);
}
tenant_conf.gc_horizon = request_data.gc_horizon;
tenant_conf.image_creation_threshold = request_data.image_creation_threshold;
if let Some(pitr_interval) = request_data.pitr_interval {
tenant_conf.pitr_interval =
Some(humantime::parse_duration(&pitr_interval).map_err(ApiError::from_err)?);
}
tenant_conf.checkpoint_distance = request_data.checkpoint_distance;
tenant_conf.compaction_target_size = request_data.compaction_target_size;
tenant_conf.compaction_threshold = request_data.compaction_threshold;
if let Some(compaction_period) = request_data.compaction_period {
tenant_conf.compaction_period =
Some(humantime::parse_duration(&compaction_period).map_err(ApiError::from_err)?);
}
tokio::task::spawn_blocking(move || {
let _enter = info_span!("tenant_config", tenant = ?tenant_id).entered();
tenant_mgr::update_tenant_config(tenant_conf, tenant_id)
})
.await
.map_err(ApiError::from_err)??;
json_response(StatusCode::OK, ())
}
async fn handler_404(_: Request<Body>) -> Result<Response<Body>, ApiError> {
json_response(
StatusCode::NOT_FOUND,
@@ -317,7 +473,7 @@ pub fn make_router(
conf: &'static PageServerConf,
auth: Option<Arc<JwtAuth>>,
remote_index: RemoteIndex,
) -> RouterBuilder<hyper::Body, ApiError> {
) -> anyhow::Result<RouterBuilder<hyper::Body, ApiError>> {
let spec = include_bytes!("openapi_spec.yml");
let mut router = attach_openapi_ui(endpoint::make_router(), spec, "/swagger.yml", "/v1/doc");
if auth.is_some() {
@@ -331,11 +487,14 @@ pub fn make_router(
}))
}
router
.data(Arc::new(State::new(conf, auth, remote_index)))
Ok(router
.data(Arc::new(
State::new(conf, auth, remote_index).context("Failed to initialize router state")?,
))
.get("/v1/status", status_handler)
.get("/v1/tenant", tenant_list_handler)
.post("/v1/tenant", tenant_create_handler)
.put("/v1/tenant/config", tenant_config_handler)
.get("/v1/tenant/:tenant_id/timeline", timeline_list_handler)
.post("/v1/tenant/:tenant_id/timeline", timeline_create_handler)
.get(
@@ -350,5 +509,5 @@ pub fn make_router(
"/v1/tenant/:tenant_id/timeline/:timeline_id/detach",
timeline_detach_handler,
)
.any(handler_404)
.any(handler_404))
}

View File

@@ -20,7 +20,7 @@ use postgres_ffi::waldecoder::*;
use postgres_ffi::xlog_utils::*;
use postgres_ffi::{pg_constants, ControlFileData, DBState_DB_SHUTDOWNED};
use postgres_ffi::{Oid, TransactionId};
use zenith_utils::lsn::Lsn;
use utils::lsn::Lsn;
///
/// Import all relation data pages from local disk into the repository.

File diff suppressed because it is too large Load Diff

View File

@@ -1,12 +1,20 @@
//!
//! Functions for reading and writing variable-sized "blobs".
//!
//! Each blob begins with a 4-byte length, followed by the actual data.
//! Each blob begins with a 1- or 4-byte length field, followed by the
//! actual data. If the length is smaller than 128 bytes, the length
//! is written as a one byte. If it's larger than that, the length
//! is written as a four-byte integer, in big-endian, with the high
//! bit set. This way, we can detect whether it's 1- or 4-byte header
//! by peeking at the first byte.
//!
//! len < 128: 0XXXXXXX
//! len >= 128: 1XXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
//!
use crate::layered_repository::block_io::{BlockCursor, BlockReader};
use crate::page_cache::PAGE_SZ;
use std::cmp::min;
use std::io::Error;
use std::io::{Error, ErrorKind};
/// For reading
pub trait BlobCursor {
@@ -40,21 +48,30 @@ where
let mut buf = self.read_blk(blknum)?;
// read length
let mut len_buf = [0u8; 4];
let thislen = PAGE_SZ - off;
if thislen < 4 {
// it is split across two pages
len_buf[..thislen].copy_from_slice(&buf[off..PAGE_SZ]);
blknum += 1;
buf = self.read_blk(blknum)?;
len_buf[thislen..].copy_from_slice(&buf[0..4 - thislen]);
off = 4 - thislen;
// peek at the first byte, to determine if it's a 1- or 4-byte length
let first_len_byte = buf[off];
let len: usize = if first_len_byte < 0x80 {
// 1-byte length header
off += 1;
first_len_byte as usize
} else {
len_buf.copy_from_slice(&buf[off..off + 4]);
off += 4;
}
let len = u32::from_ne_bytes(len_buf) as usize;
// 4-byte length header
let mut len_buf = [0u8; 4];
let thislen = PAGE_SZ - off;
if thislen < 4 {
// it is split across two pages
len_buf[..thislen].copy_from_slice(&buf[off..PAGE_SZ]);
blknum += 1;
buf = self.read_blk(blknum)?;
len_buf[thislen..].copy_from_slice(&buf[0..4 - thislen]);
off = 4 - thislen;
} else {
len_buf.copy_from_slice(&buf[off..off + 4]);
off += 4;
}
len_buf[0] &= 0x7f;
u32::from_be_bytes(len_buf) as usize
};
dstbuf.clear();
@@ -130,10 +147,27 @@ where
{
fn write_blob(&mut self, srcbuf: &[u8]) -> Result<u64, Error> {
let offset = self.offset;
self.inner
.write_all(&((srcbuf.len()) as u32).to_ne_bytes())?;
if srcbuf.len() < 128 {
// Short blob. Write a 1-byte length header
let len_buf = srcbuf.len() as u8;
self.inner.write_all(&[len_buf])?;
self.offset += 1;
} else {
// Write a 4-byte length header
if srcbuf.len() > 0x7fff_ffff {
return Err(Error::new(
ErrorKind::Other,
format!("blob too large ({} bytes)", srcbuf.len()),
));
}
let mut len_buf = ((srcbuf.len()) as u32).to_be_bytes();
len_buf[0] |= 0x80;
self.inner.write_all(&len_buf)?;
self.offset += 4;
}
self.inner.write_all(srcbuf)?;
self.offset += 4 + srcbuf.len() as u64;
self.offset += srcbuf.len() as u64;
Ok(offset)
}
}

View File

@@ -73,7 +73,7 @@ pub struct BlockCursor<R>
where
R: BlockReader,
{
reader: R,
pub reader: R,
/// last accessed page
cache: Option<(u32, R::BlockLease)>,
}

View File

@@ -23,6 +23,7 @@
//! "values" part. The actual page images and WAL records are stored in the
//! "values" part.
//!
use crate::config;
use crate::config::PageServerConf;
use crate::layered_repository::blob_io::{BlobCursor, BlobWriter, WriteBlobWriter};
use crate::layered_repository::block_io::{BlockBuf, BlockCursor, BlockReader, FileBlockReader};
@@ -35,12 +36,11 @@ use crate::page_cache::{PageReadGuard, PAGE_SZ};
use crate::repository::{Key, Value, KEY_SIZE};
use crate::virtual_file::VirtualFile;
use crate::walrecord;
use crate::{ZTenantId, ZTimelineId};
use crate::{DELTA_FILE_MAGIC, STORAGE_FORMAT_VERSION};
use anyhow::{bail, ensure, Context, Result};
use serde::{Deserialize, Serialize};
use tracing::*;
// avoid binding to Write (conflicts with std::io::Write)
// avoid binding to Write (conflicts with std::sio::Write)
// while being able to use std::fmt::Write's methods
use std::fmt::Write as _;
use std::fs;
@@ -51,8 +51,13 @@ use std::os::unix::fs::FileExt;
use std::path::{Path, PathBuf};
use std::sync::{RwLock, RwLockReadGuard, RwLockWriteGuard};
use zenith_utils::bin_ser::BeSer;
use zenith_utils::lsn::Lsn;
use utils::{
bin_ser::BeSer,
lsn::Lsn,
zid::{ZTenantId, ZTimelineId},
};
const DICTIONARY_OFFSET: u64 = PAGE_SZ as u64;
///
/// Header stored in the beginning of the file
@@ -193,6 +198,10 @@ pub struct DeltaLayerInner {
/// Reader object for reading blocks from the file. (None if not loaded yet)
file: Option<FileBlockReader<VirtualFile>>,
/// Compression dictionary.
dictionary: Vec<u8>, // empty if not loaded
prepared_dictionary: Option<zstd::dict::DecoderDictionary<'static>>, // None if not loaded
}
impl Layer for DeltaLayer {
@@ -222,6 +231,7 @@ impl Layer for DeltaLayer {
lsn_range: Range<Lsn>,
reconstruct_state: &mut ValueReconstructState,
) -> anyhow::Result<ValueReconstructResult> {
ensure!(lsn_range.start >= self.lsn_range.start);
let mut need_image = true;
ensure!(self.key_range.contains(&key));
@@ -229,6 +239,12 @@ impl Layer for DeltaLayer {
{
// Open the file and lock the metadata in memory
let inner = self.load()?;
let mut decompressor = match &inner.prepared_dictionary {
Some(dictionary) => Some(zstd::bulk::Decompressor::with_prepared_dictionary(
dictionary,
)?),
None => None,
};
// Scan the page versions backwards, starting from `lsn`.
let file = inner.file.as_ref().unwrap();
@@ -255,8 +271,25 @@ impl Layer for DeltaLayer {
// Ok, 'offsets' now contains the offsets of all the entries we need to read
let mut cursor = file.block_cursor();
for (entry_lsn, pos) in offsets {
let buf = cursor.read_blob(pos)?;
let val = Value::des(&buf)?;
let buf = cursor.read_blob(pos).with_context(|| {
format!(
"Failed to read blob from virtual file {}",
file.file.path.display()
)
})?;
let val = if let Some(decompressor) = &mut decompressor {
let decompressed =
decompressor.decompress(&buf, config::ZSTD_DECOMPRESS_BUFFER_LIMIT)?;
Value::des(&decompressed)
} else {
Value::des(&buf)
}
.with_context(|| {
format!(
"Failed to deserialize file blob from virtual file {}",
file.file.path.display()
)
})?;
match val {
Value::Image(img) => {
reconstruct_state.img = Some((entry_lsn, img));
@@ -287,7 +320,10 @@ impl Layer for DeltaLayer {
}
fn iter<'a>(&'a self) -> Box<dyn Iterator<Item = anyhow::Result<(Key, Lsn, Value)>> + 'a> {
let inner = self.load().unwrap();
let inner = match self.load() {
Ok(inner) => inner,
Err(e) => panic!("Failed to load a delta layer: {e:?}"),
};
match DeltaValueIter::new(inner) {
Ok(iter) => Box::new(iter),
@@ -326,7 +362,6 @@ impl Layer for DeltaLayer {
}
let inner = self.load()?;
println!(
"index_start_blk: {}, root {}",
inner.index_start_blk, inner.index_root_blk
@@ -342,6 +377,13 @@ impl Layer for DeltaLayer {
tree_reader.dump()?;
let mut cursor = file.block_cursor();
let mut decompressor = match &inner.prepared_dictionary {
Some(dictionary) => Some(zstd::bulk::Decompressor::with_prepared_dictionary(
dictionary,
)?),
None => None,
};
tree_reader.visit(
&[0u8; DELTA_KEY_SIZE],
VisitDirection::Forwards,
@@ -353,7 +395,14 @@ impl Layer for DeltaLayer {
let mut desc = String::new();
match cursor.read_blob(blob_ref.pos()) {
Ok(buf) => {
let val = Value::des(&buf);
let val = if let Some(decompressor) = &mut decompressor {
let decompressed = decompressor
.decompress(&buf, config::ZSTD_DECOMPRESS_BUFFER_LIMIT)
.unwrap();
Value::des(&decompressed)
} else {
Value::des(&buf)
};
match val {
Ok(Value::Image(img)) => {
write!(&mut desc, " img {} bytes", img.len()).unwrap();
@@ -419,7 +468,9 @@ impl DeltaLayer {
drop(inner);
let inner = self.inner.write().unwrap();
if !inner.loaded {
self.load_inner(inner)?;
self.load_inner(inner).with_context(|| {
format!("Failed to load delta layer {}", self.path().display())
})?;
} else {
// Another thread loaded it while we were not holding the lock.
}
@@ -469,6 +520,13 @@ impl DeltaLayer {
}
}
let mut cursor = file.block_cursor();
inner.dictionary = cursor.read_blob(DICTIONARY_OFFSET)?;
inner.prepared_dictionary = if inner.dictionary.is_empty() {
None
} else {
Some(zstd::dict::DecoderDictionary::copy(&inner.dictionary))
};
inner.index_start_blk = actual_summary.index_start_blk;
inner.index_root_blk = actual_summary.index_root_blk;
@@ -494,6 +552,8 @@ impl DeltaLayer {
inner: RwLock::new(DeltaLayerInner {
loaded: false,
file: None,
dictionary: Vec::new(),
prepared_dictionary: None,
index_start_blk: 0,
index_root_blk: 0,
}),
@@ -521,6 +581,8 @@ impl DeltaLayer {
inner: RwLock::new(DeltaLayerInner {
loaded: false,
file: None,
dictionary: Vec::new(),
prepared_dictionary: None,
index_start_blk: 0,
index_root_blk: 0,
}),
@@ -568,6 +630,7 @@ pub struct DeltaLayerWriter {
tree: DiskBtreeBuilder<BlockBuf, DELTA_KEY_SIZE>,
blob_writer: WriteBlobWriter<BufWriter<VirtualFile>>,
compressor: Option<zstd::bulk::Compressor<'static>>,
}
impl DeltaLayerWriter {
@@ -580,6 +643,7 @@ impl DeltaLayerWriter {
tenantid: ZTenantId,
key_start: Key,
lsn_range: Range<Lsn>,
dictionary: Vec<u8>,
) -> Result<DeltaLayerWriter> {
// Create the file initially with a temporary filename. We don't know
// the end key yet, so we cannot form the final filename yet. We will
@@ -595,14 +659,24 @@ impl DeltaLayerWriter {
));
let mut file = VirtualFile::create(&path)?;
// make room for the header block
file.seek(SeekFrom::Start(PAGE_SZ as u64))?;
file.seek(SeekFrom::Start(DICTIONARY_OFFSET))?;
let buf_writer = BufWriter::new(file);
let blob_writer = WriteBlobWriter::new(buf_writer, PAGE_SZ as u64);
let mut blob_writer = WriteBlobWriter::new(buf_writer, DICTIONARY_OFFSET);
let off = blob_writer.write_blob(&dictionary)?;
assert!(off == DICTIONARY_OFFSET);
let compressor = if dictionary.is_empty() {
None
} else {
Some(zstd::bulk::Compressor::with_dictionary(
config::ZSTD_COMPRESSION_LEVEL,
&dictionary,
)?)
};
// Initialize the b-tree index builder
let block_buf = BlockBuf::new();
let tree_builder = DiskBtreeBuilder::new(block_buf);
Ok(DeltaLayerWriter {
conf,
path,
@@ -612,6 +686,7 @@ impl DeltaLayerWriter {
lsn_range,
tree: tree_builder,
blob_writer,
compressor,
})
}
@@ -623,7 +698,13 @@ impl DeltaLayerWriter {
pub fn put_value(&mut self, key: Key, lsn: Lsn, val: Value) -> Result<()> {
assert!(self.lsn_range.start <= lsn);
let off = self.blob_writer.write_blob(&Value::ser(&val)?)?;
let body = &Value::ser(&val)?;
let off = if let Some(ref mut compressor) = self.compressor {
let compressed = compressor.compress(body)?;
self.blob_writer.write_blob(&compressed)?
} else {
self.blob_writer.write_blob(body)?
};
let blob_ref = BlobRef::new(off, val.will_init());
@@ -680,6 +761,8 @@ impl DeltaLayerWriter {
inner: RwLock::new(DeltaLayerInner {
loaded: false,
file: None,
dictionary: Vec::new(),
prepared_dictionary: None,
index_start_blk,
index_root_blk,
}),
@@ -717,6 +800,7 @@ struct DeltaValueIter<'a> {
all_offsets: Vec<(DeltaKey, BlobRef)>,
next_idx: usize,
reader: BlockCursor<Adapter<'a>>,
decompressor: Option<zstd::bulk::Decompressor<'a>>,
}
struct Adapter<'a>(RwLockReadGuard<'a, DeltaLayerInner>);
@@ -755,11 +839,26 @@ impl<'a> DeltaValueIter<'a> {
true
},
)?;
let decompressor = match &inner.prepared_dictionary {
Some(dictionary) => Some(zstd::bulk::Decompressor::with_prepared_dictionary(
dictionary,
)?),
None => None,
};
/*
let decompressor = if !inner.dictionary.is_empty() {
Some(zstd::bulk::Decompressor::with_dictionary(
&inner.dictionary,
)?)
} else {
None
};
*/
let iter = DeltaValueIter {
all_offsets,
next_idx: 0,
reader: BlockCursor::new(Adapter(inner)),
decompressor,
};
Ok(iter)
@@ -773,7 +872,13 @@ impl<'a> DeltaValueIter<'a> {
let lsn = delta_key.lsn();
let buf = self.reader.read_blob(blob_ref.pos())?;
let val = Value::des(&buf)?;
let val = if let Some(decompressor) = &mut self.decompressor {
let decompressed =
decompressor.decompress(&buf, config::ZSTD_DECOMPRESS_BUFFER_LIMIT)?;
Value::des(&decompressed)
} else {
Value::des(&buf)
}?;
self.next_idx += 1;
Ok(Some((key, lsn, val)))
} else {

View File

@@ -16,8 +16,8 @@ use std::io::{Error, ErrorKind};
use std::ops::DerefMut;
use std::path::PathBuf;
use std::sync::{Arc, RwLock};
use zenith_utils::zid::ZTenantId;
use zenith_utils::zid::ZTimelineId;
use tracing::*;
use utils::zid::{ZTenantId, ZTimelineId};
use std::os::unix::fs::FileExt;
@@ -199,18 +199,24 @@ impl BlobWriter for EphemeralFile {
let mut buf = self.get_buf_for_write(blknum)?;
// Write the length field
let len_buf = u32::to_ne_bytes(srcbuf.len() as u32);
let thislen = PAGE_SZ - off;
if thislen < 4 {
// it needs to be split across pages
buf[off..(off + thislen)].copy_from_slice(&len_buf[..thislen]);
blknum += 1;
buf = self.get_buf_for_write(blknum)?;
buf[0..4 - thislen].copy_from_slice(&len_buf[thislen..]);
off = 4 - thislen;
if srcbuf.len() < 0x80 {
buf[off] = srcbuf.len() as u8;
off += 1;
} else {
buf[off..off + 4].copy_from_slice(&len_buf);
off += 4;
let mut len_buf = u32::to_be_bytes(srcbuf.len() as u32);
len_buf[0] |= 0x80;
let thislen = PAGE_SZ - off;
if thislen < 4 {
// it needs to be split across pages
buf[off..(off + thislen)].copy_from_slice(&len_buf[..thislen]);
blknum += 1;
buf = self.get_buf_for_write(blknum)?;
buf[0..4 - thislen].copy_from_slice(&len_buf[thislen..]);
off = 4 - thislen;
} else {
buf[off..off + 4].copy_from_slice(&len_buf);
off += 4;
}
}
// Write the payload
@@ -229,7 +235,13 @@ impl BlobWriter for EphemeralFile {
buf_remain = &buf_remain[this_blk_len..];
}
drop(buf);
self.size += 4 + srcbuf.len() as u64;
if srcbuf.len() < 0x80 {
self.size += 1;
} else {
self.size += 4;
}
self.size += srcbuf.len() as u64;
Ok(pos)
}
@@ -244,16 +256,31 @@ impl Drop for EphemeralFile {
// remove entry from the hash map
EPHEMERAL_FILES.write().unwrap().files.remove(&self.file_id);
// unlink file
// FIXME: print error
let _ = std::fs::remove_file(&self.file.path);
// unlink the file
let res = std::fs::remove_file(&self.file.path);
if let Err(e) = res {
warn!(
"could not remove ephemeral file '{}': {}",
self.file.path.display(),
e
);
}
}
}
pub fn writeback(file_id: u64, blkno: u32, buf: &[u8]) -> Result<(), std::io::Error> {
if let Some(file) = EPHEMERAL_FILES.read().unwrap().files.get(&file_id) {
file.write_all_at(buf, blkno as u64 * PAGE_SZ as u64)?;
Ok(())
match file.write_all_at(buf, blkno as u64 * PAGE_SZ as u64) {
Ok(_) => Ok(()),
Err(e) => Err(std::io::Error::new(
ErrorKind::Other,
format!(
"failed to write back to ephemeral file at {} error: {}",
file.path.display(),
e
),
)),
}
} else {
Err(std::io::Error::new(
ErrorKind::Other,
@@ -372,6 +399,12 @@ mod tests {
let pos = file.write_blob(&data)?;
blobs.push((pos, data));
}
// also test with a large blobs
for i in 0..100 {
let data = format!("blob{}", i).as_bytes().repeat(100);
let pos = file.write_blob(&data)?;
blobs.push((pos, data));
}
let mut cursor = BlockCursor::new(&file);
for (pos, expected) in blobs {

View File

@@ -8,7 +8,7 @@ use std::fmt;
use std::ops::Range;
use std::path::PathBuf;
use zenith_utils::lsn::Lsn;
use utils::lsn::Lsn;
// Note: LayeredTimeline::load_layer_map() relies on this sort order
#[derive(Debug, PartialEq, Eq, Clone)]

View File

@@ -19,6 +19,7 @@
//! layer, and offsets to the other parts. The "index" is a B-tree,
//! mapping from Key to an offset in the "values" part. The
//! actual page images are stored in the "values" part.
use crate::config;
use crate::config::PageServerConf;
use crate::layered_repository::blob_io::{BlobCursor, BlobWriter, WriteBlobWriter};
use crate::layered_repository::block_io::{BlockBuf, BlockReader, FileBlockReader};
@@ -30,7 +31,6 @@ use crate::layered_repository::storage_layer::{
use crate::page_cache::PAGE_SZ;
use crate::repository::{Key, Value, KEY_SIZE};
use crate::virtual_file::VirtualFile;
use crate::{ZTenantId, ZTimelineId};
use crate::{IMAGE_FILE_MAGIC, STORAGE_FORMAT_VERSION};
use anyhow::{bail, ensure, Context, Result};
use bytes::Bytes;
@@ -44,8 +44,11 @@ use std::path::{Path, PathBuf};
use std::sync::{RwLock, RwLockReadGuard};
use tracing::*;
use zenith_utils::bin_ser::BeSer;
use zenith_utils::lsn::Lsn;
use utils::{
bin_ser::BeSer,
lsn::Lsn,
zid::{ZTenantId, ZTimelineId},
};
///
/// Header stored in the beginning of the file
@@ -148,6 +151,7 @@ impl Layer for ImageLayer {
reconstruct_state: &mut ValueReconstructState,
) -> anyhow::Result<ValueReconstructResult> {
assert!(self.key_range.contains(&key));
assert!(lsn_range.start >= self.lsn);
assert!(lsn_range.end >= self.lsn);
let inner = self.load()?;
@@ -165,7 +169,8 @@ impl Layer for ImageLayer {
offset
)
})?;
let value = Bytes::from(blob);
let decompressed = zstd::bulk::decompress(&blob, PAGE_SZ)?;
let value = Bytes::from(decompressed);
reconstruct_state.img = Some((self.lsn, value));
Ok(ValueReconstructResult::Complete)
@@ -251,7 +256,9 @@ impl ImageLayer {
drop(inner);
let mut inner = self.inner.write().unwrap();
if !inner.loaded {
self.load_inner(&mut inner)?;
self.load_inner(&mut inner).with_context(|| {
format!("Failed to load image layer {}", self.path().display())
})?
} else {
// Another thread loaded it while we were not holding the lock.
}
@@ -451,7 +458,8 @@ impl ImageLayerWriter {
///
pub fn put_image(&mut self, key: Key, img: &[u8]) -> Result<()> {
ensure!(self.key_range.contains(&key));
let off = self.blob_writer.write_blob(img)?;
let compressed = zstd::bulk::compress(img, config::ZSTD_COMPRESSION_LEVEL)?;
let off = self.blob_writer.write_blob(&compressed)?;
let mut keybuf: [u8; KEY_SIZE] = [0u8; KEY_SIZE];
key.write_to_byte_slice(&mut keybuf);

View File

@@ -4,6 +4,7 @@
//! held in an ephemeral file, not in memory. The metadata for each page version, i.e.
//! its position in the file, is kept in memory, though.
//!
use crate::config;
use crate::config::PageServerConf;
use crate::layered_repository::blob_io::{BlobCursor, BlobWriter};
use crate::layered_repository::block_io::BlockReader;
@@ -14,19 +15,21 @@ use crate::layered_repository::storage_layer::{
};
use crate::repository::{Key, Value};
use crate::walrecord;
use crate::{ZTenantId, ZTimelineId};
use anyhow::{bail, ensure, Result};
use std::collections::HashMap;
use tracing::*;
use utils::{
bin_ser::BeSer,
lsn::Lsn,
vec_map::VecMap,
zid::{ZTenantId, ZTimelineId},
};
// avoid binding to Write (conflicts with std::io::Write)
// while being able to use std::fmt::Write's methods
use std::fmt::Write as _;
use std::ops::Range;
use std::path::PathBuf;
use std::sync::RwLock;
use zenith_utils::bin_ser::BeSer;
use zenith_utils::lsn::Lsn;
use zenith_utils::vec_map::VecMap;
pub struct InMemoryLayer {
conf: &'static PageServerConf,
@@ -113,7 +116,7 @@ impl Layer for InMemoryLayer {
lsn_range: Range<Lsn>,
reconstruct_state: &mut ValueReconstructState,
) -> anyhow::Result<ValueReconstructResult> {
ensure!(lsn_range.start <= self.start_lsn);
ensure!(lsn_range.start >= self.start_lsn);
let mut need_image = true;
let inner = self.inner.read().unwrap();
@@ -124,13 +127,6 @@ impl Layer for InMemoryLayer {
if let Some(vec_map) = inner.index.get(&key) {
let slice = vec_map.slice_range(lsn_range);
for (entry_lsn, pos) in slice.iter().rev() {
match &reconstruct_state.img {
Some((cached_lsn, _)) if entry_lsn <= cached_lsn => {
return Ok(ValueReconstructResult::Complete)
}
_ => {}
}
let buf = reader.read_blob(*pos)?;
let value = Value::des(&buf)?;
match value {
@@ -323,21 +319,40 @@ impl InMemoryLayer {
// rare though, so we just accept the potential latency hit for now.
let inner = self.inner.read().unwrap();
let mut samples: Vec<Vec<u8>> = Vec::with_capacity(config::ZSTD_MAX_SAMPLES);
let mut buf = Vec::new();
let mut keys: Vec<(&Key, &VecMap<Lsn, u64>)> = inner.index.iter().collect();
keys.sort_by_key(|k| k.0);
let mut cursor = inner.file.block_cursor();
// First learn dictionary
'train: for (_key, vec_map) in keys.iter() {
// Write all page versions
for (_lsn, pos) in vec_map.as_slice() {
cursor.read_blob_into_buf(*pos, &mut buf)?;
samples.push(buf.clone());
if samples.len() == config::ZSTD_MAX_SAMPLES {
break 'train;
}
}
}
let dictionary = if samples.len() >= config::ZSTD_MIN_SAMPLES {
zstd::dict::from_samples(&samples, config::ZSTD_MAX_DICTIONARY_SIZE)?
} else {
Vec::new()
};
let mut delta_layer_writer = DeltaLayerWriter::new(
self.conf,
self.timelineid,
self.tenantid,
Key::MIN,
self.start_lsn..inner.end_lsn.unwrap(),
dictionary,
)?;
let mut buf = Vec::new();
let mut cursor = inner.file.block_cursor();
let mut keys: Vec<(&Key, &VecMap<Lsn, u64>)> = inner.index.iter().collect();
keys.sort_by_key(|k| k.0);
for (key, vec_map) in keys.iter() {
let key = **key;
// Write all page versions

View File

@@ -16,12 +16,12 @@ use crate::layered_repository::InMemoryLayer;
use crate::repository::Key;
use anyhow::Result;
use lazy_static::lazy_static;
use metrics::{register_int_gauge, IntGauge};
use std::collections::VecDeque;
use std::ops::Range;
use std::sync::Arc;
use tracing::*;
use zenith_metrics::{register_int_gauge, IntGauge};
use zenith_utils::lsn::Lsn;
use utils::lsn::Lsn;
lazy_static! {
static ref NUM_ONDISK_LAYERS: IntGauge =

View File

@@ -10,7 +10,7 @@ use std::path::PathBuf;
use anyhow::ensure;
use serde::{Deserialize, Serialize};
use zenith_utils::{
use utils::{
bin_ser::BeSer,
lsn::Lsn,
zid::{ZTenantId, ZTimelineId},

View File

@@ -4,13 +4,15 @@
use crate::repository::{Key, Value};
use crate::walrecord::ZenithWalRecord;
use crate::{ZTenantId, ZTimelineId};
use anyhow::Result;
use bytes::Bytes;
use std::ops::Range;
use std::path::PathBuf;
use zenith_utils::lsn::Lsn;
use utils::{
lsn::Lsn,
zid::{ZTenantId, ZTimelineId},
};
pub fn range_overlaps<T>(a: &Range<T>, b: &Range<T>) -> bool
where

View File

@@ -7,9 +7,11 @@ pub mod layered_repository;
pub mod page_cache;
pub mod page_service;
pub mod pgdatadir_mapping;
pub mod profiling;
pub mod reltag;
pub mod remote_storage;
pub mod repository;
pub mod tenant_config;
pub mod tenant_mgr;
pub mod tenant_threads;
pub mod thread_mgr;
@@ -22,13 +24,10 @@ pub mod walredo;
use lazy_static::lazy_static;
use tracing::info;
use zenith_metrics::{register_int_gauge_vec, IntGaugeVec};
use zenith_utils::{
postgres_backend,
zid::{ZTenantId, ZTimelineId},
};
use utils::postgres_backend;
use crate::thread_mgr::ThreadKind;
use metrics::{register_int_gauge_vec, IntGaugeVec};
use layered_repository::LayeredRepository;
use pgdatadir_mapping::DatadirTimeline;

View File

@@ -47,7 +47,7 @@ use std::{
use once_cell::sync::OnceCell;
use tracing::error;
use zenith_utils::{
use utils::{
lsn::Lsn,
zid::{ZTenantId, ZTimelineId},
};

View File

@@ -19,20 +19,20 @@ use std::net::TcpListener;
use std::str;
use std::str::FromStr;
use std::sync::{Arc, RwLockReadGuard};
use std::time::Duration;
use tracing::*;
use zenith_metrics::{register_histogram_vec, HistogramVec};
use zenith_utils::auth::{self, JwtAuth};
use zenith_utils::auth::{Claims, Scope};
use zenith_utils::lsn::Lsn;
use zenith_utils::postgres_backend::is_socket_read_timed_out;
use zenith_utils::postgres_backend::PostgresBackend;
use zenith_utils::postgres_backend::{self, AuthType};
use zenith_utils::pq_proto::{BeMessage, FeMessage, RowDescriptor, SINGLE_COL_ROWDESC};
use zenith_utils::zid::{ZTenantId, ZTimelineId};
use utils::{
auth::{self, Claims, JwtAuth, Scope},
lsn::Lsn,
postgres_backend::{self, is_socket_read_timed_out, AuthType, PostgresBackend},
pq_proto::{BeMessage, FeMessage, RowDescriptor, SINGLE_COL_ROWDESC},
zid::{ZTenantId, ZTimelineId},
};
use crate::basebackup;
use crate::config::PageServerConf;
use crate::config::{PageServerConf, ProfilingConfig};
use crate::pgdatadir_mapping::DatadirTimeline;
use crate::profiling::profpoint_start;
use crate::reltag::RelTag;
use crate::repository::Repository;
use crate::repository::Timeline;
@@ -41,6 +41,7 @@ use crate::thread_mgr;
use crate::thread_mgr::ThreadKind;
use crate::walreceiver;
use crate::CheckpointConfig;
use metrics::{register_histogram_vec, HistogramVec};
// Wrapped in libpq CopyData
enum PagestreamFeMessage {
@@ -325,14 +326,17 @@ impl PageServerHandler {
let _enter = info_span!("pagestream", timeline = %timelineid, tenant = %tenantid).entered();
// Check that the timeline exists
let timeline = tenant_mgr::get_timeline_for_tenant_load(tenantid, timelineid)
let timeline = tenant_mgr::get_local_timeline_with_load(tenantid, timelineid)
.context("Cannot load local timeline")?;
/* switch client to COPYBOTH */
pgb.write_message(&BeMessage::CopyBothResponse)?;
while !thread_mgr::is_shutdown_requested() {
match pgb.read_message() {
let msg = pgb.read_message();
let profiling_guard = profpoint_start(self.conf, ProfilingConfig::PageRequests);
match msg {
Ok(message) => {
if let Some(message) = message {
trace!("query: {:?}", message);
@@ -384,6 +388,7 @@ impl PageServerHandler {
}
}
}
drop(profiling_guard);
}
Ok(())
}
@@ -517,7 +522,7 @@ impl PageServerHandler {
info!("starting");
// check that the timeline exists
let timeline = tenant_mgr::get_timeline_for_tenant_load(tenantid, timelineid)
let timeline = tenant_mgr::get_local_timeline_with_load(tenantid, timelineid)
.context("Cannot load local timeline")?;
let latest_gc_cutoff_lsn = timeline.tline.get_latest_gc_cutoff_lsn();
if let Some(lsn) = lsn {
@@ -651,7 +656,7 @@ impl postgres_backend::Handler for PageServerHandler {
info_span!("callmemaybe", timeline = %timelineid, tenant = %tenantid).entered();
// Check that the timeline exists
tenant_mgr::get_timeline_for_tenant_load(tenantid, timelineid)
tenant_mgr::get_local_timeline_with_load(tenantid, timelineid)
.context("Cannot load local timeline")?;
walreceiver::launch_wal_receiver(self.conf, tenantid, timelineid, &connstr)?;
@@ -662,7 +667,10 @@ impl postgres_backend::Handler for PageServerHandler {
// on connect
pgb.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?;
} else if query_string.starts_with("failpoints ") {
ensure!(fail::has_failpoints(), "Cannot manage failpoints because pageserver was compiled without failpoints support");
let (_, failpoints) = query_string.split_at("failpoints ".len());
for failpoint in failpoints.split(';') {
if let Some((name, actions)) = failpoint.split_once('=') {
info!("cfg failpoint: {} {}", name, actions);
@@ -672,6 +680,39 @@ impl postgres_backend::Handler for PageServerHandler {
}
}
pgb.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?;
} else if query_string.starts_with("show ") {
// show <tenant_id>
let (_, params_raw) = query_string.split_at("show ".len());
let params = params_raw.split(' ').collect::<Vec<_>>();
ensure!(params.len() == 1, "invalid param number for config command");
let tenantid = ZTenantId::from_str(params[0])?;
let repo = tenant_mgr::get_repository_for_tenant(tenantid)?;
pgb.write_message_noflush(&BeMessage::RowDescription(&[
RowDescriptor::int8_col(b"checkpoint_distance"),
RowDescriptor::int8_col(b"compaction_target_size"),
RowDescriptor::int8_col(b"compaction_period"),
RowDescriptor::int8_col(b"compaction_threshold"),
RowDescriptor::int8_col(b"gc_horizon"),
RowDescriptor::int8_col(b"gc_period"),
RowDescriptor::int8_col(b"image_creation_threshold"),
RowDescriptor::int8_col(b"pitr_interval"),
]))?
.write_message_noflush(&BeMessage::DataRow(&[
Some(repo.get_checkpoint_distance().to_string().as_bytes()),
Some(repo.get_compaction_target_size().to_string().as_bytes()),
Some(
repo.get_compaction_period()
.as_secs()
.to_string()
.as_bytes(),
),
Some(repo.get_compaction_threshold().to_string().as_bytes()),
Some(repo.get_gc_horizon().to_string().as_bytes()),
Some(repo.get_gc_period().as_secs().to_string().as_bytes()),
Some(repo.get_image_creation_threshold().to_string().as_bytes()),
Some(repo.get_pitr_interval().as_secs().to_string().as_bytes()),
]))?
.write_message(&BeMessage::CommandComplete(b"SELECT 1"))?;
} else if query_string.starts_with("do_gc ") {
// Run GC immediately on given timeline.
// FIXME: This is just for tests. See test_runner/batch_others/test_gc.py.
@@ -689,16 +730,20 @@ impl postgres_backend::Handler for PageServerHandler {
let tenantid = ZTenantId::from_str(caps.get(1).unwrap().as_str())?;
let timelineid = ZTimelineId::from_str(caps.get(2).unwrap().as_str())?;
let repo = tenant_mgr::get_repository_for_tenant(tenantid)?;
let gc_horizon: u64 = caps
.get(4)
.map(|h| h.as_str().parse())
.unwrap_or(Ok(self.conf.gc_horizon))?;
.unwrap_or_else(|| Ok(repo.get_gc_horizon()))?;
let repo = tenant_mgr::get_repository_for_tenant(tenantid)?;
let result = repo.gc_iteration(Some(timelineid), gc_horizon, true)?;
let result = repo.gc_iteration(Some(timelineid), gc_horizon, Duration::ZERO, true)?;
pgb.write_message_noflush(&BeMessage::RowDescription(&[
RowDescriptor::int8_col(b"layers_total"),
RowDescriptor::int8_col(b"layers_needed_by_cutoff"),
RowDescriptor::int8_col(b"layers_needed_by_pitr"),
RowDescriptor::int8_col(b"layers_needed_by_branches"),
RowDescriptor::int8_col(b"layers_not_updated"),
RowDescriptor::int8_col(b"layers_removed"),
@@ -707,6 +752,7 @@ impl postgres_backend::Handler for PageServerHandler {
.write_message_noflush(&BeMessage::DataRow(&[
Some(result.layers_total.to_string().as_bytes()),
Some(result.layers_needed_by_cutoff.to_string().as_bytes()),
Some(result.layers_needed_by_pitr.to_string().as_bytes()),
Some(result.layers_needed_by_branches.to_string().as_bytes()),
Some(result.layers_not_updated.to_string().as_bytes()),
Some(result.layers_removed.to_string().as_bytes()),
@@ -727,7 +773,7 @@ impl postgres_backend::Handler for PageServerHandler {
let tenantid = ZTenantId::from_str(caps.get(1).unwrap().as_str())?;
let timelineid = ZTimelineId::from_str(caps.get(2).unwrap().as_str())?;
let timeline = tenant_mgr::get_timeline_for_tenant_load(tenantid, timelineid)
let timeline = tenant_mgr::get_local_timeline_with_load(tenantid, timelineid)
.context("Couldn't load timeline")?;
timeline.tline.compact()?;
@@ -746,7 +792,7 @@ impl postgres_backend::Handler for PageServerHandler {
let tenantid = ZTenantId::from_str(caps.get(1).unwrap().as_str())?;
let timelineid = ZTimelineId::from_str(caps.get(2).unwrap().as_str())?;
let timeline = tenant_mgr::get_timeline_for_tenant_load(tenantid, timelineid)
let timeline = tenant_mgr::get_local_timeline_with_load(tenantid, timelineid)
.context("Cannot load local timeline")?;
timeline.tline.checkpoint(CheckpointConfig::Forced)?;

Some files were not shown because too many files have changed in this diff Show More