Compare commits

..

47 Commits

Author SHA1 Message Date
Bojan Serafimov
940ea0ab2a Remove incorrect error handler 2022-06-07 09:28:09 -04:00
KlimentSerafimov
fecad1ca34 Resolving issue #1745. Added cluster option for SNI data (#1813)
* Added project option in case SNI data is missing. Resolving issue #1745.

* Added invariant checking for project name: if both sni_data and project_name are available then they should match.
2022-06-06 08:14:41 -04:00
bojanserafimov
92de8423af Remove dead code (#1886) 2022-06-05 09:18:11 -04:00
Dmitry Rodionov
e442f5357b unify two identical failpoints in flush_frozen_layer
probably is a merge artfact
2022-06-03 19:36:09 +03:00
Arseny Sher
5a723d44cd Parametrize test_normal_work.
I like to run small test locally, but let's avoid duplication.
2022-06-03 20:32:53 +04:00
Kirill Bulatov
2623193876 Remove pageserver_connstr from WAL stream logic 2022-06-03 17:30:36 +03:00
Arseny Sher
70a53c4b03 Get backup test_safekeeper_normal_work, but skip by default.
It is handy for development.
2022-06-03 16:12:14 +04:00
Arseny Sher
9e108102b3 Silence etcd safekeeper info key parse errors.
When we subscribe to everything, it is ok to receive not only safekeeper
timeline updates.
2022-06-03 16:12:14 +04:00
huming
9c846a93e8 chore(doc) 2022-06-03 14:24:27 +03:00
Kirill Bulatov
c5007d3916 Remove unused module 2022-06-03 00:23:13 +03:00
Kirill Bulatov
5b06599770 Simplify etcd key regex parsing 2022-06-03 00:23:13 +03:00
Kirill Bulatov
1d16ee92d4 Fix the Lsn difference reconnection 2022-06-03 00:23:13 +03:00
Kirill Bulatov
7933804284 Fix and test regex parsing 2022-06-03 00:23:13 +03:00
Kirill Bulatov
a91e0c299d Reproduce etcd parsing bug in Python tests 2022-06-03 00:23:13 +03:00
Kirill Bulatov
b0c4ec0594 Log storage sync and etcd events a bit better 2022-06-03 00:23:13 +03:00
bojanserafimov
90e2c9ee1f Rename zenith to neon in python tests (#1871) 2022-06-02 16:21:28 -04:00
Egor Suvorov
aba5e5f8b5 GitHub Actions: pin Rust version to 1.58 like on CircleCI
* Fix failing `cargo clippy` while we're here.
  The behavior has been changed in Rust 1.60: https://github.com/rust-lang/rust-clippy/issues/8928
* Add Rust version to the Cargo deps cache key
2022-06-02 17:45:53 +02:00
Dmitry Rodionov
b155fe0e2f avoid perf test result context for pg regress 2022-06-02 17:41:34 +03:00
Ryan Russell
c71faae2c6 Docs readability cont
Signed-off-by: Ryan Russell <git@ryanrussell.org>
2022-06-02 15:05:12 +02:00
Kirill Bulatov
de7eda2dc6 Fix url path printing 2022-06-02 00:48:10 +03:00
Dmitry Rodionov
1188c9a95c remove extra span as this code is already covered by create timeline span
E g this log line contains duplicated data:
INFO /timeline_create{tenant=8d367870988250a755101b5189bbbc17
  new_timeline=Some(27e2580f51f5660642d8ce124e9ee4ac) lsn=None}:
  bootstrapping{timeline=27e2580f51f5660642d8ce124e9ee4ac
  tenant=8d367870988250a755101b5189bbbc17}:
  created root timeline 27e2580f51f5660642d8ce124e9ee4ac
  timeline.lsn 0/16960E8

this avoids variable duplication in `bootstrapping` subspan
2022-06-01 19:29:17 +03:00
Kirill Bulatov
e5cb727572 Replace callmemaybe with etcd subscriptions on safekeeper timeline info 2022-06-01 16:07:04 +03:00
Dmitry Rodionov
6623c5b9d5 add installation instructions for Fedora Linux 2022-06-01 15:59:53 +03:00
Anton Chaporgin
e5a2b0372d remove sk1 from inventory (#1845)
https://github.com/neondatabase/cloud/issues/1454
2022-06-01 15:40:45 +03:00
Alexey Kondratov
af6143ea1f Install missing openssl packages in the Github Actions workflow 2022-05-31 23:12:30 +03:00
Alexey Kondratov
ff233cf4c2 Use :local compute-tools tag to build compute-node image 2022-05-31 23:12:30 +03:00
Dmitry Rodionov
b1b67cc5a0 improve test normal work to start several computes 2022-05-31 22:42:11 +03:00
bojanserafimov
ca10cc12c1 Close file descriptors for redo process (#1834) 2022-05-31 14:14:09 -04:00
Thang Pham
c97cd684e0 Use HOMEBREW_PREFIX instead of hard-coded path (#1833) 2022-05-31 11:20:51 -04:00
Ryan Russell
54e163ac03 Improve Readability in Docs
Signed-off-by: Ryan Russell <ryanrussell@users.noreply.github.com>
2022-05-31 17:22:47 +03:00
Konstantin Knizhnik
595a6bc1e1 Bump vendor/postgres to fix basebackup LSN comparison. (#1835)
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
2022-05-31 14:47:06 +03:00
Arthur Petukhovsky
c3e0b6c839 Implement timeline-based metrics in safekeeper (#1823)
Now there's timelines metrics collector, which goes through all timelines and reports metrics only for active ones
2022-05-31 11:10:50 +03:00
Arseny Sher
36281e3b47 Extend test_wal_backup with compute restart. 2022-05-30 13:57:17 +04:00
Anastasia Lubennikova
e014cb6026 rename zenith.zenith_tenant to neon.tenant_id in test 2022-05-30 12:24:44 +03:00
Anastasia Lubennikova
915e5c9114 Rename 'zenith_admin' to 'cloud_admin' on compute node start 2022-05-30 11:11:01 +03:00
Anastasia Lubennikova
67d6ff4100 Rename custom GUCs:
- zenith.zenith_tenant -> neon.tenant_id
- zenith.zenith_timeline -> neon.timeline_id
2022-05-30 11:11:01 +03:00
Anastasia Lubennikova
6a867bce6d Rename 'zenith_admin' role to 'cloud_admin' 2022-05-30 11:11:01 +03:00
Anastasia Lubennikova
751f1191b4 Rename 'wal_acceptors' GUC to 'safekeepers' 2022-05-30 11:11:01 +03:00
Anastasia Lubennikova
3accde613d Rename contrib/zenith to contrib/neon. Rename custom GUCs:
- zenith.page_server_connstring -> neon.pageserver_connstring
- zenith.zenith_tenant -> neon.tenantid
- zenith.zenith_timeline -> neon.timelineid
- zenith.max_cluster_size -> neon.max_cluster_size
2022-05-30 11:11:01 +03:00
Heikki Linnakangas
e3b320daab Remove obsolete Dockerfile.alpine
It hasn't been used for anything for a long time. The comments still
talked about librocksdb, which we also haven't used for a long time.
2022-05-28 21:22:19 +03:00
Heikki Linnakangas
4b4d3073b8 Fix misc typos 2022-05-28 14:56:23 +03:00
Kian-Meng Ang
f1c51a1267 Fix typos 2022-05-28 14:02:05 +03:00
bojanserafimov
500e8772f0 Add quick-start guide in readme (#1816) 2022-05-27 17:48:11 -04:00
Dmitry Ivanov
b3ec6e0661 [proxy] Propagate SASL/SCRAM auth errors to the user
This will replace the vague (and incorrect) "Internal error" with a nice
and helpful authentication error, e.g. "password doesn't match".
2022-05-27 21:50:43 +03:00
Dmitry Ivanov
5d813f9738 [proxy] Refactoring
This patch attempts to fix some of the technical debt
we had to introduce in previous patches.
2022-05-27 21:50:43 +03:00
Thang Pham
757746b571 Fix test_pageserver_http_get_wal_receiver_success flaky test. (#1786)
Fixes #1768.

## Context

Previously, to test `get_wal_receiver` API, we make run some DB transactions then call the API to check the latest message's LSN from the WAL receiver. However, this test won't work because it's not guaranteed that the WAL receiver will get the latest WAL from the postgres/safekeeper at the time of making the API call. 

This PR resolves the above issue by adding a "poll and wait" code that waits to retrieve the latest data from the WAL receiver. 

This PR also fixes a bug that tries to compare two hex LSNs, should convert to number before the comparison. See: https://github.com/neondatabase/neon/issues/1768#issuecomment-1133752122.
2022-05-27 13:33:53 -04:00
Arseny Sher
cb8bf1beb6 Prevent commit_lsn <= flush_lsn violation after a42eba3cd7.
Nothing complained about that yet, but we definitely don't hold at least one
assert, so let's keep it this way until better version.
2022-05-27 20:23:30 +04:00
192 changed files with 4312 additions and 3454 deletions

View File

@@ -6,7 +6,7 @@ RELEASE=${RELEASE:-false}
# look at docker hub for latest tag for neon docker image # look at docker hub for latest tag for neon docker image
if [ "${RELEASE}" = "true" ]; then if [ "${RELEASE}" = "true" ]; then
echo "search latest relase tag" echo "search latest release tag"
VERSION=$(curl -s https://registry.hub.docker.com/v1/repositories/neondatabase/neon/tags |jq -r -S '.[].name' | grep release | sed 's/release-//g' | grep -E '^[0-9]+$' | sort -n | tail -1) VERSION=$(curl -s https://registry.hub.docker.com/v1/repositories/neondatabase/neon/tags |jq -r -S '.[].name' | grep release | sed 's/release-//g' | grep -E '^[0-9]+$' | sort -n | tail -1)
if [ -z "${VERSION}" ]; then if [ -z "${VERSION}" ]; then
echo "no any docker tags found, exiting..." echo "no any docker tags found, exiting..."
@@ -31,7 +31,7 @@ echo "found ${VERSION}"
rm -rf neon_install postgres_install.tar.gz neon_install.tar.gz .neon_current_version rm -rf neon_install postgres_install.tar.gz neon_install.tar.gz .neon_current_version
mkdir neon_install mkdir neon_install
# retrive binaries from docker image # retrieve binaries from docker image
echo "getting binaries from docker image" echo "getting binaries from docker image"
docker pull --quiet neondatabase/neon:${TAG} docker pull --quiet neondatabase/neon:${TAG}
ID=$(docker create neondatabase/neon:${TAG}) ID=$(docker create neondatabase/neon:${TAG})

View File

@@ -3,7 +3,6 @@
zenith-us-stage-ps-2 console_region_id=27 zenith-us-stage-ps-2 console_region_id=27
[safekeepers] [safekeepers]
zenith-us-stage-sk-1 console_region_id=27
zenith-us-stage-sk-4 console_region_id=27 zenith-us-stage-sk-4 console_region_id=27
zenith-us-stage-sk-5 console_region_id=27 zenith-us-stage-sk-5 console_region_id=27
zenith-us-stage-sk-6 console_region_id=27 zenith-us-stage-sk-6 console_region_id=27

View File

@@ -453,9 +453,6 @@ jobs:
- checkout - checkout
- setup_remote_docker: - setup_remote_docker:
docker_layer_caching: true docker_layer_caching: true
# Build neondatabase/compute-tools:latest image and push it to Docker hub
# TODO: this should probably also use versioned tag, not just :latest.
# XXX: but should it? We build and use it only locally now.
- run: - run:
name: Build and push compute-tools Docker image name: Build and push compute-tools Docker image
command: | command: |
@@ -463,7 +460,10 @@ jobs:
docker build \ docker build \
--build-arg AWS_ACCESS_KEY_ID="${CACHEPOT_AWS_ACCESS_KEY_ID}" \ --build-arg AWS_ACCESS_KEY_ID="${CACHEPOT_AWS_ACCESS_KEY_ID}" \
--build-arg AWS_SECRET_ACCESS_KEY="${CACHEPOT_AWS_SECRET_ACCESS_KEY}" \ --build-arg AWS_SECRET_ACCESS_KEY="${CACHEPOT_AWS_SECRET_ACCESS_KEY}" \
--tag neondatabase/compute-tools:latest -f Dockerfile.compute-tools . --tag neondatabase/compute-tools:local \
--tag neondatabase/compute-tools:latest \
-f Dockerfile.compute-tools .
# Only push :latest image
docker push neondatabase/compute-tools:latest docker push neondatabase/compute-tools:latest
- run: - run:
name: Init postgres submodule name: Init postgres submodule
@@ -473,7 +473,9 @@ jobs:
command: | command: |
echo $NEON_DOCKER_PWD | docker login -u $NEON_DOCKER_LOGIN --password-stdin echo $NEON_DOCKER_PWD | docker login -u $NEON_DOCKER_LOGIN --password-stdin
DOCKER_TAG=$(git log --oneline|wc -l) DOCKER_TAG=$(git log --oneline|wc -l)
docker build --tag neondatabase/compute-node:${DOCKER_TAG} --tag neondatabase/compute-node:latest vendor/postgres docker build --tag neondatabase/compute-node:${DOCKER_TAG} \
--tag neondatabase/compute-node:latest vendor/postgres \
--build-arg COMPUTE_TOOLS_TAG=local
docker push neondatabase/compute-node:${DOCKER_TAG} docker push neondatabase/compute-node:${DOCKER_TAG}
docker push neondatabase/compute-node:latest docker push neondatabase/compute-node:latest
@@ -510,9 +512,6 @@ jobs:
- checkout - checkout
- setup_remote_docker: - setup_remote_docker:
docker_layer_caching: true docker_layer_caching: true
# Build neondatabase/compute-tools:release image and push it to Docker hub
# TODO: this should probably also use versioned tag, not just :latest.
# XXX: but should it? We build and use it only locally now.
- run: - run:
name: Build and push compute-tools Docker image name: Build and push compute-tools Docker image
command: | command: |
@@ -520,7 +519,10 @@ jobs:
docker build \ docker build \
--build-arg AWS_ACCESS_KEY_ID="${CACHEPOT_AWS_ACCESS_KEY_ID}" \ --build-arg AWS_ACCESS_KEY_ID="${CACHEPOT_AWS_ACCESS_KEY_ID}" \
--build-arg AWS_SECRET_ACCESS_KEY="${CACHEPOT_AWS_SECRET_ACCESS_KEY}" \ --build-arg AWS_SECRET_ACCESS_KEY="${CACHEPOT_AWS_SECRET_ACCESS_KEY}" \
--tag neondatabase/compute-tools:release -f Dockerfile.compute-tools . --tag neondatabase/compute-tools:release \
--tag neondatabase/compute-tools:local \
-f Dockerfile.compute-tools .
# Only push :release image
docker push neondatabase/compute-tools:release docker push neondatabase/compute-tools:release
- run: - run:
name: Init postgres submodule name: Init postgres submodule
@@ -530,7 +532,9 @@ jobs:
command: | command: |
echo $NEON_DOCKER_PWD | docker login -u $NEON_DOCKER_LOGIN --password-stdin echo $NEON_DOCKER_PWD | docker login -u $NEON_DOCKER_LOGIN --password-stdin
DOCKER_TAG="release-$(git log --oneline|wc -l)" DOCKER_TAG="release-$(git log --oneline|wc -l)"
docker build --tag neondatabase/compute-node:${DOCKER_TAG} --tag neondatabase/compute-node:release vendor/postgres docker build --tag neondatabase/compute-node:${DOCKER_TAG} \
--tag neondatabase/compute-node:release vendor/postgres \
--build-arg COMPUTE_TOOLS_TAG=local
docker push neondatabase/compute-node:${DOCKER_TAG} docker push neondatabase/compute-node:${DOCKER_TAG}
docker push neondatabase/compute-node:release docker push neondatabase/compute-node:release
@@ -746,7 +750,6 @@ workflows:
- build-postgres-<< matrix.build_type >> - build-postgres-<< matrix.build_type >>
- run-pytest: - run-pytest:
name: pg_regress-tests-<< matrix.build_type >> name: pg_regress-tests-<< matrix.build_type >>
context: PERF_TEST_RESULT_CONNSTR
matrix: matrix:
parameters: parameters:
build_type: ["debug", "release"] build_type: ["debug", "release"]

View File

@@ -19,7 +19,7 @@ jobs:
bench: bench:
# this workflow runs on self hosteed runner # this workflow runs on self hosteed runner
# it's environment is quite different from usual guthub runner # it's environment is quite different from usual guthub runner
# probably the most important difference is that it doesnt start from clean workspace each time # probably the most important difference is that it doesn't start from clean workspace each time
# e g if you install system packages they are not cleaned up since you install them directly in host machine # e g if you install system packages they are not cleaned up since you install them directly in host machine
# not a container or something # not a container or something
# See documentation for more info: https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners # See documentation for more info: https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners

View File

@@ -12,7 +12,7 @@ jobs:
matrix: matrix:
# If we want to duplicate this job for different # If we want to duplicate this job for different
# Rust toolchains (e.g. nightly or 1.37.0), add them here. # Rust toolchains (e.g. nightly or 1.37.0), add them here.
rust_toolchain: [stable] rust_toolchain: [1.58]
os: [ubuntu-latest, macos-latest] os: [ubuntu-latest, macos-latest]
timeout-minutes: 30 timeout-minutes: 30
name: run regression test suite name: run regression test suite
@@ -40,11 +40,11 @@ jobs:
if: matrix.os == 'ubuntu-latest' if: matrix.os == 'ubuntu-latest'
run: | run: |
sudo apt update sudo apt update
sudo apt install build-essential libreadline-dev zlib1g-dev flex bison libseccomp-dev sudo apt install build-essential libreadline-dev zlib1g-dev flex bison libseccomp-dev libssl-dev
- name: Install macOs postgres dependencies - name: Install macOS postgres dependencies
if: matrix.os == 'macos-latest' if: matrix.os == 'macos-latest'
run: brew install flex bison run: brew install flex bison openssl
- name: Set pg revision for caching - name: Set pg revision for caching
id: pg_ver id: pg_ver
@@ -58,10 +58,27 @@ jobs:
tmp_install/ tmp_install/
key: ${{ runner.os }}-pg-${{ steps.pg_ver.outputs.pg_rev }} key: ${{ runner.os }}-pg-${{ steps.pg_ver.outputs.pg_rev }}
- name: Set extra env for macOS
if: matrix.os == 'macos-latest'
run: |
echo 'LDFLAGS=-L/usr/local/opt/openssl@3/lib' >> $GITHUB_ENV
echo 'CPPFLAGS=-I/usr/local/opt/openssl@3/include' >> $GITHUB_ENV
- name: Build postgres - name: Build postgres
if: steps.cache_pg.outputs.cache-hit != 'true' if: steps.cache_pg.outputs.cache-hit != 'true'
run: make postgres run: make postgres
# Plain configure output can contain weird errors like 'error: C compiler cannot create executables'
# and the real cause will be inside config.log
- name: Print configure logs in case of failure
if: failure()
continue-on-error: true
run: |
echo '' && echo '=== config.log ===' && echo ''
cat tmp_install/build/config.log
echo '' && echo '=== configure.log ===' && echo ''
cat tmp_install/build/configure.log
- name: Cache cargo deps - name: Cache cargo deps
id: cache_cargo id: cache_cargo
uses: actions/cache@v2 uses: actions/cache@v2
@@ -70,7 +87,7 @@ jobs:
~/.cargo/registry ~/.cargo/registry
~/.cargo/git ~/.cargo/git
target target
key: ${{ runner.os }}-cargo-${{ hashFiles('./Cargo.lock') }} key: ${{ runner.os }}-cargo-${{ hashFiles('./Cargo.lock') }}-rust-${{ matrix.rust_toolchain }}
- name: Run cargo clippy - name: Run cargo clippy
run: ./run_clippy.sh run: ./run_clippy.sh

1
.gitignore vendored
View File

@@ -5,6 +5,7 @@
__pycache__/ __pycache__/
test_output/ test_output/
.vscode .vscode
.idea
/.zenith /.zenith
/integration_tests/.zenith /integration_tests/.zenith

54
Cargo.lock generated
View File

@@ -292,9 +292,6 @@ name = "cc"
version = "1.0.72" version = "1.0.72"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "22a9137b95ea06864e018375b72adfb7db6e6f68cfc8df5a04d00288050485ee" checksum = "22a9137b95ea06864e018375b72adfb7db6e6f68cfc8df5a04d00288050485ee"
dependencies = [
"jobserver",
]
[[package]] [[package]]
name = "cexpr" name = "cexpr"
@@ -366,6 +363,16 @@ dependencies = [
"textwrap 0.14.2", "textwrap 0.14.2",
] ]
[[package]]
name = "close_fds"
version = "0.3.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "3bc416f33de9d59e79e57560f450d21ff8393adcf1cdfc3e6d8fb93d5f88a2ed"
dependencies = [
"cfg-if",
"libc",
]
[[package]] [[package]]
name = "cmake" name = "cmake"
version = "0.1.48" version = "0.1.48"
@@ -804,6 +811,7 @@ name = "etcd_broker"
version = "0.1.0" version = "0.1.0"
dependencies = [ dependencies = [
"etcd-client", "etcd-client",
"once_cell",
"regex", "regex",
"serde", "serde",
"serde_json", "serde_json",
@@ -1359,15 +1367,6 @@ version = "1.0.1"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1aab8fc367588b89dcee83ab0fd66b72b50b72fa1904d7095045ace2b0c81c35" checksum = "1aab8fc367588b89dcee83ab0fd66b72b50b72fa1904d7095045ace2b0c81c35"
[[package]]
name = "jobserver"
version = "0.1.24"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "af25a77299a7f711a01975c35a6a424eb6862092cc2d6c72c4ed6cbc56dfc1fa"
dependencies = [
"libc",
]
[[package]] [[package]]
name = "js-sys" name = "js-sys"
version = "0.3.56" version = "0.3.56"
@@ -1801,6 +1800,7 @@ dependencies = [
"bytes", "bytes",
"chrono", "chrono",
"clap 3.0.14", "clap 3.0.14",
"close_fds",
"const_format", "const_format",
"crc32c", "crc32c",
"crossbeam-utils", "crossbeam-utils",
@@ -1843,7 +1843,6 @@ dependencies = [
"url", "url",
"utils", "utils",
"workspace_hack", "workspace_hack",
"zstd",
] ]
[[package]] [[package]]
@@ -3951,32 +3950,3 @@ name = "zeroize"
version = "1.5.2" version = "1.5.2"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "7c88870063c39ee00ec285a2f8d6a966e5b6fb2becc4e8dac77ed0d370ed6006" checksum = "7c88870063c39ee00ec285a2f8d6a966e5b6fb2becc4e8dac77ed0d370ed6006"
[[package]]
name = "zstd"
version = "0.11.1+zstd.1.5.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "77a16b8414fde0414e90c612eba70985577451c4c504b99885ebed24762cb81a"
dependencies = [
"zstd-safe",
]
[[package]]
name = "zstd-safe"
version = "5.0.1+zstd.1.5.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "7c12659121420dd6365c5c3de4901f97145b79651fb1d25814020ed2ed0585ae"
dependencies = [
"libc",
"zstd-sys",
]
[[package]]
name = "zstd-sys"
version = "2.0.1+zstd.1.5.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9fd07cbbc53846d9145dbffdf6dd09a7a0aa52be46741825f5c97bdd4f73f12b"
dependencies = [
"cc",
"libc",
]

View File

@@ -25,7 +25,7 @@ COPY --from=pg-build /pg/tmp_install/include/postgresql/server tmp_install/inclu
COPY . . COPY . .
# Show build caching stats to check if it was used in the end. # Show build caching stats to check if it was used in the end.
# Has to be the part of the same RUN since cachepot daemon is killed in the end of this RUN, loosing the compilation stats. # Has to be the part of the same RUN since cachepot daemon is killed in the end of this RUN, losing the compilation stats.
RUN set -e \ RUN set -e \
&& sudo -E "PATH=$PATH" mold -run cargo build --release \ && sudo -E "PATH=$PATH" mold -run cargo build --release \
&& cachepot -s && cachepot -s

View File

@@ -1,95 +0,0 @@
#
# Docker image for console integration testing.
#
# We may also reuse it in CI to unify installation process and as a general binaries building
# tool for production servers.
#
# Dynamic linking is used for librocksdb and libstdc++ bacause librocksdb-sys calls
# bindgen with "dynamic" feature flag. This also prevents usage of dockerhub alpine-rust
# images which are statically linked and have guards against any dlopen. I would rather
# prefer all static binaries so we may change the way librocksdb-sys builds or wait until
# we will have our own storage and drop rockdb dependency.
#
# Cargo-chef is used to separate dependencies building from main binaries building. This
# way `docker build` will download and install dependencies only of there are changes to
# out Cargo.toml files.
#
#
# build postgres separately -- this layer will be rebuilt only if one of
# mentioned paths will get any changes
#
FROM alpine:3.13 as pg-build
RUN apk add --update clang llvm compiler-rt compiler-rt-static lld musl-dev binutils \
make bison flex readline-dev zlib-dev perl linux-headers libseccomp-dev
WORKDIR zenith
COPY ./vendor/postgres vendor/postgres
COPY ./Makefile Makefile
# Build using clang and lld
RUN CC='clang' LD='lld' CFLAGS='-fuse-ld=lld --rtlib=compiler-rt' make postgres -j4
#
# Calculate cargo dependencies.
# This will always run, but only generate recipe.json with list of dependencies without
# installing them.
#
FROM alpine:20210212 as cargo-deps-inspect
RUN apk add --update rust cargo
RUN cargo install cargo-chef
WORKDIR zenith
COPY . .
RUN cargo chef prepare --recipe-path recipe.json
#
# Build cargo dependencies.
# This temp cantainner would be build only if recipe.json was changed.
#
FROM alpine:20210212 as deps-build
RUN apk add --update rust cargo openssl-dev clang build-base
# rust-rocksdb can be built against system-wide rocksdb -- that saves about
# 10 minutes during build. Rocksdb apk package is in testing now, but use it
# anyway. In case of any troubles we can download and build rocksdb here manually
# (to cache it as a docker layer).
RUN apk --no-cache --update --repository https://dl-cdn.alpinelinux.org/alpine/edge/testing add rocksdb-dev
WORKDIR zenith
COPY --from=pg-build /zenith/tmp_install/include/postgresql/server tmp_install/include/postgresql/server
COPY --from=cargo-deps-inspect /root/.cargo/bin/cargo-chef /root/.cargo/bin/
COPY --from=cargo-deps-inspect /zenith/recipe.json recipe.json
RUN ROCKSDB_LIB_DIR=/usr/lib/ cargo chef cook --release --recipe-path recipe.json
#
# Build zenith binaries
#
FROM alpine:20210212 as build
RUN apk add --update rust cargo openssl-dev clang build-base
RUN apk --no-cache --update --repository https://dl-cdn.alpinelinux.org/alpine/edge/testing add rocksdb-dev
WORKDIR zenith
COPY . .
# Copy cached dependencies
COPY --from=pg-build /zenith/tmp_install/include/postgresql/server tmp_install/include/postgresql/server
COPY --from=deps-build /zenith/target target
COPY --from=deps-build /root/.cargo /root/.cargo
RUN cargo build --release
#
# Copy binaries to resulting image.
# build-base hare to provide libstdc++ (it will also bring gcc, but leave it this way until we figure
# out how to statically link rocksdb or avoid it at all).
#
FROM alpine:3.13
RUN apk add --update openssl build-base libseccomp-dev
RUN apk --no-cache --update --repository https://dl-cdn.alpinelinux.org/alpine/edge/testing add rocksdb
COPY --from=build /zenith/target/release/pageserver /usr/local/bin
COPY --from=build /zenith/target/release/safekeeper /usr/local/bin
COPY --from=build /zenith/target/release/proxy /usr/local/bin
COPY --from=pg-build /zenith/tmp_install /usr/local
COPY docker-entrypoint.sh /docker-entrypoint.sh
RUN addgroup zenith && adduser -h /data -D -G zenith zenith
VOLUME ["/data"]
WORKDIR /data
USER zenith
EXPOSE 6400
ENTRYPOINT ["/docker-entrypoint.sh"]
CMD ["pageserver"]

View File

@@ -26,7 +26,7 @@ endif
# macOS with brew-installed openssl requires explicit paths # macOS with brew-installed openssl requires explicit paths
UNAME_S := $(shell uname -s) UNAME_S := $(shell uname -s)
ifeq ($(UNAME_S),Darwin) ifeq ($(UNAME_S),Darwin)
PG_CONFIGURE_OPTS += --with-includes=/usr/local/opt/openssl/include --with-libraries=/usr/local/opt/openssl/lib PG_CONFIGURE_OPTS += --with-includes=$(HOMEBREW_PREFIX)/opt/openssl/include --with-libraries=$(HOMEBREW_PREFIX)/opt/openssl/lib
endif endif
# Choose whether we should be silent or verbose # Choose whether we should be silent or verbose
@@ -74,16 +74,16 @@ postgres-headers: postgres-configure
+@echo "Installing PostgreSQL headers" +@echo "Installing PostgreSQL headers"
$(MAKE) -C tmp_install/build/src/include MAKELEVEL=0 install $(MAKE) -C tmp_install/build/src/include MAKELEVEL=0 install
# Compile and install PostgreSQL and contrib/zenith # Compile and install PostgreSQL and contrib/neon
.PHONY: postgres .PHONY: postgres
postgres: postgres-configure \ postgres: postgres-configure \
postgres-headers # to prevent `make install` conflicts with zenith's `postgres-headers` postgres-headers # to prevent `make install` conflicts with zenith's `postgres-headers`
+@echo "Compiling PostgreSQL" +@echo "Compiling PostgreSQL"
$(MAKE) -C tmp_install/build MAKELEVEL=0 install $(MAKE) -C tmp_install/build MAKELEVEL=0 install
+@echo "Compiling contrib/zenith" +@echo "Compiling contrib/neon"
$(MAKE) -C tmp_install/build/contrib/zenith install $(MAKE) -C tmp_install/build/contrib/neon install
+@echo "Compiling contrib/zenith_test_utils" +@echo "Compiling contrib/neon_test_utils"
$(MAKE) -C tmp_install/build/contrib/zenith_test_utils install $(MAKE) -C tmp_install/build/contrib/neon_test_utils install
+@echo "Compiling pg_buffercache" +@echo "Compiling pg_buffercache"
$(MAKE) -C tmp_install/build/contrib/pg_buffercache install $(MAKE) -C tmp_install/build/contrib/pg_buffercache install
+@echo "Compiling pageinspect" +@echo "Compiling pageinspect"

View File

@@ -5,6 +5,11 @@ Neon is a serverless open source alternative to AWS Aurora Postgres. It separate
The project used to be called "Zenith". Many of the commands and code comments The project used to be called "Zenith". Many of the commands and code comments
still refer to "zenith", but we are in the process of renaming things. still refer to "zenith", but we are in the process of renaming things.
## Quick start
[Join the waitlist](https://neon.tech/) for our free tier to receive your serverless postgres instance. Then connect to it with your preferred postgres client (psql, dbeaver, etc) or use the online SQL editor.
Alternatively, compile and run the project [locally](#running-local-installation).
## Architecture overview ## Architecture overview
A Neon installation consists of compute nodes and Neon storage engine. A Neon installation consists of compute nodes and Neon storage engine.
@@ -24,13 +29,18 @@ Pageserver consists of:
## Running local installation ## Running local installation
#### building on Ubuntu/ Debian (Linux) #### building on Linux
1. Install build dependencies and other useful packages 1. Install build dependencies and other useful packages
On Ubuntu or Debian this set of packages should be sufficient to build the code: * On Ubuntu or Debian this set of packages should be sufficient to build the code:
```text ```bash
apt install build-essential libtool libreadline-dev zlib1g-dev flex bison libseccomp-dev \ apt install build-essential libtool libreadline-dev zlib1g-dev flex bison libseccomp-dev \
libssl-dev clang pkg-config libpq-dev libprotobuf-dev etcd libssl-dev clang pkg-config libpq-dev etcd cmake postgresql-client
```
* On Fedora these packages are needed:
```bash
dnf install flex bison readline-devel zlib-devel openssl-devel \
libseccomp-devel perl clang cmake etcd postgresql postgresql-contrib
``` ```
2. [Install Rust](https://www.rust-lang.org/tools/install) 2. [Install Rust](https://www.rust-lang.org/tools/install)
@@ -39,16 +49,11 @@ libssl-dev clang pkg-config libpq-dev libprotobuf-dev etcd
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
``` ```
3. Install PostgreSQL Client 3. Build neon and patched postgres
```
apt install postgresql-client
```
4. Build neon and patched postgres
```sh ```sh
git clone --recursive https://github.com/neondatabase/neon.git git clone --recursive https://github.com/neondatabase/neon.git
cd neon cd neon
make -j5 make -j`nproc`
``` ```
#### building on OSX (12.3.1) #### building on OSX (12.3.1)
@@ -108,7 +113,7 @@ Safekeeper started
> ./target/debug/neon_local pg start main > ./target/debug/neon_local pg start main
Starting new postgres main on timeline de200bd42b49cc1814412c7e592dd6e9 ... Starting new postgres main on timeline de200bd42b49cc1814412c7e592dd6e9 ...
Extracting base backup to create postgres instance: path=.zenith/pgdatadirs/tenants/9ef87a5bf0d92544f6fafeeb3239695c/main port=55432 Extracting base backup to create postgres instance: path=.zenith/pgdatadirs/tenants/9ef87a5bf0d92544f6fafeeb3239695c/main port=55432
Starting postgres node at 'host=127.0.0.1 port=55432 user=zenith_admin dbname=postgres' Starting postgres node at 'host=127.0.0.1 port=55432 user=cloud_admin dbname=postgres'
# check list of running postgres instances # check list of running postgres instances
> ./target/debug/neon_local pg list > ./target/debug/neon_local pg list
@@ -118,7 +123,7 @@ Starting postgres node at 'host=127.0.0.1 port=55432 user=zenith_admin dbname=po
2. Now it is possible to connect to postgres and run some queries: 2. Now it is possible to connect to postgres and run some queries:
```text ```text
> psql -p55432 -h 127.0.0.1 -U zenith_admin postgres > psql -p55432 -h 127.0.0.1 -U cloud_admin postgres
postgres=# CREATE TABLE t(key int primary key, value text); postgres=# CREATE TABLE t(key int primary key, value text);
CREATE TABLE CREATE TABLE
postgres=# insert into t values(1,1); postgres=# insert into t values(1,1);
@@ -145,7 +150,7 @@ Created timeline 'b3b863fa45fa9e57e615f9f2d944e601' at Lsn 0/16F9A00 for tenant:
> ./target/debug/neon_local pg start migration_check --branch-name migration_check > ./target/debug/neon_local pg start migration_check --branch-name migration_check
Starting new postgres migration_check on timeline b3b863fa45fa9e57e615f9f2d944e601 ... Starting new postgres migration_check on timeline b3b863fa45fa9e57e615f9f2d944e601 ...
Extracting base backup to create postgres instance: path=.zenith/pgdatadirs/tenants/9ef87a5bf0d92544f6fafeeb3239695c/migration_check port=55433 Extracting base backup to create postgres instance: path=.zenith/pgdatadirs/tenants/9ef87a5bf0d92544f6fafeeb3239695c/migration_check port=55433
Starting postgres node at 'host=127.0.0.1 port=55433 user=zenith_admin dbname=postgres' Starting postgres node at 'host=127.0.0.1 port=55433 user=cloud_admin dbname=postgres'
# check the new list of running postgres instances # check the new list of running postgres instances
> ./target/debug/neon_local pg list > ./target/debug/neon_local pg list
@@ -155,7 +160,7 @@ Starting postgres node at 'host=127.0.0.1 port=55433 user=zenith_admin dbname=po
# this new postgres instance will have all the data from 'main' postgres, # this new postgres instance will have all the data from 'main' postgres,
# but all modifications would not affect data in original postgres # but all modifications would not affect data in original postgres
> psql -p55433 -h 127.0.0.1 -U zenith_admin postgres > psql -p55433 -h 127.0.0.1 -U cloud_admin postgres
postgres=# select * from t; postgres=# select * from t;
key | value key | value
-----+------- -----+-------
@@ -166,7 +171,7 @@ postgres=# insert into t values(2,2);
INSERT 0 1 INSERT 0 1
# check that the new change doesn't affect the 'main' postgres # check that the new change doesn't affect the 'main' postgres
> psql -p55432 -h 127.0.0.1 -U zenith_admin postgres > psql -p55432 -h 127.0.0.1 -U cloud_admin postgres
postgres=# select * from t; postgres=# select * from t;
key | value key | value
-----+------- -----+-------

View File

@@ -22,7 +22,7 @@ Also `compute_ctl` spawns two separate service threads:
Usage example: Usage example:
```sh ```sh
compute_ctl -D /var/db/postgres/compute \ compute_ctl -D /var/db/postgres/compute \
-C 'postgresql://zenith_admin@localhost/postgres' \ -C 'postgresql://cloud_admin@localhost/postgres' \
-S /var/db/postgres/specs/current.json \ -S /var/db/postgres/specs/current.json \
-b /usr/local/bin/postgres -b /usr/local/bin/postgres
``` ```

View File

@@ -21,7 +21,7 @@
//! Usage example: //! Usage example:
//! ```sh //! ```sh
//! compute_ctl -D /var/db/postgres/compute \ //! compute_ctl -D /var/db/postgres/compute \
//! -C 'postgresql://zenith_admin@localhost/postgres' \ //! -C 'postgresql://cloud_admin@localhost/postgres' \
//! -S /var/db/postgres/specs/current.json \ //! -S /var/db/postgres/specs/current.json \
//! -b /usr/local/bin/postgres //! -b /usr/local/bin/postgres
//! ``` //! ```
@@ -116,17 +116,17 @@ fn main() -> Result<()> {
let pageserver_connstr = spec let pageserver_connstr = spec
.cluster .cluster
.settings .settings
.find("zenith.page_server_connstring") .find("neon.pageserver_connstring")
.expect("pageserver connstr should be provided"); .expect("pageserver connstr should be provided");
let tenant = spec let tenant = spec
.cluster .cluster
.settings .settings
.find("zenith.zenith_tenant") .find("neon.tenant_id")
.expect("tenant id should be provided"); .expect("tenant id should be provided");
let timeline = spec let timeline = spec
.cluster .cluster
.settings .settings
.find("zenith.zenith_timeline") .find("neon.timeline_id")
.expect("tenant id should be provided"); .expect("tenant id should be provided");
let compute_state = ComputeNode { let compute_state = ComputeNode {

View File

@@ -262,7 +262,30 @@ impl ComputeNode {
.unwrap_or_else(|| "5432".to_string()); .unwrap_or_else(|| "5432".to_string());
wait_for_postgres(&mut pg, &port, pgdata_path)?; wait_for_postgres(&mut pg, &port, pgdata_path)?;
let mut client = Client::connect(&self.connstr, NoTls)?; // If connection fails,
// it may be the old node with `zenith_admin` superuser.
//
// In this case we need to connect with old `zenith_admin`name
// and create new user. We cannot simply rename connected user,
// but we can create a new one and grant it all privileges.
let mut client = match Client::connect(&self.connstr, NoTls) {
Err(e) => {
info!(
"cannot connect to postgres: {}, retrying with `zenith_admin` username",
e
);
let zenith_admin_connstr = self.connstr.replacen("cloud_admin", "zenith_admin", 1);
let mut client = Client::connect(&zenith_admin_connstr, NoTls)?;
client.simple_query("CREATE USER cloud_admin WITH SUPERUSER")?;
client.simple_query("GRANT zenith_admin TO cloud_admin")?;
drop(client);
// reconnect with connsting with expected name
Client::connect(&self.connstr, NoTls)?
}
Ok(client) => client,
};
handle_roles(&self.spec, &mut client)?; handle_roles(&self.spec, &mut client)?;
handle_databases(&self.spec, &mut client)?; handle_databases(&self.spec, &mut client)?;

View File

@@ -43,7 +43,7 @@ fn watch_compute_activity(compute: &Arc<ComputeNode>) {
FROM pg_stat_activity FROM pg_stat_activity
WHERE backend_type = 'client backend' WHERE backend_type = 'client backend'
AND pid != pg_backend_pid() AND pid != pg_backend_pid()
AND usename != 'zenith_admin';", // XXX: find a better way to filter other monitors? AND usename != 'cloud_admin';", // XXX: find a better way to filter other monitors?
&[], &[],
); );
let mut last_active = compute.state.read().unwrap().last_active; let mut last_active = compute.state.read().unwrap().last_active;

View File

@@ -85,7 +85,7 @@
"vartype": "bool" "vartype": "bool"
}, },
{ {
"name": "wal_acceptors", "name": "safekeepers",
"value": "127.0.0.1:6502,127.0.0.1:6503,127.0.0.1:6501", "value": "127.0.0.1:6502,127.0.0.1:6503,127.0.0.1:6501",
"vartype": "string" "vartype": "string"
}, },
@@ -150,7 +150,7 @@
"vartype": "integer" "vartype": "integer"
}, },
{ {
"name": "zenith.zenith_tenant", "name": "neon.tenant_id",
"value": "b0554b632bd4d547a63b86c3630317e8", "value": "b0554b632bd4d547a63b86c3630317e8",
"vartype": "string" "vartype": "string"
}, },
@@ -160,13 +160,13 @@
"vartype": "integer" "vartype": "integer"
}, },
{ {
"name": "zenith.zenith_timeline", "name": "neon.timeline_id",
"value": "2414a61ffc94e428f14b5758fe308e13", "value": "2414a61ffc94e428f14b5758fe308e13",
"vartype": "string" "vartype": "string"
}, },
{ {
"name": "shared_preload_libraries", "name": "shared_preload_libraries",
"value": "zenith", "value": "neon",
"vartype": "string" "vartype": "string"
}, },
{ {
@@ -175,7 +175,7 @@
"vartype": "string" "vartype": "string"
}, },
{ {
"name": "zenith.page_server_connstring", "name": "neon.pageserver_connstring",
"value": "host=127.0.0.1 port=6400", "value": "host=127.0.0.1 port=6400",
"vartype": "string" "vartype": "string"
} }

View File

@@ -28,7 +28,7 @@ mod pg_helpers_tests {
assert_eq!( assert_eq!(
spec.cluster.settings.as_pg_settings(), spec.cluster.settings.as_pg_settings(),
"fsync = off\nwal_level = replica\nhot_standby = on\nwal_acceptors = '127.0.0.1:6502,127.0.0.1:6503,127.0.0.1:6501'\nwal_log_hints = on\nlog_connections = on\nshared_buffers = 32768\nport = 55432\nmax_connections = 100\nmax_wal_senders = 10\nlisten_addresses = '0.0.0.0'\nwal_sender_timeout = 0\npassword_encryption = md5\nmaintenance_work_mem = 65536\nmax_parallel_workers = 8\nmax_worker_processes = 8\nzenith.zenith_tenant = 'b0554b632bd4d547a63b86c3630317e8'\nmax_replication_slots = 10\nzenith.zenith_timeline = '2414a61ffc94e428f14b5758fe308e13'\nshared_preload_libraries = 'zenith'\nsynchronous_standby_names = 'walproposer'\nzenith.page_server_connstring = 'host=127.0.0.1 port=6400'" "fsync = off\nwal_level = replica\nhot_standby = on\nsafekeepers = '127.0.0.1:6502,127.0.0.1:6503,127.0.0.1:6501'\nwal_log_hints = on\nlog_connections = on\nshared_buffers = 32768\nport = 55432\nmax_connections = 100\nmax_wal_senders = 10\nlisten_addresses = '0.0.0.0'\nwal_sender_timeout = 0\npassword_encryption = md5\nmaintenance_work_mem = 65536\nmax_parallel_workers = 8\nmax_worker_processes = 8\nneon.tenant_id = 'b0554b632bd4d547a63b86c3630317e8'\nmax_replication_slots = 10\nneon.timeline_id = '2414a61ffc94e428f14b5758fe308e13'\nshared_preload_libraries = 'neon'\nsynchronous_standby_names = 'walproposer'\nneon.pageserver_connstring = 'host=127.0.0.1 port=6400'"
); );
} }

View File

@@ -148,9 +148,9 @@ impl PostgresNode {
// Read a few options from the config file // Read a few options from the config file
let context = format!("in config file {}", cfg_path_str); let context = format!("in config file {}", cfg_path_str);
let port: u16 = conf.parse_field("port", &context)?; let port: u16 = conf.parse_field("port", &context)?;
let timeline_id: ZTimelineId = conf.parse_field("zenith.zenith_timeline", &context)?; let timeline_id: ZTimelineId = conf.parse_field("neon.timeline_id", &context)?;
let tenant_id: ZTenantId = conf.parse_field("zenith.zenith_tenant", &context)?; let tenant_id: ZTenantId = conf.parse_field("neon.tenant_id", &context)?;
let uses_wal_proposer = conf.get("wal_acceptors").is_some(); let uses_wal_proposer = conf.get("safekeepers").is_some();
// parse recovery_target_lsn, if any // parse recovery_target_lsn, if any
let recovery_target_lsn: Option<Lsn> = let recovery_target_lsn: Option<Lsn> =
@@ -303,11 +303,11 @@ impl PostgresNode {
// uses only needed variables namely host, port, user, password. // uses only needed variables namely host, port, user, password.
format!("postgresql://no_user:{}@{}:{}", password, host, port) format!("postgresql://no_user:{}@{}:{}", password, host, port)
}; };
conf.append("shared_preload_libraries", "zenith"); conf.append("shared_preload_libraries", "neon");
conf.append_line(""); conf.append_line("");
conf.append("zenith.page_server_connstring", &pageserver_connstr); conf.append("neon.pageserver_connstring", &pageserver_connstr);
conf.append("zenith.zenith_tenant", &self.tenant_id.to_string()); conf.append("neon.tenant_id", &self.tenant_id.to_string());
conf.append("zenith.zenith_timeline", &self.timeline_id.to_string()); conf.append("neon.timeline_id", &self.timeline_id.to_string());
if let Some(lsn) = self.lsn { if let Some(lsn) = self.lsn {
conf.append("recovery_target_lsn", &lsn.to_string()); conf.append("recovery_target_lsn", &lsn.to_string());
} }
@@ -341,7 +341,7 @@ impl PostgresNode {
.map(|sk| format!("localhost:{}", sk.pg_port)) .map(|sk| format!("localhost:{}", sk.pg_port))
.collect::<Vec<String>>() .collect::<Vec<String>>()
.join(","); .join(",");
conf.append("wal_acceptors", &safekeepers); conf.append("safekeepers", &safekeepers);
} else { } else {
// We only use setup without safekeepers for tests, // We only use setup without safekeepers for tests,
// and don't care about data durability on pageserver, // and don't care about data durability on pageserver,
@@ -352,7 +352,6 @@ impl PostgresNode {
// This isn't really a supported configuration, but can be useful for // This isn't really a supported configuration, but can be useful for
// testing. // testing.
conf.append("synchronous_standby_names", "pageserver"); conf.append("synchronous_standby_names", "pageserver");
conf.append("zenith.callmemaybe_connstring", &self.connstr());
} }
let mut file = File::create(self.pgdata().join("postgresql.conf"))?; let mut file = File::create(self.pgdata().join("postgresql.conf"))?;
@@ -499,7 +498,7 @@ impl PostgresNode {
"host={} port={} user={} dbname={}", "host={} port={} user={} dbname={}",
self.address.ip(), self.address.ip(),
self.address.port(), self.address.port(),
"zenith_admin", "cloud_admin",
"postgres" "postgres"
) )
} }

View File

@@ -77,7 +77,7 @@ pub fn stop_etcd_process(env: &local_env::LocalEnv) -> anyhow::Result<()> {
let etcd_pid_file_path = etcd_pid_file_path(env); let etcd_pid_file_path = etcd_pid_file_path(env);
let pid = Pid::from_raw(read_pidfile(&etcd_pid_file_path).with_context(|| { let pid = Pid::from_raw(read_pidfile(&etcd_pid_file_path).with_context(|| {
format!( format!(
"Failed to read etcd pid filea at {}", "Failed to read etcd pid file at {}",
etcd_pid_file_path.display() etcd_pid_file_path.display()
) )
})?); })?);

View File

@@ -119,16 +119,24 @@ impl EtcdBroker {
} }
pub fn comma_separated_endpoints(&self) -> String { pub fn comma_separated_endpoints(&self) -> String {
self.broker_endpoints.iter().map(Url::as_str).fold( self.broker_endpoints
String::new(), .iter()
|mut comma_separated_urls, url| { .map(|url| {
// URL by default adds a '/' path at the end, which is not what etcd CLI wants.
let url_string = url.as_str();
if url_string.ends_with('/') {
&url_string[0..url_string.len() - 1]
} else {
url_string
}
})
.fold(String::new(), |mut comma_separated_urls, url| {
if !comma_separated_urls.is_empty() { if !comma_separated_urls.is_empty() {
comma_separated_urls.push(','); comma_separated_urls.push(',');
} }
comma_separated_urls.push_str(url); comma_separated_urls.push_str(url);
comma_separated_urls comma_separated_urls
}, })
)
} }
} }

View File

@@ -1,6 +1,7 @@
use std::collections::HashMap; use std::collections::HashMap;
use std::io::Write; use std::io::Write;
use std::net::TcpStream; use std::net::TcpStream;
use std::num::NonZeroU64;
use std::path::PathBuf; use std::path::PathBuf;
use std::process::Command; use std::process::Command;
use std::time::Duration; use std::time::Duration;
@@ -11,6 +12,7 @@ use nix::errno::Errno;
use nix::sys::signal::{kill, Signal}; use nix::sys::signal::{kill, Signal};
use nix::unistd::Pid; use nix::unistd::Pid;
use pageserver::http::models::{TenantConfigRequest, TenantCreateRequest, TimelineCreateRequest}; use pageserver::http::models::{TenantConfigRequest, TenantCreateRequest, TimelineCreateRequest};
use pageserver::tenant_mgr::TenantInfo;
use pageserver::timelines::TimelineInfo; use pageserver::timelines::TimelineInfo;
use postgres::{Config, NoTls}; use postgres::{Config, NoTls};
use reqwest::blocking::{Client, RequestBuilder, Response}; use reqwest::blocking::{Client, RequestBuilder, Response};
@@ -26,7 +28,6 @@ use utils::{
use crate::local_env::LocalEnv; use crate::local_env::LocalEnv;
use crate::{fill_aws_secrets_vars, fill_rust_env_vars, read_pidfile}; use crate::{fill_aws_secrets_vars, fill_rust_env_vars, read_pidfile};
use pageserver::tenant_mgr::TenantInfo;
#[derive(Error, Debug)] #[derive(Error, Debug)]
pub enum PageserverHttpError { pub enum PageserverHttpError {
@@ -37,6 +38,12 @@ pub enum PageserverHttpError {
Response(String), Response(String),
} }
impl From<anyhow::Error> for PageserverHttpError {
fn from(e: anyhow::Error) -> Self {
Self::Response(e.to_string())
}
}
type Result<T> = result::Result<T, PageserverHttpError>; type Result<T> = result::Result<T, PageserverHttpError>;
pub trait ResponseErrorMessageExt: Sized { pub trait ResponseErrorMessageExt: Sized {
@@ -410,6 +417,15 @@ impl PageServerNode {
.map(|x| x.parse::<usize>()) .map(|x| x.parse::<usize>())
.transpose()?, .transpose()?,
pitr_interval: settings.get("pitr_interval").map(|x| x.to_string()), pitr_interval: settings.get("pitr_interval").map(|x| x.to_string()),
walreceiver_connect_timeout: settings
.get("walreceiver_connect_timeout")
.map(|x| x.to_string()),
lagging_wal_timeout: settings.get("lagging_wal_timeout").map(|x| x.to_string()),
max_lsn_wal_lag: settings
.get("max_lsn_wal_lag")
.map(|x| x.parse::<NonZeroU64>())
.transpose()
.context("Failed to parse 'max_lsn_wal_lag' as non zero integer")?,
}) })
.send()? .send()?
.error_from_body()? .error_from_body()?
@@ -433,22 +449,41 @@ impl PageServerNode {
tenant_id, tenant_id,
checkpoint_distance: settings checkpoint_distance: settings
.get("checkpoint_distance") .get("checkpoint_distance")
.map(|x| x.parse::<u64>().unwrap()), .map(|x| x.parse::<u64>())
.transpose()
.context("Failed to parse 'checkpoint_distance' as an integer")?,
compaction_target_size: settings compaction_target_size: settings
.get("compaction_target_size") .get("compaction_target_size")
.map(|x| x.parse::<u64>().unwrap()), .map(|x| x.parse::<u64>())
.transpose()
.context("Failed to parse 'compaction_target_size' as an integer")?,
compaction_period: settings.get("compaction_period").map(|x| x.to_string()), compaction_period: settings.get("compaction_period").map(|x| x.to_string()),
compaction_threshold: settings compaction_threshold: settings
.get("compaction_threshold") .get("compaction_threshold")
.map(|x| x.parse::<usize>().unwrap()), .map(|x| x.parse::<usize>())
.transpose()
.context("Failed to parse 'compaction_threshold' as an integer")?,
gc_horizon: settings gc_horizon: settings
.get("gc_horizon") .get("gc_horizon")
.map(|x| x.parse::<u64>().unwrap()), .map(|x| x.parse::<u64>())
.transpose()
.context("Failed to parse 'gc_horizon' as an integer")?,
gc_period: settings.get("gc_period").map(|x| x.to_string()), gc_period: settings.get("gc_period").map(|x| x.to_string()),
image_creation_threshold: settings image_creation_threshold: settings
.get("image_creation_threshold") .get("image_creation_threshold")
.map(|x| x.parse::<usize>().unwrap()), .map(|x| x.parse::<usize>())
.transpose()
.context("Failed to parse 'image_creation_threshold' as non zero integer")?,
pitr_interval: settings.get("pitr_interval").map(|x| x.to_string()), pitr_interval: settings.get("pitr_interval").map(|x| x.to_string()),
walreceiver_connect_timeout: settings
.get("walreceiver_connect_timeout")
.map(|x| x.to_string()),
lagging_wal_timeout: settings.get("lagging_wal_timeout").map(|x| x.to_string()),
max_lsn_wal_lag: settings
.get("max_lsn_wal_lag")
.map(|x| x.parse::<NonZeroU64>())
.transpose()
.context("Failed to parse 'max_lsn_wal_lag' as non zero integer")?,
}) })
.send()? .send()?
.error_from_body()?; .error_from_body()?;

View File

@@ -6,7 +6,7 @@
- [docker.md](docker.md) — Docker images and building pipeline. - [docker.md](docker.md) — Docker images and building pipeline.
- [glossary.md](glossary.md) — Glossary of all the terms used in codebase. - [glossary.md](glossary.md) — Glossary of all the terms used in codebase.
- [multitenancy.md](multitenancy.md) — how multitenancy is organized in the pageserver and Zenith CLI. - [multitenancy.md](multitenancy.md) — how multitenancy is organized in the pageserver and Zenith CLI.
- [sourcetree.md](sourcetree.md) — Overview of the source tree layeout. - [sourcetree.md](sourcetree.md) — Overview of the source tree layout.
- [pageserver/README.md](/pageserver/README.md) — pageserver overview. - [pageserver/README.md](/pageserver/README.md) — pageserver overview.
- [postgres_ffi/README.md](/libs/postgres_ffi/README.md) — Postgres FFI overview. - [postgres_ffi/README.md](/libs/postgres_ffi/README.md) — Postgres FFI overview.
- [test_runner/README.md](/test_runner/README.md) — tests infrastructure overview. - [test_runner/README.md](/test_runner/README.md) — tests infrastructure overview.

View File

@@ -188,7 +188,7 @@ Not currently committed but proposed:
3. Prefetching 3. Prefetching
- Why? - Why?
As far as pages in Zenith are loaded on demand, to reduce node startup time As far as pages in Zenith are loaded on demand, to reduce node startup time
and also sppedup some massive queries we need some mechanism for bulk loading to and also speedup some massive queries we need some mechanism for bulk loading to
reduce page request round-trip overhead. reduce page request round-trip overhead.
Currently Postgres is supporting prefetching only for bitmap scan. Currently Postgres is supporting prefetching only for bitmap scan.

View File

@@ -2,7 +2,7 @@
### Authentication ### Authentication
### Backpresssure ### Backpressure
Backpressure is used to limit the lag between pageserver and compute node or WAL service. Backpressure is used to limit the lag between pageserver and compute node or WAL service.
@@ -115,7 +115,7 @@ Neon safekeeper LSNs. For more check [safekeeper/README_PROTO.md](/safekeeper/RE
* `CommitLSN`: position in WAL confirmed by quorum safekeepers. * `CommitLSN`: position in WAL confirmed by quorum safekeepers.
* `RestartLSN`: position in WAL confirmed by all safekeepers. * `RestartLSN`: position in WAL confirmed by all safekeepers.
* `FlushLSN`: part of WAL persisted to the disk by safekeeper. * `FlushLSN`: part of WAL persisted to the disk by safekeeper.
* `VCL`: the largerst LSN for which we can guarantee availablity of all prior records. * `VCL`: the largest LSN for which we can guarantee availability of all prior records.
Neon pageserver LSNs: Neon pageserver LSNs:
* `last_record_lsn` - the end of last processed WAL record. * `last_record_lsn` - the end of last processed WAL record.

View File

@@ -6,7 +6,7 @@ Zenith supports multitenancy. One pageserver can serve multiple tenants at once.
### Tenants in other commands ### Tenants in other commands
By default during `zenith init` new tenant is created on the pageserver. Newly created tenant's id is saved to cli config, so other commands can use it automatically if no direct arugment `--tenantid=<tenantid>` is provided. So generally tenantid more frequently appears in internal pageserver interface. Its commands take tenantid argument to distinguish to which tenant operation should be applied. CLI support creation of new tenants. By default during `zenith init` new tenant is created on the pageserver. Newly created tenant's id is saved to cli config, so other commands can use it automatically if no direct argument `--tenantid=<tenantid>` is provided. So generally tenantid more frequently appears in internal pageserver interface. Its commands take tenantid argument to distinguish to which tenant operation should be applied. CLI support creation of new tenants.
Examples for cli: Examples for cli:

View File

@@ -77,7 +77,7 @@ Upon storage node restart recent WAL files are applied to appropriate pages and
### **Checkpointing** ### **Checkpointing**
No such mechanism is needed. Or we may look at the storage node as at kind of continuous chekpointer. No such mechanism is needed. Or we may look at the storage node as at kind of continuous checkpointer.
### **Full page writes (torn page protection)** ### **Full page writes (torn page protection)**
@@ -111,13 +111,13 @@ Since we are storing page diffs of variable sizes there is no structural depende
### **Chunk metadata** ### **Chunk metadata**
Chunk metadata is a file lies in chunk directory that stores info about current snapshots and PITR regions. Chunck should always consult this data when merging SSTables and applying delete markers. Chunk metadata is a file lies in chunk directory that stores info about current snapshots and PITR regions. Chunk should always consult this data when merging SSTables and applying delete markers.
### **Chunk splitting** ### **Chunk splitting**
*(NB: following paragraph is about how to avoid page splitting)* *(NB: following paragraph is about how to avoid page splitting)*
When chunks hits some soft storage limit (let's say 100Gb) it should be split in half and global matadata about chunk boundaries should be updated. Here i assume that chunk split is a local operation happening on single node. Process of chink splitting should look like following: When chunks hits some soft storage limit (let's say 100Gb) it should be split in half and global metadata about chunk boundaries should be updated. Here i assume that chunk split is a local operation happening on single node. Process of chink splitting should look like following:
1. Find separation key and spawn two new chunks with [lo, mid) [mid, hi) boundaries. 1. Find separation key and spawn two new chunks with [lo, mid) [mid, hi) boundaries.
@@ -166,7 +166,7 @@ Multi-tenant storage makes sense even on a laptop, when you work with different
Few databases are stored in one chunk, replicated three times Few databases are stored in one chunk, replicated three times
- When database can't fit into one storage node it can occupy lots of chunks that were split while database was growing. Chunk placement on nodes is controlled by us with some automatization, but we alway may manually move chunks around the cluster. - When database can't fit into one storage node it can occupy lots of chunks that were split while database was growing. Chunk placement on nodes is controlled by us with some automatization, but we always may manually move chunks around the cluster.
<img width="940" alt="Screenshot_2021-02-22_at_16 49 10" src="https://user-images.githubusercontent.com/284219/108729815-fb071e00-753b-11eb-86e0-be6703e47d82.png"> <img width="940" alt="Screenshot_2021-02-22_at_16 49 10" src="https://user-images.githubusercontent.com/284219/108729815-fb071e00-753b-11eb-86e0-be6703e47d82.png">

View File

@@ -123,7 +123,7 @@ Show currently attached storages. For example:
> zenith storage list > zenith storage list
NAME USED TYPE OPTIONS PATH NAME USED TYPE OPTIONS PATH
local 5.1G zenith-local /opt/zenith/store/local local 5.1G zenith-local /opt/zenith/store/local
local.compr 20.4G zenith-local comression=on /opt/zenith/store/local.compr local.compr 20.4G zenith-local compression=on /opt/zenith/store/local.compr
zcloud 60G zenith-remote zenith.tech/stas/mystore zcloud 60G zenith-remote zenith.tech/stas/mystore
s3tank 80G S3 s3tank 80G S3
``` ```
@@ -136,9 +136,9 @@ s3tank 80G S3
## pg ## pg
Manages postgres data directories and can start postgreses with proper configuration. An experienced user may avoid using that (except pg create) and configure/run postgres by themself. Manages postgres data directories and can start postgres instances with proper configuration. An experienced user may avoid using that (except pg create) and configure/run postgres by themselves.
Pg is a term for a single postgres running on some data. I'm trying to avoid here separation of datadir management and postgres instance management -- both that concepts bundled here together. Pg is a term for a single postgres running on some data. I'm trying to avoid separation of datadir management and postgres instance management -- both that concepts bundled here together.
**zenith pg create** [--no-start --snapshot --cow] -s storage-name -n pgdata **zenith pg create** [--no-start --snapshot --cow] -s storage-name -n pgdata

View File

@@ -31,7 +31,7 @@ Ideally, just one binary that incorporates all elements we need.
#### Components: #### Components:
- **zenith-CLI** - interface for end-users. Turns commands to REST requests and handles responces to show them in a user-friendly way. - **zenith-CLI** - interface for end-users. Turns commands to REST requests and handles responses to show them in a user-friendly way.
CLI proposal is here https://github.com/libzenith/rfcs/blob/003-laptop-cli.md/003-laptop-cli.md CLI proposal is here https://github.com/libzenith/rfcs/blob/003-laptop-cli.md/003-laptop-cli.md
WIP code is here: https://github.com/libzenith/postgres/tree/main/pageserver/src/bin/cli WIP code is here: https://github.com/libzenith/postgres/tree/main/pageserver/src/bin/cli

View File

@@ -25,9 +25,9 @@ To make changes in the catalog you need to run compute nodes
zenith start /home/pipedpiper/northwind:main -- starts a compute instance zenith start /home/pipedpiper/northwind:main -- starts a compute instance
zenith start zenith://zenith.tech/northwind:main -- starts a compute instance in the cloud zenith start zenith://zenith.tech/northwind:main -- starts a compute instance in the cloud
-- you can start a compute node against any hash or branch -- you can start a compute node against any hash or branch
zenith start /home/pipedpiper/northwind:experimental --port 8008 -- start anothe compute instance (on different port) zenith start /home/pipedpiper/northwind:experimental --port 8008 -- start another compute instance (on different port)
-- you can start a compute node against any hash or branch -- you can start a compute node against any hash or branch
zenith start /home/pipedpiper/northwind:<hash> --port 8009 -- start anothe compute instance (on different port) zenith start /home/pipedpiper/northwind:<hash> --port 8009 -- start another compute instance (on different port)
-- After running some DML you can run -- After running some DML you can run
-- zenith status and see how there are two WAL streams one on top of -- zenith status and see how there are two WAL streams one on top of

View File

@@ -121,7 +121,7 @@ repository, launch an instance on the same branch in both clones, and
later try to push/pull between them? Perhaps create a new timeline later try to push/pull between them? Perhaps create a new timeline
every time you start up an instance? Then you would detect that the every time you start up an instance? Then you would detect that the
timelines have diverged. That would match with the "epoch" concept timelines have diverged. That would match with the "epoch" concept
that we have in the WAL safekeepr that we have in the WAL safekeeper
### zenith checkout/commit ### zenith checkout/commit

View File

@@ -2,9 +2,9 @@ While working on export/import commands, I understood that they fit really well
We may think about backups as snapshots in a different format (i.e plain pgdata format, basebackup tar format, WAL-G format (if they want to support it) and so on). They use same storage API, the only difference is the code that packs/unpacks files. We may think about backups as snapshots in a different format (i.e plain pgdata format, basebackup tar format, WAL-G format (if they want to support it) and so on). They use same storage API, the only difference is the code that packs/unpacks files.
Even if zenith aims to maintains durability using it's own snapshots, backups will be useful for uploading data from postges to zenith. Even if zenith aims to maintains durability using it's own snapshots, backups will be useful for uploading data from postgres to zenith.
So here is an attemt to design consistent CLI for diferent usage scenarios: So here is an attempt to design consistent CLI for different usage scenarios:
#### 1. Start empty pageserver. #### 1. Start empty pageserver.
That is what we have now. That is what we have now.

View File

@@ -3,7 +3,7 @@
GetPage@LSN can be called with older LSNs, and the page server needs GetPage@LSN can be called with older LSNs, and the page server needs
to be able to reconstruct older page versions. That's needed for to be able to reconstruct older page versions. That's needed for
having read-only replicas that lag behind the primary, or that are having read-only replicas that lag behind the primary, or that are
"anchored" at an older LSN, and internally in the page server whne you "anchored" at an older LSN, and internally in the page server when you
branch at an older point in time. How do you do that? branch at an older point in time. How do you do that?
For now, I'm not considering incremental snapshots at all. I don't For now, I'm not considering incremental snapshots at all. I don't
@@ -192,7 +192,7 @@ for a particular relation readily available alongside the snapshot
files, and you don't need to track what snapshot LSNs exist files, and you don't need to track what snapshot LSNs exist
separately. separately.
(If we wanted to minize the number of files, you could include the (If we wanted to minimize the number of files, you could include the
snapshot @300 and the WAL between 200 and 300 in the same file, but I snapshot @300 and the WAL between 200 and 300 in the same file, but I
feel it's probably better to keep them separate) feel it's probably better to keep them separate)

View File

@@ -121,7 +121,7 @@ The properties of s3 that we depend on are:
list objects list objects
streaming read of entire object streaming read of entire object
read byte range from object read byte range from object
streaming write new object (may use multipart upload for better relialibity) streaming write new object (may use multipart upload for better reliability)
delete object (that should not disrupt an already-started read). delete object (that should not disrupt an already-started read).
Uploaded files, restored backups, or s3 buckets controlled by users could contain malicious content. We should always validate that objects contain the content theyre supposed to. Incorrect, Corrupt or malicious-looking contents should cause software (cloud tools, pageserver) to fail gracefully. Uploaded files, restored backups, or s3 buckets controlled by users could contain malicious content. We should always validate that objects contain the content theyre supposed to. Incorrect, Corrupt or malicious-looking contents should cause software (cloud tools, pageserver) to fail gracefully.

View File

@@ -40,7 +40,7 @@ b) overwrite older pages with the newer pages -- if there is no replica we proba
I imagine that newly created pages would just be added to the back of PageStore (again in queue-like fashion) and this way there wouldn't be any meaningful ordering inside of that queue. When we are forming a new incremental snapshot we may prohibit any updates to the current set of pages in PageStore (giving up on single page version rule) and cut off that whole set when snapshot creation is complete. I imagine that newly created pages would just be added to the back of PageStore (again in queue-like fashion) and this way there wouldn't be any meaningful ordering inside of that queue. When we are forming a new incremental snapshot we may prohibit any updates to the current set of pages in PageStore (giving up on single page version rule) and cut off that whole set when snapshot creation is complete.
With option b) we can also treat PageStor as an uncompleted increamental snapshot. With option b) we can also treat PageStor as an uncompleted incremental snapshot.
### LocalStore ### LocalStore
@@ -123,7 +123,7 @@ As far as I understand Bookfile/Aversion addresses versioning and serialization
As for exact data that should go to snapshots I think it is the following for each snapshot: As for exact data that should go to snapshots I think it is the following for each snapshot:
* format version number * format version number
* set of key/values to interpret content (e.g. is page compression enabled, is that a full or incremental snapshot, previous snapshot id, is there WAL at the end on file, etc) -- it is up to a reader to decide what to do if some keys are missing or some unknow key are present. If we add something backward compatible to the file we can keep the version number. * set of key/values to interpret content (e.g. is page compression enabled, is that a full or incremental snapshot, previous snapshot id, is there WAL at the end on file, etc) -- it is up to a reader to decide what to do if some keys are missing or some unknown key are present. If we add something backward compatible to the file we can keep the version number.
* array of [BuffTag, corresponding offset in file] for pages -- IIUC that is analogous to ToC in Bookfile * array of [BuffTag, corresponding offset in file] for pages -- IIUC that is analogous to ToC in Bookfile
* array of [(BuffTag, LSN), corresponding offset in file] for the WAL records * array of [(BuffTag, LSN), corresponding offset in file] for the WAL records
* pages, one by one * pages, one by one
@@ -131,7 +131,7 @@ As for exact data that should go to snapshots I think it is the following for ea
It is also important to be able to load metadata quickly since it would be one of the main factors impacting the time of page server start. E.g. if would store/cache about 10TB of data per page server, the size of uncompressed page references would be about 30GB (10TB / ( 8192 bytes page size / ( ~18 bytes per ObjectTag + 8 bytes offset in the file))). It is also important to be able to load metadata quickly since it would be one of the main factors impacting the time of page server start. E.g. if would store/cache about 10TB of data per page server, the size of uncompressed page references would be about 30GB (10TB / ( 8192 bytes page size / ( ~18 bytes per ObjectTag + 8 bytes offset in the file))).
1) Since our ToC/array of entries can be sorted by ObjectTag we can store the whole BufferTag only when realtion_id is changed and store only delta-encoded offsets for a given relation. That would reduce the average per-page metadata size to something less than 4 bytes instead of 26 (assuming that pages would follow the same order and offset delatas would be small). 1) Since our ToC/array of entries can be sorted by ObjectTag we can store the whole BufferTag only when relation_id is changed and store only delta-encoded offsets for a given relation. That would reduce the average per-page metadata size to something less than 4 bytes instead of 26 (assuming that pages would follow the same order and offset deltas would be small).
2) It makes sense to keep ToC at the beginning of the file to avoid extra seeks to locate it. Doesn't matter too much with the local files but matters on S3 -- if we are accessing a lot of ~1Gb files with the size of metadata ~ 1Mb then the time to transfer this metadata would be comparable with access latency itself (which is about a half of a second). So by slurping metadata with one read of file header instead of N reads we can improve the speed of page server start by this N factor. 2) It makes sense to keep ToC at the beginning of the file to avoid extra seeks to locate it. Doesn't matter too much with the local files but matters on S3 -- if we are accessing a lot of ~1Gb files with the size of metadata ~ 1Mb then the time to transfer this metadata would be comparable with access latency itself (which is about a half of a second). So by slurping metadata with one read of file header instead of N reads we can improve the speed of page server start by this N factor.
I think both of that optimizations can be done later, but that is something to keep in mind when we are designing our storage serialization routines. I think both of that optimizations can be done later, but that is something to keep in mind when we are designing our storage serialization routines.

View File

@@ -7,13 +7,13 @@ and e.g. prevents electing two proposers with the same term -- it is actually
called `term` in the code. The second, called `epoch`, reflects progress of log called `term` in the code. The second, called `epoch`, reflects progress of log
receival and this might lag behind `term`; safekeeper switches to epoch `n` when receival and this might lag behind `term`; safekeeper switches to epoch `n` when
it has received all committed log records from all `< n` terms. This roughly it has received all committed log records from all `< n` terms. This roughly
correspones to proposed in corresponds to proposed in
https://github.com/zenithdb/rfcs/pull/3/files https://github.com/zenithdb/rfcs/pull/3/files
This makes our biggest our difference from Raft. In Raft, every log record is This makes our biggest our difference from Raft. In Raft, every log record is
stamped with term in which it was generated; while we essentialy store in stamped with term in which it was generated; while we essentially store in
`epoch` only the term of the highest record on this safekeeper -- when we know `epoch` only the term of the highest record on this safekeeper -- when we know
it -- because during recovery generally we don't, and `epoch` is bumped directly it -- because during recovery generally we don't, and `epoch` is bumped directly
to the term of the proposer who performs the recovery when it is finished. It is to the term of the proposer who performs the recovery when it is finished. It is

View File

@@ -124,7 +124,7 @@ Each storage node can subscribe to the relevant sets of keys and maintain a loca
### Safekeeper address discovery ### Safekeeper address discovery
During the startup safekeeper should publish the address he is listening on as the part of `{"sk_#{sk_id}" => ip_address}`. Then the pageserver can resolve `sk_#{sk_id}` to the actual address. This way it would work both locally and in the cloud setup. Safekeeper should have `--advertised-address` CLI option so that we can listen on e.g. 0.0.0.0 but advertize something more useful. During the startup safekeeper should publish the address he is listening on as the part of `{"sk_#{sk_id}" => ip_address}`. Then the pageserver can resolve `sk_#{sk_id}` to the actual address. This way it would work both locally and in the cloud setup. Safekeeper should have `--advertised-address` CLI option so that we can listen on e.g. 0.0.0.0 but advertise something more useful.
### Safekeeper behavior ### Safekeeper behavior
@@ -195,7 +195,7 @@ sequenceDiagram
PS1->>SK1: start replication PS1->>SK1: start replication
``` ```
#### Behavour of services during typical operations #### Behaviour of services during typical operations
```mermaid ```mermaid
sequenceDiagram sequenceDiagram
@@ -250,7 +250,7 @@ sequenceDiagram
PS2->>M: Register downloaded timeline PS2->>M: Register downloaded timeline
PS2->>M: Get safekeepers for timeline, subscribe to changes PS2->>M: Get safekeepers for timeline, subscribe to changes
PS2->>SK1: Start replication to catch up PS2->>SK1: Start replication to catch up
note over O: PS2 catched up, time to switch compute note over O: PS2 caught up, time to switch compute
O->>C: Restart compute with new pageserver url in config O->>C: Restart compute with new pageserver url in config
note over C: Wal push is restarted note over C: Wal push is restarted
loop request pages loop request pages

View File

@@ -49,7 +49,7 @@ topics.
RFC lifecycle: RFC lifecycle:
- Should be submitted in a pull request with and full RFC text in a commited markdown file and copy of the Summary and Motivation sections also included in the PR body. - Should be submitted in a pull request with and full RFC text in a committed markdown file and copy of the Summary and Motivation sections also included in the PR body.
- RFC should be published for review before most of the actual code is written. This isnt a strict rule, dont hesitate to experiment and build a POC in parallel with writing an RFC. - RFC should be published for review before most of the actual code is written. This isnt a strict rule, dont hesitate to experiment and build a POC in parallel with writing an RFC.
- Add labels to the PR in the same manner as you do Issues. Example TBD - Add labels to the PR in the same manner as you do Issues. Example TBD
- Request the review from your peers. Reviewing the RFCs from your peers is a priority, same as reviewing the actual code. - Request the review from your peers. Reviewing the RFCs from your peers is a priority, same as reviewing the actual code.

View File

@@ -22,8 +22,8 @@ so we don't want to give users access to the functionality that we don't think i
* pageserver - calculate the size consumed by a timeline and add it to the feedback message. * pageserver - calculate the size consumed by a timeline and add it to the feedback message.
* safekeeper - pass feedback message from pageserver to compute. * safekeeper - pass feedback message from pageserver to compute.
* compute - receive feedback message, enforce size limit based on GUC `zenith.max_cluster_size`. * compute - receive feedback message, enforce size limit based on GUC `neon.max_cluster_size`.
* console - set and update `zenith.max_cluster_size` setting * console - set and update `neon.max_cluster_size` setting
## Proposed implementation ## Proposed implementation
@@ -49,7 +49,7 @@ This message is received by the safekeeper and propagated to compute node as a p
Finally, when compute node receives the `current_timeline_size` from safekeeper (or from pageserver directly), it updates the global variable. Finally, when compute node receives the `current_timeline_size` from safekeeper (or from pageserver directly), it updates the global variable.
And then every zenith_extend() operation checks if limit is reached `(current_timeline_size > zenith.max_cluster_size)` and throws `ERRCODE_DISK_FULL` error if so. And then every zenith_extend() operation checks if limit is reached `(current_timeline_size > neon.max_cluster_size)` and throws `ERRCODE_DISK_FULL` error if so.
(see Postgres error codes [https://www.postgresql.org/docs/devel/errcodes-appendix.html](https://www.postgresql.org/docs/devel/errcodes-appendix.html)) (see Postgres error codes [https://www.postgresql.org/docs/devel/errcodes-appendix.html](https://www.postgresql.org/docs/devel/errcodes-appendix.html))
TODO: TODO:
@@ -75,5 +75,5 @@ We should warn users if the limit is soon to be reached.
### **Security implications** ### **Security implications**
We treat compute as an untrusted component. That's why we try to isolate it with secure container runtime or a VM. We treat compute as an untrusted component. That's why we try to isolate it with secure container runtime or a VM.
Malicious users may change the `zenith.max_cluster_size`, so we need an extra size limit check. Malicious users may change the `neon.max_cluster_size`, so we need an extra size limit check.
To cover this case, we also monitor the compute node size in the console. To cover this case, we also monitor the compute node size in the console.

View File

@@ -23,7 +23,7 @@ gc_horizon = '67108864'
max_file_descriptors = '100' max_file_descriptors = '100'
# initial superuser role name to use when creating a new tenant # initial superuser role name to use when creating a new tenant
initial_superuser_name = 'zenith_admin' initial_superuser_name = 'cloud_admin'
broker_etcd_prefix = 'neon' broker_etcd_prefix = 'neon'
broker_endpoints = ['some://etcd'] broker_endpoints = ['some://etcd']
@@ -31,14 +31,14 @@ broker_endpoints = ['some://etcd']
# [remote_storage] # [remote_storage]
``` ```
The config above shows default values for all basic pageserver settings, besides `broker_endpoints`: that one has to be set by the user, The config above shows default values for all basic pageserver settings, besides `broker_endpoints`: that one has to be set by the user,
see the corresponding section below. see the corresponding section below.
Pageserver uses default values for all files that are missing in the config, so it's not a hard error to leave the config blank. Pageserver uses default values for all files that are missing in the config, so it's not a hard error to leave the config blank.
Yet, it validates the config values it can (e.g. postgres install dir) and errors if the validation fails, refusing to start. Yet, it validates the config values it can (e.g. postgres install dir) and errors if the validation fails, refusing to start.
Note the `[remote_storage]` section: it's a [table](https://toml.io/en/v1.0.0#table) in TOML specification and Note the `[remote_storage]` section: it's a [table](https://toml.io/en/v1.0.0#table) in TOML specification and
- either has to be placed in the config after the table-less values such as `initial_superuser_name = 'zenith_admin'` - either has to be placed in the config after the table-less values such as `initial_superuser_name = 'cloud_admin'`
- or can be placed anywhere if rewritten in identical form as [inline table](https://toml.io/en/v1.0.0#inline-table): `remote_storage = {foo = 2}` - or can be placed anywhere if rewritten in identical form as [inline table](https://toml.io/en/v1.0.0#inline-table): `remote_storage = {foo = 2}`
@@ -54,7 +54,7 @@ Note that TOML distinguishes between strings and integers, the former require si
A list of endpoints (etcd currently) to connect and pull the information from. A list of endpoints (etcd currently) to connect and pull the information from.
Mandatory, does not have a default, since requires etcd to be started as a separate process, Mandatory, does not have a default, since requires etcd to be started as a separate process,
and its connection url should be specified separately. and its connection url should be specified separately.
#### broker_etcd_prefix #### broker_etcd_prefix
@@ -105,17 +105,31 @@ Interval at which garbage collection is triggered. Default is 100 s.
#### image_creation_threshold #### image_creation_threshold
L0 delta layer threshold for L1 iamge layer creation. Default is 3. L0 delta layer threshold for L1 image layer creation. Default is 3.
#### pitr_interval #### pitr_interval
WAL retention duration for PITR branching. Default is 30 days. WAL retention duration for PITR branching. Default is 30 days.
#### walreceiver_connect_timeout
Time to wait to establish the wal receiver connection before failing
#### lagging_wal_timeout
Time the pageserver did not get any WAL updates from safekeeper (if any).
Avoids lagging pageserver preemptively by forcing to switch it from stalled connections.
#### max_lsn_wal_lag
Difference between Lsn values of the latest available WAL on safekeepers: if currently connected safekeeper starts to lag too long and too much,
it gets swapped to the different one.
#### initial_superuser_name #### initial_superuser_name
Name of the initial superuser role, passed to initdb when a new tenant Name of the initial superuser role, passed to initdb when a new tenant
is initialized. It doesn't affect anything after initialization. The is initialized. It doesn't affect anything after initialization. The
default is Note: The default is 'zenith_admin', and the console default is Note: The default is 'cloud_admin', and the console
depends on that, so if you change it, bad things will happen. depends on that, so if you change it, bad things will happen.
#### page_cache_size #### page_cache_size
@@ -185,7 +199,7 @@ If no IAM bucket access is used during the remote storage usage, use the `AWS_AC
###### General remote storage configuration ###### General remote storage configuration
Pagesever allows only one remote storage configured concurrently and errors if parameters from multiple different remote configurations are used. Pageserver allows only one remote storage configured concurrently and errors if parameters from multiple different remote configurations are used.
No default values are used for the remote storage configuration parameters. No default values are used for the remote storage configuration parameters.
Besides, there are parameters common for all types of remote storage that can be configured, those have defaults: Besides, there are parameters common for all types of remote storage that can be configured, those have defaults:

View File

@@ -10,7 +10,7 @@ Intended to be used in integration tests and in CLI tools for local installation
`/docs`: `/docs`:
Documentaion of the Zenith features and concepts. Documentation of the Zenith features and concepts.
Now it is mostly dev documentation. Now it is mostly dev documentation.
`/monitoring`: `/monitoring`:
@@ -42,13 +42,13 @@ Integration tests, written in Python using the `pytest` framework.
`/vendor/postgres`: `/vendor/postgres`:
PostgreSQL source tree, with the modifications needed for Zenith. PostgreSQL source tree, with the modifications needed for Neon.
`/vendor/postgres/contrib/zenith`: `/vendor/postgres/contrib/neon`:
PostgreSQL extension that implements storage manager API and network communications with remote page server. PostgreSQL extension that implements storage manager API and network communications with remote page server.
`/vendor/postgres/contrib/zenith_test_utils`: `/vendor/postgres/contrib/neon_test_utils`:
PostgreSQL extension that contains functions needed for testing and debugging. PostgreSQL extension that contains functions needed for testing and debugging.
@@ -92,7 +92,7 @@ A single virtual environment with all dependencies is described in the single `P
### Prerequisites ### Prerequisites
- Install Python 3.9 (the minimal supported version) or greater. - Install Python 3.9 (the minimal supported version) or greater.
- Our setup with poetry should work with newer python versions too. So feel free to open an issue with a `c/test-runner` label if something doesnt work as expected. - Our setup with poetry should work with newer python versions too. So feel free to open an issue with a `c/test-runner` label if something doesn't work as expected.
- If you have some trouble with other version you can resolve it by installing Python 3.9 separately, via [pyenv](https://github.com/pyenv/pyenv) or via system package manager e.g.: - If you have some trouble with other version you can resolve it by installing Python 3.9 separately, via [pyenv](https://github.com/pyenv/pyenv) or via system package manager e.g.:
```bash ```bash
# In Ubuntu # In Ubuntu

View File

@@ -9,6 +9,7 @@
serde = { version = "1.0", features = ["derive"] } serde = { version = "1.0", features = ["derive"] }
serde_json = "1" serde_json = "1"
serde_with = "1.12.0" serde_with = "1.12.0"
once_cell = "1.8.0"
utils = { path = "../utils" } utils = { path = "../utils" }
workspace_hack = { version = "0.1", path = "../../workspace_hack" } workspace_hack = { version = "0.1", path = "../../workspace_hack" }

View File

@@ -6,6 +6,7 @@ use std::{
str::FromStr, str::FromStr,
}; };
use once_cell::sync::Lazy;
use regex::{Captures, Regex}; use regex::{Captures, Regex};
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr}; use serde_with::{serde_as, DisplayFromStr};
@@ -31,7 +32,7 @@ struct SafekeeperTimeline {
/// Published data about safekeeper's timeline. Fields made optional for easy migrations. /// Published data about safekeeper's timeline. Fields made optional for easy migrations.
#[serde_as] #[serde_as]
#[derive(Debug, Deserialize, Serialize)] #[derive(Debug, Clone, Deserialize, Serialize)]
pub struct SkTimelineInfo { pub struct SkTimelineInfo {
/// Term of the last entry. /// Term of the last entry.
pub last_log_term: Option<u64>, pub last_log_term: Option<u64>,
@@ -55,14 +56,16 @@ pub struct SkTimelineInfo {
#[serde(default)] #[serde(default)]
pub peer_horizon_lsn: Option<Lsn>, pub peer_horizon_lsn: Option<Lsn>,
#[serde(default)] #[serde(default)]
pub safekeeper_connection_string: Option<String>, pub safekeeper_connstr: Option<String>,
} }
#[derive(Debug, thiserror::Error)] #[derive(Debug, thiserror::Error)]
pub enum BrokerError { pub enum BrokerError {
#[error("Etcd client error: {0}. Context: {1}")] #[error("Etcd client error: {0}. Context: {1}")]
EtcdClient(etcd_client::Error, String), EtcdClient(etcd_client::Error, String),
#[error("Error during parsing etcd data: {0}")] #[error("Error during parsing etcd key: {0}")]
InvalidKey(String),
#[error("Error during parsing etcd value: {0}")]
ParsingError(String), ParsingError(String),
#[error("Internal error: {0}")] #[error("Internal error: {0}")]
InternalError(String), InternalError(String),
@@ -134,29 +137,6 @@ impl SkTimelineSubscriptionKind {
} }
} }
fn watch_regex(&self) -> Regex {
match self.kind {
SubscriptionKind::All => Regex::new(&format!(
r"^{}/([[:xdigit:]]+)/([[:xdigit:]]+)/safekeeper/([[:digit:]])$",
self.broker_etcd_prefix
))
.expect("wrong regex for 'everything' subscription"),
SubscriptionKind::Tenant(tenant_id) => Regex::new(&format!(
r"^{}/{tenant_id}/([[:xdigit:]]+)/safekeeper/([[:digit:]])$",
self.broker_etcd_prefix
))
.expect("wrong regex for 'tenant' subscription"),
SubscriptionKind::Timeline(ZTenantTimelineId {
tenant_id,
timeline_id,
}) => Regex::new(&format!(
r"^{}/{tenant_id}/{timeline_id}/safekeeper/([[:digit:]])$",
self.broker_etcd_prefix
))
.expect("wrong regex for 'timeline' subscription"),
}
}
/// Etcd key to use for watching a certain timeline updates from safekeepers. /// Etcd key to use for watching a certain timeline updates from safekeepers.
pub fn watch_key(&self) -> String { pub fn watch_key(&self) -> String {
match self.kind { match self.kind {
@@ -194,6 +174,7 @@ pub async fn subscribe_to_safekeeper_timeline_updates(
subscription: SkTimelineSubscriptionKind, subscription: SkTimelineSubscriptionKind,
) -> Result<SkTimelineSubscription, BrokerError> { ) -> Result<SkTimelineSubscription, BrokerError> {
info!("Subscribing to timeline updates, subscription kind: {subscription:?}"); info!("Subscribing to timeline updates, subscription kind: {subscription:?}");
let kind = subscription.clone();
let (watcher, mut stream) = client let (watcher, mut stream) = client
.watch( .watch(
@@ -209,12 +190,9 @@ pub async fn subscribe_to_safekeeper_timeline_updates(
})?; })?;
let (timeline_updates_sender, safekeeper_timeline_updates) = mpsc::unbounded_channel(); let (timeline_updates_sender, safekeeper_timeline_updates) = mpsc::unbounded_channel();
let subscription_kind = subscription.kind;
let regex = subscription.watch_regex();
let watcher_handle = tokio::spawn(async move { let watcher_handle = tokio::spawn(async move {
while let Some(resp) = stream.message().await.map_err(|e| BrokerError::InternalError(format!( while let Some(resp) = stream.message().await.map_err(|e| BrokerError::InternalError(format!(
"Failed to get messages from the subscription stream, kind: {subscription_kind:?}, error: {e}" "Failed to get messages from the subscription stream, kind: {:?}, error: {e}", subscription.kind
)))? { )))? {
if resp.canceled() { if resp.canceled() {
info!("Watch for timeline updates subscription was canceled, exiting"); info!("Watch for timeline updates subscription was canceled, exiting");
@@ -235,9 +213,16 @@ pub async fn subscribe_to_safekeeper_timeline_updates(
if EventType::Put == event.event_type() { if EventType::Put == event.event_type() {
if let Some(new_etcd_kv) = event.kv() { if let Some(new_etcd_kv) = event.kv() {
let new_kv_version = new_etcd_kv.version(); let new_kv_version = new_etcd_kv.version();
let (key_str, value_str) = match extract_key_value_str(new_etcd_kv) {
Ok(strs) => strs,
Err(e) => {
error!("Failed to represent etcd KV {new_etcd_kv:?} as pair of str: {e}");
continue;
},
};
match parse_etcd_key_value(subscription_kind, &regex, new_etcd_kv) { match parse_safekeeper_timeline(&subscription, key_str, value_str) {
Ok(Some((zttid, timeline))) => { Ok((zttid, timeline)) => {
match timeline_updates match timeline_updates
.entry(zttid) .entry(zttid)
.or_default() .or_default()
@@ -248,6 +233,8 @@ pub async fn subscribe_to_safekeeper_timeline_updates(
if old_etcd_kv_version < new_kv_version { if old_etcd_kv_version < new_kv_version {
o.insert(timeline.info); o.insert(timeline.info);
timeline_etcd_versions.insert(zttid,new_kv_version); timeline_etcd_versions.insert(zttid,new_kv_version);
} else {
debug!("Skipping etcd timeline update due to older version compared to one that's already stored");
} }
} }
hash_map::Entry::Vacant(v) => { hash_map::Entry::Vacant(v) => {
@@ -256,7 +243,8 @@ pub async fn subscribe_to_safekeeper_timeline_updates(
} }
} }
} }
Ok(None) => {} // it is normal to get other keys when we subscribe to everything
Err(BrokerError::InvalidKey(e)) => debug!("Unexpected key for timeline update: {e}"),
Err(e) => error!("Failed to parse timeline update: {e}"), Err(e) => error!("Failed to parse timeline update: {e}"),
}; };
} }
@@ -270,64 +258,72 @@ pub async fn subscribe_to_safekeeper_timeline_updates(
} }
Ok(()) Ok(())
}); }.instrument(info_span!("etcd_broker")));
Ok(SkTimelineSubscription { Ok(SkTimelineSubscription {
kind: subscription, kind,
safekeeper_timeline_updates, safekeeper_timeline_updates,
watcher_handle, watcher_handle,
watcher, watcher,
}) })
} }
fn parse_etcd_key_value( fn extract_key_value_str(kv: &KeyValue) -> Result<(&str, &str), BrokerError> {
subscription_kind: SubscriptionKind, let key = kv.key_str().map_err(|e| {
regex: &Regex, BrokerError::EtcdClient(e, "Failed to extract key str out of etcd KV".to_string())
kv: &KeyValue,
) -> Result<Option<(ZTenantTimelineId, SafekeeperTimeline)>, BrokerError> {
let caps = if let Some(caps) = regex.captures(kv.key_str().map_err(|e| {
BrokerError::EtcdClient(e, format!("Failed to represent kv {kv:?} as key str"))
})?) {
caps
} else {
return Ok(None);
};
let (zttid, safekeeper_id) = match subscription_kind {
SubscriptionKind::All => (
ZTenantTimelineId::new(
parse_capture(&caps, 1).map_err(BrokerError::ParsingError)?,
parse_capture(&caps, 2).map_err(BrokerError::ParsingError)?,
),
NodeId(parse_capture(&caps, 3).map_err(BrokerError::ParsingError)?),
),
SubscriptionKind::Tenant(tenant_id) => (
ZTenantTimelineId::new(
tenant_id,
parse_capture(&caps, 1).map_err(BrokerError::ParsingError)?,
),
NodeId(parse_capture(&caps, 2).map_err(BrokerError::ParsingError)?),
),
SubscriptionKind::Timeline(zttid) => (
zttid,
NodeId(parse_capture(&caps, 1).map_err(BrokerError::ParsingError)?),
),
};
let info_str = kv.value_str().map_err(|e| {
BrokerError::EtcdClient(e, format!("Failed to represent kv {kv:?} as value str"))
})?; })?;
Ok(Some(( let value = kv.value_str().map_err(|e| {
BrokerError::EtcdClient(e, "Failed to extract value str out of etcd KV".to_string())
})?;
Ok((key, value))
}
static SK_TIMELINE_KEY_REGEX: Lazy<Regex> = Lazy::new(|| {
Regex::new("/([[:xdigit:]]+)/([[:xdigit:]]+)/safekeeper/([[:digit:]]+)$")
.expect("wrong regex for safekeeper timeline etcd key")
});
fn parse_safekeeper_timeline(
subscription: &SkTimelineSubscriptionKind,
key_str: &str,
value_str: &str,
) -> Result<(ZTenantTimelineId, SafekeeperTimeline), BrokerError> {
let broker_prefix = subscription.broker_etcd_prefix.as_str();
if !key_str.starts_with(broker_prefix) {
return Err(BrokerError::InvalidKey(format!(
"KV has unexpected key '{key_str}' that does not start with broker prefix {broker_prefix}"
)));
}
let key_part = &key_str[broker_prefix.len()..];
let key_captures = match SK_TIMELINE_KEY_REGEX.captures(key_part) {
Some(captures) => captures,
None => {
return Err(BrokerError::InvalidKey(format!(
"KV has unexpected key part '{key_part}' that does not match required regex {}",
SK_TIMELINE_KEY_REGEX.as_str()
)));
}
};
let info = serde_json::from_str(value_str).map_err(|e| {
BrokerError::ParsingError(format!(
"Failed to parse '{value_str}' as safekeeper timeline info: {e}"
))
})?;
let zttid = ZTenantTimelineId::new(
parse_capture(&key_captures, 1).map_err(BrokerError::ParsingError)?,
parse_capture(&key_captures, 2).map_err(BrokerError::ParsingError)?,
);
let safekeeper_id = NodeId(parse_capture(&key_captures, 3).map_err(BrokerError::ParsingError)?);
Ok((
zttid, zttid,
SafekeeperTimeline { SafekeeperTimeline {
safekeeper_id, safekeeper_id,
info: serde_json::from_str(info_str).map_err(|e| { info,
BrokerError::ParsingError(format!(
"Failed to parse '{info_str}' as safekeeper timeline info: {e}"
))
})?,
}, },
))) ))
} }
fn parse_capture<T>(caps: &Captures, index: usize) -> Result<T, String> fn parse_capture<T>(caps: &Captures, index: usize) -> Result<T, String>
@@ -346,3 +342,53 @@ where
) )
}) })
} }
#[cfg(test)]
mod tests {
use utils::zid::ZTimelineId;
use super::*;
#[test]
fn typical_etcd_prefix_should_be_parsed() {
let prefix = "neon";
let tenant_id = ZTenantId::generate();
let timeline_id = ZTimelineId::generate();
let all_subscription = SkTimelineSubscriptionKind {
broker_etcd_prefix: prefix.to_string(),
kind: SubscriptionKind::All,
};
let tenant_subscription = SkTimelineSubscriptionKind {
broker_etcd_prefix: prefix.to_string(),
kind: SubscriptionKind::Tenant(tenant_id),
};
let timeline_subscription = SkTimelineSubscriptionKind {
broker_etcd_prefix: prefix.to_string(),
kind: SubscriptionKind::Timeline(ZTenantTimelineId::new(tenant_id, timeline_id)),
};
let typical_etcd_kv_strs = [
(
format!("{prefix}/{tenant_id}/{timeline_id}/safekeeper/1"),
r#"{"last_log_term":231,"flush_lsn":"0/241BB70","commit_lsn":"0/241BB70","backup_lsn":"0/2000000","remote_consistent_lsn":"0/0","peer_horizon_lsn":"0/16960E8","safekeeper_connstr":"something.local:1234","pageserver_connstr":"postgresql://(null):@somethine.else.local:3456"}"#,
),
(
format!("{prefix}/{tenant_id}/{timeline_id}/safekeeper/13"),
r#"{"last_log_term":231,"flush_lsn":"0/241BB70","commit_lsn":"0/241BB70","backup_lsn":"0/2000000","remote_consistent_lsn":"0/0","peer_horizon_lsn":"0/16960E8","safekeeper_connstr":"something.local:1234","pageserver_connstr":"postgresql://(null):@somethine.else.local:3456"}"#,
),
];
for (key_string, value_str) in typical_etcd_kv_strs {
for subscription in [
&all_subscription,
&tenant_subscription,
&timeline_subscription,
] {
let (id, _timeline) =
parse_safekeeper_timeline(subscription, &key_string, value_str)
.unwrap_or_else(|e| panic!("Should be able to parse etcd key string '{key_string}' and etcd value string '{value_str}' for subscription {subscription:?}, but got: {e}"));
assert_eq!(id, ZTenantTimelineId::new(tenant_id, timeline_id));
}
}
}
}

View File

@@ -3,6 +3,7 @@
//! Otherwise, we might not see all metrics registered via //! Otherwise, we might not see all metrics registered via
//! a default registry. //! a default registry.
use lazy_static::lazy_static; use lazy_static::lazy_static;
pub use prometheus::{core, default_registry, proto};
pub use prometheus::{exponential_buckets, linear_buckets}; pub use prometheus::{exponential_buckets, linear_buckets};
pub use prometheus::{register_gauge, Gauge}; pub use prometheus::{register_gauge, Gauge};
pub use prometheus::{register_gauge_vec, GaugeVec}; pub use prometheus::{register_gauge_vec, GaugeVec};

View File

@@ -73,7 +73,7 @@ impl WalStreamDecoder {
/// Returns one of the following: /// Returns one of the following:
/// Ok((Lsn, Bytes)): a tuple containing the LSN of next record, and the record itself /// Ok((Lsn, Bytes)): a tuple containing the LSN of next record, and the record itself
/// Ok(None): there is not enough data in the input buffer. Feed more by calling the `feed_bytes` function /// Ok(None): there is not enough data in the input buffer. Feed more by calling the `feed_bytes` function
/// Err(WalDecodeError): an error occured while decoding, meaning the input was invalid. /// Err(WalDecodeError): an error occurred while decoding, meaning the input was invalid.
/// ///
pub fn poll_decode(&mut self) -> Result<Option<(Lsn, Bytes)>, WalDecodeError> { pub fn poll_decode(&mut self) -> Result<Option<(Lsn, Bytes)>, WalDecodeError> {
let recordbuf; let recordbuf;

View File

@@ -531,7 +531,7 @@ impl CheckPoint {
/// ///
/// Returns 'true' if the XID was updated. /// Returns 'true' if the XID was updated.
pub fn update_next_xid(&mut self, xid: u32) -> bool { pub fn update_next_xid(&mut self, xid: u32) -> bool {
// nextXid should nw greate than any XID in WAL, so increment provided XID and check for wraparround. // nextXid should nw greater than any XID in WAL, so increment provided XID and check for wraparround.
let mut new_xid = std::cmp::max(xid + 1, pg_constants::FIRST_NORMAL_TRANSACTION_ID); let mut new_xid = std::cmp::max(xid + 1, pg_constants::FIRST_NORMAL_TRANSACTION_ID);
// To reduce number of metadata checkpoints, we forward align XID on XID_CHECKPOINT_INTERVAL. // To reduce number of metadata checkpoints, we forward align XID on XID_CHECKPOINT_INTERVAL.
// XID_CHECKPOINT_INTERVAL should not be larger than BLCKSZ*CLOG_XACTS_PER_BYTE // XID_CHECKPOINT_INTERVAL should not be larger than BLCKSZ*CLOG_XACTS_PER_BYTE

View File

@@ -80,7 +80,7 @@ impl Conf {
.arg(self.datadir.as_os_str()) .arg(self.datadir.as_os_str())
.args(&["-c", "wal_keep_size=50MB"]) // Ensure old WAL is not removed .args(&["-c", "wal_keep_size=50MB"]) // Ensure old WAL is not removed
.args(&["-c", "logging_collector=on"]) // stderr will mess up with tests output .args(&["-c", "logging_collector=on"]) // stderr will mess up with tests output
.args(&["-c", "shared_preload_libraries=zenith"]) // can only be loaded at startup .args(&["-c", "shared_preload_libraries=neon"]) // can only be loaded at startup
// Disable background processes as much as possible // Disable background processes as much as possible
.args(&["-c", "wal_writer_delay=10s"]) .args(&["-c", "wal_writer_delay=10s"])
.args(&["-c", "autovacuum=off"]) .args(&["-c", "autovacuum=off"])
@@ -178,7 +178,7 @@ fn generate_internal<C: postgres::GenericClient>(
client: &mut C, client: &mut C,
f: impl Fn(&mut C, PgLsn) -> Result<Option<PgLsn>>, f: impl Fn(&mut C, PgLsn) -> Result<Option<PgLsn>>,
) -> Result<PgLsn> { ) -> Result<PgLsn> {
client.execute("create extension if not exists zenith_test_utils", &[])?; client.execute("create extension if not exists neon_test_utils", &[])?;
let wal_segment_size = client.query_one( let wal_segment_size = client.query_one(
"select cast(setting as bigint) as setting, unit \ "select cast(setting as bigint) as setting, unit \

View File

@@ -5,7 +5,7 @@ DATA_DIR=$3
PORT=$4 PORT=$4
SYSID=`od -A n -j 24 -N 8 -t d8 $WAL_PATH/000000010000000000000002* | cut -c 3-` SYSID=`od -A n -j 24 -N 8 -t d8 $WAL_PATH/000000010000000000000002* | cut -c 3-`
rm -fr $DATA_DIR rm -fr $DATA_DIR
env -i LD_LIBRARY_PATH=$PG_BIN/../lib $PG_BIN/initdb -E utf8 -U zenith_admin -D $DATA_DIR --sysid=$SYSID env -i LD_LIBRARY_PATH=$PG_BIN/../lib $PG_BIN/initdb -E utf8 -U cloud_admin -D $DATA_DIR --sysid=$SYSID
echo port=$PORT >> $DATA_DIR/postgresql.conf echo port=$PORT >> $DATA_DIR/postgresql.conf
REDO_POS=0x`$PG_BIN/pg_controldata -D $DATA_DIR | fgrep "REDO location"| cut -c 42-` REDO_POS=0x`$PG_BIN/pg_controldata -D $DATA_DIR | fgrep "REDO location"| cut -c 42-`
declare -i WAL_SIZE=$REDO_POS+114 declare -i WAL_SIZE=$REDO_POS+114

View File

@@ -5,7 +5,7 @@ PORT=$4
SYSID=`od -A n -j 24 -N 8 -t d8 $WAL_PATH/000000010000000000000002* | cut -c 3-` SYSID=`od -A n -j 24 -N 8 -t d8 $WAL_PATH/000000010000000000000002* | cut -c 3-`
rm -fr $DATA_DIR /tmp/pg_wals rm -fr $DATA_DIR /tmp/pg_wals
mkdir /tmp/pg_wals mkdir /tmp/pg_wals
env -i LD_LIBRARY_PATH=$PG_BIN/../lib $PG_BIN/initdb -E utf8 -U zenith_admin -D $DATA_DIR --sysid=$SYSID env -i LD_LIBRARY_PATH=$PG_BIN/../lib $PG_BIN/initdb -E utf8 -U cloud_admin -D $DATA_DIR --sysid=$SYSID
echo port=$PORT >> $DATA_DIR/postgresql.conf echo port=$PORT >> $DATA_DIR/postgresql.conf
REDO_POS=0x`$PG_BIN/pg_controldata -D $DATA_DIR | fgrep "REDO location"| cut -c 42-` REDO_POS=0x`$PG_BIN/pg_controldata -D $DATA_DIR | fgrep "REDO location"| cut -c 42-`
declare -i WAL_SIZE=$REDO_POS+114 declare -i WAL_SIZE=$REDO_POS+114

View File

@@ -71,7 +71,7 @@ impl From<bincode::Error> for SerializeError {
/// - Fixed integer encoding (i.e. 1u32 is 00000001 not 01) /// - Fixed integer encoding (i.e. 1u32 is 00000001 not 01)
/// ///
/// Does not allow trailing bytes in deserialization. If this is desired, you /// Does not allow trailing bytes in deserialization. If this is desired, you
/// may set [`Options::allow_trailing_bytes`] to explicitly accomodate this. /// may set [`Options::allow_trailing_bytes`] to explicitly accommodate this.
pub fn be_coder() -> impl Options { pub fn be_coder() -> impl Options {
bincode::DefaultOptions::new() bincode::DefaultOptions::new()
.with_big_endian() .with_big_endian()
@@ -85,7 +85,7 @@ pub fn be_coder() -> impl Options {
/// - Fixed integer encoding (i.e. 1u32 is 00000001 not 01) /// - Fixed integer encoding (i.e. 1u32 is 00000001 not 01)
/// ///
/// Does not allow trailing bytes in deserialization. If this is desired, you /// Does not allow trailing bytes in deserialization. If this is desired, you
/// may set [`Options::allow_trailing_bytes`] to explicitly accomodate this. /// may set [`Options::allow_trailing_bytes`] to explicitly accommodate this.
pub fn le_coder() -> impl Options { pub fn le_coder() -> impl Options {
bincode::DefaultOptions::new() bincode::DefaultOptions::new()
.with_little_endian() .with_little_endian()

View File

@@ -64,7 +64,7 @@ pub mod signals;
/// One thing to note is that .git is not available in docker (and it is bad to include it there). /// One thing to note is that .git is not available in docker (and it is bad to include it there).
/// So everything becides docker build is covered by git_version crate, and docker uses a `GIT_VERSION` argument to get the value required. /// So everything becides docker build is covered by git_version crate, and docker uses a `GIT_VERSION` argument to get the value required.
/// It takes variable from build process env and puts it to the rustc env. And then we can retrieve it here by using env! macro. /// It takes variable from build process env and puts it to the rustc env. And then we can retrieve it here by using env! macro.
/// Git version received from environment variable used as a fallback in git_version invokation. /// Git version received from environment variable used as a fallback in git_version invocation.
/// And to avoid running buildscript every recompilation, we use rerun-if-env-changed option. /// And to avoid running buildscript every recompilation, we use rerun-if-env-changed option.
/// So the build script will be run only when GIT_VERSION envvar has changed. /// So the build script will be run only when GIT_VERSION envvar has changed.
/// ///

View File

@@ -336,11 +336,11 @@ impl PostgresBackend {
let have_tls = self.tls_config.is_some(); let have_tls = self.tls_config.is_some();
match msg { match msg {
FeMessage::StartupPacket(m) => { FeMessage::StartupPacket(m) => {
trace!("got startup message {:?}", m); trace!("got startup message {m:?}");
match m { match m {
FeStartupPacket::SslRequest => { FeStartupPacket::SslRequest => {
info!("SSL requested"); debug!("SSL requested");
self.write_message(&BeMessage::EncryptionResponse(have_tls))?; self.write_message(&BeMessage::EncryptionResponse(have_tls))?;
if have_tls { if have_tls {
@@ -349,7 +349,7 @@ impl PostgresBackend {
} }
} }
FeStartupPacket::GssEncRequest => { FeStartupPacket::GssEncRequest => {
info!("GSS requested"); debug!("GSS requested");
self.write_message(&BeMessage::EncryptionResponse(false))?; self.write_message(&BeMessage::EncryptionResponse(false))?;
} }
FeStartupPacket::StartupMessage { .. } => { FeStartupPacket::StartupMessage { .. } => {
@@ -433,12 +433,7 @@ impl PostgresBackend {
// full cause of the error, not just the top-level context + its trace. // full cause of the error, not just the top-level context + its trace.
// We don't want to send that in the ErrorResponse though, // We don't want to send that in the ErrorResponse though,
// because it's not relevant to the compute node logs. // because it's not relevant to the compute node logs.
if query_string.starts_with("callmemaybe") { error!("query handler for '{}' failed: {:?}", query_string, e);
// FIXME avoid printing a backtrace for tenant x not found errors until this is properly fixed
error!("query handler for '{}' failed: {}", query_string, e);
} else {
error!("query handler for '{}' failed: {:?}", query_string, e);
}
self.write_message_noflush(&BeMessage::ErrorResponse(&e.to_string()))?; self.write_message_noflush(&BeMessage::ErrorResponse(&e.to_string()))?;
// TODO: untangle convoluted control flow // TODO: untangle convoluted control flow
if e.to_string().contains("failed to run") { if e.to_string().contains("failed to run") {
@@ -475,7 +470,7 @@ impl PostgresBackend {
self.write_message(&BeMessage::ErrorResponse(&e.to_string()))?; self.write_message(&BeMessage::ErrorResponse(&e.to_string()))?;
} }
// NOTE there is no ReadyForQuery message. This handler is used // NOTE there is no ReadyForQuery message. This handler is used
// for basebackup and it uses CopyOut which doesnt require // for basebackup and it uses CopyOut which doesn't require
// ReadyForQuery message and backend just switches back to // ReadyForQuery message and backend just switches back to
// processing mode after sending CopyDone or ErrorResponse. // processing mode after sending CopyDone or ErrorResponse.
} }

View File

@@ -269,7 +269,14 @@ impl FeStartupPacket {
.next() .next()
.context("expected even number of params in StartupMessage")?; .context("expected even number of params in StartupMessage")?;
if name == "options" { if name == "options" {
// deprecated way of passing params as cmd line args // parsing options arguments "...&options=<var0>%3D<val0>+<var1>=<var1>..."
// '%3D' is '=' and '+' is ' '
// Note: we allow users that don't have SNI capabilities,
// to pass a special keyword argument 'project'
// to be used to determine the cluster name by the proxy.
//TODO: write unit test for this and refactor in its own function.
for cmdopt in value.split(' ') { for cmdopt in value.split(' ') {
let nameval: Vec<&str> = cmdopt.split('=').collect(); let nameval: Vec<&str> = cmdopt.split('=').collect();
if nameval.len() == 2 { if nameval.len() == 2 {
@@ -464,7 +471,7 @@ impl BeParameterStatusMessage<'static> {
} }
} }
// One row desciption in RowDescription packet. // One row description in RowDescription packet.
#[derive(Debug)] #[derive(Debug)]
pub struct RowDescriptor<'a> { pub struct RowDescriptor<'a> {
pub name: &'a [u8], pub name: &'a [u8],
@@ -613,7 +620,7 @@ fn cstr_to_str(b: &Bytes) -> Result<&str> {
impl<'a> BeMessage<'a> { impl<'a> BeMessage<'a> {
/// Write message to the given buf. /// Write message to the given buf.
// Unlike the reading side, we use BytesMut // Unlike the reading side, we use BytesMut
// here as msg len preceeds its body and it is handy to write it down first // here as msg len precedes its body and it is handy to write it down first
// and then fill the length. With Write we would have to either calc it // and then fill the length. With Write we would have to either calc it
// manually or have one more buffer. // manually or have one more buffer.
pub fn write(buf: &mut BytesMut, message: &BeMessage) -> io::Result<()> { pub fn write(buf: &mut BytesMut, message: &BeMessage) -> io::Result<()> {
@@ -1047,7 +1054,7 @@ mod tests {
#[test] #[test]
fn test_zenithfeedback_serialization() { fn test_zenithfeedback_serialization() {
let mut zf = ZenithFeedback::empty(); let mut zf = ZenithFeedback::empty();
// Fill zf wih some values // Fill zf with some values
zf.current_timeline_size = 12345678; zf.current_timeline_size = 12345678;
// Set rounded time to be able to compare it with deserialized value, // Set rounded time to be able to compare it with deserialized value,
// because it is rounded up to microseconds during serialization. // because it is rounded up to microseconds during serialization.
@@ -1062,7 +1069,7 @@ mod tests {
#[test] #[test]
fn test_zenithfeedback_unknown_key() { fn test_zenithfeedback_unknown_key() {
let mut zf = ZenithFeedback::empty(); let mut zf = ZenithFeedback::empty();
// Fill zf wih some values // Fill zf with some values
zf.current_timeline_size = 12345678; zf.current_timeline_size = 12345678;
// Set rounded time to be able to compare it with deserialized value, // Set rounded time to be able to compare it with deserialized value,
// because it is rounded up to microseconds during serialization. // because it is rounded up to microseconds during serialization.

View File

@@ -193,7 +193,7 @@ pub struct ZTenantId(ZId);
zid_newtype!(ZTenantId); zid_newtype!(ZTenantId);
// A pair uniquely identifying Zenith instance. // A pair uniquely identifying Zenith instance.
#[derive(Debug, Clone, Copy, PartialOrd, Ord, PartialEq, Eq, Hash)] #[derive(Debug, Clone, Copy, PartialOrd, Ord, PartialEq, Eq, Hash, Serialize, Deserialize)]
pub struct ZTenantTimelineId { pub struct ZTenantTimelineId {
pub tenant_id: ZTenantId, pub tenant_id: ZTenantId,
pub timeline_id: ZTimelineId, pub timeline_id: ZTimelineId,

View File

@@ -5,7 +5,7 @@ edition = "2021"
[features] [features]
# It is simpler infra-wise to have failpoints enabled by default # It is simpler infra-wise to have failpoints enabled by default
# It shouldn't affect perf in any way because failpoints # It shouldn't affect performance in any way because failpoints
# are not placed in hot code paths # are not placed in hot code paths
default = ["failpoints"] default = ["failpoints"]
profiling = ["pprof"] profiling = ["pprof"]
@@ -54,15 +54,13 @@ crossbeam-utils = "0.8.5"
fail = "0.5.0" fail = "0.5.0"
git-version = "0.3.5" git-version = "0.3.5"
# 'experimental' is needed for the `zstd::bulk::Decompressor::upper_bound` function.
zstd = { version = "0.11.1", features = ["experimental"] }
postgres_ffi = { path = "../libs/postgres_ffi" } postgres_ffi = { path = "../libs/postgres_ffi" }
etcd_broker = { path = "../libs/etcd_broker" } etcd_broker = { path = "../libs/etcd_broker" }
metrics = { path = "../libs/metrics" } metrics = { path = "../libs/metrics" }
utils = { path = "../libs/utils" } utils = { path = "../libs/utils" }
remote_storage = { path = "../libs/remote_storage" } remote_storage = { path = "../libs/remote_storage" }
workspace_hack = { version = "0.1", path = "../workspace_hack" } workspace_hack = { version = "0.1", path = "../workspace_hack" }
close_fds = "0.3.2"
[dev-dependencies] [dev-dependencies]
hex-literal = "0.3" hex-literal = "0.3"

View File

@@ -22,12 +22,6 @@ use utils::{
use crate::layered_repository::TIMELINES_SEGMENT_NAME; use crate::layered_repository::TIMELINES_SEGMENT_NAME;
use crate::tenant_config::{TenantConf, TenantConfOpt}; use crate::tenant_config::{TenantConf, TenantConfOpt};
pub const ZSTD_MAX_SAMPLES: usize = 1024;
pub const ZSTD_MIN_SAMPLES: usize = 8; // magic requirement of zstd
pub const ZSTD_MAX_SAMPLE_BYTES: usize = 10 * 1024 * 1024; // max memory size for holding samples
pub const ZSTD_MAX_DICTIONARY_SIZE: usize = 8 * 1024 - 4; // make dictionary + BLOB length fit in first page
pub const ZSTD_COMPRESSION_LEVEL: i32 = 0; // default compression level
pub mod defaults { pub mod defaults {
use crate::tenant_config::defaults::*; use crate::tenant_config::defaults::*;
use const_format::formatcp; use const_format::formatcp;
@@ -40,7 +34,7 @@ pub mod defaults {
pub const DEFAULT_WAIT_LSN_TIMEOUT: &str = "60 s"; pub const DEFAULT_WAIT_LSN_TIMEOUT: &str = "60 s";
pub const DEFAULT_WAL_REDO_TIMEOUT: &str = "60 s"; pub const DEFAULT_WAL_REDO_TIMEOUT: &str = "60 s";
pub const DEFAULT_SUPERUSER: &str = "zenith_admin"; pub const DEFAULT_SUPERUSER: &str = "cloud_admin";
pub const DEFAULT_PAGE_CACHE_SIZE: usize = 8192; pub const DEFAULT_PAGE_CACHE_SIZE: usize = 8192;
pub const DEFAULT_MAX_FILE_DESCRIPTORS: usize = 100; pub const DEFAULT_MAX_FILE_DESCRIPTORS: usize = 100;
@@ -120,7 +114,7 @@ pub struct PageServerConf {
pub default_tenant_conf: TenantConf, pub default_tenant_conf: TenantConf,
/// A prefix to add in etcd brokers before every key. /// A prefix to add in etcd brokers before every key.
/// Can be used for isolating different pageserver groups withing the same etcd cluster. /// Can be used for isolating different pageserver groups within the same etcd cluster.
pub broker_etcd_prefix: String, pub broker_etcd_prefix: String,
/// Etcd broker endpoints to connect to. /// Etcd broker endpoints to connect to.
@@ -486,6 +480,21 @@ impl PageServerConf {
if let Some(pitr_interval) = item.get("pitr_interval") { if let Some(pitr_interval) = item.get("pitr_interval") {
t_conf.pitr_interval = Some(parse_toml_duration("pitr_interval", pitr_interval)?); t_conf.pitr_interval = Some(parse_toml_duration("pitr_interval", pitr_interval)?);
} }
if let Some(walreceiver_connect_timeout) = item.get("walreceiver_connect_timeout") {
t_conf.walreceiver_connect_timeout = Some(parse_toml_duration(
"walreceiver_connect_timeout",
walreceiver_connect_timeout,
)?);
}
if let Some(lagging_wal_timeout) = item.get("lagging_wal_timeout") {
t_conf.lagging_wal_timeout = Some(parse_toml_duration(
"lagging_wal_timeout",
lagging_wal_timeout,
)?);
}
if let Some(max_lsn_wal_lag) = item.get("max_lsn_wal_lag") {
t_conf.max_lsn_wal_lag = Some(parse_toml_from_str("max_lsn_wal_lag", max_lsn_wal_lag)?);
}
Ok(t_conf) Ok(t_conf)
} }
@@ -505,7 +514,7 @@ impl PageServerConf {
max_file_descriptors: defaults::DEFAULT_MAX_FILE_DESCRIPTORS, max_file_descriptors: defaults::DEFAULT_MAX_FILE_DESCRIPTORS,
listen_pg_addr: defaults::DEFAULT_PG_LISTEN_ADDR.to_string(), listen_pg_addr: defaults::DEFAULT_PG_LISTEN_ADDR.to_string(),
listen_http_addr: defaults::DEFAULT_HTTP_LISTEN_ADDR.to_string(), listen_http_addr: defaults::DEFAULT_HTTP_LISTEN_ADDR.to_string(),
superuser: "zenith_admin".to_string(), superuser: "cloud_admin".to_string(),
workdir: repo_dir, workdir: repo_dir,
pg_distrib_dir: PathBuf::new(), pg_distrib_dir: PathBuf::new(),
auth_type: AuthType::Trust, auth_type: AuthType::Trust,

View File

@@ -1,3 +1,5 @@
use std::num::NonZeroU64;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DisplayFromStr}; use serde_with::{serde_as, DisplayFromStr};
use utils::{ use utils::{
@@ -33,6 +35,9 @@ pub struct TenantCreateRequest {
pub gc_period: Option<String>, pub gc_period: Option<String>,
pub image_creation_threshold: Option<usize>, pub image_creation_threshold: Option<usize>,
pub pitr_interval: Option<String>, pub pitr_interval: Option<String>,
pub walreceiver_connect_timeout: Option<String>,
pub lagging_wal_timeout: Option<String>,
pub max_lsn_wal_lag: Option<NonZeroU64>,
} }
#[serde_as] #[serde_as]
@@ -68,6 +73,9 @@ pub struct TenantConfigRequest {
pub gc_period: Option<String>, pub gc_period: Option<String>,
pub image_creation_threshold: Option<usize>, pub image_creation_threshold: Option<usize>,
pub pitr_interval: Option<String>, pub pitr_interval: Option<String>,
pub walreceiver_connect_timeout: Option<String>,
pub lagging_wal_timeout: Option<String>,
pub max_lsn_wal_lag: Option<NonZeroU64>,
} }
impl TenantConfigRequest { impl TenantConfigRequest {
@@ -82,6 +90,21 @@ impl TenantConfigRequest {
gc_period: None, gc_period: None,
image_creation_threshold: None, image_creation_threshold: None,
pitr_interval: None, pitr_interval: None,
walreceiver_connect_timeout: None,
lagging_wal_timeout: None,
max_lsn_wal_lag: None,
} }
} }
} }
/// A WAL receiver's data stored inside the global `WAL_RECEIVERS`.
/// We keep one WAL receiver active per timeline.
#[serde_as]
#[derive(Debug, Serialize, Deserialize, Clone)]
pub struct WalReceiverEntry {
pub wal_producer_connstr: Option<String>,
#[serde_as(as = "Option<DisplayFromStr>")]
pub last_received_msg_lsn: Option<Lsn>,
/// the timestamp (in microseconds) of the last received message
pub last_received_msg_ts: Option<u128>,
}

View File

@@ -229,23 +229,16 @@ async fn wal_receiver_get_handler(request: Request<Body>) -> Result<Response<Bod
check_permission(&request, Some(tenant_id))?; check_permission(&request, Some(tenant_id))?;
let timeline_id: ZTimelineId = parse_request_param(&request, "timeline_id")?; let timeline_id: ZTimelineId = parse_request_param(&request, "timeline_id")?;
let wal_receiver_entry = crate::walreceiver::get_wal_receiver_entry(tenant_id, timeline_id)
.instrument(info_span!("wal_receiver_get", tenant = %tenant_id, timeline = %timeline_id))
.await
.ok_or_else(|| {
ApiError::NotFound(format!(
"WAL receiver data not found for tenant {tenant_id} and timeline {timeline_id}"
))
})?;
let wal_receiver = tokio::task::spawn_blocking(move || { json_response(StatusCode::OK, &wal_receiver_entry)
let _enter =
info_span!("wal_receiver_get", tenant = %tenant_id, timeline = %timeline_id).entered();
crate::walreceiver::get_wal_receiver_entry(tenant_id, timeline_id)
})
.await
.map_err(ApiError::from_err)?
.ok_or_else(|| {
ApiError::NotFound(format!(
"WAL receiver not found for tenant {} and timeline {}",
tenant_id, timeline_id
))
})?;
json_response(StatusCode::OK, wal_receiver)
} }
async fn timeline_attach_handler(request: Request<Body>) -> Result<Response<Body>, ApiError> { async fn timeline_attach_handler(request: Request<Body>) -> Result<Response<Body>, ApiError> {
@@ -402,6 +395,19 @@ async fn tenant_create_handler(mut request: Request<Body>) -> Result<Response<Bo
Some(humantime::parse_duration(&pitr_interval).map_err(ApiError::from_err)?); Some(humantime::parse_duration(&pitr_interval).map_err(ApiError::from_err)?);
} }
if let Some(walreceiver_connect_timeout) = request_data.walreceiver_connect_timeout {
tenant_conf.walreceiver_connect_timeout = Some(
humantime::parse_duration(&walreceiver_connect_timeout).map_err(ApiError::from_err)?,
);
}
if let Some(lagging_wal_timeout) = request_data.lagging_wal_timeout {
tenant_conf.lagging_wal_timeout =
Some(humantime::parse_duration(&lagging_wal_timeout).map_err(ApiError::from_err)?);
}
if let Some(max_lsn_wal_lag) = request_data.max_lsn_wal_lag {
tenant_conf.max_lsn_wal_lag = Some(max_lsn_wal_lag);
}
tenant_conf.checkpoint_distance = request_data.checkpoint_distance; tenant_conf.checkpoint_distance = request_data.checkpoint_distance;
tenant_conf.compaction_target_size = request_data.compaction_target_size; tenant_conf.compaction_target_size = request_data.compaction_target_size;
tenant_conf.compaction_threshold = request_data.compaction_threshold; tenant_conf.compaction_threshold = request_data.compaction_threshold;
@@ -449,6 +455,18 @@ async fn tenant_config_handler(mut request: Request<Body>) -> Result<Response<Bo
tenant_conf.pitr_interval = tenant_conf.pitr_interval =
Some(humantime::parse_duration(&pitr_interval).map_err(ApiError::from_err)?); Some(humantime::parse_duration(&pitr_interval).map_err(ApiError::from_err)?);
} }
if let Some(walreceiver_connect_timeout) = request_data.walreceiver_connect_timeout {
tenant_conf.walreceiver_connect_timeout = Some(
humantime::parse_duration(&walreceiver_connect_timeout).map_err(ApiError::from_err)?,
);
}
if let Some(lagging_wal_timeout) = request_data.lagging_wal_timeout {
tenant_conf.lagging_wal_timeout =
Some(humantime::parse_duration(&lagging_wal_timeout).map_err(ApiError::from_err)?);
}
if let Some(max_lsn_wal_lag) = request_data.max_lsn_wal_lag {
tenant_conf.max_lsn_wal_lag = Some(max_lsn_wal_lag);
}
tenant_conf.checkpoint_distance = request_data.checkpoint_distance; tenant_conf.checkpoint_distance = request_data.checkpoint_distance;
tenant_conf.compaction_target_size = request_data.compaction_target_size; tenant_conf.compaction_target_size = request_data.compaction_target_size;

View File

@@ -15,7 +15,7 @@ pub struct KeySpace {
impl KeySpace { impl KeySpace {
/// ///
/// Partition a key space into roughly chunks of roughly 'target_size' bytes /// Partition a key space into roughly chunks of roughly 'target_size' bytes
/// in each patition. /// in each partition.
/// ///
pub fn partition(&self, target_size: u64) -> KeyPartitioning { pub fn partition(&self, target_size: u64) -> KeyPartitioning {
// Assume that each value is 8k in size. // Assume that each value is 8k in size.

View File

@@ -25,6 +25,7 @@ use std::collections::{BTreeSet, HashSet};
use std::fs; use std::fs;
use std::fs::{File, OpenOptions}; use std::fs::{File, OpenOptions};
use std::io::Write; use std::io::Write;
use std::num::NonZeroU64;
use std::ops::{Bound::Included, Deref, Range}; use std::ops::{Bound::Included, Deref, Range};
use std::path::{Path, PathBuf}; use std::path::{Path, PathBuf};
use std::sync::atomic::{self, AtomicBool}; use std::sync::atomic::{self, AtomicBool};
@@ -557,6 +558,27 @@ impl LayeredRepository {
.unwrap_or(self.conf.default_tenant_conf.pitr_interval) .unwrap_or(self.conf.default_tenant_conf.pitr_interval)
} }
pub fn get_wal_receiver_connect_timeout(&self) -> Duration {
let tenant_conf = self.tenant_conf.read().unwrap();
tenant_conf
.walreceiver_connect_timeout
.unwrap_or(self.conf.default_tenant_conf.walreceiver_connect_timeout)
}
pub fn get_lagging_wal_timeout(&self) -> Duration {
let tenant_conf = self.tenant_conf.read().unwrap();
tenant_conf
.lagging_wal_timeout
.unwrap_or(self.conf.default_tenant_conf.lagging_wal_timeout)
}
pub fn get_max_lsn_wal_lag(&self) -> NonZeroU64 {
let tenant_conf = self.tenant_conf.read().unwrap();
tenant_conf
.max_lsn_wal_lag
.unwrap_or(self.conf.default_tenant_conf.max_lsn_wal_lag)
}
pub fn update_tenant_config(&self, new_tenant_conf: TenantConfOpt) -> Result<()> { pub fn update_tenant_config(&self, new_tenant_conf: TenantConfOpt) -> Result<()> {
let mut tenant_conf = self.tenant_conf.write().unwrap(); let mut tenant_conf = self.tenant_conf.write().unwrap();
@@ -823,7 +845,7 @@ impl LayeredRepository {
for (timeline_id, timeline_entry) in timelines.iter() { for (timeline_id, timeline_entry) in timelines.iter() {
timeline_ids.push(*timeline_id); timeline_ids.push(*timeline_id);
// This is unresolved question for now, how to do gc in presense of remote timelines // This is unresolved question for now, how to do gc in presence of remote timelines
// especially when this is combined with branching. // especially when this is combined with branching.
// Somewhat related: https://github.com/zenithdb/zenith/issues/999 // Somewhat related: https://github.com/zenithdb/zenith/issues/999
if let Some(ancestor_timeline_id) = &timeline_entry.ancestor_timeline_id() { if let Some(ancestor_timeline_id) = &timeline_entry.ancestor_timeline_id() {
@@ -1705,9 +1727,7 @@ impl LayeredTimeline {
new_delta_path.clone(), new_delta_path.clone(),
self.conf.timeline_path(&self.timeline_id, &self.tenant_id), self.conf.timeline_path(&self.timeline_id, &self.tenant_id),
])?; ])?;
fail_point!("checkpoint-before-sync"); fail_point!("flush-frozen-before-sync");
fail_point!("flush-frozen");
// Finally, replace the frozen in-memory layer with the new on-disk layer // Finally, replace the frozen in-memory layer with the new on-disk layer
{ {
@@ -1831,7 +1851,7 @@ impl LayeredTimeline {
// collect any page versions that are no longer needed because // collect any page versions that are no longer needed because
// of the new image layers we created in step 2. // of the new image layers we created in step 2.
// //
// TODO: This hight level strategy hasn't been implemented yet. // TODO: This high level strategy hasn't been implemented yet.
// Below are functions compact_level0() and create_image_layers() // Below are functions compact_level0() and create_image_layers()
// but they are a bit ad hoc and don't quite work like it's explained // but they are a bit ad hoc and don't quite work like it's explained
// above. Rewrite it. // above. Rewrite it.
@@ -1839,41 +1859,37 @@ impl LayeredTimeline {
let target_file_size = self.get_checkpoint_distance(); let target_file_size = self.get_checkpoint_distance();
// Define partitioning schema if needed // 1. Partition the key space
if let Ok(pgdir) = let pgdir = tenant_mgr::get_local_timeline_with_load(self.tenant_id, self.timeline_id)?;
tenant_mgr::get_local_timeline_with_load(self.tenant_id, self.timeline_id) let (partitioning, lsn) = pgdir.repartition(
{ self.get_last_record_lsn(),
let (partitioning, lsn) = pgdir.repartition( self.get_compaction_target_size(),
self.get_last_record_lsn(), )?;
self.get_compaction_target_size(), let timer = self.create_images_time_histo.start_timer();
)?;
let timer = self.create_images_time_histo.start_timer();
// 2. Create new image layers for partitions that have been modified
// "enough".
let mut layer_paths_to_upload = HashSet::with_capacity(partitioning.parts.len());
for part in partitioning.parts.iter() {
if self.time_for_new_image_layer(part, lsn)? {
let new_path = self.create_image_layer(part, lsn)?;
layer_paths_to_upload.insert(new_path);
}
}
if self.upload_layers.load(atomic::Ordering::Relaxed) {
storage_sync::schedule_layer_upload(
self.tenant_id,
self.timeline_id,
layer_paths_to_upload,
None,
);
}
timer.stop_and_record();
// 3. Compact // 2. Create new image layers for partitions that have been modified
let timer = self.compact_time_histo.start_timer(); // "enough".
self.compact_level0(target_file_size)?; let mut layer_paths_to_upload = HashSet::with_capacity(partitioning.parts.len());
timer.stop_and_record(); for part in partitioning.parts.iter() {
} else { if self.time_for_new_image_layer(part, lsn)? {
debug!("Could not compact because no partitioning specified yet"); let new_path = self.create_image_layer(part, lsn)?;
layer_paths_to_upload.insert(new_path);
}
} }
if self.upload_layers.load(atomic::Ordering::Relaxed) {
storage_sync::schedule_layer_upload(
self.tenant_id,
self.timeline_id,
layer_paths_to_upload,
None,
);
}
timer.stop_and_record();
// 3. Compact
let timer = self.compact_time_histo.start_timer();
self.compact_level0(target_file_size)?;
timer.stop_and_record();
Ok(()) Ok(())
} }
@@ -2268,7 +2284,7 @@ impl LayeredTimeline {
} }
// 3. Is it needed by a child branch? // 3. Is it needed by a child branch?
// NOTE With that wee would keep data that // NOTE With that we would keep data that
// might be referenced by child branches forever. // might be referenced by child branches forever.
// We can track this in child timeline GC and delete parent layers when // We can track this in child timeline GC and delete parent layers when
// they are no longer needed. This might be complicated with long inheritance chains. // they are no longer needed. This might be complicated with long inheritance chains.

View File

@@ -260,7 +260,7 @@ Whenever a GetPage@LSN request comes in from the compute node, the
page server needs to reconstruct the requested page, as it was at the page server needs to reconstruct the requested page, as it was at the
requested LSN. To do that, the page server first checks the recent requested LSN. To do that, the page server first checks the recent
in-memory layer; if the requested page version is found there, it can in-memory layer; if the requested page version is found there, it can
be returned immediatedly without looking at the files on be returned immediately without looking at the files on
disk. Otherwise the page server needs to locate the layer file that disk. Otherwise the page server needs to locate the layer file that
contains the requested page version. contains the requested page version.
@@ -409,7 +409,7 @@ removed because there is no newer layer file for the table.
Things get slightly more complicated with multiple branches. All of Things get slightly more complicated with multiple branches. All of
the above still holds, but in addition to recent files we must also the above still holds, but in addition to recent files we must also
retain older shapshot files that are still needed by child branches. retain older snapshot files that are still needed by child branches.
For example, if child branch is created at LSN 150, and the 'customers' For example, if child branch is created at LSN 150, and the 'customers'
table is updated on the branch, you would have these files: table is updated on the branch, you would have these files:

View File

@@ -23,25 +23,6 @@
//! "values" part. The actual page images and WAL records are stored in the //! "values" part. The actual page images and WAL records are stored in the
//! "values" part. //! "values" part.
//! //!
//! # Compression
//!
//! Each value is stored as a Blob, which can optionally be compressed. Compression
//! is done by ZStandard, in dictionary mode, which gives pretty good compression
//! ratio even for small inputs like WAL records.
//!
//! The dictionary is built separately for each delta layer file, and stored in
//! the file itself.
//!
//! TODO: The ZStandard format includes constant 4-byte "magic bytes" in the beginning
//! of each compressed block. With small values like WAL records, that's pretty wasteful.
//! We could disable those bytes by setting the `include_magibytes' flag to false,
//! but as of this writing that's considered experimental in the zstd crate, and the
//! zstd::bulk::Decompressor::upper_bound() function doesn't work without the magic bytes
//! so we would have to find a different way of allocating the decompression buffer if
//! we did that.
//!
use crate::config;
use crate::config::PageServerConf; use crate::config::PageServerConf;
use crate::layered_repository::blob_io::{BlobCursor, BlobWriter, WriteBlobWriter}; use crate::layered_repository::blob_io::{BlobCursor, BlobWriter, WriteBlobWriter};
use crate::layered_repository::block_io::{BlockBuf, BlockCursor, BlockReader, FileBlockReader}; use crate::layered_repository::block_io::{BlockBuf, BlockCursor, BlockReader, FileBlockReader};
@@ -55,7 +36,7 @@ use crate::repository::{Key, Value, KEY_SIZE};
use crate::virtual_file::VirtualFile; use crate::virtual_file::VirtualFile;
use crate::walrecord; use crate::walrecord;
use crate::{DELTA_FILE_MAGIC, STORAGE_FORMAT_VERSION}; use crate::{DELTA_FILE_MAGIC, STORAGE_FORMAT_VERSION};
use anyhow::{anyhow, bail, ensure, Context, Result}; use anyhow::{bail, ensure, Context, Result};
use rand::{distributions::Alphanumeric, Rng}; use rand::{distributions::Alphanumeric, Rng};
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use std::fs; use std::fs;
@@ -94,9 +75,6 @@ struct Summary {
index_start_blk: u32, index_start_blk: u32,
/// Block within the 'index', where the B-tree root page is stored /// Block within the 'index', where the B-tree root page is stored
index_root_blk: u32, index_root_blk: u32,
/// Byte offset of the compression dictionary, or 0 if no compression
dictionary_offset: u64,
} }
impl From<&DeltaLayer> for Summary { impl From<&DeltaLayer> for Summary {
@@ -112,46 +90,33 @@ impl From<&DeltaLayer> for Summary {
index_start_blk: 0, index_start_blk: 0,
index_root_blk: 0, index_root_blk: 0,
dictionary_offset: 0,
} }
} }
} }
// Flag indicating that this version initialize the page
const WILL_INIT: u64 = 1;
/// ///
/// Struct representing reference to BLOB in the file. The reference contains /// Struct representing reference to BLOB in layers. Reference contains BLOB
/// the offset to the BLOB within the file, a flag indicating if it's /// offset, and for WAL records it also contains `will_init` flag. The flag
/// compressed or not, and also the `will_init` flag. The `will_init` flag
/// helps to determine the range of records that needs to be applied, without /// helps to determine the range of records that needs to be applied, without
/// reading/deserializing records themselves. /// reading/deserializing records themselves.
/// ///
#[derive(Debug, Serialize, Deserialize, Copy, Clone)] #[derive(Debug, Serialize, Deserialize, Copy, Clone)]
struct BlobRef(u64); struct BlobRef(u64);
/// Flag indicating that this blob is compressed
const BLOB_COMPRESSED: u64 = 1;
/// Flag indicating that this version initializes the page
const WILL_INIT: u64 = 2;
impl BlobRef { impl BlobRef {
pub fn compressed(&self) -> bool {
(self.0 & BLOB_COMPRESSED) != 0
}
pub fn will_init(&self) -> bool { pub fn will_init(&self) -> bool {
(self.0 & WILL_INIT) != 0 (self.0 & WILL_INIT) != 0
} }
pub fn pos(&self) -> u64 { pub fn pos(&self) -> u64 {
self.0 >> 2 self.0 >> 1
} }
pub fn new(pos: u64, compressed: bool, will_init: bool) -> BlobRef { pub fn new(pos: u64, will_init: bool) -> BlobRef {
let mut blob_ref = pos << 2; let mut blob_ref = pos << 1;
if compressed {
blob_ref |= BLOB_COMPRESSED;
}
if will_init { if will_init {
blob_ref |= WILL_INIT; blob_ref |= WILL_INIT;
} }
@@ -228,37 +193,6 @@ pub struct DeltaLayerInner {
/// Reader object for reading blocks from the file. (None if not loaded yet) /// Reader object for reading blocks from the file. (None if not loaded yet)
file: Option<FileBlockReader<VirtualFile>>, file: Option<FileBlockReader<VirtualFile>>,
/// Compression dictionary, as raw bytes, and in prepared format ready for use
/// for decompression. None if there is no dictionary, or if 'loaded' is false.
dictionary: Option<(Vec<u8>, zstd::dict::DecoderDictionary<'static>)>,
}
impl DeltaLayerInner {
// Create a new Decompressor, using the prepared dictionary
fn create_decompressor(&self) -> Result<Option<zstd::bulk::Decompressor<'_>>> {
if let Some((_, dict)) = &self.dictionary {
let decompressor = zstd::bulk::Decompressor::with_prepared_dictionary(dict)?;
Ok(Some(decompressor))
} else {
Ok(None)
}
}
// Create a new Decompressor, without using the prepared dictionary.
//
// For the cases that you cannot use 'create_decompressor', if the
// Decompressor needs to outlive 'self'.
fn create_decompressor_not_prepared(
&self,
) -> Result<Option<zstd::bulk::Decompressor<'static>>> {
if let Some((dict, _)) = &self.dictionary {
let decompressor = zstd::bulk::Decompressor::with_dictionary(dict)?;
Ok(Some(decompressor))
} else {
Ok(None)
}
}
} }
impl Layer for DeltaLayer { impl Layer for DeltaLayer {
@@ -300,8 +234,6 @@ impl Layer for DeltaLayer {
{ {
// Open the file and lock the metadata in memory // Open the file and lock the metadata in memory
let inner = self.load()?; let inner = self.load()?;
let mut decompressor = inner.create_decompressor()?;
let mut decompress_buf = Vec::new();
// Scan the page versions backwards, starting from `lsn`. // Scan the page versions backwards, starting from `lsn`.
let file = inner.file.as_ref().unwrap(); let file = inner.file.as_ref().unwrap();
@@ -312,7 +244,7 @@ impl Layer for DeltaLayer {
); );
let search_key = DeltaKey::from_key_lsn(&key, Lsn(lsn_range.end.0 - 1)); let search_key = DeltaKey::from_key_lsn(&key, Lsn(lsn_range.end.0 - 1));
let mut blob_refs: Vec<(Lsn, BlobRef)> = Vec::new(); let mut offsets: Vec<(Lsn, u64)> = Vec::new();
tree_reader.visit(&search_key.0, VisitDirection::Backwards, |key, value| { tree_reader.visit(&search_key.0, VisitDirection::Backwards, |key, value| {
let blob_ref = BlobRef(value); let blob_ref = BlobRef(value);
@@ -323,36 +255,21 @@ impl Layer for DeltaLayer {
if entry_lsn < lsn_range.start { if entry_lsn < lsn_range.start {
return false; return false;
} }
blob_refs.push((entry_lsn, blob_ref)); offsets.push((entry_lsn, blob_ref.pos()));
!blob_ref.will_init() !blob_ref.will_init()
})?; })?;
// Ok, 'offsets' now contains the offsets of all the entries we need to read // Ok, 'offsets' now contains the offsets of all the entries we need to read
let mut cursor = file.block_cursor(); let mut cursor = file.block_cursor();
for (entry_lsn, blob_ref) in blob_refs { for (entry_lsn, pos) in offsets {
let buf = cursor.read_blob(blob_ref.pos()).with_context(|| { let buf = cursor.read_blob(pos).with_context(|| {
format!( format!(
"Failed to read blob from virtual file {}", "Failed to read blob from virtual file {}",
file.file.path.display() file.file.path.display()
) )
})?; })?;
let uncompressed_bytes = if blob_ref.compressed() { let val = Value::des(&buf).with_context(|| {
if let Some(ref mut decompressor) = decompressor {
let decompressed_max_len = zstd::bulk::Decompressor::upper_bound(&buf)
.ok_or_else(|| anyhow!("could not get decompressed length"))?;
decompress_buf.clear();
decompress_buf.reserve(decompressed_max_len);
let _ = decompressor.decompress_to_buffer(&buf, &mut decompress_buf)?;
&decompress_buf
} else {
bail!("blob is compressed, but there was no dictionary");
}
} else {
&buf
};
let val = Value::des(uncompressed_bytes).with_context(|| {
format!( format!(
"Failed to deserialize file blob from virtual file {}", "Failed to deserialize file blob from virtual file {}",
file.file.path.display() file.file.path.display()
@@ -430,6 +347,7 @@ impl Layer for DeltaLayer {
} }
let inner = self.load()?; let inner = self.load()?;
println!( println!(
"index_start_blk: {}, root {}", "index_start_blk: {}, root {}",
inner.index_start_blk, inner.index_root_blk inner.index_start_blk, inner.index_root_blk
@@ -445,49 +363,19 @@ impl Layer for DeltaLayer {
tree_reader.dump()?; tree_reader.dump()?;
let mut cursor = file.block_cursor(); let mut cursor = file.block_cursor();
let mut decompressor = inner.create_decompressor()?;
let mut decompress_buf = Vec::new();
// A subroutine to dump a single blob // A subroutine to dump a single blob
let mut dump_blob = |blob_ref: BlobRef| -> anyhow::Result<String> { let mut dump_blob = |blob_ref: BlobRef| -> anyhow::Result<String> {
let buf = cursor.read_blob(blob_ref.pos()).with_context(|| { let buf = cursor.read_blob(blob_ref.pos())?;
format!( let val = Value::des(&buf)?;
"Failed to read blob from virtual file {}",
file.file.path.display()
)
})?;
let uncompressed_bytes = if blob_ref.compressed() {
if let Some(ref mut decompressor) = decompressor {
let decompressed_max_len = zstd::bulk::Decompressor::upper_bound(&buf)
.ok_or_else(|| anyhow!("could not get decompressed length"))?;
decompress_buf.clear();
decompress_buf.reserve(decompressed_max_len);
let _ = decompressor.decompress_to_buffer(&buf, &mut decompress_buf)?;
&decompress_buf
} else {
bail!("blob is compressed, but there was no dictionary");
}
} else {
&buf
};
let val = Value::des(uncompressed_bytes).with_context(|| {
format!(
"Failed to deserialize file blob from virtual file {}",
file.file.path.display()
)
})?;
let desc = match val { let desc = match val {
Value::Image(img) => { Value::Image(img) => {
format!("img {} bytes, {} compressed", img.len(), buf.len()) format!(" img {} bytes", img.len())
} }
Value::WalRecord(rec) => { Value::WalRecord(rec) => {
let wal_desc = walrecord::describe_wal_record(&rec)?; let wal_desc = walrecord::describe_wal_record(&rec)?;
format!( format!(
"rec {} bytes, {} compressed, will_init {}: {}", " rec {} bytes will_init: {} {}",
uncompressed_bytes.len(),
buf.len(), buf.len(),
rec.will_init(), rec.will_init(),
wal_desc wal_desc
@@ -606,7 +494,6 @@ impl DeltaLayer {
let mut expected_summary = Summary::from(self); let mut expected_summary = Summary::from(self);
expected_summary.index_start_blk = actual_summary.index_start_blk; expected_summary.index_start_blk = actual_summary.index_start_blk;
expected_summary.index_root_blk = actual_summary.index_root_blk; expected_summary.index_root_blk = actual_summary.index_root_blk;
expected_summary.dictionary_offset = actual_summary.dictionary_offset;
if actual_summary != expected_summary { if actual_summary != expected_summary {
bail!("in-file summary does not match expected summary. actual = {:?} expected = {:?}", actual_summary, expected_summary); bail!("in-file summary does not match expected summary. actual = {:?} expected = {:?}", actual_summary, expected_summary);
} }
@@ -625,13 +512,6 @@ impl DeltaLayer {
} }
} }
// Load and prepare the dictionary, if any
if actual_summary.dictionary_offset != 0 {
let mut cursor = file.block_cursor();
let dict = cursor.read_blob(actual_summary.dictionary_offset)?;
let prepared_dict = zstd::dict::DecoderDictionary::copy(&dict);
inner.dictionary = Some((dict, prepared_dict));
}
inner.index_start_blk = actual_summary.index_start_blk; inner.index_start_blk = actual_summary.index_start_blk;
inner.index_root_blk = actual_summary.index_root_blk; inner.index_root_blk = actual_summary.index_root_blk;
@@ -657,7 +537,6 @@ impl DeltaLayer {
inner: RwLock::new(DeltaLayerInner { inner: RwLock::new(DeltaLayerInner {
loaded: false, loaded: false,
file: None, file: None,
dictionary: None,
index_start_blk: 0, index_start_blk: 0,
index_root_blk: 0, index_root_blk: 0,
}), }),
@@ -685,7 +564,6 @@ impl DeltaLayer {
inner: RwLock::new(DeltaLayerInner { inner: RwLock::new(DeltaLayerInner {
loaded: false, loaded: false,
file: None, file: None,
dictionary: None,
index_start_blk: 0, index_start_blk: 0,
index_root_blk: 0, index_root_blk: 0,
}), }),
@@ -721,16 +599,6 @@ impl DeltaLayer {
/// ///
/// 3. Call `finish`. /// 3. Call `finish`.
/// ///
///
/// To train the dictionary for compression, the first ZSTD_MAX_SAMPLES values
/// (or up ZSTD_MAX_SAMPLE_BYTES) are buffered in memory, before writing them
/// to disk. When the "sample buffer" fills up, the buffered values are used
/// to train a zstandard dictionary, which is then used to compress all the
/// buffered values, and all subsequent values. So the dictionary is built
/// based on just the first values, but in practice that usually gives pretty
/// good compression for all subsequent data as well. Things like page and
/// tuple headers are similar across all pages of the same relation.
///
pub struct DeltaLayerWriter { pub struct DeltaLayerWriter {
conf: &'static PageServerConf, conf: &'static PageServerConf,
path: PathBuf, path: PathBuf,
@@ -743,13 +611,6 @@ pub struct DeltaLayerWriter {
tree: DiskBtreeBuilder<BlockBuf, DELTA_KEY_SIZE>, tree: DiskBtreeBuilder<BlockBuf, DELTA_KEY_SIZE>,
blob_writer: WriteBlobWriter<BufWriter<VirtualFile>>, blob_writer: WriteBlobWriter<BufWriter<VirtualFile>>,
compressor: Option<zstd::bulk::Compressor<'static>>,
dictionary_offset: u64,
training: bool,
sample_key_lsn_willinit: Vec<(Key, Lsn, bool)>,
sample_sizes: Vec<usize>,
sample_data: Vec<u8>,
} }
impl DeltaLayerWriter { impl DeltaLayerWriter {
@@ -780,6 +641,7 @@ impl DeltaLayerWriter {
// Initialize the b-tree index builder // Initialize the b-tree index builder
let block_buf = BlockBuf::new(); let block_buf = BlockBuf::new();
let tree_builder = DiskBtreeBuilder::new(block_buf); let tree_builder = DiskBtreeBuilder::new(block_buf);
Ok(DeltaLayerWriter { Ok(DeltaLayerWriter {
conf, conf,
path, path,
@@ -789,13 +651,6 @@ impl DeltaLayerWriter {
lsn_range, lsn_range,
tree: tree_builder, tree: tree_builder,
blob_writer, blob_writer,
compressor: None,
dictionary_offset: 0,
training: true,
sample_key_lsn_willinit: Vec::new(),
sample_sizes: Vec::new(),
sample_data: Vec::new(),
}) })
} }
@@ -805,122 +660,18 @@ impl DeltaLayerWriter {
/// The values must be appended in key, lsn order. /// The values must be appended in key, lsn order.
/// ///
pub fn put_value(&mut self, key: Key, lsn: Lsn, val: Value) -> Result<()> { pub fn put_value(&mut self, key: Key, lsn: Lsn, val: Value) -> Result<()> {
let blob_content = &Value::ser(&val)?;
// Are we still accumulating values for training the compression dictionary?
if self.training {
self.put_value_train(key, lsn, val.will_init(), blob_content)?;
if self.sample_sizes.len() >= config::ZSTD_MAX_SAMPLES
|| self.sample_data.len() >= config::ZSTD_MAX_SAMPLE_BYTES
{
self.finish_training()?;
}
} else {
self.put_value_flush(key, lsn, val.will_init(), blob_content)?;
}
Ok(())
}
/// Accumulate one key-value pair in the samples buffer
fn put_value_train(&mut self, key: Key, lsn: Lsn, will_init: bool, bytes: &[u8]) -> Result<()> {
assert!(self.training);
self.sample_key_lsn_willinit.push((key, lsn, will_init));
self.sample_sizes.push(bytes.len());
self.sample_data.extend_from_slice(bytes);
Ok(())
}
/// Train the compression dictionary, and flush out all the accumulated
/// key-value pairs to disk.
fn finish_training(&mut self) -> Result<()> {
assert!(self.training);
assert!(self.sample_sizes.len() == self.sample_key_lsn_willinit.len());
// Create the dictionary, if we had enough samples for it.
//
// If there weren't enough samples, we don't do any compression at
// all. Possibly we could still benefit from compression; for example
// if you have only one gigantic value in a single layer, it would
// still be good to compress that, without a dictionary. But we don't
// do that currently.
if self.sample_sizes.len() >= config::ZSTD_MIN_SAMPLES {
let dictionary = zstd::dict::from_continuous(
&self.sample_data,
&self.sample_sizes,
config::ZSTD_MAX_DICTIONARY_SIZE,
)?;
let off = self.blob_writer.write_blob(&dictionary)?;
self.dictionary_offset = off;
let compressor = zstd::bulk::Compressor::with_dictionary(
config::ZSTD_COMPRESSION_LEVEL,
&dictionary,
)?;
self.compressor = Some(compressor);
};
self.training = false;
// release the memory used by the sample buffers
let sample_key_lsn_willinit = std::mem::take(&mut self.sample_key_lsn_willinit);
let sample_sizes = std::mem::take(&mut self.sample_sizes);
let sample_data = std::mem::take(&mut self.sample_data);
// Compress and write out all the buffered key-value pairs
let mut buf_idx: usize = 0;
for ((key, lsn, will_init), len) in
itertools::izip!(sample_key_lsn_willinit.iter(), sample_sizes.iter())
{
let end = buf_idx + len;
self.put_value_flush(*key, *lsn, *will_init, &sample_data[buf_idx..end])?;
buf_idx = end;
}
assert!(buf_idx == sample_data.len());
Ok(())
}
/// Write a key-value pair to the file, compressing it if applicable.
pub fn put_value_flush(
&mut self,
key: Key,
lsn: Lsn,
will_init: bool,
bytes: &[u8],
) -> Result<()> {
assert!(!self.training);
assert!(self.lsn_range.start <= lsn); assert!(self.lsn_range.start <= lsn);
let mut blob_content = bytes; let off = self.blob_writer.write_blob(&Value::ser(&val)?)?;
let mut compressed = false;
// Try to compress the blob let blob_ref = BlobRef::new(off, val.will_init());
let compressed_bytes;
if let Some(ref mut compressor) = self.compressor {
compressed_bytes = compressor.compress(blob_content)?;
// If compressed version is not any smaller than the original,
// store it uncompressed.
if compressed_bytes.len() < blob_content.len() {
blob_content = &compressed_bytes;
compressed = true;
}
}
// Write it to the file
let off = self.blob_writer.write_blob(blob_content)?;
let blob_ref = BlobRef::new(off, compressed, will_init);
// And store the reference in the B-tree
let delta_key = DeltaKey::from_key_lsn(&key, lsn); let delta_key = DeltaKey::from_key_lsn(&key, lsn);
self.tree.append(&delta_key.0, blob_ref.0)?; self.tree.append(&delta_key.0, blob_ref.0)?;
Ok(()) Ok(())
} }
///
/// Return an estimate of the file, if it was finished now.
///
pub fn size(&self) -> u64 { pub fn size(&self) -> u64 {
self.blob_writer.size() + self.tree.borrow_writer().size() self.blob_writer.size() + self.tree.borrow_writer().size()
} }
@@ -928,11 +679,7 @@ impl DeltaLayerWriter {
/// ///
/// Finish writing the delta layer. /// Finish writing the delta layer.
/// ///
pub fn finish(mut self, key_end: Key) -> anyhow::Result<DeltaLayer> { pub fn finish(self, key_end: Key) -> anyhow::Result<DeltaLayer> {
if self.training {
self.finish_training()?;
}
let index_start_blk = let index_start_blk =
((self.blob_writer.size() + PAGE_SZ as u64 - 1) / PAGE_SZ as u64) as u32; ((self.blob_writer.size() + PAGE_SZ as u64 - 1) / PAGE_SZ as u64) as u32;
@@ -956,7 +703,6 @@ impl DeltaLayerWriter {
lsn_range: self.lsn_range.clone(), lsn_range: self.lsn_range.clone(),
index_start_blk, index_start_blk,
index_root_blk, index_root_blk,
dictionary_offset: self.dictionary_offset,
}; };
file.seek(SeekFrom::Start(0))?; file.seek(SeekFrom::Start(0))?;
Summary::ser_into(&summary, &mut file)?; Summary::ser_into(&summary, &mut file)?;
@@ -973,7 +719,6 @@ impl DeltaLayerWriter {
inner: RwLock::new(DeltaLayerInner { inner: RwLock::new(DeltaLayerInner {
loaded: false, loaded: false,
file: None, file: None,
dictionary: None,
index_start_blk, index_start_blk,
index_root_blk, index_root_blk,
}), }),
@@ -1013,9 +758,6 @@ struct DeltaValueIter<'a> {
all_offsets: Vec<(DeltaKey, BlobRef)>, all_offsets: Vec<(DeltaKey, BlobRef)>,
next_idx: usize, next_idx: usize,
reader: BlockCursor<Adapter<'a>>, reader: BlockCursor<Adapter<'a>>,
decompressor: Option<zstd::bulk::Decompressor<'a>>,
decompress_buf: Vec<u8>,
} }
struct Adapter<'a>(RwLockReadGuard<'a, DeltaLayerInner>); struct Adapter<'a>(RwLockReadGuard<'a, DeltaLayerInner>);
@@ -1055,20 +797,10 @@ impl<'a> DeltaValueIter<'a> {
}, },
)?; )?;
// We cannot use inner.create_decompressor() here, because it returns
// a Decompressor with lifetime that depends on 'inner', and that
// doesn't live long enough here. Cannot use the prepared dictionary
// for that reason either. Doesn't matter too much in practice because
// this Iterator is used for bulk operations, and loading the dictionary
// isn't that expensive in comparison.
let decompressor = inner.create_decompressor_not_prepared()?;
let iter = DeltaValueIter { let iter = DeltaValueIter {
all_offsets, all_offsets,
next_idx: 0, next_idx: 0,
reader: BlockCursor::new(Adapter(inner)), reader: BlockCursor::new(Adapter(inner)),
decompressor,
decompress_buf: Vec::new(),
}; };
Ok(iter) Ok(iter)
@@ -1082,31 +814,7 @@ impl<'a> DeltaValueIter<'a> {
let lsn = delta_key.lsn(); let lsn = delta_key.lsn();
let buf = self.reader.read_blob(blob_ref.pos())?; let buf = self.reader.read_blob(blob_ref.pos())?;
let uncompressed_bytes = if blob_ref.compressed() { let val = Value::des(&buf)?;
if let Some(decompressor) = &mut self.decompressor {
let decompressed_max_len = zstd::bulk::Decompressor::upper_bound(&buf)
.ok_or_else(|| {
anyhow!(
"could not get decompressed length at offset {}",
blob_ref.pos()
)
})?;
self.decompress_buf.clear();
self.decompress_buf.reserve(decompressed_max_len);
let _ = decompressor.decompress_to_buffer(&buf, &mut self.decompress_buf)?;
&self.decompress_buf
} else {
bail!("blob is compressed, but there was no dictionary");
}
} else {
&buf
};
let val = Value::des(uncompressed_bytes).with_context(|| {
format!(
"Failed to deserialize file blob at offset {}",
blob_ref.pos()
)
})?;
self.next_idx += 1; self.next_idx += 1;
Ok(Some((key, lsn, val))) Ok(Some((key, lsn, val)))
} else { } else {

View File

@@ -7,7 +7,7 @@
//! - Fixed-width keys //! - Fixed-width keys
//! - Fixed-width values (VALUE_SZ) //! - Fixed-width values (VALUE_SZ)
//! - The tree is created in a bulk operation. Insert/deletion after creation //! - The tree is created in a bulk operation. Insert/deletion after creation
//! is not suppported //! is not supported
//! - page-oriented //! - page-oriented
//! //!
//! TODO: //! TODO:
@@ -498,8 +498,8 @@ where
return Ok(()); return Ok(());
} }
// It did not fit. Try to compress, and it it succeeds to make some room // It did not fit. Try to compress, and if it succeeds to make
// on the node, try appending to it again. // some room on the node, try appending to it again.
#[allow(clippy::collapsible_if)] #[allow(clippy::collapsible_if)]
if last.compress() { if last.compress() {
if last.push(key, value) { if last.push(key, value) {

View File

@@ -19,11 +19,6 @@
//! layer, and offsets to the other parts. The "index" is a B-tree, //! layer, and offsets to the other parts. The "index" is a B-tree,
//! mapping from Key to an offset in the "values" part. The //! mapping from Key to an offset in the "values" part. The
//! actual page images are stored in the "values" part. //! actual page images are stored in the "values" part.
//!
//! Each page image is compressed with ZStandard. See Compression section
//! in the delta_layer.rs for more discussion. Difference from a delta
//! layer is that we don't currently use a dictionary for image layers.
use crate::config;
use crate::config::PageServerConf; use crate::config::PageServerConf;
use crate::layered_repository::blob_io::{BlobCursor, BlobWriter, WriteBlobWriter}; use crate::layered_repository::blob_io::{BlobCursor, BlobWriter, WriteBlobWriter};
use crate::layered_repository::block_io::{BlockBuf, BlockReader, FileBlockReader}; use crate::layered_repository::block_io::{BlockBuf, BlockReader, FileBlockReader};
@@ -95,35 +90,6 @@ impl From<&ImageLayer> for Summary {
} }
} }
///
/// Struct representing reference to BLOB in the file. In an image layer,
/// each blob is an image of the page. It can be compressed or not, and
/// that is stored in low bit of the BlobRef.
///
#[derive(Debug, Serialize, Deserialize, Copy, Clone)]
struct BlobRef(u64);
/// Flag indicating that this blob is compressed
const BLOB_COMPRESSED: u64 = 1;
impl BlobRef {
pub fn compressed(&self) -> bool {
(self.0 & BLOB_COMPRESSED) != 0
}
pub fn pos(&self) -> u64 {
self.0 >> 1
}
pub fn new(pos: u64, compressed: bool) -> BlobRef {
let mut blob_ref = pos << 1;
if compressed {
blob_ref |= BLOB_COMPRESSED;
}
BlobRef(blob_ref)
}
}
/// ///
/// ImageLayer is the in-memory data structure associated with an on-disk image /// ImageLayer is the in-memory data structure associated with an on-disk image
/// file. We keep an ImageLayer in memory for each file, in the LayerMap. If a /// file. We keep an ImageLayer in memory for each file, in the LayerMap. If a
@@ -155,13 +121,6 @@ pub struct ImageLayerInner {
file: Option<FileBlockReader<VirtualFile>>, file: Option<FileBlockReader<VirtualFile>>,
} }
impl ImageLayerInner {
fn create_decompressor(&self) -> Result<zstd::bulk::Decompressor<'_>> {
let decompressor = zstd::bulk::Decompressor::new()?;
Ok(decompressor)
}
}
impl Layer for ImageLayer { impl Layer for ImageLayer {
fn filename(&self) -> PathBuf { fn filename(&self) -> PathBuf {
PathBuf::from(self.layer_name().to_string()) PathBuf::from(self.layer_name().to_string())
@@ -201,33 +160,20 @@ impl Layer for ImageLayer {
let inner = self.load()?; let inner = self.load()?;
let mut decompressor = inner.create_decompressor()?;
let file = inner.file.as_ref().unwrap(); let file = inner.file.as_ref().unwrap();
let tree_reader = DiskBtreeReader::new(inner.index_start_blk, inner.index_root_blk, file); let tree_reader = DiskBtreeReader::new(inner.index_start_blk, inner.index_root_blk, file);
let mut keybuf: [u8; KEY_SIZE] = [0u8; KEY_SIZE]; let mut keybuf: [u8; KEY_SIZE] = [0u8; KEY_SIZE];
key.write_to_byte_slice(&mut keybuf); key.write_to_byte_slice(&mut keybuf);
if let Some(value) = tree_reader.get(&keybuf)? { if let Some(offset) = tree_reader.get(&keybuf)? {
let blob_ref = BlobRef(value); let blob = file.block_cursor().read_blob(offset).with_context(|| {
let blob_content = format!(
file.block_cursor() "failed to read value from data file {} at offset {}",
.read_blob(blob_ref.pos()) self.filename().display(),
.with_context(|| { offset
format!( )
"failed to read value from data file {} at offset {}", })?;
self.filename().display(), let value = Bytes::from(blob);
blob_ref.pos()
)
})?;
let uncompressed_bytes = if blob_ref.compressed() {
decompressor.decompress(&blob_content, PAGE_SZ)?
} else {
blob_content
};
let value = Bytes::from(uncompressed_bytes);
reconstruct_state.img = Some((self.lsn, value)); reconstruct_state.img = Some((self.lsn, value));
Ok(ValueReconstructResult::Complete) Ok(ValueReconstructResult::Complete)
@@ -273,17 +219,7 @@ impl Layer for ImageLayer {
tree_reader.dump()?; tree_reader.dump()?;
tree_reader.visit(&[0u8; KEY_SIZE], VisitDirection::Forwards, |key, value| { tree_reader.visit(&[0u8; KEY_SIZE], VisitDirection::Forwards, |key, value| {
let blob_ref = BlobRef(value); println!("key: {} offset {}", hex::encode(key), value);
println!(
"key: {} offset {}{}",
hex::encode(key),
blob_ref.pos(),
if blob_ref.compressed() {
" (compressed)"
} else {
""
}
);
true true
})?; })?;
@@ -487,8 +423,6 @@ pub struct ImageLayerWriter {
blob_writer: WriteBlobWriter<VirtualFile>, blob_writer: WriteBlobWriter<VirtualFile>,
tree: DiskBtreeBuilder<BlockBuf, KEY_SIZE>, tree: DiskBtreeBuilder<BlockBuf, KEY_SIZE>,
compressor: Option<zstd::bulk::Compressor<'static>>,
} }
impl ImageLayerWriter { impl ImageLayerWriter {
@@ -520,12 +454,6 @@ impl ImageLayerWriter {
let block_buf = BlockBuf::new(); let block_buf = BlockBuf::new();
let tree_builder = DiskBtreeBuilder::new(block_buf); let tree_builder = DiskBtreeBuilder::new(block_buf);
// TODO: use a dictionary
let compressor = {
let compressor = zstd::bulk::Compressor::new(config::ZSTD_COMPRESSION_LEVEL)?;
Some(compressor)
};
let writer = ImageLayerWriter { let writer = ImageLayerWriter {
conf, conf,
path, path,
@@ -535,7 +463,6 @@ impl ImageLayerWriter {
lsn, lsn,
tree: tree_builder, tree: tree_builder,
blob_writer, blob_writer,
compressor,
}; };
Ok(writer) Ok(writer)
@@ -548,37 +475,11 @@ impl ImageLayerWriter {
/// ///
pub fn put_image(&mut self, key: Key, img: &[u8]) -> Result<()> { pub fn put_image(&mut self, key: Key, img: &[u8]) -> Result<()> {
ensure!(self.key_range.contains(&key)); ensure!(self.key_range.contains(&key));
let off = self.blob_writer.write_blob(img)?;
let mut blob_content = img;
let mut compressed = false;
// Try to compress the blob
let compressed_bytes;
if blob_content.len() <= PAGE_SZ {
if let Some(ref mut compressor) = self.compressor {
compressed_bytes = compressor.compress(blob_content)?;
// If compressed version is not any smaller than the original,
// store it uncompressed. This not just an optimization, the
// the decompression assumes that too. That simplifies the
// decompression, because you don't need to jump through any
// hoops to determine how large a buffer you need to hold the
// decompression result.
if compressed_bytes.len() < blob_content.len() {
blob_content = &compressed_bytes;
compressed = true;
}
}
}
// Write it to the file
let off = self.blob_writer.write_blob(blob_content)?;
let blob_ref = BlobRef::new(off, compressed);
// And store the reference in the B-tree
let mut keybuf: [u8; KEY_SIZE] = [0u8; KEY_SIZE]; let mut keybuf: [u8; KEY_SIZE] = [0u8; KEY_SIZE];
key.write_to_byte_slice(&mut keybuf); key.write_to_byte_slice(&mut keybuf);
self.tree.append(&keybuf, blob_ref.0)?; self.tree.append(&keybuf, off)?;
Ok(()) Ok(())
} }

View File

@@ -37,7 +37,7 @@ use pgdatadir_mapping::DatadirTimeline;
/// This is embedded in the metadata file, and also in the header of all the /// This is embedded in the metadata file, and also in the header of all the
/// layer files. If you make any backwards-incompatible changes to the storage /// layer files. If you make any backwards-incompatible changes to the storage
/// format, bump this! /// format, bump this!
pub const STORAGE_FORMAT_VERSION: u16 = 4; pub const STORAGE_FORMAT_VERSION: u16 = 3;
// Magic constants used to identify different kinds of files // Magic constants used to identify different kinds of files
pub const IMAGE_FILE_MAGIC: u16 = 0x5A60; pub const IMAGE_FILE_MAGIC: u16 = 0x5A60;

View File

@@ -20,7 +20,7 @@
//! assign a buffer for a page, you must hold the mapping lock and the lock on //! assign a buffer for a page, you must hold the mapping lock and the lock on
//! the slot at the same time. //! the slot at the same time.
//! //!
//! Whenever you need to hold both locks simultenously, the slot lock must be //! Whenever you need to hold both locks simultaneously, the slot lock must be
//! acquired first. This consistent ordering avoids deadlocks. To look up a page //! acquired first. This consistent ordering avoids deadlocks. To look up a page
//! in the cache, you would first look up the mapping, while holding the mapping //! in the cache, you would first look up the mapping, while holding the mapping
//! lock, and then lock the slot. You must release the mapping lock in between, //! lock, and then lock the slot. You must release the mapping lock in between,

View File

@@ -7,7 +7,6 @@
// *status* -- show actual info about this pageserver, // *status* -- show actual info about this pageserver,
// *pagestream* -- enter mode where smgr and pageserver talk with their // *pagestream* -- enter mode where smgr and pageserver talk with their
// custom protocol. // custom protocol.
// *callmemaybe <zenith timelineid> $url* -- ask pageserver to start walreceiver on $url
// //
use anyhow::{bail, ensure, Context, Result}; use anyhow::{bail, ensure, Context, Result};
@@ -38,7 +37,6 @@ use crate::repository::Timeline;
use crate::tenant_mgr; use crate::tenant_mgr;
use crate::thread_mgr; use crate::thread_mgr;
use crate::thread_mgr::ThreadKind; use crate::thread_mgr::ThreadKind;
use crate::walreceiver;
use crate::CheckpointConfig; use crate::CheckpointConfig;
use metrics::{register_histogram_vec, HistogramVec}; use metrics::{register_histogram_vec, HistogramVec};
use postgres_ffi::xlog_utils::to_pg_timestamp; use postgres_ffi::xlog_utils::to_pg_timestamp;
@@ -634,7 +632,7 @@ impl PageServerHandler {
return Ok(()); return Ok(());
} }
// auth is some, just checked above, when auth is some // auth is some, just checked above, when auth is some
// then claims are always present because of checks during connetion init // then claims are always present because of checks during connection init
// so this expect won't trigger // so this expect won't trigger
let claims = self let claims = self
.claims .claims
@@ -716,30 +714,6 @@ impl postgres_backend::Handler for PageServerHandler {
// Check that the timeline exists // Check that the timeline exists
self.handle_basebackup_request(pgb, timelineid, lsn, tenantid)?; self.handle_basebackup_request(pgb, timelineid, lsn, tenantid)?;
pgb.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?;
} else if query_string.starts_with("callmemaybe ") {
// callmemaybe <zenith tenantid as hex string> <zenith timelineid as hex string> <connstr>
// TODO lazy static
let re = Regex::new(r"^callmemaybe ([[:xdigit:]]+) ([[:xdigit:]]+) (.*)$").unwrap();
let caps = re
.captures(query_string)
.with_context(|| format!("invalid callmemaybe: '{}'", query_string))?;
let tenantid = ZTenantId::from_str(caps.get(1).unwrap().as_str())?;
let timelineid = ZTimelineId::from_str(caps.get(2).unwrap().as_str())?;
let connstr = caps.get(3).unwrap().as_str().to_owned();
self.check_permission(Some(tenantid))?;
let _enter =
info_span!("callmemaybe", timeline = %timelineid, tenant = %tenantid).entered();
// Check that the timeline exists
tenant_mgr::get_local_timeline_with_load(tenantid, timelineid)
.context("Cannot load local timeline")?;
walreceiver::launch_wal_receiver(self.conf, tenantid, timelineid, &connstr)?;
pgb.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?; pgb.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?;
} else if query_string.to_ascii_lowercase().starts_with("set ") { } else if query_string.to_ascii_lowercase().starts_with("set ") {
// important because psycopg2 executes "SET datestyle TO 'ISO'" // important because psycopg2 executes "SET datestyle TO 'ISO'"

View File

@@ -521,7 +521,7 @@ pub struct DatadirModification<'a, R: Repository> {
lsn: Lsn, lsn: Lsn,
// The modifications are not applied directly to the underyling key-value store. // The modifications are not applied directly to the underlying key-value store.
// The put-functions add the modifications here, and they are flushed to the // The put-functions add the modifications here, and they are flushed to the
// underlying key-value store by the 'finish' function. // underlying key-value store by the 'finish' function.
pending_updates: HashMap<Key, Value>, pending_updates: HashMap<Key, Value>,

View File

@@ -1,223 +0,0 @@
//! Timeline synchrnonization logic to delete a bulk of timeline's remote files from the remote storage.
use anyhow::Context;
use futures::stream::{FuturesUnordered, StreamExt};
use tracing::{debug, error, info};
use utils::zid::ZTenantTimelineId;
use crate::remote_storage::{
storage_sync::{SyncQueue, SyncTask},
RemoteStorage,
};
use super::{LayersDeletion, SyncData};
/// Attempts to remove the timleline layers from the remote storage.
/// If the task had not adjusted the metadata before, the deletion will fail.
pub(super) async fn delete_timeline_layers<'a, P, S>(
storage: &'a S,
sync_queue: &SyncQueue,
sync_id: ZTenantTimelineId,
mut delete_data: SyncData<LayersDeletion>,
) -> bool
where
P: std::fmt::Debug + Send + Sync + 'static,
S: RemoteStorage<RemoteObjectId = P> + Send + Sync + 'static,
{
if !delete_data.data.deletion_registered {
error!("Cannot delete timeline layers before the deletion metadata is not registered, reenqueueing");
delete_data.retries += 1;
sync_queue.push(sync_id, SyncTask::Delete(delete_data));
return false;
}
if delete_data.data.layers_to_delete.is_empty() {
info!("No layers to delete, skipping");
return true;
}
let layers_to_delete = delete_data
.data
.layers_to_delete
.drain()
.collect::<Vec<_>>();
debug!("Layers to delete: {layers_to_delete:?}");
info!("Deleting {} timeline layers", layers_to_delete.len());
let mut delete_tasks = layers_to_delete
.into_iter()
.map(|local_layer_path| async {
let storage_path = match storage.storage_path(&local_layer_path).with_context(|| {
format!(
"Failed to get the layer storage path for local path '{}'",
local_layer_path.display()
)
}) {
Ok(path) => path,
Err(e) => return Err((e, local_layer_path)),
};
match storage.delete(&storage_path).await.with_context(|| {
format!(
"Failed to delete remote layer from storage at '{:?}'",
storage_path
)
}) {
Ok(()) => Ok(local_layer_path),
Err(e) => Err((e, local_layer_path)),
}
})
.collect::<FuturesUnordered<_>>();
let mut errored = false;
while let Some(deletion_result) = delete_tasks.next().await {
match deletion_result {
Ok(local_layer_path) => {
debug!(
"Successfully deleted layer {} for timeline {sync_id}",
local_layer_path.display()
);
delete_data.data.deleted_layers.insert(local_layer_path);
}
Err((e, local_layer_path)) => {
errored = true;
error!(
"Failed to delete layer {} for timeline {sync_id}: {e:?}",
local_layer_path.display()
);
delete_data.data.layers_to_delete.insert(local_layer_path);
}
}
}
if errored {
debug!("Reenqueuing failed delete task for timeline {sync_id}");
delete_data.retries += 1;
sync_queue.push(sync_id, SyncTask::Delete(delete_data));
}
errored
}
#[cfg(test)]
mod tests {
use std::{collections::HashSet, num::NonZeroUsize};
use itertools::Itertools;
use tempfile::tempdir;
use tokio::fs;
use utils::lsn::Lsn;
use crate::{
remote_storage::{
storage_sync::test_utils::{create_local_timeline, dummy_metadata},
LocalFs,
},
repository::repo_harness::{RepoHarness, TIMELINE_ID},
};
use super::*;
#[tokio::test]
async fn delete_timeline_negative() -> anyhow::Result<()> {
let harness = RepoHarness::create("delete_timeline_negative")?;
let (sync_queue, _) = SyncQueue::new(NonZeroUsize::new(100).unwrap());
let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID);
let storage = LocalFs::new(tempdir()?.path().to_path_buf(), &harness.conf.workdir)?;
let deleted = delete_timeline_layers(
&storage,
&sync_queue,
sync_id,
SyncData {
retries: 1,
data: LayersDeletion {
deleted_layers: HashSet::new(),
layers_to_delete: HashSet::new(),
deletion_registered: false,
},
},
)
.await;
assert!(
!deleted,
"Should not start the deletion for task with delete metadata unregistered"
);
Ok(())
}
#[tokio::test]
async fn delete_timeline() -> anyhow::Result<()> {
let harness = RepoHarness::create("delete_timeline")?;
let (sync_queue, _) = SyncQueue::new(NonZeroUsize::new(100).unwrap());
let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID);
let layer_files = ["a", "b", "c", "d"];
let storage = LocalFs::new(tempdir()?.path().to_path_buf(), &harness.conf.workdir)?;
let current_retries = 3;
let metadata = dummy_metadata(Lsn(0x30));
let local_timeline_path = harness.timeline_path(&TIMELINE_ID);
let timeline_upload =
create_local_timeline(&harness, TIMELINE_ID, &layer_files, metadata.clone()).await?;
for local_path in timeline_upload.layers_to_upload {
let remote_path = storage.storage_path(&local_path)?;
let remote_parent_dir = remote_path.parent().unwrap();
if !remote_parent_dir.exists() {
fs::create_dir_all(&remote_parent_dir).await?;
}
fs::copy(&local_path, &remote_path).await?;
}
assert_eq!(
storage
.list()
.await?
.into_iter()
.map(|remote_path| storage.local_path(&remote_path).unwrap())
.filter_map(|local_path| { Some(local_path.file_name()?.to_str()?.to_owned()) })
.sorted()
.collect::<Vec<_>>(),
layer_files
.iter()
.map(|layer_str| layer_str.to_string())
.sorted()
.collect::<Vec<_>>(),
"Expect to have all layer files remotely before deletion"
);
let deleted = delete_timeline_layers(
&storage,
&sync_queue,
sync_id,
SyncData {
retries: current_retries,
data: LayersDeletion {
deleted_layers: HashSet::new(),
layers_to_delete: HashSet::from([
local_timeline_path.join("a"),
local_timeline_path.join("c"),
local_timeline_path.join("something_different"),
]),
deletion_registered: true,
},
},
)
.await;
assert!(deleted, "Should be able to delete timeline files");
assert_eq!(
storage
.list()
.await?
.into_iter()
.map(|remote_path| storage.local_path(&remote_path).unwrap())
.filter_map(|local_path| { Some(local_path.file_name()?.to_str()?.to_owned()) })
.sorted()
.collect::<Vec<_>>(),
vec!["b".to_string(), "d".to_string()],
"Expect to have only non-deleted files remotely"
);
Ok(())
}
}

View File

@@ -19,7 +19,7 @@ use utils::{
#[derive(Debug, Clone, Copy, Hash, PartialEq, Eq, Ord, PartialOrd, Serialize, Deserialize)] #[derive(Debug, Clone, Copy, Hash, PartialEq, Eq, Ord, PartialOrd, Serialize, Deserialize)]
/// Key used in the Repository kv-store. /// Key used in the Repository kv-store.
/// ///
/// The Repository treates this as an opaque struct, but see the code in pgdatadir_mapping.rs /// The Repository treats this as an opaque struct, but see the code in pgdatadir_mapping.rs
/// for what we actually store in these fields. /// for what we actually store in these fields.
pub struct Key { pub struct Key {
pub field1: u8, pub field1: u8,
@@ -195,6 +195,7 @@ impl Display for TimelineSyncStatusUpdate {
f.write_str(s) f.write_str(s)
} }
} }
/// ///
/// A repository corresponds to one .zenith directory. One repository holds multiple /// A repository corresponds to one .zenith directory. One repository holds multiple
/// timelines, forked off from the same initial call to 'initdb'. /// timelines, forked off from the same initial call to 'initdb'.
@@ -210,7 +211,7 @@ pub trait Repository: Send + Sync {
) -> Result<()>; ) -> Result<()>;
/// Get Timeline handle for given zenith timeline ID. /// Get Timeline handle for given zenith timeline ID.
/// This function is idempotent. It doesnt change internal state in any way. /// This function is idempotent. It doesn't change internal state in any way.
fn get_timeline(&self, timelineid: ZTimelineId) -> Option<RepositoryTimeline<Self::Timeline>>; fn get_timeline(&self, timelineid: ZTimelineId) -> Option<RepositoryTimeline<Self::Timeline>>;
/// Get Timeline handle for locally available timeline. Load it into memory if it is not loaded. /// Get Timeline handle for locally available timeline. Load it into memory if it is not loaded.
@@ -242,7 +243,7 @@ pub trait Repository: Send + Sync {
/// ///
/// 'timelineid' specifies the timeline to GC, or None for all. /// 'timelineid' specifies the timeline to GC, or None for all.
/// `horizon` specifies delta from last lsn to preserve all object versions (pitr interval). /// `horizon` specifies delta from last lsn to preserve all object versions (pitr interval).
/// `checkpoint_before_gc` parameter is used to force compaction of storage before CG /// `checkpoint_before_gc` parameter is used to force compaction of storage before GC
/// to make tests more deterministic. /// to make tests more deterministic.
/// TODO Do we still need it or we can call checkpoint explicitly in tests where needed? /// TODO Do we still need it or we can call checkpoint explicitly in tests where needed?
fn gc_iteration( fn gc_iteration(
@@ -345,11 +346,11 @@ pub trait Timeline: Send + Sync {
/// Look up given page version. /// Look up given page version.
/// ///
/// NOTE: It is considerd an error to 'get' a key that doesn't exist. The abstraction /// NOTE: It is considered an error to 'get' a key that doesn't exist. The abstraction
/// above this needs to store suitable metadata to track what data exists with /// above this needs to store suitable metadata to track what data exists with
/// what keys, in separate metadata entries. If a non-existent key is requested, /// what keys, in separate metadata entries. If a non-existent key is requested,
/// the Repository implementation may incorrectly return a value from an ancestore /// the Repository implementation may incorrectly return a value from an ancestor
/// branch, for exampel, or waste a lot of cycles chasing the non-existing key. /// branch, for example, or waste a lot of cycles chasing the non-existing key.
/// ///
fn get(&self, key: Key, lsn: Lsn) -> Result<Bytes>; fn get(&self, key: Key, lsn: Lsn) -> Result<Bytes>;
@@ -469,6 +470,9 @@ pub mod repo_harness {
gc_period: Some(tenant_conf.gc_period), gc_period: Some(tenant_conf.gc_period),
image_creation_threshold: Some(tenant_conf.image_creation_threshold), image_creation_threshold: Some(tenant_conf.image_creation_threshold),
pitr_interval: Some(tenant_conf.pitr_interval), pitr_interval: Some(tenant_conf.pitr_interval),
walreceiver_connect_timeout: Some(tenant_conf.walreceiver_connect_timeout),
lagging_wal_timeout: Some(tenant_conf.lagging_wal_timeout),
max_lsn_wal_lag: Some(tenant_conf.max_lsn_wal_lag),
} }
} }
} }

View File

@@ -69,7 +69,7 @@
//! Yet instead of keeping the `metadata` file remotely, we wrap it with more data in [`IndexPart`], containing the list of remote files. //! Yet instead of keeping the `metadata` file remotely, we wrap it with more data in [`IndexPart`], containing the list of remote files.
//! This file gets read to populate the cache, if the remote timeline data is missing from it and gets updated after every successful download. //! This file gets read to populate the cache, if the remote timeline data is missing from it and gets updated after every successful download.
//! This way, we optimize S3 storage access by not running the `S3 list` command that could be expencive and slow: knowing both [`ZTenantId`] and [`ZTimelineId`], //! This way, we optimize S3 storage access by not running the `S3 list` command that could be expencive and slow: knowing both [`ZTenantId`] and [`ZTimelineId`],
//! we can always reconstruct the path to the timeline, use this to get the same path on the remote storage and retrive its shard contents, if needed, same as any layer files. //! we can always reconstruct the path to the timeline, use this to get the same path on the remote storage and retrieve its shard contents, if needed, same as any layer files.
//! //!
//! By default, pageserver reads the remote storage index data only for timelines located locally, to synchronize those, if needed. //! By default, pageserver reads the remote storage index data only for timelines located locally, to synchronize those, if needed.
//! Bulk index data download happens only initially, on pageserver startup. The rest of the remote storage stays unknown to pageserver and loaded on demand only, //! Bulk index data download happens only initially, on pageserver startup. The rest of the remote storage stays unknown to pageserver and loaded on demand only,
@@ -96,7 +96,7 @@
//! timeline uploads and downloads can happen concurrently, in no particular order due to incremental nature of the timeline layers. //! timeline uploads and downloads can happen concurrently, in no particular order due to incremental nature of the timeline layers.
//! Deletion happens only after a successful upload only, otherwise the compaction output might make the timeline inconsistent until both tasks are fully processed without errors. //! Deletion happens only after a successful upload only, otherwise the compaction output might make the timeline inconsistent until both tasks are fully processed without errors.
//! Upload and download update the remote data (inmemory index and S3 json index part file) only after every layer is successfully synchronized, while the deletion task //! Upload and download update the remote data (inmemory index and S3 json index part file) only after every layer is successfully synchronized, while the deletion task
//! does otherwise: it requires to have the remote data updated first succesfully: blob files will be invisible to pageserver this way. //! does otherwise: it requires to have the remote data updated first successfully: blob files will be invisible to pageserver this way.
//! //!
//! During the loop startup, an initial [`RemoteTimelineIndex`] state is constructed via downloading and merging the index data for all timelines, //! During the loop startup, an initial [`RemoteTimelineIndex`] state is constructed via downloading and merging the index data for all timelines,
//! present locally. //! present locally.
@@ -440,7 +440,7 @@ fn collect_timeline_files(
// initial collect will fail because there is no metadata. // initial collect will fail because there is no metadata.
// We either need to start download if we see empty dir after restart or attach caller should // We either need to start download if we see empty dir after restart or attach caller should
// be aware of that and retry attach if awaits_download for timeline switched from true to false // be aware of that and retry attach if awaits_download for timeline switched from true to false
// but timelinne didnt appear locally. // but timelinne didn't appear locally.
// Check what happens with remote index in that case. // Check what happens with remote index in that case.
let timeline_metadata_path = match timeline_metadata_path { let timeline_metadata_path = match timeline_metadata_path {
Some(path) => path, Some(path) => path,
@@ -892,7 +892,7 @@ fn storage_sync_loop<P, S>(
REMAINING_SYNC_ITEMS.set(remaining_queue_length as i64); REMAINING_SYNC_ITEMS.set(remaining_queue_length as i64);
if remaining_queue_length > 0 || !batched_tasks.is_empty() { if remaining_queue_length > 0 || !batched_tasks.is_empty() {
info!("Processing tasks for {} timelines in batch, more tasks left to process: {remaining_queue_length}", batched_tasks.len()); debug!("Processing tasks for {} timelines in batch, more tasks left to process: {remaining_queue_length}", batched_tasks.len());
} else { } else {
debug!("No tasks to process"); debug!("No tasks to process");
continue; continue;
@@ -1007,7 +1007,7 @@ where
// in local (implicitly, via Lsn values and related memory state) or remote (explicitly via remote layer file paths) metadata. // in local (implicitly, via Lsn values and related memory state) or remote (explicitly via remote layer file paths) metadata.
// When operating in a system without tasks failing over the error threshold, // When operating in a system without tasks failing over the error threshold,
// current batching and task processing systems aim to update the layer set and metadata files (remote and local), // current batching and task processing systems aim to update the layer set and metadata files (remote and local),
// without "loosing" such layer files. // without "losing" such layer files.
let (upload_result, status_update) = tokio::join!( let (upload_result, status_update) = tokio::join!(
async { async {
if let Some(upload_data) = upload_data { if let Some(upload_data) = upload_data {
@@ -1162,7 +1162,7 @@ where
return Some(TimelineSyncStatusUpdate::Downloaded); return Some(TimelineSyncStatusUpdate::Downloaded);
} }
Err(e) => { Err(e) => {
error!("Timeline {sync_id} was expected to be in the remote index after a sucessful download, but it's absent: {e:?}"); error!("Timeline {sync_id} was expected to be in the remote index after a successful download, but it's absent: {e:?}");
} }
}, },
Err(e) => { Err(e) => {
@@ -1186,7 +1186,7 @@ async fn update_local_metadata(
let remote_metadata = match remote_timeline { let remote_metadata = match remote_timeline {
Some(timeline) => &timeline.metadata, Some(timeline) => &timeline.metadata,
None => { None => {
info!("No remote timeline to update local metadata from, skipping the update"); debug!("No remote timeline to update local metadata from, skipping the update");
return Ok(()); return Ok(());
} }
}; };
@@ -1549,10 +1549,10 @@ fn compare_local_and_remote_timeline(
let remote_files = remote_entry.stored_files(); let remote_files = remote_entry.stored_files();
// TODO probably here we need more sophisticated logic, // TODO probably here we need more sophisticated logic,
// if more data is available remotely can we just download whats there? // if more data is available remotely can we just download what's there?
// without trying to upload something. It may be tricky, needs further investigation. // without trying to upload something. It may be tricky, needs further investigation.
// For now looks strange that we can request upload // For now looks strange that we can request upload
// and dowload for the same timeline simultaneously. // and download for the same timeline simultaneously.
// (upload needs to be only for previously unsynced files, not whole timeline dir). // (upload needs to be only for previously unsynced files, not whole timeline dir).
// If one of the tasks fails they will be reordered in the queue which can lead // If one of the tasks fails they will be reordered in the queue which can lead
// to timeline being stuck in evicted state // to timeline being stuck in evicted state
@@ -1565,7 +1565,7 @@ fn compare_local_and_remote_timeline(
}), }),
)); ));
(LocalTimelineInitStatus::NeedsSync, true) (LocalTimelineInitStatus::NeedsSync, true)
// we do not need to manupulate with remote consistent lsn here // we do not need to manipulate with remote consistent lsn here
// because it will be updated when sync will be completed // because it will be updated when sync will be completed
} else { } else {
(LocalTimelineInitStatus::LocallyComplete, false) (LocalTimelineInitStatus::LocallyComplete, false)

View File

@@ -1,4 +1,4 @@
//! Timeline synchrnonization logic to delete a bulk of timeline's remote files from the remote storage. //! Timeline synchronization logic to delete a bulk of timeline's remote files from the remote storage.
use anyhow::Context; use anyhow::Context;
use futures::stream::{FuturesUnordered, StreamExt}; use futures::stream::{FuturesUnordered, StreamExt};

View File

@@ -1,4 +1,4 @@
//! Timeline synchrnonization logic to fetch the layer files from remote storage into pageserver's local directory. //! Timeline synchronization logic to fetch the layer files from remote storage into pageserver's local directory.
use std::{collections::HashSet, fmt::Debug, path::Path}; use std::{collections::HashSet, fmt::Debug, path::Path};

View File

@@ -273,7 +273,7 @@ mod tests {
}; };
let index_part = IndexPart::from_remote_timeline(&timeline_path, remote_timeline.clone()) let index_part = IndexPart::from_remote_timeline(&timeline_path, remote_timeline.clone())
.expect("Correct remote timeline should be convertable to index part"); .expect("Correct remote timeline should be convertible to index part");
assert_eq!( assert_eq!(
index_part.timeline_layers.iter().collect::<BTreeSet<_>>(), index_part.timeline_layers.iter().collect::<BTreeSet<_>>(),
@@ -305,7 +305,7 @@ mod tests {
); );
let restored_timeline = RemoteTimeline::from_index_part(&timeline_path, index_part) let restored_timeline = RemoteTimeline::from_index_part(&timeline_path, index_part)
.expect("Correct index part should be convertable to remote timeline"); .expect("Correct index part should be convertible to remote timeline");
let original_metadata = &remote_timeline.metadata; let original_metadata = &remote_timeline.metadata;
let restored_metadata = &restored_timeline.metadata; let restored_metadata = &restored_timeline.metadata;

View File

@@ -391,7 +391,7 @@ mod tests {
assert_eq!( assert_eq!(
upload.metadata, upload.metadata,
Some(metadata), Some(metadata),
"Successful upload should not chage its metadata" "Successful upload should not change its metadata"
); );
let storage_files = storage.list().await?; let storage_files = storage.list().await?;

View File

@@ -10,6 +10,7 @@
//! //!
use crate::config::PageServerConf; use crate::config::PageServerConf;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use std::num::NonZeroU64;
use std::path::PathBuf; use std::path::PathBuf;
use std::time::Duration; use std::time::Duration;
use utils::zid::ZTenantId; use utils::zid::ZTenantId;
@@ -34,6 +35,9 @@ pub mod defaults {
pub const DEFAULT_GC_PERIOD: &str = "100 s"; pub const DEFAULT_GC_PERIOD: &str = "100 s";
pub const DEFAULT_IMAGE_CREATION_THRESHOLD: usize = 3; pub const DEFAULT_IMAGE_CREATION_THRESHOLD: usize = 3;
pub const DEFAULT_PITR_INTERVAL: &str = "30 days"; pub const DEFAULT_PITR_INTERVAL: &str = "30 days";
pub const DEFAULT_WALRECEIVER_CONNECT_TIMEOUT: &str = "2 seconds";
pub const DEFAULT_WALRECEIVER_LAGGING_WAL_TIMEOUT: &str = "10 seconds";
pub const DEFAULT_MAX_WALRECEIVER_LSN_WAL_LAG: u64 = 10_000;
} }
/// Per-tenant configuration options /// Per-tenant configuration options
@@ -68,6 +72,17 @@ pub struct TenantConf {
// Page versions older than this are garbage collected away. // Page versions older than this are garbage collected away.
#[serde(with = "humantime_serde")] #[serde(with = "humantime_serde")]
pub pitr_interval: Duration, pub pitr_interval: Duration,
/// Maximum amount of time to wait while opening a connection to receive wal, before erroring.
#[serde(with = "humantime_serde")]
pub walreceiver_connect_timeout: Duration,
/// Considers safekeepers stalled after no WAL updates were received longer than this threshold.
/// A stalled safekeeper will be changed to a newer one when it appears.
#[serde(with = "humantime_serde")]
pub lagging_wal_timeout: Duration,
/// Considers safekeepers lagging when their WAL is behind another safekeeper for more than this threshold.
/// A lagging safekeeper will be changed after `lagging_wal_timeout` time elapses since the last WAL update,
/// to avoid eager reconnects.
pub max_lsn_wal_lag: NonZeroU64,
} }
/// Same as TenantConf, but this struct preserves the information about /// Same as TenantConf, but this struct preserves the information about
@@ -85,6 +100,11 @@ pub struct TenantConfOpt {
pub image_creation_threshold: Option<usize>, pub image_creation_threshold: Option<usize>,
#[serde(with = "humantime_serde")] #[serde(with = "humantime_serde")]
pub pitr_interval: Option<Duration>, pub pitr_interval: Option<Duration>,
#[serde(with = "humantime_serde")]
pub walreceiver_connect_timeout: Option<Duration>,
#[serde(with = "humantime_serde")]
pub lagging_wal_timeout: Option<Duration>,
pub max_lsn_wal_lag: Option<NonZeroU64>,
} }
impl TenantConfOpt { impl TenantConfOpt {
@@ -108,6 +128,13 @@ impl TenantConfOpt {
.image_creation_threshold .image_creation_threshold
.unwrap_or(global_conf.image_creation_threshold), .unwrap_or(global_conf.image_creation_threshold),
pitr_interval: self.pitr_interval.unwrap_or(global_conf.pitr_interval), pitr_interval: self.pitr_interval.unwrap_or(global_conf.pitr_interval),
walreceiver_connect_timeout: self
.walreceiver_connect_timeout
.unwrap_or(global_conf.walreceiver_connect_timeout),
lagging_wal_timeout: self
.lagging_wal_timeout
.unwrap_or(global_conf.lagging_wal_timeout),
max_lsn_wal_lag: self.max_lsn_wal_lag.unwrap_or(global_conf.max_lsn_wal_lag),
} }
} }
@@ -136,6 +163,15 @@ impl TenantConfOpt {
if let Some(pitr_interval) = other.pitr_interval { if let Some(pitr_interval) = other.pitr_interval {
self.pitr_interval = Some(pitr_interval); self.pitr_interval = Some(pitr_interval);
} }
if let Some(walreceiver_connect_timeout) = other.walreceiver_connect_timeout {
self.walreceiver_connect_timeout = Some(walreceiver_connect_timeout);
}
if let Some(lagging_wal_timeout) = other.lagging_wal_timeout {
self.lagging_wal_timeout = Some(lagging_wal_timeout);
}
if let Some(max_lsn_wal_lag) = other.max_lsn_wal_lag {
self.max_lsn_wal_lag = Some(max_lsn_wal_lag);
}
} }
} }
@@ -155,6 +191,14 @@ impl TenantConf {
image_creation_threshold: DEFAULT_IMAGE_CREATION_THRESHOLD, image_creation_threshold: DEFAULT_IMAGE_CREATION_THRESHOLD,
pitr_interval: humantime::parse_duration(DEFAULT_PITR_INTERVAL) pitr_interval: humantime::parse_duration(DEFAULT_PITR_INTERVAL)
.expect("cannot parse default PITR interval"), .expect("cannot parse default PITR interval"),
walreceiver_connect_timeout: humantime::parse_duration(
DEFAULT_WALRECEIVER_CONNECT_TIMEOUT,
)
.expect("cannot parse default walreceiver connect timeout"),
lagging_wal_timeout: humantime::parse_duration(DEFAULT_WALRECEIVER_LAGGING_WAL_TIMEOUT)
.expect("cannot parse default walreceiver lagging wal timeout"),
max_lsn_wal_lag: NonZeroU64::new(DEFAULT_MAX_WALRECEIVER_LSN_WAL_LAG)
.expect("cannot parse default max walreceiver Lsn wal lag"),
} }
} }
@@ -175,6 +219,16 @@ impl TenantConf {
gc_period: Duration::from_secs(10), gc_period: Duration::from_secs(10),
image_creation_threshold: defaults::DEFAULT_IMAGE_CREATION_THRESHOLD, image_creation_threshold: defaults::DEFAULT_IMAGE_CREATION_THRESHOLD,
pitr_interval: Duration::from_secs(60 * 60), pitr_interval: Duration::from_secs(60 * 60),
walreceiver_connect_timeout: humantime::parse_duration(
defaults::DEFAULT_WALRECEIVER_CONNECT_TIMEOUT,
)
.unwrap(),
lagging_wal_timeout: humantime::parse_duration(
defaults::DEFAULT_WALRECEIVER_LAGGING_WAL_TIMEOUT,
)
.unwrap(),
max_lsn_wal_lag: NonZeroU64::new(defaults::DEFAULT_MAX_WALRECEIVER_LSN_WAL_LAG)
.unwrap(),
} }
} }
} }

View File

@@ -8,11 +8,10 @@ use crate::repository::{Repository, TimelineSyncStatusUpdate};
use crate::storage_sync::index::RemoteIndex; use crate::storage_sync::index::RemoteIndex;
use crate::storage_sync::{self, LocalTimelineInitStatus, SyncStartupData}; use crate::storage_sync::{self, LocalTimelineInitStatus, SyncStartupData};
use crate::tenant_config::TenantConfOpt; use crate::tenant_config::TenantConfOpt;
use crate::thread_mgr;
use crate::thread_mgr::ThreadKind; use crate::thread_mgr::ThreadKind;
use crate::timelines;
use crate::timelines::CreateRepo; use crate::timelines::CreateRepo;
use crate::walredo::PostgresRedoManager; use crate::walredo::PostgresRedoManager;
use crate::{thread_mgr, timelines, walreceiver};
use crate::{DatadirTimelineImpl, RepositoryImpl}; use crate::{DatadirTimelineImpl, RepositoryImpl};
use anyhow::{bail, Context}; use anyhow::{bail, Context};
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
@@ -21,23 +20,30 @@ use std::collections::hash_map::Entry;
use std::collections::HashMap; use std::collections::HashMap;
use std::fmt; use std::fmt;
use std::sync::Arc; use std::sync::Arc;
use tokio::sync::mpsc;
use tracing::*; use tracing::*;
use utils::lsn::Lsn; use utils::lsn::Lsn;
use utils::zid::{ZTenantId, ZTimelineId}; use utils::zid::{ZTenantId, ZTenantTimelineId, ZTimelineId};
mod tenants_state { mod tenants_state {
use anyhow::ensure;
use std::{ use std::{
collections::HashMap, collections::HashMap,
sync::{RwLock, RwLockReadGuard, RwLockWriteGuard}, sync::{RwLock, RwLockReadGuard, RwLockWriteGuard},
}; };
use tokio::sync::mpsc;
use tracing::{debug, error};
use utils::zid::ZTenantId; use utils::zid::ZTenantId;
use crate::tenant_mgr::Tenant; use crate::tenant_mgr::{LocalTimelineUpdate, Tenant};
lazy_static::lazy_static! { lazy_static::lazy_static! {
static ref TENANTS: RwLock<HashMap<ZTenantId, Tenant>> = RwLock::new(HashMap::new()); static ref TENANTS: RwLock<HashMap<ZTenantId, Tenant>> = RwLock::new(HashMap::new());
/// Sends updates to the local timelines (creation and deletion) to the WAL receiver,
/// so that it can enable/disable corresponding processes.
static ref TIMELINE_UPDATE_SENDER: RwLock<Option<mpsc::UnboundedSender<LocalTimelineUpdate>>> = RwLock::new(None);
} }
pub(super) fn read_tenants() -> RwLockReadGuard<'static, HashMap<ZTenantId, Tenant>> { pub(super) fn read_tenants() -> RwLockReadGuard<'static, HashMap<ZTenantId, Tenant>> {
@@ -51,6 +57,39 @@ mod tenants_state {
.write() .write()
.expect("Failed to write() tenants lock, it got poisoned") .expect("Failed to write() tenants lock, it got poisoned")
} }
pub(super) fn set_timeline_update_sender(
timeline_updates_sender: mpsc::UnboundedSender<LocalTimelineUpdate>,
) -> anyhow::Result<()> {
let mut sender_guard = TIMELINE_UPDATE_SENDER
.write()
.expect("Failed to write() timeline_update_sender lock, it got poisoned");
ensure!(sender_guard.is_none(), "Timeline update sender already set");
*sender_guard = Some(timeline_updates_sender);
Ok(())
}
pub(super) fn try_send_timeline_update(update: LocalTimelineUpdate) {
match TIMELINE_UPDATE_SENDER
.read()
.expect("Failed to read() timeline_update_sender lock, it got poisoned")
.as_ref()
{
Some(sender) => {
if let Err(e) = sender.send(update) {
error!("Failed to send timeline update: {}", e);
}
}
None => debug!("Timeline update sender is not enabled, cannot send update {update:?}"),
}
}
pub(super) fn stop_timeline_update_sender() {
TIMELINE_UPDATE_SENDER
.write()
.expect("Failed to write() timeline_update_sender lock, it got poisoned")
.take();
}
} }
struct Tenant { struct Tenant {
@@ -87,10 +126,10 @@ pub enum TenantState {
impl fmt::Display for TenantState { impl fmt::Display for TenantState {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
match self { match self {
TenantState::Active => f.write_str("Active"), Self::Active => f.write_str("Active"),
TenantState::Idle => f.write_str("Idle"), Self::Idle => f.write_str("Idle"),
TenantState::Stopping => f.write_str("Stopping"), Self::Stopping => f.write_str("Stopping"),
TenantState::Broken => f.write_str("Broken"), Self::Broken => f.write_str("Broken"),
} }
} }
} }
@@ -99,6 +138,11 @@ impl fmt::Display for TenantState {
/// Timelines that are only partially available locally (remote storage has more data than this pageserver) /// Timelines that are only partially available locally (remote storage has more data than this pageserver)
/// are scheduled for download and added to the repository once download is completed. /// are scheduled for download and added to the repository once download is completed.
pub fn init_tenant_mgr(conf: &'static PageServerConf) -> anyhow::Result<RemoteIndex> { pub fn init_tenant_mgr(conf: &'static PageServerConf) -> anyhow::Result<RemoteIndex> {
let (timeline_updates_sender, timeline_updates_receiver) =
mpsc::unbounded_channel::<LocalTimelineUpdate>();
tenants_state::set_timeline_update_sender(timeline_updates_sender)?;
walreceiver::init_wal_receiver_main_thread(conf, timeline_updates_receiver)?;
let SyncStartupData { let SyncStartupData {
remote_index, remote_index,
local_timeline_init_statuses, local_timeline_init_statuses,
@@ -113,16 +157,27 @@ pub fn init_tenant_mgr(conf: &'static PageServerConf) -> anyhow::Result<RemoteIn
// loading a tenant is serious, but it's better to complete the startup and // loading a tenant is serious, but it's better to complete the startup and
// serve other tenants, than fail completely. // serve other tenants, than fail completely.
error!("Failed to initialize local tenant {tenant_id}: {:?}", err); error!("Failed to initialize local tenant {tenant_id}: {:?}", err);
let mut m = tenants_state::write_tenants(); set_tenant_state(tenant_id, TenantState::Broken)?;
if let Some(tenant) = m.get_mut(&tenant_id) {
tenant.state = TenantState::Broken;
}
} }
} }
Ok(remote_index) Ok(remote_index)
} }
pub enum LocalTimelineUpdate {
Detach(ZTenantTimelineId),
Attach(ZTenantTimelineId, Arc<DatadirTimelineImpl>),
}
impl std::fmt::Debug for LocalTimelineUpdate {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
match self {
Self::Detach(ttid) => f.debug_tuple("Remove").field(ttid).finish(),
Self::Attach(ttid, _) => f.debug_tuple("Add").field(ttid).finish(),
}
}
}
/// Updates tenants' repositories, changing their timelines state in memory. /// Updates tenants' repositories, changing their timelines state in memory.
pub fn apply_timeline_sync_status_updates( pub fn apply_timeline_sync_status_updates(
conf: &'static PageServerConf, conf: &'static PageServerConf,
@@ -160,6 +215,7 @@ pub fn apply_timeline_sync_status_updates(
/// Shut down all tenants. This runs as part of pageserver shutdown. /// Shut down all tenants. This runs as part of pageserver shutdown.
/// ///
pub fn shutdown_all_tenants() { pub fn shutdown_all_tenants() {
tenants_state::stop_timeline_update_sender();
let mut m = tenants_state::write_tenants(); let mut m = tenants_state::write_tenants();
let mut tenantids = Vec::new(); let mut tenantids = Vec::new();
for (tenantid, tenant) in m.iter_mut() { for (tenantid, tenant) in m.iter_mut() {
@@ -173,7 +229,7 @@ pub fn shutdown_all_tenants() {
} }
drop(m); drop(m);
thread_mgr::shutdown_threads(Some(ThreadKind::WalReceiver), None, None); thread_mgr::shutdown_threads(Some(ThreadKind::WalReceiverManager), None, None);
thread_mgr::shutdown_threads(Some(ThreadKind::GarbageCollector), None, None); thread_mgr::shutdown_threads(Some(ThreadKind::GarbageCollector), None, None);
thread_mgr::shutdown_threads(Some(ThreadKind::Compactor), None, None); thread_mgr::shutdown_threads(Some(ThreadKind::Compactor), None, None);
@@ -247,32 +303,49 @@ pub fn get_tenant_state(tenantid: ZTenantId) -> Option<TenantState> {
Some(tenants_state::read_tenants().get(&tenantid)?.state) Some(tenants_state::read_tenants().get(&tenantid)?.state)
} }
/// pub fn set_tenant_state(tenant_id: ZTenantId, new_state: TenantState) -> anyhow::Result<()> {
/// Change the state of a tenant to Active and launch its compactor and GC
/// threads. If the tenant was already in Active state or Stopping, does nothing.
///
pub fn activate_tenant(tenant_id: ZTenantId) -> anyhow::Result<()> {
let mut m = tenants_state::write_tenants(); let mut m = tenants_state::write_tenants();
let tenant = m let tenant = m
.get_mut(&tenant_id) .get_mut(&tenant_id)
.with_context(|| format!("Tenant not found for id {tenant_id}"))?; .with_context(|| format!("Tenant not found for id {tenant_id}"))?;
let old_state = tenant.state;
tenant.state = new_state;
drop(m);
info!("activating tenant {tenant_id}"); match (old_state, new_state) {
(TenantState::Broken, TenantState::Broken)
match tenant.state { | (TenantState::Active, TenantState::Active)
// If the tenant is already active, nothing to do. | (TenantState::Idle, TenantState::Idle)
TenantState::Active => {} | (TenantState::Stopping, TenantState::Stopping) => {
debug!("tenant {tenant_id} already in state {new_state}");
// If it's Idle, launch the compactor and GC threads }
TenantState::Idle => { (TenantState::Broken, ignored) => {
thread_mgr::spawn( debug!("Ignoring {ignored} since tenant {tenant_id} is in broken state");
}
(_, TenantState::Broken) => {
debug!("Setting tenant {tenant_id} status to broken");
}
(TenantState::Stopping, ignored) => {
debug!("Ignoring {ignored} since tenant {tenant_id} is in stopping state");
}
(TenantState::Idle, TenantState::Active) => {
info!("activating tenant {tenant_id}");
let compactor_spawn_result = thread_mgr::spawn(
ThreadKind::Compactor, ThreadKind::Compactor,
Some(tenant_id), Some(tenant_id),
None, None,
"Compactor thread", "Compactor thread",
false, false,
move || crate::tenant_threads::compact_loop(tenant_id), move || crate::tenant_threads::compact_loop(tenant_id),
)?; );
if compactor_spawn_result.is_err() {
let mut m = tenants_state::write_tenants();
m.get_mut(&tenant_id)
.with_context(|| format!("Tenant not found for id {tenant_id}"))?
.state = old_state;
drop(m);
}
compactor_spawn_result?;
let gc_spawn_result = thread_mgr::spawn( let gc_spawn_result = thread_mgr::spawn(
ThreadKind::GarbageCollector, ThreadKind::GarbageCollector,
@@ -286,21 +359,31 @@ pub fn activate_tenant(tenant_id: ZTenantId) -> anyhow::Result<()> {
.with_context(|| format!("Failed to launch GC thread for tenant {tenant_id}")); .with_context(|| format!("Failed to launch GC thread for tenant {tenant_id}"));
if let Err(e) = &gc_spawn_result { if let Err(e) = &gc_spawn_result {
let mut m = tenants_state::write_tenants();
m.get_mut(&tenant_id)
.with_context(|| format!("Tenant not found for id {tenant_id}"))?
.state = old_state;
drop(m);
error!("Failed to start GC thread for tenant {tenant_id}, stopping its checkpointer thread: {e:?}"); error!("Failed to start GC thread for tenant {tenant_id}, stopping its checkpointer thread: {e:?}");
thread_mgr::shutdown_threads(Some(ThreadKind::Compactor), Some(tenant_id), None); thread_mgr::shutdown_threads(Some(ThreadKind::Compactor), Some(tenant_id), None);
return gc_spawn_result; return gc_spawn_result;
} }
tenant.state = TenantState::Active;
} }
(TenantState::Idle, TenantState::Stopping) => {
TenantState::Stopping => { info!("stopping idle tenant {tenant_id}");
// don't re-activate it if it's being stopped
} }
(TenantState::Active, TenantState::Stopping | TenantState::Idle) => {
TenantState::Broken => { info!("stopping tenant {tenant_id} threads due to new state {new_state}");
// cannot activate thread_mgr::shutdown_threads(
Some(ThreadKind::WalReceiverManager),
Some(tenant_id),
None,
);
thread_mgr::shutdown_threads(Some(ThreadKind::GarbageCollector), Some(tenant_id), None);
thread_mgr::shutdown_threads(Some(ThreadKind::Compactor), Some(tenant_id), None);
} }
} }
Ok(()) Ok(())
} }
@@ -325,15 +408,15 @@ pub fn get_local_timeline_with_load(
.with_context(|| format!("Tenant {tenant_id} not found"))?; .with_context(|| format!("Tenant {tenant_id} not found"))?;
if let Some(page_tline) = tenant.local_timelines.get(&timeline_id) { if let Some(page_tline) = tenant.local_timelines.get(&timeline_id) {
return Ok(Arc::clone(page_tline)); Ok(Arc::clone(page_tline))
} else {
let page_tline = load_local_timeline(&tenant.repo, timeline_id)
.with_context(|| format!("Failed to load local timeline for tenant {tenant_id}"))?;
tenant
.local_timelines
.insert(timeline_id, Arc::clone(&page_tline));
Ok(page_tline)
} }
let page_tline = load_local_timeline(&tenant.repo, timeline_id)
.with_context(|| format!("Failed to load local timeline for tenant {tenant_id}"))?;
tenant
.local_timelines
.insert(timeline_id, Arc::clone(&page_tline));
Ok(page_tline)
} }
pub fn detach_timeline( pub fn detach_timeline(
@@ -351,6 +434,9 @@ pub fn detach_timeline(
.detach_timeline(timeline_id) .detach_timeline(timeline_id)
.context("Failed to detach inmem tenant timeline")?; .context("Failed to detach inmem tenant timeline")?;
tenant.local_timelines.remove(&timeline_id); tenant.local_timelines.remove(&timeline_id);
tenants_state::try_send_timeline_update(LocalTimelineUpdate::Detach(
ZTenantTimelineId::new(tenant_id, timeline_id),
));
} }
None => bail!("Tenant {tenant_id} not found in local tenant state"), None => bail!("Tenant {tenant_id} not found in local tenant state"),
} }
@@ -379,6 +465,12 @@ fn load_local_timeline(
repartition_distance, repartition_distance,
)); ));
page_tline.init_logical_size()?; page_tline.init_logical_size()?;
tenants_state::try_send_timeline_update(LocalTimelineUpdate::Attach(
ZTenantTimelineId::new(repo.tenant_id(), timeline_id),
Arc::clone(&page_tline),
));
Ok(page_tline) Ok(page_tline)
} }

View File

@@ -91,8 +91,8 @@ pub enum ThreadKind {
// associated with one later, after receiving a command from the client. // associated with one later, after receiving a command from the client.
PageRequestHandler, PageRequestHandler,
// Thread that connects to a safekeeper to fetch WAL for one timeline. // Main walreceiver manager thread that ensures that every timeline spawns a connection to safekeeper, to fetch WAL.
WalReceiver, WalReceiverManager,
// Thread that handles compaction of all timelines for a tenant. // Thread that handles compaction of all timelines for a tenant.
Compactor, Compactor,

View File

@@ -283,8 +283,6 @@ fn bootstrap_timeline<R: Repository>(
tli: ZTimelineId, tli: ZTimelineId,
repo: &R, repo: &R,
) -> Result<()> { ) -> Result<()> {
let _enter = info_span!("bootstrapping", timeline = %tli, tenant = %tenantid).entered();
let initdb_path = conf let initdb_path = conf
.tenant_path(&tenantid) .tenant_path(&tenantid)
.join(format!("tmp-timeline-{}", tli)); .join(format!("tmp-timeline-{}", tli));

View File

@@ -336,7 +336,7 @@ impl VirtualFile {
// library RwLock doesn't allow downgrading without releasing the lock, // library RwLock doesn't allow downgrading without releasing the lock,
// and that doesn't seem worth the trouble. // and that doesn't seem worth the trouble.
// //
// XXX: `parking_lot::RwLock` can enable such downgrades, yet its implemenation is fair and // XXX: `parking_lot::RwLock` can enable such downgrades, yet its implementation is fair and
// may deadlock on subsequent read calls. // may deadlock on subsequent read calls.
// Simply replacing all `RwLock` in project causes deadlocks, so use it sparingly. // Simply replacing all `RwLock` in project causes deadlocks, so use it sparingly.
let result = STORAGE_IO_TIME let result = STORAGE_IO_TIME

View File

@@ -12,7 +12,7 @@
//! The zenith Repository can store page versions in two formats: as //! The zenith Repository can store page versions in two formats: as
//! page images, or a WAL records. WalIngest::ingest_record() extracts //! page images, or a WAL records. WalIngest::ingest_record() extracts
//! page images out of some WAL records, but most it stores as WAL //! page images out of some WAL records, but most it stores as WAL
//! records. If a WAL record modifies multple pages, WalIngest //! records. If a WAL record modifies multiple pages, WalIngest
//! will call Repository::put_wal_record or put_page_image functions //! will call Repository::put_wal_record or put_page_image functions
//! separately for each modified page. //! separately for each modified page.
//! //!

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,405 @@
//! Actual Postgres connection handler to stream WAL to the server.
//! Runs as a separate, cancellable Tokio task.
use std::{
str::FromStr,
sync::Arc,
time::{Duration, SystemTime},
};
use anyhow::{bail, ensure, Context};
use bytes::BytesMut;
use fail::fail_point;
use postgres::{SimpleQueryMessage, SimpleQueryRow};
use postgres_ffi::waldecoder::WalStreamDecoder;
use postgres_protocol::message::backend::ReplicationMessage;
use postgres_types::PgLsn;
use tokio::{pin, select, sync::watch, time};
use tokio_postgres::{replication::ReplicationStream, Client};
use tokio_stream::StreamExt;
use tracing::{debug, error, info, info_span, trace, warn, Instrument};
use utils::{
lsn::Lsn,
pq_proto::ZenithFeedback,
zid::{NodeId, ZTenantTimelineId},
};
use crate::{
http::models::WalReceiverEntry,
repository::{Repository, Timeline},
tenant_mgr,
walingest::WalIngest,
};
#[derive(Debug, Clone)]
pub enum WalConnectionEvent {
Started,
NewWal(ZenithFeedback),
End(Result<(), String>),
}
/// A wrapper around standalone Tokio task, to poll its updates or cancel the task.
#[derive(Debug)]
pub struct WalReceiverConnection {
handle: tokio::task::JoinHandle<()>,
cancellation: watch::Sender<()>,
events_receiver: watch::Receiver<WalConnectionEvent>,
}
impl WalReceiverConnection {
/// Initializes the connection task, returning a set of handles on top of it.
/// The task is started immediately after the creation, fails if no connection is established during the timeout given.
pub fn open(
id: ZTenantTimelineId,
safekeeper_id: NodeId,
wal_producer_connstr: String,
connect_timeout: Duration,
) -> Self {
let (cancellation, mut cancellation_receiver) = watch::channel(());
let (events_sender, events_receiver) = watch::channel(WalConnectionEvent::Started);
let handle = tokio::spawn(
async move {
let connection_result = handle_walreceiver_connection(
id,
&wal_producer_connstr,
&events_sender,
&mut cancellation_receiver,
connect_timeout,
)
.await
.map_err(|e| {
format!("Walreceiver connection for id {id} failed with error: {e:#}")
});
match &connection_result {
Ok(()) => {
debug!("Walreceiver connection for id {id} ended successfully")
}
Err(e) => warn!("{e}"),
}
events_sender
.send(WalConnectionEvent::End(connection_result))
.ok();
}
.instrument(info_span!("safekeeper_handle", sk = %safekeeper_id)),
);
Self {
handle,
cancellation,
events_receiver,
}
}
/// Polls for the next WAL receiver event, if there's any available since the last check.
/// Blocks if there's no new event available, returns `None` if no new events will ever occur.
/// Only the last event is returned, all events received between observatins are lost.
pub async fn next_event(&mut self) -> Option<WalConnectionEvent> {
match self.events_receiver.changed().await {
Ok(()) => Some(self.events_receiver.borrow().clone()),
Err(_cancellation_error) => None,
}
}
/// Gracefully aborts current WAL streaming task, waiting for the current WAL streamed.
pub async fn shutdown(&mut self) -> anyhow::Result<()> {
self.cancellation.send(()).ok();
let handle = &mut self.handle;
handle
.await
.context("Failed to join on a walreceiver connection task")?;
Ok(())
}
}
async fn handle_walreceiver_connection(
id: ZTenantTimelineId,
wal_producer_connstr: &str,
events_sender: &watch::Sender<WalConnectionEvent>,
cancellation: &mut watch::Receiver<()>,
connect_timeout: Duration,
) -> anyhow::Result<()> {
// Connect to the database in replication mode.
info!("connecting to {wal_producer_connstr}");
let connect_cfg =
format!("{wal_producer_connstr} application_name=pageserver replication=true");
let (mut replication_client, connection) = time::timeout(
connect_timeout,
tokio_postgres::connect(&connect_cfg, postgres::NoTls),
)
.await
.context("Timed out while waiting for walreceiver connection to open")?
.context("Failed to open walreceiver conection")?;
// The connection object performs the actual communication with the database,
// so spawn it off to run on its own.
let mut connection_cancellation = cancellation.clone();
tokio::spawn(
async move {
info!("connected!");
select! {
connection_result = connection => match connection_result{
Ok(()) => info!("Walreceiver db connection closed"),
Err(connection_error) => {
if connection_error.is_closed() {
info!("Connection closed regularly: {connection_error}")
} else {
warn!("Connection aborted: {connection_error}")
}
}
},
_ = connection_cancellation.changed() => info!("Connection cancelled"),
}
}
.instrument(info_span!("safekeeper_handle_db")),
);
// Immediately increment the gauge, then create a job to decrement it on task exit.
// One of the pros of `defer!` is that this will *most probably*
// get called, even in presence of panics.
let gauge = crate::LIVE_CONNECTIONS_COUNT.with_label_values(&["wal_receiver"]);
gauge.inc();
scopeguard::defer! {
gauge.dec();
}
let identify = identify_system(&mut replication_client).await?;
info!("{identify:?}");
let end_of_wal = Lsn::from(u64::from(identify.xlogpos));
let mut caught_up = false;
let ZTenantTimelineId {
tenant_id,
timeline_id,
} = id;
let (repo, timeline) = tokio::task::spawn_blocking(move || {
let repo = tenant_mgr::get_repository_for_tenant(tenant_id)
.with_context(|| format!("no repository found for tenant {tenant_id}"))?;
let timeline = tenant_mgr::get_local_timeline_with_load(tenant_id, timeline_id)
.with_context(|| {
format!("local timeline {timeline_id} not found for tenant {tenant_id}")
})?;
Ok::<_, anyhow::Error>((repo, timeline))
})
.await
.with_context(|| format!("Failed to spawn blocking task to get repository and timeline for tenant {tenant_id} timeline {timeline_id}"))??;
//
// Start streaming the WAL, from where we left off previously.
//
// If we had previously received WAL up to some point in the middle of a WAL record, we
// better start from the end of last full WAL record, not in the middle of one.
let mut last_rec_lsn = timeline.get_last_record_lsn();
let mut startpoint = last_rec_lsn;
if startpoint == Lsn(0) {
bail!("No previous WAL position");
}
// There might be some padding after the last full record, skip it.
startpoint += startpoint.calc_padding(8u32);
info!("last_record_lsn {last_rec_lsn} starting replication from {startpoint}, server is at {end_of_wal}...");
let query = format!("START_REPLICATION PHYSICAL {startpoint}");
let copy_stream = replication_client.copy_both_simple(&query).await?;
let physical_stream = ReplicationStream::new(copy_stream);
pin!(physical_stream);
let mut waldecoder = WalStreamDecoder::new(startpoint);
let mut walingest = WalIngest::new(timeline.as_ref(), startpoint)?;
while let Some(replication_message) = {
select! {
// check for shutdown first
biased;
_ = cancellation.changed() => {
info!("walreceiver interrupted");
None
}
replication_message = physical_stream.next() => replication_message,
}
} {
let replication_message = replication_message?;
let status_update = match replication_message {
ReplicationMessage::XLogData(xlog_data) => {
// Pass the WAL data to the decoder, and see if we can decode
// more records as a result.
let data = xlog_data.data();
let startlsn = Lsn::from(xlog_data.wal_start());
let endlsn = startlsn + data.len() as u64;
trace!("received XLogData between {startlsn} and {endlsn}");
waldecoder.feed_bytes(data);
while let Some((lsn, recdata)) = waldecoder.poll_decode()? {
let _enter = info_span!("processing record", lsn = %lsn).entered();
// It is important to deal with the aligned records as lsn in getPage@LSN is
// aligned and can be several bytes bigger. Without this alignment we are
// at risk of hitting a deadlock.
ensure!(lsn.is_aligned());
walingest.ingest_record(&timeline, recdata, lsn)?;
fail_point!("walreceiver-after-ingest");
last_rec_lsn = lsn;
}
if !caught_up && endlsn >= end_of_wal {
info!("caught up at LSN {endlsn}");
caught_up = true;
}
let timeline_to_check = Arc::clone(&timeline.tline);
tokio::task::spawn_blocking(move || timeline_to_check.check_checkpoint_distance())
.await
.with_context(|| {
format!("Spawned checkpoint check task panicked for timeline {id}")
})?
.with_context(|| {
format!("Failed to check checkpoint distance for timeline {id}")
})?;
Some(endlsn)
}
ReplicationMessage::PrimaryKeepAlive(keepalive) => {
let wal_end = keepalive.wal_end();
let timestamp = keepalive.timestamp();
let reply_requested = keepalive.reply() != 0;
trace!("received PrimaryKeepAlive(wal_end: {wal_end}, timestamp: {timestamp:?} reply: {reply_requested})");
if reply_requested {
Some(last_rec_lsn)
} else {
None
}
}
_ => None,
};
if let Some(last_lsn) = status_update {
let remote_index = repo.get_remote_index();
let timeline_remote_consistent_lsn = remote_index
.read()
.await
// here we either do not have this timeline in remote index
// or there were no checkpoints for it yet
.timeline_entry(&ZTenantTimelineId {
tenant_id,
timeline_id,
})
.map(|remote_timeline| remote_timeline.metadata.disk_consistent_lsn())
// no checkpoint was uploaded
.unwrap_or(Lsn(0));
// The last LSN we processed. It is not guaranteed to survive pageserver crash.
let write_lsn = u64::from(last_lsn);
// `disk_consistent_lsn` is the LSN at which page server guarantees local persistence of all received data
let flush_lsn = u64::from(timeline.tline.get_disk_consistent_lsn());
// The last LSN that is synced to remote storage and is guaranteed to survive pageserver crash
// Used by safekeepers to remove WAL preceding `remote_consistent_lsn`.
let apply_lsn = u64::from(timeline_remote_consistent_lsn);
let ts = SystemTime::now();
// Update the current WAL receiver's data stored inside the global hash table `WAL_RECEIVERS`
{
super::WAL_RECEIVER_ENTRIES.write().await.insert(
id,
WalReceiverEntry {
wal_producer_connstr: Some(wal_producer_connstr.to_owned()),
last_received_msg_lsn: Some(last_lsn),
last_received_msg_ts: Some(
ts.duration_since(SystemTime::UNIX_EPOCH)
.expect("Received message time should be before UNIX EPOCH!")
.as_micros(),
),
},
);
}
// Send zenith feedback message.
// Regular standby_status_update fields are put into this message.
let zenith_status_update = ZenithFeedback {
current_timeline_size: timeline.get_current_logical_size() as u64,
ps_writelsn: write_lsn,
ps_flushlsn: flush_lsn,
ps_applylsn: apply_lsn,
ps_replytime: ts,
};
debug!("zenith_status_update {zenith_status_update:?}");
let mut data = BytesMut::new();
zenith_status_update.serialize(&mut data)?;
physical_stream
.as_mut()
.zenith_status_update(data.len() as u64, &data)
.await?;
if let Err(e) = events_sender.send(WalConnectionEvent::NewWal(zenith_status_update)) {
warn!("Wal connection event listener dropped, aborting the connection: {e}");
return Ok(());
}
}
}
Ok(())
}
/// Data returned from the postgres `IDENTIFY_SYSTEM` command
///
/// See the [postgres docs] for more details.
///
/// [postgres docs]: https://www.postgresql.org/docs/current/protocol-replication.html
#[derive(Debug)]
// As of nightly 2021-09-11, fields that are only read by the type's `Debug` impl still count as
// unused. Relevant issue: https://github.com/rust-lang/rust/issues/88900
#[allow(dead_code)]
struct IdentifySystem {
systemid: u64,
timeline: u32,
xlogpos: PgLsn,
dbname: Option<String>,
}
/// There was a problem parsing the response to
/// a postgres IDENTIFY_SYSTEM command.
#[derive(Debug, thiserror::Error)]
#[error("IDENTIFY_SYSTEM parse error")]
struct IdentifyError;
/// Run the postgres `IDENTIFY_SYSTEM` command
async fn identify_system(client: &mut Client) -> anyhow::Result<IdentifySystem> {
let query_str = "IDENTIFY_SYSTEM";
let response = client.simple_query(query_str).await?;
// get(N) from row, then parse it as some destination type.
fn get_parse<T>(row: &SimpleQueryRow, idx: usize) -> Result<T, IdentifyError>
where
T: FromStr,
{
let val = row.get(idx).ok_or(IdentifyError)?;
val.parse::<T>().or(Err(IdentifyError))
}
// extract the row contents into an IdentifySystem struct.
// written as a closure so I can use ? for Option here.
if let Some(SimpleQueryMessage::Row(first_row)) = response.get(0) {
Ok(IdentifySystem {
systemid: get_parse(first_row, 0)?,
timeline: get_parse(first_row, 1)?,
xlogpos: get_parse(first_row, 2)?,
dbname: get_parse(first_row, 3).ok(),
})
} else {
Err(IdentifyError.into())
}
}

View File

@@ -28,6 +28,7 @@ use std::fs::OpenOptions;
use std::io::prelude::*; use std::io::prelude::*;
use std::io::{Error, ErrorKind}; use std::io::{Error, ErrorKind};
use std::os::unix::io::AsRawFd; use std::os::unix::io::AsRawFd;
use std::os::unix::prelude::CommandExt;
use std::path::PathBuf; use std::path::PathBuf;
use std::process::Stdio; use std::process::Stdio;
use std::process::{Child, ChildStderr, ChildStdin, ChildStdout, Command}; use std::process::{Child, ChildStderr, ChildStdin, ChildStdout, Command};
@@ -122,7 +123,7 @@ lazy_static! {
/// ///
/// This is the real implementation that uses a Postgres process to /// This is the real implementation that uses a Postgres process to
/// perform WAL replay. Only one thread can use the processs at a time, /// perform WAL replay. Only one thread can use the process at a time,
/// that is controlled by the Mutex. In the future, we might want to /// that is controlled by the Mutex. In the future, we might want to
/// launch a pool of processes to allow concurrent replay of multiple /// launch a pool of processes to allow concurrent replay of multiple
/// records. /// records.
@@ -134,7 +135,7 @@ pub struct PostgresRedoManager {
process: Mutex<Option<PostgresRedoProcess>>, process: Mutex<Option<PostgresRedoProcess>>,
} }
/// Can this request be served by zenith redo funcitons /// Can this request be served by zenith redo functions
/// or we need to pass it to wal-redo postgres process? /// or we need to pass it to wal-redo postgres process?
fn can_apply_in_zenith(rec: &ZenithWalRecord) -> bool { fn can_apply_in_zenith(rec: &ZenithWalRecord) -> bool {
// Currently, we don't have bespoken Rust code to replay any // Currently, we don't have bespoken Rust code to replay any
@@ -554,6 +555,40 @@ impl PostgresRedoManager {
} }
} }
///
/// Command with ability not to give all file descriptors to child process
///
trait CloseFileDescriptors: CommandExt {
///
/// Close file descriptors (other than stdin, stdout, stderr) in child process
///
fn close_fds(&mut self) -> &mut Command;
}
impl<C: CommandExt> CloseFileDescriptors for C {
fn close_fds(&mut self) -> &mut Command {
unsafe {
self.pre_exec(move || {
// SAFETY: Code executed inside pre_exec should have async-signal-safety,
// which means it should be safe to execute inside a signal handler.
// The precise meaning depends on platform. See `man signal-safety`
// for the linux definition.
//
// The set_fds_cloexec_threadsafe function is documented to be
// async-signal-safe.
//
// Aside from this function, the rest of the code is re-entrant and
// doesn't make any syscalls. We're just passing constants.
//
// NOTE: It's easy to indirectly cause a malloc or lock a mutex,
// which is not async-signal-safe. Be careful.
close_fds::set_fds_cloexec_threadsafe(3, &[]);
Ok(())
})
}
}
}
/// ///
/// Handle to the Postgres WAL redo process /// Handle to the Postgres WAL redo process
/// ///
@@ -607,9 +642,10 @@ impl PostgresRedoProcess {
.open(PathBuf::from(&datadir).join("postgresql.conf"))?; .open(PathBuf::from(&datadir).join("postgresql.conf"))?;
config.write_all(b"shared_buffers=128kB\n")?; config.write_all(b"shared_buffers=128kB\n")?;
config.write_all(b"fsync=off\n")?; config.write_all(b"fsync=off\n")?;
config.write_all(b"shared_preload_libraries=zenith\n")?; config.write_all(b"shared_preload_libraries=neon\n")?;
config.write_all(b"zenith.wal_redo=on\n")?; config.write_all(b"neon.wal_redo=on\n")?;
} }
// Start postgres itself // Start postgres itself
let mut child = Command::new(conf.pg_bin_dir().join("postgres")) let mut child = Command::new(conf.pg_bin_dir().join("postgres"))
.arg("--wal-redo") .arg("--wal-redo")
@@ -620,6 +656,19 @@ impl PostgresRedoProcess {
.env("LD_LIBRARY_PATH", conf.pg_lib_dir()) .env("LD_LIBRARY_PATH", conf.pg_lib_dir())
.env("DYLD_LIBRARY_PATH", conf.pg_lib_dir()) .env("DYLD_LIBRARY_PATH", conf.pg_lib_dir())
.env("PGDATA", &datadir) .env("PGDATA", &datadir)
// The redo process is not trusted, so it runs in seccomp mode
// (see seccomp in zenith_wal_redo.c). We have to make sure it doesn't
// inherit any file descriptors from the pageserver that would allow
// an attacker to do bad things.
//
// The Rust standard library makes sure to mark any file descriptors with
// as close-on-exec by default, but that's not enough, since we use
// libraries that directly call libc open without setting that flag.
//
// One example is the pidfile of the daemonize library, which doesn't
// currently mark file descriptors as close-on-exec. Either way, we
// want to be on the safe side and prevent accidental regression.
.close_fds()
.spawn() .spawn()
.map_err(|e| { .map_err(|e| {
Error::new( Error::new(

View File

@@ -1,56 +1,58 @@
mod credentials; //! Client authentication mechanisms.
mod flow;
use crate::auth_backend::{console, legacy_console, link, postgres}; pub mod backend;
use crate::config::{AuthBackendType, ProxyConfig}; pub use backend::DatabaseInfo;
use crate::error::UserFacingError;
use crate::stream::PqStream; mod credentials;
use crate::{auth_backend, compute, waiters}; pub use credentials::ClientCredentials;
use console::ConsoleAuthError::SniMissing;
mod flow;
pub use flow::*;
use crate::{error::UserFacingError, waiters};
use std::io; use std::io;
use thiserror::Error; use thiserror::Error;
use tokio::io::{AsyncRead, AsyncWrite};
pub use credentials::ClientCredentials; /// Convenience wrapper for the authentication error.
pub use flow::*; pub type Result<T> = std::result::Result<T, AuthError>;
/// Common authentication error. /// Common authentication error.
#[derive(Debug, Error)] #[derive(Debug, Error)]
pub enum AuthErrorImpl { pub enum AuthErrorImpl {
/// Authentication error reported by the console. /// Authentication error reported by the console.
#[error(transparent)] #[error(transparent)]
Console(#[from] auth_backend::AuthError), Console(#[from] backend::AuthError),
#[error(transparent)] #[error(transparent)]
GetAuthInfo(#[from] auth_backend::console::ConsoleAuthError), GetAuthInfo(#[from] backend::console::ConsoleAuthError),
#[error(transparent)] #[error(transparent)]
Sasl(#[from] crate::sasl::Error), Sasl(#[from] crate::sasl::Error),
/// For passwords that couldn't be processed by [`parse_password`]. /// For passwords that couldn't be processed by [`backend::legacy_console::parse_password`].
#[error("Malformed password message")] #[error("Malformed password message")]
MalformedPassword, MalformedPassword,
/// Errors produced by [`PqStream`]. /// Errors produced by [`crate::stream::PqStream`].
#[error(transparent)] #[error(transparent)]
Io(#[from] io::Error), Io(#[from] io::Error),
} }
impl AuthErrorImpl { impl AuthErrorImpl {
pub fn auth_failed(msg: impl Into<String>) -> Self { pub fn auth_failed(msg: impl Into<String>) -> Self {
AuthErrorImpl::Console(auth_backend::AuthError::auth_failed(msg)) Self::Console(backend::AuthError::auth_failed(msg))
} }
} }
impl From<waiters::RegisterError> for AuthErrorImpl { impl From<waiters::RegisterError> for AuthErrorImpl {
fn from(e: waiters::RegisterError) -> Self { fn from(e: waiters::RegisterError) -> Self {
AuthErrorImpl::Console(auth_backend::AuthError::from(e)) Self::Console(backend::AuthError::from(e))
} }
} }
impl From<waiters::WaitError> for AuthErrorImpl { impl From<waiters::WaitError> for AuthErrorImpl {
fn from(e: waiters::WaitError) -> Self { fn from(e: waiters::WaitError) -> Self {
AuthErrorImpl::Console(auth_backend::AuthError::from(e)) Self::Console(backend::AuthError::from(e))
} }
} }
@@ -63,7 +65,7 @@ where
AuthErrorImpl: From<T>, AuthErrorImpl: From<T>,
{ {
fn from(e: T) -> Self { fn from(e: T) -> Self {
AuthError(Box::new(e.into())) Self(Box::new(e.into()))
} }
} }
@@ -72,34 +74,10 @@ impl UserFacingError for AuthError {
use AuthErrorImpl::*; use AuthErrorImpl::*;
match self.0.as_ref() { match self.0.as_ref() {
Console(e) => e.to_string_client(), Console(e) => e.to_string_client(),
GetAuthInfo(e) => e.to_string_client(),
Sasl(e) => e.to_string_client(),
MalformedPassword => self.to_string(), MalformedPassword => self.to_string(),
GetAuthInfo(e) if matches!(e, SniMissing) => e.to_string(),
_ => "Internal error".to_string(), _ => "Internal error".to_string(),
} }
} }
} }
async fn handle_user(
config: &ProxyConfig,
client: &mut PqStream<impl AsyncRead + AsyncWrite + Unpin + Send>,
creds: ClientCredentials,
) -> Result<compute::NodeInfo, AuthError> {
match config.auth_backend {
AuthBackendType::LegacyConsole => {
legacy_console::handle_user(
&config.auth_endpoint,
&config.auth_link_uri,
client,
&creds,
)
.await
}
AuthBackendType::Console => {
console::handle_user(config.auth_endpoint.as_ref(), client, &creds).await
}
AuthBackendType::Postgres => {
postgres::handle_user(&config.auth_endpoint, client, &creds).await
}
AuthBackendType::Link => link::handle_user(config.auth_link_uri.as_ref(), client).await,
}
}

109
proxy/src/auth/backend.rs Normal file
View File

@@ -0,0 +1,109 @@
mod legacy_console;
mod link;
mod postgres;
pub mod console;
pub use legacy_console::{AuthError, AuthErrorImpl};
use super::ClientCredentials;
use crate::{
compute,
config::{AuthBackendType, ProxyConfig},
mgmt,
stream::PqStream,
waiters::{self, Waiter, Waiters},
};
use lazy_static::lazy_static;
use serde::{Deserialize, Serialize};
use tokio::io::{AsyncRead, AsyncWrite};
lazy_static! {
static ref CPLANE_WAITERS: Waiters<mgmt::ComputeReady> = Default::default();
}
/// Give caller an opportunity to wait for the cloud's reply.
pub async fn with_waiter<R, T, E>(
psql_session_id: impl Into<String>,
action: impl FnOnce(Waiter<'static, mgmt::ComputeReady>) -> R,
) -> Result<T, E>
where
R: std::future::Future<Output = Result<T, E>>,
E: From<waiters::RegisterError>,
{
let waiter = CPLANE_WAITERS.register(psql_session_id.into())?;
action(waiter).await
}
pub fn notify(psql_session_id: &str, msg: mgmt::ComputeReady) -> Result<(), waiters::NotifyError> {
CPLANE_WAITERS.notify(psql_session_id, msg)
}
/// Compute node connection params provided by the cloud.
/// Note how it implements serde traits, since we receive it over the wire.
#[derive(Serialize, Deserialize, Default)]
pub struct DatabaseInfo {
pub host: String,
pub port: u16,
pub dbname: String,
pub user: String,
pub password: Option<String>,
}
// Manually implement debug to omit personal and sensitive info.
impl std::fmt::Debug for DatabaseInfo {
fn fmt(&self, fmt: &mut std::fmt::Formatter) -> std::fmt::Result {
fmt.debug_struct("DatabaseInfo")
.field("host", &self.host)
.field("port", &self.port)
.finish()
}
}
impl From<DatabaseInfo> for tokio_postgres::Config {
fn from(db_info: DatabaseInfo) -> Self {
let mut config = tokio_postgres::Config::new();
config
.host(&db_info.host)
.port(db_info.port)
.dbname(&db_info.dbname)
.user(&db_info.user);
if let Some(password) = db_info.password {
config.password(password);
}
config
}
}
pub(super) async fn handle_user(
config: &ProxyConfig,
client: &mut PqStream<impl AsyncRead + AsyncWrite + Unpin + Send>,
creds: ClientCredentials,
) -> super::Result<compute::NodeInfo> {
use AuthBackendType::*;
match config.auth_backend {
LegacyConsole => {
legacy_console::handle_user(
&config.auth_endpoint,
&config.auth_link_uri,
client,
&creds,
)
.await
}
Console => {
console::Api::new(&config.auth_endpoint, &creds)?
.handle_user(client)
.await
}
Postgres => {
postgres::Api::new(&config.auth_endpoint, &creds)?
.handle_user(client)
.await
}
Link => link::handle_user(&config.auth_link_uri, client).await,
}
}

View File

@@ -0,0 +1,225 @@
//! Cloud API V2.
use crate::{
auth::{self, AuthFlow, ClientCredentials, DatabaseInfo},
compute,
error::UserFacingError,
scram,
stream::PqStream,
url::ApiUrl,
};
use serde::{Deserialize, Serialize};
use std::{future::Future, io};
use thiserror::Error;
use tokio::io::{AsyncRead, AsyncWrite};
use utils::pq_proto::{BeMessage as Be, BeParameterStatusMessage};
pub type Result<T> = std::result::Result<T, ConsoleAuthError>;
#[derive(Debug, Error)]
pub enum ConsoleAuthError {
#[error(transparent)]
BadProjectName(#[from] auth::credentials::ProjectNameError),
// We shouldn't include the actual secret here.
#[error("Bad authentication secret")]
BadSecret,
#[error("Console responded with a malformed compute address: '{0}'")]
BadComputeAddress(String),
#[error("Console responded with a malformed JSON: '{0}'")]
BadResponse(#[from] serde_json::Error),
/// HTTP status (other than 200) returned by the console.
#[error("Console responded with an HTTP status: {0}")]
HttpStatus(reqwest::StatusCode),
#[error(transparent)]
Io(#[from] std::io::Error),
}
impl UserFacingError for ConsoleAuthError {
fn to_string_client(&self) -> String {
use ConsoleAuthError::*;
match self {
BadProjectName(e) => e.to_string_client(),
_ => "Internal error".to_string(),
}
}
}
// TODO: convert into an enum with "error"
#[derive(Serialize, Deserialize, Debug)]
struct GetRoleSecretResponse {
role_secret: String,
}
// TODO: convert into an enum with "error"
#[derive(Serialize, Deserialize, Debug)]
struct GetWakeComputeResponse {
address: String,
}
/// Auth secret which is managed by the cloud.
pub enum AuthInfo {
/// Md5 hash of user's password.
Md5([u8; 16]),
/// [SCRAM](crate::scram) authentication info.
Scram(scram::ServerSecret),
}
#[must_use]
pub(super) struct Api<'a> {
endpoint: &'a ApiUrl,
creds: &'a ClientCredentials,
/// Cache project name, since we'll need it several times.
project: &'a str,
}
impl<'a> Api<'a> {
/// Construct an API object containing the auth parameters.
pub(super) fn new(endpoint: &'a ApiUrl, creds: &'a ClientCredentials) -> Result<Self> {
Ok(Self {
endpoint,
creds,
project: creds.project_name()?,
})
}
/// Authenticate the existing user or throw an error.
pub(super) async fn handle_user(
self,
client: &mut PqStream<impl AsyncRead + AsyncWrite + Unpin + Send>,
) -> auth::Result<compute::NodeInfo> {
handle_user(client, &self, Self::get_auth_info, Self::wake_compute).await
}
async fn get_auth_info(&self) -> Result<AuthInfo> {
let mut url = self.endpoint.clone();
url.path_segments_mut().push("proxy_get_role_secret");
url.query_pairs_mut()
.append_pair("project", self.project)
.append_pair("role", &self.creds.user);
// TODO: use a proper logger
println!("cplane request: {url}");
let resp = reqwest::get(url.into_inner()).await.map_err(io_error)?;
if !resp.status().is_success() {
return Err(ConsoleAuthError::HttpStatus(resp.status()));
}
let response: GetRoleSecretResponse =
serde_json::from_str(&resp.text().await.map_err(io_error)?)?;
scram::ServerSecret::parse(response.role_secret.as_str())
.map(AuthInfo::Scram)
.ok_or(ConsoleAuthError::BadSecret)
}
/// Wake up the compute node and return the corresponding connection info.
async fn wake_compute(&self) -> Result<DatabaseInfo> {
let mut url = self.endpoint.clone();
url.path_segments_mut().push("proxy_wake_compute");
url.query_pairs_mut().append_pair("project", self.project);
// TODO: use a proper logger
println!("cplane request: {url}");
let resp = reqwest::get(url.into_inner()).await.map_err(io_error)?;
if !resp.status().is_success() {
return Err(ConsoleAuthError::HttpStatus(resp.status()));
}
let response: GetWakeComputeResponse =
serde_json::from_str(&resp.text().await.map_err(io_error)?)?;
let (host, port) = parse_host_port(&response.address)
.ok_or(ConsoleAuthError::BadComputeAddress(response.address))?;
Ok(DatabaseInfo {
host,
port,
dbname: self.creds.dbname.to_owned(),
user: self.creds.user.to_owned(),
password: None,
})
}
}
/// Common logic for user handling in API V2.
/// We reuse this for a mock API implementation in [`super::postgres`].
pub(super) async fn handle_user<'a, Endpoint, GetAuthInfo, WakeCompute>(
client: &mut PqStream<impl AsyncRead + AsyncWrite + Unpin>,
endpoint: &'a Endpoint,
get_auth_info: impl FnOnce(&'a Endpoint) -> GetAuthInfo,
wake_compute: impl FnOnce(&'a Endpoint) -> WakeCompute,
) -> auth::Result<compute::NodeInfo>
where
GetAuthInfo: Future<Output = Result<AuthInfo>>,
WakeCompute: Future<Output = Result<DatabaseInfo>>,
{
let auth_info = get_auth_info(endpoint).await?;
let flow = AuthFlow::new(client);
let scram_keys = match auth_info {
AuthInfo::Md5(_) => {
// TODO: decide if we should support MD5 in api v2
return Err(auth::AuthErrorImpl::auth_failed("MD5 is not supported").into());
}
AuthInfo::Scram(secret) => {
let scram = auth::Scram(&secret);
Some(compute::ScramKeys {
client_key: flow.begin(scram).await?.authenticate().await?.as_bytes(),
server_key: secret.server_key.as_bytes(),
})
}
};
client
.write_message_noflush(&Be::AuthenticationOk)?
.write_message_noflush(&BeParameterStatusMessage::encoding())?;
Ok(compute::NodeInfo {
db_info: wake_compute(endpoint).await?,
scram_keys,
})
}
/// Upcast (almost) any error into an opaque [`io::Error`].
pub(super) fn io_error(e: impl Into<Box<dyn std::error::Error + Send + Sync>>) -> io::Error {
io::Error::new(io::ErrorKind::Other, e)
}
fn parse_host_port(input: &str) -> Option<(String, u16)> {
let (host, port) = input.split_once(':')?;
Some((host.to_owned(), port.parse().ok()?))
}
#[cfg(test)]
mod tests {
use super::*;
use serde_json::json;
#[test]
fn parse_db_info() -> anyhow::Result<()> {
let _: DatabaseInfo = serde_json::from_value(json!({
"host": "localhost",
"port": 5432,
"dbname": "postgres",
"user": "john_doe",
"password": "password",
}))?;
let _: DatabaseInfo = serde_json::from_value(json!({
"host": "localhost",
"port": 5432,
"dbname": "postgres",
"user": "john_doe",
}))?;
Ok(())
}
}

View File

@@ -1,20 +1,18 @@
//! Cloud API V1. //! Cloud API V1.
use super::console::DatabaseInfo; use super::DatabaseInfo;
use crate::{
use crate::auth::ClientCredentials; auth::{self, ClientCredentials},
use crate::stream::PqStream; compute,
error::UserFacingError,
use crate::{compute, waiters}; stream::PqStream,
waiters,
};
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use thiserror::Error;
use tokio::io::{AsyncRead, AsyncWrite}; use tokio::io::{AsyncRead, AsyncWrite};
use utils::pq_proto::{BeMessage as Be, BeParameterStatusMessage}; use utils::pq_proto::{BeMessage as Be, BeParameterStatusMessage};
use thiserror::Error;
use crate::error::UserFacingError;
#[derive(Debug, Error)] #[derive(Debug, Error)]
pub enum AuthErrorImpl { pub enum AuthErrorImpl {
/// Authentication error reported by the console. /// Authentication error reported by the console.
@@ -45,7 +43,7 @@ pub struct AuthError(Box<AuthErrorImpl>);
impl AuthError { impl AuthError {
/// Smart constructor for authentication error reported by `mgmt`. /// Smart constructor for authentication error reported by `mgmt`.
pub fn auth_failed(msg: impl Into<String>) -> Self { pub fn auth_failed(msg: impl Into<String>) -> Self {
AuthError(Box::new(AuthErrorImpl::AuthFailed(msg.into()))) Self(Box::new(AuthErrorImpl::AuthFailed(msg.into())))
} }
} }
@@ -54,7 +52,7 @@ where
AuthErrorImpl: From<T>, AuthErrorImpl: From<T>,
{ {
fn from(e: T) -> Self { fn from(e: T) -> Self {
AuthError(Box::new(e.into())) Self(Box::new(e.into()))
} }
} }
@@ -120,7 +118,7 @@ async fn handle_existing_user(
auth_endpoint: &reqwest::Url, auth_endpoint: &reqwest::Url,
client: &mut PqStream<impl AsyncRead + AsyncWrite + Unpin + Send>, client: &mut PqStream<impl AsyncRead + AsyncWrite + Unpin + Send>,
creds: &ClientCredentials, creds: &ClientCredentials,
) -> Result<crate::compute::NodeInfo, crate::auth::AuthError> { ) -> Result<compute::NodeInfo, auth::AuthError> {
let psql_session_id = super::link::new_psql_session_id(); let psql_session_id = super::link::new_psql_session_id();
let md5_salt = rand::random(); let md5_salt = rand::random();
@@ -130,7 +128,7 @@ async fn handle_existing_user(
// Read client's password hash // Read client's password hash
let msg = client.read_password_message().await?; let msg = client.read_password_message().await?;
let md5_response = parse_password(&msg).ok_or(crate::auth::AuthErrorImpl::MalformedPassword)?; let md5_response = parse_password(&msg).ok_or(auth::AuthErrorImpl::MalformedPassword)?;
let db_info = authenticate_proxy_client( let db_info = authenticate_proxy_client(
auth_endpoint, auth_endpoint,
@@ -156,11 +154,11 @@ pub async fn handle_user(
auth_link_uri: &reqwest::Url, auth_link_uri: &reqwest::Url,
client: &mut PqStream<impl AsyncRead + AsyncWrite + Unpin + Send>, client: &mut PqStream<impl AsyncRead + AsyncWrite + Unpin + Send>,
creds: &ClientCredentials, creds: &ClientCredentials,
) -> Result<crate::compute::NodeInfo, crate::auth::AuthError> { ) -> auth::Result<compute::NodeInfo> {
if creds.is_existing_user() { if creds.is_existing_user() {
handle_existing_user(auth_endpoint, client, creds).await handle_existing_user(auth_endpoint, client, creds).await
} else { } else {
super::link::handle_user(auth_link_uri.as_ref(), client).await super::link::handle_user(auth_link_uri, client).await
} }
} }

View File

@@ -1,4 +1,4 @@
use crate::{compute, stream::PqStream}; use crate::{auth, compute, stream::PqStream};
use tokio::io::{AsyncRead, AsyncWrite}; use tokio::io::{AsyncRead, AsyncWrite};
use utils::pq_proto::{BeMessage as Be, BeParameterStatusMessage}; use utils::pq_proto::{BeMessage as Be, BeParameterStatusMessage};
@@ -19,13 +19,13 @@ pub fn new_psql_session_id() -> String {
} }
pub async fn handle_user( pub async fn handle_user(
redirect_uri: &str, redirect_uri: &reqwest::Url,
client: &mut PqStream<impl AsyncRead + AsyncWrite + Unpin>, client: &mut PqStream<impl AsyncRead + AsyncWrite + Unpin>,
) -> Result<compute::NodeInfo, crate::auth::AuthError> { ) -> auth::Result<compute::NodeInfo> {
let psql_session_id = new_psql_session_id(); let psql_session_id = new_psql_session_id();
let greeting = hello_message(redirect_uri, &psql_session_id); let greeting = hello_message(redirect_uri.as_str(), &psql_session_id);
let db_info = crate::auth_backend::with_waiter(psql_session_id, |waiter| async { let db_info = super::with_waiter(psql_session_id, |waiter| async {
// Give user a URL to spawn a new database // Give user a URL to spawn a new database
client client
.write_message_noflush(&Be::AuthenticationOk)? .write_message_noflush(&Be::AuthenticationOk)?
@@ -34,9 +34,7 @@ pub async fn handle_user(
.await?; .await?;
// Wait for web console response (see `mgmt`) // Wait for web console response (see `mgmt`)
waiter waiter.await?.map_err(auth::AuthErrorImpl::auth_failed)
.await?
.map_err(crate::auth::AuthErrorImpl::auth_failed)
}) })
.await?; .await?;

View File

@@ -0,0 +1,88 @@
//! Local mock of Cloud API V2.
use crate::{
auth::{
self,
backend::console::{self, io_error, AuthInfo, Result},
ClientCredentials, DatabaseInfo,
},
compute, scram,
stream::PqStream,
url::ApiUrl,
};
use tokio::io::{AsyncRead, AsyncWrite};
#[must_use]
pub(super) struct Api<'a> {
endpoint: &'a ApiUrl,
creds: &'a ClientCredentials,
}
impl<'a> Api<'a> {
/// Construct an API object containing the auth parameters.
pub(super) fn new(endpoint: &'a ApiUrl, creds: &'a ClientCredentials) -> Result<Self> {
Ok(Self { endpoint, creds })
}
/// Authenticate the existing user or throw an error.
pub(super) async fn handle_user(
self,
client: &mut PqStream<impl AsyncRead + AsyncWrite + Unpin + Send>,
) -> auth::Result<compute::NodeInfo> {
// We reuse user handling logic from a production module.
console::handle_user(client, &self, Self::get_auth_info, Self::wake_compute).await
}
/// This implementation fetches the auth info from a local postgres instance.
async fn get_auth_info(&self) -> Result<AuthInfo> {
// Perhaps we could persist this connection, but then we'd have to
// write more code for reopening it if it got closed, which doesn't
// seem worth it.
let (client, connection) =
tokio_postgres::connect(self.endpoint.as_str(), tokio_postgres::NoTls)
.await
.map_err(io_error)?;
tokio::spawn(connection);
let query = "select rolpassword from pg_catalog.pg_authid where rolname = $1";
let rows = client
.query(query, &[&self.creds.user])
.await
.map_err(io_error)?;
match &rows[..] {
// We can't get a secret if there's no such user.
[] => Err(io_error(format!("unknown user '{}'", self.creds.user)).into()),
// We shouldn't get more than one row anyway.
[row, ..] => {
let entry = row.try_get(0).map_err(io_error)?;
scram::ServerSecret::parse(entry)
.map(AuthInfo::Scram)
.or_else(|| {
// It could be an md5 hash if it's not a SCRAM secret.
let text = entry.strip_prefix("md5")?;
Some(AuthInfo::Md5({
let mut bytes = [0u8; 16];
hex::decode_to_slice(text, &mut bytes).ok()?;
bytes
}))
})
// Putting the secret into this message is a security hazard!
.ok_or(console::ConsoleAuthError::BadSecret)
}
}
}
/// We don't need to wake anything locally, so we just return the connection info.
async fn wake_compute(&self) -> Result<DatabaseInfo> {
Ok(DatabaseInfo {
// TODO: handle that near CLI params parsing
host: self.endpoint.host_str().unwrap_or("localhost").to_owned(),
port: self.endpoint.port().unwrap_or(5432),
dbname: self.creds.dbname.to_owned(),
user: self.creds.user.to_owned(),
password: None,
})
}
}

View File

@@ -1,6 +1,5 @@
//! User credentials used in authentication. //! User credentials used in authentication.
use super::AuthError;
use crate::compute; use crate::compute;
use crate::config::ProxyConfig; use crate::config::ProxyConfig;
use crate::error::UserFacingError; use crate::error::UserFacingError;
@@ -27,6 +26,11 @@ pub struct ClientCredentials {
// New console API requires SNI info to determine the cluster name. // New console API requires SNI info to determine the cluster name.
// Other Auth backends don't need it. // Other Auth backends don't need it.
pub sni_data: Option<String>, pub sni_data: Option<String>,
// project_name is passed as argument from options from url.
// In case sni_data is missing: project_name is used to determine cluster name.
// In case sni_data is available: project_name and sni_data should match (otherwise throws an error).
pub project_name: Option<String>,
} }
impl ClientCredentials { impl ClientCredentials {
@@ -36,6 +40,52 @@ impl ClientCredentials {
} }
} }
#[derive(Debug, Error)]
pub enum ProjectNameError {
#[error("SNI is missing. EITHER please upgrade the postgres client library OR pass the project name as a parameter: '...&options=project%3D<project-name>...'.")]
Missing,
#[error("SNI is malformed.")]
Bad,
#[error("Inconsistent project name inferred from SNI and project option. String from SNI: '{0}', String from project option: '{1}'")]
Inconsistent(String, String),
}
impl UserFacingError for ProjectNameError {}
impl ClientCredentials {
/// Determine project name from SNI or from project_name parameter from options argument.
pub fn project_name(&self) -> Result<&str, ProjectNameError> {
// Checking that if both sni_data and project_name are set, then they should match
// otherwise, throws a ProjectNameError::Inconsistent error.
if let Some(sni_data) = &self.sni_data {
let project_name_from_sni_data =
sni_data.split_once('.').ok_or(ProjectNameError::Bad)?.0;
if let Some(project_name_from_options) = &self.project_name {
if !project_name_from_options.eq(project_name_from_sni_data) {
return Err(ProjectNameError::Inconsistent(
project_name_from_sni_data.to_string(),
project_name_from_options.to_string(),
));
}
}
}
// determine the project name from self.sni_data if it exists, otherwise from self.project_name.
let ret = match &self.sni_data {
// if sni_data exists, use it to determine project name
Some(sni_data) => sni_data.split_once('.').ok_or(ProjectNameError::Bad)?.0,
// otherwise use project_option if it was manually set thought options parameter.
None => self
.project_name
.as_ref()
.ok_or(ProjectNameError::Missing)?
.as_str(),
};
Ok(ret)
}
}
impl TryFrom<HashMap<String, String>> for ClientCredentials { impl TryFrom<HashMap<String, String>> for ClientCredentials {
type Error = ClientCredsParseError; type Error = ClientCredsParseError;
@@ -47,12 +97,14 @@ impl TryFrom<HashMap<String, String>> for ClientCredentials {
}; };
let user = get_param("user")?; let user = get_param("user")?;
let db = get_param("database")?; let dbname = get_param("database")?;
let project_name = get_param("project").ok();
Ok(Self { Ok(Self {
user, user,
dbname: db, dbname,
sni_data: None, sni_data: None,
project_name,
}) })
} }
} }
@@ -63,8 +115,8 @@ impl ClientCredentials {
self, self,
config: &ProxyConfig, config: &ProxyConfig,
client: &mut PqStream<impl AsyncRead + AsyncWrite + Unpin + Send>, client: &mut PqStream<impl AsyncRead + AsyncWrite + Unpin + Send>,
) -> Result<compute::NodeInfo, AuthError> { ) -> super::Result<compute::NodeInfo> {
// This method is just a convenient facade for `handle_user` // This method is just a convenient facade for `handle_user`
super::handle_user(config, client, self).await super::backend::handle_user(config, client, self).await
} }
} }

View File

@@ -1,6 +1,6 @@
//! Main authentication flow. //! Main authentication flow.
use super::{AuthError, AuthErrorImpl}; use super::AuthErrorImpl;
use crate::stream::PqStream; use crate::stream::PqStream;
use crate::{sasl, scram}; use crate::{sasl, scram};
use std::io; use std::io;
@@ -32,7 +32,7 @@ impl AuthMethod for Scram<'_> {
pub struct AuthFlow<'a, Stream, State> { pub struct AuthFlow<'a, Stream, State> {
/// The underlying stream which implements libpq's protocol. /// The underlying stream which implements libpq's protocol.
stream: &'a mut PqStream<Stream>, stream: &'a mut PqStream<Stream>,
/// State might contain ancillary data (see [`AuthFlow::begin`]). /// State might contain ancillary data (see [`Self::begin`]).
state: State, state: State,
} }
@@ -60,7 +60,7 @@ impl<'a, S: AsyncWrite + Unpin> AuthFlow<'a, S, Begin> {
/// Stream wrapper for handling [SCRAM](crate::scram) auth. /// Stream wrapper for handling [SCRAM](crate::scram) auth.
impl<S: AsyncRead + AsyncWrite + Unpin> AuthFlow<'_, S, Scram<'_>> { impl<S: AsyncRead + AsyncWrite + Unpin> AuthFlow<'_, S, Scram<'_>> {
/// Perform user authentication. Raise an error in case authentication failed. /// Perform user authentication. Raise an error in case authentication failed.
pub async fn authenticate(self) -> Result<scram::ScramKey, AuthError> { pub async fn authenticate(self) -> super::Result<scram::ScramKey> {
// Initial client message contains the chosen auth method's name. // Initial client message contains the chosen auth method's name.
let msg = self.stream.read_password_message().await?; let msg = self.stream.read_password_message().await?;
let sasl = sasl::FirstMessage::parse(&msg).ok_or(AuthErrorImpl::MalformedPassword)?; let sasl = sasl::FirstMessage::parse(&msg).ok_or(AuthErrorImpl::MalformedPassword)?;

View File

@@ -1,31 +0,0 @@
pub mod console;
pub mod legacy_console;
pub mod link;
pub mod postgres;
pub use legacy_console::{AuthError, AuthErrorImpl};
use crate::mgmt;
use crate::waiters::{self, Waiter, Waiters};
use lazy_static::lazy_static;
lazy_static! {
static ref CPLANE_WAITERS: Waiters<mgmt::ComputeReady> = Default::default();
}
/// Give caller an opportunity to wait for the cloud's reply.
pub async fn with_waiter<R, T, E>(
psql_session_id: impl Into<String>,
action: impl FnOnce(Waiter<'static, mgmt::ComputeReady>) -> R,
) -> Result<T, E>
where
R: std::future::Future<Output = Result<T, E>>,
E: From<waiters::RegisterError>,
{
let waiter = CPLANE_WAITERS.register(psql_session_id.into())?;
action(waiter).await
}
pub fn notify(psql_session_id: &str, msg: mgmt::ComputeReady) -> Result<(), waiters::NotifyError> {
CPLANE_WAITERS.notify(psql_session_id, msg)
}

View File

@@ -1,243 +0,0 @@
//! Declaration of Cloud API V2.
use crate::{
auth::{self, AuthFlow},
compute, scram,
};
use serde::{Deserialize, Serialize};
use thiserror::Error;
use crate::auth::ClientCredentials;
use crate::stream::PqStream;
use tokio::io::{AsyncRead, AsyncWrite};
use utils::pq_proto::{BeMessage as Be, BeParameterStatusMessage};
#[derive(Debug, Error)]
pub enum ConsoleAuthError {
// We shouldn't include the actual secret here.
#[error("Bad authentication secret")]
BadSecret,
#[error("Bad client credentials: {0:?}")]
BadCredentials(crate::auth::ClientCredentials),
#[error("SNI info is missing, please upgrade the postgres client library")]
SniMissing,
#[error("Unexpected SNI content")]
SniWrong,
#[error(transparent)]
BadUrl(#[from] url::ParseError),
#[error(transparent)]
Io(#[from] std::io::Error),
/// HTTP status (other than 200) returned by the console.
#[error("Console responded with an HTTP status: {0}")]
HttpStatus(reqwest::StatusCode),
#[error(transparent)]
Transport(#[from] reqwest::Error),
#[error("Console responded with a malformed JSON: '{0}'")]
MalformedResponse(#[from] serde_json::Error),
#[error("Console responded with a malformed compute address: '{0}'")]
MalformedComputeAddress(String),
}
#[derive(Serialize, Deserialize, Debug)]
struct GetRoleSecretResponse {
role_secret: String,
}
#[derive(Serialize, Deserialize, Debug)]
struct GetWakeComputeResponse {
address: String,
}
/// Auth secret which is managed by the cloud.
pub enum AuthInfo {
/// Md5 hash of user's password.
Md5([u8; 16]),
/// [SCRAM](crate::scram) authentication info.
Scram(scram::ServerSecret),
}
/// Compute node connection params provided by the cloud.
/// Note how it implements serde traits, since we receive it over the wire.
#[derive(Serialize, Deserialize, Default)]
pub struct DatabaseInfo {
pub host: String,
pub port: u16,
pub dbname: String,
pub user: String,
/// [Cloud API V1](super::legacy) returns cleartext password,
/// but [Cloud API V2](super::api) implements [SCRAM](crate::scram)
/// authentication, so we can leverage this method and cope without password.
pub password: Option<String>,
}
// Manually implement debug to omit personal and sensitive info.
impl std::fmt::Debug for DatabaseInfo {
fn fmt(&self, fmt: &mut std::fmt::Formatter) -> std::fmt::Result {
fmt.debug_struct("DatabaseInfo")
.field("host", &self.host)
.field("port", &self.port)
.finish()
}
}
impl From<DatabaseInfo> for tokio_postgres::Config {
fn from(db_info: DatabaseInfo) -> Self {
let mut config = tokio_postgres::Config::new();
config
.host(&db_info.host)
.port(db_info.port)
.dbname(&db_info.dbname)
.user(&db_info.user);
if let Some(password) = db_info.password {
config.password(password);
}
config
}
}
async fn get_auth_info(
auth_endpoint: &str,
user: &str,
cluster: &str,
) -> Result<AuthInfo, ConsoleAuthError> {
let mut url = reqwest::Url::parse(&format!("{auth_endpoint}/proxy_get_role_secret"))?;
url.query_pairs_mut()
.append_pair("project", cluster)
.append_pair("role", user);
// TODO: use a proper logger
println!("cplane request: {}", url);
let resp = reqwest::get(url).await?;
if !resp.status().is_success() {
return Err(ConsoleAuthError::HttpStatus(resp.status()));
}
let response: GetRoleSecretResponse = serde_json::from_str(resp.text().await?.as_str())?;
scram::ServerSecret::parse(response.role_secret.as_str())
.map(AuthInfo::Scram)
.ok_or(ConsoleAuthError::BadSecret)
}
/// Wake up the compute node and return the corresponding connection info.
async fn wake_compute(
auth_endpoint: &str,
cluster: &str,
) -> Result<(String, u16), ConsoleAuthError> {
let mut url = reqwest::Url::parse(&format!("{auth_endpoint}/proxy_wake_compute"))?;
url.query_pairs_mut().append_pair("project", cluster);
// TODO: use a proper logger
println!("cplane request: {}", url);
let resp = reqwest::get(url).await?;
if !resp.status().is_success() {
return Err(ConsoleAuthError::HttpStatus(resp.status()));
}
let response: GetWakeComputeResponse = serde_json::from_str(resp.text().await?.as_str())?;
let (host, port) = response
.address
.split_once(':')
.ok_or_else(|| ConsoleAuthError::MalformedComputeAddress(response.address.clone()))?;
let port: u16 = port
.parse()
.map_err(|_| ConsoleAuthError::MalformedComputeAddress(response.address.clone()))?;
Ok((host.to_string(), port))
}
pub async fn handle_user(
auth_endpoint: &str,
client: &mut PqStream<impl AsyncRead + AsyncWrite + Unpin>,
creds: &ClientCredentials,
) -> Result<compute::NodeInfo, crate::auth::AuthError> {
// Determine cluster name from SNI.
let cluster = creds
.sni_data
.as_ref()
.ok_or(ConsoleAuthError::SniMissing)?
.split_once('.')
.ok_or(ConsoleAuthError::SniWrong)?
.0;
let user = creds.user.as_str();
// Step 1: get the auth secret
let auth_info = get_auth_info(auth_endpoint, user, cluster).await?;
let flow = AuthFlow::new(client);
let scram_keys = match auth_info {
AuthInfo::Md5(_) => {
// TODO: decide if we should support MD5 in api v2
return Err(crate::auth::AuthErrorImpl::auth_failed("MD5 is not supported").into());
}
AuthInfo::Scram(secret) => {
let scram = auth::Scram(&secret);
Some(compute::ScramKeys {
client_key: flow.begin(scram).await?.authenticate().await?.as_bytes(),
server_key: secret.server_key.as_bytes(),
})
}
};
client
.write_message_noflush(&Be::AuthenticationOk)?
.write_message_noflush(&BeParameterStatusMessage::encoding())?;
// Step 2: wake compute
let (host, port) = wake_compute(auth_endpoint, cluster).await?;
Ok(compute::NodeInfo {
db_info: DatabaseInfo {
host,
port,
dbname: creds.dbname.clone(),
user: creds.user.clone(),
password: None,
},
scram_keys,
})
}
#[cfg(test)]
mod tests {
use super::*;
use serde_json::json;
#[test]
fn parse_db_info() -> anyhow::Result<()> {
let _: DatabaseInfo = serde_json::from_value(json!({
"host": "localhost",
"port": 5432,
"dbname": "postgres",
"user": "john_doe",
"password": "password",
}))?;
let _: DatabaseInfo = serde_json::from_value(json!({
"host": "localhost",
"port": 5432,
"dbname": "postgres",
"user": "john_doe",
}))?;
Ok(())
}
}

View File

@@ -1,93 +0,0 @@
//! Local mock of Cloud API V2.
use super::console::{self, AuthInfo, DatabaseInfo};
use crate::scram;
use crate::{auth::ClientCredentials, compute};
use crate::stream::PqStream;
use tokio::io::{AsyncRead, AsyncWrite};
use utils::pq_proto::{BeMessage as Be, BeParameterStatusMessage};
async fn get_auth_info(
auth_endpoint: &str,
creds: &ClientCredentials,
) -> Result<AuthInfo, console::ConsoleAuthError> {
// We wrap `tokio_postgres::Error` because we don't want to infect the
// method's error type with a detail that's specific to debug mode only.
let io_error = |e| std::io::Error::new(std::io::ErrorKind::Other, e);
// Perhaps we could persist this connection, but then we'd have to
// write more code for reopening it if it got closed, which doesn't
// seem worth it.
let (client, connection) = tokio_postgres::connect(auth_endpoint, tokio_postgres::NoTls)
.await
.map_err(io_error)?;
tokio::spawn(connection);
let query = "select rolpassword from pg_catalog.pg_authid where rolname = $1";
let rows = client
.query(query, &[&creds.user])
.await
.map_err(io_error)?;
match &rows[..] {
// We can't get a secret if there's no such user.
[] => Err(console::ConsoleAuthError::BadCredentials(creds.to_owned())),
// We shouldn't get more than one row anyway.
[row, ..] => {
let entry = row.try_get(0).map_err(io_error)?;
scram::ServerSecret::parse(entry)
.map(AuthInfo::Scram)
.or_else(|| {
// It could be an md5 hash if it's not a SCRAM secret.
let text = entry.strip_prefix("md5")?;
Some(AuthInfo::Md5({
let mut bytes = [0u8; 16];
hex::decode_to_slice(text, &mut bytes).ok()?;
bytes
}))
})
// Putting the secret into this message is a security hazard!
.ok_or(console::ConsoleAuthError::BadSecret)
}
}
}
pub async fn handle_user(
auth_endpoint: &reqwest::Url,
client: &mut PqStream<impl AsyncRead + AsyncWrite + Unpin>,
creds: &ClientCredentials,
) -> Result<compute::NodeInfo, crate::auth::AuthError> {
let auth_info = get_auth_info(auth_endpoint.as_ref(), creds).await?;
let flow = crate::auth::AuthFlow::new(client);
let scram_keys = match auth_info {
AuthInfo::Md5(_) => {
// TODO: decide if we should support MD5 in api v2
return Err(crate::auth::AuthErrorImpl::auth_failed("MD5 is not supported").into());
}
AuthInfo::Scram(secret) => {
let scram = crate::auth::Scram(&secret);
Some(compute::ScramKeys {
client_key: flow.begin(scram).await?.authenticate().await?.as_bytes(),
server_key: secret.server_key.as_bytes(),
})
}
};
client
.write_message_noflush(&Be::AuthenticationOk)?
.write_message_noflush(&BeParameterStatusMessage::encoding())?;
Ok(compute::NodeInfo {
db_info: DatabaseInfo {
// TODO: handle that near CLI params parsing
host: auth_endpoint.host_str().unwrap_or("localhost").to_owned(),
port: auth_endpoint.port().unwrap_or(5432),
dbname: creds.dbname.to_owned(),
user: creds.user.to_owned(),
password: None,
},
scram_keys,
})
}

View File

@@ -1,4 +1,4 @@
use crate::auth_backend::console::DatabaseInfo; use crate::auth::DatabaseInfo;
use crate::cancellation::CancelClosure; use crate::cancellation::CancelClosure;
use crate::error::UserFacingError; use crate::error::UserFacingError;
use std::io; use std::io;
@@ -37,7 +37,7 @@ pub struct NodeInfo {
impl NodeInfo { impl NodeInfo {
async fn connect_raw(&self) -> io::Result<(SocketAddr, TcpStream)> { async fn connect_raw(&self) -> io::Result<(SocketAddr, TcpStream)> {
let host_port = format!("{}:{}", self.db_info.host, self.db_info.port); let host_port = (self.db_info.host.as_str(), self.db_info.port);
let socket = TcpStream::connect(host_port).await?; let socket = TcpStream::connect(host_port).await?;
let socket_addr = socket.peer_addr()?; let socket_addr = socket.peer_addr()?;
socket2::SockRef::from(&socket).set_keepalive(true)?; socket2::SockRef::from(&socket).set_keepalive(true)?;

View File

@@ -1,39 +1,39 @@
use anyhow::{ensure, Context}; use crate::url::ApiUrl;
use anyhow::{bail, ensure, Context};
use std::{str::FromStr, sync::Arc}; use std::{str::FromStr, sync::Arc};
#[non_exhaustive] #[derive(Debug)]
pub enum AuthBackendType { pub enum AuthBackendType {
/// Legacy Cloud API (V1).
LegacyConsole, LegacyConsole,
Console, /// Authentication via a web browser.
Postgres,
Link, Link,
/// Current Cloud API (V2).
Console,
/// Local mock of Cloud API (V2).
Postgres,
} }
impl FromStr for AuthBackendType { impl FromStr for AuthBackendType {
type Err = anyhow::Error; type Err = anyhow::Error;
fn from_str(s: &str) -> anyhow::Result<Self> { fn from_str(s: &str) -> anyhow::Result<Self> {
println!("ClientAuthMethod::from_str: '{}'", s);
use AuthBackendType::*; use AuthBackendType::*;
match s { Ok(match s {
"legacy" => Ok(LegacyConsole), "legacy" => LegacyConsole,
"console" => Ok(Console), "console" => Console,
"postgres" => Ok(Postgres), "postgres" => Postgres,
"link" => Ok(Link), "link" => Link,
_ => Err(anyhow::anyhow!("Invlid option for auth method")), _ => bail!("Invalid option `{s}` for auth method"),
} })
} }
} }
pub struct ProxyConfig { pub struct ProxyConfig {
/// TLS configuration for the proxy.
pub tls_config: Option<TlsConfig>, pub tls_config: Option<TlsConfig>,
pub auth_backend: AuthBackendType, pub auth_backend: AuthBackendType,
pub auth_endpoint: ApiUrl,
pub auth_endpoint: reqwest::Url, pub auth_link_uri: ApiUrl,
pub auth_link_uri: reqwest::Url,
} }
pub type TlsConfig = Arc<rustls::ServerConfig>; pub type TlsConfig = Arc<rustls::ServerConfig>;

View File

@@ -5,7 +5,6 @@
//! in somewhat transparent manner (again via communication with control plane API). //! in somewhat transparent manner (again via communication with control plane API).
mod auth; mod auth;
mod auth_backend;
mod cancellation; mod cancellation;
mod compute; mod compute;
mod config; mod config;
@@ -17,6 +16,7 @@ mod proxy;
mod sasl; mod sasl;
mod scram; mod scram;
mod stream; mod stream;
mod url;
mod waiters; mod waiters;
use anyhow::{bail, Context}; use anyhow::{bail, Context};
@@ -126,6 +126,7 @@ async fn main() -> anyhow::Result<()> {
})); }));
println!("Version: {GIT_VERSION}"); println!("Version: {GIT_VERSION}");
println!("Authentication backend: {:?}", config.auth_backend);
// Check that we can bind to address before further initialization // Check that we can bind to address before further initialization
println!("Starting http on {}", http_address); println!("Starting http on {}", http_address);

View File

@@ -1,4 +1,4 @@
use crate::auth_backend; use crate::auth;
use anyhow::Context; use anyhow::Context;
use serde::Deserialize; use serde::Deserialize;
use std::{ use std::{
@@ -77,12 +77,12 @@ struct PsqlSessionResponse {
#[derive(Deserialize)] #[derive(Deserialize)]
enum PsqlSessionResult { enum PsqlSessionResult {
Success(auth_backend::console::DatabaseInfo), Success(auth::DatabaseInfo),
Failure(String), Failure(String),
} }
/// A message received by `mgmt` when a compute node is ready. /// A message received by `mgmt` when a compute node is ready.
pub type ComputeReady = Result<auth_backend::console::DatabaseInfo, String>; pub type ComputeReady = Result<auth::DatabaseInfo, String>;
impl PsqlSessionResult { impl PsqlSessionResult {
fn into_compute_ready(self) -> ComputeReady { fn into_compute_ready(self) -> ComputeReady {
@@ -113,7 +113,7 @@ fn try_process_query(pgb: &mut PostgresBackend, query_string: &str) -> anyhow::R
let resp: PsqlSessionResponse = serde_json::from_str(query_string)?; let resp: PsqlSessionResponse = serde_json::from_str(query_string)?;
match auth_backend::notify(&resp.session_id, resp.result.into_compute_ready()) { match auth::backend::notify(&resp.session_id, resp.result.into_compute_ready()) {
Ok(()) => { Ok(()) => {
pgb.write_message_noflush(&SINGLE_COL_ROWDESC)? pgb.write_message_noflush(&SINGLE_COL_ROWDESC)?
.write_message_noflush(&BeMessage::DataRow(&[Some(b"ok")]))? .write_message_noflush(&BeMessage::DataRow(&[Some(b"ok")]))?

View File

@@ -95,7 +95,7 @@ async fn handle_client(
/// Establish a (most probably, secure) connection with the client. /// Establish a (most probably, secure) connection with the client.
/// For better testing experience, `stream` can be any object satisfying the traits. /// For better testing experience, `stream` can be any object satisfying the traits.
/// It's easier to work with owned `stream` here as we need to updgrade it to TLS; /// It's easier to work with owned `stream` here as we need to upgrade it to TLS;
/// we also take an extra care of propagating only the select handshake errors to client. /// we also take an extra care of propagating only the select handshake errors to client.
async fn handshake<S: AsyncRead + AsyncWrite + Unpin>( async fn handshake<S: AsyncRead + AsyncWrite + Unpin>(
stream: S, stream: S,

View File

@@ -10,6 +10,7 @@ mod channel_binding;
mod messages; mod messages;
mod stream; mod stream;
use crate::error::UserFacingError;
use std::io; use std::io;
use thiserror::Error; use thiserror::Error;
@@ -36,6 +37,20 @@ pub enum Error {
Io(#[from] io::Error), Io(#[from] io::Error),
} }
impl UserFacingError for Error {
fn to_string_client(&self) -> String {
use Error::*;
match self {
// This constructor contains the reason why auth has failed.
AuthenticationFailed(s) => s.to_string(),
// TODO: add support for channel binding
ChannelBindingFailed(_) => "channel binding is not supported yet".to_string(),
ChannelBindingBadMethod(m) => format!("unsupported channel binding method {m}"),
_ => "authentication protocol violation".to_string(),
}
}
}
/// A convenient result type for SASL exchange. /// A convenient result type for SASL exchange.
pub type Result<T> = std::result::Result<T, Error>; pub type Result<T> = std::result::Result<T, Error>;

Some files were not shown because too many files have changed in this diff Show More