Compare commits

..

11 Commits

Author SHA1 Message Date
Heikki Linnakangas
e58d0fece1 New communicator, with "integrated" cache accessible from all processes 2025-04-29 11:52:44 +03:00
Alex Chi Z.
11f6044338 fix(pageserver): report synthetic size = 1 if all tls offloaded (2) (#11731)
## Problem

https://github.com/neondatabase/neon/pull/11648 did this for resident
size instead of synthetic size.

## Summary of changes

Report synthetic_size == 1 if all timelines are offloaded.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-04-28 13:45:45 +00:00
Konstantin Knizhnik
692c0f3fb8 Prepare to prewarm support (#11740)
## Problem

See
(original prewarm implementation)
https://github.com/neondatabase/neon/pull/9197
(functions for storing/restoring LFC state)
https://github.com/neondatabase/neon/pull/9587
(store prefetch results in LFC)
https://github.com/neondatabase/neon/pull/10442

## Summary of changes

Preparation for prewarm implementation.

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2025-04-28 13:24:18 +00:00
Alexander Bayandin
2b1d2a55d6 CI: fix typo oicd -> oidc (#11747)
## Problem

It's OIDC (OpenID Connect), not OICD

## Summary of changes
- Rename actions input `aws-oicd-role-arn` -> `aws-oidc-role-arn`
2025-04-28 12:44:28 +00:00
Konstantin Knizhnik
60b9fb1baf Ignore unlogged LSNs in set last written LSN (#11743)
## Problem

See https://github.com/neondatabase/neon/issues/11718
and https://neondb.slack.com/archives/C033RQ5SPDH/p1745122797538509

GIST other indexes performing "unlogged build" are using so called fake
LSNs - not a real LSN, but something like 0/1. Been stored in lwlsn
cache they cause incorrect lookup at PS.

## Summary of changes

Do not store fake LSNs in LwLSN hash.

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2025-04-28 12:16:29 +00:00
Erik Grinaker
606f14034e pageserver: improve pageserver_smgr_query_seconds buckets (#11680)
## Problem

The `pageserver_smgr_query_seconds` buckets are too coarse, using powers
of 10: 1 µs, 10 µs, 100 µs, 1 ms, 10 ms, 100 ms, 1 s, 10 s, 100 s. This
is one of our most crucial latency metrics, and needs better resolution.

Touches #11594.

## Summary of changes

This patch uses buckets with better resolution around 1 ms (the typical
latency):

* 0.6 ms
* 1 ms
* 3 ms
* 6 ms
* 10 ms
* 30 ms
* 100 ms
* 1 s
* 3 s

These will be the same as the compute's `compute_getpage_wait_seconds`,
to make them comparable across the compute and Pageserver:
https://github.com/neondatabase/flux-fleet/pull/579. We sacrifice
buckets above 3 s, since these can already be considered "too slow".

This does not change the previously used `CRITICAL_OP_BUCKETS`, which is
also used for other operations on different timescales (e.g. LSN waits).
We should consider replacing this with more appropriate buckets for
specific operations, since it covers a large span with low resolution.
2025-04-28 11:52:44 +00:00
Conrad Ludgate
32393b4393 pg-sni-router: support compute TLS on different port (#11732)
## Problem

pg-sni-router isn't aware of compute TLS

## Summary of changes

If connections come in on port 4433, we require TLS to compute from
pg-sni-router
2025-04-28 11:29:44 +00:00
Alexander Bayandin
1a29f5672a CI(check-macos-build): trigger workflow automatically for PRs (#11706)
## Problem

- if-conditions for the `check-macos-build` workflow don't trigger it on
PRs with relevant changes (in Rust code or Postgres submodules).
- Jobs in the workflow depend on the presence of a cache, which is not
guaranteed.

## Summary of changes
- Fix if-conditions
- Use artifacts on top of cache whenever the workflow depends on it —
the cache might not be available
2025-04-28 09:03:10 +00:00
a-masterov
b8d47b5acf Run the extensions' tests on staging (#11164)
## Problem
We currently don't run end-to-end tests for PostgreSQL extensions on our
cloud infrastructure, which means we might miss problems that only occur
in a real cloud environment.

## Summary of changes
- Added a workflow to run extension tests against a cloud staging
instance
- Set up proper project configuration for extension testing
- Implemented test execution with appropriate environment settings
- Added error handling and reporting for test failures

---------

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2025-04-28 08:13:49 +00:00
Alexander Lakhin
97e01ae6fd Add workflow to run particular test(s) N times (#11050)
## Problem
Provide an easy way to run particular test(s) N times on CI.

## Summary of changes

* Allow for passing the test selection and the number of test runs to
the existing "Build and Test Locally" workflow
* Allow for running multiple selected tests by the "Pytest regression
tests" step
* Introduce a new workflow to run specified test(s) several times
* Store results in a separate database to distinguish between testing
tests for stability and usual testing
2025-04-28 04:04:37 +00:00
Lokesh
459d51974c doc: minor updates to consumption-metrics document (#7153)
## Problem
Proposed minor changes to the `consumption_metrics` document.

## Summary of changes
- Fixed minor typos in the document.
- Minor formatting in the description of metrics `timeline_logical_size`
and `synthetic_storage_size`. Makes this consistent as with description
  of other metrics in the document.

## Checklist before requesting a review

- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

Co-authored-by: Mikhail Kot <mikhail@neon.tech>
2025-04-25 19:46:40 +00:00
120 changed files with 9834 additions and 820 deletions

View File

@@ -7,7 +7,7 @@ inputs:
type: boolean
required: false
default: false
aws-oicd-role-arn:
aws-oidc-role-arn:
description: 'OIDC role arn to interract with S3'
required: true
@@ -88,7 +88,7 @@ runs:
if: ${{ !cancelled() }}
with:
aws-region: eu-central-1
role-to-assume: ${{ inputs.aws-oicd-role-arn }}
role-to-assume: ${{ inputs.aws-oidc-role-arn }}
role-duration-seconds: 3600 # 1 hour should be more than enough to upload report
# Potentially we could have several running build for the same key (for example, for the main branch), so we use improvised lock for this

View File

@@ -8,7 +8,7 @@ inputs:
unique-key:
description: 'string to distinguish different results in the same run'
required: true
aws-oicd-role-arn:
aws-oidc-role-arn:
description: 'OIDC role arn to interract with S3'
required: true
@@ -39,7 +39,7 @@ runs:
if: ${{ !cancelled() }}
with:
aws-region: eu-central-1
role-to-assume: ${{ inputs.aws-oicd-role-arn }}
role-to-assume: ${{ inputs.aws-oidc-role-arn }}
role-duration-seconds: 3600 # 1 hour should be more than enough to upload report
- name: Upload test results

View File

@@ -15,7 +15,7 @@ inputs:
prefix:
description: "S3 prefix. Default is '${GITHUB_RUN_ID}/${GITHUB_RUN_ATTEMPT}'"
required: false
aws-oicd-role-arn:
aws-oidc-role-arn:
description: 'OIDC role arn to interract with S3'
required: true
@@ -25,7 +25,7 @@ runs:
- uses: aws-actions/configure-aws-credentials@v4
with:
aws-region: eu-central-1
role-to-assume: ${{ inputs.aws-oicd-role-arn }}
role-to-assume: ${{ inputs.aws-oidc-role-arn }}
role-duration-seconds: 3600
- name: Download artifact

View File

@@ -49,6 +49,10 @@ inputs:
description: 'A JSON object with project settings'
required: false
default: '{}'
default_endpoint_settings:
description: 'A JSON object with the default endpoint settings'
required: false
default: '{}'
outputs:
dsn:
@@ -66,9 +70,9 @@ runs:
# A shell without `set -x` to not to expose password/dsn in logs
shell: bash -euo pipefail {0}
run: |
project=$(curl \
res=$(curl \
"https://${API_HOST}/api/v2/projects" \
--fail \
-w "%{http_code}" \
--header "Accept: application/json" \
--header "Content-Type: application/json" \
--header "Authorization: Bearer ${API_KEY}" \
@@ -83,6 +87,15 @@ runs:
\"settings\": ${PROJECT_SETTINGS}
}
}")
code=${res: -3}
if [[ ${code} -ge 400 ]]; then
echo Request failed with error code ${code}
echo ${res::-3}
exit 1
else
project=${res::-3}
fi
# Mask password
echo "::add-mask::$(echo $project | jq --raw-output '.roles[] | select(.name != "web_access") | .password')"
@@ -126,6 +139,22 @@ runs:
-H "Accept: application/json" -H "Content-Type: application/json" -H "Authorization: Bearer ${ADMIN_API_KEY}" \
-d "{\"scheduling\": \"Essential\"}"
fi
# XXX
# This is a workaround for the default endpoint settings, which currently do not allow some settings in the public API.
# https://github.com/neondatabase/cloud/issues/27108
if [[ -n ${DEFAULT_ENDPOINT_SETTINGS} && ${DEFAULT_ENDPOINT_SETTINGS} != "{}" ]] ; then
PROJECT_DATA=$(curl -X GET \
"https://${API_HOST}/regions/${REGION_ID}/api/v1/admin/projects/${project_id}" \
-H "Accept: application/json" -H "Content-Type: application/json" -H "Authorization: Bearer ${ADMIN_API_KEY}" \
-d "{\"scheduling\": \"Essential\"}"
)
NEW_DEFAULT_ENDPOINT_SETTINGS=$(echo ${PROJECT_DATA} | jq -rc ".project.default_endpoint_settings + ${DEFAULT_ENDPOINT_SETTINGS}")
curl -X POST --fail \
"https://${API_HOST}/regions/${REGION_ID}/api/v1/admin/projects/${project_id}/default_endpoint_settings" \
-H "Accept: application/json" -H "Content-Type: application/json" -H "Authorization: Bearer ${ADMIN_API_KEY}" \
--data "${NEW_DEFAULT_ENDPOINT_SETTINGS}"
fi
env:
API_HOST: ${{ inputs.api_host }}
@@ -142,3 +171,4 @@ runs:
PSQL: ${{ inputs.psql_path }}
LD_LIBRARY_PATH: ${{ inputs.libpq_lib_path }}
PROJECT_SETTINGS: ${{ inputs.project_settings }}
DEFAULT_ENDPOINT_SETTINGS: ${{ inputs.default_endpoint_settings }}

View File

@@ -53,7 +53,7 @@ inputs:
description: 'benchmark durations JSON'
required: false
default: '{}'
aws-oicd-role-arn:
aws-oidc-role-arn:
description: 'OIDC role arn to interract with S3'
required: true
@@ -66,7 +66,7 @@ runs:
with:
name: neon-${{ runner.os }}-${{ runner.arch }}-${{ inputs.build_type }}${{ inputs.sanitizers == 'enabled' && '-sanitized' || '' }}-artifact
path: /tmp/neon
aws-oicd-role-arn: ${{ inputs.aws-oicd-role-arn }}
aws-oidc-role-arn: ${{ inputs.aws-oidc-role-arn }}
- name: Download Neon binaries for the previous release
if: inputs.build_type != 'remote'
@@ -75,7 +75,7 @@ runs:
name: neon-${{ runner.os }}-${{ runner.arch }}-${{ inputs.build_type }}-artifact
path: /tmp/neon-previous
prefix: latest
aws-oicd-role-arn: ${{ inputs.aws-oicd-role-arn }}
aws-oidc-role-arn: ${{ inputs.aws-oidc-role-arn }}
- name: Download compatibility snapshot
if: inputs.build_type != 'remote'
@@ -87,7 +87,7 @@ runs:
# The lack of compatibility snapshot (for example, for the new Postgres version)
# shouldn't fail the whole job. Only relevant test should fail.
skip-if-does-not-exist: true
aws-oicd-role-arn: ${{ inputs.aws-oicd-role-arn }}
aws-oidc-role-arn: ${{ inputs.aws-oidc-role-arn }}
- name: Checkout
if: inputs.needs_postgres_source == 'true'
@@ -228,13 +228,13 @@ runs:
# The lack of compatibility snapshot shouldn't fail the job
# (for example if we didn't run the test for non build-and-test workflow)
skip-if-does-not-exist: true
aws-oicd-role-arn: ${{ inputs.aws-oicd-role-arn }}
aws-oidc-role-arn: ${{ inputs.aws-oidc-role-arn }}
- uses: aws-actions/configure-aws-credentials@v4
if: ${{ !cancelled() }}
with:
aws-region: eu-central-1
role-to-assume: ${{ inputs.aws-oicd-role-arn }}
role-to-assume: ${{ inputs.aws-oidc-role-arn }}
role-duration-seconds: 3600 # 1 hour should be more than enough to upload report
- name: Upload test results
@@ -243,4 +243,4 @@ runs:
with:
report-dir: /tmp/test_output/allure/results
unique-key: ${{ inputs.build_type }}-${{ inputs.pg_version }}-${{ runner.arch }}
aws-oicd-role-arn: ${{ inputs.aws-oicd-role-arn }}
aws-oidc-role-arn: ${{ inputs.aws-oidc-role-arn }}

View File

@@ -14,11 +14,11 @@ runs:
name: coverage-data-artifact
path: /tmp/coverage
skip-if-does-not-exist: true # skip if there's no previous coverage to download
aws-oicd-role-arn: ${{ inputs.aws-oicd-role-arn }}
aws-oidc-role-arn: ${{ inputs.aws-oidc-role-arn }}
- name: Upload coverage data
uses: ./.github/actions/upload
with:
name: coverage-data-artifact
path: /tmp/coverage
aws-oicd-role-arn: ${{ inputs.aws-oicd-role-arn }}
aws-oidc-role-arn: ${{ inputs.aws-oidc-role-arn }}

View File

@@ -14,7 +14,7 @@ inputs:
prefix:
description: "S3 prefix. Default is '${GITHUB_SHA}/${GITHUB_RUN_ID}/${GITHUB_RUN_ATTEMPT}'"
required: false
aws-oicd-role-arn:
aws-oidc-role-arn:
description: "the OIDC role arn for aws auth"
required: false
default: ""
@@ -61,7 +61,7 @@ runs:
uses: aws-actions/configure-aws-credentials@v4
with:
aws-region: eu-central-1
role-to-assume: ${{ inputs.aws-oicd-role-arn }}
role-to-assume: ${{ inputs.aws-oidc-role-arn }}
role-duration-seconds: 3600
- name: Upload artifact

View File

@@ -81,7 +81,7 @@ jobs:
name: neon-${{ runner.os }}-${{ runner.arch }}-release-artifact
path: /tmp/neon/
prefix: latest
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
# we create a table that has one row for each database that we want to restore with the status whether the restore is done
- name: Create benchmark_restore_status table if it does not exist

View File

@@ -28,6 +28,16 @@ on:
required: false
default: 'disabled'
type: string
test-selection:
description: 'specification of selected test(s) to run'
required: false
default: ''
type: string
test-run-count:
description: 'number of runs to perform for selected tests'
required: false
default: 1
type: number
defaults:
run:
@@ -313,7 +323,7 @@ jobs:
with:
name: neon-${{ runner.os }}-${{ runner.arch }}-${{ inputs.build-type }}${{ inputs.sanitizers == 'enabled' && '-sanitized' || '' }}-artifact
path: /tmp/neon
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
- name: Check diesel schema
if: inputs.build-type == 'release' && inputs.arch == 'x64'
@@ -381,14 +391,15 @@ jobs:
run_with_real_s3: true
real_s3_bucket: neon-github-ci-tests
real_s3_region: eu-central-1
rerun_failed: true
rerun_failed: ${{ inputs.test-run-count == 1 }}
pg_version: ${{ matrix.pg_version }}
sanitizers: ${{ inputs.sanitizers }}
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
# `--session-timeout` is equal to (timeout-minutes - 10 minutes) * 60 seconds.
# Attempt to stop tests gracefully to generate test reports
# until they are forcibly stopped by the stricter `timeout-minutes` limit.
extra_params: --session-timeout=${{ inputs.sanitizers != 'enabled' && 3000 || 10200 }}
extra_params: --session-timeout=${{ inputs.sanitizers != 'enabled' && 3000 || 10200 }} --count=${{ inputs.test-run-count }}
${{ inputs.test-selection != '' && format('-k "{0}"', inputs.test-selection) || '' }}
env:
TEST_RESULT_CONNSTR: ${{ secrets.REGRESS_TEST_RESULT_CONNSTR_NEW }}
CHECK_ONDISK_DATA_COMPATIBILITY: nonempty

View File

@@ -114,7 +114,7 @@ jobs:
name: neon-${{ runner.os }}-${{ runner.arch }}-release-artifact
path: /tmp/neon/
prefix: latest
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
- name: Create Neon Project
id: create-neon-project
@@ -132,7 +132,7 @@ jobs:
run_in_parallel: false
save_perf_report: ${{ env.SAVE_PERF_REPORT }}
pg_version: ${{ env.PG_VERSION }}
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
# Set --sparse-ordering option of pytest-order plugin
# to ensure tests are running in order of appears in the file.
# It's important for test_perf_pgbench.py::test_pgbench_remote_* tests
@@ -165,7 +165,7 @@ jobs:
if: ${{ !cancelled() }}
uses: ./.github/actions/allure-report-generate
with:
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
- name: Post to a Slack channel
if: ${{ github.event.schedule && failure() }}
@@ -222,8 +222,8 @@ jobs:
name: neon-${{ runner.os }}-${{ runner.arch }}-release-artifact
path: /tmp/neon/
prefix: latest
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
- name: Verify that cumulative statistics are preserved
uses: ./.github/actions/run-python-test-set
with:
@@ -233,7 +233,7 @@ jobs:
save_perf_report: ${{ env.SAVE_PERF_REPORT }}
extra_params: -m remote_cluster --timeout 3600
pg_version: ${{ env.DEFAULT_PG_VERSION }}
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
env:
VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"
PERF_TEST_RESULT_CONNSTR: "${{ secrets.PERF_TEST_RESULT_CONNSTR }}"
@@ -282,7 +282,7 @@ jobs:
name: neon-${{ runner.os }}-${{ runner.arch }}-release-artifact
path: /tmp/neon/
prefix: latest
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
- name: Run Logical Replication benchmarks
uses: ./.github/actions/run-python-test-set
@@ -293,7 +293,7 @@ jobs:
save_perf_report: ${{ env.SAVE_PERF_REPORT }}
extra_params: -m remote_cluster --timeout 5400
pg_version: ${{ env.DEFAULT_PG_VERSION }}
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
env:
VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"
PERF_TEST_RESULT_CONNSTR: "${{ secrets.PERF_TEST_RESULT_CONNSTR }}"
@@ -310,7 +310,7 @@ jobs:
save_perf_report: ${{ env.SAVE_PERF_REPORT }}
extra_params: -m remote_cluster --timeout 5400
pg_version: ${{ env.DEFAULT_PG_VERSION }}
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
env:
VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"
PERF_TEST_RESULT_CONNSTR: "${{ secrets.PERF_TEST_RESULT_CONNSTR }}"
@@ -322,7 +322,7 @@ jobs:
uses: ./.github/actions/allure-report-generate
with:
store-test-results-into-db: true
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
env:
REGRESS_TEST_RESULT_CONNSTR_NEW: ${{ secrets.REGRESS_TEST_RESULT_CONNSTR_NEW }}
@@ -505,7 +505,7 @@ jobs:
name: neon-${{ runner.os }}-${{ runner.arch }}-release-artifact
path: /tmp/neon/
prefix: latest
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
- name: Create Neon Project
if: contains(fromJSON('["neonvm-captest-new", "neonvm-captest-new-many-tables", "neonvm-captest-freetier", "neonvm-azure-captest-freetier", "neonvm-azure-captest-new"]'), matrix.platform)
@@ -557,7 +557,7 @@ jobs:
save_perf_report: ${{ env.SAVE_PERF_REPORT }}
extra_params: -m remote_cluster --timeout 21600 -k test_perf_many_relations
pg_version: ${{ env.PG_VERSION }}
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
env:
BENCHMARK_CONNSTR: ${{ steps.set-up-connstr.outputs.connstr }}
VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"
@@ -573,7 +573,7 @@ jobs:
save_perf_report: ${{ env.SAVE_PERF_REPORT }}
extra_params: -m remote_cluster --timeout 21600 -k test_pgbench_remote_init
pg_version: ${{ env.PG_VERSION }}
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
env:
BENCHMARK_CONNSTR: ${{ steps.set-up-connstr.outputs.connstr }}
VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"
@@ -588,7 +588,7 @@ jobs:
save_perf_report: ${{ env.SAVE_PERF_REPORT }}
extra_params: -m remote_cluster --timeout 21600 -k test_pgbench_remote_simple_update
pg_version: ${{ env.PG_VERSION }}
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
env:
BENCHMARK_CONNSTR: ${{ steps.set-up-connstr.outputs.connstr }}
VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"
@@ -603,7 +603,7 @@ jobs:
save_perf_report: ${{ env.SAVE_PERF_REPORT }}
extra_params: -m remote_cluster --timeout 21600 -k test_pgbench_remote_select_only
pg_version: ${{ env.PG_VERSION }}
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
env:
BENCHMARK_CONNSTR: ${{ steps.set-up-connstr.outputs.connstr }}
VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"
@@ -621,7 +621,7 @@ jobs:
if: ${{ !cancelled() }}
uses: ./.github/actions/allure-report-generate
with:
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
- name: Post to a Slack channel
if: ${{ github.event.schedule && failure() }}
@@ -694,7 +694,7 @@ jobs:
name: neon-${{ runner.os }}-${{ runner.arch }}-release-artifact
path: /tmp/neon/
prefix: latest
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
- name: Set up Connection String
id: set-up-connstr
@@ -726,7 +726,7 @@ jobs:
save_perf_report: ${{ env.SAVE_PERF_REPORT }}
extra_params: -m remote_cluster --timeout 21600 -k test_pgvector_indexing
pg_version: ${{ env.PG_VERSION }}
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
env:
VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"
PERF_TEST_RESULT_CONNSTR: "${{ secrets.PERF_TEST_RESULT_CONNSTR }}"
@@ -741,7 +741,7 @@ jobs:
save_perf_report: ${{ env.SAVE_PERF_REPORT }}
extra_params: -m remote_cluster --timeout 21600
pg_version: ${{ env.PG_VERSION }}
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
env:
BENCHMARK_CONNSTR: ${{ steps.set-up-connstr.outputs.connstr }}
VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"
@@ -752,7 +752,7 @@ jobs:
if: ${{ !cancelled() }}
uses: ./.github/actions/allure-report-generate
with:
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
- name: Post to a Slack channel
if: ${{ github.event.schedule && failure() }}
@@ -828,7 +828,7 @@ jobs:
name: neon-${{ runner.os }}-${{ runner.arch }}-release-artifact
path: /tmp/neon/
prefix: latest
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
- name: Set up Connection String
id: set-up-connstr
@@ -871,7 +871,7 @@ jobs:
save_perf_report: ${{ env.SAVE_PERF_REPORT }}
extra_params: -m remote_cluster --timeout 43200 -k test_clickbench
pg_version: ${{ env.PG_VERSION }}
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
env:
VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"
PERF_TEST_RESULT_CONNSTR: "${{ secrets.PERF_TEST_RESULT_CONNSTR }}"
@@ -885,7 +885,7 @@ jobs:
if: ${{ !cancelled() }}
uses: ./.github/actions/allure-report-generate
with:
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
- name: Post to a Slack channel
if: ${{ github.event.schedule && failure() }}
@@ -954,7 +954,7 @@ jobs:
name: neon-${{ runner.os }}-${{ runner.arch }}-release-artifact
path: /tmp/neon/
prefix: latest
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
- name: Get Connstring Secret Name
run: |
@@ -1003,7 +1003,7 @@ jobs:
save_perf_report: ${{ env.SAVE_PERF_REPORT }}
extra_params: -m remote_cluster --timeout 21600 -k test_tpch
pg_version: ${{ env.PG_VERSION }}
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
env:
VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"
PERF_TEST_RESULT_CONNSTR: "${{ secrets.PERF_TEST_RESULT_CONNSTR }}"
@@ -1015,7 +1015,7 @@ jobs:
if: ${{ !cancelled() }}
uses: ./.github/actions/allure-report-generate
with:
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
- name: Post to a Slack channel
if: ${{ github.event.schedule && failure() }}
@@ -1078,7 +1078,7 @@ jobs:
name: neon-${{ runner.os }}-${{ runner.arch }}-release-artifact
path: /tmp/neon/
prefix: latest
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
- name: Set up Connection String
id: set-up-connstr
@@ -1121,7 +1121,7 @@ jobs:
save_perf_report: ${{ env.SAVE_PERF_REPORT }}
extra_params: -m remote_cluster --timeout 21600 -k test_user_examples
pg_version: ${{ env.PG_VERSION }}
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
env:
VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"
PERF_TEST_RESULT_CONNSTR: "${{ secrets.PERF_TEST_RESULT_CONNSTR }}"
@@ -1132,7 +1132,7 @@ jobs:
if: ${{ !cancelled() }}
uses: ./.github/actions/allure-report-generate
with:
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
- name: Post to a Slack channel
if: ${{ github.event.schedule && failure() }}

View File

@@ -34,11 +34,10 @@ permissions:
jobs:
build-pgxn:
if: |
(inputs.pg_versions != '[]' || inputs.rebuild_everything) && (
contains(github.event.pull_request.labels.*.name, 'run-extra-build-macos') ||
contains(github.event.pull_request.labels.*.name, 'run-extra-build-*') ||
github.ref_name == 'main'
)
inputs.pg_versions != '[]' || inputs.rebuild_everything ||
contains(github.event.pull_request.labels.*.name, 'run-extra-build-macos') ||
contains(github.event.pull_request.labels.*.name, 'run-extra-build-*') ||
github.ref_name == 'main'
timeout-minutes: 30
runs-on: macos-15
strategy:
@@ -100,13 +99,21 @@ jobs:
run: |
make postgres-headers-${{ matrix.postgres-version }} -j$(sysctl -n hw.ncpu)
- name: Upload "pg_install/${{ matrix.postgres-version }}" artifact
uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4.6.2
with:
name: pg_install--${{ matrix.postgres-version }}
path: pg_install/${{ matrix.postgres-version }}
# The artifact is supposed to be used by the next job in the same workflow,
# so theres no need to store it for too long.
retention-days: 1
build-walproposer-lib:
if: |
(inputs.pg_versions != '[]' || inputs.rebuild_everything) && (
contains(github.event.pull_request.labels.*.name, 'run-extra-build-macos') ||
contains(github.event.pull_request.labels.*.name, 'run-extra-build-*') ||
github.ref_name == 'main'
)
inputs.pg_versions != '[]' || inputs.rebuild_everything ||
contains(github.event.pull_request.labels.*.name, 'run-extra-build-macos') ||
contains(github.event.pull_request.labels.*.name, 'run-extra-build-*') ||
github.ref_name == 'main'
timeout-minutes: 30
runs-on: macos-15
needs: [build-pgxn]
@@ -127,12 +134,11 @@ jobs:
id: pg_rev
run: echo pg_rev=$(git rev-parse HEAD:vendor/postgres-v17) | tee -a "${GITHUB_OUTPUT}"
- name: Cache postgres v17 build
id: cache_pg
uses: actions/cache@5a3ec84eff668545956fd18022155c47e93e2684 # v4.2.3
- name: Download "pg_install/v17" artifact
uses: actions/download-artifact@d3f86a106a0bac45b974a628896c90dbdf5c8093 # v4.3.0
with:
name: pg_install--v17
path: pg_install/v17
key: v1-${{ runner.os }}-${{ runner.arch }}-${{ env.BUILD_TYPE }}-pg-v17-${{ steps.pg_rev.outputs.pg_rev }}-${{ hashFiles('Makefile') }}
- name: Cache walproposer-lib
id: cache_walproposer_lib
@@ -163,13 +169,21 @@ jobs:
run:
make walproposer-lib -j$(sysctl -n hw.ncpu)
- name: Upload "pg_install/build/walproposer-lib" artifact
uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4.6.2
with:
name: pg_install--build--walproposer-lib
path: pg_install/build/walproposer-lib
# The artifact is supposed to be used by the next job in the same workflow,
# so theres no need to store it for too long.
retention-days: 1
cargo-build:
if: |
(inputs.pg_versions != '[]' || inputs.rebuild_rust_code || inputs.rebuild_everything) && (
contains(github.event.pull_request.labels.*.name, 'run-extra-build-macos') ||
contains(github.event.pull_request.labels.*.name, 'run-extra-build-*') ||
github.ref_name == 'main'
)
inputs.pg_versions != '[]' || inputs.rebuild_rust_code || inputs.rebuild_everything ||
contains(github.event.pull_request.labels.*.name, 'run-extra-build-macos') ||
contains(github.event.pull_request.labels.*.name, 'run-extra-build-*') ||
github.ref_name == 'main'
timeout-minutes: 30
runs-on: macos-15
needs: [build-pgxn, build-walproposer-lib]
@@ -188,45 +202,43 @@ jobs:
with:
submodules: true
- name: Set pg v14 for caching
id: pg_rev_v14
run: echo pg_rev=$(git rev-parse HEAD:vendor/postgres-v14) | tee -a "${GITHUB_OUTPUT}"
- name: Set pg v15 for caching
id: pg_rev_v15
run: echo pg_rev=$(git rev-parse HEAD:vendor/postgres-v15) | tee -a "${GITHUB_OUTPUT}"
- name: Set pg v16 for caching
id: pg_rev_v16
run: echo pg_rev=$(git rev-parse HEAD:vendor/postgres-v16) | tee -a "${GITHUB_OUTPUT}"
- name: Set pg v17 for caching
id: pg_rev_v17
run: echo pg_rev=$(git rev-parse HEAD:vendor/postgres-v17) | tee -a "${GITHUB_OUTPUT}"
- name: Cache postgres v14 build
id: cache_pg
uses: actions/cache@5a3ec84eff668545956fd18022155c47e93e2684 # v4.2.3
- name: Download "pg_install/v14" artifact
uses: actions/download-artifact@d3f86a106a0bac45b974a628896c90dbdf5c8093 # v4.3.0
with:
name: pg_install--v14
path: pg_install/v14
key: v1-${{ runner.os }}-${{ runner.arch }}-${{ env.BUILD_TYPE }}-pg-v14-${{ steps.pg_rev_v14.outputs.pg_rev }}-${{ hashFiles('Makefile') }}
- name: Cache postgres v15 build
id: cache_pg_v15
uses: actions/cache@5a3ec84eff668545956fd18022155c47e93e2684 # v4.2.3
with:
path: pg_install/v15
key: v1-${{ runner.os }}-${{ runner.arch }}-${{ env.BUILD_TYPE }}-pg-v15-${{ steps.pg_rev_v15.outputs.pg_rev }}-${{ hashFiles('Makefile') }}
- name: Cache postgres v16 build
id: cache_pg_v16
uses: actions/cache@5a3ec84eff668545956fd18022155c47e93e2684 # v4.2.3
with:
path: pg_install/v16
key: v1-${{ runner.os }}-${{ runner.arch }}-${{ env.BUILD_TYPE }}-pg-v16-${{ steps.pg_rev_v16.outputs.pg_rev }}-${{ hashFiles('Makefile') }}
- name: Cache postgres v17 build
id: cache_pg_v17
uses: actions/cache@5a3ec84eff668545956fd18022155c47e93e2684 # v4.2.3
with:
path: pg_install/v17
key: v1-${{ runner.os }}-${{ runner.arch }}-${{ env.BUILD_TYPE }}-pg-v17-${{ steps.pg_rev_v17.outputs.pg_rev }}-${{ hashFiles('Makefile') }}
- name: Cache cargo deps (only for v17)
- name: Download "pg_install/v15" artifact
uses: actions/download-artifact@d3f86a106a0bac45b974a628896c90dbdf5c8093 # v4.3.0
with:
name: pg_install--v15
path: pg_install/v15
- name: Download "pg_install/v16" artifact
uses: actions/download-artifact@d3f86a106a0bac45b974a628896c90dbdf5c8093 # v4.3.0
with:
name: pg_install--v16
path: pg_install/v16
- name: Download "pg_install/v17" artifact
uses: actions/download-artifact@d3f86a106a0bac45b974a628896c90dbdf5c8093 # v4.3.0
with:
name: pg_install--v17
path: pg_install/v17
- name: Download "pg_install/build/walproposer-lib" artifact
uses: actions/download-artifact@d3f86a106a0bac45b974a628896c90dbdf5c8093 # v4.3.0
with:
name: pg_install--build--walproposer-lib
path: pg_install/build/walproposer-lib
# `actions/download-artifact` doesn't preserve permissions:
# https://github.com/actions/download-artifact?tab=readme-ov-file#permission-loss
- name: Make pg_install/v*/bin/* executable
run: |
chmod +x pg_install/v*/bin/*
- name: Cache cargo deps
uses: actions/cache@5a3ec84eff668545956fd18022155c47e93e2684 # v4.2.3
with:
path: |
@@ -236,13 +248,6 @@ jobs:
target
key: v1-${{ runner.os }}-${{ runner.arch }}-cargo-${{ hashFiles('./Cargo.lock') }}-${{ hashFiles('./rust-toolchain.toml') }}-rust
- name: Cache walproposer-lib
id: cache_walproposer_lib
uses: actions/cache@5a3ec84eff668545956fd18022155c47e93e2684 # v4.2.3
with:
path: pg_install/build/walproposer-lib
key: v1-${{ runner.os }}-${{ runner.arch }}-${{ env.BUILD_TYPE }}-walproposer_lib-v17-${{ steps.pg_rev_v17.outputs.pg_rev }}-${{ hashFiles('Makefile') }}
- name: Install build dependencies
run: |
brew install flex bison openssl protobuf icu4c
@@ -252,8 +257,8 @@ jobs:
echo 'LDFLAGS=-L/usr/local/opt/openssl@3/lib' >> $GITHUB_ENV
echo 'CPPFLAGS=-I/usr/local/opt/openssl@3/include' >> $GITHUB_ENV
- name: Run cargo build (only for v17)
- name: Run cargo build
run: cargo build --all --release -j$(sysctl -n hw.ncpu)
- name: Check that no warnings are produced (only for v17)
- name: Check that no warnings are produced
run: ./run_clippy.sh

View File

@@ -0,0 +1,120 @@
name: Build and Run Selected Test
on:
workflow_dispatch:
inputs:
test-selection:
description: 'Specification of selected test(s), as accepted by pytest -k'
required: true
type: string
run-count:
description: 'Number of test runs to perform'
required: true
type: number
archs:
description: 'Archs to run tests on, e. g.: ["x64", "arm64"]'
default: '["x64"]'
required: true
type: string
build-types:
description: 'Build types to run tests on, e. g.: ["debug", "release"]'
default: '["release"]'
required: true
type: string
pg-versions:
description: 'Postgres versions to use for testing, e.g,: [{"pg_version":"v16"}, {"pg_version":"v17"}])'
default: '[{"pg_version":"v17"}]'
required: true
type: string
defaults:
run:
shell: bash -euxo pipefail {0}
env:
RUST_BACKTRACE: 1
COPT: '-Werror'
jobs:
meta:
uses: ./.github/workflows/_meta.yml
with:
github-event-name: ${{ github.event_name }}
github-event-json: ${{ toJSON(github.event) }}
build-and-test-locally:
needs: [ meta ]
strategy:
fail-fast: false
matrix:
arch: ${{ fromJson(inputs.archs) }}
build-type: ${{ fromJson(inputs.build-types) }}
uses: ./.github/workflows/_build-and-test-locally.yml
with:
arch: ${{ matrix.arch }}
build-tools-image: ghcr.io/neondatabase/build-tools:pinned-bookworm
build-tag: ${{ needs.meta.outputs.build-tag }}
build-type: ${{ matrix.build-type }}
test-cfg: ${{ inputs.pg-versions }}
test-selection: ${{ inputs.test-selection }}
test-run-count: ${{ fromJson(inputs.run-count) }}
secrets: inherit
create-test-report:
needs: [ build-and-test-locally ]
if: ${{ !cancelled() }}
permissions:
id-token: write # aws-actions/configure-aws-credentials
statuses: write
contents: write
pull-requests: write
outputs:
report-url: ${{ steps.create-allure-report.outputs.report-url }}
runs-on: [ self-hosted, small ]
container:
image: ghcr.io/neondatabase/build-tools:pinned-bookworm
credentials:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
options: --init
steps:
- name: Harden the runner (Audit all outbound calls)
uses: step-security/harden-runner@4d991eb9b905ef189e4c376166672c3f2f230481 # v2.11.0
with:
egress-policy: audit
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- name: Create Allure report
if: ${{ !cancelled() }}
id: create-allure-report
uses: ./.github/actions/allure-report-generate
with:
store-test-results-into-db: true
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
env:
REGRESS_TEST_RESULT_CONNSTR_NEW: ${{ secrets.REGRESS_TEST_RESULT_CONNSTR_DEV }}
- uses: actions/github-script@v7
if: ${{ !cancelled() }}
with:
# Retry script for 5XX server errors: https://github.com/actions/github-script#retries
retries: 5
script: |
const report = {
reportUrl: "${{ steps.create-allure-report.outputs.report-url }}",
reportJsonUrl: "${{ steps.create-allure-report.outputs.report-json-url }}",
}
const coverage = {}
const script = require("./scripts/comment-test-report.js")
await script({
github,
context,
fetch,
report,
coverage,
})

View File

@@ -317,7 +317,7 @@ jobs:
extra_params: --splits 5 --group ${{ matrix.pytest_split_group }}
benchmark_durations: ${{ needs.get-benchmarks-durations.outputs.json }}
pg_version: v16
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
env:
VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"
PERF_TEST_RESULT_CONNSTR: "${{ secrets.PERF_TEST_RESULT_CONNSTR }}"
@@ -384,7 +384,7 @@ jobs:
uses: ./.github/actions/allure-report-generate
with:
store-test-results-into-db: true
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
env:
REGRESS_TEST_RESULT_CONNSTR_NEW: ${{ secrets.REGRESS_TEST_RESULT_CONNSTR_NEW }}
@@ -451,14 +451,14 @@ jobs:
with:
name: neon-${{ runner.os }}-${{ runner.arch }}-${{ matrix.build_type }}-artifact
path: /tmp/neon
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
- name: Get coverage artifact
uses: ./.github/actions/download
with:
name: coverage-data-artifact
path: /tmp/coverage
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
- name: Merge coverage data
run: scripts/coverage "--profraw-prefix=$GITHUB_JOB" --dir=/tmp/coverage merge

View File

@@ -117,7 +117,7 @@ jobs:
uses: ./.github/actions/allure-report-generate
with:
store-test-results-into-db: true
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
env:
REGRESS_TEST_RESULT_CONNSTR_NEW: ${{ secrets.REGRESS_TEST_RESULT_CONNSTR_NEW }}

112
.github/workflows/cloud-extensions.yml vendored Normal file
View File

@@ -0,0 +1,112 @@
name: Cloud Extensions Test
on:
schedule:
# * is a special character in YAML so you have to quote this string
# ┌───────────── minute (0 - 59)
# │ ┌───────────── hour (0 - 23)
# │ │ ┌───────────── day of the month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12 or JAN-DEC)
# │ │ │ │ ┌───────────── day of the week (0 - 6 or SUN-SAT)
- cron: '45 1 * * *' # run once a day, timezone is utc
workflow_dispatch: # adds ability to run this manually
inputs:
region_id:
description: 'Project region id. If not set, the default region will be used'
required: false
default: 'aws-us-east-2'
defaults:
run:
shell: bash -euxo pipefail {0}
permissions:
id-token: write # aws-actions/configure-aws-credentials
statuses: write
contents: write
jobs:
regress:
env:
POSTGRES_DISTRIB_DIR: /tmp/neon/pg_install
TEST_OUTPUT: /tmp/test_output
BUILD_TYPE: remote
strategy:
fail-fast: false
matrix:
pg-version: [16, 17]
runs-on: [ self-hosted, small ]
container:
# We use the neon-test-extensions image here as it contains the source code for the extensions.
image: ghcr.io/neondatabase/neon-test-extensions-v${{ matrix.pg-version }}:latest
credentials:
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
options: --init
steps:
- name: Harden the runner (Audit all outbound calls)
uses: step-security/harden-runner@4d991eb9b905ef189e4c376166672c3f2f230481 # v2.11.0
with:
egress-policy: audit
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- name: Evaluate the settings
id: project-settings
run: |
if [[ $((${{ matrix.pg-version }})) -lt 17 ]]; then
ULID=ulid
else
ULID=pgx_ulid
fi
LIBS=timescaledb:rag_bge_small_en_v15,rag_jina_reranker_v1_tiny_en:$ULID
settings=$(jq -c -n --arg libs $LIBS '{preload_libraries:{use_defaults:false,enabled_libraries:($libs| split(":"))}}')
echo settings=$settings >> $GITHUB_OUTPUT
- name: Create Neon Project
id: create-neon-project
uses: ./.github/actions/neon-project-create
with:
region_id: ${{ inputs.region_id }}
postgres_version: ${{ matrix.pg-version }}
project_settings: ${{ steps.project-settings.outputs.settings }}
# We need these settings to get the expected output results.
# We cannot use the environment variables e.g. PGTZ due to
# https://github.com/neondatabase/neon/issues/1287
default_endpoint_settings: >
{
"pg_settings": {
"DateStyle": "Postgres,MDY",
"TimeZone": "America/Los_Angeles",
"compute_query_id": "off",
"neon.allow_unstable_extensions": "on"
}
}
api_key: ${{ secrets.NEON_STAGING_API_KEY }}
admin_api_key: ${{ secrets.NEON_STAGING_ADMIN_API_KEY }}
- name: Run the regression tests
run: /run-tests.sh -r /ext-src
env:
BENCHMARK_CONNSTR: ${{ steps.create-neon-project.outputs.dsn }}
SKIP: "pg_hint_plan-src,pg_repack-src,pg_cron-src,plpgsql_check-src"
- name: Delete Neon Project
if: ${{ always() }}
uses: ./.github/actions/neon-project-delete
with:
project_id: ${{ steps.create-neon-project.outputs.project_id }}
api_key: ${{ secrets.NEON_STAGING_API_KEY }}
- name: Post to a Slack channel
if: ${{ github.event.schedule && failure() }}
uses: slackapi/slack-github-action@fcfb566f8b0aab22203f066d80ca1d7e4b5d05b3 # v1.27.1
with:
channel-id: ${{ vars.SLACK_ON_CALL_QA_STAGING_STREAM }}
slack-message: |
Periodic extensions test on staging: ${{ job.status }}
<${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|GitHub Run>
env:
SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}

View File

@@ -89,7 +89,7 @@ jobs:
name: neon-${{ runner.os }}-${{ runner.arch }}-release-artifact
path: /tmp/neon/
prefix: latest
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
- name: Create a new branch
id: create-branch
@@ -105,7 +105,7 @@ jobs:
test_selection: cloud_regress
pg_version: ${{matrix.pg-version}}
extra_params: -m remote_cluster
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
env:
BENCHMARK_CONNSTR: ${{steps.create-branch.outputs.dsn}}
@@ -122,7 +122,7 @@ jobs:
if: ${{ !cancelled() }}
uses: ./.github/actions/allure-report-generate
with:
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
- name: Post to a Slack channel
if: ${{ github.event.schedule && failure() }}

View File

@@ -32,7 +32,7 @@ jobs:
fail-fast: false # allow other variants to continue even if one fails
matrix:
include:
- target_project: new_empty_project_stripe_size_2048
- target_project: new_empty_project_stripe_size_2048
stripe_size: 2048 # 16 MiB
postgres_version: 16
disable_sharding: false
@@ -98,7 +98,7 @@ jobs:
name: neon-${{ runner.os }}-${{ runner.arch }}-release-artifact
path: /tmp/neon/
prefix: latest
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
- name: Create Neon Project
if: ${{ startsWith(matrix.target_project, 'new_empty_project') }}
@@ -110,10 +110,10 @@ jobs:
compute_units: '[7, 7]' # we want to test large compute here to avoid compute-side bottleneck
api_key: ${{ secrets.NEON_STAGING_API_KEY }}
shard_split_project: ${{ matrix.stripe_size != null && 'true' || 'false' }}
admin_api_key: ${{ secrets.NEON_STAGING_ADMIN_API_KEY }}
admin_api_key: ${{ secrets.NEON_STAGING_ADMIN_API_KEY }}
shard_count: 8
stripe_size: ${{ matrix.stripe_size }}
disable_sharding: ${{ matrix.disable_sharding }}
disable_sharding: ${{ matrix.disable_sharding }}
- name: Initialize Neon project
if: ${{ startsWith(matrix.target_project, 'new_empty_project') }}
@@ -171,7 +171,7 @@ jobs:
extra_params: -s -m remote_cluster --timeout 86400 -k test_ingest_performance_using_pgcopydb
pg_version: v${{ matrix.postgres_version }}
save_perf_report: true
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
env:
BENCHMARK_INGEST_SOURCE_CONNSTR: ${{ secrets.BENCHMARK_INGEST_SOURCE_CONNSTR }}
TARGET_PROJECT_TYPE: ${{ matrix.target_project }}

View File

@@ -33,9 +33,9 @@ jobs:
fail-fast: false # allow other variants to continue even if one fails
matrix:
include:
- target: new_branch
- target: new_branch
custom_scripts: insert_webhooks.sql@200 select_any_webhook_with_skew.sql@300 select_recent_webhook.sql@397 select_prefetch_webhook.sql@3 IUD_one_transaction.sql@100
- target: reuse_branch
- target: reuse_branch
custom_scripts: insert_webhooks.sql@200 select_any_webhook_with_skew.sql@300 select_recent_webhook.sql@397 select_prefetch_webhook.sql@3 IUD_one_transaction.sql@100
max-parallel: 1 # we want to run each stripe size sequentially to be able to compare the results
permissions:
@@ -43,7 +43,7 @@ jobs:
statuses: write
id-token: write # aws-actions/configure-aws-credentials
env:
TEST_PG_BENCH_DURATIONS_MATRIX: "1h" # todo update to > 1 h
TEST_PG_BENCH_DURATIONS_MATRIX: "1h" # todo update to > 1 h
TEST_PGBENCH_CUSTOM_SCRIPTS: ${{ matrix.custom_scripts }}
POSTGRES_DISTRIB_DIR: /tmp/neon/pg_install
PG_VERSION: 16 # pre-determined by pre-determined project
@@ -85,7 +85,7 @@ jobs:
name: neon-${{ runner.os }}-${{ runner.arch }}-release-artifact
path: /tmp/neon/
prefix: latest
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
- name: Create Neon Branch for large tenant
if: ${{ matrix.target == 'new_branch' }}
@@ -129,7 +129,7 @@ jobs:
${PSQL} "${BENCHMARK_CONNSTR}" -c "SET statement_timeout = 0; DELETE FROM webhook.incoming_webhooks WHERE created_at > '2025-02-27 23:59:59+00';"
echo "$(date '+%Y-%m-%d %H:%M:%S') - Finished deleting rows in table webhook.incoming_webhooks from prior runs"
- name: Benchmark pgbench with custom-scripts
- name: Benchmark pgbench with custom-scripts
uses: ./.github/actions/run-python-test-set
with:
build_type: ${{ env.BUILD_TYPE }}
@@ -138,7 +138,7 @@ jobs:
save_perf_report: true
extra_params: -m remote_cluster --timeout 7200 -k test_perf_oltp_large_tenant_pgbench
pg_version: ${{ env.PG_VERSION }}
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
env:
BENCHMARK_CONNSTR: ${{ steps.set-up-connstr.outputs.connstr }}
VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"
@@ -153,7 +153,7 @@ jobs:
save_perf_report: true
extra_params: -m remote_cluster --timeout 172800 -k test_perf_oltp_large_tenant_maintenance
pg_version: ${{ env.PG_VERSION }}
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
env:
BENCHMARK_CONNSTR: ${{ steps.set-up-connstr.outputs.connstr_without_pooler }}
VIP_VAP_ACCESS_TOKEN: "${{ secrets.VIP_VAP_ACCESS_TOKEN }}"
@@ -179,8 +179,8 @@ jobs:
if: ${{ !cancelled() }}
uses: ./.github/actions/allure-report-generate
with:
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
- name: Post to a Slack channel
if: ${{ github.event.schedule && failure() }}
uses: slackapi/slack-github-action@fcfb566f8b0aab22203f066d80ca1d7e4b5d05b3 # v1.27.1

View File

@@ -69,10 +69,6 @@ jobs:
check-macos-build:
needs: [ check-permissions, files-changed ]
if: |
contains(github.event.pull_request.labels.*.name, 'run-extra-build-macos') ||
contains(github.event.pull_request.labels.*.name, 'run-extra-build-*') ||
github.ref_name == 'main'
uses: ./.github/workflows/build-macos.yml
with:
pg_versions: ${{ needs.files-changed.outputs.postgres_changes }}

View File

@@ -147,7 +147,7 @@ jobs:
if: ${{ !cancelled() }}
uses: ./.github/actions/allure-report-generate
with:
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
- name: Post to a Slack channel
if: ${{ github.event.schedule && failure() }}

View File

@@ -103,7 +103,7 @@ jobs:
name: neon-${{ runner.os }}-${{ runner.arch }}-release-artifact
path: /tmp/neon/
prefix: latest
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
- name: Create Neon Project
id: create-neon-project
@@ -122,7 +122,7 @@ jobs:
run_in_parallel: false
extra_params: -m remote_cluster
pg_version: ${{ env.DEFAULT_PG_VERSION }}
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
env:
BENCHMARK_CONNSTR: ${{ steps.create-neon-project.outputs.dsn }}
@@ -139,7 +139,7 @@ jobs:
uses: ./.github/actions/allure-report-generate
with:
store-test-results-into-db: true
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
env:
REGRESS_TEST_RESULT_CONNSTR_NEW: ${{ secrets.REGRESS_TEST_RESULT_CONNSTR_NEW }}
@@ -178,7 +178,7 @@ jobs:
name: neon-${{ runner.os }}-${{ runner.arch }}-release-artifact
path: /tmp/neon/
prefix: latest
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
- name: Create Neon Project
id: create-neon-project
@@ -195,7 +195,7 @@ jobs:
run_in_parallel: false
extra_params: -m remote_cluster
pg_version: ${{ env.DEFAULT_PG_VERSION }}
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
env:
BENCHMARK_CONNSTR: ${{ steps.create-neon-project.outputs.dsn }}
@@ -212,7 +212,7 @@ jobs:
uses: ./.github/actions/allure-report-generate
with:
store-test-results-into-db: true
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
env:
REGRESS_TEST_RESULT_CONNSTR_NEW: ${{ secrets.REGRESS_TEST_RESULT_CONNSTR_NEW }}

View File

@@ -66,7 +66,7 @@ jobs:
name: neon-${{ runner.os }}-${{ runner.arch }}-release-artifact
path: /tmp/neon/
prefix: latest
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
- name: Run tests
uses: ./.github/actions/run-python-test-set
@@ -76,7 +76,7 @@ jobs:
run_in_parallel: false
extra_params: -m remote_cluster
pg_version: ${{ matrix.pg-version }}
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
env:
NEON_API_KEY: ${{ secrets.NEON_STAGING_API_KEY }}
RANDOM_SEED: ${{ inputs.random_seed }}
@@ -88,6 +88,6 @@ jobs:
uses: ./.github/actions/allure-report-generate
with:
store-test-results-into-db: true
aws-oicd-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
env:
REGRESS_TEST_RESULT_CONNSTR_NEW: ${{ secrets.REGRESS_TEST_RESULT_CONNSTR_NEW }}

230
Cargo.lock generated
View File

@@ -253,6 +253,17 @@ version = "1.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a8ab6b55fe97976e46f91ddbed8d147d966475dc29b2032757ba47e02376fbc3"
[[package]]
name = "atomic_enum"
version = "0.3.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "99e1aca718ea7b89985790c94aad72d77533063fe00bc497bb79a7c2dae6a661"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.100",
]
[[package]]
name = "autocfg"
version = "1.1.0"
@@ -687,13 +698,40 @@ dependencies = [
"tracing",
]
[[package]]
name = "axum"
version = "0.7.9"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "edca88bc138befd0323b20752846e6587272d3b03b0343c8ea28a6f819e6e71f"
dependencies = [
"async-trait",
"axum-core 0.4.5",
"bytes",
"futures-util",
"http 1.1.0",
"http-body 1.0.0",
"http-body-util",
"itoa",
"matchit 0.7.3",
"memchr",
"mime",
"percent-encoding",
"pin-project-lite",
"rustversion",
"serde",
"sync_wrapper 1.0.1",
"tower 0.5.2",
"tower-layer",
"tower-service",
]
[[package]]
name = "axum"
version = "0.8.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6d6fd624c75e18b3b4c6b9caf42b1afe24437daaee904069137d8bab077be8b8"
dependencies = [
"axum-core",
"axum-core 0.5.0",
"base64 0.22.1",
"bytes",
"form_urlencoded",
@@ -704,7 +742,7 @@ dependencies = [
"hyper 1.4.1",
"hyper-util",
"itoa",
"matchit",
"matchit 0.8.4",
"memchr",
"mime",
"percent-encoding",
@@ -724,6 +762,26 @@ dependencies = [
"tracing",
]
[[package]]
name = "axum-core"
version = "0.4.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "09f2bd6146b97ae3359fa0cc6d6b376d9539582c7b4220f041a33ec24c226199"
dependencies = [
"async-trait",
"bytes",
"futures-util",
"http 1.1.0",
"http-body 1.0.0",
"http-body-util",
"mime",
"pin-project-lite",
"rustversion",
"sync_wrapper 1.0.1",
"tower-layer",
"tower-service",
]
[[package]]
name = "axum-core"
version = "0.5.0"
@@ -750,8 +808,8 @@ version = "0.10.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "460fc6f625a1f7705c6cf62d0d070794e94668988b1c38111baeec177c715f7b"
dependencies = [
"axum",
"axum-core",
"axum 0.8.1",
"axum-core 0.5.0",
"bytes",
"futures-util",
"headers",
@@ -1086,6 +1144,25 @@ version = "0.3.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "37b2a672a2cb129a2e41c10b1224bb368f9f37a2b16b612598138befd7b37eb5"
[[package]]
name = "cbindgen"
version = "0.28.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "eadd868a2ce9ca38de7eeafdcec9c7065ef89b42b32f0839278d55f35c54d1ff"
dependencies = [
"clap",
"heck 0.4.1",
"indexmap 2.9.0",
"log",
"proc-macro2",
"quote",
"serde",
"serde_json",
"syn 2.0.100",
"tempfile",
"toml",
]
[[package]]
name = "cc"
version = "1.2.16"
@@ -1206,7 +1283,7 @@ version = "4.5.18"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "4ac6a0c7b1a9e9a5186361f67dfa1b88213572f427fb9ab038efb2bd8c582dab"
dependencies = [
"heck",
"heck 0.5.0",
"proc-macro2",
"quote",
"syn 2.0.100",
@@ -1264,13 +1341,40 @@ dependencies = [
"unicode-width",
]
[[package]]
name = "communicator"
version = "0.1.0"
dependencies = [
"atomic_enum",
"bytes",
"cbindgen",
"http 1.1.0",
"libc",
"neonart",
"nix 0.27.1",
"pageserver_client_grpc",
"pageserver_data_api",
"prost 0.13.3",
"thiserror 1.0.69",
"tokio",
"tokio-epoll-uring",
"tokio-pipe",
"tonic",
"tracing",
"tracing-subscriber",
"uring-common",
"utils",
"zerocopy 0.8.24",
"zerocopy-derive 0.8.24",
]
[[package]]
name = "compute_api"
version = "0.1.0"
dependencies = [
"anyhow",
"chrono",
"indexmap 2.0.1",
"indexmap 2.9.0",
"jsonwebtoken",
"regex",
"remote_storage",
@@ -1288,7 +1392,7 @@ dependencies = [
"aws-sdk-kms",
"aws-sdk-s3",
"aws-smithy-types",
"axum",
"axum 0.8.1",
"axum-extra",
"base64 0.13.1",
"bytes",
@@ -1301,7 +1405,7 @@ dependencies = [
"flate2",
"futures",
"http 1.1.0",
"indexmap 2.0.1",
"indexmap 2.9.0",
"jsonwebtoken",
"metrics",
"nix 0.27.1",
@@ -1927,7 +2031,7 @@ checksum = "0892a17df262a24294c382f0d5997571006e7a4348b4327557c4ff1cd4a8bccc"
dependencies = [
"darling",
"either",
"heck",
"heck 0.5.0",
"proc-macro2",
"quote",
"syn 2.0.100",
@@ -2041,7 +2145,7 @@ name = "endpoint_storage"
version = "0.0.1"
dependencies = [
"anyhow",
"axum",
"axum 0.8.1",
"axum-extra",
"camino",
"camino-tempfile",
@@ -2588,7 +2692,7 @@ dependencies = [
"futures-sink",
"futures-util",
"http 0.2.9",
"indexmap 2.0.1",
"indexmap 2.9.0",
"slab",
"tokio",
"tokio-util",
@@ -2607,7 +2711,7 @@ dependencies = [
"futures-sink",
"futures-util",
"http 1.1.0",
"indexmap 2.0.1",
"indexmap 2.9.0",
"slab",
"tokio",
"tokio-util",
@@ -2703,6 +2807,12 @@ dependencies = [
"http 1.1.0",
]
[[package]]
name = "heck"
version = "0.4.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "95505c38b4572b2d910cecb0281560f54b440a19336cbbcb27bf6ce6adc6f5a8"
[[package]]
name = "heck"
version = "0.5.0"
@@ -3191,12 +3301,12 @@ dependencies = [
[[package]]
name = "indexmap"
version = "2.0.1"
version = "2.9.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ad227c3af19d4914570ad36d30409928b75967c298feb9ea1969db3a610bb14e"
checksum = "cea70ddb795996207ad57735b50c5982d8844f38ba9ee5f1aedcfb708a2aa11e"
dependencies = [
"equivalent",
"hashbrown 0.14.5",
"hashbrown 0.15.2",
"serde",
]
@@ -3219,7 +3329,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "232929e1d75fe899576a3d5c7416ad0d88dbfbb3c3d6aa00873a7408a50ddb88"
dependencies = [
"ahash",
"indexmap 2.0.1",
"indexmap 2.9.0",
"is-terminal",
"itoa",
"log",
@@ -3242,7 +3352,7 @@ dependencies = [
"crossbeam-utils",
"dashmap 6.1.0",
"env_logger",
"indexmap 2.0.1",
"indexmap 2.9.0",
"itoa",
"log",
"num-format",
@@ -3594,6 +3704,12 @@ dependencies = [
"regex-automata 0.1.10",
]
[[package]]
name = "matchit"
version = "0.7.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0e7465ac9959cc2b1404e8e2367b43684a6d13790fe23056cc8c6c5a6b7bcb94"
[[package]]
name = "matchit"
version = "0.8.4"
@@ -3639,7 +3755,7 @@ version = "0.0.22"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b9e6777fc80a575f9503d908c8b498782a6c3ee88a06cb416dc3941401e43b94"
dependencies = [
"heck",
"heck 0.5.0",
"proc-macro2",
"quote",
"syn 2.0.100",
@@ -3785,6 +3901,15 @@ version = "0.8.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e5ce46fe64a9d73be07dcbe690a38ce1b293be448fd8ce1e6c1b8062c9f72c6a"
[[package]]
name = "neonart"
version = "0.1.0"
dependencies = [
"rand 0.8.5",
"tracing",
"zerocopy 0.8.24",
]
[[package]]
name = "never-say-never"
version = "6.6.666"
@@ -4208,6 +4333,8 @@ dependencies = [
"humantime-serde",
"pageserver_api",
"pageserver_client",
"pageserver_client_grpc",
"pageserver_data_api",
"rand 0.8.5",
"reqwest",
"serde",
@@ -4284,6 +4411,8 @@ dependencies = [
"pageserver_api",
"pageserver_client",
"pageserver_compaction",
"pageserver_data_api",
"peekable",
"pem",
"pin-project-lite",
"postgres-protocol",
@@ -4295,6 +4424,7 @@ dependencies = [
"pprof",
"pq_proto",
"procfs",
"prost 0.13.3",
"rand 0.8.5",
"range-set-blaze",
"regex",
@@ -4326,6 +4456,7 @@ dependencies = [
"tokio-tar",
"tokio-util",
"toml_edit",
"tonic",
"tracing",
"tracing-utils",
"url",
@@ -4390,6 +4521,18 @@ dependencies = [
"workspace_hack",
]
[[package]]
name = "pageserver_client_grpc"
version = "0.1.0"
dependencies = [
"bytes",
"http 1.1.0",
"pageserver_data_api",
"thiserror 1.0.69",
"tonic",
"tracing",
]
[[package]]
name = "pageserver_compaction"
version = "0.1.0"
@@ -4413,6 +4556,17 @@ dependencies = [
"workspace_hack",
]
[[package]]
name = "pageserver_data_api"
version = "0.1.0"
dependencies = [
"prost 0.13.3",
"thiserror 1.0.69",
"tonic",
"tonic-build",
"utils",
]
[[package]]
name = "papaya"
version = "0.2.1"
@@ -4539,6 +4693,15 @@ dependencies = [
"sha2",
]
[[package]]
name = "peekable"
version = "0.3.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "225f9651e475709164f871dc2f5724956be59cb9edb055372ffeeab01ec2d20b"
dependencies = [
"smallvec",
]
[[package]]
name = "pem"
version = "3.0.3"
@@ -5010,7 +5173,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "22505a5c94da8e3b7c2996394d1c933236c4d743e81a410bcca4e6989fc066a4"
dependencies = [
"bytes",
"heck",
"heck 0.5.0",
"itertools 0.12.1",
"log",
"multimap",
@@ -5031,7 +5194,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0c1318b19085f08681016926435853bbf7858f9c082d0999b80550ff5d9abe15"
dependencies = [
"bytes",
"heck",
"heck 0.5.0",
"itertools 0.12.1",
"log",
"multimap",
@@ -5134,7 +5297,7 @@ dependencies = [
"hyper 0.14.30",
"hyper 1.4.1",
"hyper-util",
"indexmap 2.0.1",
"indexmap 2.9.0",
"ipnet",
"itertools 0.10.5",
"itoa",
@@ -5645,7 +5808,7 @@ dependencies = [
"async-trait",
"getrandom 0.2.11",
"http 1.1.0",
"matchit",
"matchit 0.8.4",
"opentelemetry",
"reqwest",
"reqwest-middleware",
@@ -6806,7 +6969,7 @@ version = "0.26.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "4c6bee85a5a24955dc440386795aa378cd9cf82acd5f764469152d2270e581be"
dependencies = [
"heck",
"heck 0.5.0",
"proc-macro2",
"quote",
"rustversion",
@@ -7231,6 +7394,16 @@ dependencies = [
"syn 2.0.100",
]
[[package]]
name = "tokio-pipe"
version = "0.2.12"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f213a84bffbd61b8fa0ba8a044b4bbe35d471d0b518867181e82bd5c15542784"
dependencies = [
"libc",
"tokio",
]
[[package]]
name = "tokio-postgres"
version = "0.7.10"
@@ -7413,7 +7586,7 @@ version = "0.22.14"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f21c7aaf97f1bd9ca9d4f9e73b0a6c74bd5afef56f2bc931943a6e1c37e04e38"
dependencies = [
"indexmap 2.0.1",
"indexmap 2.9.0",
"serde",
"serde_spanned",
"toml_datetime",
@@ -7426,9 +7599,13 @@ version = "0.12.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "877c5b330756d856ffcc4553ab34a5684481ade925ecc54bcd1bf02b1d0d4d52"
dependencies = [
"async-stream",
"async-trait",
"axum 0.7.9",
"base64 0.22.1",
"bytes",
"flate2",
"h2 0.4.4",
"http 1.1.0",
"http-body 1.0.0",
"http-body-util",
@@ -7440,6 +7617,7 @@ dependencies = [
"prost 0.13.3",
"rustls-native-certs 0.8.0",
"rustls-pemfile 2.1.1",
"socket2",
"tokio",
"tokio-rustls 0.26.0",
"tokio-stream",
@@ -7939,7 +8117,7 @@ name = "vm_monitor"
version = "0.1.0"
dependencies = [
"anyhow",
"axum",
"axum 0.8.1",
"cgroups-rs",
"clap",
"futures",
@@ -8449,7 +8627,7 @@ dependencies = [
"hyper 1.4.1",
"hyper-util",
"indexmap 1.9.3",
"indexmap 2.0.1",
"indexmap 2.9.0",
"itertools 0.12.1",
"lazy_static",
"libc",

View File

@@ -8,6 +8,7 @@ members = [
"pageserver/compaction",
"pageserver/ctl",
"pageserver/client",
"pageserver/client_grpc",
"pageserver/pagebench",
"proxy",
"safekeeper",
@@ -29,6 +30,7 @@ members = [
"libs/pq_proto",
"libs/tenant_size_model",
"libs/metrics",
"libs/neonart",
"libs/postgres_connection",
"libs/remote_storage",
"libs/tracing-utils",
@@ -41,6 +43,7 @@ members = [
"libs/proxy/postgres-types2",
"libs/proxy/tokio-postgres2",
"endpoint_storage",
"pgxn/neon/communicator",
]
[workspace.package]
@@ -142,6 +145,7 @@ parquet = { version = "53", default-features = false, features = ["zstd"] }
parquet_derive = "53"
pbkdf2 = { version = "0.12.1", features = ["simple", "std"] }
pem = "3.0.3"
peekable = "0.3.0"
pin-project-lite = "0.2"
pprof = { version = "0.14", features = ["criterion", "flamegraph", "frame-pointer", "prost-codec"] }
procfs = "0.16"
@@ -187,7 +191,6 @@ thiserror = "1.0"
tikv-jemallocator = { version = "0.6", features = ["profiling", "stats", "unprefixed_malloc_on_supported_platforms"] }
tikv-jemalloc-ctl = { version = "0.6", features = ["stats"] }
tokio = { version = "1.43.1", features = ["macros"] }
tokio-epoll-uring = { git = "https://github.com/neondatabase/tokio-epoll-uring.git" , branch = "main" }
tokio-io-timeout = "1.2.0"
tokio-postgres-rustls = "0.12.0"
tokio-rustls = { version = "0.26.0", default-features = false, features = ["tls12", "ring"]}
@@ -196,7 +199,7 @@ tokio-tar = "0.3"
tokio-util = { version = "0.7.10", features = ["io", "rt"] }
toml = "0.8"
toml_edit = "0.22"
tonic = {version = "0.12.3", default-features = false, features = ["channel", "tls", "tls-roots"]}
tonic = {version = "0.12.3", default-features = false, features = ["channel", "server", "tls", "tls-roots", "gzip"]}
tower = { version = "0.5.2", default-features = false }
tower-http = { version = "0.6.2", features = ["auth", "request-id", "trace"] }
@@ -228,6 +231,9 @@ x509-cert = { version = "0.2.5" }
env_logger = "0.11"
log = "0.4"
tokio-epoll-uring = { git = "https://github.com/neondatabase/tokio-epoll-uring.git" , branch = "main" }
uring-common = { git = "https://github.com/neondatabase/tokio-epoll-uring.git" , branch = "main" }
## Libraries from neondatabase/ git forks, ideally with changes to be upstreamed
postgres = { git = "https://github.com/neondatabase/rust-postgres.git", branch = "neon" }
postgres-protocol = { git = "https://github.com/neondatabase/rust-postgres.git", branch = "neon" }
@@ -245,9 +251,12 @@ compute_api = { version = "0.1", path = "./libs/compute_api/" }
consumption_metrics = { version = "0.1", path = "./libs/consumption_metrics/" }
http-utils = { version = "0.1", path = "./libs/http-utils/" }
metrics = { version = "0.1", path = "./libs/metrics/" }
neonart = { version = "0.1", path = "./libs/neonart/" }
pageserver = { path = "./pageserver" }
pageserver_api = { version = "0.1", path = "./libs/pageserver_api/" }
pageserver_client = { path = "./pageserver/client" }
pageserver_client_grpc = { path = "./pageserver/client_grpc" }
pageserver_data_api = { path = "./pageserver/data_api" }
pageserver_compaction = { version = "0.1", path = "./pageserver/compaction/" }
postgres_backend = { version = "0.1", path = "./libs/postgres_backend/" }
postgres_connection = { version = "0.1", path = "./libs/postgres_connection/" }
@@ -271,6 +280,7 @@ wal_decoder = { version = "0.1", path = "./libs/wal_decoder" }
workspace_hack = { version = "0.1", path = "./workspace_hack/" }
## Build dependencies
cbindgen = "0.28.0"
criterion = "0.5.1"
rcgen = "0.13"
rstest = "0.18"

View File

@@ -18,10 +18,12 @@ ifeq ($(BUILD_TYPE),release)
PG_LDFLAGS = $(LDFLAGS)
# Unfortunately, `--profile=...` is a nightly feature
CARGO_BUILD_FLAGS += --release
NEON_CARGO_ARTIFACT_TARGET_DIR = $(ROOT_PROJECT_DIR)/target/release
else ifeq ($(BUILD_TYPE),debug)
PG_CONFIGURE_OPTS = --enable-debug --with-openssl --enable-cassert --enable-depend
PG_CFLAGS += -O0 -g3 $(CFLAGS)
PG_LDFLAGS = $(LDFLAGS)
NEON_CARGO_ARTIFACT_TARGET_DIR = $(ROOT_PROJECT_DIR)/target/debug
else
$(error Bad build type '$(BUILD_TYPE)', see Makefile for options)
endif
@@ -180,11 +182,16 @@ postgres-check-%: postgres-%
.PHONY: neon-pg-ext-%
neon-pg-ext-%: postgres-%
+@echo "Compiling communicator $*"
$(CARGO_CMD_PREFIX) cargo build -p communicator $(CARGO_BUILD_FLAGS)
+@echo "Compiling neon $*"
mkdir -p $(POSTGRES_INSTALL_DIR)/build/neon-$*
$(MAKE) PG_CONFIG=$(POSTGRES_INSTALL_DIR)/$*/bin/pg_config COPT='$(COPT)' \
LIBCOMMUNICATOR_PATH=$(NEON_CARGO_ARTIFACT_TARGET_DIR) \
-C $(POSTGRES_INSTALL_DIR)/build/neon-$* \
-f $(ROOT_PROJECT_DIR)/pgxn/neon/Makefile install
+@echo "Compiling neon_walredo $*"
mkdir -p $(POSTGRES_INSTALL_DIR)/build/neon-walredo-$*
$(MAKE) PG_CONFIG=$(POSTGRES_INSTALL_DIR)/$*/bin/pg_config COPT='$(COPT)' \

View File

@@ -1800,8 +1800,8 @@ COPY compute/patches/pg_repack.patch /ext-src
RUN cd /ext-src/pg_repack-src && patch -p1 </ext-src/pg_repack.patch && rm -f /ext-src/pg_repack.patch
COPY --chmod=755 docker-compose/run-tests.sh /run-tests.sh
RUN apt-get update && apt-get install -y libtap-parser-sourcehandler-pgtap-perl\
&& apt clean && rm -rf /ext-src/*.tar.gz /var/lib/apt/lists/*
RUN apt-get update && apt-get install -y libtap-parser-sourcehandler-pgtap-perl jq \
&& apt clean && rm -rf /ext-src/*.tar.gz /ext-src/*.patch /var/lib/apt/lists/*
ENV PATH=/usr/local/pgsql/bin:$PATH
ENV PGHOST=compute
ENV PGPORT=55433

View File

@@ -65,7 +65,7 @@ for pg_version in ${TEST_VERSION_ONLY-14 15 16 17}; do
docker compose cp "${TMPDIR}/data" compute:/postgres/contrib/file_fdw/data
rm -rf "${TMPDIR}"
# Apply patches
docker compose exec -i neon-test-extensions bash -c "(cd /postgres && patch -p1)" <"../compute/patches/contrib_pg${pg_version}.patch"
docker compose exec -T neon-test-extensions bash -c "(cd /postgres && patch -p1)" <"../compute/patches/contrib_pg${pg_version}.patch"
# We are running tests now
rm -f testout.txt testout_contrib.txt
docker compose exec -e USE_PGXS=1 -e SKIP=timescaledb-src,rdkit-src,postgis-src,pg_jsonschema-src,kq_imcx-src,wal2json_2_5-src,rag_jina_reranker_v1_tiny_en-src,rag_bge_small_en_v15-src \

View File

@@ -0,0 +1,99 @@
# PostgreSQL Extensions for Testing
This directory contains PostgreSQL extensions used primarily for:
1. Testing extension upgrades between different Compute versions
2. Running regression tests with regular users (mostly for cloud instances)
## Directory Structure
Each extension directory follows a standard structure:
- `extension-name-src/` - Directory containing test files for the extension
- `test-upgrade.sh` - Script for testing upgrade scenarios
- `regular-test.sh` - Script for testing with regular users
- Additional test files depending on the extension
## Available Extensions
This directory includes the following extensions:
- `hll-src` - HyperLogLog, a fixed-size data structure for approximating cardinality
- `hypopg-src` - Extension to create hypothetical indexes
- `ip4r-src` - IPv4/v6 and subnet data types
- `pg_cron-src` - Run periodic jobs in PostgreSQL
- `pg_graphql-src` - GraphQL support for PostgreSQL
- `pg_hint_plan-src` - Execution plan hints
- `pg_ivm-src` - Incremental view maintenance
- `pg_jsonschema-src` - JSON Schema validation
- `pg_repack-src` - Reorganize tables with minimal locks
- `pg_roaringbitmap-src` - Roaring bitmap implementation
- `pg_semver-src` - Semantic version data type
- `pg_session_jwt-src` - JWT authentication for PostgreSQL
- `pg_tiktoken-src` - OpenAI Tiktoken tokenizer
- `pg_uuidv7-src` - UUIDv7 implementation for PostgreSQL
- `pgjwt-src` - JWT tokens for PostgreSQL
- `pgrag-src` - Retrieval Augmented Generation for PostgreSQL
- `pgtap-src` - Unit testing framework for PostgreSQL
- `pgvector-src` - Vector similarity search
- `pgx_ulid-src` - ULID data type
- `plv8-src` - JavaScript language for PostgreSQL stored procedures
- `postgresql-unit-src` - SI units for PostgreSQL
- `prefix-src` - Prefix matching for strings
- `rag_bge_small_en_v15-src` - BGE embedding model for RAG
- `rag_jina_reranker_v1_tiny_en-src` - Jina reranker model for RAG
- `rum-src` - RUM access method for text search
## Usage
### Extension Upgrade Testing
The extensions in this directory are used by the `test-upgrade.sh` script to test upgrading extensions between different versions of Neon Compute nodes. The script:
1. Creates a database with extensions installed on an old Compute version
2. Creates timelines for each extension
3. Switches to a new Compute version and tests the upgrade process
4. Verifies extension functionality after upgrade
### Regular User Testing
For testing with regular users (particularly for cloud instances), each extension directory typically contains a `regular-test.sh` script that:
1. Drops the database if it exists
2. Creates a fresh test database
3. Installs the extension
4. Runs regression tests
A note about pg_regress: Since pg_regress attempts to set `lc_messages` for the database by default, which is forbidden for regular users, we create databases manually and use the `--use-existing` option to bypass this limitation.
### CI Workflows
Two main workflows use these extensions:
1. **Cloud Extensions Test** - Tests extensions on Neon cloud projects
2. **Force Test Upgrading of Extension** - Tests upgrading extensions between different Compute versions
These workflows are integrated into the build-and-test pipeline through shell scripts:
- `docker_compose_test.sh` - Tests extensions in a Docker Compose environment
- `test_extensions_upgrade.sh` - Tests extension upgrades between different Compute versions
## Adding New Extensions
To add a new extension for testing:
1. Create a directory named `extension-name-src` in this directory
2. Add at minimum:
- `regular-test.sh` for testing with regular users
- If `regular-test.sh` doesn't exist, the system will look for `neon-test.sh`
- If neither exists, it will try to run `make installcheck`
- `test-upgrade.sh` is only needed if you want to test upgrade scenarios
3. Update the list of extensions in the `test_extensions_upgrade.sh` script if needed for upgrade testing
### Patching Extension Sources
If you need to patch the extension sources:
1. Place the patch file in the extension's directory
2. Apply the patch in the appropriate script (`test-upgrade.sh`, `neon-test.sh`, `regular-test.sh`, or `Makefile`)
3. The patch will be applied during the testing process

View File

@@ -0,0 +1,7 @@
#!/bin/sh
set -ex
cd "$(dirname ${0})"
PG_REGRESS=$(dirname "$(pg_config --pgxs)")/../test/regress/pg_regress
dropdb --if-exists contrib_regression
createdb contrib_regression
${PG_REGRESS} --use-existing --inputdir=./ --bindir='/usr/local/pgsql/bin' --dbname=contrib_regression setup add_agg agg_oob auto_sparse card_op cast_shape copy_binary cumulative_add_cardinality_correction cumulative_add_comprehensive_promotion cumulative_add_sparse_edge cumulative_add_sparse_random cumulative_add_sparse_step cumulative_union_comprehensive cumulative_union_explicit_explicit cumulative_union_explicit_promotion cumulative_union_probabilistic_probabilistic cumulative_union_sparse_full_representation cumulative_union_sparse_promotion cumulative_union_sparse_sparse disable_hashagg equal explicit_thresh hash hash_any meta_func murmur_bigint murmur_bytea nosparse notequal scalar_oob storedproc transaction typmod typmod_insert union_op

View File

@@ -0,0 +1,7 @@
#!/bin/sh
set -ex
cd "$(dirname ${0})"
dropdb --if-exists contrib_regression
createdb contrib_regression
PG_REGRESS=$(dirname "$(pg_config --pgxs)")/../test/regress/pg_regress
${PG_REGRESS} --inputdir=./ --bindir='/usr/local/pgsql/bin' --use-existing --inputdir=test --dbname=contrib_regression hypopg hypo_brin hypo_index_part hypo_include hypo_hash hypo_hide_index

View File

@@ -0,0 +1,7 @@
#!/bin/sh
set -ex
cd "$(dirname ${0})"
dropdb --if-exist contrib_regression
createdb contrib_regression
PG_REGRESS=$(dirname "$(pg_config --pgxs)")/../test/regress/pg_regress
${PG_REGRESS} --use-existing --inputdir=./ --bindir='/usr/local/pgsql/bin' --dbname=contrib_regression ip4r ip4r-softerr ip4r-v11

View File

@@ -0,0 +1,7 @@
#!/bin/sh
set -ex
cd "$(dirname ${0})"
dropdb --if-exist contrib_regression
createdb contrib_regression
PG_REGRESS=$(dirname "$(pg_config --pgxs)")/../test/regress/pg_regress
${PG_REGRESS} --use-existing --inputdir=./ --bindir='/usr/local/pgsql/bin' --dbname=contrib_regression pg_cron-test

View File

@@ -0,0 +1,23 @@
#!/bin/bash
set -ex
cd "$(dirname "${0}")"
PGXS="$(dirname "$(pg_config --pgxs)" )"
REGRESS="${PGXS}/../test/regress/pg_regress"
TESTDIR="test"
TESTS=$(ls "${TESTDIR}/sql" | sort )
TESTS=${TESTS//\.sql/}
TESTS=${TESTS/empty_mutations/}
TESTS=${TESTS/function_return_row_is_selectable/}
TESTS=${TESTS/issue_300/}
TESTS=${TESTS/permissions_connection_column/}
TESTS=${TESTS/permissions_functions/}
TESTS=${TESTS/permissions_node_column/}
TESTS=${TESTS/permissions_table_level/}
TESTS=${TESTS/permissions_types/}
TESTS=${TESTS/row_level_security/}
TESTS=${TESTS/sqli_connection/}
dropdb --if-exist contrib_regression
createdb contrib_regression
psql -v ON_ERROR_STOP=1 -f test/fixtures.sql -d contrib_regression
${REGRESS} --use-existing --dbname=contrib_regression --inputdir=${TESTDIR} ${TESTS}

View File

@@ -0,0 +1,7 @@
#!/bin/sh
set -ex
cd "$(dirname ${0})"
dropdb --if-exist contrib_regression
createdb contrib_regression
PG_REGRESS=$(dirname "$(pg_config --pgxs)")/../test/regress/pg_regress
${PG_REGRESS} --use-existing --inputdir=./ --bindir='/usr/local/pgsql/bin' --encoding=UTF8 --dbname=contrib_regression init base_plan pg_hint_plan ut-init ut-A ut-S ut-J ut-L ut-G ut-R ut-fdw ut-W ut-T ut-fini hints_anywhere plpgsql oldextversions

View File

@@ -0,0 +1,9 @@
#!/bin/sh
set -ex
dropdb --if-exist contrib_regression
createdb contrib_regression
cd "$(dirname ${0})"
PG_REGRESS=$(dirname "$(pg_config --pgxs)")/../test/regress/pg_regress
patch -p1 <regular.patch
${PG_REGRESS} --use-existing --inputdir=./ --bindir='/usr/local/pgsql/bin' --dbname=contrib_regression pg_ivm create_immv refresh_immv
patch -R -p1 <regular.patch

View File

@@ -0,0 +1,309 @@
diff --git a/expected/pg_ivm.out b/expected/pg_ivm.out
index e8798ee..4081680 100644
--- a/expected/pg_ivm.out
+++ b/expected/pg_ivm.out
@@ -1363,61 +1363,6 @@ SELECT * FROM mv ORDER BY i;
| 2 | 4 | 2 | 2 | 2
(1 row)
-ROLLBACK;
--- IMMV containing user defined type
-BEGIN;
-CREATE TYPE mytype;
-CREATE FUNCTION mytype_in(cstring)
- RETURNS mytype AS 'int4in'
- LANGUAGE INTERNAL STRICT IMMUTABLE;
-NOTICE: return type mytype is only a shell
-CREATE FUNCTION mytype_out(mytype)
- RETURNS cstring AS 'int4out'
- LANGUAGE INTERNAL STRICT IMMUTABLE;
-NOTICE: argument type mytype is only a shell
-CREATE TYPE mytype (
- LIKE = int4,
- INPUT = mytype_in,
- OUTPUT = mytype_out
-);
-CREATE FUNCTION mytype_eq(mytype, mytype)
- RETURNS bool AS 'int4eq'
- LANGUAGE INTERNAL STRICT IMMUTABLE;
-CREATE FUNCTION mytype_lt(mytype, mytype)
- RETURNS bool AS 'int4lt'
- LANGUAGE INTERNAL STRICT IMMUTABLE;
-CREATE FUNCTION mytype_cmp(mytype, mytype)
- RETURNS integer AS 'btint4cmp'
- LANGUAGE INTERNAL STRICT IMMUTABLE;
-CREATE OPERATOR = (
- leftarg = mytype, rightarg = mytype,
- procedure = mytype_eq);
-CREATE OPERATOR < (
- leftarg = mytype, rightarg = mytype,
- procedure = mytype_lt);
-CREATE OPERATOR CLASS mytype_ops
- DEFAULT FOR TYPE mytype USING btree AS
- OPERATOR 1 <,
- OPERATOR 3 = ,
- FUNCTION 1 mytype_cmp(mytype,mytype);
-CREATE TABLE t_mytype (x mytype);
-SELECT create_immv('mv_mytype',
- 'SELECT * FROM t_mytype');
-NOTICE: could not create an index on immv "mv_mytype" automatically
-DETAIL: This target list does not have all the primary key columns, or this view does not contain GROUP BY or DISTINCT clause.
-HINT: Create an index on the immv for efficient incremental maintenance.
- create_immv
--------------
- 0
-(1 row)
-
-INSERT INTO t_mytype VALUES ('1'::mytype);
-SELECT * FROM mv_mytype;
- x
----
- 1
-(1 row)
-
ROLLBACK;
-- outer join is not supported
SELECT create_immv('mv(a,b)',
@@ -1510,112 +1455,6 @@ SELECT create_immv('mv_ivm_only_values1', 'values(1)');
ERROR: VALUES is not supported on incrementally maintainable materialized view
SELECT create_immv('mv_ivm_only_values2', 'SELECT * FROM (values(1)) AS tmp');
ERROR: VALUES is not supported on incrementally maintainable materialized view
--- views containing base tables with Row Level Security
-DROP USER IF EXISTS ivm_admin;
-NOTICE: role "ivm_admin" does not exist, skipping
-DROP USER IF EXISTS ivm_user;
-NOTICE: role "ivm_user" does not exist, skipping
-CREATE USER ivm_admin;
-CREATE USER ivm_user;
---- create a table with RLS
-SET SESSION AUTHORIZATION ivm_admin;
-CREATE TABLE rls_tbl(id int, data text, owner name);
-INSERT INTO rls_tbl VALUES
- (1,'foo','ivm_user'),
- (2,'bar','postgres');
-CREATE TABLE num_tbl(id int, num text);
-INSERT INTO num_tbl VALUES
- (1,'one'),
- (2,'two'),
- (3,'three'),
- (4,'four'),
- (5,'five'),
- (6,'six');
---- Users can access only their own rows
-CREATE POLICY rls_tbl_policy ON rls_tbl FOR SELECT TO PUBLIC USING(owner = current_user);
-ALTER TABLE rls_tbl ENABLE ROW LEVEL SECURITY;
-GRANT ALL on rls_tbl TO PUBLIC;
-GRANT ALL on num_tbl TO PUBLIC;
---- create a view owned by ivm_user
-SET SESSION AUTHORIZATION ivm_user;
-SELECT create_immv('ivm_rls', 'SELECT * FROM rls_tbl');
-NOTICE: could not create an index on immv "ivm_rls" automatically
-DETAIL: This target list does not have all the primary key columns, or this view does not contain GROUP BY or DISTINCT clause.
-HINT: Create an index on the immv for efficient incremental maintenance.
- create_immv
--------------
- 1
-(1 row)
-
-SELECT id, data, owner FROM ivm_rls ORDER BY 1,2,3;
- id | data | owner
-----+------+----------
- 1 | foo | ivm_user
-(1 row)
-
-RESET SESSION AUTHORIZATION;
---- inserts rows owned by different users
-INSERT INTO rls_tbl VALUES
- (3,'baz','ivm_user'),
- (4,'qux','postgres');
-SELECT id, data, owner FROM ivm_rls ORDER BY 1,2,3;
- id | data | owner
-----+------+----------
- 1 | foo | ivm_user
- 3 | baz | ivm_user
-(2 rows)
-
---- combination of diffent kinds of commands
-WITH
- i AS (INSERT INTO rls_tbl VALUES(5,'quux','postgres'), (6,'corge','ivm_user')),
- u AS (UPDATE rls_tbl SET owner = 'postgres' WHERE id = 1),
- u2 AS (UPDATE rls_tbl SET owner = 'ivm_user' WHERE id = 2)
-SELECT;
---
-(1 row)
-
-SELECT id, data, owner FROM ivm_rls ORDER BY 1,2,3;
- id | data | owner
-----+-------+----------
- 2 | bar | ivm_user
- 3 | baz | ivm_user
- 6 | corge | ivm_user
-(3 rows)
-
----
-SET SESSION AUTHORIZATION ivm_user;
-SELECT create_immv('ivm_rls2', 'SELECT * FROM rls_tbl JOIN num_tbl USING(id)');
-NOTICE: could not create an index on immv "ivm_rls2" automatically
-DETAIL: This target list does not have all the primary key columns, or this view does not contain GROUP BY or DISTINCT clause.
-HINT: Create an index on the immv for efficient incremental maintenance.
- create_immv
--------------
- 3
-(1 row)
-
-RESET SESSION AUTHORIZATION;
-WITH
- x AS (UPDATE rls_tbl SET data = data || '_2' where id in (3,4)),
- y AS (UPDATE num_tbl SET num = num || '_2' where id in (3,4))
-SELECT;
---
-(1 row)
-
-SELECT * FROM ivm_rls2 ORDER BY 1,2,3;
- id | data | owner | num
-----+-------+----------+---------
- 2 | bar | ivm_user | two
- 3 | baz_2 | ivm_user | three_2
- 6 | corge | ivm_user | six
-(3 rows)
-
-DROP TABLE rls_tbl CASCADE;
-NOTICE: drop cascades to 2 other objects
-DETAIL: drop cascades to table ivm_rls
-drop cascades to table ivm_rls2
-DROP TABLE num_tbl CASCADE;
-DROP USER ivm_user;
-DROP USER ivm_admin;
-- automatic index creation
BEGIN;
CREATE TABLE base_a (i int primary key, j int);
diff --git a/sql/pg_ivm.sql b/sql/pg_ivm.sql
index d3c1a01..203213d 100644
--- a/sql/pg_ivm.sql
+++ b/sql/pg_ivm.sql
@@ -454,53 +454,6 @@ DELETE FROM base_t WHERE v = 5;
SELECT * FROM mv ORDER BY i;
ROLLBACK;
--- IMMV containing user defined type
-BEGIN;
-
-CREATE TYPE mytype;
-CREATE FUNCTION mytype_in(cstring)
- RETURNS mytype AS 'int4in'
- LANGUAGE INTERNAL STRICT IMMUTABLE;
-CREATE FUNCTION mytype_out(mytype)
- RETURNS cstring AS 'int4out'
- LANGUAGE INTERNAL STRICT IMMUTABLE;
-CREATE TYPE mytype (
- LIKE = int4,
- INPUT = mytype_in,
- OUTPUT = mytype_out
-);
-
-CREATE FUNCTION mytype_eq(mytype, mytype)
- RETURNS bool AS 'int4eq'
- LANGUAGE INTERNAL STRICT IMMUTABLE;
-CREATE FUNCTION mytype_lt(mytype, mytype)
- RETURNS bool AS 'int4lt'
- LANGUAGE INTERNAL STRICT IMMUTABLE;
-CREATE FUNCTION mytype_cmp(mytype, mytype)
- RETURNS integer AS 'btint4cmp'
- LANGUAGE INTERNAL STRICT IMMUTABLE;
-
-CREATE OPERATOR = (
- leftarg = mytype, rightarg = mytype,
- procedure = mytype_eq);
-CREATE OPERATOR < (
- leftarg = mytype, rightarg = mytype,
- procedure = mytype_lt);
-
-CREATE OPERATOR CLASS mytype_ops
- DEFAULT FOR TYPE mytype USING btree AS
- OPERATOR 1 <,
- OPERATOR 3 = ,
- FUNCTION 1 mytype_cmp(mytype,mytype);
-
-CREATE TABLE t_mytype (x mytype);
-SELECT create_immv('mv_mytype',
- 'SELECT * FROM t_mytype');
-INSERT INTO t_mytype VALUES ('1'::mytype);
-SELECT * FROM mv_mytype;
-
-ROLLBACK;
-
-- outer join is not supported
SELECT create_immv('mv(a,b)',
'SELECT a.i, b.i FROM mv_base_a a LEFT JOIN mv_base_b b ON a.i=b.i');
@@ -579,71 +532,6 @@ SELECT create_immv('mv_ivm31', 'SELECT sum(i)/sum(j) FROM mv_base_a');
SELECT create_immv('mv_ivm_only_values1', 'values(1)');
SELECT create_immv('mv_ivm_only_values2', 'SELECT * FROM (values(1)) AS tmp');
-
--- views containing base tables with Row Level Security
-DROP USER IF EXISTS ivm_admin;
-DROP USER IF EXISTS ivm_user;
-CREATE USER ivm_admin;
-CREATE USER ivm_user;
-
---- create a table with RLS
-SET SESSION AUTHORIZATION ivm_admin;
-CREATE TABLE rls_tbl(id int, data text, owner name);
-INSERT INTO rls_tbl VALUES
- (1,'foo','ivm_user'),
- (2,'bar','postgres');
-CREATE TABLE num_tbl(id int, num text);
-INSERT INTO num_tbl VALUES
- (1,'one'),
- (2,'two'),
- (3,'three'),
- (4,'four'),
- (5,'five'),
- (6,'six');
-
---- Users can access only their own rows
-CREATE POLICY rls_tbl_policy ON rls_tbl FOR SELECT TO PUBLIC USING(owner = current_user);
-ALTER TABLE rls_tbl ENABLE ROW LEVEL SECURITY;
-GRANT ALL on rls_tbl TO PUBLIC;
-GRANT ALL on num_tbl TO PUBLIC;
-
---- create a view owned by ivm_user
-SET SESSION AUTHORIZATION ivm_user;
-SELECT create_immv('ivm_rls', 'SELECT * FROM rls_tbl');
-SELECT id, data, owner FROM ivm_rls ORDER BY 1,2,3;
-RESET SESSION AUTHORIZATION;
-
---- inserts rows owned by different users
-INSERT INTO rls_tbl VALUES
- (3,'baz','ivm_user'),
- (4,'qux','postgres');
-SELECT id, data, owner FROM ivm_rls ORDER BY 1,2,3;
-
---- combination of diffent kinds of commands
-WITH
- i AS (INSERT INTO rls_tbl VALUES(5,'quux','postgres'), (6,'corge','ivm_user')),
- u AS (UPDATE rls_tbl SET owner = 'postgres' WHERE id = 1),
- u2 AS (UPDATE rls_tbl SET owner = 'ivm_user' WHERE id = 2)
-SELECT;
-SELECT id, data, owner FROM ivm_rls ORDER BY 1,2,3;
-
----
-SET SESSION AUTHORIZATION ivm_user;
-SELECT create_immv('ivm_rls2', 'SELECT * FROM rls_tbl JOIN num_tbl USING(id)');
-RESET SESSION AUTHORIZATION;
-
-WITH
- x AS (UPDATE rls_tbl SET data = data || '_2' where id in (3,4)),
- y AS (UPDATE num_tbl SET num = num || '_2' where id in (3,4))
-SELECT;
-SELECT * FROM ivm_rls2 ORDER BY 1,2,3;
-
-DROP TABLE rls_tbl CASCADE;
-DROP TABLE num_tbl CASCADE;
-
-DROP USER ivm_user;
-DROP USER ivm_admin;
-
-- automatic index creation
BEGIN;
CREATE TABLE base_a (i int primary key, j int);

View File

@@ -1,8 +1,13 @@
EXTENSION = pg_jsonschema
DATA = pg_jsonschema--1.0.sql
REGRESS = jsonschema_valid_api jsonschema_edge_cases
REGRESS_OPTS = --load-extension=pg_jsonschema
PG_CONFIG ?= pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)
PG_REGRESS := $(dir $(PGXS))../../src/test/regress/pg_regress
.PHONY installcheck:
installcheck:
dropdb --if-exists contrib_regression
createdb contrib_regression
psql -d contrib_regression -c "CREATE EXTENSION $(EXTENSION)"
$(PG_REGRESS) --use-existing --dbname=contrib_regression $(REGRESS)

View File

@@ -0,0 +1,7 @@
#!/bin/sh
set -ex
cd "$(dirname ${0})"
dropdb --if-exist contrib_regression
createdb contrib_regression
PG_REGRESS=$(dirname "$(pg_config --pgxs)")/../test/regress/pg_regress
${PG_REGRESS} --use-existing --inputdir=./ --bindir='/usr/local/pgsql/bin' --dbname=contrib_regression roaringbitmap

View File

@@ -0,0 +1,12 @@
#!/bin/bash
set -ex
# For v16 it's required to create a type which is impossible without superuser access
# do not run this test so far
if [[ "${PG_VERSION}" = v16 ]]; then
exit 0
fi
cd "$(dirname ${0})"
dropdb --if-exist contrib_regression
createdb contrib_regression
PG_REGRESS=$(dirname "$(pg_config --pgxs)")/../test/regress/pg_regress
${PG_REGRESS} --use-existing --inputdir=./ --bindir='/usr/local/pgsql/bin' --inputdir=test --dbname=contrib_regression base corpus

View File

@@ -6,4 +6,10 @@ export PGOPTIONS = -c pg_session_jwt.jwk={"crv":"Ed25519","kty":"OKP","x":"R_Abz
PG_CONFIG ?= pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)
PG_REGRESS := $(dir $(PGXS))../../src/test/regress/pg_regress
.PHONY installcheck:
installcheck:
dropdb --if-exists contrib_regression
createdb contrib_regression
psql -d contrib_regression -c "CREATE EXTENSION $(EXTENSION)"
$(PG_REGRESS) --use-existing --dbname=contrib_regression $(REGRESS)

View File

@@ -5,4 +5,6 @@ REGRESS = pg_tiktoken
installcheck: regression-test
regression-test:
$(PG_REGRESS) --inputdir=. --outputdir=. --dbname=contrib_regression $(REGRESS)
dropdb --if-exists contrib_regression
createdb contrib_regression
$(PG_REGRESS) --inputdir=. --outputdir=. --use-existing --dbname=contrib_regression $(REGRESS)

View File

@@ -0,0 +1,7 @@
#!/bin/sh
set -ex
cd "$(dirname "${0}")"
dropdb --if-exist contrib_regression
createdb contrib_regression
PG_REGRESS=$(dirname "$(pg_config --pgxs)")/../test/regress/pg_regress
${PG_REGRESS} --use-existing --inputdir=./ --bindir='/usr/local/pgsql/bin' --inputdir=test --dbname=contrib_regression 001_setup 002_uuid_generate_v7 003_uuid_v7_to_timestamptz 004_uuid_timestamptz_to_v7 005_uuid_v7_to_timestamp 006_uuid_timestamp_to_v7

View File

@@ -1,4 +1,6 @@
#!/bin/bash
set -ex
cd "$(dirname "${0}")"
pg_prove test.sql
dropdb --if-exists contrib_regression
createdb contrib_regression
pg_prove -d contrib_regression test.sql

View File

@@ -0,0 +1,8 @@
#!/bin/sh
set -ex
cd "$(dirname "${0}")"
dropdb --if-exist contrib_regression
createdb contrib_regression
psql -d contrib_regression -c "CREATE EXTENSION vector" -c "CREATE EXTENSION rag"
PG_REGRESS=$(dirname "$(pg_config --pgxs)")/../test/regress/pg_regress
${PG_REGRESS} --inputdir=./ --bindir='/usr/local/pgsql/bin' --use-existing --load-extension=vector --load-extension=rag --dbname=contrib_regression basic_functions text_processing api_keys chunking_functions document_processing embedding_api_functions voyageai_functions

View File

@@ -0,0 +1,10 @@
#!/bin/sh
set -ex
cd "$(dirname ${0})"
make installcheck || true
dropdb --if-exist contrib_regression
createdb contrib_regression
PG_REGRESS=$(dirname "$(pg_config --pgxs)")/../test/regress/pg_regress
sed -i '/hastap/d' test/build/run.sch
sed -Ei 's/\b(aretap|enumtap|ownership|privs|usergroup)\b//g' test/build/run.sch
${PG_REGRESS} --use-existing --dbname=contrib_regression --inputdir=./ --bindir='/usr/local/pgsql/bin' --inputdir=test --max-connections=879 --schedule test/schedule/main.sch --schedule test/build/run.sch

View File

@@ -0,0 +1,8 @@
#!/bin/sh
set -ex
cd "$(dirname ${0})"
dropdb --if-exist contrib_regression
createdb contrib_regression
psql -d contrib_regression -c "CREATE EXTENSION vector"
PG_REGRESS=$(dirname "$(pg_config --pgxs)")/../test/regress/pg_regress
${PG_REGRESS} --inputdir=./ --bindir='/usr/local/pgsql/bin' --inputdir=test --use-existing --dbname=contrib_regression bit btree cast copy halfvec hnsw_bit hnsw_halfvec hnsw_sparsevec hnsw_vector ivfflat_bit ivfflat_halfvec ivfflat_vector sparsevec vector_type

View File

@@ -4,13 +4,21 @@ PGFILEDESC = "pgx_ulid - ULID type for PostgreSQL"
PG_CONFIG ?= pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
PG_REGRESS = $(dir $(PGXS))/../../src/test/regress/pg_regress
PG_MAJOR_VERSION := $(word 2, $(subst ., , $(shell $(PG_CONFIG) --version)))
ifeq ($(shell test $(PG_MAJOR_VERSION) -lt 17; echo $$?),0)
REGRESS_OPTS = --load-extension=ulid
REGRESS = 00_ulid_generation 01_ulid_conversions 03_ulid_errors
EXTNAME = ulid
else
REGRESS_OPTS = --load-extension=pgx_ulid
REGRESS = 00_ulid_generation 01_ulid_conversions 02_ulid_conversions 03_ulid_errors
EXTNAME = pgx_ulid
endif
include $(PGXS)
.PHONY: installcheck
installcheck: regression-test
regression-test:
dropdb --if-exists contrib_regression
createdb contrib_regression
psql -d contrib_regression -c "CREATE EXTENSION $(EXTNAME)"
$(PG_REGRESS) --inputdir=. --outputdir=. --use-existing --dbname=contrib_regression $(REGRESS)

View File

@@ -0,0 +1,12 @@
#!/bin/bash
set -ex
cd "$(dirname ${0})"
dropdb --if-exist contrib_regression
createdb contrib_regression
PG_REGRESS=$(dirname "$(pg_config --pgxs)")/../test/regress/pg_regress
REGRESS="$(make -n installcheck | awk '{print substr($0,index($0,"init-extension"));}')"
REGRESS="${REGRESS/startup_perms/}"
REGRESS="${REGRESS/startup /}"
REGRESS="${REGRESS/find_function_perms/}"
REGRESS="${REGRESS/guc/}"
${PG_REGRESS} --inputdir=./ --bindir='/usr/local/pgsql/bin' --use-existing --dbname=contrib_regression ${REGRESS}

View File

@@ -0,0 +1,7 @@
#!/bin/sh
set -ex
cd "$(dirname ${0})"
dropdb --if-exist contrib_regression
createdb contrib_regression
PG_REGRESS=$(dirname "$(pg_config --pgxs)")/../test/regress/pg_regress
${PG_REGRESS} --inputdir=./ --bindir='/usr/local/pgsql/bin' --use-existing --dbname=contrib_regression extension tables unit binary unicode prefix units time temperature functions language_functions round derived compare aggregate iec custom crosstab convert

View File

@@ -0,0 +1,7 @@
#!/bin/sh
set -ex
cd "$(dirname ${0})"
dropdb --if-exist contrib_regression
createdb contrib_regression
PG_REGRESS=$(dirname "$(pg_config --pgxs)")/../test/regress/pg_regress
${PG_REGRESS} --use-existing --inputdir=./ --bindir='/usr/local/pgsql/bin' --dbname=contrib_regression create_extension prefix falcon explain queries

View File

@@ -3,8 +3,13 @@ MODULE_big = rag_bge_small_en_v15
OBJS = $(patsubst %.rs,%.o,$(wildcard src/*.rs))
REGRESS = basic_functions embedding_functions basic_functions_enhanced embedding_functions_enhanced
REGRESS_OPTS = --load-extension=vector --load-extension=rag_bge_small_en_v15
PG_CONFIG = pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)
PG_REGRESS := $(dir $(PGXS))../../src/test/regress/pg_regress
.PHONY installcheck:
installcheck:
dropdb --if-exists contrib_regression
createdb contrib_regression
psql -d contrib_regression -c "CREATE EXTENSION vector" -c "CREATE EXTENSION rag_bge_small_en_v15"
$(PG_REGRESS) --use-existing --dbname=contrib_regression $(REGRESS)

View File

@@ -3,8 +3,13 @@ MODULE_big = rag_jina_reranker_v1_tiny_en
OBJS = $(patsubst %.rs,%.o,$(wildcard src/*.rs))
REGRESS = reranking_functions reranking_functions_enhanced
REGRESS_OPTS = --load-extension=vector --load-extension=rag_jina_reranker_v1_tiny_en
PG_CONFIG = pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)
PG_REGRESS := $(dir $(PGXS))../../src/test/regress/pg_regress
.PHONY installcheck:
installcheck:
dropdb --if-exists contrib_regression
createdb contrib_regression
psql -d contrib_regression -c "CREATE EXTENSION vector" -c "CREATE EXTENSION rag_jina_reranker_v1_tiny_en"
$(PG_REGRESS) --use-existing --dbname=contrib_regression $(REGRESS)

View File

@@ -1,25 +1,27 @@
-- Reranking function tests
SELECT rag_jina_reranker_v1_tiny_en.rerank_distance('the cat sat on the mat', 'the baboon played with the balloon');
rerank_distance
SELECT ROUND(rag_jina_reranker_v1_tiny_en.rerank_distance('the cat sat on the mat', 'the baboon played with the balloon')::NUMERIC,4);
round
--------
0.8989
(1 row)
SELECT ARRAY(SELECT ROUND(x::NUMERIC,4) FROM unnest(rag_jina_reranker_v1_tiny_en.rerank_distance('the cat sat on the mat',
ARRAY['the baboon played with the balloon', 'the tanks fired at the buildings'])) AS x);
array
-----------------
0.8989152
{0.8989,1.3018}
(1 row)
SELECT rag_jina_reranker_v1_tiny_en.rerank_distance('the cat sat on the mat', ARRAY['the baboon played with the balloon', 'the tanks fired at the buildings']);
rerank_distance
-----------------------
{0.8989152,1.3018152}
SELECT ROUND(rag_jina_reranker_v1_tiny_en.rerank_score('the cat sat on the mat', 'the baboon played with the balloon')::NUMERIC,4);
round
---------
-0.8989
(1 row)
SELECT rag_jina_reranker_v1_tiny_en.rerank_score('the cat sat on the mat', 'the baboon played with the balloon');
rerank_score
--------------
-0.8989152
(1 row)
SELECT rag_jina_reranker_v1_tiny_en.rerank_score('the cat sat on the mat', ARRAY['the baboon played with the balloon', 'the tanks fired at the buildings']);
rerank_score
-------------------------
{-0.8989152,-1.3018152}
SELECT ARRAY(SELECT ROUND(x::NUMERIC,4) FROM unnest(rag_jina_reranker_v1_tiny_en.rerank_score('the cat sat on the mat',
ARRAY['the baboon played with the balloon', 'the tanks fired at the buildings'])) as x);
array
-------------------
{-0.8989,-1.3018}
(1 row)

View File

@@ -1,41 +1,41 @@
-- Reranking function tests - single passage
SELECT rag_jina_reranker_v1_tiny_en.rerank_distance('the cat sat on the mat', 'the baboon played with the balloon');
rerank_distance
-----------------
0.8989152
SELECT ROUND(rag_jina_reranker_v1_tiny_en.rerank_distance('the cat sat on the mat', 'the baboon played with the balloon')::NUMERIC,4);
round
--------
0.8989
(1 row)
SELECT rag_jina_reranker_v1_tiny_en.rerank_distance('the cat sat on the mat', 'the tanks fired at the buildings');
rerank_distance
-----------------
1.3018152
SELECT ROUND(rag_jina_reranker_v1_tiny_en.rerank_distance('the cat sat on the mat', 'the tanks fired at the buildings')::NUMERIC,4);
round
--------
1.3018
(1 row)
SELECT rag_jina_reranker_v1_tiny_en.rerank_distance('query about cats', 'information about felines');
rerank_distance
-----------------
1.3133051
SELECT ROUND(rag_jina_reranker_v1_tiny_en.rerank_distance('query about cats', 'information about felines')::NUMERIC,4);
round
--------
1.3133
(1 row)
SELECT rag_jina_reranker_v1_tiny_en.rerank_distance('', 'empty query test');
rerank_distance
-----------------
0.7075559
SELECT ROUND(rag_jina_reranker_v1_tiny_en.rerank_distance('', 'empty query test')::NUMERIC,4);
round
--------
0.7076
(1 row)
-- Reranking function tests - array of passages
SELECT rag_jina_reranker_v1_tiny_en.rerank_distance('the cat sat on the mat',
ARRAY['the baboon played with the balloon', 'the tanks fired at the buildings']);
rerank_distance
-----------------------
{0.8989152,1.3018152}
SELECT ARRAY(SELECT ROUND(x::NUMERIC,4) FROM unnest(rag_jina_reranker_v1_tiny_en.rerank_distance('the cat sat on the mat',
ARRAY['the baboon played with the balloon', 'the tanks fired at the buildings'])) AS x);
array
-----------------
{0.8989,1.3018}
(1 row)
SELECT rag_jina_reranker_v1_tiny_en.rerank_distance('query about programming',
ARRAY['Python is a programming language', 'Java is also a programming language', 'SQL is used for databases']);
rerank_distance
------------------------------------
{0.16591403,0.33475375,0.10132827}
SELECT ARRAY(SELECT ROUND(x::NUMERIC,4) FROM unnest(rag_jina_reranker_v1_tiny_en.rerank_distance('query about programming',
ARRAY['Python is a programming language', 'Java is also a programming language', 'SQL is used for databases'])) AS x);
array
------------------------
{0.1659,0.3348,0.1013}
(1 row)
SELECT rag_jina_reranker_v1_tiny_en.rerank_distance('empty array test', ARRAY[]::text[]);
@@ -45,43 +45,43 @@ SELECT rag_jina_reranker_v1_tiny_en.rerank_distance('empty array test', ARRAY[]:
(1 row)
-- Reranking score function tests - single passage
SELECT rag_jina_reranker_v1_tiny_en.rerank_score('the cat sat on the mat', 'the baboon played with the balloon');
rerank_score
--------------
-0.8989152
SELECT ROUND(rag_jina_reranker_v1_tiny_en.rerank_score('the cat sat on the mat', 'the baboon played with the balloon')::NUMERIC,4);
round
---------
-0.8989
(1 row)
SELECT rag_jina_reranker_v1_tiny_en.rerank_score('the cat sat on the mat', 'the tanks fired at the buildings');
rerank_score
--------------
-1.3018152
SELECT ROUND(rag_jina_reranker_v1_tiny_en.rerank_score('the cat sat on the mat', 'the tanks fired at the buildings')::NUMERIC,4);
round
---------
-1.3018
(1 row)
SELECT rag_jina_reranker_v1_tiny_en.rerank_score('query about cats', 'information about felines');
rerank_score
--------------
-1.3133051
SELECT ROUND(rag_jina_reranker_v1_tiny_en.rerank_score('query about cats', 'information about felines')::NUMERIC,4);
round
---------
-1.3133
(1 row)
SELECT rag_jina_reranker_v1_tiny_en.rerank_score('', 'empty query test');
rerank_score
--------------
-0.7075559
SELECT ROUND(rag_jina_reranker_v1_tiny_en.rerank_score('', 'empty query test')::NUMERIC,4);
round
---------
-0.7076
(1 row)
-- Reranking score function tests - array of passages
SELECT rag_jina_reranker_v1_tiny_en.rerank_score('the cat sat on the mat',
ARRAY['the baboon played with the balloon', 'the tanks fired at the buildings']);
rerank_score
-------------------------
{-0.8989152,-1.3018152}
SELECT ARRAY(SELECT ROUND(x::NUMERIC,4) FROM unnest(rag_jina_reranker_v1_tiny_en.rerank_score('the cat sat on the mat',
ARRAY['the baboon played with the balloon', 'the tanks fired at the buildings'])) AS x);
array
-------------------
{-0.8989,-1.3018}
(1 row)
SELECT rag_jina_reranker_v1_tiny_en.rerank_score('query about programming',
ARRAY['Python is a programming language', 'Java is also a programming language', 'SQL is used for databases']);
rerank_score
---------------------------------------
{-0.16591403,-0.33475375,-0.10132827}
SELECT ARRAY(SELECT ROUND(x::NUMERIC,4) FROM unnest(rag_jina_reranker_v1_tiny_en.rerank_score('query about programming',
ARRAY['Python is a programming language', 'Java is also a programming language', 'SQL is used for databases'])) AS x);
array
---------------------------
{-0.1659,-0.3348,-0.1013}
(1 row)
SELECT rag_jina_reranker_v1_tiny_en.rerank_score('empty array test', ARRAY[]::text[]);

View File

@@ -1,8 +1,10 @@
-- Reranking function tests
SELECT rag_jina_reranker_v1_tiny_en.rerank_distance('the cat sat on the mat', 'the baboon played with the balloon');
SELECT ROUND(rag_jina_reranker_v1_tiny_en.rerank_distance('the cat sat on the mat', 'the baboon played with the balloon')::NUMERIC,4);
SELECT rag_jina_reranker_v1_tiny_en.rerank_distance('the cat sat on the mat', ARRAY['the baboon played with the balloon', 'the tanks fired at the buildings']);
SELECT ARRAY(SELECT ROUND(x::NUMERIC,4) FROM unnest(rag_jina_reranker_v1_tiny_en.rerank_distance('the cat sat on the mat',
ARRAY['the baboon played with the balloon', 'the tanks fired at the buildings'])) AS x);
SELECT rag_jina_reranker_v1_tiny_en.rerank_score('the cat sat on the mat', 'the baboon played with the balloon');
SELECT ROUND(rag_jina_reranker_v1_tiny_en.rerank_score('the cat sat on the mat', 'the baboon played with the balloon')::NUMERIC,4);
SELECT rag_jina_reranker_v1_tiny_en.rerank_score('the cat sat on the mat', ARRAY['the baboon played with the balloon', 'the tanks fired at the buildings']);
SELECT ARRAY(SELECT ROUND(x::NUMERIC,4) FROM unnest(rag_jina_reranker_v1_tiny_en.rerank_score('the cat sat on the mat',
ARRAY['the baboon played with the balloon', 'the tanks fired at the buildings'])) as x);

View File

@@ -1,35 +1,35 @@
-- Reranking function tests - single passage
SELECT rag_jina_reranker_v1_tiny_en.rerank_distance('the cat sat on the mat', 'the baboon played with the balloon');
SELECT ROUND(rag_jina_reranker_v1_tiny_en.rerank_distance('the cat sat on the mat', 'the baboon played with the balloon')::NUMERIC,4);
SELECT rag_jina_reranker_v1_tiny_en.rerank_distance('the cat sat on the mat', 'the tanks fired at the buildings');
SELECT ROUND(rag_jina_reranker_v1_tiny_en.rerank_distance('the cat sat on the mat', 'the tanks fired at the buildings')::NUMERIC,4);
SELECT rag_jina_reranker_v1_tiny_en.rerank_distance('query about cats', 'information about felines');
SELECT ROUND(rag_jina_reranker_v1_tiny_en.rerank_distance('query about cats', 'information about felines')::NUMERIC,4);
SELECT rag_jina_reranker_v1_tiny_en.rerank_distance('', 'empty query test');
SELECT ROUND(rag_jina_reranker_v1_tiny_en.rerank_distance('', 'empty query test')::NUMERIC,4);
-- Reranking function tests - array of passages
SELECT rag_jina_reranker_v1_tiny_en.rerank_distance('the cat sat on the mat',
ARRAY['the baboon played with the balloon', 'the tanks fired at the buildings']);
SELECT ARRAY(SELECT ROUND(x::NUMERIC,4) FROM unnest(rag_jina_reranker_v1_tiny_en.rerank_distance('the cat sat on the mat',
ARRAY['the baboon played with the balloon', 'the tanks fired at the buildings'])) AS x);
SELECT rag_jina_reranker_v1_tiny_en.rerank_distance('query about programming',
ARRAY['Python is a programming language', 'Java is also a programming language', 'SQL is used for databases']);
SELECT ARRAY(SELECT ROUND(x::NUMERIC,4) FROM unnest(rag_jina_reranker_v1_tiny_en.rerank_distance('query about programming',
ARRAY['Python is a programming language', 'Java is also a programming language', 'SQL is used for databases'])) AS x);
SELECT rag_jina_reranker_v1_tiny_en.rerank_distance('empty array test', ARRAY[]::text[]);
-- Reranking score function tests - single passage
SELECT rag_jina_reranker_v1_tiny_en.rerank_score('the cat sat on the mat', 'the baboon played with the balloon');
SELECT ROUND(rag_jina_reranker_v1_tiny_en.rerank_score('the cat sat on the mat', 'the baboon played with the balloon')::NUMERIC,4);
SELECT rag_jina_reranker_v1_tiny_en.rerank_score('the cat sat on the mat', 'the tanks fired at the buildings');
SELECT ROUND(rag_jina_reranker_v1_tiny_en.rerank_score('the cat sat on the mat', 'the tanks fired at the buildings')::NUMERIC,4);
SELECT rag_jina_reranker_v1_tiny_en.rerank_score('query about cats', 'information about felines');
SELECT ROUND(rag_jina_reranker_v1_tiny_en.rerank_score('query about cats', 'information about felines')::NUMERIC,4);
SELECT rag_jina_reranker_v1_tiny_en.rerank_score('', 'empty query test');
SELECT ROUND(rag_jina_reranker_v1_tiny_en.rerank_score('', 'empty query test')::NUMERIC,4);
-- Reranking score function tests - array of passages
SELECT rag_jina_reranker_v1_tiny_en.rerank_score('the cat sat on the mat',
ARRAY['the baboon played with the balloon', 'the tanks fired at the buildings']);
SELECT ARRAY(SELECT ROUND(x::NUMERIC,4) FROM unnest(rag_jina_reranker_v1_tiny_en.rerank_score('the cat sat on the mat',
ARRAY['the baboon played with the balloon', 'the tanks fired at the buildings'])) AS x);
SELECT rag_jina_reranker_v1_tiny_en.rerank_score('query about programming',
ARRAY['Python is a programming language', 'Java is also a programming language', 'SQL is used for databases']);
SELECT ARRAY(SELECT ROUND(x::NUMERIC,4) FROM unnest(rag_jina_reranker_v1_tiny_en.rerank_score('query about programming',
ARRAY['Python is a programming language', 'Java is also a programming language', 'SQL is used for databases'])) AS x);
SELECT rag_jina_reranker_v1_tiny_en.rerank_score('empty array test', ARRAY[]::text[]);

View File

@@ -0,0 +1,7 @@
#!/bin/sh
set -ex
cd "$(dirname ${0})"
dropdb --if-exist contrib_regression
createdb contrib_regression
PG_REGRESS=$(dirname "$(pg_config --pgxs)")/../test/regress/pg_regress
${PG_REGRESS} --inputdir=./ --bindir='/usr/local/pgsql/bin' --use-existing --dbname=contrib_regression rum rum_hash ruminv timestamp orderby orderby_hash altorder altorder_hash limits int2 int4 int8 float4 float8 money oid time timetz date interval macaddr inet cidr text varchar char bytea bit varbit numeric rum_weight expr array

44
docker-compose/run-tests.sh Normal file → Executable file
View File

@@ -1,6 +1,42 @@
#!/bin/bash
set -x
if [[ -v BENCHMARK_CONNSTR ]]; then
uri_no_proto="${BENCHMARK_CONNSTR#postgres://}"
uri_no_proto="${uri_no_proto#postgresql://}"
if [[ $uri_no_proto == *\?* ]]; then
base="${uri_no_proto%%\?*}" # before '?'
else
base="$uri_no_proto"
fi
if [[ $base =~ ^([^:]+):([^@]+)@([^:/]+):?([0-9]*)/(.+)$ ]]; then
export PGUSER="${BASH_REMATCH[1]}"
export PGPASSWORD="${BASH_REMATCH[2]}"
export PGHOST="${BASH_REMATCH[3]}"
export PGPORT="${BASH_REMATCH[4]:-5432}"
export PGDATABASE="${BASH_REMATCH[5]}"
echo export PGUSER="${BASH_REMATCH[1]}"
echo export PGPASSWORD="${BASH_REMATCH[2]}"
echo export PGHOST="${BASH_REMATCH[3]}"
echo export PGPORT="${BASH_REMATCH[4]:-5432}"
echo export PGDATABASE="${BASH_REMATCH[5]}"
else
echo "Invalid PostgreSQL base URI"
exit 1
fi
fi
REGULAR_USER=false
while getopts r arg; do
case $arg in
r)
REGULAR_USER=true
shift $((OPTIND-1))
;;
*) :
;;
esac
done
extdir=${1}
cd "${extdir}" || exit 2
@@ -12,6 +48,11 @@ for d in ${LIST}; do
FAILED="${d} ${FAILED}"
break
fi
if [[ ${REGULAR_USER} = true ]] && [ -f "${d}"/regular-test.sh ]; then
"${d}/regular-test.sh" || FAILED="${d} ${FAILED}"
continue
fi
if [ -f "${d}/neon-test.sh" ]; then
"${d}/neon-test.sh" || FAILED="${d} ${FAILED}"
else
@@ -19,5 +60,8 @@ for d in ${LIST}; do
fi
done
[ -z "${FAILED}" ] && exit 0
for d in ${FAILED}; do
cat "$(find $d -name regression.diffs)"
done
echo "${FAILED}"
exit 1

View File

@@ -13,7 +13,7 @@ For design details see [the RFC](./rfcs/021-metering.md) and [the discussion on
batch format is
```json
{ "events" : [metric1, metric2, ...]]}
{ "events" : [metric1, metric2, ...] }
```
See metric format examples below.
@@ -49,11 +49,13 @@ Size of the remote storage (S3) directory.
This is an absolute, per-tenant metric.
- `timeline_logical_size`
Logical size of the data in the timeline
Logical size of the data in the timeline.
This is an absolute, per-timeline metric.
- `synthetic_storage_size`
Size of all tenant's branches including WAL
Size of all tenant's branches including WAL.
This is the same metric that `tenant/{tenant_id}/size` endpoint returns.
This is an absolute, per-tenant metric.
@@ -106,10 +108,10 @@ This is an incremental, per-endpoint metric.
```
The metric is incremental, so the value is the difference between the current and the previous value.
If there is no previous value, the value, the value is the current value and the `start_time` equals `stop_time`.
If there is no previous value, the value is the current value and the `start_time` equals `stop_time`.
### TODO
- [ ] Handle errors better: currently if one tenant fails to gather metrics, the whole iteration fails and metrics are not sent for any tenant.
- [ ] Add retries
- [ ] Tune the interval
- [ ] Tune the interval

View File

@@ -1,106 +0,0 @@
# Feature Rollout on Storage Controller
This RFC describes the rollout interface from a user's perspective. I do not have a concreate implementation idea
yet, but the operations described below should be intuitively feasible to implement -- most of them map to a single
SQL inside the storcon database and then some reconcile operations.
The rollout RFC makes it possible for the storcon to gradually modify tenant configs based on filters and percentages.
What will it look like if we want to rollout gc-compaction gradually?
Create a feature called gc-compaction.
```
$ storcon-cli feature create --job gc-compaction
```
Add two config sets to the rollout: gc-compaction without verification, and gc-compaction with read path verification.
```
$ storcon-cli feature config-set add --job gc-compaction --id enable_wo_verification --config '{ "gc_compaction_enabled": true }'
$ storcon-cli feature config-set add --job gc-compaction --id enable --config '{ "gc_compaction_enabled": true, "gc_compaction_verification": true }'
$ storcon-cli feature config-set list --job gc-compaction
default {}
enable_wo_verification { "gc_compaction_enabled": true }
enable { "gc_compaction_enabled": true, "gc_compaction_verification": true }
```
Week 1: rollout to 1% of active small tenants with the verification mode enabled.
```
$ storcon-cli feature rollout --job gc-compaction --config-set enable --filter "remote_size < 10GB" --coverage-percentage 1
20000 attached tenants satisfying the filter and randomly picked 200 tenants to apply `enable`, use `feature status` to view the rollout status
```
```
$ storcon-cli feature status --job gc-compaction
enable_wo_verification 0
enable 200
default 100000
$ storcon-cli feature status --job gc-compaction --filter "remote_size < 10GB"
enable_wo_verification 0
enable 200
default 19800
$ storcon-cli feature history --job gc-compaction
id,time,config-set,filter,percentage,count
1,2025-04-25 14:00:00,enable,"remote_size < 10GB",1,200
$ storcon-cli feature history --job gc-compaction --id 1
<tenant_ids that are involved in this rollout>
```
Week 2: rollout to all active small tenants with the verification mode enabled, previously rolled-out tenants switch to full rollout that disables verification mode.
```
$ storcon-cli feature rollout --job gc-compaction --config-set enable_wo_verification --filter "config_set=enable" --coverage-percentage all
$ storcon-cli feature rollout --job gc-compaction --config-set enable --filter "remote_size < 10GB" --coverage-percentage all
```
Week 3: rollout gradually to 50% larger tenants before Jun 1. The storage controller will randomly select some tenants every day at 12am to rollout the change, and will finish the rollout to at least 50% of the tenants by Jun 1.
```
$ storcon-cli feature scheduled-rollout --job gc-compaction --config-set enable --filter "remote_size < 100GB" --coverage-percentage 50 --cron "0 0 * * *" --before 2025-06-01 00:00:00
```
Week 4: we discover a bug over a specific tenant and want to disable gc-compaction on it,
```
$ storcon-cli feature rollout --job gc-compaction --config-set default --filter "tenant-id=<id>" --coverage-percentage all
rollout succeeded, operation_id=11
```
Then we realize that this bug might affect all tenants and decide to disable it for all tenants:
```
$ storcon-cli feature rollout --job gc-compaction --config-set default --coverage-percentage all
rollout succeeded, operation_id=10
```
We get a fix and can re-enable it on those tenants which had the feature enabled previously.
```
$ storcon-cli feature rollout --job gc-compaction --revert 11
$ storcon-cli feature rollout --job gc-compaction --revert 10
```
Week 5: enable by default
For newly-attached tenants, we want to enable gc-compaction by default.
```
$ storcon-cli feature set-default-when-attached --job gc-compaction --config-set enable
```
Week 6: full rollout
```
$ storcon-cli feature rollout --job gc-compaction --config-set enable --coverage-percentage all
```
Then we make a tenant default config change in the infra repo, get it deployed, and we can delete the feature rollout record in the storcon database.
```
$ storcon-cli feature delete --job gc-compaction
```

11
libs/neonart/Cargo.toml Normal file
View File

@@ -0,0 +1,11 @@
[package]
name = "neonart"
version = "0.1.0"
edition.workspace = true
license.workspace = true
[dependencies]
tracing.workspace = true
rand.workspace = true # for tests
zerocopy = "0.8"

View File

@@ -0,0 +1,377 @@
mod lock_and_version;
mod node_ptr;
mod node_ref;
use std::vec::Vec;
use crate::algorithm::lock_and_version::ResultOrRestart;
use crate::algorithm::node_ptr::{MAX_PREFIX_LEN, NodePtr};
use crate::algorithm::node_ref::ChildOrValue;
use crate::algorithm::node_ref::{NodeRef, ReadLockedNodeRef, WriteLockedNodeRef};
use crate::epoch::EpochPin;
use crate::{Allocator, Key, Value};
pub(crate) type RootPtr<V> = node_ptr::NodePtr<V>;
pub fn new_root<V: Value>(allocator: &Allocator) -> RootPtr<V> {
node_ptr::new_root(allocator)
}
pub(crate) fn search<'e, K: Key, V: Value>(
key: &K,
root: RootPtr<V>,
epoch_pin: &'e EpochPin,
) -> Option<V> {
loop {
let root_ref = NodeRef::from_root_ptr(root);
if let Ok(result) = lookup_recurse(key.as_bytes(), root_ref, None, epoch_pin) {
break result;
}
// retry
}
}
pub(crate) fn update_fn<'e, K: Key, V: Value, F>(
key: &K,
value_fn: F,
root: RootPtr<V>,
allocator: &Allocator,
epoch_pin: &'e EpochPin,
) where
F: FnOnce(Option<&V>) -> Option<V>,
{
let value_fn_cell = std::cell::Cell::new(Some(value_fn));
loop {
let root_ref = NodeRef::from_root_ptr(root);
let this_value_fn = |arg: Option<&V>| value_fn_cell.take().unwrap()(arg);
let key_bytes = key.as_bytes();
if let Ok(()) = update_recurse(
key_bytes,
this_value_fn,
root_ref,
None,
allocator,
epoch_pin,
0,
key_bytes,
) {
break;
}
// retry
}
}
pub(crate) fn dump_tree<'e, V: Value + std::fmt::Debug>(root: RootPtr<V>, epoch_pin: &'e EpochPin) {
let root_ref = NodeRef::from_root_ptr(root);
let _ = dump_recurse(&[], root_ref, &epoch_pin, 0);
}
// Error means you must retry.
//
// This corresponds to the 'lookupOpt' function in the paper
fn lookup_recurse<'e, V: Value>(
key: &[u8],
node: NodeRef<'e, V>,
parent: Option<ReadLockedNodeRef<V>>,
epoch_pin: &'e EpochPin,
) -> ResultOrRestart<Option<V>> {
let rnode = node.read_lock_or_restart()?;
if let Some(parent) = parent {
parent.read_unlock_or_restart()?;
}
// check if prefix matches, may increment level
let prefix_len = if let Some(prefix_len) = rnode.prefix_matches(key) {
prefix_len
} else {
rnode.read_unlock_or_restart()?;
return Ok(None);
};
let key = &key[prefix_len..];
// find child (or leaf value)
let next_node = rnode.find_child_or_value_or_restart(key[0])?;
match next_node {
None => Ok(None), // key not found
Some(ChildOrValue::Value(vptr)) => {
// safety: It's OK to follow the pointer because we checked the version.
let v = unsafe { (*vptr).clone() };
Ok(Some(v))
}
Some(ChildOrValue::Child(v)) => lookup_recurse(&key[1..], v, Some(rnode), epoch_pin),
}
}
// This corresponds to the 'insertOpt' function in the paper
pub(crate) fn update_recurse<'e, V: Value, F>(
key: &[u8],
value_fn: F,
node: NodeRef<'e, V>,
rparent: Option<(ReadLockedNodeRef<V>, u8)>,
allocator: &Allocator,
epoch_pin: &'e EpochPin,
level: usize,
orig_key: &[u8],
) -> ResultOrRestart<()>
where
F: FnOnce(Option<&V>) -> Option<V>,
{
let rnode = node.read_lock_or_restart()?;
let prefix_match_len = rnode.prefix_matches(key);
if prefix_match_len.is_none() {
let (rparent, parent_key) = rparent.expect("direct children of the root have no prefix");
let mut wparent = rparent.upgrade_to_write_lock_or_restart()?;
let mut wnode = rnode.upgrade_to_write_lock_or_restart()?;
if let Some(new_value) = value_fn(None) {
insert_split_prefix(
key,
new_value,
&mut wnode,
&mut wparent,
parent_key,
allocator,
);
}
wnode.write_unlock();
wparent.write_unlock();
return Ok(());
}
let prefix_match_len = prefix_match_len.unwrap();
let key = &key[prefix_match_len as usize..];
let level = level + prefix_match_len as usize;
let next_node = rnode.find_child_or_value_or_restart(key[0])?;
if next_node.is_none() {
if rnode.is_full() {
let (rparent, parent_key) = rparent.expect("root node cannot become full");
let mut wparent = rparent.upgrade_to_write_lock_or_restart()?;
let wnode = rnode.upgrade_to_write_lock_or_restart()?;
if let Some(new_value) = value_fn(None) {
insert_and_grow(key, new_value, &wnode, &mut wparent, parent_key, allocator);
wnode.write_unlock_obsolete();
wparent.write_unlock();
} else {
wnode.write_unlock();
wparent.write_unlock();
}
} else {
let mut wnode = rnode.upgrade_to_write_lock_or_restart()?;
if let Some((rparent, _)) = rparent {
rparent.read_unlock_or_restart()?;
}
if let Some(new_value) = value_fn(None) {
insert_to_node(&mut wnode, key, new_value, allocator);
}
wnode.write_unlock();
}
return Ok(());
} else {
let next_node = next_node.unwrap(); // checked above it's not None
if let Some((rparent, _)) = rparent {
rparent.read_unlock_or_restart()?;
}
match next_node {
ChildOrValue::Value(existing_value_ptr) => {
assert!(key.len() == 1);
let wnode = rnode.upgrade_to_write_lock_or_restart()?;
// safety: Now that we have acquired the write lock, we have exclusive access to the
// value
let vmut = unsafe { existing_value_ptr.cast_mut().as_mut() }.unwrap();
if let Some(new_value) = value_fn(Some(vmut)) {
*vmut = new_value;
} else {
// TODO: Treat this as deletion?
}
wnode.write_unlock();
Ok(())
}
ChildOrValue::Child(next_child) => {
// recurse to next level
update_recurse(
&key[1..],
value_fn,
next_child,
Some((rnode, key[0])),
allocator,
epoch_pin,
level + 1,
orig_key,
)
}
}
}
}
#[derive(Clone)]
enum PathElement {
Prefix(Vec<u8>),
KeyByte(u8),
}
impl std::fmt::Debug for PathElement {
fn fmt(&self, fmt: &mut std::fmt::Formatter<'_>) -> Result<(), std::fmt::Error> {
match self {
PathElement::Prefix(prefix) => write!(fmt, "{:?}", prefix),
PathElement::KeyByte(key_byte) => write!(fmt, "{}", key_byte),
}
}
}
fn dump_recurse<'e, V: Value + std::fmt::Debug>(
path: &[PathElement],
node: NodeRef<'e, V>,
epoch_pin: &'e EpochPin,
level: usize,
) -> ResultOrRestart<()> {
let indent = str::repeat(" ", level);
let rnode = node.read_lock_or_restart()?;
let mut path = Vec::from(path);
let prefix = rnode.get_prefix();
if prefix.len() != 0 {
path.push(PathElement::Prefix(Vec::from(prefix)));
}
for key_byte in 0..u8::MAX {
match rnode.find_child_or_value_or_restart(key_byte)? {
None => continue,
Some(ChildOrValue::Child(child_ref)) => {
let rchild = child_ref.read_lock_or_restart()?;
eprintln!(
"{} {:?}, {}: prefix {:?}",
indent,
&path,
key_byte,
rchild.get_prefix()
);
let mut child_path = path.clone();
child_path.push(PathElement::KeyByte(key_byte));
dump_recurse(&child_path, child_ref, epoch_pin, level + 1)?;
}
Some(ChildOrValue::Value(val)) => {
eprintln!("{} {:?}, {}: {:?}", indent, path, key_byte, unsafe {
val.as_ref().unwrap()
});
}
}
}
Ok(())
}
///```text
/// [fooba]r -> value
///
/// [foo]b -> [a]r -> value
/// e -> [ls]e -> value
///```
fn insert_split_prefix<'a, V: Value>(
key: &[u8],
value: V,
node: &mut WriteLockedNodeRef<V>,
parent: &mut WriteLockedNodeRef<V>,
parent_key: u8,
allocator: &Allocator,
) {
let old_node = node;
let old_prefix = old_node.get_prefix();
let common_prefix_len = common_prefix(key, old_prefix);
// Allocate a node for the new value.
let new_value_node = allocate_node_for_value(&key[common_prefix_len + 1..], value, allocator);
// Allocate a new internal node with the common prefix
let mut prefix_node = node_ref::new_internal(&key[..common_prefix_len], allocator);
// Add the old node and the new nodes to the new internal node
prefix_node.insert_child(old_prefix[common_prefix_len], old_node.as_ptr());
prefix_node.insert_child(key[common_prefix_len], new_value_node);
// Modify the prefix of the old child in place
old_node.truncate_prefix(old_prefix.len() - common_prefix_len - 1);
// replace the pointer in the parent
parent.replace_child(parent_key, prefix_node.into_ptr());
}
fn insert_to_node<V: Value>(
wnode: &mut WriteLockedNodeRef<V>,
key: &[u8],
value: V,
allocator: &Allocator,
) {
if wnode.is_leaf() {
wnode.insert_value(key[0], value);
} else {
let value_child = allocate_node_for_value(&key[1..], value, allocator);
wnode.insert_child(key[0], value_child);
}
}
// On entry: 'parent' and 'node' are locked
fn insert_and_grow<V: Value>(
key: &[u8],
value: V,
wnode: &WriteLockedNodeRef<V>,
parent: &mut WriteLockedNodeRef<V>,
parent_key_byte: u8,
allocator: &Allocator,
) {
let mut bigger_node = wnode.grow(allocator);
if wnode.is_leaf() {
bigger_node.insert_value(key[0], value);
} else {
let value_child = allocate_node_for_value(&key[1..], value, allocator);
bigger_node.insert_child(key[0], value_child);
}
// Replace the pointer in the parent
parent.replace_child(parent_key_byte, bigger_node.into_ptr());
}
// Allocate a new leaf node to hold 'value'. If key is long, we may need to allocate
// new internal nodes to hold it too
fn allocate_node_for_value<V: Value>(key: &[u8], value: V, allocator: &Allocator) -> NodePtr<V> {
let mut prefix_off = key.len().saturating_sub(MAX_PREFIX_LEN + 1);
let mut leaf_node = node_ref::new_leaf(&key[prefix_off..key.len() - 1], allocator);
leaf_node.insert_value(*key.last().unwrap(), value);
let mut node = leaf_node;
while prefix_off > 0 {
// Need another internal node
let remain_prefix = &key[0..prefix_off];
prefix_off = remain_prefix.len().saturating_sub(MAX_PREFIX_LEN + 1);
let mut internal_node = node_ref::new_internal(
&remain_prefix[prefix_off..remain_prefix.len() - 1],
allocator,
);
internal_node.insert_child(*remain_prefix.last().unwrap(), node.into_ptr());
node = internal_node;
}
node.into_ptr()
}
fn common_prefix(a: &[u8], b: &[u8]) -> usize {
for i in 0..MAX_PREFIX_LEN {
if a[i] != b[i] {
return i;
}
}
panic!("prefixes are equal");
}

View File

@@ -0,0 +1,85 @@
use std::sync::atomic::{AtomicU64, Ordering};
pub(crate) struct AtomicLockAndVersion {
inner: AtomicU64,
}
impl AtomicLockAndVersion {
pub(crate) fn new() -> AtomicLockAndVersion {
AtomicLockAndVersion {
inner: AtomicU64::new(0),
}
}
}
pub(crate) type ResultOrRestart<T> = Result<T, ()>;
const fn restart<T>() -> ResultOrRestart<T> {
Err(())
}
impl AtomicLockAndVersion {
pub(crate) fn read_lock_or_restart(&self) -> ResultOrRestart<u64> {
let version = self.await_node_unlocked();
if is_obsolete(version) {
return restart();
}
Ok(version)
}
pub(crate) fn check_or_restart(&self, version: u64) -> ResultOrRestart<()> {
self.read_unlock_or_restart(version)
}
pub(crate) fn read_unlock_or_restart(&self, version: u64) -> ResultOrRestart<()> {
if self.inner.load(Ordering::Acquire) != version {
return restart();
}
Ok(())
}
pub(crate) fn upgrade_to_write_lock_or_restart(&self, version: u64) -> ResultOrRestart<()> {
if self
.inner
.compare_exchange(
version,
set_locked_bit(version),
Ordering::Acquire,
Ordering::Relaxed,
)
.is_err()
{
return restart();
}
Ok(())
}
pub(crate) fn write_unlock(&self) {
// reset locked bit and overflow into version
self.inner.fetch_add(2, Ordering::Release);
}
pub(crate) fn write_unlock_obsolete(&self) {
// set obsolete, reset locked, overflow into version
self.inner.fetch_add(3, Ordering::Release);
}
// Helper functions
fn await_node_unlocked(&self) -> u64 {
let mut version = self.inner.load(Ordering::Acquire);
while (version & 2) == 2 {
// spinlock
std::thread::yield_now();
version = self.inner.load(Ordering::Acquire)
}
version
}
}
fn set_locked_bit(version: u64) -> u64 {
return version + 2;
}
fn is_obsolete(version: u64) -> bool {
return (version & 1) == 1;
}

View File

@@ -0,0 +1,983 @@
use std::marker::PhantomData;
use std::ptr::NonNull;
use super::lock_and_version::AtomicLockAndVersion;
use crate::Allocator;
use crate::Value;
pub(crate) const MAX_PREFIX_LEN: usize = 8;
enum NodeTag {
Internal4,
Internal16,
Internal48,
Internal256,
Leaf4,
Leaf16,
Leaf48,
Leaf256,
}
#[repr(C)]
struct NodeBase {
tag: NodeTag,
lock_and_version: AtomicLockAndVersion,
}
pub(crate) struct NodePtr<V> {
ptr: *mut NodeBase,
phantom_value: PhantomData<V>,
}
impl<V> std::fmt::Debug for NodePtr<V> {
fn fmt(&self, fmt: &mut std::fmt::Formatter<'_>) -> Result<(), std::fmt::Error> {
write!(fmt, "0x{}", self.ptr.addr())
}
}
impl<V> Copy for NodePtr<V> {}
impl<V> Clone for NodePtr<V> {
fn clone(&self) -> NodePtr<V> {
NodePtr {
ptr: self.ptr,
phantom_value: PhantomData,
}
}
}
enum NodeVariant<'a, V> {
Internal4(&'a NodeInternal4<V>),
Internal16(&'a NodeInternal16<V>),
Internal48(&'a NodeInternal48<V>),
Internal256(&'a NodeInternal256<V>),
Leaf4(&'a NodeLeaf4<V>),
Leaf16(&'a NodeLeaf16<V>),
Leaf48(&'a NodeLeaf48<V>),
Leaf256(&'a NodeLeaf256<V>),
}
enum NodeVariantMut<'a, V> {
Internal4(&'a mut NodeInternal4<V>),
Internal16(&'a mut NodeInternal16<V>),
Internal48(&'a mut NodeInternal48<V>),
Internal256(&'a mut NodeInternal256<V>),
Leaf4(&'a mut NodeLeaf4<V>),
Leaf16(&'a mut NodeLeaf16<V>),
Leaf48(&'a mut NodeLeaf48<V>),
Leaf256(&'a mut NodeLeaf256<V>),
}
pub(crate) enum ChildOrValuePtr<V> {
Child(NodePtr<V>),
Value(*const V),
}
#[repr(C)]
struct NodeInternal4<V> {
tag: NodeTag,
lock_and_version: AtomicLockAndVersion,
prefix: [u8; MAX_PREFIX_LEN],
prefix_len: u8,
num_children: u8,
child_keys: [u8; 4],
child_ptrs: [NodePtr<V>; 4],
}
#[repr(C)]
struct NodeInternal16<V> {
tag: NodeTag,
lock_and_version: AtomicLockAndVersion,
prefix: [u8; MAX_PREFIX_LEN],
prefix_len: u8,
num_children: u8,
child_keys: [u8; 16],
child_ptrs: [NodePtr<V>; 16],
}
const INVALID_CHILD_INDEX: u8 = u8::MAX;
#[repr(C)]
struct NodeInternal48<V> {
tag: NodeTag,
lock_and_version: AtomicLockAndVersion,
prefix: [u8; MAX_PREFIX_LEN],
prefix_len: u8,
num_children: u8,
child_indexes: [u8; 256],
child_ptrs: [NodePtr<V>; 48],
}
#[repr(C)]
pub(crate) struct NodeInternal256<V> {
tag: NodeTag,
lock_and_version: AtomicLockAndVersion,
prefix: [u8; MAX_PREFIX_LEN],
prefix_len: u8,
num_children: u16,
child_ptrs: [NodePtr<V>; 256],
}
#[repr(C)]
struct NodeLeaf4<V> {
tag: NodeTag,
lock_and_version: AtomicLockAndVersion,
prefix: [u8; MAX_PREFIX_LEN],
prefix_len: u8,
num_values: u8,
child_keys: [u8; 4],
child_values: [Option<V>; 4],
}
#[repr(C)]
struct NodeLeaf16<V> {
tag: NodeTag,
lock_and_version: AtomicLockAndVersion,
prefix: [u8; MAX_PREFIX_LEN],
prefix_len: u8,
num_values: u8,
child_keys: [u8; 16],
child_values: [Option<V>; 16],
}
#[repr(C)]
struct NodeLeaf48<V> {
tag: NodeTag,
lock_and_version: AtomicLockAndVersion,
prefix: [u8; MAX_PREFIX_LEN],
prefix_len: u8,
num_values: u8,
child_indexes: [u8; 256],
child_values: [Option<V>; 48],
}
#[repr(C)]
struct NodeLeaf256<V> {
tag: NodeTag,
lock_and_version: AtomicLockAndVersion,
prefix: [u8; MAX_PREFIX_LEN],
prefix_len: u8,
num_values: u16,
child_values: [Option<V>; 256],
}
impl<V> NodePtr<V> {
pub(crate) fn is_leaf(&self) -> bool {
match self.variant() {
NodeVariant::Internal4(_) => false,
NodeVariant::Internal16(_) => false,
NodeVariant::Internal48(_) => false,
NodeVariant::Internal256(_) => false,
NodeVariant::Leaf4(_) => true,
NodeVariant::Leaf16(_) => true,
NodeVariant::Leaf48(_) => true,
NodeVariant::Leaf256(_) => true,
}
}
pub(crate) fn lockword(&self) -> &AtomicLockAndVersion {
match self.variant() {
NodeVariant::Internal4(n) => &n.lock_and_version,
NodeVariant::Internal16(n) => &n.lock_and_version,
NodeVariant::Internal48(n) => &n.lock_and_version,
NodeVariant::Internal256(n) => &n.lock_and_version,
NodeVariant::Leaf4(n) => &n.lock_and_version,
NodeVariant::Leaf16(n) => &n.lock_and_version,
NodeVariant::Leaf48(n) => &n.lock_and_version,
NodeVariant::Leaf256(n) => &n.lock_and_version,
}
}
pub(crate) fn is_null(&self) -> bool {
self.ptr.is_null()
}
pub(crate) const fn null() -> NodePtr<V> {
NodePtr {
ptr: std::ptr::null_mut(),
phantom_value: PhantomData,
}
}
fn variant(&self) -> NodeVariant<V> {
unsafe {
match (*self.ptr).tag {
NodeTag::Internal4 => NodeVariant::Internal4(
NonNull::new_unchecked(self.ptr.cast::<NodeInternal4<V>>()).as_ref(),
),
NodeTag::Internal16 => NodeVariant::Internal16(
NonNull::new_unchecked(self.ptr.cast::<NodeInternal16<V>>()).as_ref(),
),
NodeTag::Internal48 => NodeVariant::Internal48(
NonNull::new_unchecked(self.ptr.cast::<NodeInternal48<V>>()).as_ref(),
),
NodeTag::Internal256 => NodeVariant::Internal256(
NonNull::new_unchecked(self.ptr.cast::<NodeInternal256<V>>()).as_ref(),
),
NodeTag::Leaf4 => NodeVariant::Leaf4(
NonNull::new_unchecked(self.ptr.cast::<NodeLeaf4<V>>()).as_ref(),
),
NodeTag::Leaf16 => NodeVariant::Leaf16(
NonNull::new_unchecked(self.ptr.cast::<NodeLeaf16<V>>()).as_ref(),
),
NodeTag::Leaf48 => NodeVariant::Leaf48(
NonNull::new_unchecked(self.ptr.cast::<NodeLeaf48<V>>()).as_ref(),
),
NodeTag::Leaf256 => NodeVariant::Leaf256(
NonNull::new_unchecked(self.ptr.cast::<NodeLeaf256<V>>()).as_ref(),
),
}
}
}
fn variant_mut(&mut self) -> NodeVariantMut<V> {
unsafe {
match (*self.ptr).tag {
NodeTag::Internal4 => NodeVariantMut::Internal4(
NonNull::new_unchecked(self.ptr.cast::<NodeInternal4<V>>()).as_mut(),
),
NodeTag::Internal16 => NodeVariantMut::Internal16(
NonNull::new_unchecked(self.ptr.cast::<NodeInternal16<V>>()).as_mut(),
),
NodeTag::Internal48 => NodeVariantMut::Internal48(
NonNull::new_unchecked(self.ptr.cast::<NodeInternal48<V>>()).as_mut(),
),
NodeTag::Internal256 => NodeVariantMut::Internal256(
NonNull::new_unchecked(self.ptr.cast::<NodeInternal256<V>>()).as_mut(),
),
NodeTag::Leaf4 => NodeVariantMut::Leaf4(
NonNull::new_unchecked(self.ptr.cast::<NodeLeaf4<V>>()).as_mut(),
),
NodeTag::Leaf16 => NodeVariantMut::Leaf16(
NonNull::new_unchecked(self.ptr.cast::<NodeLeaf16<V>>()).as_mut(),
),
NodeTag::Leaf48 => NodeVariantMut::Leaf48(
NonNull::new_unchecked(self.ptr.cast::<NodeLeaf48<V>>()).as_mut(),
),
NodeTag::Leaf256 => NodeVariantMut::Leaf256(
NonNull::new_unchecked(self.ptr.cast::<NodeLeaf256<V>>()).as_mut(),
),
}
}
}
}
impl<V: Value> NodePtr<V> {
pub(crate) fn prefix_matches(&self, key: &[u8]) -> Option<usize> {
let node_prefix = self.get_prefix();
assert!(node_prefix.len() <= key.len()); // because we only use fixed-size keys
if &key[0..node_prefix.len()] != node_prefix {
None
} else {
Some(node_prefix.len())
}
}
pub(crate) fn get_prefix(&self) -> &[u8] {
match self.variant() {
NodeVariant::Internal4(n) => n.get_prefix(),
NodeVariant::Internal16(n) => n.get_prefix(),
NodeVariant::Internal48(n) => n.get_prefix(),
NodeVariant::Internal256(n) => n.get_prefix(),
NodeVariant::Leaf4(n) => n.get_prefix(),
NodeVariant::Leaf16(n) => n.get_prefix(),
NodeVariant::Leaf48(n) => n.get_prefix(),
NodeVariant::Leaf256(n) => n.get_prefix(),
}
}
pub(crate) fn is_full(&self) -> bool {
match self.variant() {
NodeVariant::Internal4(n) => n.is_full(),
NodeVariant::Internal16(n) => n.is_full(),
NodeVariant::Internal48(n) => n.is_full(),
NodeVariant::Internal256(n) => n.is_full(),
NodeVariant::Leaf4(n) => n.is_full(),
NodeVariant::Leaf16(n) => n.is_full(),
NodeVariant::Leaf48(n) => n.is_full(),
NodeVariant::Leaf256(n) => n.is_full(),
}
}
pub(crate) fn find_child_or_value(&self, key_byte: u8) -> Option<ChildOrValuePtr<V>> {
match self.variant() {
NodeVariant::Internal4(n) => n.find_child(key_byte).map(|c| ChildOrValuePtr::Child(c)),
NodeVariant::Internal16(n) => n.find_child(key_byte).map(|c| ChildOrValuePtr::Child(c)),
NodeVariant::Internal48(n) => n.find_child(key_byte).map(|c| ChildOrValuePtr::Child(c)),
NodeVariant::Internal256(n) => {
n.find_child(key_byte).map(|c| ChildOrValuePtr::Child(c))
}
NodeVariant::Leaf4(n) => n
.get_leaf_value(key_byte)
.map(|v| ChildOrValuePtr::Value(v)),
NodeVariant::Leaf16(n) => n
.get_leaf_value(key_byte)
.map(|v| ChildOrValuePtr::Value(v)),
NodeVariant::Leaf48(n) => n
.get_leaf_value(key_byte)
.map(|v| ChildOrValuePtr::Value(v)),
NodeVariant::Leaf256(n) => n
.get_leaf_value(key_byte)
.map(|v| ChildOrValuePtr::Value(v)),
}
}
pub(crate) fn truncate_prefix(&mut self, new_prefix_len: usize) {
match self.variant_mut() {
NodeVariantMut::Internal4(n) => n.truncate_prefix(new_prefix_len),
NodeVariantMut::Internal16(n) => n.truncate_prefix(new_prefix_len),
NodeVariantMut::Internal48(n) => n.truncate_prefix(new_prefix_len),
NodeVariantMut::Internal256(n) => n.truncate_prefix(new_prefix_len),
NodeVariantMut::Leaf4(n) => n.truncate_prefix(new_prefix_len),
NodeVariantMut::Leaf16(n) => n.truncate_prefix(new_prefix_len),
NodeVariantMut::Leaf48(n) => n.truncate_prefix(new_prefix_len),
NodeVariantMut::Leaf256(n) => n.truncate_prefix(new_prefix_len),
}
}
pub(crate) fn grow(&self, allocator: &Allocator) -> NodePtr<V> {
match self.variant() {
NodeVariant::Internal4(n) => n.grow(allocator),
NodeVariant::Internal16(n) => n.grow(allocator),
NodeVariant::Internal48(n) => n.grow(allocator),
NodeVariant::Internal256(_) => panic!("cannot grow Internal256 node"),
NodeVariant::Leaf4(n) => n.grow(allocator),
NodeVariant::Leaf16(n) => n.grow(allocator),
NodeVariant::Leaf48(n) => n.grow(allocator),
NodeVariant::Leaf256(_) => panic!("cannot grow Leaf256 node"),
}
}
pub(crate) fn insert_child(&mut self, key_byte: u8, child: NodePtr<V>) {
match self.variant_mut() {
NodeVariantMut::Internal4(n) => n.insert_child(key_byte, child),
NodeVariantMut::Internal16(n) => n.insert_child(key_byte, child),
NodeVariantMut::Internal48(n) => n.insert_child(key_byte, child),
NodeVariantMut::Internal256(n) => n.insert_child(key_byte, child),
NodeVariantMut::Leaf4(_)
| NodeVariantMut::Leaf16(_)
| NodeVariantMut::Leaf48(_)
| NodeVariantMut::Leaf256(_) => panic!("insert_child called on leaf node"),
}
}
pub(crate) fn replace_child(&mut self, key_byte: u8, replacement: NodePtr<V>) {
match self.variant_mut() {
NodeVariantMut::Internal4(n) => n.replace_child(key_byte, replacement),
NodeVariantMut::Internal16(n) => n.replace_child(key_byte, replacement),
NodeVariantMut::Internal48(n) => n.replace_child(key_byte, replacement),
NodeVariantMut::Internal256(n) => n.replace_child(key_byte, replacement),
NodeVariantMut::Leaf4(_)
| NodeVariantMut::Leaf16(_)
| NodeVariantMut::Leaf48(_)
| NodeVariantMut::Leaf256(_) => panic!("replace_child called on leaf node"),
}
}
pub(crate) fn insert_value(&mut self, key_byte: u8, value: V) {
match self.variant_mut() {
NodeVariantMut::Internal4(_)
| NodeVariantMut::Internal16(_)
| NodeVariantMut::Internal48(_)
| NodeVariantMut::Internal256(_) => panic!("insert_value called on internal node"),
NodeVariantMut::Leaf4(n) => n.insert_value(key_byte, value),
NodeVariantMut::Leaf16(n) => n.insert_value(key_byte, value),
NodeVariantMut::Leaf48(n) => n.insert_value(key_byte, value),
NodeVariantMut::Leaf256(n) => n.insert_value(key_byte, value),
}
}
}
pub fn new_root<V: Value>(allocator: &Allocator) -> NodePtr<V> {
NodePtr {
ptr: allocator.alloc(NodeInternal256::<V>::new()).as_ptr().cast(),
phantom_value: PhantomData,
}
}
pub fn new_internal<V: Value>(prefix: &[u8], allocator: &Allocator) -> NodePtr<V> {
let mut node = allocator.alloc(NodeInternal4 {
tag: NodeTag::Internal4,
lock_and_version: AtomicLockAndVersion::new(),
prefix: [8; MAX_PREFIX_LEN],
prefix_len: prefix.len() as u8,
num_children: 0,
child_keys: [0; 4],
child_ptrs: [const { NodePtr::null() }; 4],
});
node.prefix[0..prefix.len()].copy_from_slice(prefix);
node.as_ptr().into()
}
pub fn new_leaf<V: Value>(prefix: &[u8], allocator: &Allocator) -> NodePtr<V> {
let mut node = allocator.alloc(NodeLeaf4 {
tag: NodeTag::Leaf4,
lock_and_version: AtomicLockAndVersion::new(),
prefix: [8; MAX_PREFIX_LEN],
prefix_len: prefix.len() as u8,
num_values: 0,
child_keys: [0; 4],
child_values: [const { None }; 4],
});
node.prefix[0..prefix.len()].copy_from_slice(prefix);
node.as_ptr().into()
}
impl<V: Value> NodeInternal4<V> {
fn get_prefix(&self) -> &[u8] {
&self.prefix[0..self.prefix_len as usize]
}
fn truncate_prefix(&mut self, new_prefix_len: usize) {
assert!(new_prefix_len < self.prefix_len as usize);
let prefix = &mut self.prefix;
let offset = self.prefix_len as usize - new_prefix_len;
for i in 0..new_prefix_len {
prefix[i] = prefix[i + offset];
}
self.prefix_len = new_prefix_len as u8;
}
fn find_child(&self, key: u8) -> Option<NodePtr<V>> {
for i in 0..self.num_children as usize {
if self.child_keys[i] == key {
return Some(self.child_ptrs[i]);
}
}
None
}
fn replace_child(&mut self, key_byte: u8, replacement: NodePtr<V>) {
for i in 0..self.num_children as usize {
if self.child_keys[i] == key_byte {
self.child_ptrs[i] = replacement;
return;
}
}
panic!("could not re-find parent with key {}", key_byte);
}
fn is_full(&self) -> bool {
self.num_children == 4
}
fn insert_child(&mut self, key_byte: u8, child: NodePtr<V>) {
assert!(self.num_children < 4);
let idx = self.num_children as usize;
self.child_keys[idx] = key_byte;
self.child_ptrs[idx] = child;
self.num_children += 1;
}
fn grow(&self, allocator: &Allocator) -> NodePtr<V> {
let mut node16 = allocator.alloc(NodeInternal16 {
tag: NodeTag::Internal16,
lock_and_version: AtomicLockAndVersion::new(),
prefix: self.prefix.clone(),
prefix_len: self.prefix_len,
num_children: self.num_children,
child_keys: [0; 16],
child_ptrs: [const { NodePtr::null() }; 16],
});
for i in 0..self.num_children as usize {
node16.child_keys[i] = self.child_keys[i];
node16.child_ptrs[i] = self.child_ptrs[i];
}
node16.as_ptr().into()
}
}
impl<V: Value> NodeInternal16<V> {
fn get_prefix(&self) -> &[u8] {
&self.prefix[0..self.prefix_len as usize]
}
fn truncate_prefix(&mut self, new_prefix_len: usize) {
assert!(new_prefix_len < self.prefix_len as usize);
let prefix = &mut self.prefix;
let offset = self.prefix_len as usize - new_prefix_len;
for i in 0..new_prefix_len {
prefix[i] = prefix[i + offset];
}
self.prefix_len = new_prefix_len as u8;
}
fn find_child(&self, key_byte: u8) -> Option<NodePtr<V>> {
for i in 0..self.num_children as usize {
if self.child_keys[i] == key_byte {
return Some(self.child_ptrs[i]);
}
}
None
}
fn replace_child(&mut self, key_byte: u8, replacement: NodePtr<V>) {
for i in 0..self.num_children as usize {
if self.child_keys[i] == key_byte {
self.child_ptrs[i] = replacement;
return;
}
}
panic!("could not re-find parent with key {}", key_byte);
}
fn is_full(&self) -> bool {
self.num_children == 16
}
fn insert_child(&mut self, key_byte: u8, child: NodePtr<V>) {
assert!(self.num_children < 16);
let idx = self.num_children as usize;
self.child_keys[idx] = key_byte;
self.child_ptrs[idx] = child;
self.num_children += 1;
}
fn grow(&self, allocator: &Allocator) -> NodePtr<V> {
let mut node48 = allocator.alloc(NodeInternal48 {
tag: NodeTag::Internal48,
lock_and_version: AtomicLockAndVersion::new(),
prefix: self.prefix.clone(),
prefix_len: self.prefix_len,
num_children: self.num_children,
child_indexes: [INVALID_CHILD_INDEX; 256],
child_ptrs: [const { NodePtr::null() }; 48],
});
for i in 0..self.num_children as usize {
let idx = self.child_keys[i] as usize;
node48.child_indexes[idx] = i as u8;
node48.child_ptrs[i] = self.child_ptrs[i];
}
node48.as_ptr().into()
}
}
impl<V: Value> NodeInternal48<V> {
fn get_prefix(&self) -> &[u8] {
&self.prefix[0..self.prefix_len as usize]
}
fn truncate_prefix(&mut self, new_prefix_len: usize) {
assert!(new_prefix_len < self.prefix_len as usize);
let prefix = &mut self.prefix;
let offset = self.prefix_len as usize - new_prefix_len;
for i in 0..new_prefix_len {
prefix[i] = prefix[i + offset];
}
self.prefix_len = new_prefix_len as u8;
}
fn find_child(&self, key_byte: u8) -> Option<NodePtr<V>> {
let idx = self.child_indexes[key_byte as usize];
if idx != INVALID_CHILD_INDEX {
Some(self.child_ptrs[idx as usize])
} else {
None
}
}
fn replace_child(&mut self, key_byte: u8, replacement: NodePtr<V>) {
let idx = self.child_indexes[key_byte as usize];
if idx != INVALID_CHILD_INDEX {
self.child_ptrs[idx as usize] = replacement
} else {
panic!("could not re-find parent with key {}", key_byte);
}
}
fn is_full(&self) -> bool {
self.num_children == 48
}
fn insert_child(&mut self, key_byte: u8, child: NodePtr<V>) {
assert!(self.num_children < 48);
assert!(self.child_indexes[key_byte as usize] == INVALID_CHILD_INDEX);
let idx = self.num_children;
self.child_indexes[key_byte as usize] = idx;
self.child_ptrs[idx as usize] = child;
self.num_children += 1;
}
fn grow(&self, allocator: &Allocator) -> NodePtr<V> {
let mut node256 = allocator.alloc(NodeInternal256 {
tag: NodeTag::Internal256,
lock_and_version: AtomicLockAndVersion::new(),
prefix: self.prefix.clone(),
prefix_len: self.prefix_len,
num_children: self.num_children as u16,
child_ptrs: [const { NodePtr::null() }; 256],
});
for i in 0..256 {
let idx = self.child_indexes[i];
if idx != INVALID_CHILD_INDEX {
node256.child_ptrs[i] = self.child_ptrs[idx as usize];
}
}
node256.as_ptr().into()
}
}
impl<V: Value> NodeInternal256<V> {
fn get_prefix(&self) -> &[u8] {
&self.prefix[0..self.prefix_len as usize]
}
fn truncate_prefix(&mut self, new_prefix_len: usize) {
assert!(new_prefix_len < self.prefix_len as usize);
let prefix = &mut self.prefix;
let offset = self.prefix_len as usize - new_prefix_len;
for i in 0..new_prefix_len {
prefix[i] = prefix[i + offset];
}
self.prefix_len = new_prefix_len as u8;
}
fn find_child(&self, key_byte: u8) -> Option<NodePtr<V>> {
let idx = key_byte as usize;
if !self.child_ptrs[idx].is_null() {
Some(self.child_ptrs[idx])
} else {
None
}
}
fn replace_child(&mut self, key_byte: u8, replacement: NodePtr<V>) {
let idx = key_byte as usize;
if !self.child_ptrs[idx].is_null() {
self.child_ptrs[idx] = replacement
} else {
panic!("could not re-find parent with key {}", key_byte);
}
}
fn is_full(&self) -> bool {
self.num_children == 256
}
fn insert_child(&mut self, key_byte: u8, child: NodePtr<V>) {
assert!(self.num_children < 256);
assert!(self.child_ptrs[key_byte as usize].is_null());
self.child_ptrs[key_byte as usize] = child;
self.num_children += 1;
}
}
impl<V: Value> NodeLeaf4<V> {
fn get_prefix(&self) -> &[u8] {
&self.prefix[0..self.prefix_len as usize]
}
fn truncate_prefix(&mut self, new_prefix_len: usize) {
assert!(new_prefix_len < self.prefix_len as usize);
let prefix = &mut self.prefix;
let offset = self.prefix_len as usize - new_prefix_len;
for i in 0..new_prefix_len {
prefix[i] = prefix[i + offset];
}
self.prefix_len = new_prefix_len as u8;
}
fn get_leaf_value<'a: 'b, 'b>(&'a self, key: u8) -> Option<&'b V> {
for i in 0..self.num_values {
if self.child_keys[i as usize] == key {
assert!(self.child_values[i as usize].is_some());
return self.child_values[i as usize].as_ref();
}
}
None
}
fn is_full(&self) -> bool {
self.num_values == 4
}
fn insert_value(&mut self, key_byte: u8, value: V) {
assert!(self.num_values < 16);
let idx = self.num_values as usize;
self.child_keys[idx] = key_byte;
self.child_values[idx] = Some(value);
self.num_values += 1;
}
fn grow(&self, allocator: &Allocator) -> NodePtr<V> {
let mut node16 = allocator.alloc(NodeLeaf16 {
tag: NodeTag::Leaf16,
lock_and_version: AtomicLockAndVersion::new(),
prefix: self.prefix.clone(),
prefix_len: self.prefix_len,
num_values: self.num_values,
child_keys: [0; 16],
child_values: [const { None }; 16],
});
for i in 0..self.num_values as usize {
node16.child_keys[i] = self.child_keys[i];
node16.child_values[i] = self.child_values[i].clone();
}
node16.as_ptr().into()
}
}
impl<V: Value> NodeLeaf16<V> {
fn get_prefix(&self) -> &[u8] {
&self.prefix[0..self.prefix_len as usize]
}
fn truncate_prefix(&mut self, new_prefix_len: usize) {
assert!(new_prefix_len < self.prefix_len as usize);
let prefix = &mut self.prefix;
let offset = self.prefix_len as usize - new_prefix_len;
for i in 0..new_prefix_len {
prefix[i] = prefix[i + offset];
}
self.prefix_len = new_prefix_len as u8;
}
fn get_leaf_value(&self, key: u8) -> Option<&V> {
for i in 0..self.num_values {
if self.child_keys[i as usize] == key {
assert!(self.child_values[i as usize].is_some());
return self.child_values[i as usize].as_ref();
}
}
None
}
fn is_full(&self) -> bool {
self.num_values == 16
}
fn insert_value(&mut self, key_byte: u8, value: V) {
assert!(self.num_values < 16);
let idx = self.num_values as usize;
self.child_keys[idx] = key_byte;
self.child_values[idx] = Some(value);
self.num_values += 1;
}
fn grow(&self, allocator: &Allocator) -> NodePtr<V> {
let mut node48 = allocator.alloc(NodeLeaf48 {
tag: NodeTag::Leaf48,
lock_and_version: AtomicLockAndVersion::new(),
prefix: self.prefix.clone(),
prefix_len: self.prefix_len,
num_values: self.num_values,
child_indexes: [INVALID_CHILD_INDEX; 256],
child_values: [const { None }; 48],
});
for i in 0..self.num_values {
let idx = self.child_keys[i as usize];
node48.child_indexes[idx as usize] = i;
node48.child_values[i as usize] = self.child_values[i as usize].clone();
}
node48.as_ptr().into()
}
}
impl<V: Value> NodeLeaf48<V> {
fn get_prefix(&self) -> &[u8] {
&self.prefix[0..self.prefix_len as usize]
}
fn truncate_prefix(&mut self, new_prefix_len: usize) {
assert!(new_prefix_len < self.prefix_len as usize);
let prefix = &mut self.prefix;
let offset = self.prefix_len as usize - new_prefix_len;
for i in 0..new_prefix_len {
prefix[i] = prefix[i + offset];
}
self.prefix_len = new_prefix_len as u8;
}
fn get_leaf_value(&self, key: u8) -> Option<&V> {
let idx = self.child_indexes[key as usize];
if idx != INVALID_CHILD_INDEX {
assert!(self.child_values[idx as usize].is_some());
self.child_values[idx as usize].as_ref()
} else {
None
}
}
fn is_full(&self) -> bool {
self.num_values == 48
}
fn insert_value(&mut self, key_byte: u8, value: V) {
assert!(self.num_values < 48);
assert!(self.child_indexes[key_byte as usize] == INVALID_CHILD_INDEX);
let idx = self.num_values;
self.child_indexes[key_byte as usize] = idx;
self.child_values[idx as usize] = Some(value);
self.num_values += 1;
}
fn grow(&self, allocator: &Allocator) -> NodePtr<V> {
let mut node256 = allocator.alloc(NodeLeaf256 {
tag: NodeTag::Leaf256,
lock_and_version: AtomicLockAndVersion::new(),
prefix: self.prefix.clone(),
prefix_len: self.prefix_len,
num_values: self.num_values as u16,
child_values: [const { None }; 256],
});
for i in 0..256 {
let idx = self.child_indexes[i];
if idx != INVALID_CHILD_INDEX {
node256.child_values[i] = self.child_values[idx as usize].clone();
}
}
node256.as_ptr().into()
}
}
impl<V: Value> NodeLeaf256<V> {
fn get_prefix(&self) -> &[u8] {
&self.prefix[0..self.prefix_len as usize]
}
fn truncate_prefix(&mut self, new_prefix_len: usize) {
assert!(new_prefix_len < self.prefix_len as usize);
let prefix = &mut self.prefix;
let offset = self.prefix_len as usize - new_prefix_len;
for i in 0..new_prefix_len {
prefix[i] = prefix[i + offset];
}
self.prefix_len = new_prefix_len as u8;
}
fn get_leaf_value(&self, key: u8) -> Option<&V> {
let idx = key as usize;
self.child_values[idx].as_ref()
}
fn is_full(&self) -> bool {
self.num_values == 256
}
fn insert_value(&mut self, key_byte: u8, value: V) {
assert!(self.num_values < 256);
assert!(self.child_values[key_byte as usize].is_none());
self.child_values[key_byte as usize] = Some(value);
self.num_values += 1;
}
}
impl<V: Value> NodeInternal256<V> {
pub(crate) fn new() -> NodeInternal256<V> {
NodeInternal256 {
tag: NodeTag::Internal256,
lock_and_version: AtomicLockAndVersion::new(),
prefix: [0; MAX_PREFIX_LEN],
prefix_len: 0,
num_children: 0,
child_ptrs: [const { NodePtr::null() }; 256],
}
}
}
impl<V: Value> From<*mut NodeInternal4<V>> for NodePtr<V> {
fn from(val: *mut NodeInternal4<V>) -> NodePtr<V> {
NodePtr {
ptr: val.cast(),
phantom_value: PhantomData,
}
}
}
impl<V: Value> From<*mut NodeInternal16<V>> for NodePtr<V> {
fn from(val: *mut NodeInternal16<V>) -> NodePtr<V> {
NodePtr {
ptr: val.cast(),
phantom_value: PhantomData,
}
}
}
impl<V: Value> From<*mut NodeInternal48<V>> for NodePtr<V> {
fn from(val: *mut NodeInternal48<V>) -> NodePtr<V> {
NodePtr {
ptr: val.cast(),
phantom_value: PhantomData,
}
}
}
impl<V: Value> From<*mut NodeInternal256<V>> for NodePtr<V> {
fn from(val: *mut NodeInternal256<V>) -> NodePtr<V> {
NodePtr {
ptr: val.cast(),
phantom_value: PhantomData,
}
}
}
impl<V: Value> From<*mut NodeLeaf4<V>> for NodePtr<V> {
fn from(val: *mut NodeLeaf4<V>) -> NodePtr<V> {
NodePtr {
ptr: val.cast(),
phantom_value: PhantomData,
}
}
}
impl<V: Value> From<*mut NodeLeaf16<V>> for NodePtr<V> {
fn from(val: *mut NodeLeaf16<V>) -> NodePtr<V> {
NodePtr {
ptr: val.cast(),
phantom_value: PhantomData,
}
}
}
impl<V: Value> From<*mut NodeLeaf48<V>> for NodePtr<V> {
fn from(val: *mut NodeLeaf48<V>) -> NodePtr<V> {
NodePtr {
ptr: val.cast(),
phantom_value: PhantomData,
}
}
}
impl<V: Value> From<*mut NodeLeaf256<V>> for NodePtr<V> {
fn from(val: *mut NodeLeaf256<V>) -> NodePtr<V> {
NodePtr {
ptr: val.cast(),
phantom_value: PhantomData,
}
}
}

View File

@@ -0,0 +1,202 @@
use std::fmt::Debug;
use std::marker::PhantomData;
use super::lock_and_version::ResultOrRestart;
use super::node_ptr;
use super::node_ptr::ChildOrValuePtr;
use super::node_ptr::NodePtr;
use crate::EpochPin;
use crate::algorithm::lock_and_version::AtomicLockAndVersion;
use crate::{Allocator, Value};
pub struct NodeRef<'e, V> {
ptr: NodePtr<V>,
phantom: PhantomData<&'e EpochPin>,
}
impl<'e, V> Debug for NodeRef<'e, V> {
fn fmt(&self, fmt: &mut std::fmt::Formatter<'_>) -> Result<(), std::fmt::Error> {
write!(fmt, "{:?}", self.ptr)
}
}
impl<'e, V: Value> NodeRef<'e, V> {
pub(crate) fn from_root_ptr(root_ptr: NodePtr<V>) -> NodeRef<'e, V> {
NodeRef {
ptr: root_ptr,
phantom: PhantomData,
}
}
pub(crate) fn read_lock_or_restart(&self) -> ResultOrRestart<ReadLockedNodeRef<'e, V>> {
let version = self.lockword().read_lock_or_restart()?;
Ok(ReadLockedNodeRef {
ptr: self.ptr,
version,
phantom: self.phantom,
})
}
fn lockword(&self) -> &AtomicLockAndVersion {
self.ptr.lockword()
}
}
/// A reference to a node that has been optimistically read-locked. The functions re-check
/// the version after each read.
pub struct ReadLockedNodeRef<'e, V> {
ptr: NodePtr<V>,
version: u64,
phantom: PhantomData<&'e EpochPin>,
}
pub(crate) enum ChildOrValue<'e, V> {
Child(NodeRef<'e, V>),
Value(*const V),
}
impl<'e, V: Value> ReadLockedNodeRef<'e, V> {
pub(crate) fn is_full(&self) -> bool {
self.ptr.is_full()
}
pub(crate) fn get_prefix(&self) -> &[u8] {
self.ptr.get_prefix()
}
/// Note: because we're only holding a read lock, the prefix can change concurrently.
/// You must be prepared to restart, if read_unlock() returns error later.
///
/// Returns the length of the prefix, or None if it's not a match
pub(crate) fn prefix_matches(&self, key: &[u8]) -> Option<usize> {
self.ptr.prefix_matches(key)
}
pub(crate) fn find_child_or_value_or_restart(
&self,
key_byte: u8,
) -> ResultOrRestart<Option<ChildOrValue<'e, V>>> {
let child_or_value = self.ptr.find_child_or_value(key_byte);
self.ptr.lockword().check_or_restart(self.version)?;
match child_or_value {
None => Ok(None),
Some(ChildOrValuePtr::Value(vptr)) => Ok(Some(ChildOrValue::Value(vptr))),
Some(ChildOrValuePtr::Child(child_ptr)) => Ok(Some(ChildOrValue::Child(NodeRef {
ptr: child_ptr,
phantom: self.phantom,
}))),
}
}
pub(crate) fn upgrade_to_write_lock_or_restart(
self,
) -> ResultOrRestart<WriteLockedNodeRef<'e, V>> {
self.ptr
.lockword()
.upgrade_to_write_lock_or_restart(self.version)?;
Ok(WriteLockedNodeRef {
ptr: self.ptr,
phantom: self.phantom,
})
}
pub(crate) fn read_unlock_or_restart(self) -> ResultOrRestart<()> {
self.ptr.lockword().check_or_restart(self.version)?;
Ok(())
}
}
/// A reference to a node that has been optimistically read-locked. The functions re-check
/// the version after each read.
pub struct WriteLockedNodeRef<'e, V> {
ptr: NodePtr<V>,
phantom: PhantomData<&'e EpochPin>,
}
impl<'e, V: Value> WriteLockedNodeRef<'e, V> {
pub(crate) fn is_leaf(&self) -> bool {
self.ptr.is_leaf()
}
pub(crate) fn write_unlock(mut self) {
self.ptr.lockword().write_unlock();
self.ptr = NodePtr::null();
}
pub(crate) fn write_unlock_obsolete(mut self) {
self.ptr.lockword().write_unlock_obsolete();
self.ptr = NodePtr::null();
}
pub(crate) fn get_prefix(&self) -> &[u8] {
self.ptr.get_prefix()
}
pub(crate) fn truncate_prefix(&mut self, new_prefix_len: usize) {
self.ptr.truncate_prefix(new_prefix_len)
}
pub(crate) fn insert_child(&mut self, key_byte: u8, child: NodePtr<V>) {
self.ptr.insert_child(key_byte, child)
}
pub(crate) fn insert_value(&mut self, key_byte: u8, value: V) {
self.ptr.insert_value(key_byte, value)
}
pub(crate) fn grow(&self, allocator: &Allocator) -> NewNodeRef<V> {
let new_node = self.ptr.grow(allocator);
NewNodeRef { ptr: new_node }
}
pub(crate) fn as_ptr(&self) -> NodePtr<V> {
self.ptr
}
pub(crate) fn replace_child(&mut self, key_byte: u8, replacement: NodePtr<V>) {
self.ptr.replace_child(key_byte, replacement);
}
}
impl<'e, V> Drop for WriteLockedNodeRef<'e, V> {
fn drop(&mut self) {
if !self.ptr.is_null() {
self.ptr.lockword().write_unlock();
}
}
}
pub(crate) struct NewNodeRef<V> {
ptr: NodePtr<V>,
}
impl<V: Value> NewNodeRef<V> {
pub(crate) fn insert_child(&mut self, key_byte: u8, child: NodePtr<V>) {
self.ptr.insert_child(key_byte, child)
}
pub(crate) fn insert_value(&mut self, key_byte: u8, value: V) {
self.ptr.insert_value(key_byte, value)
}
pub(crate) fn into_ptr(self) -> NodePtr<V> {
let ptr = self.ptr;
ptr
}
}
pub(crate) fn new_internal<V: Value>(prefix: &[u8], allocator: &Allocator) -> NewNodeRef<V> {
NewNodeRef {
ptr: node_ptr::new_internal(prefix, allocator),
}
}
pub(crate) fn new_leaf<V: Value>(prefix: &[u8], allocator: &Allocator) -> NewNodeRef<V> {
NewNodeRef {
ptr: node_ptr::new_leaf(prefix, allocator),
}
}

View File

@@ -0,0 +1,107 @@
use std::marker::PhantomData;
use std::mem::MaybeUninit;
use std::ops::{Deref, DerefMut};
use std::ptr::NonNull;
use std::sync::atomic::{AtomicUsize, Ordering};
pub struct Allocator {
area: *mut MaybeUninit<u8>,
allocated: AtomicUsize,
size: usize,
}
// FIXME: I don't know if these are really safe...
unsafe impl Send for Allocator {}
unsafe impl Sync for Allocator {}
#[repr(transparent)]
pub struct AllocatedBox<'a, T> {
inner: NonNull<T>,
_phantom: PhantomData<&'a Allocator>,
}
// FIXME: I don't know if these are really safe...
unsafe impl<'a, T> Send for AllocatedBox<'a, T> {}
unsafe impl<'a, T> Sync for AllocatedBox<'a, T> {}
impl<T> Deref for AllocatedBox<'_, T> {
type Target = T;
fn deref(&self) -> &T {
unsafe { self.inner.as_ref() }
}
}
impl<T> DerefMut for AllocatedBox<'_, T> {
fn deref_mut(&mut self) -> &mut T {
unsafe { self.inner.as_mut() }
}
}
impl<T> AsMut<T> for AllocatedBox<'_, T> {
fn as_mut(&mut self) -> &mut T {
unsafe { self.inner.as_mut() }
}
}
impl<T> AllocatedBox<'_, T> {
pub fn as_ptr(&self) -> *mut T {
self.inner.as_ptr()
}
}
const MAXALIGN: usize = std::mem::align_of::<usize>();
impl Allocator {
pub fn new_uninit(area: &'static mut [MaybeUninit<u8>]) -> Allocator {
let ptr = area.as_mut_ptr();
let size = area.len();
Self::new_from_ptr(ptr, size)
}
pub fn new(area: &'static mut [u8]) -> Allocator {
let ptr: *mut MaybeUninit<u8> = area.as_mut_ptr().cast();
let size = area.len();
Self::new_from_ptr(ptr, size)
}
pub fn new_from_ptr(ptr: *mut MaybeUninit<u8>, size: usize) -> Allocator {
let padding = ptr.align_offset(MAXALIGN);
Allocator {
area: ptr,
allocated: AtomicUsize::new(padding),
size,
}
}
pub fn alloc<'a, T: Sized>(&'a self, value: T) -> AllocatedBox<'a, T> {
let sz = std::mem::size_of::<T>();
// pad all allocations to MAXALIGN boundaries
assert!(std::mem::align_of::<T>() <= MAXALIGN);
let sz = sz.next_multiple_of(MAXALIGN);
let offset = self.allocated.fetch_add(sz, Ordering::Relaxed);
if offset + sz > self.size {
panic!("out of memory");
}
let inner = unsafe {
let inner = self.area.offset(offset as isize).cast::<T>();
*inner = value;
NonNull::new_unchecked(inner)
};
AllocatedBox {
inner,
_phantom: PhantomData,
}
}
pub fn _dealloc_node<T>(&self, _node: AllocatedBox<T>) {
// doesn't free it immediately.
}
}

23
libs/neonart/src/epoch.rs Normal file
View File

@@ -0,0 +1,23 @@
//! This is similar to crossbeam_epoch crate, but works in shared memory
//!
//! FIXME: not implemented yet. (We haven't implemented removing any nodes from the ART
//! tree, which is why we get away without this now)
pub(crate) struct EpochPin {}
pub(crate) fn pin_epoch() -> EpochPin {
EpochPin {}
}
/*
struct CollectorGlobal {
epoch: AtomicU64,
participants: CachePadded<AtomicU64>, // make it an array
}
struct CollectorQueue {
}
*/

301
libs/neonart/src/lib.rs Normal file
View File

@@ -0,0 +1,301 @@
//! Adaptive Radix Tree (ART) implementation, with Optimistic Lock Coupling.
//!
//! The data structure is described in these two papers:
//!
//! [1] Leis, V. & Kemper, Alfons & Neumann, Thomas. (2013).
//! The adaptive radix tree: ARTful indexing for main-memory databases.
//! Proceedings - International Conference on Data Engineering. 38-49. 10.1109/ICDE.2013.6544812.
//! https://db.in.tum.de/~leis/papers/ART.pdf
//!
//! [2] Leis, Viktor & Scheibner, Florian & Kemper, Alfons & Neumann, Thomas. (2016).
//! The ART of practical synchronization.
//! 1-8. 10.1145/2933349.2933352.
//! https://db.in.tum.de/~leis/papers/artsync.pdf
//!
//! [1] describes the base data structure, and [2] describes the Optimistic Lock Coupling that we
//! use.
//!
//! The papers mention a few different variants. We have made the following choices in this
//! implementation:
//!
//! - All keys have the same length
//!
//! - Multi-value leaves. The values are stored directly in one of the four different leaf node
//! types.
//!
//! - For collapsing inner nodes, we use the Pessimistic approach, where each inner node stores a
//! variable length "prefix", which stores the keys of all the one-way nodes which have been
//! removed. However, similar to the "hybrid" approach described in the paper, each node only has
//! space for a constant-size prefix of 8 bytes. If a node would have a longer prefix, then we
//! create create one-way nodes to store them. (There was no particular reason for this choice,
//! the "hybrid" approach described in the paper might be better.)
//!
//! - For concurrency, we use Optimistic Lock Coupling. The paper [2] also describes another method,
//! ROWEX, which generally performs better when there is contention, but that is not important
//! for use and Optimisic Lock Coupling is simpler to implement.
//!
//! ## Requirements
//!
//! This data structure is currently used for the integrated LFC, relsize and last-written LSN cache
//! in the compute communicator, part of the 'neon' Postgres extension. We have some unique
//! requirements, which is why we had to write our own. Namely:
//!
//! - The data structure has to live in fixed-sized shared memory segment. That rules out any
//! built-in Rust collections and most crates. (Except possibly with the 'allocator_api' rust
//! feature, which still nightly-only experimental as of this writing).
//!
//! - The data structure is accessed from multiple processes. Only one process updates the data
//! structure, but other processes perform reads. That rules out using built-in Rust locking
//! primitives like Mutex and RwLock, and most crates too.
//!
//! - Within the one process with write-access, multiple threads can perform updates concurrently.
//! That rules out using PostgreSQL LWLocks for the locking.
//!
//! The implementation is generic, and doesn't depend on any PostgreSQL specifics, but it has been
//! written with that usage and the above constraints in mind. Some noteworthy assumptions:
//!
//! - Contention is assumed to be rare. In the integrated cache in PostgreSQL, there's higher level
//! locking in the PostgreSQL buffer manager, which ensures that two backends should not try to
//! read / write the same page at the same time. (Prefetching can conflict with actual reads,
//! however.)
//!
//! - The keys in the integrated cache are 17 bytes long.
//!
//! ## Usage
//!
//! Because this is designed to be used as a Postgres shared memory data structure, initialization
//! happens in three stages:
//!
//! 0. A fixed area of shared memory is allocated at postmaster startup.
//!
//! 1. TreeInitStruct::new() is called to initialize it, still in Postmaster process, before any
//! other process or thread is running. It returns a TreeInitStruct, which is inherited by all
//! the processes through fork().
//!
//! 2. One process may have write-access to the struct, by calling
//! [TreeInitStruct::attach_writer]. (That process is the communicator process.)
//!
//! 3. Other processes get read-access to the struct, by calling [TreeInitStruct::attach_reader]
//!
//! "Write access" means that you can insert / update / delete values in the tree.
//!
//! NOTE: The Values stored in the tree are sometimes moved, when a leaf node fills up and a new
//! larger node needs to be allocated. The versioning and epoch-based allocator ensure that the data
//! structure stays consistent, but if the Value has interior mutability, like atomic fields,
//! updates to such fields might be lost if the leaf node is concurrently moved! If that becomes a
//! problem, the version check could be passed up to the caller, so that the caller could detect the
//! lost updates and retry the operation.
//!
//! ## Implementation
//!
//! node_ptr: Provides low-level implementations of the four different node types (eight actually,
//! since there is an Internal and Leaf variant of each)
//!
//! lock_and_version.rs: Provides an abstraction for the combined lock and version counter on each
//! node.
//!
//! node_ref.rs: The code in node_ptr.rs deals with raw pointers. node_ref.rs provides more type-safe
//! abstractions on top.
//!
//! algorithm.rs: Contains the functions to implement lookups and updates in the tree
//!
//! allocator.rs: Provides a facility to allocate memory for the tree nodes. (We must provide our
//! own abstraction for that because we need the data structure to live in a pre-allocated shared
//! memory segment).
//!
//! epoch.rs: The data structure requires that when a node is removed from the tree, it is not
//! immediately deallocated, but stays around for as long as concurrent readers might still have
//! pointers to them. This is enforced by an epoch system. This is similar to
//! e.g. crossbeam_epoch, but we couldn't use that either because it has to work across processes
//! communicating over the shared memory segment.
//!
//! ## See also
//!
//! There are some existing Rust ART implementations out there, but none of them filled all
//! the requirements:
//!
//! - https://github.com/XiangpengHao/congee
//! - https://github.com/declanvk/blart
//!
//! ## TODO
//!
//! - Removing values has not been implemented
mod algorithm;
mod allocator;
mod epoch;
use algorithm::RootPtr;
use allocator::AllocatedBox;
use std::fmt::Debug;
use std::marker::PhantomData;
use std::sync::atomic::{AtomicBool, Ordering};
use crate::epoch::EpochPin;
#[cfg(test)]
mod tests;
pub use allocator::Allocator;
/// Fixed-length key type.
///
pub trait Key: Clone + Debug {
const KEY_LEN: usize;
fn as_bytes(&self) -> &[u8];
}
/// Values stored in the tree
///
/// Values need to be Cloneable, because when a node "grows", the value is copied to a new node and
/// the old sticks around until all readers that might see the old value are gone.
pub trait Value: Clone {}
struct Tree<K: Key, V: Value> {
root: RootPtr<V>,
writer_attached: AtomicBool,
phantom_key: PhantomData<K>,
}
/// Struct created at postmaster startup
pub struct TreeInitStruct<'t, K: Key, V: Value> {
tree: AllocatedBox<'t, Tree<K, V>>,
allocator: &'t Allocator,
}
/// The worker process has a reference to this. The write operations are only safe
/// from the worker process
pub struct TreeWriteAccess<'t, K: Key, V: Value>
where
K: Key,
V: Value,
{
tree: AllocatedBox<'t, Tree<K, V>>,
allocator: &'t Allocator,
}
/// The backends have a reference to this. It cannot be used to modify the tree
pub struct TreeReadAccess<'t, K: Key, V: Value>
where
K: Key,
V: Value,
{
tree: AllocatedBox<'t, Tree<K, V>>,
}
impl<'a, 't: 'a, K: Key, V: Value> TreeInitStruct<'t, K, V> {
pub fn new(allocator: &'t Allocator) -> TreeInitStruct<'t, K, V> {
let tree = allocator.alloc(Tree {
root: algorithm::new_root(allocator),
writer_attached: AtomicBool::new(false),
phantom_key: PhantomData,
});
TreeInitStruct { tree, allocator }
}
pub fn attach_writer(self) -> TreeWriteAccess<'t, K, V> {
let previously_attached = self.tree.writer_attached.swap(true, Ordering::Relaxed);
if previously_attached {
panic!("writer already attached");
}
TreeWriteAccess {
tree: self.tree,
allocator: self.allocator,
}
}
pub fn attach_reader(self) -> TreeReadAccess<'t, K, V> {
TreeReadAccess { tree: self.tree }
}
}
impl<'t, K: Key + Clone, V: Value> TreeWriteAccess<'t, K, V> {
pub fn start_write(&'t self) -> TreeWriteGuard<'t, K, V> {
// TODO: grab epoch guard
TreeWriteGuard {
allocator: self.allocator,
tree: &self.tree,
epoch_pin: epoch::pin_epoch(),
}
}
pub fn start_read(&'t self) -> TreeReadGuard<'t, K, V> {
TreeReadGuard {
tree: &self.tree,
epoch_pin: epoch::pin_epoch(),
}
}
}
impl<'t, K: Key + Clone, V: Value> TreeReadAccess<'t, K, V> {
pub fn start_read(&'t self) -> TreeReadGuard<'t, K, V> {
TreeReadGuard {
tree: &self.tree,
epoch_pin: epoch::pin_epoch(),
}
}
}
pub struct TreeReadGuard<'t, K, V>
where
K: Key,
V: Value,
{
tree: &'t AllocatedBox<'t, Tree<K, V>>,
epoch_pin: EpochPin,
}
impl<'t, K: Key, V: Value> TreeReadGuard<'t, K, V> {
pub fn get(&self, key: &K) -> Option<V> {
algorithm::search(key, self.tree.root, &self.epoch_pin)
}
}
pub struct TreeWriteGuard<'t, K, V>
where
K: Key,
V: Value,
{
tree: &'t AllocatedBox<'t, Tree<K, V>>,
allocator: &'t Allocator,
epoch_pin: EpochPin,
}
impl<'t, K: Key, V: Value> TreeWriteGuard<'t, K, V> {
pub fn insert(&mut self, key: &K, value: V) {
self.update_with_fn(key, |_| Some(value))
}
pub fn update_with_fn<F>(&mut self, key: &K, value_fn: F)
where
F: FnOnce(Option<&V>) -> Option<V>,
{
algorithm::update_fn(
key,
value_fn,
self.tree.root,
self.allocator,
&self.epoch_pin,
)
}
pub fn get(&mut self, key: &K) -> Option<V> {
algorithm::search(key, self.tree.root, &self.epoch_pin)
}
}
impl<'t, K: Key, V: Value + Debug> TreeWriteGuard<'t, K, V> {
pub fn dump(&mut self) {
algorithm::dump_tree(self.tree.root, &self.epoch_pin)
}
}

90
libs/neonart/src/tests.rs Normal file
View File

@@ -0,0 +1,90 @@
use std::collections::HashSet;
use crate::Allocator;
use crate::TreeInitStruct;
use crate::{Key, Value};
use rand::seq::SliceRandom;
use rand::thread_rng;
const TEST_KEY_LEN: usize = 16;
#[derive(Clone, Copy, Debug)]
struct TestKey([u8; TEST_KEY_LEN]);
impl Key for TestKey {
const KEY_LEN: usize = TEST_KEY_LEN;
fn as_bytes(&self) -> &[u8] {
&self.0
}
}
impl From<u128> for TestKey {
fn from(val: u128) -> TestKey {
TestKey(val.to_be_bytes())
}
}
impl Value for usize {}
fn test_inserts<K: Into<TestKey> + Copy>(keys: &[K]) {
const MEM_SIZE: usize = 10000000;
let area = Box::leak(Box::new_uninit_slice(MEM_SIZE));
let allocator = Box::leak(Box::new(Allocator::new_uninit(area)));
let init_struct = TreeInitStruct::<TestKey, usize>::new(allocator);
let tree_writer = init_struct.attach_writer();
for (idx, k) in keys.iter().enumerate() {
let mut w = tree_writer.start_write();
w.insert(&(*k).into(), idx);
eprintln!("INSERTED {:?}", Into::<TestKey>::into(*k));
}
//tree_writer.start_read().dump();
for (idx, k) in keys.iter().enumerate() {
let r = tree_writer.start_read();
let value = r.get(&(*k).into());
assert_eq!(value, Some(idx));
}
}
#[test]
fn dense() {
// This exercises splitting a node with prefix
let keys: &[u128] = &[0, 1, 2, 3, 256];
test_inserts(keys);
// Dense keys
let mut keys: Vec<u128> = (0..10000).collect();
test_inserts(&keys);
// Do the same in random orders
for _ in 1..10 {
keys.shuffle(&mut thread_rng());
test_inserts(&keys);
}
}
#[test]
fn sparse() {
// sparse keys
let mut keys: Vec<TestKey> = Vec::new();
let mut used_keys = HashSet::new();
for _ in 0..10000 {
loop {
let key = rand::random::<u128>();
if used_keys.get(&key).is_some() {
continue;
}
used_keys.insert(key);
keys.push(key.into());
break;
}
}
test_inserts(&keys);
}

View File

@@ -77,7 +77,9 @@ impl StorageModel {
}
SizeResult {
total_size,
// If total_size is 0, it means that the tenant has all timelines offloaded; we need to report 1
// here so that the data point shows up in the s3 files.
total_size: total_size.max(1),
segments: segment_results,
}
}

View File

@@ -42,12 +42,14 @@ nix.workspace = true
num_cpus.workspace = true
num-traits.workspace = true
once_cell.workspace = true
peekable.workspace = true
pin-project-lite.workspace = true
postgres_backend.workspace = true
postgres-protocol.workspace = true
postgres-types.workspace = true
postgres_initdb.workspace = true
pprof.workspace = true
prost.workspace = true
rand.workspace = true
range-set-blaze = { version = "0.1.16", features = ["alloc"] }
regex.workspace = true
@@ -60,6 +62,7 @@ serde_path_to_error.workspace = true
serde_with.workspace = true
sysinfo.workspace = true
tokio-tar.workspace = true
tonic.workspace = true
thiserror.workspace = true
tikv-jemallocator.workspace = true
tokio = { workspace = true, features = ["process", "sync", "fs", "rt", "io-util", "time"] }
@@ -76,6 +79,7 @@ url.workspace = true
walkdir.workspace = true
metrics.workspace = true
pageserver_api.workspace = true
pageserver_data_api.workspace = true
pageserver_client.workspace = true # for ResponseErrorMessageExt TOOD refactor that
pageserver_compaction.workspace = true
pem.workspace = true

View File

@@ -0,0 +1,13 @@
[package]
name = "pageserver_client_grpc"
version = "0.1.0"
edition = "2024"
[dependencies]
bytes.workspace = true
http.workspace = true
thiserror.workspace = true
tonic.workspace = true
tracing.workspace = true
pageserver_data_api.workspace = true

View File

@@ -0,0 +1,221 @@
//! Pageserver Data API client
//!
//! - Manage connections to pageserver
//! - Send requests to correct shards
//!
use std::collections::HashMap;
use std::sync::RwLock;
use bytes::Bytes;
use http;
use thiserror::Error;
use tonic;
use tonic::metadata::AsciiMetadataValue;
use tonic::transport::Channel;
use pageserver_data_api::model::*;
use pageserver_data_api::proto;
type Shardno = u16;
use pageserver_data_api::client::PageServiceClient;
type MyPageServiceClient = pageserver_data_api::client::PageServiceClient<
tonic::service::interceptor::InterceptedService<tonic::transport::Channel, AuthInterceptor>,
>;
#[derive(Error, Debug)]
pub enum PageserverClientError {
#[error("could not connect to service: {0}")]
ConnectError(#[from] tonic::transport::Error),
#[error("could not perform request: {0}`")]
RequestError(#[from] tonic::Status),
#[error("could not perform request: {0}`")]
InvalidUri(#[from] http::uri::InvalidUri),
}
pub struct PageserverClient {
_tenant_id: String,
_timeline_id: String,
_auth_token: Option<String>,
shard_map: HashMap<Shardno, String>,
channels: RwLock<HashMap<Shardno, Channel>>,
auth_interceptor: AuthInterceptor,
}
impl PageserverClient {
/// TODO: this doesn't currently react to changes in the shard map.
pub fn new(
tenant_id: &str,
timeline_id: &str,
auth_token: &Option<String>,
shard_map: HashMap<Shardno, String>,
) -> Self {
Self {
_tenant_id: tenant_id.to_string(),
_timeline_id: timeline_id.to_string(),
_auth_token: auth_token.clone(),
shard_map,
channels: RwLock::new(HashMap::new()),
auth_interceptor: AuthInterceptor::new(tenant_id, timeline_id, auth_token.as_ref()),
}
}
pub async fn process_rel_exists_request(
&self,
request: &RelExistsRequest,
) -> Result<bool, PageserverClientError> {
// Current sharding model assumes that all metadata is present only at shard 0.
let shard_no = 0;
let mut client = self.get_client(shard_no).await?;
let request = proto::RelExistsRequest::from(request);
let response = client.rel_exists(tonic::Request::new(request)).await?;
Ok(response.get_ref().exists)
}
pub async fn process_rel_size_request(
&self,
request: &RelSizeRequest,
) -> Result<u32, PageserverClientError> {
// Current sharding model assumes that all metadata is present only at shard 0.
let shard_no = 0;
let mut client = self.get_client(shard_no).await?;
let request = proto::RelSizeRequest::from(request);
let response = client.rel_size(tonic::Request::new(request)).await?;
Ok(response.get_ref().num_blocks)
}
pub async fn get_page(&self, request: &GetPageRequest) -> Result<Bytes, PageserverClientError> {
// FIXME: calculate the shard number correctly
let shard_no = 0;
let mut client = self.get_client(shard_no).await?;
let request = proto::GetPageRequest::from(request);
let response = client.get_page(tonic::Request::new(request)).await?;
Ok(response.into_inner().page_image)
}
/// Process a request to get the size of a database.
pub async fn process_dbsize_request(
&self,
request: &DbSizeRequest,
) -> Result<u64, PageserverClientError> {
// Current sharding model assumes that all metadata is present only at shard 0.
let shard_no = 0;
let mut client = self.get_client(shard_no).await?;
let request = proto::DbSizeRequest::from(request);
let response = client.db_size(tonic::Request::new(request)).await?;
Ok(response.get_ref().num_bytes)
}
/// Process a request to get the size of a database.
pub async fn get_base_backup(
&self,
request: &GetBaseBackupRequest,
gzip: bool,
) -> std::result::Result<
tonic::Response<tonic::codec::Streaming<proto::GetBaseBackupResponseChunk>>,
PageserverClientError,
> {
// Current sharding model assumes that all metadata is present only at shard 0.
let shard_no = 0;
let mut client = self.get_client(shard_no).await?;
if gzip {
client = client.accept_compressed(tonic::codec::CompressionEncoding::Gzip);
}
let request = proto::GetBaseBackupRequest::from(request);
let response = client.get_base_backup(tonic::Request::new(request)).await?;
Ok(response)
}
/// Get a client for given shard
///
/// This implements very basic caching. If we already have a client for the given shard,
/// reuse it. If not, create a new client and put it to the cache.
async fn get_client(
&self,
shard_no: u16,
) -> Result<MyPageServiceClient, PageserverClientError> {
let reused_channel: Option<Channel> = {
let channels = self.channels.read().unwrap();
channels.get(&shard_no).cloned()
};
let channel = if let Some(reused_channel) = reused_channel {
reused_channel
} else {
let endpoint: tonic::transport::Endpoint = self
.shard_map
.get(&shard_no)
.expect("no url for shard {shard_no}")
.parse()?;
let channel = endpoint.connect().await?;
// Insert it to the cache so that it can be reused on subsequent calls. It's possible
// that another thread did the same concurrently, in which case we will overwrite the
// client in the cache.
{
let mut channels = self.channels.write().unwrap();
channels.insert(shard_no, channel.clone());
}
channel
};
let client = PageServiceClient::with_interceptor(channel, self.auth_interceptor.clone());
Ok(client)
}
}
/// Inject tenant_id, timeline_id and authentication token to all pageserver requests.
#[derive(Clone)]
struct AuthInterceptor {
tenant_id: AsciiMetadataValue,
timeline_id: AsciiMetadataValue,
auth_token: Option<AsciiMetadataValue>,
}
impl AuthInterceptor {
fn new(tenant_id: &str, timeline_id: &str, auth_token: Option<&String>) -> Self {
Self {
tenant_id: tenant_id.parse().expect("could not parse tenant id"),
timeline_id: timeline_id.parse().expect("could not parse timeline id"),
auth_token: auth_token.map(|x| x.parse().expect("could not parse auth token")),
}
}
}
impl tonic::service::Interceptor for AuthInterceptor {
fn call(&mut self, mut req: tonic::Request<()>) -> Result<tonic::Request<()>, tonic::Status> {
req.metadata_mut()
.insert("neon-tenant-id", self.tenant_id.clone());
req.metadata_mut()
.insert("neon-timeline-id", self.timeline_id.clone());
if let Some(auth_token) = &self.auth_token {
req.metadata_mut()
.insert("neon-auth-token", auth_token.clone());
}
Ok(req)
}
}

View File

@@ -0,0 +1,18 @@
[package]
name = "pageserver_data_api"
version = "0.1.0"
edition = "2024"
[dependencies]
# For Lsn.
#
# TODO: move Lsn to separate crate? This draws in a lot more dependencies
utils.workspace = true
prost.workspace = true
thiserror.workspace = true
tonic.workspace = true
[build-dependencies]
tonic-build.workspace = true

View File

@@ -0,0 +1,8 @@
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Generate rust code from .proto protobuf.
tonic_build::configure()
.bytes(&["."])
.compile_protos(&["proto/page_service.proto"], &["proto"])
.unwrap_or_else(|e| panic!("failed to compile protos {:?}", e));
Ok(())
}

View File

@@ -0,0 +1,84 @@
// Page service presented by pageservers, for computes
//
// Each request must come with the following metadata:
// - neon-tenant-id
// - neon-timeline-id
// - neon-auth-token (if auth is enabled)
//
// TODO: what else? Priority? OpenTelemetry tracing?
//
syntax = "proto3";
package page_service;
service PageService {
rpc RelExists(RelExistsRequest) returns (RelExistsResponse);
// Returns size of a relation, as # of blocks
rpc RelSize (RelSizeRequest) returns (RelSizeResponse);
rpc GetPage (GetPageRequest) returns (GetPageResponse);
// Returns total size of a database, as # of bytes
rpc DbSize (DbSizeRequest) returns (DbSizeResponse);
rpc GetBaseBackup (GetBaseBackupRequest) returns (stream GetBaseBackupResponseChunk);
}
message RequestCommon {
uint64 request_lsn = 1;
uint64 not_modified_since_lsn = 2;
}
message RelTag {
uint32 spc_oid = 1;
uint32 db_oid = 2;
uint32 rel_number = 3;
uint32 fork_number = 4;
}
message RelExistsRequest {
RequestCommon common = 1;
RelTag rel = 2;
}
message RelExistsResponse {
bool exists = 1;
}
message RelSizeRequest {
RequestCommon common = 1;
RelTag rel = 2;
}
message RelSizeResponse {
uint32 num_blocks = 1;
}
message GetPageRequest {
RequestCommon common = 1;
RelTag rel = 2;
uint32 block_number = 3;
}
message GetPageResponse {
bytes page_image = 1;
}
message DbSizeRequest {
RequestCommon common = 1;
uint32 db_oid = 2;
}
message DbSizeResponse {
uint64 num_bytes = 1;
}
message GetBaseBackupRequest {
RequestCommon common = 1;
bool replica = 2;
}
message GetBaseBackupResponseChunk {
bytes chunk = 1;
}

View File

@@ -0,0 +1,17 @@
//! This crate has two modules related to the Pageserver Data API:
//!
//! proto: code auto-generated from the protobuf definition
//! model: slightly more ergonomic structs representing the same API
//!
//! See protobuf spec under the protos/ subdirectory.
//!
//! This crate is used by both the client and the server. Try to keep it slim.
//!
pub mod model;
// Code generated by protobuf.
pub mod proto {
tonic::include_proto!("page_service");
}
pub use proto::page_service_client as client;

View File

@@ -0,0 +1,239 @@
//! Structs representing the API
//!
//! These mirror the pageserver APIs and the structs automatically generated
//! from the protobuf specification. The differences are:
//!
//! - Types that are in fact required by the API are not Options. The protobuf "required"
//! attribute is deprecated and 'prost' marks a lot of members as optional because of that.
//! (See https://github.com/tokio-rs/prost/issues/800 for a gripe on this)
//!
//! - Use more precise datatypes, e.g. Lsn and uints shorter than 32 bits.
use utils::lsn::Lsn;
use crate::proto;
#[derive(Clone, Debug)]
pub struct RequestCommon {
pub request_lsn: Lsn,
pub not_modified_since_lsn: Lsn,
}
#[derive(Clone, Debug, Eq, PartialEq, Hash, PartialOrd, Ord)]
pub struct RelTag {
pub spc_oid: u32,
pub db_oid: u32,
pub rel_number: u32,
pub fork_number: u8,
}
#[derive(Clone, Debug)]
pub struct RelExistsRequest {
pub common: RequestCommon,
pub rel: RelTag,
}
#[derive(Clone, Debug)]
pub struct RelSizeRequest {
pub common: RequestCommon,
pub rel: RelTag,
}
#[derive(Clone, Debug)]
pub struct RelSizeResponse {
pub num_blocks: u32,
}
#[derive(Clone, Debug)]
pub struct GetPageRequest {
pub common: RequestCommon,
pub rel: RelTag,
pub block_number: u32,
}
#[derive(Clone, Debug)]
pub struct GetPageResponse {
pub page_image: std::vec::Vec<u8>,
}
#[derive(Clone, Debug)]
pub struct DbSizeRequest {
pub common: RequestCommon,
pub db_oid: u32,
}
#[derive(Clone, Debug)]
pub struct DbSizeResponse {
pub num_bytes: u64,
}
#[derive(Clone, Debug)]
pub struct GetBaseBackupRequest {
pub common: RequestCommon,
pub replica: bool,
}
//--- Conversions to/from the generated proto types
use thiserror::Error;
#[derive(Error, Debug)]
pub enum ProtocolError {
#[error("the value for field `{0}` is invalid")]
InvalidValue(&'static str),
#[error("the required field `{0}` is missing ")]
Missing(&'static str),
}
impl From<ProtocolError> for tonic::Status {
fn from(e: ProtocolError) -> Self {
match e {
ProtocolError::InvalidValue(_field) => tonic::Status::invalid_argument(e.to_string()),
ProtocolError::Missing(_field) => tonic::Status::invalid_argument(e.to_string()),
}
}
}
impl From<&RelTag> for proto::RelTag {
fn from(value: &RelTag) -> proto::RelTag {
proto::RelTag {
spc_oid: value.spc_oid,
db_oid: value.db_oid,
rel_number: value.rel_number,
fork_number: value.fork_number as u32,
}
}
}
impl TryFrom<&proto::RelTag> for RelTag {
type Error = ProtocolError;
fn try_from(value: &proto::RelTag) -> Result<RelTag, ProtocolError> {
Ok(RelTag {
spc_oid: value.spc_oid,
db_oid: value.db_oid,
rel_number: value.rel_number,
fork_number: value
.fork_number
.try_into()
.or(Err(ProtocolError::InvalidValue("fork_number")))?,
})
}
}
impl From<&RequestCommon> for proto::RequestCommon {
fn from(value: &RequestCommon) -> proto::RequestCommon {
proto::RequestCommon {
request_lsn: value.request_lsn.into(),
not_modified_since_lsn: value.not_modified_since_lsn.into(),
}
}
}
impl From<&proto::RequestCommon> for RequestCommon {
fn from(value: &proto::RequestCommon) -> RequestCommon {
RequestCommon {
request_lsn: value.request_lsn.into(),
not_modified_since_lsn: value.not_modified_since_lsn.into(),
}
}
}
impl From<&RelExistsRequest> for proto::RelExistsRequest {
fn from(value: &RelExistsRequest) -> proto::RelExistsRequest {
proto::RelExistsRequest {
common: Some((&value.common).into()),
rel: Some((&value.rel).into()),
}
}
}
impl TryFrom<&proto::RelExistsRequest> for RelExistsRequest {
type Error = ProtocolError;
fn try_from(value: &proto::RelExistsRequest) -> Result<RelExistsRequest, ProtocolError> {
Ok(RelExistsRequest {
common: (&value.common.ok_or(ProtocolError::Missing("common"))?).into(),
rel: (&value.rel.ok_or(ProtocolError::Missing("rel"))?).try_into()?,
})
}
}
impl From<&RelSizeRequest> for proto::RelSizeRequest {
fn from(value: &RelSizeRequest) -> proto::RelSizeRequest {
proto::RelSizeRequest {
common: Some((&value.common).into()),
rel: Some((&value.rel).into()),
}
}
}
impl TryFrom<&proto::RelSizeRequest> for RelSizeRequest {
type Error = ProtocolError;
fn try_from(value: &proto::RelSizeRequest) -> Result<RelSizeRequest, ProtocolError> {
Ok(RelSizeRequest {
common: (&value.common.ok_or(ProtocolError::Missing("common"))?).into(),
rel: (&value.rel.ok_or(ProtocolError::Missing("rel"))?).try_into()?,
})
}
}
impl From<&GetPageRequest> for proto::GetPageRequest {
fn from(value: &GetPageRequest) -> proto::GetPageRequest {
proto::GetPageRequest {
common: Some((&value.common).into()),
rel: Some((&value.rel).into()),
block_number: value.block_number,
}
}
}
impl TryFrom<&proto::GetPageRequest> for GetPageRequest {
type Error = ProtocolError;
fn try_from(value: &proto::GetPageRequest) -> Result<GetPageRequest, ProtocolError> {
Ok(GetPageRequest {
common: (&value.common.ok_or(ProtocolError::Missing("common"))?).into(),
rel: (&value.rel.ok_or(ProtocolError::Missing("rel"))?).try_into()?,
block_number: value.block_number,
})
}
}
impl From<&DbSizeRequest> for proto::DbSizeRequest {
fn from(value: &DbSizeRequest) -> proto::DbSizeRequest {
proto::DbSizeRequest {
common: Some((&value.common).into()),
db_oid: value.db_oid,
}
}
}
impl TryFrom<&proto::DbSizeRequest> for DbSizeRequest {
type Error = ProtocolError;
fn try_from(value: &proto::DbSizeRequest) -> Result<DbSizeRequest, ProtocolError> {
Ok(DbSizeRequest {
common: (&value.common.ok_or(ProtocolError::Missing("common"))?).into(),
db_oid: value.db_oid,
})
}
}
impl From<&GetBaseBackupRequest> for proto::GetBaseBackupRequest {
fn from(value: &GetBaseBackupRequest) -> proto::GetBaseBackupRequest {
proto::GetBaseBackupRequest {
common: Some((&value.common).into()),
replica: value.replica,
}
}
}
impl TryFrom<&proto::GetBaseBackupRequest> for GetBaseBackupRequest {
type Error = ProtocolError;
fn try_from(
value: &proto::GetBaseBackupRequest,
) -> Result<GetBaseBackupRequest, ProtocolError> {
Ok(GetBaseBackupRequest {
common: (&value.common.ok_or(ProtocolError::Missing("common"))?).into(),
replica: value.replica,
})
}
}

View File

@@ -23,6 +23,8 @@ tokio.workspace = true
tokio-util.workspace = true
pageserver_client.workspace = true
pageserver_client_grpc.workspace = true
pageserver_data_api.workspace = true
pageserver_api.workspace = true
utils = { path = "../../libs/utils/" }
workspace_hack = { version = "0.1", path = "../../workspace_hack" }

View File

@@ -9,6 +9,9 @@ use anyhow::Context;
use pageserver_api::shard::TenantShardId;
use pageserver_client::mgmt_api::ForceAwaitLogicalSize;
use pageserver_client::page_service::BasebackupRequest;
use pageserver_client_grpc;
use pageserver_data_api::model::{GetBaseBackupRequest, RequestCommon};
use rand::prelude::*;
use tokio::sync::Barrier;
use tokio::task::JoinSet;
@@ -22,6 +25,8 @@ use crate::util::{request_stats, tokio_thread_local_stats};
/// basebackup@LatestLSN
#[derive(clap::Parser)]
pub(crate) struct Args {
#[clap(long, default_value = "false")]
grpc: bool,
#[clap(long, default_value = "http://localhost:9898")]
mgmt_api_endpoint: String,
#[clap(long, default_value = "postgres://postgres@localhost:64000")]
@@ -52,7 +57,7 @@ impl LiveStats {
struct Target {
timeline: TenantTimelineId,
lsn_range: Option<Range<Lsn>>,
lsn_range: Range<Lsn>,
}
#[derive(serde::Serialize)]
@@ -105,7 +110,7 @@ async fn main_impl(
anyhow::Ok(Target {
timeline,
// TODO: support lsn_range != latest LSN
lsn_range: Some(info.last_record_lsn..(info.last_record_lsn + 1)),
lsn_range: info.last_record_lsn..(info.last_record_lsn + 1),
})
}
});
@@ -149,14 +154,27 @@ async fn main_impl(
for tl in &timelines {
let (sender, receiver) = tokio::sync::mpsc::channel(1); // TODO: not sure what the implications of this are
work_senders.insert(tl, sender);
tasks.push(tokio::spawn(client(
args,
*tl,
Arc::clone(&start_work_barrier),
receiver,
Arc::clone(&all_work_done_barrier),
Arc::clone(&live_stats),
)));
let client_task = if args.grpc {
tokio::spawn(client_grpc(
args,
*tl,
Arc::clone(&start_work_barrier),
receiver,
Arc::clone(&all_work_done_barrier),
Arc::clone(&live_stats),
))
} else {
tokio::spawn(client(
args,
*tl,
Arc::clone(&start_work_barrier),
receiver,
Arc::clone(&all_work_done_barrier),
Arc::clone(&live_stats),
))
};
tasks.push(client_task);
}
let work_sender = async move {
@@ -165,7 +183,7 @@ async fn main_impl(
let (timeline, work) = {
let mut rng = rand::thread_rng();
let target = all_targets.choose(&mut rng).unwrap();
let lsn = target.lsn_range.clone().map(|r| rng.gen_range(r));
let lsn = rng.gen_range(target.lsn_range.clone());
(
target.timeline,
Work {
@@ -215,7 +233,7 @@ async fn main_impl(
#[derive(Copy, Clone)]
struct Work {
lsn: Option<Lsn>,
lsn: Lsn,
gzip: bool,
}
@@ -240,7 +258,7 @@ async fn client(
.basebackup(&BasebackupRequest {
tenant_id: timeline.tenant_id,
timeline_id: timeline.timeline_id,
lsn,
lsn: Some(lsn),
gzip,
})
.await
@@ -270,3 +288,71 @@ async fn client(
all_work_done_barrier.wait().await;
}
#[instrument(skip_all)]
async fn client_grpc(
args: &'static Args,
timeline: TenantTimelineId,
start_work_barrier: Arc<Barrier>,
mut work: tokio::sync::mpsc::Receiver<Work>,
all_work_done_barrier: Arc<Barrier>,
live_stats: Arc<LiveStats>,
) {
let shard_map = HashMap::from([(0, args.page_service_connstring.clone())]);
let client = pageserver_client_grpc::PageserverClient::new(
&timeline.tenant_id.to_string(),
&timeline.timeline_id.to_string(),
&None,
shard_map,
);
start_work_barrier.wait().await;
while let Some(Work { lsn, gzip }) = work.recv().await {
let start = Instant::now();
//tokio::time::sleep(std::time::Duration::from_secs(1)).await;
info!("starting get_base_backup");
let mut basebackup_stream = client
.get_base_backup(
&GetBaseBackupRequest {
common: RequestCommon {
request_lsn: lsn,
not_modified_since_lsn: lsn,
},
replica: false,
},
gzip,
)
.await
.with_context(|| format!("start basebackup for {timeline}"))
.unwrap()
.into_inner();
info!("starting receive");
use futures::StreamExt;
let mut size = 0;
let mut nchunks = 0;
while let Some(chunk) = basebackup_stream.next().await {
let chunk = chunk
.with_context(|| format!("error during basebackup"))
.unwrap();
size += chunk.chunk.len();
nchunks += 1;
}
info!(
"basebackup size is {} bytes, avg chunk size {} bytes",
size,
size as f32 / nchunks as f32
);
let elapsed = start.elapsed();
live_stats.inc();
STATS.with(|stats| {
stats.borrow().lock().unwrap().observe(elapsed).unwrap();
});
}
all_work_done_barrier.wait().await;
}

View File

@@ -1,4 +1,4 @@
use std::collections::{HashSet, VecDeque};
use std::collections::{HashMap, HashSet, VecDeque};
use std::future::Future;
use std::num::NonZeroUsize;
use std::pin::Pin;
@@ -8,6 +8,8 @@ use std::time::{Duration, Instant};
use anyhow::Context;
use camino::Utf8PathBuf;
use futures::StreamExt;
use futures::stream::FuturesOrdered;
use pageserver_api::key::Key;
use pageserver_api::keyspace::KeySpaceAccum;
use pageserver_api::models::{PagestreamGetPageRequest, PagestreamRequest};
@@ -25,6 +27,8 @@ use crate::util::{request_stats, tokio_thread_local_stats};
/// GetPage@LatestLSN, uniformly distributed across the compute-accessible keyspace.
#[derive(clap::Parser)]
pub(crate) struct Args {
#[clap(long, default_value = "false")]
grpc: bool,
#[clap(long, default_value = "http://localhost:9898")]
mgmt_api_endpoint: String,
#[clap(long, default_value = "postgres://postgres@localhost:64000")]
@@ -295,7 +299,29 @@ async fn main_impl(
.unwrap();
Box::pin(async move {
client_libpq(args, worker_id, ss, cancel, rps_period, ranges, weights).await
if args.grpc {
client_grpc(
args,
worker_id,
ss,
cancel,
rps_period,
ranges,
weights,
)
.await
} else {
client_libpq(
args,
worker_id,
ss,
cancel,
rps_period,
ranges,
weights,
)
.await
}
})
};
@@ -434,3 +460,100 @@ async fn client_libpq(
}
}
}
async fn client_grpc(
args: &Args,
worker_id: WorkerId,
shared_state: Arc<SharedState>,
cancel: CancellationToken,
rps_period: Option<Duration>,
ranges: Vec<KeyRange>,
weights: rand::distributions::weighted::WeightedIndex<i128>,
) {
let shard_map = HashMap::from([(0, args.page_service_connstring.clone())]);
let client = pageserver_client_grpc::PageserverClient::new(
&worker_id.timeline.tenant_id.to_string(),
&worker_id.timeline.timeline_id.to_string(),
&None,
shard_map,
);
let client = Arc::new(client);
shared_state.start_work_barrier.wait().await;
let client_start = Instant::now();
let mut ticks_processed = 0;
let mut inflight = FuturesOrdered::new();
while !cancel.is_cancelled() {
// Detect if a request took longer than the RPS rate
if let Some(period) = &rps_period {
let periods_passed_until_now =
usize::try_from(client_start.elapsed().as_micros() / period.as_micros()).unwrap();
if periods_passed_until_now > ticks_processed {
shared_state
.live_stats
.missed((periods_passed_until_now - ticks_processed) as u64);
}
ticks_processed = periods_passed_until_now;
}
while inflight.len() < args.queue_depth.get() {
let start = Instant::now();
let req = {
let mut rng = rand::thread_rng();
let r = &ranges[weights.sample(&mut rng)];
let key: i128 = rng.gen_range(r.start..r.end);
let key = Key::from_i128(key);
assert!(key.is_rel_block_key());
let (rel_tag, block_no) = key
.to_rel_block()
.expect("we filter non-rel-block keys out above");
pageserver_data_api::model::GetPageRequest {
common: pageserver_data_api::model::RequestCommon {
request_lsn: if rng.gen_bool(args.req_latest_probability) {
Lsn::MAX
} else {
r.timeline_lsn
},
not_modified_since_lsn: r.timeline_lsn,
},
rel: pageserver_data_api::model::RelTag {
spc_oid: rel_tag.spcnode,
db_oid: rel_tag.dbnode,
rel_number: rel_tag.relnode,
fork_number: rel_tag.forknum,
},
block_number: block_no,
}
};
let client_clone = client.clone();
let getpage_fut = async move {
let result = client_clone.get_page(&req).await;
(start, result)
};
inflight.push_back(getpage_fut);
}
let (start, result) = inflight.next().await.unwrap();
result.expect("getpage request should succeed");
let end = Instant::now();
shared_state.live_stats.request_done();
ticks_processed += 1;
STATS.with(|stats| {
stats
.borrow()
.lock()
.unwrap()
.observe(end.duration_since(start))
.unwrap();
});
if let Some(period) = &rps_period {
let next_at = client_start
+ Duration::from_micros(
(ticks_processed) as u64 * u64::try_from(period.as_micros()).unwrap(),
);
tokio::time::sleep_until(next_at.into()).await;
}
}
}

View File

@@ -151,10 +151,14 @@ where
.map_err(|_| BasebackupError::Shutdown)?,
),
};
basebackup
let res = basebackup
.send_tarball()
.instrument(info_span!("send_tarball", backup_lsn=%backup_lsn))
.await
.await;
info!("basebackup done!");
res
}
/// This is short-living object only for the time of tarball creation,

View File

@@ -16,6 +16,7 @@ use http_utils::tls_certs::ReloadingCertificateResolver;
use metrics::launch_timestamp::{LaunchTimestamp, set_launch_timestamp_metric};
use metrics::set_build_info_metric;
use nix::sys::socket::{setsockopt, sockopt};
use pageserver::compute_service;
use pageserver::config::{PageServerConf, PageserverIdentity, ignored_fields};
use pageserver::controller_upcall_client::StorageControllerUpcallClient;
use pageserver::deletion_queue::DeletionQueue;
@@ -27,7 +28,7 @@ use pageserver::task_mgr::{
use pageserver::tenant::{TenantSharedResources, mgr, secondary};
use pageserver::{
CancellableTask, ConsumptionMetricsTasks, HttpEndpointListener, HttpsEndpointListener, http,
page_cache, page_service, task_mgr, virtual_file,
page_cache, task_mgr, virtual_file,
};
use postgres_backend::AuthType;
use remote_storage::GenericRemoteStorage;
@@ -745,7 +746,7 @@ fn start_pageserver(
// Spawn a task to listen for libpq connections. It will spawn further tasks
// for each connection. We created the listener earlier already.
let perf_trace_dispatch = otel_guard.as_ref().map(|g| g.dispatch.clone());
let page_service = page_service::spawn(
let compute_service = compute_service::spawn(
conf,
tenant_manager.clone(),
pg_auth,
@@ -782,7 +783,7 @@ fn start_pageserver(
pageserver::shutdown_pageserver(
http_endpoint_listener,
https_endpoint_listener,
page_service,
compute_service,
consumption_metrics_tasks,
disk_usage_eviction_task,
&tenant_manager,

View File

@@ -0,0 +1,286 @@
//!
//! The Compute Service listens for compute connections, and serves requests like
//! the GetPage@LSN requests.
//!
//! We support two protocols:
//!
//! 1. Legacy, connection-oriented libpq based protocol. That's
//! handled by the code in page_service.rs.
//!
//! 2. gRPC based protocol. See compute_service_grpc.rs.
//!
//! To make the transition smooth, without having to open up new firewall ports
//! etc, both protocols are served on the same port. When a new TCP connection
//! is accepted, we peek at the first few bytes incoming from the client to
//! determine which protocol it speaks.
//!
//! TODO: This gets easier once we drop the legacy protocol support. Or if we
//! open a separate port for them.
use std::sync::Arc;
use anyhow::Context;
use futures::FutureExt;
use pageserver_api::config::PageServicePipeliningConfig;
use postgres_backend::AuthType;
use tokio::task::JoinHandle;
use tokio_util::sync::CancellationToken;
use tracing::*;
use utils::auth::SwappableJwtAuth;
use utils::sync::gate::{Gate, GateGuard};
use crate::compute_service_grpc::launch_compute_service_grpc_server;
use crate::config::PageServerConf;
use crate::context::{DownloadBehavior, RequestContext, RequestContextBuilder};
use crate::page_service::libpq_page_service_conn_main;
use crate::task_mgr::{self, COMPUTE_REQUEST_RUNTIME, TaskKind};
use crate::tenant::mgr::TenantManager;
///////////////////////////////////////////////////////////////////////////////
pub type ConnectionHandlerResult = anyhow::Result<()>;
pub struct Connections {
cancel: CancellationToken,
tasks: tokio::task::JoinSet<ConnectionHandlerResult>,
gate: Gate,
}
impl Connections {
pub(crate) async fn shutdown(self) {
let Self {
cancel,
mut tasks,
gate,
} = self;
cancel.cancel();
while let Some(res) = tasks.join_next().await {
Self::handle_connection_completion(res);
}
gate.close().await;
}
fn handle_connection_completion(res: Result<anyhow::Result<()>, tokio::task::JoinError>) {
match res {
Ok(Ok(())) => {}
Ok(Err(e)) => error!("error in page_service connection task: {:?}", e),
Err(e) => error!("page_service connection task panicked: {:?}", e),
}
}
}
pub struct Listener {
cancel: CancellationToken,
/// Cancel the listener task through `listen_cancel` to shut down the listener
/// and get a handle on the existing connections.
task: JoinHandle<Connections>,
}
pub fn spawn(
conf: &'static PageServerConf,
tenant_manager: Arc<TenantManager>,
pg_auth: Option<Arc<SwappableJwtAuth>>,
perf_trace_dispatch: Option<Dispatch>,
tcp_listener: tokio::net::TcpListener,
tls_config: Option<Arc<rustls::ServerConfig>>,
) -> Listener {
let cancel = CancellationToken::new();
let libpq_ctx = RequestContext::todo_child(
TaskKind::LibpqEndpointListener,
// listener task shouldn't need to download anything. (We will
// create a separate sub-contexts for each connection, with their
// own download behavior. This context is used only to listen and
// accept connections.)
DownloadBehavior::Error,
);
let task = COMPUTE_REQUEST_RUNTIME.spawn(task_mgr::exit_on_panic_or_error(
"compute connection listener",
compute_connection_listener_main(
conf,
tenant_manager,
pg_auth,
perf_trace_dispatch,
tcp_listener,
conf.pg_auth_type,
tls_config,
conf.page_service_pipelining.clone(),
libpq_ctx,
cancel.clone(),
)
.map(anyhow::Ok),
));
Listener { cancel, task }
}
impl Listener {
pub async fn stop_accepting(self) -> Connections {
self.cancel.cancel();
self.task
.await
.expect("unreachable: we wrap the listener task in task_mgr::exit_on_panic_or_error")
}
}
/// Listener loop. Listens for connections, and launches a new handler
/// task for each.
///
/// Returns Ok(()) upon cancellation via `cancel`, returning the set of
/// open connections.
///
#[allow(clippy::too_many_arguments)]
pub async fn compute_connection_listener_main(
conf: &'static PageServerConf,
tenant_manager: Arc<TenantManager>,
auth: Option<Arc<SwappableJwtAuth>>,
perf_trace_dispatch: Option<Dispatch>,
listener: tokio::net::TcpListener,
auth_type: AuthType,
tls_config: Option<Arc<rustls::ServerConfig>>,
pipelining_config: PageServicePipeliningConfig,
listener_ctx: RequestContext,
listener_cancel: CancellationToken,
) -> Connections {
let connections_cancel = CancellationToken::new();
let connections_gate = Gate::default();
let mut connection_handler_tasks = tokio::task::JoinSet::default();
// The connection handling task passes the gRPC protocol
// connections to this channel. The tonic gRPC server reads the
// channel and takes over the connections from there.
let (grpc_connections_tx, grpc_connections_rx) = tokio::sync::mpsc::channel(1000);
// Set up the gRPC service
launch_compute_service_grpc_server(
grpc_connections_rx,
conf,
tenant_manager.clone(),
auth.clone(),
auth_type,
connections_cancel.clone(),
&listener_ctx,
);
// Main listener loop
loop {
let gate_guard = match connections_gate.enter() {
Ok(guard) => guard,
Err(_) => break,
};
let accepted = tokio::select! {
biased;
_ = listener_cancel.cancelled() => break,
next = connection_handler_tasks.join_next(), if !connection_handler_tasks.is_empty() => {
let res = next.expect("we dont poll while empty");
Connections::handle_connection_completion(res);
continue;
}
accepted = listener.accept() => accepted,
};
match accepted {
Ok((socket, peer_addr)) => {
// Connection established. Spawn a new task to handle it.
debug!("accepted connection from {}", peer_addr);
let local_auth = auth.clone();
let connection_ctx = RequestContextBuilder::from(&listener_ctx)
.task_kind(TaskKind::PageRequestHandler)
.download_behavior(DownloadBehavior::Download)
.perf_span_dispatch(perf_trace_dispatch.clone())
.detached_child();
connection_handler_tasks.spawn(page_service_conn_main(
conf,
tenant_manager.clone(),
local_auth,
socket,
auth_type,
tls_config.clone(),
pipelining_config.clone(),
connection_ctx,
connections_cancel.child_token(),
gate_guard,
grpc_connections_tx.clone(),
));
}
Err(err) => {
// accept() failed. Log the error, and loop back to retry on next connection.
error!("accept() failed: {:?}", err);
}
}
}
debug!("page_service listener loop terminated");
Connections {
cancel: connections_cancel,
tasks: connection_handler_tasks,
gate: connections_gate,
}
}
/// Handle a new incoming connection.
///
/// This peeks at the first few incoming bytes and dispatches the connection
/// to the legacy libpq handler or the new gRPC handler accordingly.
#[instrument(skip_all, fields(peer_addr, application_name, compute_mode))]
#[allow(clippy::too_many_arguments)]
pub async fn page_service_conn_main(
conf: &'static PageServerConf,
tenant_manager: Arc<TenantManager>,
auth: Option<Arc<SwappableJwtAuth>>,
socket: tokio::net::TcpStream,
auth_type: AuthType,
tls_config: Option<Arc<rustls::ServerConfig>>,
pipelining_config: PageServicePipeliningConfig,
connection_ctx: RequestContext,
cancel: CancellationToken,
gate_guard: GateGuard,
grpc_connections_tx: tokio::sync::mpsc::Sender<tokio::io::Result<tokio::net::TcpStream>>,
) -> ConnectionHandlerResult {
let mut buf: [u8; 4] = [0; 4];
socket
.set_nodelay(true)
.context("could not set TCP_NODELAY")?;
// Peek
socket.peek(&mut buf).await?;
let mut grpc = false;
if buf[0] == 0x16 {
// looks like a TLS handshake. Assume gRPC.
// XXX: Starting with v17, PostgreSQL also supports "direct TLS mode". But
// the compute doesn't use it.
grpc = true;
}
if buf[0] == b'G' || buf[0] == b'P' {
// Looks like 'GET' or 'POST'
// or 'PRI', indicating gRPC over HTTP/2 with prior knowledge
grpc = true;
}
// Dispatch
if grpc {
grpc_connections_tx.send(Ok(socket)).await?;
info!("connection sent to channel");
Ok(())
} else {
libpq_page_service_conn_main(
conf,
tenant_manager,
auth,
socket,
auth_type,
tls_config,
pipelining_config,
connection_ctx,
cancel,
gate_guard,
)
.await
}
}

View File

@@ -0,0 +1,746 @@
//!
//! Compute <-> Pageserver API handler. This is for the new gRPC-based protocol
//!
//! TODO:
//!
//! - Many of the API endpoints are still missing
//!
//! - This is very much not optimized.
//!
//! - Much of the code was copy-pasted from page_service.rs. Like the code to get the
//! Timeline object, and the JWT auth. Could refactor and share.
//!
//!
use std::pin::Pin;
use std::str::FromStr;
use std::sync::Arc;
use std::task::Poll;
use std::time::Duration;
use std::time::Instant;
use crate::TenantManager;
use crate::auth::check_permission;
use crate::basebackup;
use crate::basebackup::BasebackupError;
use crate::config::PageServerConf;
use crate::context::{DownloadBehavior, RequestContext, RequestContextBuilder};
use crate::task_mgr::TaskKind;
use crate::tenant::Timeline;
use crate::tenant::mgr::ShardResolveResult;
use crate::tenant::mgr::ShardSelector;
use crate::tenant::storage_layer::IoConcurrency;
use crate::tenant::timeline::WaitLsnTimeout;
use tokio::io::{AsyncWriteExt, ReadHalf, SimplexStream};
use tokio::task::JoinHandle;
use tokio_util::codec::{Decoder, FramedRead};
use tokio_util::sync::CancellationToken;
use futures::stream::StreamExt;
use pageserver_data_api::model;
use pageserver_data_api::proto::page_service_server::PageService;
use pageserver_data_api::proto::page_service_server::PageServiceServer;
use anyhow::Context;
use bytes::BytesMut;
use jsonwebtoken::TokenData;
use tracing::Instrument;
use tracing::{debug, error};
use utils::auth::SwappableJwtAuth;
use utils::id::{TenantId, TenantTimelineId, TimelineId};
use utils::lsn::Lsn;
use utils::simple_rcu::RcuReadGuard;
use crate::tenant::PageReconstructError;
use postgres_ffi::BLCKSZ;
use tonic;
use tonic::codec::CompressionEncoding;
use tonic::service::interceptor::InterceptedService;
use pageserver_api::key::rel_block_to_key;
use crate::pgdatadir_mapping::Version;
use postgres_ffi::pg_constants::DEFAULTTABLESPACE_OID;
use postgres_backend::AuthType;
pub use pageserver_data_api::proto;
pub(super) fn launch_compute_service_grpc_server(
tcp_connections_rx: tokio::sync::mpsc::Receiver<tokio::io::Result<tokio::net::TcpStream>>,
conf: &'static PageServerConf,
tenant_manager: Arc<TenantManager>,
auth: Option<Arc<SwappableJwtAuth>>,
auth_type: AuthType,
connections_cancel: CancellationToken,
listener_ctx: &RequestContext,
) {
// Set up the gRPC service
let service_ctx = RequestContextBuilder::from(listener_ctx)
.task_kind(TaskKind::PageRequestHandler)
.download_behavior(DownloadBehavior::Download)
.attached_child();
let service = crate::compute_service_grpc::PageServiceService {
conf,
tenant_mgr: tenant_manager.clone(),
ctx: Arc::new(service_ctx),
};
let authenticator = PageServiceAuthenticator {
auth: auth.clone(),
auth_type,
};
let server = InterceptedService::new(
PageServiceServer::new(service).send_compressed(CompressionEncoding::Gzip),
authenticator,
);
let cc = connections_cancel.clone();
tokio::spawn(async move {
tonic::transport::Server::builder()
.add_service(server)
.serve_with_incoming_shutdown(
tokio_stream::wrappers::ReceiverStream::new(tcp_connections_rx),
cc.cancelled(),
)
.await
});
}
struct PageServiceService {
conf: &'static PageServerConf,
tenant_mgr: Arc<TenantManager>,
ctx: Arc<RequestContext>,
}
/// An error happened in a get() operation.
impl From<PageReconstructError> for tonic::Status {
fn from(e: PageReconstructError) -> Self {
match e {
PageReconstructError::Other(err) => tonic::Status::unknown(err.to_string()),
PageReconstructError::AncestorLsnTimeout(_) => {
tonic::Status::unavailable(e.to_string())
}
PageReconstructError::Cancelled => tonic::Status::aborted(e.to_string()),
PageReconstructError::WalRedo(_) => tonic::Status::internal(e.to_string()),
PageReconstructError::MissingKey(_) => tonic::Status::internal(e.to_string()),
}
}
}
fn convert_reltag(value: &model::RelTag) -> pageserver_api::reltag::RelTag {
pageserver_api::reltag::RelTag {
spcnode: value.spc_oid,
dbnode: value.db_oid,
relnode: value.rel_number,
forknum: value.fork_number,
}
}
#[tonic::async_trait]
impl PageService for PageServiceService {
type GetBaseBackupStream = GetBaseBackupStream;
async fn rel_exists(
&self,
request: tonic::Request<proto::RelExistsRequest>,
) -> std::result::Result<tonic::Response<proto::RelExistsResponse>, tonic::Status> {
let ttid = self.extract_ttid(request.metadata())?;
let req: model::RelExistsRequest = request.get_ref().try_into()?;
let rel = convert_reltag(&req.rel);
let span = tracing::info_span!("rel_exists", tenant_id = %ttid.tenant_id, timeline_id = %ttid.timeline_id, rel = %rel, req_lsn = %req.common.request_lsn);
async {
let timeline = self.get_timeline(ttid, ShardSelector::Zero).await?;
let ctx = self.ctx.with_scope_timeline(&timeline);
let latest_gc_cutoff_lsn = timeline.get_applied_gc_cutoff_lsn();
let lsn = Self::wait_or_get_last_lsn(
&timeline,
req.common.request_lsn,
req.common.not_modified_since_lsn,
&latest_gc_cutoff_lsn,
&ctx,
)
.await?;
let exists = timeline
.get_rel_exists(rel, Version::Lsn(lsn), &ctx)
.await?;
Ok(tonic::Response::new(proto::RelExistsResponse { exists }))
}
.instrument(span)
.await
}
/// Returns size of a relation, as # of blocks
async fn rel_size(
&self,
request: tonic::Request<proto::RelSizeRequest>,
) -> std::result::Result<tonic::Response<proto::RelSizeResponse>, tonic::Status> {
let ttid = self.extract_ttid(request.metadata())?;
let req: model::RelSizeRequest = request.get_ref().try_into()?;
let rel = convert_reltag(&req.rel);
let span = tracing::info_span!("rel_size", tenant_id = %ttid.tenant_id, timeline_id = %ttid.timeline_id, rel = %rel, req_lsn = %req.common.request_lsn);
async {
let timeline = self.get_timeline(ttid, ShardSelector::Zero).await?;
let ctx = self.ctx.with_scope_timeline(&timeline);
let latest_gc_cutoff_lsn = timeline.get_applied_gc_cutoff_lsn();
let lsn = Self::wait_or_get_last_lsn(
&timeline,
req.common.request_lsn,
req.common.not_modified_since_lsn,
&latest_gc_cutoff_lsn,
&ctx,
)
.await?;
let num_blocks = timeline.get_rel_size(rel, Version::Lsn(lsn), &ctx).await?;
Ok(tonic::Response::new(proto::RelSizeResponse { num_blocks }))
}
.instrument(span)
.await
}
async fn get_page(
&self,
request: tonic::Request<proto::GetPageRequest>,
) -> std::result::Result<tonic::Response<proto::GetPageResponse>, tonic::Status> {
let ttid = self.extract_ttid(request.metadata())?;
let req: model::GetPageRequest = request.get_ref().try_into()?;
// Calculate shard number.
//
// FIXME: this should probably be part of the data_api crate.
let rel = convert_reltag(&req.rel);
let key = rel_block_to_key(rel, req.block_number);
let timeline = self.get_timeline(ttid, ShardSelector::Page(key)).await?;
let ctx = self.ctx.with_scope_timeline(&timeline);
let latest_gc_cutoff_lsn = timeline.get_applied_gc_cutoff_lsn();
let lsn = Self::wait_or_get_last_lsn(
&timeline,
req.common.request_lsn,
req.common.not_modified_since_lsn,
&latest_gc_cutoff_lsn,
&ctx,
)
.await?;
let shard_id = timeline.tenant_shard_id.shard_number;
let span = tracing::info_span!("get_page", tenant_id = %ttid.tenant_id, shard_id = %shard_id, timeline_id = %ttid.timeline_id, rel = %rel, block_number = %req.block_number, req_lsn = %req.common.request_lsn);
async {
let gate_guard = match timeline.gate.enter() {
Ok(guard) => guard,
Err(_) => {
return Err(tonic::Status::unavailable("timeline is shutting down"));
}
};
let io_concurrency = IoConcurrency::spawn_from_conf(self.conf, gate_guard);
let page_image = timeline
.get_rel_page_at_lsn(
rel,
req.block_number,
Version::Lsn(lsn),
&ctx,
io_concurrency,
)
.await?;
Ok(tonic::Response::new(proto::GetPageResponse {
page_image: page_image,
}))
}
.instrument(span)
.await
}
async fn db_size(
&self,
request: tonic::Request<proto::DbSizeRequest>,
) -> Result<tonic::Response<proto::DbSizeResponse>, tonic::Status> {
let ttid = self.extract_ttid(request.metadata())?;
let req: model::DbSizeRequest = request.get_ref().try_into()?;
let span = tracing::info_span!("get_page", tenant_id = %ttid.tenant_id, timeline_id = %ttid.timeline_id, db_oid = %req.db_oid, req_lsn = %req.common.request_lsn);
async {
let timeline = self.get_timeline(ttid, ShardSelector::Zero).await?;
let ctx = self.ctx.with_scope_timeline(&timeline);
let latest_gc_cutoff_lsn = timeline.get_applied_gc_cutoff_lsn();
let lsn = Self::wait_or_get_last_lsn(
&timeline,
req.common.request_lsn,
req.common.not_modified_since_lsn,
&latest_gc_cutoff_lsn,
&ctx,
)
.await?;
let total_blocks = timeline
.get_db_size(DEFAULTTABLESPACE_OID, req.db_oid, Version::Lsn(lsn), &ctx)
.await?;
Ok(tonic::Response::new(proto::DbSizeResponse {
num_bytes: total_blocks as u64 * BLCKSZ as u64,
}))
}
.instrument(span)
.await
}
async fn get_base_backup(
&self,
request: tonic::Request<proto::GetBaseBackupRequest>,
) -> Result<tonic::Response<Self::GetBaseBackupStream>, tonic::Status> {
let ttid = self.extract_ttid(request.metadata())?;
let req: model::GetBaseBackupRequest = request.get_ref().try_into()?;
let timeline = self.get_timeline(ttid, ShardSelector::Zero).await?;
let ctx = self.ctx.with_scope_timeline(&timeline);
let latest_gc_cutoff_lsn = timeline.get_applied_gc_cutoff_lsn();
let lsn = Self::wait_or_get_last_lsn(
&timeline,
req.common.request_lsn,
req.common.not_modified_since_lsn,
&latest_gc_cutoff_lsn,
&ctx,
)
.await?;
let span = tracing::info_span!("get_base_backup", tenant_id = %ttid.tenant_id, timeline_id = %ttid.timeline_id, req_lsn = %req.common.request_lsn);
tracing::info!("starting basebackup");
#[allow(dead_code)]
enum TestMode {
/// Create real basebackup, in streaming fashion
Streaming,
/// Create real basebackup, but fully materialize it in the 'simplex' pipe buffer first
Materialize,
/// Create a dummy all-zeros basebackup, in streaming fashion
DummyStreaming,
/// Create a dummy all-zeros basebackup, but fully materialize it first
DummyMaterialize,
}
let mode = TestMode::Streaming;
let buf_size = match mode {
TestMode::Streaming | TestMode::DummyStreaming => 64 * 1024,
TestMode::Materialize | TestMode::DummyMaterialize => 64 * 1024 * 1024,
};
let (simplex_read, mut simplex_write) = tokio::io::simplex(buf_size);
let basebackup_task = match mode {
TestMode::DummyStreaming => {
tokio::spawn(
async move {
// hold onto the guard for as long as the basebackup runs
let _latest_gc_cutoff_lsn = latest_gc_cutoff_lsn;
let zerosbuf: [u8; 1024] = [0; 1024];
let nbytes = 16900000;
let mut bytes_written = 0;
while bytes_written < nbytes {
let s = std::cmp::min(1024, nbytes - bytes_written);
let _ = simplex_write.write_all(&zerosbuf[0..s]).await;
bytes_written += s;
}
simplex_write
.shutdown()
.await
.context("shutdown of basebackup pipe")?;
Ok(())
}
.instrument(span),
)
}
TestMode::DummyMaterialize => {
let zerosbuf: [u8; 1024] = [0; 1024];
let nbytes = 16900000;
let mut bytes_written = 0;
while bytes_written < nbytes {
let s = std::cmp::min(1024, nbytes - bytes_written);
let _ = simplex_write.write_all(&zerosbuf[0..s]).await;
bytes_written += s;
}
simplex_write
.shutdown()
.await
.expect("shutdown of basebackup pipe");
tracing::info!("basebackup (dummy) materialized");
let result = Ok(());
tokio::spawn(std::future::ready(result))
}
TestMode::Materialize => {
let result = basebackup::send_basebackup_tarball(
&mut simplex_write,
&timeline,
Some(lsn),
None,
false,
req.replica,
&ctx,
)
.await;
simplex_write
.shutdown()
.await
.expect("shutdown of basebackup pipe");
tracing::info!("basebackup materialized");
// Launch a task that writes the basebackup tarball to the simplex pipe
tokio::spawn(std::future::ready(result))
}
TestMode::Streaming => {
tokio::spawn(
async move {
// hold onto the guard for as long as the basebackup runs
let _latest_gc_cutoff_lsn = latest_gc_cutoff_lsn;
let result = basebackup::send_basebackup_tarball(
&mut simplex_write,
&timeline,
Some(lsn),
None,
false,
req.replica,
&ctx,
)
.await;
simplex_write
.shutdown()
.await
.context("shutdown of basebackup pipe")?;
result
}
.instrument(span),
)
}
};
let response = new_basebackup_response_stream(simplex_read, basebackup_task);
Ok(tonic::Response::new(response))
}
}
/// NB: this is a different value than [`crate::http::routes::ACTIVE_TENANT_TIMEOUT`].
/// NB: and also different from page_service::ACTIVE_TENANT_TIMEOUT
const ACTIVE_TENANT_TIMEOUT: Duration = Duration::from_millis(30000);
impl PageServiceService {
async fn get_timeline(
&self,
ttid: TenantTimelineId,
shard_selector: ShardSelector,
) -> Result<Arc<Timeline>, tonic::Status> {
let timeout = ACTIVE_TENANT_TIMEOUT;
let wait_start = Instant::now();
let deadline = wait_start + timeout;
let tenant_shard = loop {
let resolved = self
.tenant_mgr
.resolve_attached_shard(&ttid.tenant_id, shard_selector);
match resolved {
ShardResolveResult::Found(tenant_shard) => break tenant_shard,
ShardResolveResult::NotFound => {
return Err(tonic::Status::not_found("tenant not found"));
}
ShardResolveResult::InProgress(barrier) => {
// We can't authoritatively answer right now: wait for InProgress state
// to end, then try again
tokio::select! {
_ = barrier.wait() => {
// The barrier completed: proceed around the loop to try looking up again
},
_ = tokio::time::sleep(deadline.duration_since(Instant::now())) => {
return Err(tonic::Status::unavailable("tenant is in InProgress state"));
}
}
}
}
};
tracing::debug!("Waiting for tenant to enter active state...");
tenant_shard
.wait_to_become_active(deadline.duration_since(Instant::now()))
.await
.map_err(|e| {
tonic::Status::unavailable(format!("tenant is not in active state: {e}"))
})?;
let timeline = tenant_shard
.get_timeline(ttid.timeline_id, true)
.map_err(|e| tonic::Status::unavailable(format!("could not get timeline: {e}")))?;
// FIXME: need to do something with the 'gate' here?
Ok(timeline)
}
/// Extract TenantTimelineId from the request metadata
///
/// Note: the interceptor has already authenticated the request
///
/// TOOD: Could we use "binary" metadata for these, for efficiency? gRPC has such a concept
fn extract_ttid(
&self,
metadata: &tonic::metadata::MetadataMap,
) -> Result<TenantTimelineId, tonic::Status> {
let tenant_id = metadata
.get("neon-tenant-id")
.ok_or(tonic::Status::invalid_argument(
"neon-tenant-id metadata missing",
))?;
let tenant_id = tenant_id.to_str().map_err(|_| {
tonic::Status::invalid_argument("invalid UTF-8 characters in neon-tenant-id metadata")
})?;
let tenant_id = TenantId::from_str(tenant_id)
.map_err(|_| tonic::Status::invalid_argument("invalid neon-tenant-id metadata"))?;
let timeline_id =
metadata
.get("neon-timeline-id")
.ok_or(tonic::Status::invalid_argument(
"neon-timeline-id metadata missing",
))?;
let timeline_id = timeline_id.to_str().map_err(|_| {
tonic::Status::invalid_argument("invalid UTF-8 characters in neon-timeline-id metadata")
})?;
let timeline_id = TimelineId::from_str(timeline_id)
.map_err(|_| tonic::Status::invalid_argument("invalid neon-timelineid metadata"))?;
Ok(TenantTimelineId::new(tenant_id, timeline_id))
}
// XXX: copied from PageServerHandler
async fn wait_or_get_last_lsn(
timeline: &Timeline,
request_lsn: Lsn,
not_modified_since: Lsn,
latest_gc_cutoff_lsn: &RcuReadGuard<Lsn>,
ctx: &RequestContext,
) -> Result<Lsn, tonic::Status> {
let last_record_lsn = timeline.get_last_record_lsn();
// Sanity check the request
if request_lsn < not_modified_since {
return Err(tonic::Status::invalid_argument(format!(
"invalid request with request LSN {} and not_modified_since {}",
request_lsn, not_modified_since,
)));
}
// Check explicitly for INVALID just to get a less scary error message if the request is obviously bogus
if request_lsn == Lsn::INVALID {
return Err(tonic::Status::invalid_argument("invalid LSN(0) in request"));
}
// Clients should only read from recent LSNs on their timeline, or from locations holding an LSN lease.
//
// We may have older data available, but we make a best effort to detect this case and return an error,
// to distinguish a misbehaving client (asking for old LSN) from a storage issue (data missing at a legitimate LSN).
if request_lsn < **latest_gc_cutoff_lsn && !timeline.is_gc_blocked_by_lsn_lease_deadline() {
let gc_info = &timeline.gc_info.read().unwrap();
if !gc_info.lsn_covered_by_lease(request_lsn) {
return Err(tonic::Status::not_found(format!(
"tried to request a page version that was garbage collected. requested at {} gc cutoff {}",
request_lsn, **latest_gc_cutoff_lsn
)));
}
}
// Wait for WAL up to 'not_modified_since' to arrive, if necessary
if not_modified_since > last_record_lsn {
timeline
.wait_lsn(
not_modified_since,
crate::tenant::timeline::WaitLsnWaiter::PageService,
WaitLsnTimeout::Default,
ctx,
)
.await
.map_err(|_| {
tonic::Status::unavailable("not_modified_since LSN not arrived yet")
})?;
// Since we waited for 'not_modified_since' to arrive, that is now the last
// record LSN. (Or close enough for our purposes; the last-record LSN can
// advance immediately after we return anyway)
Ok(not_modified_since)
} else {
// It might be better to use max(not_modified_since, latest_gc_cutoff_lsn)
// here instead. That would give the same result, since we know that there
// haven't been any modifications since 'not_modified_since'. Using an older
// LSN might be faster, because that could allow skipping recent layers when
// finding the page. However, we have historically used 'last_record_lsn', so
// stick to that for now.
Ok(std::cmp::min(last_record_lsn, request_lsn))
}
}
}
#[derive(Clone)]
pub struct PageServiceAuthenticator {
pub auth: Option<Arc<SwappableJwtAuth>>,
pub auth_type: AuthType,
}
impl tonic::service::Interceptor for PageServiceAuthenticator {
fn call(
&mut self,
req: tonic::Request<()>,
) -> std::result::Result<tonic::Request<()>, tonic::Status> {
// Check the tenant_id in any case
let tenant_id =
req.metadata()
.get("neon-tenant-id")
.ok_or(tonic::Status::invalid_argument(
"neon-tenant-id metadata missing",
))?;
let tenant_id = tenant_id.to_str().map_err(|_| {
tonic::Status::invalid_argument("invalid UTF-8 characters in neon-tenant-id metadata")
})?;
let tenant_id = TenantId::from_str(tenant_id)
.map_err(|_| tonic::Status::invalid_argument("invalid neon-tenant-id metadata"))?;
// when accessing management api supply None as an argument
// when using to authorize tenant pass corresponding tenant id
let auth = if let Some(auth) = &self.auth {
auth
} else {
// auth is set to Trust, nothing to check so just return ok
return Ok(req);
};
let jwt = req
.metadata()
.get("neon-auth-token")
.ok_or(tonic::Status::unauthenticated("no neon-auth-token"))?;
let jwt = jwt.to_str().map_err(|_| {
tonic::Status::invalid_argument("invalid UTF-8 characters in neon-auth-token metadata")
})?;
let jwtdata: TokenData<utils::auth::Claims> = auth
.decode(jwt)
.map_err(|err| tonic::Status::unauthenticated(format!("invalid JWT token: {}", err)))?;
let claims = jwtdata.claims;
if matches!(claims.scope, utils::auth::Scope::Tenant) && claims.tenant_id.is_none() {
return Err(tonic::Status::unauthenticated(
"jwt token scope is Tenant, but tenant id is missing",
));
}
debug!(
"jwt scope check succeeded for scope: {:#?} by tenant id: {:?}",
claims.scope, claims.tenant_id,
);
// The token is valid. Check if it's allowed to access the tenant ID
// given in the request.
check_permission(&claims, Some(tenant_id))
.map_err(|err| tonic::Status::permission_denied(err.to_string()))?;
// All checks out
Ok(req)
}
}
/// Stream of GetBaseBackupResponseChunk messages.
///
/// The first part of the Chain chunks the tarball. The second part checks the return value
/// of the send_basebackup_tarball Future that created the tarball.
type GetBaseBackupStream = futures::stream::Chain<BasebackupChunkedStream, CheckResultStream>;
fn new_basebackup_response_stream(
simplex_read: ReadHalf<SimplexStream>,
basebackup_task: JoinHandle<Result<(), BasebackupError>>,
) -> GetBaseBackupStream {
let framed = FramedRead::new(simplex_read, GetBaseBackupResponseDecoder {});
framed.chain(CheckResultStream { basebackup_task })
}
/// Stream that uses GetBaseBackupResponseDecoder
type BasebackupChunkedStream =
tokio_util::codec::FramedRead<ReadHalf<SimplexStream>, GetBaseBackupResponseDecoder>;
struct GetBaseBackupResponseDecoder;
impl Decoder for GetBaseBackupResponseDecoder {
type Item = proto::GetBaseBackupResponseChunk;
type Error = tonic::Status;
fn decode(&mut self, src: &mut BytesMut) -> Result<Option<Self::Item>, Self::Error> {
if src.len() < 64 * 1024 {
return Ok(None);
}
let item = proto::GetBaseBackupResponseChunk {
chunk: bytes::Bytes::from(std::mem::take(src)),
};
Ok(Some(item))
}
fn decode_eof(&mut self, src: &mut BytesMut) -> Result<Option<Self::Item>, Self::Error> {
if src.is_empty() {
return Ok(None);
}
let item = proto::GetBaseBackupResponseChunk {
chunk: bytes::Bytes::from(std::mem::take(src)),
};
Ok(Some(item))
}
}
struct CheckResultStream {
basebackup_task: tokio::task::JoinHandle<Result<(), BasebackupError>>,
}
impl futures::Stream for CheckResultStream {
type Item = Result<proto::GetBaseBackupResponseChunk, tonic::Status>;
fn poll_next(
mut self: Pin<&mut Self>,
ctx: &mut std::task::Context<'_>,
) -> Poll<Option<Self::Item>> {
let task = Pin::new(&mut self.basebackup_task);
match task.poll(ctx) {
Poll::Pending => Poll::Pending,
Poll::Ready(Ok(Ok(()))) => Poll::Ready(None),
Poll::Ready(Ok(Err(basebackup_err))) => {
error!(error=%basebackup_err, "error getting basebackup");
Poll::Ready(Some(Err(tonic::Status::internal(
"could not get basebackup",
))))
}
Poll::Ready(Err(join_err)) => {
error!(error=%join_err, "JoinError getting basebackup");
Poll::Ready(Some(Err(tonic::Status::internal(
"could not get basebackup",
))))
}
}
}
}

View File

@@ -21,6 +21,8 @@ pub use pageserver_api::keyspace;
use tokio_util::sync::CancellationToken;
mod assert_u64_eq_usize;
pub mod aux_file;
pub mod compute_service;
pub mod compute_service_grpc;
pub mod metrics;
pub mod page_cache;
pub mod page_service;
@@ -82,7 +84,7 @@ impl CancellableTask {
pub async fn shutdown_pageserver(
http_listener: HttpEndpointListener,
https_listener: Option<HttpsEndpointListener>,
page_service: page_service::Listener,
compute_service: compute_service::Listener,
consumption_metrics_worker: ConsumptionMetricsTasks,
disk_usage_eviction_task: Option<DiskUsageEvictionTask>,
tenant_manager: &TenantManager,
@@ -167,11 +169,11 @@ pub async fn shutdown_pageserver(
}
});
// Shut down the libpq endpoint task. This prevents new connections from
// Shut down the compute service endpoint task. This prevents new connections from
// being accepted.
let remaining_connections = timed(
page_service.stop_accepting(),
"shutdown LibpqEndpointListener",
compute_service.stop_accepting(),
"shutdown compte service listener",
Duration::from_secs(1),
)
.await;

View File

@@ -1774,8 +1774,12 @@ static SMGR_QUERY_STARTED_PER_TENANT_TIMELINE: Lazy<IntCounterVec> = Lazy::new(|
.expect("failed to define a metric")
});
// Alias so all histograms recording per-timeline smgr timings use the same buckets.
static SMGR_QUERY_TIME_PER_TENANT_TIMELINE_BUCKETS: &[f64] = CRITICAL_OP_BUCKETS;
/// Per-timeline smgr histogram buckets should be the same as the compute buckets, such that the
/// metrics are comparable across compute and Pageserver. See also:
/// <https://github.com/neondatabase/neon/blob/1a87975d956a8ad17ec8b85da32a137ec4893fcc/pgxn/neon/neon_perf_counters.h#L18-L27>
/// <https://github.com/neondatabase/flux-fleet/blob/556182a939edda87ff1d85a6b02e5cec901e0e9e/apps/base/compute-metrics/scrape-compute-sql-exporter.yaml#L29-L35>
static SMGR_QUERY_TIME_PER_TENANT_TIMELINE_BUCKETS: &[f64] =
&[0.0006, 0.001, 0.003, 0.006, 0.01, 0.03, 0.1, 1.0, 3.0];
static SMGR_QUERY_TIME_PER_TENANT_TIMELINE: Lazy<HistogramVec> = Lazy::new(|| {
register_histogram_vec!(

View File

@@ -13,7 +13,6 @@ use crate::PERF_TRACE_TARGET;
use anyhow::{Context, bail};
use async_compression::tokio::write::GzipEncoder;
use bytes::Buf;
use futures::FutureExt;
use itertools::Itertools;
use jsonwebtoken::TokenData;
use once_cell::sync::OnceCell;
@@ -40,7 +39,6 @@ use pq_proto::framed::ConnectionError;
use pq_proto::{BeMessage, FeMessage, FeStartupPacket, RowDescriptor};
use strum_macros::IntoStaticStr;
use tokio::io::{AsyncRead, AsyncWrite, AsyncWriteExt, BufWriter};
use tokio::task::JoinHandle;
use tokio_util::sync::CancellationToken;
use tracing::*;
use utils::auth::{Claims, Scope, SwappableJwtAuth};
@@ -49,15 +47,13 @@ use utils::id::{TenantId, TimelineId};
use utils::logging::log_slow;
use utils::lsn::Lsn;
use utils::simple_rcu::RcuReadGuard;
use utils::sync::gate::{Gate, GateGuard};
use utils::sync::gate::GateGuard;
use utils::sync::spsc_fold;
use crate::auth::check_permission;
use crate::basebackup::BasebackupError;
use crate::config::PageServerConf;
use crate::context::{
DownloadBehavior, PerfInstrumentFutureExt, RequestContext, RequestContextBuilder,
};
use crate::context::{PerfInstrumentFutureExt, RequestContext, RequestContextBuilder};
use crate::metrics::{
self, COMPUTE_COMMANDS_COUNTERS, ComputeCommandKind, GetPageBatchBreakReason, LIVE_CONNECTIONS,
SmgrOpTimer, TimelineMetrics,
@@ -67,7 +63,6 @@ use crate::span::{
debug_assert_current_span_has_tenant_and_timeline_id,
debug_assert_current_span_has_tenant_and_timeline_id_no_shard_id,
};
use crate::task_mgr::{self, COMPUTE_REQUEST_RUNTIME, TaskKind};
use crate::tenant::mgr::{
GetActiveTenantError, GetTenantError, ShardResolveResult, ShardSelector, TenantManager,
};
@@ -85,171 +80,6 @@ const ACTIVE_TENANT_TIMEOUT: Duration = Duration::from_millis(30000);
/// Threshold at which to log slow GetPage requests.
const LOG_SLOW_GETPAGE_THRESHOLD: Duration = Duration::from_secs(30);
///////////////////////////////////////////////////////////////////////////////
pub struct Listener {
cancel: CancellationToken,
/// Cancel the listener task through `listen_cancel` to shut down the listener
/// and get a handle on the existing connections.
task: JoinHandle<Connections>,
}
pub struct Connections {
cancel: CancellationToken,
tasks: tokio::task::JoinSet<ConnectionHandlerResult>,
gate: Gate,
}
pub fn spawn(
conf: &'static PageServerConf,
tenant_manager: Arc<TenantManager>,
pg_auth: Option<Arc<SwappableJwtAuth>>,
perf_trace_dispatch: Option<Dispatch>,
tcp_listener: tokio::net::TcpListener,
tls_config: Option<Arc<rustls::ServerConfig>>,
) -> Listener {
let cancel = CancellationToken::new();
let libpq_ctx = RequestContext::todo_child(
TaskKind::LibpqEndpointListener,
// listener task shouldn't need to download anything. (We will
// create a separate sub-contexts for each connection, with their
// own download behavior. This context is used only to listen and
// accept connections.)
DownloadBehavior::Error,
);
let task = COMPUTE_REQUEST_RUNTIME.spawn(task_mgr::exit_on_panic_or_error(
"libpq listener",
libpq_listener_main(
conf,
tenant_manager,
pg_auth,
perf_trace_dispatch,
tcp_listener,
conf.pg_auth_type,
tls_config,
conf.page_service_pipelining.clone(),
libpq_ctx,
cancel.clone(),
)
.map(anyhow::Ok),
));
Listener { cancel, task }
}
impl Listener {
pub async fn stop_accepting(self) -> Connections {
self.cancel.cancel();
self.task
.await
.expect("unreachable: we wrap the listener task in task_mgr::exit_on_panic_or_error")
}
}
impl Connections {
pub(crate) async fn shutdown(self) {
let Self {
cancel,
mut tasks,
gate,
} = self;
cancel.cancel();
while let Some(res) = tasks.join_next().await {
Self::handle_connection_completion(res);
}
gate.close().await;
}
fn handle_connection_completion(res: Result<anyhow::Result<()>, tokio::task::JoinError>) {
match res {
Ok(Ok(())) => {}
Ok(Err(e)) => error!("error in page_service connection task: {:?}", e),
Err(e) => error!("page_service connection task panicked: {:?}", e),
}
}
}
///
/// Main loop of the page service.
///
/// Listens for connections, and launches a new handler task for each.
///
/// Returns Ok(()) upon cancellation via `cancel`, returning the set of
/// open connections.
///
#[allow(clippy::too_many_arguments)]
pub async fn libpq_listener_main(
conf: &'static PageServerConf,
tenant_manager: Arc<TenantManager>,
auth: Option<Arc<SwappableJwtAuth>>,
perf_trace_dispatch: Option<Dispatch>,
listener: tokio::net::TcpListener,
auth_type: AuthType,
tls_config: Option<Arc<rustls::ServerConfig>>,
pipelining_config: PageServicePipeliningConfig,
listener_ctx: RequestContext,
listener_cancel: CancellationToken,
) -> Connections {
let connections_cancel = CancellationToken::new();
let connections_gate = Gate::default();
let mut connection_handler_tasks = tokio::task::JoinSet::default();
loop {
let gate_guard = match connections_gate.enter() {
Ok(guard) => guard,
Err(_) => break,
};
let accepted = tokio::select! {
biased;
_ = listener_cancel.cancelled() => break,
next = connection_handler_tasks.join_next(), if !connection_handler_tasks.is_empty() => {
let res = next.expect("we dont poll while empty");
Connections::handle_connection_completion(res);
continue;
}
accepted = listener.accept() => accepted,
};
match accepted {
Ok((socket, peer_addr)) => {
// Connection established. Spawn a new task to handle it.
debug!("accepted connection from {}", peer_addr);
let local_auth = auth.clone();
let connection_ctx = RequestContextBuilder::from(&listener_ctx)
.task_kind(TaskKind::PageRequestHandler)
.download_behavior(DownloadBehavior::Download)
.perf_span_dispatch(perf_trace_dispatch.clone())
.detached_child();
connection_handler_tasks.spawn(page_service_conn_main(
conf,
tenant_manager.clone(),
local_auth,
socket,
auth_type,
tls_config.clone(),
pipelining_config.clone(),
connection_ctx,
connections_cancel.child_token(),
gate_guard,
));
}
Err(err) => {
// accept() failed. Log the error, and loop back to retry on next connection.
error!("accept() failed: {:?}", err);
}
}
}
debug!("page_service listener loop terminated");
Connections {
cancel: connections_cancel,
tasks: connection_handler_tasks,
gate: connections_gate,
}
}
type ConnectionHandlerResult = anyhow::Result<()>;
/// Perf root spans start at the per-request level, after shard routing.
@@ -261,9 +91,10 @@ struct ConnectionPerfSpanFields {
compute_mode: Option<String>,
}
/// note: the caller has already set TCP_NODELAY on the socket
#[instrument(skip_all, fields(peer_addr, application_name, compute_mode))]
#[allow(clippy::too_many_arguments)]
async fn page_service_conn_main(
pub async fn libpq_page_service_conn_main(
conf: &'static PageServerConf,
tenant_manager: Arc<TenantManager>,
auth: Option<Arc<SwappableJwtAuth>>,
@@ -279,10 +110,6 @@ async fn page_service_conn_main(
.with_label_values(&["page_service"])
.guard();
socket
.set_nodelay(true)
.context("could not set TCP_NODELAY")?;
let socket_fd = socket.as_raw_fd();
let peer_addr = socket.peer_addr().context("get peer address")?;
@@ -393,7 +220,7 @@ struct PageServerHandler {
gate_guard: GateGuard,
}
struct TimelineHandles {
pub struct TimelineHandles {
wrapper: TenantManagerWrapper,
/// Note on size: the typical size of this map is 1. The largest size we expect
/// to see is the number of shards divided by the number of pageservers (typically < 2),

View File

@@ -1,10 +1,10 @@
# pgxs/neon/Makefile
MODULE_big = neon
OBJS = \
$(WIN32RES) \
communicator.o \
communicator_new.o \
extension_server.o \
file_cache.o \
hll.o \
@@ -22,7 +22,8 @@ OBJS = \
walproposer.o \
walproposer_pg.o \
control_plane_connector.o \
walsender_hooks.o
walsender_hooks.o \
$(LIBCOMMUNICATOR_PATH)/libcommunicator.a
PG_CPPFLAGS = -I$(libpq_srcdir)
SHLIB_LINK_INTERNAL = $(libpq)

View File

@@ -88,9 +88,6 @@ typedef PGAlignedBlock PGIOAlignedBlock;
page_server_api *page_server;
static uint32 local_request_counter;
#define GENERATE_REQUEST_ID() (((NeonRequestId)MyProcPid << 32) | ++local_request_counter)
/*
* Various settings related to prompt (fast) handling of PageStream responses
* at any CHECK_FOR_INTERRUPTS point.
@@ -788,6 +785,27 @@ prefetch_read(PrefetchRequest *slot)
}
}
/*
* Wait completion of previosly registered prefetch request.
* Prefetch result should be placed in LFC by prefetch_wait_for.
*/
bool
communicator_prefetch_receive(BufferTag tag)
{
PrfHashEntry *entry;
PrefetchRequest hashkey;
hashkey.buftag = tag;
entry = prfh_lookup(MyPState->prf_hash, &hashkey);
if (entry != NULL && prefetch_wait_for(entry->slot->my_ring_index))
{
prefetch_set_unused(entry->slot->my_ring_index);
return true;
}
return false;
}
/*
* Disconnect hook - drop prefetches when the connection drops
*
@@ -906,7 +924,6 @@ prefetch_do_request(PrefetchRequest *slot, neon_request_lsns *force_request_lsns
NeonGetPageRequest request = {
.hdr.tag = T_NeonGetPageRequest,
.hdr.reqid = GENERATE_REQUEST_ID(),
/* lsn and not_modified_since are filled in below */
.rinfo = BufTagGetNRelFileInfo(slot->buftag),
.forknum = slot->buftag.forkNum,
@@ -915,8 +932,6 @@ prefetch_do_request(PrefetchRequest *slot, neon_request_lsns *force_request_lsns
Assert(mySlotNo == MyPState->ring_unused);
slot->reqid = request.hdr.reqid;
if (force_request_lsns)
slot->request_lsns = *force_request_lsns;
else
@@ -934,6 +949,7 @@ prefetch_do_request(PrefetchRequest *slot, neon_request_lsns *force_request_lsns
Assert(mySlotNo == MyPState->ring_unused);
/* loop */
}
slot->reqid = request.hdr.reqid;
/* update prefetch state */
MyPState->n_requests_inflight += 1;
@@ -1937,7 +1953,6 @@ communicator_exists(NRelFileInfo rinfo, ForkNumber forkNum, neon_request_lsns *r
{
NeonExistsRequest request = {
.hdr.tag = T_NeonExistsRequest,
.hdr.reqid = GENERATE_REQUEST_ID(),
.hdr.lsn = request_lsns->request_lsn,
.hdr.not_modified_since = request_lsns->not_modified_since,
.rinfo = rinfo,
@@ -2212,7 +2227,6 @@ communicator_nblocks(NRelFileInfo rinfo, ForkNumber forknum, neon_request_lsns *
{
NeonNblocksRequest request = {
.hdr.tag = T_NeonNblocksRequest,
.hdr.reqid = GENERATE_REQUEST_ID(),
.hdr.lsn = request_lsns->request_lsn,
.hdr.not_modified_since = request_lsns->not_modified_since,
.rinfo = rinfo,
@@ -2285,7 +2299,6 @@ communicator_dbsize(Oid dbNode, neon_request_lsns *request_lsns)
{
NeonDbSizeRequest request = {
.hdr.tag = T_NeonDbSizeRequest,
.hdr.reqid = GENERATE_REQUEST_ID(),
.hdr.lsn = request_lsns->request_lsn,
.hdr.not_modified_since = request_lsns->not_modified_since,
.dbNode = dbNode,
@@ -2353,7 +2366,6 @@ communicator_read_slru_segment(SlruKind kind, int64 segno, neon_request_lsns *re
request = (NeonGetSlruSegmentRequest) {
.hdr.tag = T_NeonGetSlruSegmentRequest,
.hdr.reqid = GENERATE_REQUEST_ID(),
.hdr.lsn = request_lsns->request_lsn,
.hdr.not_modified_since = request_lsns->not_modified_since,
.kind = kind,

View File

@@ -37,6 +37,8 @@ extern int communicator_prefetch_lookupv(NRelFileInfo rinfo, ForkNumber forknum,
BlockNumber nblocks, void **buffers, bits8 *mask);
extern void communicator_prefetch_register_bufferv(BufferTag tag, neon_request_lsns *frlsns,
BlockNumber nblocks, const bits8 *mask);
extern bool communicator_prefetch_receive(BufferTag tag);
extern int communicator_read_slru_segment(SlruKind kind, int64 segno,
neon_request_lsns *request_lsns,
void *buffer);

372
pgxn/neon/communicator/Cargo.lock generated Normal file
View File

@@ -0,0 +1,372 @@
# This file is automatically @generated by Cargo.
# It is not intended for manual editing.
version = 4
[[package]]
name = "addr2line"
version = "0.24.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "dfbe277e56a376000877090da837660b4427aad530e3028d44e0bffe4f89a1c1"
dependencies = [
"gimli",
]
[[package]]
name = "adler2"
version = "2.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "512761e0bb2578dd7380c6baaa0f4ce03e84f95e960231d1dec8bf4d7d6e2627"
[[package]]
name = "backtrace"
version = "0.3.74"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8d82cb332cdfaed17ae235a638438ac4d4839913cc2af585c3c6746e8f8bee1a"
dependencies = [
"addr2line",
"cfg-if",
"libc",
"miniz_oxide",
"object",
"rustc-demangle",
"windows-targets",
]
[[package]]
name = "base64"
version = "0.22.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "72b3254f16251a8381aa12e40e3c4d2f0199f8c6508fbecb9d91f575e0fbb8c6"
[[package]]
name = "bytes"
version = "1.10.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d71b6127be86fdcfddb610f7182ac57211d4b18a3e9c82eb2d17662f2227ad6a"
[[package]]
name = "cfg-if"
version = "1.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "baf1de4339761588bc0619e3cbc0120ee582ebb74b53b4efbf79117bd2da40fd"
[[package]]
name = "communicator"
version = "0.1.0"
dependencies = [
"tonic",
]
[[package]]
name = "fnv"
version = "1.0.7"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "3f9eec918d3f24069decb9af1554cad7c880e2da24a9afd88aca000531ab82c1"
[[package]]
name = "futures-core"
version = "0.3.31"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "05f29059c0c2090612e8d742178b0580d2dc940c837851ad723096f87af6663e"
[[package]]
name = "gimli"
version = "0.31.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "07e28edb80900c19c28f1072f2e8aeca7fa06b23cd4169cefe1af5aa3260783f"
[[package]]
name = "http"
version = "1.3.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f4a85d31aea989eead29a3aaf9e1115a180df8282431156e533de47660892565"
dependencies = [
"bytes",
"fnv",
"itoa",
]
[[package]]
name = "http-body"
version = "1.0.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1efedce1fb8e6913f23e0c92de8e62cd5b772a67e7b3946df930a62566c93184"
dependencies = [
"bytes",
"http",
]
[[package]]
name = "http-body-util"
version = "0.1.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b021d93e26becf5dc7e1b75b1bed1fd93124b374ceb73f43d4d4eafec896a64a"
dependencies = [
"bytes",
"futures-core",
"http",
"http-body",
"pin-project-lite",
]
[[package]]
name = "itoa"
version = "1.0.15"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "4a5f13b858c8d314ee3e8f639011f7ccefe71f97f96e50151fb991f267928e2c"
[[package]]
name = "libc"
version = "0.2.171"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c19937216e9d3aa9956d9bb8dfc0b0c8beb6058fc4f7a4dc4d850edf86a237d6"
[[package]]
name = "memchr"
version = "2.7.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "78ca9ab1a0babb1e7d5695e3530886289c18cf2f87ec19a575a0abdce112e3a3"
[[package]]
name = "miniz_oxide"
version = "0.8.7"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ff70ce3e48ae43fa075863cef62e8b43b71a4f2382229920e0df362592919430"
dependencies = [
"adler2",
]
[[package]]
name = "object"
version = "0.36.7"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "62948e14d923ea95ea2c7c86c71013138b66525b86bdc08d2dcc262bdb497b87"
dependencies = [
"memchr",
]
[[package]]
name = "once_cell"
version = "1.21.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "42f5e15c9953c5e4ccceeb2e7382a716482c34515315f7b03532b8b4e8393d2d"
[[package]]
name = "percent-encoding"
version = "2.3.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e3148f5046208a5d56bcfc03053e3ca6334e51da8dfb19b6cdc8b306fae3283e"
[[package]]
name = "pin-project"
version = "1.1.10"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "677f1add503faace112b9f1373e43e9e054bfdd22ff1a63c1bc485eaec6a6a8a"
dependencies = [
"pin-project-internal",
]
[[package]]
name = "pin-project-internal"
version = "1.1.10"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6e918e4ff8c4549eb882f14b3a4bc8c8bc93de829416eacf579f1207a8fbf861"
dependencies = [
"proc-macro2",
"quote",
"syn",
]
[[package]]
name = "pin-project-lite"
version = "0.2.16"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "3b3cff922bd51709b605d9ead9aa71031d81447142d828eb4a6eba76fe619f9b"
[[package]]
name = "proc-macro2"
version = "1.0.94"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a31971752e70b8b2686d7e46ec17fb38dad4051d94024c88df49b667caea9c84"
dependencies = [
"unicode-ident",
]
[[package]]
name = "quote"
version = "1.0.40"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1885c039570dc00dcb4ff087a89e185fd56bae234ddc7f056a945bf36467248d"
dependencies = [
"proc-macro2",
]
[[package]]
name = "rustc-demangle"
version = "0.1.24"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "719b953e2095829ee67db738b3bfa9fa368c94900df327b3f07fe6e794d2fe1f"
[[package]]
name = "syn"
version = "2.0.100"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b09a44accad81e1ba1cd74a32461ba89dee89095ba17b32f5d03683b1b1fc2a0"
dependencies = [
"proc-macro2",
"quote",
"unicode-ident",
]
[[package]]
name = "tokio"
version = "1.44.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e6b88822cbe49de4185e3a4cbf8321dd487cf5fe0c5c65695fef6346371e9c48"
dependencies = [
"backtrace",
"pin-project-lite",
]
[[package]]
name = "tokio-stream"
version = "0.1.17"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "eca58d7bba4a75707817a2c44174253f9236b2d5fbd055602e9d5c07c139a047"
dependencies = [
"futures-core",
"pin-project-lite",
"tokio",
]
[[package]]
name = "tonic"
version = "0.13.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "85839f0b32fd242bb3209262371d07feda6d780d16ee9d2bc88581b89da1549b"
dependencies = [
"base64",
"bytes",
"http",
"http-body",
"http-body-util",
"percent-encoding",
"pin-project",
"tokio-stream",
"tower-layer",
"tower-service",
"tracing",
]
[[package]]
name = "tower-layer"
version = "0.3.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "121c2a6cda46980bb0fcd1647ffaf6cd3fc79a013de288782836f6df9c48780e"
[[package]]
name = "tower-service"
version = "0.3.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8df9b6e13f2d32c91b9bd719c00d1958837bc7dec474d94952798cc8e69eeec3"
[[package]]
name = "tracing"
version = "0.1.41"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "784e0ac535deb450455cbfa28a6f0df145ea1bb7ae51b821cf5e7927fdcfbdd0"
dependencies = [
"pin-project-lite",
"tracing-attributes",
"tracing-core",
]
[[package]]
name = "tracing-attributes"
version = "0.1.28"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "395ae124c09f9e6918a2310af6038fba074bcf474ac352496d5910dd59a2226d"
dependencies = [
"proc-macro2",
"quote",
"syn",
]
[[package]]
name = "tracing-core"
version = "0.1.33"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e672c95779cf947c5311f83787af4fa8fffd12fb27e4993211a84bdfd9610f9c"
dependencies = [
"once_cell",
]
[[package]]
name = "unicode-ident"
version = "1.0.18"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5a5f39404a5da50712a4c1eecf25e90dd62b613502b7e925fd4e4d19b5c96512"
[[package]]
name = "windows-targets"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9b724f72796e036ab90c1021d4780d4d3d648aca59e491e6b98e725b84e99973"
dependencies = [
"windows_aarch64_gnullvm",
"windows_aarch64_msvc",
"windows_i686_gnu",
"windows_i686_gnullvm",
"windows_i686_msvc",
"windows_x86_64_gnu",
"windows_x86_64_gnullvm",
"windows_x86_64_msvc",
]
[[package]]
name = "windows_aarch64_gnullvm"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "32a4622180e7a0ec044bb555404c800bc9fd9ec262ec147edd5989ccd0c02cd3"
[[package]]
name = "windows_aarch64_msvc"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "09ec2a7bb152e2252b53fa7803150007879548bc709c039df7627cabbd05d469"
[[package]]
name = "windows_i686_gnu"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8e9b5ad5ab802e97eb8e295ac6720e509ee4c243f69d781394014ebfe8bbfa0b"
[[package]]
name = "windows_i686_gnullvm"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0eee52d38c090b3caa76c563b86c3a4bd71ef1a819287c19d586d7334ae8ed66"
[[package]]
name = "windows_i686_msvc"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "240948bc05c5e7c6dabba28bf89d89ffce3e303022809e73deaefe4f6ec56c66"
[[package]]
name = "windows_x86_64_gnu"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "147a5c80aabfbf0c7d901cb5895d1de30ef2907eb21fbbab29ca94c5b08b1a78"
[[package]]
name = "windows_x86_64_gnullvm"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "24d5b23dc417412679681396f2b49f3de8c1473deb516bd34410872eff51ed0d"
[[package]]
name = "windows_x86_64_msvc"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "589f6da84c646204747d1270a2a5661ea66ed1cced2631d546fdfb155959f9ec"

View File

@@ -0,0 +1,35 @@
[package]
name = "communicator"
version = "0.1.0"
edition = "2024"
[lib]
crate-type = ["staticlib"]
[dependencies]
bytes.workspace = true
http.workspace = true
libc.workspace = true
nix.workspace = true
atomic_enum = "0.3.0"
prost.workspace = true
tonic = { version = "0.12.0", default-features = false, features=["codegen", "prost", "transport"] }
tokio = { version = "1.43.1", features = ["macros", "net", "io-util", "rt", "rt-multi-thread"] }
tokio-pipe = { version = "0.2.12" }
thiserror.workspace = true
tracing.workspace = true
tracing-subscriber.workspace = true
zerocopy = "0.8.0"
zerocopy-derive = "0.8.0"
tokio-epoll-uring.workspace = true
uring-common.workspace = true
pageserver_client_grpc.workspace = true
pageserver_data_api.workspace = true
neonart.workspace = true
utils.workspace = true
[build-dependencies]
cbindgen.workspace = true

View File

@@ -0,0 +1,123 @@
# Communicator
This package provides the so-called "compute-pageserver communicator",
or just "communicator" in short. It runs in a PostgreSQL server, as
part of the neon extension, and handles the communication with the
pageservers. On the PostgreSQL side, the glue code in pgxn/neon/ uses
the communicator to implement the PostgreSQL Storage Manager (SMGR)
interface.
## Design criteria
- Low latency
- Saturate a 10 Gbit / s network interface without becoming a bottleneck
## Source code view
pgxn/neon/communicator_new.c
Contains the glue that interact with PostgreSQL code and the Rust
communicator code.
pgxn/neon/communicator/src/backend_interface.rs
The entry point for calls from each backend.
pgxn/neon/communicator/src/init.rs
Initialization at server startup
pgxn/neon/communicator/src/worker_process/
Worker process main loop and glue code
At compilation time, pgxn/neon/communicator/ produces a static
library, libcommunicator.a. It is linked to the neon.so extension
library.
The real networking code, which is independent of PostgreSQL, is in
the pageserver/client_grpc crate.
## Process view
The communicator runs in a dedicated background worker process, the
"communicator process". The communicator uses a multi-threaded Tokio
runtime to execute the IO requests. So the communicator process has
multiple threads running. That's unusual for Postgres processes and
care must be taken to make that work.
### Backend <-> worker communication
Each backend has a number of I/O request slots in shared memory. The
slots are statically allocated for each backend, and must not be
accessed by other backends. The worker process reads requests from the
shared memory slots, and writes responses back to the slots.
To submit an IO request, first pick one of your backend's free slots,
and write the details of the IO request in the slot. Finally, update
the 'state' field of the slot to Submitted. That informs the worker
process that it can start processing the request. Once the state has
been set to Submitted, the backend *must not* access the slot anymore,
until the worker process sets its state to 'Completed'. In other
words, each slot is owned by either the backend or the worker process
at all times, and the 'state' field indicates who has ownership at the
moment.
To inform the worker process that a request slot has a pending IO
request, there's a pipe shared by the worker process and all backend
processes. After you have changed the slot's state to Submitted, write
the index of the request slot to the pipe. This wakes up the worker
process.
(Note that the pipe is just used for wakeups, but the worker process
is free to pick up Submitted IO requests even without receiving the
wakeup. As of this writing, it doesn't do that, but it might be useful
in the future to reduce latency even further, for example.)
When the worker process has completed processing the request, it
writes the result back in the request slot. A GetPage request can also
contain a pointer to buffer in the shared buffer cache. In that case,
the worker process writes the resulting page contents directly to the
buffer, and just a result code in the request slot. It then updates
the 'state' field to Completed, which passes the owner ship back to
the originating backend. Finally, it signals the process Latch of the
originating backend, waking it up.
### Differences between PostgreSQL v16, v17 and v18
PostgreSQL v18 introduced the new AIO mechanism. The PostgreSQL AIO
mechanism uses a very similar mechanism as described in the previous
section, for the communication between AIO worker processes and
backends. With our communicator, the AIO worker processes are not
used, but we use the same PgAioHandle request slots as in upstream.
For Neon-specific IO requests like GetDbSize, a neon request slot is
used. But for the actual IO requests, the request slot merely contains
a pointer to the PgAioHandle slot. The worker process updates the
status of that, calls the IO callbacks upon completionetc, just like
the upstream AIO worker processes do.
## Sequence diagram
neon
PostgreSQL extension backend_interface.rs worker_process.rs processor tonic
| . . . .
| smgr_read() . . . .
+-------------> + . . .
. | . . .
. | rcommunicator_ . . .
. | get_page_at_lsn . . .
. +------------------> + . .
| . .
| write request to . . .
| slot . .
| . .
| . .
| submit_request() . .
+-----------------> + .
| | .
| | db_size_request . .
+---------------->.
. TODO
### Compute <-> pageserver protocol
The protocol between Compute and the pageserver is based on gRPC. See `protos/`.

View File

@@ -0,0 +1,24 @@
use cbindgen;
use std::env;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let crate_dir = env::var("CARGO_MANIFEST_DIR").unwrap();
cbindgen::generate(crate_dir).map_or_else(
|error| match error {
cbindgen::Error::ParseSyntaxError { .. } => {
// This means there was a syntax error in the Rust sources. Don't panic, because
// we want the build to continue and the Rust compiler to hit the error. The
// Rust compiler produces a better error message than cbindgen.
eprintln!("Generating C bindings failed because of a Rust syntax error");
}
e => panic!("Unable to generate C bindings: {:?}", e),
},
|bindings| {
bindings.write_to_file("communicator_bindings.h");
},
);
Ok(())
}

View File

@@ -0,0 +1,4 @@
language = "C"
[enum]
prefix_with_name = true

View File

@@ -0,0 +1,204 @@
//! This module implements a request/response "slot" for submitting requests from backends
//! to the communicator process.
//!
//! NB: The "backend" side of this code runs in Postgres backend processes,
//! which means that it is not safe to use the 'tracing' crate for logging, nor
//! to launch threads or use tokio tasks.
use std::cell::UnsafeCell;
use std::sync::atomic::fence;
use std::sync::atomic::{AtomicI32, Ordering};
use crate::neon_request::{NeonIORequest, NeonIOResult};
use atomic_enum::atomic_enum;
/// One request/response slot. Each backend has its own set of slots that it uses.
///
/// This is the moral equivalent of PgAioHandle for Postgres AIO requests
/// Like PgAioHandle, try to keep this small.
///
/// There is an array of these in shared memory. Therefore, this must be Sized.
///
/// ## Lifecycle of a request
///
/// The slot is always owned by either the backend process or the communicator
/// process, depending on the 'state'. Only the owning process is allowed to
/// read or modify the slot, except for reading the 'state' itself to check who
/// owns it.
///
/// A slot begins in the Idle state, where it is owned by the backend process.
/// To submit a request, the backend process fills the slot with the request
/// data, and changes it to the Submitted state. After changing the state, the
/// slot is owned by the communicator process, and the backend is not allowed
/// to access it until the communicator process marks it as Completed.
///
/// When the communicator process sees that the slot is in Submitted state, it
/// starts to process the request. After processing the request, it stores the
/// result in the slot, and changes the state to Completed. It is now owned by
/// the backend process again, which may now read the result, and reuse the
/// slot for a new request.
///
/// For correctness of the above protocol, we really only need two states:
/// "owned by backend" and "owned by communicator process. But to help with
/// debugging, there are a few more states. When the backend starts to fill in
/// the request details in the slot, it first sets the state from Idle to
/// Filling, and when it's done with that, from Filling to Submitted. In the
/// Filling state, the slot is still owned by the backend. Similarly, when the
/// communicator process starts to process a request, it sets it to Processing
/// state first, but the slot is still owned by the communicator process.
///
/// This struct doesn't handle waking up the communicator process when a request
/// has been submitted or when a response is ready. We only store the 'owner_procno'
/// which can be used for waking up the backend on completion, but the wakeups are
/// performed elsewhere.
pub struct NeonIOHandle {
/// similar to PgAioHandleState
state: AtomicNeonIOHandleState,
/// The owning process's ProcNumber. The worker process uses this to set the process's
/// latch on completion.
///
/// (This could be calculated from num_neon_request_slots_per_backend and the index of
/// this slot in the overall 'neon_requst_slots array')
owner_procno: AtomicI32,
/// SAFETY: This is modified by fill_request(), after it has established ownership
/// of the slot by setting state from Idle to Filling
request: UnsafeCell<NeonIORequest>,
/// valid when state is Completed
///
/// SAFETY: This is modified by RequestProcessingGuard::complete(). There can be
/// only one RequestProcessingGuard outstanding for a slot at a time, because
/// it is returned by start_processing_request() which checks the state, so
/// RequestProcessingGuard has exclusive access to the slot.
result: UnsafeCell<NeonIOResult>,
}
// The protocol described in the "Lifecycle of a request" section above ensures
// the safe access to the fields
unsafe impl Send for NeonIOHandle {}
unsafe impl Sync for NeonIOHandle {}
impl Default for NeonIOHandle {
fn default() -> NeonIOHandle {
NeonIOHandle {
owner_procno: AtomicI32::new(-1),
request: UnsafeCell::new(NeonIORequest::Empty),
result: UnsafeCell::new(NeonIOResult::Empty),
state: AtomicNeonIOHandleState::new(NeonIOHandleState::Idle),
}
}
}
#[atomic_enum]
#[derive(Eq, PartialEq)]
pub enum NeonIOHandleState {
Idle,
/// backend is filling in the request
Filling,
/// Backend has submitted the request to the communicator, but the
/// communicator process has not yet started processing it.
Submitted,
/// Communicator is processing the request
Processing,
/// Communicator has completed the request, and the 'result' field is now
/// valid, but the backend has not read the result yet.
Completed,
}
pub struct RequestProcessingGuard<'a>(&'a NeonIOHandle);
unsafe impl<'a> Send for RequestProcessingGuard<'a> {}
unsafe impl<'a> Sync for RequestProcessingGuard<'a> {}
impl<'a> RequestProcessingGuard<'a> {
pub fn get_request(&self) -> &NeonIORequest {
unsafe { &*self.0.request.get() }
}
pub fn get_owner_procno(&self) -> i32 {
self.0.owner_procno.load(Ordering::Relaxed)
}
pub fn completed(self, result: NeonIOResult) {
unsafe {
*self.0.result.get() = result;
};
// Ok, we have completed the IO. Mark the request as completed. After that,
// we no longer have ownership of the slot, and must not modify it.
let old_state = self
.0
.state
.swap(NeonIOHandleState::Completed, Ordering::Release);
assert!(old_state == NeonIOHandleState::Processing);
}
}
impl NeonIOHandle {
pub fn fill_request(&self, request: &NeonIORequest, proc_number: i32) {
// Verify that the slot is in Idle state previously, and start filling it.
//
// XXX: This step isn't strictly necessary. Assuming the caller didn't screw up
// and try to use a slot that's already in use, we could fill the slot and
// switch it directly from Idle to Submitted state.
if let Err(s) = self.state.compare_exchange(
NeonIOHandleState::Idle,
NeonIOHandleState::Filling,
Ordering::Relaxed,
Ordering::Relaxed,
) {
panic!("unexpected state in request slot: {s:?}");
}
// This fence synchronizes-with store/swap in `communicator_process_main_loop`.
fence(Ordering::Acquire);
self.owner_procno.store(proc_number, Ordering::Relaxed);
unsafe { *self.request.get() = *request }
self.state
.store(NeonIOHandleState::Submitted, Ordering::Release);
}
pub fn try_get_result(&self) -> Option<NeonIOResult> {
// FIXME: ordering?
let state = self.state.load(Ordering::Relaxed);
if state == NeonIOHandleState::Completed {
// This fence synchronizes-with store/swap in `communicator_process_main_loop`.
fence(Ordering::Acquire);
let result = unsafe { *self.result.get() };
self.state.store(NeonIOHandleState::Idle, Ordering::Relaxed);
Some(result)
} else {
None
}
}
pub fn start_processing_request<'a>(&'a self) -> Option<RequestProcessingGuard<'a>> {
// Read the IO request from the slot indicated in the wakeup
//
// XXX: using compare_exchange for this is not strictly necessary, as long as
// the communicator process has _some_ means of tracking which requests it's
// already processing. That could be a flag somewhere in communicator's private
// memory, for example.
if let Err(s) = self.state.compare_exchange(
NeonIOHandleState::Submitted,
NeonIOHandleState::Processing,
Ordering::Relaxed,
Ordering::Relaxed,
) {
// FIXME surprising state. This is unexpected at the moment, but if we
// started to process requests more aggressively, without waiting for the
// read from the pipe, then this could happen
panic!("unexpected state in request slot: {s:?}");
}
fence(Ordering::Acquire);
Some(RequestProcessingGuard(self))
}
}

View File

@@ -0,0 +1,196 @@
//! This code runs in each backend process. That means that launching Rust threads, panicking
//! etc. is forbidden!
use crate::backend_comms::NeonIOHandle;
use crate::init::CommunicatorInitStruct;
use crate::integrated_cache::{BackendCacheReadOp, IntegratedCacheReadAccess};
use crate::neon_request::CCachedGetPageVResult;
use crate::neon_request::{NeonIORequest, NeonIOResult};
pub struct CommunicatorBackendStruct<'t> {
my_proc_number: i32,
next_neon_request_idx: u32,
my_start_idx: u32, // First request slot that belongs to this backend
my_end_idx: u32, // end + 1 request slot that belongs to this backend
neon_request_slots: &'t [NeonIOHandle],
submission_pipe_write_fd: std::ffi::c_int,
pending_cache_read_op: Option<BackendCacheReadOp<'t>>,
integrated_cache: &'t IntegratedCacheReadAccess<'t>,
}
#[unsafe(no_mangle)]
pub extern "C" fn rcommunicator_backend_init(
cis: Box<CommunicatorInitStruct>,
my_proc_number: i32,
) -> &'static mut CommunicatorBackendStruct<'static> {
let start_idx = my_proc_number as u32 * cis.num_neon_request_slots_per_backend;
let end_idx = start_idx + cis.num_neon_request_slots_per_backend;
let integrated_cache = Box::leak(Box::new(cis.integrated_cache_init_struct.backend_init()));
let bs: &'static mut CommunicatorBackendStruct =
Box::leak(Box::new(CommunicatorBackendStruct {
my_proc_number,
next_neon_request_idx: start_idx,
my_start_idx: start_idx,
my_end_idx: end_idx,
neon_request_slots: cis.neon_request_slots,
submission_pipe_write_fd: cis.submission_pipe_write_fd,
pending_cache_read_op: None,
integrated_cache,
}));
bs
}
/// Start a request. You can poll for its completion and get the result by
/// calling bcomm_poll_dbsize_request_completion(). The communicator will wake
/// us up by setting our process latch, so to wait for the completion, wait on
/// the latch and call bcomm_poll_dbsize_request_completion() every time the
/// latch is set.
///
/// Safety: The C caller must ensure that the references are valid.
#[unsafe(no_mangle)]
pub extern "C" fn bcomm_start_io_request<'t>(
bs: &'t mut CommunicatorBackendStruct,
request: &NeonIORequest,
immediate_result_ptr: &mut NeonIOResult,
) -> i32 {
assert!(bs.pending_cache_read_op.is_none());
// Check if the request can be satisfied from the cache first
if let NeonIORequest::RelSize(req) = request {
if let Some(nblocks) = bs.integrated_cache.get_rel_size(&req.reltag()) {
*immediate_result_ptr = NeonIOResult::RelSize(nblocks);
return -1;
}
}
// Create neon request and submit it
let request_idx = bs.start_neon_request(request);
// Tell the communicator about it
bs.submit_request(request_idx);
return request_idx;
}
#[unsafe(no_mangle)]
pub extern "C" fn bcomm_start_get_page_v_request<'t>(
bs: &'t mut CommunicatorBackendStruct,
request: &NeonIORequest,
immediate_result_ptr: &mut CCachedGetPageVResult,
) -> i32 {
let NeonIORequest::GetPageV(get_pagev_request) = request else {
panic!("invalid request passed to bcomm_start_get_page_v_request()");
};
assert!(matches!(request, NeonIORequest::GetPageV(_)));
assert!(bs.pending_cache_read_op.is_none());
// Check if the request can be satisfied from the cache first
let mut all_cached = true;
let read_op = bs.integrated_cache.start_read_op();
for i in 0..get_pagev_request.nblocks {
if let Some(cache_block) = read_op.get_page(
&get_pagev_request.reltag(),
get_pagev_request.block_number + i as u32,
) {
(*immediate_result_ptr).cache_block_numbers[i as usize] = cache_block;
} else {
// not found in cache
all_cached = false;
break;
}
}
if all_cached {
bs.pending_cache_read_op = Some(read_op);
return -1;
}
// Create neon request and submit it
let request_idx = bs.start_neon_request(request);
// Tell the communicator about it
bs.submit_request(request_idx);
return request_idx;
}
/// Check if a request has completed. Returns:
///
/// -1 if the request is still being processed
/// 0 on success
#[unsafe(no_mangle)]
pub extern "C" fn bcomm_poll_request_completion(
bs: &mut CommunicatorBackendStruct,
request_idx: u32,
result_p: &mut NeonIOResult,
) -> i32 {
match bs.neon_request_slots[request_idx as usize].try_get_result() {
None => -1, // still processing
Some(result) => {
*result_p = result;
0
}
}
}
// LFC functions
/// Finish a local file cache read
///
//
#[unsafe(no_mangle)]
pub extern "C" fn bcomm_finish_cache_read(bs: &mut CommunicatorBackendStruct) -> bool {
if let Some(op) = bs.pending_cache_read_op.take() {
op.finish()
} else {
panic!("bcomm_finish_cache_read() called with no cached read pending");
}
}
impl<'t> CommunicatorBackendStruct<'t> {
/// Send a wakeup to the communicator process
fn submit_request(self: &CommunicatorBackendStruct<'t>, request_idx: i32) {
// wake up communicator by writing the idx to the submission pipe
//
// This can block, if the pipe is full. That should be very rare,
// because the communicator tries hard to drain the pipe to prevent
// that. Also, there's a natural upper bound on how many wakeups can be
// queued up: there is only a limited number of request slots for each
// backend.
//
// If it does block very briefly, that's not too serious.
let idxbuf = request_idx.to_ne_bytes();
let _res = nix::unistd::write(self.submission_pipe_write_fd, &idxbuf);
// FIXME: check result, return any errors
}
/// Note: there's no guarantee on when the communicator might pick it up. You should ring
/// the doorbell. But it might pick it up immediately.
pub(crate) fn start_neon_request(&mut self, request: &NeonIORequest) -> i32 {
let my_proc_number = self.my_proc_number;
// Grab next free slot
// FIXME: any guarantee that there will be any?
let idx = self.next_neon_request_idx;
let next_idx = idx + 1;
self.next_neon_request_idx = if next_idx == self.my_end_idx {
self.my_start_idx
} else {
next_idx
};
self.neon_request_slots[idx as usize].fill_request(request, my_proc_number);
return idx as i32;
}
}

View File

@@ -0,0 +1,109 @@
//! Implement the "low-level" parts of the file cache.
//!
//! This module just deals with reading and writing the file, and keeping track
//! which blocks in the cache file are in use and which are free. The "high
//! level" parts of tracking which block in the cache file corresponds to which
//! relation block is handled in 'integrated_cache' instead.
//!
//! This module is only used to access the file from the communicator
//! process. The backend processes *also* read the file (and sometimes also
//! write it? ), but the backends use direct C library calls for that.
use std::fs::File;
use std::path::Path;
use std::sync::Arc;
use std::sync::atomic::{AtomicU64, Ordering};
use tokio_epoll_uring;
use crate::BLCKSZ;
pub type CacheBlock = u64;
pub struct FileCache {
uring_system: tokio_epoll_uring::SystemHandle,
file: Arc<File>,
// TODO: there's no reclamation mechanism, the cache grows
// indefinitely. This is the next free block, i.e. the current
// size of the file
next_free_block: AtomicU64,
}
impl FileCache {
pub fn new(
file_cache_path: &Path,
uring_system: tokio_epoll_uring::SystemHandle,
) -> Result<FileCache, std::io::Error> {
let file = std::fs::OpenOptions::new()
.read(true)
.write(true)
.truncate(true)
.create(true)
.open(file_cache_path)?;
tracing::info!("Created cache file {file_cache_path:?}");
Ok(FileCache {
file: Arc::new(file),
uring_system,
next_free_block: AtomicU64::new(0),
})
}
// File cache management
pub async fn read_block(
&self,
cache_block: CacheBlock,
dst: impl uring_common::buf::IoBufMut + Send + Sync,
) -> Result<(), std::io::Error> {
assert!(dst.bytes_total() == BLCKSZ);
let file = self.file.clone();
let ((_file, _buf), res) = self
.uring_system
.read(file, cache_block as u64 * BLCKSZ as u64, dst)
.await;
let res = res.map_err(map_io_uring_error)?;
if res != BLCKSZ {
panic!("unexpected read result");
}
Ok(())
}
pub async fn write_block(
&self,
cache_block: CacheBlock,
src: impl uring_common::buf::IoBuf + Send + Sync,
) -> Result<(), std::io::Error> {
assert!(src.bytes_init() == BLCKSZ);
let file = self.file.clone();
let ((_file, _buf), res) = self
.uring_system
.write(file, cache_block as u64 * BLCKSZ as u64, src)
.await;
let res = res.map_err(map_io_uring_error)?;
if res != BLCKSZ {
panic!("unexpected read result");
}
Ok(())
}
pub fn alloc_block(&self) -> CacheBlock {
self.next_free_block.fetch_add(1, Ordering::Relaxed)
}
}
fn map_io_uring_error(err: tokio_epoll_uring::Error<std::io::Error>) -> std::io::Error {
match err {
tokio_epoll_uring::Error::Op(err) => err,
tokio_epoll_uring::Error::System(err) => {
std::io::Error::new(std::io::ErrorKind::Other, err)
}
}
}

View File

@@ -0,0 +1,130 @@
//! Initialization functions. These are executed in the postmaster process,
//! at different stages of server startup.
//!
//!
//! Communicator initialization steps:
//!
//! 1. At postmaster startup, before shared memory is allocated,
//! rcommunicator_shmem_size() is called to get the amount of
//! shared memory that this module needs.
//!
//! 2. Later, after the shared memory has been allocated,
//! rcommunicator_shmem_init() is called to initialize the shmem
//! area.
//!
//! Per process initialization:
//!
//! When a backend process starts up, it calls rcommunicator_backend_init().
//! In the communicator worker process, other functions are called, see
//! `worker_process` module.
use std::ffi::c_int;
use std::mem;
use crate::backend_comms::NeonIOHandle;
use crate::integrated_cache::IntegratedCacheInitStruct;
const NUM_NEON_REQUEST_SLOTS_PER_BACKEND: u32 = 5;
/// This struct is created in the postmaster process, and inherited to
/// the communicator process and all backend processes through fork()
#[repr(C)]
pub struct CommunicatorInitStruct {
#[allow(dead_code)]
pub max_procs: u32,
pub submission_pipe_read_fd: std::ffi::c_int,
pub submission_pipe_write_fd: std::ffi::c_int,
// Shared memory data structures
pub num_neon_request_slots_per_backend: u32,
pub neon_request_slots: &'static [NeonIOHandle],
pub integrated_cache_init_struct: IntegratedCacheInitStruct<'static>,
}
impl std::fmt::Debug for CommunicatorInitStruct {
fn fmt(&self, fmt: &mut std::fmt::Formatter<'_>) -> Result<(), std::fmt::Error> {
fmt.debug_struct("CommunicatorInitStruct")
.field("max_procs", &self.max_procs)
.field("submission_pipe_read_fd", &self.submission_pipe_read_fd)
.field("submission_pipe_write_fd", &self.submission_pipe_write_fd)
.field(
"num_neon_request_slots_per_backend",
&self.num_neon_request_slots_per_backend,
)
.field("neon_request_slots length", &self.neon_request_slots.len())
.finish()
}
}
#[unsafe(no_mangle)]
pub extern "C" fn rcommunicator_shmem_size(max_procs: u32) -> u64 {
let mut size = 0;
let num_neon_request_slots = max_procs * NUM_NEON_REQUEST_SLOTS_PER_BACKEND;
size += mem::size_of::<NeonIOHandle>() * num_neon_request_slots as usize;
// For integrated_cache's Allocator. TODO: make this adjustable
size += IntegratedCacheInitStruct::shmem_size(max_procs);
size as u64
}
/// Initialize the shared memory segment. Returns a backend-private
/// struct, which will be inherited by backend processes through fork
#[unsafe(no_mangle)]
pub extern "C" fn rcommunicator_shmem_init(
submission_pipe_read_fd: c_int,
submission_pipe_write_fd: c_int,
max_procs: u32,
shmem_area_ptr: *mut u8,
shmem_area_len: u64,
) -> &'static mut CommunicatorInitStruct {
let mut ptr = shmem_area_ptr;
// Carve out the request slots from the shmem area and initialize them
let num_neon_request_slots_per_backend = NUM_NEON_REQUEST_SLOTS_PER_BACKEND;
let num_neon_request_slots = max_procs * num_neon_request_slots_per_backend;
let len_used;
let neon_request_slots: &mut [NeonIOHandle] = unsafe {
ptr = ptr.add(ptr.align_offset(std::mem::align_of::<NeonIOHandle>()));
let neon_request_slots_ptr: *mut NeonIOHandle = ptr.cast();
for _i in 0..num_neon_request_slots {
let slot: *mut NeonIOHandle = ptr.cast();
*slot = NeonIOHandle::default();
ptr = ptr.byte_add(mem::size_of::<NeonIOHandle>());
}
len_used = ptr.byte_offset_from(shmem_area_ptr) as usize;
assert!(len_used <= shmem_area_len as usize);
std::slice::from_raw_parts_mut(neon_request_slots_ptr, num_neon_request_slots as usize)
};
let remaining_area =
unsafe { std::slice::from_raw_parts_mut(ptr, shmem_area_len as usize - len_used) };
// Give the rest of the area to the integrated cache
let integrated_cache_init_struct =
IntegratedCacheInitStruct::shmem_init(max_procs, remaining_area);
eprintln!(
"PIPE READ {} WRITE {}",
submission_pipe_read_fd, submission_pipe_write_fd
);
let cis: &'static mut CommunicatorInitStruct = Box::leak(Box::new(CommunicatorInitStruct {
max_procs,
submission_pipe_read_fd,
submission_pipe_write_fd,
num_neon_request_slots_per_backend: NUM_NEON_REQUEST_SLOTS_PER_BACKEND,
neon_request_slots,
integrated_cache_init_struct,
}));
cis
}

Some files were not shown because too many files have changed in this diff Show More