Compare commits

...

32 Commits

Author SHA1 Message Date
HaoyuHuang
eb1d6eb1fe c 2025-07-25 16:18:29 +00:00
HaoyuHuang
3cbd25adef Merge branch 'tristan957/pgconf-port' of https://github.com/neondatabase/neon into tristan957/pgconf-port 2025-07-25 15:52:48 +00:00
HaoyuHuang
2685b93e81 c 2025-07-25 15:45:35 +00:00
HaoyuHuang
5b2d3b7cac Merge branch 'tristan957/hadron-pg_hba' of https://github.com/neondatabase/neon into tristan957/hadron-pg_hba 2025-07-25 15:43:35 +00:00
Vlad Lazar
b0dfe0ffa6 storcon: attempt all non-essential location config calls during reconciliations (#12745)
## Problem

We saw the following in the field:

Context and observations:
* The storage controller keeps track of the latest generations and the
pageserver that issued the latest generation in the database
* When the storage controller needs to proxy a request (e.g. timeline
creation) to the pageservers, it will find use the pageserver that
issued the latest generation from the db (generation_pageserver).
* pageserver-2.cell-2 got into a bad state and wasn't able to apply
location_config (e.g. detach a shard)

What happened:
1. pageserver-2.cell-2 was a secondary for our shard since we were not
able to detach it
2. control plane asked to detach a tenant (presumably because it was
idle)
a. In response storcon clears the generation_pageserver from the db and
attempts to detach all locations
b. it tries to detach pageserver-2.cell-2 first, but fails, which fails
the entire reconciliation leaving the good attached location still there
c. return success to cplane

3. control plane asks to re-attach the tenant
a. In response storcon performs a reconciliation
b. it finds that the observed state matches the intent (remember we did
not detach the primary at step(2))
c. skips incrementing the genration and setting the
generation_pageserver column

Now any requests that need to be proxied to pageservers and rely on the
generation_pageserver db column fail because that's not set

## Summary of changes

1. We do all non-essential location config calls (setting up
secondaries,
detaches) at the end of the reconciliation. Previously, we bailed out
of the reconciliation on the first failure. With this patch we attempt
all of the RPCs.
This allows the observed state to update even if another RPC failed for
unrelated reasons.

2. If the overall reconciliation failed, we don't want to remove nodes
from the
observed state as a safe-guard. With the previous patch, we'll get a
deletion delta to process, which would be ignored. Ignoring it is not
the right thing to do since it's out of sync with the db state.
Hence, on reconciliation failures map deletion from the observed state
to the uncertain state. Future reconciliation will query the node to
refresh their observed state.

Closes LKB-204
2025-07-25 14:03:17 +00:00
Erik Grinaker
185ead8395 pageserver: verify gRPC GetPages on correct shard (#12722)
Verify that gRPC `GetPageRequest` has been sent to the shard that owns
the pages. This avoid spurious `NotFound` errors if a compute misroutes
a request, which can appear scarier (e.g. data loss).

Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191).
2025-07-25 13:43:04 +00:00
Erik Grinaker
37e322438b pageserver: document gRPC compute accessibility (#12724)
Document that the Pageserver gRPC port is accessible by computes, and
should not provide internal services.

Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191).
2025-07-25 13:35:44 +00:00
Gustavo Bazan
fca2c32e59 [ci/docker] task: Apply some quick wins for tools dockerfile (#12740)
## Problem

The Dockerfile for build tools has some small issues that are easy to
fix to make it follow some of docker best practices

## Summary of changes

Apply some small quick wins on the Dockerfile for build tools

- Usage of apt-get over apt
- usage of --no-cache-dir for pip install
2025-07-25 12:39:01 +00:00
Conrad Ludgate
d19aebcf12 [proxy] introduce moka for the project-info cache (#12710)
## Problem

LKB-2502 The garbage collection of the project info cache is garbage. 

What we observed: If we get unlucky, we might throw away a very hot
entry if the cache is full. The GC loop is dependent on getting a lucky
shard of the projects2ep table that clears a lot of cold entries. The GC
does not take into account active use, and the interval it runs at is
too sparse to do any good.

Can we switch to a proper cache implementation?

Complications:
1. We need to invalidate by project/account.
2. We need to expire based on `retry_delay_ms`.

## Summary of changes

1. Replace `retry_delay_ms: Duration` with `retry_at: Instant` when
deserializing.
2. Split the EndpointControls from the RoleControls into two different
caches.
3. Introduce an expiry policy based on error retry info.
4. Introduce `moka` as a dependency, replacing our `TimedLru`.

See the follow up PR for changing all TimedLru instances to use moka:
#12726.
2025-07-25 11:40:47 +00:00
Conrad Ludgate
a70a5bccff move subzero_core to proxy libs (#12742)
We have a dedicated libs folder for proxy related libraries. Let's move
the subzero_core stub there.
2025-07-25 10:44:28 +00:00
Conrad Ludgate
d9cedb4a95 [tokio-postgres] fix regression in buffer reuse (#12739)
Follow up to #12701, which introduced a new regression. When profiling
locally I noticed that writes have the tendency to always reallocate. On
investigation I found that even if the `Connection`'s write buffer is
empty, if it still shares the same data pointer as the `Client`'s write
buffer then the client cannot reclaim it.

The best way I found to fix this is to just drop the `Connection`'s
write buffer each time we fully flush it.

Additionally, I remembered that `BytesMut` has an `unsplit` method which
is allows even better sharing over the previous optimisation I had when
'encoding'.
2025-07-25 09:03:21 +00:00
Tristan Partin
7ea879d47e Explicitly set port in postgresql.conf
Hadron doesn't bind Postgres to the default Postgres port of 5432 it
seems, and requires using a different port. Neon binds to the default
port. Writing out the default port makes no difference, so do it always
regardless of Lakebase mode.

Signed-off-by: Tristan Partin <tristan.partin@databricks.com>
2025-07-24 21:57:20 -05:00
Jarupat Jisarojito
cf3f5f23b3 Add databricks setting via write_postgres_conf
## Summary of changes

Move databricks settings that will be appended to postgres.conf to
`write_postgres_conf`

## How is this tested?

Existing tests.

Co-authored-by: Tristan Partin <tristan.partin@databricks.com>
2025-07-24 20:45:17 -05:00
Jarupat Jisarojito
a66df1f4bd [BRC-1405] Mount databricks pg_hba and pg_ident from configmap to dblet pod
## Problem

For certificate auth, we need to configure pg_hba and pg_ident for it to
work.
This PR https://github.com/databricks/universe/pull/655011 in universe
will create a config map and deployed to `hadron-compute` namespace.

HCC needs to mount this config map to all pg compute pod.

## Summary of changes

Create `databricks_pg_hba` and `databricks_pg_ident` to configure where
the files are located on the pod. These configs are pass down to
`compute_ctl`. Compute_ctl uses these config to update `pg_hba.conf` and
`pg_ident.conf` file.

We append `include_if_exists {databricks_pg_hba}` to `pg_hba.conf` and
similarly to `pg_ident.conf`. So that it will refer to databricks config
file without much change to existing pg default config file.

I renamed `secret_mounts` to `compute_mounts` because now it is used to
configure secret and config map mounts.

Co-authored-by: Tristan Partin <tristan.partin@databricks.com>
2025-07-24 20:39:02 -05:00
William Huang
9fda727018 [BRC-1425] Plumb through and set the requisite GUCs when starting the compute instance
## Problem

We need the set the following Postgres GUCs to the correct value before
starting Postgres in the compute instance:

```
databricks.workspace_url
databricks.enable_databricks_identity_login
databricks.enable_sql_restrictions
```

## Summary of changes

Plumbed through `workspace_url` and other GUC settings via
`DatabricksSettings` in `ComputeSpec`. The spec is sent to the compute
instance when it starts up and the GUCs are written to `postgresql.conf`
before the postgres process is launched.

Co-authored-by: Tristan Partin <tristan.partin@databricks.com>
2025-07-24 20:37:05 -05:00
Jarupat Jisarojito
1f8e8c50ae Copy pg server cert and key to pgdata with correct permission
## Problem

We need to copy certificate and key from secret mount directory to
`pgdata` directory where `postgres` is the owner and we can set the key
permission to 0600.

## Summary of changes

- Added new pgparam `pg_compute_tls_settings` to specify where k8s
secret for certificate and key are mounted.
- Added a new field to `ComputeSpec` called `databricks_settings`. This
is a struct that will be used to store any other settings that needs to
be propagate to Compute but should not be persisted to `ComputeSpec` in
the database.
- Then when the compute container start up, as part of `prepare_pgdata`
function, it will copied `server.key` and `server.crt` from k8s mounted
directory to `pgdata` directory.

Co-authored-by: Tristan Partin <tristan.partin@databricks.com>
2025-07-24 20:34:58 -05:00
Tristan Partin
b623fbae0c Cancel PG query if stuck at refreshing configuration (#12717)
## Problem

While configuring or reconfiguring PG due to PageServer movements, it's
possible PG may get stuck if PageServer is moved around after fetching
the spec from StorageController.

## Summary of changes

To fix this issue, this PR introduces two changes:
1. Fail the PG query directly if the query cannot request configuration
for certain number of times.
2. Introduce a new state `RefreshConfiguration` in compute tools to
differentiate it from `RefreshConfigurationPending`. If compute tool is
already in `RefreshConfiguration` state, then it will not accept new
request configuration requests.

## How is this tested?
Chaos testing.

Co-authored-by: Chen Luo <chen.luo@databricks.com>
2025-07-25 00:01:59 +00:00
Tristan Partin
512210bb5a [BRC-2368] Add PS and compute_ctl metrics to report pagestream request errors (#12716)
## Problem

In our experience running the system so far, almost all of the "hang
compute" situations are due to the compute (postgres) pointing at the
wrong pageservers. We currently mainly rely on the promethesus exporter
(PGExporter) running on PG to detect and report any down time, but these
can be unreliable because the read and write probes the PGExporter runs
do not always generate pageserver requests due to caching, even though
the real user might be experiencing down time when touching uncached
pages.

We are also about to start disk-wiping node pool rotation operations in
prod clusters for our pageservers, and it is critical to have a
convenient way to monitor the impact of these node pool rotations so
that we can quickly respond to any issues. These metrics should provide
very clear signals to address this operational need.

## Summary of changes

Added a pair of metrics to detect issues between postgres' PageStream
protocol (e.g. get_page_at_lsn, get_base_backup, etc.) communications
with pageservers:
* On the compute node (compute_ctl), exports a counter metric that is
incremented every time postgres requests a configuration refresh.
Postgres today only requests these configuration refreshes when it
cannot connect to a pageserver or if the pageserver rejects its request
by disconnecting.
* On the pageserver, exports a counter metric that is incremented every
time it receives a PageStream request that cannot be handled because the
tenant is not known or if the request was routed to the wrong shard
(e.g. secondary).

### How I plan to use metrics
I plan to use the metrics added here to create alerts. The alerts can
fire, for example, if these counters have been continuously increasing
for over a certain period of time. During rollouts, misrouted requests
may occasionally happen, but they should soon die down as
reconfigurations make progress. We can start with something like raising
the alert if the counters have been increasing continuously for over 5
minutes.

## How is this tested?

New integration tests in
`test_runner/regress/test_hadron_ps_connectivity_metrics.py`

Co-authored-by: William Huang <william.huang@databricks.com>
2025-07-24 19:05:00 +00:00
HaoyuHuang
9eebd6fc79 A few more compute_ctl changes (#12713)
## Summary of changes
A bunch of no-op changes. 

The only other thing is that the lock is released early in the terminate
func.
2025-07-24 19:01:30 +00:00
Tristan Partin
11527b9df7 [BRC-2951] Enforce PG backpressure parameters at the shard level (#12694)
## Problem
Currently PG backpressure parameters are enforced globally. With tenant
splitting, this makes it hard to balance small tenants and large
tenants. For large tenants with more shards, we need to increase the
lagging because each shard receives total/shard_count amount of data,
while doing so could be suboptimal to small tenants with fewer shards.

## Summary of changes
This PR makes these parameters to be enforced at the shard level, i.e.,
PG will compute the actual lag limit by multiply the shard count.

## How is this tested?
Added regression test.

Co-authored-by: Chen Luo <chen.luo@databricks.com>
2025-07-24 18:41:29 +00:00
Tristan Partin
89554af1bd [BRC-1778] Have PG signal compute_ctl to refresh configuration if it suspects that it is talking to the wrong PSs (#12712)
## Problem

This is a follow-up to TODO, as part
of the effort to rewire the compute reconfiguration/notification
mechanism to make it more robust. Please refer to that commit or ticket
BRC-1778 for full context of the problem.

## Summary of changes

The previous change added mechanism in `compute_ctl` that makes it
possible to refresh the configuration of PG on-demand by having
`compute_ctl` go out to download a new config from the control
plane/HCC. This change wired this mechanism up with PG so that PG will
signal `compute_ctl` to refresh its configuration when it suspects that
it could be talking to incorrect pageservers due to a stale
configuration.

PG will become suspicious that it is talking to the wrong pageservers in
the following situations:
1. It cannot connect to a pageserver (e.g., getting a network-level
connection refused error)
2. It can connect to a pageserver, but the pageserver does not return
any data for the GetPage request
3. It can connect to a pageserver, but the pageserver returns a
malformed response
4. It can connect to a pageserver, but there is an error receiving the
GetPage request response for any other reason

This change also includes a minor tweak to `compute_ctl`'s config
refresh behavior. Upon receiving a request to refresh PG configuration,
`compute_ctl` will reach out to download a config, but it will not
attempt to apply the configuration if the config is the same as the old
config is it replacing. This optimization is added because the act of
reconfiguring itself requires working pageserver connections. In many
failure situations it is likely that PG detects an issue with a
pageserver before the control plane can detect the issue, migrate
tenants, and update the compute config. In this case even the latest
compute config won't point PG to working pageservers, causing the
configuration attempt to hang and negatively impact PG's
time-to-recovery. With this change, `compute_ctl` only attempts
reconfiguration if the refreshed config points PG to different
pageservers.

## How is this tested?

The new code paths are exercised in all existing tests because this
mechanism is on by default.

Explicitly tested in `test_runner/regress/test_change_pageserver.py`.

Co-authored-by: William Huang <william.huang@databricks.com>
2025-07-24 16:44:45 +00:00
Peter Bendel
f391186aa7 TPC-C like periodic benchmark using benchbase (#12665)
## Problem

We don't have a well-documented, periodic benchmark for TPC-C like OLTP
workload.

## Summary of changes

# Benchbase TPC-C-like Performance Results

Runs TPC-C-like benchmarks on Neon databases using
[Benchbase](https://github.com/cmu-db/benchbase).
Docker images are built
[here](https://github.com/neondatabase-labs/benchbase-docker-images)

We run the benchmarks at different scale factors aligned with different
compute sizes we offer to customers.
For each scale factor, we determine a max rate (see Throughput in warmup
phase) and then run the benchmark at a target rate of approx. 70 % of
the max rate.
We use different warehouse sizes which determine the working set size -
it is optimized for LFC size of the respected pricing tier.
Usually we should get LFC hit rates above 70 % for this setup and quite
good, consistent (non-flaky) latencies.

## Expected performance as of first testing this

| Tier | CU | Warehouses | Terminals | Max TPS | LFC size | Working set
size | LFC hit rate | Median latency | p95 latency |

|------------|------------|---------------|-----------|---------|----------|------------------|--------------|----------------|-------------|
| free | 0.25-2 | 50 - 5 GB | 150 | 800 | 5 GB | 6.3 GB | 95 % | 170 ms
| 600 ms |
| serverless | 2-8 | 500 - 50 GB | 230 | 2000 | 26 GB | ?? GB | 91 % |
50 ms | 200 ms |
| business | 2-16 | 1000 - 100 GB | 330 | 2900 | 51 GB | 50 GB | 72 % |
40 ms | 180 ms |

Each run 
- first loads the database (not shown in the dashboard). 
- Then we run a warmup phase for 20 minutes to warm up the database and
the LFC at unlimited target rate (max rate) (highest throughput but
flaky latencies).
The warmup phase can be used to determine the max rate and adjust it in
the github workflow in case Neon is faster in the future.
- Then we run the benchmark at a target rate of approx. 70 % of the max
rate for 1 hour (expecting consistent latencies and throughput).

## Important notes on implementation:
- we want to eventually publish the process how to reproduce these
benchmarks
- thus we want to reduce all dependencies necessary to run the
benchmark, the only thing needed are
   - docker
   - the docker images referenced above for benchbase
- python >= 3.9 to run some config generation steps and create diagrams
- to reduce dependencies we deliberatly do NOT use some of our python
fixture test infrastructure to make the dependency chain really small -
so pls don't add a review comment "should reuse fixture xy"
- we also upload all generator python scripts, generated bash shell
scripts and configs as well as raw results to S3 bucket that we later
want to publish once this benchmark is reviewed and approved.
2025-07-24 16:26:54 +00:00
Paul Banks
94b41b531b storecon: Fix panic due to race with chaos migration on staging (#12727)
## Problem

* Fixes LKB-743

We get regular assertion failures on staging caused by a race with chaos
injector. If chaos injector decides to migrate a tenant shard between
the background optimisation planning and applying optimisations then we
attempt to migrate and already migrated shard and hit an assertion
failure.

## Summary of changes

@VladLazar fixed a variant of this issue by
adding`validate_optimization` recently, however it didn't validate the
specific property this other assertion requires. Fix is just to update
it to cover all the expected properties.
2025-07-24 16:14:47 +00:00
Erik Grinaker
d793088225 pgxn: set MACOSX_DEPLOYMENT_TARGET (#12723)
## Problem

Compiling `neon-pg-ext-v17` results in these linker warnings for
`libcommunicator.a`:

```
$ make -j`nproc` -s neon-pg-ext-v17
Installing PostgreSQL v17 headers
Compiling PostgreSQL v17
Compiling neon-specific Postgres extensions for v17
ld: warning: object file (/Users/erik.grinaker/Projects/neon/target/debug/libcommunicator.a[1159](25ac62e5b3c53843-curve25519.o)) was built for newer 'macOS' version (15.5) than being linked (15.0)
ld: warning: object file (/Users/erik.grinaker/Projects/neon/target/debug/libcommunicator.a[1160](0bbbd18bda93c05b-aes_nohw.o)) was built for newer 'macOS' version (15.5) than being linked (15.0)
ld: warning: object file (/Users/erik.grinaker/Projects/neon/target/debug/libcommunicator.a[1161](00c879ee3285a50d-montgomery.o)) was built for newer 'macOS' version (15.5) than being linked (15.0)
[...]
```

## Summary of changes

Set `MACOSX_DEPLOYMENT_TARGET` to the current local SDK version (15.5 in
this case), which links against object files for that version.
2025-07-24 14:48:35 +00:00
John Spray
67ad420e26 tests: turn down error rate in test_compute_pageserver_connection_stress (#12721)
## Problem

Compute retries are finite (e.g. 5x in a basebackup) -- with a 50%
failure rate we have pretty good chance of exceeding that and the test
failing.

Fixes: https://databricks.atlassian.net/browse/LKB-2278

## Summary of changes

- Turn connection error rate down to 20%

Co-authored-by: John Spray <john.spray@databricks.com>
2025-07-24 14:42:39 +00:00
Tristan Partin
90cd5a5be8 [BRC-1778] Add mechanism to compute_ctl to pull a new config (#12711)
## Problem

We have been dealing with a number of issues with the SC compute
notification mechanism. Various race conditions exist in the
PG/HCC/cplane/PS distributed system, and relying on the SC to send
notifications to the compute node to notify it of PS changes is not
robust. We decided to pursue a more robust option where the compute node
itself discovers whether it may be pointing to the incorrect PSs and
proactively reconfigure itself if issues are suspected.

## Summary of changes

To support this self-healing reconfiguration mechanism several pieces
are needed. This PR adds a mechanism to `compute_ctl` called "refresh
configuration", where the compute node reaches out to the control plane
to pull a new config and reconfigure PG using the new config, instead of
listening for a notification message containing a config to arrive from
the control plane. Main changes to compute_ctl:

1. The `compute_ctl` state machine now has a new State,
`RefreshConfigurationPending`. The compute node may enter this state
upon receiving a signal that it may be using the incorrect page servers.
2. Upon entering the `RefreshConfigurationPending` state, the background
configurator thread in `compute_ctl` wakes up, pulls a new config from
the control plane, and reconfigures PG (with `pg_ctl reload`) according
to the new config.
3. The compute node may enter the new `RefreshConfigurationPending`
state from `Running` or `Failed` states. If the configurator managed to
configure the compute node successfully, it will enter the `Running`
state, otherwise, it stays in `RefreshConfigurationPending` and the
configurator thread will wait for the next notification if an incorrect
config is still suspected.
4. Added various plumbing in `compute_ctl` data structures to allow the
configurator thread to perform the config fetch.

The "incorrect config suspected" notification is delivered using a HTTP
endpoint, `/refresh_configuration`, on `compute_ctl`. This endpoint is
currently not called by anyone other than the tests. In a follow up PR I
will set up some code in the PG extension/libpagestore to call this HTTP
endpoint whenever PG suspects that it is pointing to the wrong page
servers.

## How is this tested?

Modified `test_runner/regress/test_change_pageserver.py` to add a
scenario where we use the new `/refresh_configuration` mechanism instead
of the existing `/configure` mechanism (which requires us sending a full
config to compute_ctl) to have the compute node reload and reconfigure
its pageservers.

I took one shortcut to reduce the scope of this change when it comes to
testing: the compute node uses a local config file instead of pulling a
config over the network from the HCC. This simplifies the test setup in
the following ways:
* The existing test framework is set up to use local config files for
compute nodes only, so it's convenient if I just stick with it.
* The HCC today generates a compute config with production settings
(e.g., assuming 4 CPUs, 16GB RAM, with local file caches), which is
probably not suitable in tests. We may need to add another test-only
endpoint config to the control plane to make this work.

The config-fetch part of the code is relatively straightforward (and
well-covered in both production and the KIND test) so it is probably
fine to replace it with loading from the local config file for these
integration tests.

In addition to making sure that the tests pass, I also manually
inspected the logs to make sure that the compute node is indeed
reloading the config using the new mechanism instead of going down the
old `/configure` path (it turns out the test has bugs which causes
compute `/configure` messages to be sent despite the test intending to
disable/blackhole them).

```test
2024-09-24T18:53:29.573650Z  INFO http request{otel.name=/refresh_configuration http.method=POST}: serving /refresh_configuration POST request
2024-09-24T18:53:29.573689Z  INFO configurator_main_loop: compute node suspects its configuration is out of date, now refreshing configuration
2024-09-24T18:53:29.573706Z  INFO configurator_main_loop: reloading config.json from path: /workspaces/hadron/test_output/test_change_pageserver_using_refresh[release-pg16]/repo/endpoints/ep-1/spec.json
PG:2024-09-24 18:53:29.574 GMT [52799] LOG:  received SIGHUP, reloading configuration files
PG:2024-09-24 18:53:29.575 GMT [52799] LOG:  parameter "neon.extension_server_port" cannot be changed without restarting the server
PG:2024-09-24 18:53:29.575 GMT [52799] LOG:  parameter "neon.pageserver_connstring" changed to "postgresql://no_user@localhost:15008"
...
```

Co-authored-by: William Huang <william.huang@databricks.com>
2025-07-24 14:26:21 +00:00
Christian Schwarz
643448b1a2 test_hot_standby_gc: work around standby_horizon-related flakiness/raciness uncovered by #12431 (#12704)
PR #12431 set initial lease deadline = 0s for tests.
This turned test_hot_standby_gc flaky because it now runs GC: it started
failing with `tried to request a page version that was garbage
collected`
because the replica reads below applied gc cutoff.

The leading theory is that, we run the timeline_gc() before the first
standby_horizon push arrives at PS. That is definitively a thing that
can happen with the current standby_horizon mechanism, and it's now
tracked as such in https://databricks.atlassian.net/browse/LKB-2499.

We don't have logs to confirm this theory though, but regardless,
try the fix in this PR and see if it stabilizes things.

Refs
- flaky test issue: https://databricks.atlassian.net/browse/LKB-2465

## Problem

## Summary of changes
2025-07-24 14:00:22 +00:00
Conrad Ludgate
8daebb6ed4 [proxy] remove TokioMechanism and HyperMechanism (#12672)
Another go at #12341. LKB-2497

We now only need 1 connect mechanism (and 1 more for testing) which
saves us some code and complexity. We should be able to remove the final
connect mechanism when we create a separate worker task for
pglb->compute connections - either via QUIC streams or via in-memory
channels.

This also now ensures that connect_once always returns a ConnectionError
type - something simple enough we can probably define a serialisation
for in pglb.

* I've abstracted connect_to_compute to always use TcpMechanism and the
ProxyConfig.
* I've abstracted connect_to_compute_and_auth to perform authentication,
managing any retries for stale computes
* I had to introduce a separate `managed` function for taking ownership
of the compute connection into the Client/Connection pair
2025-07-24 12:37:04 +00:00
Alexey Kondratov
ab14521ea5 fix(compute): Turn off database collector in postgres_exporter (#12684)
## Problem

`postgres_exporter` has database collector enabled by default and it
doesn't filter out invalid databases, see

06a553c816/collector/pg_database.go (L67)
so if it hits one, it starts spamming logs
```
ERROR:  [NEON_SMGR] [reqid d9700000018] could not read db size of db 705302 from page server at lsn 5/A2457EB0
```

## Summary of changes

We don't use `pg_database_size_bytes` metric anyway, see

5e19b3fd89/apps/base/compute-metrics/scrape-compute-pg-exporter-neon.yaml (L29)
so just turn it off by passing `--no-collector.database`.
2025-07-24 11:52:31 +00:00
dependabot[bot]
e82021d6fe build(deps): bump the npm_and_yarn group across 1 directory with 2 updates (#12678)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-07-24 10:51:09 +00:00
Conrad Ludgate
9997661138 [proxy/tokio-postgres] garbage collection for codec buffers (#12701)
## Problem

A large insert or a large row will cause the codec to allocate a large
buffer. The codec never shrinks the buffer however. LKB-2496

## Summary of changes

1. Introduce a naive GC system for codec buffers
2. Try and reduce copies as much as possible
2025-07-24 10:30:02 +00:00
Ivan Efremov
0e427fc117 Update proxy-bench workflow to use bare-metal script (#12703)
Pass the params for run.sh in proxy-bench repo to use bare-metal config.
Fix the paths and cleanup procedure.
2025-07-24 08:23:07 +00:00
78 changed files with 3794 additions and 1083 deletions

384
.github/workflows/benchbase_tpcc.yml vendored Normal file
View File

@@ -0,0 +1,384 @@
name: TPC-C like benchmark using benchbase
on:
schedule:
# * is a special character in YAML so you have to quote this string
# ┌───────────── minute (0 - 59)
# │ ┌───────────── hour (0 - 23)
# │ │ ┌───────────── day of the month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12 or JAN-DEC)
# │ │ │ │ ┌───────────── day of the week (0 - 6 or SUN-SAT)
- cron: '0 6 * * *' # run once a day at 6 AM UTC
workflow_dispatch: # adds ability to run this manually
defaults:
run:
shell: bash -euxo pipefail {0}
concurrency:
# Allow only one workflow globally because we do not want to be too noisy in production environment
group: benchbase-tpcc-workflow
cancel-in-progress: false
permissions:
contents: read
jobs:
benchbase-tpcc:
strategy:
fail-fast: false # allow other variants to continue even if one fails
matrix:
include:
- warehouses: 50 # defines number of warehouses and is used to compute number of terminals
max_rate: 800 # measured max TPS at scale factor based on experiments. Adjust if performance is better/worse
min_cu: 0.25 # simulate free tier plan (0.25 -2 CU)
max_cu: 2
- warehouses: 500 # serverless plan (2-8 CU)
max_rate: 2000
min_cu: 2
max_cu: 8
- warehouses: 1000 # business plan (2-16 CU)
max_rate: 2900
min_cu: 2
max_cu: 16
max-parallel: 1 # we want to run each workload size sequentially to avoid noisy neighbors
permissions:
contents: write
statuses: write
id-token: write # aws-actions/configure-aws-credentials
env:
PG_CONFIG: /tmp/neon/pg_install/v17/bin/pg_config
PSQL: /tmp/neon/pg_install/v17/bin/psql
PG_17_LIB_PATH: /tmp/neon/pg_install/v17/lib
POSTGRES_VERSION: 17
runs-on: [ self-hosted, us-east-2, x64 ]
timeout-minutes: 1440
steps:
- name: Harden the runner (Audit all outbound calls)
uses: step-security/harden-runner@4d991eb9b905ef189e4c376166672c3f2f230481 # v2.11.0
with:
egress-policy: audit
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- name: Configure AWS credentials # necessary to download artefacts
uses: aws-actions/configure-aws-credentials@e3dd6a429d7300a6a4c196c26e071d42e0343502 # v4.0.2
with:
aws-region: eu-central-1
role-to-assume: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
role-duration-seconds: 18000 # 5 hours is currently max associated with IAM role
- name: Download Neon artifact
uses: ./.github/actions/download
with:
name: neon-${{ runner.os }}-${{ runner.arch }}-release-artifact
path: /tmp/neon/
prefix: latest
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
- name: Create Neon Project
id: create-neon-project-tpcc
uses: ./.github/actions/neon-project-create
with:
region_id: aws-us-east-2
postgres_version: ${{ env.POSTGRES_VERSION }}
compute_units: '[${{ matrix.min_cu }}, ${{ matrix.max_cu }}]'
api_key: ${{ secrets.NEON_PRODUCTION_API_KEY_4_BENCHMARKS }}
api_host: console.neon.tech # production (!)
- name: Initialize Neon project
env:
BENCHMARK_TPCC_CONNSTR: ${{ steps.create-neon-project-tpcc.outputs.dsn }}
PROJECT_ID: ${{ steps.create-neon-project-tpcc.outputs.project_id }}
run: |
echo "Initializing Neon project with project_id: ${PROJECT_ID}"
export LD_LIBRARY_PATH=${PG_17_LIB_PATH}
# Retry logic for psql connection with 1 minute sleep between attempts
for attempt in {1..3}; do
echo "Attempt ${attempt}/3: Creating extensions in Neon project"
if ${PSQL} "${BENCHMARK_TPCC_CONNSTR}" -c "CREATE EXTENSION IF NOT EXISTS neon; CREATE EXTENSION IF NOT EXISTS neon_utils;"; then
echo "Successfully created extensions"
break
else
echo "Failed to create extensions on attempt ${attempt}"
if [ ${attempt} -lt 3 ]; then
echo "Waiting 60 seconds before retry..."
sleep 60
else
echo "All attempts failed, exiting"
exit 1
fi
fi
done
echo "BENCHMARK_TPCC_CONNSTR=${BENCHMARK_TPCC_CONNSTR}" >> $GITHUB_ENV
- name: Generate BenchBase workload configuration
env:
WAREHOUSES: ${{ matrix.warehouses }}
MAX_RATE: ${{ matrix.max_rate }}
run: |
echo "Generating BenchBase configs for warehouses: ${WAREHOUSES}, max_rate: ${MAX_RATE}"
# Extract hostname and password from connection string
# Format: postgresql://username:password@hostname/database?params (no port for Neon)
HOSTNAME=$(echo "${BENCHMARK_TPCC_CONNSTR}" | sed -n 's|.*://[^:]*:[^@]*@\([^/]*\)/.*|\1|p')
PASSWORD=$(echo "${BENCHMARK_TPCC_CONNSTR}" | sed -n 's|.*://[^:]*:\([^@]*\)@.*|\1|p')
echo "Extracted hostname: ${HOSTNAME}"
# Use runner temp (NVMe) as working directory
cd "${RUNNER_TEMP}"
# Copy the generator script
cp "${GITHUB_WORKSPACE}/test_runner/performance/benchbase_tpc_c_helpers/generate_workload_size.py" .
# Generate configs and scripts
python3 generate_workload_size.py \
--warehouses ${WAREHOUSES} \
--max-rate ${MAX_RATE} \
--hostname ${HOSTNAME} \
--password ${PASSWORD} \
--runner-arch ${{ runner.arch }}
# Fix path mismatch: move generated configs and scripts to expected locations
mv ../configs ./configs
mv ../scripts ./scripts
- name: Prepare database (load data)
env:
WAREHOUSES: ${{ matrix.warehouses }}
run: |
cd "${RUNNER_TEMP}"
echo "Loading ${WAREHOUSES} warehouses into database..."
# Run the loader script and capture output to log file while preserving stdout/stderr
./scripts/load_${WAREHOUSES}_warehouses.sh 2>&1 | tee "load_${WAREHOUSES}_warehouses.log"
echo "Database loading completed"
- name: Run TPC-C benchmark (warmup phase, then benchmark at 70% of configuredmax TPS)
env:
WAREHOUSES: ${{ matrix.warehouses }}
run: |
cd "${RUNNER_TEMP}"
echo "Running TPC-C benchmark with ${WAREHOUSES} warehouses..."
# Run the optimal rate benchmark
./scripts/execute_${WAREHOUSES}_warehouses_opt_rate.sh
echo "Benchmark execution completed"
- name: Run TPC-C benchmark (warmup phase, then ramp down TPS and up again in 5 minute intervals)
env:
WAREHOUSES: ${{ matrix.warehouses }}
run: |
cd "${RUNNER_TEMP}"
echo "Running TPC-C ramp-down-up with ${WAREHOUSES} warehouses..."
# Run the optimal rate benchmark
./scripts/execute_${WAREHOUSES}_warehouses_ramp_up.sh
echo "Benchmark execution completed"
- name: Process results (upload to test results database and generate diagrams)
env:
WAREHOUSES: ${{ matrix.warehouses }}
MIN_CU: ${{ matrix.min_cu }}
MAX_CU: ${{ matrix.max_cu }}
PROJECT_ID: ${{ steps.create-neon-project-tpcc.outputs.project_id }}
REVISION: ${{ github.sha }}
PERF_DB_CONNSTR: ${{ secrets.PERF_TEST_RESULT_CONNSTR }}
run: |
cd "${RUNNER_TEMP}"
echo "Creating temporary Python environment for results processing..."
# Create temporary virtual environment
python3 -m venv temp_results_env
source temp_results_env/bin/activate
# Install required packages in virtual environment
pip install matplotlib pandas psycopg2-binary
echo "Copying results processing scripts..."
# Copy both processing scripts
cp "${GITHUB_WORKSPACE}/test_runner/performance/benchbase_tpc_c_helpers/generate_diagrams.py" .
cp "${GITHUB_WORKSPACE}/test_runner/performance/benchbase_tpc_c_helpers/upload_results_to_perf_test_results.py" .
echo "Processing load phase metrics..."
# Find and process load log
LOAD_LOG=$(find . -name "load_${WAREHOUSES}_warehouses.log" -type f | head -1)
if [ -n "$LOAD_LOG" ]; then
echo "Processing load metrics from: $LOAD_LOG"
python upload_results_to_perf_test_results.py \
--load-log "$LOAD_LOG" \
--run-type "load" \
--warehouses "${WAREHOUSES}" \
--min-cu "${MIN_CU}" \
--max-cu "${MAX_CU}" \
--project-id "${PROJECT_ID}" \
--revision "${REVISION}" \
--connection-string "${PERF_DB_CONNSTR}"
else
echo "Warning: Load log file not found: load_${WAREHOUSES}_warehouses.log"
fi
echo "Processing warmup results for optimal rate..."
# Find and process warmup results
WARMUP_CSV=$(find results_warmup -name "*.results.csv" -type f | head -1)
WARMUP_JSON=$(find results_warmup -name "*.summary.json" -type f | head -1)
if [ -n "$WARMUP_CSV" ] && [ -n "$WARMUP_JSON" ]; then
echo "Generating warmup diagram from: $WARMUP_CSV"
python generate_diagrams.py \
--input-csv "$WARMUP_CSV" \
--output-svg "warmup_${WAREHOUSES}_warehouses_performance.svg" \
--title-suffix "Warmup at max TPS"
echo "Uploading warmup metrics from: $WARMUP_JSON"
python upload_results_to_perf_test_results.py \
--summary-json "$WARMUP_JSON" \
--results-csv "$WARMUP_CSV" \
--run-type "warmup" \
--min-cu "${MIN_CU}" \
--max-cu "${MAX_CU}" \
--project-id "${PROJECT_ID}" \
--revision "${REVISION}" \
--connection-string "${PERF_DB_CONNSTR}"
else
echo "Warning: Missing warmup results files (CSV: $WARMUP_CSV, JSON: $WARMUP_JSON)"
fi
echo "Processing optimal rate results..."
# Find and process optimal rate results
OPTRATE_CSV=$(find results_opt_rate -name "*.results.csv" -type f | head -1)
OPTRATE_JSON=$(find results_opt_rate -name "*.summary.json" -type f | head -1)
if [ -n "$OPTRATE_CSV" ] && [ -n "$OPTRATE_JSON" ]; then
echo "Generating optimal rate diagram from: $OPTRATE_CSV"
python generate_diagrams.py \
--input-csv "$OPTRATE_CSV" \
--output-svg "benchmark_${WAREHOUSES}_warehouses_performance.svg" \
--title-suffix "70% of max TPS"
echo "Uploading optimal rate metrics from: $OPTRATE_JSON"
python upload_results_to_perf_test_results.py \
--summary-json "$OPTRATE_JSON" \
--results-csv "$OPTRATE_CSV" \
--run-type "opt-rate" \
--min-cu "${MIN_CU}" \
--max-cu "${MAX_CU}" \
--project-id "${PROJECT_ID}" \
--revision "${REVISION}" \
--connection-string "${PERF_DB_CONNSTR}"
else
echo "Warning: Missing optimal rate results files (CSV: $OPTRATE_CSV, JSON: $OPTRATE_JSON)"
fi
echo "Processing warmup 2 results for ramp down/up phase..."
# Find and process warmup results
WARMUP_CSV=$(find results_warmup -name "*.results.csv" -type f | tail -1)
WARMUP_JSON=$(find results_warmup -name "*.summary.json" -type f | tail -1)
if [ -n "$WARMUP_CSV" ] && [ -n "$WARMUP_JSON" ]; then
echo "Generating warmup diagram from: $WARMUP_CSV"
python generate_diagrams.py \
--input-csv "$WARMUP_CSV" \
--output-svg "warmup_2_${WAREHOUSES}_warehouses_performance.svg" \
--title-suffix "Warmup at max TPS"
echo "Uploading warmup metrics from: $WARMUP_JSON"
python upload_results_to_perf_test_results.py \
--summary-json "$WARMUP_JSON" \
--results-csv "$WARMUP_CSV" \
--run-type "warmup" \
--min-cu "${MIN_CU}" \
--max-cu "${MAX_CU}" \
--project-id "${PROJECT_ID}" \
--revision "${REVISION}" \
--connection-string "${PERF_DB_CONNSTR}"
else
echo "Warning: Missing warmup results files (CSV: $WARMUP_CSV, JSON: $WARMUP_JSON)"
fi
echo "Processing ramp results..."
# Find and process ramp results
RAMPUP_CSV=$(find results_ramp_up -name "*.results.csv" -type f | head -1)
RAMPUP_JSON=$(find results_ramp_up -name "*.summary.json" -type f | head -1)
if [ -n "$RAMPUP_CSV" ] && [ -n "$RAMPUP_JSON" ]; then
echo "Generating ramp diagram from: $RAMPUP_CSV"
python generate_diagrams.py \
--input-csv "$RAMPUP_CSV" \
--output-svg "ramp_${WAREHOUSES}_warehouses_performance.svg" \
--title-suffix "ramp TPS down and up in 5 minute intervals"
echo "Uploading ramp metrics from: $RAMPUP_JSON"
python upload_results_to_perf_test_results.py \
--summary-json "$RAMPUP_JSON" \
--results-csv "$RAMPUP_CSV" \
--run-type "ramp-up" \
--min-cu "${MIN_CU}" \
--max-cu "${MAX_CU}" \
--project-id "${PROJECT_ID}" \
--revision "${REVISION}" \
--connection-string "${PERF_DB_CONNSTR}"
else
echo "Warning: Missing ramp results files (CSV: $RAMPUP_CSV, JSON: $RAMPUP_JSON)"
fi
# Deactivate and clean up virtual environment
deactivate
rm -rf temp_results_env
rm upload_results_to_perf_test_results.py
echo "Results processing completed and environment cleaned up"
- name: Set date for upload
id: set-date
run: echo "date=$(date +%Y-%m-%d)" >> $GITHUB_OUTPUT
- name: Configure AWS credentials # necessary to upload results
uses: aws-actions/configure-aws-credentials@e3dd6a429d7300a6a4c196c26e071d42e0343502 # v4.0.2
with:
aws-region: us-east-2
role-to-assume: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
role-duration-seconds: 900 # 900 is minimum value
- name: Upload benchmark results to S3
env:
S3_BUCKET: neon-public-benchmark-results
S3_PREFIX: benchbase-tpc-c/${{ steps.set-date.outputs.date }}/${{ github.run_id }}/${{ matrix.warehouses }}-warehouses
run: |
echo "Redacting passwords from configuration files before upload..."
# Mask all passwords in XML config files
find "${RUNNER_TEMP}/configs" -name "*.xml" -type f -exec sed -i 's|<password>[^<]*</password>|<password>redacted</password>|g' {} \;
echo "Uploading benchmark results to s3://${S3_BUCKET}/${S3_PREFIX}/"
# Upload the entire benchmark directory recursively
aws s3 cp --only-show-errors --recursive "${RUNNER_TEMP}" s3://${S3_BUCKET}/${S3_PREFIX}/
echo "Upload completed"
- name: Delete Neon Project
if: ${{ always() }}
uses: ./.github/actions/neon-project-delete
with:
project_id: ${{ steps.create-neon-project-tpcc.outputs.project_id }}
api_key: ${{ secrets.NEON_PRODUCTION_API_KEY_4_BENCHMARKS }}
api_host: console.neon.tech # production (!)

View File

@@ -3,7 +3,7 @@ name: Periodic proxy performance test on unit-perf-aws-arm runners
on:
push: # TODO: remove after testing
branches:
- test-proxy-bench # Runs on pushes to branches starting with test-proxy-bench
- test-proxy-bench # Runs on pushes to test-proxy-bench branch
# schedule:
# * is a special character in YAML so you have to quote this string
# ┌───────────── minute (0 - 59)
@@ -32,7 +32,7 @@ jobs:
statuses: write
contents: write
pull-requests: write
runs-on: [self-hosted, unit-perf-aws-arm]
runs-on: [ self-hosted, unit-perf-aws-arm ]
timeout-minutes: 60 # 1h timeout
container:
image: ghcr.io/neondatabase/build-tools:pinned-bookworm
@@ -55,30 +55,58 @@ jobs:
{
echo "PROXY_BENCH_PATH=$PROXY_BENCH_PATH"
echo "NEON_DIR=${RUNNER_TEMP}/neon"
echo "NEON_PROXY_PATH=${RUNNER_TEMP}/neon/bin/proxy"
echo "TEST_OUTPUT=${PROXY_BENCH_PATH}/test_output"
echo ""
} >> "$GITHUB_ENV"
- name: Run proxy-bench
run: ${PROXY_BENCH_PATH}/run.sh
- name: Cache poetry deps
uses: actions/cache@v4
with:
path: ~/.cache/pypoetry/virtualenvs
key: v2-${{ runner.os }}-${{ runner.arch }}-python-deps-bookworm-${{ hashFiles('poetry.lock') }}
- name: Ingest Bench Results # neon repo script
- name: Install Python deps
shell: bash -euxo pipefail {0}
run: ./scripts/pysync
- name: show ulimits
shell: bash -euxo pipefail {0}
run: |
ulimit -a
- name: Run proxy-bench
working-directory: ${{ env.PROXY_BENCH_PATH }}
run: ./run.sh --with-grafana --bare-metal
- name: Ingest Bench Results
if: always()
working-directory: ${{ env.NEON_DIR }}
run: |
mkdir -p $TEST_OUTPUT
python $NEON_DIR/scripts/proxy_bench_results_ingest.py --out $TEST_OUTPUT
- name: Push Metrics to Proxy perf database
shell: bash -euxo pipefail {0}
if: always()
env:
PERF_TEST_RESULT_CONNSTR: "${{ secrets.PROXY_TEST_RESULT_CONNSTR }}"
REPORT_FROM: $TEST_OUTPUT
working-directory: ${{ env.NEON_DIR }}
run: $NEON_DIR/scripts/generate_and_push_perf_report.sh
- name: Docker cleanup
if: always()
run: docker compose down
- name: Notify Failure
if: failure()
run: echo "Proxy bench job failed" && exit 1
run: echo "Proxy bench job failed" && exit 1
- name: Cleanup Test Resources
if: always()
shell: bash -euxo pipefail {0}
run: |
# Cleanup the test resources
if [[ -d "${TEST_OUTPUT}" ]]; then
rm -rf ${TEST_OUTPUT}
fi
if [[ -d "${PROXY_BENCH_PATH}/test_output" ]]; then
rm -rf ${PROXY_BENCH_PATH}/test_output
fi

199
Cargo.lock generated
View File

@@ -211,11 +211,11 @@ dependencies = [
[[package]]
name = "async-lock"
version = "3.2.0"
version = "3.4.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "7125e42787d53db9dd54261812ef17e937c95a51e4d291373b670342fa44310c"
checksum = "ff6e472cdea888a4bd64f342f09b3f50e1886d32afe8df3d663c01140b811b18"
dependencies = [
"event-listener 4.0.0",
"event-listener 5.4.0",
"event-listener-strategy",
"pin-project-lite",
]
@@ -1404,9 +1404,9 @@ dependencies = [
[[package]]
name = "concurrent-queue"
version = "2.3.0"
version = "2.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f057a694a54f12365049b0958a1685bb52d567f5593b355fbf685838e873d400"
checksum = "4ca0197aee26d1ae37445ee532fefce43251d24cc7c166799f4d46817f1d3973"
dependencies = [
"crossbeam-utils",
]
@@ -2232,9 +2232,9 @@ checksum = "0206175f82b8d6bf6652ff7d71a1e27fd2e4efde587fd368662814d6ec1d9ce0"
[[package]]
name = "event-listener"
version = "4.0.0"
version = "5.4.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "770d968249b5d99410d61f5bf89057f3199a077a04d087092f58e7d10692baae"
checksum = "3492acde4c3fc54c845eaab3eed8bd00c7a7d881f78bfc801e43a93dec1331ae"
dependencies = [
"concurrent-queue",
"parking",
@@ -2243,11 +2243,11 @@ dependencies = [
[[package]]
name = "event-listener-strategy"
version = "0.4.0"
version = "0.5.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "958e4d70b6d5e81971bebec42271ec641e7ff4e170a6fa605f2b8a8b65cb97d3"
checksum = "8be9f3dfaaffdae2972880079a491a1a8bb7cbed0b8dd7a347f668b4150a3b93"
dependencies = [
"event-listener 4.0.0",
"event-listener 5.4.0",
"pin-project-lite",
]
@@ -2516,6 +2516,20 @@ version = "0.4.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "304de19db7028420975a296ab0fcbbc8e69438c4ed254a1e41e2a7f37d5f0e0a"
[[package]]
name = "generator"
version = "0.8.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d18470a76cb7f8ff746cf1f7470914f900252ec36bbc40b569d74b1258446827"
dependencies = [
"cc",
"cfg-if",
"libc",
"log",
"rustversion",
"windows 0.61.3",
]
[[package]]
name = "generic-array"
version = "0.14.7"
@@ -2834,7 +2848,7 @@ checksum = "f9c7c7c8ac16c798734b8a24560c1362120597c40d5e1459f09498f8f6c8f2ba"
dependencies = [
"cfg-if",
"libc",
"windows",
"windows 0.52.0",
]
[[package]]
@@ -3105,7 +3119,7 @@ dependencies = [
"iana-time-zone-haiku",
"js-sys",
"wasm-bindgen",
"windows-core",
"windows-core 0.52.0",
]
[[package]]
@@ -3656,6 +3670,19 @@ version = "0.4.26"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "30bde2b3dc3671ae49d8e2e9f044c7c005836e7a023ee57cffa25ab82764bb9e"
[[package]]
name = "loom"
version = "0.7.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "419e0dc8046cb947daa77eb95ae174acfbddb7673b4151f56d1eed8e93fbfaca"
dependencies = [
"cfg-if",
"generator",
"scoped-tls",
"tracing",
"tracing-subscriber",
]
[[package]]
name = "lru"
version = "0.12.3"
@@ -3872,6 +3899,25 @@ dependencies = [
"windows-sys 0.52.0",
]
[[package]]
name = "moka"
version = "0.12.10"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a9321642ca94a4282428e6ea4af8cc2ca4eac48ac7a6a4ea8f33f76d0ce70926"
dependencies = [
"crossbeam-channel",
"crossbeam-epoch",
"crossbeam-utils",
"loom",
"parking_lot 0.12.1",
"portable-atomic",
"rustc_version",
"smallvec",
"tagptr",
"thiserror 1.0.69",
"uuid",
]
[[package]]
name = "multimap"
version = "0.8.3"
@@ -5407,6 +5453,7 @@ dependencies = [
"lasso",
"measured",
"metrics",
"moka",
"once_cell",
"opentelemetry",
"ouroboros",
@@ -6420,6 +6467,12 @@ dependencies = [
"pin-project-lite",
]
[[package]]
name = "scoped-tls"
version = "1.0.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e1cf6437eb19a8f4a6cc0f7dca544973b0b78843adbfeb3683d1a94a0024a294"
[[package]]
name = "scopeguard"
version = "1.1.0"
@@ -7269,6 +7322,12 @@ dependencies = [
"winapi",
]
[[package]]
name = "tagptr"
version = "0.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "7b2093cf4c8eb1e67749a6762251bc9cd836b6fc171623bd0a9d324d37af2417"
[[package]]
name = "tar"
version = "0.4.40"
@@ -8638,10 +8697,32 @@ version = "0.52.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e48a53791691ab099e5e2ad123536d0fff50652600abaf43bbf952894110d0be"
dependencies = [
"windows-core",
"windows-core 0.52.0",
"windows-targets 0.52.6",
]
[[package]]
name = "windows"
version = "0.61.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9babd3a767a4c1aef6900409f85f5d53ce2544ccdfaa86dad48c91782c6d6893"
dependencies = [
"windows-collections",
"windows-core 0.61.2",
"windows-future",
"windows-link",
"windows-numerics",
]
[[package]]
name = "windows-collections"
version = "0.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "3beeceb5e5cfd9eb1d76b381630e82c4241ccd0d27f1a39ed41b2760b255c5e8"
dependencies = [
"windows-core 0.61.2",
]
[[package]]
name = "windows-core"
version = "0.52.0"
@@ -8651,6 +8732,86 @@ dependencies = [
"windows-targets 0.52.6",
]
[[package]]
name = "windows-core"
version = "0.61.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c0fdd3ddb90610c7638aa2b3a3ab2904fb9e5cdbecc643ddb3647212781c4ae3"
dependencies = [
"windows-implement",
"windows-interface",
"windows-link",
"windows-result",
"windows-strings",
]
[[package]]
name = "windows-future"
version = "0.2.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "fc6a41e98427b19fe4b73c550f060b59fa592d7d686537eebf9385621bfbad8e"
dependencies = [
"windows-core 0.61.2",
"windows-link",
"windows-threading",
]
[[package]]
name = "windows-implement"
version = "0.60.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a47fddd13af08290e67f4acabf4b459f647552718f683a7b415d290ac744a836"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.100",
]
[[package]]
name = "windows-interface"
version = "0.59.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "bd9211b69f8dcdfa817bfd14bf1c97c9188afa36f4750130fcdf3f400eca9fa8"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.100",
]
[[package]]
name = "windows-link"
version = "0.1.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5e6ad25900d524eaabdbbb96d20b4311e1e7ae1699af4fb28c17ae66c80d798a"
[[package]]
name = "windows-numerics"
version = "0.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9150af68066c4c5c07ddc0ce30421554771e528bde427614c61038bc2c92c2b1"
dependencies = [
"windows-core 0.61.2",
"windows-link",
]
[[package]]
name = "windows-result"
version = "0.3.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "56f42bd332cc6c8eac5af113fc0c1fd6a8fd2aa08a0119358686e5160d0586c6"
dependencies = [
"windows-link",
]
[[package]]
name = "windows-strings"
version = "0.4.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "56e6c93f3a0c3b36176cb1327a4958a0353d5d166c2a35cb268ace15e91d3b57"
dependencies = [
"windows-link",
]
[[package]]
name = "windows-sys"
version = "0.48.0"
@@ -8709,6 +8870,15 @@ dependencies = [
"windows_x86_64_msvc 0.52.6",
]
[[package]]
name = "windows-threading"
version = "0.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b66463ad2e0ea3bbf808b7f1d371311c80e115c0b71d60efc142cafbcfb057a6"
dependencies = [
"windows-link",
]
[[package]]
name = "windows_aarch64_gnullvm"
version = "0.48.0"
@@ -8845,6 +9015,8 @@ dependencies = [
"clap",
"clap_builder",
"const-oid",
"crossbeam-epoch",
"crossbeam-utils",
"crypto-bigint 0.5.5",
"der 0.7.8",
"deranged",
@@ -8890,6 +9062,7 @@ dependencies = [
"once_cell",
"p256 0.13.2",
"parquet",
"portable-atomic",
"prettyplease",
"proc-macro2",
"prost 0.13.5",

View File

@@ -46,10 +46,10 @@ members = [
"libs/proxy/json",
"libs/proxy/postgres-protocol2",
"libs/proxy/postgres-types2",
"libs/proxy/subzero_core",
"libs/proxy/tokio-postgres2",
"endpoint_storage",
"pgxn/neon/communicator",
"proxy/subzero_core",
]
[workspace.package]
@@ -136,6 +136,7 @@ md5 = "0.7.0"
measured = { version = "0.0.22", features=["lasso"] }
measured-process = { version = "0.0.22" }
memoffset = "0.9"
moka = { version = "0.12", features = ["sync"] }
nix = { version = "0.30.1", features = ["dir", "fs", "mman", "process", "socket", "signal", "poll"] }
# Do not update to >= 7.0.0, at least. The update will have a significant impact
# on compute startup metrics (start_postgres_ms), >= 25% degradation.

View File

@@ -39,13 +39,13 @@ COPY build-tools/patches/pgcopydbv017.patch /pgcopydbv017.patch
RUN if [ "${DEBIAN_VERSION}" = "bookworm" ]; then \
set -e && \
apt update && \
apt install -y --no-install-recommends \
apt-get update && \
apt-get install -y --no-install-recommends \
ca-certificates wget gpg && \
wget -qO - https://www.postgresql.org/media/keys/ACCC4CF8.asc | gpg --dearmor -o /usr/share/keyrings/postgresql-keyring.gpg && \
echo "deb [signed-by=/usr/share/keyrings/postgresql-keyring.gpg] http://apt.postgresql.org/pub/repos/apt bookworm-pgdg main" > /etc/apt/sources.list.d/pgdg.list && \
apt-get update && \
apt install -y --no-install-recommends \
apt-get install -y --no-install-recommends \
build-essential \
autotools-dev \
libedit-dev \
@@ -89,8 +89,7 @@ RUN useradd -ms /bin/bash nonroot -b /home
# Use strict mode for bash to catch errors early
SHELL ["/bin/bash", "-euo", "pipefail", "-c"]
RUN mkdir -p /pgcopydb/bin && \
mkdir -p /pgcopydb/lib && \
RUN mkdir -p /pgcopydb/{bin,lib} && \
chmod -R 755 /pgcopydb && \
chown -R nonroot:nonroot /pgcopydb
@@ -106,8 +105,8 @@ RUN echo 'Acquire::Retries "5";' > /etc/apt/apt.conf.d/80-retries && \
# 'gdb' is included so that we get backtraces of core dumps produced in
# regression tests
RUN set -e \
&& apt update \
&& apt install -y \
&& apt-get update \
&& apt-get install -y --no-install-recommends \
autoconf \
automake \
bison \
@@ -183,22 +182,22 @@ RUN curl -sL "https://github.com/peak/s5cmd/releases/download/v${S5CMD_VERSION}/
ENV LLVM_VERSION=20
RUN curl -fsSL 'https://apt.llvm.org/llvm-snapshot.gpg.key' | apt-key add - \
&& echo "deb http://apt.llvm.org/${DEBIAN_VERSION}/ llvm-toolchain-${DEBIAN_VERSION}-${LLVM_VERSION} main" > /etc/apt/sources.list.d/llvm.stable.list \
&& apt update \
&& apt install -y clang-${LLVM_VERSION} llvm-${LLVM_VERSION} \
&& apt-get update \
&& apt-get install -y --no-install-recommends clang-${LLVM_VERSION} llvm-${LLVM_VERSION} \
&& bash -c 'for f in /usr/bin/clang*-${LLVM_VERSION} /usr/bin/llvm*-${LLVM_VERSION}; do ln -s "${f}" "${f%-${LLVM_VERSION}}"; done' \
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
# Install node
ENV NODE_VERSION=24
RUN curl -fsSL https://deb.nodesource.com/setup_${NODE_VERSION}.x | bash - \
&& apt install -y nodejs \
&& apt-get install -y --no-install-recommends nodejs \
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
# Install docker
RUN curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg \
&& echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/debian ${DEBIAN_VERSION} stable" > /etc/apt/sources.list.d/docker.list \
&& apt update \
&& apt install -y docker-ce docker-ce-cli \
&& apt-get update \
&& apt-get install -y --no-install-recommends docker-ce docker-ce-cli \
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
# Configure sudo & docker
@@ -215,12 +214,11 @@ RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-$(uname -m).zip" -o "aws
# Mold: A Modern Linker
ENV MOLD_VERSION=v2.37.1
RUN set -e \
&& git clone https://github.com/rui314/mold.git \
&& git clone -b "${MOLD_VERSION}" --depth 1 https://github.com/rui314/mold.git \
&& mkdir mold/build \
&& cd mold/build \
&& git checkout ${MOLD_VERSION} \
&& cd mold/build \
&& cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER=clang++ .. \
&& cmake --build . -j $(nproc) \
&& cmake --build . -j "$(nproc)" \
&& cmake --install . \
&& cd .. \
&& rm -rf mold
@@ -254,7 +252,7 @@ ENV ICU_VERSION=67.1
ENV ICU_PREFIX=/usr/local/icu
# Download and build static ICU
RUN wget -O /tmp/libicu-${ICU_VERSION}.tgz https://github.com/unicode-org/icu/releases/download/release-${ICU_VERSION//./-}/icu4c-${ICU_VERSION//./_}-src.tgz && \
RUN wget -O "/tmp/libicu-${ICU_VERSION}.tgz" https://github.com/unicode-org/icu/releases/download/release-${ICU_VERSION//./-}/icu4c-${ICU_VERSION//./_}-src.tgz && \
echo "94a80cd6f251a53bd2a997f6f1b5ac6653fe791dfab66e1eb0227740fb86d5dc /tmp/libicu-${ICU_VERSION}.tgz" | sha256sum --check && \
mkdir /tmp/icu && \
pushd /tmp/icu && \
@@ -265,8 +263,7 @@ RUN wget -O /tmp/libicu-${ICU_VERSION}.tgz https://github.com/unicode-org/icu/re
make install && \
popd && \
rm -rf icu && \
rm -f /tmp/libicu-${ICU_VERSION}.tgz && \
popd
rm -f /tmp/libicu-${ICU_VERSION}.tgz
# Switch to nonroot user
USER nonroot:nonroot
@@ -279,19 +276,19 @@ ENV PYTHON_VERSION=3.11.12 \
PYENV_ROOT=/home/nonroot/.pyenv \
PATH=/home/nonroot/.pyenv/shims:/home/nonroot/.pyenv/bin:/home/nonroot/.poetry/bin:$PATH
RUN set -e \
&& cd $HOME \
&& cd "$HOME" \
&& curl -sSO https://raw.githubusercontent.com/pyenv/pyenv-installer/master/bin/pyenv-installer \
&& chmod +x pyenv-installer \
&& ./pyenv-installer \
&& export PYENV_ROOT=/home/nonroot/.pyenv \
&& export PATH="$PYENV_ROOT/bin:$PATH" \
&& export PATH="$PYENV_ROOT/shims:$PATH" \
&& pyenv install ${PYTHON_VERSION} \
&& pyenv global ${PYTHON_VERSION} \
&& pyenv install "${PYTHON_VERSION}" \
&& pyenv global "${PYTHON_VERSION}" \
&& python --version \
&& pip install --upgrade pip \
&& pip install --no-cache-dir --upgrade pip \
&& pip --version \
&& pip install pipenv wheel poetry
&& pip install --no-cache-dir pipenv wheel poetry
# Switch to nonroot user (again)
USER nonroot:nonroot
@@ -317,13 +314,13 @@ RUN curl -sSO https://static.rust-lang.org/rustup/dist/$(uname -m)-unknown-linux
. "$HOME/.cargo/env" && \
cargo --version && rustup --version && \
rustup component add llvm-tools rustfmt clippy && \
cargo install rustfilt --locked --version ${RUSTFILT_VERSION} && \
cargo install cargo-hakari --locked --version ${CARGO_HAKARI_VERSION} && \
cargo install cargo-deny --locked --version ${CARGO_DENY_VERSION} && \
cargo install cargo-hack --locked --version ${CARGO_HACK_VERSION} && \
cargo install cargo-nextest --locked --version ${CARGO_NEXTEST_VERSION} && \
cargo install cargo-chef --locked --version ${CARGO_CHEF_VERSION} && \
cargo install diesel_cli --locked --version ${CARGO_DIESEL_CLI_VERSION} \
cargo install rustfilt --locked --version "${RUSTFILT_VERSION}" && \
cargo install cargo-hakari --locked --version "${CARGO_HAKARI_VERSION}" && \
cargo install cargo-deny --locked --version "${CARGO_DENY_VERSION}" && \
cargo install cargo-hack --locked --version "${CARGO_HACK_VERSION}" && \
cargo install cargo-nextest --locked --version "${CARGO_NEXTEST_VERSION}" && \
cargo install cargo-chef --locked --version "${CARGO_CHEF_VERSION}" && \
cargo install diesel_cli --locked --version "${CARGO_DIESEL_CLI_VERSION}" \
--features postgres-bundled --no-default-features && \
rm -rf /home/nonroot/.cargo/registry && \
rm -rf /home/nonroot/.cargo/git

View File

@@ -6,7 +6,7 @@
"": {
"name": "build-tools",
"devDependencies": {
"@redocly/cli": "1.34.4",
"@redocly/cli": "1.34.5",
"@sourcemeta/jsonschema": "10.0.0"
}
},
@@ -472,9 +472,9 @@
}
},
"node_modules/@redocly/cli": {
"version": "1.34.4",
"resolved": "https://registry.npmjs.org/@redocly/cli/-/cli-1.34.4.tgz",
"integrity": "sha512-seH/GgrjSB1EeOsgJ/4Ct6Jk2N7sh12POn/7G8UQFARMyUMJpe1oHtBwT2ndfp4EFCpgBAbZ/82Iw6dwczNxEA==",
"version": "1.34.5",
"resolved": "https://registry.npmjs.org/@redocly/cli/-/cli-1.34.5.tgz",
"integrity": "sha512-5IEwxs7SGP5KEXjBKLU8Ffdz9by/KqNSeBk6YUVQaGxMXK//uYlTJIPntgUXbo1KAGG2d2q2XF8y4iFz6qNeiw==",
"dev": true,
"license": "MIT",
"dependencies": {
@@ -484,14 +484,14 @@
"@opentelemetry/sdk-trace-node": "1.26.0",
"@opentelemetry/semantic-conventions": "1.27.0",
"@redocly/config": "^0.22.0",
"@redocly/openapi-core": "1.34.4",
"@redocly/respect-core": "1.34.4",
"@redocly/openapi-core": "1.34.5",
"@redocly/respect-core": "1.34.5",
"abort-controller": "^3.0.0",
"chokidar": "^3.5.1",
"colorette": "^1.2.0",
"core-js": "^3.32.1",
"dotenv": "16.4.7",
"form-data": "^4.0.0",
"form-data": "^4.0.4",
"get-port-please": "^3.0.1",
"glob": "^7.1.6",
"handlebars": "^4.7.6",
@@ -522,9 +522,9 @@
"license": "MIT"
},
"node_modules/@redocly/openapi-core": {
"version": "1.34.4",
"resolved": "https://registry.npmjs.org/@redocly/openapi-core/-/openapi-core-1.34.4.tgz",
"integrity": "sha512-hf53xEgpXIgWl3b275PgZU3OTpYh1RoD2LHdIfQ1JzBNTWsiNKczTEsI/4Tmh2N1oq9YcphhSMyk3lDh85oDjg==",
"version": "1.34.5",
"resolved": "https://registry.npmjs.org/@redocly/openapi-core/-/openapi-core-1.34.5.tgz",
"integrity": "sha512-0EbE8LRbkogtcCXU7liAyC00n9uNG9hJ+eMyHFdUsy9lB/WGqnEBgwjA9q2cyzAVcdTkQqTBBU1XePNnN3OijA==",
"dev": true,
"license": "MIT",
"dependencies": {
@@ -544,21 +544,21 @@
}
},
"node_modules/@redocly/respect-core": {
"version": "1.34.4",
"resolved": "https://registry.npmjs.org/@redocly/respect-core/-/respect-core-1.34.4.tgz",
"integrity": "sha512-MitKyKyQpsizA4qCVv+MjXL4WltfhFQAoiKiAzrVR1Kusro3VhYb6yJuzoXjiJhR0ukLP5QOP19Vcs7qmj9dZg==",
"version": "1.34.5",
"resolved": "https://registry.npmjs.org/@redocly/respect-core/-/respect-core-1.34.5.tgz",
"integrity": "sha512-GheC/g/QFztPe9UA9LamooSplQuy9pe0Yr8XGTqkz0ahivLDl7svoy/LSQNn1QH3XGtLKwFYMfTwFR2TAYyh5Q==",
"dev": true,
"license": "MIT",
"dependencies": {
"@faker-js/faker": "^7.6.0",
"@redocly/ajv": "8.11.2",
"@redocly/openapi-core": "1.34.4",
"@redocly/openapi-core": "1.34.5",
"better-ajv-errors": "^1.2.0",
"colorette": "^2.0.20",
"concat-stream": "^2.0.0",
"cookie": "^0.7.2",
"dotenv": "16.4.7",
"form-data": "4.0.0",
"form-data": "^4.0.4",
"jest-diff": "^29.3.1",
"jest-matcher-utils": "^29.3.1",
"js-yaml": "4.1.0",
@@ -582,21 +582,6 @@
"dev": true,
"license": "MIT"
},
"node_modules/@redocly/respect-core/node_modules/form-data": {
"version": "4.0.0",
"resolved": "https://registry.npmjs.org/form-data/-/form-data-4.0.0.tgz",
"integrity": "sha512-ETEklSGi5t0QMZuiXoA/Q6vcnxcLQP5vdugSpuAyi6SVGi2clPPp+xgEhuMaHC+zGgn31Kd235W35f7Hykkaww==",
"dev": true,
"license": "MIT",
"dependencies": {
"asynckit": "^0.4.0",
"combined-stream": "^1.0.8",
"mime-types": "^2.1.12"
},
"engines": {
"node": ">= 6"
}
},
"node_modules/@sinclair/typebox": {
"version": "0.27.8",
"resolved": "https://registry.npmjs.org/@sinclair/typebox/-/typebox-0.27.8.tgz",
@@ -1345,9 +1330,9 @@
"license": "MIT"
},
"node_modules/form-data": {
"version": "4.0.3",
"resolved": "https://registry.npmjs.org/form-data/-/form-data-4.0.3.tgz",
"integrity": "sha512-qsITQPfmvMOSAdeyZ+12I1c+CKSstAFAwu+97zrnWAbIr5u8wfsExUzCesVLC8NgHuRUqNN4Zy6UPWUTRGslcA==",
"version": "4.0.4",
"resolved": "https://registry.npmjs.org/form-data/-/form-data-4.0.4.tgz",
"integrity": "sha512-KrGhL9Q4zjj0kiUt5OO4Mr/A/jlI2jDYs5eHBpYHPcBEVSiipAvn2Ko2HnPe20rmcuuvMHNdZFp+4IlGTMF0Ow==",
"dev": true,
"license": "MIT",
"dependencies": {

View File

@@ -2,7 +2,7 @@
"name": "build-tools",
"private": true,
"devDependencies": {
"@redocly/cli": "1.34.4",
"@redocly/cli": "1.34.5",
"@sourcemeta/jsonschema": "10.0.0"
}
}

View File

@@ -26,7 +26,13 @@ commands:
- name: postgres-exporter
user: nobody
sysvInitAction: respawn
shell: 'DATA_SOURCE_NAME="user=cloud_admin sslmode=disable dbname=postgres application_name=postgres-exporter pgaudit.log=none" /bin/postgres_exporter --config.file=/etc/postgres_exporter.yml'
# Turn off database collector (`--no-collector.database`), we don't use `pg_database_size_bytes` metric anyway, see
# https://github.com/neondatabase/flux-fleet/blob/5e19b3fd897667b70d9a7ad4aa06df0ca22b49ff/apps/base/compute-metrics/scrape-compute-pg-exporter-neon.yaml#L29
# but it's enabled by default and it doesn't filter out invalid databases, see
# https://github.com/prometheus-community/postgres_exporter/blob/06a553c8166512c9d9c5ccf257b0f9bba8751dbc/collector/pg_database.go#L67
# so if it hits one, it starts spamming logs
# ERROR: [NEON_SMGR] [reqid d9700000018] could not read db size of db 705302 from page server at lsn 5/A2457EB0
shell: 'DATA_SOURCE_NAME="user=cloud_admin sslmode=disable dbname=postgres application_name=postgres-exporter pgaudit.log=none" /bin/postgres_exporter --no-collector.database --config.file=/etc/postgres_exporter.yml'
- name: pgbouncer-exporter
user: postgres
sysvInitAction: respawn

View File

@@ -26,7 +26,13 @@ commands:
- name: postgres-exporter
user: nobody
sysvInitAction: respawn
shell: 'DATA_SOURCE_NAME="user=cloud_admin sslmode=disable dbname=postgres application_name=postgres-exporter pgaudit.log=none" /bin/postgres_exporter --config.file=/etc/postgres_exporter.yml'
# Turn off database collector (`--no-collector.database`), we don't use `pg_database_size_bytes` metric anyway, see
# https://github.com/neondatabase/flux-fleet/blob/5e19b3fd897667b70d9a7ad4aa06df0ca22b49ff/apps/base/compute-metrics/scrape-compute-pg-exporter-neon.yaml#L29
# but it's enabled by default and it doesn't filter out invalid databases, see
# https://github.com/prometheus-community/postgres_exporter/blob/06a553c8166512c9d9c5ccf257b0f9bba8751dbc/collector/pg_database.go#L67
# so if it hits one, it starts spamming logs
# ERROR: [NEON_SMGR] [reqid d9700000018] could not read db size of db 705302 from page server at lsn 5/A2457EB0
shell: 'DATA_SOURCE_NAME="user=cloud_admin sslmode=disable dbname=postgres application_name=postgres-exporter pgaudit.log=none" /bin/postgres_exporter --no-collector.database --config.file=/etc/postgres_exporter.yml'
- name: pgbouncer-exporter
user: postgres
sysvInitAction: respawn

View File

@@ -52,8 +52,14 @@ stateDiagram-v2
Init --> Running : Started Postgres
Running --> TerminationPendingFast : Requested termination
Running --> TerminationPendingImmediate : Requested termination
Running --> ConfigurationPending : Received a /configure request with spec
Running --> RefreshConfigurationPending : Received a /refresh_configuration request, compute node will pull a new spec and reconfigure
RefreshConfigurationPending --> RefreshConfiguration: Received compute spec and started configuration
RefreshConfiguration --> Running : Compute has been re-configured
RefreshConfiguration --> RefreshConfigurationPending : Configuration failed and to be retried
TerminationPendingFast --> Terminated compute with 30s delay for cplane to inspect status
TerminationPendingImmediate --> Terminated : Terminated compute immediately
Failed --> RefreshConfigurationPending : Received a /refresh_configuration request
Failed --> [*] : Compute exited
Terminated --> [*] : Compute exited
```

View File

@@ -49,10 +49,10 @@ use compute_tools::compute::{
BUILD_TAG, ComputeNode, ComputeNodeParams, forward_termination_signal,
};
use compute_tools::extension_server::get_pg_version_string;
use compute_tools::logger::*;
use compute_tools::params::*;
use compute_tools::pg_isready::get_pg_isready_bin;
use compute_tools::spec::*;
use compute_tools::{hadron_metrics, installed_extensions, logger::*};
use rlimit::{Resource, setrlimit};
use signal_hook::consts::{SIGINT, SIGQUIT, SIGTERM};
use signal_hook::iterator::Signals;
@@ -205,6 +205,9 @@ fn main() -> Result<()> {
// enable core dumping for all child processes
setrlimit(Resource::CORE, rlimit::INFINITY, rlimit::INFINITY)?;
installed_extensions::initialize_metrics();
hadron_metrics::initialize_metrics();
let connstr = Url::parse(&cli.connstr).context("cannot parse connstr as a URL")?;
let config = get_config(&cli)?;
@@ -235,6 +238,9 @@ fn main() -> Result<()> {
pg_isready_bin: get_pg_isready_bin(&cli.pgbin),
instance_id: std::env::var("INSTANCE_ID").ok(),
lakebase_mode: cli.lakebase_mode,
build_tag: BUILD_TAG.to_string(),
control_plane_uri: cli.control_plane_uri,
config_path_test_only: cli.config,
},
config,
)?;

View File

@@ -21,6 +21,7 @@ use postgres::NoTls;
use postgres::error::SqlState;
use remote_storage::{DownloadError, RemotePath};
use std::collections::{HashMap, HashSet};
use std::ffi::OsString;
use std::os::unix::fs::{PermissionsExt, symlink};
use std::path::Path;
use std::process::{Command, Stdio};
@@ -40,8 +41,9 @@ use utils::shard::{ShardCount, ShardIndex, ShardNumber};
use crate::configurator::launch_configurator;
use crate::disk_quota::set_disk_quota;
use crate::hadron_metrics::COMPUTE_ATTACHED;
use crate::installed_extensions::get_installed_extensions;
use crate::logger::startup_context_from_env;
use crate::logger::{self, startup_context_from_env};
use crate::lsn_lease::launch_lsn_lease_bg_task_for_static;
use crate::metrics::COMPUTE_CTL_UP;
use crate::monitor::launch_monitor;
@@ -120,6 +122,10 @@ pub struct ComputeNodeParams {
// Path to the `pg_isready` binary.
pub pg_isready_bin: String,
pub lakebase_mode: bool,
pub build_tag: String,
pub control_plane_uri: Option<String>,
pub config_path_test_only: Option<OsString>,
}
type TaskHandle = Mutex<Option<JoinHandle<()>>>;
@@ -407,6 +413,52 @@ struct StartVmMonitorResult {
vm_monitor: Option<JoinHandle<Result<()>>>,
}
/// Databricks-specific environment variables to be passed to the `postgres` sub-process.
pub struct DatabricksEnvVars {
/// The Databricks "endpoint ID" of the compute instance. Used by `postgres` to check
/// the token scopes of internal auth tokens.
pub endpoint_id: String,
/// Hostname of the Databricks workspace URL this compute instance belongs to.
/// Used by postgres to verify Databricks PAT tokens.
pub workspace_host: String,
}
impl DatabricksEnvVars {
pub fn new(compute_spec: &ComputeSpec, compute_id: Option<&String>) -> Self {
// compute_id is a string format of "{endpoint_id}/{compute_idx}"
// endpoint_id is a uuid. We only need to pass down endpoint_id to postgres.
// Panics if compute_id is not set or not in the expected format.
let endpoint_id = compute_id.unwrap().split('/').next().unwrap().to_string();
let workspace_host = compute_spec
.databricks_settings
.as_ref()
.map(|s| s.databricks_workspace_host.clone())
.unwrap_or("".to_string());
Self {
endpoint_id,
workspace_host,
}
}
/// Constants for the names of Databricks-specific postgres environment variables.
const DATABRICKS_ENDPOINT_ID_ENVVAR: &'static str = "DATABRICKS_ENDPOINT_ID";
const DATABRICKS_WORKSPACE_HOST_ENVVAR: &'static str = "DATABRICKS_WORKSPACE_HOST";
/// Convert DatabricksEnvVars to a list of string pairs that can be passed as env vars. Consumes `self`.
pub fn to_env_var_list(self) -> Vec<(String, String)> {
vec![
(
Self::DATABRICKS_ENDPOINT_ID_ENVVAR.to_string(),
self.endpoint_id.clone(),
),
(
Self::DATABRICKS_WORKSPACE_HOST_ENVVAR.to_string(),
self.workspace_host.clone(),
),
]
}
}
impl ComputeNode {
pub fn new(params: ComputeNodeParams, config: ComputeConfig) -> Result<Self> {
let connstr = params.connstr.as_str();
@@ -1405,15 +1457,20 @@ impl ComputeNode {
let pgdata_path = Path::new(&self.params.pgdata);
let tls_config = self.tls_config(&pspec.spec);
let databricks_settings = spec.databricks_settings.as_ref();
let postgres_port = self.params.connstr.port();
// Remove/create an empty pgdata directory and put configuration there.
self.create_pgdata()?;
config::write_postgres_conf(
pgdata_path,
&self.params,
&pspec.spec,
postgres_port,
self.params.internal_http_port,
tls_config,
databricks_settings,
self.params.lakebase_mode,
)?;
// Syncing safekeepers is only safe with primary nodes: if a primary
@@ -1453,8 +1510,20 @@ impl ComputeNode {
)
})?;
// Update pg_hba.conf received with basebackup.
update_pg_hba(pgdata_path, None)?;
if let Some(settings) = databricks_settings {
copy_tls_certificates(
&settings.pg_compute_tls_settings.key_file,
&settings.pg_compute_tls_settings.cert_file,
pgdata_path,
)?;
// Update pg_hba.conf received with basebackup including additional databricks settings.
update_pg_hba(pgdata_path, Some(&settings.databricks_pg_hba))?;
update_pg_ident(pgdata_path, Some(&settings.databricks_pg_ident))?;
} else {
// Update pg_hba.conf received with basebackup.
update_pg_hba(pgdata_path, None)?;
}
// Place pg_dynshmem under /dev/shm. This allows us to use
// 'dynamic_shared_memory_type = mmap' so that the files are placed in
@@ -1567,14 +1636,31 @@ impl ComputeNode {
pub fn start_postgres(&self, storage_auth_token: Option<String>) -> Result<PostgresHandle> {
let pgdata_path = Path::new(&self.params.pgdata);
let env_vars: Vec<(String, String)> = if self.params.lakebase_mode {
let databricks_env_vars = {
let state = self.state.lock().unwrap();
let spec = &state.pspec.as_ref().unwrap().spec;
DatabricksEnvVars::new(spec, Some(&self.params.compute_id))
};
info!(
"Starting Postgres for databricks endpoint id: {}",
&databricks_env_vars.endpoint_id
);
let mut env_vars = databricks_env_vars.to_env_var_list();
env_vars.extend(storage_auth_token.map(|t| ("NEON_AUTH_TOKEN".to_string(), t)));
env_vars
} else if let Some(storage_auth_token) = &storage_auth_token {
vec![("NEON_AUTH_TOKEN".to_owned(), storage_auth_token.to_owned())]
} else {
vec![]
};
// Run postgres as a child process.
let mut pg = maybe_cgexec(&self.params.pgbin)
.args(["-D", &self.params.pgdata])
.envs(if let Some(storage_auth_token) = &storage_auth_token {
vec![("NEON_AUTH_TOKEN", storage_auth_token)]
} else {
vec![]
})
.envs(env_vars)
.stderr(Stdio::piped())
.spawn()
.expect("cannot start postgres process");
@@ -1796,12 +1882,12 @@ impl ComputeNode {
let states_allowing_configuration_refresh = [
ComputeStatus::Running,
ComputeStatus::Failed,
// ComputeStatus::RefreshConfigurationPending,
ComputeStatus::RefreshConfigurationPending,
];
let state = self.state.lock().expect("state lock poisoned");
let mut state = self.state.lock().expect("state lock poisoned");
if states_allowing_configuration_refresh.contains(&state.status) {
// state.status = ComputeStatus::RefreshConfigurationPending;
state.status = ComputeStatus::RefreshConfigurationPending;
self.state_changed.notify_all();
Ok(())
} else if state.status == ComputeStatus::Init {
@@ -1877,12 +1963,16 @@ impl ComputeNode {
// Write new config
let pgdata_path = Path::new(&self.params.pgdata);
let postgres_port = self.params.connstr.port();
config::write_postgres_conf(
pgdata_path,
&self.params,
&spec,
postgres_port,
self.params.internal_http_port,
tls_config,
spec.databricks_settings.as_ref(),
self.params.lakebase_mode,
)?;
self.pg_reload_conf()?;
@@ -1988,6 +2078,8 @@ impl ComputeNode {
// wait
ComputeStatus::Init
| ComputeStatus::Configuration
| ComputeStatus::RefreshConfiguration
| ComputeStatus::RefreshConfigurationPending
| ComputeStatus::Empty => {
state = self.state_changed.wait(state).unwrap();
}
@@ -2544,6 +2636,34 @@ LIMIT 100",
);
}
}
/// Set the compute spec and update related metrics.
/// This is the central place where pspec is updated.
pub fn set_spec(params: &ComputeNodeParams, state: &mut ComputeState, pspec: ParsedSpec) {
state.pspec = Some(pspec);
ComputeNode::update_attached_metric(params, state);
let _ = logger::update_ids(&params.instance_id, &Some(params.compute_id.clone()));
}
pub fn update_attached_metric(params: &ComputeNodeParams, state: &mut ComputeState) {
// Update the pg_cctl_attached gauge when all identifiers are available.
if let Some(instance_id) = &params.instance_id {
if let Some(pspec) = &state.pspec {
// Clear all values in the metric
COMPUTE_ATTACHED.reset();
// Set new metric value
COMPUTE_ATTACHED
.with_label_values(&[
&params.compute_id,
instance_id,
&pspec.tenant_id.to_string(),
&pspec.timeline_id.to_string(),
])
.set(1);
}
}
}
}
pub async fn installed_extensions(conf: tokio_postgres::Config) -> Result<()> {

View File

@@ -7,11 +7,14 @@ use std::io::prelude::*;
use std::path::Path;
use compute_api::responses::TlsConfig;
use compute_api::spec::{ComputeAudit, ComputeMode, ComputeSpec, GenericOption};
use compute_api::spec::{
ComputeAudit, ComputeMode, ComputeSpec, DatabricksSettings, GenericOption,
};
use crate::compute::ComputeNodeParams;
use crate::pg_helpers::{
GenericOptionExt, GenericOptionsSearch, PgOptionsSerialize, escape_conf_value,
DatabricksSettingsExt as _, GenericOptionExt, GenericOptionsSearch, PgOptionsSerialize,
escape_conf_value,
};
use crate::tls::{self, SERVER_CRT, SERVER_KEY};
@@ -40,12 +43,16 @@ pub fn line_in_file(path: &Path, line: &str) -> Result<bool> {
}
/// Create or completely rewrite configuration file specified by `path`
#[allow(clippy::too_many_arguments)]
pub fn write_postgres_conf(
pgdata_path: &Path,
params: &ComputeNodeParams,
spec: &ComputeSpec,
postgres_port: Option<u16>,
extension_server_port: u16,
tls_config: &Option<TlsConfig>,
databricks_settings: Option<&DatabricksSettings>,
lakebase_mode: bool,
) -> Result<()> {
let path = pgdata_path.join("postgresql.conf");
// File::create() destroys the file content if it exists.
@@ -285,6 +292,24 @@ pub fn write_postgres_conf(
writeln!(file, "log_destination='stderr,syslog'")?;
}
if lakebase_mode {
// Explicitly set the port based on the connstr, overriding any previous port setting.
// Note: It is important that we don't specify a different port again after this.
let port = postgres_port.expect("port must be present in connstr");
writeln!(file, "port = {port}")?;
// This is databricks specific settings.
// This should be at the end of the file but before `compute_ctl_temp_override.conf` below
// so that it can override any settings above.
// `compute_ctl_temp_override.conf` is intended to override any settings above during specific operations.
// To prevent potential breakage in the future, we keep it above `compute_ctl_temp_override.conf`.
writeln!(file, "# Databricks settings start")?;
if let Some(settings) = databricks_settings {
writeln!(file, "{}", settings.as_pg_settings())?;
}
writeln!(file, "# Databricks settings end")?;
}
// This is essential to keep this line at the end of the file,
// because it is intended to override any settings above.
writeln!(file, "include_if_exists = 'compute_ctl_temp_override.conf'")?;

View File

@@ -1,23 +1,40 @@
use std::sync::Arc;
use std::fs::File;
use std::thread;
use std::{path::Path, sync::Arc};
use compute_api::responses::ComputeStatus;
use anyhow::Result;
use compute_api::responses::{ComputeConfig, ComputeStatus};
use tracing::{error, info, instrument};
use crate::compute::ComputeNode;
use crate::compute::{ComputeNode, ParsedSpec};
use crate::spec::get_config_from_control_plane;
#[instrument(skip_all)]
fn configurator_main_loop(compute: &Arc<ComputeNode>) {
info!("waiting for reconfiguration requests");
loop {
let mut state = compute.state.lock().unwrap();
/* BEGIN_HADRON */
// RefreshConfiguration should only be used inside the loop
assert_ne!(state.status, ComputeStatus::RefreshConfiguration);
/* END_HADRON */
// We have to re-check the status after re-acquiring the lock because it could be that
// the status has changed while we were waiting for the lock, and we might not need to
// wait on the condition variable. Otherwise, we might end up in some soft-/deadlock, i.e.
// we are waiting for a condition variable that will never be signaled.
if state.status != ComputeStatus::ConfigurationPending {
state = compute.state_changed.wait(state).unwrap();
if compute.params.lakebase_mode {
while state.status != ComputeStatus::ConfigurationPending
&& state.status != ComputeStatus::RefreshConfigurationPending
&& state.status != ComputeStatus::Failed
{
info!("configurator: compute status: {:?}, sleeping", state.status);
state = compute.state_changed.wait(state).unwrap();
}
} else {
// We have to re-check the status after re-acquiring the lock because it could be that
// the status has changed while we were waiting for the lock, and we might not need to
// wait on the condition variable. Otherwise, we might end up in some soft-/deadlock, i.e.
// we are waiting for a condition variable that will never be signaled.
if state.status != ComputeStatus::ConfigurationPending {
state = compute.state_changed.wait(state).unwrap();
}
}
// Re-check the status after waking up
@@ -37,6 +54,133 @@ fn configurator_main_loop(compute: &Arc<ComputeNode>) {
// XXX: used to test that API is blocking
// std::thread::sleep(std::time::Duration::from_millis(10000));
compute.set_status(new_status);
} else if state.status == ComputeStatus::RefreshConfigurationPending {
info!(
"compute node suspects its configuration is out of date, now refreshing configuration"
);
state.set_status(ComputeStatus::RefreshConfiguration, &compute.state_changed);
// Drop the lock guard here to avoid holding the lock while downloading config from the control plane / HCC.
// This is the only thread that can move compute_ctl out of the `RefreshConfiguration` state, so it
// is safe to drop the lock like this.
drop(state);
let get_config_result: anyhow::Result<ComputeConfig> =
if let Some(config_path) = &compute.params.config_path_test_only {
// This path is only to make testing easier. In production we always get the config from the HCC.
info!(
"reloading config.json from path: {}",
config_path.to_string_lossy()
);
let path = Path::new(config_path);
if let Ok(file) = File::open(path) {
match serde_json::from_reader::<File, ComputeConfig>(file) {
Ok(config) => Ok(config),
Err(e) => {
error!("could not parse config file: {}", e);
Err(anyhow::anyhow!("could not parse config file: {}", e))
}
}
} else {
error!(
"could not open config file at path: {:?}",
config_path.to_string_lossy()
);
Err(anyhow::anyhow!(
"could not open config file at path: {}",
config_path.to_string_lossy()
))
}
} else if let Some(control_plane_uri) = &compute.params.control_plane_uri {
get_config_from_control_plane(control_plane_uri, &compute.params.compute_id)
} else {
Err(anyhow::anyhow!("config_path_test_only is not set"))
};
// Parse any received ComputeSpec and transpose the result into a Result<Option<ParsedSpec>>.
let parsed_spec_result: Result<Option<ParsedSpec>> =
get_config_result.and_then(|config| {
if let Some(spec) = config.spec {
if let Ok(pspec) = ParsedSpec::try_from(spec) {
Ok(Some(pspec))
} else {
Err(anyhow::anyhow!("could not parse spec"))
}
} else {
Ok(None)
}
});
let new_status: ComputeStatus;
match parsed_spec_result {
// Control plane (HCM) returned a spec and we were able to parse it.
Ok(Some(pspec)) => {
{
let mut state = compute.state.lock().unwrap();
// Defensive programming to make sure this thread is indeed the only one that can move the compute
// node out of the `RefreshConfiguration` state. Would be nice if we can encode this invariant
// into the type system.
assert_eq!(state.status, ComputeStatus::RefreshConfiguration);
if state.pspec.as_ref().map(|ps| ps.pageserver_connstr.clone())
== Some(pspec.pageserver_connstr.clone())
{
info!(
"Refresh configuration: Retrieved spec is the same as the current spec. Waiting for control plane to update the spec before attempting reconfiguration."
);
state.status = ComputeStatus::Running;
compute.state_changed.notify_all();
drop(state);
std::thread::sleep(std::time::Duration::from_secs(5));
continue;
}
// state.pspec is consumed by compute.reconfigure() below. Note that compute.reconfigure() will acquire
// the compute.state lock again so we need to have the lock guard go out of scope here. We could add a
// "locked" variant of compute.reconfigure() that takes the lock guard as an argument to make this cleaner,
// but it's not worth forking the codebase too much for this minor point alone right now.
state.pspec = Some(pspec);
}
match compute.reconfigure() {
Ok(_) => {
info!("Refresh configuration: compute node configured");
new_status = ComputeStatus::Running;
}
Err(e) => {
error!(
"Refresh configuration: could not configure compute node: {}",
e
);
// Set the compute node back to the `RefreshConfigurationPending` state if the configuration
// was not successful. It should be okay to treat this situation the same as if the loop
// hasn't executed yet as long as the detection side keeps notifying.
new_status = ComputeStatus::RefreshConfigurationPending;
}
}
}
// Control plane (HCM)'s response does not contain a spec. This is the "Empty" attachment case.
Ok(None) => {
info!(
"Compute Manager signaled that this compute is no longer attached to any storage. Exiting."
);
// We just immediately terminate the whole compute_ctl in this case. It's not necessary to attempt a
// clean shutdown as Postgres is probably not responding anyway (which is why we are in this refresh
// configuration state).
std::process::exit(1);
}
// Various error cases:
// - The request to the control plane (HCM) either failed or returned a malformed spec.
// - compute_ctl itself is configured incorrectly (e.g., compute_id is not set).
Err(e) => {
error!(
"Refresh configuration: error getting a parsed spec: {:?}",
e
);
new_status = ComputeStatus::RefreshConfigurationPending;
// We may be dealing with an overloaded HCM if we end up in this path. Backoff 5 seconds before
// retrying to avoid hammering the HCM.
std::thread::sleep(std::time::Duration::from_secs(5));
}
}
compute.set_status(new_status);
} else if state.status == ComputeStatus::Failed {
info!("compute node is now in Failed state, exiting");

View File

@@ -43,7 +43,12 @@ pub(in crate::http) async fn configure(
// configure request for tracing purposes.
state.startup_span = Some(tracing::Span::current());
state.pspec = Some(pspec);
if compute.params.lakebase_mode {
ComputeNode::set_spec(&compute.params, &mut state, pspec);
} else {
state.pspec = Some(pspec);
}
state.set_status(ComputeStatus::ConfigurationPending, &compute.state_changed);
drop(state);
}

View File

@@ -13,6 +13,7 @@ use metrics::{Encoder, TextEncoder};
use crate::communicator_socket_client::connect_communicator_socket;
use crate::compute::ComputeNode;
use crate::hadron_metrics;
use crate::http::JsonResponse;
use crate::metrics::collect;
@@ -21,11 +22,18 @@ pub(in crate::http) async fn get_metrics() -> Response {
// When we call TextEncoder::encode() below, it will immediately return an
// error if a metric family has no metrics, so we need to preemptively
// filter out metric families with no metrics.
let metrics = collect()
let mut metrics = collect()
.into_iter()
.filter(|m| !m.get_metric().is_empty())
.collect::<Vec<MetricFamily>>();
// Add Hadron metrics.
let hadron_metrics: Vec<MetricFamily> = hadron_metrics::collect()
.into_iter()
.filter(|m| !m.get_metric().is_empty())
.collect();
metrics.extend(hadron_metrics);
let encoder = TextEncoder::new();
let mut buffer = vec![];

View File

@@ -7,28 +7,23 @@ use axum::{
response::{IntoResponse, Response},
};
use http::StatusCode;
use tracing::debug;
use crate::compute::ComputeNode;
// use crate::hadron_metrics::POSTGRES_PAGESTREAM_REQUEST_ERRORS;
use crate::hadron_metrics::POSTGRES_PAGESTREAM_REQUEST_ERRORS;
use crate::http::JsonResponse;
// The /refresh_configuration POST method is used to nudge compute_ctl to pull a new spec
// from the HCC and attempt to reconfigure Postgres with the new spec. The method does not wait
// for the reconfiguration to complete. Rather, it simply delivers a signal that will cause
// configuration to be reloaded in a best effort manner. Invocation of this method does not
// guarantee that a reconfiguration will occur. The caller should consider keep sending this
// request while it believes that the compute configuration is out of date.
/// The /refresh_configuration POST method is used to nudge compute_ctl to pull a new spec
/// from the HCC and attempt to reconfigure Postgres with the new spec. The method does not wait
/// for the reconfiguration to complete. Rather, it simply delivers a signal that will cause
/// configuration to be reloaded in a best effort manner. Invocation of this method does not
/// guarantee that a reconfiguration will occur. The caller should consider keep sending this
/// request while it believes that the compute configuration is out of date.
pub(in crate::http) async fn refresh_configuration(
State(compute): State<Arc<ComputeNode>>,
) -> Response {
debug!("serving /refresh_configuration POST request");
// POSTGRES_PAGESTREAM_REQUEST_ERRORS.inc();
POSTGRES_PAGESTREAM_REQUEST_ERRORS.inc();
match compute.signal_refresh_configuration().await {
Ok(_) => StatusCode::OK.into_response(),
Err(e) => {
tracing::error!("error handling /refresh_configuration request: {}", e);
JsonResponse::error(StatusCode::INTERNAL_SERVER_ERROR, e)
}
Err(e) => JsonResponse::error(StatusCode::INTERNAL_SERVER_ERROR, e),
}
}

View File

@@ -1,7 +1,7 @@
use crate::compute::{ComputeNode, forward_termination_signal};
use crate::http::JsonResponse;
use axum::extract::State;
use axum::response::Response;
use axum::response::{IntoResponse, Response};
use axum_extra::extract::OptionalQuery;
use compute_api::responses::{ComputeStatus, TerminateMode, TerminateResponse};
use http::StatusCode;
@@ -33,7 +33,29 @@ pub(in crate::http) async fn terminate(
if !matches!(state.status, ComputeStatus::Empty | ComputeStatus::Running) {
return JsonResponse::invalid_status(state.status);
}
// If compute is Empty, there's no Postgres to terminate. The regular compute_ctl termination path
// assumes Postgres to be configured and running, so we just special-handle this case by exiting
// the process directly.
if compute.params.lakebase_mode && state.status == ComputeStatus::Empty {
drop(state);
info!("terminating empty compute - will exit process");
// Queue a task to exit the process after 5 seconds. The 5-second delay aims to
// give enough time for the HTTP response to be sent so that HCM doesn't get an abrupt
// connection termination.
tokio::spawn(async {
tokio::time::sleep(tokio::time::Duration::from_secs(5)).await;
info!("exiting process after terminating empty compute");
std::process::exit(0);
});
return StatusCode::OK.into_response();
}
// For Running status, proceed with normal termination
state.set_status(mode.into(), &compute.state_changed);
drop(state);
}
forward_termination_signal(false);

View File

@@ -23,11 +23,11 @@ use super::{
middleware::authorize::Authorize,
routes::{
check_writability, configure, database_schema, dbs_and_roles, extension_server, extensions,
grants, insights, lfc, metrics, metrics_json, promote, status, terminate,
grants, hadron_liveness_probe, insights, lfc, metrics, metrics_json, promote,
refresh_configuration, status, terminate,
},
};
use crate::compute::ComputeNode;
use crate::http::routes::{hadron_liveness_probe, refresh_configuration};
/// `compute_ctl` has two servers: internal and external. The internal server
/// binds to the loopback interface and handles communication from clients on

View File

@@ -142,7 +142,7 @@ pub fn update_pg_hba(pgdata_path: &Path, databricks_pg_hba: Option<&String>) ->
// Update pg_hba to contains databricks specfic settings before adding neon settings
// PG uses the first record that matches to perform authentication, so we need to have
// our rules before the default ones from neon.
// See https://www.postgresql.org/docs/16/auth-pg-hba-conf.html
// See https://www.postgresql.org/docs/current/auth-pg-hba-conf.html
if let Some(databricks_pg_hba) = databricks_pg_hba {
if config::line_in_file(
&pghba_path,

View File

@@ -560,7 +560,9 @@ enum EndpointCmd {
Create(EndpointCreateCmdArgs),
Start(EndpointStartCmdArgs),
Reconfigure(EndpointReconfigureCmdArgs),
RefreshConfiguration(EndpointRefreshConfigurationArgs),
Stop(EndpointStopCmdArgs),
UpdatePageservers(EndpointUpdatePageserversCmdArgs),
GenerateJwt(EndpointGenerateJwtCmdArgs),
}
@@ -721,6 +723,13 @@ struct EndpointReconfigureCmdArgs {
safekeepers: Option<String>,
}
#[derive(clap::Args)]
#[clap(about = "Refresh the endpoint's configuration by forcing it reload it's spec")]
struct EndpointRefreshConfigurationArgs {
#[clap(help = "Postgres endpoint id")]
endpoint_id: String,
}
#[derive(clap::Args)]
#[clap(about = "Stop an endpoint")]
struct EndpointStopCmdArgs {
@@ -738,6 +747,16 @@ struct EndpointStopCmdArgs {
mode: EndpointTerminateMode,
}
#[derive(clap::Args)]
#[clap(about = "Update the pageservers in the spec file of the compute endpoint")]
struct EndpointUpdatePageserversCmdArgs {
#[clap(help = "Postgres endpoint id")]
endpoint_id: String,
#[clap(short = 'p', long, help = "Specified pageserver id")]
pageserver_id: Option<NodeId>,
}
#[derive(clap::Args)]
#[clap(about = "Generate a JWT for an endpoint")]
struct EndpointGenerateJwtCmdArgs {
@@ -1625,6 +1644,44 @@ async fn handle_endpoint(subcmd: &EndpointCmd, env: &local_env::LocalEnv) -> Res
println!("Starting existing endpoint {endpoint_id}...");
endpoint.start(args).await?;
}
EndpointCmd::UpdatePageservers(args) => {
let endpoint_id = &args.endpoint_id;
let endpoint = cplane
.endpoints
.get(endpoint_id.as_str())
.with_context(|| format!("postgres endpoint {endpoint_id} is not found"))?;
let pageservers = match args.pageserver_id {
Some(pageserver_id) => {
let pageserver =
PageServerNode::from_env(env, env.get_pageserver_conf(pageserver_id)?);
vec![(
PageserverProtocol::Libpq,
pageserver.pg_connection_config.host().clone(),
pageserver.pg_connection_config.port(),
)]
}
None => {
let storage_controller = StorageController::from_env(env);
storage_controller
.tenant_locate(endpoint.tenant_id)
.await?
.shards
.into_iter()
.map(|shard| {
(
PageserverProtocol::Libpq,
Host::parse(&shard.listen_pg_addr)
.expect("Storage controller reported malformed host"),
shard.listen_pg_port,
)
})
.collect::<Vec<_>>()
}
};
endpoint.update_pageservers_in_config(pageservers).await?;
}
EndpointCmd::Reconfigure(args) => {
let endpoint_id = &args.endpoint_id;
let endpoint = cplane
@@ -1678,6 +1735,14 @@ async fn handle_endpoint(subcmd: &EndpointCmd, env: &local_env::LocalEnv) -> Res
.reconfigure(Some(pageservers), None, safekeepers, None)
.await?;
}
EndpointCmd::RefreshConfiguration(args) => {
let endpoint_id = &args.endpoint_id;
let endpoint = cplane
.endpoints
.get(endpoint_id.as_str())
.with_context(|| format!("postgres endpoint {endpoint_id} is not found"))?;
endpoint.refresh_configuration().await?;
}
EndpointCmd::Stop(args) => {
let endpoint_id = &args.endpoint_id;
let endpoint = cplane

View File

@@ -793,6 +793,7 @@ impl Endpoint {
autoprewarm: args.autoprewarm,
offload_lfc_interval_seconds: args.offload_lfc_interval_seconds,
suspend_timeout_seconds: -1, // Only used in neon_local.
databricks_settings: None,
};
// this strange code is needed to support respec() in tests
@@ -937,7 +938,9 @@ impl Endpoint {
| ComputeStatus::Configuration
| ComputeStatus::TerminationPendingFast
| ComputeStatus::TerminationPendingImmediate
| ComputeStatus::Terminated => {
| ComputeStatus::Terminated
| ComputeStatus::RefreshConfigurationPending
| ComputeStatus::RefreshConfiguration => {
bail!("unexpected compute status: {:?}", state.status)
}
}
@@ -960,6 +963,29 @@ impl Endpoint {
Ok(())
}
// Update the pageservers in the spec file of the endpoint. This is useful to test the spec refresh scenario.
pub async fn update_pageservers_in_config(
&self,
pageservers: Vec<(PageserverProtocol, Host, u16)>,
) -> Result<()> {
let config_path = self.endpoint_path().join("config.json");
let mut config: ComputeConfig = {
let file = std::fs::File::open(&config_path)?;
serde_json::from_reader(file)?
};
let pageserver_connstring = Self::build_pageserver_connstr(&pageservers);
assert!(!pageserver_connstring.is_empty());
let mut spec = config.spec.unwrap();
spec.pageserver_connstring = Some(pageserver_connstring);
config.spec = Some(spec);
let file = std::fs::File::create(&config_path)?;
serde_json::to_writer_pretty(file, &config)?;
Ok(())
}
// Call the /status HTTP API
pub async fn get_status(&self) -> Result<ComputeStatusResponse> {
let client = reqwest::Client::new();
@@ -1125,6 +1151,33 @@ impl Endpoint {
Ok(response)
}
pub async fn refresh_configuration(&self) -> Result<()> {
let client = reqwest::Client::builder()
.timeout(Duration::from_secs(30))
.build()
.unwrap();
let response = client
.post(format!(
"http://{}:{}/refresh_configuration",
self.internal_http_address.ip(),
self.internal_http_address.port()
))
.send()
.await?;
let status = response.status();
if !(status.is_client_error() || status.is_server_error()) {
Ok(())
} else {
let url = response.url().to_owned();
let msg = match response.text().await {
Ok(err_body) => format!("Error: {err_body}"),
Err(_) => format!("Http error ({}) at {}.", status.as_u16(), url),
};
Err(anyhow::anyhow!(msg))
}
}
pub fn connstr(&self, user: &str, db_name: &str) -> String {
format!(
"postgresql://{}@{}:{}/{}",

View File

@@ -172,6 +172,11 @@ pub enum ComputeStatus {
TerminationPendingImmediate,
// Terminated Postgres
Terminated,
// A spec refresh is being requested
RefreshConfigurationPending,
// A spec refresh is being applied. We cannot refresh configuration again until the current
// refresh is done, i.e., signal_refresh_configuration() will return 500 error.
RefreshConfiguration,
}
#[derive(Deserialize, Serialize)]
@@ -184,6 +189,10 @@ impl Display for ComputeStatus {
match self {
ComputeStatus::Empty => f.write_str("empty"),
ComputeStatus::ConfigurationPending => f.write_str("configuration-pending"),
ComputeStatus::RefreshConfiguration => f.write_str("refresh-configuration"),
ComputeStatus::RefreshConfigurationPending => {
f.write_str("refresh-configuration-pending")
}
ComputeStatus::Init => f.write_str("init"),
ComputeStatus::Running => f.write_str("running"),
ComputeStatus::Configuration => f.write_str("configuration"),

View File

@@ -193,6 +193,9 @@ pub struct ComputeSpec {
///
/// We use this value to derive other values, such as the installed extensions metric.
pub suspend_timeout_seconds: i64,
// Databricks specific options for compute instance.
pub databricks_settings: Option<DatabricksSettings>,
}
/// Feature flag to signal `compute_ctl` to enable certain experimental functionality.

View File

@@ -15,6 +15,7 @@ use tokio::sync::mpsc;
use crate::cancel_token::RawCancelToken;
use crate::codec::{BackendMessages, FrontendMessage, RecordNotices};
use crate::config::{Host, SslMode};
use crate::connection::gc_bytesmut;
use crate::query::RowStream;
use crate::simple_query::SimpleQueryStream;
use crate::types::{Oid, Type};
@@ -95,20 +96,13 @@ impl InnerClient {
Ok(PartialQuery(Some(self)))
}
// pub fn send_with_sync<F>(&mut self, f: F) -> Result<&mut Responses, Error>
// where
// F: FnOnce(&mut BytesMut) -> Result<(), Error>,
// {
// self.start()?.send_with_sync(f)
// }
pub fn send_simple_query(&mut self, query: &str) -> Result<&mut Responses, Error> {
self.responses.waiting += 1;
self.buffer.clear();
// simple queries do not need sync.
frontend::query(query, &mut self.buffer).map_err(Error::encode)?;
let buf = self.buffer.split().freeze();
let buf = self.buffer.split();
self.send_message(FrontendMessage::Raw(buf))
}
@@ -125,7 +119,7 @@ impl Drop for PartialQuery<'_> {
if let Some(client) = self.0.take() {
client.buffer.clear();
frontend::sync(&mut client.buffer);
let buf = client.buffer.split().freeze();
let buf = client.buffer.split();
let _ = client.send_message(FrontendMessage::Raw(buf));
}
}
@@ -141,7 +135,7 @@ impl<'a> PartialQuery<'a> {
client.buffer.clear();
f(&mut client.buffer)?;
frontend::flush(&mut client.buffer);
let buf = client.buffer.split().freeze();
let buf = client.buffer.split();
client.send_message(FrontendMessage::Raw(buf))
}
@@ -154,7 +148,7 @@ impl<'a> PartialQuery<'a> {
client.buffer.clear();
f(&mut client.buffer)?;
frontend::sync(&mut client.buffer);
let buf = client.buffer.split().freeze();
let buf = client.buffer.split();
let _ = client.send_message(FrontendMessage::Raw(buf));
Ok(&mut self.0.take().unwrap().responses)
@@ -191,6 +185,7 @@ impl Client {
ssl_mode: SslMode,
process_id: i32,
secret_key: i32,
write_buf: BytesMut,
) -> Client {
Client {
inner: InnerClient {
@@ -201,7 +196,7 @@ impl Client {
waiting: 0,
received: 0,
},
buffer: Default::default(),
buffer: write_buf,
},
cached_typeinfo: Default::default(),
@@ -317,6 +312,9 @@ impl Client {
DISCARD SEQUENCES;",
)?;
// Clean up memory usage.
gc_bytesmut(&mut self.inner_mut().buffer);
Ok(())
}

View File

@@ -1,13 +1,13 @@
use std::io;
use bytes::{Bytes, BytesMut};
use bytes::BytesMut;
use fallible_iterator::FallibleIterator;
use postgres_protocol2::message::backend;
use tokio::sync::mpsc::UnboundedSender;
use tokio_util::codec::{Decoder, Encoder};
pub enum FrontendMessage {
Raw(Bytes),
Raw(BytesMut),
RecordNotices(RecordNotices),
}
@@ -17,7 +17,10 @@ pub struct RecordNotices {
}
pub enum BackendMessage {
Normal { messages: BackendMessages },
Normal {
messages: BackendMessages,
ready: bool,
},
Async(backend::Message),
}
@@ -40,11 +43,11 @@ impl FallibleIterator for BackendMessages {
pub struct PostgresCodec;
impl Encoder<Bytes> for PostgresCodec {
impl Encoder<BytesMut> for PostgresCodec {
type Error = io::Error;
fn encode(&mut self, item: Bytes, dst: &mut BytesMut) -> io::Result<()> {
dst.extend_from_slice(&item);
fn encode(&mut self, item: BytesMut, dst: &mut BytesMut) -> io::Result<()> {
dst.unsplit(item);
Ok(())
}
}
@@ -56,6 +59,7 @@ impl Decoder for PostgresCodec {
fn decode(&mut self, src: &mut BytesMut) -> Result<Option<BackendMessage>, io::Error> {
let mut idx = 0;
let mut ready = false;
while let Some(header) = backend::Header::parse(&src[idx..])? {
let len = header.len() as usize + 1;
if src[idx..].len() < len {
@@ -79,6 +83,7 @@ impl Decoder for PostgresCodec {
idx += len;
if header.tag() == backend::READY_FOR_QUERY_TAG {
ready = true;
break;
}
}
@@ -88,6 +93,7 @@ impl Decoder for PostgresCodec {
} else {
Ok(Some(BackendMessage::Normal {
messages: BackendMessages(src.split_to(idx)),
ready,
}))
}
}

View File

@@ -250,19 +250,20 @@ impl Config {
{
let stream = connect_tls(stream, self.ssl_mode, tls).await?;
let mut stream = StartupStream::new(stream);
connect_raw::startup(&mut stream, self).await?;
connect_raw::authenticate(&mut stream, self).await?;
Ok(stream)
}
pub async fn authenticate<S, T>(&self, stream: &mut StartupStream<S, T>) -> Result<(), Error>
pub fn authenticate<S, T>(
&self,
stream: &mut StartupStream<S, T>,
) -> impl Future<Output = Result<(), Error>>
where
S: AsyncRead + AsyncWrite + Unpin,
T: TlsStream + Unpin,
{
connect_raw::startup(stream, self).await?;
connect_raw::authenticate(stream, self).await
connect_raw::authenticate(stream, self)
}
}

View File

@@ -7,7 +7,7 @@ use tokio::net::TcpStream;
use tokio::sync::mpsc;
use crate::client::SocketConfig;
use crate::config::Host;
use crate::config::{Host, SslMode};
use crate::connect_raw::StartupStream;
use crate::connect_socket::connect_socket;
use crate::tls::{MakeTlsConnect, TlsConnect};
@@ -45,28 +45,53 @@ where
T: TlsConnect<TcpStream>,
{
let socket = connect_socket(host_addr, host, port, config.connect_timeout).await?;
let mut stream = config.tls_and_authenticate(socket, tls).await?;
let stream = config.tls_and_authenticate(socket, tls).await?;
managed(
stream,
host_addr,
host.clone(),
port,
config.ssl_mode,
config.connect_timeout,
)
.await
}
pub async fn managed<TlsStream>(
mut stream: StartupStream<TcpStream, TlsStream>,
host_addr: Option<IpAddr>,
host: Host,
port: u16,
ssl_mode: SslMode,
connect_timeout: Option<std::time::Duration>,
) -> Result<(Client, Connection<TcpStream, TlsStream>), Error>
where
TlsStream: AsyncRead + AsyncWrite + Unpin,
{
let (process_id, secret_key) = wait_until_ready(&mut stream).await?;
let socket_config = SocketConfig {
host_addr,
host: host.clone(),
host,
port,
connect_timeout: config.connect_timeout,
connect_timeout,
};
let mut stream = stream.into_framed();
let write_buf = std::mem::take(stream.write_buffer_mut());
let (client_tx, conn_rx) = mpsc::unbounded_channel();
let (conn_tx, client_rx) = mpsc::channel(4);
let client = Client::new(
client_tx,
client_rx,
socket_config,
config.ssl_mode,
ssl_mode,
process_id,
secret_key,
write_buf,
);
let stream = stream.into_framed();
let connection = Connection::new(stream, conn_tx, conn_rx);
Ok((client, connection))

View File

@@ -2,51 +2,28 @@ use std::io;
use std::pin::Pin;
use std::task::{Context, Poll, ready};
use bytes::{Bytes, BytesMut};
use bytes::BytesMut;
use fallible_iterator::FallibleIterator;
use futures_util::{Sink, SinkExt, Stream, TryStreamExt};
use futures_util::{SinkExt, Stream, TryStreamExt};
use postgres_protocol2::authentication::sasl;
use postgres_protocol2::authentication::sasl::ScramSha256;
use postgres_protocol2::message::backend::{AuthenticationSaslBody, Message};
use postgres_protocol2::message::frontend;
use tokio::io::{AsyncRead, AsyncWrite, ReadBuf};
use tokio_util::codec::{Framed, FramedParts, FramedWrite};
use tokio_util::codec::{Framed, FramedParts};
use crate::Error;
use crate::codec::PostgresCodec;
use crate::config::{self, AuthKeys, Config};
use crate::connection::{GC_THRESHOLD, INITIAL_CAPACITY};
use crate::maybe_tls_stream::MaybeTlsStream;
use crate::tls::TlsStream;
pub struct StartupStream<S, T> {
inner: FramedWrite<MaybeTlsStream<S, T>, PostgresCodec>,
inner: Framed<MaybeTlsStream<S, T>, PostgresCodec>,
read_buf: BytesMut,
}
impl<S, T> Sink<Bytes> for StartupStream<S, T>
where
S: AsyncRead + AsyncWrite + Unpin,
T: AsyncRead + AsyncWrite + Unpin,
{
type Error = io::Error;
fn poll_ready(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<io::Result<()>> {
Pin::new(&mut self.inner).poll_ready(cx)
}
fn start_send(mut self: Pin<&mut Self>, item: Bytes) -> io::Result<()> {
Pin::new(&mut self.inner).start_send(item)
}
fn poll_flush(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<io::Result<()>> {
Pin::new(&mut self.inner).poll_flush(cx)
}
fn poll_close(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<io::Result<()>> {
Pin::new(&mut self.inner).poll_close(cx)
}
}
impl<S, T> Stream for StartupStream<S, T>
where
S: AsyncRead + AsyncWrite + Unpin,
@@ -55,6 +32,8 @@ where
type Item = io::Result<Message>;
fn poll_next(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Option<Self::Item>> {
// We don't use `self.inner.poll_next()` as that might over-read into the read buffer.
// read 1 byte tag, 4 bytes length.
let header = ready!(self.as_mut().poll_fill_buf_exact(cx, 5)?);
@@ -121,36 +100,28 @@ where
}
pub fn into_framed(mut self) -> Framed<MaybeTlsStream<S, T>, PostgresCodec> {
let write_buf = std::mem::take(self.inner.write_buffer_mut());
let io = self.inner.into_inner();
let mut parts = FramedParts::new(io, PostgresCodec);
parts.read_buf = self.read_buf;
parts.write_buf = write_buf;
Framed::from_parts(parts)
*self.inner.read_buffer_mut() = self.read_buf;
self.inner
}
pub fn new(io: MaybeTlsStream<S, T>) -> Self {
let mut parts = FramedParts::new(io, PostgresCodec);
parts.write_buf = BytesMut::with_capacity(INITIAL_CAPACITY);
let mut inner = Framed::from_parts(parts);
// This is the default already, but nice to be explicit.
// We divide by two because writes will overshoot the boundary.
// We don't want constant overshoots to cause us to constantly re-shrink the buffer.
inner.set_backpressure_boundary(GC_THRESHOLD / 2);
Self {
inner: FramedWrite::new(io, PostgresCodec),
read_buf: BytesMut::new(),
inner,
read_buf: BytesMut::with_capacity(INITIAL_CAPACITY),
}
}
}
pub(crate) async fn startup<S, T>(
stream: &mut StartupStream<S, T>,
config: &Config,
) -> Result<(), Error>
where
S: AsyncRead + AsyncWrite + Unpin,
T: AsyncRead + AsyncWrite + Unpin,
{
let mut buf = BytesMut::new();
frontend::startup_message(&config.server_params, &mut buf).map_err(Error::encode)?;
stream.send(buf.freeze()).await.map_err(Error::io)
}
pub(crate) async fn authenticate<S, T>(
stream: &mut StartupStream<S, T>,
config: &Config,
@@ -159,6 +130,10 @@ where
S: AsyncRead + AsyncWrite + Unpin,
T: TlsStream + Unpin,
{
frontend::startup_message(&config.server_params, stream.inner.write_buffer_mut())
.map_err(Error::encode)?;
stream.inner.flush().await.map_err(Error::io)?;
match stream.try_next().await.map_err(Error::io)? {
Some(Message::AuthenticationOk) => {
can_skip_channel_binding(config)?;
@@ -172,7 +147,8 @@ where
.as_ref()
.ok_or_else(|| Error::config("password missing".into()))?;
authenticate_password(stream, pass).await?;
frontend::password_message(pass, stream.inner.write_buffer_mut())
.map_err(Error::encode)?;
}
Some(Message::AuthenticationSasl(body)) => {
authenticate_sasl(stream, body, config).await?;
@@ -191,6 +167,7 @@ where
None => return Err(Error::closed()),
}
stream.inner.flush().await.map_err(Error::io)?;
match stream.try_next().await.map_err(Error::io)? {
Some(Message::AuthenticationOk) => Ok(()),
Some(Message::ErrorResponse(body)) => Err(Error::db(body)),
@@ -208,20 +185,6 @@ fn can_skip_channel_binding(config: &Config) -> Result<(), Error> {
}
}
async fn authenticate_password<S, T>(
stream: &mut StartupStream<S, T>,
password: &[u8],
) -> Result<(), Error>
where
S: AsyncRead + AsyncWrite + Unpin,
T: AsyncRead + AsyncWrite + Unpin,
{
let mut buf = BytesMut::new();
frontend::password_message(password, &mut buf).map_err(Error::encode)?;
stream.send(buf.freeze()).await.map_err(Error::io)
}
async fn authenticate_sasl<S, T>(
stream: &mut StartupStream<S, T>,
body: AuthenticationSaslBody,
@@ -276,10 +239,10 @@ where
return Err(Error::config("password or auth keys missing".into()));
};
let mut buf = BytesMut::new();
frontend::sasl_initial_response(mechanism, scram.message(), &mut buf).map_err(Error::encode)?;
stream.send(buf.freeze()).await.map_err(Error::io)?;
frontend::sasl_initial_response(mechanism, scram.message(), stream.inner.write_buffer_mut())
.map_err(Error::encode)?;
stream.inner.flush().await.map_err(Error::io)?;
let body = match stream.try_next().await.map_err(Error::io)? {
Some(Message::AuthenticationSaslContinue(body)) => body,
Some(Message::ErrorResponse(body)) => return Err(Error::db(body)),
@@ -292,10 +255,10 @@ where
.await
.map_err(|e| Error::authentication(e.into()))?;
let mut buf = BytesMut::new();
frontend::sasl_response(scram.message(), &mut buf).map_err(Error::encode)?;
stream.send(buf.freeze()).await.map_err(Error::io)?;
frontend::sasl_response(scram.message(), stream.inner.write_buffer_mut())
.map_err(Error::encode)?;
stream.inner.flush().await.map_err(Error::io)?;
let body = match stream.try_next().await.map_err(Error::io)? {
Some(Message::AuthenticationSaslFinal(body)) => body,
Some(Message::ErrorResponse(body)) => return Err(Error::db(body)),

View File

@@ -44,6 +44,27 @@ pub struct Connection<S, T> {
state: State,
}
pub const INITIAL_CAPACITY: usize = 2 * 1024;
pub const GC_THRESHOLD: usize = 16 * 1024;
/// Gargabe collect the [`BytesMut`] if it has too much spare capacity.
pub fn gc_bytesmut(buf: &mut BytesMut) {
// We use a different mode to shrink the buf when above the threshold.
// When above the threshold, we only re-allocate when the buf has 2x spare capacity.
let reclaim = GC_THRESHOLD.checked_sub(buf.len()).unwrap_or(buf.len());
// `try_reclaim` tries to get the capacity from any shared `BytesMut`s,
// before then comparing the length against the capacity.
if buf.try_reclaim(reclaim) {
let capacity = usize::max(buf.len(), INITIAL_CAPACITY);
// Allocate a new `BytesMut` so that we deallocate the old version.
let mut new = BytesMut::with_capacity(capacity);
new.extend_from_slice(buf);
*buf = new;
}
}
pub enum Never {}
impl<S, T> Connection<S, T>
@@ -86,7 +107,14 @@ where
continue;
}
BackendMessage::Async(_) => continue,
BackendMessage::Normal { messages } => messages,
BackendMessage::Normal { messages, ready } => {
// if we read a ReadyForQuery from postgres, let's try GC the read buffer.
if ready {
gc_bytesmut(self.stream.read_buffer_mut());
}
messages
}
}
}
};
@@ -177,12 +205,7 @@ where
// Send a terminate message to postgres
Poll::Ready(None) => {
trace!("poll_write: at eof, terminating");
let mut request = BytesMut::new();
frontend::terminate(&mut request);
Pin::new(&mut self.stream)
.start_send(request.freeze())
.map_err(Error::io)?;
frontend::terminate(self.stream.write_buffer_mut());
trace!("poll_write: sent eof, closing");
trace!("poll_write: done");
@@ -205,6 +228,13 @@ where
{
Poll::Ready(()) => {
trace!("poll_flush: flushed");
// Since our codec prefers to share the buffer with the `Client`,
// if we don't release our share, then the `Client` would have to re-alloc
// the buffer when they next use it.
debug_assert!(self.stream.write_buffer().is_empty());
*self.stream.write_buffer_mut() = BytesMut::new();
Poll::Ready(Ok(()))
}
Poll::Pending => {

View File

@@ -48,7 +48,7 @@ mod cancel_token;
mod client;
mod codec;
pub mod config;
mod connect;
pub mod connect;
pub mod connect_raw;
mod connect_socket;
mod connect_tls;

View File

@@ -843,12 +843,11 @@ fn start_pageserver(
},
);
// Spawn a Pageserver gRPC server task. It will spawn separate tasks for
// each stream/request.
// Spawn a Pageserver gRPC server task. It will spawn separate tasks for each request/stream.
// It uses a separate compute request Tokio runtime (COMPUTE_REQUEST_RUNTIME).
//
// TODO: this uses a separate Tokio runtime for the page service. If we want
// other gRPC services, they will need their own port and runtime. Is this
// necessary?
// NB: this port is exposed to computes. It should only provide services that we're okay with
// computes accessing. Internal services should use a separate port.
let mut page_service_grpc = None;
if let Some(grpc_listener) = grpc_listener {
page_service_grpc = Some(GrpcPageServiceHandler::spawn(

View File

@@ -2005,6 +2005,10 @@ async fn put_tenant_location_config_handler(
let state = get_state(&request);
let conf = state.conf;
fail::fail_point!("put-location-conf-handler", |_| {
Err(ApiError::ResourceUnavailable("failpoint".into()))
});
// The `Detached` state is special, it doesn't upsert a tenant, it removes
// its local disk content and drops it from memory.
if let LocationConfigMode::Detached = request_data.config.mode {

View File

@@ -535,6 +535,7 @@ impl timeline::handle::TenantManager<TenantManagerTypes> for TenantManagerWrappe
match resolved {
ShardResolveResult::Found(tenant_shard) => break tenant_shard,
ShardResolveResult::NotFound => {
MISROUTED_PAGESTREAM_REQUESTS.inc();
return Err(GetActiveTimelineError::Tenant(
GetActiveTenantError::NotFound(GetTenantError::NotFound(*tenant_id)),
));
@@ -3428,8 +3429,6 @@ impl GrpcPageServiceHandler {
/// NB: errors returned from here are intercepted in get_pages(), and may be converted to a
/// GetPageResponse with an appropriate status code to avoid terminating the stream.
///
/// TODO: verify that the requested pages belong to this shard.
///
/// TODO: get_vectored() currently enforces a batch limit of 32. Postgres will typically send
/// batches up to effective_io_concurrency = 100. Either we have to accept large batches, or
/// split them up in the client or server.
@@ -3455,6 +3454,19 @@ impl GrpcPageServiceHandler {
lsn = %req.read_lsn,
);
for &blkno in &req.block_numbers {
let shard = timeline.get_shard_identity();
let key = rel_block_to_key(req.rel, blkno);
if !shard.is_key_local(&key) {
return Err(tonic::Status::invalid_argument(format!(
"block {blkno} of relation {} requested on wrong shard {} (is on {})",
req.rel,
timeline.get_shard_index(),
ShardIndex::new(shard.get_shard_number(&key), shard.count),
)));
}
}
let latest_gc_cutoff_lsn = timeline.get_applied_gc_cutoff_lsn(); // hold guard
let effective_lsn = PageServerHandler::effective_request_lsn(
&timeline,

View File

@@ -33,6 +33,10 @@ SHLIB_LINK = -lcurl
UNAME_S := $(shell uname -s)
ifeq ($(UNAME_S), Darwin)
SHLIB_LINK += -framework Security -framework CoreFoundation -framework SystemConfiguration
# Link against object files for the current macOS version, to avoid spurious linker warnings.
MACOSX_DEPLOYMENT_TARGET := $(shell xcrun --sdk macosx --show-sdk-version)
export MACOSX_DEPLOYMENT_TARGET
endif
EXTENSION = neon

View File

@@ -14,7 +14,7 @@
#include "extension_server.h"
#include "neon_utils.h"
static int extension_server_port = 0;
int hadron_extension_server_port = 0;
static int extension_server_request_timeout = 60;
static int extension_server_connect_timeout = 60;
@@ -47,7 +47,7 @@ neon_download_extension_file_http(const char *filename, bool is_library)
curl_easy_setopt(handle, CURLOPT_CONNECTTIMEOUT, (long)extension_server_connect_timeout /* seconds */ );
compute_ctl_url = psprintf("http://localhost:%d/extension_server/%s%s",
extension_server_port, filename, is_library ? "?is_library=true" : "");
hadron_extension_server_port, filename, is_library ? "?is_library=true" : "");
elog(LOG, "Sending request to compute_ctl: %s", compute_ctl_url);
@@ -82,7 +82,7 @@ pg_init_extension_server()
DefineCustomIntVariable("neon.extension_server_port",
"connection string to the compute_ctl",
NULL,
&extension_server_port,
&hadron_extension_server_port,
0, 0, INT_MAX,
PGC_POSTMASTER,
0, /* no flags required */

View File

@@ -13,6 +13,8 @@
#include <math.h>
#include <sys/socket.h>
#include <curl/curl.h>
#include "libpq-int.h"
#include "access/xlog.h"
@@ -86,6 +88,10 @@ static int pageserver_response_log_timeout = 10000;
/* 2.5 minutes. A bit higher than highest default TCP retransmission timeout */
static int pageserver_response_disconnect_timeout = 150000;
static int conf_refresh_reconnect_attempt_threshold = 16;
// Hadron: timeout for refresh errors (1 minute)
static uint64 kRefreshErrorTimeoutUSec = 1 * USECS_PER_MINUTE;
typedef struct
{
char connstring[MAX_SHARDS][MAX_PAGESERVER_CONNSTRING_SIZE];
@@ -130,7 +136,7 @@ static uint64 pagestore_local_counter = 0;
typedef enum PSConnectionState {
PS_Disconnected, /* no connection yet */
PS_Connecting_Startup, /* connection starting up */
PS_Connecting_PageStream, /* negotiating pagestream */
PS_Connecting_PageStream, /* negotiating pagestream */
PS_Connected, /* connected, pagestream established */
} PSConnectionState;
@@ -401,7 +407,7 @@ get_shard_number(BufferTag *tag)
}
static inline void
CLEANUP_AND_DISCONNECT(PageServer *shard)
CLEANUP_AND_DISCONNECT(PageServer *shard)
{
if (shard->wes_read)
{
@@ -423,7 +429,7 @@ CLEANUP_AND_DISCONNECT(PageServer *shard)
* complete the connection (e.g. due to receiving an earlier cancellation
* during connection start).
* Returns true if successfully connected; false if the connection failed.
*
*
* Throws errors in unrecoverable situations, or when this backend's query
* is canceled.
*/
@@ -1030,6 +1036,101 @@ pageserver_disconnect_shard(shardno_t shard_no)
shard->state = PS_Disconnected;
}
// BEGIN HADRON
/*
* Nudge compute_ctl to refresh our configuration. Called when we suspect we may be
* connecting to the wrong pageservers due to a stale configuration.
*
* This is a best-effort operation. If we couldn't send the local loopback HTTP request
* to compute_ctl or if the request fails for any reason, we just log the error and move
* on.
*/
extern int hadron_extension_server_port;
// The timestamp (usec) of the first error that occurred while trying to refresh the configuration.
// Will be reset to 0 after a successful refresh.
static uint64 first_recorded_refresh_error_usec = 0;
// Request compute_ctl to refresh the configuration. This operation may fail, e.g., if the compute_ctl
// is already in the configuration state. The function returns true if the caller needs to cancel the
// current query to avoid dead/live lock.
static bool
hadron_request_configuration_refresh() {
static CURL *handle = NULL;
CURLcode res;
char *compute_ctl_url;
bool cancel_query = false;
if (!lakebase_mode)
return false;
if (handle == NULL)
{
handle = alloc_curl_handle();
curl_easy_setopt(handle, CURLOPT_CUSTOMREQUEST, "POST");
curl_easy_setopt(handle, CURLOPT_TIMEOUT, 3L /* seconds */ );
curl_easy_setopt(handle, CURLOPT_POSTFIELDS, "");
}
// Set the URL
compute_ctl_url = psprintf("http://localhost:%d/refresh_configuration", hadron_extension_server_port);
elog(LOG, "Sending refresh configuration request to compute_ctl: %s", compute_ctl_url);
curl_easy_setopt(handle, CURLOPT_URL, compute_ctl_url);
res = curl_easy_perform(handle);
if (res != CURLE_OK )
{
elog(WARNING, "refresh_configuration request failed: %s\n", curl_easy_strerror(res));
}
else
{
long http_code = 0;
curl_easy_getinfo(handle, CURLINFO_RESPONSE_CODE, &http_code);
if ( res != CURLE_OK )
{
elog(WARNING, "compute_ctl refresh_configuration request getinfo failed: %s\n", curl_easy_strerror(res));
}
else
{
elog(LOG, "compute_ctl refresh_configuration got HTTP response: %ld\n", http_code);
if( http_code == 200 )
{
first_recorded_refresh_error_usec = 0;
}
else
{
if (first_recorded_refresh_error_usec == 0)
{
first_recorded_refresh_error_usec = GetCurrentTimestamp();
}
else if(GetCurrentTimestamp() - first_recorded_refresh_error_usec > kRefreshErrorTimeoutUSec)
{
{
first_recorded_refresh_error_usec = 0;
cancel_query = true;
}
}
}
}
}
// In regular Postgres usage, it is not necessary to manually free memory allocated by palloc (psprintf) because
// it will be cleaned up after the "memory context" is reset (e.g. after the query or the transaction is finished).
// However, the number of times this function gets called during a single query/transaction can be unbounded due to
// the various retry loops around calls to pageservers. Therefore, we need to manually free this memory here.
if (compute_ctl_url != NULL)
{
pfree(compute_ctl_url);
}
return cancel_query;
}
// END HADRON
static bool
pageserver_send(shardno_t shard_no, NeonRequest *request)
{
@@ -1064,6 +1165,11 @@ pageserver_send(shardno_t shard_no, NeonRequest *request)
while (!pageserver_connect(shard_no, shard->n_reconnect_attempts < max_reconnect_attempts ? LOG : ERROR))
{
shard->n_reconnect_attempts += 1;
if (shard->n_reconnect_attempts > conf_refresh_reconnect_attempt_threshold
&& hadron_request_configuration_refresh() )
{
neon_shard_log(shard_no, ERROR, "request failed too many times, cancelling query");
}
}
shard->n_reconnect_attempts = 0;
} else {
@@ -1171,17 +1277,26 @@ pageserver_receive(shardno_t shard_no)
pfree(msg);
pageserver_disconnect(shard_no);
resp = NULL;
/*
* Always poke compute_ctl to request a configuration refresh if we have issues receiving data from pageservers after
* successfully connecting to it. It could be an indication that we are connecting to the wrong pageservers (e.g. PS
* is in secondary mode or otherwise refuses to respond our request).
*/
hadron_request_configuration_refresh();
}
else if (rc == -2)
{
char *msg = pchomp(PQerrorMessage(pageserver_conn));
pageserver_disconnect(shard_no);
hadron_request_configuration_refresh();
neon_shard_log(shard_no, ERROR, "pageserver_receive disconnect: could not read COPY data: %s", msg);
}
else
{
pageserver_disconnect(shard_no);
hadron_request_configuration_refresh();
neon_shard_log(shard_no, ERROR, "pageserver_receive disconnect: unexpected PQgetCopyData return value: %d", rc);
}
@@ -1249,21 +1364,34 @@ pageserver_try_receive(shardno_t shard_no)
neon_shard_log(shard_no, LOG, "pageserver_receive disconnect: psql end of copy data: %s", pchomp(PQerrorMessage(pageserver_conn)));
pageserver_disconnect(shard_no);
resp = NULL;
hadron_request_configuration_refresh();
}
else if (rc == -2)
{
char *msg = pchomp(PQerrorMessage(pageserver_conn));
pageserver_disconnect(shard_no);
hadron_request_configuration_refresh();
neon_shard_log(shard_no, LOG, "pageserver_receive disconnect: could not read COPY data: %s", msg);
resp = NULL;
}
else
{
pageserver_disconnect(shard_no);
hadron_request_configuration_refresh();
neon_shard_log(shard_no, ERROR, "pageserver_receive disconnect: unexpected PQgetCopyData return value: %d", rc);
}
/*
* Always poke compute_ctl to request a configuration refresh if we have issues receiving data from pageservers after
* successfully connecting to it. It could be an indication that we are connecting to the wrong pageservers (e.g. PS
* is in secondary mode or otherwise refuses to respond our request).
*/
if ( rc < 0 && hadron_request_configuration_refresh() )
{
neon_shard_log(shard_no, ERROR, "refresh_configuration request failed, cancelling query");
}
shard->nresponses_received++;
return (NeonResponse *) resp;
}
@@ -1460,6 +1588,16 @@ pg_init_libpagestore(void)
PGC_SU_BACKEND,
0, /* no flags required */
NULL, NULL, NULL);
DefineCustomIntVariable("hadron.conf_refresh_reconnect_attempt_threshold",
"Threshold of the number of consecutive failed pageserver "
"connection attempts (per shard) before signaling "
"compute_ctl for a configuration refresh.",
NULL,
&conf_refresh_reconnect_attempt_threshold,
16, 0, INT_MAX,
PGC_USERSET,
0,
NULL, NULL, NULL);
DefineCustomIntVariable("neon.pageserver_response_log_timeout",
"pageserver response log timeout",

View File

@@ -507,19 +507,45 @@ backpressure_lag_impl(void)
LSN_FORMAT_ARGS(flushPtr),
LSN_FORMAT_ARGS(applyPtr));
if ((writePtr != InvalidXLogRecPtr && max_replication_write_lag > 0 && myFlushLsn > writePtr + max_replication_write_lag * MB))
if (lakebase_mode)
{
return (myFlushLsn - writePtr - max_replication_write_lag * MB);
}
// in case PG does not have shard map initialized, we assume PG always has 1 shard at minimum.
shardno_t num_shards = Max(1, get_num_shards());
int tenant_max_replication_apply_lag = num_shards * max_replication_apply_lag;
int tenant_max_replication_flush_lag = num_shards * max_replication_flush_lag;
int tenant_max_replication_write_lag = num_shards * max_replication_write_lag;
if ((flushPtr != InvalidXLogRecPtr && max_replication_flush_lag > 0 && myFlushLsn > flushPtr + max_replication_flush_lag * MB))
{
return (myFlushLsn - flushPtr - max_replication_flush_lag * MB);
}
if ((writePtr != InvalidXLogRecPtr && tenant_max_replication_write_lag > 0 && myFlushLsn > writePtr + tenant_max_replication_write_lag * MB))
{
return (myFlushLsn - writePtr - tenant_max_replication_write_lag * MB);
}
if ((applyPtr != InvalidXLogRecPtr && max_replication_apply_lag > 0 && myFlushLsn > applyPtr + max_replication_apply_lag * MB))
if ((flushPtr != InvalidXLogRecPtr && tenant_max_replication_flush_lag > 0 && myFlushLsn > flushPtr + tenant_max_replication_flush_lag * MB))
{
return (myFlushLsn - flushPtr - tenant_max_replication_flush_lag * MB);
}
if ((applyPtr != InvalidXLogRecPtr && tenant_max_replication_apply_lag > 0 && myFlushLsn > applyPtr + tenant_max_replication_apply_lag * MB))
{
return (myFlushLsn - applyPtr - tenant_max_replication_apply_lag * MB);
}
}
else
{
return (myFlushLsn - applyPtr - max_replication_apply_lag * MB);
if ((writePtr != InvalidXLogRecPtr && max_replication_write_lag > 0 && myFlushLsn > writePtr + max_replication_write_lag * MB))
{
return (myFlushLsn - writePtr - max_replication_write_lag * MB);
}
if ((flushPtr != InvalidXLogRecPtr && max_replication_flush_lag > 0 && myFlushLsn > flushPtr + max_replication_flush_lag * MB))
{
return (myFlushLsn - flushPtr - max_replication_flush_lag * MB);
}
if ((applyPtr != InvalidXLogRecPtr && max_replication_apply_lag > 0 && myFlushLsn > applyPtr + max_replication_apply_lag * MB))
{
return (myFlushLsn - applyPtr - max_replication_apply_lag * MB);
}
}
}
return 0;

View File

@@ -54,6 +54,7 @@ json = { path = "../libs/proxy/json" }
lasso = { workspace = true, features = ["multi-threaded"] }
measured = { workspace = true, features = ["lasso"] }
metrics.workspace = true
moka.workspace = true
once_cell.workspace = true
opentelemetry = { workspace = true, features = ["trace"] }
papaya = "0.2.0"
@@ -110,7 +111,7 @@ zerocopy.workspace = true
# uncomment this to use the real subzero-core crate
# subzero-core = { git = "https://github.com/neondatabase/subzero", rev = "396264617e78e8be428682f87469bb25429af88a", features = ["postgresql"], optional = true }
# this is a stub for the subzero-core crate
subzero-core = { path = "./subzero_core", features = ["postgresql"], optional = true}
subzero-core = { path = "../libs/proxy/subzero_core", features = ["postgresql"], optional = true}
ouroboros = { version = "0.18", optional = true }
# jwt stuff

View File

@@ -1,4 +1,14 @@
use std::ops::{Deref, DerefMut};
use std::{
ops::{Deref, DerefMut},
time::{Duration, Instant},
};
use moka::Expiry;
use crate::control_plane::messages::ControlPlaneErrorMessage;
/// Default TTL used when caching errors from control plane.
pub const DEFAULT_ERROR_TTL: Duration = Duration::from_secs(30);
/// A generic trait which exposes types of cache's key and value,
/// as well as the notion of cache entry invalidation.
@@ -87,3 +97,59 @@ impl<C: Cache, V> DerefMut for Cached<C, V> {
&mut self.value
}
}
pub type ControlPlaneResult<T> = Result<T, Box<ControlPlaneErrorMessage>>;
#[derive(Clone, Copy)]
pub struct CplaneExpiry {
pub error: Duration,
}
impl Default for CplaneExpiry {
fn default() -> Self {
Self {
error: DEFAULT_ERROR_TTL,
}
}
}
impl CplaneExpiry {
pub fn expire_early<V>(
&self,
value: &ControlPlaneResult<V>,
updated: Instant,
) -> Option<Duration> {
match value {
Ok(_) => None,
Err(err) => Some(self.expire_err_early(err, updated)),
}
}
pub fn expire_err_early(&self, err: &ControlPlaneErrorMessage, updated: Instant) -> Duration {
err.status
.as_ref()
.and_then(|s| s.details.retry_info.as_ref())
.map_or(self.error, |r| r.retry_at.into_std() - updated)
}
}
impl<K, V> Expiry<K, ControlPlaneResult<V>> for CplaneExpiry {
fn expire_after_create(
&self,
_key: &K,
value: &ControlPlaneResult<V>,
created_at: Instant,
) -> Option<Duration> {
self.expire_early(value, created_at)
}
fn expire_after_update(
&self,
_key: &K,
value: &ControlPlaneResult<V>,
updated_at: Instant,
_duration_until_expiry: Option<Duration>,
) -> Option<Duration> {
self.expire_early(value, updated_at)
}
}

View File

@@ -1,84 +1,17 @@
use std::collections::{HashMap, HashSet, hash_map};
use std::collections::HashSet;
use std::convert::Infallible;
use std::time::Duration;
use async_trait::async_trait;
use clashmap::ClashMap;
use clashmap::mapref::one::Ref;
use rand::Rng;
use tokio::time::Instant;
use moka::sync::Cache;
use tracing::{debug, info};
use crate::cache::common::{ControlPlaneResult, CplaneExpiry};
use crate::config::ProjectInfoCacheOptions;
use crate::control_plane::messages::{ControlPlaneErrorMessage, Reason};
use crate::control_plane::{EndpointAccessControl, RoleAccessControl};
use crate::intern::{AccountIdInt, EndpointIdInt, ProjectIdInt, RoleNameInt};
use crate::types::{EndpointId, RoleName};
#[async_trait]
pub(crate) trait ProjectInfoCache {
fn invalidate_endpoint_access(&self, endpoint_id: EndpointIdInt);
fn invalidate_endpoint_access_for_project(&self, project_id: ProjectIdInt);
fn invalidate_endpoint_access_for_org(&self, account_id: AccountIdInt);
fn invalidate_role_secret_for_project(&self, project_id: ProjectIdInt, role_name: RoleNameInt);
}
struct Entry<T> {
expires_at: Instant,
value: T,
}
impl<T> Entry<T> {
pub(crate) fn new(value: T, ttl: Duration) -> Self {
Self {
expires_at: Instant::now() + ttl,
value,
}
}
pub(crate) fn get(&self) -> Option<&T> {
(!self.is_expired()).then_some(&self.value)
}
fn is_expired(&self) -> bool {
self.expires_at <= Instant::now()
}
}
struct EndpointInfo {
role_controls: HashMap<RoleNameInt, Entry<ControlPlaneResult<RoleAccessControl>>>,
controls: Option<Entry<ControlPlaneResult<EndpointAccessControl>>>,
}
type ControlPlaneResult<T> = Result<T, Box<ControlPlaneErrorMessage>>;
impl EndpointInfo {
pub(crate) fn get_role_secret_with_ttl(
&self,
role_name: RoleNameInt,
) -> Option<(ControlPlaneResult<RoleAccessControl>, Duration)> {
let entry = self.role_controls.get(&role_name)?;
let ttl = entry.expires_at - Instant::now();
Some((entry.get()?.clone(), ttl))
}
pub(crate) fn get_controls_with_ttl(
&self,
) -> Option<(ControlPlaneResult<EndpointAccessControl>, Duration)> {
let entry = self.controls.as_ref()?;
let ttl = entry.expires_at - Instant::now();
Some((entry.get()?.clone(), ttl))
}
pub(crate) fn invalidate_endpoint(&mut self) {
self.controls = None;
}
pub(crate) fn invalidate_role_secret(&mut self, role_name: RoleNameInt) {
self.role_controls.remove(&role_name);
}
}
/// Cache for project info.
/// This is used to cache auth data for endpoints.
/// Invalidation is done by console notifications or by TTL (if console notifications are disabled).
@@ -86,8 +19,9 @@ impl EndpointInfo {
/// We also store endpoint-to-project mapping in the cache, to be able to access per-endpoint data.
/// One may ask, why the data is stored per project, when on the user request there is only data about the endpoint available?
/// On the cplane side updates are done per project (or per branch), so it's easier to invalidate the whole project cache.
pub struct ProjectInfoCacheImpl {
cache: ClashMap<EndpointIdInt, EndpointInfo>,
pub struct ProjectInfoCache {
role_controls: Cache<(EndpointIdInt, RoleNameInt), ControlPlaneResult<RoleAccessControl>>,
ep_controls: Cache<EndpointIdInt, ControlPlaneResult<EndpointAccessControl>>,
project2ep: ClashMap<ProjectIdInt, HashSet<EndpointIdInt>>,
// FIXME(stefan): we need a way to GC the account2ep map.
@@ -96,16 +30,13 @@ pub struct ProjectInfoCacheImpl {
config: ProjectInfoCacheOptions,
}
#[async_trait]
impl ProjectInfoCache for ProjectInfoCacheImpl {
fn invalidate_endpoint_access(&self, endpoint_id: EndpointIdInt) {
impl ProjectInfoCache {
pub fn invalidate_endpoint_access(&self, endpoint_id: EndpointIdInt) {
info!("invalidating endpoint access for `{endpoint_id}`");
if let Some(mut endpoint_info) = self.cache.get_mut(&endpoint_id) {
endpoint_info.invalidate_endpoint();
}
self.ep_controls.invalidate(&endpoint_id);
}
fn invalidate_endpoint_access_for_project(&self, project_id: ProjectIdInt) {
pub fn invalidate_endpoint_access_for_project(&self, project_id: ProjectIdInt) {
info!("invalidating endpoint access for project `{project_id}`");
let endpoints = self
.project2ep
@@ -113,13 +44,11 @@ impl ProjectInfoCache for ProjectInfoCacheImpl {
.map(|kv| kv.value().clone())
.unwrap_or_default();
for endpoint_id in endpoints {
if let Some(mut endpoint_info) = self.cache.get_mut(&endpoint_id) {
endpoint_info.invalidate_endpoint();
}
self.ep_controls.invalidate(&endpoint_id);
}
}
fn invalidate_endpoint_access_for_org(&self, account_id: AccountIdInt) {
pub fn invalidate_endpoint_access_for_org(&self, account_id: AccountIdInt) {
info!("invalidating endpoint access for org `{account_id}`");
let endpoints = self
.account2ep
@@ -127,13 +56,15 @@ impl ProjectInfoCache for ProjectInfoCacheImpl {
.map(|kv| kv.value().clone())
.unwrap_or_default();
for endpoint_id in endpoints {
if let Some(mut endpoint_info) = self.cache.get_mut(&endpoint_id) {
endpoint_info.invalidate_endpoint();
}
self.ep_controls.invalidate(&endpoint_id);
}
}
fn invalidate_role_secret_for_project(&self, project_id: ProjectIdInt, role_name: RoleNameInt) {
pub fn invalidate_role_secret_for_project(
&self,
project_id: ProjectIdInt,
role_name: RoleNameInt,
) {
info!(
"invalidating role secret for project_id `{}` and role_name `{}`",
project_id, role_name,
@@ -144,47 +75,52 @@ impl ProjectInfoCache for ProjectInfoCacheImpl {
.map(|kv| kv.value().clone())
.unwrap_or_default();
for endpoint_id in endpoints {
if let Some(mut endpoint_info) = self.cache.get_mut(&endpoint_id) {
endpoint_info.invalidate_role_secret(role_name);
}
self.role_controls.invalidate(&(endpoint_id, role_name));
}
}
}
impl ProjectInfoCacheImpl {
impl ProjectInfoCache {
pub(crate) fn new(config: ProjectInfoCacheOptions) -> Self {
// we cache errors for 30 seconds, unless retry_at is set.
let expiry = CplaneExpiry::default();
Self {
cache: ClashMap::new(),
role_controls: Cache::builder()
.name("role_access_controls")
.max_capacity(config.size * config.max_roles)
.time_to_live(config.ttl)
.expire_after(expiry)
.build(),
ep_controls: Cache::builder()
.name("endpoint_access_controls")
.max_capacity(config.size)
.time_to_live(config.ttl)
.expire_after(expiry)
.build(),
project2ep: ClashMap::new(),
account2ep: ClashMap::new(),
config,
}
}
fn get_endpoint_cache(
&self,
endpoint_id: &EndpointId,
) -> Option<Ref<'_, EndpointIdInt, EndpointInfo>> {
let endpoint_id = EndpointIdInt::get(endpoint_id)?;
self.cache.get(&endpoint_id)
}
pub(crate) fn get_role_secret_with_ttl(
pub(crate) fn get_role_secret(
&self,
endpoint_id: &EndpointId,
role_name: &RoleName,
) -> Option<(ControlPlaneResult<RoleAccessControl>, Duration)> {
) -> Option<ControlPlaneResult<RoleAccessControl>> {
let endpoint_id = EndpointIdInt::get(endpoint_id)?;
let role_name = RoleNameInt::get(role_name)?;
let endpoint_info = self.get_endpoint_cache(endpoint_id)?;
endpoint_info.get_role_secret_with_ttl(role_name)
self.role_controls.get(&(endpoint_id, role_name))
}
pub(crate) fn get_endpoint_access_with_ttl(
pub(crate) fn get_endpoint_access(
&self,
endpoint_id: &EndpointId,
) -> Option<(ControlPlaneResult<EndpointAccessControl>, Duration)> {
let endpoint_info = self.get_endpoint_cache(endpoint_id)?;
endpoint_info.get_controls_with_ttl()
) -> Option<ControlPlaneResult<EndpointAccessControl>> {
let endpoint_id = EndpointIdInt::get(endpoint_id)?;
self.ep_controls.get(&endpoint_id)
}
pub(crate) fn insert_endpoint_access(
@@ -203,34 +139,14 @@ impl ProjectInfoCacheImpl {
self.insert_project2endpoint(project_id, endpoint_id);
}
if self.cache.len() >= self.config.size {
// If there are too many entries, wait until the next gc cycle.
return;
}
debug!(
key = &*endpoint_id,
"created a cache entry for endpoint access"
);
let controls = Some(Entry::new(Ok(controls), self.config.ttl));
let role_controls = Entry::new(Ok(role_controls), self.config.ttl);
match self.cache.entry(endpoint_id) {
clashmap::Entry::Vacant(e) => {
e.insert(EndpointInfo {
role_controls: HashMap::from_iter([(role_name, role_controls)]),
controls,
});
}
clashmap::Entry::Occupied(mut e) => {
let ep = e.get_mut();
ep.controls = controls;
if ep.role_controls.len() < self.config.max_roles {
ep.role_controls.insert(role_name, role_controls);
}
}
}
self.ep_controls.insert(endpoint_id, Ok(controls));
self.role_controls
.insert((endpoint_id, role_name), Ok(role_controls));
}
pub(crate) fn insert_endpoint_access_err(
@@ -238,55 +154,30 @@ impl ProjectInfoCacheImpl {
endpoint_id: EndpointIdInt,
role_name: RoleNameInt,
msg: Box<ControlPlaneErrorMessage>,
ttl: Option<Duration>,
) {
if self.cache.len() >= self.config.size {
// If there are too many entries, wait until the next gc cycle.
return;
}
debug!(
key = &*endpoint_id,
"created a cache entry for an endpoint access error"
);
let ttl = ttl.unwrap_or(self.config.ttl);
let controls = if msg.get_reason() == Reason::RoleProtected {
// RoleProtected is the only role-specific error that control plane can give us.
// If a given role name does not exist, it still returns a successful response,
// just with an empty secret.
None
} else {
// We can cache all the other errors in EndpointInfo.controls,
// because they don't depend on what role name we pass to control plane.
Some(Entry::new(Err(msg.clone()), ttl))
};
let role_controls = Entry::new(Err(msg), ttl);
match self.cache.entry(endpoint_id) {
clashmap::Entry::Vacant(e) => {
e.insert(EndpointInfo {
role_controls: HashMap::from_iter([(role_name, role_controls)]),
controls,
// RoleProtected is the only role-specific error that control plane can give us.
// If a given role name does not exist, it still returns a successful response,
// just with an empty secret.
if msg.get_reason() != Reason::RoleProtected {
// We can cache all the other errors in ep_controls because they don't
// depend on what role name we pass to control plane.
self.ep_controls
.entry(endpoint_id)
.and_compute_with(|entry| match entry {
// leave the entry alone if it's already Ok
Some(entry) if entry.value().is_ok() => moka::ops::compute::Op::Nop,
// replace the entry
_ => moka::ops::compute::Op::Put(Err(msg.clone())),
});
}
clashmap::Entry::Occupied(mut e) => {
let ep = e.get_mut();
if let Some(entry) = &ep.controls
&& !entry.is_expired()
&& entry.value.is_ok()
{
// If we have cached non-expired, non-error controls, keep them.
} else {
ep.controls = controls;
}
if ep.role_controls.len() < self.config.max_roles {
ep.role_controls.insert(role_name, role_controls);
}
}
}
self.role_controls
.insert((endpoint_id, role_name), Err(msg));
}
fn insert_project2endpoint(&self, project_id: ProjectIdInt, endpoint_id: EndpointIdInt) {
@@ -307,58 +198,19 @@ impl ProjectInfoCacheImpl {
}
}
pub fn maybe_invalidate_role_secret(&self, endpoint_id: &EndpointId, role_name: &RoleName) {
let Some(endpoint_id) = EndpointIdInt::get(endpoint_id) else {
return;
};
let Some(role_name) = RoleNameInt::get(role_name) else {
return;
};
let Some(mut endpoint_info) = self.cache.get_mut(&endpoint_id) else {
return;
};
let entry = endpoint_info.role_controls.entry(role_name);
let hash_map::Entry::Occupied(role_controls) = entry else {
return;
};
if role_controls.get().is_expired() {
role_controls.remove();
}
pub fn maybe_invalidate_role_secret(&self, _endpoint_id: &EndpointId, _role_name: &RoleName) {
// TODO: Expire the value early if the key is idle.
// Currently not an issue as we would just use the TTL to decide, which is what already happens.
}
pub async fn gc_worker(&self) -> anyhow::Result<Infallible> {
let mut interval =
tokio::time::interval(self.config.gc_interval / (self.cache.shards().len()) as u32);
let mut interval = tokio::time::interval(self.config.gc_interval);
loop {
interval.tick().await;
if self.cache.len() < self.config.size {
// If there are not too many entries, wait until the next gc cycle.
continue;
}
self.gc();
self.ep_controls.run_pending_tasks();
self.role_controls.run_pending_tasks();
}
}
fn gc(&self) {
let shard = rand::rng().random_range(0..self.project2ep.shards().len());
debug!(shard, "project_info_cache: performing epoch reclamation");
// acquire a random shard lock
let mut removed = 0;
let shard = self.project2ep.shards()[shard].write();
for (_, endpoints) in shard.iter() {
for endpoint in endpoints {
self.cache.remove(endpoint);
removed += 1;
}
}
// We can drop this shard only after making sure that all endpoints are removed.
drop(shard);
info!("project_info_cache: removed {removed} endpoints");
}
}
#[cfg(test)]
@@ -368,12 +220,12 @@ mod tests {
use crate::control_plane::{AccessBlockerFlags, AuthSecret};
use crate::scram::ServerSecret;
use std::sync::Arc;
use std::time::Duration;
#[tokio::test]
async fn test_project_info_cache_settings() {
tokio::time::pause();
let cache = ProjectInfoCacheImpl::new(ProjectInfoCacheOptions {
size: 2,
let cache = ProjectInfoCache::new(ProjectInfoCacheOptions {
size: 1,
max_roles: 2,
ttl: Duration::from_secs(1),
gc_interval: Duration::from_secs(600),
@@ -423,22 +275,17 @@ mod tests {
},
);
let (cached, ttl) = cache
.get_role_secret_with_ttl(&endpoint_id, &user1)
.unwrap();
let cached = cache.get_role_secret(&endpoint_id, &user1).unwrap();
assert_eq!(cached.unwrap().secret, secret1);
assert_eq!(ttl, cache.config.ttl);
let (cached, ttl) = cache
.get_role_secret_with_ttl(&endpoint_id, &user2)
.unwrap();
let cached = cache.get_role_secret(&endpoint_id, &user2).unwrap();
assert_eq!(cached.unwrap().secret, secret2);
assert_eq!(ttl, cache.config.ttl);
// Shouldn't add more than 2 roles.
let user3: RoleName = "user3".into();
let secret3 = Some(AuthSecret::Scram(ServerSecret::mock([3; 32])));
cache.role_controls.run_pending_tasks();
cache.insert_endpoint_access(
account_id,
project_id,
@@ -455,31 +302,18 @@ mod tests {
},
);
assert!(
cache
.get_role_secret_with_ttl(&endpoint_id, &user3)
.is_none()
);
cache.role_controls.run_pending_tasks();
assert_eq!(cache.role_controls.entry_count(), 2);
let cached = cache
.get_endpoint_access_with_ttl(&endpoint_id)
.unwrap()
.0
.unwrap();
assert_eq!(cached.allowed_ips, allowed_ips);
tokio::time::sleep(Duration::from_secs(2)).await;
tokio::time::advance(Duration::from_secs(2)).await;
let cached = cache.get_role_secret_with_ttl(&endpoint_id, &user1);
assert!(cached.is_none());
let cached = cache.get_role_secret_with_ttl(&endpoint_id, &user2);
assert!(cached.is_none());
let cached = cache.get_endpoint_access_with_ttl(&endpoint_id);
assert!(cached.is_none());
cache.role_controls.run_pending_tasks();
assert_eq!(cache.role_controls.entry_count(), 0);
}
#[tokio::test]
async fn test_caching_project_info_errors() {
let cache = ProjectInfoCacheImpl::new(ProjectInfoCacheOptions {
let cache = ProjectInfoCache::new(ProjectInfoCacheOptions {
size: 10,
max_roles: 10,
ttl: Duration::from_secs(1),
@@ -519,34 +353,23 @@ mod tests {
status: None,
});
let get_role_secret = |endpoint_id, role_name| {
cache
.get_role_secret_with_ttl(endpoint_id, role_name)
.unwrap()
.0
};
let get_endpoint_access =
|endpoint_id| cache.get_endpoint_access_with_ttl(endpoint_id).unwrap().0;
let get_role_secret =
|endpoint_id, role_name| cache.get_role_secret(endpoint_id, role_name).unwrap();
let get_endpoint_access = |endpoint_id| cache.get_endpoint_access(endpoint_id).unwrap();
// stores role-specific errors only for get_role_secret
cache.insert_endpoint_access_err(
(&endpoint_id).into(),
(&user1).into(),
role_msg.clone(),
None,
);
cache.insert_endpoint_access_err((&endpoint_id).into(), (&user1).into(), role_msg.clone());
assert_eq!(
get_role_secret(&endpoint_id, &user1).unwrap_err().error,
role_msg.error
);
assert!(cache.get_endpoint_access_with_ttl(&endpoint_id).is_none());
assert!(cache.get_endpoint_access(&endpoint_id).is_none());
// stores non-role specific errors for both get_role_secret and get_endpoint_access
cache.insert_endpoint_access_err(
(&endpoint_id).into(),
(&user1).into(),
generic_msg.clone(),
None,
);
assert_eq!(
get_role_secret(&endpoint_id, &user1).unwrap_err().error,
@@ -558,11 +381,7 @@ mod tests {
);
// error isn't returned for other roles in the same endpoint
assert!(
cache
.get_role_secret_with_ttl(&endpoint_id, &user2)
.is_none()
);
assert!(cache.get_role_secret(&endpoint_id, &user2).is_none());
// success for a role does not overwrite errors for other roles
cache.insert_endpoint_access(
@@ -590,7 +409,6 @@ mod tests {
(&endpoint_id).into(),
(&user2).into(),
generic_msg.clone(),
None,
);
assert!(get_role_secret(&endpoint_id, &user2).is_err());
assert!(get_endpoint_access(&endpoint_id).is_ok());

View File

@@ -246,17 +246,14 @@ impl<K: Hash + Eq + Clone, V: Clone> TimedLru<K, V> {
impl<K: Hash + Eq, V: Clone> TimedLru<K, V> {
/// Retrieve a cached entry in convenient wrapper, alongside timing information.
pub(crate) fn get_with_created_at<Q>(
&self,
key: &Q,
) -> Option<Cached<&Self, (<Self as Cache>::Value, Instant)>>
pub(crate) fn get<Q>(&self, key: &Q) -> Option<Cached<&Self, <Self as Cache>::Value>>
where
K: Borrow<Q> + Clone,
Q: Hash + Eq + ?Sized,
{
self.get_raw(key, |key, entry| Cached {
token: Some((self, key.clone())),
value: (entry.value.clone(), entry.created_at),
value: entry.value.clone(),
})
}
}

View File

@@ -25,6 +25,7 @@ use crate::control_plane::messages::MetricsAuxInfo;
use crate::error::{ReportableError, UserFacingError};
use crate::metrics::{Metrics, NumDbConnectionsGuard};
use crate::pqproto::StartupMessageParams;
use crate::proxy::connect_compute::TlsNegotiation;
use crate::proxy::neon_option;
use crate::types::Host;
@@ -84,6 +85,14 @@ pub(crate) enum ConnectionError {
#[error("error acquiring resource permit: {0}")]
TooManyConnectionAttempts(#[from] ApiLockError),
#[cfg(test)]
#[error("retryable: {retryable}, wakeable: {wakeable}, kind: {kind:?}")]
TestError {
retryable: bool,
wakeable: bool,
kind: crate::error::ErrorKind,
},
}
impl UserFacingError for ConnectionError {
@@ -94,6 +103,8 @@ impl UserFacingError for ConnectionError {
"Failed to acquire permit to connect to the database. Too many database connection attempts are currently ongoing.".to_owned()
}
ConnectionError::TlsError(_) => COULD_NOT_CONNECT.to_owned(),
#[cfg(test)]
ConnectionError::TestError { .. } => self.to_string(),
}
}
}
@@ -104,6 +115,8 @@ impl ReportableError for ConnectionError {
ConnectionError::TlsError(_) => crate::error::ErrorKind::Compute,
ConnectionError::WakeComputeError(e) => e.get_error_kind(),
ConnectionError::TooManyConnectionAttempts(e) => e.get_error_kind(),
#[cfg(test)]
ConnectionError::TestError { kind, .. } => *kind,
}
}
}
@@ -256,6 +269,7 @@ impl ConnectInfo {
async fn connect_raw(
&self,
config: &ComputeConfig,
tls: TlsNegotiation,
) -> Result<(SocketAddr, MaybeTlsStream<TcpStream, RustlsStream>), TlsError> {
let timeout = config.timeout;
@@ -298,7 +312,7 @@ impl ConnectInfo {
match connect_once(&*addrs).await {
Ok((sockaddr, stream)) => Ok((
sockaddr,
tls::connect_tls(stream, self.ssl_mode, config, host).await?,
tls::connect_tls(stream, self.ssl_mode, config, host, tls).await?,
)),
Err(err) => {
warn!("couldn't connect to compute node at {host}:{port}: {err}");
@@ -329,9 +343,10 @@ impl ConnectInfo {
ctx: &RequestContext,
aux: &MetricsAuxInfo,
config: &ComputeConfig,
tls: TlsNegotiation,
) -> Result<ComputeConnection, ConnectionError> {
let pause = ctx.latency_timer_pause(crate::metrics::Waiting::Compute);
let (socket_addr, stream) = self.connect_raw(config).await?;
let (socket_addr, stream) = self.connect_raw(config, tls).await?;
drop(pause);
tracing::Span::current().record("compute_id", tracing::field::display(&aux.compute_id));

View File

@@ -7,6 +7,7 @@ use thiserror::Error;
use tokio::io::{AsyncRead, AsyncWrite};
use crate::pqproto::request_tls;
use crate::proxy::connect_compute::TlsNegotiation;
use crate::proxy::retry::CouldRetry;
#[derive(Debug, Error)]
@@ -35,6 +36,7 @@ pub async fn connect_tls<S, T>(
mode: SslMode,
tls: &T,
host: &str,
negotiation: TlsNegotiation,
) -> Result<MaybeTlsStream<S, T::Stream>, TlsError>
where
S: AsyncRead + AsyncWrite + Unpin + Send,
@@ -49,12 +51,15 @@ where
SslMode::Prefer | SslMode::Require => {}
}
if !request_tls(&mut stream).await? {
if SslMode::Require == mode {
return Err(TlsError::Required);
}
return Ok(MaybeTlsStream::Raw(stream));
match negotiation {
// No TLS request needed
TlsNegotiation::Direct => {}
// TLS request successful
TlsNegotiation::Postgres if request_tls(&mut stream).await? => {}
// TLS request failed but is required
TlsNegotiation::Postgres if SslMode::Require == mode => return Err(TlsError::Required),
// TLS request failed but is not required
TlsNegotiation::Postgres => return Ok(MaybeTlsStream::Raw(stream)),
}
Ok(MaybeTlsStream::Tls(

View File

@@ -159,11 +159,11 @@ impl FromStr for CacheOptions {
#[derive(Debug)]
pub struct ProjectInfoCacheOptions {
/// Max number of entries.
pub size: usize,
pub size: u64,
/// Entry's time-to-live.
pub ttl: Duration,
/// Max number of roles per endpoint.
pub max_roles: usize,
pub max_roles: u64,
/// Gc interval.
pub gc_interval: Duration,
}

View File

@@ -16,8 +16,9 @@ use crate::pglb::ClientRequestError;
use crate::pglb::handshake::{HandshakeData, handshake};
use crate::pglb::passthrough::ProxyPassthrough;
use crate::protocol2::{ConnectHeader, ConnectionInfo, read_proxy_protocol};
use crate::proxy::connect_compute::{TcpMechanism, connect_to_compute};
use crate::proxy::{ErrorSource, forward_compute_params_to_client, send_client_greeting};
use crate::proxy::{
ErrorSource, connect_compute, forward_compute_params_to_client, send_client_greeting,
};
use crate::util::run_until_cancelled;
pub async fn task_main(
@@ -215,14 +216,11 @@ pub(crate) async fn handle_client<S: AsyncRead + AsyncWrite + Unpin + Send>(
};
auth_info.set_startup_params(&params, true);
let mut node = connect_to_compute(
let mut node = connect_compute::connect_to_compute(
ctx,
&TcpMechanism {
locks: &config.connect_compute_locks,
},
config,
&node_info,
config.wake_compute_retry_config,
&config.connect_to_compute,
connect_compute::TlsNegotiation::Postgres,
)
.or_else(|e| async { Err(stream.throw_error(e, Some(ctx)).await) })
.await?;

View File

@@ -3,7 +3,6 @@
use std::net::IpAddr;
use std::str::FromStr;
use std::sync::Arc;
use std::time::Duration;
use ::http::HeaderName;
use ::http::header::AUTHORIZATION;
@@ -17,6 +16,7 @@ use tracing::{Instrument, debug, info, info_span, warn};
use super::super::messages::{ControlPlaneErrorMessage, GetEndpointAccessControl, WakeCompute};
use crate::auth::backend::ComputeUserInfo;
use crate::auth::backend::jwt::AuthRule;
use crate::cache::common::DEFAULT_ERROR_TTL;
use crate::context::RequestContext;
use crate::control_plane::caches::ApiCaches;
use crate::control_plane::errors::{
@@ -118,7 +118,6 @@ impl NeonControlPlaneClient {
cache_key.into(),
role.into(),
msg.clone(),
retry_info.map(|r| Duration::from_millis(r.retry_delay_ms)),
);
Err(err)
@@ -347,18 +346,11 @@ impl super::ControlPlaneApi for NeonControlPlaneClient {
) -> Result<RoleAccessControl, GetAuthInfoError> {
let key = endpoint.normalize();
if let Some((role_control, ttl)) = self
.caches
.project_info
.get_role_secret_with_ttl(&key, role)
{
if let Some(role_control) = self.caches.project_info.get_role_secret(&key, role) {
return match role_control {
Err(mut msg) => {
Err(msg) => {
info!(key = &*key, "found cached get_role_access_control error");
// if retry_delay_ms is set change it to the remaining TTL
replace_retry_delay_ms(&mut msg, |_| ttl.as_millis() as u64);
Err(GetAuthInfoError::ApiError(ControlPlaneError::Message(msg)))
}
Ok(role_control) => {
@@ -383,17 +375,14 @@ impl super::ControlPlaneApi for NeonControlPlaneClient {
) -> Result<EndpointAccessControl, GetAuthInfoError> {
let key = endpoint.normalize();
if let Some((control, ttl)) = self.caches.project_info.get_endpoint_access_with_ttl(&key) {
if let Some(control) = self.caches.project_info.get_endpoint_access(&key) {
return match control {
Err(mut msg) => {
Err(msg) => {
info!(
key = &*key,
"found cached get_endpoint_access_control error"
);
// if retry_delay_ms is set change it to the remaining TTL
replace_retry_delay_ms(&mut msg, |_| ttl.as_millis() as u64);
Err(GetAuthInfoError::ApiError(ControlPlaneError::Message(msg)))
}
Ok(control) => {
@@ -426,17 +415,12 @@ impl super::ControlPlaneApi for NeonControlPlaneClient {
macro_rules! check_cache {
() => {
if let Some(cached) = self.caches.node_info.get_with_created_at(&key) {
let (cached, (info, created_at)) = cached.take_value();
if let Some(cached) = self.caches.node_info.get(&key) {
let (cached, info) = cached.take_value();
return match info {
Err(mut msg) => {
Err(msg) => {
info!(key = &*key, "found cached wake_compute error");
// if retry_delay_ms is set, reduce it by the amount of time it spent in cache
replace_retry_delay_ms(&mut msg, |delay| {
delay.saturating_sub(created_at.elapsed().as_millis() as u64)
});
Err(WakeComputeError::ControlPlane(ControlPlaneError::Message(
msg,
)))
@@ -503,9 +487,7 @@ impl super::ControlPlaneApi for NeonControlPlaneClient {
"created a cache entry for the wake compute error"
);
let ttl = retry_info.map_or(Duration::from_secs(30), |r| {
Duration::from_millis(r.retry_delay_ms)
});
let ttl = retry_info.map_or(DEFAULT_ERROR_TTL, |r| r.retry_at - Instant::now());
self.caches.node_info.insert_ttl(key, Err(msg.clone()), ttl);
@@ -517,14 +499,6 @@ impl super::ControlPlaneApi for NeonControlPlaneClient {
}
}
fn replace_retry_delay_ms(msg: &mut ControlPlaneErrorMessage, f: impl FnOnce(u64) -> u64) {
if let Some(status) = &mut msg.status
&& let Some(retry_info) = &mut status.details.retry_info
{
retry_info.retry_delay_ms = f(retry_info.retry_delay_ms);
}
}
/// Parse http response body, taking status code into account.
fn parse_body<T: for<'a> serde::Deserialize<'a>>(
status: StatusCode,

View File

@@ -13,7 +13,7 @@ use tracing::{debug, info};
use super::{EndpointAccessControl, RoleAccessControl};
use crate::auth::backend::ComputeUserInfo;
use crate::auth::backend::jwt::{AuthRule, FetchAuthRules, FetchAuthRulesError};
use crate::cache::project_info::ProjectInfoCacheImpl;
use crate::cache::project_info::ProjectInfoCache;
use crate::config::{CacheOptions, ProjectInfoCacheOptions};
use crate::context::RequestContext;
use crate::control_plane::{CachedNodeInfo, ControlPlaneApi, NodeInfoCache, errors};
@@ -119,7 +119,7 @@ pub struct ApiCaches {
/// Cache for the `wake_compute` API method.
pub(crate) node_info: NodeInfoCache,
/// Cache which stores project_id -> endpoint_ids mapping.
pub project_info: Arc<ProjectInfoCacheImpl>,
pub project_info: Arc<ProjectInfoCache>,
}
impl ApiCaches {
@@ -134,7 +134,7 @@ impl ApiCaches {
wake_compute_cache_config.ttl,
true,
),
project_info: Arc::new(ProjectInfoCacheImpl::new(project_info_cache_config)),
project_info: Arc::new(ProjectInfoCache::new(project_info_cache_config)),
}
}
}

View File

@@ -1,8 +1,10 @@
use std::fmt::{self, Display};
use std::time::Duration;
use measured::FixedCardinalityLabel;
use serde::{Deserialize, Serialize};
use smol_str::SmolStr;
use tokio::time::Instant;
use crate::auth::IpPattern;
use crate::intern::{AccountIdInt, BranchIdInt, EndpointIdInt, ProjectIdInt, RoleNameInt};
@@ -231,7 +233,13 @@ impl Reason {
#[derive(Copy, Clone, Debug, Deserialize)]
#[allow(dead_code)]
pub(crate) struct RetryInfo {
pub(crate) retry_delay_ms: u64,
#[serde(rename = "retry_delay_ms", deserialize_with = "milliseconds_from_now")]
pub(crate) retry_at: Instant,
}
fn milliseconds_from_now<'de, D: serde::Deserializer<'de>>(d: D) -> Result<Instant, D::Error> {
let millis = u64::deserialize(d)?;
Ok(Instant::now() + Duration::from_millis(millis))
}
#[derive(Debug, Deserialize, Clone)]

View File

@@ -17,7 +17,6 @@ use crate::auth::backend::ComputeUserInfo;
use crate::auth::backend::jwt::AuthRule;
use crate::auth::{AuthError, IpPattern, check_peer_addr_is_in_list};
use crate::cache::{Cached, TimedLru};
use crate::config::ComputeConfig;
use crate::context::RequestContext;
use crate::control_plane::messages::{ControlPlaneErrorMessage, MetricsAuxInfo};
use crate::intern::{AccountIdInt, EndpointIdInt, ProjectIdInt};
@@ -72,16 +71,6 @@ pub(crate) struct NodeInfo {
pub(crate) aux: MetricsAuxInfo,
}
impl NodeInfo {
pub(crate) async fn connect(
&self,
ctx: &RequestContext,
config: &ComputeConfig,
) -> Result<compute::ComputeConnection, compute::ConnectionError> {
self.conn_info.connect(ctx, &self.aux, config).await
}
}
#[derive(Copy, Clone, Default, Debug)]
pub(crate) struct AccessBlockerFlags {
pub public_access_blocked: bool,

View File

@@ -0,0 +1,82 @@
use thiserror::Error;
use crate::auth::Backend;
use crate::auth::backend::ComputeUserInfo;
use crate::cache::Cache;
use crate::compute::{AuthInfo, ComputeConnection, ConnectionError, PostgresError};
use crate::config::ProxyConfig;
use crate::context::RequestContext;
use crate::control_plane::client::ControlPlaneClient;
use crate::error::{ReportableError, UserFacingError};
use crate::proxy::connect_compute::{TlsNegotiation, connect_to_compute};
use crate::proxy::retry::ShouldRetryWakeCompute;
#[derive(Debug, Error)]
pub enum AuthError {
#[error(transparent)]
Auth(#[from] PostgresError),
#[error(transparent)]
Connect(#[from] ConnectionError),
}
impl UserFacingError for AuthError {
fn to_string_client(&self) -> String {
match self {
AuthError::Auth(postgres_error) => postgres_error.to_string_client(),
AuthError::Connect(connection_error) => connection_error.to_string_client(),
}
}
}
impl ReportableError for AuthError {
fn get_error_kind(&self) -> crate::error::ErrorKind {
match self {
AuthError::Auth(postgres_error) => postgres_error.get_error_kind(),
AuthError::Connect(connection_error) => connection_error.get_error_kind(),
}
}
}
/// Try to connect to the compute node, retrying if necessary.
#[tracing::instrument(skip_all)]
pub(crate) async fn connect_to_compute_and_auth(
ctx: &RequestContext,
config: &ProxyConfig,
user_info: &Backend<'_, ComputeUserInfo>,
auth_info: AuthInfo,
tls: TlsNegotiation,
) -> Result<ComputeConnection, AuthError> {
let mut attempt = 0;
// NOTE: This is messy, but should hopefully be detangled with PGLB.
// We wanted to separate the concerns of **connect** to compute (a PGLB operation),
// from **authenticate** to compute (a NeonKeeper operation).
//
// This unfortunately removed retry handling for one error case where
// the compute was cached, and we connected, but the compute cache was actually stale
// and is associated with the wrong endpoint. We detect this when the **authentication** fails.
// As such, we retry once here if the `authenticate` function fails and the error is valid to retry.
loop {
attempt += 1;
let mut node = connect_to_compute(ctx, config, user_info, tls).await?;
let res = auth_info.authenticate(ctx, &mut node).await;
match res {
Ok(()) => return Ok(node),
Err(e) => {
if attempt < 2
&& let Backend::ControlPlane(cplane, user_info) = user_info
&& let ControlPlaneClient::ProxyV1(cplane_proxy_v1) = &**cplane
&& e.should_retry_wake_compute()
{
tracing::warn!(error = ?e, "retrying wake compute");
let key = user_info.endpoint_cache_key();
cplane_proxy_v1.caches.node_info.invalidate(&key);
continue;
}
return Err(e)?;
}
}
}
}

View File

@@ -1,18 +1,15 @@
use async_trait::async_trait;
use tokio::time;
use tracing::{debug, info, warn};
use crate::compute::{self, COULD_NOT_CONNECT, ComputeConnection};
use crate::config::{ComputeConfig, RetryConfig};
use crate::config::{ComputeConfig, ProxyConfig, RetryConfig};
use crate::context::RequestContext;
use crate::control_plane::errors::WakeComputeError;
use crate::control_plane::locks::ApiLocks;
use crate::control_plane::{self, NodeInfo};
use crate::error::ReportableError;
use crate::metrics::{
ConnectOutcome, ConnectionFailureKind, Metrics, RetriesMetricGroup, RetryType,
};
use crate::proxy::retry::{CouldRetry, ShouldRetryWakeCompute, retry_after, should_retry};
use crate::proxy::retry::{ShouldRetryWakeCompute, retry_after, should_retry};
use crate::proxy::wake_compute::{WakeComputeBackend, wake_compute};
use crate::types::Host;
@@ -35,29 +32,32 @@ pub(crate) fn invalidate_cache(node_info: control_plane::CachedNodeInfo) -> Node
node_info.invalidate()
}
#[async_trait]
pub(crate) trait ConnectMechanism {
type Connection;
type ConnectError: ReportableError;
type Error: From<Self::ConnectError>;
async fn connect_once(
&self,
ctx: &RequestContext,
node_info: &control_plane::CachedNodeInfo,
config: &ComputeConfig,
) -> Result<Self::Connection, Self::ConnectError>;
) -> Result<Self::Connection, compute::ConnectionError>;
}
pub(crate) struct TcpMechanism {
struct TcpMechanism<'a> {
/// connect_to_compute concurrency lock
pub(crate) locks: &'static ApiLocks<Host>,
locks: &'a ApiLocks<Host>,
tls: TlsNegotiation,
}
#[async_trait]
impl ConnectMechanism for TcpMechanism {
#[derive(Clone, Copy, PartialEq, Eq, Debug)]
pub enum TlsNegotiation {
/// TLS is assumed
Direct,
/// We must ask for TLS using the postgres SSLRequest message
Postgres,
}
impl ConnectMechanism for TcpMechanism<'_> {
type Connection = ComputeConnection;
type ConnectError = compute::ConnectionError;
type Error = compute::ConnectionError;
#[tracing::instrument(skip_all, fields(
pid = tracing::field::Empty,
@@ -68,25 +68,47 @@ impl ConnectMechanism for TcpMechanism {
ctx: &RequestContext,
node_info: &control_plane::CachedNodeInfo,
config: &ComputeConfig,
) -> Result<ComputeConnection, Self::Error> {
) -> Result<ComputeConnection, compute::ConnectionError> {
let permit = self.locks.get_permit(&node_info.conn_info.host).await?;
permit.release_result(node_info.connect(ctx, config).await)
permit.release_result(
node_info
.conn_info
.connect(ctx, &node_info.aux, config, self.tls)
.await,
)
}
}
/// Try to connect to the compute node, retrying if necessary.
#[tracing::instrument(skip_all)]
pub(crate) async fn connect_to_compute<M: ConnectMechanism, B: WakeComputeBackend>(
pub(crate) async fn connect_to_compute<B: WakeComputeBackend>(
ctx: &RequestContext,
config: &ProxyConfig,
user_info: &B,
tls: TlsNegotiation,
) -> Result<ComputeConnection, compute::ConnectionError> {
connect_to_compute_inner(
ctx,
&TcpMechanism {
locks: &config.connect_compute_locks,
tls,
},
user_info,
config.wake_compute_retry_config,
&config.connect_to_compute,
)
.await
}
/// Try to connect to the compute node, retrying if necessary.
pub(crate) async fn connect_to_compute_inner<M: ConnectMechanism, B: WakeComputeBackend>(
ctx: &RequestContext,
mechanism: &M,
user_info: &B,
wake_compute_retry_config: RetryConfig,
compute: &ComputeConfig,
) -> Result<M::Connection, M::Error>
where
M::ConnectError: CouldRetry + ShouldRetryWakeCompute + std::fmt::Debug,
M::Error: From<WakeComputeError>,
{
) -> Result<M::Connection, compute::ConnectionError> {
let mut num_retries = 0;
let node_info =
wake_compute(&mut num_retries, ctx, user_info, wake_compute_retry_config).await?;
@@ -120,7 +142,7 @@ where
},
num_retries.into(),
);
return Err(err.into());
return Err(err);
}
node_info
} else {
@@ -161,7 +183,7 @@ where
},
num_retries.into(),
);
return Err(e.into());
return Err(e);
}
warn!(error = ?e, num_retries, retriable = true, COULD_NOT_CONNECT);

View File

@@ -1,6 +1,7 @@
#[cfg(test)]
mod tests;
pub(crate) mod connect_auth;
pub(crate) mod connect_compute;
pub(crate) mod retry;
pub(crate) mod wake_compute;
@@ -23,17 +24,13 @@ use tokio::net::TcpStream;
use tokio::sync::oneshot;
use tracing::Instrument;
use crate::cache::Cache;
use crate::cancellation::{CancelClosure, CancellationHandler};
use crate::compute::{ComputeConnection, PostgresError, RustlsStream};
use crate::config::ProxyConfig;
use crate::context::RequestContext;
use crate::control_plane::client::ControlPlaneClient;
pub use crate::pglb::copy_bidirectional::{ErrorSource, copy_bidirectional_client_compute};
use crate::pglb::{ClientMode, ClientRequestError};
use crate::pqproto::{BeMessage, CancelKeyData, StartupMessageParams};
use crate::proxy::connect_compute::{TcpMechanism, connect_to_compute};
use crate::proxy::retry::ShouldRetryWakeCompute;
use crate::rate_limiter::EndpointRateLimiter;
use crate::stream::{PqStream, Stream};
use crate::types::EndpointCacheKey;
@@ -95,61 +92,24 @@ pub(crate) async fn handle_client<S: AsyncRead + AsyncWrite + Unpin + Send>(
let mut auth_info = compute::AuthInfo::with_auth_keys(creds.keys);
auth_info.set_startup_params(params, params_compat);
let mut node;
let mut attempt = 0;
let connect = TcpMechanism {
locks: &config.connect_compute_locks,
};
let backend = auth::Backend::ControlPlane(cplane, creds.info);
// NOTE: This is messy, but should hopefully be detangled with PGLB.
// We wanted to separate the concerns of **connect** to compute (a PGLB operation),
// from **authenticate** to compute (a NeonKeeper operation).
//
// This unfortunately removed retry handling for one error case where
// the compute was cached, and we connected, but the compute cache was actually stale
// and is associated with the wrong endpoint. We detect this when the **authentication** fails.
// As such, we retry once here if the `authenticate` function fails and the error is valid to retry.
loop {
attempt += 1;
// TODO: callback to pglb
let res = connect_auth::connect_to_compute_and_auth(
ctx,
config,
&backend,
auth_info,
connect_compute::TlsNegotiation::Postgres,
)
.await;
// TODO: callback to pglb
let res = connect_to_compute(
ctx,
&connect,
&backend,
config.wake_compute_retry_config,
&config.connect_to_compute,
)
.await;
let mut node = match res {
Ok(node) => node,
Err(e) => Err(client.throw_error(e, Some(ctx)).await)?,
};
match res {
Ok(n) => node = n,
Err(e) => return Err(client.throw_error(e, Some(ctx)).await)?,
}
let auth::Backend::ControlPlane(cplane, user_info) = &backend else {
unreachable!("ensured above");
};
let res = auth_info.authenticate(ctx, &mut node).await;
match res {
Ok(()) => {
send_client_greeting(ctx, &config.greetings, client);
break;
}
Err(e) if attempt < 2 && e.should_retry_wake_compute() => {
tracing::warn!(error = ?e, "retrying wake compute");
#[allow(irrefutable_let_patterns)]
if let ControlPlaneClient::ProxyV1(cplane_proxy_v1) = &**cplane {
let key = user_info.endpoint_cache_key();
cplane_proxy_v1.caches.node_info.invalidate(&key);
}
}
Err(e) => Err(client.throw_error(e, Some(ctx)).await)?,
}
}
send_client_greeting(ctx, &config.greetings, client);
let auth::Backend::ControlPlane(_, user_info) = backend else {
unreachable!("ensured above");

View File

@@ -31,18 +31,6 @@ impl CouldRetry for io::Error {
}
}
impl CouldRetry for postgres_client::error::DbError {
fn could_retry(&self) -> bool {
use postgres_client::error::SqlState;
matches!(
self.code(),
&SqlState::CONNECTION_FAILURE
| &SqlState::CONNECTION_EXCEPTION
| &SqlState::CONNECTION_DOES_NOT_EXIST
| &SqlState::SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION,
)
}
}
impl ShouldRetryWakeCompute for postgres_client::error::DbError {
fn should_retry_wake_compute(&self) -> bool {
use postgres_client::error::SqlState;
@@ -73,17 +61,6 @@ impl ShouldRetryWakeCompute for postgres_client::error::DbError {
}
}
impl CouldRetry for postgres_client::Error {
fn could_retry(&self) -> bool {
if let Some(io_err) = self.source().and_then(|x| x.downcast_ref()) {
io::Error::could_retry(io_err)
} else if let Some(db_err) = self.source().and_then(|x| x.downcast_ref()) {
postgres_client::error::DbError::could_retry(db_err)
} else {
false
}
}
}
impl ShouldRetryWakeCompute for postgres_client::Error {
fn should_retry_wake_compute(&self) -> bool {
if let Some(db_err) = self.source().and_then(|x| x.downcast_ref()) {
@@ -102,6 +79,8 @@ impl CouldRetry for compute::ConnectionError {
compute::ConnectionError::TlsError(err) => err.could_retry(),
compute::ConnectionError::WakeComputeError(err) => err.could_retry(),
compute::ConnectionError::TooManyConnectionAttempts(_) => false,
#[cfg(test)]
compute::ConnectionError::TestError { retryable, .. } => *retryable,
}
}
}
@@ -110,6 +89,8 @@ impl ShouldRetryWakeCompute for compute::ConnectionError {
match self {
// the cache entry was not checked for validity
compute::ConnectionError::TooManyConnectionAttempts(_) => false,
#[cfg(test)]
compute::ConnectionError::TestError { wakeable, .. } => *wakeable,
_ => true,
}
}

View File

@@ -15,6 +15,7 @@ use rstest::rstest;
use rustls::crypto::ring;
use rustls::pki_types;
use tokio::io::{AsyncRead, AsyncWrite, DuplexStream};
use tokio::time::Instant;
use tracing_test::traced_test;
use super::retry::CouldRetry;
@@ -24,13 +25,13 @@ use crate::context::RequestContext;
use crate::control_plane::client::{ControlPlaneClient, TestControlPlaneClient};
use crate::control_plane::messages::{ControlPlaneErrorMessage, Details, MetricsAuxInfo, Status};
use crate::control_plane::{self, CachedNodeInfo, NodeInfo, NodeInfoCache};
use crate::error::{ErrorKind, ReportableError};
use crate::error::ErrorKind;
use crate::pglb::ERR_INSECURE_CONNECTION;
use crate::pglb::handshake::{HandshakeData, handshake};
use crate::pqproto::BeMessage;
use crate::proxy::NeonOptions;
use crate::proxy::connect_compute::{ConnectMechanism, connect_to_compute};
use crate::proxy::retry::{ShouldRetryWakeCompute, retry_after};
use crate::proxy::connect_compute::{ConnectMechanism, connect_to_compute_inner};
use crate::proxy::retry::retry_after;
use crate::stream::{PqStream, Stream};
use crate::tls::client_config::compute_client_config_with_certs;
use crate::tls::server_config::CertResolver;
@@ -430,71 +431,36 @@ impl TestConnectMechanism {
#[derive(Debug)]
struct TestConnection;
#[derive(Debug)]
struct TestConnectError {
retryable: bool,
wakeable: bool,
kind: crate::error::ErrorKind,
}
impl ReportableError for TestConnectError {
fn get_error_kind(&self) -> crate::error::ErrorKind {
self.kind
}
}
impl std::fmt::Display for TestConnectError {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "{self:?}")
}
}
impl std::error::Error for TestConnectError {}
impl CouldRetry for TestConnectError {
fn could_retry(&self) -> bool {
self.retryable
}
}
impl ShouldRetryWakeCompute for TestConnectError {
fn should_retry_wake_compute(&self) -> bool {
self.wakeable
}
}
#[async_trait]
impl ConnectMechanism for TestConnectMechanism {
type Connection = TestConnection;
type ConnectError = TestConnectError;
type Error = anyhow::Error;
async fn connect_once(
&self,
_ctx: &RequestContext,
_node_info: &control_plane::CachedNodeInfo,
_config: &ComputeConfig,
) -> Result<Self::Connection, Self::ConnectError> {
) -> Result<Self::Connection, compute::ConnectionError> {
let mut counter = self.counter.lock().unwrap();
let action = self.sequence[*counter];
*counter += 1;
match action {
ConnectAction::Connect => Ok(TestConnection),
ConnectAction::Retry => Err(TestConnectError {
ConnectAction::Retry => Err(compute::ConnectionError::TestError {
retryable: true,
wakeable: true,
kind: ErrorKind::Compute,
}),
ConnectAction::RetryNoWake => Err(TestConnectError {
ConnectAction::RetryNoWake => Err(compute::ConnectionError::TestError {
retryable: true,
wakeable: false,
kind: ErrorKind::Compute,
}),
ConnectAction::Fail => Err(TestConnectError {
ConnectAction::Fail => Err(compute::ConnectionError::TestError {
retryable: false,
wakeable: true,
kind: ErrorKind::Compute,
}),
ConnectAction::FailNoWake => Err(TestConnectError {
ConnectAction::FailNoWake => Err(compute::ConnectionError::TestError {
retryable: false,
wakeable: false,
kind: ErrorKind::Compute,
@@ -536,7 +502,7 @@ impl TestControlPlaneClient for TestConnectMechanism {
details: Details {
error_info: None,
retry_info: Some(control_plane::messages::RetryInfo {
retry_delay_ms: 1,
retry_at: Instant::now() + Duration::from_millis(1),
}),
user_facing_message: None,
},
@@ -620,7 +586,7 @@ async fn connect_to_compute_success() {
let mechanism = TestConnectMechanism::new(vec![Wake, Connect]);
let user_info = helper_create_connect_info(&mechanism);
let config = config();
connect_to_compute(&ctx, &mechanism, &user_info, config.retry, &config)
connect_to_compute_inner(&ctx, &mechanism, &user_info, config.retry, &config)
.await
.unwrap();
mechanism.verify();
@@ -634,7 +600,7 @@ async fn connect_to_compute_retry() {
let mechanism = TestConnectMechanism::new(vec![Wake, Retry, Wake, Connect]);
let user_info = helper_create_connect_info(&mechanism);
let config = config();
connect_to_compute(&ctx, &mechanism, &user_info, config.retry, &config)
connect_to_compute_inner(&ctx, &mechanism, &user_info, config.retry, &config)
.await
.unwrap();
mechanism.verify();
@@ -649,7 +615,7 @@ async fn connect_to_compute_non_retry_1() {
let mechanism = TestConnectMechanism::new(vec![Wake, Retry, Wake, Fail]);
let user_info = helper_create_connect_info(&mechanism);
let config = config();
connect_to_compute(&ctx, &mechanism, &user_info, config.retry, &config)
connect_to_compute_inner(&ctx, &mechanism, &user_info, config.retry, &config)
.await
.unwrap_err();
mechanism.verify();
@@ -664,7 +630,7 @@ async fn connect_to_compute_non_retry_2() {
let mechanism = TestConnectMechanism::new(vec![Wake, Fail, Wake, Connect]);
let user_info = helper_create_connect_info(&mechanism);
let config = config();
connect_to_compute(&ctx, &mechanism, &user_info, config.retry, &config)
connect_to_compute_inner(&ctx, &mechanism, &user_info, config.retry, &config)
.await
.unwrap();
mechanism.verify();
@@ -686,7 +652,7 @@ async fn connect_to_compute_non_retry_3() {
backoff_factor: 2.0,
};
let config = config();
connect_to_compute(
connect_to_compute_inner(
&ctx,
&mechanism,
&user_info,
@@ -707,7 +673,7 @@ async fn wake_retry() {
let mechanism = TestConnectMechanism::new(vec![WakeRetry, Wake, Connect]);
let user_info = helper_create_connect_info(&mechanism);
let config = config();
connect_to_compute(&ctx, &mechanism, &user_info, config.retry, &config)
connect_to_compute_inner(&ctx, &mechanism, &user_info, config.retry, &config)
.await
.unwrap();
mechanism.verify();
@@ -722,7 +688,7 @@ async fn wake_non_retry() {
let mechanism = TestConnectMechanism::new(vec![WakeRetry, WakeFail]);
let user_info = helper_create_connect_info(&mechanism);
let config = config();
connect_to_compute(&ctx, &mechanism, &user_info, config.retry, &config)
connect_to_compute_inner(&ctx, &mechanism, &user_info, config.retry, &config)
.await
.unwrap_err();
mechanism.verify();
@@ -741,7 +707,7 @@ async fn fail_but_wake_invalidates_cache() {
let user = helper_create_connect_info(&mech);
let cfg = config();
connect_to_compute(&ctx, &mech, &user, cfg.retry, &cfg)
connect_to_compute_inner(&ctx, &mech, &user, cfg.retry, &cfg)
.await
.unwrap();
@@ -762,7 +728,7 @@ async fn fail_no_wake_skips_cache_invalidation() {
let user = helper_create_connect_info(&mech);
let cfg = config();
connect_to_compute(&ctx, &mech, &user, cfg.retry, &cfg)
connect_to_compute_inner(&ctx, &mech, &user, cfg.retry, &cfg)
.await
.unwrap();
@@ -783,7 +749,7 @@ async fn retry_but_wake_invalidates_cache() {
let user_info = helper_create_connect_info(&mechanism);
let cfg = config();
connect_to_compute(&ctx, &mechanism, &user_info, cfg.retry, &cfg)
connect_to_compute_inner(&ctx, &mechanism, &user_info, cfg.retry, &cfg)
.await
.unwrap();
mechanism.verify();
@@ -806,7 +772,7 @@ async fn retry_no_wake_skips_invalidation() {
let user_info = helper_create_connect_info(&mechanism);
let cfg = config();
connect_to_compute(&ctx, &mechanism, &user_info, cfg.retry, &cfg)
connect_to_compute_inner(&ctx, &mechanism, &user_info, cfg.retry, &cfg)
.await
.unwrap_err();
mechanism.verify();
@@ -829,7 +795,7 @@ async fn retry_no_wake_error_fast() {
let user_info = helper_create_connect_info(&mechanism);
let cfg = config();
connect_to_compute(&ctx, &mechanism, &user_info, cfg.retry, &cfg)
connect_to_compute_inner(&ctx, &mechanism, &user_info, cfg.retry, &cfg)
.await
.unwrap_err();
mechanism.verify();
@@ -852,7 +818,7 @@ async fn retry_cold_wake_skips_invalidation() {
let user_info = helper_create_connect_info(&mechanism);
let cfg = config();
connect_to_compute(&ctx, &mechanism, &user_info, cfg.retry, &cfg)
connect_to_compute_inner(&ctx, &mechanism, &user_info, cfg.retry, &cfg)
.await
.unwrap();
mechanism.verify();

View File

@@ -131,11 +131,11 @@ where
Ok(())
}
struct MessageHandler<C: ProjectInfoCache + Send + Sync + 'static> {
struct MessageHandler<C: Send + Sync + 'static> {
cache: Arc<C>,
}
impl<C: ProjectInfoCache + Send + Sync + 'static> Clone for MessageHandler<C> {
impl<C: Send + Sync + 'static> Clone for MessageHandler<C> {
fn clone(&self) -> Self {
Self {
cache: self.cache.clone(),
@@ -143,8 +143,8 @@ impl<C: ProjectInfoCache + Send + Sync + 'static> Clone for MessageHandler<C> {
}
}
impl<C: ProjectInfoCache + Send + Sync + 'static> MessageHandler<C> {
pub(crate) fn new(cache: Arc<C>) -> Self {
impl MessageHandler<ProjectInfoCache> {
pub(crate) fn new(cache: Arc<ProjectInfoCache>) -> Self {
Self { cache }
}
@@ -224,7 +224,7 @@ impl<C: ProjectInfoCache + Send + Sync + 'static> MessageHandler<C> {
}
}
fn invalidate_cache<C: ProjectInfoCache>(cache: Arc<C>, msg: Notification) {
fn invalidate_cache(cache: Arc<ProjectInfoCache>, msg: Notification) {
match msg {
Notification::EndpointSettingsUpdate(ids) => ids
.iter()
@@ -247,8 +247,8 @@ fn invalidate_cache<C: ProjectInfoCache>(cache: Arc<C>, msg: Notification) {
}
}
async fn handle_messages<C: ProjectInfoCache + Send + Sync + 'static>(
handler: MessageHandler<C>,
async fn handle_messages(
handler: MessageHandler<ProjectInfoCache>,
redis: ConnectionWithCredentialsProvider,
cancellation_token: CancellationToken,
) -> anyhow::Result<()> {
@@ -284,13 +284,10 @@ async fn handle_messages<C: ProjectInfoCache + Send + Sync + 'static>(
/// Handle console's invalidation messages.
#[tracing::instrument(name = "redis_notifications", skip_all)]
pub async fn task_main<C>(
pub async fn task_main(
redis: ConnectionWithCredentialsProvider,
cache: Arc<C>,
) -> anyhow::Result<Infallible>
where
C: ProjectInfoCache + Send + Sync + 'static,
{
cache: Arc<ProjectInfoCache>,
) -> anyhow::Result<Infallible> {
let handler = MessageHandler::new(cache);
// 6h - 1m.
// There will be 1 minute overlap between two tasks. But at least we can be sure that no message is lost.

View File

@@ -1,17 +1,11 @@
use std::io;
use std::net::{IpAddr, SocketAddr};
use std::sync::Arc;
use std::time::Duration;
use async_trait::async_trait;
use ed25519_dalek::SigningKey;
use hyper_util::rt::{TokioExecutor, TokioIo, TokioTimer};
use jose_jwk::jose_b64;
use postgres_client::config::SslMode;
use postgres_client::maybe_tls_stream::MaybeTlsStream;
use rand_core::OsRng;
use rustls::pki_types::{DnsName, ServerName};
use tokio::net::{TcpStream, lookup_host};
use tokio_rustls::TlsConnector;
use tracing::field::display;
use tracing::{debug, info};
@@ -21,23 +15,22 @@ use super::conn_pool_lib::{Client, ConnInfo, EndpointConnPool, GlobalConnPool};
use super::http_conn_pool::{self, HttpConnPool, LocalProxyClient, poll_http2_client};
use super::local_conn_pool::{self, EXT_NAME, EXT_SCHEMA, EXT_VERSION, LocalConnPool};
use crate::auth::backend::local::StaticAuthRules;
use crate::auth::backend::{ComputeCredentialKeys, ComputeCredentials, ComputeUserInfo};
use crate::auth::backend::{ComputeCredentials, ComputeUserInfo};
use crate::auth::{self, AuthError};
use crate::compute;
use crate::compute_ctl::{
ComputeCtlError, ExtensionInstallRequest, Privilege, SetRoleGrantsRequest,
};
use crate::config::{ComputeConfig, ProxyConfig};
use crate::config::ProxyConfig;
use crate::context::RequestContext;
use crate::control_plane::CachedNodeInfo;
use crate::control_plane::client::ApiLockError;
use crate::control_plane::errors::{GetAuthInfoError, WakeComputeError};
use crate::control_plane::locks::ApiLocks;
use crate::error::{ErrorKind, ReportableError, UserFacingError};
use crate::intern::EndpointIdInt;
use crate::proxy::connect_compute::ConnectMechanism;
use crate::proxy::retry::{CouldRetry, ShouldRetryWakeCompute};
use crate::pqproto::StartupMessageParams;
use crate::proxy::{connect_auth, connect_compute};
use crate::rate_limiter::EndpointRateLimiter;
use crate::types::{EndpointId, Host, LOCAL_PROXY_SUFFIX};
use crate::types::{EndpointId, LOCAL_PROXY_SUFFIX};
pub(crate) struct PoolingBackend {
pub(crate) http_conn_pool:
@@ -186,20 +179,42 @@ impl PoolingBackend {
tracing::Span::current().record("conn_id", display(conn_id));
info!(%conn_id, "pool: opening a new connection '{conn_info}'");
let backend = self.auth_backend.as_ref().map(|()| keys.info);
crate::proxy::connect_compute::connect_to_compute(
let mut params = StartupMessageParams::default();
params.insert("database", &conn_info.dbname);
params.insert("user", &conn_info.user_info.user);
let mut auth_info = compute::AuthInfo::with_auth_keys(keys.keys);
auth_info.set_startup_params(&params, true);
let node = connect_auth::connect_to_compute_and_auth(
ctx,
&TokioMechanism {
conn_id,
conn_info,
pool: self.pool.clone(),
locks: &self.config.connect_compute_locks,
keys: keys.keys,
},
self.config,
&backend,
self.config.wake_compute_retry_config,
&self.config.connect_to_compute,
auth_info,
connect_compute::TlsNegotiation::Postgres,
)
.await
.await?;
let (client, connection) = postgres_client::connect::managed(
node.stream,
Some(node.socket_addr.ip()),
postgres_client::config::Host::Tcp(node.hostname.to_string()),
node.socket_addr.port(),
node.ssl_mode,
Some(self.config.connect_to_compute.timeout),
)
.await?;
Ok(poll_client(
self.pool.clone(),
ctx,
conn_info,
client,
connection,
conn_id,
node.aux,
))
}
// Wake up the destination if needed
@@ -228,19 +243,38 @@ impl PoolingBackend {
)),
options: conn_info.user_info.options.clone(),
});
crate::proxy::connect_compute::connect_to_compute(
let node = connect_compute::connect_to_compute(
ctx,
&HyperMechanism {
conn_id,
conn_info,
pool: self.http_conn_pool.clone(),
locks: &self.config.connect_compute_locks,
},
self.config,
&backend,
self.config.wake_compute_retry_config,
&self.config.connect_to_compute,
connect_compute::TlsNegotiation::Direct,
)
.await
.await?;
let stream = match node.stream.into_framed().into_inner() {
MaybeTlsStream::Raw(s) => Box::pin(s) as AsyncRW,
MaybeTlsStream::Tls(s) => Box::pin(s) as AsyncRW,
};
let (client, connection) = hyper::client::conn::http2::Builder::new(TokioExecutor::new())
.timer(TokioTimer::new())
.keep_alive_interval(Duration::from_secs(20))
.keep_alive_while_idle(true)
.keep_alive_timeout(Duration::from_secs(5))
.handshake(TokioIo::new(stream))
.await
.map_err(LocalProxyConnError::H2)?;
Ok(poll_http2_client(
self.http_conn_pool.clone(),
ctx,
&conn_info,
client,
connection,
conn_id,
node.aux.clone(),
))
}
/// Connect to postgres over localhost.
@@ -380,6 +414,8 @@ fn create_random_jwk() -> (SigningKey, jose_jwk::Key) {
pub(crate) enum HttpConnError {
#[error("pooled connection closed at inconsistent state")]
ConnectionClosedAbruptly(#[from] tokio::sync::watch::error::SendError<uuid::Uuid>),
#[error("could not connect to compute")]
ConnectError(#[from] compute::ConnectionError),
#[error("could not connect to postgres in compute")]
PostgresConnectionError(#[from] postgres_client::Error),
#[error("could not connect to local-proxy in compute")]
@@ -399,10 +435,19 @@ pub(crate) enum HttpConnError {
TooManyConnectionAttempts(#[from] ApiLockError),
}
impl From<connect_auth::AuthError> for HttpConnError {
fn from(value: connect_auth::AuthError) -> Self {
match value {
connect_auth::AuthError::Auth(compute::PostgresError::Postgres(error)) => {
Self::PostgresConnectionError(error)
}
connect_auth::AuthError::Connect(error) => Self::ConnectError(error),
}
}
}
#[derive(Debug, thiserror::Error)]
pub(crate) enum LocalProxyConnError {
#[error("error with connection to local-proxy")]
Io(#[source] std::io::Error),
#[error("could not establish h2 connection")]
H2(#[from] hyper::Error),
}
@@ -410,6 +455,7 @@ pub(crate) enum LocalProxyConnError {
impl ReportableError for HttpConnError {
fn get_error_kind(&self) -> ErrorKind {
match self {
HttpConnError::ConnectError(_) => ErrorKind::Compute,
HttpConnError::ConnectionClosedAbruptly(_) => ErrorKind::Compute,
HttpConnError::PostgresConnectionError(p) => {
if p.as_db_error().is_some() {
@@ -434,6 +480,7 @@ impl ReportableError for HttpConnError {
impl UserFacingError for HttpConnError {
fn to_string_client(&self) -> String {
match self {
HttpConnError::ConnectError(p) => p.to_string_client(),
HttpConnError::ConnectionClosedAbruptly(_) => self.to_string(),
HttpConnError::PostgresConnectionError(p) => p.to_string(),
HttpConnError::LocalProxyConnectionError(p) => p.to_string(),
@@ -449,36 +496,9 @@ impl UserFacingError for HttpConnError {
}
}
impl CouldRetry for HttpConnError {
fn could_retry(&self) -> bool {
match self {
HttpConnError::PostgresConnectionError(e) => e.could_retry(),
HttpConnError::LocalProxyConnectionError(e) => e.could_retry(),
HttpConnError::ComputeCtl(_) => false,
HttpConnError::ConnectionClosedAbruptly(_) => false,
HttpConnError::JwtPayloadError(_) => false,
HttpConnError::GetAuthInfo(_) => false,
HttpConnError::AuthError(_) => false,
HttpConnError::WakeCompute(_) => false,
HttpConnError::TooManyConnectionAttempts(_) => false,
}
}
}
impl ShouldRetryWakeCompute for HttpConnError {
fn should_retry_wake_compute(&self) -> bool {
match self {
HttpConnError::PostgresConnectionError(e) => e.should_retry_wake_compute(),
// we never checked cache validity
HttpConnError::TooManyConnectionAttempts(_) => false,
_ => true,
}
}
}
impl ReportableError for LocalProxyConnError {
fn get_error_kind(&self) -> ErrorKind {
match self {
LocalProxyConnError::Io(_) => ErrorKind::Compute,
LocalProxyConnError::H2(_) => ErrorKind::Compute,
}
}
@@ -489,215 +509,3 @@ impl UserFacingError for LocalProxyConnError {
"Could not establish HTTP connection to the database".to_string()
}
}
impl CouldRetry for LocalProxyConnError {
fn could_retry(&self) -> bool {
match self {
LocalProxyConnError::Io(_) => false,
LocalProxyConnError::H2(_) => false,
}
}
}
impl ShouldRetryWakeCompute for LocalProxyConnError {
fn should_retry_wake_compute(&self) -> bool {
match self {
LocalProxyConnError::Io(_) => false,
LocalProxyConnError::H2(_) => false,
}
}
}
struct TokioMechanism {
pool: Arc<GlobalConnPool<postgres_client::Client, EndpointConnPool<postgres_client::Client>>>,
conn_info: ConnInfo,
conn_id: uuid::Uuid,
keys: ComputeCredentialKeys,
/// connect_to_compute concurrency lock
locks: &'static ApiLocks<Host>,
}
#[async_trait]
impl ConnectMechanism for TokioMechanism {
type Connection = Client<postgres_client::Client>;
type ConnectError = HttpConnError;
type Error = HttpConnError;
async fn connect_once(
&self,
ctx: &RequestContext,
node_info: &CachedNodeInfo,
compute_config: &ComputeConfig,
) -> Result<Self::Connection, Self::ConnectError> {
let permit = self.locks.get_permit(&node_info.conn_info.host).await?;
let mut config = node_info.conn_info.to_postgres_client_config();
let config = config
.user(&self.conn_info.user_info.user)
.dbname(&self.conn_info.dbname)
.connect_timeout(compute_config.timeout);
if let ComputeCredentialKeys::AuthKeys(auth_keys) = self.keys {
config.auth_keys(auth_keys);
}
let pause = ctx.latency_timer_pause(crate::metrics::Waiting::Compute);
let res = config.connect(compute_config).await;
drop(pause);
let (client, connection) = permit.release_result(res)?;
tracing::Span::current().record("pid", tracing::field::display(client.get_process_id()));
tracing::Span::current().record(
"compute_id",
tracing::field::display(&node_info.aux.compute_id),
);
if let Some(query_id) = ctx.get_testodrome_id() {
info!("latency={}, query_id={}", ctx.get_proxy_latency(), query_id);
}
Ok(poll_client(
self.pool.clone(),
ctx,
self.conn_info.clone(),
client,
connection,
self.conn_id,
node_info.aux.clone(),
))
}
}
struct HyperMechanism {
pool: Arc<GlobalConnPool<LocalProxyClient, HttpConnPool<LocalProxyClient>>>,
conn_info: ConnInfo,
conn_id: uuid::Uuid,
/// connect_to_compute concurrency lock
locks: &'static ApiLocks<Host>,
}
#[async_trait]
impl ConnectMechanism for HyperMechanism {
type Connection = http_conn_pool::Client<LocalProxyClient>;
type ConnectError = HttpConnError;
type Error = HttpConnError;
async fn connect_once(
&self,
ctx: &RequestContext,
node_info: &CachedNodeInfo,
config: &ComputeConfig,
) -> Result<Self::Connection, Self::ConnectError> {
let host_addr = node_info.conn_info.host_addr;
let host = &node_info.conn_info.host;
let permit = self.locks.get_permit(host).await?;
let pause = ctx.latency_timer_pause(crate::metrics::Waiting::Compute);
let tls = if node_info.conn_info.ssl_mode == SslMode::Disable {
None
} else {
Some(&config.tls)
};
let port = node_info.conn_info.port;
let res = connect_http2(host_addr, host, port, config.timeout, tls).await;
drop(pause);
let (client, connection) = permit.release_result(res)?;
tracing::Span::current().record(
"compute_id",
tracing::field::display(&node_info.aux.compute_id),
);
if let Some(query_id) = ctx.get_testodrome_id() {
info!("latency={}, query_id={}", ctx.get_proxy_latency(), query_id);
}
Ok(poll_http2_client(
self.pool.clone(),
ctx,
&self.conn_info,
client,
connection,
self.conn_id,
node_info.aux.clone(),
))
}
}
async fn connect_http2(
host_addr: Option<IpAddr>,
host: &str,
port: u16,
timeout: Duration,
tls: Option<&Arc<rustls::ClientConfig>>,
) -> Result<
(
http_conn_pool::LocalProxyClient,
http_conn_pool::LocalProxyConnection,
),
LocalProxyConnError,
> {
let addrs = match host_addr {
Some(addr) => vec![SocketAddr::new(addr, port)],
None => lookup_host((host, port))
.await
.map_err(LocalProxyConnError::Io)?
.collect(),
};
let mut last_err = None;
let mut addrs = addrs.into_iter();
let stream = loop {
let Some(addr) = addrs.next() else {
return Err(last_err.unwrap_or_else(|| {
LocalProxyConnError::Io(io::Error::new(
io::ErrorKind::InvalidInput,
"could not resolve any addresses",
))
}));
};
match tokio::time::timeout(timeout, TcpStream::connect(addr)).await {
Ok(Ok(stream)) => {
stream.set_nodelay(true).map_err(LocalProxyConnError::Io)?;
break stream;
}
Ok(Err(e)) => {
last_err = Some(LocalProxyConnError::Io(e));
}
Err(e) => {
last_err = Some(LocalProxyConnError::Io(io::Error::new(
io::ErrorKind::TimedOut,
e,
)));
}
}
};
let stream = if let Some(tls) = tls {
let host = DnsName::try_from(host)
.map_err(io::Error::other)
.map_err(LocalProxyConnError::Io)?
.to_owned();
let stream = TlsConnector::from(tls.clone())
.connect(ServerName::DnsName(host), stream)
.await
.map_err(LocalProxyConnError::Io)?;
Box::pin(stream) as AsyncRW
} else {
Box::pin(stream) as AsyncRW
};
let (client, connection) = hyper::client::conn::http2::Builder::new(TokioExecutor::new())
.timer(TokioTimer::new())
.keep_alive_interval(Duration::from_secs(20))
.keep_alive_while_idle(true)
.keep_alive_timeout(Duration::from_secs(5))
.handshake(TokioIo::new(stream))
.await?;
Ok((client, connection))
}

View File

@@ -149,8 +149,8 @@ impl DbSchemaCache {
ctx: &RequestContext,
config: &'static ProxyConfig,
) -> Result<Arc<(ApiConfig, DbSchemaOwned)>, RestError> {
match self.get_with_created_at(endpoint_id) {
Some(Cached { value: (v, _), .. }) => Ok(v),
match self.get(endpoint_id) {
Some(Cached { value: v, .. }) => Ok(v),
None => {
info!("db_schema cache miss for endpoint: {:?}", endpoint_id);
let remote_value = self

View File

@@ -981,6 +981,7 @@ impl Reconciler {
));
}
let mut first_err = None;
for (node, conf) in changes {
if self.cancel.is_cancelled() {
return Err(ReconcileError::Cancel);
@@ -990,7 +991,12 @@ impl Reconciler {
// shard _available_ (the attached location), and configuring secondary locations
// can be done lazily when the node becomes available (via background reconciliation).
if node.is_available() {
self.location_config(&node, conf, None, false).await?;
let res = self.location_config(&node, conf, None, false).await;
if let Err(err) = res {
if first_err.is_none() {
first_err = Some(err);
}
}
} else {
// If the node is unavailable, we skip and consider the reconciliation successful: this
// is a common case where a pageserver is marked unavailable: we demote a location on
@@ -1002,6 +1008,10 @@ impl Reconciler {
}
}
if let Some(err) = first_err {
return Err(err);
}
// The condition below identifies a detach. We must have no attached intent and
// must have been attached to something previously. Pass this information to
// the [`ComputeHook`] such that it can update its tenant-wide state.

View File

@@ -1530,10 +1530,19 @@ impl Service {
// so that waiters will see the correct error after waiting.
tenant.set_last_error(result.sequence, e);
// Skip deletions on reconcile failures
let upsert_deltas =
deltas.filter(|delta| matches!(delta, ObservedStateDelta::Upsert(_)));
tenant.apply_observed_deltas(upsert_deltas);
// If the reconciliation failed, don't clear the observed state for places where we
// detached. Instead, mark the observed state as uncertain.
let failed_reconcile_deltas = deltas.map(|delta| {
if let ObservedStateDelta::Delete(node_id) = delta {
ObservedStateDelta::Upsert(Box::new((
node_id,
ObservedStateLocation { conf: None },
)))
} else {
delta
}
});
tenant.apply_observed_deltas(failed_reconcile_deltas);
}
}

View File

@@ -249,6 +249,10 @@ impl IntentState {
}
pub(crate) fn push_secondary(&mut self, scheduler: &mut Scheduler, new_secondary: NodeId) {
// Every assertion here should probably have a corresponding check in
// `validate_optimization` unless it is an invariant that should never be violated. Note
// that the lock is not held between planning optimizations and applying them so you have to
// assume any valid state transition of the intent state may have occurred
assert!(!self.secondary.contains(&new_secondary));
assert!(self.attached != Some(new_secondary));
scheduler.update_node_ref_counts(
@@ -1335,8 +1339,9 @@ impl TenantShard {
true
}
/// Check that the desired modifications to the intent state are compatible with
/// the current intent state
/// Check that the desired modifications to the intent state are compatible with the current
/// intent state. Note that the lock is not held between planning optimizations and applying
/// them so any valid state transition of the intent state may have occurred.
fn validate_optimization(&self, optimization: &ScheduleOptimization) -> bool {
match optimization.action {
ScheduleOptimizationAction::MigrateAttachment(MigrateAttachment {
@@ -1352,6 +1357,9 @@ impl TenantShard {
}) => {
// It's legal to remove a secondary that is not present in the intent state
!self.intent.secondary.contains(&new_node_id)
// Ensure the secondary hasn't already been promoted to attached by a concurrent
// optimization/migration.
&& self.intent.attached != Some(new_node_id)
}
ScheduleOptimizationAction::CreateSecondary(new_node_id) => {
!self.intent.secondary.contains(&new_node_id)

View File

@@ -587,7 +587,9 @@ class NeonLocalCli(AbstractNeonCli):
]
extra_env_vars = env or {}
if basebackup_request_tries is not None:
extra_env_vars["NEON_COMPUTE_TESTING_BASEBACKUP_TRIES"] = str(basebackup_request_tries)
extra_env_vars["NEON_COMPUTE_TESTING_BASEBACKUP_RETRIES"] = str(
basebackup_request_tries
)
if remote_ext_base_url is not None:
args.extend(["--remote-ext-base-url", remote_ext_base_url])
@@ -623,6 +625,7 @@ class NeonLocalCli(AbstractNeonCli):
pageserver_id: int | None = None,
safekeepers: list[int] | None = None,
check_return_code=True,
timeout_sec: float | None = None,
) -> subprocess.CompletedProcess[str]:
args = ["endpoint", "reconfigure", endpoint_id]
if tenant_id is not None:
@@ -631,7 +634,16 @@ class NeonLocalCli(AbstractNeonCli):
args.extend(["--pageserver-id", str(pageserver_id)])
if safekeepers is not None:
args.extend(["--safekeepers", (",".join(map(str, safekeepers)))])
return self.raw_cli(args, check_return_code=check_return_code)
return self.raw_cli(args, check_return_code=check_return_code, timeout=timeout_sec)
def endpoint_refresh_configuration(
self,
endpoint_id: str,
) -> subprocess.CompletedProcess[str]:
args = ["endpoint", "refresh-configuration", endpoint_id]
res = self.raw_cli(args)
res.check_returncode()
return res
def endpoint_stop(
self,
@@ -657,6 +669,22 @@ class NeonLocalCli(AbstractNeonCli):
lsn: Lsn | None = None if lsn_str == "null" else Lsn(lsn_str)
return lsn, proc
def endpoint_update_pageservers(
self,
endpoint_id: str,
pageserver_id: int | None = None,
) -> subprocess.CompletedProcess[str]:
args = [
"endpoint",
"update-pageservers",
endpoint_id,
]
if pageserver_id is not None:
args.extend(["--pageserver-id", str(pageserver_id)])
res = self.raw_cli(args)
res.check_returncode()
return res
def mappings_map_branch(
self, name: str, tenant_id: TenantId, timeline_id: TimelineId
) -> subprocess.CompletedProcess[str]:

View File

@@ -4930,15 +4930,38 @@ class Endpoint(PgProtocol, LogUtils):
def is_running(self):
return self._running._value > 0
def reconfigure(self, pageserver_id: int | None = None, safekeepers: list[int] | None = None):
def reconfigure(
self,
pageserver_id: int | None = None,
safekeepers: list[int] | None = None,
timeout_sec: float = 120,
):
assert self.endpoint_id is not None
# If `safekeepers` is not None, they are remember them as active and use
# in the following commands.
if safekeepers is not None:
self.active_safekeepers = safekeepers
self.env.neon_cli.endpoint_reconfigure(
self.endpoint_id, self.tenant_id, pageserver_id, self.active_safekeepers
)
start_time = time.time()
while True:
try:
self.env.neon_cli.endpoint_reconfigure(
self.endpoint_id,
self.tenant_id,
pageserver_id,
self.active_safekeepers,
timeout_sec=timeout_sec,
)
return
except RuntimeError as e:
if time.time() - start_time > timeout_sec:
raise e
log.warning(f"Reconfigure failed with error: {e}. Retrying...")
time.sleep(5)
def refresh_configuration(self):
assert self.endpoint_id is not None
self.env.neon_cli.endpoint_refresh_configuration(self.endpoint_id)
def respec(self, **kwargs: Any) -> None:
"""Update the endpoint.json file used by control_plane."""
@@ -4986,6 +5009,10 @@ class Endpoint(PgProtocol, LogUtils):
log.debug("Updating compute config to: %s", json.dumps(config, indent=4))
json.dump(config, file, indent=4)
def update_pageservers_in_config(self, pageserver_id: int | None = None):
assert self.endpoint_id is not None
self.env.neon_cli.endpoint_update_pageservers(self.endpoint_id, pageserver_id)
def wait_for_migrations(self, wait_for: int = NUM_COMPUTE_MIGRATIONS) -> None:
"""
Wait for all compute migrations to be ran. Remember that migrations only

View File

@@ -78,6 +78,9 @@ class Workload:
"""
if self._endpoint is not None:
with ENDPOINT_LOCK:
# It's important that we update config.json before issuing the reconfigure request to make sure
# that PG-initiated spec refresh doesn't mess things up by reverting to the old spec.
self._endpoint.update_pageservers_in_config()
self._endpoint.reconfigure()
def endpoint(self, pageserver_id: int | None = None) -> Endpoint:
@@ -97,10 +100,10 @@ class Workload:
self._endpoint.start(pageserver_id=pageserver_id)
self._configured_pageserver = pageserver_id
else:
if self._configured_pageserver != pageserver_id:
self._configured_pageserver = pageserver_id
self._endpoint.reconfigure(pageserver_id=pageserver_id)
self._endpoint_config = pageserver_id
# It's important that we update config.json before issuing the reconfigure request to make sure
# that PG-initiated spec refresh doesn't mess things up by reverting to the old spec.
self._endpoint.update_pageservers_in_config(pageserver_id=pageserver_id)
self._endpoint.reconfigure(pageserver_id=pageserver_id)
connstring = self._endpoint.safe_psql(
"SELECT setting FROM pg_settings WHERE name='neon.pageserver_connstring'"

View File

@@ -0,0 +1,152 @@
#!/usr/bin/env python3
"""
Generate TPS and latency charts from BenchBase TPC-C results CSV files.
This script reads a CSV file containing BenchBase results and generates two charts:
1. TPS (requests per second) over time
2. P95 and P99 latencies over time
Both charts are combined in a single SVG file.
"""
import argparse
import sys
from pathlib import Path
import matplotlib.pyplot as plt # type: ignore[import-not-found]
import pandas as pd # type: ignore[import-untyped]
def load_results_csv(csv_file_path):
"""Load BenchBase results CSV file into a pandas DataFrame."""
try:
df = pd.read_csv(csv_file_path)
# Validate required columns exist
required_columns = [
"Time (seconds)",
"Throughput (requests/second)",
"95th Percentile Latency (millisecond)",
"99th Percentile Latency (millisecond)",
]
missing_columns = [col for col in required_columns if col not in df.columns]
if missing_columns:
print(f"Error: Missing required columns: {missing_columns}")
sys.exit(1)
return df
except FileNotFoundError:
print(f"Error: CSV file not found: {csv_file_path}")
sys.exit(1)
except pd.errors.EmptyDataError:
print(f"Error: CSV file is empty: {csv_file_path}")
sys.exit(1)
except Exception as e:
print(f"Error reading CSV file: {e}")
sys.exit(1)
def generate_charts(df, input_filename, output_svg_path, title_suffix=None):
"""Generate combined TPS and latency charts and save as SVG."""
# Get the filename without extension for chart titles
file_label = Path(input_filename).stem
# Build title ending with optional suffix
if title_suffix:
title_ending = f"{title_suffix} - {file_label}"
else:
title_ending = file_label
# Create figure with two subplots
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 10))
# Chart 1: Time vs TPS
ax1.plot(
df["Time (seconds)"],
df["Throughput (requests/second)"],
linewidth=1,
color="blue",
alpha=0.7,
)
ax1.set_xlabel("Time (seconds)")
ax1.set_ylabel("TPS (Requests Per Second)")
ax1.set_title(f"Benchbase TPC-C Like Throughput (TPS) - {title_ending}")
ax1.grid(True, alpha=0.3)
ax1.set_xlim(0, df["Time (seconds)"].max())
# Chart 2: Time vs P95 and P99 Latencies
ax2.plot(
df["Time (seconds)"],
df["95th Percentile Latency (millisecond)"],
linewidth=1,
color="orange",
alpha=0.7,
label="Latency P95",
)
ax2.plot(
df["Time (seconds)"],
df["99th Percentile Latency (millisecond)"],
linewidth=1,
color="red",
alpha=0.7,
label="Latency P99",
)
ax2.set_xlabel("Time (seconds)")
ax2.set_ylabel("Latency (ms)")
ax2.set_title(f"Benchbase TPC-C Like Latency - {title_ending}")
ax2.grid(True, alpha=0.3)
ax2.set_xlim(0, df["Time (seconds)"].max())
ax2.legend()
plt.tight_layout()
# Save as SVG
try:
plt.savefig(output_svg_path, format="svg", dpi=300, bbox_inches="tight")
print(f"Charts saved to: {output_svg_path}")
except Exception as e:
print(f"Error saving SVG file: {e}")
sys.exit(1)
def main():
"""Main function to parse arguments and generate charts."""
parser = argparse.ArgumentParser(
description="Generate TPS and latency charts from BenchBase TPC-C results CSV"
)
parser.add_argument(
"--input-csv", type=str, required=True, help="Path to the input CSV results file"
)
parser.add_argument(
"--output-svg", type=str, required=True, help="Path for the output SVG chart file"
)
parser.add_argument(
"--title-suffix",
type=str,
required=False,
help="Optional suffix to add to chart titles (e.g., 'Warmup', 'Benchmark Phase')",
)
args = parser.parse_args()
# Validate input file exists
if not Path(args.input_csv).exists():
print(f"Error: Input CSV file does not exist: {args.input_csv}")
sys.exit(1)
# Create output directory if it doesn't exist
output_path = Path(args.output_svg)
output_path.parent.mkdir(parents=True, exist_ok=True)
# Load data and generate charts
df = load_results_csv(args.input_csv)
generate_charts(df, args.input_csv, args.output_svg, args.title_suffix)
print(f"Successfully generated charts from {len(df)} data points")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,339 @@
import argparse
import html
import math
import os
import sys
from pathlib import Path
CONFIGS_DIR = Path("../configs")
SCRIPTS_DIR = Path("../scripts")
# Constants
## TODO increase times after testing
WARMUP_TIME_SECONDS = 1200 # 20 minutes
BENCHMARK_TIME_SECONDS = 3600 # 1 hour
RAMP_STEP_TIME_SECONDS = 300 # 5 minutes
BASE_TERMINALS = 130
TERMINALS_PER_WAREHOUSE = 0.2
OPTIMAL_RATE_FACTOR = 0.7 # 70% of max rate
BATCH_SIZE = 1000
LOADER_THREADS = 4
TRANSACTION_WEIGHTS = "45,43,4,4,4" # NewOrder, Payment, OrderStatus, Delivery, StockLevel
# Ramp-up rate multipliers
RAMP_RATE_FACTORS = [1.5, 1.1, 0.9, 0.7, 0.6, 0.4, 0.6, 0.7, 0.9, 1.1]
# Templates for XML configs
WARMUP_XML = """<?xml version="1.0"?>
<parameters>
<type>POSTGRES</type>
<driver>org.postgresql.Driver</driver>
<url>jdbc:postgresql://{hostname}/neondb?sslmode=require&amp;ApplicationName=tpcc&amp;reWriteBatchedInserts=true</url>
<username>neondb_owner</username>
<password>{password}</password>
<reconnectOnConnectionFailure>true</reconnectOnConnectionFailure>
<isolation>TRANSACTION_READ_COMMITTED</isolation>
<batchsize>{batch_size}</batchsize>
<scalefactor>{warehouses}</scalefactor>
<loaderThreads>0</loaderThreads>
<terminals>{terminals}</terminals>
<works>
<work>
<time>{warmup_time}</time>
<weights>{transaction_weights}</weights>
<rate>unlimited</rate>
<arrival>POISSON</arrival>
<distribution>ZIPFIAN</distribution>
</work>
</works>
<transactiontypes>
<transactiontype><name>NewOrder</name></transactiontype>
<transactiontype><name>Payment</name></transactiontype>
<transactiontype><name>OrderStatus</name></transactiontype>
<transactiontype><name>Delivery</name></transactiontype>
<transactiontype><name>StockLevel</name></transactiontype>
</transactiontypes>
</parameters>
"""
MAX_RATE_XML = """<?xml version="1.0"?>
<parameters>
<type>POSTGRES</type>
<driver>org.postgresql.Driver</driver>
<url>jdbc:postgresql://{hostname}/neondb?sslmode=require&amp;ApplicationName=tpcc&amp;reWriteBatchedInserts=true</url>
<username>neondb_owner</username>
<password>{password}</password>
<reconnectOnConnectionFailure>true</reconnectOnConnectionFailure>
<isolation>TRANSACTION_READ_COMMITTED</isolation>
<batchsize>{batch_size}</batchsize>
<scalefactor>{warehouses}</scalefactor>
<loaderThreads>0</loaderThreads>
<terminals>{terminals}</terminals>
<works>
<work>
<time>{benchmark_time}</time>
<weights>{transaction_weights}</weights>
<rate>unlimited</rate>
<arrival>POISSON</arrival>
<distribution>ZIPFIAN</distribution>
</work>
</works>
<transactiontypes>
<transactiontype><name>NewOrder</name></transactiontype>
<transactiontype><name>Payment</name></transactiontype>
<transactiontype><name>OrderStatus</name></transactiontype>
<transactiontype><name>Delivery</name></transactiontype>
<transactiontype><name>StockLevel</name></transactiontype>
</transactiontypes>
</parameters>
"""
OPT_RATE_XML = """<?xml version="1.0"?>
<parameters>
<type>POSTGRES</type>
<driver>org.postgresql.Driver</driver>
<url>jdbc:postgresql://{hostname}/neondb?sslmode=require&amp;ApplicationName=tpcc&amp;reWriteBatchedInserts=true</url>
<username>neondb_owner</username>
<password>{password}</password>
<reconnectOnConnectionFailure>true</reconnectOnConnectionFailure>
<isolation>TRANSACTION_READ_COMMITTED</isolation>
<batchsize>{batch_size}</batchsize>
<scalefactor>{warehouses}</scalefactor>
<loaderThreads>0</loaderThreads>
<terminals>{terminals}</terminals>
<works>
<work>
<time>{benchmark_time}</time>
<rate>{opt_rate}</rate>
<weights>{transaction_weights}</weights>
<arrival>POISSON</arrival>
<distribution>ZIPFIAN</distribution>
</work>
</works>
<transactiontypes>
<transactiontype><name>NewOrder</name></transactiontype>
<transactiontype><name>Payment</name></transactiontype>
<transactiontype><name>OrderStatus</name></transactiontype>
<transactiontype><name>Delivery</name></transactiontype>
<transactiontype><name>StockLevel</name></transactiontype>
</transactiontypes>
</parameters>
"""
RAMP_UP_XML = """<?xml version="1.0"?>
<parameters>
<type>POSTGRES</type>
<driver>org.postgresql.Driver</driver>
<url>jdbc:postgresql://{hostname}/neondb?sslmode=require&amp;ApplicationName=tpcc&amp;reWriteBatchedInserts=true</url>
<username>neondb_owner</username>
<password>{password}</password>
<reconnectOnConnectionFailure>true</reconnectOnConnectionFailure>
<isolation>TRANSACTION_READ_COMMITTED</isolation>
<batchsize>{batch_size}</batchsize>
<scalefactor>{warehouses}</scalefactor>
<loaderThreads>0</loaderThreads>
<terminals>{terminals}</terminals>
<works>
{works}
</works>
<transactiontypes>
<transactiontype><name>NewOrder</name></transactiontype>
<transactiontype><name>Payment</name></transactiontype>
<transactiontype><name>OrderStatus</name></transactiontype>
<transactiontype><name>Delivery</name></transactiontype>
<transactiontype><name>StockLevel</name></transactiontype>
</transactiontypes>
</parameters>
"""
WORK_TEMPLATE = f""" <work>\n <time>{RAMP_STEP_TIME_SECONDS}</time>\n <rate>{{rate}}</rate>\n <weights>{TRANSACTION_WEIGHTS}</weights>\n <arrival>POISSON</arrival>\n <distribution>ZIPFIAN</distribution>\n </work>\n"""
# Templates for shell scripts
EXECUTE_SCRIPT = """# Create results directories
mkdir -p results_warmup
mkdir -p results_{suffix}
chmod 777 results_warmup results_{suffix}
# Run warmup phase
docker run --network=host --rm \
-v $(pwd)/configs:/configs \
-v $(pwd)/results_warmup:/results \
{docker_image}\
-b tpcc \
-c /configs/execute_{warehouses}_warehouses_warmup.xml \
-d /results \
--create=false --load=false --execute=true
# Run benchmark phase
docker run --network=host --rm \
-v $(pwd)/configs:/configs \
-v $(pwd)/results_{suffix}:/results \
{docker_image}\
-b tpcc \
-c /configs/execute_{warehouses}_warehouses_{suffix}.xml \
-d /results \
--create=false --load=false --execute=true\n"""
LOAD_XML = """<?xml version="1.0"?>
<parameters>
<type>POSTGRES</type>
<driver>org.postgresql.Driver</driver>
<url>jdbc:postgresql://{hostname}/neondb?sslmode=require&amp;ApplicationName=tpcc&amp;reWriteBatchedInserts=true</url>
<username>neondb_owner</username>
<password>{password}</password>
<reconnectOnConnectionFailure>true</reconnectOnConnectionFailure>
<isolation>TRANSACTION_READ_COMMITTED</isolation>
<batchsize>{batch_size}</batchsize>
<scalefactor>{warehouses}</scalefactor>
<loaderThreads>{loader_threads}</loaderThreads>
</parameters>
"""
LOAD_SCRIPT = """# Create results directory for loading
mkdir -p results_load
chmod 777 results_load
docker run --network=host --rm \
-v $(pwd)/configs:/configs \
-v $(pwd)/results_load:/results \
{docker_image}\
-b tpcc \
-c /configs/load_{warehouses}_warehouses.xml \
-d /results \
--create=true --load=true --execute=false\n"""
def write_file(path, content):
path.parent.mkdir(parents=True, exist_ok=True)
try:
with open(path, "w") as f:
f.write(content)
except OSError as e:
print(f"Error writing {path}: {e}")
sys.exit(1)
# If it's a shell script, set executable permission
if str(path).endswith(".sh"):
os.chmod(path, 0o755)
def escape_xml_password(password):
"""Escape XML special characters in password."""
return html.escape(password, quote=True)
def get_docker_arch_tag(runner_arch):
"""Map GitHub Actions runner.arch to Docker image architecture tag."""
arch_mapping = {"X64": "amd64", "ARM64": "arm64"}
return arch_mapping.get(runner_arch, "amd64") # Default to amd64
def main():
parser = argparse.ArgumentParser(description="Generate BenchBase workload configs and scripts.")
parser.add_argument("--warehouses", type=int, required=True, help="Number of warehouses")
parser.add_argument("--max-rate", type=int, required=True, help="Max rate (TPS)")
parser.add_argument("--hostname", type=str, required=True, help="Database hostname")
parser.add_argument("--password", type=str, required=True, help="Database password")
parser.add_argument(
"--runner-arch", type=str, required=True, help="GitHub Actions runner architecture"
)
args = parser.parse_args()
warehouses = args.warehouses
max_rate = args.max_rate
hostname = args.hostname
password = args.password
runner_arch = args.runner_arch
# Escape password for safe XML insertion
escaped_password = escape_xml_password(password)
# Get the appropriate Docker architecture tag
docker_arch = get_docker_arch_tag(runner_arch)
docker_image = f"ghcr.io/neondatabase-labs/benchbase-postgres:latest-{docker_arch}"
opt_rate = math.ceil(max_rate * OPTIMAL_RATE_FACTOR)
# Calculate terminals as next rounded integer of 40% of warehouses
terminals = math.ceil(BASE_TERMINALS + warehouses * TERMINALS_PER_WAREHOUSE)
ramp_rates = [math.ceil(max_rate * factor) for factor in RAMP_RATE_FACTORS]
# Write configs
write_file(
CONFIGS_DIR / f"execute_{warehouses}_warehouses_warmup.xml",
WARMUP_XML.format(
warehouses=warehouses,
hostname=hostname,
password=escaped_password,
terminals=terminals,
batch_size=BATCH_SIZE,
warmup_time=WARMUP_TIME_SECONDS,
transaction_weights=TRANSACTION_WEIGHTS,
),
)
write_file(
CONFIGS_DIR / f"execute_{warehouses}_warehouses_max_rate.xml",
MAX_RATE_XML.format(
warehouses=warehouses,
hostname=hostname,
password=escaped_password,
terminals=terminals,
batch_size=BATCH_SIZE,
benchmark_time=BENCHMARK_TIME_SECONDS,
transaction_weights=TRANSACTION_WEIGHTS,
),
)
write_file(
CONFIGS_DIR / f"execute_{warehouses}_warehouses_opt_rate.xml",
OPT_RATE_XML.format(
warehouses=warehouses,
opt_rate=opt_rate,
hostname=hostname,
password=escaped_password,
terminals=terminals,
batch_size=BATCH_SIZE,
benchmark_time=BENCHMARK_TIME_SECONDS,
transaction_weights=TRANSACTION_WEIGHTS,
),
)
ramp_works = "".join([WORK_TEMPLATE.format(rate=rate) for rate in ramp_rates])
write_file(
CONFIGS_DIR / f"execute_{warehouses}_warehouses_ramp_up.xml",
RAMP_UP_XML.format(
warehouses=warehouses,
works=ramp_works,
hostname=hostname,
password=escaped_password,
terminals=terminals,
batch_size=BATCH_SIZE,
),
)
# Loader config
write_file(
CONFIGS_DIR / f"load_{warehouses}_warehouses.xml",
LOAD_XML.format(
warehouses=warehouses,
hostname=hostname,
password=escaped_password,
batch_size=BATCH_SIZE,
loader_threads=LOADER_THREADS,
),
)
# Write scripts
for suffix in ["max_rate", "opt_rate", "ramp_up"]:
script = EXECUTE_SCRIPT.format(
warehouses=warehouses, suffix=suffix, docker_image=docker_image
)
write_file(SCRIPTS_DIR / f"execute_{warehouses}_warehouses_{suffix}.sh", script)
# Loader script
write_file(
SCRIPTS_DIR / f"load_{warehouses}_warehouses.sh",
LOAD_SCRIPT.format(warehouses=warehouses, docker_image=docker_image),
)
print(f"Generated configs and scripts for {warehouses} warehouses and max rate {max_rate}.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,591 @@
#!/usr/bin/env python3
# ruff: noqa
# we exclude the file from ruff because on the github runner we have python 3.9 and ruff
# is running with newer python 3.12 which suggests changes incompatible with python 3.9
"""
Upload BenchBase TPC-C results from summary.json and results.csv files to perf_test_results database.
This script extracts metrics from BenchBase *.summary.json and *.results.csv files and uploads them
to a PostgreSQL database table for performance tracking and analysis.
"""
import argparse
import json
import re
import sys
from datetime import datetime, timezone
from pathlib import Path
import pandas as pd # type: ignore[import-untyped]
import psycopg2
def load_summary_json(json_file_path):
"""Load summary.json file and return parsed data."""
try:
with open(json_file_path) as f:
return json.load(f)
except FileNotFoundError:
print(f"Error: Summary JSON file not found: {json_file_path}")
sys.exit(1)
except json.JSONDecodeError as e:
print(f"Error: Invalid JSON in file {json_file_path}: {e}")
sys.exit(1)
except Exception as e:
print(f"Error loading JSON file {json_file_path}: {e}")
sys.exit(1)
def get_metric_info(metric_name):
"""Get metric unit and report type for a given metric name."""
metrics_config = {
"Throughput": {"unit": "req/s", "report_type": "higher_is_better"},
"Goodput": {"unit": "req/s", "report_type": "higher_is_better"},
"Measured Requests": {"unit": "requests", "report_type": "higher_is_better"},
"95th Percentile Latency": {"unit": "µs", "report_type": "lower_is_better"},
"Maximum Latency": {"unit": "µs", "report_type": "lower_is_better"},
"Median Latency": {"unit": "µs", "report_type": "lower_is_better"},
"Minimum Latency": {"unit": "µs", "report_type": "lower_is_better"},
"25th Percentile Latency": {"unit": "µs", "report_type": "lower_is_better"},
"90th Percentile Latency": {"unit": "µs", "report_type": "lower_is_better"},
"99th Percentile Latency": {"unit": "µs", "report_type": "lower_is_better"},
"75th Percentile Latency": {"unit": "µs", "report_type": "lower_is_better"},
"Average Latency": {"unit": "µs", "report_type": "lower_is_better"},
}
return metrics_config.get(metric_name, {"unit": "", "report_type": "higher_is_better"})
def extract_metrics(summary_data):
"""Extract relevant metrics from summary JSON data."""
metrics = []
# Direct top-level metrics
direct_metrics = {
"Throughput (requests/second)": "Throughput",
"Goodput (requests/second)": "Goodput",
"Measured Requests": "Measured Requests",
}
for json_key, clean_name in direct_metrics.items():
if json_key in summary_data:
metrics.append((clean_name, summary_data[json_key]))
# Latency metrics from nested "Latency Distribution" object
if "Latency Distribution" in summary_data:
latency_data = summary_data["Latency Distribution"]
latency_metrics = {
"95th Percentile Latency (microseconds)": "95th Percentile Latency",
"Maximum Latency (microseconds)": "Maximum Latency",
"Median Latency (microseconds)": "Median Latency",
"Minimum Latency (microseconds)": "Minimum Latency",
"25th Percentile Latency (microseconds)": "25th Percentile Latency",
"90th Percentile Latency (microseconds)": "90th Percentile Latency",
"99th Percentile Latency (microseconds)": "99th Percentile Latency",
"75th Percentile Latency (microseconds)": "75th Percentile Latency",
"Average Latency (microseconds)": "Average Latency",
}
for json_key, clean_name in latency_metrics.items():
if json_key in latency_data:
metrics.append((clean_name, latency_data[json_key]))
return metrics
def build_labels(summary_data, project_id):
"""Build labels JSON object from summary data and project info."""
labels = {}
# Extract required label keys from summary data
label_keys = [
"DBMS Type",
"DBMS Version",
"Benchmark Type",
"Final State",
"isolation",
"scalefactor",
"terminals",
]
for key in label_keys:
if key in summary_data:
labels[key] = summary_data[key]
# Add project_id from workflow
labels["project_id"] = project_id
return labels
def build_suit_name(scalefactor, terminals, run_type, min_cu, max_cu):
"""Build the suit name according to specification."""
return f"benchbase-tpc-c-{scalefactor}-{terminals}-{run_type}-{min_cu}-{max_cu}"
def convert_timestamp_to_utc(timestamp_ms):
"""Convert millisecond timestamp to PostgreSQL-compatible UTC timestamp."""
try:
dt = datetime.fromtimestamp(timestamp_ms / 1000.0, tz=timezone.utc)
return dt.isoformat()
except (ValueError, TypeError) as e:
print(f"Warning: Could not convert timestamp {timestamp_ms}: {e}")
return datetime.now(timezone.utc).isoformat()
def insert_metrics(conn, metrics_data):
"""Insert metrics data into the perf_test_results table."""
insert_query = """
INSERT INTO perf_test_results
(suit, revision, platform, metric_name, metric_value, metric_unit,
metric_report_type, recorded_at_timestamp, labels)
VALUES (%(suit)s, %(revision)s, %(platform)s, %(metric_name)s, %(metric_value)s,
%(metric_unit)s, %(metric_report_type)s, %(recorded_at_timestamp)s, %(labels)s)
"""
try:
with conn.cursor() as cursor:
cursor.executemany(insert_query, metrics_data)
conn.commit()
print(f"Successfully inserted {len(metrics_data)} metrics into perf_test_results")
# Log some sample data for verification
if metrics_data:
print(
f"Sample metric: {metrics_data[0]['metric_name']} = {metrics_data[0]['metric_value']} {metrics_data[0]['metric_unit']}"
)
except Exception as e:
print(f"Error inserting metrics into database: {e}")
sys.exit(1)
def create_benchbase_results_details_table(conn):
"""Create benchbase_results_details table if it doesn't exist."""
create_table_query = """
CREATE TABLE IF NOT EXISTS benchbase_results_details (
id BIGSERIAL PRIMARY KEY,
suit TEXT,
revision CHAR(40),
platform TEXT,
recorded_at_timestamp TIMESTAMP WITH TIME ZONE,
requests_per_second NUMERIC,
average_latency_ms NUMERIC,
minimum_latency_ms NUMERIC,
p25_latency_ms NUMERIC,
median_latency_ms NUMERIC,
p75_latency_ms NUMERIC,
p90_latency_ms NUMERIC,
p95_latency_ms NUMERIC,
p99_latency_ms NUMERIC,
maximum_latency_ms NUMERIC
);
CREATE INDEX IF NOT EXISTS benchbase_results_details_recorded_at_timestamp_idx
ON benchbase_results_details USING BRIN (recorded_at_timestamp);
CREATE INDEX IF NOT EXISTS benchbase_results_details_suit_idx
ON benchbase_results_details USING BTREE (suit text_pattern_ops);
"""
try:
with conn.cursor() as cursor:
cursor.execute(create_table_query)
conn.commit()
print("Successfully created/verified benchbase_results_details table")
except Exception as e:
print(f"Error creating benchbase_results_details table: {e}")
sys.exit(1)
def process_csv_results(csv_file_path, start_timestamp_ms, suit, revision, platform):
"""Process CSV results and return data for database insertion."""
try:
# Read CSV file
df = pd.read_csv(csv_file_path)
# Validate required columns exist
required_columns = [
"Time (seconds)",
"Throughput (requests/second)",
"Average Latency (millisecond)",
"Minimum Latency (millisecond)",
"25th Percentile Latency (millisecond)",
"Median Latency (millisecond)",
"75th Percentile Latency (millisecond)",
"90th Percentile Latency (millisecond)",
"95th Percentile Latency (millisecond)",
"99th Percentile Latency (millisecond)",
"Maximum Latency (millisecond)",
]
missing_columns = [col for col in required_columns if col not in df.columns]
if missing_columns:
print(f"Error: Missing required columns in CSV: {missing_columns}")
return []
csv_data = []
for _, row in df.iterrows():
# Calculate timestamp: start_timestamp_ms + (time_seconds * 1000)
time_seconds = row["Time (seconds)"]
row_timestamp_ms = start_timestamp_ms + (time_seconds * 1000)
# Convert to UTC timestamp
row_timestamp = datetime.fromtimestamp(
row_timestamp_ms / 1000.0, tz=timezone.utc
).isoformat()
csv_row = {
"suit": suit,
"revision": revision,
"platform": platform,
"recorded_at_timestamp": row_timestamp,
"requests_per_second": float(row["Throughput (requests/second)"]),
"average_latency_ms": float(row["Average Latency (millisecond)"]),
"minimum_latency_ms": float(row["Minimum Latency (millisecond)"]),
"p25_latency_ms": float(row["25th Percentile Latency (millisecond)"]),
"median_latency_ms": float(row["Median Latency (millisecond)"]),
"p75_latency_ms": float(row["75th Percentile Latency (millisecond)"]),
"p90_latency_ms": float(row["90th Percentile Latency (millisecond)"]),
"p95_latency_ms": float(row["95th Percentile Latency (millisecond)"]),
"p99_latency_ms": float(row["99th Percentile Latency (millisecond)"]),
"maximum_latency_ms": float(row["Maximum Latency (millisecond)"]),
}
csv_data.append(csv_row)
print(f"Processed {len(csv_data)} rows from CSV file")
return csv_data
except FileNotFoundError:
print(f"Error: CSV file not found: {csv_file_path}")
return []
except Exception as e:
print(f"Error processing CSV file {csv_file_path}: {e}")
return []
def insert_csv_results(conn, csv_data):
"""Insert CSV results into benchbase_results_details table."""
if not csv_data:
print("No CSV data to insert")
return
insert_query = """
INSERT INTO benchbase_results_details
(suit, revision, platform, recorded_at_timestamp, requests_per_second,
average_latency_ms, minimum_latency_ms, p25_latency_ms, median_latency_ms,
p75_latency_ms, p90_latency_ms, p95_latency_ms, p99_latency_ms, maximum_latency_ms)
VALUES (%(suit)s, %(revision)s, %(platform)s, %(recorded_at_timestamp)s, %(requests_per_second)s,
%(average_latency_ms)s, %(minimum_latency_ms)s, %(p25_latency_ms)s, %(median_latency_ms)s,
%(p75_latency_ms)s, %(p90_latency_ms)s, %(p95_latency_ms)s, %(p99_latency_ms)s, %(maximum_latency_ms)s)
"""
try:
with conn.cursor() as cursor:
cursor.executemany(insert_query, csv_data)
conn.commit()
print(
f"Successfully inserted {len(csv_data)} detailed results into benchbase_results_details"
)
# Log some sample data for verification
sample = csv_data[0]
print(
f"Sample detail: {sample['requests_per_second']} req/s at {sample['recorded_at_timestamp']}"
)
except Exception as e:
print(f"Error inserting CSV results into database: {e}")
sys.exit(1)
def parse_load_log(log_file_path, scalefactor):
"""Parse load log file and extract load metrics."""
try:
with open(log_file_path) as f:
log_content = f.read()
# Regex patterns to match the timestamp lines
loading_pattern = r"\[INFO \] (\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}),\d{3}.*Loading data into TPCC database"
finished_pattern = r"\[INFO \] (\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}),\d{3}.*Finished loading data into TPCC database"
loading_match = re.search(loading_pattern, log_content)
finished_match = re.search(finished_pattern, log_content)
if not loading_match or not finished_match:
print(f"Warning: Could not find loading timestamps in log file {log_file_path}")
return None
# Parse timestamps
loading_time = datetime.strptime(loading_match.group(1), "%Y-%m-%d %H:%M:%S")
finished_time = datetime.strptime(finished_match.group(1), "%Y-%m-%d %H:%M:%S")
# Calculate duration in seconds
duration_seconds = (finished_time - loading_time).total_seconds()
# Calculate throughput: scalefactor/warehouses: 10 warehouses is approx. 1 GB of data
load_throughput = (scalefactor * 1024 / 10.0) / duration_seconds
# Convert end time to UTC timestamp for database
finished_time_utc = finished_time.replace(tzinfo=timezone.utc).isoformat()
print(f"Load metrics: Duration={duration_seconds}s, Throughput={load_throughput:.2f} MB/s")
return {
"duration_seconds": duration_seconds,
"throughput_mb_per_sec": load_throughput,
"end_timestamp": finished_time_utc,
}
except FileNotFoundError:
print(f"Warning: Load log file not found: {log_file_path}")
return None
except Exception as e:
print(f"Error parsing load log file {log_file_path}: {e}")
return None
def insert_load_metrics(conn, load_metrics, suit, revision, platform, labels_json):
"""Insert load metrics into perf_test_results table."""
if not load_metrics:
print("No load metrics to insert")
return
load_metrics_data = [
{
"suit": suit,
"revision": revision,
"platform": platform,
"metric_name": "load_duration_seconds",
"metric_value": load_metrics["duration_seconds"],
"metric_unit": "seconds",
"metric_report_type": "lower_is_better",
"recorded_at_timestamp": load_metrics["end_timestamp"],
"labels": labels_json,
},
{
"suit": suit,
"revision": revision,
"platform": platform,
"metric_name": "load_throughput",
"metric_value": load_metrics["throughput_mb_per_sec"],
"metric_unit": "MB/second",
"metric_report_type": "higher_is_better",
"recorded_at_timestamp": load_metrics["end_timestamp"],
"labels": labels_json,
},
]
insert_query = """
INSERT INTO perf_test_results
(suit, revision, platform, metric_name, metric_value, metric_unit,
metric_report_type, recorded_at_timestamp, labels)
VALUES (%(suit)s, %(revision)s, %(platform)s, %(metric_name)s, %(metric_value)s,
%(metric_unit)s, %(metric_report_type)s, %(recorded_at_timestamp)s, %(labels)s)
"""
try:
with conn.cursor() as cursor:
cursor.executemany(insert_query, load_metrics_data)
conn.commit()
print(f"Successfully inserted {len(load_metrics_data)} load metrics into perf_test_results")
except Exception as e:
print(f"Error inserting load metrics into database: {e}")
sys.exit(1)
def main():
"""Main function to parse arguments and upload results."""
parser = argparse.ArgumentParser(
description="Upload BenchBase TPC-C results to perf_test_results database"
)
parser.add_argument(
"--summary-json", type=str, required=False, help="Path to the summary.json file"
)
parser.add_argument(
"--run-type",
type=str,
required=True,
choices=["warmup", "opt-rate", "ramp-up", "load"],
help="Type of benchmark run",
)
parser.add_argument("--min-cu", type=float, required=True, help="Minimum compute units")
parser.add_argument("--max-cu", type=float, required=True, help="Maximum compute units")
parser.add_argument("--project-id", type=str, required=True, help="Neon project ID")
parser.add_argument(
"--revision", type=str, required=True, help="Git commit hash (40 characters)"
)
parser.add_argument(
"--connection-string", type=str, required=True, help="PostgreSQL connection string"
)
parser.add_argument(
"--results-csv",
type=str,
required=False,
help="Path to the results.csv file for detailed metrics upload",
)
parser.add_argument(
"--load-log",
type=str,
required=False,
help="Path to the load log file for load phase metrics",
)
parser.add_argument(
"--warehouses",
type=int,
required=False,
help="Number of warehouses (scalefactor) for load metrics calculation",
)
args = parser.parse_args()
# Validate inputs
if args.summary_json and not Path(args.summary_json).exists():
print(f"Error: Summary JSON file does not exist: {args.summary_json}")
sys.exit(1)
if not args.summary_json and not args.load_log:
print("Error: Either summary JSON or load log file must be provided")
sys.exit(1)
if len(args.revision) != 40:
print(f"Warning: Revision should be 40 characters, got {len(args.revision)}")
# Load and process summary data if provided
summary_data = None
metrics = []
if args.summary_json:
summary_data = load_summary_json(args.summary_json)
metrics = extract_metrics(summary_data)
if not metrics:
print("Warning: No metrics found in summary JSON")
# Build common data for all metrics
if summary_data:
scalefactor = summary_data.get("scalefactor", "unknown")
terminals = summary_data.get("terminals", "unknown")
labels = build_labels(summary_data, args.project_id)
else:
# For load-only processing, use warehouses argument as scalefactor
scalefactor = args.warehouses if args.warehouses else "unknown"
terminals = "unknown"
labels = {"project_id": args.project_id}
suit = build_suit_name(scalefactor, terminals, args.run_type, args.min_cu, args.max_cu)
platform = f"prod-us-east-2-{args.project_id}"
# Convert timestamp - only needed for summary metrics and CSV processing
current_timestamp_ms = None
start_timestamp_ms = None
recorded_at = None
if summary_data:
current_timestamp_ms = summary_data.get("Current Timestamp (milliseconds)")
start_timestamp_ms = summary_data.get("Start timestamp (milliseconds)")
if current_timestamp_ms:
recorded_at = convert_timestamp_to_utc(current_timestamp_ms)
else:
print("Warning: No timestamp found in JSON, using current time")
recorded_at = datetime.now(timezone.utc).isoformat()
if not start_timestamp_ms:
print("Warning: No start timestamp found in JSON, CSV upload may be incorrect")
start_timestamp_ms = (
current_timestamp_ms or datetime.now(timezone.utc).timestamp() * 1000
)
# Print Grafana dashboard link for cross-service endpoint debugging
if start_timestamp_ms and current_timestamp_ms:
grafana_url = (
f"https://neonprod.grafana.net/d/cdya0okb81zwga/cross-service-endpoint-debugging"
f"?orgId=1&from={int(start_timestamp_ms)}&to={int(current_timestamp_ms)}"
f"&timezone=utc&var-env=prod&var-input_project_id={args.project_id}"
)
print(f'Cross service endpoint dashboard for "{args.run_type}" phase: {grafana_url}')
# Prepare metrics data for database insertion (only if we have summary metrics)
metrics_data = []
if metrics and recorded_at:
for metric_name, metric_value in metrics:
metric_info = get_metric_info(metric_name)
row = {
"suit": suit,
"revision": args.revision,
"platform": platform,
"metric_name": metric_name,
"metric_value": float(metric_value), # Ensure numeric type
"metric_unit": metric_info["unit"],
"metric_report_type": metric_info["report_type"],
"recorded_at_timestamp": recorded_at,
"labels": json.dumps(labels), # Convert to JSON string for JSONB column
}
metrics_data.append(row)
print(f"Prepared {len(metrics_data)} summary metrics for upload to database")
print(f"Suit: {suit}")
print(f"Platform: {platform}")
# Connect to database and insert metrics
try:
conn = psycopg2.connect(args.connection_string)
# Insert summary metrics into perf_test_results (if any)
if metrics_data:
insert_metrics(conn, metrics_data)
else:
print("No summary metrics to upload")
# Process and insert detailed CSV results if provided
if args.results_csv:
print(f"Processing detailed CSV results from: {args.results_csv}")
# Create table if it doesn't exist
create_benchbase_results_details_table(conn)
# Process CSV data
csv_data = process_csv_results(
args.results_csv, start_timestamp_ms, suit, args.revision, platform
)
# Insert CSV data
if csv_data:
insert_csv_results(conn, csv_data)
else:
print("No CSV data to upload")
else:
print("No CSV file provided, skipping detailed results upload")
# Process and insert load metrics if provided
if args.load_log:
print(f"Processing load metrics from: {args.load_log}")
# Parse load log and extract metrics
load_metrics = parse_load_log(args.load_log, scalefactor)
# Insert load metrics
if load_metrics:
insert_load_metrics(
conn, load_metrics, suit, args.revision, platform, json.dumps(labels)
)
else:
print("No load metrics to upload")
else:
print("No load log file provided, skipping load metrics upload")
conn.close()
print("Database upload completed successfully")
except psycopg2.Error as e:
print(f"Database connection/query error: {e}")
sys.exit(1)
except Exception as e:
print(f"Unexpected error: {e}")
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -26,7 +26,7 @@ def test_compute_pageserver_connection_stress(neon_env_builder: NeonEnvBuilder):
# Enable failpoint before starting everything else up so that we exercise the retry
# on fetching basebackup
pageserver_http = env.pageserver.http_client()
pageserver_http.configure_failpoints(("simulated-bad-compute-connection", "50%return(15)"))
pageserver_http.configure_failpoints(("simulated-bad-compute-connection", "20%return(15)"))
env.create_branch("test_compute_pageserver_connection_stress")
endpoint = env.endpoints.create_start("test_compute_pageserver_connection_stress")

View File

@@ -3,14 +3,35 @@ from __future__ import annotations
import asyncio
from typing import TYPE_CHECKING
import pytest
from fixtures.log_helper import log
from fixtures.neon_fixtures import NeonEnvBuilder
from fixtures.remote_storage import RemoteStorageKind
if TYPE_CHECKING:
from fixtures.neon_fixtures import NeonEnvBuilder
from fixtures.neon_fixtures import Endpoint, NeonEnvBuilder
def test_change_pageserver(neon_env_builder: NeonEnvBuilder):
def reconfigure_endpoint(endpoint: Endpoint, pageserver_id: int, use_explicit_reconfigure: bool):
# It's important that we always update config.json before issuing any reconfigure requests
# to make sure that PG-initiated config refresh doesn't mess things up by reverting to the old config.
endpoint.update_pageservers_in_config(pageserver_id=pageserver_id)
# PG will automatically refresh its configuration if it detects connectivity issues with pageservers.
# We also allow the test to explicitly request a reconfigure so that the test can be sure that the
# endpoint is running with the latest configuration.
#
# Note that explicit reconfiguration is not required for the system to function or for this test to pass.
# It is kept for reference as this is how this test used to work before the capability of initiating
# configuration refreshes was added to compute nodes.
if use_explicit_reconfigure:
endpoint.reconfigure(pageserver_id=pageserver_id)
@pytest.mark.parametrize("use_explicit_reconfigure_for_failover", [False, True])
def test_change_pageserver(
neon_env_builder: NeonEnvBuilder, use_explicit_reconfigure_for_failover: bool
):
"""
A relatively low level test of reconfiguring a compute's pageserver at runtime. Usually this
is all done via the storage controller, but this test will disable the storage controller's compute
@@ -72,7 +93,10 @@ def test_change_pageserver(neon_env_builder: NeonEnvBuilder):
execute("SELECT count(*) FROM foo")
assert fetchone() == (100000,)
endpoint.reconfigure(pageserver_id=alt_pageserver_id)
# Reconfigure the endpoint to use the alt pageserver. We issue an explicit reconfigure request here
# regardless of test mode as this is testing the externally driven reconfiguration scenario, not the
# compute-initiated reconfiguration scenario upon detecting failures.
reconfigure_endpoint(endpoint, pageserver_id=alt_pageserver_id, use_explicit_reconfigure=True)
# Verify that the neon.pageserver_connstring GUC is set to the correct thing
execute("SELECT setting FROM pg_settings WHERE name='neon.pageserver_connstring'")
@@ -100,6 +124,12 @@ def test_change_pageserver(neon_env_builder: NeonEnvBuilder):
env.storage_controller.node_configure(env.pageservers[1].id, {"availability": "Offline"})
env.storage_controller.reconcile_until_idle()
reconfigure_endpoint(
endpoint,
pageserver_id=env.pageservers[0].id,
use_explicit_reconfigure=use_explicit_reconfigure_for_failover,
)
endpoint.reconfigure(pageserver_id=env.pageservers[0].id)
execute("SELECT count(*) FROM foo")
@@ -116,7 +146,11 @@ def test_change_pageserver(neon_env_builder: NeonEnvBuilder):
await asyncio.sleep(
1
) # Sleep for 1 second just to make sure we actually started our count(*) query
endpoint.reconfigure(pageserver_id=env.pageservers[1].id)
reconfigure_endpoint(
endpoint,
pageserver_id=env.pageservers[1].id,
use_explicit_reconfigure=use_explicit_reconfigure_for_failover,
)
def execute_count():
execute("SELECT count(*) FROM FOO")

View File

@@ -0,0 +1,369 @@
from __future__ import annotations
import json
import os
import shutil
import subprocess
import threading
import time
from http.server import BaseHTTPRequestHandler, HTTPServer
from typing import TYPE_CHECKING
import requests
from fixtures.log_helper import log
from typing_extensions import override
if TYPE_CHECKING:
from typing import Any
from fixtures.common_types import TenantId, TimelineId
from fixtures.neon_fixtures import NeonEnv
from fixtures.port_distributor import PortDistributor
def launch_compute_ctl(
env: NeonEnv,
endpoint_name: str,
external_http_port: int,
internal_http_port: int,
pg_port: int,
control_plane_port: int,
) -> subprocess.Popen[str]:
"""
Helper function to launch compute_ctl process with common configuration.
Returns the Popen process object.
"""
# Create endpoint directory structure following the standard pattern
endpoint_path = env.repo_dir / "endpoints" / endpoint_name
# Clean up any existing endpoint directory to avoid conflicts
if endpoint_path.exists():
shutil.rmtree(endpoint_path)
endpoint_path.mkdir(mode=0o755, parents=True, exist_ok=True)
# pgdata path - compute_ctl will create this directory during basebackup
pgdata_path = endpoint_path / "pgdata"
# Create log file in endpoint directory
log_file = endpoint_path / "compute.log"
log_handle = open(log_file, "w")
# Start compute_ctl pointing to our control plane
compute_ctl_path = env.neon_binpath / "compute_ctl"
connstr = f"postgresql://cloud_admin@localhost:{pg_port}/postgres"
# Find postgres binary path
pg_bin_path = env.pg_distrib_dir / env.pg_version.v_prefixed / "bin" / "postgres"
pg_lib_path = env.pg_distrib_dir / env.pg_version.v_prefixed / "lib"
env_vars = {
"INSTANCE_ID": "lakebase-instance-id",
"LD_LIBRARY_PATH": str(pg_lib_path), # Linux, etc.
"DYLD_LIBRARY_PATH": str(pg_lib_path), # macOS
}
cmd = [
str(compute_ctl_path),
"--external-http-port",
str(external_http_port),
"--internal-http-port",
str(internal_http_port),
"--pgdata",
str(pgdata_path),
"--connstr",
connstr,
"--pgbin",
str(pg_bin_path),
"--compute-id",
endpoint_name, # Use endpoint_name as compute-id
"--control-plane-uri",
f"http://127.0.0.1:{control_plane_port}",
"--lakebase-mode",
"true",
]
print(f"Launching compute_ctl with command: {cmd}")
# Start compute_ctl
process = subprocess.Popen(
cmd,
env=env_vars,
stdout=log_handle,
stderr=subprocess.STDOUT, # Combine stderr with stdout
text=True,
)
return process
def wait_for_compute_status(
compute_process: subprocess.Popen[str],
http_port: int,
expected_status: str,
timeout_seconds: int = 10,
) -> None:
"""
Wait for compute_ctl to reach the expected status.
Raises an exception if timeout is reached or process exits unexpectedly.
"""
start_time = time.time()
while time.time() - start_time < timeout_seconds:
try:
# Try to connect to the HTTP endpoint
response = requests.get(f"http://localhost:{http_port}/status", timeout=0.5)
if response.status_code == 200:
status_json = response.json()
# Check if it's in expected status
if status_json.get("status") == expected_status:
return
except (requests.ConnectionError, requests.Timeout):
pass
# Check if process has exited
if compute_process.poll() is not None:
raise Exception(
f"compute_ctl exited unexpectedly with code {compute_process.returncode}."
)
time.sleep(0.5)
# Timeout reached
compute_process.terminate()
raise Exception(
f"compute_ctl failed to reach {expected_status} status within {timeout_seconds} seconds."
)
class EmptySpecHandler(BaseHTTPRequestHandler):
"""HTTP handler that returns an Empty compute spec response"""
def do_GET(self):
if self.path.startswith("/compute/api/v2/computes/") and self.path.endswith("/spec"):
# Return empty status which will put compute in Empty state
response: dict[str, Any] = {
"status": "empty",
"spec": None,
"compute_ctl_config": {"jwks": {"keys": []}},
}
self.send_response(200)
self.send_header("Content-Type", "application/json")
self.end_headers()
self.wfile.write(json.dumps(response).encode())
else:
self.send_error(404)
@override
def log_message(self, format: str, *args: Any):
# Suppress request logging
pass
def test_compute_terminate_empty(neon_simple_env: NeonEnv, port_distributor: PortDistributor):
"""
Test that terminating a compute in Empty status works correctly.
This tests the bug fix where terminating an Empty compute would hang
waiting for a non-existent postgres process to terminate.
"""
env = neon_simple_env
# Get ports for our test
control_plane_port = port_distributor.get_port()
external_http_port = port_distributor.get_port()
internal_http_port = port_distributor.get_port()
pg_port = port_distributor.get_port()
# Start a simple HTTP server that will serve the Empty spec
server = HTTPServer(("127.0.0.1", control_plane_port), EmptySpecHandler)
server_thread = threading.Thread(target=server.serve_forever)
server_thread.daemon = True
server_thread.start()
compute_process = None
try:
# Start compute_ctl with ephemeral tenant ID
compute_process = launch_compute_ctl(
env,
"test-empty-compute",
external_http_port,
internal_http_port,
pg_port,
control_plane_port,
)
# Wait for compute_ctl to start and report "empty" status
wait_for_compute_status(compute_process, external_http_port, "empty")
# Now send terminate request
response = requests.post(f"http://localhost:{external_http_port}/terminate")
# Verify that the termination request sends back a 200 OK response and is not abruptly terminated.
assert response.status_code == 200, (
f"Expected 200 OK, got {response.status_code}: {response.text}"
)
# Wait for compute_ctl to exit
exit_code = compute_process.wait(timeout=10)
assert exit_code == 0, f"compute_ctl exited with non-zero code: {exit_code}"
finally:
# Clean up
server.shutdown()
if compute_process and compute_process.poll() is None:
compute_process.terminate()
compute_process.wait()
class SwitchableConfigHandler(BaseHTTPRequestHandler):
"""HTTP handler that can switch between normal compute configs and compute configs without specs"""
return_empty_spec: bool = False
tenant_id: TenantId | None = None
timeline_id: TimelineId | None = None
pageserver_port: int | None = None
safekeeper_connstrs: list[str] | None = None
def do_GET(self):
if self.path.startswith("/compute/api/v2/computes/") and self.path.endswith("/spec"):
if self.return_empty_spec:
# Return empty status
response: dict[str, object | None] = {
"status": "empty",
"spec": None,
"compute_ctl_config": {
"jwks": {"keys": []},
},
}
else:
# Return normal attached spec
response = {
"status": "attached",
"spec": {
"format_version": 1.0,
"cluster": {
"roles": [],
"databases": [],
"postgresql_conf": "shared_preload_libraries='neon'",
},
"tenant_id": str(self.tenant_id) if self.tenant_id else "",
"timeline_id": str(self.timeline_id) if self.timeline_id else "",
"pageserver_connstring": f"postgres://no_user@localhost:{self.pageserver_port}"
if self.pageserver_port
else "",
"safekeeper_connstrings": self.safekeeper_connstrs or [],
"mode": "Primary",
"skip_pg_catalog_updates": True,
"reconfigure_concurrency": 1,
"suspend_timeout_seconds": -1,
},
"compute_ctl_config": {
"jwks": {"keys": []},
},
}
self.send_response(200)
self.send_header("Content-Type", "application/json")
self.end_headers()
self.wfile.write(json.dumps(response).encode())
else:
self.send_error(404)
@override
def log_message(self, format: str, *args: Any):
# Suppress request logging
pass
def test_compute_empty_spec_during_refresh_configuration(
neon_simple_env: NeonEnv, port_distributor: PortDistributor
):
"""
Test that compute exits when it receives an empty spec during refresh configuration state.
This test:
1. Start compute with a normal spec
2. Change the spec handler to return empty spec
3. Trigger some condition to force compute to refresh configuration
4. Verify that compute_ctl exits
"""
env = neon_simple_env
# Get ports for our test
control_plane_port = port_distributor.get_port()
external_http_port = port_distributor.get_port()
internal_http_port = port_distributor.get_port()
pg_port = port_distributor.get_port()
# Set up handler class variables
SwitchableConfigHandler.tenant_id = env.initial_tenant
SwitchableConfigHandler.timeline_id = env.initial_timeline
SwitchableConfigHandler.pageserver_port = env.pageserver.service_port.pg
# Convert comma-separated string to list
safekeeper_connstrs = env.get_safekeeper_connstrs()
if safekeeper_connstrs:
SwitchableConfigHandler.safekeeper_connstrs = safekeeper_connstrs.split(",")
else:
SwitchableConfigHandler.safekeeper_connstrs = []
SwitchableConfigHandler.return_empty_spec = False # Start with normal spec
# Start HTTP server with switchable spec handler
server = HTTPServer(("127.0.0.1", control_plane_port), SwitchableConfigHandler)
server_thread = threading.Thread(target=server.serve_forever)
server_thread.daemon = True
server_thread.start()
compute_process = None
try:
# Start compute_ctl with tenant and timeline IDs
# Use a unique endpoint name to avoid conflicts
endpoint_name = f"test-refresh-compute-{os.getpid()}"
compute_process = launch_compute_ctl(
env,
endpoint_name,
external_http_port,
internal_http_port,
pg_port,
control_plane_port,
)
# Wait for compute_ctl to start and report "running" status
wait_for_compute_status(compute_process, external_http_port, "running", timeout_seconds=30)
log.info("Compute is running. Now returning empty spec and trigger configuration refresh.")
# Switch spec fetch handler to return empty spec
SwitchableConfigHandler.return_empty_spec = True
# Trigger a configuration refresh
try:
requests.post(f"http://localhost:{internal_http_port}/refresh_configuration")
except requests.RequestException as e:
log.info(f"Call to /refresh_configuration failed: {e}")
log.info(
"Ignoring the error, assuming that compute_ctl is already refreshing or has exited"
)
# Wait for compute_ctl to exit (it should exit when it gets an empty spec during refresh)
exit_start_time = time.time()
while time.time() - exit_start_time < 30:
if compute_process.poll() is not None:
# Process exited
break
time.sleep(0.5)
# Verify that compute_ctl exited
exit_code = compute_process.poll()
if exit_code is None:
compute_process.terminate()
raise Exception("compute_ctl did not exit after receiving empty spec.")
# The exit code might not be 0 in this case since it's an unexpected termination
# but we mainly care that it did exit
assert exit_code is not None, "compute_ctl should have exited"
finally:
# Clean up
server.shutdown()
if compute_process and compute_process.poll() is None:
compute_process.terminate()
compute_process.wait()

View File

@@ -0,0 +1,137 @@
import json
import shutil
from fixtures.common_types import TenantShardId
from fixtures.log_helper import log
from fixtures.metrics import parse_metrics
from fixtures.neon_fixtures import Endpoint, NeonEnvBuilder, NeonPageserver
from requests.exceptions import ConnectionError
# Helper function to attempt reconfiguration of the compute to point to a new pageserver. Note that in these tests,
# we don't expect the reconfiguration attempts to go through, as we will be pointing the compute at a "wrong" pageserver.
def _attempt_reconfiguration(endpoint: Endpoint, new_pageserver_id: int, timeout_sec: float):
try:
endpoint.reconfigure(pageserver_id=new_pageserver_id, timeout_sec=timeout_sec)
except Exception as e:
log.info(f"reconfiguration failed with exception {e}")
pass
def read_misrouted_metric_value(pageserver: NeonPageserver) -> float:
return (
pageserver.http_client()
.get_metrics()
.query_one("pageserver_misrouted_pagestream_requests_total")
.value
)
def read_request_error_metric_value(endpoint: Endpoint) -> float:
return (
parse_metrics(endpoint.http_client().metrics())
.query_one("pg_cctl_pagestream_request_errors_total")
.value
)
def test_misrouted_to_secondary(
neon_env_builder: NeonEnvBuilder,
):
"""
Tests that the following metrics are incremented when compute tries to talk to a secondary pageserver:
- On pageserver receiving the request: pageserver_misrouted_pagestream_requests_total
- On compute: pg_cctl_pagestream_request_errors_total
"""
neon_env_builder.num_pageservers = 2
env = neon_env_builder.init_configs()
env.broker.start()
env.storage_controller.start()
for ps in env.pageservers:
ps.start()
for sk in env.safekeepers:
sk.start()
# Create a tenant that has one primary and one secondary. Due to primary/secondary placement constraints,
# the primary and secondary pageservers will be different.
tenant_id, _ = env.create_tenant(shard_count=1, placement_policy=json.dumps({"Attached": 1}))
endpoint = env.endpoints.create(
"main", tenant_id=tenant_id, config_lines=["neon.lakebase_mode = true"]
)
endpoint.respec(skip_pg_catalog_updates=False)
endpoint.start()
# Get the primary pageserver serving the zero shard of the tenant, and detach it from the primary pageserver.
# This test operation configures tenant directly on the pageserver/does not go through the storage controller,
# so the compute does not get any notifications and will keep pointing at the detached pageserver.
tenant_zero_shard = TenantShardId(tenant_id, shard_number=0, shard_count=1)
primary_ps = env.get_tenant_pageserver(tenant_zero_shard)
secondary_ps = (
env.pageservers[1] if primary_ps.id == env.pageservers[0].id else env.pageservers[0]
)
# Now try to point the compute at the pageserver that is acting as secondary for the tenant. Test that the metrics
# on both compute_ctl and the pageserver register the misrouted requests following the reconfiguration attempt.
assert read_misrouted_metric_value(secondary_ps) == 0
assert read_request_error_metric_value(endpoint) == 0
_attempt_reconfiguration(endpoint, new_pageserver_id=secondary_ps.id, timeout_sec=2.0)
assert read_misrouted_metric_value(secondary_ps) > 0
try:
assert read_request_error_metric_value(endpoint) > 0
except ConnectionError:
# When configuring PG to use misconfigured pageserver, PG will cancel the query after certain number of failed
# reconfigure attempts. This will cause compute_ctl to exit.
log.info("Cannot connect to PG, ignoring")
pass
def test_misrouted_to_ps_not_hosting_tenant(
neon_env_builder: NeonEnvBuilder,
):
"""
Tests that the following metrics are incremented when compute tries to talk to a pageserver that does not host the tenant:
- On pageserver receiving the request: pageserver_misrouted_pagestream_requests_total
- On compute: pg_cctl_pagestream_request_errors_total
"""
neon_env_builder.num_pageservers = 2
env = neon_env_builder.init_configs()
env.broker.start()
env.storage_controller.start(handle_ps_local_disk_loss=False)
for ps in env.pageservers:
ps.start()
for sk in env.safekeepers:
sk.start()
tenant_id, _ = env.create_tenant(shard_count=1)
endpoint = env.endpoints.create(
"main", tenant_id=tenant_id, config_lines=["neon.lakebase_mode = true"]
)
endpoint.respec(skip_pg_catalog_updates=False)
endpoint.start()
tenant_ps_id = env.get_tenant_pageserver(
TenantShardId(tenant_id, shard_number=0, shard_count=1)
).id
non_hosting_ps = (
env.pageservers[1] if tenant_ps_id == env.pageservers[0].id else env.pageservers[0]
)
# Clear the disk of the non-hosting PS to make sure that it indeed doesn't have any information about the tenant.
non_hosting_ps.stop(immediate=True)
shutil.rmtree(non_hosting_ps.tenant_dir())
non_hosting_ps.start()
# Now try to point the compute to the non-hosting pageserver. Test that the metrics
# on both compute_ctl and the pageserver register the misrouted requests following the reconfiguration attempt.
assert read_misrouted_metric_value(non_hosting_ps) == 0
assert read_request_error_metric_value(endpoint) == 0
_attempt_reconfiguration(endpoint, new_pageserver_id=non_hosting_ps.id, timeout_sec=2.0)
assert read_misrouted_metric_value(non_hosting_ps) > 0
try:
assert read_request_error_metric_value(endpoint) > 0
except ConnectionError:
# When configuring PG to use misconfigured pageserver, PG will cancel the query after certain number of failed
# reconfigure attempts. This will cause compute_ctl to exit.
log.info("Cannot connect to PG, ignoring")
pass

View File

@@ -133,6 +133,9 @@ def test_hot_standby_gc(neon_env_builder: NeonEnvBuilder, pause_apply: bool):
tenant_conf = {
# set PITR interval to be small, so we can do GC
"pitr_interval": "0 s",
# we want to control gc and checkpoint frequency precisely
"gc_period": "0s",
"compaction_period": "0s",
}
env = neon_env_builder.init_start(initial_tenant_conf=tenant_conf)
timeline_id = env.initial_timeline
@@ -186,6 +189,23 @@ def test_hot_standby_gc(neon_env_builder: NeonEnvBuilder, pause_apply: bool):
client = pageserver.http_client()
client.timeline_checkpoint(tenant_shard_id, timeline_id)
client.timeline_compact(tenant_shard_id, timeline_id)
# Wait for standby horizon to get propagated.
# This shouldn't be necessary, but the current mechanism for
# standby_horizon propagation is imperfect. Detailed
# description in https://databricks.atlassian.net/browse/LKB-2499
while True:
val = client.get_metric_value(
"pageserver_standby_horizon",
{
"tenant_id": str(tenant_shard_id.tenant_id),
"shard_id": str(tenant_shard_id.shard_index),
"timeline_id": str(timeline_id),
},
)
log.info("waiting for next standby_horizon push from safekeeper, {val=}")
if val != 0:
break
time.sleep(0.1)
client.timeline_gc(tenant_shard_id, timeline_id, 0)
# Re-execute the query. The GetPage requests that this

View File

@@ -1751,14 +1751,15 @@ def test_back_pressure_per_shard(neon_env_builder: NeonEnvBuilder):
"max_replication_apply_lag = 0",
"max_replication_flush_lag = 15MB",
"neon.max_cluster_size = 10GB",
"neon.lakebase_mode = true",
],
)
endpoint.respec(skip_pg_catalog_updates=False)
endpoint.start()
# generate 10MB of data
# generate 20MB of data
endpoint.safe_psql(
"CREATE TABLE usertable AS SELECT s AS KEY, repeat('a', 1000) as VALUE from generate_series(1, 10000) s;"
"CREATE TABLE usertable AS SELECT s AS KEY, repeat('a', 1000) as VALUE from generate_series(1, 20000) s;"
)
res = endpoint.safe_psql("SELECT neon.backpressure_throttling_time() as throttling_time")[0]
assert res[0] == 0, f"throttling_time should be 0, but got {res[0]}"

View File

@@ -4959,3 +4959,49 @@ def test_storage_controller_forward_404(neon_env_builder: NeonEnvBuilder):
env.storage_controller.configure_failpoints(
("reconciler-live-migrate-post-generation-inc", "off")
)
def test_re_attach_with_stuck_secondary(neon_env_builder: NeonEnvBuilder):
"""
This test assumes that the secondary location cannot be configured for whatever reason.
It then attempts to detach and and attach the tenant back again and, finally, checks
for observed state consistency by attempting to create a timeline.
See LKB-204 for more details.
"""
neon_env_builder.num_pageservers = 2
env = neon_env_builder.init_configs()
env.start()
env.storage_controller.allowed_errors.append(".*failpoint.*")
tenant_id, _ = env.create_tenant(shard_count=1, placement_policy='{"Attached":1}')
env.storage_controller.reconcile_until_idle()
locations = env.storage_controller.locate(tenant_id)
assert len(locations) == 1
primary: int = locations[0]["node_id"]
not_primary = [ps.id for ps in env.pageservers if ps.id != primary]
assert len(not_primary) == 1
secondary = not_primary[0]
env.get_pageserver(secondary).http_client().configure_failpoints(
("put-location-conf-handler", "return(1)")
)
env.storage_controller.tenant_policy_update(tenant_id, {"placement": "Detached"})
with pytest.raises(Exception, match="failpoint"):
env.storage_controller.reconcile_all()
env.storage_controller.tenant_policy_update(tenant_id, {"placement": {"Attached": 1}})
with pytest.raises(Exception, match="failpoint"):
env.storage_controller.reconcile_all()
env.storage_controller.pageserver_api().timeline_create(
pg_version=PgVersion.NOT_SET, tenant_id=tenant_id, new_timeline_id=TimelineId.generate()
)

View File

@@ -28,6 +28,8 @@ chrono = { version = "0.4", default-features = false, features = ["clock", "serd
clap = { version = "4", features = ["derive", "env", "string"] }
clap_builder = { version = "4", default-features = false, features = ["color", "env", "help", "std", "string", "suggestions", "usage"] }
const-oid = { version = "0.9", default-features = false, features = ["db", "std"] }
crossbeam-epoch = { version = "0.9" }
crossbeam-utils = { version = "0.8" }
crypto-bigint = { version = "0.5", features = ["generic-array", "zeroize"] }
der = { version = "0.7", default-features = false, features = ["derive", "flagset", "oid", "pem", "std"] }
deranged = { version = "0.3", default-features = false, features = ["powerfmt", "serde", "std"] }
@@ -73,6 +75,7 @@ num-traits = { version = "0.2", features = ["i128", "libm"] }
once_cell = { version = "1" }
p256 = { version = "0.13", features = ["jwk"] }
parquet = { version = "53", default-features = false, features = ["zstd"] }
portable-atomic = { version = "1", features = ["require-cas"] }
prost = { version = "0.13", features = ["no-recursion-limit", "prost-derive"] }
rand = { version = "0.9" }
regex = { version = "1" }