Compare commits

..

486 Commits

Author SHA1 Message Date
github-actions[bot]
58abce0207 Proxy release 2025-07-29 06:01 UTC 2025-07-29 06:01:52 +00:00
Ivan Efremov
6be572177c chore: Fix nightly lints (#12746)
- Remove some unused code
- Use `is_multiple_of()` instead of '%'
- Collapse consecuative "if let" statements
- Elided lifetime fixes

It is enough just to review the code of your team
2025-07-28 21:36:30 +00:00
Alex Chi Z.
fe7a4e1ab6 fix(test): wait compaction in timeline offload test (#12673)
## Problem

close LKB-753. `test_pageserver_metrics_removed_after_offload` is
unstable and it sometimes leave the metrics behind after tenant
offloading. It turns out that we triggered an image compaction before
the offload and the job was stopped after the offload request was
completed.

## Summary of changes

Wait all background tasks to finish before checking the metrics.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-07-28 16:27:55 +00:00
Heikki Linnakangas
40cae8cc36 Fix misc typos and some cosmetic code cleanup (#12695) 2025-07-28 16:21:35 +00:00
Heikki Linnakangas
02fc8b7c70 Add compatibility macros for MyProcNumber and PGIOAlignedBlock (#12715)
There were a few uses of these already, so collect them to the
compatibility header to avoid the repetition and scattered #ifdefs.

The definition of MyProcNumber is a little different from what was used
before, but the end result is the same. (PGPROC->pgprocno values were
just assigned sequentially to all PGPROC array members, see
InitProcGlobal(). That's a bit silly, which is why it was removed in
v17.)
2025-07-28 15:05:36 +00:00
John Spray
60feb168e2 pageserver: decrease MAX_SHARDS in utilization (#12668)
## Problem

When tenants have a lot of timelines, the number of tenants that a
pageserver can comfortably handle goes down. Branching is much more
widely used in practice now than it was when this code was written, and
we generally run pageservers with a few thousand tenants (where each
tenant has many timelines), rather than the 10k-20k we might have done
historically.

This should really be something configurable, or a more direct proxy for
resource utilization (such as non-archived timeline count), but this
change should be a low effort improvement.

## Summary of changes

* Change the target shard count (MAX_SHARDS) to 2500 from 5000 when
calculating pageserver utilization (i.e. a 200% overcommit now
corresponds to 5000 shards, not 10000 shards)

Co-authored-by: John Spray <john.spray@databricks.com>
2025-07-28 13:50:18 +00:00
a-masterov
da596a5162 Update the versions for ClickHouse and Debezium (#12741)
## Problem
The test for logical replication used the year-old versions of
ClickHouse and Debezium so that we may miss problems related to
up-to-date versions.
## Summary of changes
The ClickHouse version has been updated to 24.8.
The Debezium version has been updated to the latest stable one,
3.1.3Final.
Some problems with locally running the Debezium test have been fixed.

---------

Co-authored-by: Alexey Masterov <alexey.masterov@databricks.com>
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2025-07-28 13:26:33 +00:00
Conrad Ludgate
effd6bf829 [proxy] add metrics for caches (#12752)
Exposes metrics for caches. LKB-2594

This exposes a high level namespace, `cache`, that all cache metrics can
be added to - this makes it easier to make library panels for the caches
as I understand it.

To calculate the current cache fill ratio, you could use the following
query:

```
(
    cache_inserted_total{cache="node_info"}
  - sum (cache_evicted_total{cache="node_info"}) without (cause)
)
  / cache_capacity{cache="node_info"}
```

To calculate the cache hit ratio, you could use the following query:

```
  cache_request_total{cache="node_info", outcome="hit"}
/ sum (cache_request_total{cache="node_info"}) without (outcome)
```
2025-07-28 10:41:49 +00:00
Tristan Partin
a6e0baf31a [BRC-1405] Mount databricks pg_hba and pg_ident from configmap (#12733)
## Problem

For certificate auth, we need to configure pg_hba and pg_ident for it to
work.

HCC needs to mount this config map to all pg compute pod.

## Summary of changes

Create `databricks_pg_hba` and `databricks_pg_ident` to configure where
the files are located on the pod. These configs are pass down to
`compute_ctl`. Compute_ctl uses these config to update `pg_hba.conf` and
`pg_ident.conf` file.

We append `include_if_exists {databricks_pg_hba}` to `pg_hba.conf` and
similarly to `pg_ident.conf`. So that it will refer to databricks config
file without much change to existing pg default config file.

---------

Co-authored-by: Jarupat Jisarojito <jarupat.jisarojito@databricks.com>
Co-authored-by: William Huang <william.huang@databricks.com>
Co-authored-by: HaoyuHuang <haoyu.huang.68@gmail.com>
2025-07-25 20:50:03 +00:00
Christian Schwarz
19b74b8837 fix(page_service): getpage requests don't hold applied_gc_cutoff_lsn guard (#12743)
Before this PR, getpage requests wouldn't hold the
`applied_gc_cutoff_lsn` guard until they were done.

Theoretical impact: if we’re not holding the `RcuReadGuard`, gc can
theoretically concurrently delete reconstruct data that we need to
reconstruct the page.

I don't think this practically occurs in production because the odds of
it happening are quite low, especially for primary read_write computes.
But RO replicas / standby_horizon relies on correct
`applied_gc_cutofff_lsn`, so, I'm fixing this as part of the work ok
replacing standby_horizon propagation mechanism with leases (LKB-88).

The change is feature-gated with a feature flag, and evaluated once when
entering `handle_pagestream` to avoid performance impact.

For observability, we add a field to the `handle_pagestream` span, and a
slow-log to the place in `gc_loop` where it waits for the in-flight
RcuReadGuard's to drain.

refs
- fixes https://databricks.atlassian.net/browse/LKB-2572
- standby_horizon leases epic:
https://databricks.atlassian.net/browse/LKB-2572

---------

Co-authored-by: Christian Schwarz <Christian Schwarz>
2025-07-25 20:25:04 +00:00
Folke Behrens
25718e324a proxy: Define service_info metric showing the run state (#12749)
## Problem

Monitoring dashboards show aggregates of all proxy instances, including
terminating ones. This can skew the results or make graphs less
readable. Also, alerts must be tuned to ignore certain signals from
terminating proxies.

## Summary of changes

Add a `service_info` metric currently with one label, `state`, showing
if an instance is in state `init`, `running`, or `terminating`. The
metric can be joined with other metrics to filter the presented time
series.
2025-07-25 18:27:21 +00:00
Dmitrii Kovalkov
ac8f44c70e tests: stop ps immediately in test_ps_unavailable_after_delete (#12728)
## Problem
test_ps_unavailable_after_delete is flaky. All test failures I've looked
at are because of ERROR log messages in pageserver, which happen because
storage controller tries runs a reconciliations during the graceful
shutdown of the pageserver.

I wasn't able to reproduce it locally, but I think stopping PS
immediately instead of gracefully should help. If not, we might just
silence those errors.

- Closes: https://databricks.atlassian.net/browse/LKB-745
2025-07-25 18:09:34 +00:00
Conrad Ludgate
d09664f039 [proxy] replace TimedLru with moka (#12726)
LKB-2536 TimedLru is hard to maintain. Let's use moka instead. Stacked
on top of #12710.
2025-07-25 17:39:48 +00:00
Mikhail
6689d6fd89 LFC prewarm perftest fixes: use existing staging project (#12651)
https://github.com/neondatabase/cloud/issues/19011

- Prewarm config changes are not publicly available.
  Correct the test by using a pre-filled 50 GB project on staging
- Create extension neon with schema neon to fix read performance tests
on staging, error example in
https://neon-github-public-dev.s3.amazonaws.com/reports/main/16483462789/index.html#suites/3d632da6dda4a70f5b4bd24904ab444c/919841e331089fc4/
- Don't create extra endpoint in LFC prewarm performance tests
2025-07-25 16:56:41 +00:00
Tristan Partin
33b400beae [BRC-1425] Plumb through and set the requisite GUCs when starting the compute instance (#12732)
## Problem

We need the set the following Postgres GUCs to the correct value before
starting Postgres in the compute instance:

```
databricks.workspace_url
databricks.enable_databricks_identity_login
databricks.enable_sql_restrictions
```

## Summary of changes

Plumbed through `workspace_url` and other GUC settings via
`DatabricksSettings` in `ComputeSpec`. The spec is sent to the compute
instance when it starts up and the GUCs are written to `postgresql.conf`
before the postgres process is launched.

---------

Co-authored-by: Jarupat Jisarojito <jarupat.jisarojito@databricks.com>
Co-authored-by: William Huang <william.huang@databricks.com>
2025-07-25 15:20:05 +00:00
Tristan Partin
ca07f7dba5 Copy pg server cert and key to pgdata with correct permission (#12731)
## Problem

Copy certificate and key from secret mount directory to `pgdata`
directory where `postgres` is the owner and we can set the key
permission to 0600.

## Summary of changes

- Added new pgparam `pg_compute_tls_settings` to specify where k8s
secret for certificate and key are mounted.
- Added a new field to `ComputeSpec` called `databricks_settings`. This
is a struct that will be used to store any other settings that needs to
be propagate to Compute but should not be persisted to `ComputeSpec` in
the database.
- Then when the compute container start up, as part of `prepare_pgdata`
function, it will copied `server.key` and `server.crt` from k8s mounted
directory to `pgdata` directory.

## How is this tested?

Add unit tests.
Manual test via KIND

Co-authored-by: Jarupat Jisarojito <jarupat.jisarojito@databricks.com>
2025-07-25 15:05:05 +00:00
Vlad Lazar
b0dfe0ffa6 storcon: attempt all non-essential location config calls during reconciliations (#12745)
## Problem

We saw the following in the field:

Context and observations:
* The storage controller keeps track of the latest generations and the
pageserver that issued the latest generation in the database
* When the storage controller needs to proxy a request (e.g. timeline
creation) to the pageservers, it will find use the pageserver that
issued the latest generation from the db (generation_pageserver).
* pageserver-2.cell-2 got into a bad state and wasn't able to apply
location_config (e.g. detach a shard)

What happened:
1. pageserver-2.cell-2 was a secondary for our shard since we were not
able to detach it
2. control plane asked to detach a tenant (presumably because it was
idle)
a. In response storcon clears the generation_pageserver from the db and
attempts to detach all locations
b. it tries to detach pageserver-2.cell-2 first, but fails, which fails
the entire reconciliation leaving the good attached location still there
c. return success to cplane

3. control plane asks to re-attach the tenant
a. In response storcon performs a reconciliation
b. it finds that the observed state matches the intent (remember we did
not detach the primary at step(2))
c. skips incrementing the genration and setting the
generation_pageserver column

Now any requests that need to be proxied to pageservers and rely on the
generation_pageserver db column fail because that's not set

## Summary of changes

1. We do all non-essential location config calls (setting up
secondaries,
detaches) at the end of the reconciliation. Previously, we bailed out
of the reconciliation on the first failure. With this patch we attempt
all of the RPCs.
This allows the observed state to update even if another RPC failed for
unrelated reasons.

2. If the overall reconciliation failed, we don't want to remove nodes
from the
observed state as a safe-guard. With the previous patch, we'll get a
deletion delta to process, which would be ignored. Ignoring it is not
the right thing to do since it's out of sync with the db state.
Hence, on reconciliation failures map deletion from the observed state
to the uncertain state. Future reconciliation will query the node to
refresh their observed state.

Closes LKB-204
2025-07-25 14:03:17 +00:00
Erik Grinaker
185ead8395 pageserver: verify gRPC GetPages on correct shard (#12722)
Verify that gRPC `GetPageRequest` has been sent to the shard that owns
the pages. This avoid spurious `NotFound` errors if a compute misroutes
a request, which can appear scarier (e.g. data loss).

Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191).
2025-07-25 13:43:04 +00:00
Erik Grinaker
37e322438b pageserver: document gRPC compute accessibility (#12724)
Document that the Pageserver gRPC port is accessible by computes, and
should not provide internal services.

Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191).
2025-07-25 13:35:44 +00:00
Gustavo Bazan
fca2c32e59 [ci/docker] task: Apply some quick wins for tools dockerfile (#12740)
## Problem

The Dockerfile for build tools has some small issues that are easy to
fix to make it follow some of docker best practices

## Summary of changes

Apply some small quick wins on the Dockerfile for build tools

- Usage of apt-get over apt
- usage of --no-cache-dir for pip install
2025-07-25 12:39:01 +00:00
Conrad Ludgate
d19aebcf12 [proxy] introduce moka for the project-info cache (#12710)
## Problem

LKB-2502 The garbage collection of the project info cache is garbage. 

What we observed: If we get unlucky, we might throw away a very hot
entry if the cache is full. The GC loop is dependent on getting a lucky
shard of the projects2ep table that clears a lot of cold entries. The GC
does not take into account active use, and the interval it runs at is
too sparse to do any good.

Can we switch to a proper cache implementation?

Complications:
1. We need to invalidate by project/account.
2. We need to expire based on `retry_delay_ms`.

## Summary of changes

1. Replace `retry_delay_ms: Duration` with `retry_at: Instant` when
deserializing.
2. Split the EndpointControls from the RoleControls into two different
caches.
3. Introduce an expiry policy based on error retry info.
4. Introduce `moka` as a dependency, replacing our `TimedLru`.

See the follow up PR for changing all TimedLru instances to use moka:
#12726.
2025-07-25 11:40:47 +00:00
Conrad Ludgate
a70a5bccff move subzero_core to proxy libs (#12742)
We have a dedicated libs folder for proxy related libraries. Let's move
the subzero_core stub there.
2025-07-25 10:44:28 +00:00
Conrad Ludgate
d9cedb4a95 [tokio-postgres] fix regression in buffer reuse (#12739)
Follow up to #12701, which introduced a new regression. When profiling
locally I noticed that writes have the tendency to always reallocate. On
investigation I found that even if the `Connection`'s write buffer is
empty, if it still shares the same data pointer as the `Client`'s write
buffer then the client cannot reclaim it.

The best way I found to fix this is to just drop the `Connection`'s
write buffer each time we fully flush it.

Additionally, I remembered that `BytesMut` has an `unsplit` method which
is allows even better sharing over the previous optimisation I had when
'encoding'.
2025-07-25 09:03:21 +00:00
Tristan Partin
b623fbae0c Cancel PG query if stuck at refreshing configuration (#12717)
## Problem

While configuring or reconfiguring PG due to PageServer movements, it's
possible PG may get stuck if PageServer is moved around after fetching
the spec from StorageController.

## Summary of changes

To fix this issue, this PR introduces two changes:
1. Fail the PG query directly if the query cannot request configuration
for certain number of times.
2. Introduce a new state `RefreshConfiguration` in compute tools to
differentiate it from `RefreshConfigurationPending`. If compute tool is
already in `RefreshConfiguration` state, then it will not accept new
request configuration requests.

## How is this tested?
Chaos testing.

Co-authored-by: Chen Luo <chen.luo@databricks.com>
2025-07-25 00:01:59 +00:00
Tristan Partin
512210bb5a [BRC-2368] Add PS and compute_ctl metrics to report pagestream request errors (#12716)
## Problem

In our experience running the system so far, almost all of the "hang
compute" situations are due to the compute (postgres) pointing at the
wrong pageservers. We currently mainly rely on the promethesus exporter
(PGExporter) running on PG to detect and report any down time, but these
can be unreliable because the read and write probes the PGExporter runs
do not always generate pageserver requests due to caching, even though
the real user might be experiencing down time when touching uncached
pages.

We are also about to start disk-wiping node pool rotation operations in
prod clusters for our pageservers, and it is critical to have a
convenient way to monitor the impact of these node pool rotations so
that we can quickly respond to any issues. These metrics should provide
very clear signals to address this operational need.

## Summary of changes

Added a pair of metrics to detect issues between postgres' PageStream
protocol (e.g. get_page_at_lsn, get_base_backup, etc.) communications
with pageservers:
* On the compute node (compute_ctl), exports a counter metric that is
incremented every time postgres requests a configuration refresh.
Postgres today only requests these configuration refreshes when it
cannot connect to a pageserver or if the pageserver rejects its request
by disconnecting.
* On the pageserver, exports a counter metric that is incremented every
time it receives a PageStream request that cannot be handled because the
tenant is not known or if the request was routed to the wrong shard
(e.g. secondary).

### How I plan to use metrics
I plan to use the metrics added here to create alerts. The alerts can
fire, for example, if these counters have been continuously increasing
for over a certain period of time. During rollouts, misrouted requests
may occasionally happen, but they should soon die down as
reconfigurations make progress. We can start with something like raising
the alert if the counters have been increasing continuously for over 5
minutes.

## How is this tested?

New integration tests in
`test_runner/regress/test_hadron_ps_connectivity_metrics.py`

Co-authored-by: William Huang <william.huang@databricks.com>
2025-07-24 19:05:00 +00:00
HaoyuHuang
9eebd6fc79 A few more compute_ctl changes (#12713)
## Summary of changes
A bunch of no-op changes. 

The only other thing is that the lock is released early in the terminate
func.
2025-07-24 19:01:30 +00:00
Tristan Partin
11527b9df7 [BRC-2951] Enforce PG backpressure parameters at the shard level (#12694)
## Problem
Currently PG backpressure parameters are enforced globally. With tenant
splitting, this makes it hard to balance small tenants and large
tenants. For large tenants with more shards, we need to increase the
lagging because each shard receives total/shard_count amount of data,
while doing so could be suboptimal to small tenants with fewer shards.

## Summary of changes
This PR makes these parameters to be enforced at the shard level, i.e.,
PG will compute the actual lag limit by multiply the shard count.

## How is this tested?
Added regression test.

Co-authored-by: Chen Luo <chen.luo@databricks.com>
2025-07-24 18:41:29 +00:00
Tristan Partin
89554af1bd [BRC-1778] Have PG signal compute_ctl to refresh configuration if it suspects that it is talking to the wrong PSs (#12712)
## Problem

This is a follow-up to TODO, as part
of the effort to rewire the compute reconfiguration/notification
mechanism to make it more robust. Please refer to that commit or ticket
BRC-1778 for full context of the problem.

## Summary of changes

The previous change added mechanism in `compute_ctl` that makes it
possible to refresh the configuration of PG on-demand by having
`compute_ctl` go out to download a new config from the control
plane/HCC. This change wired this mechanism up with PG so that PG will
signal `compute_ctl` to refresh its configuration when it suspects that
it could be talking to incorrect pageservers due to a stale
configuration.

PG will become suspicious that it is talking to the wrong pageservers in
the following situations:
1. It cannot connect to a pageserver (e.g., getting a network-level
connection refused error)
2. It can connect to a pageserver, but the pageserver does not return
any data for the GetPage request
3. It can connect to a pageserver, but the pageserver returns a
malformed response
4. It can connect to a pageserver, but there is an error receiving the
GetPage request response for any other reason

This change also includes a minor tweak to `compute_ctl`'s config
refresh behavior. Upon receiving a request to refresh PG configuration,
`compute_ctl` will reach out to download a config, but it will not
attempt to apply the configuration if the config is the same as the old
config is it replacing. This optimization is added because the act of
reconfiguring itself requires working pageserver connections. In many
failure situations it is likely that PG detects an issue with a
pageserver before the control plane can detect the issue, migrate
tenants, and update the compute config. In this case even the latest
compute config won't point PG to working pageservers, causing the
configuration attempt to hang and negatively impact PG's
time-to-recovery. With this change, `compute_ctl` only attempts
reconfiguration if the refreshed config points PG to different
pageservers.

## How is this tested?

The new code paths are exercised in all existing tests because this
mechanism is on by default.

Explicitly tested in `test_runner/regress/test_change_pageserver.py`.

Co-authored-by: William Huang <william.huang@databricks.com>
2025-07-24 16:44:45 +00:00
Peter Bendel
f391186aa7 TPC-C like periodic benchmark using benchbase (#12665)
## Problem

We don't have a well-documented, periodic benchmark for TPC-C like OLTP
workload.

## Summary of changes

# Benchbase TPC-C-like Performance Results

Runs TPC-C-like benchmarks on Neon databases using
[Benchbase](https://github.com/cmu-db/benchbase).
Docker images are built
[here](https://github.com/neondatabase-labs/benchbase-docker-images)

We run the benchmarks at different scale factors aligned with different
compute sizes we offer to customers.
For each scale factor, we determine a max rate (see Throughput in warmup
phase) and then run the benchmark at a target rate of approx. 70 % of
the max rate.
We use different warehouse sizes which determine the working set size -
it is optimized for LFC size of the respected pricing tier.
Usually we should get LFC hit rates above 70 % for this setup and quite
good, consistent (non-flaky) latencies.

## Expected performance as of first testing this

| Tier | CU | Warehouses | Terminals | Max TPS | LFC size | Working set
size | LFC hit rate | Median latency | p95 latency |

|------------|------------|---------------|-----------|---------|----------|------------------|--------------|----------------|-------------|
| free | 0.25-2 | 50 - 5 GB | 150 | 800 | 5 GB | 6.3 GB | 95 % | 170 ms
| 600 ms |
| serverless | 2-8 | 500 - 50 GB | 230 | 2000 | 26 GB | ?? GB | 91 % |
50 ms | 200 ms |
| business | 2-16 | 1000 - 100 GB | 330 | 2900 | 51 GB | 50 GB | 72 % |
40 ms | 180 ms |

Each run 
- first loads the database (not shown in the dashboard). 
- Then we run a warmup phase for 20 minutes to warm up the database and
the LFC at unlimited target rate (max rate) (highest throughput but
flaky latencies).
The warmup phase can be used to determine the max rate and adjust it in
the github workflow in case Neon is faster in the future.
- Then we run the benchmark at a target rate of approx. 70 % of the max
rate for 1 hour (expecting consistent latencies and throughput).

## Important notes on implementation:
- we want to eventually publish the process how to reproduce these
benchmarks
- thus we want to reduce all dependencies necessary to run the
benchmark, the only thing needed are
   - docker
   - the docker images referenced above for benchbase
- python >= 3.9 to run some config generation steps and create diagrams
- to reduce dependencies we deliberatly do NOT use some of our python
fixture test infrastructure to make the dependency chain really small -
so pls don't add a review comment "should reuse fixture xy"
- we also upload all generator python scripts, generated bash shell
scripts and configs as well as raw results to S3 bucket that we later
want to publish once this benchmark is reviewed and approved.
2025-07-24 16:26:54 +00:00
Paul Banks
94b41b531b storecon: Fix panic due to race with chaos migration on staging (#12727)
## Problem

* Fixes LKB-743

We get regular assertion failures on staging caused by a race with chaos
injector. If chaos injector decides to migrate a tenant shard between
the background optimisation planning and applying optimisations then we
attempt to migrate and already migrated shard and hit an assertion
failure.

## Summary of changes

@VladLazar fixed a variant of this issue by
adding`validate_optimization` recently, however it didn't validate the
specific property this other assertion requires. Fix is just to update
it to cover all the expected properties.
2025-07-24 16:14:47 +00:00
Erik Grinaker
d793088225 pgxn: set MACOSX_DEPLOYMENT_TARGET (#12723)
## Problem

Compiling `neon-pg-ext-v17` results in these linker warnings for
`libcommunicator.a`:

```
$ make -j`nproc` -s neon-pg-ext-v17
Installing PostgreSQL v17 headers
Compiling PostgreSQL v17
Compiling neon-specific Postgres extensions for v17
ld: warning: object file (/Users/erik.grinaker/Projects/neon/target/debug/libcommunicator.a[1159](25ac62e5b3c53843-curve25519.o)) was built for newer 'macOS' version (15.5) than being linked (15.0)
ld: warning: object file (/Users/erik.grinaker/Projects/neon/target/debug/libcommunicator.a[1160](0bbbd18bda93c05b-aes_nohw.o)) was built for newer 'macOS' version (15.5) than being linked (15.0)
ld: warning: object file (/Users/erik.grinaker/Projects/neon/target/debug/libcommunicator.a[1161](00c879ee3285a50d-montgomery.o)) was built for newer 'macOS' version (15.5) than being linked (15.0)
[...]
```

## Summary of changes

Set `MACOSX_DEPLOYMENT_TARGET` to the current local SDK version (15.5 in
this case), which links against object files for that version.
2025-07-24 14:48:35 +00:00
John Spray
67ad420e26 tests: turn down error rate in test_compute_pageserver_connection_stress (#12721)
## Problem

Compute retries are finite (e.g. 5x in a basebackup) -- with a 50%
failure rate we have pretty good chance of exceeding that and the test
failing.

Fixes: https://databricks.atlassian.net/browse/LKB-2278

## Summary of changes

- Turn connection error rate down to 20%

Co-authored-by: John Spray <john.spray@databricks.com>
2025-07-24 14:42:39 +00:00
Tristan Partin
90cd5a5be8 [BRC-1778] Add mechanism to compute_ctl to pull a new config (#12711)
## Problem

We have been dealing with a number of issues with the SC compute
notification mechanism. Various race conditions exist in the
PG/HCC/cplane/PS distributed system, and relying on the SC to send
notifications to the compute node to notify it of PS changes is not
robust. We decided to pursue a more robust option where the compute node
itself discovers whether it may be pointing to the incorrect PSs and
proactively reconfigure itself if issues are suspected.

## Summary of changes

To support this self-healing reconfiguration mechanism several pieces
are needed. This PR adds a mechanism to `compute_ctl` called "refresh
configuration", where the compute node reaches out to the control plane
to pull a new config and reconfigure PG using the new config, instead of
listening for a notification message containing a config to arrive from
the control plane. Main changes to compute_ctl:

1. The `compute_ctl` state machine now has a new State,
`RefreshConfigurationPending`. The compute node may enter this state
upon receiving a signal that it may be using the incorrect page servers.
2. Upon entering the `RefreshConfigurationPending` state, the background
configurator thread in `compute_ctl` wakes up, pulls a new config from
the control plane, and reconfigures PG (with `pg_ctl reload`) according
to the new config.
3. The compute node may enter the new `RefreshConfigurationPending`
state from `Running` or `Failed` states. If the configurator managed to
configure the compute node successfully, it will enter the `Running`
state, otherwise, it stays in `RefreshConfigurationPending` and the
configurator thread will wait for the next notification if an incorrect
config is still suspected.
4. Added various plumbing in `compute_ctl` data structures to allow the
configurator thread to perform the config fetch.

The "incorrect config suspected" notification is delivered using a HTTP
endpoint, `/refresh_configuration`, on `compute_ctl`. This endpoint is
currently not called by anyone other than the tests. In a follow up PR I
will set up some code in the PG extension/libpagestore to call this HTTP
endpoint whenever PG suspects that it is pointing to the wrong page
servers.

## How is this tested?

Modified `test_runner/regress/test_change_pageserver.py` to add a
scenario where we use the new `/refresh_configuration` mechanism instead
of the existing `/configure` mechanism (which requires us sending a full
config to compute_ctl) to have the compute node reload and reconfigure
its pageservers.

I took one shortcut to reduce the scope of this change when it comes to
testing: the compute node uses a local config file instead of pulling a
config over the network from the HCC. This simplifies the test setup in
the following ways:
* The existing test framework is set up to use local config files for
compute nodes only, so it's convenient if I just stick with it.
* The HCC today generates a compute config with production settings
(e.g., assuming 4 CPUs, 16GB RAM, with local file caches), which is
probably not suitable in tests. We may need to add another test-only
endpoint config to the control plane to make this work.

The config-fetch part of the code is relatively straightforward (and
well-covered in both production and the KIND test) so it is probably
fine to replace it with loading from the local config file for these
integration tests.

In addition to making sure that the tests pass, I also manually
inspected the logs to make sure that the compute node is indeed
reloading the config using the new mechanism instead of going down the
old `/configure` path (it turns out the test has bugs which causes
compute `/configure` messages to be sent despite the test intending to
disable/blackhole them).

```test
2024-09-24T18:53:29.573650Z  INFO http request{otel.name=/refresh_configuration http.method=POST}: serving /refresh_configuration POST request
2024-09-24T18:53:29.573689Z  INFO configurator_main_loop: compute node suspects its configuration is out of date, now refreshing configuration
2024-09-24T18:53:29.573706Z  INFO configurator_main_loop: reloading config.json from path: /workspaces/hadron/test_output/test_change_pageserver_using_refresh[release-pg16]/repo/endpoints/ep-1/spec.json
PG:2024-09-24 18:53:29.574 GMT [52799] LOG:  received SIGHUP, reloading configuration files
PG:2024-09-24 18:53:29.575 GMT [52799] LOG:  parameter "neon.extension_server_port" cannot be changed without restarting the server
PG:2024-09-24 18:53:29.575 GMT [52799] LOG:  parameter "neon.pageserver_connstring" changed to "postgresql://no_user@localhost:15008"
...
```

Co-authored-by: William Huang <william.huang@databricks.com>
2025-07-24 14:26:21 +00:00
Christian Schwarz
643448b1a2 test_hot_standby_gc: work around standby_horizon-related flakiness/raciness uncovered by #12431 (#12704)
PR #12431 set initial lease deadline = 0s for tests.
This turned test_hot_standby_gc flaky because it now runs GC: it started
failing with `tried to request a page version that was garbage
collected`
because the replica reads below applied gc cutoff.

The leading theory is that, we run the timeline_gc() before the first
standby_horizon push arrives at PS. That is definitively a thing that
can happen with the current standby_horizon mechanism, and it's now
tracked as such in https://databricks.atlassian.net/browse/LKB-2499.

We don't have logs to confirm this theory though, but regardless,
try the fix in this PR and see if it stabilizes things.

Refs
- flaky test issue: https://databricks.atlassian.net/browse/LKB-2465

## Problem

## Summary of changes
2025-07-24 14:00:22 +00:00
Conrad Ludgate
8daebb6ed4 [proxy] remove TokioMechanism and HyperMechanism (#12672)
Another go at #12341. LKB-2497

We now only need 1 connect mechanism (and 1 more for testing) which
saves us some code and complexity. We should be able to remove the final
connect mechanism when we create a separate worker task for
pglb->compute connections - either via QUIC streams or via in-memory
channels.

This also now ensures that connect_once always returns a ConnectionError
type - something simple enough we can probably define a serialisation
for in pglb.

* I've abstracted connect_to_compute to always use TcpMechanism and the
ProxyConfig.
* I've abstracted connect_to_compute_and_auth to perform authentication,
managing any retries for stale computes
* I had to introduce a separate `managed` function for taking ownership
of the compute connection into the Client/Connection pair
2025-07-24 12:37:04 +00:00
Alexey Kondratov
ab14521ea5 fix(compute): Turn off database collector in postgres_exporter (#12684)
## Problem

`postgres_exporter` has database collector enabled by default and it
doesn't filter out invalid databases, see

06a553c816/collector/pg_database.go (L67)
so if it hits one, it starts spamming logs
```
ERROR:  [NEON_SMGR] [reqid d9700000018] could not read db size of db 705302 from page server at lsn 5/A2457EB0
```

## Summary of changes

We don't use `pg_database_size_bytes` metric anyway, see

5e19b3fd89/apps/base/compute-metrics/scrape-compute-pg-exporter-neon.yaml (L29)
so just turn it off by passing `--no-collector.database`.
2025-07-24 11:52:31 +00:00
dependabot[bot]
e82021d6fe build(deps): bump the npm_and_yarn group across 1 directory with 2 updates (#12678)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-07-24 10:51:09 +00:00
Conrad Ludgate
9997661138 [proxy/tokio-postgres] garbage collection for codec buffers (#12701)
## Problem

A large insert or a large row will cause the codec to allocate a large
buffer. The codec never shrinks the buffer however. LKB-2496

## Summary of changes

1. Introduce a naive GC system for codec buffers
2. Try and reduce copies as much as possible
2025-07-24 10:30:02 +00:00
Ivan Efremov
0e427fc117 Update proxy-bench workflow to use bare-metal script (#12703)
Pass the params for run.sh in proxy-bench repo to use bare-metal config.
Fix the paths and cleanup procedure.
2025-07-24 08:23:07 +00:00
Tristan Partin
9b2e6f862a Set an upper limit on PG backpressure throttling (#12675)
## Problem
Tenant split test revealed another bug with PG backpressure throttling
that under some cases PS may never report its progress back to SK (e.g.,
observed when aborting tenant shard where the old shard needs to
re-establish SK connection and re-ingest WALs from a much older LSN). In
this case, PG may get stuck forever.

## Summary of changes
As a general precaution that PS feedback mechanism may not always be
reliable, this PR uses the previously introduced WAL write rate limit
mechanism to slow down write rates instead of completely pausing it. The
idea is to introduce a new
`databricks_effective_max_wal_bytes_per_second`, which is set to
`databricks_max_wal_mb_per_second` when no PS back pressure and is set
to `10KB` when there is back pressure. This way, PG can still write to
SK, though at a very low speed.

The PR also fixes the problem that the current WAL rate limiting
mechanism is too coarse grained and cannot enforce limits < 1MB. This is
because it always resets the rate limiter after 1 second, even if PG
could have written more data in the past second. The fix is to introduce
a `batch_end_time_us` which records the expected end time of the current
batch. For example, if PG writes 10MB of data in a single batch, and max
WAL write rate is set as `1MB/s`, then `batch_end_time_us` will be set
as 10 seconds later.

## How is this tested?
Tweaked the existing test, and also did manual testing on dev. I set
`max_replication_flush_lag` as 1GB, and loaded 500GB pgbench tables.
It's expected to see PG gets throttled periodically because PS will
accumulate 4GB of data before flushing.

Results:
when PG is throttled:
```
9500000 of 3300000000 tuples (0%) done (elapsed 10.36 s, remaining 3587.62 s)
9600000 of 3300000000 tuples (0%) done (elapsed 124.07 s, remaining 42523.59 s)
9700000 of 3300000000 tuples (0%) done (elapsed 255.79 s, remaining 86763.97 s)
9800000 of 3300000000 tuples (0%) done (elapsed 315.89 s, remaining 106056.52 s)
9900000 of 3300000000 tuples (0%) done (elapsed 412.75 s, remaining 137170.58 s)
```

when PS just flushed:
```
18100000 of 3300000000 tuples (0%) done (elapsed 433.80 s, remaining 78655.96 s)
18200000 of 3300000000 tuples (0%) done (elapsed 433.85 s, remaining 78231.71 s)
18300000 of 3300000000 tuples (0%) done (elapsed 433.90 s, remaining 77810.62 s)
18400000 of 3300000000 tuples (0%) done (elapsed 433.96 s, remaining 77395.86 s)
18500000 of 3300000000 tuples (0%) done (elapsed 434.03 s, remaining 76987.27 s)
18600000 of 3300000000 tuples (0%) done (elapsed 434.08 s, remaining 76579.59 s)
18700000 of 3300000000 tuples (0%) done (elapsed 434.13 s, remaining 76177.12 s)
18800000 of 3300000000 tuples (0%) done (elapsed 434.19 s, remaining 75779.45 s)
18900000 of 3300000000 tuples (0%) done (elapsed 434.84 s, remaining 75489.40 s)
19000000 of 3300000000 tuples (0%) done (elapsed 434.89 s, remaining 75097.90 s)
19100000 of 3300000000 tuples (0%) done (elapsed 434.94 s, remaining 74712.56 s)
19200000 of 3300000000 tuples (0%) done (elapsed 498.93 s, remaining 85254.20 s)
19300000 of 3300000000 tuples (0%) done (elapsed 498.97 s, remaining 84817.95 s)
19400000 of 3300000000 tuples (0%) done (elapsed 623.80 s, remaining 105486.76 s)
19500000 of 3300000000 tuples (0%) done (elapsed 745.86 s, remaining 125476.51 s)
```

Co-authored-by: Chen Luo <chen.luo@databricks.com>
2025-07-23 22:37:27 +00:00
Tristan Partin
12e87d7a9f Add neon.lakebase_mode boolean GUC (#12714)
This GUC will become useful for temporarily disabling Lakebase-specific
features during the code merge.

Signed-off-by: Tristan Partin <tristan.partin@databricks.com>
2025-07-23 22:37:20 +00:00
Mikhail
a56afee269 Accept primary compute spec in /promote, promotion corner cases testing (#12574)
https://github.com/neondatabase/cloud/issues/19011
- Accept `ComputeSpec` in `/promote` instead of just passing safekeepers
and LSN. Update API spec
- Add corner case tests for promotion when promotion or perwarm fails
(using failpoints)
- Print root error for prewarm and promotion in status handlers
2025-07-23 20:11:34 +00:00
Alex Chi Z.
9e6ca2932f fix(test): convert bool to lowercase when invoking neon-cli (#12688)
## Problem

There has been some inconsistencies of providing tenant config via
`tenant_create` and via other tenant config APIs due to how the
properties are processed: in `tenant_create`, the test framework calls
neon-cli and therefore puts those properties in the cmdline. In other
cases, it's done via the HTTP API by directly serializing to a JSON.
When using the cmdline, the program only accepts serde bool that is
true/false.

## Summary of changes

Convert Python bool into `true`/`false` when using neon-cli.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-07-23 18:56:37 +00:00
HaoyuHuang
63ea4b0579 A few more compute_tool changes (#12687)
## Summary of changes
All changes are no-op except that the tracing-appender lib is upgraded
from 0.2.2 to 0.2.3
2025-07-23 18:30:33 +00:00
Folke Behrens
20881ef65e otel: Use blocking reqwest in dedicated thread (#12699)
## Problem

OTel 0.28+ by default uses blocking operations in a dedicated thread and
doesn't start a tokio runtime. Reqwest as currently configured wants to
spawn tokio tasks.

## Summary of changes

Use blocking reqwest.

This PR just mitigates the current issue.
2025-07-23 18:21:36 +00:00
Conrad Ludgate
a695713727 [sql-over-http] Reset session state between pooled connection re-use (#12681)
Session variables can be set during one sql-over-http query and observed
on another when that pooled connection is re-used. To address this we
can use `RESET ALL;` before re-using the connection. LKB-2495

To be on the safe side, we can opt for a full `DISCARD ALL;`, but that
might have performance regressions since it also clears any query plans.
See pgbouncer docs
https://www.pgbouncer.org/config.html#server_reset_query.

`DISCARD ALL` is currently defined as:
```
CLOSE ALL;
SET SESSION AUTHORIZATION DEFAULT;
RESET ALL;
DEALLOCATE ALL;
UNLISTEN *;
SELECT pg_advisory_unlock_all();
DISCARD PLANS;
DISCARD TEMP;
DISCARD SEQUENCES;
```

I've opted to keep everything here except the `DISCARD PLANS`. I've
modified the code so that this query is executed in the background when
a connection is returned to the pool, rather than when taken from the
pool.

This should marginally improve performance for Neon RLS by removing 1
(localhost) round trip. I don't believe that keeping query plans could
be a security concern. It's a potential side channel, but I can't
imagine what you could extract from it.

---

Thanks to
https://github.com/neondatabase/neon/pull/12659#discussion_r2219016205
for probing the idea in my head.
2025-07-23 17:43:43 +00:00
Alex Chi Z.
5c57e8a11b feat(pageserver): rework reldirv2 rollout (#12576)
## Problem

LKB-197, #9516 

To make sure the migration path is smooth.

The previous plan is to store new relations in new keyspace and old ones
in old keyspace until it gets dropped. This makes the migration path
hard as we can't validate v2 writes and can't rollback. This patch gives
us a more smooth migration path:

- The first time we enable reldirv2 for a tenant, we copy over
everything in the old keyspace to the new one. This might create a short
spike of latency for the create relation operation, but it's oneoff.
- After that, we have identical v1/v2 keyspace and read/write both of
them. We validate reads every time we list the reldirs.
- If we are in `migrating` mode, use v1 as source of truth and log a
warning for failed v2 operations. If we are in `migrated` mode, use v2
as source of truth and error when writes fail.
- One compatibility test uses dataset from the time where we enabled
reldirv2 (of the original rollout plan), which only has relations
written to the v2 keyspace instead of the v1 keyspace. We had to adjust
it accordingly.
- Add `migrated_at` in index_part to indicate the LSN where we did the
initialize.

TODOs:

- Test if relv1 can be read below the migrated_at LSN.
- Move the initialization process to L0 compaction instead of doing it
on the write path.
- Disable relcache in the relv2 test case so that all code path gets
fully tested.

## Summary of changes

- New behavior of reldirv2 migration flags as described above.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-07-23 16:12:46 +00:00
Alexander Bayandin
84a2556c9f compute-node.Dockerfile: update bullseye-backports backports url (#12700)
## Problem

> bullseye-backports has reached end-of-life and is no longer supported
or updated

From: https://backports.debian.org/Instructions/

This causes the compute-node image build to fail with the following
error:
```
0.099 Err:5 http://deb.debian.org/debian bullseye-backports Release
0.099   404  Not Found [IP: 146.75.122.132 80]
...
1.293 E: The repository 'http://deb.debian.org/debian bullseye-backports Release' does not have a Release file.
```

## Summary of changes
- Use archive version of `bullseye-backports`
2025-07-23 14:45:52 +00:00
Conrad Ludgate
761e9e0e1d [proxy] move read_info from the compute connection to be as late as possible (#12660)
Second attempt at #12130, now with a smaller diff.

This allows us to skip allocating for things like parameter status and
notices that we will either just forward untouched, or discard.

LKB-2494
2025-07-23 13:33:21 +00:00
Dmitrii Kovalkov
94cb9a79d9 safekeeper: generation aware timeline tombstones (#12482)
## Problem
With safekeeper migration in mind, we can now pull/exclude the timeline
multiple times within the same safekeeper. To avoid races between out of
order requests, we need to ignore the pull/exclude requests if we have
already seen a higher generation.

- Closes: https://github.com/neondatabase/neon/issues/12186
- Closes: [LKB-949](https://databricks.atlassian.net/browse/LKB-949)

## Summary of changes
- Annotate timeline tombstones in safekeeper with request generation.
- Replace `ignore_tombstone` option with `mconf` in
`PullTimelineRequest`
- Switch membership in `pull_timeline` if the existing/pulled timeline
has an older generation.
- Refuse to switch membership if the timeline is being deleted
(`is_canceled`).
- Refuse to switch membership in compute greeting request if the
safekeeper is not a member of `mconf`.
- Pass `mconf` in `PullTimelineRequest` in safekeeper_service

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2025-07-23 11:01:04 +00:00
Tristan Partin
fc242afcc2 PG ignore PageserverFeedback from unknown shards (#12671)
## Problem
When testing tenant splits, I found that PG can get backpressure
throttled indefinitely if the split is aborted afterwards. It turns out
that each PageServer activates new shard separately even before the
split is committed and they may start sending PageserverFeedback to PG
directly. As a result, if the split is aborted, no one resets the
pageserver feedback in PG, and thus PG will be backpressure throttled
forever unless it's restarted manually.

## Summary of changes
This PR fixes this problem by having
`walprop_pg_process_safekeeper_feedback` simply ignore all pageserver
feedback from unknown shards. The source of truth here is defined by the
shard map, which is guaranteed to be reloaded only after the split is
committed.

Co-authored-by: Chen Luo <chen.luo@databricks.com>
2025-07-22 21:41:56 +00:00
Suhas Thalanki
e275221aef add hadron-specific metrics (#12686) 2025-07-22 21:17:45 +00:00
Alex Chi Z.
f859354466 feat(pageserver): add db rel count as feature flag property (#12632)
## Problem

As part of the reldirv2 rollout: LKB-197.


We will use number of db/rels as a criteria whether to rollout reldirv2
directly on the write path (simplest and easiest way of rollout). If the
number of rel/db is small then it shouldn't take too long time on the
write path.

## Summary of changes

* Compute db/rel count during basebackup.
* Also compute it during logical size computation.
* Collect maximum number of db/rel across all timelines in the feature
flag propeties.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-07-22 17:55:07 +00:00
Konstantin Knizhnik
b00a0096bf Reintialize page in allocNewBuffer only when buffer is returned (#12399)
## Problem

See https://github.com/neondatabase/neon/issues/12387

`allocNewBuffer` initialise page with zeros 
but not always return it because of parity checks.
In case of wrong parity the page is rejected and as a result we have
dirty page with zero LSN, which cause assertion failure on neon_write
when page is evicted from shared buffers.

## Summary of changes

Perform, page initialisation in `allocNewBuffer` only when buffer is
returned (parity check is passed).

Postgres PRs:
https://github.com/neondatabase/postgres/pull/661
https://github.com/neondatabase/postgres/pull/662
https://github.com/neondatabase/postgres/pull/663
https://github.com/neondatabase/postgres/pull/664

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Kosntantin Knizhnik <konstantin.knizhnik@databricks.com>
2025-07-22 17:50:26 +00:00
a-masterov
b3844903e5 Add new operations to Random operations test (#12213)
## Problem
We did not test some Public API calls, such as using a timestamp to
create a branch, reset_to_parent.
## Summary of changes
Tests now include some other operations: reset_to_parent, a branch
creation from any time in the past, etc.
Currently, the API calls are only exposed; the semantics are not
verified.

---------

Co-authored-by: Alexey Masterov <alexey.masterov@databricks.com>
2025-07-22 17:43:01 +00:00
Vlad Lazar
5b0972151c pageserver: silence shard resolution warning (#12685)
## Problem

We drive the get page requests that have started processing to
completion. So in the case when the compute received a reconfiguration
request and the old connection has a request procesing on the
pageserver, we are going to issue the warning.

I spot checked a few instances of the warning and in all cases the
compute was already connected to the correct pageserver.

## Summary of Changes

Downgrade to INFO. It would be nice to somehow figure out if the
connection has been terminated in the meantime, but the terminate libpq
message is still in the pipe while we're doing the shard resolution.

Closes LKB-2381
2025-07-22 17:34:23 +00:00
github-actions[bot]
fe28e71d71 Proxy release 2025-07-22 06:01 UTC 2025-07-22 06:01:40 +00:00
github-actions[bot]
7da277e82c Proxy release 2025-07-15 06:01 UTC 2025-07-15 06:01:18 +00:00
github-actions[bot]
dd5da6c7e0 Proxy release 2025-07-08 06:01 UTC 2025-07-08 06:01:17 +00:00
Conrad Ludgate
304ed02f74 Proxy release 2025-07-03 10:07 UTC 2025-07-03 11:07:21 +01:00
Conrad Ludgate
1a7c2450f5 fix redis credentials check (#12455)
## Problem

`keep_connection` does not exit, so it was never setting
`credentials_refreshed`.

## Summary of changes

Set `credentials_refreshed` to true when we first establish a
connection, and after we re-authenticate the connection.
2025-07-03 11:07:21 +01:00
Conrad Ludgate
d08748d4e4 Proxy release 2025-07-02 12:38 UTC 2025-07-02 13:38:03 +01:00
Ivan Efremov
fcf8127900 [proxy]: Fix redis IRSA expiration failure errors (#12430)
Relates to the
[#30688](https://github.com/neondatabase/cloud/issues/30688)
2025-07-02 13:38:03 +01:00
github-actions[bot]
c82d646598 Proxy release 2025-07-01 06:01 UTC 2025-07-01 06:01:46 +00:00
Conrad Ludgate
9b26b7bde9 Proxy release 2025-06-25 14:35 UTC 2025-06-25 15:35:55 +01:00
Conrad Ludgate
93350f7018 [proxy]: BatchQueue::call is not cancel safe - make it directly cancellation aware (#12345)
## Problem

https://github.com/neondatabase/cloud/issues/30539

If the current leader cancels the `call` function, then it has removed
the jobs from the queue, but will never finish sending the responses.
Because of this, it is not cancellation safe.

## Summary of changes

Document these functions as not cancellation safe. Move cancellation of
the queued jobs into the queue itself.

## Alternatives considered

1. We could spawn the task that runs the batch, since that won't get
cancelled.
* This requires `fn call(self: Arc<Self>)` or `fn call(&'static self)`.
2. We could add another scopeguard and return the requests back to the
queue.
* This requires that requests are always retry safe, and also requires
requests to be `Clone`.
2025-06-25 15:35:55 +01:00
github-actions[bot]
7f4f1785a8 Proxy release 2025-06-24 06:01 UTC 2025-06-24 06:01:22 +00:00
github-actions[bot]
bf2a21567d Proxy release 2025-06-17 06:01 UTC 2025-06-17 06:01:42 +00:00
github-actions[bot]
24053ff4ca Proxy release 2025-06-10 08:56 UTC 2025-06-10 08:56:51 +00:00
github-actions[bot]
b147439d6b Proxy release 2025-06-02 06:12 UTC 2025-06-02 06:12:02 +00:00
github-actions[bot]
54433c0839 Proxy release 2025-05-27 06:01 UTC 2025-05-27 06:01:30 +00:00
github-actions[bot]
40bb9ff62a Proxy release 2025-05-20 06:01 UTC 2025-05-20 06:01:25 +00:00
github-actions[bot]
4688b815b1 Proxy release 2025-05-14 09:39 UTC 2025-05-14 09:39:15 +00:00
github-actions[bot]
0982ca4636 Proxy release 2025-05-12 14:35 UTC 2025-05-12 14:35:27 +00:00
github-actions[bot]
7272d9f7b3 Proxy release 2025-05-06 09:47 UTC 2025-05-06 09:47:48 +00:00
github-actions[bot]
37d555aa59 Proxy release 2025/04/29 06:01 UTC 2025-04-29 06:01:28 +00:00
github-actions[bot]
cae3e2976b Proxy release 2025-04-22 2025-04-22 06:02:06 +00:00
github-actions[bot]
51ecd1bb37 Proxy release 2025-04-15 2025-04-15 06:01:10 +00:00
github-actions[bot]
1e6bb48076 Proxy release 2025-04-08 2025-04-08 06:01:37 +00:00
github-actions[bot]
1470af0b42 Proxy release 2025-04-01 2025-04-01 06:01:27 +00:00
github-actions[bot]
f92f92b91b Proxy release 2025-03-13 2025-03-13 13:43:01 +00:00
github-actions[bot]
dbb205ae92 Proxy release 2025-03-13 2025-03-13 09:50:35 +00:00
Conrad Ludgate
85072b715f Merge pull request #11106 from neondatabase/rc/release-proxy/2025-03-06
Proxy release 2025-03-06
2025-03-06 09:53:00 +00:00
github-actions[bot]
6c86fe7143 Proxy release 2025-03-06 2025-03-06 06:02:15 +00:00
JC Grünhage
66d5fe7f5b Merge pull request #11023 from neondatabase/rc/release-proxy/2025-02-27
Proxy release 2025-02-27
2025-02-27 19:10:58 +01:00
github-actions[bot]
a1b9528757 Proxy release 2025-02-27 2025-02-27 16:18:42 +00:00
Ivan Efremov
1423bb8aa2 Merge pull request #11011 from neondatabase/rc/release-proxy/2025-02-27
Proxy release 2025-02-27
2025-02-27 13:57:49 +02:00
github-actions[bot]
332f064a42 Proxy release 2025-02-27 2025-02-27 00:17:57 +00:00
Folke Behrens
c962f2b447 Merge pull request #10903 from neondatabase/rc/release-proxy/2025-02-20
Proxy release 2025-02-20
2025-02-20 10:37:47 +01:00
github-actions[bot]
446b3f9d28 Proxy release 2025-02-20 2025-02-20 06:02:01 +00:00
Conrad Ludgate
23352dc2e9 Merge pull request #10802 from neondatabase/rc/release-proxy/2025-02-13
Proxy release 2025-02-13
2025-02-13 08:41:01 +00:00
github-actions[bot]
c65fc5a955 Proxy release 2025-02-13 2025-02-13 06:02:01 +00:00
Ivan Efremov
3e624581cd Merge pull request #10691 from neondatabase/rc/release-proxy/2025-02-06
Proxy release 2025-02-06
2025-02-06 10:23:43 +02:00
github-actions[bot]
fedf4f169c Proxy release 2025-02-06 2025-02-06 06:02:11 +00:00
Folke Behrens
86d5798108 Merge pull request #10576 from neondatabase/rc/release-proxy/2025-01-30
Proxy release 2025-01-30
2025-01-30 08:52:09 +01:00
github-actions[bot]
8b4088dd8a Proxy release 2025-01-30 2025-01-30 06:02:00 +00:00
Ivan Efremov
c91905e643 Merge pull request #10416 from neondatabase/rc/release-proxy/2025-01-16
Proxy release 2025-01-16
2025-01-16 10:04:38 +02:00
github-actions[bot]
44b4e355a2 Proxy release 2025-01-16 2025-01-16 06:02:04 +00:00
Folke Behrens
03666a1f37 Merge pull request #10320 from neondatabase/rc/release-proxy/2025-01-09
Proxy release 2025-01-09
2025-01-09 10:19:07 +01:00
github-actions[bot]
9c92242ca0 Proxy release 2025-01-09 2025-01-09 06:02:06 +00:00
Conrad Ludgate
a354071dd0 Merge pull request #10180 from neondatabase/rc/release-proxy/2024-12-17
Proxy release 2024-12-17
2024-12-18 06:31:05 +00:00
github-actions[bot]
758680d4f8 Proxy release 2024-12-17 2024-12-17 22:06:42 +00:00
Conrad Ludgate
1738fd0a96 Merge pull request #10107 from neondatabase/rc/release-proxy/2024-12-12
Proxy release 2024-12-12
2024-12-12 10:21:30 +00:00
Conrad Ludgate
87b7edfc72 Merge branch 'release-proxy' into rc/release-proxy/2024-12-12 2024-12-12 09:58:31 +00:00
github-actions[bot]
def05700d5 Proxy release 2024-12-12 2024-12-12 06:02:08 +00:00
Ivan Efremov
b547681e08 Merge pull request #10024 from neondatabase/rc/release-proxy/2024-12-05
Proxy release 2024-12-05
2024-12-05 15:35:35 +02:00
Ivan Efremov
0fd211537b proxy: Present new auth backend cplane_proxy_v1 (#10012)
Implement a new auth backend based on the current Neon backend to switch
to the new Proxy V1 cplane API.

Implements [#21048](https://github.com/neondatabase/cloud/issues/21048)
2024-12-05 13:00:40 +02:00
Yuchen Liang
a83bd4e81c pageserver: fix buffered-writer on macos build (#10019)
## Problem

In https://github.com/neondatabase/neon/pull/9693, we forgot to check
macos build. The [CI
run](https://github.com/neondatabase/neon/actions/runs/12164541897/job/33926455468)
on main showed that macos build failed with unused variables and dead
code.

## Summary of changes

- add `allow(dead_code)` and `allow(unused_variables)` to the relevant
code that is not used on macos.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-12-05 13:00:40 +02:00
Conrad Ludgate
ecdad5e6d5 chore: update rust-postgres (#10002)
Like #9931 but without rebasing upstream just yet, to try and minimise
the differences.

Removes all proxy-specific commits from the rust-postgres fork, now that
proxy no longer depends on them. Merging upstream changes to come later.
2024-12-05 13:00:40 +02:00
Conrad Ludgate
d028929945 chore: update clap (#10009)
This updates clap to use a new version of anstream
2024-12-05 13:00:40 +02:00
Yuchen Liang
7b0e3db868 pageserver: make BufferedWriter do double-buffering (#9693)
Closes #9387.

## Problem

`BufferedWriter` cannot proceed while the owned buffer is flushing to
disk. We want to implement double buffering so that the flush can happen
in the background. See #9387.

## Summary of changes

- Maintain two owned buffers in `BufferedWriter`.
- The writer is in charge of copying the data into owned, aligned
buffer, once full, submit it to the flush task.
- The flush background task is in charge of flushing the owned buffer to
disk, and returned the buffer to the writer for reuse.
- The writer and the flush background task communicate through a
bi-directional channel.

For in-memory layer, we also need to be able to read from the buffered
writer in `get_values_reconstruct_data`. To handle this case, we did the
following
- Use replace `VirtualFile::write_all` with `VirtualFile::write_all_at`,
and use `Arc` to share it between writer and background task.
- leverage `IoBufferMut::freeze` to get a cheaply clonable `IoBuffer`,
one clone will be submitted to the channel, the other clone will be
saved within the writer to serve reads. When we want to reuse the
buffer, we can invoke `IoBuffer::into_mut`, which gives us back the
mutable aligned buffer.
- InMemoryLayer reads is now aware of the maybe_flushed part of the
buffer.

**Caveat**

- We removed the owned version of write, because this interface does not
work well with buffer alignment. The result is that without direct IO
enabled,
[`download_object`](a439d57050/pageserver/src/tenant/remote_timeline_client/download.rs (L243))
does one more memcpy than before this PR due to the switch to use
`_borrowed` version of the write.
- "Bypass aligned part of write" could be implemented later to avoid
large amount of memcpy.

**Testing**
- use an oneshot channel based control mechanism to make flush behavior
deterministic in test.
- test reading from `EphemeralFile` when the last submitted buffer is
not flushed, in-progress, and done flushing to disk.


## Performance


We see performance improvement for small values, and regression on big
values, likely due to being CPU bound + disk write latency.


[Results](https://www.notion.so/neondatabase/Benchmarking-New-BufferedWriter-11-20-2024-143f189e0047805ba99acda89f984d51?pvs=4)


## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
2024-12-05 13:00:40 +02:00
John Spray
088eb72dd7 tests: make storcon scale test AZ-aware (#9952)
## Problem

We have a scale test for the storage controller which also acts as a
good stress test for scheduling stability. However, it created nodes
with no AZs set.

## Summary of changes

- Bump node count to 6 and set AZs on them.

This is a precursor to other AZ-related PRs, to make sure any new code
that's landed is getting scale tested in an AZ-aware environment.
2024-12-05 13:00:40 +02:00
a-masterov
d550e3f626 Create a branch for compute release (#9637)
## Problem
We practice a manual release flow for the compute module. This will
allow automation of the compute release process.

## Summary of changes
The workflow was modified to make a compute release automatically on the
branch release-compute.
## Checklist before requesting a review

- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist
2024-12-05 13:00:40 +02:00
Erik Grinaker
8c6b41daf5 Display reqwest error source (#10004)
## Problem

Reqwest errors don't include details about the inner source error. This
means that we get opaque errors like:

```
receive body: error sending request for url (http://localhost:9898/v1/location_config)
```

Instead of the more helpful:

```
receive body: error sending request for url (http://localhost:9898/v1/location_config): operation timed out
```

Touches #9801.

## Summary of changes

Include the source error for `reqwest::Error` wherever it's displayed.
2024-12-05 13:00:40 +02:00
Alexey Kondratov
bbb050459b feat(compute): Set default application_name for pgbouncer connections (#9973)
## Problem

When client specifies `application_name`, pgbouncer propagates it to the
Postgres. Yet, if client doesn't do it, we have hard time figuring out
who opens a lot of Postgres connections (including the `cloud_admin`
ones).

See this investigation as an example:
https://neondb.slack.com/archives/C0836R0RZ0D

## Summary of changes

I haven't found this documented, but it looks like pgbouncer accepts
standard Postgres connstring parameters in the connstring in the
`[databases]` section, so put the default `application_name=pgbouncer`
there. That way, we will always see who opens Postgres connections. I
did tests, and if client specifies a `application_name`, pgbouncer
overrides this default, so it only works if it's not specified or set to
blank `&application_name=` in the connection string.

This is the last place we could potentially open some Postgres
connections without `application_name`. Everything else should be either
of two:
1. Direct client connections without `application_name`, but these
should be strictly non-`cloud_admin` ones
2. Some ad-hoc internal connections, so if we see spikes of unidentified
`cloud_admin` connections, we will need to investigate it again.

Fixes neondatabase/cloud#20948
2024-12-05 13:00:40 +02:00
Conrad Ludgate
cab498c787 feat(proxy): add option to forward startup params (#9979)
(stacked on #9990 and #9995)

Partially fixes #1287 with a custom option field to enable the fixed
behaviour. This allows us to gradually roll out the fix without silently
changing the observed behaviour for our customers.

related to https://github.com/neondatabase/cloud/issues/15284
2024-12-05 13:00:40 +02:00
Folke Behrens
6359342ffb Assign /libs/proxy/ to proxy team (#10003) 2024-12-05 13:00:40 +02:00
Erik Grinaker
13285c2a5e pageserver: return proper status code for heatmap_upload errors (#9991)
## Problem

During deploys, we see a lot of 500 errors due to heapmap uploads for
inactive tenants. These should be 503s instead.

Resolves #9574.

## Summary of changes

Make the secondary tenant scheduler use `ApiError` rather than
`anyhow::Error`, to propagate the tenant error and convert it to an
appropriate status code.
2024-12-05 13:00:40 +02:00
Peter Bendel
33790d14a3 fix parsing human time output like "50m37s" (#10001)
## Problem

In ingest_benchmark.yml workflow we use pgcopydb tool to migrate
project.
pgcopydb logs human time.

Our parsing of the human time doesn't work for times like "50m37s".

[Example
workflow](https://github.com/neondatabase/neon/actions/runs/12145539948/job/33867418065#step:10:479)

contains "57m45s"

but we
[reported](https://github.com/neondatabase/neon/actions/runs/12145539948/job/33867418065#step:10:500)
only the seconds part: 
45.000 s


## Summary of changes

add a regex pattern for Minute/Second combination
2024-12-05 13:00:40 +02:00
Peter Bendel
709b8cd371 optimize parms for ingest bench (#9999)
## Problem

we tried different parallelism settings for ingest bench 

## Summary of changes

the following settings seem optimal after merging
- SK side Wal filtering
- batched getpages

Settings:
- effective_io_concurrency 100
- concurrency limit 200 (different from Prod!)
- jobs 4, maintenance workers 7
- 10 GB chunk size
2024-12-05 13:00:40 +02:00
Vlad Lazar
1c9bbf1a92 storcon: return an error for drain attempts while paused (#9997)
## Problem

We currently allow drain operations to proceed while the node policy is
paused.

## Summary of changes

Return a precondition failed error in such cases. The orchestrator is
updated in https://github.com/neondatabase/infra/pull/2544 to skip drain
and fills if the pageserver is paused.

Closes: https://github.com/neondatabase/neon/issues/9907
2024-12-05 13:00:40 +02:00
Christian Schwarz
16163fb850 page_service: enable batching in Rust & Python Tests + Python benchmarks (#9993)
This is the first step towards batching rollout.

Refs

- rollout plan: https://github.com/neondatabase/cloud/issues/20620
- task https://github.com/neondatabase/neon/issues/9377
- uber-epic: https://github.com/neondatabase/neon/issues/9376
2024-12-05 13:00:40 +02:00
Alexander Bayandin
73ccc2b08c test_page_service_batching: fix non-numeric metrics (#9998)
## Problem

```
2024-12-03T15:42:46.5978335Z + poetry run python /__w/neon/neon/scripts/ingest_perf_test_result.py --ingest /__w/neon/neon/test_runner/perf-report-local
2024-12-03T15:42:49.5325077Z Traceback (most recent call last):
2024-12-03T15:42:49.5325603Z   File "/__w/neon/neon/scripts/ingest_perf_test_result.py", line 165, in <module>
2024-12-03T15:42:49.5326029Z     main()
2024-12-03T15:42:49.5326316Z   File "/__w/neon/neon/scripts/ingest_perf_test_result.py", line 155, in main
2024-12-03T15:42:49.5326739Z     ingested = ingest_perf_test_result(cur, item, recorded_at_timestamp)
2024-12-03T15:42:49.5327488Z                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-12-03T15:42:49.5327914Z   File "/__w/neon/neon/scripts/ingest_perf_test_result.py", line 99, in ingest_perf_test_result
2024-12-03T15:42:49.5328321Z     psycopg2.extras.execute_values(
2024-12-03T15:42:49.5328940Z   File "/github/home/.cache/pypoetry/virtualenvs/non-package-mode-_pxWMzVK-py3.11/lib/python3.11/site-packages/psycopg2/extras.py", line 1299, in execute_values
2024-12-03T15:42:49.5335618Z     cur.execute(b''.join(parts))
2024-12-03T15:42:49.5335967Z psycopg2.errors.InvalidTextRepresentation: invalid input syntax for type numeric: "concurrent-futures"
2024-12-03T15:42:49.5336287Z LINE 57:             'concurrent-futures',
2024-12-03T15:42:49.5336462Z                      ^
```

## Summary of changes
- `test_page_service_batching`: save non-numeric params as `labels`
- Add a runtime check that `metric_value` is NUMERIC
2024-12-05 13:00:40 +02:00
Christian Schwarz
c719be6474 tests & benchmarks: unify the way we customize the default tenant config (#9992)
Before this PR, some override callbacks used `.default()`, others
used `.setdefault()`.

As of this PR, all callbacks use `.setdefault()` which I think is least
prone to failure.

Aligning on a single way will set the right example for future tests
that need such customization.

The `test_pageserver_getpage_throttle.py` technically is a change in
behavior: before, it replaced the `tenant_config` field, now it just
configures the throttle. This is what I believe is intended anyway.
2024-12-05 13:00:40 +02:00
Arpad Müller
718645e56c Support tenant manifests in the scrubber (#9942)
Support tenant manifests in the storage scrubber:

* list the manifests, order them by generation
* delete all manifests except for the two most recent generations
* for the latest manifest: try parsing it.

I've tested this patch by running the against a staging bucket and it
successfully deleted stuff (and avoided deleting the latest two
generations).

In follow-up work, we might want to also check some invariants of the
manifest, as mentioned in #8088.

Part of #9386
Part of #8088

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
2024-12-05 13:00:40 +02:00
Conrad Ludgate
fbc8c36983 chore(proxy): enforce single host+port (#9995)
proxy doesn't ever provide multiple hosts/ports, so this code adds a lot
of complexity of error handling for no good reason.

(stacked on #9990)
2024-12-05 13:00:40 +02:00
Alexey Immoreev
5519e42612 Improvement: add console redirect timeout warning (#9985)
## Problem

There is no information on session being cancelled in 2 minutes at the
moment

## Summary of changes

The timeout being logged for the user
2024-12-05 13:00:40 +02:00
Erik Grinaker
4157eaf4c5 pageserver: respond to multiple shutdown signals (#9982)
## Problem

The Pageserver signal handler would only respond to a single signal and
initiate shutdown. Subsequent signals were ignored. This meant that a
`SIGQUIT` sent after a `SIGTERM` had no effect (e.g. in the case of a
slow or stalled shutdown). The `test_runner` uses this to force shutdown
if graceful shutdown is slow.

Touches #9740.

## Summary of changes

Keep responding to signals after the initial shutdown signal has been
received.

Arguably, the `test_runner` should also use `SIGKILL` rather than
`SIGQUIT` in this case, but it seems reasonable to respond to `SIGQUIT`
regardless.
2024-12-05 13:00:40 +02:00
Conrad Ludgate
60241127e2 chore(proxy): remove postgres config parser and md5 support (#9990)
Keeping the `mock` postgres cplane adaptor using "stock" tokio-postgres
allows us to remove a lot of dead weight from our actual postgres
connection logic.
2024-12-05 13:00:40 +02:00
John Spray
f7d5322e8b pageserver: more detailed logs when calling re-attach (#9996)
## Problem

We saw a peculiar case where a pageserver apparently got a 0-tenant
response to `/re-attach` but we couldn't see the request landing on a
storage controller. It was hard to confirm retrospectively that the
pageserver was configured properly at the moment it sent the request.

## Summary of changes

- Log the URL to which we are sending the request
- Log the NodeId and metadata that we sent
2024-12-05 13:00:40 +02:00
John Spray
41bb9c5280 pageserver: only store SLRUs & aux files on shard zero (#9786)
## Problem

Since https://github.com/neondatabase/neon/pull/9423 the non-zero shards
no longer need SLRU content in order to do GC. This data is now
redundant on shards >0.

One release cycle after merging that PR, we may merge this one, which
also stops writing those pages to shards > 0, reaping the efficiency
benefit.

Closes: https://github.com/neondatabase/neon/issues/7512
Closes: https://github.com/neondatabase/neon/issues/9641

## Summary of changes

- Avoid storing SLRUs on non-zero shards
- Bonus: avoid storing aux files on non-zero shards
2024-12-05 13:00:40 +02:00
John Spray
69c0d61c5c storcon: in shard splits, inherit parent's AZ (#9946)
## Problem

Sharded tenants should be run in a single AZ for best performance, so
that computes have AZ-local latency to all the shards.

Part of https://github.com/neondatabase/neon/issues/8264

## Summary of changes

- When we split a tenant, instead of updating each shard's preferred AZ
to wherever it is scheduled, propagate the preferred AZ from the parent.
- Drop the check in `test_shard_preferred_azs` that asserts shards end
up in their preferred AZ: this will not be true again until the
optimize_attachment logic is updated to make this so. The existing check
wasn't testing anything about scheduling, it was just asserting that we
set preferred AZ in a way that matches the way things happen to be
scheduled at time of split.
2024-12-05 13:00:40 +02:00
Christian Schwarz
63cb8ce975 pageserver: only throttle pagestream requests & bring back throttling deduction for smgr latency metrics (#9962)
## Problem

In the batching PR 
- https://github.com/neondatabase/neon/pull/9870

I stopped deducting the time-spent-in-throttle fro latency metrics,
i.e.,
- smgr latency metrics (`SmgrOpTimer`)
- basebackup latency (+scan latency, which I think is part of
basebackup).

The reason for stopping the deduction was that with the introduction of
batching, the trick with tracking time-spent-in-throttle inside
RequestContext and swap-replacing it from the `impl Drop for
SmgrOpTimer` no longer worked with >1 requests in a batch.

However, deducting time-spent-in-throttle is desirable because our
internal latency SLO definition does not account for throttling.

## Summary of changes

- Redefine throttling to be a page_service pagestream request throttle
instead of a throttle for repository `Key` reads through `Timeline::get`
/ `Timeline::get_vectored`.
- This means reads done by `basebackup` are no longer subject to any
throttle.
- The throttle applies after batching, before handling of the request.
- Drive-by fix: make throttle sensitive to cancellation.
- Rename metric label `kind` from `timeline_get` to `pagestream` to
reflect the new scope of throttling.

To avoid config format breakage, we leave the config field named
`timeline_get_throttle` and ignore the `task_kinds` field.
This will be cleaned up in a future PR.

## Trade-Offs

Ideally, we would apply the throttle before reading a request off the
connection, so that we queue the minimal amount of work inside the
process.
However, that's not possible because we need to do shard routing.

The redefinition of the throttle to limit pagestream request rate
instead of repository `Key` rate comes with several downsides:
- We're no longer able to use the throttle mechanism for other other
tasks, e.g. image layer creation.
  However, in practice, we never used that capability anyways.
- We no longer throttle basebackup.
2024-12-05 13:00:40 +02:00
Erik Grinaker
907e4aa3c4 test_runner: use immediate shutdown in test_sharded_ingest (#9984)
## Problem

`test_sharded_ingest` ingests a lot of data, which can cause shutdown to
be slow e.g. due to local "S3 uploads" or compactions. This can cause
test flakes during teardown.

Resolves #9740.

## Summary of changes

Perform an immediate shutdown of the cluster.
2024-12-05 13:00:40 +02:00
Erik Grinaker
0a2a84b766 safekeeper,pageserver: add heap profiling (#9778)
## Problem

We don't have good observability for memory usage. This would be useful
e.g. to debug OOM incidents or optimize performance or resource usage.

We would also like to use continuous profiling with e.g. [Grafana Cloud
Profiles](https://grafana.com/products/cloud/profiles-for-continuous-profiling/)
(see https://github.com/neondatabase/cloud/issues/14888).

This PR is intended as a proof of concept, to try it out in staging and
drive further discussions about profiling more broadly.

Touches https://github.com/neondatabase/neon/issues/9534.
Touches https://github.com/neondatabase/cloud/issues/14888.
Depends on #9779.
Depends on #9780.

## Summary of changes

Adds a HTTP route `/profile/heap` that takes a heap profile and returns
it. Query parameters:

* `format`: output format (`jemalloc` or `pprof`; default `pprof`).

Unlike CPU profiles (see #9764), heap profiles are not symbolized and
require the original binary to translate addresses to function names. To
make this work with Grafana, we'll probably have to symbolize the
process server-side -- this is left as future work, as is other output
formats like SVG.

Heap profiles don't work on macOS due to limitations in jemalloc.
2024-12-05 13:00:40 +02:00
a-masterov
85b12ddd52 Add support for the extensions test for Postgres v17 (#9748)
## Problem
The extensions for Postgres v17 are ready but we do not test the
extensions shipped with v17
## Summary of changes
Build the test image based on Postgres v17. Run the tests for v17.

---------

Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
2024-12-05 13:00:40 +02:00
Christian Schwarz
dd76f1eeee page_service: batching observability & include throttled time in smgr metrics (#9870)
This PR 

- fixes smgr metrics https://github.com/neondatabase/neon/issues/9925 
- adds an additional startup log line logging the current batching
config
- adds a histogram of batch sizes global and per-tenant
- adds a metric exposing the current batching config

The issue described #9925 is that before this PR, request latency was
only observed *after* batching.
This means that smgr latency metrics (most importantly getpage latency)
don't account for
- `wait_lsn` time 
- time spent waiting for batch to fill up / the executor stage to pick
up the batch.

The fix is to use a per-request batching timer, like we did before the
initial batching PR.
We funnel those timers through the entire request lifecycle.

I noticed that even before the initial batching changes, we weren't
accounting for the time spent writing & flushing the response to the
wire.
This PR drive-by fixes that deficiency by dropping the timers at the
very end of processing the batch, i.e., after the `pgb.flush()` call.

I was **unable to maintain the behavior that we deduct
time-spent-in-throttle from various latency metrics.
The reason is that we're using a *single* counter in `RequestContext` to
track micros spent in throttle.
But there are *N* metrics timers in the batch, one per request.
As a consequence, the practice of consuming the counter in the drop
handler of each timer no longer works because all but the first timer
will encounter error `close() called on closed state`.
A failed attempt to maintain the current behavior can be found in
https://github.com/neondatabase/neon/pull/9951.

So, this PR remvoes the deduction behavior from all metrics.
I started a discussion on Slack about it the implications this has for
our internal SLO calculation:
https://neondb.slack.com/archives/C033RQ5SPDH/p1732910861704029

# Refs

- fixes https://github.com/neondatabase/neon/issues/9925
- sub-issue https://github.com/neondatabase/neon/issues/9377
- epic: https://github.com/neondatabase/neon/issues/9376
2024-12-05 13:00:40 +02:00
Christian Schwarz
8963ac85f9 storcon_cli tenant-describe: include tenant-wide information in output (#9899)
Before this PR, the storcon_cli didn't have a way to show the
tenant-wide information of the TenantDescribeResponse.

Sadly, the `Serialize` impl for the tenant config doesn't skip on
`None`, so, the output becomes a bit bloated.
Maybe we can use `skip_serializing_if(Option::is_none)` in the future.
=> https://github.com/neondatabase/neon/issues/9983
2024-12-05 13:00:40 +02:00
John Spray
4a488b3e24 storcon: use proper schedule context during node delete (#9958)
## Problem

I was touching `test_storage_controller_node_deletion` because for AZ
scheduling work I was adding a change to the storage controller (kick
secondaries during optimisation) that made a FIXME in this test defunct.
While looking at it I also realized that we can easily fix the way node
deletion currently doesn't use a proper ScheduleContext, using the
iterator type recently added for that purpose.

## Summary of changes

- A testing-only behavior in storage controller where if a secondary
location isn't yet ready during optimisation, it will be actively
polled.
- Remove workaround in `test_storage_controller_node_deletion` that
previously was needed because optimisation would get stuck on cold
secondaries.
- Update node deletion code to use a `TenantShardContextIterator` and
thereby a proper ScheduleContext
2024-12-05 13:00:40 +02:00
Alexey Kondratov
c4987b0b13 fix(testing): Use 1 MB shared_buffers even with LFC (#9969)
## Problem

After enabling LFC in tests and lowering `shared_buffers` we started
having more problems with `test_pg_regress`.

## Summary of changes

Set `shared_buffers` to 1MB to both exercise getPage requests/LFC, and
still have enough room for Postgres to operate. Everything smaller might
be not enough for Postgres under load, and can cause errors like 'no
unpinned buffers available'.

See Konstantin's comment [1] as well.

Fixes #9956

[1]:
https://github.com/neondatabase/neon/issues/9956#issuecomment-2511608097
2024-12-05 13:00:40 +02:00
Tristan Partin
84b4821118 Stop changing the value of neon.extension_server_port at runtime (#9972)
On reconfigure, we no longer passed a port for the extension server
which caused us to not write out the neon.extension_server_port line.
Thus, Postgres thought we were setting the port to the default value of
0. PGC_POSTMASTER GUCs cannot be set at runtime, which causes the
following log messages:

> LOG: parameter "neon.extension_server_port" cannot be changed without
restarting the server
> LOG: configuration file
"/var/db/postgres/compute/pgdata/postgresql.conf" contains errors;
unaffected changes were applied

Fixes: https://github.com/neondatabase/neon/issues/9945

Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-12-05 13:00:40 +02:00
Conrad Ludgate
32ba9811f9 feat(proxy): emit JWT auth method and JWT issuer in parquet logs (#9971)
Fix the HTTP AuthMethod to accomodate the JWT authorization method.
Introduces the JWT issuer as an additional field in the parquet logs
2024-12-05 13:00:40 +02:00
Folke Behrens
a0cd64c4d3 Bump OTel, tracing, reqwest crates (#9970) 2024-12-05 13:00:40 +02:00
Arseny Sher
84687b743d Update consensus protocol spec (#9607)
The spec was written for the buggy protocol which we had before the one
more similar to Raft was implemented. Update the spec with what we
currently have.

ref https://github.com/neondatabase/neon/issues/8699
2024-12-05 13:00:40 +02:00
Folke Behrens
b6f93dcec9 proxy: Create Elasticache credentials provider lazily (#9967)
## Problem

The credentials providers tries to connect to AWS STS even when we use
plain Redis connections.

## Summary of changes

* Construct the CredentialsProvider only when needed ("irsa").
2024-12-05 13:00:40 +02:00
Alexander Bayandin
4f6c594973 CI(replication-tests): fix notifications about replication-tests failures (#9950)
## Problem

`if: ${{ github.event.schedule }}` gets skipped if a previous step has
failed, but we want to run the step for both `success` and `failure`

## Summary of changes
- Add `!cancelled()` to notification step if-condition, to skip only
cancelled jobs
2024-12-05 13:00:40 +02:00
Conrad Ludgate
a750c14735 fix(proxy): forward notifications from authentication (#9948)
Fixes https://github.com/neondatabase/cloud/issues/20973. 

This refactors `connect_raw` in order to return direct access to the
delayed notices.

I cannot find a way to test this with psycopg2 unfortunately, although
testing it with psql does return the expected results.
2024-12-05 13:00:40 +02:00
John Spray
9ce0dd4e55 storcon: add metric for AZ scheduling violations (#9949)
## Problem

We can't easily tell how far the state of shards is from their AZ
preferences. This can be a cause of performance issues, so it's
important for diagnosability that we can tell easily if there are
significant numbers of shards that aren't running in their preferred AZ.

Related: https://github.com/neondatabase/cloud/issues/15413

## Summary of changes

- In reconcile_all, count shards that are scheduled into the wrong AZ
(if they have a preference), and publish it as a prometheus gauge.
- Also calculate a statistic for how many shards wanted to reconcile but
couldn't.

This is clearly a lazy calculation: reconcile all only runs
periodically. But that's okay: shards in the wrong AZ is something that
only matters if it stays that way for some period of time.
2024-12-05 13:00:40 +02:00
Erik Grinaker
0e1a336607 test_runner: improve wait_until (#9936)
Improves `wait_until` by:

* Use `timeout` instead of `iterations`. This allows changing the
timeout/interval parameters independently.
* Make `timeout` and `interval` optional (default 20s and 0.5s). Most
callers don't care.
* Only output status every 1s by default, and add optional
`status_interval` parameter.
* Remove `show_intermediate_error`, this was always emitted anyway.

Most callers have been updated to use the defaults, except where they
had good reason otherwise.
2024-12-05 13:00:40 +02:00
Anastasia Lubennikova
7fc2912d06 Update pgvector to 0.8.0 (#9733) 2024-12-05 13:00:40 +02:00
John Spray
fdf231c237 storcon: don't take any Service locks in /status and /ready (#9944)
## Problem

We saw unexpected container terminations when running in k8s with with
small CPU resource requests.

The /status and /ready handlers called `maybe_forward`, which always
takes the lock on Service::inner.

If there is a lot of writer lock contention, and the container is
starved of CPU, this increases the likelihood that we will get killed by
the kubelet.

It isn't certain that this was a cause of issues, but it is a potential
source that we can eliminate.

## Summary of changes

- Revise logic to return immediately if the URL is in the non-forwarded
list, rather than calling maybe_forward
2024-12-05 13:00:40 +02:00
Konstantin Knizhnik
1e08b5dccc Fix issues with prefetch ring buffer resize (#9847)
## Problem

See https://neondb.slack.com/archives/C04DGM6SMTM/p1732110190129479


We observe the following error in the logs 
```
[XX000] ERROR: [NEON_SMGR] [shard 3] Incorrect prefetch read: status=1 response=0x7fafef335138 my=128 receive=128
```
most likely caused by changing `neon.readahead_buffer_size`

## Summary of changes

1. Copy shard state
2. Do not use prefetch_set_unused in readahead_buffer_resize
3. Change prefetch buffer overflow criteria

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-12-05 13:00:40 +02:00
Alexander Bayandin
030810ed3e Compute image: prepare Postgres v14-v16 for Debian 12 (#9954)
## Problem

Current compute images for Postgres 14-16 don't build on Debian 12
because of issues with extensions.
This PR fixes that, but for the current setup, it is mostly a no-op
change.

## Summary of changes
- Use `/bin/bash -euo pipefail` as SHELL to fail earlier
- Fix `plv8` build: backport a trivial patch for v8
- Fix `postgis` build: depend `sfgal` version on Debian version instead
of Postgres version


Tested in: https://github.com/neondatabase/neon/pull/9849
2024-12-05 13:00:40 +02:00
Konstantin Knizhnik
62b74bdc2c Add GUC controlling whether to pause recovery if some critical GUCs at replica have smaller value than on primary (#9057)
## Problem

See https://github.com/neondatabase/neon/issues/9023

## Summary of changes

Ass GUC `recovery_pause_on_misconfig` allowing not to pause in case of
replica and primary configuration mismatch

See https://github.com/neondatabase/postgres/pull/501
See https://github.com/neondatabase/postgres/pull/502
See https://github.com/neondatabase/postgres/pull/503
See https://github.com/neondatabase/postgres/pull/504


## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2024-12-05 13:00:40 +02:00
Folke Behrens
8b7e9ed820 Merge the consumption metric pushes (#9939)
#8564

## Problem

The main and backup consumption metric pushes are completely
independent,
resulting in different event time windows and different idempotency
keys.

## Summary of changes

* Merge the push tasks, but keep chunks the same size.
2024-12-05 13:00:40 +02:00
Christian Schwarz
5dad89acd4 page_service: rewrite batching to work without a timeout (#9851)
# Problem

The timeout-based batching adds latency to unbatchable workloads.

We can choose a short batching timeout (e.g. 10us) but that requires
high-resolution timers, which tokio doesn't have.
I thoroughly explored options to use OS timers (see
[this](https://github.com/neondatabase/neon/pull/9822) abandoned PR).
In short, it's not an attractive option because any timer implementation
adds non-trivial overheads.

# Solution

The insight is that, in the steady state of a batchable workload, the
time we spend in `get_vectored` will be hundreds of microseconds anyway.

If we prepare the next batch concurrently to `get_vectored`, we will
have a sizeable batch ready once `get_vectored` of the current batch is
done and do not need an explicit timeout.

This can be reasonably described as **pipelining of the protocol
handler**.

# Implementation

We model the sub-protocol handler for pagestream requests
(`handle_pagrequests`) as two futures that form a pipeline:

2. Batching: read requests from the connection and fill the current
batch
3. Execution: `take` the current batch, execute it using `get_vectored`,
and send the response.

The Reading and Batching stage are connected through a new type of
channel called `spsc_fold`.

See the long comment in the `handle_pagerequests_pipelined` for details.

# Changes

- Refactor `handle_pagerequests`
    - separate functions for
- reading one protocol message; produces a `BatchedFeMessage` with just
one page request in it
- batching; tried to merge an incoming `BatchedFeMessage` into an
existing `BatchedFeMessage`; returns `None` on success and returns back
the incoming message in case merging isn't possible
        - execution of a batched message
- unify the timeline handle acquisition & request span construction; it
now happen in the function that reads the protocol message
- Implement serial and pipelined model
    - serial: what we had before any of the batching changes
      - read one protocol message
      - execute protocol messages
    - pipelined: the design described above
- optionality for execution of the pipeline: either via concurrent
futures vs tokio tasks
- Pageserver config
  - remove batching timeout field
  - add ability to configure pipelining mode
- add ability to limit max batch size for pipelined configurations
(required for the rollout, cf
https://github.com/neondatabase/cloud/issues/20620 )
  - ability to configure execution mode
- Tests
  - remove `batch_timeout` parametrization
  - rename `test_getpage_merge_smoke` to `test_throughput`
- add parametrization to test different max batch sizes and execution
moes
  - rename `test_timer_precision` to `test_latency`
  - rename the test case file to `test_page_service_batching.py`
  - better descriptions of what the tests actually do

## On the holding The `TimelineHandle` in the pending batch

While batching, we hold the `TimelineHandle` in the pending batch.
Therefore, the timeline will not finish shutting down while we're
batching.

This is not a problem in practice because the concurrently ongoing
`get_vectored` call will fail quickly with an error indicating that the
timeline is shutting down.
This results in the Execution stage returning a `QueryError::Shutdown`,
which causes the pipeline / entire page service connection to shut down.
This drops all references to the
`Arc<Mutex<Option<Box<BatchedFeMessage>>>>` object, thereby dropping the
contained `TimelineHandle`s.

- => fixes https://github.com/neondatabase/neon/issues/9850

# Performance

Local run of the benchmarks, results in [this empty
commit](1cf5b1463f)
in the PR branch.

Key take-aways:
* `concurrent-futures` and `tasks` deliver identical `batching_factor`
* tail latency impact unknown, cf
https://github.com/neondatabase/neon/issues/9837
* `concurrent-futures` has higher throughput than `tasks` in all
workloads (=lower `time` metric)
* In unbatchable workloads, `concurrent-futures` has 5% higher
`CPU-per-throughput` than that of `tasks`, and 15% higher than that of
`serial`.
* In batchable-32 workload, `concurrent-futures` has 8% lower
`CPU-per-throughput` than that of `tasks` (comparison to tput of
`serial` is irrelevant)
* in unbatchable workloads, mean and tail latencies of
`concurrent-futures` is practically identical to `serial`, whereas
`tasks` adds 20-30us of overhead

Overall, `concurrent-futures` seems like a slightly more attractive
choice.

# Rollout

This change is disabled-by-default.

Rollout plan:
- https://github.com/neondatabase/cloud/issues/20620

# Refs

- epic: https://github.com/neondatabase/neon/issues/9376
- this sub-task: https://github.com/neondatabase/neon/issues/9377
- the abandoned attempt to improve batching timeout resolution:
https://github.com/neondatabase/neon/pull/9820
- closes https://github.com/neondatabase/neon/issues/9850
- fixes https://github.com/neondatabase/neon/issues/9835
2024-12-05 13:00:40 +02:00
Matthias van de Meent
547b2d2827 Fix timeout value used in XLogWaitForReplayOf (#9937)
The previous value assumed usec precision, while the timeout used is in
milliseconds, causing replica backends to wait for (potentially) many
hours for WAL replay without the expected progress reports in logs.

This fixes the issue.

Reported-By: Alexander Lakhin <exclusion@gmail.com>

## Problem


https://github.com/neondatabase/postgres/pull/279#issuecomment-2507671817

The timeout value was configured with the assumption the indicated value
would be microseconds, where it's actually milliseconds. That causes the
backend to wait for much longer (2h46m40s) before it emits the "I'm
waiting for recovery" message. While we do have wait events configured
on this, it's not great to have stuck backends without clear logs, so
this fixes the timeout value in all our PostgreSQL branches.

## PG PRs

* PG14: https://github.com/neondatabase/postgres/pull/542
* PG15: https://github.com/neondatabase/postgres/pull/543
* PG16: https://github.com/neondatabase/postgres/pull/544
* PG17: https://github.com/neondatabase/postgres/pull/545
2024-12-05 13:00:40 +02:00
Gleb Novikov
93f29a0065 Fixed fast_import pgbin in calling get_pg_version (#9933)
Was working on https://github.com/neondatabase/cloud/pull/20795 and
discovered that fast_import is not working normally.
2024-12-05 13:00:40 +02:00
John Spray
4f36494615 pageserver: download small objects using a smaller timeout (#9938)
## Problem

It appears that the Azure storage API tends to hang TCP connections more
than S3 does.

Currently we use a 2 minute timeout for all downloads. This is large
because sometimes the objects we download are large. However, waiting 2
minutes when doing something like downloading a manifest on tenant
attach is problematic, because when someone is doing a "create tenant,
create timeline" workflow, that 2 minutes is long enough for them
reasonably to give up creating that timeline.

Rather than propagate oversized timeouts further up the stack, we should
use a different timeout for objects that we expect to be small.

Closes: https://github.com/neondatabase/neon/issues/9836

## Summary of changes

- Add a `small_timeout` configuration attribute to remote storage,
defaulting to 30 seconds (still a very generous period to do something
like download an index)
- Add a DownloadKind parameter to DownloadOpts, so that callers can
indicate whether they expect the object to be small or large.
- In the azure client, use small timeout for HEAD requests, and for GET
requests if DownloadKind::Small is used.
- Use DownloadKind::Small for manifests, indices, and heatmap downloads.

This PR intentionally does not make the equivalent change to the S3
client, to reduce blast radius in case this has unexpected consequences
(we could accomplish the same thing by editing lots of configs, but just
skipping the code is simpler for right now)
2024-12-05 13:00:40 +02:00
Alexey Kondratov
0a550f3e7d feat(compute_ctl): Always set application_name (#9934)
## Problem

It was not always possible to judge what exactly some `cloud_admin`
connections were doing because we didn't consistently set
`application_name` everywhere.

## Summary of changes

Unify the way we connect to Postgres:
1. Switch to building configs everywhere
2. Always set `application_name` and make naming consistent

Follow-up for #9919
Part of neondatabase/cloud#20948
2024-12-05 13:00:40 +02:00
Erik Grinaker
4bb9554e4a safekeeper: use jemalloc (#9780)
## Problem

To add Safekeeper heap profiling in #9778, we need to switch to an
allocator that supports it. Pageserver and proxy already use jemalloc.

Touches #9534.

## Summary of changes

Use jemalloc in Safekeeper.
2024-12-05 13:00:40 +02:00
John Spray
008616cfe6 storage controller: use proper ScheduleContext when evacuating a node (#9908)
## Problem

When picking locations for a shard, we should use a ScheduleContext that
includes all the other shards in the tenant, so that we apply proper
anti-affinity between shards. If we don't do this, then it can lead to
unstable scheduling, where we place a shard somewhere that the optimizer
will then immediately move it away from.

We didn't always do this, because it was a bit awkward to accumulate the
context for a tenant rather than just walking tenants.

This was a TODO in `handle_node_availability_transition`:
```
                        // TODO: populate a ScheduleContext including all shards in the same tenant_id (only matters
                        // for tenants without secondary locations: if they have a secondary location, then this
                        // schedule() call is just promoting an existing secondary)
```

This is a precursor to https://github.com/neondatabase/neon/issues/8264,
where the current imperfect scheduling during node evacuation hampers
testing.

## Summary of changes

- Add an iterator type that yields each shard along with a
schedulecontext that includes all the other shards from the same tenant
- Use the iterator to replace hand-crafted logic in optimize_all_plan
(functionally identical)
- Use the iterator in `handle_node_availability_transition` to apply
proper anti-affinity during node evacuation.
2024-12-05 13:00:40 +02:00
Conrad Ludgate
e61ec94fbc chore(proxy): vendor a subset of rust-postgres (#9930)
Our rust-postgres fork is getting messy. Mostly because proxy wants more
control over the raw protocol than tokio-postgres provides. As such,
it's diverging more and more. Storage and compute also make use of
rust-postgres, but in more normal usage, thus they don't need our crazy
changes.

Idea: 
* proxy maintains their subset
* other teams use a minimal patch set against upstream rust-postgres

Reviewing this code will be difficult. To implement it, I
1. Copied tokio-postgres, postgres-protocol and postgres-types from
00940fcdb5
2. Updated their package names with the `2` suffix to make them compile
in the workspace.
3. Updated proxy to use those packages
4. Copied in the code from tokio-postgres-rustls 0.13 (with some patches
applied https://github.com/jbg/tokio-postgres-rustls/pull/32
https://github.com/jbg/tokio-postgres-rustls/pull/33)
5. Removed as much dead code as I could find in the vendored libraries
6. Updated the tokio-postgres-rustls code to use our existing channel
binding implementation
2024-12-05 13:00:40 +02:00
Erik Grinaker
e5152551ad test_runner/performance: add logical message ingest benchmark (#9749)
Adds a benchmark for logical message WAL ingestion throughput
end-to-end. Logical messages are essentially noops, and thus ignored by
the Pageserver.

Example results from my MacBook, with fsync enabled:

```
postgres_ingest: 14.445 s
safekeeper_ingest: 29.948 s
pageserver_ingest: 30.013 s
pageserver_recover_ingest: 8.633 s
wal_written: 10,340 MB
message_count: 1310720 messages
postgres_throughput: 715 MB/s
safekeeper_throughput: 345 MB/s
pageserver_throughput: 344 MB/s
pageserver_recover_throughput: 1197 MB/s
```

See
https://github.com/neondatabase/neon/issues/9642#issuecomment-2475995205
for running analysis.

Touches #9642.
2024-12-05 13:00:40 +02:00
Alexey Kondratov
b0822a5499 fix(compute_ctl): Allow usage of DB names with whitespaces (#9919)
## Problem

We used `set_path()` to replace the database name in the connection
string. It automatically does url-safe encoding if the path is not
already encoded, but it does it as per the URL standard, which assumes
that tabs can be safely removed from the path without changing the
meaning of the URL. See, e.g.,
https://url.spec.whatwg.org/#concept-basic-url-parser. It also breaks
for DBs with properly %-encoded names, like with `%20`, as they are kept
intact, but actually should be escaped.

Yet, this is not true for Postgres, where it's completely valid to have
trailing tabs in the database name.

I think this is the PR that caused this regression
https://github.com/neondatabase/neon/pull/9717, as it switched from
`postgres::config::Config` back to `set_path()`.

This was fixed a while ago already [1], btw, I just haven't added a test
to catch this regression back then :(

## Summary of changes

This commit changes the code back to use
`postgres/tokio_postgres::Config` everywhere.

While on it, also do some changes around, as I had to touch this code:
1. Bump some logging from `debug` to `info` in the spec apply path. We
do not use `debug` in prod, and it was tricky to understand what was
going on with this bug in prod.
2. Refactor configuration concurrency calculation code so it was
reusable. Yet, still keep `1` in the case of reconfiguration. The
database can be actively used at this moment, so we cannot guarantee
that there will be enough spare connection slots, and the underlying
code won't handle connection errors properly.
3. Simplify the installed extensions code. It was spawning a blocking
task inside async function, which doesn't make much sense. Instead, just
have a main sync function and call it with `spawn_blocking` in the API
code -- the only place we need it to be async.
4. Add regression python test to cover this and related problems in the
future. Also, add more extensive testing of schema dump and DBs and
roles listing API.

[1]:
4d1e48f3b9
[2]:
https://www.postgresql.org/message-id/flat/20151023003445.931.91267%40wrigleys.postgresql.org

Resolves neondatabase/cloud#20869
2024-12-05 13:00:40 +02:00
Alexander Bayandin
1fb6ab59e8 test_runner: rerun all failed tests (#9917)
## Problem

Currently, we rerun only known flaky tests. This approach was chosen to
reduce the number of tests that go unnoticed (by forcing people to take
a look at failed tests and rerun the job manually), but it has some
drawbacks:
- In PRs, people tend to push new changes without checking failed tests
(that's ok)
- In the main, tests are just restarted without checking
(understandable)
- Parametrised tests become flaky one by one, i.e. if `test[1]` is flaky
`, test[2]` is not marked as flaky automatically (which may or may not
be the case).

I suggest rerunning all failed tests to increase the stability of GitHub
jobs and using the Grafana Dashboard with flaky tests for deeper
analysis.

## Summary of changes
- Rerun all failed tests twice at max
2024-12-05 13:00:40 +02:00
Vlad Lazar
e16439400d pageserver: return correct LSN for interpreted proto keep alive responses (#9928)
## Problem

For the interpreted proto the pageserver is not returning the correct
LSN
in replies to keep alive requests. This is because the interpreted
protocol arm
was not updating `last_rec_lsn`.

## Summary of changes

* Return correct LSN in keep-alive responses
* Fix shard field in wal sender traces
2024-12-05 13:00:40 +02:00
Arpad Müller
e401f66698 Update rust to 1.83.0, also update cargo adjacent tools (#9926)
We keep the practice of keeping the compiler up to date, pointing to the
latest release. This is done by many other projects in the Rust
ecosystem as well.

[Release notes](https://releases.rs/docs/1.83.0/).

Also update `cargo-hakari`, `cargo-deny`, `cargo-hack` and
`cargo-nextest` to their latest versions.

Prior update was in #9445.
2024-12-05 13:00:40 +02:00
Erik Grinaker
2fa461b668 Makefile: build pg_visibility (#9922)
Build the `pg_visibility` extension for use with `neon_local`. This is
useful to inspect the visibility map for debugging.

Touches #9914.
2024-12-05 13:00:40 +02:00
Vlad Lazar
03d90bc0b3 remote_storage/abs: count 404 and 304 for get as ok for metrics (#9912)
## Problem

We currently see elevated levels of errors for GetBlob requests. This is
because 404 and 304 are counted as errors for metric reporting.

## Summary of Changes

Bring the implementation in line with the S3 client and treat 404 and
304 responses as ok for metric purposes.

Related: https://github.com/neondatabase/cloud/issues/20666
2024-12-05 13:00:40 +02:00
Ivan Efremov
268bc890ea proxy: spawn cancellation checks in the background (#9918)
## Problem
For cancellation, a connection is open during all the cancel checks.
## Summary of changes
Spawn cancellation checks in the background, and close connection
immediately.
Use task_tracker for cancellation checks.
2024-12-05 13:00:40 +02:00
Folke Behrens
8a6ee79f6f Merge pull request #9921 from neondatabase/rc/release-proxy/2024-11-28
Proxy release 2024-11-28
2024-11-28 11:09:06 +01:00
github-actions[bot]
9052c32b46 Proxy release 2024-11-28 2024-11-28 06:02:15 +00:00
Ivan Efremov
995e729ebe Merge pull request #9832 from neondatabase/rc/release-proxy/2024-11-21
Proxy release 2024-11-21
2024-11-21 09:41:31 +02:00
github-actions[bot]
76077e1ddf Proxy release 2024-11-21 2024-11-21 06:02:11 +00:00
Ivan Efremov
0467d88f06 Merge pull request #9756 from neondatabase/rc/proxy/2024-11-14
Proxy release 2024-11-14
2024-11-14 09:46:52 +02:00
Ivan Efremov
f5eec194e7 Merge pull request #9674 from neondatabase/rc/proxy/2024-11-07
Proxy release 2024-11-07
2024-11-07 12:07:12 +02:00
Conrad Ludgate
7e00be391d Merge pull request #9558 from neondatabase/rc/proxy/2024-10-29
Auth broker release 2024-10-29
2024-10-29 12:10:50 +00:00
Folke Behrens
d56599df2a Merge pull request #9499 from neondatabase/rc/proxy/2024-10-24
Proxy release 2024-10-24
2024-10-24 10:34:56 +02:00
Folke Behrens
9d9aab3680 Merge pull request #9426 from neondatabase/rc/proxy/2024-10-17
Proxy release 2024-10-17
2024-10-17 12:18:51 +02:00
Folke Behrens
a202b1b5cc Merge pull request #9341 from neondatabase/rc/proxy/2024-10-10
Proxy release 2024-10-10
2024-10-10 09:17:11 +02:00
Folke Behrens
90f731f3b1 Merge pull request #9256 from neondatabase/rc/proxy/2024-10-03
Proxy release 2024-10-03
2024-10-03 11:01:41 +02:00
Conrad Ludgate
7736b748d3 Merge pull request #9159 from neondatabase/rc/proxy/2024-09-26
Proxy release 2024-09-26
2024-09-26 09:22:33 +01:00
Conrad Ludgate
9c23333cb3 Merge pull request #9056 from neondatabase/rc/proxy/2024-09-19
Proxy release 2024-09-19
2024-09-19 10:41:17 +01:00
Conrad Ludgate
66a99009ba Merge pull request #8799 from neondatabase/rc/proxy/2024-08-22
Proxy release 2024-08-22
2024-08-22 10:04:56 +01:00
Conrad Ludgate
5d4c57491f Merge pull request #8723 from neondatabase/rc/proxy/2024-08-14
Proxy release 2024-08-14
2024-08-14 13:05:51 +01:00
Conrad Ludgate
73935ea3a2 Merge pull request #8647 from neondatabase/rc/proxy/2024-08-08
Proxy release 2024-08-08
2024-08-08 15:37:09 +01:00
Conrad Ludgate
32e595d4dd Merge branch 'release-proxy' into rc/proxy/2024-08-08 2024-08-08 13:53:33 +01:00
Conrad Ludgate
b0d69acb07 Merge pull request #8505 from neondatabase/rc/proxy/2024-07-25
Proxy release 2024-07-25
2024-07-25 11:07:19 +01:00
Conrad Ludgate
98355a419a Merge pull request #8351 from neondatabase/rc/proxy/2024-07-11
Proxy release 2024-07-11
2024-07-11 10:40:17 +01:00
Conrad Ludgate
cfb03d6cf0 Merge pull request #8178 from neondatabase/rc/proxy/2024-06-27
Proxy release 2024-06-27
2024-06-27 11:35:30 +01:00
Conrad Ludgate
d81ef3f962 Revert "proxy: update tokio-postgres to allow arbitrary config params (#8076)"
This reverts commit 78d9059fc7.
2024-06-27 09:46:58 +01:00
Conrad Ludgate
5d62c67e75 Merge pull request #8117 from neondatabase/rc/proxy/2024-06-20
Proxy release 2024-06-20
2024-06-20 11:42:35 +01:00
Anna Khanova
53d53d5b1e Merge pull request #7980 from neondatabase/rc/proxy/2024-06-06
Proxy release 2024-06-06
2024-06-06 13:14:40 +02:00
Anna Khanova
29fe6ea47a Merge pull request #7909 from neondatabase/rc/proxy/2024-05-30
Proxy release 2024-05-30
2024-05-30 14:59:41 +02:00
Alexander Bayandin
640327ccb3 Merge pull request #7880 from neondatabase/rc/proxy/2024-05-24
Proxy release 2024-05-24
2024-05-24 18:00:18 +01:00
Anna Khanova
7cf0f6b37e Merge pull request #7853 from neondatabase/rc/proxy/2024-05-23
Proxy release 2024-05-23
2024-05-23 12:09:13 +02:00
Anna Khanova
03c2c569be [proxy] Do not fail after parquet upload error (#7858)
## Problem

If the parquet upload was unsuccessful, it will panic.

## Summary of changes

Write error in logs instead.
2024-05-23 11:44:47 +02:00
Conrad Ludgate
eff6d4538a Merge pull request #7654 from neondatabase/rc/proxy/2024-05-08
Proxy release 2024-05-08
2024-05-08 11:56:20 +01:00
Conrad Ludgate
5ef7782e9c Merge pull request #7649 from neondatabase/rc/proxy/2024-05-08
Proxy release 2024-05-08
2024-05-08 06:54:03 +01:00
Conrad Ludgate
73101db8c4 Merge branch 'release-proxy' into rc/proxy/2024-05-08 2024-05-08 06:43:57 +01:00
Anna Khanova
bccdfc6d39 Merge pull request #7580 from neondatabase/rc/proxy/2024-05-02
Proxy release 2024-05-02
2024-05-02 12:00:01 +02:00
Anna Khanova
99595813bb proxy: keep track on the number of events from redis by type. (#7582)
## Problem

It's unclear what is the distribution of messages, proxy is consuming
from redis.

## Summary of changes

Add counter.
2024-05-02 11:56:19 +02:00
Anna Khanova
fe07b54758 Merge pull request #7507 from neondatabase/rc/proxy/2024-04-25
Proxy release 2024-04-25
2024-04-25 13:50:05 +02:00
Anna Khanova
a42d173e7b proxy: Fix cancellations (#7510)
## Problem

Cancellations were published to the channel, that was never read.

## Summary of changes

Fallback to global redis publishing.
2024-04-25 13:42:25 +02:00
Anna Khanova
e07f689238 Update connect to compute and wake compute retry configs (#7509)
## Problem

## Summary of changes

Decrease waiting time
2024-04-25 13:20:21 +02:00
Conrad Ludgate
7831eddc88 Merge pull request #7417 from neondatabase/rc/proxy/2024-04-18
Proxy release 2024-04-18
2024-04-18 12:03:07 +01:00
Conrad Ludgate
943b1bc80c Merge pull request #7366 from neondatabase/proxy-hotfix
Release proxy (2024-04-11 hotfix)
2024-04-12 10:15:14 +01:00
Conrad Ludgate
95a184e9b7 proxy: fix overloaded db connection closure (#7364)
## Problem

possible for the database connections to not close in time.

## Summary of changes

force the closing of connections if the client has hung up
2024-04-11 23:38:47 +01:00
Conrad Ludgate
3fa17e9d17 Merge pull request #7357 from neondatabase/rc/proxy/2024-04-11
Proxy release 2024-04-11
2024-04-11 11:49:45 +01:00
Anna Khanova
55e0fd9789 Merge pull request #7304 from neondatabase/rc/proxy/2024-04-04
Proxy release 2024-04-04
2024-04-04 12:40:11 +02:00
Anna Khanova
2a88889f44 Merge pull request #7254 from neondatabase/rc/proxy/2024-03-27
Proxy release 2024-03-27
2024-03-27 11:44:09 +01:00
Conrad Ludgate
5bad8126dc Merge pull request #7173 from neondatabase/rc/proxy/2024-03-19
Proxy release 2024-03-19
2024-03-19 12:11:42 +00:00
Anna Khanova
27bc242085 Merge pull request #7119 from neondatabase/rc/proxy/2024-03-14
Proxy release 2024-03-14
2024-03-14 14:57:05 +05:00
Anna Khanova
192b49cc6d Merge branch 'release-proxy' into rc/proxy/2024-03-14 2024-03-14 14:16:36 +05:00
Conrad Ludgate
e1b60f3693 Merge pull request #7041 from neondatabase/rc/proxy/2024-03-07
Proxy release 2024-03-07
2024-03-08 08:19:16 +00:00
Anna Khanova
2804f5323b Merge pull request #6997 from neondatabase/rc/proxy/2024-03-04
Proxy release 2024-03-04
2024-03-04 17:36:11 +04:00
Anna Khanova
676adc6b32 Merge branch 'release-proxy' into rc/proxy/2024-03-04 2024-03-04 16:41:46 +04:00
Nikita Kalyanov
96a4e8de66 Add /terminate API (#6745) (#6853)
this is to speed up suspends, see
https://github.com/neondatabase/cloud/issues/10284


Cherry-pick to release branch to build new compute images
2024-02-22 11:51:19 +02:00
Arseny Sher
01180666b0 Merge pull request #6803 from neondatabase/releases/2024-02-19
Release 2024-02-19
2024-02-19 16:38:35 +04:00
Conrad Ludgate
6c94269c32 Merge pull request #6758 from neondatabase/release-proxy-2024-02-14
2024-02-14 Proxy Release
2024-02-15 09:45:08 +00:00
Anna Khanova
edc691647d Proxy: remove fail fast logic to connect to compute (#6759)
## Problem

Flaky tests

## Summary of changes

Remove failfast logic
2024-02-15 07:42:12 +00:00
Conrad Ludgate
855d7b4781 hold cancel session (#6750)
## Problem

In a recent refactor, we accidentally dropped the cancel session early

## Summary of changes

Hold the cancel session during proxy passthrough
2024-02-14 14:57:22 +00:00
Anna Khanova
c49c9707ce Proxy: send cancel notifications to all instances (#6719)
## Problem

If cancel request ends up on the wrong proxy instance, it doesn't take
an effect.

## Summary of changes

Send redis notifications to all proxy pods about the cancel request.

Related issue: https://github.com/neondatabase/neon/issues/5839,
https://github.com/neondatabase/cloud/issues/10262
2024-02-14 14:57:22 +00:00
Anna Khanova
2227540a0d Proxy refactor auth+connect (#6708)
## Problem

Not really a problem, just refactoring.

## Summary of changes

Separate authenticate from wake compute.

Do not call wake compute second time if we managed to connect to
postgres or if we got it not from cache.
2024-02-14 14:57:22 +00:00
Conrad Ludgate
f1347f2417 proxy: add more http logging (#6726)
## Problem

hard to see where time is taken during HTTP flow.

## Summary of changes

add a lot more for query state. add a conn_id field to the sql-over-http
span
2024-02-14 14:57:22 +00:00
Conrad Ludgate
30b295b017 proxy: some more parquet data (#6711)
## Summary of changes

add auth_method and database to the parquet logs
2024-02-14 14:57:22 +00:00
Anna Khanova
1cef395266 Proxy: copy bidirectional fork (#6720)
## Problem

`tokio::io::copy_bidirectional` doesn't close the connection once one of
the sides closes it. It's not really suitable for the postgres protocol.

## Summary of changes

Fork `copy_bidirectional` and initiate a shutdown for both connections.

---------

Co-authored-by: Conrad Ludgate <conradludgate@gmail.com>
2024-02-14 14:57:22 +00:00
John Spray
78d160f76d Merge pull request #6721 from neondatabase/releases/2024-02-12
Release 2024-02-12
2024-02-12 09:35:30 +00:00
Vlad Lazar
b9238059d6 Merge pull request #6617 from neondatabase/releases/2024-02-05
Release 2024-02-05
2024-02-05 12:50:38 +00:00
Arpad Müller
d0cb4b88c8 Don't preserve temp files on creation errors of delta layers (#6612)
There is currently no cleanup done after a delta layer creation error,
so delta layers can accumulate. The problem gets worse as the operation
gets retried and delta layers accumulate on the disk. Therefore, delete
them from disk (if something has been written to disk).
2024-02-05 09:58:18 +00:00
John Spray
1ec3e39d4e Merge pull request #6504 from neondatabase/releases/2024-01-29
Release 2024-01-29
2024-01-29 10:05:01 +00:00
John Spray
a1a74eef2c Merge pull request #6420 from neondatabase/releases/2024-01-22
Release 2024-01-22
2024-01-22 17:24:11 +00:00
John Spray
90e689adda pageserver: mark tenant broken when cancelling attach (#6430)
## Problem

When a tenant is in Attaching state, and waiting for the
`concurrent_tenant_warmup` semaphore, it also listens for the tenant
cancellation token. When that token fires, Tenant::attach drops out.
Meanwhile, Tenant::set_stopping waits forever for the tenant to exit
Attaching state.

Fixes: https://github.com/neondatabase/neon/issues/6423

## Summary of changes

- In the absence of a valid state for the tenant, it is set to Broken in
this path. A more elegant solution will require more refactoring, beyond
this minimal fix.

(cherry picked from commit 93572a3e99)
2024-01-22 16:20:57 +00:00
Christian Schwarz
f0b2d4b053 fixup(#6037): actually fix the issue, #6388 failed to do so (#6429)
Before this patch, the select! still retured immediately if `futs` was
empty. Must have tested a stale build in my manual testing of #6388.

(cherry picked from commit 15c0df4de7)
2024-01-22 15:23:12 +00:00
Anna Khanova
299d9474c9 Proxy: fix gc (#6426)
## Problem

Gc currently doesn't work properly.

## Summary of changes

Change statement on running gc.
2024-01-22 14:39:09 +01:00
Conrad Ludgate
7234208b36 bump shlex (#6421)
## Problem

https://rustsec.org/advisories/RUSTSEC-2024-0006

## Summary of changes

`cargo update -p shlex`

(cherry picked from commit 5559b16953)
2024-01-22 09:49:33 +00:00
Christian Schwarz
93450f11f5 Merge pull request #6354 from neondatabase/releases/2024-01-15
Release 2024-01-15

NB: the previous release PR https://github.com/neondatabase/neon/pull/6286 was accidentally merged by merge-by-squash instead of merge-by-merge-commit.
See https://github.com/neondatabase/neon/pull/6354#issuecomment-1891706321 for more context.
2024-01-15 14:30:25 +01:00
Christian Schwarz
2f0f9edf33 Merge remote-tracking branch 'origin/release' into releases/2024-01-15 2024-01-15 09:36:42 +00:00
Christian Schwarz
d424f2b7c8 empty commit so we can produce a merge commit 2024-01-15 09:36:22 +00:00
Christian Schwarz
21315e80bc Merge branch 'releases/2024-01-08--not-squashed' into releases/2024-01-15 2024-01-15 09:31:07 +00:00
vipvap
483b66d383 Merge branch 'release' into releases/2024-01-08 (not-squashed merge of #6286)
Release PR https://github.com/neondatabase/neon/pull/6286 got
accidentally merged-by-squash intstead of merge-by-merge-commit.

This commit shows how things would look like if 6286 had been
merged-by-squash.

```
git reset --hard 9f1327772
git merge --no-ff 5c0264b591
```

Co-authored-by: Christian Schwarz <christian@neon.tech>
2024-01-15 09:28:08 +00:00
vipvap
aa72a22661 Release 2024-01-08 (#6286)
Release 2024-01-08
2024-01-08 09:26:27 +00:00
Shany Pozin
5c0264b591 Merge branch 'release' into releases/2024-01-08 2024-01-08 09:34:06 +02:00
Arseny Sher
9f13277729 Merge pull request #6242 from neondatabase/releases/2024-01-02
Release 2024-01-02
2024-01-02 12:04:43 +04:00
Arseny Sher
54aa319805 Don't split WAL record across two XLogData's when sending from safekeepers.
As protocol demands. Not following this makes standby complain about corrupted
WAL in various ways.

https://neondb.slack.com/archives/C05L7D1JAUS/p1703774799114719
closes https://github.com/neondatabase/cloud/issues/9057
2024-01-02 10:54:00 +04:00
Arseny Sher
4a227484bf Add large insertion and slow WAL sending to test_hot_standby.
To exercise MAX_SEND_SIZE sending from safekeeper; we've had a bug with WAL
records torn across several XLogData messages. Add failpoint to safekeeper to
slow down sending. Also check for corrupted WAL complains in standby log.

Make the test a bit simpler in passing, e.g. we don't need explicit commits as
autocommit is enabled by default.

https://neondb.slack.com/archives/C05L7D1JAUS/p1703774799114719
https://github.com/neondatabase/cloud/issues/9057
2024-01-02 10:54:00 +04:00
Arseny Sher
2f83f85291 Add failpoint support to safekeeper.
Just a copy paste from pageserver.
2024-01-02 10:54:00 +04:00
Arseny Sher
d6cfcb0d93 Move failpoint support code to utils.
To enable them in safekeeper as well.
2024-01-02 10:54:00 +04:00
Arseny Sher
392843ad2a Fix safekeeper START_REPLICATION (term=n).
It was giving WAL only up to commit_lsn instead of flush_lsn, so recovery of
uncommitted WAL since cdb08f03 hanged. Add test for this.
2024-01-02 10:54:00 +04:00
Arseny Sher
bd4dae8f4a compute_ctl: kill postgres and sync-safekeeprs on exit.
Otherwise they are left orphaned when compute_ctl is terminated with a
signal. It was invisible most of the time because normally neon_local or k8s
kills postgres directly and then compute_ctl finishes gracefully. However, in
some tests compute_ctl gets stuck waiting for sync-safekeepers which
intentionally never ends because safekeepers are offline, and we want to stop
compute_ctl without leaving orphanes behind.

This is a quite rough approach which doesn't wait for children termination. A
better way would be to convert compute_ctl to async which would make waiting
easy.
2024-01-02 10:54:00 +04:00
Shany Pozin
b05fe53cfd Merge pull request #6240 from neondatabase/releases/2024-01-01
Release 2024-01-01
2024-01-01 11:07:30 +02:00
Christian Schwarz
c13a2f0df1 Merge pull request #6192 from neondatabase/releases/2023-12-19
Release 2023-12-19

We need to do a config change that requires restarting the pageservers.
Slip in two metrics-related commits that didn't make this week's regularly release.
2023-12-19 14:52:47 +01:00
Christian Schwarz
39be366fc5 higher resolution histograms for getpage@lsn (#6177)
part of https://github.com/neondatabase/cloud/issues/7811
2023-12-19 13:46:59 +00:00
Christian Schwarz
6eda0a3158 [PRE-MERGE] fix metric pageserver_initial_logical_size_start_calculation
(This is a pre-merge cherry-pick of https://github.com/neondatabase/neon/pull/6191)

It wasn't being incremented.

Fixup of

    commit 1c88824ed0
    Author: Christian Schwarz <christian@neon.tech>
    Date:   Fri Dec 1 12:52:59 2023 +0100

        initial logical size calculation: add a bunch of metrics (#5995)
2023-12-19 13:46:55 +00:00
Shany Pozin
306c7a1813 Merge pull request #6173 from neondatabase/sasha_release_bypassrls_replication
Grant BYPASSRLS and REPLICATION explicitly to neon_superuser roles
2023-12-18 22:16:36 +02:00
Sasha Krassovsky
80be423a58 Grant BYPASSRLS and REPLICATION explicitly to neon_superuser roles 2023-12-18 10:22:36 -08:00
Shany Pozin
5dcfef82f2 Merge pull request #6163 from neondatabase/releases/2023-12-18
Release 2023-12-18-2
2023-12-18 15:34:17 +02:00
Christian Schwarz
e67b8f69c0 [PRE-MERGE] pageserver: Reduce tracing overhead in timeline::get #6115
Pre-merge `git merge --squash` of
https://github.com/neondatabase/neon/pull/6115

Lowering the tracing level in get_value_reconstruct_data and
get_or_maybe_download from info to debug reduces the overhead
of span creation in non-debug environments.
2023-12-18 13:39:48 +01:00
Shany Pozin
e546872ab4 Merge pull request #6158 from neondatabase/releases/2023-12-18
Release 2023-12-18
2023-12-18 14:24:34 +02:00
John Spray
322ea1cf7c pageserver: on-demand activation cleanups (#6157)
## Problem

#6112 added some logs and metrics: clean these up a bit:
- Avoid counting startup completions for tenants launched after startup
- exclude no-op cases from timing histograms 
- remove a rogue log messages
2023-12-18 11:14:19 +00:00
Vadim Kharitonov
3633742de9 Merge pull request #6121 from neondatabase/releases/2023-12-13
Release 2023-12-13
2023-12-13 12:39:43 +01:00
Joonas Koivunen
079d3a37ba Merge remote-tracking branch 'origin/release' into releases/2023-12-13
this handles the hotfix introduced conflict.
2023-12-13 10:07:19 +00:00
Vadim Kharitonov
a46e77b476 Merge pull request #6090 from neondatabase/releases/2023-12-11
Release 2023-12-11
2023-12-12 12:10:35 +01:00
Tristan Partin
a92702b01e Add submodule paths as safe directories as a precaution
The check-codestyle-rust-arm job requires this for some reason, so let's
just add them everywhere we do this workaround.
2023-12-11 22:00:35 +00:00
Tristan Partin
8ff3253f20 Fix git ownership issue in check-codestyle-rust-arm
We have this workaround for other jobs. Looks like this one was
forgotten about.
2023-12-11 22:00:35 +00:00
Joonas Koivunen
04b82c92a7 fix: accidential return Ok (#6106)
Error indicating request cancellation OR timeline shutdown was deemed as
a reason to exit the background worker that calculated synthetic size.
Fix it to only be considered for avoiding logging such of such errors.

This conflicted on tenant_shard_id having already replaced tenant_id on
`main`.
2023-12-11 21:41:36 +00:00
Vadim Kharitonov
e5bf423e68 Merge branch 'release' into releases/2023-12-11 2023-12-11 11:55:48 +01:00
Vadim Kharitonov
60af392e45 Merge pull request #6057 from neondatabase/vk/patch_timescale_for_production
Revert timescaledb for pg14 and pg15 (#6056)
2023-12-06 16:21:16 +01:00
Vadim Kharitonov
661fc41e71 Revert timescaledb for pg14 and pg15 (#6056)
```
could not start the compute node: compute is in state "failed": db error: ERROR: could not access file "$libdir/timescaledb-2.10.1": No such file or directory Caused by: ERROR: could not access file "$libdir/timescaledb-2.10.1": No such file or directory
```
2023-12-06 16:14:07 +01:00
Shany Pozin
702c488f32 Merge pull request #6022 from neondatabase/releases/2023-12-04
Release 2023-12-04
2023-12-05 17:03:28 +02:00
Sasha Krassovsky
45c5122754 Remove trusted from wal2json 2023-12-04 12:36:19 -08:00
Shany Pozin
558394f710 fix merge 2023-12-04 11:41:27 +02:00
Shany Pozin
73b0898608 Merge branch 'release' into releases/2023-12-04 2023-12-04 11:36:26 +02:00
Joonas Koivunen
e65be4c2dc Merge pull request #6013 from neondatabase/releases/2023-12-01-hotfix
fix: use create_new instead of create for mutex file
2023-12-01 15:35:56 +02:00
Joonas Koivunen
40087b8164 fix: use create_new instead of create for mutex file 2023-12-01 12:54:49 +00:00
Shany Pozin
c762b59483 Merge pull request #5986 from neondatabase/Release-11-30-hotfix
Notify safekeeper readiness with systemd.
2023-11-30 10:01:05 +02:00
Arseny Sher
5d71601ca9 Notify safekeeper readiness with systemd.
To avoid downtime during deploy, as in busy regions initial load can currently
take ~30s.
2023-11-30 08:23:31 +03:00
Shany Pozin
a113c3e433 Merge pull request #5945 from neondatabase/release-2023-11-28-hotfix
Release 2023 11 28 hotfix
2023-11-28 08:14:59 +02:00
Anastasia Lubennikova
e81fc598f4 Update neon extension relocatable for existing installations (#5943) 2023-11-28 00:12:39 +00:00
Anastasia Lubennikova
48b845fa76 Make neon extension relocatable to allow SET SCHEMA (#5942) 2023-11-28 00:12:32 +00:00
Shany Pozin
27096858dc Merge pull request #5922 from neondatabase/releases/2023-11-27
Release 2023-11-27
2023-11-27 09:58:51 +02:00
Shany Pozin
4430d0ae7d Merge pull request #5876 from neondatabase/releases/2023-11-17
Release 2023-11-17
2023-11-20 09:11:58 +02:00
Joonas Koivunen
6e183aa0de Merge branch 'main' into releases/2023-11-17 2023-11-19 15:25:47 +00:00
Vadim Kharitonov
fd6d0b7635 Merge branch 'release' into releases/2023-11-17 2023-11-17 10:51:45 +01:00
Vadim Kharitonov
3710c32aae Merge pull request #5778 from neondatabase/releases/2023-11-03
Release 2023-11-03
2023-11-03 16:06:58 +01:00
Vadim Kharitonov
be83bee49d Merge branch 'release' into releases/2023-11-03 2023-11-03 11:18:15 +01:00
Alexander Bayandin
cf28e5922a Merge pull request #5685 from neondatabase/releases/2023-10-26
Release 2023-10-26
2023-10-27 10:42:12 +01:00
Em Sharnoff
7d384d6953 Bump vm-builder v0.18.2 -> v0.18.4 (#5666)
Only applicable change was neondatabase/autoscaling#584, setting
pgbouncer auth_dbname=postgres in order to fix superuser connections
from preventing dropping databases.
2023-10-26 20:15:45 +01:00
Em Sharnoff
4b3b37b912 Bump vm-builder v0.18.1 -> v0.18.2 (#5646)
Only applicable change was neondatabase/autoscaling#571, removing the
postgres_exporter flags `--auto-discover-databases` and
`--exclude-databases=...`
2023-10-26 20:15:29 +01:00
Shany Pozin
1d8d200f4d Merge pull request #5668 from neondatabase/sp/aux_files_cherry_pick
Cherry pick: Ignore missed AUX_FILES_KEY when generating image layer (#5660)
2023-10-26 10:08:16 +03:00
Konstantin Knizhnik
0d80d6ce18 Ignore missed AUX_FILES_KEY when generating image layer (#5660)
## Problem

Logical replication requires new AUX_FILES_KEY which is definitely
absent in existed database.
We do not have function to check if key exists in our KV storage.
So I have to handle the error in `list_aux_files` method.
But this key is also included in key space range and accessed y
`create_image_layer` method.

## Summary of changes

Check if AUX_FILES_KEY  exists before including it in keyspace.

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Shany Pozin <shany@neon.tech>
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2023-10-26 09:30:28 +03:00
Shany Pozin
f653ee039f Merge pull request #5638 from neondatabase/releases/2023-10-24
Release 2023-10-24
2023-10-24 12:10:52 +03:00
Em Sharnoff
e614a95853 Merge pull request #5610 from neondatabase/sharnoff/rc-2023-10-20-vm-monitor-fixes
Release 2023-10-20: vm-monitor memory.high throttling fixes
2023-10-20 00:11:06 -07:00
Em Sharnoff
850db4cc13 vm-monitor: Deny not fail downscale if no memory stats yet (#5606)
Fixes an issue we observed on staging that happens when the
autoscaler-agent attempts to immediately downscale the VM after binding,
which is typical for pooled computes.

The issue was occurring because the autoscaler-agent was requesting
downscaling before the vm-monitor had gathered sufficient cgroup memory
stats to be confident in approving it. When the vm-monitor returned an
internal error instead of denying downscaling, the autoscaler-agent
retried the connection and immediately hit the same issue (in part
because cgroup stats are collected per-connection, rather than
globally).
2023-10-19 21:56:55 -07:00
Em Sharnoff
8a316b1277 vm-monitor: Log full error on message handling failure (#5604)
There's currently an issue with the vm-monitor on staging that's not
really feasible to debug because the current display impl gives no
context to the errors (just says "failed to downscale").

Logging the full error should help.

For communications with the autoscaler-agent, it's ok to only provide
the outermost cause, because we can cross-reference with the VM logs.
At some point in the future, we may want to change that.
2023-10-19 21:56:50 -07:00
Em Sharnoff
4d13bae449 vm-monitor: Switch from memory.high to polling memory.stat (#5524)
tl;dr it's really hard to avoid throttling from memory.high, and it
counts tmpfs & page cache usage, so it's also hard to make sense of.

In the interest of fixing things quickly with something that should be
*good enough*, this PR switches to instead periodically fetch memory
statistics from the cgroup's memory.stat and use that data to determine
if and when we should upscale.

This PR fixes #5444, which has a lot more detail on the difficulties
we've hit with memory.high. This PR also supersedes #5488.
2023-10-19 21:56:36 -07:00
Vadim Kharitonov
49377abd98 Merge pull request #5577 from neondatabase/releases/2023-10-17
Release 2023-10-17
2023-10-17 12:21:20 +02:00
Christian Schwarz
a6b2f4e54e limit imitate accesses concurrency, using same semaphore as compactions (#5578)
Before this PR, when we restarted pageserver, we'd see a rush of
`$number_of_tenants` concurrent eviction tasks starting to do imitate
accesses building up in the period of `[init_order allows activations,
$random_access_delay + EvictionPolicyLayerAccessThreshold::period]`.

We simply cannot handle that degree of concurrent IO.

We already solved the problem for compactions by adding a semaphore.
So, this PR shares that semaphore for use by evictions.

Part of https://github.com/neondatabase/neon/issues/5479

Which is again part of https://github.com/neondatabase/neon/issues/4743

Risks / Changes In System Behavior
==================================

* we don't do evictions as timely as we currently do
* we log a bunch of warnings about eviction taking too long
* imitate accesses and compactions compete for the same concurrency
limit, so, they'll slow each other down through this shares semaphore

Changes
=======

- Move the `CONCURRENT_COMPACTIONS` semaphore into `tasks.rs`
- Rename it to `CONCURRENT_BACKGROUND_TASKS`
- Use it also for the eviction imitate accesses:
    - Imitate acceses are both per-TIMELINE and per-TENANT
    - The per-TENANT is done through coalescing all the per-TIMELINE
      tasks via a tokio mutex `eviction_task_tenant_state`.
    - We acquire the CONCURRENT_BACKGROUND_TASKS permit early, at the
      beginning of the eviction iteration, much before the imitate
      acesses start (and they may not even start at all in the given
      iteration, as they happen only every $threshold).
    - Acquiring early is **sub-optimal** because when the per-timline
      tasks coalesce on the `eviction_task_tenant_state` mutex,
      they are already holding a CONCURRENT_BACKGROUND_TASKS permit.
    - It's also unfair because tenants with many timelines win
      the CONCURRENT_BACKGROUND_TASKS more often.
    - I don't think there's another way though, without refactoring
      more of the imitate accesses logic, e.g, making it all per-tenant.
- Add metrics for queue depth behind the semaphore.
I found these very useful to understand what work is queued in the
system.

    - The metrics are tagged by the new `BackgroundLoopKind`.
    - On a green slate, I would have used `TaskKind`, but we already had
      pre-existing labels whose names didn't map exactly to task kind.
      Also the task kind is kind of a lower-level detail, so, I think
it's fine to have a separate enum to identify background work kinds.

Future Work
===========

I guess I could move the eviction tasks from a ticker to "sleep for
$period".
The benefit would be that the semaphore automatically "smears" the
eviction task scheduling over time, so, we only have the rush on restart
but a smeared-out rush afterward.

The downside is that this perverts the meaning of "$period", as we'd
actually not run the eviction at a fixed period. It also means the the
"took to long" warning & metric becomes meaningless.

Then again, that is already the case for the compaction and gc tasks,
which do sleep for `$period` instead of using a ticker.

(cherry picked from commit 9256788273)
2023-10-17 12:16:26 +02:00
Shany Pozin
face60d50b Merge pull request #5526 from neondatabase/releases/2023-10-11
Release 2023-10-11
2023-10-11 11:16:39 +03:00
Shany Pozin
9768aa27f2 Merge pull request #5516 from neondatabase/releases/2023-10-10
Release 2023-10-10
2023-10-10 14:16:47 +03:00
Shany Pozin
96b2e575e1 Merge pull request #5445 from neondatabase/releases/2023-10-03
Release 2023-10-03
2023-10-04 13:53:37 +03:00
Alexander Bayandin
7222777784 Update checksums for pg_jsonschema & pg_graphql (#5455)
## Problem

Folks have re-taged releases for `pg_jsonschema` and `pg_graphql` (to
increase timeouts on their CI), for us, these are a noop changes, 
but unfortunately, this will cause our builds to fail due to checksums 
mismatch (this might not strike right away because of the build cache).
- 8ba7c7be9d
- aa7509370a

## Summary of changes
- `pg_jsonschema` update checksum
- `pg_graphql` update checksum
2023-10-03 18:44:30 +01:00
Em Sharnoff
5469fdede0 Merge pull request #5422 from neondatabase/sharnoff/rc-2023-09-28-fix-restart-on-postmaster-SIGKILL
Release 2023-09-28: Fix (lack of) restart on neonvm postmaster SIGKILL
2023-09-28 10:48:51 -07:00
MMeent
72aa6b9fdd Fix neon_zeroextend's WAL logging (#5387)
When you log more than a few blocks, you need to reserve the space in
advance. We didn't do that, so we got errors. Now we do that, and
shouldn't get errors.
2023-09-28 09:37:28 -07:00
Em Sharnoff
ae0634b7be Bump vm-builder v0.17.11 -> v0.17.12 (#5407)
Only relevant change is neondatabase/autoscaling#534 - refer there for
more details.
2023-09-28 09:28:04 -07:00
Shany Pozin
70711f32fa Merge pull request #5375 from neondatabase/releases/2023-09-26
Release 2023-09-26
2023-09-26 15:19:45 +03:00
Vadim Kharitonov
52a88af0aa Merge pull request #5336 from neondatabase/releases/2023-09-19
Release 2023-09-19
2023-09-19 11:16:43 +02:00
Alexander Bayandin
b7a43bf817 Merge branch 'release' into releases/2023-09-19 2023-09-19 09:07:20 +01:00
Alexander Bayandin
dce91b33a4 Merge pull request #5318 from neondatabase/releases/2023-09-15-1
Postgres 14/15: Use previous extensions versions
2023-09-15 16:30:44 +01:00
Alexander Bayandin
23ee4f3050 Revert plv8 only 2023-09-15 15:45:23 +01:00
Alexander Bayandin
46857e8282 Postgres 14/15: Use previous extensions versions 2023-09-15 15:27:00 +01:00
Alexander Bayandin
368ab0ce54 Merge pull request #5313 from neondatabase/releases/2023-09-15
Release 2023-09-15
2023-09-15 10:39:56 +01:00
Konstantin Knizhnik
a5987eebfd References to old and new blocks were mixed in xlog_heap_update handler (#5312)
## Problem

See https://neondb.slack.com/archives/C05L7D1JAUS/p1694614585955029

https://www.notion.so/neondatabase/Duplicate-key-issue-651627ce843c45188fbdcb2d30fd2178

## Summary of changes

Swap old/new block references

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2023-09-15 10:11:41 +01:00
Alexander Bayandin
6686ede30f Update checksum for pg_hint_plan (#5309)
## Problem

The checksum for `pg_hint_plan` doesn't match:
```
sha256sum: WARNING: 1 computed checksum did NOT match
```

Ref
https://github.com/neondatabase/neon/actions/runs/6185715461/job/16793609251?pr=5307

It seems that the release was retagged yesterday:
https://github.com/ossc-db/pg_hint_plan/releases/tag/REL16_1_6_0

I don't see any malicious changes from 15_1.5.1:
https://github.com/ossc-db/pg_hint_plan/compare/REL15_1_5_1...REL16_1_6_0,
so it should be ok to update.

## Summary of changes
- Update checksum for `pg_hint_plan` 16_1.6.0
2023-09-15 09:54:42 +01:00
Em Sharnoff
373c7057cc vm-monitor: Fix cgroup throttling (#5303)
I believe this (not actual IO problems) is the cause of the "disk speed
issue" that we've had for VMs recently. See e.g.:

1. https://neondb.slack.com/archives/C03H1K0PGKH/p1694287808046179?thread_ts=1694271790.580099&cid=C03H1K0PGKH
2. https://neondb.slack.com/archives/C03H1K0PGKH/p1694511932560659

The vm-informant (and now, the vm-monitor, its replacement) is supposed
to gradually increase the `neon-postgres` cgroup's memory.high value,
because otherwise the kernel will throttle all the processes in the
cgroup.

This PR fixes a bug with the vm-monitor's implementation of this
behavior.

---

Other references, for the vm-informant's implementation:

- Original issue: neondatabase/autoscaling#44
- Original PR: neondatabase/autoscaling#223
2023-09-15 09:54:42 +01:00
Shany Pozin
7d6ec16166 Merge pull request #5296 from neondatabase/releases/2023-09-13
Release 2023-09-13
2023-09-13 13:49:14 +03:00
Shany Pozin
0e6fdc8a58 Merge pull request #5283 from neondatabase/releases/2023-09-12
Release 2023-09-12
2023-09-12 14:56:47 +03:00
Christian Schwarz
521438a5c6 fix deadlock around TENANTS (#5285)
The sequence that can lead to a deadlock:

1. DELETE request gets all the way to `tenant.shutdown(progress,
false).await.is_err() ` , while holding TENANTS.read()
2. POST request for tenant creation comes in, calls `tenant_map_insert`,
it does `let mut guard = TENANTS.write().await;`
3. Something that `tenant.shutdown()` needs to wait for needs a
`TENANTS.read().await`.
The only case identified in exhaustive manual scanning of the code base
is this one:
Imitate size access does `get_tenant().await`, which does
`TENANTS.read().await` under the hood.

In the above case (1) waits for (3), (3)'s read-lock request is queued
behind (2)'s write-lock, and (2) waits for (1).
Deadlock.

I made a reproducer/proof-that-above-hypothesis-holds in
https://github.com/neondatabase/neon/pull/5281 , but, it's not ready for
merge yet and we want the fix _now_.

fixes https://github.com/neondatabase/neon/issues/5284
2023-09-12 14:13:13 +03:00
Vadim Kharitonov
07d7874bc8 Merge pull request #5202 from neondatabase/releases/2023-09-05
Release 2023-09-05
2023-09-05 12:16:06 +02:00
Anastasia Lubennikova
1804111a02 Merge pull request #5161 from neondatabase/rc-2023-08-31
Release 2023-08-31
2023-08-31 16:53:17 +03:00
Arthur Petukhovsky
cd0178efed Merge pull request #5150 from neondatabase/release-sk-fix-active-timeline
Release 2023-08-30
2023-08-30 11:43:39 +02:00
Shany Pozin
333574be57 Merge pull request #5133 from neondatabase/releases/2023-08-29
Release 2023-08-29
2023-08-29 14:02:58 +03:00
Alexander Bayandin
79a799a143 Merge branch 'release' into releases/2023-08-29 2023-08-29 11:17:57 +01:00
Conrad Ludgate
9da06af6c9 Merge pull request #5113 from neondatabase/release-http-connection-fix
Release 2023-08-25
2023-08-25 17:21:35 +01:00
Conrad Ludgate
ce1753d036 proxy: dont return connection pending (#5107)
## Problem

We were returning Pending when a connection had a notice/notification
(introduced recently in #5020). When returning pending, the runtime
assumes you will call `cx.waker().wake()` in order to continue
processing.

We weren't doing that, so the connection task would get stuck

## Summary of changes

Don't return pending. Loop instead
2023-08-25 16:42:30 +01:00
Alek Westover
67db8432b4 Fix cargo deny errors (#5068)
## Problem
cargo deny lint broken

Links to the CVEs:

[rustsec.org/advisories/RUSTSEC-2023-0052](https://rustsec.org/advisories/RUSTSEC-2023-0052)

[rustsec.org/advisories/RUSTSEC-2023-0053](https://rustsec.org/advisories/RUSTSEC-2023-0053)
One is fixed, the other one isn't so we allow it (for now), to unbreak
CI. Then later we'll try to get rid of webpki in favour of the rustls
fork.

## Summary of changes
```
+ignore = ["RUSTSEC-2023-0052"]
```
2023-08-25 16:42:30 +01:00
Vadim Kharitonov
4e2e44e524 Enable neon-pool-opt-in (#5062) 2023-08-22 09:06:14 +01:00
Vadim Kharitonov
ed786104f3 Merge pull request #5060 from neondatabase/releases/2023-08-22
Release 2023-08-22
2023-08-22 09:41:02 +02:00
Stas Kelvich
84b74f2bd1 Merge pull request #4997 from neondatabase/sk/proxy-release-23-07-15
Fix lint
2023-08-15 18:54:20 +03:00
Arthur Petukhovsky
fec2ad6283 Fix lint 2023-08-15 18:49:02 +03:00
Stas Kelvich
98eebd4682 Merge pull request #4996 from neondatabase/sk/proxy_release
Disable neon-pool-opt-in
2023-08-15 18:37:50 +03:00
Arthur Petukhovsky
2f74287c9b Disable neon-pool-opt-in 2023-08-15 18:34:17 +03:00
Shany Pozin
aee1bf95e3 Merge pull request #4990 from neondatabase/releases/2023-08-15
Release 2023-08-15
2023-08-15 15:34:38 +03:00
Shany Pozin
b9de9d75ff Merge branch 'release' into releases/2023-08-15 2023-08-15 14:35:00 +03:00
Stas Kelvich
7943b709e6 Merge pull request #4940 from neondatabase/sk/release-23-05-25-proxy-fixup
Release: proxy retry fixup
2023-08-09 13:53:19 +03:00
Conrad Ludgate
d7d066d493 proxy: delay auth on retry (#4929)
## Problem

When an endpoint is shutting down, it can take a few seconds. Currently
when starting a new compute, this causes an "endpoint is in transition"
error. We need to add delays before retrying to ensure that we allow
time for the endpoint to shutdown properly.

## Summary of changes

Adds a delay before retrying in auth. connect_to_compute already has
this delay
2023-08-09 12:54:24 +03:00
Felix Prasanna
e78ac22107 release fix: revert vm builder bump from 0.13.1 -> 0.15.0-alpha1 (#4932)
This reverts commit 682dfb3a31.

hotfix for a CLI arg issue in the monitor
2023-08-08 21:08:46 +03:00
Vadim Kharitonov
76a8f2bb44 Merge pull request #4923 from neondatabase/releases/2023-08-08
Release 2023-08-08
2023-08-08 11:44:38 +02:00
Vadim Kharitonov
8d59a8581f Merge branch 'release' into releases/2023-08-08 2023-08-08 10:54:34 +02:00
Vadim Kharitonov
b1ddd01289 Define NEON_SMGR to make it possible for extensions to use Neon SMG API (#4889)
Co-authored-by: Konstantin Knizhnik <knizhnik@garret.ru>
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2023-08-03 16:28:31 +03:00
Alexander Bayandin
6eae4fc9aa Release 2023-08-02: update pg_embedding (#4877)
Cherry-picking ca4d71a954 from `main` into
the `release`

Co-authored-by: Vadim Kharitonov <vadim2404@users.noreply.github.com>
2023-08-03 08:48:09 +02:00
Christian Schwarz
765455bca2 Merge pull request #4861 from neondatabase/releases/2023-08-01--2-fix-pipeline
ci: fix upload-postgres-extensions-to-s3 job
2023-08-01 13:22:07 +02:00
Christian Schwarz
4204960942 ci: fix upload-postgres-extensions-to-s3 job
commit

	commit 5f8fd640bf
	Author: Alek Westover <alek.westover@gmail.com>
	Date:   Wed Jul 26 08:24:03 2023 -0400

	    Upload Test Remote Extensions (#4792)

switched to using the release tag instead of `latest`, but,
the `promote-images` job only uploads `latest` to the prod ECR.

The switch to using release tag was good in principle, but,
reverting that part to make the release pipeine work.

Note that a proper fix should abandon use of `:latest` tag
at all: currently, if a `main` pipeline runs concurrently
with a `release` pipeline, the `release` pipeline may end
up using the `main` pipeline's images.
2023-08-01 12:01:45 +02:00
Christian Schwarz
67345d66ea Merge pull request #4858 from neondatabase/releases/2023-08-01
Release 2023-08-01
2023-08-01 10:44:01 +02:00
Shany Pozin
2266ee5971 Merge pull request #4803 from neondatabase/releases/2023-07-25
Release 2023-07-25
2023-07-25 14:21:07 +03:00
Shany Pozin
b58445d855 Merge pull request #4746 from neondatabase/releases/2023-07-18
Release 2023-07-18
2023-07-18 14:45:39 +03:00
Conrad Ludgate
36050e7f3d Merge branch 'release' into releases/2023-07-18 2023-07-18 12:00:09 +01:00
Alexander Bayandin
33360ed96d Merge pull request #4705 from neondatabase/release-2023-07-12
Release 2023-07-12 (only proxy)
2023-07-12 19:44:36 +01:00
Conrad Ludgate
39a28d1108 proxy wake_compute loop (#4675)
## Problem

If we fail to wake up the compute node, a subsequent connect attempt
will definitely fail. However, kubernetes won't fail the connection
immediately, instead it hangs until we timeout (10s).

## Summary of changes

Refactor the loop to allow fast retries of compute_wake and to skip a
connect attempt.
2023-07-12 18:40:11 +01:00
Conrad Ludgate
efa6aa134f allow repeated IO errors from compute node (#4624)
## Problem

#4598 compute nodes are not accessible some time after wake up due to
kubernetes DNS not being fully propagated.

## Summary of changes

Update connect retry mechanism to support handling IO errors and
sleeping for 100ms

## Checklist before requesting a review

- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
2023-07-12 18:40:06 +01:00
Alexander Bayandin
2c724e56e2 Merge pull request #4646 from neondatabase/releases/2023-07-06-hotfix
Release 2023-07-06 (add pg_embedding extension only)
2023-07-06 12:19:52 +01:00
Alexander Bayandin
feff887c6f Compile pg_embedding extension (#4634)
```
CREATE EXTENSION embedding;
CREATE TABLE t (val real[]);
INSERT INTO t (val) VALUES ('{0,0,0}'), ('{1,2,3}'), ('{1,1,1}'), (NULL);
CREATE INDEX ON t USING hnsw (val) WITH (maxelements = 10, dims=3, m=3);
INSERT INTO t (val) VALUES (array[1,2,4]);

SELECT * FROM t ORDER BY val <-> array[3,3,3];
   val   
---------
 {1,2,3}
 {1,2,4}
 {1,1,1}
 {0,0,0}
 
(5 rows)
```
2023-07-06 09:39:41 +01:00
Vadim Kharitonov
353d915fcf Merge pull request #4633 from neondatabase/releases/2023-07-05
Release 2023-07-05
2023-07-05 15:10:47 +02:00
Vadim Kharitonov
2e38098cbc Merge branch 'release' into releases/2023-07-05 2023-07-05 12:41:48 +02:00
Vadim Kharitonov
a6fe5ea1ac Merge pull request #4571 from neondatabase/releases/2023-06-27
Release 2023-06-27
2023-06-27 12:55:33 +02:00
Vadim Kharitonov
05b0aed0c1 Merge branch 'release' into releases/2023-06-27 2023-06-27 12:22:12 +02:00
Alex Chi Z
cd1705357d Merge pull request #4561 from neondatabase/releases/2023-06-23-hotfix
Release 2023-06-23 (pageserver-only)
2023-06-23 15:38:50 -04:00
Christian Schwarz
6bc7561290 don't use MGMT_REQUEST_RUNTIME for consumption metrics synthetic size worker
The consumption metrics synthetic size worker does logical size calculation.
Logical size calculation currently does synchronous disk IO.
This blocks the MGMT_REQUEST_RUNTIME's executor threads, starving other futures.

While there's work on the way to move the synchronous disk IO into spawn_blocking,
the quickfix here is to use the BACKGROUND_RUNTIME instead of MGMT_REQUEST_RUNTIME.

Actually it's not just a quickfix. We simply shouldn't be blocking MGMT_REQUEST_RUNTIME
executor threads on CPU or sync disk IO.
That work isn't done yet, as many of the mgmt tasks still _do_ disk IO.
But it's not as intensive as the logical size calculations that we're fixing here.

While we're at it, fix disk-usage-based eviction in a similar way.
It wasn't the culprit here, according to prod logs, but it can theoretically be
a little CPU-intensive.

More context, including graphs from Prod:
https://neondb.slack.com/archives/C03F5SM1N02/p1687541681336949

(cherry picked from commit d6e35222ea)
2023-06-23 20:54:07 +02:00
Christian Schwarz
fbd3ac14b5 Merge pull request #4544 from neondatabase/releases/2023-06-21-hotfix
Release 2023-06-21 (fixup for post-merge failed 2023-06-20)
2023-06-21 16:54:34 +03:00
Christian Schwarz
e437787c8f cargo update -p openssl (#4542)
To unblock release
https://github.com/neondatabase/neon/pull/4536#issuecomment-1600678054

Context: https://rustsec.org/advisories/RUSTSEC-2023-0044
2023-06-21 15:52:56 +03:00
Christian Schwarz
3460dbf90b Merge pull request #4536 from neondatabase/releases/2023-06-20
Release 2023-06-20 (actually 2023-06-21)
2023-06-21 14:19:14 +03:00
Vadim Kharitonov
6b89d99677 Merge pull request #4521 from neondatabase/release_2023-06-15
Release 2023 06 15
2023-06-15 17:40:01 +02:00
Vadim Kharitonov
6cc8ea86e4 Merge branch 'main' into release_2023-06-15 2023-06-15 16:50:44 +02:00
Shany Pozin
e62a492d6f Merge pull request #4486 from neondatabase/releases/2023-06-13
Release 2023-06-13
2023-06-13 15:21:35 +03:00
Alexey Kondratov
a475cdf642 [compute_ctl] Fix logging if catalog updates are skipped (#4480)
Otherwise, it wasn't clear from the log when Postgres started up
completely if catalog updates were skipped.

Follow-up for 4936ab6
2023-06-13 13:37:24 +02:00
Stas Kelvich
7002c79a47 Merge pull request #4447 from neondatabase/release_proxy_08-06-2023
Release proxy 08 06 2023
2023-06-08 21:02:54 +03:00
Vadim Kharitonov
ee6cf357b4 Merge pull request #4427 from neondatabase/releases/2023-06-06
Release 2023-06-06
2023-06-06 14:42:21 +02:00
Vadim Kharitonov
e5c2086b5f Merge branch 'release' into releases/2023-06-06 2023-06-06 12:33:56 +02:00
Shany Pozin
5f1208296a Merge pull request #4395 from neondatabase/releases/2023-06-01
Release 2023-06-01
2023-06-01 10:58:00 +03:00
Stas Kelvich
88e8e473cd Merge pull request #4345 from neondatabase/release-23-05-25-proxy
Release 23-05-25, take 3
2023-05-25 19:40:43 +03:00
Stas Kelvich
b0a77844f6 Add SQL-over-HTTP endpoint to Proxy
This commit introduces an SQL-over-HTTP endpoint in the proxy, with a JSON
response structure resembling that of the node-postgres driver. This method,
using HTTP POST, achieves smaller amortized latencies in edge setups due to
fewer round trips and an enhanced open connection reuse by the v8 engine.

This update involves several intricacies:
1. SQL injection protection: We employed the extended query protocol, modifying
   the rust-postgres driver to send queries in one roundtrip using a text
   protocol rather than binary, bypassing potential issues like those identified
   in https://github.com/sfackler/rust-postgres/issues/1030.

2. Postgres type compatibility: As not all postgres types have binary
   representations (e.g., acl's in pg_class), we adjusted rust-postgres to
   respond with text protocol, simplifying serialization and fixing queries with
   text-only types in response.

3. Data type conversion: Considering JSON supports fewer data types than
   Postgres, we perform conversions where possible, passing all other types as
   strings. Key conversions include:
   - postgres int2, int4, float4, float8 -> json number (NaN and Inf remain
     text)
   - postgres bool, null, text -> json bool, null, string
   - postgres array -> json array
   - postgres json and jsonb -> json object

4. Alignment with node-postgres: To facilitate integration with js libraries,
   we've matched the response structure of node-postgres, returning command tags
   and column oids. Command tag capturing was added to the rust-postgres
   functionality as part of this change.
2023-05-25 17:59:17 +03:00
Vadim Kharitonov
1baf464307 Merge pull request #4309 from neondatabase/releases/2023-05-23
Release 2023-05-23
2023-05-24 11:56:54 +02:00
Alexander Bayandin
e9b8e81cea Merge branch 'release' into releases/2023-05-23 2023-05-23 12:54:08 +01:00
Alexander Bayandin
85d6194aa4 Fix regress-tests job for Postgres 15 on release branch (#4254)
## Problem

Compatibility tests don't support Postgres 15 yet, but we're still
trying to upload compatibility snapshot (which we do not collect).

Ref
https://github.com/neondatabase/neon/actions/runs/4991394158/jobs/8940369368#step:4:38129

## Summary of changes

Add `pg_version` parameter to `run-python-test-set` actions and do not
upload compatibility snapshot for Postgres 15
2023-05-16 17:19:12 +01:00
Vadim Kharitonov
333a7a68ef Merge pull request #4245 from neondatabase/releases/2023-05-16
Release 2023-05-16
2023-05-16 13:38:40 +02:00
Vadim Kharitonov
6aa4e41bee Merge branch 'release' into releases/2023-05-16 2023-05-16 12:48:23 +02:00
Joonas Koivunen
840183e51f try: higher page_service timeouts to isolate an issue 2023-05-11 16:24:53 +03:00
Shany Pozin
cbccc94b03 Merge pull request #4184 from neondatabase/releases/2023-05-09
Release 2023-05-09
2023-05-09 15:30:36 +03:00
Stas Kelvich
fce227df22 Merge pull request #4163 from neondatabase/main
Release 23-05-05
2023-05-05 15:56:23 +03:00
Stas Kelvich
bd787e800f Merge pull request #4133 from neondatabase/main
Release 23-04-01
2023-05-01 18:52:46 +03:00
Shany Pozin
4a7704b4a3 Merge pull request #4131 from neondatabase/sp/hotfix_adding_sks_us_west
Hotfix: Adding 4 new pageservers and two sets of safekeepers to us west 2
2023-05-01 15:17:38 +03:00
Shany Pozin
ff1119da66 Add 2 new sets of safekeepers to us-west2 2023-05-01 14:35:31 +03:00
Shany Pozin
4c3ba1627b Add 4 new Pageservers for retool launch 2023-05-01 14:34:38 +03:00
Vadim Kharitonov
1407174fb2 Merge pull request #4110 from neondatabase/vk/release_2023-04-28
Release 2023 04 28
2023-04-28 17:43:16 +02:00
Vadim Kharitonov
ec9dcb1889 Merge branch 'release' into vk/release_2023-04-28 2023-04-28 16:32:26 +02:00
Joonas Koivunen
d11d781afc revert: "Add check for duplicates of generated image layers" (#4104)
This reverts commit 732acc5.

Reverted PR: #3869

As noted in PR #4094, we do in fact try to insert duplicates to the
layer map, if L0->L1 compaction is interrupted. We do not have a proper
fix for that right now, and we are in a hurry to make a release to
production, so revert the changes related to this to the state that we
have in production currently. We know that we have a bug here, but
better to live with the bug that we've had in production for a long
time, than rush a fix to production without testing it in staging first.

Cc: #4094, #4088
2023-04-28 16:31:35 +02:00
Anastasia Lubennikova
4e44565b71 Merge pull request #4000 from neondatabase/releases/2023-04-11
Release 2023-04-11
2023-04-11 17:47:41 +03:00
Stas Kelvich
4ed51ad33b Add more proxy cnames 2023-04-11 15:59:35 +03:00
Arseny Sher
1c1ebe5537 Merge pull request #3946 from neondatabase/releases/2023-04-04
Release 2023-04-04
2023-04-04 14:38:40 +04:00
Christian Schwarz
c19cb7f386 Merge pull request #3935 from neondatabase/releases/2023-04-03
Release 2023-04-03
2023-04-03 16:19:49 +02:00
Vadim Kharitonov
4b97d31b16 Merge pull request #3896 from neondatabase/releases/2023-03-28
Release 2023-03-28
2023-03-28 17:58:06 +04:00
Shany Pozin
923ade3dd7 Merge pull request #3855 from neondatabase/releases/2023-03-21
Release 2023-03-21
2023-03-21 13:12:32 +02:00
Arseny Sher
b04e711975 Merge pull request #3825 from neondatabase/release-2023-03-15
Release 2023.03.15
2023-03-15 15:38:00 +03:00
Arseny Sher
afd0a6b39a Forward framed read buf contents to compute before proxy pass.
Otherwise they get lost. Normally buffer is empty before proxy pass, but this is
not the case with pipeline mode of out npm driver; fixes connection hangup
introduced by b80fe41af3 for it.

fixes https://github.com/neondatabase/neon/issues/3822
2023-03-15 15:36:06 +04:00
Lassi Pölönen
99752286d8 Use RollingUpdate strategy also for legacy proxy (#3814)
## Describe your changes
We have previously changed the neon-proxy to use RollingUpdate. This
should be enabled in legacy proxy too in order to avoid breaking
connections for the clients and allow for example backups to run even
during deployment. (https://github.com/neondatabase/neon/pull/3683)

## Issue ticket number and link
https://github.com/neondatabase/neon/issues/3333
2023-03-15 15:35:51 +04:00
Arseny Sher
15df93363c Merge pull request #3804 from neondatabase/release-2023-03-13
Release 2023.03.13
2023-03-13 20:25:40 +03:00
Vadim Kharitonov
bc0ab741af Merge pull request #3758 from neondatabase/releases/2023-03-07
Release 2023-03-07
2023-03-07 12:38:47 +01:00
Christian Schwarz
51d9dfeaa3 Merge pull request #3743 from neondatabase/releases/2023-03-03
Release 2023-03-03
2023-03-03 19:20:21 +01:00
Shany Pozin
f63cb18155 Merge pull request #3713 from neondatabase/releases/2023-02-28
Release 2023-02-28
2023-02-28 12:52:24 +02:00
Arseny Sher
0de603d88e Merge pull request #3707 from neondatabase/release-2023-02-24
Release 2023-02-24

Hotfix for UNLOGGED tables. Contains #3706
Also contains rebase on 14.7 and 15.2 #3581
2023-02-25 00:32:11 +04:00
Heikki Linnakangas
240913912a Fix UNLOGGED tables.
Instead of trying to create missing files on the way, send init fork contents as
main fork from pageserver during basebackup. Add test for that. Call
put_rel_drop for init forks; previously they weren't removed. Bump
vendor/postgres to revert previous approach on Postgres side.

Co-authored-by: Arseny Sher <sher-ars@yandex.ru>

ref https://github.com/neondatabase/postgres/pull/264
ref https://github.com/neondatabase/postgres/pull/259
ref https://github.com/neondatabase/neon/issues/1222
2023-02-24 23:54:53 +04:00
MMeent
91a4ea0de2 Update vendored PostgreSQL versions to 14.7 and 15.2 (#3581)
## Describe your changes
Rebase vendored PostgreSQL onto 14.7 and 15.2

## Issue ticket number and link

#3579

## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [x] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [x] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
    ```
The version of PostgreSQL that we use is updated to 14.7 for PostgreSQL
14 and 15.2 for PostgreSQL 15.
    ```
2023-02-24 23:54:42 +04:00
Arseny Sher
8608704f49 Merge pull request #3691 from neondatabase/release-2023-02-23
Release 2023-02-23

Hotfix for the unlogged tables with indexes issue.

neondatabase/postgres#259
neondatabase/postgres#262
2023-02-23 13:39:33 +04:00
Arseny Sher
efef68ce99 Bump vendor/postgres to include hotfix for unlogged tables with indexes.
https://github.com/neondatabase/postgres/pull/259
https://github.com/neondatabase/postgres/pull/262
2023-02-23 08:49:43 +04:00
Joonas Koivunen
8daefd24da Merge pull request #3679 from neondatabase/releases/2023-02-22
Releases/2023-02-22
2023-02-22 15:56:55 +02:00
Arthur Petukhovsky
46cc8b7982 Remove safekeeper-1.ap-southeast-1.aws.neon.tech (#3671)
We migrated all timelines to
`safekeeper-3.ap-southeast-1.aws.neon.tech`, now old instance can be
removed.
2023-02-22 15:07:57 +02:00
Sergey Melnikov
38cd90dd0c Add -v to ansible invocations (#3670)
To get more debug output on failures
2023-02-22 15:07:57 +02:00
Joonas Koivunen
a51b269f15 fix: hold permit until GetObject eof (#3663)
previously we applied the ratelimiting only up to receiving the headers
from s3, or somewhere near it. the commit adds an adapter which carries
the permit until the AsyncRead has been disposed.

fixes #3662.
2023-02-22 15:07:57 +02:00
Joonas Koivunen
43bf6d0a0f calculate_logical_size: no longer use spawn_blocking (#3664)
Calculation of logical size is now async because of layer downloads, so
we shouldn't use spawn_blocking for it. Use of `spawn_blocking`
exhausted resources which are needed by `tokio::io::copy` when copying
from a stream to a file which lead to deadlock.

Fixes: #3657
2023-02-22 15:07:57 +02:00
Joonas Koivunen
15273a9b66 chore: ignore all compaction inactive tenant errors (#3665)
these are happening in tests because of #3655 but they sure took some
time to appear.

makes the `Compaction failed, retrying in 2s: Cannot run compaction
iteration on inactive tenant` into a globally allowed error, because it
has been seen failing on different test cases.
2023-02-22 15:07:57 +02:00
Joonas Koivunen
78aca668d0 fix: log download failed error (#3661)
Fixes #3659
2023-02-22 15:07:57 +02:00
Vadim Kharitonov
acbf4148ea Merge pull request #3656 from neondatabase/releases/2023-02-21
Release 2023-02-21
2023-02-21 16:03:48 +01:00
Vadim Kharitonov
6508540561 Merge branch 'release' into releases/2023-02-21 2023-02-21 15:31:16 +01:00
Arthur Petukhovsky
a41b5244a8 Add new safekeeper to ap-southeast-1 prod (#3645) (#3646)
To trigger deployment of #3645 to production.
2023-02-20 15:22:49 +00:00
Shany Pozin
2b3189be95 Merge pull request #3600 from neondatabase/releases/2023-02-14
Release 2023-02-14
2023-02-15 13:31:30 +02:00
Vadim Kharitonov
248563c595 Merge pull request #3553 from neondatabase/releases/2023-02-07
Release 2023-02-07
2023-02-07 14:07:44 +01:00
Vadim Kharitonov
14cd6ca933 Merge branch 'release' into releases/2023-02-07 2023-02-07 12:11:56 +01:00
Vadim Kharitonov
eb36403e71 Release 2023 01 31 (#3497)
Co-authored-by: Kirill Bulatov <kirill@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>
Co-authored-by: Christian Schwarz <christian@neon.tech>
Co-authored-by: Alexey Kondratov <kondratov.aleksey@gmail.com>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Konstantin Knizhnik <knizhnik@garret.ru>
Co-authored-by: Shany Pozin <shany@neon.tech>
Co-authored-by: Sergey Melnikov <sergey@neon.tech>
Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
Co-authored-by: Rory de Zoete <33318916+zoete@users.noreply.github.com>
Co-authored-by: Rory de Zoete <rdezoete@Rorys-Mac-Studio.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box>
Co-authored-by: Lassi Pölönen <lassi.polonen@iki.fi>
2023-01-31 15:06:35 +02:00
Anastasia Lubennikova
3c6f779698 Merge pull request #3411 from neondatabase/release_2023_01_23
Fix Release 2023 01 23
2023-01-23 20:10:03 +02:00
Joonas Koivunen
f67f0c1c11 More tenant size fixes (#3410)
Small changes, but hopefully this will help with the panic detected in
staging, for which we cannot get the debugging information right now
(end-of-branch before branch-point).
2023-01-23 17:46:13 +02:00
Shany Pozin
edb02d3299 Adding pageserver3 to staging (#3403) 2023-01-23 17:46:13 +02:00
Konstantin Knizhnik
664a69e65b Fix slru_segment_key_range function: segno was assigned to incorrect Key field (#3354) 2023-01-23 17:46:13 +02:00
Anastasia Lubennikova
478322ebf9 Fix tenant size orphans (#3377)
Before only the timelines which have passed the `gc_horizon` were
processed which failed with orphans at the tree_sort phase. Example
input in added `test_branched_empty_timeline_size` test case.

The PR changes iteration to happen through all timelines, and in
addition to that, any learned branch points will be calculated as they
would had been in the original implementation if the ancestor branch had
been over the `gc_horizon`.

This also changes how tenants where all timelines are below `gc_horizon`
are handled. Previously tenant_size 0 was returned, but now they will
have approximately `initdb_lsn` worth of tenant_size.

The PR also adds several new tenant size tests that describe various corner
cases of branching structure and `gc_horizon` setting.
They are currently disabled to not consume time during CI.

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
2023-01-23 17:46:13 +02:00
Joonas Koivunen
802f174072 fix: dont stop pageserver if we fail to calculate synthetic size 2023-01-23 17:46:13 +02:00
Alexey Kondratov
47f9890bae [compute_ctl] Make role deletion spec processing idempotent (#3380)
Previously, we were trying to re-assign owned objects of the already
deleted role. This were causing a crash loop in the case when compute
was restarted with a spec that includes delta operation for role
deletion. To avoid such cases, check that role is still present before
calling `reassign_owned_objects`.

Resolves neondatabase/cloud#3553
2023-01-23 17:46:13 +02:00
Christian Schwarz
262265daad Revert "Use actual temporary dir for pageserver unit tests"
This reverts commit 826e89b9ce.

The problem with that commit was that it deletes the TempDir while
there are still EphemeralFile instances open.

At first I thought this could be fixed by simply adding

  Handle::current().block_on(task_mgr::shutdown(None, Some(tenant_id), None))

to TenantHarness::drop, but it turned out to be insufficient.

So, reverting the commit until we find a proper solution.

refs https://github.com/neondatabase/neon/issues/3385
2023-01-23 17:46:13 +02:00
bojanserafimov
300da5b872 Improve layer map docstrings (#3382) 2023-01-23 17:46:13 +02:00
Heikki Linnakangas
7b22b5c433 Switch to 'tracing' for logging, restructure code to make use of spans.
Refactors Compute::prepare_and_run. It's split into subroutines
differently, to make it easier to attach tracing spans to the
different stages. The high-level logic for waiting for Postgres to
exit is moved to the caller.

Replace 'env_logger' with 'tracing', and add `#instrument` directives
to different stages fo the startup process. This is a fairly
mechanical change, except for the changes in 'spec.rs'. 'spec.rs'
contained some complicated formatting, where parts of log messages
were printed directly to stdout with `print`s. That was a bit messed
up because the log normally goes to stderr, but those lines were
printed to stdout. In our docker images, stderr and stdout both go to
the same place so you wouldn't notice, but I don't think it was
intentional.

This changes the log format to the default
'tracing_subscriber::format' format. It's different from the Postgres
log format, however, and because both compute_tools and Postgres print
to the same log, it's now a mix of two different formats.  I'm not
sure how the Grafana log parsing pipeline can handle that. If it's a
problem, we can build custom formatter to change the compute_tools log
format to be the same as Postgres's, like it was before this commit,
or we can change the Postgres log format to match tracing_formatter's,
or we can start printing compute_tool's log output to a different
destination than Postgres
2023-01-23 17:46:12 +02:00
Kirill Bulatov
ffca97bc1e Enable logs in unit tests 2023-01-23 17:46:12 +02:00
Kirill Bulatov
cb356f3259 Use actual temporary dir for pageserver unit tests 2023-01-23 17:46:12 +02:00
Vadim Kharitonov
c85374295f Change SENTRY_ENVIRONMENT from "development" to "staging" 2023-01-23 17:46:12 +02:00
Anastasia Lubennikova
4992160677 Fix metric_collection_endpoint for prod.
It was incorrectly set to staging url
2023-01-23 17:46:12 +02:00
Heikki Linnakangas
bd535b3371 If an error happens while checking for core dumps, don't panic.
If we panic, we skip the 30s wait in 'main', and don't give the
console a chance to observe the error. Which is not nice.

Spotted by @ololobus at
https://github.com/neondatabase/neon/pull/3352#discussion_r1072806981
2023-01-23 17:46:12 +02:00
Kirill Bulatov
d90c5a03af Add more io::Error context when fail to operate on a path (#3254)
I have a test failure that shows 

```
Caused by:
    0: Failed to reconstruct a page image:
    1: Directory not empty (os error 39)
```

but does not really show where exactly that happens.

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-3227/release/3823785365/index.html#categories/c0057473fc9ec8fb70876fd29a171ce8/7088dab272f2c7b7/?attachment=60fe6ed2add4d82d

The PR aims to add more context in debugging that issue.
2023-01-23 17:46:12 +02:00
Anastasia Lubennikova
2d02cc9079 Merge pull request #3365 from neondatabase/main
Release 2023-01-17
2023-01-17 16:41:34 +02:00
Christian Schwarz
49ad94b99f Merge pull request #3301 from neondatabase/release-2023-01-10
Release 2023-01-10
2023-01-10 16:42:26 +01:00
Christian Schwarz
948a217398 Merge commit '95bf19b85a06b27a7fc3118dee03d48648efab15' into release-2023-01-10
Conflicts:
        .github/helm-values/neon-stress.proxy-scram.yaml
        .github/helm-values/neon-stress.proxy.yaml
        .github/helm-values/staging.proxy-scram.yaml
        .github/helm-values/staging.proxy.yaml
        All of the above were deleted in `main` after we hotfixed them
        in `release. Deleting them here
        storage_broker/src/bin/storage_broker.rs
        Hotfix toned down logging, but `main` has sinced implemented
        a proper fix. Taken `main`'s side, see
        https://neondb.slack.com/archives/C033RQ5SPDH/p1673354385387479?thread_ts=1673354306.474729&cid=C033RQ5SPDH

closes https://github.com/neondatabase/neon/issues/3287
2023-01-10 15:40:14 +01:00
Dmitry Rodionov
125381eae7 Merge pull request #3236 from neondatabase/dkr/retrofit-sk4-sk4-change
Move zenith-1-sk-3 to zenith-1-sk-4 (#3164)
2022-12-30 14:13:50 +03:00
Arthur Petukhovsky
cd01bbc715 Move zenith-1-sk-3 to zenith-1-sk-4 (#3164) 2022-12-30 12:32:52 +02:00
Dmitry Rodionov
d8b5e3b88d Merge pull request #3229 from neondatabase/dkr/add-pageserver-for-release
add pageserver to new region see https://github.com/neondatabase/aws/pull/116

decrease log volume for pageserver
2022-12-30 12:34:04 +03:00
Dmitry Rodionov
06d25f2186 switch to debug from info to produce less noise 2022-12-29 17:48:47 +02:00
Dmitry Rodionov
f759b561f3 add pageserver to new region see https://github.com/neondatabase/aws/pull/116 2022-12-29 17:17:35 +02:00
Sergey Melnikov
ece0555600 Push proxy metrics to Victoria Metrics (#3106) 2022-12-16 14:44:49 +02:00
Joonas Koivunen
73ea0a0b01 fix(remote_storage): use cached credentials (#3128)
IMDSv2 has limits, and if we query it on every s3 interaction we are
going to go over those limits. Changes the s3_bucket client
configuration to use:
- ChainCredentialsProvider to handle env variables or imds usage
- LazyCachingCredentialsProvider to actually cache any credentials

Related: https://github.com/awslabs/aws-sdk-rust/issues/629
Possibly related: https://github.com/neondatabase/neon/issues/3118
2022-12-16 14:44:49 +02:00
Arseny Sher
d8f6d6fd6f Merge pull request #3126 from neondatabase/broker-lb-release
Deploy broker with L4 LB in new env.
2022-12-16 01:25:28 +03:00
Arseny Sher
d24de169a7 Deploy broker with L4 LB in new env.
Seems to be fixing issue with missing keepalives.
2022-12-16 01:45:32 +04:00
Arseny Sher
0816168296 Hotfix: terminate subscription if channel is full.
Might help as a hotfix, but need to understand root better.
2022-12-15 12:23:56 +03:00
Dmitry Rodionov
277b44d57a Merge pull request #3102 from neondatabase/main
Hotfix. See commits for details
2022-12-14 19:38:43 +03:00
MMeent
68c2c3880e Merge pull request #3038 from neondatabase/main
Release 22-12-14
2022-12-14 14:35:47 +01:00
Arthur Petukhovsky
49da498f65 Merge pull request #2833 from neondatabase/main
Release 2022-11-16
2022-11-17 08:44:10 +01:00
Stas Kelvich
2c76ba3dd7 Merge pull request #2718 from neondatabase/main-rc-22-10-28
Release 22-10-28
2022-10-28 20:33:56 +03:00
Arseny Sher
dbe3dc69ad Merge branch 'main' into main-rc-22-10-28
Release 22-10-28.
2022-10-28 19:10:11 +04:00
Arseny Sher
8e5bb3ed49 Enable etcd compaction in neon_local. 2022-10-27 12:53:20 +03:00
Stas Kelvich
ab0be7b8da Avoid debian-testing packages in compute Dockerfiles
plv8 can only be built with a fairly new gold linker version. We used to install
it via binutils packages from testing, but it also updates libc and that causes
troubles in the resulting image as different extensions were built against
different libc versions. We could either use libc from debian-testing everywhere
or restrain from using testing packages and install necessary programs manually.
This patch uses the latter approach: gold for plv8 and cmake for h3 are
installed manually.

In a passing declare h3_postgis as a safe extension (previous omission).
2022-10-27 12:53:20 +03:00
bojanserafimov
b4c55f5d24 Move pagestream api to libs/pageserver_api (#2698) 2022-10-27 12:53:20 +03:00
mikecaat
ede70d833c Add a docker-compose example file (#1943) (#2666)
Co-authored-by: Masahiro Ikeda <masahiro.ikeda.us@hco.ntt.co.jp>
2022-10-27 12:53:20 +03:00
Sergey Melnikov
70c3d18bb0 Do not release to new staging proxies on release (#2685) 2022-10-27 12:53:20 +03:00
bojanserafimov
7a491f52c4 Add draw_timeline binary (#2688) 2022-10-27 12:53:20 +03:00
Alexander Bayandin
323c4ecb4f Add data format backward compatibility tests (#2626) 2022-10-27 12:53:20 +03:00
Anastasia Lubennikova
3d2466607e Merge pull request #2692 from neondatabase/main-rc
Release 2022-10-25
2022-10-25 18:18:58 +03:00
Anastasia Lubennikova
ed478b39f4 Merge branch 'release' into main-rc 2022-10-25 17:06:33 +03:00
Stas Kelvich
91585a558d Merge pull request #2678 from neondatabase/stas/hotfix_schema
Hotfix to disable grant create on public schema
2022-10-22 02:54:31 +03:00
Stas Kelvich
93467eae1f Hotfix to disable grant create on public schema
`GRANT CREATE ON SCHEMA public` fails if there is no schema `public`.
Disable it in release for now and make a better fix later (it is
needed for v15 support).
2022-10-22 02:26:28 +03:00
Stas Kelvich
f3aac81d19 Merge pull request #2668 from neondatabase/main
Release 2022-10-21
2022-10-21 15:21:42 +03:00
Stas Kelvich
979ad60c19 Merge pull request #2581 from neondatabase/main
Release 2022-10-07
2022-10-07 16:50:55 +03:00
Stas Kelvich
9316cb1b1f Merge pull request #2573 from neondatabase/main
Release 2022-10-06
2022-10-07 11:07:06 +03:00
Anastasia Lubennikova
e7939a527a Merge pull request #2377 from neondatabase/main
Release 2022-09-01
2022-09-01 20:20:44 +03:00
Arthur Petukhovsky
36d26665e1 Merge pull request #2299 from neondatabase/main
* Check for entire range during sasl validation (#2281)

* Gen2 GH runner (#2128)

* Re-add rustup override

* Try s3 bucket

* Set git version

* Use v4 cache key to prevent problems

* Switch to v5 for key

* Add second rustup fix

* Rebase

* Add kaniko steps

* Fix typo and set compress level

* Disable global run default

* Specify shell for step

* Change approach with kaniko

* Try less verbose shell spec

* Add submodule pull

* Add promote step

* Adjust dependency chain

* Try default swap again

* Use env

* Don't override aws key

* Make kaniko build conditional

* Specify runs on

* Try without dependency link

* Try soft fail

* Use image with git

* Try passing to next step

* Fix duplicate

* Try other approach

* Try other approach

* Fix typo

* Try other syntax

* Set env

* Adjust setup

* Try step 1

* Add link

* Try global env

* Fix mistake

* Debug

* Try other syntax

* Try other approach

* Change order

* Move output one step down

* Put output up one level

* Try other syntax

* Skip build

* Try output

* Re-enable build

* Try other syntax

* Skip middle step

* Update check

* Try first step of dockerhub push

* Update needs dependency

* Try explicit dir

* Add missing package

* Try other approach

* Try other approach

* Specify region

* Use with

* Try other approach

* Add debug

* Try other approach

* Set region

* Follow AWS example

* Try github approach

* Skip Qemu

* Try stdin

* Missing steps

* Add missing close

* Add echo debug

* Try v2 endpoint

* Use v1 endpoint

* Try without quotes

* Revert

* Try crane

* Add debug

* Split steps

* Fix duplicate

* Add shell step

* Conform to options

* Add verbose flag

* Try single step

* Try workaround

* First request fails hunch

* Try bullseye image

* Try other approach

* Adjust verbose level

* Try previous step

* Add more debug

* Remove debug step

* Remove rogue indent

* Try with larger image

* Add build tag step

* Update workflow for testing

* Add tag step for test

* Remove unused

* Update dependency chain

* Add ownership fix

* Use matrix for promote

* Force update

* Force build

* Remove unused

* Add new image

* Add missing argument

* Update dockerfile copy

* Update Dockerfile

* Update clone

* Update dockerfile

* Go to correct folder

* Use correct format

* Update dockerfile

* Remove cd

* Debug find where we are

* Add debug on first step

* Changedir to postgres

* Set workdir

* Use v1 approach

* Use other dependency

* Try other approach

* Try other approach

* Update dockerfile

* Update approach

* Update dockerfile

* Update approach

* Update dockerfile

* Update dockerfile

* Add workspace hack

* Update Dockerfile

* Update Dockerfile

* Update Dockerfile

* Change last step

* Cleanup pull in prep for review

* Force build images

* Add condition for latest tagging

* Use pinned version

* Try without name value

* Remove more names

* Shorten names

* Add kaniko comments

* Pin kaniko

* Pin crane and ecr helper

* Up one level

* Switch to pinned tag for rust image

* Force update for test

Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@b04468bf-cdf4-41eb-9c94-aff4ca55e4bf.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@Rorys-Mac-Studio.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@4795e9ee-4f32-401f-85f3-f316263b62b8.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@2f8bc4e5-4ec2-4ea2-adb1-65d863c4a558.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@27565b2b-72d5-4742-9898-a26c9033e6f9.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@ecc96c26-c6c4-4664-be6e-34f7c3f89a3c.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@7caff3a5-bf03-4202-bd0e-f1a93c86bdae.fritz.box>

* Add missing step output, revert one deploy step (#2285)

* Add missing step output, revert one deploy step

* Conform to syntax

* Update approach

* Add missing value

* Add missing needs

Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box>

* Error for fatal not git repo (#2286)

Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box>

* Use main, not branch for ref check (#2288)

* Use main, not branch for ref check

* Add more debug

* Count main, not head

* Try new approach

* Conform to syntax

* Update approach

* Get full history

* Skip checkout

* Cleanup debug

* Remove more debug

Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box>

* Fix docker zombie process issue (#2289)

* Fix docker zombie process issue

* Init everywhere

Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box>

* Fix 1.63 clippy lints (#2282)

* split out timeline metrics, track layer map loading and size calculation

* reset rust cache for clippy run to avoid an ICE

additionally remove trailing whitespaces

* Rename pg_control_ffi.h to bindgen_deps.h, for clarity.

The pg_control_ffi.h name implies that it only includes stuff related to
pg_control.h. That's mostly true currently, but really the point of the
file is to include everything that we need to generate Rust definitions
from.

* Make local mypy behave like CI mypy (#2291)

* Fix flaky pageserver restarts in tests (#2261)

* Remove extra type aliases (#2280)

* Update cachepot endpoint (#2290)

* Update cachepot endpoint

* Update dockerfile & remove env

* Update image building process

* Cannot use metadata endpoint for this

* Update workflow

* Conform to kaniko syntax

* Update syntax

* Update approach

* Update dockerfiles

* Force update

* Update dockerfiles

* Update dockerfile

* Cleanup dockerfiles

* Update s3 test location

* Revert s3 experiment

* Add more debug

* Specify aws region

* Remove debug, add prefix

* Remove one more debug

Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box>

* workflows/benchmarking: increase timeout (#2294)

* Rework `init` in pageserver CLI  (#2272)

* Do not create initial tenant and timeline (adjust Python tests for that)
* Rework config handling during init, add --update-config to manage local config updates

* Fix: Always build images (#2296)

* Always build images

* Remove unused

Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box>

* Move auto-generated 'bindings' to a separate inner module.

Re-export only things that are used by other modules.

In the future, I'm imagining that we run bindgen twice, for Postgres
v14 and v15. The two sets of bindings would go into separate
'bindings_v14' and 'bindings_v15' modules.

Rearrange postgres_ffi modules.

Move function, to avoid Postgres version dependency in timelines.rs
Move function to generate a logical-message WAL record to postgres_ffi.

* fix cargo test

* Fix walreceiver and safekeeper bugs (#2295)

- There was an issue with zero commit_lsn `reason: LaggingWal { current_commit_lsn: 0/0, new_commit_lsn: 1/6FD90D38, threshold: 10485760 } }`. The problem was in `send_wal.rs`, where we initialized `end_pos = Lsn(0)` and in some cases sent it to the pageserver.
- IDENTIFY_SYSTEM previously returned `flush_lsn` as a physical end of WAL. Now it returns `flush_lsn` (as it was) to walproposer and `commit_lsn` to everyone else including pageserver.
- There was an issue with backoff where connection was cancelled right after initialization: `connected!` -> `safekeeper_handle_db: Connection cancelled` -> `Backoff: waiting 3 seconds`. The problem was in sleeping before establishing the connection. This is fixed by reworking retry logic.
- There was an issue with getting `NoKeepAlives` reason in a loop. The issue is probably the same as the previous.
- There was an issue with filtering safekeepers based on retry attempts, which could filter some safekeepers indefinetely. This is fixed by using retry cooldown duration instead of retry attempts.
- Some `send_wal.rs` connections failed with errors without context. This is fixed by adding a timeline to safekeepers errors.

New retry logic works like this:
- Every candidate has a `next_retry_at` timestamp and is not considered for connection until that moment
- When walreceiver connection is closed, we update `next_retry_at` using exponential backoff, increasing the cooldown on every disconnect.
- When `last_record_lsn` was advanced using the WAL from the safekeeper, we reset the retry cooldown and exponential backoff, allowing walreceiver to reconnect to the same safekeeper instantly.

* on safekeeper registration pass availability zone param (#2292)

Co-authored-by: Kirill Bulatov <kirill@neon.tech>
Co-authored-by: Rory de Zoete <33318916+zoete@users.noreply.github.com>
Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@b04468bf-cdf4-41eb-9c94-aff4ca55e4bf.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@Rorys-Mac-Studio.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@4795e9ee-4f32-401f-85f3-f316263b62b8.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@2f8bc4e5-4ec2-4ea2-adb1-65d863c4a558.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@27565b2b-72d5-4742-9898-a26c9033e6f9.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@ecc96c26-c6c4-4664-be6e-34f7c3f89a3c.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@7caff3a5-bf03-4202-bd0e-f1a93c86bdae.fritz.box>
Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
Co-authored-by: Anton Galitsyn <agalitsyn@users.noreply.github.com>
2022-08-18 15:32:33 +03:00
Arthur Petukhovsky
873347f977 Merge pull request #2275 from neondatabase/main
* github/workflows: Fix git dubious ownership (#2223)

* Move relation size cache from WalIngest to DatadirTimeline (#2094)

* Move relation sie cache to layered timeline

* Fix obtaining current LSN for relation size cache

* Resolve merge conflicts

* Resolve merge conflicts

* Reestore 'lsn' field in DatadirModification

* adjust DatadirModification lsn in ingest_record

* Fix formatting

* Pass lsn to get_relsize

* Fix merge conflict

* Update pageserver/src/pgdatadir_mapping.rs

Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>

* Update pageserver/src/pgdatadir_mapping.rs

Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>

Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>

* refactor: replace lazy-static with once-cell (#2195)

- Replacing all the occurrences of lazy-static with `once-cell::sync::Lazy`
- fixes #1147

Signed-off-by: Ankur Srivastava <best.ankur@gmail.com>

* Add more buckets to pageserver latency metrics (#2225)

* ignore record property warning to fix benchmarks

* increase statement timeout

* use event so it fires only if workload thread successfully finished

* remove debug log

* increase timeout to pass test with real s3

* avoid duplicate parameter, increase timeout

* Major migration script (#2073)

This script can be used to migrate a tenant across breaking storage versions, or (in the future) upgrading postgres versions. See the comment at the top for an overview.

Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>

* Fix etcd typos

* Fix links to safekeeper protocol docs. (#2188)

safekeeper/README_PROTO.md was moved to docs/safekeeper-protocol.md in
commit 0b14fdb078, as part of reorganizing the docs into 'mdbook' format.

Fixes issue #1475. Thanks to @banks for spotting the outdated references.

In addition to fixing the above issue, this patch also fixes other broken links as a result of 0b14fdb078. See https://github.com/neondatabase/neon/pull/2188#pullrequestreview-1055918480.

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Thang Pham <thang@neon.tech>

* Update CONTRIBUTING.md

* Update CONTRIBUTING.md

* support node id and remote storage params in docker_entrypoint.sh

* Safe truncate (#2218)

* Move relation sie cache to layered timeline

* Fix obtaining current LSN for relation size cache

* Resolve merge conflicts

* Resolve merge conflicts

* Reestore 'lsn' field in DatadirModification

* adjust DatadirModification lsn in ingest_record

* Fix formatting

* Pass lsn to get_relsize

* Fix merge conflict

* Update pageserver/src/pgdatadir_mapping.rs

Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>

* Update pageserver/src/pgdatadir_mapping.rs

Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>

* Check if relation exists before trying to truncat it

refer #1932

* Add test reporducing FSM truncate problem

Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>

* Fix exponential backoff values

* Update back `vendor/postgres` back; it was changed accidentally. (#2251)

Commit 4227cfc96e accidentally reverted vendor/postgres to an older
version. Update it back.

* Add pageserver checkpoint_timeout option.

To flush inmemory layer eventually when no new data arrives, which helps
safekeepers to suspend activity (stop pushing to the broker). Default 10m should
be ok.

* Share exponential backoff code and fix logic for delete task failure (#2252)

* Fix bug when import large (>1GB) relations (#2172)

Resolves #2097 

- use timeline modification's `lsn` and timeline's `last_record_lsn` to determine the corresponding LSN to query data in `DatadirModification::get`
- update `test_import_from_pageserver`. Split the test into 2 variants: `small` and `multisegment`. 
  + `small` is the old test
  + `multisegment` is to simulate #2097 by using a larger number of inserted rows to create multiple segment files of a relation. `multisegment` is configured to only run with a `release` build

* Fix timeline physical size flaky tests (#2244)

Resolves #2212.

- use `wait_for_last_flush_lsn` in `test_timeline_physical_size_*` tests

## Context
Need to wait for the pageserver to catch up with the compute's last flush LSN because during the timeline physical size API call, it's possible that there are running `LayerFlushThread` threads. These threads flush new layers into disk and hence update the physical size. This results in a mismatch between the physical size reported by the API and the actual physical size on disk.

### Note
The `LayerFlushThread` threads are processed **concurrently**, so it's possible that the above error still persists even with this patch. However, making the tests wait to finish processing all the WALs (not flushing) before calculating the physical size should help reduce the "flakiness" significantly

* postgres_ffi/waldecoder: validate more header fields

* postgres_ffi/waldecoder: remove unused startlsn

* postgres_ffi/waldecoder: introduce explicit `enum State`

Previously it was emulated with a combination of nullable fields.
This change should make the logic more readable.

* disable `test_import_from_pageserver_multisegment` (#2258)

This test failed consistently on `main` now. It's better to temporarily disable it to avoid blocking others' PRs while investigating the root cause for the test failure.

See: #2255, #2256

* get_binaries uses DOCKER_TAG taken from docker image build step (#2260)

* [proxy] Rework wire format of the password hack and some errors (#2236)

The new format has a few benefits: it's shorter, simpler and
human-readable as well. We don't use base64 anymore, since
url encoding got us covered.

We also show a better error in case we couldn't parse the
payload; the users should know it's all about passing the
correct project name.

* test_runner/pg_clients: collect docker logs (#2259)

* get_binaries script fix (#2263)

* get_binaries uses DOCKER_TAG taken from docker image build step

* remove docker tag discovery at all and fix get_binaries for version variable

* Better storage sync logs (#2268)

* Find end of WAL on safekeepers using WalStreamDecoder.

We could make it inside wal_storage.rs, but taking into account that
 - wal_storage.rs reading is async
 - we don't need s3 here
 - error handling is different; error during decoding is normal
I decided to put it separately.

Test
cargo test test_find_end_of_wal_last_crossing_segment
prepared earlier by @yeputons passes now.

Fixes https://github.com/neondatabase/neon/issues/544
      https://github.com/neondatabase/cloud/issues/2004
Supersedes https://github.com/neondatabase/neon/pull/2066

* Improve walreceiver logic (#2253)

This patch makes walreceiver logic more complicated, but it should work better in most cases. Added `test_wal_lagging` to test scenarios where alive safekeepers can lag behind other alive safekeepers.

- There was a bug which looks like `etcd_info.timeline.commit_lsn > Some(self.local_timeline.get_last_record_lsn())` filtered all safekeepers in some strange cases. I removed this filter, it should probably help with #2237
- Now walreceiver_connection reports status, including commit_lsn. This allows keeping safekeeper connection even when etcd is down.
- Safekeeper connection now fails if pageserver doesn't receive safekeeper messages for some time. Usually safekeeper sends messages at least once per second.
- `LaggingWal` check now uses `commit_lsn` directly from safekeeper. This fixes the issue with often reconnects, when compute generates WAL really fast.
- `NoWalTimeout` is rewritten to trigger only when we know about the new WAL and the connected safekeeper doesn't stream any WAL. This allows setting a small `lagging_wal_timeout` because it will trigger only when we observe that the connected safekeeper has stuck.

* increase timeout in wait_for_upload to avoid spurious failures when testing with real s3

* Bump vendor/postgres to include XLP_FIRST_IS_CONTRECORD fix. (#2274)

* Set up a workflow to run pgbench against captest (#2077)

Signed-off-by: Ankur Srivastava <best.ankur@gmail.com>
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Konstantin Knizhnik <knizhnik@garret.ru>
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
Co-authored-by: Ankur Srivastava <ansrivas@users.noreply.github.com>
Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>
Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
Co-authored-by: Kirill Bulatov <kirill@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Thang Pham <thang@neon.tech>
Co-authored-by: Stas Kelvich <stas.kelvich@gmail.com>
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
Co-authored-by: Egor Suvorov <egor@neon.tech>
Co-authored-by: Andrey Taranik <andrey@cicd.team>
Co-authored-by: Dmitry Ivanov <ivadmi5@gmail.com>
2022-08-15 21:30:45 +03:00
Arthur Petukhovsky
e814ac16f9 Merge pull request #2219 from neondatabase/main
Release 2022-08-04
2022-08-04 20:06:34 +03:00
Heikki Linnakangas
ad3055d386 Merge pull request #2203 from neondatabase/release-uuid-ossp
Deploy new storage and compute version to production

Release 2022-08-02
2022-08-02 15:08:14 +03:00
Heikki Linnakangas
94e03eb452 Merge remote-tracking branch 'origin/main' into 'release'
Release 2022-08-01
2022-08-02 12:43:49 +03:00
Sergey Melnikov
380f26ef79 Merge pull request #2170 from neondatabase/main (Release 2022-07-28)
Release 2022-07-28
2022-07-28 14:16:52 +03:00
Arthur Petukhovsky
3c5b7f59d7 Merge pull request #2119 from neondatabase/main
Release 2022-07-19
2022-07-19 11:58:48 +03:00
Arthur Petukhovsky
fee89f80b5 Merge pull request #2115 from neondatabase/main-2022-07-18
Release 2022-07-18
2022-07-18 19:21:11 +03:00
Arthur Petukhovsky
41cce8eaf1 Merge remote-tracking branch 'origin/release' into main-2022-07-18 2022-07-18 18:21:20 +03:00
Alexey Kondratov
f88fe0218d Merge pull request #1842 from neondatabase/release-deploy-hotfix
[HOTFIX] Release deploy fix

This PR uses this branch neondatabase/postgres#171 and several required commits from the main to use only locally built compute-tools. This should allow us to rollout safekeepers sync issue fix on prod
2022-06-01 11:04:30 +03:00
Alexey Kondratov
cc856eca85 Install missing openssl packages in the Github Actions workflow 2022-05-31 21:31:31 +02:00
Alexey Kondratov
cf350c6002 Use :local compute-tools tag to build compute-node image 2022-05-31 21:31:16 +02:00
Arseny Sher
0ce6b6a0a3 Merge pull request #1836 from neondatabase/release-hotfix-basebackup-lsn-page-boundary
Bump vendor/postgres to hotfix basebackup LSN comparison.
2022-05-31 16:54:03 +04:00
Arseny Sher
73f247d537 Bump vendor/postgres to hotfix basebackup LSN comparison. 2022-05-31 16:00:50 +04:00
Andrey Taranik
960be82183 Merge pull request #1792 from neondatabase/main
Release 2202-05-25 (second)
2022-05-25 16:37:57 +03:00
Andrey Taranik
806e5a6c19 Merge pull request #1787 from neondatabase/main
Release 2022-05-25
2022-05-25 13:34:11 +03:00
Alexey Kondratov
8d5df07cce Merge pull request #1385 from zenithdb/main
Release main 2022-03-22
2022-03-22 05:04:34 -05:00
Andrey Taranik
df7a9d1407 release fix 2022-03-16 (#1375) 2022-03-17 00:43:28 +03:00
229 changed files with 8268 additions and 10409 deletions

View File

@@ -31,7 +31,7 @@ config-variables:
- NEON_PROD_AWS_ACCOUNT_ID
- PGREGRESS_PG16_PROJECT_ID
- PGREGRESS_PG17_PROJECT_ID
- PREWARM_PGBENCH_SIZE
- PREWARM_PROJECT_ID
- REMOTE_STORAGE_AZURE_CONTAINER
- REMOTE_STORAGE_AZURE_REGION
- SLACK_CICD_CHANNEL_ID

384
.github/workflows/benchbase_tpcc.yml vendored Normal file
View File

@@ -0,0 +1,384 @@
name: TPC-C like benchmark using benchbase
on:
schedule:
# * is a special character in YAML so you have to quote this string
# ┌───────────── minute (0 - 59)
# │ ┌───────────── hour (0 - 23)
# │ │ ┌───────────── day of the month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12 or JAN-DEC)
# │ │ │ │ ┌───────────── day of the week (0 - 6 or SUN-SAT)
- cron: '0 6 * * *' # run once a day at 6 AM UTC
workflow_dispatch: # adds ability to run this manually
defaults:
run:
shell: bash -euxo pipefail {0}
concurrency:
# Allow only one workflow globally because we do not want to be too noisy in production environment
group: benchbase-tpcc-workflow
cancel-in-progress: false
permissions:
contents: read
jobs:
benchbase-tpcc:
strategy:
fail-fast: false # allow other variants to continue even if one fails
matrix:
include:
- warehouses: 50 # defines number of warehouses and is used to compute number of terminals
max_rate: 800 # measured max TPS at scale factor based on experiments. Adjust if performance is better/worse
min_cu: 0.25 # simulate free tier plan (0.25 -2 CU)
max_cu: 2
- warehouses: 500 # serverless plan (2-8 CU)
max_rate: 2000
min_cu: 2
max_cu: 8
- warehouses: 1000 # business plan (2-16 CU)
max_rate: 2900
min_cu: 2
max_cu: 16
max-parallel: 1 # we want to run each workload size sequentially to avoid noisy neighbors
permissions:
contents: write
statuses: write
id-token: write # aws-actions/configure-aws-credentials
env:
PG_CONFIG: /tmp/neon/pg_install/v17/bin/pg_config
PSQL: /tmp/neon/pg_install/v17/bin/psql
PG_17_LIB_PATH: /tmp/neon/pg_install/v17/lib
POSTGRES_VERSION: 17
runs-on: [ self-hosted, us-east-2, x64 ]
timeout-minutes: 1440
steps:
- name: Harden the runner (Audit all outbound calls)
uses: step-security/harden-runner@4d991eb9b905ef189e4c376166672c3f2f230481 # v2.11.0
with:
egress-policy: audit
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- name: Configure AWS credentials # necessary to download artefacts
uses: aws-actions/configure-aws-credentials@e3dd6a429d7300a6a4c196c26e071d42e0343502 # v4.0.2
with:
aws-region: eu-central-1
role-to-assume: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
role-duration-seconds: 18000 # 5 hours is currently max associated with IAM role
- name: Download Neon artifact
uses: ./.github/actions/download
with:
name: neon-${{ runner.os }}-${{ runner.arch }}-release-artifact
path: /tmp/neon/
prefix: latest
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
- name: Create Neon Project
id: create-neon-project-tpcc
uses: ./.github/actions/neon-project-create
with:
region_id: aws-us-east-2
postgres_version: ${{ env.POSTGRES_VERSION }}
compute_units: '[${{ matrix.min_cu }}, ${{ matrix.max_cu }}]'
api_key: ${{ secrets.NEON_PRODUCTION_API_KEY_4_BENCHMARKS }}
api_host: console.neon.tech # production (!)
- name: Initialize Neon project
env:
BENCHMARK_TPCC_CONNSTR: ${{ steps.create-neon-project-tpcc.outputs.dsn }}
PROJECT_ID: ${{ steps.create-neon-project-tpcc.outputs.project_id }}
run: |
echo "Initializing Neon project with project_id: ${PROJECT_ID}"
export LD_LIBRARY_PATH=${PG_17_LIB_PATH}
# Retry logic for psql connection with 1 minute sleep between attempts
for attempt in {1..3}; do
echo "Attempt ${attempt}/3: Creating extensions in Neon project"
if ${PSQL} "${BENCHMARK_TPCC_CONNSTR}" -c "CREATE EXTENSION IF NOT EXISTS neon; CREATE EXTENSION IF NOT EXISTS neon_utils;"; then
echo "Successfully created extensions"
break
else
echo "Failed to create extensions on attempt ${attempt}"
if [ ${attempt} -lt 3 ]; then
echo "Waiting 60 seconds before retry..."
sleep 60
else
echo "All attempts failed, exiting"
exit 1
fi
fi
done
echo "BENCHMARK_TPCC_CONNSTR=${BENCHMARK_TPCC_CONNSTR}" >> $GITHUB_ENV
- name: Generate BenchBase workload configuration
env:
WAREHOUSES: ${{ matrix.warehouses }}
MAX_RATE: ${{ matrix.max_rate }}
run: |
echo "Generating BenchBase configs for warehouses: ${WAREHOUSES}, max_rate: ${MAX_RATE}"
# Extract hostname and password from connection string
# Format: postgresql://username:password@hostname/database?params (no port for Neon)
HOSTNAME=$(echo "${BENCHMARK_TPCC_CONNSTR}" | sed -n 's|.*://[^:]*:[^@]*@\([^/]*\)/.*|\1|p')
PASSWORD=$(echo "${BENCHMARK_TPCC_CONNSTR}" | sed -n 's|.*://[^:]*:\([^@]*\)@.*|\1|p')
echo "Extracted hostname: ${HOSTNAME}"
# Use runner temp (NVMe) as working directory
cd "${RUNNER_TEMP}"
# Copy the generator script
cp "${GITHUB_WORKSPACE}/test_runner/performance/benchbase_tpc_c_helpers/generate_workload_size.py" .
# Generate configs and scripts
python3 generate_workload_size.py \
--warehouses ${WAREHOUSES} \
--max-rate ${MAX_RATE} \
--hostname ${HOSTNAME} \
--password ${PASSWORD} \
--runner-arch ${{ runner.arch }}
# Fix path mismatch: move generated configs and scripts to expected locations
mv ../configs ./configs
mv ../scripts ./scripts
- name: Prepare database (load data)
env:
WAREHOUSES: ${{ matrix.warehouses }}
run: |
cd "${RUNNER_TEMP}"
echo "Loading ${WAREHOUSES} warehouses into database..."
# Run the loader script and capture output to log file while preserving stdout/stderr
./scripts/load_${WAREHOUSES}_warehouses.sh 2>&1 | tee "load_${WAREHOUSES}_warehouses.log"
echo "Database loading completed"
- name: Run TPC-C benchmark (warmup phase, then benchmark at 70% of configuredmax TPS)
env:
WAREHOUSES: ${{ matrix.warehouses }}
run: |
cd "${RUNNER_TEMP}"
echo "Running TPC-C benchmark with ${WAREHOUSES} warehouses..."
# Run the optimal rate benchmark
./scripts/execute_${WAREHOUSES}_warehouses_opt_rate.sh
echo "Benchmark execution completed"
- name: Run TPC-C benchmark (warmup phase, then ramp down TPS and up again in 5 minute intervals)
env:
WAREHOUSES: ${{ matrix.warehouses }}
run: |
cd "${RUNNER_TEMP}"
echo "Running TPC-C ramp-down-up with ${WAREHOUSES} warehouses..."
# Run the optimal rate benchmark
./scripts/execute_${WAREHOUSES}_warehouses_ramp_up.sh
echo "Benchmark execution completed"
- name: Process results (upload to test results database and generate diagrams)
env:
WAREHOUSES: ${{ matrix.warehouses }}
MIN_CU: ${{ matrix.min_cu }}
MAX_CU: ${{ matrix.max_cu }}
PROJECT_ID: ${{ steps.create-neon-project-tpcc.outputs.project_id }}
REVISION: ${{ github.sha }}
PERF_DB_CONNSTR: ${{ secrets.PERF_TEST_RESULT_CONNSTR }}
run: |
cd "${RUNNER_TEMP}"
echo "Creating temporary Python environment for results processing..."
# Create temporary virtual environment
python3 -m venv temp_results_env
source temp_results_env/bin/activate
# Install required packages in virtual environment
pip install matplotlib pandas psycopg2-binary
echo "Copying results processing scripts..."
# Copy both processing scripts
cp "${GITHUB_WORKSPACE}/test_runner/performance/benchbase_tpc_c_helpers/generate_diagrams.py" .
cp "${GITHUB_WORKSPACE}/test_runner/performance/benchbase_tpc_c_helpers/upload_results_to_perf_test_results.py" .
echo "Processing load phase metrics..."
# Find and process load log
LOAD_LOG=$(find . -name "load_${WAREHOUSES}_warehouses.log" -type f | head -1)
if [ -n "$LOAD_LOG" ]; then
echo "Processing load metrics from: $LOAD_LOG"
python upload_results_to_perf_test_results.py \
--load-log "$LOAD_LOG" \
--run-type "load" \
--warehouses "${WAREHOUSES}" \
--min-cu "${MIN_CU}" \
--max-cu "${MAX_CU}" \
--project-id "${PROJECT_ID}" \
--revision "${REVISION}" \
--connection-string "${PERF_DB_CONNSTR}"
else
echo "Warning: Load log file not found: load_${WAREHOUSES}_warehouses.log"
fi
echo "Processing warmup results for optimal rate..."
# Find and process warmup results
WARMUP_CSV=$(find results_warmup -name "*.results.csv" -type f | head -1)
WARMUP_JSON=$(find results_warmup -name "*.summary.json" -type f | head -1)
if [ -n "$WARMUP_CSV" ] && [ -n "$WARMUP_JSON" ]; then
echo "Generating warmup diagram from: $WARMUP_CSV"
python generate_diagrams.py \
--input-csv "$WARMUP_CSV" \
--output-svg "warmup_${WAREHOUSES}_warehouses_performance.svg" \
--title-suffix "Warmup at max TPS"
echo "Uploading warmup metrics from: $WARMUP_JSON"
python upload_results_to_perf_test_results.py \
--summary-json "$WARMUP_JSON" \
--results-csv "$WARMUP_CSV" \
--run-type "warmup" \
--min-cu "${MIN_CU}" \
--max-cu "${MAX_CU}" \
--project-id "${PROJECT_ID}" \
--revision "${REVISION}" \
--connection-string "${PERF_DB_CONNSTR}"
else
echo "Warning: Missing warmup results files (CSV: $WARMUP_CSV, JSON: $WARMUP_JSON)"
fi
echo "Processing optimal rate results..."
# Find and process optimal rate results
OPTRATE_CSV=$(find results_opt_rate -name "*.results.csv" -type f | head -1)
OPTRATE_JSON=$(find results_opt_rate -name "*.summary.json" -type f | head -1)
if [ -n "$OPTRATE_CSV" ] && [ -n "$OPTRATE_JSON" ]; then
echo "Generating optimal rate diagram from: $OPTRATE_CSV"
python generate_diagrams.py \
--input-csv "$OPTRATE_CSV" \
--output-svg "benchmark_${WAREHOUSES}_warehouses_performance.svg" \
--title-suffix "70% of max TPS"
echo "Uploading optimal rate metrics from: $OPTRATE_JSON"
python upload_results_to_perf_test_results.py \
--summary-json "$OPTRATE_JSON" \
--results-csv "$OPTRATE_CSV" \
--run-type "opt-rate" \
--min-cu "${MIN_CU}" \
--max-cu "${MAX_CU}" \
--project-id "${PROJECT_ID}" \
--revision "${REVISION}" \
--connection-string "${PERF_DB_CONNSTR}"
else
echo "Warning: Missing optimal rate results files (CSV: $OPTRATE_CSV, JSON: $OPTRATE_JSON)"
fi
echo "Processing warmup 2 results for ramp down/up phase..."
# Find and process warmup results
WARMUP_CSV=$(find results_warmup -name "*.results.csv" -type f | tail -1)
WARMUP_JSON=$(find results_warmup -name "*.summary.json" -type f | tail -1)
if [ -n "$WARMUP_CSV" ] && [ -n "$WARMUP_JSON" ]; then
echo "Generating warmup diagram from: $WARMUP_CSV"
python generate_diagrams.py \
--input-csv "$WARMUP_CSV" \
--output-svg "warmup_2_${WAREHOUSES}_warehouses_performance.svg" \
--title-suffix "Warmup at max TPS"
echo "Uploading warmup metrics from: $WARMUP_JSON"
python upload_results_to_perf_test_results.py \
--summary-json "$WARMUP_JSON" \
--results-csv "$WARMUP_CSV" \
--run-type "warmup" \
--min-cu "${MIN_CU}" \
--max-cu "${MAX_CU}" \
--project-id "${PROJECT_ID}" \
--revision "${REVISION}" \
--connection-string "${PERF_DB_CONNSTR}"
else
echo "Warning: Missing warmup results files (CSV: $WARMUP_CSV, JSON: $WARMUP_JSON)"
fi
echo "Processing ramp results..."
# Find and process ramp results
RAMPUP_CSV=$(find results_ramp_up -name "*.results.csv" -type f | head -1)
RAMPUP_JSON=$(find results_ramp_up -name "*.summary.json" -type f | head -1)
if [ -n "$RAMPUP_CSV" ] && [ -n "$RAMPUP_JSON" ]; then
echo "Generating ramp diagram from: $RAMPUP_CSV"
python generate_diagrams.py \
--input-csv "$RAMPUP_CSV" \
--output-svg "ramp_${WAREHOUSES}_warehouses_performance.svg" \
--title-suffix "ramp TPS down and up in 5 minute intervals"
echo "Uploading ramp metrics from: $RAMPUP_JSON"
python upload_results_to_perf_test_results.py \
--summary-json "$RAMPUP_JSON" \
--results-csv "$RAMPUP_CSV" \
--run-type "ramp-up" \
--min-cu "${MIN_CU}" \
--max-cu "${MAX_CU}" \
--project-id "${PROJECT_ID}" \
--revision "${REVISION}" \
--connection-string "${PERF_DB_CONNSTR}"
else
echo "Warning: Missing ramp results files (CSV: $RAMPUP_CSV, JSON: $RAMPUP_JSON)"
fi
# Deactivate and clean up virtual environment
deactivate
rm -rf temp_results_env
rm upload_results_to_perf_test_results.py
echo "Results processing completed and environment cleaned up"
- name: Set date for upload
id: set-date
run: echo "date=$(date +%Y-%m-%d)" >> $GITHUB_OUTPUT
- name: Configure AWS credentials # necessary to upload results
uses: aws-actions/configure-aws-credentials@e3dd6a429d7300a6a4c196c26e071d42e0343502 # v4.0.2
with:
aws-region: us-east-2
role-to-assume: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
role-duration-seconds: 900 # 900 is minimum value
- name: Upload benchmark results to S3
env:
S3_BUCKET: neon-public-benchmark-results
S3_PREFIX: benchbase-tpc-c/${{ steps.set-date.outputs.date }}/${{ github.run_id }}/${{ matrix.warehouses }}-warehouses
run: |
echo "Redacting passwords from configuration files before upload..."
# Mask all passwords in XML config files
find "${RUNNER_TEMP}/configs" -name "*.xml" -type f -exec sed -i 's|<password>[^<]*</password>|<password>redacted</password>|g' {} \;
echo "Uploading benchmark results to s3://${S3_BUCKET}/${S3_PREFIX}/"
# Upload the entire benchmark directory recursively
aws s3 cp --only-show-errors --recursive "${RUNNER_TEMP}" s3://${S3_BUCKET}/${S3_PREFIX}/
echo "Upload completed"
- name: Delete Neon Project
if: ${{ always() }}
uses: ./.github/actions/neon-project-delete
with:
project_id: ${{ steps.create-neon-project-tpcc.outputs.project_id }}
api_key: ${{ secrets.NEON_PRODUCTION_API_KEY_4_BENCHMARKS }}
api_host: console.neon.tech # production (!)

View File

@@ -418,7 +418,7 @@ jobs:
statuses: write
id-token: write # aws-actions/configure-aws-credentials
env:
PGBENCH_SIZE: ${{ vars.PREWARM_PGBENCH_SIZE }}
PROJECT_ID: ${{ vars.PREWARM_PROJECT_ID }}
POSTGRES_DISTRIB_DIR: /tmp/neon/pg_install
DEFAULT_PG_VERSION: 17
TEST_OUTPUT: /tmp/test_output

View File

@@ -48,8 +48,20 @@ jobs:
uses: ./.github/workflows/build-build-tools-image.yml
secrets: inherit
generate-ch-tmppw:
runs-on: ubuntu-22.04
outputs:
tmp_val: ${{ steps.pwgen.outputs.tmp_val }}
steps:
- name: Generate a random password
id: pwgen
run: |
set +x
p=$(dd if=/dev/random bs=14 count=1 2>/dev/null | base64)
echo tmp_val="${p//\//}" >> "${GITHUB_OUTPUT}"
test-logical-replication:
needs: [ build-build-tools-image ]
needs: [ build-build-tools-image, generate-ch-tmppw ]
runs-on: ubuntu-22.04
container:
@@ -60,16 +72,20 @@ jobs:
options: --init --user root
services:
clickhouse:
image: clickhouse/clickhouse-server:24.6.3.64
image: clickhouse/clickhouse-server:24.8
env:
CLICKHOUSE_PASSWORD: ${{ needs.generate-ch-tmppw.outputs.tmp_val }}
ports:
- 9000:9000
- 8123:8123
zookeeper:
image: quay.io/debezium/zookeeper:2.7
image: quay.io/debezium/zookeeper:3.1.3.Final
ports:
- 2181:2181
- 2888:2888
- 3888:3888
kafka:
image: quay.io/debezium/kafka:2.7
image: quay.io/debezium/kafka:3.1.3.Final
env:
ZOOKEEPER_CONNECT: "zookeeper:2181"
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
@@ -79,7 +95,7 @@ jobs:
ports:
- 9092:9092
debezium:
image: quay.io/debezium/connect:2.7
image: quay.io/debezium/connect:3.1.3.Final
env:
BOOTSTRAP_SERVERS: kafka:9092
GROUP_ID: 1
@@ -125,6 +141,7 @@ jobs:
aws-oidc-role-arn: ${{ vars.DEV_AWS_OIDC_ROLE_ARN }}
env:
BENCHMARK_CONNSTR: ${{ steps.create-neon-project.outputs.dsn }}
CLICKHOUSE_PASSWORD: ${{ needs.generate-ch-tmppw.outputs.tmp_val }}
- name: Delete Neon Project
if: always()

View File

@@ -3,7 +3,7 @@ name: Periodic proxy performance test on unit-perf-aws-arm runners
on:
push: # TODO: remove after testing
branches:
- test-proxy-bench # Runs on pushes to branches starting with test-proxy-bench
- test-proxy-bench # Runs on pushes to test-proxy-bench branch
# schedule:
# * is a special character in YAML so you have to quote this string
# ┌───────────── minute (0 - 59)
@@ -32,7 +32,7 @@ jobs:
statuses: write
contents: write
pull-requests: write
runs-on: [self-hosted, unit-perf-aws-arm]
runs-on: [ self-hosted, unit-perf-aws-arm ]
timeout-minutes: 60 # 1h timeout
container:
image: ghcr.io/neondatabase/build-tools:pinned-bookworm
@@ -55,30 +55,58 @@ jobs:
{
echo "PROXY_BENCH_PATH=$PROXY_BENCH_PATH"
echo "NEON_DIR=${RUNNER_TEMP}/neon"
echo "NEON_PROXY_PATH=${RUNNER_TEMP}/neon/bin/proxy"
echo "TEST_OUTPUT=${PROXY_BENCH_PATH}/test_output"
echo ""
} >> "$GITHUB_ENV"
- name: Run proxy-bench
run: ${PROXY_BENCH_PATH}/run.sh
- name: Cache poetry deps
uses: actions/cache@v4
with:
path: ~/.cache/pypoetry/virtualenvs
key: v2-${{ runner.os }}-${{ runner.arch }}-python-deps-bookworm-${{ hashFiles('poetry.lock') }}
- name: Ingest Bench Results # neon repo script
- name: Install Python deps
shell: bash -euxo pipefail {0}
run: ./scripts/pysync
- name: show ulimits
shell: bash -euxo pipefail {0}
run: |
ulimit -a
- name: Run proxy-bench
working-directory: ${{ env.PROXY_BENCH_PATH }}
run: ./run.sh --with-grafana --bare-metal
- name: Ingest Bench Results
if: always()
working-directory: ${{ env.NEON_DIR }}
run: |
mkdir -p $TEST_OUTPUT
python $NEON_DIR/scripts/proxy_bench_results_ingest.py --out $TEST_OUTPUT
- name: Push Metrics to Proxy perf database
shell: bash -euxo pipefail {0}
if: always()
env:
PERF_TEST_RESULT_CONNSTR: "${{ secrets.PROXY_TEST_RESULT_CONNSTR }}"
REPORT_FROM: $TEST_OUTPUT
working-directory: ${{ env.NEON_DIR }}
run: $NEON_DIR/scripts/generate_and_push_perf_report.sh
- name: Docker cleanup
if: always()
run: docker compose down
- name: Notify Failure
if: failure()
run: echo "Proxy bench job failed" && exit 1
run: echo "Proxy bench job failed" && exit 1
- name: Cleanup Test Resources
if: always()
shell: bash -euxo pipefail {0}
run: |
# Cleanup the test resources
if [[ -d "${TEST_OUTPUT}" ]]; then
rm -rf ${TEST_OUTPUT}
fi
if [[ -d "${PROXY_BENCH_PATH}/test_output" ]]; then
rm -rf ${PROXY_BENCH_PATH}/test_output
fi

1
.gitignore vendored
View File

@@ -15,7 +15,6 @@ neon.iml
/.neon
/integration_tests/.neon
compaction-suite-results.*
pgxn/neon/communicator/communicator_bindings.h
docker-compose/docker-compose-parallel.yml
# Coverage

315
Cargo.lock generated
View File

@@ -211,11 +211,11 @@ dependencies = [
[[package]]
name = "async-lock"
version = "3.2.0"
version = "3.4.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "7125e42787d53db9dd54261812ef17e937c95a51e4d291373b670342fa44310c"
checksum = "ff6e472cdea888a4bd64f342f09b3f50e1886d32afe8df3d663c01140b811b18"
dependencies = [
"event-listener 4.0.0",
"event-listener 5.4.0",
"event-listener-strategy",
"pin-project-lite",
]
@@ -259,17 +259,6 @@ version = "1.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a8ab6b55fe97976e46f91ddbed8d147d966475dc29b2032757ba47e02376fbc3"
[[package]]
name = "atomic_enum"
version = "0.3.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "99e1aca718ea7b89985790c94aad72d77533063fe00bc497bb79a7c2dae6a661"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.100",
]
[[package]]
name = "autocfg"
version = "1.1.0"
@@ -1305,32 +1294,15 @@ dependencies = [
[[package]]
name = "communicator"
version = "0.0.0"
version = "0.1.0"
dependencies = [
"atomic_enum",
"axum",
"bytes",
"cbindgen",
"clashmap",
"http 1.3.1",
"libc",
"measured",
"metrics",
"neon-shmem",
"nix 0.30.1",
"pageserver_api",
"pageserver_client_grpc",
"pageserver_page_api",
"prometheus",
"prost 0.13.5",
"strum_macros",
"thiserror 1.0.69",
"tokio",
"tokio-pipe",
"tonic",
"tracing",
"tracing-subscriber",
"uring-common",
"utils",
"workspace_hack",
]
@@ -1416,6 +1388,7 @@ dependencies = [
"tower-http",
"tower-otel",
"tracing",
"tracing-appender",
"tracing-opentelemetry",
"tracing-subscriber",
"tracing-utils",
@@ -1431,9 +1404,9 @@ dependencies = [
[[package]]
name = "concurrent-queue"
version = "2.3.0"
version = "2.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f057a694a54f12365049b0958a1685bb52d567f5593b355fbf685838e873d400"
checksum = "4ca0197aee26d1ae37445ee532fefce43251d24cc7c166799f4d46817f1d3973"
dependencies = [
"crossbeam-utils",
]
@@ -1670,9 +1643,9 @@ dependencies = [
[[package]]
name = "crossbeam-utils"
version = "0.8.21"
version = "0.8.19"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d0a5c400df2834b80a4c3327b3aad3a4c4cd4de0629063962b03235697506a28"
checksum = "248e3bacc7dc6baa3b21e405ee045c3047101a49145e7e9eca583ab4c2ca5345"
[[package]]
name = "crossterm"
@@ -2259,9 +2232,9 @@ checksum = "0206175f82b8d6bf6652ff7d71a1e27fd2e4efde587fd368662814d6ec1d9ce0"
[[package]]
name = "event-listener"
version = "4.0.0"
version = "5.4.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "770d968249b5d99410d61f5bf89057f3199a077a04d087092f58e7d10692baae"
checksum = "3492acde4c3fc54c845eaab3eed8bd00c7a7d881f78bfc801e43a93dec1331ae"
dependencies = [
"concurrent-queue",
"parking",
@@ -2270,11 +2243,11 @@ dependencies = [
[[package]]
name = "event-listener-strategy"
version = "0.4.0"
version = "0.5.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "958e4d70b6d5e81971bebec42271ec641e7ff4e170a6fa605f2b8a8b65cb97d3"
checksum = "8be9f3dfaaffdae2972880079a491a1a8bb7cbed0b8dd7a347f668b4150a3b93"
dependencies = [
"event-listener 4.0.0",
"event-listener 5.4.0",
"pin-project-lite",
]
@@ -2388,12 +2361,6 @@ version = "1.0.7"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "3f9eec918d3f24069decb9af1554cad7c880e2da24a9afd88aca000531ab82c1"
[[package]]
name = "foldhash"
version = "0.1.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d9c4f5dac5e15c24eb999c26181a6ca40b39fe946cbe4c263c7209467bc83af2"
[[package]]
name = "form_urlencoded"
version = "1.2.1"
@@ -2549,6 +2516,20 @@ version = "0.4.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "304de19db7028420975a296ab0fcbbc8e69438c4ed254a1e41e2a7f37d5f0e0a"
[[package]]
name = "generator"
version = "0.8.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d18470a76cb7f8ff746cf1f7470914f900252ec36bbc40b569d74b1258446827"
dependencies = [
"cc",
"cfg-if",
"libc",
"log",
"rustversion",
"windows 0.61.3",
]
[[package]]
name = "generic-array"
version = "0.14.7"
@@ -2761,16 +2742,6 @@ version = "0.15.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "bf151400ff0baff5465007dd2f3e717f3fe502074ca563069ce3a6629d07b289"
[[package]]
name = "hashbrown"
version = "0.15.4"
source = "git+https://github.com/quantumish/hashbrown.git?rev=6610e6d#6610e6d2b1f288ef7b0709a3efefbc846395dc5e"
dependencies = [
"allocator-api2",
"equivalent",
"foldhash",
]
[[package]]
name = "hashlink"
version = "0.9.1"
@@ -2877,7 +2848,7 @@ checksum = "f9c7c7c8ac16c798734b8a24560c1362120597c40d5e1459f09498f8f6c8f2ba"
dependencies = [
"cfg-if",
"libc",
"windows",
"windows 0.52.0",
]
[[package]]
@@ -3148,7 +3119,7 @@ dependencies = [
"iana-time-zone-haiku",
"js-sys",
"wasm-bindgen",
"windows-core",
"windows-core 0.52.0",
]
[[package]]
@@ -3699,6 +3670,19 @@ version = "0.4.26"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "30bde2b3dc3671ae49d8e2e9f044c7c005836e7a023ee57cffa25ab82764bb9e"
[[package]]
name = "loom"
version = "0.7.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "419e0dc8046cb947daa77eb95ae174acfbddb7673b4151f56d1eed8e93fbfaca"
dependencies = [
"cfg-if",
"generator",
"scoped-tls",
"tracing",
"tracing-subscriber",
]
[[package]]
name = "lru"
version = "0.12.3"
@@ -3838,7 +3822,7 @@ dependencies = [
"prometheus",
"rand 0.9.1",
"rand_distr",
"twox-hash 1.6.3",
"twox-hash",
]
[[package]]
@@ -3915,6 +3899,25 @@ dependencies = [
"windows-sys 0.52.0",
]
[[package]]
name = "moka"
version = "0.12.10"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a9321642ca94a4282428e6ea4af8cc2ca4eac48ac7a6a4ea8f33f76d0ce70926"
dependencies = [
"crossbeam-channel",
"crossbeam-epoch",
"crossbeam-utils",
"loom",
"parking_lot 0.12.1",
"portable-atomic",
"rustc_version",
"smallvec",
"tagptr",
"thiserror 1.0.69",
"uuid",
]
[[package]]
name = "multimap"
version = "0.8.3"
@@ -3925,21 +3928,15 @@ checksum = "e5ce46fe64a9d73be07dcbe690a38ce1b293be448fd8ce1e6c1b8062c9f72c6a"
name = "neon-shmem"
version = "0.1.0"
dependencies = [
"ahash",
"criterion",
"hashbrown 0.15.4",
"libc",
"lock_api",
"nix 0.30.1",
"rand 0.9.1",
"rand_distr",
"rustc-hash 2.1.1",
"seahash",
"tempfile",
"thiserror 1.0.69",
"twox-hash 2.1.1",
"workspace_hack",
"xxhash-rust",
]
[[package]]
@@ -4394,16 +4391,13 @@ version = "0.1.0"
dependencies = [
"anyhow",
"async-trait",
"axum",
"bytes",
"camino",
"clap",
"futures",
"hdrhistogram",
"http 1.3.1",
"humantime",
"humantime-serde",
"metrics",
"pageserver_api",
"pageserver_client",
"pageserver_client_grpc",
@@ -4493,7 +4487,6 @@ dependencies = [
"pageserver_client",
"pageserver_compaction",
"pageserver_page_api",
"peekable",
"pem",
"pin-project-lite",
"postgres-protocol",
@@ -4507,7 +4500,6 @@ dependencies = [
"pprof",
"pq_proto",
"procfs",
"prost 0.13.5",
"rand 0.9.1",
"range-set-blaze",
"regex",
@@ -4544,7 +4536,7 @@ dependencies = [
"tower 0.5.2",
"tracing",
"tracing-utils",
"twox-hash 1.6.3",
"twox-hash",
"url",
"utils",
"uuid",
@@ -4756,7 +4748,7 @@ dependencies = [
"paste",
"seq-macro",
"thrift",
"twox-hash 1.6.3",
"twox-hash",
"zstd",
"zstd-sys",
]
@@ -4802,15 +4794,6 @@ dependencies = [
"sha2",
]
[[package]]
name = "peekable"
version = "0.3.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "225f9651e475709164f871dc2f5724956be59cb9edb055372ffeeab01ec2d20b"
dependencies = [
"smallvec",
]
[[package]]
name = "pem"
version = "3.0.3"
@@ -5448,7 +5431,6 @@ dependencies = [
"futures",
"gettid",
"hashbrown 0.14.5",
"hashlink",
"hex",
"hmac",
"hostname",
@@ -5470,6 +5452,7 @@ dependencies = [
"lasso",
"measured",
"metrics",
"moka",
"once_cell",
"opentelemetry",
"ouroboros",
@@ -6483,6 +6466,12 @@ dependencies = [
"pin-project-lite",
]
[[package]]
name = "scoped-tls"
version = "1.0.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e1cf6437eb19a8f4a6cc0f7dca544973b0b78843adbfeb3683d1a94a0024a294"
[[package]]
name = "scopeguard"
version = "1.1.0"
@@ -6505,12 +6494,6 @@ version = "0.4.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "621e3680f3e07db4c9c2c3fb07c6223ab2fab2e54bd3c04c3ae037990f428c32"
[[package]]
name = "seahash"
version = "4.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1c107b6f4780854c8b126e228ea8869f4d7b71260f962fefb57b996b8959ba6b"
[[package]]
name = "sec1"
version = "0.3.0"
@@ -7338,6 +7321,12 @@ dependencies = [
"winapi",
]
[[package]]
name = "tagptr"
version = "0.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "7b2093cf4c8eb1e67749a6762251bc9cd836b6fc171623bd0a9d324d37af2417"
[[package]]
name = "tar"
version = "0.4.40"
@@ -7658,16 +7647,6 @@ dependencies = [
"syn 2.0.100",
]
[[package]]
name = "tokio-pipe"
version = "0.2.12"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f213a84bffbd61b8fa0ba8a044b4bbe35d471d0b518867181e82bd5c15542784"
dependencies = [
"libc",
"tokio",
]
[[package]]
name = "tokio-postgres"
version = "0.7.10"
@@ -8014,11 +7993,12 @@ dependencies = [
[[package]]
name = "tracing-appender"
version = "0.2.2"
version = "0.2.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "09d48f71a791638519505cefafe162606f706c25592e4bde4d97600c0195312e"
checksum = "3566e8ce28cc0a3fe42519fc80e6b4c943cc4c8cef275620eb8dac2d3d4e06cf"
dependencies = [
"crossbeam-channel",
"thiserror 1.0.69",
"time",
"tracing-subscriber",
]
@@ -8204,15 +8184,6 @@ dependencies = [
"static_assertions",
]
[[package]]
name = "twox-hash"
version = "2.1.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8b907da542cbced5261bd3256de1b3a1bf340a3d37f93425a07362a1d687de56"
dependencies = [
"rand 0.9.1",
]
[[package]]
name = "typed-json"
version = "0.1.1"
@@ -8725,10 +8696,32 @@ version = "0.52.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e48a53791691ab099e5e2ad123536d0fff50652600abaf43bbf952894110d0be"
dependencies = [
"windows-core",
"windows-core 0.52.0",
"windows-targets 0.52.6",
]
[[package]]
name = "windows"
version = "0.61.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9babd3a767a4c1aef6900409f85f5d53ce2544ccdfaa86dad48c91782c6d6893"
dependencies = [
"windows-collections",
"windows-core 0.61.2",
"windows-future",
"windows-link",
"windows-numerics",
]
[[package]]
name = "windows-collections"
version = "0.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "3beeceb5e5cfd9eb1d76b381630e82c4241ccd0d27f1a39ed41b2760b255c5e8"
dependencies = [
"windows-core 0.61.2",
]
[[package]]
name = "windows-core"
version = "0.52.0"
@@ -8738,6 +8731,86 @@ dependencies = [
"windows-targets 0.52.6",
]
[[package]]
name = "windows-core"
version = "0.61.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c0fdd3ddb90610c7638aa2b3a3ab2904fb9e5cdbecc643ddb3647212781c4ae3"
dependencies = [
"windows-implement",
"windows-interface",
"windows-link",
"windows-result",
"windows-strings",
]
[[package]]
name = "windows-future"
version = "0.2.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "fc6a41e98427b19fe4b73c550f060b59fa592d7d686537eebf9385621bfbad8e"
dependencies = [
"windows-core 0.61.2",
"windows-link",
"windows-threading",
]
[[package]]
name = "windows-implement"
version = "0.60.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a47fddd13af08290e67f4acabf4b459f647552718f683a7b415d290ac744a836"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.100",
]
[[package]]
name = "windows-interface"
version = "0.59.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "bd9211b69f8dcdfa817bfd14bf1c97c9188afa36f4750130fcdf3f400eca9fa8"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.100",
]
[[package]]
name = "windows-link"
version = "0.1.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5e6ad25900d524eaabdbbb96d20b4311e1e7ae1699af4fb28c17ae66c80d798a"
[[package]]
name = "windows-numerics"
version = "0.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9150af68066c4c5c07ddc0ce30421554771e528bde427614c61038bc2c92c2b1"
dependencies = [
"windows-core 0.61.2",
"windows-link",
]
[[package]]
name = "windows-result"
version = "0.3.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "56f42bd332cc6c8eac5af113fc0c1fd6a8fd2aa08a0119358686e5160d0586c6"
dependencies = [
"windows-link",
]
[[package]]
name = "windows-strings"
version = "0.4.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "56e6c93f3a0c3b36176cb1327a4958a0353d5d166c2a35cb268ace15e91d3b57"
dependencies = [
"windows-link",
]
[[package]]
name = "windows-sys"
version = "0.48.0"
@@ -8796,6 +8869,15 @@ dependencies = [
"windows_x86_64_msvc 0.52.6",
]
[[package]]
name = "windows-threading"
version = "0.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b66463ad2e0ea3bbf808b7f1d371311c80e115c0b71d60efc142cafbcfb057a6"
dependencies = [
"windows-link",
]
[[package]]
name = "windows_aarch64_gnullvm"
version = "0.48.0"
@@ -8932,7 +9014,8 @@ dependencies = [
"clap",
"clap_builder",
"const-oid",
"criterion",
"crossbeam-epoch",
"crossbeam-utils",
"crypto-bigint 0.5.5",
"der 0.7.8",
"deranged",
@@ -8975,8 +9058,10 @@ dependencies = [
"num-iter",
"num-rational",
"num-traits",
"once_cell",
"p256 0.13.2",
"parquet",
"portable-atomic",
"prettyplease",
"proc-macro2",
"prost 0.13.5",
@@ -9082,12 +9167,6 @@ version = "0.13.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "4d25c75bf9ea12c4040a97f829154768bbbce366287e2dc044af160cd79a13fd"
[[package]]
name = "xxhash-rust"
version = "0.8.15"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "fdd20c5420375476fbd4394763288da7eb0cc0b8c11deed431a91562af7335d3"
[[package]]
name = "yansi"
version = "1.0.1"

View File

@@ -46,10 +46,10 @@ members = [
"libs/proxy/json",
"libs/proxy/postgres-protocol2",
"libs/proxy/postgres-types2",
"libs/proxy/subzero_core",
"libs/proxy/tokio-postgres2",
"endpoint_storage",
"pgxn/neon/communicator",
"proxy/subzero_core",
]
[workspace.package]
@@ -93,7 +93,6 @@ clap = { version = "4.0", features = ["derive", "env"] }
clashmap = { version = "1.0", features = ["raw-api"] }
comfy-table = "7.1"
const_format = "0.2"
crossbeam-utils = "0.8.21"
crc32c = "0.6"
diatomic-waker = { version = "0.2.3" }
either = "1.8"
@@ -137,6 +136,7 @@ md5 = "0.7.0"
measured = { version = "0.0.22", features=["lasso"] }
measured-process = { version = "0.0.22" }
memoffset = "0.9"
moka = { version = "0.12", features = ["sync"] }
nix = { version = "0.30.1", features = ["dir", "fs", "mman", "process", "socket", "signal", "poll"] }
# Do not update to >= 7.0.0, at least. The update will have a significant impact
# on compute startup metrics (start_postgres_ms), >= 25% degradation.
@@ -153,7 +153,6 @@ parquet = { version = "53", default-features = false, features = ["zstd"] }
parquet_derive = "53"
pbkdf2 = { version = "0.12.1", features = ["simple", "std"] }
pem = "3.0.3"
peekable = "0.3.0"
pin-project-lite = "0.2"
pprof = { version = "0.14", features = ["criterion", "flamegraph", "frame-pointer", "prost-codec"] }
procfs = "0.16"
@@ -192,7 +191,6 @@ smallvec = "1.11"
smol_str = { version = "0.2.0", features = ["serde"] }
socket2 = "0.5"
spki = "0.7.3"
spin = "0.9.8"
strum = "0.26"
strum_macros = "0.26"
"subtle" = "2.5.0"
@@ -204,6 +202,7 @@ thiserror = "1.0"
tikv-jemallocator = { version = "0.6", features = ["profiling", "stats", "unprefixed_malloc_on_supported_platforms"] }
tikv-jemalloc-ctl = { version = "0.6", features = ["stats"] }
tokio = { version = "1.43.1", features = ["macros"] }
tokio-epoll-uring = { git = "https://github.com/neondatabase/tokio-epoll-uring.git" , branch = "main" }
tokio-io-timeout = "1.2.0"
tokio-postgres-rustls = "0.12.0"
tokio-rustls = { version = "0.26.0", default-features = false, features = ["tls12", "ring"]}
@@ -224,6 +223,7 @@ tracing-log = "0.2"
tracing-opentelemetry = "0.31"
tracing-serde = "0.2.0"
tracing-subscriber = { version = "0.3", default-features = false, features = ["smallvec", "fmt", "tracing-log", "std", "env-filter", "json"] }
tracing-appender = "0.2.3"
try-lock = "0.2.5"
test-log = { version = "0.2.17", default-features = false, features = ["log"] }
twox-hash = { version = "1.6.3", default-features = false }
@@ -242,9 +242,6 @@ x509-cert = { version = "0.2.5" }
env_logger = "0.11"
log = "0.4"
tokio-epoll-uring = { git = "https://github.com/neondatabase/tokio-epoll-uring.git" , branch = "main" }
uring-common = { git = "https://github.com/neondatabase/tokio-epoll-uring.git" , branch = "main" }
## Libraries from neondatabase/ git forks, ideally with changes to be upstreamed
postgres = { git = "https://github.com/neondatabase/rust-postgres.git", branch = "neon" }
postgres-protocol = { git = "https://github.com/neondatabase/rust-postgres.git", branch = "neon" }

View File

@@ -39,13 +39,13 @@ COPY build-tools/patches/pgcopydbv017.patch /pgcopydbv017.patch
RUN if [ "${DEBIAN_VERSION}" = "bookworm" ]; then \
set -e && \
apt update && \
apt install -y --no-install-recommends \
apt-get update && \
apt-get install -y --no-install-recommends \
ca-certificates wget gpg && \
wget -qO - https://www.postgresql.org/media/keys/ACCC4CF8.asc | gpg --dearmor -o /usr/share/keyrings/postgresql-keyring.gpg && \
echo "deb [signed-by=/usr/share/keyrings/postgresql-keyring.gpg] http://apt.postgresql.org/pub/repos/apt bookworm-pgdg main" > /etc/apt/sources.list.d/pgdg.list && \
apt-get update && \
apt install -y --no-install-recommends \
apt-get install -y --no-install-recommends \
build-essential \
autotools-dev \
libedit-dev \
@@ -89,8 +89,7 @@ RUN useradd -ms /bin/bash nonroot -b /home
# Use strict mode for bash to catch errors early
SHELL ["/bin/bash", "-euo", "pipefail", "-c"]
RUN mkdir -p /pgcopydb/bin && \
mkdir -p /pgcopydb/lib && \
RUN mkdir -p /pgcopydb/{bin,lib} && \
chmod -R 755 /pgcopydb && \
chown -R nonroot:nonroot /pgcopydb
@@ -106,8 +105,8 @@ RUN echo 'Acquire::Retries "5";' > /etc/apt/apt.conf.d/80-retries && \
# 'gdb' is included so that we get backtraces of core dumps produced in
# regression tests
RUN set -e \
&& apt update \
&& apt install -y \
&& apt-get update \
&& apt-get install -y --no-install-recommends \
autoconf \
automake \
bison \
@@ -183,22 +182,22 @@ RUN curl -sL "https://github.com/peak/s5cmd/releases/download/v${S5CMD_VERSION}/
ENV LLVM_VERSION=20
RUN curl -fsSL 'https://apt.llvm.org/llvm-snapshot.gpg.key' | apt-key add - \
&& echo "deb http://apt.llvm.org/${DEBIAN_VERSION}/ llvm-toolchain-${DEBIAN_VERSION}-${LLVM_VERSION} main" > /etc/apt/sources.list.d/llvm.stable.list \
&& apt update \
&& apt install -y clang-${LLVM_VERSION} llvm-${LLVM_VERSION} \
&& apt-get update \
&& apt-get install -y --no-install-recommends clang-${LLVM_VERSION} llvm-${LLVM_VERSION} \
&& bash -c 'for f in /usr/bin/clang*-${LLVM_VERSION} /usr/bin/llvm*-${LLVM_VERSION}; do ln -s "${f}" "${f%-${LLVM_VERSION}}"; done' \
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
# Install node
ENV NODE_VERSION=24
RUN curl -fsSL https://deb.nodesource.com/setup_${NODE_VERSION}.x | bash - \
&& apt install -y nodejs \
&& apt-get install -y --no-install-recommends nodejs \
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
# Install docker
RUN curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg \
&& echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/debian ${DEBIAN_VERSION} stable" > /etc/apt/sources.list.d/docker.list \
&& apt update \
&& apt install -y docker-ce docker-ce-cli \
&& apt-get update \
&& apt-get install -y --no-install-recommends docker-ce docker-ce-cli \
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
# Configure sudo & docker
@@ -215,12 +214,11 @@ RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-$(uname -m).zip" -o "aws
# Mold: A Modern Linker
ENV MOLD_VERSION=v2.37.1
RUN set -e \
&& git clone https://github.com/rui314/mold.git \
&& git clone -b "${MOLD_VERSION}" --depth 1 https://github.com/rui314/mold.git \
&& mkdir mold/build \
&& cd mold/build \
&& git checkout ${MOLD_VERSION} \
&& cd mold/build \
&& cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER=clang++ .. \
&& cmake --build . -j $(nproc) \
&& cmake --build . -j "$(nproc)" \
&& cmake --install . \
&& cd .. \
&& rm -rf mold
@@ -254,7 +252,7 @@ ENV ICU_VERSION=67.1
ENV ICU_PREFIX=/usr/local/icu
# Download and build static ICU
RUN wget -O /tmp/libicu-${ICU_VERSION}.tgz https://github.com/unicode-org/icu/releases/download/release-${ICU_VERSION//./-}/icu4c-${ICU_VERSION//./_}-src.tgz && \
RUN wget -O "/tmp/libicu-${ICU_VERSION}.tgz" https://github.com/unicode-org/icu/releases/download/release-${ICU_VERSION//./-}/icu4c-${ICU_VERSION//./_}-src.tgz && \
echo "94a80cd6f251a53bd2a997f6f1b5ac6653fe791dfab66e1eb0227740fb86d5dc /tmp/libicu-${ICU_VERSION}.tgz" | sha256sum --check && \
mkdir /tmp/icu && \
pushd /tmp/icu && \
@@ -265,8 +263,7 @@ RUN wget -O /tmp/libicu-${ICU_VERSION}.tgz https://github.com/unicode-org/icu/re
make install && \
popd && \
rm -rf icu && \
rm -f /tmp/libicu-${ICU_VERSION}.tgz && \
popd
rm -f /tmp/libicu-${ICU_VERSION}.tgz
# Switch to nonroot user
USER nonroot:nonroot
@@ -279,19 +276,19 @@ ENV PYTHON_VERSION=3.11.12 \
PYENV_ROOT=/home/nonroot/.pyenv \
PATH=/home/nonroot/.pyenv/shims:/home/nonroot/.pyenv/bin:/home/nonroot/.poetry/bin:$PATH
RUN set -e \
&& cd $HOME \
&& cd "$HOME" \
&& curl -sSO https://raw.githubusercontent.com/pyenv/pyenv-installer/master/bin/pyenv-installer \
&& chmod +x pyenv-installer \
&& ./pyenv-installer \
&& export PYENV_ROOT=/home/nonroot/.pyenv \
&& export PATH="$PYENV_ROOT/bin:$PATH" \
&& export PATH="$PYENV_ROOT/shims:$PATH" \
&& pyenv install ${PYTHON_VERSION} \
&& pyenv global ${PYTHON_VERSION} \
&& pyenv install "${PYTHON_VERSION}" \
&& pyenv global "${PYTHON_VERSION}" \
&& python --version \
&& pip install --upgrade pip \
&& pip install --no-cache-dir --upgrade pip \
&& pip --version \
&& pip install pipenv wheel poetry
&& pip install --no-cache-dir pipenv wheel poetry
# Switch to nonroot user (again)
USER nonroot:nonroot
@@ -317,13 +314,13 @@ RUN curl -sSO https://static.rust-lang.org/rustup/dist/$(uname -m)-unknown-linux
. "$HOME/.cargo/env" && \
cargo --version && rustup --version && \
rustup component add llvm-tools rustfmt clippy && \
cargo install rustfilt --locked --version ${RUSTFILT_VERSION} && \
cargo install cargo-hakari --locked --version ${CARGO_HAKARI_VERSION} && \
cargo install cargo-deny --locked --version ${CARGO_DENY_VERSION} && \
cargo install cargo-hack --locked --version ${CARGO_HACK_VERSION} && \
cargo install cargo-nextest --locked --version ${CARGO_NEXTEST_VERSION} && \
cargo install cargo-chef --locked --version ${CARGO_CHEF_VERSION} && \
cargo install diesel_cli --locked --version ${CARGO_DIESEL_CLI_VERSION} \
cargo install rustfilt --locked --version "${RUSTFILT_VERSION}" && \
cargo install cargo-hakari --locked --version "${CARGO_HAKARI_VERSION}" && \
cargo install cargo-deny --locked --version "${CARGO_DENY_VERSION}" && \
cargo install cargo-hack --locked --version "${CARGO_HACK_VERSION}" && \
cargo install cargo-nextest --locked --version "${CARGO_NEXTEST_VERSION}" && \
cargo install cargo-chef --locked --version "${CARGO_CHEF_VERSION}" && \
cargo install diesel_cli --locked --version "${CARGO_DIESEL_CLI_VERSION}" \
--features postgres-bundled --no-default-features && \
rm -rf /home/nonroot/.cargo/registry && \
rm -rf /home/nonroot/.cargo/git

View File

@@ -6,7 +6,7 @@
"": {
"name": "build-tools",
"devDependencies": {
"@redocly/cli": "1.34.4",
"@redocly/cli": "1.34.5",
"@sourcemeta/jsonschema": "10.0.0"
}
},
@@ -472,9 +472,9 @@
}
},
"node_modules/@redocly/cli": {
"version": "1.34.4",
"resolved": "https://registry.npmjs.org/@redocly/cli/-/cli-1.34.4.tgz",
"integrity": "sha512-seH/GgrjSB1EeOsgJ/4Ct6Jk2N7sh12POn/7G8UQFARMyUMJpe1oHtBwT2ndfp4EFCpgBAbZ/82Iw6dwczNxEA==",
"version": "1.34.5",
"resolved": "https://registry.npmjs.org/@redocly/cli/-/cli-1.34.5.tgz",
"integrity": "sha512-5IEwxs7SGP5KEXjBKLU8Ffdz9by/KqNSeBk6YUVQaGxMXK//uYlTJIPntgUXbo1KAGG2d2q2XF8y4iFz6qNeiw==",
"dev": true,
"license": "MIT",
"dependencies": {
@@ -484,14 +484,14 @@
"@opentelemetry/sdk-trace-node": "1.26.0",
"@opentelemetry/semantic-conventions": "1.27.0",
"@redocly/config": "^0.22.0",
"@redocly/openapi-core": "1.34.4",
"@redocly/respect-core": "1.34.4",
"@redocly/openapi-core": "1.34.5",
"@redocly/respect-core": "1.34.5",
"abort-controller": "^3.0.0",
"chokidar": "^3.5.1",
"colorette": "^1.2.0",
"core-js": "^3.32.1",
"dotenv": "16.4.7",
"form-data": "^4.0.0",
"form-data": "^4.0.4",
"get-port-please": "^3.0.1",
"glob": "^7.1.6",
"handlebars": "^4.7.6",
@@ -522,9 +522,9 @@
"license": "MIT"
},
"node_modules/@redocly/openapi-core": {
"version": "1.34.4",
"resolved": "https://registry.npmjs.org/@redocly/openapi-core/-/openapi-core-1.34.4.tgz",
"integrity": "sha512-hf53xEgpXIgWl3b275PgZU3OTpYh1RoD2LHdIfQ1JzBNTWsiNKczTEsI/4Tmh2N1oq9YcphhSMyk3lDh85oDjg==",
"version": "1.34.5",
"resolved": "https://registry.npmjs.org/@redocly/openapi-core/-/openapi-core-1.34.5.tgz",
"integrity": "sha512-0EbE8LRbkogtcCXU7liAyC00n9uNG9hJ+eMyHFdUsy9lB/WGqnEBgwjA9q2cyzAVcdTkQqTBBU1XePNnN3OijA==",
"dev": true,
"license": "MIT",
"dependencies": {
@@ -544,21 +544,21 @@
}
},
"node_modules/@redocly/respect-core": {
"version": "1.34.4",
"resolved": "https://registry.npmjs.org/@redocly/respect-core/-/respect-core-1.34.4.tgz",
"integrity": "sha512-MitKyKyQpsizA4qCVv+MjXL4WltfhFQAoiKiAzrVR1Kusro3VhYb6yJuzoXjiJhR0ukLP5QOP19Vcs7qmj9dZg==",
"version": "1.34.5",
"resolved": "https://registry.npmjs.org/@redocly/respect-core/-/respect-core-1.34.5.tgz",
"integrity": "sha512-GheC/g/QFztPe9UA9LamooSplQuy9pe0Yr8XGTqkz0ahivLDl7svoy/LSQNn1QH3XGtLKwFYMfTwFR2TAYyh5Q==",
"dev": true,
"license": "MIT",
"dependencies": {
"@faker-js/faker": "^7.6.0",
"@redocly/ajv": "8.11.2",
"@redocly/openapi-core": "1.34.4",
"@redocly/openapi-core": "1.34.5",
"better-ajv-errors": "^1.2.0",
"colorette": "^2.0.20",
"concat-stream": "^2.0.0",
"cookie": "^0.7.2",
"dotenv": "16.4.7",
"form-data": "4.0.0",
"form-data": "^4.0.4",
"jest-diff": "^29.3.1",
"jest-matcher-utils": "^29.3.1",
"js-yaml": "4.1.0",
@@ -582,21 +582,6 @@
"dev": true,
"license": "MIT"
},
"node_modules/@redocly/respect-core/node_modules/form-data": {
"version": "4.0.0",
"resolved": "https://registry.npmjs.org/form-data/-/form-data-4.0.0.tgz",
"integrity": "sha512-ETEklSGi5t0QMZuiXoA/Q6vcnxcLQP5vdugSpuAyi6SVGi2clPPp+xgEhuMaHC+zGgn31Kd235W35f7Hykkaww==",
"dev": true,
"license": "MIT",
"dependencies": {
"asynckit": "^0.4.0",
"combined-stream": "^1.0.8",
"mime-types": "^2.1.12"
},
"engines": {
"node": ">= 6"
}
},
"node_modules/@sinclair/typebox": {
"version": "0.27.8",
"resolved": "https://registry.npmjs.org/@sinclair/typebox/-/typebox-0.27.8.tgz",
@@ -1345,9 +1330,9 @@
"license": "MIT"
},
"node_modules/form-data": {
"version": "4.0.3",
"resolved": "https://registry.npmjs.org/form-data/-/form-data-4.0.3.tgz",
"integrity": "sha512-qsITQPfmvMOSAdeyZ+12I1c+CKSstAFAwu+97zrnWAbIr5u8wfsExUzCesVLC8NgHuRUqNN4Zy6UPWUTRGslcA==",
"version": "4.0.4",
"resolved": "https://registry.npmjs.org/form-data/-/form-data-4.0.4.tgz",
"integrity": "sha512-KrGhL9Q4zjj0kiUt5OO4Mr/A/jlI2jDYs5eHBpYHPcBEVSiipAvn2Ko2HnPe20rmcuuvMHNdZFp+4IlGTMF0Ow==",
"dev": true,
"license": "MIT",
"dependencies": {

View File

@@ -2,7 +2,7 @@
"name": "build-tools",
"private": true,
"devDependencies": {
"@redocly/cli": "1.34.4",
"@redocly/cli": "1.34.5",
"@sourcemeta/jsonschema": "10.0.0"
}
}

View File

@@ -133,7 +133,7 @@ RUN case $DEBIAN_VERSION in \
# Install newer version (3.25) from backports.
# libstdc++-10-dev is required for plv8
bullseye) \
echo "deb http://deb.debian.org/debian bullseye-backports main" > /etc/apt/sources.list.d/bullseye-backports.list; \
echo "deb http://archive.debian.org/debian bullseye-backports main" > /etc/apt/sources.list.d/bullseye-backports.list; \
VERSION_INSTALLS="cmake/bullseye-backports cmake-data/bullseye-backports libstdc++-10-dev"; \
;; \
# Version-specific installs for Bookworm (PG17):

View File

@@ -26,7 +26,13 @@ commands:
- name: postgres-exporter
user: nobody
sysvInitAction: respawn
shell: 'DATA_SOURCE_NAME="user=cloud_admin sslmode=disable dbname=postgres application_name=postgres-exporter pgaudit.log=none" /bin/postgres_exporter --config.file=/etc/postgres_exporter.yml'
# Turn off database collector (`--no-collector.database`), we don't use `pg_database_size_bytes` metric anyway, see
# https://github.com/neondatabase/flux-fleet/blob/5e19b3fd897667b70d9a7ad4aa06df0ca22b49ff/apps/base/compute-metrics/scrape-compute-pg-exporter-neon.yaml#L29
# but it's enabled by default and it doesn't filter out invalid databases, see
# https://github.com/prometheus-community/postgres_exporter/blob/06a553c8166512c9d9c5ccf257b0f9bba8751dbc/collector/pg_database.go#L67
# so if it hits one, it starts spamming logs
# ERROR: [NEON_SMGR] [reqid d9700000018] could not read db size of db 705302 from page server at lsn 5/A2457EB0
shell: 'DATA_SOURCE_NAME="user=cloud_admin sslmode=disable dbname=postgres application_name=postgres-exporter pgaudit.log=none" /bin/postgres_exporter --no-collector.database --config.file=/etc/postgres_exporter.yml'
- name: pgbouncer-exporter
user: postgres
sysvInitAction: respawn

View File

@@ -26,7 +26,13 @@ commands:
- name: postgres-exporter
user: nobody
sysvInitAction: respawn
shell: 'DATA_SOURCE_NAME="user=cloud_admin sslmode=disable dbname=postgres application_name=postgres-exporter pgaudit.log=none" /bin/postgres_exporter --config.file=/etc/postgres_exporter.yml'
# Turn off database collector (`--no-collector.database`), we don't use `pg_database_size_bytes` metric anyway, see
# https://github.com/neondatabase/flux-fleet/blob/5e19b3fd897667b70d9a7ad4aa06df0ca22b49ff/apps/base/compute-metrics/scrape-compute-pg-exporter-neon.yaml#L29
# but it's enabled by default and it doesn't filter out invalid databases, see
# https://github.com/prometheus-community/postgres_exporter/blob/06a553c8166512c9d9c5ccf257b0f9bba8751dbc/collector/pg_database.go#L67
# so if it hits one, it starts spamming logs
# ERROR: [NEON_SMGR] [reqid d9700000018] could not read db size of db 705302 from page server at lsn 5/A2457EB0
shell: 'DATA_SOURCE_NAME="user=cloud_admin sslmode=disable dbname=postgres application_name=postgres-exporter pgaudit.log=none" /bin/postgres_exporter --no-collector.database --config.file=/etc/postgres_exporter.yml'
- name: pgbouncer-exporter
user: postgres
sysvInitAction: respawn

View File

@@ -62,6 +62,7 @@ tokio-stream.workspace = true
tonic.workspace = true
tower-otel.workspace = true
tracing.workspace = true
tracing-appender.workspace = true
tracing-opentelemetry.workspace = true
tracing-subscriber.workspace = true
tracing-utils.workspace = true

View File

@@ -52,8 +52,14 @@ stateDiagram-v2
Init --> Running : Started Postgres
Running --> TerminationPendingFast : Requested termination
Running --> TerminationPendingImmediate : Requested termination
Running --> ConfigurationPending : Received a /configure request with spec
Running --> RefreshConfigurationPending : Received a /refresh_configuration request, compute node will pull a new spec and reconfigure
RefreshConfigurationPending --> RefreshConfiguration: Received compute spec and started configuration
RefreshConfiguration --> Running : Compute has been re-configured
RefreshConfiguration --> RefreshConfigurationPending : Configuration failed and to be retried
TerminationPendingFast --> Terminated compute with 30s delay for cplane to inspect status
TerminationPendingImmediate --> Terminated : Terminated compute immediately
Failed --> RefreshConfigurationPending : Received a /refresh_configuration request
Failed --> [*] : Compute exited
Terminated --> [*] : Compute exited
```

View File

@@ -49,9 +49,10 @@ use compute_tools::compute::{
BUILD_TAG, ComputeNode, ComputeNodeParams, forward_termination_signal,
};
use compute_tools::extension_server::get_pg_version_string;
use compute_tools::logger::*;
use compute_tools::params::*;
use compute_tools::pg_isready::get_pg_isready_bin;
use compute_tools::spec::*;
use compute_tools::{hadron_metrics, installed_extensions, logger::*};
use rlimit::{Resource, setrlimit};
use signal_hook::consts::{SIGINT, SIGQUIT, SIGTERM};
use signal_hook::iterator::Signals;
@@ -194,11 +195,19 @@ fn main() -> Result<()> {
.build()?;
let _rt_guard = runtime.enter();
let tracing_provider = init(cli.dev)?;
let mut log_dir = None;
if cli.lakebase_mode {
log_dir = std::env::var("COMPUTE_CTL_LOG_DIRECTORY").ok();
}
let (tracing_provider, _file_logs_guard) = init(cli.dev, log_dir)?;
// enable core dumping for all child processes
setrlimit(Resource::CORE, rlimit::INFINITY, rlimit::INFINITY)?;
installed_extensions::initialize_metrics();
hadron_metrics::initialize_metrics();
let connstr = Url::parse(&cli.connstr).context("cannot parse connstr as a URL")?;
let config = get_config(&cli)?;
@@ -226,7 +235,12 @@ fn main() -> Result<()> {
cli.installed_extensions_collection_interval,
)),
pg_init_timeout: cli.pg_init_timeout.map(Duration::from_secs),
pg_isready_bin: get_pg_isready_bin(&cli.pgbin),
instance_id: std::env::var("INSTANCE_ID").ok(),
lakebase_mode: cli.lakebase_mode,
build_tag: BUILD_TAG.to_string(),
control_plane_uri: cli.control_plane_uri,
config_path_test_only: cli.config,
},
config,
)?;
@@ -238,8 +252,14 @@ fn main() -> Result<()> {
deinit_and_exit(tracing_provider, exit_code);
}
fn init(dev_mode: bool) -> Result<Option<tracing_utils::Provider>> {
let provider = init_tracing_and_logging(DEFAULT_LOG_LEVEL)?;
fn init(
dev_mode: bool,
log_dir: Option<String>,
) -> Result<(
Option<tracing_utils::Provider>,
Option<tracing_appender::non_blocking::WorkerGuard>,
)> {
let (provider, file_logs_guard) = init_tracing_and_logging(DEFAULT_LOG_LEVEL, &log_dir)?;
let mut signals = Signals::new([SIGINT, SIGTERM, SIGQUIT])?;
thread::spawn(move || {
@@ -250,7 +270,7 @@ fn init(dev_mode: bool) -> Result<Option<tracing_utils::Provider>> {
info!("compute build_tag: {}", &BUILD_TAG.to_string());
Ok(provider)
Ok((provider, file_logs_guard))
}
fn get_config(cli: &Cli) -> Result<ComputeConfig> {

View File

@@ -6,8 +6,7 @@ use compute_api::responses::{
LfcPrewarmState, PromoteState, TlsConfig,
};
use compute_api::spec::{
ComputeAudit, ComputeFeature, ComputeMode, ComputeSpec, ExtVersion, PageserverConnectionInfo,
PageserverProtocol, PageserverShardConnectionInfo, PageserverShardInfo, PgIdent,
ComputeAudit, ComputeFeature, ComputeMode, ComputeSpec, ExtVersion, PageserverProtocol, PgIdent,
};
use futures::StreamExt;
use futures::future::join_all;
@@ -22,6 +21,7 @@ use postgres::NoTls;
use postgres::error::SqlState;
use remote_storage::{DownloadError, RemotePath};
use std::collections::{HashMap, HashSet};
use std::ffi::OsString;
use std::os::unix::fs::{PermissionsExt, symlink};
use std::path::Path;
use std::process::{Command, Stdio};
@@ -41,8 +41,9 @@ use utils::shard::{ShardCount, ShardIndex, ShardNumber};
use crate::configurator::launch_configurator;
use crate::disk_quota::set_disk_quota;
use crate::hadron_metrics::COMPUTE_ATTACHED;
use crate::installed_extensions::get_installed_extensions;
use crate::logger::startup_context_from_env;
use crate::logger::{self, startup_context_from_env};
use crate::lsn_lease::launch_lsn_lease_bg_task_for_static;
use crate::metrics::COMPUTE_CTL_UP;
use crate::monitor::launch_monitor;
@@ -114,11 +115,17 @@ pub struct ComputeNodeParams {
/// Interval for installed extensions collection
pub installed_extensions_collection_interval: Arc<AtomicU64>,
/// Hadron instance ID of the compute node.
pub instance_id: Option<String>,
/// Timeout of PG compute startup in the Init state.
pub pg_init_timeout: Option<Duration>,
// Path to the `pg_isready` binary.
pub pg_isready_bin: String,
pub lakebase_mode: bool,
pub build_tag: String,
pub control_plane_uri: Option<String>,
pub config_path_test_only: Option<OsString>,
}
type TaskHandle = Mutex<Option<JoinHandle<()>>>;
@@ -241,7 +248,7 @@ pub struct ParsedSpec {
pub spec: ComputeSpec,
pub tenant_id: TenantId,
pub timeline_id: TimelineId,
pub pageserver_conninfo: PageserverConnectionInfo,
pub pageserver_connstr: String,
pub safekeeper_connstrings: Vec<String>,
pub storage_auth_token: Option<String>,
/// k8s dns name and port
@@ -288,114 +295,26 @@ impl ParsedSpec {
}
}
/// Extract PageserverConnectionInfo from a comma-separated list of libpq connection strings.
///
/// This is used for backwards-compatilibity, to parse the legacye `pageserver_connstr`
/// field in the compute spec, or the 'neon.pageserver_connstring' GUC. Nowadays, the
/// 'pageserver_connection_info' field should be used instead.
fn extract_pageserver_conninfo_from_connstr(
connstr: &str,
stripe_size: Option<u32>,
) -> Result<PageserverConnectionInfo, anyhow::Error> {
let shard_infos: Vec<_> = connstr
.split(',')
.map(|connstr| PageserverShardInfo {
pageservers: vec![PageserverShardConnectionInfo {
id: None,
libpq_url: Some(connstr.to_string()),
grpc_url: None,
}],
})
.collect();
match shard_infos.len() {
0 => anyhow::bail!("empty connection string"),
1 => {
// We assume that if there's only connection string, it means "unsharded",
// rather than a sharded system with just a single shard. The latter is
// possible in principle, but we never do it.
let shard_count = ShardCount::unsharded();
let only_shard = shard_infos.first().unwrap().clone();
let shards = vec![(ShardIndex::unsharded(), only_shard)];
Ok(PageserverConnectionInfo {
shard_count,
stripe_size: None,
shards: shards.into_iter().collect(),
prefer_protocol: PageserverProtocol::Libpq,
})
}
n => {
if stripe_size.is_none() {
anyhow::bail!("{n} shards but no stripe_size");
}
let shard_count = ShardCount(n.try_into()?);
let shards = shard_infos
.into_iter()
.enumerate()
.map(|(idx, shard_info)| {
(
ShardIndex {
shard_count,
shard_number: ShardNumber(
idx.try_into().expect("shard number fits in u8"),
),
},
shard_info,
)
})
.collect();
Ok(PageserverConnectionInfo {
shard_count,
stripe_size,
shards,
prefer_protocol: PageserverProtocol::Libpq,
})
}
}
}
impl TryFrom<ComputeSpec> for ParsedSpec {
type Error = anyhow::Error;
fn try_from(spec: ComputeSpec) -> Result<Self, anyhow::Error> {
type Error = String;
fn try_from(spec: ComputeSpec) -> Result<Self, String> {
// Extract the options from the spec file that are needed to connect to
// the storage system.
//
// In compute specs generated by old control plane versions, the spec file might
// be missing the `pageserver_connection_info` field. In that case, we need to dig
// the pageserver connection info from the `pageserver_connstr` field instead, or
// if that's missing too, from the GUC in the cluster.settings field.
let mut pageserver_conninfo = spec.pageserver_connection_info.clone();
if pageserver_conninfo.is_none() {
if let Some(pageserver_connstr_field) = &spec.pageserver_connstring {
pageserver_conninfo = Some(extract_pageserver_conninfo_from_connstr(
pageserver_connstr_field,
spec.shard_stripe_size,
)?);
}
}
if pageserver_conninfo.is_none() {
if let Some(guc) = spec.cluster.settings.find("neon.pageserver_connstring") {
let stripe_size = if let Some(guc) = spec.cluster.settings.find("neon.stripe_size")
{
Some(u32::from_str(&guc)?)
} else {
None
};
pageserver_conninfo =
Some(extract_pageserver_conninfo_from_connstr(&guc, stripe_size)?);
}
}
let pageserver_conninfo = pageserver_conninfo.ok_or(anyhow::anyhow!(
"pageserver connection information should be provided"
))?;
// Similarly for safekeeper connection strings
// For backwards-compatibility, the top-level fields in the spec file
// may be empty. In that case, we need to dig them from the GUCs in the
// cluster.settings field.
let pageserver_connstr = spec
.pageserver_connstring
.clone()
.or_else(|| spec.cluster.settings.find("neon.pageserver_connstring"))
.ok_or("pageserver connstr should be provided")?;
let safekeeper_connstrings = if spec.safekeeper_connstrings.is_empty() {
if matches!(spec.mode, ComputeMode::Primary) {
spec.cluster
.settings
.find("neon.safekeepers")
.ok_or(anyhow::anyhow!("safekeeper connstrings should be provided"))?
.ok_or("safekeeper connstrings should be provided")?
.split(',')
.map(|str| str.to_string())
.collect()
@@ -410,22 +329,22 @@ impl TryFrom<ComputeSpec> for ParsedSpec {
let tenant_id: TenantId = if let Some(tenant_id) = spec.tenant_id {
tenant_id
} else {
let guc = spec
.cluster
spec.cluster
.settings
.find("neon.tenant_id")
.ok_or(anyhow::anyhow!("tenant id should be provided"))?;
TenantId::from_str(&guc).context("invalid tenant id")?
.ok_or("tenant id should be provided")
.map(|s| TenantId::from_str(&s))?
.or(Err("invalid tenant id"))?
};
let timeline_id: TimelineId = if let Some(timeline_id) = spec.timeline_id {
timeline_id
} else {
let guc = spec
.cluster
spec.cluster
.settings
.find("neon.timeline_id")
.ok_or(anyhow::anyhow!("timeline id should be provided"))?;
TimelineId::from_str(&guc).context(anyhow::anyhow!("invalid timeline id"))?
.ok_or("timeline id should be provided")
.map(|s| TimelineId::from_str(&s))?
.or(Err("invalid timeline id"))?
};
let endpoint_storage_addr: Option<String> = spec
@@ -439,7 +358,7 @@ impl TryFrom<ComputeSpec> for ParsedSpec {
let res = ParsedSpec {
spec,
pageserver_conninfo,
pageserver_connstr,
safekeeper_connstrings,
storage_auth_token,
tenant_id,
@@ -449,7 +368,7 @@ impl TryFrom<ComputeSpec> for ParsedSpec {
};
// Now check validity of the parsed specification
res.validate().map_err(anyhow::Error::msg)?;
res.validate()?;
Ok(res)
}
}
@@ -494,6 +413,52 @@ struct StartVmMonitorResult {
vm_monitor: Option<JoinHandle<Result<()>>>,
}
/// Databricks-specific environment variables to be passed to the `postgres` sub-process.
pub struct DatabricksEnvVars {
/// The Databricks "endpoint ID" of the compute instance. Used by `postgres` to check
/// the token scopes of internal auth tokens.
pub endpoint_id: String,
/// Hostname of the Databricks workspace URL this compute instance belongs to.
/// Used by postgres to verify Databricks PAT tokens.
pub workspace_host: String,
}
impl DatabricksEnvVars {
pub fn new(compute_spec: &ComputeSpec, compute_id: Option<&String>) -> Self {
// compute_id is a string format of "{endpoint_id}/{compute_idx}"
// endpoint_id is a uuid. We only need to pass down endpoint_id to postgres.
// Panics if compute_id is not set or not in the expected format.
let endpoint_id = compute_id.unwrap().split('/').next().unwrap().to_string();
let workspace_host = compute_spec
.databricks_settings
.as_ref()
.map(|s| s.databricks_workspace_host.clone())
.unwrap_or("".to_string());
Self {
endpoint_id,
workspace_host,
}
}
/// Constants for the names of Databricks-specific postgres environment variables.
const DATABRICKS_ENDPOINT_ID_ENVVAR: &'static str = "DATABRICKS_ENDPOINT_ID";
const DATABRICKS_WORKSPACE_HOST_ENVVAR: &'static str = "DATABRICKS_WORKSPACE_HOST";
/// Convert DatabricksEnvVars to a list of string pairs that can be passed as env vars. Consumes `self`.
pub fn to_env_var_list(self) -> Vec<(String, String)> {
vec![
(
Self::DATABRICKS_ENDPOINT_ID_ENVVAR.to_string(),
self.endpoint_id.clone(),
),
(
Self::DATABRICKS_WORKSPACE_HOST_ENVVAR.to_string(),
self.workspace_host.clone(),
),
]
}
}
impl ComputeNode {
pub fn new(params: ComputeNodeParams, config: ComputeConfig) -> Result<Self> {
let connstr = params.connstr.as_str();
@@ -575,6 +540,7 @@ impl ComputeNode {
port: this.params.external_http_port,
config: this.compute_ctl_config.clone(),
compute_id: this.params.compute_id.clone(),
instance_id: this.params.instance_id.clone(),
}
.launch(&this);
@@ -1139,10 +1105,12 @@ impl ComputeNode {
fn try_get_basebackup(&self, compute_state: &ComputeState, lsn: Lsn) -> Result<()> {
let spec = compute_state.pspec.as_ref().expect("spec must be set");
let shard0_connstr = spec.pageserver_connstr.split(',').next().unwrap();
let started = Instant::now();
let (connected, size) = match spec.pageserver_conninfo.prefer_protocol {
PageserverProtocol::Grpc => self.try_get_basebackup_grpc(spec, lsn)?,
let (connected, size) = match PageserverProtocol::from_connstring(shard0_connstr)? {
PageserverProtocol::Libpq => self.try_get_basebackup_libpq(spec, lsn)?,
PageserverProtocol::Grpc => self.try_get_basebackup_grpc(spec, lsn)?,
};
self.fix_zenith_signal_neon_signal()?;
@@ -1180,32 +1148,23 @@ impl ComputeNode {
/// Fetches a basebackup via gRPC. The connstring must use grpc://. Returns the timestamp when
/// the connection was established, and the (compressed) size of the basebackup.
fn try_get_basebackup_grpc(&self, spec: &ParsedSpec, lsn: Lsn) -> Result<(Instant, usize)> {
let shard0_index = ShardIndex {
shard_number: ShardNumber(0),
shard_count: spec.pageserver_conninfo.shard_count,
let shard0_connstr = spec
.pageserver_connstr
.split(',')
.next()
.unwrap()
.to_string();
let shard_index = match spec.pageserver_connstr.split(',').count() as u8 {
0 | 1 => ShardIndex::unsharded(),
count => ShardIndex::new(ShardNumber(0), ShardCount(count)),
};
let shard0 = spec
.pageserver_conninfo
.shards
.get(&shard0_index)
.ok_or_else(|| {
anyhow::anyhow!("shard connection info missing for shard {}", shard0_index)
})?;
let pageserver = shard0
.pageservers
.first()
.expect("must have at least one pageserver");
let shard0_url = pageserver
.grpc_url
.clone()
.expect("no grpc_url for shard 0");
let (reader, connected) = tokio::runtime::Handle::current().block_on(async move {
let mut client = page_api::Client::connect(
shard0_url,
shard0_connstr,
spec.tenant_id,
spec.timeline_id,
shard0_index,
shard_index,
spec.storage_auth_token.clone(),
None, // NB: base backups use payload compression
)
@@ -1237,26 +1196,8 @@ impl ComputeNode {
/// Fetches a basebackup via libpq. The connstring must use postgresql://. Returns the timestamp
/// when the connection was established, and the (compressed) size of the basebackup.
fn try_get_basebackup_libpq(&self, spec: &ParsedSpec, lsn: Lsn) -> Result<(Instant, usize)> {
let shard0_index = ShardIndex {
shard_number: ShardNumber(0),
shard_count: spec.pageserver_conninfo.shard_count,
};
let shard0 = spec
.pageserver_conninfo
.shards
.get(&shard0_index)
.ok_or_else(|| {
anyhow::anyhow!("shard connection info missing for shard {}", shard0_index)
})?;
let pageserver = shard0
.pageservers
.first()
.expect("must have at least one pageserver");
let shard0_connstr = pageserver
.libpq_url
.clone()
.expect("no libpq_url for shard 0");
let mut config = postgres::Config::from_str(&shard0_connstr)?;
let shard0_connstr = spec.pageserver_connstr.split(',').next().unwrap();
let mut config = postgres::Config::from_str(shard0_connstr)?;
// Use the storage auth token from the config file, if given.
// Note: this overrides any password set in the connection string.
@@ -1342,7 +1283,10 @@ impl ComputeNode {
return result;
}
Err(ref e) if attempts < max_attempts => {
warn!("Failed to get basebackup: {e:?} (attempt {attempts}/{max_attempts})");
warn!(
"Failed to get basebackup: {} (attempt {}/{})",
e, attempts, max_attempts
);
std::thread::sleep(std::time::Duration::from_millis(retry_period_ms as u64));
retry_period_ms *= 1.5;
}
@@ -1513,6 +1457,8 @@ impl ComputeNode {
let pgdata_path = Path::new(&self.params.pgdata);
let tls_config = self.tls_config(&pspec.spec);
let databricks_settings = spec.databricks_settings.as_ref();
let postgres_port = self.params.connstr.port();
// Remove/create an empty pgdata directory and put configuration there.
self.create_pgdata()?;
@@ -1520,8 +1466,11 @@ impl ComputeNode {
pgdata_path,
&self.params,
&pspec.spec,
postgres_port,
self.params.internal_http_port,
tls_config,
databricks_settings,
self.params.lakebase_mode,
)?;
// Syncing safekeepers is only safe with primary nodes: if a primary
@@ -1550,11 +1499,39 @@ impl ComputeNode {
}
};
self.get_basebackup(compute_state, lsn)
.with_context(|| format!("failed to get basebackup@{lsn}"))?;
info!(
"getting basebackup@{} from pageserver {}",
lsn, &pspec.pageserver_connstr
);
self.get_basebackup(compute_state, lsn).with_context(|| {
format!(
"failed to get basebackup@{} from pageserver {}",
lsn, &pspec.pageserver_connstr
)
})?;
// Update pg_hba.conf received with basebackup.
update_pg_hba(pgdata_path, None)?;
if let Some(settings) = databricks_settings {
copy_tls_certificates(
&settings.pg_compute_tls_settings.key_file,
&settings.pg_compute_tls_settings.cert_file,
pgdata_path,
)?;
// Update pg_hba.conf received with basebackup including additional databricks settings.
update_pg_hba(pgdata_path, Some(&settings.databricks_pg_hba))?;
update_pg_ident(pgdata_path, Some(&settings.databricks_pg_ident))?;
} else {
// Update pg_hba.conf received with basebackup.
update_pg_hba(pgdata_path, None)?;
}
if let Some(databricks_settings) = spec.databricks_settings.as_ref() {
copy_tls_certificates(
&databricks_settings.pg_compute_tls_settings.key_file,
&databricks_settings.pg_compute_tls_settings.cert_file,
pgdata_path,
)?;
}
// Place pg_dynshmem under /dev/shm. This allows us to use
// 'dynamic_shared_memory_type = mmap' so that the files are placed in
@@ -1667,14 +1644,31 @@ impl ComputeNode {
pub fn start_postgres(&self, storage_auth_token: Option<String>) -> Result<PostgresHandle> {
let pgdata_path = Path::new(&self.params.pgdata);
let env_vars: Vec<(String, String)> = if self.params.lakebase_mode {
let databricks_env_vars = {
let state = self.state.lock().unwrap();
let spec = &state.pspec.as_ref().unwrap().spec;
DatabricksEnvVars::new(spec, Some(&self.params.compute_id))
};
info!(
"Starting Postgres for databricks endpoint id: {}",
&databricks_env_vars.endpoint_id
);
let mut env_vars = databricks_env_vars.to_env_var_list();
env_vars.extend(storage_auth_token.map(|t| ("NEON_AUTH_TOKEN".to_string(), t)));
env_vars
} else if let Some(storage_auth_token) = &storage_auth_token {
vec![("NEON_AUTH_TOKEN".to_owned(), storage_auth_token.to_owned())]
} else {
vec![]
};
// Run postgres as a child process.
let mut pg = maybe_cgexec(&self.params.pgbin)
.args(["-D", &self.params.pgdata])
.envs(if let Some(storage_auth_token) = &storage_auth_token {
vec![("NEON_AUTH_TOKEN", storage_auth_token)]
} else {
vec![]
})
.envs(env_vars)
.stderr(Stdio::piped())
.spawn()
.expect("cannot start postgres process");
@@ -1888,6 +1882,34 @@ impl ComputeNode {
Ok::<(), anyhow::Error>(())
}
// Signal to the configurator to refresh the configuration by pulling a new spec from the HCC.
// Note that this merely triggers a notification on a condition variable the configurator thread
// waits on. The configurator thread (in configurator.rs) pulls the new spec from the HCC and
// applies it.
pub async fn signal_refresh_configuration(&self) -> Result<()> {
let states_allowing_configuration_refresh = [
ComputeStatus::Running,
ComputeStatus::Failed,
ComputeStatus::RefreshConfigurationPending,
];
let mut state = self.state.lock().expect("state lock poisoned");
if states_allowing_configuration_refresh.contains(&state.status) {
state.status = ComputeStatus::RefreshConfigurationPending;
self.state_changed.notify_all();
Ok(())
} else if state.status == ComputeStatus::Init {
// If the compute is in Init state, we can't refresh the configuration immediately,
// but we should be able to do that soon.
Ok(())
} else {
Err(anyhow::anyhow!(
"Cannot refresh compute configuration in state {:?}",
state.status
))
}
}
// Wrapped this around `pg_ctl reload`, but right now we don't use
// `pg_ctl` for start / stop.
#[instrument(skip_all)]
@@ -1949,12 +1971,16 @@ impl ComputeNode {
// Write new config
let pgdata_path = Path::new(&self.params.pgdata);
let postgres_port = self.params.connstr.port();
config::write_postgres_conf(
pgdata_path,
&self.params,
&spec,
postgres_port,
self.params.internal_http_port,
tls_config,
spec.databricks_settings.as_ref(),
self.params.lakebase_mode,
)?;
self.pg_reload_conf()?;
@@ -2060,6 +2086,8 @@ impl ComputeNode {
// wait
ComputeStatus::Init
| ComputeStatus::Configuration
| ComputeStatus::RefreshConfiguration
| ComputeStatus::RefreshConfigurationPending
| ComputeStatus::Empty => {
state = self.state_changed.wait(state).unwrap();
}
@@ -2495,22 +2523,22 @@ LIMIT 100",
/// The operation will time out after a specified duration.
pub fn wait_timeout_while_pageserver_connstr_unchanged(&self, duration: Duration) {
let state = self.state.lock().unwrap();
let old_pageserver_conninfo = state
let old_pageserver_connstr = state
.pspec
.as_ref()
.expect("spec must be set")
.pageserver_conninfo
.pageserver_connstr
.clone();
let mut unchanged = true;
let _ = self
.state_changed
.wait_timeout_while(state, duration, |s| {
let pageserver_conninfo = &s
let pageserver_connstr = &s
.pspec
.as_ref()
.expect("spec must be set")
.pageserver_conninfo;
unchanged = pageserver_conninfo == &old_pageserver_conninfo;
.pageserver_connstr;
unchanged = pageserver_connstr == &old_pageserver_connstr;
unchanged
})
.unwrap();
@@ -2616,6 +2644,34 @@ LIMIT 100",
);
}
}
/// Set the compute spec and update related metrics.
/// This is the central place where pspec is updated.
pub fn set_spec(params: &ComputeNodeParams, state: &mut ComputeState, pspec: ParsedSpec) {
state.pspec = Some(pspec);
ComputeNode::update_attached_metric(params, state);
let _ = logger::update_ids(&params.instance_id, &Some(params.compute_id.clone()));
}
pub fn update_attached_metric(params: &ComputeNodeParams, state: &mut ComputeState) {
// Update the pg_cctl_attached gauge when all identifiers are available.
if let Some(instance_id) = &params.instance_id {
if let Some(pspec) = &state.pspec {
// Clear all values in the metric
COMPUTE_ATTACHED.reset();
// Set new metric value
COMPUTE_ATTACHED
.with_label_values(&[
&params.compute_id,
instance_id,
&pspec.tenant_id.to_string(),
&pspec.timeline_id.to_string(),
])
.set(1);
}
}
}
}
pub async fn installed_extensions(conf: tokio_postgres::Config) -> Result<()> {
@@ -2740,10 +2796,7 @@ mod tests {
match ParsedSpec::try_from(spec.clone()) {
Ok(_p) => panic!("Failed to detect duplicate entry"),
Err(e) => assert!(
e.to_string()
.starts_with("duplicate entry in safekeeper_connstrings:")
),
Err(e) => assert!(e.starts_with("duplicate entry in safekeeper_connstrings:")),
};
}
}

View File

@@ -90,6 +90,7 @@ impl ComputeNode {
}
/// If there is a prewarm request ongoing, return `false`, `true` otherwise.
/// Has a failpoint "compute-prewarm"
pub fn prewarm_lfc(self: &Arc<Self>, from_endpoint: Option<String>) -> bool {
{
let state = &mut self.state.lock().unwrap().lfc_prewarm_state;
@@ -112,9 +113,8 @@ impl ComputeNode {
Err(err) => {
crate::metrics::LFC_PREWARM_ERRORS.inc();
error!(%err, "could not prewarm LFC");
LfcPrewarmState::Failed {
error: err.to_string(),
error: format!("{err:#}"),
}
}
};
@@ -135,16 +135,20 @@ impl ComputeNode {
async fn prewarm_impl(&self, from_endpoint: Option<String>) -> Result<bool> {
let EndpointStoragePair { url, token } = self.endpoint_storage_pair(from_endpoint)?;
#[cfg(feature = "testing")]
fail::fail_point!("compute-prewarm", |_| {
bail!("prewarm configured to fail because of a failpoint")
});
info!(%url, "requesting LFC state from endpoint storage");
let request = Client::new().get(&url).bearer_auth(token);
let res = request.send().await.context("querying endpoint storage")?;
let status = res.status();
match status {
match res.status() {
StatusCode::OK => (),
StatusCode::NOT_FOUND => {
return Ok(false);
}
_ => bail!("{status} querying endpoint storage"),
status => bail!("{status} querying endpoint storage"),
}
let mut uncompressed = Vec::new();
@@ -205,7 +209,7 @@ impl ComputeNode {
crate::metrics::LFC_OFFLOAD_ERRORS.inc();
error!(%err, "could not offload LFC state to endpoint storage");
self.state.lock().unwrap().lfc_offload_state = LfcOffloadState::Failed {
error: err.to_string(),
error: format!("{err:#}"),
};
}
@@ -213,16 +217,22 @@ impl ComputeNode {
let EndpointStoragePair { url, token } = self.endpoint_storage_pair(None)?;
info!(%url, "requesting LFC state from Postgres");
let mut compressed = Vec::new();
ComputeNode::get_maintenance_client(&self.tokio_conn_conf)
let row = ComputeNode::get_maintenance_client(&self.tokio_conn_conf)
.await
.context("connecting to postgres")?
.query_one("select neon.get_local_cache_state()", &[])
.await
.context("querying LFC state")?
.try_get::<usize, &[u8]>(0)
.context("deserializing LFC state")
.map(ZstdEncoder::new)?
.context("querying LFC state")?;
let state = row
.try_get::<usize, Option<&[u8]>>(0)
.context("deserializing LFC state")?;
let Some(state) = state else {
info!(%url, "empty LFC state, not exporting");
return Ok(());
};
let mut compressed = Vec::new();
ZstdEncoder::new(state)
.read_to_end(&mut compressed)
.await
.context("compressing LFC state")?;

View File

@@ -1,11 +1,12 @@
use crate::compute::ComputeNode;
use anyhow::{Context, Result, bail};
use compute_api::{
responses::{LfcPrewarmState, PromoteState, SafekeepersLsn},
spec::ComputeMode,
};
use compute_api::responses::{LfcPrewarmState, PromoteConfig, PromoteState};
use compute_api::spec::ComputeMode;
use itertools::Itertools;
use std::collections::HashMap;
use std::{sync::Arc, time::Duration};
use tokio::time::sleep;
use tracing::info;
use utils::lsn::Lsn;
impl ComputeNode {
@@ -13,21 +14,22 @@ impl ComputeNode {
/// and http client disconnects, this does not stop promotion, and subsequent
/// calls block until promote finishes.
/// Called by control plane on secondary after primary endpoint is terminated
pub async fn promote(self: &Arc<Self>, safekeepers_lsn: SafekeepersLsn) -> PromoteState {
/// Has a failpoint "compute-promotion"
pub async fn promote(self: &Arc<Self>, cfg: PromoteConfig) -> PromoteState {
let cloned = self.clone();
let promote_fn = async move || {
let Err(err) = cloned.promote_impl(cfg).await else {
return PromoteState::Completed;
};
tracing::error!(%err, "promoting");
PromoteState::Failed {
error: format!("{err:#}"),
}
};
let start_promotion = || {
let (tx, rx) = tokio::sync::watch::channel(PromoteState::NotPromoted);
tokio::spawn(async move {
tx.send(match cloned.promote_impl(safekeepers_lsn).await {
Ok(_) => PromoteState::Completed,
Err(err) => {
tracing::error!(%err, "promoting");
PromoteState::Failed {
error: err.to_string(),
}
}
})
});
tokio::spawn(async move { tx.send(promote_fn().await) });
rx
};
@@ -47,9 +49,7 @@ impl ComputeNode {
task.borrow().clone()
}
// Why do we have to supply safekeepers?
// For secondary we use primary_connection_conninfo so safekeepers field is empty
async fn promote_impl(&self, safekeepers_lsn: SafekeepersLsn) -> Result<()> {
async fn promote_impl(&self, mut cfg: PromoteConfig) -> Result<()> {
{
let state = self.state.lock().unwrap();
let mode = &state.pspec.as_ref().unwrap().spec.mode;
@@ -73,7 +73,7 @@ impl ComputeNode {
.await
.context("connecting to postgres")?;
let primary_lsn = safekeepers_lsn.wal_flush_lsn;
let primary_lsn = cfg.wal_flush_lsn;
let mut last_wal_replay_lsn: Lsn = Lsn::INVALID;
const RETRIES: i32 = 20;
for i in 0..=RETRIES {
@@ -86,7 +86,7 @@ impl ComputeNode {
if last_wal_replay_lsn >= primary_lsn {
break;
}
tracing::info!("Try {i}, replica lsn {last_wal_replay_lsn}, primary lsn {primary_lsn}");
info!("Try {i}, replica lsn {last_wal_replay_lsn}, primary lsn {primary_lsn}");
sleep(Duration::from_secs(1)).await;
}
if last_wal_replay_lsn < primary_lsn {
@@ -96,7 +96,7 @@ impl ComputeNode {
// using $1 doesn't work with ALTER SYSTEM SET
let safekeepers_sql = format!(
"ALTER SYSTEM SET neon.safekeepers='{}'",
safekeepers_lsn.safekeepers
cfg.spec.safekeeper_connstrings.join(",")
);
client
.query(&safekeepers_sql, &[])
@@ -106,6 +106,12 @@ impl ComputeNode {
.query("SELECT pg_reload_conf()", &[])
.await
.context("reloading postgres config")?;
#[cfg(feature = "testing")]
fail::fail_point!("compute-promotion", |_| {
bail!("promotion configured to fail because of a failpoint")
});
let row = client
.query_one("SELECT * FROM pg_promote()", &[])
.await
@@ -125,8 +131,36 @@ impl ComputeNode {
bail!("replica in read only mode after promotion");
}
let mut state = self.state.lock().unwrap();
state.pspec.as_mut().unwrap().spec.mode = ComputeMode::Primary;
Ok(())
{
let mut state = self.state.lock().unwrap();
let spec = &mut state.pspec.as_mut().unwrap().spec;
spec.mode = ComputeMode::Primary;
let new_conf = cfg.spec.cluster.postgresql_conf.as_mut().unwrap();
let existing_conf = spec.cluster.postgresql_conf.as_ref().unwrap();
Self::merge_spec(new_conf, existing_conf);
}
info!("applied new spec, reconfiguring as primary");
self.reconfigure()
}
/// Merge old and new Postgres conf specs to apply on secondary.
/// Change new spec's port and safekeepers since they are supplied
/// differenly
fn merge_spec(new_conf: &mut String, existing_conf: &str) {
let mut new_conf_set: HashMap<&str, &str> = new_conf
.split_terminator('\n')
.map(|e| e.split_once("=").expect("invalid item"))
.collect();
new_conf_set.remove("neon.safekeepers");
let existing_conf_set: HashMap<&str, &str> = existing_conf
.split_terminator('\n')
.map(|e| e.split_once("=").expect("invalid item"))
.collect();
new_conf_set.insert("port", existing_conf_set["port"]);
*new_conf = new_conf_set
.iter()
.map(|(k, v)| format!("{k}={v}"))
.join("\n");
}
}

View File

@@ -7,16 +7,17 @@ use std::io::prelude::*;
use std::path::Path;
use compute_api::responses::TlsConfig;
use compute_api::spec::{ComputeAudit, ComputeMode, ComputeSpec, GenericOption};
use compute_api::spec::{
ComputeAudit, ComputeMode, ComputeSpec, DatabricksSettings, GenericOption,
};
use crate::compute::ComputeNodeParams;
use crate::pg_helpers::{
GenericOptionExt, GenericOptionsSearch, PgOptionsSerialize, escape_conf_value,
DatabricksSettingsExt as _, GenericOptionExt, GenericOptionsSearch, PgOptionsSerialize,
escape_conf_value,
};
use crate::tls::{self, SERVER_CRT, SERVER_KEY};
use utils::shard::{ShardIndex, ShardNumber};
/// Check that `line` is inside a text file and put it there if it is not.
/// Create file if it doesn't exist.
pub fn line_in_file(path: &Path, line: &str) -> Result<bool> {
@@ -42,12 +43,16 @@ pub fn line_in_file(path: &Path, line: &str) -> Result<bool> {
}
/// Create or completely rewrite configuration file specified by `path`
#[allow(clippy::too_many_arguments)]
pub fn write_postgres_conf(
pgdata_path: &Path,
params: &ComputeNodeParams,
spec: &ComputeSpec,
postgres_port: Option<u16>,
extension_server_port: u16,
tls_config: &Option<TlsConfig>,
databricks_settings: Option<&DatabricksSettings>,
lakebase_mode: bool,
) -> Result<()> {
let path = pgdata_path.join("postgresql.conf");
// File::create() destroys the file content if it exists.
@@ -58,101 +63,15 @@ pub fn write_postgres_conf(
writeln!(file, "{conf}")?;
}
// Stripe size GUC should be defined prior to connection string
if let Some(stripe_size) = spec.shard_stripe_size {
writeln!(file, "neon.stripe_size={stripe_size}")?;
}
// Add options for connecting to storage
writeln!(file, "# Neon storage settings")?;
writeln!(file)?;
if let Some(conninfo) = &spec.pageserver_connection_info {
// Stripe size GUC should be defined prior to connection string
if let Some(stripe_size) = conninfo.stripe_size {
writeln!(
file,
"# from compute spec's pageserver_conninfo.stripe_size field"
)?;
writeln!(file, "neon.stripe_size={stripe_size}")?;
}
let mut libpq_urls: Option<Vec<String>> = Some(Vec::new());
let mut grpc_urls: Option<Vec<String>> = Some(Vec::new());
let num_shards = if conninfo.shard_count.0 == 0 {
1 // unsharded, treat it as a single shard
} else {
conninfo.shard_count.0
};
for shard_number in 0..num_shards {
let shard_index = ShardIndex {
shard_number: ShardNumber(shard_number),
shard_count: conninfo.shard_count,
};
let info = conninfo.shards.get(&shard_index).ok_or_else(|| {
anyhow::anyhow!(
"shard {shard_index} missing from pageserver_connection_info shard map"
)
})?;
let first_pageserver = info
.pageservers
.first()
.expect("must have at least one pageserver");
// Add the libpq URL to the array, or if the URL is missing, reset the array
// forgetting any previous entries. All servers must have a libpq URL, or none
// at all.
if let Some(url) = &first_pageserver.libpq_url {
if let Some(ref mut urls) = libpq_urls {
urls.push(url.clone());
}
} else {
libpq_urls = None
}
// Similarly for gRPC URLs
if let Some(url) = &first_pageserver.grpc_url {
if let Some(ref mut urls) = grpc_urls {
urls.push(url.clone());
}
} else {
grpc_urls = None
}
}
if let Some(libpq_urls) = libpq_urls {
writeln!(
file,
"# derived from compute spec's pageserver_conninfo field"
)?;
writeln!(
file,
"neon.pageserver_connstring={}",
escape_conf_value(&libpq_urls.join(","))
)?;
} else {
writeln!(file, "# no neon.pageserver_connstring")?;
}
if let Some(grpc_urls) = grpc_urls {
writeln!(
file,
"# derived from compute spec's pageserver_conninfo field"
)?;
writeln!(
file,
"neon.pageserver_grpc_urls={}",
escape_conf_value(&grpc_urls.join(","))
)?;
} else {
writeln!(file, "# no neon.pageserver_grpc_urls")?;
}
} else {
// Stripe size GUC should be defined prior to connection string
if let Some(stripe_size) = spec.shard_stripe_size {
writeln!(file, "# from compute spec's shard_stripe_size field")?;
writeln!(file, "neon.stripe_size={stripe_size}")?;
}
if let Some(s) = &spec.pageserver_connstring {
writeln!(file, "# from compute spec's pageserver_connstring field")?;
writeln!(file, "neon.pageserver_connstring={}", escape_conf_value(s))?;
}
if let Some(s) = &spec.pageserver_connstring {
writeln!(file, "neon.pageserver_connstring={}", escape_conf_value(s))?;
}
if !spec.safekeeper_connstrings.is_empty() {
let mut neon_safekeepers_value = String::new();
tracing::info!(
@@ -373,6 +292,24 @@ pub fn write_postgres_conf(
writeln!(file, "log_destination='stderr,syslog'")?;
}
if lakebase_mode {
// Explicitly set the port based on the connstr, overriding any previous port setting.
// Note: It is important that we don't specify a different port again after this.
let port = postgres_port.expect("port must be present in connstr");
writeln!(file, "port = {port}")?;
// This is databricks specific settings.
// This should be at the end of the file but before `compute_ctl_temp_override.conf` below
// so that it can override any settings above.
// `compute_ctl_temp_override.conf` is intended to override any settings above during specific operations.
// To prevent potential breakage in the future, we keep it above `compute_ctl_temp_override.conf`.
writeln!(file, "# Databricks settings start")?;
if let Some(settings) = databricks_settings {
writeln!(file, "{}", settings.as_pg_settings())?;
}
writeln!(file, "# Databricks settings end")?;
}
// This is essential to keep this line at the end of the file,
// because it is intended to override any settings above.
writeln!(file, "include_if_exists = 'compute_ctl_temp_override.conf'")?;

View File

@@ -1,23 +1,40 @@
use std::sync::Arc;
use std::fs::File;
use std::thread;
use std::{path::Path, sync::Arc};
use compute_api::responses::ComputeStatus;
use anyhow::Result;
use compute_api::responses::{ComputeConfig, ComputeStatus};
use tracing::{error, info, instrument};
use crate::compute::ComputeNode;
use crate::compute::{ComputeNode, ParsedSpec};
use crate::spec::get_config_from_control_plane;
#[instrument(skip_all)]
fn configurator_main_loop(compute: &Arc<ComputeNode>) {
info!("waiting for reconfiguration requests");
loop {
let mut state = compute.state.lock().unwrap();
/* BEGIN_HADRON */
// RefreshConfiguration should only be used inside the loop
assert_ne!(state.status, ComputeStatus::RefreshConfiguration);
/* END_HADRON */
// We have to re-check the status after re-acquiring the lock because it could be that
// the status has changed while we were waiting for the lock, and we might not need to
// wait on the condition variable. Otherwise, we might end up in some soft-/deadlock, i.e.
// we are waiting for a condition variable that will never be signaled.
if state.status != ComputeStatus::ConfigurationPending {
state = compute.state_changed.wait(state).unwrap();
if compute.params.lakebase_mode {
while state.status != ComputeStatus::ConfigurationPending
&& state.status != ComputeStatus::RefreshConfigurationPending
&& state.status != ComputeStatus::Failed
{
info!("configurator: compute status: {:?}, sleeping", state.status);
state = compute.state_changed.wait(state).unwrap();
}
} else {
// We have to re-check the status after re-acquiring the lock because it could be that
// the status has changed while we were waiting for the lock, and we might not need to
// wait on the condition variable. Otherwise, we might end up in some soft-/deadlock, i.e.
// we are waiting for a condition variable that will never be signaled.
if state.status != ComputeStatus::ConfigurationPending {
state = compute.state_changed.wait(state).unwrap();
}
}
// Re-check the status after waking up
@@ -37,6 +54,133 @@ fn configurator_main_loop(compute: &Arc<ComputeNode>) {
// XXX: used to test that API is blocking
// std::thread::sleep(std::time::Duration::from_millis(10000));
compute.set_status(new_status);
} else if state.status == ComputeStatus::RefreshConfigurationPending {
info!(
"compute node suspects its configuration is out of date, now refreshing configuration"
);
state.set_status(ComputeStatus::RefreshConfiguration, &compute.state_changed);
// Drop the lock guard here to avoid holding the lock while downloading config from the control plane / HCC.
// This is the only thread that can move compute_ctl out of the `RefreshConfiguration` state, so it
// is safe to drop the lock like this.
drop(state);
let get_config_result: anyhow::Result<ComputeConfig> =
if let Some(config_path) = &compute.params.config_path_test_only {
// This path is only to make testing easier. In production we always get the config from the HCC.
info!(
"reloading config.json from path: {}",
config_path.to_string_lossy()
);
let path = Path::new(config_path);
if let Ok(file) = File::open(path) {
match serde_json::from_reader::<File, ComputeConfig>(file) {
Ok(config) => Ok(config),
Err(e) => {
error!("could not parse config file: {}", e);
Err(anyhow::anyhow!("could not parse config file: {}", e))
}
}
} else {
error!(
"could not open config file at path: {:?}",
config_path.to_string_lossy()
);
Err(anyhow::anyhow!(
"could not open config file at path: {}",
config_path.to_string_lossy()
))
}
} else if let Some(control_plane_uri) = &compute.params.control_plane_uri {
get_config_from_control_plane(control_plane_uri, &compute.params.compute_id)
} else {
Err(anyhow::anyhow!("config_path_test_only is not set"))
};
// Parse any received ComputeSpec and transpose the result into a Result<Option<ParsedSpec>>.
let parsed_spec_result: Result<Option<ParsedSpec>> =
get_config_result.and_then(|config| {
if let Some(spec) = config.spec {
if let Ok(pspec) = ParsedSpec::try_from(spec) {
Ok(Some(pspec))
} else {
Err(anyhow::anyhow!("could not parse spec"))
}
} else {
Ok(None)
}
});
let new_status: ComputeStatus;
match parsed_spec_result {
// Control plane (HCM) returned a spec and we were able to parse it.
Ok(Some(pspec)) => {
{
let mut state = compute.state.lock().unwrap();
// Defensive programming to make sure this thread is indeed the only one that can move the compute
// node out of the `RefreshConfiguration` state. Would be nice if we can encode this invariant
// into the type system.
assert_eq!(state.status, ComputeStatus::RefreshConfiguration);
if state.pspec.as_ref().map(|ps| ps.pageserver_connstr.clone())
== Some(pspec.pageserver_connstr.clone())
{
info!(
"Refresh configuration: Retrieved spec is the same as the current spec. Waiting for control plane to update the spec before attempting reconfiguration."
);
state.status = ComputeStatus::Running;
compute.state_changed.notify_all();
drop(state);
std::thread::sleep(std::time::Duration::from_secs(5));
continue;
}
// state.pspec is consumed by compute.reconfigure() below. Note that compute.reconfigure() will acquire
// the compute.state lock again so we need to have the lock guard go out of scope here. We could add a
// "locked" variant of compute.reconfigure() that takes the lock guard as an argument to make this cleaner,
// but it's not worth forking the codebase too much for this minor point alone right now.
state.pspec = Some(pspec);
}
match compute.reconfigure() {
Ok(_) => {
info!("Refresh configuration: compute node configured");
new_status = ComputeStatus::Running;
}
Err(e) => {
error!(
"Refresh configuration: could not configure compute node: {}",
e
);
// Set the compute node back to the `RefreshConfigurationPending` state if the configuration
// was not successful. It should be okay to treat this situation the same as if the loop
// hasn't executed yet as long as the detection side keeps notifying.
new_status = ComputeStatus::RefreshConfigurationPending;
}
}
}
// Control plane (HCM)'s response does not contain a spec. This is the "Empty" attachment case.
Ok(None) => {
info!(
"Compute Manager signaled that this compute is no longer attached to any storage. Exiting."
);
// We just immediately terminate the whole compute_ctl in this case. It's not necessary to attempt a
// clean shutdown as Postgres is probably not responding anyway (which is why we are in this refresh
// configuration state).
std::process::exit(1);
}
// Various error cases:
// - The request to the control plane (HCM) either failed or returned a malformed spec.
// - compute_ctl itself is configured incorrectly (e.g., compute_id is not set).
Err(e) => {
error!(
"Refresh configuration: error getting a parsed spec: {:?}",
e
);
new_status = ComputeStatus::RefreshConfigurationPending;
// We may be dealing with an overloaded HCM if we end up in this path. Backoff 5 seconds before
// retrying to avoid hammering the HCM.
std::thread::sleep(std::time::Duration::from_secs(5));
}
}
compute.set_status(new_status);
} else if state.status == ComputeStatus::Failed {
info!("compute node is now in Failed state, exiting");

View File

@@ -0,0 +1,60 @@
use metrics::{
IntCounter, IntGaugeVec, core::Collector, proto::MetricFamily, register_int_counter,
register_int_gauge_vec,
};
use once_cell::sync::Lazy;
// Counter keeping track of the number of PageStream request errors reported by Postgres.
// An error is registered every time Postgres calls compute_ctl's /refresh_configuration API.
// Postgres will invoke this API if it detected trouble with PageStream requests (get_page@lsn,
// get_base_backup, etc.) it sends to any pageserver. An increase in this counter value typically
// indicates Postgres downtime, as PageStream requests are critical for Postgres to function.
pub static POSTGRES_PAGESTREAM_REQUEST_ERRORS: Lazy<IntCounter> = Lazy::new(|| {
register_int_counter!(
"pg_cctl_pagestream_request_errors_total",
"Number of PageStream request errors reported by the postgres process"
)
.expect("failed to define a metric")
});
// Counter keeping track of the number of compute configuration errors due to Postgres statement
// timeouts. An error is registered every time `ComputeNode::reconfigure()` fails due to Postgres
// error code 57014 (query cancelled). This statement timeout typically occurs when postgres is
// stuck in a problematic retry loop when the PS is reject its connection requests (usually due
// to PG pointing at the wrong PS). We should investigate the root cause when this counter value
// increases by checking PG and PS logs.
pub static COMPUTE_CONFIGURE_STATEMENT_TIMEOUT_ERRORS: Lazy<IntCounter> = Lazy::new(|| {
register_int_counter!(
"pg_cctl_configure_statement_timeout_errors_total",
"Number of compute configuration errors due to Postgres statement timeouts."
)
.expect("failed to define a metric")
});
pub static COMPUTE_ATTACHED: Lazy<IntGaugeVec> = Lazy::new(|| {
register_int_gauge_vec!(
"pg_cctl_attached",
"Compute node attached status (1 if attached)",
&[
"pg_compute_id",
"pg_instance_id",
"tenant_id",
"timeline_id"
]
)
.expect("failed to define a metric")
});
pub fn collect() -> Vec<MetricFamily> {
let mut metrics = Vec::new();
metrics.extend(POSTGRES_PAGESTREAM_REQUEST_ERRORS.collect());
metrics.extend(COMPUTE_CONFIGURE_STATEMENT_TIMEOUT_ERRORS.collect());
metrics.extend(COMPUTE_ATTACHED.collect());
metrics
}
pub fn initialize_metrics() {
Lazy::force(&POSTGRES_PAGESTREAM_REQUEST_ERRORS);
Lazy::force(&COMPUTE_CONFIGURE_STATEMENT_TIMEOUT_ERRORS);
Lazy::force(&COMPUTE_ATTACHED);
}

View File

@@ -16,13 +16,29 @@ use crate::http::JsonResponse;
#[derive(Clone, Debug)]
pub(in crate::http) struct Authorize {
compute_id: String,
// BEGIN HADRON
// Hadron instance ID. Only set if it's a Lakebase V1 a.k.a. Hadron instance.
instance_id: Option<String>,
// END HADRON
jwks: JwkSet,
validation: Validation,
}
impl Authorize {
pub fn new(compute_id: String, jwks: JwkSet) -> Self {
pub fn new(compute_id: String, instance_id: Option<String>, jwks: JwkSet) -> Self {
let mut validation = Validation::new(Algorithm::EdDSA);
// BEGIN HADRON
let use_rsa = jwks.keys.iter().any(|jwk| {
jwk.common
.key_algorithm
.is_some_and(|alg| alg == jsonwebtoken::jwk::KeyAlgorithm::RS256)
});
if use_rsa {
validation = Validation::new(Algorithm::RS256);
}
// END HADRON
validation.validate_exp = true;
// Unused by the control plane
validation.validate_nbf = false;
@@ -34,6 +50,7 @@ impl Authorize {
Self {
compute_id,
instance_id,
jwks,
validation,
}
@@ -47,10 +64,20 @@ impl AsyncAuthorizeRequest<Body> for Authorize {
fn authorize(&mut self, mut request: Request<Body>) -> Self::Future {
let compute_id = self.compute_id.clone();
let is_hadron_instance = self.instance_id.is_some();
let jwks = self.jwks.clone();
let validation = self.validation.clone();
Box::pin(async move {
// BEGIN HADRON
// In Hadron deployments the "external" HTTP endpoint on compute_ctl can only be
// accessed by trusted components (enforced by dblet network policy), so we can bypass
// all auth here.
if is_hadron_instance {
return Ok(request);
}
// END HADRON
let TypedHeader(Authorization(bearer)) = request
.extract_parts::<TypedHeader<Authorization<Bearer>>>()
.await

View File

@@ -96,7 +96,7 @@ paths:
content:
application/json:
schema:
$ref: "#/components/schemas/SafekeepersLsn"
$ref: "#/components/schemas/ComputeSchemaWithLsn"
responses:
200:
description: Promote succeeded or wasn't started
@@ -297,14 +297,7 @@ paths:
content:
application/json:
schema:
type: object
required:
- spec
properties:
spec:
# XXX: I don't want to explain current spec in the OpenAPI format,
# as it could be changed really soon. Consider doing it later.
type: object
$ref: "#/components/schemas/ComputeSchema"
responses:
200:
description: Compute configuration finished.
@@ -591,18 +584,25 @@ components:
type: string
example: "1.0.0"
SafekeepersLsn:
ComputeSchema:
type: object
required:
- safekeepers
- spec
properties:
spec:
type: object
ComputeSchemaWithLsn:
type: object
required:
- spec
- wal_flush_lsn
properties:
safekeepers:
description: Primary replica safekeepers
type: string
spec:
$ref: "#/components/schemas/ComputeState"
wal_flush_lsn:
description: Primary last WAL flush LSN
type: string
description: "last WAL flush LSN"
example: "0/028F10D8"
LfcPrewarmState:
type: object

View File

@@ -43,7 +43,12 @@ pub(in crate::http) async fn configure(
// configure request for tracing purposes.
state.startup_span = Some(tracing::Span::current());
state.pspec = Some(pspec);
if compute.params.lakebase_mode {
ComputeNode::set_spec(&compute.params, &mut state, pspec);
} else {
state.pspec = Some(pspec);
}
state.set_status(ComputeStatus::ConfigurationPending, &compute.state_changed);
drop(state);
}

View File

@@ -0,0 +1,34 @@
use crate::pg_isready::pg_isready;
use crate::{compute::ComputeNode, http::JsonResponse};
use axum::{extract::State, http::StatusCode, response::Response};
use std::sync::Arc;
/// NOTE: NOT ENABLED YET
/// Detect if the compute is alive.
/// Called by the liveness probe of the compute container.
pub(in crate::http) async fn hadron_liveness_probe(
State(compute): State<Arc<ComputeNode>>,
) -> Response {
let port = match compute.params.connstr.port() {
Some(port) => port,
None => {
return JsonResponse::error(
StatusCode::INTERNAL_SERVER_ERROR,
"Failed to get the port from the connection string",
);
}
};
match pg_isready(&compute.params.pg_isready_bin, port) {
Ok(_) => {
// The connection is successful, so the compute is alive.
// Return a 200 OK response.
JsonResponse::success(StatusCode::OK, "ok")
}
Err(e) => {
tracing::error!("Hadron liveness probe failed: {}", e);
// The connection failed, so the compute is not alive.
// Return a 500 Internal Server Error response.
JsonResponse::error(StatusCode::INTERNAL_SERVER_ERROR, e)
}
}
}

View File

@@ -13,6 +13,7 @@ use metrics::{Encoder, TextEncoder};
use crate::communicator_socket_client::connect_communicator_socket;
use crate::compute::ComputeNode;
use crate::hadron_metrics;
use crate::http::JsonResponse;
use crate::metrics::collect;
@@ -21,11 +22,18 @@ pub(in crate::http) async fn get_metrics() -> Response {
// When we call TextEncoder::encode() below, it will immediately return an
// error if a metric family has no metrics, so we need to preemptively
// filter out metric families with no metrics.
let metrics = collect()
let mut metrics = collect()
.into_iter()
.filter(|m| !m.get_metric().is_empty())
.collect::<Vec<MetricFamily>>();
// Add Hadron metrics.
let hadron_metrics: Vec<MetricFamily> = hadron_metrics::collect()
.into_iter()
.filter(|m| !m.get_metric().is_empty())
.collect();
metrics.extend(hadron_metrics);
let encoder = TextEncoder::new();
let mut buffer = vec![];

View File

@@ -10,11 +10,13 @@ pub(in crate::http) mod extension_server;
pub(in crate::http) mod extensions;
pub(in crate::http) mod failpoints;
pub(in crate::http) mod grants;
pub(in crate::http) mod hadron_liveness_probe;
pub(in crate::http) mod insights;
pub(in crate::http) mod lfc;
pub(in crate::http) mod metrics;
pub(in crate::http) mod metrics_json;
pub(in crate::http) mod promote;
pub(in crate::http) mod refresh_configuration;
pub(in crate::http) mod status;
pub(in crate::http) mod terminate;

View File

@@ -1,14 +1,14 @@
use crate::http::JsonResponse;
use axum::Form;
use axum::extract::Json;
use http::StatusCode;
pub(in crate::http) async fn promote(
compute: axum::extract::State<std::sync::Arc<crate::compute::ComputeNode>>,
Form(safekeepers_lsn): Form<compute_api::responses::SafekeepersLsn>,
Json(cfg): Json<compute_api::responses::PromoteConfig>,
) -> axum::response::Response {
let state = compute.promote(safekeepers_lsn).await;
if let compute_api::responses::PromoteState::Failed { error } = state {
return JsonResponse::error(StatusCode::INTERNAL_SERVER_ERROR, error);
let state = compute.promote(cfg).await;
if let compute_api::responses::PromoteState::Failed { error: _ } = state {
return JsonResponse::create_response(StatusCode::INTERNAL_SERVER_ERROR, state);
}
JsonResponse::success(StatusCode::OK, state)
}

View File

@@ -0,0 +1,29 @@
// This file is added by Hadron
use std::sync::Arc;
use axum::{
extract::State,
response::{IntoResponse, Response},
};
use http::StatusCode;
use crate::compute::ComputeNode;
use crate::hadron_metrics::POSTGRES_PAGESTREAM_REQUEST_ERRORS;
use crate::http::JsonResponse;
/// The /refresh_configuration POST method is used to nudge compute_ctl to pull a new spec
/// from the HCC and attempt to reconfigure Postgres with the new spec. The method does not wait
/// for the reconfiguration to complete. Rather, it simply delivers a signal that will cause
/// configuration to be reloaded in a best effort manner. Invocation of this method does not
/// guarantee that a reconfiguration will occur. The caller should consider keep sending this
/// request while it believes that the compute configuration is out of date.
pub(in crate::http) async fn refresh_configuration(
State(compute): State<Arc<ComputeNode>>,
) -> Response {
POSTGRES_PAGESTREAM_REQUEST_ERRORS.inc();
match compute.signal_refresh_configuration().await {
Ok(_) => StatusCode::OK.into_response(),
Err(e) => JsonResponse::error(StatusCode::INTERNAL_SERVER_ERROR, e),
}
}

View File

@@ -1,7 +1,7 @@
use crate::compute::{ComputeNode, forward_termination_signal};
use crate::http::JsonResponse;
use axum::extract::State;
use axum::response::Response;
use axum::response::{IntoResponse, Response};
use axum_extra::extract::OptionalQuery;
use compute_api::responses::{ComputeStatus, TerminateMode, TerminateResponse};
use http::StatusCode;
@@ -33,7 +33,29 @@ pub(in crate::http) async fn terminate(
if !matches!(state.status, ComputeStatus::Empty | ComputeStatus::Running) {
return JsonResponse::invalid_status(state.status);
}
// If compute is Empty, there's no Postgres to terminate. The regular compute_ctl termination path
// assumes Postgres to be configured and running, so we just special-handle this case by exiting
// the process directly.
if compute.params.lakebase_mode && state.status == ComputeStatus::Empty {
drop(state);
info!("terminating empty compute - will exit process");
// Queue a task to exit the process after 5 seconds. The 5-second delay aims to
// give enough time for the HTTP response to be sent so that HCM doesn't get an abrupt
// connection termination.
tokio::spawn(async {
tokio::time::sleep(tokio::time::Duration::from_secs(5)).await;
info!("exiting process after terminating empty compute");
std::process::exit(0);
});
return StatusCode::OK.into_response();
}
// For Running status, proceed with normal termination
state.set_status(mode.into(), &compute.state_changed);
drop(state);
}
forward_termination_signal(false);

View File

@@ -23,7 +23,8 @@ use super::{
middleware::authorize::Authorize,
routes::{
check_writability, configure, database_schema, dbs_and_roles, extension_server, extensions,
grants, insights, lfc, metrics, metrics_json, promote, status, terminate,
grants, hadron_liveness_probe, insights, lfc, metrics, metrics_json, promote,
refresh_configuration, status, terminate,
},
};
use crate::compute::ComputeNode;
@@ -43,6 +44,7 @@ pub enum Server {
port: u16,
config: ComputeCtlConfig,
compute_id: String,
instance_id: Option<String>,
},
}
@@ -67,7 +69,12 @@ impl From<&Server> for Router<Arc<ComputeNode>> {
post(extension_server::download_extension),
)
.route("/extensions", post(extensions::install_extension))
.route("/grants", post(grants::add_grant));
.route("/grants", post(grants::add_grant))
// Hadron: Compute-initiated configuration refresh
.route(
"/refresh_configuration",
post(refresh_configuration::refresh_configuration),
);
// Add in any testing support
if cfg!(feature = "testing") {
@@ -79,7 +86,10 @@ impl From<&Server> for Router<Arc<ComputeNode>> {
router
}
Server::External {
config, compute_id, ..
config,
compute_id,
instance_id,
..
} => {
let unauthenticated_router = Router::<Arc<ComputeNode>>::new()
.route("/metrics", get(metrics::get_metrics))
@@ -100,8 +110,13 @@ impl From<&Server> for Router<Arc<ComputeNode>> {
.route("/metrics.json", get(metrics_json::get_metrics))
.route("/status", get(status::get_status))
.route("/terminate", post(terminate::terminate))
.route(
"/hadron_liveness_probe",
get(hadron_liveness_probe::hadron_liveness_probe),
)
.layer(AsyncRequireAuthorizationLayer::new(Authorize::new(
compute_id.clone(),
instance_id.clone(),
config.jwks.clone(),
)));

View File

@@ -2,6 +2,7 @@ use std::collections::HashMap;
use anyhow::Result;
use compute_api::responses::{InstalledExtension, InstalledExtensions};
use once_cell::sync::Lazy;
use tokio_postgres::error::Error as PostgresError;
use tokio_postgres::{Client, Config, NoTls};
@@ -119,3 +120,7 @@ pub async fn get_installed_extensions(
extensions: extensions_map.into_values().collect(),
})
}
pub fn initialize_metrics() {
Lazy::force(&INSTALLED_EXTENSIONS);
}

View File

@@ -16,6 +16,7 @@ pub mod compute_prewarm;
pub mod compute_promote;
pub mod disk_quota;
pub mod extension_server;
pub mod hadron_metrics;
pub mod installed_extensions;
pub mod local_proxy;
pub mod lsn_lease;
@@ -24,6 +25,7 @@ mod migration;
pub mod monitor;
pub mod params;
pub mod pg_helpers;
pub mod pg_isready;
pub mod pgbouncer;
pub mod rsyslog;
pub mod spec;

View File

@@ -1,7 +1,10 @@
use std::collections::HashMap;
use std::sync::{LazyLock, RwLock};
use tracing::Subscriber;
use tracing::info;
use tracing_subscriber::layer::SubscriberExt;
use tracing_appender;
use tracing_subscriber::prelude::*;
use tracing_subscriber::{fmt, layer::SubscriberExt, registry::LookupSpan};
/// Initialize logging to stderr, and OpenTelemetry tracing and exporter.
///
@@ -15,16 +18,44 @@ use tracing_subscriber::prelude::*;
///
pub fn init_tracing_and_logging(
default_log_level: &str,
) -> anyhow::Result<Option<tracing_utils::Provider>> {
log_dir_opt: &Option<String>,
) -> anyhow::Result<(
Option<tracing_utils::Provider>,
Option<tracing_appender::non_blocking::WorkerGuard>,
)> {
// Initialize Logging
let env_filter = tracing_subscriber::EnvFilter::try_from_default_env()
.unwrap_or_else(|_| tracing_subscriber::EnvFilter::new(default_log_level));
// Standard output streams
let fmt_layer = tracing_subscriber::fmt::layer()
.with_ansi(false)
.with_target(false)
.with_writer(std::io::stderr);
// Logs with file rotation. Files in `$log_dir/pgcctl.yyyy-MM-dd`
let (json_to_file_layer, _file_logs_guard) = if let Some(log_dir) = log_dir_opt {
std::fs::create_dir_all(log_dir)?;
let file_logs_appender = tracing_appender::rolling::RollingFileAppender::builder()
.rotation(tracing_appender::rolling::Rotation::DAILY)
.filename_prefix("pgcctl")
// Lib appends to existing files, so we will keep files for up to 2 days even on restart loops.
// At minimum, log-daemon will have 1 day to detect and upload a file (if created right before midnight).
.max_log_files(2)
.build(log_dir)
.expect("Initializing rolling file appender should succeed");
let (file_logs_writer, _file_logs_guard) =
tracing_appender::non_blocking(file_logs_appender);
let json_to_file_layer = tracing_subscriber::fmt::layer()
.with_ansi(false)
.with_target(false)
.event_format(PgJsonLogShapeFormatter)
.with_writer(file_logs_writer);
(Some(json_to_file_layer), Some(_file_logs_guard))
} else {
(None, None)
};
// Initialize OpenTelemetry
let provider =
tracing_utils::init_tracing("compute_ctl", tracing_utils::ExportConfig::default());
@@ -35,12 +66,13 @@ pub fn init_tracing_and_logging(
.with(env_filter)
.with(otlp_layer)
.with(fmt_layer)
.with(json_to_file_layer)
.init();
tracing::info!("logging and tracing started");
utils::logging::replace_panic_hook_with_tracing_panic_hook().forget();
Ok(provider)
Ok((provider, _file_logs_guard))
}
/// Replace all newline characters with a special character to make it
@@ -95,3 +127,157 @@ pub fn startup_context_from_env() -> Option<opentelemetry::Context> {
None
}
}
/// Track relevant id's
const UNKNOWN_IDS: &str = r#""pg_instance_id": "", "pg_compute_id": """#;
static IDS: LazyLock<RwLock<String>> = LazyLock::new(|| RwLock::new(UNKNOWN_IDS.to_string()));
pub fn update_ids(instance_id: &Option<String>, compute_id: &Option<String>) -> anyhow::Result<()> {
let ids = format!(
r#""pg_instance_id": "{}", "pg_compute_id": "{}""#,
instance_id.as_ref().map(|s| s.as_str()).unwrap_or_default(),
compute_id.as_ref().map(|s| s.as_str()).unwrap_or_default()
);
let mut guard = IDS
.write()
.map_err(|e| anyhow::anyhow!("Log set id's rwlock poisoned: {}", e))?;
*guard = ids;
Ok(())
}
/// Massage compute_ctl logs into PG json log shape so we can use the same Lumberjack setup.
struct PgJsonLogShapeFormatter;
impl<S, N> fmt::format::FormatEvent<S, N> for PgJsonLogShapeFormatter
where
S: Subscriber + for<'a> LookupSpan<'a>,
N: for<'a> fmt::format::FormatFields<'a> + 'static,
{
fn format_event(
&self,
ctx: &fmt::FmtContext<'_, S, N>,
mut writer: fmt::format::Writer<'_>,
event: &tracing::Event<'_>,
) -> std::fmt::Result {
// Format values from the event's metadata, and open message string
let metadata = event.metadata();
{
let ids_guard = IDS.read();
let ids = ids_guard
.as_ref()
.map(|guard| guard.as_str())
// Surpress so that we don't lose all uploaded/ file logs if something goes super wrong. We would notice the missing id's.
.unwrap_or(UNKNOWN_IDS);
write!(
&mut writer,
r#"{{"timestamp": "{}", "error_severity": "{}", "file_name": "{}", "backend_type": "compute_ctl_self", {}, "message": "#,
chrono::Utc::now().format("%Y-%m-%d %H:%M:%S%.3f GMT"),
metadata.level(),
metadata.target(),
ids
)?;
}
let mut message = String::new();
let message_writer = fmt::format::Writer::new(&mut message);
// Gather the message
ctx.field_format().format_fields(message_writer, event)?;
// TODO: any better options than to copy-paste this OSS span formatter?
// impl<S, N, T> FormatEvent<S, N> for Format<Full, T>
// https://docs.rs/tracing-subscriber/latest/tracing_subscriber/fmt/trait.FormatEvent.html#impl-FormatEvent%3CS,+N%3E-for-Format%3CFull,+T%3E
// write message, close bracket, and new line
writeln!(writer, "{}}}", serde_json::to_string(&message).unwrap())
}
}
#[cfg(feature = "testing")]
#[cfg(test)]
mod test {
use super::*;
use std::{cell::RefCell, io};
// Use thread_local! instead of Mutex for test isolation
thread_local! {
static WRITER_OUTPUT: RefCell<String> = const { RefCell::new(String::new()) };
}
#[derive(Clone, Default)]
struct StaticStringWriter;
impl io::Write for StaticStringWriter {
fn write(&mut self, buf: &[u8]) -> io::Result<usize> {
let output = String::from_utf8(buf.to_vec()).expect("Invalid UTF-8 in test output");
WRITER_OUTPUT.with(|s| s.borrow_mut().push_str(&output));
Ok(buf.len())
}
fn flush(&mut self) -> io::Result<()> {
Ok(())
}
}
impl fmt::MakeWriter<'_> for StaticStringWriter {
type Writer = Self;
fn make_writer(&self) -> Self::Writer {
Self
}
}
#[test]
fn test_log_pg_json_shape_formatter() {
// Use a scoped subscriber to prevent global state pollution
let subscriber = tracing_subscriber::registry().with(
tracing_subscriber::fmt::layer()
.with_ansi(false)
.with_target(false)
.event_format(PgJsonLogShapeFormatter)
.with_writer(StaticStringWriter),
);
let _ = update_ids(&Some("000".to_string()), &Some("111".to_string()));
// Clear any previous test state
WRITER_OUTPUT.with(|s| s.borrow_mut().clear());
let messages = [
"test message",
r#"json escape check: name="BatchSpanProcessor.Flush.ExportError" reason="Other(reqwest::Error { kind: Request, url: \"http://localhost:4318/v1/traces\", source: hyper_
util::client::legacy::Error(Connect, ConnectError(\"tcp connect error\", Os { code: 111, kind: ConnectionRefused, message: \"Connection refused\" })) })" Failed during the export process"#,
];
tracing::subscriber::with_default(subscriber, || {
for message in messages {
tracing::info!(message);
}
});
tracing::info!("not test message");
// Get captured output
let output = WRITER_OUTPUT.with(|s| s.borrow().clone());
let json_strings: Vec<&str> = output.lines().collect();
assert_eq!(
json_strings.len(),
messages.len(),
"Log didn't have the expected number of json strings."
);
let json_string_shape_regex = regex::Regex::new(
r#"\{"timestamp": "\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{3} GMT", "error_severity": "INFO", "file_name": ".+", "backend_type": "compute_ctl_self", "pg_instance_id": "000", "pg_compute_id": "111", "message": ".+"\}"#
).unwrap();
for (i, expected_message) in messages.iter().enumerate() {
let json_string = json_strings[i];
assert!(
json_string_shape_regex.is_match(json_string),
"Json log didn't match expected pattern:\n{json_string}",
);
let parsed_json: serde_json::Value = serde_json::from_str(json_string).unwrap();
let actual_message = parsed_json["message"].as_str().unwrap();
assert_eq!(*expected_message, actual_message);
}
}
}

View File

@@ -4,13 +4,14 @@ use std::thread;
use std::time::{Duration, SystemTime};
use anyhow::{Result, bail};
use compute_api::spec::{ComputeMode, PageserverConnectionInfo, PageserverProtocol};
use compute_api::spec::{ComputeMode, PageserverProtocol};
use itertools::Itertools as _;
use pageserver_page_api as page_api;
use postgres::{NoTls, SimpleQueryMessage};
use tracing::{info, warn};
use utils::id::{TenantId, TimelineId};
use utils::lsn::Lsn;
use utils::shard::TenantShardId;
use utils::shard::{ShardCount, ShardNumber, TenantShardId};
use crate::compute::ComputeNode;
@@ -77,16 +78,17 @@ fn acquire_lsn_lease_with_retry(
loop {
// Note: List of pageservers is dynamic, need to re-read configs before each attempt.
let (conninfo, auth) = {
let (connstrings, auth) = {
let state = compute.state.lock().unwrap();
let spec = state.pspec.as_ref().expect("spec must be set");
(
spec.pageserver_conninfo.clone(),
spec.pageserver_connstr.clone(),
spec.storage_auth_token.clone(),
)
};
let result = try_acquire_lsn_lease(conninfo, auth.as_deref(), tenant_id, timeline_id, lsn);
let result =
try_acquire_lsn_lease(&connstrings, auth.as_deref(), tenant_id, timeline_id, lsn);
match result {
Ok(Some(res)) => {
return Ok(res);
@@ -110,44 +112,35 @@ fn acquire_lsn_lease_with_retry(
/// Tries to acquire LSN leases on all Pageserver shards.
fn try_acquire_lsn_lease(
conninfo: PageserverConnectionInfo,
connstrings: &str,
auth: Option<&str>,
tenant_id: TenantId,
timeline_id: TimelineId,
lsn: Lsn,
) -> Result<Option<SystemTime>> {
let connstrings = connstrings.split(',').collect_vec();
let shard_count = connstrings.len();
let mut leases = Vec::new();
for (shard_index, shard) in conninfo.shards.into_iter() {
let tenant_shard_id = TenantShardId {
tenant_id,
shard_number: shard_index.shard_number,
shard_count: shard_index.shard_count,
for (shard_number, &connstring) in connstrings.iter().enumerate() {
let tenant_shard_id = match shard_count {
0 | 1 => TenantShardId::unsharded(tenant_id),
shard_count => TenantShardId {
tenant_id,
shard_number: ShardNumber(shard_number as u8),
shard_count: ShardCount::new(shard_count as u8),
},
};
// XXX: If there are more than pageserver for the one shard, do we need to get a
// leas on all of them? Currently, that's what we assume, but this is hypothetical
// as of this writing, as we never pass the info for more than one pageserver per
// shard.
for pageserver in shard.pageservers {
let lease = match conninfo.prefer_protocol {
PageserverProtocol::Grpc => acquire_lsn_lease_grpc(
&pageserver.grpc_url.unwrap(),
auth,
tenant_shard_id,
timeline_id,
lsn,
)?,
PageserverProtocol::Libpq => acquire_lsn_lease_libpq(
&pageserver.libpq_url.unwrap(),
auth,
tenant_shard_id,
timeline_id,
lsn,
)?,
};
leases.push(lease);
}
let lease = match PageserverProtocol::from_connstring(connstring)? {
PageserverProtocol::Libpq => {
acquire_lsn_lease_libpq(connstring, auth, tenant_shard_id, timeline_id, lsn)?
}
PageserverProtocol::Grpc => {
acquire_lsn_lease_grpc(connstring, auth, tenant_shard_id, timeline_id, lsn)?
}
};
leases.push(lease);
}
Ok(leases.into_iter().min().flatten())

View File

@@ -0,0 +1,30 @@
use anyhow::{Context, anyhow};
// Run `/usr/local/bin/pg_isready -p {port}`
// Check the connectivity of PG
// Success means PG is listening on the port and accepting connections
// Note that PG does not need to authenticate the connection, nor reserve a connection quota for it.
// See https://www.postgresql.org/docs/current/app-pg-isready.html
pub fn pg_isready(bin: &str, port: u16) -> anyhow::Result<()> {
let child_result = std::process::Command::new(bin)
.arg("-p")
.arg(port.to_string())
.spawn();
child_result
.context("spawn() failed")
.and_then(|mut child| child.wait().context("wait() failed"))
.and_then(|status| match status.success() {
true => Ok(()),
false => Err(anyhow!("process exited with {status}")),
})
// wrap any prior error with the overall context that we couldn't run the command
.with_context(|| format!("could not run `{bin} --port {port}`"))
}
// It's safe to assume pg_isready is under the same directory with postgres,
// because it is a PG util bin installed along with postgres
pub fn get_pg_isready_bin(pgbin: &str) -> String {
let split = pgbin.split("/").collect::<Vec<&str>>();
split[0..split.len() - 1].join("/") + "/pg_isready"
}

View File

@@ -142,7 +142,7 @@ pub fn update_pg_hba(pgdata_path: &Path, databricks_pg_hba: Option<&String>) ->
// Update pg_hba to contains databricks specfic settings before adding neon settings
// PG uses the first record that matches to perform authentication, so we need to have
// our rules before the default ones from neon.
// See https://www.postgresql.org/docs/16/auth-pg-hba-conf.html
// See https://www.postgresql.org/docs/current/auth-pg-hba-conf.html
if let Some(databricks_pg_hba) = databricks_pg_hba {
if config::line_in_file(
&pghba_path,

View File

@@ -16,14 +16,9 @@ use std::time::Duration;
use anyhow::{Context, Result, anyhow, bail};
use clap::Parser;
use compute_api::requests::ComputeClaimsScope;
use compute_api::spec::{
ComputeMode, PageserverConnectionInfo, PageserverProtocol, PageserverShardInfo,
};
use compute_api::spec::{ComputeMode, PageserverProtocol};
use control_plane::broker::StorageBroker;
use control_plane::endpoint::{ComputeControlPlane, EndpointTerminateMode};
use control_plane::endpoint::{
pageserver_conf_to_shard_conn_info, tenant_locate_response_to_conn_info,
};
use control_plane::endpoint_storage::{ENDPOINT_STORAGE_DEFAULT_ADDR, EndpointStorage};
use control_plane::local_env;
use control_plane::local_env::{
@@ -49,6 +44,7 @@ use pageserver_api::models::{
};
use pageserver_api::shard::{DEFAULT_STRIPE_SIZE, ShardCount, ShardStripeSize, TenantShardId};
use postgres_backend::AuthType;
use postgres_connection::parse_host_port;
use safekeeper_api::membership::{SafekeeperGeneration, SafekeeperId};
use safekeeper_api::{
DEFAULT_HTTP_LISTEN_PORT as DEFAULT_SAFEKEEPER_HTTP_PORT,
@@ -56,11 +52,11 @@ use safekeeper_api::{
};
use storage_broker::DEFAULT_LISTEN_ADDR as DEFAULT_BROKER_ADDR;
use tokio::task::JoinSet;
use url::Host;
use utils::auth::{Claims, Scope};
use utils::id::{NodeId, TenantId, TenantTimelineId, TimelineId};
use utils::lsn::Lsn;
use utils::project_git_version;
use utils::shard::ShardIndex;
// Default id of a safekeeper node, if not specified on the command line.
const DEFAULT_SAFEKEEPER_ID: NodeId = NodeId(1);
@@ -564,7 +560,9 @@ enum EndpointCmd {
Create(EndpointCreateCmdArgs),
Start(EndpointStartCmdArgs),
Reconfigure(EndpointReconfigureCmdArgs),
RefreshConfiguration(EndpointRefreshConfigurationArgs),
Stop(EndpointStopCmdArgs),
UpdatePageservers(EndpointUpdatePageserversCmdArgs),
GenerateJwt(EndpointGenerateJwtCmdArgs),
}
@@ -725,6 +723,13 @@ struct EndpointReconfigureCmdArgs {
safekeepers: Option<String>,
}
#[derive(clap::Args)]
#[clap(about = "Refresh the endpoint's configuration by forcing it reload it's spec")]
struct EndpointRefreshConfigurationArgs {
#[clap(help = "Postgres endpoint id")]
endpoint_id: String,
}
#[derive(clap::Args)]
#[clap(about = "Stop an endpoint")]
struct EndpointStopCmdArgs {
@@ -742,6 +747,16 @@ struct EndpointStopCmdArgs {
mode: EndpointTerminateMode,
}
#[derive(clap::Args)]
#[clap(about = "Update the pageservers in the spec file of the compute endpoint")]
struct EndpointUpdatePageserversCmdArgs {
#[clap(help = "Postgres endpoint id")]
endpoint_id: String,
#[clap(short = 'p', long, help = "Specified pageserver id")]
pageserver_id: Option<NodeId>,
}
#[derive(clap::Args)]
#[clap(about = "Generate a JWT for an endpoint")]
struct EndpointGenerateJwtCmdArgs {
@@ -1521,7 +1536,7 @@ async fn handle_endpoint(subcmd: &EndpointCmd, env: &local_env::LocalEnv) -> Res
let endpoint = cplane
.endpoints
.get(endpoint_id.as_str())
.ok_or_else(|| anyhow::anyhow!("endpoint {endpoint_id} not found"))?;
.ok_or_else(|| anyhow!("endpoint {endpoint_id} not found"))?;
if !args.allow_multiple {
cplane.check_conflicting_endpoints(
@@ -1531,56 +1546,62 @@ async fn handle_endpoint(subcmd: &EndpointCmd, env: &local_env::LocalEnv) -> Res
)?;
}
let prefer_protocol = if endpoint.grpc {
PageserverProtocol::Grpc
} else {
PageserverProtocol::Libpq
};
let mut pageserver_conninfo = if let Some(ps_id) = pageserver_id {
let conf = env.get_pageserver_conf(ps_id).unwrap();
let ps_conninfo = pageserver_conf_to_shard_conn_info(conf)?;
let shard_info = PageserverShardInfo {
pageservers: vec![ps_conninfo],
let (pageservers, stripe_size) = if let Some(pageserver_id) = pageserver_id {
let conf = env.get_pageserver_conf(pageserver_id).unwrap();
// Use gRPC if requested.
let pageserver = if endpoint.grpc {
let grpc_addr = conf.listen_grpc_addr.as_ref().expect("bad config");
let (host, port) = parse_host_port(grpc_addr)?;
let port = port.unwrap_or(DEFAULT_PAGESERVER_GRPC_PORT);
(PageserverProtocol::Grpc, host, port)
} else {
let (host, port) = parse_host_port(&conf.listen_pg_addr)?;
let port = port.unwrap_or(5432);
(PageserverProtocol::Libpq, host, port)
};
// If caller is telling us what pageserver to use, this is not a tenant which is
// fully managed by storage controller, therefore not sharded.
let shards: HashMap<_, _> = vec![(ShardIndex::unsharded(), shard_info)]
.into_iter()
.collect();
PageserverConnectionInfo {
shard_count: ShardCount(0),
stripe_size: None,
shards,
prefer_protocol,
}
(vec![pageserver], DEFAULT_STRIPE_SIZE)
} else {
// Look up the currently attached location of the tenant, and its striping metadata,
// to pass these on to postgres.
let storage_controller = StorageController::from_env(env);
let locate_result = storage_controller.tenant_locate(endpoint.tenant_id).await?;
assert!(!locate_result.shards.is_empty());
// Initialize LSN leases for static computes.
if let ComputeMode::Static(lsn) = endpoint.mode {
futures::future::try_join_all(locate_result.shards.iter().map(
|shard| async move {
let pageservers = futures::future::try_join_all(
locate_result.shards.into_iter().map(|shard| async move {
if let ComputeMode::Static(lsn) = endpoint.mode {
// Initialize LSN leases for static computes.
let conf = env.get_pageserver_conf(shard.node_id).unwrap();
let pageserver = PageServerNode::from_env(env, conf);
pageserver
.http_client
.timeline_init_lsn_lease(shard.shard_id, endpoint.timeline_id, lsn)
.await
},
))
.await?;
}
.await?;
}
tenant_locate_response_to_conn_info(&locate_result)?
let pageserver = if endpoint.grpc {
(
PageserverProtocol::Grpc,
Host::parse(&shard.listen_grpc_addr.expect("no gRPC address"))?,
shard.listen_grpc_port.expect("no gRPC port"),
)
} else {
(
PageserverProtocol::Libpq,
Host::parse(&shard.listen_pg_addr)?,
shard.listen_pg_port,
)
};
anyhow::Ok(pageserver)
}),
)
.await?;
let stripe_size = locate_result.shard_params.stripe_size;
(pageservers, stripe_size)
};
pageserver_conninfo.prefer_protocol = prefer_protocol;
assert!(!pageservers.is_empty());
let ps_conf = env.get_pageserver_conf(DEFAULT_PAGESERVER_ID)?;
let auth_token = if matches!(ps_conf.pg_auth_type, AuthType::NeonJWT) {
@@ -1610,8 +1631,9 @@ async fn handle_endpoint(subcmd: &EndpointCmd, env: &local_env::LocalEnv) -> Res
endpoint_storage_addr,
safekeepers_generation,
safekeepers,
pageserver_conninfo,
pageservers,
remote_ext_base_url: remote_ext_base_url.clone(),
shard_stripe_size: stripe_size.0 as usize,
create_test_user: args.create_test_user,
start_timeout: args.start_timeout,
autoprewarm: args.autoprewarm,
@@ -1622,53 +1644,105 @@ async fn handle_endpoint(subcmd: &EndpointCmd, env: &local_env::LocalEnv) -> Res
println!("Starting existing endpoint {endpoint_id}...");
endpoint.start(args).await?;
}
EndpointCmd::UpdatePageservers(args) => {
let endpoint_id = &args.endpoint_id;
let endpoint = cplane
.endpoints
.get(endpoint_id.as_str())
.with_context(|| format!("postgres endpoint {endpoint_id} is not found"))?;
let pageservers = match args.pageserver_id {
Some(pageserver_id) => {
let pageserver =
PageServerNode::from_env(env, env.get_pageserver_conf(pageserver_id)?);
vec![(
PageserverProtocol::Libpq,
pageserver.pg_connection_config.host().clone(),
pageserver.pg_connection_config.port(),
)]
}
None => {
let storage_controller = StorageController::from_env(env);
storage_controller
.tenant_locate(endpoint.tenant_id)
.await?
.shards
.into_iter()
.map(|shard| {
(
PageserverProtocol::Libpq,
Host::parse(&shard.listen_pg_addr)
.expect("Storage controller reported malformed host"),
shard.listen_pg_port,
)
})
.collect::<Vec<_>>()
}
};
endpoint.update_pageservers_in_config(pageservers).await?;
}
EndpointCmd::Reconfigure(args) => {
let endpoint_id = &args.endpoint_id;
let endpoint = cplane
.endpoints
.get(endpoint_id.as_str())
.with_context(|| format!("postgres endpoint {endpoint_id} is not found"))?;
let prefer_protocol = if endpoint.grpc {
PageserverProtocol::Grpc
} else {
PageserverProtocol::Libpq
};
let mut pageserver_conninfo = if let Some(ps_id) = args.endpoint_pageserver_id {
let pageservers = if let Some(ps_id) = args.endpoint_pageserver_id {
let conf = env.get_pageserver_conf(ps_id)?;
let ps_conninfo = pageserver_conf_to_shard_conn_info(conf)?;
let shard_info = PageserverShardInfo {
pageservers: vec![ps_conninfo],
// Use gRPC if requested.
let pageserver = if endpoint.grpc {
let grpc_addr = conf.listen_grpc_addr.as_ref().expect("bad config");
let (host, port) = parse_host_port(grpc_addr)?;
let port = port.unwrap_or(DEFAULT_PAGESERVER_GRPC_PORT);
(PageserverProtocol::Grpc, host, port)
} else {
let (host, port) = parse_host_port(&conf.listen_pg_addr)?;
let port = port.unwrap_or(5432);
(PageserverProtocol::Libpq, host, port)
};
// If caller is telling us what pageserver to use, this is not a tenant which is
// fully managed by storage controller, therefore not sharded.
let shards: HashMap<_, _> = vec![(ShardIndex::unsharded(), shard_info)]
.into_iter()
.collect();
PageserverConnectionInfo {
shard_count: ShardCount::unsharded(),
stripe_size: None,
shards,
prefer_protocol,
}
vec![pageserver]
} else {
// Look up the currently attached location of the tenant, and its striping metadata,
// to pass these on to postgres.
let storage_controller = StorageController::from_env(env);
let locate_result = storage_controller.tenant_locate(endpoint.tenant_id).await?;
tenant_locate_response_to_conn_info(&locate_result)?
storage_controller
.tenant_locate(endpoint.tenant_id)
.await?
.shards
.into_iter()
.map(|shard| {
// Use gRPC if requested.
if endpoint.grpc {
(
PageserverProtocol::Grpc,
Host::parse(&shard.listen_grpc_addr.expect("no gRPC address"))
.expect("bad hostname"),
shard.listen_grpc_port.expect("no gRPC port"),
)
} else {
(
PageserverProtocol::Libpq,
Host::parse(&shard.listen_pg_addr).expect("bad hostname"),
shard.listen_pg_port,
)
}
})
.collect::<Vec<_>>()
};
pageserver_conninfo.prefer_protocol = prefer_protocol;
// If --safekeepers argument is given, use only the listed
// safekeeper nodes; otherwise all from the env.
let safekeepers = parse_safekeepers(&args.safekeepers)?;
endpoint
.reconfigure(Some(&pageserver_conninfo), safekeepers, None)
.reconfigure(Some(pageservers), None, safekeepers, None)
.await?;
}
EndpointCmd::RefreshConfiguration(args) => {
let endpoint_id = &args.endpoint_id;
let endpoint = cplane
.endpoints
.get(endpoint_id.as_str())
.with_context(|| format!("postgres endpoint {endpoint_id} is not found"))?;
endpoint.refresh_configuration().await?;
}
EndpointCmd::Stop(args) => {
let endpoint_id = &args.endpoint_id;
let endpoint = cplane

View File

@@ -37,7 +37,7 @@
//! <other PostgreSQL files>
//! ```
//!
use std::collections::{BTreeMap, HashMap};
use std::collections::BTreeMap;
use std::fmt::Display;
use std::net::{IpAddr, Ipv4Addr, SocketAddr, TcpStream};
use std::path::PathBuf;
@@ -58,12 +58,8 @@ use compute_api::responses::{
};
use compute_api::spec::{
Cluster, ComputeAudit, ComputeFeature, ComputeMode, ComputeSpec, Database, PageserverProtocol,
PageserverShardInfo, PgIdent, RemoteExtSpec, Role,
PgIdent, RemoteExtSpec, Role,
};
// re-export these, because they're used in the reconfigure() function
pub use compute_api::spec::{PageserverConnectionInfo, PageserverShardConnectionInfo};
use jsonwebtoken::jwk::{
AlgorithmParameters, CommonParameters, EllipticCurve, Jwk, JwkSet, KeyAlgorithm, KeyOperations,
OctetKeyPairParameters, OctetKeyPairType, PublicKeyUse,
@@ -78,11 +74,9 @@ use sha2::{Digest, Sha256};
use spki::der::Decode;
use spki::{SubjectPublicKeyInfo, SubjectPublicKeyInfoRef};
use tracing::debug;
use url::Host;
use utils::id::{NodeId, TenantId, TimelineId};
use utils::shard::{ShardIndex, ShardNumber};
use pageserver_api::config::DEFAULT_GRPC_LISTEN_PORT as DEFAULT_PAGESERVER_GRPC_PORT;
use postgres_connection::parse_host_port;
use utils::shard::ShardStripeSize;
use crate::local_env::LocalEnv;
use crate::postgresql_conf::PostgresConf;
@@ -393,8 +387,9 @@ pub struct EndpointStartArgs {
pub endpoint_storage_addr: String,
pub safekeepers_generation: Option<SafekeeperGeneration>,
pub safekeepers: Vec<NodeId>,
pub pageserver_conninfo: PageserverConnectionInfo,
pub pageservers: Vec<(PageserverProtocol, Host, u16)>,
pub remote_ext_base_url: Option<String>,
pub shard_stripe_size: usize,
pub create_test_user: bool,
pub start_timeout: Duration,
pub autoprewarm: bool,
@@ -667,6 +662,14 @@ impl Endpoint {
}
}
fn build_pageserver_connstr(pageservers: &[(PageserverProtocol, Host, u16)]) -> String {
pageservers
.iter()
.map(|(scheme, host, port)| format!("{scheme}://no_user@{host}:{port}"))
.collect::<Vec<_>>()
.join(",")
}
/// Map safekeepers ids to the actual connection strings.
fn build_safekeepers_connstrs(&self, sk_ids: Vec<NodeId>) -> Result<Vec<String>> {
let mut safekeeper_connstrings = Vec::new();
@@ -712,6 +715,9 @@ impl Endpoint {
std::fs::remove_dir_all(self.pgdata())?;
}
let pageserver_connstring = Self::build_pageserver_connstr(&args.pageservers);
assert!(!pageserver_connstring.is_empty());
let safekeeper_connstrings = self.build_safekeepers_connstrs(args.safekeepers)?;
// check for file remote_extensions_spec.json
@@ -726,45 +732,6 @@ impl Endpoint {
remote_extensions = None;
};
// For the sake of backwards-compatibility, also fill in 'pageserver_connstring'
//
// Use a closure so that we can conviniently return None in the middle of the
// loop.
let pageserver_connstring: Option<String> = (|| {
let num_shards = if args.pageserver_conninfo.shard_count.is_unsharded() {
1
} else {
args.pageserver_conninfo.shard_count.0
};
let mut connstrings = Vec::new();
for shard_no in 0..num_shards {
let shard_index = ShardIndex {
shard_count: args.pageserver_conninfo.shard_count,
shard_number: ShardNumber(shard_no),
};
let shard = args
.pageserver_conninfo
.shards
.get(&shard_index)
.ok_or_else(|| {
anyhow!(
"shard {} not found in pageserver_connection_info",
shard_index
)
})?;
let pageserver = shard
.pageservers
.first()
.ok_or(anyhow!("must have at least one pageserver"))?;
if let Some(libpq_url) = &pageserver.libpq_url {
connstrings.push(libpq_url.clone());
} else {
return Ok::<_, anyhow::Error>(None);
}
}
Ok(Some(connstrings.join(",")))
})()?;
// Create config file
let config = {
let mut spec = ComputeSpec {
@@ -809,14 +776,13 @@ impl Endpoint {
branch_id: None,
endpoint_id: Some(self.endpoint_id.clone()),
mode: self.mode,
pageserver_connection_info: Some(args.pageserver_conninfo.clone()),
pageserver_connstring,
pageserver_connstring: Some(pageserver_connstring),
safekeepers_generation: args.safekeepers_generation.map(|g| g.into_inner()),
safekeeper_connstrings,
storage_auth_token: args.auth_token.clone(),
remote_extensions,
pgbouncer_settings: None,
shard_stripe_size: args.pageserver_conninfo.stripe_size, // redundant with pageserver_connection_info.stripe_size
shard_stripe_size: Some(args.shard_stripe_size),
local_proxy_config: None,
reconfigure_concurrency: self.reconfigure_concurrency,
drop_subscriptions_before_start: self.drop_subscriptions_before_start,
@@ -827,6 +793,7 @@ impl Endpoint {
autoprewarm: args.autoprewarm,
offload_lfc_interval_seconds: args.offload_lfc_interval_seconds,
suspend_timeout_seconds: -1, // Only used in neon_local.
databricks_settings: None,
};
// this strange code is needed to support respec() in tests
@@ -971,7 +938,9 @@ impl Endpoint {
| ComputeStatus::Configuration
| ComputeStatus::TerminationPendingFast
| ComputeStatus::TerminationPendingImmediate
| ComputeStatus::Terminated => {
| ComputeStatus::Terminated
| ComputeStatus::RefreshConfigurationPending
| ComputeStatus::RefreshConfiguration => {
bail!("unexpected compute status: {:?}", state.status)
}
}
@@ -994,6 +963,29 @@ impl Endpoint {
Ok(())
}
// Update the pageservers in the spec file of the endpoint. This is useful to test the spec refresh scenario.
pub async fn update_pageservers_in_config(
&self,
pageservers: Vec<(PageserverProtocol, Host, u16)>,
) -> Result<()> {
let config_path = self.endpoint_path().join("config.json");
let mut config: ComputeConfig = {
let file = std::fs::File::open(&config_path)?;
serde_json::from_reader(file)?
};
let pageserver_connstring = Self::build_pageserver_connstr(&pageservers);
assert!(!pageserver_connstring.is_empty());
let mut spec = config.spec.unwrap();
spec.pageserver_connstring = Some(pageserver_connstring);
config.spec = Some(spec);
let file = std::fs::File::create(&config_path)?;
serde_json::to_writer_pretty(file, &config)?;
Ok(())
}
// Call the /status HTTP API
pub async fn get_status(&self) -> Result<ComputeStatusResponse> {
let client = reqwest::Client::new();
@@ -1028,7 +1020,8 @@ impl Endpoint {
pub async fn reconfigure(
&self,
pageserver_conninfo: Option<&PageserverConnectionInfo>,
pageservers: Option<Vec<(PageserverProtocol, Host, u16)>>,
stripe_size: Option<ShardStripeSize>,
safekeepers: Option<Vec<NodeId>>,
safekeeper_generation: Option<SafekeeperGeneration>,
) -> Result<()> {
@@ -1043,15 +1036,15 @@ impl Endpoint {
let postgresql_conf = self.read_postgresql_conf()?;
spec.cluster.postgresql_conf = Some(postgresql_conf);
if let Some(pageserver_conninfo) = pageserver_conninfo {
// If pageservers are provided, we need to ensure that they are not empty.
// This is a requirement for the compute_ctl configuration.
anyhow::ensure!(
!pageserver_conninfo.shards.is_empty(),
"no pageservers provided"
);
spec.pageserver_connection_info = Some(pageserver_conninfo.clone());
spec.shard_stripe_size = pageserver_conninfo.stripe_size;
// If pageservers are not specified, don't change them.
if let Some(pageservers) = pageservers {
anyhow::ensure!(!pageservers.is_empty(), "no pageservers provided");
let pageserver_connstr = Self::build_pageserver_connstr(&pageservers);
spec.pageserver_connstring = Some(pageserver_connstr);
if stripe_size.is_some() {
spec.shard_stripe_size = stripe_size.map(|s| s.0 as usize);
}
}
// If safekeepers are not specified, don't change them.
@@ -1100,9 +1093,11 @@ impl Endpoint {
pub async fn reconfigure_pageservers(
&self,
pageservers: &PageserverConnectionInfo,
pageservers: Vec<(PageserverProtocol, Host, u16)>,
stripe_size: Option<ShardStripeSize>,
) -> Result<()> {
self.reconfigure(Some(pageservers), None, None).await
self.reconfigure(Some(pageservers), stripe_size, None, None)
.await
}
pub async fn reconfigure_safekeepers(
@@ -1110,7 +1105,7 @@ impl Endpoint {
safekeepers: Vec<NodeId>,
generation: SafekeeperGeneration,
) -> Result<()> {
self.reconfigure(None, Some(safekeepers), Some(generation))
self.reconfigure(None, None, Some(safekeepers), Some(generation))
.await
}
@@ -1156,6 +1151,33 @@ impl Endpoint {
Ok(response)
}
pub async fn refresh_configuration(&self) -> Result<()> {
let client = reqwest::Client::builder()
.timeout(Duration::from_secs(30))
.build()
.unwrap();
let response = client
.post(format!(
"http://{}:{}/refresh_configuration",
self.internal_http_address.ip(),
self.internal_http_address.port()
))
.send()
.await?;
let status = response.status();
if !(status.is_client_error() || status.is_server_error()) {
Ok(())
} else {
let url = response.url().to_owned();
let msg = match response.text().await {
Ok(err_body) => format!("Error: {err_body}"),
Err(_) => format!("Http error ({}) at {}.", status.as_u16(), url),
};
Err(anyhow::anyhow!(msg))
}
}
pub fn connstr(&self, user: &str, db_name: &str) -> String {
format!(
"postgresql://{}@{}:{}/{}",
@@ -1166,68 +1188,3 @@ impl Endpoint {
)
}
}
pub fn pageserver_conf_to_shard_conn_info(
conf: &crate::local_env::PageServerConf,
) -> Result<PageserverShardConnectionInfo> {
let libpq_url = {
let (host, port) = parse_host_port(&conf.listen_pg_addr)?;
let port = port.unwrap_or(5432);
Some(format!("postgres://no_user@{host}:{port}"))
};
let grpc_url = if let Some(grpc_addr) = &conf.listen_grpc_addr {
let (host, port) = parse_host_port(grpc_addr)?;
let port = port.unwrap_or(DEFAULT_PAGESERVER_GRPC_PORT);
Some(format!("grpc://no_user@{host}:{port}"))
} else {
None
};
Ok(PageserverShardConnectionInfo {
id: Some(conf.id.to_string()),
libpq_url,
grpc_url,
})
}
pub fn tenant_locate_response_to_conn_info(
response: &pageserver_api::controller_api::TenantLocateResponse,
) -> Result<PageserverConnectionInfo> {
let mut shards = HashMap::new();
for shard in response.shards.iter() {
tracing::info!("parsing {}", shard.listen_pg_addr);
let libpq_url = {
let host = &shard.listen_pg_addr;
let port = shard.listen_pg_port;
Some(format!("postgres://no_user@{host}:{port}"))
};
let grpc_url = if let Some(grpc_addr) = &shard.listen_grpc_addr {
let host = grpc_addr;
let port = shard.listen_grpc_port.expect("no gRPC port");
Some(format!("grpc://no_user@{host}:{port}"))
} else {
None
};
let shard_info = PageserverShardInfo {
pageservers: vec![PageserverShardConnectionInfo {
id: Some(shard.node_id.to_string()),
libpq_url,
grpc_url,
}],
};
shards.insert(shard.shard_id.to_index(), shard_info);
}
let stripe_size = if response.shard_params.count.is_unsharded() {
None
} else {
Some(response.shard_params.stripe_size.0)
};
Ok(PageserverConnectionInfo {
shard_count: response.shard_params.count,
stripe_size,
shards,
prefer_protocol: PageserverProtocol::default(),
})
}

View File

@@ -108,11 +108,10 @@ pub enum PromoteState {
Failed { error: String },
}
#[derive(Deserialize, Serialize, Default, Debug, Clone)]
#[derive(Deserialize, Default, Debug)]
#[serde(rename_all = "snake_case")]
/// Result of /safekeepers_lsn
pub struct SafekeepersLsn {
pub safekeepers: String,
pub struct PromoteConfig {
pub spec: ComputeSpec,
pub wal_flush_lsn: utils::lsn::Lsn,
}
@@ -173,6 +172,11 @@ pub enum ComputeStatus {
TerminationPendingImmediate,
// Terminated Postgres
Terminated,
// A spec refresh is being requested
RefreshConfigurationPending,
// A spec refresh is being applied. We cannot refresh configuration again until the current
// refresh is done, i.e., signal_refresh_configuration() will return 500 error.
RefreshConfiguration,
}
#[derive(Deserialize, Serialize)]
@@ -185,6 +189,10 @@ impl Display for ComputeStatus {
match self {
ComputeStatus::Empty => f.write_str("empty"),
ComputeStatus::ConfigurationPending => f.write_str("configuration-pending"),
ComputeStatus::RefreshConfiguration => f.write_str("refresh-configuration"),
ComputeStatus::RefreshConfigurationPending => {
f.write_str("refresh-configuration-pending")
}
ComputeStatus::Init => f.write_str("init"),
ComputeStatus::Running => f.write_str("running"),
ComputeStatus::Configuration => f.write_str("configuration"),

View File

@@ -14,7 +14,6 @@ use serde::{Deserialize, Serialize};
use url::Url;
use utils::id::{TenantId, TimelineId};
use utils::lsn::Lsn;
use utils::shard::{ShardCount, ShardIndex};
use crate::responses::TlsConfig;
@@ -106,17 +105,6 @@ pub struct ComputeSpec {
// updated to fill these fields, we can make these non optional.
pub tenant_id: Option<TenantId>,
pub timeline_id: Option<TimelineId>,
/// Pageserver information can be passed in three different ways:
/// 1. Here in `pageserver_connection_info`
/// 2. In the `pageserver_connstring` field.
/// 3. in `cluster.settings`.
///
/// The goal is to use method 1. everywhere. But for backwards-compatibility with old
/// versions of the control plane, `compute_ctl` will check 2. and 3. if the
/// `pageserver_connection_info` field is missing.
pub pageserver_connection_info: Option<PageserverConnectionInfo>,
pub pageserver_connstring: Option<String>,
// More neon ids that we expose to the compute_ctl
@@ -153,7 +141,7 @@ pub struct ComputeSpec {
// Stripe size for pageserver sharding, in pages
#[serde(default)]
pub shard_stripe_size: Option<u32>,
pub shard_stripe_size: Option<usize>,
/// Local Proxy configuration used for JWT authentication
#[serde(default)]
@@ -205,6 +193,9 @@ pub struct ComputeSpec {
///
/// We use this value to derive other values, such as the installed extensions metric.
pub suspend_timeout_seconds: i64,
// Databricks specific options for compute instance.
pub databricks_settings: Option<DatabricksSettings>,
}
/// Feature flag to signal `compute_ctl` to enable certain experimental functionality.
@@ -226,32 +217,6 @@ pub enum ComputeFeature {
UnknownFeature,
}
#[derive(Clone, Debug, Deserialize, Serialize, Eq, PartialEq)]
pub struct PageserverConnectionInfo {
/// NB: 0 for unsharded tenants, 1 for sharded tenants with 1 shard, following storage
pub shard_count: ShardCount,
/// INVARIANT: null if shard_count is 0, otherwise non-null and immutable
pub stripe_size: Option<u32>,
pub shards: HashMap<ShardIndex, PageserverShardInfo>,
#[serde(default)]
pub prefer_protocol: PageserverProtocol,
}
#[derive(Clone, Debug, Deserialize, Serialize, Eq, PartialEq)]
pub struct PageserverShardInfo {
pub pageservers: Vec<PageserverShardConnectionInfo>,
}
#[derive(Clone, Debug, Deserialize, Serialize, Eq, PartialEq)]
pub struct PageserverShardConnectionInfo {
pub id: Option<String>,
pub libpq_url: Option<String>,
pub grpc_url: Option<String>,
}
#[derive(Clone, Debug, Default, Deserialize, Serialize)]
pub struct RemoteExtSpec {
pub public_extensions: Option<Vec<String>>,
@@ -369,12 +334,6 @@ impl ComputeMode {
}
}
impl Display for ComputeMode {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.write_str(self.to_type_str())
}
}
/// Log level for audit logging
#[derive(Clone, Debug, Default, Eq, PartialEq, Deserialize, Serialize)]
pub enum ComputeAudit {
@@ -511,15 +470,13 @@ pub struct JwksSettings {
pub jwt_audience: Option<String>,
}
/// Protocol used to connect to a Pageserver.
#[derive(Clone, Copy, Debug, Default, Deserialize, Serialize, PartialEq, Eq)]
/// Protocol used to connect to a Pageserver. Parsed from the connstring scheme.
#[derive(Clone, Copy, Debug, Default, PartialEq, Eq)]
pub enum PageserverProtocol {
/// The original protocol based on libpq and COPY. Uses postgresql:// or postgres:// scheme.
#[default]
#[serde(rename = "libpq")]
Libpq,
/// A newer, gRPC-based protocol. Uses grpc:// scheme.
#[serde(rename = "grpc")]
Grpc,
}

View File

@@ -558,11 +558,11 @@ async fn add_request_id_header_to_response(
mut res: Response<Body>,
req_info: RequestInfo,
) -> Result<Response<Body>, ApiError> {
if let Some(request_id) = req_info.context::<RequestId>() {
if let Ok(request_header_value) = HeaderValue::from_str(&request_id.0) {
res.headers_mut()
.insert(&X_REQUEST_ID_HEADER, request_header_value);
};
if let Some(request_id) = req_info.context::<RequestId>()
&& let Ok(request_header_value) = HeaderValue::from_str(&request_id.0)
{
res.headers_mut()
.insert(&X_REQUEST_ID_HEADER, request_header_value);
};
Ok(res)

View File

@@ -72,10 +72,10 @@ impl Server {
if err.is_incomplete_message() || err.is_closed() || err.is_timeout() {
return true;
}
if let Some(inner) = err.source() {
if let Some(io) = inner.downcast_ref::<std::io::Error>() {
return suppress_io_error(io);
}
if let Some(inner) = err.source()
&& let Some(io) = inner.downcast_ref::<std::io::Error>()
{
return suppress_io_error(io);
}
false
}

View File

@@ -129,6 +129,12 @@ impl<L: LabelGroup> InfoMetric<L> {
}
}
impl<L: LabelGroup + Default> Default for InfoMetric<L, GaugeState> {
fn default() -> Self {
InfoMetric::new(L::default())
}
}
impl<L: LabelGroup, M: MetricType<Metadata = ()>> InfoMetric<L, M> {
pub fn with_metric(label: L, metric: M) -> Self {
Self {

View File

@@ -6,26 +6,15 @@ license.workspace = true
[dependencies]
thiserror.workspace = true
nix.workspace = true
nix.workspace=true
workspace_hack = { version = "0.1", path = "../../workspace_hack" }
libc.workspace = true
lock_api.workspace = true
rustc-hash.workspace = true
[dev-dependencies]
criterion = { workspace = true, features = ["html_reports"] }
rand = "0.9"
rand_distr = "0.5.1"
xxhash-rust = { version = "0.8.15", features = ["xxh3"] }
ahash.workspace = true
twox-hash = { version = "2.1.1" }
seahash = "4.1.0"
hashbrown = { git = "https://github.com/quantumish/hashbrown.git", rev = "6610e6d" }
[target.'cfg(target_os = "macos")'.dependencies]
tempfile = "3.14.0"
[[bench]]
name = "hmap_resize"
harness = false
[dev-dependencies]
rand.workspace = true
rand_distr = "0.5.1"

View File

@@ -1,330 +0,0 @@
use criterion::{BatchSize, BenchmarkId, Criterion, criterion_group, criterion_main};
use neon_shmem::hash::HashMapAccess;
use neon_shmem::hash::HashMapInit;
use neon_shmem::hash::entry::Entry;
use rand::distr::{Distribution, StandardUniform};
use rand::prelude::*;
use std::default::Default;
use std::hash::BuildHasher;
// Taken from bindings to C code
#[derive(Clone, Debug, Hash, Eq, PartialEq)]
#[repr(C)]
pub struct FileCacheKey {
pub _spc_id: u32,
pub _db_id: u32,
pub _rel_number: u32,
pub _fork_num: u32,
pub _block_num: u32,
}
impl Distribution<FileCacheKey> for StandardUniform {
// questionable, but doesn't need to be good randomness
fn sample<R: Rng + ?Sized>(&self, rng: &mut R) -> FileCacheKey {
FileCacheKey {
_spc_id: rng.random(),
_db_id: rng.random(),
_rel_number: rng.random(),
_fork_num: rng.random(),
_block_num: rng.random(),
}
}
}
#[derive(Clone, Debug)]
#[repr(C)]
pub struct FileCacheEntry {
pub _offset: u32,
pub _access_count: u32,
pub _prev: *mut FileCacheEntry,
pub _next: *mut FileCacheEntry,
pub _state: [u32; 8],
}
impl FileCacheEntry {
fn dummy() -> Self {
Self {
_offset: 0,
_access_count: 0,
_prev: std::ptr::null_mut(),
_next: std::ptr::null_mut(),
_state: [0; 8],
}
}
}
// Utilities for applying operations.
#[derive(Clone, Debug)]
struct TestOp<K, V>(K, Option<V>);
fn apply_op<K: Clone + std::hash::Hash + Eq, V, S: std::hash::BuildHasher>(
op: TestOp<K, V>,
map: &mut HashMapAccess<K, V, S>,
) {
let entry = map.entry(op.0);
match op.1 {
Some(new) => match entry {
Entry::Occupied(mut e) => Some(e.insert(new)),
Entry::Vacant(e) => {
_ = e.insert(new).unwrap();
None
}
},
None => match entry {
Entry::Occupied(e) => Some(e.remove()),
Entry::Vacant(_) => None,
},
};
}
// Hash utilities
struct SeaRandomState {
k1: u64,
k2: u64,
k3: u64,
k4: u64,
}
impl std::hash::BuildHasher for SeaRandomState {
type Hasher = seahash::SeaHasher;
fn build_hasher(&self) -> Self::Hasher {
seahash::SeaHasher::with_seeds(self.k1, self.k2, self.k3, self.k4)
}
}
impl SeaRandomState {
fn new() -> Self {
let mut rng = rand::rng();
Self {
k1: rng.random(),
k2: rng.random(),
k3: rng.random(),
k4: rng.random(),
}
}
}
fn small_benchs(c: &mut Criterion) {
let mut group = c.benchmark_group("Small maps");
group.sample_size(10);
group.bench_function("small_rehash", |b| {
let ideal_filled = 4_000_000;
let size = 5_000_000;
let mut writer = HashMapInit::new_resizeable(size, size * 2).attach_writer();
let mut rng = rand::rng();
while writer.get_num_buckets_in_use() < ideal_filled as usize {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
apply_op(TestOp(key, Some(val)), &mut writer);
}
b.iter(|| writer.shuffle());
});
group.bench_function("small_rehash_xxhash", |b| {
let ideal_filled = 4_000_000;
let size = 5_000_000;
let mut writer = HashMapInit::new_resizeable(size, size * 2)
.with_hasher(twox_hash::xxhash64::RandomState::default())
.attach_writer();
let mut rng = rand::rng();
while writer.get_num_buckets_in_use() < ideal_filled as usize {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
apply_op(TestOp(key, Some(val)), &mut writer);
}
b.iter(|| writer.shuffle());
});
group.bench_function("small_rehash_ahash", |b| {
let ideal_filled = 4_000_000;
let size = 5_000_000;
let mut writer = HashMapInit::new_resizeable(size, size * 2)
.with_hasher(ahash::RandomState::default())
.attach_writer();
let mut rng = rand::rng();
while writer.get_num_buckets_in_use() < ideal_filled as usize {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
apply_op(TestOp(key, Some(val)), &mut writer);
}
b.iter(|| writer.shuffle());
});
group.bench_function("small_rehash_seahash", |b| {
let ideal_filled = 4_000_000;
let size = 5_000_000;
let mut writer = HashMapInit::new_resizeable(size, size * 2)
.with_hasher(SeaRandomState::new())
.attach_writer();
let mut rng = rand::rng();
while writer.get_num_buckets_in_use() < ideal_filled as usize {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
apply_op(TestOp(key, Some(val)), &mut writer);
}
b.iter(|| writer.shuffle());
});
group.finish();
}
fn real_benchs(c: &mut Criterion) {
let mut group = c.benchmark_group("Realistic workloads");
group.sample_size(10);
group.bench_function("real_bulk_insert", |b| {
let size = 125_000_000;
let ideal_filled = 100_000_000;
let mut rng = rand::rng();
b.iter_batched(
|| HashMapInit::new_resizeable(size, size * 2).attach_writer(),
|writer| {
for _ in 0..ideal_filled {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
let entry = writer.entry(key);
match entry {
Entry::Occupied(mut e) => {
std::hint::black_box(e.insert(val));
}
Entry::Vacant(e) => {
let _ = std::hint::black_box(e.insert(val).unwrap());
}
}
}
},
BatchSize::SmallInput,
)
});
group.bench_function("real_rehash", |b| {
let size = 125_000_000;
let ideal_filled = 100_000_000;
let mut writer = HashMapInit::new_resizeable(size, size).attach_writer();
let mut rng = rand::rng();
while writer.get_num_buckets_in_use() < ideal_filled {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
apply_op(TestOp(key, Some(val)), &mut writer);
}
b.iter(|| writer.shuffle());
});
group.bench_function("real_rehash_hashbrown", |b| {
let size = 125_000_000;
let ideal_filled = 100_000_000;
let mut writer = hashbrown::raw::RawTable::new();
let mut rng = rand::rng();
let hasher = rustc_hash::FxBuildHasher;
unsafe {
writer
.resize(
size,
|(k, _)| hasher.hash_one(k),
hashbrown::raw::Fallibility::Infallible,
)
.unwrap();
}
while writer.len() < ideal_filled as usize {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
writer.insert(hasher.hash_one(&key), (key, val), |(k, _)| {
hasher.hash_one(k)
});
}
b.iter(|| unsafe {
writer.table.rehash_in_place(
&|table, index| {
hasher.hash_one(
&table
.bucket::<(FileCacheKey, FileCacheEntry)>(index)
.as_ref()
.0,
)
},
std::mem::size_of::<(FileCacheKey, FileCacheEntry)>(),
if std::mem::needs_drop::<(FileCacheKey, FileCacheEntry)>() {
Some(|ptr| std::ptr::drop_in_place(ptr as *mut (FileCacheKey, FileCacheEntry)))
} else {
None
},
)
});
});
for elems in [2, 4, 8, 16, 32, 64, 96, 112] {
group.bench_with_input(
BenchmarkId::new("real_rehash_varied", elems),
&elems,
|b, &size| {
let ideal_filled = size * 1_000_000;
let size = 125_000_000;
let mut writer = HashMapInit::new_resizeable(size, size).attach_writer();
let mut rng = rand::rng();
while writer.get_num_buckets_in_use() < ideal_filled as usize {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
apply_op(TestOp(key, Some(val)), &mut writer);
}
b.iter(|| writer.shuffle());
},
);
group.bench_with_input(
BenchmarkId::new("real_rehash_varied_hashbrown", elems),
&elems,
|b, &size| {
let ideal_filled = size * 1_000_000;
let size = 125_000_000;
let mut writer = hashbrown::raw::RawTable::new();
let mut rng = rand::rng();
let hasher = rustc_hash::FxBuildHasher;
unsafe {
writer
.resize(
size,
|(k, _)| hasher.hash_one(k),
hashbrown::raw::Fallibility::Infallible,
)
.unwrap();
}
while writer.len() < ideal_filled as usize {
let key: FileCacheKey = rng.random();
let val = FileCacheEntry::dummy();
writer.insert(hasher.hash_one(&key), (key, val), |(k, _)| {
hasher.hash_one(k)
});
}
b.iter(|| unsafe {
writer.table.rehash_in_place(
&|table, index| {
hasher.hash_one(
&table
.bucket::<(FileCacheKey, FileCacheEntry)>(index)
.as_ref()
.0,
)
},
std::mem::size_of::<(FileCacheKey, FileCacheEntry)>(),
if std::mem::needs_drop::<(FileCacheKey, FileCacheEntry)>() {
Some(|ptr| {
std::ptr::drop_in_place(ptr as *mut (FileCacheKey, FileCacheEntry))
})
} else {
None
},
)
});
},
);
}
group.finish();
}
criterion_group!(benches, small_benchs, real_benchs);
criterion_main!(benches);

View File

@@ -16,7 +16,6 @@
//!
//! Concurrency is managed very simply: the entire map is guarded by one shared-memory RwLock.
use std::fmt::Debug;
use std::hash::{BuildHasher, Hash};
use std::mem::MaybeUninit;
@@ -57,22 +56,6 @@ pub struct HashMapInit<'a, K, V, S = rustc_hash::FxBuildHasher> {
num_buckets: u32,
}
impl<'a, K, V, S> Debug for HashMapInit<'a, K, V, S>
where
K: Debug,
V: Debug,
{
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("HashMapInit")
.field("shmem_handle", &self.shmem_handle)
.field("shared_ptr", &self.shared_ptr)
.field("shared_size", &self.shared_size)
// .field("hasher", &self.hasher)
.field("num_buckets", &self.num_buckets)
.finish()
}
}
/// This is a per-process handle to a hash table that (possibly) lives in shared memory.
/// If a child process is launched with fork(), the child process should
/// get its own HashMapAccess by calling HashMapInit::attach_writer/reader().
@@ -88,20 +71,6 @@ pub struct HashMapAccess<'a, K, V, S = rustc_hash::FxBuildHasher> {
unsafe impl<K: Sync, V: Sync, S> Sync for HashMapAccess<'_, K, V, S> {}
unsafe impl<K: Send, V: Send, S> Send for HashMapAccess<'_, K, V, S> {}
impl<'a, K, V, S> Debug for HashMapAccess<'a, K, V, S>
where
K: Debug,
V: Debug,
{
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("HashMapAccess")
.field("shmem_handle", &self.shmem_handle)
.field("shared_ptr", &self.shared_ptr)
// .field("hasher", &self.hasher)
.finish()
}
}
impl<'a, K: Clone + Hash + Eq, V, S> HashMapInit<'a, K, V, S> {
/// Change the 'hasher' used by the hash table.
///
@@ -329,7 +298,7 @@ where
/// Get a reference to the entry containing a key.
///
/// NB: This takes a write lock as there's no way to distinguish whether the intention
/// NB: THis takes a write lock as there's no way to distinguish whether the intention
/// is to use the entry for reading or for writing in advance.
pub fn entry(&self, key: K) -> Entry<'a, '_, K, V> {
let hash = self.get_hash_value(&key);
@@ -394,7 +363,7 @@ where
// TODO: An Iterator might be nicer. The communicator's clock algorithm needs to
// _slowly_ iterate through all buckets with its clock hand, without holding a lock.
// If we switch to an Iterator, it must not hold the lock.
pub fn get_at_bucket(&self, pos: usize) -> Option<ValueReadGuard<(K, V)>> {
pub fn get_at_bucket(&self, pos: usize) -> Option<ValueReadGuard<'_, (K, V)>> {
let map = unsafe { self.shared_ptr.as_ref() }.unwrap().read();
if pos >= map.buckets.len() {
return None;

View File

@@ -1,6 +1,5 @@
//! Simple hash table with chaining.
use std::fmt::Debug;
use std::hash::Hash;
use std::mem::MaybeUninit;
@@ -18,19 +17,6 @@ pub(crate) struct Bucket<K, V> {
pub(crate) inner: Option<(K, V)>,
}
impl<K, V> Debug for Bucket<K, V>
where
K: Debug,
V: Debug,
{
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("Bucket")
.field("next", &self.next)
.field("inner", &self.inner)
.finish()
}
}
/// Core hash table implementation.
pub(crate) struct CoreHashMap<'a, K, V> {
/// Dictionary used to map hashes to bucket indices.
@@ -45,22 +31,6 @@ pub(crate) struct CoreHashMap<'a, K, V> {
pub(crate) buckets_in_use: u32,
}
impl<'a, K, V> Debug for CoreHashMap<'a, K, V>
where
K: Debug,
V: Debug,
{
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("CoreHashMap")
.field("dictionary", &self.dictionary)
.field("buckets", &self.buckets)
.field("free_head", &self.free_head)
.field("alloc_limit", &self.alloc_limit)
.field("buckets_in_use", &self.buckets_in_use)
.finish()
}
}
/// Error for when there are no empty buckets left but one is needed.
#[derive(Debug, PartialEq)]
pub struct FullError;

View File

@@ -61,10 +61,6 @@ impl<K, V> OccupiedEntry<'_, '_, K, V> {
///
/// This may result in multiple bucket accesses if the entry was obtained by index as the
/// previous chain entry needs to be discovered in this case.
///
/// # Panics
/// Panics if the `prev_pos` field is equal to [`PrevPos::Unknown`]. In practice, this means
/// the entry was obtained via calling something like [`super::HashMapAccess::entry_at_bucket`].
pub fn remove(mut self) -> V {
// If this bucket was queried by index, go ahead and follow its chain from the start.
let prev = if let PrevPos::Unknown(hash) = self.prev_pos {

View File

@@ -21,7 +21,6 @@ use nix::unistd::ftruncate as nix_ftruncate;
/// the underlying file is resized. Do not access the area beyond the current size. Currently, that
/// will cause the file to be expanded, but we might use `mprotect()` etc. to enforce that in the
/// future.
#[derive(Debug)]
pub struct ShmemHandle {
/// memfd file descriptor
fd: OwnedFd,
@@ -36,7 +35,6 @@ pub struct ShmemHandle {
}
/// This is stored at the beginning in the shared memory area.
#[derive(Debug)]
struct SharedStruct {
max_size: usize,

View File

@@ -1500,6 +1500,7 @@ pub struct TimelineArchivalConfigRequest {
#[derive(Serialize, Deserialize, PartialEq, Eq, Clone)]
pub struct TimelinePatchIndexPartRequest {
pub rel_size_migration: Option<RelSizeMigration>,
pub rel_size_migrated_at: Option<Lsn>,
pub gc_compaction_last_completed_lsn: Option<Lsn>,
pub applied_gc_cutoff_lsn: Option<Lsn>,
#[serde(default)]
@@ -1533,10 +1534,10 @@ pub enum RelSizeMigration {
/// `None` is the same as `Some(RelSizeMigration::Legacy)`.
Legacy,
/// The tenant is migrating to the new rel_size format. Both old and new rel_size format are
/// persisted in the index part. The read path will read both formats and merge them.
/// persisted in the storage. The read path will read both formats and validate them.
Migrating,
/// The tenant has migrated to the new rel_size format. Only the new rel_size format is persisted
/// in the index part, and the read path will not read the old format.
/// in the storage, and the read path will not read the old format.
Migrated,
}
@@ -1619,6 +1620,7 @@ pub struct TimelineInfo {
/// The status of the rel_size migration.
pub rel_size_migration: Option<RelSizeMigration>,
pub rel_size_migrated_at: Option<Lsn>,
/// Whether the timeline is invisible in synthetic size calculations.
pub is_invisible: Option<bool>,

View File

@@ -15,6 +15,7 @@ use tokio::sync::mpsc;
use crate::cancel_token::RawCancelToken;
use crate::codec::{BackendMessages, FrontendMessage, RecordNotices};
use crate::config::{Host, SslMode};
use crate::connection::gc_bytesmut;
use crate::query::RowStream;
use crate::simple_query::SimpleQueryStream;
use crate::types::{Oid, Type};
@@ -95,20 +96,13 @@ impl InnerClient {
Ok(PartialQuery(Some(self)))
}
// pub fn send_with_sync<F>(&mut self, f: F) -> Result<&mut Responses, Error>
// where
// F: FnOnce(&mut BytesMut) -> Result<(), Error>,
// {
// self.start()?.send_with_sync(f)
// }
pub fn send_simple_query(&mut self, query: &str) -> Result<&mut Responses, Error> {
self.responses.waiting += 1;
self.buffer.clear();
// simple queries do not need sync.
frontend::query(query, &mut self.buffer).map_err(Error::encode)?;
let buf = self.buffer.split().freeze();
let buf = self.buffer.split();
self.send_message(FrontendMessage::Raw(buf))
}
@@ -125,7 +119,7 @@ impl Drop for PartialQuery<'_> {
if let Some(client) = self.0.take() {
client.buffer.clear();
frontend::sync(&mut client.buffer);
let buf = client.buffer.split().freeze();
let buf = client.buffer.split();
let _ = client.send_message(FrontendMessage::Raw(buf));
}
}
@@ -141,7 +135,7 @@ impl<'a> PartialQuery<'a> {
client.buffer.clear();
f(&mut client.buffer)?;
frontend::flush(&mut client.buffer);
let buf = client.buffer.split().freeze();
let buf = client.buffer.split();
client.send_message(FrontendMessage::Raw(buf))
}
@@ -154,7 +148,7 @@ impl<'a> PartialQuery<'a> {
client.buffer.clear();
f(&mut client.buffer)?;
frontend::sync(&mut client.buffer);
let buf = client.buffer.split().freeze();
let buf = client.buffer.split();
let _ = client.send_message(FrontendMessage::Raw(buf));
Ok(&mut self.0.take().unwrap().responses)
@@ -191,6 +185,7 @@ impl Client {
ssl_mode: SslMode,
process_id: i32,
secret_key: i32,
write_buf: BytesMut,
) -> Client {
Client {
inner: InnerClient {
@@ -201,7 +196,7 @@ impl Client {
waiting: 0,
received: 0,
},
buffer: Default::default(),
buffer: write_buf,
},
cached_typeinfo: Default::default(),
@@ -292,8 +287,35 @@ impl Client {
simple_query::batch_execute(self.inner_mut(), query).await
}
pub async fn discard_all(&mut self) -> Result<ReadyForQueryStatus, Error> {
self.batch_execute("discard all").await
/// Similar to `discard_all`, but it does not clear any query plans
///
/// This runs in the background, so it can be executed without `await`ing.
pub fn reset_session_background(&mut self) -> Result<(), Error> {
// "CLOSE ALL": closes any cursors
// "SET SESSION AUTHORIZATION DEFAULT": resets the current_user back to the session_user
// "RESET ALL": resets any GUCs back to their session defaults.
// "DEALLOCATE ALL": deallocates any prepared statements
// "UNLISTEN *": stops listening on all channels
// "SELECT pg_advisory_unlock_all();": unlocks all advisory locks
// "DISCARD TEMP;": drops all temporary tables
// "DISCARD SEQUENCES;": deallocates all cached sequence state
let _responses = self.inner_mut().send_simple_query(
"ROLLBACK;
CLOSE ALL;
SET SESSION AUTHORIZATION DEFAULT;
RESET ALL;
DEALLOCATE ALL;
UNLISTEN *;
SELECT pg_advisory_unlock_all();
DISCARD TEMP;
DISCARD SEQUENCES;",
)?;
// Clean up memory usage.
gc_bytesmut(&mut self.inner_mut().buffer);
Ok(())
}
/// Begins a new database transaction.

View File

@@ -1,13 +1,13 @@
use std::io;
use bytes::{Bytes, BytesMut};
use bytes::BytesMut;
use fallible_iterator::FallibleIterator;
use postgres_protocol2::message::backend;
use tokio::sync::mpsc::UnboundedSender;
use tokio_util::codec::{Decoder, Encoder};
pub enum FrontendMessage {
Raw(Bytes),
Raw(BytesMut),
RecordNotices(RecordNotices),
}
@@ -17,7 +17,10 @@ pub struct RecordNotices {
}
pub enum BackendMessage {
Normal { messages: BackendMessages },
Normal {
messages: BackendMessages,
ready: bool,
},
Async(backend::Message),
}
@@ -40,11 +43,11 @@ impl FallibleIterator for BackendMessages {
pub struct PostgresCodec;
impl Encoder<Bytes> for PostgresCodec {
impl Encoder<BytesMut> for PostgresCodec {
type Error = io::Error;
fn encode(&mut self, item: Bytes, dst: &mut BytesMut) -> io::Result<()> {
dst.extend_from_slice(&item);
fn encode(&mut self, item: BytesMut, dst: &mut BytesMut) -> io::Result<()> {
dst.unsplit(item);
Ok(())
}
}
@@ -56,6 +59,7 @@ impl Decoder for PostgresCodec {
fn decode(&mut self, src: &mut BytesMut) -> Result<Option<BackendMessage>, io::Error> {
let mut idx = 0;
let mut ready = false;
while let Some(header) = backend::Header::parse(&src[idx..])? {
let len = header.len() as usize + 1;
if src[idx..].len() < len {
@@ -79,6 +83,7 @@ impl Decoder for PostgresCodec {
idx += len;
if header.tag() == backend::READY_FOR_QUERY_TAG {
ready = true;
break;
}
}
@@ -88,6 +93,7 @@ impl Decoder for PostgresCodec {
} else {
Ok(Some(BackendMessage::Normal {
messages: BackendMessages(src.split_to(idx)),
ready,
}))
}
}

View File

@@ -11,9 +11,8 @@ use tokio::io::{AsyncRead, AsyncWrite};
use tokio::net::TcpStream;
use crate::connect::connect;
use crate::connect_raw::{RawConnection, connect_raw};
use crate::connect_raw::{self, StartupStream};
use crate::connect_tls::connect_tls;
use crate::maybe_tls_stream::MaybeTlsStream;
use crate::tls::{MakeTlsConnect, TlsConnect, TlsStream};
use crate::{Client, Connection, Error};
@@ -244,24 +243,27 @@ impl Config {
&self,
stream: S,
tls: T,
) -> Result<RawConnection<S, T::Stream>, Error>
) -> Result<StartupStream<S, T::Stream>, Error>
where
S: AsyncRead + AsyncWrite + Unpin,
T: TlsConnect<S>,
{
let stream = connect_tls(stream, self.ssl_mode, tls).await?;
connect_raw(stream, self).await
let mut stream = StartupStream::new(stream);
connect_raw::authenticate(&mut stream, self).await?;
Ok(stream)
}
pub async fn authenticate<S, T>(
pub fn authenticate<S, T>(
&self,
stream: MaybeTlsStream<S, T>,
) -> Result<RawConnection<S, T>, Error>
stream: &mut StartupStream<S, T>,
) -> impl Future<Output = Result<(), Error>>
where
S: AsyncRead + AsyncWrite + Unpin,
T: TlsStream + Unpin,
{
connect_raw(stream, self).await
connect_raw::authenticate(stream, self)
}
}

View File

@@ -1,15 +1,17 @@
use std::net::IpAddr;
use futures_util::TryStreamExt;
use postgres_protocol2::message::backend::Message;
use tokio::io::{AsyncRead, AsyncWrite};
use tokio::net::TcpStream;
use tokio::sync::mpsc;
use crate::client::SocketConfig;
use crate::config::Host;
use crate::connect_raw::connect_raw;
use crate::config::{Host, SslMode};
use crate::connect_raw::StartupStream;
use crate::connect_socket::connect_socket;
use crate::connect_tls::connect_tls;
use crate::tls::{MakeTlsConnect, TlsConnect};
use crate::{Client, Config, Connection, Error, RawConnection};
use crate::{Client, Config, Connection, Error};
pub async fn connect<T>(
tls: &T,
@@ -43,34 +45,78 @@ where
T: TlsConnect<TcpStream>,
{
let socket = connect_socket(host_addr, host, port, config.connect_timeout).await?;
let stream = connect_tls(socket, config.ssl_mode, tls).await?;
let RawConnection {
let stream = config.tls_and_authenticate(socket, tls).await?;
managed(
stream,
parameters: _,
delayed_notice: _,
process_id,
secret_key,
} = connect_raw(stream, config).await?;
host_addr,
host.clone(),
port,
config.ssl_mode,
config.connect_timeout,
)
.await
}
pub async fn managed<TlsStream>(
mut stream: StartupStream<TcpStream, TlsStream>,
host_addr: Option<IpAddr>,
host: Host,
port: u16,
ssl_mode: SslMode,
connect_timeout: Option<std::time::Duration>,
) -> Result<(Client, Connection<TcpStream, TlsStream>), Error>
where
TlsStream: AsyncRead + AsyncWrite + Unpin,
{
let (process_id, secret_key) = wait_until_ready(&mut stream).await?;
let socket_config = SocketConfig {
host_addr,
host: host.clone(),
host,
port,
connect_timeout: config.connect_timeout,
connect_timeout,
};
let mut stream = stream.into_framed();
let write_buf = std::mem::take(stream.write_buffer_mut());
let (client_tx, conn_rx) = mpsc::unbounded_channel();
let (conn_tx, client_rx) = mpsc::channel(4);
let client = Client::new(
client_tx,
client_rx,
socket_config,
config.ssl_mode,
ssl_mode,
process_id,
secret_key,
write_buf,
);
let connection = Connection::new(stream, conn_tx, conn_rx);
Ok((client, connection))
}
async fn wait_until_ready<S, T>(stream: &mut StartupStream<S, T>) -> Result<(i32, i32), Error>
where
S: AsyncRead + AsyncWrite + Unpin,
T: AsyncRead + AsyncWrite + Unpin,
{
let mut process_id = 0;
let mut secret_key = 0;
loop {
match stream.try_next().await.map_err(Error::io)? {
Some(Message::BackendKeyData(body)) => {
process_id = body.process_id();
secret_key = body.secret_key();
}
// These values are currently not used by `Client`/`Connection`. Ignore them.
Some(Message::ParameterStatus(_)) | Some(Message::NoticeResponse(_)) => {}
Some(Message::ReadyForQuery(_)) => return Ok((process_id, secret_key)),
Some(Message::ErrorResponse(body)) => return Err(Error::db(body)),
Some(_) => return Err(Error::unexpected_message()),
None => return Err(Error::closed()),
}
}
}

View File

@@ -1,52 +1,27 @@
use std::collections::HashMap;
use std::io;
use std::pin::Pin;
use std::task::{Context, Poll};
use std::task::{Context, Poll, ready};
use bytes::{Bytes, BytesMut};
use bytes::BytesMut;
use fallible_iterator::FallibleIterator;
use futures_util::{Sink, SinkExt, Stream, TryStreamExt, ready};
use futures_util::{SinkExt, Stream, TryStreamExt};
use postgres_protocol2::authentication::sasl;
use postgres_protocol2::authentication::sasl::ScramSha256;
use postgres_protocol2::message::backend::{AuthenticationSaslBody, Message, NoticeResponseBody};
use postgres_protocol2::message::backend::{AuthenticationSaslBody, Message};
use postgres_protocol2::message::frontend;
use tokio::io::{AsyncRead, AsyncWrite};
use tokio_util::codec::Framed;
use tokio::io::{AsyncRead, AsyncWrite, ReadBuf};
use tokio_util::codec::{Framed, FramedParts};
use crate::Error;
use crate::codec::{BackendMessage, BackendMessages, PostgresCodec};
use crate::codec::PostgresCodec;
use crate::config::{self, AuthKeys, Config};
use crate::connection::{GC_THRESHOLD, INITIAL_CAPACITY};
use crate::maybe_tls_stream::MaybeTlsStream;
use crate::tls::TlsStream;
pub struct StartupStream<S, T> {
inner: Framed<MaybeTlsStream<S, T>, PostgresCodec>,
buf: BackendMessages,
delayed_notice: Vec<NoticeResponseBody>,
}
impl<S, T> Sink<Bytes> for StartupStream<S, T>
where
S: AsyncRead + AsyncWrite + Unpin,
T: AsyncRead + AsyncWrite + Unpin,
{
type Error = io::Error;
fn poll_ready(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<io::Result<()>> {
Pin::new(&mut self.inner).poll_ready(cx)
}
fn start_send(mut self: Pin<&mut Self>, item: Bytes) -> io::Result<()> {
Pin::new(&mut self.inner).start_send(item)
}
fn poll_flush(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<io::Result<()>> {
Pin::new(&mut self.inner).poll_flush(cx)
}
fn poll_close(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<io::Result<()>> {
Pin::new(&mut self.inner).poll_close(cx)
}
read_buf: BytesMut,
}
impl<S, T> Stream for StartupStream<S, T>
@@ -56,78 +31,109 @@ where
{
type Item = io::Result<Message>;
fn poll_next(
mut self: Pin<&mut Self>,
cx: &mut Context<'_>,
) -> Poll<Option<io::Result<Message>>> {
loop {
match self.buf.next() {
Ok(Some(message)) => return Poll::Ready(Some(Ok(message))),
Ok(None) => {}
Err(e) => return Poll::Ready(Some(Err(e))),
}
fn poll_next(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Option<Self::Item>> {
// We don't use `self.inner.poll_next()` as that might over-read into the read buffer.
match ready!(Pin::new(&mut self.inner).poll_next(cx)) {
Some(Ok(BackendMessage::Normal { messages, .. })) => self.buf = messages,
Some(Ok(BackendMessage::Async(message))) => return Poll::Ready(Some(Ok(message))),
Some(Err(e)) => return Poll::Ready(Some(Err(e))),
None => return Poll::Ready(None),
}
// read 1 byte tag, 4 bytes length.
let header = ready!(self.as_mut().poll_fill_buf_exact(cx, 5)?);
let len = u32::from_be_bytes(header[1..5].try_into().unwrap());
if len < 4 {
return Poll::Ready(Some(Err(std::io::Error::other(
"postgres message too small",
))));
}
if len >= 65536 {
return Poll::Ready(Some(Err(std::io::Error::other(
"postgres message too large",
))));
}
// the tag is an additional byte.
let _message = ready!(self.as_mut().poll_fill_buf_exact(cx, len as usize + 1)?);
// Message::parse will remove the all the bytes from the buffer.
Poll::Ready(Message::parse(&mut self.read_buf).transpose())
}
}
pub struct RawConnection<S, T> {
pub stream: Framed<MaybeTlsStream<S, T>, PostgresCodec>,
pub parameters: HashMap<String, String>,
pub delayed_notice: Vec<NoticeResponseBody>,
pub process_id: i32,
pub secret_key: i32,
}
pub async fn connect_raw<S, T>(
stream: MaybeTlsStream<S, T>,
config: &Config,
) -> Result<RawConnection<S, T>, Error>
where
S: AsyncRead + AsyncWrite + Unpin,
T: TlsStream + Unpin,
{
let mut stream = StartupStream {
inner: Framed::new(stream, PostgresCodec),
buf: BackendMessages::empty(),
delayed_notice: Vec::new(),
};
startup(&mut stream, config).await?;
authenticate(&mut stream, config).await?;
let (process_id, secret_key, parameters) = read_info(&mut stream).await?;
Ok(RawConnection {
stream: stream.inner,
parameters,
delayed_notice: stream.delayed_notice,
process_id,
secret_key,
})
}
async fn startup<S, T>(stream: &mut StartupStream<S, T>, config: &Config) -> Result<(), Error>
impl<S, T> StartupStream<S, T>
where
S: AsyncRead + AsyncWrite + Unpin,
T: AsyncRead + AsyncWrite + Unpin,
{
let mut buf = BytesMut::new();
frontend::startup_message(&config.server_params, &mut buf).map_err(Error::encode)?;
/// Fill the buffer until it's the exact length provided. No additional data will be read from the socket.
///
/// If the current buffer length is greater, nothing happens.
fn poll_fill_buf_exact(
self: Pin<&mut Self>,
cx: &mut Context<'_>,
len: usize,
) -> Poll<Result<&[u8], std::io::Error>> {
let this = self.get_mut();
let mut stream = Pin::new(this.inner.get_mut());
stream.send(buf.freeze()).await.map_err(Error::io)
let mut n = this.read_buf.len();
while n < len {
this.read_buf.resize(len, 0);
let mut buf = ReadBuf::new(&mut this.read_buf[..]);
buf.set_filled(n);
if stream.as_mut().poll_read(cx, &mut buf)?.is_pending() {
this.read_buf.truncate(n);
return Poll::Pending;
}
if buf.filled().len() == n {
return Poll::Ready(Err(std::io::Error::new(
std::io::ErrorKind::UnexpectedEof,
"early eof",
)));
}
n = buf.filled().len();
this.read_buf.truncate(n);
}
Poll::Ready(Ok(&this.read_buf[..len]))
}
pub fn into_framed(mut self) -> Framed<MaybeTlsStream<S, T>, PostgresCodec> {
*self.inner.read_buffer_mut() = self.read_buf;
self.inner
}
pub fn new(io: MaybeTlsStream<S, T>) -> Self {
let mut parts = FramedParts::new(io, PostgresCodec);
parts.write_buf = BytesMut::with_capacity(INITIAL_CAPACITY);
let mut inner = Framed::from_parts(parts);
// This is the default already, but nice to be explicit.
// We divide by two because writes will overshoot the boundary.
// We don't want constant overshoots to cause us to constantly re-shrink the buffer.
inner.set_backpressure_boundary(GC_THRESHOLD / 2);
Self {
inner,
read_buf: BytesMut::with_capacity(INITIAL_CAPACITY),
}
}
}
async fn authenticate<S, T>(stream: &mut StartupStream<S, T>, config: &Config) -> Result<(), Error>
pub(crate) async fn authenticate<S, T>(
stream: &mut StartupStream<S, T>,
config: &Config,
) -> Result<(), Error>
where
S: AsyncRead + AsyncWrite + Unpin,
T: TlsStream + Unpin,
{
frontend::startup_message(&config.server_params, stream.inner.write_buffer_mut())
.map_err(Error::encode)?;
stream.inner.flush().await.map_err(Error::io)?;
match stream.try_next().await.map_err(Error::io)? {
Some(Message::AuthenticationOk) => {
can_skip_channel_binding(config)?;
@@ -141,7 +147,8 @@ where
.as_ref()
.ok_or_else(|| Error::config("password missing".into()))?;
authenticate_password(stream, pass).await?;
frontend::password_message(pass, stream.inner.write_buffer_mut())
.map_err(Error::encode)?;
}
Some(Message::AuthenticationSasl(body)) => {
authenticate_sasl(stream, body, config).await?;
@@ -160,6 +167,7 @@ where
None => return Err(Error::closed()),
}
stream.inner.flush().await.map_err(Error::io)?;
match stream.try_next().await.map_err(Error::io)? {
Some(Message::AuthenticationOk) => Ok(()),
Some(Message::ErrorResponse(body)) => Err(Error::db(body)),
@@ -177,20 +185,6 @@ fn can_skip_channel_binding(config: &Config) -> Result<(), Error> {
}
}
async fn authenticate_password<S, T>(
stream: &mut StartupStream<S, T>,
password: &[u8],
) -> Result<(), Error>
where
S: AsyncRead + AsyncWrite + Unpin,
T: AsyncRead + AsyncWrite + Unpin,
{
let mut buf = BytesMut::new();
frontend::password_message(password, &mut buf).map_err(Error::encode)?;
stream.send(buf.freeze()).await.map_err(Error::io)
}
async fn authenticate_sasl<S, T>(
stream: &mut StartupStream<S, T>,
body: AuthenticationSaslBody,
@@ -245,10 +239,10 @@ where
return Err(Error::config("password or auth keys missing".into()));
};
let mut buf = BytesMut::new();
frontend::sasl_initial_response(mechanism, scram.message(), &mut buf).map_err(Error::encode)?;
stream.send(buf.freeze()).await.map_err(Error::io)?;
frontend::sasl_initial_response(mechanism, scram.message(), stream.inner.write_buffer_mut())
.map_err(Error::encode)?;
stream.inner.flush().await.map_err(Error::io)?;
let body = match stream.try_next().await.map_err(Error::io)? {
Some(Message::AuthenticationSaslContinue(body)) => body,
Some(Message::ErrorResponse(body)) => return Err(Error::db(body)),
@@ -261,10 +255,10 @@ where
.await
.map_err(|e| Error::authentication(e.into()))?;
let mut buf = BytesMut::new();
frontend::sasl_response(scram.message(), &mut buf).map_err(Error::encode)?;
stream.send(buf.freeze()).await.map_err(Error::io)?;
frontend::sasl_response(scram.message(), stream.inner.write_buffer_mut())
.map_err(Error::encode)?;
stream.inner.flush().await.map_err(Error::io)?;
let body = match stream.try_next().await.map_err(Error::io)? {
Some(Message::AuthenticationSaslFinal(body)) => body,
Some(Message::ErrorResponse(body)) => return Err(Error::db(body)),
@@ -278,35 +272,3 @@ where
Ok(())
}
async fn read_info<S, T>(
stream: &mut StartupStream<S, T>,
) -> Result<(i32, i32, HashMap<String, String>), Error>
where
S: AsyncRead + AsyncWrite + Unpin,
T: AsyncRead + AsyncWrite + Unpin,
{
let mut process_id = 0;
let mut secret_key = 0;
let mut parameters = HashMap::new();
loop {
match stream.try_next().await.map_err(Error::io)? {
Some(Message::BackendKeyData(body)) => {
process_id = body.process_id();
secret_key = body.secret_key();
}
Some(Message::ParameterStatus(body)) => {
parameters.insert(
body.name().map_err(Error::parse)?.to_string(),
body.value().map_err(Error::parse)?.to_string(),
);
}
Some(Message::NoticeResponse(body)) => stream.delayed_notice.push(body),
Some(Message::ReadyForQuery(_)) => return Ok((process_id, secret_key, parameters)),
Some(Message::ErrorResponse(body)) => return Err(Error::db(body)),
Some(_) => return Err(Error::unexpected_message()),
None => return Err(Error::closed()),
}
}
}

View File

@@ -44,6 +44,27 @@ pub struct Connection<S, T> {
state: State,
}
pub const INITIAL_CAPACITY: usize = 2 * 1024;
pub const GC_THRESHOLD: usize = 16 * 1024;
/// Gargabe collect the [`BytesMut`] if it has too much spare capacity.
pub fn gc_bytesmut(buf: &mut BytesMut) {
// We use a different mode to shrink the buf when above the threshold.
// When above the threshold, we only re-allocate when the buf has 2x spare capacity.
let reclaim = GC_THRESHOLD.checked_sub(buf.len()).unwrap_or(buf.len());
// `try_reclaim` tries to get the capacity from any shared `BytesMut`s,
// before then comparing the length against the capacity.
if buf.try_reclaim(reclaim) {
let capacity = usize::max(buf.len(), INITIAL_CAPACITY);
// Allocate a new `BytesMut` so that we deallocate the old version.
let mut new = BytesMut::with_capacity(capacity);
new.extend_from_slice(buf);
*buf = new;
}
}
pub enum Never {}
impl<S, T> Connection<S, T>
@@ -86,7 +107,14 @@ where
continue;
}
BackendMessage::Async(_) => continue,
BackendMessage::Normal { messages } => messages,
BackendMessage::Normal { messages, ready } => {
// if we read a ReadyForQuery from postgres, let's try GC the read buffer.
if ready {
gc_bytesmut(self.stream.read_buffer_mut());
}
messages
}
}
}
};
@@ -177,12 +205,7 @@ where
// Send a terminate message to postgres
Poll::Ready(None) => {
trace!("poll_write: at eof, terminating");
let mut request = BytesMut::new();
frontend::terminate(&mut request);
Pin::new(&mut self.stream)
.start_send(request.freeze())
.map_err(Error::io)?;
frontend::terminate(self.stream.write_buffer_mut());
trace!("poll_write: sent eof, closing");
trace!("poll_write: done");
@@ -205,6 +228,13 @@ where
{
Poll::Ready(()) => {
trace!("poll_flush: flushed");
// Since our codec prefers to share the buffer with the `Client`,
// if we don't release our share, then the `Client` would have to re-alloc
// the buffer when they next use it.
debug_assert!(self.stream.write_buffer().is_empty());
*self.stream.write_buffer_mut() = BytesMut::new();
Poll::Ready(Ok(()))
}
Poll::Pending => {

View File

@@ -452,16 +452,16 @@ impl Error {
Error(Box::new(ErrorInner { kind, cause }))
}
pub(crate) fn closed() -> Error {
pub fn closed() -> Error {
Error::new(Kind::Closed, None)
}
pub(crate) fn unexpected_message() -> Error {
pub fn unexpected_message() -> Error {
Error::new(Kind::UnexpectedMessage, None)
}
#[allow(clippy::needless_pass_by_value)]
pub(crate) fn db(error: ErrorResponseBody) -> Error {
pub fn db(error: ErrorResponseBody) -> Error {
match DbError::parse(&mut error.fields()) {
Ok(e) => Error::new(Kind::Db, Some(Box::new(e))),
Err(e) => Error::new(Kind::Parse, Some(Box::new(e))),
@@ -493,7 +493,7 @@ impl Error {
Error::new(Kind::Tls, Some(e))
}
pub(crate) fn io(e: io::Error) -> Error {
pub fn io(e: io::Error) -> Error {
Error::new(Kind::Io, Some(Box::new(e)))
}

View File

@@ -6,7 +6,6 @@ use postgres_protocol2::message::backend::ReadyForQueryBody;
pub use crate::cancel_token::{CancelToken, RawCancelToken};
pub use crate::client::{Client, SocketConfig};
pub use crate::config::Config;
pub use crate::connect_raw::RawConnection;
pub use crate::connection::Connection;
pub use crate::error::Error;
pub use crate::generic_client::GenericClient;
@@ -49,8 +48,8 @@ mod cancel_token;
mod client;
mod codec;
pub mod config;
mod connect;
mod connect_raw;
pub mod connect;
pub mod connect_raw;
mod connect_socket;
mod connect_tls;
mod connection;

View File

@@ -301,7 +301,12 @@ pub struct PullTimelineRequest {
pub tenant_id: TenantId,
pub timeline_id: TimelineId,
pub http_hosts: Vec<String>,
pub ignore_tombstone: Option<bool>,
/// Membership configuration to switch to after pull.
/// It guarantees that if pull_timeline returns successfully, the timeline will
/// not be deleted by request with an older generation.
/// Storage controller always sets this field.
/// None is only allowed for manual pull_timeline requests.
pub mconf: Option<Configuration>,
}
#[derive(Debug, Serialize, Deserialize)]

View File

@@ -49,7 +49,7 @@ impl PerfSpan {
}
}
pub fn enter(&self) -> PerfSpanEntered {
pub fn enter(&self) -> PerfSpanEntered<'_> {
if let Some(ref id) = self.inner.id() {
self.dispatch.enter(id);
}

View File

@@ -59,10 +59,6 @@ impl ShardCount {
pub const MAX: Self = Self(u8::MAX);
pub const MIN: Self = Self(0);
pub fn unsharded() -> Self {
ShardCount(0)
}
/// The internal value of a ShardCount may be zero, which means "1 shard, but use
/// legacy format for TenantShardId that excludes the shard suffix", also known
/// as [`TenantShardId::unsharded`].

View File

@@ -429,9 +429,11 @@ pub fn empty_shmem() -> crate::bindings::WalproposerShmemState {
};
let empty_wal_rate_limiter = crate::bindings::WalRateLimiter {
effective_max_wal_bytes_per_second: crate::bindings::pg_atomic_uint32 { value: 0 },
should_limit: crate::bindings::pg_atomic_uint32 { value: 0 },
sent_bytes: 0,
last_recorded_time_us: crate::bindings::pg_atomic_uint64 { value: 0 },
batch_start_time_us: crate::bindings::pg_atomic_uint64 { value: 0 },
batch_end_time_us: crate::bindings::pg_atomic_uint64 { value: 0 },
};
crate::bindings::WalproposerShmemState {

View File

@@ -54,7 +54,6 @@ pageserver_api.workspace = true
pageserver_client.workspace = true # for ResponseErrorMessageExt TOOD refactor that
pageserver_compaction.workspace = true
pageserver_page_api.workspace = true
peekable.workspace = true
pem.workspace = true
pin-project-lite.workspace = true
postgres_backend.workspace = true
@@ -67,7 +66,6 @@ postgres-types.workspace = true
posthog_client_lite.workspace = true
pprof.workspace = true
pq_proto.workspace = true
prost.workspace = true
rand.workspace = true
range-set-blaze = { version = "0.1.16", features = ["alloc"] }
regex.workspace = true

View File

@@ -14,9 +14,9 @@ use utils::logging::warn_slow;
use crate::pool::{ChannelPool, ClientGuard, ClientPool, StreamGuard, StreamPool};
use crate::retry::Retry;
use crate::split::GetPageSplitter;
use compute_api::spec::PageserverProtocol;
use pageserver_page_api as page_api;
use pageserver_page_api::GetPageSplitter;
use utils::id::{TenantId, TimelineId};
use utils::shard::{ShardCount, ShardIndex, ShardNumber, ShardStripeSize};

View File

@@ -1,6 +1,6 @@
mod client;
mod pool;
mod retry;
mod split;
pub use client::{PageserverClient, ShardSpec};
pub use pageserver_api::shard::ShardStripeSize; // used in ShardSpec

View File

@@ -3,18 +3,18 @@ use std::collections::HashMap;
use anyhow::anyhow;
use bytes::Bytes;
use crate::model::*;
use pageserver_api::key::rel_block_to_key;
use pageserver_api::shard::key_to_shard_number;
use pageserver_page_api as page_api;
use utils::shard::{ShardCount, ShardIndex, ShardStripeSize};
/// Splits GetPageRequests that straddle shard boundaries and assembles the responses.
/// TODO: add tests for this.
pub struct GetPageSplitter {
/// Split requests by shard index.
requests: HashMap<ShardIndex, GetPageRequest>,
requests: HashMap<ShardIndex, page_api::GetPageRequest>,
/// The response being assembled. Preallocated with empty pages, to be filled in.
response: GetPageResponse,
response: page_api::GetPageResponse,
/// Maps the offset in `request.block_numbers` and `response.pages` to the owning shard. Used
/// to assemble the response pages in the same order as the original request.
block_shards: Vec<ShardIndex>,
@@ -24,7 +24,7 @@ impl GetPageSplitter {
/// Checks if the given request only touches a single shard, and returns the shard ID. This is
/// the common case, so we check first in order to avoid unnecessary allocations and overhead.
pub fn for_single_shard(
req: &GetPageRequest,
req: &page_api::GetPageRequest,
count: ShardCount,
stripe_size: Option<ShardStripeSize>,
) -> anyhow::Result<Option<ShardIndex>> {
@@ -57,7 +57,7 @@ impl GetPageSplitter {
/// Splits the given request.
pub fn split(
req: GetPageRequest,
req: page_api::GetPageRequest,
count: ShardCount,
stripe_size: Option<ShardStripeSize>,
) -> anyhow::Result<Self> {
@@ -84,7 +84,7 @@ impl GetPageSplitter {
requests
.entry(shard_id)
.or_insert_with(|| GetPageRequest {
.or_insert_with(|| page_api::GetPageRequest {
request_id: req.request_id,
request_class: req.request_class,
rel: req.rel,
@@ -98,16 +98,16 @@ impl GetPageSplitter {
// Construct a response to be populated by shard responses. Preallocate empty page slots
// with the expected block numbers.
let response = GetPageResponse {
let response = page_api::GetPageResponse {
request_id: req.request_id,
status_code: GetPageStatusCode::Ok,
status_code: page_api::GetPageStatusCode::Ok,
reason: None,
rel: req.rel,
pages: req
.block_numbers
.into_iter()
.map(|block_number| {
Page {
page_api::Page {
block_number,
image: Bytes::new(), // empty page slot to be filled in
}
@@ -123,7 +123,9 @@ impl GetPageSplitter {
}
/// Drains the per-shard requests, moving them out of the splitter to avoid extra allocations.
pub fn drain_requests(&mut self) -> impl Iterator<Item = (ShardIndex, GetPageRequest)> {
pub fn drain_requests(
&mut self,
) -> impl Iterator<Item = (ShardIndex, page_api::GetPageRequest)> {
self.requests.drain()
}
@@ -133,10 +135,10 @@ impl GetPageSplitter {
pub fn add_response(
&mut self,
shard_id: ShardIndex,
response: GetPageResponse,
response: page_api::GetPageResponse,
) -> anyhow::Result<()> {
// The caller should already have converted status codes into tonic::Status.
if response.status_code != GetPageStatusCode::Ok {
if response.status_code != page_api::GetPageStatusCode::Ok {
return Err(anyhow!(
"unexpected non-OK response for shard {shard_id}: {} {}",
response.status_code,
@@ -207,7 +209,7 @@ impl GetPageSplitter {
/// Fetches the final, assembled response.
#[allow(clippy::result_large_err)]
pub fn get_response(self) -> anyhow::Result<GetPageResponse> {
pub fn get_response(self) -> anyhow::Result<page_api::GetPageResponse> {
// Check that the response is complete.
for (i, page) in self.response.pages.iter().enumerate() {
if page.image.is_empty() {

View File

@@ -19,9 +19,7 @@ pub mod proto {
}
mod client;
mod model;
mod split;
pub use client::Client;
mod model;
pub use model::*;
pub use split::GetPageSplitter;

View File

@@ -33,8 +33,6 @@ pub enum ProtocolError {
Invalid(&'static str, String),
#[error("required field '{0}' is missing")]
Missing(&'static str),
#[error("invalid combination of not_modified_lsn '{0}' and request_lsn '{1}'")]
InvalidLsns(Lsn, Lsn),
}
impl ProtocolError {
@@ -87,9 +85,9 @@ impl TryFrom<proto::ReadLsn> for ReadLsn {
return Err(ProtocolError::invalid("request_lsn", pb.request_lsn));
}
if pb.not_modified_since_lsn > pb.request_lsn {
return Err(ProtocolError::InvalidLsns(
Lsn(pb.not_modified_since_lsn),
Lsn(pb.request_lsn),
return Err(ProtocolError::invalid(
"not_modified_since_lsn",
pb.not_modified_since_lsn,
));
}
Ok(Self {

View File

@@ -25,9 +25,6 @@ tracing.workspace = true
tokio.workspace = true
tokio-stream.workspace = true
tokio-util.workspace = true
axum.workspace = true
http.workspace = true
metrics.workspace = true
tonic.workspace = true
url.workspace = true

View File

@@ -34,10 +34,6 @@ use crate::util::{request_stats, tokio_thread_local_stats};
/// GetPage@LatestLSN, uniformly distributed across the compute-accessible keyspace.
#[derive(clap::Parser)]
pub(crate) struct Args {
#[clap(long, default_value = "false")]
grpc: bool,
#[clap(long, default_value = "false")]
grpc_stream: bool,
#[clap(long, default_value = "http://localhost:9898")]
mgmt_api_endpoint: String,
/// Pageserver connection string. Supports postgresql:// and grpc:// protocols.
@@ -82,9 +78,6 @@ pub(crate) struct Args {
#[clap(long)]
set_io_mode: Option<pageserver_api::models::virtual_file::IoMode>,
#[clap(long)]
only_relnode: Option<u32>,
/// Queue depth generated in each client.
#[clap(long, default_value = "1")]
queue_depth: NonZeroUsize,
@@ -99,31 +92,10 @@ pub(crate) struct Args {
#[clap(long, default_value = "1")]
batch_size: NonZeroUsize,
#[clap(long)]
only_relnode: Option<u32>,
targets: Option<Vec<TenantTimelineId>>,
#[clap(long, default_value = "100")]
pool_max_consumers: NonZeroUsize,
#[clap(long, default_value = "5")]
pool_error_threshold: NonZeroUsize,
#[clap(long, default_value = "5000")]
pool_connect_timeout: NonZeroUsize,
#[clap(long, default_value = "1000")]
pool_connect_backoff: NonZeroUsize,
#[clap(long, default_value = "60000")]
pool_max_idle_duration: NonZeroUsize,
#[clap(long, default_value = "0")]
max_delay_ms: usize,
#[clap(long, default_value = "0")]
percent_drops: usize,
#[clap(long, default_value = "0")]
percent_hangs: usize,
}
/// State shared by all clients
@@ -180,6 +152,7 @@ pub(crate) fn main(args: Args) -> anyhow::Result<()> {
main_impl(args, thread_local_stats)
})
}
async fn main_impl(
args: Args,
all_thread_local_stats: AllThreadLocalStats<request_stats::Stats>,
@@ -344,7 +317,6 @@ async fn main_impl(
let rps_period = args
.per_client_rate
.map(|rps_limit| Duration::from_secs_f64(1.0 / (rps_limit as f64)));
let make_worker: &dyn Fn(WorkerId) -> Pin<Box<dyn Send + Future<Output = ()>>> = &|worker_id| {
let ss = shared_state.clone();
let cancel = cancel.clone();

View File

@@ -11,6 +11,7 @@
//! from data stored in object storage.
//!
use std::fmt::Write as FmtWrite;
use std::sync::Arc;
use std::time::{Instant, SystemTime};
use anyhow::{Context, anyhow};
@@ -420,12 +421,16 @@ where
}
let mut min_restart_lsn: Lsn = Lsn::MAX;
let mut dbdir_cnt = 0;
let mut rel_cnt = 0;
// Create tablespace directories
for ((spcnode, dbnode), has_relmap_file) in
self.timeline.list_dbdirs(self.lsn, self.ctx).await?
{
self.add_dbdir(spcnode, dbnode, has_relmap_file).await?;
dbdir_cnt += 1;
// If full backup is requested, include all relation files.
// Otherwise only include init forks of unlogged relations.
let rels = self
@@ -433,6 +438,7 @@ where
.list_rels(spcnode, dbnode, Version::at(self.lsn), self.ctx)
.await?;
for &rel in rels.iter() {
rel_cnt += 1;
// Send init fork as main fork to provide well formed empty
// contents of UNLOGGED relations. Postgres copies it in
// `reinit.c` during recovery.
@@ -455,6 +461,10 @@ where
}
}
self.timeline
.db_rel_count
.store(Some(Arc::new((dbdir_cnt, rel_cnt))));
let start_time = Instant::now();
let aux_files = self
.timeline

View File

@@ -715,7 +715,7 @@ fn start_pageserver(
disk_usage_eviction_state,
deletion_queue.new_client(),
secondary_controller,
feature_resolver,
feature_resolver.clone(),
)
.context("Failed to initialize router state")?,
);
@@ -841,14 +841,14 @@ fn start_pageserver(
} else {
None
},
feature_resolver.clone(),
);
// Spawn a Pageserver gRPC server task. It will spawn separate tasks for
// each stream/request.
// Spawn a Pageserver gRPC server task. It will spawn separate tasks for each request/stream.
// It uses a separate compute request Tokio runtime (COMPUTE_REQUEST_RUNTIME).
//
// TODO: this uses a separate Tokio runtime for the page service. If we want
// other gRPC services, they will need their own port and runtime. Is this
// necessary?
// NB: this port is exposed to computes. It should only provide services that we're okay with
// computes accessing. Internal services should use a separate port.
let mut page_service_grpc = None;
if let Some(grpc_listener) = grpc_listener {
page_service_grpc = Some(GrpcPageServiceHandler::spawn(

View File

@@ -156,6 +156,8 @@ impl FeatureResolver {
let tenant_properties = PerTenantProperties {
remote_size_mb: Some(rand::rng().random_range(100.0..1000000.00)),
db_count_max: Some(rand::rng().random_range(1..1000)),
rel_count_max: Some(rand::rng().random_range(1..1000)),
}
.into_posthog_properties();
@@ -344,6 +346,8 @@ impl FeatureResolver {
struct PerTenantProperties {
pub remote_size_mb: Option<f64>,
pub db_count_max: Option<usize>,
pub rel_count_max: Option<usize>,
}
impl PerTenantProperties {
@@ -355,6 +359,18 @@ impl PerTenantProperties {
PostHogFlagFilterPropertyValue::Number(remote_size_mb),
);
}
if let Some(db_count) = self.db_count_max {
properties.insert(
"tenant_db_count_max".to_string(),
PostHogFlagFilterPropertyValue::Number(db_count as f64),
);
}
if let Some(rel_count) = self.rel_count_max {
properties.insert(
"tenant_rel_count_max".to_string(),
PostHogFlagFilterPropertyValue::Number(rel_count as f64),
);
}
properties
}
}
@@ -409,7 +425,11 @@ impl TenantFeatureResolver {
/// Refresh the cached properties and flags on the critical path.
pub fn refresh_properties_and_flags(&self, tenant_shard: &TenantShard) {
// Any of the remote size is none => this property is none.
let mut remote_size_mb = Some(0.0);
// Any of the db or rel count is available => this property is available.
let mut db_count_max = None;
let mut rel_count_max = None;
for timeline in tenant_shard.list_timelines() {
let size = timeline.metrics.resident_physical_size_get();
if size == 0 {
@@ -419,9 +439,25 @@ impl TenantFeatureResolver {
if let Some(ref mut remote_size_mb) = remote_size_mb {
*remote_size_mb += size as f64 / 1024.0 / 1024.0;
}
if let Some(data) = timeline.db_rel_count.load_full() {
let (db_count, rel_count) = *data.as_ref();
if db_count_max.is_none() {
db_count_max = Some(db_count);
}
if rel_count_max.is_none() {
rel_count_max = Some(rel_count);
}
db_count_max = db_count_max.map(|max| max.max(db_count));
rel_count_max = rel_count_max.map(|max| max.max(rel_count));
}
}
self.cached_tenant_properties.store(Arc::new(
PerTenantProperties { remote_size_mb }.into_posthog_properties(),
PerTenantProperties {
remote_size_mb,
db_count_max,
rel_count_max,
}
.into_posthog_properties(),
));
// BEGIN: Update the feature flag on the critical path.

View File

@@ -484,6 +484,8 @@ async fn build_timeline_info_common(
*timeline.get_applied_gc_cutoff_lsn(),
);
let (rel_size_migration, rel_size_migrated_at) = timeline.get_rel_size_v2_status();
let info = TimelineInfo {
tenant_id: timeline.tenant_shard_id,
timeline_id: timeline.timeline_id,
@@ -515,7 +517,8 @@ async fn build_timeline_info_common(
state,
is_archived: Some(is_archived),
rel_size_migration: Some(timeline.get_rel_size_v2_status()),
rel_size_migration: Some(rel_size_migration),
rel_size_migrated_at,
is_invisible: Some(is_invisible),
walreceiver_status,
@@ -930,9 +933,16 @@ async fn timeline_patch_index_part_handler(
active_timeline_of_active_tenant(&state.tenant_manager, tenant_shard_id, timeline_id)
.await?;
if request_data.rel_size_migration.is_none() && request_data.rel_size_migrated_at.is_some()
{
return Err(ApiError::BadRequest(anyhow!(
"updating rel_size_migrated_at without rel_size_migration is not allowed"
)));
}
if let Some(rel_size_migration) = request_data.rel_size_migration {
timeline
.update_rel_size_v2_status(rel_size_migration)
.update_rel_size_v2_status(rel_size_migration, request_data.rel_size_migrated_at)
.map_err(ApiError::InternalServerError)?;
}
@@ -1995,6 +2005,10 @@ async fn put_tenant_location_config_handler(
let state = get_state(&request);
let conf = state.conf;
fail::fail_point!("put-location-conf-handler", |_| {
Err(ApiError::ResourceUnavailable("failpoint".into()))
});
// The `Detached` state is special, it doesn't upsert a tenant, it removes
// its local disk content and drops it from memory.
if let LocationConfigMode::Detached = request_data.config.mode {

View File

@@ -57,7 +57,7 @@ pub async fn import_timeline_from_postgres_datadir(
// TODO this shoud be start_lsn, which is not necessarily equal to end_lsn (aka lsn)
// Then fishing out pg_control would be unnecessary
let mut modification = tline.begin_modification(pgdata_lsn);
let mut modification = tline.begin_modification_for_import(pgdata_lsn);
modification.init_empty()?;
// Import all but pg_wal
@@ -309,7 +309,7 @@ async fn import_wal(
waldecoder.feed_bytes(&buf);
let mut nrecords = 0;
let mut modification = tline.begin_modification(last_lsn);
let mut modification = tline.begin_modification_for_import(last_lsn);
while last_lsn <= endpoint {
if let Some((lsn, recdata)) = waldecoder.poll_decode()? {
let interpreted = InterpretedWalRecord::from_bytes_filtered(
@@ -357,7 +357,7 @@ pub async fn import_basebackup_from_tar(
ctx: &RequestContext,
) -> Result<()> {
info!("importing base at {base_lsn}");
let mut modification = tline.begin_modification(base_lsn);
let mut modification = tline.begin_modification_for_import(base_lsn);
modification.init_empty()?;
let mut pg_control: Option<ControlFileData> = None;
@@ -457,7 +457,7 @@ pub async fn import_wal_from_tar(
waldecoder.feed_bytes(&bytes[offset..]);
let mut modification = tline.begin_modification(last_lsn);
let mut modification = tline.begin_modification_for_import(last_lsn);
while last_lsn <= end_lsn {
if let Some((lsn, recdata)) = waldecoder.poll_decode()? {
let interpreted = InterpretedWalRecord::from_bytes_filtered(

View File

@@ -16,8 +16,7 @@ use anyhow::{Context as _, bail};
use bytes::{Buf as _, BufMut as _, BytesMut};
use chrono::Utc;
use futures::future::BoxFuture;
use futures::stream::FuturesUnordered;
use futures::{FutureExt, Stream, StreamExt as _};
use futures::{FutureExt, Stream};
use itertools::Itertools;
use jsonwebtoken::TokenData;
use once_cell::sync::OnceCell;
@@ -36,8 +35,8 @@ use pageserver_api::pagestream_api::{
};
use pageserver_api::reltag::SlruKind;
use pageserver_api::shard::TenantShardId;
use pageserver_page_api as page_api;
use pageserver_page_api::proto;
use pageserver_page_api::{self as page_api, GetPageSplitter};
use postgres_backend::{
AuthType, PostgresBackend, PostgresBackendReader, QueryError, is_expected_io_error,
};
@@ -69,6 +68,7 @@ use crate::config::PageServerConf;
use crate::context::{
DownloadBehavior, PerfInstrumentFutureExt, RequestContext, RequestContextBuilder,
};
use crate::feature_resolver::FeatureResolver;
use crate::metrics::{
self, COMPUTE_COMMANDS_COUNTERS, ComputeCommandKind, GetPageBatchBreakReason, LIVE_CONNECTIONS,
MISROUTED_PAGESTREAM_REQUESTS, PAGESTREAM_HANDLER_RESULTS_TOTAL, SmgrOpTimer, TimelineMetrics,
@@ -140,6 +140,7 @@ pub fn spawn(
perf_trace_dispatch: Option<Dispatch>,
tcp_listener: tokio::net::TcpListener,
tls_config: Option<Arc<rustls::ServerConfig>>,
feature_resolver: FeatureResolver,
) -> Listener {
let cancel = CancellationToken::new();
let libpq_ctx = RequestContext::todo_child(
@@ -161,6 +162,7 @@ pub fn spawn(
conf.pg_auth_type,
tls_config,
conf.page_service_pipelining.clone(),
feature_resolver,
libpq_ctx,
cancel.clone(),
)
@@ -219,6 +221,7 @@ pub async fn libpq_listener_main(
auth_type: AuthType,
tls_config: Option<Arc<rustls::ServerConfig>>,
pipelining_config: PageServicePipeliningConfig,
feature_resolver: FeatureResolver,
listener_ctx: RequestContext,
listener_cancel: CancellationToken,
) -> Connections {
@@ -262,6 +265,7 @@ pub async fn libpq_listener_main(
auth_type,
tls_config.clone(),
pipelining_config.clone(),
feature_resolver.clone(),
connection_ctx,
connections_cancel.child_token(),
gate_guard,
@@ -304,6 +308,7 @@ async fn page_service_conn_main(
auth_type: AuthType,
tls_config: Option<Arc<rustls::ServerConfig>>,
pipelining_config: PageServicePipeliningConfig,
feature_resolver: FeatureResolver,
connection_ctx: RequestContext,
cancel: CancellationToken,
gate_guard: GateGuard,
@@ -371,6 +376,7 @@ async fn page_service_conn_main(
perf_span_fields,
connection_ctx,
cancel.clone(),
feature_resolver.clone(),
gate_guard,
);
let pgbackend =
@@ -422,6 +428,8 @@ struct PageServerHandler {
pipelining_config: PageServicePipeliningConfig,
get_vectored_concurrent_io: GetVectoredConcurrentIo,
feature_resolver: FeatureResolver,
gate_guard: GateGuard,
}
@@ -444,7 +452,6 @@ impl TimelineHandles {
handles: Default::default(),
}
}
async fn get(
&mut self,
tenant_id: TenantId,
@@ -471,13 +478,6 @@ impl TimelineHandles {
fn tenant_id(&self) -> Option<TenantId> {
self.wrapper.tenant_id.get().copied()
}
/// Returns whether a child shard exists locally for the given shard.
fn has_child_shard(&self, tenant_id: TenantId, shard_index: ShardIndex) -> bool {
self.wrapper
.tenant_manager
.has_child_shard(tenant_id, shard_index)
}
}
pub(crate) struct TenantManagerWrapper {
@@ -544,6 +544,7 @@ impl timeline::handle::TenantManager<TenantManagerTypes> for TenantManagerWrappe
match resolved {
ShardResolveResult::Found(tenant_shard) => break tenant_shard,
ShardResolveResult::NotFound => {
MISROUTED_PAGESTREAM_REQUESTS.inc();
return Err(GetActiveTimelineError::Tenant(
GetActiveTenantError::NotFound(GetTenantError::NotFound(*tenant_id)),
));
@@ -595,6 +596,15 @@ impl timeline::handle::TenantManager<TenantManagerTypes> for TenantManagerWrappe
}
}
/// Whether to hold the applied GC cutoff guard when processing GetPage requests.
/// This is determined once at the start of pagestream subprotocol handling based on
/// feature flags, configuration, and test conditions.
#[derive(Debug, Clone, Copy)]
enum HoldAppliedGcCutoffGuard {
Yes,
No,
}
#[derive(thiserror::Error, Debug)]
enum PageStreamError {
/// We encountered an error that should prompt the client to reconnect:
@@ -738,6 +748,7 @@ enum BatchedFeMessage {
GetPage {
span: Span,
shard: WeakHandle<TenantManagerTypes>,
applied_gc_cutoff_guard: Option<RcuReadGuard<Lsn>>,
pages: SmallVec<[BatchedGetPageRequest; 1]>,
batch_break_reason: GetPageBatchBreakReason,
},
@@ -917,6 +928,7 @@ impl PageServerHandler {
perf_span_fields: ConnectionPerfSpanFields,
connection_ctx: RequestContext,
cancel: CancellationToken,
feature_resolver: FeatureResolver,
gate_guard: GateGuard,
) -> Self {
PageServerHandler {
@@ -928,6 +940,7 @@ impl PageServerHandler {
cancel,
pipelining_config,
get_vectored_concurrent_io,
feature_resolver,
gate_guard,
}
}
@@ -967,6 +980,7 @@ impl PageServerHandler {
ctx: &RequestContext,
protocol_version: PagestreamProtocolVersion,
parent_span: Span,
hold_gc_cutoff_guard: HoldAppliedGcCutoffGuard,
) -> Result<Option<BatchedFeMessage>, QueryError>
where
IO: AsyncRead + AsyncWrite + Send + Sync + Unpin + 'static,
@@ -1204,19 +1218,27 @@ impl PageServerHandler {
})
.await?;
let applied_gc_cutoff_guard = shard.get_applied_gc_cutoff_lsn(); // hold guard
// We're holding the Handle
let effective_lsn = match Self::effective_request_lsn(
&shard,
shard.get_last_record_lsn(),
req.hdr.request_lsn,
req.hdr.not_modified_since,
&shard.get_applied_gc_cutoff_lsn(),
&applied_gc_cutoff_guard,
) {
Ok(lsn) => lsn,
Err(e) => {
return respond_error!(span, e);
}
};
let applied_gc_cutoff_guard = match hold_gc_cutoff_guard {
HoldAppliedGcCutoffGuard::Yes => Some(applied_gc_cutoff_guard),
HoldAppliedGcCutoffGuard::No => {
drop(applied_gc_cutoff_guard);
None
}
};
let batch_wait_ctx = if ctx.has_perf_span() {
Some(
@@ -1237,6 +1259,7 @@ impl PageServerHandler {
BatchedFeMessage::GetPage {
span,
shard: shard.downgrade(),
applied_gc_cutoff_guard,
pages: smallvec![BatchedGetPageRequest {
req,
timer,
@@ -1337,13 +1360,28 @@ impl PageServerHandler {
match (eligible_batch, this_msg) {
(
BatchedFeMessage::GetPage {
pages: accum_pages, ..
pages: accum_pages,
applied_gc_cutoff_guard: accum_applied_gc_cutoff_guard,
..
},
BatchedFeMessage::GetPage {
pages: this_pages, ..
pages: this_pages,
applied_gc_cutoff_guard: this_applied_gc_cutoff_guard,
..
},
) => {
accum_pages.extend(this_pages);
// the minimum of the two guards will keep data for both alive
match (&accum_applied_gc_cutoff_guard, this_applied_gc_cutoff_guard) {
(None, None) => (),
(None, Some(this)) => *accum_applied_gc_cutoff_guard = Some(this),
(Some(_), None) => (),
(Some(accum), Some(this)) => {
if **accum > *this {
*accum_applied_gc_cutoff_guard = Some(this);
}
}
};
Ok(())
}
#[cfg(feature = "testing")]
@@ -1658,6 +1696,7 @@ impl PageServerHandler {
BatchedFeMessage::GetPage {
span,
shard,
applied_gc_cutoff_guard,
pages,
batch_break_reason,
} => {
@@ -1677,6 +1716,7 @@ impl PageServerHandler {
.instrument(span.clone())
.await;
assert_eq!(res.len(), npages);
drop(applied_gc_cutoff_guard);
res
},
span,
@@ -1758,7 +1798,7 @@ impl PageServerHandler {
/// Coding discipline within this function: all interaction with the `pgb` connection
/// needs to be sensitive to connection shutdown, currently signalled via [`Self::cancel`].
/// This is so that we can shutdown page_service quickly.
#[instrument(skip_all)]
#[instrument(skip_all, fields(hold_gc_cutoff_guard))]
async fn handle_pagerequests<IO>(
&mut self,
pgb: &mut PostgresBackend<IO>,
@@ -1804,6 +1844,30 @@ impl PageServerHandler {
.take()
.expect("implementation error: timeline_handles should not be locked");
// Evaluate the expensive feature resolver check once per pagestream subprotocol handling
// instead of once per GetPage request. This is shared between pipelined and serial paths.
let hold_gc_cutoff_guard = if cfg!(test) || cfg!(feature = "testing") {
HoldAppliedGcCutoffGuard::Yes
} else {
// Use the global feature resolver with the tenant ID directly, avoiding the need
// to get a timeline/shard which might not be available on this pageserver node.
let empty_properties = std::collections::HashMap::new();
match self.feature_resolver.evaluate_boolean(
"page-service-getpage-hold-applied-gc-cutoff-guard",
tenant_id,
&empty_properties,
) {
Ok(()) => HoldAppliedGcCutoffGuard::Yes,
Err(_) => HoldAppliedGcCutoffGuard::No,
}
};
// record it in the span of handle_pagerequests so that both the request_span
// and the pipeline implementation spans contains the field.
Span::current().record(
"hold_gc_cutoff_guard",
tracing::field::debug(&hold_gc_cutoff_guard),
);
let request_span = info_span!("request");
let ((pgb_reader, timeline_handles), result) = match self.pipelining_config.clone() {
PageServicePipeliningConfig::Pipelined(pipelining_config) => {
@@ -1817,6 +1881,7 @@ impl PageServerHandler {
pipelining_config,
protocol_version,
io_concurrency,
hold_gc_cutoff_guard,
&ctx,
)
.await
@@ -1831,6 +1896,7 @@ impl PageServerHandler {
request_span,
protocol_version,
io_concurrency,
hold_gc_cutoff_guard,
&ctx,
)
.await
@@ -1859,6 +1925,7 @@ impl PageServerHandler {
request_span: Span,
protocol_version: PagestreamProtocolVersion,
io_concurrency: IoConcurrency,
hold_gc_cutoff_guard: HoldAppliedGcCutoffGuard,
ctx: &RequestContext,
) -> (
(PostgresBackendReader<IO>, TimelineHandles),
@@ -1880,6 +1947,7 @@ impl PageServerHandler {
ctx,
protocol_version,
request_span.clone(),
hold_gc_cutoff_guard,
)
.await;
let msg = match msg {
@@ -1927,6 +1995,7 @@ impl PageServerHandler {
pipelining_config: PageServicePipeliningConfigPipelined,
protocol_version: PagestreamProtocolVersion,
io_concurrency: IoConcurrency,
hold_gc_cutoff_guard: HoldAppliedGcCutoffGuard,
ctx: &RequestContext,
) -> (
(PostgresBackendReader<IO>, TimelineHandles),
@@ -2030,6 +2099,7 @@ impl PageServerHandler {
&ctx,
protocol_version,
request_span.clone(),
hold_gc_cutoff_guard,
)
.await;
let Some(read_res) = read_res.transpose() else {
@@ -2076,6 +2146,7 @@ impl PageServerHandler {
pages,
span: _,
shard: _,
applied_gc_cutoff_guard: _,
batch_break_reason: _,
} = &mut batch
{
@@ -3387,9 +3458,17 @@ impl GrpcPageServiceHandler {
}
}
/// Acquires a timeline handle for the given request. The shard index must match a local shard.
/// Acquires a timeline handle for the given request.
///
/// NB: this will fail during shard splits, see comment on [`Self::maybe_split_get_page`].
/// TODO: during shard splits, the compute may still be sending requests to the parent shard
/// until the entire split is committed and the compute is notified. Consider installing a
/// temporary shard router from the parent to the children while the split is in progress.
///
/// TODO: consider moving this to a middleware layer; all requests need it. Needs to manage
/// the TimelineHandles lifecycle.
///
/// TODO: untangle acquisition from TenantManagerWrapper::resolve() and Cache::get(), to avoid
/// the unnecessary overhead.
async fn get_request_timeline(
&self,
req: &tonic::Request<impl Any>,
@@ -3398,62 +3477,11 @@ impl GrpcPageServiceHandler {
let shard_index = *extract::<ShardIndex>(req);
let shard_selector = ShardSelector::Known(shard_index);
// TODO: untangle acquisition from TenantManagerWrapper::resolve() and Cache::get(), to
// avoid the unnecessary overhead.
TimelineHandles::new(self.tenant_manager.clone())
.get(ttid.tenant_id, ttid.timeline_id, shard_selector)
.await
}
/// Acquires a timeline handle for the given request, which must be for shard zero.
///
/// NB: during an ongoing shard split, the compute will keep talking to the parent shard until
/// the split is committed, but the parent shard may have been removed in the meanwhile. In that
/// case, we reroute the request to the new child shard. See [`Self::maybe_split_get_page`].
///
/// TODO: revamp the split protocol to avoid this child routing.
async fn get_shard_zero_request_timeline(
&self,
req: &tonic::Request<impl Any>,
) -> Result<Handle<TenantManagerTypes>, tonic::Status> {
let ttid = *extract::<TenantTimelineId>(req);
let shard_index = *extract::<ShardIndex>(req);
if shard_index.shard_number.0 != 0 {
return Err(tonic::Status::invalid_argument(format!(
"request must use shard zero (requested shard {shard_index})",
)));
}
// TODO: untangle acquisition from TenantManagerWrapper::resolve() and Cache::get(), to
// avoid the unnecessary overhead.
//
// TODO: this does internal retries, which will delay requests during shard splits (we won't
// look for the child until the parent's retries are exhausted). Don't do that.
let mut handles = TimelineHandles::new(self.tenant_manager.clone());
match handles
.get(
ttid.tenant_id,
ttid.timeline_id,
ShardSelector::Known(shard_index),
)
.await
{
Ok(timeline) => Ok(timeline),
Err(err) => {
// We may be in the middle of a shard split. Try to find a child shard 0.
if let Ok(timeline) = handles
.get(ttid.tenant_id, ttid.timeline_id, ShardSelector::Zero)
.await
&& timeline.get_shard_index().shard_count > shard_index.shard_count
{
return Ok(timeline);
}
Err(err.into())
}
}
}
/// Starts a SmgrOpTimer at received_at, throttles the request, and records execution start.
/// Only errors if the timeline is shutting down.
///
@@ -3480,27 +3508,44 @@ impl GrpcPageServiceHandler {
/// NB: errors returned from here are intercepted in get_pages(), and may be converted to a
/// GetPageResponse with an appropriate status code to avoid terminating the stream.
///
/// TODO: verify that the requested pages belong to this shard.
///
/// TODO: get_vectored() currently enforces a batch limit of 32. Postgres will typically send
/// batches up to effective_io_concurrency = 100. Either we have to accept large batches, or
/// split them up in the client or server.
#[instrument(skip_all, fields(
req_id = %req.request_id,
rel = %req.rel,
blkno = %req.block_numbers[0],
blks = %req.block_numbers.len(),
lsn = %req.read_lsn,
))]
#[instrument(skip_all, fields(req_id, rel, blkno, blks, req_lsn, mod_lsn))]
async fn get_page(
ctx: &RequestContext,
timeline: Handle<TenantManagerTypes>,
req: page_api::GetPageRequest,
timeline: &WeakHandle<TenantManagerTypes>,
req: proto::GetPageRequest,
io_concurrency: IoConcurrency,
received_at: Instant,
) -> Result<page_api::GetPageResponse, tonic::Status> {
) -> Result<proto::GetPageResponse, tonic::Status> {
let received_at = Instant::now();
let timeline = timeline.upgrade()?;
let ctx = ctx.with_scope_page_service_pagestream(&timeline);
// Validate the request, decorate the span, and convert it to a Pagestream request.
let req = page_api::GetPageRequest::try_from(req)?;
span_record!(
req_id = %req.request_id,
rel = %req.rel,
blkno = %req.block_numbers[0],
blks = %req.block_numbers.len(),
lsn = %req.read_lsn,
);
for &blkno in &req.block_numbers {
let shard = timeline.get_shard_identity();
let key = rel_block_to_key(req.rel, blkno);
if !shard.is_key_local(&key) {
return Err(tonic::Status::invalid_argument(format!(
"block {blkno} of relation {} requested on wrong shard {} (is on {})",
req.rel,
timeline.get_shard_index(),
ShardIndex::new(shard.get_shard_number(&key), shard.count),
)));
}
}
let latest_gc_cutoff_lsn = timeline.get_applied_gc_cutoff_lsn(); // hold guard
let effective_lsn = PageServerHandler::effective_request_lsn(
&timeline,
@@ -3575,96 +3620,7 @@ impl GrpcPageServiceHandler {
};
}
Ok(resp)
}
/// Processes a GetPage request when there is a potential shard split in progress. We have to
/// reroute the request any local child shards, and split batch requests that straddle multiple
/// child shards.
///
/// Parent shards are split and removed incrementally, but the compute is only notified once the
/// entire split commits, which can take several minutes. In the meanwhile, the compute will be
/// sending requests to the parent shard.
///
/// TODO: add test infrastructure to provoke this situation frequently and for long periods of
/// time, to properly exercise it.
///
/// TODO: revamp the split protocol to avoid this, e.g.:
/// * Keep the parent shard until the split commits and the compute is notified.
/// * Notify the compute about each subsplit.
/// * Return an error that updates the compute's shard map.
#[instrument(skip_all)]
async fn maybe_split_get_page(
ctx: &RequestContext,
handles: &mut TimelineHandles,
ttid: TenantTimelineId,
parent: ShardIndex,
req: page_api::GetPageRequest,
io_concurrency: IoConcurrency,
received_at: Instant,
) -> Result<page_api::GetPageResponse, tonic::Status> {
// Check the first page to see if we have any child shards at all. Otherwise, the compute is
// just talking to the wrong Pageserver. If the parent has been split, the shard now owning
// the page must have a higher shard count.
let timeline = handles
.get(
ttid.tenant_id,
ttid.timeline_id,
ShardSelector::Page(rel_block_to_key(req.rel, req.block_numbers[0])),
)
.await?;
let shard_id = timeline.get_shard_identity();
if shard_id.count <= parent.shard_count {
return Err(HandleUpgradeError::ShutDown.into()); // emulate original error
}
// Fast path: the request fits in a single shard.
if let Some(shard_index) =
GetPageSplitter::for_single_shard(&req, shard_id.count, Some(shard_id.stripe_size))
.map_err(|err| tonic::Status::internal(err.to_string()))?
{
// We got the shard ID from the first page, so these must be equal.
assert_eq!(shard_index.shard_number, shard_id.number);
assert_eq!(shard_index.shard_count, shard_id.count);
return Self::get_page(ctx, timeline, req, io_concurrency, received_at).await;
}
// The request spans multiple shards; split it and dispatch parallel requests. All pages
// were originally in the parent shard, and during a split all children are local, so we
// expect to find local shards for all pages.
let mut splitter = GetPageSplitter::split(req, shard_id.count, Some(shard_id.stripe_size))
.map_err(|err| tonic::Status::internal(err.to_string()))?;
let mut shard_requests = FuturesUnordered::new();
for (shard_index, shard_req) in splitter.drain_requests() {
let timeline = handles
.get(
ttid.tenant_id,
ttid.timeline_id,
ShardSelector::Known(shard_index),
)
.await?;
let future = Self::get_page(
ctx,
timeline,
shard_req,
io_concurrency.clone(),
received_at,
)
.map(move |result| result.map(|resp| (shard_index, resp)));
shard_requests.push(future);
}
while let Some((shard_index, shard_response)) = shard_requests.next().await.transpose()? {
splitter
.add_response(shard_index, shard_response)
.map_err(|err| tonic::Status::internal(err.to_string()))?;
}
splitter
.get_response()
.map_err(|err| tonic::Status::internal(err.to_string()))
Ok(resp.into())
}
}
@@ -3693,10 +3649,11 @@ impl proto::PageService for GrpcPageServiceHandler {
// to be the sweet spot where throughput is saturated.
const CHUNK_SIZE: usize = 256 * 1024;
let timeline = self.get_shard_zero_request_timeline(&req).await?;
let timeline = self.get_request_timeline(&req).await?;
let ctx = self.ctx.with_scope_timeline(&timeline);
// Validate the request and decorate the span.
Self::ensure_shard_zero(&timeline)?;
if timeline.is_archived() == Some(true) {
return Err(tonic::Status::failed_precondition("timeline is archived"));
}
@@ -3812,10 +3769,11 @@ impl proto::PageService for GrpcPageServiceHandler {
req: tonic::Request<proto::GetDbSizeRequest>,
) -> Result<tonic::Response<proto::GetDbSizeResponse>, tonic::Status> {
let received_at = extract::<ReceivedAt>(&req).0;
let timeline = self.get_shard_zero_request_timeline(&req).await?;
let timeline = self.get_request_timeline(&req).await?;
let ctx = self.ctx.with_scope_page_service_pagestream(&timeline);
// Validate the request, decorate the span, and convert it to a Pagestream request.
Self::ensure_shard_zero(&timeline)?;
let req: page_api::GetDbSizeRequest = req.into_inner().try_into()?;
span_record!(db_oid=%req.db_oid, lsn=%req.read_lsn);
@@ -3844,33 +3802,14 @@ impl proto::PageService for GrpcPageServiceHandler {
req: tonic::Request<tonic::Streaming<proto::GetPageRequest>>,
) -> Result<tonic::Response<Self::GetPagesStream>, tonic::Status> {
// Extract the timeline from the request and check that it exists.
//
// NB: during shard splits, the compute may still send requests to the parent shard. We'll
// reroute requests to the child shards below, but we also detect the common cases here
// where either the shard exists or no shards exist at all. If we have a child shard, we
// can't acquire a weak handle because we don't know which child shard to use yet.
//
// TODO: TimelineHandles.get() does internal retries, which will delay requests during shard
// splits. It shouldn't.
let ttid = *extract::<TenantTimelineId>(&req);
let shard_index = *extract::<ShardIndex>(&req);
let shard_selector = ShardSelector::Known(shard_index);
let mut handles = TimelineHandles::new(self.tenant_manager.clone());
let timeline = match handles
.get(
ttid.tenant_id,
ttid.timeline_id,
ShardSelector::Known(shard_index),
)
.await
{
// The timeline shard exists. Keep a weak handle to reuse for each request.
Ok(timeline) => Some(timeline.downgrade()),
// The shard doesn't exist, but a child shard does. We'll reroute requests later.
Err(_) if handles.has_child_shard(ttid.tenant_id, shard_index) => None,
// Failed to fetch the timeline, and no child shard exists. Error out.
Err(err) => return Err(err.into()),
};
handles
.get(ttid.tenant_id, ttid.timeline_id, shard_selector)
.await?;
// Spawn an IoConcurrency sidecar, if enabled.
let gate_guard = self
@@ -3887,9 +3826,11 @@ impl proto::PageService for GrpcPageServiceHandler {
let mut reqs = req.into_inner();
let resps = async_stream::try_stream! {
let timeline = handles
.get(ttid.tenant_id, ttid.timeline_id, shard_selector)
.await?
.downgrade();
loop {
// Wait for the next client request.
//
// NB: Tonic considers the entire stream to be an in-flight request and will wait
// for it to complete before shutting down. React to cancellation between requests.
let req = tokio::select! {
@@ -3902,43 +3843,16 @@ impl proto::PageService for GrpcPageServiceHandler {
Err(err) => Err(err),
},
}?;
let received_at = Instant::now();
let req_id = req.request_id.map(page_api::RequestID::from).unwrap_or_default();
// Process the request, using a closure to capture errors.
let process_request = async || {
let req = page_api::GetPageRequest::try_from(req)?;
// Fast path: use the pre-acquired timeline handle.
if let Some(Ok(timeline)) = timeline.as_ref().map(|t| t.upgrade()) {
return Self::get_page(&ctx, timeline, req, io_concurrency.clone(), received_at)
.instrument(span.clone()) // propagate request span
.await
}
// The timeline handle is stale. During shard splits, the compute may still be
// sending requests to the parent shard. Try to re-route requests to the child
// shards, and split any batch requests that straddle multiple child shards.
Self::maybe_split_get_page(
&ctx,
&mut handles,
ttid,
shard_index,
req,
io_concurrency.clone(),
received_at,
)
let result = Self::get_page(&ctx, &timeline, req, io_concurrency.clone())
.instrument(span.clone()) // propagate request span
.await
};
// Return the response. Convert per-request errors to GetPageResponses if
// appropriate, or terminate the stream with a tonic::Status.
yield match process_request().await {
Ok(resp) => resp.into(),
.await;
yield match result {
Ok(resp) => resp,
// Convert per-request errors to GetPageResponses as appropriate, or terminate
// the stream with a tonic::Status. Log the error regardless, since
// ObservabilityLayer can't automatically log stream errors.
Err(status) => {
// Log the error, since ObservabilityLayer won't see stream errors.
// TODO: it would be nice if we could propagate the get_page() fields here.
span.in_scope(|| {
warn!("request failed with {:?}: {}", status.code(), status.message());
@@ -3958,10 +3872,11 @@ impl proto::PageService for GrpcPageServiceHandler {
req: tonic::Request<proto::GetRelSizeRequest>,
) -> Result<tonic::Response<proto::GetRelSizeResponse>, tonic::Status> {
let received_at = extract::<ReceivedAt>(&req).0;
let timeline = self.get_shard_zero_request_timeline(&req).await?;
let timeline = self.get_request_timeline(&req).await?;
let ctx = self.ctx.with_scope_page_service_pagestream(&timeline);
// Validate the request, decorate the span, and convert it to a Pagestream request.
Self::ensure_shard_zero(&timeline)?;
let req: page_api::GetRelSizeRequest = req.into_inner().try_into()?;
let allow_missing = req.allow_missing;
@@ -3994,7 +3909,7 @@ impl proto::PageService for GrpcPageServiceHandler {
req: tonic::Request<proto::GetSlruSegmentRequest>,
) -> Result<tonic::Response<proto::GetSlruSegmentResponse>, tonic::Status> {
let received_at = extract::<ReceivedAt>(&req).0;
let timeline = self.get_shard_zero_request_timeline(&req).await?;
let timeline = self.get_request_timeline(&req).await?;
let ctx = self.ctx.with_scope_page_service_pagestream(&timeline);
// Validate the request, decorate the span, and convert it to a Pagestream request.
@@ -4028,10 +3943,6 @@ impl proto::PageService for GrpcPageServiceHandler {
&self,
req: tonic::Request<proto::LeaseLsnRequest>,
) -> Result<tonic::Response<proto::LeaseLsnResponse>, tonic::Status> {
// TODO: this won't work during shard splits, as the request is directed at a specific shard
// but the parent shard is removed before the split commits and the compute is notified
// (which can take several minutes for large tenants). That's also the case for the libpq
// implementation, so we keep the behavior for now.
let timeline = self.get_request_timeline(&req).await?;
let ctx = self.ctx.with_scope_timeline(&timeline);

View File

@@ -6,8 +6,9 @@
//! walingest.rs handles a few things like implicit relation creation and extension.
//! Clarify that)
//!
use std::collections::{HashMap, HashSet, hash_map};
use std::collections::{BTreeSet, HashMap, HashSet, hash_map};
use std::ops::{ControlFlow, Range};
use std::sync::Arc;
use crate::walingest::{WalIngestError, WalIngestErrorKind};
use crate::{PERF_TRACE_TARGET, ensure_walingest};
@@ -226,6 +227,25 @@ impl Timeline {
pending_nblocks: 0,
pending_directory_entries: Vec::new(),
pending_metadata_bytes: 0,
is_importing_pgdata: false,
lsn,
}
}
pub fn begin_modification_for_import(&self, lsn: Lsn) -> DatadirModification
where
Self: Sized,
{
DatadirModification {
tline: self,
pending_lsns: Vec::new(),
pending_metadata_pages: HashMap::new(),
pending_data_batch: None,
pending_deletions: Vec::new(),
pending_nblocks: 0,
pending_directory_entries: Vec::new(),
pending_metadata_bytes: 0,
is_importing_pgdata: true,
lsn,
}
}
@@ -595,6 +615,50 @@ impl Timeline {
self.get_rel_exists_in_reldir(tag, version, None, ctx).await
}
async fn get_rel_exists_in_reldir_v1(
&self,
tag: RelTag,
version: Version<'_>,
deserialized_reldir_v1: Option<(Key, &RelDirectory)>,
ctx: &RequestContext,
) -> Result<bool, PageReconstructError> {
let key = rel_dir_to_key(tag.spcnode, tag.dbnode);
if let Some((cached_key, dir)) = deserialized_reldir_v1 {
if cached_key == key {
return Ok(dir.rels.contains(&(tag.relnode, tag.forknum)));
} else if cfg!(test) || cfg!(feature = "testing") {
panic!("cached reldir key mismatch: {cached_key} != {key}");
} else {
warn!("cached reldir key mismatch: {cached_key} != {key}");
}
// Fallback to reading the directory from the datadir.
}
let buf = version.get(self, key, ctx).await?;
let dir = RelDirectory::des(&buf)?;
Ok(dir.rels.contains(&(tag.relnode, tag.forknum)))
}
async fn get_rel_exists_in_reldir_v2(
&self,
tag: RelTag,
version: Version<'_>,
ctx: &RequestContext,
) -> Result<bool, PageReconstructError> {
let key = rel_tag_sparse_key(tag.spcnode, tag.dbnode, tag.relnode, tag.forknum);
let buf = RelDirExists::decode_option(version.sparse_get(self, key, ctx).await?).map_err(
|_| {
PageReconstructError::Other(anyhow::anyhow!(
"invalid reldir key: decode failed, {}",
key
))
},
)?;
let exists_v2 = buf == RelDirExists::Exists;
Ok(exists_v2)
}
/// Does the relation exist? With a cached deserialized `RelDirectory`.
///
/// There are some cases where the caller loops across all relations. In that specific case,
@@ -626,45 +690,134 @@ impl Timeline {
return Ok(false);
}
// Read path: first read the new reldir keyspace. Early return if the relation exists.
// Otherwise, read the old reldir keyspace.
// TODO: if IndexPart::rel_size_migration is `Migrated`, we only need to read from v2.
let (v2_status, migrated_lsn) = self.get_rel_size_v2_status();
if let RelSizeMigration::Migrated | RelSizeMigration::Migrating =
self.get_rel_size_v2_status()
{
// fetch directory listing (new)
let key = rel_tag_sparse_key(tag.spcnode, tag.dbnode, tag.relnode, tag.forknum);
let buf = RelDirExists::decode_option(version.sparse_get(self, key, ctx).await?)
.map_err(|_| PageReconstructError::Other(anyhow::anyhow!("invalid reldir key")))?;
let exists_v2 = buf == RelDirExists::Exists;
// Fast path: if the relation exists in the new format, return true.
// TODO: we should have a verification mode that checks both keyspaces
// to ensure the relation only exists in one of them.
if exists_v2 {
return Ok(true);
match v2_status {
RelSizeMigration::Legacy => {
let v1_exists = self
.get_rel_exists_in_reldir_v1(tag, version, deserialized_reldir_v1, ctx)
.await?;
Ok(v1_exists)
}
RelSizeMigration::Migrating | RelSizeMigration::Migrated
if version.get_lsn() < migrated_lsn.unwrap_or(Lsn(0)) =>
{
// For requests below the migrated LSN, we still use the v1 read path.
let v1_exists = self
.get_rel_exists_in_reldir_v1(tag, version, deserialized_reldir_v1, ctx)
.await?;
Ok(v1_exists)
}
RelSizeMigration::Migrating => {
let v1_exists = self
.get_rel_exists_in_reldir_v1(tag, version, deserialized_reldir_v1, ctx)
.await?;
let v2_exists_res = self.get_rel_exists_in_reldir_v2(tag, version, ctx).await;
match v2_exists_res {
Ok(v2_exists) if v1_exists == v2_exists => {}
Ok(v2_exists) => {
tracing::warn!(
"inconsistent v1/v2 reldir keyspace for rel {}: v1_exists={}, v2_exists={}",
tag,
v1_exists,
v2_exists
);
}
Err(e) => {
tracing::warn!("failed to get rel exists in v2: {e}");
}
}
Ok(v1_exists)
}
RelSizeMigration::Migrated => {
let v2_exists = self.get_rel_exists_in_reldir_v2(tag, version, ctx).await?;
Ok(v2_exists)
}
}
}
// fetch directory listing (old)
let key = rel_dir_to_key(tag.spcnode, tag.dbnode);
if let Some((cached_key, dir)) = deserialized_reldir_v1 {
if cached_key == key {
return Ok(dir.rels.contains(&(tag.relnode, tag.forknum)));
} else if cfg!(test) || cfg!(feature = "testing") {
panic!("cached reldir key mismatch: {cached_key} != {key}");
} else {
warn!("cached reldir key mismatch: {cached_key} != {key}");
}
// Fallback to reading the directory from the datadir.
}
async fn list_rels_v1(
&self,
spcnode: Oid,
dbnode: Oid,
version: Version<'_>,
ctx: &RequestContext,
) -> Result<HashSet<RelTag>, PageReconstructError> {
let key = rel_dir_to_key(spcnode, dbnode);
let buf = version.get(self, key, ctx).await?;
let dir = RelDirectory::des(&buf)?;
let exists_v1 = dir.rels.contains(&(tag.relnode, tag.forknum));
Ok(exists_v1)
let rels_v1: HashSet<RelTag> =
HashSet::from_iter(dir.rels.iter().map(|(relnode, forknum)| RelTag {
spcnode,
dbnode,
relnode: *relnode,
forknum: *forknum,
}));
Ok(rels_v1)
}
async fn list_rels_v2(
&self,
spcnode: Oid,
dbnode: Oid,
version: Version<'_>,
ctx: &RequestContext,
) -> Result<HashSet<RelTag>, PageReconstructError> {
let key_range = rel_tag_sparse_key_range(spcnode, dbnode);
let io_concurrency = IoConcurrency::spawn_from_conf(
self.conf.get_vectored_concurrent_io,
self.gate
.enter()
.map_err(|_| PageReconstructError::Cancelled)?,
);
let results = self
.scan(
KeySpace::single(key_range),
version.get_lsn(),
ctx,
io_concurrency,
)
.await?;
let mut rels = HashSet::new();
for (key, val) in results {
let val = RelDirExists::decode(&val?).map_err(|_| {
PageReconstructError::Other(anyhow::anyhow!(
"invalid reldir key: decode failed, {}",
key
))
})?;
if key.field6 != 1 {
return Err(PageReconstructError::Other(anyhow::anyhow!(
"invalid reldir key: field6 != 1, {}",
key
)));
}
if key.field2 != spcnode {
return Err(PageReconstructError::Other(anyhow::anyhow!(
"invalid reldir key: field2 != spcnode, {}",
key
)));
}
if key.field3 != dbnode {
return Err(PageReconstructError::Other(anyhow::anyhow!(
"invalid reldir key: field3 != dbnode, {}",
key
)));
}
let tag = RelTag {
spcnode,
dbnode,
relnode: key.field4,
forknum: key.field5,
};
if val == RelDirExists::Removed {
debug_assert!(!rels.contains(&tag), "removed reltag in v2");
continue;
}
let did_not_contain = rels.insert(tag);
debug_assert!(did_not_contain, "duplicate reltag in v2");
}
Ok(rels)
}
/// Get a list of all existing relations in given tablespace and database.
@@ -682,60 +835,45 @@ impl Timeline {
version: Version<'_>,
ctx: &RequestContext,
) -> Result<HashSet<RelTag>, PageReconstructError> {
// fetch directory listing (old)
let key = rel_dir_to_key(spcnode, dbnode);
let buf = version.get(self, key, ctx).await?;
let (v2_status, migrated_lsn) = self.get_rel_size_v2_status();
let dir = RelDirectory::des(&buf)?;
let rels_v1: HashSet<RelTag> =
HashSet::from_iter(dir.rels.iter().map(|(relnode, forknum)| RelTag {
spcnode,
dbnode,
relnode: *relnode,
forknum: *forknum,
}));
if let RelSizeMigration::Legacy = self.get_rel_size_v2_status() {
return Ok(rels_v1);
}
// scan directory listing (new), merge with the old results
let key_range = rel_tag_sparse_key_range(spcnode, dbnode);
let io_concurrency = IoConcurrency::spawn_from_conf(
self.conf.get_vectored_concurrent_io,
self.gate
.enter()
.map_err(|_| PageReconstructError::Cancelled)?,
);
let results = self
.scan(
KeySpace::single(key_range),
version.get_lsn(),
ctx,
io_concurrency,
)
.await?;
let mut rels = rels_v1;
for (key, val) in results {
let val = RelDirExists::decode(&val?)
.map_err(|_| PageReconstructError::Other(anyhow::anyhow!("invalid reldir key")))?;
assert_eq!(key.field6, 1);
assert_eq!(key.field2, spcnode);
assert_eq!(key.field3, dbnode);
let tag = RelTag {
spcnode,
dbnode,
relnode: key.field4,
forknum: key.field5,
};
if val == RelDirExists::Removed {
debug_assert!(!rels.contains(&tag), "removed reltag in v2");
continue;
match v2_status {
RelSizeMigration::Legacy => {
let rels_v1 = self.list_rels_v1(spcnode, dbnode, version, ctx).await?;
Ok(rels_v1)
}
RelSizeMigration::Migrating | RelSizeMigration::Migrated
if version.get_lsn() < migrated_lsn.unwrap_or(Lsn(0)) =>
{
// For requests below the migrated LSN, we still use the v1 read path.
let rels_v1 = self.list_rels_v1(spcnode, dbnode, version, ctx).await?;
Ok(rels_v1)
}
RelSizeMigration::Migrating => {
let rels_v1 = self.list_rels_v1(spcnode, dbnode, version, ctx).await?;
let rels_v2_res = self.list_rels_v2(spcnode, dbnode, version, ctx).await;
match rels_v2_res {
Ok(rels_v2) if rels_v1 == rels_v2 => {}
Ok(rels_v2) => {
tracing::warn!(
"inconsistent v1/v2 reldir keyspace for db {} {}: v1_rels.len()={}, v2_rels.len()={}",
spcnode,
dbnode,
rels_v1.len(),
rels_v2.len()
);
}
Err(e) => {
tracing::warn!("failed to list rels in v2: {e}");
}
}
Ok(rels_v1)
}
RelSizeMigration::Migrated => {
let rels_v2 = self.list_rels_v2(spcnode, dbnode, version, ctx).await?;
Ok(rels_v2)
}
let did_not_contain = rels.insert(tag);
debug_assert!(did_not_contain, "duplicate reltag in v2");
}
Ok(rels)
}
/// Get the whole SLRU segment
@@ -1254,11 +1392,16 @@ impl Timeline {
let dbdir = DbDirectory::des(&buf)?;
let mut total_size: u64 = 0;
for (spcnode, dbnode) in dbdir.dbdirs.keys() {
let mut dbdir_cnt = 0;
let mut rel_cnt = 0;
for &(spcnode, dbnode) in dbdir.dbdirs.keys() {
dbdir_cnt += 1;
for rel in self
.list_rels(*spcnode, *dbnode, Version::at(lsn), ctx)
.list_rels(spcnode, dbnode, Version::at(lsn), ctx)
.await?
{
rel_cnt += 1;
if self.cancel.is_cancelled() {
return Err(CalculateLogicalSizeError::Cancelled);
}
@@ -1269,6 +1412,10 @@ impl Timeline {
total_size += relsize as u64;
}
}
self.db_rel_count
.store(Some(Arc::new((dbdir_cnt, rel_cnt))));
Ok(total_size * BLCKSZ as u64)
}
@@ -1556,6 +1703,9 @@ pub struct DatadirModification<'a> {
/// An **approximation** of how many metadata bytes will be written to the EphemeralFile.
pending_metadata_bytes: usize,
/// Whether we are importing a pgdata directory.
is_importing_pgdata: bool,
}
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
@@ -1568,6 +1718,14 @@ pub enum MetricsUpdate {
Sub(u64),
}
/// Controls the behavior of the reldir keyspace.
pub struct RelDirMode {
// Whether we can read the v2 keyspace or not.
current_status: RelSizeMigration,
// Whether we should initialize the v2 keyspace or not.
initialize: bool,
}
impl DatadirModification<'_> {
// When a DatadirModification is committed, we do a monolithic serialization of all its contents. WAL records can
// contain multiple pages, so the pageserver's record-based batch size isn't sufficient to bound this allocation: we
@@ -1923,30 +2081,49 @@ impl DatadirModification<'_> {
}
/// Returns `true` if the rel_size_v2 write path is enabled. If it is the first time that
/// we enable it, we also need to persist it in `index_part.json`.
pub fn maybe_enable_rel_size_v2(&mut self) -> anyhow::Result<bool> {
let status = self.tline.get_rel_size_v2_status();
/// we enable it, we also need to persist it in `index_part.json` (initialize is true).
///
/// As this function is only used on the write path, we do not need to read the migrated_at
/// field.
pub fn maybe_enable_rel_size_v2(&mut self, is_create: bool) -> anyhow::Result<RelDirMode> {
// TODO: define the behavior of the tenant-level config flag and use feature flag to enable this feature
let (status, _) = self.tline.get_rel_size_v2_status();
let config = self.tline.get_rel_size_v2_enabled();
match (config, status) {
(false, RelSizeMigration::Legacy) => {
// tenant config didn't enable it and we didn't write any reldir_v2 key yet
Ok(false)
Ok(RelDirMode {
current_status: RelSizeMigration::Legacy,
initialize: false,
})
}
(false, RelSizeMigration::Migrating | RelSizeMigration::Migrated) => {
(false, status @ RelSizeMigration::Migrating | status @ RelSizeMigration::Migrated) => {
// index_part already persisted that the timeline has enabled rel_size_v2
Ok(true)
Ok(RelDirMode {
current_status: status,
initialize: false,
})
}
(true, RelSizeMigration::Legacy) => {
// The first time we enable it, we need to persist it in `index_part.json`
self.tline
.update_rel_size_v2_status(RelSizeMigration::Migrating)?;
tracing::info!("enabled rel_size_v2");
Ok(true)
// The caller should update the reldir status once the initialization is done.
//
// Only initialize the v2 keyspace on new relation creation. No initialization
// during `timeline_create` (TODO: fix this, we should allow, but currently it
// hits consistency issues).
Ok(RelDirMode {
current_status: RelSizeMigration::Legacy,
initialize: is_create && !self.is_importing_pgdata,
})
}
(true, RelSizeMigration::Migrating | RelSizeMigration::Migrated) => {
(true, status @ RelSizeMigration::Migrating | status @ RelSizeMigration::Migrated) => {
// index_part already persisted that the timeline has enabled rel_size_v2
// and we don't need to do anything
Ok(true)
Ok(RelDirMode {
current_status: status,
initialize: false,
})
}
}
}
@@ -1959,8 +2136,8 @@ impl DatadirModification<'_> {
img: Bytes,
ctx: &RequestContext,
) -> Result<(), WalIngestError> {
let v2_enabled = self
.maybe_enable_rel_size_v2()
let v2_mode = self
.maybe_enable_rel_size_v2(false)
.map_err(WalIngestErrorKind::MaybeRelSizeV2Error)?;
// Add it to the directory (if it doesn't exist already)
@@ -1976,17 +2153,19 @@ impl DatadirModification<'_> {
self.put(DBDIR_KEY, Value::Image(buf.into()));
}
if r.is_none() {
// Create RelDirectory
// TODO: if we have fully migrated to v2, no need to create this directory
if v2_mode.current_status != RelSizeMigration::Legacy {
self.pending_directory_entries
.push((DirectoryKind::RelV2, MetricsUpdate::Set(0)));
}
// Create RelDirectory in v1 keyspace. TODO: if we have fully migrated to v2, no need to create this directory.
// Some code path relies on this directory to be present. We should remove it once we starts to set tenants to
// `RelSizeMigration::Migrated` state (currently we don't, all tenants will have `RelSizeMigration::Migrating`).
let buf = RelDirectory::ser(&RelDirectory {
rels: HashSet::new(),
})?;
self.pending_directory_entries
.push((DirectoryKind::Rel, MetricsUpdate::Set(0)));
if v2_enabled {
self.pending_directory_entries
.push((DirectoryKind::RelV2, MetricsUpdate::Set(0)));
}
self.put(
rel_dir_to_key(spcnode, dbnode),
Value::Image(Bytes::from(buf)),
@@ -2093,6 +2272,109 @@ impl DatadirModification<'_> {
Ok(())
}
async fn initialize_rel_size_v2_keyspace(
&mut self,
ctx: &RequestContext,
dbdir: &DbDirectory,
) -> Result<(), WalIngestError> {
// Copy everything from relv1 to relv2; TODO: check if there's any key in the v2 keyspace, if so, abort.
tracing::info!("initializing rel_size_v2 keyspace");
let mut rel_cnt = 0;
// relmap_exists (the value of dbdirs hashmap) does not affect the migration: we need to copy things over anyways
for &(spcnode, dbnode) in dbdir.dbdirs.keys() {
let rel_dir_key = rel_dir_to_key(spcnode, dbnode);
let rel_dir = RelDirectory::des(&self.get(rel_dir_key, ctx).await?)?;
for (relnode, forknum) in rel_dir.rels {
let sparse_rel_dir_key = rel_tag_sparse_key(spcnode, dbnode, relnode, forknum);
self.put(
sparse_rel_dir_key,
Value::Image(RelDirExists::Exists.encode()),
);
tracing::info!(
"migrated rel_size_v2: {}",
RelTag {
spcnode,
dbnode,
relnode,
forknum
}
);
rel_cnt += 1;
}
}
tracing::info!(
"initialized rel_size_v2 keyspace at lsn {}: migrated {} relations",
self.lsn,
rel_cnt
);
self.tline
.update_rel_size_v2_status(RelSizeMigration::Migrating, Some(self.lsn))
.map_err(WalIngestErrorKind::MaybeRelSizeV2Error)?;
Ok::<_, WalIngestError>(())
}
async fn put_rel_creation_v1(
&mut self,
rel: RelTag,
dbdir_exists: bool,
ctx: &RequestContext,
) -> Result<(), WalIngestError> {
// Reldir v1 write path
let rel_dir_key = rel_dir_to_key(rel.spcnode, rel.dbnode);
let mut rel_dir = if !dbdir_exists {
// Create the RelDirectory
RelDirectory::default()
} else {
// reldir already exists, fetch it
RelDirectory::des(&self.get(rel_dir_key, ctx).await?)?
};
// Add the new relation to the rel directory entry, and write it back
if !rel_dir.rels.insert((rel.relnode, rel.forknum)) {
Err(WalIngestErrorKind::RelationAlreadyExists(rel))?;
}
if !dbdir_exists {
self.pending_directory_entries
.push((DirectoryKind::Rel, MetricsUpdate::Set(0)))
}
self.pending_directory_entries
.push((DirectoryKind::Rel, MetricsUpdate::Add(1)));
self.put(
rel_dir_key,
Value::Image(Bytes::from(RelDirectory::ser(&rel_dir)?)),
);
Ok(())
}
async fn put_rel_creation_v2(
&mut self,
rel: RelTag,
dbdir_exists: bool,
ctx: &RequestContext,
) -> Result<(), WalIngestError> {
// Reldir v2 write path
let sparse_rel_dir_key =
rel_tag_sparse_key(rel.spcnode, rel.dbnode, rel.relnode, rel.forknum);
// check if the rel_dir_key exists in v2
let val = self.sparse_get(sparse_rel_dir_key, ctx).await?;
let val = RelDirExists::decode_option(val)
.map_err(|_| WalIngestErrorKind::InvalidRelDirKey(sparse_rel_dir_key))?;
if val == RelDirExists::Exists {
Err(WalIngestErrorKind::RelationAlreadyExists(rel))?;
}
self.put(
sparse_rel_dir_key,
Value::Image(RelDirExists::Exists.encode()),
);
if !dbdir_exists {
self.pending_directory_entries
.push((DirectoryKind::RelV2, MetricsUpdate::Set(0)));
}
self.pending_directory_entries
.push((DirectoryKind::RelV2, MetricsUpdate::Add(1)));
Ok(())
}
/// Create a relation fork.
///
/// 'nblocks' is the initial size.
@@ -2126,66 +2408,31 @@ impl DatadirModification<'_> {
true
};
let rel_dir_key = rel_dir_to_key(rel.spcnode, rel.dbnode);
let mut rel_dir = if !dbdir_exists {
// Create the RelDirectory
RelDirectory::default()
} else {
// reldir already exists, fetch it
RelDirectory::des(&self.get(rel_dir_key, ctx).await?)?
};
let v2_enabled = self
.maybe_enable_rel_size_v2()
let mut v2_mode = self
.maybe_enable_rel_size_v2(true)
.map_err(WalIngestErrorKind::MaybeRelSizeV2Error)?;
if v2_enabled {
if rel_dir.rels.contains(&(rel.relnode, rel.forknum)) {
Err(WalIngestErrorKind::RelationAlreadyExists(rel))?;
if v2_mode.initialize {
if let Err(e) = self.initialize_rel_size_v2_keyspace(ctx, &dbdir).await {
tracing::warn!("error initializing rel_size_v2 keyspace: {}", e);
// TODO: circuit breaker so that it won't retry forever
} else {
v2_mode.current_status = RelSizeMigration::Migrating;
}
let sparse_rel_dir_key =
rel_tag_sparse_key(rel.spcnode, rel.dbnode, rel.relnode, rel.forknum);
// check if the rel_dir_key exists in v2
let val = self.sparse_get(sparse_rel_dir_key, ctx).await?;
let val = RelDirExists::decode_option(val)
.map_err(|_| WalIngestErrorKind::InvalidRelDirKey(sparse_rel_dir_key))?;
if val == RelDirExists::Exists {
Err(WalIngestErrorKind::RelationAlreadyExists(rel))?;
}
if v2_mode.current_status != RelSizeMigration::Migrated {
self.put_rel_creation_v1(rel, dbdir_exists, ctx).await?;
}
if v2_mode.current_status != RelSizeMigration::Legacy {
let write_v2_res = self.put_rel_creation_v2(rel, dbdir_exists, ctx).await;
if let Err(e) = write_v2_res {
if v2_mode.current_status == RelSizeMigration::Migrated {
return Err(e);
}
tracing::warn!("error writing rel_size_v2 keyspace: {}", e);
}
self.put(
sparse_rel_dir_key,
Value::Image(RelDirExists::Exists.encode()),
);
if !dbdir_exists {
self.pending_directory_entries
.push((DirectoryKind::Rel, MetricsUpdate::Set(0)));
self.pending_directory_entries
.push((DirectoryKind::RelV2, MetricsUpdate::Set(0)));
// We don't write `rel_dir_key -> rel_dir.rels` back to the storage in the v2 path unless it's the initial creation.
// TODO: if we have fully migrated to v2, no need to create this directory. Otherwise, there
// will be key not found errors if we don't create an empty one for rel_size_v2.
self.put(
rel_dir_key,
Value::Image(Bytes::from(RelDirectory::ser(&RelDirectory::default())?)),
);
}
self.pending_directory_entries
.push((DirectoryKind::RelV2, MetricsUpdate::Add(1)));
} else {
// Add the new relation to the rel directory entry, and write it back
if !rel_dir.rels.insert((rel.relnode, rel.forknum)) {
Err(WalIngestErrorKind::RelationAlreadyExists(rel))?;
}
if !dbdir_exists {
self.pending_directory_entries
.push((DirectoryKind::Rel, MetricsUpdate::Set(0)))
}
self.pending_directory_entries
.push((DirectoryKind::Rel, MetricsUpdate::Add(1)));
self.put(
rel_dir_key,
Value::Image(Bytes::from(RelDirectory::ser(&rel_dir)?)),
);
}
// Put size
@@ -2260,15 +2507,12 @@ impl DatadirModification<'_> {
Ok(())
}
/// Drop some relations
pub(crate) async fn put_rel_drops(
async fn put_rel_drop_v1(
&mut self,
drop_relations: HashMap<(u32, u32), Vec<RelTag>>,
ctx: &RequestContext,
) -> Result<(), WalIngestError> {
let v2_enabled = self
.maybe_enable_rel_size_v2()
.map_err(WalIngestErrorKind::MaybeRelSizeV2Error)?;
) -> Result<BTreeSet<RelTag>, WalIngestError> {
let mut dropped_rels = BTreeSet::new();
for ((spc_node, db_node), rel_tags) in drop_relations {
let dir_key = rel_dir_to_key(spc_node, db_node);
let buf = self.get(dir_key, ctx).await?;
@@ -2280,25 +2524,8 @@ impl DatadirModification<'_> {
self.pending_directory_entries
.push((DirectoryKind::Rel, MetricsUpdate::Sub(1)));
dirty = true;
dropped_rels.insert(rel_tag);
true
} else if v2_enabled {
// The rel is not found in the old reldir key, so we need to check the new sparse keyspace.
// Note that a relation can only exist in one of the two keyspaces (guaranteed by the ingestion
// logic).
let key =
rel_tag_sparse_key(spc_node, db_node, rel_tag.relnode, rel_tag.forknum);
let val = RelDirExists::decode_option(self.sparse_get(key, ctx).await?)
.map_err(|_| WalIngestErrorKind::InvalidKey(key, self.lsn))?;
if val == RelDirExists::Exists {
self.pending_directory_entries
.push((DirectoryKind::RelV2, MetricsUpdate::Sub(1)));
// put tombstone
self.put(key, Value::Image(RelDirExists::Removed.encode()));
// no need to set dirty to true
true
} else {
false
}
} else {
false
};
@@ -2321,7 +2548,67 @@ impl DatadirModification<'_> {
self.put(dir_key, Value::Image(Bytes::from(RelDirectory::ser(&dir)?)));
}
}
Ok(dropped_rels)
}
async fn put_rel_drop_v2(
&mut self,
drop_relations: HashMap<(u32, u32), Vec<RelTag>>,
ctx: &RequestContext,
) -> Result<BTreeSet<RelTag>, WalIngestError> {
let mut dropped_rels = BTreeSet::new();
for ((spc_node, db_node), rel_tags) in drop_relations {
for rel_tag in rel_tags {
let key = rel_tag_sparse_key(spc_node, db_node, rel_tag.relnode, rel_tag.forknum);
let val = RelDirExists::decode_option(self.sparse_get(key, ctx).await?)
.map_err(|_| WalIngestErrorKind::InvalidKey(key, self.lsn))?;
if val == RelDirExists::Exists {
dropped_rels.insert(rel_tag);
self.pending_directory_entries
.push((DirectoryKind::RelV2, MetricsUpdate::Sub(1)));
// put tombstone
self.put(key, Value::Image(RelDirExists::Removed.encode()));
}
}
}
Ok(dropped_rels)
}
/// Drop some relations
pub(crate) async fn put_rel_drops(
&mut self,
drop_relations: HashMap<(u32, u32), Vec<RelTag>>,
ctx: &RequestContext,
) -> Result<(), WalIngestError> {
let v2_mode = self
.maybe_enable_rel_size_v2(false)
.map_err(WalIngestErrorKind::MaybeRelSizeV2Error)?;
match v2_mode.current_status {
RelSizeMigration::Legacy => {
self.put_rel_drop_v1(drop_relations, ctx).await?;
}
RelSizeMigration::Migrating => {
let dropped_rels_v1 = self.put_rel_drop_v1(drop_relations.clone(), ctx).await?;
let dropped_rels_v2_res = self.put_rel_drop_v2(drop_relations, ctx).await;
match dropped_rels_v2_res {
Ok(dropped_rels_v2) => {
if dropped_rels_v1 != dropped_rels_v2 {
tracing::warn!(
"inconsistent v1/v2 rel drop: dropped_rels_v1.len()={}, dropped_rels_v2.len()={}",
dropped_rels_v1.len(),
dropped_rels_v2.len()
);
}
}
Err(e) => {
tracing::warn!("error dropping rels: {}", e);
}
}
}
RelSizeMigration::Migrated => {
self.put_rel_drop_v2(drop_relations, ctx).await?;
}
}
Ok(())
}

View File

@@ -1205,6 +1205,7 @@ impl TenantShard {
idempotency.clone(),
index_part.gc_compaction.clone(),
index_part.rel_size_migration.clone(),
index_part.rel_size_migrated_at,
ctx,
)?;
let disk_consistent_lsn = timeline.get_disk_consistent_lsn();
@@ -2584,6 +2585,7 @@ impl TenantShard {
initdb_lsn,
None,
None,
None,
ctx,
)
.await
@@ -2913,6 +2915,7 @@ impl TenantShard {
initdb_lsn,
None,
None,
None,
ctx,
)
.await
@@ -4342,6 +4345,7 @@ impl TenantShard {
create_idempotency: CreateTimelineIdempotency,
gc_compaction_state: Option<GcCompactionState>,
rel_size_v2_status: Option<RelSizeMigration>,
rel_size_migrated_at: Option<Lsn>,
ctx: &RequestContext,
) -> anyhow::Result<(Arc<Timeline>, RequestContext)> {
let state = match cause {
@@ -4376,6 +4380,7 @@ impl TenantShard {
create_idempotency,
gc_compaction_state,
rel_size_v2_status,
rel_size_migrated_at,
self.cancel.child_token(),
);
@@ -5085,6 +5090,7 @@ impl TenantShard {
src_timeline.pg_version,
);
let (rel_size_v2_status, rel_size_migrated_at) = src_timeline.get_rel_size_v2_status();
let (uninitialized_timeline, _timeline_ctx) = self
.prepare_new_timeline(
dst_id,
@@ -5092,7 +5098,8 @@ impl TenantShard {
timeline_create_guard,
start_lsn + 1,
Some(Arc::clone(src_timeline)),
Some(src_timeline.get_rel_size_v2_status()),
Some(rel_size_v2_status),
rel_size_migrated_at,
ctx,
)
.await?;
@@ -5379,6 +5386,7 @@ impl TenantShard {
pgdata_lsn,
None,
None,
None,
ctx,
)
.await?;
@@ -5462,14 +5470,17 @@ impl TenantShard {
start_lsn: Lsn,
ancestor: Option<Arc<Timeline>>,
rel_size_v2_status: Option<RelSizeMigration>,
rel_size_migrated_at: Option<Lsn>,
ctx: &RequestContext,
) -> anyhow::Result<(UninitializedTimeline<'a>, RequestContext)> {
let tenant_shard_id = self.tenant_shard_id;
let resources = self.build_timeline_resources(new_timeline_id);
resources
.remote_client
.init_upload_queue_for_empty_remote(new_metadata, rel_size_v2_status.clone())?;
resources.remote_client.init_upload_queue_for_empty_remote(
new_metadata,
rel_size_v2_status.clone(),
rel_size_migrated_at,
)?;
let (timeline_struct, timeline_ctx) = self
.create_timeline_struct(
@@ -5482,6 +5493,7 @@ impl TenantShard {
create_guard.idempotency.clone(),
None,
rel_size_v2_status,
rel_size_migrated_at,
ctx,
)
.context("Failed to create timeline data structure")?;

View File

@@ -826,18 +826,6 @@ impl TenantManager {
peek_slot.is_some()
}
/// Returns whether a local slot exists for a child shard of the given tenant and shard count.
/// Note that this just checks for a shard with a larger shard count, and it may not be a
/// direct child of the given shard.
pub(crate) fn has_child_shard(&self, tenant_id: TenantId, shard_index: ShardIndex) -> bool {
match &*self.tenants.read().unwrap() {
TenantsMap::Initializing => false,
TenantsMap::Open(slots) | TenantsMap::ShuttingDown(slots) => slots
.range(TenantShardId::tenant_range(tenant_id))
.any(|(tsid, _)| tsid.shard_count > shard_index.shard_count),
}
}
#[instrument(skip_all, fields(tenant_id=%tenant_shard_id.tenant_id, shard_id=%tenant_shard_id.shard_slug()))]
pub(crate) async fn upsert_location(
&self,

View File

@@ -443,7 +443,8 @@ impl RemoteTimelineClient {
pub fn init_upload_queue_for_empty_remote(
&self,
local_metadata: &TimelineMetadata,
rel_size_v2_status: Option<RelSizeMigration>,
rel_size_v2_migration: Option<RelSizeMigration>,
rel_size_migrated_at: Option<Lsn>,
) -> anyhow::Result<()> {
// Set the maximum number of inprogress tasks to the remote storage concurrency. There's
// certainly no point in starting more upload tasks than this.
@@ -455,7 +456,8 @@ impl RemoteTimelineClient {
let mut upload_queue = self.upload_queue.lock().unwrap();
let initialized_queue =
upload_queue.initialize_empty_remote(local_metadata, inprogress_limit)?;
initialized_queue.dirty.rel_size_migration = rel_size_v2_status;
initialized_queue.dirty.rel_size_migration = rel_size_v2_migration;
initialized_queue.dirty.rel_size_migrated_at = rel_size_migrated_at;
self.update_remote_physical_size_gauge(None);
info!("initialized upload queue as empty");
Ok(())
@@ -994,10 +996,12 @@ impl RemoteTimelineClient {
pub(crate) fn schedule_index_upload_for_rel_size_v2_status_update(
self: &Arc<Self>,
rel_size_v2_status: RelSizeMigration,
rel_size_migrated_at: Option<Lsn>,
) -> anyhow::Result<()> {
let mut guard = self.upload_queue.lock().unwrap();
let upload_queue = guard.initialized_mut()?;
upload_queue.dirty.rel_size_migration = Some(rel_size_v2_status);
upload_queue.dirty.rel_size_migrated_at = rel_size_migrated_at;
// TODO: allow this operation to bypass the validation check because we might upload the index part
// with no layers but the flag updated. For now, we just modify the index part in memory and the next
// upload will include the flag.

View File

@@ -114,6 +114,11 @@ pub struct IndexPart {
/// The timestamp when the timeline was marked invisible in synthetic size calculations.
#[serde(skip_serializing_if = "Option::is_none", default)]
pub(crate) marked_invisible_at: Option<NaiveDateTime>,
/// The LSN at which we started the rel size migration. Accesses below this LSN should be
/// processed with the v1 read path. Usually this LSN should be set together with `rel_size_migration`.
#[serde(skip_serializing_if = "Option::is_none", default)]
pub(crate) rel_size_migrated_at: Option<Lsn>,
}
#[derive(Debug, PartialEq, Eq, Clone, Serialize, Deserialize)]
@@ -142,10 +147,12 @@ impl IndexPart {
/// - 12: +l2_lsn
/// - 13: +gc_compaction
/// - 14: +marked_invisible_at
const LATEST_VERSION: usize = 14;
/// - 15: +rel_size_migrated_at
const LATEST_VERSION: usize = 15;
// Versions we may see when reading from a bucket.
pub const KNOWN_VERSIONS: &'static [usize] = &[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14];
pub const KNOWN_VERSIONS: &'static [usize] =
&[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15];
pub const FILE_NAME: &'static str = "index_part.json";
@@ -165,6 +172,7 @@ impl IndexPart {
l2_lsn: None,
gc_compaction: None,
marked_invisible_at: None,
rel_size_migrated_at: None,
}
}
@@ -475,6 +483,7 @@ mod tests {
l2_lsn: None,
gc_compaction: None,
marked_invisible_at: None,
rel_size_migrated_at: None,
};
let part = IndexPart::from_json_bytes(example.as_bytes()).unwrap();
@@ -524,6 +533,7 @@ mod tests {
l2_lsn: None,
gc_compaction: None,
marked_invisible_at: None,
rel_size_migrated_at: None,
};
let part = IndexPart::from_json_bytes(example.as_bytes()).unwrap();
@@ -574,6 +584,7 @@ mod tests {
l2_lsn: None,
gc_compaction: None,
marked_invisible_at: None,
rel_size_migrated_at: None,
};
let part = IndexPart::from_json_bytes(example.as_bytes()).unwrap();
@@ -627,6 +638,7 @@ mod tests {
l2_lsn: None,
gc_compaction: None,
marked_invisible_at: None,
rel_size_migrated_at: None,
};
let empty_layers_parsed = IndexPart::from_json_bytes(empty_layers_json.as_bytes()).unwrap();
@@ -675,6 +687,7 @@ mod tests {
l2_lsn: None,
gc_compaction: None,
marked_invisible_at: None,
rel_size_migrated_at: None,
};
let part = IndexPart::from_json_bytes(example.as_bytes()).unwrap();
@@ -726,6 +739,7 @@ mod tests {
l2_lsn: None,
gc_compaction: None,
marked_invisible_at: None,
rel_size_migrated_at: None,
};
let part = IndexPart::from_json_bytes(example.as_bytes()).unwrap();
@@ -782,6 +796,7 @@ mod tests {
l2_lsn: None,
gc_compaction: None,
marked_invisible_at: None,
rel_size_migrated_at: None,
};
let part = IndexPart::from_json_bytes(example.as_bytes()).unwrap();
@@ -843,6 +858,7 @@ mod tests {
l2_lsn: None,
gc_compaction: None,
marked_invisible_at: None,
rel_size_migrated_at: None,
};
let part = IndexPart::from_json_bytes(example.as_bytes()).unwrap();
@@ -905,6 +921,7 @@ mod tests {
l2_lsn: None,
gc_compaction: None,
marked_invisible_at: None,
rel_size_migrated_at: None,
};
let part = IndexPart::from_json_bytes(example.as_bytes()).unwrap();
@@ -972,6 +989,7 @@ mod tests {
l2_lsn: None,
gc_compaction: None,
marked_invisible_at: None,
rel_size_migrated_at: None,
};
let part = IndexPart::from_json_bytes(example.as_bytes()).unwrap();
@@ -1052,6 +1070,7 @@ mod tests {
l2_lsn: None,
gc_compaction: None,
marked_invisible_at: None,
rel_size_migrated_at: None,
};
let part = IndexPart::from_json_bytes(example.as_bytes()).unwrap();
@@ -1133,6 +1152,7 @@ mod tests {
l2_lsn: None,
gc_compaction: None,
marked_invisible_at: None,
rel_size_migrated_at: None,
};
let part = IndexPart::from_json_bytes(example.as_bytes()).unwrap();
@@ -1220,6 +1240,7 @@ mod tests {
last_completed_lsn: "0/16960E8".parse::<Lsn>().unwrap(),
}),
marked_invisible_at: None,
rel_size_migrated_at: None,
};
let part = IndexPart::from_json_bytes(example.as_bytes()).unwrap();
@@ -1308,6 +1329,97 @@ mod tests {
last_completed_lsn: "0/16960E8".parse::<Lsn>().unwrap(),
}),
marked_invisible_at: Some(parse_naive_datetime("2023-07-31T09:00:00.123000000")),
rel_size_migrated_at: None,
};
let part = IndexPart::from_json_bytes(example.as_bytes()).unwrap();
assert_eq!(part, expected);
}
#[test]
fn v15_rel_size_migrated_at_is_parsed() {
let example = r#"{
"version": 15,
"layer_metadata":{
"000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9": { "file_size": 25600000 },
"000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51": { "file_size": 9007199254741001 }
},
"disk_consistent_lsn":"0/16960E8",
"metadata": {
"disk_consistent_lsn": "0/16960E8",
"prev_record_lsn": "0/1696070",
"ancestor_timeline": "e45a7f37d3ee2ff17dc14bf4f4e3f52e",
"ancestor_lsn": "0/0",
"latest_gc_cutoff_lsn": "0/1696070",
"initdb_lsn": "0/1696070",
"pg_version": 14
},
"gc_blocking": {
"started_at": "2024-07-19T09:00:00.123",
"reasons": ["DetachAncestor"]
},
"import_pgdata": {
"V1": {
"Done": {
"idempotency_key": "specified-by-client-218a5213-5044-4562-a28d-d024c5f057f5",
"started_at": "2024-11-13T09:23:42.123",
"finished_at": "2024-11-13T09:42:23.123"
}
}
},
"rel_size_migration": "legacy",
"l2_lsn": "0/16960E8",
"gc_compaction": {
"last_completed_lsn": "0/16960E8"
},
"marked_invisible_at": "2023-07-31T09:00:00.123",
"rel_size_migrated_at": "0/16960E8"
}"#;
let expected = IndexPart {
version: 15,
layer_metadata: HashMap::from([
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9".parse().unwrap(), LayerFileMetadata {
file_size: 25600000,
generation: Generation::none(),
shard: ShardIndex::unsharded()
}),
("000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016B59D8-00000000016B5A51".parse().unwrap(), LayerFileMetadata {
file_size: 9007199254741001,
generation: Generation::none(),
shard: ShardIndex::unsharded()
})
]),
disk_consistent_lsn: "0/16960E8".parse::<Lsn>().unwrap(),
metadata: TimelineMetadata::new(
Lsn::from_str("0/16960E8").unwrap(),
Some(Lsn::from_str("0/1696070").unwrap()),
Some(TimelineId::from_str("e45a7f37d3ee2ff17dc14bf4f4e3f52e").unwrap()),
Lsn::INVALID,
Lsn::from_str("0/1696070").unwrap(),
Lsn::from_str("0/1696070").unwrap(),
PgMajorVersion::PG14,
).with_recalculated_checksum().unwrap(),
deleted_at: None,
lineage: Default::default(),
gc_blocking: Some(GcBlocking {
started_at: parse_naive_datetime("2024-07-19T09:00:00.123000000"),
reasons: enumset::EnumSet::from_iter([GcBlockingReason::DetachAncestor]),
}),
last_aux_file_policy: Default::default(),
archived_at: None,
import_pgdata: Some(import_pgdata::index_part_format::Root::V1(import_pgdata::index_part_format::V1::Done(import_pgdata::index_part_format::Done{
started_at: parse_naive_datetime("2024-11-13T09:23:42.123000000"),
finished_at: parse_naive_datetime("2024-11-13T09:42:23.123000000"),
idempotency_key: import_pgdata::index_part_format::IdempotencyKey::new("specified-by-client-218a5213-5044-4562-a28d-d024c5f057f5".to_string()),
}))),
rel_size_migration: Some(RelSizeMigration::Legacy),
l2_lsn: Some("0/16960E8".parse::<Lsn>().unwrap()),
gc_compaction: Some(GcCompactionState {
last_completed_lsn: "0/16960E8".parse::<Lsn>().unwrap(),
}),
marked_invisible_at: Some(parse_naive_datetime("2023-07-31T09:00:00.123000000")),
rel_size_migrated_at: Some("0/16960E8".parse::<Lsn>().unwrap()),
};
let part = IndexPart::from_json_bytes(example.as_bytes()).unwrap();

View File

@@ -70,7 +70,7 @@ use tracing::*;
use utils::generation::Generation;
use utils::guard_arc_swap::GuardArcSwap;
use utils::id::TimelineId;
use utils::logging::{MonitorSlowFutureCallback, monitor_slow_future};
use utils::logging::{MonitorSlowFutureCallback, log_slow, monitor_slow_future};
use utils::lsn::{AtomicLsn, Lsn, RecordLsn};
use utils::postgres_client::PostgresClientProtocol;
use utils::rate_limit::RateLimit;
@@ -287,7 +287,7 @@ pub struct Timeline {
ancestor_lsn: Lsn,
// The LSN of gc-compaction that was last applied to this timeline.
gc_compaction_state: ArcSwap<Option<GcCompactionState>>,
gc_compaction_state: ArcSwapOption<GcCompactionState>,
pub(crate) metrics: Arc<TimelineMetrics>,
@@ -441,7 +441,7 @@ pub struct Timeline {
/// heatmap on demand.
heatmap_layers_downloader: Mutex<Option<heatmap_layers_downloader::HeatmapLayersDownloader>>,
pub(crate) rel_size_v2_status: ArcSwapOption<RelSizeMigration>,
pub(crate) rel_size_v2_status: ArcSwap<(Option<RelSizeMigration>, Option<Lsn>)>,
wait_lsn_log_slow: tokio::sync::Semaphore,
@@ -450,6 +450,9 @@ pub struct Timeline {
#[expect(dead_code)]
feature_resolver: Arc<TenantFeatureResolver>,
/// Basebackup will collect the count and store it here. Used for reldirv2 rollout.
pub(crate) db_rel_count: ArcSwapOption<(usize, usize)>,
}
pub(crate) enum PreviousHeatmap {
@@ -2891,12 +2894,9 @@ impl Timeline {
.unwrap_or(self.conf.default_tenant_conf.rel_size_v2_enabled)
}
pub(crate) fn get_rel_size_v2_status(&self) -> RelSizeMigration {
self.rel_size_v2_status
.load()
.as_ref()
.map(|s| s.as_ref().clone())
.unwrap_or(RelSizeMigration::Legacy)
pub(crate) fn get_rel_size_v2_status(&self) -> (RelSizeMigration, Option<Lsn>) {
let (status, migrated_at) = self.rel_size_v2_status.load().as_ref().clone();
(status.unwrap_or(RelSizeMigration::Legacy), migrated_at)
}
fn get_compaction_upper_limit(&self) -> usize {
@@ -3171,6 +3171,7 @@ impl Timeline {
create_idempotency: crate::tenant::CreateTimelineIdempotency,
gc_compaction_state: Option<GcCompactionState>,
rel_size_v2_status: Option<RelSizeMigration>,
rel_size_migrated_at: Option<Lsn>,
cancel: CancellationToken,
) -> Arc<Self> {
let disk_consistent_lsn = metadata.disk_consistent_lsn();
@@ -3237,7 +3238,7 @@ impl Timeline {
}),
disk_consistent_lsn: AtomicLsn::new(disk_consistent_lsn.0),
gc_compaction_state: ArcSwap::new(Arc::new(gc_compaction_state)),
gc_compaction_state: ArcSwapOption::from_pointee(gc_compaction_state),
last_freeze_at: AtomicLsn::new(disk_consistent_lsn.0),
last_freeze_ts: RwLock::new(Instant::now()),
@@ -3335,13 +3336,18 @@ impl Timeline {
heatmap_layers_downloader: Mutex::new(None),
rel_size_v2_status: ArcSwapOption::from_pointee(rel_size_v2_status),
rel_size_v2_status: ArcSwap::from_pointee((
rel_size_v2_status,
rel_size_migrated_at,
)),
wait_lsn_log_slow: tokio::sync::Semaphore::new(1),
basebackup_cache: resources.basebackup_cache,
feature_resolver: resources.feature_resolver.clone(),
db_rel_count: ArcSwapOption::from_pointee(None),
};
result.repartition_threshold =
@@ -3413,7 +3419,7 @@ impl Timeline {
gc_compaction_state: GcCompactionState,
) -> anyhow::Result<()> {
self.gc_compaction_state
.store(Arc::new(Some(gc_compaction_state.clone())));
.store(Some(Arc::new(gc_compaction_state.clone())));
self.remote_client
.schedule_index_upload_for_gc_compaction_state_update(gc_compaction_state)
}
@@ -3421,15 +3427,24 @@ impl Timeline {
pub(crate) fn update_rel_size_v2_status(
&self,
rel_size_v2_status: RelSizeMigration,
rel_size_migrated_at: Option<Lsn>,
) -> anyhow::Result<()> {
self.rel_size_v2_status
.store(Some(Arc::new(rel_size_v2_status.clone())));
self.rel_size_v2_status.store(Arc::new((
Some(rel_size_v2_status.clone()),
rel_size_migrated_at,
)));
self.remote_client
.schedule_index_upload_for_rel_size_v2_status_update(rel_size_v2_status)
.schedule_index_upload_for_rel_size_v2_status_update(
rel_size_v2_status,
rel_size_migrated_at,
)
}
pub(crate) fn get_gc_compaction_state(&self) -> Option<GcCompactionState> {
self.gc_compaction_state.load_full().as_ref().clone()
self.gc_compaction_state
.load()
.as_ref()
.map(|x| x.as_ref().clone())
}
/// Creates and starts the wal receiver.
@@ -6883,7 +6898,13 @@ impl Timeline {
write_guard.store_and_unlock(new_gc_cutoff)
};
waitlist.wait().await;
let waitlist_wait_fut = std::pin::pin!(waitlist.wait());
log_slow(
"applied_gc_cutoff waitlist wait",
Duration::from_secs(30),
waitlist_wait_fut,
)
.await;
info!("GC starting");

View File

@@ -332,6 +332,7 @@ impl DeleteTimelineFlow {
crate::tenant::CreateTimelineIdempotency::FailWithConflict, // doesn't matter what we put here
None, // doesn't matter what we put here
None, // doesn't matter what we put here
None, // doesn't matter what we put here
ctx,
)
.context("create_timeline_struct")?;

View File

@@ -362,7 +362,7 @@ impl<T: Types> Cache<T> {
tokio::time::sleep(RETRY_BACKOFF).await;
continue;
} else {
tracing::warn!(
tracing::info!(
"Failed to resolve tenant shard after {} attempts: {:?}",
GET_MAX_RETRIES,
e

View File

@@ -52,7 +52,7 @@ pub(crate) fn regenerate(
};
// Express a static value for how many shards we may schedule on one node
const MAX_SHARDS: u32 = 5000;
const MAX_SHARDS: u32 = 2500;
let mut doc = PageserverUtilization {
disk_usage_bytes: used,

View File

@@ -5,12 +5,10 @@ MODULE_big = neon
OBJS = \
$(WIN32RES) \
communicator.o \
communicator_new.o \
communicator_process.o \
extension_server.o \
file_cache.o \
hll.o \
lfc_prewarm.o \
libpagestore.o \
logical_replication_monitor.o \
neon.o \
@@ -35,6 +33,10 @@ SHLIB_LINK = -lcurl
UNAME_S := $(shell uname -s)
ifeq ($(UNAME_S), Darwin)
SHLIB_LINK += -framework Security -framework CoreFoundation -framework SystemConfiguration
# Link against object files for the current macOS version, to avoid spurious linker warnings.
MACOSX_DEPLOYMENT_TARGET := $(shell xcrun --sdk macosx --show-sdk-version)
export MACOSX_DEPLOYMENT_TARGET
endif
EXTENSION = neon
@@ -65,7 +67,6 @@ WALPROP_OBJS = \
# libcommunicator.a is built by cargo from the Rust sources under communicator/
# subdirectory. `cargo build` also generates communicator_bindings.h.
communicator_new.o: communicator/communicator_bindings.h
communicator_process.o: communicator/communicator_bindings.h
file_cache.o: communicator/communicator_bindings.h

View File

@@ -79,10 +79,6 @@
#include "access/xlogrecovery.h"
#endif
#if PG_VERSION_NUM < 160000
typedef PGAlignedBlock PGIOAlignedBlock;
#endif
#define NEON_PANIC_CONNECTION_STATE(shard_no, elvl, message, ...) \
neon_shard_log(shard_no, elvl, "Broken connection state: " message, \
##__VA_ARGS__)

View File

@@ -1,372 +0,0 @@
# This file is automatically @generated by Cargo.
# It is not intended for manual editing.
version = 4
[[package]]
name = "addr2line"
version = "0.24.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "dfbe277e56a376000877090da837660b4427aad530e3028d44e0bffe4f89a1c1"
dependencies = [
"gimli",
]
[[package]]
name = "adler2"
version = "2.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "512761e0bb2578dd7380c6baaa0f4ce03e84f95e960231d1dec8bf4d7d6e2627"
[[package]]
name = "backtrace"
version = "0.3.74"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8d82cb332cdfaed17ae235a638438ac4d4839913cc2af585c3c6746e8f8bee1a"
dependencies = [
"addr2line",
"cfg-if",
"libc",
"miniz_oxide",
"object",
"rustc-demangle",
"windows-targets",
]
[[package]]
name = "base64"
version = "0.22.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "72b3254f16251a8381aa12e40e3c4d2f0199f8c6508fbecb9d91f575e0fbb8c6"
[[package]]
name = "bytes"
version = "1.10.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d71b6127be86fdcfddb610f7182ac57211d4b18a3e9c82eb2d17662f2227ad6a"
[[package]]
name = "cfg-if"
version = "1.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "baf1de4339761588bc0619e3cbc0120ee582ebb74b53b4efbf79117bd2da40fd"
[[package]]
name = "communicator"
version = "0.1.0"
dependencies = [
"tonic",
]
[[package]]
name = "fnv"
version = "1.0.7"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "3f9eec918d3f24069decb9af1554cad7c880e2da24a9afd88aca000531ab82c1"
[[package]]
name = "futures-core"
version = "0.3.31"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "05f29059c0c2090612e8d742178b0580d2dc940c837851ad723096f87af6663e"
[[package]]
name = "gimli"
version = "0.31.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "07e28edb80900c19c28f1072f2e8aeca7fa06b23cd4169cefe1af5aa3260783f"
[[package]]
name = "http"
version = "1.3.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f4a85d31aea989eead29a3aaf9e1115a180df8282431156e533de47660892565"
dependencies = [
"bytes",
"fnv",
"itoa",
]
[[package]]
name = "http-body"
version = "1.0.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1efedce1fb8e6913f23e0c92de8e62cd5b772a67e7b3946df930a62566c93184"
dependencies = [
"bytes",
"http",
]
[[package]]
name = "http-body-util"
version = "0.1.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b021d93e26becf5dc7e1b75b1bed1fd93124b374ceb73f43d4d4eafec896a64a"
dependencies = [
"bytes",
"futures-core",
"http",
"http-body",
"pin-project-lite",
]
[[package]]
name = "itoa"
version = "1.0.15"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "4a5f13b858c8d314ee3e8f639011f7ccefe71f97f96e50151fb991f267928e2c"
[[package]]
name = "libc"
version = "0.2.171"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c19937216e9d3aa9956d9bb8dfc0b0c8beb6058fc4f7a4dc4d850edf86a237d6"
[[package]]
name = "memchr"
version = "2.7.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "78ca9ab1a0babb1e7d5695e3530886289c18cf2f87ec19a575a0abdce112e3a3"
[[package]]
name = "miniz_oxide"
version = "0.8.7"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ff70ce3e48ae43fa075863cef62e8b43b71a4f2382229920e0df362592919430"
dependencies = [
"adler2",
]
[[package]]
name = "object"
version = "0.36.7"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "62948e14d923ea95ea2c7c86c71013138b66525b86bdc08d2dcc262bdb497b87"
dependencies = [
"memchr",
]
[[package]]
name = "once_cell"
version = "1.21.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "42f5e15c9953c5e4ccceeb2e7382a716482c34515315f7b03532b8b4e8393d2d"
[[package]]
name = "percent-encoding"
version = "2.3.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e3148f5046208a5d56bcfc03053e3ca6334e51da8dfb19b6cdc8b306fae3283e"
[[package]]
name = "pin-project"
version = "1.1.10"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "677f1add503faace112b9f1373e43e9e054bfdd22ff1a63c1bc485eaec6a6a8a"
dependencies = [
"pin-project-internal",
]
[[package]]
name = "pin-project-internal"
version = "1.1.10"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6e918e4ff8c4549eb882f14b3a4bc8c8bc93de829416eacf579f1207a8fbf861"
dependencies = [
"proc-macro2",
"quote",
"syn",
]
[[package]]
name = "pin-project-lite"
version = "0.2.16"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "3b3cff922bd51709b605d9ead9aa71031d81447142d828eb4a6eba76fe619f9b"
[[package]]
name = "proc-macro2"
version = "1.0.94"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a31971752e70b8b2686d7e46ec17fb38dad4051d94024c88df49b667caea9c84"
dependencies = [
"unicode-ident",
]
[[package]]
name = "quote"
version = "1.0.40"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1885c039570dc00dcb4ff087a89e185fd56bae234ddc7f056a945bf36467248d"
dependencies = [
"proc-macro2",
]
[[package]]
name = "rustc-demangle"
version = "0.1.24"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "719b953e2095829ee67db738b3bfa9fa368c94900df327b3f07fe6e794d2fe1f"
[[package]]
name = "syn"
version = "2.0.100"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b09a44accad81e1ba1cd74a32461ba89dee89095ba17b32f5d03683b1b1fc2a0"
dependencies = [
"proc-macro2",
"quote",
"unicode-ident",
]
[[package]]
name = "tokio"
version = "1.44.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e6b88822cbe49de4185e3a4cbf8321dd487cf5fe0c5c65695fef6346371e9c48"
dependencies = [
"backtrace",
"pin-project-lite",
]
[[package]]
name = "tokio-stream"
version = "0.1.17"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "eca58d7bba4a75707817a2c44174253f9236b2d5fbd055602e9d5c07c139a047"
dependencies = [
"futures-core",
"pin-project-lite",
"tokio",
]
[[package]]
name = "tonic"
version = "0.13.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "85839f0b32fd242bb3209262371d07feda6d780d16ee9d2bc88581b89da1549b"
dependencies = [
"base64",
"bytes",
"http",
"http-body",
"http-body-util",
"percent-encoding",
"pin-project",
"tokio-stream",
"tower-layer",
"tower-service",
"tracing",
]
[[package]]
name = "tower-layer"
version = "0.3.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "121c2a6cda46980bb0fcd1647ffaf6cd3fc79a013de288782836f6df9c48780e"
[[package]]
name = "tower-service"
version = "0.3.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8df9b6e13f2d32c91b9bd719c00d1958837bc7dec474d94952798cc8e69eeec3"
[[package]]
name = "tracing"
version = "0.1.41"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "784e0ac535deb450455cbfa28a6f0df145ea1bb7ae51b821cf5e7927fdcfbdd0"
dependencies = [
"pin-project-lite",
"tracing-attributes",
"tracing-core",
]
[[package]]
name = "tracing-attributes"
version = "0.1.28"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "395ae124c09f9e6918a2310af6038fba074bcf474ac352496d5910dd59a2226d"
dependencies = [
"proc-macro2",
"quote",
"syn",
]
[[package]]
name = "tracing-core"
version = "0.1.33"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e672c95779cf947c5311f83787af4fa8fffd12fb27e4993211a84bdfd9610f9c"
dependencies = [
"once_cell",
]
[[package]]
name = "unicode-ident"
version = "1.0.18"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5a5f39404a5da50712a4c1eecf25e90dd62b613502b7e925fd4e4d19b5c96512"
[[package]]
name = "windows-targets"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9b724f72796e036ab90c1021d4780d4d3d648aca59e491e6b98e725b84e99973"
dependencies = [
"windows_aarch64_gnullvm",
"windows_aarch64_msvc",
"windows_i686_gnu",
"windows_i686_gnullvm",
"windows_i686_msvc",
"windows_x86_64_gnu",
"windows_x86_64_gnullvm",
"windows_x86_64_msvc",
]
[[package]]
name = "windows_aarch64_gnullvm"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "32a4622180e7a0ec044bb555404c800bc9fd9ec262ec147edd5989ccd0c02cd3"
[[package]]
name = "windows_aarch64_msvc"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "09ec2a7bb152e2252b53fa7803150007879548bc709c039df7627cabbd05d469"
[[package]]
name = "windows_i686_gnu"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8e9b5ad5ab802e97eb8e295ac6720e509ee4c243f69d781394014ebfe8bbfa0b"
[[package]]
name = "windows_i686_gnullvm"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0eee52d38c090b3caa76c563b86c3a4bd71ef1a819287c19d586d7334ae8ed66"
[[package]]
name = "windows_i686_msvc"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "240948bc05c5e7c6dabba28bf89d89ffce3e303022809e73deaefe4f6ec56c66"
[[package]]
name = "windows_x86_64_gnu"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "147a5c80aabfbf0c7d901cb5895d1de30ef2907eb21fbbab29ca94c5b08b1a78"
[[package]]
name = "windows_x86_64_gnullvm"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "24d5b23dc417412679681396f2b49f3de8c1473deb516bd34410872eff51ed0d"
[[package]]
name = "windows_x86_64_msvc"
version = "0.52.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "589f6da84c646204747d1270a2a5661ea66ed1cced2631d546fdfb155959f9ec"

View File

@@ -1,8 +1,12 @@
[package]
name = "communicator"
version = "0.1.0"
license.workspace = true
edition.workspace = true
[lib]
crate-type = ["staticlib"]
[features]
# 'testing' feature is currently unused in the communicator, but we accept it for convenience of
# calling build scripts, so that you can pass the same feature to all packages.
@@ -11,36 +15,14 @@ testing = []
# calling build scripts, so that you can pass the same feature to all packages.
rest_broker = []
[lib]
crate-type = ["staticlib"]
[dependencies]
axum.workspace = true
bytes.workspace = true
clashmap.workspace = true
http.workspace = true
libc.workspace = true
nix.workspace = true
atomic_enum = "0.3.0"
measured.workspace = true
prometheus.workspace = true
prost.workspace = true
strum_macros.workspace = true
thiserror.workspace = true
tonic = { workspace = true, default-features = false, features=["codegen", "prost", "transport"] }
tokio = { workspace = true, features = ["macros", "net", "io-util", "rt", "rt-multi-thread"] }
tokio-pipe = { version = "0.2.12" }
tracing.workspace = true
tracing-subscriber.workspace = true
metrics.workspace = true
uring-common = { workspace = true, features = ["bytes"] }
pageserver_client_grpc.workspace = true
pageserver_api.workspace = true
pageserver_page_api.workspace = true
neon-shmem.workspace = true
measured.workspace = true
utils.workspace = true
workspace_hack = { version = "0.1", path = "../../../workspace_hack" }

View File

@@ -3,18 +3,9 @@
This package provides the so-called "compute-pageserver communicator",
or just "communicator" in short. The communicator is a separate
background worker process that runs in the PostgreSQL server. It's
part of the neon extension.
The commuicator handles the communication with the pageservers, and
also provides an HTTP endpoint for metrics over a local Unix Domain
socket (aka. the "communicator control socket"). On the PostgreSQL
side, the glue code in pgxn/neon/ uses the communicator to implement
the PostgreSQL Storage Manager (SMGR) interface.
## Design criteria
- Low latency
- Saturate a 10 Gbit / s network interface without becoming a bottleneck
part of the neon extension. Currently, it only provides an HTTP
endpoint for metrics, but in the future it will evolve to handle all
communications with the pageservers.
## Source code view
@@ -23,122 +14,10 @@ pgxn/neon/communicator_process.c
the glue that interacts with PostgreSQL code and the Rust
code in the communicator process.
pgxn/neon/communicator_new.c
Contains the backend code that interacts with the communicator
process.
pgxn/neon/communicator/src/backend_interface.rs
The entry point for calls from each backend.
pgxn/neon/communicator/src/init.rs
Initialization at server startup
pgxn/neon/communicator/src/worker_process/
Worker process main loop and glue code
At compilation time, pgxn/neon/communicator/ produces a static
library, libcommunicator.a. It is linked to the neon.so extension
library.
The real networking code, which is independent of PostgreSQL, is in
the pageserver/client_grpc crate.
## Process view
The communicator runs in a dedicated background worker process, the
"communicator process". The communicator uses a multi-threaded Tokio
runtime to execute the IO requests. So the communicator process has
multiple threads running. That's unusual for Postgres processes and
care must be taken to make that work.
### Backend <-> worker communication
Each backend has a number of I/O request slots in shared memory. The
slots are statically allocated for each backend, and must not be
accessed by other backends. The worker process reads requests from the
shared memory slots, and writes responses back to the slots.
Here's an example snapshot of the system, when two requests from two
different backends are in progress:
```
Backends Request slots Communicator process
--------- ------------- --------------------
Backend 1 1: Idle
2: Idle
3: Processing tokio task handling request 3
Backend 2 4: Completed
5: Processing tokio task handling request 5
6: Idle
... ...
```
To submit an IO request, the backend first picks one of its Idle
slots, writes the IO request in the slot, and updates it to
'Submitted' state. That transfers the ownership of the slot to the
worker process, until the worker process marks the request as
Completed. The worker process spawns a separate Tokio task for each
request.
To inform the worker process that a request slot has a pending IO
request, there's a pipe shared by the worker process and all backend
processes. The backend writes the index of the request slot to the
pipe after changing the slot's state to Submitted. This wakes up the
worker process.
(Note that the pipe is just used for wakeups, but the worker process
is free to pick up Submitted IO requests even without receiving the
wakeup. As of this writing, it doesn't do that, but it might be useful
in the future to reduce latency even further, for example.)
When the worker process has completed processing the request, it
writes the result back in the request slot. A GetPage request can also
contain a pointer to buffer in the shared buffer cache. In that case,
the worker process writes the resulting page contents directly to the
buffer, and just a result code in the request slot. It then updates
the 'state' field to Completed, which passes the owner ship back to
the originating backend. Finally, it signals the process Latch of the
originating backend, waking it up.
### Differences between PostgreSQL v16, v17 and v18
PostgreSQL v18 introduced the new AIO mechanism. The PostgreSQL AIO
mechanism uses a very similar mechanism as described in the previous
section, for the communication between AIO worker processes and
backends. With our communicator, the AIO worker processes are not
used, but we use the same PgAioHandle request slots as in upstream.
For Neon-specific IO requests like GetDbSize, a neon request slot is
used. But for the actual IO requests, the request slot merely contains
a pointer to the PgAioHandle slot. The worker process updates the
status of that, calls the IO callbacks upon completionetc, just like
the upstream AIO worker processes do.
## Sequence diagram
neon
PostgreSQL extension backend_interface.rs worker_process.rs processor tonic
| . . . .
| smgr_read() . . . .
+-------------> + . . .
. | . . .
. | rcommunicator_ . . .
. | get_page_at_lsn . . .
. +------------------> + . .
| . .
| write request to . . .
| slot . .
| . .
| . .
| submit_request() . .
+-----------------> + .
| | .
| | db_size_request . .
+---------------->.
. TODO
### Compute <-> pageserver protocol
The protocol between Compute and the pageserver is based on gRPC. See `protos/`.

View File

@@ -1,224 +0,0 @@
//! This module implements a request/response "slot" for submitting
//! requests from backends to the communicator process.
//!
//! NB: The "backend" side of this code runs in Postgres backend processes,
//! which means that it is not safe to use the 'tracing' crate for logging, nor
//! to launch threads or use tokio tasks!
use std::cell::UnsafeCell;
use std::sync::atomic::{AtomicI32, Ordering};
use crate::neon_request::{NeonIORequest, NeonIOResult};
use atomic_enum::atomic_enum;
/// One request/response slot. Each backend has its own set of slots that it
/// uses.
///
/// This is the moral equivalent of PgAioHandle for Postgres AIO requests
/// Like PgAioHandle, try to keep this small.
///
/// There is an array of these in shared memory. Therefore, this must be Sized.
///
/// ## Lifecycle of a request
///
/// A slot is always owned by either the backend process or the communicator
/// process, depending on the 'state'. Only the owning process is allowed to
/// read or modify the slot, except for reading the 'state' itself to check who
/// owns it.
///
/// A slot begins in the Idle state, where it is owned by the backend process.
/// To submit a request, the backend process fills the slot with the request
/// data, and changes it to the Submitted state. After changing the state, the
/// slot is owned by the communicator process, and the backend is not allowed
/// to access it until the communicator process marks it as Completed.
///
/// When the communicator process sees that the slot is in Submitted state, it
/// starts to process the request. After processing the request, it stores the
/// result in the slot, and changes the state to Completed. It is now owned by
/// the backend process again, which may now read the result, and reuse the
/// slot for a new request.
///
/// For correctness of the above protocol, we really only need two states:
/// "owned by backend" and "owned by communicator process". But to help with
/// debugging and better assertions, there are a few more states. When the
/// backend starts to fill in the request details in the slot, it first sets the
/// state from Idle to Filling, and when it's done with that, from Filling to
/// Submitted. In the Filling state, the slot is still owned by the
/// backend. Similarly, when the communicator process starts to process a
/// request, it sets it to Processing state first, but the slot is still owned
/// by the communicator process.
///
/// This struct doesn't handle waking up the communicator process when a request
/// has been submitted or when a response is ready. The 'owner_procno' is used
/// for waking up the backend on completion, but that happens elsewhere.
pub struct NeonIORequestSlot {
/// similar to PgAioHandleState
state: AtomicNeonIORequestSlotState,
/// The owning process's ProcNumber. The worker process uses this to set the
/// process's latch on completion.
///
/// (This could be calculated from num_neon_request_slots_per_backend and
/// the index of this slot in the overall 'neon_requst_slots array'. But we
/// prefer the communicator process to not know how the request slots are
/// divided between the backends.)
owner_procno: AtomicI32,
/// SAFETY: This is modified by submit_request(), after it has established
/// ownership of the slot by setting state from Idle to Filling
request: UnsafeCell<NeonIORequest>,
/// Valid when state is Completed
///
/// SAFETY: This is modified by RequestProcessingGuard::complete(). There
/// can be only one RequestProcessingGuard outstanding for a slot at a time,
/// because it is returned by start_processing_request() which checks the
/// state, so RequestProcessingGuard has exclusive access to the slot.
result: UnsafeCell<NeonIOResult>,
}
// The protocol described in the "Lifecycle of a request" section above ensures
// the safe access to the fields
unsafe impl Send for NeonIORequestSlot {}
unsafe impl Sync for NeonIORequestSlot {}
impl Default for NeonIORequestSlot {
fn default() -> NeonIORequestSlot {
NeonIORequestSlot {
owner_procno: AtomicI32::new(-1),
request: UnsafeCell::new(NeonIORequest::Empty),
result: UnsafeCell::new(NeonIOResult::Empty),
state: AtomicNeonIORequestSlotState::new(NeonIORequestSlotState::Idle),
}
}
}
#[atomic_enum]
#[derive(Eq, PartialEq)]
pub enum NeonIORequestSlotState {
Idle,
/// Backend is filling in the request
Filling,
/// Backend has submitted the request to the communicator, but the
/// communicator process has not yet started processing it.
Submitted,
/// Communicator is processing the request
Processing,
/// Communicator has completed the request, and the 'result' field is now
/// valid, but the backend has not read the result yet.
Completed,
}
impl NeonIORequestSlot {
/// Write a request to the slot, and mark it as Submitted.
///
/// Note: This does not wake up the worker process to actually process
/// the request. It's the caller's responsibility to do that.
pub fn submit_request(&self, request: &NeonIORequest, proc_number: i32) {
// Verify that the slot is in Idle state previously, and put it in
// Filling state.
//
// XXX: This step isn't strictly necessary. Assuming the caller didn't
// screw up and try to use a slot that's already in use, we could fill
// the slot and switch it directly from Idle to Submitted state.
if let Err(s) = self.state.compare_exchange(
NeonIORequestSlotState::Idle,
NeonIORequestSlotState::Filling,
Ordering::Relaxed,
Ordering::Relaxed,
) {
panic!("unexpected state in request slot: {s:?}");
}
// Fill in the request details
self.owner_procno.store(proc_number, Ordering::Relaxed);
unsafe { *self.request.get() = *request }
// This synchronizes-with store/swap in [`start_processing_request`].
// Note that this ensures that the previous non-atomic writes visible
// to other threads too.
self.state
.store(NeonIORequestSlotState::Submitted, Ordering::Release);
}
pub fn get_state(&self) -> NeonIORequestSlotState {
self.state.load(Ordering::Relaxed)
}
pub fn try_get_result(&self) -> Option<NeonIOResult> {
// This synchronizes-with the store/swap in [`RequestProcessingGuard::completed`]
let state = self.state.load(Ordering::Acquire);
if state == NeonIORequestSlotState::Completed {
let result = unsafe { *self.result.get() };
self.state
.store(NeonIORequestSlotState::Idle, Ordering::Relaxed);
Some(result)
} else {
None
}
}
/// Read the IO request from the slot indicated in the wakeup
pub fn start_processing_request<'a>(&'a self) -> Option<RequestProcessingGuard<'a>> {
// XXX: using atomic load rather than compare_exchange would be
// sufficient here, as long as the communicator process has _some_ means
// of tracking which requests it's already processing. That could be a
// flag somewhere in communicator's private memory, for example.
//
// This synchronizes-with the store in [`submit_request`].
if let Err(s) = self.state.compare_exchange(
NeonIORequestSlotState::Submitted,
NeonIORequestSlotState::Processing,
Ordering::Acquire,
Ordering::Relaxed,
) {
// FIXME surprising state. This is unexpected at the moment, but if we
// started to process requests more aggressively, without waiting for the
// read from the pipe, then this could happen
panic!("unexpected state in request slot: {s:?}");
}
Some(RequestProcessingGuard(self))
}
}
/// [`NeonIORequestSlot::start_processing_request`] returns this guard object to
/// indicate that the the caller now "owns" the slot, until it calls
/// [`RequestProcessingGuard::completed`].
///
/// TODO: implement Drop on this, to mark the request as Aborted or Errored
/// if [`RequestProcessingGuard::completed`] is not called.
pub struct RequestProcessingGuard<'a>(&'a NeonIORequestSlot);
unsafe impl<'a> Send for RequestProcessingGuard<'a> {}
unsafe impl<'a> Sync for RequestProcessingGuard<'a> {}
impl<'a> RequestProcessingGuard<'a> {
pub fn get_request(&self) -> &NeonIORequest {
unsafe { &*self.0.request.get() }
}
pub fn get_owner_procno(&self) -> i32 {
self.0.owner_procno.load(Ordering::Relaxed)
}
pub fn completed(self, result: NeonIOResult) {
// Store the result to the slot.
unsafe {
*self.0.result.get() = result;
};
// Mark the request as completed. After that, we no longer have
// ownership of the slot, and must not modify it.
let old_state = self
.0
.state
.swap(NeonIORequestSlotState::Completed, Ordering::Release);
assert!(old_state == NeonIORequestSlotState::Processing);
}
}

View File

@@ -1,283 +0,0 @@
//! This code runs in each backend process. That means that launching Rust threads, panicking
//! etc. is forbidden!
use std::os::fd::OwnedFd;
use crate::backend_comms::NeonIORequestSlot;
use crate::init::CommunicatorInitStruct;
use crate::integrated_cache::{BackendCacheReadOp, IntegratedCacheReadAccess};
use crate::neon_request::{CCachedGetPageVResult, COid};
use crate::neon_request::{NeonIORequest, NeonIOResult};
pub struct CommunicatorBackendStruct<'t> {
my_proc_number: i32,
neon_request_slots: &'t [NeonIORequestSlot],
submission_pipe_write_fd: OwnedFd,
pending_cache_read_op: Option<BackendCacheReadOp<'t>>,
integrated_cache: &'t IntegratedCacheReadAccess<'t>,
}
#[unsafe(no_mangle)]
pub extern "C" fn rcommunicator_backend_init(
cis: Box<CommunicatorInitStruct>,
my_proc_number: i32,
) -> &'static mut CommunicatorBackendStruct<'static> {
if my_proc_number < 0 {
panic!("cannot attach to communicator shared memory with procnumber {my_proc_number}");
}
let integrated_cache = Box::leak(Box::new(cis.integrated_cache_init_struct.backend_init()));
let bs: &'static mut CommunicatorBackendStruct =
Box::leak(Box::new(CommunicatorBackendStruct {
my_proc_number,
neon_request_slots: cis.neon_request_slots,
submission_pipe_write_fd: cis.submission_pipe_write_fd,
pending_cache_read_op: None,
integrated_cache,
}));
bs
}
/// Start a request. You can poll for its completion and get the result by
/// calling bcomm_poll_dbsize_request_completion(). The communicator will wake
/// us up by setting our process latch, so to wait for the completion, wait on
/// the latch and call bcomm_poll_dbsize_request_completion() every time the
/// latch is set.
///
/// Safety: The C caller must ensure that the references are valid.
/// The requested slot must be free, or this panics.
#[unsafe(no_mangle)]
pub extern "C" fn bcomm_start_io_request(
bs: &'_ mut CommunicatorBackendStruct,
slot_idx: i32,
request: &NeonIORequest,
immediate_result_ptr: &mut NeonIOResult,
) -> i32 {
assert!(bs.pending_cache_read_op.is_none());
// Check if the request can be satisfied from the cache first
if let NeonIORequest::RelSize(req) = request {
if let Some(nblocks) = bs.integrated_cache.get_rel_size(&req.reltag()) {
*immediate_result_ptr = NeonIOResult::RelSize(nblocks);
return -1;
}
}
// Create neon request and submit it
bs.start_neon_io_request(slot_idx, request);
slot_idx
}
#[unsafe(no_mangle)]
pub extern "C" fn bcomm_start_get_page_v_request(
bs: &mut CommunicatorBackendStruct,
slot_idx: i32,
request: &NeonIORequest,
immediate_result_ptr: &mut CCachedGetPageVResult,
) -> i32 {
let NeonIORequest::GetPageV(get_pagev_request) = request else {
panic!("invalid request passed to bcomm_start_get_page_v_request()");
};
assert!(matches!(request, NeonIORequest::GetPageV(_)));
assert!(bs.pending_cache_read_op.is_none());
// Check if the request can be satisfied from the cache first
let mut all_cached = true;
let mut read_op = bs.integrated_cache.start_read_op();
for i in 0..get_pagev_request.nblocks {
if let Some(cache_block) = read_op.get_page(
&get_pagev_request.reltag(),
get_pagev_request.block_number + i as u32,
) {
immediate_result_ptr.cache_block_numbers[i as usize] = cache_block;
} else {
// not found in cache
all_cached = false;
break;
}
}
if all_cached {
bs.pending_cache_read_op = Some(read_op);
return -1;
}
// Create neon request and submit it
bs.start_neon_io_request(slot_idx, request);
slot_idx
}
/// Check if a request has completed. Returns:
///
/// -1 if the request is still being processed
/// 0 on success
#[unsafe(no_mangle)]
pub extern "C" fn bcomm_poll_request_completion(
bs: &mut CommunicatorBackendStruct,
request_slot_idx: u32,
result_p: &mut NeonIOResult,
) -> i32 {
match bs.neon_request_slots[request_slot_idx as usize].try_get_result() {
None => -1, // still processing
Some(result) => {
*result_p = result;
0
}
}
}
/// Check if a request has completed. Returns:
///
/// 'false' if the slot is Idle. The backend process has ownership.
/// 'true' if the slot is busy, and should be polled for result.
#[unsafe(no_mangle)]
pub extern "C" fn bcomm_get_request_slot_status(
bs: &mut CommunicatorBackendStruct,
request_slot_idx: u32,
) -> bool {
use crate::backend_comms::NeonIORequestSlotState;
match bs.neon_request_slots[request_slot_idx as usize].get_state() {
NeonIORequestSlotState::Idle => false,
NeonIORequestSlotState::Filling => {
// 'false' would be the right result here. However, this
// is a very transient state. The C code should never
// leave a slot in this state, so if it sees that,
// something's gone wrong and it's not clear what to do
// with it.
panic!("unexpected Filling state in request slot {request_slot_idx}");
}
NeonIORequestSlotState::Submitted => true,
NeonIORequestSlotState::Processing => true,
NeonIORequestSlotState::Completed => true,
}
}
// LFC functions
/// Finish a local file cache read
///
//
#[unsafe(no_mangle)]
pub extern "C" fn bcomm_finish_cache_read(bs: &mut CommunicatorBackendStruct) -> bool {
if let Some(op) = bs.pending_cache_read_op.take() {
op.finish()
} else {
panic!("bcomm_finish_cache_read() called with no cached read pending");
}
}
/// Check if the local file cache contians the given block
#[unsafe(no_mangle)]
pub extern "C" fn bcomm_cache_contains(
bs: &mut CommunicatorBackendStruct,
spc_oid: COid,
db_oid: COid,
rel_number: u32,
fork_number: u8,
block_number: u32,
) -> bool {
bs.integrated_cache.cache_contains_page(
&pageserver_page_api::RelTag {
spcnode: spc_oid,
dbnode: db_oid,
relnode: rel_number,
forknum: fork_number,
},
block_number,
)
}
#[repr(C)]
#[derive(Clone, Debug)]
pub struct FileCacheIterator {
next_bucket: u64,
pub spc_oid: COid,
pub db_oid: COid,
pub rel_number: u32,
pub fork_number: u8,
pub block_number: u32,
}
/// Iterate over LFC contents
#[allow(clippy::missing_safety_doc)]
#[unsafe(no_mangle)]
pub unsafe extern "C" fn bcomm_cache_iterate_begin(
_bs: &mut CommunicatorBackendStruct,
iter: *mut FileCacheIterator,
) {
unsafe { (*iter).next_bucket = 0 };
}
#[allow(clippy::missing_safety_doc)]
#[unsafe(no_mangle)]
pub unsafe extern "C" fn bcomm_cache_iterate_next(
bs: &mut CommunicatorBackendStruct,
iter: *mut FileCacheIterator,
) -> bool {
use crate::integrated_cache::GetBucketResult;
loop {
let next_bucket = unsafe { (*iter).next_bucket } as usize;
match bs.integrated_cache.get_bucket(next_bucket) {
GetBucketResult::Occupied(rel, blk) => {
unsafe {
(*iter).spc_oid = rel.spcnode;
(*iter).db_oid = rel.dbnode;
(*iter).rel_number = rel.relnode;
(*iter).fork_number = rel.forknum;
(*iter).block_number = blk;
(*iter).next_bucket += 1;
}
break true;
}
GetBucketResult::Vacant => {
unsafe {
(*iter).next_bucket += 1;
}
continue;
}
GetBucketResult::OutOfBounds => {
break false;
}
}
}
}
impl<'t> CommunicatorBackendStruct<'t> {
/// The slot must be free, or this panics.
pub(crate) fn start_neon_io_request(&mut self, request_slot_idx: i32, request: &NeonIORequest) {
let my_proc_number = self.my_proc_number;
self.neon_request_slots[request_slot_idx as usize].submit_request(request, my_proc_number);
// Tell the communicator about it
self.notify_about_request(request_slot_idx);
}
/// Send a wakeup to the communicator process
fn notify_about_request(self: &CommunicatorBackendStruct<'t>, request_slot_idx: i32) {
// wake up communicator by writing the idx to the submission pipe
//
// This can block, if the pipe is full. That should be very rare,
// because the communicator tries hard to drain the pipe to prevent
// that. Also, there's a natural upper bound on how many wakeups can be
// queued up: there is only a limited number of request slots for each
// backend.
//
// If it does block very briefly, that's not too serious.
let idxbuf = request_slot_idx.to_ne_bytes();
let _res = nix::unistd::write(&self.submission_pipe_write_fd, &idxbuf);
// FIXME: check result, return any errors
}
}

View File

@@ -1,162 +0,0 @@
//! Implement the "low-level" parts of the file cache.
//!
//! This module just deals with reading and writing the file, and keeping track
//! which blocks in the cache file are in use and which are free. The "high
//! level" parts of tracking which block in the cache file corresponds to which
//! relation block is handled in 'integrated_cache' instead.
//!
//! This module is only used to access the file from the communicator
//! process. The backend processes *also* read the file (and sometimes also
//! write it? ), but the backends use direct C library calls for that.
use std::fs::File;
use std::os::unix::fs::FileExt;
use std::path::Path;
use std::sync::Arc;
use std::sync::Mutex;
use crate::BLCKSZ;
use tokio::task::spawn_blocking;
pub type CacheBlock = u64;
pub const INVALID_CACHE_BLOCK: CacheBlock = u64::MAX;
#[derive(Debug)]
pub struct FileCache {
file: Arc<File>,
free_list: Mutex<FreeList>,
// metrics
max_blocks_gauge: metrics::IntGauge,
num_free_blocks_gauge: metrics::IntGauge,
}
// TODO: We keep track of all free blocks in this vec. That doesn't really scale.
// Idea: when free_blocks fills up with more than 1024 entries, write them all to
// one block on disk.
#[derive(Debug)]
struct FreeList {
next_free_block: CacheBlock,
max_blocks: u64,
free_blocks: Vec<CacheBlock>,
}
impl FileCache {
pub fn new(file_cache_path: &Path, mut initial_size: u64) -> Result<FileCache, std::io::Error> {
if initial_size < 100 {
tracing::warn!(
"min size for file cache is 100 blocks, {} requested",
initial_size
);
initial_size = 100;
}
let file = std::fs::OpenOptions::new()
.read(true)
.write(true)
.truncate(true)
.create(true)
.open(file_cache_path)?;
let max_blocks_gauge = metrics::IntGauge::new(
"file_cache_max_blocks",
"Local File Cache size in 8KiB blocks",
)
.unwrap();
let num_free_blocks_gauge = metrics::IntGauge::new(
"file_cache_num_free_blocks",
"Number of free 8KiB blocks in Local File Cache",
)
.unwrap();
tracing::info!("initialized file cache with {} blocks", initial_size);
Ok(FileCache {
file: Arc::new(file),
free_list: Mutex::new(FreeList {
next_free_block: 0,
max_blocks: initial_size,
free_blocks: Vec::new(),
}),
max_blocks_gauge,
num_free_blocks_gauge,
})
}
// File cache management
pub async fn read_block(
&self,
cache_block: CacheBlock,
mut dst: impl uring_common::buf::IoBufMut + Send + Sync,
) -> Result<(), std::io::Error> {
assert!(dst.bytes_total() == BLCKSZ);
let file = self.file.clone();
let dst_ref = unsafe { std::slice::from_raw_parts_mut(dst.stable_mut_ptr(), BLCKSZ) };
spawn_blocking(move || file.read_exact_at(dst_ref, cache_block * BLCKSZ as u64)).await??;
Ok(())
}
pub async fn write_block(
&self,
cache_block: CacheBlock,
src: impl uring_common::buf::IoBuf + Send + Sync,
) -> Result<(), std::io::Error> {
assert!(src.bytes_init() == BLCKSZ);
let file = self.file.clone();
let src_ref = unsafe { std::slice::from_raw_parts(src.stable_ptr(), BLCKSZ) };
spawn_blocking(move || file.write_all_at(src_ref, cache_block * BLCKSZ as u64)).await??;
Ok(())
}
pub fn alloc_block(&self) -> Option<CacheBlock> {
let mut free_list = self.free_list.lock().unwrap();
if let Some(x) = free_list.free_blocks.pop() {
return Some(x);
}
if free_list.next_free_block < free_list.max_blocks {
let result = free_list.next_free_block;
free_list.next_free_block += 1;
return Some(result);
}
None
}
pub fn dealloc_block(&self, cache_block: CacheBlock) {
let mut free_list = self.free_list.lock().unwrap();
free_list.free_blocks.push(cache_block);
}
}
impl metrics::core::Collector for FileCache {
fn desc(&self) -> Vec<&metrics::core::Desc> {
let mut descs = Vec::new();
descs.append(&mut self.max_blocks_gauge.desc());
descs.append(&mut self.num_free_blocks_gauge.desc());
descs
}
fn collect(&self) -> Vec<metrics::proto::MetricFamily> {
// Update the gauges with fresh values first
{
let free_list = self.free_list.lock().unwrap();
self.max_blocks_gauge.set(free_list.max_blocks as i64);
let total_free_blocks: i64 = free_list.free_blocks.len() as i64
+ (free_list.max_blocks as i64 - free_list.next_free_block as i64);
self.num_free_blocks_gauge.set(total_free_blocks);
}
let mut values = Vec::new();
values.append(&mut self.max_blocks_gauge.collect());
values.append(&mut self.num_free_blocks_gauge.collect());
values
}
}

View File

@@ -1,109 +0,0 @@
//! Global allocator, for tracking memory usage of the Rust parts
//!
//! Postgres is designed to handle allocation failure (ie. malloc() returning NULL) gracefully. It
//! rolls backs the transaction and gives the user an "ERROR: out of memory" error. Rust code
//! however panics if an allocation fails. We don't want that to ever happen, because an unhandled
//! panic leads to Postgres crash and restart. Our strategy is to pre-allocate a large enough chunk
//! of memory for use by the Rust code, so that the allocations never fail.
//!
//! To pick the size for the pre-allocated chunk, we have a metric to track the high watermark
//! memory usage of all the Rust allocations in total.
//!
//! TODO:
//!
//! - Currently we just export the metrics. Actual allocations are still just passed through to
//! the system allocator.
//! - Take padding etc. overhead into account
use std::alloc::{GlobalAlloc, Layout, System};
use std::sync::atomic::{AtomicU64, AtomicUsize, Ordering};
use metrics::IntGauge;
struct MyAllocator {
allocations: AtomicU64,
deallocations: AtomicU64,
allocated: AtomicUsize,
high: AtomicUsize,
}
unsafe impl GlobalAlloc for MyAllocator {
unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
self.allocations.fetch_add(1, Ordering::Relaxed);
let mut allocated = self.allocated.fetch_add(layout.size(), Ordering::Relaxed);
allocated += layout.size();
self.high.fetch_max(allocated, Ordering::Relaxed);
unsafe { System.alloc(layout) }
}
unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
self.deallocations.fetch_add(1, Ordering::Relaxed);
self.allocated.fetch_sub(layout.size(), Ordering::Relaxed);
unsafe { System.dealloc(ptr, layout) }
}
}
#[global_allocator]
static GLOBAL: MyAllocator = MyAllocator {
allocations: AtomicU64::new(0),
deallocations: AtomicU64::new(0),
allocated: AtomicUsize::new(0),
high: AtomicUsize::new(0),
};
pub struct MyAllocatorCollector {
allocations: IntGauge,
deallocations: IntGauge,
allocated: IntGauge,
high: IntGauge,
}
impl MyAllocatorCollector {
pub fn new() -> MyAllocatorCollector {
MyAllocatorCollector {
allocations: IntGauge::new("allocations_total", "Number of allocations in Rust code")
.unwrap(),
deallocations: IntGauge::new(
"deallocations_total",
"Number of deallocations in Rust code",
)
.unwrap(),
allocated: IntGauge::new("allocated_total", "Bytes currently allocated").unwrap(),
high: IntGauge::new("allocated_high", "High watermark of allocated bytes").unwrap(),
}
}
}
impl metrics::core::Collector for MyAllocatorCollector {
fn desc(&self) -> Vec<&metrics::core::Desc> {
let mut descs = Vec::new();
descs.append(&mut self.allocations.desc());
descs.append(&mut self.deallocations.desc());
descs.append(&mut self.allocated.desc());
descs.append(&mut self.high.desc());
descs
}
fn collect(&self) -> Vec<metrics::proto::MetricFamily> {
let mut values = Vec::new();
// update the gauges
self.allocations
.set(GLOBAL.allocations.load(Ordering::Relaxed) as i64);
self.deallocations
.set(GLOBAL.allocations.load(Ordering::Relaxed) as i64);
self.allocated
.set(GLOBAL.allocated.load(Ordering::Relaxed) as i64);
self.high.set(GLOBAL.high.load(Ordering::Relaxed) as i64);
values.append(&mut self.allocations.collect());
values.append(&mut self.deallocations.collect());
values.append(&mut self.allocated.collect());
values.append(&mut self.high.collect());
values
}
}

Some files were not shown because too many files have changed in this diff Show More