Commit Graph

80 Commits

Author SHA1 Message Date
Heikki Linnakangas
48535798ba Merge remote-tracking branch 'origin/main' into communicator-rewrite 2025-07-23 00:00:10 +03:00
Folke Behrens
9c0efba91e Bump rand crate to 0.9 (#12674) 2025-07-22 09:31:39 +00:00
Erik Grinaker
367d96e25b Merge branch 'main' into communicator-rewrite 2025-07-14 18:47:23 +02:00
Erik Grinaker
30b877074c pagebench: add CPU profiling support (#12478)
## Problem

The new communicator gRPC client has significantly worse Pagebench
performance than a basic gRPC client. We need to find out why.

## Summary of changes

Add a `pagebench --profile` flag which takes a client CPU profile of the
benchmark and writes a flamegraph to `profile.svg`.
2025-07-14 11:44:53 +00:00
Erik Grinaker
fecb707b19 pagebench: add idle-streams (#12583)
## Problem

For the communicator scheduling policy, we need to understand the
server-side cost of idle gRPC streams.

Touches #11735.

## Summary of changes

Add an `idle-streams` benchmark to `pagebench` which opens a large
number of idle gRPC GetPage streams.
2025-07-14 09:41:58 +00:00
Erik Grinaker
8cd5370c00 Merge branch 'main' into communicator-rewrite 2025-07-11 10:39:26 +02:00
Erik Grinaker
8aa9540a05 pageserver/page_api: include block number and rel in gRPC GetPageResponse (#12542)
## Problem

With gRPC `GetPageRequest` batches, we'll have non-trivial
fragmentation/reassembly logic in several places of the stack
(concurrent reads, shard splits, LFC hits, etc). If we included the
block numbers with the pages in `GetPageResponse` we could have better
verification and observability that the final responses are correct.

Touches #11735.
Requires #12480.

## Summary of changes

Add a `Page` struct with`block_number` for `GetPageResponse`, along with
the `RelTag` for completeness, and verify them in the rich gRPC client.
2025-07-10 22:35:14 +00:00
Erik Grinaker
44ea17b7b2 pageserver/page_api: add attempt to GetPage request ID (#12536)
## Problem

`GetPageRequest::request_id` is supposed to be a unique ID for a
request. It's not, because we may retry the request using the same ID.
This causes assertion failures and confusion.

Touches #11735.
Requires #12480.

## Summary of changes

Extend the request ID with a retry attempt, and handle it in the gRPC
client and server.
2025-07-10 20:39:42 +00:00
Erik Grinaker
dcdfe80bf0 pagebench: add support for rich gRPC client (#12477)
## Problem

We need to benchmark the rich gRPC client
`client_grpc::PageserverClient` against the basic, no-frills
`page_api::Client` to determine how much overhead it adds.

Touches #11735.
Requires #12476.

## Summary of changes

Add a `pagebench --rich-client` parameter to use
`client_grpc::PageserverClient`. Also adds a compression parameter to
the client.
2025-07-10 17:30:09 +00:00
Erik Grinaker
81e7218c27 pageserver: tighten up gRPC page_api::Client (#12396)
This patch tightens up `page_api::Client`. It's mostly superficial
changes, but also adds a new constructor that takes an existing gRPC
channel, for use with the communicator connection pool.
2025-07-08 18:15:13 +00:00
Erik Grinaker
bf01145ae4 Remove some old code 2025-07-02 11:46:54 +02:00
Erik Grinaker
958c2577f5 pageserver: tighten up page_api::Client 2025-07-01 17:54:41 +02:00
Erik Grinaker
c3cb1ab98d Merge branch 'main' into communicator-rewrite 2025-06-30 21:07:01 +02:00
Erik Grinaker
67b04f8ab3 Fix a bunch of linter warnings 2025-06-30 11:10:02 +02:00
Heikki Linnakangas
f3ba201800 Run cargo fmt 2025-06-29 21:21:07 +03:00
Erik Grinaker
e50b914a8e compute_tools: support gRPC base backups in compute_ctl (#12244)
## Problem

`compute_ctl` should support gRPC base backups.

Requires #12111.
Requires #12243.
Touches #11926.

## Summary of changes

Support `grpc://` connstrings for `compute_ctl` base backups.
2025-06-27 16:39:00 +00:00
Erik Grinaker
f755979102 pageserver: payload compression for gRPC base backups (#12346)
## Problem

gRPC base backups use gRPC compression. However, this has two problems:

* Base backup caching will cache compressed base backups (making gRPC
compression pointless).
* Tonic does not support varying the compression level, and zstd default
level is 10% slower than gzip fastest level.

Touches https://github.com/neondatabase/neon/issues/11728.
Touches https://github.com/neondatabase/cloud/issues/29353.

## Summary of changes

This patch adds a gRPC parameter `BaseBackupRequest::compression`
specifying the compression algorithm. It also moves compression into
`send_basebackup_tarball` to reduce code duplication.

A follow-up PR will integrate the base backup cache with gRPC.
2025-06-25 18:16:23 +00:00
Arpad Müller
552249607d apply clippy fixes for 1.88.0 beta (#12331)
The 1.88.0 stable release is near (this Thursday). We'd like to fix most
warnings beforehand so that the compiler upgrade doesn't require
approval from too many teams.

This is therefore a preparation PR (like similar PRs before it).

There is a lot of changes for this release, mostly because the
`uninlined_format_args` lint has been added to the `style` lint group.
One can read more about the lint
[here](https://rust-lang.github.io/rust-clippy/master/#/uninlined_format_args).

The PR is the result of `cargo +beta clippy --fix` and `cargo fmt`. One
remaining warning is left for the proxy team.

---------

Co-authored-by: Conrad Ludgate <conrad@neon.tech>
2025-06-24 10:12:42 +00:00
Heikki Linnakangas
e90be06d46 silence a few compiler warnings
about unnecessary 'mut's and 'use's
2025-06-23 18:16:54 +03:00
Heikki Linnakangas
356ba67607 Merge remote-tracking branch 'origin/main' into HEAD
I also included build script changes from
https://github.com/neondatabase/neon/pull/12266, which is not yet
merged but will be soon.
2025-06-23 17:46:30 +03:00
Erik Grinaker
15d079cd41 pagebench: improve getpage-latest-lsn gRPC support (#12293)
This improves `pagebench getpage-latest-lsn` gRPC support by:

* Using `page_api::Client`.
* Removing `--protocol`, and using the `page-server-connstring` scheme
instead.
* Adding `--compression` to enable zstd compression.
2025-06-20 08:31:40 +00:00
Erik Grinaker
dc1625cd8e pagebench: add basebackup gRPC support (#12250)
## Problem

Pagebench does not support gRPC for `basebackup` benchmarks.

Requires #12243.
Touches #11728.

## Summary of changes

Add gRPC support via gRPC connstrings, e.g. `pagebench basebackup
--page-service-connstring grpc://localhost:51051`.

Also change `--gzip-probability` to `--no-compression`, since this must
be specified per-client for gRPC.
2025-06-19 15:40:57 +00:00
Heikki Linnakangas
5a045e7d52 Move pagestream_api to separate module (#12272)
For general readability.
2025-06-18 12:03:14 +00:00
Erik Grinaker
d0b3629412 Tweak base backups 2025-06-13 13:47:26 -07:00
Elizabeth Murray
68f18ccacf Request Tracker Prototype
Does not include splitting requests across shards.
2025-06-05 13:32:18 -07:00
Erik Grinaker
2fb6164bf8 Misc build fixes 2025-06-05 17:22:11 +02:00
Erik Grinaker
37c58522a2 Merge branch 'main' into communicator-rewrite 2025-06-05 15:08:05 +02:00
Erik Grinaker
6a43f23eca pagebench: add batch support (#12133)
## Problem

The new gRPC page service protocol supports client-side batches. The
current libpq protocol only does best-effort server-side batching.

To compare these approaches, Pagebench should support submitting
contiguous page batches, similar to how Postgres will submit them (e.g.
with prefetches or vectored reads).

## Summary of changes

Add a `--batch-size` parameter specifying the size of contiguous page
batches. One batch counts as 1 RPS and 1 queue depth.

For the libpq protocol, a batch is submitted as individual requests and
we rely on the server to batch them for us. This will give a realistic
comparison of how these would be processed in the wild (e.g. when
Postgres sends 100 prefetch requests).

This patch also adds some basic validation of responses.
2025-06-05 11:52:52 +00:00
Erik Grinaker
745b750f33 Merge branch 'main' into communicator-rewrite 2025-06-03 13:29:45 +02:00
Erik Grinaker
a21c1174ed pagebench: add gRPC support for get-page-latest-lsn (#12077)
## Problem

We need gRPC support in Pagebench to benchmark the new gRPC Pageserver
implementation.

Touches #11728.

## Summary of changes

Adds a `Client` trait to make the client transport swappable, and a gRPC
client via a `--protocol grpc` parameter. This must also specify the
connstring with the gRPC port:

```
pagebench get-page-latest-lsn --protocol grpc --page-service-connstring grpc://localhost:51051
```

The client is implemented using the raw Tonic-generated gRPC client, to
minimize client overhead.
2025-06-02 14:50:49 +00:00
Heikki Linnakangas
f06bb2bbd8 Implement growing the hash table. Fix unit tests. 2025-05-29 15:54:55 +03:00
Elizabeth Murray
7c9bd542a6 Fix compile warnings, minor cleanup. 2025-05-26 06:30:48 -07:00
Elizabeth Murray
af9379ccf6 Use a sempahore to gate access to connections. Add metrics for testing. 2025-05-26 05:28:50 -07:00
Heikki Linnakangas
bb28109ffa Merge remote-tracking branch 'origin/main' into communicator-rewrite-with-integrated-cache
There were conflicts because of the differences in the page_api
protocol that was merged to main vs what was on the branch. I adapted
the code for the protocol in main.
2025-05-26 11:52:32 +03:00
Elizabeth Murray
0dddb1e373 Add back whitespace that was removed. 2025-05-19 06:34:52 -07:00
Elizabeth Murray
3acb263e62 Add first iteration of simulating a flakey network with a custom TCP. 2025-05-19 06:33:30 -07:00
Christian Schwarz
a7ce323949 benchmarking: extend test_page_service_batching.py to cover concurrent IO + batching under random reads (#10466)
This PR commits the benchmarks I ran to qualify concurrent IO before we
released it.

Changes:
- Add `l0stack` fixture; a reusable abstraction for creating a stack of
L0 deltas
  each of which has 1 Value::Delta per page.
- Such a stack of L0 deltas is a good and understandable demo for
concurrent IO
because to reconstruct any page, $layer_stack_height` Values need to be
read.
  Before concurrent IO, the reads were sequential.
  With concurrent IO, they are executed concurrently.
- So, switch `test_latency` to use the l0stack.
- Teach `pagebench`, which is used by `test_latency`, to limit itself to
the blocks of the relation created by the l0stack abstraction.
- Additional parametrization of `test_latency` over dimensions
`ps_io_concurrency,l0_stack_height,queue_depth`
- Use better names for the tests to reflect what they do, leave
interpretation of the (now quite high-dimensional) results to the reader
  - `test_{throughput => postgres_seqscan}`
  - `test_{latency => random_reads}`
- Cut down on permutations to those we use in production. Runtime is
about 2min.

Refs
- concurrent IO epic https://github.com/neondatabase/neon/issues/9378 
- batching task: fixes https://github.com/neondatabase/neon/issues/9837

---------

Co-authored-by: Peter Bendel <peterbendel@neon.tech>
2025-05-15 17:48:13 +00:00
Elizabeth Murray
be8ed81532 Connection pool: update error accounting, sweep idle connections, add config options. 2025-05-14 07:31:52 -07:00
Erik Grinaker
d785100c02 page_api: add GetPageRequest::class 2025-05-02 10:48:32 +02:00
Erik Grinaker
2c0d930e3d page_api: add GetPageResponse::status 2025-04-30 16:48:45 +02:00
Erik Grinaker
66171a117b page_api: add GetPageRequestBatch 2025-04-30 15:31:11 +02:00
Erik Grinaker
df2806e7a0 page_api: add GetPageRequest::id 2025-04-30 15:00:16 +02:00
Erik Grinaker
4c77397943 Add neon-shard-id header 2025-04-30 11:18:06 +02:00
Erik Grinaker
0f520d79ab pageserver: rename data_api to page_api 2025-04-29 15:58:52 +02:00
Heikki Linnakangas
93eb7bb6b8 include lots of changes that went missing by accident 2025-04-29 15:32:27 +03:00
Heikki Linnakangas
e58d0fece1 New communicator, with "integrated" cache accessible from all processes 2025-04-29 11:52:44 +03:00
Heikki Linnakangas
4d0c1e8b78 refactor: Extract some code in pagebench getpage command to function (#11563)
This makes it easier to add a different client implementation alongside
the current one. I started working on a new gRPC-based protocol to
replace the libpq protocol, which will introduce a new function like
`client_libpq`, but for the new protocol.

It's a little more readable with less indentation anyway.
2025-04-19 08:38:03 +00:00
Dmitrii Kovalkov
0f367cb665 storcon: reuse reqwest http client (#11327)
## Problem

- Part of https://github.com/neondatabase/neon/issues/11113
- Building a new `reqwest::Client` for every request is expensive
because it parses CA certs under the hood. It's noticeable in storcon's
flamegraph.

## Summary of changes
- Reuse one `reqwest::Client` for all API calls to avoid parsing CA
certificates every time.
2025-03-21 11:48:22 +00:00
Dmitrii Kovalkov
63b22d3fb1 pageserver: https for management API (#11025)
## Problem

Storage controller uses unencrypted HTTP requests for pageserver
management API.

Closes: https://github.com/neondatabase/cloud/issues/24283


## Summary of changes
- Implement `http_utils::server::Server` with TLS support.
- Replace `hyper0::server::Server` with `http_utils::server::Server` in
pageserver.
- Add HTTPS handler for pageserver management API.
- Generate local SSL certificates in neon local.
2025-03-10 15:07:59 +00:00
Arpad Müller
a22be5af72 Migrate the last crates to edition 2024 (#10998)
Migrates the remaining crates to edition 2024. We like to stay on the
latest edition if possible. There is no functional changes, however some
code changes had to be done to accommodate the edition's breaking
changes.

Like the previous migration PRs, this is comprised of three commits:

* the first does the edition update and makes `cargo check`/`cargo
clippy` pass. we had to update bindgen to make its output [satisfy the
requirements of edition
2024](https://doc.rust-lang.org/edition-guide/rust-2024/unsafe-extern.html)
* the second commit does a `cargo fmt` for the new style edition.
* the third commit reorders imports as a one-off change. As before, it
is entirely optional.

Part of #10918
2025-02-27 09:40:40 +00:00