Commit Graph

4661 Commits

Author SHA1 Message Date
Conrad Ludgate
686b3c79c8 http2 alpn (#6815)
## Problem

Proxy already supported HTTP2, but I expect no one is using it because
we don't advertise it in the TLS handshake.

## Summary of changes

#6335 without the websocket changes.
2024-02-20 10:44:46 +00:00
John Spray
02a8b7fbe0 storage controller: issue timeline create/delete calls concurrently (#6827)
## Problem

Timeline creation is meant to be very fast: it should only take
approximately on S3 PUT latency. When we have many shards in a tenant,
we should preserve that responsiveness.

## Summary of changes

- Issue create/delete pageserver API calls concurrently across all >0
shards
- During tenant deletion, delete shard zero last, separately, to avoid
confusing anything using GETs on the timeline.
- Return 201 instead of 200 on creations to make cloud control plane
happy

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2024-02-20 10:13:21 +00:00
Alexander Bayandin
feb359b459 CI: Update deprecated GitHub Actions (#6822)
## Problem

We use a bunch of deprecated actions.
See https://github.com/neondatabase/neon/actions/runs/7958569728
(Annotations section)

```
Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/checkout@v3, actions/setup-java@v3, actions/cache@v3, actions/github-script@v6. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/.
```

## Summary of changes
- `actions/cache@v3` -> `actions/cache@v4`
- `actions/checkout@v3` -> `actions/checkout@v4`
- `actions/github-script@v6` -> `actions/github-script@v7`
- `actions/setup-java@v3` -> `actions/setup-java@v4`
- `actions/upload-artifact@v3` -> `actions/upload-artifact@v4`
2024-02-19 21:46:22 +00:00
John Spray
0c105ef352 storage controller: debug observability endpoints and self-test (#6820)
This PR stacks on https://github.com/neondatabase/neon/pull/6814

Observability:
- Because we only persist a subset of our state, and our external API is
pretty high level, it can be hard to get at the detail of what's going
on internally (e.g. the IntentState of a shard).
- Add debug endpoints for getting a full dump of all TenantState and
SchedulerNode objects
- Enrich the /control/v1/node listing endpoint to include full in-memory
detail of `Node` rather than just the `NodePersistence` subset

Consistency checks:
- The storage controller maintains separate in-memory and on-disk
states, by design. To catch subtle bugs, it is useful to occasionally
cross-check these.
- The Scheduler maintains reference counts for shard->node
relationships, which could drift if there was a bug in IntentState:
exhausively cross check them in tests.
2024-02-19 20:29:23 +00:00
John Spray
4f7704af24 storage controller: fix spurious reconciles after pageserver restarts (#6814)
## Problem

When investigating test failures
(https://github.com/neondatabase/neon/issues/6813) I noticed we were
doing a bunch of Reconciler runs right after splitting a tenant.

It's because the splitting test does a pageserver restart, and there was
a bug in /re-attach handling, where we would update the generation
correctly in the database and intent state, but not observed state,
thereby triggering a reconciliation on the next call to maybe_reconcile.
This didn't break anything profound (underlying rules about generations
were respected), but caused the storage controller to do an un-needed
extra round of bumping the generation and reconciling.

## Summary of changes

- Start adding metrics to the storage controller
- Assert on the number of reconciles done in test_sharding_split_smoke
- Fix /re-attach to update `observed` such that we don't spuriously
re-reconcile tenants.
2024-02-19 17:44:20 +00:00
Arpad Müller
e0c12faabd Allow initdb preservation for broken tenants (#6790)
Often times the tenants we want to (WAL) DR are the ones which the
pageserver marks as broken. Therefore, we should allow initdb
preservation also for broken tenants.

Fixes #6781.
2024-02-19 17:27:02 +01:00
John Spray
2f8a2681b8 pageserver: ensure we never try to save empty delta layer (#6805)
## Problem

Sharded tenants could panic during compaction when they try to generate
an L1 delta layer for a region that contains no keys on a particular
shard.

This is a variant of https://github.com/neondatabase/neon/issues/6755,
where we attempt to save a delta layer with no keys. It is harder to
reproduce than the case of image layers fixed in
https://github.com/neondatabase/neon/pull/6776.

It will become even less likely once
https://github.com/neondatabase/neon/pull/6778 tweaks keyspace
generation, but even then, we should not rely on keyspace partitioning
to guarantee at least one stored key in each partition.

## Summary of changes

- Move construction of `writer` in `compact_level0_phase1`, so that we
never leave a writer constructed but without any keys.
2024-02-19 15:07:07 +00:00
John Spray
7e4280955e control_plane/attachment_service: improve Scheduler (#6633)
## Problem

One of the major shortcuts in the initial version of this code was to
construct a fresh `Scheduler` each time we need it, which is an O(N^2)
cost as the tenant count increases.

## Summary of changes

- Keep `Scheduler` alive through the lifetime of ServiceState
- Use `IntentState` as a reference tracking helper, updating Scheduler
refcounts as nodes are added/removed from the intent.

There is an automated test that checks things don't get pathologically
slow with thousands of shards, but it's not included in this PR because
tests that implicitly test the runner node performance take some thought
to stabilize/land in CI.
2024-02-19 14:12:20 +00:00
John Spray
349b375010 pageserver: remove heatmap file during tenant delete (#6806)
## Problem

Secondary mode locations keep a local copy of the heatmap, which needs
cleaning up during deletion.

Closes: https://github.com/neondatabase/neon/issues/6802

## Summary of changes

- Extend test_live_migration to reproduce the issue
- Remove heatmap-v1.json during tenant deletion
2024-02-19 14:01:36 +00:00
Conrad Ludgate
d0d4871682 proxy: use postgres_protocol scram/sasl code (#4748)
1) `scram::password` was used in tests only. can be replaced with
`postgres_protocol::password`.
2) `postgres_protocol::authentication::sasl` provides a client impl of
SASL which improves our ability to test
2024-02-19 12:54:17 +00:00
Vlad Lazar
587cb705b8 pageserver: roll open layer in timeline writer (#6661)
## Problem
One WAL record can actually produce an arbitrary amount of key value pairs.
This is problematic since it might cause our frozen layers to bloat past the 
max allowed size of S3 single shot uploads.

[#6639](https://github.com/neondatabase/neon/pull/6639) introduced a "should roll"
check after every batch of `ingest_batch_size` (100 WAL records by default). This helps,
but the original problem still exists.

## Summary of changes
This patch moves the responsibility of rolling the currently open layer
to the `TimelineWriter`. Previously, this was done ad-hoc via calls
to `check_checkpoint_distance`. The advantages of this approach are:
* ability to split one batch over multiple open layers
* less layer map locking
* remove ad-hoc check_checkpoint_distance calls

More specifically, we track the current size of the open layer in the
writer. On each `put` check whether the current layer should be closed
and a new one opened. Keeping track of the currently open layer results
in less contention on the layer map lock. It only needs to be acquired
on the first write and on writes that require a roll afterwards.

Rolling the open layer can be triggered by:
1. The distance from the last LSN we rolled at. This bounds the amount
of WAL that the safekeepers need to store.
2. The size of the currently open layer.
3. The time since the last roll. It helps safekeepers to regard
pageserver as caught up and suspend activity.

Closes #6624
2024-02-19 12:34:27 +00:00
Alexander Bayandin
4d2bf55e6c CI: temporary disable coverage report for regression tests (#6798)
## Problem

The merging coverage data step recently started to be too flaky.
This failure blocks staging deployment and along with the flakiness of
regression tests might require 4-5-6 manual restarts of a CI job.

Refs:
- https://github.com/neondatabase/neon/issues/4540
- https://github.com/neondatabase/neon/issues/6485
- https://neondb.slack.com/archives/C059ZC138NR/p1704131143740669

## Summary of changes
- Disable code coverage report for functional tests
2024-02-19 11:07:27 +00:00
John Spray
5667372c61 pageserver: during shard split, wait for child to activate (#6789)
## Problem

test_sharding_split_unsharded was flaky with log errors from tenants not
being active. This was happening when the split function enters
wait_lsn() while the child shard might still be activating. It's flaky
rather than an outright failure because activation is usually very fast.

This is also a real bug fix, because in realistic scenarios we could
proceed to detach the parent shard before the children are ready,
leading to an availability gap for clients.

## Summary of changes

- Do a short wait_to_become_active on the child shards before proceeding
to wait for their LSNs to advance

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2024-02-18 15:55:19 +00:00
Alexander Bayandin
61f99d703d test_create_snapshot: do not try to copy pg_dynshmem dir (#6796)
## Problem
`test_create_snapshot` is flaky[0] on CI and fails constantly on macOS,
but with a slightly different error:
```
shutil.Error: [('/Users/bayandin/work/neon/test_output/test_create_snapshot[release-pg15-1-100]/repo/endpoints/ep-1/pgdata/pg_dynshmem', '/Users/bayandin/work/neon/test_output/compatibility_snapshot_pgv15/repo/endpoints/ep-1/pgdata/pg_dynshmem', "[Errno 2] No such file or directory: '/Users/bayandin/work/neon/test_output/test_create_snapshot[release-pg15-1-100]/repo/endpoints/ep-1/pgdata/pg_dynshmem'")]
```
Also (on macOS) `repo/endpoints/ep-1/pgdata/pg_dynshmem` is a symlink
to `/dev/shm/`.

- [0] https://github.com/neondatabase/neon/issues/6784

## Summary of changes
Ignore `pg_dynshmem` directory while copying a snapshot
2024-02-18 12:16:07 +00:00
John Spray
24014d8383 pageserver: fix sharding emitting empty image layers during compaction (#6776)
## Problem

Sharded tenants would sometimes try to write empty image layers during
compaction: this was more noticeable on larger databases.
- https://github.com/neondatabase/neon/issues/6755

**Note to reviewers: the last commit is a refactor that de-intents a
whole block, I recommend reviewing the earlier commits one by one to see
the real changes**

## Summary of changes

- Fix a case where when we drop a key during compaction, we might fail
to write out keys (this was broken when vectored get was added)
- If an image layer is empty, then do not try and write it out, but
leave `start` where it is so that if the subsequent key range meets
criteria for writing an image layer, we will extend its key range to
cover the empty area.
- Add a compaction test that configures small layers and compaction
thresholds, and asserts that we really successfully did image layer
generation. This fails before the fix.
2024-02-18 08:51:12 +00:00
Konstantin Knizhnik
e3ded64d1b Support pg-ivm extension (#6793)
## Problem

See https://github.com/neondatabase/cloud/issues/10268

## Summary of changes

Add pg_ivm extension

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2024-02-17 22:13:25 +02:00
dependabot[bot]
9b714c8572 build(deps): bump cryptography from 42.0.0 to 42.0.2 (#6792) 2024-02-17 19:15:21 +00:00
Alex Chi Z
29fb675432 Revert "fix superuser permission check for extensions (#6733)" (#6791)
This reverts commit 9ad940086c.

This pull request reverts #6733 to avoid incompatibility with pgvector
and I will push further fixes later. Note that after reverting this pull
request, the postgres submodule will point to some detached branches.
2024-02-16 20:50:09 +00:00
Christian Schwarz
ca07fa5f8b per-TenantShard read throttling (#6706) 2024-02-16 21:26:59 +01:00
John Spray
5d039c6e9b libs: add 'generations_api' auth scope (#6783)
## Problem

Even if you're not enforcing auth, the JwtAuth middleware barfs on
scopes it doesn't know about.

Add `generations_api` scope, which was invented in the cloud control
plane for the pageserver's /re-attach and /validate upcalls: this will
be enforced in storage controller's implementation of these in a later
PR.

Unfortunately the scope's naming doesn't match the other scope's naming
styles, so needs a manual serde decorator to give it an underscore.

## Summary of changes

- Add `Scope::GenerationsApi` variant
- Update pageserver + safekeeper auth code to print appropriate message
if they see it.
2024-02-16 15:53:09 +00:00
Calin Anca
36e1100949 bench_walredo: use tokio multi-threaded runtime (#6743)
fixes https://github.com/neondatabase/neon/issues/6648

Co-authored-by: Christian Schwarz <christian@neon.tech>
2024-02-16 16:31:54 +01:00
Alexander Bayandin
59c5b374de test_pageserver_max_throughput_getpage_at_latest_lsn: disable on CI (#6785)
## Problem
`test_pageserver_max_throughput_getpage_at_latest_lsn` is flaky which
makes CI status red pretty frequently. `benchmarks` is not a blocking
job (doesn't block `deploy`), so having it red might hide failures in
other jobs

Ref: https://github.com/neondatabase/neon/issues/6724

## Summary of changes
- Disable `test_pageserver_max_throughput_getpage_at_latest_lsn` on CI
until it fixed
2024-02-16 15:30:04 +00:00
Arpad Müller
0f3b87d023 Add test for pageserver_directory_entries_count metric (#6767)
Adds a simple test to ensure the metric works.

The test creates a bunch of relations to activate the metric.

Follow-up of #6736
2024-02-16 14:53:36 +00:00
Konstantin Knizhnik
c19625a29c Support sharding for compute_ctl (#6787)
## Problem

See https://github.com/neondatabase/neon/issues/6786

## Summary of changes

Split connection string in compute.rs when requesting basebackup
2024-02-16 14:50:09 +00:00
John Spray
f2e5212fed storage controller: background reconcile, graceful shutdown, better logging (#6709)
## Problem

Now that the storage controller is working end to end, we start burning
down the robustness aspects.

## Summary of changes

- Add a background task that periodically calls `reconcile_all`. This
ensures that if earlier operations couldn't succeed (e.g. because a node
was unavailable), we will eventually retry. This is a naive initial
implementation can start an unlimited number of reconcile tasks:
limiting reconcile concurrency is a later item in #6342
- Add a number of tracing spans in key locations: each background task,
each reconciler task.
- Add a top level CancellationToken and Gate, and use these to implement
a graceful shutdown that waits for tasks to shut down. This is not
bulletproof yet, because within these tasks we have remote HTTP calls
that aren't wrapped in cancellation/timeouts, but it creates the
structure, and if we don't shutdown promptly then k8s will kill us.
- To protect shard splits from background reconciliation, expose the `SplitState`
in memory and use it to guard any APIs that require an attached tenant.
2024-02-16 13:00:53 +00:00
Christian Schwarz
568bc1fde3 fix(build): production flamegraphs are useless (#6764) 2024-02-16 10:12:34 +00:00
Christian Schwarz
45e929c069 stop reading local metadata file (#6777) 2024-02-16 09:35:11 +00:00
John Spray
6b980f38da libs: refactor ShardCount.0 to private (#6690)
## Problem

The ShardCount type has a magic '0' value that represents a legacy
single-sharded tenant, whose TenantShardId is formatted without a
`-0001` suffix (i.e. formatted as a traditional TenantId).

This was error-prone in code locations that wanted the actual number of
shards: they had to handle the 0 case specially.

## Summary of changes

- Make the internal value of ShardCount private, and expose `count()`
and `literal()` getters so that callers have to explicitly say whether
they want the literal value (e.g. for storing in a TenantShardId), or
the actual number of shards in the tenant.


---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2024-02-15 21:59:39 +00:00
MMeent
f0d8bd7855 Update Makefile (#6779)
This fixes issues where `neon-pg-ext-clean-vYY` is used as target and
resolves using the `neon-pg-ext-%` template with `$*` resolving as `clean-vYY`, for
older versions of GNU Make, rather than `neon-pg-ext-clean-%` using `$*` = `vYY`

## Problem

```
$ make clean
...
rm -f pg_config_paths.h

Compiling neon clean-v14

mkdir -p /Users/<user>/neon-build//pg_install//build/neon-clean-v14

/Applications/Xcode.app/Contents/Developer/usr/bin/make PG_CONFIG=/Users/<user>/neon-build//pg_install//clean-v14/bin/pg_config CFLAGS='-O0 -g3  ' \

        -C /Users/<user>/neon-build//pg_install//build/neon-clean-v14 \

        -f /Users/<user>/neon-build//pgxn/neon/Makefile install

make[1]: /Users/<user>/neon-build//pg_install//clean-v14/bin/pg_config: Command not found

make[1]: *** No rule to make target `install'.  Stop.

make: *** [neon-pg-ext-clean-v14] Error 2
```
2024-02-15 19:48:50 +00:00
Joonas Koivunen
046d9c69e6 fix: require wider jwt for changing the io engine (#6770)
io-engine should not be changeable with any JWT token, for example the
tenant_id scoped token which computes have.
2024-02-15 16:58:26 +00:00
Alexander Bayandin
c72cb44213 test_runner/performance: parametrize benchmarks (#6744)
## Problem
Currently, we don't store `PLATFORM` for Nightly Benchmarks. It
causes them to be merged as reruns in Allure report (because they have
the same test name).

## Summary of changes
- Parametrize benchmarks by 
  - Postgres Version (14/15/16)
  - Build Type (debug/release/remote)
  - PLATFORM (neon-staging/github-actions-selfhosted/...)

---------

Co-authored-by: Bodobolero <peterbendel@neon.tech>
2024-02-15 15:53:58 +00:00
Arpad Müller
cd3e4ac18d Rename TEST_IMG function to test_img (#6762)
Latter follows the canonical way to naming functions in Rust.
2024-02-15 15:14:51 +00:00
Alex Chi Z
9ad940086c fix superuser permission check for extensions (#6733)
close https://github.com/neondatabase/neon/issues/6236

This pull request bumps neon postgres dependencies. The corresponding
postgres commits fix the checks for superuser permission when creating
an extension. Also, for creating native functinos, it now allows
neon_superuser only in the extension creation process.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2024-02-15 14:59:13 +00:00
Joonas Koivunen
936f2ee2a5 fix: accidential wide span in tests (#6772)
introduced in a PR without other #[tracing::instrument] changes.
2024-02-15 13:48:44 +00:00
Heikki Linnakangas
1af047dd3e Fix typo in CI message (#6749) 2024-02-15 14:34:19 +02:00
John Spray
5fa747e493 pageserver: shard splitting refinements (parent deletion, hard linking) (#6725)
## Problem

- We weren't deleting parent shard contents once the split was done
- Re-downloading layers into child shards is wasteful

## Summary of changes

- Hard-link layers into child chart local storage during split
- Delete parent shards content at the end

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2024-02-15 10:21:53 +02:00
Joonas Koivunen
80854b98ff move timeouts and cancellation handling to remote_storage (#6697)
Cancellation and timeouts are handled at remote_storage callsites, if
they are. However they should always be handled, because we've had
transient problems with remote storage connections.

- Add cancellation token to the `trait RemoteStorage` methods
- For `download*`, `list*` methods there is
`DownloadError::{Cancelled,Timeout}`
- For the rest now using `anyhow::Error`, it will have root cause
`remote_storage::TimeoutOrCancel::{Cancel,Timeout}`
- Both types have `::is_permanent` equivalent which should be passed to
`backoff::retry`
- New generic RemoteStorageConfig option `timeout`, defaults to 120s
- Start counting timeouts only after acquiring concurrency limiter
permit
- Cancellable permit acquiring
- Download stream timeout or cancellation is communicated via an
`std::io::Error`
- Exit backoff::retry by marking cancellation errors permanent

Fixes: #6096
Closes: #4781

Co-authored-by: arpad-m <arpad-m@users.noreply.github.com>
2024-02-14 23:24:07 +00:00
Christian Schwarz
024372a3db Revert "refactor(VirtualFile::crashsafe_overwrite): avoid Handle::block_on in callers" (#6765)
Reverts neondatabase/neon#6731

On high tenant count Pageservers in staging, memory and CPU usage shoots
to 100% with this change. (NB: staging currently has tokio-epoll-uring
enabled)

Will analyze tomorrow.


https://neondb.slack.com/archives/C03H1K0PGKH/p1707933875639379?thread_ts=1707929541.125329&cid=C03H1K0PGKH
2024-02-14 19:17:12 +00:00
Shayan Hosseini
fff2468aa2 Add resource consume test funcs (#6747)
## Problem

Building on #5875 to add handy test functions for autoscaling.

Resolves #5609

## Summary of changes

This PR makes the following changes to #5875:
- Enable `neon_test_utils` extension in the compute node docker image,
so we could use it in the e2e tests (as discussed with @kelvich).
- Removed test functions related to disk as we don't use them for
autoscaling.
- Fix the warning with printf-ing unsigned long variables.

---------

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2024-02-14 18:45:05 +00:00
Anna Khanova
c7538a2c20 Proxy: remove fail fast logic to connect to compute (#6759)
## Problem

Flaky tests

## Summary of changes

Remove failfast logic
2024-02-14 18:43:52 +00:00
Arpad Müller
a2d0d44b42 Remove unused allow's (#6760)
These allow's became redundant some time ago so remove them, or address
them if addressing is very simple.
2024-02-14 18:16:05 +00:00
Christian Schwarz
7d3cdc05d4 fix(pageserver): pagebench doesn't work with released artifacts (#6757)
The canonical release artifact of neon.git is the Docker image with all
the binaries in them:

```
docker pull neondatabase/neon:release-4854
docker create --name extract neondatabase/neon:release-4854
docker cp extract:/usr/local/bin/pageserver ./pageserver.release-4854
chmod +x pageserver.release-4854
cp -a pageserver.release-4854 ./target/release/pageserver
```

Before this PR, these artifacts didn't expose the `keyspace` API,
thereby preventing `pagebench get-page-latest-lsn` from working.

Having working pagebench is useful, e.g., for experiments in staging.
So, expose the API, but don't document it, as it's not part of the
interface with control plane.
2024-02-14 17:01:15 +00:00
John Spray
840abe3954 pageserver: store aux files as deltas (#6742)
## Problem

Aux files were stored with an O(N^2) cost, since on each modification
the entire map is re-written as a page image.

This addresses one axis of the inefficiency in logical replication's use
of storage (https://github.com/neondatabase/neon/issues/6626). It will
still be writing a large amount of duplicative data if writing the same
slot's state every 15 seconds, but the impact will be O(N) instead of
O(N^2).

## Summary of changes

- Introduce `NeonWalRecord::AuxFile`
- In `DatadirModification`, if the AUX_FILES_KEY has already been set,
then write a delta instead of an image
2024-02-14 15:01:16 +00:00
Christian Schwarz
774a6e7475 refactor(virtual_file) make write_all_at take owned buffers (#6673)
context: https://github.com/neondatabase/neon/issues/6663

Building atop #6664, this PR switches `write_all_at` to take owned
buffers.

The main challenge here is the `EphemeralFile::mutable_tail`, for which
I'm picking the ugly solution of an `Option` that is `None` while the IO
is in flight.

After this, we will be able to switch `write_at` to take owned buffers
and call tokio-epoll-uring's `write` function with that owned buffer.
That'll be done in #6378.
2024-02-14 15:59:06 +01:00
Christian Schwarz
df5d588f63 refactor(VirtualFile::crashsafe_overwrite): avoid Handle::block_on in callers (#6731)
Some callers of `VirtualFile::crashsafe_overwrite` call it on the
executor thread, thereby potentially stalling it.

Others are more diligent and wrap it in `spawn_blocking(...,
Handle::block_on, ... )` to avoid stalling the executor thread.

However, because `crashsafe_overwrite` uses
VirtualFile::open_with_options internally, we spawn a new thread-local
`tokio-epoll-uring::System` in the blocking pool thread that's used for
the `spawn_blocking` call.

This PR refactors the situation such that we do the `spawn_blocking`
inside `VirtualFile::crashsafe_overwrite`. This unifies the situation
for the better:

1. Callers who didn't wrap in `spawn_blocking(..., Handle::block_on,
...)` before no longer stall the executor.
2. Callers who did it before now can avoid the `block_on`, resolving the
problem with the short-lived `tokio-epoll-uring::System`s in the
blocking pool threads.

A future PR will build on top of this and divert to tokio-epoll-uring if
it's configures as the IO engine.

Changes
-------

- Convert implementation to std::fs and move it into `crashsafe.rs`
- Yes, I know, Safekeepers (cc @arssher ) added `durable_rename` and
`fsync_async_opt` recently. However, `crashsafe_overwrite` is different
in the sense that it's higher level, i.e., it's more like
`std::fs::write` and the Safekeeper team's code is more building block
style.
- The consequence is that we don't use the VirtualFile file descriptor
cache anymore.
- I don't think it's a big deal because we have plenty of slack wrt
production file descriptor limit rlimit (see [this
dashboard](https://neonprod.grafana.net/d/e4a40325-9acf-4aa0-8fd9-f6322b3f30bd/pageserver-open-file-descriptors?orgId=1))

- Use `tokio::task::spawn_blocking` in
`VirtualFile::crashsafe_overwrite` to call the new
`crashsafe::overwrite` API.
- Inspect all callers to remove any double-`spawn_blocking`
- spawn_blocking requires the captures data to be 'static + Send. So,
refactor the callers. We'll need this for future tokio-epoll-uring
support anyway, because tokio-epoll-uring requires owned buffers.

Related Issues
--------------

- overall epic to enable write path to tokio-epoll-uring: #6663
- this is also kind of relevant to the tokio-epoll-uring System creation
failures that we encountered in staging, investigation being tracked in
#6667
- why is it relevant? Because this PR removes two uses of
`spawn_blocking+Handle::block_on`
2024-02-14 14:22:41 +00:00
John Spray
f39b0fce9b Revert #6666 "tests: try to make restored-datadir comparison tests not flaky" (#6751)
The #6666  change appears to have made the test fail more often.

PR https://github.com/neondatabase/neon/pull/6712 should re-instate this
change, along with its change to make the overall flow more reliable.

This reverts commit 568f91420a.
2024-02-14 10:57:01 +00:00
Conrad Ludgate
a9ec4eb4fc hold cancel session (#6750)
## Problem

In a recent refactor, we accidentally dropped the cancel session early

## Summary of changes

Hold the cancel session during proxy passthrough
2024-02-14 10:26:32 +00:00
Heikki Linnakangas
a97b54e3b9 Cherry-pick Postgres bugfix to 'mmap' DSM implementation
Cherry-pick Upstream commit fbf9a7ac4d to neon stable branches. We'll
get it in the next PostgreSQL minor release anyway, but we need it
now, if we want to start using the 'mmap' implementation.

See https://github.com/neondatabase/autoscaling/issues/800 for the
plans on doing that.
2024-02-14 11:37:52 +02:00
Heikki Linnakangas
a5114a99b2 Create a symlink from pg_dynshmem to /dev/shm
See included comment and issue
https://github.com/neondatabase/autoscaling/issues/800 for details.

This has no effect, unless you set "dynamic_shared_memory_type = mmap"
in postgresql.conf.
2024-02-14 11:37:52 +02:00
Arpad Müller
ee7bbdda0e Create new metric for directory counts (#6736)
There is O(n^2) issues due to how we store these directories (#6626), so
it's good to keep an eye on them and ensure the numbers stay low.

The new per-timeline metric `pageserver_directory_entries_count`
isn't perfect, namely we don't calculate it every time we attach
the timeline, but only if there is an actual change.
Also, it is a collective metric over multiple scalars. Lastly,
we only emit the metric if it is above a certain threshold.

However, the metric still give a feel for the general size of the timeline.
We care less for small values as the metric is mainly there to
detect and track tenants with large directory counts.

We also expose the directory counts in `TimelineInfo` so that one can
get the detailed size distribution directly via the pageserver's API.

Related: #6642 , https://github.com/neondatabase/cloud/issues/10273
2024-02-14 02:12:00 +01:00