Commit Graph

1714 Commits

Author SHA1 Message Date
John Spray
af844571e6 pageserver: only upload initdb from shard 0 2024-01-07 20:59:07 +00:00
John Spray
0a2e90ab65 pageserver: move EvictionPolicy into models/
This makes the config type more convenient to work with: it
was previously carrying a serde_json::Value to force the
contained types to be opaque, which doesn't make a lot of
sense for an external interface.
2024-01-07 20:59:07 +00:00
John Spray
3c560d27a8 pageserver: implement secondary-mode downloads (#6123)
Follows on from #6050 , in which we upload heatmaps. Secondary locations
will now poll those heatmaps and download layers mentioned in the
heatmap.

TODO:
- [X] ~Unify/reconcile stats for behind-schedule execution with
warn_when_period_overrun
(https://github.com/neondatabase/neon/pull/6050#discussion_r1426560695)~
- [x] Give downloads their own concurrency config independent of uploads

Deferred optimizations:
- https://github.com/neondatabase/neon/issues/6199
- https://github.com/neondatabase/neon/issues/6200

Eviction will be the next PR:
- #5342
2024-01-05 12:29:20 +00:00
John Spray
18e9208158 pageserver: improved error handling for shard routing error, timeline not found (#6262)
## Problem

- When a client requests a key that isn't found in any shard on the node
(edge case that only happens if a compute's config is out of date), we
should prompt them to reconnect (as this includes a backoff), since they
will not be able to complete the request until they eventually get a
correct pageserver connection string.
- QueryError::Other is used excessively: this contains a type-ambiguous
anyhow::Error and is logged very verbosely (including backtrace).

## Summary of changes

- Introduce PageStreamError to replace use of anyhow::Error in request
handlers for getpage, etc.
- Introduce Reconnect and NotFound variants to QueryError
- Map the "shard routing error" case to PageStreamError::Reconnect ->
QueryError::Reconnect
- Update type conversions for LSN timeouts and tenant/timeline not found
errors to use PageStreamError::NotFound->QueryError::NotFound
2024-01-04 10:40:03 +00:00
John Spray
c119af8ddd pageserver: run at least 2 background task threads
Otherwise an assertion in CONCURRENT_BACKGROUND_TASKS will
trip if you try to run the pageserver on a single core.
2024-01-03 14:22:40 +00:00
John Spray
a2e083ebe0 pageserver: make walredo shard-aware
This does not have a functional impact, but enables all
the logging in this code to include the shard_id
label.
2024-01-03 14:22:40 +00:00
John Spray
73a944205b pageserver: log details on shard routing error 2024-01-03 14:22:40 +00:00
John Spray
34ebfbdd6f pageserver: fix handling getpage with multiple shards on one node
Previously, we would wait for the LSN to be visible on whichever
timeline we happened to load at the start of the connection, then
proceed to look up the correct timeline for the key and do the read.

If the timeline holding the key was behind the timeline we used
for the LSN wait, then we might serve an apparently-successful read result
that actually contains data from behind the requested lsn.
2024-01-03 14:22:40 +00:00
John Spray
ef7c9c2ccc pageserver: fix active tenant lookup hitting secondaries with sharding
If there is some secondary shard for a tenant on the same
node as an attached shard, the secondary shard could trip up
this code and cause page_service to incorrectly
get an error instead of finding the attached shard.
2024-01-03 14:22:40 +00:00
John Spray
6c79e12630 pageserver: drop unwanted keys during compaction after split 2024-01-03 14:22:40 +00:00
John Spray
753d97bd77 pageserver: don't delete ancestor shard layers 2024-01-03 14:22:40 +00:00
Cuong Nguyen
fb518aea0d Add batch ingestion mechanism to avoid high contention (#5886)
## Problem
For context, this problem was observed in a research project where we
try to make neon run in multiple regions and I was asked by @hlinnaka to
make this PR.

In our project, we use the pageserver in a non-conventional way such
that we would send a larger number of requests to the pageserver than
normal (imagine postgres without the buffer pool). I measured the time
from the moment a WAL record left the safekeeper to when it reached the
pageserver
([code](e593db1f5a/pageserver/src/tenant/timeline/walreceiver/walreceiver_connection.rs (L282-L287)))
and observed that when the number of get_page_at_lsn requests was high,
the wal receiving time increased significantly (see the left side of the
graphs below).

Upon further investigation, I found that the delay was caused by this
line


d2ca410919/pageserver/src/tenant/timeline.rs (L2348)

The `get_layer_for_write` method is called for every value during WAL
ingestion and it tries to acquire layers write lock every time, thus
this results in high contention when read lock is acquired more
frequently.


![Untitled](https://github.com/neondatabase/neon/assets/6244849/85460f4d-ead1-4532-bc64-736d0bfd7f16)

![Untitled2](https://github.com/neondatabase/neon/assets/6244849/84199ab7-5f0e-413b-a42b-f728f2225218)

## Summary of changes

It is unnecessary to call `get_layer_for_write` repeatedly for all
values in a WAL message since they would end up in the same memory layer
anyway, so I created the batched versions of `InMemoryLayer::put_value`,
`InMemoryLayer ::put_tombstone`, `Timeline::put_value`, and
`Timeline::put_tombstone`, that acquire the locks once for a batch of
values.

Additionally, `DatadirModification` is changed to store multiple
versions of uncommitted values, and `WalIngest::ingest_record()` can now
ingest records without immediately committing them.

With these new APIs, the new ingestion loop can be changed to commit for
every `ingest_batch_size` records. The `ingest_batch_size` variable is
exposed as a config. If it is set to 1 then we get the same behavior
before this change. I found that setting this value to 100 seems to work
the best, and you can see its effect on the right side of the above
graphs.

---------

Co-authored-by: John Spray <john@neon.tech>
2024-01-03 10:41:58 +00:00
Arseny Sher
dbd36e40dc Move failpoint support code to utils.
To enable them in safekeeper as well.
2024-01-02 10:50:20 +04:00
John Spray
e68ae2888a pageserver: expedite tenant activation on delete (#6190)
## Problem

During startup, a tenant delete request might have to retry for many
minutes waiting for a tenant to enter Active state.

## Summary of changes

- Refactor delete_tenant into TenantManager: this is not a functional
change, but will avoid merge conflicts with
https://github.com/neondatabase/neon/pull/6105 later
- Add 412 responses to the swagger definition of this endpoint.
- Use Tenant::wait_to_become_active in `TenantManager::delete_tenant`

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2023-12-22 10:22:22 +00:00
Arpad Müller
7d6fc3c826 Use pre-generated initdb.tar.zst in test_ingest_real_wal (#6206)
This implements the TODO mentioned in the test added by #5892.
2023-12-21 14:23:09 +00:00
Christian Schwarz
5385791ca6 add pageserver component-level benchmark (pagebench) (#6174)
This PR adds a component-level benchmarking utility for pageserver.
Its name is `pagebench`.

The problem solved by `pagebench` is that we want to put Pageserver
under high load.

This isn't easily achieved with `pgbench` because it needs to go through
a compute, which has signficant performance overhead compared to
accessing Pageserver directly.

Further, compute has its own performance optimizations (most
importantly: caches). Instead of designing a compute-facing workload
that defeats those internal optimizations, `pagebench` simply bypasses
them by accessing pageserver directly.

Supported benchmarks:

* getpage@latest_lsn
* basebackup
* triggering logical size calculation

This code has no automated users yet.
A performance regression test for getpage@latest_lsn will be added in a
later PR.

part of https://github.com/neondatabase/neon/issues/5771
2023-12-21 13:07:23 +01:00
Arpad Müller
48890d206e Simplify inject_index_part test function (#6207)
Instead of manually constructing the directory's path, we can just use
the `parent()` function.

This is a drive-by improvement from #6206
2023-12-21 12:52:38 +01:00
Joonas Koivunen
48f156b8a2 feat: relative last activity based eviction (#6136)
Adds a new disk usage based eviction option, EvictionOrder, which
selects whether to use the current `AbsoluteAccessed` or this new
proposed but not yet tested `RelativeAccessed`. Additionally a fudge
factor was noticed while implementing this, which might help sparing
smaller tenants at the expense of targeting larger tenants.

Cc: #5304

Co-authored-by: Arpad Müller <arpad@neon.tech>
2023-12-20 18:44:19 +00:00
John Spray
f260f1565e pageserver: fixes + test updates for sharding (#6186)
This is a precursor to:
- https://github.com/neondatabase/neon/pull/6185

While that PR contains big changes to neon_local and attachment_service,
this PR contains a few unrelated standalone changes generated while
working on that branch:
- Fix restarting a pageserver when it contains multiple shards for the
same tenant
- When using location_config api to attach a tenant, create its
timelines dir
- Update test paths where generations were previously optional to make
them always-on: this avoids tests having to spuriously assert that
attachment_service is not None in order to make the linter happy.
- Add a TenantShardId python implementation for subsequent use in test
helpers that will be made shard-aware
- Teach scrubber to read across shards when checking for layer
existence: this is a refactor to track the list of existent layers at
tenant-level rather than locally to each timeline. This is a precursor
to testing shard splitting.
2023-12-20 12:26:20 +00:00
Joonas Koivunen
c29df80634 fix(layer): move backoff to spawned task (#5746)
Move the backoff to spawned task as it can still be useful; make the
sleep cancellable.
2023-12-20 10:26:06 +02:00
Christian Schwarz
82809d2ec2 fix metric pageserver_initial_logical_size_start_calculation (#6191)
It wasn't being incremented.

Fixup of

    commit 1c88824ed0
    Author: Christian Schwarz <christian@neon.tech>
    Date:   Fri Dec 1 12:52:59 2023 +0100

        initial logical size calculation: add a bunch of metrics (#5995)
2023-12-19 17:44:49 +01:00
Christian Schwarz
e6bf6952b8 higher resolution histograms for getpage@lsn (#6177)
part of https://github.com/neondatabase/cloud/issues/7811
2023-12-19 14:46:17 +01:00
John Spray
d89af4cf8e pageserver: downgrade 'connection reset' WAL errors (#6181)
This squashes a particularly noisy warn-level log that occurs when
safekeepers are restarted.

Unfortunately the error type from `tonic` doesn't provide a neat way of
matching this, so we use a string comparison
2023-12-19 10:38:00 +00:00
Christian Schwarz
6ffbbb2e02 include timeline ids in tenant details response (#6166)
Part of getpage@lsn benchmark epic:
https://github.com/neondatabase/neon/issues/5771

This allows getting the list of tenants and timelines without triggering
initial logical size calculation by requesting the timeline details API
response, which would skew our results.
2023-12-19 10:32:51 +00:00
Arpad Müller
a89d6dc76e Always send a json response for timeline_get_lsn_by_timestamp (#6178)
As part of the transition laid out in
[this](https://github.com/neondatabase/cloud/pull/7553#discussion_r1370473911)
comment, don't read the `version` query parameter in
`timeline_get_lsn_by_timestamp`, but always return the structured json
response.

Follow-up of https://github.com/neondatabase/neon/pull/5608
2023-12-19 11:29:16 +01:00
John Khvatov
33cb9a68f7 pageserver: Reduce tracing overhead in timeline::get (#6115)
## Problem

Compaction process (specifically the image layer reconstructions part)
is lagging behind wal ingest (at speed ~10-15MB/s) for medium-sized
tenants (30-50GB). CPU profile shows that significant amount of time
(see flamegraph) is being spent in `tracing::span::Span::new`.

mainline (commit: 0ba4cae491):

![reconstruct-mainline-0ba4cae491c2](https://github.com/neondatabase/neon/assets/289788/ebfd262e-5c97-4858-80c7-664a1dbcc59d)

## Summary of changes

By lowering the tracing level in get_value_reconstruct_data and
get_or_maybe_download from info to debug, we can reduce the overhead of
span creation in prod environments. On my system, this sped up the image
reconstruction process by 60% (from 14500 to 23160 page reconstruction
per sec)

pr:

![reconstruct-opt-2](https://github.com/neondatabase/neon/assets/289788/563a159b-8f2f-4300-b0a1-6cd66e7df769)


`create_image_layers()` (it's 1 CPU bound here) mainline vs pr:

![image](https://github.com/neondatabase/neon/assets/289788/a981e3cb-6df9-4882-8a94-95e99c35aa83)
2023-12-18 13:33:23 +00:00
John Spray
dbdb1d21f2 pageserver: on-demand activation cleanups (#6157)
## Problem

#6112 added some logs and metrics: clean these up a bit:
- Avoid counting startup completions for tenants launched after startup
- exclude no-op cases from timing histograms 
- remove a rogue log messages
2023-12-18 10:29:19 +00:00
Christian Schwarz
47873470db pageserver: add method to dump keyspace in mgmt api client (#6145)
Part of getpage@lsn benchmark epic:
https://github.com/neondatabase/neon/issues/5771
2023-12-16 10:52:48 +00:00
John Spray
d066dad84b pageserver: prioritize activation of tenants with client requests (#6112)
## Problem

During startup, a client request might have to wait a long time while
the system is busy initializing all the attached tenants, even though
most of the attached tenants probably don't have any client requests to
service, and could wait a bit.

## Summary of changes

- Add a semaphore to limit how many Tenant::spawn()s may concurrently do
I/O to attach their tenant (i.e. read indices from remote storage, scan
local layer files, etc).
- Add Tenant::activate_now, a hook for kicking a tenant in its spawn()
method to skip waiting for the warmup semaphore
- For tenants that attached via warmup semaphore units, wait for logical
size calculation to complete before dropping the warmup units
- Set Tenant::activate_now in `get_active_tenant_with_timeout` (the page
service's path for getting a reference to a tenant).
- Wait for tenant activation in HTTP handlers for timeline creation and
deletion: like page service requests, these require an active tenant and
should prioritize activation if called.
2023-12-15 20:37:47 +00:00
John Spray
56f7d55ba7 pageserver: basic cancel/timeout for remote storage operations (#6097)
## Problem

Various places in remote storage were not subject to a timeout (thereby
stuck TCP connections could hold things up), and did not respect a
cancellation token (so things like timeline deletion or tenant detach
would have to wait arbitrarily long).



## Summary of changes

- Add download_cancellable and upload_cancellable helpers, and use them
in all the places we wait for remote storage operations (with the
exception of initdb downloads, where it would not have been safe).
- Add a cancellation token arg to `download_retry`.
- Use cancellation token args in various places that were missing one
per #5066

Closes: #5066 

Why is this only "basic" handling?
- Doesn't express difference between shutdown and errors in return
types, to avoid refactoring all the places that use an anyhow::Error
(these should all eventually return a more structured error type)
- Implements timeouts on top of remote storage, rather than within it:
this means that operations hitting their timeout will lose their
semaphore permit and thereby go to the back of the queue for their
retry.
- Doing a nicer job is tracked in
https://github.com/neondatabase/neon/issues/6096
2023-12-15 17:43:02 +00:00
Arpad Müller
215cdd18c4 Make initdb upload retries cancellable and seek to beginning (#6147)
* initdb uploads had no cancellation token, which means that when we
were stuck in upload retries, we wouldn't be able to delete the
timeline. in general, the combination of retrying forever and not having
cancellation tokens is quite dangerous.
* initdb uploads wouldn't rewind the file. this wasn't discovered in the
purposefully unreliable test-s3 in pytest because those fail on the
first byte always, not somewhere during the connection. we'd be getting
errors from the AWS sdk that the file was at an unexpected end.

slack thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1702632247784079
2023-12-15 12:11:25 +00:00
Joonas Koivunen
0fd80484a9 fix: Timeline deletion during busy startup (#6133)
Compaction was holding back timeline deletion because the compaction
lock had been acquired, but the semaphore was waited on. Timeline
deletion was waiting on the same lock for 1500s.

This replaces the
`pageserver::tenant::tasks::concurrent_background_tasks_rate_limit`
(which looks correct) with a simpler `..._permit` which is just an
infallible acquire, which is easier to spot "aah this needs to be raced
with cancellation tokens".

Ref: https://neondb.slack.com/archives/C03F5SM1N02/p1702496912904719
Ref: https://neondb.slack.com/archives/C03F5SM1N02/p1702578093497779
2023-12-15 11:59:24 +00:00
Joonas Koivunen
07508fb110 fix: better Json parsing errors (#6135)
Before any json parsing from the http api only returned errors were per
field errors. Now they are done using `serde_path_to_error`, which at
least helped greatly with the `disk_usage_eviction_run` used for
testing. I don't think this can conflict with anything added in #5310.
2023-12-15 12:18:22 +02:00
John Spray
f1cd1a2122 pageserver: improved handling of concurrent timeline creations on the same ID (#6139)
## Problem

Historically, the pageserver used an "uninit mark" file on disk for two
purposes:
- Track which timeline dirs are incomplete for handling on restart
- Avoid trying to create the same timeline twice at the same time.

The original purpose of handling restarts is now defunct, as we use
remote storage as the source of truth and clean up any trash timeline
dirs on startup. Using the file to mutually exclude creation operations
is error prone compared with just doing it in memory, and the existing
checks happened some way into the creation operation, and could expose
errors as 500s (anyhow::Errors) rather than something clean.

## Summary of changes

- Creations are now mutually excluded in memory (using
`Tenant::timelines_creating`), rather than relying on a file on disk for
coordination.
- Acquiring unique access to the timeline ID now happens earlier in the
request.
- Creating the same timeline which already exists is now a 201: this
simplifies retry handling for clients.
- 409 is still returned if a timeline with the same ID is still being
created: if this happens it is probably because the client timed out an
earlier request and has retried.
- Colliding timeline creation requests should no longer return 500
errors

This paves the way to entirely removing uninit markers in a subsequent
change.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-12-15 08:51:23 +00:00
Joonas Koivunen
f010479107 feat(layer): pageserver_layer_redownloaded_after histogram (#6132)
this is aimed at replacing the current mtime only based trashing
alerting later.

Cc: #5331
2023-12-14 21:32:54 +02:00
Conrad Ludgate
cc633585dc gauge guards (#6138)
## Problem

The websockets gauge for active db connections seems to be growing more
than the gauge for client connections over websockets, which does not
make sense.

## Summary of changes

refactor how our counter-pair gauges are represented. not sure if this
will improve the problem, but it should be harder to mess-up the
counters. The API is much nicer though now and doesn't require
scopeguard::defer hacks
2023-12-14 17:21:39 +00:00
John Spray
c4e0ef507f pageserver: heatmap uploads (#6050)
Dependency (commits inline):
https://github.com/neondatabase/neon/pull/5842

## Problem

Secondary mode tenants need a manifest of what to download. Ultimately
this will be some kind of heat-scored set of layers, but as a robust
first step we will simply use the set of resident layers: secondary
tenant locations will aim to match the on-disk content of the attached
location.

## Summary of changes

- Add heatmap types representing the remote structure
- Add hooks to Tenant/Timeline for generating these heatmaps
- Create a new `HeatmapUploader` type that is external to `Tenant`, and
responsible for walking the list of attached tenants and scheduling
heatmap uploads.

Notes to reviewers:
- Putting the logic for uploads (and later, secondary mode downloads)
outside of `Tenant` is an opinionated choice, motivated by:
- Enable future smarter scheduling of operations, e.g. uploading the
stalest tenant first, rather than having all tenants compete for a fair
semaphore on a first-come-first-served basis. Similarly for downloads,
we may wish to schedule the tenants with the hottest un-downloaded
layers first.
- Enable accessing upload-related state without synchronization (it
belongs to HeatmapUploader, rather than being some Mutex<>'d part of
Tenant)
- Avoid further expanding the scope of Tenant/Timeline types, which are
already among the largest in the codebase
- You might reasonably wonder how much of the uploader code could be a
generic job manager thing. Probably some of it: but let's defer pulling
that out until we have at least two users (perhaps secondary downloads
will be the second one) to highlight which bits are really generic.

Compromises:
- Later, instead of using digests of heatmaps to decide whether anything
changed, I would prefer to avoid walking the layers in tenants that
don't have changes: tracking that will be a bit invasive, as it needs
input from both remote_timeline_client and Layer.
2023-12-14 13:09:24 +00:00
Joonas Koivunen
a919b863d1 refactor: remove eviction batching (#6060)
We no longer have `layer_removal_cs` since #5108, we no longer need
batching.
2023-12-13 18:05:33 +02:00
Joonas Koivunen
2d22661061 refactor: calculate_synthetic_size_worker, remove PRE::NeedsDownload (#6111)
Changes I wanted to make on #6106 but decided to leave out to keep that
commit clean for including in the #6090. Finally remove
`PageReconstructionError::NeedsDownload`.
2023-12-13 14:23:19 +00:00
Konstantin Knizhnik
aec1acdbac Do not inherite replication slots in branch (#5898)
## Problem
 
See 
https://github.com/neondatabase/company_projects/issues/111
https://neondb.slack.com/archives/C03H1K0PGKH/p1700166126954079


## Summary of changes

Do not search for AUX_FILES_KEY in parent timelines

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
2023-12-12 14:24:21 +02:00
Konstantin Knizhnik
8bb4a13192 Do not materialize null images in PS (#5979)
## Problem

PG16 is writing null images during relation extension.
And page server implements optimisation which replace WAL record with
FPI with page image.
So instead of WAL record ~30 bytes we store 8kb null page image.
Ans this image is almost useless, because most likely it will be shortly
rewritten with actual page content.

## Summary of changes

Do not materialize wal records with null page FPI.
 
## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2023-12-12 14:23:45 +02:00
John Spray
fead836f26 swagger: remove 'format: hex' from tenant IDs (#6099)
## Problem

TenantId is changing to TenantShardId in many APIs. The swagger had
`format: hex` attributes on some of these IDs. That isn't formally
defined anywhere, but a reasonable person might think it means "hex
digits only", which will no longer be the case once we start using
shard-aware IDs (they're like `<tenant_id>-0001`).



## Summary of changes

- Remove these `format` attributes from all `tenant_id` fields in the
swagger definition
2023-12-12 10:39:34 +00:00
John Spray
20e9cf7d31 pageserver: tweaks to slow/hung task logging (#6098)
## Problem

- `shutdown_tasks` would log when a particular task was taking a long
time to shut down, but not when it eventually completed. That left one
uncertain as to whether the slow task was the source of a hang, or just
a precursor.

## Summary of changes

- Add a log line after a slow task shutdown
- Add an equivalent in Gate's `warn_if_stuck`, in case we ever need it.
This isn't related to the original issue but was noticed when checking
through these logging paths.
2023-12-12 07:19:59 +00:00
Joonas Koivunen
3b04f3a749 fix: accidential return Ok (#6106)
Error indicating request cancellation OR timeline shutdown was deemed as
a reason to exit the background worker that calculated synthetic size.
Fix it to only be considered for avoiding logging such of such errors.
2023-12-11 21:27:53 +00:00
Arpad Müller
c49fd69bd6 Add initdb_lsn to TimelineInfo (#6104)
This way, we can query it.

Background: I want to do statistics for how reproducible `initdb_lsn`
really is, see https://github.com/neondatabase/cloud/issues/8284 and
https://neondb.slack.com/archives/C036U0GRMRB/p1701895218280269
2023-12-11 21:08:14 +00:00
John Spray
f1fc1fd639 pageserver: further refactoring from TenantId to TenantShardId (#6059)
## Problem

In https://github.com/neondatabase/neon/pull/5957, the most essential
types were updated to use TenantShardId rather than TenantId. That
unblocked other work, but didn't fully enable running multiple shards
from the same tenant on the same pageserver.

## Summary of changes

- Use TenantShardId in page cache key for materialized pages
- Update mgr.rs get_tenant() and list_tenants() functions to use a shard
id, and update all callers.
- Eliminate the exactly_one_or_none helper in mgr.rs and all code that
used it
- Convert timeline HTTP routes to use tenant_shard_id

Note on page cache:
```
struct MaterializedPageHashKey {
    /// Why is this TenantShardId rather than TenantId?
    ///
    /// Usually, the materialized value of a page@lsn is identical on any shard in the same tenant.  However, this
    /// this not the case for certain internally-generated pages (e.g. relation sizes).  In future, we may make this
    /// key smaller by omitting the shard, if we ensure that reads to such pages always skip the cache, or are
    /// special-cased in some other way.
    tenant_shard_id: TenantShardId,
    timeline_id: TimelineId,
    key: Key,
}
```
2023-12-11 15:52:33 +00:00
Christian Schwarz
cf024de202 virtual_file metrics: expose max size of the fd cache (#6078)
And also leave a comment on how to determine current size.

Kind of follow-up to #6066

refs https://github.com/neondatabase/cloud/issues/8351
refs https://github.com/neondatabase/neon/issues/5479
2023-12-08 17:23:50 +00:00
Conrad Ludgate
e1a564ace2 proxy simplify cancellation (#5916)
## Problem

The cancellation code was confusing and error prone (as seen before in
our memory leaks).

## Summary of changes

* Use the new `TaskTracker` primitve instead of JoinSet to gracefully
wait for tasks to shutdown.
* Updated libs/utils/completion to use `TaskTracker`
* Remove `tokio::select` in favour of `futures::future::select` in a
specialised `run_until_cancelled()` helper function
2023-12-08 16:21:17 +00:00
Christian Schwarz
f5b9af6ac7 page cache: improve eviction-related metrics (#6077)
These changes help with identifying thrashing.

The existing `pageserver_page_cache_find_victim_iters_total` is already
useful, but, it doesn't tell us how many individual find_victim() calls
are happening, only how many clock-LRU steps happened in the entire
system,
without info about whether we needed to actually evict other data vs
just scan for a long time, e.g., because the cache is large.

The changes in this PR allows us to
1. count each possible outcome separately, esp evictions
2. compute mean iterations/outcome

I don't think anyone except me was paying close attention to
`pageserver_page_cache_find_victim_iters_total` before, so,
I think the slight behavior change of also counting iterations
for the 'iters exceeded' case is fine.

refs https://github.com/neondatabase/cloud/issues/8351
refs https://github.com/neondatabase/neon/issues/5479
2023-12-08 15:27:21 +00:00
John Spray
2c544343e0 pageserver: filtered WAL ingest for sharding (#6024)
## Problem

Currently, if one creates many shards they will all ingest all the data:
not much use! We want them to ingest a proportional share of the data
each.

Closes: #6025

## Summary of changes

- WalIngest object gets a copy of the ShardIdentity for the Tenant it
was created by.
- While iterating the `blocks` part of a decoded record, blocks that do
not match the current shard are ignored, apart from on shard zero where
they are used to update relation sizes in `observe_decoded_block` (but
not stored).
- Before committing a `DataDirModificiation` from a WAL record, we check
if it's empty, and drop the record if so. This check is necessary
(rather than just looking at the `blocks` part) because certain record
types may modify blocks in non-obvious ways (e.g.
`ingest_heapam_record`).
- Add WAL ingest metrics to record the total received, total committed,
and total filtered out
- Behaviour for unsharded tenants is unchanged: they will continue to
ingest all blocks, and will take the fast path through `is_key_local`
that doesn't bother calculating any hashes.

After this change, shards store a subset of the tenant's total data, and
accurate relation sizes are only maintained on shard zero.

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2023-12-08 10:12:37 +00:00