Commit Graph

5893 Commits

Author SHA1 Message Date
Christian Schwarz
ef57e73fbf task_mgr::spawn: require a TenantId (#8462)
… to dis-incentivize global tasks via task_mgr in the future

(As of https://github.com/neondatabase/neon/pull/8339 all remaining
task_mgr usage is tenant or timeline scoped.)
2024-08-20 08:26:44 +00:00
Arseny Sher
4c5a0fdc75 safekeeper: check for non-consecutive writes in safekeeper.rs
wal_storage.rs already checks this, but since this is a quite legit scenario
check it at safekeeper.rs (consensus level) as well.

ref https://github.com/neondatabase/neon/issues/8212
2024-08-20 07:12:56 +03:00
Arpad Müller
4b26783c94 scrubber: remove _generic postfix and two unused functions (#8761)
Removes the `_generic` postfix from the `GenericRemoteStorage` using
APIs, as `remote_storage` is the "default" now, and add a `_s3` postfix
to the remaining APIs using the S3 SDK (only in tenant snapshot). Also,
remove two unused functions: `list_objects_with_retries` and
`stream_tenants functions`.

Part of https://github.com/neondatabase/neon/issues/7547
2024-08-19 23:58:47 +02:00
Arpad Müller
6949b45e17 Update aws -> infra for repo rename (#8755)
See slack thread:
https://neondb.slack.com/archives/C039YKBRZB4/p1722501766006179
2024-08-19 17:44:10 +02:00
Arpad Müller
3b8ca477ab Migrate physical GC and scan_metadata to remote_storage (#8673)
Migrates most of the remaining parts of the scrubber to remote_storage:

* `pageserver_physical_gc`
* `scan_metadata` for pageservers (safekeepers were done in #8595)
* `download()` in `tenant_snapshot`. The main `tenant_snapshot` is not
migrated as it uses version history to be able to work in the face of
ongoing changes.
 
Part of #7547
2024-08-19 16:39:44 +02:00
Christian Schwarz
eb7241c798 l0_flush: remove support for mode page-cached (#8739)
It's been rolled out everywhere, no configs are referencing it.

All code that's made dead by the removal of the config option is removed
as part of this PR.

The `page_caching::PreWarmingWriter` in `::No` mode is equivalent to a
`size_tracking_writer`, so, use that.

part of https://github.com/neondatabase/neon/issues/7418
2024-08-19 16:35:34 +02:00
Folke Behrens
f246aa3ca7 proxy: Fix some warnings by extended clippy checks (#8748)
* Missing blank lifetimes which is now deprecated.
* Matching off unqualified enum variants that could act like variable.
* Missing semicolons.
2024-08-19 10:33:46 +02:00
Arpad Müller
188bde7f07 Default image compression to zstd at level 1 (#8677)
After the rollout has succeeded, we now set the default image
compression to be enabled.

We also remove its explicit mention from `neon_fixtures.py` added in
#8368 as it is now the default (and we switch to `zstd(1)` which is a
bit nicer on CPU time).

Part of https://github.com/neondatabase/neon/issues/5431
2024-08-18 18:32:10 +01:00
Yuchen Liang
7131ac4730 refactor(scrubber): add unified command suitable for cron job (#8635)
Part of #8128.

## Description

This PR creates a unified command to run both physical gc and metadata
health check as a cron job. This also enables us to add additional tasks
to the cron job in the future.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-08-17 16:48:53 +01:00
Konstantin Knizhnik
2be69af6c3 Track holes to be able to reuse them once LFC limit is increased (#8575)
## Problem

Multiple increase/decrease LFC limit may cause unlimited growth of LFC
file because punched holes while LFC shrinking are not reused when LFC
is extended.

## Summary of changes

Keep track of holes and reused them when LFC size is increased.

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-08-16 22:19:44 +03:00
Sasha Krassovsky
c6b6b7700a Fix superuser check in test_snap_files (#8749)
## Problem
Current superuser check always passes because it returns a tuple like
`(False,)`, and then the `if not superuser` passes.

## Summary of changes
Fixes the issue by unwrapping the tuple. Verified that it works against
a project where I don't have superuser.
2024-08-16 19:13:18 +01:00
John Spray
e2d89f7991 pageserver: prioritize secondary downloads to get most recent layers first, except l0s (#8729)
## Problem

When a secondary location is trying to catch up while a tenant is
receiving new writes, it can become quite wasteful:
- Downloading L0s which are soon destroyed by compaction to L1s
- Downloading older layer files which are soon made irrelevant when
covered by image layers.

## Summary of changes

Sort the layer files in the heatmap:
- L0 layers are the lowest priority
- Other layers are sorted to download the highest LSNs first.
2024-08-16 14:35:02 +02:00
Arseny Sher
25e7d321f4 safekeeper: cross check divergence point in ProposerElected handling.
Previously, we protected from multiple ProposerElected messages from the same
walproposer with the following condition:

msg.term == self.get_last_log_term() && self.flush_lsn() >
msg.start_streaming_at

It is not exhaustive, i.e. we could still proceed to truncating WAL even though
safekeeper inserted something since the divergence point has been
calculated. While it was most likely safe because walproposer can't use
safekeeper position to commit WAL until last_log_term reaches the current
walproposer term, let's be more careful and properly calculate the divergence
point like walproposer does.
2024-08-16 15:22:46 +03:00
Vlad Lazar
3f91ea28d9 tests: add infra and test for storcon leadership transfer (#8587)
## Problem
https://github.com/neondatabase/neon/pull/8588 implemented the mechanism
for storage controller
leadership transfers. However, there's no tests that exercise the
behaviour.

## Summary of changes
1. Teach `neon_local` how to handle multiple storage controller
instances. Each storage controller
instance gets its own subdirectory (`storage_controller_1, ...`).
`storage_controller start|stop` subcommands
have also been extended to optionally accept an instance id.
2. Add a storage controller proxy test fixture. It's a basic HTTP server
that forwards requests from pageserver
and test env to the currently configured storage controller.
3. Add a test which exercises storage controller leadership transfer.
4. Finally fix a couple bugs that the test surfaced
2024-08-16 13:05:04 +01:00
Heikki Linnakangas
7fdc3ea162 Add retroactive RFC about physical replication (#8546)
We've had physical replication support for a long time, but we never
created an RFC for the feature. This RFC does that after the fact. Even
though we've already implemented the feature, let's have a design
discussion as if it hadn't done that. It can still be quite insightful.

This is written from a pretty compute-centric viewpoint, not much
on how it works in the control plane.
2024-08-16 11:30:53 +01:00
Joonas Koivunen
4763a960d1 chore: log if we have an open layer or any frozen on shutdown (#8740)
Some benchmarks are failing with a "long" flushing, which might be
because there is a queue of in-memory layers, or something else. Add
logging to narrow it down.

Private slack DM ref:
https://neondb.slack.com/archives/D049K7HJ9JM/p1723727305238099
2024-08-16 06:10:05 +01:00
Sasha Krassovsky
df086cd139 Add logical replication test to exercise snapfiles (#8364) 2024-08-15 15:34:45 -07:00
Alexander Bayandin
69cb1ee479 CI(replication-tests): store test results & change notification channel (#8687)
## Problem

We want to store Nightly Replication test results in the database and
notify the relevant Slack channel about failures

## Summary of changes
- Store test results in the database
- Notify `on-call-compute-staging-stream` about failures
2024-08-15 22:41:58 +01:00
Alexander Bayandin
4e58fd9321 CI(label-for-external-users): use CI_ACCESS_TOKEN (#8738)
## Problem

`secrets.GITHUB_TOKEN` (with any permissions) is not enough to get 
a user's membership info if they decide to hide it.

## Summary of changes
- Use `secrets.CI_ACCESS_TOKEN` for `gh api` call
- Use `pull_request_target` instead of `pull_request` event to get
access to secrets
2024-08-15 18:37:15 +01:00
Konstantin Knizhnik
f087423a01 Handle reload config file request in LR monitor (#8732)
## Problem

Logical replication BGW checking replication lag is not reloading config

## Summary of changes

Add handling of reload config request

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-08-15 16:28:25 +03:00
Joonas Koivunen
24d347f50b storcon: use tracing for logging panics (#8734)
this gives spans for panics, and does not globber loki output by writing
to stderr while all of the other logging is to stdout.

See: #3475
2024-08-15 16:27:07 +03:00
Joonas Koivunen
52641eb853 storcon: add spans to drain/fill ops (#8735)
this way we do not need to repeat the %node_id everywhere, and we get no
stray messages in logs from within the op.
2024-08-15 15:30:04 +03:00
Joonas Koivunen
d9a57aeed9 storcon: deny external node configuration if an operation is ongoing (#8727)
Per #8674, disallow node configuration while drain/fill are ongoing.
Implement it by adding a only-http wrapper
`Service::external_node_configure` which checks for operation existing
before configuring.

Additionally:
- allow cancelling drain/fill after a pageserver has restarted and
transitioned to WarmingUp

Fixes: #8674
2024-08-15 10:54:05 +01:00
Alexander Bayandin
a9c28be7d0 fix(pageserver): allow unused_imports in download.rs on macOS (#8733)
## Problem

On macOS, clippy fails with the following error:

```
error: unused import: `crate::virtual_file::owned_buffers_io::io_buf_ext::IoBufExt`
  --> pageserver/src/tenant/remote_timeline_client/download.rs:26:5
   |
26 | use crate::virtual_file::owned_buffers_io::io_buf_ext::IoBufExt;
   |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   |
   = note: `-D unused-imports` implied by `-D warnings`
   = help: to override `-D warnings` add `#[allow(unused_imports)]`
```

Introduced in https://github.com/neondatabase/neon/pull/8717

## Summary of changes
- allow `unused_imports` for
`crate::virtual_file::owned_buffers_io::io_buf_ext::IoBufExt` on macOS
in download.rs
2024-08-15 10:06:28 +01:00
Vlad Lazar
fef77b0cc9 safekeeper: consider partial uploads when pulling timeline (#8628)
## Problem
The control file contains the id of the safekeeper that uploaded it.
Previously, when sending a snapshot of the control file to another sk,
it would eventually be gc-ed by the receiving sk. This is incorrect
because the original sk might still need it later.

## Summary of Changes
When sending a snapshot and the control file contains an uploaded
segment:
* Create a copy of the segment in s3 with the destination sk in the
  object name
* Tweak the streamed control file to point to the object create in the
  previous step

Note that the snapshot endpoint now has to know the id of the requestor,
so the api has been extended to include the node if of the destination
sk.

Closes https://github.com/neondatabase/neon/issues/8542
2024-08-15 09:02:33 +01:00
Christian Schwarz
168913bdf0 refactor(write path): newtype to enforce use of fully initialized slices (#8717)
The `tokio_epoll_uring::Slice` / `tokio_uring::Slice` type is weird.
The new `FullSlice` newtype is better. See the doc comment for details.

The naming is not ideal, but we'll clean that up in a future refactoring
where we move the `FullSlice` into `tokio_epoll_uring`. Then, we'll do
the following:
* tokio_epoll_uring::Slice is removed
* `FullSlice` becomes `tokio_epoll_uring::IoBufView`
* new type `tokio_epoll_uring::IoBufMutView` for the current
`tokio_epoll_uring::Slice<IoBufMut>`

Context
-------

I did this work in preparation for
https://github.com/neondatabase/neon/pull/8537.
There, I'm changing the type that the `inmemory_layer.rs` passes to
`DeltaLayerWriter::put_value_bytes` and thus it seemed like a good
opportunity to make this cleanup first.
2024-08-14 21:57:17 +02:00
Alexander Bayandin
aa2e16f307 CI: misc cleanup & fixes (#8559)
## Problem
A bunch of small fixes and improvements for CI, that are too small to
have a separate PR for them

## Summary of changes
- CI(build-and-test): fix parenthesis
- CI(actionlint): fix path to workflow file
- CI: remove default args from actions/checkout
- CI: remove `gen3` label, using a couple `self-hosted` +
`small{,-arm64}`/`large{,-arm64}` is enough
- CI: prettify Slack messages, hide links behind text messages
- C(build-and-test): add more dependencies to `conclusion` job
2024-08-14 17:56:59 +01:00
Alexander Bayandin
70b18ff481 CI(neon-image): add ARM-specific RUSTFLAGS (#8566)
## Problem

It's recommended that a couple of additional RUSTFLAGS be set up to
improve the performance of Rust applications on AWS Graviton.

See
57dc813626/rust.md

Note: Apple Silicon is compatible with neoverse-n1:
```
$ clang --version
Apple clang version 15.0.0 (clang-1500.3.9.4)
Target: arm64-apple-darwin23.6.0
Thread model: posix
InstalledDir: /Applications/Xcode_15.4.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
$
$ clang --print-supported-cpus 2>&1 | grep neoverse-
	neoverse-512tvb
	neoverse-e1
	neoverse-n1
	neoverse-n2
	neoverse-v1
	neoverse-v2
```

## Summary of changes
- Add `-Ctarget-feature=+lse -Ctarget-cpu=neoverse-n1` to RUSTFLAGS for
ARM images
2024-08-14 17:03:21 +01:00
Joonas Koivunen
60fc1e8cc8 chore: even more responsive compaction cancellation (#8725)
Some benchmarks and tests might still fail because of #8655 (tracked in
#8708) because we are not fast enough to shut down ([one evidence]).
Partially this is explained by the current validation mode of streaming
k-merge, but otherwise because that is where we use a lot of time in
compaction. Outside of L0 => L1 compaction, the image layer generation
is already guarded by vectored reads doing cancellation checks.

32768 is a wild guess based on looking how many keys we put in each
layer in a bench (1-2 million), but I assume it will be good enough
divisor. Doing checks more often will start showing up as contention
which we cannot currently measure. Doing checks less often might be
reasonable.

[one evidence]:
https://neon-github-public-dev.s3.amazonaws.com/reports/main/10384136483/index.html#suites/9681106e61a1222669b9d22ab136d07b/96e6d53af234924/

Earlier PR: #8706.
2024-08-14 14:48:15 +01:00
Alexander Bayandin
36c1719a07 CI(build-neon): fix accidental neon rebuild on cargo test (#8721)
## Problem

During `Run rust tests` step (for debug builds), we accidentally rebuild
neon twice (by `cargo test --doc` and by `cargo nextest run`).
It happens because we don't set `cov_prefix` for the `cargo test --doc`
command, which triggers rebuilding with different build flags, and one
more rebuild by `cargo nextest run`.

## Summary of changes
- Set `cov_prefix` for `cargo test --doc` to prevent unneeded rebuilds
2024-08-14 13:38:25 +01:00
John Spray
abb53ba36d storcon_cli: don't clobber heatmap interval when setting eviction (#8722)
## Problem

This command is kind of a hack, used when we're migrating large tenants
and want to get their resident size down. It sets the tenant config to a
fixed value, which omitted heatmap_period, so caused secondaries to get
out of date.

## Summary of changes

- Set heatmap period to the same 300s default that we use elsewhere when
updating eviction settings

This is not as elegant as some general purpose partial modification of
the config, but it practically makes the command safer to use.
2024-08-14 13:37:03 +01:00
Conrad Ludgate
a7028d92b7 proxy: start of jwk cache (#8690)
basic JWT implementation that caches JWKs and verifies signatures.

this code is currently not reachable from proxy, I just wanted to get
something merged in.
2024-08-14 13:35:29 +01:00
Joonas Koivunen
6c9e3c9551 refactor: error/anyhow::Error wrapping (#8697)
We can get CompactionError::Other(Cancelled) via the error handling with
a few ways.
[evidence](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8655/10301613380/index.html#suites/cae012a1e6acdd9fdd8b81541972b6ce/653a33de17802bb1/).
Hopefully fix it by:

1. replace the `map_err` which hid the
`GetReadyAncestorError::Cancelled` with `From<GetReadyAncestorError> for
GetVectoredError` conversion
2. simplifying the code in pgdatadir_mapping to eliminate the token
anyhow wrapping for deserialization errors
3. stop wrapping GetVectoredError as anyhow errors
4. stop wrapping PageReconstructError as anyhow errors

Additionally, produce warnings if we treat any other error (as was legal
before this PR) as missing key.

Cc: #8708.
2024-08-14 12:45:56 +01:00
Alexander Bayandin
fc3d372f3a CI(label-for-external-users): check membership using GitHub API (#8724)
## Problem

`author_association` doesn't properly work if a GitHub user decides not
to show affiliation with the org in their profile (i.e. if it's private)

## Summary of changes
- Call
`/orgs/ORG/members/USERNAME` API to check whether 
a PR/issue author is a member of the org
2024-08-14 12:27:52 +01:00
John Spray
19d69d515c pageserver: evict covered layers earlier (#8679)
## Problem

When pageservers do compaction, they frequently create image layers that
make earlier layers un-needed for reads, but then keep those earlier
layers around for 24 hours waiting for time-based eviction to expire
them.

Now that we track layer visibility, we can use it as an input to
eviction, and avoid the 24 hour "disk bump" that happens around
pageserver restarts.

## Summary of changes

- During time-based eviction, if a layer is marked Covered, use the
eviction period as the threshold: i.e. these layers get to remain
resident for at least one iteration of the eviction loop, but then get
evicted. With current settings this means they get evicted after 1h
instead of 24h.
- During disk usage eviction, prioritized evicting covered layers above
all other layers.


Caveats:
- Using the period as the threshold for time based eviction in this case
is a bit of a hack, but it avoids adding yet another configuration
property, and in any case the value of a new property would be somewhat
arbitrary: there's no "right" length of time to keep covered layers
around just in case.
- We had previously planned on removing time-based eviction: this change
would motivate us to keep it around, but we can still simplify the code
later to just do the eviction of covered layers, rather than applying a
TTL policy to all layers.
2024-08-14 12:10:15 +01:00
Joonas Koivunen
485d76ac62 timeline_detach_ancestor: adjust error handling (#8528)
With additional phases from #8430 the `detach_ancestor::Error` became
untenable. Split it up into phases, and introduce laundering for
remaining `anyhow::Error` to propagate them as most often
`Error::ShuttingDown`.

Additionally, complete FIXMEs.

Cc: #6994
2024-08-14 10:16:18 +01:00
John Spray
4049d2b7e1 scrubber: fix spurious "Missed some shards" errors (#8661)
## Problem

The storage scrubber was reporting warnings for lots of timelines like:
```
WARN Missed some shards at count ShardCount(0) tenant_id=25eb7a83d9a2f90ac0b765b6ca84cf4c
```

These were spurious: these tenants are fine. There was a bug in
accumulating the ShardIndex for each tenant, whereby multiple timelines
would lead us to add the same ShardIndex more than one.

Closes: #8646 

## Summary of changes

- Accumulate ShardIndex in a BTreeSet instead of a Vec
- Extend the test to reproduce the issue
2024-08-14 09:29:06 +01:00
Konstantin Knizhnik
7a1736ddcf Preserve HEAP_COMBOCID when restoring t_cid from WAL (#8503)
## Problem

See https://github.com/neondatabase/neon/issues/8499

## Summary of changes

Save HEAP_COMBOCID flag in WAL and do not clear it in redo handlers.

Related Postgres PRs:
https://github.com/neondatabase/postgres/pull/457
https://github.com/neondatabase/postgres/pull/458
https://github.com/neondatabase/postgres/pull/459


## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2024-08-14 08:13:20 +03:00
Tristan Partin
c624317b0e Decode the database name in SQL/HTTP connections
A url::Url does not hand you back a URL decoded value for path values,
so we must decode them ourselves.

Link: https://docs.rs/url/2.5.2/url/struct.Url.html#method.path
Link: https://docs.rs/url/2.5.2/url/struct.Url.html#method.path_segments
Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-08-13 16:32:58 -05:00
Tristan Partin
0f43b7c51b Loosen type on PgProtocol::safe_psql(queries:)
Using Iterable allows us to also use tuples, among other things.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-08-13 16:32:58 -05:00
Joonas Koivunen
6d6e2c6a39 feat(detach_ancestor): better retries with persistent gc blocking (#8430)
With the persistent gc blocking, we can now retry reparenting timelines
which had failed for whatever reason on the previous attempt(s).
Restructure the detach_ancestor into three phases:

- prepare (insert persistent gc blocking, copy lsn prefix, layers)
- detach and reparent
    - reparenting can fail, so we might need to retry this portion
- complete (remove persistent gc blocking)

Cc: #6994
2024-08-13 18:51:51 +01:00
Joonas Koivunen
87a5d7db9e test: do better job of shutting everything down (#8714)
After #8655 we've had a few issues (mostly tracked on #8708) with the
graceful shutdown. In order to shutdown more of the processes and catch
more errors, for example, from all pageservers, do an immediate shutdown
for those nodes which fail the initial (possibly graceful) shutdown.

Cc: #6485
2024-08-13 18:49:50 +01:00
Peter Bendel
9d2276323d Benchmarking tests: automatically restore Neon reuse databases, too and migrate to pg16 (#8707)
## Problem

We use a set of **Neon** reuse databases in benchmarking.yml which are
still using pg14.
Because we want to compare apples to apples and have migrated the AWS
reuse clusters to pg16 we should also use pg16 for Neon.

## Summary of changes

- Automatically restore the test databases for Neon project
2024-08-13 19:36:39 +02:00
Joonas Koivunen
ae6e27274c refactor(test): unify how we clear shared buffers (#8634)
so that we can easily plug in LFC clearing as well.

Private discussion reference:
<https://neondb.slack.com/archives/C033A2WE6BZ/p1722942856987979>
2024-08-13 20:14:42 +03:00
Joonas Koivunen
8f170c5105 fix: make compaction more sensitive to cancellation (#8706)
A few of the benchmarks have started failing after #8655 where they are
waiting for compactor task. Reads done by image layer creation should
already be cancellation sensitive because vectored get does a check each
time, but try sprinkling additional cancellation points to:

- each partition
- after each vectored read batch
2024-08-13 18:00:54 +01:00
Joonas Koivunen
e0946e334a bench: stop immediatedly in some benches (#8713)
It seems that some benchmarks are failing because they are simply not
stopping to ingest wal on shutdown. It might mean that the tests were
never ran on a stable pageserver situation and WAL has always been left
to be ingested on safekeepers, but let's see if this silences the
failures and "stops the bleeding".

Cc: https://github.com/neondatabase/neon/issues/8712
2024-08-13 17:07:51 +01:00
Alexander Bayandin
852a6a7a5a CI: mark PRs and issues create by external users (#8694)
## Problem

We want to mark new PRs and issues created by external users

## Summary of changes
- Add a new workflow which adds `external` label for issues and PRs
created by external users
2024-08-13 15:28:26 +01:00
John Spray
ecb01834d6 pageserver: implement utilization score (#8703)
## Problem

When the utilization API was added, it was just a stub with disk space
information.

Disk space information isn't a very good metric for assigning tenants to
pageservers, because pageservers making full use of their disks would
always just have 85% utilization, irrespective of how much pressure they
had for disk space.

## Summary of changes

- Use the new layer visibiilty metric to calculate a "wanted size" per
tenant, and sum these to get a total local disk space wanted per
pageserver. This acts as the primary signal for utilization.
- Also use the shard count to calculate a utilization score, and take
the max of this and the disk-driven utilization. The shard count limit
is currently set as a constant 20,000, which matches contemporary
operational practices when loading pageservers.

The shard count limit means that for tiny/empty tenants, on a machine
with 3.84TB disk, each tiny tenant influences the utilization score as
if it had size 160MB.
2024-08-13 15:15:55 +01:00
Konstantin Knizhnik
afb68b0e7e Report search_path to make it possible to use it in pgbouncer track_extra_parameters (#8303)
## Problem

When pooled connections are used, session semantic its not preserved,
including GUC settings.
Many customers have particular problem with setting search_path.
But pgbouncer 1.20 has `track_extra_parameters` settings which allows to
track parameters included in startup package which are reported by
Postgres. Postgres has [an official list of parameters that it reports
to the
client](https://www.postgresql.org/docs/15/protocol-flow.html#PROTOCOL-ASYNC).
This PR makes Postgres also report `search_path` and so allows to
include it in `track_extra_parameters`.



## Summary of changes

Set GUC_REPORT flag  for `search_path`.

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-08-13 15:07:24 +03:00
Vlad Lazar
b9d2c7bdd5 pageserver: remove vectored get related configs (#8695)
## Problem
Pageserver exposes some vectored get related configs which are not in
use.

## Summary of changes
Remove the following pageserver configs: get_impl, get_vectored_impl,
and `validate_get_vectored`.
They are not used in the pageserver since
https://github.com/neondatabase/neon/pull/8601.
Manual overrides have been removed from the aws repo in
https://github.com/neondatabase/aws/pull/1664.
2024-08-13 12:45:54 +01:00