Commit Graph

4301 Commits

Author SHA1 Message Date
John Spray
82ccd449d7 pageserver: fix active tenant lookup hitting secondaries with sharding 2023-12-22 13:58:00 +00:00
John Spray
d4c4ee6a14 f DNM demo 2023-12-22 13:40:22 +00:00
John Spray
758bb24445 DNM dirty hacks 2023-12-22 13:40:04 +00:00
John Spray
ed11fc0ec8 DNM: hack small buffer size into compute 2023-12-22 13:39:48 +00:00
John Spray
61044aa7f0 DNM demo script 2023-12-22 11:54:15 +00:00
John Spray
35de775ed5 pageserver: log details on shard routing error 2023-12-22 11:54:15 +00:00
John Spray
95505e5ac1 neon_local: add shard split command 2023-12-22 11:54:15 +00:00
John Spray
dda046bbcd pageserver: don't delete ancestor shard layers 2023-12-22 11:54:15 +00:00
John Spray
7b2019fdfe fixup! compute_tools: enable passing through stripe size 2023-12-22 11:54:15 +00:00
John Spray
b57848236f pgxn: fix stripe calculation 2023-12-22 11:54:15 +00:00
Konstantin Knizhnik
218a8a7461 Support sharding at compute side
refer #5508
2023-12-21 17:30:06 +00:00
John Spray
e745391c72 f neon_local reconfigure 2023-12-21 17:21:10 +00:00
John Spray
6e56f88b79 control_plane: improve debug of pageserver_connstr 2023-12-21 17:21:10 +00:00
John Spray
3ccf5abc9d control plane: improve handling of stripe size 2023-12-21 17:21:10 +00:00
John Spray
bc90272e47 tests: support initial stream size + migration 2023-12-21 17:21:10 +00:00
John Spray
aa1252d687 compute_tools: enable passing through stripe size 2023-12-21 17:21:10 +00:00
John Spray
727eef05b3 tests: make more fixtures/helpers shard-aware 2023-12-21 17:21:10 +00:00
John Spray
a5813e2516 tests: add test_sharding_smoke 2023-12-21 17:21:10 +00:00
John Spray
3c1d8e7239 tests: enable using timeout with CLIs 2023-12-21 17:21:10 +00:00
John Spray
3835a51429 control_plane: rebase fixes 2023-12-21 17:21:10 +00:00
John Spray
495c3d70f3 clippy 2023-12-21 17:19:01 +00:00
John Spray
d1af9d480e tests: enable s3 scrubber in pg_regress tests 2023-12-21 17:19:01 +00:00
John Spray
14b0acbda7 neon_local: improved timeline creation and 'branch' 2023-12-21 17:19:01 +00:00
John Spray
583375e6f6 tests: use sharding in test_pageserver_chaos 2023-12-21 17:19:01 +00:00
John Spray
1e542b3187 neon_local: always get endpoint pageserver from attachment service 2023-12-21 17:19:01 +00:00
John Spray
37db221a60 tests: enable sharding for tests in test_pg_regress.py 2023-12-21 17:19:01 +00:00
John Spray
9055985d72 tests: adapt helpers for sharding 2023-12-21 17:19:01 +00:00
John Spray
fc2f9fa3fe pageserver: implement shard splitting 2023-12-21 17:19:01 +00:00
John Spray
9cd72caabf neon_local: add tenant status command 2023-12-21 17:18:58 +00:00
John Spray
d1a0a0941a neon_local: add stripe size arg 2023-12-21 17:03:37 +00:00
John Spray
347bd012b3 neon_local: attachment service status, refactors 2023-12-21 17:03:37 +00:00
John Spray
58f64339f3 neon_local: implement Locate API for attachment service 2023-12-21 17:03:37 +00:00
John Spray
78e673fbb3 neon_local: use attachment service to locate pagservers for endpoints 2023-12-21 17:03:37 +00:00
John Spray
3d573be816 neon_local: use attachment service for tenant creation 2023-12-21 17:03:37 +00:00
John Spray
bdc4a7512b pageserver: refactor creation API (add ShardParams) 2023-12-21 17:03:37 +00:00
John Spray
24d0395f66 neon_local: update various TenantId uses to TenantShardId 2023-12-21 17:03:37 +00:00
John Spray
7bbfc160aa DNM: script for sharding demo 2023-12-21 17:03:14 +00:00
Arpad Müller
a21b719770 Use neon-github-ci-tests S3 bucket for remote_storage tests (#6216)
This bucket is already used by the pytests. The current bucket
github-public-dev is more meant for longer living artifacts.

slack thread:
https://neondb.slack.com/archives/C039YKBRZB4/p1703124944669009

Part of https://github.com/neondatabase/cloud/issues/8233 / #6155
2023-12-21 17:28:28 +01:00
Alexander Bayandin
1dff98be84 CI: fix build-tools image tag for PRs (#6217)
## Problem

Fix build-tools image tag calculation for PRs.
Broken in https://github.com/neondatabase/neon/pull/6195

## Summary of changes
- Use `pinned` tag instead of `$GITHUB_RUN_ID` if there's no changes in
the dockerfile (and we don't build such image)
2023-12-21 14:55:24 +00:00
Arpad Müller
7d6fc3c826 Use pre-generated initdb.tar.zst in test_ingest_real_wal (#6206)
This implements the TODO mentioned in the test added by #5892.
2023-12-21 14:23:09 +00:00
Abhijeet Patil
61b6c4cf30 Build dockerfile from neon repo (#6195)
## Fixing GitHub workflow issue related to build and push images

## Summary of changes
Followup of PR#608[move docker file from build repo to neon to solve
issue some issues

The build started failing because it missed a validation in logic that
determines changes in the docker file
Also, all the dependent jobs were skipped because of the build and push
of the image job.
To address the above issue following changes were made

- we are adding validation to generate image tag even if it's a merge to
repo.
- All the dependent jobs won't skip even if the build and push image job
is skipped.
- We have moved the logic to generate a tag in the sub-workflow. As the
tag name was necessary to be passed to the sub-workflow it made sense to
abstract that away where it was needed and then store it as an output
variable so that downward dependent jobs could access the value.
- This made the dependency logic easy and we don't need complex
expressions to check the condition on which it will run
- An earlier PR was closed that tried solving a similar problem that has
some feedback and context before creating this PR
https://github.com/neondatabase/neon/pull/6175

## Checklist before requesting a review

- [x] Move the tag generation logic from the main workflow to the
sub-workflow of build and push the image
- [x] Add a condition to generate an image tag for a non-PR-related run 
- [x] remove complex if the condition from the job if conditions

---------

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Abhijeet Patil <abhijeet@neon.tech>
2023-12-21 12:46:51 +00:00
Bodobolero
f93d15f781 add comment to run vacuum for clickbench (#6212)
## Problem

This is a comment only change.
To ensure that our benchmarking results are fair we need to have correct
stats in catalog. Otherwise optimizer chooses seq scan instead of index
only scan for some queries. Added comment to run vacuum after data prep.
2023-12-21 13:34:31 +01:00
Christian Schwarz
5385791ca6 add pageserver component-level benchmark (pagebench) (#6174)
This PR adds a component-level benchmarking utility for pageserver.
Its name is `pagebench`.

The problem solved by `pagebench` is that we want to put Pageserver
under high load.

This isn't easily achieved with `pgbench` because it needs to go through
a compute, which has signficant performance overhead compared to
accessing Pageserver directly.

Further, compute has its own performance optimizations (most
importantly: caches). Instead of designing a compute-facing workload
that defeats those internal optimizations, `pagebench` simply bypasses
them by accessing pageserver directly.

Supported benchmarks:

* getpage@latest_lsn
* basebackup
* triggering logical size calculation

This code has no automated users yet.
A performance regression test for getpage@latest_lsn will be added in a
later PR.

part of https://github.com/neondatabase/neon/issues/5771
2023-12-21 13:07:23 +01:00
Conrad Ludgate
2df3602a4b Add GC to http connection pool (#6196)
## Problem

HTTP connection pool will grow without being pruned

## Summary of changes

Remove connection clients from pools once idle, or once they exit.
Periodically clear pool shards.

GC Logic:

Each shard contains a hashmap of `Arc<EndpointPool>`s.
Each connection stores a `Weak<EndpointPool>`.

During a GC sweep, we take a random shard write lock, and check that if
any of the `Arc<EndpointPool>`s are unique (using `Arc::get_mut`).
- If they are unique, then we check that the endpoint-pool is empty, and
sweep if it is.
- If they are not unique, then the endpoint-pool is in active use and we
don't sweep.
- Idle connections will self-clear from the endpoint-pool after 5
minutes.

Technically, the uniqueness of the endpoint-pool should be enough to
consider it empty, but the connection count check is done for
completeness sake.
2023-12-21 12:00:10 +00:00
Arpad Müller
48890d206e Simplify inject_index_part test function (#6207)
Instead of manually constructing the directory's path, we can just use
the `parent()` function.

This is a drive-by improvement from #6206
2023-12-21 12:52:38 +01:00
Arpad Müller
baa1323b4a Use ProfileFileCredentialsProvider for AWS SDK configuration (#6202)
Allows usage via `aws sso login --profile=<p>; AWS_PROFILE=<p>`. Now
there is no need to manually configure things any more via
`SSO_ACCOUNT_ID` and others. Now one can run the tests locally (given
Neon employee access to aws):

```
aws sso login --profile dev
export ENABLE_REAL_S3_REMOTE_STORAGE=nonempty REMOTE_STORAGE_S3_REGION=eu-central-1 REMOTE_STORAGE_S3_BUCKET=neon-github-public-dev AWS_PROFILE=dev
cargo test -p remote_storage -j 1 s3 -- --nocapture
```

Also makes the scrubber use the same region for auth that it does its
operations in (not touching the hard coded role name and start_url
values here, they are not ideal though).
2023-12-20 22:38:58 +00:00
Joonas Koivunen
48f156b8a2 feat: relative last activity based eviction (#6136)
Adds a new disk usage based eviction option, EvictionOrder, which
selects whether to use the current `AbsoluteAccessed` or this new
proposed but not yet tested `RelativeAccessed`. Additionally a fudge
factor was noticed while implementing this, which might help sparing
smaller tenants at the expense of targeting larger tenants.

Cc: #5304

Co-authored-by: Arpad Müller <arpad@neon.tech>
2023-12-20 18:44:19 +00:00
John Spray
ac38d3a88c remote_storage: don't count 404s as errors (#6201)
## Problem

Currently a chart of S3 error rate is misleading: it can show errors any
time we are attaching a tenant (probing for index_part generation,
checking for remote delete marker).

Considering 404 successful isn't perfectly elegant, but it enables the
error rate to be used a a more meaningful alert signal: it would
indicate if we were having auth issues, sending bad requests, getting
throttled ,etc.

## Summary of changes

Track 404 requests in the AttemptOutcome::Ok bucket instead of the
AttemptOutcome::Err bucket.
2023-12-20 17:00:29 +00:00
Arthur Petukhovsky
0f56104a61 Make sk_collect_dumps also possible with teleport (#4739)
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
2023-12-20 15:06:55 +00:00
John Spray
f260f1565e pageserver: fixes + test updates for sharding (#6186)
This is a precursor to:
- https://github.com/neondatabase/neon/pull/6185

While that PR contains big changes to neon_local and attachment_service,
this PR contains a few unrelated standalone changes generated while
working on that branch:
- Fix restarting a pageserver when it contains multiple shards for the
same tenant
- When using location_config api to attach a tenant, create its
timelines dir
- Update test paths where generations were previously optional to make
them always-on: this avoids tests having to spuriously assert that
attachment_service is not None in order to make the linter happy.
- Add a TenantShardId python implementation for subsequent use in test
helpers that will be made shard-aware
- Teach scrubber to read across shards when checking for layer
existence: this is a refactor to track the list of existent layers at
tenant-level rather than locally to each timeline. This is a precursor
to testing shard splitting.
2023-12-20 12:26:20 +00:00