Commit Graph

6493 Commits

Author SHA1 Message Date
BodoBolero
8456737945 improve pg settings for maintenance 2024-11-11 07:27:47 +01:00
BodoBolero
172b3805c3 fix convert_to_seconds double counting for milliseconds as minutes 2024-11-09 17:36:57 +01:00
BodoBolero
533908eab1 remove leagin 0s in convert_to_seconds 2024-11-09 17:14:07 +01:00
BodoBolero
8394ab2f9e actionline warnings 2024-11-08 18:08:14 +01:00
BodoBolero
e2f05de1c0 we really want to use dbname ludicrous for the large tenant 2024-11-08 17:56:01 +01:00
BodoBolero
6465f4d97d now run the real bench ingest in two variants 2024-11-08 17:47:30 +01:00
BodoBolero
84042486b2 CREATE EXTENSION IF NOT EXISTS neon; CREATE EXTENSION IF NOT EXISTS neon_utils; 2024-11-08 17:43:02 +01:00
BodoBolero
bdcc2bea50 drop database cannot run in transaction block 2024-11-08 17:39:08 +01:00
BodoBolero
2dd86150de old and new project in matrix strategy 2024-11-08 17:26:26 +01:00
BodoBolero
eb8a9a12da better detection of duration 2024-11-08 15:46:21 +01:00
BodoBolero
9710169ed4 fix perf results insertion 2024-11-08 15:35:36 +01:00
BodoBolero
74a747f3ec fix log parser 2024-11-08 15:32:58 +01:00
BodoBolero
3c0efd42dc bc is not installed in build-tools, use awk instead 2024-11-08 15:24:26 +01:00
BodoBolero
0397307711 fix git rev-parse error 2024-11-08 15:18:27 +01:00
BodoBolero
5490f427ce test parser for pgcopydb log output 2024-11-08 15:08:04 +01:00
BodoBolero
3bf4d1cb64 include all necessary tables in filter 2024-11-08 14:14:53 +01:00
BodoBolero
4d83012c11 correct here doc 2024-11-08 14:12:10 +01:00
BodoBolero
cd3a8f537d bind BENCHMARK_INGEST_SOURCE_CONNSTR 2024-11-08 13:55:12 +01:00
BodoBolero
9e46a8b87b echo new project ID 2024-11-08 12:29:02 +01:00
BodoBolero
3856d42c2d first draft of actually invoking pgcopydb to clone a project 2024-11-08 12:22:51 +01:00
BodoBolero
d2f4ef6ee6 new attempt with shared library load paths 2024-11-08 12:06:41 +01:00
BodoBolero
77aec7c063 fix shared library path 2024-11-08 11:53:44 +01:00
BodoBolero
a6756c45fa test project creation and deletion 2024-11-08 11:44:06 +01:00
BodoBolero
55157b3506 test pgcopydb with new build-tools image 2024-11-07 16:06:15 +01:00
BodoBolero
31d87720d7 undo changes to Dockerfile - these changes are in separate PR 2024-11-07 16:01:43 +01:00
BodoBolero
d33619f2b6 use newer image with pgcopydb 0.17-1 and verify it can be invoked 2024-11-06 10:47:47 +01:00
BodoBolero
0799fd6fca need higher version of pgcopydb 2024-11-06 10:35:26 +01:00
BodoBolero
0e524c969b fix image build errors 2024-11-06 09:20:27 +01:00
BodoBolero
57752a0188 pgcopydb package only available for bookworm 2024-11-06 09:12:43 +01:00
BodoBolero
6d675196bf try to add pgcopydb to build-tools 2024-11-05 18:19:54 +01:00
BodoBolero
8e87dcc3d6 reduce token lifetime 2024-11-05 17:57:07 +01:00
BodoBolero
6d724178d3 need to download our own actions to use them 2024-11-05 17:52:42 +01:00
BodoBolero
4f9c6043ea need permissions to use AWS credentials 2024-11-05 17:47:54 +01:00
BodoBolero
f1573962b8 don't need prepare databases for ingest bench 2024-11-05 17:44:23 +01:00
BodoBolero
8ac210247c need AWS credentials to download Neon 2024-11-05 17:41:14 +01:00
BodoBolero
31c1af47b7 preparations for ingest benchmarking workflow - install pgcopydb into github runner 2024-11-05 17:37:17 +01:00
Christian Schwarz
06113e94e6 fix(test_regress): always use storcon virtual pageserver API to set tenant config (#9622)
Problem
-------

Tests that directly call the Pageserver Management API to set tenant
config are flaky if the Pageserver is managed by Storcon because Storcon
is the source of truth and may (theoretically) reconcile a tenant at any
time.

Solution
--------

Switch all users of
`set_tenant_config`/`patch_tenant_config_client_side`
to use the `env.storage_controller.pageserver_api()`

Future Work
-----------

Prevent regressions from creeping in.

And generally clean up up tenant configuration.
Maybe we can avoid the Pageserver having a default tenant config at all
and put the default into Storcon instead?

* => https://github.com/neondatabase/neon/issues/9621

Refs
----

fixes https://github.com/neondatabase/neon/issues/9522
2024-11-04 17:42:08 +01:00
Erik Grinaker
0d5a512825 safekeeper: add walreceiver metrics (#9450)
## Problem

We don't have any observability for Safekeeper WAL receiver queues.

## Summary of changes

Adds a few WAL receiver metrics:

* `safekeeper_wal_receivers`: gauge of currently connected WAL
receivers.
* `safekeeper_wal_receiver_queue_depth`: histogram of queue depths per
receiver, sampled every 5 seconds.
* `safekeeper_wal_receiver_queue_depth_total`: gauge of total queued
messages across all receivers.
* `safekeeper_wal_receiver_queue_size_total`: gauge of total queued
message sizes across all receivers.

There are already metrics for ingested WAL volume: `written_wal_bytes`
counter per timeline, and `safekeeper_write_wal_bytes` per-request
histogram.
2024-11-04 15:22:46 +00:00
Conrad Ludgate
8ad1dbce72 [proxy]: parse proxy protocol TLVs with aws/azure support (#9610)
AWS/azure private link shares extra information in the "TLV" values of
the proxy protocol v2 header. This code doesn't action on it, but it
parses it as appropriate.
2024-11-04 14:04:56 +00:00
Conrad Ludgate
3dcdbcc34d remove aws-lc-rs dep and fix storage_broker tls (#9613)
It seems the ecosystem is not so keen on moving to aws-lc-rs as it's
build setup is more complicated than ring (requiring cmake).

Eventually I expect the ecosystem should pivot to
https://github.com/ctz/graviola/tree/main/rustls-graviola as it
stabilises (it has a very simply build step and license), but for now
let's try not have a headache of juggling two crypto libs.

I also noticed that tonic will just fail with tls without a default
provider, so I added some defensive code for that.
2024-11-04 13:29:13 +00:00
Matthias van de Meent
d5de63c6b8 Fix a time zone issue in a PG17 test case (#9618)
The commit was cherry-picked and thus shouldn't cause issues once we
merge the release tag for PostgreSQL 17.1
2024-11-04 12:10:32 +00:00
John Spray
4534f5cdc6 pageserver: make local timeline deletion infallible (#9594)
## Problem

In https://github.com/neondatabase/neon/pull/9589, timeline offload code
is modified to return an explicit error type rather than propagating
anyhow::Error. One of the 'Other' cases there is I/O errors from local
timeline deletion, which shouldn't need to exist, because our policy is
not to try and continue running if the local disk gives us errors.

## Summary of changes

- Make `delete_local_timeline_directory` and use `.fatal_err(` on I/O
errors

---------

Co-authored-by: Erik Grinaker <erik@neon.tech>
2024-11-04 09:11:52 +00:00
Erik Grinaker
0058eb09df test_runner/performance: add sharded ingest benchmark (#9591)
Adds a Python benchmark for sharded ingestion. This ingests 7 GB of WAL
(100M rows) into a Safekeeper and fans out to 10 shards running on 10
different pageservers. The ingest volume and duration is recorded.
2024-11-02 16:42:10 +00:00
Konstantin Knizhnik
8ac523d2ee Do not assign page LSN to new (uninitialized) page in ClearVisibilityMapFlags redo handler (#9287)
## Problem

https://neondb.slack.com/archives/C04DGM6SMTM/p1727872045252899

See https://github.com/neondatabase/neon/issues/9240

## Summary of changes

Add `!page_is_new` check before assigning page lsn.

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-11-01 20:31:29 +02:00
John Spray
3c16bd6e0b storcon: skip non-active projects in chaos injection (#9606)
## Problem

We may sometimes use scheduling modes like `Pause` to pin a tenant in
its current location for operational reasons. It is undesirable for the
chaos task to make any changes to such projects.

## Summary of changes

- Add a check for scheduling mode
- Add a log line when we do choose to do a chaos action for a tenant:
this will help us understand which operations originate from the chaos
task.
2024-11-01 16:47:20 +00:00
Erik Grinaker
123816e99a safekeeper: log slow WalAcceptor sends (#9564)
## Problem

We don't have any observability into full WalAcceptor queues per
timeline.

## Summary of changes

Logs a message when a WalAcceptor send has blocked for 5 seconds, and
another message when the send completes. This implies that the log
frequency is at most once every 5 seconds per timeline, so we don't need
further throttling.
2024-11-01 13:47:03 +01:00
Peter Bendel
8b3bcf71ee revert higher token expiration (#9605)
## Problem

The IAM role associated with our github action runner supports a max
token expiration which is lower than the value we tried.

## Summary of changes

Since we believe to have understood the performance regression we (by
ensuring availability zone affinity of compute and pageserver) the job
should again run in lower than 5 hours and we revert this change instead
of increasing the max session token expiration in the IAM role which
would reduce our security.
2024-11-01 12:46:02 +01:00
Erik Grinaker
4c2c8d6708 test_runner: fix tenant_get_shards with one pageserver (#9603)
## Problem

`tenant_get_shards()` does not work with a sharded tenant on 1
pageserver, as it assumes an unsharded tenant in this case. This special
case appears to have been added to handle e.g. `test_emergency_mode`,
where the storage controller is stopped. This breaks e.g. the sharded
ingest benchmark in #9591 when run with a single shard.

## Summary of changes

Correctly look up shards even with a single pageserver, but add a
special case that assumes an unsharded tenant if the storage controller
is stopped and the caller provides an explicit pageserver, in order to
accomodate `test_emergency_mode`.
2024-11-01 11:25:04 +00:00
Conrad Ludgate
2d1366c8ee fix pre-commit hook with python stubs (#9602)
fix #9601
2024-11-01 11:22:38 +00:00
Vlad Lazar
e589c2e5ec storage_controller: allow deployment infra to use infra token (#9596)
## Problem

We wish for the deployment orchestrator to use infra scoped tokens,
but storcon endpoints it's using require admin scoped tokens.

## Summary of Changes

Switch over all endpoints that are used by the deployment orchestrator
to use an infra scoped token. This causes no breakage during mixed
version scenarios because admin scoped tokens allow access to all
endpoints. The deployment orchestrator can cut over to the infra token
after this commit touches down in prod.

Once this commit is released we should also update the tests code to use
infra scoped tokens where appropriate. Currently it would fail on the
[compat tests](9761b6a64e/test_runner/regress/test_storage_controller.py (L69-L71)).
2024-10-31 18:29:16 +00:00