Commit Graph

4083 Commits

Author SHA1 Message Date
Christian Schwarz
d5bb030ea1 WIP 2023-12-13 12:02:58 +00:00
Christian Schwarz
b3c811731a getpage bench: mode to rate limit per timeline 2023-12-13 10:50:51 +00:00
Christian Schwarz
b6cb717a76 WIP: add tracing_flame stuff, many more spans, looked at spans with Joonas 2023-12-07 18:07:45 +00:00
Christian Schwarz
9e128b58b4 add debug span on Timeline::get 2023-12-07 08:48:25 +00:00
Christian Schwarz
28eb4da171 add a debug span for blob_io 2023-12-07 08:17:37 +00:00
Christian Schwarz
c4f7cab042 basebackup bench: debug-log basebackup size 2023-12-06 18:07:24 +00:00
Christian Schwarz
cfff331da3 WIP: implement tracing_chrome support for utils::logging 2023-12-06 18:07:24 +00:00
Christian Schwarz
658c20bea4 jwt support; debug spans in basebackup 2023-12-06 17:20:04 +00:00
Christian Schwarz
8a555f1cf3 basebackup bench: fixup copy-pasta of wip 2023-12-05 23:55:49 +00:00
Christian Schwarz
4f79b6d140 pagebench: fixup some accidental WIP thing from last week 2023-12-05 23:55:49 +00:00
Christian Schwarz
d6b7bc2abc implement a basebackup benchmark 2023-12-05 19:59:51 +00:00
Christian Schwarz
4fc3596677 client & getpage bench: distinguish between page_service client and client in pagestream mode 2023-12-05 19:59:51 +00:00
Christian Schwarz
60cc3a3397 pagebench: restructure dir a bit 2023-12-05 19:59:51 +00:00
Christian Schwarz
cb3dcb06cf cargo fmt 2023-12-05 19:59:40 +00:00
Christian Schwarz
d75470280f fixup: scale factors in the python benchmark 2023-11-24 18:16:58 +00:00
Christian Schwarz
687678c4ff a mode where one task picks which work to do & dispatches it to per-timeline clients 2023-11-24 18:01:55 +00:00
Christian Schwarz
59c8a29569 WIP: failed attempt to have fixed number of clients going over all the key ranges of all tenants
The problem is that the connections are stateful, need to implement a
client pool => sucks
2023-11-24 17:12:42 +00:00
Christian Schwarz
044e96ce50 fixup: few more perecentiles 2023-11-24 16:00:59 +00:00
Christian Schwarz
12a60cd914 parameters for i3en.3xlarge (need to add more modes to the benchmark, e.g., time based) 2023-11-24 15:40:29 +00:00
Christian Schwarz
9f36d19383 few more percentiles for the benchmark 2023-11-24 15:39:02 +00:00
Christian Schwarz
a0909a2b80 make the benchmarking script work again 2023-11-24 14:58:21 +00:00
Christian Schwarz
bd06672cdd have one HdrHistogram per thread instead of one per task 2023-11-24 14:27:52 +00:00
Christian Schwarz
f1a714e465 Revert "WIP: figure out overhead of linear histogram"
This reverts commit dc914ef368.
2023-11-24 14:01:07 +00:00
Christian Schwarz
dc914ef368 WIP: figure out overhead of linear histogram 2023-11-24 14:00:54 +00:00
Christian Schwarz
568f6ae332 per-task & global mean + percentiles using hdrhistogram
known problem is: one hdrhistogram per task => too much memory usage
2023-11-24 12:35:22 +00:00
Christian Schwarz
857150dcee CLI structure 2023-11-24 11:19:21 +00:00
Christian Schwarz
9d13d0015f perftest: use new binary name 2023-11-24 11:06:03 +00:00
Christian Schwarz
281f05398e further break up 2023-11-24 11:05:55 +00:00
Christian Schwarz
0bd5e3aedc remove unnucessary return impl Future 2023-11-24 10:56:52 +00:00
Christian Schwarz
4f1197311e break up client into library & cli 2023-11-24 10:55:54 +00:00
Christian Schwarz
dd5792e488 WIP use results 2023-11-24 10:18:05 +00:00
Christian Schwarz
135e37e5b2 implement the performance test in the Python test suite 2023-11-24 10:17:49 +00:00
Christian Schwarz
ccb9fe9b33 find a way to duplicate a tenant in local_fs
Use the script like so, against the tenant to duplicate:

    poetry run python3 ./test_runner/duplicate_tenant.py 7ea51af32d42bfe7fb93bf5f28114d09  200 8

backup of pageserver.toml

    d =1
    pg_distrib_dir ='/home/admin/neon-main/pg_install'
    http_auth_type ='Trust'
    pg_auth_type ='Trust'
    listen_http_addr ='127.0.0.1:9898'
    listen_pg_addr ='127.0.0.1:64000'
    broker_endpoint ='http://127.0.0.1:50051/'
    #control_plane_api ='http://127.0.0.1:1234/'

    # Initial configuration file created by 'pageserver --init'
    #listen_pg_addr = '127.0.0.1:64000'
    #listen_http_addr = '127.0.0.1:9898'

    #wait_lsn_timeout = '60 s'
    #wal_redo_timeout = '60 s'

    #max_file_descriptors = 10000
    #page_cache_size = 160000

    # initial superuser role name to use when creating a new tenant
    #initial_superuser_name = 'cloud_admin'

    #broker_endpoint = 'http://127.0.0.1:50051'

    #log_format = 'plain'

    #concurrent_tenant_size_logical_size_queries = '1'

    #metric_collection_interval = '10 min'
    #cached_metric_collection_interval = '0s'
    #synthetic_size_calculation_interval = '10 min'

    #disk_usage_based_eviction = { max_usage_pct = .., min_avail_bytes = .., period = "10s"}

    #background_task_maximum_delay = '10s'

    [tenant_config]
    #checkpoint_distance = 268435456 # in bytes
    #checkpoint_timeout = 10 m
    #compaction_target_size = 134217728 # in bytes
    #compaction_period = '20 s'
    #compaction_threshold = 10

    #gc_period = '1 hr'
    #gc_horizon = 67108864
    #image_creation_threshold = 3
    #pitr_interval = '7 days'

    #min_resident_size_override = .. # in bytes
    #evictions_low_residence_duration_metric_threshold = '24 hour'
    #gc_feedback = false

    # make it determinsitic
    gc_period = '0s'
    checkpoint_timeout = '3650 day'
    compaction_period = '20 s'
    compaction_threshold = 10
    compaction_target_size = 134217728
    checkpoint_distance = 268435456
    image_creation_threshold = 3

    [remote_storage]
    local_path = '/home/admin/neon-main/bench_repo_dir/repo/remote_storage_local_fs'

remove http handler

switch to generalized rewrite_summary  & impl page_ctl subcommand to use it

WIP: change duplicate_tenant.py script to use the pagectl command

The script works but at restart, we detach the created tenants because
they're not known to the attachment service:

  Detaching tenant, control plane omitted it in re-attach response tenant_id=1e399d390e3aee6b11c701cbc716bb6c

=> figure out how to further integrate this
2023-11-24 10:17:49 +00:00
Christian Schwarz
1b81640290 random getpage benchmark 2023-11-24 10:17:49 +00:00
Anastasia Lubennikova
2a12e9c46b Add documentation for our sample pre-commit hook (#5868) 2023-11-22 12:04:36 +00:00
Christian Schwarz
9e3c07611c logging: support output to stderr (#5896)
(part of the getpage benchmarking epic #5771)

The plan is to make the benchmarking tool log on stderr and emit results
as JSON on stdout. That way, the test suite can simply take captures
stdout and json.loads() it, while interactive users of the benchmarking
tool have a reasonable experience as well.

Existing logging users continue to print to stdout, so, this change
should be a no-op functionally and performance-wise.
2023-11-22 11:08:35 +00:00
Christian Schwarz
d353fa1998 refer to our rust-postgres.git fork by branch name (#5894)
This way, `cargo update -p tokio-postgres` just works. The `Cargo.toml`
communicates more clearly that we're referring to the `main` branch. And
the git revision is still pinned in `Cargo.lock`.
2023-11-22 10:58:27 +00:00
Joonas Koivunen
0d10992e46 Cleanup compact_level0_phase1 fsyncing (#5852)
While reviewing code noticed a scary `layer_paths.pop().unwrap()` then
realized this should be further asyncified, something I forgot to do
when I switched the `compact_level0_phase1` back to async in #4938.

This keeps the double-fsync for new deltas as #4749 is still unsolved.
2023-11-21 15:30:40 +02:00
Arpad Müller
3e131bb3d7 Update Rust to 1.74.0 (#5873)
[Release notes](https://github.com/rust-lang/rust/releases/tag/1.74.0).
2023-11-21 11:41:41 +01:00
Sasha Krassovsky
81b2cefe10 Disallow CREATE DATABASE WITH OWNER neon_superuser (#5887)
## Problem
Currently, control plane doesn't know about neon_superuser, so if a user
creates a database with owner neon_superuser it causes an exception when
it tries to forward it. It is also currently possible to ALTER ROLE
neon_superuser.

## Summary of changes
Disallow creating database with owner neon_superuser. This is probably
fine, since I don't think you can create a database with owner normal
superuser. Also forbids altering neon_superuser
2023-11-20 22:39:47 +00:00
Christian Schwarz
d2ca410919 build: back to opt-level=0 in debug builds, for faster compile times (#5751)
This change brings down incremental compilation for me
from > 1min to 10s (and this is a pretty old Ryzen 1700X).

More details: "incremental compilation" here means to change one
character
in the `failed to read value from offset` string in `image_layer.rs`.
The command for incremental compilation is `cargo build_testing`.
The system on which I got these numbers uses `mold` via
`~/.cargo/config.toml`.

As a bonus, `rust-gdb` is now at least a little fun again.

Some tests are timing out in debug builds due to these changes.
This PR makes them skip for debug builds.
We run both with debug and release build, so, the loss of coverage is
marginal.

---------

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2023-11-20 15:41:37 +01:00
Joonas Koivunen
d98ac04136 chore(background_tasks): missed allowed_error change, logging change (#5883)
- I am always confused by the log for the error wait time, now it will
be `2s` or `2.0s` not `2.0`
- fix missed string change introduced in #5881 [evidence]

[evidence]:
https://neon-github-public-dev.s3.amazonaws.com/reports/main/6921062837/index.html#suites/f9eba3cfdb71aa6e2b54f6466222829b/87897fe1ddee3825
2023-11-20 07:33:17 +00:00
Joonas Koivunen
ac08072d2e fix(layer): VirtualFile opening and read errors can be caused by contention (#5880)
A very low number of layer loads have been marked wrongly as permanent,
as I did not remember that `VirtualFile::open` or reading could fail
transiently for contention. Return separate errors for transient and
persistent errors from `{Delta,Image}LayerInner::load`.

Includes drive-by comment changes.

The implementation looks quite ugly because having the same type be both
the inner (operation error) and outer (critical error), but with the
alternatives I tried I did not find a better way.
2023-11-19 14:57:39 +00:00
John Spray
d22dce2e31 pageserver: shut down idle walredo processes (#5877)
The longer a pageserver runs, the more walredo processes it accumulates
from tenants that are touched intermittently (e.g. by availability
checks). This can lead to getting OOM killed.

Changes:
- Add an Instant recording the last use of the walredo process for a
tenant
- After compaction iteration in the background task, check for idleness
and stop the walredo process if idle for more than 10x compaction
period.

Cc: #3620

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Shany Pozin <shany@neon.tech>
2023-11-19 14:21:16 +00:00
Joonas Koivunen
3b3f040be3 fix(background_tasks): first backoff, compaction error stacktraces (#5881)
First compaction/gc error backoff starts from 0 which is less than 2s
what it was before #5672. This is now fixed to be the intended 2**n.

Additionally noticed the `compaction_iteration` creating an
`anyhow::Error` via `into()` always captures a stacktrace even if we had
a stacktraceful anyhow error within the CompactionError because there is
no stable api for querying that.
2023-11-19 14:16:31 +00:00
Em Sharnoff
cad0dca4b8 compute_ctl: Remove deprecated flag --file-cache-on-disk (#5622)
See neondatabase/cloud#7516 for more.
2023-11-18 12:43:54 +01:00
Em Sharnoff
5d13a2e426 Improve error message when neon.max_cluster_size reached (#4173)
Changes the error message encountered when the `neon.max_cluster_size`
limit is reached. Reasoning is that this is user-visible, and so should
*probably* use language that's closer to what users are familiar with.
2023-11-16 21:51:26 +00:00
khanova
0c243faf96 Proxy log pid hack (#5869)
## Problem

Improve observability for the compute node.

## Summary of changes

Log pid from the compute node. Doesn't work with pgbouncer.
2023-11-16 20:46:23 +00:00
Em Sharnoff
d0a842a509 Update vm-builder to v0.19.0 and move its customization here (#5783)
ref neondatabase/autoscaling#600 for more
2023-11-16 18:17:42 +01:00
khanova
6b82f22ada Collect number of connections by sni type (#5867)
## Problem

We don't know the number of users with the different kind of
authentication: ["sni", "endpoint in options" (A and B from
[here](https://neon.tech/docs/connect/connection-errors)),
"password_hack"]

## Summary of changes

Collect metrics by sni kind.
2023-11-16 12:19:13 +00:00