- add pgbouncer_settings section to compute spec;
- add pgbouncer-connstr option to compute_ctl.
- add pgbouncer-ini-path option to compute_ctl. Default: /etc/pgbouncer/pgbouncer.ini
Apply pgbouncer config on compute start and respec to override default spec.
Save pgbouncer config updates to pgbouncer.ini to preserve them across pgbouncer restarts.
Remove confirm_wal_streamed; we already apply both write and flush positions of
the slot to commit_lsn which is fine because 1) we need to wake up waiters 2)
committed WAL can be fetched from safekeepers by neon_walreader now.
wp -> sk communication now uses neon_walreader which will fetch missing WAL on
demand from safekeepers, so doesn't need this anymore. Also, cap WAL download by
max_slot_wal_keep_size to be able to start compute if lag is too high.
It is similar to XLogReader, but when either requested segment is missing
locally or requested LSN is before basebackup_lsn NeonWALReader asynchronously
fetches WAL from one of safekeepers.
Patch includes walproposer switch to NeonWALReader, splitting wouldn't make much
sense as it is hard to test otherwise. This finally removes risk of pg_wal
explosion (as well as slow start time) when one safekeeper is lagging, at the
same time allowing to recover it.
In the future reader should also be used by logical walsender for similar
reasons (currently we download the tail on compute start synchronously).
The main test is test_lagging_sk. However, I also run it manually a lot varying
MAX_SEND_SIZE on both sides (on safekeeper and on walproposer), testing various
fragmentations (one side having small buffer, another, both), which brought up
https://github.com/neondatabase/neon/issues/6055
closes https://github.com/neondatabase/neon/issues/1012
## Problem
See https://neondb.slack.com/archives/C026T7K2YP9/p1702813041997959
## Summary of changes
Do not take in account invalidated slots when calculate restart_lsn
position for basebackup at page server
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Store the content of the `last-modified` and `etag` HTTP headers in
`Download`.
This serves both as the first step towards #6199 and as a preparation
for tests in #6155 .
## Problem
During startup, a tenant delete request might have to retry for many
minutes waiting for a tenant to enter Active state.
## Summary of changes
- Refactor delete_tenant into TenantManager: this is not a functional
change, but will avoid merge conflicts with
https://github.com/neondatabase/neon/pull/6105 later
- Add 412 responses to the swagger definition of this endpoint.
- Use Tenant::wait_to_become_active in `TenantManager::delete_tenant`
---------
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
## Problem
Fix build-tools image tag calculation for PRs.
Broken in https://github.com/neondatabase/neon/pull/6195
## Summary of changes
- Use `pinned` tag instead of `$GITHUB_RUN_ID` if there's no changes in
the dockerfile (and we don't build such image)
## Fixing GitHub workflow issue related to build and push images
## Summary of changes
Followup of PR#608[move docker file from build repo to neon to solve
issue some issues
The build started failing because it missed a validation in logic that
determines changes in the docker file
Also, all the dependent jobs were skipped because of the build and push
of the image job.
To address the above issue following changes were made
- we are adding validation to generate image tag even if it's a merge to
repo.
- All the dependent jobs won't skip even if the build and push image job
is skipped.
- We have moved the logic to generate a tag in the sub-workflow. As the
tag name was necessary to be passed to the sub-workflow it made sense to
abstract that away where it was needed and then store it as an output
variable so that downward dependent jobs could access the value.
- This made the dependency logic easy and we don't need complex
expressions to check the condition on which it will run
- An earlier PR was closed that tried solving a similar problem that has
some feedback and context before creating this PR
https://github.com/neondatabase/neon/pull/6175
## Checklist before requesting a review
- [x] Move the tag generation logic from the main workflow to the
sub-workflow of build and push the image
- [x] Add a condition to generate an image tag for a non-PR-related run
- [x] remove complex if the condition from the job if conditions
---------
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
Co-authored-by: Abhijeet Patil <abhijeet@neon.tech>
## Problem
This is a comment only change.
To ensure that our benchmarking results are fair we need to have correct
stats in catalog. Otherwise optimizer chooses seq scan instead of index
only scan for some queries. Added comment to run vacuum after data prep.
This PR adds a component-level benchmarking utility for pageserver.
Its name is `pagebench`.
The problem solved by `pagebench` is that we want to put Pageserver
under high load.
This isn't easily achieved with `pgbench` because it needs to go through
a compute, which has signficant performance overhead compared to
accessing Pageserver directly.
Further, compute has its own performance optimizations (most
importantly: caches). Instead of designing a compute-facing workload
that defeats those internal optimizations, `pagebench` simply bypasses
them by accessing pageserver directly.
Supported benchmarks:
* getpage@latest_lsn
* basebackup
* triggering logical size calculation
This code has no automated users yet.
A performance regression test for getpage@latest_lsn will be added in a
later PR.
part of https://github.com/neondatabase/neon/issues/5771
## Problem
HTTP connection pool will grow without being pruned
## Summary of changes
Remove connection clients from pools once idle, or once they exit.
Periodically clear pool shards.
GC Logic:
Each shard contains a hashmap of `Arc<EndpointPool>`s.
Each connection stores a `Weak<EndpointPool>`.
During a GC sweep, we take a random shard write lock, and check that if
any of the `Arc<EndpointPool>`s are unique (using `Arc::get_mut`).
- If they are unique, then we check that the endpoint-pool is empty, and
sweep if it is.
- If they are not unique, then the endpoint-pool is in active use and we
don't sweep.
- Idle connections will self-clear from the endpoint-pool after 5
minutes.
Technically, the uniqueness of the endpoint-pool should be enough to
consider it empty, but the connection count check is done for
completeness sake.
Allows usage via `aws sso login --profile=<p>; AWS_PROFILE=<p>`. Now
there is no need to manually configure things any more via
`SSO_ACCOUNT_ID` and others. Now one can run the tests locally (given
Neon employee access to aws):
```
aws sso login --profile dev
export ENABLE_REAL_S3_REMOTE_STORAGE=nonempty REMOTE_STORAGE_S3_REGION=eu-central-1 REMOTE_STORAGE_S3_BUCKET=neon-github-public-dev AWS_PROFILE=dev
cargo test -p remote_storage -j 1 s3 -- --nocapture
```
Also makes the scrubber use the same region for auth that it does its
operations in (not touching the hard coded role name and start_url
values here, they are not ideal though).
Adds a new disk usage based eviction option, EvictionOrder, which
selects whether to use the current `AbsoluteAccessed` or this new
proposed but not yet tested `RelativeAccessed`. Additionally a fudge
factor was noticed while implementing this, which might help sparing
smaller tenants at the expense of targeting larger tenants.
Cc: #5304
Co-authored-by: Arpad Müller <arpad@neon.tech>
## Problem
Currently a chart of S3 error rate is misleading: it can show errors any
time we are attaching a tenant (probing for index_part generation,
checking for remote delete marker).
Considering 404 successful isn't perfectly elegant, but it enables the
error rate to be used a a more meaningful alert signal: it would
indicate if we were having auth issues, sending bad requests, getting
throttled ,etc.
## Summary of changes
Track 404 requests in the AttemptOutcome::Ok bucket instead of the
AttemptOutcome::Err bucket.
This is a precursor to:
- https://github.com/neondatabase/neon/pull/6185
While that PR contains big changes to neon_local and attachment_service,
this PR contains a few unrelated standalone changes generated while
working on that branch:
- Fix restarting a pageserver when it contains multiple shards for the
same tenant
- When using location_config api to attach a tenant, create its
timelines dir
- Update test paths where generations were previously optional to make
them always-on: this avoids tests having to spuriously assert that
attachment_service is not None in order to make the linter happy.
- Add a TenantShardId python implementation for subsequent use in test
helpers that will be made shard-aware
- Teach scrubber to read across shards when checking for layer
existence: this is a refactor to track the list of existent layers at
tenant-level rather than locally to each timeline. This is a precursor
to testing shard splitting.
Old methods are unchanged for backwards compatibility. Added
`SafekeeperDiscoveryRequest` and `SafekeeperDiscoveryResponse` types to
serve as example, and also as a prerequisite for
https://github.com/neondatabase/neon/issues/5471
It wasn't being incremented.
Fixup of
commit 1c88824ed0
Author: Christian Schwarz <christian@neon.tech>
Date: Fri Dec 1 12:52:59 2023 +0100
initial logical size calculation: add a bunch of metrics (#5995)
There is still default 'postgres' database, that may contain objects
owned by the role or some ACLs. We need to reassign objects in this
database too.
## Problem
If customer deleted all databases and then tries to delete role, that
has some non-standard ACLs,
`apply_config` operation will stuck because of failing role deletion.
## Problem
GCCproduce warning that bread result is not checked. It doesn't affect
program logic, but better live without warnings.
## Summary of changes
Check read result.
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
## Problem
To understand differences in performance between neon, aurora and rds we
want to collect explain analyze plans and pg_stat_statements for
selected benchmarking runs
## Summary of changes
Add workflow input options to collect explain and pg_stat_statements for
benchmarking workflow
Co-authored-by: BodoBolero <bodobolero@gmail.com>
This squashes a particularly noisy warn-level log that occurs when
safekeepers are restarted.
Unfortunately the error type from `tonic` doesn't provide a neat way of
matching this, so we use a string comparison
Part of getpage@lsn benchmark epic:
https://github.com/neondatabase/neon/issues/5771
This allows getting the list of tenants and timelines without triggering
initial logical size calculation by requesting the timeline details API
response, which would skew our results.
The PR does two things:
* move the util functions present in the remote_storage Azure and S3
test files into a shared one, deduplicating them.
* add a `s3_upload_download_works` test as a copy of the Azure test
The goal is mainly to fight duplication and make the code a little bit
more generic (like removing mentions of s3 and azure from function
names).
This is a first step towards #6146.
## Problem
If the user reset password, cache could receive this information only
after `ttl` minutes.
## Summary of changes
Invalidate password on auth failure.
- No need to include c.h, port.h or pg_config.h, they are included in
postgres.h
- No need to include postgres.h in header files. Instead, the
assumption in PostgreSQL is that all .c files include postgres.h.
- Reorder includes to alphabetical order, and system headers before
pgsql headers
- Remove bunch of other unnecessary includes that got copy-pasted from
one source file to another