Commit Graph

4468 Commits

Author SHA1 Message Date
dependabot[bot]
c70bf9150f build(deps): bump aiohttp from 3.9.0 to 3.9.2 (#6518) 2024-01-30 10:46:49 +00:00
Alexander Bayandin
8e4da52069 Compute: pgvector 0.6.0 (#6517)
Update pgvector extension from 0.5.1 to 0.6.0
2024-01-30 09:29:45 +00:00
Arthur Petukhovsky
2ff1a5cecd Patch safekeeper control file on HTTP request (#6455)
Closes #6397
2024-01-29 18:20:57 +00:00
Conrad Ludgate
ec8dcc2231 flatten proxy flow (#6447)
## Problem

Taking my ideas from https://github.com/neondatabase/neon/pull/6283 and
doing a bit less radical changes. smaller commits.

Proxy flow was quite deeply nested, which makes adding more interesting
error handling quite tricky.

## Summary of changes

I recommend reviewing commit by commit.

1. move handshake logic into a separate file
2. move passthrough logic into a separate file
3. no longer accept a closure in CancelMap session logic
4. Remove connect_to_db, copy logic into handle_client
5. flatten auth_and_wake_compute in authenticate
6. record info for link auth
2024-01-29 17:38:03 +00:00
Arpad Müller
b844c6f0c7 Do pagination in list_object_versions call (#6500)
## Problem

The tenants we want to recover might have tens of thousands of keys, or
more. At that point, the AWS API returns a paginated response.

## Summary of changes

Support paginated responses for `list_object_versions` requests.

Follow-up of #6155, part of https://github.com/neondatabase/cloud/issues/8233
2024-01-29 17:59:26 +01:00
Alexander Bayandin
6a85a06e1b Compute: build rdkit without freetype support (#6495)
## Problem
`rdkit` extension is built with `RDK_BUILD_FREETYPE_SUPPORT=ON` (by
default), which requires a bunch of additional dependencies, but the
support of freetype fonts isn't required for Postgres.


With `RDK_BUILD_FREETYPE_SUPPORT=ON`:
```
ldd /usr/local/pgsql/lib/rdkit.so
	linux-vdso.so.1 (0x0000ffff82ea8000)
	libfreetype.so.6 => /usr/lib/aarch64-linux-gnu/libfreetype.so.6 (0x0000ffff825e5000)
	libboost_serialization.so.1.74.0 => /usr/lib/aarch64-linux-gnu/libboost_serialization.so.1.74.0 (0x0000ffff82590000)
	libpthread.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000ffff8255f000)
	libstdc++.so.6 => /usr/lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000ffff82387000)
	libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000ffff822dc000)
	libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000ffff822b8000)
	libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffff82144000)
	libpng16.so.16 => /usr/lib/aarch64-linux-gnu/libpng16.so.16 (0x0000ffff820fd000)
	libz.so.1 => /lib/aarch64-linux-gnu/libz.so.1 (0x0000ffff820d3000)
	libbrotlidec.so.1 => /usr/lib/aarch64-linux-gnu/libbrotlidec.so.1 (0x0000ffff820b8000)
	/lib/ld-linux-aarch64.so.1 (0x0000ffff82e78000)
	libbrotlicommon.so.1 => /usr/lib/aarch64-linux-gnu/libbrotlicommon.so.1 (0x0000ffff82087000)
```

With `RDK_BUILD_FREETYPE_SUPPORT=OFF`:
```
ldd /usr/local/pgsql/lib/rdkit.so
	linux-vdso.so.1 (0x0000ffffbba75000)
	libboost_serialization.so.1.74.0 => /usr/lib/aarch64-linux-gnu/libboost_serialization.so.1.74.0 (0x0000ffffbb259000)
	libpthread.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000ffffbb228000)
	libstdc++.so.6 => /usr/lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000ffffbb050000)
	libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000ffffbafa5000)
	libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000ffffbaf81000)
	libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffffbae0d000)
	/lib/ld-linux-aarch64.so.1 (0x0000ffffbba45000)
```

## Summary of changes
- Build `rdkit` with `RDK_BUILD_FREETYPE_SUPPORT=OFF`
- Remove extra dependencies from the Compute image
2024-01-29 16:16:37 +00:00
John Spray
b04a6acd6c docker: add attachment_service binary (#6506)
## Problem

Creating sharded tenants will require an instance of the sharding
service -- the initial goal is to deploy one of these in a staging
region (https://github.com/neondatabase/cloud/issues/9718). It will run
as a kubernetes container, similar to the storage broker, so needs to be
built into the container image.

## Summary of changes

Add `attachment_service` binary to container image
2024-01-29 13:31:56 +00:00
Vlad Lazar
0c7b89235c pageserver: add range layer map search implementation (#6469)
## Problem
There's no efficient way of querying the layer map for a range.

## Summary of changes
Introduce a range query for the layer map (`LayerMap::range_search`).
There's two broad steps to it:
1. Find all coverage changes for layers that intersect the queried range
(see `LayerCoverage::range_overlaps`).
The slightly tricky part is dealing with the start of the range. We can
either be aligned with a layer or not and we need
to treat these cases differently.
2. Iterate over the coverage changes and collect the result. For this we
use a two pointer approach: the trailing pointer tracks the start of the
current range (current location in the key space) and the forward
pointer tracks the next coverage change.

Plugging the range search into the read path is deferred to a future PR.

## Performance
I adapted the layer map benchmarks on a local branch. Range searches are 
between 2x and 2.5x slower than point searches. That's in line with what I
expected since we query thelayer map twice.

Since `Timeline::get` will proxy to `Timeline::get_vectored` we can
special case the one element layer map range search
at that point.
2024-01-29 09:47:12 +00:00
Joonas Koivunen
1e9a50bca8 disk_usage_eviction_task: cleanup summaries (#6490)
This is the "partial revert" of #6384. The summaries turned out to be
expensive due to naive vec usage, but also inconclusive because of the
additional context required. In addition to removing summary traces,
small refactoring is done.
2024-01-29 10:38:40 +02:00
Conrad Ludgate
511e730cc0 hll experiment (#6312)
## Problem

Measuring cardinality using logs is expensive and slow.

## Summary of changes

Implement a pre-aggregated HyperLogLog-based cardinality estimate.
HyperLogLog estimates the cardinality of a set by using the probability
that the uniform hash of a value will have a run of n 0s at the end is
`1/2^n`, therefore, having observed a run of `n` 0s suggests we have
measured `2^n` distinct values. By using multiple shards, we can use the
harmonic mean to get a more accurate estimate.

We record this into a Prometheus time-series. HyperLogLog counts can be
merged by taking the `max` of each shard. We can apply a `max_over_time`
in order to find the estimate of cardinality of distinct values over
time
2024-01-29 07:26:20 +00:00
Konstantin Knizhnik
c1148dc9ac Fix calculation of maximal multixact in ingest_multixact_create_record (#6502)
## Problem

See https://neondb.slack.com/archives/C06F5UJH601/p1706373716661439

## Summary of changes

Use None instead of 0 as initial accumulator value for calculating
maximal multixact XID.

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2024-01-29 07:39:16 +02:00
Anna Khanova
8253cf1931 proxy: Relax endpoint check (#6503)
## Problem

http-over-sql allowes host to be in format api.aws.... however it's not
the case for the websocket flow.

## Summary of changes

Relax endpoint check for the ws serverless connections.
2024-01-28 21:27:14 +00:00
Christian Schwarz
3a82430432 fixup(#6492): also switch the benchmarks that runs on merge-to-main back to std-fs (#6501) 2024-01-28 00:15:11 +01:00
Arpad Müller
734755eaca Enable nextest retries for the arm build (#6496)
Also make the NEXTEST_RETRIES declaration more local.

Requested in https://github.com/neondatabase/neon/pull/6493#issuecomment-1912110202
2024-01-27 05:16:11 +01:00
Christian Schwarz
e34166a28f CI: switch back to std-fs io engine for soak time before next release (#6492)
PR #5824 introduced the concept of io engines in pageserver and
implemented `tokio-epoll-uring` in addition to our current method,
`std-fs`.

We used `tokio-epoll-uring` in CI for a day to get more exposure to
the code.  Now it's time to switch CI back so that we test with `std-fs`
as well, because that's what we're (still) using in production.
2024-01-26 22:48:34 +01:00
Christian Schwarz
3a36a0a227 fix(test suite): some tests leak child processes (#6497) 2024-01-26 18:23:53 +00:00
John Spray
58f6cb649e control_plane: database persistence for attachment_service (#6468)
## Problem

Spun off from https://github.com/neondatabase/neon/pull/6394 -- this PR
is just the persistence parts and the changes that enable it to work
nicely


## Summary of changes

- Revert #6444 and #6450
- In neon_local, start a vanilla postgres instance for the attachment
service to use.
- Adopt `diesel` crate for database access in attachment service. This
uses raw SQL migrations as the source of truth for the schema, so it's a
soft dependency: we can switch libraries pretty easily.
- Rewrite persistence.rs to use postgres (via diesel) instead of JSON.
- Preserve JSON read+write at startup and shutdown: this enables using
the JSON format in compatibility tests, so that we don't have to commit
to our DB schema yet.
- In neon_local, run database creation + migrations before starting
attachment service
- Run the initial reconciliation in Service::spawn in the background, so
that the pageserver + attachment service don't get stuck waiting for
each other to start, when restarting both together in a test.
2024-01-26 17:20:44 +00:00
Arpad Müller
dcc7610ad6 Do backoff::retry in s3 timetravel test (#6493)
The top level retries weren't enough, probably because we do so many
network requests. Fine grained retries ensure that there is higher
potential for the entire test to succeed.

To demonstrate this, consider the following example: let's assume that
each request has 5% chance of failing and we do 10 requests. Then
chances of success without any retries is 0.95^10 = 0.6. With 3 top
level retries it is 1-0.4^3 = 0.936. With 3 fine grained retries it is
(1-0.05^3)^10 = 0.9988 (roundings implicit). So chances of failure are
6.4% for the top level retry vs 0.12% for the fine grained retry.

Follow-up of #6155
2024-01-26 16:43:56 +00:00
Alexander Bayandin
4c245b0f5a update_build_tools_image.yml: Push build-tools image to Docker Hub (#6481)
## Problem

- `docker.io/neondatabase/build-tools:pinned` image is frequently
outdated on Docker Hub because there's no automated way to update it.
- `update_build_tools_image.yml` workflow contains legacy roll-back
logic, which is not required anymore because it updates only a single
image.

## Summary of changes
- Make `update_build_tools_image.yml` workflow push images to both ECR
and Docker Hub
- Remove unneeded roll-back logic
2024-01-26 16:12:49 +00:00
John Spray
55b7cde665 tests: add basic coverage for sharding (#6380)
## Problem

The support for sharding in the pageserver was written before
https://github.com/neondatabase/neon/pull/6205 landed, so when it landed
we couldn't directly test sharding.

## Summary of changes

- Add `test_sharding_smoke` which tests the basics of creating a
sharding tenant, creating a timeline within it, checking that data
within it is distributed.
- Add modes to pg_regress tests for running with 4 shards as well as
with 1.
2024-01-26 14:40:47 +00:00
Vlad Lazar
5b34d5f561 pageserver: add vectored get latency histogram (#6461)
This patch introduces a new set of grafana metrics for a histogram:
pageserver_get_vectored_seconds_bucket{task_kind="Compaction|PageRequestHandler"}.

While it has a `task_kind` label, only compaction and SLRU fetches are
tracked. This reduces the increase in cardinality to 24.

The metric should allow us to isolate performance regressions while the
vectorized get is being implemented. Once the implementation is
complete, it'll also allow us to quantify the improvements.
2024-01-26 13:40:03 +00:00
Alexander Bayandin
26c55b0255 Compute: fix rdkit extension build (#6488)
## Problem

`rdkit` extension build started to fail because of the changed checksum
of the Comic Neue font:

```
Downloading https://fonts.google.com/download?family=Comic%20Neue...
CMake Error at Code/cmake/Modules/RDKitUtils.cmake:257 (MESSAGE):
  The md5 checksum for /rdkit-src/Code/GraphMol/MolDraw2D/Comic_Neue.zip is
  incorrect; expected: 850b0df852f1cda4970887b540f8f333, found:
  b7fd0df73ad4637504432d72a0accb8f
```

https://github.com/neondatabase/neon/actions/runs/7666530536/job/20895534826

Ref https://neondb.slack.com/archives/C059ZC138NR/p1706265392422469

## Summary of changes
- Disable comic fonts for `rdkit` extension
2024-01-26 12:39:20 +00:00
Vadim Kharitonov
12e9b2a909 Update plv8 (#6465) 2024-01-26 09:56:11 +00:00
Christian Schwarz
918b03b3b0 integrate tokio-epoll-uring as alternative VirtualFile IO engine (#5824) 2024-01-26 09:25:07 +01:00
Alexander Bayandin
d36623ad74 CI: cancel old e2e-tests on new commits (#6463)
## Problem

Triggered `e2e-tests` job is not cancelled along with other jobs in a PR
if the PR get new commits. We can improve the situation by setting
`concurrency_group` for the remote workflow
(https://github.com/neondatabase/cloud/pull/9622 adds
`concurrency_group` group input to the remote workflow).

Ref https://neondb.slack.com/archives/C059ZC138NR/p1706087124297569

Cloud's part added in https://github.com/neondatabase/cloud/pull/9622

## Summary of changes
- Set `concurrency_group` parameter when triggering `e2e-tests`
- At the beginning of a CI pipeline, trigger Cloud's
`cancel-previous-in-concurrency-group.yml` workflow which cancels
previously triggered e2e-tests
2024-01-25 19:25:29 +00:00
Christian Schwarz
689ad72e92 fix(neon_local): leaks child process if it fails to start & pass checks (#6474)
refs https://github.com/neondatabase/neon/issues/6473

Before this PR, if process_started() didn't return Ok(true) until we
ran out of retries, we'd return an error but leave the process running.

Try it by adding a 20s sleep to the pageserver `main()`, e.g., right
before we claim the pidfile.

Without this PR, output looks like so:

```
(.venv) cs@devvm-mbp:[~/src/neon-work-2]: ./target/debug/neon_local start
Starting neon broker at 127.0.0.1:50051.
storage_broker started, pid: 2710939
.
attachment_service started, pid: 2710949
Starting pageserver node 1 at '127.0.0.1:64000' in ".neon/pageserver_1".....
pageserver has not started yet, continuing to wait.....
pageserver 1 start failed: pageserver did not start in 10 seconds
No process is holding the pidfile. The process must have already exited. Leave in place to avoid race conditions: ".neon/pageserver_1/pageserver.pid"
No process is holding the pidfile. The process must have already exited. Leave in place to avoid race conditions: ".neon/safekeepers/sk1/safekeeper.pid"
Stopping storage_broker with pid 2710939 immediately.......
storage_broker has not stopped yet, continuing to wait.....
neon broker stop failed: storage_broker with pid 2710939 did not stop in 10 seconds
Stopping attachment_service with pid 2710949 immediately.......
attachment_service has not stopped yet, continuing to wait.....
attachment service stop failed: attachment_service with pid 2710949 did not stop in 10 seconds
```

and we leak the pageserver process

```
(.venv) cs@devvm-mbp:[~/src/neon-work-2]: ps aux | grep pageserver
cs       2710959  0.0  0.2 2377960 47616 pts/4   Sl   14:36   0:00 /home/cs/src/neon-work-2/target/debug/pageserver -D .neon/pageserver_1 -c id=1 -c pg_distrib_dir='/home/cs/src/neon-work-2/pg_install' -c http_auth_type='Trust' -c pg_auth_type='Trust' -c listen_http_addr='127.0.0.1:9898' -c listen_pg_addr='127.0.0.1:64000' -c broker_endpoint='http://127.0.0.1:50051/' -c control_plane_api='http://127.0.0.1:1234/' -c remote_storage={local_path='../local_fs_remote_storage/pageserver'}
```

After this PR, there is no leaked process.
2024-01-25 19:20:02 +01:00
Christian Schwarz
fd4cce9417 test_pageserver_max_throughput_getpage_at_latest_lsn: remove n_tenants=100 combination (#6477)
Need to fix the neon_local timeouts first
(https://github.com/neondatabase/neon/issues/6473)
and also not run them on every merge, but only nightly:
https://github.com/neondatabase/neon/issues/6476
2024-01-25 18:17:53 +00:00
Arpad Müller
d52b81340f S3 based recovery (#6155)
Adds a new `time_travel_recover` function to the `RemoteStorage` trait
that allows time travel like functionality for S3 buckets, regardless of
their content (it is not even pageserver related). It takes a different
approach from [this
post](https://aws.amazon.com/blogs/storage/point-in-time-restore-for-amazon-s3-buckets/)
that is more complicated.

It takes as input a prefix a target timestamp, and a limit timestamp:

* executes [`ListObjectVersions`](https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectVersions.html)
* obtains the latest version that comes before the target timestamp
* copies that latest version to the same prefix
* if there is versions newer than the limit timestamp, it doesn't do
anything for the file

The limit timestamp is meant to be some timestamp before the start of
the recovery operation and after any changes that one wants to revert.
For example, it might be the time point after a tenant was detached from
all involved pageservers. The limiting mechanism ensures that the
operation is idempotent and can be retried without causing additional
writes/copies.

The approach fulfills all the requirements laid out in 8233, and is a
recoverable operation. Nothing is deleted permanently, only new entries
added to the version log.

I also enable [nextest retries](https://nexte.st/book/retries.html) to
help with some general S3 flakiness (on top of low level retries).

Part of https://github.com/neondatabase/cloud/issues/8233
2024-01-25 18:23:18 +01:00
Joonas Koivunen
8dee9908f8 fix(compaction_task): wrong log levels (#6442)
Filter what we log on compaction task. Per discussion in last triage
call, fixing these by introducing and inspecting the root cause within
anyhow::Error instead of rolling out proper conversions.

Fixes: #6365
Fixes: #6367
2024-01-25 18:45:17 +02:00
Konstantin Knizhnik
19ed230708 Add support for PS sharding in compute (#6205)
refer #5508

replaces #5837

## Problem

This PR implements sharding support at compute side. Relations are
splinted in stripes and `get_page` requests are redirected to the
particular shard where stripe is located. All other requests (i.e. get
relation or database size) are always send to shard 0.

## Summary of changes

Support of sharding at compute side include three things:
1. Make it possible to specify and change in runtime connection to more
retain one page server
2. Send `get_page` request to the particular shard (determined by hash
of page key)
3. Support multiple servers in prefetch ring requests

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: John Spray <john@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2024-01-25 15:53:31 +02:00
Joonas Koivunen
463b6a26b5 test: show relative order eviction with "fast growing tenant" (#6377)
Refactor out test_disk_usage_eviction tenant creation and add a custom
case with 4 tenants, 3 made with pgbench scale=1 and 1 made with pgbench
scale=4.

Because the tenants are created in order of scales [1, 1, 1, 4] this is
simple enough to demonstrate the problem with using absolute access
times, because on a disk usage based eviction run we will
disproportionally target the *first* scale=1 tenant(s), and the later
larger tenant does not lose anything.

This test is not enough to show the difference between `relative_equal`
and `relative_spare` (the fudge factor); much larger scale will be
needed for "the large tenant", but that will make debug mode tests
slower.

Cc: #5304
2024-01-25 15:38:28 +02:00
John Spray
c9b1657e4c pageserver: fixes for creation operations overlapping with shutdown/startup (#6436)
## Problem

For #6423, creating a reproducer turned out to be very easy, as an
extension to test_ondemand_activation.

However, before I had diagnosed the issue, I was starting with a more
brute force approach of running creation API calls in the background
while restarting a pageserver, and that shows up a bunch of other
interesting issues.

In this PR:
- Add the reproducer for #6423 by extending `test_ondemand_activation`
(confirmed that this test fails if I revert the fix from
https://github.com/neondatabase/neon/pull/6430)
- In timeline creation, return 503 responses when we get an error and
the tenant's cancellation token is set: this covers the cases where we
get an anyhow::Error from something during timeline creation as a result
of shutdown.
- While waiting for tenants to become active during creation, don't
.map_err() the result to a 500: instead let the `From` impl map the
result to something appropriate (this includes mapping shutdown to 503)
- During tenant creation, we were calling `Tenant::load_local` because
no Preload object is provided. This is usually harmless because the
tenant dir is empty, but if there are some half-created timelines in
there, bad things can happen. Propagate the SpawnMode into
Tenant::attach, so that it can properly skip _any_ attempt to load
timelines if creating.
- When we call upsert_location, there's a SpawnMode that tells us
whether to load from remote storage or not. But if the operation is a
retry and we already have the tenant, it is not correct to skip loading
from remote storage: there might be a timeline there. This isn't
strictly a correctness issue as long as the caller behaves correctly
(does not assume that any timelines are persistent until the creation is
acked), but it's a more defensive position.
- If we shut down while the task in Tenant::attach is running, it can
end up spawning rogue tasks. Fix this by holding a GateGuard through
here, and in upsert_location shutting down a tenant after calling
tenant_spawn if we can't insert it into tenants_map. This fixes the
expected behavior that after shutdown_all_tenants returns, no tenant
tasks are running.
- Add `test_create_churn_during_restart`, which runs tenant & timeline
creations across pageserver restarts.
- Update a couple of tests that covered cancellation, to reflect the
cleaner errors we now return.
2024-01-25 12:35:52 +00:00
Arpad Müller
b92be77e19 Make RemoteStorage not use async_trait (#6464)
Makes the `RemoteStorage` trait not be based on `async_trait` any more.

To avoid recursion in async (not supported by Rust), we made
`GenericRemoteStorage` generic on the "Unreliable" variant. That allows
us to have the unreliable wrapper never contain/call itself.

related earlier work: #6305
2024-01-24 21:27:54 +01:00
Arthur Petukhovsky
8cb8c8d7b5 Allow remove_wal.rs to run on inactive timelines (#6462)
Temporary enable it on staging to help with
https://github.com/neondatabase/neon/issues/6403
Can be also deployed to prod if will work well on staging.
2024-01-24 16:48:56 +00:00
Conrad Ludgate
210700d0d9 proxy: add newtype wrappers for string based IDs (#6445)
## Problem

too many string based IDs. easy to mix up ID types.

## Summary of changes

Add a bunch of `SmolStr` wrappers that provide convenience methods but
are type safe
2024-01-24 16:38:10 +00:00
Joonas Koivunen
a0a3ba85e7 fix(page_service): walredo logging problem (#6460)
Fixes: #6459 by formatting full causes of an error to log, while keeping
the top level string for end-user.

Changes user visible error detail from:

```
-DETAIL:  page server returned error: Read error: Failed to reconstruct a page image:
+DETAIL:  page server returned error: Read error
```

However on pageserver logs:

```
-ERROR page_service_conn_main{...}: error reading relation or page version: Read error: Failed to reconstruct a page image:
+ERROR page_service_conn_main{...}: error reading relation or page version: Read error: reconstruct a page image: launch walredo process: spawn process: Permission denied (os error 13)
```
2024-01-24 15:47:17 +00:00
Arpad Müller
d820aa1d08 Disable initdb cancellation (#6451)
## Problem

The initdb cancellation added in #5921 is not sufficient to reliably
abort the entire initdb process. Initdb also spawns children. The tests
added by #6310 (#6385) and #6436 now do initdb cancellations on a more
regular basis.

In #6385, I attempted to issue `killpg` (after giving it a new process
group ID) to kill not just the initdb but all its spawned subprocesses,
but this didn't work. Initdb doesn't take *that* long in the end either,
so we just wait until it concludes.

## Summary of changes

* revert initdb cancellation support added in #5921
* still return `Err(Cancelled)` upon cancellation, but this is just to
not have to remove the cancellation infrastructure
* fixes to the `test_tenant_delete_races_timeline_creation` test to make
it reliably pass

Fixes #6385
2024-01-24 13:06:05 +01:00
Christian Schwarz
996abc9563 pagebench-based GetPage@LSN performance test (#6214) 2024-01-24 12:51:53 +01:00
John Spray
a72af29d12 control_plane/attachment_service: implement PlacementPolicy::Detached (#6458)
## Problem

The API for detaching things wasn't implement yet, but one could hit
this case indirectly from tests when using attach-hook, and find tenants
unexpectedly attached again because their policy remained Single.

## Summary of changes

Add PlacementPolicy::Detached, and:
- add the behavior for it in schedule()
- in tenant_migrate, refuse if the policy is detached
- automatically set this policy in attach-hook if the caller has
specified pageserver=null.
2024-01-24 12:49:30 +01:00
Sasha Krassovsky
4f51824820 Fix creating publications for all tables 2024-01-23 22:41:00 -08:00
Christian Schwarz
743f6dfb9b fix(attachment_service): corrupted attachments.json when parallel requests (#6450)
The pagebench integration PR (#6214) issues attachment requests in
parallel.
We observed corrupted attachments.json from time to time, especially in
the test cases with high tenant counts.

The atomic overwrite added in #6444 exposed the root cause cleanly:
the `.commit()` calls of two request handlers could interleave or
be reordered.
See also:
https://github.com/neondatabase/neon/pull/6444#issuecomment-1906392259

This PR makes changes to the `persistence` module to fix above race:
- mpsc queue for PendingWrites
- one writer task performs the writes in mpsc queue order
- request handlers that need to do writes do it using the
  new `mutating_transaction` function.

`mutating_transaction`, while holding the lock, does the modifications,
serializes the post-modification state, and pushes that as a
`PendingWrite` into the mpsc queue.
It then release the lock and `await`s the completion of the write.
The writer tasks executes the `PendingWrites` in queue order.
Once the write has been executed, it wakes the writing tokio task.
2024-01-23 19:14:32 +00:00
Arpad Müller
faf275d4a2 Remove initdb on timeline delete (#6387)
This PR:

* makes `initdb.tar.zst` be deleted by default on timeline deletion
(#6226), mirroring the safekeeper:
https://github.com/neondatabase/neon/pull/6381
* adds a new `preserve_initdb_archive` endpoint for a timeline, to be
used during the disaster recovery process, see reasoning
[here](https://github.com/neondatabase/neon/issues/6226#issuecomment-1894574778)
* makes the creation code look for `initdb-preserved.tar.zst` in
addition to `initdb.tar.zst`.
* makes the tests use the new endpoint

fixes #6226
2024-01-23 18:22:59 +00:00
Vlad Lazar
001f0d6db7 pageserver: fix import failure caused by merge race (#6448)
PR #6406 raced with #6372 and broke main.
2024-01-23 18:07:01 +01:00
Christian Schwarz
42c17a6fc6 attachment_service: use atomic overwrite to persist attachments.json (#6444)
The pagebench integration PR (#6214) is the first to SIGQUIT & then
restart attachment_service.

With many tenants (100), we have found frequent failures on restart in
the CI[^1].

[^1]:
[Allure](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6214/7615750160/index.html#suites/e26265675583c610f99af77084ae58f1/851ff709578c4452/)

```
2024-01-22T19:07:57.932021Z  INFO request{method=POST path=/attach-hook request_id=2697503c-7b3e-4529-b8c1-d12ef912d3eb}: Request handled, status: 200 OK
2024-01-22T19:07:58.898213Z  INFO Got SIGQUIT. Terminating
2024-01-22T19:08:02.176588Z  INFO version: git-env:d56f31639356ed8e8ce832097f132f27ee19ac8a, launch_timestamp: 2024-01-22 19:08:02.174634554 UTC, build_tag build_tag-env:7615750160, state at /tmp/test_output/test_pageserver_max_throughput_getpage_at_latest_lsn[10-13-30]/repo/attachments.json, listening on 127.0.0.1:15048
thread 'main' panicked at /__w/neon/neon/control_plane/attachment_service/src/persistence.rs:95:17:
Failed to load state from '/tmp/test_output/test_pageserver_max_throughput_getpage_at_latest_lsn[10-13-30]/repo/attachments.json': trailing characters at line 1 column 8957 (maybe your .neon/ dir was written by an older version?)
stack backtrace:
   0: rust_begin_unwind
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:72:14
   2: attachment_service::persistence::PersistentState::load_or_new::{{closure}}
             at ./control_plane/attachment_service/src/persistence.rs:95:17
   3: attachment_service::persistence::Persistence:🆕:{{closure}}
             at ./control_plane/attachment_service/src/persistence.rs:103:56
   4: attachment_service::main::{{closure}}
             at ./control_plane/attachment_service/src/main.rs:69:61
   5: tokio::runtime::park::CachedParkThread::block_on::{{closure}}
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/park.rs:282:63
   6: tokio::runtime::coop::with_budget
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/coop.rs:107:5
   7: tokio::runtime::coop::budget
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/coop.rs:73:5
   8: tokio::runtime::park::CachedParkThread::block_on
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/park.rs:282:31
   9: tokio::runtime::context::blocking::BlockingRegionGuard::block_on
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context/blocking.rs:66:9
  10: tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}}
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/multi_thread/mod.rs:87:13
  11: tokio::runtime::context::runtime::enter_runtime
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/context/runtime.rs:65:16
  12: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/scheduler/multi_thread/mod.rs:86:9
  13: tokio::runtime::runtime::Runtime::block_on
             at ./.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.34.0/src/runtime/runtime.rs:350:50
  14: attachment_service::main
             at ./control_plane/attachment_service/src/main.rs:99:5
  15: core::ops::function::FnOnce::call_once
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
```

The attachment_service handles SIGQUIT by just exiting the process.
In theory, the SIGQUIT could come in while we're writing out the
`attachments.json`.

Now, in above log output, there's a 1 second gap between the last
request completing
and the SIGQUIT coming in. So, there must be some other issue.

But, let's have this change anyways, maybe it helps uncover the real
cause for the test failure.
2024-01-23 17:21:06 +01:00
Vlad Lazar
37638fce79 pageserver: introduce vectored Timeline::get interface (#6372)
1. Introduce a naive  `Timeline::get_vectored` implementation

The return type is intended to be flexible enough for various types of
callers. We return the pages in a map keyed by `Key` such that the
caller doesn't have to map back to the key if it needs to know it. Some
callers can ignore errors
for specific pages, so we return a separate `Result<Bytes,
PageReconstructError>` for each page and an overarching
`GetVectoredError` for API misuse. The overhead of the mapping will be
small and bounded since we enforce a maximum key count for the
operation.

2. Use the `get_vectored` API for SLRU segment reconstruction and image
layer creation.
2024-01-23 14:23:53 +00:00
Christian Schwarz
50288c16b1 fix(pagebench): avoid CopyFail error in success case (#6443)
PR #6392 fixed CopyFail in the case where we get cancelled.
But, we also want to use `client.shutdown()` if we don't get cancelled.
2024-01-23 15:11:32 +01:00
Conrad Ludgate
e03f8abba9 eager parsing of ip addr (#6446)
## Problem

Parsing the IP address at check time is a little wasteful. 

## Summary of changes

Parse the IP when we get it from cplane. Adding a `None` variant to
still allow malformed patterns
2024-01-23 13:25:01 +00:00
Anna Khanova
1905f0bced proxy: store role not found in cache (#6439)
## Problem

There are a lot of responses with 404 role not found error, which are
not getting cached in proxy.

## Summary of changes

If there was returned an empty secret but with the project_id, store it
in cache.
2024-01-23 13:15:05 +01:00
Conrad Ludgate
72de1cb511 remove some duped deps (#6422)
## Problem

duplicated deps

## Summary of changes

little bit of fiddling with deps to reduce duplicates

needs consideration:
https://github.com/notify-rs/notify/blob/main/CHANGELOG.md#notify-600-2023-05-17
2024-01-23 11:17:15 +00:00
Konstantin Knizhnik
00d9bf5b61 Implement lockless update of pageserver_connstring GUC in shared memory (#6314)
## Problem

There is "neon.pageserver_connstring" GUC with PGC_SIGHUP option,
allowing to change it using
pg_reload_conf(). It is used by control plane to update pageserver
connection string if page server is crashed,
relocated or new shards are added.
It is copied to shared memory because config can not be loaded during
query execution and we need to
reestablish connection to page server.

## Summary of changes

Copying connection string to shared memory is done by postmaster. And
other backends
should check update counter to determine of connection URL is changed
and connection needs to be reestablished.
We can not use standard Postgres LW-locks, because postmaster has proc
entry and so can not wait
on this primitive. This is why lockless access algorithm is implemented
using two atomic counters to enforce
consistent reading of connection string value from shared memory.


## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2024-01-23 07:55:05 +02:00