Commit Graph

3766 Commits

Author SHA1 Message Date
Christian Schwarz
dc1c6b28db move lsn visibility related stuff into separate module 2023-09-14 14:42:53 +02:00
Christian Schwarz
1a92a107f6 move DeletionList stuff into separate module 2023-09-14 14:42:53 +02:00
Christian Schwarz
ef9e081866 Revert "unimpl the parts that support !generation.is_none()"
This reverts commit 641130a959d05aaf1708a3fa3a107341474ace4d.
2023-09-14 14:42:53 +02:00
Christian Schwarz
d62723ea57 unimpl the parts that support !generation.is_none() 2023-09-14 14:42:51 +02:00
John Spray
0ba442f1e0 Update pageserver/src/lib.rs
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-09-12 16:54:35 +01:00
John Spray
7d4dff0738 Get rid of a couple of spurious mut 2023-09-12 16:36:39 +01:00
John Spray
20a3a9be70 refactor Arc<dyn> to generics for control plane client mocking 2023-09-12 16:02:58 +01:00
John Spray
db9a49ed91 Refactor remote_consistent_lsn updates to use an atomic instead of a
channel
2023-09-12 15:50:26 +01:00
John Spray
28d7f6b643 reinstate log line per layer in deletion scheduling 2023-09-12 15:35:20 +01:00
John Spray
7b05ec2825 pageserver: refactor http deletion_queue_flush 2023-09-12 15:27:27 +01:00
John Spray
b76f08e863 deletion queue: wrap workers in an opaque struct 2023-09-12 15:03:09 +01:00
John Spray
fad9c45f11 pageserver: fix executor flush 2023-09-12 09:46:05 +01:00
John Spray
ee1fba8729 pageserver: fix deletion queue flush on shutdown 2023-09-12 09:18:18 +01:00
John Spray
a5aa6652c6 clippy deletion queue 2023-09-11 18:10:37 +01:00
John Spray
34007e12a1 libs: add Generation::next 2023-09-11 18:06:40 +01:00
John Spray
278eb70522 tests: add test_pageserver_generations 2023-09-11 18:06:40 +01:00
John Spray
da96d34fa2 tests: optional delimiter to list_prefix 2023-09-11 18:06:40 +01:00
John Spray
8b2160793a tests: update existing deletion tests 2023-09-11 18:06:40 +01:00
John Spray
9a4f9a1b7c pageserver: add a const constructor for Key, for use in test consts 2023-09-11 18:06:40 +01:00
John Spray
412819ac20 control_plane: fix attach_hook in attachment_service 2023-09-11 18:06:40 +01:00
John Spray
1f43fed305 pageserver: add flush admin API 2023-09-11 18:06:40 +01:00
John Spray
9ccad00474 libs: add ApiError::ShuttingDown 2023-09-11 18:06:40 +01:00
John Spray
960dd9a206 pageserver: use deferred updates to remote_consistent_lsn 2023-09-11 18:06:40 +01:00
John Spray
37f4972291 pageserver: cut over to using deletion queue 2023-09-11 18:06:40 +01:00
John Spray
38b41e5c34 pageserver: wire deletion queue through to tenant 2023-09-11 18:06:40 +01:00
John Spray
eb464d5322 pageserver: instantiate deletion queue 2023-09-11 18:06:40 +01:00
John Spray
60241567ce pageserver: add deletion queue 2023-09-11 18:06:40 +01:00
John Spray
b6183a9e65 pageserver: refactor ControlPlaneClient into a mockable trait 2023-09-11 17:59:30 +01:00
John Spray
6e0b977bc8 libs: add RemotePath::strip_prefix 2023-09-11 17:59:30 +01:00
John Spray
3e09cabb6a libs: implement Generation Into<u32> 2023-09-11 17:59:30 +01:00
John Spray
145685201a pageserver: add validate to control plane client 2023-09-11 17:59:30 +01:00
John Spray
d545e3f03b pageserver: add deletion queue metrics 2023-09-11 17:59:30 +01:00
John Spray
6a0cc9e526 pageserver: add deletion path definitions to config 2023-09-11 17:59:30 +01:00
John Spray
1fed35a481 pageserver/tenant: remote_layer_path take Generation instead of layer metadata 2023-09-11 17:59:30 +01:00
John Spray
3d6c5c8d37 pageserver: update unit tests to keep TenantHarness alive
This controls the lifetime of the MockDeletionQueue.
2023-09-11 17:59:30 +01:00
John Spray
d5c9bfa75e pageserver: enable disabling control_plane_api with an override
This is just for testing.  Eventually we'll remove this after
everything is upgraded.
2023-09-11 17:59:30 +01:00
John Spray
8d5d36ed12 remote_storage: expose MAX_KEYS_PER_DELETE constant 2023-09-11 17:59:30 +01:00
John Spray
9c64d95467 remote_storage: implement Serialize/Deserialize for RemotePath 2023-09-11 17:59:30 +01:00
Arpad Müller
a18d6d9ae3 Make File opening in VirtualFile async-compatible (#5280)
## Problem

Previously, we were using `observe_closure_duration` in `VirtualFile`
file opening code, but this doesn't support async open operations, which
we want to use as part of #4743.

## Summary of changes

* Move the duration measurement from the `with_file` macro into a
`observe_duration` macro.
* Some smaller drive-by fixes to replace the old strings with the new
variant names introduced by #5273

Part of #4743, follow-up of #5247.
2023-09-11 18:41:08 +02:00
Arpad Müller
76cc87398c Use tokio locks in VirtualFile and turn with_file into macro (#5247)
## Problem

For #4743, we want to convert everything up to the actual I/O operations
of `VirtualFile` to `async fn`.

## Summary of changes

This PR is the last change in a series of changes to `VirtualFile`:
#5189, #5190, #5195, #5203, and #5224.

It does the last preparations before the I/O operations are actually
made async. We are doing the following things:

* First, we change the locks for the file descriptor cache to tokio's
locks that support Send. This is important when one wants to hold locks
across await points (which we want to do), otherwise the Future won't be
Send. Also, one shouldn't generally block in async code as executors
don't like that.
* Due to the lock change, we now take an approach for the `VirtualFile`
destructors similar to the one proposed by #5122 for the page cache, to
use `try_write`. Similarly to the situation in the linked PR, one can
make an argument that if we are in the destructor and the slot has not
been reused yet, we are the only user accessing the slot due to owning
the lock mutably. It is still possible that we are not obtaining the
lock, but the only cause for that is the clock algorithm touching the
slot, which should be quite an unlikely occurence. For the instance of
`try_write` failing, we spawn an async task to destroy the lock. As just
argued however, most of the time the code path where we spawn the task
should not be visited.
* Lastly, we split `with_file` into a macro part, and a function part
that contains most of the logic. The function part returns a lock
object, that the macro uses. The macro exists to perform the operation
in a more compact fashion, saving code from putting the lock into a
variable and then doing the operation while measuring the time to run
it. We take the locks approach because Rust has no support for async
closures. One can make normal closures return a future, but that
approach gets into lifetime issues the moment you want to pass data to
these closures via parameters that has a lifetime (captures work). For
details, see
[this](https://smallcultfollowing.com/babysteps/blog/2023/03/29/thoughts-on-async-closures/)
and
[this](https://users.rust-lang.org/t/function-that-takes-an-async-closure/61663)
link. In #5224, we ran into a similar problem with the `test_files`
function, and we ended up passing the path and the `OpenOptions`
by-value instead of by-ref, at the expense of a few extra copies. This
can be done as the data is cheaply copyable, and we are in test code.
But here, we are not, and while `File::try_clone` exists, it [issues
system calls
internally](1e746d7741/library/std/src/os/fd/owned.rs (L94-L111)).
Also, it would allocate an entirely new file descriptor, something that
the fd cache was built to prevent.
* We change the `STORAGE_IO_TIME` metrics to support async.

Part of #4743.
2023-09-11 17:35:05 +02:00
bojanserafimov
c0ed362790 Measure pageserver wal recovery time and fix flush() method (#5240) 2023-09-11 09:46:06 -04:00
duguorong009
d7fa2dba2d fix(pageserver): update the STORAGE_IO_TIME metrics to avoid expensive operations (#5273)
Introduce the `StorageIoOperation` enum, `StorageIoTime` struct, and
`STORAGE_IO_TIME_METRIC` static which provides lockless access to
histograms consumed by `VirtualFile`.

Closes #5131

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-09-11 14:58:15 +03:00
Joonas Koivunen
a55a78a453 Misc test flakyness fixes (#5233)
Assorted flakyness fixes from #5198, might not be flaky on `main`.

Migrate some tests using neon_simple_env to just neon_env_builder and
using initial_tenant to make flakyness understanding easier. (Did not
understand the flakyness of
`test_timeline_create_break_after_uninit_mark`.)

`test_download_remote_layers_api` is flaky because we have no atomic
"wait for WAL, checkpoint, wait for upload and do not receive any more
WAL".

`test_tenant_size` fixes are just boilerplate which should had always
existed; we should wait for the tenant to be active. similarly for
`test_timeline_delete`.

`test_timeline_size_post_checkpoint` fails often for me with reading
zero from metrics. Give it a few attempts.
2023-09-11 11:42:49 +03:00
Rahul Modpur
999fe668e7 Ack tenant detach before local files are deleted (#5211)
## Problem

Detaching a tenant can involve many thousands of local filesystem
metadata writes, but the control plane would benefit from us not
blocking detach/delete responses on these.

## Summary of changes

After rename of local tenant directory ack tenant detach and delete
tenant directory in background

#5183 

---------

Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>
2023-09-10 22:59:51 +03:00
Alexander Bayandin
d33e1b1b24 approved-for-ci-run.yml: use token to checkout the repo (#5266)
## Problem

Another thing I overlooked regarding'approved-for-ci-run`:
- When we create a PR, the action is associated with @vipvap and this
triggers the pipeline — this is good.
- When we update the PR by force-pushing to the branch, the action is
associated with @github-actions, which doesn't trigger a pipeline — this
is bad.

Initially spotted in #5239 / #5211
([link](https://github.com/neondatabase/neon/actions/runs/6122249456/job/16633919558?pr=5239))
— `check-permissions` should not fail.


## Summary of changes
- Use `CI_ACCESS_TOKEN` to check out the repo (I expect this token will
be reused in the following `git push`)
2023-09-10 20:12:38 +01:00
Alexander Bayandin
15fd188fd6 Fix GitHub Autocomment for ci-run/prs (#5268)
## Problem

When PR `ci-run/pr-*` is created the GitHub Autocomment with test
results are supposed to be posted to the original PR, currently, this
doesn't work.

I created this PR from a personal fork to debug and fix the issue. 

## Summary of changes
- `scripts/comment-test-report.js`: use `pull_request.head` instead of
`pull_request.base` 🤦
2023-09-10 20:06:10 +01:00
Alexander Bayandin
34e39645c4 GitHub Workflows: add actionlint (#5265)
## Problem

Add a CI pipeline that checks GitHub Workflows with
https://github.com/rhysd/actionlint (it uses `shellcheck` for shell
scripts in steps)

To run it locally: `SHELLCHECK_OPTS=--exclude=SC2046,SC2086 actionlint`

## Summary of changes
- Add `.github/workflows/actionlint.yml`
- Fix actionlint warnings
2023-09-10 20:05:07 +01:00
Em Sharnoff
1cac923af8 vm-monitor: Rate-limit upscale requests (#5263)
Some VMs, when already scaled up as much as possible, end up spamming
the autoscaler-agent with upscale requests that will never be fulfilled.
If postgres is using memory greater than the cgroup's memory.high, it
can emit new memory.high events 1000 times per second, which... just
means unnecessary load on the rest of the system.

This changes the vm-monitor so that we skip sending upscale requests if
we already sent one within the last second, to avoid spamming the
autoscaler-agent. This matches previous behavior that the vm-informant
hand.
2023-09-10 20:33:53 +03:00
Em Sharnoff
853552dcb4 vm-monitor: Don't include Args in top-level span (#5264)
It makes the logs too verbose.

ref https://neondb.slack.com/archives/C03F5SM1N02/p1694281232874719?thread_ts=1694272777.207109&cid=C03F5SM1N02
2023-09-10 20:15:53 +03:00
Alexander Bayandin
1ea93af56c Create GitHub release from release tag (#5246)
## Problem

This PR creates a GitHub release from a release tag with an
autogenerated changelog: https://github.com/neondatabase/neon/releases

## Summary of changes
- Call GitHub API to create a release
2023-09-09 22:02:28 +01:00