rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-08 06:30:37 +00:00

Author	SHA1	Message	Date
Vlad Lazar	013e44f528	wip: remove gratuitous log	2024-04-23 11:55:42 +01:00
Vlad Lazar	c719f902a1	wip: fix new stats	2024-04-23 10:37:28 +01:00
Vlad Lazar	bb2215ef0c	wip: even more stats	2024-04-23 10:24:49 +01:00
Vlad Lazar	294a2b2d6f	wip: reset stats	2024-04-19 16:52:37 +01:00
Vlad Lazar	7f8a6702fc	wip: reset stats	2024-04-19 16:51:33 +01:00
Vlad Lazar	76502b96bd	wip: more data	2024-04-19 15:41:28 +01:00
Vlad Lazar	a67f496b09	wip: track read path stats in context	2024-04-19 14:26:13 +01:00
Vlad Lazar	761f211f55	sq: error parity	2024-04-19 12:16:18 +01:00
Vlad Lazar	dc2c6367af	sq: error parity	2024-04-19 10:50:02 +01:00
Vlad Lazar	a65f067799	sq: improve error parity	2024-04-19 10:50:02 +01:00
Vlad Lazar	f36561746e	tests: handle not found errors from both read paths in lsn mapping test	2024-04-19 10:50:02 +01:00
Vlad Lazar	c121d3a3c7	sq validate	2024-04-19 10:50:02 +01:00
Vlad Lazar	76abb4e7a4	pagserver: cargo fmt	2024-04-19 10:50:02 +01:00
Vlad Lazar	d2c806fa40	pageserver: skip unnecessary fringe update on vectored read path	2024-04-19 10:50:02 +01:00
Vlad Lazar	def9b17b15	pageserver: achieve parity between read paths for invalid lsns	2024-04-19 10:50:02 +01:00
Vlad Lazar	0e8f9b82aa	tests: unset new config for forward compat test	2024-04-19 10:50:02 +01:00
Vlad Lazar	2cd9b3cab5	pageserver: imrpove error parity between read paths	2024-04-19 10:50:02 +01:00
Vlad Lazar	e489fd7a1d	test: use vectored impl for singular gets in tests	2024-04-19 10:50:01 +01:00
Vlad Lazar	d45eca992f	pageserver: validate singular vectored get results	2024-04-19 10:50:01 +01:00
Vlad Lazar	02f82ebd79	pageserver: use vectored get for singular gets when configured	2024-04-19 10:50:01 +01:00
Vlad Lazar	5a1dce5e3d	pageserver: fix vectored read path cache handling	2024-04-19 10:50:01 +01:00
Vlad Lazar	7fa1915229	pageserver/config: add config for singular get impl	2024-04-19 10:50:01 +01:00
Vlad Lazar	fc8714215c	pageserver/timeline: replicate metrics on vectored get path	2024-04-19 10:50:01 +01:00
Vlad Lazar	632613273f	pageserver/metrics: add get kind label for get reconstruct data time	2024-04-19 10:50:01 +01:00
Vlad Lazar	96eb1631d3	pageserver/metrics: add get kind label for reconstruct time	2024-04-19 10:50:01 +01:00
Vlad Lazar	7f674c83e1	pageserver: fix aux key retrieval for vectored get Problem Vectored get would descend into ancestor timelines for aux files. This is not the behaviour of the legacy read path and blocks cutting over to the vectored read path. Summary of Changes Treat non inherited keys specially in vectored get. At the point when we want to descend into the ancestor mark all pending non inherited keys as errored out at the key level. Note that this diverges from the standard vectored get behaviour for missing keys which is a top level error. This divergence is required to avoid blocking compaction in case such an error is encountered when compaction aux files keys. I'm pretty sure the bug I just described predates the vectored get implementation, but it's still worth fixing.	2024-04-19 10:50:01 +01:00
Vlad Lazar	e5411ee556	pageserver_api: return removed overlapping keyspace	2024-04-19 10:50:01 +01:00
Vlad Lazar	6eb946e2de	pageserver: fix cont lsn jump on vectored read path (#7412 ) ## Problem Vectored read path may return an image that's newer than the request lsn under certain circumstances. ``` LSN ^ \| \| 500 \| ------------------------- -> branch point 400 \| X 300 \| X 200 \| ------------------------------------> requested lsn 100 \| X \|---------------------------------> Key Legend: * X - page images ``` The vectored read path inspects each ancestor timeline one by one starting from the current one. When moving into the ancestor timeline, the current code resets the current search lsn (called `cont_lsn` in code) to the lsn of the ancestor timeline ([here](`d5708e7435/pageserver/src/tenant/timeline.rs (L2971)`)). For instance, if the request lsn was 200, we would: 1. Look into the current timeline and find nothing for the key 2. Descend into the ancestor timeline and set `cont_lsn=500` 3. Return the page image at LSN 400 Myself and Christian find it very unlikely for this to have happened in prod since the vectored read path is always used at the last record lsn. This issue was found by a regress test during the work to migrate get page handling to use the vectored implementation. I've applied my fix to that wip branch and it fixed the issue. ## Summary of changes The fix is to set the current search lsn to the min between the requested LSN and the ancestor lsn. Hence, at step 2 above we would set the current search lsn to 200 and ignore the images above that. A test illustrating the bug is also included. Fails without the patch and passes with it.	2024-04-18 18:40:30 +01:00
dependabot[bot]	681a04d287	build(deps): bump aiohttp from 3.9.2 to 3.9.4 (#7429 )	2024-04-18 16:47:34 +00:00
Joonas Koivunen	3df67bf4d7	fix(Layer): metric regression with too many canceled evictions (#7363 ) #7030 introduced an annoying papercut, deeming a failure to acquire a strong reference to `LayerInner` from `DownloadedLayer::drop` as a canceled eviction. Most of the time, it wasn't that, but just timeline deletion or tenant detach with the layer not wanting to be deleted or evicted. When a Layer is dropped as part of a normal shutdown, the `Layer` is dropped first, and the `DownloadedLayer` the second. Because of this, we cannot detect eviction being canceled from the `DownloadedLayer::drop`. We can detect it from `LayerInner::drop`, which this PR adds. Test case is added which before had 1 started eviction, 2 canceled. Now it accurately finds 1 started, 1 canceled.	2024-04-18 15:27:58 +00:00
John Spray	0d8e68003a	Add a docs page for storage controller (#7392 ) ## Problem External contributors need information on how to use the storage controller. ## Summary of changes - Background content on what the storage controller is. - Deployment information on how to use it. This is not super-detailed, but should be enough for a well motivated third party to get started, with an occasional peek at the code.	2024-04-18 13:45:25 +00:00
John Spray	637ad4a638	pageserver: fix secondary download scheduling (#7396 ) ## Problem Some tenants were observed to stop doing downloads after some time ## Summary of changes - Fix a rogue `<` that was incorrectly scheduling work when `now` was _before_ the scheduling target, rather than after. This usually resulted in too-frequent execution, but could also result in never executing, if the current time has advanced ahead of `next_download` at the time we call `schedule()`. - Fix in-memory list of timelines not being amended after timeline deletion: the resulted in repeated harmless logs about the timeline being removed, and redundant calls to remove_dir_all for the timeline path. - Add a log at startup to make it easier to see a particular tenant starting in secondary mode (this is for parity with the logging that exists when spawning an attached tenant). Previously searching on tenant ID didn't provide a clear signal as to how the tenant was started during pageserver start. - Add a test that exercises secondary downloads using the background scheduling, whereas existing tests were using the API hook to invoke download directly.	2024-04-18 13:16:03 +01:00
Joonas Koivunen	8d0f701767	feat: copy delta layer prefix or "truncate" (#7228 ) For "timeline ancestor merge" or "timeline detach," we need to "cut" delta layers at particular LSN. The name "truncate" is not used as it would imply that a layer file changes, instead of what happens: we copy keys with Lsn less than a "cut point". Cc: #6994 Add the "copy delta layer prefix" operation to DeltaLayerInner, re-using some of the vectored read internals. The code is `cfg(test)` until it will be used later with a more complete integration test.	2024-04-18 10:43:04 +03:00
Anna Khanova	5191f6ef0e	proxy: Record only valid rejected events (#7415 ) ## Problem Sometimes rejected metric might record invalid events. ## Summary of changes * Only record it `rejected` was explicitly set. * Change order in logs. * Report metrics if not under high-load.	2024-04-18 06:09:12 +01:00
Conrad Ludgate	a54ea8fb1c	proxy: move endpoint rate limiter (#7413 ) ## Problem ## Summary of changes Rate limit for wake_compute calls	2024-04-18 06:00:33 +01:00
Anna Khanova	d5708e7435	proxy: Record role to span (#7407 ) ## Problem ## Summary of changes Add dbrole to span.	2024-04-17 14:16:11 +02:00
Anna Khanova	fd49005cb3	proxy: Improve logging (#7405 ) ## Problem It's unclear from logs what's going on with the regional redis. ## Summary of changes Make logs better.	2024-04-17 11:33:31 +00:00
Vlad Lazar	3023de156e	pageserver: demote range end fallback log (#7403 ) ## Problem This trace is emitted whenever a vectored read touches the end of a delta layer file. It's a perfectly normal case, but I expected it to be more rare when implementing the code. ## Summary of changes Demote log to debug.	2024-04-17 11:32:07 +01:00
Jure Bajic	e49e931bc4	Add for `add-help-for-timeline-arg` for `timeline` command (#7361 ) ## Problem When calling `./neon_local timeline` a confusing error message pops up: `command failed: no tenant subcommand provided` ## Summary of changes Add `add-help-for-timeline-arg` for timeline commands so when no argument for the timeline is provided help is printed.	2024-04-17 10:23:55 +01:00
Anna Khanova	13b9135d4e	proxy: Cleanup unused rate limiter (#7400 ) ## Problem There is an unused dead code. ## Summary of changes Let's remove it. In case we would need it in the future, we can always return it back. Also removed cli arguments. They shouldn't be used by anyone but us.	2024-04-17 11:11:49 +02:00
Alexander Bayandin	41bb1e42b8	CI(check-build-tools-image): fix getting build-tools image tag (#7402 ) ## Problem For PRs, by default, we check out a phantom merge commit (merge a branch into the main), but using a real branches head when finding `build-tools` image tag. ## Summary of changes - Change `COMMIT_SHA` to use `${{ github.sha }}` instead of `${{ github.event.pull_request.head.sha }}` for PRs ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist	2024-04-17 09:50:58 +01:00
Alex Chi Z	cb4b40f9c1	chore(compute_ctl): add error context to apply_spec (#7374 ) Make it faster to identify which part of apply spec goes wrong by adding an error context. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-04-17 09:11:04 +03:00
Alex Chi Z	9e567d9814	feat(neon_local): support listen addr for safekeeper (#7328 ) Leftover from my LFC benchmarks. Safekeepers only listen on `127.0.0.1` for `neon_local`. This pull request adds support for listening on other address. To specify a custom address, modify `.neon/config`. ``` [[safekeepers]] listen_addr = "192.168.?.?" ``` Endpoints created by neon_local still use 127.0.0.1 and I will fix them later. I didn't fix it in the same pull request because my benchmark setting does not use neon_local to create compute nodes so I don't know how to fix it yet -- maybe replacing a few `127.0.0.1`s. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-04-17 09:10:01 +03:00
Vlad Lazar	1c012958c7	pageserver/http: remove status code boilerplate from swagger spec (#7385 ) ## Problem We specify a bunch of possible error codes in the pageserver api swagger spec. This is error prone and annoying to work with. https://github.com/neondatabase/cloud/pull/11907 introduced generic error handling on the control plane side, so we can now clean up the spec. ## Summary of changes * Remove generic error codes from swagger spec * Update a couple route handlers which would previously return an error without a `msg` field in the response body. Tested via https://github.com/neondatabase/cloud/pull/12340 Related https://github.com/neondatabase/cloud/issues/7238	2024-04-16 16:24:09 +01:00
Conrad Ludgate	e5c50bb12b	proxy: rate limit authentication by masked IPv6. (#7316 ) ## Problem Many users have access to ipv6 subnets (eg a /64). That gives them 2^64 addresses to play with ## Summary of changes Truncate the address to /64 to reduce the attack surface. Todo: ~~Will NAT64 be an issue here? AFAIU they put the IPv4 address at the end of the IPv6 address. By truncating we will lose all that detail.~~ It's the same problem as a host sharing IPv6 addresses between clients. I don't think it's up to us to solve. If a customer is getting DDoSed, then they likely need to arrange a dedicated IP with us.	2024-04-16 14:16:34 +00:00
John Spray	926662eb7c	storage_controller: suppress misleading log (#7395 ) ## Problem - https://github.com/neondatabase/neon/issues/7355 The optimize_secondary function calls schedule_shard to check for improvements, but if there are exactly the same number of nodes as there are replicas of the shard, it emits some scary looking logs about no nodes being elegible. Closes https://github.com/neondatabase/neon/issues/7355 ## Summary of changes - Add a mode to SchedulingContext that controls logging: this should be useful in future any time we add a log to the scheduling path, to avoid it becoming a source of spam when the scheduler is called during optimization.	2024-04-16 12:41:48 +00:00
John Spray	3366cd34ba	pageserver: return ACCEPTED when deletion already in flight (#7384 ) ## Problem test_sharding_smoke recently got an added section that checks deletion of a sharded tenant. The storage controller does a retry loop for deletion, waiting for a 404 response. When deletion is a bit slow (debug builds), the retry of deletion was getting a 500 response -- this caused the test to become flaky (example failure: https://neon-github-public-dev.s3.amazonaws.com/reports/release-proxy/8659801445/index.html#testresult/b4cbf5b58190f60e/retries) There was a false comment in the code: ``` match tenant.current_state() { TenantState::Broken { .. } \| TenantState::Stopping { .. } => { - // If a tenant is broken or stopping, DeleteTenantFlow can - // handle it: broken tenants proceed to delete, stopping tenants - // are checked for deletion already in progress. ``` If the tenant is stopping, DeleteTenantFlow does not in fact handle it, but returns a 500-yielding errror. ## Summary of changes Before calling into DeleteTenantFlow, if the tenant is in stopping\|broken state then return 202 if a deletion is in progress. This makes the API friendlier for retries. The historic AlreadyInProgress (409) response still exists for if we enter DeleteTenantFlow and unexpectedly see the tenant stopping. That should go away when we implement #5080 . For the moment, callers that handle 409s should continue to do so.	2024-04-16 09:39:18 +01:00
Christian Schwarz	2d5a8462c8	add `async` walredo mode (disabled-by-default, opt-in via config) (#6548 ) Before this PR, the `nix::poll::poll` call would stall the executor. This PR refactors the `walredo::process` module to allow for different implementations, and adds a new `async` implementation which uses `tokio::process::ChildStd{in,out}` for IPC. The `sync` variant remains the default for now; we'll do more testing in staging and gradual rollout to prod using the config variable. Performance ----------- I updated `bench_walredo.rs`, demonstrating that a single `async`-based walredo manager used by N=1...128 tokio tasks has lower latency and higher throughput. I further did manual less-micro-benchmarking in the real pageserver binary. Methodology & results are published here: https://neondatabase.notion.site/2024-04-08-async-walredo-benchmarking-8c0ed3cc8d364a44937c4cb50b6d7019?pvs=4 tl;dr: - use pagebench against a pageserver patched to answer getpage request & small-enough working set to fit into PS PageCache / kernel page cache. - compare knee in the latency/throughput curve - N tenants, each 1 pagebench clients - sync better throughput at N < 30, async better at higher N - async generally noticable but not much worse p99.X tail latencies - eyeballing CPU efficiency in htop, `async` seems significantly more CPU efficient at ca N=[0.5ncpus, 1.5ncpus], worse than `sync` outside of that band Mental Model For Walredo & Scheduler Interactions ------------------------------------------------- Walredo is CPU-/DRAM-only work. This means that as soon as the Pageserver writes to the pipe, the walredo process becomes runnable. To the Linux kernel scheduler, the `$ncpus` executor threads and the walredo process thread are just `struct task_struct`, and it will divide CPU time fairly among them. In `sync` mode, there are always `$ncpus` runnable `struct task_struct` because the executor thread blocks while `walredo` runs, and the executor thread becomes runnable when the `walredo` process is done handling the request. In `async` mode, the executor threads remain runnable unless there are no more runnable tokio tasks, which is unlikely in a production pageserver. The above means that in `sync` mode, there is an implicit concurrency limit on concurrent walredo requests (`$num_runtimes * $num_executor_threads_per_runtime`). And executor threads do not compete in the Linux kernel scheduler for CPU time, due to the blocked-runnable-ping-pong. In `async` mode, there is no concurrency limit, and the walredo tasks compete with the executor threads for CPU time in the kernel scheduler. If we're not CPU-bound, `async` has a pipelining and hence throughput advantage over `sync` because one executor thread can continue processing requests while a walredo request is in flight. If we're CPU-bound, under a fair CPU scheduler, the fixed number of executor threads has to share CPU time with the aggregate of walredo processes. It's trivial to reason about this in `sync` mode due to the blocked-runnable-ping-pong. In `async` mode, at 100% CPU, the system arrives at some (potentially sub-optiomal) equilibrium where the executor threads get just enough CPU time to fill up the remaining CPU time with runnable walredo process. Why `async` mode Doesn't Limit Walredo Concurrency -------------------------------------------------- To control that equilibrium in `async` mode, one may add a tokio semaphore to limit the number of in-flight walredo requests. However, the placement of such a semaphore is non-trivial because it means that tasks queuing up behind it hold on to their request-scoped allocations. In the case of walredo, that might be the entire reconstruct data. We don't limit the number of total inflight Timeline::get (we only throttle admission). So, that queue might lead to an OOM. The alternative is to acquire the semaphore permit before collecting reconstruct data. However, what if we need to on-demand download? A combination of semaphores might help: one for reconstruct data, one for walredo. The reconstruct data semaphore permit is dropped after acquiring the walredo semaphore permit. This scheme effectively enables both a limit on in-flight reconstruct data and walredo concurrency. However, sizing the amount of permits for the semaphores is tricky: - Reconstruct data retrieval is a mix of disk IO and CPU work. - If we need to do on-demand downloads, it's network IO + disk IO + CPU work. - At this time, we have no good data on how the wall clock time is distributed. It turns out that, in my benchmarking, the system worked fine without a semaphore. So, we're shipping async walredo without one for now. Future Work ----------- We will do more testing of `async` mode and gradual rollout to prod using the config flag. Once that is done, we'll remove `sync` mode to avoid the temporary code duplication introduced by this PR. The flag will be removed. The `wait()` for the child process to exit is still synchronous; the comment [here]( `655d3b6468/pageserver/src/walredo.rs (L294-L306)`) is still a valid argument in favor of that. The `sync` mode had another implicit advantage: from tokio's perspective, the calling task was using up coop budget. But with `async` mode, that's no longer the case -- to tokio, the writes to the child process pipe look like IO. We could/should inform tokio about the CPU time budget consumed by the task to achieve fairness similar to `sync`. However, the [runtime function for this is `tokio_unstable`](`https://docs.rs/tokio/latest/tokio/task/fn.consume_budget.html). Refs ---- refs #6628 refs https://github.com/neondatabase/neon/issues/2975	2024-04-15 22:14:42 +02:00
Anna Khanova	110282ee7e	proxy: Exclude private ip errors from recorded metrics (#7389 ) ## Problem Right now we record errors from internal VPC. ## Summary of changes * Exclude it from the metrics. * Simplify pg-sni-router	2024-04-15 20:21:50 +02:00
Christian Schwarz	f752c40f58	storage release: stop using no-op deployProxy / deployPgSniRouter (#7382 ) As of https://github.com/neondatabase/aws/pull/1264 these options are no-ops. This PR unblocks removal of the variables in https://github.com/neondatabase/aws/pull/1263	2024-04-15 15:05:44 +02:00

1 2 3 4 5 ...

5056 Commits