rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-06 13:40:37 +00:00

Author	SHA1	Message	Date
Joonas Koivunen	b4cf3b18fa	test: assert that allowed error appears in log	2023-05-16 15:45:36 +03:00
Joonas Koivunen	6b1adf6e5e	test: test_concurrent_timeline_delete_if_first_stuck_at_index_upload docs	2023-05-16 15:45:36 +03:00
Joonas Koivunen	4591698cd4	fixup: refactor: stop using todo! for a panic!	2023-05-16 15:45:36 +03:00
Joonas Koivunen	81e8c10069	fixup: move doc test as test case and hide _spawn variants	2023-05-16 15:45:36 +03:00
Joonas Koivunen	6bd3951141	fixup: nicer ascii art	2023-05-16 15:45:36 +03:00
Joonas Koivunen	511978be6f	doc: remove FIXME simplify uploading there's no need to simplify the uploading; the uploading is not on within this one future which will be executed so it's not safe to just remove it or revert it back. also I am hesitant to add new changes at this point, the diff is large enough.	2023-05-16 15:45:36 +03:00
Joonas Koivunen	1e94bbf249	test: less racy test_delete_timeline_client_hangup with the SharedRetried the first request creates a task which will continue to the pause point regardless of the first request, so we need to accept a 404 as success.	2023-05-16 15:45:36 +03:00
Joonas Koivunen	a0f63b8264	test: fix test_concurrent_timeline_delete_if_first_stuck_at_index_upload manually cherry-picked `245a2a0592`	2023-05-16 15:45:36 +03:00
Joonas Koivunen	c948ee3975	fixup: one missed logging opportunity	2023-05-16 15:45:36 +03:00
Joonas Koivunen	3784166180	doc: more cleanup	2023-05-16 15:45:36 +03:00
Joonas Koivunen	8c1962d40c	refactor: simplify, racy removes no longer intended	2023-05-16 15:45:36 +03:00
Joonas Koivunen	d3ce8eae4e	fixup: doc: align strong usage, fix broken references	2023-05-16 15:45:36 +03:00
Joonas Koivunen	3b1141b344	fixup: doc: explain convoluted return type	2023-05-16 15:45:36 +03:00
Joonas Koivunen	17c92c885c	fixup: doc: many futures instead of tasks as in, you could run many of these futures with tokio::select within one task or `runtime.block_on`; it does not matter.	2023-05-16 15:45:36 +03:00
Joonas Koivunen	43b397da10	fixup: clippy	2023-05-16 15:45:36 +03:00
Joonas Koivunen	00673cf900	fix: coalesce requests to delete_timeline	2023-05-16 15:45:36 +03:00
Joonas Koivunen	ba7c97b61c	feat: utils::shared_retryable::SharedRetryable documented in the module.	2023-05-16 15:45:34 +03:00
Dmitry Rodionov	a0b34e8c49	add create tenant metric to storage operations (#4231 ) Add a metric to track time spent in create tenant requests Originated from https://github.com/neondatabase/neon/pull/4204	2023-05-16 15:15:29 +03:00
bojanserafimov	fdc1c12fb0	Simplify github PR template (#4241 )	2023-05-16 08:13:54 -04:00
Alexander Bayandin	0322e2720f	Nightly Benchmarks: add neonvm to pgbench-compare (#4225 )	2023-05-16 12:46:28 +01:00
Vadim Kharitonov	4f64be4a98	Add endpoint to connection string	2023-05-15 23:45:04 +02:00
Tristan Partin	e7514cc15e	Wrap naked PQerrorMessage calls in libpagestore with pchomp (#4242 )	2023-05-15 15:36:53 -05:00
Tristan Partin	6415dc791c	Fix use-after-free issue in libpagestore (#4239 ) ## Describe your changes `pageserver_disconnect()` calls `PQfinish()` which deallocates resources on the connection structure. `PQerrorMessage()` hands back a pointer to an allocated resource. Duplicate the error message prior to calling `pageserver_disconnect()`. ## Issue ticket number and link Fixes https://github.com/neondatabase/neon/issues/4214 ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [x] If it is a core feature, I have added thorough tests. - [x] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [x] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [x] Do not forget to reformat commit message to not include the above checklist	2023-05-15 13:38:18 -05:00
Alexander Bayandin	a5615bd8ea	Fix Allure reports for different benchmark jobs (#4229 ) - Fix Allure report generation failure for Nightly Benchmarks - Fix GitHub Autocomment for `run-benchmarks` label (`build_and_test.yml::benchmarks` job)	2023-05-15 13:04:03 +01:00
Joonas Koivunen	4a76f2b8d6	upload new timeline index part json before 201 or on retry (#4204 ) Await for upload to complete before returning 201 Created on `branch_timeline` or when `bootstrap_timeline` happens. Should either of those waits fail, then on the retried request await for uploads again. This should work as expected assuming control-plane does not start to use timeline creation as a wait_for_upload mechanism. Fixes #3865, started from https://github.com/neondatabase/neon/pull/3857/files#r1144468177 Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2023-05-15 14:16:43 +03:00
Shany Pozin	9cd6f2ceeb	Remove duplicated logic in creating TenantConfOpt (#4230 ) ## Describe your changes Remove duplicated logic in creating TenantConfOpt in both TryFrom of TenantConfigRequest and TenantCreateRequest	2023-05-15 10:08:44 +03:00
Heikki Linnakangas	2855c73990	Fix race condition after attaching tenant with branches. (#4170 ) After tenant attach, there is a window where the child timeline is loaded and accepts GetPage requests, but its parent is not. If a GetPage request needs to traverse to the parent, it needs to wait for the parent timeline to become active, or it might miss some records on the parent timeline. It's also possible that the parent timeline is active, but it hasn't yet received all the WAL up to the branch point from the safekeeper. This happens if a pageserver crashes soon after creating a timeline, so that the WAL leading to the branch point has not yet been uploaded to remote storage. After restart, the WAL will be re-streamed and ingested from the safekeeper, but that takes a while. Because of that, it's not enough to check that the parent timeline is active, we also need to wait for the WAL to arrive on the parent timeline, just like at the beginning of GetPage handling. We probably should change the behavior at create_timeline so that a timeline can only be created after all the WAL up to the branch point has been uploaded to remote storage, but that's not currently the case and out of scope for this PR (see github issue #4218). @NanoBjorn encountered this while working on tenant migration. After migrating a tenant with a parent and child branch, connecting to the child branch failed with an error like: ``` FATAL: "base/16385" is not a valid data directory DETAIL: File "base/16385/PG_VERSION" is missing. ``` This commit adds two tests that reproduce the bug, with slightly different symptoms.	2023-05-13 10:44:11 +03:00
Christian Schwarz	edcf4d61a4	distinguish imitated from real size::gather_input calls in metrics (#4224 ) Before this PR, the gather_inputs() calls made to imitate synthetic size calculation accesses were accounted towards the real logical size calculation metric. This PR forces all callers to declare the cause for making logical size calculations, making the decision which cause counts towards which metric explicit. This is follow-up to ``` commit `1d266a6365` Author: Christian Schwarz <christian@neon.tech> Date: Thu May 11 16:09:29 2023 +0200 logical size calculation metrics: differentiate regular vs imitated (#4197) ``` After merging this patch, I hope to be able to explain why we have ca 30x more "logical size" ops in prod than "imitate logical size" for any given observation interval. refs https://github.com/neondatabase/neon/issues/4154	2023-05-12 17:57:33 +00:00
Christian Schwarz	a2a9c598be	add counter metric that increases whenever a background loop overruns its period (#4223 ) We already have the warn!() log line for this condition. This PR adds a corresponding metric on which we can have a dedicated alert. Cheaper and more reliable than alerting on the logs, because, we run into log rate limits from time to time these days. refs https://github.com/neondatabase/neon/issues/4222	2023-05-12 19:00:06 +03:00
Alexander Bayandin	bb06d281ea	Run regressions tests on both Postgres 14 and 15 (#4192 ) This PR adds tests runs on Postgres 15 and created unified Allure report with results for all tests. - Split `.github/actions/allure-report` into `.github/actions/allure-report-store` and `.github/actions/allure-report-generate` - Add debug or release pytest parameter for all tests (depending on `BUILD_TYPE` env variable) - Add Postgres version as a pytest parameter for all tests (depending on `DEFAULT_PG_VERSION` env variable) - Fix `test_wal_restore` and `restore_from_wal.sh` to support path with `[`/`]` in it (fixed by applying spellcheck to the script and fixing all warnings), `restore_from_wal_archive.sh` is deleted as unused. - All known failures on Postgres 15 marked with xfail	2023-05-12 15:28:51 +01:00
Christian Schwarz	5869234290	logical size calculation: spawn with in_current_span (#4196 ) While investigating https://github.com/neondatabase/neon/issues/4154 I found that the `Calculating logical size for timeline` tracing events created from within the logical size computation code are not always attributable to the background task that caused it. My goal is to be able to distinguish in the logs whether a `Calculating logical size for timeline` was logged as part of a real synthetic size calculation VS an imitation by the eviction task. I want this distinction so I can prove my assumption that the disk IO peaks which we see every 24h on prod are due to eviction's imitate synthetic size calculations. The alternative here, which I would have preferred, but is more work: link RequestContext's into a child->parent list and dump this list when we log `Calculating logical size for timeline`. I would have preferred that over what we have in this PR because, technically, the ondemand logical size computation can outlive the caller that spawned it. This is against the idea of correctly nested spans. I guess in OpenTelemetry land, the correct modelling would be a link between the caller's span and the task_mgr task's span. Anyways, I think the case where we hang up on the spawned ondemand logical size calculation is quite rare. So, I'm willing to tolerate incorrectly nested spans for these edge-cases. refs https://github.com/neondatabase/neon/issues/4154	2023-05-12 15:36:30 +02:00
Rahul Modpur	ecfe4757d3	fix bogus at character context in log messages Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>	2023-05-11 23:31:42 +01:00
Christian Schwarz	845e296562	eviction: add global histogram for iteration durations (#4212 ) I would like to know whether and by how much the eviction iterations spike in the $period-sized window that happens every $threshold , when all the timelines do the imitate accesses. refs https://github.com/neondatabase/neon/issues/4154	2023-05-11 18:02:19 +03:00
Heikki Linnakangas	1988cc5527	Fix `failpoint_sleep_millis_async` without `use std::time::Duration` (#4195 ) I tried to use failpoint_sleep_millis_async(...) in a source file that didn't do `use std::time::Duration`, and got a compiler error: ``` error[E0433]: failed to resolve: use of undeclared type `Duration` --> pageserver/src/walingest.rs:316:17 \| 316 \| utils::failpoint_sleep_millis_async!("wal-ingest-logical-message-sleep"); \| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ not found in this scope \| = note: this error originates in the macro `utils::failpoint_sleep_millis_async` (in Nightly builds, run with -Z macro-backtrace for more info) help: consider importing one of these items \| 24 \| use chrono::Duration; \| 24 \| use core::time::Duration; \| 24 \| use humantime::Duration; \| 24 \| use serde_with::__private__::Duration; \| and 2 other candidates ```	2023-05-11 17:53:42 +03:00
Christian Schwarz	1d266a6365	logical size calculation metrics: differentiate regular vs imitated (#4197 ) I want this distinction so I can prove my assumption that the disk IO peaks which we see every 24h on prod are due to eviction's imitate synthetic size calculations. refs https://github.com/neondatabase/neon/issues/4154	2023-05-11 17:09:29 +03:00
Christian Schwarz	80522a1b9d	replace has_in_progress_downloads with new attachment_status field (#4168 ) Control Plane currently [^1] polls for `has_in_progress_downloads == false` after /attach to determine that an attach operation succeeded. As pointed out in the OpenAPI spec as of neon#4151, polling for `has_in_progress_downloads` is incorrect. This patch changes the situation by - removing `has_in_progress_downloads` - adding a new field `attachment_status.` - changing instructions for `/attach` to poll for `attachment_status == attached`. This makes the instructions in `/attach` actionable for Control Plane. NB that we don't expose the TenantState in the OpenAPI docs, even though we expose it in the endpoint. That is with good reason because we don't want to commit to a fixed set of tenant states forever. Hence, the separate `attachment_status` field that exposes the bare minimum required to make /attach + subsequent polling 100% safe wrt split brain. It would have been nice to report failures explicitly, but the problem is that we lose that state when we restart. So, we return `attached` upon attach failure. The tenant is Broken in that case, causing Control Plane's subsequent health check will fail. Control Plane can roll back the relocation operation then. NB: the reliance on the subsequent health check is no change to what we had before this patch! NB: we can always add additional TenantAttachmentStatus'es in the future to communicate failure. This PR also moves the attach-marker file's creation to the API handler's synchronous part. That was done to avoid the need to distinguish * `Attaching but marker not yet written => AttachmentStatus::Maybe` from * `Attaching, marker written, but attach failed for other reason => AttachmentStatus::Attached` Coincidentally, this also adds more transactionality to the /attach API because we only return 202 once we've written the marker file. But, in the end, it doesn't affect how the control plane interacts with us or how it needs to do retries. So, we don't mention any of this in the API docs. [^1]: The one-click tenant relocation PR cloud#4740, currently WIP, is the first real user.	2023-05-11 16:53:46 +03:00
Joonas Koivunen	ecced13d90	try: higher page_service timeouts to isolate an issue (#4206 ) See #4205.	2023-05-11 16:14:42 +03:00
Alexander Bayandin	59510f6449	scripts/flaky_tests.py: use retriesStatusChange from Allure	2023-05-10 16:59:03 +01:00
Alexander Bayandin	7fc778d251	GitHub Autocomment: fix flaky test notifications	2023-05-10 16:59:03 +01:00
Alexander Bayandin	1d490b2311	Make benchmark_fixture less noisy	2023-05-10 16:59:03 +01:00
Dmitry Rodionov	eb3a8be933	keep track of timeline deletion status in IndexPart to prevent timeline resurrection (#3919 ) Before this patch, the following sequence would lead to the resurrection of a deleted timeline: - create timeline - wait for its index part to reach s3 - delete timeline - wait an arbitrary amount of time, including 0 seconds - detach tenant - attach tenant - the timeline is there and Active again This happens because we only kept track of the deletion in the tenant dir (by deleting the timeline dir) but not in S3. The solution is to turn the deleted timeline's IndexPart into a tombstone. The deletion status of the timeline is expressed in the `deleted_at: Option<NativeDateTime>` field of IndexPart. It's `None` while the timeline is alive and `Some(deletion time stamp)` if it is deleted. We change the timeline deletion handler to upload this tombstoned IndexPart. The handler does not return success if the upload fails. Coincidentally, this fixes the long-stanging TODO about the `std::fs::remove_dir_all` being not atomic. It need not be atomic anymore because we set the `deleted_at=Some()` before starting the `remove_dir_all`. The tombstone is in the IndexPart only, not in the `metadata`. So, we only have the tombstone and the `remove_dir_all` benefits mentioned above if remote storage is configured. This was a conscious trade-off because there's no good format evolution story for the current metadata file format. The introduction of this additional step into `delete_timeline` was painful because delete_timeline needs to be 1. cancel-safe 2. idempotent 3. safe to call concurrently These are mostly self-inflicted limitations that can be avoided by using request-coalescing. PR https://github.com/neondatabase/neon/pull/4159 will do that. fixes https://github.com/neondatabase/neon/issues/3560 refs https://github.com/neondatabase/neon/issues/3889 (part of tenant relocation) Co-authored-by: Joonas Koivunen <joonas@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-05-10 10:27:12 +02:00
Christian Schwarz	3ec52088dd	eviction_task: tracing::instrument the imitate-access calls (#4180 ) Currently, if we unexpectly download from the eviction task, the log lines look like what we have in https://github.com/neondatabase/neon/issues/4154 ``` 2023-05-04T14:42:57.586772Z WARN eviction_task{tenant_id=$TENANT timeline_id=$TIMELINE}:eviction_iteration{policy_kind="LayerAccessThreshold"}: unexpectedly on-demand downloading remote layer remote $TIMELINE/000000067F000032AC0000400C00FFFFFFFF-000000067F000032AC000040140000000008__0000000001696070-0000000003DC76E9 for task kind Eviction ``` We know these are caused by the imitate accesses. But we don't know which one (my bet is on update_gc_info). I didn't want to pollute the other tasks' logs with the additional spans, so, using `.instrument()` when we call non-eviction-task code. refs https://github.com/neondatabase/neon/issues/4154	2023-05-09 18:16:22 +02:00
Heikki Linnakangas	66b06e416a	Pass tracing context in env variables instead of the spec file. (#4174 ) If compute_ctl is launched without a spec file, it fetches it from the control plane with an HTTP request. We cannot get the startup tracing context from the compute spec in that case, because we don't have it available on start. We could still read the tracing context from the compute spec after we have fetched it, but that would leave the fetch itself out of the context. Pass the tracing context in environment variables instead.	2023-05-09 17:08:02 +03:00
Arthur Petukhovsky	d62315327a	Allow parallel backup in safekeepers (#4177 ) Add `wal_backup_parallel_jobs` cmdline argument to specify the max count of parallel segments upload. New default value is 5, meaning that safekeepers will try to upload 5 segments concurrently if they are available. Setting this value to 1 will be equivalent to the sequential upload that we had before. Part of the https://github.com/neondatabase/neon/issues/3957	2023-05-09 12:20:35 +03:00
Anastasia Lubennikova	4bd7b1daf2	Bump vendor/postgres: Fix entering hot standby mode for Neon postgres v15	2023-05-08 21:25:47 +01:00
Sergey Melnikov	0d3d022eb1	Remove deploy workflows (#4157 ) ## Describe your changes Removing deploy workflows (moving to aws repo)	2023-05-08 17:30:16 +02:00
Raouf Chebri	e85cbddd2e	Update neondatabase banner in README.md (#4176 ) ## Describe your changes ## Issue ticket number and link ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist	2023-05-08 17:12:42 +02:00
Anton Chaporgin	51ff9f9359	pg-sni-router nlb is internal (#4164 )	2023-05-08 18:03:50 +03:00
Vadim Kharitonov	0f8b2d8f0a	Compile kq_imcx extension (#3568 ) ## Describe your changes Compiles kq_imcx extension At this moment, there are some issues with the extension: 1. I'm cloning it directly from the master branch. It's better to fetch tag/archive 2. PG14: ``` postgres=# CREATE EXTENSION IF NOT EXISTS kq_imcx; postgres=# select * from kq_calendar_cache_info(); 2023-02-08 13:55:22.853 UTC [412] ERROR: relation "ketteq.slice_type" does not exist at character 34 2023-02-08 13:55:22.853 UTC [412] QUERY: select min(s.id), max(s.id) from ketteq.slice_type s 2023-02-08 13:55:22.853 UTC [412] STATEMENT: select * from kq_calendar_cache_info(); ERROR: relation "ketteq.slice_type" does not exist LINE 1: select min(s.id), max(s.id) from ketteq.slice_type s ``` 3. PG15: `cannot request additional shared memory outside shmem_request_hook` Note: I don't think we need to publish info about this extension in the docs. ## Issue ticket number and link neondatabase/cloud#3387	2023-05-08 15:56:08 +02:00
Gleb Novikov	9860d59aa2	Public docker image repository by default	2023-05-08 15:51:54 +04:00

1 2 3 4 5 ...

3185 Commits