rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-15 09:22:55 +00:00

Author	SHA1	Message	Date
Alex Chi	f2d7baf0ba	rename DeleteGuard -> LayerDeletionGuard Signed-off-by: Alex Chi <chi@neon.tech>	2023-06-27 16:57:11 -04:00
Alex Chi	113a4256d4	rename lcache to layer_cache Signed-off-by: Alex Chi <chi@neon.tech>	2023-06-27 16:55:56 -04:00
Alex Chi	be4999713a	add comments for LayerCache Signed-off-by: Alex Chi <chi@neon.tech>	2023-06-27 16:54:51 -04:00
Alex Chi	7335f155c3	fmt Signed-off-by: Alex Chi <chi@neon.tech>	2023-06-27 10:31:15 -04:00
Alex Chi	a1ca70ff35	Merge branch 'skyzh/layermap-ref-2' of https://github.com/neondatabase/neon into skyzh/layermap-as-cache	2023-06-27 10:26:13 -04:00
Alex Chi	ce1e57faea	fix merge conflicts Signed-off-by: Alex Chi <chi@neon.tech>	2023-06-27 10:25:06 -04:00
Alex Chi	6f50bec781	fix merge conflicts Signed-off-by: Alex Chi <chi@neon.tech>	2023-06-27 10:24:48 -04:00
Alex Chi	b981702ecf	Merge branch 'skyzh/layermap-ref-2' of https://github.com/neondatabase/neon into skyzh/layermap-as-cache	2023-06-27 10:23:08 -04:00
Alex Chi	21d30fc43f	use layer_desc key for replace cmp Signed-off-by: Alex Chi <chi@neon.tech>	2023-06-27 10:16:30 -04:00
Alex Chi	137ad83f37	fix merge conflicts Signed-off-by: Alex Chi <chi@neon.tech>	2023-06-27 10:14:09 -04:00
Alex Chi	22da36bc02	fix merge conflicts Signed-off-by: Alex Chi <chi@neon.tech>	2023-06-27 10:04:21 -04:00
Alex Chi	900ef3d92b	Merge branch 'main' of https://github.com/neondatabase/neon into skyzh/layermap-ref-2	2023-06-27 10:02:57 -04:00
Shany Pozin	d748615c1f	RemoteTimelineClient::delete_all() to use s3::delete_objects (#4461 ) ## Problem [#4325](https://github.com/neondatabase/neon/issues/4325) ## Summary of changes Use delete_objects() method	2023-06-27 15:01:32 +03:00
Dmitry Rodionov	681c6910c2	Straighten the spec for timeline delete (#4538 ) ## Problem Lets keep 500 for unusual stuff that is not considered normal. Came up during one of the discussions around console logs now seeing this 500's. ## Summary of changes - Return 409 Conflict instead of 500 - Remove 200 ok status because it is not used anymore	2023-06-27 13:56:32 +03:00
Vadim Kharitonov	148f0f9b21	Compile `pg_roaringbitmap` extension (#4564 ) ## Problem ``` postgres=# create extension roaringbitmap; CREATE EXTENSION postgres=# select roaringbitmap('{1,100,10}'); roaringbitmap ------------------------------------------------ \x3a30000001000000000002001000000001000a006400 (1 row) ```	2023-06-27 10:55:03 +01:00
Shany Pozin	a7f3f5f356	Revert "run `Layer::get_value_reconstruct_data` in `spawn_blocking`#4498" (#4569 ) This reverts commit `1faf69a698`.	2023-06-27 10:57:28 +03:00
Felix Prasanna	00d1cfa503	bump VM_BUILDER_VERSION to 0.11.0 (#4566 ) Routine bump of autoscaling version `0.8.0` -> `0.11.0`	2023-06-26 14:10:27 -04:00
Christian Schwarz	1faf69a698	run `Layer::get_value_reconstruct_data` in `spawn_blocking` (#4498 ) This PR concludes the "async `Layer::get_value_reconstruct_data`" project. The problem we're solving is that, before this patch, we'd execute `Layer::get_value_reconstruct_data` on the tokio executor threads. This function is IO- and/or CPU-intensive. The IO is using VirtualFile / std::fs; hence it's blocking. This results in unfairness towards other tokio tasks, especially under (disk) load. Some context can be found at https://github.com/neondatabase/neon/issues/4154 where I suspect (but can't prove) load spikes of logical size calculation to cause heavy eviction skew. Sadly we don't have tokio runtime/scheduler metrics to quantify the unfairness. But generally, we know blocking the executor threads on std::fs IO is bad. So, let's have this change and watch out for severe perf regressions in staging & during rollout. ## Changes * rename `Layer::get_value_reconstruct_data` to `Layer::get_value_reconstruct_data_blocking` * add a new blanket impl'd `Layer::get_value_reconstruct_data` `async_trait` method that runs `get_value_reconstruct_data_blocking` inside `spawn_blocking`. * The `spawn_blocking` requires `'static` lifetime of the captured variables; hence I had to change the data flow to _move_ the `ValueReconstructState` into and back out of get_value_reconstruct_data instead of passing a reference. It's a small struct, so I don't expect a big performance penalty. ## Performance Fundamentally, the code changes cause the following performance-relevant changes: * Latency & allocations: each `get_value_reconstruct_data` call now makes a short-lived allocation because `async_trait` is just sugar for boxed futures under the hood * Latency: `spawn_blocking` adds some latency because it needs to move the work to a thread pool * using `spawn_blocking` plus the existing synchronous code inside is probably more efficient better than switching all the synchronous code to tokio::fs because _each_ tokio::fs call does `spawn_blocking` under the hood. * Throughput: the `spawn_blocking` thread pool is much larger than the async executor thread pool. Hence, as long as the disks can keep up, which they should according to AWS specs, we will be able to deliver higher `get_value_reconstruct_data` throughput. * Disk IOPS utilization: we will see higher disk utilization if we get more throughput. Not a problem because the disks in prod are currently under-utilized, according to node_exporter metrics & the AWS specs. * CPU utilization: at higher throughput, CPU utilization will be higher. Slightly higher latency under regular load is acceptable given the throughput gains and expected better fairness during disk load peaks, such as logical size calculation peaks uncovered in #4154. ## Full Stack Of Preliminary PRs This PR builds on top of the following preliminary PRs 1. Clean-ups * https://github.com/neondatabase/neon/pull/4316 * https://github.com/neondatabase/neon/pull/4317 * https://github.com/neondatabase/neon/pull/4318 * https://github.com/neondatabase/neon/pull/4319 * https://github.com/neondatabase/neon/pull/4321 * Note: these were mostly to find an alternative to #4291, which I thought we'd need in my original plan where we would need to convert `Tenant::timelines` into an async locking primitive (#4333). In reviews, we walked away from that, but these cleanups were still quite useful. 2. https://github.com/neondatabase/neon/pull/4364 3. https://github.com/neondatabase/neon/pull/4472 4. https://github.com/neondatabase/neon/pull/4476 5. https://github.com/neondatabase/neon/pull/4477 6. https://github.com/neondatabase/neon/pull/4485 7. https://github.com/neondatabase/neon/pull/4441	2023-06-26 11:43:11 +02:00
Christian Schwarz	44a441080d	bring back spawn_blocking for `compact_level0_phase1` (#4537 ) The stats for `compact_level0_phase` that I added in #4527 show the following breakdown (24h data from prod, only looking at compactions with > 1 L1 produced): * 10%ish of wall-clock time spent between the two read locks * I learned that the `DeltaLayer::iter()` and `DeltaLayer::key_iter()` calls actually do IO, even before we call `.next()`. I suspect that is why they take so much time between the locks. * 80+% of wall-clock time spent writing layer files * Lock acquisition time is irrelevant (low double-digit microseconds at most) * The generation of the holes holds the read lock for a relatively long time and it's proportional to the amount of keys / IO required to iterate over them (max: 110ms in prod; staging (nightly benchmarks): multiple seconds). Find below screenshots from my ad-hoc spreadsheet + some graphs. <img width="1182" alt="image" src="https://github.com/neondatabase/neon/assets/956573/81398b3f-6fa1-40dd-9887-46a4715d9194"> <img width="901" alt="image" src="https://github.com/neondatabase/neon/assets/956573/e4ac0393-f2c1-4187-a5e5-39a8b0c394c9"> <img width="210" alt="image" src="https://github.com/neondatabase/neon/assets/956573/7977ade7-6aa5-4773-a0a2-f9729aecee0d"> ## Changes In This PR This PR makes the following changes: * rearrange the `compact_level0_phase1` code such that we build the `all_keys_iter` and `all_values_iter` later than before * only grab the `Timeline::layers` lock once, and hold it until we've computed the holes * run compact_level0_phase1 in spawn_blocking, pre-grabbing the `Timeline::layers` lock in the async code and passing it in as an `OwnedRwLockReadGuard`. * the code inside spawn_blocking drops this guard after computing the holds * the `OwnedRwLockReadGuard` requires the `Timeline::layers` to be wrapped in an `Arc`. I think that's Ok, the locking for the RwLock is more heavy-weight than an additional pointer indirection. ## Alternatives Considered The naive alternative is to throw the entire function into `spawn_blocking`, and use `blocking_read` for `Timeline::layers` access. What I've done in this PR is better because, with this alternative, 1. while we `blocking_read()`, we'd waste one slot in the spawn_blocking pool 2. there's deadlock risk because the spawn_blocking pool is a finite resource ![image](https://github.com/neondatabase/neon/assets/956573/46c419f1-6695-467e-b315-9d1fc0949058) ## Metadata Fixes https://github.com/neondatabase/neon/issues/4492	2023-06-26 11:42:17 +02:00
Sasha Krassovsky	c215389f1c	quote_ident identifiers when creating neon_superuser (#4562 ) ## Problem	2023-06-24 10:34:15 +03:00
Sasha Krassovsky	b1477b4448	Create neon_superuser role, grant it to roles created from control plane (#4425 ) ## Problem Currently, if a user creates a role, it won't by default have any grants applied to it. If the compute restarts, the grants get applied. This gives a very strange UX of being able to drop roles/not have any access to anything at first, and then once something triggers a config application, suddenly grants are applied. This removes these grants.	2023-06-24 01:38:27 +03:00
Christian Schwarz	a500bb06fb	use preinitialize_metrics to initialize page cache metrics (#4557 ) This is follow-up to ``` commit `2252c5c282` Author: Alex Chi Z <iskyzh@gmail.com> Date: Wed Jun 14 17:12:34 2023 -0400 metrics: convert some metrics to pageserver-level (#4490) ```	2023-06-23 16:40:50 -04:00
Alex Chi	b7923fa0be	use new errmsg Signed-off-by: Alex Chi <chi@neon.tech>	2023-06-23 16:27:49 -04:00
Alex Chi	4c4a531d5e	rename LayerMapping -> LayerFileManager Signed-off-by: Alex Chi <chi@neon.tech>	2023-06-23 15:52:09 -04:00
Christian Schwarz	15456625c2	don't use MGMT_REQUEST_RUNTIME for consumption metrics synthetic size worker (#4560 ) The consumption metrics synthetic size worker does logical size calculation. Logical size calculation currently does synchronous disk IO. This blocks the MGMT_REQUEST_RUNTIME's executor threads, starving other futures. While there's work on the way to move the synchronous disk IO into spawn_blocking, the quickfix here is to use the BACKGROUND_RUNTIME instead of MGMT_REQUEST_RUNTIME. Actually it's not just a quickfix. We simply shouldn't be blocking MGMT_REQUEST_RUNTIME executor threads on CPU or sync disk IO. That work isn't done yet, as many of the mgmt tasks still _do_ disk IO. But it's not as intensive as the logical size calculations that we're fixing here. While we're at it, fix disk-usage-based eviction in a similar way. It wasn't the culprit here, according to prod logs, but it can theoretically be a little CPU-intensive. More context, including graphs from Prod: https://neondb.slack.com/archives/C03F5SM1N02/p1687541681336949	2023-06-23 15:40:36 -04:00
Alex Chi	2b4f96345b	resolve comments Signed-off-by: Alex Chi <chi@neon.tech>	2023-06-23 15:32:57 -04:00
Vadim Kharitonov	a3f0dd2d30	Compile `pg_uuidv7` (#4558 ) Doc says that it should be added into `shared_preload_libraries`, but, practically, it's not required. ``` postgres=# create extension pg_uuidv7; CREATE EXTENSION postgres=# SELECT uuid_generate_v7(); uuid_generate_v7 -------------------------------------- 0188e823-3f8f-796c-a92c-833b0b2d1746 (1 row) ```	2023-06-23 15:56:49 +01:00
Christian Schwarz	76718472be	add pageserver-global histogram for basebackup latency (#4559 ) The histogram distinguishes by ok/err. I took the liberty to create a small abstraction for such use cases. It helps keep the label values inside `metrics.rs`, right next to the place where the metric and its labels are declared.	2023-06-23 16:42:12 +02:00
Alexander Bayandin	c07b6ffbdc	Fix git tag name for release (#4545 ) ## Problem A git tag for a release has an extra `release-` prefix (it looks like `release-release-3439`). ## Summary of changes - Do not add `release-` prefix when create git tag	2023-06-23 12:52:17 +01:00
Alexander Bayandin	6c3605fc24	Nightly Benchmarks: Increase timeout for pgbench-compare job (#4551 ) ## Problem In the test environment vacuum duration fluctuates from ~1h to ~5h, along with another two 1h benchmarks (`select-only` and `simple-update`) it could be up to 7h which is longer than 6h timeout. ## Summary of changes - Increase timeout for pgbench-compare job to 8h - Remove 6h timeouts from Nightly Benchmarks (this is a default value)	2023-06-23 12:47:37 +01:00
Vadim Kharitonov	d96d51a3b7	Update rust to 1.70.0 (#4550 )	2023-06-23 13:09:04 +02:00
Alex Chi Z	a010b2108a	pgserver: better template config file (#4554 ) * `compaction_threshold` should be an integer, not a string. * uncomment `[section]` so that if a user needs to modify the config, they can simply uncomment the corresponding line. Otherwise it's easy for us to forget uncommenting the `[section]` when uncommenting the config item we want to configure. Signed-off-by: Alex Chi <iskyzh@gmail.com>	2023-06-23 10:18:06 +03:00
Anastasia Lubennikova	2f618f46be	Use BUILD_TAG in compute_ctl binary. (#4541 ) Pass BUILD_TAG to compute_ctl binary. We need it to access versioned extension storage.	2023-06-22 17:06:16 +03:00
Alexander Bayandin	d3aa8a48ea	Update client libs for test_runner/pg_clients to their latest versions (#4547 ) Resolves https://github.com/neondatabase/neon/security/dependabot/27	2023-06-21 16:20:35 +01:00
Christian Schwarz	e4da76f021	update_gc_info: fix typo in timeline_id tracing field (#4546 ) Commit ``` commit `472cc17b7a` Author: Dmitry Rodionov <dmitry@neon.tech> Date: Thu Jun 15 17:30:12 2023 +0300 propagate lock guard to background deletion task (#4495) ``` did a drive-by fix, but, the drive-by had a typo. ``` gc_loop{tenant_id=2e2f2bff091b258ac22a4c4dd39bd25d}:update_gc_info{timline_id=837c688fd37c903639b9aa0a6dd3f1f1}:download_remote_layer{layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000024DA0D1-000000000443FB51}:panic{thread=background op worker location=pageserver/src/tenant/timeline.rs:4843:25}: missing extractors: ["TimelineId"] Stack backtrace: 0: utils::logging::tracing_panic_hook at /libs/utils/src/logging.rs:166:21 1: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/alloc/src/boxed.rs:2002:9 2: std::panicking::rust_panic_with_hook at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:692:13 3: std::panicking::begin_panic_handler::{{closure}} at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:579:13 4: std::sys_common::backtrace::__rust_end_short_backtrace at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/sys_common/backtrace.rs:137:18 5: rust_begin_unwind at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:575:5 6: core::panicking::panic_fmt at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:64:14 7: pageserver::tenant::timeline::debug_assert_current_span_has_tenant_and_timeline_id at /pageserver/src/tenant/timeline.rs:4843:25 8: <pageserver::tenant::timeline::Timeline>::download_remote_layer::{closure#0}::{closure#0} at /pageserver/src/tenant/timeline.rs:4368:9 9: <tracing::instrument::Instrumented<<pageserver::tenant::timeline::Timeline>::download_remote_layer::{closure#0}::{closure#0}> as core::future::future::Future>::poll at /.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-0.1.37/src/instrument.rs:272:9 10: <pageserver::tenant::timeline::Timeline>::download_remote_layer::{closure#0} at /pageserver/src/tenant/timeline.rs:4363:5 11: <pageserver::tenant::timeline::Timeline>::get_reconstruct_data::{closure#0} at /pageserver/src/tenant/timeline.rs:2618:69 12: <pageserver::tenant::timeline::Timeline>::get::{closure#0} at /pageserver/src/tenant/timeline.rs:565:13 13: <pageserver::tenant::timeline::Timeline>::list_slru_segments::{closure#0} at /pageserver/src/pgdatadir_mapping.rs:427:42 14: <pageserver::tenant::timeline::Timeline>::is_latest_commit_timestamp_ge_than::{closure#0} at /pageserver/src/pgdatadir_mapping.rs:390:13 15: <pageserver::tenant::timeline::Timeline>::find_lsn_for_timestamp::{closure#0} at /pageserver/src/pgdatadir_mapping.rs:338:17 16: <pageserver::tenant::timeline::Timeline>::update_gc_info::{closure#0}::{closure#0} at /pageserver/src/tenant/timeline.rs:3967:71 17: <tracing::instrument::Instrumented<<pageserver::tenant::timeline::Timeline>::update_gc_info::{closure#0}::{closure#0}> as core::future::future::Future>::poll at /.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-0.1.37/src/instrument.rs:272:9 18: <pageserver::tenant::timeline::Timeline>::update_gc_info::{closure#0} at /pageserver/src/tenant/timeline.rs:3948:5 19: <pageserver::tenant::Tenant>::refresh_gc_info_internal::{closure#0} at /pageserver/src/tenant.rs:2687:21 20: <pageserver::tenant::Tenant>::gc_iteration_internal::{closure#0} at /pageserver/src/tenant.rs:2551:13 21: <pageserver::tenant::Tenant>::gc_iteration::{closure#0} at /pageserver/src/tenant.rs:1490:13 22: pageserver::tenant::tasks::gc_loop::{closure#0}::{closure#0} at /pageserver/src/tenant/tasks.rs:187:21 23: pageserver::tenant::tasks::gc_loop::{closure#0} at /pageserver/src/tenant/tasks.rs:208:5 ``` ## Problem ## Summary of changes ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist	2023-06-21 18:00:14 +03:00
Christian Schwarz	870740c949	cargo update -p openssl (#4542 ) To unblock release https://github.com/neondatabase/neon/pull/4536#issuecomment-1600678054 Context: https://rustsec.org/advisories/RUSTSEC-2023-0044	2023-06-21 15:50:52 +03:00
Dmitry Rodionov	75d583c04a	Tenant::load: fix uninit timeline marker processing (#4458 ) ## Problem During timeline creation we create special mark file which presense indicates that initialization didnt complete successfully. In case of a crash restart we can remove such half-initialized timeline and following retry from control plane side should perform another attempt. So in case of a possible crash restart during initial loading we have following picture: ``` timelines \| - <timeline_id>___uninit \| - <timeline_id> \| - \| <timeline files> ``` We call `std::fs::read_dir` to walk files in `timelines` directory one by one. If we see uninit file we proceed with deletion of both, timeline directory and uninit file. If we see timeline we check if uninit file exists and do the same cleanup. But in fact its possible to get both branches to be true at the same time. Result of readdir doesnt reflect following directory state modifications. So you can still get "valid" entry on the next iteration of the loop despite the fact that it was deleted in one of the previous iterations of the loop. To see that you can apply the following patch (it disables uninit mark cleanup on successful timeline creation): ```diff diff --git a/pageserver/src/tenant.rs b/pageserver/src/tenant.rs index 4beb2664..b3cdad8f 100644 --- a/pageserver/src/tenant.rs +++ b/pageserver/src/tenant.rs @@ -224,11 +224,6 @@ impl UninitializedTimeline<'_> { ) })?; } - uninit_mark.remove_uninit_mark().with_context(\|\| { - format!( - "Failed to remove uninit mark file for timeline {tenant_id}/{timeline_id}" - ) - })?; v.insert(Arc::clone(&new_timeline)); new_timeline.maybe_spawn_flush_loop(); ``` And perform the following steps: ```bash neon_local init neon_local start neon_local tenant create neon_local stop neon_local start ``` The error is: ```log INFO load{tenant_id=X}:blocking: Found an uninit mark file .neon/tenants/X/timelines/Y.___uninit, removing the timeline and its uninit mark 2023-06-09T18:43:41.664247Z ERROR load{tenant_id=X}: load failed, setting tenant state to Broken: failed to load metadata Caused by: 0: Failed to read metadata bytes from path .neon/tenants/X/timelines/Y/metadata 1: No such file or directory (os error 2) ``` So uninit mark got deleted together with timeline directory but we still got directory entry for it and tried to load it. The bug prevented tenant from being successfully loaded. ## Summary of changes Ideally I think we shouldnt place uninit marks in the same directory as timeline directories but move them to separate directory and gather them as an input to actual listing, but that would be sort of an on-disk format change, so just check whether entries are still valid before operating on them.	2023-06-21 14:25:58 +03:00
Alek Westover	b4c5beff9f	`list_files` function in `remote_storage` (#4522 )	2023-06-20 15:36:28 -04:00
bojanserafimov	90e1f629e8	Add test for `skip_pg_catalog_updates` (#4530 )	2023-06-20 11:38:59 -04:00
Alex Chi	b775ca8a58	resolve conflicts Signed-off-by: Alex Chi <iskyzh@gmail.com>	2023-06-20 09:52:28 -04:00
Alex Chi	ddb5862be2	Merge branch 'main' of https://github.com/neondatabase/neon into skyzh/layermap-ref-2	2023-06-20 09:12:15 -04:00
Alek Westover	2023e22ed3	Add `RelationError` error type to pageserver rather than string parsing error messages (#4508 )	2023-06-19 13:14:20 -04:00
Christian Schwarz	036fda392f	log timings for compact_level0_phase1 (#4527 ) The data will help decide whether it's ok to keep holding Timeline::layers in shared mode until after we've calculated the holes. Other timings are to understand the general breakdown of timings in that function. Context: https://github.com/neondatabase/neon/issues/4492	2023-06-19 17:25:57 +03:00
Arseny Sher	557abc18f3	Fix test_s3_wal_replay assertion flakiness. Supposedly fixes https://github.com/neondatabase/neon/issues/4277	2023-06-19 16:08:20 +04:00
Arseny Sher	3b06a5bc54	Raise pageserver walreceiver timeouts. I observe sporadic reconnections with ~10k idle computes. It looks like a separate issue, probably walreceiver runtime gets blocked somewhere, but in any case 2-3 seconds is too small.	2023-06-19 15:59:38 +04:00
Alexander Bayandin	1b947fc8af	test_runner: workaround rerunfailures and timeout incompatibility (#4469 ) ## Problem `pytest-timeout` and `pytest-rerunfailures` are incompatible (or rather not fully compatible). Timeouts aren't set for reruns. Ref https://github.com/pytest-dev/pytest-rerunfailures/issues/99 ## Summary of changes - Dynamically make timeouts `func_only` for tests that we're going to retry. It applies timeouts for reruns as well.	2023-06-16 18:08:11 +01:00
Christian Schwarz	78082d0b9f	create_delta_layer: avoid needless `stat` (#4489 ) We already do it inside `frozen_layer.write_to_disk()`. Context: https://github.com/neondatabase/neon/pull/4441#discussion_r1228083959	2023-06-16 16:54:41 +02:00
Alexander Bayandin	190c3ba610	Add tags for releases (#4524 ) ## Problem It's not a trivial task to find corresponding changes for a particular release (for example, for 3371 — 🤷) Ref: https://neondb.slack.com/archives/C04BLQ4LW7K/p1686761537607649?thread_ts=1686736854.174559&cid=C04BLQ4LW7K ## Summary of changes - Tag releases - Add a manual trigger for the release workflow	2023-06-16 14:17:37 +01:00
Christian Schwarz	14d495ae14	create_delta_layer: improve misleading TODO comment (#4488 ) Context: https://github.com/neondatabase/neon/pull/4441#discussion_r1228086608	2023-06-16 14:23:55 +03:00
Dmitry Rodionov	472cc17b7a	propagate lock guard to background deletion task (#4495 ) ## Problem 1. During the rollout we got a panic: "timeline that we were deleting was concurrently removed from 'timelines' map" that was caused by lock guard not being propagated to the background part of the deletion. Existing test didnt catch it because failpoint that was used for verification was placed earlier prior to background task spawning. 2. When looking at surrounding code one more bug was detected. We removed timeline from the map before deletion is finished, which breaks client retry logic, because it will indicate 404 before actual deletion is completed which can lead to client stopping its retry poll earlier. ## Summary of changes 1. Carry the lock guard over to background deletion. Ensure existing test case fails without applied patch (second deletion becomes stuck without it, which eventually leads to a test failure). 2. Move delete_all call earlier so timeline is removed from the map is the last thing done during deletion. Additionally I've added timeline_id to the `update_gc_info` span, because `debug_assert_current_span_has_tenant_and_timeline_id` in `download_remote_layer` was firing when `update_gc_info` lead to on-demand downloads via `find_lsn_for_timestamp` (caught by @problame). This is not directly related to the PR but fixes possible flakiness. Another smaller set of changes involves deletion wrapper used in python tests. Now there is a simpler wrapper that waits for deletions to complete `timeline_delete_wait_completed`. Most of the test_delete_timeline.py tests make negative tests, i.e., "does ps_http.timeline_delete() fail in this and that scenario". These can be left alone. Other places when we actually do the deletions, we need to use the helper that polls for completion. Discussion https://neondb.slack.com/archives/C03F5SM1N02/p1686668007396639 resolves #4496 --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-06-15 17:30:12 +03:00

1 2 3 4 5 ...

3356 Commits