rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-15 17:32:56 +00:00

Author	SHA1	Message	Date
Vadim Kharitonov	a6fe5ea1ac	Merge pull request #4571 from neondatabase/releases/2023-06-27 Release 2023-06-27 release-3465	2023-06-27 12:55:33 +02:00
Vadim Kharitonov	05b0aed0c1	Merge branch 'release' into releases/2023-06-27	2023-06-27 12:22:12 +02:00
Vadim Kharitonov	148f0f9b21	Compile `pg_roaringbitmap` extension (#4564 ) ## Problem ``` postgres=# create extension roaringbitmap; CREATE EXTENSION postgres=# select roaringbitmap('{1,100,10}'); roaringbitmap ------------------------------------------------ \x3a30000001000000000002001000000001000a006400 (1 row) ```	2023-06-27 10:55:03 +01:00
Shany Pozin	a7f3f5f356	Revert "run `Layer::get_value_reconstruct_data` in `spawn_blocking`#4498" (#4569 ) This reverts commit `1faf69a698`.	2023-06-27 10:57:28 +03:00
Felix Prasanna	00d1cfa503	bump VM_BUILDER_VERSION to 0.11.0 (#4566 ) Routine bump of autoscaling version `0.8.0` -> `0.11.0`	2023-06-26 14:10:27 -04:00
Christian Schwarz	1faf69a698	run `Layer::get_value_reconstruct_data` in `spawn_blocking` (#4498 ) This PR concludes the "async `Layer::get_value_reconstruct_data`" project. The problem we're solving is that, before this patch, we'd execute `Layer::get_value_reconstruct_data` on the tokio executor threads. This function is IO- and/or CPU-intensive. The IO is using VirtualFile / std::fs; hence it's blocking. This results in unfairness towards other tokio tasks, especially under (disk) load. Some context can be found at https://github.com/neondatabase/neon/issues/4154 where I suspect (but can't prove) load spikes of logical size calculation to cause heavy eviction skew. Sadly we don't have tokio runtime/scheduler metrics to quantify the unfairness. But generally, we know blocking the executor threads on std::fs IO is bad. So, let's have this change and watch out for severe perf regressions in staging & during rollout. ## Changes * rename `Layer::get_value_reconstruct_data` to `Layer::get_value_reconstruct_data_blocking` * add a new blanket impl'd `Layer::get_value_reconstruct_data` `async_trait` method that runs `get_value_reconstruct_data_blocking` inside `spawn_blocking`. * The `spawn_blocking` requires `'static` lifetime of the captured variables; hence I had to change the data flow to _move_ the `ValueReconstructState` into and back out of get_value_reconstruct_data instead of passing a reference. It's a small struct, so I don't expect a big performance penalty. ## Performance Fundamentally, the code changes cause the following performance-relevant changes: * Latency & allocations: each `get_value_reconstruct_data` call now makes a short-lived allocation because `async_trait` is just sugar for boxed futures under the hood * Latency: `spawn_blocking` adds some latency because it needs to move the work to a thread pool * using `spawn_blocking` plus the existing synchronous code inside is probably more efficient better than switching all the synchronous code to tokio::fs because _each_ tokio::fs call does `spawn_blocking` under the hood. * Throughput: the `spawn_blocking` thread pool is much larger than the async executor thread pool. Hence, as long as the disks can keep up, which they should according to AWS specs, we will be able to deliver higher `get_value_reconstruct_data` throughput. * Disk IOPS utilization: we will see higher disk utilization if we get more throughput. Not a problem because the disks in prod are currently under-utilized, according to node_exporter metrics & the AWS specs. * CPU utilization: at higher throughput, CPU utilization will be higher. Slightly higher latency under regular load is acceptable given the throughput gains and expected better fairness during disk load peaks, such as logical size calculation peaks uncovered in #4154. ## Full Stack Of Preliminary PRs This PR builds on top of the following preliminary PRs 1. Clean-ups * https://github.com/neondatabase/neon/pull/4316 * https://github.com/neondatabase/neon/pull/4317 * https://github.com/neondatabase/neon/pull/4318 * https://github.com/neondatabase/neon/pull/4319 * https://github.com/neondatabase/neon/pull/4321 * Note: these were mostly to find an alternative to #4291, which I thought we'd need in my original plan where we would need to convert `Tenant::timelines` into an async locking primitive (#4333). In reviews, we walked away from that, but these cleanups were still quite useful. 2. https://github.com/neondatabase/neon/pull/4364 3. https://github.com/neondatabase/neon/pull/4472 4. https://github.com/neondatabase/neon/pull/4476 5. https://github.com/neondatabase/neon/pull/4477 6. https://github.com/neondatabase/neon/pull/4485 7. https://github.com/neondatabase/neon/pull/4441	2023-06-26 11:43:11 +02:00
Christian Schwarz	44a441080d	bring back spawn_blocking for `compact_level0_phase1` (#4537 ) The stats for `compact_level0_phase` that I added in #4527 show the following breakdown (24h data from prod, only looking at compactions with > 1 L1 produced): * 10%ish of wall-clock time spent between the two read locks * I learned that the `DeltaLayer::iter()` and `DeltaLayer::key_iter()` calls actually do IO, even before we call `.next()`. I suspect that is why they take so much time between the locks. * 80+% of wall-clock time spent writing layer files * Lock acquisition time is irrelevant (low double-digit microseconds at most) * The generation of the holes holds the read lock for a relatively long time and it's proportional to the amount of keys / IO required to iterate over them (max: 110ms in prod; staging (nightly benchmarks): multiple seconds). Find below screenshots from my ad-hoc spreadsheet + some graphs. <img width="1182" alt="image" src="https://github.com/neondatabase/neon/assets/956573/81398b3f-6fa1-40dd-9887-46a4715d9194"> <img width="901" alt="image" src="https://github.com/neondatabase/neon/assets/956573/e4ac0393-f2c1-4187-a5e5-39a8b0c394c9"> <img width="210" alt="image" src="https://github.com/neondatabase/neon/assets/956573/7977ade7-6aa5-4773-a0a2-f9729aecee0d"> ## Changes In This PR This PR makes the following changes: * rearrange the `compact_level0_phase1` code such that we build the `all_keys_iter` and `all_values_iter` later than before * only grab the `Timeline::layers` lock once, and hold it until we've computed the holes * run compact_level0_phase1 in spawn_blocking, pre-grabbing the `Timeline::layers` lock in the async code and passing it in as an `OwnedRwLockReadGuard`. * the code inside spawn_blocking drops this guard after computing the holds * the `OwnedRwLockReadGuard` requires the `Timeline::layers` to be wrapped in an `Arc`. I think that's Ok, the locking for the RwLock is more heavy-weight than an additional pointer indirection. ## Alternatives Considered The naive alternative is to throw the entire function into `spawn_blocking`, and use `blocking_read` for `Timeline::layers` access. What I've done in this PR is better because, with this alternative, 1. while we `blocking_read()`, we'd waste one slot in the spawn_blocking pool 2. there's deadlock risk because the spawn_blocking pool is a finite resource ![image](https://github.com/neondatabase/neon/assets/956573/46c419f1-6695-467e-b315-9d1fc0949058) ## Metadata Fixes https://github.com/neondatabase/neon/issues/4492	2023-06-26 11:42:17 +02:00
Sasha Krassovsky	c215389f1c	quote_ident identifiers when creating neon_superuser (#4562 ) ## Problem	2023-06-24 10:34:15 +03:00
Sasha Krassovsky	b1477b4448	Create neon_superuser role, grant it to roles created from control plane (#4425 ) ## Problem Currently, if a user creates a role, it won't by default have any grants applied to it. If the compute restarts, the grants get applied. This gives a very strange UX of being able to drop roles/not have any access to anything at first, and then once something triggers a config application, suddenly grants are applied. This removes these grants.	2023-06-24 01:38:27 +03:00
Christian Schwarz	a500bb06fb	use preinitialize_metrics to initialize page cache metrics (#4557 ) This is follow-up to ``` commit `2252c5c282` Author: Alex Chi Z <iskyzh@gmail.com> Date: Wed Jun 14 17:12:34 2023 -0400 metrics: convert some metrics to pageserver-level (#4490) ```	2023-06-23 16:40:50 -04:00
Christian Schwarz	15456625c2	don't use MGMT_REQUEST_RUNTIME for consumption metrics synthetic size worker (#4560 ) The consumption metrics synthetic size worker does logical size calculation. Logical size calculation currently does synchronous disk IO. This blocks the MGMT_REQUEST_RUNTIME's executor threads, starving other futures. While there's work on the way to move the synchronous disk IO into spawn_blocking, the quickfix here is to use the BACKGROUND_RUNTIME instead of MGMT_REQUEST_RUNTIME. Actually it's not just a quickfix. We simply shouldn't be blocking MGMT_REQUEST_RUNTIME executor threads on CPU or sync disk IO. That work isn't done yet, as many of the mgmt tasks still _do_ disk IO. But it's not as intensive as the logical size calculations that we're fixing here. While we're at it, fix disk-usage-based eviction in a similar way. It wasn't the culprit here, according to prod logs, but it can theoretically be a little CPU-intensive. More context, including graphs from Prod: https://neondb.slack.com/archives/C03F5SM1N02/p1687541681336949	2023-06-23 15:40:36 -04:00
Alex Chi Z	cd1705357d	Merge pull request #4561 from neondatabase/releases/2023-06-23-hotfix Release 2023-06-23 (pageserver-only) release-3441	2023-06-23 15:38:50 -04:00
Christian Schwarz	6bc7561290	don't use MGMT_REQUEST_RUNTIME for consumption metrics synthetic size worker The consumption metrics synthetic size worker does logical size calculation. Logical size calculation currently does synchronous disk IO. This blocks the MGMT_REQUEST_RUNTIME's executor threads, starving other futures. While there's work on the way to move the synchronous disk IO into spawn_blocking, the quickfix here is to use the BACKGROUND_RUNTIME instead of MGMT_REQUEST_RUNTIME. Actually it's not just a quickfix. We simply shouldn't be blocking MGMT_REQUEST_RUNTIME executor threads on CPU or sync disk IO. That work isn't done yet, as many of the mgmt tasks still _do_ disk IO. But it's not as intensive as the logical size calculations that we're fixing here. While we're at it, fix disk-usage-based eviction in a similar way. It wasn't the culprit here, according to prod logs, but it can theoretically be a little CPU-intensive. More context, including graphs from Prod: https://neondb.slack.com/archives/C03F5SM1N02/p1687541681336949 (cherry picked from commit `d6e35222ea`)	2023-06-23 20:54:07 +02:00
Vadim Kharitonov	a3f0dd2d30	Compile `pg_uuidv7` (#4558 ) Doc says that it should be added into `shared_preload_libraries`, but, practically, it's not required. ``` postgres=# create extension pg_uuidv7; CREATE EXTENSION postgres=# SELECT uuid_generate_v7(); uuid_generate_v7 -------------------------------------- 0188e823-3f8f-796c-a92c-833b0b2d1746 (1 row) ```	2023-06-23 15:56:49 +01:00
Christian Schwarz	76718472be	add pageserver-global histogram for basebackup latency (#4559 ) The histogram distinguishes by ok/err. I took the liberty to create a small abstraction for such use cases. It helps keep the label values inside `metrics.rs`, right next to the place where the metric and its labels are declared.	2023-06-23 16:42:12 +02:00
Alexander Bayandin	c07b6ffbdc	Fix git tag name for release (#4545 ) ## Problem A git tag for a release has an extra `release-` prefix (it looks like `release-release-3439`). ## Summary of changes - Do not add `release-` prefix when create git tag	2023-06-23 12:52:17 +01:00
Alexander Bayandin	6c3605fc24	Nightly Benchmarks: Increase timeout for pgbench-compare job (#4551 ) ## Problem In the test environment vacuum duration fluctuates from ~1h to ~5h, along with another two 1h benchmarks (`select-only` and `simple-update`) it could be up to 7h which is longer than 6h timeout. ## Summary of changes - Increase timeout for pgbench-compare job to 8h - Remove 6h timeouts from Nightly Benchmarks (this is a default value)	2023-06-23 12:47:37 +01:00
Vadim Kharitonov	d96d51a3b7	Update rust to 1.70.0 (#4550 )	2023-06-23 13:09:04 +02:00
Alex Chi Z	a010b2108a	pgserver: better template config file (#4554 ) * `compaction_threshold` should be an integer, not a string. * uncomment `[section]` so that if a user needs to modify the config, they can simply uncomment the corresponding line. Otherwise it's easy for us to forget uncommenting the `[section]` when uncommenting the config item we want to configure. Signed-off-by: Alex Chi <iskyzh@gmail.com>	2023-06-23 10:18:06 +03:00
Anastasia Lubennikova	2f618f46be	Use BUILD_TAG in compute_ctl binary. (#4541 ) Pass BUILD_TAG to compute_ctl binary. We need it to access versioned extension storage.	2023-06-22 17:06:16 +03:00
Alexander Bayandin	d3aa8a48ea	Update client libs for test_runner/pg_clients to their latest versions (#4547 ) Resolves https://github.com/neondatabase/neon/security/dependabot/27	2023-06-21 16:20:35 +01:00
Christian Schwarz	e4da76f021	update_gc_info: fix typo in timeline_id tracing field (#4546 ) Commit ``` commit `472cc17b7a` Author: Dmitry Rodionov <dmitry@neon.tech> Date: Thu Jun 15 17:30:12 2023 +0300 propagate lock guard to background deletion task (#4495) ``` did a drive-by fix, but, the drive-by had a typo. ``` gc_loop{tenant_id=2e2f2bff091b258ac22a4c4dd39bd25d}:update_gc_info{timline_id=837c688fd37c903639b9aa0a6dd3f1f1}:download_remote_layer{layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000024DA0D1-000000000443FB51}:panic{thread=background op worker location=pageserver/src/tenant/timeline.rs:4843:25}: missing extractors: ["TimelineId"] Stack backtrace: 0: utils::logging::tracing_panic_hook at /libs/utils/src/logging.rs:166:21 1: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/alloc/src/boxed.rs:2002:9 2: std::panicking::rust_panic_with_hook at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:692:13 3: std::panicking::begin_panic_handler::{{closure}} at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:579:13 4: std::sys_common::backtrace::__rust_end_short_backtrace at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/sys_common/backtrace.rs:137:18 5: rust_begin_unwind at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:575:5 6: core::panicking::panic_fmt at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:64:14 7: pageserver::tenant::timeline::debug_assert_current_span_has_tenant_and_timeline_id at /pageserver/src/tenant/timeline.rs:4843:25 8: <pageserver::tenant::timeline::Timeline>::download_remote_layer::{closure#0}::{closure#0} at /pageserver/src/tenant/timeline.rs:4368:9 9: <tracing::instrument::Instrumented<<pageserver::tenant::timeline::Timeline>::download_remote_layer::{closure#0}::{closure#0}> as core::future::future::Future>::poll at /.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-0.1.37/src/instrument.rs:272:9 10: <pageserver::tenant::timeline::Timeline>::download_remote_layer::{closure#0} at /pageserver/src/tenant/timeline.rs:4363:5 11: <pageserver::tenant::timeline::Timeline>::get_reconstruct_data::{closure#0} at /pageserver/src/tenant/timeline.rs:2618:69 12: <pageserver::tenant::timeline::Timeline>::get::{closure#0} at /pageserver/src/tenant/timeline.rs:565:13 13: <pageserver::tenant::timeline::Timeline>::list_slru_segments::{closure#0} at /pageserver/src/pgdatadir_mapping.rs:427:42 14: <pageserver::tenant::timeline::Timeline>::is_latest_commit_timestamp_ge_than::{closure#0} at /pageserver/src/pgdatadir_mapping.rs:390:13 15: <pageserver::tenant::timeline::Timeline>::find_lsn_for_timestamp::{closure#0} at /pageserver/src/pgdatadir_mapping.rs:338:17 16: <pageserver::tenant::timeline::Timeline>::update_gc_info::{closure#0}::{closure#0} at /pageserver/src/tenant/timeline.rs:3967:71 17: <tracing::instrument::Instrumented<<pageserver::tenant::timeline::Timeline>::update_gc_info::{closure#0}::{closure#0}> as core::future::future::Future>::poll at /.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-0.1.37/src/instrument.rs:272:9 18: <pageserver::tenant::timeline::Timeline>::update_gc_info::{closure#0} at /pageserver/src/tenant/timeline.rs:3948:5 19: <pageserver::tenant::Tenant>::refresh_gc_info_internal::{closure#0} at /pageserver/src/tenant.rs:2687:21 20: <pageserver::tenant::Tenant>::gc_iteration_internal::{closure#0} at /pageserver/src/tenant.rs:2551:13 21: <pageserver::tenant::Tenant>::gc_iteration::{closure#0} at /pageserver/src/tenant.rs:1490:13 22: pageserver::tenant::tasks::gc_loop::{closure#0}::{closure#0} at /pageserver/src/tenant/tasks.rs:187:21 23: pageserver::tenant::tasks::gc_loop::{closure#0} at /pageserver/src/tenant/tasks.rs:208:5 ``` ## Problem ## Summary of changes ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist	2023-06-21 18:00:14 +03:00
Christian Schwarz	fbd3ac14b5	Merge pull request #4544 from neondatabase/releases/2023-06-21-hotfix Release 2023-06-21 (fixup for post-merge failed 2023-06-20) release-3439	2023-06-21 16:54:34 +03:00
Christian Schwarz	e437787c8f	cargo update -p openssl (#4542 ) To unblock release https://github.com/neondatabase/neon/pull/4536#issuecomment-1600678054 Context: https://rustsec.org/advisories/RUSTSEC-2023-0044	2023-06-21 15:52:56 +03:00
Christian Schwarz	870740c949	cargo update -p openssl (#4542 ) To unblock release https://github.com/neondatabase/neon/pull/4536#issuecomment-1600678054 Context: https://rustsec.org/advisories/RUSTSEC-2023-0044	2023-06-21 15:50:52 +03:00
Dmitry Rodionov	75d583c04a	Tenant::load: fix uninit timeline marker processing (#4458 ) ## Problem During timeline creation we create special mark file which presense indicates that initialization didnt complete successfully. In case of a crash restart we can remove such half-initialized timeline and following retry from control plane side should perform another attempt. So in case of a possible crash restart during initial loading we have following picture: ``` timelines \| - <timeline_id>___uninit \| - <timeline_id> \| - \| <timeline files> ``` We call `std::fs::read_dir` to walk files in `timelines` directory one by one. If we see uninit file we proceed with deletion of both, timeline directory and uninit file. If we see timeline we check if uninit file exists and do the same cleanup. But in fact its possible to get both branches to be true at the same time. Result of readdir doesnt reflect following directory state modifications. So you can still get "valid" entry on the next iteration of the loop despite the fact that it was deleted in one of the previous iterations of the loop. To see that you can apply the following patch (it disables uninit mark cleanup on successful timeline creation): ```diff diff --git a/pageserver/src/tenant.rs b/pageserver/src/tenant.rs index 4beb2664..b3cdad8f 100644 --- a/pageserver/src/tenant.rs +++ b/pageserver/src/tenant.rs @@ -224,11 +224,6 @@ impl UninitializedTimeline<'_> { ) })?; } - uninit_mark.remove_uninit_mark().with_context(\|\| { - format!( - "Failed to remove uninit mark file for timeline {tenant_id}/{timeline_id}" - ) - })?; v.insert(Arc::clone(&new_timeline)); new_timeline.maybe_spawn_flush_loop(); ``` And perform the following steps: ```bash neon_local init neon_local start neon_local tenant create neon_local stop neon_local start ``` The error is: ```log INFO load{tenant_id=X}:blocking: Found an uninit mark file .neon/tenants/X/timelines/Y.___uninit, removing the timeline and its uninit mark 2023-06-09T18:43:41.664247Z ERROR load{tenant_id=X}: load failed, setting tenant state to Broken: failed to load metadata Caused by: 0: Failed to read metadata bytes from path .neon/tenants/X/timelines/Y/metadata 1: No such file or directory (os error 2) ``` So uninit mark got deleted together with timeline directory but we still got directory entry for it and tried to load it. The bug prevented tenant from being successfully loaded. ## Summary of changes Ideally I think we shouldnt place uninit marks in the same directory as timeline directories but move them to separate directory and gather them as an input to actual listing, but that would be sort of an on-disk format change, so just check whether entries are still valid before operating on them.	2023-06-21 14:25:58 +03:00
Christian Schwarz	3460dbf90b	Merge pull request #4536 from neondatabase/releases/2023-06-20 Release 2023-06-20 (actually 2023-06-21)	2023-06-21 14:19:14 +03:00
Alek Westover	b4c5beff9f	`list_files` function in `remote_storage` (#4522 )	2023-06-20 15:36:28 -04:00
bojanserafimov	90e1f629e8	Add test for `skip_pg_catalog_updates` (#4530 )	2023-06-20 11:38:59 -04:00
Alek Westover	2023e22ed3	Add `RelationError` error type to pageserver rather than string parsing error messages (#4508 )	2023-06-19 13:14:20 -04:00
Christian Schwarz	036fda392f	log timings for compact_level0_phase1 (#4527 ) The data will help decide whether it's ok to keep holding Timeline::layers in shared mode until after we've calculated the holes. Other timings are to understand the general breakdown of timings in that function. Context: https://github.com/neondatabase/neon/issues/4492	2023-06-19 17:25:57 +03:00
Arseny Sher	557abc18f3	Fix test_s3_wal_replay assertion flakiness. Supposedly fixes https://github.com/neondatabase/neon/issues/4277	2023-06-19 16:08:20 +04:00
Arseny Sher	3b06a5bc54	Raise pageserver walreceiver timeouts. I observe sporadic reconnections with ~10k idle computes. It looks like a separate issue, probably walreceiver runtime gets blocked somewhere, but in any case 2-3 seconds is too small.	2023-06-19 15:59:38 +04:00
Alexander Bayandin	1b947fc8af	test_runner: workaround rerunfailures and timeout incompatibility (#4469 ) ## Problem `pytest-timeout` and `pytest-rerunfailures` are incompatible (or rather not fully compatible). Timeouts aren't set for reruns. Ref https://github.com/pytest-dev/pytest-rerunfailures/issues/99 ## Summary of changes - Dynamically make timeouts `func_only` for tests that we're going to retry. It applies timeouts for reruns as well.	2023-06-16 18:08:11 +01:00
Christian Schwarz	78082d0b9f	create_delta_layer: avoid needless `stat` (#4489 ) We already do it inside `frozen_layer.write_to_disk()`. Context: https://github.com/neondatabase/neon/pull/4441#discussion_r1228083959	2023-06-16 16:54:41 +02:00
Alexander Bayandin	190c3ba610	Add tags for releases (#4524 ) ## Problem It's not a trivial task to find corresponding changes for a particular release (for example, for 3371 — 🤷) Ref: https://neondb.slack.com/archives/C04BLQ4LW7K/p1686761537607649?thread_ts=1686736854.174559&cid=C04BLQ4LW7K ## Summary of changes - Tag releases - Add a manual trigger for the release workflow	2023-06-16 14:17:37 +01:00
Christian Schwarz	14d495ae14	create_delta_layer: improve misleading TODO comment (#4488 ) Context: https://github.com/neondatabase/neon/pull/4441#discussion_r1228086608	2023-06-16 14:23:55 +03:00
Vadim Kharitonov	6b89d99677	Merge pull request #4521 from neondatabase/release_2023-06-15 Release 2023 06 15	2023-06-15 17:40:01 +02:00
Vadim Kharitonov	6cc8ea86e4	Merge branch 'main' into release_2023-06-15	2023-06-15 16:50:44 +02:00
Dmitry Rodionov	472cc17b7a	propagate lock guard to background deletion task (#4495 ) ## Problem 1. During the rollout we got a panic: "timeline that we were deleting was concurrently removed from 'timelines' map" that was caused by lock guard not being propagated to the background part of the deletion. Existing test didnt catch it because failpoint that was used for verification was placed earlier prior to background task spawning. 2. When looking at surrounding code one more bug was detected. We removed timeline from the map before deletion is finished, which breaks client retry logic, because it will indicate 404 before actual deletion is completed which can lead to client stopping its retry poll earlier. ## Summary of changes 1. Carry the lock guard over to background deletion. Ensure existing test case fails without applied patch (second deletion becomes stuck without it, which eventually leads to a test failure). 2. Move delete_all call earlier so timeline is removed from the map is the last thing done during deletion. Additionally I've added timeline_id to the `update_gc_info` span, because `debug_assert_current_span_has_tenant_and_timeline_id` in `download_remote_layer` was firing when `update_gc_info` lead to on-demand downloads via `find_lsn_for_timestamp` (caught by @problame). This is not directly related to the PR but fixes possible flakiness. Another smaller set of changes involves deletion wrapper used in python tests. Now there is a simpler wrapper that waits for deletions to complete `timeline_delete_wait_completed`. Most of the test_delete_timeline.py tests make negative tests, i.e., "does ps_http.timeline_delete() fail in this and that scenario". These can be left alone. Other places when we actually do the deletions, we need to use the helper that polls for completion. Discussion https://neondb.slack.com/archives/C03F5SM1N02/p1686668007396639 resolves #4496 --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-06-15 17:30:12 +03:00
Arthur Petukhovsky	76413a0fb8	Revert reconnect_timeout to improve performance (#4512 ) Default value for `wal_acceptor_reconnect_timeout` was changed in https://github.com/neondatabase/neon/pull/4428 and it affected performance up to 20% in some cases. Revert the value back.	2023-06-15 15:26:59 +03:00
Alexander Bayandin	e60b70b475	Fix data ingestion scripts (#4515 ) ## Problem When I switched `psycopg2.connect` from context manager to a regular function call in https://github.com/neondatabase/neon/pull/4382 I embarrassingly forgot about commit, so it doesn't really put data into DB 😞 ## Summary of changes - Enable autocommit for data ingestion scripts	2023-06-15 15:01:06 +03:00
Alex Chi Z	2252c5c282	metrics: convert some metrics to pageserver-level (#4490 ) ## Problem Some metrics are better to be observed at page-server level. Otherwise, as we have a lot of tenants in production, we cannot do a sum b/c Prometheus has limit on how many time series we can aggregate. This also helps reduce metrics scraping size. ## Summary of changes Some integration tests are likely not to pass as it will check the existence of some metrics. Waiting for CI complete and fix them. Metrics downgraded: page cache hit (where we are likely to have a page-server level page cache in the future instead of per-tenant), and reconstruct time (this would better be tenant-level, as we have one pg replayer for each tenant, but now we make it page-server level as we do not need that fine-grained data). --------- Signed-off-by: Alex Chi <iskyzh@gmail.com>	2023-06-14 17:12:34 -04:00
Alexander Bayandin	94f315d490	Remove neon-image-depot job (#4506 ) ## Problem `neon-image-depot` is an experimental job we use to compare with the main `neon-image` job. But it's not stable and right now we don't have the capacity to properly fix and evaluate it. We can come back to this later. ## Summary of changes Remove `neon-image-depot` job	2023-06-14 19:03:09 +01:00
Christian Schwarz	cd3faa8c0c	test_basic_eviction: avoid some sources of flakiness (#4504 ) We've seen the download_layer() call return 304 in prod because of a spurious on-demand download caused by a GetPage request from compute. Avoid these and some other sources of on-demand downloads by shutting down compute, SKs, and by disabling background loops. CF https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4498/5258914461/index.html#suites/2599693fa27db8427603ba822bcf2a20/357808fd552fede3	2023-06-14 19:04:22 +02:00
Arthur Petukhovsky	a7a0c3cd27	Invalidate proxy cache in http-over-sql (#4500 ) HTTP queries failed with errors `error connecting to server: failed to lookup address information: Name or service not known\n\nCaused by:\n failed to lookup address information: Name or service not known` The fix reused cache invalidation logic in proxy from usual postgres connections and added it to HTTP-over-SQL queries. Also removed a timeout for HTTP request, because it almost never worked on staging (50s+ time just to start the compute), and we can have the similar case in production. Should be ok, since we have a limits for the requests and responses.	2023-06-14 19:24:46 +03:00
Dmitry Rodionov	ee9a5bae43	Filter only active timelines for compaction (#4487 ) Previously we may've included Stopping/Broken timelines here, which leads to errors in logs -> causes tests to sporadically fail resolves #4467	2023-06-14 19:07:42 +03:00
Alexander Bayandin	9484b96d7c	GitHub Autocomment: do not fail the job (#4478 ) ## Problem If the script fails to generate a test summary, the step also fails the job/workflow (despite this could be a non-fatal problem). ## Summary of changes - Separate JSON parsing and summarisation into separate functions - Wrap functions calling into try..catch block, add an error message to GitHub comment and do not fail the step - Make `scripts/comment-test-report.js` a CLI script that can be run locally (mock GitHub calls) to make it easier to debug issues locally	2023-06-14 15:07:30 +01:00
Shany Pozin	ebee8247b5	Move s3 delete_objects to use chunks of 1000 OIDs (#4463 ) See https://github.com/neondatabase/neon/pull/4461#pullrequestreview-1474240712	2023-06-14 15:38:01 +03:00
bojanserafimov	3164ad7052	compute_ctl: Spec parser forward compatibility test (#4494 )	2023-06-13 21:48:09 -04:00

1 2 3 4 5 ...

3465 Commits