rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-16 12:40:36 +00:00

Author	SHA1	Message	Date
Christian Schwarz	f91ad65fb3	Merge branch 'problame/async-timeline-get/dont-hold-timelines-lock-inside-tenant-state-send-modify' into problame/async-timeline-get/refactor-timeline-initialization-to-avoid-holding-tenants-timelines-lock	2023-05-26 18:24:23 +02:00
Christian Schwarz	9a4789ec73	demote warn line to info-level, as the log line in set_stopping() is also info!() This should fix the faile regress tests that barked on allowed_errors	2023-05-26 18:22:41 +02:00
Christian Schwarz	72159ee686	Merge remote-tracking branch 'origin/main' into problame/async-timeline-get/dont-hold-timelines-lock-inside-tenant-state-send-modify	2023-05-26 18:03:35 +02:00
Christian Schwarz	be74662d05	re-introduce the check for 0 layers, based on cause	2023-05-26 17:52:26 +02:00
Christian Schwarz	e7c4ef9f4f	don't hold TENANTS lock while waiting for set_stopping()	2023-05-26 17:46:09 +02:00
Christian Schwarz	13d3f4c29f	set_stopping(): report in result if not transitioning to Stopping	2023-05-26 17:46:09 +02:00
Christian Schwarz	67258af8a2	Revert "test_broken_timelines: regex needs changing due to changes in this PR" This reverts commit `17ba307004`.	2023-05-26 17:40:37 +02:00
Christian Schwarz	17ba307004	test_broken_timelines: regex needs changing due to changes in this PR The regex is different because tenant2 is not broken anymore with this PR, because we allow empty timeline dirs to load	2023-05-26 17:40:34 +02:00
Christian Schwarz	e1486444d6	Revert "test_broken_timelines: wait for tenants to load" This reverts commit `c6f9b8f318`.	2023-05-26 17:40:27 +02:00
Christian Schwarz	c6f9b8f318	test_broken_timelines: wait for tenants to load Without this, we rely on the basebackup request to wait for the tenant to load. It works, but, would be nice to rule it out, no?	2023-05-26 17:39:14 +02:00
Joonas Koivunen	be177f82dc	Revert "Allow for higher s3 concurrency (#4292 )" (#4356 ) This reverts commit `024109fbeb` for it failing to be speed up anything, but run into more errors. See: #3698.	2023-05-26 18:37:17 +03:00
Christian Schwarz	ba3e3bdddf	clippy	2023-05-26 17:07:15 +02:00
Christian Schwarz	71f9bbef0d	fix the test timeline creation functions	2023-05-26 16:01:45 +02:00
Alexander Bayandin	339a3e3146	GitHub Autocomment: comment commits for branches (#4335 ) ## Problem GitHub Autocomment script posts a comment only for PRs. It's harder to debug failed tests on main or release branches. ## Summary of changes - Change the GitHub Autocomment script to be able to post a comment to either a PR or a commit of a branch	2023-05-26 14:49:42 +01:00
Christian Schwarz	4680f8c60b	finish WIP: keep the real timeline from create_empty_timeline outside of timelines map until it has finished filling	2023-05-26 15:29:19 +02:00
Heikki Linnakangas	a560b28829	Make new tenant/timeline IDs mandatory in create APIs. (#4304 ) We used to generate the ID, if the caller didn't specify it. That's bad practice, however, because network is never fully reliable, so it's possible we create a new tenant but the caller doesn't know about it, and because it doesn't know the tenant ID, it has no way of retrying or checking if it succeeded. To discourage that, make it mandatory. The web control plane has not relied on the auto-generation for a long time.	2023-05-26 16:19:36 +03:00
Christian Schwarz	3c1fc2617c	WIP	2023-05-26 14:24:23 +02:00
Joonas Koivunen	024109fbeb	Allow for higher s3 concurrency (#4292 ) We currently have a semaphore based rate limiter which we hope will keep us under S3 limits. However, the semaphore does not consider time, so I've been hesitant to raise the concurrency limit of 100. See #3698. The PR Introduces a leaky-bucket based rate limiter instead of the `tokio::sync::Semaphore` which will allow us to raise the limit later on. The configuration changes are not contained here.	2023-05-26 13:35:50 +03:00
Christian Schwarz	60cc197ce3	fix test_timeline_create_break_after_uninit_mark (the refactoring added .context())	2023-05-26 10:19:24 +02:00
Christian Schwarz	609a929968	instrument shutdown_all_tenants code path, include timeline_id in logs if failed to flush This can be extracted into an independent commit.	2023-05-26 10:12:33 +02:00
Christian Schwarz	f2abc4c933	independent fix: test_pageserver_metrics_removed_after_detach didn't wait for uploads This resulted in unexpectedly absent metrics `pageserver_remote_timeline_client_bytes_finished` tripping the assert quoted below. Not sure why this PR (#4350) exposed this problem though. Are we detaching faster? If so, why? AssertionError: assert {'pageserver_...s_count', ...} == {'pageserver_...s_count', ...} Extra items in the right set: 'pageserver_remote_timeline_client_bytes_started_total' 'pageserver_remote_timeline_client_bytes_finished_total' Full diff: { 'pageserver_created_persistent_files_total', 'pageserver_current_logical_size', 'pageserver_evictions_total', 'pageserver_evictions_with_low_residence_duration_total', 'pageserver_getpage_reconstruct_seconds_bucket', 'pageserver_getpage_reconstruct_seconds_count', 'pageserver_getpage_reconstruct_seconds_sum', 'pageserver_io_operations_bytes_total', 'pageserver_io_operations_seconds_bucket', 'pageserver_io_operations_seconds_count', 'pageserver_io_operations_seconds_sum', 'pageserver_last_record_lsn', 'pageserver_materialized_cache_hits_total', 'pageserver_remote_operation_seconds_bucket', 'pageserver_remote_operation_seconds_count', 'pageserver_remote_operation_seconds_sum', 'pageserver_remote_physical_size', - 'pageserver_remote_timeline_client_bytes_finished_total', - 'pageserver_remote_timeline_client_bytes_started_total', 'pageserver_remote_timeline_client_calls_started_bucket', 'pageserver_remote_timeline_client_calls_started_count', 'pageserver_remote_timeline_client_calls_started_sum', 'pageserver_remote_timeline_client_calls_unfinished', 'pageserver_resident_physical_size', 'pageserver_smgr_query_seconds_bucket', 'pageserver_smgr_query_seconds_count', 'pageserver_smgr_query_seconds_sum', 'pageserver_storage_operations_seconds_count_total', 'pageserver_storage_operations_seconds_sum_total', 'pageserver_tenant_states_count', 'pageserver_wait_lsn_seconds_bucket', 'pageserver_wait_lsn_seconds_count', 'pageserver_wait_lsn_seconds_sum', 'pageserver_written_persistent_bytes_total', }	2023-05-26 09:54:30 +02:00
Christian Schwarz	b09beaa4fe	log while waiting for tenant to finish activation	2023-05-26 09:34:12 +02:00
Christian Schwarz	1367e2b0ee	improve TenantState doc comments, repeating what's in the Mermaid diagram	2023-05-26 09:31:44 +02:00
Christian Schwarz	122e23071b	fix the tests (commenting out too-conservative "Timeline has no ancestor and no layer files" assert)	2023-05-26 09:23:26 +02:00
Christian Schwarz	696c6ed6ff	fix cfg(test) code to the extent that clippy passes	2023-05-26 08:49:42 +02:00
Alexander Bayandin	2b25f0dfa0	Fix flakiness of test_metric_collection (#4346 ) ## Problem Test `test_metric_collection` become flaky: ``` AssertionError: assert not ['2023-05-25T14:03:41.644042Z ERROR metrics_collection: failed to send metrics: reqwest::Error { kind: Request, url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("localhost")), port: Some(18022), path: "/billing/api/v1/usage_events", query: None, fragment: None }, source: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 99, kind: AddrNotAvailable, message: "Cannot assign requested address" })) }', ...] ``` I suspect it is caused by having 2 places when we define `httpserver_listen_address` fixture (which is internally used by `pytest-httpserver` plugin) ## Summary of changes - Remove the definition of `httpserver_listen_address` from `test_runner/regress/test_ddl_forwarding.py` and keep one in `test_runner/fixtures/neon_fixtures.py` - Also remote unused `httpserver_listen_address` parameter from `test_proxy_metric_collection`	2023-05-26 00:05:11 +03:00
Christian Schwarz	0874e27023	refactor timeline initialization High-level ideas: - placeholder Timeline object in timelines map during a timeline creation - the timeline creations (branch, bootstrap, import_from_basebackup) prepare durable state (on-disk & remote)state, if necessary using _another_ _temporary_ Timeline object - once the timeline creations have prepared the durable state, they use the normal load routine (load_local_timeline) that is also used during pageserver startup - Once the loading is done, we replace the placheolder timeline object with the real one	2023-05-25 23:01:40 +02:00
Christian Schwarz	6fe39ecbf7	add ability to have fake metrics (needed in next patch so we can have to Timeline objects with the same id in memory)	2023-05-25 23:01:40 +02:00
Christian Schwarz	a0c2a85505	timeline_init_and_sync: don't hold Tenant::timelines while load_layer_map This patch inlines `initialize_with_lock` and then reorganizes the code such that we can `load_layer_map` without holding the `Tenant::timelines` lock. As a nice aside, we can get rid of the dummy() uninit mark, which has always been a terrible hack.	2023-05-25 23:01:40 +02:00
Christian Schwarz	dd0f5c4ef3	Merge remote-tracking branch 'origin/main' into problame/async-timeline-get/dont-hold-timelines-lock-inside-tenant-state-send-modify	2023-05-25 22:20:52 +02:00
Christian Schwarz	057cceb559	refactor: make timeline activation infallible (#4319 ) Timeline::activate() was only fallible because `launch_wal_receiver` was. `launch_wal_receiver` was fallible only because of some preliminary checks in `WalReceiver::start`. Turns out these checks can be shifted to the type system by delaying creatinon of the `WalReceiver` struct to the point where we activate the timeline. The changes in this PR were enabled by my previous refactoring that funneled the broker_client from pageserver startup to the activate() call sites. Patch series: - #4316 - #4317 - #4318 - #4319	2023-05-25 20:26:43 +02:00
sharnoff	ae805b985d	Bump vm-builder v0.7.3-alpha3 -> v0.8.0 (#4339 ) Routine `vm-builder` version bump, from autoscaling repo release. You can find the release notes here: https://github.com/neondatabase/autoscaling/releases/tag/v0.8.0 The changes are from v0.7.2 — most of them were already included in v0.7.3-alpha3. Of particular note: This (finally) fixes the cgroup issues, so we should now be able to scale up when we're about to run out of memory. NB: This has the effect of limit the DB's memory usage in a way it wasn't limited before. We may run into issues because of that. There is currently no way to disable that behavior, other than switching the endpoint back to the k8s-pod provisioner.	2023-05-25 09:33:18 -07:00
Joonas Koivunen	85e76090ea	test: fix ancestor is stopping flakyness (#4234 ) Flakyness most likely introduced in #4170, detected in https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4232/4980691289/index.html#suites/542b1248464b42cc5a4560f408115965/18e623585e47af33. Opted to allow it globally because it can happen in other tests as well, basically whenever compaction is enabled and we stop pageserver gracefully.	2023-05-25 16:22:58 +00:00
Alexander Bayandin	08e7d2407b	Storage: use Postgres 15 as default (#2809 )	2023-05-25 15:55:46 +01:00
Alex Chi Z	ab2757f64a	bump dependencies version (#4336 ) proceeding https://github.com/neondatabase/neon/pull/4237, this PR bumps AWS dependencies along with all other dependencies to the latest compatible semver. Signed-off-by: Alex Chi <iskyzh@gmail.com>	2023-05-25 10:21:15 -04:00
Christian Schwarz	e5617021a7	refactor: eliminate global storage_broker client state (#4318 ) (This is prep work to make `Timeline::activate` infallible.) This patch removes the global storage_broker client instance from the pageserver codebase. Instead, pageserver startup instantiates it and passes it down to the `Timeline::activate` function, which in turn passes it to the WalReceiver, which is the entity that actually uses it. Patch series: - #4316 - #4317 - #4318 - #4319	2023-05-25 16:47:42 +03:00
Christian Schwarz	de780d2e0f	make TenantState::{Loading,Attaching,Activating} owned by spawn_load / spawn_attach See the Mermaid diagram in the doc comment for the now-possible state transitions. The two core insights / changes are: - spawn_load and spawn_attach own the tenant state until they're done - once load()/attach() calls are done - if they failed, transition them to Broken directly (we know that there's no background activity because we didn't call activate yet) - if they succeed, call activate. We can make it infallible. How? Later. - set_broken() and set_stopping() are changed to wait for spawn_load() / spawn_attach() to finish. This sounds scary because it might hinder detach or shutdown, but actually, concurrent attach+detach, or attach+shutdown, or load+shutdown, or attach+shutdown were just racy. With this change, they're not anymore. We can add a CancellationToken stored in Tenant for load/attach and cancel it from set_stopping() or set_broken() if necessary in the future. So, why can activate() be infallible now: because we declare that spawn_load and spawn_attach own the tenant state until they're done. And we enforce that ownership using the wait_for at the start of set_stopping and set_broken.	2023-05-25 15:02:43 +02:00
Christian Schwarz	f18d9f555b	Revert "Revert "use tokio::sync:⌚:Receiver::wait_for"" This reverts commit `eaf270c648`.	2023-05-25 14:58:49 +02:00
Christian Schwarz	05a2fe08d1	Merge branch 'problame/infallible-timeline-activate/4-make-infallible' into problame/async-timeline-get/dont-hold-timelines-lock-inside-tenant-state-send-modify	2023-05-25 14:58:19 +02:00
Christian Schwarz	eaf270c648	Revert "use tokio::sync:⌚:Receiver::wait_for" This reverts commit `fe4ef121b6`.	2023-05-25 14:57:41 +02:00
Christian Schwarz	ddad0928c5	Merge branch 'problame/infallible-timeline-activate/3-funnel-storage-broker-client' into problame/infallible-timeline-activate/4-make-infallible	2023-05-25 14:53:32 +02:00
Christian Schwarz	96c550222b	apply heikki's comment suggestion	2023-05-25 14:53:20 +02:00
Christian Schwarz	cf8ff7edad	explainer comment on storage_broker::connect async weirdness	2023-05-25 14:51:48 +02:00
Christian Schwarz	83ba02b431	tenant_status: don't InternalServerError if tenant not found (#4337 ) Note this also changes the status code to the (correct) 404. Not sure if that's relevant to Console. Context: https://neondb.slack.com/archives/C04PSBP2SAF/p1684746238831449?thread_ts=1684742106.169859&cid=C04PSBP2SAF Atop #4300 because it cleans up the mgr::get_tenant() error type and I want eyes on that PR.	2023-05-25 11:38:04 +02:00
Christian Schwarz	37ecebe45b	mgr::get_tenant: distinguished error type (#4300 ) Before this patch, it would use error type `TenantStateError` which has many more error variants than can actually happen with `mgr::get_tenant`. Along the way, I also introduced `SetNewTenantConfigError` because it uses `mgr::get_tenant` and also can only fail in much fewer ways than `TenantStateError` suggests. The new `page_service.rs`'s `GetActiveTimelineError` and `GetActiveTenantError` types were necessary to avoid an `Other` variant on the `GetTenantError`. This patch is a by-product of reading code that subscribes to `Tenant::state` changes. Can't really connect it to any given project.	2023-05-25 11:37:12 +02:00
Sasha Krassovsky	6052ecee07	Add connector extension to send Role/Database updates to console (#3891 ) ## Describe your changes ## Issue ticket number and link ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [x] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.	2023-05-25 12:36:57 +03:00
Christian Schwarz	da6573f551	Merge branch 'problame/infallible-timeline-activate/3-funnel-storage-broker-client' into problame/infallible-timeline-activate/4-make-infallible	2023-05-25 10:54:30 +02:00
Christian Schwarz	2fee8c884f	Merge remote-tracking branch 'origin/main' into problame/infallible-timeline-activate/3-funnel-storage-broker-client	2023-05-25 10:54:03 +02:00
Christian Schwarz	e11ba24ec5	tenant loops: operate on the Arc<Tenant> directly (#4298 ) (Instead of going through mgr every iteration.) The `wait_for_active_tenant` function's `wait` argument could be removed because it was only used for the loop that waits for the tenant to show up in the tenants map. Since we're passing the tenant in, we now longer need to get it from the tenants map. NB that there's no guarantee that the tenant object is in the tenants map at the time the background loop function starts running. But the tenant mgr guarantees that it will be quite soon. See `tenant_map_insert` way upwards in the call hierarchy for details. This is prep work to eliminate `subscribe_for_state_updates` (PR #4299 ) Fixes: #3501	2023-05-25 10:49:09 +02:00
Christian Schwarz	fe4ef121b6	use tokio::sync:⌚:Receiver::wait_for	2023-05-25 10:44:26 +02:00

1 2 3 4 5 ...

3283 Commits