Commit Graph

3283 Commits

Author SHA1 Message Date
Christian Schwarz
f91ad65fb3 Merge branch 'problame/async-timeline-get/dont-hold-timelines-lock-inside-tenant-state-send-modify' into problame/async-timeline-get/refactor-timeline-initialization-to-avoid-holding-tenants-timelines-lock 2023-05-26 18:24:23 +02:00
Christian Schwarz
9a4789ec73 demote warn line to info-level, as the log line in set_stopping() is also info!()
This should fix the faile regress tests that barked on allowed_errors
2023-05-26 18:22:41 +02:00
Christian Schwarz
72159ee686 Merge remote-tracking branch 'origin/main' into problame/async-timeline-get/dont-hold-timelines-lock-inside-tenant-state-send-modify 2023-05-26 18:03:35 +02:00
Christian Schwarz
be74662d05 re-introduce the check for 0 layers, based on cause 2023-05-26 17:52:26 +02:00
Christian Schwarz
e7c4ef9f4f don't hold TENANTS lock while waiting for set_stopping() 2023-05-26 17:46:09 +02:00
Christian Schwarz
13d3f4c29f set_stopping(): report in result if not transitioning to Stopping 2023-05-26 17:46:09 +02:00
Christian Schwarz
67258af8a2 Revert "test_broken_timelines: regex needs changing due to changes in this PR"
This reverts commit 17ba307004.
2023-05-26 17:40:37 +02:00
Christian Schwarz
17ba307004 test_broken_timelines: regex needs changing due to changes in this PR
The regex is different because tenant2 is not broken anymore with this
PR, because we allow empty timeline dirs to load
2023-05-26 17:40:34 +02:00
Christian Schwarz
e1486444d6 Revert "test_broken_timelines: wait for tenants to load"
This reverts commit c6f9b8f318.
2023-05-26 17:40:27 +02:00
Christian Schwarz
c6f9b8f318 test_broken_timelines: wait for tenants to load
Without this, we rely on the basebackup request to wait for the tenant to load.

It works, but, would be nice to rule it out, no?
2023-05-26 17:39:14 +02:00
Joonas Koivunen
be177f82dc Revert "Allow for higher s3 concurrency (#4292)" (#4356)
This reverts commit 024109fbeb for it
failing to be speed up anything, but run into more errors.

See: #3698.
2023-05-26 18:37:17 +03:00
Christian Schwarz
ba3e3bdddf clippy 2023-05-26 17:07:15 +02:00
Christian Schwarz
71f9bbef0d fix the test timeline creation functions 2023-05-26 16:01:45 +02:00
Alexander Bayandin
339a3e3146 GitHub Autocomment: comment commits for branches (#4335)
## Problem

GitHub Autocomment script posts a comment only for PRs. It's harder
to debug failed tests on main or release branches.

## Summary of changes

- Change the GitHub Autocomment script to be able to post a comment to
either a PR or a commit of a branch
2023-05-26 14:49:42 +01:00
Christian Schwarz
4680f8c60b finish WIP: keep the real timeline from create_empty_timeline outside of timelines map until it has finished filling 2023-05-26 15:29:19 +02:00
Heikki Linnakangas
a560b28829 Make new tenant/timeline IDs mandatory in create APIs. (#4304)
We used to generate the ID, if the caller didn't specify it. That's bad
practice, however, because network is never fully reliable, so it's
possible we create a new tenant but the caller doesn't know about it,
and because it doesn't know the tenant ID, it has no way of retrying or
checking if it succeeded. To discourage that, make it mandatory. The web
control plane has not relied on the auto-generation for a long time.
2023-05-26 16:19:36 +03:00
Christian Schwarz
3c1fc2617c WIP 2023-05-26 14:24:23 +02:00
Joonas Koivunen
024109fbeb Allow for higher s3 concurrency (#4292)
We currently have a semaphore based rate limiter which we hope will keep
us under S3 limits. However, the semaphore does not consider time, so
I've been hesitant to raise the concurrency limit of 100.

See #3698.

The PR Introduces a leaky-bucket based rate limiter instead of the
`tokio::sync::Semaphore` which will allow us to raise the limit later
on. The configuration changes are not contained here.
2023-05-26 13:35:50 +03:00
Christian Schwarz
60cc197ce3 fix test_timeline_create_break_after_uninit_mark (the refactoring added .context()) 2023-05-26 10:19:24 +02:00
Christian Schwarz
609a929968 instrument shutdown_all_tenants code path, include timeline_id in logs if failed to flush
This can be extracted into an independent commit.
2023-05-26 10:12:33 +02:00
Christian Schwarz
f2abc4c933 independent fix: test_pageserver_metrics_removed_after_detach didn't wait for uploads
This resulted in unexpectedly absent metrics `pageserver_remote_timeline_client_bytes_finished`
tripping the assert quoted below.

Not sure why this PR (#4350) exposed this problem though.
Are we detaching faster? If so, why?

AssertionError: assert {'pageserver_...s_count', ...} == {'pageserver_...s_count', ...}
  Extra items in the right set:
  'pageserver_remote_timeline_client_bytes_started_total'
  'pageserver_remote_timeline_client_bytes_finished_total'
  Full diff:
    {
     'pageserver_created_persistent_files_total',
     'pageserver_current_logical_size',
     'pageserver_evictions_total',
     'pageserver_evictions_with_low_residence_duration_total',
     'pageserver_getpage_reconstruct_seconds_bucket',
     'pageserver_getpage_reconstruct_seconds_count',
     'pageserver_getpage_reconstruct_seconds_sum',
     'pageserver_io_operations_bytes_total',
     'pageserver_io_operations_seconds_bucket',
     'pageserver_io_operations_seconds_count',
     'pageserver_io_operations_seconds_sum',
     'pageserver_last_record_lsn',
     'pageserver_materialized_cache_hits_total',
     'pageserver_remote_operation_seconds_bucket',
     'pageserver_remote_operation_seconds_count',
     'pageserver_remote_operation_seconds_sum',
     'pageserver_remote_physical_size',
  -  'pageserver_remote_timeline_client_bytes_finished_total',
  -  'pageserver_remote_timeline_client_bytes_started_total',
     'pageserver_remote_timeline_client_calls_started_bucket',
     'pageserver_remote_timeline_client_calls_started_count',
     'pageserver_remote_timeline_client_calls_started_sum',
     'pageserver_remote_timeline_client_calls_unfinished',
     'pageserver_resident_physical_size',
     'pageserver_smgr_query_seconds_bucket',
     'pageserver_smgr_query_seconds_count',
     'pageserver_smgr_query_seconds_sum',
     'pageserver_storage_operations_seconds_count_total',
     'pageserver_storage_operations_seconds_sum_total',
     'pageserver_tenant_states_count',
     'pageserver_wait_lsn_seconds_bucket',
     'pageserver_wait_lsn_seconds_count',
     'pageserver_wait_lsn_seconds_sum',
     'pageserver_written_persistent_bytes_total',
    }
2023-05-26 09:54:30 +02:00
Christian Schwarz
b09beaa4fe log while waiting for tenant to finish activation 2023-05-26 09:34:12 +02:00
Christian Schwarz
1367e2b0ee improve TenantState doc comments, repeating what's in the Mermaid diagram 2023-05-26 09:31:44 +02:00
Christian Schwarz
122e23071b fix the tests (commenting out too-conservative "Timeline has no ancestor and no layer files" assert) 2023-05-26 09:23:26 +02:00
Christian Schwarz
696c6ed6ff fix cfg(test) code to the extent that clippy passes 2023-05-26 08:49:42 +02:00
Alexander Bayandin
2b25f0dfa0 Fix flakiness of test_metric_collection (#4346)
## Problem

Test `test_metric_collection` become flaky:

```
AssertionError: assert not ['2023-05-25T14:03:41.644042Z ERROR metrics_collection: failed to send metrics: reqwest::Error { kind: Request, url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("localhost")), port: Some(18022), path: "/billing/api/v1/usage_events", query: None, fragment: None }, source: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 99, kind: AddrNotAvailable, message: "Cannot assign requested address" })) }',
                            ...]
```
I suspect it is caused by having 2 places when we define
`httpserver_listen_address` fixture (which is internally used by
`pytest-httpserver` plugin)

## Summary of changes
- Remove the definition of `httpserver_listen_address` from
`test_runner/regress/test_ddl_forwarding.py` and keep one in
`test_runner/fixtures/neon_fixtures.py`
- Also remote unused `httpserver_listen_address` parameter from
`test_proxy_metric_collection`
2023-05-26 00:05:11 +03:00
Christian Schwarz
0874e27023 refactor timeline initialization
High-level ideas:
- placeholder Timeline object in timelines map during a timeline creation
- the timeline creations (branch, bootstrap, import_from_basebackup)
  prepare durable state (on-disk & remote)state, if necessary using
  _another_ _temporary_ Timeline object
- once the timeline creations have prepared the durable state, they
  use the normal load routine (load_local_timeline) that is also used
  during pageserver startup
- Once the loading is done, we replace the placheolder timeline object
  with the real one
2023-05-25 23:01:40 +02:00
Christian Schwarz
6fe39ecbf7 add ability to have fake metrics (needed in next patch so we can have to Timeline objects with the same id in memory) 2023-05-25 23:01:40 +02:00
Christian Schwarz
a0c2a85505 timeline_init_and_sync: don't hold Tenant::timelines while load_layer_map
This patch inlines `initialize_with_lock` and then reorganizes the code
such that we can `load_layer_map` without holding the
`Tenant::timelines` lock.

As a nice aside, we can get rid of the dummy() uninit mark, which has
always been a terrible hack.
2023-05-25 23:01:40 +02:00
Christian Schwarz
dd0f5c4ef3 Merge remote-tracking branch 'origin/main' into problame/async-timeline-get/dont-hold-timelines-lock-inside-tenant-state-send-modify 2023-05-25 22:20:52 +02:00
Christian Schwarz
057cceb559 refactor: make timeline activation infallible (#4319)
Timeline::activate() was only fallible because `launch_wal_receiver`
was.

`launch_wal_receiver` was fallible only because of some preliminary
checks in `WalReceiver::start`.

Turns out these checks can be shifted to the type system by delaying
creatinon of the `WalReceiver` struct to the point where we activate the
timeline.

The changes in this PR were enabled by my previous refactoring that
funneled the broker_client from pageserver startup to the activate()
call sites.

Patch series:

- #4316
- #4317
- #4318
- #4319
2023-05-25 20:26:43 +02:00
sharnoff
ae805b985d Bump vm-builder v0.7.3-alpha3 -> v0.8.0 (#4339)
Routine `vm-builder` version bump, from autoscaling repo release. You
can find the release notes here:
https://github.com/neondatabase/autoscaling/releases/tag/v0.8.0
The changes are from v0.7.2 — most of them were already included in
v0.7.3-alpha3.

Of particular note: This (finally) fixes the cgroup issues, so we should
now be able to scale up when we're about to run out of memory.

**NB:** This has the effect of limit the DB's memory usage in a way it
wasn't limited before. We may run into issues because of that. There is
currently no way to disable that behavior, other than switching the
endpoint back to the k8s-pod provisioner.
2023-05-25 09:33:18 -07:00
Joonas Koivunen
85e76090ea test: fix ancestor is stopping flakyness (#4234)
Flakyness most likely introduced in #4170, detected in
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4232/4980691289/index.html#suites/542b1248464b42cc5a4560f408115965/18e623585e47af33.

Opted to allow it globally because it can happen in other tests as well,
basically whenever compaction is enabled and we stop pageserver
gracefully.
2023-05-25 16:22:58 +00:00
Alexander Bayandin
08e7d2407b Storage: use Postgres 15 as default (#2809) 2023-05-25 15:55:46 +01:00
Alex Chi Z
ab2757f64a bump dependencies version (#4336)
proceeding https://github.com/neondatabase/neon/pull/4237, this PR bumps
AWS dependencies along with all other dependencies to the latest
compatible semver.

Signed-off-by: Alex Chi <iskyzh@gmail.com>
2023-05-25 10:21:15 -04:00
Christian Schwarz
e5617021a7 refactor: eliminate global storage_broker client state (#4318)
(This is prep work to make `Timeline::activate` infallible.)

This patch removes the global storage_broker client instance from the
pageserver codebase.

Instead, pageserver startup instantiates it and passes it down to the
`Timeline::activate` function, which in turn passes it to the
WalReceiver, which is the entity that actually uses it.

Patch series:

- #4316
- #4317
- #4318
- #4319
2023-05-25 16:47:42 +03:00
Christian Schwarz
de780d2e0f make TenantState::{Loading,Attaching,Activating} owned by spawn_load / spawn_attach
See the Mermaid diagram in the doc comment for the now-possible state transitions.

The two core insights / changes are:
- spawn_load and spawn_attach own the tenant state until they're done
- once load()/attach() calls are done
    - if they failed, transition them to Broken directly (we know
      that there's no background activity because we didn't call activate yet)
    - if they succeed, call activate. We can make it infallible. How? Later.

- set_broken() and set_stopping() are changed to wait for spawn_load() /
  spawn_attach() to finish. This sounds scary because it might hinder
  detach or shutdown, but actually, concurrent attach+detach, or
  attach+shutdown, or load+shutdown, or attach+shutdown were just racy.
  With this change, they're not anymore.
  We can add a CancellationToken stored in Tenant for load/attach and cancel
  it from set_stopping() or set_broken() if necessary in the future.

So, why can activate() be infallible now: because we declare that
spawn_load and spawn_attach own the tenant state until they're done.
And we enforce that ownership using the wait_for at the start of
set_stopping and set_broken.
2023-05-25 15:02:43 +02:00
Christian Schwarz
f18d9f555b Revert "Revert "use tokio::sync::Receiver::wait_for""
This reverts commit eaf270c648.
2023-05-25 14:58:49 +02:00
Christian Schwarz
05a2fe08d1 Merge branch 'problame/infallible-timeline-activate/4-make-infallible' into problame/async-timeline-get/dont-hold-timelines-lock-inside-tenant-state-send-modify 2023-05-25 14:58:19 +02:00
Christian Schwarz
eaf270c648 Revert "use tokio::sync::Receiver::wait_for"
This reverts commit fe4ef121b6.
2023-05-25 14:57:41 +02:00
Christian Schwarz
ddad0928c5 Merge branch 'problame/infallible-timeline-activate/3-funnel-storage-broker-client' into problame/infallible-timeline-activate/4-make-infallible 2023-05-25 14:53:32 +02:00
Christian Schwarz
96c550222b apply heikki's comment suggestion 2023-05-25 14:53:20 +02:00
Christian Schwarz
cf8ff7edad explainer comment on storage_broker::connect async weirdness 2023-05-25 14:51:48 +02:00
Christian Schwarz
83ba02b431 tenant_status: don't InternalServerError if tenant not found (#4337)
Note this also changes the status code to the (correct) 404. Not sure if
that's relevant to Console.

Context:
https://neondb.slack.com/archives/C04PSBP2SAF/p1684746238831449?thread_ts=1684742106.169859&cid=C04PSBP2SAF

Atop #4300 because it cleans up the mgr::get_tenant() error type and I want eyes on that PR.
2023-05-25 11:38:04 +02:00
Christian Schwarz
37ecebe45b mgr::get_tenant: distinguished error type (#4300)
Before this patch, it would use error type `TenantStateError` which has
many more error variants than can actually happen with
`mgr::get_tenant`.

Along the way, I also introduced `SetNewTenantConfigError` because it
uses `mgr::get_tenant` and also can only fail in much fewer ways than
`TenantStateError` suggests.

The new `page_service.rs`'s `GetActiveTimelineError` and
`GetActiveTenantError` types were necessary to avoid an `Other` variant
on the `GetTenantError`.

This patch is a by-product of reading code that subscribes to
`Tenant::state` changes.
Can't really connect it to any given project.
2023-05-25 11:37:12 +02:00
Sasha Krassovsky
6052ecee07 Add connector extension to send Role/Database updates to console (#3891)
## Describe your changes

## Issue ticket number and link

## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [x] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
2023-05-25 12:36:57 +03:00
Christian Schwarz
da6573f551 Merge branch 'problame/infallible-timeline-activate/3-funnel-storage-broker-client' into problame/infallible-timeline-activate/4-make-infallible 2023-05-25 10:54:30 +02:00
Christian Schwarz
2fee8c884f Merge remote-tracking branch 'origin/main' into problame/infallible-timeline-activate/3-funnel-storage-broker-client 2023-05-25 10:54:03 +02:00
Christian Schwarz
e11ba24ec5 tenant loops: operate on the Arc<Tenant> directly (#4298)
(Instead of going through mgr every iteration.)

The `wait_for_active_tenant` function's `wait` argument could be removed
because it was only used for the loop that waits for the tenant to show
up in the tenants map. Since we're passing the tenant in, we now longer
need to get it from the tenants map.

NB that there's no guarantee that the tenant object is in the tenants
map at the time the background loop function starts running. But the
tenant mgr guarantees that it will be quite soon. See
`tenant_map_insert` way upwards in the call hierarchy for details.

This is prep work to eliminate `subscribe_for_state_updates` (PR #4299 )

Fixes: #3501
2023-05-25 10:49:09 +02:00
Christian Schwarz
fe4ef121b6 use tokio::sync::Receiver::wait_for 2023-05-25 10:44:26 +02:00