Commit Graph

58 Commits

Author SHA1 Message Date
Christian Schwarz
df52587bef attach-time tenant config (#4255)
This PR adds support for supplying the tenant config upon /attach.

Before this change, when relocating a tenant using `/detach` and
`/attach`, the tenant config after `/attach` would be the default config
from `pageserver.toml`.
That is undesirable for settings such as the PITR-interval: if the
tenant's config on the source was `30 days` and the default config on
the attach-side is `7 days`, then the first GC run would eradicate 23
days worth of PITR capability.

The API change is backwards-compatible: if the body is empty, we
continue to use the default config.
We'll remove that capability as soon as the cloud.git code is updated to
use attach-time tenant config
(https://github.com/neondatabase/neon/issues/4282 keeps track of this).

unblocks https://github.com/neondatabase/cloud/issues/5092 
fixes https://github.com/neondatabase/neon/issues/1555
part of https://github.com/neondatabase/neon/issues/886 (Tenant
Relocation)

Implementation
==============

The preliminary PRs for this work were (most-recent to least-recent)

* https://github.com/neondatabase/neon/pull/4279
* https://github.com/neondatabase/neon/pull/4267
* https://github.com/neondatabase/neon/pull/4252
* https://github.com/neondatabase/neon/pull/4235
2023-05-24 17:46:30 +03:00
Christian Schwarz
b391c94440 tenant create / update-config: reject unknown fields (#4267)
This PR enforces that the tenant create / update-config APIs reject
requests with unknown fields.

This is a desirable property because some tenant config settings control
the lifetime of user data (e.g., GC horizon or PITR interval).

Suppose we inadvertently rename the `pitr_interval` field in the Rust
code.
Then, right now, a client that still uses the old name will send a
tenant config request to configure a new PITR interval.
Before this PR, we would accept such a request, ignore the old name
field, and use the pageserver.toml default value for what the new PITR
interval is.
With this PR, we will instead reject such a request.

One might argue that the client could simply check whether the config it
sent has been applied, using the `/v1/tenant/.../config` endpoint.
That is correct for tenant create and update-config.

But, attach will soon [^1] grow the ability to have attach-time config
as well.
If we ignore unknown fields and fall back to global defaults in that
case, we risk data loss.
Example:
1. Default PITR in pageservers is 7 days.
2. Create a tenant and set its PITR to 30 days.
3. For 30 days, fill the tenant continuously with data.
4. Detach the tenant.
5. Attach tenant.

Attach must use the 30-day PITR setting in this scenario.
If it were to fall back to the 7-day default value, we would lose 23
days of PITR capability for the tenant.

So, the PR that adds attach-time tenant config will build on the
(clunky) infrastructure added in this PR

[^1]: https://github.com/neondatabase/neon/pull/4255

Implementation Notes
====================

This could have been a simple `#[serde(deny_unknown_fields)]` but sadly,
that is documented- but silent-at-compile-time-incompatible with
`#[serde(flatten)]`. But we are still using this by adding on outer struct and use unit tests to ensure it is correct.

`neon_local tenant config` now uses the `.remove()` pattern + bail if
there are leftover config args. That's in line with what
`neon_local tenant create` does. We should dedupe that logic in a future
PR.

---------

Signed-off-by: Alex Chi <iskyzh@gmail.com>
Co-authored-by: Alex Chi <iskyzh@gmail.com>
2023-05-18 21:16:09 -04:00
Christian Schwarz
89307822b0 mgmt api: share a single tenant config model struct in Rust and OpenAPI (#4252)
This is prep for https://github.com/neondatabase/neon/pull/4255

[1/X] OpenAPI: share a single definition of TenantConfig

DRYs up the pageserver OpenAPI YAML's representation of
tenant config.

All the fields of tenant config are now located in a model schema
called TenantConfig.

The tenant create & config-change endpoints have separate schemas,
TenantCreateInfo and TenantConfigureArg, respectively.
These schemas inherit from TenantConfig, using allOf 1.

The tenant config-GET handler's response was previously named
TenantConfig.
It's now named TenantConfigResponse.

None of these changes affect how the request looks on the wire.

The generated Go code will change for Console because the OpenAPI code
generator maps `allOf` to a Go struct embedding.
Luckily, usage of tenant config in Console is still very lightweigt,
but that will change in the near future.
So, this is a good chance to set things straight.

The console changes are tracked in
 https://github.com/neondatabase/cloud/pull/5046

[2/x]: extract the tenant config parts of create & config requests

[3/x]: code movement: move TenantConfigRequestConfig next to
TenantCreateRequestConfig

[4/x] type-alias TenantConfigRequestConfig = TenantCreateRequestConfig;
They are exactly the same.

[5/x] switch to qualified use for tenant create/config request api
models

[6/x] rename models::TenantConfig{RequestConfig,} and remove the alias

[7/x] OpenAPI: sync tenant create & configure body names from Rust code

[8/x]: dedupe the two TryFrom<...> for TenantConfOpt impls
The only difference is that the TenantConfigRequest impl does

```
        tenant_conf.max_lsn_wal_lag = request_data.max_lsn_wal_lag;
        tenant_conf.trace_read_requests = request_data.trace_read_requests;
```

and the TenantCreateRequest impl does

```
        if let Some(max_lsn_wal_lag) = request_data.max_lsn_wal_lag {
            tenant_conf.max_lsn_wal_lag = Some(max_lsn_wal_lag);
        }
        if let Some(trace_read_requests) = request_data.trace_read_requests {
            tenant_conf.trace_read_requests = Some(trace_read_requests);
        }
```

As far as I can tell, these are identical.
2023-05-17 12:31:17 +02:00
Christian Schwarz
80522a1b9d replace has_in_progress_downloads with new attachment_status field (#4168)
Control Plane currently [^1] polls for `has_in_progress_downloads ==
false` after /attach to determine that an attach operation succeeded.

As pointed out in the OpenAPI spec as of neon#4151, polling for
`has_in_progress_downloads` is incorrect.

This patch changes the situation by
- removing `has_in_progress_downloads`
- adding a new field `attachment_status.`
- changing instructions for `/attach` to poll for `attachment_status ==
attached`.

This makes the instructions in `/attach` actionable for Control Plane.
NB that we don't expose the TenantState in the OpenAPI docs, even though
we expose it in the endpoint. That is with good reason because we don't
want to commit to a fixed set of tenant states forever. Hence, the
separate `attachment_status` field that exposes the bare minimum
required to make /attach + subsequent polling 100% safe wrt split brain.

It would have been nice to report failures explicitly, but the problem
is that we lose that state when we restart. So, we return `attached`
upon attach failure. The tenant is Broken in that case, causing Control
Plane's subsequent health check will fail. Control Plane can roll back
the relocation operation then.
NB: the reliance on the subsequent health check is no change to what we
had before this patch!
NB: we can always add additional TenantAttachmentStatus'es in the future
to communicate failure.

This PR also moves the attach-marker file's creation to the API
handler's synchronous part. That was done to avoid the need to
distinguish
* `Attaching but marker not yet written => AttachmentStatus::Maybe` from
* `Attaching, marker written, but attach failed for other reason =>
AttachmentStatus::Attached`

Coincidentally, this also adds more transactionality to the /attach API
because we only return 202 once we've written the marker file. But, in
the end, it doesn't affect how the control plane interacts with us or
how it needs to do retries. So, we don't mention any of this in the API
docs.

[^1]: The one-click tenant relocation PR cloud#4740, currently WIP, is
      the first real user.
2023-05-11 16:53:46 +03:00
Christian Schwarz
411c71b486 document current tenant attach API semantics (#4151)
We currently return 202 as soon as the tenant is allocated in memory
before we've written out the marker file. So, the /attach API currently
does not have a transactional character. For example, it can happen that
we respond with a 202 and then crash before writing out the marker file.
In such a case, it is important that the client

1. observes the lost attach (by polling tenant status and observing 404)
2. and consequently retries the attach.

It has to do it in this loop until it observes the tenant as "Active" in
the tenant status. If the client doesn't follow this protocol and
instead goes to another pageserver to attach the tenant, we risk a
split-brain situation where both the first and second pageserver write
to the tenant's S3 state.

The improved description highlights the consequences of this behavior
for clients that use the /attach endpoint.

The tenant relocation that is currently being implemented in cloud#4740
implements retries of Attach and it does poll afterwards, but, it polls
`has_in_progress_downloads`.
That is incorrect, as described in the patch body.

The motivation for this write-up is that, in a future PR, we'll extend
the /attach endpoint with an option to provide the tenant config. If we
decide to leave the non-transactional behavior of /attach unmodified, we
will be able to avoid persisting the tenant config. Conversely, if we
decide that the /attach API should become transactional, we'll need to
persist the tenant config in the attach-marker-file before acknowledging
receipt of the /attach operation.

refs https://github.com/neondatabase/cloud/pull/4740
refs https://github.com/neondatabase/neon/issues/2238
refs https://github.com/neondatabase/neon/issues/1555
2023-05-05 19:32:41 +03:00
Eduard Dyckman
afbbc61036 Adding synthetic size to pageserver swagger (#4049)
## Describe your changes

I added synthetic size response to the console swagger. Now I am syncing
it back to neon
2023-04-24 16:19:25 +03:00
Dmitry Rodionov
15d1f85552 Add reason to TenantState::Broken (#3954)
Reason and backtrace are added to the Broken state. Backtrace is automatically collected when tenant entered the broken state. The format for API, CLI and metrics is changed and unified to return tenant state name in camel case. Previously snake case was used for metrics and camel case was used for everything else. Now tenant state field in TenantInfo swagger spec is changed to contain state name in "slug" field and other fields (currently only reason and backtrace for Broken variant in "data" field). To allow for this breaking change state was removed from TenantInfo swagger spec because it was not used anywhere.

Please note that the tenant's broken reason is not persisted on disk so the reason is lost when pageserver is restarted.

Requires changes to grafana dashboard that monitors tenant states.

Closes #3001

---------

Co-authored-by: theirix <theirix@gmail.com>
2023-04-13 12:11:43 +03:00
Christian Schwarz
a64dd3ecb5 disk-usage-based layer eviction (#3809)
This patch adds a pageserver-global background loop that evicts layers
in response to a shortage of available bytes in the $repo/tenants
directory's filesystem.

The loop runs periodically at a configurable `period`.

Each loop iteration uses `statvfs` to determine filesystem-level space
usage. It compares the returned usage data against two different types
of thresholds. The iteration tries to evict layers until app-internal
accounting says we should be below the thresholds. We cross-check this
internal accounting with the real world by making another `statvfs` at
the end of the iteration. We're good if that second statvfs shows that
we're _actually_ below the configured thresholds. If we're still above
one or more thresholds, we emit a warning log message, leaving it to the
operator to investigate further.

There are two thresholds:
- `max_usage_pct` is the relative available space, expressed in percent
of the total filesystem space. If the actual usage is higher, the
threshold is exceeded.
- `min_avail_bytes` is the absolute available space in bytes. If the
actual usage is lower, the threshold is exceeded.

The iteration evicts layers in LRU fashion with a reservation of up to
`tenant_min_resident_size` bytes of the most recent layers per tenant.
The layers not part of the per-tenant reservation are evicted
least-recently-used first until we're below all thresholds. The
`tenant_min_resident_size` can be overridden per tenant as
`min_resident_size_override` (bytes).

In addition to the loop, there is also an HTTP endpoint to perform one
loop iteration synchronous to the request. The endpoint takes an
absolute number of bytes that the iteration needs to evict before
pressure is relieved. The tests use this endpoint, which is a great
simplification over setting up loopback-mounts in the tests, which would
be required to test the statvfs part of the implementation. We will rely
on manual testing in staging to test the statvfs parts.

The HTTP endpoint is also handy in emergencies where an operator wants
the pageserver to evict a given amount of space _now. Hence, it's
arguments documented in openapi_spec.yml. The response type isn't
documented though because we don't consider it stable. The endpoint
should _not_ be used by Console but it could be used by on-call.

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2023-03-31 14:47:57 +03:00
Gleb Novikov
b26c837ed6 Fixed pageserver openapi spec properties reference (#3904)
## Describe your changes

In [this linter
run](https://github.com/neondatabase/cloud/actions/runs/4553032319/jobs/8029101300?pr=4391)
accidentally found out that spec is invalid. Reference other schemas in
properties should be done the way I changed.

Could not find documentation specifically for schemas embedding in
`components.schemas`, but it seems like the approach is inherited from
json schema:
https://json-schema.org/understanding-json-schema/structuring.html#ref

## Issue ticket number and link
-

## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] ~If it is a core feature, I have added thorough tests.~
- [ ] ~Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?~
- [ ] ~If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.~
2023-03-29 19:18:44 +04:00
Dmitry Rodionov
6efea43449 Use precondition failed code in delete_timeline when tenant is missing (#3884)
This allows client to differentiate between missing tenant and missing
timeline cases
2023-03-27 21:01:46 +03:00
Dmitry Rodionov
870ba43a1f return proper http codes in timeline delete endpoint (#3876)
return proper http codes in timeline delete endpoint
+ fix openapi spec for detach to include 404 responses
2023-03-24 19:25:39 +02:00
Shany Pozin
0f7de84785 Allow calling detach on ignored tenant (#3834)
## Describe your changes
Added a query param to detach API
Allow to remove local state of a tenant even if its not in the memory
(following ignore API)
## Issue ticket number and link
#3828
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

---------

Co-authored-by: Kirill Bulatov <kirill@neon.tech>
2023-03-22 07:17:00 +00:00
Sasha Krassovsky
02b8e0e5af Add OpenAPI spec for do_gc (#3756)
## Describe your changes
Adds a field to the OpenAPI spec for the page server which describes the
`do_gc` command.
## Issue ticket number and link
#3669
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
2023-03-07 09:08:46 -08:00
Heikki Linnakangas
ddbdcdddd7 Tenant size calculation: refactor, rewrite, and add SVG (#2817)
Refactor the tenant_size_model code. Segment now contains just the
minimum amount of information needed to calculate the size. Other
information that is useful for building up the segment tree, and for
display purposes, is now kept elsewhere. The code in 'main.rs' has a new
ScenarioBuilder struct for that.

Calculating which Segments are "needed" is now the responsibility of the
caller of tenant_size_mode, not part of the calculation itself. So it's
up to the caller to make all the decisions with retention periods for
each branch.

The output of the sizing calculation is now a Vec of SizeResults, rather
than a tree. It uses a tree representation internally, when doing the
calculation, but it's not exposed to the caller anymore.

Refactor the way the recursive calculation is performed.

Rewrite the code in size.rs that builds the Segment model. Get rid of
the intermediate representation with Update structs. Build the Segments
directly, with some local HashMaps and Vecs to track branch points to
help with that.

retention_period is now an input to gather_inputs(), rather than an
output.

Update pageserver http API: rename /size endpoint to /synthetic_size
with following parameters:
    - /synthetic_size?inputs_only to get debug info;
- /synthetic_size?retention_period=0 to override cutoff that is used to
calculate the size;
pass header -H "Accept: text/html" to get HTML output, otherwise JSON is
returned

Update python tests and openapi spec.

---------

Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-02-16 10:53:46 +02:00
Kirill Bulatov
ec3a3aed37 Dump current tenant config (#3534)
The PR adds an endpoint to show tenant's current config: `GET
/v1/tenant/:tenant_id/config`

Tenant's config consists of two parts: tenant overrides (could be
changed via other management API requests) and the default part,
substituting all missing overrides (constant, hardcoded in pageserver).
The API returns the custom overrides and the final tenant config, after
applying all the defaults.

Along the way, it had to fix two things in the config:

* allow to shorten the json version and omit all `null`'s (same as toml
serializer behaves by default), and to understand such shortened format
when deserialized. A unit test is added
* fix a bug, when `PUT /v1/tenant/config` endpoint rewritten the local
file with what had came in the request, but updating (not rewriting the
old values) the in-memory state instead.
That got uncovered during adjusting the e2e test and fixed to do the
replacement everywhere, otherwise there's no way to revert existing
overrides. Fixes #3471 (commit
dc688affe8)
* fixes https://github.com/neondatabase/neon/issues/3472 by reordering
the config saving operations
2023-02-04 01:32:29 +02:00
Joonas Koivunen
f74080cbad feat(http): support ?inputs_only=true for tenant_size (#3419)
this makes debugging problematic cases in the future easier, as we can
just request the model inputs, use them locally to reproduce the issue
with the model.
2023-01-24 13:57:13 +02:00
Christian Schwarz
f637f6e77e stop exposing non-incremental sizes in API spec
Console doesn't use them, so, don't expose them.

refs https://github.com/neondatabase/cloud/pull/3358
refs https://github.com/neondatabase/cloud/pull/3366
2022-12-21 15:37:29 +01:00
Christian Schwarz
c785a516aa remove TimelineInfo.{Remote,Local} along with their types
follow-up of https://github.com/neondatabase/neon/pull/2615
which is neon.git: 538876650a

must be deployed after cloud.git change
https://github.com/neondatabase/cloud/issues/3232

fixes https://github.com/neondatabase/neon/issues/3041
2022-12-19 14:37:40 +01:00
Heikki Linnakangas
b513619503 Remove obsolete 'awaits_download' field.
It used to be a separate piece of state, but after 9a6c0be823 it's just
an alias for the Tenant being in Attaching state. It was only used in
one assertion in a test, but that check doesn't make sense anymore, so
just remove it.

Fixes https://github.com/neondatabase/neon/issues/2930
2022-12-07 13:13:54 +02:00
Kirill Bulatov
d6bfe955c6 Add commands to unload and load the tenant in memory (#2977)
Closes https://github.com/neondatabase/neon/issues/2537

Follow-up of https://github.com/neondatabase/neon/pull/2950
With the new model that prevents attaching without the remote storage,
it has started to be even more odd to add attach-with-files
functionality (in addition to the issues raised previously).

Adds two separate commands:
* `POST {tenant_id}/ignore` that places a mark file to skip such tenant
on every start and removes it from memory
* `POST {tenant_id}/schedule_load` that tries to load a tenant from
local FS similar to what pageserver does now on startup, but without
directory removals
2022-12-06 15:30:02 +00:00
Heikki Linnakangas
9a6c0be823 storage_sync2
The code in this change was extracted from PR #2595, i.e., Heikki’s draft
PR for on-demand download.

High-Level Changes

- storage_sync module rewrite
- Changes to Tenant Loading
- Changes to Timeline States
- Crash-safe & Resumable Tenant Attach

There are several follow-up work items planned.
Refer to the Epic issue on GitHub:
https://github.com/neondatabase/neon/issues/2029

Metadata:

closes https://github.com/neondatabase/neon/pull/2785

unsquashed history of this patch: archive/pr-2785-storage-sync2/pre-squash

Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>

===============================================================================

storage_sync module rewrite
===========================

The storage_sync code is rewritten. New module name is storage_sync2, mostly to
make a more reasonable git diff.

The updated block comment in storage_sync2.rs describes the changes quite well,
so, we will not reproduce that comment here. TL;DR:
- Global sync queue and RemoteIndex are replaced with per-timeline
  `RemoteTimelineClient` structure that contains a queue for UploadOperations
  to ensure proper ordering and necessary metadata.
- Before deleting local layer files, wait for ongoing UploadOps to finish
  (wait_completion()).
- Download operations are not queued and executed immediately.

Changes to Tenant Loading
=========================

Initial sync part was rewritten as well and represents the other major change
that serves as a foundation for on-demand downloads. Routines for attaching and
loading shifted directly to Tenant struct and now are asynchronous and spawned
into the background.

Since this patch doesn’t introduce on-demand download of layers we fully
synchronize with the remote during pageserver startup. See details in
`Timeline::reconcile_with_remote` and `Timeline::download_missing`.

Changes to Tenant States
========================

The “Active” state has lost its “background_jobs_running: bool” member. That
variable indicated whether the GC & Compaction background loops are spawned or
not. With this patch, they are now always spawned. Unit tests (#[test]) use the
TenantConf::{gc_period,compaction_period} to disable their effect (15db566).

This patch introduces a new tenant state, “Attaching”. A tenant that is being
attached starts in this state and transitions to “Active” once it finishes
download.

The `GET /tenant` endpoints returns `TenantInfo::has_in_progress_downloads`. We
derive the value for that field from the tenant state now, to remain
backwards-compatible with cloud.git. We will remove that field when we switch
to on-demand downloads.

Changes to Timeline States
==========================

The TimelineInfo::awaits_download field is now equivalent to the tenant being
in Attaching state.  Previously, download progress was tracked per timeline.
With this change, it’s only tracked per tenant. When on-demand downloads
arrive, the field will be completely obsolete.  Deprecation is tracked in
isuse #2930.

Crash-safe & Resumable Tenant Attach
====================================

Previously, the attach operation was not persistent. I.e., when tenant attach
was interrupted by a crash, the pageserver would not continue attaching after
pageserver restart. In fact, the half-finished tenant directory on disk would
simply be skipped by tenant_mgr because it lacked the metadata file (it’s
written last). This patch introduces an “attaching” marker file inside that is
present inside the tenant directory while the tenant is attaching. During
pageserver startup, tenant_mgr will resume attach if that file is present. If
not, it assumes that the local tenant state is consistent and tries to load the
tenant. If that fails, the tenant transitions into Broken state.
2022-11-29 18:55:20 +01:00
Alexey Kondratov
4bf3087aed [pageserver] list latest_gc_cutoff_lsn in the OpenAPI spec (#2894)
It seems that it's present in the API response for quite a while. It's
just not listed in the spec, fix it.
2022-11-22 21:10:49 +01:00
Joonas Koivunen
cf68963b18 Add initial tenant sizing model and a http route to query it (#2714)
Tenant size information is gathered by using existing parts of
`Tenant::gc_iteration` which are now separated as
`Tenant::refresh_gc_info`. `Tenant::refresh_gc_info` collects branch
points, and invokes `Timeline::update_gc_info`; nothing was supposed to
be changed there. The gathered branch points (through Timeline's
`GcInfo::retain_lsns`), `GcInfo::horizon_cutoff`, and
`GcInfo::pitr_cutoff` are used to build up a Vec of updates fed into the
`libs/tenant_size_model` to calculate the history size.

The gathered information is now exposed using `GET
/v1/tenant/{tenant_id}/size`, which which will respond with the actual
calculated size. Initially the idea was to have this delivered as tenant
background task and exported via metric, but it might be too
computationally expensive to run it periodically as we don't yet know if
the returned values are any good.

Adds one new metric:
- pageserver_storage_operations_seconds with label `logical_size`
    - separating from original `init_logical_size`

Adds a pageserver wide configuration variable:
- `concurrent_tenant_size_logical_size_queries` with default 1

This leaves a lot of TODO's, tracked on issue #2748.
2022-11-03 12:39:19 +00:00
Kirill Bulatov
5928cb33c5 Introduce timeline state (#2651)
Similar to https://github.com/neondatabase/neon/pull/2395, introduces a state field in Timeline, that's possible to subscribe to.

Adjusts

* walreceiver to not to have any connections if timeline is not Active
* remote storage sync to not to schedule uploads if timeline is Broken
* not to create timelines if a tenant/timeline is broken
* automatically switches timelines' states based on tenant state

Does not adjust timeline's gc, checkpointing and layer flush behaviour much, since it's not safe to cancel these processes abruptly and there's task_mgr::shutdown_tasks that does similar thing.
2022-10-21 15:51:48 +00:00
Heikki Linnakangas
9c24de254f Add description and license fields to OpenAPI spec.
These were added earlier to the control plane's copy of this file.
This is the master version of this file, so let's keep it in sync.
2022-10-14 18:37:58 +03:00
Heikki Linnakangas
538876650a Merge 'local' and 'remote' parts of TimelineInfo into one struct.
The 'local' part was always filled in, so that was easy to merge into
into the TimelineInfo itself. 'remote' only contained two fields,
'remote_consistent_lsn' and 'awaits_download'. I made
'remote_consistent_lsn' an optional field, and 'awaits_download' is now
false if the timeline is not present remotely.

However, I kept stub versions of the 'local' and 'remote' structs for
backwards-compatibility, with a few fields that are actively used by
the control plane. They just duplicate the fields from TimelineInfo
now. They can be removed later, once the control plane has been
updated to use the new fields.
2022-10-14 18:37:14 +03:00
Andrés
47bae68a2e Make get_lsn_by_timestamp available in mgmt API (#2536) (#2560)
Co-authored-by: andres <andres.rodriguez@outlook.es>
2022-10-06 12:42:50 +03:00
Anastasia Lubennikova
862902f9e5 Update readme and openapi spec 2022-09-22 14:15:13 +03:00
Dmitry Rodionov
1d53173e62 update openapi spec (tenant state has changed) 2022-09-13 21:53:59 +03:00
Kirill Bulatov
1a8c8b04d7 Merge Repository and Tenant entities, rework tenant background jobs 2022-09-13 15:39:39 +03:00
Heikki Linnakangas
65b592d4bd Remove deprecated management API for timeline detach.
It is no longer used anywhere.
2022-09-06 18:54:04 +03:00
Arseny Sher
e593cbaaba Add pageserver checkpoint_timeout option.
To flush inmemory layer eventually when no new data arrives, which helps
safekeepers to suspend activity (stop pushing to the broker). Default 10m should
be ok.
2022-08-11 22:54:09 +03:00
Heikki Linnakangas
d0494c391a Remove wal_receiver mgmt API endpoint
Move all the fields that were returned by the wal_receiver endpoint into
timeline_detail. Internally, move those fields from the separate global
WAL_RECEIVERS hash into the LayeredTimeline struct. That way, all the
information about a timeline is kept in one place.

In the passing, I noted that the 'thread_id' field was removed from
WalReceiverEntry in commit e5cb727572, but it forgot to update
openapi_spec.yml. This commit removes that too.
2022-07-29 20:51:37 +03:00
Heikki Linnakangas
d903dd61bd Rename 'wal_producer_connstr' to 'wal_source_connstr'.
What the WAL receiver really connects to is the safekeeper. The
"producer" term is a bit misleading, as the safekeeper doesn't produce
the WAL, the compute node does.

This change also applies to the name of the field used in the mgmt API
in in the response of the
'/v1/tenant/:tenant_id/timeline/:timeline_id/wal_receiver' endpoint.
AFAICS that's not used anywhere else than one python test, so it
should be OK to change it.
2022-07-29 09:09:22 +03:00
Thang Pham
417d9e9db2 Add current physical size to tenant status endpoint (#2173)
Ref #1902
2022-07-28 13:59:20 -04:00
Thang Pham
6a664629fa Add timeline physical size tracking (#2126)
Ref #1902.

- Track the layered timeline's `physical_size` using `pageserver_current_physical_size` metric when updating the layer map.
- Report the local timeline's `physical_size` in timeline GET APIs.
- Add `include-non-incremental-physical-size` URL flag to also report the local timeline's `physical_size_non_incremental` (similar to `logical_size_non_incremental`)
- Add a `UIntGaugeVec` and `UIntGauge` to represent `u64` prometheus metrics

Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
2022-07-27 12:36:46 -04:00
Dmitry Rodionov
520ffb341b fix pageserver openapi spec 2022-07-07 21:20:04 +03:00
Dmitry Rodionov
9f2b40645d review cleanup, point timeline/detach to timeline/delete 2022-07-07 21:20:04 +03:00
Dmitry Rodionov
168214e0b6 use tenant status endpoint to check whether timelines were downloaded or not 2022-07-07 21:20:04 +03:00
Dmitry Rodionov
e1e24336b7 review adjustments, bring back timeline_detach and rename it to timeline_delete 2022-07-07 21:20:04 +03:00
Dmitry Rodionov
4c54e4b37d switch to per-tenant attach/detach
download operations of all timelines for one tenant are now grouped
together so when attach is invoked pageserver downloads all of them
and registers them in a single apply_sync_status_update call so
branches can be used safely with attach/detach
2022-07-07 21:20:04 +03:00
Thang Pham
e4a70faa08 Add more information to timeline-related APIs (#1673)
Resolves #1488.

- implemented `GET tenant/:tenant_id/timeline/:timeline_id/wal_receiver` endpoint
- returned `thread_id` in `thread_mgr::spawn` 
- added `latest_gc_cutoff_lsn` field to `LocalTimelineInfo` struct
2022-05-16 11:05:43 -04:00
Konstantin Knizhnik
5f83c9290b Make it possible to specify per-tenant configuration parameters
Add tenant config API and 'zenith tenant config' CLI command.
Add 'show' query to pageserver protocol for tenantspecific config parameters

Refactoring: move tenant_config code to a separate module.
Save tenant conf file to tenant's directory, when tenant is created to recover it on pageserver restart.
Ignore error during tenant config loading, while it is not supported by console

Define PiTR interval for GC.

refer #1320
2022-04-22 11:24:29 +03:00
Kirill Bulatov
3e6087a12f Remove S3 archiving 2022-04-19 23:13:52 +03:00
Alexey Kondratov
d0c246ac3c Update pageserver OpenAPI spec with missing attach/detach methods (#1463)
We have these methods for some time in the API, so mentioning them in the
spec could be useful for console (see zenithdb/console#867), as we generate
pageserver HTTP API golang client there.
2022-04-05 20:01:57 +03:00
Kirill Bulatov
bd6bef468c Provide single list timelines HTTP API handle 2022-03-21 13:42:21 +02:00
Kirill Bulatov
093ad8ab59 Send 409 HTTP responses on timeline and tenant creation for existing entity 2022-03-10 19:38:58 +02:00
Kirill Bulatov
fe6fccfdae Allow already existing repo when creating a tenant 2022-03-10 19:38:58 +02:00
Kirill Bulatov
dd74c66ef0 Do not create timeline along with tenant 2022-03-10 19:38:58 +02:00
Kirill Bulatov
a5e10c4f64 Tidy up pageserver's endpoints 2022-03-10 19:38:58 +02:00