Commit Graph

48 Commits

Author SHA1 Message Date
Alex Chi Z.
6b93230270 fix(pageserver): receive body error now 500 (#9052)
close https://github.com/neondatabase/neon/issues/8903

In https://github.com/neondatabase/neon/issues/8903 we observed JSON
decoding error to have the following error message in the log:

```
Error processing HTTP request: Resource temporarily unavailable: 3956 (pageserver-6.ap-southeast-1.aws.neon.tech) error receiving body: error decoding response body
```

This is hard to understand. In this patch, we make the error message
more reasonable.

## Summary of changes

* receive body error is now an internal server error, passthrough the
`reqwest::Error` (only decoding error) as `anyhow::Error`.
* instead of formatting the error using `to_string`, we use the
alternative `anyhow::Error` formatting, so that it prints out the cause
of the error (i.e., what exactly cannot serde decode).

I would expect seeing something like `error receiving body: error
decoding response body: XXX field not found` after this patch, though I
didn't set up a testing environment to observe the exact behavior.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-09-20 10:37:28 -04:00
Conrad Ludgate
411a130675 Fix nightly warnings 2024 june (#8151)
## Problem

new clippy warnings on nightly.

## Summary of changes

broken up each commit by warning type.
1. Remove some unnecessary refs.
2. In edition 2024, inference will default to `!` and not `()`.
3. Clippy complains about doc comment indentation
4. Fix `Trait + ?Sized` where `Trait: Sized`.
5. diesel_derives triggering `non_local_defintions`
2024-07-12 13:58:04 +01:00
Christian Schwarz
1a49f1c15c pageserver: move page_service's import basebackup / import wal to mgmt API (#8292)
I want to fix bugs in `page_service`
([issue](https://github.com/neondatabase/neon/issues/7427)) and the
`import basebackup` / `import wal` stand in the way / make the
refactoring more complicated.

We don't use these methods anyway in practice, but, there have been some
objections to removing the functionality completely.

So, this PR preserves the existing functionality but moves it into the
HTTP management API.

Note that I don't try to fix existing bugs in the code, specifically not
fixing
* it only ever worked correctly for unsharded tenants
* it doesn't clean up on error

All errors are mapped to `ApiError::InternalServerError`.
2024-07-09 23:17:42 +02:00
John Spray
07f21dd6b6 pageserver: remove attach/detach apis (#8134)
## Problem

These APIs have been deprecated for some time, but were still used from
test code.

Closes: https://github.com/neondatabase/neon/issues/4282

## Summary of changes

- It is still convenient to do a "tenant_attach" from a test without
having to write out a location_conf body, so those test methods have
been retained with implementations that call through to their
location_conf equivalent.
2024-06-25 17:38:06 +01:00
Yuchen Liang
630cfbe420 refactor(pageserver): designated api error type for cancelled request (#7949)
Closes #7406.

## Problem

When a `get_lsn_by_timestamp` request is cancelled, an anyhow error is
exposed to handle that case, which verbosely logs the error. However, we
don't benefit from having the full backtrace provided by anyhow in this
case.

## Summary of changes

This PR introduces a new `ApiError` type to handle errors caused by
cancelled request more robustly.
-  A new enum variant `ApiError::Cancelled`
- Currently the cancelled request is mapped to status code 500.
- Need to handle this error in proxy's `http_util` as well.
- Added a failpoint test to simulate cancelled `get_lsn_by_timestamp`
request.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-06-06 14:00:14 +00:00
Vlad Lazar
c75b584430 storage_controller: add metrics (#7178)
## Problem
Storage controller had basically no metrics.

## Summary of changes
1. Migrate the existing metrics to use Conrad's
[`measured`](https://docs.rs/measured/0.0.14/measured/) crate.
2. Add metrics for incoming http requests
3. Add metrics for outgoing http requests to the pageserver
4. Add metrics for outgoing pass through requests to the pageserver
5. Add metrics for database queries

Note that the metrics response for the attachment service does not use
chunked encoding like the rest of the metrics endpoints. Conrad has
kindly extended the crate such that it can now be done. Let's leave it
for a follow-up since the payload shouldn't be that big at this point.

Fixes https://github.com/neondatabase/neon/issues/6875
2024-03-21 12:00:20 +00:00
Arpad Müller
82853cc1d1 Fix warnings and compile errors on nightly (#6886)
Nightly has added a bunch of compiler and linter warnings. There is also
two dependencies that fail compilation on latest nightly due to using
the old `stdsimd` feature name. This PR fixes them.
2024-03-01 17:14:19 +01:00
Joonas Koivunen
5ab10d051d metrics: record more details of the responding (#6979)
On eu-west-1 during benchmarks we sometimes lose samples. Add more time
measurements.
2024-03-01 14:04:39 +00:00
John Spray
c9b1657e4c pageserver: fixes for creation operations overlapping with shutdown/startup (#6436)
## Problem

For #6423, creating a reproducer turned out to be very easy, as an
extension to test_ondemand_activation.

However, before I had diagnosed the issue, I was starting with a more
brute force approach of running creation API calls in the background
while restarting a pageserver, and that shows up a bunch of other
interesting issues.

In this PR:
- Add the reproducer for #6423 by extending `test_ondemand_activation`
(confirmed that this test fails if I revert the fix from
https://github.com/neondatabase/neon/pull/6430)
- In timeline creation, return 503 responses when we get an error and
the tenant's cancellation token is set: this covers the cases where we
get an anyhow::Error from something during timeline creation as a result
of shutdown.
- While waiting for tenants to become active during creation, don't
.map_err() the result to a 500: instead let the `From` impl map the
result to something appropriate (this includes mapping shutdown to 503)
- During tenant creation, we were calling `Tenant::load_local` because
no Preload object is provided. This is usually harmless because the
tenant dir is empty, but if there are some half-created timelines in
there, bad things can happen. Propagate the SpawnMode into
Tenant::attach, so that it can properly skip _any_ attempt to load
timelines if creating.
- When we call upsert_location, there's a SpawnMode that tells us
whether to load from remote storage or not. But if the operation is a
retry and we already have the tenant, it is not correct to skip loading
from remote storage: there might be a timeline there. This isn't
strictly a correctness issue as long as the caller behaves correctly
(does not assume that any timelines are persistent until the creation is
acked), but it's a more defensive position.
- If we shut down while the task in Tenant::attach is running, it can
end up spawning rogue tasks. Fix this by holding a GateGuard through
here, and in upsert_location shutting down a tenant after calling
tenant_spawn if we can't insert it into tenants_map. This fixes the
expected behavior that after shutdown_all_tenants returns, no tenant
tasks are running.
- Add `test_create_churn_during_restart`, which runs tenant & timeline
creations across pageserver restarts.
- Update a couple of tests that covered cancellation, to reflect the
cleaner errors we now return.
2024-01-25 12:35:52 +00:00
John Spray
18e9208158 pageserver: improved error handling for shard routing error, timeline not found (#6262)
## Problem

- When a client requests a key that isn't found in any shard on the node
(edge case that only happens if a compute's config is out of date), we
should prompt them to reconnect (as this includes a backoff), since they
will not be able to complete the request until they eventually get a
correct pageserver connection string.
- QueryError::Other is used excessively: this contains a type-ambiguous
anyhow::Error and is logged very verbosely (including backtrace).

## Summary of changes

- Introduce PageStreamError to replace use of anyhow::Error in request
handlers for getpage, etc.
- Introduce Reconnect and NotFound variants to QueryError
- Map the "shard routing error" case to PageStreamError::Reconnect ->
QueryError::Reconnect
- Update type conversions for LSN timeouts and tenant/timeline not found
errors to use PageStreamError::NotFound->QueryError::NotFound
2024-01-04 10:40:03 +00:00
Joonas Koivunen
07508fb110 fix: better Json parsing errors (#6135)
Before any json parsing from the http api only returned errors were per
field errors. Now they are done using `serde_path_to_error`, which at
least helped greatly with the `disk_usage_eviction_run` used for
testing. I don't think this can conflict with anything added in #5310.
2023-12-15 12:18:22 +02:00
Arpad Müller
ea118a238a JWT logging improvements (#5823)
* lower level on auth success from info to debug (fixes #5820)
* don't log stacktraces on auth errors (as requested on slack). we do this by introducing an `AuthError` type instead of using `anyhow` and `bail`.
* return errors that have been censored for improved security.
2023-11-08 16:56:53 +00:00
Arpad Müller
e310533ed3 Support JWT key reload in pageserver (#5594)
## Problem

For quickly rotating JWT secrets, we want to be able to reload the JWT
public key file in the pageserver, and also support multiple JWT keys.

See #4897.

## Summary of changes

* Allow directories for the `auth_validation_public_key_path` config
param instead of just files. for the safekeepers, all of their config options
also support multiple JWT keys.
* For the pageservers, make the JWT public keys easily globally swappable
by using the `arc-swap` crate.
* Add an endpoint to the pageserver, triggered by a POST to
`/v1/reload_auth_validation_keys`, that reloads the JWT public keys from
the pre-configured path (for security reasons, you cannot upload any
keys yourself).

Fixes #4897

---------

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-11-07 15:43:29 +01:00
duguorong009
09b5954526 refactor: use streaming in safekeeper /v1/debug_dump http response (#5731)
- Update the handler for `/v1/debug_dump` http response in safekeeper
- Update the `debug_dump::build()` to use the streaming in JSON build
process
2023-11-05 10:16:54 +00:00
Shany Pozin
010b4d0d5c Move ApiError 404 to info level (#5501)
## Problem
Moving ApiError 404 to info level logging (see
https://github.com/neondatabase/neon/pull/5489#issuecomment-1750211212)
2023-10-09 13:54:46 +03:00
Joonas Koivunen
a15f9b3baa pageserver: Tune 503 Resource unavailable (#5489)
503 Resource Unavailable appears as error in logs, but is not really an
error which should ever fail a test on, or even log an error in prod,
[evidence].

Changes:
- log 503 as `info!` level
- use `Cow<'static, str>` instead of `String`
- add an additional `wait_until_tenant_active` in
`test_actually_duplicate_l1`
 
We ought to have in tests "wait for tenants to complete loading" but
this is easier to implement for now.

[evidence]:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-5485/6423110295/index.html#/testresult/182de66203864fc0
2023-10-06 09:59:14 +01:00
John Spray
ace0c775fc pageserver: prefer 503 to 500 for transient unavailability (#5439)
## Problem

The 500 status code should only be used for bugs or unrecoverable
failures: situations we did not expect. Currently, the pageserver is
misusing this response code for some situations that are totally normal,
like requests targeting tenants that are in the process of activating.

The 503 response is a convenient catch-all for "I can't right now, but I
will be able to".

## Summary of changes

- Change some transient availability error conditions to return 503
instead of 500
- Update the HTTP client configuration in integration tests to retry on
503

After these changes, things like creating a tenant and then trying to
create a timeline within it will no longer require carefully checking
its status first, or retrying on 500s. Instead, a client which is
properly configured to retry on 503 can quietly handle such situations.
2023-10-03 17:00:55 +01:00
John Spray
ba92668e37 pageserver: deletion queue & generation validation for deletions (#5207)
## Problem

Pageservers must not delete objects or advertise updates to
remote_consistent_lsn without checking that they hold the latest
generation for the tenant in question (see [the RFC](
https://github.com/neondatabase/neon/blob/main/docs/rfcs/025-generation-numbers.md))

In this PR:
- A new "deletion queue" subsystem is introduced, through which
deletions flow
- `RemoteTimelineClient` is modified to send deletions through the
deletion queue:
- For GC & compaction, deletions flow through the full generation
verifying process
- For timeline deletions, deletions take a fast path that bypasses
generation verification
- The `last_uploaded_consistent_lsn` value in `UploadQueue` is replaced
with a mechanism that maintains a "projected" lsn (equivalent to the
previous property), and a "visible" LSN (which is the one that we may
share with safekeepers).
- Until `control_plane_api` is set, all deletions skip generation
validation
- Tests are introduced for the new functionality in
`test_pageserver_generations.py`

Once this lands, if a pageserver is configured with the
`control_plane_api` configuration added in
https://github.com/neondatabase/neon/pull/5163, it becomes safe to
attach a tenant to multiple pageservers concurrently.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-09-26 16:11:55 +01:00
Joonas Koivunen
25d2f4b669 metrics: chunked responses (#4768)
Metrics can get really large in the order of hundreds of megabytes,
which we used to buffer completly (after a few rounds of growing the
buffer).
2023-07-21 15:10:55 +00:00
arpad-m
982fce1e72 Fix rustdoc warnings and test cargo doc in CI (#4711)
## Problem

`cargo +nightly doc` is giving a lot of warnings: broken links, naked
URLs, etc.

## Summary of changes

* update the `proc-macro2` dependency so that it can compile on latest
Rust nightly, see https://github.com/dtolnay/proc-macro2/pull/391 and
https://github.com/dtolnay/proc-macro2/issues/398
* allow the `private_intra_doc_links` lint, as linking to something
that's private is always more useful than just mentioning it without a
link: if the link breaks in the future, at least there is a warning due
to that. Also, one might enable
[`--document-private-items`](https://doc.rust-lang.org/cargo/commands/cargo-doc.html#documentation-options)
in the future and make these links work in general.
* fix all the remaining warnings given by `cargo +nightly doc`
* make it possible to run `cargo doc` on stable Rust by updating
`opentelemetry` and associated crates to version 0.19, pulling in a fix
that previously broke `cargo doc` on stable:
https://github.com/open-telemetry/opentelemetry-rust/pull/904
* Add `cargo doc` to CI to ensure that it won't get broken in the
future.

Fixes #2557

## Future work
* Potentially, it might make sense, for development purposes, to publish
the generated rustdocs somewhere, like for example [how the rust
compiler does
it](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_driver/index.html).
I will file an issue for discussion.
2023-07-15 05:11:25 +03:00
Dmitry Rodionov
75d583c04a Tenant::load: fix uninit timeline marker processing (#4458)
## Problem

During timeline creation we create special mark file which presense
indicates that initialization didnt complete successfully. In case of a
crash restart we can remove such half-initialized timeline and following
retry from control plane side should perform another attempt.

So in case of a possible crash restart during initial loading we have
following picture:

```
timelines
| - <timeline_id>___uninit
| - <timeline_id>
| - | <timeline files>
```

We call `std::fs::read_dir` to walk files in `timelines` directory one
by one. If we see uninit file
we proceed with deletion of both, timeline directory and uninit file. If
we see timeline we check if uninit file exists and do the same cleanup.
But in fact its possible to get both branches to be true at the same
time. Result of readdir doesnt reflect following directory state
modifications. So you can still get "valid" entry on the next iteration
of the loop despite the fact that it was deleted in one of the previous
iterations of the loop.

To see that you can apply the following patch (it disables uninit mark
cleanup on successful timeline creation):
```diff
diff --git a/pageserver/src/tenant.rs b/pageserver/src/tenant.rs
index 4beb2664..b3cdad8f 100644
--- a/pageserver/src/tenant.rs
+++ b/pageserver/src/tenant.rs
@@ -224,11 +224,6 @@ impl UninitializedTimeline<'_> {
                             )
                         })?;
                 }
-                uninit_mark.remove_uninit_mark().with_context(|| {
-                    format!(
-                        "Failed to remove uninit mark file for timeline {tenant_id}/{timeline_id}"
-                    )
-                })?;
                 v.insert(Arc::clone(&new_timeline));
 
                 new_timeline.maybe_spawn_flush_loop();
```
And perform the following steps:

```bash
neon_local init
neon_local start
neon_local tenant create
neon_local stop
neon_local start
```

The error is:
```log
INFO load{tenant_id=X}:blocking: Found an uninit mark file .neon/tenants/X/timelines/Y.___uninit, removing the timeline and its uninit mark
2023-06-09T18:43:41.664247Z ERROR load{tenant_id=X}: load failed, setting tenant state to Broken: failed to load metadata

Caused by:
    0: Failed to read metadata bytes from path .neon/tenants/X/timelines/Y/metadata
    1: No such file or directory (os error 2)
```

So uninit mark got deleted together with timeline directory but we still
got directory entry for it and tried to load it.

The bug prevented tenant from being successfully loaded.

## Summary of changes

Ideally I think we shouldnt place uninit marks in the same directory as timeline directories but move them to separate directory and
gather them as an input to actual listing, but that would be sort of an
on-disk format change, so just check whether entries are still valid
before operating on them.
2023-06-21 14:25:58 +03:00
Arseny Sher
227271ccad Switch safekeepers to async.
This is a full switch, fs io operations are also tokio ones, working through
thread pool. Similar to pageserver, we have multiple runtimes for easier `top`
usage and isolation.

Notable points:
- Now that guts of safekeeper.rs are full of .await's, we need to be very
  careful not to drop task at random point, leaving timeline in unclear
  state. Currently the only writer is walreceiver and we don't have top
  level cancellation there, so we are good. But to be safe probably we should
  add a fuse panicking if task is being dropped while operation on a timeline
  is in progress.
- Timeline lock is Tokio one now, as we do disk IO under it.
- Collecting metrics got a crutch: since prometheus Collector is
  synchronous, it spawns a thread with current thread runtime collecting data.
- Anything involving closures becomes significantly more complicated, as
  async fns are already kinda closures + 'async closures are unstable'.
- Main thread now tracks other main tasks, which got much easier.
- The only sync place left is initial data loading, as otherwise clippy
  complains on timeline map lock being held across await points -- which is
  not bad here as it happens only in single threaded runtime of main thread.
  But having it sync doesn't hurt either.

I'm concerned about performance of thread pool io offloading, async traits and
many await points; but we can try and see how it goes.

fixes https://github.com/neondatabase/neon/issues/3036
fixes https://github.com/neondatabase/neon/issues/3966
2023-06-11 22:53:08 +04:00
Dmitry Rodionov
d53f9ab3eb delete timelines from s3 (#4384)
Delete data from s3 when timeline deletion is requested

## Summary of changes

UploadQueue is altered to support scheduling of delete operations in
stopped state. This looks weird, and I'm thinking whether there are
better options/refactorings for upload client to make it look better.

Probably can be part of https://github.com/neondatabase/neon/issues/4378

Deletion is implemented directly in existing endpoint because changes are not
that significant. If we want more safety we can separate those or create
feature flag for new behavior.

resolves [#4193](https://github.com/neondatabase/neon/issues/4193)

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-06-08 15:01:22 +03:00
Heikki Linnakangas
9787227c35 Shield HTTP request handlers from async cancellations. (#4314)
We now spawn a new task for every HTTP request, and wait on the
JoinHandle. If Hyper drops the Future, the spawned task will keep
running. This protects the rest of the pageserver code from unexpected
async cancellations.

This creates a CancellationToken for each request and passes it to the
handler function. If the HTTP request is dropped by the client, the
CancellationToken is signaled. None of the handler functions make use
for the CancellationToken currently, but they now they could.

The CancellationToken arguments also work like documentation. When
you're looking at a function signature and you see that it takes a
CancellationToken as argument, it's a nice hint that the function might
run for a long time, and won't be async cancelled. The default
assumption in the pageserver is now that async functions are not
cancellation-safe anyway, unless explictly marked as such, but this is a
nice extra reminder.

Spawning a task for each request is OK from a performance point of view
because spawning is very cheap in Tokio, and none of our HTTP requests
are very performance critical anyway.

Fixes issue #3478
2023-06-02 08:28:13 -04:00
Heikki Linnakangas
2cdf07f12c Refactor RequestSpan into a function.
Previously, you used it like this:

    |r| RequestSpan(my_handler).handle(r)

But I don't see the point of the RequestSpan struct. It's just a
wrapper around the handler function. With this commit, the call
becomes:

    |r| request_span(r, my_handler)

Which seems a little simpler.

At first I thought that the RequestSpan struct would allow "chaining"
other kinds of decorators like RequestSpan, so that you could do
something like this:

    |r| CheckPermissions(RequestSpan(my_handler)).handle(r)

But it doesn't work like that. If each of those structs wrap a handler
*function*, it would actually look like this:

    |r| CheckPermissions(|r| RequestSpan(my_handler).handle(r))).handle(r)

This commit doesn't make that kind of chaining any easier, but seems a
little more straightforward anyway.
2023-05-27 11:47:22 +03:00
Heikki Linnakangas
200a520e6c Minor refactoring in RequestSpan
Require the error type to be ApiError. It implicitly required that
anyway, because the function used error::handler, which downcasted the
error to an ApiError. If the error was in fact anything else than
ApiError, it would just panic. Better to check it at compilation time.

Also make the last-resort error handler more forgiving, so that it
returns an 500 Internal Server error response, instead of panicking,
if a request handler returns some other error than an ApiError.
2023-05-27 11:47:22 +03:00
Christian Schwarz
df52587bef attach-time tenant config (#4255)
This PR adds support for supplying the tenant config upon /attach.

Before this change, when relocating a tenant using `/detach` and
`/attach`, the tenant config after `/attach` would be the default config
from `pageserver.toml`.
That is undesirable for settings such as the PITR-interval: if the
tenant's config on the source was `30 days` and the default config on
the attach-side is `7 days`, then the first GC run would eradicate 23
days worth of PITR capability.

The API change is backwards-compatible: if the body is empty, we
continue to use the default config.
We'll remove that capability as soon as the cloud.git code is updated to
use attach-time tenant config
(https://github.com/neondatabase/neon/issues/4282 keeps track of this).

unblocks https://github.com/neondatabase/cloud/issues/5092 
fixes https://github.com/neondatabase/neon/issues/1555
part of https://github.com/neondatabase/neon/issues/886 (Tenant
Relocation)

Implementation
==============

The preliminary PRs for this work were (most-recent to least-recent)

* https://github.com/neondatabase/neon/pull/4279
* https://github.com/neondatabase/neon/pull/4267
* https://github.com/neondatabase/neon/pull/4252
* https://github.com/neondatabase/neon/pull/4235
2023-05-24 17:46:30 +03:00
Joonas Koivunen
474f69c1c0 fix: omit cancellation logging when panicking (#4125)
noticed while describing `RequestSpan`, this fix will omit the otherwise
logged message about request being cancelled when panicking in the
request handler. this was missed on #4064.
2023-05-03 10:56:49 +03:00
Arthur Petukhovsky
b03143dfc8 Use serde_as DisplayFromStr everywhere (#4103)
We used `display_serialize` previously, but it works only for Serialize.
`DisplayFromStr` does the same, but also works for Deserialize.
2023-04-28 13:55:07 +03:00
Joonas Koivunen
4911d7ce6f feat: warn when requests get cancelled (#4064)
Add a simple disarmable dropguard to log if request is cancelled before
it is completed. We currently don't have this, and it makes for
difficult to know when the request was dropped.
2023-04-25 15:22:23 +03:00
Dmitry Rodionov
6efea43449 Use precondition failed code in delete_timeline when tenant is missing (#3884)
This allows client to differentiate between missing tenant and missing
timeline cases
2023-03-27 21:01:46 +03:00
Kirill Bulatov
03a2ce9d13 Add tracing spans with request_id into pageserver management API handlers (#3755)
Adds a newtype that creates a span with request_id from
https://github.com/neondatabase/neon/pull/3708 for every HTTP request
served.

Moves request logging and error handlers under the new wrapper, so every request-related event now is logged under the request span.
For compatibility reasons, error handler is left on the general router, since not every service uses the new handler wrappers yet.
2023-03-09 09:24:01 +02:00
Arthur Petukhovsky
b23742e09c Create /v1/debug_dump safekeepers endpoint (#3710)
Add HTTP endpoint to get full safekeeper state of all existing timelines
(all in-memory values and info about all files stored on disk).

Example:
https://gist.github.com/petuhovskiy/3cbb8f870401e9f486731d145161c286
2023-03-03 14:01:05 +03:00
Shany Pozin
d19c5248c9 Add UUID header to mgmt API (#3708)
## Describe your changes

## Issue ticket number and link
#3479
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
2023-03-01 18:09:08 +02:00
Dmitry Rodionov
eb403da814 Use debug level for successful GET http requests (#3681)
We started rather frequently scrap some apis for metadata. This includes
layer eviction tester, I believe console does that too.

It should eliminate these logs:
https://neonprod.grafana.net/goto/rr_ace1Vz?orgId=1 (Note the rate
around 2k messages per minute)
2023-02-22 22:19:05 +03:00
Christian Schwarz
87cd2bae77 introduce LaunchTimestamp to identify process restarts
This patch adds a LaunchTimestamp type to the `metrics` crate,
along with a `libmetric_` Prometheus metric.

The initial user is pageserver.
In addition to exposing the Prometheus metric, it also reproduces
the launch timestamp as a header in the API responses.

The motivation for this is that we plan to scrape the pageserver's
/v1/tenant/:tenant_id/timeline/:timeline_id/layer
HTTP endpoint over time. It will soon expose access metrics (#3496)
which reset upon process restart. We will use the pageserver's launch
ID to identify a restart between two scrape points.

However, there are other potential uses. For example, we could use
the Prometheus metric to annotate Grafana plots whenever the launch
timestamp changes.
2023-02-03 18:12:17 +01:00
Christian Schwarz
590695e845 improve query param parsing
- add parse_query_param()
- use Cow<> where possible
- move param parsing code to utils::http::request

This was originally PR https://github.com/neondatabase/neon/pull/3502
which targeted a different branch.

closes  #3510
2023-02-01 14:11:12 +01:00
Kirill Bulatov
9fbef1159f Tone down http error printing (#3434)
Only print backtraces for internal server error variants of the API
error.
2023-01-25 10:36:30 +00:00
Egor Suvorov
2ce5d8137d Separate permission checks for Pageserver and Safekeeper
There will be different scopes for those two, so authorization code should be different.

The `check_permission` function is now not in the shared library. Its implementation
is very similar to the one which will be added for Safekeeper. In fact, we may reuse
the same existing root-like 'PageServerApi' scope, but I would prefer to have separate
root-like scopes for services.

Also, generate_management_token in tests is generate_pageserver_token now.
2022-11-25 04:17:42 +03:00
Arthur Petukhovsky
f25dd75be9 Fix deadlock in safekeeper metrics (#2566)
We had a problem where almost all of the threads were waiting on a futex syscall. More specifically:
- `/metrics` handler was inside `TimelineCollector::collect()`, waiting on a mutex for a single Timeline
- This exact timeline was inside `control_file::FileStorage::persist()`, waiting on a mutex for Lazy initialization of `PERSIST_CONTROL_FILE_SECONDS`
- `PERSIST_CONTROL_FILE_SECONDS: Lazy<Histogram>` was blocked on `prometheus::register`
- `prometheus::register` calls `DEFAULT_REGISTRY.write().register()` to take a write lock on Registry and add a new metric
- `DEFAULT_REGISTRY` lock was already taken inside `DEFAULT_REGISTRY.gather()`, which was called by `/metrics` handler to collect all metrics

This commit creates another Registry with a separate lock, to avoid deadlock in a case where `TimelineCollector` triggers registration of new metrics inside default registry.
2022-10-06 01:07:02 +03:00
sharnoff
6f949e1556 Improve pageserver/safekeepeer HTTP API errors (#2461)
Part of the general work on improving pageserver logs.

Brief summary of changes:

* Remove `ApiError::from_err`
* Remove `impl From<anyhow::Error> for ApiError`
* Convert `ApiError::{BadRequest, NotFound}` to use `anyhow::Error`
  * Note: `NotFound` has more verbose formatting because it's more
    likely to have useful information for the receiving "user"
* Explicitly convert from `tokio::task::JoinError`s into
  `InternalServerError`s where appropriate

Also note: many of the places where errors were implicitly converted to
500s have now been updated to return a more appropriate error. Some
places where it's not yet possible to distinguish the error types have
been left as 500s.
2022-09-20 17:02:10 -07:00
Kirill Bulatov
b8eb908a3d Rename old project name references 2022-09-14 08:14:05 +03:00
Kirill Bulatov
648e8bbefe Fix 1.63 clippy lints (#2282) 2022-08-16 18:49:22 +03:00
Ankur Srivastava
84d1bc06a9 refactor: replace lazy-static with once-cell (#2195)
- Replacing all the occurrences of lazy-static with `once-cell::sync::Lazy`
- fixes #1147

Signed-off-by: Ankur Srivastava <best.ankur@gmail.com>
2022-08-05 19:34:04 +02:00
Arthur Petukhovsky
98da0aa159 Add _total suffix to metrics name (#1741) 2022-05-18 15:17:04 +03:00
Arthur Petukhovsky
134eeeb096 Add more common storage metrics (#1722)
- Enabled process exporter for storage services
- Changed zenith_proxy prefix to just proxy
- Removed old `monitoring` directory
- Removed common prefix for metrics, now our common metrics have `libmetrics_` prefix, for example `libmetrics_serve_metrics_count`
- Added `test_metrics_normal_work`
2022-05-17 19:29:01 +03:00
Egor Suvorov
22d997049c libs/utils/http/request: add ensure_no_body 2022-05-13 15:43:52 +02:00
Kirill Bulatov
81cad6277a Move and library crates into a dedicated directory and rename them 2022-04-21 13:30:33 +03:00