mirror of
https://github.com/neondatabase/neon.git
synced 2025-12-24 22:59:59 +00:00
We already had strong support for this many months ago on Slack: https://neondb.slack.com/archives/C0277TKAJCA/p1673453329770429
248 lines
9.2 KiB
Markdown
248 lines
9.2 KiB
Markdown
# Error handling and logging
|
|
|
|
## Logging errors
|
|
|
|
The principle is that errors are logged when they are handled. If you
|
|
just propagate an error to the caller in a function, you don't need to
|
|
log it; the caller will. But if you consume an error in a function,
|
|
you *must* log it (if it needs to be logged at all).
|
|
|
|
For example:
|
|
|
|
```rust
|
|
fn read_motd_file() -> std::io::Result<String> {
|
|
let mut f = File::open("/etc/motd")?;
|
|
let mut result = String::new();
|
|
f.read_to_string(&mut result)?;
|
|
result
|
|
}
|
|
```
|
|
|
|
Opening or reading the file could fail, but there is no need to log
|
|
the error here. The function merely propagates the error to the
|
|
caller, and it is up to the caller to log the error or propagate it
|
|
further, if the failure is not expected. But if, for example, it is
|
|
normal that the "/etc/motd" file doesn't exist, the caller can choose
|
|
to silently ignore the error, or log it as an INFO or DEBUG level
|
|
message:
|
|
|
|
```rust
|
|
fn get_message_of_the_day() -> String {
|
|
// Get the motd from /etc/motd, or return the default proverb
|
|
match read_motd_file() {
|
|
Ok(motd) => motd,
|
|
Err(err) => {
|
|
// It's normal that /etc/motd doesn't exist, but if we fail to
|
|
// read it for some other reason, that's unexpected. The message
|
|
// of the day isn't very important though, so we just WARN and
|
|
// continue with the default in any case.
|
|
if err.kind() != std::io::ErrorKind::NotFound {
|
|
tracing::warn!("could not read \"/etc/motd\": {err:?}");
|
|
}
|
|
"An old error is always more popular than a new truth. - German proverb"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Error types
|
|
|
|
We use the `anyhow` crate widely. It contains many convenient macros
|
|
like `bail!` and `ensure!` to construct and return errors, and to
|
|
propagate many kinds of low-level errors, wrapped in `anyhow::Error`.
|
|
|
|
A downside of `anyhow::Error` is that the caller cannot distinguish
|
|
between different error cases. Most errors are propagated all the way
|
|
to the mgmt API handler function, or the main loop that handles a
|
|
connection with the compute node, and they are all handled the same
|
|
way: the error is logged and returned to the client as an HTTP or
|
|
libpq error.
|
|
|
|
But in some cases, we need to distinguish between errors and handle
|
|
them differently. For example, attaching a tenant to the pageserver
|
|
could fail either because the tenant has already been attached, or
|
|
because we could not load its metadata from cloud storage. The first
|
|
case is more or less expected. The console sends the Attach request to
|
|
the pageserver, and the pageserver completes the operation, but the
|
|
network connection might be lost before the console receives the
|
|
response. The console will retry the operation in that case, but the
|
|
tenant has already been attached. It is important that the pagserver
|
|
responds with the HTTP 403 Already Exists error in that case, rather
|
|
than a generic HTTP 500 Internal Server Error.
|
|
|
|
If you need to distinguish between different kinds of errors, create a
|
|
new `Error` type. The `thiserror` crate is useful for that. But in
|
|
most cases `anyhow::Error` is good enough.
|
|
|
|
## Panics
|
|
|
|
Depending on where a panic happens, it can cause the whole pageserver
|
|
or safekeeper to restart, or just a single tenant. In either case,
|
|
that is pretty bad and causes an outage. Avoid panics. Never use
|
|
`unwrap()` or other calls that might panic, to verify inputs from the
|
|
network or from disk.
|
|
|
|
It is acceptable to use functions that might panic, like `unwrap()`, if
|
|
it is obvious that it cannot panic. For example, if you have just
|
|
checked that a variable is not None, it is OK to call `unwrap()` on it,
|
|
but it is still preferable to use `expect("reason")` instead to explain
|
|
why the function cannot fail.
|
|
|
|
`assert!` and `panic!` are reserved for checking clear invariants and
|
|
very obvious "can't happen" cases. When in doubt, use anyhow `ensure!`
|
|
or `bail!` instead.
|
|
|
|
## Error levels
|
|
|
|
`tracing::Level` doesn't provide very clear guidelines on what the
|
|
different levels mean, or when to use which level. Here is how we use
|
|
them:
|
|
|
|
### Error
|
|
|
|
Examples:
|
|
- could not open file "foobar"
|
|
- invalid tenant id
|
|
|
|
Errors are not expected to happen during normal operation. Incorrect
|
|
inputs from client can cause ERRORs. For example, if a client tries to
|
|
call a mgmt API that doesn't exist, or if a compute node sends passes
|
|
an LSN that has already been garbage collected away.
|
|
|
|
These should *not* happen during normal operations. "Normal
|
|
operations" is not a very precise concept. But for example, disk
|
|
errors are not expected to happen when the system is working, so those
|
|
count as Errors. However, if a TCP connection to a compute node is
|
|
lost, that is not considered an Error, because it doesn't affect the
|
|
pageserver's or safekeeper's operation in any way, and happens fairly
|
|
frequently when compute nodes are shut down, or are killed abruptly
|
|
because of errors in the compute.
|
|
|
|
**Errors are monitored, and always need human investigation to determine
|
|
the cause.**
|
|
|
|
Whether something should be logged at ERROR, WARNING or INFO level can
|
|
depend on the callers and clients. For example, it might be unexpected
|
|
and a sign of a serious issue if the console calls the
|
|
"timeline_detail" mgmt API for a timeline that doesn't exist. ERROR
|
|
would be appropriate in that case. But if the console routinely calls
|
|
the API after deleting a timeline, to check if the deletion has
|
|
completed, then it would be totally normal and an INFO or DEBUG level
|
|
message would be more appropriate. If a message is logged as an ERROR,
|
|
but it in fact happens frequently in production and never requires any
|
|
action, it should probably be demoted to an INFO level message.
|
|
|
|
### Warn
|
|
|
|
Examples:
|
|
- could not remove temporary file "foobar.temp"
|
|
- unrecognized file "foobar" in timeline directory
|
|
|
|
Warnings are similar to Errors, in that they should not happen
|
|
when the system is operating normally. The difference between Error and
|
|
Warning is that an Error means that the operation failed, whereas Warning
|
|
means that something unexpected happened, but the operation continued anyway.
|
|
For example, if deleting a file fails because the file already didn't exist,
|
|
it should be logged as Warning.
|
|
|
|
> **Note:** The python regression tests, under `test_regress`, check the
|
|
> pageserver log after each test for any ERROR and WARN lines. If there are
|
|
> any ERRORs or WARNs that have not been explicitly listed in the test as
|
|
> allowed, the test is marked a failed. This is to catch unexpected errors
|
|
> e.g. in background operations, that don't cause immediate misbehaviour in
|
|
> the tested functionality.
|
|
|
|
### Info
|
|
|
|
Info level is used to log useful information when the system is
|
|
operating normally. Info level is appropriate e.g. for logging state
|
|
changes, background operations, and network connections.
|
|
|
|
Examples:
|
|
- "system is shutting down"
|
|
- "tenant was created"
|
|
- "retrying S3 upload"
|
|
|
|
### Debug & Trace
|
|
|
|
Debug and Trace level messages are not printed to the log in our normal
|
|
production configuration, but could be enabled for a specific server or
|
|
tenant, to aid debugging. (Although we don't actually have that
|
|
capability as of this writing).
|
|
|
|
## Context
|
|
|
|
We use logging "spans" to hold context information about the current
|
|
operation. Almost every operation happens on a particular tenant and
|
|
timeline, so we enter a span with the "tenant_id" and "timeline_id"
|
|
very early when processing an incoming API request, for example. All
|
|
background operations should also run in a span containing at least
|
|
those two fields, and any other parameters or information that might
|
|
be useful when debugging an error that might happen when performing
|
|
the operation.
|
|
|
|
TODO: Spans are not captured in the Error when it is created, but when
|
|
the error is logged. It would be more useful to capture them at Error
|
|
creation. We should consider using `tracing_error::SpanTrace` to do
|
|
that.
|
|
|
|
## Error message style
|
|
|
|
### PostgreSQL extensions
|
|
|
|
PostgreSQL has a style guide for writing error messages:
|
|
|
|
https://www.postgresql.org/docs/current/error-style-guide.html
|
|
|
|
Follow that guide when writing error messages in the PostgreSQL
|
|
extensions.
|
|
|
|
### Neon Rust code
|
|
|
|
#### Anyhow Context
|
|
|
|
When adding anyhow `context()`, use form `present-tense-verb+action`.
|
|
|
|
Example:
|
|
- Bad: `file.metadata().context("could not get file metadata")?;`
|
|
- Good: `file.metadata().context("get file metadata")?;`
|
|
|
|
#### Logging Errors
|
|
|
|
When logging any error `e`, use `could not {e:#}` or `failed to {e:#}`.
|
|
|
|
If `e` is an `anyhow` error and you want to log the backtrace that it contains,
|
|
use `{e:?}` instead of `{e:#}`.
|
|
|
|
#### Rationale
|
|
|
|
The `{:#}` ("alternate Display") of an `anyhow` error chain is concatenation fo the contexts, using `: `.
|
|
|
|
For example, the following Rust code will result in output
|
|
```
|
|
ERROR failed to list users: load users from server: parse response: invalid json
|
|
```
|
|
|
|
This is more concise / less noisy than what happens if you do `.context("could not ...")?` at each level, i.e.:
|
|
|
|
```
|
|
ERROR could not list users: could not load users from server: could not parse response: invalid json
|
|
```
|
|
|
|
|
|
```rust
|
|
fn main() {
|
|
match list_users().context("list users") else {
|
|
Ok(_) => ...,
|
|
Err(e) => tracing::error!("failed to {e:#}"),
|
|
}
|
|
}
|
|
fn list_users() {
|
|
http_get_users().context("load users from server")?;
|
|
}
|
|
fn http_get_users() {
|
|
let response = client....?;
|
|
response.parse().context("parse response")?; // fails with serde error "invalid json"
|
|
}
|
|
```
|