diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md index faf2b2336f..b275349168 100644 --- a/docs/SUMMARY.md +++ b/docs/SUMMARY.md @@ -37,7 +37,7 @@ - [Source view](./sourcetree.md) - [docker.md](./docker.md) — Docker images and building pipeline. - - [Error handling and logging]() + - [Error handling and logging](./error-handling.md) - [Testing]() - [Unit testing]() - [Integration testing]() diff --git a/docs/error-handling.md b/docs/error-handling.md new file mode 100644 index 0000000000..c2231274a1 --- /dev/null +++ b/docs/error-handling.md @@ -0,0 +1,198 @@ +# Error handling and logging + +## Logging errors + +The principle is that errors are logged when they are handled. If you +just propagate an error to the caller in a function, you don't need to +log it; the caller will. But if you consume an error in a function, +you *must* log it (if it needs to be logged at all). + +For example: + +```rust +fn read_motd_file() -> std::io::Result { + let mut f = File::open("/etc/motd")?; + let mut result = String::new(); + f.read_to_string(&mut result)?; + result +} +``` + +Opening or reading the file could fail, but there is no need to log +the error here. The function merely propagates the error to the +caller, and it is up to the caller to log the error or propagate it +further, if the failure is not expected. But if, for example, it is +normal that the "/etc/motd" file doesn't exist, the caller can choose +to silently ignore the error, or log it as an INFO or DEBUG level +message: + +```rust +fn get_message_of_the_day() -> String { + // Get the motd from /etc/motd, or return the default proverb + match read_motd_file() { + Ok(motd) => motd, + Err(err) => { + // It's normal that /etc/motd doesn't exist, but if we fail to + // read it for some other reason, that's unexpected. The message + // of the day isn't very important though, so we just WARN and + // continue with the default in any case. + if err.kind() != std::io::ErrorKind::NotFound { + tracing::warn!("could not read \"/etc/motd\": {err:?}"); + } + "An old error is always more popular than a new truth. - German proverb" + } + } +} +``` + +## Error types + +We use the `anyhow` crate widely. It contains many convenient macros +like `bail!` and `ensure!` to construct and return errors, and to +propagate many kinds of low-level errors, wrapped in `anyhow::Error`. + +A downside of `anyhow::Error` is that the caller cannot distinguish +between different error cases. Most errors are propagated all the way +to the mgmt API handler function, or the main loop that handles a +connection with the compute node, and they are all handled the same +way: the error is logged and returned to the client as an HTTP or +libpq error. + +But in some cases, we need to distinguish between errors and handle +them differently. For example, attaching a tenant to the pageserver +could fail either because the tenant has already been attached, or +because we could not load its metadata from cloud storage. The first +case is more or less expected. The console sends the Attach request to +the pageserver, and the pageserver completes the operation, but the +network connection might be lost before the console receives the +response. The console will retry the operation in that case, but the +tenant has already been attached. It is important that the pagserver +responds with the HTTP 403 Already Exists error in that case, rather +than a generic HTTP 500 Internal Server Error. + +If you need to distinguish between different kinds of errors, create a +new `Error` type. The `thiserror` crate is useful for that. But in +most cases `anyhow::Error` is good enough. + +## Panics + +Depending on where a panic happens, it can cause the whole pageserver +or safekeeper to restart, or just a single tenant. In either case, +that is pretty bad and causes an outage. Avoid panics. Never use +`unwrap()` or other calls that might panic, to verify inputs from the +network or from disk. + +It is acceptable to use functions that might panic, like `unwrap()`, if +it is obvious that it cannot panic. For example, if you have just +checked that a variable is not None, it is OK to call `unwrap()` on it, +but it is still preferable to use `expect("reason")` instead to explain +why the function cannot fail. + +`assert!` and `panic!` are reserved for checking clear invariants and +very obvious "can't happen" cases. When in doubt, use anyhow `ensure!` +or `bail!` instead. + +## Error levels + +`tracing::Level` doesn't provide very clear guidelines on what the +different levels mean, or when to use which level. Here is how we use +them: + +### Error + +Examples: +- could not open file "foobar" +- invalid tenant id + +Errors are not expected to happen during normal operation. Incorrect +inputs from client can cause ERRORs. For example, if a client tries to +call a mgmt API that doesn't exist, or if a compute node sends passes +an LSN that has already been garbage collected away. + +These should *not* happen during normal operations. "Normal +operations" is not a very precise concept. But for example, disk +errors are not expected to happen when the system is working, so those +count as Errors. However, if a TCP connection to a compute node is +lost, that is not considered an Error, because it doesn't affect the +pageserver's or safekeeper's operation in any way, and happens fairly +frequently when compute nodes are shut down, or are killed abruptly +because of errors in the compute. + +**Errors are monitored, and always need human investigation to determine +the cause.** + +Whether something should be logged at ERROR, WARNING or INFO level can +depend on the callers and clients. For example, it might be unexpected +and a sign of a serious issue if the console calls the +"timeline_detail" mgmt API for a timeline that doesn't exist. ERROR +would be appropriate in that case. But if the console routinely calls +the API after deleting a timeline, to check if the deletion has +completed, then it would be totally normal and an INFO or DEBUG level +message would be more appropriate. If a message is logged as an ERROR, +but it in fact happens frequently in production and never requires any +action, it should probably be demoted to an INFO level message. + +### Warn + +Examples: +- could not remove temporary file "foobar.temp" +- unrecognized file "foobar" in timeline directory + +Warnings are similar to Errors, in that they should not happen +when the system is operating normally. The difference between Error and +Warning is that an Error means that the operation failed, whereas Warning +means that something unexpected happened, but the operation continued anyway. +For example, if deleting a file fails because the file already didn't exist, +it should be logged as Warning. + +> **Note:** The python regression tests, under `test_regress`, check the +> pageserver log after each test for any ERROR and WARN lines. If there are +> any ERRORs or WARNs that have not been explicitly listed in the test as +> allowed, the test is marked a failed. This is to catch unexpected errors +> e.g. in background operations, that don't cause immediate misbehaviour in +> the tested functionality. + +### Info + +Info level is used to log useful information when the system is +operating normally. Info level is appropriate e.g. for logging state +changes, background operations, and network connections. + +Examples: +- "system is shutting down" +- "tenant was created" +- "retrying S3 upload" + +### Debug & Trace + +Debug and Trace level messages are not printed to the log in our normal +production configuration, but could be enabled for a specific server or +tenant, to aid debugging. (Although we don't actually have that +capability as of this writing). + +## Context + +We use logging "spans" to hold context information about the current +operation. Almost every operation happens on a particular tenant and +timeline, so we enter a span with the "tenant_id" and "timeline_id" +very early when processing an incoming API request, for example. All +background operations should also run in a span containing at least +those two fields, and any other parameters or information that might +be useful when debugging an error that might happen when performing +the operation. + +TODO: Spans are not captured in the Error when it is created, but when +the error is logged. It would be more useful to capture them at Error +creation. We should consider using `tracing_error::SpanTrace` to do +that. + +## Error message style + +PostgreSQL has a style guide for writing error messages: + +https://www.postgresql.org/docs/current/error-style-guide.html + +Follow that guide when writing error messages in the PostgreSQL +extension. We don't follow it strictly in the pageserver and +safekeeper, but the advice in the PostgreSQL style guide is generally +good, and you can't go wrong by following it.