- Enabled process exporter for storage services
- Changed zenith_proxy prefix to just proxy
- Removed old `monitoring` directory
- Removed common prefix for metrics, now our common metrics have `libmetrics_` prefix, for example `libmetrics_serve_metrics_count`
- Added `test_metrics_normal_work`
The SyncQueue consisted of a tokio mpsc channel, and an atomic counter
to keep track of how many items there are in the channel. Updating the
atomic counter was racy, and sometimes the consumer would decrement
the counter before the producer had incremented it, leading to integer
wraparound to usize::MAX. Calling Vec::with_capacity(usize::MAX) leads
to a panic.
To fix, replace the channel with a VecDeque protected by a Mutex, and
a condition variable for signaling. Now that the queue is now
protected by standard blocking Mutex and Condvar, refactor the
functions touching it to be sync, not async.
A theoretical downside of this is that the calls to push items to the
queue and the storage sync thread that drains the queue might now need
to wait, if another thread is busy manipulating the queue. I believe
that's OK; the lock isn't held for very long, and these operations are
made in background threads, not in the hot GetPage@LSN path, so
they're not very latency-sensitive.
Fixes#1719. Also add a test case.
Resolves#1488.
- implemented `GET tenant/:tenant_id/timeline/:timeline_id/wal_receiver` endpoint
- returned `thread_id` in `thread_mgr::spawn`
- added `latest_gc_cutoff_lsn` field to `LocalTimelineInfo` struct
It's very confusing, and because you don't get a stack trace and error
message in the logs, makes debugging very hard. However, the
'test_pageserver_recovery' test relied on that behavior. To support that,
add a new "exit" action to the pageserver 'failpoints' command, so that
you can explicitly request to exit the process when a failpoint is hit.
Use timestamp->LSN mapping instead of file modification time.
Fix 'latest_gc_cutoff_lsn' - set it to the minimum of pitr_cutoff and gc_cutoff.
Add new test: test_pitr_gc
Try to follow Prometheus style-guide https://prometheus.io/docs/practices/naming/ for metrics names. More specifically:
- Use `pageserver_` prefix for all pagserver metrics
- Specify `_seconds` unit in time metrics
- Use unit as a suffix in other cases, such as `_hits`, `_bytes`, `_records`
- Use `_total` suffix for accumulating counters (note that Histograms append that suffix internally)
* Do not apply records with LSN smaller than LSN of cached image in delta layer
* Do not apply records with LSN smaller than LSN of cached image in delta layer
* Do not set LSN for new FPI page
refer #1656
* Add page_is_new, page_get_lsn, page_set_lsn functions
* Fix page_is_new implementation
* Add comment from XLogReadBufferForRedoExtended
Now princeple is following: acceptor threads (libpq and http) error will
bring the pageserver down, but all per-tenant thread failures will be treated
as an error.
A new `get_lsn_by_timestamp` command is added to the libpq page service
API.
An extra timestamp field is now stored in an extra field after each
Clog page. It is the timestamp of the latest commit, among all the
transactions on the Clog page. To find the overall latest commit, we
need to scan all Clog pages, but this isn't a very frequent operation
so that's not too bad.
To find the LSN that corresponds to a timestamp, we perform a binary
search. The binary search starts with min = last LSN when GC ran, and
max = latest LSN on the timeline. On each iteration of the search we
check if there are any commits with a higher-than-requested timestamp
at that LSN.
Implements github issue 1361.
* Traverse frozen layer in get_reconstruct_data in reverse order
* Fix comments on frozen layers.
Note explicitly the order that the layers are in the queue.
* Add fail point to reproduce failpoint iteration error
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Previously we've used table interface, but there was no easy way to pass
it as an override to pageserver through cli. Use the same strategy as
for remote storage config parsing