mirror of
https://github.com/neondatabase/neon.git
synced 2026-05-20 06:30:43 +00:00
This clarifies - I hope - the abstractions between Repository and ObjectRepository. The ObjectTag struct was a mix of objects that could be accessed directly through the public Timeline interface, and also objects that were created and used internally by the ObjectRepository implementation and not supposed to be accessed directly by the callers. With the RelishTag separaate from ObjectTag, the distinction is more clear: RelishTag is used in the public interface, and ObjectTag is used internally between object_repository.rs and object_store.rs, and it contains the internal metadata object types. One awkward thing with the ObjectTag struct was that the Repository implementation had to distinguish between ObjectTags for relations, and track the size of the relation, while others were used to store "blobs". With the RelishTags, some relishes are considered "non-blocky", and the Repository implementation is expected to track their sizes, while others are stored as blobs. I'm not 100% happy with how RelishTag captures that either: it just knows that some relish kinds are blocky and some non-blocky, and there's an is_block() function to check that. But this does enable size-tracking for SLRUs, allowing us to treat them more like relations. This changes the way SLRUs are stored in the repository. Each SLRU segment, e.g. "pg_clog/0000", "pg_clog/0001", are now handled as a separate relish. This removes the need for the SLRU-specific put_slru_truncate() function in the Timeline trait. SLRU truncation is now handled by caling put_unlink() on the segment. This is more in line with how PostgreSQL stores SLRUs and handles their trunction. The SLRUs are "blocky", so they are accessed one 8k page at a time, and repository tracks their size. I considered an alternative design where we would treat each SLRU segment as non-blocky, and just store the whole file as one blob. Each SLRU segment is up to 256 kB in size, which isn't that large, so that might've worked fine, too. One reason I didn't do that is that it seems better to have the WAL redo routines be as close as possible to the PostgreSQL routines. It doesn't matter much in the repository, though; we have to track the size for relations anyway, so there's not much difference in whether we also do it for SLRUs. While working on this, I noticed that the CLOG and MultiXact redo code did not handle wraparound correctly. We need to fix that, but for now, I just commented them out with a FIXME comment.
94 lines
3.3 KiB
Rust
94 lines
3.3 KiB
Rust
//! Low-level key-value storage abstraction.
|
|
//!
|
|
use crate::object_key::*;
|
|
use crate::relish::*;
|
|
use crate::ZTimelineId;
|
|
use anyhow::Result;
|
|
use std::collections::HashSet;
|
|
use std::iter::Iterator;
|
|
use zenith_utils::lsn::Lsn;
|
|
|
|
///
|
|
/// Low-level storage abstraction.
|
|
///
|
|
/// All the data in the repository is stored in a key-value store. This trait
|
|
/// abstracts the details of the key-value store.
|
|
///
|
|
/// A simple key-value store would support just GET and PUT operations with
|
|
/// a key, but the upper layer needs slightly complicated read operations
|
|
///
|
|
/// The most frequently used function is 'object_versions'. It is used
|
|
/// to look up a page version. It is LSN aware, in that the caller
|
|
/// specifies an LSN, and the function returns all values for that
|
|
/// block with the same or older LSN.
|
|
///
|
|
pub trait ObjectStore: Send + Sync {
|
|
///
|
|
/// Store a value with given key.
|
|
///
|
|
fn put(&self, key: &ObjectKey, lsn: Lsn, value: &[u8]) -> Result<()>;
|
|
|
|
/// Read entry with the exact given key.
|
|
///
|
|
/// This is used for retrieving metadata with special key that doesn't
|
|
/// correspond to any real relation.
|
|
fn get(&self, key: &ObjectKey, lsn: Lsn) -> Result<Vec<u8>>;
|
|
|
|
/// Read key greater or equal than specified
|
|
fn get_next_key(&self, key: &ObjectKey) -> Result<Option<ObjectKey>>;
|
|
|
|
/// Iterate through all page versions of one object.
|
|
///
|
|
/// Returns all page versions in descending LSN order, along with the LSN
|
|
/// of each page version.
|
|
fn object_versions<'a>(
|
|
&'a self,
|
|
key: &ObjectKey,
|
|
lsn: Lsn,
|
|
) -> Result<Box<dyn Iterator<Item = (Lsn, Vec<u8>)> + 'a>>;
|
|
|
|
/// Iterate through versions of all objects in a timeline.
|
|
///
|
|
/// Returns objects in increasing key-version order.
|
|
/// Returns all versions up to and including the specified LSN.
|
|
fn objects<'a>(
|
|
&'a self,
|
|
timeline: ZTimelineId,
|
|
lsn: Lsn,
|
|
) -> Result<Box<dyn Iterator<Item = Result<(ObjectTag, Lsn, Vec<u8>)>> + 'a>>;
|
|
|
|
/// Iterate through all keys with given tablespace and database ID, and LSN <= 'lsn'.
|
|
/// Both dbnode and spcnode can be InvalidId (0) which means get all relations in tablespace/cluster
|
|
///
|
|
/// This is used to implement 'create database'
|
|
fn list_rels(
|
|
&self,
|
|
timelineid: ZTimelineId,
|
|
spcnode: u32,
|
|
dbnode: u32,
|
|
lsn: Lsn,
|
|
) -> Result<HashSet<RelTag>>;
|
|
|
|
/// Iterate through non-rel relishes
|
|
///
|
|
/// This is used to prepare tarball for new node startup.
|
|
/// Returns objects in increasing key-version order.
|
|
fn list_nonrels<'a>(&'a self, timelineid: ZTimelineId, lsn: Lsn) -> Result<HashSet<RelishTag>>;
|
|
|
|
/// Iterate through objects tags. If nonrel_only, then only non-relationa data is iterated.
|
|
///
|
|
/// This is used to implement GC and preparing tarball for new node startup
|
|
/// Returns objects in increasing key-version order.
|
|
fn list_objects<'a>(
|
|
&'a self,
|
|
timelineid: ZTimelineId,
|
|
lsn: Lsn,
|
|
) -> Result<Box<dyn Iterator<Item = ObjectTag> + 'a>>;
|
|
|
|
/// Unlink object (used by GC). This mehod may actually delete object or just mark it for deletion.
|
|
fn unlink(&self, key: &ObjectKey, lsn: Lsn) -> Result<()>;
|
|
|
|
// Compact storage and remove versions marged for deletion
|
|
fn compact(&self);
|
|
}
|