mirror of
https://github.com/neondatabase/neon.git
synced 2026-01-15 17:32:56 +00:00
## Refs - Epic: https://github.com/neondatabase/neon/issues/9378 Co-authored-by: Vlad Lazar <vlad@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech> ## Problem The read path does its IOs sequentially. This means that if N values need to be read to reconstruct a page, we will do N IOs and getpage latency is `O(N*IoLatency)`. ## Solution With this PR we gain the ability to issue IO concurrently within one layer visit **and** to move on to the next layer without waiting for IOs from the previous visit to complete. This is an evolved version of the work done at the Lisbon hackathon, cf https://github.com/neondatabase/neon/pull/9002. ## Design ### `will_init` now sourced from disk btree index keys On the algorithmic level, the only change is that the `get_values_reconstruct_data` now sources `will_init` from the disk btree index key (which is PS-page_cache'd), instead of from the `Value`, which is only available after the IO completes. ### Concurrent IOs, Submission & Completion To separate IO submission from waiting for its completion, while simultaneously feature-gating the change, we introduce the notion of an `IoConcurrency` struct through which IO futures are "spawned". An IO is an opaque future, and waiting for completions is handled through `tokio::sync::oneshot` channels. The oneshot Receiver's take the place of the `img` and `records` fields inside `VectoredValueReconstructState`. When we're done visiting all the layers and submitting all the IOs along the way we concurrently `collect_pending_ios` for each value, which means for each value there is a future that awaits all the oneshot receivers and then calls into walredo to reconstruct the page image. Walredo is now invoked concurrently for each value instead of sequentially. Walredo itself remains unchanged. The spawned IO futures are driven to completion by a sidecar tokio task that is separate from the task that performs all the layer visiting and spawning of IOs. That tasks receives the IO futures via an unbounded mpsc channel and drives them to completion inside a `FuturedUnordered`. (The behavior from before this PR is available through `IoConcurrency::Sequential`, which awaits the IO futures in place, without "spawning" or "submitting" them anywhere.) #### Alternatives Explored A few words on the rationale behind having a sidecar *task* and what alternatives were considered. One option is to queue up all IO futures in a FuturesUnordered that is polled the first time when we `collect_pending_ios`. Firstly, the IO futures are opaque, compiler-generated futures that need to be polled at least once to submit their IO. "At least once" because tokio-epoll-uring may not be able to submit the IO to the kernel on first poll right away. Second, there are deadlocks if we don't drive the IO futures to completion independently of the spawning task. The reason is that both the IO futures and the spawning task may hold some _and_ try to acquire _more_ shared limited resources. For example, both spawning task and IO future may try to acquire * a VirtualFile file descriptor cache slot async mutex (observed during impl) * a tokio-epoll-uring submission slot (observed during impl) * a PageCache slot (currently this is not the case but we may move more code into the IO futures in the future) Another option is to spawn a short-lived `tokio::task` for each IO future. We implemented and benchmarked it during development, but found little throughput improvement and moderate mean & tail latency degradation. Concerns about pressure on the tokio scheduler made us discard this variant. The sidecar task could be obsoleted if the IOs were not arbitrary code but a well-defined struct. However, 1. the opaque futures approach taken in this PR allows leaving the existing code unchanged, which 2. allows us to implement the `IoConcurrency::Sequential` mode for feature-gating the change. Once the new mode sidecar task implementation is rolled out everywhere, and `::Sequential` removed, we can think about a descriptive submission & completion interface. The problems around deadlocks pointed out earlier will need to be solved then. For example, we could eliminate VirtualFile file descriptor cache and tokio-epoll-uring slots. The latter has been drafted in https://github.com/neondatabase/tokio-epoll-uring/pull/63. See the lengthy doc comment on `spawn_io()` for more details. ### Error handling There are two error classes during reconstruct data retrieval: * traversal errors: index lookup, move to next layer, and the like * value read IO errors A traversal error fails the entire get_vectored request, as before this PR. A value read error only fails that value. In any case, we preserve the existing behavior that once `get_vectored` returns, all IOs are done. Panics and failing to poll `get_vectored` to completion will leave the IOs dangling, which is safe but shouldn't happen, and so, a rate-limited log statement will be emitted at warning level. There is a doc comment on `collect_pending_ios` giving more code-level details and rationale. ### Feature Gating The new behavior is opt-in via pageserver config. The `Sequential` mode is the default. The only significant change in `Sequential` mode compared to before this PR is the buffering of results in the `oneshot`s. ## Code-Level Changes Prep work: * Make `GateGuard` clonable. Core Feature: * Traversal code: track `will_init` in `BlobMeta` and source it from the Delta/Image/InMemory layer index, instead of determining `will_init` after we've read the value. This avoids having to read the value to determine whether traversal can stop. * Introduce `IoConcurrency` & its sidecar task. * `IoConcurrency` is the clonable handle. * It connects to the sidecar task via an `mpsc`. * Plumb through `IoConcurrency` from high level code to the individual layer implementations' `get_values_reconstruct_data`. We piggy-back on the `ValuesReconstructState` for this. * The sidecar task should be long-lived, so, `IoConcurrency` needs to be rooted up "high" in the call stack. * Roots as of this PR: * `page_service`: outside of pagestream loop * `create_image_layers`: when it is called * `basebackup`(only auxfiles + replorigin + SLRU segments) * Code with no roots that uses `IoConcurrency::sequential` * any `Timeline::get` call * `collect_keyspace` is a good example * follow-up: https://github.com/neondatabase/neon/issues/10460 * `TimelineAdaptor` code used by the compaction simulator, unused in practive * `ingest_xlog_dbase_create` * Transform Delta/Image/InMemoryLayer to * do their values IO in a distinct `async {}` block * extend the residence of the Delta/Image layer until the IO is done * buffer their results in a `oneshot` channel instead of straight in `ValuesReconstructState` * the `oneshot` channel is wrapped in `OnDiskValueIo` / `OnDiskValueIoWaiter` types that aid in expressiveness and are used to keep track of in-flight IOs so we can print warnings if we leave them dangling. * Change `ValuesReconstructState` to hold the receiving end of the `oneshot` channel aka `OnDiskValueIoWaiter`. * Change `get_vectored_impl` to `collect_pending_ios` and issue walredo concurrently, in a `FuturesUnordered`. Testing / Benchmarking: * Support queue-depth in pagebench for manual benchmarkinng. * Add test suite support for setting concurrency mode ps config field via a) an env var and b) via NeonEnvBuilder. * Hacky helper to have sidecar-based IoConcurrency in tests. This will be cleaned up later. More benchmarking will happen post-merge in nightly benchmarks, plus in staging/pre-prod. Some intermediate helpers for manual benchmarking have been preserved in https://github.com/neondatabase/neon/pull/10466 and will be landed in later PRs. (L0 layer stack generator!) Drive-By: * test suite actually didn't enable batching by default because `config.compatibility_neon_binpath` is always Truthy in our CI environment => https://neondb.slack.com/archives/C059ZC138NR/p1737490501941309 * initial logical size calculation wasn't always polled to completion, which was surfaced through the added WARN logs emitted when dropping a `ValuesReconstructState` that still has inflight IOs. * remove the timing histograms `pageserver_getpage_get_reconstruct_data_seconds` and `pageserver_getpage_reconstruct_seconds` because with planning, value read IO, and walredo happening concurrently, one can no longer attribute latency to any one of them; we'll revisit this when Vlad's work on tracing/sampling through RequestContext lands. * remove code related to `get_cached_lsn()`. The logic around this has been dead at runtime for a long time, ever since the removal of the materialized page cache in #8105. ## Testing Unit tests use the sidecar task by default and run both modes in CI. Python regression tests and benchmarks also use the sidecar task by default. We'll test more in staging and possibly preprod. # Future Work Please refer to the parent epic for the full plan. The next step will be to fold the plumbing of IoConcurrency into RequestContext so that the function signatures get cleaned up. Once `Sequential` isn't used anymore, we can take the next big leap which is replacing the opaque IOs with structs that have well-defined semantics. --------- Co-authored-by: Christian Schwarz <christian@neon.tech>
785 lines
28 KiB
Rust
785 lines
28 KiB
Rust
//!
|
|
//! Generate a tarball with files needed to bootstrap ComputeNode.
|
|
//!
|
|
//! TODO: this module has nothing to do with PostgreSQL pg_basebackup.
|
|
//! It could use a better name.
|
|
//!
|
|
//! Stateless Postgres compute node is launched by sending a tarball
|
|
//! which contains non-relational data (multixacts, clog, filenodemaps, twophase files),
|
|
//! generated pg_control and dummy segment of WAL.
|
|
//! This module is responsible for creation of such tarball
|
|
//! from data stored in object storage.
|
|
//!
|
|
use anyhow::{anyhow, Context};
|
|
use bytes::{BufMut, Bytes, BytesMut};
|
|
use fail::fail_point;
|
|
use pageserver_api::key::Key;
|
|
use postgres_ffi::pg_constants;
|
|
use std::fmt::Write as FmtWrite;
|
|
use std::time::{Instant, SystemTime};
|
|
use tokio::io;
|
|
use tokio::io::AsyncWrite;
|
|
use tracing::*;
|
|
|
|
use tokio_tar::{Builder, EntryType, Header};
|
|
|
|
use crate::context::RequestContext;
|
|
use crate::pgdatadir_mapping::Version;
|
|
use crate::tenant::storage_layer::IoConcurrency;
|
|
use crate::tenant::Timeline;
|
|
use pageserver_api::reltag::{RelTag, SlruKind};
|
|
|
|
use postgres_ffi::dispatch_pgversion;
|
|
use postgres_ffi::pg_constants::{DEFAULTTABLESPACE_OID, GLOBALTABLESPACE_OID};
|
|
use postgres_ffi::pg_constants::{PGDATA_SPECIAL_FILES, PG_HBA};
|
|
use postgres_ffi::relfile_utils::{INIT_FORKNUM, MAIN_FORKNUM};
|
|
use postgres_ffi::XLogFileName;
|
|
use postgres_ffi::PG_TLI;
|
|
use postgres_ffi::{BLCKSZ, RELSEG_SIZE, WAL_SEGMENT_SIZE};
|
|
use utils::lsn::Lsn;
|
|
|
|
#[derive(Debug, thiserror::Error)]
|
|
pub enum BasebackupError {
|
|
#[error("basebackup pageserver error {0:#}")]
|
|
Server(#[from] anyhow::Error),
|
|
#[error("basebackup client error {0:#}")]
|
|
Client(#[source] io::Error),
|
|
}
|
|
|
|
/// Create basebackup with non-rel data in it.
|
|
/// Only include relational data if 'full_backup' is true.
|
|
///
|
|
/// Currently we use empty 'req_lsn' in two cases:
|
|
/// * During the basebackup right after timeline creation
|
|
/// * When working without safekeepers. In this situation it is important to match the lsn
|
|
/// we are taking basebackup on with the lsn that is used in pageserver's walreceiver
|
|
/// to start the replication.
|
|
pub async fn send_basebackup_tarball<'a, W>(
|
|
write: &'a mut W,
|
|
timeline: &'a Timeline,
|
|
req_lsn: Option<Lsn>,
|
|
prev_lsn: Option<Lsn>,
|
|
full_backup: bool,
|
|
replica: bool,
|
|
ctx: &'a RequestContext,
|
|
) -> Result<(), BasebackupError>
|
|
where
|
|
W: AsyncWrite + Send + Sync + Unpin,
|
|
{
|
|
// Compute postgres doesn't have any previous WAL files, but the first
|
|
// record that it's going to write needs to include the LSN of the
|
|
// previous record (xl_prev). We include prev_record_lsn in the
|
|
// "zenith.signal" file, so that postgres can read it during startup.
|
|
//
|
|
// We don't keep full history of record boundaries in the page server,
|
|
// however, only the predecessor of the latest record on each
|
|
// timeline. So we can only provide prev_record_lsn when you take a
|
|
// base backup at the end of the timeline, i.e. at last_record_lsn.
|
|
// Even at the end of the timeline, we sometimes don't have a valid
|
|
// prev_lsn value; that happens if the timeline was just branched from
|
|
// an old LSN and it doesn't have any WAL of its own yet. We will set
|
|
// prev_lsn to Lsn(0) if we cannot provide the correct value.
|
|
let (backup_prev, backup_lsn) = if let Some(req_lsn) = req_lsn {
|
|
// Backup was requested at a particular LSN. The caller should've
|
|
// already checked that it's a valid LSN.
|
|
|
|
// If the requested point is the end of the timeline, we can
|
|
// provide prev_lsn. (get_last_record_rlsn() might return it as
|
|
// zero, though, if no WAL has been generated on this timeline
|
|
// yet.)
|
|
let end_of_timeline = timeline.get_last_record_rlsn();
|
|
if req_lsn == end_of_timeline.last {
|
|
(end_of_timeline.prev, req_lsn)
|
|
} else {
|
|
(Lsn(0), req_lsn)
|
|
}
|
|
} else {
|
|
// Backup was requested at end of the timeline.
|
|
let end_of_timeline = timeline.get_last_record_rlsn();
|
|
(end_of_timeline.prev, end_of_timeline.last)
|
|
};
|
|
|
|
// Consolidate the derived and the provided prev_lsn values
|
|
let prev_lsn = if let Some(provided_prev_lsn) = prev_lsn {
|
|
if backup_prev != Lsn(0) && backup_prev != provided_prev_lsn {
|
|
return Err(BasebackupError::Server(anyhow!(
|
|
"backup_prev {backup_prev} != provided_prev_lsn {provided_prev_lsn}"
|
|
)));
|
|
}
|
|
provided_prev_lsn
|
|
} else {
|
|
backup_prev
|
|
};
|
|
|
|
info!(
|
|
"taking basebackup lsn={}, prev_lsn={} (full_backup={}, replica={})",
|
|
backup_lsn, prev_lsn, full_backup, replica
|
|
);
|
|
|
|
let basebackup = Basebackup {
|
|
ar: Builder::new_non_terminated(write),
|
|
timeline,
|
|
lsn: backup_lsn,
|
|
prev_record_lsn: prev_lsn,
|
|
full_backup,
|
|
replica,
|
|
ctx,
|
|
io_concurrency: IoConcurrency::spawn_from_conf(
|
|
timeline.conf,
|
|
timeline
|
|
.gate
|
|
.enter()
|
|
.map_err(|e| BasebackupError::Server(e.into()))?,
|
|
),
|
|
};
|
|
basebackup
|
|
.send_tarball()
|
|
.instrument(info_span!("send_tarball", backup_lsn=%backup_lsn))
|
|
.await
|
|
}
|
|
|
|
/// This is short-living object only for the time of tarball creation,
|
|
/// created mostly to avoid passing a lot of parameters between various functions
|
|
/// used for constructing tarball.
|
|
struct Basebackup<'a, W>
|
|
where
|
|
W: AsyncWrite + Send + Sync + Unpin,
|
|
{
|
|
ar: Builder<&'a mut W>,
|
|
timeline: &'a Timeline,
|
|
lsn: Lsn,
|
|
prev_record_lsn: Lsn,
|
|
full_backup: bool,
|
|
replica: bool,
|
|
ctx: &'a RequestContext,
|
|
io_concurrency: IoConcurrency,
|
|
}
|
|
|
|
/// A sink that accepts SLRU blocks ordered by key and forwards
|
|
/// full segments to the archive.
|
|
struct SlruSegmentsBuilder<'a, 'b, W>
|
|
where
|
|
W: AsyncWrite + Send + Sync + Unpin,
|
|
{
|
|
ar: &'a mut Builder<&'b mut W>,
|
|
buf: Vec<u8>,
|
|
current_segment: Option<(SlruKind, u32)>,
|
|
total_blocks: usize,
|
|
}
|
|
|
|
impl<'a, 'b, W> SlruSegmentsBuilder<'a, 'b, W>
|
|
where
|
|
W: AsyncWrite + Send + Sync + Unpin,
|
|
{
|
|
fn new(ar: &'a mut Builder<&'b mut W>) -> Self {
|
|
Self {
|
|
ar,
|
|
buf: Vec::new(),
|
|
current_segment: None,
|
|
total_blocks: 0,
|
|
}
|
|
}
|
|
|
|
async fn add_block(&mut self, key: &Key, block: Bytes) -> Result<(), BasebackupError> {
|
|
let (kind, segno, _) = key.to_slru_block()?;
|
|
|
|
match kind {
|
|
SlruKind::Clog => {
|
|
if !(block.len() == BLCKSZ as usize || block.len() == BLCKSZ as usize + 8) {
|
|
return Err(BasebackupError::Server(anyhow!(
|
|
"invalid SlruKind::Clog record: block.len()={}",
|
|
block.len()
|
|
)));
|
|
}
|
|
}
|
|
SlruKind::MultiXactMembers | SlruKind::MultiXactOffsets => {
|
|
if block.len() != BLCKSZ as usize {
|
|
return Err(BasebackupError::Server(anyhow!(
|
|
"invalid {:?} record: block.len()={}",
|
|
kind,
|
|
block.len()
|
|
)));
|
|
}
|
|
}
|
|
}
|
|
|
|
let segment = (kind, segno);
|
|
match self.current_segment {
|
|
None => {
|
|
self.current_segment = Some(segment);
|
|
self.buf
|
|
.extend_from_slice(block.slice(..BLCKSZ as usize).as_ref());
|
|
}
|
|
Some(current_seg) if current_seg == segment => {
|
|
self.buf
|
|
.extend_from_slice(block.slice(..BLCKSZ as usize).as_ref());
|
|
}
|
|
Some(_) => {
|
|
self.flush().await?;
|
|
|
|
self.current_segment = Some(segment);
|
|
self.buf
|
|
.extend_from_slice(block.slice(..BLCKSZ as usize).as_ref());
|
|
}
|
|
}
|
|
|
|
Ok(())
|
|
}
|
|
|
|
async fn flush(&mut self) -> Result<(), BasebackupError> {
|
|
let nblocks = self.buf.len() / BLCKSZ as usize;
|
|
let (kind, segno) = self.current_segment.take().unwrap();
|
|
let segname = format!("{}/{:>04X}", kind.to_str(), segno);
|
|
let header = new_tar_header(&segname, self.buf.len() as u64)?;
|
|
self.ar
|
|
.append(&header, self.buf.as_slice())
|
|
.await
|
|
.map_err(BasebackupError::Client)?;
|
|
|
|
self.total_blocks += nblocks;
|
|
debug!("Added to basebackup slru {} relsize {}", segname, nblocks);
|
|
|
|
self.buf.clear();
|
|
|
|
Ok(())
|
|
}
|
|
|
|
async fn finish(mut self) -> Result<(), BasebackupError> {
|
|
let res = if self.current_segment.is_none() || self.buf.is_empty() {
|
|
Ok(())
|
|
} else {
|
|
self.flush().await
|
|
};
|
|
|
|
info!("Collected {} SLRU blocks", self.total_blocks);
|
|
|
|
res
|
|
}
|
|
}
|
|
|
|
impl<W> Basebackup<'_, W>
|
|
where
|
|
W: AsyncWrite + Send + Sync + Unpin,
|
|
{
|
|
async fn send_tarball(mut self) -> Result<(), BasebackupError> {
|
|
// TODO include checksum
|
|
|
|
let lazy_slru_download = self.timeline.get_lazy_slru_download() && !self.full_backup;
|
|
|
|
let pgversion = self.timeline.pg_version;
|
|
let subdirs = dispatch_pgversion!(pgversion, &pgv::bindings::PGDATA_SUBDIRS[..]);
|
|
|
|
// Create pgdata subdirs structure
|
|
for dir in subdirs.iter() {
|
|
let header = new_tar_header_dir(dir)?;
|
|
self.ar
|
|
.append(&header, &mut io::empty())
|
|
.await
|
|
.context("could not add directory to basebackup tarball")?;
|
|
}
|
|
|
|
// Send config files.
|
|
for filepath in PGDATA_SPECIAL_FILES.iter() {
|
|
if *filepath == "pg_hba.conf" {
|
|
let data = PG_HBA.as_bytes();
|
|
let header = new_tar_header(filepath, data.len() as u64)?;
|
|
self.ar
|
|
.append(&header, data)
|
|
.await
|
|
.context("could not add config file to basebackup tarball")?;
|
|
} else {
|
|
let header = new_tar_header(filepath, 0)?;
|
|
self.ar
|
|
.append(&header, &mut io::empty())
|
|
.await
|
|
.context("could not add config file to basebackup tarball")?;
|
|
}
|
|
}
|
|
if !lazy_slru_download {
|
|
// Gather non-relational files from object storage pages.
|
|
let slru_partitions = self
|
|
.timeline
|
|
.get_slru_keyspace(Version::Lsn(self.lsn), self.ctx)
|
|
.await
|
|
.map_err(|e| BasebackupError::Server(e.into()))?
|
|
.partition(
|
|
self.timeline.get_shard_identity(),
|
|
Timeline::MAX_GET_VECTORED_KEYS * BLCKSZ as u64,
|
|
);
|
|
|
|
let mut slru_builder = SlruSegmentsBuilder::new(&mut self.ar);
|
|
|
|
for part in slru_partitions.parts {
|
|
let blocks = self
|
|
.timeline
|
|
.get_vectored(part, self.lsn, self.io_concurrency.clone(), self.ctx)
|
|
.await
|
|
.map_err(|e| BasebackupError::Server(e.into()))?;
|
|
|
|
for (key, block) in blocks {
|
|
let block = block.map_err(|e| BasebackupError::Server(e.into()))?;
|
|
slru_builder.add_block(&key, block).await?;
|
|
}
|
|
}
|
|
slru_builder.finish().await?;
|
|
}
|
|
|
|
let mut min_restart_lsn: Lsn = Lsn::MAX;
|
|
// Create tablespace directories
|
|
for ((spcnode, dbnode), has_relmap_file) in self
|
|
.timeline
|
|
.list_dbdirs(self.lsn, self.ctx)
|
|
.await
|
|
.map_err(|e| BasebackupError::Server(e.into()))?
|
|
{
|
|
self.add_dbdir(spcnode, dbnode, has_relmap_file).await?;
|
|
|
|
// If full backup is requested, include all relation files.
|
|
// Otherwise only include init forks of unlogged relations.
|
|
let rels = self
|
|
.timeline
|
|
.list_rels(spcnode, dbnode, Version::Lsn(self.lsn), self.ctx)
|
|
.await
|
|
.map_err(|e| BasebackupError::Server(e.into()))?;
|
|
for &rel in rels.iter() {
|
|
// Send init fork as main fork to provide well formed empty
|
|
// contents of UNLOGGED relations. Postgres copies it in
|
|
// `reinit.c` during recovery.
|
|
if rel.forknum == INIT_FORKNUM {
|
|
// I doubt we need _init fork itself, but having it at least
|
|
// serves as a marker relation is unlogged.
|
|
self.add_rel(rel, rel).await?;
|
|
self.add_rel(rel, rel.with_forknum(MAIN_FORKNUM)).await?;
|
|
continue;
|
|
}
|
|
|
|
if self.full_backup {
|
|
if rel.forknum == MAIN_FORKNUM && rels.contains(&rel.with_forknum(INIT_FORKNUM))
|
|
{
|
|
// skip this, will include it when we reach the init fork
|
|
continue;
|
|
}
|
|
self.add_rel(rel, rel).await?;
|
|
}
|
|
}
|
|
}
|
|
|
|
let start_time = Instant::now();
|
|
let aux_files = self
|
|
.timeline
|
|
.list_aux_files(self.lsn, self.ctx, self.io_concurrency.clone())
|
|
.await
|
|
.map_err(|e| BasebackupError::Server(e.into()))?;
|
|
let aux_scan_time = start_time.elapsed();
|
|
let aux_estimated_size = aux_files
|
|
.values()
|
|
.map(|content| content.len())
|
|
.sum::<usize>();
|
|
info!(
|
|
"Scanned {} aux files in {}ms, aux file content size = {}",
|
|
aux_files.len(),
|
|
aux_scan_time.as_millis(),
|
|
aux_estimated_size
|
|
);
|
|
|
|
for (path, content) in aux_files {
|
|
if path.starts_with("pg_replslot") {
|
|
// Do not create LR slots at standby because they are not used but prevent WAL truncation
|
|
if self.replica {
|
|
continue;
|
|
}
|
|
let offs = pg_constants::REPL_SLOT_ON_DISK_OFFSETOF_RESTART_LSN;
|
|
let restart_lsn = Lsn(u64::from_le_bytes(
|
|
content[offs..offs + 8].try_into().unwrap(),
|
|
));
|
|
info!("Replication slot {} restart LSN={}", path, restart_lsn);
|
|
min_restart_lsn = Lsn::min(min_restart_lsn, restart_lsn);
|
|
} else if path == "pg_logical/replorigin_checkpoint" {
|
|
// replorigin_checkoint is written only on compute shutdown, so it contains
|
|
// deteriorated values. So we generate our own version of this file for the particular LSN
|
|
// based on information about replorigins extracted from transaction commit records.
|
|
// In future we will not generate AUX record for "pg_logical/replorigin_checkpoint" at all,
|
|
// but now we should handle (skip) it for backward compatibility.
|
|
continue;
|
|
}
|
|
let header = new_tar_header(&path, content.len() as u64)?;
|
|
self.ar
|
|
.append(&header, &*content)
|
|
.await
|
|
.context("could not add aux file to basebackup tarball")?;
|
|
}
|
|
|
|
if min_restart_lsn != Lsn::MAX {
|
|
info!(
|
|
"Min restart LSN for logical replication is {}",
|
|
min_restart_lsn
|
|
);
|
|
let data = min_restart_lsn.0.to_le_bytes();
|
|
let header = new_tar_header("restart.lsn", data.len() as u64)?;
|
|
self.ar
|
|
.append(&header, &data[..])
|
|
.await
|
|
.context("could not add restart.lsn file to basebackup tarball")?;
|
|
}
|
|
for xid in self
|
|
.timeline
|
|
.list_twophase_files(self.lsn, self.ctx)
|
|
.await
|
|
.map_err(|e| BasebackupError::Server(e.into()))?
|
|
{
|
|
self.add_twophase_file(xid).await?;
|
|
}
|
|
let repl_origins = self
|
|
.timeline
|
|
.get_replorigins(self.lsn, self.ctx, self.io_concurrency.clone())
|
|
.await
|
|
.map_err(|e| BasebackupError::Server(e.into()))?;
|
|
let n_origins = repl_origins.len();
|
|
if n_origins != 0 {
|
|
//
|
|
// Construct "pg_logical/replorigin_checkpoint" file based on information about replication origins
|
|
// extracted from transaction commit record. We are using this file to pass information about replication
|
|
// origins to compute to allow logical replication to restart from proper point.
|
|
//
|
|
let mut content = Vec::with_capacity(n_origins * 16 + 8);
|
|
content.extend_from_slice(&pg_constants::REPLICATION_STATE_MAGIC.to_le_bytes());
|
|
for (origin_id, origin_lsn) in repl_origins {
|
|
content.extend_from_slice(&origin_id.to_le_bytes());
|
|
content.extend_from_slice(&[0u8; 6]); // align to 8 bytes
|
|
content.extend_from_slice(&origin_lsn.0.to_le_bytes());
|
|
}
|
|
let crc32 = crc32c::crc32c(&content);
|
|
content.extend_from_slice(&crc32.to_le_bytes());
|
|
let header = new_tar_header("pg_logical/replorigin_checkpoint", content.len() as u64)?;
|
|
self.ar.append(&header, &*content).await.context(
|
|
"could not add pg_logical/replorigin_checkpoint file to basebackup tarball",
|
|
)?;
|
|
}
|
|
|
|
fail_point!("basebackup-before-control-file", |_| {
|
|
Err(BasebackupError::Server(anyhow!(
|
|
"failpoint basebackup-before-control-file"
|
|
)))
|
|
});
|
|
|
|
// Generate pg_control and bootstrap WAL segment.
|
|
self.add_pgcontrol_file().await?;
|
|
self.ar.finish().await.map_err(BasebackupError::Client)?;
|
|
debug!("all tarred up!");
|
|
Ok(())
|
|
}
|
|
|
|
/// Add contents of relfilenode `src`, naming it as `dst`.
|
|
async fn add_rel(&mut self, src: RelTag, dst: RelTag) -> Result<(), BasebackupError> {
|
|
let nblocks = self
|
|
.timeline
|
|
.get_rel_size(src, Version::Lsn(self.lsn), self.ctx)
|
|
.await
|
|
.map_err(|e| BasebackupError::Server(e.into()))?;
|
|
|
|
// If the relation is empty, create an empty file
|
|
if nblocks == 0 {
|
|
let file_name = dst.to_segfile_name(0);
|
|
let header = new_tar_header(&file_name, 0)?;
|
|
self.ar
|
|
.append(&header, &mut io::empty())
|
|
.await
|
|
.map_err(BasebackupError::Client)?;
|
|
return Ok(());
|
|
}
|
|
|
|
// Add a file for each chunk of blocks (aka segment)
|
|
let mut startblk = 0;
|
|
let mut seg = 0;
|
|
while startblk < nblocks {
|
|
let endblk = std::cmp::min(startblk + RELSEG_SIZE, nblocks);
|
|
|
|
let mut segment_data: Vec<u8> = vec![];
|
|
for blknum in startblk..endblk {
|
|
let img = self
|
|
.timeline
|
|
.get_rel_page_at_lsn(
|
|
src,
|
|
blknum,
|
|
Version::Lsn(self.lsn),
|
|
self.ctx,
|
|
self.io_concurrency.clone(),
|
|
)
|
|
.await
|
|
.map_err(|e| BasebackupError::Server(e.into()))?;
|
|
segment_data.extend_from_slice(&img[..]);
|
|
}
|
|
|
|
let file_name = dst.to_segfile_name(seg as u32);
|
|
let header = new_tar_header(&file_name, segment_data.len() as u64)?;
|
|
self.ar
|
|
.append(&header, segment_data.as_slice())
|
|
.await
|
|
.map_err(BasebackupError::Client)?;
|
|
|
|
seg += 1;
|
|
startblk = endblk;
|
|
}
|
|
|
|
Ok(())
|
|
}
|
|
|
|
//
|
|
// Include database/tablespace directories.
|
|
//
|
|
// Each directory contains a PG_VERSION file, and the default database
|
|
// directories also contain pg_filenode.map files.
|
|
//
|
|
async fn add_dbdir(
|
|
&mut self,
|
|
spcnode: u32,
|
|
dbnode: u32,
|
|
has_relmap_file: bool,
|
|
) -> Result<(), BasebackupError> {
|
|
let relmap_img = if has_relmap_file {
|
|
let img = self
|
|
.timeline
|
|
.get_relmap_file(spcnode, dbnode, Version::Lsn(self.lsn), self.ctx)
|
|
.await
|
|
.map_err(|e| BasebackupError::Server(e.into()))?;
|
|
|
|
if img.len()
|
|
!= dispatch_pgversion!(self.timeline.pg_version, pgv::bindings::SIZEOF_RELMAPFILE)
|
|
{
|
|
return Err(BasebackupError::Server(anyhow!(
|
|
"img.len() != SIZE_OF_RELMAPFILE, img.len()={}",
|
|
img.len(),
|
|
)));
|
|
}
|
|
|
|
Some(img)
|
|
} else {
|
|
None
|
|
};
|
|
|
|
if spcnode == GLOBALTABLESPACE_OID {
|
|
let pg_version_str = match self.timeline.pg_version {
|
|
14 | 15 => self.timeline.pg_version.to_string(),
|
|
ver => format!("{ver}\x0A"),
|
|
};
|
|
let header = new_tar_header("PG_VERSION", pg_version_str.len() as u64)?;
|
|
self.ar
|
|
.append(&header, pg_version_str.as_bytes())
|
|
.await
|
|
.map_err(BasebackupError::Client)?;
|
|
|
|
info!("timeline.pg_version {}", self.timeline.pg_version);
|
|
|
|
if let Some(img) = relmap_img {
|
|
// filenode map for global tablespace
|
|
let header = new_tar_header("global/pg_filenode.map", img.len() as u64)?;
|
|
self.ar
|
|
.append(&header, &img[..])
|
|
.await
|
|
.map_err(BasebackupError::Client)?;
|
|
} else {
|
|
warn!("global/pg_filenode.map is missing");
|
|
}
|
|
} else {
|
|
// User defined tablespaces are not supported. However, as
|
|
// a special case, if a tablespace/db directory is
|
|
// completely empty, we can leave it out altogether. This
|
|
// makes taking a base backup after the 'tablespace'
|
|
// regression test pass, because the test drops the
|
|
// created tablespaces after the tests.
|
|
//
|
|
// FIXME: this wouldn't be necessary, if we handled
|
|
// XLOG_TBLSPC_DROP records. But we probably should just
|
|
// throw an error on CREATE TABLESPACE in the first place.
|
|
if !has_relmap_file
|
|
&& self
|
|
.timeline
|
|
.list_rels(spcnode, dbnode, Version::Lsn(self.lsn), self.ctx)
|
|
.await
|
|
.map_err(|e| BasebackupError::Server(e.into()))?
|
|
.is_empty()
|
|
{
|
|
return Ok(());
|
|
}
|
|
// User defined tablespaces are not supported
|
|
if spcnode != DEFAULTTABLESPACE_OID {
|
|
return Err(BasebackupError::Server(anyhow!(
|
|
"spcnode != DEFAULTTABLESPACE_OID, spcnode={spcnode}"
|
|
)));
|
|
}
|
|
|
|
// Append dir path for each database
|
|
let path = format!("base/{}", dbnode);
|
|
let header = new_tar_header_dir(&path)?;
|
|
self.ar
|
|
.append(&header, &mut io::empty())
|
|
.await
|
|
.map_err(BasebackupError::Client)?;
|
|
|
|
if let Some(img) = relmap_img {
|
|
let dst_path = format!("base/{}/PG_VERSION", dbnode);
|
|
|
|
let pg_version_str = match self.timeline.pg_version {
|
|
14 | 15 => self.timeline.pg_version.to_string(),
|
|
ver => format!("{ver}\x0A"),
|
|
};
|
|
let header = new_tar_header(&dst_path, pg_version_str.len() as u64)?;
|
|
self.ar
|
|
.append(&header, pg_version_str.as_bytes())
|
|
.await
|
|
.map_err(BasebackupError::Client)?;
|
|
|
|
let relmap_path = format!("base/{}/pg_filenode.map", dbnode);
|
|
let header = new_tar_header(&relmap_path, img.len() as u64)?;
|
|
self.ar
|
|
.append(&header, &img[..])
|
|
.await
|
|
.map_err(BasebackupError::Client)?;
|
|
}
|
|
};
|
|
Ok(())
|
|
}
|
|
|
|
//
|
|
// Extract twophase state files
|
|
//
|
|
async fn add_twophase_file(&mut self, xid: u64) -> Result<(), BasebackupError> {
|
|
let img = self
|
|
.timeline
|
|
.get_twophase_file(xid, self.lsn, self.ctx)
|
|
.await
|
|
.map_err(|e| BasebackupError::Server(e.into()))?;
|
|
|
|
let mut buf = BytesMut::new();
|
|
buf.extend_from_slice(&img[..]);
|
|
let crc = crc32c::crc32c(&img[..]);
|
|
buf.put_u32_le(crc);
|
|
let path = if self.timeline.pg_version < 17 {
|
|
format!("pg_twophase/{:>08X}", xid)
|
|
} else {
|
|
format!("pg_twophase/{:>016X}", xid)
|
|
};
|
|
let header = new_tar_header(&path, buf.len() as u64)?;
|
|
self.ar
|
|
.append(&header, &buf[..])
|
|
.await
|
|
.map_err(BasebackupError::Client)?;
|
|
|
|
Ok(())
|
|
}
|
|
|
|
//
|
|
// Add generated pg_control file and bootstrap WAL segment.
|
|
// Also send zenith.signal file with extra bootstrap data.
|
|
//
|
|
async fn add_pgcontrol_file(&mut self) -> Result<(), BasebackupError> {
|
|
// add zenith.signal file
|
|
let mut zenith_signal = String::new();
|
|
if self.prev_record_lsn == Lsn(0) {
|
|
if self.timeline.is_ancestor_lsn(self.lsn) {
|
|
write!(zenith_signal, "PREV LSN: none")
|
|
.map_err(|e| BasebackupError::Server(e.into()))?;
|
|
} else {
|
|
write!(zenith_signal, "PREV LSN: invalid")
|
|
.map_err(|e| BasebackupError::Server(e.into()))?;
|
|
}
|
|
} else {
|
|
write!(zenith_signal, "PREV LSN: {}", self.prev_record_lsn)
|
|
.map_err(|e| BasebackupError::Server(e.into()))?;
|
|
}
|
|
self.ar
|
|
.append(
|
|
&new_tar_header("zenith.signal", zenith_signal.len() as u64)?,
|
|
zenith_signal.as_bytes(),
|
|
)
|
|
.await
|
|
.map_err(BasebackupError::Client)?;
|
|
|
|
let checkpoint_bytes = self
|
|
.timeline
|
|
.get_checkpoint(self.lsn, self.ctx)
|
|
.await
|
|
.context("failed to get checkpoint bytes")?;
|
|
let pg_control_bytes = self
|
|
.timeline
|
|
.get_control_file(self.lsn, self.ctx)
|
|
.await
|
|
.context("failed get control bytes")?;
|
|
|
|
let (pg_control_bytes, system_identifier) = postgres_ffi::generate_pg_control(
|
|
&pg_control_bytes,
|
|
&checkpoint_bytes,
|
|
self.lsn,
|
|
self.timeline.pg_version,
|
|
)?;
|
|
|
|
//send pg_control
|
|
let header = new_tar_header("global/pg_control", pg_control_bytes.len() as u64)?;
|
|
self.ar
|
|
.append(&header, &pg_control_bytes[..])
|
|
.await
|
|
.map_err(BasebackupError::Client)?;
|
|
|
|
//send wal segment
|
|
let segno = self.lsn.segment_number(WAL_SEGMENT_SIZE);
|
|
let wal_file_name = XLogFileName(PG_TLI, segno, WAL_SEGMENT_SIZE);
|
|
let wal_file_path = format!("pg_wal/{}", wal_file_name);
|
|
let header = new_tar_header(&wal_file_path, WAL_SEGMENT_SIZE as u64)?;
|
|
|
|
let wal_seg = postgres_ffi::generate_wal_segment(
|
|
segno,
|
|
system_identifier,
|
|
self.timeline.pg_version,
|
|
self.lsn,
|
|
)
|
|
.map_err(|e| anyhow!(e).context("Failed generating wal segment"))?;
|
|
if wal_seg.len() != WAL_SEGMENT_SIZE {
|
|
return Err(BasebackupError::Server(anyhow!(
|
|
"wal_seg.len() != WAL_SEGMENT_SIZE, wal_seg.len()={}",
|
|
wal_seg.len()
|
|
)));
|
|
}
|
|
self.ar
|
|
.append(&header, &wal_seg[..])
|
|
.await
|
|
.map_err(BasebackupError::Client)?;
|
|
Ok(())
|
|
}
|
|
}
|
|
|
|
//
|
|
// Create new tarball entry header
|
|
//
|
|
fn new_tar_header(path: &str, size: u64) -> anyhow::Result<Header> {
|
|
let mut header = Header::new_gnu();
|
|
header.set_size(size);
|
|
header.set_path(path)?;
|
|
header.set_mode(0b110000000); // -rw-------
|
|
header.set_mtime(
|
|
// use currenttime as last modified time
|
|
SystemTime::now()
|
|
.duration_since(SystemTime::UNIX_EPOCH)
|
|
.unwrap()
|
|
.as_secs(),
|
|
);
|
|
header.set_cksum();
|
|
Ok(header)
|
|
}
|
|
|
|
fn new_tar_header_dir(path: &str) -> anyhow::Result<Header> {
|
|
let mut header = Header::new_gnu();
|
|
header.set_size(0);
|
|
header.set_path(path)?;
|
|
header.set_mode(0o755); // -rw-------
|
|
header.set_entry_type(EntryType::dir());
|
|
header.set_mtime(
|
|
// use currenttime as last modified time
|
|
SystemTime::now()
|
|
.duration_since(SystemTime::UNIX_EPOCH)
|
|
.unwrap()
|
|
.as_secs(),
|
|
);
|
|
header.set_cksum();
|
|
Ok(header)
|
|
}
|