From e13bdd77fe97e0c081218639ca55668aac23aeaa Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Mon, 21 Mar 2022 14:42:24 +0400 Subject: [PATCH 001/296] add safekepeers gossip annd storage messaging rfcs they were in prs during rfc repo import in addition to just import I've added sequence diagrams to storage messaging rfc --- docs/rfcs/014-safekeepers-gossip.md | 69 +++++++ docs/rfcs/015-storage-messaging.md | 295 ++++++++++++++++++++++++++++ 2 files changed, 364 insertions(+) create mode 100644 docs/rfcs/014-safekeepers-gossip.md create mode 100644 docs/rfcs/015-storage-messaging.md diff --git a/docs/rfcs/014-safekeepers-gossip.md b/docs/rfcs/014-safekeepers-gossip.md new file mode 100644 index 0000000000..3d6cc04b94 --- /dev/null +++ b/docs/rfcs/014-safekeepers-gossip.md @@ -0,0 +1,69 @@ +# Safekeeper gossip + +Extracted from this [PR](https://github.com/zenithdb/rfcs/pull/13) + +## Motivation + +In some situations, safekeeper (SK) needs coordination with other SK's that serve the same tenant: + +1. WAL deletion. SK needs to know what WAL was already safely replicated to delete it. Now we keep WAL indefinitely. +2. Deciding on who is sending WAL to the pageserver. Now sending SK crash may lead to a livelock where nobody sends WAL to the pageserver. +3. To enable SK to SK direct recovery without involving the compute + +## Summary + +Compute node has connection strings to each safekeeper. During each compute->safekeeper connection establishment, the compute node should pass down all that connection strings to each safekeeper. With that info, safekeepers may establish Postgres connections to each other and periodically send ping messages with LSN payload. + +## Components + +safekeeper, compute, compute<->safekeeper protocol, possibly console (group SK addresses) + +## Proposed implementation + +Each safekeeper can periodically ping all its peers and share connectivity and liveness info. If the ping was not receiver for, let's say, four ping periods, we may consider sending safekeeper as dead. That would mean some of the alive safekeepers should connect to the pageserver. One way to decide which one exactly: `make_connection = my_node_id == min(alive_nodes)` + +Since safekeepers are multi-tenant, we may establish either per-tenant physical connections or per-safekeeper ones. So it makes sense to group "logical" connections between corresponding tenants on different nodes into a single physical connection. That means that we should implement an interconnect thread that maintains physical connections and periodically broadcasts info about all tenants. + +Right now console may assign any 3 SK addresses to a given compute node. That may lead to a high number of gossip connections between SK's. Instead, we can assign safekeeper triples to the compute node. But if we want to "break"/" change" group by an ad-hoc action, we can do it. + +### Corner cases + +- Current safekeeper may be alive but may not have connectivity to the pageserver + + To address that, we need to gossip visibility info. Based on that info, we may define SK as alive only when it can connect to the pageserver. + +- Current safekeeper may be alive but may not have connectivity with the compute node. + + We may broadcast last_received_lsn and presence of compute connection and decide who is alive based on that. + +- It is tricky to decide when to shut down gossip connections because we need to be sure that pageserver got all the committed (in the distributed sense, so local SK info is not enough) records, and it may never lose them. It is not a strict requirement since `--sync-safekeepers` that happen before the compute start will allow the pageserver to consume missing WAL, but it is better to do that in the background. So the condition may look like that: `majority_max(flush_lsn) == pageserver_s3_lsn` Here we rely on the two facts: + - that `--sync-safekeepers` happened after the compute shutdown, and it advanced local commit_lsn's allowing pageserver to consume that WAL. + + - we wait for the `pageserver_s3_lsn` advancement to avoid pageserver's last_received_lsn/disk_consistent_lsn going backward due to the disk/hardware failure and subsequent S3 recovery + + If those conditions are not met, we will have some gossip activity (but that may be okay). + +## Pros/cons + +Pros: + +- distributed, does not introduce new services (like etcd), does not add console as a storage dependency +- lays the foundation for gossip-based recovery + +Cons: + +- Only compute knows a set of safekeepers, but they should communicate even without compute node. In case of safekeepers restart, we will lose that info and can't gossip anymore. Hence we can't trim some WAL tail until the compute node start. Also, it is ugly. + +- If the console assigns a random set of safekeepers to each Postgres, we may end up in a situation where each safekeeper needs to have a connection with all other safekeepers. We can group safekeepers into isolated triples in the console to avoid that. Then "mixing" would happen only if we do rebalancing. + +## Alternative implementation + +We can have a selected node (e.g., console) with everybody reporting to it. + +## Security implications + +We don't increase the attack surface here. Communication can happen in a private network that is not exposed to users. + +## Scalability implications + +The only thing that may grow as we grow the number of computes is the number of gossip connections. But if we group safekeepers and assign a compute node to the random SK triple, the number of connections would be constant. diff --git a/docs/rfcs/015-storage-messaging.md b/docs/rfcs/015-storage-messaging.md new file mode 100644 index 0000000000..47bc9eb89c --- /dev/null +++ b/docs/rfcs/015-storage-messaging.md @@ -0,0 +1,295 @@ +# Storage messaging + +Created on 19.01.22 + +Initially created [here](https://github.com/zenithdb/rfcs/pull/16) by @kelvich. + +That it is an alternative to (014-safekeeper-gossip)[] + +## Motivation + +As in 014-safekeeper-gossip we need to solve the following problems: + +* Trim WAL on safekeepers +* Decide on which SK should push WAL to the S3 +* Decide on which SK should forward WAL to the pageserver +* Decide on when to shut down SK<->pageserver connection + +This RFC suggests a more generic and hopefully more manageable way to address those problems. However, unlike 014-safekeeper-gossip, it does not bring us any closer to safekeeper-to-safekeeper recovery but rather unties two sets of different issues we previously wanted to solve with gossip. + +Also, with this approach, we would not need "call me maybe" anymore, and the pageserver will have all the data required to understand that it needs to reconnect to another safekeeper. + +## Summary + +Instead of p2p gossip, let's have a centralized broker where all the storage nodes report per-timeline state. Each storage node should have a `--broker-url=1.2.3.4` CLI param. + +Here I propose two ways to do that. After a lot of arguing with myself, I'm leaning towards the etcd approach. My arguments for it are in the pros/cons section. Both options require adding a Grpc client in our codebase either directly or as an etcd dependency. + +## Non-goals + +That RFC does *not* suggest moving the compute to pageserver and compute to safekeeper mappings out of the console. The console is still the only place in the cluster responsible for the persistency of that info. So I'm implying that each pageserver and safekeeper exactly knows what timelines he serves, as it currently is. We need some mechanism for a new pageserver to discover mapping info, but that is out of the scope of this RFC. + +## Impacted components + +pageserver, safekeeper +adds either etcd or console as a storage dependency + +## Possible implementation: custom message broker in the console + +We've decided to go with an etcd approach instead of the message broker. + +
+Original suggestion +
+We can add a Grpc service in the console that acts as a message broker since the console knows the addresses of all the components. The broker can ignore the payload and only redirect messages. So, for example, each safekeeper may send a message to the peering safekeepers or to the pageserver responsible for a given timeline. + +Message format could be `{sender, destination, payload}`. + +The destination is either: +1. `sk_#{tenant}_#{timeline}` -- to be broadcasted on all safekeepers, responsible for that timeline, or +2. `pserver_#{tenant}_#{timeline}` -- to be broadcasted on all pageservers, responsible for that timeline + +Sender is either: +1. `sk_#{sk_id}`, or +2. `pserver_#{pserver_id}` + +I can think of the following behavior to address our original problems: + +* WAL trimming + Each safekeeper periodically broadcasts `(write_lsn, commit_lsn)` to all peering (peering == responsible for that timeline) safekeepers + +* Decide on which SK should push WAL to the S3 + + Each safekeeper periodically broadcasts `i_am_alive_#{current_timestamp}` message to all peering safekeepers. That way, safekeepers may maintain the vector of alive peers (loose one, with false negatives). Alive safekeeper with the minimal id pushes data to S3. + +* Decide on which SK should forward WAL to the pageserver + + Each safekeeper periodically sends (write_lsn, commit_lsn, compute_connected) to the relevant pageservers. With that info, pageserver can maintain a view of the safekeepers state, connect to a random one, and detect the moments (e.g., one the safekeepers is not making progress or down) when it needs to reconnect to another safekeeper. Pageserver should resolve exact IP addresses through the console, e.g., exchange `#sk_#{sk_id}` to `4.5.6.7:6400`. + + Pageserver connection to the safekeeper triggered by the state change `compute_connected: false -> true`. With that, we don't need "call me maybe" anymore. + + Also, we don't have a "peer address amnesia" problem as in the gossip approach (with gossip, after a simultaneous reboot, safekeepers wouldn't know each other addresses until the next compute connection). + +* Decide on when to shutdown sk<->pageserver connection + + Again, pageserver would have all the info to understand when to shut down the safekeeper connection. + +### Scalability + +One node is enough (c) No, seriously, it is enough. + +### High Availability + +Broker lives in the console, so we can rely on k8s maintaining the console app alive. + +If the console is down, we won't trim WAL and reconnect the pageserver to another safekeeper. But, at the same, if the console is down, we already can't accept new compute connections and start stopped computes, so we are making things a bit worse, but not dramatically. + +### Interactions + +``` + .________________. +sk_1 <-> | | <-> pserver_1 +... | Console broker | ... +sk_n <-> |________________| <-> pserver_m +``` +
+ + +## Implementation: etcd state store + +Alternatively, we can set up `etcd` and maintain the following data structure in it: + +```ruby +"compute_#{tenant}_#{timeline}" => { + safekeepers => { + "sk_#{sk_id}" => { + write_lsn: "0/AEDF130", + commit_lsn: "0/AEDF100", + compute_connected: true, + last_updated: 1642621138, + }, + } +} +``` + +As etcd doesn't support field updates in the nested objects that translates to the following set of keys: + +```ruby +"compute_#{tenant}_#{timeline}/safekeepers/sk_#{sk_id}/write_lsn", +"compute_#{tenant}_#{timeline}/safekeepers/sk_#{sk_id}/commit_lsn", +... +``` + +Each storage node can subscribe to the relevant sets of keys and maintain a local view of that structure. So in terms of the data flow, everything is the same as in the previous approach. Still, we can avoid implementing the message broker and prevent runtime storage dependency on a console. + +### Safekeeper address discovery + +During the startup safekeeper should publish the address he is listening on as the part of `{"sk_#{sk_id}" => ip_address}`. Then the pageserver can resolve `sk_#{sk_id}` to the actual address. This way it would work both locally and in the cloud setup. Safekeeper should have `--advertised-address` CLI option so that we can listen on e.g. 0.0.0.0 but advertize something more useful. + +### Safekeeper behavior + +For each timeline safekeeper periodically broadcasts `compute_#{tenant}_#{timeline}/safekeepers/sk_#{sk_id}/*` fields. It subscribes to changes of `compute_#{tenant}_#{timeline}` -- that way safekeeper will have an information about peering safekeepers. +That amount of information is enough to properly trim WAL. To decide on who is pushing the data to S3 safekeeper may use etcd leases or broadcast a timestamp and hence track who is alive. + +### Pageserver behavior + +Pageserver subscribes to `compute_#{tenant}_#{timeline}` for each tenant it owns. With that info, pageserver can maintain a view of the safekeepers state, connect to a random one, and detect the moments (e.g., one the safekeepers is not making progress or down) when it needs to reconnect to another safekeeper. Pageserver should resolve exact IP addresses through the console, e.g., exchange `#sk_#{sk_id}` to `4.5.6.7:6400`. + +Pageserver connection to the safekeeper can be triggered by the state change `compute_connected: false -> true`. With that, we don't need "call me maybe" anymore. + +As an alternative to compute_connected, we can track timestamp of the latest message arrived to safekeeper from compute. Usually compute broadcasts KeepAlive to all safekeepers every second, so it'll be updated every second when connection is ok. Then the connection can be considered down when this timestamp isn't updated for a several seconds. + +This will help to faster detect issues with safekeeper (and switch to another) in the following cases: + + when compute failed but TCP connection stays alive until timeout (usually about a minute) + when safekeeper failed and didn't set compute_connected to false + +Another way to deal with [2] is to process (write_lsn, commit_lsn, compute_connected) as a KeepAlive on the pageserver side and detect issues when sk_id don't send anything for some time. This way is fully compliant to this RFC. + +Also, we don't have a "peer address amnesia" problem as in the gossip approach (with gossip, after a simultaneous reboot, safekeepers wouldn't know each other addresses until the next compute connection). + +### Interactions + +``` + .________________. +sk_1 <-> | | <-> pserver_1 +... | etcd | ... +sk_n <-> |________________| <-> pserver_m +``` + +### Sequence diagrams for different workflows + +#### Cluster startup + +```mermaid +sequenceDiagram + autonumber + participant C as Compute + participant SK1 + participant SK2 + participant SK3 + participant PS1 + participant PS2 + participant O as Orchestrator + participant M as Metadata Service + + PS1->>M: subscribe to updates to state of timeline N + C->>+SK1: WAL push + loop constantly update current lsns + SK1->>-M: I'm at lsn A + end + C->>+SK2: WAL push + loop constantly update current lsns + SK2->>-M: I'm at lsn B + end + C->>+SK3: WAL push + loop constantly update current lsns + SK3->>-M: I'm at lsn C + end + loop request pages + C->>+PS1: get_page@lsn + PS1->>-C: page image + end + M->>PS1: New compute appeared for timeline N. SK1 at A, SK2 at B, SK3 at C + note over PS1: Say SK1 at A=200, SK2 at B=150 SK3 at C=100
so connect to SK1 because it is the most up to date one + PS1->>SK1: start replication +``` + +#### Behavour of services during typical operations + +```mermaid +sequenceDiagram + autonumber + participant C as Compute + participant SK1 + participant SK2 + participant SK3 + participant PS1 + participant PS2 + participant O as Orchestrator + participant M as Metadata Service + + note over C,M: Scenario 1: Pageserver checkpoint + note over PS1: Upload data to S3 + PS1->>M: Update remote consistent lsn + M->>SK1: propagate remote consistent lsn update + note over SK1: truncate WAL up to remote consistent lsn + M->>SK2: propagate remote consistent lsn update + note over SK2: truncate WAL up to remote consistent lsn + M->>SK3: propagate remote consistent lsn update + note over SK3: truncate WAL up to remote consistent lsn + note over C,M: Scenario 2: SK1 finds itself lagging behind MAX(150 (SK2), 200 (SK2)) - 100 (SK1) > THRESHOLD + SK1->>SK2: Fetch WAL delta between 100 (SK1) and 200 (SK2) + note over C,M: Scenario 3: PS1 detects that SK1 is lagging behind: Connection from SK1 is broken or there is no messages from it in 30 seconds. + note over PS1: e.g. SK2 is at 150, SK3 is at 100, chose SK2 as a new replication source + PS1->>SK2: start replication +``` + +#### Behaviour during timeline relocation + +```mermaid +sequenceDiagram + autonumber + participant C as Compute + participant SK1 + participant SK2 + participant SK3 + participant PS1 + participant PS2 + participant O as Orchestrator + participant M as Metadata Service + + note over C,M: Timeline is being relocated from PS1 to PS2 + O->>+PS2: Attach timeline + PS2->>-O: 202 Accepted if timeline exists in S3 + note over PS2: Download timeline from S3 + note over O: Poll for timeline download (or subscribe to metadata service) + loop wait for attach to complete + O->>PS2: timeline detail should answer that timeline is ready + end + PS2->>M: Register downloaded timeline + PS2->>M: Get safekeepers for timeline, subscribe to changes + PS2->>SK1: Start replication to catch up + note over O: PS2 catched up, time to switch compute + O->>C: Restart compute with new pageserver url in config + note over C: Wal push is restarted + loop request pages + C->>+PS2: get_page@lsn + PS2->>-C: page image + end + O->>PS1: detach timeline + note over C,M: Scenario 1: Attach call failed + O--xPS2: Attach timeline + note over O: The operation can be safely retried,
if we hit some threshold we can try another pageserver + note over C,M: Scenario 2: Attach succeeded but pageserver failed to download the data or start replication + loop wait for attach to complete + O--xPS2: timeline detail should answer that timeline is ready + end + note over O: Can wait for a timeout, and then try another pageserver
there should be a limit on number of different pageservers to try + note over C,M: Scenario 3: Detach fails + O--xPS1: Detach timeline + note over O: can be retried, if continues to fail might lead to data duplication in s3 +``` + +# Pros/cons + +## Console broker/etcd vs gossip: + +Gossip pros: +* gossip allows running storage without the console or etcd + +Console broker/etcd pros: +* simpler +* solves "call me maybe" as well +* avoid possible N-to-N connection issues with gossip without grouping safekeepers in pre-defined triples + +## Console broker vs. etcd: + +Initially, I wanted to avoid etcd as a dependency mostly because I've seen how painful for Clickhouse was their ZooKeeper dependency: in each chat, at each conference, people were complaining about configuration and maintenance barriers with ZooKeeper. It was that bad that ClickHouse re-implemented ZooKeeper to embed it: https://clickhouse.com/docs/en/operations/clickhouse-keeper/. + +But with an etcd we are in a bit different situation: + +1. We don't need persistency and strong consistency guarantees for the data we store in the etcd +2. etcd uses Grpc as a protocol, and messages are pretty simple + +So it looks like implementing in-mem store with etcd interface is straightforward thing _if we will want that in future_. At the same time, we can avoid implementing it right now, and we will be able to run local zenith installation with etcd running somewhere in the background (as opposed to building and running console, which in turn requires Postgres). From a4d0d78e9ec82b3cc848f8b467b865b0507fcdad Mon Sep 17 00:00:00 2001 From: Andrey Taranik Date: Wed, 23 Mar 2022 13:39:55 +0300 Subject: [PATCH 002/296] s3 settings for pageserver (#1388) --- .circleci/ansible/deploy.yaml | 14 ++++++++++++++ .circleci/ansible/production.hosts | 2 +- .circleci/ansible/staging.hosts | 2 +- 3 files changed, 16 insertions(+), 2 deletions(-) diff --git a/.circleci/ansible/deploy.yaml b/.circleci/ansible/deploy.yaml index 2dd109f99a..2379ef8510 100644 --- a/.circleci/ansible/deploy.yaml +++ b/.circleci/ansible/deploy.yaml @@ -91,6 +91,20 @@ tags: - pageserver + - name: update config + when: current_version > remote_version or force_deploy + lineinfile: + path: /storage/pageserver/data/pageserver.toml + line: "{{ item }}" + loop: + - "[remote_storage]" + - "bucket_name = '{{ bucket_name }}'" + - "bucket_region = '{{ bucket_region }}'" + - "prefix_in_bucket = '{{ inventory_hostname }}'" + become: true + tags: + - pageserver + - name: upload systemd service definition when: current_version > remote_version or force_deploy ansible.builtin.template: diff --git a/.circleci/ansible/production.hosts b/.circleci/ansible/production.hosts index c5b4f664a6..3a0543f39a 100644 --- a/.circleci/ansible/production.hosts +++ b/.circleci/ansible/production.hosts @@ -1,5 +1,5 @@ [pageservers] -zenith-1-ps-1 +zenith-1-ps-1 bucket_name=zenith-storage-oregon bucket_region=us-west-2 [safekeepers] zenith-1-sk-1 diff --git a/.circleci/ansible/staging.hosts b/.circleci/ansible/staging.hosts index e625120bf3..2987e2c6fa 100644 --- a/.circleci/ansible/staging.hosts +++ b/.circleci/ansible/staging.hosts @@ -1,5 +1,5 @@ [pageservers] -zenith-us-stage-ps-1 +zenith-us-stage-ps-1 bucket_name=zenith-staging-storage-us-east-1 bucket_region=us-east-1 [safekeepers] zenith-us-stage-sk-1 From 15434ba7e0f870683abe83d3e9994f00e5599f3f Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Tue, 22 Mar 2022 13:05:14 +0200 Subject: [PATCH 003/296] Show cachepot build stats --- Dockerfile | 2 ++ 1 file changed, 2 insertions(+) diff --git a/Dockerfile b/Dockerfile index 9ee6abaa8a..3bc1039129 100644 --- a/Dockerfile +++ b/Dockerfile @@ -31,6 +31,8 @@ COPY --from=pg-build /pg/tmp_install/include/postgresql/server tmp_install/inclu COPY . . RUN cargo build --release +# Show build caching stats to check if it was used +RUN /usr/local/cargo/bin/cachepot -s # Build final image # From 123fcd5d0dbeb6712d51fbd574e0dc16a7cb853d Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Wed, 23 Mar 2022 09:08:56 +0200 Subject: [PATCH 004/296] Revert accidental bump of vendor/postgres submodule I accidentally bumped it in commit 3b069f5aef. It didn't seem to cause any harm, but it was not intentional. --- vendor/postgres | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vendor/postgres b/vendor/postgres index 5e9bc37322..093aa160e5 160000 --- a/vendor/postgres +++ b/vendor/postgres @@ -1 +1 @@ -Subproject commit 5e9bc3732266c072151df20d6772b47ca51e233f +Subproject commit 093aa160e5df19814ff19b995d36dd5ee03c7f8b From e80ae4306aa009ce8154bf12269c49275551a582 Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Wed, 23 Mar 2022 16:47:05 +0400 Subject: [PATCH 005/296] change log level from info to debug for timeline gc messages --- pageserver/src/layered_repository.rs | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index c17df84689..64ac00ab56 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -1734,7 +1734,7 @@ impl LayeredTimeline { // 1. Is it newer than cutoff point? if l.get_end_lsn() > cutoff { - info!( + debug!( "keeping {} {}-{} because it's newer than cutoff {}", seg, l.get_start_lsn(), @@ -1757,7 +1757,7 @@ impl LayeredTimeline { for retain_lsn in &retain_lsns { // start_lsn is inclusive if &l.get_start_lsn() <= retain_lsn { - info!( + debug!( "keeping {} {}-{} because it's still might be referenced by child branch forked at {} is_dropped: {} is_incremental: {}", seg, l.get_start_lsn(), @@ -1783,7 +1783,7 @@ impl LayeredTimeline { disk_consistent_lsn, ) { - info!( + debug!( "keeping {} {}-{} because it is the latest layer", seg, l.get_start_lsn(), @@ -1806,7 +1806,7 @@ impl LayeredTimeline { // because LayerMap of this timeline is already locked. let mut is_tombstone = layers.layer_exists_at_lsn(l.get_seg_tag(), prior_lsn)?; if is_tombstone { - info!( + debug!( "earlier layer exists at {} in {}", prior_lsn, self.timelineid ); @@ -1819,7 +1819,7 @@ impl LayeredTimeline { { let prior_lsn = ancestor.get_last_record_lsn(); if seg.rel.is_blocky() { - info!( + debug!( "check blocky relish size {} at {} in {} for layer {}-{}", seg, prior_lsn, @@ -1831,7 +1831,7 @@ impl LayeredTimeline { Some(size) => { let (last_live_seg, _rel_blknum) = SegmentTag::from_blknum(seg.rel, size - 1); - info!( + debug!( "blocky rel size is {} last_live_seg.segno {} seg.segno {}", size, last_live_seg.segno, seg.segno ); @@ -1840,11 +1840,11 @@ impl LayeredTimeline { } } _ => { - info!("blocky rel doesn't exist"); + debug!("blocky rel doesn't exist"); } } } else { - info!( + debug!( "check non-blocky relish existence {} at {} in {} for layer {}-{}", seg, prior_lsn, @@ -1857,7 +1857,7 @@ impl LayeredTimeline { } if is_tombstone { - info!( + debug!( "keeping {} {}-{} because this layer serves as a tombstone for older layer", seg, l.get_start_lsn(), @@ -1874,7 +1874,7 @@ impl LayeredTimeline { } // We didn't find any reason to keep this file, so remove it. - info!( + debug!( "garbage collecting {} {}-{} is_dropped: {} is_incremental: {}", l.get_seg_tag(), l.get_start_lsn(), From 0be7ed0cb5c1ee0e52c67d28a2ebb3113b7d3c54 Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Wed, 23 Mar 2022 17:13:01 +0400 Subject: [PATCH 006/296] decrease log message severity for timeline checkpoint internals --- pageserver/src/layered_repository.rs | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 64ac00ab56..2c4393481d 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -1529,7 +1529,7 @@ impl LayeredTimeline { && oldest_lsn >= freeze_end_lsn // this layer intersects with evicted layer and so also need to be evicted { - info!( + debug!( "the oldest layer is now {} which is {} bytes behind last_record_lsn", oldest_layer.filename().display(), distance From 8a86276a6ef6a8f79e11a264087e6f22790d67c5 Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Wed, 23 Mar 2022 17:40:29 +0400 Subject: [PATCH 007/296] add more context to error --- pageserver/src/remote_storage/storage_sync/upload.rs | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/pageserver/src/remote_storage/storage_sync/upload.rs b/pageserver/src/remote_storage/storage_sync/upload.rs index 8fdd91dd18..431b5ec484 100644 --- a/pageserver/src/remote_storage/storage_sync/upload.rs +++ b/pageserver/src/remote_storage/storage_sync/upload.rs @@ -182,7 +182,13 @@ async fn try_upload_checkpoint< } }) .collect::>(); - ensure!(!files_to_upload.is_empty(), "No files to upload"); + + ensure!( + !files_to_upload.is_empty(), + "No files to upload. Upload request was: {:?}, already uploaded files: {:?}", + new_checkpoint.layers, + files_to_skip, + ); compression::archive_files_as_stream( &timeline_dir, From 8b8d78a3a01fddcd0ba3e6ad5af782f4a147e26f Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Wed, 23 Mar 2022 19:13:44 +0400 Subject: [PATCH 008/296] use main branch of our bookfile crate --- Cargo.lock | 2 +- pageserver/Cargo.toml | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/Cargo.lock b/Cargo.lock index a9de71420b..923f14e06e 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -246,7 +246,7 @@ dependencies = [ [[package]] name = "bookfile" version = "0.3.0" -source = "git+https://github.com/zenithdb/bookfile.git?branch=generic-readext#d51a99c7a0be48c3d9cc7cb85c9b7fb05ce1100c" +source = "git+https://github.com/zenithdb/bookfile.git?rev=bf6e43825dfb6e749ae9b80e8372c8fea76cec2f#bf6e43825dfb6e749ae9b80e8372c8fea76cec2f" dependencies = [ "aversion", "byteorder", diff --git a/pageserver/Cargo.toml b/pageserver/Cargo.toml index efd2fa4a38..46e6e2a8f1 100644 --- a/pageserver/Cargo.toml +++ b/pageserver/Cargo.toml @@ -4,7 +4,7 @@ version = "0.1.0" edition = "2021" [dependencies] -bookfile = { git = "https://github.com/zenithdb/bookfile.git", branch="generic-readext" } +bookfile = { git = "https://github.com/zenithdb/bookfile.git", rev="bf6e43825dfb6e749ae9b80e8372c8fea76cec2f" } chrono = "0.4.19" rand = "0.8.3" regex = "1.4.5" From 8437fc056e9c95c3a925df4dd4317f4454b8198c Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Wed, 23 Mar 2022 22:03:12 +0400 Subject: [PATCH 009/296] some follow ups after s3 integration was enabled on staging * do not error out when upload file list is empty * ignore ephemeral files during sync initialization --- pageserver/src/layered_repository.rs | 2 +- pageserver/src/remote_storage.rs | 8 ++++- .../src/remote_storage/storage_sync/upload.rs | 29 ++++++++++--------- 3 files changed, 24 insertions(+), 15 deletions(-) diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 2c4393481d..9cb0a17e66 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -54,7 +54,7 @@ use zenith_utils::lsn::{AtomicLsn, Lsn, RecordLsn}; use zenith_utils::seqwait::SeqWait; mod delta_layer; -mod ephemeral_file; +pub(crate) mod ephemeral_file; mod filename; mod global_layer_map; mod image_layer; diff --git a/pageserver/src/remote_storage.rs b/pageserver/src/remote_storage.rs index 08fb16a679..6eb7bd910b 100644 --- a/pageserver/src/remote_storage.rs +++ b/pageserver/src/remote_storage.rs @@ -94,12 +94,13 @@ use std::{ use anyhow::{bail, Context}; use tokio::{io, sync::RwLock}; -use tracing::{error, info}; +use tracing::{debug, error, info}; use zenith_utils::zid::{ZTenantId, ZTenantTimelineId, ZTimelineId}; pub use self::storage_sync::index::{RemoteTimelineIndex, TimelineIndexEntry}; pub use self::storage_sync::{schedule_timeline_checkpoint_upload, schedule_timeline_download}; use self::{local_fs::LocalFs, rust_s3::S3}; +use crate::layered_repository::ephemeral_file::is_ephemeral_file; use crate::{ config::{PageServerConf, RemoteStorageKind}, layered_repository::metadata::{TimelineMetadata, METADATA_FILE_NAME}, @@ -261,6 +262,8 @@ fn collect_timelines_for_tenant( Ok(timelines) } +// discover timeline files and extract timeline metadata +// NOTE: ephemeral files are excluded from the list fn collect_timeline_files( timeline_dir: &Path, ) -> anyhow::Result<(ZTimelineId, TimelineMetadata, Vec)> { @@ -280,6 +283,9 @@ fn collect_timeline_files( if entry_path.is_file() { if entry_path.file_name().and_then(ffi::OsStr::to_str) == Some(METADATA_FILE_NAME) { timeline_metadata_path = Some(entry_path); + } else if is_ephemeral_file(&entry_path.file_name().unwrap().to_string_lossy()) { + debug!("skipping ephemeral file {}", entry_path.display()); + continue; } else { timeline_files.push(entry_path); } diff --git a/pageserver/src/remote_storage/storage_sync/upload.rs b/pageserver/src/remote_storage/storage_sync/upload.rs index 431b5ec484..dfc4433694 100644 --- a/pageserver/src/remote_storage/storage_sync/upload.rs +++ b/pageserver/src/remote_storage/storage_sync/upload.rs @@ -2,7 +2,6 @@ use std::{borrow::Cow, collections::BTreeSet, path::PathBuf, sync::Arc}; -use anyhow::ensure; use tokio::sync::RwLock; use tracing::{debug, error, warn}; @@ -95,7 +94,7 @@ pub(super) async fn upload_timeline_checkpoint< ) .await { - Ok((archive_header, header_size)) => { + Some(Ok((archive_header, header_size))) => { let mut index_write = index.write().await; match index_write .timeline_entry_mut(&sync_id) @@ -136,7 +135,7 @@ pub(super) async fn upload_timeline_checkpoint< debug!("Checkpoint uploaded successfully"); Some(true) } - Err(e) => { + Some(Err(e)) => { error!( "Failed to upload checkpoint: {:?}, requeueing the upload", e @@ -148,6 +147,7 @@ pub(super) async fn upload_timeline_checkpoint< )); Some(false) } + None => Some(true), } } @@ -160,7 +160,7 @@ async fn try_upload_checkpoint< sync_id: ZTenantTimelineId, new_checkpoint: &NewCheckpoint, files_to_skip: BTreeSet, -) -> anyhow::Result<(ArchiveHeader, u64)> { +) -> Option> { let ZTenantTimelineId { tenant_id, timeline_id, @@ -172,7 +172,7 @@ async fn try_upload_checkpoint< .iter() .filter(|&path_to_upload| { if files_to_skip.contains(path_to_upload) { - error!( + warn!( "Skipping file upload '{}', since it was already uploaded", path_to_upload.display() ); @@ -183,14 +183,15 @@ async fn try_upload_checkpoint< }) .collect::>(); - ensure!( - !files_to_upload.is_empty(), - "No files to upload. Upload request was: {:?}, already uploaded files: {:?}", - new_checkpoint.layers, - files_to_skip, - ); + if files_to_upload.is_empty() { + warn!( + "No files to upload. Upload request was: {:?}, already uploaded files: {:?}", + new_checkpoint.layers, files_to_skip + ); + return None; + } - compression::archive_files_as_stream( + let upload_result = compression::archive_files_as_stream( &timeline_dir, files_to_upload.into_iter(), &new_checkpoint.metadata, @@ -206,7 +207,9 @@ async fn try_upload_checkpoint< }, ) .await - .map(|(header, header_size, _)| (header, header_size)) + .map(|(header, header_size, _)| (header, header_size)); + + Some(upload_result) } #[cfg(test)] From c7188705173e41ac742dd9738b5a99699552a8eb Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Thu, 24 Mar 2022 09:46:07 +0200 Subject: [PATCH 010/296] Tiny refactoring of page_cache::init function. The init function only needs the 'page_cache_size' from the config, so seems slightly nicer to pass just that. --- pageserver/src/bin/pageserver.rs | 3 +-- pageserver/src/page_cache.rs | 9 +++------ 2 files changed, 4 insertions(+), 8 deletions(-) diff --git a/pageserver/src/bin/pageserver.rs b/pageserver/src/bin/pageserver.rs index 05fb14daca..a2564d51d7 100644 --- a/pageserver/src/bin/pageserver.rs +++ b/pageserver/src/bin/pageserver.rs @@ -163,8 +163,7 @@ fn main() -> Result<()> { // Basic initialization of things that don't change after startup virtual_file::init(conf.max_file_descriptors); - - page_cache::init(conf); + page_cache::init(conf.page_cache_size); // Create repo and exit if init was requested if init { diff --git a/pageserver/src/page_cache.rs b/pageserver/src/page_cache.rs index b0c8d3a5d7..2992d9477b 100644 --- a/pageserver/src/page_cache.rs +++ b/pageserver/src/page_cache.rs @@ -53,7 +53,7 @@ use zenith_utils::{ }; use crate::layered_repository::writeback_ephemeral_file; -use crate::{config::PageServerConf, relish::RelTag}; +use crate::relish::RelTag; static PAGE_CACHE: OnceCell = OnceCell::new(); const TEST_PAGE_CACHE_SIZE: usize = 10; @@ -61,11 +61,8 @@ const TEST_PAGE_CACHE_SIZE: usize = 10; /// /// Initialize the page cache. This must be called once at page server startup. /// -pub fn init(conf: &'static PageServerConf) { - if PAGE_CACHE - .set(PageCache::new(conf.page_cache_size)) - .is_err() - { +pub fn init(size: usize) { + if PAGE_CACHE.set(PageCache::new(size)).is_err() { panic!("page cache already initialized"); } } From d3a9cb44a659b11d0df7f7e2fbded9e388fbe917 Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Thu, 24 Mar 2022 02:05:35 +0400 Subject: [PATCH 011/296] tweak timeouts for tenant relocation test --- test_runner/batch_others/test_tenant_relocation.py | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/test_runner/batch_others/test_tenant_relocation.py b/test_runner/batch_others/test_tenant_relocation.py index 32fbc8f872..8213d2526b 100644 --- a/test_runner/batch_others/test_tenant_relocation.py +++ b/test_runner/batch_others/test_tenant_relocation.py @@ -3,10 +3,8 @@ import os import pathlib import subprocess import threading -from typing import Dict from uuid import UUID from fixtures.log_helper import log -import time import signal import pytest @@ -15,7 +13,6 @@ from fixtures.utils import lsn_from_hex def assert_abs_margin_ratio(a: float, b: float, margin_ratio: float): - print("!" * 100, abs(a - b) / a) assert abs(a - b) / a < margin_ratio, abs(a - b) / a @@ -235,10 +232,10 @@ def test_tenant_relocation(zenith_env_builder: ZenithEnvBuilder, assert cur.fetchone() == (2001000, ) if with_load == 'with_load': - assert load_ok_event.wait(1) + assert load_ok_event.wait(3) log.info('stopping load thread') load_stop_event.set() - load_thread.join() + load_thread.join(timeout=10) log.info('load thread stopped') # bring old pageserver back for clean shutdown via zenith cli From b9a1a75b0d21fee7818777f91d2f297273d9d631 Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Thu, 24 Mar 2022 11:48:50 +0400 Subject: [PATCH 012/296] clean up unused imports in python tests --- test_runner/batch_others/test_gc_aggressive.py | 7 ++----- test_runner/batch_others/test_next_xid.py | 3 --- test_runner/batch_others/test_old_request_lsn.py | 2 -- test_runner/batch_others/test_pageserver_api.py | 2 +- test_runner/batch_others/test_pageserver_catchup.py | 7 ------- test_runner/batch_others/test_pageserver_restart.py | 6 ------ test_runner/batch_others/test_remote_storage.py | 2 +- test_runner/batch_others/test_snapfiles_gc.py | 1 - test_runner/batch_others/test_timeline_size.py | 1 - test_runner/batch_others/test_zenith_cli.py | 2 -- 10 files changed, 4 insertions(+), 29 deletions(-) diff --git a/test_runner/batch_others/test_gc_aggressive.py b/test_runner/batch_others/test_gc_aggressive.py index 9de6ba9f59..e4e4aa9f4a 100644 --- a/test_runner/batch_others/test_gc_aggressive.py +++ b/test_runner/batch_others/test_gc_aggressive.py @@ -1,10 +1,7 @@ -from contextlib import closing - import asyncio -import asyncpg import random -from fixtures.zenith_fixtures import ZenithEnv, Postgres, Safekeeper +from fixtures.zenith_fixtures import ZenithEnv, Postgres from fixtures.log_helper import log # Test configuration @@ -76,5 +73,5 @@ def test_gc_aggressive(zenith_simple_env: ZenithEnv): asyncio.run(update_and_gc(env, pg, timeline)) - row = cur.execute('SELECT COUNT(*), SUM(counter) FROM foo') + cur.execute('SELECT COUNT(*), SUM(counter) FROM foo') assert cur.fetchone() == (num_rows, updates_to_perform) diff --git a/test_runner/batch_others/test_next_xid.py b/test_runner/batch_others/test_next_xid.py index fd0f761409..03c27bcd70 100644 --- a/test_runner/batch_others/test_next_xid.py +++ b/test_runner/batch_others/test_next_xid.py @@ -1,9 +1,6 @@ -import pytest -import random import time from fixtures.zenith_fixtures import ZenithEnvBuilder -from fixtures.log_helper import log # Test restarting page server, while safekeeper and compute node keep diff --git a/test_runner/batch_others/test_old_request_lsn.py b/test_runner/batch_others/test_old_request_lsn.py index d09fb24913..e7400cff96 100644 --- a/test_runner/batch_others/test_old_request_lsn.py +++ b/test_runner/batch_others/test_old_request_lsn.py @@ -1,5 +1,3 @@ -from contextlib import closing - from fixtures.zenith_fixtures import ZenithEnv from fixtures.log_helper import log diff --git a/test_runner/batch_others/test_pageserver_api.py b/test_runner/batch_others/test_pageserver_api.py index 965ba9bcc3..13f6ef358e 100644 --- a/test_runner/batch_others/test_pageserver_api.py +++ b/test_runner/batch_others/test_pageserver_api.py @@ -1,6 +1,6 @@ from uuid import uuid4, UUID import pytest -from fixtures.zenith_fixtures import ZenithEnv, ZenithEnvBuilder, ZenithPageserverHttpClient, zenith_binpath +from fixtures.zenith_fixtures import ZenithEnv, ZenithEnvBuilder, ZenithPageserverHttpClient # test that we cannot override node id diff --git a/test_runner/batch_others/test_pageserver_catchup.py b/test_runner/batch_others/test_pageserver_catchup.py index 7093a1bdb3..3c4b7f9569 100644 --- a/test_runner/batch_others/test_pageserver_catchup.py +++ b/test_runner/batch_others/test_pageserver_catchup.py @@ -1,11 +1,4 @@ -import pytest -import random -import time - -from contextlib import closing -from multiprocessing import Process, Value from fixtures.zenith_fixtures import ZenithEnvBuilder -from fixtures.log_helper import log # Test safekeeper sync and pageserver catch up diff --git a/test_runner/batch_others/test_pageserver_restart.py b/test_runner/batch_others/test_pageserver_restart.py index 57f9db8f96..20e6f4467e 100644 --- a/test_runner/batch_others/test_pageserver_restart.py +++ b/test_runner/batch_others/test_pageserver_restart.py @@ -1,9 +1,3 @@ -import pytest -import random -import time - -from contextlib import closing -from multiprocessing import Process, Value from fixtures.zenith_fixtures import ZenithEnvBuilder from fixtures.log_helper import log diff --git a/test_runner/batch_others/test_remote_storage.py b/test_runner/batch_others/test_remote_storage.py index 07a122ede9..e762f8589a 100644 --- a/test_runner/batch_others/test_remote_storage.py +++ b/test_runner/batch_others/test_remote_storage.py @@ -1,7 +1,7 @@ # It's possible to run any regular test with the local fs remote storage via # env ZENITH_PAGESERVER_OVERRIDES="remote_storage={local_path='/tmp/zenith_zzz/'}" poetry ...... -import time, shutil, os +import shutil, os from contextlib import closing from pathlib import Path from uuid import UUID diff --git a/test_runner/batch_others/test_snapfiles_gc.py b/test_runner/batch_others/test_snapfiles_gc.py index c6d4512bc9..d00af53864 100644 --- a/test_runner/batch_others/test_snapfiles_gc.py +++ b/test_runner/batch_others/test_snapfiles_gc.py @@ -1,6 +1,5 @@ from contextlib import closing import psycopg2.extras -import time from fixtures.utils import print_gc_result from fixtures.zenith_fixtures import ZenithEnv from fixtures.log_helper import log diff --git a/test_runner/batch_others/test_timeline_size.py b/test_runner/batch_others/test_timeline_size.py index 0b341746ee..db33493d61 100644 --- a/test_runner/batch_others/test_timeline_size.py +++ b/test_runner/batch_others/test_timeline_size.py @@ -1,5 +1,4 @@ from contextlib import closing -from uuid import UUID import psycopg2.extras import psycopg2.errors from fixtures.zenith_fixtures import ZenithEnv, ZenithEnvBuilder, Postgres, assert_local diff --git a/test_runner/batch_others/test_zenith_cli.py b/test_runner/batch_others/test_zenith_cli.py index 4a62a1430a..091d9ac8ba 100644 --- a/test_runner/batch_others/test_zenith_cli.py +++ b/test_runner/batch_others/test_zenith_cli.py @@ -1,8 +1,6 @@ -import json import uuid import requests -from psycopg2.extensions import cursor as PgCursor from fixtures.zenith_fixtures import ZenithEnv, ZenithEnvBuilder, ZenithPageserverHttpClient from typing import cast From 825d3631707016717f05ae5bcb7c112af9feba8f Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Thu, 24 Mar 2022 12:17:56 +0200 Subject: [PATCH 013/296] Remove some unnecessary Ord etc. trait implementations. It doesn't make much sense to compare TimelineMetadata structs with < or >. But we depended on that in the remote storage upload code, so replace BTreeSets with Vecs there. --- pageserver/src/layered_repository/metadata.rs | 2 +- pageserver/src/remote_storage/storage_sync.rs | 18 +++++++++--------- 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/pageserver/src/layered_repository/metadata.rs b/pageserver/src/layered_repository/metadata.rs index 960a1b7fe3..99d786c4cd 100644 --- a/pageserver/src/layered_repository/metadata.rs +++ b/pageserver/src/layered_repository/metadata.rs @@ -28,7 +28,7 @@ pub const METADATA_FILE_NAME: &str = "metadata"; /// Metadata stored on disk for each timeline /// /// The fields correspond to the values we hold in memory, in LayeredTimeline. -#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord)] +#[derive(Debug, Clone, PartialEq, Eq)] pub struct TimelineMetadata { disk_consistent_lsn: Lsn, // This is only set if we know it. We track it in memory when the page diff --git a/pageserver/src/remote_storage/storage_sync.rs b/pageserver/src/remote_storage/storage_sync.rs index f1483375cb..4ad28e6f8f 100644 --- a/pageserver/src/remote_storage/storage_sync.rs +++ b/pageserver/src/remote_storage/storage_sync.rs @@ -142,7 +142,7 @@ lazy_static! { /// mpsc approach was picked to allow blocking the sync loop if no tasks are present, to avoid meaningless spinning. mod sync_queue { use std::{ - collections::{BTreeSet, HashMap}, + collections::HashMap, sync::atomic::{AtomicUsize, Ordering}, }; @@ -205,9 +205,9 @@ mod sync_queue { pub async fn next_task_batch( receiver: &mut UnboundedReceiver, mut max_batch_size: usize, - ) -> BTreeSet { + ) -> Vec { if max_batch_size == 0 { - return BTreeSet::new(); + return Vec::new(); } let mut tasks = HashMap::with_capacity(max_batch_size); @@ -244,7 +244,7 @@ mod sync_queue { /// A task to run in the async download/upload loop. /// Limited by the number of retries, after certain threshold the failing task gets evicted and the timeline disabled. -#[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Clone)] +#[derive(Debug, Clone)] pub struct SyncTask { sync_id: ZTenantTimelineId, retries: u32, @@ -261,7 +261,7 @@ impl SyncTask { } } -#[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Clone)] +#[derive(Debug, Clone)] enum SyncKind { /// A certain amount of images (archive files) to download. Download(TimelineDownload), @@ -281,7 +281,7 @@ impl SyncKind { /// Local timeline files for upload, appeared after the new checkpoint. /// Current checkpoint design assumes new files are added only, no deletions or amendment happens. -#[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Clone)] +#[derive(Debug, Clone)] pub struct NewCheckpoint { /// Relish file paths in the pageserver workdir, that were added for the corresponding checkpoint. layers: Vec, @@ -289,7 +289,7 @@ pub struct NewCheckpoint { } /// Info about the remote image files. -#[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Clone)] +#[derive(Debug, Clone)] struct TimelineDownload { files_to_skip: Arc>, archives_to_skip: BTreeSet, @@ -485,11 +485,11 @@ async fn loop_step< max_sync_errors: NonZeroU32, ) -> HashMap> { let max_concurrent_sync = max_concurrent_sync.get(); - let mut next_tasks = BTreeSet::new(); + let mut next_tasks = Vec::new(); // request the first task in blocking fashion to do less meaningless work if let Some(first_task) = sync_queue::next_task(receiver).await { - next_tasks.insert(first_task); + next_tasks.push(first_task); } else { debug!("Shutdown requested, stopping"); return HashMap::new(); From a201d33edceacf8c1687f4dce9e94230f25be064 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Thu, 24 Mar 2022 13:27:14 +0200 Subject: [PATCH 014/296] Properly print cachepot stats --- Dockerfile | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/Dockerfile b/Dockerfile index 3bc1039129..5e55cd834f 100644 --- a/Dockerfile +++ b/Dockerfile @@ -30,9 +30,9 @@ ENV RUSTC_WRAPPER /usr/local/cargo/bin/cachepot COPY --from=pg-build /pg/tmp_install/include/postgresql/server tmp_install/include/postgresql/server COPY . . -RUN cargo build --release -# Show build caching stats to check if it was used -RUN /usr/local/cargo/bin/cachepot -s +# Show build caching stats to check if it was used in the end. +# Has to be the part of the same RUN since cachepot daemon is killed in the end of this RUN, loosing the compilation stats. +RUN cargo build --release && /usr/local/cargo/bin/cachepot -s # Build final image # From edc7bebcb5a452ad84c5c3cfd46b727c6e6f1c48 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Thu, 17 Mar 2022 18:52:27 +0200 Subject: [PATCH 015/296] Remove obvious panic sources --- pageserver/src/basebackup.rs | 21 +++++----- pageserver/src/bin/pageserver.rs | 8 ++-- pageserver/src/import_datadir.rs | 21 +++++----- pageserver/src/layered_repository.rs | 21 ++++++---- .../src/layered_repository/inmemory_layer.rs | 10 ++--- pageserver/src/page_cache.rs | 7 ++-- pageserver/src/page_service.rs | 1 - pageserver/src/tenant_threads.rs | 2 +- pageserver/src/thread_mgr.rs | 2 +- pageserver/src/timelines.rs | 6 +-- pageserver/src/virtual_file.rs | 3 +- pageserver/src/walingest.rs | 2 +- pageserver/src/walredo.rs | 42 ++++++++++++------- 13 files changed, 84 insertions(+), 62 deletions(-) diff --git a/pageserver/src/basebackup.rs b/pageserver/src/basebackup.rs index 1ee48eb2fc..c316fc43d1 100644 --- a/pageserver/src/basebackup.rs +++ b/pageserver/src/basebackup.rs @@ -145,16 +145,17 @@ impl<'a> Basebackup<'a> { .timeline .get_relish_size(RelishTag::Slru { slru, segno }, self.lsn)?; - if seg_size == None { - trace!( - "SLRU segment {}/{:>04X} was truncated", - slru.to_str(), - segno - ); - return Ok(()); - } - - let nblocks = seg_size.unwrap(); + let nblocks = match seg_size { + Some(seg_size) => seg_size, + None => { + trace!( + "SLRU segment {}/{:>04X} was truncated", + slru.to_str(), + segno + ); + return Ok(()); + } + }; let mut slru_buf: Vec = Vec::with_capacity(nblocks as usize * pg_constants::BLCKSZ as usize); diff --git a/pageserver/src/bin/pageserver.rs b/pageserver/src/bin/pageserver.rs index a2564d51d7..5a1b5e5e2c 100644 --- a/pageserver/src/bin/pageserver.rs +++ b/pageserver/src/bin/pageserver.rs @@ -30,7 +30,7 @@ use zenith_utils::postgres_backend; use zenith_utils::shutdown::exit_now; use zenith_utils::signals::{self, Signal}; -fn main() -> Result<()> { +fn main() -> anyhow::Result<()> { zenith_metrics::set_common_metrics_prefix("pageserver"); let arg_matches = App::new("Zenith page server") .about("Materializes WAL stream to pages and serves them to the postgres") @@ -116,7 +116,7 @@ fn main() -> Result<()> { // We're initializing the repo, so there's no config file yet DEFAULT_CONFIG_FILE .parse::() - .expect("could not parse built-in config file") + .context("could not parse built-in config file")? } else { // Supplement the CLI arguments with the config file let cfg_file_contents = std::fs::read_to_string(&cfg_file_path) @@ -209,7 +209,9 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<() // There shouldn't be any logging to stdin/stdout. Redirect it to the main log so // that we will see any accidental manual fprintf's or backtraces. - let stdout = log_file.try_clone().unwrap(); + let stdout = log_file + .try_clone() + .with_context(|| format!("Failed to clone log file '{:?}'", log_file))?; let stderr = log_file; let daemonize = Daemonize::new() diff --git a/pageserver/src/import_datadir.rs b/pageserver/src/import_datadir.rs index e317118bb5..1e691fb2fe 100644 --- a/pageserver/src/import_datadir.rs +++ b/pageserver/src/import_datadir.rs @@ -70,11 +70,11 @@ pub fn import_timeline_from_postgres_datadir( let direntry = direntry?; //skip all temporary files - if direntry.file_name().to_str().unwrap() == "pgsql_tmp" { + if direntry.file_name().to_string_lossy() == "pgsql_tmp" { continue; } - let dboid = direntry.file_name().to_str().unwrap().parse::()?; + let dboid = direntry.file_name().to_string_lossy().parse::()?; for direntry in fs::read_dir(direntry.path())? { let direntry = direntry?; @@ -117,7 +117,7 @@ pub fn import_timeline_from_postgres_datadir( } for entry in fs::read_dir(path.join("pg_twophase"))? { let entry = entry?; - let xid = u32::from_str_radix(entry.path().to_str().unwrap(), 16)?; + let xid = u32::from_str_radix(&entry.path().to_string_lossy(), 16)?; import_nonrel_file(writer, lsn, RelishTag::TwoPhase { xid }, &entry.path())?; } // TODO: Scan pg_tblspc @@ -156,16 +156,15 @@ fn import_relfile( lsn: Lsn, spcoid: Oid, dboid: Oid, -) -> Result<()> { +) -> anyhow::Result<()> { // Does it look like a relation file? trace!("importing rel file {}", path.display()); - let p = parse_relfilename(path.file_name().unwrap().to_str().unwrap()); - if let Err(e) = p { - warn!("unrecognized file in postgres datadir: {:?} ({})", path, e); - return Err(e.into()); - } - let (relnode, forknum, segno) = p.unwrap(); + let (relnode, forknum, segno) = parse_relfilename(&path.file_name().unwrap().to_string_lossy()) + .map_err(|e| { + warn!("unrecognized file in postgres datadir: {:?} ({})", path, e); + e + })?; let mut file = File::open(path)?; let mut buf: [u8; 8192] = [0u8; 8192]; @@ -271,7 +270,7 @@ fn import_slru_file( // Does it look like an SLRU file? let mut file = File::open(path)?; let mut buf: [u8; 8192] = [0u8; 8192]; - let segno = u32::from_str_radix(path.file_name().unwrap().to_str().unwrap(), 16)?; + let segno = u32::from_str_radix(&path.file_name().unwrap().to_string_lossy(), 16)?; trace!("importing slru file {}", path.display()); diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 9cb0a17e66..4d8d0ada24 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -11,7 +11,7 @@ //! parent timeline, and the last LSN that has been written to disk. //! -use anyhow::{bail, ensure, Context, Result}; +use anyhow::{anyhow, bail, ensure, Context, Result}; use bookfile::Book; use bytes::Bytes; use lazy_static::lazy_static; @@ -1157,9 +1157,9 @@ impl LayeredTimeline { for direntry in fs::read_dir(timeline_path)? { let direntry = direntry?; let fname = direntry.file_name(); - let fname = fname.to_str().unwrap(); + let fname = fname.to_string_lossy(); - if let Some(imgfilename) = ImageFileName::parse_str(fname) { + if let Some(imgfilename) = ImageFileName::parse_str(&fname) { // create an ImageLayer struct for each image file. if imgfilename.lsn > disk_consistent_lsn { warn!( @@ -1177,7 +1177,7 @@ impl LayeredTimeline { trace!("found layer {}", layer.filename().display()); layers.insert_historic(Arc::new(layer)); num_layers += 1; - } else if let Some(deltafilename) = DeltaFileName::parse_str(fname) { + } else if let Some(deltafilename) = DeltaFileName::parse_str(&fname) { // Create a DeltaLayer struct for each delta file. ensure!(deltafilename.start_lsn < deltafilename.end_lsn); // The end-LSN is exclusive, while disk_consistent_lsn is @@ -1203,7 +1203,7 @@ impl LayeredTimeline { num_layers += 1; } else if fname == METADATA_FILE_NAME || fname.ends_with(".old") { // ignore these - } else if is_ephemeral_file(fname) { + } else if is_ephemeral_file(&fname) { // Delete any old ephemeral files trace!("deleting old ephemeral file in timeline dir: {}", fname); fs::remove_file(direntry.path())?; @@ -1938,7 +1938,7 @@ impl LayeredTimeline { seg_blknum: SegmentBlk, lsn: Lsn, layer: &dyn Layer, - ) -> Result { + ) -> anyhow::Result { // Check the page cache. We will get back the most recent page with lsn <= `lsn`. // The cached image can be returned directly if there is no WAL between the cached image // and requested LSN. The cached image can also be used to reduce the amount of WAL needed @@ -1950,7 +1950,9 @@ impl LayeredTimeline { match cached_lsn.cmp(&lsn) { cmp::Ordering::Less => {} // there might be WAL between cached_lsn and lsn, we need to check cmp::Ordering::Equal => return Ok(cached_img), // exact LSN match, return the image - cmp::Ordering::Greater => panic!(), // the returned lsn should never be after the requested lsn + cmp::Ordering::Greater => { + bail!("the returned lsn should never be after the requested lsn") + } } Some((cached_lsn, cached_img)) } @@ -2341,7 +2343,10 @@ pub fn dump_layerfile_from_path(path: &Path) -> Result<()> { /// Add a suffix to a layer file's name: .{num}.old /// Uses the first available num (starts at 0) fn rename_to_backup(path: PathBuf) -> anyhow::Result<()> { - let filename = path.file_name().unwrap().to_str().unwrap(); + let filename = path + .file_name() + .ok_or_else(|| anyhow!("Path {} don't have a file name", path.display()))? + .to_string_lossy(); let mut new_path = path.clone(); for i in 0u32.. { diff --git a/pageserver/src/layered_repository/inmemory_layer.rs b/pageserver/src/layered_repository/inmemory_layer.rs index 6e24bf6022..239fb341a5 100644 --- a/pageserver/src/layered_repository/inmemory_layer.rs +++ b/pageserver/src/layered_repository/inmemory_layer.rs @@ -17,7 +17,7 @@ use crate::layered_repository::LayeredTimeline; use crate::layered_repository::ZERO_PAGE; use crate::repository::ZenithWalRecord; use crate::{ZTenantId, ZTimelineId}; -use anyhow::{ensure, Result}; +use anyhow::{ensure, Result, bail}; use bytes::Bytes; use log::*; use std::collections::HashMap; @@ -150,9 +150,9 @@ impl InMemoryLayerInner { let pos = self.file.stream_position()?; // make room for the 'length' field by writing zeros as a placeholder. - self.file.seek(std::io::SeekFrom::Start(pos + 4)).unwrap(); + self.file.seek(std::io::SeekFrom::Start(pos + 4))?; - pv.ser_into(&mut self.file).unwrap(); + pv.ser_into(&mut self.file)?; // write the 'length' field. let len = self.file.stream_position()? - pos - 4; @@ -315,7 +315,7 @@ impl Layer for InMemoryLayer { return Ok(false); } } else { - panic!("dropped in-memory layer with no end LSN"); + bail!("dropped in-memory layer with no end LSN"); } } @@ -333,7 +333,7 @@ impl Layer for InMemoryLayer { /// Nothing to do here. When you drop the last reference to the layer, it will /// be deallocated. fn delete(&self) -> Result<()> { - panic!("can't delete an InMemoryLayer") + bail!("can't delete an InMemoryLayer") } fn is_incremental(&self) -> bool { diff --git a/pageserver/src/page_cache.rs b/pageserver/src/page_cache.rs index 2992d9477b..ef802ba0e2 100644 --- a/pageserver/src/page_cache.rs +++ b/pageserver/src/page_cache.rs @@ -732,9 +732,10 @@ impl PageCache { CacheKey::MaterializedPage { hash_key: _, lsn: _, - } => { - panic!("unexpected dirty materialized page"); - } + } => Err(std::io::Error::new( + std::io::ErrorKind::Other, + "unexpected dirty materialized page", + )), CacheKey::EphemeralPage { file_id, blkno } => { writeback_ephemeral_file(*file_id, *blkno, buf) } diff --git a/pageserver/src/page_service.rs b/pageserver/src/page_service.rs index 6e6b6415f3..6acdc8e93d 100644 --- a/pageserver/src/page_service.rs +++ b/pageserver/src/page_service.rs @@ -574,7 +574,6 @@ impl postgres_backend::Handler for PageServerHandler { let data = self .auth .as_ref() - .as_ref() .unwrap() .decode(str::from_utf8(jwt_response)?)?; diff --git a/pageserver/src/tenant_threads.rs b/pageserver/src/tenant_threads.rs index 062af9f1ad..c370eb61c8 100644 --- a/pageserver/src/tenant_threads.rs +++ b/pageserver/src/tenant_threads.rs @@ -49,7 +49,7 @@ pub fn gc_loop(tenantid: ZTenantId, conf: &'static PageServerConf) -> Result<()> // Garbage collect old files that are not needed for PITR anymore if conf.gc_horizon > 0 { let repo = tenant_mgr::get_repository_for_tenant(tenantid)?; - repo.gc_iteration(None, conf.gc_horizon, false).unwrap(); + repo.gc_iteration(None, conf.gc_horizon, false)?; } // TODO Write it in more adequate way using diff --git a/pageserver/src/thread_mgr.rs b/pageserver/src/thread_mgr.rs index a51f0909ca..d24d6bf016 100644 --- a/pageserver/src/thread_mgr.rs +++ b/pageserver/src/thread_mgr.rs @@ -250,7 +250,7 @@ pub fn shutdown_threads( let _ = join_handle.join(); } else { // The thread had not even fully started yet. Or it was shut down - // concurrently and alrady exited + // concurrently and already exited } } } diff --git a/pageserver/src/timelines.rs b/pageserver/src/timelines.rs index 00dd0f8f9c..8c018ce70f 100644 --- a/pageserver/src/timelines.rs +++ b/pageserver/src/timelines.rs @@ -250,7 +250,7 @@ fn run_initdb(conf: &'static PageServerConf, initdbpath: &Path) -> Result<()> { let initdb_path = conf.pg_bin_dir().join("initdb"); let initdb_output = Command::new(initdb_path) - .args(&["-D", initdbpath.to_str().unwrap()]) + .args(&["-D", &initdbpath.to_string_lossy()]) .args(&["-U", &conf.superuser]) .args(&["-E", "utf8"]) .arg("--no-instructions") @@ -258,8 +258,8 @@ fn run_initdb(conf: &'static PageServerConf, initdbpath: &Path) -> Result<()> { // so no need to fsync it .arg("--no-sync") .env_clear() - .env("LD_LIBRARY_PATH", conf.pg_lib_dir().to_str().unwrap()) - .env("DYLD_LIBRARY_PATH", conf.pg_lib_dir().to_str().unwrap()) + .env("LD_LIBRARY_PATH", conf.pg_lib_dir()) + .env("DYLD_LIBRARY_PATH", conf.pg_lib_dir()) .stdout(Stdio::null()) .output() .context("failed to execute initdb")?; diff --git a/pageserver/src/virtual_file.rs b/pageserver/src/virtual_file.rs index 73671dcf4e..858cff29cb 100644 --- a/pageserver/src/virtual_file.rs +++ b/pageserver/src/virtual_file.rs @@ -226,7 +226,8 @@ impl VirtualFile { path: &Path, open_options: &OpenOptions, ) -> Result { - let parts = path.to_str().unwrap().split('/').collect::>(); + let path_str = path.to_string_lossy(); + let parts = path_str.split('/').collect::>(); let tenantid; let timelineid; if parts.len() > 5 && parts[parts.len() - 5] == "tenants" { diff --git a/pageserver/src/walingest.rs b/pageserver/src/walingest.rs index 1962c9bbd3..506890476f 100644 --- a/pageserver/src/walingest.rs +++ b/pageserver/src/walingest.rs @@ -249,7 +249,7 @@ impl WalIngest { { let mut checkpoint_bytes = [0u8; SIZEOF_CHECKPOINT]; buf.copy_to_slice(&mut checkpoint_bytes); - let xlog_checkpoint = CheckPoint::decode(&checkpoint_bytes).unwrap(); + let xlog_checkpoint = CheckPoint::decode(&checkpoint_bytes)?; trace!( "xlog_checkpoint.oldestXid={}, checkpoint.oldestXid={}", xlog_checkpoint.oldestXid, diff --git a/pageserver/src/walredo.rs b/pageserver/src/walredo.rs index 877b81b8d5..704b8f2583 100644 --- a/pageserver/src/walredo.rs +++ b/pageserver/src/walredo.rs @@ -375,7 +375,10 @@ impl PostgresRedoManager { ZenithWalRecord::Postgres { will_init: _, rec: _, - } => panic!("tried to pass postgres wal record to zenith WAL redo"), + } => { + error!("tried to pass postgres wal record to zenith WAL redo"); + return Err(WalRedoError::InvalidRequest); + } ZenithWalRecord::ClearVisibilityMapFlags { new_heap_blkno, old_heap_blkno, @@ -541,20 +544,23 @@ impl PostgresRedoProcess { } info!("running initdb in {:?}", datadir.display()); let initdb = Command::new(conf.pg_bin_dir().join("initdb")) - .args(&["-D", datadir.to_str().unwrap()]) + .args(&["-D", &datadir.to_string_lossy()]) .arg("-N") .env_clear() - .env("LD_LIBRARY_PATH", conf.pg_lib_dir().to_str().unwrap()) - .env("DYLD_LIBRARY_PATH", conf.pg_lib_dir().to_str().unwrap()) + .env("LD_LIBRARY_PATH", conf.pg_lib_dir()) + .env("DYLD_LIBRARY_PATH", conf.pg_lib_dir()) .output() - .expect("failed to execute initdb"); + .map_err(|e| Error::new(e.kind(), format!("failed to execute initdb: {}", e)))?; if !initdb.status.success() { - panic!( - "initdb failed: {}\nstderr:\n{}", - std::str::from_utf8(&initdb.stdout).unwrap(), - std::str::from_utf8(&initdb.stderr).unwrap() - ); + return Err(Error::new( + ErrorKind::Other, + format!( + "initdb failed\nstdout: {}\nstderr:\n{}", + String::from_utf8_lossy(&initdb.stdout), + String::from_utf8_lossy(&initdb.stderr) + ), + )); } else { // Limit shared cache for wal-redo-postres let mut config = OpenOptions::new() @@ -572,11 +578,16 @@ impl PostgresRedoProcess { .stderr(Stdio::piped()) .stdout(Stdio::piped()) .env_clear() - .env("LD_LIBRARY_PATH", conf.pg_lib_dir().to_str().unwrap()) - .env("DYLD_LIBRARY_PATH", conf.pg_lib_dir().to_str().unwrap()) + .env("LD_LIBRARY_PATH", conf.pg_lib_dir()) + .env("DYLD_LIBRARY_PATH", conf.pg_lib_dir()) .env("PGDATA", &datadir) .spawn() - .expect("postgres --wal-redo command failed to start"); + .map_err(|e| { + Error::new( + e.kind(), + format!("postgres --wal-redo command failed to start: {}", e), + ) + })?; info!( "launched WAL redo postgres process on {:?}", @@ -636,7 +647,10 @@ impl PostgresRedoProcess { { build_apply_record_msg(*lsn, postgres_rec, &mut writebuf); } else { - panic!("tried to pass zenith wal record to postgres WAL redo"); + return Err(Error::new( + ErrorKind::Other, + "tried to pass zenith wal record to postgres WAL redo", + )); } } build_get_page_msg(tag, &mut writebuf); From f6b1d76c3097c61b89b47849a52fb714b1f45cbf Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Fri, 18 Mar 2022 20:59:55 +0200 Subject: [PATCH 016/296] Replace assert! with ensure! for anyhow::Result functions --- pageserver/src/basebackup.rs | 10 ++++---- pageserver/src/layered_repository.rs | 16 ++++++------ .../src/layered_repository/delta_layer.rs | 12 ++++----- .../src/layered_repository/image_layer.rs | 20 +++++++-------- .../src/layered_repository/inmemory_layer.rs | 25 +++++++++++-------- pageserver/src/layered_repository/metadata.rs | 4 +-- pageserver/src/walreceiver.rs | 4 +-- 7 files changed, 48 insertions(+), 43 deletions(-) diff --git a/pageserver/src/basebackup.rs b/pageserver/src/basebackup.rs index c316fc43d1..5711f1807d 100644 --- a/pageserver/src/basebackup.rs +++ b/pageserver/src/basebackup.rs @@ -10,7 +10,7 @@ //! This module is responsible for creation of such tarball //! from data stored in object storage. //! -use anyhow::{Context, Result}; +use anyhow::{ensure, Context, Result}; use bytes::{BufMut, BytesMut}; use log::*; use std::fmt::Write as FmtWrite; @@ -163,7 +163,7 @@ impl<'a> Basebackup<'a> { let img = self.timeline .get_page_at_lsn(RelishTag::Slru { slru, segno }, blknum, self.lsn)?; - assert!(img.len() == pg_constants::BLCKSZ as usize); + ensure!(img.len() == pg_constants::BLCKSZ as usize); slru_buf.extend_from_slice(&img); } @@ -197,7 +197,7 @@ impl<'a> Basebackup<'a> { String::from("global/pg_filenode.map") // filenode map for global tablespace } else { // User defined tablespaces are not supported - assert!(spcnode == pg_constants::DEFAULTTABLESPACE_OID); + ensure!(spcnode == pg_constants::DEFAULTTABLESPACE_OID); // Append dir path for each database let path = format!("base/{}", dbnode); @@ -211,7 +211,7 @@ impl<'a> Basebackup<'a> { format!("base/{}/pg_filenode.map", dbnode) }; - assert!(img.len() == 512); + ensure!(img.len() == 512); let header = new_tar_header(&path, img.len() as u64)?; self.ar.append(&header, &img[..])?; Ok(()) @@ -292,7 +292,7 @@ impl<'a> Basebackup<'a> { let wal_file_path = format!("pg_wal/{}", wal_file_name); let header = new_tar_header(&wal_file_path, pg_constants::WAL_SEGMENT_SIZE as u64)?; let wal_seg = generate_wal_segment(segno, pg_control.system_identifier); - assert!(wal_seg.len() == pg_constants::WAL_SEGMENT_SIZE); + ensure!(wal_seg.len() == pg_constants::WAL_SEGMENT_SIZE); self.ar.append(&header, &wal_seg[..])?; Ok(()) } diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 4d8d0ada24..7ec11add9c 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -791,10 +791,10 @@ impl Timeline for LayeredTimeline { } /// Wait until WAL has been received up to the given LSN. - fn wait_lsn(&self, lsn: Lsn) -> Result<()> { + fn wait_lsn(&self, lsn: Lsn) -> anyhow::Result<()> { // This should never be called from the WAL receiver thread, because that could lead // to a deadlock. - assert!( + ensure!( !IS_WAL_RECEIVER.with(|c| c.get()), "wait_lsn called by WAL receiver thread" ); @@ -1262,7 +1262,7 @@ impl LayeredTimeline { seg: SegmentTag, lsn: Lsn, self_layers: &MutexGuard, - ) -> Result, Lsn)>> { + ) -> anyhow::Result, Lsn)>> { trace!("get_layer_for_read called for {} at {}", seg, lsn); // If you requested a page at an older LSN, before the branch point, dig into @@ -1310,7 +1310,7 @@ impl LayeredTimeline { layer.get_end_lsn() ); - assert!(layer.get_start_lsn() <= lsn); + ensure!(layer.get_start_lsn() <= lsn); if layer.is_dropped() && layer.get_end_lsn() <= lsn { return Ok(None); @@ -1338,13 +1338,13 @@ impl LayeredTimeline { /// /// Get a handle to the latest layer for appending. /// - fn get_layer_for_write(&self, seg: SegmentTag, lsn: Lsn) -> Result> { + fn get_layer_for_write(&self, seg: SegmentTag, lsn: Lsn) -> anyhow::Result> { let mut layers = self.layers.lock().unwrap(); - assert!(lsn.is_aligned()); + ensure!(lsn.is_aligned()); let last_record_lsn = self.get_last_record_lsn(); - assert!( + ensure!( lsn > last_record_lsn, "cannot modify relation after advancing last_record_lsn (incoming_lsn={}, last_record_lsn={})", lsn, @@ -1360,7 +1360,7 @@ impl LayeredTimeline { // Open layer exists, but it is dropped, so create a new one. if open_layer.is_dropped() { - assert!(!open_layer.is_writeable()); + ensure!(!open_layer.is_writeable()); // Layer that is created after dropped one represents a new relish segment. trace!( "creating layer for write for new relish segment after dropped layer {} at {}/{}", diff --git a/pageserver/src/layered_repository/delta_layer.rs b/pageserver/src/layered_repository/delta_layer.rs index 7434b8de11..f6e5510339 100644 --- a/pageserver/src/layered_repository/delta_layer.rs +++ b/pageserver/src/layered_repository/delta_layer.rs @@ -209,10 +209,10 @@ impl Layer for DeltaLayer { blknum: SegmentBlk, lsn: Lsn, reconstruct_data: &mut PageReconstructData, - ) -> Result { + ) -> anyhow::Result { let mut need_image = true; - assert!((0..RELISH_SEG_SIZE).contains(&blknum)); + ensure!((0..RELISH_SEG_SIZE).contains(&blknum)); match &reconstruct_data.page_img { Some((cached_lsn, _)) if &self.end_lsn <= cached_lsn => { @@ -289,8 +289,8 @@ impl Layer for DeltaLayer { } /// Get size of the relation at given LSN - fn get_seg_size(&self, lsn: Lsn) -> Result { - assert!(lsn >= self.start_lsn); + fn get_seg_size(&self, lsn: Lsn) -> anyhow::Result { + ensure!(lsn >= self.start_lsn); ensure!( self.seg.rel.is_blocky(), "get_seg_size() called on a non-blocky rel" @@ -641,7 +641,7 @@ impl DeltaLayerWriter { /// /// 'seg_sizes' is a list of size changes to store with the actual data. /// - pub fn finish(self, seg_sizes: VecMap) -> Result { + pub fn finish(self, seg_sizes: VecMap) -> anyhow::Result { // Close the page-versions chapter let book = self.page_version_writer.close()?; @@ -652,7 +652,7 @@ impl DeltaLayerWriter { let book = chapter.close()?; if self.seg.rel.is_blocky() { - assert!(!seg_sizes.is_empty()); + ensure!(!seg_sizes.is_empty()); } // and seg_sizes to separate chapter diff --git a/pageserver/src/layered_repository/image_layer.rs b/pageserver/src/layered_repository/image_layer.rs index 24445ff7e9..c706f58e39 100644 --- a/pageserver/src/layered_repository/image_layer.rs +++ b/pageserver/src/layered_repository/image_layer.rs @@ -146,9 +146,9 @@ impl Layer for ImageLayer { blknum: SegmentBlk, lsn: Lsn, reconstruct_data: &mut PageReconstructData, - ) -> Result { - assert!((0..RELISH_SEG_SIZE).contains(&blknum)); - assert!(lsn >= self.lsn); + ) -> anyhow::Result { + ensure!((0..RELISH_SEG_SIZE).contains(&blknum)); + ensure!(lsn >= self.lsn); match reconstruct_data.page_img { Some((cached_lsn, _)) if self.lsn <= cached_lsn => { @@ -432,7 +432,7 @@ impl ImageLayerWriter { seg: SegmentTag, lsn: Lsn, num_blocks: SegmentBlk, - ) -> Result { + ) -> anyhow::Result { // Create the file // // Note: This overwrites any existing file. There shouldn't be any. @@ -452,7 +452,7 @@ impl ImageLayerWriter { let chapter = if seg.rel.is_blocky() { book.new_chapter(BLOCKY_IMAGES_CHAPTER) } else { - assert_eq!(num_blocks, 1); + ensure!(num_blocks == 1); book.new_chapter(NONBLOCKY_IMAGE_CHAPTER) }; @@ -475,19 +475,19 @@ impl ImageLayerWriter { /// /// The page versions must be appended in blknum order. /// - pub fn put_page_image(&mut self, block_bytes: &[u8]) -> Result<()> { - assert!(self.num_blocks_written < self.num_blocks); + pub fn put_page_image(&mut self, block_bytes: &[u8]) -> anyhow::Result<()> { + ensure!(self.num_blocks_written < self.num_blocks); if self.seg.rel.is_blocky() { - assert_eq!(block_bytes.len(), BLOCK_SIZE); + ensure!(block_bytes.len() == BLOCK_SIZE); } self.page_image_writer.write_all(block_bytes)?; self.num_blocks_written += 1; Ok(()) } - pub fn finish(self) -> Result { + pub fn finish(self) -> anyhow::Result { // Check that the `put_page_image' was called for every block. - assert!(self.num_blocks_written == self.num_blocks); + ensure!(self.num_blocks_written == self.num_blocks); // Close the page-images chapter let book = self.page_image_writer.close()?; diff --git a/pageserver/src/layered_repository/inmemory_layer.rs b/pageserver/src/layered_repository/inmemory_layer.rs index 239fb341a5..fed1fb6469 100644 --- a/pageserver/src/layered_repository/inmemory_layer.rs +++ b/pageserver/src/layered_repository/inmemory_layer.rs @@ -17,7 +17,7 @@ use crate::layered_repository::LayeredTimeline; use crate::layered_repository::ZERO_PAGE; use crate::repository::ZenithWalRecord; use crate::{ZTenantId, ZTimelineId}; -use anyhow::{ensure, Result, bail}; +use anyhow::{bail, ensure, Result}; use bytes::Bytes; use log::*; use std::collections::HashMap; @@ -224,10 +224,10 @@ impl Layer for InMemoryLayer { blknum: SegmentBlk, lsn: Lsn, reconstruct_data: &mut PageReconstructData, - ) -> Result { + ) -> anyhow::Result { let mut need_image = true; - assert!((0..RELISH_SEG_SIZE).contains(&blknum)); + ensure!((0..RELISH_SEG_SIZE).contains(&blknum)); { let inner = self.inner.read().unwrap(); @@ -288,8 +288,8 @@ impl Layer for InMemoryLayer { } /// Get size of the relation at given LSN - fn get_seg_size(&self, lsn: Lsn) -> Result { - assert!(lsn >= self.start_lsn); + fn get_seg_size(&self, lsn: Lsn) -> anyhow::Result { + ensure!(lsn >= self.start_lsn); ensure!( self.seg.rel.is_blocky(), "get_seg_size() called on a non-blocky rel" @@ -300,13 +300,13 @@ impl Layer for InMemoryLayer { } /// Does this segment exist at given LSN? - fn get_seg_exists(&self, lsn: Lsn) -> Result { + fn get_seg_exists(&self, lsn: Lsn) -> anyhow::Result { let inner = self.inner.read().unwrap(); // If the segment created after requested LSN, // it doesn't exist in the layer. But we shouldn't // have requested it in the first place. - assert!(lsn >= self.start_lsn); + ensure!(lsn >= self.start_lsn); // Is the requested LSN after the segment was dropped? if inner.dropped { @@ -466,8 +466,13 @@ impl InMemoryLayer { /// Common subroutine of the public put_wal_record() and put_page_image() functions. /// Adds the page version to the in-memory tree - pub fn put_page_version(&self, blknum: SegmentBlk, lsn: Lsn, pv: PageVersion) -> Result { - assert!((0..RELISH_SEG_SIZE).contains(&blknum)); + pub fn put_page_version( + &self, + blknum: SegmentBlk, + lsn: Lsn, + pv: PageVersion, + ) -> anyhow::Result { + ensure!((0..RELISH_SEG_SIZE).contains(&blknum)); trace!( "put_page_version blk {} of {} at {}/{}", @@ -479,7 +484,7 @@ impl InMemoryLayer { let mut inner = self.inner.write().unwrap(); inner.assert_writeable(); - assert!(lsn >= inner.latest_lsn); + ensure!(lsn >= inner.latest_lsn); inner.latest_lsn = lsn; // Write the page version to the file, and remember its offset in 'page_versions' diff --git a/pageserver/src/layered_repository/metadata.rs b/pageserver/src/layered_repository/metadata.rs index 99d786c4cd..17e0485093 100644 --- a/pageserver/src/layered_repository/metadata.rs +++ b/pageserver/src/layered_repository/metadata.rs @@ -96,7 +96,7 @@ impl TimelineMetadata { ); let data = TimelineMetadata::from(serialize::DeTimelineMetadata::des_prefix(data)?); - assert!(data.disk_consistent_lsn.is_aligned()); + ensure!(data.disk_consistent_lsn.is_aligned()); Ok(data) } @@ -104,7 +104,7 @@ impl TimelineMetadata { pub fn to_bytes(&self) -> anyhow::Result> { let serializeable_metadata = serialize::SeTimelineMetadata::from(self); let mut metadata_bytes = serialize::SeTimelineMetadata::ser(&serializeable_metadata)?; - assert!(metadata_bytes.len() <= METADATA_MAX_DATA_SIZE); + ensure!(metadata_bytes.len() <= METADATA_MAX_DATA_SIZE); metadata_bytes.resize(METADATA_MAX_SAFE_SIZE, 0u8); let checksum = crc32c::crc32c(&metadata_bytes[..METADATA_MAX_DATA_SIZE]); diff --git a/pageserver/src/walreceiver.rs b/pageserver/src/walreceiver.rs index 305dd4b3a2..43fb7db4b0 100644 --- a/pageserver/src/walreceiver.rs +++ b/pageserver/src/walreceiver.rs @@ -146,7 +146,7 @@ fn walreceiver_main( tenant_id: ZTenantId, timeline_id: ZTimelineId, wal_producer_connstr: &str, -) -> Result<(), Error> { +) -> anyhow::Result<(), Error> { // Connect to the database in replication mode. info!("connecting to {:?}", wal_producer_connstr); let connect_cfg = format!( @@ -255,7 +255,7 @@ fn walreceiver_main( // It is important to deal with the aligned records as lsn in getPage@LSN is // aligned and can be several bytes bigger. Without this alignment we are // at risk of hittind a deadlock. - assert!(lsn.is_aligned()); + anyhow::ensure!(lsn.is_aligned()); let writer = timeline.writer(); walingest.ingest_record(writer.as_ref(), recdata, lsn)?; From 6244fd9e7eb78cd056cc92e67ca2fc6bf67eca22 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Wed, 23 Mar 2022 00:57:20 +0200 Subject: [PATCH 017/296] Better error messages on zenith cli subcommand invocations --- control_plane/src/storage.rs | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/control_plane/src/storage.rs b/control_plane/src/storage.rs index 835c93bf1d..c49d5743a9 100644 --- a/control_plane/src/storage.rs +++ b/control_plane/src/storage.rs @@ -148,12 +148,20 @@ impl PageServerNode { let initial_timeline_id_string = initial_timeline_id.to_string(); args.extend(["--initial-timeline-id", &initial_timeline_id_string]); - let init_output = fill_rust_env_vars(cmd.args(args)) + let cmd_with_args = cmd.args(args); + let init_output = fill_rust_env_vars(cmd_with_args) .output() - .context("pageserver init failed")?; + .with_context(|| { + format!("failed to init pageserver with command {:?}", cmd_with_args) + })?; if !init_output.status.success() { - bail!("pageserver init failed"); + bail!( + "init invocation failed, {}\nStdout: {}\nStderr: {}", + init_output.status, + String::from_utf8_lossy(&init_output.stdout), + String::from_utf8_lossy(&init_output.stderr) + ); } Ok(initial_timeline_id) From 28bc8e3f5c961532f4177fb3e803b73f6a2adb5a Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Wed, 23 Mar 2022 19:33:06 +0200 Subject: [PATCH 018/296] Log pageserver threads better and shut down on errors in them --- pageserver/src/bin/pageserver.rs | 33 +----------------------- pageserver/src/layered_repository.rs | 2 +- pageserver/src/lib.rs | 38 +++++++++++++++++++++++++++- pageserver/src/thread_mgr.rs | 38 +++++++++++++++++++++------- 4 files changed, 68 insertions(+), 43 deletions(-) diff --git a/pageserver/src/bin/pageserver.rs b/pageserver/src/bin/pageserver.rs index 5a1b5e5e2c..14249963de 100644 --- a/pageserver/src/bin/pageserver.rs +++ b/pageserver/src/bin/pageserver.rs @@ -26,7 +26,6 @@ use pageserver::{ timelines, virtual_file, LOG_FILE_NAME, }; use zenith_utils::http::endpoint; -use zenith_utils::postgres_backend; use zenith_utils::shutdown::exit_now; use zenith_utils::signals::{self, Signal}; @@ -322,38 +321,8 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<() "Got {}. Terminating gracefully in fast shutdown mode", signal.name() ); - shutdown_pageserver(); + pageserver::shutdown_pageserver(); unreachable!() } }) } - -fn shutdown_pageserver() { - // Shut down the libpq endpoint thread. This prevents new connections from - // being accepted. - thread_mgr::shutdown_threads(Some(ThreadKind::LibpqEndpointListener), None, None); - - // Shut down any page service threads. - postgres_backend::set_pgbackend_shutdown_requested(); - thread_mgr::shutdown_threads(Some(ThreadKind::PageRequestHandler), None, None); - - // Shut down all the tenants. This flushes everything to disk and kills - // the checkpoint and GC threads. - tenant_mgr::shutdown_all_tenants(); - - // Stop syncing with remote storage. - // - // FIXME: Does this wait for the sync thread to finish syncing what's queued up? - // Should it? - thread_mgr::shutdown_threads(Some(ThreadKind::StorageSync), None, None); - - // Shut down the HTTP endpoint last, so that you can still check the server's - // status while it's shutting down. - thread_mgr::shutdown_threads(Some(ThreadKind::HttpEndpointListener), None, None); - - // There should be nothing left, but let's be sure - thread_mgr::shutdown_threads(None, None, None); - - info!("Shut down successfully completed"); - std::process::exit(0); -} diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 7ec11add9c..ac0afcb275 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -976,7 +976,7 @@ impl Timeline for LayeredTimeline { /// Public entry point for checkpoint(). All the logic is in the private /// checkpoint_internal function, this public facade just wraps it for /// metrics collection. - fn checkpoint(&self, cconf: CheckpointConfig) -> Result<()> { + fn checkpoint(&self, cconf: CheckpointConfig) -> anyhow::Result<()> { match cconf { CheckpointConfig::Flush => self .flush_checkpoint_time_histo diff --git a/pageserver/src/lib.rs b/pageserver/src/lib.rs index 3d66192c80..060fa54b23 100644 --- a/pageserver/src/lib.rs +++ b/pageserver/src/lib.rs @@ -19,8 +19,14 @@ pub mod walrecord; pub mod walredo; use lazy_static::lazy_static; +use tracing::info; use zenith_metrics::{register_int_gauge_vec, IntGaugeVec}; -use zenith_utils::zid::{ZTenantId, ZTimelineId}; +use zenith_utils::{ + postgres_backend, + zid::{ZTenantId, ZTimelineId}, +}; + +use crate::thread_mgr::ThreadKind; lazy_static! { static ref LIVE_CONNECTIONS_COUNT: IntGaugeVec = register_int_gauge_vec!( @@ -43,3 +49,33 @@ pub enum CheckpointConfig { // Flush all in-memory data and reconstruct all page images Forced, } + +pub fn shutdown_pageserver() { + // Shut down the libpq endpoint thread. This prevents new connections from + // being accepted. + thread_mgr::shutdown_threads(Some(ThreadKind::LibpqEndpointListener), None, None); + + // Shut down any page service threads. + postgres_backend::set_pgbackend_shutdown_requested(); + thread_mgr::shutdown_threads(Some(ThreadKind::PageRequestHandler), None, None); + + // Shut down all the tenants. This flushes everything to disk and kills + // the checkpoint and GC threads. + tenant_mgr::shutdown_all_tenants(); + + // Stop syncing with remote storage. + // + // FIXME: Does this wait for the sync thread to finish syncing what's queued up? + // Should it? + thread_mgr::shutdown_threads(Some(ThreadKind::StorageSync), None, None); + + // Shut down the HTTP endpoint last, so that you can still check the server's + // status while it's shutting down. + thread_mgr::shutdown_threads(Some(ThreadKind::HttpEndpointListener), None, None); + + // There should be nothing left, but let's be sure + thread_mgr::shutdown_threads(None, None, None); + + info!("Shut down successfully completed"); + std::process::exit(0); +} diff --git a/pageserver/src/thread_mgr.rs b/pageserver/src/thread_mgr.rs index d24d6bf016..c4202e80be 100644 --- a/pageserver/src/thread_mgr.rs +++ b/pageserver/src/thread_mgr.rs @@ -43,12 +43,14 @@ use std::thread::JoinHandle; use tokio::sync::watch; -use tracing::{info, warn}; +use tracing::{error, info, warn}; use lazy_static::lazy_static; use zenith_utils::zid::{ZTenantId, ZTimelineId}; +use crate::shutdown_pageserver; + lazy_static! { /// Each thread that we track is associated with a "thread ID". It's just /// an increasing number that we assign, not related to any system thread @@ -125,7 +127,7 @@ struct PageServerThread { } /// Launch a new thread -pub fn spawn( +pub fn spawn( kind: ThreadKind, tenant_id: Option, timeline_id: Option, @@ -133,7 +135,7 @@ pub fn spawn( f: F, ) -> std::io::Result<()> where - F: FnOnce() -> Result<(), E> + Send + 'static, + F: FnOnce() -> anyhow::Result<()> + Send + 'static, { let (shutdown_tx, shutdown_rx) = watch::channel(()); let thread_id = NEXT_THREAD_ID.fetch_add(1, Ordering::Relaxed); @@ -160,12 +162,14 @@ where .insert(thread_id, Arc::clone(&thread_rc)); let thread_rc2 = Arc::clone(&thread_rc); + let thread_name = name.to_string(); let join_handle = match thread::Builder::new() .name(name.to_string()) - .spawn(move || thread_wrapper(thread_id, thread_rc2, shutdown_rx, f)) + .spawn(move || thread_wrapper(thread_name, thread_id, thread_rc2, shutdown_rx, f)) { Ok(handle) => handle, Err(err) => { + error!("Failed to spawn thread '{}': {}", name, err); // Could not spawn the thread. Remove the entry THREADS.lock().unwrap().remove(&thread_id); return Err(err); @@ -180,13 +184,14 @@ where /// This wrapper function runs in a newly-spawned thread. It initializes the /// thread-local variables and calls the payload function -fn thread_wrapper( +fn thread_wrapper( + thread_name: String, thread_id: u64, thread: Arc, shutdown_rx: watch::Receiver<()>, f: F, ) where - F: FnOnce() -> Result<(), E> + Send + 'static, + F: FnOnce() -> anyhow::Result<()> + Send + 'static, { SHUTDOWN_RX.with(|rx| { *rx.borrow_mut() = Some(shutdown_rx); @@ -195,6 +200,8 @@ fn thread_wrapper( *ct.borrow_mut() = Some(thread); }); + info!("Starting thread '{}'", thread_name); + // We use AssertUnwindSafe here so that the payload function // doesn't need to be UnwindSafe. We don't do anything after the // unwinding that would expose us to unwind-unsafe behavior. @@ -203,9 +210,22 @@ fn thread_wrapper( // Remove our entry from the global hashmap. THREADS.lock().unwrap().remove(&thread_id); - // If the thread payload panic'd, exit with the panic. - if let Err(err) = result { - panic::resume_unwind(err); + match result { + Ok(Ok(())) => info!("Thread '{}' exited normally", thread_name), + Ok(Err(err)) => { + error!( + "Shutting down: thread '{}' exited with error: {:?}", + thread_name, err + ); + shutdown_pageserver(); + } + Err(err) => { + error!( + "Shutting down: thread '{}' panicked: {:?}", + thread_name, err + ); + shutdown_pageserver(); + } } } From b39d1b17177eb6fe9509b87cb8908f8128ab78bc Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Thu, 24 Mar 2022 14:05:15 +0200 Subject: [PATCH 019/296] Exit only on important thread failures --- pageserver/src/bin/pageserver.rs | 2 ++ pageserver/src/page_service.rs | 1 + pageserver/src/remote_storage/storage_sync.rs | 8 ++--- pageserver/src/tenant_mgr.rs | 35 ++++++++++++------- pageserver/src/thread_mgr.rs | 34 ++++++++++++------ pageserver/src/walreceiver.rs | 11 +++--- 6 files changed, 57 insertions(+), 34 deletions(-) diff --git a/pageserver/src/bin/pageserver.rs b/pageserver/src/bin/pageserver.rs index 14249963de..e217806147 100644 --- a/pageserver/src/bin/pageserver.rs +++ b/pageserver/src/bin/pageserver.rs @@ -291,6 +291,7 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<() None, None, "http_endpoint_thread", + false, move || { let router = http::make_router(conf, auth_cloned, remote_index); endpoint::serve_thread_main(router, http_listener, thread_mgr::shutdown_watcher()) @@ -304,6 +305,7 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<() None, None, "libpq endpoint thread", + false, move || page_service::thread_main(conf, auth, pageserver_listener, conf.auth_type), )?; diff --git a/pageserver/src/page_service.rs b/pageserver/src/page_service.rs index 6acdc8e93d..4744f0fe52 100644 --- a/pageserver/src/page_service.rs +++ b/pageserver/src/page_service.rs @@ -228,6 +228,7 @@ pub fn thread_main( None, None, "serving Page Service thread", + false, move || page_service_conn_main(conf, local_auth, socket, auth_type), ) { // Thread creation failed. Log the error and continue. diff --git a/pageserver/src/remote_storage/storage_sync.rs b/pageserver/src/remote_storage/storage_sync.rs index 4ad28e6f8f..b01b152e0a 100644 --- a/pageserver/src/remote_storage/storage_sync.rs +++ b/pageserver/src/remote_storage/storage_sync.rs @@ -404,6 +404,7 @@ pub(super) fn spawn_storage_sync_thread< None, None, "Remote storage sync thread", + false, move || { storage_sync_loop( runtime, @@ -413,7 +414,8 @@ pub(super) fn spawn_storage_sync_thread< storage, max_concurrent_sync, max_sync_errors, - ) + ); + Ok(()) }, ) .context("Failed to spawn remote storage sync thread")?; @@ -440,7 +442,7 @@ fn storage_sync_loop< storage: S, max_concurrent_sync: NonZeroUsize, max_sync_errors: NonZeroU32, -) -> anyhow::Result<()> { +) { let remote_assets = Arc::new((storage, Arc::clone(&index))); loop { let index = Arc::clone(&index); @@ -470,8 +472,6 @@ fn storage_sync_loop< } } } - - Ok(()) } async fn loop_step< diff --git a/pageserver/src/tenant_mgr.rs b/pageserver/src/tenant_mgr.rs index 4d6dfd7488..0bc18231c9 100644 --- a/pageserver/src/tenant_mgr.rs +++ b/pageserver/src/tenant_mgr.rs @@ -206,13 +206,13 @@ pub fn get_tenant_state(tenantid: ZTenantId) -> Option { /// Change the state of a tenant to Active and launch its checkpointer and GC /// threads. If the tenant was already in Active state or Stopping, does nothing. /// -pub fn activate_tenant(conf: &'static PageServerConf, tenantid: ZTenantId) -> Result<()> { +pub fn activate_tenant(conf: &'static PageServerConf, tenant_id: ZTenantId) -> Result<()> { let mut m = access_tenants(); let tenant = m - .get_mut(&tenantid) - .with_context(|| format!("Tenant not found for id {}", tenantid))?; + .get_mut(&tenant_id) + .with_context(|| format!("Tenant not found for id {}", tenant_id))?; - info!("activating tenant {}", tenantid); + info!("activating tenant {}", tenant_id); match tenant.state { // If the tenant is already active, nothing to do. @@ -222,22 +222,31 @@ pub fn activate_tenant(conf: &'static PageServerConf, tenantid: ZTenantId) -> Re TenantState::Idle => { thread_mgr::spawn( ThreadKind::Checkpointer, - Some(tenantid), + Some(tenant_id), None, "Checkpointer thread", - move || crate::tenant_threads::checkpoint_loop(tenantid, conf), + true, + move || crate::tenant_threads::checkpoint_loop(tenant_id, conf), )?; - // FIXME: if we fail to launch the GC thread, but already launched the - // checkpointer, we're in a strange state. - - thread_mgr::spawn( + let gc_spawn_result = thread_mgr::spawn( ThreadKind::GarbageCollector, - Some(tenantid), + Some(tenant_id), None, "GC thread", - move || crate::tenant_threads::gc_loop(tenantid, conf), - )?; + true, + move || crate::tenant_threads::gc_loop(tenant_id, conf), + ) + .with_context(|| format!("Failed to launch GC thread for tenant {}", tenant_id)); + + if let Err(e) = &gc_spawn_result { + error!( + "Failed to start GC thread for tenant {}, stopping its checkpointer thread: {:?}", + tenant_id, e + ); + thread_mgr::shutdown_threads(Some(ThreadKind::Checkpointer), Some(tenant_id), None); + return gc_spawn_result; + } tenant.state = TenantState::Active; } diff --git a/pageserver/src/thread_mgr.rs b/pageserver/src/thread_mgr.rs index c4202e80be..cafdc5e700 100644 --- a/pageserver/src/thread_mgr.rs +++ b/pageserver/src/thread_mgr.rs @@ -43,7 +43,7 @@ use std::thread::JoinHandle; use tokio::sync::watch; -use tracing::{error, info, warn}; +use tracing::{debug, error, info, warn}; use lazy_static::lazy_static; @@ -132,6 +132,7 @@ pub fn spawn( tenant_id: Option, timeline_id: Option, name: &str, + fail_on_error: bool, f: F, ) -> std::io::Result<()> where @@ -165,8 +166,16 @@ where let thread_name = name.to_string(); let join_handle = match thread::Builder::new() .name(name.to_string()) - .spawn(move || thread_wrapper(thread_name, thread_id, thread_rc2, shutdown_rx, f)) - { + .spawn(move || { + thread_wrapper( + thread_name, + thread_id, + thread_rc2, + shutdown_rx, + fail_on_error, + f, + ) + }) { Ok(handle) => handle, Err(err) => { error!("Failed to spawn thread '{}': {}", name, err); @@ -189,6 +198,7 @@ fn thread_wrapper( thread_id: u64, thread: Arc, shutdown_rx: watch::Receiver<()>, + fail_on_error: bool, f: F, ) where F: FnOnce() -> anyhow::Result<()> + Send + 'static, @@ -200,7 +210,7 @@ fn thread_wrapper( *ct.borrow_mut() = Some(thread); }); - info!("Starting thread '{}'", thread_name); + debug!("Starting thread '{}'", thread_name); // We use AssertUnwindSafe here so that the payload function // doesn't need to be UnwindSafe. We don't do anything after the @@ -211,13 +221,17 @@ fn thread_wrapper( THREADS.lock().unwrap().remove(&thread_id); match result { - Ok(Ok(())) => info!("Thread '{}' exited normally", thread_name), + Ok(Ok(())) => debug!("Thread '{}' exited normally", thread_name), Ok(Err(err)) => { - error!( - "Shutting down: thread '{}' exited with error: {:?}", - thread_name, err - ); - shutdown_pageserver(); + if fail_on_error { + error!( + "Shutting down: thread '{}' exited with error: {:?}", + thread_name, err + ); + shutdown_pageserver(); + } else { + error!("Thread '{}' exited with error: {:?}", thread_name, err); + } } Err(err) => { error!( diff --git a/pageserver/src/walreceiver.rs b/pageserver/src/walreceiver.rs index 43fb7db4b0..2c10ad315b 100644 --- a/pageserver/src/walreceiver.rs +++ b/pageserver/src/walreceiver.rs @@ -78,9 +78,11 @@ pub fn launch_wal_receiver( Some(tenantid), Some(timelineid), "WAL receiver thread", + false, move || { IS_WAL_RECEIVER.with(|c| c.set(true)); - thread_main(conf, tenantid, timelineid) + thread_main(conf, tenantid, timelineid); + Ok(()) }, )?; @@ -110,11 +112,7 @@ fn get_wal_producer_connstr(tenantid: ZTenantId, timelineid: ZTimelineId) -> Str // // This is the entry point for the WAL receiver thread. // -fn thread_main( - conf: &'static PageServerConf, - tenant_id: ZTenantId, - timeline_id: ZTimelineId, -) -> Result<()> { +fn thread_main(conf: &'static PageServerConf, tenant_id: ZTenantId, timeline_id: ZTimelineId) { let _enter = info_span!("WAL receiver", timeline = %timeline_id, tenant = %tenant_id).entered(); info!("WAL receiver thread started"); @@ -138,7 +136,6 @@ fn thread_main( // Drop it from list of active WAL_RECEIVERS // so that next callmemaybe request launched a new thread drop_wal_receiver(tenant_id, timeline_id); - Ok(()) } fn walreceiver_main( From e3fa00972e4987f2a3653ab7d547c357a94129fc Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Fri, 25 Mar 2022 15:34:38 +0200 Subject: [PATCH 020/296] Use RwLocks in image and delta layers for more concurrency. With a Mutex, only one thread could read from the layer at a time. I did some ad hoc profiling with pgbench and saw that a fair amout of time was spent blocked on these Mutexes. --- .../src/layered_repository/delta_layer.rs | 51 ++++++++++++++----- .../src/layered_repository/image_layer.rs | 46 ++++++++++++----- 2 files changed, 72 insertions(+), 25 deletions(-) diff --git a/pageserver/src/layered_repository/delta_layer.rs b/pageserver/src/layered_repository/delta_layer.rs index f6e5510339..1a6e941fbe 100644 --- a/pageserver/src/layered_repository/delta_layer.rs +++ b/pageserver/src/layered_repository/delta_layer.rs @@ -58,7 +58,7 @@ use std::io::{BufWriter, Write}; use std::ops::Bound::Included; use std::os::unix::fs::FileExt; use std::path::{Path, PathBuf}; -use std::sync::{Mutex, MutexGuard}; +use std::sync::{RwLock, RwLockReadGuard, RwLockWriteGuard, TryLockError}; use bookfile::{Book, BookWriter, BoundedReader, ChapterWriter}; @@ -142,7 +142,7 @@ pub struct DeltaLayer { dropped: bool, - inner: Mutex, + inner: RwLock, } pub struct DeltaLayerInner { @@ -316,7 +316,11 @@ impl Layer for DeltaLayer { /// it will need to be loaded back. /// fn unload(&self) -> Result<()> { - let mut inner = self.inner.lock().unwrap(); + let mut inner = match self.inner.try_write() { + Ok(inner) => inner, + Err(TryLockError::WouldBlock) => return Ok(()), + Err(TryLockError::Poisoned(_)) => panic!("DeltaLayer lock was poisoned"), + }; inner.page_version_metas = VecMap::default(); inner.seg_sizes = VecMap::default(); inner.loaded = false; @@ -406,16 +410,37 @@ impl DeltaLayer { } /// - /// Load the contents of the file into memory + /// Open the underlying file and read the metadata into memory, if it's + /// not loaded already. /// - fn load(&self) -> Result> { - // quick exit if already loaded - let mut inner = self.inner.lock().unwrap(); + fn load(&self) -> Result> { + loop { + // Quick exit if already loaded + let inner = self.inner.read().unwrap(); + if inner.loaded { + return Ok(inner); + } - if inner.loaded { - return Ok(inner); + // Need to open the file and load the metadata. Upgrade our lock to + // a write lock. (Or rather, release and re-lock in write mode.) + drop(inner); + let inner = self.inner.write().unwrap(); + if !inner.loaded { + self.load_inner(inner)?; + } else { + // Another thread loaded it while we were not holding the lock. + } + + // We now have the file open and loaded. There's no function to do + // that in the std library RwLock, so we have to release and re-lock + // in read mode. (To be precise, the lock guard was moved in the + // above call to `load_inner`, so it's already been released). And + // while we do that, another thread could unload again, so we have + // to re-check and retry if that happens. } + } + fn load_inner(&self, mut inner: RwLockWriteGuard) -> Result<()> { let path = self.path(); // Open the file if it's not open already. @@ -462,7 +487,7 @@ impl DeltaLayer { inner.seg_sizes = seg_sizes; inner.loaded = true; - Ok(inner) + Ok(()) } /// Create a DeltaLayer struct representing an existing file on disk. @@ -480,7 +505,7 @@ impl DeltaLayer { start_lsn: filename.start_lsn, end_lsn: filename.end_lsn, dropped: filename.dropped, - inner: Mutex::new(DeltaLayerInner { + inner: RwLock::new(DeltaLayerInner { loaded: false, book: None, page_version_metas: VecMap::default(), @@ -507,7 +532,7 @@ impl DeltaLayer { start_lsn: summary.start_lsn, end_lsn: summary.end_lsn, dropped: summary.dropped, - inner: Mutex::new(DeltaLayerInner { + inner: RwLock::new(DeltaLayerInner { loaded: false, book: None, page_version_metas: VecMap::default(), @@ -689,7 +714,7 @@ impl DeltaLayerWriter { start_lsn: self.start_lsn, end_lsn: self.end_lsn, dropped: self.dropped, - inner: Mutex::new(DeltaLayerInner { + inner: RwLock::new(DeltaLayerInner { loaded: false, book: None, page_version_metas: VecMap::default(), diff --git a/pageserver/src/layered_repository/image_layer.rs b/pageserver/src/layered_repository/image_layer.rs index c706f58e39..5b8ec46452 100644 --- a/pageserver/src/layered_repository/image_layer.rs +++ b/pageserver/src/layered_repository/image_layer.rs @@ -37,7 +37,7 @@ use std::convert::TryInto; use std::fs; use std::io::{BufWriter, Write}; use std::path::{Path, PathBuf}; -use std::sync::{Mutex, MutexGuard}; +use std::sync::{RwLock, RwLockReadGuard}; use bookfile::{Book, BookWriter, ChapterWriter}; @@ -93,7 +93,7 @@ pub struct ImageLayer { // This entry contains an image of all pages as of this LSN pub lsn: Lsn, - inner: Mutex, + inner: RwLock, } #[derive(Clone)] @@ -273,16 +273,38 @@ impl ImageLayer { } /// - /// Load the contents of the file into memory + /// Open the underlying file and read the metadata into memory, if it's + /// not loaded already. /// - fn load(&self) -> Result> { - // quick exit if already loaded - let mut inner = self.inner.lock().unwrap(); + fn load(&self) -> Result> { + loop { + // Quick exit if already loaded + let inner = self.inner.read().unwrap(); + if inner.book.is_some() { + return Ok(inner); + } - if inner.book.is_some() { - return Ok(inner); + // Need to open the file and load the metadata. Upgrade our lock to + // a write lock. (Or rather, release and re-lock in write mode.) + drop(inner); + let mut inner = self.inner.write().unwrap(); + if inner.book.is_none() { + self.load_inner(&mut inner)?; + } else { + // Another thread loaded it while we were not holding the lock. + } + + // We now have the file open and loaded. There's no function to do + // that in the std library RwLock, so we have to release and re-lock + // in read mode. (To be precise, the lock guard was moved in the + // above call to `load_inner`, so it's already been released). And + // while we do that, another thread could unload again, so we have + // to re-check and retry if that happens. + drop(inner); } + } + fn load_inner(&self, inner: &mut ImageLayerInner) -> Result<()> { let path = self.path(); let file = VirtualFile::open(&path) .with_context(|| format!("Failed to open virtual file '{}'", path.display()))?; @@ -336,7 +358,7 @@ impl ImageLayer { image_type, }; - Ok(inner) + Ok(()) } /// Create an ImageLayer struct representing an existing file on disk @@ -352,7 +374,7 @@ impl ImageLayer { tenantid, seg: filename.seg, lsn: filename.lsn, - inner: Mutex::new(ImageLayerInner { + inner: RwLock::new(ImageLayerInner { book: None, image_type: ImageType::Blocky { num_blocks: 0 }, }), @@ -375,7 +397,7 @@ impl ImageLayer { tenantid: summary.tenantid, seg: summary.seg, lsn: summary.lsn, - inner: Mutex::new(ImageLayerInner { + inner: RwLock::new(ImageLayerInner { book: None, image_type: ImageType::Blocky { num_blocks: 0 }, }), @@ -522,7 +544,7 @@ impl ImageLayerWriter { tenantid: self.tenantid, seg: self.seg, lsn: self.lsn, - inner: Mutex::new(ImageLayerInner { + inner: RwLock::new(ImageLayerInner { book: None, image_type, }), From b8cba059a59f1c5e74cd8160af6aee4658c9744e Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Fri, 25 Mar 2022 20:52:58 +0200 Subject: [PATCH 021/296] temporary disable s3 integration on staging until LSM storge rewrite lands --- .circleci/ansible/deploy.yaml | 27 ++++++++++++++------------- 1 file changed, 14 insertions(+), 13 deletions(-) diff --git a/.circleci/ansible/deploy.yaml b/.circleci/ansible/deploy.yaml index 2379ef8510..1f43adf950 100644 --- a/.circleci/ansible/deploy.yaml +++ b/.circleci/ansible/deploy.yaml @@ -91,19 +91,20 @@ tags: - pageserver - - name: update config - when: current_version > remote_version or force_deploy - lineinfile: - path: /storage/pageserver/data/pageserver.toml - line: "{{ item }}" - loop: - - "[remote_storage]" - - "bucket_name = '{{ bucket_name }}'" - - "bucket_region = '{{ bucket_region }}'" - - "prefix_in_bucket = '{{ inventory_hostname }}'" - become: true - tags: - - pageserver + # Temporary disabled until LSM storage rewrite lands + # - name: update config + # when: current_version > remote_version or force_deploy + # lineinfile: + # path: /storage/pageserver/data/pageserver.toml + # line: "{{ item }}" + # loop: + # - "[remote_storage]" + # - "bucket_name = '{{ bucket_name }}'" + # - "bucket_region = '{{ bucket_region }}'" + # - "prefix_in_bucket = '{{ inventory_hostname }}'" + # become: true + # tags: + # - pageserver - name: upload systemd service definition when: current_version > remote_version or force_deploy From 5e04dad3604ddc6da58558425f44c9e6b3f05def Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Fri, 25 Mar 2022 23:42:13 +0200 Subject: [PATCH 022/296] Add more variants of the sequential scan performance tests. More rows, and test with serial and parallel plans. But fewer iterations, so that the tests run in < 1 minutes, and we don't need to mark them as "slow". --- ...est_small_seqscans.py => test_seqscans.py} | 24 ++++++++++++------- 1 file changed, 15 insertions(+), 9 deletions(-) rename test_runner/performance/{test_small_seqscans.py => test_seqscans.py} (65%) diff --git a/test_runner/performance/test_small_seqscans.py b/test_runner/performance/test_seqscans.py similarity index 65% rename from test_runner/performance/test_small_seqscans.py rename to test_runner/performance/test_seqscans.py index b98018ad97..85d0a24510 100644 --- a/test_runner/performance/test_small_seqscans.py +++ b/test_runner/performance/test_seqscans.py @@ -1,8 +1,5 @@ # Test sequential scan speed # -# The test table is large enough (3-4 MB) that it doesn't fit in the compute node -# cache, so the seqscans go to the page server. But small enough that it fits -# into memory in the page server. from contextlib import closing from dataclasses import dataclass from fixtures.zenith_fixtures import ZenithEnv @@ -12,11 +9,18 @@ from fixtures.compare_fixtures import PgCompare import pytest -@pytest.mark.parametrize('rows', [ - pytest.param(100000), - pytest.param(1000000, marks=pytest.mark.slow), -]) -def test_small_seqscans(zenith_with_baseline: PgCompare, rows: int): +@pytest.mark.parametrize( + 'rows,iters,workers', + [ + # The test table is large enough (3-4 MB) that it doesn't fit in the compute node + # cache, so the seqscans go to the page server. But small enough that it fits + # into memory in the page server. + pytest.param(100000, 100, 0), + # Also test with a larger table, with and without parallelism + pytest.param(10000000, 1, 0), + pytest.param(10000000, 1, 4) + ]) +def test_seqscans(zenith_with_baseline: PgCompare, rows: int, iters: int, workers: int): env = zenith_with_baseline with closing(env.pg.connect()) as conn: @@ -36,6 +40,8 @@ def test_small_seqscans(zenith_with_baseline: PgCompare, rows: int): assert int(shared_buffers) < int(table_size) env.zenbenchmark.record("table_size", table_size, 'bytes', MetricReport.TEST_PARAM) + cur.execute(f"set max_parallel_workers_per_gather = {workers}") + with env.record_duration('run'): - for i in range(1000): + for i in range(iters): cur.execute('select count(*) from t;') From 18dfc769d814f9753eb611a85d1ebeb81de0dafe Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Fri, 25 Mar 2022 11:27:21 +0200 Subject: [PATCH 023/296] Use cachepot to cache more rustc builds --- .circleci/config.yml | 15 +++++++++++++-- Dockerfile | 1 - Dockerfile.compute-tools | 9 +++++++-- 3 files changed, 20 insertions(+), 5 deletions(-) diff --git a/.circleci/config.yml b/.circleci/config.yml index d342e7c9f4..f05ad3e816 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -111,7 +111,12 @@ jobs: fi export CARGO_INCREMENTAL=0 + export CACHEPOT_BUCKET=zenith-rust-cachepot + export RUSTC_WRAPPER=cachepot + export AWS_ACCESS_KEY_ID="${CACHEPOT_AWS_ACCESS_KEY_ID}" + export AWS_SECRET_ACCESS_KEY="${CACHEPOT_AWS_SECRET_ACCESS_KEY}" "${cov_prefix[@]}" mold -run cargo build $CARGO_FLAGS --bins --tests + cachepot -s - save_cache: name: Save rust cache @@ -464,7 +469,10 @@ jobs: name: Build and push compute-tools Docker image command: | echo $DOCKER_PWD | docker login -u $DOCKER_LOGIN --password-stdin - docker build -t zenithdb/compute-tools:latest -f Dockerfile.compute-tools . + docker build \ + --build-arg AWS_ACCESS_KEY_ID="${CACHEPOT_AWS_ACCESS_KEY_ID}" \ + --build-arg AWS_SECRET_ACCESS_KEY="${CACHEPOT_AWS_SECRET_ACCESS_KEY}" \ + --tag zenithdb/compute-tools:latest -f Dockerfile.compute-tools . docker push zenithdb/compute-tools:latest - run: name: Init postgres submodule @@ -518,7 +526,10 @@ jobs: name: Build and push compute-tools Docker image command: | echo $DOCKER_PWD | docker login -u $DOCKER_LOGIN --password-stdin - docker build -t zenithdb/compute-tools:release -f Dockerfile.compute-tools . + docker build \ + --build-arg AWS_ACCESS_KEY_ID="${CACHEPOT_AWS_ACCESS_KEY_ID}" \ + --build-arg AWS_SECRET_ACCESS_KEY="${CACHEPOT_AWS_SECRET_ACCESS_KEY}" \ + --tag zenithdb/compute-tools:release -f Dockerfile.compute-tools . docker push zenithdb/compute-tools:release - run: name: Init postgres submodule diff --git a/Dockerfile b/Dockerfile index 5e55cd834f..babc3b8e1d 100644 --- a/Dockerfile +++ b/Dockerfile @@ -24,7 +24,6 @@ ARG GIT_VERSION=local ARG CACHEPOT_BUCKET=zenith-rust-cachepot ARG AWS_ACCESS_KEY_ID ARG AWS_SECRET_ACCESS_KEY -#ENV RUSTC_WRAPPER cachepot ENV RUSTC_WRAPPER /usr/local/cargo/bin/cachepot COPY --from=pg-build /pg/tmp_install/include/postgresql/server tmp_install/include/postgresql/server diff --git a/Dockerfile.compute-tools b/Dockerfile.compute-tools index a1f7582ee4..f7672251e6 100644 --- a/Dockerfile.compute-tools +++ b/Dockerfile.compute-tools @@ -1,12 +1,17 @@ # First transient image to build compute_tools binaries # NB: keep in sync with rust image version in .circle/config.yml -FROM rust:1.56.1-slim-buster AS rust-build +FROM zenithdb/build:buster-20220309 AS rust-build WORKDIR /zenith +ARG CACHEPOT_BUCKET=zenith-rust-cachepot +ARG AWS_ACCESS_KEY_ID +ARG AWS_SECRET_ACCESS_KEY +ENV RUSTC_WRAPPER /usr/local/cargo/bin/cachepot + COPY . . -RUN cargo build -p compute_tools --release +RUN cargo build -p compute_tools --release && /usr/local/cargo/bin/cachepot -s # Final image that only has one binary FROM debian:buster-slim From d56a0ee19aeec715f9c839a9bcdc91c650000f1e Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Fri, 25 Mar 2022 11:48:30 +0200 Subject: [PATCH 024/296] Avoid recompiling tests for release profile --- .circleci/config.yml | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/.circleci/config.yml b/.circleci/config.yml index f05ad3e816..513d305b5d 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -146,11 +146,13 @@ jobs: command: | if [[ $BUILD_TYPE == "debug" ]]; then cov_prefix=(scripts/coverage "--profraw-prefix=$CIRCLE_JOB" --dir=/tmp/zenith/coverage run) + CARGO_FLAGS= elif [[ $BUILD_TYPE == "release" ]]; then cov_prefix=() + CARGO_FLAGS=--release fi - "${cov_prefix[@]}" cargo test + "${cov_prefix[@]}" cargo test $CARGO_FLAGS # Install the rust binaries, for use by test jobs - run: From 55de0b88f5b02fe4a77d7b78640b51ca9f236baa Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Fri, 25 Mar 2022 23:53:37 +0200 Subject: [PATCH 025/296] Hide remote timeline index access details --- pageserver/src/http/routes.rs | 30 ++++++---- pageserver/src/layered_repository.rs | 10 ++-- pageserver/src/remote_storage.rs | 9 ++- pageserver/src/remote_storage/storage_sync.rs | 58 ++++++++++--------- .../remote_storage/storage_sync/download.rs | 30 +++++----- .../src/remote_storage/storage_sync/index.rs | 34 +++++++++-- .../src/remote_storage/storage_sync/upload.rs | 49 +++++++--------- pageserver/src/repository.rs | 6 +- pageserver/src/tenant_mgr.rs | 10 ++-- pageserver/src/timelines.rs | 25 ++------ 10 files changed, 134 insertions(+), 127 deletions(-) diff --git a/pageserver/src/http/routes.rs b/pageserver/src/http/routes.rs index 3ca8b6334a..13e79f8f55 100644 --- a/pageserver/src/http/routes.rs +++ b/pageserver/src/http/routes.rs @@ -3,7 +3,6 @@ use std::sync::Arc; use anyhow::Result; use hyper::StatusCode; use hyper::{Body, Request, Response, Uri}; -use tokio::sync::RwLock; use tracing::*; use zenith_utils::auth::JwtAuth; use zenith_utils::http::endpoint::attach_openapi_ui; @@ -22,17 +21,14 @@ use zenith_utils::zid::{ZTenantTimelineId, ZTimelineId}; use super::models::{ StatusResponse, TenantCreateRequest, TenantCreateResponse, TimelineCreateRequest, }; -use crate::remote_storage::{schedule_timeline_download, RemoteTimelineIndex}; -use crate::timelines::{ - extract_remote_timeline_info, LocalTimelineInfo, RemoteTimelineInfo, TimelineInfo, -}; +use crate::remote_storage::{schedule_timeline_download, RemoteIndex}; +use crate::timelines::{LocalTimelineInfo, RemoteTimelineInfo, TimelineInfo}; use crate::{config::PageServerConf, tenant_mgr, timelines, ZTenantId}; -#[derive(Debug)] struct State { conf: &'static PageServerConf, auth: Option>, - remote_index: Arc>, + remote_index: RemoteIndex, allowlist_routes: Vec, } @@ -40,7 +36,7 @@ impl State { fn new( conf: &'static PageServerConf, auth: Option>, - remote_index: Arc>, + remote_index: RemoteIndex, ) -> Self { let allowlist_routes = ["/v1/status", "/v1/doc", "/swagger.yml"] .iter() @@ -113,14 +109,24 @@ async fn timeline_list_handler(request: Request) -> Result, .await .map_err(ApiError::from_err)??; - let remote_index = get_state(&request).remote_index.read().await; let mut response_data = Vec::with_capacity(local_timeline_infos.len()); for (timeline_id, local_timeline_info) in local_timeline_infos { response_data.push(TimelineInfo { tenant_id, timeline_id, local: Some(local_timeline_info), - remote: extract_remote_timeline_info(tenant_id, timeline_id, &remote_index), + remote: get_state(&request) + .remote_index + .read() + .await + .timeline_entry(&ZTenantTimelineId { + tenant_id, + timeline_id, + }) + .map(|remote_entry| RemoteTimelineInfo { + remote_consistent_lsn: remote_entry.disk_consistent_lsn(), + awaits_download: remote_entry.get_awaits_download(), + }), }) } @@ -277,7 +283,7 @@ async fn tenant_create_handler(mut request: Request) -> Result) -> Result, ApiError> { pub fn make_router( conf: &'static PageServerConf, auth: Option>, - remote_index: Arc>, + remote_index: RemoteIndex, ) -> RouterBuilder { let spec = include_bytes!("openapi_spec.yml"); let mut router = attach_openapi_ui(endpoint::make_router(), spec, "/swagger.yml", "/v1/doc"); diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index ac0afcb275..bf5f52b18d 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -35,7 +35,7 @@ use self::metadata::{metadata_path, TimelineMetadata, METADATA_FILE_NAME}; use crate::config::PageServerConf; use crate::page_cache; use crate::relish::*; -use crate::remote_storage::{schedule_timeline_checkpoint_upload, RemoteTimelineIndex}; +use crate::remote_storage::{schedule_timeline_checkpoint_upload, RemoteIndex}; use crate::repository::{ BlockNumber, GcResult, Repository, RepositoryTimeline, Timeline, TimelineSyncStatusUpdate, TimelineWriter, ZenithWalRecord, @@ -132,7 +132,7 @@ pub struct LayeredRepository { // provides access to timeline data sitting in the remote storage // supposed to be used for retrieval of remote consistent lsn in walreceiver - remote_index: Arc>, + remote_index: RemoteIndex, /// Makes every timeline to backup their files to remote storage. upload_relishes: bool, @@ -355,8 +355,8 @@ impl Repository for LayeredRepository { Ok(()) } - fn get_remote_index(&self) -> &tokio::sync::RwLock { - self.remote_index.as_ref() + fn get_remote_index(&self) -> &RemoteIndex { + &self.remote_index } } @@ -511,7 +511,7 @@ impl LayeredRepository { conf: &'static PageServerConf, walredo_mgr: Arc, tenantid: ZTenantId, - remote_index: Arc>, + remote_index: RemoteIndex, upload_relishes: bool, ) -> LayeredRepository { LayeredRepository { diff --git a/pageserver/src/remote_storage.rs b/pageserver/src/remote_storage.rs index 6eb7bd910b..bdd6086b94 100644 --- a/pageserver/src/remote_storage.rs +++ b/pageserver/src/remote_storage.rs @@ -89,15 +89,14 @@ use std::{ collections::HashMap, ffi, fs, path::{Path, PathBuf}, - sync::Arc, }; use anyhow::{bail, Context}; -use tokio::{io, sync::RwLock}; +use tokio::io; use tracing::{debug, error, info}; use zenith_utils::zid::{ZTenantId, ZTenantTimelineId, ZTimelineId}; -pub use self::storage_sync::index::{RemoteTimelineIndex, TimelineIndexEntry}; +pub use self::storage_sync::index::{RemoteIndex, TimelineIndexEntry}; pub use self::storage_sync::{schedule_timeline_checkpoint_upload, schedule_timeline_download}; use self::{local_fs::LocalFs, rust_s3::S3}; use crate::layered_repository::ephemeral_file::is_ephemeral_file; @@ -120,7 +119,7 @@ type LocalTimelineInitStatuses = HashMap>, + pub remote_index: RemoteIndex, pub local_timeline_init_statuses: LocalTimelineInitStatuses, } @@ -172,7 +171,7 @@ pub fn start_local_timeline_sync( } Ok(SyncStartupData { local_timeline_init_statuses, - remote_index: Arc::new(RwLock::new(RemoteTimelineIndex::empty())), + remote_index: RemoteIndex::empty(), }) } } diff --git a/pageserver/src/remote_storage/storage_sync.rs b/pageserver/src/remote_storage/storage_sync.rs index b01b152e0a..9fe2ab2847 100644 --- a/pageserver/src/remote_storage/storage_sync.rs +++ b/pageserver/src/remote_storage/storage_sync.rs @@ -25,6 +25,7 @@ //! * all never local state gets scheduled for upload, such timelines are "local" and fully operational //! * the rest of the remote timelines are reported to pageserver, but not downloaded before they are actually accessed in pageserver, //! it may schedule the download on such occasions. +//! Then, the index is shared across pageserver under [`RemoteIndex`] guard to ensure proper synchronization. //! //! The synchronization unit is an archive: a set of timeline files (or relishes) and a special metadata file, all compressed into a blob. //! Currently, there's no way to process an archive partially, if the archive processing fails, it has to be started from zero next time again. @@ -80,10 +81,7 @@ use futures::stream::{FuturesUnordered, StreamExt}; use lazy_static::lazy_static; use tokio::{ runtime::Runtime, - sync::{ - mpsc::{self, UnboundedReceiver}, - RwLock, - }, + sync::mpsc::{self, UnboundedReceiver}, time::{Duration, Instant}, }; use tracing::*; @@ -92,8 +90,8 @@ use self::{ compression::ArchiveHeader, download::{download_timeline, DownloadedTimeline}, index::{ - ArchiveDescription, ArchiveId, RemoteTimeline, RemoteTimelineIndex, TimelineIndexEntry, - TimelineIndexEntryInner, + ArchiveDescription, ArchiveId, RemoteIndex, RemoteTimeline, RemoteTimelineIndex, + TimelineIndexEntry, TimelineIndexEntryInner, }, upload::upload_timeline_checkpoint, }; @@ -392,13 +390,14 @@ pub(super) fn spawn_storage_sync_thread< None } }); - let mut remote_index = - RemoteTimelineIndex::try_parse_descriptions_from_paths(conf, download_paths); + let remote_index = RemoteIndex::try_parse_descriptions_from_paths(conf, download_paths); - let local_timeline_init_statuses = - schedule_first_sync_tasks(&mut remote_index, local_timeline_files); - let remote_index = Arc::new(RwLock::new(remote_index)); - let remote_index_cloned = Arc::clone(&remote_index); + let local_timeline_init_statuses = schedule_first_sync_tasks( + &mut runtime.block_on(remote_index.write()), + local_timeline_files, + ); + + let loop_index = remote_index.clone(); thread_mgr::spawn( ThreadKind::StorageSync, None, @@ -410,7 +409,7 @@ pub(super) fn spawn_storage_sync_thread< runtime, conf, receiver, - remote_index_cloned, + loop_index, storage, max_concurrent_sync, max_sync_errors, @@ -438,14 +437,14 @@ fn storage_sync_loop< runtime: Runtime, conf: &'static PageServerConf, mut receiver: UnboundedReceiver, - index: Arc>, + index: RemoteIndex, storage: S, max_concurrent_sync: NonZeroUsize, max_sync_errors: NonZeroU32, ) { - let remote_assets = Arc::new((storage, Arc::clone(&index))); + let remote_assets = Arc::new((storage, index.clone())); loop { - let index = Arc::clone(&index); + let index = index.clone(); let loop_step = runtime.block_on(async { tokio::select! { new_timeline_states = loop_step( @@ -480,7 +479,7 @@ async fn loop_step< >( conf: &'static PageServerConf, receiver: &mut UnboundedReceiver, - remote_assets: Arc<(S, Arc>)>, + remote_assets: Arc<(S, RemoteIndex)>, max_concurrent_sync: NonZeroUsize, max_sync_errors: NonZeroU32, ) -> HashMap> { @@ -560,7 +559,7 @@ async fn process_task< S: RemoteStorage + Send + Sync + 'static, >( conf: &'static PageServerConf, - remote_assets: Arc<(S, Arc>)>, + remote_assets: Arc<(S, RemoteIndex)>, task: SyncTask, max_sync_errors: NonZeroU32, ) -> Option { @@ -584,7 +583,7 @@ async fn process_task< tokio::time::sleep(Duration::from_secs_f64(seconds_to_wait)).await; } - let remote_index = Arc::clone(&remote_assets.1); + let remote_index = &remote_assets.1; let sync_start = Instant::now(); let sync_name = task.kind.sync_name(); @@ -592,7 +591,7 @@ async fn process_task< SyncKind::Download(download_data) => { let download_result = download_timeline( conf, - remote_assets, + remote_assets.clone(), task.sync_id, download_data, task.retries + 1, @@ -772,7 +771,7 @@ async fn fetch_full_index< P: Send + Sync + 'static, S: RemoteStorage + Send + Sync + 'static, >( - (storage, index): &(S, Arc>), + (storage, index): &(S, RemoteIndex), timeline_dir: &Path, id: ZTenantTimelineId, ) -> anyhow::Result { @@ -808,8 +807,9 @@ async fn fetch_full_index< } }; drop(index_read); // tokio rw lock is not upgradeable - let mut index_write = index.write().await; - index_write + index + .write() + .await .upgrade_timeline_entry(&id, full_index.clone()) .context("cannot upgrade timeline entry in remote index")?; Ok(full_index) @@ -855,7 +855,7 @@ mod test_utils { #[track_caller] pub async fn ensure_correct_timeline_upload( harness: &RepoHarness, - remote_assets: Arc<(LocalFs, Arc>)>, + remote_assets: Arc<(LocalFs, RemoteIndex)>, timeline_id: ZTimelineId, new_upload: NewCheckpoint, ) { @@ -872,7 +872,7 @@ mod test_utils { let (storage, index) = remote_assets.as_ref(); assert_index_descriptions( index, - RemoteTimelineIndex::try_parse_descriptions_from_paths( + &RemoteIndex::try_parse_descriptions_from_paths( harness.conf, remote_assets .0 @@ -914,7 +914,7 @@ mod test_utils { } pub async fn expect_timeline( - index: &Arc>, + index: &RemoteIndex, sync_id: ZTenantTimelineId, ) -> RemoteTimeline { if let Some(TimelineIndexEntryInner::Full(remote_timeline)) = index @@ -934,9 +934,11 @@ mod test_utils { #[track_caller] pub async fn assert_index_descriptions( - index: &Arc>, - expected_index_with_descriptions: RemoteTimelineIndex, + index: &RemoteIndex, + expected_index_with_descriptions: &RemoteIndex, ) { + let expected_index_with_descriptions = expected_index_with_descriptions.read().await; + let index_read = index.read().await; let actual_sync_ids = index_read.all_sync_ids().collect::>(); let expected_sync_ids = expected_index_with_descriptions diff --git a/pageserver/src/remote_storage/storage_sync/download.rs b/pageserver/src/remote_storage/storage_sync/download.rs index e5362b2973..32549c8650 100644 --- a/pageserver/src/remote_storage/storage_sync/download.rs +++ b/pageserver/src/remote_storage/storage_sync/download.rs @@ -3,7 +3,7 @@ use std::{borrow::Cow, collections::BTreeSet, path::PathBuf, sync::Arc}; use anyhow::{ensure, Context}; -use tokio::{fs, sync::RwLock}; +use tokio::fs; use tracing::{debug, error, trace, warn}; use zenith_utils::zid::ZTenantId; @@ -20,8 +20,8 @@ use crate::{ }; use super::{ - index::{ArchiveId, RemoteTimeline, RemoteTimelineIndex}, - TimelineDownload, + index::{ArchiveId, RemoteTimeline}, + RemoteIndex, TimelineDownload, }; /// Timeline download result, with extra data, needed for downloading. @@ -47,7 +47,7 @@ pub(super) async fn download_timeline< S: RemoteStorage + Send + Sync + 'static, >( conf: &'static PageServerConf, - remote_assets: Arc<(S, Arc>)>, + remote_assets: Arc<(S, RemoteIndex)>, sync_id: ZTenantTimelineId, mut download: TimelineDownload, retries: u32, @@ -167,7 +167,7 @@ async fn try_download_archive< tenant_id, timeline_id, }: ZTenantTimelineId, - remote_assets: Arc<(S, Arc>)>, + remote_assets: Arc<(S, RemoteIndex)>, remote_timeline: &RemoteTimeline, archive_id: ArchiveId, files_to_skip: Arc>, @@ -255,16 +255,14 @@ mod tests { let repo_harness = RepoHarness::create("test_download_timeline")?; let sync_id = ZTenantTimelineId::new(repo_harness.tenant_id, TIMELINE_ID); let storage = LocalFs::new(tempdir()?.path().to_owned(), &repo_harness.conf.workdir)?; - let index = Arc::new(RwLock::new( - RemoteTimelineIndex::try_parse_descriptions_from_paths( - repo_harness.conf, - storage - .list() - .await? - .into_iter() - .map(|storage_path| storage.local_path(&storage_path).unwrap()), - ), - )); + let index = RemoteIndex::try_parse_descriptions_from_paths( + repo_harness.conf, + storage + .list() + .await? + .into_iter() + .map(|storage_path| storage.local_path(&storage_path).unwrap()), + ); let remote_assets = Arc::new((storage, index)); let storage = &remote_assets.0; let index = &remote_assets.1; @@ -314,7 +312,7 @@ mod tests { .await; assert_index_descriptions( index, - RemoteTimelineIndex::try_parse_descriptions_from_paths( + &RemoteIndex::try_parse_descriptions_from_paths( repo_harness.conf, remote_assets .0 diff --git a/pageserver/src/remote_storage/storage_sync/index.rs b/pageserver/src/remote_storage/storage_sync/index.rs index 7d6b4881f7..d7bd1f1657 100644 --- a/pageserver/src/remote_storage/storage_sync/index.rs +++ b/pageserver/src/remote_storage/storage_sync/index.rs @@ -7,10 +7,12 @@ use std::{ collections::{BTreeMap, BTreeSet, HashMap}, path::{Path, PathBuf}, + sync::Arc, }; use anyhow::{bail, ensure, Context}; use serde::{Deserialize, Serialize}; +use tokio::sync::RwLock; use tracing::*; use zenith_utils::{ lsn::Lsn, @@ -55,11 +57,14 @@ pub struct RemoteTimelineIndex { timeline_entries: HashMap, } -impl RemoteTimelineIndex { +/// A wrapper to synchrnize access to the index, should be created and used before dealing with any [`RemoteTimelineIndex`]. +pub struct RemoteIndex(Arc>); + +impl RemoteIndex { pub fn empty() -> Self { - Self { + Self(Arc::new(RwLock::new(RemoteTimelineIndex { timeline_entries: HashMap::new(), - } + }))) } /// Attempts to parse file paths (not checking the file contents) and find files @@ -69,7 +74,9 @@ impl RemoteTimelineIndex { conf: &'static PageServerConf, paths: impl Iterator, ) -> Self { - let mut index = Self::empty(); + let mut index = RemoteTimelineIndex { + timeline_entries: HashMap::new(), + }; for path in paths { if let Err(e) = try_parse_index_entry(&mut index, conf, path.as_ref()) { debug!( @@ -79,9 +86,26 @@ impl RemoteTimelineIndex { ); } } - index + + Self(Arc::new(RwLock::new(index))) } + pub async fn read(&self) -> tokio::sync::RwLockReadGuard<'_, RemoteTimelineIndex> { + self.0.read().await + } + + pub async fn write(&self) -> tokio::sync::RwLockWriteGuard<'_, RemoteTimelineIndex> { + self.0.write().await + } +} + +impl Clone for RemoteIndex { + fn clone(&self) -> Self { + Self(Arc::clone(&self.0)) + } +} + +impl RemoteTimelineIndex { pub fn timeline_entry(&self, id: &ZTenantTimelineId) -> Option<&TimelineIndexEntry> { self.timeline_entries.get(id) } diff --git a/pageserver/src/remote_storage/storage_sync/upload.rs b/pageserver/src/remote_storage/storage_sync/upload.rs index dfc4433694..76e92c2781 100644 --- a/pageserver/src/remote_storage/storage_sync/upload.rs +++ b/pageserver/src/remote_storage/storage_sync/upload.rs @@ -2,7 +2,6 @@ use std::{borrow::Cow, collections::BTreeSet, path::PathBuf, sync::Arc}; -use tokio::sync::RwLock; use tracing::{debug, error, warn}; use crate::{ @@ -17,7 +16,7 @@ use crate::{ }, }; -use super::{compression::ArchiveHeader, index::RemoteTimelineIndex, NewCheckpoint}; +use super::{compression::ArchiveHeader, NewCheckpoint, RemoteIndex}; /// Attempts to compress and upload given checkpoint files. /// No extra checks for overlapping files is made: download takes care of that, ensuring no non-metadata local timeline files are overwritten. @@ -29,7 +28,7 @@ pub(super) async fn upload_timeline_checkpoint< S: RemoteStorage + Send + Sync + 'static, >( config: &'static PageServerConf, - remote_assets: Arc<(S, Arc>)>, + remote_assets: Arc<(S, RemoteIndex)>, sync_id: ZTenantTimelineId, new_checkpoint: NewCheckpoint, retries: u32, @@ -156,7 +155,7 @@ async fn try_upload_checkpoint< S: RemoteStorage + Send + Sync + 'static, >( config: &'static PageServerConf, - remote_assets: Arc<(S, Arc>)>, + remote_assets: Arc<(S, RemoteIndex)>, sync_id: ZTenantTimelineId, new_checkpoint: &NewCheckpoint, files_to_skip: BTreeSet, @@ -238,16 +237,14 @@ mod tests { let repo_harness = RepoHarness::create("reupload_timeline")?; let sync_id = ZTenantTimelineId::new(repo_harness.tenant_id, TIMELINE_ID); let storage = LocalFs::new(tempdir()?.path().to_owned(), &repo_harness.conf.workdir)?; - let index = Arc::new(RwLock::new( - RemoteTimelineIndex::try_parse_descriptions_from_paths( - repo_harness.conf, - storage - .list() - .await? - .into_iter() - .map(|storage_path| storage.local_path(&storage_path).unwrap()), - ), - )); + let index = RemoteIndex::try_parse_descriptions_from_paths( + repo_harness.conf, + storage + .list() + .await? + .into_iter() + .map(|storage_path| storage.local_path(&storage_path).unwrap()), + ); let remote_assets = Arc::new((storage, index)); let index = &remote_assets.1; @@ -436,16 +433,14 @@ mod tests { let repo_harness = RepoHarness::create("reupload_timeline_rejected")?; let sync_id = ZTenantTimelineId::new(repo_harness.tenant_id, TIMELINE_ID); let storage = LocalFs::new(tempdir()?.path().to_owned(), &repo_harness.conf.workdir)?; - let index = Arc::new(RwLock::new( - RemoteTimelineIndex::try_parse_descriptions_from_paths( - repo_harness.conf, - storage - .list() - .await? - .into_iter() - .map(|storage_path| storage.local_path(&storage_path).unwrap()), - ), - )); + let index = RemoteIndex::try_parse_descriptions_from_paths( + repo_harness.conf, + storage + .list() + .await? + .into_iter() + .map(|storage_path| storage.local_path(&storage_path).unwrap()), + ); let remote_assets = Arc::new((storage, index)); let storage = &remote_assets.0; let index = &remote_assets.1; @@ -464,7 +459,7 @@ mod tests { first_checkpoint, ) .await; - let after_first_uploads = RemoteTimelineIndex::try_parse_descriptions_from_paths( + let after_first_uploads = RemoteIndex::try_parse_descriptions_from_paths( repo_harness.conf, remote_assets .0 @@ -495,7 +490,7 @@ mod tests { 0, ) .await; - assert_index_descriptions(index, after_first_uploads.clone()).await; + assert_index_descriptions(index, &after_first_uploads).await; let checkpoint_with_uploaded_lsn = create_local_timeline( &repo_harness, @@ -511,7 +506,7 @@ mod tests { 0, ) .await; - assert_index_descriptions(index, after_first_uploads.clone()).await; + assert_index_descriptions(index, &after_first_uploads).await; Ok(()) } diff --git a/pageserver/src/repository.rs b/pageserver/src/repository.rs index 074bdf4d01..36273e6d6c 100644 --- a/pageserver/src/repository.rs +++ b/pageserver/src/repository.rs @@ -1,6 +1,6 @@ use crate::layered_repository::metadata::TimelineMetadata; use crate::relish::*; -use crate::remote_storage::RemoteTimelineIndex; +use crate::remote_storage::RemoteIndex; use crate::walrecord::MultiXactMember; use crate::CheckpointConfig; use anyhow::Result; @@ -91,7 +91,7 @@ pub trait Repository: Send + Sync { fn detach_timeline(&self, timeline_id: ZTimelineId) -> Result<()>; // Allows to retrieve remote timeline index from the repo. Used in walreceiver to grab remote consistent lsn. - fn get_remote_index(&self) -> &tokio::sync::RwLock; + fn get_remote_index(&self) -> &RemoteIndex; } /// A timeline, that belongs to the current repository. @@ -407,7 +407,7 @@ pub mod repo_harness { self.conf, walredo_mgr, self.tenant_id, - Arc::new(tokio::sync::RwLock::new(RemoteTimelineIndex::empty())), + RemoteIndex::empty(), false, )); // populate repo with locally available timelines diff --git a/pageserver/src/tenant_mgr.rs b/pageserver/src/tenant_mgr.rs index 0bc18231c9..e7cc4ecbaf 100644 --- a/pageserver/src/tenant_mgr.rs +++ b/pageserver/src/tenant_mgr.rs @@ -3,7 +3,7 @@ use crate::config::PageServerConf; use crate::layered_repository::LayeredRepository; -use crate::remote_storage::RemoteTimelineIndex; +use crate::remote_storage::RemoteIndex; use crate::repository::{Repository, Timeline, TimelineSyncStatusUpdate}; use crate::thread_mgr; use crate::thread_mgr::ThreadKind; @@ -66,7 +66,7 @@ fn access_tenants() -> MutexGuard<'static, HashMap> { pub fn load_local_repo( conf: &'static PageServerConf, tenant_id: ZTenantId, - remote_index: &Arc>, + remote_index: &RemoteIndex, ) -> Arc { let mut m = access_tenants(); let tenant = m.entry(tenant_id).or_insert_with(|| { @@ -78,7 +78,7 @@ pub fn load_local_repo( conf, Arc::new(walredo_mgr), tenant_id, - Arc::clone(remote_index), + remote_index.clone(), conf.remote_storage_config.is_some(), )); Tenant { @@ -92,7 +92,7 @@ pub fn load_local_repo( /// Updates tenants' repositories, changing their timelines state in memory. pub fn apply_timeline_sync_status_updates( conf: &'static PageServerConf, - remote_index: Arc>, + remote_index: RemoteIndex, sync_status_updates: HashMap>, ) { if sync_status_updates.is_empty() { @@ -172,7 +172,7 @@ pub fn shutdown_all_tenants() { pub fn create_tenant_repository( conf: &'static PageServerConf, tenantid: ZTenantId, - remote_index: Arc>, + remote_index: RemoteIndex, ) -> Result> { match access_tenants().entry(tenantid) { Entry::Occupied(_) => { diff --git a/pageserver/src/timelines.rs b/pageserver/src/timelines.rs index 8c018ce70f..53c4124701 100644 --- a/pageserver/src/timelines.rs +++ b/pageserver/src/timelines.rs @@ -15,13 +15,13 @@ use std::{ use tracing::*; use zenith_utils::lsn::Lsn; -use zenith_utils::zid::{ZTenantId, ZTenantTimelineId, ZTimelineId}; +use zenith_utils::zid::{ZTenantId, ZTimelineId}; use zenith_utils::{crashsafe_dir, logging}; use crate::{ config::PageServerConf, layered_repository::metadata::TimelineMetadata, - remote_storage::RemoteTimelineIndex, + remote_storage::RemoteIndex, repository::{LocalTimelineState, Repository}, }; use crate::{import_datadir, LOG_FILE_NAME}; @@ -127,22 +127,6 @@ pub struct TimelineInfo { pub remote: Option, } -pub fn extract_remote_timeline_info( - tenant_id: ZTenantId, - timeline_id: ZTimelineId, - remote_index: &RemoteTimelineIndex, -) -> Option { - remote_index - .timeline_entry(&ZTenantTimelineId { - tenant_id, - timeline_id, - }) - .map(|remote_entry| RemoteTimelineInfo { - remote_consistent_lsn: remote_entry.disk_consistent_lsn(), - awaits_download: remote_entry.get_awaits_download(), - }) -} - #[derive(Debug, Clone, Copy)] pub struct PointInTime { pub timeline_id: ZTimelineId, @@ -179,7 +163,7 @@ pub fn init_pageserver( pub enum CreateRepo { Real { wal_redo_manager: Arc, - remote_index: Arc>, + remote_index: RemoteIndex, }, Dummy, } @@ -207,8 +191,7 @@ pub fn create_repo( // anymore, but I think that could still happen. let wal_redo_manager = Arc::new(crate::walredo::DummyRedoManager {}); - let remote_index = Arc::new(tokio::sync::RwLock::new(RemoteTimelineIndex::empty())); - (wal_redo_manager as _, remote_index) + (wal_redo_manager as _, RemoteIndex::empty()) } }; From 07342f751902b06b253847065f24ddca735e00b3 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Mon, 28 Mar 2022 13:03:46 +0300 Subject: [PATCH 026/296] Major storage format rewrite. This is a backwards-incompatible change. The new pageserver cannot read repositories created with an old pageserver binary, or vice versa. Simplify Repository to a value-store ------------------------------------ Move the responsibility of tracking relation metadata, like which relations exist and what are their sizes, from Repository to a new module, pgdatadir_mapping.rs. The interface to Repository is now a simple key-value PUT/GET operations. It's still not any old key-value store though. A Repository is still responsible from handling branching, and every GET operation comes with an LSN. Mapping from Postgres data directory to keys/values --------------------------------------------------- All the data is now stored in the key-value store. The 'pgdatadir_mapping.rs' module handles mapping from PostgreSQL objects like relation pages and SLRUs, to key-value pairs. The key to the Repository key-value store is a Key struct, which consists of a few integer fields. It's wide enough to store a full RelFileNode, fork and block number, and to distinguish those from metadata keys. 'pgdatadir_mapping.rs' is also responsible for maintaining a "partitioning" of the keyspace. Partitioning means splitting the keyspace so that each partition holds a roughly equal number of keys. The partitioning is used when new image layer files are created, so that each image layer file is roughly the same size. The partitioning is also responsible for reclaiming space used by deleted keys. The Repository implementation doesn't have any explicit support for deleting keys. Instead, the deleted keys are simply omitted from the partitioning, and when a new image layer is created, the omitted keys are not copied over to the new image layer. We might want to implement tombstone keys in the future, to reclaim space faster, but this will work for now. Changes to low-level layer file code ------------------------------------ The concept of a "segment" is gone. Each layer file can now store an arbitrary range of Keys. Checkpointing, compaction ------------------------- The background tasks are somewhat different now. Whenever checkpoint_distance is reached, the WAL receiver thread "freezes" the current in-memory layer, and creates a new one. This is a quick operation and doesn't perform any I/O yet. It then launches a background "layer flushing thread" to write the frozen layer to disk, as a new L0 delta layer. This mechanism takes care of durability. It replaces the checkpointing thread. Compaction is a new background operation that takes a bunch of L0 delta layers, and reshuffles the data in them. It runs in a separate compaction thread. Deployment ---------- This also contains changes to the ansible scripts that enable having multiple different pageservers running at the same time in the staging environment. We will use that to keep an old version of the pageserver running, for clusters created with the old version, at the same time with a new pageserver with the new binary. Author: Heikki Linnakangas Author: Konstantin Knizhnik Author: Andrey Taranik Reviewed-by: Matthias Van De Meent Reviewed-by: Bojan Serafimov Reviewed-by: Konstantin Knizhnik Reviewed-by: Anton Shyrabokau Reviewed-by: Dhammika Pathirana Reviewed-by: Kirill Bulatov Reviewed-by: Anastasia Lubennikova Reviewed-by: Alexey Kondratov --- .circleci/ansible/.gitignore | 2 + .circleci/ansible/deploy.yaml | 71 +- .circleci/ansible/production.hosts | 17 +- .circleci/ansible/scripts/init_pageserver.sh | 30 + .circleci/ansible/staging.hosts | 18 +- .circleci/config.yml | 2 +- Cargo.lock | 1 + docs/glossary.md | 55 +- docs/rfcs/014-storage-lsm.md | 145 ++ docs/settings.md | 8 +- pageserver/Cargo.toml | 1 + pageserver/src/basebackup.rs | 143 +- pageserver/src/bin/pageserver.rs | 2 +- pageserver/src/config.rs | 43 +- pageserver/src/http/routes.rs | 4 + pageserver/src/import_datadir.rs | 210 +- pageserver/src/keyspace.rs | 134 + pageserver/src/layered_repository.rs | 2242 ++++++++--------- pageserver/src/layered_repository/README.md | 188 +- .../src/layered_repository/delta_layer.rs | 615 +++-- pageserver/src/layered_repository/filename.rs | 300 +-- .../layered_repository/global_layer_map.rs | 142 -- .../src/layered_repository/image_layer.rs | 370 ++- .../src/layered_repository/inmemory_layer.rs | 747 ++---- .../src/layered_repository/interval_tree.rs | 468 ---- .../src/layered_repository/layer_map.rs | 711 +++--- pageserver/src/layered_repository/metadata.rs | 183 +- .../src/layered_repository/storage_layer.rs | 183 +- pageserver/src/lib.rs | 24 +- pageserver/src/page_cache.rs | 17 +- pageserver/src/page_service.rs | 122 +- pageserver/src/pgdatadir_mapping.rs | 1350 ++++++++++ pageserver/src/relish.rs | 226 -- pageserver/src/reltag.rs | 105 + pageserver/src/remote_storage/README.md | 2 +- pageserver/src/remote_storage/local_fs.rs | 2 +- pageserver/src/remote_storage/storage_sync.rs | 6 +- .../storage_sync/compression.rs | 2 +- .../src/remote_storage/storage_sync/index.rs | 2 +- pageserver/src/repository.rs | 1042 +++----- pageserver/src/tenant_mgr.rs | 55 +- pageserver/src/tenant_threads.rs | 28 +- pageserver/src/thread_mgr.rs | 9 +- pageserver/src/timelines.rs | 72 +- pageserver/src/walingest.rs | 965 +++++-- pageserver/src/walreceiver.rs | 24 +- pageserver/src/walrecord.rs | 64 +- pageserver/src/walredo.rs | 170 +- postgres_ffi/src/pg_constants.rs | 4 +- test_runner/batch_others/test_snapfiles_gc.py | 130 - test_runner/fixtures/utils.py | 5 +- vendor/postgres | 2 +- 52 files changed, 5878 insertions(+), 5585 deletions(-) create mode 100644 .circleci/ansible/.gitignore create mode 100644 .circleci/ansible/scripts/init_pageserver.sh create mode 100644 docs/rfcs/014-storage-lsm.md create mode 100644 pageserver/src/keyspace.rs delete mode 100644 pageserver/src/layered_repository/global_layer_map.rs delete mode 100644 pageserver/src/layered_repository/interval_tree.rs create mode 100644 pageserver/src/pgdatadir_mapping.rs delete mode 100644 pageserver/src/relish.rs create mode 100644 pageserver/src/reltag.rs delete mode 100644 test_runner/batch_others/test_snapfiles_gc.py diff --git a/.circleci/ansible/.gitignore b/.circleci/ansible/.gitignore new file mode 100644 index 0000000000..14a1c155ae --- /dev/null +++ b/.circleci/ansible/.gitignore @@ -0,0 +1,2 @@ +zenith_install.tar.gz +.zenith_current_version diff --git a/.circleci/ansible/deploy.yaml b/.circleci/ansible/deploy.yaml index 1f43adf950..020a852a00 100644 --- a/.circleci/ansible/deploy.yaml +++ b/.circleci/ansible/deploy.yaml @@ -1,14 +1,11 @@ - name: Upload Zenith binaries - hosts: pageservers:safekeepers + hosts: storage gather_facts: False remote_user: admin - vars: - force_deploy: false tasks: - name: get latest version of Zenith binaries - ignore_errors: true register: current_version_file set_fact: current_version: "{{ lookup('file', '.zenith_current_version') | trim }}" @@ -16,48 +13,13 @@ - pageserver - safekeeper - - name: set zero value for current_version - when: current_version_file is failed - set_fact: - current_version: "0" - tags: - - pageserver - - safekeeper - - - name: get deployed version from content of remote file - ignore_errors: true - ansible.builtin.slurp: - src: /usr/local/.zenith_current_version - register: remote_version_file - tags: - - pageserver - - safekeeper - - - name: decode remote file content - when: remote_version_file is succeeded - set_fact: - remote_version: "{{ remote_version_file['content'] | b64decode | trim }}" - tags: - - pageserver - - safekeeper - - - name: set zero value for remote_version - when: remote_version_file is failed - set_fact: - remote_version: "0" - tags: - - pageserver - - safekeeper - - name: inform about versions - debug: msg="Version to deploy - {{ current_version }}, version on storage node - {{ remote_version }}" + debug: msg="Version to deploy - {{ current_version }}" tags: - pageserver - safekeeper - - name: upload and extract Zenith binaries to /usr/local - when: current_version > remote_version or force_deploy ansible.builtin.unarchive: owner: root group: root @@ -74,14 +36,24 @@ hosts: pageservers gather_facts: False remote_user: admin - vars: - force_deploy: false tasks: + + - name: upload init script + when: console_mgmt_base_url is defined + ansible.builtin.template: + src: scripts/init_pageserver.sh + dest: /tmp/init_pageserver.sh + owner: root + group: root + mode: '0755' + become: true + tags: + - pageserver + - name: init pageserver - when: current_version > remote_version or force_deploy shell: - cmd: sudo -u pageserver /usr/local/bin/pageserver -c "pg_distrib_dir='/usr/local'" --init -D /storage/pageserver/data + cmd: /tmp/init_pageserver.sh args: creates: "/storage/pageserver/data/tenants" environment: @@ -107,7 +79,6 @@ # - pageserver - name: upload systemd service definition - when: current_version > remote_version or force_deploy ansible.builtin.template: src: systemd/pageserver.service dest: /etc/systemd/system/pageserver.service @@ -119,7 +90,6 @@ - pageserver - name: start systemd service - when: current_version > remote_version or force_deploy ansible.builtin.systemd: daemon_reload: yes name: pageserver @@ -130,7 +100,7 @@ - pageserver - name: post version to console - when: (current_version > remote_version or force_deploy) and console_mgmt_base_url is defined + when: console_mgmt_base_url is defined shell: cmd: | INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id) @@ -142,22 +112,18 @@ hosts: safekeepers gather_facts: False remote_user: admin - vars: - force_deploy: false tasks: # in the future safekeepers should discover pageservers byself # but currently use first pageserver that was discovered - name: set first pageserver var for safekeepers - when: current_version > remote_version or force_deploy set_fact: first_pageserver: "{{ hostvars[groups['pageservers'][0]]['inventory_hostname'] }}" tags: - safekeeper - name: upload systemd service definition - when: current_version > remote_version or force_deploy ansible.builtin.template: src: systemd/safekeeper.service dest: /etc/systemd/system/safekeeper.service @@ -169,7 +135,6 @@ - safekeeper - name: start systemd service - when: current_version > remote_version or force_deploy ansible.builtin.systemd: daemon_reload: yes name: safekeeper @@ -180,7 +145,7 @@ - safekeeper - name: post version to console - when: (current_version > remote_version or force_deploy) and console_mgmt_base_url is defined + when: console_mgmt_base_url is defined shell: cmd: | INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id) diff --git a/.circleci/ansible/production.hosts b/.circleci/ansible/production.hosts index 3a0543f39a..13224b7cf5 100644 --- a/.circleci/ansible/production.hosts +++ b/.circleci/ansible/production.hosts @@ -1,7 +1,16 @@ [pageservers] -zenith-1-ps-1 bucket_name=zenith-storage-oregon bucket_region=us-west-2 +zenith-1-ps-1 console_region_id=1 [safekeepers] -zenith-1-sk-1 -zenith-1-sk-2 -zenith-1-sk-3 +zenith-1-sk-1 console_region_id=1 +zenith-1-sk-2 console_region_id=1 +zenith-1-sk-3 console_region_id=1 + +[storage:children] +pageservers +safekeepers + +[storage:vars] +console_mgmt_base_url = http://console-release.local +bucket_name = zenith-storage-oregon +bucket_region = us-west-2 diff --git a/.circleci/ansible/scripts/init_pageserver.sh b/.circleci/ansible/scripts/init_pageserver.sh new file mode 100644 index 0000000000..1cbdd0db94 --- /dev/null +++ b/.circleci/ansible/scripts/init_pageserver.sh @@ -0,0 +1,30 @@ +#!/bin/sh + +# get instance id from meta-data service +INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id) + +# store fqdn hostname in var +HOST=$(hostname -f) + + +cat < Page ID + + ++---+ +| | Layer file ++---+ +``` + + +# Memtable + +When new WAL arrives, it is first put into the Memtable. Despite the +name, the Memtable is not a purely in-memory data structure. It can +spill to a temporary file on disk if the system is low on memory, and +is accessed through a buffer cache. + +If the page server crashes, the Memtable is lost. It is rebuilt by +processing again the WAL that's newer than the latest layer in L0. + +The size of the Memtable is configured by the "checkpoint distance" +setting. Because anything that hasn't been flushed to disk and +uploaded to S3 yet needs to be kept in the safekeeper, the "checkpoint +distance" also determines the amount of WAL that needs to kept in the +safekeeper. + +# L0 + +When the Memtable fills up, it is written out to a new file in L0. The +files are immutable; when a file is created, it is never +modified. Each file in L0 is roughly 1 GB in size (*). Like the +Memtable, each file in L0 covers the whole key range. + +When enough files have been accumulated in L0, compaction +starts. Compaction processes all the files in L0 and reshuffles the +data to create a new set of files in L1. + + +(*) except in corner cases like if we want to shut down the page +server and want to flush out the memtable to disk even though it's not +full yet. + + +# L1 + +L1 consists of ~ 1 GB files like L0. But each file covers only part of +the overall key space, and a larger range of LSNs. This speeds up +searches. When you're looking for a given page, you need to check all +the files in L0, to see if they contain a page version for the requested +page. But in L1, you only need to check the files whose key range covers +the requested page. This is particularly important at cold start, when +checking a file means downloading it from S3. + +Partitioning by key range also helps with garbage collection. If only a +part of the database is updated, we will accumulate more files for +the hot part in L1, and old files can be removed without affecting the +cold part. + + +# Image layers + +So far, we've only talked about delta layers. In addition to the delta +layers, we create image layers, when "enough" WAL has been accumulated +for some part of the database. Each image layer covers a 1 GB range of +key space. It contains images of the pages at a single LSN, a snapshot +if you will. + +The exact heuristic for what "enough" means is not clear yet. Maybe +create a new image layer when 10 GB of WAL has been accumulated for a +1 GB segment. + +The image layers limit the number of layers that a search needs to +check. That put a cap on read latency, and it also allows garbage +collecting layers that are older than the GC horizon. + + +# Partitioning scheme + +When compaction happens and creates a new set of files in L1, how do +we partition the data into the files? + +- Goal is that each file is ~ 1 GB in size +- Try to match partition boundaries at relation boundaries. (See [1] + for how PebblesDB does this, and for why that's important) +- Greedy algorithm + +# Additional Reading + +[1] Paper on PebblesDB and how it does partitioning. +https://www.cs.utexas.edu/~rak/papers/sosp17-pebblesdb.pdf diff --git a/docs/settings.md b/docs/settings.md index 571cfba8df..69aadc602f 100644 --- a/docs/settings.md +++ b/docs/settings.md @@ -68,11 +68,11 @@ S3. The unit is # of bytes. -#### checkpoint_period +#### compaction_period -The pageserver checks whether `checkpoint_distance` has been reached -every `checkpoint_period` seconds. Default is 1 s, which should be -fine. +Every `compaction_period` seconds, the page server checks if +maintenance operations, like compaction, are needed on the layer +files. Default is 1 s, which should be fine. #### gc_horizon diff --git a/pageserver/Cargo.toml b/pageserver/Cargo.toml index 46e6e2a8f1..de22d0dd77 100644 --- a/pageserver/Cargo.toml +++ b/pageserver/Cargo.toml @@ -12,6 +12,7 @@ bytes = { version = "1.0.1", features = ['serde'] } byteorder = "1.4.3" futures = "0.3.13" hyper = "0.14" +itertools = "0.10.3" lazy_static = "1.4.0" log = "0.4.14" clap = "3.0" diff --git a/pageserver/src/basebackup.rs b/pageserver/src/basebackup.rs index 5711f1807d..e2a56f17d6 100644 --- a/pageserver/src/basebackup.rs +++ b/pageserver/src/basebackup.rs @@ -20,8 +20,9 @@ use std::sync::Arc; use std::time::SystemTime; use tar::{Builder, EntryType, Header}; -use crate::relish::*; +use crate::reltag::SlruKind; use crate::repository::Timeline; +use crate::DatadirTimelineImpl; use postgres_ffi::xlog_utils::*; use postgres_ffi::*; use zenith_utils::lsn::Lsn; @@ -31,7 +32,7 @@ use zenith_utils::lsn::Lsn; /// used for constructing tarball. pub struct Basebackup<'a> { ar: Builder<&'a mut dyn Write>, - timeline: &'a Arc, + timeline: &'a Arc, pub lsn: Lsn, prev_record_lsn: Lsn, } @@ -46,7 +47,7 @@ pub struct Basebackup<'a> { impl<'a> Basebackup<'a> { pub fn new( write: &'a mut dyn Write, - timeline: &'a Arc, + timeline: &'a Arc, req_lsn: Option, ) -> Result> { // Compute postgres doesn't have any previous WAL files, but the first @@ -64,13 +65,13 @@ impl<'a> Basebackup<'a> { // prev_lsn to Lsn(0) if we cannot provide the correct value. let (backup_prev, backup_lsn) = if let Some(req_lsn) = req_lsn { // Backup was requested at a particular LSN. Wait for it to arrive. - timeline.wait_lsn(req_lsn)?; + timeline.tline.wait_lsn(req_lsn)?; // If the requested point is the end of the timeline, we can // provide prev_lsn. (get_last_record_rlsn() might return it as // zero, though, if no WAL has been generated on this timeline // yet.) - let end_of_timeline = timeline.get_last_record_rlsn(); + let end_of_timeline = timeline.tline.get_last_record_rlsn(); if req_lsn == end_of_timeline.last { (end_of_timeline.prev, req_lsn) } else { @@ -78,7 +79,7 @@ impl<'a> Basebackup<'a> { } } else { // Backup was requested at end of the timeline. - let end_of_timeline = timeline.get_last_record_rlsn(); + let end_of_timeline = timeline.tline.get_last_record_rlsn(); (end_of_timeline.prev, end_of_timeline.last) }; @@ -115,21 +116,24 @@ impl<'a> Basebackup<'a> { } // Gather non-relational files from object storage pages. - for obj in self.timeline.list_nonrels(self.lsn)? { - match obj { - RelishTag::Slru { slru, segno } => { - self.add_slru_segment(slru, segno)?; - } - RelishTag::FileNodeMap { spcnode, dbnode } => { - self.add_relmap_file(spcnode, dbnode)?; - } - RelishTag::TwoPhase { xid } => { - self.add_twophase_file(xid)?; - } - _ => {} + for kind in [ + SlruKind::Clog, + SlruKind::MultiXactOffsets, + SlruKind::MultiXactMembers, + ] { + for segno in self.timeline.list_slru_segments(kind, self.lsn)? { + self.add_slru_segment(kind, segno)?; } } + // Create tablespace directories + for ((spcnode, dbnode), has_relmap_file) in self.timeline.list_dbdirs(self.lsn)? { + self.add_dbdir(spcnode, dbnode, has_relmap_file)?; + } + for xid in self.timeline.list_twophase_files(self.lsn)? { + self.add_twophase_file(xid)?; + } + // Generate pg_control and bootstrap WAL segment. self.add_pgcontrol_file()?; self.ar.finish()?; @@ -141,28 +145,14 @@ impl<'a> Basebackup<'a> { // Generate SLRU segment files from repository. // fn add_slru_segment(&mut self, slru: SlruKind, segno: u32) -> anyhow::Result<()> { - let seg_size = self - .timeline - .get_relish_size(RelishTag::Slru { slru, segno }, self.lsn)?; - - let nblocks = match seg_size { - Some(seg_size) => seg_size, - None => { - trace!( - "SLRU segment {}/{:>04X} was truncated", - slru.to_str(), - segno - ); - return Ok(()); - } - }; + let nblocks = self.timeline.get_slru_segment_size(slru, segno, self.lsn)?; let mut slru_buf: Vec = Vec::with_capacity(nblocks as usize * pg_constants::BLCKSZ as usize); for blknum in 0..nblocks { - let img = - self.timeline - .get_page_at_lsn(RelishTag::Slru { slru, segno }, blknum, self.lsn)?; + let img = self + .timeline + .get_slru_page_at_lsn(slru, segno, blknum, self.lsn)?; ensure!(img.len() == pg_constants::BLCKSZ as usize); slru_buf.extend_from_slice(&img); @@ -177,16 +167,26 @@ impl<'a> Basebackup<'a> { } // - // Extract pg_filenode.map files from repository - // Along with them also send PG_VERSION for each database. + // Include database/tablespace directories. // - fn add_relmap_file(&mut self, spcnode: u32, dbnode: u32) -> anyhow::Result<()> { - let img = self.timeline.get_page_at_lsn( - RelishTag::FileNodeMap { spcnode, dbnode }, - 0, - self.lsn, - )?; - let path = if spcnode == pg_constants::GLOBALTABLESPACE_OID { + // Each directory contains a PG_VERSION file, and the default database + // directories also contain pg_filenode.map files. + // + fn add_dbdir( + &mut self, + spcnode: u32, + dbnode: u32, + has_relmap_file: bool, + ) -> anyhow::Result<()> { + let relmap_img = if has_relmap_file { + let img = self.timeline.get_relmap_file(spcnode, dbnode, self.lsn)?; + ensure!(img.len() == 512); + Some(img) + } else { + None + }; + + if spcnode == pg_constants::GLOBALTABLESPACE_OID { let version_bytes = pg_constants::PG_MAJORVERSION.as_bytes(); let header = new_tar_header("PG_VERSION", version_bytes.len() as u64)?; self.ar.append(&header, version_bytes)?; @@ -194,8 +194,32 @@ impl<'a> Basebackup<'a> { let header = new_tar_header("global/PG_VERSION", version_bytes.len() as u64)?; self.ar.append(&header, version_bytes)?; - String::from("global/pg_filenode.map") // filenode map for global tablespace + if let Some(img) = relmap_img { + // filenode map for global tablespace + let header = new_tar_header("global/pg_filenode.map", img.len() as u64)?; + self.ar.append(&header, &img[..])?; + } else { + warn!("global/pg_filenode.map is missing"); + } } else { + // User defined tablespaces are not supported. However, as + // a special case, if a tablespace/db directory is + // completely empty, we can leave it out altogether. This + // makes taking a base backup after the 'tablespace' + // regression test pass, because the test drops the + // created tablespaces after the tests. + // + // FIXME: this wouldn't be necessary, if we handled + // XLOG_TBLSPC_DROP records. But we probably should just + // throw an error on CREATE TABLESPACE in the first place. + if !has_relmap_file + && self + .timeline + .list_rels(spcnode, dbnode, self.lsn)? + .is_empty() + { + return Ok(()); + } // User defined tablespaces are not supported ensure!(spcnode == pg_constants::DEFAULTTABLESPACE_OID); @@ -204,16 +228,17 @@ impl<'a> Basebackup<'a> { let header = new_tar_header_dir(&path)?; self.ar.append(&header, &mut io::empty())?; - let dst_path = format!("base/{}/PG_VERSION", dbnode); - let version_bytes = pg_constants::PG_MAJORVERSION.as_bytes(); - let header = new_tar_header(&dst_path, version_bytes.len() as u64)?; - self.ar.append(&header, version_bytes)?; + if let Some(img) = relmap_img { + let dst_path = format!("base/{}/PG_VERSION", dbnode); + let version_bytes = pg_constants::PG_MAJORVERSION.as_bytes(); + let header = new_tar_header(&dst_path, version_bytes.len() as u64)?; + self.ar.append(&header, version_bytes)?; - format!("base/{}/pg_filenode.map", dbnode) + let relmap_path = format!("base/{}/pg_filenode.map", dbnode); + let header = new_tar_header(&relmap_path, img.len() as u64)?; + self.ar.append(&header, &img[..])?; + } }; - ensure!(img.len() == 512); - let header = new_tar_header(&path, img.len() as u64)?; - self.ar.append(&header, &img[..])?; Ok(()) } @@ -221,9 +246,7 @@ impl<'a> Basebackup<'a> { // Extract twophase state files // fn add_twophase_file(&mut self, xid: TransactionId) -> anyhow::Result<()> { - let img = self - .timeline - .get_page_at_lsn(RelishTag::TwoPhase { xid }, 0, self.lsn)?; + let img = self.timeline.get_twophase_file(xid, self.lsn)?; let mut buf = BytesMut::new(); buf.extend_from_slice(&img[..]); @@ -243,11 +266,11 @@ impl<'a> Basebackup<'a> { fn add_pgcontrol_file(&mut self) -> anyhow::Result<()> { let checkpoint_bytes = self .timeline - .get_page_at_lsn(RelishTag::Checkpoint, 0, self.lsn) + .get_checkpoint(self.lsn) .context("failed to get checkpoint bytes")?; let pg_control_bytes = self .timeline - .get_page_at_lsn(RelishTag::ControlFile, 0, self.lsn) + .get_control_file(self.lsn) .context("failed get control bytes")?; let mut pg_control = ControlFileData::decode(&pg_control_bytes)?; let mut checkpoint = CheckPoint::decode(&checkpoint_bytes)?; @@ -268,7 +291,7 @@ impl<'a> Basebackup<'a> { // add zenith.signal file let mut zenith_signal = String::new(); if self.prev_record_lsn == Lsn(0) { - if self.lsn == self.timeline.get_ancestor_lsn() { + if self.lsn == self.timeline.tline.get_ancestor_lsn() { write!(zenith_signal, "PREV LSN: none")?; } else { write!(zenith_signal, "PREV LSN: invalid")?; diff --git a/pageserver/src/bin/pageserver.rs b/pageserver/src/bin/pageserver.rs index e217806147..0af96cff66 100644 --- a/pageserver/src/bin/pageserver.rs +++ b/pageserver/src/bin/pageserver.rs @@ -20,7 +20,7 @@ use pageserver::{ config::{defaults::*, PageServerConf}, http, page_cache, page_service, remote_storage::{self, SyncStartupData}, - repository::TimelineSyncStatusUpdate, + repository::{Repository, TimelineSyncStatusUpdate}, tenant_mgr, thread_mgr, thread_mgr::ThreadKind, timelines, virtual_file, LOG_FILE_NAME, diff --git a/pageserver/src/config.rs b/pageserver/src/config.rs index dc85c83c17..0fdfb4ceed 100644 --- a/pageserver/src/config.rs +++ b/pageserver/src/config.rs @@ -31,7 +31,8 @@ pub mod defaults { // would be more appropriate. But a low value forces the code to be exercised more, // which is good for now to trigger bugs. pub const DEFAULT_CHECKPOINT_DISTANCE: u64 = 256 * 1024 * 1024; - pub const DEFAULT_CHECKPOINT_PERIOD: &str = "1 s"; + + pub const DEFAULT_COMPACTION_PERIOD: &str = "1 s"; pub const DEFAULT_GC_HORIZON: u64 = 64 * 1024 * 1024; pub const DEFAULT_GC_PERIOD: &str = "100 s"; @@ -57,7 +58,7 @@ pub mod defaults { #listen_http_addr = '{DEFAULT_HTTP_LISTEN_ADDR}' #checkpoint_distance = {DEFAULT_CHECKPOINT_DISTANCE} # in bytes -#checkpoint_period = '{DEFAULT_CHECKPOINT_PERIOD}' +#compaction_period = '{DEFAULT_COMPACTION_PERIOD}' #gc_period = '{DEFAULT_GC_PERIOD}' #gc_horizon = {DEFAULT_GC_HORIZON} @@ -91,7 +92,9 @@ pub struct PageServerConf { // This puts a backstop on how much WAL needs to be re-digested if the // page server crashes. pub checkpoint_distance: u64, - pub checkpoint_period: Duration, + + // How often to check if there's compaction work to be done. + pub compaction_period: Duration, pub gc_horizon: u64, pub gc_period: Duration, @@ -145,7 +148,8 @@ struct PageServerConfigBuilder { listen_http_addr: BuilderValue, checkpoint_distance: BuilderValue, - checkpoint_period: BuilderValue, + + compaction_period: BuilderValue, gc_horizon: BuilderValue, gc_period: BuilderValue, @@ -179,8 +183,8 @@ impl Default for PageServerConfigBuilder { listen_pg_addr: Set(DEFAULT_PG_LISTEN_ADDR.to_string()), listen_http_addr: Set(DEFAULT_HTTP_LISTEN_ADDR.to_string()), checkpoint_distance: Set(DEFAULT_CHECKPOINT_DISTANCE), - checkpoint_period: Set(humantime::parse_duration(DEFAULT_CHECKPOINT_PERIOD) - .expect("cannot parse default checkpoint period")), + compaction_period: Set(humantime::parse_duration(DEFAULT_COMPACTION_PERIOD) + .expect("cannot parse default compaction period")), gc_horizon: Set(DEFAULT_GC_HORIZON), gc_period: Set(humantime::parse_duration(DEFAULT_GC_PERIOD) .expect("cannot parse default gc period")), @@ -216,8 +220,8 @@ impl PageServerConfigBuilder { self.checkpoint_distance = BuilderValue::Set(checkpoint_distance) } - pub fn checkpoint_period(&mut self, checkpoint_period: Duration) { - self.checkpoint_period = BuilderValue::Set(checkpoint_period) + pub fn compaction_period(&mut self, compaction_period: Duration) { + self.compaction_period = BuilderValue::Set(compaction_period) } pub fn gc_horizon(&mut self, gc_horizon: u64) { @@ -286,9 +290,9 @@ impl PageServerConfigBuilder { checkpoint_distance: self .checkpoint_distance .ok_or(anyhow::anyhow!("missing checkpoint_distance"))?, - checkpoint_period: self - .checkpoint_period - .ok_or(anyhow::anyhow!("missing checkpoint_period"))?, + compaction_period: self + .compaction_period + .ok_or(anyhow::anyhow!("missing compaction_period"))?, gc_horizon: self .gc_horizon .ok_or(anyhow::anyhow!("missing gc_horizon"))?, @@ -337,10 +341,10 @@ pub struct RemoteStorageConfig { #[derive(Debug, Clone, PartialEq, Eq)] pub enum RemoteStorageKind { /// Storage based on local file system. - /// Specify a root folder to place all stored relish data into. + /// Specify a root folder to place all stored files into. LocalFs(PathBuf), - /// AWS S3 based storage, storing all relishes into the root - /// of the S3 bucket from the config. + /// AWS S3 based storage, storing all files in the S3 bucket + /// specified by the config AwsS3(S3Config), } @@ -425,7 +429,7 @@ impl PageServerConf { "listen_pg_addr" => builder.listen_pg_addr(parse_toml_string(key, item)?), "listen_http_addr" => builder.listen_http_addr(parse_toml_string(key, item)?), "checkpoint_distance" => builder.checkpoint_distance(parse_toml_u64(key, item)?), - "checkpoint_period" => builder.checkpoint_period(parse_toml_duration(key, item)?), + "compaction_period" => builder.compaction_period(parse_toml_duration(key, item)?), "gc_horizon" => builder.gc_horizon(parse_toml_u64(key, item)?), "gc_period" => builder.gc_period(parse_toml_duration(key, item)?), "wait_lsn_timeout" => builder.wait_lsn_timeout(parse_toml_duration(key, item)?), @@ -561,7 +565,7 @@ impl PageServerConf { PageServerConf { id: ZNodeId(0), checkpoint_distance: defaults::DEFAULT_CHECKPOINT_DISTANCE, - checkpoint_period: Duration::from_secs(10), + compaction_period: Duration::from_secs(10), gc_horizon: defaults::DEFAULT_GC_HORIZON, gc_period: Duration::from_secs(10), wait_lsn_timeout: Duration::from_secs(60), @@ -631,7 +635,8 @@ listen_pg_addr = '127.0.0.1:64000' listen_http_addr = '127.0.0.1:9898' checkpoint_distance = 111 # in bytes -checkpoint_period = '111 s' + +compaction_period = '111 s' gc_period = '222 s' gc_horizon = 222 @@ -668,7 +673,7 @@ id = 10 listen_pg_addr: defaults::DEFAULT_PG_LISTEN_ADDR.to_string(), listen_http_addr: defaults::DEFAULT_HTTP_LISTEN_ADDR.to_string(), checkpoint_distance: defaults::DEFAULT_CHECKPOINT_DISTANCE, - checkpoint_period: humantime::parse_duration(defaults::DEFAULT_CHECKPOINT_PERIOD)?, + compaction_period: humantime::parse_duration(defaults::DEFAULT_COMPACTION_PERIOD)?, gc_horizon: defaults::DEFAULT_GC_HORIZON, gc_period: humantime::parse_duration(defaults::DEFAULT_GC_PERIOD)?, wait_lsn_timeout: humantime::parse_duration(defaults::DEFAULT_WAIT_LSN_TIMEOUT)?, @@ -712,7 +717,7 @@ id = 10 listen_pg_addr: "127.0.0.1:64000".to_string(), listen_http_addr: "127.0.0.1:9898".to_string(), checkpoint_distance: 111, - checkpoint_period: Duration::from_secs(111), + compaction_period: Duration::from_secs(111), gc_horizon: 222, gc_period: Duration::from_secs(222), wait_lsn_timeout: Duration::from_secs(111), diff --git a/pageserver/src/http/routes.rs b/pageserver/src/http/routes.rs index 13e79f8f55..82e818a47b 100644 --- a/pageserver/src/http/routes.rs +++ b/pageserver/src/http/routes.rs @@ -22,6 +22,7 @@ use super::models::{ StatusResponse, TenantCreateRequest, TenantCreateResponse, TimelineCreateRequest, }; use crate::remote_storage::{schedule_timeline_download, RemoteIndex}; +use crate::repository::Repository; use crate::timelines::{LocalTimelineInfo, RemoteTimelineInfo, TimelineInfo}; use crate::{config::PageServerConf, tenant_mgr, timelines, ZTenantId}; @@ -162,8 +163,11 @@ async fn timeline_detail_handler(request: Request) -> Result( path: &Path, - writer: &dyn TimelineWriter, + tline: &mut DatadirTimeline, lsn: Lsn, ) -> Result<()> { let mut pg_control: Option = None; + let mut modification = tline.begin_modification(lsn); + modification.init_empty()?; + // Scan 'global' + let mut relfiles: Vec = Vec::new(); for direntry in fs::read_dir(path.join("global"))? { let direntry = direntry?; match direntry.file_name().to_str() { None => continue, Some("pg_control") => { - pg_control = Some(import_control_file(writer, lsn, &direntry.path())?); + pg_control = Some(import_control_file(&mut modification, &direntry.path())?); + } + Some("pg_filenode.map") => { + import_relmap_file( + &mut modification, + pg_constants::GLOBALTABLESPACE_OID, + 0, + &direntry.path(), + )?; } - Some("pg_filenode.map") => import_nonrel_file( - writer, - lsn, - RelishTag::FileNodeMap { - spcnode: pg_constants::GLOBALTABLESPACE_OID, - dbnode: 0, - }, - &direntry.path(), - )?, - // Load any relation files into the page server - _ => import_relfile( - &direntry.path(), - writer, - lsn, - pg_constants::GLOBALTABLESPACE_OID, - 0, - )?, + // Load any relation files into the page server (but only after the other files) + _ => relfiles.push(direntry.path()), } } + for relfile in relfiles { + import_relfile( + &mut modification, + &relfile, + pg_constants::GLOBALTABLESPACE_OID, + 0, + )?; + } // Scan 'base'. It contains database dirs, the database OID is the filename. // E.g. 'base/12345', where 12345 is the database OID. @@ -76,54 +82,56 @@ pub fn import_timeline_from_postgres_datadir( let dboid = direntry.file_name().to_string_lossy().parse::()?; + let mut relfiles: Vec = Vec::new(); for direntry in fs::read_dir(direntry.path())? { let direntry = direntry?; match direntry.file_name().to_str() { None => continue, - Some("PG_VERSION") => continue, - Some("pg_filenode.map") => import_nonrel_file( - writer, - lsn, - RelishTag::FileNodeMap { - spcnode: pg_constants::DEFAULTTABLESPACE_OID, - dbnode: dboid, - }, + Some("PG_VERSION") => { + //modification.put_dbdir_creation(pg_constants::DEFAULTTABLESPACE_OID, dboid)?; + } + Some("pg_filenode.map") => import_relmap_file( + &mut modification, + pg_constants::DEFAULTTABLESPACE_OID, + dboid, &direntry.path(), )?, // Load any relation files into the page server - _ => import_relfile( - &direntry.path(), - writer, - lsn, - pg_constants::DEFAULTTABLESPACE_OID, - dboid, - )?, + _ => relfiles.push(direntry.path()), } } + for relfile in relfiles { + import_relfile( + &mut modification, + &relfile, + pg_constants::DEFAULTTABLESPACE_OID, + dboid, + )?; + } } for entry in fs::read_dir(path.join("pg_xact"))? { let entry = entry?; - import_slru_file(writer, lsn, SlruKind::Clog, &entry.path())?; + import_slru_file(&mut modification, SlruKind::Clog, &entry.path())?; } for entry in fs::read_dir(path.join("pg_multixact").join("members"))? { let entry = entry?; - import_slru_file(writer, lsn, SlruKind::MultiXactMembers, &entry.path())?; + import_slru_file(&mut modification, SlruKind::MultiXactMembers, &entry.path())?; } for entry in fs::read_dir(path.join("pg_multixact").join("offsets"))? { let entry = entry?; - import_slru_file(writer, lsn, SlruKind::MultiXactOffsets, &entry.path())?; + import_slru_file(&mut modification, SlruKind::MultiXactOffsets, &entry.path())?; } for entry in fs::read_dir(path.join("pg_twophase"))? { let entry = entry?; let xid = u32::from_str_radix(&entry.path().to_string_lossy(), 16)?; - import_nonrel_file(writer, lsn, RelishTag::TwoPhase { xid }, &entry.path())?; + import_twophase_file(&mut modification, xid, &entry.path())?; } // TODO: Scan pg_tblspc // We're done importing all the data files. - writer.advance_last_record_lsn(lsn); + modification.commit()?; // We expect the Postgres server to be shut down cleanly. let pg_control = pg_control.context("pg_control file not found")?; @@ -141,7 +149,7 @@ pub fn import_timeline_from_postgres_datadir( // *after* the checkpoint record. And crucially, it initializes the 'prev_lsn'. import_wal( &path.join("pg_wal"), - writer, + tline, Lsn(pg_control.checkPointCopy.redo), lsn, )?; @@ -150,10 +158,9 @@ pub fn import_timeline_from_postgres_datadir( } // subroutine of import_timeline_from_postgres_datadir(), to load one relation file. -fn import_relfile( +fn import_relfile( + modification: &mut DatadirModification, path: &Path, - timeline: &dyn TimelineWriter, - lsn: Lsn, spcoid: Oid, dboid: Oid, ) -> anyhow::Result<()> { @@ -169,26 +176,35 @@ fn import_relfile( let mut file = File::open(path)?; let mut buf: [u8; 8192] = [0u8; 8192]; + let len = file.metadata().unwrap().len(); + ensure!(len % pg_constants::BLCKSZ as u64 == 0); + let nblocks = len / pg_constants::BLCKSZ as u64; + + if segno != 0 { + todo!(); + } + + let rel = RelTag { + spcnode: spcoid, + dbnode: dboid, + relnode, + forknum, + }; + modification.put_rel_creation(rel, nblocks as u32)?; + let mut blknum: u32 = segno * (1024 * 1024 * 1024 / pg_constants::BLCKSZ as u32); loop { let r = file.read_exact(&mut buf); match r { Ok(_) => { - let rel = RelTag { - spcnode: spcoid, - dbnode: dboid, - relnode, - forknum, - }; - let tag = RelishTag::Relation(rel); - timeline.put_page_image(tag, blknum, lsn, Bytes::copy_from_slice(&buf))?; + modification.put_rel_page_image(rel, blknum, Bytes::copy_from_slice(&buf))?; } // TODO: UnexpectedEof is expected Err(err) => match err.kind() { std::io::ErrorKind::UnexpectedEof => { // reached EOF. That's expected. - // FIXME: maybe check that we read the full length of the file? + ensure!(blknum == nblocks as u32, "unexpected EOF"); break; } _ => { @@ -202,16 +218,28 @@ fn import_relfile( Ok(()) } -/// -/// Import a "non-blocky" file into the repository -/// -/// This is used for small files like the control file, twophase files etc. that -/// are just slurped into the repository as one blob. -/// -fn import_nonrel_file( - timeline: &dyn TimelineWriter, - lsn: Lsn, - tag: RelishTag, +/// Import a relmapper (pg_filenode.map) file into the repository +fn import_relmap_file( + modification: &mut DatadirModification, + spcnode: Oid, + dbnode: Oid, + path: &Path, +) -> Result<()> { + let mut file = File::open(path)?; + let mut buffer = Vec::new(); + // read the whole file + file.read_to_end(&mut buffer)?; + + trace!("importing relmap file {}", path.display()); + + modification.put_relmap_file(spcnode, dbnode, Bytes::copy_from_slice(&buffer[..]))?; + Ok(()) +} + +/// Import a twophase state file (pg_twophase/) into the repository +fn import_twophase_file( + modification: &mut DatadirModification, + xid: TransactionId, path: &Path, ) -> Result<()> { let mut file = File::open(path)?; @@ -221,7 +249,7 @@ fn import_nonrel_file( trace!("importing non-rel file {}", path.display()); - timeline.put_page_image(tag, 0, lsn, Bytes::copy_from_slice(&buffer[..]))?; + modification.put_twophase_file(xid, Bytes::copy_from_slice(&buffer[..]))?; Ok(()) } @@ -230,9 +258,8 @@ fn import_nonrel_file( /// /// The control file is imported as is, but we also extract the checkpoint record /// from it and store it separated. -fn import_control_file( - timeline: &dyn TimelineWriter, - lsn: Lsn, +fn import_control_file( + modification: &mut DatadirModification, path: &Path, ) -> Result { let mut file = File::open(path)?; @@ -243,17 +270,12 @@ fn import_control_file( trace!("importing control file {}", path.display()); // Import it as ControlFile - timeline.put_page_image( - RelishTag::ControlFile, - 0, - lsn, - Bytes::copy_from_slice(&buffer[..]), - )?; + modification.put_control_file(Bytes::copy_from_slice(&buffer[..]))?; // Extract the checkpoint record and import it separately. let pg_control = ControlFileData::decode(&buffer)?; let checkpoint_bytes = pg_control.checkPointCopy.encode(); - timeline.put_page_image(RelishTag::Checkpoint, 0, lsn, checkpoint_bytes)?; + modification.put_checkpoint(checkpoint_bytes)?; Ok(pg_control) } @@ -261,28 +283,34 @@ fn import_control_file( /// /// Import an SLRU segment file /// -fn import_slru_file( - timeline: &dyn TimelineWriter, - lsn: Lsn, +fn import_slru_file( + modification: &mut DatadirModification, slru: SlruKind, path: &Path, ) -> Result<()> { - // Does it look like an SLRU file? + trace!("importing slru file {}", path.display()); + let mut file = File::open(path)?; let mut buf: [u8; 8192] = [0u8; 8192]; let segno = u32::from_str_radix(&path.file_name().unwrap().to_string_lossy(), 16)?; - trace!("importing slru file {}", path.display()); + let len = file.metadata().unwrap().len(); + ensure!(len % pg_constants::BLCKSZ as u64 == 0); // we assume SLRU block size is the same as BLCKSZ + let nblocks = len / pg_constants::BLCKSZ as u64; + + ensure!(nblocks <= pg_constants::SLRU_PAGES_PER_SEGMENT as u64); + + modification.put_slru_segment_creation(slru, segno, nblocks as u32)?; let mut rpageno = 0; loop { let r = file.read_exact(&mut buf); match r { Ok(_) => { - timeline.put_page_image( - RelishTag::Slru { slru, segno }, + modification.put_slru_page_image( + slru, + segno, rpageno, - lsn, Bytes::copy_from_slice(&buf), )?; } @@ -291,7 +319,7 @@ fn import_slru_file( Err(err) => match err.kind() { std::io::ErrorKind::UnexpectedEof => { // reached EOF. That's expected. - // FIXME: maybe check that we read the full length of the file? + ensure!(rpageno == nblocks as u32, "unexpected EOF"); break; } _ => { @@ -300,8 +328,6 @@ fn import_slru_file( }, }; rpageno += 1; - - // TODO: Check that the file isn't unexpectedly large, not larger than SLRU_PAGES_PER_SEGMENT pages } Ok(()) @@ -309,9 +335,9 @@ fn import_slru_file( /// Scan PostgreSQL WAL files in given directory and load all records between /// 'startpoint' and 'endpoint' into the repository. -fn import_wal( +fn import_wal( walpath: &Path, - writer: &dyn TimelineWriter, + tline: &mut DatadirTimeline, startpoint: Lsn, endpoint: Lsn, ) -> Result<()> { @@ -321,7 +347,7 @@ fn import_wal( let mut offset = startpoint.segment_offset(pg_constants::WAL_SEGMENT_SIZE); let mut last_lsn = startpoint; - let mut walingest = WalIngest::new(writer.deref(), startpoint)?; + let mut walingest = WalIngest::new(tline, startpoint)?; while last_lsn <= endpoint { // FIXME: assume postgresql tli 1 for now @@ -354,7 +380,7 @@ fn import_wal( let mut nrecords = 0; while last_lsn <= endpoint { if let Some((lsn, recdata)) = waldecoder.poll_decode()? { - walingest.ingest_record(writer, recdata, lsn)?; + walingest.ingest_record(tline, recdata, lsn)?; last_lsn = lsn; nrecords += 1; diff --git a/pageserver/src/keyspace.rs b/pageserver/src/keyspace.rs new file mode 100644 index 0000000000..9973568b07 --- /dev/null +++ b/pageserver/src/keyspace.rs @@ -0,0 +1,134 @@ +use crate::repository::{key_range_size, singleton_range, Key}; +use postgres_ffi::pg_constants; +use std::ops::Range; + +// Target file size, when creating image and delta layers +pub const TARGET_FILE_SIZE_BYTES: u64 = 128 * 1024 * 1024; // 128 MB + +/// +/// Represents a set of Keys, in a compact form. +/// +#[derive(Clone, Debug)] +pub struct KeySpace { + /// Contiguous ranges of keys that belong to the key space. In key order, + /// and with no overlap. + pub ranges: Vec>, +} + +impl KeySpace { + /// + /// Partition a key space into roughly chunks of roughly 'target_size' bytes + /// in each patition. + /// + pub fn partition(&self, target_size: u64) -> KeyPartitioning { + // Assume that each value is 8k in size. + let target_nblocks = (target_size / pg_constants::BLCKSZ as u64) as usize; + + let mut parts = Vec::new(); + let mut current_part = Vec::new(); + let mut current_part_size: usize = 0; + for range in &self.ranges { + // If appending the next contiguous range in the keyspace to the current + // partition would cause it to be too large, start a new partition. + let this_size = key_range_size(range) as usize; + if current_part_size + this_size > target_nblocks && !current_part.is_empty() { + parts.push(KeySpace { + ranges: current_part, + }); + current_part = Vec::new(); + current_part_size = 0; + } + + // If the next range is larger than 'target_size', split it into + // 'target_size' chunks. + let mut remain_size = this_size; + let mut start = range.start; + while remain_size > target_nblocks { + let next = start.add(target_nblocks as u32); + parts.push(KeySpace { + ranges: vec![start..next], + }); + start = next; + remain_size -= target_nblocks + } + current_part.push(start..range.end); + current_part_size += remain_size; + } + + // add last partition that wasn't full yet. + if !current_part.is_empty() { + parts.push(KeySpace { + ranges: current_part, + }); + } + + KeyPartitioning { parts } + } +} + +/// +/// Represents a partitioning of the key space. +/// +/// The only kind of partitioning we do is to partition the key space into +/// partitions that are roughly equal in physical size (see KeySpace::partition). +/// But this data structure could represent any partitioning. +/// +#[derive(Clone, Debug, Default)] +pub struct KeyPartitioning { + pub parts: Vec, +} + +impl KeyPartitioning { + pub fn new() -> Self { + KeyPartitioning { parts: Vec::new() } + } +} + +/// +/// A helper object, to collect a set of keys and key ranges into a KeySpace +/// object. This takes care of merging adjacent keys and key ranges into +/// contiguous ranges. +/// +#[derive(Clone, Debug, Default)] +pub struct KeySpaceAccum { + accum: Option>, + + ranges: Vec>, +} + +impl KeySpaceAccum { + pub fn new() -> Self { + Self { + accum: None, + ranges: Vec::new(), + } + } + + pub fn add_key(&mut self, key: Key) { + self.add_range(singleton_range(key)) + } + + pub fn add_range(&mut self, range: Range) { + match self.accum.as_mut() { + Some(accum) => { + if range.start == accum.end { + accum.end = range.end; + } else { + assert!(range.start > accum.end); + self.ranges.push(accum.clone()); + *accum = range; + } + } + None => self.accum = Some(range), + } + } + + pub fn to_keyspace(mut self) -> KeySpace { + if let Some(accum) = self.accum.take() { + self.ranges.push(accum); + } + KeySpace { + ranges: self.ranges, + } + } +} diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index bf5f52b18d..837298a10e 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -14,32 +14,33 @@ use anyhow::{anyhow, bail, ensure, Context, Result}; use bookfile::Book; use bytes::Bytes; +use fail::fail_point; +use itertools::Itertools; use lazy_static::lazy_static; -use postgres_ffi::pg_constants::BLCKSZ; use tracing::*; -use std::cmp; +use std::cmp::{max, min, Ordering}; use std::collections::hash_map::Entry; +use std::collections::BTreeSet; use std::collections::HashMap; -use std::collections::{BTreeSet, HashSet}; use std::fs; use std::fs::{File, OpenOptions}; use std::io::Write; -use std::ops::{Bound::Included, Deref}; +use std::ops::{Bound::Included, Deref, Range}; use std::path::{Path, PathBuf}; -use std::sync::atomic::{self, AtomicBool, AtomicUsize}; -use std::sync::{Arc, Mutex, MutexGuard, RwLock, RwLockReadGuard}; +use std::sync::atomic::{self, AtomicBool}; +use std::sync::{Arc, Mutex, MutexGuard, RwLock, RwLockReadGuard, TryLockError}; use std::time::Instant; use self::metadata::{metadata_path, TimelineMetadata, METADATA_FILE_NAME}; use crate::config::PageServerConf; +use crate::keyspace::{KeyPartitioning, KeySpace}; use crate::page_cache; -use crate::relish::*; use crate::remote_storage::{schedule_timeline_checkpoint_upload, RemoteIndex}; use crate::repository::{ - BlockNumber, GcResult, Repository, RepositoryTimeline, Timeline, TimelineSyncStatusUpdate, - TimelineWriter, ZenithWalRecord, + GcResult, Repository, RepositoryTimeline, Timeline, TimelineSyncStatusUpdate, TimelineWriter, }; +use crate::repository::{Key, Value}; use crate::thread_mgr; use crate::virtual_file::VirtualFile; use crate::walreceiver::IS_WAL_RECEIVER; @@ -48,7 +49,6 @@ use crate::CheckpointConfig; use crate::{ZTenantId, ZTimelineId}; use zenith_metrics::{register_histogram_vec, Histogram, HistogramVec}; -use zenith_metrics::{register_int_gauge_vec, IntGauge, IntGaugeVec}; use zenith_utils::crashsafe_dir; use zenith_utils::lsn::{AtomicLsn, Lsn, RecordLsn}; use zenith_utils::seqwait::SeqWait; @@ -56,30 +56,25 @@ use zenith_utils::seqwait::SeqWait; mod delta_layer; pub(crate) mod ephemeral_file; mod filename; -mod global_layer_map; mod image_layer; mod inmemory_layer; -mod interval_tree; mod layer_map; pub mod metadata; mod par_fsync; mod storage_layer; -use delta_layer::DeltaLayer; +use delta_layer::{DeltaLayer, DeltaLayerWriter}; use ephemeral_file::is_ephemeral_file; use filename::{DeltaFileName, ImageFileName}; -use image_layer::ImageLayer; +use image_layer::{ImageLayer, ImageLayerWriter}; use inmemory_layer::InMemoryLayer; use layer_map::LayerMap; -use storage_layer::{ - Layer, PageReconstructData, PageReconstructResult, SegmentBlk, SegmentTag, RELISH_SEG_SIZE, -}; +use layer_map::SearchResult; +use storage_layer::{Layer, ValueReconstructResult, ValueReconstructState}; // re-export this function so that page_cache.rs can use it. pub use crate::layered_repository::ephemeral_file::writeback as writeback_ephemeral_file; -static ZERO_PAGE: Bytes = Bytes::from_static(&[0u8; 8192]); - // Metrics collected on operations on the storage repository. lazy_static! { static ref STORAGE_TIME: HistogramVec = register_histogram_vec!( @@ -100,17 +95,6 @@ lazy_static! { .expect("failed to define a metric"); } -lazy_static! { - // NOTE: can be zero if pageserver was restarted and there hasn't been any - // activity yet. - static ref LOGICAL_TIMELINE_SIZE: IntGaugeVec = register_int_gauge_vec!( - "pageserver_logical_timeline_size", - "Logical timeline size (bytes)", - &["tenant_id", "timeline_id"] - ) - .expect("failed to define a metric"); -} - /// Parts of the `.zenith/tenants//timelines/` directory prefix. pub const TIMELINES_SEGMENT_NAME: &str = "timelines"; @@ -118,7 +102,7 @@ pub const TIMELINES_SEGMENT_NAME: &str = "timelines"; /// Repository consists of multiple timelines. Keep them in a hash table. /// pub struct LayeredRepository { - conf: &'static PageServerConf, + pub conf: &'static PageServerConf, tenantid: ZTenantId, timelines: Mutex>, // This mutex prevents creation of new timelines during GC. @@ -135,21 +119,23 @@ pub struct LayeredRepository { remote_index: RemoteIndex, /// Makes every timeline to backup their files to remote storage. - upload_relishes: bool, + upload_layers: bool, } /// Public interface impl Repository for LayeredRepository { - fn get_timeline(&self, timelineid: ZTimelineId) -> Option { + type Timeline = LayeredTimeline; + + fn get_timeline(&self, timelineid: ZTimelineId) -> Option> { let timelines = self.timelines.lock().unwrap(); self.get_timeline_internal(timelineid, &timelines) .map(RepositoryTimeline::from) } - fn get_timeline_load(&self, timelineid: ZTimelineId) -> Result> { + fn get_timeline_load(&self, timelineid: ZTimelineId) -> Result> { let mut timelines = self.timelines.lock().unwrap(); match self.get_timeline_load_internal(timelineid, &mut timelines)? { - Some(local_loaded_timeline) => Ok(local_loaded_timeline as _), + Some(local_loaded_timeline) => Ok(local_loaded_timeline), None => anyhow::bail!( "cannot get local timeline: unknown timeline id: {}", timelineid @@ -157,7 +143,7 @@ impl Repository for LayeredRepository { } } - fn list_timelines(&self) -> Vec<(ZTimelineId, RepositoryTimeline)> { + fn list_timelines(&self) -> Vec<(ZTimelineId, RepositoryTimeline)> { self.timelines .lock() .unwrap() @@ -175,7 +161,7 @@ impl Repository for LayeredRepository { &self, timelineid: ZTimelineId, initdb_lsn: Lsn, - ) -> Result> { + ) -> Result> { let mut timelines = self.timelines.lock().unwrap(); // Create the timeline directory, and write initial metadata to file. @@ -191,9 +177,9 @@ impl Repository for LayeredRepository { timelineid, self.tenantid, Arc::clone(&self.walredo_mgr), - 0, - self.upload_relishes, + self.upload_layers, ); + timeline.layers.lock().unwrap().next_open_layer_at = Some(initdb_lsn); let timeline = Arc::new(timeline); let r = timelines.insert( @@ -282,13 +268,46 @@ impl Repository for LayeredRepository { }) } - fn checkpoint_iteration(&self, cconf: CheckpointConfig) -> Result<()> { + fn compaction_iteration(&self) -> Result<()> { + // Scan through the hashmap and collect a list of all the timelines, + // while holding the lock. Then drop the lock and actually perform the + // compactions. We don't want to block everything else while the + // compaction runs. + let timelines = self.timelines.lock().unwrap(); + let timelines_to_compact = timelines + .iter() + .map(|(timelineid, timeline)| (*timelineid, timeline.clone())) + .collect::>(); + drop(timelines); + + for (timelineid, timeline) in &timelines_to_compact { + let _entered = + info_span!("compact", timeline = %timelineid, tenant = %self.tenantid).entered(); + match timeline { + LayeredTimelineEntry::Loaded(timeline) => { + timeline.compact()?; + } + LayeredTimelineEntry::Unloaded { .. } => { + debug!("Cannot compact remote timeline {}", timelineid) + } + } + } + + Ok(()) + } + + /// + /// Flush all in-memory data to disk. + /// + /// Used at shutdown. + /// + fn checkpoint(&self) -> Result<()> { // Scan through the hashmap and collect a list of all the timelines, // while holding the lock. Then drop the lock and actually perform the // checkpoints. We don't want to block everything else while the // checkpoint runs. let timelines = self.timelines.lock().unwrap(); - let timelines_to_checkpoint = timelines + let timelines_to_compact = timelines .iter() // filter to get only loaded timelines .filter_map(|(timelineid, entry)| match entry { @@ -302,10 +321,10 @@ impl Repository for LayeredRepository { .collect::>(); drop(timelines); - for (timelineid, timeline) in &timelines_to_checkpoint { + for (timelineid, timeline) in &timelines_to_compact { let _entered = info_span!("checkpoint", timeline = %timelineid, tenant = %self.tenantid).entered(); - timeline.checkpoint(cconf)?; + timeline.checkpoint(CheckpointConfig::Flush)?; } Ok(()) @@ -403,7 +422,7 @@ impl LayeredTimelineEntry { } } -impl From for RepositoryTimeline { +impl From for RepositoryTimeline { fn from(entry: LayeredTimelineEntry) -> Self { match entry { LayeredTimelineEntry::Loaded(timeline) => RepositoryTimeline::Loaded(timeline as _), @@ -489,20 +508,18 @@ impl LayeredRepository { let _enter = info_span!("loading timeline", timeline = %timelineid, tenant = %self.tenantid) .entered(); - let mut timeline = LayeredTimeline::new( + let timeline = LayeredTimeline::new( self.conf, metadata, ancestor, timelineid, self.tenantid, Arc::clone(&self.walredo_mgr), - 0, // init with 0 and update after layers are loaded, - self.upload_relishes, + self.upload_layers, ); timeline .load_layer_map(disk_consistent_lsn) .context("failed to load layermap")?; - timeline.init_current_logical_size()?; Ok(Arc::new(timeline)) } @@ -512,7 +529,7 @@ impl LayeredRepository { walredo_mgr: Arc, tenantid: ZTenantId, remote_index: RemoteIndex, - upload_relishes: bool, + upload_layers: bool, ) -> LayeredRepository { LayeredRepository { tenantid, @@ -521,7 +538,7 @@ impl LayeredRepository { gc_cs: Mutex::new(()), walredo_mgr, remote_index, - upload_relishes, + upload_layers, } } @@ -673,7 +690,8 @@ impl LayeredRepository { timeline.checkpoint(CheckpointConfig::Forced)?; info!("timeline {} checkpoint_before_gc done", timelineid); } - let result = timeline.gc_timeline(branchpoints, cutoff)?; + timeline.update_gc_info(branchpoints, cutoff); + let result = timeline.gc()?; totals += result; timelines = self.timelines.lock().unwrap(); @@ -693,6 +711,8 @@ pub struct LayeredTimeline { layers: Mutex, + last_freeze_at: AtomicLsn, + // WAL redo manager walredo_mgr: Arc, @@ -725,33 +745,14 @@ pub struct LayeredTimeline { ancestor_timeline: Option, ancestor_lsn: Lsn, - // this variable indicates how much space is used from user's point of view, - // e.g. we do not account here for multiple versions of data and so on. - // this is counted incrementally based on physical relishes (excluding FileNodeMap) - // current_logical_size is not stored no disk and initialized on timeline creation using - // get_current_logical_size_non_incremental in init_current_logical_size - // this is needed because when we save it in metadata it can become out of sync - // because current_logical_size is consistent on last_record_lsn, not ondisk_consistent_lsn - // NOTE: current_logical_size also includes size of the ancestor - current_logical_size: AtomicUsize, // bytes - - // To avoid calling .with_label_values and formatting the tenant and timeline IDs to strings - // every time the logical size is updated, keep a direct reference to the Gauge here. - // unfortunately it doesnt forward atomic methods like .fetch_add - // so use two fields: actual size and metric - // see https://github.com/zenithdb/zenith/issues/622 for discussion - // TODO: it is possible to combine these two fields into single one using custom metric which uses SeqCst - // ordering for its operations, but involves private modules, and macro trickery - current_logical_size_gauge: IntGauge, - // Metrics histograms reconstruct_time_histo: Histogram, - checkpoint_time_histo: Histogram, - flush_checkpoint_time_histo: Histogram, - forced_checkpoint_time_histo: Histogram, + flush_time_histo: Histogram, + compact_time_histo: Histogram, + create_images_time_histo: Histogram, /// If `true`, will backup its files that appear after each checkpointing to the remote storage. - upload_relishes: AtomicBool, + upload_layers: AtomicBool, /// Ensures layers aren't frozen by checkpointer between /// [`LayeredTimeline::get_layer_for_write`] and layer reads. @@ -760,15 +761,24 @@ pub struct LayeredTimeline { /// to avoid deadlock. write_lock: Mutex<()>, - // Prevent concurrent checkpoints. - // Checkpoints are normally performed by one thread. But checkpoint can also be manually requested by admin - // (that's used in tests), and shutdown also forces a checkpoint. These forced checkpoints run in a different thread - // and could be triggered at the same time as a normal checkpoint. - checkpoint_cs: Mutex<()>, + /// Used to ensure that there is only one thread + layer_flush_lock: Mutex<()>, + + // Prevent concurrent compactions. + // Compactions are normally performed by one thread. But compaction can also be manually + // requested by admin (that's used in tests). These forced compactions run in a different + // thread and could be triggered at the same time as a normal, timed compaction. + compaction_cs: Mutex<()>, // Needed to ensure that we can't create a branch at a point that was already garbage collected latest_gc_cutoff_lsn: RwLock, + // List of child timelines and their branch points. This is needed to avoid + // garbage collecting data that is still needed by the child timelines. + gc_info: RwLock, + + partitioning: RwLock>, + // It may change across major versions so for simplicity // keep it after running initdb for a timeline. // It is needed in checks when we want to error on some operations @@ -778,6 +788,28 @@ pub struct LayeredTimeline { initdb_lsn: Lsn, } +/// +/// Information about how much history needs to be retained, needed by +/// Garbage Collection. +/// +struct GcInfo { + /// Specific LSNs that are needed. + /// + /// Currently, this includes all points where child branches have + /// been forked off from. In the future, could also include + /// explicit user-defined snapshot points. + retain_lsns: Vec, + + /// In addition to 'retain_lsns', keep everything newer than this + /// point. + /// + /// This is calculated by subtracting 'gc_horizon' setting from + /// last-record LSN + /// + /// FIXME: is this inclusive or exclusive? + cutoff: Lsn, +} + /// Public interface functions impl Timeline for LayeredTimeline { fn get_ancestor_lsn(&self) -> Lsn { @@ -815,162 +847,35 @@ impl Timeline for LayeredTimeline { self.latest_gc_cutoff_lsn.read().unwrap() } - /// Look up given page version. - fn get_page_at_lsn(&self, rel: RelishTag, rel_blknum: BlockNumber, lsn: Lsn) -> Result { - if !rel.is_blocky() && rel_blknum != 0 { - bail!( - "invalid request for block {} for non-blocky relish {}", - rel_blknum, - rel - ); - } - debug_assert!(lsn <= self.get_last_record_lsn()); - let (seg, seg_blknum) = SegmentTag::from_blknum(rel, rel_blknum); - - if let Some((layer, lsn)) = self.get_layer_for_read(seg, lsn)? { - self.materialize_page(seg, seg_blknum, lsn, &*layer) - } else { - // FIXME: This can happen if PostgreSQL extends a relation but never writes - // the page. See https://github.com/zenithdb/zenith/issues/841 - // - // Would be nice to detect that situation better. - if seg.segno > 0 && self.get_rel_exists(rel, lsn)? { - warn!("Page {} blk {} at {} not found", rel, rel_blknum, lsn); - return Ok(ZERO_PAGE.clone()); - } - - bail!("segment {} not found at {}", rel, lsn); - } - } - - fn get_relish_size(&self, rel: RelishTag, lsn: Lsn) -> Result> { - if !rel.is_blocky() { - bail!( - "invalid get_relish_size request for non-blocky relish {}", - rel - ); - } + /// Look up the value with the given a key + fn get(&self, key: Key, lsn: Lsn) -> Result { debug_assert!(lsn <= self.get_last_record_lsn()); - let mut segno = 0; - loop { - let seg = SegmentTag { rel, segno }; - - let segsize; - if let Some((layer, lsn)) = self.get_layer_for_read(seg, lsn)? { - segsize = layer.get_seg_size(lsn)?; - trace!("get_seg_size: {} at {} -> {}", seg, lsn, segsize); - } else { - if segno == 0 { - return Ok(None); + // Check the page cache. We will get back the most recent page with lsn <= `lsn`. + // The cached image can be returned directly if there is no WAL between the cached image + // and requested LSN. The cached image can also be used to reduce the amount of WAL needed + // for redo. + let cached_page_img = match self.lookup_cached_page(&key, lsn) { + Some((cached_lsn, cached_img)) => { + match cached_lsn.cmp(&lsn) { + Ordering::Less => {} // there might be WAL between cached_lsn and lsn, we need to check + Ordering::Equal => return Ok(cached_img), // exact LSN match, return the image + Ordering::Greater => panic!(), // the returned lsn should never be after the requested lsn } - segsize = 0; + Some((cached_lsn, cached_img)) } - - if segsize != RELISH_SEG_SIZE { - let result = segno * RELISH_SEG_SIZE + segsize; - return Ok(Some(result)); - } - segno += 1; - } - } - - fn get_rel_exists(&self, rel: RelishTag, lsn: Lsn) -> Result { - debug_assert!(lsn <= self.get_last_record_lsn()); - - let seg = SegmentTag { rel, segno: 0 }; - - let result = if let Some((layer, lsn)) = self.get_layer_for_read(seg, lsn)? { - layer.get_seg_exists(lsn)? - } else { - false + None => None, }; - trace!("get_rel_exists: {} at {} -> {}", rel, lsn, result); - Ok(result) - } - - fn list_rels(&self, spcnode: u32, dbnode: u32, lsn: Lsn) -> Result> { - let request_tag = RelTag { - spcnode, - dbnode, - relnode: 0, - forknum: 0, + let mut reconstruct_state = ValueReconstructState { + records: Vec::new(), + img: cached_page_img, }; - self.list_relishes(Some(request_tag), lsn) - } + self.get_reconstruct_data(key, lsn, &mut reconstruct_state)?; - fn list_nonrels(&self, lsn: Lsn) -> Result> { - info!("list_nonrels called at {}", lsn); - - self.list_relishes(None, lsn) - } - - fn list_relishes(&self, tag: Option, lsn: Lsn) -> Result> { - trace!("list_relishes called at {}", lsn); - debug_assert!(lsn <= self.get_last_record_lsn()); - - // List of all relishes along with a flag that marks if they exist at the given lsn. - let mut all_relishes_map: HashMap = HashMap::new(); - let mut result = HashSet::new(); - let mut timeline = self; - - // Iterate through layers back in time and find the most - // recent state of the relish. Don't add relish to the list - // if newer version is already there. - // - // This most recent version can represent dropped or existing relish. - // We will filter dropped relishes below. - // - loop { - let rels = timeline.layers.lock().unwrap().list_relishes(tag, lsn)?; - - for (&new_relish, &new_relish_exists) in rels.iter() { - match all_relishes_map.entry(new_relish) { - Entry::Occupied(o) => { - trace!( - "Newer version of the object {} is already found: exists {}", - new_relish, - o.get(), - ); - } - Entry::Vacant(v) => { - v.insert(new_relish_exists); - trace!( - "Newer version of the object {} NOT found. Insert NEW: exists {}", - new_relish, - new_relish_exists - ); - } - } - } - - match &timeline.ancestor_timeline { - None => break, - Some(ancestor_entry) => { - timeline = ancestor_entry.ensure_loaded().with_context( - || format!( - "cannot list relishes for timeline {} tenant {} due to its ancestor {} being either unloaded", - self.timelineid, self.tenantid, ancestor_entry.timeline_id(), - ) - )?; - continue; - } - } - } - - // Filter out dropped relishes - for (&new_relish, &new_relish_exists) in all_relishes_map.iter() { - if new_relish_exists { - result.insert(new_relish); - trace!("List object {}", new_relish); - } else { - trace!("Filtered out dropped object {}", new_relish); - } - } - - Ok(result) + self.reconstruct_time_histo + .observe_closure_duration(|| self.reconstruct_value(key, lsn, reconstruct_state)) } /// Public entry point for checkpoint(). All the logic is in the private @@ -978,15 +883,15 @@ impl Timeline for LayeredTimeline { /// metrics collection. fn checkpoint(&self, cconf: CheckpointConfig) -> anyhow::Result<()> { match cconf { - CheckpointConfig::Flush => self - .flush_checkpoint_time_histo - .observe_closure_duration(|| self.checkpoint_internal(0, false)), - CheckpointConfig::Forced => self - .forced_checkpoint_time_histo - .observe_closure_duration(|| self.checkpoint_internal(0, true)), - CheckpointConfig::Distance(distance) => self - .checkpoint_time_histo - .observe_closure_duration(|| self.checkpoint_internal(distance, true)), + CheckpointConfig::Flush => { + self.freeze_inmem_layer(false); + self.flush_frozen_layers(true) + } + CheckpointConfig::Forced => { + self.freeze_inmem_layer(false); + self.flush_frozen_layers(true)?; + self.compact() + } } } @@ -1019,51 +924,24 @@ impl Timeline for LayeredTimeline { self.last_record_lsn.load() } - fn get_current_logical_size(&self) -> usize { - self.current_logical_size.load(atomic::Ordering::Acquire) as usize - } - - fn get_current_logical_size_non_incremental(&self, lsn: Lsn) -> Result { - let mut total_blocks: usize = 0; - - let _enter = info_span!("calc logical size", %lsn).entered(); - - // list of all relations in this timeline, including ancestor timelines - let all_rels = self.list_rels(0, 0, lsn)?; - - for rel in all_rels { - if let Some(size) = self.get_relish_size(rel, lsn)? { - total_blocks += size as usize; - } - } - - let non_rels = self.list_nonrels(lsn)?; - for non_rel in non_rels { - // TODO support TwoPhase - if matches!(non_rel, RelishTag::Slru { slru: _, segno: _ }) { - if let Some(size) = self.get_relish_size(non_rel, lsn)? { - total_blocks += size as usize; - } - } - } - - Ok(total_blocks * BLCKSZ as usize) - } - fn get_disk_consistent_lsn(&self) -> Lsn { self.disk_consistent_lsn.load() } + fn hint_partitioning(&self, partitioning: KeyPartitioning, lsn: Lsn) -> Result<()> { + self.partitioning + .write() + .unwrap() + .replace((partitioning, lsn)); + Ok(()) + } + fn writer<'a>(&'a self) -> Box { Box::new(LayeredTimelineWriter { tl: self, _write_guard: self.write_lock.lock().unwrap(), }) } - - fn upgrade_to_layered_timeline(&self) -> &crate::layered_repository::LayeredTimeline { - self - } } impl LayeredTimeline { @@ -1078,32 +956,28 @@ impl LayeredTimeline { timelineid: ZTimelineId, tenantid: ZTenantId, walredo_mgr: Arc, - current_logical_size: usize, - upload_relishes: bool, + upload_layers: bool, ) -> LayeredTimeline { - let current_logical_size_gauge = LOGICAL_TIMELINE_SIZE - .get_metric_with_label_values(&[&tenantid.to_string(), &timelineid.to_string()]) - .unwrap(); let reconstruct_time_histo = RECONSTRUCT_TIME .get_metric_with_label_values(&[&tenantid.to_string(), &timelineid.to_string()]) .unwrap(); - let checkpoint_time_histo = STORAGE_TIME + let flush_time_histo = STORAGE_TIME .get_metric_with_label_values(&[ - "checkpoint", + "layer flush", &tenantid.to_string(), &timelineid.to_string(), ]) .unwrap(); - let flush_checkpoint_time_histo = STORAGE_TIME + let compact_time_histo = STORAGE_TIME .get_metric_with_label_values(&[ - "flush checkpoint", + "compact", &tenantid.to_string(), &timelineid.to_string(), ]) .unwrap(); - let forced_checkpoint_time_histo = STORAGE_TIME + let create_images_time_histo = STORAGE_TIME .get_metric_with_label_values(&[ - "forced checkpoint", + "create images", &tenantid.to_string(), &timelineid.to_string(), ]) @@ -1124,18 +998,27 @@ impl LayeredTimeline { }), disk_consistent_lsn: AtomicLsn::new(metadata.disk_consistent_lsn().0), + last_freeze_at: AtomicLsn::new(0), + ancestor_timeline: ancestor, ancestor_lsn: metadata.ancestor_lsn(), - current_logical_size: AtomicUsize::new(current_logical_size), - current_logical_size_gauge, + reconstruct_time_histo, - checkpoint_time_histo, - flush_checkpoint_time_histo, - forced_checkpoint_time_histo, - upload_relishes: AtomicBool::new(upload_relishes), + flush_time_histo, + compact_time_histo, + create_images_time_histo, + + upload_layers: AtomicBool::new(upload_layers), write_lock: Mutex::new(()), - checkpoint_cs: Mutex::new(()), + layer_flush_lock: Mutex::new(()), + compaction_cs: Mutex::new(()), + + gc_info: RwLock::new(GcInfo { + retain_lsns: Vec::new(), + cutoff: Lsn(0), + }), + partitioning: RwLock::new(None), latest_gc_cutoff_lsn: RwLock::new(metadata.latest_gc_cutoff_lsn()), initdb_lsn: metadata.initdb_lsn(), @@ -1179,13 +1062,12 @@ impl LayeredTimeline { num_layers += 1; } else if let Some(deltafilename) = DeltaFileName::parse_str(&fname) { // Create a DeltaLayer struct for each delta file. - ensure!(deltafilename.start_lsn < deltafilename.end_lsn); // The end-LSN is exclusive, while disk_consistent_lsn is // inclusive. For example, if disk_consistent_lsn is 100, it is // OK for a delta layer to have end LSN 101, but if the end LSN // is 102, then it might not have been fully flushed to disk // before crash. - if deltafilename.end_lsn > disk_consistent_lsn + 1 { + if deltafilename.lsn_range.end > disk_consistent_lsn + 1 { warn!( "found future delta layer {} on timeline {} disk_consistent_lsn is {}", deltafilename, self.timelineid, disk_consistent_lsn @@ -1212,41 +1094,14 @@ impl LayeredTimeline { } } - info!("loaded layer map with {} layers", num_layers); + layers.next_open_layer_at = Some(Lsn(disk_consistent_lsn.0) + 1); - Ok(()) - } - - /// - /// Used to init current logical size on startup - /// - fn init_current_logical_size(&mut self) -> Result<()> { - if self.current_logical_size.load(atomic::Ordering::Relaxed) != 0 { - bail!("cannot init already initialized current logical size") - }; - let lsn = self.get_last_record_lsn(); - self.current_logical_size = - AtomicUsize::new(self.get_current_logical_size_non_incremental(lsn)?); - trace!( - "current_logical_size initialized to {}", - self.current_logical_size.load(atomic::Ordering::Relaxed) + info!( + "loaded layer map with {} layers at {}", + num_layers, disk_consistent_lsn ); - Ok(()) - } - /// - /// Get a handle to a Layer for reading. - /// - /// The returned Layer might be from an ancestor timeline, if the - /// segment hasn't been updated on this timeline yet. - /// - fn get_layer_for_read( - &self, - seg: SegmentTag, - lsn: Lsn, - ) -> Result, Lsn)>> { - let self_layers = self.layers.lock().unwrap(); - self.get_layer_for_read_locked(seg, lsn, &self_layers) + Ok(()) } /// @@ -1257,88 +1112,160 @@ impl LayeredTimeline { /// /// This function takes the current timeline's locked LayerMap as an argument, /// so callers can avoid potential race conditions. - fn get_layer_for_read_locked( + fn get_reconstruct_data( &self, - seg: SegmentTag, - lsn: Lsn, - self_layers: &MutexGuard, - ) -> anyhow::Result, Lsn)>> { - trace!("get_layer_for_read called for {} at {}", seg, lsn); - - // If you requested a page at an older LSN, before the branch point, dig into - // the right ancestor timeline. This can only happen if you launch a read-only - // node with an old LSN, a primary always uses a recent LSN in its requests. + key: Key, + request_lsn: Lsn, + reconstruct_state: &mut ValueReconstructState, + ) -> anyhow::Result<()> { + // Start from the current timeline. + let mut timeline_owned; let mut timeline = self; - let mut lsn = lsn; - while lsn < timeline.ancestor_lsn { - trace!("going into ancestor {} ", timeline.ancestor_lsn); - timeline = timeline - .ancestor_timeline - .as_ref() - .expect("there should be an ancestor") - .ensure_loaded() - .with_context(|| format!( - "Cannot get the whole layer for read locked: timeline {} is not present locally", - self.get_ancestor_timeline_id().unwrap()) - )?; - } + let mut path: Vec<(ValueReconstructResult, Lsn, Arc)> = Vec::new(); - // Now we have the right starting timeline for our search. - loop { - let layers_owned: MutexGuard; - let layers = if self as *const LayeredTimeline != timeline as *const LayeredTimeline { - layers_owned = timeline.layers.lock().unwrap(); - &layers_owned - } else { - self_layers - }; + // 'prev_lsn' tracks the last LSN that we were at in our search. It's used + // to check that each iteration make some progress, to break infinite + // looping if something goes wrong. + let mut prev_lsn = Lsn(u64::MAX); - // - // FIXME: If the relation has been dropped, does this return the right - // thing? The compute node should not normally request dropped relations, - // but if OID wraparound happens the same relfilenode might get reused - // for an unrelated relation. - // + let mut result = ValueReconstructResult::Continue; + let mut cont_lsn = Lsn(request_lsn.0 + 1); - // Do we have a layer on this timeline? - if let Some(layer) = layers.get(&seg, lsn) { - trace!( - "found layer in cache: {} {}-{}", - timeline.timelineid, - layer.get_start_lsn(), - layer.get_end_lsn() - ); + 'outer: loop { + // The function should have updated 'state' + //info!("CALLED for {} at {}: {:?} with {} records", reconstruct_state.key, reconstruct_state.lsn, result, reconstruct_state.records.len()); + match result { + ValueReconstructResult::Complete => return Ok(()), + ValueReconstructResult::Continue => { + if prev_lsn <= cont_lsn { + // Didn't make any progress in last iteration. Error out to avoid + // getting stuck in the loop. - ensure!(layer.get_start_lsn() <= lsn); - - if layer.is_dropped() && layer.get_end_lsn() <= lsn { - return Ok(None); + // For debugging purposes, print the path of layers that we traversed + // through. + for (r, c, l) in path { + error!( + "PATH: result {:?}, cont_lsn {}, layer: {}", + r, + c, + l.filename().display() + ); + } + bail!("could not find layer with more data for key {} at LSN {}, request LSN {}, ancestor {}", + key, + Lsn(cont_lsn.0 - 1), + request_lsn, + timeline.ancestor_lsn) + } + prev_lsn = cont_lsn; + } + ValueReconstructResult::Missing => { + bail!( + "could not find data for key {} at LSN {}, for request at LSN {}", + key, + cont_lsn, + request_lsn + ) } - - return Ok(Some((layer.clone(), lsn))); } - // If not, check if there's a layer on the ancestor timeline - match &timeline.ancestor_timeline { - Some(ancestor_entry) => { - let ancestor = ancestor_entry - .ensure_loaded() - .context("cannot get a layer for read from ancestor because it is either remote or unloaded")?; - lsn = timeline.ancestor_lsn; - timeline = ancestor; - trace!("recursing into ancestor at {}/{}", timeline.timelineid, lsn); + // Recurse into ancestor if needed + if Lsn(cont_lsn.0 - 1) <= timeline.ancestor_lsn { + trace!( + "going into ancestor {}, cont_lsn is {}", + timeline.ancestor_lsn, + cont_lsn + ); + let ancestor = timeline.get_ancestor_timeline()?; + timeline_owned = ancestor; + timeline = &*timeline_owned; + prev_lsn = Lsn(u64::MAX); + continue; + } + + let layers = timeline.layers.lock().unwrap(); + + // Check the open and frozen in-memory layers first + if let Some(open_layer) = &layers.open_layer { + let start_lsn = open_layer.get_lsn_range().start; + if cont_lsn > start_lsn { + //info!("CHECKING for {} at {} on open layer {}", key, cont_lsn, open_layer.filename().display()); + result = open_layer.get_value_reconstruct_data( + key, + open_layer.get_lsn_range().start..cont_lsn, + reconstruct_state, + )?; + cont_lsn = start_lsn; + path.push((result, cont_lsn, open_layer.clone())); continue; } - None => return Ok(None), + } + for frozen_layer in layers.frozen_layers.iter() { + let start_lsn = frozen_layer.get_lsn_range().start; + if cont_lsn > start_lsn { + //info!("CHECKING for {} at {} on frozen layer {}", key, cont_lsn, frozen_layer.filename().display()); + result = frozen_layer.get_value_reconstruct_data( + key, + frozen_layer.get_lsn_range().start..cont_lsn, + reconstruct_state, + )?; + cont_lsn = start_lsn; + path.push((result, cont_lsn, frozen_layer.clone())); + continue 'outer; + } + } + + if let Some(SearchResult { lsn_floor, layer }) = layers.search(key, cont_lsn)? { + //info!("CHECKING for {} at {} on historic layer {}", key, cont_lsn, layer.filename().display()); + + result = layer.get_value_reconstruct_data( + key, + lsn_floor..cont_lsn, + reconstruct_state, + )?; + cont_lsn = lsn_floor; + path.push((result, cont_lsn, layer)); + } else if self.ancestor_timeline.is_some() { + // Nothing on this timeline. Traverse to parent + result = ValueReconstructResult::Continue; + cont_lsn = Lsn(self.ancestor_lsn.0 + 1); + } else { + // Nothing found + result = ValueReconstructResult::Missing; } } } + fn lookup_cached_page(&self, key: &Key, lsn: Lsn) -> Option<(Lsn, Bytes)> { + let cache = page_cache::get(); + + // FIXME: It's pointless to check the cache for things that are not 8kB pages. + // We should look at the key to determine if it's a cacheable object + let (lsn, read_guard) = + cache.lookup_materialized_page(self.tenantid, self.timelineid, key, lsn)?; + let img = Bytes::from(read_guard.to_vec()); + Some((lsn, img)) + } + + fn get_ancestor_timeline(&self) -> Result> { + let ancestor = self + .ancestor_timeline + .as_ref() + .expect("there should be an ancestor") + .ensure_loaded() + .with_context(|| { + format!( + "Cannot get the whole layer for read locked: timeline {} is not present locally", + self.get_ancestor_timeline_id().unwrap()) + })?; + Ok(Arc::clone(ancestor)) + } + /// /// Get a handle to the latest layer for appending. /// - fn get_layer_for_write(&self, seg: SegmentTag, lsn: Lsn) -> anyhow::Result> { + fn get_layer_for_write(&self, lsn: Lsn) -> anyhow::Result> { let mut layers = self.layers.lock().unwrap(); ensure!(lsn.is_aligned()); @@ -1353,235 +1280,191 @@ impl LayeredTimeline { // Do we have a layer open for writing already? let layer; - if let Some(open_layer) = layers.get_open(&seg) { - if open_layer.get_start_lsn() > lsn { + if let Some(open_layer) = &layers.open_layer { + if open_layer.get_lsn_range().start > lsn { bail!("unexpected open layer in the future"); } - // Open layer exists, but it is dropped, so create a new one. - if open_layer.is_dropped() { - ensure!(!open_layer.is_writeable()); - // Layer that is created after dropped one represents a new relish segment. - trace!( - "creating layer for write for new relish segment after dropped layer {} at {}/{}", - seg, - self.timelineid, - lsn - ); - - layer = InMemoryLayer::create( - self.conf, - self.timelineid, - self.tenantid, - seg, - lsn, - last_record_lsn, - )?; - } else { - return Ok(open_layer); - } - } - // No writeable layer for this relation. Create one. - // - // Is this a completely new relation? Or the first modification after branching? - // - else if let Some((prev_layer, _prev_lsn)) = - self.get_layer_for_read_locked(seg, lsn, &layers)? - { - // Create new entry after the previous one. - let start_lsn; - if prev_layer.get_timeline_id() != self.timelineid { - // First modification on this timeline - start_lsn = self.ancestor_lsn + 1; - trace!( - "creating layer for write for {} at branch point {}", - seg, - start_lsn - ); - } else { - start_lsn = prev_layer.get_end_lsn(); - trace!( - "creating layer for write for {} after previous layer {}", - seg, - start_lsn - ); - } - trace!( - "prev layer is at {}/{} - {}", - prev_layer.get_timeline_id(), - prev_layer.get_start_lsn(), - prev_layer.get_end_lsn() - ); - layer = InMemoryLayer::create_successor_layer( - self.conf, - prev_layer, - self.timelineid, - self.tenantid, - start_lsn, - last_record_lsn, - )?; + layer = Arc::clone(open_layer); } else { - // New relation. + // No writeable layer yet. Create one. + let start_lsn = layers.next_open_layer_at.unwrap(); + trace!( - "creating layer for write for new rel {} at {}/{}", - seg, + "creating layer for write at {}/{} for record at {}", self.timelineid, + start_lsn, lsn ); + let new_layer = + InMemoryLayer::create(self.conf, self.timelineid, self.tenantid, start_lsn)?; + let layer_rc = Arc::new(new_layer); - layer = InMemoryLayer::create( - self.conf, - self.timelineid, - self.tenantid, - seg, - lsn, - last_record_lsn, - )?; + layers.open_layer = Some(Arc::clone(&layer_rc)); + layers.next_open_layer_at = None; + + layer = layer_rc; } + Ok(layer) + } - let layer_rc: Arc = Arc::new(layer); - layers.insert_open(Arc::clone(&layer_rc)); + fn put_value(&self, key: Key, lsn: Lsn, val: Value) -> Result<()> { + //info!("PUT: key {} at {}", key, lsn); + let layer = self.get_layer_for_write(lsn)?; + layer.put_value(key, lsn, val)?; + Ok(()) + } - Ok(layer_rc) + fn put_tombstone(&self, key_range: Range, lsn: Lsn) -> Result<()> { + let layer = self.get_layer_for_write(lsn)?; + layer.put_tombstone(key_range, lsn)?; + + Ok(()) + } + + fn finish_write(&self, new_lsn: Lsn) { + assert!(new_lsn.is_aligned()); + + self.last_record_lsn.advance(new_lsn); + } + + fn freeze_inmem_layer(&self, write_lock_held: bool) { + // Freeze the current open in-memory layer. It will be written to disk on next + // iteration. + let _write_guard = if write_lock_held { + None + } else { + Some(self.write_lock.lock().unwrap()) + }; + let mut layers = self.layers.lock().unwrap(); + if let Some(open_layer) = &layers.open_layer { + let open_layer_rc = Arc::clone(open_layer); + // Does this layer need freezing? + let end_lsn = Lsn(self.get_last_record_lsn().0 + 1); + open_layer.freeze(end_lsn); + + // The layer is no longer open, update the layer map to reflect this. + // We will replace it with on-disk historics below. + layers.frozen_layers.push_back(open_layer_rc); + layers.open_layer = None; + layers.next_open_layer_at = Some(end_lsn); + self.last_freeze_at.store(end_lsn); + } + drop(layers); } /// - /// Flush to disk all data that was written with the put_* functions + /// Check if more than 'checkpoint_distance' of WAL has been accumulated + /// in the in-memory layer, and initiate flushing it if so. /// - /// NOTE: This has nothing to do with checkpoint in PostgreSQL. - fn checkpoint_internal(&self, checkpoint_distance: u64, reconstruct_pages: bool) -> Result<()> { - // Prevent concurrent checkpoints - let _checkpoint_cs = self.checkpoint_cs.lock().unwrap(); - let write_guard = self.write_lock.lock().unwrap(); - let mut layers = self.layers.lock().unwrap(); + pub fn check_checkpoint_distance(self: &Arc) -> Result<()> { + let last_lsn = self.get_last_record_lsn(); - // Bump the generation number in the layer map, so that we can distinguish - // entries inserted after the checkpoint started - let current_generation = layers.increment_generation(); + let distance = last_lsn.widening_sub(self.last_freeze_at.load()); + if distance >= self.conf.checkpoint_distance.into() { + self.freeze_inmem_layer(true); + self.last_freeze_at.store(last_lsn); + } + if let Ok(guard) = self.layer_flush_lock.try_lock() { + drop(guard); + let self_clone = Arc::clone(self); + thread_mgr::spawn( + thread_mgr::ThreadKind::LayerFlushThread, + Some(self.tenantid), + Some(self.timelineid), + "layer flush thread", + false, + move || self_clone.flush_frozen_layers(false), + )?; + } + Ok(()) + } - let RecordLsn { - last: last_record_lsn, - prev: prev_record_lsn, - } = self.last_record_lsn.load(); + /// Flush all frozen layers to disk. + /// + /// Only one thread at a time can be doing layer-flushing for a + /// given timeline. If 'wait' is true, and another thread is + /// currently doing the flushing, this function will wait for it + /// to finish. If 'wait' is false, this function will return + /// immediately instead. + fn flush_frozen_layers(&self, wait: bool) -> Result<()> { + let flush_lock_guard = if wait { + self.layer_flush_lock.lock().unwrap() + } else { + match self.layer_flush_lock.try_lock() { + Ok(guard) => guard, + Err(TryLockError::WouldBlock) => return Ok(()), + Err(TryLockError::Poisoned(err)) => panic!("{:?}", err), + } + }; - trace!("checkpoint starting at {}", last_record_lsn); + let timer = self.flush_time_histo.start_timer(); - // Take the in-memory layer with the oldest WAL record. If it's older - // than the threshold, write it out to disk as a new image and delta file. - // Repeat until all remaining in-memory layers are within the threshold. - // - // That's necessary to limit the amount of WAL that needs to be kept - // in the safekeepers, and that needs to be reprocessed on page server - // crash. TODO: It's not a great policy for keeping memory usage in - // check, though. We should also aim at flushing layers that consume - // a lot of memory and/or aren't receiving much updates anymore. - let mut disk_consistent_lsn = last_record_lsn; - - let mut layer_paths = Vec::new(); - let mut freeze_end_lsn = Lsn(0); - let mut evicted_layers = Vec::new(); - - // - // Determine which layers we need to evict and calculate max(latest_lsn) - // among those layers. - // - while let Some((oldest_layer_id, oldest_layer, oldest_generation)) = - layers.peek_oldest_open() - { - let oldest_lsn = oldest_layer.get_oldest_lsn(); - // Does this layer need freezing? - // - // Write out all in-memory layers that contain WAL older than CHECKPOINT_DISTANCE. - // If we reach a layer with the same - // generation number, we know that we have cycled through all layers that were open - // when we started. We don't want to process layers inserted after we started, to - // avoid getting into an infinite loop trying to process again entries that we - // inserted ourselves. - // - // Once we have decided to write out at least one layer, we must also write out - // any other layers that contain WAL older than the end LSN of the layers we have - // already decided to write out. In other words, we must write out all layers - // whose [oldest_lsn, latest_lsn) range overlaps with any of the other layers - // that we are writing out. Otherwise, when we advance 'disk_consistent_lsn', it's - // ambiguous whether those layers are already durable on disk or not. For example, - // imagine that there are two layers in memory that contain page versions in the - // following LSN ranges: - // - // A: 100-150 - // B: 110-200 - // - // If we flush layer A, we must also flush layer B, because they overlap. If we - // flushed only A, and advanced 'disk_consistent_lsn' to 150, we would break the - // rule that all WAL older than 'disk_consistent_lsn' are durable on disk, because - // B contains some WAL older than 150. On the other hand, if we flushed out A and - // advanced 'disk_consistent_lsn' only up to 110, after crash and restart we would - // delete the first layer because its end LSN is larger than 110. If we changed - // the deletion logic to not delete it, then we would start streaming at 110, and - // process again the WAL records in the range 110-150 that are already in layer A, - // and the WAL processing code does not cope with that. We solve that dilemma by - // insisting that if we write out the first layer, we also write out the second - // layer, and advance disk_consistent_lsn all the way up to 200. - // - let distance = last_record_lsn.widening_sub(oldest_lsn); - if (distance < 0 - || distance < checkpoint_distance.into() - || oldest_generation == current_generation) - && oldest_lsn >= freeze_end_lsn - // this layer intersects with evicted layer and so also need to be evicted - { - debug!( - "the oldest layer is now {} which is {} bytes behind last_record_lsn", - oldest_layer.filename().display(), - distance - ); - disk_consistent_lsn = oldest_lsn; + loop { + let layers = self.layers.lock().unwrap(); + if let Some(frozen_layer) = layers.frozen_layers.front() { + let frozen_layer = Arc::clone(frozen_layer); + drop(layers); // to allow concurrent reads and writes + self.flush_frozen_layer(frozen_layer)?; + } else { + // Drop the 'layer_flush_lock' *before* 'layers'. That + // way, if you freeze a layer, and then call + // flush_frozen_layers(false), it is guaranteed that + // if another thread was busy flushing layers and the + // call therefore returns immediately, the other + // thread will have seen the newly-frozen layer and + // will flush that too (assuming no errors). + drop(flush_lock_guard); + drop(layers); break; } - let latest_lsn = oldest_layer.get_latest_lsn(); - if latest_lsn > freeze_end_lsn { - freeze_end_lsn = latest_lsn; // calculate max of latest_lsn of the layers we're about to evict - } - layers.remove_open(oldest_layer_id); - evicted_layers.push((oldest_layer_id, oldest_layer)); } - // Freeze evicted layers - for (_evicted_layer_id, evicted_layer) in evicted_layers.iter() { - // Mark the layer as no longer accepting writes and record the end_lsn. - // This happens in-place, no new layers are created now. - evicted_layer.freeze(freeze_end_lsn); - layers.insert_historic(evicted_layer.clone()); + timer.stop_and_record(); + + Ok(()) + } + + /// Flush one frozen in-memory layer to disk, as a new delta layer. + fn flush_frozen_layer(&self, frozen_layer: Arc) -> Result<()> { + let new_delta = frozen_layer.write_to_disk()?; + let new_delta_path = new_delta.path(); + + // Sync the new layer to disk. + // + // We must also fsync the timeline dir to ensure the directory entries for + // new layer files are durable + // + // TODO: If we're running inside 'flush_frozen_layers' and there are multiple + // files to flush, it might be better to first write them all, and then fsync + // them all in parallel. + par_fsync::par_fsync(&[ + new_delta_path.clone(), + self.conf.timeline_path(&self.timelineid, &self.tenantid), + ])?; + + // Finally, replace the frozen in-memory layer with the new on-disk layers + { + let mut layers = self.layers.lock().unwrap(); + let l = layers.frozen_layers.pop_front(); + + // Only one thread may call this function at a time (for this + // timeline). If two threads tried to flush the same frozen + // layer to disk at the same time, that would not work. + assert!(Arc::ptr_eq(&l.unwrap(), &frozen_layer)); + + // Add the new delta layer to the LayerMap + layers.insert_historic(Arc::new(new_delta)); + + // release lock on 'layers' } - // Call unload() on all frozen layers, to release memory. - // This shouldn't be much memory, as only metadata is slurped - // into memory. - for layer in layers.iter_historic_layers() { - layer.unload()?; - } - - drop(layers); - drop(write_guard); - - // Create delta/image layers for evicted layers - for (_evicted_layer_id, evicted_layer) in evicted_layers.iter() { - let mut this_layer_paths = - self.evict_layer(evicted_layer.clone(), reconstruct_pages)?; - layer_paths.append(&mut this_layer_paths); - } - - // Sync layers - if !layer_paths.is_empty() { - // We must fsync the timeline dir to ensure the directory entries for - // new layer files are durable - layer_paths.push(self.conf.timeline_path(&self.timelineid, &self.tenantid)); - - // Fsync all the layer files and directory using multiple threads to - // minimize latency. - par_fsync::par_fsync(&layer_paths)?; - - layer_paths.pop().unwrap(); - } + // Update the metadata file, with new 'disk_consistent_lsn' + // + // TODO: This perhaps should be done in 'flush_frozen_layers', after flushing + // *all* the layers, to avoid fsyncing the file multiple times. + let disk_consistent_lsn; + disk_consistent_lsn = Lsn(frozen_layer.get_lsn_range().end.0 - 1); // If we were able to advance 'disk_consistent_lsn', save it the metadata file. // After crash, we will restart WAL streaming and processing from that point. @@ -1595,6 +1478,10 @@ impl LayeredTimeline { // don't remember what the correct value that corresponds to some old // LSN is. But if we flush everything, then the value corresponding // current 'last_record_lsn' is correct and we can store it on disk. + let RecordLsn { + last: last_record_lsn, + prev: prev_record_lsn, + } = self.last_record_lsn.load(); let ondisk_prev_record_lsn = if disk_consistent_lsn == last_record_lsn { Some(prev_record_lsn) } else { @@ -1615,6 +1502,11 @@ impl LayeredTimeline { self.initdb_lsn, ); + fail_point!("checkpoint-before-saving-metadata", |x| bail!( + "{}", + x.unwrap() + )); + LayeredRepository::save_metadata( self.conf, self.timelineid, @@ -1622,11 +1514,11 @@ impl LayeredTimeline { &metadata, false, )?; - if self.upload_relishes.load(atomic::Ordering::Relaxed) { + if self.upload_layers.load(atomic::Ordering::Relaxed) { schedule_timeline_checkpoint_upload( self.tenantid, self.timelineid, - layer_paths, + vec![new_delta_path], metadata, ); } @@ -1638,34 +1530,273 @@ impl LayeredTimeline { Ok(()) } - fn evict_layer( - &self, - layer: Arc, - reconstruct_pages: bool, - ) -> Result> { - let new_historics = layer.write_to_disk(self, reconstruct_pages)?; + pub fn compact(&self) -> Result<()> { + // + // High level strategy for compaction / image creation: + // + // 1. First, calculate the desired "partitioning" of the + // currently in-use key space. The goal is to partition the + // key space into roughly fixed-size chunks, but also take into + // account any existing image layers, and try to align the + // chunk boundaries with the existing image layers to avoid + // too much churn. Also try to align chunk boundaries with + // relation boundaries. In principle, we don't know about + // relation boundaries here, we just deal with key-value + // pairs, and the code in pgdatadir_mapping.rs knows how to + // map relations into key-value pairs. But in practice we know + // that 'field6' is the block number, and the fields 1-5 + // identify a relation. This is just an optimization, + // though. + // + // 2. Once we know the partitioning, for each partition, + // decide if it's time to create a new image layer. The + // criteria is: there has been too much "churn" since the last + // image layer? The "churn" is fuzzy concept, it's a + // combination of too many delta files, or too much WAL in + // total in the delta file. Or perhaps: if creating an image + // file would allow to delete some older files. + // + // 3. After that, we compact all level0 delta files if there + // are too many of them. While compacting, we also garbage + // collect any page versions that are no longer needed because + // of the new image layers we created in step 2. + // + // TODO: This hight level strategy hasn't been implemented yet. + // Below are functions compact_level0() and create_image_layers() + // but they are a bit ad hoc and don't quite work like it's explained + // above. Rewrite it. + let _compaction_cs = self.compaction_cs.lock().unwrap(); - let mut layer_paths = Vec::new(); - let _write_guard = self.write_lock.lock().unwrap(); - let mut layers = self.layers.lock().unwrap(); + let target_file_size = self.conf.checkpoint_distance; - // Finally, replace the frozen in-memory layer with the new on-disk layers - layers.remove_historic(layer); + // 1. The partitioning was already done by the code in + // pgdatadir_mapping.rs. We just use it here. + let partitioning_guard = self.partitioning.read().unwrap(); + if let Some((partitioning, lsn)) = partitioning_guard.as_ref() { + let timer = self.create_images_time_histo.start_timer(); + // Make a copy of the partitioning, so that we can release + // the lock. Otherwise we could block the WAL receiver. + let lsn = *lsn; + let parts = partitioning.parts.clone(); + drop(partitioning_guard); - // Add the historics to the LayerMap - for delta_layer in new_historics.delta_layers { - layer_paths.push(delta_layer.path()); - layers.insert_historic(Arc::new(delta_layer)); + // 2. Create new image layers for partitions that have been modified + // "enough". + for part in parts.iter() { + if self.time_for_new_image_layer(part, lsn, 3)? { + self.create_image_layer(part, lsn)?; + } + } + timer.stop_and_record(); + + // 3. Compact + let timer = self.compact_time_histo.start_timer(); + self.compact_level0(target_file_size)?; + timer.stop_and_record(); + } else { + info!("Could not compact because no partitioning specified yet"); } - for image_layer in new_historics.image_layers { - layer_paths.push(image_layer.path()); - layers.insert_historic(Arc::new(image_layer)); + + // Call unload() on all frozen layers, to release memory. + // This shouldn't be much memory, as only metadata is slurped + // into memory. + let layers = self.layers.lock().unwrap(); + for layer in layers.iter_historic_layers() { + layer.unload()?; } - Ok(layer_paths) + drop(layers); + + Ok(()) } + // Is it time to create a new image layer for the given partition? + fn time_for_new_image_layer( + &self, + partition: &KeySpace, + lsn: Lsn, + threshold: usize, + ) -> Result { + let layers = self.layers.lock().unwrap(); + + for part_range in &partition.ranges { + let image_coverage = layers.image_coverage(part_range, lsn)?; + for (img_range, last_img) in image_coverage { + let img_lsn = if let Some(ref last_img) = last_img { + last_img.get_lsn_range().end + } else { + Lsn(0) + }; + + let num_deltas = layers.count_deltas(&img_range, &(img_lsn..lsn))?; + + info!( + "range {}-{}, has {} deltas on this timeline", + img_range.start, img_range.end, num_deltas + ); + if num_deltas >= threshold { + return Ok(true); + } + } + } + + Ok(false) + } + + fn create_image_layer(&self, partition: &KeySpace, lsn: Lsn) -> Result<()> { + let img_range = + partition.ranges.first().unwrap().start..partition.ranges.last().unwrap().end; + let mut image_layer_writer = + ImageLayerWriter::new(self.conf, self.timelineid, self.tenantid, &img_range, lsn)?; + + for range in &partition.ranges { + let mut key = range.start; + while key < range.end { + let img = self.get(key, lsn)?; + image_layer_writer.put_image(key, &img)?; + key = key.next(); + } + } + let image_layer = image_layer_writer.finish()?; + + // Sync the new layer to disk before adding it to the layer map, to make sure + // we don't garbage collect something based on the new layer, before it has + // reached the disk. + // + // We must also fsync the timeline dir to ensure the directory entries for + // new layer files are durable + // + // Compaction creates multiple image layers. It would be better to create them all + // and fsync them all in parallel. + par_fsync::par_fsync(&[ + image_layer.path(), + self.conf.timeline_path(&self.timelineid, &self.tenantid), + ])?; + + // FIXME: Do we need to do something to upload it to remote storage here? + + let mut layers = self.layers.lock().unwrap(); + layers.insert_historic(Arc::new(image_layer)); + drop(layers); + + Ok(()) + } + + fn compact_level0(&self, target_file_size: u64) -> Result<()> { + let layers = self.layers.lock().unwrap(); + + // We compact or "shuffle" the level-0 delta layers when 10 have + // accumulated. + static COMPACT_THRESHOLD: usize = 10; + + let level0_deltas = layers.get_level0_deltas()?; + + if level0_deltas.len() < COMPACT_THRESHOLD { + return Ok(()); + } + drop(layers); + + // FIXME: this function probably won't work correctly if there's overlap + // in the deltas. + let lsn_range = level0_deltas + .iter() + .map(|l| l.get_lsn_range()) + .reduce(|a, b| min(a.start, b.start)..max(a.end, b.end)) + .unwrap(); + + let all_values_iter = level0_deltas.iter().map(|l| l.iter()).kmerge_by(|a, b| { + if let Ok((a_key, a_lsn, _)) = a { + if let Ok((b_key, b_lsn, _)) = b { + match a_key.cmp(b_key) { + Ordering::Less => true, + Ordering::Equal => a_lsn <= b_lsn, + Ordering::Greater => false, + } + } else { + false + } + } else { + true + } + }); + + // Merge the contents of all the input delta layers into a new set + // of delta layers, based on the current partitioning. + // + // TODO: this actually divides the layers into fixed-size chunks, not + // based on the partitioning. + // + // TODO: we should also opportunistically materialize and + // garbage collect what we can. + let mut new_layers = Vec::new(); + let mut prev_key: Option = None; + let mut writer: Option = None; + for x in all_values_iter { + let (key, lsn, value) = x?; + + if let Some(prev_key) = prev_key { + if key != prev_key && writer.is_some() { + let size = writer.as_mut().unwrap().size(); + if size > target_file_size { + new_layers.push(writer.take().unwrap().finish(prev_key.next())?); + writer = None; + } + } + } + + if writer.is_none() { + writer = Some(DeltaLayerWriter::new( + self.conf, + self.timelineid, + self.tenantid, + key, + lsn_range.clone(), + )?); + } + + writer.as_mut().unwrap().put_value(key, lsn, value)?; + prev_key = Some(key); + } + if let Some(writer) = writer { + new_layers.push(writer.finish(prev_key.unwrap().next())?); + } + + // Sync layers + if !new_layers.is_empty() { + let mut layer_paths: Vec = new_layers.iter().map(|l| l.path()).collect(); + + // also sync the directory + layer_paths.push(self.conf.timeline_path(&self.timelineid, &self.tenantid)); + + // Fsync all the layer files and directory using multiple threads to + // minimize latency. + par_fsync::par_fsync(&layer_paths)?; + + layer_paths.pop().unwrap(); + } + + let mut layers = self.layers.lock().unwrap(); + for l in new_layers { + layers.insert_historic(Arc::new(l)); + } + + // Now that we have reshuffled the data to set of new delta layers, we can + // delete the old ones + for l in level0_deltas { + l.delete()?; + layers.remove_historic(l.clone()); + } + drop(layers); + + Ok(()) + } + + /// Update information about which layer files need to be retained on + /// garbage collection. This is separate from actually performing the GC, + /// and is updated more frequently, so that compaction can remove obsolete + /// page versions more aggressively. /// - /// Garbage collect layer files on a timeline that are no longer needed. + /// TODO: that's wishful thinking, compaction doesn't actually do that + /// currently. /// /// The caller specifies how much history is needed with the two arguments: /// @@ -1682,15 +1813,29 @@ impl LayeredTimeline { /// the latest LSN subtracted by a constant, and doesn't do anything smart /// to figure out what read-only nodes might actually need.) /// + fn update_gc_info(&self, retain_lsns: Vec, cutoff: Lsn) { + let mut gc_info = self.gc_info.write().unwrap(); + gc_info.retain_lsns = retain_lsns; + gc_info.cutoff = cutoff; + } + + /// + /// Garbage collect layer files on a timeline that are no longer needed. + /// /// Currently, we don't make any attempt at removing unneeded page versions /// within a layer file. We can only remove the whole file if it's fully /// obsolete. /// - pub fn gc_timeline(&self, retain_lsns: Vec, cutoff: Lsn) -> Result { + fn gc(&self) -> Result { let now = Instant::now(); let mut result: GcResult = Default::default(); let disk_consistent_lsn = self.get_disk_consistent_lsn(); - let _checkpoint_cs = self.checkpoint_cs.lock().unwrap(); + + let _compaction_cs = self.compaction_cs.lock().unwrap(); + + let gc_info = self.gc_info.read().unwrap(); + let retain_lsns = &gc_info.retain_lsns; + let cutoff = gc_info.cutoff; let _enter = info_span!("garbage collection", timeline = %self.timelineid, tenant = %self.tenantid, cutoff = %cutoff).entered(); @@ -1709,8 +1854,7 @@ impl LayeredTimeline { // Garbage collect the layer if all conditions are satisfied: // 1. it is older than cutoff LSN; // 2. it doesn't need to be retained for 'retain_lsns'; - // 3. newer on-disk layer exists (only for non-dropped segments); - // 4. this layer doesn't serve as a tombstone for some older layer; + // 3. newer on-disk image layers cover the layer's whole key range // let mut layers = self.layers.lock().unwrap(); 'outer: for l in layers.iter_historic_layers() { @@ -1724,28 +1868,16 @@ impl LayeredTimeline { continue; } - let seg = l.get_seg_tag(); - - if seg.rel.is_relation() { - result.ondisk_relfiles_total += 1; - } else { - result.ondisk_nonrelfiles_total += 1; - } + result.layers_total += 1; // 1. Is it newer than cutoff point? - if l.get_end_lsn() > cutoff { + if l.get_lsn_range().end > cutoff { debug!( - "keeping {} {}-{} because it's newer than cutoff {}", - seg, - l.get_start_lsn(), - l.get_end_lsn(), + "keeping {} because it's newer than cutoff {}", + l.filename().display(), cutoff ); - if seg.rel.is_relation() { - result.ondisk_relfiles_needed_by_cutoff += 1; - } else { - result.ondisk_nonrelfiles_needed_by_cutoff += 1; - } + result.layers_needed_by_cutoff += 1; continue 'outer; } @@ -1754,135 +1886,49 @@ impl LayeredTimeline { // might be referenced by child branches forever. // We can track this in child timeline GC and delete parent layers when // they are no longer needed. This might be complicated with long inheritance chains. - for retain_lsn in &retain_lsns { + for retain_lsn in retain_lsns { // start_lsn is inclusive - if &l.get_start_lsn() <= retain_lsn { + if &l.get_lsn_range().start <= retain_lsn { debug!( - "keeping {} {}-{} because it's still might be referenced by child branch forked at {} is_dropped: {} is_incremental: {}", - seg, - l.get_start_lsn(), - l.get_end_lsn(), + "keeping {} because it's still might be referenced by child branch forked at {} is_dropped: xx is_incremental: {}", + l.filename().display(), retain_lsn, - l.is_dropped(), l.is_incremental(), ); - if seg.rel.is_relation() { - result.ondisk_relfiles_needed_by_branches += 1; - } else { - result.ondisk_nonrelfiles_needed_by_branches += 1; - } + result.layers_needed_by_branches += 1; continue 'outer; } } // 3. Is there a later on-disk layer for this relation? - if !l.is_dropped() - && !layers.newer_image_layer_exists( - l.get_seg_tag(), - l.get_end_lsn(), - disk_consistent_lsn, - ) - { + // + // The end-LSN is exclusive, while disk_consistent_lsn is + // inclusive. For example, if disk_consistent_lsn is 100, it is + // OK for a delta layer to have end LSN 101, but if the end LSN + // is 102, then it might not have been fully flushed to disk + // before crash. + // + // FIXME: This logic is wrong. See https://github.com/zenithdb/zenith/issues/707 + if !layers.newer_image_layer_exists( + &l.get_key_range(), + l.get_lsn_range().end, + disk_consistent_lsn + 1, + )? { debug!( - "keeping {} {}-{} because it is the latest layer", - seg, - l.get_start_lsn(), - l.get_end_lsn() + "keeping {} because it is the latest layer", + l.filename().display() ); - if seg.rel.is_relation() { - result.ondisk_relfiles_not_updated += 1; - } else { - result.ondisk_nonrelfiles_not_updated += 1; - } + result.layers_not_updated += 1; continue 'outer; } - // 4. Does this layer serve as a tombstone for some older layer? - if l.is_dropped() { - let prior_lsn = l.get_start_lsn().checked_sub(1u64).unwrap(); - - // Check if this layer serves as a tombstone for this timeline - // We have to do this separately from timeline check below, - // because LayerMap of this timeline is already locked. - let mut is_tombstone = layers.layer_exists_at_lsn(l.get_seg_tag(), prior_lsn)?; - if is_tombstone { - debug!( - "earlier layer exists at {} in {}", - prior_lsn, self.timelineid - ); - } - // Now check ancestor timelines, if any are present locally - else if let Some(ancestor) = self - .ancestor_timeline - .as_ref() - .and_then(|timeline_entry| timeline_entry.ensure_loaded().ok()) - { - let prior_lsn = ancestor.get_last_record_lsn(); - if seg.rel.is_blocky() { - debug!( - "check blocky relish size {} at {} in {} for layer {}-{}", - seg, - prior_lsn, - ancestor.timelineid, - l.get_start_lsn(), - l.get_end_lsn() - ); - match ancestor.get_relish_size(seg.rel, prior_lsn).unwrap() { - Some(size) => { - let (last_live_seg, _rel_blknum) = - SegmentTag::from_blknum(seg.rel, size - 1); - debug!( - "blocky rel size is {} last_live_seg.segno {} seg.segno {}", - size, last_live_seg.segno, seg.segno - ); - if last_live_seg.segno >= seg.segno { - is_tombstone = true; - } - } - _ => { - debug!("blocky rel doesn't exist"); - } - } - } else { - debug!( - "check non-blocky relish existence {} at {} in {} for layer {}-{}", - seg, - prior_lsn, - ancestor.timelineid, - l.get_start_lsn(), - l.get_end_lsn() - ); - is_tombstone = ancestor.get_rel_exists(seg.rel, prior_lsn).unwrap_or(false); - } - } - - if is_tombstone { - debug!( - "keeping {} {}-{} because this layer serves as a tombstone for older layer", - seg, - l.get_start_lsn(), - l.get_end_lsn() - ); - - if seg.rel.is_relation() { - result.ondisk_relfiles_needed_as_tombstone += 1; - } else { - result.ondisk_nonrelfiles_needed_as_tombstone += 1; - } - continue 'outer; - } - } - // We didn't find any reason to keep this file, so remove it. debug!( - "garbage collecting {} {}-{} is_dropped: {} is_incremental: {}", - l.get_seg_tag(), - l.get_start_lsn(), - l.get_end_lsn(), - l.is_dropped(), + "garbage collecting {} is_dropped: xx is_incremental: {}", + l.filename().display(), l.is_incremental(), ); - layers_to_remove.push(Arc::clone(&l)); + layers_to_remove.push(Arc::clone(l)); } // Actually delete the layers from disk and remove them from the map. @@ -1892,222 +1938,75 @@ impl LayeredTimeline { doomed_layer.delete()?; layers.remove_historic(doomed_layer.clone()); - match ( - doomed_layer.is_dropped(), - doomed_layer.get_seg_tag().rel.is_relation(), - ) { - (true, true) => result.ondisk_relfiles_dropped += 1, - (true, false) => result.ondisk_nonrelfiles_dropped += 1, - (false, true) => result.ondisk_relfiles_removed += 1, - (false, false) => result.ondisk_nonrelfiles_removed += 1, - } + result.layers_removed += 1; } result.elapsed = now.elapsed(); Ok(result) } - fn lookup_cached_page( + /// + /// Reconstruct a value, using the given base image and WAL records in 'data'. + /// + fn reconstruct_value( &self, - rel: &RelishTag, - rel_blknum: BlockNumber, - lsn: Lsn, - ) -> Option<(Lsn, Bytes)> { - let cache = page_cache::get(); - if let RelishTag::Relation(rel_tag) = &rel { - let (lsn, read_guard) = cache.lookup_materialized_page( - self.tenantid, - self.timelineid, - *rel_tag, - rel_blknum, - lsn, - )?; - let img = Bytes::from(read_guard.to_vec()); - Some((lsn, img)) - } else { - None - } - } - - /// - /// Reconstruct a page version from given Layer - /// - fn materialize_page( - &self, - seg: SegmentTag, - seg_blknum: SegmentBlk, - lsn: Lsn, - layer: &dyn Layer, - ) -> anyhow::Result { - // Check the page cache. We will get back the most recent page with lsn <= `lsn`. - // The cached image can be returned directly if there is no WAL between the cached image - // and requested LSN. The cached image can also be used to reduce the amount of WAL needed - // for redo. - let rel = seg.rel; - let rel_blknum = seg.segno * RELISH_SEG_SIZE + seg_blknum; - let cached_page_img = match self.lookup_cached_page(&rel, rel_blknum, lsn) { - Some((cached_lsn, cached_img)) => { - match cached_lsn.cmp(&lsn) { - cmp::Ordering::Less => {} // there might be WAL between cached_lsn and lsn, we need to check - cmp::Ordering::Equal => return Ok(cached_img), // exact LSN match, return the image - cmp::Ordering::Greater => { - bail!("the returned lsn should never be after the requested lsn") - } - } - Some((cached_lsn, cached_img)) - } - None => None, - }; - - let mut data = PageReconstructData { - records: Vec::new(), - page_img: cached_page_img, - }; - - // Holds an Arc reference to 'layer_ref' when iterating in the loop below. - let mut layer_arc: Arc; - - // Call the layer's get_page_reconstruct_data function to get the base image - // and WAL records needed to materialize the page. If it returns 'Continue', - // call it again on the predecessor layer until we have all the required data. - let mut layer_ref = layer; - let mut curr_lsn = lsn; - loop { - let result = self.reconstruct_time_histo.observe_closure_duration(|| { - layer_ref - .get_page_reconstruct_data(seg_blknum, curr_lsn, &mut data) - .with_context(|| { - format!( - "Failed to get reconstruct data {} {:?} {} {}", - layer_ref.get_seg_tag(), - layer_ref.filename(), - seg_blknum, - curr_lsn, - ) - }) - })?; - match result { - PageReconstructResult::Complete => break, - PageReconstructResult::Continue(cont_lsn) => { - // Fetch base image / more WAL from the returned predecessor layer - if let Some((cont_layer, cont_lsn)) = self.get_layer_for_read(seg, cont_lsn)? { - if cont_lsn == curr_lsn { - // We landed on the same layer again. Shouldn't happen, but if it does, - // don't get stuck in an infinite loop. - bail!( - "could not find predecessor of layer {} at {}, layer returned its own LSN", - layer_ref.filename().display(), - cont_lsn - ); - } - layer_arc = cont_layer; - layer_ref = &*layer_arc; - curr_lsn = cont_lsn; - continue; - } else { - bail!( - "could not find predecessor of layer {} at {}", - layer_ref.filename().display(), - cont_lsn - ); - } - } - PageReconstructResult::Missing(lsn) => { - // Oops, we could not reconstruct the page. - if data.records.is_empty() { - // no records, and no base image. This can happen if PostgreSQL extends a relation - // but never writes the page. - // - // Would be nice to detect that situation better. - warn!("Page {} blk {} at {} not found", rel, rel_blknum, lsn); - return Ok(ZERO_PAGE.clone()); - } - bail!( - "No base image found for page {} blk {} at {}/{}", - rel, - rel_blknum, - self.timelineid, - lsn, - ); - } - } - } - - self.reconstruct_page(rel, rel_blknum, lsn, data) - } - - /// - /// Reconstruct a page version, using the given base image and WAL records in 'data'. - /// - fn reconstruct_page( - &self, - rel: RelishTag, - rel_blknum: BlockNumber, + key: Key, request_lsn: Lsn, - mut data: PageReconstructData, + mut data: ValueReconstructState, ) -> Result { // Perform WAL redo if needed data.records.reverse(); // If we have a page image, and no WAL, we're all set if data.records.is_empty() { - if let Some((img_lsn, img)) = &data.page_img { + if let Some((img_lsn, img)) = &data.img { trace!( - "found page image for blk {} in {} at {}, no WAL redo required", - rel_blknum, - rel, + "found page image for key {} at {}, no WAL redo required", + key, img_lsn ); Ok(img.clone()) } else { - // FIXME: this ought to be an error? - warn!( - "Page {} blk {} at {} not found", - rel, rel_blknum, request_lsn - ); - Ok(ZERO_PAGE.clone()) + bail!("base image for {} at {} not found", key, request_lsn); } } else { // We need to do WAL redo. // // If we don't have a base image, then the oldest WAL record better initialize // the page - if data.page_img.is_none() && !data.records.first().unwrap().1.will_init() { - // FIXME: this ought to be an error? - warn!( - "Base image for page {}/{} at {} not found, but got {} WAL records", - rel, - rel_blknum, + if data.img.is_none() && !data.records.first().unwrap().1.will_init() { + bail!( + "Base image for {} at {} not found, but got {} WAL records", + key, request_lsn, data.records.len() ); - Ok(ZERO_PAGE.clone()) } else { - let base_img = if let Some((_lsn, img)) = data.page_img { - trace!("found {} WAL records and a base image for blk {} in {} at {}, performing WAL redo", data.records.len(), rel_blknum, rel, request_lsn); + let base_img = if let Some((_lsn, img)) = data.img { + trace!( + "found {} WAL records and a base image for {} at {}, performing WAL redo", + data.records.len(), + key, + request_lsn + ); Some(img) } else { - trace!("found {} WAL records that will init the page for blk {} in {} at {}, performing WAL redo", data.records.len(), rel_blknum, rel, request_lsn); + trace!("found {} WAL records that will init the page for {} at {}, performing WAL redo", data.records.len(), key, request_lsn); None }; let last_rec_lsn = data.records.last().unwrap().0; - let img = self.walredo_mgr.request_redo( - rel, - rel_blknum, - request_lsn, - base_img, - data.records, - )?; + let img = + self.walredo_mgr + .request_redo(key, request_lsn, base_img, data.records)?; - if let RelishTag::Relation(rel_tag) = &rel { + if img.len() == page_cache::PAGE_SZ { let cache = page_cache::get(); cache.memorize_materialized_page( self.tenantid, self.timelineid, - *rel_tag, - rel_blknum, + key, last_rec_lsn, &img, ); @@ -2117,40 +2016,6 @@ impl LayeredTimeline { } } } - - /// - /// This is a helper function to increase current_total_relation_size - /// - fn increase_current_logical_size(&self, diff: u32) { - let val = self - .current_logical_size - .fetch_add(diff as usize, atomic::Ordering::SeqCst); - trace!( - "increase_current_logical_size: {} + {} = {}", - val, - diff, - val + diff as usize, - ); - self.current_logical_size_gauge - .set(val as i64 + diff as i64); - } - - /// - /// This is a helper function to decrease current_total_relation_size - /// - fn decrease_current_logical_size(&self, diff: u32) { - let val = self - .current_logical_size - .fetch_sub(diff as usize, atomic::Ordering::SeqCst); - trace!( - "decrease_current_logical_size: {} - {} = {}", - val, - diff, - val - diff as usize, - ); - self.current_logical_size_gauge - .set(val as i64 - diff as i64); - } } struct LayeredTimelineWriter<'a> { @@ -2166,159 +2031,20 @@ impl Deref for LayeredTimelineWriter<'_> { } } -impl<'a> TimelineWriter for LayeredTimelineWriter<'a> { - fn put_wal_record( - &self, - lsn: Lsn, - rel: RelishTag, - rel_blknum: u32, - rec: ZenithWalRecord, - ) -> Result<()> { - if !rel.is_blocky() && rel_blknum != 0 { - bail!( - "invalid request for block {} for non-blocky relish {}", - rel_blknum, - rel - ); - } - ensure!(lsn.is_aligned(), "unaligned record LSN"); - - let (seg, seg_blknum) = SegmentTag::from_blknum(rel, rel_blknum); - let layer = self.tl.get_layer_for_write(seg, lsn)?; - let delta_size = layer.put_wal_record(lsn, seg_blknum, rec)?; - self.tl - .increase_current_logical_size(delta_size * BLCKSZ as u32); - Ok(()) +impl<'a> TimelineWriter<'_> for LayeredTimelineWriter<'a> { + fn put(&self, key: Key, lsn: Lsn, value: Value) -> Result<()> { + self.tl.put_value(key, lsn, value) } - fn put_page_image( - &self, - rel: RelishTag, - rel_blknum: BlockNumber, - lsn: Lsn, - img: Bytes, - ) -> Result<()> { - if !rel.is_blocky() && rel_blknum != 0 { - bail!( - "invalid request for block {} for non-blocky relish {}", - rel_blknum, - rel - ); - } - ensure!(lsn.is_aligned(), "unaligned record LSN"); - - let (seg, seg_blknum) = SegmentTag::from_blknum(rel, rel_blknum); - - let layer = self.tl.get_layer_for_write(seg, lsn)?; - let delta_size = layer.put_page_image(seg_blknum, lsn, img)?; - - self.tl - .increase_current_logical_size(delta_size * BLCKSZ as u32); - Ok(()) - } - - fn put_truncation(&self, rel: RelishTag, lsn: Lsn, relsize: BlockNumber) -> Result<()> { - if !rel.is_blocky() { - bail!("invalid truncation for non-blocky relish {}", rel); - } - ensure!(lsn.is_aligned(), "unaligned record LSN"); - - debug!("put_truncation: {} to {} blocks at {}", rel, relsize, lsn); - - let oldsize = self - .tl - .get_relish_size(rel, self.tl.get_last_record_lsn())? - .with_context(|| { - format!( - "attempted to truncate non-existent relish {} at {}", - rel, lsn - ) - })?; - - if oldsize <= relsize { - return Ok(()); - } - let old_last_seg = (oldsize - 1) / RELISH_SEG_SIZE; - - let last_remain_seg = if relsize == 0 { - 0 - } else { - (relsize - 1) / RELISH_SEG_SIZE - }; - - // Drop segments beyond the last remaining segment. - for remove_segno in (last_remain_seg + 1)..=old_last_seg { - let seg = SegmentTag { - rel, - segno: remove_segno, - }; - - let layer = self.tl.get_layer_for_write(seg, lsn)?; - layer.drop_segment(lsn); - } - - // Truncate the last remaining segment to the specified size - if relsize == 0 || relsize % RELISH_SEG_SIZE != 0 { - let seg = SegmentTag { - rel, - segno: last_remain_seg, - }; - let layer = self.tl.get_layer_for_write(seg, lsn)?; - layer.put_truncation(lsn, relsize % RELISH_SEG_SIZE) - } - self.tl - .decrease_current_logical_size((oldsize - relsize) * BLCKSZ as u32); - Ok(()) - } - - fn drop_relish(&self, rel: RelishTag, lsn: Lsn) -> Result<()> { - trace!("drop_segment: {} at {}", rel, lsn); - - if rel.is_blocky() { - if let Some(oldsize) = self - .tl - .get_relish_size(rel, self.tl.get_last_record_lsn())? - { - let old_last_seg = if oldsize == 0 { - 0 - } else { - (oldsize - 1) / RELISH_SEG_SIZE - }; - - // Drop all segments of the relish - for remove_segno in 0..=old_last_seg { - let seg = SegmentTag { - rel, - segno: remove_segno, - }; - let layer = self.tl.get_layer_for_write(seg, lsn)?; - layer.drop_segment(lsn); - } - self.tl - .decrease_current_logical_size(oldsize * BLCKSZ as u32); - } else { - warn!( - "drop_segment called on non-existent relish {} at {}", - rel, lsn - ); - } - } else { - // TODO handle TwoPhase relishes - let (seg, _seg_blknum) = SegmentTag::from_blknum(rel, 0); - let layer = self.tl.get_layer_for_write(seg, lsn)?; - layer.drop_segment(lsn); - } - - Ok(()) + fn delete(&self, key_range: Range, lsn: Lsn) -> Result<()> { + self.tl.put_tombstone(key_range, lsn) } /// /// Remember the (end of) last valid WAL record remembered in the timeline. /// - fn advance_last_record_lsn(&self, new_lsn: Lsn) { - assert!(new_lsn.is_aligned()); - - self.tl.last_record_lsn.advance(new_lsn); + fn finish_write(&self, new_lsn: Lsn) { + self.tl.finish_write(new_lsn); } } @@ -2328,10 +2054,10 @@ pub fn dump_layerfile_from_path(path: &Path) -> Result<()> { let book = Book::new(file)?; match book.magic() { - delta_layer::DELTA_FILE_MAGIC => { + crate::DELTA_FILE_MAGIC => { DeltaLayer::new_for_path(path, &book)?.dump()?; } - image_layer::IMAGE_FILE_MAGIC => { + crate::IMAGE_FILE_MAGIC => { ImageLayer::new_for_path(path, &book)?.dump()?; } magic => bail!("unrecognized magic identifier: {:?}", magic), @@ -2368,9 +2094,11 @@ fn rename_to_backup(path: PathBuf) -> anyhow::Result<()> { /// file format and directory layout. The test here are more low level. /// #[cfg(test)] -mod tests { +pub mod tests { use super::*; + use crate::keyspace::KeySpaceAccum; use crate::repository::repo_harness::*; + use rand::{thread_rng, Rng}; #[test] fn corrupt_metadata() -> Result<()> { @@ -2387,7 +2115,7 @@ mod tests { let mut metadata_bytes = std::fs::read(&metadata_path)?; assert_eq!(metadata_bytes.len(), 512); - metadata_bytes[512 - 4 - 2] ^= 1; + metadata_bytes[8] ^= 1; std::fs::write(metadata_path, metadata_bytes)?; let err = harness.try_load().err().expect("should fail"); @@ -2400,113 +2128,259 @@ mod tests { Ok(()) } - /// - /// Test the logic in 'load_layer_map' that removes layer files that are - /// newer than 'disk_consistent_lsn'. - /// + // Target file size in the unit tests. In production, the target + // file size is much larger, maybe 1 GB. But a small size makes it + // much faster to exercise all the logic for creating the files, + // garbage collection, compaction etc. + pub const TEST_FILE_SIZE: u64 = 4 * 1024 * 1024; + #[test] - fn future_layerfiles() -> Result<()> { - const TEST_NAME: &str = "future_layerfiles"; - let harness = RepoHarness::create(TEST_NAME)?; - let repo = harness.load(); + fn test_images() -> Result<()> { + let repo = RepoHarness::create("test_images")?.load(); + let tline = repo.create_empty_timeline(TIMELINE_ID, Lsn(0))?; + + #[allow(non_snake_case)] + let TEST_KEY: Key = Key::from_hex("112222222233333333444444445500000001").unwrap(); - // Create a timeline with disk_consistent_lsn = 8000 - let tline = repo.create_empty_timeline(TIMELINE_ID, Lsn(0x8000))?; let writer = tline.writer(); - writer.advance_last_record_lsn(Lsn(0x8000)); + writer.put(TEST_KEY, Lsn(0x10), Value::Image(TEST_IMG("foo at 0x10")))?; + writer.finish_write(Lsn(0x10)); drop(writer); - repo.checkpoint_iteration(CheckpointConfig::Forced)?; - drop(repo); - let timeline_path = harness.timeline_path(&TIMELINE_ID); + tline.checkpoint(CheckpointConfig::Forced)?; + tline.compact()?; - let make_empty_file = |filename: &str| -> std::io::Result<()> { - let path = timeline_path.join(filename); + let writer = tline.writer(); + writer.put(TEST_KEY, Lsn(0x20), Value::Image(TEST_IMG("foo at 0x20")))?; + writer.finish_write(Lsn(0x20)); + drop(writer); - assert!(!path.exists()); - std::fs::write(&path, &[])?; + tline.checkpoint(CheckpointConfig::Forced)?; + tline.compact()?; - Ok(()) - }; + let writer = tline.writer(); + writer.put(TEST_KEY, Lsn(0x30), Value::Image(TEST_IMG("foo at 0x30")))?; + writer.finish_write(Lsn(0x30)); + drop(writer); - // Helper function to check that a relation file exists, and a corresponding - // .0.old file does not. - let assert_exists = |filename: &str| { - let path = timeline_path.join(filename); - assert!(path.exists(), "file {} was removed", filename); + tline.checkpoint(CheckpointConfig::Forced)?; + tline.compact()?; - // Check that there is no .old file - let backup_path = timeline_path.join(format!("{}.0.old", filename)); - assert!( - !backup_path.exists(), - "unexpected backup file {}", - backup_path.display() - ); - }; + let writer = tline.writer(); + writer.put(TEST_KEY, Lsn(0x40), Value::Image(TEST_IMG("foo at 0x40")))?; + writer.finish_write(Lsn(0x40)); + drop(writer); - // Helper function to check that a relation file does *not* exists, and a corresponding - // ..old file does. - let assert_is_renamed = |filename: &str, num: u32| { - let path = timeline_path.join(filename); - assert!( - !path.exists(), - "file {} was not removed as expected", - filename - ); + tline.checkpoint(CheckpointConfig::Forced)?; + tline.compact()?; - let backup_path = timeline_path.join(format!("{}.{}.old", filename, num)); - assert!( - backup_path.exists(), - "backup file {} was not created", - backup_path.display() - ); - }; + assert_eq!(tline.get(TEST_KEY, Lsn(0x10))?, TEST_IMG("foo at 0x10")); + assert_eq!(tline.get(TEST_KEY, Lsn(0x1f))?, TEST_IMG("foo at 0x10")); + assert_eq!(tline.get(TEST_KEY, Lsn(0x20))?, TEST_IMG("foo at 0x20")); + assert_eq!(tline.get(TEST_KEY, Lsn(0x30))?, TEST_IMG("foo at 0x30")); + assert_eq!(tline.get(TEST_KEY, Lsn(0x40))?, TEST_IMG("foo at 0x40")); - // These files are considered to be in the future and will be renamed out - // of the way - let future_filenames = vec![ - format!("pg_control_0_{:016X}", 0x8001), - format!("pg_control_0_{:016X}_{:016X}", 0x8001, 0x8008), - ]; - // But these are not: - let past_filenames = vec![ - format!("pg_control_0_{:016X}", 0x8000), - format!("pg_control_0_{:016X}_{:016X}", 0x7000, 0x8001), - ]; + Ok(()) + } - for filename in future_filenames.iter().chain(past_filenames.iter()) { - make_empty_file(filename)?; + // + // Insert 1000 key-value pairs with increasing keys, checkpoint, + // repeat 50 times. + // + #[test] + fn test_bulk_insert() -> Result<()> { + let repo = RepoHarness::create("test_bulk_insert")?.load(); + let tline = repo.create_empty_timeline(TIMELINE_ID, Lsn(0))?; + + let mut lsn = Lsn(0x10); + + let mut keyspace = KeySpaceAccum::new(); + + let mut test_key = Key::from_hex("012222222233333333444444445500000000").unwrap(); + let mut blknum = 0; + for _ in 0..50 { + for _ in 0..1000 { + test_key.field6 = blknum; + let writer = tline.writer(); + writer.put( + test_key, + lsn, + Value::Image(TEST_IMG(&format!("{} at {}", blknum, lsn))), + )?; + writer.finish_write(lsn); + drop(writer); + + keyspace.add_key(test_key); + + lsn = Lsn(lsn.0 + 0x10); + blknum += 1; + } + + let cutoff = tline.get_last_record_lsn(); + let parts = keyspace + .clone() + .to_keyspace() + .partition(TEST_FILE_SIZE as u64); + tline.hint_partitioning(parts.clone(), lsn)?; + + tline.update_gc_info(Vec::new(), cutoff); + tline.checkpoint(CheckpointConfig::Forced)?; + tline.compact()?; + tline.gc()?; } - // Load the timeline. This will cause the files in the "future" to be renamed - // away. - let new_repo = harness.load(); - new_repo.get_timeline_load(TIMELINE_ID).unwrap(); - drop(new_repo); + Ok(()) + } - for filename in future_filenames.iter() { - assert_is_renamed(filename, 0); - } - for filename in past_filenames.iter() { - assert_exists(filename); + #[test] + fn test_random_updates() -> Result<()> { + let repo = RepoHarness::create("test_random_updates")?.load(); + let tline = repo.create_empty_timeline(TIMELINE_ID, Lsn(0))?; + + const NUM_KEYS: usize = 1000; + + let mut test_key = Key::from_hex("012222222233333333444444445500000000").unwrap(); + + let mut keyspace = KeySpaceAccum::new(); + + // Track when each page was last modified. Used to assert that + // a read sees the latest page version. + let mut updated = [Lsn(0); NUM_KEYS]; + + let mut lsn = Lsn(0); + #[allow(clippy::needless_range_loop)] + for blknum in 0..NUM_KEYS { + lsn = Lsn(lsn.0 + 0x10); + test_key.field6 = blknum as u32; + let writer = tline.writer(); + writer.put( + test_key, + lsn, + Value::Image(TEST_IMG(&format!("{} at {}", blknum, lsn))), + )?; + writer.finish_write(lsn); + updated[blknum] = lsn; + drop(writer); + + keyspace.add_key(test_key); } - // Create the future files again, and load again. They should be renamed to - // *.1.old this time. - for filename in future_filenames.iter() { - make_empty_file(filename)?; + let parts = keyspace.to_keyspace().partition(TEST_FILE_SIZE as u64); + tline.hint_partitioning(parts, lsn)?; + + for _ in 0..50 { + for _ in 0..NUM_KEYS { + lsn = Lsn(lsn.0 + 0x10); + let blknum = thread_rng().gen_range(0..NUM_KEYS); + test_key.field6 = blknum as u32; + let writer = tline.writer(); + writer.put( + test_key, + lsn, + Value::Image(TEST_IMG(&format!("{} at {}", blknum, lsn))), + )?; + println!("updating {} at {}", blknum, lsn); + writer.finish_write(lsn); + drop(writer); + updated[blknum] = lsn; + } + + // Read all the blocks + for (blknum, last_lsn) in updated.iter().enumerate() { + test_key.field6 = blknum as u32; + assert_eq!( + tline.get(test_key, lsn)?, + TEST_IMG(&format!("{} at {}", blknum, last_lsn)) + ); + } + + // Perform a cycle of checkpoint, compaction, and GC + println!("checkpointing {}", lsn); + let cutoff = tline.get_last_record_lsn(); + tline.update_gc_info(Vec::new(), cutoff); + tline.checkpoint(CheckpointConfig::Forced)?; + tline.compact()?; + tline.gc()?; } - let new_repo = harness.load(); - new_repo.get_timeline_load(TIMELINE_ID).unwrap(); - drop(new_repo); + Ok(()) + } - for filename in future_filenames.iter() { - assert_is_renamed(filename, 0); - assert_is_renamed(filename, 1); + #[test] + fn test_traverse_branches() -> Result<()> { + let repo = RepoHarness::create("test_traverse_branches")?.load(); + let mut tline = repo.create_empty_timeline(TIMELINE_ID, Lsn(0))?; + + const NUM_KEYS: usize = 1000; + + let mut test_key = Key::from_hex("012222222233333333444444445500000000").unwrap(); + + let mut keyspace = KeySpaceAccum::new(); + + // Track when each page was last modified. Used to assert that + // a read sees the latest page version. + let mut updated = [Lsn(0); NUM_KEYS]; + + let mut lsn = Lsn(0); + #[allow(clippy::needless_range_loop)] + for blknum in 0..NUM_KEYS { + lsn = Lsn(lsn.0 + 0x10); + test_key.field6 = blknum as u32; + let writer = tline.writer(); + writer.put( + test_key, + lsn, + Value::Image(TEST_IMG(&format!("{} at {}", blknum, lsn))), + )?; + writer.finish_write(lsn); + updated[blknum] = lsn; + drop(writer); + + keyspace.add_key(test_key); } - for filename in past_filenames.iter() { - assert_exists(filename); + + let parts = keyspace.to_keyspace().partition(TEST_FILE_SIZE as u64); + tline.hint_partitioning(parts, lsn)?; + + let mut tline_id = TIMELINE_ID; + for _ in 0..50 { + let new_tline_id = ZTimelineId::generate(); + repo.branch_timeline(tline_id, new_tline_id, lsn)?; + tline = repo.get_timeline_load(new_tline_id)?; + tline_id = new_tline_id; + + for _ in 0..NUM_KEYS { + lsn = Lsn(lsn.0 + 0x10); + let blknum = thread_rng().gen_range(0..NUM_KEYS); + test_key.field6 = blknum as u32; + let writer = tline.writer(); + writer.put( + test_key, + lsn, + Value::Image(TEST_IMG(&format!("{} at {}", blknum, lsn))), + )?; + println!("updating {} at {}", blknum, lsn); + writer.finish_write(lsn); + drop(writer); + updated[blknum] = lsn; + } + + // Read all the blocks + for (blknum, last_lsn) in updated.iter().enumerate() { + test_key.field6 = blknum as u32; + assert_eq!( + tline.get(test_key, lsn)?, + TEST_IMG(&format!("{} at {}", blknum, last_lsn)) + ); + } + + // Perform a cycle of checkpoint, compaction, and GC + println!("checkpointing {}", lsn); + let cutoff = tline.get_last_record_lsn(); + tline.update_gc_info(Vec::new(), cutoff); + tline.checkpoint(CheckpointConfig::Forced)?; + tline.compact()?; + tline.gc()?; } Ok(()) diff --git a/pageserver/src/layered_repository/README.md b/pageserver/src/layered_repository/README.md index 20f89ddc70..519478e417 100644 --- a/pageserver/src/layered_repository/README.md +++ b/pageserver/src/layered_repository/README.md @@ -1,40 +1,42 @@ # Overview -The on-disk format is based on immutable files. The page server receives a -stream of incoming WAL, parses the WAL records to determine which pages they -apply to, and accumulates the incoming changes in memory. Every now and then, -the accumulated changes are written out to new immutable files. This process is -called checkpointing. Old versions of on-disk files that are not needed by any -timeline are removed by GC process. - The main responsibility of the Page Server is to process the incoming WAL, and reprocess it into a format that allows reasonably quick access to any page -version. +version. The page server slices the incoming WAL per relation and page, and +packages the sliced WAL into suitably-sized "layer files". The layer files +contain all the history of the database, back to some reasonable retention +period. This system replaces the base backups and the WAL archive used in a +traditional PostgreSQL installation. The layer files are immutable, they are not +modified in-place after creation. New layer files are created for new incoming +WAL, and old layer files are removed when they are no longer needed. + +The on-disk format is based on immutable files. The page server receives a +stream of incoming WAL, parses the WAL records to determine which pages they +apply to, and accumulates the incoming changes in memory. Whenever enough WAL +has been accumulated in memory, it is written out to a new immutable file. That +process accumulates "L0 delta files" on disk. When enough L0 files have been +accumulated, they are merged and re-partitioned into L1 files, and old files +that are no longer needed are removed by Garbage Collection (GC). The incoming WAL contains updates to arbitrary pages in the system. The distribution depends on the workload: the updates could be totally random, or there could be a long stream of updates to a single relation when data is bulk -loaded, for example, or something in between. The page server slices the -incoming WAL per relation and page, and packages the sliced WAL into -suitably-sized "layer files". The layer files contain all the history of the -database, back to some reasonable retention period. This system replaces the -base backups and the WAL archive used in a traditional PostgreSQL -installation. The layer files are immutable, they are not modified in-place -after creation. New layer files are created for new incoming WAL, and old layer -files are removed when they are no longer needed. We could also replace layer -files with new files that contain the same information, merging small files for -example, but that hasn't been implemented yet. +loaded, for example, or something in between. +Cloud Storage Page Server Safekeeper + L1 L0 Memory WAL -Cloud Storage Page Server Safekeeper - Local disk Memory WAL - -|AAAA| |AAAA|AAAA| |AA -|BBBB| |BBBB|BBBB| | -|CCCC|CCCC| <---- |CCCC|CCCC|CCCC| <--- |CC <---- ADEBAABED -|DDDD|DDDD| |DDDD|DDDD| |DDD -|EEEE| |EEEE|EEEE|EEEE| |E - ++----+ +----+----+ +|AAAA| |AAAA|AAAA| +---+-----+ | ++----+ +----+----+ | | | |AA +|BBBB| |BBBB|BBBB| |BB | AA | |BB ++----+----+ +----+----+ |C | BB | |CC +|CCCC|CCCC| <---- |CCCC|CCCC| <--- |D | CC | <--- |DDD <---- ADEBAABED ++----+----+ +----+----+ | | DDD | |E +|DDDD|DDDD| |DDDD|DDDD| |E | | | ++----+----+ +----+----+ | | | +|EEEE| |EEEE|EEEE| +---+-----+ ++----+ +----+----+ In this illustration, WAL is received as a stream from the Safekeeper, from the right. It is immediately captured by the page server and stored quickly in @@ -42,39 +44,29 @@ memory. The page server memory can be thought of as a quick "reorder buffer", used to hold the incoming WAL and reorder it so that we keep the WAL records for the same page and relation close to each other. -From the page server memory, whenever enough WAL has been accumulated for one -relation segment, it is moved to local disk, as a new layer file, and the memory -is released. +From the page server memory, whenever enough WAL has been accumulated, it is flushed +to disk into a new L0 layer file, and the memory is released. + +When enough L0 files have been accumulated, they are merged together rand sliced +per key-space, producing a new set of files where each file contains a more +narrow key range, but larger LSN range. From the local disk, the layers are further copied to Cloud Storage, for long-term archival. After a layer has been copied to Cloud Storage, it can be removed from local disk, although we currently keep everything locally for fast access. If a layer is needed that isn't found locally, it is fetched from Cloud -Storage and stored in local disk. - -# Terms used in layered repository - -- Relish - one PostgreSQL relation or similarly treated file. -- Segment - one slice of a Relish that is stored in a LayeredTimeline. -- Layer - specific version of a relish Segment in a range of LSNs. +Storage and stored in local disk. L0 and L1 files are both uploaded to Cloud +Storage. # Layer map -The LayerMap tracks what layers exist for all the relishes in a timeline. - -LayerMap consists of two data structures: -- segs - All the layers keyed by segment tag -- open_layers - data structure that hold all open layers ordered by oldest_pending_lsn for quick access during checkpointing. oldest_pending_lsn is the LSN of the oldest page version stored in this layer. - -All operations that update InMemory Layers should update both structures to keep them up-to-date. - -- LayeredTimeline - implements Timeline interface. - -All methods of LayeredTimeline are aware of its ancestors and return data taking them into account. -TODO: Are there any exceptions to this? -For example, timeline.list_rels(lsn) will return all segments that are visible in this timeline at the LSN, -including ones that were not modified in this timeline and thus don't have a layer in the timeline's LayerMap. +The LayerMap tracks what layers exist in a timeline. +Currently, the layer map is just a resizeable array (Vec). On a GetPage@LSN or +other read request, the layer map scans through the array to find the right layer +that contains the data for the requested page. The read-code in LayeredTimeline +is aware of the ancestor, and returns data from the ancestor timeline if it's +not found on the current timeline. # Different kinds of layers @@ -92,11 +84,11 @@ To avoid OOM errors, InMemory layers can be spilled to disk into ephemeral file. TODO: Clarify the difference between Closed, Historic and Frozen. There are two kinds of OnDisk layers: -- ImageLayer represents an image or a snapshot of a 10 MB relish segment, at one particular LSN. -- DeltaLayer represents a collection of WAL records or page images in a range of LSNs, for one - relish segment. - -Dropped segments are always represented on disk by DeltaLayer. +- ImageLayer represents a snapshot of all the keys in a particular range, at one + particular LSN. Any keys that are not present in the ImageLayer are known not + to exist at that LSN. +- DeltaLayer represents a collection of WAL records or page images in a range of + LSNs, for a range of keys. # Layer life cycle @@ -109,71 +101,71 @@ layer or a delta layer, it is a valid end bound. An image layer represents snapshot at one LSN, so end_lsn is always the snapshot LSN + 1 Every layer starts its life as an Open In-Memory layer. When the page server -receives the first WAL record for a segment, it creates a new In-Memory layer -for it, and puts it to the layer map. Later, the layer is old enough, its -contents are written to disk, as On-Disk layers. This process is called -"evicting" a layer. +receives the first WAL record for a timeline, it creates a new In-Memory layer +for it, and puts it to the layer map. Later, when the layer becomes full, its +contents are written to disk, as an on-disk layers. -Layer eviction is a two-step process: First, the layer is marked as closed, so -that it no longer accepts new WAL records, and the layer map is updated -accordingly. If a new WAL record for that segment arrives after this step, a new -Open layer is created to hold it. After this first step, the layer is a Closed +Flushing a layer is a two-step process: First, the layer is marked as closed, so +that it no longer accepts new WAL records, and a new in-memory layer is created +to hold any WAL after that point. After this first step, the layer is a Closed InMemory state. This first step is called "freezing" the layer. -In the second step, new Delta and Image layers are created, containing all the -data in the Frozen InMemory layer. When the new layers are ready, the original -frozen layer is replaced with the new layers in the layer map, and the original -frozen layer is dropped, releasing the memory. +In the second step, a new Delta layers is created, containing all the data from +the Frozen InMemory layer. When it has been created and flushed to disk, the +original frozen layer is replaced with the new layers in the layer map, and the +original frozen layer is dropped, releasing the memory. # Layer files (On-disk layers) -The files are called "layer files". Each layer file corresponds -to one RELISH_SEG_SIZE slice of a PostgreSQL relation fork or -non-rel file in a range of LSNs. The layer files -for each timeline are stored in the timeline's subdirectory under +The files are called "layer files". Each layer file covers a range of keys, and +a range of LSNs (or a single LSN, in case of image layers). You can think of it +as a rectangle in the two-dimensional key-LSN space. The layer files for each +timeline are stored in the timeline's subdirectory under .zenith/tenants//timelines. -There are two kind of layer file: base images, and deltas. A base -image file contains a layer of a segment as it was at one LSN, -whereas a delta file contains modifications to a segment - mostly in -the form of WAL records - in a range of LSN +There are two kind of layer files: images, and delta layers. An image file +contains a snapshot of all keys at a particular LSN, whereas a delta file +contains modifications to a segment - mostly in the form of WAL records - in a +range of LSN. -base image file: +image file: - rel______ + 000000067F000032BE0000400000000070B6-000000067F000032BE0000400000000080B6__00000000346BC568 + start key end key LSN + +The first parts define the key range that the layer covers. See +pgdatadir_mapping.rs for how the key space is used. The last part is the LSN. delta file: - rel_______ +Delta files are named similarly, but they cover a range of LSNs: -For example: + 000000067F000032BE0000400000000020B6-000000067F000032BE0000400000000030B6__000000578C6B29-0000000057A50051 + start key end key start LSN end LSN - rel_1663_13990_2609_0_10_000000000169C348 - rel_1663_13990_2609_0_10_000000000169C348_0000000001702000 +A delta file contains all the key-values in the key-range that were updated in +the LSN range. If a key has not been modified, there is no trace of it in the +delta layer. -In addition to the relations, with "rel_*" prefix, we use the same -format for storing various smaller files from the PostgreSQL data -directory. They will use different suffixes and the naming scheme up -to the LSNs vary. The Zenith source code uses the term "relish" to -mean "a relation, or other file that's treated like a relation in the -storage" For example, a base image of a CLOG segment would be named -like this: - pg_xact_0000_0_00000000198B06B0 +A delta layer file can cover a part of the overall key space, as in the previous +example, or the whole key range like this: -There is no difference in how the relation and non-relation files are -managed, except that the first part of file names is different. -Internally, the relations and non-relation files that are managed in -the versioned store are together called "relishes". + 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000578C6B29-0000000057A50051 -If a file has been dropped, the last layer file for it is created -with the _DROPPED suffix, e.g. - - rel_1663_13990_2609_0_10_000000000169C348_0000000001702000_DROPPED +A file that covers the whole key range is called a L0 file (Level 0), while a +file that covers only part of the key range is called a L1 file. The "level" of +a file is not explicitly stored anywhere, you can only distinguish them by +looking at the key range that a file covers. The read-path doesn't need to +treat L0 and L1 files any differently. ## Notation used in this document +FIXME: This is somewhat obsolete, the layer files cover a key-range rather than +a particular relation nowadays. However, the description on how you find a page +version, and how branching and GC works is still valid. + The full path of a delta file looks like this: .zenith/tenants/941ddc8604413b88b3d208bddf90396c/timelines/4af489b06af8eed9e27a841775616962/rel_1663_13990_2609_0_10_000000000169C348_0000000001702000 diff --git a/pageserver/src/layered_repository/delta_layer.rs b/pageserver/src/layered_repository/delta_layer.rs index 1a6e941fbe..bb5fa02be1 100644 --- a/pageserver/src/layered_repository/delta_layer.rs +++ b/pageserver/src/layered_repository/delta_layer.rs @@ -1,6 +1,5 @@ -//! //! A DeltaLayer represents a collection of WAL records or page images in a range of -//! LSNs, for one segment. It is stored on a file on disk. +//! LSNs, and in a range of Keys. It is stored on a file on disk. //! //! Usually a delta layer only contains differences - in the form of WAL records against //! a base LSN. However, if a segment is newly created, by creating a new relation or @@ -11,84 +10,74 @@ //! can happen when you create a new branch in the middle of a delta layer, and the WAL //! records on the new branch are put in a new delta layer. //! -//! When a delta file needs to be accessed, we slurp the metadata and segsize chapters +//! When a delta file needs to be accessed, we slurp the 'index' metadata //! into memory, into the DeltaLayerInner struct. See load() and unload() functions. -//! To access a page/WAL record, we search `page_version_metas` for the block # and LSN. -//! The byte ranges in the metadata can be used to find the page/WAL record in -//! PAGE_VERSIONS_CHAPTER. +//! To access a particular value, we search `index` for the given key. +//! The byte offset in the index can be used to find the value in +//! VALUES_CHAPTER. //! //! On disk, the delta files are stored in timelines/ directory. //! Currently, there are no subdirectories, and each delta file is named like this: //! -//! ______ +//! -__- page/WAL record +/// byte ranges in VALUES_CHAPTER +static INDEX_CHAPTER: u64 = 1; -/// Mapping from (block #, lsn) -> page/WAL record -/// byte ranges in PAGE_VERSIONS_CHAPTER -static PAGE_VERSION_METAS_CHAPTER: u64 = 1; /// Page/WAL bytes - cannot be interpreted -/// without PAGE_VERSION_METAS_CHAPTER -static PAGE_VERSIONS_CHAPTER: u64 = 2; -static SEG_SIZES_CHAPTER: u64 = 3; +/// without the page versions from the INDEX_CHAPTER +static VALUES_CHAPTER: u64 = 2; /// Contains the [`Summary`] struct -static SUMMARY_CHAPTER: u64 = 4; +static SUMMARY_CHAPTER: u64 = 3; #[derive(Debug, Serialize, Deserialize, PartialEq, Eq)] struct Summary { tenantid: ZTenantId, timelineid: ZTimelineId, - seg: SegmentTag, - - start_lsn: Lsn, - end_lsn: Lsn, - - dropped: bool, + key_range: Range, + lsn_range: Range, } impl From<&DeltaLayer> for Summary { @@ -96,33 +85,17 @@ impl From<&DeltaLayer> for Summary { Self { tenantid: layer.tenantid, timelineid: layer.timelineid, - seg: layer.seg, - - start_lsn: layer.start_lsn, - end_lsn: layer.end_lsn, - - dropped: layer.dropped, + key_range: layer.key_range.clone(), + lsn_range: layer.lsn_range.clone(), } } } -#[derive(Serialize, Deserialize)] -struct BlobRange { - offset: u64, - size: usize, -} - -fn read_blob(reader: &BoundedReader<&'_ F>, range: &BlobRange) -> Result> { - let mut buf = vec![0u8; range.size]; - reader.read_exact_at(&mut buf, range.offset)?; - Ok(buf) -} - /// /// DeltaLayer is the in-memory data structure associated with an /// on-disk delta file. We keep a DeltaLayer in memory for each /// file, in the LayerMap. If a layer is in "loaded" state, we have a -/// copy of the file in memory, in 'inner'. Otherwise the struct is +/// copy of the index in memory, in 'inner'. Otherwise the struct is /// just a placeholder for a file that exists on disk, and it needs to /// be loaded before using it in queries. /// @@ -131,47 +104,24 @@ pub struct DeltaLayer { pub tenantid: ZTenantId, pub timelineid: ZTimelineId, - pub seg: SegmentTag, - - // - // This entry contains all the changes from 'start_lsn' to 'end_lsn'. The - // start is inclusive, and end is exclusive. - // - pub start_lsn: Lsn, - pub end_lsn: Lsn, - - dropped: bool, + pub key_range: Range, + pub lsn_range: Range, inner: RwLock, } pub struct DeltaLayerInner { - /// If false, the 'page_version_metas' and 'seg_sizes' have not been - /// loaded into memory yet. + /// If false, the 'index' has not been loaded into memory yet. loaded: bool, + /// + /// All versions of all pages in the layer are kept here. + /// Indexed by block number and LSN. The value is an offset into the + /// chapter where the page version is stored. + /// + index: HashMap>, + book: Option>, - - /// All versions of all pages in the file are are kept here. - /// Indexed by block number and LSN. - page_version_metas: VecMap<(SegmentBlk, Lsn), BlobRange>, - - /// `seg_sizes` tracks the size of the segment at different points in time. - seg_sizes: VecMap, -} - -impl DeltaLayerInner { - fn get_seg_size(&self, lsn: Lsn) -> Result { - // Scan the VecMap backwards, starting from the given entry. - let slice = self - .seg_sizes - .slice_range((Included(&Lsn(0)), Included(&lsn))); - if let Some((_entry_lsn, entry)) = slice.last() { - Ok(*entry) - } else { - bail!("could not find seg size in delta layer") - } - } } impl Layer for DeltaLayer { @@ -183,132 +133,93 @@ impl Layer for DeltaLayer { self.timelineid } - fn get_seg_tag(&self) -> SegmentTag { - self.seg + fn get_key_range(&self) -> Range { + self.key_range.clone() } - fn is_dropped(&self) -> bool { - self.dropped - } - - fn get_start_lsn(&self) -> Lsn { - self.start_lsn - } - - fn get_end_lsn(&self) -> Lsn { - self.end_lsn + fn get_lsn_range(&self) -> Range { + self.lsn_range.clone() } fn filename(&self) -> PathBuf { PathBuf::from(self.layer_name().to_string()) } - /// Look up given page in the cache. - fn get_page_reconstruct_data( + fn get_value_reconstruct_data( &self, - blknum: SegmentBlk, - lsn: Lsn, - reconstruct_data: &mut PageReconstructData, - ) -> anyhow::Result { + key: Key, + lsn_range: Range, + reconstruct_state: &mut ValueReconstructState, + ) -> anyhow::Result { let mut need_image = true; - ensure!((0..RELISH_SEG_SIZE).contains(&blknum)); - - match &reconstruct_data.page_img { - Some((cached_lsn, _)) if &self.end_lsn <= cached_lsn => { - return Ok(PageReconstructResult::Complete) - } - _ => {} - } + ensure!(self.key_range.contains(&key)); { // Open the file and lock the metadata in memory let inner = self.load()?; - let page_version_reader = inner + let values_reader = inner .book .as_ref() .expect("should be loaded in load call above") - .chapter_reader(PAGE_VERSIONS_CHAPTER)?; + .chapter_reader(VALUES_CHAPTER)?; - // Scan the metadata VecMap backwards, starting from the given entry. - let minkey = (blknum, Lsn(0)); - let maxkey = (blknum, lsn); - let iter = inner - .page_version_metas - .slice_range((Included(&minkey), Included(&maxkey))) - .iter() - .rev(); - for ((_blknum, pv_lsn), blob_range) in iter { - match &reconstruct_data.page_img { - Some((cached_lsn, _)) if pv_lsn <= cached_lsn => { - return Ok(PageReconstructResult::Complete) - } - _ => {} - } - - let pv = PageVersion::des(&read_blob(&page_version_reader, blob_range)?)?; - - match pv { - PageVersion::Page(img) => { - // Found a page image, return it - reconstruct_data.page_img = Some((*pv_lsn, img)); - need_image = false; + // Scan the page versions backwards, starting from `lsn`. + if let Some(vec_map) = inner.index.get(&key) { + let slice = vec_map.slice_range(lsn_range); + let mut size = 0usize; + let mut first_pos = 0u64; + for (_entry_lsn, blob_ref) in slice.iter().rev() { + size += blob_ref.size(); + first_pos = blob_ref.pos(); + if blob_ref.will_init() { break; } - PageVersion::Wal(rec) => { - let will_init = rec.will_init(); - reconstruct_data.records.push((*pv_lsn, rec)); - if will_init { - // This WAL record initializes the page, so no need to go further back - need_image = false; - break; + } + if size != 0 { + let mut buf = vec![0u8; size]; + values_reader.read_exact_at(&mut buf, first_pos)?; + for (entry_lsn, blob_ref) in slice.iter().rev() { + let offs = (blob_ref.pos() - first_pos) as usize; + let val = Value::des(&buf[offs..offs + blob_ref.size()])?; + match val { + Value::Image(img) => { + reconstruct_state.img = Some((*entry_lsn, img)); + need_image = false; + break; + } + Value::WalRecord(rec) => { + let will_init = rec.will_init(); + reconstruct_state.records.push((*entry_lsn, rec)); + if will_init { + // This WAL record initializes the page, so no need to go further back + need_image = false; + break; + } + } } } } } - - // If we didn't find any records for this, check if the request is beyond EOF - if need_image - && reconstruct_data.records.is_empty() - && self.seg.rel.is_blocky() - && blknum >= inner.get_seg_size(lsn)? - { - return Ok(PageReconstructResult::Missing(self.start_lsn)); - } - // release metadata lock and close the file } // If an older page image is needed to reconstruct the page, let the // caller know. if need_image { - Ok(PageReconstructResult::Continue(Lsn(self.start_lsn.0 - 1))) + Ok(ValueReconstructResult::Continue) } else { - Ok(PageReconstructResult::Complete) + Ok(ValueReconstructResult::Complete) } } - /// Get size of the relation at given LSN - fn get_seg_size(&self, lsn: Lsn) -> anyhow::Result { - ensure!(lsn >= self.start_lsn); - ensure!( - self.seg.rel.is_blocky(), - "get_seg_size() called on a non-blocky rel" - ); + fn iter(&self) -> Box> + '_> { + let inner = self.load().unwrap(); - let inner = self.load()?; - inner.get_seg_size(lsn) - } - - /// Does this segment exist at given LSN? - fn get_seg_exists(&self, lsn: Lsn) -> Result { - // Is the requested LSN after the rel was dropped? - if self.dropped && lsn >= self.end_lsn { - return Ok(false); + match DeltaValueIter::new(inner) { + Ok(iter) => Box::new(iter), + Err(err) => Box::new(std::iter::once(Err(err))), } - - // Otherwise, it exists. - Ok(true) } /// @@ -316,13 +227,22 @@ impl Layer for DeltaLayer { /// it will need to be loaded back. /// fn unload(&self) -> Result<()> { + // FIXME: In debug mode, loading and unloading the index slows + // things down so much that you get timeout errors. At least + // with the test_parallel_copy test. So as an even more ad hoc + // stopgap fix for that, only unload every on average 10 + // checkpoint cycles. + use rand::RngCore; + if rand::thread_rng().next_u32() > (u32::MAX / 10) { + return Ok(()); + } + let mut inner = match self.inner.try_write() { Ok(inner) => inner, Err(TryLockError::WouldBlock) => return Ok(()), Err(TryLockError::Poisoned(_)) => panic!("DeltaLayer lock was poisoned"), }; - inner.page_version_metas = VecMap::default(); - inner.seg_sizes = VecMap::default(); + inner.index = HashMap::default(); inner.loaded = false; // Note: we keep the Book open. Is that a good idea? The virtual file @@ -349,45 +269,52 @@ impl Layer for DeltaLayer { /// debugging function to print out the contents of the layer fn dump(&self) -> Result<()> { println!( - "----- delta layer for ten {} tli {} seg {} {}-{} ----", - self.tenantid, self.timelineid, self.seg, self.start_lsn, self.end_lsn + "----- delta layer for ten {} tli {} keys {}-{} lsn {}-{} ----", + self.tenantid, + self.timelineid, + self.key_range.start, + self.key_range.end, + self.lsn_range.start, + self.lsn_range.end ); - println!("--- seg sizes ---"); let inner = self.load()?; - for (k, v) in inner.seg_sizes.as_slice() { - println!(" {}: {}", k, v); - } - println!("--- page versions ---"); let path = self.path(); let file = std::fs::File::open(&path)?; let book = Book::new(file)?; + let chapter = book.chapter_reader(VALUES_CHAPTER)?; - let chapter = book.chapter_reader(PAGE_VERSIONS_CHAPTER)?; - for ((blk, lsn), blob_range) in inner.page_version_metas.as_slice() { - let mut desc = String::new(); + let mut values: Vec<(&Key, &VecMap)> = inner.index.iter().collect(); + values.sort_by_key(|k| k.0); - let buf = read_blob(&chapter, blob_range)?; - let pv = PageVersion::des(&buf)?; + for (key, versions) in values { + for (lsn, blob_ref) in versions.as_slice() { + let mut desc = String::new(); + let mut buf = vec![0u8; blob_ref.size()]; + chapter.read_exact_at(&mut buf, blob_ref.pos())?; + let val = Value::des(&buf); - match pv { - PageVersion::Page(img) => { - write!(&mut desc, " img {} bytes", img.len())?; - } - PageVersion::Wal(rec) => { - let wal_desc = walrecord::describe_wal_record(&rec); - write!( - &mut desc, - " rec {} bytes will_init: {} {}", - blob_range.size, - rec.will_init(), - wal_desc - )?; + match val { + Ok(Value::Image(img)) => { + write!(&mut desc, " img {} bytes", img.len())?; + } + Ok(Value::WalRecord(rec)) => { + let wal_desc = walrecord::describe_wal_record(&rec); + write!( + &mut desc, + " rec {} bytes will_init: {} {}", + buf.len(), + rec.will_init(), + wal_desc + )?; + } + Err(err) => { + write!(&mut desc, " DESERIALIZATION ERROR: {}", err)?; + } } + println!(" key {} at {}: {}", key, lsn, desc); } - - println!(" blk {} at {}: {}", blk, lsn, desc); } Ok(()) @@ -475,18 +402,13 @@ impl DeltaLayer { } } - let chapter = book.read_chapter(PAGE_VERSION_METAS_CHAPTER)?; - let page_version_metas = VecMap::des(&chapter)?; - - let chapter = book.read_chapter(SEG_SIZES_CHAPTER)?; - let seg_sizes = VecMap::des(&chapter)?; + let chapter = book.read_chapter(INDEX_CHAPTER)?; + let index = HashMap::des(&chapter)?; debug!("loaded from {}", &path.display()); - inner.page_version_metas = page_version_metas; - inner.seg_sizes = seg_sizes; + inner.index = index; inner.loaded = true; - Ok(()) } @@ -501,15 +423,12 @@ impl DeltaLayer { path_or_conf: PathOrConf::Conf(conf), timelineid, tenantid, - seg: filename.seg, - start_lsn: filename.start_lsn, - end_lsn: filename.end_lsn, - dropped: filename.dropped, + key_range: filename.key_range.clone(), + lsn_range: filename.lsn_range.clone(), inner: RwLock::new(DeltaLayerInner { loaded: false, book: None, - page_version_metas: VecMap::default(), - seg_sizes: VecMap::default(), + index: HashMap::default(), }), } } @@ -519,7 +438,7 @@ impl DeltaLayer { /// This variant is only used for debugging purposes, by the 'dump_layerfile' binary. pub fn new_for_path(path: &Path, book: &Book) -> Result where - F: std::os::unix::prelude::FileExt, + F: FileExt, { let chapter = book.read_chapter(SUMMARY_CHAPTER)?; let summary = Summary::des(&chapter)?; @@ -528,25 +447,20 @@ impl DeltaLayer { path_or_conf: PathOrConf::Path(path.to_path_buf()), timelineid: summary.timelineid, tenantid: summary.tenantid, - seg: summary.seg, - start_lsn: summary.start_lsn, - end_lsn: summary.end_lsn, - dropped: summary.dropped, + key_range: summary.key_range, + lsn_range: summary.lsn_range, inner: RwLock::new(DeltaLayerInner { loaded: false, book: None, - page_version_metas: VecMap::default(), - seg_sizes: VecMap::default(), + index: HashMap::default(), }), }) } fn layer_name(&self) -> DeltaFileName { DeltaFileName { - seg: self.seg, - start_lsn: self.start_lsn, - end_lsn: self.end_lsn, - dropped: self.dropped, + key_range: self.key_range.clone(), + lsn_range: self.lsn_range.clone(), } } @@ -567,24 +481,24 @@ impl DeltaLayer { /// /// 1. Create the DeltaLayerWriter by calling DeltaLayerWriter::new(...) /// -/// 2. Write the contents by calling `put_page_version` for every page +/// 2. Write the contents by calling `put_value` for every page /// version to store in the layer. /// /// 3. Call `finish`. /// pub struct DeltaLayerWriter { conf: &'static PageServerConf, + path: PathBuf, timelineid: ZTimelineId, tenantid: ZTenantId, - seg: SegmentTag, - start_lsn: Lsn, - end_lsn: Lsn, - dropped: bool, - page_version_writer: ChapterWriter>, - pv_offset: u64, + key_start: Key, + lsn_range: Range, - page_version_metas: VecMap<(SegmentBlk, Lsn), BlobRange>, + index: HashMap>, + + values_writer: ChapterWriter>, + end_offset: u64, } impl DeltaLayerWriter { @@ -595,94 +509,86 @@ impl DeltaLayerWriter { conf: &'static PageServerConf, timelineid: ZTimelineId, tenantid: ZTenantId, - seg: SegmentTag, - start_lsn: Lsn, - end_lsn: Lsn, - dropped: bool, + key_start: Key, + lsn_range: Range, ) -> Result { - // Create the file + // Create the file initially with a temporary filename. We don't know + // the end key yet, so we cannot form the final filename yet. We will + // rename it when we're done. // // Note: This overwrites any existing file. There shouldn't be any. // FIXME: throw an error instead? - let path = DeltaLayer::path_for( - &PathOrConf::Conf(conf), - timelineid, - tenantid, - &DeltaFileName { - seg, - start_lsn, - end_lsn, - dropped, - }, - ); + let path = conf.timeline_path(&timelineid, &tenantid).join(format!( + "{}-XXX__{:016X}-{:016X}.temp", + key_start, + u64::from(lsn_range.start), + u64::from(lsn_range.end) + )); let file = VirtualFile::create(&path)?; let buf_writer = BufWriter::new(file); let book = BookWriter::new(buf_writer, DELTA_FILE_MAGIC)?; // Open the page-versions chapter for writing. The calls to - // `put_page_version` will use this to write the contents. - let page_version_writer = book.new_chapter(PAGE_VERSIONS_CHAPTER); + // `put_value` will use this to write the contents. + let values_writer = book.new_chapter(VALUES_CHAPTER); Ok(DeltaLayerWriter { conf, + path, timelineid, tenantid, - seg, - start_lsn, - end_lsn, - dropped, - page_version_writer, - page_version_metas: VecMap::default(), - pv_offset: 0, + key_start, + lsn_range, + index: HashMap::new(), + values_writer, + end_offset: 0, }) } /// - /// Append a page version to the file. + /// Append a key-value pair to the file. /// - /// 'buf' is a serialized PageVersion. - /// The page versions must be appended in blknum, lsn order. + /// The values must be appended in key, lsn order. /// - pub fn put_page_version(&mut self, blknum: SegmentBlk, lsn: Lsn, buf: &[u8]) -> Result<()> { + pub fn put_value(&mut self, key: Key, lsn: Lsn, val: Value) -> Result<()> { + //info!("DELTA: key {} at {} on {}", key, lsn, self.path.display()); + assert!(self.lsn_range.start <= lsn); // Remember the offset and size metadata. The metadata is written // to a separate chapter, in `finish`. - let blob_range = BlobRange { - offset: self.pv_offset, - size: buf.len(), - }; - self.page_version_metas - .append((blknum, lsn), blob_range) - .unwrap(); - - // write the page version - self.page_version_writer.write_all(buf)?; - self.pv_offset += buf.len() as u64; + let off = self.end_offset; + let buf = Value::ser(&val)?; + let len = buf.len(); + self.values_writer.write_all(&buf)?; + self.end_offset += len as u64; + let vec_map = self.index.entry(key).or_default(); + let blob_ref = BlobRef::new(off, len, val.will_init()); + let old = vec_map.append_or_update_last(lsn, blob_ref).unwrap().0; + if old.is_some() { + // We already had an entry for this LSN. That's odd.. + bail!( + "Value for {} at {} already exists in delta layer being built", + key, + lsn + ); + } Ok(()) } + pub fn size(&self) -> u64 { + self.end_offset + } + /// /// Finish writing the delta layer. /// - /// 'seg_sizes' is a list of size changes to store with the actual data. - /// - pub fn finish(self, seg_sizes: VecMap) -> anyhow::Result { - // Close the page-versions chapter - let book = self.page_version_writer.close()?; + pub fn finish(self, key_end: Key) -> anyhow::Result { + // Close the values chapter + let book = self.values_writer.close()?; - // Write out page versions metadata - let mut chapter = book.new_chapter(PAGE_VERSION_METAS_CHAPTER); - let buf = VecMap::ser(&self.page_version_metas)?; - chapter.write_all(&buf)?; - let book = chapter.close()?; - - if self.seg.rel.is_blocky() { - ensure!(!seg_sizes.is_empty()); - } - - // and seg_sizes to separate chapter - let mut chapter = book.new_chapter(SEG_SIZES_CHAPTER); - let buf = VecMap::ser(&seg_sizes)?; + // Write out the index + let mut chapter = book.new_chapter(INDEX_CHAPTER); + let buf = HashMap::ser(&self.index)?; chapter.write_all(&buf)?; let book = chapter.close()?; @@ -690,12 +596,8 @@ impl DeltaLayerWriter { let summary = Summary { tenantid: self.tenantid, timelineid: self.timelineid, - seg: self.seg, - - start_lsn: self.start_lsn, - end_lsn: self.end_lsn, - - dropped: self.dropped, + key_range: self.key_start..key_end, + lsn_range: self.lsn_range.clone(), }; Summary::ser_into(&summary, &mut chapter)?; let book = chapter.close()?; @@ -710,20 +612,111 @@ impl DeltaLayerWriter { path_or_conf: PathOrConf::Conf(self.conf), tenantid: self.tenantid, timelineid: self.timelineid, - seg: self.seg, - start_lsn: self.start_lsn, - end_lsn: self.end_lsn, - dropped: self.dropped, + key_range: self.key_start..key_end, + lsn_range: self.lsn_range.clone(), inner: RwLock::new(DeltaLayerInner { loaded: false, + index: HashMap::new(), book: None, - page_version_metas: VecMap::default(), - seg_sizes: VecMap::default(), }), }; - trace!("created delta layer {}", &layer.path().display()); + // Rename the file to its final name + // + // Note: This overwrites any existing file. There shouldn't be any. + // FIXME: throw an error instead? + let final_path = DeltaLayer::path_for( + &PathOrConf::Conf(self.conf), + self.timelineid, + self.tenantid, + &DeltaFileName { + key_range: self.key_start..key_end, + lsn_range: self.lsn_range, + }, + ); + std::fs::rename(self.path, &final_path)?; + + trace!("created delta layer {}", final_path.display()); Ok(layer) } + + pub fn abort(self) { + match self.values_writer.close() { + Ok(book) => { + if let Err(err) = book.close() { + error!("error while closing delta layer file: {}", err); + } + } + Err(err) => { + error!("error while closing chapter writer: {}", err); + } + } + if let Err(err) = std::fs::remove_file(self.path) { + error!("error removing unfinished delta layer file: {}", err); + } + } +} + +/// +/// Iterator over all key-value pairse stored in a delta layer +/// +/// FIXME: This creates a Vector to hold the offsets of all key value pairs. +/// That takes up quite a lot of memory. Should do this in a more streaming +/// fashion. +/// +struct DeltaValueIter { + all_offsets: Vec<(Key, Lsn, BlobRef)>, + next_idx: usize, + data: Vec, +} + +impl Iterator for DeltaValueIter { + type Item = Result<(Key, Lsn, Value)>; + + fn next(&mut self) -> Option { + self.next_res().transpose() + } +} + +impl DeltaValueIter { + fn new(inner: RwLockReadGuard) -> Result { + let mut index: Vec<(&Key, &VecMap)> = inner.index.iter().collect(); + index.sort_by_key(|x| x.0); + + let mut all_offsets: Vec<(Key, Lsn, BlobRef)> = Vec::new(); + for (key, vec_map) in index.iter() { + for (lsn, blob_ref) in vec_map.as_slice().iter() { + all_offsets.push((**key, *lsn, *blob_ref)); + } + } + + let values_reader = inner + .book + .as_ref() + .expect("should be loaded in load call above") + .chapter_reader(VALUES_CHAPTER)?; + let file_size = values_reader.len() as usize; + let mut layer = DeltaValueIter { + all_offsets, + next_idx: 0, + data: vec![0u8; file_size], + }; + values_reader.read_exact_at(&mut layer.data, 0)?; + + Ok(layer) + } + + fn next_res(&mut self) -> Result> { + if self.next_idx < self.all_offsets.len() { + let (key, lsn, blob_ref) = self.all_offsets[self.next_idx]; + let offs = blob_ref.pos() as usize; + let size = blob_ref.size(); + let val = Value::des(&self.data[offs..offs + size])?; + self.next_idx += 1; + Ok(Some((key, lsn, val))) + } else { + Ok(None) + } + } } diff --git a/pageserver/src/layered_repository/filename.rs b/pageserver/src/layered_repository/filename.rs index df23700dfd..cd63f014c4 100644 --- a/pageserver/src/layered_repository/filename.rs +++ b/pageserver/src/layered_repository/filename.rs @@ -2,29 +2,52 @@ //! Helper functions for dealing with filenames of the image and delta layer files. //! use crate::config::PageServerConf; -use crate::layered_repository::storage_layer::SegmentTag; -use crate::relish::*; +use crate::repository::Key; +use std::cmp::Ordering; use std::fmt; +use std::ops::Range; use std::path::PathBuf; use zenith_utils::lsn::Lsn; // Note: LayeredTimeline::load_layer_map() relies on this sort order -#[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Clone)] +#[derive(Debug, PartialEq, Eq, Clone)] pub struct DeltaFileName { - pub seg: SegmentTag, - pub start_lsn: Lsn, - pub end_lsn: Lsn, - pub dropped: bool, + pub key_range: Range, + pub lsn_range: Range, +} + +impl PartialOrd for DeltaFileName { + fn partial_cmp(&self, other: &Self) -> Option { + Some(self.cmp(other)) + } +} + +impl Ord for DeltaFileName { + fn cmp(&self, other: &Self) -> Ordering { + let mut cmp; + + cmp = self.key_range.start.cmp(&other.key_range.start); + if cmp != Ordering::Equal { + return cmp; + } + cmp = self.key_range.end.cmp(&other.key_range.end); + if cmp != Ordering::Equal { + return cmp; + } + cmp = self.lsn_range.start.cmp(&other.lsn_range.start); + if cmp != Ordering::Equal { + return cmp; + } + cmp = self.lsn_range.end.cmp(&other.lsn_range.end); + + cmp + } } /// Represents the filename of a DeltaLayer /// -/// ______ -/// -/// or if it was dropped: -/// -/// _______DROPPED +/// -__- /// impl DeltaFileName { /// @@ -32,234 +55,123 @@ impl DeltaFileName { /// match the expected pattern. /// pub fn parse_str(fname: &str) -> Option { - let rel; - let mut parts; - if let Some(rest) = fname.strip_prefix("rel_") { - parts = rest.split('_'); - rel = RelishTag::Relation(RelTag { - spcnode: parts.next()?.parse::().ok()?, - dbnode: parts.next()?.parse::().ok()?, - relnode: parts.next()?.parse::().ok()?, - forknum: parts.next()?.parse::().ok()?, - }); - } else if let Some(rest) = fname.strip_prefix("pg_xact_") { - parts = rest.split('_'); - rel = RelishTag::Slru { - slru: SlruKind::Clog, - segno: u32::from_str_radix(parts.next()?, 16).ok()?, - }; - } else if let Some(rest) = fname.strip_prefix("pg_multixact_members_") { - parts = rest.split('_'); - rel = RelishTag::Slru { - slru: SlruKind::MultiXactMembers, - segno: u32::from_str_radix(parts.next()?, 16).ok()?, - }; - } else if let Some(rest) = fname.strip_prefix("pg_multixact_offsets_") { - parts = rest.split('_'); - rel = RelishTag::Slru { - slru: SlruKind::MultiXactOffsets, - segno: u32::from_str_radix(parts.next()?, 16).ok()?, - }; - } else if let Some(rest) = fname.strip_prefix("pg_filenodemap_") { - parts = rest.split('_'); - rel = RelishTag::FileNodeMap { - spcnode: parts.next()?.parse::().ok()?, - dbnode: parts.next()?.parse::().ok()?, - }; - } else if let Some(rest) = fname.strip_prefix("pg_twophase_") { - parts = rest.split('_'); - rel = RelishTag::TwoPhase { - xid: parts.next()?.parse::().ok()?, - }; - } else if let Some(rest) = fname.strip_prefix("pg_control_checkpoint_") { - parts = rest.split('_'); - rel = RelishTag::Checkpoint; - } else if let Some(rest) = fname.strip_prefix("pg_control_") { - parts = rest.split('_'); - rel = RelishTag::ControlFile; - } else { + let mut parts = fname.split("__"); + let mut key_parts = parts.next()?.split('-'); + let mut lsn_parts = parts.next()?.split('-'); + + let key_start_str = key_parts.next()?; + let key_end_str = key_parts.next()?; + let lsn_start_str = lsn_parts.next()?; + let lsn_end_str = lsn_parts.next()?; + if parts.next().is_some() || key_parts.next().is_some() || key_parts.next().is_some() { return None; } - let segno = parts.next()?.parse::().ok()?; + let key_start = Key::from_hex(key_start_str).ok()?; + let key_end = Key::from_hex(key_end_str).ok()?; - let seg = SegmentTag { rel, segno }; + let start_lsn = Lsn::from_hex(lsn_start_str).ok()?; + let end_lsn = Lsn::from_hex(lsn_end_str).ok()?; - let start_lsn = Lsn::from_hex(parts.next()?).ok()?; - let end_lsn = Lsn::from_hex(parts.next()?).ok()?; - - let mut dropped = false; - if let Some(suffix) = parts.next() { - if suffix == "DROPPED" { - dropped = true; - } else { - return None; - } - } - if parts.next().is_some() { + if start_lsn >= end_lsn { return None; + // or panic? + } + + if key_start >= key_end { + return None; + // or panic? } Some(DeltaFileName { - seg, - start_lsn, - end_lsn, - dropped, + key_range: key_start..key_end, + lsn_range: start_lsn..end_lsn, }) } } impl fmt::Display for DeltaFileName { fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { - let basename = match self.seg.rel { - RelishTag::Relation(reltag) => format!( - "rel_{}_{}_{}_{}", - reltag.spcnode, reltag.dbnode, reltag.relnode, reltag.forknum - ), - RelishTag::Slru { - slru: SlruKind::Clog, - segno, - } => format!("pg_xact_{:04X}", segno), - RelishTag::Slru { - slru: SlruKind::MultiXactMembers, - segno, - } => format!("pg_multixact_members_{:04X}", segno), - RelishTag::Slru { - slru: SlruKind::MultiXactOffsets, - segno, - } => format!("pg_multixact_offsets_{:04X}", segno), - RelishTag::FileNodeMap { spcnode, dbnode } => { - format!("pg_filenodemap_{}_{}", spcnode, dbnode) - } - RelishTag::TwoPhase { xid } => format!("pg_twophase_{}", xid), - RelishTag::Checkpoint => "pg_control_checkpoint".to_string(), - RelishTag::ControlFile => "pg_control".to_string(), - }; - write!( f, - "{}_{}_{:016X}_{:016X}{}", - basename, - self.seg.segno, - u64::from(self.start_lsn), - u64::from(self.end_lsn), - if self.dropped { "_DROPPED" } else { "" } + "{}-{}__{:016X}-{:016X}", + self.key_range.start, + self.key_range.end, + u64::from(self.lsn_range.start), + u64::from(self.lsn_range.end), ) } } -#[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Clone)] +#[derive(Debug, PartialEq, Eq, Clone)] pub struct ImageFileName { - pub seg: SegmentTag, + pub key_range: Range, pub lsn: Lsn, } +impl PartialOrd for ImageFileName { + fn partial_cmp(&self, other: &Self) -> Option { + Some(self.cmp(other)) + } +} + +impl Ord for ImageFileName { + fn cmp(&self, other: &Self) -> Ordering { + let mut cmp; + + cmp = self.key_range.start.cmp(&other.key_range.start); + if cmp != Ordering::Equal { + return cmp; + } + cmp = self.key_range.end.cmp(&other.key_range.end); + if cmp != Ordering::Equal { + return cmp; + } + cmp = self.lsn.cmp(&other.lsn); + + cmp + } +} + /// /// Represents the filename of an ImageLayer /// -/// _____ -/// +/// -__ impl ImageFileName { /// /// Parse a string as an image file name. Returns None if the filename does not /// match the expected pattern. /// pub fn parse_str(fname: &str) -> Option { - let rel; - let mut parts; - if let Some(rest) = fname.strip_prefix("rel_") { - parts = rest.split('_'); - rel = RelishTag::Relation(RelTag { - spcnode: parts.next()?.parse::().ok()?, - dbnode: parts.next()?.parse::().ok()?, - relnode: parts.next()?.parse::().ok()?, - forknum: parts.next()?.parse::().ok()?, - }); - } else if let Some(rest) = fname.strip_prefix("pg_xact_") { - parts = rest.split('_'); - rel = RelishTag::Slru { - slru: SlruKind::Clog, - segno: u32::from_str_radix(parts.next()?, 16).ok()?, - }; - } else if let Some(rest) = fname.strip_prefix("pg_multixact_members_") { - parts = rest.split('_'); - rel = RelishTag::Slru { - slru: SlruKind::MultiXactMembers, - segno: u32::from_str_radix(parts.next()?, 16).ok()?, - }; - } else if let Some(rest) = fname.strip_prefix("pg_multixact_offsets_") { - parts = rest.split('_'); - rel = RelishTag::Slru { - slru: SlruKind::MultiXactOffsets, - segno: u32::from_str_radix(parts.next()?, 16).ok()?, - }; - } else if let Some(rest) = fname.strip_prefix("pg_filenodemap_") { - parts = rest.split('_'); - rel = RelishTag::FileNodeMap { - spcnode: parts.next()?.parse::().ok()?, - dbnode: parts.next()?.parse::().ok()?, - }; - } else if let Some(rest) = fname.strip_prefix("pg_twophase_") { - parts = rest.split('_'); - rel = RelishTag::TwoPhase { - xid: parts.next()?.parse::().ok()?, - }; - } else if let Some(rest) = fname.strip_prefix("pg_control_checkpoint_") { - parts = rest.split('_'); - rel = RelishTag::Checkpoint; - } else if let Some(rest) = fname.strip_prefix("pg_control_") { - parts = rest.split('_'); - rel = RelishTag::ControlFile; - } else { + let mut parts = fname.split("__"); + let mut key_parts = parts.next()?.split('-'); + + let key_start_str = key_parts.next()?; + let key_end_str = key_parts.next()?; + let lsn_str = parts.next()?; + if parts.next().is_some() || key_parts.next().is_some() { return None; } - let segno = parts.next()?.parse::().ok()?; + let key_start = Key::from_hex(key_start_str).ok()?; + let key_end = Key::from_hex(key_end_str).ok()?; - let seg = SegmentTag { rel, segno }; + let lsn = Lsn::from_hex(lsn_str).ok()?; - let lsn = Lsn::from_hex(parts.next()?).ok()?; - - if parts.next().is_some() { - return None; - } - - Some(ImageFileName { seg, lsn }) + Some(ImageFileName { + key_range: key_start..key_end, + lsn, + }) } } impl fmt::Display for ImageFileName { fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { - let basename = match self.seg.rel { - RelishTag::Relation(reltag) => format!( - "rel_{}_{}_{}_{}", - reltag.spcnode, reltag.dbnode, reltag.relnode, reltag.forknum - ), - RelishTag::Slru { - slru: SlruKind::Clog, - segno, - } => format!("pg_xact_{:04X}", segno), - RelishTag::Slru { - slru: SlruKind::MultiXactMembers, - segno, - } => format!("pg_multixact_members_{:04X}", segno), - RelishTag::Slru { - slru: SlruKind::MultiXactOffsets, - segno, - } => format!("pg_multixact_offsets_{:04X}", segno), - RelishTag::FileNodeMap { spcnode, dbnode } => { - format!("pg_filenodemap_{}_{}", spcnode, dbnode) - } - RelishTag::TwoPhase { xid } => format!("pg_twophase_{}", xid), - RelishTag::Checkpoint => "pg_control_checkpoint".to_string(), - RelishTag::ControlFile => "pg_control".to_string(), - }; - write!( f, - "{}_{}_{:016X}", - basename, - self.seg.segno, + "{}-{}__{:016X}", + self.key_range.start, + self.key_range.end, u64::from(self.lsn), ) } diff --git a/pageserver/src/layered_repository/global_layer_map.rs b/pageserver/src/layered_repository/global_layer_map.rs deleted file mode 100644 index 169a89650a..0000000000 --- a/pageserver/src/layered_repository/global_layer_map.rs +++ /dev/null @@ -1,142 +0,0 @@ -//! -//! Global registry of open layers. -//! -//! Whenever a new in-memory layer is created to hold incoming WAL, it is registered -//! in [`GLOBAL_LAYER_MAP`], so that we can keep track of the total number of -//! in-memory layers in the system, and know when we need to evict some to release -//! memory. -//! -//! Each layer is assigned a unique ID when it's registered in the global registry. -//! The ID can be used to relocate the layer later, without having to hold locks. -//! - -use std::sync::atomic::{AtomicU8, Ordering}; -use std::sync::{Arc, RwLock}; - -use super::inmemory_layer::InMemoryLayer; - -use lazy_static::lazy_static; - -const MAX_USAGE_COUNT: u8 = 5; - -lazy_static! { - pub static ref GLOBAL_LAYER_MAP: RwLock = - RwLock::new(InMemoryLayers::default()); -} - -// TODO these types can probably be smaller -#[derive(PartialEq, Eq, Clone, Copy)] -pub struct LayerId { - index: usize, - tag: u64, // to avoid ABA problem -} - -enum SlotData { - Occupied(Arc), - /// Vacant slots form a linked list, the value is the index - /// of the next vacant slot in the list. - Vacant(Option), -} - -struct Slot { - tag: u64, - data: SlotData, - usage_count: AtomicU8, // for clock algorithm -} - -#[derive(Default)] -pub struct InMemoryLayers { - slots: Vec, - num_occupied: usize, - - // Head of free-slot list. - next_empty_slot_idx: Option, -} - -impl InMemoryLayers { - pub fn insert(&mut self, layer: Arc) -> LayerId { - let slot_idx = match self.next_empty_slot_idx { - Some(slot_idx) => slot_idx, - None => { - let idx = self.slots.len(); - self.slots.push(Slot { - tag: 0, - data: SlotData::Vacant(None), - usage_count: AtomicU8::new(0), - }); - idx - } - }; - let slots_len = self.slots.len(); - - let slot = &mut self.slots[slot_idx]; - - match slot.data { - SlotData::Occupied(_) => { - panic!("an occupied slot was in the free list"); - } - SlotData::Vacant(next_empty_slot_idx) => { - self.next_empty_slot_idx = next_empty_slot_idx; - } - } - - slot.data = SlotData::Occupied(layer); - slot.usage_count.store(1, Ordering::Relaxed); - - self.num_occupied += 1; - assert!(self.num_occupied <= slots_len); - - LayerId { - index: slot_idx, - tag: slot.tag, - } - } - - pub fn get(&self, layer_id: &LayerId) -> Option> { - let slot = self.slots.get(layer_id.index)?; // TODO should out of bounds indexes just panic? - if slot.tag != layer_id.tag { - return None; - } - - if let SlotData::Occupied(layer) = &slot.data { - let _ = slot.usage_count.fetch_update( - Ordering::Relaxed, - Ordering::Relaxed, - |old_usage_count| { - if old_usage_count < MAX_USAGE_COUNT { - Some(old_usage_count + 1) - } else { - None - } - }, - ); - Some(Arc::clone(layer)) - } else { - None - } - } - - // TODO this won't be a public API in the future - pub fn remove(&mut self, layer_id: &LayerId) { - let slot = &mut self.slots[layer_id.index]; - - if slot.tag != layer_id.tag { - return; - } - - match &slot.data { - SlotData::Occupied(_layer) => { - // TODO evict the layer - } - SlotData::Vacant(_) => unimplemented!(), - } - - slot.data = SlotData::Vacant(self.next_empty_slot_idx); - self.next_empty_slot_idx = Some(layer_id.index); - - assert!(self.num_occupied > 0); - self.num_occupied -= 1; - - slot.tag = slot.tag.wrapping_add(1); - } -} diff --git a/pageserver/src/layered_repository/image_layer.rs b/pageserver/src/layered_repository/image_layer.rs index 5b8ec46452..ab51c36cae 100644 --- a/pageserver/src/layered_repository/image_layer.rs +++ b/pageserver/src/layered_repository/image_layer.rs @@ -1,55 +1,54 @@ -//! An ImageLayer represents an image or a snapshot of a segment at one particular LSN. -//! It is stored in a file on disk. +//! An ImageLayer represents an image or a snapshot of a key-range at +//! one particular LSN. It contains an image of all key-value pairs +//! in its key-range. Any key that falls into the image layer's range +//! but does not exist in the layer, does not exist. //! -//! On disk, the image files are stored in timelines/ directory. -//! Currently, there are no subdirectories, and each image layer file is named like this: +//! An image layer is stored in a file on disk. The file is stored in +//! timelines/ directory. Currently, there are no +//! subdirectories, and each image layer file is named like this: //! -//! Note that segno is -//! _____ +//! -__ //! //! For example: //! -//! 1663_13990_2609_0_5_000000000169C348 +//! 000000067F000032BE0000400000000070B6-000000067F000032BE0000400000000080B6__00000000346BC568 //! //! An image file is constructed using the 'bookfile' crate. //! //! Only metadata is loaded into memory by the load function. //! When images are needed, they are read directly from disk. //! -//! For blocky relishes, the images are stored in BLOCKY_IMAGES_CHAPTER. -//! All the images are required to be BLOCK_SIZE, which allows for random access. -//! -//! For non-blocky relishes, the image can be found in NONBLOCKY_IMAGE_CHAPTER. -//! use crate::config::PageServerConf; use crate::layered_repository::filename::{ImageFileName, PathOrConf}; use crate::layered_repository::storage_layer::{ - Layer, PageReconstructData, PageReconstructResult, SegmentBlk, SegmentTag, + BlobRef, Layer, ValueReconstructResult, ValueReconstructState, }; -use crate::layered_repository::RELISH_SEG_SIZE; +use crate::repository::{Key, Value}; use crate::virtual_file::VirtualFile; +use crate::IMAGE_FILE_MAGIC; use crate::{ZTenantId, ZTimelineId}; -use anyhow::{anyhow, bail, ensure, Context, Result}; +use anyhow::{bail, ensure, Context, Result}; use bytes::Bytes; use log::*; use serde::{Deserialize, Serialize}; -use std::convert::TryInto; +use std::collections::HashMap; use std::fs; use std::io::{BufWriter, Write}; +use std::ops::Range; use std::path::{Path, PathBuf}; -use std::sync::{RwLock, RwLockReadGuard}; +use std::sync::{RwLock, RwLockReadGuard, TryLockError}; use bookfile::{Book, BookWriter, ChapterWriter}; use zenith_utils::bin_ser::BeSer; use zenith_utils::lsn::Lsn; -// Magic constant to identify a Zenith segment image file -pub const IMAGE_FILE_MAGIC: u32 = 0x5A616E01 + 1; +/// Mapping from (key, lsn) -> page/WAL record +/// byte ranges in VALUES_CHAPTER +static INDEX_CHAPTER: u64 = 1; /// Contains each block in block # order -const BLOCKY_IMAGES_CHAPTER: u64 = 1; -const NONBLOCKY_IMAGE_CHAPTER: u64 = 2; +const VALUES_CHAPTER: u64 = 2; /// Contains the [`Summary`] struct const SUMMARY_CHAPTER: u64 = 3; @@ -58,7 +57,7 @@ const SUMMARY_CHAPTER: u64 = 3; struct Summary { tenantid: ZTenantId, timelineid: ZTimelineId, - seg: SegmentTag, + key_range: Range, lsn: Lsn, } @@ -68,19 +67,17 @@ impl From<&ImageLayer> for Summary { Self { tenantid: layer.tenantid, timelineid: layer.timelineid, - seg: layer.seg, + key_range: layer.key_range.clone(), lsn: layer.lsn, } } } -const BLOCK_SIZE: usize = 8192; - /// /// ImageLayer is the in-memory data structure associated with an on-disk image /// file. We keep an ImageLayer in memory for each file, in the LayerMap. If a -/// layer is in "loaded" state, we have a copy of the file in memory, in 'inner'. +/// layer is in "loaded" state, we have a copy of the index in memory, in 'inner'. /// Otherwise the struct is just a placeholder for a file that exists on disk, /// and it needs to be loaded before using it in queries. /// @@ -88,7 +85,7 @@ pub struct ImageLayer { path_or_conf: PathOrConf, pub tenantid: ZTenantId, pub timelineid: ZTimelineId, - pub seg: SegmentTag, + pub key_range: Range, // This entry contains an image of all pages as of this LSN pub lsn: Lsn, @@ -96,18 +93,16 @@ pub struct ImageLayer { inner: RwLock, } -#[derive(Clone)] -enum ImageType { - Blocky { num_blocks: SegmentBlk }, - NonBlocky, -} - pub struct ImageLayerInner { - /// If None, the 'image_type' has not been loaded into memory yet. + /// If false, the 'index' has not been loaded into memory yet. + loaded: bool, + + /// The underlying (virtual) file handle. None if the layer hasn't been loaded + /// yet. book: Option>, - /// Derived from filename and bookfile chapter metadata - image_type: ImageType, + /// offset of each value + index: HashMap, } impl Layer for ImageLayer { @@ -123,98 +118,82 @@ impl Layer for ImageLayer { self.timelineid } - fn get_seg_tag(&self) -> SegmentTag { - self.seg + fn get_key_range(&self) -> Range { + self.key_range.clone() } - fn is_dropped(&self) -> bool { - false - } - - fn get_start_lsn(&self) -> Lsn { - self.lsn - } - - fn get_end_lsn(&self) -> Lsn { + fn get_lsn_range(&self) -> Range { // End-bound is exclusive - self.lsn + 1 + self.lsn..(self.lsn + 1) } /// Look up given page in the file - fn get_page_reconstruct_data( + fn get_value_reconstruct_data( &self, - blknum: SegmentBlk, - lsn: Lsn, - reconstruct_data: &mut PageReconstructData, - ) -> anyhow::Result { - ensure!((0..RELISH_SEG_SIZE).contains(&blknum)); - ensure!(lsn >= self.lsn); - - match reconstruct_data.page_img { - Some((cached_lsn, _)) if self.lsn <= cached_lsn => { - return Ok(PageReconstructResult::Complete) - } - _ => {} - } + key: Key, + lsn_range: Range, + reconstruct_state: &mut ValueReconstructState, + ) -> anyhow::Result { + assert!(self.key_range.contains(&key)); + assert!(lsn_range.end >= self.lsn); let inner = self.load()?; - let buf = match &inner.image_type { - ImageType::Blocky { num_blocks } => { - // Check if the request is beyond EOF - if blknum >= *num_blocks { - return Ok(PageReconstructResult::Missing(lsn)); - } + if let Some(blob_ref) = inner.index.get(&key) { + let chapter = inner + .book + .as_ref() + .unwrap() + .chapter_reader(VALUES_CHAPTER)?; - let mut buf = vec![0u8; BLOCK_SIZE]; - let offset = BLOCK_SIZE as u64 * blknum as u64; - - let chapter = inner - .book - .as_ref() - .unwrap() - .chapter_reader(BLOCKY_IMAGES_CHAPTER)?; - - chapter.read_exact_at(&mut buf, offset).with_context(|| { + let mut blob = vec![0; blob_ref.size()]; + chapter + .read_exact_at(&mut blob, blob_ref.pos()) + .with_context(|| { format!( - "failed to read page from data file {} at offset {}", + "failed to read {} bytes from data file {} at offset {}", + blob_ref.size(), self.filename().display(), - offset + blob_ref.pos() ) })?; + let value = Bytes::from(blob); - buf - } - ImageType::NonBlocky => { - ensure!(blknum == 0); - inner - .book - .as_ref() - .unwrap() - .read_chapter(NONBLOCKY_IMAGE_CHAPTER)? - .into_vec() - } - }; - - reconstruct_data.page_img = Some((self.lsn, Bytes::from(buf))); - Ok(PageReconstructResult::Complete) - } - - /// Get size of the segment - fn get_seg_size(&self, _lsn: Lsn) -> Result { - let inner = self.load()?; - match inner.image_type { - ImageType::Blocky { num_blocks } => Ok(num_blocks), - ImageType::NonBlocky => Err(anyhow!("get_seg_size called for non-blocky segment")), + reconstruct_state.img = Some((self.lsn, value)); + Ok(ValueReconstructResult::Complete) + } else { + Ok(ValueReconstructResult::Missing) } } - /// Does this segment exist at given LSN? - fn get_seg_exists(&self, _lsn: Lsn) -> Result { - Ok(true) + fn iter(&self) -> Box>> { + todo!(); } fn unload(&self) -> Result<()> { + // Unload the index. + // + // TODO: we should access the index directly from pages on the disk, + // using the buffer cache. This load/unload mechanism is really ad hoc. + + // FIXME: In debug mode, loading and unloading the index slows + // things down so much that you get timeout errors. At least + // with the test_parallel_copy test. So as an even more ad hoc + // stopgap fix for that, only unload every on average 10 + // checkpoint cycles. + use rand::RngCore; + if rand::thread_rng().next_u32() > (u32::MAX / 10) { + return Ok(()); + } + + let mut inner = match self.inner.try_write() { + Ok(inner) => inner, + Err(TryLockError::WouldBlock) => return Ok(()), + Err(TryLockError::Poisoned(_)) => panic!("ImageLayer lock was poisoned"), + }; + inner.index = HashMap::default(); + inner.loaded = false; + Ok(()) } @@ -235,22 +214,22 @@ impl Layer for ImageLayer { /// debugging function to print out the contents of the layer fn dump(&self) -> Result<()> { println!( - "----- image layer for ten {} tli {} seg {} at {} ----", - self.tenantid, self.timelineid, self.seg, self.lsn + "----- image layer for ten {} tli {} key {}-{} at {} ----", + self.tenantid, self.timelineid, self.key_range.start, self.key_range.end, self.lsn ); let inner = self.load()?; - match inner.image_type { - ImageType::Blocky { num_blocks } => println!("({}) blocks ", num_blocks), - ImageType::NonBlocky => { - let chapter = inner - .book - .as_ref() - .unwrap() - .read_chapter(NONBLOCKY_IMAGE_CHAPTER)?; - println!("non-blocky ({} bytes)", chapter.len()); - } + let mut index_vec: Vec<(&Key, &BlobRef)> = inner.index.iter().collect(); + index_vec.sort_by_key(|x| x.1.pos()); + + for (key, blob_ref) in index_vec { + println!( + "key: {} size {} offset {}", + key, + blob_ref.size(), + blob_ref.pos() + ); } Ok(()) @@ -280,7 +259,7 @@ impl ImageLayer { loop { // Quick exit if already loaded let inner = self.inner.read().unwrap(); - if inner.book.is_some() { + if inner.loaded { return Ok(inner); } @@ -306,14 +285,16 @@ impl ImageLayer { fn load_inner(&self, inner: &mut ImageLayerInner) -> Result<()> { let path = self.path(); - let file = VirtualFile::open(&path) - .with_context(|| format!("Failed to open virtual file '{}'", path.display()))?; - let book = Book::new(file).with_context(|| { - format!( - "Failed to open virtual file '{}' as a bookfile", - path.display() - ) - })?; + + // Open the file if it's not open already. + if inner.book.is_none() { + let file = VirtualFile::open(&path) + .with_context(|| format!("Failed to open file '{}'", path.display()))?; + inner.book = Some(Book::new(file).with_context(|| { + format!("Failed to open file '{}' as a bookfile", path.display()) + })?); + } + let book = inner.book.as_ref().unwrap(); match &self.path_or_conf { PathOrConf::Conf(_) => { @@ -340,23 +321,13 @@ impl ImageLayer { } } - let image_type = if self.seg.rel.is_blocky() { - let chapter = book.chapter_reader(BLOCKY_IMAGES_CHAPTER)?; - let images_len = chapter.len(); - ensure!(images_len % BLOCK_SIZE as u64 == 0); - let num_blocks: SegmentBlk = (images_len / BLOCK_SIZE as u64).try_into()?; - ImageType::Blocky { num_blocks } - } else { - let _chapter = book.chapter_reader(NONBLOCKY_IMAGE_CHAPTER)?; - ImageType::NonBlocky - }; + let chapter = book.read_chapter(INDEX_CHAPTER)?; + let index = HashMap::des(&chapter)?; - debug!("loaded from {}", &path.display()); + info!("loaded from {}", &path.display()); - *inner = ImageLayerInner { - book: Some(book), - image_type, - }; + inner.index = index; + inner.loaded = true; Ok(()) } @@ -372,11 +343,12 @@ impl ImageLayer { path_or_conf: PathOrConf::Conf(conf), timelineid, tenantid, - seg: filename.seg, + key_range: filename.key_range.clone(), lsn: filename.lsn, inner: RwLock::new(ImageLayerInner { book: None, - image_type: ImageType::Blocky { num_blocks: 0 }, + index: HashMap::new(), + loaded: false, }), } } @@ -395,18 +367,19 @@ impl ImageLayer { path_or_conf: PathOrConf::Path(path.to_path_buf()), timelineid: summary.timelineid, tenantid: summary.tenantid, - seg: summary.seg, + key_range: summary.key_range, lsn: summary.lsn, inner: RwLock::new(ImageLayerInner { book: None, - image_type: ImageType::Blocky { num_blocks: 0 }, + index: HashMap::new(), + loaded: false, }), }) } fn layer_name(&self) -> ImageFileName { ImageFileName { - seg: self.seg, + key_range: self.key_range.clone(), lsn: self.lsn, } } @@ -435,15 +408,18 @@ impl ImageLayer { /// pub struct ImageLayerWriter { conf: &'static PageServerConf, + path: PathBuf, timelineid: ZTimelineId, tenantid: ZTenantId, - seg: SegmentTag, + key_range: Range, lsn: Lsn, - num_blocks: SegmentBlk, + values_writer: Option>>, + end_offset: u64, - page_image_writer: ChapterWriter>, - num_blocks_written: SegmentBlk, + index: HashMap, + + finished: bool, } impl ImageLayerWriter { @@ -451,9 +427,8 @@ impl ImageLayerWriter { conf: &'static PageServerConf, timelineid: ZTimelineId, tenantid: ZTenantId, - seg: SegmentTag, + key_range: &Range, lsn: Lsn, - num_blocks: SegmentBlk, ) -> anyhow::Result { // Create the file // @@ -463,70 +438,75 @@ impl ImageLayerWriter { &PathOrConf::Conf(conf), timelineid, tenantid, - &ImageFileName { seg, lsn }, + &ImageFileName { + key_range: key_range.clone(), + lsn, + }, ); + info!("new image layer {}", path.display()); let file = VirtualFile::create(&path)?; let buf_writer = BufWriter::new(file); let book = BookWriter::new(buf_writer, IMAGE_FILE_MAGIC)?; // Open the page-images chapter for writing. The calls to - // `put_page_image` will use this to write the contents. - let chapter = if seg.rel.is_blocky() { - book.new_chapter(BLOCKY_IMAGES_CHAPTER) - } else { - ensure!(num_blocks == 1); - book.new_chapter(NONBLOCKY_IMAGE_CHAPTER) - }; + // `put_image` will use this to write the contents. + let chapter = book.new_chapter(VALUES_CHAPTER); let writer = ImageLayerWriter { conf, + path, timelineid, tenantid, - seg, + key_range: key_range.clone(), lsn, - num_blocks, - page_image_writer: chapter, - num_blocks_written: 0, + values_writer: Some(chapter), + index: HashMap::new(), + end_offset: 0, + finished: false, }; Ok(writer) } /// - /// Write next page image to the file. + /// Write next value to the file. /// /// The page versions must be appended in blknum order. /// - pub fn put_page_image(&mut self, block_bytes: &[u8]) -> anyhow::Result<()> { - ensure!(self.num_blocks_written < self.num_blocks); - if self.seg.rel.is_blocky() { - ensure!(block_bytes.len() == BLOCK_SIZE); + pub fn put_image(&mut self, key: Key, img: &[u8]) -> Result<()> { + ensure!(self.key_range.contains(&key)); + let off = self.end_offset; + + if let Some(writer) = &mut self.values_writer { + let len = img.len(); + writer.write_all(img)?; + self.end_offset += len as u64; + + let old = self.index.insert(key, BlobRef::new(off, len, true)); + assert!(old.is_none()); + } else { + panic!() } - self.page_image_writer.write_all(block_bytes)?; - self.num_blocks_written += 1; + Ok(()) } - pub fn finish(self) -> anyhow::Result { - // Check that the `put_page_image' was called for every block. - ensure!(self.num_blocks_written == self.num_blocks); + pub fn finish(&mut self) -> anyhow::Result { + // Close the values chapter + let book = self.values_writer.take().unwrap().close()?; - // Close the page-images chapter - let book = self.page_image_writer.close()?; + // Write out the index + let mut chapter = book.new_chapter(INDEX_CHAPTER); + let buf = HashMap::ser(&self.index)?; + chapter.write_all(&buf)?; + let book = chapter.close()?; // Write out the summary chapter - let image_type = if self.seg.rel.is_blocky() { - ImageType::Blocky { - num_blocks: self.num_blocks, - } - } else { - ImageType::NonBlocky - }; let mut chapter = book.new_chapter(SUMMARY_CHAPTER); let summary = Summary { tenantid: self.tenantid, timelineid: self.timelineid, - seg: self.seg, + key_range: self.key_range.clone(), lsn: self.lsn, }; Summary::ser_into(&summary, &mut chapter)?; @@ -542,15 +522,31 @@ impl ImageLayerWriter { path_or_conf: PathOrConf::Conf(self.conf), timelineid: self.timelineid, tenantid: self.tenantid, - seg: self.seg, + key_range: self.key_range.clone(), lsn: self.lsn, inner: RwLock::new(ImageLayerInner { book: None, - image_type, + loaded: false, + index: HashMap::new(), }), }; trace!("created image layer {}", layer.path().display()); + self.finished = true; + Ok(layer) } } + +impl Drop for ImageLayerWriter { + fn drop(&mut self) { + if let Some(page_image_writer) = self.values_writer.take() { + if let Ok(book) = page_image_writer.close() { + let _ = book.close(); + } + } + if !self.finished { + let _ = fs::remove_file(&self.path); + } + } +} diff --git a/pageserver/src/layered_repository/inmemory_layer.rs b/pageserver/src/layered_repository/inmemory_layer.rs index fed1fb6469..b5d98a4ca3 100644 --- a/pageserver/src/layered_repository/inmemory_layer.rs +++ b/pageserver/src/layered_repository/inmemory_layer.rs @@ -1,30 +1,29 @@ -//! An in-memory layer stores recently received PageVersions. -//! The page versions are held in a BTreeMap. To avoid OOM errors, the map size is limited -//! and layers can be spilled to disk into ephemeral files. +//! An in-memory layer stores recently received key-value pairs. //! -//! And there's another BTreeMap to track the size of the relation. +//! The "in-memory" part of the name is a bit misleading: the actual page versions are +//! held in an ephemeral file, not in memory. The metadata for each page version, i.e. +//! its position in the file, is kept in memory, though. //! use crate::config::PageServerConf; use crate::layered_repository::delta_layer::{DeltaLayer, DeltaLayerWriter}; use crate::layered_repository::ephemeral_file::EphemeralFile; -use crate::layered_repository::filename::DeltaFileName; -use crate::layered_repository::image_layer::{ImageLayer, ImageLayerWriter}; use crate::layered_repository::storage_layer::{ - Layer, PageReconstructData, PageReconstructResult, PageVersion, SegmentBlk, SegmentTag, - RELISH_SEG_SIZE, + BlobRef, Layer, ValueReconstructResult, ValueReconstructState, }; -use crate::layered_repository::LayeredTimeline; -use crate::layered_repository::ZERO_PAGE; -use crate::repository::ZenithWalRecord; +use crate::repository::{Key, Value}; +use crate::walrecord; use crate::{ZTenantId, ZTimelineId}; use anyhow::{bail, ensure, Result}; -use bytes::Bytes; use log::*; use std::collections::HashMap; -use std::io::Seek; +// avoid binding to Write (conflicts with std::io::Write) +// while being able to use std::fmt::Write's methods +use std::fmt::Write as _; +use std::io::Write; +use std::ops::Range; use std::os::unix::fs::FileExt; use std::path::PathBuf; -use std::sync::{Arc, RwLock}; +use std::sync::RwLock; use zenith_utils::bin_ser::BeSer; use zenith_utils::lsn::Lsn; use zenith_utils::vec_map::VecMap; @@ -33,7 +32,6 @@ pub struct InMemoryLayer { conf: &'static PageServerConf, tenantid: ZTenantId, timelineid: ZTimelineId, - seg: SegmentTag, /// /// This layer contains all the changes from 'start_lsn'. The @@ -41,27 +39,9 @@ pub struct InMemoryLayer { /// start_lsn: Lsn, - /// - /// LSN of the oldest page version stored in this layer. - /// - /// This is different from 'start_lsn' in that we enforce that the 'start_lsn' - /// of a layer always matches the 'end_lsn' of its predecessor, even if there - /// are no page versions until at a later LSN. That way you can detect any - /// missing layer files more easily. 'oldest_lsn' is the first page version - /// actually stored in this layer. In the range between 'start_lsn' and - /// 'oldest_lsn', there are no changes to the segment. - /// 'oldest_lsn' is used to adjust 'disk_consistent_lsn' and that is why it should - /// point to the beginning of WAL record. This is the other difference with 'start_lsn' - /// which points to end of WAL record. This is why 'oldest_lsn' can be smaller than 'start_lsn'. - /// - oldest_lsn: Lsn, - /// The above fields never change. The parts that do change are in 'inner', /// and protected by mutex. inner: RwLock, - - /// Predecessor layer might be needed? - incremental: bool, } pub struct InMemoryLayerInner { @@ -69,98 +49,25 @@ pub struct InMemoryLayerInner { /// Writes are only allowed when this is None end_lsn: Option, - /// If this relation was dropped, remember when that happened. - /// The drop LSN is recorded in [`end_lsn`]. - dropped: bool, + /// + /// All versions of all pages in the layer are kept here. Indexed + /// by block number and LSN. The value is an offset into the + /// ephemeral file where the page version is stored. + /// + index: HashMap>, - /// The PageVersion structs are stored in a serialized format in this file. - /// Each serialized PageVersion is preceded by a 'u32' length field. - /// 'page_versions' map stores offsets into this file. + /// The values are stored in a serialized format in this file. + /// Each serialized Value is preceded by a 'u32' length field. + /// PerSeg::page_versions map stores offsets into this file. file: EphemeralFile, - /// Metadata about all versions of all pages in the layer is kept - /// here. Indexed by block number and LSN. The value is an offset - /// into the ephemeral file where the page version is stored. - page_versions: HashMap>, - - /// - /// `seg_sizes` tracks the size of the segment at different points in time. - /// - /// For a blocky rel, there is always one entry, at the layer's start_lsn, - /// so that determining the size never depends on the predecessor layer. For - /// a non-blocky rel, 'seg_sizes' is not used and is always empty. - /// - seg_sizes: VecMap, - - /// - /// LSN of the newest page version stored in this layer. - /// - /// The difference between 'end_lsn' and 'latest_lsn' is the same as between - /// 'start_lsn' and 'oldest_lsn'. See comments in 'oldest_lsn'. - /// - latest_lsn: Lsn, + end_offset: u64, } impl InMemoryLayerInner { fn assert_writeable(&self) { assert!(self.end_lsn.is_none()); } - - fn get_seg_size(&self, lsn: Lsn) -> SegmentBlk { - // Scan the BTreeMap backwards, starting from the given entry. - let slice = self.seg_sizes.slice_range(..=lsn); - - // We make sure there is always at least one entry - if let Some((_entry_lsn, entry)) = slice.last() { - *entry - } else { - panic!("could not find seg size in in-memory layer"); - } - } - - /// - /// Read a page version from the ephemeral file. - /// - fn read_pv(&self, off: u64) -> Result { - let mut buf = Vec::new(); - self.read_pv_bytes(off, &mut buf)?; - Ok(PageVersion::des(&buf)?) - } - - /// - /// Read a page version from the ephemeral file, as raw bytes, at - /// the given offset. The bytes are read into 'buf', which is - /// expanded if necessary. Returns the size of the page version. - /// - fn read_pv_bytes(&self, off: u64, buf: &mut Vec) -> Result { - // read length - let mut lenbuf = [0u8; 4]; - self.file.read_exact_at(&mut lenbuf, off)?; - let len = u32::from_ne_bytes(lenbuf) as usize; - - if buf.len() < len { - buf.resize(len, 0); - } - self.file.read_exact_at(&mut buf[0..len], off + 4)?; - Ok(len) - } - - fn write_pv(&mut self, pv: &PageVersion) -> Result { - // remember starting position - let pos = self.file.stream_position()?; - - // make room for the 'length' field by writing zeros as a placeholder. - self.file.seek(std::io::SeekFrom::Start(pos + 4))?; - - pv.ser_into(&mut self.file)?; - - // write the 'length' field. - let len = self.file.stream_position()? - pos - 4; - let lenbuf = u32::to_ne_bytes(len as u32); - self.file.write_all_at(&lenbuf, pos)?; - - Ok(pos) - } } impl Layer for InMemoryLayer { @@ -170,21 +77,12 @@ impl Layer for InMemoryLayer { fn filename(&self) -> PathBuf { let inner = self.inner.read().unwrap(); - let end_lsn = if let Some(drop_lsn) = inner.end_lsn { - drop_lsn - } else { - Lsn(u64::MAX) - }; + let end_lsn = inner.end_lsn.unwrap_or(Lsn(u64::MAX)); - let delta_filename = DeltaFileName { - seg: self.seg, - start_lsn: self.start_lsn, - end_lsn, - dropped: inner.dropped, - } - .to_string(); - - PathBuf::from(format!("inmem-{}", delta_filename)) + PathBuf::from(format!( + "inmem-{:016X}-{:016X}", + self.start_lsn.0, end_lsn.0 + )) } fn get_tenant_id(&self) -> ZTenantId { @@ -195,132 +93,78 @@ impl Layer for InMemoryLayer { self.timelineid } - fn get_seg_tag(&self) -> SegmentTag { - self.seg + fn get_key_range(&self) -> Range { + Key::MIN..Key::MAX } - fn get_start_lsn(&self) -> Lsn { - self.start_lsn - } - - fn get_end_lsn(&self) -> Lsn { + fn get_lsn_range(&self) -> Range { let inner = self.inner.read().unwrap(); - if let Some(end_lsn) = inner.end_lsn { + let end_lsn = if let Some(end_lsn) = inner.end_lsn { end_lsn } else { Lsn(u64::MAX) - } + }; + self.start_lsn..end_lsn } - fn is_dropped(&self) -> bool { - let inner = self.inner.read().unwrap(); - inner.dropped - } - - /// Look up given page in the cache. - fn get_page_reconstruct_data( + /// Look up given value in the layer. + fn get_value_reconstruct_data( &self, - blknum: SegmentBlk, - lsn: Lsn, - reconstruct_data: &mut PageReconstructData, - ) -> anyhow::Result { + key: Key, + lsn_range: Range, + reconstruct_state: &mut ValueReconstructState, + ) -> anyhow::Result { + ensure!(lsn_range.start <= self.start_lsn); let mut need_image = true; - ensure!((0..RELISH_SEG_SIZE).contains(&blknum)); + let inner = self.inner.read().unwrap(); - { - let inner = self.inner.read().unwrap(); - - // Scan the page versions backwards, starting from `lsn`. - if let Some(vec_map) = inner.page_versions.get(&blknum) { - let slice = vec_map.slice_range(..=lsn); - for (entry_lsn, pos) in slice.iter().rev() { - match &reconstruct_data.page_img { - Some((cached_lsn, _)) if entry_lsn <= cached_lsn => { - return Ok(PageReconstructResult::Complete) - } - _ => {} + // Scan the page versions backwards, starting from `lsn`. + if let Some(vec_map) = inner.index.get(&key) { + let slice = vec_map.slice_range(lsn_range); + for (entry_lsn, blob_ref) in slice.iter().rev() { + match &reconstruct_state.img { + Some((cached_lsn, _)) if entry_lsn <= cached_lsn => { + return Ok(ValueReconstructResult::Complete) } + _ => {} + } - let pv = inner.read_pv(*pos)?; - match pv { - PageVersion::Page(img) => { - reconstruct_data.page_img = Some((*entry_lsn, img)); + let mut buf = vec![0u8; blob_ref.size()]; + inner.file.read_exact_at(&mut buf, blob_ref.pos())?; + let value = Value::des(&buf)?; + match value { + Value::Image(img) => { + reconstruct_state.img = Some((*entry_lsn, img)); + return Ok(ValueReconstructResult::Complete); + } + Value::WalRecord(rec) => { + let will_init = rec.will_init(); + reconstruct_state.records.push((*entry_lsn, rec)); + if will_init { + // This WAL record initializes the page, so no need to go further back need_image = false; break; } - PageVersion::Wal(rec) => { - reconstruct_data.records.push((*entry_lsn, rec.clone())); - if rec.will_init() { - // This WAL record initializes the page, so no need to go further back - need_image = false; - break; - } - } } } } - - // If we didn't find any records for this, check if the request is beyond EOF - if need_image - && reconstruct_data.records.is_empty() - && self.seg.rel.is_blocky() - && blknum >= self.get_seg_size(lsn)? - { - return Ok(PageReconstructResult::Missing(self.start_lsn)); - } - - // release lock on 'inner' } + // release lock on 'inner' + // If an older page image is needed to reconstruct the page, let the - // caller know + // caller know. if need_image { - if self.incremental { - Ok(PageReconstructResult::Continue(Lsn(self.start_lsn.0 - 1))) - } else { - Ok(PageReconstructResult::Missing(self.start_lsn)) - } + Ok(ValueReconstructResult::Continue) } else { - Ok(PageReconstructResult::Complete) + Ok(ValueReconstructResult::Complete) } } - /// Get size of the relation at given LSN - fn get_seg_size(&self, lsn: Lsn) -> anyhow::Result { - ensure!(lsn >= self.start_lsn); - ensure!( - self.seg.rel.is_blocky(), - "get_seg_size() called on a non-blocky rel" - ); - - let inner = self.inner.read().unwrap(); - Ok(inner.get_seg_size(lsn)) - } - - /// Does this segment exist at given LSN? - fn get_seg_exists(&self, lsn: Lsn) -> anyhow::Result { - let inner = self.inner.read().unwrap(); - - // If the segment created after requested LSN, - // it doesn't exist in the layer. But we shouldn't - // have requested it in the first place. - ensure!(lsn >= self.start_lsn); - - // Is the requested LSN after the segment was dropped? - if inner.dropped { - if let Some(end_lsn) = inner.end_lsn { - if lsn >= end_lsn { - return Ok(false); - } - } else { - bail!("dropped in-memory layer with no end LSN"); - } - } - - // Otherwise, it exists - Ok(true) + fn iter(&self) -> Box>> { + todo!(); } /// Cannot unload anything in an in-memory layer, since there's no backing @@ -337,7 +181,8 @@ impl Layer for InMemoryLayer { } fn is_incremental(&self) -> bool { - self.incremental + // in-memory layer is always considered incremental. + true } fn is_in_memory(&self) -> bool { @@ -355,29 +200,36 @@ impl Layer for InMemoryLayer { .unwrap_or_default(); println!( - "----- in-memory layer for tli {} seg {} {}-{} {} ----", - self.timelineid, self.seg, self.start_lsn, end_str, inner.dropped, + "----- in-memory layer for tli {} LSNs {}-{} ----", + self.timelineid, self.start_lsn, end_str, ); - for (k, v) in inner.seg_sizes.as_slice() { - println!("seg_sizes {}: {}", k, v); - } - - // List the blocks in order - let mut page_versions: Vec<(&SegmentBlk, &VecMap)> = - inner.page_versions.iter().collect(); - page_versions.sort_by_key(|k| k.0); - - for (blknum, versions) in page_versions { - for (lsn, off) in versions.as_slice() { - let pv = inner.read_pv(*off); - let pv_description = match pv { - Ok(PageVersion::Page(_img)) => "page", - Ok(PageVersion::Wal(_rec)) => "wal", - Err(_err) => "INVALID", - }; - - println!("blk {} at {}: {}\n", blknum, lsn, pv_description); + let mut buf = Vec::new(); + for (key, vec_map) in inner.index.iter() { + for (lsn, blob_ref) in vec_map.as_slice() { + let mut desc = String::new(); + buf.resize(blob_ref.size(), 0); + inner.file.read_exact_at(&mut buf, blob_ref.pos())?; + let val = Value::des(&buf); + match val { + Ok(Value::Image(img)) => { + write!(&mut desc, " img {} bytes", img.len())?; + } + Ok(Value::WalRecord(rec)) => { + let wal_desc = walrecord::describe_wal_record(&rec); + write!( + &mut desc, + " rec {} bytes will_init: {} {}", + buf.len(), + rec.will_init(), + wal_desc + )?; + } + Err(err) => { + write!(&mut desc, " DESERIALIZATION ERROR: {}", err)?; + } + } + println!(" key {} at {}: {}", key, lsn, desc); } } @@ -385,23 +237,7 @@ impl Layer for InMemoryLayer { } } -/// A result of an inmemory layer data being written to disk. -pub struct LayersOnDisk { - pub delta_layers: Vec, - pub image_layers: Vec, -} - impl InMemoryLayer { - /// Return the oldest page version that's stored in this layer - pub fn get_oldest_lsn(&self) -> Lsn { - self.oldest_lsn - } - - pub fn get_latest_lsn(&self) -> Lsn { - let inner = self.inner.read().unwrap(); - inner.latest_lsn - } - /// /// Create a new, empty, in-memory layer /// @@ -409,291 +245,83 @@ impl InMemoryLayer { conf: &'static PageServerConf, timelineid: ZTimelineId, tenantid: ZTenantId, - seg: SegmentTag, start_lsn: Lsn, - oldest_lsn: Lsn, ) -> Result { trace!( - "initializing new empty InMemoryLayer for writing {} on timeline {} at {}", - seg, + "initializing new empty InMemoryLayer for writing on timeline {} at {}", timelineid, start_lsn ); - // The segment is initially empty, so initialize 'seg_sizes' with 0. - let mut seg_sizes = VecMap::default(); - if seg.rel.is_blocky() { - seg_sizes.append(start_lsn, 0).unwrap(); - } - let file = EphemeralFile::create(conf, tenantid, timelineid)?; Ok(InMemoryLayer { conf, timelineid, tenantid, - seg, start_lsn, - oldest_lsn, - incremental: false, inner: RwLock::new(InMemoryLayerInner { end_lsn: None, - dropped: false, + index: HashMap::new(), file, - page_versions: HashMap::new(), - seg_sizes, - latest_lsn: oldest_lsn, + end_offset: 0, }), }) } // Write operations - /// Remember new page version, as a WAL record over previous version - pub fn put_wal_record( - &self, - lsn: Lsn, - blknum: SegmentBlk, - rec: ZenithWalRecord, - ) -> Result { - self.put_page_version(blknum, lsn, PageVersion::Wal(rec)) - } - - /// Remember new page version, as a full page image - pub fn put_page_image(&self, blknum: SegmentBlk, lsn: Lsn, img: Bytes) -> Result { - self.put_page_version(blknum, lsn, PageVersion::Page(img)) - } - /// Common subroutine of the public put_wal_record() and put_page_image() functions. /// Adds the page version to the in-memory tree - pub fn put_page_version( - &self, - blknum: SegmentBlk, - lsn: Lsn, - pv: PageVersion, - ) -> anyhow::Result { - ensure!((0..RELISH_SEG_SIZE).contains(&blknum)); - - trace!( - "put_page_version blk {} of {} at {}/{}", - blknum, - self.seg.rel, - self.timelineid, - lsn - ); + pub fn put_value(&self, key: Key, lsn: Lsn, val: Value) -> Result<()> { + trace!("put_value key {} at {}/{}", key, self.timelineid, lsn); let mut inner = self.inner.write().unwrap(); inner.assert_writeable(); - ensure!(lsn >= inner.latest_lsn); - inner.latest_lsn = lsn; - // Write the page version to the file, and remember its offset in 'page_versions' - { - let off = inner.write_pv(&pv)?; - let vec_map = inner.page_versions.entry(blknum).or_default(); - let old = vec_map.append_or_update_last(lsn, off).unwrap().0; - if old.is_some() { - // We already had an entry for this LSN. That's odd.. - warn!( - "Page version of rel {} blk {} at {} already exists", - self.seg.rel, blknum, lsn - ); - } - } - - // Also update the relation size, if this extended the relation. - if self.seg.rel.is_blocky() { - let newsize = blknum + 1; - - // use inner get_seg_size, since calling self.get_seg_size will try to acquire the lock, - // which we've just acquired above - let oldsize = inner.get_seg_size(lsn); - if newsize > oldsize { - trace!( - "enlarging segment {} from {} to {} blocks at {}", - self.seg, - oldsize, - newsize, - lsn - ); - - // If we are extending the relation by more than one page, initialize the "gap" - // with zeros - // - // XXX: What if the caller initializes the gap with subsequent call with same LSN? - // I don't think that can happen currently, but that is highly dependent on how - // PostgreSQL writes its WAL records and there's no guarantee of it. If it does - // happen, we would hit the "page version already exists" warning above on the - // subsequent call to initialize the gap page. - for gapblknum in oldsize..blknum { - let zeropv = PageVersion::Page(ZERO_PAGE.clone()); - trace!( - "filling gap blk {} with zeros for write of {}", - gapblknum, - blknum - ); - - // Write the page version to the file, and remember its offset in - // 'page_versions' - { - let off = inner.write_pv(&zeropv)?; - let vec_map = inner.page_versions.entry(gapblknum).or_default(); - let old = vec_map.append_or_update_last(lsn, off).unwrap().0; - if old.is_some() { - warn!( - "Page version of seg {} blk {} at {} already exists", - self.seg, gapblknum, lsn - ); - } - } - } - - inner.seg_sizes.append_or_update_last(lsn, newsize).unwrap(); - return Ok(newsize - oldsize); - } - } - - Ok(0) - } - - /// Remember that the relation was truncated at given LSN - pub fn put_truncation(&self, lsn: Lsn, new_size: SegmentBlk) { - assert!( - self.seg.rel.is_blocky(), - "put_truncation() called on a non-blocky rel" - ); - - let mut inner = self.inner.write().unwrap(); - inner.assert_writeable(); - - // check that this we truncate to a smaller size than segment was before the truncation - let old_size = inner.get_seg_size(lsn); - assert!(new_size < old_size); - - let (old, _delta_size) = inner - .seg_sizes - .append_or_update_last(lsn, new_size) - .unwrap(); + let off = inner.end_offset; + let buf = Value::ser(&val)?; + let len = buf.len(); + inner.file.write_all(&buf)?; + inner.end_offset += len as u64; + let vec_map = inner.index.entry(key).or_default(); + let blob_ref = BlobRef::new(off, len, val.will_init()); + let old = vec_map.append_or_update_last(lsn, blob_ref).unwrap().0; if old.is_some() { // We already had an entry for this LSN. That's odd.. - warn!("Inserting truncation, but had an entry for the LSN already"); - } - } - - /// Remember that the segment was dropped at given LSN - pub fn drop_segment(&self, lsn: Lsn) { - let mut inner = self.inner.write().unwrap(); - - assert!(inner.end_lsn.is_none()); - assert!(!inner.dropped); - inner.dropped = true; - assert!(self.start_lsn < lsn); - inner.end_lsn = Some(lsn); - - trace!("dropped segment {} at {}", self.seg, lsn); - } - - /// - /// Initialize a new InMemoryLayer for, by copying the state at the given - /// point in time from given existing layer. - /// - pub fn create_successor_layer( - conf: &'static PageServerConf, - src: Arc, - timelineid: ZTimelineId, - tenantid: ZTenantId, - start_lsn: Lsn, - oldest_lsn: Lsn, - ) -> Result { - let seg = src.get_seg_tag(); - - assert!(oldest_lsn.is_aligned()); - - trace!( - "initializing new InMemoryLayer for writing {} on timeline {} at {}", - seg, - timelineid, - start_lsn, - ); - - // Copy the segment size at the start LSN from the predecessor layer. - let mut seg_sizes = VecMap::default(); - if seg.rel.is_blocky() { - let size = src.get_seg_size(start_lsn)?; - seg_sizes.append(start_lsn, size).unwrap(); + warn!("Key {} at {} already exists", key, lsn); } - let file = EphemeralFile::create(conf, tenantid, timelineid)?; - - Ok(InMemoryLayer { - conf, - timelineid, - tenantid, - seg, - start_lsn, - oldest_lsn, - incremental: true, - inner: RwLock::new(InMemoryLayerInner { - end_lsn: None, - dropped: false, - file, - page_versions: HashMap::new(), - seg_sizes, - latest_lsn: oldest_lsn, - }), - }) + Ok(()) } - pub fn is_writeable(&self) -> bool { - let inner = self.inner.read().unwrap(); - inner.end_lsn.is_none() + pub fn put_tombstone(&self, _key_range: Range, _lsn: Lsn) -> Result<()> { + // TODO: Currently, we just leak the storage for any deleted keys + + Ok(()) } /// Make the layer non-writeable. Only call once. /// Records the end_lsn for non-dropped layers. - /// `end_lsn` is inclusive + /// `end_lsn` is exclusive pub fn freeze(&self, end_lsn: Lsn) { let mut inner = self.inner.write().unwrap(); - if inner.end_lsn.is_some() { - assert!(inner.dropped); - } else { - assert!(!inner.dropped); - assert!(self.start_lsn < end_lsn + 1); - inner.end_lsn = Some(Lsn(end_lsn.0 + 1)); + assert!(self.start_lsn < end_lsn); + inner.end_lsn = Some(end_lsn); - if let Some((lsn, _)) = inner.seg_sizes.as_slice().last() { - assert!(lsn <= &end_lsn, "{:?} {:?}", lsn, end_lsn); - } - - for (_blk, vec_map) in inner.page_versions.iter() { - for (lsn, _pos) in vec_map.as_slice() { - assert!(*lsn <= end_lsn); - } + for vec_map in inner.index.values() { + for (lsn, _pos) in vec_map.as_slice() { + assert!(*lsn < end_lsn); } } } - /// Write the this frozen in-memory layer to disk. + /// Write this frozen in-memory layer to disk. /// - /// Returns new layers that replace this one. - /// If not dropped and reconstruct_pages is true, returns a new image layer containing the page versions - /// at the `end_lsn`. Can also return a DeltaLayer that includes all the - /// WAL records between start and end LSN. (The delta layer is not needed - /// when a new relish is created with a single LSN, so that the start and - /// end LSN are the same.) - pub fn write_to_disk( - &self, - timeline: &LayeredTimeline, - reconstruct_pages: bool, - ) -> Result { - trace!( - "write_to_disk {} get_end_lsn is {}", - self.filename().display(), - self.get_end_lsn() - ); - + /// Returns a new delta layer with all the same data as this in-memory layer + pub fn write_to_disk(&self) -> Result { // Grab the lock in read-mode. We hold it over the I/O, but because this // layer is not writeable anymore, no one should be trying to acquire the // write lock on it, so we shouldn't block anyone. There's one exception @@ -705,105 +333,32 @@ impl InMemoryLayer { // rare though, so we just accept the potential latency hit for now. let inner = self.inner.read().unwrap(); - // Since `end_lsn` is exclusive, subtract 1 to calculate the last LSN - // that is included. - let end_lsn_exclusive = inner.end_lsn.unwrap(); - let end_lsn_inclusive = Lsn(end_lsn_exclusive.0 - 1); + let mut delta_layer_writer = DeltaLayerWriter::new( + self.conf, + self.timelineid, + self.tenantid, + Key::MIN, + self.start_lsn..inner.end_lsn.unwrap(), + )?; - // Figure out if we should create a delta layer, image layer, or both. - let image_lsn: Option; - let delta_end_lsn: Option; - if self.is_dropped() || !reconstruct_pages { - // The segment was dropped. Create just a delta layer containing all the - // changes up to and including the drop. - delta_end_lsn = Some(end_lsn_exclusive); - image_lsn = None; - } else if self.start_lsn == end_lsn_inclusive { - // The layer contains exactly one LSN. It's enough to write an image - // layer at that LSN. - delta_end_lsn = None; - image_lsn = Some(end_lsn_inclusive); - } else { - // Create a delta layer with all the changes up to the end LSN, - // and an image layer at the end LSN. - // - // Note that we the delta layer does *not* include the page versions - // at the end LSN. They are included in the image layer, and there's - // no need to store them twice. - delta_end_lsn = Some(end_lsn_inclusive); - image_lsn = Some(end_lsn_inclusive); - } - - let mut delta_layers = Vec::new(); - let mut image_layers = Vec::new(); - - if let Some(delta_end_lsn) = delta_end_lsn { - let mut delta_layer_writer = DeltaLayerWriter::new( - self.conf, - self.timelineid, - self.tenantid, - self.seg, - self.start_lsn, - delta_end_lsn, - self.is_dropped(), - )?; - - // Write all page versions, in block + LSN order - let mut buf: Vec = Vec::new(); - - let pv_iter = inner.page_versions.iter(); - let mut pages: Vec<(&SegmentBlk, &VecMap)> = pv_iter.collect(); - pages.sort_by_key(|(blknum, _vec_map)| *blknum); - for (blknum, vec_map) in pages { - for (lsn, pos) in vec_map.as_slice() { - if *lsn < delta_end_lsn { - let len = inner.read_pv_bytes(*pos, &mut buf)?; - delta_layer_writer.put_page_version(*blknum, *lsn, &buf[..len])?; - } + let mut do_steps = || -> Result<()> { + for (key, vec_map) in inner.index.iter() { + // Write all page versions + for (lsn, blob_ref) in vec_map.as_slice() { + let mut buf = vec![0u8; blob_ref.size()]; + inner.file.read_exact_at(&mut buf, blob_ref.pos())?; + let val = Value::des(&buf)?; + delta_layer_writer.put_value(*key, *lsn, val)?; } } - - // Create seg_sizes - let seg_sizes = if delta_end_lsn == end_lsn_exclusive { - inner.seg_sizes.clone() - } else { - inner.seg_sizes.split_at(&end_lsn_exclusive).0 - }; - - let delta_layer = delta_layer_writer.finish(seg_sizes)?; - delta_layers.push(delta_layer); + Ok(()) + }; + if let Err(err) = do_steps() { + delta_layer_writer.abort(); + return Err(err); } - drop(inner); - - // Write a new base image layer at the cutoff point - if let Some(image_lsn) = image_lsn { - let size = if self.seg.rel.is_blocky() { - self.get_seg_size(image_lsn)? - } else { - 1 - }; - let mut image_layer_writer = ImageLayerWriter::new( - self.conf, - self.timelineid, - self.tenantid, - self.seg, - image_lsn, - size, - )?; - - for blknum in 0..size { - let img = timeline.materialize_page(self.seg, blknum, image_lsn, &*self)?; - - image_layer_writer.put_page_image(&img)?; - } - let image_layer = image_layer_writer.finish()?; - image_layers.push(image_layer); - } - - Ok(LayersOnDisk { - delta_layers, - image_layers, - }) + let delta_layer = delta_layer_writer.finish(Key::MAX)?; + Ok(delta_layer) } } diff --git a/pageserver/src/layered_repository/interval_tree.rs b/pageserver/src/layered_repository/interval_tree.rs deleted file mode 100644 index 978ecd837e..0000000000 --- a/pageserver/src/layered_repository/interval_tree.rs +++ /dev/null @@ -1,468 +0,0 @@ -/// -/// IntervalTree is data structure for holding intervals. It is generic -/// to make unit testing possible, but the only real user of it is the layer map, -/// -/// It's inspired by the "segment tree" or a "statistic tree" as described in -/// https://en.wikipedia.org/wiki/Segment_tree. However, we use a B-tree to hold -/// the points instead of a binary tree. This is called an "interval tree" instead -/// of "segment tree" because the term "segment" is already using Zenith to mean -/// something else. To add to the confusion, there is another data structure known -/// as "interval tree" out there (see https://en.wikipedia.org/wiki/Interval_tree), -/// for storing intervals, but this isn't that. -/// -/// The basic idea is to have a B-tree of "interesting Points". At each Point, -/// there is a list of intervals that contain the point. The Points are formed -/// from the start bounds of each interval; there is a Point for each distinct -/// start bound. -/// -/// Operations: -/// -/// To find intervals that contain a given point, you search the b-tree to find -/// the nearest Point <= search key. Then you just return the list of intervals. -/// -/// To insert an interval, find the Point with start key equal to the inserted item. -/// If the Point doesn't exist yet, create it, by copying all the items from the -/// previous Point that cover the new Point. Then walk right, inserting the new -/// interval to all the Points that are contained by the new interval (including the -/// newly created Point). -/// -/// To remove an interval, you scan the tree for all the Points that are contained by -/// the removed interval, and remove it from the list in each Point. -/// -/// Requirements and assumptions: -/// -/// - Can store overlapping items -/// - But there are not many overlapping items -/// - The interval bounds don't change after it is added to the tree -/// - Intervals are uniquely identified by pointer equality. You must not be insert the -/// same interval object twice, and `remove` uses pointer equality to remove the right -/// interval. It is OK to have two intervals with the same bounds, however. -/// -use std::collections::BTreeMap; -use std::fmt::Debug; -use std::ops::Range; -use std::sync::Arc; - -pub struct IntervalTree -where - I: IntervalItem, -{ - points: BTreeMap>, -} - -struct Point { - /// All intervals that contain this point, in no particular order. - /// - /// We assume that there aren't a lot of overlappingg intervals, so that this vector - /// never grows very large. If that assumption doesn't hold, we could keep this ordered - /// by the end bound, to speed up `search`. But as long as there are only a few elements, - /// a linear search is OK. - elements: Vec>, -} - -/// Abstraction for an interval that can be stored in the tree -/// -/// The start bound is inclusive and the end bound is exclusive. End must be greater -/// than start. -pub trait IntervalItem { - type Key: Ord + Copy + Debug + Sized; - - fn start_key(&self) -> Self::Key; - fn end_key(&self) -> Self::Key; - - fn bounds(&self) -> Range { - self.start_key()..self.end_key() - } -} - -impl IntervalTree -where - I: IntervalItem, -{ - /// Return an element that contains 'key', or precedes it. - /// - /// If there are multiple candidates, returns the one with the highest 'end' key. - pub fn search(&self, key: I::Key) -> Option> { - // Find the greatest point that precedes or is equal to the search key. If there is - // none, returns None. - let (_, p) = self.points.range(..=key).next_back()?; - - // Find the element with the highest end key at this point - let highest_item = p - .elements - .iter() - .reduce(|a, b| { - // starting with Rust 1.53, could use `std::cmp::min_by_key` here - if a.end_key() > b.end_key() { - a - } else { - b - } - }) - .unwrap(); - Some(Arc::clone(highest_item)) - } - - /// Iterate over all items with start bound >= 'key' - pub fn iter_newer(&self, key: I::Key) -> IntervalIter { - IntervalIter { - point_iter: self.points.range(key..), - elem_iter: None, - } - } - - /// Iterate over all items - pub fn iter(&self) -> IntervalIter { - IntervalIter { - point_iter: self.points.range(..), - elem_iter: None, - } - } - - pub fn insert(&mut self, item: Arc) { - let start_key = item.start_key(); - let end_key = item.end_key(); - assert!(start_key < end_key); - let bounds = start_key..end_key; - - // Find the starting point and walk forward from there - let mut found_start_point = false; - let iter = self.points.range_mut(bounds); - for (point_key, point) in iter { - if *point_key == start_key { - found_start_point = true; - // It is an error to insert the same item to the tree twice. - assert!( - !point.elements.iter().any(|x| Arc::ptr_eq(x, &item)), - "interval is already in the tree" - ); - } - point.elements.push(Arc::clone(&item)); - } - if !found_start_point { - // Create a new Point for the starting point - - // Look at the previous point, and copy over elements that overlap with this - // new point - let mut new_elements: Vec> = Vec::new(); - if let Some((_, prev_point)) = self.points.range(..start_key).next_back() { - let overlapping_prev_elements = prev_point - .elements - .iter() - .filter(|x| x.bounds().contains(&start_key)) - .cloned(); - - new_elements.extend(overlapping_prev_elements); - } - new_elements.push(item); - - let new_point = Point { - elements: new_elements, - }; - self.points.insert(start_key, new_point); - } - } - - pub fn remove(&mut self, item: &Arc) { - // range search points - let start_key = item.start_key(); - let end_key = item.end_key(); - let bounds = start_key..end_key; - - let mut points_to_remove: Vec = Vec::new(); - let mut found_start_point = false; - for (point_key, point) in self.points.range_mut(bounds) { - if *point_key == start_key { - found_start_point = true; - } - let len_before = point.elements.len(); - point.elements.retain(|other| !Arc::ptr_eq(other, item)); - let len_after = point.elements.len(); - assert_eq!(len_after + 1, len_before); - if len_after == 0 { - points_to_remove.push(*point_key); - } - } - assert!(found_start_point); - - for k in points_to_remove { - self.points.remove(&k).unwrap(); - } - } -} - -pub struct IntervalIter<'a, I: ?Sized> -where - I: IntervalItem, -{ - point_iter: std::collections::btree_map::Range<'a, I::Key, Point>, - elem_iter: Option<(I::Key, std::slice::Iter<'a, Arc>)>, -} - -impl<'a, I> Iterator for IntervalIter<'a, I> -where - I: IntervalItem + ?Sized, -{ - type Item = Arc; - - fn next(&mut self) -> Option { - // Iterate over all elements in all the points in 'point_iter'. To avoid - // returning the same element twice, we only return each element at its - // starting point. - loop { - // Return next remaining element from the current point - if let Some((point_key, elem_iter)) = &mut self.elem_iter { - for elem in elem_iter { - if elem.start_key() == *point_key { - return Some(Arc::clone(elem)); - } - } - } - // No more elements at this point. Move to next point. - if let Some((point_key, point)) = self.point_iter.next() { - self.elem_iter = Some((*point_key, point.elements.iter())); - continue; - } else { - // No more points, all done - return None; - } - } - } -} - -impl Default for IntervalTree -where - I: IntervalItem, -{ - fn default() -> Self { - IntervalTree { - points: BTreeMap::new(), - } - } -} - -#[cfg(test)] -mod tests { - use super::*; - use std::fmt; - - #[derive(Debug)] - struct MockItem { - start_key: u32, - end_key: u32, - val: String, - } - impl IntervalItem for MockItem { - type Key = u32; - - fn start_key(&self) -> u32 { - self.start_key - } - fn end_key(&self) -> u32 { - self.end_key - } - } - impl MockItem { - fn new(start_key: u32, end_key: u32) -> Self { - MockItem { - start_key, - end_key, - val: format!("{}-{}", start_key, end_key), - } - } - fn new_str(start_key: u32, end_key: u32, val: &str) -> Self { - MockItem { - start_key, - end_key, - val: format!("{}-{}: {}", start_key, end_key, val), - } - } - } - impl fmt::Display for MockItem { - fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { - write!(f, "{}", self.val) - } - } - #[rustfmt::skip] - fn assert_search( - tree: &IntervalTree, - key: u32, - expected: &[&str], - ) -> Option> { - if let Some(v) = tree.search(key) { - let vstr = v.to_string(); - - assert!(!expected.is_empty(), "search with {} returned {}, expected None", key, v); - assert!( - expected.contains(&vstr.as_str()), - "search with {} returned {}, expected one of: {:?}", - key, v, expected, - ); - - Some(v) - } else { - assert!( - expected.is_empty(), - "search with {} returned None, expected one of {:?}", - key, expected - ); - None - } - } - - fn assert_contents(tree: &IntervalTree, expected: &[&str]) { - let mut contents: Vec = tree.iter().map(|e| e.to_string()).collect(); - contents.sort(); - assert_eq!(contents, expected); - } - - fn dump_tree(tree: &IntervalTree) { - for (point_key, point) in tree.points.iter() { - print!("{}:", point_key); - for e in point.elements.iter() { - print!(" {}", e); - } - println!(); - } - } - - #[test] - fn test_interval_tree_simple() { - let mut tree: IntervalTree = IntervalTree::default(); - - // Simple, non-overlapping ranges. - tree.insert(Arc::new(MockItem::new(10, 11))); - tree.insert(Arc::new(MockItem::new(11, 12))); - tree.insert(Arc::new(MockItem::new(12, 13))); - tree.insert(Arc::new(MockItem::new(18, 19))); - tree.insert(Arc::new(MockItem::new(17, 18))); - tree.insert(Arc::new(MockItem::new(15, 16))); - - assert_search(&tree, 9, &[]); - assert_search(&tree, 10, &["10-11"]); - assert_search(&tree, 11, &["11-12"]); - assert_search(&tree, 12, &["12-13"]); - assert_search(&tree, 13, &["12-13"]); - assert_search(&tree, 14, &["12-13"]); - assert_search(&tree, 15, &["15-16"]); - assert_search(&tree, 16, &["15-16"]); - assert_search(&tree, 17, &["17-18"]); - assert_search(&tree, 18, &["18-19"]); - assert_search(&tree, 19, &["18-19"]); - assert_search(&tree, 20, &["18-19"]); - - // remove a few entries and search around them again - tree.remove(&assert_search(&tree, 10, &["10-11"]).unwrap()); // first entry - tree.remove(&assert_search(&tree, 12, &["12-13"]).unwrap()); // entry in the middle - tree.remove(&assert_search(&tree, 18, &["18-19"]).unwrap()); // last entry - assert_search(&tree, 9, &[]); - assert_search(&tree, 10, &[]); - assert_search(&tree, 11, &["11-12"]); - assert_search(&tree, 12, &["11-12"]); - assert_search(&tree, 14, &["11-12"]); - assert_search(&tree, 15, &["15-16"]); - assert_search(&tree, 17, &["17-18"]); - assert_search(&tree, 18, &["17-18"]); - } - - #[test] - fn test_interval_tree_overlap() { - let mut tree: IntervalTree = IntervalTree::default(); - - // Overlapping items - tree.insert(Arc::new(MockItem::new(22, 24))); - tree.insert(Arc::new(MockItem::new(23, 25))); - let x24_26 = Arc::new(MockItem::new(24, 26)); - tree.insert(Arc::clone(&x24_26)); - let x26_28 = Arc::new(MockItem::new(26, 28)); - tree.insert(Arc::clone(&x26_28)); - tree.insert(Arc::new(MockItem::new(25, 27))); - - assert_search(&tree, 22, &["22-24"]); - assert_search(&tree, 23, &["22-24", "23-25"]); - assert_search(&tree, 24, &["23-25", "24-26"]); - assert_search(&tree, 25, &["24-26", "25-27"]); - assert_search(&tree, 26, &["25-27", "26-28"]); - assert_search(&tree, 27, &["26-28"]); - assert_search(&tree, 28, &["26-28"]); - assert_search(&tree, 29, &["26-28"]); - - tree.remove(&x24_26); - tree.remove(&x26_28); - assert_search(&tree, 23, &["22-24", "23-25"]); - assert_search(&tree, 24, &["23-25"]); - assert_search(&tree, 25, &["25-27"]); - assert_search(&tree, 26, &["25-27"]); - assert_search(&tree, 27, &["25-27"]); - assert_search(&tree, 28, &["25-27"]); - assert_search(&tree, 29, &["25-27"]); - } - - #[test] - fn test_interval_tree_nested() { - let mut tree: IntervalTree = IntervalTree::default(); - - // Items containing other items - tree.insert(Arc::new(MockItem::new(31, 39))); - tree.insert(Arc::new(MockItem::new(32, 34))); - tree.insert(Arc::new(MockItem::new(33, 35))); - tree.insert(Arc::new(MockItem::new(30, 40))); - - assert_search(&tree, 30, &["30-40"]); - assert_search(&tree, 31, &["30-40", "31-39"]); - assert_search(&tree, 32, &["30-40", "32-34", "31-39"]); - assert_search(&tree, 33, &["30-40", "32-34", "33-35", "31-39"]); - assert_search(&tree, 34, &["30-40", "33-35", "31-39"]); - assert_search(&tree, 35, &["30-40", "31-39"]); - assert_search(&tree, 36, &["30-40", "31-39"]); - assert_search(&tree, 37, &["30-40", "31-39"]); - assert_search(&tree, 38, &["30-40", "31-39"]); - assert_search(&tree, 39, &["30-40"]); - assert_search(&tree, 40, &["30-40"]); - assert_search(&tree, 41, &["30-40"]); - } - - #[test] - fn test_interval_tree_duplicates() { - let mut tree: IntervalTree = IntervalTree::default(); - - // Duplicate keys - let item_a = Arc::new(MockItem::new_str(55, 56, "a")); - tree.insert(Arc::clone(&item_a)); - let item_b = Arc::new(MockItem::new_str(55, 56, "b")); - tree.insert(Arc::clone(&item_b)); - let item_c = Arc::new(MockItem::new_str(55, 56, "c")); - tree.insert(Arc::clone(&item_c)); - let item_d = Arc::new(MockItem::new_str(54, 56, "d")); - tree.insert(Arc::clone(&item_d)); - let item_e = Arc::new(MockItem::new_str(55, 57, "e")); - tree.insert(Arc::clone(&item_e)); - - dump_tree(&tree); - - assert_search( - &tree, - 55, - &["55-56: a", "55-56: b", "55-56: c", "54-56: d", "55-57: e"], - ); - tree.remove(&item_b); - dump_tree(&tree); - - assert_contents(&tree, &["54-56: d", "55-56: a", "55-56: c", "55-57: e"]); - - tree.remove(&item_d); - dump_tree(&tree); - assert_contents(&tree, &["55-56: a", "55-56: c", "55-57: e"]); - } - - #[test] - #[should_panic] - fn test_interval_tree_insert_twice() { - let mut tree: IntervalTree = IntervalTree::default(); - - // Inserting the same item twice is not cool - let item = Arc::new(MockItem::new(1, 2)); - tree.insert(Arc::clone(&item)); - tree.insert(Arc::clone(&item)); // fails assertion - } -} diff --git a/pageserver/src/layered_repository/layer_map.rs b/pageserver/src/layered_repository/layer_map.rs index fe82fd491c..c4929a6173 100644 --- a/pageserver/src/layered_repository/layer_map.rs +++ b/pageserver/src/layered_repository/layer_map.rs @@ -1,32 +1,29 @@ //! -//! The layer map tracks what layers exist for all the relishes in a timeline. +//! The layer map tracks what layers exist in a timeline. //! //! When the timeline is first accessed, the server lists of all layer files //! in the timelines/ directory, and populates this map with -//! ImageLayer and DeltaLayer structs corresponding to each file. When new WAL -//! is received, we create InMemoryLayers to hold the incoming records. Now and -//! then, in the checkpoint() function, the in-memory layers are frozen, forming -//! new image and delta layers and corresponding files are written to disk. +//! ImageLayer and DeltaLayer structs corresponding to each file. When the first +//! new WAL record is received, we create an InMemoryLayer to hold the incoming +//! records. Now and then, in the checkpoint() function, the in-memory layer is +//! are frozen, and it is split up into new image and delta layers and the +//! corresponding files are written to disk. //! -use crate::layered_repository::interval_tree::{IntervalItem, IntervalIter, IntervalTree}; -use crate::layered_repository::storage_layer::{Layer, SegmentTag}; +use crate::layered_repository::storage_layer::Layer; +use crate::layered_repository::storage_layer::{range_eq, range_overlaps}; use crate::layered_repository::InMemoryLayer; -use crate::relish::*; +use crate::repository::Key; use anyhow::Result; use lazy_static::lazy_static; -use std::cmp::Ordering; -use std::collections::{BinaryHeap, HashMap}; +use std::collections::VecDeque; +use std::ops::Range; use std::sync::Arc; +use tracing::*; use zenith_metrics::{register_int_gauge, IntGauge}; use zenith_utils::lsn::Lsn; -use super::global_layer_map::{LayerId, GLOBAL_LAYER_MAP}; - lazy_static! { - static ref NUM_INMEMORY_LAYERS: IntGauge = - register_int_gauge!("pageserver_inmemory_layers", "Number of layers in memory") - .expect("failed to define a metric"); static ref NUM_ONDISK_LAYERS: IntGauge = register_int_gauge!("pageserver_ondisk_layers", "Number of layers on-disk") .expect("failed to define a metric"); @@ -37,98 +34,147 @@ lazy_static! { /// #[derive(Default)] pub struct LayerMap { - /// All the layers keyed by segment tag - segs: HashMap, + // + // 'open_layer' holds the current InMemoryLayer that is accepting new + // records. If it is None, 'next_open_layer_at' will be set instead, indicating + // where the start LSN of the next InMemoryLayer that is to be created. + // + pub open_layer: Option>, + pub next_open_layer_at: Option, - /// All in-memory layers, ordered by 'oldest_lsn' and generation - /// of each layer. This allows easy access to the in-memory layer that - /// contains the oldest WAL record. - open_layers: BinaryHeap, + /// + /// The frozen layer, if any, contains WAL older than the current 'open_layer' + /// or 'next_open_layer_at', but newer than any historic layer. The frozen + /// layer is during checkpointing, when an InMemoryLayer is being written out + /// to disk. + /// + pub frozen_layers: VecDeque>, - /// Generation number, used to distinguish newly inserted entries in the - /// binary heap from older entries during checkpoint. - current_generation: u64, + /// All the historic layers are kept here + + /// TODO: This is a placeholder implementation of a data structure + /// to hold information about all the layer files on disk and in + /// S3. Currently, it's just a vector and all operations perform a + /// linear scan over it. That obviously becomes slow as the + /// number of layers grows. I'm imagining that an R-tree or some + /// other 2D data structure would be the long-term solution here. + historic_layers: Vec>, +} + +/// Return value of LayerMap::search +pub struct SearchResult { + pub layer: Arc, + pub lsn_floor: Lsn, } impl LayerMap { /// - /// Look up a layer using the given segment tag and LSN. This differs from a - /// plain key-value lookup in that if there is any layer that covers the - /// given LSN, or precedes the given LSN, it is returned. In other words, - /// you don't need to know the exact start LSN of the layer. + /// Find the latest layer that covers the given 'key', with lsn < + /// 'end_lsn'. /// - pub fn get(&self, tag: &SegmentTag, lsn: Lsn) -> Option> { - let segentry = self.segs.get(tag)?; - - segentry.get(lsn) - } - + /// Returns the layer, if any, and an 'lsn_floor' value that + /// indicates which portion of the layer the caller should + /// check. 'lsn_floor' is normally the start-LSN of the layer, but + /// can be greater if there is an overlapping layer that might + /// contain the version, even if it's missing from the returned + /// layer. /// - /// Get the open layer for given segment for writing. Or None if no open - /// layer exists. - /// - pub fn get_open(&self, tag: &SegmentTag) -> Option> { - let segentry = self.segs.get(tag)?; + pub fn search(&self, key: Key, end_lsn: Lsn) -> Result> { + // linear search + // Find the latest image layer that covers the given key + let mut latest_img: Option> = None; + let mut latest_img_lsn: Option = None; + for l in self.historic_layers.iter() { + if l.is_incremental() { + continue; + } + if !l.get_key_range().contains(&key) { + continue; + } + let img_lsn = l.get_lsn_range().start; - segentry - .open_layer_id - .and_then(|layer_id| GLOBAL_LAYER_MAP.read().unwrap().get(&layer_id)) - } + if img_lsn >= end_lsn { + // too new + continue; + } + if Lsn(img_lsn.0 + 1) == end_lsn { + // found exact match + return Ok(Some(SearchResult { + layer: Arc::clone(l), + lsn_floor: img_lsn, + })); + } + if img_lsn > latest_img_lsn.unwrap_or(Lsn(0)) { + latest_img = Some(Arc::clone(l)); + latest_img_lsn = Some(img_lsn); + } + } - /// - /// Insert an open in-memory layer - /// - pub fn insert_open(&mut self, layer: Arc) { - let segentry = self.segs.entry(layer.get_seg_tag()).or_default(); - - let layer_id = segentry.update_open(Arc::clone(&layer)); - - let oldest_lsn = layer.get_oldest_lsn(); - - // After a crash and restart, 'oldest_lsn' of the oldest in-memory - // layer becomes the WAL streaming starting point, so it better not point - // in the middle of a WAL record. - assert!(oldest_lsn.is_aligned()); - - // Also add it to the binary heap - let open_layer_entry = OpenLayerEntry { - oldest_lsn: layer.get_oldest_lsn(), - layer_id, - generation: self.current_generation, - }; - self.open_layers.push(open_layer_entry); - - NUM_INMEMORY_LAYERS.inc(); - } - - /// Remove an open in-memory layer - pub fn remove_open(&mut self, layer_id: LayerId) { - // Note: we don't try to remove the entry from the binary heap. - // It will be removed lazily by peek_oldest_open() when it's made it to - // the top of the heap. - - let layer_opt = { - let mut global_map = GLOBAL_LAYER_MAP.write().unwrap(); - let layer_opt = global_map.get(&layer_id); - global_map.remove(&layer_id); - // TODO it's bad that a ref can still exist after being evicted from cache - layer_opt - }; - - if let Some(layer) = layer_opt { - let mut segentry = self.segs.get_mut(&layer.get_seg_tag()).unwrap(); - - if segentry.open_layer_id == Some(layer_id) { - // Also remove it from the SegEntry of this segment - segentry.open_layer_id = None; - } else { - // We could have already updated segentry.open for - // dropped (non-writeable) layer. This is fine. - assert!(!layer.is_writeable()); - assert!(layer.is_dropped()); + // Search the delta layers + let mut latest_delta: Option> = None; + for l in self.historic_layers.iter() { + if !l.is_incremental() { + continue; + } + if !l.get_key_range().contains(&key) { + continue; } - NUM_INMEMORY_LAYERS.dec(); + if l.get_lsn_range().start >= end_lsn { + // too new + continue; + } + + if l.get_lsn_range().end >= end_lsn { + // this layer contains the requested point in the key/lsn space. + // No need to search any further + trace!( + "found layer {} for request on {} at {}", + l.filename().display(), + key, + end_lsn + ); + latest_delta.replace(Arc::clone(l)); + break; + } + // this layer's end LSN is smaller than the requested point. If there's + // nothing newer, this is what we need to return. Remember this. + if let Some(ref old_candidate) = latest_delta { + if l.get_lsn_range().end > old_candidate.get_lsn_range().end { + latest_delta.replace(Arc::clone(l)); + } + } else { + latest_delta.replace(Arc::clone(l)); + } + } + if let Some(l) = latest_delta { + trace!( + "found (old) layer {} for request on {} at {}", + l.filename().display(), + key, + end_lsn + ); + let lsn_floor = std::cmp::max( + Lsn(latest_img_lsn.unwrap_or(Lsn(0)).0 + 1), + l.get_lsn_range().start, + ); + Ok(Some(SearchResult { + lsn_floor, + layer: l, + })) + } else if let Some(l) = latest_img { + trace!( + "found img layer and no deltas for request on {} at {}", + key, + end_lsn + ); + Ok(Some(SearchResult { + lsn_floor: latest_img_lsn.unwrap(), + layer: l, + })) + } else { + trace!("no layer found for request on {} at {}", key, end_lsn); + Ok(None) } } @@ -136,9 +182,7 @@ impl LayerMap { /// Insert an on-disk layer /// pub fn insert_historic(&mut self, layer: Arc) { - let segentry = self.segs.entry(layer.get_seg_tag()).or_default(); - segentry.insert_historic(layer); - + self.historic_layers.push(layer); NUM_ONDISK_LAYERS.inc(); } @@ -147,61 +191,62 @@ impl LayerMap { /// /// This should be called when the corresponding file on disk has been deleted. /// + #[allow(dead_code)] pub fn remove_historic(&mut self, layer: Arc) { - let tag = layer.get_seg_tag(); + let len_before = self.historic_layers.len(); - if let Some(segentry) = self.segs.get_mut(&tag) { - segentry.historic.remove(&layer); - } + // FIXME: ptr_eq might fail to return true for 'dyn' + // references. Clippy complains about this. In practice it + // seems to work, the assertion below would be triggered + // otherwise but this ought to be fixed. + #[allow(clippy::vtable_address_comparisons)] + self.historic_layers + .retain(|other| !Arc::ptr_eq(other, &layer)); + + assert_eq!(self.historic_layers.len(), len_before - 1); NUM_ONDISK_LAYERS.dec(); } - // List relations along with a flag that marks if they exist at the given lsn. - // spcnode 0 and dbnode 0 have special meanings and mean all tabespaces/databases. - // Pass Tag if we're only interested in some relations. - pub fn list_relishes(&self, tag: Option, lsn: Lsn) -> Result> { - let mut rels: HashMap = HashMap::new(); - - for (seg, segentry) in self.segs.iter() { - match seg.rel { - RelishTag::Relation(reltag) => { - if let Some(request_rel) = tag { - if (request_rel.spcnode == 0 || reltag.spcnode == request_rel.spcnode) - && (request_rel.dbnode == 0 || reltag.dbnode == request_rel.dbnode) - { - if let Some(exists) = segentry.exists_at_lsn(lsn)? { - rels.insert(seg.rel, exists); - } - } - } - } - _ => { - if tag == None { - if let Some(exists) = segentry.exists_at_lsn(lsn)? { - rels.insert(seg.rel, exists); - } - } - } - } - } - Ok(rels) - } - /// Is there a newer image layer for given segment? /// /// This is used for garbage collection, to determine if an old layer can /// be deleted. /// We ignore segments newer than disk_consistent_lsn because they will be removed at restart + /// We also only look at historic layers + //#[allow(dead_code)] pub fn newer_image_layer_exists( &self, - seg: SegmentTag, + key_range: &Range, lsn: Lsn, disk_consistent_lsn: Lsn, - ) -> bool { - if let Some(segentry) = self.segs.get(&seg) { - segentry.newer_image_layer_exists(lsn, disk_consistent_lsn) - } else { - false + ) -> Result { + let mut range_remain = key_range.clone(); + + loop { + let mut made_progress = false; + for l in self.historic_layers.iter() { + if l.is_incremental() { + continue; + } + let img_lsn = l.get_lsn_range().start; + if !l.is_incremental() + && l.get_key_range().contains(&range_remain.start) + && img_lsn > lsn + && img_lsn < disk_consistent_lsn + { + made_progress = true; + let img_key_end = l.get_key_range().end; + + if img_key_end >= range_remain.end { + return Ok(true); + } + range_remain.start = img_key_end; + } + } + + if !made_progress { + return Ok(false); + } } } @@ -211,284 +256,148 @@ impl LayerMap { /// used for garbage collection, to determine if some alive layer /// exists at the lsn. If so, we shouldn't delete a newer dropped layer /// to avoid incorrectly making it visible. - pub fn layer_exists_at_lsn(&self, seg: SegmentTag, lsn: Lsn) -> Result { - Ok(if let Some(segentry) = self.segs.get(&seg) { - segentry.exists_at_lsn(lsn)?.unwrap_or(false) - } else { - false - }) + /* + pub fn layer_exists_at_lsn(&self, seg: SegmentTag, lsn: Lsn) -> Result { + Ok(if let Some(segentry) = self.historic_layers.get(&seg) { + segentry.exists_at_lsn(seg, lsn)?.unwrap_or(false) + } else { + false + }) + } + */ + + pub fn iter_historic_layers(&self) -> std::slice::Iter> { + self.historic_layers.iter() } - /// Return the oldest in-memory layer, along with its generation number. - pub fn peek_oldest_open(&mut self) -> Option<(LayerId, Arc, u64)> { - let global_map = GLOBAL_LAYER_MAP.read().unwrap(); + /// Find the last image layer that covers 'key', ignoring any image layers + /// newer than 'lsn'. + fn find_latest_image(&self, key: Key, lsn: Lsn) -> Option> { + let mut candidate_lsn = Lsn(0); + let mut candidate = None; + for l in self.historic_layers.iter() { + if l.is_incremental() { + continue; + } - while let Some(oldest_entry) = self.open_layers.peek() { - if let Some(layer) = global_map.get(&oldest_entry.layer_id) { - return Some((oldest_entry.layer_id, layer, oldest_entry.generation)); - } else { - self.open_layers.pop(); + if !l.get_key_range().contains(&key) { + continue; + } + + let this_lsn = l.get_lsn_range().start; + if this_lsn > lsn { + continue; + } + if this_lsn < candidate_lsn { + // our previous candidate was better + continue; + } + candidate_lsn = this_lsn; + candidate = Some(Arc::clone(l)); + } + + candidate + } + + /// + /// Divide the whole given range of keys into sub-ranges based on the latest + /// image layer that covers each range. (This is used when creating new + /// image layers) + /// + // FIXME: clippy complains that the result type is very complex. She's probably + // right... + #[allow(clippy::type_complexity)] + pub fn image_coverage( + &self, + key_range: &Range, + lsn: Lsn, + ) -> Result, Option>)>> { + let mut points: Vec; + + points = vec![key_range.start]; + for l in self.historic_layers.iter() { + if l.get_lsn_range().start > lsn { + continue; + } + let range = l.get_key_range(); + if key_range.contains(&range.start) { + points.push(l.get_key_range().start); + } + if key_range.contains(&range.end) { + points.push(l.get_key_range().end); } } - None - } + points.push(key_range.end); - /// Increment the generation number used to stamp open in-memory layers. Layers - /// added with `insert_open` after this call will be associated with the new - /// generation. Returns the new generation number. - pub fn increment_generation(&mut self) -> u64 { - self.current_generation += 1; - self.current_generation - } + points.sort(); + points.dedup(); - pub fn iter_historic_layers(&self) -> HistoricLayerIter { - HistoricLayerIter { - seg_iter: self.segs.iter(), - iter: None, + // Ok, we now have a list of "interesting" points in the key space + + // For each range between the points, find the latest image + let mut start = *points.first().unwrap(); + let mut ranges = Vec::new(); + for end in points[1..].iter() { + let img = self.find_latest_image(start, lsn); + + ranges.push((start..*end, img)); + + start = *end; } + Ok(ranges) + } + + /// Count how many L1 delta layers there are that overlap with the + /// given key and LSN range. + pub fn count_deltas(&self, key_range: &Range, lsn_range: &Range) -> Result { + let mut result = 0; + for l in self.historic_layers.iter() { + if !l.is_incremental() { + continue; + } + if !range_overlaps(&l.get_lsn_range(), lsn_range) { + continue; + } + if !range_overlaps(&l.get_key_range(), key_range) { + continue; + } + + // We ignore level0 delta layers. Unless the whole keyspace fits + // into one partition + if !range_eq(key_range, &(Key::MIN..Key::MAX)) + && range_eq(&l.get_key_range(), &(Key::MIN..Key::MAX)) + { + continue; + } + + result += 1; + } + Ok(result) + } + + /// Return all L0 delta layers + pub fn get_level0_deltas(&self) -> Result>> { + let mut deltas = Vec::new(); + for l in self.historic_layers.iter() { + if !l.is_incremental() { + continue; + } + if l.get_key_range() != (Key::MIN..Key::MAX) { + continue; + } + deltas.push(Arc::clone(l)); + } + Ok(deltas) } /// debugging function to print out the contents of the layer map #[allow(unused)] pub fn dump(&self) -> Result<()> { println!("Begin dump LayerMap"); - for (seg, segentry) in self.segs.iter() { - if let Some(open) = &segentry.open_layer_id { - if let Some(layer) = GLOBAL_LAYER_MAP.read().unwrap().get(open) { - layer.dump()?; - } else { - println!("layer not found in global map"); - } - } - - for layer in segentry.historic.iter() { - layer.dump()?; - } + for layer in self.historic_layers.iter() { + layer.dump()?; } println!("End dump LayerMap"); Ok(()) } } - -impl IntervalItem for dyn Layer { - type Key = Lsn; - - fn start_key(&self) -> Lsn { - self.get_start_lsn() - } - fn end_key(&self) -> Lsn { - self.get_end_lsn() - } -} - -/// -/// Per-segment entry in the LayerMap::segs hash map. Holds all the layers -/// associated with the segment. -/// -/// The last layer that is open for writes is always an InMemoryLayer, -/// and is kept in a separate field, because there can be only one for -/// each segment. The older layers, stored on disk, are kept in an -/// IntervalTree. -#[derive(Default)] -struct SegEntry { - open_layer_id: Option, - historic: IntervalTree, -} - -impl SegEntry { - /// Does the segment exist at given LSN? - /// Return None if object is not found in this SegEntry. - fn exists_at_lsn(&self, lsn: Lsn) -> Result> { - if let Some(layer) = self.get(lsn) { - Ok(Some(layer.get_seg_exists(lsn)?)) - } else { - Ok(None) - } - } - - pub fn get(&self, lsn: Lsn) -> Option> { - if let Some(open_layer_id) = &self.open_layer_id { - let open_layer = GLOBAL_LAYER_MAP.read().unwrap().get(open_layer_id)?; - if open_layer.get_start_lsn() <= lsn { - return Some(open_layer); - } - } - - self.historic.search(lsn) - } - - pub fn newer_image_layer_exists(&self, lsn: Lsn, disk_consistent_lsn: Lsn) -> bool { - // We only check on-disk layers, because - // in-memory layers are not durable - - // The end-LSN is exclusive, while disk_consistent_lsn is - // inclusive. For example, if disk_consistent_lsn is 100, it is - // OK for a delta layer to have end LSN 101, but if the end LSN - // is 102, then it might not have been fully flushed to disk - // before crash. - self.historic - .iter_newer(lsn) - .any(|layer| !layer.is_incremental() && layer.get_end_lsn() <= disk_consistent_lsn + 1) - } - - // Set new open layer for a SegEntry. - // It's ok to rewrite previous open layer, - // but only if it is not writeable anymore. - pub fn update_open(&mut self, layer: Arc) -> LayerId { - if let Some(prev_open_layer_id) = &self.open_layer_id { - if let Some(prev_open_layer) = GLOBAL_LAYER_MAP.read().unwrap().get(prev_open_layer_id) - { - assert!(!prev_open_layer.is_writeable()); - } - } - let open_layer_id = GLOBAL_LAYER_MAP.write().unwrap().insert(layer); - self.open_layer_id = Some(open_layer_id); - open_layer_id - } - - pub fn insert_historic(&mut self, layer: Arc) { - self.historic.insert(layer); - } -} - -/// Entry held in LayerMap::open_layers, with boilerplate comparison routines -/// to implement a min-heap ordered by 'oldest_lsn' and 'generation' -/// -/// The generation number associated with each entry can be used to distinguish -/// recently-added entries (i.e after last call to increment_generation()) from older -/// entries with the same 'oldest_lsn'. -struct OpenLayerEntry { - oldest_lsn: Lsn, // copy of layer.get_oldest_lsn() - generation: u64, - layer_id: LayerId, -} -impl Ord for OpenLayerEntry { - fn cmp(&self, other: &Self) -> Ordering { - // BinaryHeap is a max-heap, and we want a min-heap. Reverse the ordering here - // to get that. Entries with identical oldest_lsn are ordered by generation - other - .oldest_lsn - .cmp(&self.oldest_lsn) - .then_with(|| other.generation.cmp(&self.generation)) - } -} -impl PartialOrd for OpenLayerEntry { - fn partial_cmp(&self, other: &Self) -> Option { - Some(self.cmp(other)) - } -} -impl PartialEq for OpenLayerEntry { - fn eq(&self, other: &Self) -> bool { - self.cmp(other) == Ordering::Equal - } -} -impl Eq for OpenLayerEntry {} - -/// Iterator returned by LayerMap::iter_historic_layers() -pub struct HistoricLayerIter<'a> { - seg_iter: std::collections::hash_map::Iter<'a, SegmentTag, SegEntry>, - iter: Option>, -} - -impl<'a> Iterator for HistoricLayerIter<'a> { - type Item = Arc; - - fn next(&mut self) -> std::option::Option<::Item> { - loop { - if let Some(x) = &mut self.iter { - if let Some(x) = x.next() { - return Some(Arc::clone(&x)); - } - } - if let Some((_tag, segentry)) = self.seg_iter.next() { - self.iter = Some(segentry.historic.iter()); - continue; - } else { - return None; - } - } - } -} - -#[cfg(test)] -mod tests { - use super::*; - use crate::config::PageServerConf; - use std::str::FromStr; - use zenith_utils::zid::{ZTenantId, ZTimelineId}; - - /// Arbitrary relation tag, for testing. - const TESTREL_A: RelishTag = RelishTag::Relation(RelTag { - spcnode: 0, - dbnode: 111, - relnode: 1000, - forknum: 0, - }); - - lazy_static! { - static ref DUMMY_TIMELINEID: ZTimelineId = - ZTimelineId::from_str("00000000000000000000000000000000").unwrap(); - static ref DUMMY_TENANTID: ZTenantId = - ZTenantId::from_str("00000000000000000000000000000000").unwrap(); - } - - /// Construct a dummy InMemoryLayer for testing - fn dummy_inmem_layer( - conf: &'static PageServerConf, - segno: u32, - start_lsn: Lsn, - oldest_lsn: Lsn, - ) -> Arc { - Arc::new( - InMemoryLayer::create( - conf, - *DUMMY_TIMELINEID, - *DUMMY_TENANTID, - SegmentTag { - rel: TESTREL_A, - segno, - }, - start_lsn, - oldest_lsn, - ) - .unwrap(), - ) - } - - #[test] - fn test_open_layers() -> Result<()> { - let conf = PageServerConf::dummy_conf(PageServerConf::test_repo_dir("dummy_inmem_layer")); - let conf = Box::leak(Box::new(conf)); - std::fs::create_dir_all(conf.timeline_path(&DUMMY_TIMELINEID, &DUMMY_TENANTID))?; - - let mut layers = LayerMap::default(); - - let gen1 = layers.increment_generation(); - layers.insert_open(dummy_inmem_layer(conf, 0, Lsn(0x100), Lsn(0x100))); - layers.insert_open(dummy_inmem_layer(conf, 1, Lsn(0x100), Lsn(0x200))); - layers.insert_open(dummy_inmem_layer(conf, 2, Lsn(0x100), Lsn(0x120))); - layers.insert_open(dummy_inmem_layer(conf, 3, Lsn(0x100), Lsn(0x110))); - - let gen2 = layers.increment_generation(); - layers.insert_open(dummy_inmem_layer(conf, 4, Lsn(0x100), Lsn(0x110))); - layers.insert_open(dummy_inmem_layer(conf, 5, Lsn(0x100), Lsn(0x100))); - - // A helper function (closure) to pop the next oldest open entry from the layer map, - // and assert that it is what we'd expect - let mut assert_pop_layer = |expected_segno: u32, expected_generation: u64| { - let (layer_id, l, generation) = layers.peek_oldest_open().unwrap(); - assert!(l.get_seg_tag().segno == expected_segno); - assert!(generation == expected_generation); - layers.remove_open(layer_id); - }; - - assert_pop_layer(0, gen1); // 0x100 - assert_pop_layer(5, gen2); // 0x100 - assert_pop_layer(3, gen1); // 0x110 - assert_pop_layer(4, gen2); // 0x110 - assert_pop_layer(2, gen1); // 0x120 - assert_pop_layer(1, gen1); // 0x200 - - Ok(()) - } -} diff --git a/pageserver/src/layered_repository/metadata.rs b/pageserver/src/layered_repository/metadata.rs index 17e0485093..7daf899ba2 100644 --- a/pageserver/src/layered_repository/metadata.rs +++ b/pageserver/src/layered_repository/metadata.rs @@ -6,9 +6,10 @@ //! //! The module contains all structs and related helper methods related to timeline metadata. -use std::{convert::TryInto, path::PathBuf}; +use std::path::PathBuf; use anyhow::ensure; +use serde::{Deserialize, Serialize}; use zenith_utils::{ bin_ser::BeSer, lsn::Lsn, @@ -16,11 +17,13 @@ use zenith_utils::{ }; use crate::config::PageServerConf; +use crate::STORAGE_FORMAT_VERSION; -// Taken from PG_CONTROL_MAX_SAFE_SIZE -const METADATA_MAX_SAFE_SIZE: usize = 512; -const METADATA_CHECKSUM_SIZE: usize = std::mem::size_of::(); -const METADATA_MAX_DATA_SIZE: usize = METADATA_MAX_SAFE_SIZE - METADATA_CHECKSUM_SIZE; +/// We assume that a write of up to METADATA_MAX_SIZE bytes is atomic. +/// +/// This is the same assumption that PostgreSQL makes with the control file, +/// see PG_CONTROL_MAX_SAFE_SIZE +const METADATA_MAX_SIZE: usize = 512; /// The name of the metadata file pageserver creates per timeline. pub const METADATA_FILE_NAME: &str = "metadata"; @@ -30,6 +33,20 @@ pub const METADATA_FILE_NAME: &str = "metadata"; /// The fields correspond to the values we hold in memory, in LayeredTimeline. #[derive(Debug, Clone, PartialEq, Eq)] pub struct TimelineMetadata { + hdr: TimelineMetadataHeader, + body: TimelineMetadataBody, +} + +#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)] +struct TimelineMetadataHeader { + checksum: u32, // CRC of serialized metadata body + size: u16, // size of serialized metadata + format_version: u16, // storage format version (used for compatibility checks) +} +const METADATA_HDR_SIZE: usize = std::mem::size_of::(); + +#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)] +struct TimelineMetadataBody { disk_consistent_lsn: Lsn, // This is only set if we know it. We track it in memory when the page // server is running, but we only track the value corresponding to @@ -69,130 +86,90 @@ impl TimelineMetadata { initdb_lsn: Lsn, ) -> Self { Self { - disk_consistent_lsn, - prev_record_lsn, - ancestor_timeline, - ancestor_lsn, - latest_gc_cutoff_lsn, - initdb_lsn, + hdr: TimelineMetadataHeader { + checksum: 0, + size: 0, + format_version: STORAGE_FORMAT_VERSION, + }, + body: TimelineMetadataBody { + disk_consistent_lsn, + prev_record_lsn, + ancestor_timeline, + ancestor_lsn, + latest_gc_cutoff_lsn, + initdb_lsn, + }, } } pub fn from_bytes(metadata_bytes: &[u8]) -> anyhow::Result { ensure!( - metadata_bytes.len() == METADATA_MAX_SAFE_SIZE, + metadata_bytes.len() == METADATA_MAX_SIZE, "metadata bytes size is wrong" ); - - let data = &metadata_bytes[..METADATA_MAX_DATA_SIZE]; - let calculated_checksum = crc32c::crc32c(data); - - let checksum_bytes: &[u8; METADATA_CHECKSUM_SIZE] = - metadata_bytes[METADATA_MAX_DATA_SIZE..].try_into()?; - let expected_checksum = u32::from_le_bytes(*checksum_bytes); + let hdr = TimelineMetadataHeader::des(&metadata_bytes[0..METADATA_HDR_SIZE])?; ensure!( - calculated_checksum == expected_checksum, + hdr.format_version == STORAGE_FORMAT_VERSION, + "format version mismatch" + ); + let metadata_size = hdr.size as usize; + ensure!( + metadata_size <= METADATA_MAX_SIZE, + "corrupted metadata file" + ); + let calculated_checksum = crc32c::crc32c(&metadata_bytes[METADATA_HDR_SIZE..metadata_size]); + ensure!( + hdr.checksum == calculated_checksum, "metadata checksum mismatch" ); + let body = TimelineMetadataBody::des(&metadata_bytes[METADATA_HDR_SIZE..metadata_size])?; + ensure!( + body.disk_consistent_lsn.is_aligned(), + "disk_consistent_lsn is not aligned" + ); - let data = TimelineMetadata::from(serialize::DeTimelineMetadata::des_prefix(data)?); - ensure!(data.disk_consistent_lsn.is_aligned()); - - Ok(data) + Ok(TimelineMetadata { hdr, body }) } pub fn to_bytes(&self) -> anyhow::Result> { - let serializeable_metadata = serialize::SeTimelineMetadata::from(self); - let mut metadata_bytes = serialize::SeTimelineMetadata::ser(&serializeable_metadata)?; - ensure!(metadata_bytes.len() <= METADATA_MAX_DATA_SIZE); - metadata_bytes.resize(METADATA_MAX_SAFE_SIZE, 0u8); - - let checksum = crc32c::crc32c(&metadata_bytes[..METADATA_MAX_DATA_SIZE]); - metadata_bytes[METADATA_MAX_DATA_SIZE..].copy_from_slice(&u32::to_le_bytes(checksum)); + let body_bytes = self.body.ser()?; + let metadata_size = METADATA_HDR_SIZE + body_bytes.len(); + let hdr = TimelineMetadataHeader { + size: metadata_size as u16, + format_version: STORAGE_FORMAT_VERSION, + checksum: crc32c::crc32c(&body_bytes), + }; + let hdr_bytes = hdr.ser()?; + let mut metadata_bytes = vec![0u8; METADATA_MAX_SIZE]; + metadata_bytes[0..METADATA_HDR_SIZE].copy_from_slice(&hdr_bytes); + metadata_bytes[METADATA_HDR_SIZE..metadata_size].copy_from_slice(&body_bytes); Ok(metadata_bytes) } /// [`Lsn`] that corresponds to the corresponding timeline directory /// contents, stored locally in the pageserver workdir. pub fn disk_consistent_lsn(&self) -> Lsn { - self.disk_consistent_lsn + self.body.disk_consistent_lsn } pub fn prev_record_lsn(&self) -> Option { - self.prev_record_lsn + self.body.prev_record_lsn } pub fn ancestor_timeline(&self) -> Option { - self.ancestor_timeline + self.body.ancestor_timeline } pub fn ancestor_lsn(&self) -> Lsn { - self.ancestor_lsn + self.body.ancestor_lsn } pub fn latest_gc_cutoff_lsn(&self) -> Lsn { - self.latest_gc_cutoff_lsn + self.body.latest_gc_cutoff_lsn } pub fn initdb_lsn(&self) -> Lsn { - self.initdb_lsn - } -} - -/// This module is for direct conversion of metadata to bytes and back. -/// For a certain metadata, besides the conversion a few verification steps has to -/// be done, so all serde derives are hidden from the user, to avoid accidental -/// verification-less metadata creation. -mod serialize { - use serde::{Deserialize, Serialize}; - use zenith_utils::{lsn::Lsn, zid::ZTimelineId}; - - use super::TimelineMetadata; - - #[derive(Serialize)] - pub(super) struct SeTimelineMetadata<'a> { - disk_consistent_lsn: &'a Lsn, - prev_record_lsn: &'a Option, - ancestor_timeline: &'a Option, - ancestor_lsn: &'a Lsn, - latest_gc_cutoff_lsn: &'a Lsn, - initdb_lsn: &'a Lsn, - } - - impl<'a> From<&'a TimelineMetadata> for SeTimelineMetadata<'a> { - fn from(other: &'a TimelineMetadata) -> Self { - Self { - disk_consistent_lsn: &other.disk_consistent_lsn, - prev_record_lsn: &other.prev_record_lsn, - ancestor_timeline: &other.ancestor_timeline, - ancestor_lsn: &other.ancestor_lsn, - latest_gc_cutoff_lsn: &other.latest_gc_cutoff_lsn, - initdb_lsn: &other.initdb_lsn, - } - } - } - - #[derive(Deserialize)] - pub(super) struct DeTimelineMetadata { - disk_consistent_lsn: Lsn, - prev_record_lsn: Option, - ancestor_timeline: Option, - ancestor_lsn: Lsn, - latest_gc_cutoff_lsn: Lsn, - initdb_lsn: Lsn, - } - - impl From for TimelineMetadata { - fn from(other: DeTimelineMetadata) -> Self { - Self { - disk_consistent_lsn: other.disk_consistent_lsn, - prev_record_lsn: other.prev_record_lsn, - ancestor_timeline: other.ancestor_timeline, - ancestor_lsn: other.ancestor_lsn, - latest_gc_cutoff_lsn: other.latest_gc_cutoff_lsn, - initdb_lsn: other.initdb_lsn, - } - } + self.body.initdb_lsn } } @@ -204,14 +181,14 @@ mod tests { #[test] fn metadata_serializes_correctly() { - let original_metadata = TimelineMetadata { - disk_consistent_lsn: Lsn(0x200), - prev_record_lsn: Some(Lsn(0x100)), - ancestor_timeline: Some(TIMELINE_ID), - ancestor_lsn: Lsn(0), - latest_gc_cutoff_lsn: Lsn(0), - initdb_lsn: Lsn(0), - }; + let original_metadata = TimelineMetadata::new( + Lsn(0x200), + Some(Lsn(0x100)), + Some(TIMELINE_ID), + Lsn(0), + Lsn(0), + Lsn(0), + ); let metadata_bytes = original_metadata .to_bytes() @@ -221,7 +198,7 @@ mod tests { .expect("Should deserialize its own bytes"); assert_eq!( - deserialized_metadata, original_metadata, + deserialized_metadata.body, original_metadata.body, "Metadata that was serialized to bytes and deserialized back should not change" ); } diff --git a/pageserver/src/layered_repository/storage_layer.rs b/pageserver/src/layered_repository/storage_layer.rs index 8976491fc0..de34545980 100644 --- a/pageserver/src/layered_repository/storage_layer.rs +++ b/pageserver/src/layered_repository/storage_layer.rs @@ -2,139 +2,102 @@ //! Common traits and structs for layers //! -use crate::relish::RelishTag; -use crate::repository::{BlockNumber, ZenithWalRecord}; +use crate::repository::{Key, Value}; +use crate::walrecord::ZenithWalRecord; use crate::{ZTenantId, ZTimelineId}; use anyhow::Result; use bytes::Bytes; use serde::{Deserialize, Serialize}; -use std::fmt; +use std::ops::Range; use std::path::PathBuf; use zenith_utils::lsn::Lsn; -// Size of one segment in pages (10 MB) -pub const RELISH_SEG_SIZE: u32 = 10 * 1024 * 1024 / 8192; - -/// -/// Each relish stored in the repository is divided into fixed-sized "segments", -/// with 10 MB of key-space, or 1280 8k pages each. -/// -#[derive(Debug, PartialEq, Eq, PartialOrd, Hash, Ord, Clone, Copy, Serialize, Deserialize)] -pub struct SegmentTag { - pub rel: RelishTag, - pub segno: u32, -} - -/// SegmentBlk represents a block number within a segment, or the size of segment. -/// -/// This is separate from BlockNumber, which is used for block number within the -/// whole relish. Since this is just a type alias, the compiler will let you mix -/// them freely, but we use the type alias as documentation to make it clear -/// which one we're dealing with. -/// -/// (We could turn this into "struct SegmentBlk(u32)" to forbid accidentally -/// assigning a BlockNumber to SegmentBlk or vice versa, but that makes -/// operations more verbose). -pub type SegmentBlk = u32; - -impl fmt::Display for SegmentTag { - fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { - write!(f, "{}.{}", self.rel, self.segno) +pub fn range_overlaps(a: &Range, b: &Range) -> bool +where + T: PartialOrd, +{ + if a.start < b.start { + a.end > b.start + } else { + b.end > a.start } } -impl SegmentTag { - /// Given a relish and block number, calculate the corresponding segment and - /// block number within the segment. - pub const fn from_blknum(rel: RelishTag, blknum: BlockNumber) -> (SegmentTag, SegmentBlk) { - ( - SegmentTag { - rel, - segno: blknum / RELISH_SEG_SIZE, - }, - blknum % RELISH_SEG_SIZE, - ) - } +pub fn range_eq(a: &Range, b: &Range) -> bool +where + T: PartialEq, +{ + a.start == b.start && a.end == b.end } +/// Struct used to communicate across calls to 'get_value_reconstruct_data'. /// -/// Represents a version of a page at a specific LSN. The LSN is the key of the -/// entry in the 'page_versions' hash, it is not duplicated here. +/// Before first call, you can fill in 'page_img' if you have an older cached +/// version of the page available. That can save work in +/// 'get_value_reconstruct_data', as it can stop searching for page versions +/// when all the WAL records going back to the cached image have been collected. /// -/// A page version can be stored as a full page image, or as WAL record that needs -/// to be applied over the previous page version to reconstruct this version. -#[derive(Debug, Clone, Serialize, Deserialize)] -pub enum PageVersion { - Page(Bytes), - Wal(ZenithWalRecord), -} - -/// -/// Struct used to communicate across calls to 'get_page_reconstruct_data'. -/// -/// Before first call to get_page_reconstruct_data, you can fill in 'page_img' -/// if you have an older cached version of the page available. That can save -/// work in 'get_page_reconstruct_data', as it can stop searching for page -/// versions when all the WAL records going back to the cached image have been -/// collected. -/// -/// When get_page_reconstruct_data returns Complete, 'page_img' is set to an -/// image of the page, or the oldest WAL record in 'records' is a will_init-type +/// When get_value_reconstruct_data returns Complete, 'img' is set to an image +/// of the page, or the oldest WAL record in 'records' is a will_init-type /// record that initializes the page without requiring a previous image. /// /// If 'get_page_reconstruct_data' returns Continue, some 'records' may have /// been collected, but there are more records outside the current layer. Pass -/// the same PageReconstructData struct in the next 'get_page_reconstruct_data' +/// the same ValueReconstructState struct in the next 'get_value_reconstruct_data' /// call, to collect more records. /// -pub struct PageReconstructData { +#[derive(Debug)] +pub struct ValueReconstructState { pub records: Vec<(Lsn, ZenithWalRecord)>, - pub page_img: Option<(Lsn, Bytes)>, + pub img: Option<(Lsn, Bytes)>, } /// Return value from Layer::get_page_reconstruct_data -pub enum PageReconstructResult { +#[derive(Clone, Copy, Debug)] +pub enum ValueReconstructResult { /// Got all the data needed to reconstruct the requested page Complete, /// This layer didn't contain all the required data, the caller should look up /// the predecessor layer at the returned LSN and collect more data from there. - Continue(Lsn), + Continue, + /// This layer didn't contain data needed to reconstruct the page version at /// the returned LSN. This is usually considered an error, but might be OK /// in some circumstances. - Missing(Lsn), + Missing, } +/// A Layer contains all data in a "rectangle" consisting of a range of keys and +/// range of LSNs. /// -/// A Layer corresponds to one RELISH_SEG_SIZE slice of a relish in a range of LSNs. /// There are two kinds of layers, in-memory and on-disk layers. In-memory -/// layers are used to ingest incoming WAL, and provide fast access -/// to the recent page versions. On-disk layers are stored as files on disk, and -/// are immutable. This trait presents the common functionality of -/// in-memory and on-disk layers. +/// layers are used to ingest incoming WAL, and provide fast access to the +/// recent page versions. On-disk layers are stored as files on disk, and are +/// immutable. This trait presents the common functionality of in-memory and +/// on-disk layers. +/// +/// Furthermore, there are two kinds of on-disk layers: delta and image layers. +/// A delta layer contains all modifications within a range of LSNs and keys. +/// An image layer is a snapshot of all the data in a key-range, at a single +/// LSN /// pub trait Layer: Send + Sync { fn get_tenant_id(&self) -> ZTenantId; - /// Identify the timeline this relish belongs to + /// Identify the timeline this layer belongs to fn get_timeline_id(&self) -> ZTimelineId; - /// Identify the relish segment - fn get_seg_tag(&self) -> SegmentTag; + /// Range of segments that this layer covers + fn get_key_range(&self) -> Range; /// Inclusive start bound of the LSN range that this layer holds - fn get_start_lsn(&self) -> Lsn; - /// Exclusive end bound of the LSN range that this layer holds. /// /// - For an open in-memory layer, this is MAX_LSN. /// - For a frozen in-memory layer or a delta layer, this is a valid end bound. /// - An image layer represents snapshot at one LSN, so end_lsn is always the snapshot LSN + 1 - fn get_end_lsn(&self) -> Lsn; - - /// Is the segment represented by this layer dropped by PostgreSQL? - fn is_dropped(&self) -> bool; + fn get_lsn_range(&self) -> Range; /// Filename used to store this layer on disk. (Even in-memory layers /// implement this, to print a handy unique identifier for the layer for @@ -153,18 +116,12 @@ pub trait Layer: Send + Sync { /// is available. If this returns PageReconstructResult::Continue, look up /// the predecessor layer and call again with the same 'reconstruct_data' to /// collect more data. - fn get_page_reconstruct_data( + fn get_value_reconstruct_data( &self, - blknum: SegmentBlk, - lsn: Lsn, - reconstruct_data: &mut PageReconstructData, - ) -> Result; - - /// Return size of the segment at given LSN. (Only for blocky relations.) - fn get_seg_size(&self, lsn: Lsn) -> Result; - - /// Does the segment exist at given LSN? Or was it dropped before it. - fn get_seg_exists(&self, lsn: Lsn) -> Result; + key: Key, + lsn_range: Range, + reconstruct_data: &mut ValueReconstructState, + ) -> Result; /// Does this layer only contain some data for the segment (incremental), /// or does it contain a version of every page? This is important to know @@ -175,6 +132,9 @@ pub trait Layer: Send + Sync { /// Returns true for layers that are represented in memory. fn is_in_memory(&self) -> bool; + /// Iterate through all keys and values stored in the layer + fn iter(&self) -> Box> + '_>; + /// Release memory used by this layer. There is no corresponding 'load' /// function, that's done implicitly when you call one of the get-functions. fn unload(&self) -> Result<()>; @@ -185,3 +145,36 @@ pub trait Layer: Send + Sync { /// Dump summary of the contents of the layer to stdout fn dump(&self) -> Result<()>; } + +// Flag indicating that this version initialize the page +const WILL_INIT: u64 = 1; + +/// +/// Struct representing reference to BLOB in layers. Reference contains BLOB offset and size. +/// For WAL records (delta layer) it also contains `will_init` flag which helps to determine range of records +/// which needs to be applied without reading/deserializing records themselves. +/// +#[derive(Debug, Serialize, Deserialize, Copy, Clone)] +pub struct BlobRef(u64); + +impl BlobRef { + pub fn will_init(&self) -> bool { + (self.0 & WILL_INIT) != 0 + } + + pub fn pos(&self) -> u64 { + self.0 >> 32 + } + + pub fn size(&self) -> usize { + ((self.0 & 0xFFFFFFFF) >> 1) as usize + } + + pub fn new(pos: u64, size: usize, will_init: bool) -> BlobRef { + let mut blob_ref = (pos << 32) | ((size as u64) << 1); + if will_init { + blob_ref |= WILL_INIT; + } + BlobRef(blob_ref) + } +} diff --git a/pageserver/src/lib.rs b/pageserver/src/lib.rs index 060fa54b23..4790ab6652 100644 --- a/pageserver/src/lib.rs +++ b/pageserver/src/lib.rs @@ -2,10 +2,12 @@ pub mod basebackup; pub mod config; pub mod http; pub mod import_datadir; +pub mod keyspace; pub mod layered_repository; pub mod page_cache; pub mod page_service; -pub mod relish; +pub mod pgdatadir_mapping; +pub mod reltag; pub mod remote_storage; pub mod repository; pub mod tenant_mgr; @@ -28,6 +30,20 @@ use zenith_utils::{ use crate::thread_mgr::ThreadKind; +use layered_repository::LayeredRepository; +use pgdatadir_mapping::DatadirTimeline; + +/// Current storage format version +/// +/// This is embedded in the metadata file, and also in the header of all the +/// layer files. If you make any backwards-incompatible changes to the storage +/// format, bump this! +pub const STORAGE_FORMAT_VERSION: u16 = 1; + +// Magic constants used to identify different kinds of files +pub const IMAGE_FILE_MAGIC: u32 = 0x5A60_0000 | STORAGE_FORMAT_VERSION as u32; +pub const DELTA_FILE_MAGIC: u32 = 0x5A61_0000 | STORAGE_FORMAT_VERSION as u32; + lazy_static! { static ref LIVE_CONNECTIONS_COUNT: IntGaugeVec = register_int_gauge_vec!( "pageserver_live_connections_count", @@ -42,14 +58,16 @@ pub const LOG_FILE_NAME: &str = "pageserver.log"; /// Config for the Repository checkpointer #[derive(Debug, Clone, Copy)] pub enum CheckpointConfig { - // Flush in-memory data that is older than this - Distance(u64), // Flush all in-memory data Flush, // Flush all in-memory data and reconstruct all page images Forced, } +pub type RepositoryImpl = LayeredRepository; + +pub type DatadirTimelineImpl = DatadirTimeline; + pub fn shutdown_pageserver() { // Shut down the libpq endpoint thread. This prevents new connections from // being accepted. diff --git a/pageserver/src/page_cache.rs b/pageserver/src/page_cache.rs index ef802ba0e2..299575f792 100644 --- a/pageserver/src/page_cache.rs +++ b/pageserver/src/page_cache.rs @@ -53,7 +53,7 @@ use zenith_utils::{ }; use crate::layered_repository::writeback_ephemeral_file; -use crate::relish::RelTag; +use crate::repository::Key; static PAGE_CACHE: OnceCell = OnceCell::new(); const TEST_PAGE_CACHE_SIZE: usize = 10; @@ -105,8 +105,7 @@ enum CacheKey { struct MaterializedPageHashKey { tenant_id: ZTenantId, timeline_id: ZTimelineId, - rel_tag: RelTag, - blknum: u32, + key: Key, } #[derive(Clone)] @@ -291,16 +290,14 @@ impl PageCache { &self, tenant_id: ZTenantId, timeline_id: ZTimelineId, - rel_tag: RelTag, - blknum: u32, + key: &Key, lsn: Lsn, ) -> Option<(Lsn, PageReadGuard)> { let mut cache_key = CacheKey::MaterializedPage { hash_key: MaterializedPageHashKey { tenant_id, timeline_id, - rel_tag, - blknum, + key: *key, }, lsn, }; @@ -323,8 +320,7 @@ impl PageCache { &self, tenant_id: ZTenantId, timeline_id: ZTimelineId, - rel_tag: RelTag, - blknum: u32, + key: Key, lsn: Lsn, img: &[u8], ) { @@ -332,8 +328,7 @@ impl PageCache { hash_key: MaterializedPageHashKey { tenant_id, timeline_id, - rel_tag, - blknum, + key, }, lsn, }; diff --git a/pageserver/src/page_service.rs b/pageserver/src/page_service.rs index 4744f0fe52..43e1ec275d 100644 --- a/pageserver/src/page_service.rs +++ b/pageserver/src/page_service.rs @@ -32,7 +32,9 @@ use zenith_utils::zid::{ZTenantId, ZTimelineId}; use crate::basebackup; use crate::config::PageServerConf; -use crate::relish::*; +use crate::pgdatadir_mapping::DatadirTimeline; +use crate::reltag::RelTag; +use crate::repository::Repository; use crate::repository::Timeline; use crate::tenant_mgr; use crate::thread_mgr; @@ -398,8 +400,8 @@ impl PageServerHandler { /// In either case, if the page server hasn't received the WAL up to the /// requested LSN yet, we will wait for it to arrive. The return value is /// the LSN that should be used to look up the page versions. - fn wait_or_get_last_lsn( - timeline: &dyn Timeline, + fn wait_or_get_last_lsn( + timeline: &DatadirTimeline, mut lsn: Lsn, latest: bool, latest_gc_cutoff_lsn: &RwLockReadGuard, @@ -426,7 +428,7 @@ impl PageServerHandler { if lsn <= last_record_lsn { lsn = last_record_lsn; } else { - timeline.wait_lsn(lsn)?; + timeline.tline.wait_lsn(lsn)?; // Since we waited for 'lsn' to arrive, that is now the last // record LSN. (Or close enough for our purposes; the // last-record LSN can advance immediately after we return @@ -436,7 +438,7 @@ impl PageServerHandler { if lsn == Lsn(0) { bail!("invalid LSN(0) in request"); } - timeline.wait_lsn(lsn)?; + timeline.tline.wait_lsn(lsn)?; } ensure!( lsn >= **latest_gc_cutoff_lsn, @@ -446,54 +448,47 @@ impl PageServerHandler { Ok(lsn) } - fn handle_get_rel_exists_request( + fn handle_get_rel_exists_request( &self, - timeline: &dyn Timeline, + timeline: &DatadirTimeline, req: &PagestreamExistsRequest, ) -> Result { let _enter = info_span!("get_rel_exists", rel = %req.rel, req_lsn = %req.lsn).entered(); - let tag = RelishTag::Relation(req.rel); - let latest_gc_cutoff_lsn = timeline.get_latest_gc_cutoff_lsn(); + let latest_gc_cutoff_lsn = timeline.tline.get_latest_gc_cutoff_lsn(); let lsn = Self::wait_or_get_last_lsn(timeline, req.lsn, req.latest, &latest_gc_cutoff_lsn)?; - let exists = timeline.get_rel_exists(tag, lsn)?; + let exists = timeline.get_rel_exists(req.rel, lsn)?; Ok(PagestreamBeMessage::Exists(PagestreamExistsResponse { exists, })) } - fn handle_get_nblocks_request( + fn handle_get_nblocks_request( &self, - timeline: &dyn Timeline, + timeline: &DatadirTimeline, req: &PagestreamNblocksRequest, ) -> Result { let _enter = info_span!("get_nblocks", rel = %req.rel, req_lsn = %req.lsn).entered(); - let tag = RelishTag::Relation(req.rel); - let latest_gc_cutoff_lsn = timeline.get_latest_gc_cutoff_lsn(); + let latest_gc_cutoff_lsn = timeline.tline.get_latest_gc_cutoff_lsn(); let lsn = Self::wait_or_get_last_lsn(timeline, req.lsn, req.latest, &latest_gc_cutoff_lsn)?; - let n_blocks = timeline.get_relish_size(tag, lsn)?; - - // Return 0 if relation is not found. - // This is what postgres smgr expects. - let n_blocks = n_blocks.unwrap_or(0); + let n_blocks = timeline.get_rel_size(req.rel, lsn)?; Ok(PagestreamBeMessage::Nblocks(PagestreamNblocksResponse { n_blocks, })) } - fn handle_get_page_at_lsn_request( + fn handle_get_page_at_lsn_request( &self, - timeline: &dyn Timeline, + timeline: &DatadirTimeline, req: &PagestreamGetPageRequest, ) -> Result { let _enter = info_span!("get_page", rel = %req.rel, blkno = &req.blkno, req_lsn = %req.lsn) .entered(); - let tag = RelishTag::Relation(req.rel); - let latest_gc_cutoff_lsn = timeline.get_latest_gc_cutoff_lsn(); + let latest_gc_cutoff_lsn = timeline.tline.get_latest_gc_cutoff_lsn(); let lsn = Self::wait_or_get_last_lsn(timeline, req.lsn, req.latest, &latest_gc_cutoff_lsn)?; /* // Add a 1s delay to some requests. The delayed causes the requests to @@ -503,7 +498,7 @@ impl PageServerHandler { std::thread::sleep(std::time::Duration::from_millis(1000)); } */ - let page = timeline.get_page_at_lsn(tag, req.blkno, lsn)?; + let page = timeline.get_rel_page_at_lsn(req.rel, req.blkno, lsn)?; Ok(PagestreamBeMessage::GetPage(PagestreamGetPageResponse { page, @@ -523,7 +518,7 @@ impl PageServerHandler { // check that the timeline exists let timeline = tenant_mgr::get_timeline_for_tenant_load(tenantid, timelineid) .context("Cannot load local timeline")?; - let latest_gc_cutoff_lsn = timeline.get_latest_gc_cutoff_lsn(); + let latest_gc_cutoff_lsn = timeline.tline.get_latest_gc_cutoff_lsn(); if let Some(lsn) = lsn { timeline .check_lsn_is_in_scope(lsn, &latest_gc_cutoff_lsn) @@ -701,67 +696,19 @@ impl postgres_backend::Handler for PageServerHandler { let repo = tenant_mgr::get_repository_for_tenant(tenantid)?; let result = repo.gc_iteration(Some(timelineid), gc_horizon, true)?; pgb.write_message_noflush(&BeMessage::RowDescription(&[ - RowDescriptor::int8_col(b"layer_relfiles_total"), - RowDescriptor::int8_col(b"layer_relfiles_needed_by_cutoff"), - RowDescriptor::int8_col(b"layer_relfiles_needed_by_branches"), - RowDescriptor::int8_col(b"layer_relfiles_not_updated"), - RowDescriptor::int8_col(b"layer_relfiles_needed_as_tombstone"), - RowDescriptor::int8_col(b"layer_relfiles_removed"), - RowDescriptor::int8_col(b"layer_relfiles_dropped"), - RowDescriptor::int8_col(b"layer_nonrelfiles_total"), - RowDescriptor::int8_col(b"layer_nonrelfiles_needed_by_cutoff"), - RowDescriptor::int8_col(b"layer_nonrelfiles_needed_by_branches"), - RowDescriptor::int8_col(b"layer_nonrelfiles_not_updated"), - RowDescriptor::int8_col(b"layer_nonrelfiles_needed_as_tombstone"), - RowDescriptor::int8_col(b"layer_nonrelfiles_removed"), - RowDescriptor::int8_col(b"layer_nonrelfiles_dropped"), + RowDescriptor::int8_col(b"layers_total"), + RowDescriptor::int8_col(b"layers_needed_by_cutoff"), + RowDescriptor::int8_col(b"layers_needed_by_branches"), + RowDescriptor::int8_col(b"layers_not_updated"), + RowDescriptor::int8_col(b"layers_removed"), RowDescriptor::int8_col(b"elapsed"), ]))? .write_message_noflush(&BeMessage::DataRow(&[ - Some(result.ondisk_relfiles_total.to_string().as_bytes()), - Some( - result - .ondisk_relfiles_needed_by_cutoff - .to_string() - .as_bytes(), - ), - Some( - result - .ondisk_relfiles_needed_by_branches - .to_string() - .as_bytes(), - ), - Some(result.ondisk_relfiles_not_updated.to_string().as_bytes()), - Some( - result - .ondisk_relfiles_needed_as_tombstone - .to_string() - .as_bytes(), - ), - Some(result.ondisk_relfiles_removed.to_string().as_bytes()), - Some(result.ondisk_relfiles_dropped.to_string().as_bytes()), - Some(result.ondisk_nonrelfiles_total.to_string().as_bytes()), - Some( - result - .ondisk_nonrelfiles_needed_by_cutoff - .to_string() - .as_bytes(), - ), - Some( - result - .ondisk_nonrelfiles_needed_by_branches - .to_string() - .as_bytes(), - ), - Some(result.ondisk_nonrelfiles_not_updated.to_string().as_bytes()), - Some( - result - .ondisk_nonrelfiles_needed_as_tombstone - .to_string() - .as_bytes(), - ), - Some(result.ondisk_nonrelfiles_removed.to_string().as_bytes()), - Some(result.ondisk_nonrelfiles_dropped.to_string().as_bytes()), + Some(result.layers_total.to_string().as_bytes()), + Some(result.layers_needed_by_cutoff.to_string().as_bytes()), + Some(result.layers_needed_by_branches.to_string().as_bytes()), + Some(result.layers_not_updated.to_string().as_bytes()), + Some(result.layers_removed.to_string().as_bytes()), Some(result.elapsed.as_millis().to_string().as_bytes()), ]))? .write_message(&BeMessage::CommandComplete(b"SELECT 1"))?; @@ -781,7 +728,14 @@ impl postgres_backend::Handler for PageServerHandler { let timeline = tenant_mgr::get_timeline_for_tenant_load(tenantid, timelineid) .context("Cannot load local timeline")?; - timeline.checkpoint(CheckpointConfig::Forced)?; + timeline.tline.checkpoint(CheckpointConfig::Forced)?; + + // Also compact it. + // + // FIXME: This probably shouldn't be part of a "checkpoint" command, but a + // separate operation. Update the tests if you change this. + timeline.tline.compact()?; + pgb.write_message_noflush(&SINGLE_COL_ROWDESC)? .write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?; } else { diff --git a/pageserver/src/pgdatadir_mapping.rs b/pageserver/src/pgdatadir_mapping.rs new file mode 100644 index 0000000000..7b0fc606de --- /dev/null +++ b/pageserver/src/pgdatadir_mapping.rs @@ -0,0 +1,1350 @@ +//! +//! This provides an abstraction to store PostgreSQL relations and other files +//! in the key-value store that implements the Repository interface. +//! +//! (TODO: The line between PUT-functions here and walingest.rs is a bit blurry, as +//! walingest.rs handles a few things like implicit relation creation and extension. +//! Clarify that) +//! +use crate::keyspace::{KeySpace, KeySpaceAccum, TARGET_FILE_SIZE_BYTES}; +use crate::reltag::{RelTag, SlruKind}; +use crate::repository::*; +use crate::repository::{Repository, Timeline}; +use crate::walrecord::ZenithWalRecord; +use anyhow::{bail, ensure, Result}; +use bytes::{Buf, Bytes}; +use postgres_ffi::{pg_constants, Oid, TransactionId}; +use serde::{Deserialize, Serialize}; +use std::collections::{HashMap, HashSet}; +use std::ops::Range; +use std::sync::atomic::{AtomicIsize, Ordering}; +use std::sync::{Arc, RwLockReadGuard}; +use tracing::{debug, error, trace, warn}; +use zenith_utils::bin_ser::BeSer; +use zenith_utils::lsn::AtomicLsn; +use zenith_utils::lsn::Lsn; + +/// Block number within a relation or SLRU. This matches PostgreSQL's BlockNumber type. +pub type BlockNumber = u32; + +pub struct DatadirTimeline +where + R: Repository, +{ + /// The underlying key-value store. Callers should not read or modify the + /// data in the underlying store directly. However, it is exposed to have + /// access to information like last-LSN, ancestor, and operations like + /// compaction. + pub tline: Arc, + + /// When did we last calculate the partitioning? + last_partitioning: AtomicLsn, + + /// Configuration: how often should the partitioning be recalculated. + repartition_threshold: u64, + + /// Current logical size of the "datadir", at the last LSN. + current_logical_size: AtomicIsize, +} + +impl DatadirTimeline { + pub fn new(tline: Arc, repartition_threshold: u64) -> Self { + DatadirTimeline { + tline, + last_partitioning: AtomicLsn::new(0), + current_logical_size: AtomicIsize::new(0), + repartition_threshold, + } + } + + /// (Re-)calculate the logical size of the database at the latest LSN. + /// + /// This can be a slow operation. + pub fn init_logical_size(&self) -> Result<()> { + let last_lsn = self.tline.get_last_record_lsn(); + self.current_logical_size.store( + self.get_current_logical_size_non_incremental(last_lsn)? as isize, + Ordering::SeqCst, + ); + Ok(()) + } + + /// Start ingesting a WAL record, or other atomic modification of + /// the timeline. + /// + /// This provides a transaction-like interface to perform a bunch + /// of modifications atomically, all stamped with one LSN. + /// + /// To ingest a WAL record, call begin_modification(lsn) to get a + /// DatadirModification object. Use the functions in the object to + /// modify the repository state, updating all the pages and metadata + /// that the WAL record affects. When you're done, call commit() to + /// commit the changes. + /// + /// Note that any pending modifications you make through the + /// modification object won't be visible to calls to the 'get' and list + /// functions of the timeline until you finish! And if you update the + /// same page twice, the last update wins. + /// + pub fn begin_modification(&self, lsn: Lsn) -> DatadirModification { + DatadirModification { + tline: self, + lsn, + pending_updates: HashMap::new(), + pending_deletions: Vec::new(), + pending_nblocks: 0, + } + } + + //------------------------------------------------------------------------------ + // Public GET functions + //------------------------------------------------------------------------------ + + /// Look up given page version. + pub fn get_rel_page_at_lsn(&self, tag: RelTag, blknum: BlockNumber, lsn: Lsn) -> Result { + ensure!(tag.relnode != 0, "invalid relnode"); + + let nblocks = self.get_rel_size(tag, lsn)?; + if blknum >= nblocks { + debug!( + "read beyond EOF at {} blk {} at {}, size is {}: returning all-zeros page", + tag, blknum, lsn, nblocks + ); + return Ok(ZERO_PAGE.clone()); + } + + let key = rel_block_to_key(tag, blknum); + self.tline.get(key, lsn) + } + + /// Get size of a relation file + pub fn get_rel_size(&self, tag: RelTag, lsn: Lsn) -> Result { + ensure!(tag.relnode != 0, "invalid relnode"); + + if (tag.forknum == pg_constants::FSM_FORKNUM + || tag.forknum == pg_constants::VISIBILITYMAP_FORKNUM) + && !self.get_rel_exists(tag, lsn)? + { + // FIXME: Postgres sometimes calls smgrcreate() to create + // FSM, and smgrnblocks() on it immediately afterwards, + // without extending it. Tolerate that by claiming that + // any non-existent FSM fork has size 0. + return Ok(0); + } + + let key = rel_size_to_key(tag); + let mut buf = self.tline.get(key, lsn)?; + Ok(buf.get_u32_le()) + } + + /// Does relation exist? + pub fn get_rel_exists(&self, tag: RelTag, lsn: Lsn) -> Result { + ensure!(tag.relnode != 0, "invalid relnode"); + + // fetch directory listing + let key = rel_dir_to_key(tag.spcnode, tag.dbnode); + let buf = self.tline.get(key, lsn)?; + let dir = RelDirectory::des(&buf)?; + + let exists = dir.rels.get(&(tag.relnode, tag.forknum)).is_some(); + + Ok(exists) + } + + /// Get a list of all existing relations in given tablespace and database. + pub fn list_rels(&self, spcnode: Oid, dbnode: Oid, lsn: Lsn) -> Result> { + // fetch directory listing + let key = rel_dir_to_key(spcnode, dbnode); + let buf = self.tline.get(key, lsn)?; + let dir = RelDirectory::des(&buf)?; + + let rels: HashSet = + HashSet::from_iter(dir.rels.iter().map(|(relnode, forknum)| RelTag { + spcnode, + dbnode, + relnode: *relnode, + forknum: *forknum, + })); + + Ok(rels) + } + + /// Look up given SLRU page version. + pub fn get_slru_page_at_lsn( + &self, + kind: SlruKind, + segno: u32, + blknum: BlockNumber, + lsn: Lsn, + ) -> Result { + let key = slru_block_to_key(kind, segno, blknum); + self.tline.get(key, lsn) + } + + /// Get size of an SLRU segment + pub fn get_slru_segment_size( + &self, + kind: SlruKind, + segno: u32, + lsn: Lsn, + ) -> Result { + let key = slru_segment_size_to_key(kind, segno); + let mut buf = self.tline.get(key, lsn)?; + Ok(buf.get_u32_le()) + } + + /// Get size of an SLRU segment + pub fn get_slru_segment_exists(&self, kind: SlruKind, segno: u32, lsn: Lsn) -> Result { + // fetch directory listing + let key = slru_dir_to_key(kind); + let buf = self.tline.get(key, lsn)?; + let dir = SlruSegmentDirectory::des(&buf)?; + + let exists = dir.segments.get(&segno).is_some(); + Ok(exists) + } + + /// Get a list of SLRU segments + pub fn list_slru_segments(&self, kind: SlruKind, lsn: Lsn) -> Result> { + // fetch directory entry + let key = slru_dir_to_key(kind); + + let buf = self.tline.get(key, lsn)?; + let dir = SlruSegmentDirectory::des(&buf)?; + + Ok(dir.segments) + } + + pub fn get_relmap_file(&self, spcnode: Oid, dbnode: Oid, lsn: Lsn) -> Result { + let key = relmap_file_key(spcnode, dbnode); + + let buf = self.tline.get(key, lsn)?; + Ok(buf) + } + + pub fn list_dbdirs(&self, lsn: Lsn) -> Result> { + // fetch directory entry + let buf = self.tline.get(DBDIR_KEY, lsn)?; + let dir = DbDirectory::des(&buf)?; + + Ok(dir.dbdirs) + } + + pub fn get_twophase_file(&self, xid: TransactionId, lsn: Lsn) -> Result { + let key = twophase_file_key(xid); + let buf = self.tline.get(key, lsn)?; + Ok(buf) + } + + pub fn list_twophase_files(&self, lsn: Lsn) -> Result> { + // fetch directory entry + let buf = self.tline.get(TWOPHASEDIR_KEY, lsn)?; + let dir = TwoPhaseDirectory::des(&buf)?; + + Ok(dir.xids) + } + + pub fn get_control_file(&self, lsn: Lsn) -> Result { + self.tline.get(CONTROLFILE_KEY, lsn) + } + + pub fn get_checkpoint(&self, lsn: Lsn) -> Result { + self.tline.get(CHECKPOINT_KEY, lsn) + } + + /// Get the LSN of the last ingested WAL record. + /// + /// This is just a convenience wrapper that calls through to the underlying + /// repository. + pub fn get_last_record_lsn(&self) -> Lsn { + self.tline.get_last_record_lsn() + } + + /// Check that it is valid to request operations with that lsn. + /// + /// This is just a convenience wrapper that calls through to the underlying + /// repository. + pub fn check_lsn_is_in_scope( + &self, + lsn: Lsn, + latest_gc_cutoff_lsn: &RwLockReadGuard, + ) -> Result<()> { + self.tline.check_lsn_is_in_scope(lsn, latest_gc_cutoff_lsn) + } + + /// Retrieve current logical size of the timeline + /// + /// NOTE: counted incrementally, includes ancestors, + pub fn get_current_logical_size(&self) -> usize { + let current_logical_size = self.current_logical_size.load(Ordering::Acquire); + match usize::try_from(current_logical_size) { + Ok(sz) => sz, + Err(_) => { + error!( + "current_logical_size is out of range: {}", + current_logical_size + ); + 0 + } + } + } + + /// Does the same as get_current_logical_size but counted on demand. + /// Used to initialize the logical size tracking on startup. + /// + /// Only relation blocks are counted currently. That excludes metadata, + /// SLRUs, twophase files etc. + pub fn get_current_logical_size_non_incremental(&self, lsn: Lsn) -> Result { + // Fetch list of database dirs and iterate them + let buf = self.tline.get(DBDIR_KEY, lsn)?; + let dbdir = DbDirectory::des(&buf)?; + + let mut total_size: usize = 0; + for (spcnode, dbnode) in dbdir.dbdirs.keys() { + for rel in self.list_rels(*spcnode, *dbnode, lsn)? { + let relsize_key = rel_size_to_key(rel); + let mut buf = self.tline.get(relsize_key, lsn)?; + let relsize = buf.get_u32_le(); + + total_size += relsize as usize; + } + } + Ok(total_size * pg_constants::BLCKSZ as usize) + } + + /// + /// Get a KeySpace that covers all the Keys that are in use at the given LSN. + /// Anything that's not listed maybe removed from the underlying storage (from + /// that LSN forwards). + fn collect_keyspace(&self, lsn: Lsn) -> Result { + // Iterate through key ranges, greedily packing them into partitions + let mut result = KeySpaceAccum::new(); + + // The dbdir metadata always exists + result.add_key(DBDIR_KEY); + + // Fetch list of database dirs and iterate them + let buf = self.tline.get(DBDIR_KEY, lsn)?; + let dbdir = DbDirectory::des(&buf)?; + + let mut dbs: Vec<(Oid, Oid)> = dbdir.dbdirs.keys().cloned().collect(); + dbs.sort_unstable(); + for (spcnode, dbnode) in dbs { + result.add_key(relmap_file_key(spcnode, dbnode)); + result.add_key(rel_dir_to_key(spcnode, dbnode)); + + let mut rels: Vec = self + .list_rels(spcnode, dbnode, lsn)? + .iter() + .cloned() + .collect(); + rels.sort_unstable(); + for rel in rels { + let relsize_key = rel_size_to_key(rel); + let mut buf = self.tline.get(relsize_key, lsn)?; + let relsize = buf.get_u32_le(); + + result.add_range(rel_block_to_key(rel, 0)..rel_block_to_key(rel, relsize)); + result.add_key(relsize_key); + } + } + + // Iterate SLRUs next + for kind in [ + SlruKind::Clog, + SlruKind::MultiXactMembers, + SlruKind::MultiXactOffsets, + ] { + let slrudir_key = slru_dir_to_key(kind); + result.add_key(slrudir_key); + let buf = self.tline.get(slrudir_key, lsn)?; + let dir = SlruSegmentDirectory::des(&buf)?; + let mut segments: Vec = dir.segments.iter().cloned().collect(); + segments.sort_unstable(); + for segno in segments { + let segsize_key = slru_segment_size_to_key(kind, segno); + let mut buf = self.tline.get(segsize_key, lsn)?; + let segsize = buf.get_u32_le(); + + result.add_range( + slru_block_to_key(kind, segno, 0)..slru_block_to_key(kind, segno, segsize), + ); + result.add_key(segsize_key); + } + } + + // Then pg_twophase + result.add_key(TWOPHASEDIR_KEY); + let buf = self.tline.get(TWOPHASEDIR_KEY, lsn)?; + let twophase_dir = TwoPhaseDirectory::des(&buf)?; + let mut xids: Vec = twophase_dir.xids.iter().cloned().collect(); + xids.sort_unstable(); + for xid in xids { + result.add_key(twophase_file_key(xid)); + } + + result.add_key(CONTROLFILE_KEY); + result.add_key(CHECKPOINT_KEY); + + Ok(result.to_keyspace()) + } +} + +/// DatadirModification represents an operation to ingest an atomic set of +/// updates to the repository. It is created by the 'begin_record' +/// function. It is called for each WAL record, so that all the modifications +/// by a one WAL record appear atomic. +pub struct DatadirModification<'a, R: Repository> { + /// The timeline this modification applies to. You can access this to + /// read the state, but note that any pending updates are *not* reflected + /// in the state in 'tline' yet. + pub tline: &'a DatadirTimeline, + + lsn: Lsn, + + // The modifications are not applied directly to the underyling key-value store. + // The put-functions add the modifications here, and they are flushed to the + // underlying key-value store by the 'finish' function. + pending_updates: HashMap, + pending_deletions: Vec>, + pending_nblocks: isize, +} + +impl<'a, R: Repository> DatadirModification<'a, R> { + /// Initialize a completely new repository. + /// + /// This inserts the directory metadata entries that are assumed to + /// always exist. + pub fn init_empty(&mut self) -> Result<()> { + let buf = DbDirectory::ser(&DbDirectory { + dbdirs: HashMap::new(), + })?; + self.put(DBDIR_KEY, Value::Image(buf.into())); + + let buf = TwoPhaseDirectory::ser(&TwoPhaseDirectory { + xids: HashSet::new(), + })?; + self.put(TWOPHASEDIR_KEY, Value::Image(buf.into())); + + let buf: Bytes = SlruSegmentDirectory::ser(&SlruSegmentDirectory::default())?.into(); + let empty_dir = Value::Image(buf); + self.put(slru_dir_to_key(SlruKind::Clog), empty_dir.clone()); + self.put( + slru_dir_to_key(SlruKind::MultiXactMembers), + empty_dir.clone(), + ); + self.put(slru_dir_to_key(SlruKind::MultiXactOffsets), empty_dir); + + Ok(()) + } + + /// Put a new page version that can be constructed from a WAL record + /// + /// NOTE: this will *not* implicitly extend the relation, if the page is beyond the + /// current end-of-file. It's up to the caller to check that the relation size + /// matches the blocks inserted! + pub fn put_rel_wal_record( + &mut self, + rel: RelTag, + blknum: BlockNumber, + rec: ZenithWalRecord, + ) -> Result<()> { + ensure!(rel.relnode != 0, "invalid relnode"); + self.put(rel_block_to_key(rel, blknum), Value::WalRecord(rec)); + Ok(()) + } + + // Same, but for an SLRU. + pub fn put_slru_wal_record( + &mut self, + kind: SlruKind, + segno: u32, + blknum: BlockNumber, + rec: ZenithWalRecord, + ) -> Result<()> { + self.put( + slru_block_to_key(kind, segno, blknum), + Value::WalRecord(rec), + ); + Ok(()) + } + + /// Like put_wal_record, but with ready-made image of the page. + pub fn put_rel_page_image( + &mut self, + rel: RelTag, + blknum: BlockNumber, + img: Bytes, + ) -> Result<()> { + ensure!(rel.relnode != 0, "invalid relnode"); + self.put(rel_block_to_key(rel, blknum), Value::Image(img)); + Ok(()) + } + + pub fn put_slru_page_image( + &mut self, + kind: SlruKind, + segno: u32, + blknum: BlockNumber, + img: Bytes, + ) -> Result<()> { + self.put(slru_block_to_key(kind, segno, blknum), Value::Image(img)); + Ok(()) + } + + /// Store a relmapper file (pg_filenode.map) in the repository + pub fn put_relmap_file(&mut self, spcnode: Oid, dbnode: Oid, img: Bytes) -> Result<()> { + // Add it to the directory (if it doesn't exist already) + let buf = self.get(DBDIR_KEY)?; + let mut dbdir = DbDirectory::des(&buf)?; + + let r = dbdir.dbdirs.insert((spcnode, dbnode), true); + if r == None || r == Some(false) { + // The dbdir entry didn't exist, or it contained a + // 'false'. The 'insert' call already updated it with + // 'true', now write the updated 'dbdirs' map back. + let buf = DbDirectory::ser(&dbdir)?; + self.put(DBDIR_KEY, Value::Image(buf.into())); + } + if r == None { + // Create RelDirectory + let buf = RelDirectory::ser(&RelDirectory { + rels: HashSet::new(), + })?; + self.put( + rel_dir_to_key(spcnode, dbnode), + Value::Image(Bytes::from(buf)), + ); + } + + self.put(relmap_file_key(spcnode, dbnode), Value::Image(img)); + Ok(()) + } + + pub fn put_twophase_file(&mut self, xid: TransactionId, img: Bytes) -> Result<()> { + // Add it to the directory entry + let buf = self.get(TWOPHASEDIR_KEY)?; + let mut dir = TwoPhaseDirectory::des(&buf)?; + if !dir.xids.insert(xid) { + bail!("twophase file for xid {} already exists", xid); + } + self.put( + TWOPHASEDIR_KEY, + Value::Image(Bytes::from(TwoPhaseDirectory::ser(&dir)?)), + ); + + self.put(twophase_file_key(xid), Value::Image(img)); + Ok(()) + } + + pub fn put_control_file(&mut self, img: Bytes) -> Result<()> { + self.put(CONTROLFILE_KEY, Value::Image(img)); + Ok(()) + } + + pub fn put_checkpoint(&mut self, img: Bytes) -> Result<()> { + self.put(CHECKPOINT_KEY, Value::Image(img)); + Ok(()) + } + + pub fn drop_dbdir(&mut self, spcnode: Oid, dbnode: Oid) -> Result<()> { + // Remove entry from dbdir + let buf = self.get(DBDIR_KEY)?; + let mut dir = DbDirectory::des(&buf)?; + if dir.dbdirs.remove(&(spcnode, dbnode)).is_some() { + let buf = DbDirectory::ser(&dir)?; + self.put(DBDIR_KEY, Value::Image(buf.into())); + } else { + warn!( + "dropped dbdir for spcnode {} dbnode {} did not exist in db directory", + spcnode, dbnode + ); + } + + // FIXME: update pending_nblocks + + // Delete all relations and metadata files for the spcnode/dnode + self.delete(dbdir_key_range(spcnode, dbnode)); + Ok(()) + } + + /// Create a relation fork. + /// + /// 'nblocks' is the initial size. + pub fn put_rel_creation(&mut self, rel: RelTag, nblocks: BlockNumber) -> Result<()> { + ensure!(rel.relnode != 0, "invalid relnode"); + // It's possible that this is the first rel for this db in this + // tablespace. Create the reldir entry for it if so. + let mut dbdir = DbDirectory::des(&self.get(DBDIR_KEY)?)?; + let rel_dir_key = rel_dir_to_key(rel.spcnode, rel.dbnode); + let mut rel_dir = if dbdir.dbdirs.get(&(rel.spcnode, rel.dbnode)).is_none() { + // Didn't exist. Update dbdir + dbdir.dbdirs.insert((rel.spcnode, rel.dbnode), false); + let buf = DbDirectory::ser(&dbdir)?; + self.put(DBDIR_KEY, Value::Image(buf.into())); + + // and create the RelDirectory + RelDirectory::default() + } else { + // reldir already exists, fetch it + RelDirectory::des(&self.get(rel_dir_key)?)? + }; + + // Add the new relation to the rel directory entry, and write it back + if !rel_dir.rels.insert((rel.relnode, rel.forknum)) { + bail!("rel {} already exists", rel); + } + self.put( + rel_dir_key, + Value::Image(Bytes::from(RelDirectory::ser(&rel_dir)?)), + ); + + // Put size + let size_key = rel_size_to_key(rel); + let buf = nblocks.to_le_bytes(); + self.put(size_key, Value::Image(Bytes::from(buf.to_vec()))); + + self.pending_nblocks += nblocks as isize; + + // Even if nblocks > 0, we don't insert any actual blocks here. That's up to the + // caller. + + Ok(()) + } + + /// Truncate relation + pub fn put_rel_truncation(&mut self, rel: RelTag, nblocks: BlockNumber) -> Result<()> { + ensure!(rel.relnode != 0, "invalid relnode"); + let size_key = rel_size_to_key(rel); + + // Fetch the old size first + let old_size = self.get(size_key)?.get_u32_le(); + + // Update the entry with the new size. + let buf = nblocks.to_le_bytes(); + self.put(size_key, Value::Image(Bytes::from(buf.to_vec()))); + + // Update logical database size. + self.pending_nblocks -= old_size as isize - nblocks as isize; + Ok(()) + } + + /// Extend relation + pub fn put_rel_extend(&mut self, rel: RelTag, nblocks: BlockNumber) -> Result<()> { + ensure!(rel.relnode != 0, "invalid relnode"); + + // Put size + let size_key = rel_size_to_key(rel); + let old_size = self.get(size_key)?.get_u32_le(); + + let buf = nblocks.to_le_bytes(); + self.put(size_key, Value::Image(Bytes::from(buf.to_vec()))); + + self.pending_nblocks += nblocks as isize - old_size as isize; + Ok(()) + } + + /// Drop a relation. + pub fn put_rel_drop(&mut self, rel: RelTag) -> Result<()> { + ensure!(rel.relnode != 0, "invalid relnode"); + + // Remove it from the directory entry + let dir_key = rel_dir_to_key(rel.spcnode, rel.dbnode); + let buf = self.get(dir_key)?; + let mut dir = RelDirectory::des(&buf)?; + + if dir.rels.remove(&(rel.relnode, rel.forknum)) { + self.put(dir_key, Value::Image(Bytes::from(RelDirectory::ser(&dir)?))); + } else { + warn!("dropped rel {} did not exist in rel directory", rel); + } + + // update logical size + let size_key = rel_size_to_key(rel); + let old_size = self.get(size_key)?.get_u32_le(); + self.pending_nblocks -= old_size as isize; + + // Delete size entry, as well as all blocks + self.delete(rel_key_range(rel)); + + Ok(()) + } + + pub fn put_slru_segment_creation( + &mut self, + kind: SlruKind, + segno: u32, + nblocks: BlockNumber, + ) -> Result<()> { + // Add it to the directory entry + let dir_key = slru_dir_to_key(kind); + let buf = self.get(dir_key)?; + let mut dir = SlruSegmentDirectory::des(&buf)?; + + if !dir.segments.insert(segno) { + bail!("slru segment {:?}/{} already exists", kind, segno); + } + self.put( + dir_key, + Value::Image(Bytes::from(SlruSegmentDirectory::ser(&dir)?)), + ); + + // Put size + let size_key = slru_segment_size_to_key(kind, segno); + let buf = nblocks.to_le_bytes(); + self.put(size_key, Value::Image(Bytes::from(buf.to_vec()))); + + // even if nblocks > 0, we don't insert any actual blocks here + + Ok(()) + } + + /// Extend SLRU segment + pub fn put_slru_extend( + &mut self, + kind: SlruKind, + segno: u32, + nblocks: BlockNumber, + ) -> Result<()> { + // Put size + let size_key = slru_segment_size_to_key(kind, segno); + let buf = nblocks.to_le_bytes(); + self.put(size_key, Value::Image(Bytes::from(buf.to_vec()))); + Ok(()) + } + + /// This method is used for marking truncated SLRU files + pub fn drop_slru_segment(&mut self, kind: SlruKind, segno: u32) -> Result<()> { + // Remove it from the directory entry + let dir_key = slru_dir_to_key(kind); + let buf = self.get(dir_key)?; + let mut dir = SlruSegmentDirectory::des(&buf)?; + + if !dir.segments.remove(&segno) { + warn!("slru segment {:?}/{} does not exist", kind, segno); + } + self.put( + dir_key, + Value::Image(Bytes::from(SlruSegmentDirectory::ser(&dir)?)), + ); + + // Delete size entry, as well as all blocks + self.delete(slru_segment_key_range(kind, segno)); + + Ok(()) + } + + /// Drop a relmapper file (pg_filenode.map) + pub fn drop_relmap_file(&mut self, _spcnode: Oid, _dbnode: Oid) -> Result<()> { + // TODO + Ok(()) + } + + /// This method is used for marking truncated SLRU files + pub fn drop_twophase_file(&mut self, xid: TransactionId) -> Result<()> { + // Remove it from the directory entry + let buf = self.get(TWOPHASEDIR_KEY)?; + let mut dir = TwoPhaseDirectory::des(&buf)?; + + if !dir.xids.remove(&xid) { + warn!("twophase file for xid {} does not exist", xid); + } + self.put( + TWOPHASEDIR_KEY, + Value::Image(Bytes::from(TwoPhaseDirectory::ser(&dir)?)), + ); + + // Delete it + self.delete(twophase_key_range(xid)); + + Ok(()) + } + + /// + /// Finish this atomic update, writing all the updated keys to the + /// underlying timeline. + /// + pub fn commit(self) -> Result<()> { + let writer = self.tline.tline.writer(); + + let last_partitioning = self.tline.last_partitioning.load(); + let pending_nblocks = self.pending_nblocks; + + for (key, value) in self.pending_updates { + writer.put(key, self.lsn, value)?; + } + for key_range in self.pending_deletions { + writer.delete(key_range.clone(), self.lsn)?; + } + + writer.finish_write(self.lsn); + + if last_partitioning == Lsn(0) + || self.lsn.0 - last_partitioning.0 > self.tline.repartition_threshold + { + let keyspace = self.tline.collect_keyspace(self.lsn)?; + let partitioning = keyspace.partition(TARGET_FILE_SIZE_BYTES); + self.tline.tline.hint_partitioning(partitioning, self.lsn)?; + self.tline.last_partitioning.store(self.lsn); + } + + if pending_nblocks != 0 { + self.tline.current_logical_size.fetch_add( + pending_nblocks * pg_constants::BLCKSZ as isize, + Ordering::SeqCst, + ); + } + + Ok(()) + } + + // Internal helper functions to batch the modifications + + fn get(&self, key: Key) -> Result { + // Have we already updated the same key? Read the pending updated + // version in that case. + // + // Note: we don't check pending_deletions. It is an error to request a + // value that has been removed, deletion only avoids leaking storage. + if let Some(value) = self.pending_updates.get(&key) { + if let Value::Image(img) = value { + Ok(img.clone()) + } else { + // Currently, we never need to read back a WAL record that we + // inserted in the same "transaction". All the metadata updates + // work directly with Images, and we never need to read actual + // data pages. We could handle this if we had to, by calling + // the walredo manager, but let's keep it simple for now. + bail!("unexpected pending WAL record"); + } + } else { + let last_lsn = self.tline.get_last_record_lsn(); + self.tline.tline.get(key, last_lsn) + } + } + + fn put(&mut self, key: Key, val: Value) { + self.pending_updates.insert(key, val); + } + + fn delete(&mut self, key_range: Range) { + trace!("DELETE {}-{}", key_range.start, key_range.end); + self.pending_deletions.push(key_range); + } +} + +//--- Metadata structs stored in key-value pairs in the repository. + +#[derive(Debug, Serialize, Deserialize)] +struct DbDirectory { + // (spcnode, dbnode) -> (do relmapper and PG_VERSION files exist) + dbdirs: HashMap<(Oid, Oid), bool>, +} + +#[derive(Debug, Serialize, Deserialize)] +struct TwoPhaseDirectory { + xids: HashSet, +} + +#[derive(Debug, Serialize, Deserialize, Default)] +struct RelDirectory { + // Set of relations that exist. (relfilenode, forknum) + // + // TODO: Store it as a btree or radix tree or something else that spans multiple + // key-value pairs, if you have a lot of relations + rels: HashSet<(Oid, u8)>, +} + +#[derive(Debug, Serialize, Deserialize)] +struct RelSizeEntry { + nblocks: u32, +} + +#[derive(Debug, Serialize, Deserialize, Default)] +struct SlruSegmentDirectory { + // Set of SLRU segments that exist. + segments: HashSet, +} + +static ZERO_PAGE: Bytes = Bytes::from_static(&[0u8; pg_constants::BLCKSZ as usize]); + +// Layout of the Key address space +// +// The Key struct, used to address the underlying key-value store, consists of +// 18 bytes, split into six fields. See 'Key' in repository.rs. We need to map +// all the data and metadata keys into those 18 bytes. +// +// Principles for the mapping: +// +// - Things that are often accessed or modified together, should be close to +// each other in the key space. For example, if a relation is extended by one +// block, we create a new key-value pair for the block data, and update the +// relation size entry. Because of that, the RelSize key comes after all the +// RelBlocks of a relation: the RelSize and the last RelBlock are always next +// to each other. +// +// The key space is divided into four major sections, identified by the first +// byte, and the form a hierarchy: +// +// 00 Relation data and metadata +// +// DbDir () -> (dbnode, spcnode) +// Filenodemap +// RelDir -> relnode forknum +// RelBlocks +// RelSize +// +// 01 SLRUs +// +// SlruDir kind +// SlruSegBlocks segno +// SlruSegSize +// +// 02 pg_twophase +// +// 03 misc +// controlfile +// checkpoint +// +// Below is a full list of the keyspace allocation: +// +// DbDir: +// 00 00000000 00000000 00000000 00 00000000 +// +// Filenodemap: +// 00 SPCNODE DBNODE 00000000 00 00000000 +// +// RelDir: +// 00 SPCNODE DBNODE 00000000 00 00000001 (Postgres never uses relfilenode 0) +// +// RelBlock: +// 00 SPCNODE DBNODE RELNODE FORK BLKNUM +// +// RelSize: +// 00 SPCNODE DBNODE RELNODE FORK FFFFFFFF +// +// SlruDir: +// 01 kind 00000000 00000000 00 00000000 +// +// SlruSegBlock: +// 01 kind 00000001 SEGNO 00 BLKNUM +// +// SlruSegSize: +// 01 kind 00000001 SEGNO 00 FFFFFFFF +// +// TwoPhaseDir: +// 02 00000000 00000000 00000000 00 00000000 +// +// TwoPhaseFile: +// 02 00000000 00000000 00000000 00 XID +// +// ControlFile: +// 03 00000000 00000000 00000000 00 00000000 +// +// Checkpoint: +// 03 00000000 00000000 00000000 00 00000001 + +//-- Section 01: relation data and metadata + +const DBDIR_KEY: Key = Key { + field1: 0x00, + field2: 0, + field3: 0, + field4: 0, + field5: 0, + field6: 0, +}; + +fn dbdir_key_range(spcnode: Oid, dbnode: Oid) -> Range { + Key { + field1: 0x00, + field2: spcnode, + field3: dbnode, + field4: 0, + field5: 0, + field6: 0, + }..Key { + field1: 0x00, + field2: spcnode, + field3: dbnode, + field4: 0xffffffff, + field5: 0xff, + field6: 0xffffffff, + } +} + +fn relmap_file_key(spcnode: Oid, dbnode: Oid) -> Key { + Key { + field1: 0x00, + field2: spcnode, + field3: dbnode, + field4: 0, + field5: 0, + field6: 0, + } +} + +fn rel_dir_to_key(spcnode: Oid, dbnode: Oid) -> Key { + Key { + field1: 0x00, + field2: spcnode, + field3: dbnode, + field4: 0, + field5: 0, + field6: 1, + } +} + +fn rel_block_to_key(rel: RelTag, blknum: BlockNumber) -> Key { + Key { + field1: 0x00, + field2: rel.spcnode, + field3: rel.dbnode, + field4: rel.relnode, + field5: rel.forknum, + field6: blknum, + } +} + +fn rel_size_to_key(rel: RelTag) -> Key { + Key { + field1: 0x00, + field2: rel.spcnode, + field3: rel.dbnode, + field4: rel.relnode, + field5: rel.forknum, + field6: 0xffffffff, + } +} + +fn rel_key_range(rel: RelTag) -> Range { + Key { + field1: 0x00, + field2: rel.spcnode, + field3: rel.dbnode, + field4: rel.relnode, + field5: rel.forknum, + field6: 0, + }..Key { + field1: 0x00, + field2: rel.spcnode, + field3: rel.dbnode, + field4: rel.relnode, + field5: rel.forknum + 1, + field6: 0, + } +} + +//-- Section 02: SLRUs + +fn slru_dir_to_key(kind: SlruKind) -> Key { + Key { + field1: 0x01, + field2: match kind { + SlruKind::Clog => 0x00, + SlruKind::MultiXactMembers => 0x01, + SlruKind::MultiXactOffsets => 0x02, + }, + field3: 0, + field4: 0, + field5: 0, + field6: 0, + } +} + +fn slru_block_to_key(kind: SlruKind, segno: u32, blknum: BlockNumber) -> Key { + Key { + field1: 0x01, + field2: match kind { + SlruKind::Clog => 0x00, + SlruKind::MultiXactMembers => 0x01, + SlruKind::MultiXactOffsets => 0x02, + }, + field3: 1, + field4: segno, + field5: 0, + field6: blknum, + } +} + +fn slru_segment_size_to_key(kind: SlruKind, segno: u32) -> Key { + Key { + field1: 0x01, + field2: match kind { + SlruKind::Clog => 0x00, + SlruKind::MultiXactMembers => 0x01, + SlruKind::MultiXactOffsets => 0x02, + }, + field3: 1, + field4: segno, + field5: 0, + field6: 0xffffffff, + } +} + +fn slru_segment_key_range(kind: SlruKind, segno: u32) -> Range { + let field2 = match kind { + SlruKind::Clog => 0x00, + SlruKind::MultiXactMembers => 0x01, + SlruKind::MultiXactOffsets => 0x02, + }; + + Key { + field1: 0x01, + field2, + field3: segno, + field4: 0, + field5: 0, + field6: 0, + }..Key { + field1: 0x01, + field2, + field3: segno, + field4: 0, + field5: 1, + field6: 0, + } +} + +//-- Section 03: pg_twophase + +const TWOPHASEDIR_KEY: Key = Key { + field1: 0x02, + field2: 0, + field3: 0, + field4: 0, + field5: 0, + field6: 0, +}; + +fn twophase_file_key(xid: TransactionId) -> Key { + Key { + field1: 0x02, + field2: 0, + field3: 0, + field4: 0, + field5: 0, + field6: xid, + } +} + +fn twophase_key_range(xid: TransactionId) -> Range { + let (next_xid, overflowed) = xid.overflowing_add(1); + + Key { + field1: 0x02, + field2: 0, + field3: 0, + field4: 0, + field5: 0, + field6: xid, + }..Key { + field1: 0x02, + field2: 0, + field3: 0, + field4: 0, + field5: if overflowed { 1 } else { 0 }, + field6: next_xid, + } +} + +//-- Section 03: Control file +const CONTROLFILE_KEY: Key = Key { + field1: 0x03, + field2: 0, + field3: 0, + field4: 0, + field5: 0, + field6: 0, +}; + +const CHECKPOINT_KEY: Key = Key { + field1: 0x03, + field2: 0, + field3: 0, + field4: 0, + field5: 0, + field6: 1, +}; + +// Reverse mappings for a few Keys. +// These are needed by WAL redo manager. + +pub fn key_to_rel_block(key: Key) -> Result<(RelTag, BlockNumber)> { + Ok(match key.field1 { + 0x00 => ( + RelTag { + spcnode: key.field2, + dbnode: key.field3, + relnode: key.field4, + forknum: key.field5, + }, + key.field6, + ), + _ => bail!("unexpected value kind 0x{:02x}", key.field1), + }) +} + +pub fn key_to_slru_block(key: Key) -> Result<(SlruKind, u32, BlockNumber)> { + Ok(match key.field1 { + 0x01 => { + let kind = match key.field2 { + 0x00 => SlruKind::Clog, + 0x01 => SlruKind::MultiXactMembers, + 0x02 => SlruKind::MultiXactOffsets, + _ => bail!("unrecognized slru kind 0x{:02x}", key.field2), + }; + let segno = key.field4; + let blknum = key.field6; + + (kind, segno, blknum) + } + _ => bail!("unexpected value kind 0x{:02x}", key.field1), + }) +} + +// +//-- Tests that should work the same with any Repository/Timeline implementation. +// + +#[cfg(test)] +pub fn create_test_timeline( + repo: R, + timeline_id: zenith_utils::zid::ZTimelineId, +) -> Result>> { + let tline = repo.create_empty_timeline(timeline_id, Lsn(8))?; + let tline = DatadirTimeline::new(tline, crate::layered_repository::tests::TEST_FILE_SIZE / 10); + let mut m = tline.begin_modification(Lsn(8)); + m.init_empty()?; + m.commit()?; + Ok(Arc::new(tline)) +} + +#[allow(clippy::bool_assert_comparison)] +#[cfg(test)] +mod tests { + //use super::repo_harness::*; + //use super::*; + + /* + fn assert_current_logical_size(timeline: &DatadirTimeline, lsn: Lsn) { + let incremental = timeline.get_current_logical_size(); + let non_incremental = timeline + .get_current_logical_size_non_incremental(lsn) + .unwrap(); + assert_eq!(incremental, non_incremental); + } + */ + + /* + /// + /// Test list_rels() function, with branches and dropped relations + /// + #[test] + fn test_list_rels_drop() -> Result<()> { + let repo = RepoHarness::create("test_list_rels_drop")?.load(); + let tline = create_empty_timeline(repo, TIMELINE_ID)?; + const TESTDB: u32 = 111; + + // Import initial dummy checkpoint record, otherwise the get_timeline() call + // after branching fails below + let mut writer = tline.begin_record(Lsn(0x10)); + writer.put_checkpoint(ZERO_CHECKPOINT.clone())?; + writer.finish()?; + + // Create a relation on the timeline + let mut writer = tline.begin_record(Lsn(0x20)); + writer.put_rel_page_image(TESTREL_A, 0, TEST_IMG("foo blk 0 at 2"))?; + writer.finish()?; + + let writer = tline.begin_record(Lsn(0x00)); + writer.finish()?; + + // Check that list_rels() lists it after LSN 2, but no before it + assert!(!tline.list_rels(0, TESTDB, Lsn(0x10))?.contains(&TESTREL_A)); + assert!(tline.list_rels(0, TESTDB, Lsn(0x20))?.contains(&TESTREL_A)); + assert!(tline.list_rels(0, TESTDB, Lsn(0x30))?.contains(&TESTREL_A)); + + // Create a branch, check that the relation is visible there + repo.branch_timeline(TIMELINE_ID, NEW_TIMELINE_ID, Lsn(0x30))?; + let newtline = match repo.get_timeline(NEW_TIMELINE_ID)?.local_timeline() { + Some(timeline) => timeline, + None => panic!("Should have a local timeline"), + }; + let newtline = DatadirTimelineImpl::new(newtline); + assert!(newtline + .list_rels(0, TESTDB, Lsn(0x30))? + .contains(&TESTREL_A)); + + // Drop it on the branch + let mut new_writer = newtline.begin_record(Lsn(0x40)); + new_writer.drop_relation(TESTREL_A)?; + new_writer.finish()?; + + // Check that it's no longer listed on the branch after the point where it was dropped + assert!(newtline + .list_rels(0, TESTDB, Lsn(0x30))? + .contains(&TESTREL_A)); + assert!(!newtline + .list_rels(0, TESTDB, Lsn(0x40))? + .contains(&TESTREL_A)); + + // Run checkpoint and garbage collection and check that it's still not visible + newtline.tline.checkpoint(CheckpointConfig::Forced)?; + repo.gc_iteration(Some(NEW_TIMELINE_ID), 0, true)?; + + assert!(!newtline + .list_rels(0, TESTDB, Lsn(0x40))? + .contains(&TESTREL_A)); + + Ok(()) + } + */ + + /* + #[test] + fn test_read_beyond_eof() -> Result<()> { + let repo = RepoHarness::create("test_read_beyond_eof")?.load(); + let tline = create_test_timeline(repo, TIMELINE_ID)?; + + make_some_layers(&tline, Lsn(0x20))?; + let mut writer = tline.begin_record(Lsn(0x60)); + walingest.put_rel_page_image( + &mut writer, + TESTREL_A, + 0, + TEST_IMG(&format!("foo blk 0 at {}", Lsn(0x60))), + )?; + writer.finish()?; + + // Test read before rel creation. Should error out. + assert!(tline.get_rel_page_at_lsn(TESTREL_A, 1, Lsn(0x10)).is_err()); + + // Read block beyond end of relation at different points in time. + // These reads should fall into different delta, image, and in-memory layers. + assert_eq!(tline.get_rel_page_at_lsn(TESTREL_A, 1, Lsn(0x20))?, ZERO_PAGE); + assert_eq!(tline.get_rel_page_at_lsn(TESTREL_A, 1, Lsn(0x25))?, ZERO_PAGE); + assert_eq!(tline.get_rel_page_at_lsn(TESTREL_A, 1, Lsn(0x30))?, ZERO_PAGE); + assert_eq!(tline.get_rel_page_at_lsn(TESTREL_A, 1, Lsn(0x35))?, ZERO_PAGE); + assert_eq!(tline.get_rel_page_at_lsn(TESTREL_A, 1, Lsn(0x40))?, ZERO_PAGE); + assert_eq!(tline.get_rel_page_at_lsn(TESTREL_A, 1, Lsn(0x45))?, ZERO_PAGE); + assert_eq!(tline.get_rel_page_at_lsn(TESTREL_A, 1, Lsn(0x50))?, ZERO_PAGE); + assert_eq!(tline.get_rel_page_at_lsn(TESTREL_A, 1, Lsn(0x55))?, ZERO_PAGE); + assert_eq!(tline.get_rel_page_at_lsn(TESTREL_A, 1, Lsn(0x60))?, ZERO_PAGE); + + // Test on an in-memory layer with no preceding layer + let mut writer = tline.begin_record(Lsn(0x70)); + walingest.put_rel_page_image( + &mut writer, + TESTREL_B, + 0, + TEST_IMG(&format!("foo blk 0 at {}", Lsn(0x70))), + )?; + writer.finish()?; + + assert_eq!(tline.get_rel_page_at_lsn(TESTREL_B, 1, Lsn(0x70))?, ZERO_PAGE); + + Ok(()) + } + */ +} diff --git a/pageserver/src/relish.rs b/pageserver/src/relish.rs deleted file mode 100644 index 9228829aef..0000000000 --- a/pageserver/src/relish.rs +++ /dev/null @@ -1,226 +0,0 @@ -//! -//! Zenith stores PostgreSQL relations, and some other files, in the -//! repository. The relations (i.e. tables and indexes) take up most -//! of the space in a typical installation, while the other files are -//! small. We call each relation and other file that is stored in the -//! repository a "relish". It comes from "rel"-ish, as in "kind of a -//! rel", because it covers relations as well as other things that are -//! not relations, but are treated similarly for the purposes of the -//! storage layer. -//! -//! This source file contains the definition of the RelishTag struct, -//! which uniquely identifies a relish. -//! -//! Relishes come in two flavors: blocky and non-blocky. Relations and -//! SLRUs are blocky, that is, they are divided into 8k blocks, and -//! the repository tracks their size. Other relishes are non-blocky: -//! the content of the whole relish is stored as one blob. Block -//! number must be passed as 0 for all operations on a non-blocky -//! relish. The one "block" that you store in a non-blocky relish can -//! have arbitrary size, but they are expected to be small, or you -//! will have performance issues. -//! -//! All relishes are versioned by LSN in the repository. -//! - -use serde::{Deserialize, Serialize}; -use std::fmt; - -use postgres_ffi::relfile_utils::forknumber_to_name; -use postgres_ffi::{Oid, TransactionId}; - -/// -/// RelishTag identifies one relish. -/// -#[derive(Debug, Clone, Copy, Hash, Serialize, Deserialize, PartialEq, Eq, PartialOrd, Ord)] -pub enum RelishTag { - // Relations correspond to PostgreSQL relation forks. Each - // PostgreSQL relation fork is considered a separate relish. - Relation(RelTag), - - // SLRUs include pg_clog, pg_multixact/members, and - // pg_multixact/offsets. There are other SLRUs in PostgreSQL, but - // they don't need to be stored permanently (e.g. pg_subtrans), - // or we do not support them in zenith yet (pg_commit_ts). - // - // These are currently never requested directly by the compute - // nodes, although in principle that would be possible. However, - // when a new compute node is created, these are included in the - // tarball that we send to the compute node to initialize the - // PostgreSQL data directory. - // - // Each SLRU segment in PostgreSQL is considered a separate - // relish. For example, pg_clog/0000, pg_clog/0001, and so forth. - // - // SLRU segments are divided into blocks, like relations. - Slru { slru: SlruKind, segno: u32 }, - - // Miscellaneous other files that need to be included in the - // tarball at compute node creation. These are non-blocky, and are - // expected to be small. - - // - // FileNodeMap represents PostgreSQL's 'pg_filenode.map' - // files. They are needed to map catalog table OIDs to filenode - // numbers. Usually the mapping is done by looking up a relation's - // 'relfilenode' field in the 'pg_class' system table, but that - // doesn't work for 'pg_class' itself and a few other such system - // relations. See PostgreSQL relmapper.c for details. - // - // Each database has a map file for its local mapped catalogs, - // and there is a separate map file for shared catalogs. - // - // These files are always 512 bytes long (although we don't check - // or care about that in the page server). - // - FileNodeMap { spcnode: Oid, dbnode: Oid }, - - // - // State files for prepared transactions (e.g pg_twophase/1234) - // - TwoPhase { xid: TransactionId }, - - // The control file, stored in global/pg_control - ControlFile, - - // Special entry that represents PostgreSQL checkpoint. It doesn't - // correspond to to any physical file in PostgreSQL, but we use it - // to track fields needed to restore the checkpoint data in the - // control file, when a compute node is created. - Checkpoint, -} - -impl RelishTag { - pub const fn is_blocky(&self) -> bool { - match self { - // These relishes work with blocks - RelishTag::Relation(_) | RelishTag::Slru { slru: _, segno: _ } => true, - - // and these don't - RelishTag::FileNodeMap { - spcnode: _, - dbnode: _, - } - | RelishTag::TwoPhase { xid: _ } - | RelishTag::ControlFile - | RelishTag::Checkpoint => false, - } - } - - // Physical relishes represent files and use - // RelationSizeEntry to track existing and dropped files. - // They can be both blocky and non-blocky. - pub const fn is_physical(&self) -> bool { - match self { - // These relishes represent physical files - RelishTag::Relation(_) - | RelishTag::Slru { .. } - | RelishTag::FileNodeMap { .. } - | RelishTag::TwoPhase { .. } => true, - - // and these don't - RelishTag::ControlFile | RelishTag::Checkpoint => false, - } - } - - // convenience function to check if this relish is a normal relation. - pub const fn is_relation(&self) -> bool { - matches!(self, RelishTag::Relation(_)) - } -} - -/// -/// Relation data file segment id throughout the Postgres cluster. -/// -/// Every data file in Postgres is uniquely identified by 4 numbers: -/// - relation id / node (`relnode`) -/// - database id (`dbnode`) -/// - tablespace id (`spcnode`), in short this is a unique id of a separate -/// directory to store data files. -/// - forknumber (`forknum`) is used to split different kinds of data of the same relation -/// between some set of files (`relnode`, `relnode_fsm`, `relnode_vm`). -/// -/// In native Postgres code `RelFileNode` structure and individual `ForkNumber` value -/// are used for the same purpose. -/// [See more related comments here](https:///github.com/postgres/postgres/blob/99c5852e20a0987eca1c38ba0c09329d4076b6a0/src/include/storage/relfilenode.h#L57). -/// -#[derive(Debug, PartialEq, Eq, PartialOrd, Hash, Ord, Clone, Copy, Serialize, Deserialize)] -pub struct RelTag { - pub forknum: u8, - pub spcnode: Oid, - pub dbnode: Oid, - pub relnode: Oid, -} - -/// Display RelTag in the same format that's used in most PostgreSQL debug messages: -/// -/// //[_fsm|_vm|_init] -/// -impl fmt::Display for RelTag { - fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { - if let Some(forkname) = forknumber_to_name(self.forknum) { - write!( - f, - "{}/{}/{}_{}", - self.spcnode, self.dbnode, self.relnode, forkname - ) - } else { - write!(f, "{}/{}/{}", self.spcnode, self.dbnode, self.relnode) - } - } -} - -/// Display RelTag in the same format that's used in most PostgreSQL debug messages: -/// -/// //[_fsm|_vm|_init] -/// -impl fmt::Display for RelishTag { - fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { - match self { - RelishTag::Relation(rel) => rel.fmt(f), - RelishTag::Slru { slru, segno } => { - // e.g. pg_clog/0001 - write!(f, "{}/{:04X}", slru.to_str(), segno) - } - RelishTag::FileNodeMap { spcnode, dbnode } => { - write!(f, "relmapper file for spc {} db {}", spcnode, dbnode) - } - RelishTag::TwoPhase { xid } => { - write!(f, "pg_twophase/{:08X}", xid) - } - RelishTag::ControlFile => { - write!(f, "control file") - } - RelishTag::Checkpoint => { - write!(f, "checkpoint") - } - } - } -} - -/// -/// Non-relation transaction status files (clog (a.k.a. pg_xact) and -/// pg_multixact) in Postgres are handled by SLRU (Simple LRU) buffer, -/// hence the name. -/// -/// These files are global for a postgres instance. -/// -/// These files are divided into segments, which are divided into -/// pages of the same BLCKSZ as used for relation files. -/// -#[derive(Debug, Clone, Copy, Hash, Serialize, Deserialize, PartialEq, Eq, PartialOrd, Ord)] -pub enum SlruKind { - Clog, - MultiXactMembers, - MultiXactOffsets, -} - -impl SlruKind { - pub fn to_str(&self) -> &'static str { - match self { - Self::Clog => "pg_xact", - Self::MultiXactMembers => "pg_multixact/members", - Self::MultiXactOffsets => "pg_multixact/offsets", - } - } -} diff --git a/pageserver/src/reltag.rs b/pageserver/src/reltag.rs new file mode 100644 index 0000000000..46ff468f2f --- /dev/null +++ b/pageserver/src/reltag.rs @@ -0,0 +1,105 @@ +use serde::{Deserialize, Serialize}; +use std::cmp::Ordering; +use std::fmt; + +use postgres_ffi::relfile_utils::forknumber_to_name; +use postgres_ffi::Oid; + +/// +/// Relation data file segment id throughout the Postgres cluster. +/// +/// Every data file in Postgres is uniquely identified by 4 numbers: +/// - relation id / node (`relnode`) +/// - database id (`dbnode`) +/// - tablespace id (`spcnode`), in short this is a unique id of a separate +/// directory to store data files. +/// - forknumber (`forknum`) is used to split different kinds of data of the same relation +/// between some set of files (`relnode`, `relnode_fsm`, `relnode_vm`). +/// +/// In native Postgres code `RelFileNode` structure and individual `ForkNumber` value +/// are used for the same purpose. +/// [See more related comments here](https:///github.com/postgres/postgres/blob/99c5852e20a0987eca1c38ba0c09329d4076b6a0/src/include/storage/relfilenode.h#L57). +/// +// FIXME: should move 'forknum' as last field to keep this consistent with Postgres. +// Then we could replace the custo Ord and PartialOrd implementations below with +// deriving them. +#[derive(Debug, PartialEq, Eq, Hash, Clone, Copy, Serialize, Deserialize)] +pub struct RelTag { + pub forknum: u8, + pub spcnode: Oid, + pub dbnode: Oid, + pub relnode: Oid, +} + +impl PartialOrd for RelTag { + fn partial_cmp(&self, other: &Self) -> Option { + Some(self.cmp(other)) + } +} + +impl Ord for RelTag { + fn cmp(&self, other: &Self) -> Ordering { + let mut cmp; + + cmp = self.spcnode.cmp(&other.spcnode); + if cmp != Ordering::Equal { + return cmp; + } + cmp = self.dbnode.cmp(&other.dbnode); + if cmp != Ordering::Equal { + return cmp; + } + cmp = self.relnode.cmp(&other.relnode); + if cmp != Ordering::Equal { + return cmp; + } + cmp = self.forknum.cmp(&other.forknum); + + cmp + } +} + +/// Display RelTag in the same format that's used in most PostgreSQL debug messages: +/// +/// //[_fsm|_vm|_init] +/// +impl fmt::Display for RelTag { + fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { + if let Some(forkname) = forknumber_to_name(self.forknum) { + write!( + f, + "{}/{}/{}_{}", + self.spcnode, self.dbnode, self.relnode, forkname + ) + } else { + write!(f, "{}/{}/{}", self.spcnode, self.dbnode, self.relnode) + } + } +} + +/// +/// Non-relation transaction status files (clog (a.k.a. pg_xact) and +/// pg_multixact) in Postgres are handled by SLRU (Simple LRU) buffer, +/// hence the name. +/// +/// These files are global for a postgres instance. +/// +/// These files are divided into segments, which are divided into +/// pages of the same BLCKSZ as used for relation files. +/// +#[derive(Debug, Clone, Copy, Hash, Serialize, Deserialize, PartialEq, Eq, PartialOrd, Ord)] +pub enum SlruKind { + Clog, + MultiXactMembers, + MultiXactOffsets, +} + +impl SlruKind { + pub fn to_str(&self) -> &'static str { + match self { + Self::Clog => "pg_xact", + Self::MultiXactMembers => "pg_multixact/members", + Self::MultiXactOffsets => "pg_multixact/offsets", + } + } +} diff --git a/pageserver/src/remote_storage/README.md b/pageserver/src/remote_storage/README.md index 3c77275da8..339ddce866 100644 --- a/pageserver/src/remote_storage/README.md +++ b/pageserver/src/remote_storage/README.md @@ -17,7 +17,7 @@ This way, the backups are managed in background, not affecting directly other pa Current implementation * provides remote storage wrappers for AWS S3 and local FS * synchronizes the differences with local timelines and remote states as fast as possible -* uploads new relishes, frozen by pageserver checkpoint thread +* uploads new layer files * downloads and registers timelines, found on the remote storage, but missing locally, if those are requested somehow via pageserver (e.g. http api, gc) * uses compression when deals with files, for better S3 usage * maintains an index of what's stored remotely diff --git a/pageserver/src/remote_storage/local_fs.rs b/pageserver/src/remote_storage/local_fs.rs index 6cce127a7c..bac693c8d0 100644 --- a/pageserver/src/remote_storage/local_fs.rs +++ b/pageserver/src/remote_storage/local_fs.rs @@ -662,7 +662,7 @@ mod fs_tests { } async fn upload_dummy_file( - harness: &RepoHarness, + harness: &RepoHarness<'_>, storage: &LocalFs, name: &str, ) -> anyhow::Result { diff --git a/pageserver/src/remote_storage/storage_sync.rs b/pageserver/src/remote_storage/storage_sync.rs index 9fe2ab2847..ddd47ea981 100644 --- a/pageserver/src/remote_storage/storage_sync.rs +++ b/pageserver/src/remote_storage/storage_sync.rs @@ -27,7 +27,7 @@ //! it may schedule the download on such occasions. //! Then, the index is shared across pageserver under [`RemoteIndex`] guard to ensure proper synchronization. //! -//! The synchronization unit is an archive: a set of timeline files (or relishes) and a special metadata file, all compressed into a blob. +//! The synchronization unit is an archive: a set of layer files and a special metadata file, all compressed into a blob. //! Currently, there's no way to process an archive partially, if the archive processing fails, it has to be started from zero next time again. //! An archive contains set of files of a certain timeline, added during checkpoint(s) and the timeline metadata at that moment. //! The archive contains that metadata's `disk_consistent_lsn` in its name, to be able to restore partial index information from just a remote storage file list. @@ -281,7 +281,7 @@ impl SyncKind { /// Current checkpoint design assumes new files are added only, no deletions or amendment happens. #[derive(Debug, Clone)] pub struct NewCheckpoint { - /// Relish file paths in the pageserver workdir, that were added for the corresponding checkpoint. + /// layer file paths in the pageserver workdir, that were added for the corresponding checkpoint. layers: Vec, metadata: TimelineMetadata, } @@ -854,7 +854,7 @@ mod test_utils { #[track_caller] pub async fn ensure_correct_timeline_upload( - harness: &RepoHarness, + harness: &RepoHarness<'_>, remote_assets: Arc<(LocalFs, RemoteIndex)>, timeline_id: ZTimelineId, new_upload: NewCheckpoint, diff --git a/pageserver/src/remote_storage/storage_sync/compression.rs b/pageserver/src/remote_storage/storage_sync/compression.rs index ca245359bf..c5b041349a 100644 --- a/pageserver/src/remote_storage/storage_sync/compression.rs +++ b/pageserver/src/remote_storage/storage_sync/compression.rs @@ -10,7 +10,7 @@ //! Archiving is almost agnostic to timeline file types, with an exception of the metadata file, that's currently distinguished in the [un]compression code. //! The metadata file is treated separately when [de]compression is involved, to reduce the risk of corrupting the metadata file. //! When compressed, the metadata file is always required and stored as the last file in the archive stream. -//! When uncompressed, the metadata file gets naturally uncompressed last, to ensure that all other relishes are decompressed successfully first. +//! When uncompressed, the metadata file gets naturally uncompressed last, to ensure that all other layer files are decompressed successfully first. //! //! Archive structure: //! +----------------------------------------+ diff --git a/pageserver/src/remote_storage/storage_sync/index.rs b/pageserver/src/remote_storage/storage_sync/index.rs index d7bd1f1657..861b78fa3b 100644 --- a/pageserver/src/remote_storage/storage_sync/index.rs +++ b/pageserver/src/remote_storage/storage_sync/index.rs @@ -277,7 +277,7 @@ impl RemoteTimeline { .map(CheckpointArchive::disk_consistent_lsn) } - /// Lists all relish files in the given remote timeline. Omits the metadata file. + /// Lists all layer files in the given remote timeline. Omits the metadata file. pub fn stored_files(&self, timeline_dir: &Path) -> BTreeSet { self.timeline_files .values() diff --git a/pageserver/src/repository.rs b/pageserver/src/repository.rs index 36273e6d6c..b960e037be 100644 --- a/pageserver/src/repository.rs +++ b/pageserver/src/repository.rs @@ -1,22 +1,173 @@ +use crate::keyspace::KeyPartitioning; use crate::layered_repository::metadata::TimelineMetadata; -use crate::relish::*; use crate::remote_storage::RemoteIndex; -use crate::walrecord::MultiXactMember; +use crate::walrecord::ZenithWalRecord; use crate::CheckpointConfig; -use anyhow::Result; +use anyhow::{bail, Result}; use bytes::Bytes; -use postgres_ffi::{MultiXactId, MultiXactOffset, TransactionId}; use serde::{Deserialize, Serialize}; -use std::collections::HashSet; +use std::fmt; use std::fmt::Display; -use std::ops::{AddAssign, Deref}; +use std::ops::{AddAssign, Range}; use std::sync::{Arc, RwLockReadGuard}; use std::time::Duration; use zenith_utils::lsn::{Lsn, RecordLsn}; use zenith_utils::zid::ZTimelineId; -/// Block number within a relish. This matches PostgreSQL's BlockNumber type. -pub type BlockNumber = u32; +#[derive(Debug, Clone, Copy, Hash, PartialEq, Eq, Ord, PartialOrd, Serialize, Deserialize)] +/// Key used in the Repository kv-store. +/// +/// The Repository treates this as an opaque struct, but see the code in pgdatadir_mapping.rs +/// for what we actually store in these fields. +pub struct Key { + pub field1: u8, + pub field2: u32, + pub field3: u32, + pub field4: u32, + pub field5: u8, + pub field6: u32, +} + +impl Key { + pub fn next(&self) -> Key { + self.add(1) + } + + pub fn add(&self, x: u32) -> Key { + let mut key = *self; + + let r = key.field6.overflowing_add(x); + key.field6 = r.0; + if r.1 { + let r = key.field5.overflowing_add(1); + key.field5 = r.0; + if r.1 { + let r = key.field4.overflowing_add(1); + key.field4 = r.0; + if r.1 { + let r = key.field3.overflowing_add(1); + key.field3 = r.0; + if r.1 { + let r = key.field2.overflowing_add(1); + key.field2 = r.0; + if r.1 { + let r = key.field1.overflowing_add(1); + key.field1 = r.0; + assert!(!r.1); + } + } + } + } + } + key + } + + pub fn from_array(b: [u8; 18]) -> Self { + Key { + field1: b[0], + field2: u32::from_be_bytes(b[1..5].try_into().unwrap()), + field3: u32::from_be_bytes(b[5..9].try_into().unwrap()), + field4: u32::from_be_bytes(b[9..13].try_into().unwrap()), + field5: b[13], + field6: u32::from_be_bytes(b[14..18].try_into().unwrap()), + } + } +} + +pub fn key_range_size(key_range: &Range) -> u32 { + let start = key_range.start; + let end = key_range.end; + + if end.field1 != start.field1 + || end.field2 != start.field2 + || end.field3 != start.field3 + || end.field4 != start.field4 + { + return u32::MAX; + } + + let start = (start.field5 as u64) << 32 | start.field6 as u64; + let end = (end.field5 as u64) << 32 | end.field6 as u64; + + let diff = end - start; + if diff > u32::MAX as u64 { + u32::MAX + } else { + diff as u32 + } +} + +pub fn singleton_range(key: Key) -> Range { + key..key.next() +} + +impl fmt::Display for Key { + fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { + write!( + f, + "{:02X}{:08X}{:08X}{:08X}{:02X}{:08X}", + self.field1, self.field2, self.field3, self.field4, self.field5, self.field6 + ) + } +} + +impl Key { + pub const MIN: Key = Key { + field1: u8::MIN, + field2: u32::MIN, + field3: u32::MIN, + field4: u32::MIN, + field5: u8::MIN, + field6: u32::MIN, + }; + pub const MAX: Key = Key { + field1: u8::MAX, + field2: u32::MAX, + field3: u32::MAX, + field4: u32::MAX, + field5: u8::MAX, + field6: u32::MAX, + }; + + pub fn from_hex(s: &str) -> Result { + if s.len() != 36 { + bail!("parse error"); + } + Ok(Key { + field1: u8::from_str_radix(&s[0..2], 16)?, + field2: u32::from_str_radix(&s[2..10], 16)?, + field3: u32::from_str_radix(&s[10..18], 16)?, + field4: u32::from_str_radix(&s[18..26], 16)?, + field5: u8::from_str_radix(&s[26..28], 16)?, + field6: u32::from_str_radix(&s[28..36], 16)?, + }) + } +} + +/// A 'value' stored for a one Key. +#[derive(Debug, Clone, Serialize, Deserialize)] +pub enum Value { + /// An Image value contains a full copy of the value + Image(Bytes), + /// A WalRecord value contains a WAL record that needs to be + /// replayed get the full value. Replaying the WAL record + /// might need a previous version of the value (if will_init() + /// returns false), or it may be replayed stand-alone (true). + WalRecord(ZenithWalRecord), +} + +impl Value { + pub fn is_image(&self) -> bool { + matches!(self, Value::Image(_)) + } + + pub fn will_init(&self) -> bool { + match self { + Value::Image(_) => true, + Value::WalRecord(rec) => rec.will_init(), + } + } +} #[derive(Clone, Copy, Debug)] pub enum TimelineSyncStatusUpdate { @@ -37,6 +188,8 @@ impl Display for TimelineSyncStatusUpdate { /// A repository corresponds to one .zenith directory. One repository holds multiple /// timelines, forked off from the same initial call to 'initdb'. pub trait Repository: Send + Sync { + type Timeline: Timeline; + /// Updates timeline based on the `TimelineSyncStatusUpdate`, received from the remote storage synchronization. /// See [`crate::remote_storage`] for more details about the synchronization. fn apply_timeline_remote_sync_status_update( @@ -47,14 +200,14 @@ pub trait Repository: Send + Sync { /// Get Timeline handle for given zenith timeline ID. /// This function is idempotent. It doesnt change internal state in any way. - fn get_timeline(&self, timelineid: ZTimelineId) -> Option; + fn get_timeline(&self, timelineid: ZTimelineId) -> Option>; /// Get Timeline handle for locally available timeline. Load it into memory if it is not loaded. - fn get_timeline_load(&self, timelineid: ZTimelineId) -> Result>; + fn get_timeline_load(&self, timelineid: ZTimelineId) -> Result>; /// Lists timelines the repository contains. /// Up to repository's implementation to omit certain timelines that ar not considered ready for use. - fn list_timelines(&self) -> Vec<(ZTimelineId, RepositoryTimeline)>; + fn list_timelines(&self) -> Vec<(ZTimelineId, RepositoryTimeline)>; /// Create a new, empty timeline. The caller is responsible for loading data into it /// Initdb lsn is provided for timeline impl to be able to perform checks for some operations against it. @@ -62,11 +215,16 @@ pub trait Repository: Send + Sync { &self, timelineid: ZTimelineId, initdb_lsn: Lsn, - ) -> Result>; + ) -> Result>; /// Branch a timeline fn branch_timeline(&self, src: ZTimelineId, dst: ZTimelineId, start_lsn: Lsn) -> Result<()>; + /// Flush all data to disk. + /// + /// this is used at graceful shutdown. + fn checkpoint(&self) -> Result<()>; + /// perform one garbage collection iteration, removing old data files from disk. /// this function is periodically called by gc thread. /// also it can be explicitly requested through page server api 'do_gc' command. @@ -83,9 +241,9 @@ pub trait Repository: Send + Sync { checkpoint_before_gc: bool, ) -> Result; - /// perform one checkpoint iteration, flushing in-memory data on disk. - /// this function is periodically called by checkponter thread. - fn checkpoint_iteration(&self, cconf: CheckpointConfig) -> Result<()>; + /// perform one compaction iteration. + /// this function is periodically called by compactor thread. + fn compaction_iteration(&self) -> Result<()>; /// detaches locally available timeline by stopping all threads and removing all the data. fn detach_timeline(&self, timeline_id: ZTimelineId) -> Result<()>; @@ -95,10 +253,10 @@ pub trait Repository: Send + Sync { } /// A timeline, that belongs to the current repository. -pub enum RepositoryTimeline { +pub enum RepositoryTimeline { /// Timeline, with its files present locally in pageserver's working directory. /// Loaded into pageserver's memory and ready to be used. - Loaded(Arc), + Loaded(Arc), /// All the data is available locally, but not loaded into memory, so loading have to be done before actually using the timeline Unloaded { @@ -118,8 +276,8 @@ pub enum LocalTimelineState { Unloaded, } -impl<'a> From<&'a RepositoryTimeline> for LocalTimelineState { - fn from(local_timeline_entry: &'a RepositoryTimeline) -> Self { +impl<'a, T> From<&'a RepositoryTimeline> for LocalTimelineState { + fn from(local_timeline_entry: &'a RepositoryTimeline) -> Self { match local_timeline_entry { RepositoryTimeline::Loaded(_) => LocalTimelineState::Loaded, RepositoryTimeline::Unloaded { .. } => LocalTimelineState::Unloaded, @@ -132,42 +290,22 @@ impl<'a> From<&'a RepositoryTimeline> for LocalTimelineState { /// #[derive(Default)] pub struct GcResult { - pub ondisk_relfiles_total: u64, - pub ondisk_relfiles_needed_by_cutoff: u64, - pub ondisk_relfiles_needed_by_branches: u64, - pub ondisk_relfiles_not_updated: u64, - pub ondisk_relfiles_needed_as_tombstone: u64, - pub ondisk_relfiles_removed: u64, // # of layer files removed because they have been made obsolete by newer ondisk files. - pub ondisk_relfiles_dropped: u64, // # of layer files removed because the relation was dropped - - pub ondisk_nonrelfiles_total: u64, - pub ondisk_nonrelfiles_needed_by_cutoff: u64, - pub ondisk_nonrelfiles_needed_by_branches: u64, - pub ondisk_nonrelfiles_not_updated: u64, - pub ondisk_nonrelfiles_needed_as_tombstone: u64, - pub ondisk_nonrelfiles_removed: u64, // # of layer files removed because they have been made obsolete by newer ondisk files. - pub ondisk_nonrelfiles_dropped: u64, // # of layer files removed because the relation was dropped + pub layers_total: u64, + pub layers_needed_by_cutoff: u64, + pub layers_needed_by_branches: u64, + pub layers_not_updated: u64, + pub layers_removed: u64, // # of layer files removed because they have been made obsolete by newer ondisk files. pub elapsed: Duration, } impl AddAssign for GcResult { fn add_assign(&mut self, other: Self) { - self.ondisk_relfiles_total += other.ondisk_relfiles_total; - self.ondisk_relfiles_needed_by_cutoff += other.ondisk_relfiles_needed_by_cutoff; - self.ondisk_relfiles_needed_by_branches += other.ondisk_relfiles_needed_by_branches; - self.ondisk_relfiles_not_updated += other.ondisk_relfiles_not_updated; - self.ondisk_relfiles_needed_as_tombstone += other.ondisk_relfiles_needed_as_tombstone; - self.ondisk_relfiles_removed += other.ondisk_relfiles_removed; - self.ondisk_relfiles_dropped += other.ondisk_relfiles_dropped; - - self.ondisk_nonrelfiles_total += other.ondisk_nonrelfiles_total; - self.ondisk_nonrelfiles_needed_by_cutoff += other.ondisk_nonrelfiles_needed_by_cutoff; - self.ondisk_nonrelfiles_needed_by_branches += other.ondisk_nonrelfiles_needed_by_branches; - self.ondisk_nonrelfiles_not_updated += other.ondisk_nonrelfiles_not_updated; - self.ondisk_nonrelfiles_needed_as_tombstone += other.ondisk_nonrelfiles_needed_as_tombstone; - self.ondisk_nonrelfiles_removed += other.ondisk_nonrelfiles_removed; - self.ondisk_nonrelfiles_dropped += other.ondisk_nonrelfiles_dropped; + self.layers_total += other.layers_total; + self.layers_needed_by_cutoff += other.layers_needed_by_cutoff; + self.layers_needed_by_branches += other.layers_needed_by_branches; + self.layers_not_updated += other.layers_not_updated; + self.layers_removed += other.layers_removed; self.elapsed += other.elapsed; } @@ -190,23 +328,14 @@ pub trait Timeline: Send + Sync { fn get_latest_gc_cutoff_lsn(&self) -> RwLockReadGuard; /// Look up given page version. - fn get_page_at_lsn(&self, tag: RelishTag, blknum: BlockNumber, lsn: Lsn) -> Result; - - /// Get size of a relish - fn get_relish_size(&self, tag: RelishTag, lsn: Lsn) -> Result>; - - /// Does relation exist? - fn get_rel_exists(&self, tag: RelishTag, lsn: Lsn) -> Result; - - /// Get a list of all existing relations - /// Pass RelTag to get relation objects or None to get nonrels. - fn list_relishes(&self, tag: Option, lsn: Lsn) -> Result>; - - /// Get a list of all existing relations in given tablespace and database. - fn list_rels(&self, spcnode: u32, dbnode: u32, lsn: Lsn) -> Result>; - - /// Get a list of all existing non-relational objects - fn list_nonrels(&self, lsn: Lsn) -> Result>; + /// + /// NOTE: It is considerd an error to 'get' a key that doesn't exist. The abstraction + /// above this needs to store suitable metadata to track what data exists with + /// what keys, in separate metadata entries. If a non-existent key is requested, + /// the Repository implementation may incorrectly return a value from an ancestore + /// branch, for exampel, or waste a lot of cycles chasing the non-existing key. + /// + fn get(&self, key: Key, lsn: Lsn) -> Result; /// Get the ancestor's timeline id fn get_ancestor_timeline_id(&self) -> Option; @@ -219,7 +348,6 @@ pub trait Timeline: Send + Sync { // // These are called by the WAL receiver to digest WAL records. //------------------------------------------------------------------------------ - /// Atomically get both last and prev. fn get_last_record_rlsn(&self) -> RecordLsn; @@ -231,6 +359,10 @@ pub trait Timeline: Send + Sync { fn get_disk_consistent_lsn(&self) -> Lsn; /// Mutate the timeline with a [`TimelineWriter`]. + /// + /// FIXME: This ought to return &'a TimelineWriter, where TimelineWriter + /// is a generic type in this trait. But that doesn't currently work in + /// Rust: https://rust-lang.github.io/rfcs/1598-generic_associated_types.html fn writer<'a>(&'a self) -> Box; /// @@ -240,6 +372,19 @@ pub trait Timeline: Send + Sync { /// know anything about them here in the repository. fn checkpoint(&self, cconf: CheckpointConfig) -> Result<()>; + /// + /// Tell the implementation how the keyspace should be partitioned. + /// + /// FIXME: This is quite a hack. The code in pgdatadir_mapping.rs knows + /// which keys exist and what is the logical grouping of them. That's why + /// the code there (and in keyspace.rs) decides the partitioning, not the + /// layered_repository.rs implementation. That's a layering violation: + /// the Repository implementation ought to be responsible for the physical + /// layout, but currently it's more convenient to do it in pgdatadir_mapping.rs + /// rather than in layered_repository.rs. + /// + fn hint_partitioning(&self, partitioning: KeyPartitioning, lsn: Lsn) -> Result<()>; + /// /// Check that it is valid to request operations with that lsn. fn check_lsn_is_in_scope( @@ -247,107 +392,39 @@ pub trait Timeline: Send + Sync { lsn: Lsn, latest_gc_cutoff_lsn: &RwLockReadGuard, ) -> Result<()>; - - /// Retrieve current logical size of the timeline - /// - /// NOTE: counted incrementally, includes ancestors, - /// doesnt support TwoPhase relishes yet - fn get_current_logical_size(&self) -> usize; - - /// Does the same as get_current_logical_size but counted on demand. - /// Used in tests to ensure that incremental and non incremental variants match. - fn get_current_logical_size_non_incremental(&self, lsn: Lsn) -> Result; - - /// An escape hatch to allow "casting" a generic Timeline to LayeredTimeline. - fn upgrade_to_layered_timeline(&self) -> &crate::layered_repository::LayeredTimeline; } /// Various functions to mutate the timeline. // TODO Currently, Deref is used to allow easy access to read methods from this trait. // This is probably considered a bad practice in Rust and should be fixed eventually, // but will cause large code changes. -pub trait TimelineWriter: Deref { +pub trait TimelineWriter<'a> { /// Put a new page version that can be constructed from a WAL record /// /// This will implicitly extend the relation, if the page is beyond the /// current end-of-file. - fn put_wal_record( - &self, - lsn: Lsn, - tag: RelishTag, - blknum: BlockNumber, - rec: ZenithWalRecord, - ) -> Result<()>; + fn put(&self, key: Key, lsn: Lsn, value: Value) -> Result<()>; - /// Like put_wal_record, but with ready-made image of the page. - fn put_page_image( - &self, - tag: RelishTag, - blknum: BlockNumber, - lsn: Lsn, - img: Bytes, - ) -> Result<()>; + fn delete(&self, key_range: Range, lsn: Lsn) -> Result<()>; - /// Truncate relation - fn put_truncation(&self, rel: RelishTag, lsn: Lsn, nblocks: BlockNumber) -> Result<()>; - - /// This method is used for marking dropped relations and truncated SLRU files and aborted two phase records - fn drop_relish(&self, tag: RelishTag, lsn: Lsn) -> Result<()>; - - /// Track end of the latest digested WAL record. + /// Track the end of the latest digested WAL record. /// - /// Advance requires aligned LSN as an argument and would wake wait_lsn() callers. - /// Previous last record LSN is stored alongside the latest and can be read. - fn advance_last_record_lsn(&self, lsn: Lsn); -} - -/// Each update to a page is represented by a ZenithWalRecord. It can be a wrapper -/// around a PostgreSQL WAL record, or a custom zenith-specific "record". -#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)] -pub enum ZenithWalRecord { - /// Native PostgreSQL WAL record - Postgres { will_init: bool, rec: Bytes }, - - /// Clear bits in heap visibility map. ('flags' is bitmap of bits to clear) - ClearVisibilityMapFlags { - new_heap_blkno: Option, - old_heap_blkno: Option, - flags: u8, - }, - /// Mark transaction IDs as committed on a CLOG page - ClogSetCommitted { xids: Vec }, - /// Mark transaction IDs as aborted on a CLOG page - ClogSetAborted { xids: Vec }, - /// Extend multixact offsets SLRU - MultixactOffsetCreate { - mid: MultiXactId, - moff: MultiXactOffset, - }, - /// Extend multixact members SLRU. - MultixactMembersCreate { - moff: MultiXactOffset, - members: Vec, - }, -} - -impl ZenithWalRecord { - /// Does replaying this WAL record initialize the page from scratch, or does - /// it need to be applied over the previous image of the page? - pub fn will_init(&self) -> bool { - match self { - ZenithWalRecord::Postgres { will_init, rec: _ } => *will_init, - - // None of the special zenith record types currently initialize the page - _ => false, - } - } + /// Call this after you have finished writing all the WAL up to 'lsn'. + /// + /// 'lsn' must be aligned. This wakes up any wait_lsn() callers waiting for + /// the 'lsn' or anything older. The previous last record LSN is stored alongside + /// the latest and can be read. + fn finish_write(&self, lsn: Lsn); } #[cfg(test)] pub mod repo_harness { use bytes::BytesMut; + use lazy_static::lazy_static; + use std::sync::{Arc, RwLock, RwLockReadGuard, RwLockWriteGuard}; use std::{fs, path::PathBuf}; + use crate::RepositoryImpl; use crate::{ config::PageServerConf, layered_repository::LayeredRepository, @@ -368,18 +445,39 @@ pub mod repo_harness { pub fn TEST_IMG(s: &str) -> Bytes { let mut buf = BytesMut::new(); buf.extend_from_slice(s.as_bytes()); - buf.resize(8192, 0); + buf.resize(64, 0); buf.freeze() } - pub struct RepoHarness { - pub conf: &'static PageServerConf, - pub tenant_id: ZTenantId, + lazy_static! { + static ref LOCK: RwLock<()> = RwLock::new(()); } - impl RepoHarness { + pub struct RepoHarness<'a> { + pub conf: &'static PageServerConf, + pub tenant_id: ZTenantId, + + pub lock_guard: ( + Option>, + Option>, + ), + } + + impl<'a> RepoHarness<'a> { pub fn create(test_name: &'static str) -> Result { + Self::create_internal(test_name, false) + } + pub fn create_exclusive(test_name: &'static str) -> Result { + Self::create_internal(test_name, true) + } + fn create_internal(test_name: &'static str, exclusive: bool) -> Result { + let lock_guard = if exclusive { + (None, Some(LOCK.write().unwrap())) + } else { + (Some(LOCK.read().unwrap()), None) + }; + let repo_dir = PageServerConf::test_repo_dir(test_name); let _ = fs::remove_dir_all(&repo_dir); fs::create_dir_all(&repo_dir)?; @@ -393,23 +491,27 @@ pub mod repo_harness { fs::create_dir_all(conf.tenant_path(&tenant_id))?; fs::create_dir_all(conf.timelines_path(&tenant_id))?; - Ok(Self { conf, tenant_id }) + Ok(Self { + conf, + tenant_id, + lock_guard, + }) } - pub fn load(&self) -> Box { + pub fn load(&self) -> RepositoryImpl { self.try_load().expect("failed to load test repo") } - pub fn try_load(&self) -> Result> { + pub fn try_load(&self) -> Result { let walredo_mgr = Arc::new(TestRedoManager); - let repo = Box::new(LayeredRepository::new( + let repo = LayeredRepository::new( self.conf, walredo_mgr, self.tenant_id, RemoteIndex::empty(), false, - )); + ); // populate repo with locally available timelines for timeline_dir_entry in fs::read_dir(self.conf.timelines_path(&self.tenant_id)) .expect("should be able to read timelines dir") @@ -438,21 +540,19 @@ pub mod repo_harness { } // Mock WAL redo manager that doesn't do much - struct TestRedoManager; + pub struct TestRedoManager; impl WalRedoManager for TestRedoManager { fn request_redo( &self, - rel: RelishTag, - blknum: BlockNumber, + key: Key, lsn: Lsn, base_img: Option, records: Vec<(Lsn, ZenithWalRecord)>, ) -> Result { let s = format!( - "redo for {} blk {} to get to {}, with {} and {} records", - rel, - blknum, + "redo for {} to get to {}, with {} and {} records", + key, lsn, if base_img.is_some() { "base image" @@ -462,6 +562,7 @@ pub mod repo_harness { records.len() ); println!("{}", s); + Ok(TEST_IMG(&s)) } } @@ -475,411 +576,43 @@ pub mod repo_harness { mod tests { use super::repo_harness::*; use super::*; - use postgres_ffi::{pg_constants, xlog_utils::SIZEOF_CHECKPOINT}; - use std::fs; + //use postgres_ffi::{pg_constants, xlog_utils::SIZEOF_CHECKPOINT}; + //use std::sync::Arc; + use bytes::BytesMut; + use hex_literal::hex; + use lazy_static::lazy_static; - /// Arbitrary relation tag, for testing. - const TESTREL_A_REL_TAG: RelTag = RelTag { - spcnode: 0, - dbnode: 111, - relnode: 1000, - forknum: 0, - }; - const TESTREL_A: RelishTag = RelishTag::Relation(TESTREL_A_REL_TAG); - const TESTREL_B: RelishTag = RelishTag::Relation(RelTag { - spcnode: 0, - dbnode: 111, - relnode: 1001, - forknum: 0, - }); - - fn assert_current_logical_size(timeline: &Arc, lsn: Lsn) { - let incremental = timeline.get_current_logical_size(); - let non_incremental = timeline - .get_current_logical_size_non_incremental(lsn) - .unwrap(); - assert_eq!(incremental, non_incremental); + lazy_static! { + static ref TEST_KEY: Key = Key::from_array(hex!("112222222233333333444444445500000001")); } - static ZERO_PAGE: Bytes = Bytes::from_static(&[0u8; 8192]); - static ZERO_CHECKPOINT: Bytes = Bytes::from_static(&[0u8; SIZEOF_CHECKPOINT]); - #[test] - fn test_relsize() -> Result<()> { - let repo = RepoHarness::create("test_relsize")?.load(); - // get_timeline() with non-existent timeline id should fail - //repo.get_timeline("11223344556677881122334455667788"); - - // Create timeline to work on + fn test_basic() -> Result<()> { + let repo = RepoHarness::create("test_basic")?.load(); let tline = repo.create_empty_timeline(TIMELINE_ID, Lsn(0))?; + let writer = tline.writer(); + writer.put(*TEST_KEY, Lsn(0x10), Value::Image(TEST_IMG("foo at 0x10")))?; + writer.finish_write(Lsn(0x10)); + drop(writer); - writer.put_page_image(TESTREL_A, 0, Lsn(0x20), TEST_IMG("foo blk 0 at 2"))?; - writer.put_page_image(TESTREL_A, 0, Lsn(0x20), TEST_IMG("foo blk 0 at 2"))?; - writer.put_page_image(TESTREL_A, 0, Lsn(0x30), TEST_IMG("foo blk 0 at 3"))?; - writer.put_page_image(TESTREL_A, 1, Lsn(0x40), TEST_IMG("foo blk 1 at 4"))?; - writer.put_page_image(TESTREL_A, 2, Lsn(0x50), TEST_IMG("foo blk 2 at 5"))?; + let writer = tline.writer(); + writer.put(*TEST_KEY, Lsn(0x20), Value::Image(TEST_IMG("foo at 0x20")))?; + writer.finish_write(Lsn(0x20)); + drop(writer); - writer.advance_last_record_lsn(Lsn(0x50)); - - assert_current_logical_size(&tline, Lsn(0x50)); - - // The relation was created at LSN 2, not visible at LSN 1 yet. - assert_eq!(tline.get_rel_exists(TESTREL_A, Lsn(0x10))?, false); - assert!(tline.get_relish_size(TESTREL_A, Lsn(0x10))?.is_none()); - - assert_eq!(tline.get_rel_exists(TESTREL_A, Lsn(0x20))?, true); - assert_eq!(tline.get_relish_size(TESTREL_A, Lsn(0x20))?.unwrap(), 1); - assert_eq!(tline.get_relish_size(TESTREL_A, Lsn(0x50))?.unwrap(), 3); - - // Check page contents at each LSN - assert_eq!( - tline.get_page_at_lsn(TESTREL_A, 0, Lsn(0x20))?, - TEST_IMG("foo blk 0 at 2") - ); - - assert_eq!( - tline.get_page_at_lsn(TESTREL_A, 0, Lsn(0x30))?, - TEST_IMG("foo blk 0 at 3") - ); - - assert_eq!( - tline.get_page_at_lsn(TESTREL_A, 0, Lsn(0x40))?, - TEST_IMG("foo blk 0 at 3") - ); - assert_eq!( - tline.get_page_at_lsn(TESTREL_A, 1, Lsn(0x40))?, - TEST_IMG("foo blk 1 at 4") - ); - - assert_eq!( - tline.get_page_at_lsn(TESTREL_A, 0, Lsn(0x50))?, - TEST_IMG("foo blk 0 at 3") - ); - assert_eq!( - tline.get_page_at_lsn(TESTREL_A, 1, Lsn(0x50))?, - TEST_IMG("foo blk 1 at 4") - ); - assert_eq!( - tline.get_page_at_lsn(TESTREL_A, 2, Lsn(0x50))?, - TEST_IMG("foo blk 2 at 5") - ); - - // Truncate last block - writer.put_truncation(TESTREL_A, Lsn(0x60), 2)?; - writer.advance_last_record_lsn(Lsn(0x60)); - assert_current_logical_size(&tline, Lsn(0x60)); - - // Check reported size and contents after truncation - assert_eq!(tline.get_relish_size(TESTREL_A, Lsn(0x60))?.unwrap(), 2); - assert_eq!( - tline.get_page_at_lsn(TESTREL_A, 0, Lsn(0x60))?, - TEST_IMG("foo blk 0 at 3") - ); - assert_eq!( - tline.get_page_at_lsn(TESTREL_A, 1, Lsn(0x60))?, - TEST_IMG("foo blk 1 at 4") - ); - - // should still see the truncated block with older LSN - assert_eq!(tline.get_relish_size(TESTREL_A, Lsn(0x50))?.unwrap(), 3); - assert_eq!( - tline.get_page_at_lsn(TESTREL_A, 2, Lsn(0x50))?, - TEST_IMG("foo blk 2 at 5") - ); - - // Truncate to zero length - writer.put_truncation(TESTREL_A, Lsn(0x68), 0)?; - writer.advance_last_record_lsn(Lsn(0x68)); - assert_eq!(tline.get_relish_size(TESTREL_A, Lsn(0x68))?.unwrap(), 0); - - // Extend from 0 to 2 blocks, leaving a gap - writer.put_page_image(TESTREL_A, 1, Lsn(0x70), TEST_IMG("foo blk 1"))?; - writer.advance_last_record_lsn(Lsn(0x70)); - assert_eq!(tline.get_relish_size(TESTREL_A, Lsn(0x70))?.unwrap(), 2); - assert_eq!(tline.get_page_at_lsn(TESTREL_A, 0, Lsn(0x70))?, ZERO_PAGE); - assert_eq!( - tline.get_page_at_lsn(TESTREL_A, 1, Lsn(0x70))?, - TEST_IMG("foo blk 1") - ); - - // Extend a lot more, leaving a big gap that spans across segments - // FIXME: This is currently broken, see https://github.com/zenithdb/zenith/issues/500 - /* - tline.put_page_image(TESTREL_A, 1500, Lsn(0x80), TEST_IMG("foo blk 1500"))?; - tline.advance_last_record_lsn(Lsn(0x80)); - assert_eq!(tline.get_relish_size(TESTREL_A, Lsn(0x80))?.unwrap(), 1501); - for blk in 2..1500 { - assert_eq!( - tline.get_page_at_lsn(TESTREL_A, blk, Lsn(0x80))?, - ZERO_PAGE); - } - assert_eq!( - tline.get_page_at_lsn(TESTREL_A, 1500, Lsn(0x80))?, - TEST_IMG("foo blk 1500")); - */ + assert_eq!(tline.get(*TEST_KEY, Lsn(0x10))?, TEST_IMG("foo at 0x10")); + assert_eq!(tline.get(*TEST_KEY, Lsn(0x1f))?, TEST_IMG("foo at 0x10")); + assert_eq!(tline.get(*TEST_KEY, Lsn(0x20))?, TEST_IMG("foo at 0x20")); Ok(()) } - // Test what happens if we dropped a relation - // and then created it again within the same layer. - #[test] - fn test_drop_extend() -> Result<()> { - let repo = RepoHarness::create("test_drop_extend")?.load(); - - // Create timeline to work on - let tline = repo.create_empty_timeline(TIMELINE_ID, Lsn(0))?; - let writer = tline.writer(); - - writer.put_page_image(TESTREL_A, 0, Lsn(0x20), TEST_IMG("foo blk 0 at 2"))?; - writer.advance_last_record_lsn(Lsn(0x20)); - - // Check that rel exists and size is correct - assert_eq!(tline.get_rel_exists(TESTREL_A, Lsn(0x20))?, true); - assert_eq!(tline.get_relish_size(TESTREL_A, Lsn(0x20))?.unwrap(), 1); - - // Drop relish - writer.drop_relish(TESTREL_A, Lsn(0x30))?; - writer.advance_last_record_lsn(Lsn(0x30)); - - // Check that rel is not visible anymore - assert_eq!(tline.get_rel_exists(TESTREL_A, Lsn(0x30))?, false); - assert!(tline.get_relish_size(TESTREL_A, Lsn(0x30))?.is_none()); - - // Extend it again - writer.put_page_image(TESTREL_A, 0, Lsn(0x40), TEST_IMG("foo blk 0 at 4"))?; - writer.advance_last_record_lsn(Lsn(0x40)); - - // Check that rel exists and size is correct - assert_eq!(tline.get_rel_exists(TESTREL_A, Lsn(0x40))?, true); - assert_eq!(tline.get_relish_size(TESTREL_A, Lsn(0x40))?.unwrap(), 1); - - Ok(()) - } - - // Test what happens if we truncated a relation - // so that one of its segments was dropped - // and then extended it again within the same layer. - #[test] - fn test_truncate_extend() -> Result<()> { - let repo = RepoHarness::create("test_truncate_extend")?.load(); - - // Create timeline to work on - let tline = repo.create_empty_timeline(TIMELINE_ID, Lsn(0))?; - let writer = tline.writer(); - - //from storage_layer.rs - const RELISH_SEG_SIZE: u32 = 10 * 1024 * 1024 / 8192; - let relsize = RELISH_SEG_SIZE * 2; - - // Create relation with relsize blocks - for blkno in 0..relsize { - let lsn = Lsn(0x20); - let data = format!("foo blk {} at {}", blkno, lsn); - writer.put_page_image(TESTREL_A, blkno, lsn, TEST_IMG(&data))?; - } - - writer.advance_last_record_lsn(Lsn(0x20)); - - // The relation was created at LSN 2, not visible at LSN 1 yet. - assert_eq!(tline.get_rel_exists(TESTREL_A, Lsn(0x10))?, false); - assert!(tline.get_relish_size(TESTREL_A, Lsn(0x10))?.is_none()); - - assert_eq!(tline.get_rel_exists(TESTREL_A, Lsn(0x20))?, true); - assert_eq!( - tline.get_relish_size(TESTREL_A, Lsn(0x20))?.unwrap(), - relsize - ); - - // Check relation content - for blkno in 0..relsize { - let lsn = Lsn(0x20); - let data = format!("foo blk {} at {}", blkno, lsn); - assert_eq!( - tline.get_page_at_lsn(TESTREL_A, blkno, lsn)?, - TEST_IMG(&data) - ); - } - - // Truncate relation so that second segment was dropped - // - only leave one page - writer.put_truncation(TESTREL_A, Lsn(0x60), 1)?; - writer.advance_last_record_lsn(Lsn(0x60)); - - // Check reported size and contents after truncation - assert_eq!(tline.get_relish_size(TESTREL_A, Lsn(0x60))?.unwrap(), 1); - - for blkno in 0..1 { - let lsn = Lsn(0x20); - let data = format!("foo blk {} at {}", blkno, lsn); - assert_eq!( - tline.get_page_at_lsn(TESTREL_A, blkno, Lsn(0x60))?, - TEST_IMG(&data) - ); - } - - // should still see all blocks with older LSN - assert_eq!( - tline.get_relish_size(TESTREL_A, Lsn(0x50))?.unwrap(), - relsize - ); - for blkno in 0..relsize { - let lsn = Lsn(0x20); - let data = format!("foo blk {} at {}", blkno, lsn); - assert_eq!( - tline.get_page_at_lsn(TESTREL_A, blkno, Lsn(0x50))?, - TEST_IMG(&data) - ); - } - - // Extend relation again. - // Add enough blocks to create second segment - for blkno in 0..relsize { - let lsn = Lsn(0x80); - let data = format!("foo blk {} at {}", blkno, lsn); - writer.put_page_image(TESTREL_A, blkno, lsn, TEST_IMG(&data))?; - } - writer.advance_last_record_lsn(Lsn(0x80)); - - assert_eq!(tline.get_rel_exists(TESTREL_A, Lsn(0x80))?, true); - assert_eq!( - tline.get_relish_size(TESTREL_A, Lsn(0x80))?.unwrap(), - relsize - ); - // Check relation content - for blkno in 0..relsize { - let lsn = Lsn(0x80); - let data = format!("foo blk {} at {}", blkno, lsn); - assert_eq!( - tline.get_page_at_lsn(TESTREL_A, blkno, Lsn(0x80))?, - TEST_IMG(&data) - ); - } - - Ok(()) - } - - /// Test get_relsize() and truncation with a file larger than 1 GB, so that it's - /// split into multiple 1 GB segments in Postgres. - #[test] - fn test_large_rel() -> Result<()> { - let repo = RepoHarness::create("test_large_rel")?.load(); - let tline = repo.create_empty_timeline(TIMELINE_ID, Lsn(0))?; - let writer = tline.writer(); - - let mut lsn = 0x10; - for blknum in 0..pg_constants::RELSEG_SIZE + 1 { - lsn += 0x10; - let img = TEST_IMG(&format!("foo blk {} at {}", blknum, Lsn(lsn))); - writer.put_page_image(TESTREL_A, blknum as BlockNumber, Lsn(lsn), img)?; - } - writer.advance_last_record_lsn(Lsn(lsn)); - - assert_current_logical_size(&tline, Lsn(lsn)); - - assert_eq!( - tline.get_relish_size(TESTREL_A, Lsn(lsn))?.unwrap(), - pg_constants::RELSEG_SIZE + 1 - ); - - // Truncate one block - lsn += 0x10; - writer.put_truncation(TESTREL_A, Lsn(lsn), pg_constants::RELSEG_SIZE)?; - writer.advance_last_record_lsn(Lsn(lsn)); - assert_eq!( - tline.get_relish_size(TESTREL_A, Lsn(lsn))?.unwrap(), - pg_constants::RELSEG_SIZE - ); - assert_current_logical_size(&tline, Lsn(lsn)); - - // Truncate another block - lsn += 0x10; - writer.put_truncation(TESTREL_A, Lsn(lsn), pg_constants::RELSEG_SIZE - 1)?; - writer.advance_last_record_lsn(Lsn(lsn)); - assert_eq!( - tline.get_relish_size(TESTREL_A, Lsn(lsn))?.unwrap(), - pg_constants::RELSEG_SIZE - 1 - ); - assert_current_logical_size(&tline, Lsn(lsn)); - - // Truncate to 1500, and then truncate all the way down to 0, one block at a time - // This tests the behavior at segment boundaries - let mut size: i32 = 3000; - while size >= 0 { - lsn += 0x10; - writer.put_truncation(TESTREL_A, Lsn(lsn), size as BlockNumber)?; - writer.advance_last_record_lsn(Lsn(lsn)); - assert_eq!( - tline.get_relish_size(TESTREL_A, Lsn(lsn))?.unwrap(), - size as BlockNumber - ); - - size -= 1; - } - assert_current_logical_size(&tline, Lsn(lsn)); - - Ok(()) - } - - /// - /// Test list_rels() function, with branches and dropped relations - /// - #[test] - fn test_list_rels_drop() -> Result<()> { - let repo = RepoHarness::create("test_list_rels_drop")?.load(); - let tline = repo.create_empty_timeline(TIMELINE_ID, Lsn(0))?; - let writer = tline.writer(); - const TESTDB: u32 = 111; - - // Import initial dummy checkpoint record, otherwise the get_timeline() call - // after branching fails below - writer.put_page_image(RelishTag::Checkpoint, 0, Lsn(0x10), ZERO_CHECKPOINT.clone())?; - - // Create a relation on the timeline - writer.put_page_image(TESTREL_A, 0, Lsn(0x20), TEST_IMG("foo blk 0 at 2"))?; - - writer.advance_last_record_lsn(Lsn(0x30)); - - // Check that list_rels() lists it after LSN 2, but no before it - assert!(!tline.list_rels(0, TESTDB, Lsn(0x10))?.contains(&TESTREL_A)); - assert!(tline.list_rels(0, TESTDB, Lsn(0x20))?.contains(&TESTREL_A)); - assert!(tline.list_rels(0, TESTDB, Lsn(0x30))?.contains(&TESTREL_A)); - - // Create a branch, check that the relation is visible there - repo.branch_timeline(TIMELINE_ID, NEW_TIMELINE_ID, Lsn(0x30))?; - let newtline = repo - .get_timeline_load(NEW_TIMELINE_ID) - .expect("Should have a local timeline"); - let new_writer = newtline.writer(); - - assert!(newtline - .list_rels(0, TESTDB, Lsn(0x30))? - .contains(&TESTREL_A)); - - // Drop it on the branch - new_writer.drop_relish(TESTREL_A, Lsn(0x40))?; - new_writer.advance_last_record_lsn(Lsn(0x40)); - - drop(new_writer); - - // Check that it's no longer listed on the branch after the point where it was dropped - assert!(newtline - .list_rels(0, TESTDB, Lsn(0x30))? - .contains(&TESTREL_A)); - assert!(!newtline - .list_rels(0, TESTDB, Lsn(0x40))? - .contains(&TESTREL_A)); - - // Run checkpoint and garbage collection and check that it's still not visible - newtline.checkpoint(CheckpointConfig::Forced)?; - repo.gc_iteration(Some(NEW_TIMELINE_ID), 0, true)?; - - assert!(!newtline - .list_rels(0, TESTDB, Lsn(0x40))? - .contains(&TESTREL_A)); - - Ok(()) + /// Convenience function to create a page image with given string as the only content + pub fn test_value(s: &str) -> Value { + let mut buf = BytesMut::new(); + buf.extend_from_slice(s.as_bytes()); + Value::Image(buf.freeze()) } /// @@ -890,21 +623,24 @@ mod tests { let repo = RepoHarness::create("test_branch")?.load(); let tline = repo.create_empty_timeline(TIMELINE_ID, Lsn(0))?; let writer = tline.writer(); + use std::str::from_utf8; - // Import initial dummy checkpoint record, otherwise the get_timeline() call - // after branching fails below - writer.put_page_image(RelishTag::Checkpoint, 0, Lsn(0x10), ZERO_CHECKPOINT.clone())?; + #[allow(non_snake_case)] + let TEST_KEY_A: Key = Key::from_hex("112222222233333333444444445500000001").unwrap(); + #[allow(non_snake_case)] + let TEST_KEY_B: Key = Key::from_hex("112222222233333333444444445500000002").unwrap(); - // Create a relation on the timeline - writer.put_page_image(TESTREL_A, 0, Lsn(0x20), TEST_IMG("foo blk 0 at 2"))?; - writer.put_page_image(TESTREL_A, 0, Lsn(0x30), TEST_IMG("foo blk 0 at 3"))?; - writer.put_page_image(TESTREL_A, 0, Lsn(0x40), TEST_IMG("foo blk 0 at 4"))?; + // Insert a value on the timeline + writer.put(TEST_KEY_A, Lsn(0x20), test_value("foo at 0x20"))?; + writer.put(TEST_KEY_B, Lsn(0x20), test_value("foobar at 0x20"))?; + writer.finish_write(Lsn(0x20)); - // Create another relation - writer.put_page_image(TESTREL_B, 0, Lsn(0x20), TEST_IMG("foobar blk 0 at 2"))?; + writer.put(TEST_KEY_A, Lsn(0x30), test_value("foo at 0x30"))?; + writer.finish_write(Lsn(0x30)); + writer.put(TEST_KEY_A, Lsn(0x40), test_value("foo at 0x40"))?; + writer.finish_write(Lsn(0x40)); - writer.advance_last_record_lsn(Lsn(0x40)); - assert_current_logical_size(&tline, Lsn(0x40)); + //assert_current_logical_size(&tline, Lsn(0x40)); // Branch the history, modify relation differently on the new timeline repo.branch_timeline(TIMELINE_ID, NEW_TIMELINE_ID, Lsn(0x30))?; @@ -912,71 +648,65 @@ mod tests { .get_timeline_load(NEW_TIMELINE_ID) .expect("Should have a local timeline"); let new_writer = newtline.writer(); - - new_writer.put_page_image(TESTREL_A, 0, Lsn(0x40), TEST_IMG("bar blk 0 at 4"))?; - new_writer.advance_last_record_lsn(Lsn(0x40)); + new_writer.put(TEST_KEY_A, Lsn(0x40), test_value("bar at 0x40"))?; + new_writer.finish_write(Lsn(0x40)); // Check page contents on both branches assert_eq!( - tline.get_page_at_lsn(TESTREL_A, 0, Lsn(0x40))?, - TEST_IMG("foo blk 0 at 4") + from_utf8(&tline.get(TEST_KEY_A, Lsn(0x40))?)?, + "foo at 0x40" ); - assert_eq!( - newtline.get_page_at_lsn(TESTREL_A, 0, Lsn(0x40))?, - TEST_IMG("bar blk 0 at 4") + from_utf8(&newtline.get(TEST_KEY_A, Lsn(0x40))?)?, + "bar at 0x40" ); - assert_eq!( - newtline.get_page_at_lsn(TESTREL_B, 0, Lsn(0x40))?, - TEST_IMG("foobar blk 0 at 2") + from_utf8(&newtline.get(TEST_KEY_B, Lsn(0x40))?)?, + "foobar at 0x20" ); - assert_eq!(newtline.get_relish_size(TESTREL_B, Lsn(0x40))?.unwrap(), 1); - - assert_current_logical_size(&tline, Lsn(0x40)); + //assert_current_logical_size(&tline, Lsn(0x40)); Ok(()) } - fn make_some_layers(tline: &Arc, start_lsn: Lsn) -> Result<()> { + fn make_some_layers(tline: &T, start_lsn: Lsn) -> Result<()> { let mut lsn = start_lsn; + #[allow(non_snake_case)] { let writer = tline.writer(); // Create a relation on the timeline - writer.put_page_image( - TESTREL_A, - 0, + writer.put( + *TEST_KEY, lsn, - TEST_IMG(&format!("foo blk 0 at {}", lsn)), + Value::Image(TEST_IMG(&format!("foo at {}", lsn))), )?; + writer.finish_write(lsn); lsn += 0x10; - writer.put_page_image( - TESTREL_A, - 0, + writer.put( + *TEST_KEY, lsn, - TEST_IMG(&format!("foo blk 0 at {}", lsn)), + Value::Image(TEST_IMG(&format!("foo at {}", lsn))), )?; - writer.advance_last_record_lsn(lsn); + writer.finish_write(lsn); + lsn += 0x10; } tline.checkpoint(CheckpointConfig::Forced)?; { let writer = tline.writer(); - lsn += 0x10; - writer.put_page_image( - TESTREL_A, - 0, + writer.put( + *TEST_KEY, lsn, - TEST_IMG(&format!("foo blk 0 at {}", lsn)), + Value::Image(TEST_IMG(&format!("foo at {}", lsn))), )?; + writer.finish_write(lsn); lsn += 0x10; - writer.put_page_image( - TESTREL_A, - 0, + writer.put( + *TEST_KEY, lsn, - TEST_IMG(&format!("foo blk 0 at {}", lsn)), + Value::Image(TEST_IMG(&format!("foo at {}", lsn))), )?; - writer.advance_last_record_lsn(lsn); + writer.finish_write(lsn); } tline.checkpoint(CheckpointConfig::Forced) } @@ -985,11 +715,13 @@ mod tests { fn test_prohibit_branch_creation_on_garbage_collected_data() -> Result<()> { let repo = RepoHarness::create("test_prohibit_branch_creation_on_garbage_collected_data")?.load(); - let tline = repo.create_empty_timeline(TIMELINE_ID, Lsn(0))?; - make_some_layers(&tline, Lsn(0x20))?; + make_some_layers(tline.as_ref(), Lsn(0x20))?; // this removes layers before lsn 40 (50 minus 10), so there are two remaining layers, image and delta for 31-50 + // FIXME: this doesn't actually remove any layer currently, given how the checkpointing + // and compaction works. But it does set the 'cutoff' point so that the cross check + // below should fail. repo.gc_iteration(Some(TIMELINE_ID), 0x10, false)?; // try to branch at lsn 25, should fail because we already garbage collected the data @@ -1029,32 +761,35 @@ mod tests { Ok(()) } + /* + // FIXME: This currently fails to error out. Calling GC doesn't currently + // remove the old value, we'd need to work a little harder #[test] - fn test_prohibit_get_page_at_lsn_for_garbage_collected_pages() -> Result<()> { + fn test_prohibit_get_for_garbage_collected_data() -> Result<()> { let repo = - RepoHarness::create("test_prohibit_get_page_at_lsn_for_garbage_collected_pages")? - .load(); + RepoHarness::create("test_prohibit_get_for_garbage_collected_data")? + .load(); let tline = repo.create_empty_timeline(TIMELINE_ID, Lsn(0))?; - make_some_layers(&tline, Lsn(0x20))?; + make_some_layers(tline.as_ref(), Lsn(0x20))?; repo.gc_iteration(Some(TIMELINE_ID), 0x10, false)?; let latest_gc_cutoff_lsn = tline.get_latest_gc_cutoff_lsn(); assert!(*latest_gc_cutoff_lsn > Lsn(0x25)); - match tline.get_page_at_lsn(TESTREL_A, 0, Lsn(0x25)) { + match tline.get(*TEST_KEY, Lsn(0x25)) { Ok(_) => panic!("request for page should have failed"), Err(err) => assert!(err.to_string().contains("not found at")), } Ok(()) } + */ #[test] fn test_retain_data_in_parent_which_is_needed_for_child() -> Result<()> { let repo = RepoHarness::create("test_retain_data_in_parent_which_is_needed_for_child")?.load(); let tline = repo.create_empty_timeline(TIMELINE_ID, Lsn(0))?; - - make_some_layers(&tline, Lsn(0x20))?; + make_some_layers(tline.as_ref(), Lsn(0x20))?; repo.branch_timeline(TIMELINE_ID, NEW_TIMELINE_ID, Lsn(0x40))?; let newtline = repo @@ -1062,92 +797,31 @@ mod tests { .expect("Should have a local timeline"); // this removes layers before lsn 40 (50 minus 10), so there are two remaining layers, image and delta for 31-50 repo.gc_iteration(Some(TIMELINE_ID), 0x10, false)?; - assert!(newtline.get_page_at_lsn(TESTREL_A, 0, Lsn(0x25)).is_ok()); + assert!(newtline.get(*TEST_KEY, Lsn(0x25)).is_ok()); Ok(()) } - #[test] fn test_parent_keeps_data_forever_after_branching() -> Result<()> { - let harness = RepoHarness::create("test_parent_keeps_data_forever_after_branching")?; - let repo = harness.load(); + let repo = RepoHarness::create("test_parent_keeps_data_forever_after_branching")?.load(); let tline = repo.create_empty_timeline(TIMELINE_ID, Lsn(0))?; - - make_some_layers(&tline, Lsn(0x20))?; + make_some_layers(tline.as_ref(), Lsn(0x20))?; repo.branch_timeline(TIMELINE_ID, NEW_TIMELINE_ID, Lsn(0x40))?; let newtline = repo .get_timeline_load(NEW_TIMELINE_ID) .expect("Should have a local timeline"); - make_some_layers(&newtline, Lsn(0x60))?; + make_some_layers(newtline.as_ref(), Lsn(0x60))?; // run gc on parent repo.gc_iteration(Some(TIMELINE_ID), 0x10, false)?; - // check that the layer in parent before the branching point is still there - let tline_dir = harness.conf.timeline_path(&TIMELINE_ID, &harness.tenant_id); - - let expected_image_layer_path = tline_dir.join(format!( - "rel_{}_{}_{}_{}_{}_{:016X}_{:016X}", - TESTREL_A_REL_TAG.spcnode, - TESTREL_A_REL_TAG.dbnode, - TESTREL_A_REL_TAG.relnode, - TESTREL_A_REL_TAG.forknum, - 0, // seg is 0 - 0x20, - 0x30, - )); - assert!(fs::metadata(&expected_image_layer_path).is_ok()); - - Ok(()) - } - - #[test] - fn test_read_beyond_eof() -> Result<()> { - let harness = RepoHarness::create("test_read_beyond_eof")?; - let repo = harness.load(); - let tline = repo.create_empty_timeline(TIMELINE_ID, Lsn(0))?; - - make_some_layers(&tline, Lsn(0x20))?; - { - let writer = tline.writer(); - writer.put_page_image( - TESTREL_A, - 0, - Lsn(0x60), - TEST_IMG(&format!("foo blk 0 at {}", Lsn(0x50))), - )?; - writer.advance_last_record_lsn(Lsn(0x60)); - } - - // Test read before rel creation. Should error out. - assert!(tline.get_page_at_lsn(TESTREL_A, 1, Lsn(0x10)).is_err()); - - // Read block beyond end of relation at different points in time. - // These reads should fall into different delta, image, and in-memory layers. - assert_eq!(tline.get_page_at_lsn(TESTREL_A, 1, Lsn(0x20))?, ZERO_PAGE); - assert_eq!(tline.get_page_at_lsn(TESTREL_A, 1, Lsn(0x25))?, ZERO_PAGE); - assert_eq!(tline.get_page_at_lsn(TESTREL_A, 1, Lsn(0x30))?, ZERO_PAGE); - assert_eq!(tline.get_page_at_lsn(TESTREL_A, 1, Lsn(0x35))?, ZERO_PAGE); - assert_eq!(tline.get_page_at_lsn(TESTREL_A, 1, Lsn(0x40))?, ZERO_PAGE); - assert_eq!(tline.get_page_at_lsn(TESTREL_A, 1, Lsn(0x45))?, ZERO_PAGE); - assert_eq!(tline.get_page_at_lsn(TESTREL_A, 1, Lsn(0x50))?, ZERO_PAGE); - assert_eq!(tline.get_page_at_lsn(TESTREL_A, 1, Lsn(0x55))?, ZERO_PAGE); - assert_eq!(tline.get_page_at_lsn(TESTREL_A, 1, Lsn(0x60))?, ZERO_PAGE); - - // Test on an in-memory layer with no preceding layer - { - let writer = tline.writer(); - writer.put_page_image( - TESTREL_B, - 0, - Lsn(0x70), - TEST_IMG(&format!("foo blk 0 at {}", Lsn(0x70))), - )?; - writer.advance_last_record_lsn(Lsn(0x70)); - } - assert_eq!(tline.get_page_at_lsn(TESTREL_B, 1, Lsn(0x70))?, ZERO_PAGE); + // Check that the data is still accessible on the branch. + assert_eq!( + newtline.get(*TEST_KEY, Lsn(0x50))?, + TEST_IMG(&format!("foo at {}", Lsn(0x40))) + ); Ok(()) } @@ -1159,7 +833,7 @@ mod tests { { let repo = harness.load(); let tline = repo.create_empty_timeline(TIMELINE_ID, Lsn(0x8000))?; - make_some_layers(&tline, Lsn(0x8000))?; + make_some_layers(tline.as_ref(), Lsn(0x8000))?; tline.checkpoint(CheckpointConfig::Forced)?; } @@ -1188,7 +862,7 @@ mod tests { let repo = harness.load(); let tline = repo.create_empty_timeline(TIMELINE_ID, Lsn(0))?; - make_some_layers(&tline, Lsn(0x20))?; + make_some_layers(tline.as_ref(), Lsn(0x20))?; tline.checkpoint(CheckpointConfig::Forced)?; repo.branch_timeline(TIMELINE_ID, NEW_TIMELINE_ID, Lsn(0x40))?; @@ -1197,7 +871,7 @@ mod tests { .get_timeline_load(NEW_TIMELINE_ID) .expect("Should have a local timeline"); - make_some_layers(&newtline, Lsn(0x60))?; + make_some_layers(newtline.as_ref(), Lsn(0x60))?; tline.checkpoint(CheckpointConfig::Forced)?; } diff --git a/pageserver/src/tenant_mgr.rs b/pageserver/src/tenant_mgr.rs index e7cc4ecbaf..aeff718803 100644 --- a/pageserver/src/tenant_mgr.rs +++ b/pageserver/src/tenant_mgr.rs @@ -4,13 +4,13 @@ use crate::config::PageServerConf; use crate::layered_repository::LayeredRepository; use crate::remote_storage::RemoteIndex; -use crate::repository::{Repository, Timeline, TimelineSyncStatusUpdate}; +use crate::repository::{Repository, TimelineSyncStatusUpdate}; use crate::thread_mgr; use crate::thread_mgr::ThreadKind; use crate::timelines; use crate::timelines::CreateRepo; use crate::walredo::PostgresRedoManager; -use crate::CheckpointConfig; +use crate::{DatadirTimelineImpl, RepositoryImpl}; use anyhow::{Context, Result}; use lazy_static::lazy_static; use log::*; @@ -28,7 +28,9 @@ lazy_static! { struct Tenant { state: TenantState, - repo: Arc, + repo: Arc, + + timelines: HashMap>, } #[derive(Debug, Serialize, Deserialize, Clone, Copy, PartialEq, Eq)] @@ -67,14 +69,14 @@ pub fn load_local_repo( conf: &'static PageServerConf, tenant_id: ZTenantId, remote_index: &RemoteIndex, -) -> Arc { +) -> Arc { let mut m = access_tenants(); let tenant = m.entry(tenant_id).or_insert_with(|| { // Set up a WAL redo manager, for applying WAL records. let walredo_mgr = PostgresRedoManager::new(conf, tenant_id); // Set up an object repository, for actual data storage. - let repo: Arc = Arc::new(LayeredRepository::new( + let repo: Arc = Arc::new(LayeredRepository::new( conf, Arc::new(walredo_mgr), tenant_id, @@ -84,6 +86,7 @@ pub fn load_local_repo( Tenant { state: TenantState::Idle, repo, + timelines: HashMap::new(), } }); Arc::clone(&tenant.repo) @@ -138,7 +141,7 @@ pub fn shutdown_all_tenants() { thread_mgr::shutdown_threads(Some(ThreadKind::WalReceiver), None, None); thread_mgr::shutdown_threads(Some(ThreadKind::GarbageCollector), None, None); - thread_mgr::shutdown_threads(Some(ThreadKind::Checkpointer), None, None); + thread_mgr::shutdown_threads(Some(ThreadKind::Compactor), None, None); // Ok, no background threads running anymore. Flush any remaining data in // memory to disk. @@ -152,7 +155,7 @@ pub fn shutdown_all_tenants() { debug!("shutdown tenant {}", tenantid); match get_repository_for_tenant(tenantid) { Ok(repo) => { - if let Err(err) = repo.checkpoint_iteration(CheckpointConfig::Flush) { + if let Err(err) = repo.checkpoint() { error!( "Could not checkpoint tenant {} during shutdown: {:?}", tenantid, err @@ -192,6 +195,7 @@ pub fn create_tenant_repository( v.insert(Tenant { state: TenantState::Idle, repo, + timelines: HashMap::new(), }); Ok(Some(tenantid)) } @@ -203,7 +207,7 @@ pub fn get_tenant_state(tenantid: ZTenantId) -> Option { } /// -/// Change the state of a tenant to Active and launch its checkpointer and GC +/// Change the state of a tenant to Active and launch its compactor and GC /// threads. If the tenant was already in Active state or Stopping, does nothing. /// pub fn activate_tenant(conf: &'static PageServerConf, tenant_id: ZTenantId) -> Result<()> { @@ -218,15 +222,15 @@ pub fn activate_tenant(conf: &'static PageServerConf, tenant_id: ZTenantId) -> R // If the tenant is already active, nothing to do. TenantState::Active => {} - // If it's Idle, launch the checkpointer and GC threads + // If it's Idle, launch the compactor and GC threads TenantState::Idle => { thread_mgr::spawn( - ThreadKind::Checkpointer, + ThreadKind::Compactor, Some(tenant_id), None, - "Checkpointer thread", + "Compactor thread", true, - move || crate::tenant_threads::checkpoint_loop(tenant_id, conf), + move || crate::tenant_threads::compact_loop(tenant_id, conf), )?; let gc_spawn_result = thread_mgr::spawn( @@ -244,7 +248,7 @@ pub fn activate_tenant(conf: &'static PageServerConf, tenant_id: ZTenantId) -> R "Failed to start GC thread for tenant {}, stopping its checkpointer thread: {:?}", tenant_id, e ); - thread_mgr::shutdown_threads(Some(ThreadKind::Checkpointer), Some(tenant_id), None); + thread_mgr::shutdown_threads(Some(ThreadKind::Compactor), Some(tenant_id), None); return gc_spawn_result; } @@ -258,7 +262,7 @@ pub fn activate_tenant(conf: &'static PageServerConf, tenant_id: ZTenantId) -> R Ok(()) } -pub fn get_repository_for_tenant(tenantid: ZTenantId) -> Result> { +pub fn get_repository_for_tenant(tenantid: ZTenantId) -> Result> { let m = access_tenants(); let tenant = m .get(&tenantid) @@ -271,10 +275,27 @@ pub fn get_repository_for_tenant(tenantid: ZTenantId) -> Result Result> { - get_repository_for_tenant(tenantid)? +) -> Result> { + let mut m = access_tenants(); + let tenant = m + .get_mut(&tenantid) + .with_context(|| format!("Tenant {} not found", tenantid))?; + + if let Some(page_tline) = tenant.timelines.get(&timelineid) { + return Ok(Arc::clone(page_tline)); + } + // First access to this timeline. Create a DatadirTimeline wrapper for it + let tline = tenant + .repo .get_timeline_load(timelineid) - .with_context(|| format!("Timeline {} not found for tenant {}", timelineid, tenantid)) + .with_context(|| format!("Timeline {} not found for tenant {}", timelineid, tenantid))?; + + let repartition_distance = tenant.repo.conf.checkpoint_distance / 10; + + let page_tline = Arc::new(DatadirTimelineImpl::new(tline, repartition_distance)); + page_tline.init_logical_size()?; + tenant.timelines.insert(timelineid, Arc::clone(&page_tline)); + Ok(page_tline) } #[serde_as] diff --git a/pageserver/src/tenant_threads.rs b/pageserver/src/tenant_threads.rs index c370eb61c8..0d9a94cc5b 100644 --- a/pageserver/src/tenant_threads.rs +++ b/pageserver/src/tenant_threads.rs @@ -1,34 +1,42 @@ //! This module contains functions to serve per-tenant background processes, -//! such as checkpointer and GC +//! such as compaction and GC use crate::config::PageServerConf; +use crate::repository::Repository; use crate::tenant_mgr; use crate::tenant_mgr::TenantState; -use crate::CheckpointConfig; use anyhow::Result; use std::time::Duration; use tracing::*; use zenith_utils::zid::ZTenantId; /// -/// Checkpointer thread's main loop +/// Compaction thread's main loop /// -pub fn checkpoint_loop(tenantid: ZTenantId, conf: &'static PageServerConf) -> Result<()> { +pub fn compact_loop(tenantid: ZTenantId, conf: &'static PageServerConf) -> Result<()> { + if let Err(err) = compact_loop_ext(tenantid, conf) { + error!("compact loop terminated with error: {:?}", err); + Err(err) + } else { + Ok(()) + } +} + +fn compact_loop_ext(tenantid: ZTenantId, conf: &'static PageServerConf) -> Result<()> { loop { if tenant_mgr::get_tenant_state(tenantid) != Some(TenantState::Active) { break; } - std::thread::sleep(conf.checkpoint_period); - trace!("checkpointer thread for tenant {} waking up", tenantid); + std::thread::sleep(conf.compaction_period); + trace!("compaction thread for tenant {} waking up", tenantid); - // checkpoint timelines that have accumulated more than CHECKPOINT_DISTANCE - // bytes of WAL since last checkpoint. + // Compact timelines let repo = tenant_mgr::get_repository_for_tenant(tenantid)?; - repo.checkpoint_iteration(CheckpointConfig::Distance(conf.checkpoint_distance))?; + repo.compaction_iteration()?; } trace!( - "checkpointer thread stopped for tenant {} state is {:?}", + "compaction thread stopped for tenant {} state is {:?}", tenantid, tenant_mgr::get_tenant_state(tenantid) ); diff --git a/pageserver/src/thread_mgr.rs b/pageserver/src/thread_mgr.rs index cafdc5e700..4484bb1db1 100644 --- a/pageserver/src/thread_mgr.rs +++ b/pageserver/src/thread_mgr.rs @@ -94,13 +94,16 @@ pub enum ThreadKind { // Thread that connects to a safekeeper to fetch WAL for one timeline. WalReceiver, - // Thread that handles checkpointing of all timelines for a tenant. - Checkpointer, + // Thread that handles compaction of all timelines for a tenant. + Compactor, // Thread that handles GC of a tenant GarbageCollector, - // Thread for synchronizing pageserver relish data with the remote storage. + // Thread that flushes frozen in-memory layers to disk + LayerFlushThread, + + // Thread for synchronizing pageserver layer files with the remote storage. // Shared by all tenants. StorageSync, } diff --git a/pageserver/src/timelines.rs b/pageserver/src/timelines.rs index 53c4124701..105c3c869f 100644 --- a/pageserver/src/timelines.rs +++ b/pageserver/src/timelines.rs @@ -23,6 +23,7 @@ use crate::{ layered_repository::metadata::TimelineMetadata, remote_storage::RemoteIndex, repository::{LocalTimelineState, Repository}, + DatadirTimeline, RepositoryImpl, }; use crate::{import_datadir, LOG_FILE_NAME}; use crate::{layered_repository::LayeredRepository, walredo::WalRedoManager}; @@ -48,26 +49,26 @@ pub struct LocalTimelineInfo { } impl LocalTimelineInfo { - pub fn from_loaded_timeline( - timeline: &dyn Timeline, + pub fn from_loaded_timeline( + datadir_tline: &DatadirTimeline, include_non_incremental_logical_size: bool, ) -> anyhow::Result { - let last_record_lsn = timeline.get_last_record_lsn(); + let last_record_lsn = datadir_tline.tline.get_last_record_lsn(); let info = LocalTimelineInfo { - ancestor_timeline_id: timeline.get_ancestor_timeline_id(), + ancestor_timeline_id: datadir_tline.tline.get_ancestor_timeline_id(), ancestor_lsn: { - match timeline.get_ancestor_lsn() { + match datadir_tline.tline.get_ancestor_lsn() { Lsn(0) => None, lsn @ Lsn(_) => Some(lsn), } }, - disk_consistent_lsn: timeline.get_disk_consistent_lsn(), + disk_consistent_lsn: datadir_tline.tline.get_disk_consistent_lsn(), last_record_lsn, - prev_record_lsn: Some(timeline.get_prev_record_lsn()), + prev_record_lsn: Some(datadir_tline.tline.get_prev_record_lsn()), timeline_state: LocalTimelineState::Loaded, - current_logical_size: Some(timeline.get_current_logical_size()), + current_logical_size: Some(datadir_tline.get_current_logical_size()), current_logical_size_non_incremental: if include_non_incremental_logical_size { - Some(timeline.get_current_logical_size_non_incremental(last_record_lsn)?) + Some(datadir_tline.get_current_logical_size_non_incremental(last_record_lsn)?) } else { None }, @@ -93,17 +94,19 @@ impl LocalTimelineInfo { } } - pub fn from_repo_timeline( - repo_timeline: RepositoryTimeline, + pub fn from_repo_timeline( + tenant_id: ZTenantId, + timeline_id: ZTimelineId, + repo_timeline: &RepositoryTimeline, include_non_incremental_logical_size: bool, ) -> anyhow::Result { match repo_timeline { - RepositoryTimeline::Loaded(timeline) => { - Self::from_loaded_timeline(timeline.as_ref(), include_non_incremental_logical_size) - } - RepositoryTimeline::Unloaded { metadata } => { - Ok(Self::from_unloaded_timeline(&metadata)) + RepositoryTimeline::Loaded(_) => { + let datadir_tline = + tenant_mgr::get_timeline_for_tenant_load(tenant_id, timeline_id)?; + Self::from_loaded_timeline(&datadir_tline, include_non_incremental_logical_size) } + RepositoryTimeline::Unloaded { metadata } => Ok(Self::from_unloaded_timeline(metadata)), } } } @@ -172,7 +175,7 @@ pub fn create_repo( conf: &'static PageServerConf, tenant_id: ZTenantId, create_repo: CreateRepo, -) -> Result> { +) -> Result> { let (wal_redo_manager, remote_index) = match create_repo { CreateRepo::Real { wal_redo_manager, @@ -260,12 +263,12 @@ fn run_initdb(conf: &'static PageServerConf, initdbpath: &Path) -> Result<()> { // - run initdb to init temporary instance and get bootstrap data // - after initialization complete, remove the temp dir. // -fn bootstrap_timeline( +fn bootstrap_timeline( conf: &'static PageServerConf, tenantid: ZTenantId, tli: ZTimelineId, - repo: &dyn Repository, -) -> Result> { + repo: &R, +) -> Result<()> { let _enter = info_span!("bootstrapping", timeline = %tli, tenant = %tenantid).entered(); let initdb_path = conf.tenant_path(&tenantid).join("tmp"); @@ -281,23 +284,20 @@ fn bootstrap_timeline( // Initdb lsn will be equal to last_record_lsn which will be set after import. // Because we know it upfront avoid having an option or dummy zero value by passing it to create_empty_timeline. let timeline = repo.create_empty_timeline(tli, lsn)?; - import_datadir::import_timeline_from_postgres_datadir( - &pgdata_path, - timeline.writer().as_ref(), - lsn, - )?; - timeline.checkpoint(CheckpointConfig::Forced)?; + let mut page_tline: DatadirTimeline = DatadirTimeline::new(timeline, u64::MAX); + import_datadir::import_timeline_from_postgres_datadir(&pgdata_path, &mut page_tline, lsn)?; + page_tline.tline.checkpoint(CheckpointConfig::Forced)?; println!( "created initial timeline {} timeline.lsn {}", tli, - timeline.get_last_record_lsn() + page_tline.tline.get_last_record_lsn() ); // Remove temp dir. We don't need it anymore fs::remove_dir_all(pgdata_path)?; - Ok(timeline) + Ok(()) } pub(crate) fn get_local_timelines( @@ -313,7 +313,9 @@ pub(crate) fn get_local_timelines( local_timeline_info.push(( timeline_id, LocalTimelineInfo::from_repo_timeline( - repository_timeline, + tenant_id, + timeline_id, + &repository_timeline, include_non_incremental_logical_size, )?, )) @@ -372,13 +374,17 @@ pub(crate) fn create_timeline( } repo.branch_timeline(ancestor_timeline_id, new_timeline_id, start_lsn)?; // load the timeline into memory - let loaded_timeline = repo.get_timeline_load(new_timeline_id)?; - LocalTimelineInfo::from_loaded_timeline(loaded_timeline.as_ref(), false) + let loaded_timeline = + tenant_mgr::get_timeline_for_tenant_load(tenant_id, new_timeline_id)?; + LocalTimelineInfo::from_loaded_timeline(&loaded_timeline, false) .context("cannot fill timeline info")? } None => { - let new_timeline = bootstrap_timeline(conf, tenant_id, new_timeline_id, repo.as_ref())?; - LocalTimelineInfo::from_loaded_timeline(new_timeline.as_ref(), false) + bootstrap_timeline(conf, tenant_id, new_timeline_id, repo.as_ref())?; + // load the timeline into memory + let new_timeline = + tenant_mgr::get_timeline_for_tenant_load(tenant_id, new_timeline_id)?; + LocalTimelineInfo::from_loaded_timeline(&new_timeline, false) .context("cannot fill timeline info")? } }; diff --git a/pageserver/src/walingest.rs b/pageserver/src/walingest.rs index 506890476f..c6c6e89854 100644 --- a/pageserver/src/walingest.rs +++ b/pageserver/src/walingest.rs @@ -23,14 +23,16 @@ use postgres_ffi::nonrelfile_utils::clogpage_precedes; use postgres_ffi::nonrelfile_utils::slru_may_delete_clogsegment; -use std::cmp::min; use anyhow::Result; use bytes::{Buf, Bytes, BytesMut}; use tracing::*; -use crate::relish::*; -use crate::repository::*; +use std::collections::HashMap; + +use crate::pgdatadir_mapping::*; +use crate::reltag::{RelTag, SlruKind}; +use crate::repository::Repository; use crate::walrecord::*; use postgres_ffi::nonrelfile_utils::mx_offset_to_member_segment; use postgres_ffi::xlog_utils::*; @@ -40,22 +42,28 @@ use zenith_utils::lsn::Lsn; static ZERO_PAGE: Bytes = Bytes::from_static(&[0u8; 8192]); -pub struct WalIngest { +pub struct WalIngest<'a, R: Repository> { + timeline: &'a DatadirTimeline, + checkpoint: CheckPoint, checkpoint_modified: bool, + + relsize_cache: HashMap, } -impl WalIngest { - pub fn new(timeline: &dyn Timeline, startpoint: Lsn) -> Result { +impl<'a, R: Repository> WalIngest<'a, R> { + pub fn new(timeline: &DatadirTimeline, startpoint: Lsn) -> Result> { // Fetch the latest checkpoint into memory, so that we can compare with it // quickly in `ingest_record` and update it when it changes. - let checkpoint_bytes = timeline.get_page_at_lsn(RelishTag::Checkpoint, 0, startpoint)?; + let checkpoint_bytes = timeline.get_checkpoint(startpoint)?; let checkpoint = CheckPoint::decode(&checkpoint_bytes)?; trace!("CheckPoint.nextXid = {}", checkpoint.nextXid.value); Ok(WalIngest { + timeline, checkpoint, checkpoint_modified: false, + relsize_cache: HashMap::new(), }) } @@ -68,10 +76,12 @@ impl WalIngest { /// pub fn ingest_record( &mut self, - timeline: &dyn TimelineWriter, + timeline: &DatadirTimeline, recdata: Bytes, lsn: Lsn, ) -> Result<()> { + let mut modification = timeline.begin_modification(lsn); + let mut decoded = decode_wal_record(recdata); let mut buf = decoded.record.clone(); buf.advance(decoded.main_data_offset); @@ -86,48 +96,34 @@ impl WalIngest { if decoded.xl_rmid == pg_constants::RM_HEAP_ID || decoded.xl_rmid == pg_constants::RM_HEAP2_ID { - self.ingest_heapam_record(&mut buf, timeline, lsn, &mut decoded)?; + self.ingest_heapam_record(&mut buf, &mut modification, &mut decoded)?; } // Handle other special record types if decoded.xl_rmid == pg_constants::RM_SMGR_ID + && (decoded.xl_info & pg_constants::XLR_RMGR_INFO_MASK) + == pg_constants::XLOG_SMGR_CREATE + { + let create = XlSmgrCreate::decode(&mut buf); + self.ingest_xlog_smgr_create(&mut modification, &create)?; + } else if decoded.xl_rmid == pg_constants::RM_SMGR_ID && (decoded.xl_info & pg_constants::XLR_RMGR_INFO_MASK) == pg_constants::XLOG_SMGR_TRUNCATE { let truncate = XlSmgrTruncate::decode(&mut buf); - self.ingest_xlog_smgr_truncate(timeline, lsn, &truncate)?; + self.ingest_xlog_smgr_truncate(&mut modification, &truncate)?; } else if decoded.xl_rmid == pg_constants::RM_DBASE_ID { if (decoded.xl_info & pg_constants::XLR_RMGR_INFO_MASK) == pg_constants::XLOG_DBASE_CREATE { let createdb = XlCreateDatabase::decode(&mut buf); - self.ingest_xlog_dbase_create(timeline, lsn, &createdb)?; + self.ingest_xlog_dbase_create(&mut modification, &createdb)?; } else if (decoded.xl_info & pg_constants::XLR_RMGR_INFO_MASK) == pg_constants::XLOG_DBASE_DROP { let dropdb = XlDropDatabase::decode(&mut buf); - - // To drop the database, we need to drop all the relations in it. Like in - // ingest_xlog_dbase_create(), use the previous record's LSN in the list_rels() call - let req_lsn = min(timeline.get_last_record_lsn(), lsn); - for tablespace_id in dropdb.tablespace_ids { - let rels = timeline.list_rels(tablespace_id, dropdb.db_id, req_lsn)?; - for rel in rels { - timeline.drop_relish(rel, lsn)?; - } - trace!( - "Drop FileNodeMap {}, {} at lsn {}", - tablespace_id, - dropdb.db_id, - lsn - ); - timeline.drop_relish( - RelishTag::FileNodeMap { - spcnode: tablespace_id, - dbnode: dropdb.db_id, - }, - lsn, - )?; + trace!("Drop db {}, {}", tablespace_id, dropdb.db_id); + modification.drop_dbdir(tablespace_id, dropdb.db_id)?; } } } else if decoded.xl_rmid == pg_constants::RM_TBLSPC_ID { @@ -138,19 +134,17 @@ impl WalIngest { let pageno = buf.get_u32_le(); let segno = pageno / pg_constants::SLRU_PAGES_PER_SEGMENT; let rpageno = pageno % pg_constants::SLRU_PAGES_PER_SEGMENT; - timeline.put_page_image( - RelishTag::Slru { - slru: SlruKind::Clog, - segno, - }, + self.put_slru_page_image( + &mut modification, + SlruKind::Clog, + segno, rpageno, - lsn, ZERO_PAGE.clone(), )?; } else { assert!(info == pg_constants::CLOG_TRUNCATE); let xlrec = XlClogTruncate::decode(&mut buf); - self.ingest_clog_truncate_record(timeline, lsn, &xlrec)?; + self.ingest_clog_truncate_record(&mut modification, &xlrec)?; } } else if decoded.xl_rmid == pg_constants::RM_XACT_ID { let info = decoded.xl_info & pg_constants::XLOG_XACT_OPMASK; @@ -158,8 +152,7 @@ impl WalIngest { let parsed_xact = XlXactParsedRecord::decode(&mut buf, decoded.xl_xid, decoded.xl_info); self.ingest_xact_record( - timeline, - lsn, + &mut modification, &parsed_xact, info == pg_constants::XLOG_XACT_COMMIT, )?; @@ -169,8 +162,7 @@ impl WalIngest { let parsed_xact = XlXactParsedRecord::decode(&mut buf, decoded.xl_xid, decoded.xl_info); self.ingest_xact_record( - timeline, - lsn, + &mut modification, &parsed_xact, info == pg_constants::XLOG_XACT_COMMIT_PREPARED, )?; @@ -179,23 +171,11 @@ impl WalIngest { "Drop twophaseFile for xid {} parsed_xact.xid {} here at {}", decoded.xl_xid, parsed_xact.xid, - lsn + lsn, ); - timeline.drop_relish( - RelishTag::TwoPhase { - xid: parsed_xact.xid, - }, - lsn, - )?; + modification.drop_twophase_file(parsed_xact.xid)?; } else if info == pg_constants::XLOG_XACT_PREPARE { - timeline.put_page_image( - RelishTag::TwoPhase { - xid: decoded.xl_xid, - }, - 0, - lsn, - Bytes::copy_from_slice(&buf[..]), - )?; + modification.put_twophase_file(decoded.xl_xid, Bytes::copy_from_slice(&buf[..]))?; } } else if decoded.xl_rmid == pg_constants::RM_MULTIXACT_ID { let info = decoded.xl_info & pg_constants::XLR_RMGR_INFO_MASK; @@ -204,38 +184,34 @@ impl WalIngest { let pageno = buf.get_u32_le(); let segno = pageno / pg_constants::SLRU_PAGES_PER_SEGMENT; let rpageno = pageno % pg_constants::SLRU_PAGES_PER_SEGMENT; - timeline.put_page_image( - RelishTag::Slru { - slru: SlruKind::MultiXactOffsets, - segno, - }, + self.put_slru_page_image( + &mut modification, + SlruKind::MultiXactOffsets, + segno, rpageno, - lsn, ZERO_PAGE.clone(), )?; } else if info == pg_constants::XLOG_MULTIXACT_ZERO_MEM_PAGE { let pageno = buf.get_u32_le(); let segno = pageno / pg_constants::SLRU_PAGES_PER_SEGMENT; let rpageno = pageno % pg_constants::SLRU_PAGES_PER_SEGMENT; - timeline.put_page_image( - RelishTag::Slru { - slru: SlruKind::MultiXactMembers, - segno, - }, + self.put_slru_page_image( + &mut modification, + SlruKind::MultiXactMembers, + segno, rpageno, - lsn, ZERO_PAGE.clone(), )?; } else if info == pg_constants::XLOG_MULTIXACT_CREATE_ID { let xlrec = XlMultiXactCreate::decode(&mut buf); - self.ingest_multixact_create_record(timeline, lsn, &xlrec)?; + self.ingest_multixact_create_record(&mut modification, &xlrec)?; } else if info == pg_constants::XLOG_MULTIXACT_TRUNCATE_ID { let xlrec = XlMultiXactTruncate::decode(&mut buf); - self.ingest_multixact_truncate_record(timeline, lsn, &xlrec)?; + self.ingest_multixact_truncate_record(&mut modification, &xlrec)?; } } else if decoded.xl_rmid == pg_constants::RM_RELMAP_ID { let xlrec = XlRelmapUpdate::decode(&mut buf); - self.ingest_relmap_page(timeline, lsn, &xlrec, &decoded)?; + self.ingest_relmap_page(&mut modification, &xlrec, &decoded)?; } else if decoded.xl_rmid == pg_constants::RM_XLOG_ID { let info = decoded.xl_info & pg_constants::XLR_RMGR_INFO_MASK; if info == pg_constants::XLOG_NEXTOID { @@ -270,37 +246,37 @@ impl WalIngest { // Iterate through all the blocks that the record modifies, and // "put" a separate copy of the record for each block. for blk in decoded.blocks.iter() { - self.ingest_decoded_block(timeline, lsn, &decoded, blk)?; + self.ingest_decoded_block(&mut modification, lsn, &decoded, blk)?; } // If checkpoint data was updated, store the new version in the repository if self.checkpoint_modified { let new_checkpoint_bytes = self.checkpoint.encode(); - timeline.put_page_image(RelishTag::Checkpoint, 0, lsn, new_checkpoint_bytes)?; + modification.put_checkpoint(new_checkpoint_bytes)?; self.checkpoint_modified = false; } // Now that this record has been fully handled, including updating the // checkpoint data, let the repository know that it is up-to-date to this LSN - timeline.advance_last_record_lsn(lsn); + modification.commit()?; Ok(()) } fn ingest_decoded_block( &mut self, - timeline: &dyn TimelineWriter, + modification: &mut DatadirModification, lsn: Lsn, decoded: &DecodedWALRecord, blk: &DecodedBkpBlock, ) -> Result<()> { - let tag = RelishTag::Relation(RelTag { + let rel = RelTag { spcnode: blk.rnode_spcnode, dbnode: blk.rnode_dbnode, relnode: blk.rnode_relnode, forknum: blk.forknum as u8, - }); + }; // // Instead of storing full-page-image WAL record, @@ -330,13 +306,13 @@ impl WalIngest { image[0..4].copy_from_slice(&((lsn.0 >> 32) as u32).to_le_bytes()); image[4..8].copy_from_slice(&(lsn.0 as u32).to_le_bytes()); assert_eq!(image.len(), pg_constants::BLCKSZ as usize); - timeline.put_page_image(tag, blk.blkno, lsn, image.freeze())?; + self.put_rel_page_image(modification, rel, blk.blkno, image.freeze())?; } else { let rec = ZenithWalRecord::Postgres { will_init: blk.will_init || blk.apply_image, rec: decoded.record.clone(), }; - timeline.put_wal_record(lsn, tag, blk.blkno, rec)?; + self.put_rel_wal_record(modification, rel, blk.blkno, rec)?; } Ok(()) } @@ -344,8 +320,7 @@ impl WalIngest { fn ingest_heapam_record( &mut self, buf: &mut Bytes, - timeline: &dyn TimelineWriter, - lsn: Lsn, + modification: &mut DatadirModification, decoded: &mut DecodedWALRecord, ) -> Result<()> { // Handle VM bit updates that are implicitly part of heap records. @@ -409,54 +384,76 @@ impl WalIngest { // Clear the VM bits if required. if new_heap_blkno.is_some() || old_heap_blkno.is_some() { - let vm_relish = RelishTag::Relation(RelTag { + let vm_rel = RelTag { forknum: pg_constants::VISIBILITYMAP_FORKNUM, spcnode: decoded.blocks[0].rnode_spcnode, dbnode: decoded.blocks[0].rnode_dbnode, relnode: decoded.blocks[0].rnode_relnode, - }); + }; - let new_vm_blk = new_heap_blkno.map(pg_constants::HEAPBLK_TO_MAPBLOCK); - let old_vm_blk = old_heap_blkno.map(pg_constants::HEAPBLK_TO_MAPBLOCK); - if new_vm_blk == old_vm_blk { - // An UPDATE record that needs to clear the bits for both old and the - // new page, both of which reside on the same VM page. - timeline.put_wal_record( - lsn, - vm_relish, - new_vm_blk.unwrap(), - ZenithWalRecord::ClearVisibilityMapFlags { - new_heap_blkno, - old_heap_blkno, - flags: pg_constants::VISIBILITYMAP_VALID_BITS, - }, - )?; - } else { - // Clear VM bits for one heap page, or for two pages that reside on - // different VM pages. - if let Some(new_vm_blk) = new_vm_blk { - timeline.put_wal_record( - lsn, - vm_relish, - new_vm_blk, + let mut new_vm_blk = new_heap_blkno.map(pg_constants::HEAPBLK_TO_MAPBLOCK); + let mut old_vm_blk = old_heap_blkno.map(pg_constants::HEAPBLK_TO_MAPBLOCK); + + // Sometimes, Postgres seems to create heap WAL records with the + // ALL_VISIBLE_CLEARED flag set, even though the bit in the VM page is + // not set. In fact, it's possible that the VM page does not exist at all. + // In that case, we don't want to store a record to clear the VM bit; + // replaying it would fail to find the previous image of the page, because + // it doesn't exist. So check if the VM page(s) exist, and skip the WAL + // record if it doesn't. + let vm_size = self.get_relsize(vm_rel)?; + if let Some(blknum) = new_vm_blk { + if blknum >= vm_size { + new_vm_blk = None; + } + } + if let Some(blknum) = old_vm_blk { + if blknum >= vm_size { + old_vm_blk = None; + } + } + + if new_vm_blk.is_some() || old_vm_blk.is_some() { + if new_vm_blk == old_vm_blk { + // An UPDATE record that needs to clear the bits for both old and the + // new page, both of which reside on the same VM page. + self.put_rel_wal_record( + modification, + vm_rel, + new_vm_blk.unwrap(), ZenithWalRecord::ClearVisibilityMapFlags { new_heap_blkno, - old_heap_blkno: None, - flags: pg_constants::VISIBILITYMAP_VALID_BITS, - }, - )?; - } - if let Some(old_vm_blk) = old_vm_blk { - timeline.put_wal_record( - lsn, - vm_relish, - old_vm_blk, - ZenithWalRecord::ClearVisibilityMapFlags { - new_heap_blkno: None, old_heap_blkno, flags: pg_constants::VISIBILITYMAP_VALID_BITS, }, )?; + } else { + // Clear VM bits for one heap page, or for two pages that reside on + // different VM pages. + if let Some(new_vm_blk) = new_vm_blk { + self.put_rel_wal_record( + modification, + vm_rel, + new_vm_blk, + ZenithWalRecord::ClearVisibilityMapFlags { + new_heap_blkno, + old_heap_blkno: None, + flags: pg_constants::VISIBILITYMAP_VALID_BITS, + }, + )?; + } + if let Some(old_vm_blk) = old_vm_blk { + self.put_rel_wal_record( + modification, + vm_rel, + old_vm_blk, + ZenithWalRecord::ClearVisibilityMapFlags { + new_heap_blkno: None, + old_heap_blkno, + flags: pg_constants::VISIBILITYMAP_VALID_BITS, + }, + )?; + } } } } @@ -467,8 +464,7 @@ impl WalIngest { /// Subroutine of ingest_record(), to handle an XLOG_DBASE_CREATE record. fn ingest_xlog_dbase_create( &mut self, - timeline: &dyn TimelineWriter, - lsn: Lsn, + modification: &mut DatadirModification, rec: &XlCreateDatabase, ) -> Result<()> { let db_id = rec.db_id; @@ -481,76 +477,79 @@ impl WalIngest { // cannot pass 'lsn' to the Timeline.get_* functions, or they will block waiting for // the last valid LSN to advance up to it. So we use the previous record's LSN in the // get calls instead. - let req_lsn = min(timeline.get_last_record_lsn(), lsn); + let req_lsn = modification.tline.get_last_record_lsn(); - let rels = timeline.list_rels(src_tablespace_id, src_db_id, req_lsn)?; + let rels = modification + .tline + .list_rels(src_tablespace_id, src_db_id, req_lsn)?; - trace!("ingest_xlog_dbase_create: {} rels", rels.len()); + debug!("ingest_xlog_dbase_create: {} rels", rels.len()); + + // Copy relfilemap + let filemap = modification + .tline + .get_relmap_file(src_tablespace_id, src_db_id, req_lsn)?; + modification.put_relmap_file(tablespace_id, db_id, filemap)?; let mut num_rels_copied = 0; let mut num_blocks_copied = 0; - for rel in rels { - if let RelishTag::Relation(src_rel) = rel { - assert_eq!(src_rel.spcnode, src_tablespace_id); - assert_eq!(src_rel.dbnode, src_db_id); + for src_rel in rels { + assert_eq!(src_rel.spcnode, src_tablespace_id); + assert_eq!(src_rel.dbnode, src_db_id); - let nblocks = timeline.get_relish_size(rel, req_lsn)?.unwrap_or(0); - let dst_rel = RelTag { - spcnode: tablespace_id, - dbnode: db_id, - relnode: src_rel.relnode, - forknum: src_rel.forknum, - }; + let nblocks = modification.tline.get_rel_size(src_rel, req_lsn)?; + let dst_rel = RelTag { + spcnode: tablespace_id, + dbnode: db_id, + relnode: src_rel.relnode, + forknum: src_rel.forknum, + }; - // Copy content - for blknum in 0..nblocks { - let content = timeline.get_page_at_lsn(rel, blknum, req_lsn)?; + modification.put_rel_creation(dst_rel, nblocks)?; - debug!("copying block {} from {} to {}", blknum, src_rel, dst_rel); + // Copy content + debug!("copying rel {} to {}, {} blocks", src_rel, dst_rel, nblocks); + for blknum in 0..nblocks { + debug!("copying block {} from {} to {}", blknum, src_rel, dst_rel); - timeline.put_page_image(RelishTag::Relation(dst_rel), blknum, lsn, content)?; - num_blocks_copied += 1; - } - - if nblocks == 0 { - // make sure we have some trace of the relation, even if it's empty - timeline.put_truncation(RelishTag::Relation(dst_rel), lsn, 0)?; - } - - num_rels_copied += 1; + let content = modification + .tline + .get_rel_page_at_lsn(src_rel, blknum, req_lsn)?; + modification.put_rel_page_image(dst_rel, blknum, content)?; + num_blocks_copied += 1; } + + num_rels_copied += 1; } - // Copy relfilemap - // TODO This implementation is very inefficient - - // it scans all non-rels only to find FileNodeMaps - for tag in timeline.list_nonrels(req_lsn)? { - if let RelishTag::FileNodeMap { spcnode, dbnode } = tag { - if spcnode == src_tablespace_id && dbnode == src_db_id { - let img = timeline.get_page_at_lsn(tag, 0, req_lsn)?; - let new_tag = RelishTag::FileNodeMap { - spcnode: tablespace_id, - dbnode: db_id, - }; - timeline.put_page_image(new_tag, 0, lsn, img)?; - break; - } - } - } info!( - "Created database {}/{}, copied {} blocks in {} rels at {}", - tablespace_id, db_id, num_blocks_copied, num_rels_copied, lsn + "Created database {}/{}, copied {} blocks in {} rels", + tablespace_id, db_id, num_blocks_copied, num_rels_copied ); Ok(()) } + fn ingest_xlog_smgr_create( + &mut self, + modification: &mut DatadirModification, + rec: &XlSmgrCreate, + ) -> Result<()> { + let rel = RelTag { + spcnode: rec.rnode.spcnode, + dbnode: rec.rnode.dbnode, + relnode: rec.rnode.relnode, + forknum: rec.forknum, + }; + self.put_rel_creation(modification, rel)?; + Ok(()) + } + /// Subroutine of ingest_record(), to handle an XLOG_SMGR_TRUNCATE record. /// /// This is the same logic as in PostgreSQL's smgr_redo() function. fn ingest_xlog_smgr_truncate( &mut self, - timeline: &dyn TimelineWriter, - lsn: Lsn, + modification: &mut DatadirModification, rec: &XlSmgrTruncate, ) -> Result<()> { let spcnode = rec.rnode.spcnode; @@ -564,7 +563,7 @@ impl WalIngest { relnode, forknum: pg_constants::MAIN_FORKNUM, }; - timeline.put_truncation(RelishTag::Relation(rel), lsn, rec.blkno)?; + self.put_rel_truncation(modification, rel, rec.blkno)?; } if (rec.flags & pg_constants::SMGR_TRUNCATE_FSM) != 0 { let rel = RelTag { @@ -587,7 +586,7 @@ impl WalIngest { info!("Partial truncation of FSM is not supported"); } let num_fsm_blocks = 0; - timeline.put_truncation(RelishTag::Relation(rel), lsn, num_fsm_blocks)?; + self.put_rel_truncation(modification, rel, num_fsm_blocks)?; } if (rec.flags & pg_constants::SMGR_TRUNCATE_VM) != 0 { let rel = RelTag { @@ -606,7 +605,7 @@ impl WalIngest { info!("Partial truncation of VM is not supported"); } let num_vm_blocks = 0; - timeline.put_truncation(RelishTag::Relation(rel), lsn, num_vm_blocks)?; + self.put_rel_truncation(modification, rel, num_vm_blocks)?; } Ok(()) } @@ -615,8 +614,7 @@ impl WalIngest { /// fn ingest_xact_record( &mut self, - timeline: &dyn TimelineWriter, - lsn: Lsn, + modification: &mut DatadirModification, parsed: &XlXactParsedRecord, is_commit: bool, ) -> Result<()> { @@ -632,12 +630,9 @@ impl WalIngest { // This subxact goes to different page. Write the record // for all the XIDs on the previous page, and continue // accumulating XIDs on this new page. - timeline.put_wal_record( - lsn, - RelishTag::Slru { - slru: SlruKind::Clog, - segno, - }, + modification.put_slru_wal_record( + SlruKind::Clog, + segno, rpageno, if is_commit { ZenithWalRecord::ClogSetCommitted { xids: page_xids } @@ -652,12 +647,9 @@ impl WalIngest { rpageno = pageno % pg_constants::SLRU_PAGES_PER_SEGMENT; page_xids.push(*subxact); } - timeline.put_wal_record( - lsn, - RelishTag::Slru { - slru: SlruKind::Clog, - segno, - }, + modification.put_slru_wal_record( + SlruKind::Clog, + segno, rpageno, if is_commit { ZenithWalRecord::ClogSetCommitted { xids: page_xids } @@ -674,7 +666,10 @@ impl WalIngest { dbnode: xnode.dbnode, relnode: xnode.relnode, }; - timeline.drop_relish(RelishTag::Relation(rel), lsn)?; + let last_lsn = self.timeline.get_last_record_lsn(); + if modification.tline.get_rel_exists(rel, last_lsn)? { + self.put_rel_drop(modification, rel)?; + } } } Ok(()) @@ -682,13 +677,12 @@ impl WalIngest { fn ingest_clog_truncate_record( &mut self, - timeline: &dyn TimelineWriter, - lsn: Lsn, + modification: &mut DatadirModification, xlrec: &XlClogTruncate, ) -> Result<()> { info!( - "RM_CLOG_ID truncate pageno {} oldestXid {} oldestXidDB {} lsn {}", - xlrec.pageno, xlrec.oldest_xid, xlrec.oldest_xid_db, lsn + "RM_CLOG_ID truncate pageno {} oldestXid {} oldestXidDB {}", + xlrec.pageno, xlrec.oldest_xid, xlrec.oldest_xid_db ); // Here we treat oldestXid and oldestXidDB @@ -719,23 +713,20 @@ impl WalIngest { } // Iterate via SLRU CLOG segments and drop segments that we're ready to truncate - // TODO This implementation is very inefficient - - // it scans all non-rels only to find Clog // // We cannot pass 'lsn' to the Timeline.list_nonrels(), or it // will block waiting for the last valid LSN to advance up to // it. So we use the previous record's LSN in the get calls // instead. - let req_lsn = min(timeline.get_last_record_lsn(), lsn); - for obj in timeline.list_nonrels(req_lsn)? { - if let RelishTag::Slru { slru, segno } = obj { - if slru == SlruKind::Clog { - let segpage = segno * pg_constants::SLRU_PAGES_PER_SEGMENT; - if slru_may_delete_clogsegment(segpage, xlrec.pageno) { - timeline.drop_relish(RelishTag::Slru { slru, segno }, lsn)?; - trace!("Drop CLOG segment {:>04X} at lsn {}", segno, lsn); - } - } + let req_lsn = modification.tline.get_last_record_lsn(); + for segno in modification + .tline + .list_slru_segments(SlruKind::Clog, req_lsn)? + { + let segpage = segno * pg_constants::SLRU_PAGES_PER_SEGMENT; + if slru_may_delete_clogsegment(segpage, xlrec.pageno) { + modification.drop_slru_segment(SlruKind::Clog, segno)?; + trace!("Drop CLOG segment {:>04X}", segno); } } @@ -744,8 +735,7 @@ impl WalIngest { fn ingest_multixact_create_record( &mut self, - timeline: &dyn TimelineWriter, - lsn: Lsn, + modification: &mut DatadirModification, xlrec: &XlMultiXactCreate, ) -> Result<()> { // Create WAL record for updating the multixact-offsets page @@ -753,12 +743,9 @@ impl WalIngest { let segno = pageno / pg_constants::SLRU_PAGES_PER_SEGMENT; let rpageno = pageno % pg_constants::SLRU_PAGES_PER_SEGMENT; - timeline.put_wal_record( - lsn, - RelishTag::Slru { - slru: SlruKind::MultiXactOffsets, - segno, - }, + modification.put_slru_wal_record( + SlruKind::MultiXactOffsets, + segno, rpageno, ZenithWalRecord::MultixactOffsetCreate { mid: xlrec.mid, @@ -790,12 +777,9 @@ impl WalIngest { } let n_this_page = this_page_members.len(); - timeline.put_wal_record( - lsn, - RelishTag::Slru { - slru: SlruKind::MultiXactMembers, - segno: pageno / pg_constants::SLRU_PAGES_PER_SEGMENT, - }, + modification.put_slru_wal_record( + SlruKind::MultiXactMembers, + pageno / pg_constants::SLRU_PAGES_PER_SEGMENT, pageno % pg_constants::SLRU_PAGES_PER_SEGMENT, ZenithWalRecord::MultixactMembersCreate { moff: offset, @@ -830,8 +814,7 @@ impl WalIngest { fn ingest_multixact_truncate_record( &mut self, - timeline: &dyn TimelineWriter, - lsn: Lsn, + modification: &mut DatadirModification, xlrec: &XlMultiXactTruncate, ) -> Result<()> { self.checkpoint.oldestMulti = xlrec.end_trunc_off; @@ -847,13 +830,7 @@ impl WalIngest { // Delete all the segments except the last one. The last segment can still // contain, possibly partially, valid data. while segment != endsegment { - timeline.drop_relish( - RelishTag::Slru { - slru: SlruKind::MultiXactMembers, - segno: segment as u32, - }, - lsn, - )?; + modification.drop_slru_segment(SlruKind::MultiXactMembers, segment as u32)?; /* move to next segment, handling wraparound correctly */ if segment == maxsegment { @@ -871,22 +848,538 @@ impl WalIngest { fn ingest_relmap_page( &mut self, - timeline: &dyn TimelineWriter, - lsn: Lsn, + modification: &mut DatadirModification, xlrec: &XlRelmapUpdate, decoded: &DecodedWALRecord, ) -> Result<()> { - let tag = RelishTag::FileNodeMap { - spcnode: xlrec.tsid, - dbnode: xlrec.dbid, - }; - let mut buf = decoded.record.clone(); buf.advance(decoded.main_data_offset); // skip xl_relmap_update buf.advance(12); - timeline.put_page_image(tag, 0, lsn, Bytes::copy_from_slice(&buf[..]))?; + modification.put_relmap_file(xlrec.tsid, xlrec.dbid, Bytes::copy_from_slice(&buf[..]))?; + + Ok(()) + } + + fn put_rel_creation( + &mut self, + modification: &mut DatadirModification, + rel: RelTag, + ) -> Result<()> { + self.relsize_cache.insert(rel, 0); + modification.put_rel_creation(rel, 0)?; + Ok(()) + } + + fn put_rel_page_image( + &mut self, + modification: &mut DatadirModification, + rel: RelTag, + blknum: BlockNumber, + img: Bytes, + ) -> Result<()> { + self.handle_rel_extend(modification, rel, blknum)?; + modification.put_rel_page_image(rel, blknum, img)?; + Ok(()) + } + + fn put_rel_wal_record( + &mut self, + modification: &mut DatadirModification, + rel: RelTag, + blknum: BlockNumber, + rec: ZenithWalRecord, + ) -> Result<()> { + self.handle_rel_extend(modification, rel, blknum)?; + modification.put_rel_wal_record(rel, blknum, rec)?; + Ok(()) + } + + fn put_rel_truncation( + &mut self, + modification: &mut DatadirModification, + rel: RelTag, + nblocks: BlockNumber, + ) -> Result<()> { + modification.put_rel_truncation(rel, nblocks)?; + self.relsize_cache.insert(rel, nblocks); + Ok(()) + } + + fn put_rel_drop( + &mut self, + modification: &mut DatadirModification, + rel: RelTag, + ) -> Result<()> { + modification.put_rel_drop(rel)?; + self.relsize_cache.remove(&rel); + Ok(()) + } + + fn get_relsize(&mut self, rel: RelTag) -> Result { + if let Some(nblocks) = self.relsize_cache.get(&rel) { + Ok(*nblocks) + } else { + let last_lsn = self.timeline.get_last_record_lsn(); + let nblocks = if !self.timeline.get_rel_exists(rel, last_lsn)? { + 0 + } else { + self.timeline.get_rel_size(rel, last_lsn)? + }; + self.relsize_cache.insert(rel, nblocks); + Ok(nblocks) + } + } + + fn handle_rel_extend( + &mut self, + modification: &mut DatadirModification, + rel: RelTag, + blknum: BlockNumber, + ) -> Result<()> { + let new_nblocks = blknum + 1; + let old_nblocks = if let Some(nblocks) = self.relsize_cache.get(&rel) { + *nblocks + } else { + // Check if the relation exists. We implicitly create relations on first + // record. + // TODO: would be nice if to be more explicit about it + let last_lsn = self.timeline.get_last_record_lsn(); + let nblocks = if !self.timeline.get_rel_exists(rel, last_lsn)? { + // create it with 0 size initially, the logic below will extend it + modification.put_rel_creation(rel, 0)?; + 0 + } else { + self.timeline.get_rel_size(rel, last_lsn)? + }; + self.relsize_cache.insert(rel, nblocks); + nblocks + }; + + if new_nblocks > old_nblocks { + //info!("extending {} {} to {}", rel, old_nblocks, new_nblocks); + modification.put_rel_extend(rel, new_nblocks)?; + + // fill the gap with zeros + for gap_blknum in old_nblocks..blknum { + modification.put_rel_page_image(rel, gap_blknum, ZERO_PAGE.clone())?; + } + self.relsize_cache.insert(rel, new_nblocks); + } + Ok(()) + } + + fn put_slru_page_image( + &mut self, + modification: &mut DatadirModification, + kind: SlruKind, + segno: u32, + blknum: BlockNumber, + img: Bytes, + ) -> Result<()> { + self.handle_slru_extend(modification, kind, segno, blknum)?; + modification.put_slru_page_image(kind, segno, blknum, img)?; + Ok(()) + } + + fn handle_slru_extend( + &mut self, + modification: &mut DatadirModification, + kind: SlruKind, + segno: u32, + blknum: BlockNumber, + ) -> Result<()> { + // we don't use a cache for this like we do for relations. SLRUS are explcitly + // extended with ZEROPAGE records, not with commit records, so it happens + // a lot less frequently. + + let new_nblocks = blknum + 1; + // Check if the relation exists. We implicitly create relations on first + // record. + // TODO: would be nice if to be more explicit about it + let last_lsn = self.timeline.get_last_record_lsn(); + let old_nblocks = if !self + .timeline + .get_slru_segment_exists(kind, segno, last_lsn)? + { + // create it with 0 size initially, the logic below will extend it + modification.put_slru_segment_creation(kind, segno, 0)?; + 0 + } else { + self.timeline.get_slru_segment_size(kind, segno, last_lsn)? + }; + + if new_nblocks > old_nblocks { + trace!( + "extending SLRU {:?} seg {} from {} to {} blocks", + kind, + segno, + old_nblocks, + new_nblocks + ); + modification.put_slru_extend(kind, segno, new_nblocks)?; + + // fill the gap with zeros + for gap_blknum in old_nblocks..blknum { + modification.put_slru_page_image(kind, segno, gap_blknum, ZERO_PAGE.clone())?; + } + } + Ok(()) + } +} + +/// +/// Tests that should work the same with any Repository/Timeline implementation. +/// +#[allow(clippy::bool_assert_comparison)] +#[cfg(test)] +mod tests { + use super::*; + use crate::pgdatadir_mapping::create_test_timeline; + use crate::repository::repo_harness::*; + use postgres_ffi::pg_constants; + + /// Arbitrary relation tag, for testing. + const TESTREL_A: RelTag = RelTag { + spcnode: 0, + dbnode: 111, + relnode: 1000, + forknum: 0, + }; + + fn assert_current_logical_size(_timeline: &DatadirTimeline, _lsn: Lsn) { + // TODO + } + + static ZERO_CHECKPOINT: Bytes = Bytes::from_static(&[0u8; SIZEOF_CHECKPOINT]); + + fn init_walingest_test(tline: &DatadirTimeline) -> Result> { + let mut m = tline.begin_modification(Lsn(0x10)); + m.put_checkpoint(ZERO_CHECKPOINT.clone())?; + m.put_relmap_file(0, 111, Bytes::from(""))?; // dummy relmapper file + m.commit()?; + let walingest = WalIngest::new(tline, Lsn(0x10))?; + + Ok(walingest) + } + + #[test] + fn test_relsize() -> Result<()> { + let repo = RepoHarness::create("test_relsize")?.load(); + let tline = create_test_timeline(repo, TIMELINE_ID)?; + let mut walingest = init_walingest_test(&tline)?; + + let mut m = tline.begin_modification(Lsn(0x20)); + walingest.put_rel_creation(&mut m, TESTREL_A)?; + walingest.put_rel_page_image(&mut m, TESTREL_A, 0, TEST_IMG("foo blk 0 at 2"))?; + m.commit()?; + let mut m = tline.begin_modification(Lsn(0x30)); + walingest.put_rel_page_image(&mut m, TESTREL_A, 0, TEST_IMG("foo blk 0 at 3"))?; + m.commit()?; + let mut m = tline.begin_modification(Lsn(0x40)); + walingest.put_rel_page_image(&mut m, TESTREL_A, 1, TEST_IMG("foo blk 1 at 4"))?; + m.commit()?; + let mut m = tline.begin_modification(Lsn(0x50)); + walingest.put_rel_page_image(&mut m, TESTREL_A, 2, TEST_IMG("foo blk 2 at 5"))?; + m.commit()?; + + assert_current_logical_size(&tline, Lsn(0x50)); + + // The relation was created at LSN 2, not visible at LSN 1 yet. + assert_eq!(tline.get_rel_exists(TESTREL_A, Lsn(0x10))?, false); + assert!(tline.get_rel_size(TESTREL_A, Lsn(0x10)).is_err()); + + assert_eq!(tline.get_rel_exists(TESTREL_A, Lsn(0x20))?, true); + assert_eq!(tline.get_rel_size(TESTREL_A, Lsn(0x20))?, 1); + assert_eq!(tline.get_rel_size(TESTREL_A, Lsn(0x50))?, 3); + + // Check page contents at each LSN + assert_eq!( + tline.get_rel_page_at_lsn(TESTREL_A, 0, Lsn(0x20))?, + TEST_IMG("foo blk 0 at 2") + ); + + assert_eq!( + tline.get_rel_page_at_lsn(TESTREL_A, 0, Lsn(0x30))?, + TEST_IMG("foo blk 0 at 3") + ); + + assert_eq!( + tline.get_rel_page_at_lsn(TESTREL_A, 0, Lsn(0x40))?, + TEST_IMG("foo blk 0 at 3") + ); + assert_eq!( + tline.get_rel_page_at_lsn(TESTREL_A, 1, Lsn(0x40))?, + TEST_IMG("foo blk 1 at 4") + ); + + assert_eq!( + tline.get_rel_page_at_lsn(TESTREL_A, 0, Lsn(0x50))?, + TEST_IMG("foo blk 0 at 3") + ); + assert_eq!( + tline.get_rel_page_at_lsn(TESTREL_A, 1, Lsn(0x50))?, + TEST_IMG("foo blk 1 at 4") + ); + assert_eq!( + tline.get_rel_page_at_lsn(TESTREL_A, 2, Lsn(0x50))?, + TEST_IMG("foo blk 2 at 5") + ); + + // Truncate last block + let mut m = tline.begin_modification(Lsn(0x60)); + walingest.put_rel_truncation(&mut m, TESTREL_A, 2)?; + m.commit()?; + assert_current_logical_size(&tline, Lsn(0x60)); + + // Check reported size and contents after truncation + assert_eq!(tline.get_rel_size(TESTREL_A, Lsn(0x60))?, 2); + assert_eq!( + tline.get_rel_page_at_lsn(TESTREL_A, 0, Lsn(0x60))?, + TEST_IMG("foo blk 0 at 3") + ); + assert_eq!( + tline.get_rel_page_at_lsn(TESTREL_A, 1, Lsn(0x60))?, + TEST_IMG("foo blk 1 at 4") + ); + + // should still see the truncated block with older LSN + assert_eq!(tline.get_rel_size(TESTREL_A, Lsn(0x50))?, 3); + assert_eq!( + tline.get_rel_page_at_lsn(TESTREL_A, 2, Lsn(0x50))?, + TEST_IMG("foo blk 2 at 5") + ); + + // Truncate to zero length + let mut m = tline.begin_modification(Lsn(0x68)); + walingest.put_rel_truncation(&mut m, TESTREL_A, 0)?; + m.commit()?; + assert_eq!(tline.get_rel_size(TESTREL_A, Lsn(0x68))?, 0); + + // Extend from 0 to 2 blocks, leaving a gap + let mut m = tline.begin_modification(Lsn(0x70)); + walingest.put_rel_page_image(&mut m, TESTREL_A, 1, TEST_IMG("foo blk 1"))?; + m.commit()?; + assert_eq!(tline.get_rel_size(TESTREL_A, Lsn(0x70))?, 2); + assert_eq!( + tline.get_rel_page_at_lsn(TESTREL_A, 0, Lsn(0x70))?, + ZERO_PAGE + ); + assert_eq!( + tline.get_rel_page_at_lsn(TESTREL_A, 1, Lsn(0x70))?, + TEST_IMG("foo blk 1") + ); + + // Extend a lot more, leaving a big gap that spans across segments + let mut m = tline.begin_modification(Lsn(0x80)); + walingest.put_rel_page_image(&mut m, TESTREL_A, 1500, TEST_IMG("foo blk 1500"))?; + m.commit()?; + assert_eq!(tline.get_rel_size(TESTREL_A, Lsn(0x80))?, 1501); + for blk in 2..1500 { + assert_eq!( + tline.get_rel_page_at_lsn(TESTREL_A, blk, Lsn(0x80))?, + ZERO_PAGE + ); + } + assert_eq!( + tline.get_rel_page_at_lsn(TESTREL_A, 1500, Lsn(0x80))?, + TEST_IMG("foo blk 1500") + ); + + Ok(()) + } + + // Test what happens if we dropped a relation + // and then created it again within the same layer. + #[test] + fn test_drop_extend() -> Result<()> { + let repo = RepoHarness::create("test_drop_extend")?.load(); + let tline = create_test_timeline(repo, TIMELINE_ID)?; + let mut walingest = init_walingest_test(&tline)?; + + let mut m = tline.begin_modification(Lsn(0x20)); + walingest.put_rel_page_image(&mut m, TESTREL_A, 0, TEST_IMG("foo blk 0 at 2"))?; + m.commit()?; + + // Check that rel exists and size is correct + assert_eq!(tline.get_rel_exists(TESTREL_A, Lsn(0x20))?, true); + assert_eq!(tline.get_rel_size(TESTREL_A, Lsn(0x20))?, 1); + + // Drop rel + let mut m = tline.begin_modification(Lsn(0x30)); + walingest.put_rel_drop(&mut m, TESTREL_A)?; + m.commit()?; + + // Check that rel is not visible anymore + assert_eq!(tline.get_rel_exists(TESTREL_A, Lsn(0x30))?, false); + + // FIXME: should fail + //assert!(tline.get_rel_size(TESTREL_A, Lsn(0x30))?.is_none()); + + // Re-create it + let mut m = tline.begin_modification(Lsn(0x40)); + walingest.put_rel_page_image(&mut m, TESTREL_A, 0, TEST_IMG("foo blk 0 at 4"))?; + m.commit()?; + + // Check that rel exists and size is correct + assert_eq!(tline.get_rel_exists(TESTREL_A, Lsn(0x40))?, true); + assert_eq!(tline.get_rel_size(TESTREL_A, Lsn(0x40))?, 1); + + Ok(()) + } + + // Test what happens if we truncated a relation + // so that one of its segments was dropped + // and then extended it again within the same layer. + #[test] + fn test_truncate_extend() -> Result<()> { + let repo = RepoHarness::create("test_truncate_extend")?.load(); + let tline = create_test_timeline(repo, TIMELINE_ID)?; + let mut walingest = init_walingest_test(&tline)?; + + // Create a 20 MB relation (the size is arbitrary) + let relsize = 20 * 1024 * 1024 / 8192; + let mut m = tline.begin_modification(Lsn(0x20)); + for blkno in 0..relsize { + let data = format!("foo blk {} at {}", blkno, Lsn(0x20)); + walingest.put_rel_page_image(&mut m, TESTREL_A, blkno, TEST_IMG(&data))?; + } + m.commit()?; + + // The relation was created at LSN 20, not visible at LSN 1 yet. + assert_eq!(tline.get_rel_exists(TESTREL_A, Lsn(0x10))?, false); + assert!(tline.get_rel_size(TESTREL_A, Lsn(0x10)).is_err()); + + assert_eq!(tline.get_rel_exists(TESTREL_A, Lsn(0x20))?, true); + assert_eq!(tline.get_rel_size(TESTREL_A, Lsn(0x20))?, relsize); + + // Check relation content + for blkno in 0..relsize { + let lsn = Lsn(0x20); + let data = format!("foo blk {} at {}", blkno, lsn); + assert_eq!( + tline.get_rel_page_at_lsn(TESTREL_A, blkno, lsn)?, + TEST_IMG(&data) + ); + } + + // Truncate relation so that second segment was dropped + // - only leave one page + let mut m = tline.begin_modification(Lsn(0x60)); + walingest.put_rel_truncation(&mut m, TESTREL_A, 1)?; + m.commit()?; + + // Check reported size and contents after truncation + assert_eq!(tline.get_rel_size(TESTREL_A, Lsn(0x60))?, 1); + + for blkno in 0..1 { + let lsn = Lsn(0x20); + let data = format!("foo blk {} at {}", blkno, lsn); + assert_eq!( + tline.get_rel_page_at_lsn(TESTREL_A, blkno, Lsn(0x60))?, + TEST_IMG(&data) + ); + } + + // should still see all blocks with older LSN + assert_eq!(tline.get_rel_size(TESTREL_A, Lsn(0x50))?, relsize); + for blkno in 0..relsize { + let lsn = Lsn(0x20); + let data = format!("foo blk {} at {}", blkno, lsn); + assert_eq!( + tline.get_rel_page_at_lsn(TESTREL_A, blkno, Lsn(0x50))?, + TEST_IMG(&data) + ); + } + + // Extend relation again. + // Add enough blocks to create second segment + let lsn = Lsn(0x80); + let mut m = tline.begin_modification(lsn); + for blkno in 0..relsize { + let data = format!("foo blk {} at {}", blkno, lsn); + walingest.put_rel_page_image(&mut m, TESTREL_A, blkno, TEST_IMG(&data))?; + } + m.commit()?; + + assert_eq!(tline.get_rel_exists(TESTREL_A, Lsn(0x80))?, true); + assert_eq!(tline.get_rel_size(TESTREL_A, Lsn(0x80))?, relsize); + // Check relation content + for blkno in 0..relsize { + let lsn = Lsn(0x80); + let data = format!("foo blk {} at {}", blkno, lsn); + assert_eq!( + tline.get_rel_page_at_lsn(TESTREL_A, blkno, Lsn(0x80))?, + TEST_IMG(&data) + ); + } + + Ok(()) + } + + /// Test get_relsize() and truncation with a file larger than 1 GB, so that it's + /// split into multiple 1 GB segments in Postgres. + #[test] + fn test_large_rel() -> Result<()> { + let repo = RepoHarness::create("test_large_rel")?.load(); + let tline = create_test_timeline(repo, TIMELINE_ID)?; + let mut walingest = init_walingest_test(&tline)?; + + let mut lsn = 0x10; + for blknum in 0..pg_constants::RELSEG_SIZE + 1 { + lsn += 0x10; + let mut m = tline.begin_modification(Lsn(lsn)); + let img = TEST_IMG(&format!("foo blk {} at {}", blknum, Lsn(lsn))); + walingest.put_rel_page_image(&mut m, TESTREL_A, blknum as BlockNumber, img)?; + m.commit()?; + } + + assert_current_logical_size(&tline, Lsn(lsn)); + + assert_eq!( + tline.get_rel_size(TESTREL_A, Lsn(lsn))?, + pg_constants::RELSEG_SIZE + 1 + ); + + // Truncate one block + lsn += 0x10; + let mut m = tline.begin_modification(Lsn(lsn)); + walingest.put_rel_truncation(&mut m, TESTREL_A, pg_constants::RELSEG_SIZE)?; + m.commit()?; + assert_eq!( + tline.get_rel_size(TESTREL_A, Lsn(lsn))?, + pg_constants::RELSEG_SIZE + ); + assert_current_logical_size(&tline, Lsn(lsn)); + + // Truncate another block + lsn += 0x10; + let mut m = tline.begin_modification(Lsn(lsn)); + walingest.put_rel_truncation(&mut m, TESTREL_A, pg_constants::RELSEG_SIZE - 1)?; + m.commit()?; + assert_eq!( + tline.get_rel_size(TESTREL_A, Lsn(lsn))?, + pg_constants::RELSEG_SIZE - 1 + ); + assert_current_logical_size(&tline, Lsn(lsn)); + + // Truncate to 1500, and then truncate all the way down to 0, one block at a time + // This tests the behavior at segment boundaries + let mut size: i32 = 3000; + while size >= 0 { + lsn += 0x10; + let mut m = tline.begin_modification(Lsn(lsn)); + walingest.put_rel_truncation(&mut m, TESTREL_A, size as BlockNumber)?; + m.commit()?; + assert_eq!( + tline.get_rel_size(TESTREL_A, Lsn(lsn))?, + size as BlockNumber + ); + + size -= 1; + } + assert_current_logical_size(&tline, Lsn(lsn)); Ok(()) } diff --git a/pageserver/src/walreceiver.rs b/pageserver/src/walreceiver.rs index 2c10ad315b..e382475627 100644 --- a/pageserver/src/walreceiver.rs +++ b/pageserver/src/walreceiver.rs @@ -6,6 +6,7 @@ //! We keep one WAL receiver active per timeline. use crate::config::PageServerConf; +use crate::repository::{Repository, Timeline}; use crate::tenant_mgr; use crate::thread_mgr; use crate::thread_mgr::ThreadKind; @@ -182,13 +183,13 @@ fn walreceiver_main( let repo = tenant_mgr::get_repository_for_tenant(tenant_id) .with_context(|| format!("no repository found for tenant {}", tenant_id))?; - let timeline = repo.get_timeline_load(timeline_id).with_context(|| { - format!( - "local timeline {} not found for tenant {}", - timeline_id, tenant_id - ) - })?; - + let timeline = + tenant_mgr::get_timeline_for_tenant_load(tenant_id, timeline_id).with_context(|| { + format!( + "local timeline {} not found for tenant {}", + timeline_id, tenant_id + ) + })?; let remote_index = repo.get_remote_index(); // @@ -251,11 +252,10 @@ fn walreceiver_main( // It is important to deal with the aligned records as lsn in getPage@LSN is // aligned and can be several bytes bigger. Without this alignment we are - // at risk of hittind a deadlock. + // at risk of hitting a deadlock. anyhow::ensure!(lsn.is_aligned()); - let writer = timeline.writer(); - walingest.ingest_record(writer.as_ref(), recdata, lsn)?; + walingest.ingest_record(&timeline, recdata, lsn)?; fail_point!("walreceiver-after-ingest"); @@ -267,6 +267,8 @@ fn walreceiver_main( caught_up = true; } + timeline.tline.check_checkpoint_distance()?; + Some(endlsn) } @@ -310,7 +312,7 @@ fn walreceiver_main( // The last LSN we processed. It is not guaranteed to survive pageserver crash. let write_lsn = u64::from(last_lsn); // `disk_consistent_lsn` is the LSN at which page server guarantees local persistence of all received data - let flush_lsn = u64::from(timeline.get_disk_consistent_lsn()); + let flush_lsn = u64::from(timeline.tline.get_disk_consistent_lsn()); // The last LSN that is synced to remote storage and is guaranteed to survive pageserver crash // Used by safekeepers to remove WAL preceding `remote_consistent_lsn`. let apply_lsn = u64::from(timeline_remote_consistent_lsn); diff --git a/pageserver/src/walrecord.rs b/pageserver/src/walrecord.rs index ca9107cdbf..5947a0c147 100644 --- a/pageserver/src/walrecord.rs +++ b/pageserver/src/walrecord.rs @@ -10,7 +10,47 @@ use postgres_ffi::{MultiXactId, MultiXactOffset, MultiXactStatus, Oid, Transacti use serde::{Deserialize, Serialize}; use tracing::*; -use crate::repository::ZenithWalRecord; +/// Each update to a page is represented by a ZenithWalRecord. It can be a wrapper +/// around a PostgreSQL WAL record, or a custom zenith-specific "record". +#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)] +pub enum ZenithWalRecord { + /// Native PostgreSQL WAL record + Postgres { will_init: bool, rec: Bytes }, + + /// Clear bits in heap visibility map. ('flags' is bitmap of bits to clear) + ClearVisibilityMapFlags { + new_heap_blkno: Option, + old_heap_blkno: Option, + flags: u8, + }, + /// Mark transaction IDs as committed on a CLOG page + ClogSetCommitted { xids: Vec }, + /// Mark transaction IDs as aborted on a CLOG page + ClogSetAborted { xids: Vec }, + /// Extend multixact offsets SLRU + MultixactOffsetCreate { + mid: MultiXactId, + moff: MultiXactOffset, + }, + /// Extend multixact members SLRU. + MultixactMembersCreate { + moff: MultiXactOffset, + members: Vec, + }, +} + +impl ZenithWalRecord { + /// Does replaying this WAL record initialize the page from scratch, or does + /// it need to be applied over the previous image of the page? + pub fn will_init(&self) -> bool { + match self { + ZenithWalRecord::Postgres { will_init, rec: _ } => *will_init, + + // None of the special zenith record types currently initialize the page + _ => false, + } + } +} /// DecodedBkpBlock represents per-page data contained in a WAL record. #[derive(Default)] @@ -87,6 +127,28 @@ impl XlRelmapUpdate { } } +#[repr(C)] +#[derive(Debug)] +pub struct XlSmgrCreate { + pub rnode: RelFileNode, + // FIXME: This is ForkNumber in storage_xlog.h. That's an enum. Does it have + // well-defined size? + pub forknum: u8, +} + +impl XlSmgrCreate { + pub fn decode(buf: &mut Bytes) -> XlSmgrCreate { + XlSmgrCreate { + rnode: RelFileNode { + spcnode: buf.get_u32_le(), /* tablespace */ + dbnode: buf.get_u32_le(), /* database */ + relnode: buf.get_u32_le(), /* relation */ + }, + forknum: buf.get_u32_le() as u8, + } + } +} + #[repr(C)] #[derive(Debug)] pub struct XlSmgrTruncate { diff --git a/pageserver/src/walredo.rs b/pageserver/src/walredo.rs index 704b8f2583..ae22f1eead 100644 --- a/pageserver/src/walredo.rs +++ b/pageserver/src/walredo.rs @@ -42,8 +42,10 @@ use zenith_utils::nonblock::set_nonblock; use zenith_utils::zid::ZTenantId; use crate::config::PageServerConf; -use crate::relish::*; -use crate::repository::ZenithWalRecord; +use crate::pgdatadir_mapping::{key_to_rel_block, key_to_slru_block}; +use crate::reltag::{RelTag, SlruKind}; +use crate::repository::Key; +use crate::walrecord::ZenithWalRecord; use postgres_ffi::nonrelfile_utils::mx_offset_to_flags_bitshift; use postgres_ffi::nonrelfile_utils::mx_offset_to_flags_offset; use postgres_ffi::nonrelfile_utils::mx_offset_to_member_offset; @@ -75,8 +77,7 @@ pub trait WalRedoManager: Send + Sync { /// the reords. fn request_redo( &self, - rel: RelishTag, - blknum: u32, + key: Key, lsn: Lsn, base_img: Option, records: Vec<(Lsn, ZenithWalRecord)>, @@ -92,8 +93,7 @@ pub struct DummyRedoManager {} impl crate::walredo::WalRedoManager for DummyRedoManager { fn request_redo( &self, - _rel: RelishTag, - _blknum: u32, + _key: Key, _lsn: Lsn, _base_img: Option, _records: Vec<(Lsn, ZenithWalRecord)>, @@ -152,28 +152,6 @@ fn can_apply_in_zenith(rec: &ZenithWalRecord) -> bool { } } -fn check_forknum(rel: &RelishTag, expected_forknum: u8) -> bool { - if let RelishTag::Relation(RelTag { - forknum, - spcnode: _, - dbnode: _, - relnode: _, - }) = rel - { - *forknum == expected_forknum - } else { - false - } -} - -fn check_slru_segno(rel: &RelishTag, expected_slru: SlruKind, expected_segno: u32) -> bool { - if let RelishTag::Slru { slru, segno } = rel { - *slru == expected_slru && *segno == expected_segno - } else { - false - } -} - /// An error happened in WAL redo #[derive(Debug, thiserror::Error)] pub enum WalRedoError { @@ -184,6 +162,8 @@ pub enum WalRedoError { InvalidState, #[error("cannot perform WAL redo for this request")] InvalidRequest, + #[error("cannot perform WAL redo for this record")] + InvalidRecord, } /// @@ -198,8 +178,7 @@ impl WalRedoManager for PostgresRedoManager { /// fn request_redo( &self, - rel: RelishTag, - blknum: u32, + key: Key, lsn: Lsn, base_img: Option, records: Vec<(Lsn, ZenithWalRecord)>, @@ -217,11 +196,10 @@ impl WalRedoManager for PostgresRedoManager { if rec_zenith != batch_zenith { let result = if batch_zenith { - self.apply_batch_zenith(rel, blknum, lsn, img, &records[batch_start..i]) + self.apply_batch_zenith(key, lsn, img, &records[batch_start..i]) } else { self.apply_batch_postgres( - rel, - blknum, + key, lsn, img, &records[batch_start..i], @@ -236,11 +214,10 @@ impl WalRedoManager for PostgresRedoManager { } // last batch if batch_zenith { - self.apply_batch_zenith(rel, blknum, lsn, img, &records[batch_start..]) + self.apply_batch_zenith(key, lsn, img, &records[batch_start..]) } else { self.apply_batch_postgres( - rel, - blknum, + key, lsn, img, &records[batch_start..], @@ -268,16 +245,15 @@ impl PostgresRedoManager { /// fn apply_batch_postgres( &self, - rel: RelishTag, - blknum: u32, + key: Key, lsn: Lsn, base_img: Option, records: &[(Lsn, ZenithWalRecord)], wal_redo_timeout: Duration, ) -> Result { - let start_time = Instant::now(); + let (rel, blknum) = key_to_rel_block(key).or(Err(WalRedoError::InvalidRecord))?; - let apply_result: Result; + let start_time = Instant::now(); let mut process_guard = self.process.lock().unwrap(); let lock_time = Instant::now(); @@ -291,16 +267,11 @@ impl PostgresRedoManager { WAL_REDO_WAIT_TIME.observe(lock_time.duration_since(start_time).as_secs_f64()); - let result = if let RelishTag::Relation(rel) = rel { - // Relational WAL records are applied using wal-redo-postgres - let buf_tag = BufferTag { rel, blknum }; - apply_result = process.apply_wal_records(buf_tag, base_img, records, wal_redo_timeout); - - apply_result.map_err(WalRedoError::IoError) - } else { - error!("unexpected non-relation relish: {:?}", rel); - Err(WalRedoError::InvalidRequest) - }; + // Relational WAL records are applied using wal-redo-postgres + let buf_tag = BufferTag { rel, blknum }; + let result = process + .apply_wal_records(buf_tag, base_img, records, wal_redo_timeout) + .map_err(WalRedoError::IoError); let end_time = Instant::now(); let duration = end_time.duration_since(lock_time); @@ -326,8 +297,7 @@ impl PostgresRedoManager { /// fn apply_batch_zenith( &self, - rel: RelishTag, - blknum: u32, + key: Key, lsn: Lsn, base_img: Option, records: &[(Lsn, ZenithWalRecord)], @@ -346,7 +316,7 @@ impl PostgresRedoManager { // Apply all the WAL records in the batch for (record_lsn, record) in records.iter() { - self.apply_record_zenith(rel, blknum, &mut page, *record_lsn, record)?; + self.apply_record_zenith(key, &mut page, *record_lsn, record)?; } // Success! let end_time = Instant::now(); @@ -365,8 +335,7 @@ impl PostgresRedoManager { fn apply_record_zenith( &self, - rel: RelishTag, - blknum: u32, + key: Key, page: &mut BytesMut, _record_lsn: Lsn, record: &ZenithWalRecord, @@ -384,10 +353,11 @@ impl PostgresRedoManager { old_heap_blkno, flags, } => { - // sanity check that this is modifying the correct relish + // sanity check that this is modifying the correct relation + let (rel, blknum) = key_to_rel_block(key).or(Err(WalRedoError::InvalidRecord))?; assert!( - check_forknum(&rel, pg_constants::VISIBILITYMAP_FORKNUM), - "ClearVisibilityMapFlags record on unexpected rel {:?}", + rel.forknum == pg_constants::VISIBILITYMAP_FORKNUM, + "ClearVisibilityMapFlags record on unexpected rel {}", rel ); if let Some(heap_blkno) = *new_heap_blkno { @@ -421,6 +391,14 @@ impl PostgresRedoManager { // Non-relational WAL records are handled here, with custom code that has the // same effects as the corresponding Postgres WAL redo function. ZenithWalRecord::ClogSetCommitted { xids } => { + let (slru_kind, segno, blknum) = + key_to_slru_block(key).or(Err(WalRedoError::InvalidRecord))?; + assert_eq!( + slru_kind, + SlruKind::Clog, + "ClogSetCommitted record with unexpected key {}", + key + ); for &xid in xids { let pageno = xid as u32 / pg_constants::CLOG_XACTS_PER_PAGE; let expected_segno = pageno / pg_constants::SLRU_PAGES_PER_SEGMENT; @@ -428,12 +406,17 @@ impl PostgresRedoManager { // Check that we're modifying the correct CLOG block. assert!( - check_slru_segno(&rel, SlruKind::Clog, expected_segno), - "ClogSetCommitted record for XID {} with unexpected rel {:?}", + segno == expected_segno, + "ClogSetCommitted record for XID {} with unexpected key {}", xid, - rel + key + ); + assert!( + blknum == expected_blknum, + "ClogSetCommitted record for XID {} with unexpected key {}", + xid, + key ); - assert!(blknum == expected_blknum); transaction_id_set_status( xid, @@ -443,6 +426,14 @@ impl PostgresRedoManager { } } ZenithWalRecord::ClogSetAborted { xids } => { + let (slru_kind, segno, blknum) = + key_to_slru_block(key).or(Err(WalRedoError::InvalidRecord))?; + assert_eq!( + slru_kind, + SlruKind::Clog, + "ClogSetAborted record with unexpected key {}", + key + ); for &xid in xids { let pageno = xid as u32 / pg_constants::CLOG_XACTS_PER_PAGE; let expected_segno = pageno / pg_constants::SLRU_PAGES_PER_SEGMENT; @@ -450,17 +441,30 @@ impl PostgresRedoManager { // Check that we're modifying the correct CLOG block. assert!( - check_slru_segno(&rel, SlruKind::Clog, expected_segno), - "ClogSetCommitted record for XID {} with unexpected rel {:?}", + segno == expected_segno, + "ClogSetAborted record for XID {} with unexpected key {}", xid, - rel + key + ); + assert!( + blknum == expected_blknum, + "ClogSetAborted record for XID {} with unexpected key {}", + xid, + key ); - assert!(blknum == expected_blknum); transaction_id_set_status(xid, pg_constants::TRANSACTION_STATUS_ABORTED, page); } } ZenithWalRecord::MultixactOffsetCreate { mid, moff } => { + let (slru_kind, segno, blknum) = + key_to_slru_block(key).or(Err(WalRedoError::InvalidRecord))?; + assert_eq!( + slru_kind, + SlruKind::MultiXactOffsets, + "MultixactOffsetCreate record with unexpected key {}", + key + ); // Compute the block and offset to modify. // See RecordNewMultiXact in PostgreSQL sources. let pageno = mid / pg_constants::MULTIXACT_OFFSETS_PER_PAGE as u32; @@ -471,16 +475,29 @@ impl PostgresRedoManager { let expected_segno = pageno / pg_constants::SLRU_PAGES_PER_SEGMENT; let expected_blknum = pageno % pg_constants::SLRU_PAGES_PER_SEGMENT; assert!( - check_slru_segno(&rel, SlruKind::MultiXactOffsets, expected_segno), - "MultiXactOffsetsCreate record for multi-xid {} with unexpected rel {:?}", + segno == expected_segno, + "MultiXactOffsetsCreate record for multi-xid {} with unexpected key {}", mid, - rel + key + ); + assert!( + blknum == expected_blknum, + "MultiXactOffsetsCreate record for multi-xid {} with unexpected key {}", + mid, + key ); - assert!(blknum == expected_blknum); LittleEndian::write_u32(&mut page[offset..offset + 4], *moff); } ZenithWalRecord::MultixactMembersCreate { moff, members } => { + let (slru_kind, segno, blknum) = + key_to_slru_block(key).or(Err(WalRedoError::InvalidRecord))?; + assert_eq!( + slru_kind, + SlruKind::MultiXactMembers, + "MultixactMembersCreate record with unexpected key {}", + key + ); for (i, member) in members.iter().enumerate() { let offset = moff + i as u32; @@ -495,12 +512,17 @@ impl PostgresRedoManager { let expected_segno = pageno / pg_constants::SLRU_PAGES_PER_SEGMENT; let expected_blknum = pageno % pg_constants::SLRU_PAGES_PER_SEGMENT; assert!( - check_slru_segno(&rel, SlruKind::MultiXactMembers, expected_segno), - "MultiXactMembersCreate record at offset {} with unexpected rel {:?}", + segno == expected_segno, + "MultiXactMembersCreate record for offset {} with unexpected key {}", moff, - rel + key + ); + assert!( + blknum == expected_blknum, + "MultiXactMembersCreate record for offset {} with unexpected key {}", + moff, + key ); - assert!(blknum == expected_blknum); let mut flagsval = LittleEndian::read_u32(&page[flagsoff..flagsoff + 4]); flagsval &= !(((1 << pg_constants::MXACT_MEMBER_BITS_PER_XACT) - 1) << bshift); diff --git a/postgres_ffi/src/pg_constants.rs b/postgres_ffi/src/pg_constants.rs index 76f837cefc..7230b841f5 100644 --- a/postgres_ffi/src/pg_constants.rs +++ b/postgres_ffi/src/pg_constants.rs @@ -24,6 +24,9 @@ pub const VISIBILITYMAP_FORKNUM: u8 = 2; pub const INIT_FORKNUM: u8 = 3; // From storage_xlog.h +pub const XLOG_SMGR_CREATE: u8 = 0x10; +pub const XLOG_SMGR_TRUNCATE: u8 = 0x20; + pub const SMGR_TRUNCATE_HEAP: u32 = 0x0001; pub const SMGR_TRUNCATE_VM: u32 = 0x0002; pub const SMGR_TRUNCATE_FSM: u32 = 0x0004; @@ -113,7 +116,6 @@ pub const XACT_XINFO_HAS_TWOPHASE: u32 = 1u32 << 4; // From pg_control.h and rmgrlist.h pub const XLOG_NEXTOID: u8 = 0x30; pub const XLOG_SWITCH: u8 = 0x40; -pub const XLOG_SMGR_TRUNCATE: u8 = 0x20; pub const XLOG_FPI_FOR_HINT: u8 = 0xA0; pub const XLOG_FPI: u8 = 0xB0; pub const DB_SHUTDOWNED: u32 = 1; diff --git a/test_runner/batch_others/test_snapfiles_gc.py b/test_runner/batch_others/test_snapfiles_gc.py deleted file mode 100644 index d00af53864..0000000000 --- a/test_runner/batch_others/test_snapfiles_gc.py +++ /dev/null @@ -1,130 +0,0 @@ -from contextlib import closing -import psycopg2.extras -from fixtures.utils import print_gc_result -from fixtures.zenith_fixtures import ZenithEnv -from fixtures.log_helper import log - - -# -# Test Garbage Collection of old layer files -# -# This test is pretty tightly coupled with the current implementation of layered -# storage, in layered_repository.rs. -# -def test_layerfiles_gc(zenith_simple_env: ZenithEnv): - env = zenith_simple_env - env.zenith_cli.create_branch("test_layerfiles_gc", "empty") - pg = env.postgres.create_start('test_layerfiles_gc') - - with closing(pg.connect()) as conn: - with conn.cursor() as cur: - with closing(env.pageserver.connect()) as psconn: - with psconn.cursor(cursor_factory=psycopg2.extras.DictCursor) as pscur: - - # Get the timeline ID of our branch. We need it for the 'do_gc' command - cur.execute("SHOW zenith.zenith_timeline") - timeline = cur.fetchone()[0] - - # Create a test table - cur.execute("CREATE TABLE foo(x integer)") - cur.execute("INSERT INTO foo VALUES (1)") - - cur.execute("select relfilenode from pg_class where oid = 'foo'::regclass") - row = cur.fetchone() - log.info(f"relfilenode is {row[0]}") - - # Run GC, to clear out any garbage left behind in the catalogs by - # the CREATE TABLE command. We want to have a clean slate with no garbage - # before running the actual tests below, otherwise the counts won't match - # what we expect. - # - # Also run vacuum first to make it less likely that autovacuum or pruning - # kicks in and confuses our numbers. - cur.execute("VACUUM") - - # delete the row, to update the Visibility Map. We don't want the VM - # update to confuse our numbers either. - cur.execute("DELETE FROM foo") - - log.info("Running GC before test") - pscur.execute(f"do_gc {env.initial_tenant.hex} {timeline} 0") - row = pscur.fetchone() - print_gc_result(row) - # remember the number of files - layer_relfiles_remain = (row['layer_relfiles_total'] - - row['layer_relfiles_removed']) - assert layer_relfiles_remain > 0 - - # Insert a row and run GC. Checkpoint should freeze the layer - # so that there is only the most recent image layer left for the rel, - # removing the old image and delta layer. - log.info("Inserting one row and running GC") - cur.execute("INSERT INTO foo VALUES (1)") - pscur.execute(f"do_gc {env.initial_tenant.hex} {timeline} 0") - row = pscur.fetchone() - print_gc_result(row) - assert row['layer_relfiles_total'] == layer_relfiles_remain + 2 - assert row['layer_relfiles_removed'] == 2 - assert row['layer_relfiles_dropped'] == 0 - - # Insert two more rows and run GC. - # This should create new image and delta layer file with the new contents, and - # then remove the old one image and the just-created delta layer. - log.info("Inserting two more rows and running GC") - cur.execute("INSERT INTO foo VALUES (2)") - cur.execute("INSERT INTO foo VALUES (3)") - - pscur.execute(f"do_gc {env.initial_tenant.hex} {timeline} 0") - row = pscur.fetchone() - print_gc_result(row) - assert row['layer_relfiles_total'] == layer_relfiles_remain + 2 - assert row['layer_relfiles_removed'] == 2 - assert row['layer_relfiles_dropped'] == 0 - - # Do it again. Should again create two new layer files and remove old ones. - log.info("Inserting two more rows and running GC") - cur.execute("INSERT INTO foo VALUES (2)") - cur.execute("INSERT INTO foo VALUES (3)") - - pscur.execute(f"do_gc {env.initial_tenant.hex} {timeline} 0") - row = pscur.fetchone() - print_gc_result(row) - assert row['layer_relfiles_total'] == layer_relfiles_remain + 2 - assert row['layer_relfiles_removed'] == 2 - assert row['layer_relfiles_dropped'] == 0 - - # Run GC again, with no changes in the database. Should not remove anything. - log.info("Run GC again, with nothing to do") - pscur.execute(f"do_gc {env.initial_tenant.hex} {timeline} 0") - row = pscur.fetchone() - print_gc_result(row) - assert row['layer_relfiles_total'] == layer_relfiles_remain - assert row['layer_relfiles_removed'] == 0 - assert row['layer_relfiles_dropped'] == 0 - - # - # Test DROP TABLE checks that relation data and metadata was deleted by GC from object storage - # - log.info("Drop table and run GC again") - cur.execute("DROP TABLE foo") - - pscur.execute(f"do_gc {env.initial_tenant.hex} {timeline} 0") - row = pscur.fetchone() - print_gc_result(row) - - # We still cannot remove the latest layers - # because they serve as tombstones for earlier layers. - assert row['layer_relfiles_dropped'] == 0 - # Each relation fork is counted separately, hence 3. - assert row['layer_relfiles_needed_as_tombstone'] == 3 - - # The catalog updates also create new layer files of the catalogs, which - # are counted as 'removed' - assert row['layer_relfiles_removed'] > 0 - - # TODO Change the test to check actual CG of dropped layers. - # Each relation fork is counted separately, hence 3. - #assert row['layer_relfiles_dropped'] == 3 - - # TODO: perhaps we should count catalog and user relations separately, - # to make this kind of testing more robust diff --git a/test_runner/fixtures/utils.py b/test_runner/fixtures/utils.py index 236c225bfb..58f7294eb5 100644 --- a/test_runner/fixtures/utils.py +++ b/test_runner/fixtures/utils.py @@ -74,8 +74,5 @@ def lsn_from_hex(lsn_hex: str) -> int: def print_gc_result(row): log.info("GC duration {elapsed} ms".format_map(row)) log.info( - " REL total: {layer_relfiles_total}, needed_by_cutoff {layer_relfiles_needed_by_cutoff}, needed_by_branches: {layer_relfiles_needed_by_branches}, not_updated: {layer_relfiles_not_updated}, needed_as_tombstone {layer_relfiles_needed_as_tombstone}, removed: {layer_relfiles_removed}, dropped: {layer_relfiles_dropped}" - .format_map(row)) - log.info( - " NONREL total: {layer_nonrelfiles_total}, needed_by_cutoff {layer_nonrelfiles_needed_by_cutoff}, needed_by_branches: {layer_nonrelfiles_needed_by_branches}, not_updated: {layer_nonrelfiles_not_updated}, needed_as_tombstone {layer_nonrelfiles_needed_as_tombstone}, removed: {layer_nonrelfiles_removed}, dropped: {layer_nonrelfiles_dropped}" + " total: {layers_total}, needed_by_cutoff {layers_needed_by_cutoff}, needed_by_branches: {layers_needed_by_branches}, not_updated: {layers_not_updated}, removed: {layers_removed}" .format_map(row)) diff --git a/vendor/postgres b/vendor/postgres index 093aa160e5..756a01aade 160000 --- a/vendor/postgres +++ b/vendor/postgres @@ -1 +1 @@ -Subproject commit 093aa160e5df19814ff19b995d36dd5ee03c7f8b +Subproject commit 756a01aade765d1d2ac115e7e189865ff697222b From 75002adc14b93a0c80b124f3677c04ae072dd739 Mon Sep 17 00:00:00 2001 From: Arseny Sher Date: Mon, 28 Mar 2022 18:27:28 +0400 Subject: [PATCH 027/296] Make shared_buffers large in test_pageserver_catchup. We intentionally write while pageserver is down, so we shouldn't query it. Noticed by @petuhovskiy at https://github.com/zenithdb/postgres/pull/141#issuecomment-1080261700 --- test_runner/batch_others/test_pageserver_catchup.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/test_runner/batch_others/test_pageserver_catchup.py b/test_runner/batch_others/test_pageserver_catchup.py index 3c4b7f9569..758b018046 100644 --- a/test_runner/batch_others/test_pageserver_catchup.py +++ b/test_runner/batch_others/test_pageserver_catchup.py @@ -10,7 +10,9 @@ def test_pageserver_catchup_while_compute_down(zenith_env_builder: ZenithEnvBuil env = zenith_env_builder.init_start() env.zenith_cli.create_branch('test_pageserver_catchup_while_compute_down') - pg = env.postgres.create_start('test_pageserver_catchup_while_compute_down') + # Make shared_buffers large to ensure we won't query pageserver while it is down. + pg = env.postgres.create_start('test_pageserver_catchup_while_compute_down', + config_lines=['shared_buffers=512MB']) pg_conn = pg.connect() cur = pg_conn.cursor() From 780b46ad270c66960f3f4de8468891b4b030507e Mon Sep 17 00:00:00 2001 From: Arseny Sher Date: Mon, 28 Mar 2022 18:11:48 +0400 Subject: [PATCH 028/296] Bump vendor/postgres to fix commit_lsn going backwards. --- vendor/postgres | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vendor/postgres b/vendor/postgres index 756a01aade..19164aeacf 160000 --- a/vendor/postgres +++ b/vendor/postgres @@ -1 +1 @@ -Subproject commit 756a01aade765d1d2ac115e7e189865ff697222b +Subproject commit 19164aeacfd877ef75d67e70a71647f5d4c0cd2f From a8832024953d3bb6da5da76f8dd2007433119b87 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 28 Mar 2022 18:56:36 +0300 Subject: [PATCH 029/296] Enable S3 for pageserver on staging Follow-up for #1417. Previously we had a problem uploading to S3 due to huge ammount of existing not yet uploaded data. Now we have a fresh pageserver with LSM storage on staging, so we can try enabling it once again. --- .circleci/ansible/deploy.yaml | 27 +++++++++++++-------------- 1 file changed, 13 insertions(+), 14 deletions(-) diff --git a/.circleci/ansible/deploy.yaml b/.circleci/ansible/deploy.yaml index 020a852a00..09aca8539e 100644 --- a/.circleci/ansible/deploy.yaml +++ b/.circleci/ansible/deploy.yaml @@ -63,20 +63,19 @@ tags: - pageserver - # Temporary disabled until LSM storage rewrite lands - # - name: update config - # when: current_version > remote_version or force_deploy - # lineinfile: - # path: /storage/pageserver/data/pageserver.toml - # line: "{{ item }}" - # loop: - # - "[remote_storage]" - # - "bucket_name = '{{ bucket_name }}'" - # - "bucket_region = '{{ bucket_region }}'" - # - "prefix_in_bucket = '{{ inventory_hostname }}'" - # become: true - # tags: - # - pageserver + - name: update remote storage (s3) config + when: current_version > remote_version or force_deploy + lineinfile: + path: /storage/pageserver/data/pageserver.toml + line: "{{ item }}" + loop: + - "[remote_storage]" + - "bucket_name = '{{ bucket_name }}'" + - "bucket_region = '{{ bucket_region }}'" + - "prefix_in_bucket = '{{ inventory_hostname }}'" + become: true + tags: + - pageserver - name: upload systemd service definition ansible.builtin.template: From 8a901de52a270b8bf8a97a256527037fb0031276 Mon Sep 17 00:00:00 2001 From: Arthur Petukhovsky Date: Sat, 12 Mar 2022 20:28:44 +0000 Subject: [PATCH 030/296] Refactor control file update at safekeeper. Record global_commit_lsn, have common routine for control file update, add SafekeeperMemstate. --- walkeeper/src/safekeeper.rs | 133 +++++++++++++++++++++++------------- walkeeper/src/timeline.rs | 4 +- 2 files changed, 87 insertions(+), 50 deletions(-) diff --git a/walkeeper/src/safekeeper.rs b/walkeeper/src/safekeeper.rs index 53fd6f5588..8300b32b42 100644 --- a/walkeeper/src/safekeeper.rs +++ b/walkeeper/src/safekeeper.rs @@ -202,6 +202,14 @@ pub struct SafeKeeperState { pub peers: Peers, } +#[derive(Debug, Clone)] +// In memory safekeeper state. Fields mirror ones in `SafeKeeperState`; they are +// not flushed yet. +pub struct SafekeeperMemState { + pub commit_lsn: Lsn, + pub peer_horizon_lsn: Lsn, +} + impl SafeKeeperState { pub fn new(zttid: &ZTenantTimelineId, peers: Vec) -> SafeKeeperState { SafeKeeperState { @@ -470,14 +478,12 @@ struct SafeKeeperMetrics { } impl SafeKeeperMetrics { - fn new(tenant_id: ZTenantId, timeline_id: ZTimelineId, commit_lsn: Lsn) -> Self { + fn new(tenant_id: ZTenantId, timeline_id: ZTimelineId) -> Self { let tenant_id = tenant_id.to_string(); let timeline_id = timeline_id.to_string(); - let m = Self { + Self { commit_lsn: COMMIT_LSN_GAUGE.with_label_values(&[&tenant_id, &timeline_id]), - }; - m.commit_lsn.set(u64::from(commit_lsn) as f64); - m + } } } @@ -487,9 +493,14 @@ pub struct SafeKeeper { // Cached metrics so we don't have to recompute labels on each update. metrics: SafeKeeperMetrics, - /// not-yet-flushed pairs of same named fields in s.* - pub commit_lsn: Lsn, - pub peer_horizon_lsn: Lsn, + /// Maximum commit_lsn between all nodes, can be ahead of local flush_lsn. + global_commit_lsn: Lsn, + /// LSN since the proposer safekeeper currently talking to appends WAL; + /// determines epoch switch point. + epoch_start_lsn: Lsn, + + pub inmem: SafekeeperMemState, // in memory part + pub s: SafeKeeperState, // persistent part pub control_store: CTRL, @@ -513,9 +524,13 @@ where } SafeKeeper { - metrics: SafeKeeperMetrics::new(state.tenant_id, ztli, state.commit_lsn), - commit_lsn: state.commit_lsn, - peer_horizon_lsn: state.peer_horizon_lsn, + metrics: SafeKeeperMetrics::new(state.tenant_id, ztli), + global_commit_lsn: state.commit_lsn, + epoch_start_lsn: Lsn(0), + inmem: SafekeeperMemState { + commit_lsn: state.commit_lsn, + peer_horizon_lsn: state.peer_horizon_lsn, + }, s: state, control_store, wal_store, @@ -602,9 +617,6 @@ where // pass wal_seg_size to read WAL and find flush_lsn self.wal_store.init_storage(&self.s)?; - // update tenant_id/timeline_id in metrics - self.metrics = SafeKeeperMetrics::new(msg.tenant_id, msg.ztli, self.commit_lsn); - info!( "processed greeting from proposer {:?}, sending term {:?}", msg.proposer_id, self.s.acceptor_state.term @@ -684,12 +696,49 @@ where Ok(None) } + /// Advance commit_lsn taking into account what we have locally + fn update_commit_lsn(&mut self) -> Result<()> { + let commit_lsn = min(self.global_commit_lsn, self.wal_store.flush_lsn()); + assert!(commit_lsn >= self.inmem.commit_lsn); + + self.inmem.commit_lsn = commit_lsn; + self.metrics.commit_lsn.set(self.inmem.commit_lsn.0 as f64); + + // If new commit_lsn reached epoch switch, force sync of control + // file: walproposer in sync mode is very interested when this + // happens. Note: this is for sync-safekeepers mode only, as + // otherwise commit_lsn might jump over epoch_start_lsn. + // Also note that commit_lsn can reach epoch_start_lsn earlier + // that we receive new epoch_start_lsn, and we still need to sync + // control file in this case. + if commit_lsn == self.epoch_start_lsn && self.s.commit_lsn != commit_lsn { + self.persist_control_file()?; + } + + // We got our first commit_lsn, which means we should sync + // everything to disk, to initialize the state. + if self.s.commit_lsn == Lsn(0) && commit_lsn > Lsn(0) { + self.wal_store.flush_wal()?; + self.persist_control_file()?; + } + + Ok(()) + } + + /// Persist in-memory state to the disk. + fn persist_control_file(&mut self) -> Result<()> { + self.s.commit_lsn = self.inmem.commit_lsn; + self.s.peer_horizon_lsn = self.inmem.peer_horizon_lsn; + + self.control_store.persist(&self.s) + } + /// Handle request to append WAL. #[allow(clippy::comparison_chain)] fn handle_append_request( &mut self, msg: &AppendRequest, - mut require_flush: bool, + require_flush: bool, ) -> Result> { if self.s.acceptor_state.term < msg.h.term { bail!("got AppendRequest before ProposerElected"); @@ -701,25 +750,22 @@ where return Ok(Some(AcceptorProposerMessage::AppendResponse(resp))); } - // After ProposerElected, which performs truncation, we should get only - // indeed append requests (but flush_lsn is advanced only on record - // boundary, so might be less). - assert!(self.wal_store.flush_lsn() <= msg.h.begin_lsn); + // Now we know that we are in the same term as the proposer, + // processing the message. + self.epoch_start_lsn = msg.h.epoch_start_lsn; + // TODO: don't update state without persisting to disk self.s.proposer_uuid = msg.h.proposer_uuid; - let mut sync_control_file = false; // do the job if !msg.wal_data.is_empty() { self.wal_store.write_wal(msg.h.begin_lsn, &msg.wal_data)?; - // If this was the first record we ever receieved, initialize + // If this was the first record we ever received, initialize // commit_lsn to help find_end_of_wal skip the hole in the // beginning. - if self.s.commit_lsn == Lsn(0) { - self.s.commit_lsn = msg.h.begin_lsn; - sync_control_file = true; - require_flush = true; + if self.global_commit_lsn == Lsn(0) { + self.global_commit_lsn = msg.h.begin_lsn; } } @@ -728,35 +774,22 @@ where self.wal_store.flush_wal()?; } - // Advance commit_lsn taking into account what we have locally. - // commit_lsn can be 0, being unknown to new walproposer while he hasn't - // collected majority of its epoch acks yet, ignore it in this case. + // Update global_commit_lsn, verifying that it cannot decrease. if msg.h.commit_lsn != Lsn(0) { - let commit_lsn = min(msg.h.commit_lsn, self.wal_store.flush_lsn()); - // If new commit_lsn reached epoch switch, force sync of control - // file: walproposer in sync mode is very interested when this - // happens. Note: this is for sync-safekeepers mode only, as - // otherwise commit_lsn might jump over epoch_start_lsn. - sync_control_file |= commit_lsn == msg.h.epoch_start_lsn; - self.commit_lsn = commit_lsn; - self.metrics - .commit_lsn - .set(u64::from(self.commit_lsn) as f64); + assert!(msg.h.commit_lsn >= self.global_commit_lsn); + self.global_commit_lsn = msg.h.commit_lsn; } - self.peer_horizon_lsn = msg.h.truncate_lsn; + self.inmem.peer_horizon_lsn = msg.h.truncate_lsn; + self.update_commit_lsn()?; + // Update truncate and commit LSN in control file. // To avoid negative impact on performance of extra fsync, do it only // when truncate_lsn delta exceeds WAL segment size. - sync_control_file |= - self.s.peer_horizon_lsn + (self.s.server.wal_seg_size as u64) < self.peer_horizon_lsn; - if sync_control_file { - self.s.commit_lsn = self.commit_lsn; - self.s.peer_horizon_lsn = self.peer_horizon_lsn; - } - - if sync_control_file { - self.control_store.persist(&self.s)?; + if self.s.peer_horizon_lsn + (self.s.server.wal_seg_size as u64) + < self.inmem.peer_horizon_lsn + { + self.persist_control_file()?; } trace!( @@ -780,6 +813,10 @@ where /// Flush WAL to disk. Return AppendResponse with latest LSNs. fn handle_flush(&mut self) -> Result> { self.wal_store.flush_wal()?; + + // commit_lsn can be updated because we have new flushed data locally. + self.update_commit_lsn()?; + Ok(Some(AcceptorProposerMessage::AppendResponse( self.append_response(), ))) diff --git a/walkeeper/src/timeline.rs b/walkeeper/src/timeline.rs index ea8308b95e..b53f2e086b 100644 --- a/walkeeper/src/timeline.rs +++ b/walkeeper/src/timeline.rs @@ -340,7 +340,7 @@ impl Timeline { let replica_state = shared_state.replicas[replica_id].unwrap(); let deactivate = shared_state.notified_commit_lsn == Lsn(0) || // no data at all yet (replica_state.last_received_lsn != Lsn::MAX && // Lsn::MAX means that we don't know the latest LSN yet. - replica_state.last_received_lsn >= shared_state.sk.commit_lsn); + replica_state.last_received_lsn >= shared_state.sk.inmem.commit_lsn); if deactivate { shared_state.deactivate(&self.zttid, callmemaybe_tx)?; return Ok(true); @@ -394,7 +394,7 @@ impl Timeline { rmsg = shared_state.sk.process_msg(msg)?; // locally available commit lsn. flush_lsn can be smaller than // commit_lsn if we are catching up safekeeper. - commit_lsn = shared_state.sk.commit_lsn; + commit_lsn = shared_state.sk.inmem.commit_lsn; // if this is AppendResponse, fill in proper hot standby feedback and disk consistent lsn if let Some(AcceptorProposerMessage::AppendResponse(ref mut resp)) = rmsg { From d88f8b4a7e0b8251db36b7ed1dad4888765e3b83 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 28 Mar 2022 20:47:55 +0300 Subject: [PATCH 031/296] Fix storage deploy condition in ansible playbook --- .circleci/ansible/deploy.yaml | 1 - 1 file changed, 1 deletion(-) diff --git a/.circleci/ansible/deploy.yaml b/.circleci/ansible/deploy.yaml index 09aca8539e..3540f01fcb 100644 --- a/.circleci/ansible/deploy.yaml +++ b/.circleci/ansible/deploy.yaml @@ -64,7 +64,6 @@ - pageserver - name: update remote storage (s3) config - when: current_version > remote_version or force_deploy lineinfile: path: /storage/pageserver/data/pageserver.toml line: "{{ item }}" From 9a4f0930c02906bdce0806db6dceed44c48e0c66 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 28 Mar 2022 22:10:15 +0300 Subject: [PATCH 032/296] Turn off S3 for pageserver on staging --- .circleci/ansible/deploy.yaml | 27 +++++++++++++++------------ 1 file changed, 15 insertions(+), 12 deletions(-) diff --git a/.circleci/ansible/deploy.yaml b/.circleci/ansible/deploy.yaml index 3540f01fcb..b7ffd075a0 100644 --- a/.circleci/ansible/deploy.yaml +++ b/.circleci/ansible/deploy.yaml @@ -63,18 +63,21 @@ tags: - pageserver - - name: update remote storage (s3) config - lineinfile: - path: /storage/pageserver/data/pageserver.toml - line: "{{ item }}" - loop: - - "[remote_storage]" - - "bucket_name = '{{ bucket_name }}'" - - "bucket_region = '{{ bucket_region }}'" - - "prefix_in_bucket = '{{ inventory_hostname }}'" - become: true - tags: - - pageserver + # It seems that currently S3 integration does not play well + # even with fresh pageserver without a burden of old data. + # TODO: turn this back on once the issue is solved. + # - name: update remote storage (s3) config + # lineinfile: + # path: /storage/pageserver/data/pageserver.toml + # line: "{{ item }}" + # loop: + # - "[remote_storage]" + # - "bucket_name = '{{ bucket_name }}'" + # - "bucket_region = '{{ bucket_region }}'" + # - "prefix_in_bucket = '{{ inventory_hostname }}'" + # become: true + # tags: + # - pageserver - name: upload systemd service definition ansible.builtin.template: From 1aa57fc262bebb52b78dfa4054bdf9e8bd9cb48c Mon Sep 17 00:00:00 2001 From: Dhammika Pathirana Date: Mon, 28 Mar 2022 12:07:23 -0700 Subject: [PATCH 033/296] Fix tone down compact log chatter Signed-off-by: Dhammika Pathirana --- pageserver/src/layered_repository.rs | 3 +++ 1 file changed, 3 insertions(+) diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 837298a10e..a0f1f2d830 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -1628,6 +1628,9 @@ impl LayeredTimeline { }; let num_deltas = layers.count_deltas(&img_range, &(img_lsn..lsn))?; + if num_deltas == 0 { + continue; + } info!( "range {}-{}, has {} deltas on this timeline", From 0e44887929daa9851fb0c6239d1011c41cde04b8 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Mon, 28 Mar 2022 22:33:05 +0300 Subject: [PATCH 034/296] Show more S3 logs and less verbove WAL logs --- pageserver/src/config.rs | 2 +- pageserver/src/layered_repository.rs | 2 +- pageserver/src/remote_storage/storage_sync.rs | 47 ++++++++++++------- pageserver/src/walreceiver.rs | 2 +- 4 files changed, 33 insertions(+), 20 deletions(-) diff --git a/pageserver/src/config.rs b/pageserver/src/config.rs index 0fdfb4ceed..9f7cd34a7a 100644 --- a/pageserver/src/config.rs +++ b/pageserver/src/config.rs @@ -41,7 +41,7 @@ pub mod defaults { pub const DEFAULT_WAL_REDO_TIMEOUT: &str = "60 s"; pub const DEFAULT_SUPERUSER: &str = "zenith_admin"; - pub const DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNC: usize = 100; + pub const DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNC: usize = 10; pub const DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS: u32 = 10; pub const DEFAULT_PAGE_CACHE_SIZE: usize = 8192; diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index a0f1f2d830..56d14fd4e9 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -1594,7 +1594,7 @@ impl LayeredTimeline { self.compact_level0(target_file_size)?; timer.stop_and_record(); } else { - info!("Could not compact because no partitioning specified yet"); + debug!("Could not compact because no partitioning specified yet"); } // Call unload() on all frozen layers, to release memory. diff --git a/pageserver/src/remote_storage/storage_sync.rs b/pageserver/src/remote_storage/storage_sync.rs index ddd47ea981..cd6c40b46f 100644 --- a/pageserver/src/remote_storage/storage_sync.rs +++ b/pageserver/src/remote_storage/storage_sync.rs @@ -443,30 +443,38 @@ fn storage_sync_loop< max_sync_errors: NonZeroU32, ) { let remote_assets = Arc::new((storage, index.clone())); + info!("Starting remote storage sync loop"); loop { let index = index.clone(); let loop_step = runtime.block_on(async { tokio::select! { - new_timeline_states = loop_step( + step = loop_step( conf, &mut receiver, Arc::clone(&remote_assets), max_concurrent_sync, max_sync_errors, ) - .instrument(debug_span!("storage_sync_loop_step")) => LoopStep::SyncStatusUpdates(new_timeline_states), + .instrument(debug_span!("storage_sync_loop_step")) => step, _ = thread_mgr::shutdown_watcher() => LoopStep::Shutdown, } }); match loop_step { LoopStep::SyncStatusUpdates(new_timeline_states) => { - // Batch timeline download registration to ensure that the external registration code won't block any running tasks before. - apply_timeline_sync_status_updates(conf, index, new_timeline_states); - debug!("Sync loop step completed"); + if new_timeline_states.is_empty() { + debug!("Sync loop step completed, no new timeline states"); + } else { + info!( + "Sync loop step completed, {} new timeline state update(s)", + new_timeline_states.len() + ); + // Batch timeline download registration to ensure that the external registration code won't block any running tasks before. + apply_timeline_sync_status_updates(conf, index, new_timeline_states); + } } LoopStep::Shutdown => { - debug!("Shutdown requested, stopping"); + info!("Shutdown requested, stopping"); break; } } @@ -482,7 +490,7 @@ async fn loop_step< remote_assets: Arc<(S, RemoteIndex)>, max_concurrent_sync: NonZeroUsize, max_sync_errors: NonZeroU32, -) -> HashMap> { +) -> LoopStep { let max_concurrent_sync = max_concurrent_sync.get(); let mut next_tasks = Vec::new(); @@ -490,8 +498,7 @@ async fn loop_step< if let Some(first_task) = sync_queue::next_task(receiver).await { next_tasks.push(first_task); } else { - debug!("Shutdown requested, stopping"); - return HashMap::new(); + return LoopStep::Shutdown; }; next_tasks.extend( sync_queue::next_task_batch(receiver, max_concurrent_sync - 1) @@ -500,12 +507,17 @@ async fn loop_step< ); let remaining_queue_length = sync_queue::len(); - debug!( - "Processing {} tasks in batch, more tasks left to process: {}", - next_tasks.len(), - remaining_queue_length - ); REMAINING_SYNC_ITEMS.set(remaining_queue_length as i64); + if remaining_queue_length > 0 || !next_tasks.is_empty() { + info!( + "Processing {} tasks in batch, more tasks left to process: {}", + next_tasks.len(), + remaining_queue_length + ); + } else { + debug!("No tasks to process"); + return LoopStep::SyncStatusUpdates(HashMap::new()); + } let mut task_batch = next_tasks .into_iter() @@ -515,8 +527,9 @@ async fn loop_step< let sync_name = task.kind.sync_name(); let extra_step = match tokio::spawn( - process_task(conf, Arc::clone(&remote_assets), task, max_sync_errors) - .instrument(debug_span!("", sync_id = %sync_id, attempt, sync_name)), + process_task(conf, Arc::clone(&remote_assets), task, max_sync_errors).instrument( + debug_span!("process_sync_task", sync_id = %sync_id, attempt, sync_name), + ), ) .await { @@ -551,7 +564,7 @@ async fn loop_step< } } - new_timeline_states + LoopStep::SyncStatusUpdates(new_timeline_states) } async fn process_task< diff --git a/pageserver/src/walreceiver.rs b/pageserver/src/walreceiver.rs index e382475627..6de0b87478 100644 --- a/pageserver/src/walreceiver.rs +++ b/pageserver/src/walreceiver.rs @@ -70,7 +70,7 @@ pub fn launch_wal_receiver( match receivers.get_mut(&(tenantid, timelineid)) { Some(receiver) => { - info!("wal receiver already running, updating connection string"); + debug!("wal receiver already running, updating connection string"); receiver.wal_producer_connstr = wal_producer_connstr.into(); } None => { From be6a6958e26b2eae54fe00fd282772222d44b728 Mon Sep 17 00:00:00 2001 From: Anton Shyrabokau <97127717+antons-antons@users.noreply.github.com> Date: Mon, 28 Mar 2022 18:19:20 -0700 Subject: [PATCH 035/296] CI: rebuild postgres when Makefile changes (#1429) --- .circleci/config.yml | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/.circleci/config.yml b/.circleci/config.yml index 8faa69d64e..4a03cbf3b5 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -34,10 +34,13 @@ jobs: - checkout # Grab the postgres git revision to build a cache key. + # Append makefile as it could change the way postgres is built. # Note this works even though the submodule hasn't been checkout out yet. - run: name: Get postgres cache key - command: git rev-parse HEAD:vendor/postgres > /tmp/cache-key-postgres + command: | + git rev-parse HEAD:vendor/postgres > /tmp/cache-key-postgres + cat Makefile >> /tmp/cache-key-postgres - restore_cache: name: Restore postgres cache @@ -78,11 +81,14 @@ jobs: - checkout # Grab the postgres git revision to build a cache key. + # Append makefile as it could change the way postgres is built. # Note this works even though the submodule hasn't been checkout out yet. - run: name: Get postgres cache key command: | git rev-parse HEAD:vendor/postgres > /tmp/cache-key-postgres + cat Makefile >> /tmp/cache-key-postgres + - restore_cache: name: Restore postgres cache From fd78110c2bd22fa2fdb4a3191df542b697858528 Mon Sep 17 00:00:00 2001 From: Arthur Petukhovsky Date: Tue, 29 Mar 2022 09:57:00 +0300 Subject: [PATCH 036/296] Add default statement_timeout for tests (#1423) --- test_runner/fixtures/zenith_fixtures.py | 36 +++++++++++++++---------- 1 file changed, 22 insertions(+), 14 deletions(-) diff --git a/test_runner/fixtures/zenith_fixtures.py b/test_runner/fixtures/zenith_fixtures.py index 08ac09ee4c..2da021a49c 100644 --- a/test_runner/fixtures/zenith_fixtures.py +++ b/test_runner/fixtures/zenith_fixtures.py @@ -257,7 +257,8 @@ class PgProtocol: dbname: Optional[str] = None, schema: Optional[str] = None, username: Optional[str] = None, - password: Optional[str] = None) -> str: + password: Optional[str] = None, + statement_timeout_ms: Optional[int] = None) -> str: """ Build a libpq connection string for the Postgres instance. """ @@ -277,16 +278,23 @@ class PgProtocol: if schema: res = f"{res} options='-c search_path={schema}'" + if statement_timeout_ms: + res = f"{res} options='-c statement_timeout={statement_timeout_ms}'" + return res # autocommit=True here by default because that's what we need most of the time - def connect(self, - *, - autocommit=True, - dbname: Optional[str] = None, - schema: Optional[str] = None, - username: Optional[str] = None, - password: Optional[str] = None) -> PgConnection: + def connect( + self, + *, + autocommit=True, + dbname: Optional[str] = None, + schema: Optional[str] = None, + username: Optional[str] = None, + password: Optional[str] = None, + # individual statement timeout in seconds, 2 minutes should be enough for our tests + statement_timeout: Optional[int] = 120 + ) -> PgConnection: """ Connect to the node. Returns psycopg2's connection object. @@ -294,12 +302,12 @@ class PgProtocol: """ conn = psycopg2.connect( - self.connstr( - dbname=dbname, - schema=schema, - username=username, - password=password, - )) + self.connstr(dbname=dbname, + schema=schema, + username=username, + password=password, + statement_timeout_ms=statement_timeout * + 1000 if statement_timeout else None)) # WARNING: this setting affects *all* tests! conn.autocommit = autocommit return conn From eee0f51e0c3ea2d52269741124b68b8dac0e051c Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Mon, 28 Mar 2022 11:39:15 +0300 Subject: [PATCH 037/296] use cargo-hakari to manage workspace_hack crate workspace_hack is needed to avoid recompilation when different crates inside the workspace depend on the same packages but with different features being enabled. Problem occurs when you build crates separately one by one. So this is irrelevant to our CI setup because there we build all binaries at once, but it may be relevant for local development. this also changes cargo's resolver version to 2 --- .config/hakari.toml | 24 ++++++++++++++++ Cargo.lock | 15 ++++++++++ Cargo.toml | 1 + compute_tools/Cargo.toml | 1 + control_plane/Cargo.toml | 2 +- docs/sourcetree.md | 2 ++ pageserver/Cargo.toml | 2 +- postgres_ffi/Cargo.toml | 2 +- proxy/Cargo.toml | 1 + walkeeper/Cargo.toml | 2 +- workspace_hack/Cargo.toml | 60 ++++++++++++++++++++++++++++----------- workspace_hack/src/lib.rs | 24 +--------------- zenith/Cargo.toml | 2 +- zenith_metrics/Cargo.toml | 1 + zenith_utils/Cargo.toml | 2 +- 15 files changed, 96 insertions(+), 45 deletions(-) create mode 100644 .config/hakari.toml diff --git a/.config/hakari.toml b/.config/hakari.toml new file mode 100644 index 0000000000..7bccc6c4a3 --- /dev/null +++ b/.config/hakari.toml @@ -0,0 +1,24 @@ +# This file contains settings for `cargo hakari`. +# See https://docs.rs/cargo-hakari/latest/cargo_hakari/config for a full list of options. + +hakari-package = "workspace_hack" + +# Format for `workspace-hack = ...` lines in other Cargo.tomls. Requires cargo-hakari 0.9.8 or above. +dep-format-version = "2" + +# Setting workspace.resolver = "2" in the root Cargo.toml is HIGHLY recommended. +# Hakari works much better with the new feature resolver. +# For more about the new feature resolver, see: +# https://blog.rust-lang.org/2021/03/25/Rust-1.51.0.html#cargos-new-feature-resolver +resolver = "2" + +# Add triples corresponding to platforms commonly used by developers here. +# https://doc.rust-lang.org/rustc/platform-support.html +platforms = [ + # "x86_64-unknown-linux-gnu", + # "x86_64-apple-darwin", + # "x86_64-pc-windows-msvc", +] + +# Write out exact versions rather than a semver range. (Defaults to false.) +# exact-versions = true diff --git a/Cargo.lock b/Cargo.lock index 290d715f2c..40f4358d98 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -407,6 +407,7 @@ dependencies = [ "serde_json", "tar", "tokio", + "workspace_hack", ] [[package]] @@ -1803,6 +1804,7 @@ dependencies = [ "tokio-postgres 0.7.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", "tokio-postgres-rustls", "tokio-rustls 0.22.0", + "workspace_hack", "zenith_metrics", "zenith_utils", ] @@ -3041,7 +3043,14 @@ dependencies = [ name = "workspace_hack" version = "0.1.0" dependencies = [ + "anyhow", + "bytes", + "cc", + "clap 2.34.0", + "either", + "hashbrown 0.11.2", "libc", + "log", "memchr", "num-integer", "num-traits", @@ -3049,8 +3058,13 @@ dependencies = [ "quote", "regex", "regex-syntax", + "reqwest", + "scopeguard", "serde", "syn", + "tokio", + "tracing", + "tracing-core", ] [[package]] @@ -3101,6 +3115,7 @@ dependencies = [ "libc", "once_cell", "prometheus", + "workspace_hack", ] [[package]] diff --git a/Cargo.toml b/Cargo.toml index b20e64a06f..f3ac36dcb2 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -11,6 +11,7 @@ members = [ "zenith_metrics", "zenith_utils", ] +resolver = "2" [profile.release] # This is useful for profiling and, to some extent, debug. diff --git a/compute_tools/Cargo.toml b/compute_tools/Cargo.toml index 3adf762dcb..4ecf7f6499 100644 --- a/compute_tools/Cargo.toml +++ b/compute_tools/Cargo.toml @@ -17,3 +17,4 @@ serde = { version = "1.0", features = ["derive"] } serde_json = "1" tar = "0.4" tokio = { version = "1", features = ["macros", "rt", "rt-multi-thread"] } +workspace_hack = { version = "0.1", path = "../workspace_hack" } diff --git a/control_plane/Cargo.toml b/control_plane/Cargo.toml index b52c7ad5a9..e118ea4793 100644 --- a/control_plane/Cargo.toml +++ b/control_plane/Cargo.toml @@ -20,4 +20,4 @@ reqwest = { version = "0.11", default-features = false, features = ["blocking", pageserver = { path = "../pageserver" } walkeeper = { path = "../walkeeper" } zenith_utils = { path = "../zenith_utils" } -workspace_hack = { path = "../workspace_hack" } +workspace_hack = { version = "0.1", path = "../workspace_hack" } diff --git a/docs/sourcetree.md b/docs/sourcetree.md index 8d35d35f2f..89b07de8d2 100644 --- a/docs/sourcetree.md +++ b/docs/sourcetree.md @@ -67,6 +67,8 @@ For more detailed info, see `/walkeeper/README` `/workspace_hack`: The workspace_hack crate exists only to pin down some dependencies. +We use [cargo-hakari](https://crates.io/crates/cargo-hakari) for automation. + `/zenith` Main entry point for the 'zenith' CLI utility. diff --git a/pageserver/Cargo.toml b/pageserver/Cargo.toml index de22d0dd77..14eae31da8 100644 --- a/pageserver/Cargo.toml +++ b/pageserver/Cargo.toml @@ -51,7 +51,7 @@ async-compression = {version = "0.3", features = ["zstd", "tokio"]} postgres_ffi = { path = "../postgres_ffi" } zenith_metrics = { path = "../zenith_metrics" } zenith_utils = { path = "../zenith_utils" } -workspace_hack = { path = "../workspace_hack" } +workspace_hack = { version = "0.1", path = "../workspace_hack" } [dev-dependencies] hex-literal = "0.3" diff --git a/postgres_ffi/Cargo.toml b/postgres_ffi/Cargo.toml index 17f1ecd666..e8d471cb12 100644 --- a/postgres_ffi/Cargo.toml +++ b/postgres_ffi/Cargo.toml @@ -17,8 +17,8 @@ log = "0.4.14" memoffset = "0.6.2" thiserror = "1.0" serde = { version = "1.0", features = ["derive"] } -workspace_hack = { path = "../workspace_hack" } zenith_utils = { path = "../zenith_utils" } +workspace_hack = { version = "0.1", path = "../workspace_hack" } [build-dependencies] bindgen = "0.59.1" diff --git a/proxy/Cargo.toml b/proxy/Cargo.toml index dda018a1d8..72c394dad4 100644 --- a/proxy/Cargo.toml +++ b/proxy/Cargo.toml @@ -29,6 +29,7 @@ tokio-rustls = "0.22.0" zenith_utils = { path = "../zenith_utils" } zenith_metrics = { path = "../zenith_metrics" } +workspace_hack = { version = "0.1", path = "../workspace_hack" } [dev-dependencies] tokio-postgres-rustls = "0.8.0" diff --git a/walkeeper/Cargo.toml b/walkeeper/Cargo.toml index 193fc4acf6..f59c24816d 100644 --- a/walkeeper/Cargo.toml +++ b/walkeeper/Cargo.toml @@ -29,9 +29,9 @@ const_format = "0.2.21" tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } postgres_ffi = { path = "../postgres_ffi" } -workspace_hack = { path = "../workspace_hack" } zenith_metrics = { path = "../zenith_metrics" } zenith_utils = { path = "../zenith_utils" } +workspace_hack = { version = "0.1", path = "../workspace_hack" } [dev-dependencies] tempfile = "3.2" diff --git a/workspace_hack/Cargo.toml b/workspace_hack/Cargo.toml index 48d81bbc07..6e6a0e09d7 100644 --- a/workspace_hack/Cargo.toml +++ b/workspace_hack/Cargo.toml @@ -1,22 +1,50 @@ +# This file is generated by `cargo hakari`. +# To regenerate, run: +# cargo hakari generate + [package] name = "workspace_hack" version = "0.1.0" -edition = "2021" +description = "workspace-hack package, managed by hakari" +# You can choose to publish this crate: see https://docs.rs/cargo-hakari/latest/cargo_hakari/publishing. +publish = false -[target.'cfg(all())'.dependencies] -libc = { version = "0.2", features = ["default", "extra_traits", "std"] } -memchr = { version = "2", features = ["default", "std", "use_std"] } +# The parts of the file between the BEGIN HAKARI SECTION and END HAKARI SECTION comments +# are managed by hakari. + +### BEGIN HAKARI SECTION +[dependencies] +anyhow = { version = "1", features = ["backtrace", "std"] } +bytes = { version = "1", features = ["serde", "std"] } +clap = { version = "2", features = ["ansi_term", "atty", "color", "strsim", "suggestions", "vec_map"] } +either = { version = "1", features = ["use_std"] } +hashbrown = { version = "0.11", features = ["ahash", "inline-more", "raw"] } +libc = { version = "0.2", features = ["extra_traits", "std"] } +log = { version = "0.4", default-features = false, features = ["serde", "std"] } +memchr = { version = "2", features = ["std", "use_std"] } num-integer = { version = "0.1", default-features = false, features = ["std"] } -num-traits = { version = "0.2", default-features = false, features = ["std"] } -regex = { version = "1", features = ["aho-corasick", "default", "memchr", "perf", "perf-cache", "perf-dfa", "perf-inline", "perf-literal", "std", "unicode", "unicode-age", "unicode-bool", "unicode-case", "unicode-gencat", "unicode-perl", "unicode-script", "unicode-segment"] } -regex-syntax = { version = "0.6", features = ["default", "unicode", "unicode-age", "unicode-bool", "unicode-case", "unicode-gencat", "unicode-perl", "unicode-script", "unicode-segment"] } -serde = { version = "1", features = ["default", "derive", "serde_derive", "std"] } +num-traits = { version = "0.2", features = ["std"] } +regex = { version = "1", features = ["aho-corasick", "memchr", "perf", "perf-cache", "perf-dfa", "perf-inline", "perf-literal", "std", "unicode", "unicode-age", "unicode-bool", "unicode-case", "unicode-gencat", "unicode-perl", "unicode-script", "unicode-segment"] } +regex-syntax = { version = "0.6", features = ["unicode", "unicode-age", "unicode-bool", "unicode-case", "unicode-gencat", "unicode-perl", "unicode-script", "unicode-segment"] } +reqwest = { version = "0.11", default-features = false, features = ["__rustls", "__tls", "blocking", "hyper-rustls", "json", "rustls", "rustls-pemfile", "rustls-tls", "rustls-tls-webpki-roots", "serde_json", "stream", "tokio-rustls", "tokio-util", "webpki-roots"] } +scopeguard = { version = "1", features = ["use_std"] } +serde = { version = "1", features = ["alloc", "derive", "serde_derive", "std"] } +tokio = { version = "1", features = ["bytes", "fs", "io-util", "libc", "macros", "memchr", "mio", "net", "num_cpus", "once_cell", "process", "rt", "rt-multi-thread", "signal-hook-registry", "sync", "time", "tokio-macros"] } +tracing = { version = "0.1", features = ["attributes", "std", "tracing-attributes"] } +tracing-core = { version = "0.1", features = ["lazy_static", "std"] } -[target.'cfg(all())'.build-dependencies] -libc = { version = "0.2", features = ["default", "extra_traits", "std"] } -memchr = { version = "2", features = ["default", "std", "use_std"] } -proc-macro2 = { version = "1", features = ["default", "proc-macro"] } -quote = { version = "1", features = ["default", "proc-macro"] } -regex = { version = "1", features = ["aho-corasick", "default", "memchr", "perf", "perf-cache", "perf-dfa", "perf-inline", "perf-literal", "std", "unicode", "unicode-age", "unicode-bool", "unicode-case", "unicode-gencat", "unicode-perl", "unicode-script", "unicode-segment"] } -regex-syntax = { version = "0.6", features = ["default", "unicode", "unicode-age", "unicode-bool", "unicode-case", "unicode-gencat", "unicode-perl", "unicode-script", "unicode-segment"] } -syn = { version = "1", features = ["clone-impls", "default", "derive", "full", "parsing", "printing", "proc-macro", "quote", "visit", "visit-mut"] } +[build-dependencies] +cc = { version = "1", default-features = false, features = ["jobserver", "parallel"] } +clap = { version = "2", features = ["ansi_term", "atty", "color", "strsim", "suggestions", "vec_map"] } +either = { version = "1", features = ["use_std"] } +libc = { version = "0.2", features = ["extra_traits", "std"] } +log = { version = "0.4", default-features = false, features = ["serde", "std"] } +memchr = { version = "2", features = ["std", "use_std"] } +proc-macro2 = { version = "1", features = ["proc-macro"] } +quote = { version = "1", features = ["proc-macro"] } +regex = { version = "1", features = ["aho-corasick", "memchr", "perf", "perf-cache", "perf-dfa", "perf-inline", "perf-literal", "std", "unicode", "unicode-age", "unicode-bool", "unicode-case", "unicode-gencat", "unicode-perl", "unicode-script", "unicode-segment"] } +regex-syntax = { version = "0.6", features = ["unicode", "unicode-age", "unicode-bool", "unicode-case", "unicode-gencat", "unicode-perl", "unicode-script", "unicode-segment"] } +serde = { version = "1", features = ["alloc", "derive", "serde_derive", "std"] } +syn = { version = "1", features = ["clone-impls", "derive", "extra-traits", "full", "parsing", "printing", "proc-macro", "quote", "visit", "visit-mut"] } + +### END HAKARI SECTION diff --git a/workspace_hack/src/lib.rs b/workspace_hack/src/lib.rs index ceba3d145d..22489f632b 100644 --- a/workspace_hack/src/lib.rs +++ b/workspace_hack/src/lib.rs @@ -1,23 +1 @@ -//! This crate contains no code. -//! -//! The workspace_hack crate exists only to pin down some dependencies, -//! so that those dependencies always build with the same features, -//! under a few different cases that can be problematic: -//! - Running `cargo check` or `cargo build` from a crate sub-directory -//! instead of the workspace root. -//! - Running `cargo install`, which can only be done per-crate -//! -//! The dependency lists in Cargo.toml were automatically generated by -//! a tool called -//! [Hakari](https://github.com/facebookincubator/cargo-guppy/tree/main/tools/hakari). -//! -//! Hakari doesn't have a CLI yet; in the meantime the example code in -//! their `README` file is enough to regenerate the dependencies. -//! Hakari's output was pasted into Cargo.toml, except for the -//! following manual edits: -//! - `winapi` dependency was removed. This is probably just due to the -//! fact that Hakari's target analysis is incomplete. -//! -//! There isn't any penalty to this data falling out of date; it just -//! means that under the conditions above Cargo will rebuild more -//! packages than strictly necessary. +// This is a stub lib.rs. diff --git a/zenith/Cargo.toml b/zenith/Cargo.toml index 8adbda0723..74aeffb51c 100644 --- a/zenith/Cargo.toml +++ b/zenith/Cargo.toml @@ -15,4 +15,4 @@ control_plane = { path = "../control_plane" } walkeeper = { path = "../walkeeper" } postgres_ffi = { path = "../postgres_ffi" } zenith_utils = { path = "../zenith_utils" } -workspace_hack = { path = "../workspace_hack" } +workspace_hack = { version = "0.1", path = "../workspace_hack" } diff --git a/zenith_metrics/Cargo.toml b/zenith_metrics/Cargo.toml index 0c921ede0b..906c5a1d64 100644 --- a/zenith_metrics/Cargo.toml +++ b/zenith_metrics/Cargo.toml @@ -8,3 +8,4 @@ prometheus = {version = "0.13", default_features=false} # removes protobuf depen libc = "0.2" lazy_static = "1.4" once_cell = "1.8.0" +workspace_hack = { version = "0.1", path = "../workspace_hack" } diff --git a/zenith_utils/Cargo.toml b/zenith_utils/Cargo.toml index 8e7f5f233c..e8ad0e627f 100644 --- a/zenith_utils/Cargo.toml +++ b/zenith_utils/Cargo.toml @@ -30,7 +30,7 @@ git-version = "0.3.5" serde_with = "1.12.0" zenith_metrics = { path = "../zenith_metrics" } -workspace_hack = { path = "../workspace_hack" } +workspace_hack = { version = "0.1", path = "../workspace_hack" } [dev-dependencies] byteorder = "1.4.3" From 9594362f74c2ea66a495da8d50c3cb25de67d62c Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Mon, 28 Mar 2022 17:34:13 +0300 Subject: [PATCH 038/296] change python cache version to 2 (fixes python cache in circle CI) --- .circleci/config.yml | 8 ++++---- scripts/pysync | 8 +++++++- 2 files changed, 11 insertions(+), 5 deletions(-) diff --git a/.circleci/config.yml b/.circleci/config.yml index 4a03cbf3b5..e96964558b 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -228,12 +228,12 @@ jobs: - checkout - restore_cache: keys: - - v1-python-deps-{{ checksum "poetry.lock" }} + - v2-python-deps-{{ checksum "poetry.lock" }} - run: name: Install deps command: ./scripts/pysync - save_cache: - key: v1-python-deps-{{ checksum "poetry.lock" }} + key: v2-python-deps-{{ checksum "poetry.lock" }} paths: - /home/circleci/.cache/pypoetry/virtualenvs - run: @@ -287,12 +287,12 @@ jobs: - run: git submodule update --init --depth 1 - restore_cache: keys: - - v1-python-deps-{{ checksum "poetry.lock" }} + - v2-python-deps-{{ checksum "poetry.lock" }} - run: name: Install deps command: ./scripts/pysync - save_cache: - key: v1-python-deps-{{ checksum "poetry.lock" }} + key: v2-python-deps-{{ checksum "poetry.lock" }} paths: - /home/circleci/.cache/pypoetry/virtualenvs - run: diff --git a/scripts/pysync b/scripts/pysync index e548973dea..12fa08beca 100755 --- a/scripts/pysync +++ b/scripts/pysync @@ -4,4 +4,10 @@ # It is intended to be a primary endpoint for all the people who want to # just setup test environment without going into details of python package management -poetry install --no-root # this installs dev dependencies by default +poetry config --list + +if [ -z "${CI}" ]; then + poetry install --no-root --no-interaction --ansi +else + poetry install --no-root +fi From ec3bc741653d8c14f99a27c58ff74f4046ba7969 Mon Sep 17 00:00:00 2001 From: Arseny Sher Date: Thu, 17 Mar 2022 15:14:16 +0300 Subject: [PATCH 039/296] Add safekeeper information exchange through etcd. Safekeers now publish to and pull from etcd per-timeline data. Immediate goal is WAL truncation, for which every safekeeper must know remote_consistent_lsn; the next would be callmemaybe replacement. Adds corresponding '--broker' argument to safekeeper and ability to run etcd in tests. Adds test checking remote_consistent_lsn is indeed communicated. --- Cargo.lock | 252 +++++++++++++++++- control_plane/src/local_env.rs | 4 + control_plane/src/safekeeper.rs | 6 + test_runner/README.md | 2 + test_runner/batch_others/test_wal_acceptor.py | 46 +++- test_runner/fixtures/utils.py | 6 + test_runner/fixtures/zenith_fixtures.py | 75 +++++- walkeeper/Cargo.toml | 3 + walkeeper/src/bin/safekeeper.rs | 27 +- walkeeper/src/broker.rs | 211 +++++++++++++++ walkeeper/src/handler.rs | 9 +- walkeeper/src/http/routes.rs | 17 +- walkeeper/src/json_ctrl.rs | 6 +- walkeeper/src/lib.rs | 4 + walkeeper/src/safekeeper.rs | 20 +- walkeeper/src/send_wal.rs | 2 +- walkeeper/src/timeline.rs | 76 +++++- 17 files changed, 726 insertions(+), 40 deletions(-) create mode 100644 walkeeper/src/broker.rs diff --git a/Cargo.lock b/Cargo.lock index 40f4358d98..c770f576c9 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -75,6 +75,27 @@ dependencies = [ "zstd-safe", ] +[[package]] +name = "async-stream" +version = "0.3.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "171374e7e3b2504e0e5236e3b59260560f9fe94bfe9ac39ba5e4e929c5590625" +dependencies = [ + "async-stream-impl", + "futures-core", +] + +[[package]] +name = "async-stream-impl" +version = "0.3.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "648ed8c8d2ce5409ccd57453d9d1b214b342a0d69376a6feda1fd6cae3299308" +dependencies = [ + "proc-macro2", + "quote", + "syn", +] + [[package]] name = "async-trait" version = "0.1.52" @@ -703,6 +724,21 @@ dependencies = [ "termcolor", ] +[[package]] +name = "etcd-client" +version = "0.8.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "585de5039d1ecce74773db49ba4e8107e42be7c2cd0b1a9e7fce27181db7b118" +dependencies = [ + "http", + "prost", + "tokio", + "tokio-stream", + "tonic", + "tonic-build", + "tower-service", +] + [[package]] name = "fail" version = "0.5.0" @@ -741,6 +777,12 @@ dependencies = [ "winapi", ] +[[package]] +name = "fixedbitset" +version = "0.4.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "279fb028e20b3c4c320317955b77c5e0c9701f05a1d309905d6fc702cdc5053e" + [[package]] name = "fnv" version = "1.0.7" @@ -926,7 +968,7 @@ dependencies = [ "indexmap", "slab", "tokio", - "tokio-util", + "tokio-util 0.6.9", "tracing", ] @@ -954,6 +996,15 @@ dependencies = [ "ahash 0.7.6", ] +[[package]] +name = "heck" +version = "0.3.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6d621efb26863f0e9924c6ac577e8275e5e6b77455db64ffa6c65c904e9e132c" +dependencies = [ + "unicode-segmentation", +] + [[package]] name = "hermit-abi" version = "0.1.19" @@ -1075,6 +1126,18 @@ dependencies = [ "tokio-rustls 0.23.2", ] +[[package]] +name = "hyper-timeout" +version = "0.4.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "bbb958482e8c7be4bc3cf272a766a2b0bf1a6755e7a6ae777f017a31d11b13b1" +dependencies = [ + "hyper", + "pin-project-lite", + "tokio", + "tokio-io-timeout", +] + [[package]] name = "ident_case" version = "1.0.1" @@ -1308,9 +1371,9 @@ dependencies = [ [[package]] name = "mio" -version = "0.7.14" +version = "0.8.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8067b404fe97c70829f082dec8bcf4f71225d7eaea1d8645349cb76fa06205cc" +checksum = "ba272f85fa0b41fc91872be579b3bbe0f56b792aa361a380eb669469f68dafb2" dependencies = [ "libc", "log", @@ -1328,6 +1391,12 @@ dependencies = [ "winapi", ] +[[package]] +name = "multimap" +version = "0.8.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e5ce46fe64a9d73be07dcbe690a38ce1b293be448fd8ce1e6c1b8062c9f72c6a" + [[package]] name = "nix" version = "0.23.1" @@ -1557,6 +1626,16 @@ version = "2.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "d4fd5641d01c8f18a23da7b6fe29298ff4b55afcccdf78973b24cf3175fee32e" +[[package]] +name = "petgraph" +version = "0.6.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "4a13a2fa9d0b63e5f22328828741e523766fff0ee9e779316902290dff3f824f" +dependencies = [ + "fixedbitset", + "indexmap", +] + [[package]] name = "phf" version = "0.8.0" @@ -1776,6 +1855,59 @@ dependencies = [ "thiserror", ] +[[package]] +name = "prost" +version = "0.9.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "444879275cb4fd84958b1a1d5420d15e6fcf7c235fe47f053c9c2a80aceb6001" +dependencies = [ + "bytes", + "prost-derive", +] + +[[package]] +name = "prost-build" +version = "0.9.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "62941722fb675d463659e49c4f3fe1fe792ff24fe5bbaa9c08cd3b98a1c354f5" +dependencies = [ + "bytes", + "heck", + "itertools", + "lazy_static", + "log", + "multimap", + "petgraph", + "prost", + "prost-types", + "regex", + "tempfile", + "which", +] + +[[package]] +name = "prost-derive" +version = "0.9.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f9cc1a3263e07e0bf68e96268f37665207b49560d98739662cdfaae215c720fe" +dependencies = [ + "anyhow", + "itertools", + "proc-macro2", + "quote", + "syn", +] + +[[package]] +name = "prost-types" +version = "0.9.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "534b7a0e836e3c482d2693070f982e39e7611da9695d4d1f5a4b186b51faef0a" +dependencies = [ + "bytes", + "prost", +] + [[package]] name = "proxy" version = "0.1.0" @@ -1979,7 +2111,7 @@ dependencies = [ "serde_urlencoded", "tokio", "tokio-rustls 0.23.2", - "tokio-util", + "tokio-util 0.6.9", "url", "wasm-bindgen", "wasm-bindgen-futures", @@ -2508,9 +2640,9 @@ checksum = "cda74da7e1a664f795bb1f8a87ec406fb89a02522cf6e50620d016add6dbbf5c" [[package]] name = "tokio" -version = "1.16.1" +version = "1.17.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0c27a64b625de6d309e8c57716ba93021dccf1b3b5c97edd6d3dd2d2135afc0a" +checksum = "2af73ac49756f3f7c01172e34a23e5d0216f6c32333757c2c61feb2bbff5a5ee" dependencies = [ "bytes", "libc", @@ -2520,10 +2652,21 @@ dependencies = [ "once_cell", "pin-project-lite", "signal-hook-registry", + "socket2", "tokio-macros", "winapi", ] +[[package]] +name = "tokio-io-timeout" +version = "1.2.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "30b74022ada614a1b4834de765f9bb43877f910cc8ce4be40e89042c9223a8bf" +dependencies = [ + "pin-project-lite", + "tokio", +] + [[package]] name = "tokio-macros" version = "1.7.0" @@ -2554,7 +2697,7 @@ dependencies = [ "postgres-types 0.2.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", "socket2", "tokio", - "tokio-util", + "tokio-util 0.6.9", ] [[package]] @@ -2576,7 +2719,7 @@ dependencies = [ "postgres-types 0.2.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858)", "socket2", "tokio", - "tokio-util", + "tokio-util 0.6.9", ] [[package]] @@ -2641,6 +2784,20 @@ dependencies = [ "tokio", ] +[[package]] +name = "tokio-util" +version = "0.7.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "64910e1b9c1901aaf5375561e35b9c057d95ff41a44ede043a03e09279eabaf1" +dependencies = [ + "bytes", + "futures-core", + "futures-sink", + "log", + "pin-project-lite", + "tokio", +] + [[package]] name = "toml" version = "0.5.8" @@ -2663,6 +2820,75 @@ dependencies = [ "serde", ] +[[package]] +name = "tonic" +version = "0.6.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ff08f4649d10a70ffa3522ca559031285d8e421d727ac85c60825761818f5d0a" +dependencies = [ + "async-stream", + "async-trait", + "base64 0.13.0", + "bytes", + "futures-core", + "futures-util", + "h2", + "http", + "http-body", + "hyper", + "hyper-timeout", + "percent-encoding", + "pin-project", + "prost", + "prost-derive", + "tokio", + "tokio-stream", + "tokio-util 0.6.9", + "tower", + "tower-layer", + "tower-service", + "tracing", + "tracing-futures", +] + +[[package]] +name = "tonic-build" +version = "0.6.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9403f1bafde247186684b230dc6f38b5cd514584e8bec1dd32514be4745fa757" +dependencies = [ + "proc-macro2", + "prost-build", + "quote", + "syn", +] + +[[package]] +name = "tower" +version = "0.4.12" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9a89fd63ad6adf737582df5db40d286574513c69a11dac5214dc3b5603d6713e" +dependencies = [ + "futures-core", + "futures-util", + "indexmap", + "pin-project", + "pin-project-lite", + "rand", + "slab", + "tokio", + "tokio-util 0.7.0", + "tower-layer", + "tower-service", + "tracing", +] + +[[package]] +name = "tower-layer" +version = "0.3.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "343bc9466d3fe6b0f960ef45960509f84480bf4fd96f92901afe7ff3df9d3a62" + [[package]] name = "tower-service" version = "0.3.1" @@ -2676,6 +2902,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "2d8d93354fe2a8e50d5953f5ae2e47a3fc2ef03292e7ea46e3cc38f549525fb9" dependencies = [ "cfg-if", + "log", "pin-project-lite", "tracing-attributes", "tracing-core", @@ -2768,6 +2995,12 @@ dependencies = [ "tinyvec", ] +[[package]] +name = "unicode-segmentation" +version = "1.9.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7e8820f5d777f6224dc4be3632222971ac30164d4a258d595640799554ebfd99" + [[package]] name = "unicode-width" version = "0.1.9" @@ -2838,6 +3071,7 @@ dependencies = [ "const_format", "crc32c", "daemonize", + "etcd-client", "fs2", "hex", "humantime", @@ -2850,11 +3084,13 @@ dependencies = [ "rust-s3", "serde", "serde_json", + "serde_with", "signal-hook", "tempfile", "tokio", "tokio-postgres 0.7.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", "tracing", + "url", "walkdir", "workspace_hack", "zenith_metrics", diff --git a/control_plane/src/local_env.rs b/control_plane/src/local_env.rs index 00ace431e6..2bdc76e876 100644 --- a/control_plane/src/local_env.rs +++ b/control_plane/src/local_env.rs @@ -57,6 +57,10 @@ pub struct LocalEnv { #[serde(default)] pub private_key_path: PathBuf, + // A comma separated broker (etcd) endpoints for storage nodes coordination, e.g. 'http://127.0.0.1:2379'. + #[serde(default)] + pub broker_endpoints: Option, + pub pageserver: PageServerConf, #[serde(default)] diff --git a/control_plane/src/safekeeper.rs b/control_plane/src/safekeeper.rs index 969e2cd531..89ab0a31ee 100644 --- a/control_plane/src/safekeeper.rs +++ b/control_plane/src/safekeeper.rs @@ -73,6 +73,8 @@ pub struct SafekeeperNode { pub http_base_url: String, pub pageserver: Arc, + + broker_endpoints: Option, } impl SafekeeperNode { @@ -89,6 +91,7 @@ impl SafekeeperNode { http_client: Client::new(), http_base_url: format!("http://127.0.0.1:{}/v1", conf.http_port), pageserver, + broker_endpoints: env.broker_endpoints.clone(), } } @@ -135,6 +138,9 @@ impl SafekeeperNode { if !self.conf.sync { cmd.arg("--no-sync"); } + if let Some(ref ep) = self.broker_endpoints { + cmd.args(&["--broker-endpoints", ep]); + } if !cmd.status()?.success() { bail!( diff --git a/test_runner/README.md b/test_runner/README.md index a56c2df2c0..ee171ae6a0 100644 --- a/test_runner/README.md +++ b/test_runner/README.md @@ -10,6 +10,8 @@ Prerequisites: below to run from other directories. - The zenith git repo, including the postgres submodule (for some tests, e.g. `pg_regress`) +- Some tests (involving storage nodes coordination) require etcd installed. Follow + [`the guide`](https://etcd.io/docs/v3.5/install/) to obtain it. ### Test Organization diff --git a/test_runner/batch_others/test_wal_acceptor.py b/test_runner/batch_others/test_wal_acceptor.py index 37ce1a8bca..bdc526a125 100644 --- a/test_runner/batch_others/test_wal_acceptor.py +++ b/test_runner/batch_others/test_wal_acceptor.py @@ -13,7 +13,7 @@ from dataclasses import dataclass, field from multiprocessing import Process, Value from pathlib import Path from fixtures.zenith_fixtures import PgBin, Postgres, Safekeeper, ZenithEnv, ZenithEnvBuilder, PortDistributor, SafekeeperPort, zenith_binpath, PgProtocol -from fixtures.utils import lsn_to_hex, mkdir_if_needed, lsn_from_hex +from fixtures.utils import etcd_path, lsn_to_hex, mkdir_if_needed, lsn_from_hex from fixtures.log_helper import log from typing import List, Optional, Any @@ -22,6 +22,7 @@ from typing import List, Optional, Any # succeed and data is written def test_normal_work(zenith_env_builder: ZenithEnvBuilder): zenith_env_builder.num_safekeepers = 3 + zenith_env_builder.broker = True env = zenith_env_builder.init_start() env.zenith_cli.create_branch('test_wal_acceptors_normal_work') @@ -326,6 +327,49 @@ def test_race_conditions(zenith_env_builder: ZenithEnvBuilder, stop_value): proc.join() +# Test that safekeepers push their info to the broker and learn peer status from it +@pytest.mark.skipif(etcd_path() is None, reason="requires etcd which is not present in PATH") +def test_broker(zenith_env_builder: ZenithEnvBuilder): + zenith_env_builder.num_safekeepers = 3 + zenith_env_builder.broker = True + zenith_env_builder.enable_local_fs_remote_storage() + env = zenith_env_builder.init_start() + + env.zenith_cli.create_branch("test_broker", "main") + pg = env.postgres.create_start('test_broker') + pg.safe_psql("CREATE TABLE t(key int primary key, value text)") + + # learn zenith timeline from compute + tenant_id = pg.safe_psql("show zenith.zenith_tenant")[0][0] + timeline_id = pg.safe_psql("show zenith.zenith_timeline")[0][0] + + # wait until remote_consistent_lsn gets advanced on all safekeepers + clients = [sk.http_client() for sk in env.safekeepers] + stat_before = [cli.timeline_status(tenant_id, timeline_id) for cli in clients] + log.info(f"statuses is {stat_before}") + + pg.safe_psql("INSERT INTO t SELECT generate_series(1,100), 'payload'") + # force checkpoint to advance remote_consistent_lsn + with closing(env.pageserver.connect()) as psconn: + with psconn.cursor() as pscur: + pscur.execute(f"checkpoint {tenant_id} {timeline_id}") + # and wait till remote_consistent_lsn propagates to all safekeepers + started_at = time.time() + while True: + stat_after = [cli.timeline_status(tenant_id, timeline_id) for cli in clients] + if all( + lsn_from_hex(s_after.remote_consistent_lsn) > lsn_from_hex( + s_before.remote_consistent_lsn) for s_after, + s_before in zip(stat_after, stat_before)): + break + elapsed = time.time() - started_at + if elapsed > 20: + raise RuntimeError( + f"timed out waiting {elapsed:.0f}s for remote_consistent_lsn propagation: status before {stat_before}, status current {stat_after}" + ) + time.sleep(0.5) + + class ProposerPostgres(PgProtocol): """Object for running postgres without ZenithEnv""" def __init__(self, diff --git a/test_runner/fixtures/utils.py b/test_runner/fixtures/utils.py index 58f7294eb5..f16fe1d9cf 100644 --- a/test_runner/fixtures/utils.py +++ b/test_runner/fixtures/utils.py @@ -1,4 +1,5 @@ import os +import shutil import subprocess from typing import Any, List @@ -76,3 +77,8 @@ def print_gc_result(row): log.info( " total: {layers_total}, needed_by_cutoff {layers_needed_by_cutoff}, needed_by_branches: {layers_needed_by_branches}, not_updated: {layers_not_updated}, removed: {layers_removed}" .format_map(row)) + + +# path to etcd binary or None if not present. +def etcd_path(): + return shutil.which("etcd") diff --git a/test_runner/fixtures/zenith_fixtures.py b/test_runner/fixtures/zenith_fixtures.py index 2da021a49c..a95809687a 100644 --- a/test_runner/fixtures/zenith_fixtures.py +++ b/test_runner/fixtures/zenith_fixtures.py @@ -33,7 +33,7 @@ from typing_extensions import Literal import requests import backoff # type: ignore -from .utils import (get_self_dir, lsn_from_hex, mkdir_if_needed, subprocess_capture) +from .utils import (etcd_path, get_self_dir, mkdir_if_needed, subprocess_capture, lsn_from_hex) from fixtures.log_helper import log """ This file contains pytest fixtures. A fixture is a test resource that can be @@ -433,7 +433,8 @@ class ZenithEnvBuilder: num_safekeepers: int = 0, pageserver_auth_enabled: bool = False, rust_log_override: Optional[str] = None, - default_branch_name=DEFAULT_BRANCH_NAME): + default_branch_name=DEFAULT_BRANCH_NAME, + broker: bool = False): self.repo_dir = repo_dir self.rust_log_override = rust_log_override self.port_distributor = port_distributor @@ -442,6 +443,7 @@ class ZenithEnvBuilder: self.num_safekeepers = num_safekeepers self.pageserver_auth_enabled = pageserver_auth_enabled self.default_branch_name = default_branch_name + self.broker = broker self.env: Optional[ZenithEnv] = None self.s3_mock_server: Optional[MockS3Server] = None @@ -517,6 +519,8 @@ class ZenithEnvBuilder: self.env.pageserver.stop(immediate=True) if self.s3_mock_server: self.s3_mock_server.kill() + if self.env.broker is not None: + self.env.broker.stop() class ZenithEnv: @@ -569,6 +573,16 @@ class ZenithEnv: default_tenant_id = '{self.initial_tenant.hex}' """) + self.broker = None + if config.broker: + # keep etcd datadir inside 'repo' + self.broker = Etcd(datadir=os.path.join(self.repo_dir, "etcd"), + port=self.port_distributor.get_port(), + peer_port=self.port_distributor.get_port()) + toml += textwrap.dedent(f""" + broker_endpoints = 'http://127.0.0.1:{self.broker.port}' + """) + # Create config for pageserver pageserver_port = PageserverPort( pg=self.port_distributor.get_port(), @@ -611,12 +625,15 @@ class ZenithEnv: self.zenith_cli.init(toml) def start(self): - # Start up the page server and all the safekeepers + # Start up the page server, all the safekeepers and the broker self.pageserver.start() for safekeeper in self.safekeepers: safekeeper.start() + if self.broker is not None: + self.broker.start() + def get_safekeeper_connstrs(self) -> str: """ Get list of safekeeper endpoints suitable for wal_acceptors GUC """ return ','.join([f'localhost:{wa.port.pg}' for wa in self.safekeepers]) @@ -1674,6 +1691,7 @@ class Safekeeper: class SafekeeperTimelineStatus: acceptor_epoch: int flush_lsn: str + remote_consistent_lsn: str @dataclass @@ -1697,7 +1715,8 @@ class SafekeeperHttpClient(requests.Session): res.raise_for_status() resj = res.json() return SafekeeperTimelineStatus(acceptor_epoch=resj['acceptor_state']['epoch'], - flush_lsn=resj['flush_lsn']) + flush_lsn=resj['flush_lsn'], + remote_consistent_lsn=resj['remote_consistent_lsn']) def get_metrics(self) -> SafekeeperMetrics: request_result = self.get(f"http://localhost:{self.port}/metrics") @@ -1718,6 +1737,54 @@ class SafekeeperHttpClient(requests.Session): return metrics +@dataclass +class Etcd: + """ An object managing etcd instance """ + datadir: str + port: int + peer_port: int + handle: Optional[subprocess.Popen[Any]] = None # handle of running daemon + + def check_status(self): + s = requests.Session() + s.mount('http://', requests.adapters.HTTPAdapter(max_retries=1)) # do not retry + s.get(f"http://localhost:{self.port}/health").raise_for_status() + + def start(self): + pathlib.Path(self.datadir).mkdir(exist_ok=True) + etcd_full_path = etcd_path() + if etcd_full_path is None: + raise Exception('etcd not found') + + with open(os.path.join(self.datadir, "etcd.log"), "wb") as log_file: + args = [ + etcd_full_path, + f"--data-dir={self.datadir}", + f"--listen-client-urls=http://localhost:{self.port}", + f"--advertise-client-urls=http://localhost:{self.port}", + f"--listen-peer-urls=http://localhost:{self.peer_port}" + ] + self.handle = subprocess.Popen(args, stdout=log_file, stderr=log_file) + + # wait for start + started_at = time.time() + while True: + try: + self.check_status() + except Exception as e: + elapsed = time.time() - started_at + if elapsed > 5: + raise RuntimeError(f"timed out waiting {elapsed:.0f}s for etcd start: {e}") + time.sleep(0.5) + else: + break # success + + def stop(self): + if self.handle is not None: + self.handle.terminate() + self.handle.wait() + + def get_test_output_dir(request: Any) -> str: """ Compute the working directory for an individual test. """ test_name = request.node.name diff --git a/walkeeper/Cargo.toml b/walkeeper/Cargo.toml index f59c24816d..e8523d27d1 100644 --- a/walkeeper/Cargo.toml +++ b/walkeeper/Cargo.toml @@ -22,11 +22,14 @@ anyhow = "1.0" crc32c = "0.6.0" humantime = "2.1.0" walkdir = "2" +url = "2.2.2" signal-hook = "0.3.10" serde = { version = "1.0", features = ["derive"] } +serde_with = {version = "1.12.0"} hex = "0.4.3" const_format = "0.2.21" tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } +etcd-client = "0.8.3" postgres_ffi = { path = "../postgres_ffi" } zenith_metrics = { path = "../zenith_metrics" } diff --git a/walkeeper/src/bin/safekeeper.rs b/walkeeper/src/bin/safekeeper.rs index 6c45115e5f..b3087a1004 100644 --- a/walkeeper/src/bin/safekeeper.rs +++ b/walkeeper/src/bin/safekeeper.rs @@ -11,18 +11,19 @@ use std::io::{ErrorKind, Write}; use std::path::{Path, PathBuf}; use std::thread; use tracing::*; +use url::{ParseError, Url}; use walkeeper::control_file::{self}; use zenith_utils::http::endpoint; use zenith_utils::zid::ZNodeId; use zenith_utils::{logging, tcp_listener, GIT_VERSION}; use tokio::sync::mpsc; -use walkeeper::callmemaybe; use walkeeper::defaults::{DEFAULT_HTTP_LISTEN_ADDR, DEFAULT_PG_LISTEN_ADDR}; use walkeeper::http; use walkeeper::s3_offload; use walkeeper::wal_service; use walkeeper::SafeKeeperConf; +use walkeeper::{broker, callmemaybe}; use zenith_utils::shutdown::exit_now; use zenith_utils::signals; @@ -104,6 +105,11 @@ fn main() -> Result<()> { ) .arg( Arg::new("id").long("id").takes_value(true).help("safekeeper node id: integer") + ).arg( + Arg::new("broker-endpoints") + .long("broker-endpoints") + .takes_value(true) + .help("a comma separated broker (etcd) endpoints for storage nodes coordination, e.g. 'http://127.0.0.1:2379'"), ) .get_matches(); @@ -154,6 +160,11 @@ fn main() -> Result<()> { )); } + if let Some(addr) = arg_matches.value_of("broker-endpoints") { + let collected_ep: Result, ParseError> = addr.split(',').map(Url::parse).collect(); + conf.broker_endpoints = Some(collected_ep?); + } + start_safekeeper(conf, given_id, arg_matches.is_present("init")) } @@ -259,11 +270,12 @@ fn start_safekeeper(mut conf: SafeKeeperConf, given_id: Option, init: b threads.push(wal_acceptor_thread); + let conf_cloned = conf.clone(); let callmemaybe_thread = thread::Builder::new() .name("callmemaybe thread".into()) .spawn(|| { // thread code - let thread_result = callmemaybe::thread_main(conf, rx); + let thread_result = callmemaybe::thread_main(conf_cloned, rx); if let Err(e) = thread_result { error!("callmemaybe thread terminated: {}", e); } @@ -271,6 +283,17 @@ fn start_safekeeper(mut conf: SafeKeeperConf, given_id: Option, init: b .unwrap(); threads.push(callmemaybe_thread); + if conf.broker_endpoints.is_some() { + let conf_ = conf.clone(); + threads.push( + thread::Builder::new() + .name("broker thread".into()) + .spawn(|| { + broker::thread_main(conf_); + })?, + ); + } + // TODO: put more thoughts into handling of failed threads // We probably should restart them. diff --git a/walkeeper/src/broker.rs b/walkeeper/src/broker.rs new file mode 100644 index 0000000000..147497d673 --- /dev/null +++ b/walkeeper/src/broker.rs @@ -0,0 +1,211 @@ +//! Communication with etcd, providing safekeeper peers and pageserver coordination. + +use anyhow::bail; +use anyhow::Context; +use anyhow::Error; +use anyhow::Result; +use etcd_client::Client; +use etcd_client::EventType; +use etcd_client::PutOptions; +use etcd_client::WatchOptions; +use lazy_static::lazy_static; +use regex::Regex; +use serde::{Deserialize, Serialize}; +use serde_with::{serde_as, DisplayFromStr}; +use std::str::FromStr; +use std::time::Duration; +use tokio::task::JoinHandle; +use tokio::{runtime, time::sleep}; +use tracing::*; +use zenith_utils::zid::ZTenantId; +use zenith_utils::zid::ZTimelineId; +use zenith_utils::{ + lsn::Lsn, + zid::{ZNodeId, ZTenantTimelineId}, +}; + +use crate::{safekeeper::Term, timeline::GlobalTimelines, SafeKeeperConf}; + +const RETRY_INTERVAL_MSEC: u64 = 1000; +const PUSH_INTERVAL_MSEC: u64 = 1000; +const LEASE_TTL_SEC: i64 = 5; +// TODO: add global zenith installation ID. +const ZENITH_PREFIX: &str = "zenith"; + +/// Published data about safekeeper. Fields made optional for easy migrations. +#[serde_as] +#[derive(Deserialize, Serialize)] +pub struct SafekeeperInfo { + /// Term of the last entry. + pub last_log_term: Option, + /// LSN of the last record. + #[serde_as(as = "Option")] + pub flush_lsn: Option, + /// Up to which LSN safekeeper regards its WAL as committed. + #[serde_as(as = "Option")] + pub commit_lsn: Option, + /// LSN up to which safekeeper offloaded WAL to s3. + #[serde_as(as = "Option")] + pub s3_wal_lsn: Option, + /// LSN of last checkpoint uploaded by pageserver. + #[serde_as(as = "Option")] + pub remote_consistent_lsn: Option, + #[serde_as(as = "Option")] + pub peer_horizon_lsn: Option, +} + +pub fn thread_main(conf: SafeKeeperConf) { + let runtime = runtime::Builder::new_current_thread() + .enable_all() + .build() + .unwrap(); + + let _enter = info_span!("broker").entered(); + info!("started, broker endpoints {:?}", conf.broker_endpoints); + + runtime.block_on(async { + main_loop(conf).await; + }); +} + +/// Prefix to timeline related data. +fn timeline_path(zttid: &ZTenantTimelineId) -> String { + format!( + "{}/{}/{}", + ZENITH_PREFIX, zttid.tenant_id, zttid.timeline_id + ) +} + +/// Key to per timeline per safekeeper data. +fn timeline_safekeeper_path(zttid: &ZTenantTimelineId, sk_id: ZNodeId) -> String { + format!("{}/safekeeper/{}", timeline_path(zttid), sk_id) +} + +/// Push once in a while data about all active timelines to the broker. +async fn push_loop(conf: SafeKeeperConf) -> Result<()> { + let mut client = Client::connect(conf.broker_endpoints.as_ref().unwrap(), None).await?; + + // Get and maintain lease to automatically delete obsolete data + let lease = client.lease_grant(LEASE_TTL_SEC, None).await?; + let (mut keeper, mut ka_stream) = client.lease_keep_alive(lease.id()).await?; + + let push_interval = Duration::from_millis(PUSH_INTERVAL_MSEC); + loop { + // Note: we lock runtime here and in timeline methods as GlobalTimelines + // is under plain mutex. That's ok, all this code is not performance + // sensitive and there is no risk of deadlock as we don't await while + // lock is held. + let active_tlis = GlobalTimelines::get_active_timelines(); + for zttid in &active_tlis { + if let Ok(tli) = GlobalTimelines::get(&conf, *zttid, false) { + let sk_info = tli.get_public_info(); + let put_opts = PutOptions::new().with_lease(lease.id()); + client + .put( + timeline_safekeeper_path(zttid, conf.my_id), + serde_json::to_string(&sk_info)?, + Some(put_opts), + ) + .await + .context("failed to push safekeeper info")?; + } + } + // revive the lease + keeper + .keep_alive() + .await + .context("failed to send LeaseKeepAliveRequest")?; + ka_stream + .message() + .await + .context("failed to receive LeaseKeepAliveResponse")?; + sleep(push_interval).await; + } +} + +/// Subscribe and fetch all the interesting data from the broker. +async fn pull_loop(conf: SafeKeeperConf) -> Result<()> { + lazy_static! { + static ref TIMELINE_SAFEKEEPER_RE: Regex = + Regex::new(r"^zenith/([[:xdigit:]]+)/([[:xdigit:]]+)/safekeeper/([[:digit:]])$") + .unwrap(); + } + let mut client = Client::connect(conf.broker_endpoints.as_ref().unwrap(), None).await?; + loop { + let wo = WatchOptions::new().with_prefix(); + // TODO: subscribe only to my timelines + let (_, mut stream) = client.watch(ZENITH_PREFIX, Some(wo)).await?; + while let Some(resp) = stream.message().await? { + if resp.canceled() { + bail!("watch canceled"); + } + + for event in resp.events() { + if EventType::Put == event.event_type() { + if let Some(kv) = event.kv() { + if let Some(caps) = TIMELINE_SAFEKEEPER_RE.captures(kv.key_str()?) { + let tenant_id = ZTenantId::from_str(caps.get(1).unwrap().as_str())?; + let timeline_id = ZTimelineId::from_str(caps.get(2).unwrap().as_str())?; + let zttid = ZTenantTimelineId::new(tenant_id, timeline_id); + let safekeeper_id = ZNodeId(caps.get(3).unwrap().as_str().parse()?); + let value_str = kv.value_str()?; + match serde_json::from_str::(value_str) { + Ok(safekeeper_info) => { + if let Ok(tli) = GlobalTimelines::get(&conf, zttid, false) { + tli.record_safekeeper_info(&safekeeper_info, safekeeper_id)? + } + } + Err(err) => warn!( + "failed to deserialize safekeeper info {}: {}", + value_str, err + ), + } + } + } + } + } + } + } +} + +async fn main_loop(conf: SafeKeeperConf) { + let mut ticker = tokio::time::interval(Duration::from_millis(RETRY_INTERVAL_MSEC)); + let mut push_handle: Option>> = None; + let mut pull_handle: Option>> = None; + // Selecting on JoinHandles requires some squats; is there a better way to + // reap tasks individually? + + // Handling failures in task itself won't catch panic and in Tokio, task's + // panic doesn't kill the whole executor, so it is better to do reaping + // here. + loop { + tokio::select! { + res = async { push_handle.as_mut().unwrap().await }, if push_handle.is_some() => { + // was it panic or normal error? + let err = match res { + Ok(res_internal) => res_internal.unwrap_err(), + Err(err_outer) => err_outer.into(), + }; + warn!("push task failed: {:?}", err); + push_handle = None; + }, + res = async { pull_handle.as_mut().unwrap().await }, if pull_handle.is_some() => { + // was it panic or normal error? + let err = match res { + Ok(res_internal) => res_internal.unwrap_err(), + Err(err_outer) => err_outer.into(), + }; + warn!("pull task failed: {:?}", err); + pull_handle = None; + }, + _ = ticker.tick() => { + if push_handle.is_none() { + push_handle = Some(tokio::spawn(push_loop(conf.clone()))); + } + if pull_handle.is_none() { + pull_handle = Some(tokio::spawn(pull_loop(conf.clone()))); + } + } + } + } +} diff --git a/walkeeper/src/handler.rs b/walkeeper/src/handler.rs index ead6fab9fb..00d177da56 100644 --- a/walkeeper/src/handler.rs +++ b/walkeeper/src/handler.rs @@ -168,7 +168,14 @@ impl SafekeeperPostgresHandler { fn handle_identify_system(&mut self, pgb: &mut PostgresBackend) -> Result<()> { let start_pos = self.timeline.get().get_end_of_wal(); let lsn = start_pos.to_string(); - let sysid = self.timeline.get().get_info().server.system_id.to_string(); + let sysid = self + .timeline + .get() + .get_state() + .1 + .server + .system_id + .to_string(); let lsn_bytes = lsn.as_bytes(); let tli = PG_TLI.to_string(); let tli_bytes = tli.as_bytes(); diff --git a/walkeeper/src/http/routes.rs b/walkeeper/src/http/routes.rs index 74f7f4a735..06a0682c37 100644 --- a/walkeeper/src/http/routes.rs +++ b/walkeeper/src/http/routes.rs @@ -86,23 +86,24 @@ async fn timeline_status_handler(request: Request) -> Result Result<()> { fn send_proposer_elected(spg: &mut SafekeeperPostgresHandler, term: Term, lsn: Lsn) -> Result<()> { // add new term to existing history - let history = spg.timeline.get().get_info().acceptor_state.term_history; + let history = spg.timeline.get().get_state().1.acceptor_state.term_history; let history = history.up_to(lsn.checked_sub(1u64).unwrap()); let mut history_entries = history.0; history_entries.push(TermSwitchEntry { term, lsn }); @@ -142,7 +142,7 @@ fn append_logical_message( msg: &AppendLogicalMessage, ) -> Result { let wal_data = encode_logical_message(&msg.lm_prefix, &msg.lm_message); - let sk_state = spg.timeline.get().get_info(); + let sk_state = spg.timeline.get().get_state().1; let begin_lsn = msg.begin_lsn; let end_lsn = begin_lsn + wal_data.len() as u64; diff --git a/walkeeper/src/lib.rs b/walkeeper/src/lib.rs index dfd71e4de2..69423d42d8 100644 --- a/walkeeper/src/lib.rs +++ b/walkeeper/src/lib.rs @@ -1,9 +1,11 @@ // use std::path::PathBuf; use std::time::Duration; +use url::Url; use zenith_utils::zid::{ZNodeId, ZTenantTimelineId}; +pub mod broker; pub mod callmemaybe; pub mod control_file; pub mod control_file_upgrade; @@ -47,6 +49,7 @@ pub struct SafeKeeperConf { pub ttl: Option, pub recall_period: Duration, pub my_id: ZNodeId, + pub broker_endpoints: Option>, } impl SafeKeeperConf { @@ -71,6 +74,7 @@ impl Default for SafeKeeperConf { ttl: None, recall_period: defaults::DEFAULT_RECALL_PERIOD, my_id: ZNodeId(0), + broker_endpoints: None, } } } diff --git a/walkeeper/src/safekeeper.rs b/walkeeper/src/safekeeper.rs index 8300b32b42..307a67e5f3 100644 --- a/walkeeper/src/safekeeper.rs +++ b/walkeeper/src/safekeeper.rs @@ -193,7 +193,7 @@ pub struct SafeKeeperState { pub peer_horizon_lsn: Lsn, /// LSN of the oldest known checkpoint made by pageserver and successfully /// pushed to s3. We don't remove WAL beyond it. Persisted only for - /// informational purposes, we receive it from pageserver. + /// informational purposes, we receive it from pageserver (or broker). pub remote_consistent_lsn: Lsn, // Peers and their state as we remember it. Knowing peers themselves is // fundamental; but state is saved here only for informational purposes and @@ -203,11 +203,13 @@ pub struct SafeKeeperState { } #[derive(Debug, Clone)] -// In memory safekeeper state. Fields mirror ones in `SafeKeeperState`; they are -// not flushed yet. +// In memory safekeeper state. Fields mirror ones in `SafeKeeperState`; values +// are not flushed yet. pub struct SafekeeperMemState { pub commit_lsn: Lsn, + pub s3_wal_lsn: Lsn, // TODO: keep only persistent version pub peer_horizon_lsn: Lsn, + pub remote_consistent_lsn: Lsn, } impl SafeKeeperState { @@ -494,14 +496,13 @@ pub struct SafeKeeper { metrics: SafeKeeperMetrics, /// Maximum commit_lsn between all nodes, can be ahead of local flush_lsn. - global_commit_lsn: Lsn, + pub global_commit_lsn: Lsn, /// LSN since the proposer safekeeper currently talking to appends WAL; /// determines epoch switch point. epoch_start_lsn: Lsn, pub inmem: SafekeeperMemState, // in memory part - - pub s: SafeKeeperState, // persistent part + pub s: SafeKeeperState, // persistent part pub control_store: CTRL, pub wal_store: WAL, @@ -529,7 +530,9 @@ where epoch_start_lsn: Lsn(0), inmem: SafekeeperMemState { commit_lsn: state.commit_lsn, + s3_wal_lsn: state.s3_wal_lsn, peer_horizon_lsn: state.peer_horizon_lsn, + remote_consistent_lsn: state.remote_consistent_lsn, }, s: state, control_store, @@ -545,8 +548,7 @@ where .up_to(self.wal_store.flush_lsn()) } - #[cfg(test)] - fn get_epoch(&self) -> Term { + pub fn get_epoch(&self) -> Term { self.s.acceptor_state.get_epoch(self.wal_store.flush_lsn()) } @@ -697,7 +699,7 @@ where } /// Advance commit_lsn taking into account what we have locally - fn update_commit_lsn(&mut self) -> Result<()> { + pub fn update_commit_lsn(&mut self) -> Result<()> { let commit_lsn = min(self.global_commit_lsn, self.wal_store.flush_lsn()); assert!(commit_lsn >= self.inmem.commit_lsn); diff --git a/walkeeper/src/send_wal.rs b/walkeeper/src/send_wal.rs index 1febd71842..f12fb5cb4a 100644 --- a/walkeeper/src/send_wal.rs +++ b/walkeeper/src/send_wal.rs @@ -230,7 +230,7 @@ impl ReplicationConn { let mut wal_seg_size: usize; loop { - wal_seg_size = spg.timeline.get().get_info().server.wal_seg_size as usize; + wal_seg_size = spg.timeline.get().get_state().1.server.wal_seg_size as usize; if wal_seg_size == 0 { error!("Cannot start replication before connecting to wal_proposer"); sleep(Duration::from_secs(1)); diff --git a/walkeeper/src/timeline.rs b/walkeeper/src/timeline.rs index b53f2e086b..b10ab97cc1 100644 --- a/walkeeper/src/timeline.rs +++ b/walkeeper/src/timeline.rs @@ -17,12 +17,14 @@ use tracing::*; use zenith_utils::lsn::Lsn; use zenith_utils::zid::{ZNodeId, ZTenantTimelineId}; +use crate::broker::SafekeeperInfo; use crate::callmemaybe::{CallmeEvent, SubscriptionStateKey}; use crate::control_file; use crate::control_file::Storage as cf_storage; use crate::safekeeper::{ AcceptorProposerMessage, ProposerAcceptorMessage, SafeKeeper, SafeKeeperState, + SafekeeperMemState, }; use crate::send_wal::HotStandbyFeedback; use crate::wal_storage; @@ -349,6 +351,11 @@ impl Timeline { Ok(false) } + fn is_active(&self) -> bool { + let shared_state = self.mutex.lock().unwrap(); + shared_state.active + } + /// Timed wait for an LSN to be committed. /// /// Returns the last committed LSN, which will be at least @@ -410,8 +417,61 @@ impl Timeline { Ok(rmsg) } - pub fn get_info(&self) -> SafeKeeperState { - self.mutex.lock().unwrap().sk.s.clone() + pub fn get_state(&self) -> (SafekeeperMemState, SafeKeeperState) { + let shared_state = self.mutex.lock().unwrap(); + (shared_state.sk.inmem.clone(), shared_state.sk.s.clone()) + } + + /// Prepare public safekeeper info for reporting. + pub fn get_public_info(&self) -> SafekeeperInfo { + let shared_state = self.mutex.lock().unwrap(); + SafekeeperInfo { + last_log_term: Some(shared_state.sk.get_epoch()), + flush_lsn: Some(shared_state.sk.wal_store.flush_lsn()), + // note: this value is not flushed to control file yet and can be lost + commit_lsn: Some(shared_state.sk.inmem.commit_lsn), + s3_wal_lsn: Some(shared_state.sk.inmem.s3_wal_lsn), + // TODO: rework feedbacks to avoid max here + remote_consistent_lsn: Some(max( + shared_state.get_replicas_state().remote_consistent_lsn, + shared_state.sk.inmem.remote_consistent_lsn, + )), + peer_horizon_lsn: Some(shared_state.sk.inmem.peer_horizon_lsn), + } + } + + /// Update timeline state with peer safekeeper data. + pub fn record_safekeeper_info(&self, sk_info: &SafekeeperInfo, _sk_id: ZNodeId) -> Result<()> { + let mut shared_state = self.mutex.lock().unwrap(); + // Note: the check is too restrictive, generally we can update local + // commit_lsn if our history matches (is part of) history of advanced + // commit_lsn provider. + if let (Some(commit_lsn), Some(last_log_term)) = (sk_info.commit_lsn, sk_info.last_log_term) + { + if last_log_term == shared_state.sk.get_epoch() { + shared_state.sk.global_commit_lsn = + max(commit_lsn, shared_state.sk.global_commit_lsn); + shared_state.sk.update_commit_lsn()?; + let local_commit_lsn = min(commit_lsn, shared_state.sk.wal_store.flush_lsn()); + shared_state.sk.inmem.commit_lsn = + max(local_commit_lsn, shared_state.sk.inmem.commit_lsn); + } + } + if let Some(s3_wal_lsn) = sk_info.s3_wal_lsn { + shared_state.sk.inmem.s3_wal_lsn = max(s3_wal_lsn, shared_state.sk.inmem.s3_wal_lsn); + } + if let Some(remote_consistent_lsn) = sk_info.remote_consistent_lsn { + shared_state.sk.inmem.remote_consistent_lsn = max( + remote_consistent_lsn, + shared_state.sk.inmem.remote_consistent_lsn, + ); + } + if let Some(peer_horizon_lsn) = sk_info.peer_horizon_lsn { + shared_state.sk.inmem.peer_horizon_lsn = + max(peer_horizon_lsn, shared_state.sk.inmem.peer_horizon_lsn); + } + // TODO: sync control file + Ok(()) } pub fn add_replica(&self, state: ReplicaState) -> usize { @@ -495,7 +555,7 @@ impl GlobalTimelines { } /// Get a timeline with control file loaded from the global TIMELINES map. - /// If control file doesn't exist, bails out. + /// If control file doesn't exist and create=false, bails out. pub fn get( conf: &SafeKeeperConf, zttid: ZTenantTimelineId, @@ -537,4 +597,14 @@ impl GlobalTimelines { } } } + + /// Get ZTenantTimelineIDs of all active timelines. + pub fn get_active_timelines() -> Vec { + let timelines = TIMELINES.lock().unwrap(); + timelines + .iter() + .filter(|&(_, tli)| tli.is_active()) + .map(|(zttid, _)| *zttid) + .collect() + } } From ce0243bc12db72dba8b196dbee71af2434d28ead Mon Sep 17 00:00:00 2001 From: Arthur Petukhovsky Date: Tue, 29 Mar 2022 18:54:24 +0300 Subject: [PATCH 040/296] Add metric for last_record_lsn (#1430) --- pageserver/src/layered_repository.rs | 21 +++++++++++++++++++-- 1 file changed, 19 insertions(+), 2 deletions(-) diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 56d14fd4e9..33f5694879 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -48,7 +48,9 @@ use crate::walredo::WalRedoManager; use crate::CheckpointConfig; use crate::{ZTenantId, ZTimelineId}; -use zenith_metrics::{register_histogram_vec, Histogram, HistogramVec}; +use zenith_metrics::{ + register_histogram_vec, register_int_gauge_vec, Histogram, HistogramVec, IntGauge, IntGaugeVec, +}; use zenith_utils::crashsafe_dir; use zenith_utils::lsn::{AtomicLsn, Lsn, RecordLsn}; use zenith_utils::seqwait::SeqWait; @@ -95,6 +97,15 @@ lazy_static! { .expect("failed to define a metric"); } +lazy_static! { + static ref LAST_RECORD_LSN: IntGaugeVec = register_int_gauge_vec!( + "pageserver_last_record_lsn", + "Last record LSN grouped by timeline", + &["tenant_id", "timeline_id"] + ) + .expect("failed to define a metric"); +} + /// Parts of the `.zenith/tenants//timelines/` directory prefix. pub const TIMELINES_SEGMENT_NAME: &str = "timelines"; @@ -745,11 +756,12 @@ pub struct LayeredTimeline { ancestor_timeline: Option, ancestor_lsn: Lsn, - // Metrics histograms + // Metrics reconstruct_time_histo: Histogram, flush_time_histo: Histogram, compact_time_histo: Histogram, create_images_time_histo: Histogram, + last_record_gauge: IntGauge, /// If `true`, will backup its files that appear after each checkpointing to the remote storage. upload_layers: AtomicBool, @@ -982,6 +994,9 @@ impl LayeredTimeline { &timelineid.to_string(), ]) .unwrap(); + let last_record_gauge = LAST_RECORD_LSN + .get_metric_with_label_values(&[&tenantid.to_string(), &timelineid.to_string()]) + .unwrap(); LayeredTimeline { conf, @@ -1007,6 +1022,7 @@ impl LayeredTimeline { flush_time_histo, compact_time_histo, create_images_time_histo, + last_record_gauge, upload_layers: AtomicBool::new(upload_layers), @@ -1325,6 +1341,7 @@ impl LayeredTimeline { fn finish_write(&self, new_lsn: Lsn) { assert!(new_lsn.is_aligned()); + self.last_record_gauge.set(new_lsn.0 as i64); self.last_record_lsn.advance(new_lsn); } From 277e41f4b73d91bfb96383eab1f42c4e5f7a0ad9 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Tue, 29 Mar 2022 13:48:26 +0300 Subject: [PATCH 041/296] Show s3 spans in logs and improve the log messages --- pageserver/src/remote_storage/storage_sync.rs | 8 ++++---- zenith_utils/src/http/endpoint.rs | 2 +- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/pageserver/src/remote_storage/storage_sync.rs b/pageserver/src/remote_storage/storage_sync.rs index cd6c40b46f..50a260491b 100644 --- a/pageserver/src/remote_storage/storage_sync.rs +++ b/pageserver/src/remote_storage/storage_sync.rs @@ -321,8 +321,8 @@ pub fn schedule_timeline_checkpoint_upload( tenant_id, timeline_id ) } else { - warn!( - "Could not send an upload task for tenant {}, timeline {}: the sync queue is not initialized", + debug!( + "Upload task for tenant {}, timeline {} sent", tenant_id, timeline_id ) } @@ -455,7 +455,7 @@ fn storage_sync_loop< max_concurrent_sync, max_sync_errors, ) - .instrument(debug_span!("storage_sync_loop_step")) => step, + .instrument(info_span!("storage_sync_loop_step")) => step, _ = thread_mgr::shutdown_watcher() => LoopStep::Shutdown, } }); @@ -528,7 +528,7 @@ async fn loop_step< let extra_step = match tokio::spawn( process_task(conf, Arc::clone(&remote_assets), task, max_sync_errors).instrument( - debug_span!("process_sync_task", sync_id = %sync_id, attempt, sync_name), + info_span!("process_sync_task", sync_id = %sync_id, attempt, sync_name), ), ) .await diff --git a/zenith_utils/src/http/endpoint.rs b/zenith_utils/src/http/endpoint.rs index 0be08f45e1..7669f18cd2 100644 --- a/zenith_utils/src/http/endpoint.rs +++ b/zenith_utils/src/http/endpoint.rs @@ -160,7 +160,7 @@ pub fn serve_thread_main( where S: Future + Send + Sync, { - info!("Starting a http endpoint at {}", listener.local_addr()?); + info!("Starting an HTTP endpoint at {}", listener.local_addr()?); // Create a Service from the router above to handle incoming requests. let service = RouterService::new(router_builder.build().map_err(|err| anyhow!(err))?).unwrap(); From 5c5629910f33bead0150821217c115db5ece5495 Mon Sep 17 00:00:00 2001 From: Anton Shyrabokau <97127717+antons-antons@users.noreply.github.com> Date: Tue, 29 Mar 2022 22:13:06 -0700 Subject: [PATCH 042/296] Add a test case for reading historic page versions (#1314) * Add a test case for reading historic page versions Test read_page_at_lsn returns correct results when compared to page inspect. Validate possiblity of reading pages from dropped relation. Ensure funcitons read latest version when null lsn supplied. Check that functions do not poison buffer cache with stale page versions. --- Makefile | 5 + .../batch_others/test_read_validation.py | 183 ++++++++++++++++++ vendor/postgres | 2 +- 3 files changed, 189 insertions(+), 1 deletion(-) create mode 100644 test_runner/batch_others/test_read_validation.py diff --git a/Makefile b/Makefile index ef26ceee2d..d2a79661f2 100644 --- a/Makefile +++ b/Makefile @@ -78,6 +78,11 @@ postgres: postgres-configure \ $(MAKE) -C tmp_install/build/contrib/zenith install +@echo "Compiling contrib/zenith_test_utils" $(MAKE) -C tmp_install/build/contrib/zenith_test_utils install + +@echo "Compiling pg_buffercache" + $(MAKE) -C tmp_install/build/contrib/pg_buffercache install + +@echo "Compiling pageinspect" + $(MAKE) -C tmp_install/build/contrib/pageinspect install + .PHONY: postgres-clean postgres-clean: diff --git a/test_runner/batch_others/test_read_validation.py b/test_runner/batch_others/test_read_validation.py new file mode 100644 index 0000000000..ee41e6511c --- /dev/null +++ b/test_runner/batch_others/test_read_validation.py @@ -0,0 +1,183 @@ +from contextlib import closing + +from fixtures.zenith_fixtures import ZenithEnv +from fixtures.log_helper import log + +from psycopg2.errors import UndefinedTable +from psycopg2.errors import IoError + +pytest_plugins = ("fixtures.zenith_fixtures") + +extensions = ["pageinspect", "zenith_test_utils", "pg_buffercache"] + + +# +# Validation of reading different page versions +# +def test_read_validation(zenith_simple_env: ZenithEnv): + env = zenith_simple_env + env.zenith_cli.create_branch("test_read_validation", "empty") + + pg = env.postgres.create_start("test_read_validation") + log.info("postgres is running on 'test_read_validation' branch") + + with closing(pg.connect()) as con: + with con.cursor() as c: + + for e in extensions: + c.execute("create extension if not exists {};".format(e)) + + c.execute("create table foo (c int) with (autovacuum_enabled = false)") + c.execute("insert into foo values (1)") + + c.execute("select lsn, lower, upper from page_header(get_raw_page('foo', 'main', 0));") + first = c.fetchone() + + c.execute("select relfilenode from pg_class where relname = 'foo'") + relfilenode = c.fetchone()[0] + + c.execute("insert into foo values (2);") + c.execute("select lsn, lower, upper from page_header(get_raw_page('foo', 'main', 0));") + second = c.fetchone() + + assert first != second, "Failed to update page" + + log.info("Test table is populated, validating buffer cache") + + c.execute( + "select count(*) from pg_buffercache where relfilenode = {}".format(relfilenode)) + assert c.fetchone()[0] > 0, "No buffers cached for the test relation" + + c.execute( + "select reltablespace, reldatabase, relfilenode from pg_buffercache where relfilenode = {}" + .format(relfilenode)) + reln = c.fetchone() + + log.info("Clear buffer cache to ensure no stale pages are brought into the cache") + + c.execute("select clear_buffer_cache()") + + c.execute( + "select count(*) from pg_buffercache where relfilenode = {}".format(relfilenode)) + assert c.fetchone()[0] == 0, "Failed to clear buffer cache" + + log.info("Cache is clear, reading stale page version") + + c.execute( + "select lsn, lower, upper from page_header(get_raw_page_at_lsn('foo', 'main', 0, '{}'))" + .format(first[0])) + direct_first = c.fetchone() + assert first == direct_first, "Failed fetch page at historic lsn" + + c.execute( + "select count(*) from pg_buffercache where relfilenode = {}".format(relfilenode)) + assert c.fetchone()[0] == 0, "relation buffers detected after invalidation" + + log.info("Cache is clear, reading latest page version without cache") + + c.execute( + "select lsn, lower, upper from page_header(get_raw_page_at_lsn('foo', 'main', 0, NULL))" + ) + direct_latest = c.fetchone() + assert second == direct_latest, "Failed fetch page at latest lsn" + + c.execute( + "select count(*) from pg_buffercache where relfilenode = {}".format(relfilenode)) + assert c.fetchone()[0] == 0, "relation buffers detected after invalidation" + + log.info( + "Cache is clear, reading stale page version without cache using relation identifiers" + ) + + c.execute( + "select lsn, lower, upper from page_header(get_raw_page_at_lsn( {}, {}, {}, 0, 0, '{}' ))" + .format(reln[0], reln[1], reln[2], first[0])) + direct_first = c.fetchone() + assert first == direct_first, "Failed fetch page at historic lsn using oid" + + log.info( + "Cache is clear, reading latest page version without cache using relation identifiers" + ) + + c.execute( + "select lsn, lower, upper from page_header(get_raw_page_at_lsn( {}, {}, {}, 0, 0, NULL ))" + .format(reln[0], reln[1], reln[2])) + direct_latest = c.fetchone() + assert second == direct_latest, "Failed fetch page at latest lsn" + + c.execute('drop table foo;') + + log.info( + "Relation dropped, attempting reading stale page version without cache using relation identifiers" + ) + + c.execute( + "select lsn, lower, upper from page_header(get_raw_page_at_lsn( {}, {}, {}, 0, 0, '{}' ))" + .format(reln[0], reln[1], reln[2], first[0])) + direct_first = c.fetchone() + assert first == direct_first, "Failed fetch page at historic lsn using oid" + + log.info("Validation page inspect won't allow reading pages of dropped relations") + try: + c.execute("select * from page_header(get_raw_page('foo', 'main', 0));") + assert False, "query should have failed" + except UndefinedTable as e: + log.info("Caught an expected failure: {}".format(e)) + + +def test_read_validation_neg(zenith_simple_env: ZenithEnv): + env = zenith_simple_env + env.zenith_cli.create_branch("test_read_validation_neg", "empty") + + pg = env.postgres.create_start("test_read_validation_neg") + log.info("postgres is running on 'test_read_validation_neg' branch") + + with closing(pg.connect()) as con: + with con.cursor() as c: + + for e in extensions: + c.execute("create extension if not exists {};".format(e)) + + log.info("read a page of a missing relation") + try: + c.execute( + "select lsn, lower, upper from page_header(get_raw_page_at_lsn('Unknown', 'main', 0, '0/0'))" + ) + assert False, "query should have failed" + except UndefinedTable as e: + log.info("Caught an expected failure: {}".format(e)) + + c.execute("create table foo (c int) with (autovacuum_enabled = false)") + c.execute("insert into foo values (1)") + + log.info("read a page at lsn 0") + try: + c.execute( + "select lsn, lower, upper from page_header(get_raw_page_at_lsn('foo', 'main', 0, '0/0'))" + ) + assert False, "query should have failed" + except IoError as e: + log.info("Caught an expected failure: {}".format(e)) + + log.info("Pass NULL as an input") + expected = (None, None, None) + c.execute( + "select lsn, lower, upper from page_header(get_raw_page_at_lsn(NULL, 'main', 0, '0/0'))" + ) + assert c.fetchone() == expected, "Expected null output" + + c.execute( + "select lsn, lower, upper from page_header(get_raw_page_at_lsn('foo', NULL, 0, '0/0'))" + ) + assert c.fetchone() == expected, "Expected null output" + + c.execute( + "select lsn, lower, upper from page_header(get_raw_page_at_lsn('foo', 'main', NULL, '0/0'))" + ) + assert c.fetchone() == expected, "Expected null output" + + # This check is currently failing, reading beyond EOF is returning a 0-page + log.info("Read beyond EOF") + c.execute( + "select lsn, lower, upper from page_header(get_raw_page_at_lsn('foo', 'main', 1, NULL))" + ) diff --git a/vendor/postgres b/vendor/postgres index 19164aeacf..5c278ed0ac 160000 --- a/vendor/postgres +++ b/vendor/postgres @@ -1 +1 @@ -Subproject commit 19164aeacfd877ef75d67e70a71647f5d4c0cd2f +Subproject commit 5c278ed0aca5dea9340d9af4ad5f004d905ff1b7 From 860923420468a3882b71929f2dbe59673484ddca Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Tue, 29 Mar 2022 22:44:33 +0300 Subject: [PATCH 043/296] decrease the log level to debug because it is too noisy --- pageserver/src/layered_repository.rs | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 33f5694879..202a2ea756 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -1645,11 +1645,8 @@ impl LayeredTimeline { }; let num_deltas = layers.count_deltas(&img_range, &(img_lsn..lsn))?; - if num_deltas == 0 { - continue; - } - info!( + debug!( "range {}-{}, has {} deltas on this timeline", img_range.start, img_range.end, num_deltas ); From 649f324fe3b7dc5ff8b95cfaabf584753d53af16 Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Wed, 30 Mar 2022 13:46:18 +0300 Subject: [PATCH 044/296] make logging in basebackup more consistent --- pageserver/src/basebackup.rs | 1 + pageserver/src/page_service.rs | 3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/pageserver/src/basebackup.rs b/pageserver/src/basebackup.rs index e2a56f17d6..3caf27b9b3 100644 --- a/pageserver/src/basebackup.rs +++ b/pageserver/src/basebackup.rs @@ -65,6 +65,7 @@ impl<'a> Basebackup<'a> { // prev_lsn to Lsn(0) if we cannot provide the correct value. let (backup_prev, backup_lsn) = if let Some(req_lsn) = req_lsn { // Backup was requested at a particular LSN. Wait for it to arrive. + info!("waiting for {}", req_lsn); timeline.tline.wait_lsn(req_lsn)?; // If the requested point is the end of the timeline, we can diff --git a/pageserver/src/page_service.rs b/pageserver/src/page_service.rs index 43e1ec275d..e7a4117b3e 100644 --- a/pageserver/src/page_service.rs +++ b/pageserver/src/page_service.rs @@ -514,6 +514,7 @@ impl PageServerHandler { ) -> anyhow::Result<()> { let span = info_span!("basebackup", timeline = %timelineid, tenant = %tenantid, lsn = field::Empty); let _enter = span.enter(); + info!("starting"); // check that the timeline exists let timeline = tenant_mgr::get_timeline_for_tenant_load(tenantid, timelineid) @@ -536,7 +537,7 @@ impl PageServerHandler { basebackup.send_tarball()?; } pgb.write_message(&BeMessage::CopyDone)?; - debug!("CopyDone sent!"); + info!("done"); Ok(()) } From 1aa8fe43cf9b769ec728b126a6a5c20b6f9d388f Mon Sep 17 00:00:00 2001 From: Konstantin Knizhnik Date: Thu, 31 Mar 2022 15:47:59 +0300 Subject: [PATCH 045/296] Fix race condition in image layer (#1440) * Fix race condition in image layer refer #1439 * Add explicit drop(inner) in layer load method * Add explicit drop(inner) in layer load method --- pageserver/src/layered_repository/image_layer.rs | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pageserver/src/layered_repository/image_layer.rs b/pageserver/src/layered_repository/image_layer.rs index ab51c36cae..ed9be913b9 100644 --- a/pageserver/src/layered_repository/image_layer.rs +++ b/pageserver/src/layered_repository/image_layer.rs @@ -267,7 +267,7 @@ impl ImageLayer { // a write lock. (Or rather, release and re-lock in write mode.) drop(inner); let mut inner = self.inner.write().unwrap(); - if inner.book.is_none() { + if !inner.loaded { self.load_inner(&mut inner)?; } else { // Another thread loaded it while we were not holding the lock. From a40b7cd516672a58d63de8015d848cd40ce33f08 Mon Sep 17 00:00:00 2001 From: Arthur Petukhovsky Date: Thu, 31 Mar 2022 17:00:09 +0300 Subject: [PATCH 046/296] Fix timeouts in test_restarts_under_load (#1436) * Enable backpressure in test_restarts_under_load * Remove hacks because #644 is fixed now * Adjust config in test_restarts_under_load --- .../batch_others/test_wal_acceptor_async.py | 30 +++++++++++-------- 1 file changed, 17 insertions(+), 13 deletions(-) diff --git a/test_runner/batch_others/test_wal_acceptor_async.py b/test_runner/batch_others/test_wal_acceptor_async.py index 31ace7eab3..aadafc76cf 100644 --- a/test_runner/batch_others/test_wal_acceptor_async.py +++ b/test_runner/batch_others/test_wal_acceptor_async.py @@ -1,9 +1,10 @@ import asyncio +import uuid import asyncpg import random import time -from fixtures.zenith_fixtures import ZenithEnvBuilder, Postgres, Safekeeper +from fixtures.zenith_fixtures import ZenithEnv, ZenithEnvBuilder, Postgres, Safekeeper from fixtures.log_helper import getLogger from fixtures.utils import lsn_from_hex, lsn_to_hex from typing import List @@ -30,10 +31,6 @@ class BankClient(object): await self.conn.execute('DROP TABLE IF EXISTS bank_log') await self.conn.execute('CREATE TABLE bank_log(from_uid int, to_uid int, amount int)') - # TODO: Remove when https://github.com/zenithdb/zenith/issues/644 is fixed - await self.conn.execute('ALTER TABLE bank_accs SET (autovacuum_enabled = false)') - await self.conn.execute('ALTER TABLE bank_log SET (autovacuum_enabled = false)') - async def check_invariant(self): row = await self.conn.fetchrow('SELECT sum(amount) AS sum FROM bank_accs') assert row['sum'] == self.n_accounts * self.init_amount @@ -139,12 +136,15 @@ async def wait_for_lsn(safekeeper: Safekeeper, # On each iteration 1 acceptor is stopped, and 2 others should allow # background workers execute transactions. In the end, state should remain # consistent. -async def run_restarts_under_load(pg: Postgres, acceptors: List[Safekeeper], n_workers=10): +async def run_restarts_under_load(env: ZenithEnv, + pg: Postgres, + acceptors: List[Safekeeper], + n_workers=10): n_accounts = 100 init_amount = 100000 max_transfer = 100 - period_time = 10 - iterations = 6 + period_time = 4 + iterations = 10 # Set timeout for this test at 5 minutes. It should be enough for test to complete # and less than CircleCI's no_output_timeout, taking into account that this timeout @@ -176,6 +176,11 @@ async def run_restarts_under_load(pg: Postgres, acceptors: List[Safekeeper], n_w flush_lsn = lsn_to_hex(flush_lsn) log.info(f'Postgres flush_lsn {flush_lsn}') + pageserver_lsn = env.pageserver.http_client().timeline_detail( + uuid.UUID(tenant_id), uuid.UUID((timeline_id)))["local"]["last_record_lsn"] + sk_ps_lag = lsn_from_hex(flush_lsn) - lsn_from_hex(pageserver_lsn) + log.info(f'Pageserver last_record_lsn={pageserver_lsn} lag={sk_ps_lag / 1024}kb') + # Wait until alive safekeepers catch up with postgres for idx, safekeeper in enumerate(acceptors): if idx != victim_idx: @@ -203,9 +208,8 @@ def test_restarts_under_load(zenith_env_builder: ZenithEnvBuilder): env = zenith_env_builder.init_start() env.zenith_cli.create_branch('test_wal_acceptors_restarts_under_load') - pg = env.postgres.create_start('test_wal_acceptors_restarts_under_load') + # Enable backpressure with 1MB maximal lag, because we don't want to block on `wait_for_lsn()` for too long + pg = env.postgres.create_start('test_wal_acceptors_restarts_under_load', + config_lines=['max_replication_write_lag=1MB']) - asyncio.run(run_restarts_under_load(pg, env.safekeepers)) - - # TODO: Remove when https://github.com/zenithdb/zenith/issues/644 is fixed - pg.stop() + asyncio.run(run_restarts_under_load(env, pg, env.safekeepers)) From 8745b022a985f6b758f9bddb9aae8038608df677 Mon Sep 17 00:00:00 2001 From: Anastasia Lubennikova Date: Thu, 31 Mar 2022 12:29:13 +0300 Subject: [PATCH 047/296] Extend LayerMap dump() function to print also open_layers and frozen_layers. Add verbose option to chose if we need to print all layer's keys or not. --- pageserver/src/bin/dump_layerfile.rs | 2 +- pageserver/src/layered_repository.rs | 8 ++++---- pageserver/src/layered_repository/delta_layer.rs | 6 +++++- pageserver/src/layered_repository/image_layer.rs | 6 +++++- .../src/layered_repository/inmemory_layer.rs | 6 +++++- pageserver/src/layered_repository/layer_map.rs | 16 ++++++++++++++-- .../src/layered_repository/storage_layer.rs | 2 +- 7 files changed, 35 insertions(+), 11 deletions(-) diff --git a/pageserver/src/bin/dump_layerfile.rs b/pageserver/src/bin/dump_layerfile.rs index b954ad5a15..27d41d50d9 100644 --- a/pageserver/src/bin/dump_layerfile.rs +++ b/pageserver/src/bin/dump_layerfile.rs @@ -25,7 +25,7 @@ fn main() -> Result<()> { // Basic initialization of things that don't change after startup virtual_file::init(10); - dump_layerfile_from_path(&path)?; + dump_layerfile_from_path(&path, true)?; Ok(()) } diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 202a2ea756..4a9d1c480d 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -2066,16 +2066,16 @@ impl<'a> TimelineWriter<'_> for LayeredTimelineWriter<'a> { } /// Dump contents of a layer file to stdout. -pub fn dump_layerfile_from_path(path: &Path) -> Result<()> { +pub fn dump_layerfile_from_path(path: &Path, verbose: bool) -> Result<()> { let file = File::open(path)?; let book = Book::new(file)?; match book.magic() { crate::DELTA_FILE_MAGIC => { - DeltaLayer::new_for_path(path, &book)?.dump()?; + DeltaLayer::new_for_path(path, &book)?.dump(verbose)?; } crate::IMAGE_FILE_MAGIC => { - ImageLayer::new_for_path(path, &book)?.dump()?; + ImageLayer::new_for_path(path, &book)?.dump(verbose)?; } magic => bail!("unrecognized magic identifier: {:?}", magic), } @@ -2216,7 +2216,7 @@ pub mod tests { let mut test_key = Key::from_hex("012222222233333333444444445500000000").unwrap(); let mut blknum = 0; for _ in 0..50 { - for _ in 0..1000 { + for _ in 0..10000 { test_key.field6 = blknum; let writer = tline.writer(); writer.put( diff --git a/pageserver/src/layered_repository/delta_layer.rs b/pageserver/src/layered_repository/delta_layer.rs index bb5fa02be1..0e59eb7a3c 100644 --- a/pageserver/src/layered_repository/delta_layer.rs +++ b/pageserver/src/layered_repository/delta_layer.rs @@ -267,7 +267,7 @@ impl Layer for DeltaLayer { } /// debugging function to print out the contents of the layer - fn dump(&self) -> Result<()> { + fn dump(&self, verbose: bool) -> Result<()> { println!( "----- delta layer for ten {} tli {} keys {}-{} lsn {}-{} ----", self.tenantid, @@ -278,6 +278,10 @@ impl Layer for DeltaLayer { self.lsn_range.end ); + if !verbose { + return Ok(()); + } + let inner = self.load()?; let path = self.path(); diff --git a/pageserver/src/layered_repository/image_layer.rs b/pageserver/src/layered_repository/image_layer.rs index ed9be913b9..2b9bf4a717 100644 --- a/pageserver/src/layered_repository/image_layer.rs +++ b/pageserver/src/layered_repository/image_layer.rs @@ -212,12 +212,16 @@ impl Layer for ImageLayer { } /// debugging function to print out the contents of the layer - fn dump(&self) -> Result<()> { + fn dump(&self, verbose: bool) -> Result<()> { println!( "----- image layer for ten {} tli {} key {}-{} at {} ----", self.tenantid, self.timelineid, self.key_range.start, self.key_range.end, self.lsn ); + if !verbose { + return Ok(()); + } + let inner = self.load()?; let mut index_vec: Vec<(&Key, &BlobRef)> = inner.index.iter().collect(); diff --git a/pageserver/src/layered_repository/inmemory_layer.rs b/pageserver/src/layered_repository/inmemory_layer.rs index b5d98a4ca3..8670442a2c 100644 --- a/pageserver/src/layered_repository/inmemory_layer.rs +++ b/pageserver/src/layered_repository/inmemory_layer.rs @@ -190,7 +190,7 @@ impl Layer for InMemoryLayer { } /// debugging function to print out the contents of the layer - fn dump(&self) -> Result<()> { + fn dump(&self, verbose: bool) -> Result<()> { let inner = self.inner.read().unwrap(); let end_str = inner @@ -204,6 +204,10 @@ impl Layer for InMemoryLayer { self.timelineid, self.start_lsn, end_str, ); + if !verbose { + return Ok(()); + } + let mut buf = Vec::new(); for (key, vec_map) in inner.index.iter() { for (lsn, blob_ref) in vec_map.as_slice() { diff --git a/pageserver/src/layered_repository/layer_map.rs b/pageserver/src/layered_repository/layer_map.rs index c4929a6173..b6a3bd82aa 100644 --- a/pageserver/src/layered_repository/layer_map.rs +++ b/pageserver/src/layered_repository/layer_map.rs @@ -392,10 +392,22 @@ impl LayerMap { /// debugging function to print out the contents of the layer map #[allow(unused)] - pub fn dump(&self) -> Result<()> { + pub fn dump(&self, verbose: bool) -> Result<()> { println!("Begin dump LayerMap"); + + println!("open_layer:"); + if let Some(open_layer) = &self.open_layer { + open_layer.dump(verbose)?; + } + + println!("frozen_layers:"); + for frozen_layer in self.frozen_layers.iter() { + frozen_layer.dump(verbose)?; + } + + println!("historic_layers:"); for layer in self.historic_layers.iter() { - layer.dump()?; + layer.dump(verbose)?; } println!("End dump LayerMap"); Ok(()) diff --git a/pageserver/src/layered_repository/storage_layer.rs b/pageserver/src/layered_repository/storage_layer.rs index de34545980..dcf5b63908 100644 --- a/pageserver/src/layered_repository/storage_layer.rs +++ b/pageserver/src/layered_repository/storage_layer.rs @@ -143,7 +143,7 @@ pub trait Layer: Send + Sync { fn delete(&self) -> Result<()>; /// Dump summary of the contents of the layer to stdout - fn dump(&self) -> Result<()>; + fn dump(&self, verbose: bool) -> Result<()>; } // Flag indicating that this version initialize the page From f5da6523882e2be24a5e4252be7c5f963fbc4c7c Mon Sep 17 00:00:00 2001 From: Dmitry Ivanov Date: Thu, 31 Mar 2022 20:44:57 +0300 Subject: [PATCH 048/296] [proxy] Enable keepalives for all tcp connections (#1448) --- Cargo.lock | 16 ++++++++++++---- compute_tools/Cargo.toml | 2 +- pageserver/Cargo.toml | 2 +- proxy/Cargo.toml | 3 ++- proxy/src/compute.rs | 1 + proxy/src/proxy.rs | 24 ++++++++++++++++++++++++ walkeeper/Cargo.toml | 2 +- zenith_utils/Cargo.toml | 2 +- 8 files changed, 43 insertions(+), 9 deletions(-) diff --git a/Cargo.lock b/Cargo.lock index c770f576c9..bb27df7012 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -916,7 +916,7 @@ checksum = "418d37c8b1d42553c93648be529cb70f920d3baf8ef469b74b9638df426e0b4c" dependencies = [ "cfg-if", "libc", - "wasi", + "wasi 0.10.0+wasi-snapshot-preview1", ] [[package]] @@ -1371,14 +1371,15 @@ dependencies = [ [[package]] name = "mio" -version = "0.8.0" +version = "0.8.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ba272f85fa0b41fc91872be579b3bbe0f56b792aa361a380eb669469f68dafb2" +checksum = "52da4364ffb0e4fe33a9841a98a3f3014fb964045ce4f7a45a398243c8d6b0c9" dependencies = [ "libc", "log", "miow", "ntapi", + "wasi 0.11.0+wasi-snapshot-preview1", "winapi", ] @@ -1931,6 +1932,7 @@ dependencies = [ "scopeguard", "serde", "serde_json", + "socket2", "thiserror", "tokio", "tokio-postgres 0.7.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", @@ -2609,7 +2611,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "6db9e6914ab8b1ae1c260a4ae7a49b6c5611b40328a735b21862567685e73255" dependencies = [ "libc", - "wasi", + "wasi 0.10.0+wasi-snapshot-preview1", "winapi", ] @@ -3113,6 +3115,12 @@ version = "0.10.0+wasi-snapshot-preview1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "1a143597ca7c7793eff794def352d41792a93c481eb1042423ff7ff72ba2c31f" +[[package]] +name = "wasi" +version = "0.11.0+wasi-snapshot-preview1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9c8d87e72b64a3b4db28d11ce29237c246188f4f51057d65a7eab63b7987e423" + [[package]] name = "wasm-bindgen" version = "0.2.79" diff --git a/compute_tools/Cargo.toml b/compute_tools/Cargo.toml index 4ecf7f6499..56047093f1 100644 --- a/compute_tools/Cargo.toml +++ b/compute_tools/Cargo.toml @@ -16,5 +16,5 @@ regex = "1" serde = { version = "1.0", features = ["derive"] } serde_json = "1" tar = "0.4" -tokio = { version = "1", features = ["macros", "rt", "rt-multi-thread"] } +tokio = { version = "1.17", features = ["macros", "rt", "rt-multi-thread"] } workspace_hack = { version = "0.1", path = "../workspace_hack" } diff --git a/pageserver/Cargo.toml b/pageserver/Cargo.toml index 14eae31da8..6a77af1691 100644 --- a/pageserver/Cargo.toml +++ b/pageserver/Cargo.toml @@ -17,7 +17,7 @@ lazy_static = "1.4.0" log = "0.4.14" clap = "3.0" daemonize = "0.4.1" -tokio = { version = "1.11", features = ["process", "sync", "macros", "fs", "rt", "io-util", "time"] } +tokio = { version = "1.17", features = ["process", "sync", "macros", "fs", "rt", "io-util", "time"] } postgres-types = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } postgres-protocol = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } diff --git a/proxy/Cargo.toml b/proxy/Cargo.toml index 72c394dad4..dc20695884 100644 --- a/proxy/Cargo.toml +++ b/proxy/Cargo.toml @@ -22,8 +22,9 @@ rustls = "0.19.1" scopeguard = "1.1.0" serde = "1" serde_json = "1" +socket2 = "0.4.4" thiserror = "1.0" -tokio = { version = "1.11", features = ["macros"] } +tokio = { version = "1.17", features = ["macros"] } tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } tokio-rustls = "0.22.0" diff --git a/proxy/src/compute.rs b/proxy/src/compute.rs index 64ce5d0a5a..7c0ab965a0 100644 --- a/proxy/src/compute.rs +++ b/proxy/src/compute.rs @@ -41,6 +41,7 @@ impl DatabaseInfo { let host_port = format!("{}:{}", self.host, self.port); let socket = TcpStream::connect(host_port).await?; let socket_addr = socket.peer_addr()?; + socket2::SockRef::from(&socket).set_keepalive(true)?; Ok((socket_addr, socket)) } diff --git a/proxy/src/proxy.rs b/proxy/src/proxy.rs index 3c7f59bc26..81581b5cf1 100644 --- a/proxy/src/proxy.rs +++ b/proxy/src/proxy.rs @@ -50,6 +50,10 @@ pub async fn thread_main( println!("proxy has shut down"); } + // When set for the server socket, the keepalive setting + // will be inherited by all accepted client sockets. + socket2::SockRef::from(&listener).set_keepalive(true)?; + let cancel_map = Arc::new(CancelMap::default()); loop { let (socket, peer_addr) = listener.accept().await?; @@ -367,4 +371,24 @@ mod tests { Ok(()) } + + #[tokio::test] + async fn keepalive_is_inherited() -> anyhow::Result<()> { + use tokio::net::{TcpListener, TcpStream}; + + let listener = TcpListener::bind("127.0.0.1:0").await?; + let port = listener.local_addr()?.port(); + socket2::SockRef::from(&listener).set_keepalive(true)?; + + let t = tokio::spawn(async move { + let (client, _) = listener.accept().await?; + let keepalive = socket2::SockRef::from(&client).keepalive()?; + anyhow::Ok(keepalive) + }); + + let _ = TcpStream::connect(("127.0.0.1", port)).await?; + assert!(t.await??, "keepalive should be inherited"); + + Ok(()) + } } diff --git a/walkeeper/Cargo.toml b/walkeeper/Cargo.toml index e8523d27d1..ddce78e737 100644 --- a/walkeeper/Cargo.toml +++ b/walkeeper/Cargo.toml @@ -15,7 +15,7 @@ tracing = "0.1.27" clap = "3.0" daemonize = "0.4.1" rust-s3 = { version = "0.28", default-features = false, features = ["no-verify-ssl", "tokio-rustls-tls"] } -tokio = { version = "1.11", features = ["macros"] } +tokio = { version = "1.17", features = ["macros"] } postgres-protocol = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } anyhow = "1.0" diff --git a/zenith_utils/Cargo.toml b/zenith_utils/Cargo.toml index e8ad0e627f..cf864b3a54 100644 --- a/zenith_utils/Cargo.toml +++ b/zenith_utils/Cargo.toml @@ -16,7 +16,7 @@ routerify = "3" serde = { version = "1.0", features = ["derive"] } serde_json = "1" thiserror = "1.0" -tokio = { version = "1.11", features = ["macros"]} +tokio = { version = "1.17", features = ["macros"]} tracing = "0.1" tracing-subscriber = { version = "0.3", features = ["env-filter"] } nix = "0.23.0" From af712798e75589a5186fe3c78fa683b901fe2566 Mon Sep 17 00:00:00 2001 From: bojanserafimov Date: Fri, 1 Apr 2022 15:47:23 -0400 Subject: [PATCH 049/296] Fix pageserver readme formatting I put the diagram in a fixed-width block, since it wasn't rendering correctly on github. --- pageserver/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/pageserver/README.md b/pageserver/README.md index 69080a16cc..1fd627785c 100644 --- a/pageserver/README.md +++ b/pageserver/README.md @@ -13,7 +13,7 @@ keeps track of WAL records which are not synced to S3 yet. The Page Server consists of multiple threads that operate on a shared repository of page versions: - +``` | WAL V +--------------+ @@ -46,7 +46,7 @@ Legend: ---> Data flow <--- - +``` Page Service ------------ From 43c16c514556bb0ccbeb3b0458f46d39866005aa Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Fri, 1 Apr 2022 20:48:03 +0300 Subject: [PATCH 050/296] Don't log ZIds in the timeline load span --- pageserver/src/layered_repository.rs | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 4a9d1c480d..a352f31169 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -468,18 +468,20 @@ impl LayeredRepository { match timelines.get(&timelineid) { Some(entry) => match entry { LayeredTimelineEntry::Loaded(local_timeline) => { - trace!("timeline {} found loaded", &timelineid); + debug!("timeline {} found loaded into memory", &timelineid); return Ok(Some(Arc::clone(local_timeline))); } - LayeredTimelineEntry::Unloaded { .. } => { - trace!("timeline {} found unloaded", &timelineid) - } + LayeredTimelineEntry::Unloaded { .. } => {} }, None => { - trace!("timeline {} not found", &timelineid); + debug!("timeline {} not found", &timelineid); return Ok(None); } }; + debug!( + "timeline {} found on a local disk, but not loaded into the memory, loading", + &timelineid + ); let timeline = self.load_local_timeline(timelineid, timelines)?; let was_loaded = timelines.insert( timelineid, @@ -516,9 +518,7 @@ impl LayeredRepository { .context("cannot load ancestor timeline")? .flatten() .map(LayeredTimelineEntry::Loaded); - let _enter = - info_span!("loading timeline", timeline = %timelineid, tenant = %self.tenantid) - .entered(); + let _enter = info_span!("loading local timeline").entered(); let timeline = LayeredTimeline::new( self.conf, metadata, From 9e5423c86724cdd90cefd81791214870138b6983 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Fri, 1 Apr 2022 21:46:54 +0300 Subject: [PATCH 051/296] Assert in a more informative way --- postgres_ffi/src/xlog_utils.rs | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/postgres_ffi/src/xlog_utils.rs b/postgres_ffi/src/xlog_utils.rs index d2b2b5c122..89fdbbf7ac 100644 --- a/postgres_ffi/src/xlog_utils.rs +++ b/postgres_ffi/src/xlog_utils.rs @@ -495,7 +495,13 @@ mod tests { .env("DYLD_LIBRARY_PATH", &lib_path) .output() .unwrap(); - assert!(initdb_output.status.success()); + assert!( + initdb_output.status.success(), + "initdb failed. Status: '{}', stdout: '{}', stderr: '{}'", + initdb_output.status, + String::from_utf8_lossy(&initdb_output.stdout), + String::from_utf8_lossy(&initdb_output.stderr), + ); // 2. Pick WAL generated by initdb let wal_dir = data_dir.join("pg_wal"); From 4c9447589a837266fb943cc0f32124191891cd9a Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Fri, 1 Apr 2022 23:23:13 +0300 Subject: [PATCH 052/296] Place an info span into gc loop step --- pageserver/src/layered_repository.rs | 2 ++ 1 file changed, 2 insertions(+) diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index a352f31169..f07a2639d3 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -630,6 +630,8 @@ impl LayeredRepository { horizon: u64, checkpoint_before_gc: bool, ) -> Result { + let _span_guard = + info_span!("gc iteration", tenant = %self.tenantid, timeline = ?target_timelineid); let mut totals: GcResult = Default::default(); let now = Instant::now(); From 1f0b406b633aa624f89d1632affabd03ab622171 Mon Sep 17 00:00:00 2001 From: Konstantin Knizhnik Date: Thu, 31 Mar 2022 16:28:07 +0300 Subject: [PATCH 053/296] Perform repartitioning in compaction thread refer #1441 --- pageserver/src/layered_repository.rs | 5 +++++ pageserver/src/pgdatadir_mapping.rs | 21 +++++++++++---------- pageserver/src/timelines.rs | 2 +- 3 files changed, 17 insertions(+), 11 deletions(-) diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index f07a2639d3..a63f157552 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -41,6 +41,7 @@ use crate::repository::{ GcResult, Repository, RepositoryTimeline, Timeline, TimelineSyncStatusUpdate, TimelineWriter, }; use crate::repository::{Key, Value}; +use crate::tenant_mgr; use crate::thread_mgr; use crate::virtual_file::VirtualFile; use crate::walreceiver::IS_WAL_RECEIVER; @@ -1588,6 +1589,10 @@ impl LayeredTimeline { let target_file_size = self.conf.checkpoint_distance; + // Define partitioning schema if needed + tenant_mgr::get_timeline_for_tenant_load(self.tenantid, self.timelineid)? + .repartition(self.get_last_record_lsn())?; + // 1. The partitioning was already done by the code in // pgdatadir_mapping.rs. We just use it here. let partitioning_guard = self.partitioning.read().unwrap(); diff --git a/pageserver/src/pgdatadir_mapping.rs b/pageserver/src/pgdatadir_mapping.rs index 7b0fc606de..75ace4ecee 100644 --- a/pageserver/src/pgdatadir_mapping.rs +++ b/pageserver/src/pgdatadir_mapping.rs @@ -388,6 +388,17 @@ impl DatadirTimeline { Ok(result.to_keyspace()) } + + pub fn repartition(&self, lsn: Lsn) -> Result<()> { + let last_partitioning = self.last_partitioning.load(); + if last_partitioning == Lsn(0) || lsn.0 - last_partitioning.0 > self.repartition_threshold { + let keyspace = self.collect_keyspace(lsn)?; + let partitioning = keyspace.partition(TARGET_FILE_SIZE_BYTES); + self.tline.hint_partitioning(partitioning, lsn)?; + self.last_partitioning.store(lsn); + } + Ok(()) + } } /// DatadirModification represents an operation to ingest an atomic set of @@ -767,7 +778,6 @@ impl<'a, R: Repository> DatadirModification<'a, R> { pub fn commit(self) -> Result<()> { let writer = self.tline.tline.writer(); - let last_partitioning = self.tline.last_partitioning.load(); let pending_nblocks = self.pending_nblocks; for (key, value) in self.pending_updates { @@ -779,15 +789,6 @@ impl<'a, R: Repository> DatadirModification<'a, R> { writer.finish_write(self.lsn); - if last_partitioning == Lsn(0) - || self.lsn.0 - last_partitioning.0 > self.tline.repartition_threshold - { - let keyspace = self.tline.collect_keyspace(self.lsn)?; - let partitioning = keyspace.partition(TARGET_FILE_SIZE_BYTES); - self.tline.tline.hint_partitioning(partitioning, self.lsn)?; - self.tline.last_partitioning.store(self.lsn); - } - if pending_nblocks != 0 { self.tline.current_logical_size.fetch_add( pending_nblocks * pg_constants::BLCKSZ as isize, diff --git a/pageserver/src/timelines.rs b/pageserver/src/timelines.rs index 105c3c869f..ae713c260c 100644 --- a/pageserver/src/timelines.rs +++ b/pageserver/src/timelines.rs @@ -286,7 +286,7 @@ fn bootstrap_timeline( let timeline = repo.create_empty_timeline(tli, lsn)?; let mut page_tline: DatadirTimeline = DatadirTimeline::new(timeline, u64::MAX); import_datadir::import_timeline_from_postgres_datadir(&pgdata_path, &mut page_tline, lsn)?; - page_tline.tline.checkpoint(CheckpointConfig::Forced)?; + page_tline.tline.checkpoint(CheckpointConfig::Flush)?; println!( "created initial timeline {} timeline.lsn {}", From 92031d376af9c8d80e77ee33afdb9b7868281f9c Mon Sep 17 00:00:00 2001 From: Konstantin Knizhnik Date: Thu, 31 Mar 2022 16:44:01 +0300 Subject: [PATCH 054/296] Fix unit tests --- pageserver/src/layered_repository.rs | 6 ++++-- pageserver/src/timelines.rs | 2 +- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index a63f157552..eb4f49ddd1 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -1590,8 +1590,10 @@ impl LayeredTimeline { let target_file_size = self.conf.checkpoint_distance; // Define partitioning schema if needed - tenant_mgr::get_timeline_for_tenant_load(self.tenantid, self.timelineid)? - .repartition(self.get_last_record_lsn())?; + if let Ok(pgdir) = tenant_mgr::get_timeline_for_tenant_load(self.tenantid, self.timelineid) + { + pgdir.repartition(self.get_last_record_lsn())?; + } // 1. The partitioning was already done by the code in // pgdatadir_mapping.rs. We just use it here. diff --git a/pageserver/src/timelines.rs b/pageserver/src/timelines.rs index ae713c260c..105c3c869f 100644 --- a/pageserver/src/timelines.rs +++ b/pageserver/src/timelines.rs @@ -286,7 +286,7 @@ fn bootstrap_timeline( let timeline = repo.create_empty_timeline(tli, lsn)?; let mut page_tline: DatadirTimeline = DatadirTimeline::new(timeline, u64::MAX); import_datadir::import_timeline_from_postgres_datadir(&pgdata_path, &mut page_tline, lsn)?; - page_tline.tline.checkpoint(CheckpointConfig::Flush)?; + page_tline.tline.checkpoint(CheckpointConfig::Forced)?; println!( "created initial timeline {} timeline.lsn {}", From 232fe14297c6f12b6ad83b723ab6dcba09febc5e Mon Sep 17 00:00:00 2001 From: Konstantin Knizhnik Date: Thu, 31 Mar 2022 20:23:56 +0300 Subject: [PATCH 055/296] Refactor partitioning --- pageserver/src/layered_repository.rs | 29 +++------------------------- pageserver/src/pgdatadir_mapping.rs | 25 +++++++++++++----------- pageserver/src/repository.rs | 14 -------------- 3 files changed, 17 insertions(+), 51 deletions(-) diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index eb4f49ddd1..5ab6097960 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -34,7 +34,7 @@ use std::time::Instant; use self::metadata::{metadata_path, TimelineMetadata, METADATA_FILE_NAME}; use crate::config::PageServerConf; -use crate::keyspace::{KeyPartitioning, KeySpace}; +use crate::keyspace::KeySpace; use crate::page_cache; use crate::remote_storage::{schedule_timeline_checkpoint_upload, RemoteIndex}; use crate::repository::{ @@ -792,8 +792,6 @@ pub struct LayeredTimeline { // garbage collecting data that is still needed by the child timelines. gc_info: RwLock, - partitioning: RwLock>, - // It may change across major versions so for simplicity // keep it after running initdb for a timeline. // It is needed in checks when we want to error on some operations @@ -943,14 +941,6 @@ impl Timeline for LayeredTimeline { self.disk_consistent_lsn.load() } - fn hint_partitioning(&self, partitioning: KeyPartitioning, lsn: Lsn) -> Result<()> { - self.partitioning - .write() - .unwrap() - .replace((partitioning, lsn)); - Ok(()) - } - fn writer<'a>(&'a self) -> Box { Box::new(LayeredTimelineWriter { tl: self, @@ -1037,7 +1027,6 @@ impl LayeredTimeline { retain_lsns: Vec::new(), cutoff: Lsn(0), }), - partitioning: RwLock::new(None), latest_gc_cutoff_lsn: RwLock::new(metadata.latest_gc_cutoff_lsn()), initdb_lsn: metadata.initdb_lsn(), @@ -1592,23 +1581,11 @@ impl LayeredTimeline { // Define partitioning schema if needed if let Ok(pgdir) = tenant_mgr::get_timeline_for_tenant_load(self.tenantid, self.timelineid) { - pgdir.repartition(self.get_last_record_lsn())?; - } - - // 1. The partitioning was already done by the code in - // pgdatadir_mapping.rs. We just use it here. - let partitioning_guard = self.partitioning.read().unwrap(); - if let Some((partitioning, lsn)) = partitioning_guard.as_ref() { + let (partitioning, lsn) = pgdir.repartition(self.get_last_record_lsn())?; let timer = self.create_images_time_histo.start_timer(); - // Make a copy of the partitioning, so that we can release - // the lock. Otherwise we could block the WAL receiver. - let lsn = *lsn; - let parts = partitioning.parts.clone(); - drop(partitioning_guard); - // 2. Create new image layers for partitions that have been modified // "enough". - for part in parts.iter() { + for part in partitioning.parts.iter() { if self.time_for_new_image_layer(part, lsn, 3)? { self.create_image_layer(part, lsn)?; } diff --git a/pageserver/src/pgdatadir_mapping.rs b/pageserver/src/pgdatadir_mapping.rs index 75ace4ecee..fbd1b56180 100644 --- a/pageserver/src/pgdatadir_mapping.rs +++ b/pageserver/src/pgdatadir_mapping.rs @@ -6,7 +6,7 @@ //! walingest.rs handles a few things like implicit relation creation and extension. //! Clarify that) //! -use crate::keyspace::{KeySpace, KeySpaceAccum, TARGET_FILE_SIZE_BYTES}; +use crate::keyspace::{KeyPartitioning, KeySpace, KeySpaceAccum, TARGET_FILE_SIZE_BYTES}; use crate::reltag::{RelTag, SlruKind}; use crate::repository::*; use crate::repository::{Repository, Timeline}; @@ -18,10 +18,9 @@ use serde::{Deserialize, Serialize}; use std::collections::{HashMap, HashSet}; use std::ops::Range; use std::sync::atomic::{AtomicIsize, Ordering}; -use std::sync::{Arc, RwLockReadGuard}; +use std::sync::{Arc, RwLock, RwLockReadGuard}; use tracing::{debug, error, trace, warn}; use zenith_utils::bin_ser::BeSer; -use zenith_utils::lsn::AtomicLsn; use zenith_utils::lsn::Lsn; /// Block number within a relation or SLRU. This matches PostgreSQL's BlockNumber type. @@ -38,7 +37,7 @@ where pub tline: Arc, /// When did we last calculate the partitioning? - last_partitioning: AtomicLsn, + partitioning: RwLock<(KeyPartitioning, Lsn)>, /// Configuration: how often should the partitioning be recalculated. repartition_threshold: u64, @@ -51,7 +50,7 @@ impl DatadirTimeline { pub fn new(tline: Arc, repartition_threshold: u64) -> Self { DatadirTimeline { tline, - last_partitioning: AtomicLsn::new(0), + partitioning: RwLock::new((KeyPartitioning::new(), Lsn(0))), current_logical_size: AtomicIsize::new(0), repartition_threshold, } @@ -389,15 +388,19 @@ impl DatadirTimeline { Ok(result.to_keyspace()) } - pub fn repartition(&self, lsn: Lsn) -> Result<()> { - let last_partitioning = self.last_partitioning.load(); - if last_partitioning == Lsn(0) || lsn.0 - last_partitioning.0 > self.repartition_threshold { + pub fn repartition(&self, lsn: Lsn) -> Result<(KeyPartitioning, Lsn)> { + let partitioning_guard = self.partitioning.read().unwrap(); + if partitioning_guard.1 == Lsn(0) + || lsn.0 - partitioning_guard.1 .0 > self.repartition_threshold + { let keyspace = self.collect_keyspace(lsn)?; + drop(partitioning_guard); + let mut partitioning_guard = self.partitioning.write().unwrap(); let partitioning = keyspace.partition(TARGET_FILE_SIZE_BYTES); - self.tline.hint_partitioning(partitioning, lsn)?; - self.last_partitioning.store(lsn); + *partitioning_guard = (partitioning, lsn); + return Ok((partitioning_guard.0.clone(), lsn)); } - Ok(()) + Ok((partitioning_guard.0.clone(), partitioning_guard.1)) } } diff --git a/pageserver/src/repository.rs b/pageserver/src/repository.rs index b960e037be..7e998b0ebe 100644 --- a/pageserver/src/repository.rs +++ b/pageserver/src/repository.rs @@ -1,4 +1,3 @@ -use crate::keyspace::KeyPartitioning; use crate::layered_repository::metadata::TimelineMetadata; use crate::remote_storage::RemoteIndex; use crate::walrecord::ZenithWalRecord; @@ -372,19 +371,6 @@ pub trait Timeline: Send + Sync { /// know anything about them here in the repository. fn checkpoint(&self, cconf: CheckpointConfig) -> Result<()>; - /// - /// Tell the implementation how the keyspace should be partitioned. - /// - /// FIXME: This is quite a hack. The code in pgdatadir_mapping.rs knows - /// which keys exist and what is the logical grouping of them. That's why - /// the code there (and in keyspace.rs) decides the partitioning, not the - /// layered_repository.rs implementation. That's a layering violation: - /// the Repository implementation ought to be responsible for the physical - /// layout, but currently it's more convenient to do it in pgdatadir_mapping.rs - /// rather than in layered_repository.rs. - /// - fn hint_partitioning(&self, partitioning: KeyPartitioning, lsn: Lsn) -> Result<()>; - /// /// Check that it is valid to request operations with that lsn. fn check_lsn_is_in_scope( From bef9b837f1171b9040dc959189796d835c1f8f9c Mon Sep 17 00:00:00 2001 From: Konstantin Knizhnik Date: Fri, 1 Apr 2022 12:09:35 +0300 Subject: [PATCH 056/296] Replace rwlock with mutex in repartition --- pageserver/src/layered_repository.rs | 12 ------------ pageserver/src/pgdatadir_mapping.rs | 10 ++++------ 2 files changed, 4 insertions(+), 18 deletions(-) diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 5ab6097960..60b0e921ce 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -2220,12 +2220,6 @@ pub mod tests { } let cutoff = tline.get_last_record_lsn(); - let parts = keyspace - .clone() - .to_keyspace() - .partition(TEST_FILE_SIZE as u64); - tline.hint_partitioning(parts.clone(), lsn)?; - tline.update_gc_info(Vec::new(), cutoff); tline.checkpoint(CheckpointConfig::Forced)?; tline.compact()?; @@ -2268,9 +2262,6 @@ pub mod tests { keyspace.add_key(test_key); } - let parts = keyspace.to_keyspace().partition(TEST_FILE_SIZE as u64); - tline.hint_partitioning(parts, lsn)?; - for _ in 0..50 { for _ in 0..NUM_KEYS { lsn = Lsn(lsn.0 + 0x10); @@ -2342,9 +2333,6 @@ pub mod tests { keyspace.add_key(test_key); } - let parts = keyspace.to_keyspace().partition(TEST_FILE_SIZE as u64); - tline.hint_partitioning(parts, lsn)?; - let mut tline_id = TIMELINE_ID; for _ in 0..50 { let new_tline_id = ZTimelineId::generate(); diff --git a/pageserver/src/pgdatadir_mapping.rs b/pageserver/src/pgdatadir_mapping.rs index fbd1b56180..2e0040f0c0 100644 --- a/pageserver/src/pgdatadir_mapping.rs +++ b/pageserver/src/pgdatadir_mapping.rs @@ -18,7 +18,7 @@ use serde::{Deserialize, Serialize}; use std::collections::{HashMap, HashSet}; use std::ops::Range; use std::sync::atomic::{AtomicIsize, Ordering}; -use std::sync::{Arc, RwLock, RwLockReadGuard}; +use std::sync::{Arc, Mutex, RwLockReadGuard}; use tracing::{debug, error, trace, warn}; use zenith_utils::bin_ser::BeSer; use zenith_utils::lsn::Lsn; @@ -37,7 +37,7 @@ where pub tline: Arc, /// When did we last calculate the partitioning? - partitioning: RwLock<(KeyPartitioning, Lsn)>, + partitioning: Mutex<(KeyPartitioning, Lsn)>, /// Configuration: how often should the partitioning be recalculated. repartition_threshold: u64, @@ -50,7 +50,7 @@ impl DatadirTimeline { pub fn new(tline: Arc, repartition_threshold: u64) -> Self { DatadirTimeline { tline, - partitioning: RwLock::new((KeyPartitioning::new(), Lsn(0))), + partitioning: Mutex::new((KeyPartitioning::new(), Lsn(0))), current_logical_size: AtomicIsize::new(0), repartition_threshold, } @@ -389,13 +389,11 @@ impl DatadirTimeline { } pub fn repartition(&self, lsn: Lsn) -> Result<(KeyPartitioning, Lsn)> { - let partitioning_guard = self.partitioning.read().unwrap(); + let mut partitioning_guard = self.partitioning.lock().unwrap(); if partitioning_guard.1 == Lsn(0) || lsn.0 - partitioning_guard.1 .0 > self.repartition_threshold { let keyspace = self.collect_keyspace(lsn)?; - drop(partitioning_guard); - let mut partitioning_guard = self.partitioning.write().unwrap(); let partitioning = keyspace.partition(TARGET_FILE_SIZE_BYTES); *partitioning_guard = (partitioning, lsn); return Ok((partitioning_guard.0.clone(), lsn)); From 572b3f48cf1fb1217efc8067fde2597f38dfa447 Mon Sep 17 00:00:00 2001 From: Konstantin Knizhnik Date: Fri, 1 Apr 2022 19:40:39 +0300 Subject: [PATCH 057/296] Add compaction_target_size parameter --- pageserver/src/config.rs | 27 +++++++++++++++++++++++++++ pageserver/src/keyspace.rs | 3 --- pageserver/src/layered_repository.rs | 3 ++- pageserver/src/pgdatadir_mapping.rs | 8 ++++---- 4 files changed, 33 insertions(+), 8 deletions(-) diff --git a/pageserver/src/config.rs b/pageserver/src/config.rs index 9f7cd34a7a..0d5cac8b4f 100644 --- a/pageserver/src/config.rs +++ b/pageserver/src/config.rs @@ -30,8 +30,13 @@ pub mod defaults { // FIXME: This current value is very low. I would imagine something like 1 GB or 10 GB // would be more appropriate. But a low value forces the code to be exercised more, // which is good for now to trigger bugs. + // This parameter actually determines L0 layer file size. pub const DEFAULT_CHECKPOINT_DISTANCE: u64 = 256 * 1024 * 1024; + // Target file size, when creating image and delta layers. + // This parameter determines L1 layer file size. + pub const DEFAULT_COMPACTION_TARGET_SIZE: u64 = 128 * 1024 * 1024; + pub const DEFAULT_COMPACTION_PERIOD: &str = "1 s"; pub const DEFAULT_GC_HORIZON: u64 = 64 * 1024 * 1024; @@ -58,6 +63,7 @@ pub mod defaults { #listen_http_addr = '{DEFAULT_HTTP_LISTEN_ADDR}' #checkpoint_distance = {DEFAULT_CHECKPOINT_DISTANCE} # in bytes +#compaction_target_size = {DEFAULT_COMPACTION_TARGET_SIZE} # in bytes #compaction_period = '{DEFAULT_COMPACTION_PERIOD}' #gc_period = '{DEFAULT_GC_PERIOD}' @@ -91,8 +97,13 @@ pub struct PageServerConf { // Flush out an inmemory layer, if it's holding WAL older than this // This puts a backstop on how much WAL needs to be re-digested if the // page server crashes. + // This parameter actually determines L0 layer file size. pub checkpoint_distance: u64, + // Target file size, when creating image and delta layers. + // This parameter determines L1 layer file size. + pub compaction_target_size: u64, + // How often to check if there's compaction work to be done. pub compaction_period: Duration, @@ -149,6 +160,7 @@ struct PageServerConfigBuilder { checkpoint_distance: BuilderValue, + compaction_target_size: BuilderValue, compaction_period: BuilderValue, gc_horizon: BuilderValue, @@ -183,6 +195,7 @@ impl Default for PageServerConfigBuilder { listen_pg_addr: Set(DEFAULT_PG_LISTEN_ADDR.to_string()), listen_http_addr: Set(DEFAULT_HTTP_LISTEN_ADDR.to_string()), checkpoint_distance: Set(DEFAULT_CHECKPOINT_DISTANCE), + compaction_target_size: Set(DEFAULT_COMPACTION_TARGET_SIZE), compaction_period: Set(humantime::parse_duration(DEFAULT_COMPACTION_PERIOD) .expect("cannot parse default compaction period")), gc_horizon: Set(DEFAULT_GC_HORIZON), @@ -220,6 +233,10 @@ impl PageServerConfigBuilder { self.checkpoint_distance = BuilderValue::Set(checkpoint_distance) } + pub fn compaction_target_size(&mut self, compaction_target_size: u64) { + self.compaction_target_size = BuilderValue::Set(compaction_target_size) + } + pub fn compaction_period(&mut self, compaction_period: Duration) { self.compaction_period = BuilderValue::Set(compaction_period) } @@ -290,6 +307,9 @@ impl PageServerConfigBuilder { checkpoint_distance: self .checkpoint_distance .ok_or(anyhow::anyhow!("missing checkpoint_distance"))?, + compaction_target_size: self + .compaction_target_size + .ok_or(anyhow::anyhow!("missing compaction_target_size"))?, compaction_period: self .compaction_period .ok_or(anyhow::anyhow!("missing compaction_period"))?, @@ -429,6 +449,9 @@ impl PageServerConf { "listen_pg_addr" => builder.listen_pg_addr(parse_toml_string(key, item)?), "listen_http_addr" => builder.listen_http_addr(parse_toml_string(key, item)?), "checkpoint_distance" => builder.checkpoint_distance(parse_toml_u64(key, item)?), + "compaction_target_size" => { + builder.compaction_target_size(parse_toml_u64(key, item)?) + } "compaction_period" => builder.compaction_period(parse_toml_duration(key, item)?), "gc_horizon" => builder.gc_horizon(parse_toml_u64(key, item)?), "gc_period" => builder.gc_period(parse_toml_duration(key, item)?), @@ -565,6 +588,7 @@ impl PageServerConf { PageServerConf { id: ZNodeId(0), checkpoint_distance: defaults::DEFAULT_CHECKPOINT_DISTANCE, + compaction_target_size: 4 * 1024 * 1024, compaction_period: Duration::from_secs(10), gc_horizon: defaults::DEFAULT_GC_HORIZON, gc_period: Duration::from_secs(10), @@ -636,6 +660,7 @@ listen_http_addr = '127.0.0.1:9898' checkpoint_distance = 111 # in bytes +compaction_target_size = 111 # in bytes compaction_period = '111 s' gc_period = '222 s' @@ -673,6 +698,7 @@ id = 10 listen_pg_addr: defaults::DEFAULT_PG_LISTEN_ADDR.to_string(), listen_http_addr: defaults::DEFAULT_HTTP_LISTEN_ADDR.to_string(), checkpoint_distance: defaults::DEFAULT_CHECKPOINT_DISTANCE, + compaction_target_size: defaults::DEFAULT_COMPACTION_TARGET_SIZE, compaction_period: humantime::parse_duration(defaults::DEFAULT_COMPACTION_PERIOD)?, gc_horizon: defaults::DEFAULT_GC_HORIZON, gc_period: humantime::parse_duration(defaults::DEFAULT_GC_PERIOD)?, @@ -717,6 +743,7 @@ id = 10 listen_pg_addr: "127.0.0.1:64000".to_string(), listen_http_addr: "127.0.0.1:9898".to_string(), checkpoint_distance: 111, + compaction_target_size: 111, compaction_period: Duration::from_secs(111), gc_horizon: 222, gc_period: Duration::from_secs(222), diff --git a/pageserver/src/keyspace.rs b/pageserver/src/keyspace.rs index 9973568b07..f6f0d7b7cf 100644 --- a/pageserver/src/keyspace.rs +++ b/pageserver/src/keyspace.rs @@ -2,9 +2,6 @@ use crate::repository::{key_range_size, singleton_range, Key}; use postgres_ffi::pg_constants; use std::ops::Range; -// Target file size, when creating image and delta layers -pub const TARGET_FILE_SIZE_BYTES: u64 = 128 * 1024 * 1024; // 128 MB - /// /// Represents a set of Keys, in a compact form. /// diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 60b0e921ce..2d9b680624 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -1581,7 +1581,8 @@ impl LayeredTimeline { // Define partitioning schema if needed if let Ok(pgdir) = tenant_mgr::get_timeline_for_tenant_load(self.tenantid, self.timelineid) { - let (partitioning, lsn) = pgdir.repartition(self.get_last_record_lsn())?; + let (partitioning, lsn) = + pgdir.repartition(self.get_last_record_lsn(), self.conf.compaction_target_size)?; let timer = self.create_images_time_histo.start_timer(); // 2. Create new image layers for partitions that have been modified // "enough". diff --git a/pageserver/src/pgdatadir_mapping.rs b/pageserver/src/pgdatadir_mapping.rs index 2e0040f0c0..af12084766 100644 --- a/pageserver/src/pgdatadir_mapping.rs +++ b/pageserver/src/pgdatadir_mapping.rs @@ -6,7 +6,7 @@ //! walingest.rs handles a few things like implicit relation creation and extension. //! Clarify that) //! -use crate::keyspace::{KeyPartitioning, KeySpace, KeySpaceAccum, TARGET_FILE_SIZE_BYTES}; +use crate::keyspace::{KeyPartitioning, KeySpace, KeySpaceAccum}; use crate::reltag::{RelTag, SlruKind}; use crate::repository::*; use crate::repository::{Repository, Timeline}; @@ -388,13 +388,13 @@ impl DatadirTimeline { Ok(result.to_keyspace()) } - pub fn repartition(&self, lsn: Lsn) -> Result<(KeyPartitioning, Lsn)> { + pub fn repartition(&self, lsn: Lsn, partition_size: u64) -> Result<(KeyPartitioning, Lsn)> { let mut partitioning_guard = self.partitioning.lock().unwrap(); if partitioning_guard.1 == Lsn(0) || lsn.0 - partitioning_guard.1 .0 > self.repartition_threshold { let keyspace = self.collect_keyspace(lsn)?; - let partitioning = keyspace.partition(TARGET_FILE_SIZE_BYTES); + let partitioning = keyspace.partition(partition_size); *partitioning_guard = (partitioning, lsn); return Ok((partitioning_guard.0.clone(), lsn)); } @@ -1215,7 +1215,7 @@ pub fn create_test_timeline( timeline_id: zenith_utils::zid::ZTimelineId, ) -> Result>> { let tline = repo.create_empty_timeline(timeline_id, Lsn(8))?; - let tline = DatadirTimeline::new(tline, crate::layered_repository::tests::TEST_FILE_SIZE / 10); + let tline = DatadirTimeline::new(tline, tline.conf.compaction_target_size / 10); let mut m = tline.begin_modification(Lsn(8)); m.init_empty()?; m.commit()?; From fcf613b6e3e5d4fefa1d53daeb677ccf7c64b5f8 Mon Sep 17 00:00:00 2001 From: Konstantin Knizhnik Date: Fri, 1 Apr 2022 19:57:51 +0300 Subject: [PATCH 058/296] Fix unit tests build --- pageserver/src/pgdatadir_mapping.rs | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pageserver/src/pgdatadir_mapping.rs b/pageserver/src/pgdatadir_mapping.rs index af12084766..0b9ea7c7a7 100644 --- a/pageserver/src/pgdatadir_mapping.rs +++ b/pageserver/src/pgdatadir_mapping.rs @@ -1215,7 +1215,7 @@ pub fn create_test_timeline( timeline_id: zenith_utils::zid::ZTimelineId, ) -> Result>> { let tline = repo.create_empty_timeline(timeline_id, Lsn(8))?; - let tline = DatadirTimeline::new(tline, tline.conf.compaction_target_size / 10); + let tline = DatadirTimeline::new(tline, 256 * 1024); let mut m = tline.begin_modification(Lsn(8)); m.init_empty()?; m.commit()?; From a5a478c32193fcf6e04b3e9b2fa981d2bc5e82e2 Mon Sep 17 00:00:00 2001 From: Arthur Petukhovsky Date: Mon, 4 Apr 2022 16:32:30 +0300 Subject: [PATCH 059/296] Bump vendor/postgres to store WAL on disk only (#1342) Now WAL is no longer held in compute memory --- vendor/postgres | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vendor/postgres b/vendor/postgres index 5c278ed0ac..8481459996 160000 --- a/vendor/postgres +++ b/vendor/postgres @@ -1 +1 @@ -Subproject commit 5c278ed0aca5dea9340d9af4ad5f004d905ff1b7 +Subproject commit 848145999653be213141a330569b6f2d9f53dbf2 From 089ba6abfe6c6e291489970b1c82dc5d3d6c0516 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Mon, 4 Apr 2022 20:12:25 +0300 Subject: [PATCH 060/296] Clean up some comments that still referred to 'segments' --- .../src/layered_repository/delta_layer.rs | 13 +++++------- .../src/layered_repository/image_layer.rs | 4 ++-- .../src/layered_repository/layer_map.rs | 20 ++----------------- .../src/layered_repository/storage_layer.rs | 4 ++-- 4 files changed, 11 insertions(+), 30 deletions(-) diff --git a/pageserver/src/layered_repository/delta_layer.rs b/pageserver/src/layered_repository/delta_layer.rs index 0e59eb7a3c..955d4145f3 100644 --- a/pageserver/src/layered_repository/delta_layer.rs +++ b/pageserver/src/layered_repository/delta_layer.rs @@ -1,14 +1,11 @@ //! A DeltaLayer represents a collection of WAL records or page images in a range of //! LSNs, and in a range of Keys. It is stored on a file on disk. //! -//! Usually a delta layer only contains differences - in the form of WAL records against -//! a base LSN. However, if a segment is newly created, by creating a new relation or -//! extending an old one, there might be no base image. In that case, all the entries in -//! the delta layer must be page images or WAL records with the 'will_init' flag set, so -//! that they can be replayed without referring to an older page version. Also in some -//! circumstances, the predecessor layer might actually be another delta layer. That -//! can happen when you create a new branch in the middle of a delta layer, and the WAL -//! records on the new branch are put in a new delta layer. +//! Usually a delta layer only contains differences, in the form of WAL records +//! against a base LSN. However, if a relation extended or a whole new relation +//! is created, there would be no base for the new pages. The entries for them +//! must be page images or WAL records with the 'will_init' flag set, so that +//! they can be replayed without referring to an older page version. //! //! When a delta file needs to be accessed, we slurp the 'index' metadata //! into memory, into the DeltaLayerInner struct. See load() and unload() functions. diff --git a/pageserver/src/layered_repository/image_layer.rs b/pageserver/src/layered_repository/image_layer.rs index 2b9bf4a717..68d1cd4a8a 100644 --- a/pageserver/src/layered_repository/image_layer.rs +++ b/pageserver/src/layered_repository/image_layer.rs @@ -405,8 +405,8 @@ impl ImageLayer { /// /// 1. Create the ImageLayerWriter by calling ImageLayerWriter::new(...) /// -/// 2. Write the contents by calling `put_page_image` for every page -/// in the segment. +/// 2. Write the contents by calling `put_page_image` for every key-value +/// pair in the key range. /// /// 3. Call `finish`. /// diff --git a/pageserver/src/layered_repository/layer_map.rs b/pageserver/src/layered_repository/layer_map.rs index b6a3bd82aa..8132ec9cc4 100644 --- a/pageserver/src/layered_repository/layer_map.rs +++ b/pageserver/src/layered_repository/layer_map.rs @@ -207,11 +207,11 @@ impl LayerMap { NUM_ONDISK_LAYERS.dec(); } - /// Is there a newer image layer for given segment? + /// Is there a newer image layer for given key-range? /// /// This is used for garbage collection, to determine if an old layer can /// be deleted. - /// We ignore segments newer than disk_consistent_lsn because they will be removed at restart + /// We ignore layers newer than disk_consistent_lsn because they will be removed at restart /// We also only look at historic layers //#[allow(dead_code)] pub fn newer_image_layer_exists( @@ -250,22 +250,6 @@ impl LayerMap { } } - /// Is there any layer for given segment that is alive at the lsn? - /// - /// This is a public wrapper for SegEntry fucntion, - /// used for garbage collection, to determine if some alive layer - /// exists at the lsn. If so, we shouldn't delete a newer dropped layer - /// to avoid incorrectly making it visible. - /* - pub fn layer_exists_at_lsn(&self, seg: SegmentTag, lsn: Lsn) -> Result { - Ok(if let Some(segentry) = self.historic_layers.get(&seg) { - segentry.exists_at_lsn(seg, lsn)?.unwrap_or(false) - } else { - false - }) - } - */ - pub fn iter_historic_layers(&self) -> std::slice::Iter> { self.historic_layers.iter() } diff --git a/pageserver/src/layered_repository/storage_layer.rs b/pageserver/src/layered_repository/storage_layer.rs index dcf5b63908..2711640736 100644 --- a/pageserver/src/layered_repository/storage_layer.rs +++ b/pageserver/src/layered_repository/storage_layer.rs @@ -88,7 +88,7 @@ pub trait Layer: Send + Sync { /// Identify the timeline this layer belongs to fn get_timeline_id(&self) -> ZTimelineId; - /// Range of segments that this layer covers + /// Range of keys that this layer covers fn get_key_range(&self) -> Range; /// Inclusive start bound of the LSN range that this layer holds @@ -123,7 +123,7 @@ pub trait Layer: Send + Sync { reconstruct_data: &mut ValueReconstructState, ) -> Result; - /// Does this layer only contain some data for the segment (incremental), + /// Does this layer only contain some data for the key-range (incremental), /// or does it contain a version of every page? This is important to know /// for garbage collecting old layers: an incremental layer depends on /// the previous non-incremental layer. From 222b7233540d93327d26cb0566b1c30379451656 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Mon, 4 Apr 2022 20:12:28 +0300 Subject: [PATCH 061/296] Handle read errors when dumping a delta layer file. If a file is corrupt, let's not stop on first read error, but continue dumping. --- .../src/layered_repository/delta_layer.rs | 38 +++++++++++-------- 1 file changed, 22 insertions(+), 16 deletions(-) diff --git a/pageserver/src/layered_repository/delta_layer.rs b/pageserver/src/layered_repository/delta_layer.rs index 955d4145f3..7013c2417c 100644 --- a/pageserver/src/layered_repository/delta_layer.rs +++ b/pageserver/src/layered_repository/delta_layer.rs @@ -293,25 +293,31 @@ impl Layer for DeltaLayer { for (lsn, blob_ref) in versions.as_slice() { let mut desc = String::new(); let mut buf = vec![0u8; blob_ref.size()]; - chapter.read_exact_at(&mut buf, blob_ref.pos())?; - let val = Value::des(&buf); + match chapter.read_exact_at(&mut buf, blob_ref.pos()) { + Ok(()) => { + let val = Value::des(&buf); - match val { - Ok(Value::Image(img)) => { - write!(&mut desc, " img {} bytes", img.len())?; - } - Ok(Value::WalRecord(rec)) => { - let wal_desc = walrecord::describe_wal_record(&rec); - write!( - &mut desc, - " rec {} bytes will_init: {} {}", - buf.len(), - rec.will_init(), - wal_desc - )?; + match val { + Ok(Value::Image(img)) => { + write!(&mut desc, " img {} bytes", img.len())?; + } + Ok(Value::WalRecord(rec)) => { + let wal_desc = walrecord::describe_wal_record(&rec); + write!( + &mut desc, + " rec {} bytes will_init: {} {}", + buf.len(), + rec.will_init(), + wal_desc + )?; + } + Err(err) => { + write!(&mut desc, " DESERIALIZATION ERROR: {}", err)?; + } + } } Err(err) => { - write!(&mut desc, " DESERIALIZATION ERROR: {}", err)?; + write!(&mut desc, " READ ERROR: {}", err)?; } } println!(" key {} at {}: {}", key, lsn, desc); From 2f784144fe335e30811dca0f86c7ff20ec2978dc Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Mon, 4 Apr 2022 20:12:31 +0300 Subject: [PATCH 062/296] Avoid deadlock when locking two buffers. It happened in unit tests. If a thread tries to read a buffer while already holding a lock on one buffer, the code to find a victim buffer to evict could try to evict the buffer that's already locked. To fix, skip locked buffers. --- pageserver/src/page_cache.rs | 29 ++++++++++++++++++++++------- 1 file changed, 22 insertions(+), 7 deletions(-) diff --git a/pageserver/src/page_cache.rs b/pageserver/src/page_cache.rs index 299575f792..c485e46f47 100644 --- a/pageserver/src/page_cache.rs +++ b/pageserver/src/page_cache.rs @@ -41,7 +41,7 @@ use std::{ convert::TryInto, sync::{ atomic::{AtomicU8, AtomicUsize, Ordering}, - RwLock, RwLockReadGuard, RwLockWriteGuard, + RwLock, RwLockReadGuard, RwLockWriteGuard, TryLockError, }, }; @@ -683,16 +683,33 @@ impl PageCache { /// /// On return, the slot is empty and write-locked. fn find_victim(&self) -> (usize, RwLockWriteGuard) { - let iter_limit = self.slots.len() * 2; + let iter_limit = self.slots.len() * 10; let mut iters = 0; loop { + iters += 1; let slot_idx = self.next_evict_slot.fetch_add(1, Ordering::Relaxed) % self.slots.len(); let slot = &self.slots[slot_idx]; - if slot.dec_usage_count() == 0 || iters >= iter_limit { - let mut inner = slot.inner.write().unwrap(); - + if slot.dec_usage_count() == 0 { + let mut inner = match slot.inner.try_write() { + Ok(inner) => inner, + Err(TryLockError::Poisoned(err)) => { + panic!("buffer lock was poisoned: {:?}", err) + } + Err(TryLockError::WouldBlock) => { + // If we have looped through the whole buffer pool 10 times + // and still haven't found a victim buffer, something's wrong. + // Maybe all the buffers were in locked. That could happen in + // theory, if you have more threads holding buffers locked than + // there are buffers in the pool. In practice, with a reasonably + // large buffer pool it really shouldn't happen. + if iters > iter_limit { + panic!("could not find a victim buffer to evict"); + } + continue; + } + }; if let Some(old_key) = &inner.key { if inner.dirty { if let Err(err) = Self::writeback(old_key, inner.buf) { @@ -717,8 +734,6 @@ impl PageCache { } return (slot_idx, inner); } - - iters += 1; } } From d0c246ac3c0101fba6c8607dbb11444d8a0f589c Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Tue, 5 Apr 2022 20:01:57 +0300 Subject: [PATCH 063/296] Update pageserver OpenAPI spec with missing attach/detach methods (#1463) We have these methods for some time in the API, so mentioning them in the spec could be useful for console (see zenithdb/console#867), as we generate pageserver HTTP API golang client there. --- pageserver/src/http/openapi_spec.yml | 121 +++++++++++++++++++++++++-- pageserver/src/http/routes.rs | 5 +- zenith_utils/src/http/error.rs | 6 ++ 3 files changed, 125 insertions(+), 7 deletions(-) diff --git a/pageserver/src/http/openapi_spec.yml b/pageserver/src/http/openapi_spec.yml index a9101d4bd6..b2760efe85 100644 --- a/pageserver/src/http/openapi_spec.yml +++ b/pageserver/src/http/openapi_spec.yml @@ -18,7 +18,7 @@ paths: schema: type: object required: - - id + - id properties: id: type: integer @@ -122,6 +122,110 @@ paths: application/json: schema: $ref: "#/components/schemas/Error" + + + /v1/tenant/{tenant_id}/timeline/{timeline_id}/attach: + parameters: + - name: tenant_id + in: path + required: true + schema: + type: string + format: hex + - name: timeline_id + in: path + required: true + schema: + type: string + format: hex + post: + description: Attach remote timeline + responses: + "200": + description: Timeline attaching scheduled + "400": + description: Error when no tenant id found in path or no timeline id + content: + application/json: + schema: + $ref: "#/components/schemas/Error" + "401": + description: Unauthorized Error + content: + application/json: + schema: + $ref: "#/components/schemas/UnauthorizedError" + "403": + description: Forbidden Error + content: + application/json: + schema: + $ref: "#/components/schemas/ForbiddenError" + "404": + description: Timeline not found + content: + application/json: + schema: + $ref: "#/components/schemas/NotFoundError" + "409": + description: Timeline download is already in progress + content: + application/json: + schema: + $ref: "#/components/schemas/ConflictError" + "500": + description: Generic operation error + content: + application/json: + schema: + $ref: "#/components/schemas/Error" + + + /v1/tenant/{tenant_id}/timeline/{timeline_id}/detach: + parameters: + - name: tenant_id + in: path + required: true + schema: + type: string + format: hex + - name: timeline_id + in: path + required: true + schema: + type: string + format: hex + post: + description: Detach local timeline + responses: + "200": + description: Timeline detached + "400": + description: Error when no tenant id found in path or no timeline id + content: + application/json: + schema: + $ref: "#/components/schemas/Error" + "401": + description: Unauthorized Error + content: + application/json: + schema: + $ref: "#/components/schemas/UnauthorizedError" + "403": + description: Forbidden Error + content: + application/json: + schema: + $ref: "#/components/schemas/ForbiddenError" + "500": + description: Generic operation error + content: + application/json: + schema: + $ref: "#/components/schemas/Error" + + /v1/tenant/{tenant_id}/timeline/: parameters: - name: tenant_id @@ -179,7 +283,7 @@ paths: content: application/json: schema: - $ref: "#/components/schemas/AlreadyExistsError" + $ref: "#/components/schemas/ConflictError" "500": description: Generic operation error content: @@ -260,7 +364,7 @@ paths: content: application/json: schema: - $ref: "#/components/schemas/AlreadyExistsError" + $ref: "#/components/schemas/ConflictError" "500": description: Generic operation error content: @@ -354,14 +458,21 @@ components: properties: msg: type: string - AlreadyExistsError: + ForbiddenError: type: object required: - msg properties: msg: type: string - ForbiddenError: + NotFoundError: + type: object + required: + - msg + properties: + msg: + type: string + ConflictError: type: object required: - msg diff --git a/pageserver/src/http/routes.rs b/pageserver/src/http/routes.rs index 82e818a47b..207d2420bd 100644 --- a/pageserver/src/http/routes.rs +++ b/pageserver/src/http/routes.rs @@ -220,6 +220,7 @@ async fn timeline_attach_handler(request: Request) -> Result) -> Result { HttpErrorBody::response_from_msg_and_status(self.to_string(), StatusCode::NOT_FOUND) } + ApiError::Conflict(_) => { + HttpErrorBody::response_from_msg_and_status(self.to_string(), StatusCode::CONFLICT) + } ApiError::InternalServerError(err) => HttpErrorBody::response_from_msg_and_status( err.to_string(), StatusCode::INTERNAL_SERVER_ERROR, From 6fe443e239531ca1fef4dbf5258c892b1baac6ef Mon Sep 17 00:00:00 2001 From: bojanserafimov Date: Wed, 6 Apr 2022 18:32:10 -0400 Subject: [PATCH 064/296] Improve random_writes test (#1469) If you want to test with a 3GB database by tweaking some constants you'll hit a query timeout. I fix that by batching the inserts. --- test_runner/performance/test_random_writes.py | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/test_runner/performance/test_random_writes.py b/test_runner/performance/test_random_writes.py index b41f2f72a8..ba9eabcd97 100644 --- a/test_runner/performance/test_random_writes.py +++ b/test_runner/performance/test_random_writes.py @@ -49,7 +49,15 @@ def test_random_writes(zenith_with_baseline: PgCompare): count integer default 0 ); """) - cur.execute(f"INSERT INTO Big (pk) values (generate_series(1,{n_rows}))") + + # Insert n_rows in batches to avoid query timeouts + rows_inserted = 0 + while rows_inserted < n_rows: + rows_to_insert = min(1000 * 1000, n_rows - rows_inserted) + low = rows_inserted + 1 + high = rows_inserted + rows_to_insert + cur.execute(f"INSERT INTO Big (pk) values (generate_series({low},{high}))") + rows_inserted += rows_to_insert # Get table size (can't be predicted because padding and alignment) cur.execute("SELECT pg_relation_size('Big');") From 6bc78a0e7729c206d8c4ebfdaed539017130d253 Mon Sep 17 00:00:00 2001 From: Arthur Petukhovsky Date: Thu, 7 Apr 2022 01:44:26 +0300 Subject: [PATCH 065/296] Log more info in test_many_timelines asserts (#1473) It will help to debug #1470 as soon as it happens again --- test_runner/batch_others/test_wal_acceptor.py | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/test_runner/batch_others/test_wal_acceptor.py b/test_runner/batch_others/test_wal_acceptor.py index bdc526a125..8f87ff041f 100644 --- a/test_runner/batch_others/test_wal_acceptor.py +++ b/test_runner/batch_others/test_wal_acceptor.py @@ -108,14 +108,14 @@ def test_many_timelines(zenith_env_builder: ZenithEnvBuilder): for flush_lsn, commit_lsn in zip(m.flush_lsns, m.commit_lsns): # Invariant. May be < when transaction is in progress. - assert commit_lsn <= flush_lsn + assert commit_lsn <= flush_lsn, f"timeline_id={timeline_id}, timeline_detail={timeline_detail}, sk_metrics={sk_metrics}" # We only call collect_metrics() after a transaction is confirmed by # the compute node, which only happens after a consensus of safekeepers # has confirmed the transaction. We assume majority consensus here. assert (2 * sum(m.last_record_lsn <= lsn - for lsn in m.flush_lsns) > zenith_env_builder.num_safekeepers) + for lsn in m.flush_lsns) > zenith_env_builder.num_safekeepers), f"timeline_id={timeline_id}, timeline_detail={timeline_detail}, sk_metrics={sk_metrics}" assert (2 * sum(m.last_record_lsn <= lsn - for lsn in m.commit_lsns) > zenith_env_builder.num_safekeepers) + for lsn in m.commit_lsns) > zenith_env_builder.num_safekeepers), f"timeline_id={timeline_id}, timeline_detail={timeline_detail}, sk_metrics={sk_metrics}" timeline_metrics.append(m) log.info(f"{message}: {timeline_metrics}") return timeline_metrics From d5258cdc4df4f5130bb9ceea5dc47128bac6ce48 Mon Sep 17 00:00:00 2001 From: bojanserafimov Date: Wed, 6 Apr 2022 20:05:24 -0400 Subject: [PATCH 066/296] [proxy] Don't print passwords (#1298) --- proxy/src/compute.rs | 12 +++++++++++- proxy/src/mgmt.rs | 2 +- 2 files changed, 12 insertions(+), 2 deletions(-) diff --git a/proxy/src/compute.rs b/proxy/src/compute.rs index 7c0ab965a0..3c0eee29bc 100644 --- a/proxy/src/compute.rs +++ b/proxy/src/compute.rs @@ -24,7 +24,7 @@ pub enum ConnectionError { impl UserFacingError for ConnectionError {} /// Compute node connection params. -#[derive(Serialize, Deserialize, Debug, Default)] +#[derive(Serialize, Deserialize, Default)] pub struct DatabaseInfo { pub host: String, pub port: u16, @@ -33,6 +33,16 @@ pub struct DatabaseInfo { pub password: Option, } +// Manually implement debug to omit personal and sensitive info +impl std::fmt::Debug for DatabaseInfo { + fn fmt(&self, fmt: &mut std::fmt::Formatter) -> std::fmt::Result { + fmt.debug_struct("DatabaseInfo") + .field("host", &self.host) + .field("port", &self.port) + .finish() + } +} + /// PostgreSQL version as [`String`]. pub type Version = String; diff --git a/proxy/src/mgmt.rs b/proxy/src/mgmt.rs index e53542dfd2..ab6fdff040 100644 --- a/proxy/src/mgmt.rs +++ b/proxy/src/mgmt.rs @@ -107,7 +107,7 @@ impl postgres_backend::Handler for MgmtHandler { } fn try_process_query(pgb: &mut PostgresBackend, query_string: &str) -> anyhow::Result<()> { - println!("Got mgmt query: '{}'", query_string); + println!("Got mgmt query [redacted]"); // Content contains password, don't print it let resp: PsqlSessionResponse = serde_json::from_str(query_string)?; From 81ba23094e8578ed11cb1aae48cf10b79dc2f3cd Mon Sep 17 00:00:00 2001 From: Arthur Petukhovsky Date: Thu, 7 Apr 2022 20:38:26 +0300 Subject: [PATCH 067/296] Fix scripts to deploy sk4 on staging (#1476) Adjust ansible scripts and inventory for sk4 on staging --- .circleci/ansible/deploy.yaml | 24 ++++++++++++++++ .circleci/ansible/scripts/init_safekeeper.sh | 30 ++++++++++++++++++++ .circleci/ansible/staging.hosts | 1 + 3 files changed, 55 insertions(+) create mode 100644 .circleci/ansible/scripts/init_safekeeper.sh diff --git a/.circleci/ansible/deploy.yaml b/.circleci/ansible/deploy.yaml index b7ffd075a0..2112102aa7 100644 --- a/.circleci/ansible/deploy.yaml +++ b/.circleci/ansible/deploy.yaml @@ -116,6 +116,30 @@ tasks: + - name: upload init script + when: console_mgmt_base_url is defined + ansible.builtin.template: + src: scripts/init_safekeeper.sh + dest: /tmp/init_safekeeper.sh + owner: root + group: root + mode: '0755' + become: true + tags: + - safekeeper + + - name: init safekeeper + shell: + cmd: /tmp/init_safekeeper.sh + args: + creates: "/storage/safekeeper/data/safekeeper.id" + environment: + ZENITH_REPO_DIR: "/storage/safekeeper/data" + LD_LIBRARY_PATH: "/usr/local/lib" + become: true + tags: + - safekeeper + # in the future safekeepers should discover pageservers byself # but currently use first pageserver that was discovered - name: set first pageserver var for safekeepers diff --git a/.circleci/ansible/scripts/init_safekeeper.sh b/.circleci/ansible/scripts/init_safekeeper.sh new file mode 100644 index 0000000000..2297788f59 --- /dev/null +++ b/.circleci/ansible/scripts/init_safekeeper.sh @@ -0,0 +1,30 @@ +#!/bin/sh + +# get instance id from meta-data service +INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id) + +# store fqdn hostname in var +HOST=$(hostname -f) + + +cat < Date: Thu, 7 Apr 2022 20:50:08 +0300 Subject: [PATCH 068/296] Refactor the I/O functions. This introduces two new abstraction layers for I/O: - Block I/O, and - Blob I/O. The BlockReader trait abstracts a file or something else that can be read in 8kB pages. It is implemented by EphemeralFiles, and by a new FileBlockReader struct that allows reading arbitrary VirtualFiles in that manner, utilizing the page cache. There is also a new BlockCursor struct that works as a cursor over a BlockReader. When you create a BlockCursor and read the first page using it, it keeps the reference to the page. If you access the same page again, it avoids going to page cache and quickly returns the same page again. That can save a lot of lookups in the page cache if you perform multiple reads. The Blob-oriented API allows reading and writing "blobs" of arbitrary length. It is a layer on top of the block-oriented API. When you write a blob with the write_blob() function, it writes a length field followed by the actual data to the underlying block storage, and returns the offset where the blob was stored. The blob can be retrieved later using the offset. Finally, this replaces the I/O code in image-, delta-, and in-memory layers to use the new abstractions. These replace the 'bookfile' crate. This is a backwards-incompatible change to the storage format. --- Cargo.lock | 36 --- pageserver/Cargo.toml | 1 - pageserver/src/bin/dump_layerfile.rs | 2 + pageserver/src/layered_repository.rs | 23 +- pageserver/src/layered_repository/blob_io.rs | 122 ++++++++ pageserver/src/layered_repository/block_io.rs | 176 ++++++++++++ .../src/layered_repository/delta_layer.rs | 272 ++++++++---------- .../src/layered_repository/ephemeral_file.rs | 183 ++++++++---- .../src/layered_repository/image_layer.rs | 195 ++++++------- .../src/layered_repository/inmemory_layer.rs | 61 ++-- .../src/layered_repository/storage_layer.rs | 17 +- pageserver/src/lib.rs | 6 +- pageserver/src/page_cache.rs | 82 +++++- pageserver/src/virtual_file.rs | 3 +- 14 files changed, 774 insertions(+), 405 deletions(-) create mode 100644 pageserver/src/layered_repository/blob_io.rs create mode 100644 pageserver/src/layered_repository/block_io.rs diff --git a/Cargo.lock b/Cargo.lock index bb27df7012..e0b6288f63 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -141,30 +141,6 @@ version = "1.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "d468802bab17cbc0cc575e9b053f41e72aa36bfa6b7f55e3529ffa43161b97fa" -[[package]] -name = "aversion" -version = "0.2.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "41992ab8cfcc3026ef9abceffe0c2b0479c043183fc23825e30d22baab6df334" -dependencies = [ - "aversion-macros", - "byteorder", - "serde", - "serde_cbor", - "thiserror", -] - -[[package]] -name = "aversion-macros" -version = "0.2.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5ba5785f953985aa0caca927ba4005880f3b4f53de87f134e810ae3549f744d2" -dependencies = [ - "proc-macro2", - "quote", - "syn", -] - [[package]] name = "aws-creds" version = "0.27.1" @@ -264,17 +240,6 @@ dependencies = [ "generic-array", ] -[[package]] -name = "bookfile" -version = "0.3.0" -source = "git+https://github.com/zenithdb/bookfile.git?rev=bf6e43825dfb6e749ae9b80e8372c8fea76cec2f#bf6e43825dfb6e749ae9b80e8372c8fea76cec2f" -dependencies = [ - "aversion", - "byteorder", - "serde", - "thiserror", -] - [[package]] name = "boxfnonce" version = "0.1.1" @@ -1524,7 +1489,6 @@ dependencies = [ "anyhow", "async-compression", "async-trait", - "bookfile", "byteorder", "bytes", "chrono", diff --git a/pageserver/Cargo.toml b/pageserver/Cargo.toml index 6a77af1691..a5283cb331 100644 --- a/pageserver/Cargo.toml +++ b/pageserver/Cargo.toml @@ -4,7 +4,6 @@ version = "0.1.0" edition = "2021" [dependencies] -bookfile = { git = "https://github.com/zenithdb/bookfile.git", rev="bf6e43825dfb6e749ae9b80e8372c8fea76cec2f" } chrono = "0.4.19" rand = "0.8.3" regex = "1.4.5" diff --git a/pageserver/src/bin/dump_layerfile.rs b/pageserver/src/bin/dump_layerfile.rs index 27d41d50d9..7cf39566ac 100644 --- a/pageserver/src/bin/dump_layerfile.rs +++ b/pageserver/src/bin/dump_layerfile.rs @@ -4,6 +4,7 @@ use anyhow::Result; use clap::{App, Arg}; use pageserver::layered_repository::dump_layerfile_from_path; +use pageserver::page_cache; use pageserver::virtual_file; use std::path::PathBuf; use zenith_utils::GIT_VERSION; @@ -24,6 +25,7 @@ fn main() -> Result<()> { // Basic initialization of things that don't change after startup virtual_file::init(10); + page_cache::init(100); dump_layerfile_from_path(&path, true)?; diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 2d9b680624..5adf4a89ff 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -12,7 +12,6 @@ //! use anyhow::{anyhow, bail, ensure, Context, Result}; -use bookfile::Book; use bytes::Bytes; use fail::fail_point; use itertools::Itertools; @@ -56,6 +55,8 @@ use zenith_utils::crashsafe_dir; use zenith_utils::lsn::{AtomicLsn, Lsn, RecordLsn}; use zenith_utils::seqwait::SeqWait; +mod blob_io; +pub mod block_io; mod delta_layer; pub(crate) mod ephemeral_file; mod filename; @@ -2054,16 +2055,17 @@ impl<'a> TimelineWriter<'_> for LayeredTimelineWriter<'a> { /// Dump contents of a layer file to stdout. pub fn dump_layerfile_from_path(path: &Path, verbose: bool) -> Result<()> { - let file = File::open(path)?; - let book = Book::new(file)?; + use std::os::unix::fs::FileExt; - match book.magic() { - crate::DELTA_FILE_MAGIC => { - DeltaLayer::new_for_path(path, &book)?.dump(verbose)?; - } - crate::IMAGE_FILE_MAGIC => { - ImageLayer::new_for_path(path, &book)?.dump(verbose)?; - } + // All layer files start with a two-byte "magic" value, to identify the kind of + // file. + let file = File::open(path)?; + let mut header_buf = [0u8; 2]; + file.read_exact_at(&mut header_buf, 0)?; + + match u16::from_be_bytes(header_buf) { + crate::IMAGE_FILE_MAGIC => ImageLayer::new_for_path(path, file)?.dump(verbose)?, + crate::DELTA_FILE_MAGIC => DeltaLayer::new_for_path(path, file)?.dump(verbose)?, magic => bail!("unrecognized magic identifier: {:?}", magic), } @@ -2274,7 +2276,6 @@ pub mod tests { lsn, Value::Image(TEST_IMG(&format!("{} at {}", blknum, lsn))), )?; - println!("updating {} at {}", blknum, lsn); writer.finish_write(lsn); drop(writer); updated[blknum] = lsn; diff --git a/pageserver/src/layered_repository/blob_io.rs b/pageserver/src/layered_repository/blob_io.rs new file mode 100644 index 0000000000..10bfea934d --- /dev/null +++ b/pageserver/src/layered_repository/blob_io.rs @@ -0,0 +1,122 @@ +//! +//! Functions for reading and writing variable-sized "blobs". +//! +//! Each blob begins with a 4-byte length, followed by the actual data. +//! +use crate::layered_repository::block_io::{BlockCursor, BlockReader}; +use crate::page_cache::PAGE_SZ; +use std::cmp::min; +use std::io::Error; + +/// For reading +pub trait BlobCursor { + fn read_blob(&mut self, offset: u64) -> Result, std::io::Error> { + let mut buf = Vec::new(); + self.read_blob_into_buf(offset, &mut buf)?; + Ok(buf) + } + + fn read_blob_into_buf( + &mut self, + offset: u64, + dstbuf: &mut Vec, + ) -> Result<(), std::io::Error>; +} + +impl<'a, R> BlobCursor for BlockCursor +where + R: BlockReader, +{ + fn read_blob_into_buf( + &mut self, + offset: u64, + dstbuf: &mut Vec, + ) -> Result<(), std::io::Error> { + let mut blknum = (offset / PAGE_SZ as u64) as u32; + let mut off = (offset % PAGE_SZ as u64) as usize; + + let mut buf = self.read_blk(blknum)?; + + // read length + let mut len_buf = [0u8; 4]; + let thislen = PAGE_SZ - off; + if thislen < 4 { + // it is split across two pages + len_buf[..thislen].copy_from_slice(&buf[off..PAGE_SZ]); + blknum += 1; + buf = self.read_blk(blknum)?; + len_buf[thislen..].copy_from_slice(&buf[0..4 - thislen]); + off = 4 - thislen; + } else { + len_buf.copy_from_slice(&buf[off..off + 4]); + off += 4; + } + let len = u32::from_ne_bytes(len_buf) as usize; + + dstbuf.clear(); + + // Read the payload + let mut remain = len; + while remain > 0 { + let mut page_remain = PAGE_SZ - off; + if page_remain == 0 { + // continue on next page + blknum += 1; + buf = self.read_blk(blknum)?; + off = 0; + page_remain = PAGE_SZ; + } + let this_blk_len = min(remain, page_remain); + dstbuf.extend_from_slice(&buf[off..off + this_blk_len]); + remain -= this_blk_len; + off += this_blk_len; + } + Ok(()) + } +} + +pub trait BlobWriter { + fn write_blob(&mut self, srcbuf: &[u8]) -> Result; +} + +pub struct WriteBlobWriter +where + W: std::io::Write, +{ + inner: W, + offset: u64, +} + +impl WriteBlobWriter +where + W: std::io::Write, +{ + pub fn new(inner: W, start_offset: u64) -> Self { + WriteBlobWriter { + inner, + offset: start_offset, + } + } + + pub fn size(&self) -> u64 { + self.offset + } + + pub fn into_inner(self) -> W { + self.inner + } +} + +impl BlobWriter for WriteBlobWriter +where + W: std::io::Write, +{ + fn write_blob(&mut self, srcbuf: &[u8]) -> Result { + let offset = self.offset; + self.inner + .write_all(&((srcbuf.len()) as u32).to_ne_bytes())?; + self.inner.write_all(srcbuf)?; + self.offset += 4 + srcbuf.len() as u64; + Ok(offset) + } +} diff --git a/pageserver/src/layered_repository/block_io.rs b/pageserver/src/layered_repository/block_io.rs new file mode 100644 index 0000000000..2b8e31e1ee --- /dev/null +++ b/pageserver/src/layered_repository/block_io.rs @@ -0,0 +1,176 @@ +//! +//! Low-level Block-oriented I/O functions +//! +//! +//! + +use crate::page_cache; +use crate::page_cache::{ReadBufResult, PAGE_SZ}; +use lazy_static::lazy_static; +use std::ops::{Deref, DerefMut}; +use std::os::unix::fs::FileExt; +use std::sync::atomic::AtomicU64; + +/// This is implemented by anything that can read 8 kB (PAGE_SZ) +/// blocks, using the page cache +/// +/// There are currently two implementations: EphemeralFile, and FileBlockReader +/// below. +pub trait BlockReader { + type BlockLease: Deref + 'static; + + /// + /// Read a block. Returns a "lease" object that can be used to + /// access to the contents of the page. (For the page cache, the + /// lease object represents a lock on the buffer.) + /// + fn read_blk(&self, blknum: u32) -> Result; + + /// + /// Create a new "cursor" for reading from this reader. + /// + /// A cursor caches the last accessed page, allowing for faster + /// access if the same block is accessed repeatedly. + fn block_cursor(&self) -> BlockCursor<&Self> + where + Self: Sized, + { + BlockCursor::new(self) + } +} + +impl BlockReader for &B +where + B: BlockReader, +{ + type BlockLease = B::BlockLease; + + fn read_blk(&self, blknum: u32) -> Result { + (*self).read_blk(blknum) + } +} + +/// +/// A "cursor" for efficiently reading multiple pages from a BlockReader +/// +/// A cursor caches the last accessed page, allowing for faster access if the +/// same block is accessed repeatedly. +/// +/// You can access the last page with `*cursor`. 'read_blk' returns 'self', so +/// that in many cases you can use a BlockCursor as a drop-in replacement for +/// the underlying BlockReader. For example: +/// +/// ```no_run +/// # use pageserver::layered_repository::block_io::{BlockReader, FileBlockReader}; +/// # let reader: FileBlockReader = todo!(); +/// let cursor = reader.block_cursor(); +/// let buf = cursor.read_blk(1); +/// // do stuff with 'buf' +/// let buf = cursor.read_blk(2); +/// // do stuff with 'buf' +/// ``` +/// +pub struct BlockCursor +where + R: BlockReader, +{ + reader: R, + /// last accessed page + cache: Option<(u32, R::BlockLease)>, +} + +impl BlockCursor +where + R: BlockReader, +{ + pub fn new(reader: R) -> Self { + BlockCursor { + reader, + cache: None, + } + } + + pub fn read_blk(&mut self, blknum: u32) -> Result<&Self, std::io::Error> { + // Fast return if this is the same block as before + if let Some((cached_blk, _buf)) = &self.cache { + if *cached_blk == blknum { + return Ok(self); + } + } + + // Read the block from the underlying reader, and cache it + self.cache = None; + let buf = self.reader.read_blk(blknum)?; + self.cache = Some((blknum, buf)); + + Ok(self) + } +} + +impl Deref for BlockCursor +where + R: BlockReader, +{ + type Target = [u8; PAGE_SZ]; + + fn deref(&self) -> &::Target { + &self.cache.as_ref().unwrap().1 + } +} + +lazy_static! { + static ref NEXT_ID: AtomicU64 = AtomicU64::new(1); +} + +/// An adapter for reading a (virtual) file using the page cache. +/// +/// The file is assumed to be immutable. This doesn't provide any functions +/// for modifying the file, nor for invalidating the cache if it is modified. +pub struct FileBlockReader { + pub file: F, + + /// Unique ID of this file, used as key in the page cache. + file_id: u64, +} + +impl FileBlockReader +where + F: FileExt, +{ + pub fn new(file: F) -> Self { + let file_id = NEXT_ID.fetch_add(1, std::sync::atomic::Ordering::Relaxed); + + FileBlockReader { file_id, file } + } + + /// Read a page from the underlying file into given buffer. + fn fill_buffer(&self, buf: &mut [u8], blkno: u32) -> Result<(), std::io::Error> { + assert!(buf.len() == PAGE_SZ); + self.file.read_exact_at(buf, blkno as u64 * PAGE_SZ as u64) + } +} + +impl BlockReader for FileBlockReader +where + F: FileExt, +{ + type BlockLease = page_cache::PageReadGuard<'static>; + + fn read_blk(&self, blknum: u32) -> Result { + // Look up the right page + let cache = page_cache::get(); + loop { + match cache.read_immutable_buf(self.file_id, blknum) { + ReadBufResult::Found(guard) => break Ok(guard), + ReadBufResult::NotFound(mut write_guard) => { + // Read the page from disk into the buffer + self.fill_buffer(write_guard.deref_mut(), blknum)?; + write_guard.mark_valid(); + + // Swap for read lock + continue; + } + }; + } + } +} diff --git a/pageserver/src/layered_repository/delta_layer.rs b/pageserver/src/layered_repository/delta_layer.rs index 7013c2417c..f8828b541f 100644 --- a/pageserver/src/layered_repository/delta_layer.rs +++ b/pageserver/src/layered_repository/delta_layer.rs @@ -23,21 +23,27 @@ //! 000000067F000032BE0000400000000020B6-000000067F000032BE0000400000000030B6__000000578C6B29-0000000057A50051 //! //! -//! A delta file is constructed using the 'bookfile' crate. Each file consists of three -//! parts: the 'index', the values, and a short summary header. They are stored as -//! separate chapters. +//! Every delta file consists of three parts: "summary", "index", and +//! "values". The summary is a fixed size header at the beginning of the file, +//! and it contains basic information about the layer, and offsets to the other +//! parts. The "index" is a serialized HashMap mapping from Key and LSN to an offset in the +//! "values" part. The actual page images and WAL records are stored in the +//! "values" part. //! use crate::config::PageServerConf; +use crate::layered_repository::blob_io::{BlobCursor, BlobWriter, WriteBlobWriter}; +use crate::layered_repository::block_io::{BlockCursor, BlockReader, FileBlockReader}; use crate::layered_repository::filename::{DeltaFileName, PathOrConf}; use crate::layered_repository::storage_layer::{ BlobRef, Layer, ValueReconstructResult, ValueReconstructState, }; +use crate::page_cache::{PageReadGuard, PAGE_SZ}; use crate::repository::{Key, Value}; use crate::virtual_file::VirtualFile; use crate::walrecord; -use crate::DELTA_FILE_MAGIC; use crate::{ZTenantId, ZTimelineId}; -use anyhow::{bail, ensure, Result}; +use crate::{DELTA_FILE_MAGIC, STORAGE_FORMAT_VERSION}; +use anyhow::{bail, ensure, Context, Result}; use log::*; use serde::{Deserialize, Serialize}; use std::collections::HashMap; @@ -46,44 +52,43 @@ use zenith_utils::vec_map::VecMap; // while being able to use std::fmt::Write's methods use std::fmt::Write as _; use std::fs; -use std::io::BufWriter; -use std::io::Write; +use std::io::{BufWriter, Write}; +use std::io::{Seek, SeekFrom}; use std::ops::Range; use std::os::unix::fs::FileExt; use std::path::{Path, PathBuf}; use std::sync::{RwLock, RwLockReadGuard, RwLockWriteGuard, TryLockError}; -use bookfile::{Book, BookWriter, ChapterWriter}; - use zenith_utils::bin_ser::BeSer; use zenith_utils::lsn::Lsn; -/// Mapping from (key, lsn) -> page/WAL record -/// byte ranges in VALUES_CHAPTER -static INDEX_CHAPTER: u64 = 1; - -/// Page/WAL bytes - cannot be interpreted -/// without the page versions from the INDEX_CHAPTER -static VALUES_CHAPTER: u64 = 2; - -/// Contains the [`Summary`] struct -static SUMMARY_CHAPTER: u64 = 3; - #[derive(Debug, Serialize, Deserialize, PartialEq, Eq)] struct Summary { + /// Magic value to identify this as a zenith delta file. Always DELTA_FILE_MAGIC. + magic: u16, + format_version: u16, + tenantid: ZTenantId, timelineid: ZTimelineId, key_range: Range, lsn_range: Range, + + /// Block number where the 'index' part of the file begins. + index_start_blk: u32, } impl From<&DeltaLayer> for Summary { fn from(layer: &DeltaLayer) -> Self { Self { + magic: DELTA_FILE_MAGIC, + format_version: STORAGE_FORMAT_VERSION, + tenantid: layer.tenantid, timelineid: layer.timelineid, key_range: layer.key_range.clone(), lsn_range: layer.lsn_range.clone(), + + index_start_blk: 0, } } } @@ -118,7 +123,11 @@ pub struct DeltaLayerInner { /// index: HashMap>, - book: Option>, + // values copied from summary + index_start_blk: u32, + + /// Reader object for reading blocks from the file. (None if not loaded yet) + file: Option>, } impl Layer for DeltaLayer { @@ -155,45 +164,28 @@ impl Layer for DeltaLayer { { // Open the file and lock the metadata in memory let inner = self.load()?; - let values_reader = inner - .book - .as_ref() - .expect("should be loaded in load call above") - .chapter_reader(VALUES_CHAPTER)?; // Scan the page versions backwards, starting from `lsn`. if let Some(vec_map) = inner.index.get(&key) { + let mut reader = inner.file.as_ref().unwrap().block_cursor(); let slice = vec_map.slice_range(lsn_range); - let mut size = 0usize; - let mut first_pos = 0u64; - for (_entry_lsn, blob_ref) in slice.iter().rev() { - size += blob_ref.size(); - first_pos = blob_ref.pos(); - if blob_ref.will_init() { - break; - } - } - if size != 0 { - let mut buf = vec![0u8; size]; - values_reader.read_exact_at(&mut buf, first_pos)?; - for (entry_lsn, blob_ref) in slice.iter().rev() { - let offs = (blob_ref.pos() - first_pos) as usize; - let val = Value::des(&buf[offs..offs + blob_ref.size()])?; - match val { - Value::Image(img) => { - reconstruct_state.img = Some((*entry_lsn, img)); + for (entry_lsn, blob_ref) in slice.iter().rev() { + let buf = reader.read_blob(blob_ref.pos())?; + let val = Value::des(&buf)?; + match val { + Value::Image(img) => { + reconstruct_state.img = Some((*entry_lsn, img)); + need_image = false; + break; + } + Value::WalRecord(rec) => { + let will_init = rec.will_init(); + reconstruct_state.records.push((*entry_lsn, rec)); + if will_init { + // This WAL record initializes the page, so no need to go further back need_image = false; break; } - Value::WalRecord(rec) => { - let will_init = rec.will_init(); - reconstruct_state.records.push((*entry_lsn, rec)); - if will_init { - // This WAL record initializes the page, so no need to go further back - need_image = false; - break; - } - } } } } @@ -210,7 +202,7 @@ impl Layer for DeltaLayer { } } - fn iter(&self) -> Box> + '_> { + fn iter<'a>(&'a self) -> Box> + 'a> { let inner = self.load().unwrap(); match DeltaValueIter::new(inner) { @@ -281,20 +273,16 @@ impl Layer for DeltaLayer { let inner = self.load()?; - let path = self.path(); - let file = std::fs::File::open(&path)?; - let book = Book::new(file)?; - let chapter = book.chapter_reader(VALUES_CHAPTER)?; - let mut values: Vec<(&Key, &VecMap)> = inner.index.iter().collect(); values.sort_by_key(|k| k.0); + let mut reader = inner.file.as_ref().unwrap().block_cursor(); + for (key, versions) in values { for (lsn, blob_ref) in versions.as_slice() { let mut desc = String::new(); - let mut buf = vec![0u8; blob_ref.size()]; - match chapter.read_exact_at(&mut buf, blob_ref.pos()) { - Ok(()) => { + match reader.read_blob(blob_ref.pos()) { + Ok(buf) => { let val = Value::des(&buf); match val { @@ -378,19 +366,19 @@ impl DeltaLayer { let path = self.path(); // Open the file if it's not open already. - if inner.book.is_none() { - let file = VirtualFile::open(&path)?; - inner.book = Some(Book::new(file)?); + if inner.file.is_none() { + let file = VirtualFile::open(&path) + .with_context(|| format!("Failed to open file '{}'", path.display()))?; + inner.file = Some(FileBlockReader::new(file)); } - let book = inner.book.as_ref().unwrap(); + let file = inner.file.as_mut().unwrap(); + let summary_blk = file.read_blk(0)?; + let actual_summary = Summary::des_prefix(summary_blk.as_ref())?; match &self.path_or_conf { PathOrConf::Conf(_) => { - let chapter = book.read_chapter(SUMMARY_CHAPTER)?; - let actual_summary = Summary::des(&chapter)?; - - let expected_summary = Summary::from(self); - + let mut expected_summary = Summary::from(self); + expected_summary.index_start_blk = actual_summary.index_start_blk; if actual_summary != expected_summary { bail!("in-file summary does not match expected summary. actual = {:?} expected = {:?}", actual_summary, expected_summary); } @@ -409,8 +397,13 @@ impl DeltaLayer { } } - let chapter = book.read_chapter(INDEX_CHAPTER)?; - let index = HashMap::des(&chapter)?; + file.file.seek(SeekFrom::Start( + actual_summary.index_start_blk as u64 * PAGE_SZ as u64, + ))?; + let mut buf_reader = std::io::BufReader::new(&mut file.file); + let index = HashMap::des_from(&mut buf_reader)?; + + inner.index_start_blk = actual_summary.index_start_blk; debug!("loaded from {}", &path.display()); @@ -434,8 +427,9 @@ impl DeltaLayer { lsn_range: filename.lsn_range.clone(), inner: RwLock::new(DeltaLayerInner { loaded: false, - book: None, index: HashMap::default(), + file: None, + index_start_blk: 0, }), } } @@ -443,12 +437,14 @@ impl DeltaLayer { /// Create a DeltaLayer struct representing an existing file on disk. /// /// This variant is only used for debugging purposes, by the 'dump_layerfile' binary. - pub fn new_for_path(path: &Path, book: &Book) -> Result + pub fn new_for_path(path: &Path, file: F) -> Result where F: FileExt, { - let chapter = book.read_chapter(SUMMARY_CHAPTER)?; - let summary = Summary::des(&chapter)?; + let mut summary_buf = Vec::new(); + summary_buf.resize(PAGE_SZ, 0); + file.read_exact_at(&mut summary_buf, 0)?; + let summary = Summary::des_prefix(&summary_buf)?; Ok(DeltaLayer { path_or_conf: PathOrConf::Path(path.to_path_buf()), @@ -458,8 +454,9 @@ impl DeltaLayer { lsn_range: summary.lsn_range, inner: RwLock::new(DeltaLayerInner { loaded: false, - book: None, + file: None, index: HashMap::default(), + index_start_blk: 0, }), }) } @@ -504,8 +501,7 @@ pub struct DeltaLayerWriter { index: HashMap>, - values_writer: ChapterWriter>, - end_offset: u64, + blob_writer: WriteBlobWriter>, } impl DeltaLayerWriter { @@ -531,13 +527,10 @@ impl DeltaLayerWriter { u64::from(lsn_range.start), u64::from(lsn_range.end) )); - let file = VirtualFile::create(&path)?; + let mut file = VirtualFile::create(&path)?; + file.seek(SeekFrom::Start(PAGE_SZ as u64))?; let buf_writer = BufWriter::new(file); - let book = BookWriter::new(buf_writer, DELTA_FILE_MAGIC)?; - - // Open the page-versions chapter for writing. The calls to - // `put_value` will use this to write the contents. - let values_writer = book.new_chapter(VALUES_CHAPTER); + let blob_writer = WriteBlobWriter::new(buf_writer, PAGE_SZ as u64); Ok(DeltaLayerWriter { conf, @@ -547,8 +540,7 @@ impl DeltaLayerWriter { key_start, lsn_range, index: HashMap::new(), - values_writer, - end_offset: 0, + blob_writer, }) } @@ -558,17 +550,12 @@ impl DeltaLayerWriter { /// The values must be appended in key, lsn order. /// pub fn put_value(&mut self, key: Key, lsn: Lsn, val: Value) -> Result<()> { - //info!("DELTA: key {} at {} on {}", key, lsn, self.path.display()); assert!(self.lsn_range.start <= lsn); - // Remember the offset and size metadata. The metadata is written - // to a separate chapter, in `finish`. - let off = self.end_offset; - let buf = Value::ser(&val)?; - let len = buf.len(); - self.values_writer.write_all(&buf)?; - self.end_offset += len as u64; + + let off = self.blob_writer.write_blob(&Value::ser(&val)?)?; + let vec_map = self.index.entry(key).or_default(); - let blob_ref = BlobRef::new(off, len, val.will_init()); + let blob_ref = BlobRef::new(off, val.will_init()); let old = vec_map.append_or_update_last(lsn, blob_ref).unwrap().0; if old.is_some() { // We already had an entry for this LSN. That's odd.. @@ -583,38 +570,40 @@ impl DeltaLayerWriter { } pub fn size(&self) -> u64 { - self.end_offset + self.blob_writer.size() } /// /// Finish writing the delta layer. /// pub fn finish(self, key_end: Key) -> anyhow::Result { - // Close the values chapter - let book = self.values_writer.close()?; + let index_start_blk = + ((self.blob_writer.size() + PAGE_SZ as u64 - 1) / PAGE_SZ as u64) as u32; + + let buf_writer = self.blob_writer.into_inner(); + let mut file = buf_writer.into_inner()?; // Write out the index - let mut chapter = book.new_chapter(INDEX_CHAPTER); let buf = HashMap::ser(&self.index)?; - chapter.write_all(&buf)?; - let book = chapter.close()?; + file.seek(SeekFrom::Start(index_start_blk as u64 * PAGE_SZ as u64))?; + file.write_all(&buf)?; - let mut chapter = book.new_chapter(SUMMARY_CHAPTER); + // Fill in the summary on blk 0 let summary = Summary { + magic: DELTA_FILE_MAGIC, + format_version: STORAGE_FORMAT_VERSION, tenantid: self.tenantid, timelineid: self.timelineid, key_range: self.key_start..key_end, lsn_range: self.lsn_range.clone(), + index_start_blk, }; - Summary::ser_into(&summary, &mut chapter)?; - let book = chapter.close()?; - - // This flushes the underlying 'buf_writer'. - book.close()?; + file.seek(SeekFrom::Start(0))?; + Summary::ser_into(&summary, &mut file)?; // Note: Because we opened the file in write-only mode, we cannot // reuse the same VirtualFile for reading later. That's why we don't - // set inner.book here. The first read will have to re-open it. + // set inner.file here. The first read will have to re-open it. let layer = DeltaLayer { path_or_conf: PathOrConf::Conf(self.conf), tenantid: self.tenantid, @@ -624,7 +613,8 @@ impl DeltaLayerWriter { inner: RwLock::new(DeltaLayerInner { loaded: false, index: HashMap::new(), - book: None, + file: None, + index_start_blk, }), }; @@ -647,22 +637,6 @@ impl DeltaLayerWriter { Ok(layer) } - - pub fn abort(self) { - match self.values_writer.close() { - Ok(book) => { - if let Err(err) = book.close() { - error!("error while closing delta layer file: {}", err); - } - } - Err(err) => { - error!("error while closing chapter writer: {}", err); - } - } - if let Err(err) = std::fs::remove_file(self.path) { - error!("error removing unfinished delta layer file: {}", err); - } - } } /// @@ -672,13 +646,23 @@ impl DeltaLayerWriter { /// That takes up quite a lot of memory. Should do this in a more streaming /// fashion. /// -struct DeltaValueIter { +struct DeltaValueIter<'a> { all_offsets: Vec<(Key, Lsn, BlobRef)>, next_idx: usize, - data: Vec, + reader: BlockCursor>, } -impl Iterator for DeltaValueIter { +struct Adapter<'a>(RwLockReadGuard<'a, DeltaLayerInner>); + +impl<'a> BlockReader for Adapter<'a> { + type BlockLease = PageReadGuard<'static>; + + fn read_blk(&self, blknum: u32) -> Result { + self.0.file.as_ref().unwrap().read_blk(blknum) + } +} + +impl<'a> Iterator for DeltaValueIter<'a> { type Item = Result<(Key, Lsn, Value)>; fn next(&mut self) -> Option { @@ -686,8 +670,8 @@ impl Iterator for DeltaValueIter { } } -impl DeltaValueIter { - fn new(inner: RwLockReadGuard) -> Result { +impl<'a> DeltaValueIter<'a> { + fn new(inner: RwLockReadGuard<'a, DeltaLayerInner>) -> Result { let mut index: Vec<(&Key, &VecMap)> = inner.index.iter().collect(); index.sort_by_key(|x| x.0); @@ -698,30 +682,24 @@ impl DeltaValueIter { } } - let values_reader = inner - .book - .as_ref() - .expect("should be loaded in load call above") - .chapter_reader(VALUES_CHAPTER)?; - let file_size = values_reader.len() as usize; - let mut layer = DeltaValueIter { + let iter = DeltaValueIter { all_offsets, next_idx: 0, - data: vec![0u8; file_size], + reader: BlockCursor::new(Adapter(inner)), }; - values_reader.read_exact_at(&mut layer.data, 0)?; - Ok(layer) + Ok(iter) } fn next_res(&mut self) -> Result> { if self.next_idx < self.all_offsets.len() { - let (key, lsn, blob_ref) = self.all_offsets[self.next_idx]; - let offs = blob_ref.pos() as usize; - let size = blob_ref.size(); - let val = Value::des(&self.data[offs..offs + size])?; + let (key, lsn, off) = &self.all_offsets[self.next_idx]; + + //let mut reader = BlobReader::new(self.inner.file.as_ref().unwrap()); + let buf = self.reader.read_blob(off.pos())?; + let val = Value::des(&buf)?; self.next_idx += 1; - Ok(Some((key, lsn, val))) + Ok(Some((*key, *lsn, val))) } else { Ok(None) } diff --git a/pageserver/src/layered_repository/ephemeral_file.rs b/pageserver/src/layered_repository/ephemeral_file.rs index 79a72f4563..d509186e6f 100644 --- a/pageserver/src/layered_repository/ephemeral_file.rs +++ b/pageserver/src/layered_repository/ephemeral_file.rs @@ -2,6 +2,8 @@ //! used to keep in-memory layers spilled on disk. use crate::config::PageServerConf; +use crate::layered_repository::blob_io::BlobWriter; +use crate::layered_repository::block_io::BlockReader; use crate::page_cache; use crate::page_cache::PAGE_SZ; use crate::page_cache::{ReadBufResult, WriteBufResult}; @@ -10,7 +12,7 @@ use lazy_static::lazy_static; use std::cmp::min; use std::collections::HashMap; use std::fs::OpenOptions; -use std::io::{Error, ErrorKind, Seek, SeekFrom, Write}; +use std::io::{Error, ErrorKind}; use std::ops::DerefMut; use std::path::PathBuf; use std::sync::{Arc, RwLock}; @@ -41,7 +43,7 @@ pub struct EphemeralFile { _timelineid: ZTimelineId, file: Arc, - pos: u64, + size: u64, } impl EphemeralFile { @@ -70,11 +72,11 @@ impl EphemeralFile { _tenantid: tenantid, _timelineid: timelineid, file: file_rc, - pos: 0, + size: 0, }) } - pub fn fill_buffer(&self, buf: &mut [u8], blkno: u32) -> Result<(), Error> { + fn fill_buffer(&self, buf: &mut [u8], blkno: u32) -> Result<(), Error> { let mut off = 0; while off < PAGE_SZ { let n = self @@ -93,6 +95,26 @@ impl EphemeralFile { } Ok(()) } + + fn get_buf_for_write(&self, blkno: u32) -> Result { + // Look up the right page + let cache = page_cache::get(); + let mut write_guard = match cache.write_ephemeral_buf(self.file_id, blkno) { + WriteBufResult::Found(guard) => guard, + WriteBufResult::NotFound(mut guard) => { + // Read the page from disk into the buffer + // TODO: if we're overwriting the whole page, no need to read it in first + self.fill_buffer(guard.deref_mut(), blkno)?; + guard.mark_valid(); + + // And then fall through to modify it. + guard + } + }; + write_guard.mark_dirty(); + + Ok(write_guard) + } } /// Does the given filename look like an ephemeral file? @@ -167,48 +189,49 @@ impl FileExt for EphemeralFile { } } -impl Write for EphemeralFile { - fn write(&mut self, buf: &[u8]) -> Result { - let n = self.write_at(buf, self.pos)?; - self.pos += n as u64; - Ok(n) - } +impl BlobWriter for EphemeralFile { + fn write_blob(&mut self, srcbuf: &[u8]) -> Result { + let pos = self.size; - fn flush(&mut self) -> Result<(), std::io::Error> { - // we don't need to flush data: - // * we either write input bytes or not, not keeping any intermediate data buffered - // * rust unix file `flush` impl does not flush things either, returning `Ok(())` - Ok(()) - } -} + let mut blknum = (self.size / PAGE_SZ as u64) as u32; + let mut off = (pos % PAGE_SZ as u64) as usize; -impl Seek for EphemeralFile { - fn seek(&mut self, pos: SeekFrom) -> Result { - match pos { - SeekFrom::Start(offset) => { - self.pos = offset; - } - SeekFrom::End(_offset) => { - return Err(Error::new( - ErrorKind::Other, - "SeekFrom::End not supported by EphemeralFile", - )); - } - SeekFrom::Current(offset) => { - let pos = self.pos as i128 + offset as i128; - if pos < 0 { - return Err(Error::new( - ErrorKind::InvalidInput, - "offset would be negative", - )); - } - if pos > u64::MAX as i128 { - return Err(Error::new(ErrorKind::InvalidInput, "offset overflow")); - } - self.pos = pos as u64; - } + let mut buf = self.get_buf_for_write(blknum)?; + + // Write the length field + let len_buf = u32::to_ne_bytes(srcbuf.len() as u32); + let thislen = PAGE_SZ - off; + if thislen < 4 { + // it needs to be split across pages + buf[off..(off + thislen)].copy_from_slice(&len_buf[..thislen]); + blknum += 1; + buf = self.get_buf_for_write(blknum)?; + buf[0..4 - thislen].copy_from_slice(&len_buf[thislen..]); + off = 4 - thislen; + } else { + buf[off..off + 4].copy_from_slice(&len_buf); + off += 4; } - Ok(self.pos) + + // Write the payload + let mut buf_remain = srcbuf; + while !buf_remain.is_empty() { + let mut page_remain = PAGE_SZ - off; + if page_remain == 0 { + blknum += 1; + buf = self.get_buf_for_write(blknum)?; + off = 0; + page_remain = PAGE_SZ; + } + let this_blk_len = min(page_remain, buf_remain.len()); + buf[off..(off + this_blk_len)].copy_from_slice(&buf_remain[..this_blk_len]); + off += this_blk_len; + buf_remain = &buf_remain[this_blk_len..]; + } + drop(buf); + self.size += 4 + srcbuf.len() as u64; + + Ok(pos) } } @@ -239,11 +262,34 @@ pub fn writeback(file_id: u64, blkno: u32, buf: &[u8]) -> Result<(), std::io::Er } } +impl BlockReader for EphemeralFile { + type BlockLease = page_cache::PageReadGuard<'static>; + + fn read_blk(&self, blknum: u32) -> Result { + // Look up the right page + let cache = page_cache::get(); + loop { + match cache.read_ephemeral_buf(self.file_id, blknum) { + ReadBufResult::Found(guard) => return Ok(guard), + ReadBufResult::NotFound(mut write_guard) => { + // Read the page from disk into the buffer + self.fill_buffer(write_guard.deref_mut(), blknum)?; + write_guard.mark_valid(); + + // Swap for read lock + continue; + } + }; + } + } +} + #[cfg(test)] mod tests { use super::*; - use rand::seq::SliceRandom; - use rand::thread_rng; + use crate::layered_repository::blob_io::{BlobCursor, BlobWriter}; + use crate::layered_repository::block_io::BlockCursor; + use rand::{seq::SliceRandom, thread_rng, RngCore}; use std::fs; use std::str::FromStr; @@ -281,19 +327,19 @@ mod tests { fn test_ephemeral_files() -> Result<(), Error> { let (conf, tenantid, timelineid) = repo_harness("ephemeral_files")?; - let mut file_a = EphemeralFile::create(conf, tenantid, timelineid)?; + let file_a = EphemeralFile::create(conf, tenantid, timelineid)?; - file_a.write_all(b"foo")?; + file_a.write_all_at(b"foo", 0)?; assert_eq!("foo", read_string(&file_a, 0, 20)?); - file_a.write_all(b"bar")?; + file_a.write_all_at(b"bar", 3)?; assert_eq!("foobar", read_string(&file_a, 0, 20)?); // Open a lot of files, enough to cause some page evictions. let mut efiles = Vec::new(); for fileno in 0..100 { - let mut efile = EphemeralFile::create(conf, tenantid, timelineid)?; - efile.write_all(format!("file {}", fileno).as_bytes())?; + let efile = EphemeralFile::create(conf, tenantid, timelineid)?; + efile.write_all_at(format!("file {}", fileno).as_bytes(), 0)?; assert_eq!(format!("file {}", fileno), read_string(&efile, 0, 10)?); efiles.push((fileno, efile)); } @@ -307,4 +353,41 @@ mod tests { Ok(()) } + + #[test] + fn test_ephemeral_blobs() -> Result<(), Error> { + let (conf, tenantid, timelineid) = repo_harness("ephemeral_blobs")?; + + let mut file = EphemeralFile::create(conf, tenantid, timelineid)?; + + let pos_foo = file.write_blob(b"foo")?; + assert_eq!(b"foo", file.block_cursor().read_blob(pos_foo)?.as_slice()); + let pos_bar = file.write_blob(b"bar")?; + assert_eq!(b"foo", file.block_cursor().read_blob(pos_foo)?.as_slice()); + assert_eq!(b"bar", file.block_cursor().read_blob(pos_bar)?.as_slice()); + + let mut blobs = Vec::new(); + for i in 0..10000 { + let data = Vec::from(format!("blob{}", i).as_bytes()); + let pos = file.write_blob(&data)?; + blobs.push((pos, data)); + } + + let mut cursor = BlockCursor::new(&file); + for (pos, expected) in blobs { + let actual = cursor.read_blob(pos)?; + assert_eq!(actual, expected); + } + drop(cursor); + + // Test a large blob that spans multiple pages + let mut large_data = Vec::new(); + large_data.resize(20000, 0); + thread_rng().fill_bytes(&mut large_data); + let pos_large = file.write_blob(&large_data)?; + let result = file.block_cursor().read_blob(pos_large)?; + assert_eq!(result, large_data); + + Ok(()) + } } diff --git a/pageserver/src/layered_repository/image_layer.rs b/pageserver/src/layered_repository/image_layer.rs index 68d1cd4a8a..a8e5de09f5 100644 --- a/pageserver/src/layered_repository/image_layer.rs +++ b/pageserver/src/layered_repository/image_layer.rs @@ -13,63 +13,70 @@ //! //! 000000067F000032BE0000400000000070B6-000000067F000032BE0000400000000080B6__00000000346BC568 //! -//! An image file is constructed using the 'bookfile' crate. +//! Every image layer file consists of three parts: "summary", +//! "index", and "values". The summary is a fixed size header at the +//! beginning of the file, and it contains basic information about the +//! layer, and offsets to the other parts. The "index" is a serialized +//! HashMap, mapping from Key to an offset in the "values" part. The +//! actual page images are stored in the "values" part. //! -//! Only metadata is loaded into memory by the load function. +//! Only the "index" is loaded into memory by the load function. //! When images are needed, they are read directly from disk. //! use crate::config::PageServerConf; +use crate::layered_repository::blob_io::{BlobCursor, BlobWriter, WriteBlobWriter}; +use crate::layered_repository::block_io::{BlockReader, FileBlockReader}; use crate::layered_repository::filename::{ImageFileName, PathOrConf}; use crate::layered_repository::storage_layer::{ BlobRef, Layer, ValueReconstructResult, ValueReconstructState, }; +use crate::page_cache::PAGE_SZ; use crate::repository::{Key, Value}; use crate::virtual_file::VirtualFile; -use crate::IMAGE_FILE_MAGIC; use crate::{ZTenantId, ZTimelineId}; +use crate::{IMAGE_FILE_MAGIC, STORAGE_FORMAT_VERSION}; use anyhow::{bail, ensure, Context, Result}; use bytes::Bytes; use log::*; use serde::{Deserialize, Serialize}; use std::collections::HashMap; use std::fs; -use std::io::{BufWriter, Write}; +use std::io::Write; +use std::io::{Seek, SeekFrom}; use std::ops::Range; use std::path::{Path, PathBuf}; use std::sync::{RwLock, RwLockReadGuard, TryLockError}; -use bookfile::{Book, BookWriter, ChapterWriter}; - use zenith_utils::bin_ser::BeSer; use zenith_utils::lsn::Lsn; -/// Mapping from (key, lsn) -> page/WAL record -/// byte ranges in VALUES_CHAPTER -static INDEX_CHAPTER: u64 = 1; - -/// Contains each block in block # order -const VALUES_CHAPTER: u64 = 2; - -/// Contains the [`Summary`] struct -const SUMMARY_CHAPTER: u64 = 3; - #[derive(Debug, Serialize, Deserialize, PartialEq, Eq)] struct Summary { + /// Magic value to identify this as a zenith image file. Always IMAGE_FILE_MAGIC. + magic: u16, + format_version: u16, + tenantid: ZTenantId, timelineid: ZTimelineId, key_range: Range, - lsn: Lsn, + + /// Block number where the 'index' part of the file begins. + index_start_blk: u32, } impl From<&ImageLayer> for Summary { fn from(layer: &ImageLayer) -> Self { Self { + magic: IMAGE_FILE_MAGIC, + format_version: STORAGE_FORMAT_VERSION, tenantid: layer.tenantid, timelineid: layer.timelineid, key_range: layer.key_range.clone(), lsn: layer.lsn, + + index_start_blk: 0, } } } @@ -97,12 +104,14 @@ pub struct ImageLayerInner { /// If false, the 'index' has not been loaded into memory yet. loaded: bool, - /// The underlying (virtual) file handle. None if the layer hasn't been loaded - /// yet. - book: Option>, - /// offset of each value index: HashMap, + + // values copied from summary + index_start_blk: u32, + + /// Reader object for reading blocks from the file. (None if not loaded yet) + file: Option>, } impl Layer for ImageLayer { @@ -138,26 +147,21 @@ impl Layer for ImageLayer { assert!(lsn_range.end >= self.lsn); let inner = self.load()?; - if let Some(blob_ref) = inner.index.get(&key) { - let chapter = inner - .book + let buf = inner + .file .as_ref() .unwrap() - .chapter_reader(VALUES_CHAPTER)?; - - let mut blob = vec![0; blob_ref.size()]; - chapter - .read_exact_at(&mut blob, blob_ref.pos()) + .block_cursor() + .read_blob(blob_ref.pos()) .with_context(|| { format!( - "failed to read {} bytes from data file {} at offset {}", - blob_ref.size(), + "failed to read blob from data file {} at offset {}", self.filename().display(), blob_ref.pos() ) })?; - let value = Bytes::from(blob); + let value = Bytes::from(buf); reconstruct_state.img = Some((self.lsn, value)); Ok(ValueReconstructResult::Complete) @@ -228,12 +232,7 @@ impl Layer for ImageLayer { index_vec.sort_by_key(|x| x.1.pos()); for (key, blob_ref) in index_vec { - println!( - "key: {} size {} offset {}", - key, - blob_ref.size(), - blob_ref.pos() - ); + println!("key: {} offset {}", key, blob_ref.pos()); } Ok(()) @@ -291,21 +290,19 @@ impl ImageLayer { let path = self.path(); // Open the file if it's not open already. - if inner.book.is_none() { + if inner.file.is_none() { let file = VirtualFile::open(&path) .with_context(|| format!("Failed to open file '{}'", path.display()))?; - inner.book = Some(Book::new(file).with_context(|| { - format!("Failed to open file '{}' as a bookfile", path.display()) - })?); + inner.file = Some(FileBlockReader::new(file)); } - let book = inner.book.as_ref().unwrap(); + let file = inner.file.as_mut().unwrap(); + let summary_blk = file.read_blk(0)?; + let actual_summary = Summary::des_prefix(summary_blk.as_ref())?; match &self.path_or_conf { PathOrConf::Conf(_) => { - let chapter = book.read_chapter(SUMMARY_CHAPTER)?; - let actual_summary = Summary::des(&chapter)?; - - let expected_summary = Summary::from(self); + let mut expected_summary = Summary::from(self); + expected_summary.index_start_blk = actual_summary.index_start_blk; if actual_summary != expected_summary { bail!("in-file summary does not match expected summary. actual = {:?} expected = {:?}", actual_summary, expected_summary); @@ -325,14 +322,18 @@ impl ImageLayer { } } - let chapter = book.read_chapter(INDEX_CHAPTER)?; - let index = HashMap::des(&chapter)?; + file.file.seek(SeekFrom::Start( + actual_summary.index_start_blk as u64 * PAGE_SZ as u64, + ))?; + let mut buf_reader = std::io::BufReader::new(&mut file.file); + let index = HashMap::des_from(&mut buf_reader)?; + + inner.index_start_blk = actual_summary.index_start_blk; info!("loaded from {}", &path.display()); inner.index = index; inner.loaded = true; - Ok(()) } @@ -350,9 +351,10 @@ impl ImageLayer { key_range: filename.key_range.clone(), lsn: filename.lsn, inner: RwLock::new(ImageLayerInner { - book: None, index: HashMap::new(), loaded: false, + file: None, + index_start_blk: 0, }), } } @@ -360,12 +362,14 @@ impl ImageLayer { /// Create an ImageLayer struct representing an existing file on disk. /// /// This variant is only used for debugging purposes, by the 'dump_layerfile' binary. - pub fn new_for_path(path: &Path, book: &Book) -> Result + pub fn new_for_path(path: &Path, file: F) -> Result where F: std::os::unix::prelude::FileExt, { - let chapter = book.read_chapter(SUMMARY_CHAPTER)?; - let summary = Summary::des(&chapter)?; + let mut summary_buf = Vec::new(); + summary_buf.resize(PAGE_SZ, 0); + file.read_exact_at(&mut summary_buf, 0)?; + let summary = Summary::des_prefix(&summary_buf)?; Ok(ImageLayer { path_or_conf: PathOrConf::Path(path.to_path_buf()), @@ -374,9 +378,10 @@ impl ImageLayer { key_range: summary.key_range, lsn: summary.lsn, inner: RwLock::new(ImageLayerInner { - book: None, + file: None, index: HashMap::new(), loaded: false, + index_start_blk: 0, }), }) } @@ -412,18 +417,15 @@ impl ImageLayer { /// pub struct ImageLayerWriter { conf: &'static PageServerConf, - path: PathBuf, + _path: PathBuf, timelineid: ZTimelineId, tenantid: ZTenantId, key_range: Range, lsn: Lsn, - values_writer: Option>>, - end_offset: u64, - index: HashMap, - finished: bool, + blob_writer: WriteBlobWriter, } impl ImageLayerWriter { @@ -449,24 +451,17 @@ impl ImageLayerWriter { ); info!("new image layer {}", path.display()); let file = VirtualFile::create(&path)?; - let buf_writer = BufWriter::new(file); - let book = BookWriter::new(buf_writer, IMAGE_FILE_MAGIC)?; - - // Open the page-images chapter for writing. The calls to - // `put_image` will use this to write the contents. - let chapter = book.new_chapter(VALUES_CHAPTER); + let blob_writer = WriteBlobWriter::new(file, PAGE_SZ as u64); let writer = ImageLayerWriter { conf, - path, + _path: path, timelineid, tenantid, key_range: key_range.clone(), lsn, - values_writer: Some(chapter), index: HashMap::new(), - end_offset: 0, - finished: false, + blob_writer, }; Ok(writer) @@ -479,49 +474,41 @@ impl ImageLayerWriter { /// pub fn put_image(&mut self, key: Key, img: &[u8]) -> Result<()> { ensure!(self.key_range.contains(&key)); - let off = self.end_offset; + let off = self.blob_writer.write_blob(img)?; - if let Some(writer) = &mut self.values_writer { - let len = img.len(); - writer.write_all(img)?; - self.end_offset += len as u64; - - let old = self.index.insert(key, BlobRef::new(off, len, true)); - assert!(old.is_none()); - } else { - panic!() - } + let old = self.index.insert(key, BlobRef::new(off, true)); + assert!(old.is_none()); Ok(()) } - pub fn finish(&mut self) -> anyhow::Result { - // Close the values chapter - let book = self.values_writer.take().unwrap().close()?; + pub fn finish(self) -> anyhow::Result { + let index_start_blk = + ((self.blob_writer.size() + PAGE_SZ as u64 - 1) / PAGE_SZ as u64) as u32; + + let mut file = self.blob_writer.into_inner(); // Write out the index - let mut chapter = book.new_chapter(INDEX_CHAPTER); let buf = HashMap::ser(&self.index)?; - chapter.write_all(&buf)?; - let book = chapter.close()?; + file.seek(SeekFrom::Start(index_start_blk as u64 * PAGE_SZ as u64))?; + file.write_all(&buf)?; - // Write out the summary chapter - let mut chapter = book.new_chapter(SUMMARY_CHAPTER); + // Fill in the summary on blk 0 let summary = Summary { + magic: IMAGE_FILE_MAGIC, + format_version: STORAGE_FORMAT_VERSION, tenantid: self.tenantid, timelineid: self.timelineid, key_range: self.key_range.clone(), lsn: self.lsn, + index_start_blk, }; - Summary::ser_into(&summary, &mut chapter)?; - let book = chapter.close()?; - - // This flushes the underlying 'buf_writer'. - book.close()?; + file.seek(SeekFrom::Start(0))?; + Summary::ser_into(&summary, &mut file)?; // Note: Because we open the file in write-only mode, we cannot // reuse the same VirtualFile for reading later. That's why we don't - // set inner.book here. The first read will have to re-open it. + // set inner.file here. The first read will have to re-open it. let layer = ImageLayer { path_or_conf: PathOrConf::Conf(self.conf), timelineid: self.timelineid, @@ -529,28 +516,14 @@ impl ImageLayerWriter { key_range: self.key_range.clone(), lsn: self.lsn, inner: RwLock::new(ImageLayerInner { - book: None, loaded: false, index: HashMap::new(), + file: None, + index_start_blk, }), }; trace!("created image layer {}", layer.path().display()); - self.finished = true; - Ok(layer) } } - -impl Drop for ImageLayerWriter { - fn drop(&mut self) { - if let Some(page_image_writer) = self.values_writer.take() { - if let Ok(book) = page_image_writer.close() { - let _ = book.close(); - } - } - if !self.finished { - let _ = fs::remove_file(&self.path); - } - } -} diff --git a/pageserver/src/layered_repository/inmemory_layer.rs b/pageserver/src/layered_repository/inmemory_layer.rs index 8670442a2c..8a24528732 100644 --- a/pageserver/src/layered_repository/inmemory_layer.rs +++ b/pageserver/src/layered_repository/inmemory_layer.rs @@ -5,10 +5,12 @@ //! its position in the file, is kept in memory, though. //! use crate::config::PageServerConf; +use crate::layered_repository::blob_io::{BlobCursor, BlobWriter}; +use crate::layered_repository::block_io::BlockReader; use crate::layered_repository::delta_layer::{DeltaLayer, DeltaLayerWriter}; use crate::layered_repository::ephemeral_file::EphemeralFile; use crate::layered_repository::storage_layer::{ - BlobRef, Layer, ValueReconstructResult, ValueReconstructState, + Layer, ValueReconstructResult, ValueReconstructState, }; use crate::repository::{Key, Value}; use crate::walrecord; @@ -19,9 +21,7 @@ use std::collections::HashMap; // avoid binding to Write (conflicts with std::io::Write) // while being able to use std::fmt::Write's methods use std::fmt::Write as _; -use std::io::Write; use std::ops::Range; -use std::os::unix::fs::FileExt; use std::path::PathBuf; use std::sync::RwLock; use zenith_utils::bin_ser::BeSer; @@ -54,14 +54,12 @@ pub struct InMemoryLayerInner { /// by block number and LSN. The value is an offset into the /// ephemeral file where the page version is stored. /// - index: HashMap>, + index: HashMap>, /// The values are stored in a serialized format in this file. /// Each serialized Value is preceded by a 'u32' length field. /// PerSeg::page_versions map stores offsets into this file. file: EphemeralFile, - - end_offset: u64, } impl InMemoryLayerInner { @@ -120,10 +118,12 @@ impl Layer for InMemoryLayer { let inner = self.inner.read().unwrap(); + let mut reader = inner.file.block_cursor(); + // Scan the page versions backwards, starting from `lsn`. if let Some(vec_map) = inner.index.get(&key) { let slice = vec_map.slice_range(lsn_range); - for (entry_lsn, blob_ref) in slice.iter().rev() { + for (entry_lsn, pos) in slice.iter().rev() { match &reconstruct_state.img { Some((cached_lsn, _)) if entry_lsn <= cached_lsn => { return Ok(ValueReconstructResult::Complete) @@ -131,8 +131,7 @@ impl Layer for InMemoryLayer { _ => {} } - let mut buf = vec![0u8; blob_ref.size()]; - inner.file.read_exact_at(&mut buf, blob_ref.pos())?; + let buf = reader.read_blob(*pos)?; let value = Value::des(&buf)?; match value { Value::Image(img) => { @@ -208,12 +207,12 @@ impl Layer for InMemoryLayer { return Ok(()); } + let mut cursor = inner.file.block_cursor(); let mut buf = Vec::new(); for (key, vec_map) in inner.index.iter() { - for (lsn, blob_ref) in vec_map.as_slice() { + for (lsn, pos) in vec_map.as_slice() { let mut desc = String::new(); - buf.resize(blob_ref.size(), 0); - inner.file.read_exact_at(&mut buf, blob_ref.pos())?; + cursor.read_blob_into_buf(*pos, &mut buf)?; let val = Value::des(&buf); match val { Ok(Value::Image(img)) => { @@ -268,7 +267,6 @@ impl InMemoryLayer { end_lsn: None, index: HashMap::new(), file, - end_offset: 0, }), }) } @@ -283,15 +281,10 @@ impl InMemoryLayer { inner.assert_writeable(); - let off = inner.end_offset; - let buf = Value::ser(&val)?; - let len = buf.len(); - inner.file.write_all(&buf)?; - inner.end_offset += len as u64; + let off = inner.file.write_blob(&Value::ser(&val)?)?; let vec_map = inner.index.entry(key).or_default(); - let blob_ref = BlobRef::new(off, len, val.will_init()); - let old = vec_map.append_or_update_last(lsn, blob_ref).unwrap().0; + let old = vec_map.append_or_update_last(lsn, off).unwrap().0; if old.is_some() { // We already had an entry for this LSN. That's odd.. warn!("Key {} at {} already exists", key, lsn); @@ -345,21 +338,21 @@ impl InMemoryLayer { self.start_lsn..inner.end_lsn.unwrap(), )?; - let mut do_steps = || -> Result<()> { - for (key, vec_map) in inner.index.iter() { - // Write all page versions - for (lsn, blob_ref) in vec_map.as_slice() { - let mut buf = vec![0u8; blob_ref.size()]; - inner.file.read_exact_at(&mut buf, blob_ref.pos())?; - let val = Value::des(&buf)?; - delta_layer_writer.put_value(*key, *lsn, val)?; - } + let mut buf = Vec::new(); + + let mut cursor = inner.file.block_cursor(); + + let mut keys: Vec<(&Key, &VecMap)> = inner.index.iter().collect(); + keys.sort_by_key(|k| k.0); + + for (key, vec_map) in keys.iter() { + let key = **key; + // Write all page versions + for (lsn, pos) in vec_map.as_slice() { + cursor.read_blob_into_buf(*pos, &mut buf)?; + let val = Value::des(&buf)?; + delta_layer_writer.put_value(key, *lsn, val)?; } - Ok(()) - }; - if let Err(err) = do_steps() { - delta_layer_writer.abort(); - return Err(err); } let delta_layer = delta_layer_writer.finish(Key::MAX)?; diff --git a/pageserver/src/layered_repository/storage_layer.rs b/pageserver/src/layered_repository/storage_layer.rs index 2711640736..b5366da223 100644 --- a/pageserver/src/layered_repository/storage_layer.rs +++ b/pageserver/src/layered_repository/storage_layer.rs @@ -150,9 +150,10 @@ pub trait Layer: Send + Sync { const WILL_INIT: u64 = 1; /// -/// Struct representing reference to BLOB in layers. Reference contains BLOB offset and size. -/// For WAL records (delta layer) it also contains `will_init` flag which helps to determine range of records -/// which needs to be applied without reading/deserializing records themselves. +/// Struct representing reference to BLOB in layers. Reference contains BLOB +/// offset, and for WAL records it also contains `will_init` flag. The flag +/// helps to determine the range of records that needs to be applied, without +/// reading/deserializing records themselves. /// #[derive(Debug, Serialize, Deserialize, Copy, Clone)] pub struct BlobRef(u64); @@ -163,15 +164,11 @@ impl BlobRef { } pub fn pos(&self) -> u64 { - self.0 >> 32 + self.0 >> 1 } - pub fn size(&self) -> usize { - ((self.0 & 0xFFFFFFFF) >> 1) as usize - } - - pub fn new(pos: u64, size: usize, will_init: bool) -> BlobRef { - let mut blob_ref = (pos << 32) | ((size as u64) << 1); + pub fn new(pos: u64, will_init: bool) -> BlobRef { + let mut blob_ref = pos << 1; if will_init { blob_ref |= WILL_INIT; } diff --git a/pageserver/src/lib.rs b/pageserver/src/lib.rs index 4790ab6652..6d2631b2b1 100644 --- a/pageserver/src/lib.rs +++ b/pageserver/src/lib.rs @@ -38,11 +38,11 @@ use pgdatadir_mapping::DatadirTimeline; /// This is embedded in the metadata file, and also in the header of all the /// layer files. If you make any backwards-incompatible changes to the storage /// format, bump this! -pub const STORAGE_FORMAT_VERSION: u16 = 1; +pub const STORAGE_FORMAT_VERSION: u16 = 2; // Magic constants used to identify different kinds of files -pub const IMAGE_FILE_MAGIC: u32 = 0x5A60_0000 | STORAGE_FORMAT_VERSION as u32; -pub const DELTA_FILE_MAGIC: u32 = 0x5A61_0000 | STORAGE_FORMAT_VERSION as u32; +pub const IMAGE_FILE_MAGIC: u16 = 0x5A60; +pub const DELTA_FILE_MAGIC: u16 = 0x5A61; lazy_static! { static ref LIVE_CONNECTIONS_COUNT: IntGaugeVec = register_int_gauge_vec!( diff --git a/pageserver/src/page_cache.rs b/pageserver/src/page_cache.rs index c485e46f47..bd44384a44 100644 --- a/pageserver/src/page_cache.rs +++ b/pageserver/src/page_cache.rs @@ -56,7 +56,7 @@ use crate::layered_repository::writeback_ephemeral_file; use crate::repository::Key; static PAGE_CACHE: OnceCell = OnceCell::new(); -const TEST_PAGE_CACHE_SIZE: usize = 10; +const TEST_PAGE_CACHE_SIZE: usize = 50; /// /// Initialize the page cache. This must be called once at page server startup. @@ -90,6 +90,7 @@ const MAX_USAGE_COUNT: u8 = 5; /// CacheKey uniquely identifies a "thing" to cache in the page cache. /// #[derive(Debug, PartialEq, Eq, Clone)] +#[allow(clippy::enum_variant_names)] enum CacheKey { MaterializedPage { hash_key: MaterializedPageHashKey, @@ -99,6 +100,10 @@ enum CacheKey { file_id: u64, blkno: u32, }, + ImmutableFilePage { + file_id: u64, + blkno: u32, + }, } #[derive(Debug, PartialEq, Eq, Hash, Clone)] @@ -173,6 +178,8 @@ pub struct PageCache { ephemeral_page_map: RwLock>, + immutable_page_map: RwLock>, + /// The actual buffers with their metadata. slots: Box<[Slot]>, @@ -195,6 +202,12 @@ impl std::ops::Deref for PageReadGuard<'_> { } } +impl AsRef<[u8; PAGE_SZ]> for PageReadGuard<'_> { + fn as_ref(&self) -> &[u8; PAGE_SZ] { + self.0.buf + } +} + /// /// PageWriteGuard is a lease on a buffer for modifying it. The page is kept locked /// until the guard is dropped. @@ -226,6 +239,12 @@ impl std::ops::Deref for PageWriteGuard<'_> { } } +impl AsMut<[u8; PAGE_SZ]> for PageWriteGuard<'_> { + fn as_mut(&mut self) -> &mut [u8; PAGE_SZ] { + self.inner.buf + } +} + impl PageWriteGuard<'_> { /// Mark that the buffer contents are now valid. pub fn mark_valid(&mut self) { @@ -381,6 +400,36 @@ impl PageCache { } } + // Section 1.3: Public interface functions for working with immutable file pages. + + pub fn read_immutable_buf(&self, file_id: u64, blkno: u32) -> ReadBufResult { + let mut cache_key = CacheKey::ImmutableFilePage { file_id, blkno }; + + self.lock_for_read(&mut cache_key) + } + + /// Immediately drop all buffers belonging to given file, without writeback + pub fn drop_buffers_for_immutable(&self, drop_file_id: u64) { + for slot_idx in 0..self.slots.len() { + let slot = &self.slots[slot_idx]; + + let mut inner = slot.inner.write().unwrap(); + if let Some(key) = &inner.key { + match key { + CacheKey::ImmutableFilePage { file_id, blkno: _ } + if *file_id == drop_file_id => + { + // remove mapping for old buffer + self.remove_mapping(key); + inner.key = None; + inner.dirty = false; + } + _ => {} + } + } + } + } + // // Section 2: Internal interface functions for lookup/update. // @@ -578,6 +627,10 @@ impl PageCache { let map = self.ephemeral_page_map.read().unwrap(); Some(*map.get(&(*file_id, *blkno))?) } + CacheKey::ImmutableFilePage { file_id, blkno } => { + let map = self.immutable_page_map.read().unwrap(); + Some(*map.get(&(*file_id, *blkno))?) + } } } @@ -601,6 +654,10 @@ impl PageCache { let map = self.ephemeral_page_map.read().unwrap(); Some(*map.get(&(*file_id, *blkno))?) } + CacheKey::ImmutableFilePage { file_id, blkno } => { + let map = self.immutable_page_map.read().unwrap(); + Some(*map.get(&(*file_id, *blkno))?) + } } } @@ -632,6 +689,11 @@ impl PageCache { map.remove(&(*file_id, *blkno)) .expect("could not find old key in mapping"); } + CacheKey::ImmutableFilePage { file_id, blkno } => { + let mut map = self.immutable_page_map.write().unwrap(); + map.remove(&(*file_id, *blkno)) + .expect("could not find old key in mapping"); + } } } @@ -672,6 +734,16 @@ impl PageCache { } } } + CacheKey::ImmutableFilePage { file_id, blkno } => { + let mut map = self.immutable_page_map.write().unwrap(); + match map.entry((*file_id, *blkno)) { + Entry::Occupied(entry) => Some(*entry.get()), + Entry::Vacant(entry) => { + entry.insert(slot_idx); + None + } + } + } } } @@ -749,6 +821,13 @@ impl PageCache { CacheKey::EphemeralPage { file_id, blkno } => { writeback_ephemeral_file(*file_id, *blkno, buf) } + CacheKey::ImmutableFilePage { + file_id: _, + blkno: _, + } => Err(std::io::Error::new( + std::io::ErrorKind::Other, + "unexpected dirty immutable page", + )), } } @@ -779,6 +858,7 @@ impl PageCache { Self { materialized_page_map: Default::default(), ephemeral_page_map: Default::default(), + immutable_page_map: Default::default(), slots, next_evict_slot: AtomicUsize::new(0), } diff --git a/pageserver/src/virtual_file.rs b/pageserver/src/virtual_file.rs index 858cff29cb..64f9db2338 100644 --- a/pageserver/src/virtual_file.rs +++ b/pageserver/src/virtual_file.rs @@ -65,6 +65,7 @@ lazy_static! { /// currently open, the 'handle' can still point to the slot where it was last kept. The /// 'tag' field is used to detect whether the handle still is valid or not. /// +#[derive(Debug)] pub struct VirtualFile { /// Lazy handle to the global file descriptor cache. The slot that this points to /// might contain our File, or it may be empty, or it may contain a File that @@ -88,7 +89,7 @@ pub struct VirtualFile { timelineid: String, } -#[derive(PartialEq, Clone, Copy)] +#[derive(Debug, PartialEq, Clone, Copy)] struct SlotHandle { /// Index into OPEN_FILES.slots index: usize, From c4b57e4b8fb55360bdb77cc9165be8fc31b0b469 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Thu, 7 Apr 2022 20:50:12 +0300 Subject: [PATCH 069/296] Move BlobRef It's not needed in image layers anymore, so move it into delta_layer.rs --- pageserver/src/layered_repository/blob_io.rs | 17 ++++++++++ pageserver/src/layered_repository/block_io.rs | 2 -- .../src/layered_repository/delta_layer.rs | 32 ++++++++++++++++++- .../src/layered_repository/image_layer.rs | 21 ++++++------ .../src/layered_repository/storage_layer.rs | 31 ------------------ 5 files changed, 57 insertions(+), 46 deletions(-) diff --git a/pageserver/src/layered_repository/blob_io.rs b/pageserver/src/layered_repository/blob_io.rs index 10bfea934d..aa90bbd0cf 100644 --- a/pageserver/src/layered_repository/blob_io.rs +++ b/pageserver/src/layered_repository/blob_io.rs @@ -10,12 +10,15 @@ use std::io::Error; /// For reading pub trait BlobCursor { + /// Read a blob into a new buffer. fn read_blob(&mut self, offset: u64) -> Result, std::io::Error> { let mut buf = Vec::new(); self.read_blob_into_buf(offset, &mut buf)?; Ok(buf) } + /// Read blob into the given buffer. Any previous contents in the buffer + /// are overwritten. fn read_blob_into_buf( &mut self, offset: u64, @@ -75,10 +78,19 @@ where } } +/// +/// Abstract trait for a data sink that you can write blobs to. +/// pub trait BlobWriter { + /// Write a blob of data. Returns the offset that it was written to, + /// which can be used to retrieve the data later. fn write_blob(&mut self, srcbuf: &[u8]) -> Result; } +/// +/// An implementation of BlobWriter to write blobs to anything that +/// implements std::io::Write. +/// pub struct WriteBlobWriter where W: std::io::Write, @@ -102,6 +114,11 @@ where self.offset } + /// Access the underlying Write object. + /// + /// NOTE: WriteBlobWriter keeps track of the current write offset. If + /// you write something directly to the inner Write object, it makes the + /// internally tracked 'offset' to go out of sync. So don't do that. pub fn into_inner(self) -> W { self.inner } diff --git a/pageserver/src/layered_repository/block_io.rs b/pageserver/src/layered_repository/block_io.rs index 2b8e31e1ee..a8992a6cb5 100644 --- a/pageserver/src/layered_repository/block_io.rs +++ b/pageserver/src/layered_repository/block_io.rs @@ -1,8 +1,6 @@ //! //! Low-level Block-oriented I/O functions //! -//! -//! use crate::page_cache; use crate::page_cache::{ReadBufResult, PAGE_SZ}; diff --git a/pageserver/src/layered_repository/delta_layer.rs b/pageserver/src/layered_repository/delta_layer.rs index f8828b541f..43122fd99d 100644 --- a/pageserver/src/layered_repository/delta_layer.rs +++ b/pageserver/src/layered_repository/delta_layer.rs @@ -35,7 +35,7 @@ use crate::layered_repository::blob_io::{BlobCursor, BlobWriter, WriteBlobWriter use crate::layered_repository::block_io::{BlockCursor, BlockReader, FileBlockReader}; use crate::layered_repository::filename::{DeltaFileName, PathOrConf}; use crate::layered_repository::storage_layer::{ - BlobRef, Layer, ValueReconstructResult, ValueReconstructState, + Layer, ValueReconstructResult, ValueReconstructState, }; use crate::page_cache::{PageReadGuard, PAGE_SZ}; use crate::repository::{Key, Value}; @@ -93,6 +93,36 @@ impl From<&DeltaLayer> for Summary { } } +// Flag indicating that this version initialize the page +const WILL_INIT: u64 = 1; + +/// +/// Struct representing reference to BLOB in layers. Reference contains BLOB +/// offset, and for WAL records it also contains `will_init` flag. The flag +/// helps to determine the range of records that needs to be applied, without +/// reading/deserializing records themselves. +/// +#[derive(Debug, Serialize, Deserialize, Copy, Clone)] +struct BlobRef(u64); + +impl BlobRef { + pub fn will_init(&self) -> bool { + (self.0 & WILL_INIT) != 0 + } + + pub fn pos(&self) -> u64 { + self.0 >> 1 + } + + pub fn new(pos: u64, will_init: bool) -> BlobRef { + let mut blob_ref = pos << 1; + if will_init { + blob_ref |= WILL_INIT; + } + BlobRef(blob_ref) + } +} + /// /// DeltaLayer is the in-memory data structure associated with an /// on-disk delta file. We keep a DeltaLayer in memory for each diff --git a/pageserver/src/layered_repository/image_layer.rs b/pageserver/src/layered_repository/image_layer.rs index a8e5de09f5..d0afce1549 100644 --- a/pageserver/src/layered_repository/image_layer.rs +++ b/pageserver/src/layered_repository/image_layer.rs @@ -28,7 +28,7 @@ use crate::layered_repository::blob_io::{BlobCursor, BlobWriter, WriteBlobWriter use crate::layered_repository::block_io::{BlockReader, FileBlockReader}; use crate::layered_repository::filename::{ImageFileName, PathOrConf}; use crate::layered_repository::storage_layer::{ - BlobRef, Layer, ValueReconstructResult, ValueReconstructState, + Layer, ValueReconstructResult, ValueReconstructState, }; use crate::page_cache::PAGE_SZ; use crate::repository::{Key, Value}; @@ -105,7 +105,7 @@ pub struct ImageLayerInner { loaded: bool, /// offset of each value - index: HashMap, + index: HashMap, // values copied from summary index_start_blk: u32, @@ -147,18 +147,18 @@ impl Layer for ImageLayer { assert!(lsn_range.end >= self.lsn); let inner = self.load()?; - if let Some(blob_ref) = inner.index.get(&key) { + if let Some(&offset) = inner.index.get(&key) { let buf = inner .file .as_ref() .unwrap() .block_cursor() - .read_blob(blob_ref.pos()) + .read_blob(offset) .with_context(|| { format!( "failed to read blob from data file {} at offset {}", self.filename().display(), - blob_ref.pos() + offset ) })?; let value = Bytes::from(buf); @@ -228,11 +228,8 @@ impl Layer for ImageLayer { let inner = self.load()?; - let mut index_vec: Vec<(&Key, &BlobRef)> = inner.index.iter().collect(); - index_vec.sort_by_key(|x| x.1.pos()); - - for (key, blob_ref) in index_vec { - println!("key: {} offset {}", key, blob_ref.pos()); + for (key, offset) in inner.index.iter() { + println!("key: {} offset {}", key, offset); } Ok(()) @@ -423,7 +420,7 @@ pub struct ImageLayerWriter { key_range: Range, lsn: Lsn, - index: HashMap, + index: HashMap, blob_writer: WriteBlobWriter, } @@ -476,7 +473,7 @@ impl ImageLayerWriter { ensure!(self.key_range.contains(&key)); let off = self.blob_writer.write_blob(img)?; - let old = self.index.insert(key, BlobRef::new(off, true)); + let old = self.index.insert(key, off); assert!(old.is_none()); Ok(()) diff --git a/pageserver/src/layered_repository/storage_layer.rs b/pageserver/src/layered_repository/storage_layer.rs index b5366da223..5ad43182f6 100644 --- a/pageserver/src/layered_repository/storage_layer.rs +++ b/pageserver/src/layered_repository/storage_layer.rs @@ -7,7 +7,6 @@ use crate::walrecord::ZenithWalRecord; use crate::{ZTenantId, ZTimelineId}; use anyhow::Result; use bytes::Bytes; -use serde::{Deserialize, Serialize}; use std::ops::Range; use std::path::PathBuf; @@ -145,33 +144,3 @@ pub trait Layer: Send + Sync { /// Dump summary of the contents of the layer to stdout fn dump(&self, verbose: bool) -> Result<()>; } - -// Flag indicating that this version initialize the page -const WILL_INIT: u64 = 1; - -/// -/// Struct representing reference to BLOB in layers. Reference contains BLOB -/// offset, and for WAL records it also contains `will_init` flag. The flag -/// helps to determine the range of records that needs to be applied, without -/// reading/deserializing records themselves. -/// -#[derive(Debug, Serialize, Deserialize, Copy, Clone)] -pub struct BlobRef(u64); - -impl BlobRef { - pub fn will_init(&self) -> bool { - (self.0 & WILL_INIT) != 0 - } - - pub fn pos(&self) -> u64 { - self.0 >> 1 - } - - pub fn new(pos: u64, will_init: bool) -> BlobRef { - let mut blob_ref = pos << 1; - if will_init { - blob_ref |= WILL_INIT; - } - BlobRef(blob_ref) - } -} From 214567bf8fafed56cd867698d9e54fafc7001b45 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Thu, 7 Apr 2022 20:50:16 +0300 Subject: [PATCH 070/296] Use B-tree for the index in image and delta layers. We now use a page cache for those, instead of slurping the whole index into memory. Fixes https://github.com/zenithdb/zenith/issues/1356 This is a backwards-incompatible change to the storage format, so bump STORAGE_FORMAT_VERSION. --- Cargo.lock | 1 + pageserver/Cargo.toml | 1 + pageserver/src/layered_repository.rs | 10 +- pageserver/src/layered_repository/block_io.rs | 45 + .../src/layered_repository/delta_layer.rs | 290 ++- .../src/layered_repository/disk_btree.rs | 979 ++++++++ .../disk_btree_test_data.rs | 2013 +++++++++++++++++ .../src/layered_repository/image_layer.rs | 144 +- .../src/layered_repository/inmemory_layer.rs | 7 - .../src/layered_repository/storage_layer.rs | 4 - pageserver/src/lib.rs | 2 +- pageserver/src/repository.rs | 16 +- 12 files changed, 3287 insertions(+), 225 deletions(-) create mode 100644 pageserver/src/layered_repository/disk_btree.rs create mode 100644 pageserver/src/layered_repository/disk_btree_test_data.rs diff --git a/Cargo.lock b/Cargo.lock index e0b6288f63..19ccd18a10 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -1499,6 +1499,7 @@ dependencies = [ "daemonize", "fail", "futures", + "hex", "hex-literal", "humantime", "hyper", diff --git a/pageserver/Cargo.toml b/pageserver/Cargo.toml index a5283cb331..4d79811bfb 100644 --- a/pageserver/Cargo.toml +++ b/pageserver/Cargo.toml @@ -10,6 +10,7 @@ regex = "1.4.5" bytes = { version = "1.0.1", features = ['serde'] } byteorder = "1.4.3" futures = "0.3.13" +hex = "0.4.3" hyper = "0.14" itertools = "0.10.3" lazy_static = "1.4.0" diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 5adf4a89ff..d7a250f31e 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -58,6 +58,7 @@ use zenith_utils::seqwait::SeqWait; mod blob_io; pub mod block_io; mod delta_layer; +mod disk_btree; pub(crate) mod ephemeral_file; mod filename; mod image_layer; @@ -1602,15 +1603,6 @@ impl LayeredTimeline { debug!("Could not compact because no partitioning specified yet"); } - // Call unload() on all frozen layers, to release memory. - // This shouldn't be much memory, as only metadata is slurped - // into memory. - let layers = self.layers.lock().unwrap(); - for layer in layers.iter_historic_layers() { - layer.unload()?; - } - drop(layers); - Ok(()) } diff --git a/pageserver/src/layered_repository/block_io.rs b/pageserver/src/layered_repository/block_io.rs index a8992a6cb5..2eba0aa403 100644 --- a/pageserver/src/layered_repository/block_io.rs +++ b/pageserver/src/layered_repository/block_io.rs @@ -4,6 +4,7 @@ use crate::page_cache; use crate::page_cache::{ReadBufResult, PAGE_SZ}; +use bytes::Bytes; use lazy_static::lazy_static; use std::ops::{Deref, DerefMut}; use std::os::unix::fs::FileExt; @@ -172,3 +173,47 @@ where } } } + +/// +/// Trait for block-oriented output +/// +pub trait BlockWriter { + /// + /// Write a page to the underlying storage. + /// + /// 'buf' must be of size PAGE_SZ. Returns the block number the page was + /// written to. + /// + fn write_blk(&mut self, buf: Bytes) -> Result; +} + +/// +/// A simple in-memory buffer of blocks. +/// +pub struct BlockBuf { + pub blocks: Vec, +} +impl BlockWriter for BlockBuf { + fn write_blk(&mut self, buf: Bytes) -> Result { + assert!(buf.len() == PAGE_SZ); + let blknum = self.blocks.len(); + self.blocks.push(buf); + tracing::info!("buffered block {}", blknum); + Ok(blknum as u32) + } +} + +impl BlockBuf { + pub fn new() -> Self { + BlockBuf { blocks: Vec::new() } + } + + pub fn size(&self) -> u64 { + (self.blocks.len() * PAGE_SZ) as u64 + } +} +impl Default for BlockBuf { + fn default() -> Self { + Self::new() + } +} diff --git a/pageserver/src/layered_repository/delta_layer.rs b/pageserver/src/layered_repository/delta_layer.rs index 43122fd99d..dd6b5d3afa 100644 --- a/pageserver/src/layered_repository/delta_layer.rs +++ b/pageserver/src/layered_repository/delta_layer.rs @@ -7,14 +7,8 @@ //! must be page images or WAL records with the 'will_init' flag set, so that //! they can be replayed without referring to an older page version. //! -//! When a delta file needs to be accessed, we slurp the 'index' metadata -//! into memory, into the DeltaLayerInner struct. See load() and unload() functions. -//! To access a particular value, we search `index` for the given key. -//! The byte offset in the index can be used to find the value in -//! VALUES_CHAPTER. -//! -//! On disk, the delta files are stored in timelines/ directory. -//! Currently, there are no subdirectories, and each delta file is named like this: +//! The delta files are stored in timelines/ directory. Currently, +//! there are no subdirectories, and each delta file is named like this: //! //! -__- for Summary { @@ -89,6 +89,7 @@ impl From<&DeltaLayer> for Summary { lsn_range: layer.lsn_range.clone(), index_start_blk: 0, + index_root_blk: 0, } } } @@ -123,6 +124,46 @@ impl BlobRef { } } +const DELTA_KEY_SIZE: usize = KEY_SIZE + 8; +struct DeltaKey([u8; DELTA_KEY_SIZE]); + +/// +/// This is the key of the B-tree index stored in the delta layer. It consists +/// of the serialized representation of a Key and LSN. +/// +impl DeltaKey { + fn from_slice(buf: &[u8]) -> Self { + let mut bytes: [u8; DELTA_KEY_SIZE] = [0u8; DELTA_KEY_SIZE]; + bytes.copy_from_slice(buf); + DeltaKey(bytes) + } + + fn from_key_lsn(key: &Key, lsn: Lsn) -> Self { + let mut bytes: [u8; DELTA_KEY_SIZE] = [0u8; DELTA_KEY_SIZE]; + key.write_to_byte_slice(&mut bytes[0..KEY_SIZE]); + bytes[KEY_SIZE..].copy_from_slice(&u64::to_be_bytes(lsn.0)); + DeltaKey(bytes) + } + + fn key(&self) -> Key { + Key::from_slice(&self.0) + } + + fn lsn(&self) -> Lsn { + Lsn(u64::from_be_bytes(self.0[KEY_SIZE..].try_into().unwrap())) + } + + fn extract_key_from_buf(buf: &[u8]) -> Key { + Key::from_slice(&buf[..KEY_SIZE]) + } + + fn extract_lsn_from_buf(buf: &[u8]) -> Lsn { + let mut lsn_buf = [0u8; 8]; + lsn_buf.copy_from_slice(&buf[KEY_SIZE..]); + Lsn(u64::from_be_bytes(lsn_buf)) + } +} + /// /// DeltaLayer is the in-memory data structure associated with an /// on-disk delta file. We keep a DeltaLayer in memory for each @@ -143,18 +184,12 @@ pub struct DeltaLayer { } pub struct DeltaLayerInner { - /// If false, the 'index' has not been loaded into memory yet. + /// If false, the fields below have not been loaded into memory yet. loaded: bool, - /// - /// All versions of all pages in the layer are kept here. - /// Indexed by block number and LSN. The value is an offset into the - /// chapter where the page version is stored. - /// - index: HashMap>, - // values copied from summary index_start_blk: u32, + index_root_blk: u32, /// Reader object for reading blocks from the file. (None if not loaded yet) file: Option>, @@ -196,27 +231,46 @@ impl Layer for DeltaLayer { let inner = self.load()?; // Scan the page versions backwards, starting from `lsn`. - if let Some(vec_map) = inner.index.get(&key) { - let mut reader = inner.file.as_ref().unwrap().block_cursor(); - let slice = vec_map.slice_range(lsn_range); - for (entry_lsn, blob_ref) in slice.iter().rev() { - let buf = reader.read_blob(blob_ref.pos())?; - let val = Value::des(&buf)?; - match val { - Value::Image(img) => { - reconstruct_state.img = Some((*entry_lsn, img)); + let file = inner.file.as_ref().unwrap(); + let tree_reader = DiskBtreeReader::<_, DELTA_KEY_SIZE>::new( + inner.index_start_blk, + inner.index_root_blk, + file, + ); + let search_key = DeltaKey::from_key_lsn(&key, Lsn(lsn_range.end.0 - 1)); + + let mut offsets: Vec<(Lsn, u64)> = Vec::new(); + + tree_reader.visit(&search_key.0, VisitDirection::Backwards, |key, value| { + let blob_ref = BlobRef(value); + if key[..KEY_SIZE] != search_key.0[..KEY_SIZE] { + return false; + } + let entry_lsn = DeltaKey::extract_lsn_from_buf(key); + offsets.push((entry_lsn, blob_ref.pos())); + + !blob_ref.will_init() + })?; + + // Ok, 'offsets' now contains the offsets of all the entries we need to read + let mut cursor = file.block_cursor(); + for (entry_lsn, pos) in offsets { + let buf = cursor.read_blob(pos)?; + let val = Value::des(&buf)?; + match val { + Value::Image(img) => { + reconstruct_state.img = Some((entry_lsn, img)); + need_image = false; + break; + } + Value::WalRecord(rec) => { + let will_init = rec.will_init(); + reconstruct_state.records.push((entry_lsn, rec)); + if will_init { + // This WAL record initializes the page, so no need to go further back need_image = false; break; } - Value::WalRecord(rec) => { - let will_init = rec.will_init(); - reconstruct_state.records.push((*entry_lsn, rec)); - if will_init { - // This WAL record initializes the page, so no need to go further back - need_image = false; - break; - } - } } } } @@ -241,36 +295,6 @@ impl Layer for DeltaLayer { } } - /// - /// Release most of the memory used by this layer. If it's accessed again later, - /// it will need to be loaded back. - /// - fn unload(&self) -> Result<()> { - // FIXME: In debug mode, loading and unloading the index slows - // things down so much that you get timeout errors. At least - // with the test_parallel_copy test. So as an even more ad hoc - // stopgap fix for that, only unload every on average 10 - // checkpoint cycles. - use rand::RngCore; - if rand::thread_rng().next_u32() > (u32::MAX / 10) { - return Ok(()); - } - - let mut inner = match self.inner.try_write() { - Ok(inner) => inner, - Err(TryLockError::WouldBlock) => return Ok(()), - Err(TryLockError::Poisoned(_)) => panic!("DeltaLayer lock was poisoned"), - }; - inner.index = HashMap::default(); - inner.loaded = false; - - // Note: we keep the Book open. Is that a good idea? The virtual file - // machinery has its own rules for closing the file descriptor if it's not - // needed, but the Book struct uses up some memory, too. - - Ok(()) - } - fn delete(&self) -> Result<()> { // delete underlying file fs::remove_file(self.path())?; @@ -303,21 +327,36 @@ impl Layer for DeltaLayer { let inner = self.load()?; - let mut values: Vec<(&Key, &VecMap)> = inner.index.iter().collect(); - values.sort_by_key(|k| k.0); + println!( + "index_start_blk: {}, root {}", + inner.index_start_blk, inner.index_root_blk + ); - let mut reader = inner.file.as_ref().unwrap().block_cursor(); + let file = inner.file.as_ref().unwrap(); + let tree_reader = DiskBtreeReader::<_, DELTA_KEY_SIZE>::new( + inner.index_start_blk, + inner.index_root_blk, + file, + ); + + tree_reader.dump()?; + + let mut cursor = file.block_cursor(); + tree_reader.visit( + &[0u8; DELTA_KEY_SIZE], + VisitDirection::Forwards, + |delta_key, val| { + let blob_ref = BlobRef(val); + let key = DeltaKey::extract_key_from_buf(delta_key); + let lsn = DeltaKey::extract_lsn_from_buf(delta_key); - for (key, versions) in values { - for (lsn, blob_ref) in versions.as_slice() { let mut desc = String::new(); - match reader.read_blob(blob_ref.pos()) { + match cursor.read_blob(blob_ref.pos()) { Ok(buf) => { let val = Value::des(&buf); - match val { Ok(Value::Image(img)) => { - write!(&mut desc, " img {} bytes", img.len())?; + write!(&mut desc, " img {} bytes", img.len()).unwrap(); } Ok(Value::WalRecord(rec)) => { let wal_desc = walrecord::describe_wal_record(&rec); @@ -327,20 +366,22 @@ impl Layer for DeltaLayer { buf.len(), rec.will_init(), wal_desc - )?; + ) + .unwrap(); } Err(err) => { - write!(&mut desc, " DESERIALIZATION ERROR: {}", err)?; + write!(&mut desc, " DESERIALIZATION ERROR: {}", err).unwrap(); } } } Err(err) => { - write!(&mut desc, " READ ERROR: {}", err)?; + write!(&mut desc, " READ ERROR: {}", err).unwrap(); } } println!(" key {} at {}: {}", key, lsn, desc); - } - } + true + }, + )?; Ok(()) } @@ -409,6 +450,7 @@ impl DeltaLayer { PathOrConf::Conf(_) => { let mut expected_summary = Summary::from(self); expected_summary.index_start_blk = actual_summary.index_start_blk; + expected_summary.index_root_blk = actual_summary.index_root_blk; if actual_summary != expected_summary { bail!("in-file summary does not match expected summary. actual = {:?} expected = {:?}", actual_summary, expected_summary); } @@ -427,17 +469,11 @@ impl DeltaLayer { } } - file.file.seek(SeekFrom::Start( - actual_summary.index_start_blk as u64 * PAGE_SZ as u64, - ))?; - let mut buf_reader = std::io::BufReader::new(&mut file.file); - let index = HashMap::des_from(&mut buf_reader)?; - inner.index_start_blk = actual_summary.index_start_blk; + inner.index_root_blk = actual_summary.index_root_blk; debug!("loaded from {}", &path.display()); - inner.index = index; inner.loaded = true; Ok(()) } @@ -457,9 +493,9 @@ impl DeltaLayer { lsn_range: filename.lsn_range.clone(), inner: RwLock::new(DeltaLayerInner { loaded: false, - index: HashMap::default(), file: None, index_start_blk: 0, + index_root_blk: 0, }), } } @@ -485,8 +521,8 @@ impl DeltaLayer { inner: RwLock::new(DeltaLayerInner { loaded: false, file: None, - index: HashMap::default(), index_start_blk: 0, + index_root_blk: 0, }), }) } @@ -529,7 +565,7 @@ pub struct DeltaLayerWriter { key_start: Key, lsn_range: Range, - index: HashMap>, + tree: DiskBtreeBuilder, blob_writer: WriteBlobWriter>, } @@ -558,10 +594,15 @@ impl DeltaLayerWriter { u64::from(lsn_range.end) )); let mut file = VirtualFile::create(&path)?; + // make room for the header block file.seek(SeekFrom::Start(PAGE_SZ as u64))?; let buf_writer = BufWriter::new(file); let blob_writer = WriteBlobWriter::new(buf_writer, PAGE_SZ as u64); + // Initialize the b-tree index builder + let block_buf = BlockBuf::new(); + let tree_builder = DiskBtreeBuilder::new(block_buf); + Ok(DeltaLayerWriter { conf, path, @@ -569,7 +610,7 @@ impl DeltaLayerWriter { tenantid, key_start, lsn_range, - index: HashMap::new(), + tree: tree_builder, blob_writer, }) } @@ -584,23 +625,16 @@ impl DeltaLayerWriter { let off = self.blob_writer.write_blob(&Value::ser(&val)?)?; - let vec_map = self.index.entry(key).or_default(); let blob_ref = BlobRef::new(off, val.will_init()); - let old = vec_map.append_or_update_last(lsn, blob_ref).unwrap().0; - if old.is_some() { - // We already had an entry for this LSN. That's odd.. - bail!( - "Value for {} at {} already exists in delta layer being built", - key, - lsn - ); - } + + let delta_key = DeltaKey::from_key_lsn(&key, lsn); + self.tree.append(&delta_key.0, blob_ref.0)?; Ok(()) } pub fn size(&self) -> u64 { - self.blob_writer.size() + self.blob_writer.size() + self.tree.borrow_writer().size() } /// @@ -614,9 +648,11 @@ impl DeltaLayerWriter { let mut file = buf_writer.into_inner()?; // Write out the index - let buf = HashMap::ser(&self.index)?; + let (index_root_blk, block_buf) = self.tree.finish()?; file.seek(SeekFrom::Start(index_start_blk as u64 * PAGE_SZ as u64))?; - file.write_all(&buf)?; + for buf in block_buf.blocks { + file.write_all(buf.as_ref())?; + } // Fill in the summary on blk 0 let summary = Summary { @@ -627,6 +663,7 @@ impl DeltaLayerWriter { key_range: self.key_start..key_end, lsn_range: self.lsn_range.clone(), index_start_blk, + index_root_blk, }; file.seek(SeekFrom::Start(0))?; Summary::ser_into(&summary, &mut file)?; @@ -642,9 +679,9 @@ impl DeltaLayerWriter { lsn_range: self.lsn_range.clone(), inner: RwLock::new(DeltaLayerInner { loaded: false, - index: HashMap::new(), file: None, index_start_blk, + index_root_blk, }), }; @@ -677,7 +714,7 @@ impl DeltaLayerWriter { /// fashion. /// struct DeltaValueIter<'a> { - all_offsets: Vec<(Key, Lsn, BlobRef)>, + all_offsets: Vec<(DeltaKey, BlobRef)>, next_idx: usize, reader: BlockCursor>, } @@ -702,15 +739,22 @@ impl<'a> Iterator for DeltaValueIter<'a> { impl<'a> DeltaValueIter<'a> { fn new(inner: RwLockReadGuard<'a, DeltaLayerInner>) -> Result { - let mut index: Vec<(&Key, &VecMap)> = inner.index.iter().collect(); - index.sort_by_key(|x| x.0); + let file = inner.file.as_ref().unwrap(); + let tree_reader = DiskBtreeReader::<_, DELTA_KEY_SIZE>::new( + inner.index_start_blk, + inner.index_root_blk, + file, + ); - let mut all_offsets: Vec<(Key, Lsn, BlobRef)> = Vec::new(); - for (key, vec_map) in index.iter() { - for (lsn, blob_ref) in vec_map.as_slice().iter() { - all_offsets.push((**key, *lsn, *blob_ref)); - } - } + let mut all_offsets: Vec<(DeltaKey, BlobRef)> = Vec::new(); + tree_reader.visit( + &[0u8; DELTA_KEY_SIZE], + VisitDirection::Forwards, + |key, value| { + all_offsets.push((DeltaKey::from_slice(key), BlobRef(value))); + true + }, + )?; let iter = DeltaValueIter { all_offsets, @@ -723,13 +767,15 @@ impl<'a> DeltaValueIter<'a> { fn next_res(&mut self) -> Result> { if self.next_idx < self.all_offsets.len() { - let (key, lsn, off) = &self.all_offsets[self.next_idx]; + let (delta_key, blob_ref) = &self.all_offsets[self.next_idx]; - //let mut reader = BlobReader::new(self.inner.file.as_ref().unwrap()); - let buf = self.reader.read_blob(off.pos())?; + let key = delta_key.key(); + let lsn = delta_key.lsn(); + + let buf = self.reader.read_blob(blob_ref.pos())?; let val = Value::des(&buf)?; self.next_idx += 1; - Ok(Some((*key, *lsn, val))) + Ok(Some((key, lsn, val))) } else { Ok(None) } diff --git a/pageserver/src/layered_repository/disk_btree.rs b/pageserver/src/layered_repository/disk_btree.rs new file mode 100644 index 0000000000..7a9fe6f2b7 --- /dev/null +++ b/pageserver/src/layered_repository/disk_btree.rs @@ -0,0 +1,979 @@ +//! +//! Simple on-disk B-tree implementation +//! +//! This is used as the index structure within image and delta layers +//! +//! Features: +//! - Fixed-width keys +//! - Fixed-width values (VALUE_SZ) +//! - The tree is created in a bulk operation. Insert/deletion after creation +//! is not suppported +//! - page-oriented +//! +//! TODO: +//! - better errors (e.g. with thiserror?) +//! - maybe something like an Adaptive Radix Tree would be more efficient? +//! - the values stored by image and delta layers are offsets into the file, +//! and they are in monotonically increasing order. Prefix compression would +//! be very useful for them, too. +//! - An Iterator interface would be more convenient for the callers than the +//! 'visit' function +//! +use anyhow; +use byteorder::{ReadBytesExt, BE}; +use bytes::{BufMut, Bytes, BytesMut}; +use hex; +use std::cmp::Ordering; + +use crate::layered_repository::block_io::{BlockReader, BlockWriter}; + +// The maximum size of a value stored in the B-tree. 5 bytes is enough currently. +pub const VALUE_SZ: usize = 5; +pub const MAX_VALUE: u64 = 0x007f_ffff_ffff; + +#[allow(dead_code)] +pub const PAGE_SZ: usize = 8192; + +#[derive(Clone, Copy, Debug)] +struct Value([u8; VALUE_SZ]); + +impl Value { + fn from_slice(slice: &[u8]) -> Value { + let mut b = [0u8; VALUE_SZ]; + b.copy_from_slice(slice); + Value(b) + } + + fn from_u64(x: u64) -> Value { + assert!(x <= 0x007f_ffff_ffff); + Value([ + (x >> 32) as u8, + (x >> 24) as u8, + (x >> 16) as u8, + (x >> 8) as u8, + x as u8, + ]) + } + + fn from_blknum(x: u32) -> Value { + Value([ + 0x80, + (x >> 24) as u8, + (x >> 16) as u8, + (x >> 8) as u8, + x as u8, + ]) + } + + #[allow(dead_code)] + fn is_offset(self) -> bool { + self.0[0] & 0x80 != 0 + } + + fn to_u64(self) -> u64 { + let b = &self.0; + (b[0] as u64) << 32 + | (b[1] as u64) << 24 + | (b[2] as u64) << 16 + | (b[3] as u64) << 8 + | b[4] as u64 + } + + fn to_blknum(self) -> u32 { + let b = &self.0; + assert!(b[0] == 0x80); + (b[1] as u32) << 24 | (b[2] as u32) << 16 | (b[3] as u32) << 8 | b[4] as u32 + } +} + +/// This is the on-disk representation. +struct OnDiskNode<'a, const L: usize> { + // Fixed-width fields + num_children: u16, + level: u8, + prefix_len: u8, + suffix_len: u8, + + // Variable-length fields. These are stored on-disk after the fixed-width + // fields, in this order. In the in-memory representation, these point to + // the right parts in the page buffer. + prefix: &'a [u8], + keys: &'a [u8], + values: &'a [u8], +} + +impl<'a, const L: usize> OnDiskNode<'a, L> { + /// + /// Interpret a PAGE_SZ page as a node. + /// + fn deparse(buf: &[u8]) -> OnDiskNode { + let mut cursor = std::io::Cursor::new(buf); + let num_children = cursor.read_u16::().unwrap(); + let level = cursor.read_u8().unwrap(); + let prefix_len = cursor.read_u8().unwrap(); + let suffix_len = cursor.read_u8().unwrap(); + + let mut off = cursor.position(); + let prefix_off = off as usize; + off += prefix_len as u64; + + let keys_off = off as usize; + let keys_len = num_children as usize * suffix_len as usize; + off += keys_len as u64; + + let values_off = off as usize; + let values_len = num_children as usize * VALUE_SZ as usize; + //off += values_len as u64; + + let prefix = &buf[prefix_off..prefix_off + prefix_len as usize]; + let keys = &buf[keys_off..keys_off + keys_len]; + let values = &buf[values_off..values_off + values_len]; + + OnDiskNode { + num_children, + level, + prefix_len, + suffix_len, + prefix, + keys, + values, + } + } + + /// + /// Read a value at 'idx' + /// + fn value(&self, idx: usize) -> Value { + let value_off = idx * VALUE_SZ; + let value_slice = &self.values[value_off..value_off + VALUE_SZ]; + Value::from_slice(value_slice) + } + + fn binary_search(&self, search_key: &[u8; L], keybuf: &mut [u8]) -> Result { + let mut size = self.num_children as usize; + let mut low = 0; + let mut high = size; + while low < high { + let mid = low + size / 2; + + let key_off = mid as usize * self.suffix_len as usize; + let suffix = &self.keys[key_off..key_off + self.suffix_len as usize]; + // Does this match? + keybuf[self.prefix_len as usize..].copy_from_slice(suffix); + + let cmp = keybuf[..].cmp(search_key); + + if cmp == Ordering::Less { + low = mid + 1; + } else if cmp == Ordering::Greater { + high = mid; + } else { + return Ok(mid); + } + size = high - low; + } + Err(low) + } +} + +/// +/// Public reader object, to search the tree. +/// +pub struct DiskBtreeReader +where + R: BlockReader, +{ + start_blk: u32, + root_blk: u32, + reader: R, +} + +#[derive(Clone, Copy, Debug, PartialEq)] +pub enum VisitDirection { + Forwards, + Backwards, +} + +impl DiskBtreeReader +where + R: BlockReader, +{ + pub fn new(start_blk: u32, root_blk: u32, reader: R) -> Self { + DiskBtreeReader { + start_blk, + root_blk, + reader, + } + } + + /// + /// Read the value for given key. Returns the value, or None if it doesn't exist. + /// + pub fn get(&self, search_key: &[u8; L]) -> anyhow::Result> { + let mut result: Option = None; + self.visit(search_key, VisitDirection::Forwards, |key, value| { + if key == search_key { + result = Some(value); + } + false + })?; + Ok(result) + } + + /// + /// Scan the tree, starting from 'search_key', in the given direction. 'visitor' + /// will be called for every key >= 'search_key' (or <= 'search_key', if scanning + /// backwards) + /// + pub fn visit( + &self, + search_key: &[u8; L], + dir: VisitDirection, + mut visitor: V, + ) -> anyhow::Result + where + V: FnMut(&[u8], u64) -> bool, + { + self.search_recurse(self.root_blk, search_key, dir, &mut visitor) + } + + fn search_recurse( + &self, + node_blknum: u32, + search_key: &[u8; L], + dir: VisitDirection, + visitor: &mut V, + ) -> anyhow::Result + where + V: FnMut(&[u8], u64) -> bool, + { + // Locate the node. + let blk = self.reader.read_blk(self.start_blk + node_blknum)?; + + // Search all entries on this node + self.search_node(blk.as_ref(), search_key, dir, visitor) + } + + fn search_node( + &self, + node_buf: &[u8], + search_key: &[u8; L], + dir: VisitDirection, + visitor: &mut V, + ) -> anyhow::Result + where + V: FnMut(&[u8], u64) -> bool, + { + let node = OnDiskNode::deparse(node_buf); + let prefix_len = node.prefix_len as usize; + let suffix_len = node.suffix_len as usize; + + assert!(node.num_children > 0); + + let mut keybuf = Vec::new(); + keybuf.extend(node.prefix); + keybuf.resize(prefix_len + suffix_len, 0); + + if dir == VisitDirection::Forwards { + // Locate the first match + let mut idx = match node.binary_search(search_key, keybuf.as_mut_slice()) { + Ok(idx) => idx, + Err(idx) => { + if node.level == 0 { + // Imagine that the node contains the following keys: + // + // 1 + // 3 <-- idx + // 5 + // + // If the search key is '2' and there is exact match, + // the binary search would return the index of key + // '3'. That's cool, '3' is the first key to return. + idx + } else { + // This is an internal page, so each key represents a lower + // bound for what's in the child page. If there is no exact + // match, we have to return the *previous* entry. + // + // 1 <-- return this + // 3 <-- idx + // 5 + idx.saturating_sub(1) + } + } + }; + // idx points to the first match now. Keep going from there + let mut key_off = idx * suffix_len; + while idx < node.num_children as usize { + let suffix = &node.keys[key_off..key_off + suffix_len]; + keybuf[prefix_len..].copy_from_slice(suffix); + let value = node.value(idx as usize); + #[allow(clippy::collapsible_if)] + if node.level == 0 { + // leaf + if !visitor(&keybuf, value.to_u64()) { + return Ok(false); + } + } else { + #[allow(clippy::collapsible_if)] + if !self.search_recurse(value.to_blknum(), search_key, dir, visitor)? { + return Ok(false); + } + } + idx += 1; + key_off += suffix_len; + } + } else { + let mut idx = match node.binary_search(search_key, keybuf.as_mut_slice()) { + Ok(idx) => { + // Exact match. That's the first entry to return, and walk + // backwards from there. (The loop below starts from 'idx - + // 1', so add one here to compensate.) + idx + 1 + } + Err(idx) => { + // No exact match. The binary search returned the index of the + // first key that's > search_key. Back off by one, and walk + // backwards from there. (The loop below starts from idx - 1, + // so we don't need to subtract one here) + idx + } + }; + + // idx points to the first match + 1 now. Keep going from there. + let mut key_off = idx * suffix_len; + while idx > 0 { + idx -= 1; + key_off -= suffix_len; + let suffix = &node.keys[key_off..key_off + suffix_len]; + keybuf[prefix_len..].copy_from_slice(suffix); + let value = node.value(idx as usize); + #[allow(clippy::collapsible_if)] + if node.level == 0 { + // leaf + if !visitor(&keybuf, value.to_u64()) { + return Ok(false); + } + } else { + #[allow(clippy::collapsible_if)] + if !self.search_recurse(value.to_blknum(), search_key, dir, visitor)? { + return Ok(false); + } + } + if idx == 0 { + break; + } + } + } + Ok(true) + } + + #[allow(dead_code)] + pub fn dump(&self) -> anyhow::Result<()> { + self.dump_recurse(self.root_blk, &[], 0) + } + + fn dump_recurse(&self, blknum: u32, path: &[u8], depth: usize) -> anyhow::Result<()> { + let blk = self.reader.read_blk(self.start_blk + blknum)?; + let buf: &[u8] = blk.as_ref(); + + let node = OnDiskNode::::deparse(buf); + + print!("{:indent$}", "", indent = depth * 2); + println!( + "blk #{}: path {}: prefix {}, suffix_len {}", + blknum, + hex::encode(path), + hex::encode(node.prefix), + node.suffix_len + ); + + let mut idx = 0; + let mut key_off = 0; + while idx < node.num_children { + let key = &node.keys[key_off..key_off + node.suffix_len as usize]; + let val = node.value(idx as usize); + print!("{:indent$}", "", indent = depth * 2 + 2); + println!("{}: {}", hex::encode(key), hex::encode(val.0)); + + if node.level > 0 { + let child_path = [path, node.prefix].concat(); + self.dump_recurse(val.to_blknum(), &child_path, depth + 1)?; + } + idx += 1; + key_off += node.suffix_len as usize; + } + Ok(()) + } +} + +/// +/// Public builder object, for creating a new tree. +/// +/// Usage: Create a builder object by calling 'new', load all the data into the +/// tree by calling 'append' for each key-value pair, and then call 'finish' +/// +/// 'L' is the key length in bytes +pub struct DiskBtreeBuilder +where + W: BlockWriter, +{ + writer: W, + + /// + /// stack[0] is the current root page, stack.last() is the leaf. + /// + stack: Vec>, + + /// Last key that was appended to the tree. Used to sanity check that append + /// is called in increasing key order. + last_key: Option<[u8; L]>, +} + +impl DiskBtreeBuilder +where + W: BlockWriter, +{ + pub fn new(writer: W) -> Self { + DiskBtreeBuilder { + writer, + last_key: None, + stack: vec![BuildNode::new(0)], + } + } + + pub fn append(&mut self, key: &[u8; L], value: u64) -> Result<(), anyhow::Error> { + assert!(value <= MAX_VALUE); + if let Some(last_key) = &self.last_key { + assert!(key > last_key, "unsorted input"); + } + self.last_key = Some(*key); + + Ok(self.append_internal(key, Value::from_u64(value))?) + } + + fn append_internal(&mut self, key: &[u8; L], value: Value) -> Result<(), std::io::Error> { + // Try to append to the current leaf buffer + let last = self.stack.last_mut().unwrap(); + let level = last.level; + if last.push(key, value) { + return Ok(()); + } + + // It did not fit. Try to compress, and it it succeeds to make some room + // on the node, try appending to it again. + #[allow(clippy::collapsible_if)] + if last.compress() { + if last.push(key, value) { + return Ok(()); + } + } + + // Could not append to the current leaf. Flush it and create a new one. + self.flush_node()?; + + // Replace the node we flushed with an empty one and append the new + // key to it. + let mut last = BuildNode::new(level); + if !last.push(key, value) { + panic!("could not push to new leaf node"); + } + self.stack.push(last); + + Ok(()) + } + + fn flush_node(&mut self) -> Result<(), std::io::Error> { + let last = self.stack.pop().unwrap(); + let buf = last.pack(); + let downlink_key = last.first_key(); + let downlink_ptr = self.writer.write_blk(buf)?; + + // Append the downlink to the parent + if self.stack.is_empty() { + self.stack.push(BuildNode::new(last.level + 1)); + } + self.append_internal(&downlink_key, Value::from_blknum(downlink_ptr))?; + + Ok(()) + } + + /// + /// Flushes everything to disk, and returns the block number of the root page. + /// The caller must store the root block number "out-of-band", and pass it + /// to the DiskBtreeReader::new() when you want to read the tree again. + /// (In the image and delta layers, it is stored in the beginning of the file, + /// in the summary header) + /// + pub fn finish(mut self) -> Result<(u32, W), std::io::Error> { + // flush all levels, except the root. + while self.stack.len() > 1 { + self.flush_node()?; + } + + let root = self.stack.first().unwrap(); + let buf = root.pack(); + let root_blknum = self.writer.write_blk(buf)?; + + Ok((root_blknum, self.writer)) + } + + pub fn borrow_writer(&self) -> &W { + &self.writer + } +} + +/// +/// BuildNode represesnts an incomplete page that we are appending to. +/// +#[derive(Clone, Debug)] +struct BuildNode { + num_children: u16, + level: u8, + prefix: Vec, + suffix_len: usize, + + keys: Vec, + values: Vec, + + size: usize, // physical size of this node, if it was written to disk like this +} + +const NODE_SIZE: usize = PAGE_SZ; + +const NODE_HDR_SIZE: usize = 2 + 1 + 1 + 1; + +impl BuildNode { + fn new(level: u8) -> Self { + BuildNode { + num_children: 0, + level, + prefix: Vec::new(), + suffix_len: 0, + keys: Vec::new(), + values: Vec::new(), + size: NODE_HDR_SIZE, + } + } + + /// Try to append a key-value pair to this node. Returns 'true' on + /// success, 'false' if the page was full or the key was + /// incompatible with the prefix of the existing keys. + fn push(&mut self, key: &[u8; L], value: Value) -> bool { + // If we have already performed prefix-compression on the page, + // check that the incoming key has the same prefix. + if self.num_children > 0 { + // does the prefix allow it? + if !key.starts_with(&self.prefix) { + return false; + } + } else { + self.suffix_len = key.len(); + } + + // Is the node too full? + if self.size + self.suffix_len + VALUE_SZ >= NODE_SIZE { + return false; + } + + // All clear + self.num_children += 1; + self.keys.extend(&key[self.prefix.len()..]); + self.values.extend(value.0); + + assert!(self.keys.len() == self.num_children as usize * self.suffix_len as usize); + assert!(self.values.len() == self.num_children as usize * VALUE_SZ); + + self.size += self.suffix_len + VALUE_SZ; + + true + } + + /// + /// Perform prefix-compression. + /// + /// Returns 'true' on success, 'false' if no compression was possible. + /// + fn compress(&mut self) -> bool { + let first_suffix = self.first_suffix(); + let last_suffix = self.last_suffix(); + + // Find the common prefix among all keys + let mut prefix_len = 0; + while prefix_len < self.suffix_len { + if first_suffix[prefix_len] != last_suffix[prefix_len] { + break; + } + prefix_len += 1; + } + if prefix_len == 0 { + return false; + } + + // Can compress. Rewrite the keys without the common prefix. + self.prefix.extend(&self.keys[..prefix_len]); + + let mut new_keys = Vec::new(); + let mut key_off = 0; + while key_off < self.keys.len() { + let next_key_off = key_off + self.suffix_len; + new_keys.extend(&self.keys[key_off + prefix_len..next_key_off]); + key_off = next_key_off; + } + self.keys = new_keys; + self.suffix_len -= prefix_len; + + self.size -= prefix_len * self.num_children as usize; + self.size += prefix_len; + + assert!(self.keys.len() == self.num_children as usize * self.suffix_len as usize); + assert!(self.values.len() == self.num_children as usize * VALUE_SZ); + + true + } + + /// + /// Serialize the node to on-disk format. + /// + fn pack(&self) -> Bytes { + assert!(self.keys.len() == self.num_children as usize * self.suffix_len as usize); + assert!(self.values.len() == self.num_children as usize * VALUE_SZ); + assert!(self.num_children > 0); + + let mut buf = BytesMut::new(); + + buf.put_u16(self.num_children); + buf.put_u8(self.level); + buf.put_u8(self.prefix.len() as u8); + buf.put_u8(self.suffix_len as u8); + buf.put(&self.prefix[..]); + buf.put(&self.keys[..]); + buf.put(&self.values[..]); + + assert!(buf.len() == self.size); + + assert!(buf.len() <= PAGE_SZ); + buf.resize(PAGE_SZ, 0); + buf.freeze() + } + + fn first_suffix(&self) -> &[u8] { + &self.keys[..self.suffix_len] + } + fn last_suffix(&self) -> &[u8] { + &self.keys[self.keys.len() - self.suffix_len..] + } + + /// Return the full first key of the page, including the prefix + fn first_key(&self) -> [u8; L] { + let mut key = [0u8; L]; + key[..self.prefix.len()].copy_from_slice(&self.prefix); + key[self.prefix.len()..].copy_from_slice(self.first_suffix()); + key + } +} + +#[cfg(test)] +mod tests { + use super::*; + use rand::Rng; + use std::collections::BTreeMap; + use std::sync::atomic::{AtomicUsize, Ordering}; + + #[derive(Clone, Default)] + struct TestDisk { + blocks: Vec, + } + impl TestDisk { + fn new() -> Self { + Self::default() + } + } + impl BlockReader for TestDisk { + type BlockLease = std::rc::Rc<[u8; PAGE_SZ]>; + + fn read_blk(&self, blknum: u32) -> Result { + let mut buf = [0u8; PAGE_SZ]; + buf.copy_from_slice(&self.blocks[blknum as usize]); + Ok(std::rc::Rc::new(buf)) + } + } + impl BlockWriter for &mut TestDisk { + fn write_blk(&mut self, buf: Bytes) -> Result { + let blknum = self.blocks.len(); + self.blocks.push(buf); + Ok(blknum as u32) + } + } + + #[test] + fn basic() -> anyhow::Result<()> { + let mut disk = TestDisk::new(); + let mut writer = DiskBtreeBuilder::<_, 6>::new(&mut disk); + + let all_keys: Vec<&[u8; 6]> = vec![ + b"xaaaaa", b"xaaaba", b"xaaaca", b"xabaaa", b"xababa", b"xabaca", b"xabada", b"xabadb", + ]; + let all_data: Vec<(&[u8; 6], u64)> = all_keys + .iter() + .enumerate() + .map(|(idx, key)| (*key, idx as u64)) + .collect(); + for (key, val) in all_data.iter() { + writer.append(key, *val)?; + } + + let (root_offset, _writer) = writer.finish()?; + + let reader = DiskBtreeReader::new(0, root_offset, disk); + + reader.dump()?; + + // Test the `get` function on all the keys. + for (key, val) in all_data.iter() { + assert_eq!(reader.get(key)?, Some(*val)); + } + // And on some keys that don't exist + assert_eq!(reader.get(b"aaaaaa")?, None); + assert_eq!(reader.get(b"zzzzzz")?, None); + assert_eq!(reader.get(b"xaaabx")?, None); + + // Test search with `visit` function + let search_key = b"xabaaa"; + let expected: Vec<(Vec, u64)> = all_data + .iter() + .filter(|(key, _value)| key[..] >= search_key[..]) + .map(|(key, value)| (key.to_vec(), *value)) + .collect(); + + let mut data = Vec::new(); + reader.visit(search_key, VisitDirection::Forwards, |key, value| { + data.push((key.to_vec(), value)); + true + })?; + assert_eq!(data, expected); + + // Test a backwards scan + let mut expected: Vec<(Vec, u64)> = all_data + .iter() + .filter(|(key, _value)| key[..] <= search_key[..]) + .map(|(key, value)| (key.to_vec(), *value)) + .collect(); + expected.reverse(); + let mut data = Vec::new(); + reader.visit(search_key, VisitDirection::Backwards, |key, value| { + data.push((key.to_vec(), value)); + true + })?; + assert_eq!(data, expected); + + // Backward scan where nothing matches + reader.visit(b"aaaaaa", VisitDirection::Backwards, |key, value| { + panic!("found unexpected key {}: {}", hex::encode(key), value); + })?; + + // Full scan + let expected: Vec<(Vec, u64)> = all_data + .iter() + .map(|(key, value)| (key.to_vec(), *value)) + .collect(); + let mut data = Vec::new(); + reader.visit(&[0u8; 6], VisitDirection::Forwards, |key, value| { + data.push((key.to_vec(), value)); + true + })?; + assert_eq!(data, expected); + + Ok(()) + } + + #[test] + fn lots_of_keys() -> anyhow::Result<()> { + let mut disk = TestDisk::new(); + let mut writer = DiskBtreeBuilder::<_, 8>::new(&mut disk); + + const NUM_KEYS: u64 = 1000; + + let mut all_data: BTreeMap = BTreeMap::new(); + + for idx in 0..NUM_KEYS { + let key_int: u64 = 1 + idx * 2; + let key = u64::to_be_bytes(key_int); + writer.append(&key, idx)?; + + all_data.insert(key_int, idx); + } + + let (root_offset, _writer) = writer.finish()?; + + let reader = DiskBtreeReader::new(0, root_offset, disk); + + reader.dump()?; + + use std::sync::Mutex; + + let result = Mutex::new(Vec::new()); + let limit: AtomicUsize = AtomicUsize::new(10); + let take_ten = |key: &[u8], value: u64| { + let mut keybuf = [0u8; 8]; + keybuf.copy_from_slice(key); + let key_int = u64::from_be_bytes(keybuf); + + let mut result = result.lock().unwrap(); + result.push((key_int, value)); + + // keep going until we have 10 matches + result.len() < limit.load(Ordering::Relaxed) + }; + + for search_key_int in 0..(NUM_KEYS * 2 + 10) { + let search_key = u64::to_be_bytes(search_key_int); + assert_eq!( + reader.get(&search_key)?, + all_data.get(&search_key_int).cloned() + ); + + // Test a forward scan starting with this key + result.lock().unwrap().clear(); + reader.visit(&search_key, VisitDirection::Forwards, take_ten)?; + let expected = all_data + .range(search_key_int..) + .take(10) + .map(|(&key, &val)| (key, val)) + .collect::>(); + assert_eq!(*result.lock().unwrap(), expected); + + // And a backwards scan + result.lock().unwrap().clear(); + reader.visit(&search_key, VisitDirection::Backwards, take_ten)?; + let expected = all_data + .range(..=search_key_int) + .rev() + .take(10) + .map(|(&key, &val)| (key, val)) + .collect::>(); + assert_eq!(*result.lock().unwrap(), expected); + } + + // full scan + let search_key = u64::to_be_bytes(0); + limit.store(usize::MAX, Ordering::Relaxed); + result.lock().unwrap().clear(); + reader.visit(&search_key, VisitDirection::Forwards, take_ten)?; + let expected = all_data + .iter() + .map(|(&key, &val)| (key, val)) + .collect::>(); + assert_eq!(*result.lock().unwrap(), expected); + + // full scan + let search_key = u64::to_be_bytes(u64::MAX); + limit.store(usize::MAX, Ordering::Relaxed); + result.lock().unwrap().clear(); + reader.visit(&search_key, VisitDirection::Backwards, take_ten)?; + let expected = all_data + .iter() + .rev() + .map(|(&key, &val)| (key, val)) + .collect::>(); + assert_eq!(*result.lock().unwrap(), expected); + + Ok(()) + } + + #[test] + fn random_data() -> anyhow::Result<()> { + // Generate random keys with exponential distribution, to + // exercise the prefix compression + const NUM_KEYS: usize = 100000; + let mut all_data: BTreeMap = BTreeMap::new(); + for idx in 0..NUM_KEYS { + let u: f64 = rand::thread_rng().gen_range(0.0..1.0); + let t = -(f64::ln(u)); + let key_int = (t * 1000000.0) as u128; + + all_data.insert(key_int as u128, idx as u64); + } + + // Build a tree from it + let mut disk = TestDisk::new(); + let mut writer = DiskBtreeBuilder::<_, 16>::new(&mut disk); + + for (&key, &val) in all_data.iter() { + writer.append(&u128::to_be_bytes(key), val)?; + } + let (root_offset, _writer) = writer.finish()?; + + let reader = DiskBtreeReader::new(0, root_offset, disk); + + // Test get() operation on all the keys + for (&key, &val) in all_data.iter() { + let search_key = u128::to_be_bytes(key); + assert_eq!(reader.get(&search_key)?, Some(val)); + } + + // Test get() operations on random keys, most of which will not exist + for _ in 0..100000 { + let key_int = rand::thread_rng().gen::(); + let search_key = u128::to_be_bytes(key_int); + assert!(reader.get(&search_key)? == all_data.get(&key_int).cloned()); + } + + // Test boundary cases + assert!(reader.get(&u128::to_be_bytes(u128::MIN))? == all_data.get(&u128::MIN).cloned()); + assert!(reader.get(&u128::to_be_bytes(u128::MAX))? == all_data.get(&u128::MAX).cloned()); + + Ok(()) + } + + #[test] + #[should_panic(expected = "unsorted input")] + fn unsorted_input() { + let mut disk = TestDisk::new(); + let mut writer = DiskBtreeBuilder::<_, 2>::new(&mut disk); + + let _ = writer.append(b"ba", 1); + let _ = writer.append(b"bb", 2); + let _ = writer.append(b"aa", 3); + } + + /// + /// This test contains a particular data set, see disk_btree_test_data.rs + /// + #[test] + fn particular_data() -> anyhow::Result<()> { + // Build a tree from it + let mut disk = TestDisk::new(); + let mut writer = DiskBtreeBuilder::<_, 26>::new(&mut disk); + + for (key, val) in disk_btree_test_data::TEST_DATA { + writer.append(&key, val)?; + } + let (root_offset, writer) = writer.finish()?; + + println!("SIZE: {} blocks", writer.blocks.len()); + + let reader = DiskBtreeReader::new(0, root_offset, disk); + + // Test get() operation on all the keys + for (key, val) in disk_btree_test_data::TEST_DATA { + assert_eq!(reader.get(&key)?, Some(val)); + } + + // Test full scan + let mut count = 0; + reader.visit(&[0u8; 26], VisitDirection::Forwards, |_key, _value| { + count += 1; + true + })?; + assert_eq!(count, disk_btree_test_data::TEST_DATA.len()); + + reader.dump()?; + + Ok(()) + } +} + +#[cfg(test)] +#[path = "disk_btree_test_data.rs"] +mod disk_btree_test_data; diff --git a/pageserver/src/layered_repository/disk_btree_test_data.rs b/pageserver/src/layered_repository/disk_btree_test_data.rs new file mode 100644 index 0000000000..9462573f03 --- /dev/null +++ b/pageserver/src/layered_repository/disk_btree_test_data.rs @@ -0,0 +1,2013 @@ +use hex_literal::hex; + +/// Test data set for the 'particular_data' test in disk_btree.rs +/// +/// This test contains a particular data set, representing all the keys +/// generated by the 'test_random_updates' unit test. I extracted this while +/// trying to debug a failure in that test. The bug turned out to be +/// elsewhere, and I'm not sure if this is still useful, but keeping it for +/// now... Maybe it's a useful data set to show the typical key-values used +/// by a delta layer, for evaluating how well the prefix compression works. +#[rustfmt::skip] +pub static TEST_DATA: [([u8; 26], u64); 2000] = [ + (hex!("0122222222333333334444444455000000000000000000000010"), 0x004001), + (hex!("0122222222333333334444444455000000000000000000007cb0"), 0x0040a1), + (hex!("0122222222333333334444444455000000010000000000000020"), 0x004141), + (hex!("0122222222333333334444444455000000020000000000000030"), 0x0041e1), + (hex!("01222222223333333344444444550000000200000000000051a0"), 0x004281), + (hex!("0122222222333333334444444455000000030000000000000040"), 0x004321), + (hex!("0122222222333333334444444455000000030000000000006cf0"), 0x0043c1), + (hex!("0122222222333333334444444455000000030000000000007140"), 0x004461), + (hex!("0122222222333333334444444455000000040000000000000050"), 0x004501), + (hex!("01222222223333333344444444550000000400000000000047f0"), 0x0045a1), + (hex!("01222222223333333344444444550000000400000000000072b0"), 0x004641), + (hex!("0122222222333333334444444455000000050000000000000060"), 0x0046e1), + (hex!("0122222222333333334444444455000000050000000000005550"), 0x004781), + (hex!("0122222222333333334444444455000000060000000000000070"), 0x004821), + (hex!("01222222223333333344444444550000000600000000000044a0"), 0x0048c1), + (hex!("0122222222333333334444444455000000060000000000006870"), 0x004961), + (hex!("0122222222333333334444444455000000070000000000000080"), 0x004a01), + (hex!("0122222222333333334444444455000000080000000000000090"), 0x004aa1), + (hex!("0122222222333333334444444455000000080000000000004150"), 0x004b41), + (hex!("01222222223333333344444444550000000900000000000000a0"), 0x004be1), + (hex!("01222222223333333344444444550000000a00000000000000b0"), 0x004c81), + (hex!("01222222223333333344444444550000000a0000000000006680"), 0x004d21), + (hex!("01222222223333333344444444550000000b00000000000000c0"), 0x004dc1), + (hex!("01222222223333333344444444550000000b0000000000006230"), 0x004e61), + (hex!("01222222223333333344444444550000000c00000000000000d0"), 0x004f01), + (hex!("01222222223333333344444444550000000d00000000000000e0"), 0x004fa1), + (hex!("01222222223333333344444444550000000e00000000000000f0"), 0x005041), + (hex!("01222222223333333344444444550000000e0000000000006000"), 0x0050e1), + (hex!("01222222223333333344444444550000000f0000000000000100"), 0x005181), + (hex!("01222222223333333344444444550000000f00000000000053c0"), 0x005221), + (hex!("01222222223333333344444444550000000f0000000000006580"), 0x0052c1), + (hex!("0122222222333333334444444455000000100000000000000110"), 0x005361), + (hex!("01222222223333333344444444550000001000000000000046c0"), 0x005401), + (hex!("0122222222333333334444444455000000100000000000004e40"), 0x0054a1), + (hex!("0122222222333333334444444455000000110000000000000120"), 0x005541), + (hex!("0122222222333333334444444455000000120000000000000130"), 0x0055e1), + (hex!("01222222223333333344444444550000001200000000000066d0"), 0x005681), + (hex!("0122222222333333334444444455000000130000000000000140"), 0x005721), + (hex!("0122222222333333334444444455000000130000000000007710"), 0x0057c1), + (hex!("0122222222333333334444444455000000140000000000000150"), 0x005861), + (hex!("0122222222333333334444444455000000140000000000006c40"), 0x005901), + (hex!("0122222222333333334444444455000000150000000000000160"), 0x0059a1), + (hex!("0122222222333333334444444455000000150000000000005990"), 0x005a41), + (hex!("0122222222333333334444444455000000160000000000000170"), 0x005ae1), + (hex!("0122222222333333334444444455000000160000000000005530"), 0x005b81), + (hex!("0122222222333333334444444455000000170000000000000180"), 0x005c21), + (hex!("0122222222333333334444444455000000170000000000004290"), 0x005cc1), + (hex!("0122222222333333334444444455000000180000000000000190"), 0x005d61), + (hex!("01222222223333333344444444550000001800000000000051c0"), 0x005e01), + (hex!("01222222223333333344444444550000001900000000000001a0"), 0x005ea1), + (hex!("0122222222333333334444444455000000190000000000005420"), 0x005f41), + (hex!("0122222222333333334444444455000000190000000000005770"), 0x005fe1), + (hex!("01222222223333333344444444550000001900000000000079d0"), 0x006081), + (hex!("01222222223333333344444444550000001a00000000000001b0"), 0x006121), + (hex!("01222222223333333344444444550000001a0000000000006f70"), 0x0061c1), + (hex!("01222222223333333344444444550000001a0000000000007150"), 0x006261), + (hex!("01222222223333333344444444550000001b00000000000001c0"), 0x006301), + (hex!("01222222223333333344444444550000001b0000000000005070"), 0x0063a1), + (hex!("01222222223333333344444444550000001c00000000000001d0"), 0x006441), + (hex!("01222222223333333344444444550000001d00000000000001e0"), 0x0064e1), + (hex!("01222222223333333344444444550000001e00000000000001f0"), 0x006581), + (hex!("01222222223333333344444444550000001e0000000000005650"), 0x006621), + (hex!("01222222223333333344444444550000001f0000000000000200"), 0x0066c1), + (hex!("01222222223333333344444444550000001f0000000000006ca0"), 0x006761), + (hex!("0122222222333333334444444455000000200000000000000210"), 0x006801), + (hex!("0122222222333333334444444455000000200000000000005fc0"), 0x0068a1), + (hex!("0122222222333333334444444455000000210000000000000220"), 0x006941), + (hex!("0122222222333333334444444455000000210000000000006430"), 0x0069e1), + (hex!("0122222222333333334444444455000000220000000000000230"), 0x006a81), + (hex!("01222222223333333344444444550000002200000000000040e0"), 0x006b21), + (hex!("0122222222333333334444444455000000230000000000000240"), 0x006bc1), + (hex!("01222222223333333344444444550000002300000000000042d0"), 0x006c61), + (hex!("0122222222333333334444444455000000240000000000000250"), 0x006d01), + (hex!("0122222222333333334444444455000000250000000000000260"), 0x006da1), + (hex!("01222222223333333344444444550000002500000000000058c0"), 0x006e41), + (hex!("0122222222333333334444444455000000260000000000000270"), 0x006ee1), + (hex!("0122222222333333334444444455000000260000000000004020"), 0x006f81), + (hex!("0122222222333333334444444455000000270000000000000280"), 0x007021), + (hex!("0122222222333333334444444455000000280000000000000290"), 0x0070c1), + (hex!("0122222222333333334444444455000000280000000000007c00"), 0x007161), + (hex!("01222222223333333344444444550000002900000000000002a0"), 0x007201), + (hex!("01222222223333333344444444550000002a00000000000002b0"), 0x0072a1), + (hex!("01222222223333333344444444550000002b00000000000002c0"), 0x007341), + (hex!("01222222223333333344444444550000002c00000000000002d0"), 0x0073e1), + (hex!("01222222223333333344444444550000002c00000000000041b0"), 0x007481), + (hex!("01222222223333333344444444550000002c0000000000004c30"), 0x007521), + (hex!("01222222223333333344444444550000002d00000000000002e0"), 0x0075c1), + (hex!("01222222223333333344444444550000002d0000000000005e40"), 0x007661), + (hex!("01222222223333333344444444550000002d0000000000006990"), 0x007701), + (hex!("01222222223333333344444444550000002e00000000000002f0"), 0x0077a1), + (hex!("01222222223333333344444444550000002f0000000000000300"), 0x007841), + (hex!("01222222223333333344444444550000002f0000000000004a70"), 0x0078e1), + (hex!("01222222223333333344444444550000002f0000000000006b40"), 0x007981), + (hex!("0122222222333333334444444455000000300000000000000310"), 0x007a21), + (hex!("0122222222333333334444444455000000310000000000000320"), 0x007ac1), + (hex!("0122222222333333334444444455000000320000000000000330"), 0x007b61), + (hex!("01222222223333333344444444550000003200000000000041a0"), 0x007c01), + (hex!("0122222222333333334444444455000000320000000000007340"), 0x007ca1), + (hex!("0122222222333333334444444455000000320000000000007730"), 0x007d41), + (hex!("0122222222333333334444444455000000330000000000000340"), 0x007de1), + (hex!("01222222223333333344444444550000003300000000000055a0"), 0x007e81), + (hex!("0122222222333333334444444455000000340000000000000350"), 0x007f21), + (hex!("0122222222333333334444444455000000350000000000000360"), 0x007fc1), + (hex!("01222222223333333344444444550000003500000000000077a0"), 0x008061), + (hex!("0122222222333333334444444455000000360000000000000370"), 0x008101), + (hex!("0122222222333333334444444455000000370000000000000380"), 0x0081a1), + (hex!("0122222222333333334444444455000000380000000000000390"), 0x008241), + (hex!("01222222223333333344444444550000003900000000000003a0"), 0x0082e1), + (hex!("01222222223333333344444444550000003a00000000000003b0"), 0x008381), + (hex!("01222222223333333344444444550000003a00000000000071c0"), 0x008421), + (hex!("01222222223333333344444444550000003b00000000000003c0"), 0x0084c1), + (hex!("01222222223333333344444444550000003c00000000000003d0"), 0x008561), + (hex!("01222222223333333344444444550000003d00000000000003e0"), 0x008601), + (hex!("01222222223333333344444444550000003e00000000000003f0"), 0x0086a1), + (hex!("01222222223333333344444444550000003e00000000000062e0"), 0x008741), + (hex!("01222222223333333344444444550000003f0000000000000400"), 0x0087e1), + (hex!("0122222222333333334444444455000000400000000000000410"), 0x008881), + (hex!("0122222222333333334444444455000000400000000000004460"), 0x008921), + (hex!("0122222222333333334444444455000000400000000000005b90"), 0x0089c1), + (hex!("01222222223333333344444444550000004000000000000079b0"), 0x008a61), + (hex!("0122222222333333334444444455000000410000000000000420"), 0x008b01), + (hex!("0122222222333333334444444455000000420000000000000430"), 0x008ba1), + (hex!("0122222222333333334444444455000000420000000000005640"), 0x008c41), + (hex!("0122222222333333334444444455000000430000000000000440"), 0x008ce1), + (hex!("01222222223333333344444444550000004300000000000072a0"), 0x008d81), + (hex!("0122222222333333334444444455000000440000000000000450"), 0x008e21), + (hex!("0122222222333333334444444455000000450000000000000460"), 0x008ec1), + (hex!("0122222222333333334444444455000000450000000000005750"), 0x008f61), + (hex!("01222222223333333344444444550000004500000000000077b0"), 0x009001), + (hex!("0122222222333333334444444455000000460000000000000470"), 0x0090a1), + (hex!("0122222222333333334444444455000000470000000000000480"), 0x009141), + (hex!("0122222222333333334444444455000000480000000000000490"), 0x0091e1), + (hex!("01222222223333333344444444550000004800000000000069e0"), 0x009281), + (hex!("01222222223333333344444444550000004900000000000004a0"), 0x009321), + (hex!("0122222222333333334444444455000000490000000000007370"), 0x0093c1), + (hex!("01222222223333333344444444550000004a00000000000004b0"), 0x009461), + (hex!("01222222223333333344444444550000004a0000000000005cb0"), 0x009501), + (hex!("01222222223333333344444444550000004b00000000000004c0"), 0x0095a1), + (hex!("01222222223333333344444444550000004c00000000000004d0"), 0x009641), + (hex!("01222222223333333344444444550000004c0000000000004880"), 0x0096e1), + (hex!("01222222223333333344444444550000004c0000000000007a40"), 0x009781), + (hex!("01222222223333333344444444550000004d00000000000004e0"), 0x009821), + (hex!("01222222223333333344444444550000004d0000000000006390"), 0x0098c1), + (hex!("01222222223333333344444444550000004e00000000000004f0"), 0x009961), + (hex!("01222222223333333344444444550000004e0000000000004db0"), 0x009a01), + (hex!("01222222223333333344444444550000004f0000000000000500"), 0x009aa1), + (hex!("0122222222333333334444444455000000500000000000000510"), 0x009b41), + (hex!("0122222222333333334444444455000000510000000000000520"), 0x009be1), + (hex!("01222222223333333344444444550000005100000000000069c0"), 0x009c81), + (hex!("0122222222333333334444444455000000520000000000000530"), 0x009d21), + (hex!("0122222222333333334444444455000000520000000000006e60"), 0x009dc1), + (hex!("01222222223333333344444444550000005200000000000070c0"), 0x009e61), + (hex!("0122222222333333334444444455000000530000000000000540"), 0x009f01), + (hex!("0122222222333333334444444455000000530000000000005840"), 0x009fa1), + (hex!("0122222222333333334444444455000000540000000000000550"), 0x00a041), + (hex!("01222222223333333344444444550000005400000000000043e0"), 0x00a0e1), + (hex!("01222222223333333344444444550000005400000000000074e0"), 0x00a181), + (hex!("0122222222333333334444444455000000550000000000000560"), 0x00a221), + (hex!("0122222222333333334444444455000000550000000000003ee0"), 0x00a2c1), + (hex!("0122222222333333334444444455000000560000000000000570"), 0x00a361), + (hex!("0122222222333333334444444455000000570000000000000580"), 0x00a401), + (hex!("0122222222333333334444444455000000570000000000007030"), 0x00a4a1), + (hex!("0122222222333333334444444455000000580000000000000590"), 0x00a541), + (hex!("0122222222333333334444444455000000580000000000005340"), 0x00a5e1), + (hex!("01222222223333333344444444550000005800000000000059f0"), 0x00a681), + (hex!("0122222222333333334444444455000000580000000000006930"), 0x00a721), + (hex!("01222222223333333344444444550000005900000000000005a0"), 0x00a7c1), + (hex!("0122222222333333334444444455000000590000000000003f90"), 0x00a861), + (hex!("01222222223333333344444444550000005a00000000000005b0"), 0x00a901), + (hex!("01222222223333333344444444550000005b00000000000005c0"), 0x00a9a1), + (hex!("01222222223333333344444444550000005b00000000000062c0"), 0x00aa41), + (hex!("01222222223333333344444444550000005c00000000000005d0"), 0x00aae1), + (hex!("01222222223333333344444444550000005c0000000000005a70"), 0x00ab81), + (hex!("01222222223333333344444444550000005c0000000000005dd0"), 0x00ac21), + (hex!("01222222223333333344444444550000005d00000000000005e0"), 0x00acc1), + (hex!("01222222223333333344444444550000005d0000000000005730"), 0x00ad61), + (hex!("01222222223333333344444444550000005e00000000000005f0"), 0x00ae01), + (hex!("01222222223333333344444444550000005e0000000000004f40"), 0x00aea1), + (hex!("01222222223333333344444444550000005f0000000000000600"), 0x00af41), + (hex!("0122222222333333334444444455000000600000000000000610"), 0x00afe1), + (hex!("0122222222333333334444444455000000600000000000007c40"), 0x00b081), + (hex!("0122222222333333334444444455000000610000000000000620"), 0x00b121), + (hex!("0122222222333333334444444455000000610000000000007860"), 0x00b1c1), + (hex!("0122222222333333334444444455000000620000000000000630"), 0x00b261), + (hex!("0122222222333333334444444455000000620000000000005050"), 0x00b301), + (hex!("0122222222333333334444444455000000630000000000000640"), 0x00b3a1), + (hex!("0122222222333333334444444455000000640000000000000650"), 0x00b441), + (hex!("0122222222333333334444444455000000650000000000000660"), 0x00b4e1), + (hex!("0122222222333333334444444455000000650000000000005330"), 0x00b581), + (hex!("0122222222333333334444444455000000660000000000000670"), 0x00b621), + (hex!("0122222222333333334444444455000000660000000000004e20"), 0x00b6c1), + (hex!("0122222222333333334444444455000000660000000000005ee0"), 0x00b761), + (hex!("0122222222333333334444444455000000660000000000006360"), 0x00b801), + (hex!("0122222222333333334444444455000000670000000000000680"), 0x00b8a1), + (hex!("0122222222333333334444444455000000670000000000004040"), 0x00b941), + (hex!("0122222222333333334444444455000000680000000000000690"), 0x00b9e1), + (hex!("0122222222333333334444444455000000680000000000003f80"), 0x00ba81), + (hex!("01222222223333333344444444550000006800000000000041e0"), 0x00bb21), + (hex!("01222222223333333344444444550000006900000000000006a0"), 0x00bbc1), + (hex!("0122222222333333334444444455000000690000000000006080"), 0x00bc61), + (hex!("01222222223333333344444444550000006a00000000000006b0"), 0x00bd01), + (hex!("01222222223333333344444444550000006a00000000000042f0"), 0x00bda1), + (hex!("01222222223333333344444444550000006b00000000000006c0"), 0x00be41), + (hex!("01222222223333333344444444550000006b00000000000052f0"), 0x00bee1), + (hex!("01222222223333333344444444550000006b0000000000005980"), 0x00bf81), + (hex!("01222222223333333344444444550000006b0000000000006170"), 0x00c021), + (hex!("01222222223333333344444444550000006c00000000000006d0"), 0x00c0c1), + (hex!("01222222223333333344444444550000006d00000000000006e0"), 0x00c161), + (hex!("01222222223333333344444444550000006d0000000000006fb0"), 0x00c201), + (hex!("01222222223333333344444444550000006e00000000000006f0"), 0x00c2a1), + (hex!("01222222223333333344444444550000006e00000000000065b0"), 0x00c341), + (hex!("01222222223333333344444444550000006e0000000000007970"), 0x00c3e1), + (hex!("01222222223333333344444444550000006f0000000000000700"), 0x00c481), + (hex!("01222222223333333344444444550000006f0000000000005900"), 0x00c521), + (hex!("01222222223333333344444444550000006f0000000000006d90"), 0x00c5c1), + (hex!("0122222222333333334444444455000000700000000000000710"), 0x00c661), + (hex!("01222222223333333344444444550000007000000000000045c0"), 0x00c701), + (hex!("0122222222333333334444444455000000700000000000004d40"), 0x00c7a1), + (hex!("0122222222333333334444444455000000710000000000000720"), 0x00c841), + (hex!("0122222222333333334444444455000000710000000000004dc0"), 0x00c8e1), + (hex!("0122222222333333334444444455000000710000000000007550"), 0x00c981), + (hex!("0122222222333333334444444455000000720000000000000730"), 0x00ca21), + (hex!("0122222222333333334444444455000000720000000000003ec0"), 0x00cac1), + (hex!("01222222223333333344444444550000007200000000000045a0"), 0x00cb61), + (hex!("0122222222333333334444444455000000720000000000006770"), 0x00cc01), + (hex!("0122222222333333334444444455000000720000000000006bc0"), 0x00cca1), + (hex!("0122222222333333334444444455000000730000000000000740"), 0x00cd41), + (hex!("0122222222333333334444444455000000730000000000005250"), 0x00cde1), + (hex!("01222222223333333344444444550000007300000000000075f0"), 0x00ce81), + (hex!("0122222222333333334444444455000000740000000000000750"), 0x00cf21), + (hex!("0122222222333333334444444455000000740000000000003ff0"), 0x00cfc1), + (hex!("01222222223333333344444444550000007400000000000079e0"), 0x00d061), + (hex!("0122222222333333334444444455000000750000000000000760"), 0x00d101), + (hex!("0122222222333333334444444455000000750000000000004310"), 0x00d1a1), + (hex!("0122222222333333334444444455000000760000000000000770"), 0x00d241), + (hex!("0122222222333333334444444455000000770000000000000780"), 0x00d2e1), + (hex!("01222222223333333344444444550000007700000000000062f0"), 0x00d381), + (hex!("0122222222333333334444444455000000770000000000006940"), 0x00d421), + (hex!("0122222222333333334444444455000000780000000000000790"), 0x00d4c1), + (hex!("01222222223333333344444444550000007900000000000007a0"), 0x00d561), + (hex!("0122222222333333334444444455000000790000000000007af0"), 0x00d601), + (hex!("01222222223333333344444444550000007a00000000000007b0"), 0x00d6a1), + (hex!("01222222223333333344444444550000007b00000000000007c0"), 0x00d741), + (hex!("01222222223333333344444444550000007b00000000000067e0"), 0x00d7e1), + (hex!("01222222223333333344444444550000007b0000000000007890"), 0x00d881), + (hex!("01222222223333333344444444550000007c00000000000007d0"), 0x00d921), + (hex!("01222222223333333344444444550000007d00000000000007e0"), 0x00d9c1), + (hex!("01222222223333333344444444550000007e00000000000007f0"), 0x00da61), + (hex!("01222222223333333344444444550000007f0000000000000800"), 0x00db01), + (hex!("01222222223333333344444444550000007f0000000000005be0"), 0x00dba1), + (hex!("0122222222333333334444444455000000800000000000000810"), 0x00dc41), + (hex!("0122222222333333334444444455000000810000000000000820"), 0x00dce1), + (hex!("0122222222333333334444444455000000810000000000007190"), 0x00dd81), + (hex!("0122222222333333334444444455000000820000000000000830"), 0x00de21), + (hex!("0122222222333333334444444455000000820000000000004ab0"), 0x00dec1), + (hex!("0122222222333333334444444455000000830000000000000840"), 0x00df61), + (hex!("0122222222333333334444444455000000830000000000006720"), 0x00e001), + (hex!("0122222222333333334444444455000000840000000000000850"), 0x00e0a1), + (hex!("0122222222333333334444444455000000850000000000000860"), 0x00e141), + (hex!("01222222223333333344444444550000008500000000000054f0"), 0x00e1e1), + (hex!("0122222222333333334444444455000000850000000000007920"), 0x00e281), + (hex!("0122222222333333334444444455000000860000000000000870"), 0x00e321), + (hex!("01222222223333333344444444550000008600000000000060e0"), 0x00e3c1), + (hex!("0122222222333333334444444455000000860000000000006be0"), 0x00e461), + (hex!("0122222222333333334444444455000000870000000000000880"), 0x00e501), + (hex!("0122222222333333334444444455000000870000000000006820"), 0x00e5a1), + (hex!("0122222222333333334444444455000000880000000000000890"), 0x00e641), + (hex!("01222222223333333344444444550000008900000000000008a0"), 0x00e6e1), + (hex!("0122222222333333334444444455000000890000000000007c30"), 0x00e781), + (hex!("01222222223333333344444444550000008a00000000000008b0"), 0x00e821), + (hex!("01222222223333333344444444550000008b00000000000008c0"), 0x00e8c1), + (hex!("01222222223333333344444444550000008b0000000000005910"), 0x00e961), + (hex!("01222222223333333344444444550000008b0000000000006fe0"), 0x00ea01), + (hex!("01222222223333333344444444550000008c00000000000008d0"), 0x00eaa1), + (hex!("01222222223333333344444444550000008c0000000000006800"), 0x00eb41), + (hex!("01222222223333333344444444550000008d00000000000008e0"), 0x00ebe1), + (hex!("01222222223333333344444444550000008d0000000000005810"), 0x00ec81), + (hex!("01222222223333333344444444550000008d0000000000007c90"), 0x00ed21), + (hex!("01222222223333333344444444550000008e00000000000008f0"), 0x00edc1), + (hex!("01222222223333333344444444550000008e00000000000058f0"), 0x00ee61), + (hex!("01222222223333333344444444550000008f0000000000000900"), 0x00ef01), + (hex!("01222222223333333344444444550000008f0000000000005a30"), 0x00efa1), + (hex!("0122222222333333334444444455000000900000000000000910"), 0x00f041), + (hex!("0122222222333333334444444455000000900000000000006130"), 0x00f0e1), + (hex!("0122222222333333334444444455000000900000000000006550"), 0x00f181), + (hex!("0122222222333333334444444455000000910000000000000920"), 0x00f221), + (hex!("01222222223333333344444444550000009100000000000079f0"), 0x00f2c1), + (hex!("0122222222333333334444444455000000920000000000000930"), 0x00f361), + (hex!("0122222222333333334444444455000000920000000000005620"), 0x00f401), + (hex!("0122222222333333334444444455000000920000000000005e90"), 0x00f4a1), + (hex!("01222222223333333344444444550000009200000000000063d0"), 0x00f541), + (hex!("01222222223333333344444444550000009200000000000076c0"), 0x00f5e1), + (hex!("0122222222333333334444444455000000930000000000000940"), 0x00f681), + (hex!("01222222223333333344444444550000009300000000000044e0"), 0x00f721), + (hex!("0122222222333333334444444455000000940000000000000950"), 0x00f7c1), + (hex!("0122222222333333334444444455000000940000000000007a30"), 0x00f861), + (hex!("0122222222333333334444444455000000950000000000000960"), 0x00f901), + (hex!("0122222222333333334444444455000000950000000000007a70"), 0x00f9a1), + (hex!("0122222222333333334444444455000000960000000000000970"), 0x00fa41), + (hex!("0122222222333333334444444455000000970000000000000980"), 0x00fae1), + (hex!("0122222222333333334444444455000000970000000000007330"), 0x00fb81), + (hex!("0122222222333333334444444455000000980000000000000990"), 0x00fc21), + (hex!("0122222222333333334444444455000000980000000000005af0"), 0x00fcc1), + (hex!("0122222222333333334444444455000000980000000000007ae0"), 0x00fd61), + (hex!("01222222223333333344444444550000009900000000000009a0"), 0x00fe01), + (hex!("0122222222333333334444444455000000990000000000005160"), 0x00fea1), + (hex!("0122222222333333334444444455000000990000000000006850"), 0x00ff41), + (hex!("01222222223333333344444444550000009a00000000000009b0"), 0x00ffe1), + (hex!("01222222223333333344444444550000009b00000000000009c0"), 0x010081), + (hex!("01222222223333333344444444550000009b0000000000005010"), 0x010121), + (hex!("01222222223333333344444444550000009c00000000000009d0"), 0x0101c1), + (hex!("01222222223333333344444444550000009c00000000000042e0"), 0x010261), + (hex!("01222222223333333344444444550000009d00000000000009e0"), 0x010301), + (hex!("01222222223333333344444444550000009d00000000000057f0"), 0x0103a1), + (hex!("01222222223333333344444444550000009e00000000000009f0"), 0x010441), + (hex!("01222222223333333344444444550000009e0000000000004ef0"), 0x0104e1), + (hex!("01222222223333333344444444550000009f0000000000000a00"), 0x010581), + (hex!("01222222223333333344444444550000009f0000000000006110"), 0x010621), + (hex!("0122222222333333334444444455000000a00000000000000a10"), 0x0106c1), + (hex!("0122222222333333334444444455000000a10000000000000a20"), 0x010761), + (hex!("0122222222333333334444444455000000a100000000000040d0"), 0x010801), + (hex!("0122222222333333334444444455000000a10000000000007670"), 0x0108a1), + (hex!("0122222222333333334444444455000000a20000000000000a30"), 0x010941), + (hex!("0122222222333333334444444455000000a200000000000074d0"), 0x0109e1), + (hex!("0122222222333333334444444455000000a30000000000000a40"), 0x010a81), + (hex!("0122222222333333334444444455000000a30000000000004c90"), 0x010b21), + (hex!("0122222222333333334444444455000000a40000000000000a50"), 0x010bc1), + (hex!("0122222222333333334444444455000000a50000000000000a60"), 0x010c61), + (hex!("0122222222333333334444444455000000a60000000000000a70"), 0x010d01), + (hex!("0122222222333333334444444455000000a60000000000006d80"), 0x010da1), + (hex!("0122222222333333334444444455000000a60000000000007830"), 0x010e41), + (hex!("0122222222333333334444444455000000a70000000000000a80"), 0x010ee1), + (hex!("0122222222333333334444444455000000a700000000000064f0"), 0x010f81), + (hex!("0122222222333333334444444455000000a80000000000000a90"), 0x011021), + (hex!("0122222222333333334444444455000000a90000000000000aa0"), 0x0110c1), + (hex!("0122222222333333334444444455000000a90000000000005e30"), 0x011161), + (hex!("0122222222333333334444444455000000aa0000000000000ab0"), 0x011201), + (hex!("0122222222333333334444444455000000ab0000000000000ac0"), 0x0112a1), + (hex!("0122222222333333334444444455000000ac0000000000000ad0"), 0x011341), + (hex!("0122222222333333334444444455000000ac0000000000006d20"), 0x0113e1), + (hex!("0122222222333333334444444455000000ac0000000000007000"), 0x011481), + (hex!("0122222222333333334444444455000000ad0000000000000ae0"), 0x011521), + (hex!("0122222222333333334444444455000000ae0000000000000af0"), 0x0115c1), + (hex!("0122222222333333334444444455000000ae0000000000004a10"), 0x011661), + (hex!("0122222222333333334444444455000000af0000000000000b00"), 0x011701), + (hex!("0122222222333333334444444455000000af0000000000004e10"), 0x0117a1), + (hex!("0122222222333333334444444455000000b00000000000000b10"), 0x011841), + (hex!("0122222222333333334444444455000000b00000000000004280"), 0x0118e1), + (hex!("0122222222333333334444444455000000b000000000000077e0"), 0x011981), + (hex!("0122222222333333334444444455000000b10000000000000b20"), 0x011a21), + (hex!("0122222222333333334444444455000000b20000000000000b30"), 0x011ac1), + (hex!("0122222222333333334444444455000000b30000000000000b40"), 0x011b61), + (hex!("0122222222333333334444444455000000b30000000000004bc0"), 0x011c01), + (hex!("0122222222333333334444444455000000b40000000000000b50"), 0x011ca1), + (hex!("0122222222333333334444444455000000b50000000000000b60"), 0x011d41), + (hex!("0122222222333333334444444455000000b50000000000004fa0"), 0x011de1), + (hex!("0122222222333333334444444455000000b50000000000006a60"), 0x011e81), + (hex!("0122222222333333334444444455000000b60000000000000b70"), 0x011f21), + (hex!("0122222222333333334444444455000000b60000000000005630"), 0x011fc1), + (hex!("0122222222333333334444444455000000b70000000000000b80"), 0x012061), + (hex!("0122222222333333334444444455000000b80000000000000b90"), 0x012101), + (hex!("0122222222333333334444444455000000b80000000000006f80"), 0x0121a1), + (hex!("0122222222333333334444444455000000b90000000000000ba0"), 0x012241), + (hex!("0122222222333333334444444455000000ba0000000000000bb0"), 0x0122e1), + (hex!("0122222222333333334444444455000000bb0000000000000bc0"), 0x012381), + (hex!("0122222222333333334444444455000000bb00000000000047c0"), 0x012421), + (hex!("0122222222333333334444444455000000bb0000000000006060"), 0x0124c1), + (hex!("0122222222333333334444444455000000bc0000000000000bd0"), 0x012561), + (hex!("0122222222333333334444444455000000bd0000000000000be0"), 0x012601), + (hex!("0122222222333333334444444455000000bd0000000000004e80"), 0x0126a1), + (hex!("0122222222333333334444444455000000be0000000000000bf0"), 0x012741), + (hex!("0122222222333333334444444455000000bf0000000000000c00"), 0x0127e1), + (hex!("0122222222333333334444444455000000bf00000000000047a0"), 0x012881), + (hex!("0122222222333333334444444455000000bf0000000000006da0"), 0x012921), + (hex!("0122222222333333334444444455000000c00000000000000c10"), 0x0129c1), + (hex!("0122222222333333334444444455000000c10000000000000c20"), 0x012a61), + (hex!("0122222222333333334444444455000000c20000000000000c30"), 0x012b01), + (hex!("0122222222333333334444444455000000c20000000000004bd0"), 0x012ba1), + (hex!("0122222222333333334444444455000000c20000000000006ac0"), 0x012c41), + (hex!("0122222222333333334444444455000000c30000000000000c40"), 0x012ce1), + (hex!("0122222222333333334444444455000000c30000000000004660"), 0x012d81), + (hex!("0122222222333333334444444455000000c40000000000000c50"), 0x012e21), + (hex!("0122222222333333334444444455000000c50000000000000c60"), 0x012ec1), + (hex!("0122222222333333334444444455000000c60000000000000c70"), 0x012f61), + (hex!("0122222222333333334444444455000000c60000000000005880"), 0x013001), + (hex!("0122222222333333334444444455000000c60000000000006b70"), 0x0130a1), + (hex!("0122222222333333334444444455000000c70000000000000c80"), 0x013141), + (hex!("0122222222333333334444444455000000c80000000000000c90"), 0x0131e1), + (hex!("0122222222333333334444444455000000c80000000000005310"), 0x013281), + (hex!("0122222222333333334444444455000000c80000000000005db0"), 0x013321), + (hex!("0122222222333333334444444455000000c80000000000007040"), 0x0133c1), + (hex!("0122222222333333334444444455000000c80000000000007290"), 0x013461), + (hex!("0122222222333333334444444455000000c90000000000000ca0"), 0x013501), + (hex!("0122222222333333334444444455000000c90000000000004fe0"), 0x0135a1), + (hex!("0122222222333333334444444455000000ca0000000000000cb0"), 0x013641), + (hex!("0122222222333333334444444455000000ca0000000000006140"), 0x0136e1), + (hex!("0122222222333333334444444455000000ca0000000000007700"), 0x013781), + (hex!("0122222222333333334444444455000000cb0000000000000cc0"), 0x013821), + (hex!("0122222222333333334444444455000000cc0000000000000cd0"), 0x0138c1), + (hex!("0122222222333333334444444455000000cd0000000000000ce0"), 0x013961), + (hex!("0122222222333333334444444455000000cd0000000000003f20"), 0x013a01), + (hex!("0122222222333333334444444455000000cd00000000000040f0"), 0x013aa1), + (hex!("0122222222333333334444444455000000cd0000000000004ec0"), 0x013b41), + (hex!("0122222222333333334444444455000000ce0000000000000cf0"), 0x013be1), + (hex!("0122222222333333334444444455000000ce0000000000007200"), 0x013c81), + (hex!("0122222222333333334444444455000000cf0000000000000d00"), 0x013d21), + (hex!("0122222222333333334444444455000000cf00000000000046a0"), 0x013dc1), + (hex!("0122222222333333334444444455000000cf0000000000005960"), 0x013e61), + (hex!("0122222222333333334444444455000000d00000000000000d10"), 0x013f01), + (hex!("0122222222333333334444444455000000d00000000000005f30"), 0x013fa1), + (hex!("0122222222333333334444444455000000d10000000000000d20"), 0x014041), + (hex!("0122222222333333334444444455000000d10000000000007a00"), 0x0140e1), + (hex!("0122222222333333334444444455000000d20000000000000d30"), 0x014181), + (hex!("0122222222333333334444444455000000d30000000000000d40"), 0x014221), + (hex!("0122222222333333334444444455000000d40000000000000d50"), 0x0142c1), + (hex!("0122222222333333334444444455000000d50000000000000d60"), 0x014361), + (hex!("0122222222333333334444444455000000d50000000000004960"), 0x014401), + (hex!("0122222222333333334444444455000000d500000000000055d0"), 0x0144a1), + (hex!("0122222222333333334444444455000000d500000000000067d0"), 0x014541), + (hex!("0122222222333333334444444455000000d60000000000000d70"), 0x0145e1), + (hex!("0122222222333333334444444455000000d70000000000000d80"), 0x014681), + (hex!("0122222222333333334444444455000000d80000000000000d90"), 0x014721), + (hex!("0122222222333333334444444455000000d800000000000065f0"), 0x0147c1), + (hex!("0122222222333333334444444455000000d90000000000000da0"), 0x014861), + (hex!("0122222222333333334444444455000000d90000000000004980"), 0x014901), + (hex!("0122222222333333334444444455000000da0000000000000db0"), 0x0149a1), + (hex!("0122222222333333334444444455000000da00000000000048c0"), 0x014a41), + (hex!("0122222222333333334444444455000000da00000000000072c0"), 0x014ae1), + (hex!("0122222222333333334444444455000000da00000000000076b0"), 0x014b81), + (hex!("0122222222333333334444444455000000db0000000000000dc0"), 0x014c21), + (hex!("0122222222333333334444444455000000dc0000000000000dd0"), 0x014cc1), + (hex!("0122222222333333334444444455000000dc00000000000040a0"), 0x014d61), + (hex!("0122222222333333334444444455000000dc00000000000074c0"), 0x014e01), + (hex!("0122222222333333334444444455000000dd0000000000000de0"), 0x014ea1), + (hex!("0122222222333333334444444455000000dd0000000000004e50"), 0x014f41), + (hex!("0122222222333333334444444455000000dd0000000000007270"), 0x014fe1), + (hex!("0122222222333333334444444455000000de0000000000000df0"), 0x015081), + (hex!("0122222222333333334444444455000000de00000000000078d0"), 0x015121), + (hex!("0122222222333333334444444455000000df0000000000000e00"), 0x0151c1), + (hex!("0122222222333333334444444455000000df0000000000004d30"), 0x015261), + (hex!("0122222222333333334444444455000000df0000000000006c30"), 0x015301), + (hex!("0122222222333333334444444455000000e00000000000000e10"), 0x0153a1), + (hex!("0122222222333333334444444455000000e00000000000005d30"), 0x015441), + (hex!("0122222222333333334444444455000000e10000000000000e20"), 0x0154e1), + (hex!("0122222222333333334444444455000000e10000000000004610"), 0x015581), + (hex!("0122222222333333334444444455000000e100000000000051d0"), 0x015621), + (hex!("0122222222333333334444444455000000e10000000000005f10"), 0x0156c1), + (hex!("0122222222333333334444444455000000e20000000000000e30"), 0x015761), + (hex!("0122222222333333334444444455000000e20000000000007a90"), 0x015801), + (hex!("0122222222333333334444444455000000e30000000000000e40"), 0x0158a1), + (hex!("0122222222333333334444444455000000e30000000000005ae0"), 0x015941), + (hex!("0122222222333333334444444455000000e40000000000000e50"), 0x0159e1), + (hex!("0122222222333333334444444455000000e50000000000000e60"), 0x015a81), + (hex!("0122222222333333334444444455000000e50000000000004700"), 0x015b21), + (hex!("0122222222333333334444444455000000e500000000000065d0"), 0x015bc1), + (hex!("0122222222333333334444444455000000e60000000000000e70"), 0x015c61), + (hex!("0122222222333333334444444455000000e60000000000004fd0"), 0x015d01), + (hex!("0122222222333333334444444455000000e70000000000000e80"), 0x015da1), + (hex!("0122222222333333334444444455000000e70000000000005150"), 0x015e41), + (hex!("0122222222333333334444444455000000e70000000000005920"), 0x015ee1), + (hex!("0122222222333333334444444455000000e80000000000000e90"), 0x015f81), + (hex!("0122222222333333334444444455000000e80000000000004320"), 0x016021), + (hex!("0122222222333333334444444455000000e80000000000005ec0"), 0x0160c1), + (hex!("0122222222333333334444444455000000e90000000000000ea0"), 0x016161), + (hex!("0122222222333333334444444455000000e900000000000043b0"), 0x016201), + (hex!("0122222222333333334444444455000000ea0000000000000eb0"), 0x0162a1), + (hex!("0122222222333333334444444455000000ea0000000000003ea0"), 0x016341), + (hex!("0122222222333333334444444455000000ea0000000000004f50"), 0x0163e1), + (hex!("0122222222333333334444444455000000ea0000000000007520"), 0x016481), + (hex!("0122222222333333334444444455000000eb0000000000000ec0"), 0x016521), + (hex!("0122222222333333334444444455000000ec0000000000000ed0"), 0x0165c1), + (hex!("0122222222333333334444444455000000ec0000000000006670"), 0x016661), + (hex!("0122222222333333334444444455000000ed0000000000000ee0"), 0x016701), + (hex!("0122222222333333334444444455000000ee0000000000000ef0"), 0x0167a1), + (hex!("0122222222333333334444444455000000ee0000000000004d10"), 0x016841), + (hex!("0122222222333333334444444455000000ef0000000000000f00"), 0x0168e1), + (hex!("0122222222333333334444444455000000f00000000000000f10"), 0x016981), + (hex!("0122222222333333334444444455000000f00000000000007220"), 0x016a21), + (hex!("0122222222333333334444444455000000f00000000000007540"), 0x016ac1), + (hex!("0122222222333333334444444455000000f10000000000000f20"), 0x016b61), + (hex!("0122222222333333334444444455000000f100000000000066f0"), 0x016c01), + (hex!("0122222222333333334444444455000000f20000000000000f30"), 0x016ca1), + (hex!("0122222222333333334444444455000000f20000000000007810"), 0x016d41), + (hex!("0122222222333333334444444455000000f30000000000000f40"), 0x016de1), + (hex!("0122222222333333334444444455000000f30000000000007b70"), 0x016e81), + (hex!("0122222222333333334444444455000000f40000000000000f50"), 0x016f21), + (hex!("0122222222333333334444444455000000f400000000000059c0"), 0x016fc1), + (hex!("0122222222333333334444444455000000f50000000000000f60"), 0x017061), + (hex!("0122222222333333334444444455000000f50000000000003fb0"), 0x017101), + (hex!("0122222222333333334444444455000000f50000000000005740"), 0x0171a1), + (hex!("0122222222333333334444444455000000f500000000000064d0"), 0x017241), + (hex!("0122222222333333334444444455000000f50000000000006960"), 0x0172e1), + (hex!("0122222222333333334444444455000000f60000000000000f70"), 0x017381), + (hex!("0122222222333333334444444455000000f60000000000006d00"), 0x017421), + (hex!("0122222222333333334444444455000000f70000000000000f80"), 0x0174c1), + (hex!("0122222222333333334444444455000000f80000000000000f90"), 0x017561), + (hex!("0122222222333333334444444455000000f90000000000000fa0"), 0x017601), + (hex!("0122222222333333334444444455000000fa0000000000000fb0"), 0x0176a1), + (hex!("0122222222333333334444444455000000fa00000000000067b0"), 0x017741), + (hex!("0122222222333333334444444455000000fb0000000000000fc0"), 0x0177e1), + (hex!("0122222222333333334444444455000000fb0000000000004eb0"), 0x017881), + (hex!("0122222222333333334444444455000000fb0000000000006ef0"), 0x017921), + (hex!("0122222222333333334444444455000000fc0000000000000fd0"), 0x0179c1), + (hex!("0122222222333333334444444455000000fc0000000000004470"), 0x017a61), + (hex!("0122222222333333334444444455000000fc0000000000005940"), 0x017b01), + (hex!("0122222222333333334444444455000000fd0000000000000fe0"), 0x017ba1), + (hex!("0122222222333333334444444455000000fe0000000000000ff0"), 0x017c41), + (hex!("0122222222333333334444444455000000ff0000000000001000"), 0x017ce1), + (hex!("0122222222333333334444444455000000ff0000000000005690"), 0x017d81), + (hex!("0122222222333333334444444455000001000000000000001010"), 0x017e21), + (hex!("0122222222333333334444444455000001000000000000005210"), 0x017ec1), + (hex!("01222222223333333344444444550000010000000000000070a0"), 0x017f61), + (hex!("0122222222333333334444444455000001010000000000001020"), 0x018001), + (hex!("0122222222333333334444444455000001010000000000006b80"), 0x0180a1), + (hex!("0122222222333333334444444455000001020000000000001030"), 0x018141), + (hex!("0122222222333333334444444455000001030000000000001040"), 0x0181e1), + (hex!("0122222222333333334444444455000001030000000000004c80"), 0x018281), + (hex!("0122222222333333334444444455000001040000000000001050"), 0x018321), + (hex!("0122222222333333334444444455000001040000000000004850"), 0x0183c1), + (hex!("01222222223333333344444444550000010400000000000057b0"), 0x018461), + (hex!("0122222222333333334444444455000001050000000000001060"), 0x018501), + (hex!("01222222223333333344444444550000010500000000000048d0"), 0x0185a1), + (hex!("0122222222333333334444444455000001050000000000007870"), 0x018641), + (hex!("0122222222333333334444444455000001060000000000001070"), 0x0186e1), + (hex!("0122222222333333334444444455000001060000000000004f90"), 0x018781), + (hex!("0122222222333333334444444455000001060000000000006270"), 0x018821), + (hex!("0122222222333333334444444455000001070000000000001080"), 0x0188c1), + (hex!("01222222223333333344444444550000010700000000000063b0"), 0x018961), + (hex!("0122222222333333334444444455000001080000000000001090"), 0x018a01), + (hex!("01222222223333333344444444550000010900000000000010a0"), 0x018aa1), + (hex!("0122222222333333334444444455000001090000000000006f40"), 0x018b41), + (hex!("01222222223333333344444444550000010a00000000000010b0"), 0x018be1), + (hex!("01222222223333333344444444550000010a0000000000006640"), 0x018c81), + (hex!("01222222223333333344444444550000010b00000000000010c0"), 0x018d21), + (hex!("01222222223333333344444444550000010c00000000000010d0"), 0x018dc1), + (hex!("01222222223333333344444444550000010d00000000000010e0"), 0x018e61), + (hex!("01222222223333333344444444550000010e00000000000010f0"), 0x018f01), + (hex!("01222222223333333344444444550000010e0000000000005c40"), 0x018fa1), + (hex!("01222222223333333344444444550000010e0000000000007ba0"), 0x019041), + (hex!("01222222223333333344444444550000010f0000000000001100"), 0x0190e1), + (hex!("01222222223333333344444444550000010f0000000000005c30"), 0x019181), + (hex!("0122222222333333334444444455000001100000000000001110"), 0x019221), + (hex!("0122222222333333334444444455000001100000000000007640"), 0x0192c1), + (hex!("0122222222333333334444444455000001110000000000001120"), 0x019361), + (hex!("01222222223333333344444444550000011100000000000052c0"), 0x019401), + (hex!("0122222222333333334444444455000001110000000000005710"), 0x0194a1), + (hex!("0122222222333333334444444455000001110000000000006a00"), 0x019541), + (hex!("0122222222333333334444444455000001120000000000001130"), 0x0195e1), + (hex!("0122222222333333334444444455000001130000000000001140"), 0x019681), + (hex!("0122222222333333334444444455000001140000000000001150"), 0x019721), + (hex!("0122222222333333334444444455000001140000000000003fa0"), 0x0197c1), + (hex!("01222222223333333344444444550000011400000000000054b0"), 0x019861), + (hex!("0122222222333333334444444455000001140000000000006070"), 0x019901), + (hex!("0122222222333333334444444455000001150000000000001160"), 0x0199a1), + (hex!("0122222222333333334444444455000001150000000000005320"), 0x019a41), + (hex!("0122222222333333334444444455000001150000000000006600"), 0x019ae1), + (hex!("0122222222333333334444444455000001150000000000006df0"), 0x019b81), + (hex!("01222222223333333344444444550000011500000000000079c0"), 0x019c21), + (hex!("0122222222333333334444444455000001160000000000001170"), 0x019cc1), + (hex!("0122222222333333334444444455000001170000000000001180"), 0x019d61), + (hex!("0122222222333333334444444455000001170000000000004a60"), 0x019e01), + (hex!("01222222223333333344444444550000011700000000000063c0"), 0x019ea1), + (hex!("0122222222333333334444444455000001180000000000001190"), 0x019f41), + (hex!("0122222222333333334444444455000001180000000000004530"), 0x019fe1), + (hex!("01222222223333333344444444550000011800000000000077c0"), 0x01a081), + (hex!("01222222223333333344444444550000011900000000000011a0"), 0x01a121), + (hex!("01222222223333333344444444550000011a00000000000011b0"), 0x01a1c1), + (hex!("01222222223333333344444444550000011a00000000000041c0"), 0x01a261), + (hex!("01222222223333333344444444550000011a00000000000061e0"), 0x01a301), + (hex!("01222222223333333344444444550000011b00000000000011c0"), 0x01a3a1), + (hex!("01222222223333333344444444550000011c00000000000011d0"), 0x01a441), + (hex!("01222222223333333344444444550000011c0000000000005f90"), 0x01a4e1), + (hex!("01222222223333333344444444550000011d00000000000011e0"), 0x01a581), + (hex!("01222222223333333344444444550000011d0000000000004160"), 0x01a621), + (hex!("01222222223333333344444444550000011e00000000000011f0"), 0x01a6c1), + (hex!("01222222223333333344444444550000011e00000000000056d0"), 0x01a761), + (hex!("01222222223333333344444444550000011f0000000000001200"), 0x01a801), + (hex!("01222222223333333344444444550000011f0000000000004510"), 0x01a8a1), + (hex!("0122222222333333334444444455000001200000000000001210"), 0x01a941), + (hex!("0122222222333333334444444455000001210000000000001220"), 0x01a9e1), + (hex!("0122222222333333334444444455000001210000000000005140"), 0x01aa81), + (hex!("0122222222333333334444444455000001210000000000006710"), 0x01ab21), + (hex!("0122222222333333334444444455000001210000000000006f50"), 0x01abc1), + (hex!("0122222222333333334444444455000001220000000000001230"), 0x01ac61), + (hex!("0122222222333333334444444455000001220000000000005570"), 0x01ad01), + (hex!("0122222222333333334444444455000001220000000000007ac0"), 0x01ada1), + (hex!("0122222222333333334444444455000001230000000000001240"), 0x01ae41), + (hex!("0122222222333333334444444455000001240000000000001250"), 0x01aee1), + (hex!("0122222222333333334444444455000001240000000000006cd0"), 0x01af81), + (hex!("0122222222333333334444444455000001250000000000001260"), 0x01b021), + (hex!("01222222223333333344444444550000012500000000000046b0"), 0x01b0c1), + (hex!("0122222222333333334444444455000001250000000000005eb0"), 0x01b161), + (hex!("0122222222333333334444444455000001260000000000001270"), 0x01b201), + (hex!("0122222222333333334444444455000001260000000000004630"), 0x01b2a1), + (hex!("0122222222333333334444444455000001270000000000001280"), 0x01b341), + (hex!("0122222222333333334444444455000001270000000000004ff0"), 0x01b3e1), + (hex!("0122222222333333334444444455000001270000000000006ec0"), 0x01b481), + (hex!("0122222222333333334444444455000001280000000000001290"), 0x01b521), + (hex!("01222222223333333344444444550000012900000000000012a0"), 0x01b5c1), + (hex!("0122222222333333334444444455000001290000000000005f60"), 0x01b661), + (hex!("01222222223333333344444444550000012a00000000000012b0"), 0x01b701), + (hex!("01222222223333333344444444550000012a0000000000005480"), 0x01b7a1), + (hex!("01222222223333333344444444550000012b00000000000012c0"), 0x01b841), + (hex!("01222222223333333344444444550000012b00000000000065a0"), 0x01b8e1), + (hex!("01222222223333333344444444550000012b00000000000066c0"), 0x01b981), + (hex!("01222222223333333344444444550000012c00000000000012d0"), 0x01ba21), + (hex!("01222222223333333344444444550000012c00000000000064b0"), 0x01bac1), + (hex!("01222222223333333344444444550000012d00000000000012e0"), 0x01bb61), + (hex!("01222222223333333344444444550000012d00000000000049c0"), 0x01bc01), + (hex!("01222222223333333344444444550000012d0000000000004bf0"), 0x01bca1), + (hex!("01222222223333333344444444550000012e00000000000012f0"), 0x01bd41), + (hex!("01222222223333333344444444550000012e0000000000005ed0"), 0x01bde1), + (hex!("01222222223333333344444444550000012f0000000000001300"), 0x01be81), + (hex!("01222222223333333344444444550000012f00000000000049a0"), 0x01bf21), + (hex!("0122222222333333334444444455000001300000000000001310"), 0x01bfc1), + (hex!("0122222222333333334444444455000001300000000000007840"), 0x01c061), + (hex!("0122222222333333334444444455000001310000000000001320"), 0x01c101), + (hex!("0122222222333333334444444455000001310000000000005f70"), 0x01c1a1), + (hex!("0122222222333333334444444455000001320000000000001330"), 0x01c241), + (hex!("0122222222333333334444444455000001320000000000005a00"), 0x01c2e1), + (hex!("0122222222333333334444444455000001330000000000001340"), 0x01c381), + (hex!("0122222222333333334444444455000001330000000000006c70"), 0x01c421), + (hex!("0122222222333333334444444455000001340000000000001350"), 0x01c4c1), + (hex!("0122222222333333334444444455000001340000000000005c60"), 0x01c561), + (hex!("0122222222333333334444444455000001350000000000001360"), 0x01c601), + (hex!("0122222222333333334444444455000001350000000000004f10"), 0x01c6a1), + (hex!("0122222222333333334444444455000001360000000000001370"), 0x01c741), + (hex!("0122222222333333334444444455000001360000000000004c60"), 0x01c7e1), + (hex!("0122222222333333334444444455000001370000000000001380"), 0x01c881), + (hex!("0122222222333333334444444455000001380000000000001390"), 0x01c921), + (hex!("01222222223333333344444444550000013900000000000013a0"), 0x01c9c1), + (hex!("0122222222333333334444444455000001390000000000004ea0"), 0x01ca61), + (hex!("01222222223333333344444444550000013a00000000000013b0"), 0x01cb01), + (hex!("01222222223333333344444444550000013a0000000000007350"), 0x01cba1), + (hex!("01222222223333333344444444550000013b00000000000013c0"), 0x01cc41), + (hex!("01222222223333333344444444550000013c00000000000013d0"), 0x01cce1), + (hex!("01222222223333333344444444550000013c0000000000007050"), 0x01cd81), + (hex!("01222222223333333344444444550000013d00000000000013e0"), 0x01ce21), + (hex!("01222222223333333344444444550000013d0000000000006bd0"), 0x01cec1), + (hex!("01222222223333333344444444550000013e00000000000013f0"), 0x01cf61), + (hex!("01222222223333333344444444550000013e00000000000058e0"), 0x01d001), + (hex!("01222222223333333344444444550000013f0000000000001400"), 0x01d0a1), + (hex!("01222222223333333344444444550000013f0000000000004740"), 0x01d141), + (hex!("0122222222333333334444444455000001400000000000001410"), 0x01d1e1), + (hex!("0122222222333333334444444455000001400000000000003f10"), 0x01d281), + (hex!("0122222222333333334444444455000001400000000000006d40"), 0x01d321), + (hex!("01222222223333333344444444550000014000000000000072d0"), 0x01d3c1), + (hex!("0122222222333333334444444455000001410000000000001420"), 0x01d461), + (hex!("0122222222333333334444444455000001420000000000001430"), 0x01d501), + (hex!("0122222222333333334444444455000001430000000000001440"), 0x01d5a1), + (hex!("0122222222333333334444444455000001440000000000001450"), 0x01d641), + (hex!("0122222222333333334444444455000001450000000000001460"), 0x01d6e1), + (hex!("0122222222333333334444444455000001460000000000001470"), 0x01d781), + (hex!("01222222223333333344444444550000014600000000000055c0"), 0x01d821), + (hex!("0122222222333333334444444455000001470000000000001480"), 0x01d8c1), + (hex!("0122222222333333334444444455000001470000000000004570"), 0x01d961), + (hex!("0122222222333333334444444455000001470000000000004be0"), 0x01da01), + (hex!("0122222222333333334444444455000001480000000000001490"), 0x01daa1), + (hex!("0122222222333333334444444455000001480000000000005360"), 0x01db41), + (hex!("01222222223333333344444444550000014900000000000014a0"), 0x01dbe1), + (hex!("01222222223333333344444444550000014a00000000000014b0"), 0x01dc81), + (hex!("01222222223333333344444444550000014a00000000000053d0"), 0x01dd21), + (hex!("01222222223333333344444444550000014b00000000000014c0"), 0x01ddc1), + (hex!("01222222223333333344444444550000014b0000000000005950"), 0x01de61), + (hex!("01222222223333333344444444550000014c00000000000014d0"), 0x01df01), + (hex!("01222222223333333344444444550000014c0000000000004f60"), 0x01dfa1), + (hex!("01222222223333333344444444550000014d00000000000014e0"), 0x01e041), + (hex!("01222222223333333344444444550000014d0000000000004520"), 0x01e0e1), + (hex!("01222222223333333344444444550000014d0000000000005200"), 0x01e181), + (hex!("01222222223333333344444444550000014e00000000000014f0"), 0x01e221), + (hex!("01222222223333333344444444550000014e0000000000005bd0"), 0x01e2c1), + (hex!("01222222223333333344444444550000014f0000000000001500"), 0x01e361), + (hex!("01222222223333333344444444550000014f00000000000060d0"), 0x01e401), + (hex!("0122222222333333334444444455000001500000000000001510"), 0x01e4a1), + (hex!("01222222223333333344444444550000015000000000000075e0"), 0x01e541), + (hex!("0122222222333333334444444455000001510000000000001520"), 0x01e5e1), + (hex!("0122222222333333334444444455000001510000000000005c00"), 0x01e681), + (hex!("0122222222333333334444444455000001510000000000006af0"), 0x01e721), + (hex!("0122222222333333334444444455000001510000000000007b80"), 0x01e7c1), + (hex!("0122222222333333334444444455000001520000000000001530"), 0x01e861), + (hex!("0122222222333333334444444455000001520000000000004c70"), 0x01e901), + (hex!("0122222222333333334444444455000001530000000000001540"), 0x01e9a1), + (hex!("0122222222333333334444444455000001540000000000001550"), 0x01ea41), + (hex!("0122222222333333334444444455000001540000000000007cd0"), 0x01eae1), + (hex!("0122222222333333334444444455000001550000000000001560"), 0x01eb81), + (hex!("0122222222333333334444444455000001550000000000004ae0"), 0x01ec21), + (hex!("01222222223333333344444444550000015500000000000068c0"), 0x01ecc1), + (hex!("0122222222333333334444444455000001560000000000001570"), 0x01ed61), + (hex!("01222222223333333344444444550000015600000000000064a0"), 0x01ee01), + (hex!("0122222222333333334444444455000001570000000000001580"), 0x01eea1), + (hex!("0122222222333333334444444455000001580000000000001590"), 0x01ef41), + (hex!("0122222222333333334444444455000001580000000000006d30"), 0x01efe1), + (hex!("01222222223333333344444444550000015800000000000074f0"), 0x01f081), + (hex!("01222222223333333344444444550000015900000000000015a0"), 0x01f121), + (hex!("01222222223333333344444444550000015900000000000053a0"), 0x01f1c1), + (hex!("01222222223333333344444444550000015900000000000055e0"), 0x01f261), + (hex!("0122222222333333334444444455000001590000000000006210"), 0x01f301), + (hex!("01222222223333333344444444550000015900000000000067c0"), 0x01f3a1), + (hex!("01222222223333333344444444550000015a00000000000015b0"), 0x01f441), + (hex!("01222222223333333344444444550000015b00000000000015c0"), 0x01f4e1), + (hex!("01222222223333333344444444550000015c00000000000015d0"), 0x01f581), + (hex!("01222222223333333344444444550000015c0000000000004d80"), 0x01f621), + (hex!("01222222223333333344444444550000015c00000000000073f0"), 0x01f6c1), + (hex!("01222222223333333344444444550000015d00000000000015e0"), 0x01f761), + (hex!("01222222223333333344444444550000015e00000000000015f0"), 0x01f801), + (hex!("01222222223333333344444444550000015e0000000000004120"), 0x01f8a1), + (hex!("01222222223333333344444444550000015e0000000000004350"), 0x01f941), + (hex!("01222222223333333344444444550000015e0000000000007c50"), 0x01f9e1), + (hex!("01222222223333333344444444550000015f0000000000001600"), 0x01fa81), + (hex!("0122222222333333334444444455000001600000000000001610"), 0x01fb21), + (hex!("0122222222333333334444444455000001600000000000004840"), 0x01fbc1), + (hex!("0122222222333333334444444455000001600000000000004b10"), 0x01fc61), + (hex!("0122222222333333334444444455000001600000000000007060"), 0x01fd01), + (hex!("0122222222333333334444444455000001610000000000001620"), 0x01fda1), + (hex!("0122222222333333334444444455000001610000000000005300"), 0x01fe41), + (hex!("0122222222333333334444444455000001620000000000001630"), 0x01fee1), + (hex!("0122222222333333334444444455000001620000000000006530"), 0x01ff81), + (hex!("0122222222333333334444444455000001630000000000001640"), 0x020021), + (hex!("0122222222333333334444444455000001640000000000001650"), 0x0200c1), + (hex!("0122222222333333334444444455000001650000000000001660"), 0x020161), + (hex!("0122222222333333334444444455000001660000000000001670"), 0x020201), + (hex!("0122222222333333334444444455000001670000000000001680"), 0x0202a1), + (hex!("0122222222333333334444444455000001670000000000007310"), 0x020341), + (hex!("0122222222333333334444444455000001680000000000001690"), 0x0203e1), + (hex!("0122222222333333334444444455000001680000000000007b50"), 0x020481), + (hex!("01222222223333333344444444550000016900000000000016a0"), 0x020521), + (hex!("01222222223333333344444444550000016900000000000049d0"), 0x0205c1), + (hex!("01222222223333333344444444550000016a00000000000016b0"), 0x020661), + (hex!("01222222223333333344444444550000016a00000000000078b0"), 0x020701), + (hex!("01222222223333333344444444550000016b00000000000016c0"), 0x0207a1), + (hex!("01222222223333333344444444550000016b0000000000004100"), 0x020841), + (hex!("01222222223333333344444444550000016c00000000000016d0"), 0x0208e1), + (hex!("01222222223333333344444444550000016c0000000000006e00"), 0x020981), + (hex!("01222222223333333344444444550000016d00000000000016e0"), 0x020a21), + (hex!("01222222223333333344444444550000016e00000000000016f0"), 0x020ac1), + (hex!("01222222223333333344444444550000016e0000000000004ac0"), 0x020b61), + (hex!("01222222223333333344444444550000016e0000000000007820"), 0x020c01), + (hex!("01222222223333333344444444550000016f0000000000001700"), 0x020ca1), + (hex!("0122222222333333334444444455000001700000000000001710"), 0x020d41), + (hex!("0122222222333333334444444455000001700000000000005830"), 0x020de1), + (hex!("0122222222333333334444444455000001710000000000001720"), 0x020e81), + (hex!("01222222223333333344444444550000017100000000000072f0"), 0x020f21), + (hex!("0122222222333333334444444455000001720000000000001730"), 0x020fc1), + (hex!("0122222222333333334444444455000001720000000000004870"), 0x021061), + (hex!("01222222223333333344444444550000017200000000000070b0"), 0x021101), + (hex!("0122222222333333334444444455000001730000000000001740"), 0x0211a1), + (hex!("0122222222333333334444444455000001740000000000001750"), 0x021241), + (hex!("0122222222333333334444444455000001750000000000001760"), 0x0212e1), + (hex!("0122222222333333334444444455000001750000000000005670"), 0x021381), + (hex!("0122222222333333334444444455000001750000000000005870"), 0x021421), + (hex!("0122222222333333334444444455000001760000000000001770"), 0x0214c1), + (hex!("0122222222333333334444444455000001770000000000001780"), 0x021561), + (hex!("0122222222333333334444444455000001770000000000005000"), 0x021601), + (hex!("0122222222333333334444444455000001770000000000007090"), 0x0216a1), + (hex!("0122222222333333334444444455000001780000000000001790"), 0x021741), + (hex!("01222222223333333344444444550000017800000000000048a0"), 0x0217e1), + (hex!("0122222222333333334444444455000001780000000000006bf0"), 0x021881), + (hex!("01222222223333333344444444550000017900000000000017a0"), 0x021921), + (hex!("01222222223333333344444444550000017900000000000057d0"), 0x0219c1), + (hex!("0122222222333333334444444455000001790000000000006660"), 0x021a61), + (hex!("01222222223333333344444444550000017a00000000000017b0"), 0x021b01), + (hex!("01222222223333333344444444550000017a0000000000004970"), 0x021ba1), + (hex!("01222222223333333344444444550000017a0000000000005dc0"), 0x021c41), + (hex!("01222222223333333344444444550000017b00000000000017c0"), 0x021ce1), + (hex!("01222222223333333344444444550000017b0000000000004ee0"), 0x021d81), + (hex!("01222222223333333344444444550000017b00000000000054c0"), 0x021e21), + (hex!("01222222223333333344444444550000017c00000000000017d0"), 0x021ec1), + (hex!("01222222223333333344444444550000017c0000000000003fc0"), 0x021f61), + (hex!("01222222223333333344444444550000017c00000000000063e0"), 0x022001), + (hex!("01222222223333333344444444550000017c0000000000006520"), 0x0220a1), + (hex!("01222222223333333344444444550000017d00000000000017e0"), 0x022141), + (hex!("01222222223333333344444444550000017d0000000000006220"), 0x0221e1), + (hex!("01222222223333333344444444550000017d0000000000007120"), 0x022281), + (hex!("01222222223333333344444444550000017e00000000000017f0"), 0x022321), + (hex!("01222222223333333344444444550000017f0000000000001800"), 0x0223c1), + (hex!("0122222222333333334444444455000001800000000000001810"), 0x022461), + (hex!("0122222222333333334444444455000001810000000000001820"), 0x022501), + (hex!("01222222223333333344444444550000018100000000000041f0"), 0x0225a1), + (hex!("0122222222333333334444444455000001810000000000007590"), 0x022641), + (hex!("0122222222333333334444444455000001820000000000001830"), 0x0226e1), + (hex!("0122222222333333334444444455000001820000000000004ce0"), 0x022781), + (hex!("0122222222333333334444444455000001830000000000001840"), 0x022821), + (hex!("01222222223333333344444444550000018300000000000042c0"), 0x0228c1), + (hex!("0122222222333333334444444455000001840000000000001850"), 0x022961), + (hex!("0122222222333333334444444455000001840000000000004f70"), 0x022a01), + (hex!("0122222222333333334444444455000001850000000000001860"), 0x022aa1), + (hex!("0122222222333333334444444455000001850000000000006470"), 0x022b41), + (hex!("0122222222333333334444444455000001850000000000007500"), 0x022be1), + (hex!("0122222222333333334444444455000001860000000000001870"), 0x022c81), + (hex!("0122222222333333334444444455000001860000000000004770"), 0x022d21), + (hex!("0122222222333333334444444455000001870000000000001880"), 0x022dc1), + (hex!("0122222222333333334444444455000001870000000000006a30"), 0x022e61), + (hex!("0122222222333333334444444455000001880000000000001890"), 0x022f01), + (hex!("0122222222333333334444444455000001880000000000007410"), 0x022fa1), + (hex!("01222222223333333344444444550000018900000000000018a0"), 0x023041), + (hex!("01222222223333333344444444550000018900000000000044d0"), 0x0230e1), + (hex!("0122222222333333334444444455000001890000000000005ac0"), 0x023181), + (hex!("01222222223333333344444444550000018a00000000000018b0"), 0x023221), + (hex!("01222222223333333344444444550000018a0000000000006260"), 0x0232c1), + (hex!("01222222223333333344444444550000018a0000000000006d70"), 0x023361), + (hex!("01222222223333333344444444550000018b00000000000018c0"), 0x023401), + (hex!("01222222223333333344444444550000018b0000000000004aa0"), 0x0234a1), + (hex!("01222222223333333344444444550000018b0000000000006fd0"), 0x023541), + (hex!("01222222223333333344444444550000018c00000000000018d0"), 0x0235e1), + (hex!("01222222223333333344444444550000018c00000000000051b0"), 0x023681), + (hex!("01222222223333333344444444550000018c0000000000006650"), 0x023721), + (hex!("01222222223333333344444444550000018d00000000000018e0"), 0x0237c1), + (hex!("01222222223333333344444444550000018e00000000000018f0"), 0x023861), + (hex!("01222222223333333344444444550000018e00000000000041d0"), 0x023901), + (hex!("01222222223333333344444444550000018f0000000000001900"), 0x0239a1), + (hex!("01222222223333333344444444550000018f0000000000007600"), 0x023a41), + (hex!("0122222222333333334444444455000001900000000000001910"), 0x023ae1), + (hex!("0122222222333333334444444455000001900000000000005410"), 0x023b81), + (hex!("0122222222333333334444444455000001900000000000006760"), 0x023c21), + (hex!("0122222222333333334444444455000001910000000000001920"), 0x023cc1), + (hex!("0122222222333333334444444455000001920000000000001930"), 0x023d61), + (hex!("0122222222333333334444444455000001920000000000004ca0"), 0x023e01), + (hex!("0122222222333333334444444455000001920000000000005d80"), 0x023ea1), + (hex!("0122222222333333334444444455000001920000000000005fd0"), 0x023f41), + (hex!("01222222223333333344444444550000019200000000000070d0"), 0x023fe1), + (hex!("0122222222333333334444444455000001930000000000001940"), 0x024081), + (hex!("0122222222333333334444444455000001930000000000004010"), 0x024121), + (hex!("0122222222333333334444444455000001930000000000007ca0"), 0x0241c1), + (hex!("0122222222333333334444444455000001940000000000001950"), 0x024261), + (hex!("0122222222333333334444444455000001950000000000001960"), 0x024301), + (hex!("0122222222333333334444444455000001950000000000005380"), 0x0243a1), + (hex!("0122222222333333334444444455000001960000000000001970"), 0x024441), + (hex!("0122222222333333334444444455000001960000000000006de0"), 0x0244e1), + (hex!("0122222222333333334444444455000001970000000000001980"), 0x024581), + (hex!("01222222223333333344444444550000019700000000000048f0"), 0x024621), + (hex!("0122222222333333334444444455000001980000000000001990"), 0x0246c1), + (hex!("0122222222333333334444444455000001980000000000006510"), 0x024761), + (hex!("01222222223333333344444444550000019900000000000019a0"), 0x024801), + (hex!("0122222222333333334444444455000001990000000000007570"), 0x0248a1), + (hex!("0122222222333333334444444455000001990000000000007580"), 0x024941), + (hex!("01222222223333333344444444550000019a00000000000019b0"), 0x0249e1), + (hex!("01222222223333333344444444550000019a0000000000004050"), 0x024a81), + (hex!("01222222223333333344444444550000019a0000000000004ba0"), 0x024b21), + (hex!("01222222223333333344444444550000019a0000000000005540"), 0x024bc1), + (hex!("01222222223333333344444444550000019a00000000000061c0"), 0x024c61), + (hex!("01222222223333333344444444550000019a0000000000007c60"), 0x024d01), + (hex!("01222222223333333344444444550000019b00000000000019c0"), 0x024da1), + (hex!("01222222223333333344444444550000019b0000000000006240"), 0x024e41), + (hex!("01222222223333333344444444550000019c00000000000019d0"), 0x024ee1), + (hex!("01222222223333333344444444550000019d00000000000019e0"), 0x024f81), + (hex!("01222222223333333344444444550000019d0000000000004640"), 0x025021), + (hex!("01222222223333333344444444550000019d00000000000052a0"), 0x0250c1), + (hex!("01222222223333333344444444550000019d00000000000052b0"), 0x025161), + (hex!("01222222223333333344444444550000019e00000000000019f0"), 0x025201), + (hex!("01222222223333333344444444550000019f0000000000001a00"), 0x0252a1), + (hex!("01222222223333333344444444550000019f0000000000006b20"), 0x025341), + (hex!("0122222222333333334444444455000001a00000000000001a10"), 0x0253e1), + (hex!("0122222222333333334444444455000001a10000000000001a20"), 0x025481), + (hex!("0122222222333333334444444455000001a10000000000005460"), 0x025521), + (hex!("0122222222333333334444444455000001a10000000000005d20"), 0x0255c1), + (hex!("0122222222333333334444444455000001a100000000000068f0"), 0x025661), + (hex!("0122222222333333334444444455000001a20000000000001a30"), 0x025701), + (hex!("0122222222333333334444444455000001a20000000000007170"), 0x0257a1), + (hex!("0122222222333333334444444455000001a30000000000001a40"), 0x025841), + (hex!("0122222222333333334444444455000001a40000000000001a50"), 0x0258e1), + (hex!("0122222222333333334444444455000001a50000000000001a60"), 0x025981), + (hex!("0122222222333333334444444455000001a60000000000001a70"), 0x025a21), + (hex!("0122222222333333334444444455000001a70000000000001a80"), 0x025ac1), + (hex!("0122222222333333334444444455000001a70000000000005a90"), 0x025b61), + (hex!("0122222222333333334444444455000001a70000000000006440"), 0x025c01), + (hex!("0122222222333333334444444455000001a80000000000001a90"), 0x025ca1), + (hex!("0122222222333333334444444455000001a80000000000004800"), 0x025d41), + (hex!("0122222222333333334444444455000001a90000000000001aa0"), 0x025de1), + (hex!("0122222222333333334444444455000001aa0000000000001ab0"), 0x025e81), + (hex!("0122222222333333334444444455000001aa0000000000005b60"), 0x025f21), + (hex!("0122222222333333334444444455000001ab0000000000001ac0"), 0x025fc1), + (hex!("0122222222333333334444444455000001ab0000000000006700"), 0x026061), + (hex!("0122222222333333334444444455000001ab00000000000071d0"), 0x026101), + (hex!("0122222222333333334444444455000001ac0000000000001ad0"), 0x0261a1), + (hex!("0122222222333333334444444455000001ac0000000000007380"), 0x026241), + (hex!("0122222222333333334444444455000001ad0000000000001ae0"), 0x0262e1), + (hex!("0122222222333333334444444455000001ad0000000000006350"), 0x026381), + (hex!("0122222222333333334444444455000001ae0000000000001af0"), 0x026421), + (hex!("0122222222333333334444444455000001af0000000000001b00"), 0x0264c1), + (hex!("0122222222333333334444444455000001af0000000000007390"), 0x026561), + (hex!("0122222222333333334444444455000001b00000000000001b10"), 0x026601), + (hex!("0122222222333333334444444455000001b10000000000001b20"), 0x0266a1), + (hex!("0122222222333333334444444455000001b10000000000005cc0"), 0x026741), + (hex!("0122222222333333334444444455000001b20000000000001b30"), 0x0267e1), + (hex!("0122222222333333334444444455000001b20000000000004fb0"), 0x026881), + (hex!("0122222222333333334444444455000001b30000000000001b40"), 0x026921), + (hex!("0122222222333333334444444455000001b40000000000001b50"), 0x0269c1), + (hex!("0122222222333333334444444455000001b50000000000001b60"), 0x026a61), + (hex!("0122222222333333334444444455000001b60000000000001b70"), 0x026b01), + (hex!("0122222222333333334444444455000001b600000000000048e0"), 0x026ba1), + (hex!("0122222222333333334444444455000001b70000000000001b80"), 0x026c41), + (hex!("0122222222333333334444444455000001b70000000000005ca0"), 0x026ce1), + (hex!("0122222222333333334444444455000001b70000000000007900"), 0x026d81), + (hex!("0122222222333333334444444455000001b80000000000001b90"), 0x026e21), + (hex!("0122222222333333334444444455000001b80000000000004d90"), 0x026ec1), + (hex!("0122222222333333334444444455000001b90000000000001ba0"), 0x026f61), + (hex!("0122222222333333334444444455000001b90000000000003f40"), 0x027001), + (hex!("0122222222333333334444444455000001ba0000000000001bb0"), 0x0270a1), + (hex!("0122222222333333334444444455000001ba00000000000042a0"), 0x027141), + (hex!("0122222222333333334444444455000001ba00000000000067f0"), 0x0271e1), + (hex!("0122222222333333334444444455000001ba00000000000073a0"), 0x027281), + (hex!("0122222222333333334444444455000001bb0000000000001bc0"), 0x027321), + (hex!("0122222222333333334444444455000001bb0000000000004a00"), 0x0273c1), + (hex!("0122222222333333334444444455000001bb0000000000005e00"), 0x027461), + (hex!("0122222222333333334444444455000001bc0000000000001bd0"), 0x027501), + (hex!("0122222222333333334444444455000001bc0000000000004230"), 0x0275a1), + (hex!("0122222222333333334444444455000001bc0000000000005860"), 0x027641), + (hex!("0122222222333333334444444455000001bd0000000000001be0"), 0x0276e1), + (hex!("0122222222333333334444444455000001bd0000000000007c70"), 0x027781), + (hex!("0122222222333333334444444455000001be0000000000001bf0"), 0x027821), + (hex!("0122222222333333334444444455000001be0000000000007770"), 0x0278c1), + (hex!("0122222222333333334444444455000001be0000000000007cf0"), 0x027961), + (hex!("0122222222333333334444444455000001bf0000000000001c00"), 0x027a01), + (hex!("0122222222333333334444444455000001bf0000000000006490"), 0x027aa1), + (hex!("0122222222333333334444444455000001c00000000000001c10"), 0x027b41), + (hex!("0122222222333333334444444455000001c10000000000001c20"), 0x027be1), + (hex!("0122222222333333334444444455000001c10000000000004600"), 0x027c81), + (hex!("0122222222333333334444444455000001c20000000000001c30"), 0x027d21), + (hex!("0122222222333333334444444455000001c20000000000006e30"), 0x027dc1), + (hex!("0122222222333333334444444455000001c30000000000001c40"), 0x027e61), + (hex!("0122222222333333334444444455000001c40000000000001c50"), 0x027f01), + (hex!("0122222222333333334444444455000001c50000000000001c60"), 0x027fa1), + (hex!("0122222222333333334444444455000001c60000000000001c70"), 0x028041), + (hex!("0122222222333333334444444455000001c60000000000004240"), 0x0280e1), + (hex!("0122222222333333334444444455000001c60000000000005bb0"), 0x028181), + (hex!("0122222222333333334444444455000001c70000000000001c80"), 0x028221), + (hex!("0122222222333333334444444455000001c80000000000001c90"), 0x0282c1), + (hex!("0122222222333333334444444455000001c90000000000001ca0"), 0x028361), + (hex!("0122222222333333334444444455000001c90000000000006730"), 0x028401), + (hex!("0122222222333333334444444455000001ca0000000000001cb0"), 0x0284a1), + (hex!("0122222222333333334444444455000001ca00000000000070f0"), 0x028541), + (hex!("0122222222333333334444444455000001cb0000000000001cc0"), 0x0285e1), + (hex!("0122222222333333334444444455000001cb00000000000071a0"), 0x028681), + (hex!("0122222222333333334444444455000001cc0000000000001cd0"), 0x028721), + (hex!("0122222222333333334444444455000001cc0000000000005280"), 0x0287c1), + (hex!("0122222222333333334444444455000001cc0000000000005d90"), 0x028861), + (hex!("0122222222333333334444444455000001cd0000000000001ce0"), 0x028901), + (hex!("0122222222333333334444444455000001cd00000000000069b0"), 0x0289a1), + (hex!("0122222222333333334444444455000001ce0000000000001cf0"), 0x028a41), + (hex!("0122222222333333334444444455000001ce0000000000004540"), 0x028ae1), + (hex!("0122222222333333334444444455000001cf0000000000001d00"), 0x028b81), + (hex!("0122222222333333334444444455000001cf00000000000076a0"), 0x028c21), + (hex!("0122222222333333334444444455000001d00000000000001d10"), 0x028cc1), + (hex!("0122222222333333334444444455000001d000000000000060a0"), 0x028d61), + (hex!("0122222222333333334444444455000001d10000000000001d20"), 0x028e01), + (hex!("0122222222333333334444444455000001d20000000000001d30"), 0x028ea1), + (hex!("0122222222333333334444444455000001d30000000000001d40"), 0x028f41), + (hex!("0122222222333333334444444455000001d30000000000004000"), 0x028fe1), + (hex!("0122222222333333334444444455000001d30000000000004140"), 0x029081), + (hex!("0122222222333333334444444455000001d30000000000006790"), 0x029121), + (hex!("0122222222333333334444444455000001d40000000000001d50"), 0x0291c1), + (hex!("0122222222333333334444444455000001d50000000000001d60"), 0x029261), + (hex!("0122222222333333334444444455000001d60000000000001d70"), 0x029301), + (hex!("0122222222333333334444444455000001d60000000000004b50"), 0x0293a1), + (hex!("0122222222333333334444444455000001d60000000000007430"), 0x029441), + (hex!("0122222222333333334444444455000001d70000000000001d80"), 0x0294e1), + (hex!("0122222222333333334444444455000001d70000000000006920"), 0x029581), + (hex!("0122222222333333334444444455000001d80000000000001d90"), 0x029621), + (hex!("0122222222333333334444444455000001d80000000000005b30"), 0x0296c1), + (hex!("0122222222333333334444444455000001d90000000000001da0"), 0x029761), + (hex!("0122222222333333334444444455000001da0000000000001db0"), 0x029801), + (hex!("0122222222333333334444444455000001da0000000000004af0"), 0x0298a1), + (hex!("0122222222333333334444444455000001da0000000000007240"), 0x029941), + (hex!("0122222222333333334444444455000001da0000000000007470"), 0x0299e1), + (hex!("0122222222333333334444444455000001db0000000000001dc0"), 0x029a81), + (hex!("0122222222333333334444444455000001db00000000000045d0"), 0x029b21), + (hex!("0122222222333333334444444455000001dc0000000000001dd0"), 0x029bc1), + (hex!("0122222222333333334444444455000001dd0000000000001de0"), 0x029c61), + (hex!("0122222222333333334444444455000001dd0000000000004bb0"), 0x029d01), + (hex!("0122222222333333334444444455000001dd0000000000004cd0"), 0x029da1), + (hex!("0122222222333333334444444455000001dd0000000000006100"), 0x029e41), + (hex!("0122222222333333334444444455000001dd0000000000007bb0"), 0x029ee1), + (hex!("0122222222333333334444444455000001de0000000000001df0"), 0x029f81), + (hex!("0122222222333333334444444455000001de0000000000004260"), 0x02a021), + (hex!("0122222222333333334444444455000001de0000000000006040"), 0x02a0c1), + (hex!("0122222222333333334444444455000001df0000000000001e00"), 0x02a161), + (hex!("0122222222333333334444444455000001df0000000000005fa0"), 0x02a201), + (hex!("0122222222333333334444444455000001df0000000000006a70"), 0x02a2a1), + (hex!("0122222222333333334444444455000001df0000000000006dc0"), 0x02a341), + (hex!("0122222222333333334444444455000001e00000000000001e10"), 0x02a3e1), + (hex!("0122222222333333334444444455000001e00000000000007010"), 0x02a481), + (hex!("0122222222333333334444444455000001e10000000000001e20"), 0x02a521), + (hex!("0122222222333333334444444455000001e10000000000005720"), 0x02a5c1), + (hex!("0122222222333333334444444455000001e10000000000006830"), 0x02a661), + (hex!("0122222222333333334444444455000001e20000000000001e30"), 0x02a701), + (hex!("0122222222333333334444444455000001e20000000000005100"), 0x02a7a1), + (hex!("0122222222333333334444444455000001e30000000000001e40"), 0x02a841), + (hex!("0122222222333333334444444455000001e40000000000001e50"), 0x02a8e1), + (hex!("0122222222333333334444444455000001e40000000000003f30"), 0x02a981), + (hex!("0122222222333333334444444455000001e40000000000005220"), 0x02aa21), + (hex!("0122222222333333334444444455000001e50000000000001e60"), 0x02aac1), + (hex!("0122222222333333334444444455000001e50000000000006f60"), 0x02ab61), + (hex!("0122222222333333334444444455000001e60000000000001e70"), 0x02ac01), + (hex!("0122222222333333334444444455000001e60000000000006c80"), 0x02aca1), + (hex!("0122222222333333334444444455000001e70000000000001e80"), 0x02ad41), + (hex!("0122222222333333334444444455000001e80000000000001e90"), 0x02ade1), + (hex!("0122222222333333334444444455000001e80000000000004e30"), 0x02ae81), + (hex!("0122222222333333334444444455000001e90000000000001ea0"), 0x02af21), + (hex!("0122222222333333334444444455000001e90000000000005470"), 0x02afc1), + (hex!("0122222222333333334444444455000001ea0000000000001eb0"), 0x02b061), + (hex!("0122222222333333334444444455000001ea0000000000007980"), 0x02b101), + (hex!("0122222222333333334444444455000001eb0000000000001ec0"), 0x02b1a1), + (hex!("0122222222333333334444444455000001eb0000000000004390"), 0x02b241), + (hex!("0122222222333333334444444455000001eb0000000000005970"), 0x02b2e1), + (hex!("0122222222333333334444444455000001ec0000000000001ed0"), 0x02b381), + (hex!("0122222222333333334444444455000001ec0000000000005d50"), 0x02b421), + (hex!("0122222222333333334444444455000001ec00000000000076e0"), 0x02b4c1), + (hex!("0122222222333333334444444455000001ed0000000000001ee0"), 0x02b561), + (hex!("0122222222333333334444444455000001ed0000000000006190"), 0x02b601), + (hex!("0122222222333333334444444455000001ee0000000000001ef0"), 0x02b6a1), + (hex!("0122222222333333334444444455000001ee0000000000004900"), 0x02b741), + (hex!("0122222222333333334444444455000001ef0000000000001f00"), 0x02b7e1), + (hex!("0122222222333333334444444455000001ef0000000000006c60"), 0x02b881), + (hex!("0122222222333333334444444455000001f00000000000001f10"), 0x02b921), + (hex!("0122222222333333334444444455000001f00000000000006950"), 0x02b9c1), + (hex!("0122222222333333334444444455000001f10000000000001f20"), 0x02ba61), + (hex!("0122222222333333334444444455000001f10000000000006400"), 0x02bb01), + (hex!("0122222222333333334444444455000001f20000000000001f30"), 0x02bba1), + (hex!("0122222222333333334444444455000001f20000000000006f00"), 0x02bc41), + (hex!("0122222222333333334444444455000001f20000000000007b10"), 0x02bce1), + (hex!("0122222222333333334444444455000001f30000000000001f40"), 0x02bd81), + (hex!("0122222222333333334444444455000001f40000000000001f50"), 0x02be21), + (hex!("0122222222333333334444444455000001f50000000000001f60"), 0x02bec1), + (hex!("0122222222333333334444444455000001f500000000000044f0"), 0x02bf61), + (hex!("0122222222333333334444444455000001f60000000000001f70"), 0x02c001), + (hex!("0122222222333333334444444455000001f70000000000001f80"), 0x02c0a1), + (hex!("0122222222333333334444444455000001f70000000000004ad0"), 0x02c141), + (hex!("0122222222333333334444444455000001f80000000000001f90"), 0x02c1e1), + (hex!("0122222222333333334444444455000001f90000000000001fa0"), 0x02c281), + (hex!("0122222222333333334444444455000001f90000000000003f60"), 0x02c321), + (hex!("0122222222333333334444444455000001f90000000000004a80"), 0x02c3c1), + (hex!("0122222222333333334444444455000001fa0000000000001fb0"), 0x02c461), + (hex!("0122222222333333334444444455000001fa0000000000006f90"), 0x02c501), + (hex!("0122222222333333334444444455000001fb0000000000001fc0"), 0x02c5a1), + (hex!("0122222222333333334444444455000001fc0000000000001fd0"), 0x02c641), + (hex!("0122222222333333334444444455000001fc0000000000004a90"), 0x02c6e1), + (hex!("0122222222333333334444444455000001fd0000000000001fe0"), 0x02c781), + (hex!("0122222222333333334444444455000001fd0000000000005f50"), 0x02c821), + (hex!("0122222222333333334444444455000001fe0000000000001ff0"), 0x02c8c1), + (hex!("0122222222333333334444444455000001ff0000000000002000"), 0x02c961), + (hex!("0122222222333333334444444455000002000000000000002010"), 0x02ca01), + (hex!("0122222222333333334444444455000002000000000000005f00"), 0x02caa1), + (hex!("0122222222333333334444444455000002000000000000006840"), 0x02cb41), + (hex!("0122222222333333334444444455000002010000000000002020"), 0x02cbe1), + (hex!("0122222222333333334444444455000002020000000000002030"), 0x02cc81), + (hex!("0122222222333333334444444455000002030000000000002040"), 0x02cd21), + (hex!("0122222222333333334444444455000002040000000000002050"), 0x02cdc1), + (hex!("01222222223333333344444444550000020400000000000051f0"), 0x02ce61), + (hex!("0122222222333333334444444455000002050000000000002060"), 0x02cf01), + (hex!("0122222222333333334444444455000002060000000000002070"), 0x02cfa1), + (hex!("0122222222333333334444444455000002060000000000005c80"), 0x02d041), + (hex!("01222222223333333344444444550000020600000000000061d0"), 0x02d0e1), + (hex!("01222222223333333344444444550000020600000000000078c0"), 0x02d181), + (hex!("0122222222333333334444444455000002070000000000002080"), 0x02d221), + (hex!("0122222222333333334444444455000002070000000000006ba0"), 0x02d2c1), + (hex!("0122222222333333334444444455000002080000000000002090"), 0x02d361), + (hex!("01222222223333333344444444550000020900000000000020a0"), 0x02d401), + (hex!("01222222223333333344444444550000020900000000000067a0"), 0x02d4a1), + (hex!("01222222223333333344444444550000020a00000000000020b0"), 0x02d541), + (hex!("01222222223333333344444444550000020a0000000000004950"), 0x02d5e1), + (hex!("01222222223333333344444444550000020a0000000000004de0"), 0x02d681), + (hex!("01222222223333333344444444550000020b00000000000020c0"), 0x02d721), + (hex!("01222222223333333344444444550000020b0000000000004b00"), 0x02d7c1), + (hex!("01222222223333333344444444550000020c00000000000020d0"), 0x02d861), + (hex!("01222222223333333344444444550000020d00000000000020e0"), 0x02d901), + (hex!("01222222223333333344444444550000020e00000000000020f0"), 0x02d9a1), + (hex!("01222222223333333344444444550000020f0000000000002100"), 0x02da41), + (hex!("0122222222333333334444444455000002100000000000002110"), 0x02dae1), + (hex!("0122222222333333334444444455000002110000000000002120"), 0x02db81), + (hex!("0122222222333333334444444455000002110000000000004490"), 0x02dc21), + (hex!("0122222222333333334444444455000002120000000000002130"), 0x02dcc1), + (hex!("0122222222333333334444444455000002130000000000002140"), 0x02dd61), + (hex!("01222222223333333344444444550000021300000000000046d0"), 0x02de01), + (hex!("01222222223333333344444444550000021300000000000046e0"), 0x02dea1), + (hex!("0122222222333333334444444455000002130000000000004b70"), 0x02df41), + (hex!("0122222222333333334444444455000002140000000000002150"), 0x02dfe1), + (hex!("0122222222333333334444444455000002140000000000006c50"), 0x02e081), + (hex!("0122222222333333334444444455000002150000000000002160"), 0x02e121), + (hex!("01222222223333333344444444550000021500000000000043c0"), 0x02e1c1), + (hex!("0122222222333333334444444455000002160000000000002170"), 0x02e261), + (hex!("01222222223333333344444444550000021600000000000055b0"), 0x02e301), + (hex!("0122222222333333334444444455000002160000000000006150"), 0x02e3a1), + (hex!("0122222222333333334444444455000002170000000000002180"), 0x02e441), + (hex!("01222222223333333344444444550000021700000000000053b0"), 0x02e4e1), + (hex!("0122222222333333334444444455000002170000000000007460"), 0x02e581), + (hex!("0122222222333333334444444455000002180000000000002190"), 0x02e621), + (hex!("01222222223333333344444444550000021900000000000021a0"), 0x02e6c1), + (hex!("01222222223333333344444444550000021a00000000000021b0"), 0x02e761), + (hex!("01222222223333333344444444550000021a0000000000007650"), 0x02e801), + (hex!("01222222223333333344444444550000021b00000000000021c0"), 0x02e8a1), + (hex!("01222222223333333344444444550000021b0000000000004b20"), 0x02e941), + (hex!("01222222223333333344444444550000021c00000000000021d0"), 0x02e9e1), + (hex!("01222222223333333344444444550000021c0000000000007610"), 0x02ea81), + (hex!("01222222223333333344444444550000021d00000000000021e0"), 0x02eb21), + (hex!("01222222223333333344444444550000021d0000000000005f40"), 0x02ebc1), + (hex!("01222222223333333344444444550000021e00000000000021f0"), 0x02ec61), + (hex!("01222222223333333344444444550000021e0000000000005a50"), 0x02ed01), + (hex!("01222222223333333344444444550000021e0000000000005ff0"), 0x02eda1), + (hex!("01222222223333333344444444550000021f0000000000002200"), 0x02ee41), + (hex!("01222222223333333344444444550000021f00000000000043a0"), 0x02eee1), + (hex!("01222222223333333344444444550000021f0000000000004cb0"), 0x02ef81), + (hex!("01222222223333333344444444550000021f0000000000004e00"), 0x02f021), + (hex!("0122222222333333334444444455000002200000000000002210"), 0x02f0c1), + (hex!("0122222222333333334444444455000002210000000000002220"), 0x02f161), + (hex!("0122222222333333334444444455000002210000000000006290"), 0x02f201), + (hex!("0122222222333333334444444455000002210000000000007230"), 0x02f2a1), + (hex!("0122222222333333334444444455000002220000000000002230"), 0x02f341), + (hex!("0122222222333333334444444455000002220000000000006ea0"), 0x02f3e1), + (hex!("0122222222333333334444444455000002230000000000002240"), 0x02f481), + (hex!("0122222222333333334444444455000002230000000000004710"), 0x02f521), + (hex!("0122222222333333334444444455000002240000000000002250"), 0x02f5c1), + (hex!("0122222222333333334444444455000002250000000000002260"), 0x02f661), + (hex!("0122222222333333334444444455000002260000000000002270"), 0x02f701), + (hex!("0122222222333333334444444455000002260000000000005b40"), 0x02f7a1), + (hex!("0122222222333333334444444455000002260000000000006300"), 0x02f841), + (hex!("0122222222333333334444444455000002270000000000002280"), 0x02f8e1), + (hex!("0122222222333333334444444455000002270000000000005b80"), 0x02f981), + (hex!("0122222222333333334444444455000002280000000000002290"), 0x02fa21), + (hex!("0122222222333333334444444455000002280000000000003ed0"), 0x02fac1), + (hex!("0122222222333333334444444455000002280000000000004550"), 0x02fb61), + (hex!("01222222223333333344444444550000022800000000000077d0"), 0x02fc01), + (hex!("01222222223333333344444444550000022900000000000022a0"), 0x02fca1), + (hex!("0122222222333333334444444455000002290000000000006480"), 0x02fd41), + (hex!("01222222223333333344444444550000022a00000000000022b0"), 0x02fde1), + (hex!("01222222223333333344444444550000022a0000000000005450"), 0x02fe81), + (hex!("01222222223333333344444444550000022b00000000000022c0"), 0x02ff21), + (hex!("01222222223333333344444444550000022b0000000000006dd0"), 0x02ffc1), + (hex!("01222222223333333344444444550000022c00000000000022d0"), 0x030061), + (hex!("01222222223333333344444444550000022c0000000000006890"), 0x030101), + (hex!("01222222223333333344444444550000022d00000000000022e0"), 0x0301a1), + (hex!("01222222223333333344444444550000022e00000000000022f0"), 0x030241), + (hex!("01222222223333333344444444550000022e0000000000004f20"), 0x0302e1), + (hex!("01222222223333333344444444550000022f0000000000002300"), 0x030381), + (hex!("01222222223333333344444444550000022f0000000000005260"), 0x030421), + (hex!("01222222223333333344444444550000022f00000000000053f0"), 0x0304c1), + (hex!("0122222222333333334444444455000002300000000000002310"), 0x030561), + (hex!("01222222223333333344444444550000023000000000000050e0"), 0x030601), + (hex!("0122222222333333334444444455000002310000000000002320"), 0x0306a1), + (hex!("0122222222333333334444444455000002310000000000007800"), 0x030741), + (hex!("0122222222333333334444444455000002320000000000002330"), 0x0307e1), + (hex!("0122222222333333334444444455000002330000000000002340"), 0x030881), + (hex!("0122222222333333334444444455000002330000000000004d70"), 0x030921), + (hex!("0122222222333333334444444455000002330000000000005cf0"), 0x0309c1), + (hex!("0122222222333333334444444455000002340000000000002350"), 0x030a61), + (hex!("0122222222333333334444444455000002350000000000002360"), 0x030b01), + (hex!("0122222222333333334444444455000002350000000000006970"), 0x030ba1), + (hex!("0122222222333333334444444455000002360000000000002370"), 0x030c41), + (hex!("0122222222333333334444444455000002360000000000005270"), 0x030ce1), + (hex!("0122222222333333334444444455000002370000000000002380"), 0x030d81), + (hex!("0122222222333333334444444455000002370000000000005d70"), 0x030e21), + (hex!("0122222222333333334444444455000002380000000000002390"), 0x030ec1), + (hex!("01222222223333333344444444550000023800000000000069a0"), 0x030f61), + (hex!("01222222223333333344444444550000023900000000000023a0"), 0x031001), + (hex!("01222222223333333344444444550000023900000000000052e0"), 0x0310a1), + (hex!("0122222222333333334444444455000002390000000000005a10"), 0x031141), + (hex!("0122222222333333334444444455000002390000000000007440"), 0x0311e1), + (hex!("01222222223333333344444444550000023a00000000000023b0"), 0x031281), + (hex!("01222222223333333344444444550000023a0000000000003f00"), 0x031321), + (hex!("01222222223333333344444444550000023a0000000000004430"), 0x0313c1), + (hex!("01222222223333333344444444550000023a0000000000007070"), 0x031461), + (hex!("01222222223333333344444444550000023a00000000000074a0"), 0x031501), + (hex!("01222222223333333344444444550000023b00000000000023c0"), 0x0315a1), + (hex!("01222222223333333344444444550000023b0000000000004730"), 0x031641), + (hex!("01222222223333333344444444550000023b00000000000068b0"), 0x0316e1), + (hex!("01222222223333333344444444550000023c00000000000023d0"), 0x031781), + (hex!("01222222223333333344444444550000023c0000000000004680"), 0x031821), + (hex!("01222222223333333344444444550000023d00000000000023e0"), 0x0318c1), + (hex!("01222222223333333344444444550000023d00000000000059a0"), 0x031961), + (hex!("01222222223333333344444444550000023e00000000000023f0"), 0x031a01), + (hex!("01222222223333333344444444550000023f0000000000002400"), 0x031aa1), + (hex!("0122222222333333334444444455000002400000000000002410"), 0x031b41), + (hex!("0122222222333333334444444455000002400000000000004920"), 0x031be1), + (hex!("01222222223333333344444444550000024000000000000066e0"), 0x031c81), + (hex!("01222222223333333344444444550000024000000000000076f0"), 0x031d21), + (hex!("01222222223333333344444444550000024000000000000078e0"), 0x031dc1), + (hex!("0122222222333333334444444455000002410000000000002420"), 0x031e61), + (hex!("0122222222333333334444444455000002420000000000002430"), 0x031f01), + (hex!("0122222222333333334444444455000002420000000000006590"), 0x031fa1), + (hex!("0122222222333333334444444455000002430000000000002440"), 0x032041), + (hex!("0122222222333333334444444455000002430000000000004d00"), 0x0320e1), + (hex!("0122222222333333334444444455000002440000000000002450"), 0x032181), + (hex!("0122222222333333334444444455000002440000000000005f80"), 0x032221), + (hex!("0122222222333333334444444455000002450000000000002460"), 0x0322c1), + (hex!("0122222222333333334444444455000002450000000000004940"), 0x032361), + (hex!("0122222222333333334444444455000002460000000000002470"), 0x032401), + (hex!("0122222222333333334444444455000002470000000000002480"), 0x0324a1), + (hex!("0122222222333333334444444455000002470000000000004dd0"), 0x032541), + (hex!("0122222222333333334444444455000002470000000000005930"), 0x0325e1), + (hex!("01222222223333333344444444550000024700000000000061b0"), 0x032681), + (hex!("0122222222333333334444444455000002470000000000007740"), 0x032721), + (hex!("0122222222333333334444444455000002480000000000002490"), 0x0327c1), + (hex!("0122222222333333334444444455000002480000000000004890"), 0x032861), + (hex!("01222222223333333344444444550000024900000000000024a0"), 0x032901), + (hex!("01222222223333333344444444550000024a00000000000024b0"), 0x0329a1), + (hex!("01222222223333333344444444550000024b00000000000024c0"), 0x032a41), + (hex!("01222222223333333344444444550000024c00000000000024d0"), 0x032ae1), + (hex!("01222222223333333344444444550000024d00000000000024e0"), 0x032b81), + (hex!("01222222223333333344444444550000024d0000000000004070"), 0x032c21), + (hex!("01222222223333333344444444550000024e00000000000024f0"), 0x032cc1), + (hex!("01222222223333333344444444550000024e00000000000066a0"), 0x032d61), + (hex!("01222222223333333344444444550000024e0000000000006ab0"), 0x032e01), + (hex!("01222222223333333344444444550000024f0000000000002500"), 0x032ea1), + (hex!("0122222222333333334444444455000002500000000000002510"), 0x032f41), + (hex!("0122222222333333334444444455000002510000000000002520"), 0x032fe1), + (hex!("0122222222333333334444444455000002510000000000007320"), 0x033081), + (hex!("0122222222333333334444444455000002520000000000002530"), 0x033121), + (hex!("0122222222333333334444444455000002520000000000006410"), 0x0331c1), + (hex!("0122222222333333334444444455000002530000000000002540"), 0x033261), + (hex!("0122222222333333334444444455000002530000000000005110"), 0x033301), + (hex!("0122222222333333334444444455000002540000000000002550"), 0x0333a1), + (hex!("01222222223333333344444444550000025400000000000040c0"), 0x033441), + (hex!("0122222222333333334444444455000002540000000000006a40"), 0x0334e1), + (hex!("0122222222333333334444444455000002550000000000002560"), 0x033581), + (hex!("0122222222333333334444444455000002550000000000005190"), 0x033621), + (hex!("0122222222333333334444444455000002560000000000002570"), 0x0336c1), + (hex!("01222222223333333344444444550000025600000000000061f0"), 0x033761), + (hex!("0122222222333333334444444455000002570000000000002580"), 0x033801), + (hex!("0122222222333333334444444455000002580000000000002590"), 0x0338a1), + (hex!("01222222223333333344444444550000025800000000000043d0"), 0x033941), + (hex!("01222222223333333344444444550000025900000000000025a0"), 0x0339e1), + (hex!("0122222222333333334444444455000002590000000000006bb0"), 0x033a81), + (hex!("01222222223333333344444444550000025a00000000000025b0"), 0x033b21), + (hex!("01222222223333333344444444550000025a0000000000005fb0"), 0x033bc1), + (hex!("01222222223333333344444444550000025a00000000000064c0"), 0x033c61), + (hex!("01222222223333333344444444550000025b00000000000025c0"), 0x033d01), + (hex!("01222222223333333344444444550000025b0000000000005c10"), 0x033da1), + (hex!("01222222223333333344444444550000025c00000000000025d0"), 0x033e41), + (hex!("01222222223333333344444444550000025c0000000000007d00"), 0x033ee1), + (hex!("01222222223333333344444444550000025d00000000000025e0"), 0x033f81), + (hex!("01222222223333333344444444550000025e00000000000025f0"), 0x034021), + (hex!("01222222223333333344444444550000025e00000000000045e0"), 0x0340c1), + (hex!("01222222223333333344444444550000025e0000000000006ee0"), 0x034161), + (hex!("01222222223333333344444444550000025f0000000000002600"), 0x034201), + (hex!("01222222223333333344444444550000025f00000000000050b0"), 0x0342a1), + (hex!("01222222223333333344444444550000025f0000000000007690"), 0x034341), + (hex!("0122222222333333334444444455000002600000000000002610"), 0x0343e1), + (hex!("0122222222333333334444444455000002600000000000007b60"), 0x034481), + (hex!("0122222222333333334444444455000002610000000000002620"), 0x034521), + (hex!("0122222222333333334444444455000002620000000000002630"), 0x0345c1), + (hex!("0122222222333333334444444455000002630000000000002640"), 0x034661), + (hex!("0122222222333333334444444455000002640000000000002650"), 0x034701), + (hex!("0122222222333333334444444455000002650000000000002660"), 0x0347a1), + (hex!("0122222222333333334444444455000002650000000000006180"), 0x034841), + (hex!("0122222222333333334444444455000002660000000000002670"), 0x0348e1), + (hex!("0122222222333333334444444455000002660000000000005430"), 0x034981), + (hex!("0122222222333333334444444455000002660000000000007a60"), 0x034a21), + (hex!("0122222222333333334444444455000002670000000000002680"), 0x034ac1), + (hex!("01222222223333333344444444550000026700000000000077f0"), 0x034b61), + (hex!("0122222222333333334444444455000002680000000000002690"), 0x034c01), + (hex!("01222222223333333344444444550000026900000000000026a0"), 0x034ca1), + (hex!("01222222223333333344444444550000026a00000000000026b0"), 0x034d41), + (hex!("01222222223333333344444444550000026a0000000000007530"), 0x034de1), + (hex!("01222222223333333344444444550000026b00000000000026c0"), 0x034e81), + (hex!("01222222223333333344444444550000026b00000000000058b0"), 0x034f21), + (hex!("01222222223333333344444444550000026b00000000000066b0"), 0x034fc1), + (hex!("01222222223333333344444444550000026b0000000000006b10"), 0x035061), + (hex!("01222222223333333344444444550000026c00000000000026d0"), 0x035101), + (hex!("01222222223333333344444444550000026d00000000000026e0"), 0x0351a1), + (hex!("01222222223333333344444444550000026d0000000000004210"), 0x035241), + (hex!("01222222223333333344444444550000026d0000000000005490"), 0x0352e1), + (hex!("01222222223333333344444444550000026d0000000000005e60"), 0x035381), + (hex!("01222222223333333344444444550000026d00000000000068e0"), 0x035421), + (hex!("01222222223333333344444444550000026d0000000000007020"), 0x0354c1), + (hex!("01222222223333333344444444550000026d0000000000007300"), 0x035561), + (hex!("01222222223333333344444444550000026e00000000000026f0"), 0x035601), + (hex!("01222222223333333344444444550000026f0000000000002700"), 0x0356a1), + (hex!("01222222223333333344444444550000026f0000000000004910"), 0x035741), + (hex!("0122222222333333334444444455000002700000000000002710"), 0x0357e1), + (hex!("0122222222333333334444444455000002710000000000002720"), 0x035881), + (hex!("01222222223333333344444444550000027100000000000050c0"), 0x035921), + (hex!("0122222222333333334444444455000002720000000000002730"), 0x0359c1), + (hex!("0122222222333333334444444455000002730000000000002740"), 0x035a61), + (hex!("0122222222333333334444444455000002740000000000002750"), 0x035b01), + (hex!("0122222222333333334444444455000002740000000000007490"), 0x035ba1), + (hex!("0122222222333333334444444455000002750000000000002760"), 0x035c41), + (hex!("0122222222333333334444444455000002760000000000002770"), 0x035ce1), + (hex!("0122222222333333334444444455000002760000000000004790"), 0x035d81), + (hex!("0122222222333333334444444455000002770000000000002780"), 0x035e21), + (hex!("01222222223333333344444444550000027700000000000050a0"), 0x035ec1), + (hex!("0122222222333333334444444455000002780000000000002790"), 0x035f61), + (hex!("0122222222333333334444444455000002780000000000004330"), 0x036001), + (hex!("0122222222333333334444444455000002780000000000006b00"), 0x0360a1), + (hex!("01222222223333333344444444550000027900000000000027a0"), 0x036141), + (hex!("01222222223333333344444444550000027a00000000000027b0"), 0x0361e1), + (hex!("01222222223333333344444444550000027b00000000000027c0"), 0x036281), + (hex!("01222222223333333344444444550000027b0000000000004930"), 0x036321), + (hex!("01222222223333333344444444550000027b0000000000006250"), 0x0363c1), + (hex!("01222222223333333344444444550000027c00000000000027d0"), 0x036461), + (hex!("01222222223333333344444444550000027d00000000000027e0"), 0x036501), + (hex!("01222222223333333344444444550000027d0000000000005ce0"), 0x0365a1), + (hex!("01222222223333333344444444550000027d0000000000005fe0"), 0x036641), + (hex!("01222222223333333344444444550000027e00000000000027f0"), 0x0366e1), + (hex!("01222222223333333344444444550000027f0000000000002800"), 0x036781), + (hex!("01222222223333333344444444550000027f0000000000003e90"), 0x036821), + (hex!("01222222223333333344444444550000027f0000000000007910"), 0x0368c1), + (hex!("0122222222333333334444444455000002800000000000002810"), 0x036961), + (hex!("0122222222333333334444444455000002800000000000004990"), 0x036a01), + (hex!("0122222222333333334444444455000002800000000000006160"), 0x036aa1), + (hex!("0122222222333333334444444455000002800000000000006740"), 0x036b41), + (hex!("0122222222333333334444444455000002810000000000002820"), 0x036be1), + (hex!("0122222222333333334444444455000002820000000000002830"), 0x036c81), + (hex!("0122222222333333334444444455000002820000000000005170"), 0x036d21), + (hex!("0122222222333333334444444455000002830000000000002840"), 0x036dc1), + (hex!("0122222222333333334444444455000002840000000000002850"), 0x036e61), + (hex!("0122222222333333334444444455000002840000000000004810"), 0x036f01), + (hex!("0122222222333333334444444455000002840000000000006aa0"), 0x036fa1), + (hex!("0122222222333333334444444455000002850000000000002860"), 0x037041), + (hex!("0122222222333333334444444455000002860000000000002870"), 0x0370e1), + (hex!("0122222222333333334444444455000002860000000000005080"), 0x037181), + (hex!("0122222222333333334444444455000002870000000000002880"), 0x037221), + (hex!("0122222222333333334444444455000002870000000000004e60"), 0x0372c1), + (hex!("0122222222333333334444444455000002880000000000002890"), 0x037361), + (hex!("0122222222333333334444444455000002880000000000005060"), 0x037401), + (hex!("0122222222333333334444444455000002880000000000006f20"), 0x0374a1), + (hex!("01222222223333333344444444550000028900000000000028a0"), 0x037541), + (hex!("01222222223333333344444444550000028900000000000047e0"), 0x0375e1), + (hex!("01222222223333333344444444550000028a00000000000028b0"), 0x037681), + (hex!("01222222223333333344444444550000028a0000000000005ab0"), 0x037721), + (hex!("01222222223333333344444444550000028a0000000000007130"), 0x0377c1), + (hex!("01222222223333333344444444550000028a0000000000007660"), 0x037861), + (hex!("01222222223333333344444444550000028b00000000000028c0"), 0x037901), + (hex!("01222222223333333344444444550000028b00000000000054e0"), 0x0379a1), + (hex!("01222222223333333344444444550000028c00000000000028d0"), 0x037a41), + (hex!("01222222223333333344444444550000028c00000000000046f0"), 0x037ae1), + (hex!("01222222223333333344444444550000028c00000000000061a0"), 0x037b81), + (hex!("01222222223333333344444444550000028d00000000000028e0"), 0x037c21), + (hex!("01222222223333333344444444550000028e00000000000028f0"), 0x037cc1), + (hex!("01222222223333333344444444550000028e0000000000004130"), 0x037d61), + (hex!("01222222223333333344444444550000028f0000000000002900"), 0x037e01), + (hex!("01222222223333333344444444550000028f0000000000007510"), 0x037ea1), + (hex!("0122222222333333334444444455000002900000000000002910"), 0x037f41), + (hex!("0122222222333333334444444455000002900000000000004a40"), 0x037fe1), + (hex!("0122222222333333334444444455000002910000000000002920"), 0x038081), + (hex!("0122222222333333334444444455000002920000000000002930"), 0x038121), + (hex!("0122222222333333334444444455000002920000000000004e90"), 0x0381c1), + (hex!("0122222222333333334444444455000002930000000000002940"), 0x038261), + (hex!("0122222222333333334444444455000002930000000000006880"), 0x038301), + (hex!("0122222222333333334444444455000002940000000000002950"), 0x0383a1), + (hex!("0122222222333333334444444455000002940000000000007bc0"), 0x038441), + (hex!("0122222222333333334444444455000002950000000000002960"), 0x0384e1), + (hex!("0122222222333333334444444455000002960000000000002970"), 0x038581), + (hex!("01222222223333333344444444550000029600000000000059d0"), 0x038621), + (hex!("0122222222333333334444444455000002970000000000002980"), 0x0386c1), + (hex!("0122222222333333334444444455000002970000000000004a50"), 0x038761), + (hex!("0122222222333333334444444455000002970000000000005f20"), 0x038801), + (hex!("01222222223333333344444444550000029700000000000068d0"), 0x0388a1), + (hex!("0122222222333333334444444455000002980000000000002990"), 0x038941), + (hex!("0122222222333333334444444455000002980000000000004370"), 0x0389e1), + (hex!("0122222222333333334444444455000002980000000000004420"), 0x038a81), + (hex!("01222222223333333344444444550000029900000000000029a0"), 0x038b21), + (hex!("01222222223333333344444444550000029a00000000000029b0"), 0x038bc1), + (hex!("01222222223333333344444444550000029a0000000000006010"), 0x038c61), + (hex!("01222222223333333344444444550000029a0000000000006980"), 0x038d01), + (hex!("01222222223333333344444444550000029b00000000000029c0"), 0x038da1), + (hex!("01222222223333333344444444550000029c00000000000029d0"), 0x038e41), + (hex!("01222222223333333344444444550000029c0000000000007480"), 0x038ee1), + (hex!("01222222223333333344444444550000029d00000000000029e0"), 0x038f81), + (hex!("01222222223333333344444444550000029d0000000000005030"), 0x039021), + (hex!("01222222223333333344444444550000029d0000000000007780"), 0x0390c1), + (hex!("01222222223333333344444444550000029d0000000000007a50"), 0x039161), + (hex!("01222222223333333344444444550000029e00000000000029f0"), 0x039201), + (hex!("01222222223333333344444444550000029e00000000000074b0"), 0x0392a1), + (hex!("01222222223333333344444444550000029f0000000000002a00"), 0x039341), + (hex!("0122222222333333334444444455000002a00000000000002a10"), 0x0393e1), + (hex!("0122222222333333334444444455000002a10000000000002a20"), 0x039481), + (hex!("0122222222333333334444444455000002a20000000000002a30"), 0x039521), + (hex!("0122222222333333334444444455000002a20000000000004c50"), 0x0395c1), + (hex!("0122222222333333334444444455000002a20000000000006f10"), 0x039661), + (hex!("0122222222333333334444444455000002a30000000000002a40"), 0x039701), + (hex!("0122222222333333334444444455000002a40000000000002a50"), 0x0397a1), + (hex!("0122222222333333334444444455000002a40000000000005d60"), 0x039841), + (hex!("0122222222333333334444444455000002a50000000000002a60"), 0x0398e1), + (hex!("0122222222333333334444444455000002a50000000000005440"), 0x039981), + (hex!("0122222222333333334444444455000002a50000000000005890"), 0x039a21), + (hex!("0122222222333333334444444455000002a60000000000002a70"), 0x039ac1), + (hex!("0122222222333333334444444455000002a70000000000002a80"), 0x039b61), + (hex!("0122222222333333334444444455000002a700000000000054a0"), 0x039c01), + (hex!("0122222222333333334444444455000002a70000000000007280"), 0x039ca1), + (hex!("0122222222333333334444444455000002a80000000000002a90"), 0x039d41), + (hex!("0122222222333333334444444455000002a90000000000002aa0"), 0x039de1), + (hex!("0122222222333333334444444455000002aa0000000000002ab0"), 0x039e81), + (hex!("0122222222333333334444444455000002ab0000000000002ac0"), 0x039f21), + (hex!("0122222222333333334444444455000002ab0000000000006c90"), 0x039fc1), + (hex!("0122222222333333334444444455000002ac0000000000002ad0"), 0x03a061), + (hex!("0122222222333333334444444455000002ac0000000000006db0"), 0x03a101), + (hex!("0122222222333333334444444455000002ad0000000000002ae0"), 0x03a1a1), + (hex!("0122222222333333334444444455000002ad00000000000065e0"), 0x03a241), + (hex!("0122222222333333334444444455000002ad0000000000007b40"), 0x03a2e1), + (hex!("0122222222333333334444444455000002ae0000000000002af0"), 0x03a381), + (hex!("0122222222333333334444444455000002ae0000000000004d20"), 0x03a421), + (hex!("0122222222333333334444444455000002ae0000000000006f30"), 0x03a4c1), + (hex!("0122222222333333334444444455000002af0000000000002b00"), 0x03a561), + (hex!("0122222222333333334444444455000002b00000000000002b10"), 0x03a601), + (hex!("0122222222333333334444444455000002b00000000000004560"), 0x03a6a1), + (hex!("0122222222333333334444444455000002b00000000000005800"), 0x03a741), + (hex!("0122222222333333334444444455000002b00000000000005a60"), 0x03a7e1), + (hex!("0122222222333333334444444455000002b10000000000002b20"), 0x03a881), + (hex!("0122222222333333334444444455000002b10000000000007b30"), 0x03a921), + (hex!("0122222222333333334444444455000002b20000000000002b30"), 0x03a9c1), + (hex!("0122222222333333334444444455000002b20000000000004440"), 0x03aa61), + (hex!("0122222222333333334444444455000002b20000000000004f80"), 0x03ab01), + (hex!("0122222222333333334444444455000002b20000000000005020"), 0x03aba1), + (hex!("0122222222333333334444444455000002b30000000000002b40"), 0x03ac41), + (hex!("0122222222333333334444444455000002b40000000000002b50"), 0x03ace1), + (hex!("0122222222333333334444444455000002b50000000000002b60"), 0x03ad81), + (hex!("0122222222333333334444444455000002b500000000000059e0"), 0x03ae21), + (hex!("0122222222333333334444444455000002b60000000000002b70"), 0x03aec1), + (hex!("0122222222333333334444444455000002b70000000000002b80"), 0x03af61), + (hex!("0122222222333333334444444455000002b80000000000002b90"), 0x03b001), + (hex!("0122222222333333334444444455000002b80000000000004590"), 0x03b0a1), + (hex!("0122222222333333334444444455000002b800000000000047d0"), 0x03b141), + (hex!("0122222222333333334444444455000002b80000000000006030"), 0x03b1e1), + (hex!("0122222222333333334444444455000002b80000000000006a20"), 0x03b281), + (hex!("0122222222333333334444444455000002b80000000000006a90"), 0x03b321), + (hex!("0122222222333333334444444455000002b90000000000002ba0"), 0x03b3c1), + (hex!("0122222222333333334444444455000002ba0000000000002bb0"), 0x03b461), + (hex!("0122222222333333334444444455000002ba0000000000006e80"), 0x03b501), + (hex!("0122222222333333334444444455000002bb0000000000002bc0"), 0x03b5a1), + (hex!("0122222222333333334444444455000002bc0000000000002bd0"), 0x03b641), + (hex!("0122222222333333334444444455000002bc0000000000004b30"), 0x03b6e1), + (hex!("0122222222333333334444444455000002bd0000000000002be0"), 0x03b781), + (hex!("0122222222333333334444444455000002bd0000000000005e10"), 0x03b821), + (hex!("0122222222333333334444444455000002be0000000000002bf0"), 0x03b8c1), + (hex!("0122222222333333334444444455000002bf0000000000002c00"), 0x03b961), + (hex!("0122222222333333334444444455000002c00000000000002c10"), 0x03ba01), + (hex!("0122222222333333334444444455000002c10000000000002c20"), 0x03baa1), + (hex!("0122222222333333334444444455000002c10000000000003ef0"), 0x03bb41), + (hex!("0122222222333333334444444455000002c20000000000002c30"), 0x03bbe1), + (hex!("0122222222333333334444444455000002c200000000000056e0"), 0x03bc81), + (hex!("0122222222333333334444444455000002c30000000000002c40"), 0x03bd21), + (hex!("0122222222333333334444444455000002c30000000000004b60"), 0x03bdc1), + (hex!("0122222222333333334444444455000002c40000000000002c50"), 0x03be61), + (hex!("0122222222333333334444444455000002c400000000000045f0"), 0x03bf01), + (hex!("0122222222333333334444444455000002c40000000000005290"), 0x03bfa1), + (hex!("0122222222333333334444444455000002c50000000000002c60"), 0x03c041), + (hex!("0122222222333333334444444455000002c60000000000002c70"), 0x03c0e1), + (hex!("0122222222333333334444444455000002c60000000000006ae0"), 0x03c181), + (hex!("0122222222333333334444444455000002c70000000000002c80"), 0x03c221), + (hex!("0122222222333333334444444455000002c70000000000005680"), 0x03c2c1), + (hex!("0122222222333333334444444455000002c70000000000006e10"), 0x03c361), + (hex!("0122222222333333334444444455000002c80000000000002c90"), 0x03c401), + (hex!("0122222222333333334444444455000002c90000000000002ca0"), 0x03c4a1), + (hex!("0122222222333333334444444455000002ca0000000000002cb0"), 0x03c541), + (hex!("0122222222333333334444444455000002cb0000000000002cc0"), 0x03c5e1), + (hex!("0122222222333333334444444455000002cc0000000000002cd0"), 0x03c681), + (hex!("0122222222333333334444444455000002cc0000000000005b50"), 0x03c721), + (hex!("0122222222333333334444444455000002cd0000000000002ce0"), 0x03c7c1), + (hex!("0122222222333333334444444455000002ce0000000000002cf0"), 0x03c861), + (hex!("0122222222333333334444444455000002ce00000000000043f0"), 0x03c901), + (hex!("0122222222333333334444444455000002ce0000000000006420"), 0x03c9a1), + (hex!("0122222222333333334444444455000002cf0000000000002d00"), 0x03ca41), + (hex!("0122222222333333334444444455000002d00000000000002d10"), 0x03cae1), + (hex!("0122222222333333334444444455000002d10000000000002d20"), 0x03cb81), + (hex!("0122222222333333334444444455000002d10000000000005370"), 0x03cc21), + (hex!("0122222222333333334444444455000002d20000000000002d30"), 0x03ccc1), + (hex!("0122222222333333334444444455000002d20000000000005ef0"), 0x03cd61), + (hex!("0122222222333333334444444455000002d20000000000006570"), 0x03ce01), + (hex!("0122222222333333334444444455000002d30000000000002d40"), 0x03cea1), + (hex!("0122222222333333334444444455000002d30000000000007360"), 0x03cf41), + (hex!("0122222222333333334444444455000002d40000000000002d50"), 0x03cfe1), + (hex!("0122222222333333334444444455000002d400000000000079a0"), 0x03d081), + (hex!("0122222222333333334444444455000002d50000000000002d60"), 0x03d121), + (hex!("0122222222333333334444444455000002d50000000000004250"), 0x03d1c1), + (hex!("0122222222333333334444444455000002d50000000000006050"), 0x03d261), + (hex!("0122222222333333334444444455000002d60000000000002d70"), 0x03d301), + (hex!("0122222222333333334444444455000002d60000000000007080"), 0x03d3a1), + (hex!("0122222222333333334444444455000002d70000000000002d80"), 0x03d441), + (hex!("0122222222333333334444444455000002d80000000000002d90"), 0x03d4e1), + (hex!("0122222222333333334444444455000002d80000000000007110"), 0x03d581), + (hex!("0122222222333333334444444455000002d800000000000073c0"), 0x03d621), + (hex!("0122222222333333334444444455000002d800000000000075a0"), 0x03d6c1), + (hex!("0122222222333333334444444455000002d90000000000002da0"), 0x03d761), + (hex!("0122222222333333334444444455000002d90000000000004860"), 0x03d801), + (hex!("0122222222333333334444444455000002d90000000000006b60"), 0x03d8a1), + (hex!("0122222222333333334444444455000002da0000000000002db0"), 0x03d941), + (hex!("0122222222333333334444444455000002da0000000000006630"), 0x03d9e1), + (hex!("0122222222333333334444444455000002db0000000000002dc0"), 0x03da81), + (hex!("0122222222333333334444444455000002dc0000000000002dd0"), 0x03db21), + (hex!("0122222222333333334444444455000002dc0000000000004830"), 0x03dbc1), + (hex!("0122222222333333334444444455000002dd0000000000002de0"), 0x03dc61), + (hex!("0122222222333333334444444455000002de0000000000002df0"), 0x03dd01), + (hex!("0122222222333333334444444455000002de0000000000004f00"), 0x03dda1), + (hex!("0122222222333333334444444455000002df0000000000002e00"), 0x03de41), + (hex!("0122222222333333334444444455000002e00000000000002e10"), 0x03dee1), + (hex!("0122222222333333334444444455000002e10000000000002e20"), 0x03df81), + (hex!("0122222222333333334444444455000002e10000000000006e90"), 0x03e021), + (hex!("0122222222333333334444444455000002e20000000000002e30"), 0x03e0c1), + (hex!("0122222222333333334444444455000002e200000000000053e0"), 0x03e161), + (hex!("0122222222333333334444444455000002e30000000000002e40"), 0x03e201), + (hex!("0122222222333333334444444455000002e30000000000006020"), 0x03e2a1), + (hex!("0122222222333333334444444455000002e30000000000006540"), 0x03e341), + (hex!("0122222222333333334444444455000002e40000000000002e50"), 0x03e3e1), + (hex!("0122222222333333334444444455000002e50000000000002e60"), 0x03e481), + (hex!("0122222222333333334444444455000002e50000000000005180"), 0x03e521), + (hex!("0122222222333333334444444455000002e50000000000007bf0"), 0x03e5c1), + (hex!("0122222222333333334444444455000002e60000000000002e70"), 0x03e661), + (hex!("0122222222333333334444444455000002e60000000000005350"), 0x03e701), + (hex!("0122222222333333334444444455000002e60000000000007960"), 0x03e7a1), + (hex!("0122222222333333334444444455000002e70000000000002e80"), 0x03e841), + (hex!("0122222222333333334444444455000002e80000000000002e90"), 0x03e8e1), + (hex!("0122222222333333334444444455000002e90000000000002ea0"), 0x03e981), + (hex!("0122222222333333334444444455000002ea0000000000002eb0"), 0x03ea21), + (hex!("0122222222333333334444444455000002eb0000000000002ec0"), 0x03eac1), + (hex!("0122222222333333334444444455000002ec0000000000002ed0"), 0x03eb61), + (hex!("0122222222333333334444444455000002ec0000000000006c10"), 0x03ec01), + (hex!("0122222222333333334444444455000002ed0000000000002ee0"), 0x03eca1), + (hex!("0122222222333333334444444455000002ed0000000000005590"), 0x03ed41), + (hex!("0122222222333333334444444455000002ed0000000000005cd0"), 0x03ede1), + (hex!("0122222222333333334444444455000002ed0000000000006910"), 0x03ee81), + (hex!("0122222222333333334444444455000002ee0000000000002ef0"), 0x03ef21), + (hex!("0122222222333333334444444455000002ef0000000000002f00"), 0x03efc1), + (hex!("0122222222333333334444444455000002ef0000000000004ed0"), 0x03f061), + (hex!("0122222222333333334444444455000002f00000000000002f10"), 0x03f101), + (hex!("0122222222333333334444444455000002f00000000000004cf0"), 0x03f1a1), + (hex!("0122222222333333334444444455000002f00000000000005d10"), 0x03f241), + (hex!("0122222222333333334444444455000002f00000000000006860"), 0x03f2e1), + (hex!("0122222222333333334444444455000002f00000000000006b50"), 0x03f381), + (hex!("0122222222333333334444444455000002f00000000000007100"), 0x03f421), + (hex!("0122222222333333334444444455000002f00000000000007aa0"), 0x03f4c1), + (hex!("0122222222333333334444444455000002f10000000000002f20"), 0x03f561), + (hex!("0122222222333333334444444455000002f20000000000002f30"), 0x03f601), + (hex!("0122222222333333334444444455000002f200000000000044b0"), 0x03f6a1), + (hex!("0122222222333333334444444455000002f30000000000002f40"), 0x03f741), + (hex!("0122222222333333334444444455000002f300000000000075b0"), 0x03f7e1), + (hex!("0122222222333333334444444455000002f40000000000002f50"), 0x03f881), + (hex!("0122222222333333334444444455000002f400000000000060f0"), 0x03f921), + (hex!("0122222222333333334444444455000002f50000000000002f60"), 0x03f9c1), + (hex!("0122222222333333334444444455000002f50000000000007210"), 0x03fa61), + (hex!("0122222222333333334444444455000002f60000000000002f70"), 0x03fb01), + (hex!("0122222222333333334444444455000002f60000000000006610"), 0x03fba1), + (hex!("0122222222333333334444444455000002f70000000000002f80"), 0x03fc41), + (hex!("0122222222333333334444444455000002f70000000000007560"), 0x03fce1), + (hex!("0122222222333333334444444455000002f80000000000002f90"), 0x03fd81), + (hex!("0122222222333333334444444455000002f80000000000006320"), 0x03fe21), + (hex!("0122222222333333334444444455000002f90000000000002fa0"), 0x03fec1), + (hex!("0122222222333333334444444455000002f90000000000006e50"), 0x03ff61), + (hex!("0122222222333333334444444455000002fa0000000000002fb0"), 0x040001), + (hex!("0122222222333333334444444455000002fb0000000000002fc0"), 0x0400a1), + (hex!("0122222222333333334444444455000002fb0000000000004780"), 0x040141), + (hex!("0122222222333333334444444455000002fc0000000000002fd0"), 0x0401e1), + (hex!("0122222222333333334444444455000002fd0000000000002fe0"), 0x040281), + (hex!("0122222222333333334444444455000002fd0000000000005600"), 0x040321), + (hex!("0122222222333333334444444455000002fd0000000000006c00"), 0x0403c1), + (hex!("0122222222333333334444444455000002fe0000000000002ff0"), 0x040461), + (hex!("0122222222333333334444444455000002ff0000000000003000"), 0x040501), + (hex!("0122222222333333334444444455000003000000000000003010"), 0x0405a1), + (hex!("0122222222333333334444444455000003000000000000004080"), 0x040641), + (hex!("0122222222333333334444444455000003010000000000003020"), 0x0406e1), + (hex!("0122222222333333334444444455000003010000000000006340"), 0x040781), + (hex!("0122222222333333334444444455000003020000000000003030"), 0x040821), + (hex!("0122222222333333334444444455000003020000000000005b00"), 0x0408c1), + (hex!("0122222222333333334444444455000003020000000000007b20"), 0x040961), + (hex!("0122222222333333334444444455000003030000000000003040"), 0x040a01), + (hex!("01222222223333333344444444550000030300000000000056b0"), 0x040aa1), + (hex!("0122222222333333334444444455000003030000000000006280"), 0x040b41), + (hex!("0122222222333333334444444455000003030000000000007ad0"), 0x040be1), + (hex!("0122222222333333334444444455000003040000000000003050"), 0x040c81), + (hex!("0122222222333333334444444455000003040000000000005c50"), 0x040d21), + (hex!("0122222222333333334444444455000003050000000000003060"), 0x040dc1), + (hex!("01222222223333333344444444550000030500000000000072e0"), 0x040e61), + (hex!("0122222222333333334444444455000003060000000000003070"), 0x040f01), + (hex!("0122222222333333334444444455000003060000000000004360"), 0x040fa1), + (hex!("0122222222333333334444444455000003060000000000004380"), 0x041041), + (hex!("0122222222333333334444444455000003060000000000004820"), 0x0410e1), + (hex!("0122222222333333334444444455000003060000000000006d10"), 0x041181), + (hex!("0122222222333333334444444455000003070000000000003080"), 0x041221), + (hex!("0122222222333333334444444455000003070000000000004450"), 0x0412c1), + (hex!("0122222222333333334444444455000003080000000000003090"), 0x041361), + (hex!("0122222222333333334444444455000003080000000000005ad0"), 0x041401), + (hex!("01222222223333333344444444550000030900000000000030a0"), 0x0414a1), + (hex!("01222222223333333344444444550000030a00000000000030b0"), 0x041541), + (hex!("01222222223333333344444444550000030a0000000000007760"), 0x0415e1), + (hex!("01222222223333333344444444550000030b00000000000030c0"), 0x041681), + (hex!("01222222223333333344444444550000030b0000000000007a80"), 0x041721), + (hex!("01222222223333333344444444550000030c00000000000030d0"), 0x0417c1), + (hex!("01222222223333333344444444550000030d00000000000030e0"), 0x041861), + (hex!("01222222223333333344444444550000030d0000000000003eb0"), 0x041901), + (hex!("01222222223333333344444444550000030e00000000000030f0"), 0x0419a1), + (hex!("01222222223333333344444444550000030f0000000000003100"), 0x041a41), + (hex!("01222222223333333344444444550000030f0000000000004690"), 0x041ae1), + (hex!("01222222223333333344444444550000030f0000000000006900"), 0x041b81), + (hex!("0122222222333333334444444455000003100000000000003110"), 0x041c21), + (hex!("01222222223333333344444444550000031000000000000058a0"), 0x041cc1), + (hex!("0122222222333333334444444455000003110000000000003120"), 0x041d61), + (hex!("0122222222333333334444444455000003110000000000004200"), 0x041e01), + (hex!("0122222222333333334444444455000003120000000000003130"), 0x041ea1), + (hex!("0122222222333333334444444455000003130000000000003140"), 0x041f41), + (hex!("0122222222333333334444444455000003130000000000004d50"), 0x041fe1), + (hex!("0122222222333333334444444455000003130000000000005400"), 0x042081), + (hex!("0122222222333333334444444455000003130000000000005520"), 0x042121), + (hex!("0122222222333333334444444455000003140000000000003150"), 0x0421c1), + (hex!("0122222222333333334444444455000003140000000000006450"), 0x042261), + (hex!("0122222222333333334444444455000003150000000000003160"), 0x042301), + (hex!("01222222223333333344444444550000031500000000000062d0"), 0x0423a1), + (hex!("0122222222333333334444444455000003160000000000003170"), 0x042441), + (hex!("0122222222333333334444444455000003160000000000004c40"), 0x0424e1), + (hex!("0122222222333333334444444455000003160000000000007c80"), 0x042581), + (hex!("0122222222333333334444444455000003170000000000003180"), 0x042621), + (hex!("0122222222333333334444444455000003170000000000004400"), 0x0426c1), + (hex!("0122222222333333334444444455000003170000000000005090"), 0x042761), + (hex!("0122222222333333334444444455000003170000000000006cb0"), 0x042801), + (hex!("0122222222333333334444444455000003180000000000003190"), 0x0428a1), + (hex!("0122222222333333334444444455000003180000000000006560"), 0x042941), + (hex!("01222222223333333344444444550000031900000000000031a0"), 0x0429e1), + (hex!("01222222223333333344444444550000031900000000000052d0"), 0x042a81), + (hex!("01222222223333333344444444550000031900000000000057e0"), 0x042b21), + (hex!("01222222223333333344444444550000031a00000000000031b0"), 0x042bc1), + (hex!("01222222223333333344444444550000031a00000000000071e0"), 0x042c61), + (hex!("01222222223333333344444444550000031b00000000000031c0"), 0x042d01), + (hex!("01222222223333333344444444550000031c00000000000031d0"), 0x042da1), + (hex!("01222222223333333344444444550000031c0000000000004480"), 0x042e41), + (hex!("01222222223333333344444444550000031c0000000000005790"), 0x042ee1), + (hex!("01222222223333333344444444550000031c0000000000007be0"), 0x042f81), + (hex!("01222222223333333344444444550000031d00000000000031e0"), 0x043021), + (hex!("01222222223333333344444444550000031d0000000000005560"), 0x0430c1), + (hex!("01222222223333333344444444550000031e00000000000031f0"), 0x043161), + (hex!("01222222223333333344444444550000031f0000000000003200"), 0x043201), + (hex!("01222222223333333344444444550000031f0000000000004190"), 0x0432a1), + (hex!("0122222222333333334444444455000003200000000000003210"), 0x043341), + (hex!("0122222222333333334444444455000003210000000000003220"), 0x0433e1), + (hex!("0122222222333333334444444455000003220000000000003230"), 0x043481), + (hex!("0122222222333333334444444455000003230000000000003240"), 0x043521), + (hex!("01222222223333333344444444550000032300000000000069d0"), 0x0435c1), + (hex!("0122222222333333334444444455000003240000000000003250"), 0x043661), + (hex!("0122222222333333334444444455000003250000000000003260"), 0x043701), + (hex!("01222222223333333344444444550000032500000000000042b0"), 0x0437a1), + (hex!("01222222223333333344444444550000032500000000000064e0"), 0x043841), + (hex!("0122222222333333334444444455000003260000000000003270"), 0x0438e1), + (hex!("0122222222333333334444444455000003270000000000003280"), 0x043981), + (hex!("0122222222333333334444444455000003270000000000005b20"), 0x043a21), + (hex!("0122222222333333334444444455000003270000000000006330"), 0x043ac1), + (hex!("0122222222333333334444444455000003270000000000006810"), 0x043b61), + (hex!("0122222222333333334444444455000003280000000000003290"), 0x043c01), + (hex!("01222222223333333344444444550000032900000000000032a0"), 0x043ca1), + (hex!("01222222223333333344444444550000032900000000000056f0"), 0x043d41), + (hex!("0122222222333333334444444455000003290000000000005e20"), 0x043de1), + (hex!("0122222222333333334444444455000003290000000000005e70"), 0x043e81), + (hex!("01222222223333333344444444550000032a00000000000032b0"), 0x043f21), + (hex!("01222222223333333344444444550000032b00000000000032c0"), 0x043fc1), + (hex!("01222222223333333344444444550000032b0000000000005500"), 0x044061), + (hex!("01222222223333333344444444550000032b0000000000005a20"), 0x044101), + (hex!("01222222223333333344444444550000032c00000000000032d0"), 0x0441a1), + (hex!("01222222223333333344444444550000032c0000000000004060"), 0x044241), + (hex!("01222222223333333344444444550000032c0000000000004760"), 0x0442e1), + (hex!("01222222223333333344444444550000032d00000000000032e0"), 0x044381), + (hex!("01222222223333333344444444550000032d00000000000068a0"), 0x044421), + (hex!("01222222223333333344444444550000032e00000000000032f0"), 0x0444c1), + (hex!("01222222223333333344444444550000032f0000000000003300"), 0x044561), + (hex!("0122222222333333334444444455000003300000000000003310"), 0x044601), + (hex!("0122222222333333334444444455000003300000000000006e40"), 0x0446a1), + (hex!("0122222222333333334444444455000003310000000000003320"), 0x044741), + (hex!("0122222222333333334444444455000003310000000000004620"), 0x0447e1), + (hex!("0122222222333333334444444455000003320000000000003330"), 0x044881), + (hex!("0122222222333333334444444455000003330000000000003340"), 0x044921), + (hex!("0122222222333333334444444455000003330000000000004b80"), 0x0449c1), + (hex!("0122222222333333334444444455000003340000000000003350"), 0x044a61), + (hex!("0122222222333333334444444455000003350000000000003360"), 0x044b01), + (hex!("0122222222333333334444444455000003360000000000003370"), 0x044ba1), + (hex!("0122222222333333334444444455000003370000000000003380"), 0x044c41), + (hex!("0122222222333333334444444455000003380000000000003390"), 0x044ce1), + (hex!("01222222223333333344444444550000033900000000000033a0"), 0x044d81), + (hex!("0122222222333333334444444455000003390000000000006b90"), 0x044e21), + (hex!("01222222223333333344444444550000033a00000000000033b0"), 0x044ec1), + (hex!("01222222223333333344444444550000033a0000000000007420"), 0x044f61), + (hex!("01222222223333333344444444550000033b00000000000033c0"), 0x045001), + (hex!("01222222223333333344444444550000033b0000000000007620"), 0x0450a1), + (hex!("01222222223333333344444444550000033c00000000000033d0"), 0x045141), + (hex!("01222222223333333344444444550000033c0000000000006b30"), 0x0451e1), + (hex!("01222222223333333344444444550000033d00000000000033e0"), 0x045281), + (hex!("01222222223333333344444444550000033e00000000000033f0"), 0x045321), + (hex!("01222222223333333344444444550000033e00000000000048b0"), 0x0453c1), + (hex!("01222222223333333344444444550000033e0000000000004e70"), 0x045461), + (hex!("01222222223333333344444444550000033f0000000000003400"), 0x045501), + (hex!("01222222223333333344444444550000033f0000000000006380"), 0x0455a1), + (hex!("0122222222333333334444444455000003400000000000003410"), 0x045641), + (hex!("0122222222333333334444444455000003410000000000003420"), 0x0456e1), + (hex!("0122222222333333334444444455000003410000000000006090"), 0x045781), + (hex!("0122222222333333334444444455000003420000000000003430"), 0x045821), + (hex!("01222222223333333344444444550000034200000000000073d0"), 0x0458c1), + (hex!("0122222222333333334444444455000003430000000000003440"), 0x045961), + (hex!("0122222222333333334444444455000003430000000000006370"), 0x045a01), + (hex!("01222222223333333344444444550000034300000000000075c0"), 0x045aa1), + (hex!("0122222222333333334444444455000003440000000000003450"), 0x045b41), + (hex!("0122222222333333334444444455000003450000000000003460"), 0x045be1), + (hex!("0122222222333333334444444455000003460000000000003470"), 0x045c81), + (hex!("01222222223333333344444444550000034600000000000055f0"), 0x045d21), + (hex!("0122222222333333334444444455000003470000000000003480"), 0x045dc1), + (hex!("0122222222333333334444444455000003470000000000003fe0"), 0x045e61), + (hex!("0122222222333333334444444455000003480000000000003490"), 0x045f01), + (hex!("0122222222333333334444444455000003480000000000007990"), 0x045fa1), + (hex!("01222222223333333344444444550000034900000000000034a0"), 0x046041), + (hex!("0122222222333333334444444455000003490000000000004410"), 0x0460e1), + (hex!("01222222223333333344444444550000034a00000000000034b0"), 0x046181), + (hex!("01222222223333333344444444550000034a00000000000062a0"), 0x046221), + (hex!("01222222223333333344444444550000034a0000000000007260"), 0x0462c1), + (hex!("01222222223333333344444444550000034b00000000000034c0"), 0x046361), + (hex!("01222222223333333344444444550000034b0000000000005760"), 0x046401), + (hex!("01222222223333333344444444550000034b0000000000006200"), 0x0464a1), + (hex!("01222222223333333344444444550000034c00000000000034d0"), 0x046541), + (hex!("01222222223333333344444444550000034d00000000000034e0"), 0x0465e1), + (hex!("01222222223333333344444444550000034e00000000000034f0"), 0x046681), + (hex!("01222222223333333344444444550000034e0000000000007790"), 0x046721), + (hex!("01222222223333333344444444550000034f0000000000003500"), 0x0467c1), + (hex!("0122222222333333334444444455000003500000000000003510"), 0x046861), + (hex!("0122222222333333334444444455000003510000000000003520"), 0x046901), + (hex!("0122222222333333334444444455000003520000000000003530"), 0x0469a1), + (hex!("01222222223333333344444444550000035200000000000056a0"), 0x046a41), + (hex!("0122222222333333334444444455000003530000000000003540"), 0x046ae1), + (hex!("0122222222333333334444444455000003540000000000003550"), 0x046b81), + (hex!("01222222223333333344444444550000035400000000000047b0"), 0x046c21), + (hex!("0122222222333333334444444455000003550000000000003560"), 0x046cc1), + (hex!("0122222222333333334444444455000003550000000000004500"), 0x046d61), + (hex!("0122222222333333334444444455000003560000000000003570"), 0x046e01), + (hex!("0122222222333333334444444455000003560000000000004fc0"), 0x046ea1), + (hex!("0122222222333333334444444455000003560000000000007160"), 0x046f41), + (hex!("0122222222333333334444444455000003560000000000007400"), 0x046fe1), + (hex!("0122222222333333334444444455000003570000000000003580"), 0x047081), + (hex!("0122222222333333334444444455000003580000000000003590"), 0x047121), + (hex!("0122222222333333334444444455000003580000000000005a80"), 0x0471c1), + (hex!("01222222223333333344444444550000035900000000000035a0"), 0x047261), + (hex!("01222222223333333344444444550000035900000000000073b0"), 0x047301), + (hex!("01222222223333333344444444550000035a00000000000035b0"), 0x0473a1), + (hex!("01222222223333333344444444550000035a0000000000004c20"), 0x047441), + (hex!("01222222223333333344444444550000035b00000000000035c0"), 0x0474e1), + (hex!("01222222223333333344444444550000035b0000000000005120"), 0x047581), + (hex!("01222222223333333344444444550000035c00000000000035d0"), 0x047621), + (hex!("01222222223333333344444444550000035c0000000000004300"), 0x0476c1), + (hex!("01222222223333333344444444550000035c0000000000005a40"), 0x047761), + (hex!("01222222223333333344444444550000035c0000000000006620"), 0x047801), + (hex!("01222222223333333344444444550000035c0000000000006ed0"), 0x0478a1), + (hex!("01222222223333333344444444550000035d00000000000035e0"), 0x047941), + (hex!("01222222223333333344444444550000035d0000000000005df0"), 0x0479e1), + (hex!("01222222223333333344444444550000035e00000000000035f0"), 0x047a81), + (hex!("01222222223333333344444444550000035f0000000000003600"), 0x047b21), + (hex!("01222222223333333344444444550000035f00000000000058d0"), 0x047bc1), + (hex!("0122222222333333334444444455000003600000000000003610"), 0x047c61), + (hex!("0122222222333333334444444455000003600000000000007b90"), 0x047d01), + (hex!("0122222222333333334444444455000003610000000000003620"), 0x047da1), + (hex!("0122222222333333334444444455000003610000000000006ad0"), 0x047e41), + (hex!("0122222222333333334444444455000003620000000000003630"), 0x047ee1), + (hex!("01222222223333333344444444550000036200000000000063a0"), 0x047f81), + (hex!("0122222222333333334444444455000003630000000000003640"), 0x048021), + (hex!("0122222222333333334444444455000003630000000000007250"), 0x0480c1), + (hex!("0122222222333333334444444455000003640000000000003650"), 0x048161), + (hex!("0122222222333333334444444455000003640000000000005510"), 0x048201), + (hex!("0122222222333333334444444455000003640000000000007850"), 0x0482a1), + (hex!("0122222222333333334444444455000003650000000000003660"), 0x048341), + (hex!("0122222222333333334444444455000003660000000000003670"), 0x0483e1), + (hex!("0122222222333333334444444455000003660000000000004650"), 0x048481), + (hex!("01222222223333333344444444550000036600000000000050d0"), 0x048521), + (hex!("0122222222333333334444444455000003660000000000006eb0"), 0x0485c1), + (hex!("0122222222333333334444444455000003670000000000003680"), 0x048661), + (hex!("01222222223333333344444444550000036700000000000071f0"), 0x048701), + (hex!("0122222222333333334444444455000003680000000000003690"), 0x0487a1), + (hex!("01222222223333333344444444550000036900000000000036a0"), 0x048841), + (hex!("0122222222333333334444444455000003690000000000005c70"), 0x0488e1), + (hex!("01222222223333333344444444550000036a00000000000036b0"), 0x048981), + (hex!("01222222223333333344444444550000036a00000000000071b0"), 0x048a21), + (hex!("01222222223333333344444444550000036b00000000000036c0"), 0x048ac1), + (hex!("01222222223333333344444444550000036b0000000000004670"), 0x048b61), + (hex!("01222222223333333344444444550000036c00000000000036d0"), 0x048c01), + (hex!("01222222223333333344444444550000036c0000000000004750"), 0x048ca1), + (hex!("01222222223333333344444444550000036c0000000000006fa0"), 0x048d41), + (hex!("01222222223333333344444444550000036d00000000000036e0"), 0x048de1), + (hex!("01222222223333333344444444550000036d0000000000003f70"), 0x048e81), + (hex!("01222222223333333344444444550000036d0000000000004b90"), 0x048f21), + (hex!("01222222223333333344444444550000036d00000000000057a0"), 0x048fc1), + (hex!("01222222223333333344444444550000036e00000000000036f0"), 0x049061), + (hex!("01222222223333333344444444550000036e00000000000075d0"), 0x049101), + (hex!("01222222223333333344444444550000036f0000000000003700"), 0x0491a1), + (hex!("0122222222333333334444444455000003700000000000003710"), 0x049241), + (hex!("0122222222333333334444444455000003700000000000005aa0"), 0x0492e1), + (hex!("0122222222333333334444444455000003710000000000003720"), 0x049381), + (hex!("0122222222333333334444444455000003710000000000005130"), 0x049421), + (hex!("0122222222333333334444444455000003710000000000006fc0"), 0x0494c1), + (hex!("0122222222333333334444444455000003710000000000007b00"), 0x049561), + (hex!("0122222222333333334444444455000003720000000000003730"), 0x049601), + (hex!("01222222223333333344444444550000037200000000000054d0"), 0x0496a1), + (hex!("0122222222333333334444444455000003730000000000003740"), 0x049741), + (hex!("0122222222333333334444444455000003730000000000004220"), 0x0497e1), + (hex!("0122222222333333334444444455000003740000000000003750"), 0x049881), + (hex!("0122222222333333334444444455000003740000000000004720"), 0x049921), + (hex!("0122222222333333334444444455000003750000000000003760"), 0x0499c1), + (hex!("0122222222333333334444444455000003750000000000004110"), 0x049a61), + (hex!("0122222222333333334444444455000003760000000000003770"), 0x049b01), + (hex!("0122222222333333334444444455000003770000000000003780"), 0x049ba1), + (hex!("0122222222333333334444444455000003780000000000003790"), 0x049c41), + (hex!("0122222222333333334444444455000003780000000000004b40"), 0x049ce1), + (hex!("0122222222333333334444444455000003780000000000005660"), 0x049d81), + (hex!("0122222222333333334444444455000003780000000000005ea0"), 0x049e21), + (hex!("01222222223333333344444444550000037900000000000037a0"), 0x049ec1), + (hex!("01222222223333333344444444550000037a00000000000037b0"), 0x049f61), + (hex!("01222222223333333344444444550000037b00000000000037c0"), 0x04a001), + (hex!("01222222223333333344444444550000037c00000000000037d0"), 0x04a0a1), + (hex!("01222222223333333344444444550000037c0000000000004340"), 0x04a141), + (hex!("01222222223333333344444444550000037c0000000000005230"), 0x04a1e1), + (hex!("01222222223333333344444444550000037d00000000000037e0"), 0x04a281), + (hex!("01222222223333333344444444550000037d00000000000051e0"), 0x04a321), + (hex!("01222222223333333344444444550000037e00000000000037f0"), 0x04a3c1), + (hex!("01222222223333333344444444550000037e0000000000004090"), 0x04a461), + (hex!("01222222223333333344444444550000037e0000000000005c20"), 0x04a501), + (hex!("01222222223333333344444444550000037f0000000000003800"), 0x04a5a1), + (hex!("0122222222333333334444444455000003800000000000003810"), 0x04a641), + (hex!("0122222222333333334444444455000003800000000000007630"), 0x04a6e1), + (hex!("0122222222333333334444444455000003810000000000003820"), 0x04a781), + (hex!("0122222222333333334444444455000003820000000000003830"), 0x04a821), + (hex!("0122222222333333334444444455000003820000000000004170"), 0x04a8c1), + (hex!("0122222222333333334444444455000003830000000000003840"), 0x04a961), + (hex!("0122222222333333334444444455000003840000000000003850"), 0x04aa01), + (hex!("0122222222333333334444444455000003850000000000003860"), 0x04aaa1), + (hex!("0122222222333333334444444455000003850000000000004180"), 0x04ab41), + (hex!("0122222222333333334444444455000003850000000000005c90"), 0x04abe1), + (hex!("0122222222333333334444444455000003850000000000005da0"), 0x04ac81), + (hex!("0122222222333333334444444455000003850000000000006ff0"), 0x04ad21), + (hex!("0122222222333333334444444455000003860000000000003870"), 0x04adc1), + (hex!("01222222223333333344444444550000038600000000000065c0"), 0x04ae61), + (hex!("0122222222333333334444444455000003870000000000003880"), 0x04af01), + (hex!("0122222222333333334444444455000003870000000000007cc0"), 0x04afa1), + (hex!("0122222222333333334444444455000003880000000000003890"), 0x04b041), + (hex!("01222222223333333344444444550000038900000000000038a0"), 0x04b0e1), + (hex!("01222222223333333344444444550000038a00000000000038b0"), 0x04b181), + (hex!("01222222223333333344444444550000038a00000000000073e0"), 0x04b221), + (hex!("01222222223333333344444444550000038b00000000000038c0"), 0x04b2c1), + (hex!("01222222223333333344444444550000038c00000000000038d0"), 0x04b361), + (hex!("01222222223333333344444444550000038d00000000000038e0"), 0x04b401), + (hex!("01222222223333333344444444550000038d00000000000069f0"), 0x04b4a1), + (hex!("01222222223333333344444444550000038d0000000000007680"), 0x04b541), + (hex!("01222222223333333344444444550000038e00000000000038f0"), 0x04b5e1), + (hex!("01222222223333333344444444550000038f0000000000003900"), 0x04b681), + (hex!("01222222223333333344444444550000038f00000000000045b0"), 0x04b721), + (hex!("01222222223333333344444444550000038f0000000000007180"), 0x04b7c1), + (hex!("0122222222333333334444444455000003900000000000003910"), 0x04b861), + (hex!("0122222222333333334444444455000003910000000000003920"), 0x04b901), + (hex!("0122222222333333334444444455000003910000000000004a20"), 0x04b9a1), + (hex!("0122222222333333334444444455000003920000000000003930"), 0x04ba41), + (hex!("01222222223333333344444444550000039200000000000059b0"), 0x04bae1), + (hex!("0122222222333333334444444455000003930000000000003940"), 0x04bb81), + (hex!("0122222222333333334444444455000003930000000000006cc0"), 0x04bc21), + (hex!("0122222222333333334444444455000003940000000000003950"), 0x04bcc1), + (hex!("01222222223333333344444444550000039400000000000056c0"), 0x04bd61), + (hex!("0122222222333333334444444455000003950000000000003960"), 0x04be01), + (hex!("0122222222333333334444444455000003950000000000004cc0"), 0x04bea1), + (hex!("0122222222333333334444444455000003950000000000007720"), 0x04bf41), + (hex!("0122222222333333334444444455000003960000000000003970"), 0x04bfe1), + (hex!("0122222222333333334444444455000003960000000000004da0"), 0x04c081), + (hex!("0122222222333333334444444455000003960000000000004df0"), 0x04c121), + (hex!("0122222222333333334444444455000003960000000000004f30"), 0x04c1c1), + (hex!("01222222223333333344444444550000039600000000000050f0"), 0x04c261), + (hex!("0122222222333333334444444455000003960000000000007940"), 0x04c301), + (hex!("0122222222333333334444444455000003970000000000003980"), 0x04c3a1), + (hex!("0122222222333333334444444455000003970000000000005850"), 0x04c441), + (hex!("0122222222333333334444444455000003970000000000007bd0"), 0x04c4e1), + (hex!("0122222222333333334444444455000003980000000000003990"), 0x04c581), + (hex!("0122222222333333334444444455000003980000000000004c00"), 0x04c621), + (hex!("0122222222333333334444444455000003980000000000005580"), 0x04c6c1), + (hex!("01222222223333333344444444550000039900000000000039a0"), 0x04c761), + (hex!("0122222222333333334444444455000003990000000000005820"), 0x04c801), + (hex!("01222222223333333344444444550000039a00000000000039b0"), 0x04c8a1), + (hex!("01222222223333333344444444550000039b00000000000039c0"), 0x04c941), + (hex!("01222222223333333344444444550000039b0000000000004c10"), 0x04c9e1), + (hex!("01222222223333333344444444550000039b0000000000006460"), 0x04ca81), + (hex!("01222222223333333344444444550000039c00000000000039d0"), 0x04cb21), + (hex!("01222222223333333344444444550000039d00000000000039e0"), 0x04cbc1), + (hex!("01222222223333333344444444550000039d00000000000044c0"), 0x04cc61), + (hex!("01222222223333333344444444550000039d00000000000049e0"), 0x04cd01), + (hex!("01222222223333333344444444550000039e00000000000039f0"), 0x04cda1), + (hex!("01222222223333333344444444550000039f0000000000003a00"), 0x04ce41), + (hex!("0122222222333333334444444455000003a00000000000003a10"), 0x04cee1), + (hex!("0122222222333333334444444455000003a10000000000003a20"), 0x04cf81), + (hex!("0122222222333333334444444455000003a10000000000006a80"), 0x04d021), + (hex!("0122222222333333334444444455000003a20000000000003a30"), 0x04d0c1), + (hex!("0122222222333333334444444455000003a200000000000062b0"), 0x04d161), + (hex!("0122222222333333334444444455000003a30000000000003a40"), 0x04d201), + (hex!("0122222222333333334444444455000003a30000000000006ce0"), 0x04d2a1), + (hex!("0122222222333333334444444455000003a40000000000003a50"), 0x04d341), + (hex!("0122222222333333334444444455000003a50000000000003a60"), 0x04d3e1), + (hex!("0122222222333333334444444455000003a60000000000003a70"), 0x04d481), + (hex!("0122222222333333334444444455000003a60000000000007750"), 0x04d521), + (hex!("0122222222333333334444444455000003a70000000000003a80"), 0x04d5c1), + (hex!("0122222222333333334444444455000003a70000000000005b10"), 0x04d661), + (hex!("0122222222333333334444444455000003a80000000000003a90"), 0x04d701), + (hex!("0122222222333333334444444455000003a80000000000006c20"), 0x04d7a1), + (hex!("0122222222333333334444444455000003a90000000000003aa0"), 0x04d841), + (hex!("0122222222333333334444444455000003a90000000000005b70"), 0x04d8e1), + (hex!("0122222222333333334444444455000003a900000000000070e0"), 0x04d981), + (hex!("0122222222333333334444444455000003aa0000000000003ab0"), 0x04da21), + (hex!("0122222222333333334444444455000003aa00000000000049f0"), 0x04dac1), + (hex!("0122222222333333334444444455000003aa0000000000004d60"), 0x04db61), + (hex!("0122222222333333334444444455000003ab0000000000003ac0"), 0x04dc01), + (hex!("0122222222333333334444444455000003ac0000000000003ad0"), 0x04dca1), + (hex!("0122222222333333334444444455000003ac0000000000004580"), 0x04dd41), + (hex!("0122222222333333334444444455000003ad0000000000003ae0"), 0x04dde1), + (hex!("0122222222333333334444444455000003ae0000000000003af0"), 0x04de81), + (hex!("0122222222333333334444444455000003af0000000000003b00"), 0x04df21), + (hex!("0122222222333333334444444455000003b00000000000003b10"), 0x04dfc1), + (hex!("0122222222333333334444444455000003b10000000000003b20"), 0x04e061), + (hex!("0122222222333333334444444455000003b10000000000003fd0"), 0x04e101), + (hex!("0122222222333333334444444455000003b20000000000003b30"), 0x04e1a1), + (hex!("0122222222333333334444444455000003b30000000000003b40"), 0x04e241), + (hex!("0122222222333333334444444455000003b40000000000003b50"), 0x04e2e1), + (hex!("0122222222333333334444444455000003b40000000000007450"), 0x04e381), + (hex!("0122222222333333334444444455000003b50000000000003b60"), 0x04e421), + (hex!("0122222222333333334444444455000003b60000000000003b70"), 0x04e4c1), + (hex!("0122222222333333334444444455000003b70000000000003b80"), 0x04e561), + (hex!("0122222222333333334444444455000003b70000000000006d50"), 0x04e601), + (hex!("0122222222333333334444444455000003b80000000000003b90"), 0x04e6a1), + (hex!("0122222222333333334444444455000003b800000000000057c0"), 0x04e741), + (hex!("0122222222333333334444444455000003b800000000000078a0"), 0x04e7e1), + (hex!("0122222222333333334444444455000003b90000000000003ba0"), 0x04e881), + (hex!("0122222222333333334444444455000003b90000000000006750"), 0x04e921), + (hex!("0122222222333333334444444455000003ba0000000000003bb0"), 0x04e9c1), + (hex!("0122222222333333334444444455000003ba0000000000007a10"), 0x04ea61), + (hex!("0122222222333333334444444455000003ba0000000000007a20"), 0x04eb01), + (hex!("0122222222333333334444444455000003bb0000000000003bc0"), 0x04eba1), + (hex!("0122222222333333334444444455000003bb0000000000005bc0"), 0x04ec41), + (hex!("0122222222333333334444444455000003bc0000000000003bd0"), 0x04ece1), + (hex!("0122222222333333334444444455000003bc0000000000005e80"), 0x04ed81), + (hex!("0122222222333333334444444455000003bc0000000000007ab0"), 0x04ee21), + (hex!("0122222222333333334444444455000003bd0000000000003be0"), 0x04eec1), + (hex!("0122222222333333334444444455000003bd00000000000049b0"), 0x04ef61), + (hex!("0122222222333333334444444455000003be0000000000003bf0"), 0x04f001), + (hex!("0122222222333333334444444455000003be0000000000005780"), 0x04f0a1), + (hex!("0122222222333333334444444455000003be0000000000007930"), 0x04f141), + (hex!("0122222222333333334444444455000003bf0000000000003c00"), 0x04f1e1), + (hex!("0122222222333333334444444455000003bf0000000000005de0"), 0x04f281), + (hex!("0122222222333333334444444455000003bf00000000000060b0"), 0x04f321), + (hex!("0122222222333333334444444455000003bf00000000000060c0"), 0x04f3c1), + (hex!("0122222222333333334444444455000003bf0000000000006a50"), 0x04f461), + (hex!("0122222222333333334444444455000003c00000000000003c10"), 0x04f501), + (hex!("0122222222333333334444444455000003c00000000000004030"), 0x04f5a1), + (hex!("0122222222333333334444444455000003c10000000000003c20"), 0x04f641), + (hex!("0122222222333333334444444455000003c20000000000003c30"), 0x04f6e1), + (hex!("0122222222333333334444444455000003c200000000000040b0"), 0x04f781), + (hex!("0122222222333333334444444455000003c30000000000003c40"), 0x04f821), + (hex!("0122222222333333334444444455000003c40000000000003c50"), 0x04f8c1), + (hex!("0122222222333333334444444455000003c40000000000005ba0"), 0x04f961), + (hex!("0122222222333333334444444455000003c50000000000003c60"), 0x04fa01), + (hex!("0122222222333333334444444455000003c60000000000003c70"), 0x04faa1), + (hex!("0122222222333333334444444455000003c70000000000003c80"), 0x04fb41), + (hex!("0122222222333333334444444455000003c70000000000004270"), 0x04fbe1), + (hex!("0122222222333333334444444455000003c80000000000003c90"), 0x04fc81), + (hex!("0122222222333333334444444455000003c80000000000006e70"), 0x04fd21), + (hex!("0122222222333333334444444455000003c90000000000003ca0"), 0x04fdc1), + (hex!("0122222222333333334444444455000003ca0000000000003cb0"), 0x04fe61), + (hex!("0122222222333333334444444455000003ca0000000000006e20"), 0x04ff01), + (hex!("0122222222333333334444444455000003ca0000000000007c20"), 0x04ffa1), + (hex!("0122222222333333334444444455000003cb0000000000003cc0"), 0x050041), + (hex!("0122222222333333334444444455000003cc0000000000003cd0"), 0x0500e1), + (hex!("0122222222333333334444444455000003cc0000000000006120"), 0x050181), + (hex!("0122222222333333334444444455000003cc0000000000007950"), 0x050221), + (hex!("0122222222333333334444444455000003cd0000000000003ce0"), 0x0502c1), + (hex!("0122222222333333334444444455000003ce0000000000003cf0"), 0x050361), + (hex!("0122222222333333334444444455000003cf0000000000003d00"), 0x050401), + (hex!("0122222222333333334444444455000003d00000000000003d10"), 0x0504a1), + (hex!("0122222222333333334444444455000003d10000000000003d20"), 0x050541), + (hex!("0122222222333333334444444455000003d10000000000005e50"), 0x0505e1), + (hex!("0122222222333333334444444455000003d10000000000007880"), 0x050681), + (hex!("0122222222333333334444444455000003d20000000000003d30"), 0x050721), + (hex!("0122222222333333334444444455000003d20000000000005d00"), 0x0507c1), + (hex!("0122222222333333334444444455000003d30000000000003d40"), 0x050861), + (hex!("0122222222333333334444444455000003d30000000000005d40"), 0x050901), + (hex!("0122222222333333334444444455000003d300000000000063f0"), 0x0509a1), + (hex!("0122222222333333334444444455000003d40000000000003d50"), 0x050a41), + (hex!("0122222222333333334444444455000003d40000000000005700"), 0x050ae1), + (hex!("0122222222333333334444444455000003d400000000000078f0"), 0x050b81), + (hex!("0122222222333333334444444455000003d50000000000003d60"), 0x050c21), + (hex!("0122222222333333334444444455000003d60000000000003d70"), 0x050cc1), + (hex!("0122222222333333334444444455000003d70000000000003d80"), 0x050d61), + (hex!("0122222222333333334444444455000003d80000000000003d90"), 0x050e01), + (hex!("0122222222333333334444444455000003d80000000000006690"), 0x050ea1), + (hex!("0122222222333333334444444455000003d90000000000003da0"), 0x050f41), + (hex!("0122222222333333334444444455000003d900000000000076d0"), 0x050fe1), + (hex!("0122222222333333334444444455000003da0000000000003db0"), 0x051081), + (hex!("0122222222333333334444444455000003db0000000000003dc0"), 0x051121), + (hex!("0122222222333333334444444455000003db0000000000004a30"), 0x0511c1), + (hex!("0122222222333333334444444455000003db0000000000005390"), 0x051261), + (hex!("0122222222333333334444444455000003dc0000000000003dd0"), 0x051301), + (hex!("0122222222333333334444444455000003dc0000000000006d60"), 0x0513a1), + (hex!("0122222222333333334444444455000003dd0000000000003de0"), 0x051441), + (hex!("0122222222333333334444444455000003de0000000000003df0"), 0x0514e1), + (hex!("0122222222333333334444444455000003df0000000000003e00"), 0x051581), + (hex!("0122222222333333334444444455000003df0000000000005240"), 0x051621), + (hex!("0122222222333333334444444455000003df0000000000005610"), 0x0516c1), + (hex!("0122222222333333334444444455000003e00000000000003e10"), 0x051761), + (hex!("0122222222333333334444444455000003e00000000000006500"), 0x051801), + (hex!("0122222222333333334444444455000003e10000000000003e20"), 0x0518a1), + (hex!("0122222222333333334444444455000003e10000000000006a10"), 0x051941), + (hex!("0122222222333333334444444455000003e10000000000007c10"), 0x0519e1), + (hex!("0122222222333333334444444455000003e20000000000003e30"), 0x051a81), + (hex!("0122222222333333334444444455000003e20000000000006310"), 0x051b21), + (hex!("0122222222333333334444444455000003e30000000000003e40"), 0x051bc1), + (hex!("0122222222333333334444444455000003e40000000000003e50"), 0x051c61), + (hex!("0122222222333333334444444455000003e40000000000006780"), 0x051d01), + (hex!("0122222222333333334444444455000003e40000000000007ce0"), 0x051da1), + (hex!("0122222222333333334444444455000003e50000000000003e60"), 0x051e41), + (hex!("0122222222333333334444444455000003e60000000000003e70"), 0x051ee1), + (hex!("0122222222333333334444444455000003e60000000000005040"), 0x051f81), + (hex!("0122222222333333334444444455000003e60000000000005bf0"), 0x052021), + (hex!("0122222222333333334444444455000003e70000000000003e80"), 0x0520c1), + (hex!("0122222222333333334444444455000003e70000000000003f50"), 0x052161), +]; diff --git a/pageserver/src/layered_repository/image_layer.rs b/pageserver/src/layered_repository/image_layer.rs index d0afce1549..08e635f073 100644 --- a/pageserver/src/layered_repository/image_layer.rs +++ b/pageserver/src/layered_repository/image_layer.rs @@ -16,40 +16,43 @@ //! Every image layer file consists of three parts: "summary", //! "index", and "values". The summary is a fixed size header at the //! beginning of the file, and it contains basic information about the -//! layer, and offsets to the other parts. The "index" is a serialized -//! HashMap, mapping from Key to an offset in the "values" part. The +//! layer, and offsets to the other parts. The "index" is a B-tree, +//! mapping from Key to an offset in the "values" part. The //! actual page images are stored in the "values" part. -//! -//! Only the "index" is loaded into memory by the load function. -//! When images are needed, they are read directly from disk. -//! use crate::config::PageServerConf; use crate::layered_repository::blob_io::{BlobCursor, BlobWriter, WriteBlobWriter}; -use crate::layered_repository::block_io::{BlockReader, FileBlockReader}; +use crate::layered_repository::block_io::{BlockBuf, BlockReader, FileBlockReader}; +use crate::layered_repository::disk_btree::{DiskBtreeBuilder, DiskBtreeReader, VisitDirection}; use crate::layered_repository::filename::{ImageFileName, PathOrConf}; use crate::layered_repository::storage_layer::{ Layer, ValueReconstructResult, ValueReconstructState, }; use crate::page_cache::PAGE_SZ; -use crate::repository::{Key, Value}; +use crate::repository::{Key, Value, KEY_SIZE}; use crate::virtual_file::VirtualFile; use crate::{ZTenantId, ZTimelineId}; use crate::{IMAGE_FILE_MAGIC, STORAGE_FORMAT_VERSION}; use anyhow::{bail, ensure, Context, Result}; use bytes::Bytes; +use hex; use log::*; use serde::{Deserialize, Serialize}; -use std::collections::HashMap; use std::fs; use std::io::Write; use std::io::{Seek, SeekFrom}; use std::ops::Range; use std::path::{Path, PathBuf}; -use std::sync::{RwLock, RwLockReadGuard, TryLockError}; +use std::sync::{RwLock, RwLockReadGuard}; use zenith_utils::bin_ser::BeSer; use zenith_utils::lsn::Lsn; +/// +/// Header stored in the beginning of the file +/// +/// After this comes the 'values' part, starting on block 1. After that, +/// the 'index' starts at the block indicated by 'index_start_blk' +/// #[derive(Debug, Serialize, Deserialize, PartialEq, Eq)] struct Summary { /// Magic value to identify this as a zenith image file. Always IMAGE_FILE_MAGIC. @@ -63,6 +66,9 @@ struct Summary { /// Block number where the 'index' part of the file begins. index_start_blk: u32, + /// Block within the 'index', where the B-tree root page is stored + index_root_blk: u32, + // the 'values' part starts after the summary header, on block 1. } impl From<&ImageLayer> for Summary { @@ -73,10 +79,10 @@ impl From<&ImageLayer> for Summary { tenantid: layer.tenantid, timelineid: layer.timelineid, key_range: layer.key_range.clone(), - lsn: layer.lsn, index_start_blk: 0, + index_root_blk: 0, } } } @@ -104,11 +110,9 @@ pub struct ImageLayerInner { /// If false, the 'index' has not been loaded into memory yet. loaded: bool, - /// offset of each value - index: HashMap, - // values copied from summary index_start_blk: u32, + index_root_blk: u32, /// Reader object for reading blocks from the file. (None if not loaded yet) file: Option>, @@ -147,21 +151,21 @@ impl Layer for ImageLayer { assert!(lsn_range.end >= self.lsn); let inner = self.load()?; - if let Some(&offset) = inner.index.get(&key) { - let buf = inner - .file - .as_ref() - .unwrap() - .block_cursor() - .read_blob(offset) - .with_context(|| { - format!( - "failed to read blob from data file {} at offset {}", - self.filename().display(), - offset - ) - })?; - let value = Bytes::from(buf); + + let file = inner.file.as_ref().unwrap(); + let tree_reader = DiskBtreeReader::new(inner.index_start_blk, inner.index_root_blk, file); + + let mut keybuf: [u8; KEY_SIZE] = [0u8; KEY_SIZE]; + key.write_to_byte_slice(&mut keybuf); + if let Some(offset) = tree_reader.get(&keybuf)? { + let blob = file.block_cursor().read_blob(offset).with_context(|| { + format!( + "failed to read value from data file {} at offset {}", + self.filename().display(), + offset + ) + })?; + let value = Bytes::from(blob); reconstruct_state.img = Some((self.lsn, value)); Ok(ValueReconstructResult::Complete) @@ -174,33 +178,6 @@ impl Layer for ImageLayer { todo!(); } - fn unload(&self) -> Result<()> { - // Unload the index. - // - // TODO: we should access the index directly from pages on the disk, - // using the buffer cache. This load/unload mechanism is really ad hoc. - - // FIXME: In debug mode, loading and unloading the index slows - // things down so much that you get timeout errors. At least - // with the test_parallel_copy test. So as an even more ad hoc - // stopgap fix for that, only unload every on average 10 - // checkpoint cycles. - use rand::RngCore; - if rand::thread_rng().next_u32() > (u32::MAX / 10) { - return Ok(()); - } - - let mut inner = match self.inner.try_write() { - Ok(inner) => inner, - Err(TryLockError::WouldBlock) => return Ok(()), - Err(TryLockError::Poisoned(_)) => panic!("ImageLayer lock was poisoned"), - }; - inner.index = HashMap::default(); - inner.loaded = false; - - Ok(()) - } - fn delete(&self) -> Result<()> { // delete underlying file fs::remove_file(self.path())?; @@ -227,10 +204,16 @@ impl Layer for ImageLayer { } let inner = self.load()?; + let file = inner.file.as_ref().unwrap(); + let tree_reader = + DiskBtreeReader::<_, KEY_SIZE>::new(inner.index_start_blk, inner.index_root_blk, file); - for (key, offset) in inner.index.iter() { - println!("key: {} offset {}", key, offset); - } + tree_reader.dump()?; + + tree_reader.visit(&[0u8; KEY_SIZE], VisitDirection::Forwards, |key, value| { + println!("key: {} offset {}", hex::encode(key), value); + true + })?; Ok(()) } @@ -300,6 +283,7 @@ impl ImageLayer { PathOrConf::Conf(_) => { let mut expected_summary = Summary::from(self); expected_summary.index_start_blk = actual_summary.index_start_blk; + expected_summary.index_root_blk = actual_summary.index_root_blk; if actual_summary != expected_summary { bail!("in-file summary does not match expected summary. actual = {:?} expected = {:?}", actual_summary, expected_summary); @@ -319,17 +303,8 @@ impl ImageLayer { } } - file.file.seek(SeekFrom::Start( - actual_summary.index_start_blk as u64 * PAGE_SZ as u64, - ))?; - let mut buf_reader = std::io::BufReader::new(&mut file.file); - let index = HashMap::des_from(&mut buf_reader)?; - inner.index_start_blk = actual_summary.index_start_blk; - - info!("loaded from {}", &path.display()); - - inner.index = index; + inner.index_root_blk = actual_summary.index_root_blk; inner.loaded = true; Ok(()) } @@ -348,10 +323,10 @@ impl ImageLayer { key_range: filename.key_range.clone(), lsn: filename.lsn, inner: RwLock::new(ImageLayerInner { - index: HashMap::new(), loaded: false, file: None, index_start_blk: 0, + index_root_blk: 0, }), } } @@ -376,9 +351,9 @@ impl ImageLayer { lsn: summary.lsn, inner: RwLock::new(ImageLayerInner { file: None, - index: HashMap::new(), loaded: false, index_start_blk: 0, + index_root_blk: 0, }), }) } @@ -420,9 +395,8 @@ pub struct ImageLayerWriter { key_range: Range, lsn: Lsn, - index: HashMap, - blob_writer: WriteBlobWriter, + tree: DiskBtreeBuilder, } impl ImageLayerWriter { @@ -447,9 +421,15 @@ impl ImageLayerWriter { }, ); info!("new image layer {}", path.display()); - let file = VirtualFile::create(&path)?; + let mut file = VirtualFile::create(&path)?; + // make room for the header block + file.seek(SeekFrom::Start(PAGE_SZ as u64))?; let blob_writer = WriteBlobWriter::new(file, PAGE_SZ as u64); + // Initialize the b-tree index builder + let block_buf = BlockBuf::new(); + let tree_builder = DiskBtreeBuilder::new(block_buf); + let writer = ImageLayerWriter { conf, _path: path, @@ -457,7 +437,7 @@ impl ImageLayerWriter { tenantid, key_range: key_range.clone(), lsn, - index: HashMap::new(), + tree: tree_builder, blob_writer, }; @@ -473,8 +453,9 @@ impl ImageLayerWriter { ensure!(self.key_range.contains(&key)); let off = self.blob_writer.write_blob(img)?; - let old = self.index.insert(key, off); - assert!(old.is_none()); + let mut keybuf: [u8; KEY_SIZE] = [0u8; KEY_SIZE]; + key.write_to_byte_slice(&mut keybuf); + self.tree.append(&keybuf, off)?; Ok(()) } @@ -486,9 +467,11 @@ impl ImageLayerWriter { let mut file = self.blob_writer.into_inner(); // Write out the index - let buf = HashMap::ser(&self.index)?; file.seek(SeekFrom::Start(index_start_blk as u64 * PAGE_SZ as u64))?; - file.write_all(&buf)?; + let (index_root_blk, block_buf) = self.tree.finish()?; + for buf in block_buf.blocks { + file.write_all(buf.as_ref())?; + } // Fill in the summary on blk 0 let summary = Summary { @@ -499,6 +482,7 @@ impl ImageLayerWriter { key_range: self.key_range.clone(), lsn: self.lsn, index_start_blk, + index_root_blk, }; file.seek(SeekFrom::Start(0))?; Summary::ser_into(&summary, &mut file)?; @@ -514,9 +498,9 @@ impl ImageLayerWriter { lsn: self.lsn, inner: RwLock::new(ImageLayerInner { loaded: false, - index: HashMap::new(), file: None, index_start_blk, + index_root_blk, }), }; trace!("created image layer {}", layer.path().display()); diff --git a/pageserver/src/layered_repository/inmemory_layer.rs b/pageserver/src/layered_repository/inmemory_layer.rs index 8a24528732..a45af51487 100644 --- a/pageserver/src/layered_repository/inmemory_layer.rs +++ b/pageserver/src/layered_repository/inmemory_layer.rs @@ -166,13 +166,6 @@ impl Layer for InMemoryLayer { todo!(); } - /// Cannot unload anything in an in-memory layer, since there's no backing - /// store. To release memory used by an in-memory layer, use 'freeze' to turn - /// it into an on-disk layer. - fn unload(&self) -> Result<()> { - Ok(()) - } - /// Nothing to do here. When you drop the last reference to the layer, it will /// be deallocated. fn delete(&self) -> Result<()> { diff --git a/pageserver/src/layered_repository/storage_layer.rs b/pageserver/src/layered_repository/storage_layer.rs index 5ad43182f6..e413f311c3 100644 --- a/pageserver/src/layered_repository/storage_layer.rs +++ b/pageserver/src/layered_repository/storage_layer.rs @@ -134,10 +134,6 @@ pub trait Layer: Send + Sync { /// Iterate through all keys and values stored in the layer fn iter(&self) -> Box> + '_>; - /// Release memory used by this layer. There is no corresponding 'load' - /// function, that's done implicitly when you call one of the get-functions. - fn unload(&self) -> Result<()>; - /// Permanently remove this layer from disk. fn delete(&self) -> Result<()>; diff --git a/pageserver/src/lib.rs b/pageserver/src/lib.rs index 6d2631b2b1..6dddef5f27 100644 --- a/pageserver/src/lib.rs +++ b/pageserver/src/lib.rs @@ -38,7 +38,7 @@ use pgdatadir_mapping::DatadirTimeline; /// This is embedded in the metadata file, and also in the header of all the /// layer files. If you make any backwards-incompatible changes to the storage /// format, bump this! -pub const STORAGE_FORMAT_VERSION: u16 = 2; +pub const STORAGE_FORMAT_VERSION: u16 = 3; // Magic constants used to identify different kinds of files pub const IMAGE_FILE_MAGIC: u16 = 0x5A60; diff --git a/pageserver/src/repository.rs b/pageserver/src/repository.rs index 7e998b0ebe..02334d3229 100644 --- a/pageserver/src/repository.rs +++ b/pageserver/src/repository.rs @@ -3,6 +3,7 @@ use crate::remote_storage::RemoteIndex; use crate::walrecord::ZenithWalRecord; use crate::CheckpointConfig; use anyhow::{bail, Result}; +use byteorder::{ByteOrder, BE}; use bytes::Bytes; use serde::{Deserialize, Serialize}; use std::fmt; @@ -27,6 +28,8 @@ pub struct Key { pub field6: u32, } +pub const KEY_SIZE: usize = 18; + impl Key { pub fn next(&self) -> Key { self.add(1) @@ -61,7 +64,7 @@ impl Key { key } - pub fn from_array(b: [u8; 18]) -> Self { + pub fn from_slice(b: &[u8]) -> Self { Key { field1: b[0], field2: u32::from_be_bytes(b[1..5].try_into().unwrap()), @@ -71,6 +74,15 @@ impl Key { field6: u32::from_be_bytes(b[14..18].try_into().unwrap()), } } + + pub fn write_to_byte_slice(&self, buf: &mut [u8]) { + buf[0] = self.field1; + BE::write_u32(&mut buf[1..5], self.field2); + BE::write_u32(&mut buf[5..9], self.field3); + BE::write_u32(&mut buf[9..13], self.field4); + buf[13] = self.field5; + BE::write_u32(&mut buf[14..18], self.field6); + } } pub fn key_range_size(key_range: &Range) -> u32 { @@ -569,7 +581,7 @@ mod tests { use lazy_static::lazy_static; lazy_static! { - static ref TEST_KEY: Key = Key::from_array(hex!("112222222233333333444444445500000001")); + static ref TEST_KEY: Key = Key::from_slice(&hex!("112222222233333333444444445500000001")); } #[test] From 8e2a6661e901562ee72c70436a350b4af81968a2 Mon Sep 17 00:00:00 2001 From: Arthur Petukhovsky Date: Mon, 11 Apr 2022 20:36:26 +0300 Subject: [PATCH 071/296] Make wal_storage initialization eager (#1489) --- walkeeper/src/safekeeper.rs | 18 ++++++++++-------- walkeeper/src/timeline.rs | 4 ++-- 2 files changed, 12 insertions(+), 10 deletions(-) diff --git a/walkeeper/src/safekeeper.rs b/walkeeper/src/safekeeper.rs index 307a67e5f3..1e23d87b34 100644 --- a/walkeeper/src/safekeeper.rs +++ b/walkeeper/src/safekeeper.rs @@ -517,14 +517,16 @@ where pub fn new( ztli: ZTimelineId, control_store: CTRL, - wal_store: WAL, + mut wal_store: WAL, state: SafeKeeperState, - ) -> SafeKeeper { + ) -> Result> { if state.timeline_id != ZTimelineId::from([0u8; 16]) && ztli != state.timeline_id { - panic!("Calling SafeKeeper::new with inconsistent ztli ({}) and SafeKeeperState.server.timeline_id ({})", ztli, state.timeline_id); + bail!("Calling SafeKeeper::new with inconsistent ztli ({}) and SafeKeeperState.server.timeline_id ({})", ztli, state.timeline_id); } - SafeKeeper { + wal_store.init_storage(&state)?; + + Ok(SafeKeeper { metrics: SafeKeeperMetrics::new(state.tenant_id, ztli), global_commit_lsn: state.commit_lsn, epoch_start_lsn: Lsn(0), @@ -537,7 +539,7 @@ where s: state, control_store, wal_store, - } + }) } /// Get history of term switches for the available WAL @@ -877,7 +879,7 @@ mod tests { }; let wal_store = DummyWalStore { lsn: Lsn(0) }; let ztli = ZTimelineId::from([0u8; 16]); - let mut sk = SafeKeeper::new(ztli, storage, wal_store, SafeKeeperState::empty()); + let mut sk = SafeKeeper::new(ztli, storage, wal_store, SafeKeeperState::empty()).unwrap(); // check voting for 1 is ok let vote_request = ProposerAcceptorMessage::VoteRequest(VoteRequest { term: 1 }); @@ -892,7 +894,7 @@ mod tests { let storage = InMemoryState { persisted_state: state.clone(), }; - sk = SafeKeeper::new(ztli, storage, sk.wal_store, state); + sk = SafeKeeper::new(ztli, storage, sk.wal_store, state).unwrap(); // and ensure voting second time for 1 is not ok vote_resp = sk.process_msg(&vote_request); @@ -909,7 +911,7 @@ mod tests { }; let wal_store = DummyWalStore { lsn: Lsn(0) }; let ztli = ZTimelineId::from([0u8; 16]); - let mut sk = SafeKeeper::new(ztli, storage, wal_store, SafeKeeperState::empty()); + let mut sk = SafeKeeper::new(ztli, storage, wal_store, SafeKeeperState::empty()).unwrap(); let mut ar_hdr = AppendRequestHeader { term: 1, diff --git a/walkeeper/src/timeline.rs b/walkeeper/src/timeline.rs index b10ab97cc1..a76ef77615 100644 --- a/walkeeper/src/timeline.rs +++ b/walkeeper/src/timeline.rs @@ -100,7 +100,7 @@ impl SharedState { let state = SafeKeeperState::new(zttid, peer_ids); let control_store = control_file::FileStorage::new(zttid, conf); let wal_store = wal_storage::PhysicalStorage::new(zttid, conf); - let mut sk = SafeKeeper::new(zttid.timeline_id, control_store, wal_store, state); + let mut sk = SafeKeeper::new(zttid.timeline_id, control_store, wal_store, state)?; sk.control_store.persist(&sk.s)?; Ok(Self { @@ -127,7 +127,7 @@ impl SharedState { Ok(Self { notified_commit_lsn: Lsn(0), - sk: SafeKeeper::new(zttid.timeline_id, control_store, wal_store, state), + sk: SafeKeeper::new(zttid.timeline_id, control_store, wal_store, state)?, replicas: Vec::new(), active: false, num_computes: 0, From db63fa64ae863187bb044f569ad8aa63c9f5e58b Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Fri, 29 Oct 2021 23:21:40 +0300 Subject: [PATCH 072/296] Use rusoto lib for S3 relish_storage impl --- Cargo.lock | 3394 ----------------- pageserver/Cargo.toml | 6 +- pageserver/src/remote_storage.rs | 8 +- pageserver/src/remote_storage/README.md | 12 - .../{rust_s3.rs => s3_bucket.rs} | 247 +- 5 files changed, 135 insertions(+), 3532 deletions(-) delete mode 100644 Cargo.lock rename pageserver/src/remote_storage/{rust_s3.rs => s3_bucket.rs} (68%) diff --git a/Cargo.lock b/Cargo.lock deleted file mode 100644 index 19ccd18a10..0000000000 --- a/Cargo.lock +++ /dev/null @@ -1,3394 +0,0 @@ -# This file is automatically @generated by Cargo. -# It is not intended for manual editing. -version = 3 - -[[package]] -name = "addr2line" -version = "0.17.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b9ecd88a8c8378ca913a680cd98f0f13ac67383d35993f86c90a70e3f137816b" -dependencies = [ - "gimli", -] - -[[package]] -name = "adler" -version = "1.0.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f26201604c87b1e01bd3d98f8d5d9a8fcbb815e8cedb41ffccbeb4bf593a35fe" - -[[package]] -name = "ahash" -version = "0.4.7" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "739f4a8db6605981345c5654f3a85b056ce52f37a39d34da03f25bf2151ea16e" - -[[package]] -name = "ahash" -version = "0.7.6" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "fcb51a0695d8f838b1ee009b3fbf66bda078cd64590202a864a8f3e8c4315c47" -dependencies = [ - "getrandom", - "once_cell", - "version_check", -] - -[[package]] -name = "aho-corasick" -version = "0.7.18" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1e37cfd5e7657ada45f742d6e99ca5788580b5c529dc78faf11ece6dc702656f" -dependencies = [ - "memchr", -] - -[[package]] -name = "ansi_term" -version = "0.12.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d52a9bb7ec0cf484c551830a7ce27bd20d67eac647e1befb56b0be4ee39a55d2" -dependencies = [ - "winapi", -] - -[[package]] -name = "anyhow" -version = "1.0.53" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "94a45b455c14666b85fc40a019e8ab9eb75e3a124e05494f5397122bc9eb06e0" -dependencies = [ - "backtrace", -] - -[[package]] -name = "async-compression" -version = "0.3.12" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f2bf394cfbbe876f0ac67b13b6ca819f9c9f2fb9ec67223cceb1555fbab1c31a" -dependencies = [ - "futures-core", - "memchr", - "pin-project-lite", - "tokio", - "zstd", - "zstd-safe", -] - -[[package]] -name = "async-stream" -version = "0.3.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "171374e7e3b2504e0e5236e3b59260560f9fe94bfe9ac39ba5e4e929c5590625" -dependencies = [ - "async-stream-impl", - "futures-core", -] - -[[package]] -name = "async-stream-impl" -version = "0.3.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "648ed8c8d2ce5409ccd57453d9d1b214b342a0d69376a6feda1fd6cae3299308" -dependencies = [ - "proc-macro2", - "quote", - "syn", -] - -[[package]] -name = "async-trait" -version = "0.1.52" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "061a7acccaa286c011ddc30970520b98fa40e00c9d644633fb26b5fc63a265e3" -dependencies = [ - "proc-macro2", - "quote", - "syn", -] - -[[package]] -name = "attohttpc" -version = "0.18.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e69e13a99a7e6e070bb114f7ff381e58c7ccc188630121fc4c2fe4bcf24cd072" -dependencies = [ - "http", - "log", - "rustls 0.20.2", - "serde", - "serde_json", - "url", - "webpki 0.22.0", - "webpki-roots", - "wildmatch", -] - -[[package]] -name = "atty" -version = "0.2.14" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d9b39be18770d11421cdb1b9947a45dd3f37e93092cbf377614828a319d5fee8" -dependencies = [ - "hermit-abi", - "libc", - "winapi", -] - -[[package]] -name = "autocfg" -version = "1.1.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d468802bab17cbc0cc575e9b053f41e72aa36bfa6b7f55e3529ffa43161b97fa" - -[[package]] -name = "aws-creds" -version = "0.27.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "460a75eac8f3cb7683e0a9a588a83c3ff039331ea7bfbfbfcecf1dacab276e11" -dependencies = [ - "anyhow", - "attohttpc", - "dirs", - "rust-ini", - "serde", - "serde-xml-rs", - "serde_derive", - "url", -] - -[[package]] -name = "aws-region" -version = "0.23.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "4e37c2dc2c9047311911ef175e0ffbb3853f17c32b72cf3d562f455e5ff77267" -dependencies = [ - "anyhow", -] - -[[package]] -name = "backtrace" -version = "0.3.64" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5e121dee8023ce33ab248d9ce1493df03c3b38a659b240096fcbd7048ff9c31f" -dependencies = [ - "addr2line", - "cc", - "cfg-if", - "libc", - "miniz_oxide", - "object", - "rustc-demangle", -] - -[[package]] -name = "base64" -version = "0.12.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3441f0f7b02788e948e47f457ca01f1d7e6d92c693bc132c22b087d3141c03ff" - -[[package]] -name = "base64" -version = "0.13.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "904dfeac50f3cdaba28fc6f57fdcddb75f49ed61346676a78c4ffe55877802fd" - -[[package]] -name = "bincode" -version = "1.3.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b1f45e9417d87227c7a56d22e471c6206462cba514c7590c09aff4cf6d1ddcad" -dependencies = [ - "serde", -] - -[[package]] -name = "bindgen" -version = "0.59.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2bd2a9a458e8f4304c52c43ebb0cfbd520289f8379a52e329a38afda99bf8eb8" -dependencies = [ - "bitflags", - "cexpr", - "clang-sys", - "clap 2.34.0", - "env_logger", - "lazy_static", - "lazycell", - "log", - "peeking_take_while", - "proc-macro2", - "quote", - "regex", - "rustc-hash", - "shlex", - "which", -] - -[[package]] -name = "bitflags" -version = "1.3.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bef38d45163c2f1dde094a7dfd33ccf595c92905c8f8f4fdc18d06fb1037718a" - -[[package]] -name = "block-buffer" -version = "0.9.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "4152116fd6e9dadb291ae18fc1ec3575ed6d84c29642d97890f4b4a3417297e4" -dependencies = [ - "generic-array", -] - -[[package]] -name = "boxfnonce" -version = "0.1.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5988cb1d626264ac94100be357308f29ff7cbdd3b36bda27f450a4ee3f713426" - -[[package]] -name = "bstr" -version = "0.2.17" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ba3569f383e8f1598449f1a423e72e99569137b47740b1da11ef19af3d5c3223" -dependencies = [ - "lazy_static", - "memchr", - "regex-automata", - "serde", -] - -[[package]] -name = "bumpalo" -version = "3.9.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a4a45a46ab1f2412e53d3a0ade76ffad2025804294569aae387231a0cd6e0899" - -[[package]] -name = "byteorder" -version = "1.4.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "14c189c53d098945499cdfa7ecc63567cf3886b3332b312a5b4585d8d3a6a610" - -[[package]] -name = "bytes" -version = "1.1.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "c4872d67bab6358e59559027aa3b9157c53d9358c51423c17554809a8858e0f8" -dependencies = [ - "serde", -] - -[[package]] -name = "cast" -version = "0.2.7" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "4c24dab4283a142afa2fdca129b80ad2c6284e073930f964c3a1293c225ee39a" -dependencies = [ - "rustc_version", -] - -[[package]] -name = "cc" -version = "1.0.72" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "22a9137b95ea06864e018375b72adfb7db6e6f68cfc8df5a04d00288050485ee" -dependencies = [ - "jobserver", -] - -[[package]] -name = "cexpr" -version = "0.6.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6fac387a98bb7c37292057cffc56d62ecb629900026402633ae9160df93a8766" -dependencies = [ - "nom", -] - -[[package]] -name = "cfg-if" -version = "1.0.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "baf1de4339761588bc0619e3cbc0120ee582ebb74b53b4efbf79117bd2da40fd" - -[[package]] -name = "chrono" -version = "0.4.19" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "670ad68c9088c2a963aaa298cb369688cf3f9465ce5e2d4ca10e6e0098a1ce73" -dependencies = [ - "libc", - "num-integer", - "num-traits", - "time", - "winapi", -] - -[[package]] -name = "clang-sys" -version = "1.3.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "4cc00842eed744b858222c4c9faf7243aafc6d33f92f96935263ef4d8a41ce21" -dependencies = [ - "glob", - "libc", - "libloading", -] - -[[package]] -name = "clap" -version = "2.34.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a0610544180c38b88101fecf2dd634b174a62eef6946f84dfc6a7127512b381c" -dependencies = [ - "ansi_term", - "atty", - "bitflags", - "strsim 0.8.0", - "textwrap 0.11.0", - "unicode-width", - "vec_map", -] - -[[package]] -name = "clap" -version = "3.0.14" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b63edc3f163b3c71ec8aa23f9bd6070f77edbf3d1d198b164afa90ff00e4ec62" -dependencies = [ - "atty", - "bitflags", - "indexmap", - "os_str_bytes", - "strsim 0.10.0", - "termcolor", - "textwrap 0.14.2", -] - -[[package]] -name = "combine" -version = "4.6.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "50b727aacc797f9fc28e355d21f34709ac4fc9adecfe470ad07b8f4464f53062" -dependencies = [ - "bytes", - "memchr", -] - -[[package]] -name = "compute_tools" -version = "0.1.0" -dependencies = [ - "anyhow", - "chrono", - "clap 3.0.14", - "env_logger", - "hyper", - "libc", - "log", - "postgres 0.19.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858)", - "regex", - "serde", - "serde_json", - "tar", - "tokio", - "workspace_hack", -] - -[[package]] -name = "const_format" -version = "0.2.22" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "22bc6cd49b0ec407b680c3e380182b6ac63b73991cb7602de350352fc309b614" -dependencies = [ - "const_format_proc_macros", -] - -[[package]] -name = "const_format_proc_macros" -version = "0.2.22" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ef196d5d972878a48da7decb7686eded338b4858fbabeed513d63a7c98b2b82d" -dependencies = [ - "proc-macro2", - "quote", - "unicode-xid", -] - -[[package]] -name = "control_plane" -version = "0.1.0" -dependencies = [ - "anyhow", - "lazy_static", - "nix", - "pageserver", - "postgres 0.19.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", - "regex", - "reqwest", - "serde", - "serde_with", - "tar", - "thiserror", - "toml", - "url", - "walkeeper", - "workspace_hack", - "zenith_utils", -] - -[[package]] -name = "cpufeatures" -version = "0.2.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "95059428f66df56b63431fdb4e1947ed2190586af5c5a8a8b71122bdf5a7f469" -dependencies = [ - "libc", -] - -[[package]] -name = "crc32c" -version = "0.6.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ee6b9c9389584bcba988bd0836086789b7f87ad91892d6a83d5291dbb24524b5" -dependencies = [ - "rustc_version", -] - -[[package]] -name = "criterion" -version = "0.3.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1604dafd25fba2fe2d5895a9da139f8dc9b319a5fe5354ca137cbbce4e178d10" -dependencies = [ - "atty", - "cast", - "clap 2.34.0", - "criterion-plot", - "csv", - "itertools", - "lazy_static", - "num-traits", - "oorandom", - "plotters", - "rayon", - "regex", - "serde", - "serde_cbor", - "serde_derive", - "serde_json", - "tinytemplate", - "walkdir", -] - -[[package]] -name = "criterion-plot" -version = "0.4.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d00996de9f2f7559f7f4dc286073197f83e92256a59ed395f9aac01fe717da57" -dependencies = [ - "cast", - "itertools", -] - -[[package]] -name = "crossbeam-channel" -version = "0.5.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e54ea8bc3fb1ee042f5aace6e3c6e025d3874866da222930f70ce62aceba0bfa" -dependencies = [ - "cfg-if", - "crossbeam-utils", -] - -[[package]] -name = "crossbeam-deque" -version = "0.8.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6455c0ca19f0d2fbf751b908d5c55c1f5cbc65e03c4225427254b46890bdde1e" -dependencies = [ - "cfg-if", - "crossbeam-epoch", - "crossbeam-utils", -] - -[[package]] -name = "crossbeam-epoch" -version = "0.9.7" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "c00d6d2ea26e8b151d99093005cb442fb9a37aeaca582a03ec70946f49ab5ed9" -dependencies = [ - "cfg-if", - "crossbeam-utils", - "lazy_static", - "memoffset", - "scopeguard", -] - -[[package]] -name = "crossbeam-utils" -version = "0.8.7" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b5e5bed1f1c269533fa816a0a5492b3545209a205ca1a54842be180eb63a16a6" -dependencies = [ - "cfg-if", - "lazy_static", -] - -[[package]] -name = "crypto-mac" -version = "0.10.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bff07008ec701e8028e2ceb8f83f0e4274ee62bd2dbdc4fefff2e9a91824081a" -dependencies = [ - "generic-array", - "subtle", -] - -[[package]] -name = "crypto-mac" -version = "0.11.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b1d1a86f49236c215f271d40892d5fc950490551400b02ef360692c29815c714" -dependencies = [ - "generic-array", - "subtle", -] - -[[package]] -name = "csv" -version = "1.1.6" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "22813a6dc45b335f9bade10bf7271dc477e81113e89eb251a0bc2a8a81c536e1" -dependencies = [ - "bstr", - "csv-core", - "itoa 0.4.8", - "ryu", - "serde", -] - -[[package]] -name = "csv-core" -version = "0.1.10" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2b2466559f260f48ad25fe6317b3c8dac77b5bdb5763ac7d9d6103530663bc90" -dependencies = [ - "memchr", -] - -[[package]] -name = "daemonize" -version = "0.4.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "70c24513e34f53b640819f0ac9f705b673fcf4006d7aab8778bee72ebfc89815" -dependencies = [ - "boxfnonce", - "libc", -] - -[[package]] -name = "darling" -version = "0.13.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d0d720b8683f8dd83c65155f0530560cba68cd2bf395f6513a483caee57ff7f4" -dependencies = [ - "darling_core", - "darling_macro", -] - -[[package]] -name = "darling_core" -version = "0.13.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7a340f241d2ceed1deb47ae36c4144b2707ec7dd0b649f894cb39bb595986324" -dependencies = [ - "fnv", - "ident_case", - "proc-macro2", - "quote", - "strsim 0.10.0", - "syn", -] - -[[package]] -name = "darling_macro" -version = "0.13.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "72c41b3b7352feb3211a0d743dc5700a4e3b60f51bd2b368892d1e0f9a95f44b" -dependencies = [ - "darling_core", - "quote", - "syn", -] - -[[package]] -name = "digest" -version = "0.9.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d3dd60d1080a57a05ab032377049e0591415d2b31afd7028356dbf3cc6dcb066" -dependencies = [ - "generic-array", -] - -[[package]] -name = "dirs" -version = "4.0.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ca3aa72a6f96ea37bbc5aa912f6788242832f75369bdfdadcb0e38423f100059" -dependencies = [ - "dirs-sys", -] - -[[package]] -name = "dirs-sys" -version = "0.3.6" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "03d86534ed367a67548dc68113a0f5db55432fdfbb6e6f9d77704397d95d5780" -dependencies = [ - "libc", - "redox_users", - "winapi", -] - -[[package]] -name = "dlv-list" -version = "0.2.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "68df3f2b690c1b86e65ef7830956aededf3cb0a16f898f79b9a6f421a7b6211b" -dependencies = [ - "rand", -] - -[[package]] -name = "either" -version = "1.6.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e78d4f1cc4ae33bbfc157ed5d5a5ef3bc29227303d595861deb238fcec4e9457" - -[[package]] -name = "encoding_rs" -version = "0.8.30" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7896dc8abb250ffdda33912550faa54c88ec8b998dec0b2c55ab224921ce11df" -dependencies = [ - "cfg-if", -] - -[[package]] -name = "env_logger" -version = "0.9.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0b2cf0344971ee6c64c31be0d530793fba457d322dfec2810c453d0ef228f9c3" -dependencies = [ - "atty", - "humantime", - "log", - "regex", - "termcolor", -] - -[[package]] -name = "etcd-client" -version = "0.8.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "585de5039d1ecce74773db49ba4e8107e42be7c2cd0b1a9e7fce27181db7b118" -dependencies = [ - "http", - "prost", - "tokio", - "tokio-stream", - "tonic", - "tonic-build", - "tower-service", -] - -[[package]] -name = "fail" -version = "0.5.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ec3245a0ca564e7f3c797d20d833a6870f57a728ac967d5225b3ffdef4465011" -dependencies = [ - "lazy_static", - "log", - "rand", -] - -[[package]] -name = "fallible-iterator" -version = "0.2.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "4443176a9f2c162692bd3d352d745ef9413eec5782a80d8fd6f8a1ac692a07f7" - -[[package]] -name = "fastrand" -version = "1.7.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "c3fcf0cee53519c866c09b5de1f6c56ff9d647101f81c1964fa632e148896cdf" -dependencies = [ - "instant", -] - -[[package]] -name = "filetime" -version = "0.2.15" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "975ccf83d8d9d0d84682850a38c8169027be83368805971cc4f238c2b245bc98" -dependencies = [ - "cfg-if", - "libc", - "redox_syscall", - "winapi", -] - -[[package]] -name = "fixedbitset" -version = "0.4.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "279fb028e20b3c4c320317955b77c5e0c9701f05a1d309905d6fc702cdc5053e" - -[[package]] -name = "fnv" -version = "1.0.7" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3f9eec918d3f24069decb9af1554cad7c880e2da24a9afd88aca000531ab82c1" - -[[package]] -name = "form_urlencoded" -version = "1.0.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5fc25a87fa4fd2094bffb06925852034d90a17f0d1e05197d4956d3555752191" -dependencies = [ - "matches", - "percent-encoding", -] - -[[package]] -name = "fs2" -version = "0.4.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9564fc758e15025b46aa6643b1b77d047d1a56a1aea6e01002ac0c7026876213" -dependencies = [ - "libc", - "winapi", -] - -[[package]] -name = "futures" -version = "0.3.21" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f73fe65f54d1e12b726f517d3e2135ca3125a437b6d998caf1962961f7172d9e" -dependencies = [ - "futures-channel", - "futures-core", - "futures-executor", - "futures-io", - "futures-sink", - "futures-task", - "futures-util", -] - -[[package]] -name = "futures-channel" -version = "0.3.21" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "c3083ce4b914124575708913bca19bfe887522d6e2e6d0952943f5eac4a74010" -dependencies = [ - "futures-core", - "futures-sink", -] - -[[package]] -name = "futures-core" -version = "0.3.21" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0c09fd04b7e4073ac7156a9539b57a484a8ea920f79c7c675d05d289ab6110d3" - -[[package]] -name = "futures-executor" -version = "0.3.21" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9420b90cfa29e327d0429f19be13e7ddb68fa1cccb09d65e5706b8c7a749b8a6" -dependencies = [ - "futures-core", - "futures-task", - "futures-util", -] - -[[package]] -name = "futures-io" -version = "0.3.21" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "fc4045962a5a5e935ee2fdedaa4e08284547402885ab326734432bed5d12966b" - -[[package]] -name = "futures-macro" -version = "0.3.21" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "33c1e13800337f4d4d7a316bf45a567dbcb6ffe087f16424852d97e97a91f512" -dependencies = [ - "proc-macro2", - "quote", - "syn", -] - -[[package]] -name = "futures-sink" -version = "0.3.21" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "21163e139fa306126e6eedaf49ecdb4588f939600f0b1e770f4205ee4b7fa868" - -[[package]] -name = "futures-task" -version = "0.3.21" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "57c66a976bf5909d801bbef33416c41372779507e7a6b3a5e25e4749c58f776a" - -[[package]] -name = "futures-util" -version = "0.3.21" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d8b7abd5d659d9b90c8cba917f6ec750a74e2dc23902ef9cd4cc8c8b22e6036a" -dependencies = [ - "futures-channel", - "futures-core", - "futures-io", - "futures-macro", - "futures-sink", - "futures-task", - "memchr", - "pin-project-lite", - "pin-utils", - "slab", -] - -[[package]] -name = "generic-array" -version = "0.14.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "fd48d33ec7f05fbfa152300fdad764757cbded343c1aa1cff2fbaf4134851803" -dependencies = [ - "typenum", - "version_check", -] - -[[package]] -name = "getrandom" -version = "0.2.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "418d37c8b1d42553c93648be529cb70f920d3baf8ef469b74b9638df426e0b4c" -dependencies = [ - "cfg-if", - "libc", - "wasi 0.10.0+wasi-snapshot-preview1", -] - -[[package]] -name = "gimli" -version = "0.26.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "78cc372d058dcf6d5ecd98510e7fbc9e5aec4d21de70f65fea8fecebcd881bd4" - -[[package]] -name = "git-version" -version = "0.3.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f6b0decc02f4636b9ccad390dcbe77b722a77efedfa393caf8379a51d5c61899" -dependencies = [ - "git-version-macro", - "proc-macro-hack", -] - -[[package]] -name = "git-version-macro" -version = "0.3.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "fe69f1cbdb6e28af2bac214e943b99ce8a0a06b447d15d3e61161b0423139f3f" -dependencies = [ - "proc-macro-hack", - "proc-macro2", - "quote", - "syn", -] - -[[package]] -name = "glob" -version = "0.3.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9b919933a397b79c37e33b77bb2aa3dc8eb6e165ad809e58ff75bc7db2e34574" - -[[package]] -name = "h2" -version = "0.3.11" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d9f1f717ddc7b2ba36df7e871fd88db79326551d3d6f1fc406fbfd28b582ff8e" -dependencies = [ - "bytes", - "fnv", - "futures-core", - "futures-sink", - "futures-util", - "http", - "indexmap", - "slab", - "tokio", - "tokio-util 0.6.9", - "tracing", -] - -[[package]] -name = "half" -version = "1.8.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "eabb4a44450da02c90444cf74558da904edde8fb4e9035a9a6a4e15445af0bd7" - -[[package]] -name = "hashbrown" -version = "0.9.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d7afe4a420e3fe79967a00898cc1f4db7c8a49a9333a29f8a4bd76a253d5cd04" -dependencies = [ - "ahash 0.4.7", -] - -[[package]] -name = "hashbrown" -version = "0.11.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ab5ef0d4909ef3724cc8cce6ccc8572c5c817592e9285f5464f8e86f8bd3726e" -dependencies = [ - "ahash 0.7.6", -] - -[[package]] -name = "heck" -version = "0.3.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6d621efb26863f0e9924c6ac577e8275e5e6b77455db64ffa6c65c904e9e132c" -dependencies = [ - "unicode-segmentation", -] - -[[package]] -name = "hermit-abi" -version = "0.1.19" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "62b467343b94ba476dcb2500d242dadbb39557df889310ac77c5d99100aaac33" -dependencies = [ - "libc", -] - -[[package]] -name = "hex" -version = "0.4.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7f24254aa9a54b5c858eaee2f5bccdb46aaf0e486a595ed5fd8f86ba55232a70" -dependencies = [ - "serde", -] - -[[package]] -name = "hex-literal" -version = "0.3.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7ebdb29d2ea9ed0083cd8cece49bbd968021bd99b0849edb4a9a7ee0fdf6a4e0" - -[[package]] -name = "hmac" -version = "0.10.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "c1441c6b1e930e2817404b5046f1f989899143a12bf92de603b69f4e0aee1e15" -dependencies = [ - "crypto-mac 0.10.1", - "digest", -] - -[[package]] -name = "hmac" -version = "0.11.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2a2a2320eb7ec0ebe8da8f744d7812d9fc4cb4d09344ac01898dbcb6a20ae69b" -dependencies = [ - "crypto-mac 0.11.1", - "digest", -] - -[[package]] -name = "http" -version = "0.2.6" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "31f4c6746584866f0feabcc69893c5b51beef3831656a968ed7ae254cdc4fd03" -dependencies = [ - "bytes", - "fnv", - "itoa 1.0.1", -] - -[[package]] -name = "http-body" -version = "0.4.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1ff4f84919677303da5f147645dbea6b1881f368d03ac84e1dc09031ebd7b2c6" -dependencies = [ - "bytes", - "http", - "pin-project-lite", -] - -[[package]] -name = "httparse" -version = "1.6.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9100414882e15fb7feccb4897e5f0ff0ff1ca7d1a86a23208ada4d7a18e6c6c4" - -[[package]] -name = "httpdate" -version = "1.0.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "c4a1e36c821dbe04574f602848a19f742f4fb3c98d40449f11bcad18d6b17421" - -[[package]] -name = "humantime" -version = "2.1.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9a3a5bfb195931eeb336b2a7b4d761daec841b97f947d34394601737a7bba5e4" - -[[package]] -name = "hyper" -version = "0.14.16" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b7ec3e62bdc98a2f0393a5048e4c30ef659440ea6e0e572965103e72bd836f55" -dependencies = [ - "bytes", - "futures-channel", - "futures-core", - "futures-util", - "h2", - "http", - "http-body", - "httparse", - "httpdate", - "itoa 0.4.8", - "pin-project-lite", - "socket2", - "tokio", - "tower-service", - "tracing", - "want", -] - -[[package]] -name = "hyper-rustls" -version = "0.23.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d87c48c02e0dc5e3b849a2041db3029fd066650f8f717c07bf8ed78ccb895cac" -dependencies = [ - "http", - "hyper", - "rustls 0.20.2", - "tokio", - "tokio-rustls 0.23.2", -] - -[[package]] -name = "hyper-timeout" -version = "0.4.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bbb958482e8c7be4bc3cf272a766a2b0bf1a6755e7a6ae777f017a31d11b13b1" -dependencies = [ - "hyper", - "pin-project-lite", - "tokio", - "tokio-io-timeout", -] - -[[package]] -name = "ident_case" -version = "1.0.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b9e0384b61958566e926dc50660321d12159025e767c18e043daf26b70104c39" - -[[package]] -name = "idna" -version = "0.2.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "418a0a6fab821475f634efe3ccc45c013f742efe03d853e8d3355d5cb850ecf8" -dependencies = [ - "matches", - "unicode-bidi", - "unicode-normalization", -] - -[[package]] -name = "indexmap" -version = "1.8.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "282a6247722caba404c065016bbfa522806e51714c34f5dfc3e4a3a46fcb4223" -dependencies = [ - "autocfg", - "hashbrown 0.11.2", -] - -[[package]] -name = "instant" -version = "0.1.12" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7a5bbe824c507c5da5956355e86a746d82e0e1464f65d862cc5e71da70e94b2c" -dependencies = [ - "cfg-if", -] - -[[package]] -name = "ipnet" -version = "2.3.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "68f2d64f2edebec4ce84ad108148e67e1064789bee435edc5b60ad398714a3a9" - -[[package]] -name = "itertools" -version = "0.10.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a9a9d19fa1e79b6215ff29b9d6880b706147f16e9b1dbb1e4e5947b5b02bc5e3" -dependencies = [ - "either", -] - -[[package]] -name = "itoa" -version = "0.4.8" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b71991ff56294aa922b450139ee08b3bfc70982c6b2c7562771375cf73542dd4" - -[[package]] -name = "itoa" -version = "1.0.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1aab8fc367588b89dcee83ab0fd66b72b50b72fa1904d7095045ace2b0c81c35" - -[[package]] -name = "jobserver" -version = "0.1.24" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "af25a77299a7f711a01975c35a6a424eb6862092cc2d6c72c4ed6cbc56dfc1fa" -dependencies = [ - "libc", -] - -[[package]] -name = "js-sys" -version = "0.3.56" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a38fc24e30fd564ce974c02bf1d337caddff65be6cc4735a1f7eab22a7440f04" -dependencies = [ - "wasm-bindgen", -] - -[[package]] -name = "jsonwebtoken" -version = "7.2.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "afabcc15e437a6484fc4f12d0fd63068fe457bf93f1c148d3d9649c60b103f32" -dependencies = [ - "base64 0.12.3", - "pem 0.8.3", - "ring", - "serde", - "serde_json", - "simple_asn1", -] - -[[package]] -name = "kstring" -version = "1.0.6" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8b310ccceade8121d7d77fee406160e457c2f4e7c7982d589da3499bc7ea4526" -dependencies = [ - "serde", -] - -[[package]] -name = "lazy_static" -version = "1.4.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e2abad23fbc42b3700f2f279844dc832adb2b2eb069b2df918f455c4e18cc646" - -[[package]] -name = "lazycell" -version = "1.3.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "830d08ce1d1d941e6b30645f1a0eb5643013d835ce3779a5fc208261dbe10f55" - -[[package]] -name = "libc" -version = "0.2.117" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e74d72e0f9b65b5b4ca49a346af3976df0f9c61d550727f349ecd559f251a26c" - -[[package]] -name = "libloading" -version = "0.7.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "efbc0f03f9a775e9f6aed295c6a1ba2253c5757a9e03d55c6caa46a681abcddd" -dependencies = [ - "cfg-if", - "winapi", -] - -[[package]] -name = "lock_api" -version = "0.4.6" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "88943dd7ef4a2e5a4bfa2753aaab3013e34ce2533d1996fb18ef591e315e2b3b" -dependencies = [ - "scopeguard", -] - -[[package]] -name = "log" -version = "0.4.14" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "51b9bbe6c47d51fc3e1a9b945965946b4c44142ab8792c50835a980d362c2710" -dependencies = [ - "cfg-if", - "serde", -] - -[[package]] -name = "matchers" -version = "0.1.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8263075bb86c5a1b1427b5ae862e8889656f126e9f77c484496e8b47cf5c5558" -dependencies = [ - "regex-automata", -] - -[[package]] -name = "matches" -version = "0.1.9" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a3e378b66a060d48947b590737b30a1be76706c8dd7b8ba0f2fe3989c68a853f" - -[[package]] -name = "maybe-async" -version = "0.2.6" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6007f9dad048e0a224f27ca599d669fca8cfa0dac804725aab542b2eb032bce6" -dependencies = [ - "proc-macro2", - "quote", - "syn", -] - -[[package]] -name = "md-5" -version = "0.9.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7b5a279bb9607f9f53c22d496eade00d138d1bdcccd07d74650387cf94942a15" -dependencies = [ - "block-buffer", - "digest", - "opaque-debug", -] - -[[package]] -name = "md5" -version = "0.7.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "490cc448043f947bae3cbee9c203358d62dbee0db12107a74be5c30ccfd09771" - -[[package]] -name = "memchr" -version = "2.4.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "308cc39be01b73d0d18f82a0e7b2a3df85245f84af96fdddc5d202d27e47b86a" - -[[package]] -name = "memoffset" -version = "0.6.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5aa361d4faea93603064a027415f07bd8e1d5c88c9fbf68bf56a285428fd79ce" -dependencies = [ - "autocfg", -] - -[[package]] -name = "mime" -version = "0.3.16" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2a60c7ce501c71e03a9c9c0d35b861413ae925bd979cc7a4e30d060069aaac8d" - -[[package]] -name = "minimal-lexical" -version = "0.2.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "68354c5c6bd36d73ff3feceb05efa59b6acb7626617f4962be322a825e61f79a" - -[[package]] -name = "miniz_oxide" -version = "0.4.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a92518e98c078586bc6c934028adcca4c92a53d6a958196de835170a01d84e4b" -dependencies = [ - "adler", - "autocfg", -] - -[[package]] -name = "mio" -version = "0.8.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "52da4364ffb0e4fe33a9841a98a3f3014fb964045ce4f7a45a398243c8d6b0c9" -dependencies = [ - "libc", - "log", - "miow", - "ntapi", - "wasi 0.11.0+wasi-snapshot-preview1", - "winapi", -] - -[[package]] -name = "miow" -version = "0.3.7" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b9f1c5b025cda876f66ef43a113f91ebc9f4ccef34843000e0adf6ebbab84e21" -dependencies = [ - "winapi", -] - -[[package]] -name = "multimap" -version = "0.8.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e5ce46fe64a9d73be07dcbe690a38ce1b293be448fd8ce1e6c1b8062c9f72c6a" - -[[package]] -name = "nix" -version = "0.23.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9f866317acbd3a240710c63f065ffb1e4fd466259045ccb504130b7f668f35c6" -dependencies = [ - "bitflags", - "cc", - "cfg-if", - "libc", - "memoffset", -] - -[[package]] -name = "nom" -version = "7.1.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1b1d11e1ef389c76fe5b81bcaf2ea32cf88b62bc494e19f493d0b30e7a930109" -dependencies = [ - "memchr", - "minimal-lexical", - "version_check", -] - -[[package]] -name = "ntapi" -version = "0.3.6" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3f6bb902e437b6d86e03cce10a7e2af662292c5dfef23b65899ea3ac9354ad44" -dependencies = [ - "winapi", -] - -[[package]] -name = "num-bigint" -version = "0.2.6" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "090c7f9998ee0ff65aa5b723e4009f7b217707f1fb5ea551329cc4d6231fb304" -dependencies = [ - "autocfg", - "num-integer", - "num-traits", -] - -[[package]] -name = "num-integer" -version = "0.1.44" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d2cc698a63b549a70bc047073d2949cce27cd1c7b0a4a862d08a8031bc2801db" -dependencies = [ - "autocfg", - "num-traits", -] - -[[package]] -name = "num-traits" -version = "0.2.14" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9a64b1ec5cda2586e284722486d802acf1f7dbdc623e2bfc57e65ca1cd099290" -dependencies = [ - "autocfg", -] - -[[package]] -name = "num_cpus" -version = "1.13.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "19e64526ebdee182341572e50e9ad03965aa510cd94427a4549448f285e957a1" -dependencies = [ - "hermit-abi", - "libc", -] - -[[package]] -name = "object" -version = "0.27.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "67ac1d3f9a1d3616fd9a60c8d74296f22406a238b6a72f5cc1e6f314df4ffbf9" -dependencies = [ - "memchr", -] - -[[package]] -name = "once_cell" -version = "1.9.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "da32515d9f6e6e489d7bc9d84c71b060db7247dc035bbe44eac88cf87486d8d5" - -[[package]] -name = "oorandom" -version = "11.1.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0ab1bc2a289d34bd04a330323ac98a1b4bc82c9d9fcb1e66b63caa84da26b575" - -[[package]] -name = "opaque-debug" -version = "0.3.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "624a8340c38c1b80fd549087862da4ba43e08858af025b236e509b6649fc13d5" - -[[package]] -name = "ordered-multimap" -version = "0.3.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1c672c7ad9ec066e428c00eb917124a06f08db19e2584de982cc34b1f4c12485" -dependencies = [ - "dlv-list", - "hashbrown 0.9.1", -] - -[[package]] -name = "os_str_bytes" -version = "6.0.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8e22443d1643a904602595ba1cd8f7d896afe56d26712531c5ff73a15b2fbf64" -dependencies = [ - "memchr", -] - -[[package]] -name = "pageserver" -version = "0.1.0" -dependencies = [ - "anyhow", - "async-compression", - "async-trait", - "byteorder", - "bytes", - "chrono", - "clap 3.0.14", - "const_format", - "crc32c", - "crossbeam-utils", - "daemonize", - "fail", - "futures", - "hex", - "hex-literal", - "humantime", - "hyper", - "itertools", - "lazy_static", - "log", - "nix", - "once_cell", - "postgres 0.19.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", - "postgres-protocol 0.6.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", - "postgres-types 0.2.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", - "postgres_ffi", - "rand", - "regex", - "rust-s3", - "scopeguard", - "serde", - "serde_json", - "serde_with", - "signal-hook", - "tar", - "tempfile", - "thiserror", - "tokio", - "tokio-postgres 0.7.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", - "tokio-stream", - "toml_edit", - "tracing", - "tracing-futures", - "url", - "workspace_hack", - "zenith_metrics", - "zenith_utils", -] - -[[package]] -name = "parking_lot" -version = "0.11.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7d17b78036a60663b797adeaee46f5c9dfebb86948d1255007a1d6be0271ff99" -dependencies = [ - "instant", - "lock_api", - "parking_lot_core", -] - -[[package]] -name = "parking_lot_core" -version = "0.8.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d76e8e1493bcac0d2766c42737f34458f1c8c50c0d23bcb24ea953affb273216" -dependencies = [ - "cfg-if", - "instant", - "libc", - "redox_syscall", - "smallvec", - "winapi", -] - -[[package]] -name = "peeking_take_while" -version = "0.1.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "19b17cddbe7ec3f8bc800887bab5e717348c95ea2ca0b1bf0837fb964dc67099" - -[[package]] -name = "pem" -version = "0.8.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "fd56cbd21fea48d0c440b41cd69c589faacade08c992d9a54e471b79d0fd13eb" -dependencies = [ - "base64 0.13.0", - "once_cell", - "regex", -] - -[[package]] -name = "pem" -version = "1.0.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e9a3b09a20e374558580a4914d3b7d89bd61b954a5a5e1dcbea98753addb1947" -dependencies = [ - "base64 0.13.0", -] - -[[package]] -name = "percent-encoding" -version = "2.1.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d4fd5641d01c8f18a23da7b6fe29298ff4b55afcccdf78973b24cf3175fee32e" - -[[package]] -name = "petgraph" -version = "0.6.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "4a13a2fa9d0b63e5f22328828741e523766fff0ee9e779316902290dff3f824f" -dependencies = [ - "fixedbitset", - "indexmap", -] - -[[package]] -name = "phf" -version = "0.8.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3dfb61232e34fcb633f43d12c58f83c1df82962dcdfa565a4e866ffc17dafe12" -dependencies = [ - "phf_shared", -] - -[[package]] -name = "phf_shared" -version = "0.8.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "c00cf8b9eafe68dde5e9eaa2cef8ee84a9336a47d566ec55ca16589633b65af7" -dependencies = [ - "siphasher", -] - -[[package]] -name = "pin-project" -version = "1.0.10" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "58ad3879ad3baf4e44784bc6a718a8698867bb991f8ce24d1bcbe2cfb4c3a75e" -dependencies = [ - "pin-project-internal", -] - -[[package]] -name = "pin-project-internal" -version = "1.0.10" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "744b6f092ba29c3650faf274db506afd39944f48420f6c86b17cfe0ee1cb36bb" -dependencies = [ - "proc-macro2", - "quote", - "syn", -] - -[[package]] -name = "pin-project-lite" -version = "0.2.8" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e280fbe77cc62c91527259e9442153f4688736748d24660126286329742b4c6c" - -[[package]] -name = "pin-utils" -version = "0.1.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8b870d8c151b6f2fb93e84a13146138f05d02ed11c7e7c54f8826aaaf7c9f184" - -[[package]] -name = "plotters" -version = "0.3.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "32a3fd9ec30b9749ce28cd91f255d569591cdf937fe280c312143e3c4bad6f2a" -dependencies = [ - "num-traits", - "plotters-backend", - "plotters-svg", - "wasm-bindgen", - "web-sys", -] - -[[package]] -name = "plotters-backend" -version = "0.3.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d88417318da0eaf0fdcdb51a0ee6c3bed624333bff8f946733049380be67ac1c" - -[[package]] -name = "plotters-svg" -version = "0.3.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "521fa9638fa597e1dc53e9412a4f9cefb01187ee1f7413076f9e6749e2885ba9" -dependencies = [ - "plotters-backend", -] - -[[package]] -name = "postgres" -version = "0.19.1" -source = "git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7#2949d98df52587d562986aad155dd4e889e408b7" -dependencies = [ - "bytes", - "fallible-iterator", - "futures", - "log", - "postgres-protocol 0.6.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", - "tokio", - "tokio-postgres 0.7.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", -] - -[[package]] -name = "postgres" -version = "0.19.1" -source = "git+https://github.com/zenithdb/rust-postgres.git?rev=9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858#9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858" -dependencies = [ - "bytes", - "fallible-iterator", - "futures", - "log", - "postgres-protocol 0.6.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858)", - "tokio", - "tokio-postgres 0.7.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858)", -] - -[[package]] -name = "postgres-protocol" -version = "0.6.1" -source = "git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7#2949d98df52587d562986aad155dd4e889e408b7" -dependencies = [ - "base64 0.13.0", - "byteorder", - "bytes", - "fallible-iterator", - "hmac 0.10.1", - "lazy_static", - "md-5", - "memchr", - "rand", - "sha2", - "stringprep", -] - -[[package]] -name = "postgres-protocol" -version = "0.6.1" -source = "git+https://github.com/zenithdb/rust-postgres.git?rev=9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858#9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858" -dependencies = [ - "base64 0.13.0", - "byteorder", - "bytes", - "fallible-iterator", - "hmac 0.10.1", - "lazy_static", - "md-5", - "memchr", - "rand", - "sha2", - "stringprep", -] - -[[package]] -name = "postgres-types" -version = "0.2.1" -source = "git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7#2949d98df52587d562986aad155dd4e889e408b7" -dependencies = [ - "bytes", - "fallible-iterator", - "postgres-protocol 0.6.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", -] - -[[package]] -name = "postgres-types" -version = "0.2.1" -source = "git+https://github.com/zenithdb/rust-postgres.git?rev=9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858#9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858" -dependencies = [ - "bytes", - "fallible-iterator", - "postgres-protocol 0.6.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858)", -] - -[[package]] -name = "postgres_ffi" -version = "0.1.0" -dependencies = [ - "anyhow", - "bindgen", - "byteorder", - "bytes", - "chrono", - "crc32c", - "hex", - "lazy_static", - "log", - "memoffset", - "rand", - "regex", - "serde", - "thiserror", - "workspace_hack", - "zenith_utils", -] - -[[package]] -name = "ppv-lite86" -version = "0.2.16" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "eb9f9e6e233e5c4a35559a617bf40a4ec447db2e84c20b55a6f83167b7e57872" - -[[package]] -name = "proc-macro-hack" -version = "0.5.19" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "dbf0c48bc1d91375ae5c3cd81e3722dff1abcf81a30960240640d223f59fe0e5" - -[[package]] -name = "proc-macro2" -version = "1.0.36" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "c7342d5883fbccae1cc37a2353b09c87c9b0f3afd73f5fb9bba687a1f733b029" -dependencies = [ - "unicode-xid", -] - -[[package]] -name = "prometheus" -version = "0.13.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b7f64969ffd5dd8f39bd57a68ac53c163a095ed9d0fb707146da1b27025a3504" -dependencies = [ - "cfg-if", - "fnv", - "lazy_static", - "memchr", - "parking_lot", - "thiserror", -] - -[[package]] -name = "prost" -version = "0.9.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "444879275cb4fd84958b1a1d5420d15e6fcf7c235fe47f053c9c2a80aceb6001" -dependencies = [ - "bytes", - "prost-derive", -] - -[[package]] -name = "prost-build" -version = "0.9.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "62941722fb675d463659e49c4f3fe1fe792ff24fe5bbaa9c08cd3b98a1c354f5" -dependencies = [ - "bytes", - "heck", - "itertools", - "lazy_static", - "log", - "multimap", - "petgraph", - "prost", - "prost-types", - "regex", - "tempfile", - "which", -] - -[[package]] -name = "prost-derive" -version = "0.9.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f9cc1a3263e07e0bf68e96268f37665207b49560d98739662cdfaae215c720fe" -dependencies = [ - "anyhow", - "itertools", - "proc-macro2", - "quote", - "syn", -] - -[[package]] -name = "prost-types" -version = "0.9.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "534b7a0e836e3c482d2693070f982e39e7611da9695d4d1f5a4b186b51faef0a" -dependencies = [ - "bytes", - "prost", -] - -[[package]] -name = "proxy" -version = "0.1.0" -dependencies = [ - "anyhow", - "bytes", - "clap 3.0.14", - "fail", - "futures", - "hashbrown 0.11.2", - "hex", - "hyper", - "lazy_static", - "md5", - "parking_lot", - "pin-project-lite", - "rand", - "rcgen", - "reqwest", - "rustls 0.19.1", - "scopeguard", - "serde", - "serde_json", - "socket2", - "thiserror", - "tokio", - "tokio-postgres 0.7.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", - "tokio-postgres-rustls", - "tokio-rustls 0.22.0", - "workspace_hack", - "zenith_metrics", - "zenith_utils", -] - -[[package]] -name = "quote" -version = "1.0.15" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "864d3e96a899863136fc6e99f3d7cae289dafe43bf2c5ac19b70df7210c0a145" -dependencies = [ - "proc-macro2", -] - -[[package]] -name = "rand" -version = "0.8.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2e7573632e6454cf6b99d7aac4ccca54be06da05aca2ef7423d22d27d4d4bcd8" -dependencies = [ - "libc", - "rand_chacha", - "rand_core", - "rand_hc", -] - -[[package]] -name = "rand_chacha" -version = "0.3.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e6c10a63a0fa32252be49d21e7709d4d4baf8d231c2dbce1eaa8141b9b127d88" -dependencies = [ - "ppv-lite86", - "rand_core", -] - -[[package]] -name = "rand_core" -version = "0.6.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d34f1408f55294453790c48b2f1ebbb1c5b4b7563eb1f418bcfcfdbb06ebb4e7" -dependencies = [ - "getrandom", -] - -[[package]] -name = "rand_hc" -version = "0.3.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d51e9f596de227fda2ea6c84607f5558e196eeaf43c986b724ba4fb8fdf497e7" -dependencies = [ - "rand_core", -] - -[[package]] -name = "rayon" -version = "1.5.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "c06aca804d41dbc8ba42dfd964f0d01334eceb64314b9ecf7c5fad5188a06d90" -dependencies = [ - "autocfg", - "crossbeam-deque", - "either", - "rayon-core", -] - -[[package]] -name = "rayon-core" -version = "1.9.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d78120e2c850279833f1dd3582f730c4ab53ed95aeaaaa862a2a5c71b1656d8e" -dependencies = [ - "crossbeam-channel", - "crossbeam-deque", - "crossbeam-utils", - "lazy_static", - "num_cpus", -] - -[[package]] -name = "rcgen" -version = "0.8.14" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5911d1403f4143c9d56a702069d593e8d0f3fab880a85e103604d0893ea31ba7" -dependencies = [ - "chrono", - "pem 1.0.2", - "ring", - "yasna", -] - -[[package]] -name = "redox_syscall" -version = "0.2.10" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8383f39639269cde97d255a32bdb68c047337295414940c68bdd30c2e13203ff" -dependencies = [ - "bitflags", -] - -[[package]] -name = "redox_users" -version = "0.4.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "528532f3d801c87aec9def2add9ca802fe569e44a544afe633765267840abe64" -dependencies = [ - "getrandom", - "redox_syscall", -] - -[[package]] -name = "regex" -version = "1.5.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d07a8629359eb56f1e2fb1652bb04212c072a87ba68546a04065d525673ac461" -dependencies = [ - "aho-corasick", - "memchr", - "regex-syntax", -] - -[[package]] -name = "regex-automata" -version = "0.1.10" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6c230d73fb8d8c1b9c0b3135c5142a8acee3a0558fb8db5cf1cb65f8d7862132" -dependencies = [ - "regex-syntax", -] - -[[package]] -name = "regex-syntax" -version = "0.6.25" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f497285884f3fcff424ffc933e56d7cbca511def0c9831a7f9b5f6153e3cc89b" - -[[package]] -name = "remove_dir_all" -version = "0.5.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3acd125665422973a33ac9d3dd2df85edad0f4ae9b00dafb1a05e43a9f5ef8e7" -dependencies = [ - "winapi", -] - -[[package]] -name = "reqwest" -version = "0.11.9" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "87f242f1488a539a79bac6dbe7c8609ae43b7914b7736210f239a37cccb32525" -dependencies = [ - "base64 0.13.0", - "bytes", - "encoding_rs", - "futures-core", - "futures-util", - "h2", - "http", - "http-body", - "hyper", - "hyper-rustls", - "ipnet", - "js-sys", - "lazy_static", - "log", - "mime", - "percent-encoding", - "pin-project-lite", - "rustls 0.20.2", - "rustls-pemfile", - "serde", - "serde_json", - "serde_urlencoded", - "tokio", - "tokio-rustls 0.23.2", - "tokio-util 0.6.9", - "url", - "wasm-bindgen", - "wasm-bindgen-futures", - "web-sys", - "webpki-roots", - "winreg", -] - -[[package]] -name = "ring" -version = "0.16.20" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3053cf52e236a3ed746dfc745aa9cacf1b791d846bdaf412f60a8d7d6e17c8fc" -dependencies = [ - "cc", - "libc", - "once_cell", - "spin", - "untrusted", - "web-sys", - "winapi", -] - -[[package]] -name = "routerify" -version = "3.0.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "496c1d3718081c45ba9c31fbfc07417900aa96f4070ff90dc29961836b7a9945" -dependencies = [ - "http", - "hyper", - "lazy_static", - "percent-encoding", - "regex", -] - -[[package]] -name = "rust-ini" -version = "0.17.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "63471c4aa97a1cf8332a5f97709a79a4234698de6a1f5087faf66f2dae810e22" -dependencies = [ - "cfg-if", - "ordered-multimap", -] - -[[package]] -name = "rust-s3" -version = "0.28.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2dc0e521d1084d6950e050d4e2595f0fbdaa2b96bb795bab3d90a282288c5e49" -dependencies = [ - "anyhow", - "async-trait", - "aws-creds", - "aws-region", - "base64 0.13.0", - "cfg-if", - "chrono", - "hex", - "hmac 0.11.0", - "http", - "log", - "maybe-async", - "md5", - "percent-encoding", - "reqwest", - "serde", - "serde-xml-rs", - "serde_derive", - "sha2", - "tokio", - "tokio-stream", - "url", -] - -[[package]] -name = "rustc-demangle" -version = "0.1.21" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7ef03e0a2b150c7a90d01faf6254c9c48a41e95fb2a8c2ac1c6f0d2b9aefc342" - -[[package]] -name = "rustc-hash" -version = "1.1.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "08d43f7aa6b08d49f382cde6a7982047c3426db949b1424bc4b7ec9ae12c6ce2" - -[[package]] -name = "rustc_version" -version = "0.4.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bfa0f585226d2e68097d4f95d113b15b83a82e819ab25717ec0590d9584ef366" -dependencies = [ - "semver", -] - -[[package]] -name = "rustls" -version = "0.19.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "35edb675feee39aec9c99fa5ff985081995a06d594114ae14cbe797ad7b7a6d7" -dependencies = [ - "base64 0.13.0", - "log", - "ring", - "sct 0.6.1", - "webpki 0.21.4", -] - -[[package]] -name = "rustls" -version = "0.20.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d37e5e2290f3e040b594b1a9e04377c2c671f1a1cfd9bfdef82106ac1c113f84" -dependencies = [ - "log", - "ring", - "sct 0.7.0", - "webpki 0.22.0", -] - -[[package]] -name = "rustls-pemfile" -version = "0.2.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5eebeaeb360c87bfb72e84abdb3447159c0eaececf1bef2aecd65a8be949d1c9" -dependencies = [ - "base64 0.13.0", -] - -[[package]] -name = "rustls-split" -version = "0.2.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7fb079b52cfdb005752b7c3c646048e702003576a8321058e4c8b38227c11aa6" -dependencies = [ - "rustls 0.19.1", -] - -[[package]] -name = "rustversion" -version = "1.0.6" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f2cc38e8fa666e2de3c4aba7edeb5ffc5246c1c2ed0e3d17e560aeeba736b23f" - -[[package]] -name = "ryu" -version = "1.0.9" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "73b4b750c782965c211b42f022f59af1fbceabdd026623714f104152f1ec149f" - -[[package]] -name = "same-file" -version = "1.0.6" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "93fc1dc3aaa9bfed95e02e6eadabb4baf7e3078b0bd1b4d7b6b0b68378900502" -dependencies = [ - "winapi-util", -] - -[[package]] -name = "scopeguard" -version = "1.1.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d29ab0c6d3fc0ee92fe66e2d99f700eab17a8d57d1c1d3b748380fb20baa78cd" - -[[package]] -name = "sct" -version = "0.6.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b362b83898e0e69f38515b82ee15aa80636befe47c3b6d3d89a911e78fc228ce" -dependencies = [ - "ring", - "untrusted", -] - -[[package]] -name = "sct" -version = "0.7.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d53dcdb7c9f8158937a7981b48accfd39a43af418591a5d008c7b22b5e1b7ca4" -dependencies = [ - "ring", - "untrusted", -] - -[[package]] -name = "semver" -version = "1.0.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0486718e92ec9a68fbed73bb5ef687d71103b142595b406835649bebd33f72c7" - -[[package]] -name = "serde" -version = "1.0.136" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ce31e24b01e1e524df96f1c2fdd054405f8d7376249a5110886fb4b658484789" -dependencies = [ - "serde_derive", -] - -[[package]] -name = "serde-xml-rs" -version = "0.5.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "65162e9059be2f6a3421ebbb4fef3e74b7d9e7c60c50a0e292c6239f19f1edfa" -dependencies = [ - "log", - "serde", - "thiserror", - "xml-rs", -] - -[[package]] -name = "serde_cbor" -version = "0.11.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2bef2ebfde456fb76bbcf9f59315333decc4fda0b2b44b420243c11e0f5ec1f5" -dependencies = [ - "half", - "serde", -] - -[[package]] -name = "serde_derive" -version = "1.0.136" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "08597e7152fcd306f41838ed3e37be9eaeed2b61c42e2117266a554fab4662f9" -dependencies = [ - "proc-macro2", - "quote", - "syn", -] - -[[package]] -name = "serde_json" -version = "1.0.78" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d23c1ba4cf0efd44be32017709280b32d1cea5c3f1275c3b6d9e8bc54f758085" -dependencies = [ - "itoa 1.0.1", - "ryu", - "serde", -] - -[[package]] -name = "serde_urlencoded" -version = "0.7.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d3491c14715ca2294c4d6a88f15e84739788c1d030eed8c110436aafdaa2f3fd" -dependencies = [ - "form_urlencoded", - "itoa 1.0.1", - "ryu", - "serde", -] - -[[package]] -name = "serde_with" -version = "1.12.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ec1e6ec4d8950e5b1e894eac0d360742f3b1407a6078a604a731c4b3f49cefbc" -dependencies = [ - "rustversion", - "serde", - "serde_with_macros", -] - -[[package]] -name = "serde_with_macros" -version = "1.5.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "12e47be9471c72889ebafb5e14d5ff930d89ae7a67bbdb5f8abb564f845a927e" -dependencies = [ - "darling", - "proc-macro2", - "quote", - "syn", -] - -[[package]] -name = "sha2" -version = "0.9.9" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "4d58a1e1bf39749807d89cf2d98ac2dfa0ff1cb3faa38fbb64dd88ac8013d800" -dependencies = [ - "block-buffer", - "cfg-if", - "cpufeatures", - "digest", - "opaque-debug", -] - -[[package]] -name = "sharded-slab" -version = "0.1.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "900fba806f70c630b0a382d0d825e17a0f19fcd059a2ade1ff237bcddf446b31" -dependencies = [ - "lazy_static", -] - -[[package]] -name = "shlex" -version = "1.1.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "43b2853a4d09f215c24cc5489c992ce46052d359b5109343cbafbf26bc62f8a3" - -[[package]] -name = "signal-hook" -version = "0.3.13" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "647c97df271007dcea485bb74ffdb57f2e683f1306c854f468a0c244badabf2d" -dependencies = [ - "libc", - "signal-hook-registry", -] - -[[package]] -name = "signal-hook-registry" -version = "1.4.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e51e73328dc4ac0c7ccbda3a494dfa03df1de2f46018127f60c693f2648455b0" -dependencies = [ - "libc", -] - -[[package]] -name = "simple_asn1" -version = "0.4.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "692ca13de57ce0613a363c8c2f1de925adebc81b04c923ac60c5488bb44abe4b" -dependencies = [ - "chrono", - "num-bigint", - "num-traits", -] - -[[package]] -name = "siphasher" -version = "0.3.9" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a86232ab60fa71287d7f2ddae4a7073f6b7aac33631c3015abb556f08c6d0a3e" - -[[package]] -name = "slab" -version = "0.4.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9def91fd1e018fe007022791f865d0ccc9b3a0d5001e01aabb8b40e46000afb5" - -[[package]] -name = "smallvec" -version = "1.8.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f2dd574626839106c320a323308629dcb1acfc96e32a8cba364ddc61ac23ee83" - -[[package]] -name = "socket2" -version = "0.4.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "66d72b759436ae32898a2af0a14218dbf55efde3feeb170eb623637db85ee1e0" -dependencies = [ - "libc", - "winapi", -] - -[[package]] -name = "spin" -version = "0.5.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6e63cff320ae2c57904679ba7cb63280a3dc4613885beafb148ee7bf9aa9042d" - -[[package]] -name = "stringprep" -version = "0.1.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8ee348cb74b87454fff4b551cbf727025810a004f88aeacae7f85b87f4e9a1c1" -dependencies = [ - "unicode-bidi", - "unicode-normalization", -] - -[[package]] -name = "strsim" -version = "0.8.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8ea5119cdb4c55b55d432abb513a0429384878c15dde60cc77b1c99de1a95a6a" - -[[package]] -name = "strsim" -version = "0.10.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "73473c0e59e6d5812c5dfe2a064a6444949f089e20eec9a2e5506596494e4623" - -[[package]] -name = "subtle" -version = "2.4.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6bdef32e8150c2a081110b42772ffe7d7c9032b606bc226c8260fd97e0976601" - -[[package]] -name = "syn" -version = "1.0.86" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8a65b3f4ffa0092e9887669db0eae07941f023991ab58ea44da8fe8e2d511c6b" -dependencies = [ - "proc-macro2", - "quote", - "unicode-xid", -] - -[[package]] -name = "tar" -version = "0.4.38" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "4b55807c0344e1e6c04d7c965f5289c39a8d94ae23ed5c0b57aabac549f871c6" -dependencies = [ - "filetime", - "libc", - "xattr", -] - -[[package]] -name = "tempfile" -version = "3.3.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5cdb1ef4eaeeaddc8fbd371e5017057064af0911902ef36b39801f67cc6d79e4" -dependencies = [ - "cfg-if", - "fastrand", - "libc", - "redox_syscall", - "remove_dir_all", - "winapi", -] - -[[package]] -name = "termcolor" -version = "1.1.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2dfed899f0eb03f32ee8c6a0aabdb8a7949659e3466561fc0adf54e26d88c5f4" -dependencies = [ - "winapi-util", -] - -[[package]] -name = "textwrap" -version = "0.11.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d326610f408c7a4eb6f51c37c330e496b08506c9457c9d34287ecc38809fb060" -dependencies = [ - "unicode-width", -] - -[[package]] -name = "textwrap" -version = "0.14.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0066c8d12af8b5acd21e00547c3797fde4e8677254a7ee429176ccebbe93dd80" - -[[package]] -name = "thiserror" -version = "1.0.30" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "854babe52e4df1653706b98fcfc05843010039b406875930a70e4d9644e5c417" -dependencies = [ - "thiserror-impl", -] - -[[package]] -name = "thiserror-impl" -version = "1.0.30" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "aa32fd3f627f367fe16f893e2597ae3c05020f8bba2666a4e6ea73d377e5714b" -dependencies = [ - "proc-macro2", - "quote", - "syn", -] - -[[package]] -name = "thread_local" -version = "1.1.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5516c27b78311c50bf42c071425c560ac799b11c30b31f87e3081965fe5e0180" -dependencies = [ - "once_cell", -] - -[[package]] -name = "time" -version = "0.1.44" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6db9e6914ab8b1ae1c260a4ae7a49b6c5611b40328a735b21862567685e73255" -dependencies = [ - "libc", - "wasi 0.10.0+wasi-snapshot-preview1", - "winapi", -] - -[[package]] -name = "tinytemplate" -version = "1.2.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "be4d6b5f19ff7664e8c98d03e2139cb510db9b0a60b55f8e8709b689d939b6bc" -dependencies = [ - "serde", - "serde_json", -] - -[[package]] -name = "tinyvec" -version = "1.5.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2c1c1d5a42b6245520c249549ec267180beaffcc0615401ac8e31853d4b6d8d2" -dependencies = [ - "tinyvec_macros", -] - -[[package]] -name = "tinyvec_macros" -version = "0.1.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "cda74da7e1a664f795bb1f8a87ec406fb89a02522cf6e50620d016add6dbbf5c" - -[[package]] -name = "tokio" -version = "1.17.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2af73ac49756f3f7c01172e34a23e5d0216f6c32333757c2c61feb2bbff5a5ee" -dependencies = [ - "bytes", - "libc", - "memchr", - "mio", - "num_cpus", - "once_cell", - "pin-project-lite", - "signal-hook-registry", - "socket2", - "tokio-macros", - "winapi", -] - -[[package]] -name = "tokio-io-timeout" -version = "1.2.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "30b74022ada614a1b4834de765f9bb43877f910cc8ce4be40e89042c9223a8bf" -dependencies = [ - "pin-project-lite", - "tokio", -] - -[[package]] -name = "tokio-macros" -version = "1.7.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b557f72f448c511a979e2564e55d74e6c4432fc96ff4f6241bc6bded342643b7" -dependencies = [ - "proc-macro2", - "quote", - "syn", -] - -[[package]] -name = "tokio-postgres" -version = "0.7.1" -source = "git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7#2949d98df52587d562986aad155dd4e889e408b7" -dependencies = [ - "async-trait", - "byteorder", - "bytes", - "fallible-iterator", - "futures", - "log", - "parking_lot", - "percent-encoding", - "phf", - "pin-project-lite", - "postgres-protocol 0.6.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", - "postgres-types 0.2.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", - "socket2", - "tokio", - "tokio-util 0.6.9", -] - -[[package]] -name = "tokio-postgres" -version = "0.7.1" -source = "git+https://github.com/zenithdb/rust-postgres.git?rev=9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858#9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858" -dependencies = [ - "async-trait", - "byteorder", - "bytes", - "fallible-iterator", - "futures", - "log", - "parking_lot", - "percent-encoding", - "phf", - "pin-project-lite", - "postgres-protocol 0.6.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858)", - "postgres-types 0.2.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858)", - "socket2", - "tokio", - "tokio-util 0.6.9", -] - -[[package]] -name = "tokio-postgres-rustls" -version = "0.8.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7bd8c37d8c23cb6ecdc32fc171bade4e9c7f1be65f693a17afbaad02091a0a19" -dependencies = [ - "futures", - "ring", - "rustls 0.19.1", - "tokio", - "tokio-postgres 0.7.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", - "tokio-rustls 0.22.0", - "webpki 0.21.4", -] - -[[package]] -name = "tokio-rustls" -version = "0.22.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bc6844de72e57df1980054b38be3a9f4702aba4858be64dd700181a8a6d0e1b6" -dependencies = [ - "rustls 0.19.1", - "tokio", - "webpki 0.21.4", -] - -[[package]] -name = "tokio-rustls" -version = "0.23.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a27d5f2b839802bd8267fa19b0530f5a08b9c08cd417976be2a65d130fe1c11b" -dependencies = [ - "rustls 0.20.2", - "tokio", - "webpki 0.22.0", -] - -[[package]] -name = "tokio-stream" -version = "0.1.8" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "50145484efff8818b5ccd256697f36863f587da82cf8b409c53adf1e840798e3" -dependencies = [ - "futures-core", - "pin-project-lite", - "tokio", -] - -[[package]] -name = "tokio-util" -version = "0.6.9" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9e99e1983e5d376cd8eb4b66604d2e99e79f5bd988c3055891dcd8c9e2604cc0" -dependencies = [ - "bytes", - "futures-core", - "futures-sink", - "log", - "pin-project-lite", - "tokio", -] - -[[package]] -name = "tokio-util" -version = "0.7.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "64910e1b9c1901aaf5375561e35b9c057d95ff41a44ede043a03e09279eabaf1" -dependencies = [ - "bytes", - "futures-core", - "futures-sink", - "log", - "pin-project-lite", - "tokio", -] - -[[package]] -name = "toml" -version = "0.5.8" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a31142970826733df8241ef35dc040ef98c679ab14d7c3e54d827099b3acecaa" -dependencies = [ - "serde", -] - -[[package]] -name = "toml_edit" -version = "0.13.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "744e9ed5b352340aa47ce033716991b5589e23781acb97cad37d4ea70560f55b" -dependencies = [ - "combine", - "indexmap", - "itertools", - "kstring", - "serde", -] - -[[package]] -name = "tonic" -version = "0.6.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ff08f4649d10a70ffa3522ca559031285d8e421d727ac85c60825761818f5d0a" -dependencies = [ - "async-stream", - "async-trait", - "base64 0.13.0", - "bytes", - "futures-core", - "futures-util", - "h2", - "http", - "http-body", - "hyper", - "hyper-timeout", - "percent-encoding", - "pin-project", - "prost", - "prost-derive", - "tokio", - "tokio-stream", - "tokio-util 0.6.9", - "tower", - "tower-layer", - "tower-service", - "tracing", - "tracing-futures", -] - -[[package]] -name = "tonic-build" -version = "0.6.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9403f1bafde247186684b230dc6f38b5cd514584e8bec1dd32514be4745fa757" -dependencies = [ - "proc-macro2", - "prost-build", - "quote", - "syn", -] - -[[package]] -name = "tower" -version = "0.4.12" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9a89fd63ad6adf737582df5db40d286574513c69a11dac5214dc3b5603d6713e" -dependencies = [ - "futures-core", - "futures-util", - "indexmap", - "pin-project", - "pin-project-lite", - "rand", - "slab", - "tokio", - "tokio-util 0.7.0", - "tower-layer", - "tower-service", - "tracing", -] - -[[package]] -name = "tower-layer" -version = "0.3.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "343bc9466d3fe6b0f960ef45960509f84480bf4fd96f92901afe7ff3df9d3a62" - -[[package]] -name = "tower-service" -version = "0.3.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "360dfd1d6d30e05fda32ace2c8c70e9c0a9da713275777f5a4dbb8a1893930c6" - -[[package]] -name = "tracing" -version = "0.1.30" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2d8d93354fe2a8e50d5953f5ae2e47a3fc2ef03292e7ea46e3cc38f549525fb9" -dependencies = [ - "cfg-if", - "log", - "pin-project-lite", - "tracing-attributes", - "tracing-core", -] - -[[package]] -name = "tracing-attributes" -version = "0.1.19" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8276d9a4a3a558d7b7ad5303ad50b53d58264641b82914b7ada36bd762e7a716" -dependencies = [ - "proc-macro2", - "quote", - "syn", -] - -[[package]] -name = "tracing-core" -version = "0.1.22" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "03cfcb51380632a72d3111cb8d3447a8d908e577d31beeac006f836383d29a23" -dependencies = [ - "lazy_static", - "valuable", -] - -[[package]] -name = "tracing-futures" -version = "0.2.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "97d095ae15e245a057c8e8451bab9b3ee1e1f68e9ba2b4fbc18d0ac5237835f2" -dependencies = [ - "pin-project", - "tracing", -] - -[[package]] -name = "tracing-log" -version = "0.1.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a6923477a48e41c1951f1999ef8bb5a3023eb723ceadafe78ffb65dc366761e3" -dependencies = [ - "lazy_static", - "log", - "tracing-core", -] - -[[package]] -name = "tracing-subscriber" -version = "0.3.8" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "74786ce43333fcf51efe947aed9718fbe46d5c7328ec3f1029e818083966d9aa" -dependencies = [ - "ansi_term", - "lazy_static", - "matchers", - "regex", - "sharded-slab", - "smallvec", - "thread_local", - "tracing", - "tracing-core", - "tracing-log", -] - -[[package]] -name = "try-lock" -version = "0.2.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "59547bce71d9c38b83d9c0e92b6066c4253371f15005def0c30d9657f50c7642" - -[[package]] -name = "typenum" -version = "1.15.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "dcf81ac59edc17cc8697ff311e8f5ef2d99fcbd9817b34cec66f90b6c3dfd987" - -[[package]] -name = "unicode-bidi" -version = "0.3.7" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1a01404663e3db436ed2746d9fefef640d868edae3cceb81c3b8d5732fda678f" - -[[package]] -name = "unicode-normalization" -version = "0.1.19" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d54590932941a9e9266f0832deed84ebe1bf2e4c9e4a3554d393d18f5e854bf9" -dependencies = [ - "tinyvec", -] - -[[package]] -name = "unicode-segmentation" -version = "1.9.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7e8820f5d777f6224dc4be3632222971ac30164d4a258d595640799554ebfd99" - -[[package]] -name = "unicode-width" -version = "0.1.9" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3ed742d4ea2bd1176e236172c8429aaf54486e7ac098db29ffe6529e0ce50973" - -[[package]] -name = "unicode-xid" -version = "0.2.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8ccb82d61f80a663efe1f787a51b16b5a51e3314d6ac365b08639f52387b33f3" - -[[package]] -name = "untrusted" -version = "0.7.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a156c684c91ea7d62626509bce3cb4e1d9ed5c4d978f7b4352658f96a4c26b4a" - -[[package]] -name = "url" -version = "2.2.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a507c383b2d33b5fc35d1861e77e6b383d158b2da5e14fe51b83dfedf6fd578c" -dependencies = [ - "form_urlencoded", - "idna", - "matches", - "percent-encoding", -] - -[[package]] -name = "valuable" -version = "0.1.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "830b7e5d4d90034032940e4ace0d9a9a057e7a45cd94e6c007832e39edb82f6d" - -[[package]] -name = "vec_map" -version = "0.8.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f1bddf1187be692e79c5ffeab891132dfb0f236ed36a43c7ed39f1165ee20191" - -[[package]] -name = "version_check" -version = "0.9.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "49874b5167b65d7193b8aba1567f5c7d93d001cafc34600cee003eda787e483f" - -[[package]] -name = "walkdir" -version = "2.3.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "808cf2735cd4b6866113f648b791c6adc5714537bc222d9347bb203386ffda56" -dependencies = [ - "same-file", - "winapi", - "winapi-util", -] - -[[package]] -name = "walkeeper" -version = "0.1.0" -dependencies = [ - "anyhow", - "byteorder", - "bytes", - "clap 3.0.14", - "const_format", - "crc32c", - "daemonize", - "etcd-client", - "fs2", - "hex", - "humantime", - "hyper", - "lazy_static", - "postgres 0.19.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", - "postgres-protocol 0.6.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", - "postgres_ffi", - "regex", - "rust-s3", - "serde", - "serde_json", - "serde_with", - "signal-hook", - "tempfile", - "tokio", - "tokio-postgres 0.7.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", - "tracing", - "url", - "walkdir", - "workspace_hack", - "zenith_metrics", - "zenith_utils", -] - -[[package]] -name = "want" -version = "0.3.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1ce8a968cb1cd110d136ff8b819a556d6fb6d919363c61534f6860c7eb172ba0" -dependencies = [ - "log", - "try-lock", -] - -[[package]] -name = "wasi" -version = "0.10.0+wasi-snapshot-preview1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1a143597ca7c7793eff794def352d41792a93c481eb1042423ff7ff72ba2c31f" - -[[package]] -name = "wasi" -version = "0.11.0+wasi-snapshot-preview1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9c8d87e72b64a3b4db28d11ce29237c246188f4f51057d65a7eab63b7987e423" - -[[package]] -name = "wasm-bindgen" -version = "0.2.79" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "25f1af7423d8588a3d840681122e72e6a24ddbcb3f0ec385cac0d12d24256c06" -dependencies = [ - "cfg-if", - "wasm-bindgen-macro", -] - -[[package]] -name = "wasm-bindgen-backend" -version = "0.2.79" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8b21c0df030f5a177f3cba22e9bc4322695ec43e7257d865302900290bcdedca" -dependencies = [ - "bumpalo", - "lazy_static", - "log", - "proc-macro2", - "quote", - "syn", - "wasm-bindgen-shared", -] - -[[package]] -name = "wasm-bindgen-futures" -version = "0.4.29" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2eb6ec270a31b1d3c7e266b999739109abce8b6c87e4b31fcfcd788b65267395" -dependencies = [ - "cfg-if", - "js-sys", - "wasm-bindgen", - "web-sys", -] - -[[package]] -name = "wasm-bindgen-macro" -version = "0.2.79" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2f4203d69e40a52ee523b2529a773d5ffc1dc0071801c87b3d270b471b80ed01" -dependencies = [ - "quote", - "wasm-bindgen-macro-support", -] - -[[package]] -name = "wasm-bindgen-macro-support" -version = "0.2.79" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bfa8a30d46208db204854cadbb5d4baf5fcf8071ba5bf48190c3e59937962ebc" -dependencies = [ - "proc-macro2", - "quote", - "syn", - "wasm-bindgen-backend", - "wasm-bindgen-shared", -] - -[[package]] -name = "wasm-bindgen-shared" -version = "0.2.79" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3d958d035c4438e28c70e4321a2911302f10135ce78a9c7834c0cab4123d06a2" - -[[package]] -name = "web-sys" -version = "0.3.56" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "c060b319f29dd25724f09a2ba1418f142f539b2be99fbf4d2d5a8f7330afb8eb" -dependencies = [ - "js-sys", - "wasm-bindgen", -] - -[[package]] -name = "webpki" -version = "0.21.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b8e38c0608262c46d4a56202ebabdeb094cef7e560ca7a226c6bf055188aa4ea" -dependencies = [ - "ring", - "untrusted", -] - -[[package]] -name = "webpki" -version = "0.22.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f095d78192e208183081cc07bc5515ef55216397af48b873e5edcd72637fa1bd" -dependencies = [ - "ring", - "untrusted", -] - -[[package]] -name = "webpki-roots" -version = "0.22.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "552ceb903e957524388c4d3475725ff2c8b7960922063af6ce53c9a43da07449" -dependencies = [ - "webpki 0.22.0", -] - -[[package]] -name = "which" -version = "4.2.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2a5a7e487e921cf220206864a94a89b6c6905bfc19f1057fa26a4cb360e5c1d2" -dependencies = [ - "either", - "lazy_static", - "libc", -] - -[[package]] -name = "wildmatch" -version = "2.1.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d6c48bd20df7e4ced539c12f570f937c6b4884928a87fee70a479d72f031d4e0" - -[[package]] -name = "winapi" -version = "0.3.9" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5c839a674fcd7a98952e593242ea400abe93992746761e38641405d28b00f419" -dependencies = [ - "winapi-i686-pc-windows-gnu", - "winapi-x86_64-pc-windows-gnu", -] - -[[package]] -name = "winapi-i686-pc-windows-gnu" -version = "0.4.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ac3b87c63620426dd9b991e5ce0329eff545bccbbb34f3be09ff6fb6ab51b7b6" - -[[package]] -name = "winapi-util" -version = "0.1.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "70ec6ce85bb158151cae5e5c87f95a8e97d2c0c4b001223f33a334e3ce5de178" -dependencies = [ - "winapi", -] - -[[package]] -name = "winapi-x86_64-pc-windows-gnu" -version = "0.4.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "712e227841d057c1ee1cd2fb22fa7e5a5461ae8e48fa2ca79ec42cfc1931183f" - -[[package]] -name = "winreg" -version = "0.7.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0120db82e8a1e0b9fb3345a539c478767c0048d842860994d96113d5b667bd69" -dependencies = [ - "winapi", -] - -[[package]] -name = "workspace_hack" -version = "0.1.0" -dependencies = [ - "anyhow", - "bytes", - "cc", - "clap 2.34.0", - "either", - "hashbrown 0.11.2", - "libc", - "log", - "memchr", - "num-integer", - "num-traits", - "proc-macro2", - "quote", - "regex", - "regex-syntax", - "reqwest", - "scopeguard", - "serde", - "syn", - "tokio", - "tracing", - "tracing-core", -] - -[[package]] -name = "xattr" -version = "0.2.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "244c3741f4240ef46274860397c7c74e50eb23624996930e484c16679633a54c" -dependencies = [ - "libc", -] - -[[package]] -name = "xml-rs" -version = "0.8.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d2d7d3948613f75c98fd9328cfdcc45acc4d360655289d0a7d4ec931392200a3" - -[[package]] -name = "yasna" -version = "0.4.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e262a29d0e61ccf2b6190d7050d4b237535fc76ce4c1210d9caa316f71dffa75" -dependencies = [ - "chrono", -] - -[[package]] -name = "zenith" -version = "0.1.0" -dependencies = [ - "anyhow", - "clap 3.0.14", - "control_plane", - "pageserver", - "postgres 0.19.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", - "postgres_ffi", - "serde_json", - "walkeeper", - "workspace_hack", - "zenith_utils", -] - -[[package]] -name = "zenith_metrics" -version = "0.1.0" -dependencies = [ - "lazy_static", - "libc", - "once_cell", - "prometheus", - "workspace_hack", -] - -[[package]] -name = "zenith_utils" -version = "0.1.0" -dependencies = [ - "anyhow", - "bincode", - "byteorder", - "bytes", - "criterion", - "git-version", - "hex", - "hex-literal", - "hyper", - "jsonwebtoken", - "lazy_static", - "nix", - "pin-project-lite", - "postgres 0.19.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", - "postgres-protocol 0.6.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", - "rand", - "routerify", - "rustls 0.19.1", - "rustls-split", - "serde", - "serde_json", - "serde_with", - "signal-hook", - "tempfile", - "thiserror", - "tokio", - "tracing", - "tracing-subscriber", - "webpki 0.21.4", - "workspace_hack", - "zenith_metrics", -] - -[[package]] -name = "zstd" -version = "0.10.0+zstd.1.5.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3b1365becbe415f3f0fcd024e2f7b45bacfb5bdd055f0dc113571394114e7bdd" -dependencies = [ - "zstd-safe", -] - -[[package]] -name = "zstd-safe" -version = "4.1.4+zstd.1.5.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2f7cd17c9af1a4d6c24beb1cc54b17e2ef7b593dc92f19e9d9acad8b182bbaee" -dependencies = [ - "libc", - "zstd-sys", -] - -[[package]] -name = "zstd-sys" -version = "1.6.3+zstd.1.5.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "fc49afa5c8d634e75761feda8c592051e7eeb4683ba827211eb0d731d3402ea8" -dependencies = [ - "cc", - "libc", -] diff --git a/pageserver/Cargo.toml b/pageserver/Cargo.toml index 4d79811bfb..dccdca291c 100644 --- a/pageserver/Cargo.toml +++ b/pageserver/Cargo.toml @@ -18,6 +18,7 @@ log = "0.4.14" clap = "3.0" daemonize = "0.4.1" tokio = { version = "1.17", features = ["process", "sync", "macros", "fs", "rt", "io-util", "time"] } +tokio-util = { version = "0.7", features = ["io"] } postgres-types = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } postgres-protocol = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } @@ -34,7 +35,6 @@ serde_with = "1.12.0" toml_edit = { version = "0.13", features = ["easy"] } scopeguard = "1.1.0" -async-trait = "0.1" const_format = "0.2.21" tracing = "0.1.27" tracing-futures = "0.2" @@ -45,7 +45,9 @@ once_cell = "1.8.0" crossbeam-utils = "0.8.5" fail = "0.5.0" -rust-s3 = { version = "0.28", default-features = false, features = ["no-verify-ssl", "tokio-rustls-tls"] } +rusoto_core = "0.47" +rusoto_s3 = "0.47" +async-trait = "0.1" async-compression = {version = "0.3", features = ["zstd", "tokio"]} postgres_ffi = { path = "../postgres_ffi" } diff --git a/pageserver/src/remote_storage.rs b/pageserver/src/remote_storage.rs index bdd6086b94..02d37af5de 100644 --- a/pageserver/src/remote_storage.rs +++ b/pageserver/src/remote_storage.rs @@ -5,7 +5,7 @@ //! There are a few components the storage machinery consists of: //! * [`RemoteStorage`] trait a CRUD-like generic abstraction to use for adapting external storages with a few implementations: //! * [`local_fs`] allows to use local file system as an external storage -//! * [`rust_s3`] uses AWS S3 bucket as an external storage +//! * [`s3_bucket`] uses AWS S3 bucket as an external storage //! //! * synchronization logic at [`storage_sync`] module that keeps pageserver state (both runtime one and the workdir files) and storage state in sync. //! Synchronization internals are split into submodules @@ -82,7 +82,7 @@ //! The sync queue processing also happens in batches, so the sync tasks can wait in the queue for some time. mod local_fs; -mod rust_s3; +mod s3_bucket; mod storage_sync; use std::{ @@ -98,7 +98,7 @@ use zenith_utils::zid::{ZTenantId, ZTenantTimelineId, ZTimelineId}; pub use self::storage_sync::index::{RemoteIndex, TimelineIndexEntry}; pub use self::storage_sync::{schedule_timeline_checkpoint_upload, schedule_timeline_download}; -use self::{local_fs::LocalFs, rust_s3::S3}; +use self::{local_fs::LocalFs, s3_bucket::S3Bucket}; use crate::layered_repository::ephemeral_file::is_ephemeral_file; use crate::{ config::{PageServerConf, RemoteStorageKind}, @@ -151,7 +151,7 @@ pub fn start_local_timeline_sync( storage_sync::spawn_storage_sync_thread( config, local_timeline_files, - S3::new(s3_config, &config.workdir)?, + S3Bucket::new(s3_config, &config.workdir)?, storage_config.max_concurrent_sync, storage_config.max_sync_errors, ) diff --git a/pageserver/src/remote_storage/README.md b/pageserver/src/remote_storage/README.md index 339ddce866..43a47e09d8 100644 --- a/pageserver/src/remote_storage/README.md +++ b/pageserver/src/remote_storage/README.md @@ -46,18 +46,6 @@ This could be avoided by a background thread/future storing the serialized index No file checksum assertion is done currently, but should be (AWS S3 returns file checksums during the `list` operation) -* sad rust-s3 api - -rust-s3 is not very pleasant to use: -1. it returns `anyhow::Result` and it's hard to distinguish "missing file" cases from "no connection" one, for instance -2. at least one function it its API that we need (`get_object_stream`) has `async` keyword and blocks (!), see details [here](https://github.com/zenithdb/zenith/pull/752#discussion_r728373091) -3. it's a prerelease library with unclear maintenance status -4. noisy on debug level - -But it's already used in the project, so for now it's reused to avoid bloating the dependency tree. -Based on previous evaluation, even `rusoto-s3` could be a better choice over this library, but needs further benchmarking. - - * gc is ignored So far, we don't adjust the remote storage based on GC thread loop results, only checkpointer loop affects the remote storage. diff --git a/pageserver/src/remote_storage/rust_s3.rs b/pageserver/src/remote_storage/s3_bucket.rs similarity index 68% rename from pageserver/src/remote_storage/rust_s3.rs rename to pageserver/src/remote_storage/s3_bucket.rs index 527bdf48ff..92b3b0cce8 100644 --- a/pageserver/src/remote_storage/rust_s3.rs +++ b/pageserver/src/remote_storage/s3_bucket.rs @@ -1,4 +1,4 @@ -//! AWS S3 storage wrapper around `rust_s3` library. +//! AWS S3 storage wrapper around `rusoto` library. //! //! Respects `prefix_in_bucket` property from [`S3Config`], //! allowing multiple pageservers to independently work with the same S3 bucket, if @@ -7,9 +7,17 @@ use std::path::{Path, PathBuf}; use anyhow::Context; -use s3::{bucket::Bucket, creds::Credentials, region::Region}; -use tokio::io::{self, AsyncWriteExt}; -use tracing::debug; +use rusoto_core::{ + credential::{InstanceMetadataProvider, StaticProvider}, + HttpClient, Region, +}; +use rusoto_s3::{ + DeleteObjectRequest, GetObjectRequest, ListObjectsV2Request, PutObjectRequest, S3Client, + StreamingBody, S3, +}; +use tokio::io; +use tokio_util::io::ReaderStream; +use tracing::{debug, trace}; use crate::{ config::S3Config, @@ -50,38 +58,50 @@ impl S3ObjectKey { } /// AWS S3 storage. -pub struct S3 { +pub struct S3Bucket { pageserver_workdir: &'static Path, - bucket: Bucket, + client: S3Client, + bucket_name: String, prefix_in_bucket: Option, } -impl S3 { - /// Creates the storage, errors if incorrect AWS S3 configuration provided. +impl S3Bucket { + /// Creates the S3 storage, errors if incorrect AWS S3 configuration provided. pub fn new(aws_config: &S3Config, pageserver_workdir: &'static Path) -> anyhow::Result { + // TODO kb check this + // Keeping a single client may cause issues due to timeouts. + // https://github.com/rusoto/rusoto/issues/1686 + debug!( - "Creating s3 remote storage around bucket {}", + "Creating s3 remote storage for S3 bucket {}", aws_config.bucket_name ); let region = match aws_config.endpoint.clone() { - Some(endpoint) => Region::Custom { - endpoint, - region: aws_config.bucket_region.clone(), + Some(custom_endpoint) => Region::Custom { + name: aws_config.bucket_region.clone(), + endpoint: custom_endpoint, }, None => aws_config .bucket_region .parse::() .context("Failed to parse the s3 region from config")?, }; - - let credentials = Credentials::new( - aws_config.access_key_id.as_deref(), - aws_config.secret_access_key.as_deref(), - None, - None, - None, - ) - .context("Failed to create the s3 credentials")?; + let request_dispatcher = HttpClient::new().context("Failed to create S3 http client")?; + let client = if aws_config.access_key_id.is_none() && aws_config.secret_access_key.is_none() + { + trace!("Using IAM-based AWS access"); + S3Client::new_with(request_dispatcher, InstanceMetadataProvider::new(), region) + } else { + trace!("Using credentials-based AWS access"); + S3Client::new_with( + request_dispatcher, + StaticProvider::new_minimal( + aws_config.access_key_id.clone().unwrap_or_default(), + aws_config.secret_access_key.clone().unwrap_or_default(), + ), + region, + ) + }; let prefix_in_bucket = aws_config.prefix_in_bucket.as_deref().map(|prefix| { let mut prefix = prefix; @@ -97,20 +117,16 @@ impl S3 { }); Ok(Self { - bucket: Bucket::new_with_path_style( - aws_config.bucket_name.as_str(), - region, - credentials, - ) - .context("Failed to create the s3 bucket")?, + client, pageserver_workdir, + bucket_name: aws_config.bucket_name.clone(), prefix_in_bucket, }) } } #[async_trait::async_trait] -impl RemoteStorage for S3 { +impl RemoteStorage for S3Bucket { type StoragePath = S3ObjectKey; fn storage_path(&self, local_path: &Path) -> anyhow::Result { @@ -129,48 +145,50 @@ impl RemoteStorage for S3 { } async fn list(&self) -> anyhow::Result> { - let list_response = self - .bucket - .list(self.prefix_in_bucket.clone().unwrap_or_default(), None) - .await - .context("Failed to list s3 objects")?; + let mut document_keys = Vec::new(); - Ok(list_response - .into_iter() - .flat_map(|response| response.contents) - .map(|s3_object| S3ObjectKey(s3_object.key)) - .collect()) + let mut continuation_token = None; + loop { + let fetch_response = self + .client + .list_objects_v2(ListObjectsV2Request { + bucket: self.bucket_name.clone(), + prefix: self.prefix_in_bucket.clone(), + continuation_token, + ..ListObjectsV2Request::default() + }) + .await?; + document_keys.extend( + fetch_response + .contents + .unwrap_or_default() + .into_iter() + .filter_map(|o| Some(S3ObjectKey(o.key?))), + ); + + match fetch_response.continuation_token { + Some(new_token) => continuation_token = Some(new_token), + None => break, + } + } + + Ok(document_keys) } async fn upload( &self, - mut from: impl io::AsyncRead + Unpin + Send + Sync + 'static, + from: impl io::AsyncRead + Unpin + Send + Sync + 'static, to: &Self::StoragePath, ) -> anyhow::Result<()> { - let mut upload_contents = io::BufWriter::new(std::io::Cursor::new(Vec::new())); - io::copy(&mut from, &mut upload_contents) - .await - .context("Failed to read the upload contents")?; - upload_contents - .flush() - .await - .context("Failed to read the upload contents")?; - let upload_contents = upload_contents.into_inner().into_inner(); - - let (_, code) = self - .bucket - .put_object(to.key(), &upload_contents) - .await - .with_context(|| format!("Failed to create s3 object with key {}", to.key()))?; - if code != 200 { - Err(anyhow::format_err!( - "Received non-200 exit code during creating object with key '{}', code: {}", - to.key(), - code - )) - } else { - Ok(()) - } + self.client + .put_object(PutObjectRequest { + body: Some(StreamingBody::new(ReaderStream::new(from))), + bucket: self.bucket_name.clone(), + key: to.key().to_owned(), + ..PutObjectRequest::default() + }) + .await?; + Ok(()) } async fn download( @@ -178,25 +196,21 @@ impl RemoteStorage for S3 { from: &Self::StoragePath, to: &mut (impl io::AsyncWrite + Unpin + Send + Sync), ) -> anyhow::Result<()> { - let (data, code) = self - .bucket - .get_object(from.key()) - .await - .with_context(|| format!("Failed to download s3 object with key {}", from.key()))?; - if code != 200 { - Err(anyhow::format_err!( - "Received non-200 exit code during downloading object, code: {}", - code - )) - } else { - // we don't have to write vector into the destination this way, `to_write_all` would be enough. - // but we want to prepare for migration on `rusoto`, that has a streaming HTTP body instead here, with - // which it makes more sense to use `io::copy`. - io::copy(&mut data.as_slice(), to) - .await - .context("Failed to write downloaded data into the destination buffer")?; - Ok(()) + let object_output = self + .client + .get_object(GetObjectRequest { + bucket: self.bucket_name.clone(), + key: from.key().to_owned(), + ..GetObjectRequest::default() + }) + .await?; + + if let Some(body) = object_output.body { + let mut from = io::BufReader::new(body.into_async_read()); + io::copy(&mut from, to).await?; } + + Ok(()) } async fn download_range( @@ -209,40 +223,37 @@ impl RemoteStorage for S3 { // S3 accepts ranges as https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35 // and needs both ends to be exclusive let end_inclusive = end_exclusive.map(|end| end.saturating_sub(1)); - let (data, code) = self - .bucket - .get_object_range(from.key(), start_inclusive, end_inclusive) - .await - .with_context(|| format!("Failed to download s3 object with key {}", from.key()))?; - if code != 206 { - Err(anyhow::format_err!( - "Received non-206 exit code during downloading object range, code: {}", - code - )) - } else { - // see `download` function above for the comment on why `Vec` buffer is copied this way - io::copy(&mut data.as_slice(), to) - .await - .context("Failed to write downloaded range into the destination buffer")?; - Ok(()) + let range = Some(match end_inclusive { + Some(end_inclusive) => format!("bytes={}-{}", start_inclusive, end_inclusive), + None => format!("bytes={}-", start_inclusive), + }); + let object_output = self + .client + .get_object(GetObjectRequest { + bucket: self.bucket_name.clone(), + key: from.key().to_owned(), + range, + ..GetObjectRequest::default() + }) + .await?; + + if let Some(body) = object_output.body { + let mut from = io::BufReader::new(body.into_async_read()); + io::copy(&mut from, to).await?; } + + Ok(()) } async fn delete(&self, path: &Self::StoragePath) -> anyhow::Result<()> { - let (_, code) = self - .bucket - .delete_object(path.key()) - .await - .with_context(|| format!("Failed to delete s3 object with key {}", path.key()))?; - if code != 204 { - Err(anyhow::format_err!( - "Received non-204 exit code during deleting object with key '{}', code: {}", - path.key(), - code - )) - } else { - Ok(()) - } + self.client + .delete_object(DeleteObjectRequest { + bucket: self.bucket_name.clone(), + key: path.key().to_owned(), + ..DeleteObjectRequest::default() + }) + .await?; + Ok(()) } } @@ -314,7 +325,7 @@ mod tests { #[test] fn storage_path_negatives() -> anyhow::Result<()> { #[track_caller] - fn storage_path_error(storage: &S3, mismatching_path: &Path) -> String { + fn storage_path_error(storage: &S3Bucket, mismatching_path: &Path) -> String { match storage.storage_path(mismatching_path) { Ok(wrong_key) => panic!( "Expected path '{}' to error, but got S3 key: {:?}", @@ -412,15 +423,11 @@ mod tests { Ok(()) } - fn dummy_storage(pageserver_workdir: &'static Path) -> S3 { - S3 { + fn dummy_storage(pageserver_workdir: &'static Path) -> S3Bucket { + S3Bucket { pageserver_workdir, - bucket: Bucket::new( - "dummy-bucket", - "us-east-1".parse().unwrap(), - Credentials::anonymous().unwrap(), - ) - .unwrap(), + client: S3Client::new("us-east-1".parse().unwrap()), + bucket_name: "dummy-bucket".to_string(), prefix_in_bucket: Some("dummy_prefix/".to_string()), } } From 0e9ee772af7406e943565a1985ef5c9117ad470c Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Mon, 28 Mar 2022 15:18:01 +0300 Subject: [PATCH 073/296] Use rusoto in safekeeper --- Cargo.lock | 3503 +++++++++++++++++++++++++++++++++++ walkeeper/Cargo.toml | 6 +- walkeeper/src/s3_offload.rs | 102 +- 3 files changed, 3573 insertions(+), 38 deletions(-) create mode 100644 Cargo.lock diff --git a/Cargo.lock b/Cargo.lock new file mode 100644 index 0000000000..1a9e261281 --- /dev/null +++ b/Cargo.lock @@ -0,0 +1,3503 @@ +# This file is automatically @generated by Cargo. +# It is not intended for manual editing. +version = 3 + +[[package]] +name = "addr2line" +version = "0.17.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b9ecd88a8c8378ca913a680cd98f0f13ac67383d35993f86c90a70e3f137816b" +dependencies = [ + "gimli", +] + +[[package]] +name = "adler" +version = "1.0.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f26201604c87b1e01bd3d98f8d5d9a8fcbb815e8cedb41ffccbeb4bf593a35fe" + +[[package]] +name = "ahash" +version = "0.7.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "fcb51a0695d8f838b1ee009b3fbf66bda078cd64590202a864a8f3e8c4315c47" +dependencies = [ + "getrandom", + "once_cell", + "version_check", +] + +[[package]] +name = "aho-corasick" +version = "0.7.18" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1e37cfd5e7657ada45f742d6e99ca5788580b5c529dc78faf11ece6dc702656f" +dependencies = [ + "memchr", +] + +[[package]] +name = "ansi_term" +version = "0.12.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d52a9bb7ec0cf484c551830a7ce27bd20d67eac647e1befb56b0be4ee39a55d2" +dependencies = [ + "winapi", +] + +[[package]] +name = "anyhow" +version = "1.0.53" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "94a45b455c14666b85fc40a019e8ab9eb75e3a124e05494f5397122bc9eb06e0" +dependencies = [ + "backtrace", +] + +[[package]] +name = "async-compression" +version = "0.3.12" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f2bf394cfbbe876f0ac67b13b6ca819f9c9f2fb9ec67223cceb1555fbab1c31a" +dependencies = [ + "futures-core", + "memchr", + "pin-project-lite", + "tokio", + "zstd", + "zstd-safe", +] + +[[package]] +name = "async-stream" +version = "0.3.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "dad5c83079eae9969be7fadefe640a1c566901f05ff91ab221de4b6f68d9507e" +dependencies = [ + "async-stream-impl", + "futures-core", +] + +[[package]] +name = "async-stream-impl" +version = "0.3.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "10f203db73a71dfa2fb6dd22763990fa26f3d2625a6da2da900d23b87d26be27" +dependencies = [ + "proc-macro2", + "quote", + "syn", +] + +[[package]] +name = "async-trait" +version = "0.1.52" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "061a7acccaa286c011ddc30970520b98fa40e00c9d644633fb26b5fc63a265e3" +dependencies = [ + "proc-macro2", + "quote", + "syn", +] + +[[package]] +name = "atty" +version = "0.2.14" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d9b39be18770d11421cdb1b9947a45dd3f37e93092cbf377614828a319d5fee8" +dependencies = [ + "hermit-abi", + "libc", + "winapi", +] + +[[package]] +name = "autocfg" +version = "1.1.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d468802bab17cbc0cc575e9b053f41e72aa36bfa6b7f55e3529ffa43161b97fa" + +[[package]] +name = "backtrace" +version = "0.3.64" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5e121dee8023ce33ab248d9ce1493df03c3b38a659b240096fcbd7048ff9c31f" +dependencies = [ + "addr2line", + "cc", + "cfg-if", + "libc", + "miniz_oxide", + "object", + "rustc-demangle", +] + +[[package]] +name = "base64" +version = "0.12.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "3441f0f7b02788e948e47f457ca01f1d7e6d92c693bc132c22b087d3141c03ff" + +[[package]] +name = "base64" +version = "0.13.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "904dfeac50f3cdaba28fc6f57fdcddb75f49ed61346676a78c4ffe55877802fd" + +[[package]] +name = "bincode" +version = "1.3.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b1f45e9417d87227c7a56d22e471c6206462cba514c7590c09aff4cf6d1ddcad" +dependencies = [ + "serde", +] + +[[package]] +name = "bindgen" +version = "0.59.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2bd2a9a458e8f4304c52c43ebb0cfbd520289f8379a52e329a38afda99bf8eb8" +dependencies = [ + "bitflags", + "cexpr", + "clang-sys", + "clap 2.34.0", + "env_logger", + "lazy_static", + "lazycell", + "log", + "peeking_take_while", + "proc-macro2", + "quote", + "regex", + "rustc-hash", + "shlex", + "which", +] + +[[package]] +name = "bitflags" +version = "1.3.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "bef38d45163c2f1dde094a7dfd33ccf595c92905c8f8f4fdc18d06fb1037718a" + +[[package]] +name = "block-buffer" +version = "0.9.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "4152116fd6e9dadb291ae18fc1ec3575ed6d84c29642d97890f4b4a3417297e4" +dependencies = [ + "generic-array", +] + +[[package]] +name = "boxfnonce" +version = "0.1.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5988cb1d626264ac94100be357308f29ff7cbdd3b36bda27f450a4ee3f713426" + +[[package]] +name = "bstr" +version = "0.2.17" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ba3569f383e8f1598449f1a423e72e99569137b47740b1da11ef19af3d5c3223" +dependencies = [ + "lazy_static", + "memchr", + "regex-automata", + "serde", +] + +[[package]] +name = "bumpalo" +version = "3.9.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a4a45a46ab1f2412e53d3a0ade76ffad2025804294569aae387231a0cd6e0899" + +[[package]] +name = "byteorder" +version = "1.4.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "14c189c53d098945499cdfa7ecc63567cf3886b3332b312a5b4585d8d3a6a610" + +[[package]] +name = "bytes" +version = "1.1.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "c4872d67bab6358e59559027aa3b9157c53d9358c51423c17554809a8858e0f8" +dependencies = [ + "serde", +] + +[[package]] +name = "cast" +version = "0.2.7" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "4c24dab4283a142afa2fdca129b80ad2c6284e073930f964c3a1293c225ee39a" +dependencies = [ + "rustc_version", +] + +[[package]] +name = "cc" +version = "1.0.72" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "22a9137b95ea06864e018375b72adfb7db6e6f68cfc8df5a04d00288050485ee" +dependencies = [ + "jobserver", +] + +[[package]] +name = "cexpr" +version = "0.6.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6fac387a98bb7c37292057cffc56d62ecb629900026402633ae9160df93a8766" +dependencies = [ + "nom", +] + +[[package]] +name = "cfg-if" +version = "1.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "baf1de4339761588bc0619e3cbc0120ee582ebb74b53b4efbf79117bd2da40fd" + +[[package]] +name = "chrono" +version = "0.4.19" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "670ad68c9088c2a963aaa298cb369688cf3f9465ce5e2d4ca10e6e0098a1ce73" +dependencies = [ + "libc", + "num-integer", + "num-traits", + "serde", + "time", + "winapi", +] + +[[package]] +name = "clang-sys" +version = "1.3.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "4cc00842eed744b858222c4c9faf7243aafc6d33f92f96935263ef4d8a41ce21" +dependencies = [ + "glob", + "libc", + "libloading", +] + +[[package]] +name = "clap" +version = "2.34.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a0610544180c38b88101fecf2dd634b174a62eef6946f84dfc6a7127512b381c" +dependencies = [ + "ansi_term", + "atty", + "bitflags", + "strsim 0.8.0", + "textwrap 0.11.0", + "unicode-width", + "vec_map", +] + +[[package]] +name = "clap" +version = "3.0.14" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b63edc3f163b3c71ec8aa23f9bd6070f77edbf3d1d198b164afa90ff00e4ec62" +dependencies = [ + "atty", + "bitflags", + "indexmap", + "os_str_bytes", + "strsim 0.10.0", + "termcolor", + "textwrap 0.14.2", +] + +[[package]] +name = "combine" +version = "4.6.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "50b727aacc797f9fc28e355d21f34709ac4fc9adecfe470ad07b8f4464f53062" +dependencies = [ + "bytes", + "memchr", +] + +[[package]] +name = "compute_tools" +version = "0.1.0" +dependencies = [ + "anyhow", + "chrono", + "clap 3.0.14", + "env_logger", + "hyper", + "libc", + "log", + "postgres 0.19.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858)", + "regex", + "serde", + "serde_json", + "tar", + "tokio", + "workspace_hack", +] + +[[package]] +name = "const_format" +version = "0.2.22" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "22bc6cd49b0ec407b680c3e380182b6ac63b73991cb7602de350352fc309b614" +dependencies = [ + "const_format_proc_macros", +] + +[[package]] +name = "const_format_proc_macros" +version = "0.2.22" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ef196d5d972878a48da7decb7686eded338b4858fbabeed513d63a7c98b2b82d" +dependencies = [ + "proc-macro2", + "quote", + "unicode-xid", +] + +[[package]] +name = "control_plane" +version = "0.1.0" +dependencies = [ + "anyhow", + "lazy_static", + "nix", + "pageserver", + "postgres 0.19.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", + "regex", + "reqwest", + "serde", + "serde_with", + "tar", + "thiserror", + "toml", + "url", + "walkeeper", + "workspace_hack", + "zenith_utils", +] + +[[package]] +name = "core-foundation" +version = "0.9.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "194a7a9e6de53fa55116934067c844d9d749312f75c6f6d0980e8c252f8c2146" +dependencies = [ + "core-foundation-sys", + "libc", +] + +[[package]] +name = "core-foundation-sys" +version = "0.8.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5827cebf4670468b8772dd191856768aedcb1b0278a04f989f7766351917b9dc" + +[[package]] +name = "cpufeatures" +version = "0.2.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "95059428f66df56b63431fdb4e1947ed2190586af5c5a8a8b71122bdf5a7f469" +dependencies = [ + "libc", +] + +[[package]] +name = "crc32c" +version = "0.6.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ee6b9c9389584bcba988bd0836086789b7f87ad91892d6a83d5291dbb24524b5" +dependencies = [ + "rustc_version", +] + +[[package]] +name = "crc32fast" +version = "1.3.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b540bd8bc810d3885c6ea91e2018302f68baba2129ab3e88f32389ee9370880d" +dependencies = [ + "cfg-if", +] + +[[package]] +name = "criterion" +version = "0.3.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1604dafd25fba2fe2d5895a9da139f8dc9b319a5fe5354ca137cbbce4e178d10" +dependencies = [ + "atty", + "cast", + "clap 2.34.0", + "criterion-plot", + "csv", + "itertools", + "lazy_static", + "num-traits", + "oorandom", + "plotters", + "rayon", + "regex", + "serde", + "serde_cbor", + "serde_derive", + "serde_json", + "tinytemplate", + "walkdir", +] + +[[package]] +name = "criterion-plot" +version = "0.4.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d00996de9f2f7559f7f4dc286073197f83e92256a59ed395f9aac01fe717da57" +dependencies = [ + "cast", + "itertools", +] + +[[package]] +name = "crossbeam-channel" +version = "0.5.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5aaa7bd5fb665c6864b5f963dd9097905c54125909c7aa94c9e18507cdbe6c53" +dependencies = [ + "cfg-if", + "crossbeam-utils", +] + +[[package]] +name = "crossbeam-deque" +version = "0.8.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6455c0ca19f0d2fbf751b908d5c55c1f5cbc65e03c4225427254b46890bdde1e" +dependencies = [ + "cfg-if", + "crossbeam-epoch", + "crossbeam-utils", +] + +[[package]] +name = "crossbeam-epoch" +version = "0.9.8" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1145cf131a2c6ba0615079ab6a638f7e1973ac9c2634fcbeaaad6114246efe8c" +dependencies = [ + "autocfg", + "cfg-if", + "crossbeam-utils", + "lazy_static", + "memoffset", + "scopeguard", +] + +[[package]] +name = "crossbeam-utils" +version = "0.8.7" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b5e5bed1f1c269533fa816a0a5492b3545209a205ca1a54842be180eb63a16a6" +dependencies = [ + "cfg-if", + "lazy_static", +] + +[[package]] +name = "crypto-mac" +version = "0.10.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "bff07008ec701e8028e2ceb8f83f0e4274ee62bd2dbdc4fefff2e9a91824081a" +dependencies = [ + "generic-array", + "subtle", +] + +[[package]] +name = "crypto-mac" +version = "0.11.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b1d1a86f49236c215f271d40892d5fc950490551400b02ef360692c29815c714" +dependencies = [ + "generic-array", + "subtle", +] + +[[package]] +name = "csv" +version = "1.1.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "22813a6dc45b335f9bade10bf7271dc477e81113e89eb251a0bc2a8a81c536e1" +dependencies = [ + "bstr", + "csv-core", + "itoa 0.4.8", + "ryu", + "serde", +] + +[[package]] +name = "csv-core" +version = "0.1.10" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2b2466559f260f48ad25fe6317b3c8dac77b5bdb5763ac7d9d6103530663bc90" +dependencies = [ + "memchr", +] + +[[package]] +name = "daemonize" +version = "0.4.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "70c24513e34f53b640819f0ac9f705b673fcf4006d7aab8778bee72ebfc89815" +dependencies = [ + "boxfnonce", + "libc", +] + +[[package]] +name = "darling" +version = "0.13.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d0d720b8683f8dd83c65155f0530560cba68cd2bf395f6513a483caee57ff7f4" +dependencies = [ + "darling_core", + "darling_macro", +] + +[[package]] +name = "darling_core" +version = "0.13.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7a340f241d2ceed1deb47ae36c4144b2707ec7dd0b649f894cb39bb595986324" +dependencies = [ + "fnv", + "ident_case", + "proc-macro2", + "quote", + "strsim 0.10.0", + "syn", +] + +[[package]] +name = "darling_macro" +version = "0.13.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "72c41b3b7352feb3211a0d743dc5700a4e3b60f51bd2b368892d1e0f9a95f44b" +dependencies = [ + "darling_core", + "quote", + "syn", +] + +[[package]] +name = "digest" +version = "0.9.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d3dd60d1080a57a05ab032377049e0591415d2b31afd7028356dbf3cc6dcb066" +dependencies = [ + "generic-array", +] + +[[package]] +name = "dirs-next" +version = "2.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b98cf8ebf19c3d1b223e151f99a4f9f0690dca41414773390fc824184ac833e1" +dependencies = [ + "cfg-if", + "dirs-sys-next", +] + +[[package]] +name = "dirs-sys-next" +version = "0.1.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "4ebda144c4fe02d1f7ea1a7d9641b6fc6b580adcfa024ae48797ecdeb6825b4d" +dependencies = [ + "libc", + "redox_users", + "winapi", +] + +[[package]] +name = "either" +version = "1.6.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e78d4f1cc4ae33bbfc157ed5d5a5ef3bc29227303d595861deb238fcec4e9457" + +[[package]] +name = "encoding_rs" +version = "0.8.30" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7896dc8abb250ffdda33912550faa54c88ec8b998dec0b2c55ab224921ce11df" +dependencies = [ + "cfg-if", +] + +[[package]] +name = "env_logger" +version = "0.9.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0b2cf0344971ee6c64c31be0d530793fba457d322dfec2810c453d0ef228f9c3" +dependencies = [ + "atty", + "humantime", + "log", + "regex", + "termcolor", +] + +[[package]] +name = "etcd-client" +version = "0.8.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "585de5039d1ecce74773db49ba4e8107e42be7c2cd0b1a9e7fce27181db7b118" +dependencies = [ + "http", + "prost", + "tokio", + "tokio-stream", + "tonic", + "tonic-build", + "tower-service", +] + +[[package]] +name = "fail" +version = "0.5.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ec3245a0ca564e7f3c797d20d833a6870f57a728ac967d5225b3ffdef4465011" +dependencies = [ + "lazy_static", + "log", + "rand", +] + +[[package]] +name = "fallible-iterator" +version = "0.2.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "4443176a9f2c162692bd3d352d745ef9413eec5782a80d8fd6f8a1ac692a07f7" + +[[package]] +name = "fastrand" +version = "1.7.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "c3fcf0cee53519c866c09b5de1f6c56ff9d647101f81c1964fa632e148896cdf" +dependencies = [ + "instant", +] + +[[package]] +name = "filetime" +version = "0.2.15" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "975ccf83d8d9d0d84682850a38c8169027be83368805971cc4f238c2b245bc98" +dependencies = [ + "cfg-if", + "libc", + "redox_syscall", + "winapi", +] + +[[package]] +name = "fixedbitset" +version = "0.4.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "279fb028e20b3c4c320317955b77c5e0c9701f05a1d309905d6fc702cdc5053e" + +[[package]] +name = "fnv" +version = "1.0.7" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "3f9eec918d3f24069decb9af1554cad7c880e2da24a9afd88aca000531ab82c1" + +[[package]] +name = "foreign-types" +version = "0.3.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f6f339eb8adc052cd2ca78910fda869aefa38d22d5cb648e6485e4d3fc06f3b1" +dependencies = [ + "foreign-types-shared", +] + +[[package]] +name = "foreign-types-shared" +version = "0.1.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "00b0228411908ca8685dba7fc2cdd70ec9990a6e753e89b6ac91a84c40fbaf4b" + +[[package]] +name = "form_urlencoded" +version = "1.0.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5fc25a87fa4fd2094bffb06925852034d90a17f0d1e05197d4956d3555752191" +dependencies = [ + "matches", + "percent-encoding", +] + +[[package]] +name = "fs2" +version = "0.4.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9564fc758e15025b46aa6643b1b77d047d1a56a1aea6e01002ac0c7026876213" +dependencies = [ + "libc", + "winapi", +] + +[[package]] +name = "futures" +version = "0.3.21" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f73fe65f54d1e12b726f517d3e2135ca3125a437b6d998caf1962961f7172d9e" +dependencies = [ + "futures-channel", + "futures-core", + "futures-executor", + "futures-io", + "futures-sink", + "futures-task", + "futures-util", +] + +[[package]] +name = "futures-channel" +version = "0.3.21" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "c3083ce4b914124575708913bca19bfe887522d6e2e6d0952943f5eac4a74010" +dependencies = [ + "futures-core", + "futures-sink", +] + +[[package]] +name = "futures-core" +version = "0.3.21" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0c09fd04b7e4073ac7156a9539b57a484a8ea920f79c7c675d05d289ab6110d3" + +[[package]] +name = "futures-executor" +version = "0.3.21" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9420b90cfa29e327d0429f19be13e7ddb68fa1cccb09d65e5706b8c7a749b8a6" +dependencies = [ + "futures-core", + "futures-task", + "futures-util", +] + +[[package]] +name = "futures-io" +version = "0.3.21" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "fc4045962a5a5e935ee2fdedaa4e08284547402885ab326734432bed5d12966b" + +[[package]] +name = "futures-macro" +version = "0.3.21" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "33c1e13800337f4d4d7a316bf45a567dbcb6ffe087f16424852d97e97a91f512" +dependencies = [ + "proc-macro2", + "quote", + "syn", +] + +[[package]] +name = "futures-sink" +version = "0.3.21" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "21163e139fa306126e6eedaf49ecdb4588f939600f0b1e770f4205ee4b7fa868" + +[[package]] +name = "futures-task" +version = "0.3.21" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "57c66a976bf5909d801bbef33416c41372779507e7a6b3a5e25e4749c58f776a" + +[[package]] +name = "futures-util" +version = "0.3.21" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d8b7abd5d659d9b90c8cba917f6ec750a74e2dc23902ef9cd4cc8c8b22e6036a" +dependencies = [ + "futures-channel", + "futures-core", + "futures-io", + "futures-macro", + "futures-sink", + "futures-task", + "memchr", + "pin-project-lite", + "pin-utils", + "slab", +] + +[[package]] +name = "generic-array" +version = "0.14.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "fd48d33ec7f05fbfa152300fdad764757cbded343c1aa1cff2fbaf4134851803" +dependencies = [ + "typenum", + "version_check", +] + +[[package]] +name = "getrandom" +version = "0.2.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "418d37c8b1d42553c93648be529cb70f920d3baf8ef469b74b9638df426e0b4c" +dependencies = [ + "cfg-if", + "libc", + "wasi 0.10.0+wasi-snapshot-preview1", +] + +[[package]] +name = "gimli" +version = "0.26.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "78cc372d058dcf6d5ecd98510e7fbc9e5aec4d21de70f65fea8fecebcd881bd4" + +[[package]] +name = "git-version" +version = "0.3.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f6b0decc02f4636b9ccad390dcbe77b722a77efedfa393caf8379a51d5c61899" +dependencies = [ + "git-version-macro", + "proc-macro-hack", +] + +[[package]] +name = "git-version-macro" +version = "0.3.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "fe69f1cbdb6e28af2bac214e943b99ce8a0a06b447d15d3e61161b0423139f3f" +dependencies = [ + "proc-macro-hack", + "proc-macro2", + "quote", + "syn", +] + +[[package]] +name = "glob" +version = "0.3.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9b919933a397b79c37e33b77bb2aa3dc8eb6e165ad809e58ff75bc7db2e34574" + +[[package]] +name = "h2" +version = "0.3.11" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d9f1f717ddc7b2ba36df7e871fd88db79326551d3d6f1fc406fbfd28b582ff8e" +dependencies = [ + "bytes", + "fnv", + "futures-core", + "futures-sink", + "futures-util", + "http", + "indexmap", + "slab", + "tokio", + "tokio-util 0.6.9", + "tracing", +] + +[[package]] +name = "half" +version = "1.8.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "eabb4a44450da02c90444cf74558da904edde8fb4e9035a9a6a4e15445af0bd7" + +[[package]] +name = "hashbrown" +version = "0.11.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ab5ef0d4909ef3724cc8cce6ccc8572c5c817592e9285f5464f8e86f8bd3726e" +dependencies = [ + "ahash", +] + +[[package]] +name = "heck" +version = "0.3.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6d621efb26863f0e9924c6ac577e8275e5e6b77455db64ffa6c65c904e9e132c" +dependencies = [ + "unicode-segmentation", +] + +[[package]] +name = "hermit-abi" +version = "0.1.19" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "62b467343b94ba476dcb2500d242dadbb39557df889310ac77c5d99100aaac33" +dependencies = [ + "libc", +] + +[[package]] +name = "hex" +version = "0.4.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7f24254aa9a54b5c858eaee2f5bccdb46aaf0e486a595ed5fd8f86ba55232a70" +dependencies = [ + "serde", +] + +[[package]] +name = "hex-literal" +version = "0.3.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7ebdb29d2ea9ed0083cd8cece49bbd968021bd99b0849edb4a9a7ee0fdf6a4e0" + +[[package]] +name = "hmac" +version = "0.10.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "c1441c6b1e930e2817404b5046f1f989899143a12bf92de603b69f4e0aee1e15" +dependencies = [ + "crypto-mac 0.10.1", + "digest", +] + +[[package]] +name = "hmac" +version = "0.11.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2a2a2320eb7ec0ebe8da8f744d7812d9fc4cb4d09344ac01898dbcb6a20ae69b" +dependencies = [ + "crypto-mac 0.11.1", + "digest", +] + +[[package]] +name = "http" +version = "0.2.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "31f4c6746584866f0feabcc69893c5b51beef3831656a968ed7ae254cdc4fd03" +dependencies = [ + "bytes", + "fnv", + "itoa 1.0.1", +] + +[[package]] +name = "http-body" +version = "0.4.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1ff4f84919677303da5f147645dbea6b1881f368d03ac84e1dc09031ebd7b2c6" +dependencies = [ + "bytes", + "http", + "pin-project-lite", +] + +[[package]] +name = "httparse" +version = "1.6.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9100414882e15fb7feccb4897e5f0ff0ff1ca7d1a86a23208ada4d7a18e6c6c4" + +[[package]] +name = "httpdate" +version = "1.0.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "c4a1e36c821dbe04574f602848a19f742f4fb3c98d40449f11bcad18d6b17421" + +[[package]] +name = "humantime" +version = "2.1.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9a3a5bfb195931eeb336b2a7b4d761daec841b97f947d34394601737a7bba5e4" + +[[package]] +name = "hyper" +version = "0.14.17" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "043f0e083e9901b6cc658a77d1eb86f4fc650bbb977a4337dd63192826aa85dd" +dependencies = [ + "bytes", + "futures-channel", + "futures-core", + "futures-util", + "h2", + "http", + "http-body", + "httparse", + "httpdate", + "itoa 1.0.1", + "pin-project-lite", + "socket2", + "tokio", + "tower-service", + "tracing", + "want", +] + +[[package]] +name = "hyper-rustls" +version = "0.23.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d87c48c02e0dc5e3b849a2041db3029fd066650f8f717c07bf8ed78ccb895cac" +dependencies = [ + "http", + "hyper", + "rustls 0.20.2", + "tokio", + "tokio-rustls 0.23.2", +] + +[[package]] +name = "hyper-timeout" +version = "0.4.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "bbb958482e8c7be4bc3cf272a766a2b0bf1a6755e7a6ae777f017a31d11b13b1" +dependencies = [ + "hyper", + "pin-project-lite", + "tokio", + "tokio-io-timeout", +] + +[[package]] +name = "hyper-tls" +version = "0.5.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d6183ddfa99b85da61a140bea0efc93fdf56ceaa041b37d553518030827f9905" +dependencies = [ + "bytes", + "hyper", + "native-tls", + "tokio", + "tokio-native-tls", +] + +[[package]] +name = "ident_case" +version = "1.0.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b9e0384b61958566e926dc50660321d12159025e767c18e043daf26b70104c39" + +[[package]] +name = "idna" +version = "0.2.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "418a0a6fab821475f634efe3ccc45c013f742efe03d853e8d3355d5cb850ecf8" +dependencies = [ + "matches", + "unicode-bidi", + "unicode-normalization", +] + +[[package]] +name = "indexmap" +version = "1.8.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "282a6247722caba404c065016bbfa522806e51714c34f5dfc3e4a3a46fcb4223" +dependencies = [ + "autocfg", + "hashbrown", +] + +[[package]] +name = "instant" +version = "0.1.12" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7a5bbe824c507c5da5956355e86a746d82e0e1464f65d862cc5e71da70e94b2c" +dependencies = [ + "cfg-if", +] + +[[package]] +name = "ipnet" +version = "2.3.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "68f2d64f2edebec4ce84ad108148e67e1064789bee435edc5b60ad398714a3a9" + +[[package]] +name = "itertools" +version = "0.10.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a9a9d19fa1e79b6215ff29b9d6880b706147f16e9b1dbb1e4e5947b5b02bc5e3" +dependencies = [ + "either", +] + +[[package]] +name = "itoa" +version = "0.4.8" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b71991ff56294aa922b450139ee08b3bfc70982c6b2c7562771375cf73542dd4" + +[[package]] +name = "itoa" +version = "1.0.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1aab8fc367588b89dcee83ab0fd66b72b50b72fa1904d7095045ace2b0c81c35" + +[[package]] +name = "jobserver" +version = "0.1.24" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "af25a77299a7f711a01975c35a6a424eb6862092cc2d6c72c4ed6cbc56dfc1fa" +dependencies = [ + "libc", +] + +[[package]] +name = "js-sys" +version = "0.3.56" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a38fc24e30fd564ce974c02bf1d337caddff65be6cc4735a1f7eab22a7440f04" +dependencies = [ + "wasm-bindgen", +] + +[[package]] +name = "jsonwebtoken" +version = "7.2.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "afabcc15e437a6484fc4f12d0fd63068fe457bf93f1c148d3d9649c60b103f32" +dependencies = [ + "base64 0.12.3", + "pem 0.8.3", + "ring", + "serde", + "serde_json", + "simple_asn1", +] + +[[package]] +name = "kstring" +version = "1.0.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8b310ccceade8121d7d77fee406160e457c2f4e7c7982d589da3499bc7ea4526" +dependencies = [ + "serde", +] + +[[package]] +name = "lazy_static" +version = "1.4.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e2abad23fbc42b3700f2f279844dc832adb2b2eb069b2df918f455c4e18cc646" + +[[package]] +name = "lazycell" +version = "1.3.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "830d08ce1d1d941e6b30645f1a0eb5643013d835ce3779a5fc208261dbe10f55" + +[[package]] +name = "libc" +version = "0.2.117" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e74d72e0f9b65b5b4ca49a346af3976df0f9c61d550727f349ecd559f251a26c" + +[[package]] +name = "libloading" +version = "0.7.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "efbc0f03f9a775e9f6aed295c6a1ba2253c5757a9e03d55c6caa46a681abcddd" +dependencies = [ + "cfg-if", + "winapi", +] + +[[package]] +name = "lock_api" +version = "0.4.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "88943dd7ef4a2e5a4bfa2753aaab3013e34ce2533d1996fb18ef591e315e2b3b" +dependencies = [ + "scopeguard", +] + +[[package]] +name = "log" +version = "0.4.14" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "51b9bbe6c47d51fc3e1a9b945965946b4c44142ab8792c50835a980d362c2710" +dependencies = [ + "cfg-if", + "serde", +] + +[[package]] +name = "matchers" +version = "0.1.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8263075bb86c5a1b1427b5ae862e8889656f126e9f77c484496e8b47cf5c5558" +dependencies = [ + "regex-automata", +] + +[[package]] +name = "matches" +version = "0.1.9" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a3e378b66a060d48947b590737b30a1be76706c8dd7b8ba0f2fe3989c68a853f" + +[[package]] +name = "md-5" +version = "0.9.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7b5a279bb9607f9f53c22d496eade00d138d1bdcccd07d74650387cf94942a15" +dependencies = [ + "block-buffer", + "digest", + "opaque-debug", +] + +[[package]] +name = "md5" +version = "0.7.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "490cc448043f947bae3cbee9c203358d62dbee0db12107a74be5c30ccfd09771" + +[[package]] +name = "memchr" +version = "2.4.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "308cc39be01b73d0d18f82a0e7b2a3df85245f84af96fdddc5d202d27e47b86a" + +[[package]] +name = "memoffset" +version = "0.6.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5aa361d4faea93603064a027415f07bd8e1d5c88c9fbf68bf56a285428fd79ce" +dependencies = [ + "autocfg", +] + +[[package]] +name = "mime" +version = "0.3.16" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2a60c7ce501c71e03a9c9c0d35b861413ae925bd979cc7a4e30d060069aaac8d" + +[[package]] +name = "minimal-lexical" +version = "0.2.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "68354c5c6bd36d73ff3feceb05efa59b6acb7626617f4962be322a825e61f79a" + +[[package]] +name = "miniz_oxide" +version = "0.4.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a92518e98c078586bc6c934028adcca4c92a53d6a958196de835170a01d84e4b" +dependencies = [ + "adler", + "autocfg", +] + +[[package]] +name = "mio" +version = "0.8.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "52da4364ffb0e4fe33a9841a98a3f3014fb964045ce4f7a45a398243c8d6b0c9" +dependencies = [ + "libc", + "log", + "miow", + "ntapi", + "wasi 0.11.0+wasi-snapshot-preview1", + "winapi", +] + +[[package]] +name = "miow" +version = "0.3.7" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b9f1c5b025cda876f66ef43a113f91ebc9f4ccef34843000e0adf6ebbab84e21" +dependencies = [ + "winapi", +] + +[[package]] +name = "multimap" +version = "0.8.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e5ce46fe64a9d73be07dcbe690a38ce1b293be448fd8ce1e6c1b8062c9f72c6a" + +[[package]] +name = "native-tls" +version = "0.2.8" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "48ba9f7719b5a0f42f338907614285fb5fd70e53858141f69898a1fb7203b24d" +dependencies = [ + "lazy_static", + "libc", + "log", + "openssl", + "openssl-probe", + "openssl-sys", + "schannel", + "security-framework", + "security-framework-sys", + "tempfile", +] + +[[package]] +name = "nix" +version = "0.23.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9f866317acbd3a240710c63f065ffb1e4fd466259045ccb504130b7f668f35c6" +dependencies = [ + "bitflags", + "cc", + "cfg-if", + "libc", + "memoffset", +] + +[[package]] +name = "nom" +version = "7.1.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1b1d11e1ef389c76fe5b81bcaf2ea32cf88b62bc494e19f493d0b30e7a930109" +dependencies = [ + "memchr", + "minimal-lexical", + "version_check", +] + +[[package]] +name = "ntapi" +version = "0.3.7" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "c28774a7fd2fbb4f0babd8237ce554b73af68021b5f695a3cebd6c59bac0980f" +dependencies = [ + "winapi", +] + +[[package]] +name = "num-bigint" +version = "0.2.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "090c7f9998ee0ff65aa5b723e4009f7b217707f1fb5ea551329cc4d6231fb304" +dependencies = [ + "autocfg", + "num-integer", + "num-traits", +] + +[[package]] +name = "num-integer" +version = "0.1.44" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d2cc698a63b549a70bc047073d2949cce27cd1c7b0a4a862d08a8031bc2801db" +dependencies = [ + "autocfg", + "num-traits", +] + +[[package]] +name = "num-traits" +version = "0.2.14" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9a64b1ec5cda2586e284722486d802acf1f7dbdc623e2bfc57e65ca1cd099290" +dependencies = [ + "autocfg", +] + +[[package]] +name = "num_cpus" +version = "1.13.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "19e64526ebdee182341572e50e9ad03965aa510cd94427a4549448f285e957a1" +dependencies = [ + "hermit-abi", + "libc", +] + +[[package]] +name = "object" +version = "0.27.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "67ac1d3f9a1d3616fd9a60c8d74296f22406a238b6a72f5cc1e6f314df4ffbf9" +dependencies = [ + "memchr", +] + +[[package]] +name = "once_cell" +version = "1.9.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "da32515d9f6e6e489d7bc9d84c71b060db7247dc035bbe44eac88cf87486d8d5" + +[[package]] +name = "oorandom" +version = "11.1.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0ab1bc2a289d34bd04a330323ac98a1b4bc82c9d9fcb1e66b63caa84da26b575" + +[[package]] +name = "opaque-debug" +version = "0.3.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "624a8340c38c1b80fd549087862da4ba43e08858af025b236e509b6649fc13d5" + +[[package]] +name = "openssl" +version = "0.10.38" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0c7ae222234c30df141154f159066c5093ff73b63204dcda7121eb082fc56a95" +dependencies = [ + "bitflags", + "cfg-if", + "foreign-types", + "libc", + "once_cell", + "openssl-sys", +] + +[[package]] +name = "openssl-probe" +version = "0.1.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ff011a302c396a5197692431fc1948019154afc178baf7d8e37367442a4601cf" + +[[package]] +name = "openssl-sys" +version = "0.9.72" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7e46109c383602735fa0a2e48dd2b7c892b048e1bf69e5c3b1d804b7d9c203cb" +dependencies = [ + "autocfg", + "cc", + "libc", + "pkg-config", + "vcpkg", +] + +[[package]] +name = "os_str_bytes" +version = "6.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8e22443d1643a904602595ba1cd8f7d896afe56d26712531c5ff73a15b2fbf64" +dependencies = [ + "memchr", +] + +[[package]] +name = "pageserver" +version = "0.1.0" +dependencies = [ + "anyhow", + "async-compression", + "async-trait", + "byteorder", + "bytes", + "chrono", + "clap 3.0.14", + "const_format", + "crc32c", + "crossbeam-utils", + "daemonize", + "fail", + "futures", + "hex", + "hex-literal", + "humantime", + "hyper", + "itertools", + "lazy_static", + "log", + "nix", + "once_cell", + "postgres 0.19.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", + "postgres-protocol 0.6.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", + "postgres-types 0.2.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", + "postgres_ffi", + "rand", + "regex", + "rusoto_core", + "rusoto_s3", + "scopeguard", + "serde", + "serde_json", + "serde_with", + "signal-hook", + "tar", + "tempfile", + "thiserror", + "tokio", + "tokio-postgres 0.7.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", + "tokio-stream", + "tokio-util 0.7.0", + "toml_edit", + "tracing", + "tracing-futures", + "url", + "workspace_hack", + "zenith_metrics", + "zenith_utils", +] + +[[package]] +name = "parking_lot" +version = "0.11.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7d17b78036a60663b797adeaee46f5c9dfebb86948d1255007a1d6be0271ff99" +dependencies = [ + "instant", + "lock_api", + "parking_lot_core", +] + +[[package]] +name = "parking_lot_core" +version = "0.8.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d76e8e1493bcac0d2766c42737f34458f1c8c50c0d23bcb24ea953affb273216" +dependencies = [ + "cfg-if", + "instant", + "libc", + "redox_syscall", + "smallvec", + "winapi", +] + +[[package]] +name = "peeking_take_while" +version = "0.1.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "19b17cddbe7ec3f8bc800887bab5e717348c95ea2ca0b1bf0837fb964dc67099" + +[[package]] +name = "pem" +version = "0.8.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "fd56cbd21fea48d0c440b41cd69c589faacade08c992d9a54e471b79d0fd13eb" +dependencies = [ + "base64 0.13.0", + "once_cell", + "regex", +] + +[[package]] +name = "pem" +version = "1.0.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e9a3b09a20e374558580a4914d3b7d89bd61b954a5a5e1dcbea98753addb1947" +dependencies = [ + "base64 0.13.0", +] + +[[package]] +name = "percent-encoding" +version = "2.1.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d4fd5641d01c8f18a23da7b6fe29298ff4b55afcccdf78973b24cf3175fee32e" + +[[package]] +name = "petgraph" +version = "0.6.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "4a13a2fa9d0b63e5f22328828741e523766fff0ee9e779316902290dff3f824f" +dependencies = [ + "fixedbitset", + "indexmap", +] + +[[package]] +name = "phf" +version = "0.8.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "3dfb61232e34fcb633f43d12c58f83c1df82962dcdfa565a4e866ffc17dafe12" +dependencies = [ + "phf_shared", +] + +[[package]] +name = "phf_shared" +version = "0.8.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "c00cf8b9eafe68dde5e9eaa2cef8ee84a9336a47d566ec55ca16589633b65af7" +dependencies = [ + "siphasher", +] + +[[package]] +name = "pin-project" +version = "1.0.10" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "58ad3879ad3baf4e44784bc6a718a8698867bb991f8ce24d1bcbe2cfb4c3a75e" +dependencies = [ + "pin-project-internal", +] + +[[package]] +name = "pin-project-internal" +version = "1.0.10" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "744b6f092ba29c3650faf274db506afd39944f48420f6c86b17cfe0ee1cb36bb" +dependencies = [ + "proc-macro2", + "quote", + "syn", +] + +[[package]] +name = "pin-project-lite" +version = "0.2.8" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e280fbe77cc62c91527259e9442153f4688736748d24660126286329742b4c6c" + +[[package]] +name = "pin-utils" +version = "0.1.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8b870d8c151b6f2fb93e84a13146138f05d02ed11c7e7c54f8826aaaf7c9f184" + +[[package]] +name = "pkg-config" +version = "0.3.24" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "58893f751c9b0412871a09abd62ecd2a00298c6c83befa223ef98c52aef40cbe" + +[[package]] +name = "plotters" +version = "0.3.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "32a3fd9ec30b9749ce28cd91f255d569591cdf937fe280c312143e3c4bad6f2a" +dependencies = [ + "num-traits", + "plotters-backend", + "plotters-svg", + "wasm-bindgen", + "web-sys", +] + +[[package]] +name = "plotters-backend" +version = "0.3.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d88417318da0eaf0fdcdb51a0ee6c3bed624333bff8f946733049380be67ac1c" + +[[package]] +name = "plotters-svg" +version = "0.3.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "521fa9638fa597e1dc53e9412a4f9cefb01187ee1f7413076f9e6749e2885ba9" +dependencies = [ + "plotters-backend", +] + +[[package]] +name = "postgres" +version = "0.19.1" +source = "git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7#2949d98df52587d562986aad155dd4e889e408b7" +dependencies = [ + "bytes", + "fallible-iterator", + "futures", + "log", + "postgres-protocol 0.6.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", + "tokio", + "tokio-postgres 0.7.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", +] + +[[package]] +name = "postgres" +version = "0.19.1" +source = "git+https://github.com/zenithdb/rust-postgres.git?rev=9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858#9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858" +dependencies = [ + "bytes", + "fallible-iterator", + "futures", + "log", + "postgres-protocol 0.6.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858)", + "tokio", + "tokio-postgres 0.7.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858)", +] + +[[package]] +name = "postgres-protocol" +version = "0.6.1" +source = "git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7#2949d98df52587d562986aad155dd4e889e408b7" +dependencies = [ + "base64 0.13.0", + "byteorder", + "bytes", + "fallible-iterator", + "hmac 0.10.1", + "lazy_static", + "md-5", + "memchr", + "rand", + "sha2", + "stringprep", +] + +[[package]] +name = "postgres-protocol" +version = "0.6.1" +source = "git+https://github.com/zenithdb/rust-postgres.git?rev=9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858#9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858" +dependencies = [ + "base64 0.13.0", + "byteorder", + "bytes", + "fallible-iterator", + "hmac 0.10.1", + "lazy_static", + "md-5", + "memchr", + "rand", + "sha2", + "stringprep", +] + +[[package]] +name = "postgres-types" +version = "0.2.1" +source = "git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7#2949d98df52587d562986aad155dd4e889e408b7" +dependencies = [ + "bytes", + "fallible-iterator", + "postgres-protocol 0.6.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", +] + +[[package]] +name = "postgres-types" +version = "0.2.1" +source = "git+https://github.com/zenithdb/rust-postgres.git?rev=9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858#9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858" +dependencies = [ + "bytes", + "fallible-iterator", + "postgres-protocol 0.6.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858)", +] + +[[package]] +name = "postgres_ffi" +version = "0.1.0" +dependencies = [ + "anyhow", + "bindgen", + "byteorder", + "bytes", + "chrono", + "crc32c", + "hex", + "lazy_static", + "log", + "memoffset", + "rand", + "regex", + "serde", + "thiserror", + "workspace_hack", + "zenith_utils", +] + +[[package]] +name = "ppv-lite86" +version = "0.2.16" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "eb9f9e6e233e5c4a35559a617bf40a4ec447db2e84c20b55a6f83167b7e57872" + +[[package]] +name = "proc-macro-hack" +version = "0.5.19" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "dbf0c48bc1d91375ae5c3cd81e3722dff1abcf81a30960240640d223f59fe0e5" + +[[package]] +name = "proc-macro2" +version = "1.0.36" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "c7342d5883fbccae1cc37a2353b09c87c9b0f3afd73f5fb9bba687a1f733b029" +dependencies = [ + "unicode-xid", +] + +[[package]] +name = "prometheus" +version = "0.13.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b7f64969ffd5dd8f39bd57a68ac53c163a095ed9d0fb707146da1b27025a3504" +dependencies = [ + "cfg-if", + "fnv", + "lazy_static", + "memchr", + "parking_lot", + "thiserror", +] + +[[package]] +name = "prost" +version = "0.9.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "444879275cb4fd84958b1a1d5420d15e6fcf7c235fe47f053c9c2a80aceb6001" +dependencies = [ + "bytes", + "prost-derive", +] + +[[package]] +name = "prost-build" +version = "0.9.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "62941722fb675d463659e49c4f3fe1fe792ff24fe5bbaa9c08cd3b98a1c354f5" +dependencies = [ + "bytes", + "heck", + "itertools", + "lazy_static", + "log", + "multimap", + "petgraph", + "prost", + "prost-types", + "regex", + "tempfile", + "which", +] + +[[package]] +name = "prost-derive" +version = "0.9.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f9cc1a3263e07e0bf68e96268f37665207b49560d98739662cdfaae215c720fe" +dependencies = [ + "anyhow", + "itertools", + "proc-macro2", + "quote", + "syn", +] + +[[package]] +name = "prost-types" +version = "0.9.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "534b7a0e836e3c482d2693070f982e39e7611da9695d4d1f5a4b186b51faef0a" +dependencies = [ + "bytes", + "prost", +] + +[[package]] +name = "proxy" +version = "0.1.0" +dependencies = [ + "anyhow", + "bytes", + "clap 3.0.14", + "fail", + "futures", + "hashbrown", + "hex", + "hyper", + "lazy_static", + "md5", + "parking_lot", + "pin-project-lite", + "rand", + "rcgen", + "reqwest", + "rustls 0.19.1", + "scopeguard", + "serde", + "serde_json", + "socket2", + "thiserror", + "tokio", + "tokio-postgres 0.7.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", + "tokio-postgres-rustls", + "tokio-rustls 0.22.0", + "workspace_hack", + "zenith_metrics", + "zenith_utils", +] + +[[package]] +name = "quote" +version = "1.0.15" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "864d3e96a899863136fc6e99f3d7cae289dafe43bf2c5ac19b70df7210c0a145" +dependencies = [ + "proc-macro2", +] + +[[package]] +name = "rand" +version = "0.8.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2e7573632e6454cf6b99d7aac4ccca54be06da05aca2ef7423d22d27d4d4bcd8" +dependencies = [ + "libc", + "rand_chacha", + "rand_core", + "rand_hc", +] + +[[package]] +name = "rand_chacha" +version = "0.3.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e6c10a63a0fa32252be49d21e7709d4d4baf8d231c2dbce1eaa8141b9b127d88" +dependencies = [ + "ppv-lite86", + "rand_core", +] + +[[package]] +name = "rand_core" +version = "0.6.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d34f1408f55294453790c48b2f1ebbb1c5b4b7563eb1f418bcfcfdbb06ebb4e7" +dependencies = [ + "getrandom", +] + +[[package]] +name = "rand_hc" +version = "0.3.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d51e9f596de227fda2ea6c84607f5558e196eeaf43c986b724ba4fb8fdf497e7" +dependencies = [ + "rand_core", +] + +[[package]] +name = "rayon" +version = "1.5.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "c06aca804d41dbc8ba42dfd964f0d01334eceb64314b9ecf7c5fad5188a06d90" +dependencies = [ + "autocfg", + "crossbeam-deque", + "either", + "rayon-core", +] + +[[package]] +name = "rayon-core" +version = "1.9.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d78120e2c850279833f1dd3582f730c4ab53ed95aeaaaa862a2a5c71b1656d8e" +dependencies = [ + "crossbeam-channel", + "crossbeam-deque", + "crossbeam-utils", + "lazy_static", + "num_cpus", +] + +[[package]] +name = "rcgen" +version = "0.8.14" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5911d1403f4143c9d56a702069d593e8d0f3fab880a85e103604d0893ea31ba7" +dependencies = [ + "chrono", + "pem 1.0.2", + "ring", + "yasna", +] + +[[package]] +name = "redox_syscall" +version = "0.2.10" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8383f39639269cde97d255a32bdb68c047337295414940c68bdd30c2e13203ff" +dependencies = [ + "bitflags", +] + +[[package]] +name = "redox_users" +version = "0.4.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "528532f3d801c87aec9def2add9ca802fe569e44a544afe633765267840abe64" +dependencies = [ + "getrandom", + "redox_syscall", +] + +[[package]] +name = "regex" +version = "1.5.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d07a8629359eb56f1e2fb1652bb04212c072a87ba68546a04065d525673ac461" +dependencies = [ + "aho-corasick", + "memchr", + "regex-syntax", +] + +[[package]] +name = "regex-automata" +version = "0.1.10" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6c230d73fb8d8c1b9c0b3135c5142a8acee3a0558fb8db5cf1cb65f8d7862132" +dependencies = [ + "regex-syntax", +] + +[[package]] +name = "regex-syntax" +version = "0.6.25" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f497285884f3fcff424ffc933e56d7cbca511def0c9831a7f9b5f6153e3cc89b" + +[[package]] +name = "remove_dir_all" +version = "0.5.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "3acd125665422973a33ac9d3dd2df85edad0f4ae9b00dafb1a05e43a9f5ef8e7" +dependencies = [ + "winapi", +] + +[[package]] +name = "reqwest" +version = "0.11.9" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "87f242f1488a539a79bac6dbe7c8609ae43b7914b7736210f239a37cccb32525" +dependencies = [ + "base64 0.13.0", + "bytes", + "encoding_rs", + "futures-core", + "futures-util", + "h2", + "http", + "http-body", + "hyper", + "hyper-rustls", + "ipnet", + "js-sys", + "lazy_static", + "log", + "mime", + "percent-encoding", + "pin-project-lite", + "rustls 0.20.2", + "rustls-pemfile", + "serde", + "serde_json", + "serde_urlencoded", + "tokio", + "tokio-rustls 0.23.2", + "tokio-util 0.6.9", + "url", + "wasm-bindgen", + "wasm-bindgen-futures", + "web-sys", + "webpki-roots", + "winreg", +] + +[[package]] +name = "ring" +version = "0.16.20" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "3053cf52e236a3ed746dfc745aa9cacf1b791d846bdaf412f60a8d7d6e17c8fc" +dependencies = [ + "cc", + "libc", + "once_cell", + "spin", + "untrusted", + "web-sys", + "winapi", +] + +[[package]] +name = "routerify" +version = "3.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "496c1d3718081c45ba9c31fbfc07417900aa96f4070ff90dc29961836b7a9945" +dependencies = [ + "http", + "hyper", + "lazy_static", + "percent-encoding", + "regex", +] + +[[package]] +name = "rusoto_core" +version = "0.47.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5b4f000e8934c1b4f70adde180056812e7ea6b1a247952db8ee98c94cd3116cc" +dependencies = [ + "async-trait", + "base64 0.13.0", + "bytes", + "crc32fast", + "futures", + "http", + "hyper", + "hyper-tls", + "lazy_static", + "log", + "rusoto_credential", + "rusoto_signature", + "rustc_version", + "serde", + "serde_json", + "tokio", + "xml-rs", +] + +[[package]] +name = "rusoto_credential" +version = "0.47.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6a46b67db7bb66f5541e44db22b0a02fed59c9603e146db3a9e633272d3bac2f" +dependencies = [ + "async-trait", + "chrono", + "dirs-next", + "futures", + "hyper", + "serde", + "serde_json", + "shlex", + "tokio", + "zeroize", +] + +[[package]] +name = "rusoto_s3" +version = "0.47.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "048c2fe811a823ad5a9acc976e8bf4f1d910df719dcf44b15c3e96c5b7a51027" +dependencies = [ + "async-trait", + "bytes", + "futures", + "rusoto_core", + "xml-rs", +] + +[[package]] +name = "rusoto_signature" +version = "0.47.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6264e93384b90a747758bcc82079711eacf2e755c3a8b5091687b5349d870bcc" +dependencies = [ + "base64 0.13.0", + "bytes", + "chrono", + "digest", + "futures", + "hex", + "hmac 0.11.0", + "http", + "hyper", + "log", + "md-5", + "percent-encoding", + "pin-project-lite", + "rusoto_credential", + "rustc_version", + "serde", + "sha2", + "tokio", +] + +[[package]] +name = "rustc-demangle" +version = "0.1.21" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7ef03e0a2b150c7a90d01faf6254c9c48a41e95fb2a8c2ac1c6f0d2b9aefc342" + +[[package]] +name = "rustc-hash" +version = "1.1.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "08d43f7aa6b08d49f382cde6a7982047c3426db949b1424bc4b7ec9ae12c6ce2" + +[[package]] +name = "rustc_version" +version = "0.4.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "bfa0f585226d2e68097d4f95d113b15b83a82e819ab25717ec0590d9584ef366" +dependencies = [ + "semver", +] + +[[package]] +name = "rustls" +version = "0.19.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "35edb675feee39aec9c99fa5ff985081995a06d594114ae14cbe797ad7b7a6d7" +dependencies = [ + "base64 0.13.0", + "log", + "ring", + "sct 0.6.1", + "webpki 0.21.4", +] + +[[package]] +name = "rustls" +version = "0.20.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d37e5e2290f3e040b594b1a9e04377c2c671f1a1cfd9bfdef82106ac1c113f84" +dependencies = [ + "log", + "ring", + "sct 0.7.0", + "webpki 0.22.0", +] + +[[package]] +name = "rustls-pemfile" +version = "0.2.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5eebeaeb360c87bfb72e84abdb3447159c0eaececf1bef2aecd65a8be949d1c9" +dependencies = [ + "base64 0.13.0", +] + +[[package]] +name = "rustls-split" +version = "0.2.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7fb079b52cfdb005752b7c3c646048e702003576a8321058e4c8b38227c11aa6" +dependencies = [ + "rustls 0.19.1", +] + +[[package]] +name = "rustversion" +version = "1.0.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f2cc38e8fa666e2de3c4aba7edeb5ffc5246c1c2ed0e3d17e560aeeba736b23f" + +[[package]] +name = "ryu" +version = "1.0.9" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "73b4b750c782965c211b42f022f59af1fbceabdd026623714f104152f1ec149f" + +[[package]] +name = "same-file" +version = "1.0.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "93fc1dc3aaa9bfed95e02e6eadabb4baf7e3078b0bd1b4d7b6b0b68378900502" +dependencies = [ + "winapi-util", +] + +[[package]] +name = "schannel" +version = "0.1.19" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8f05ba609c234e60bee0d547fe94a4c7e9da733d1c962cf6e59efa4cd9c8bc75" +dependencies = [ + "lazy_static", + "winapi", +] + +[[package]] +name = "scopeguard" +version = "1.1.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d29ab0c6d3fc0ee92fe66e2d99f700eab17a8d57d1c1d3b748380fb20baa78cd" + +[[package]] +name = "sct" +version = "0.6.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b362b83898e0e69f38515b82ee15aa80636befe47c3b6d3d89a911e78fc228ce" +dependencies = [ + "ring", + "untrusted", +] + +[[package]] +name = "sct" +version = "0.7.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d53dcdb7c9f8158937a7981b48accfd39a43af418591a5d008c7b22b5e1b7ca4" +dependencies = [ + "ring", + "untrusted", +] + +[[package]] +name = "security-framework" +version = "2.6.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2dc14f172faf8a0194a3aded622712b0de276821addc574fa54fc0a1167e10dc" +dependencies = [ + "bitflags", + "core-foundation", + "core-foundation-sys", + "libc", + "security-framework-sys", +] + +[[package]] +name = "security-framework-sys" +version = "2.6.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0160a13a177a45bfb43ce71c01580998474f556ad854dcbca936dd2841a5c556" +dependencies = [ + "core-foundation-sys", + "libc", +] + +[[package]] +name = "semver" +version = "1.0.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0486718e92ec9a68fbed73bb5ef687d71103b142595b406835649bebd33f72c7" + +[[package]] +name = "serde" +version = "1.0.136" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ce31e24b01e1e524df96f1c2fdd054405f8d7376249a5110886fb4b658484789" +dependencies = [ + "serde_derive", +] + +[[package]] +name = "serde_cbor" +version = "0.11.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2bef2ebfde456fb76bbcf9f59315333decc4fda0b2b44b420243c11e0f5ec1f5" +dependencies = [ + "half", + "serde", +] + +[[package]] +name = "serde_derive" +version = "1.0.136" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "08597e7152fcd306f41838ed3e37be9eaeed2b61c42e2117266a554fab4662f9" +dependencies = [ + "proc-macro2", + "quote", + "syn", +] + +[[package]] +name = "serde_json" +version = "1.0.78" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d23c1ba4cf0efd44be32017709280b32d1cea5c3f1275c3b6d9e8bc54f758085" +dependencies = [ + "itoa 1.0.1", + "ryu", + "serde", +] + +[[package]] +name = "serde_urlencoded" +version = "0.7.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d3491c14715ca2294c4d6a88f15e84739788c1d030eed8c110436aafdaa2f3fd" +dependencies = [ + "form_urlencoded", + "itoa 1.0.1", + "ryu", + "serde", +] + +[[package]] +name = "serde_with" +version = "1.12.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ec1e6ec4d8950e5b1e894eac0d360742f3b1407a6078a604a731c4b3f49cefbc" +dependencies = [ + "rustversion", + "serde", + "serde_with_macros", +] + +[[package]] +name = "serde_with_macros" +version = "1.5.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "12e47be9471c72889ebafb5e14d5ff930d89ae7a67bbdb5f8abb564f845a927e" +dependencies = [ + "darling", + "proc-macro2", + "quote", + "syn", +] + +[[package]] +name = "sha2" +version = "0.9.9" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "4d58a1e1bf39749807d89cf2d98ac2dfa0ff1cb3faa38fbb64dd88ac8013d800" +dependencies = [ + "block-buffer", + "cfg-if", + "cpufeatures", + "digest", + "opaque-debug", +] + +[[package]] +name = "sharded-slab" +version = "0.1.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "900fba806f70c630b0a382d0d825e17a0f19fcd059a2ade1ff237bcddf446b31" +dependencies = [ + "lazy_static", +] + +[[package]] +name = "shlex" +version = "1.1.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "43b2853a4d09f215c24cc5489c992ce46052d359b5109343cbafbf26bc62f8a3" + +[[package]] +name = "signal-hook" +version = "0.3.13" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "647c97df271007dcea485bb74ffdb57f2e683f1306c854f468a0c244badabf2d" +dependencies = [ + "libc", + "signal-hook-registry", +] + +[[package]] +name = "signal-hook-registry" +version = "1.4.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e51e73328dc4ac0c7ccbda3a494dfa03df1de2f46018127f60c693f2648455b0" +dependencies = [ + "libc", +] + +[[package]] +name = "simple_asn1" +version = "0.4.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "692ca13de57ce0613a363c8c2f1de925adebc81b04c923ac60c5488bb44abe4b" +dependencies = [ + "chrono", + "num-bigint", + "num-traits", +] + +[[package]] +name = "siphasher" +version = "0.3.9" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a86232ab60fa71287d7f2ddae4a7073f6b7aac33631c3015abb556f08c6d0a3e" + +[[package]] +name = "slab" +version = "0.4.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9def91fd1e018fe007022791f865d0ccc9b3a0d5001e01aabb8b40e46000afb5" + +[[package]] +name = "smallvec" +version = "1.8.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f2dd574626839106c320a323308629dcb1acfc96e32a8cba364ddc61ac23ee83" + +[[package]] +name = "socket2" +version = "0.4.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "66d72b759436ae32898a2af0a14218dbf55efde3feeb170eb623637db85ee1e0" +dependencies = [ + "libc", + "winapi", +] + +[[package]] +name = "spin" +version = "0.5.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6e63cff320ae2c57904679ba7cb63280a3dc4613885beafb148ee7bf9aa9042d" + +[[package]] +name = "stringprep" +version = "0.1.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8ee348cb74b87454fff4b551cbf727025810a004f88aeacae7f85b87f4e9a1c1" +dependencies = [ + "unicode-bidi", + "unicode-normalization", +] + +[[package]] +name = "strsim" +version = "0.8.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8ea5119cdb4c55b55d432abb513a0429384878c15dde60cc77b1c99de1a95a6a" + +[[package]] +name = "strsim" +version = "0.10.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "73473c0e59e6d5812c5dfe2a064a6444949f089e20eec9a2e5506596494e4623" + +[[package]] +name = "subtle" +version = "2.4.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6bdef32e8150c2a081110b42772ffe7d7c9032b606bc226c8260fd97e0976601" + +[[package]] +name = "syn" +version = "1.0.86" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8a65b3f4ffa0092e9887669db0eae07941f023991ab58ea44da8fe8e2d511c6b" +dependencies = [ + "proc-macro2", + "quote", + "unicode-xid", +] + +[[package]] +name = "tar" +version = "0.4.38" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "4b55807c0344e1e6c04d7c965f5289c39a8d94ae23ed5c0b57aabac549f871c6" +dependencies = [ + "filetime", + "libc", + "xattr", +] + +[[package]] +name = "tempfile" +version = "3.3.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5cdb1ef4eaeeaddc8fbd371e5017057064af0911902ef36b39801f67cc6d79e4" +dependencies = [ + "cfg-if", + "fastrand", + "libc", + "redox_syscall", + "remove_dir_all", + "winapi", +] + +[[package]] +name = "termcolor" +version = "1.1.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2dfed899f0eb03f32ee8c6a0aabdb8a7949659e3466561fc0adf54e26d88c5f4" +dependencies = [ + "winapi-util", +] + +[[package]] +name = "textwrap" +version = "0.11.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d326610f408c7a4eb6f51c37c330e496b08506c9457c9d34287ecc38809fb060" +dependencies = [ + "unicode-width", +] + +[[package]] +name = "textwrap" +version = "0.14.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0066c8d12af8b5acd21e00547c3797fde4e8677254a7ee429176ccebbe93dd80" + +[[package]] +name = "thiserror" +version = "1.0.30" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "854babe52e4df1653706b98fcfc05843010039b406875930a70e4d9644e5c417" +dependencies = [ + "thiserror-impl", +] + +[[package]] +name = "thiserror-impl" +version = "1.0.30" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "aa32fd3f627f367fe16f893e2597ae3c05020f8bba2666a4e6ea73d377e5714b" +dependencies = [ + "proc-macro2", + "quote", + "syn", +] + +[[package]] +name = "thread_local" +version = "1.1.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5516c27b78311c50bf42c071425c560ac799b11c30b31f87e3081965fe5e0180" +dependencies = [ + "once_cell", +] + +[[package]] +name = "time" +version = "0.1.44" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6db9e6914ab8b1ae1c260a4ae7a49b6c5611b40328a735b21862567685e73255" +dependencies = [ + "libc", + "wasi 0.10.0+wasi-snapshot-preview1", + "winapi", +] + +[[package]] +name = "tinytemplate" +version = "1.2.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "be4d6b5f19ff7664e8c98d03e2139cb510db9b0a60b55f8e8709b689d939b6bc" +dependencies = [ + "serde", + "serde_json", +] + +[[package]] +name = "tinyvec" +version = "1.5.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2c1c1d5a42b6245520c249549ec267180beaffcc0615401ac8e31853d4b6d8d2" +dependencies = [ + "tinyvec_macros", +] + +[[package]] +name = "tinyvec_macros" +version = "0.1.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "cda74da7e1a664f795bb1f8a87ec406fb89a02522cf6e50620d016add6dbbf5c" + +[[package]] +name = "tokio" +version = "1.17.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2af73ac49756f3f7c01172e34a23e5d0216f6c32333757c2c61feb2bbff5a5ee" +dependencies = [ + "bytes", + "libc", + "memchr", + "mio", + "num_cpus", + "once_cell", + "pin-project-lite", + "signal-hook-registry", + "socket2", + "tokio-macros", + "winapi", +] + +[[package]] +name = "tokio-io-timeout" +version = "1.2.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "30b74022ada614a1b4834de765f9bb43877f910cc8ce4be40e89042c9223a8bf" +dependencies = [ + "pin-project-lite", + "tokio", +] + +[[package]] +name = "tokio-macros" +version = "1.7.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b557f72f448c511a979e2564e55d74e6c4432fc96ff4f6241bc6bded342643b7" +dependencies = [ + "proc-macro2", + "quote", + "syn", +] + +[[package]] +name = "tokio-native-tls" +version = "0.3.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f7d995660bd2b7f8c1568414c1126076c13fbb725c40112dc0120b78eb9b717b" +dependencies = [ + "native-tls", + "tokio", +] + +[[package]] +name = "tokio-postgres" +version = "0.7.1" +source = "git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7#2949d98df52587d562986aad155dd4e889e408b7" +dependencies = [ + "async-trait", + "byteorder", + "bytes", + "fallible-iterator", + "futures", + "log", + "parking_lot", + "percent-encoding", + "phf", + "pin-project-lite", + "postgres-protocol 0.6.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", + "postgres-types 0.2.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", + "socket2", + "tokio", + "tokio-util 0.6.9", +] + +[[package]] +name = "tokio-postgres" +version = "0.7.1" +source = "git+https://github.com/zenithdb/rust-postgres.git?rev=9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858#9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858" +dependencies = [ + "async-trait", + "byteorder", + "bytes", + "fallible-iterator", + "futures", + "log", + "parking_lot", + "percent-encoding", + "phf", + "pin-project-lite", + "postgres-protocol 0.6.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858)", + "postgres-types 0.2.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858)", + "socket2", + "tokio", + "tokio-util 0.6.9", +] + +[[package]] +name = "tokio-postgres-rustls" +version = "0.8.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7bd8c37d8c23cb6ecdc32fc171bade4e9c7f1be65f693a17afbaad02091a0a19" +dependencies = [ + "futures", + "ring", + "rustls 0.19.1", + "tokio", + "tokio-postgres 0.7.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", + "tokio-rustls 0.22.0", + "webpki 0.21.4", +] + +[[package]] +name = "tokio-rustls" +version = "0.22.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "bc6844de72e57df1980054b38be3a9f4702aba4858be64dd700181a8a6d0e1b6" +dependencies = [ + "rustls 0.19.1", + "tokio", + "webpki 0.21.4", +] + +[[package]] +name = "tokio-rustls" +version = "0.23.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a27d5f2b839802bd8267fa19b0530f5a08b9c08cd417976be2a65d130fe1c11b" +dependencies = [ + "rustls 0.20.2", + "tokio", + "webpki 0.22.0", +] + +[[package]] +name = "tokio-stream" +version = "0.1.8" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "50145484efff8818b5ccd256697f36863f587da82cf8b409c53adf1e840798e3" +dependencies = [ + "futures-core", + "pin-project-lite", + "tokio", +] + +[[package]] +name = "tokio-util" +version = "0.6.9" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9e99e1983e5d376cd8eb4b66604d2e99e79f5bd988c3055891dcd8c9e2604cc0" +dependencies = [ + "bytes", + "futures-core", + "futures-sink", + "log", + "pin-project-lite", + "tokio", +] + +[[package]] +name = "tokio-util" +version = "0.7.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "64910e1b9c1901aaf5375561e35b9c057d95ff41a44ede043a03e09279eabaf1" +dependencies = [ + "bytes", + "futures-core", + "futures-sink", + "log", + "pin-project-lite", + "tokio", +] + +[[package]] +name = "toml" +version = "0.5.8" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a31142970826733df8241ef35dc040ef98c679ab14d7c3e54d827099b3acecaa" +dependencies = [ + "serde", +] + +[[package]] +name = "toml_edit" +version = "0.13.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "744e9ed5b352340aa47ce033716991b5589e23781acb97cad37d4ea70560f55b" +dependencies = [ + "combine", + "indexmap", + "itertools", + "kstring", + "serde", +] + +[[package]] +name = "tonic" +version = "0.6.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ff08f4649d10a70ffa3522ca559031285d8e421d727ac85c60825761818f5d0a" +dependencies = [ + "async-stream", + "async-trait", + "base64 0.13.0", + "bytes", + "futures-core", + "futures-util", + "h2", + "http", + "http-body", + "hyper", + "hyper-timeout", + "percent-encoding", + "pin-project", + "prost", + "prost-derive", + "tokio", + "tokio-stream", + "tokio-util 0.6.9", + "tower", + "tower-layer", + "tower-service", + "tracing", + "tracing-futures", +] + +[[package]] +name = "tonic-build" +version = "0.6.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9403f1bafde247186684b230dc6f38b5cd514584e8bec1dd32514be4745fa757" +dependencies = [ + "proc-macro2", + "prost-build", + "quote", + "syn", +] + +[[package]] +name = "tower" +version = "0.4.12" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9a89fd63ad6adf737582df5db40d286574513c69a11dac5214dc3b5603d6713e" +dependencies = [ + "futures-core", + "futures-util", + "indexmap", + "pin-project", + "pin-project-lite", + "rand", + "slab", + "tokio", + "tokio-util 0.7.0", + "tower-layer", + "tower-service", + "tracing", +] + +[[package]] +name = "tower-layer" +version = "0.3.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "343bc9466d3fe6b0f960ef45960509f84480bf4fd96f92901afe7ff3df9d3a62" + +[[package]] +name = "tower-service" +version = "0.3.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "360dfd1d6d30e05fda32ace2c8c70e9c0a9da713275777f5a4dbb8a1893930c6" + +[[package]] +name = "tracing" +version = "0.1.30" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2d8d93354fe2a8e50d5953f5ae2e47a3fc2ef03292e7ea46e3cc38f549525fb9" +dependencies = [ + "cfg-if", + "log", + "pin-project-lite", + "tracing-attributes", + "tracing-core", +] + +[[package]] +name = "tracing-attributes" +version = "0.1.19" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8276d9a4a3a558d7b7ad5303ad50b53d58264641b82914b7ada36bd762e7a716" +dependencies = [ + "proc-macro2", + "quote", + "syn", +] + +[[package]] +name = "tracing-core" +version = "0.1.22" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "03cfcb51380632a72d3111cb8d3447a8d908e577d31beeac006f836383d29a23" +dependencies = [ + "lazy_static", + "valuable", +] + +[[package]] +name = "tracing-futures" +version = "0.2.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "97d095ae15e245a057c8e8451bab9b3ee1e1f68e9ba2b4fbc18d0ac5237835f2" +dependencies = [ + "pin-project", + "tracing", +] + +[[package]] +name = "tracing-log" +version = "0.1.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a6923477a48e41c1951f1999ef8bb5a3023eb723ceadafe78ffb65dc366761e3" +dependencies = [ + "lazy_static", + "log", + "tracing-core", +] + +[[package]] +name = "tracing-subscriber" +version = "0.3.8" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "74786ce43333fcf51efe947aed9718fbe46d5c7328ec3f1029e818083966d9aa" +dependencies = [ + "ansi_term", + "lazy_static", + "matchers", + "regex", + "sharded-slab", + "smallvec", + "thread_local", + "tracing", + "tracing-core", + "tracing-log", +] + +[[package]] +name = "try-lock" +version = "0.2.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "59547bce71d9c38b83d9c0e92b6066c4253371f15005def0c30d9657f50c7642" + +[[package]] +name = "typenum" +version = "1.15.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "dcf81ac59edc17cc8697ff311e8f5ef2d99fcbd9817b34cec66f90b6c3dfd987" + +[[package]] +name = "unicode-bidi" +version = "0.3.7" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1a01404663e3db436ed2746d9fefef640d868edae3cceb81c3b8d5732fda678f" + +[[package]] +name = "unicode-normalization" +version = "0.1.19" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d54590932941a9e9266f0832deed84ebe1bf2e4c9e4a3554d393d18f5e854bf9" +dependencies = [ + "tinyvec", +] + +[[package]] +name = "unicode-segmentation" +version = "1.9.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7e8820f5d777f6224dc4be3632222971ac30164d4a258d595640799554ebfd99" + +[[package]] +name = "unicode-width" +version = "0.1.9" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "3ed742d4ea2bd1176e236172c8429aaf54486e7ac098db29ffe6529e0ce50973" + +[[package]] +name = "unicode-xid" +version = "0.2.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8ccb82d61f80a663efe1f787a51b16b5a51e3314d6ac365b08639f52387b33f3" + +[[package]] +name = "untrusted" +version = "0.7.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a156c684c91ea7d62626509bce3cb4e1d9ed5c4d978f7b4352658f96a4c26b4a" + +[[package]] +name = "url" +version = "2.2.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a507c383b2d33b5fc35d1861e77e6b383d158b2da5e14fe51b83dfedf6fd578c" +dependencies = [ + "form_urlencoded", + "idna", + "matches", + "percent-encoding", +] + +[[package]] +name = "valuable" +version = "0.1.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "830b7e5d4d90034032940e4ace0d9a9a057e7a45cd94e6c007832e39edb82f6d" + +[[package]] +name = "vcpkg" +version = "0.2.15" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "accd4ea62f7bb7a82fe23066fb0957d48ef677f6eeb8215f372f52e48bb32426" + +[[package]] +name = "vec_map" +version = "0.8.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f1bddf1187be692e79c5ffeab891132dfb0f236ed36a43c7ed39f1165ee20191" + +[[package]] +name = "version_check" +version = "0.9.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "49874b5167b65d7193b8aba1567f5c7d93d001cafc34600cee003eda787e483f" + +[[package]] +name = "walkdir" +version = "2.3.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "808cf2735cd4b6866113f648b791c6adc5714537bc222d9347bb203386ffda56" +dependencies = [ + "same-file", + "winapi", + "winapi-util", +] + +[[package]] +name = "walkeeper" +version = "0.1.0" +dependencies = [ + "anyhow", + "byteorder", + "bytes", + "clap 3.0.14", + "const_format", + "crc32c", + "daemonize", + "etcd-client", + "fs2", + "hex", + "humantime", + "hyper", + "lazy_static", + "postgres 0.19.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", + "postgres-protocol 0.6.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", + "postgres_ffi", + "regex", + "rusoto_core", + "rusoto_s3", + "serde", + "serde_json", + "serde_with", + "signal-hook", + "tempfile", + "tokio", + "tokio-postgres 0.7.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", + "tokio-util 0.7.0", + "tracing", + "url", + "walkdir", + "workspace_hack", + "zenith_metrics", + "zenith_utils", +] + +[[package]] +name = "want" +version = "0.3.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1ce8a968cb1cd110d136ff8b819a556d6fb6d919363c61534f6860c7eb172ba0" +dependencies = [ + "log", + "try-lock", +] + +[[package]] +name = "wasi" +version = "0.10.0+wasi-snapshot-preview1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1a143597ca7c7793eff794def352d41792a93c481eb1042423ff7ff72ba2c31f" + +[[package]] +name = "wasi" +version = "0.11.0+wasi-snapshot-preview1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9c8d87e72b64a3b4db28d11ce29237c246188f4f51057d65a7eab63b7987e423" + +[[package]] +name = "wasm-bindgen" +version = "0.2.79" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "25f1af7423d8588a3d840681122e72e6a24ddbcb3f0ec385cac0d12d24256c06" +dependencies = [ + "cfg-if", + "wasm-bindgen-macro", +] + +[[package]] +name = "wasm-bindgen-backend" +version = "0.2.79" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8b21c0df030f5a177f3cba22e9bc4322695ec43e7257d865302900290bcdedca" +dependencies = [ + "bumpalo", + "lazy_static", + "log", + "proc-macro2", + "quote", + "syn", + "wasm-bindgen-shared", +] + +[[package]] +name = "wasm-bindgen-futures" +version = "0.4.29" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2eb6ec270a31b1d3c7e266b999739109abce8b6c87e4b31fcfcd788b65267395" +dependencies = [ + "cfg-if", + "js-sys", + "wasm-bindgen", + "web-sys", +] + +[[package]] +name = "wasm-bindgen-macro" +version = "0.2.79" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2f4203d69e40a52ee523b2529a773d5ffc1dc0071801c87b3d270b471b80ed01" +dependencies = [ + "quote", + "wasm-bindgen-macro-support", +] + +[[package]] +name = "wasm-bindgen-macro-support" +version = "0.2.79" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "bfa8a30d46208db204854cadbb5d4baf5fcf8071ba5bf48190c3e59937962ebc" +dependencies = [ + "proc-macro2", + "quote", + "syn", + "wasm-bindgen-backend", + "wasm-bindgen-shared", +] + +[[package]] +name = "wasm-bindgen-shared" +version = "0.2.79" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "3d958d035c4438e28c70e4321a2911302f10135ce78a9c7834c0cab4123d06a2" + +[[package]] +name = "web-sys" +version = "0.3.56" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "c060b319f29dd25724f09a2ba1418f142f539b2be99fbf4d2d5a8f7330afb8eb" +dependencies = [ + "js-sys", + "wasm-bindgen", +] + +[[package]] +name = "webpki" +version = "0.21.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b8e38c0608262c46d4a56202ebabdeb094cef7e560ca7a226c6bf055188aa4ea" +dependencies = [ + "ring", + "untrusted", +] + +[[package]] +name = "webpki" +version = "0.22.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f095d78192e208183081cc07bc5515ef55216397af48b873e5edcd72637fa1bd" +dependencies = [ + "ring", + "untrusted", +] + +[[package]] +name = "webpki-roots" +version = "0.22.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "552ceb903e957524388c4d3475725ff2c8b7960922063af6ce53c9a43da07449" +dependencies = [ + "webpki 0.22.0", +] + +[[package]] +name = "which" +version = "4.2.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2a5a7e487e921cf220206864a94a89b6c6905bfc19f1057fa26a4cb360e5c1d2" +dependencies = [ + "either", + "lazy_static", + "libc", +] + +[[package]] +name = "winapi" +version = "0.3.9" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5c839a674fcd7a98952e593242ea400abe93992746761e38641405d28b00f419" +dependencies = [ + "winapi-i686-pc-windows-gnu", + "winapi-x86_64-pc-windows-gnu", +] + +[[package]] +name = "winapi-i686-pc-windows-gnu" +version = "0.4.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ac3b87c63620426dd9b991e5ce0329eff545bccbbb34f3be09ff6fb6ab51b7b6" + +[[package]] +name = "winapi-util" +version = "0.1.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "70ec6ce85bb158151cae5e5c87f95a8e97d2c0c4b001223f33a334e3ce5de178" +dependencies = [ + "winapi", +] + +[[package]] +name = "winapi-x86_64-pc-windows-gnu" +version = "0.4.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "712e227841d057c1ee1cd2fb22fa7e5a5461ae8e48fa2ca79ec42cfc1931183f" + +[[package]] +name = "winreg" +version = "0.7.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0120db82e8a1e0b9fb3345a539c478767c0048d842860994d96113d5b667bd69" +dependencies = [ + "winapi", +] + +[[package]] +name = "workspace_hack" +version = "0.1.0" +dependencies = [ + "anyhow", + "bytes", + "cc", + "clap 2.34.0", + "either", + "hashbrown", + "libc", + "log", + "memchr", + "num-integer", + "num-traits", + "proc-macro2", + "quote", + "regex", + "regex-syntax", + "reqwest", + "scopeguard", + "serde", + "syn", + "tokio", + "tracing", + "tracing-core", +] + +[[package]] +name = "xattr" +version = "0.2.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "244c3741f4240ef46274860397c7c74e50eb23624996930e484c16679633a54c" +dependencies = [ + "libc", +] + +[[package]] +name = "xml-rs" +version = "0.8.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d2d7d3948613f75c98fd9328cfdcc45acc4d360655289d0a7d4ec931392200a3" + +[[package]] +name = "yasna" +version = "0.4.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e262a29d0e61ccf2b6190d7050d4b237535fc76ce4c1210d9caa316f71dffa75" +dependencies = [ + "chrono", +] + +[[package]] +name = "zenith" +version = "0.1.0" +dependencies = [ + "anyhow", + "clap 3.0.14", + "control_plane", + "pageserver", + "postgres 0.19.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", + "postgres_ffi", + "serde_json", + "walkeeper", + "workspace_hack", + "zenith_utils", +] + +[[package]] +name = "zenith_metrics" +version = "0.1.0" +dependencies = [ + "lazy_static", + "libc", + "once_cell", + "prometheus", + "workspace_hack", +] + +[[package]] +name = "zenith_utils" +version = "0.1.0" +dependencies = [ + "anyhow", + "bincode", + "byteorder", + "bytes", + "criterion", + "git-version", + "hex", + "hex-literal", + "hyper", + "jsonwebtoken", + "lazy_static", + "nix", + "pin-project-lite", + "postgres 0.19.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", + "postgres-protocol 0.6.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", + "rand", + "routerify", + "rustls 0.19.1", + "rustls-split", + "serde", + "serde_json", + "serde_with", + "signal-hook", + "tempfile", + "thiserror", + "tokio", + "tracing", + "tracing-subscriber", + "webpki 0.21.4", + "workspace_hack", + "zenith_metrics", +] + +[[package]] +name = "zeroize" +version = "1.5.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7c88870063c39ee00ec285a2f8d6a966e5b6fb2becc4e8dac77ed0d370ed6006" + +[[package]] +name = "zstd" +version = "0.10.0+zstd.1.5.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "3b1365becbe415f3f0fcd024e2f7b45bacfb5bdd055f0dc113571394114e7bdd" +dependencies = [ + "zstd-safe", +] + +[[package]] +name = "zstd-safe" +version = "4.1.4+zstd.1.5.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2f7cd17c9af1a4d6c24beb1cc54b17e2ef7b593dc92f19e9d9acad8b182bbaee" +dependencies = [ + "libc", + "zstd-sys", +] + +[[package]] +name = "zstd-sys" +version = "1.6.3+zstd.1.5.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "fc49afa5c8d634e75761feda8c592051e7eeb4683ba827211eb0d731d3402ea8" +dependencies = [ + "cc", + "libc", +] diff --git a/walkeeper/Cargo.toml b/walkeeper/Cargo.toml index ddce78e737..86aa56c9ae 100644 --- a/walkeeper/Cargo.toml +++ b/walkeeper/Cargo.toml @@ -14,8 +14,7 @@ serde_json = "1" tracing = "0.1.27" clap = "3.0" daemonize = "0.4.1" -rust-s3 = { version = "0.28", default-features = false, features = ["no-verify-ssl", "tokio-rustls-tls"] } -tokio = { version = "1.17", features = ["macros"] } +tokio = { version = "1.17", features = ["macros", "fs"] } postgres-protocol = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } anyhow = "1.0" @@ -30,6 +29,9 @@ hex = "0.4.3" const_format = "0.2.21" tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } etcd-client = "0.8.3" +tokio-util = { version = "0.7", features = ["io"] } +rusoto_core = "0.47" +rusoto_s3 = "0.47" postgres_ffi = { path = "../postgres_ffi" } zenith_metrics = { path = "../zenith_metrics" } diff --git a/walkeeper/src/s3_offload.rs b/walkeeper/src/s3_offload.rs index 2b3285e6c6..c796f53615 100644 --- a/walkeeper/src/s3_offload.rs +++ b/walkeeper/src/s3_offload.rs @@ -2,19 +2,19 @@ // Offload old WAL segments to S3 and remove them locally // -use anyhow::Result; +use anyhow::Context; use postgres_ffi::xlog_utils::*; -use s3::bucket::Bucket; -use s3::creds::Credentials; -use s3::region::Region; +use rusoto_core::credential::StaticProvider; +use rusoto_core::{HttpClient, Region}; +use rusoto_s3::{ListObjectsV2Request, PutObjectRequest, S3Client, StreamingBody, S3}; use std::collections::HashSet; use std::env; -use std::fs::{self, File}; -use std::io::prelude::*; use std::path::Path; use std::time::SystemTime; +use tokio::fs::{self, File}; use tokio::runtime; use tokio::time::sleep; +use tokio_util::io::ReaderStream; use tracing::*; use walkdir::WalkDir; @@ -39,11 +39,12 @@ pub fn thread_main(conf: SafeKeeperConf) { } async fn offload_files( - bucket: &Bucket, + client: &S3Client, + bucket_name: &str, listing: &HashSet, dir_path: &Path, conf: &SafeKeeperConf, -) -> Result { +) -> anyhow::Result { let horizon = SystemTime::now() - conf.ttl.unwrap(); let mut n: u64 = 0; for entry in WalkDir::new(dir_path) { @@ -57,12 +58,17 @@ async fn offload_files( let relpath = path.strip_prefix(&conf.workdir).unwrap(); let s3path = String::from("walarchive/") + relpath.to_str().unwrap(); if !listing.contains(&s3path) { - let mut file = File::open(&path)?; - let mut content = Vec::new(); - file.read_to_end(&mut content)?; - bucket.put_object(s3path, &content).await?; + let file = File::open(&path).await?; + client + .put_object(PutObjectRequest { + body: Some(StreamingBody::new(ReaderStream::new(file))), + bucket: bucket_name.to_string(), + key: s3path, + ..PutObjectRequest::default() + }) + .await?; - fs::remove_file(&path)?; + fs::remove_file(&path).await?; n += 1; } } @@ -70,35 +76,59 @@ async fn offload_files( Ok(n) } -async fn main_loop(conf: &SafeKeeperConf) -> Result<()> { +async fn main_loop(conf: &SafeKeeperConf) -> anyhow::Result<()> { let region = Region::Custom { - region: env::var("S3_REGION").unwrap(), - endpoint: env::var("S3_ENDPOINT").unwrap(), + name: env::var("S3_REGION").context("S3_REGION env var is not set")?, + endpoint: env::var("S3_ENDPOINT").context("S3_ENDPOINT env var is not set")?, }; - let credentials = Credentials::new( - Some(&env::var("S3_ACCESSKEY").unwrap()), - Some(&env::var("S3_SECRET").unwrap()), - None, - None, - None, - ) - .unwrap(); - // Create Bucket in REGION for BUCKET - let bucket = Bucket::new_with_path_style("zenith-testbucket", region, credentials)?; + let client = S3Client::new_with( + HttpClient::new().context("Failed to create S3 http client")?, + StaticProvider::new_minimal( + env::var("S3_ACCESSKEY").context("S3_ACCESSKEY env var is not set")?, + env::var("S3_SECRET").context("S3_SECRET env var is not set")?, + ), + region, + ); + + let bucket_name = "zenith-testbucket"; loop { - // List out contents of directory - let results = bucket - .list("walarchive/".to_string(), Some("".to_string())) - .await?; - let listing = results - .iter() - .flat_map(|b| b.contents.iter().map(|o| o.key.clone())) - .collect(); - - let n = offload_files(&bucket, &listing, &conf.workdir, conf).await?; + let listing = gather_wal_entries(&client, bucket_name).await?; + let n = offload_files(&client, bucket_name, &listing, &conf.workdir, conf).await?; info!("Offload {} files to S3", n); sleep(conf.ttl.unwrap()).await; } } + +async fn gather_wal_entries( + client: &S3Client, + bucket_name: &str, +) -> anyhow::Result> { + let mut document_keys = HashSet::new(); + + let mut continuation_token = None::; + loop { + let response = client + .list_objects_v2(ListObjectsV2Request { + bucket: bucket_name.to_string(), + prefix: Some("walarchive/".to_string()), + continuation_token, + ..ListObjectsV2Request::default() + }) + .await?; + document_keys.extend( + response + .contents + .unwrap_or_default() + .into_iter() + .filter_map(|o| o.key), + ); + + continuation_token = response.continuation_token; + if continuation_token.is_none() { + break; + } + } + Ok(document_keys) +} From 4f172e7612870909613eb7c8f9c3d3a41a426618 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Sat, 9 Apr 2022 01:15:20 +0300 Subject: [PATCH 074/296] Replicate S3 blob metadata in the remote storage --- pageserver/src/remote_storage.rs | 12 +- pageserver/src/remote_storage/local_fs.rs | 188 +++++++++++++++--- pageserver/src/remote_storage/s3_bucket.rs | 12 +- .../src/remote_storage/storage_sync/upload.rs | 1 + 4 files changed, 179 insertions(+), 34 deletions(-) diff --git a/pageserver/src/remote_storage.rs b/pageserver/src/remote_storage.rs index 02d37af5de..aebd74af5a 100644 --- a/pageserver/src/remote_storage.rs +++ b/pageserver/src/remote_storage.rs @@ -325,27 +325,35 @@ trait RemoteStorage: Send + Sync { &self, from: impl io::AsyncRead + Unpin + Send + Sync + 'static, to: &Self::StoragePath, + metadata: Option, ) -> anyhow::Result<()>; /// Streams the remote storage entry contents into the buffered writer given, returns the filled writer. + /// Returns the metadata, if any was stored with the file previously. async fn download( &self, from: &Self::StoragePath, to: &mut (impl io::AsyncWrite + Unpin + Send + Sync), - ) -> anyhow::Result<()>; + ) -> anyhow::Result>; /// Streams a given byte range of the remote storage entry contents into the buffered writer given, returns the filled writer. + /// Returns the metadata, if any was stored with the file previously. async fn download_range( &self, from: &Self::StoragePath, start_inclusive: u64, end_exclusive: Option, to: &mut (impl io::AsyncWrite + Unpin + Send + Sync), - ) -> anyhow::Result<()>; + ) -> anyhow::Result>; async fn delete(&self, path: &Self::StoragePath) -> anyhow::Result<()>; } +/// Extra set of key-value pairs that contain arbitrary metadata about the storage entry. +/// Immutable, cannot be changed once the file is created. +#[derive(Debug, Clone, PartialEq, Eq)] +pub struct StorageMetadata(HashMap); + fn strip_path_prefix<'a>(prefix: &'a Path, path: &'a Path) -> anyhow::Result<&'a Path> { if prefix == path { anyhow::bail!( diff --git a/pageserver/src/remote_storage/local_fs.rs b/pageserver/src/remote_storage/local_fs.rs index bac693c8d0..846adf8e9b 100644 --- a/pageserver/src/remote_storage/local_fs.rs +++ b/pageserver/src/remote_storage/local_fs.rs @@ -5,7 +5,6 @@ //! volume is mounted to the local FS. use std::{ - ffi::OsString, future::Future, path::{Path, PathBuf}, pin::Pin, @@ -18,7 +17,7 @@ use tokio::{ }; use tracing::*; -use super::{strip_path_prefix, RemoteStorage}; +use super::{strip_path_prefix, RemoteStorage, StorageMetadata}; pub struct LocalFs { pageserver_workdir: &'static Path, @@ -54,6 +53,32 @@ impl LocalFs { ) } } + + async fn read_storage_metadata( + &self, + file_path: &Path, + ) -> anyhow::Result> { + let metadata_path = storage_metadata_path(&file_path); + if metadata_path.exists() && metadata_path.is_file() { + let metadata_string = fs::read_to_string(&metadata_path).await.with_context(|| { + format!( + "Failed to read metadata from the local storage at '{}'", + metadata_path.display() + ) + })?; + + serde_json::from_str(&metadata_string) + .with_context(|| { + format!( + "Failed to deserialize metadata from the local storage at '{}'", + metadata_path.display() + ) + }) + .map(|metadata| Some(StorageMetadata(metadata))) + } else { + Ok(None) + } + } } #[async_trait::async_trait] @@ -81,19 +106,14 @@ impl RemoteStorage for LocalFs { &self, mut from: impl io::AsyncRead + Unpin + Send + Sync + 'static, to: &Self::StoragePath, + metadata: Option, ) -> anyhow::Result<()> { let target_file_path = self.resolve_in_storage(to)?; create_target_directory(&target_file_path).await?; // We need this dance with sort of durable rename (without fsyncs) // to prevent partial uploads. This was really hit when pageserver shutdown // cancelled the upload and partial file was left on the fs - let mut temp_extension = target_file_path - .extension() - .unwrap_or_default() - .to_os_string(); - - temp_extension.push(OsString::from(".temp")); - let temp_file_path = target_file_path.with_extension(temp_extension); + let temp_file_path = path_with_suffix_extension(&target_file_path, ".temp"); let mut destination = io::BufWriter::new( fs::OpenOptions::new() .write(true) @@ -132,6 +152,23 @@ impl RemoteStorage for LocalFs { target_file_path.display() ) })?; + + if let Some(storage_metadata) = metadata { + let storage_metadata_path = storage_metadata_path(&target_file_path); + fs::write( + &storage_metadata_path, + serde_json::to_string(&storage_metadata.0) + .context("Failed to serialize storage metadata as json")?, + ) + .await + .with_context(|| { + format!( + "Failed to write metadata to the local storage at '{}'", + storage_metadata_path.display() + ) + })?; + } + Ok(()) } @@ -139,7 +176,7 @@ impl RemoteStorage for LocalFs { &self, from: &Self::StoragePath, to: &mut (impl io::AsyncWrite + Unpin + Send + Sync), - ) -> anyhow::Result<()> { + ) -> anyhow::Result> { let file_path = self.resolve_in_storage(from)?; if file_path.exists() && file_path.is_file() { @@ -162,7 +199,8 @@ impl RemoteStorage for LocalFs { ) })?; source.flush().await?; - Ok(()) + + self.read_storage_metadata(&file_path).await } else { bail!( "File '{}' either does not exist or is not a file", @@ -177,7 +215,7 @@ impl RemoteStorage for LocalFs { start_inclusive: u64, end_exclusive: Option, to: &mut (impl io::AsyncWrite + Unpin + Send + Sync), - ) -> anyhow::Result<()> { + ) -> anyhow::Result> { if let Some(end_exclusive) = end_exclusive { ensure!( end_exclusive > start_inclusive, @@ -186,7 +224,7 @@ impl RemoteStorage for LocalFs { end_exclusive ); if start_inclusive == end_exclusive.saturating_sub(1) { - return Ok(()); + return Ok(None); } } let file_path = self.resolve_in_storage(from)?; @@ -220,7 +258,8 @@ impl RemoteStorage for LocalFs { file_path.display() ) })?; - Ok(()) + + self.read_storage_metadata(&file_path).await } else { bail!( "File '{}' either does not exist or is not a file", @@ -242,6 +281,17 @@ impl RemoteStorage for LocalFs { } } +fn path_with_suffix_extension(original_path: &Path, suffix: &str) -> PathBuf { + let mut extension_with_suffix = original_path.extension().unwrap_or_default().to_os_string(); + extension_with_suffix.push(suffix); + + original_path.with_extension(extension_with_suffix) +} + +fn storage_metadata_path(original_path: &Path) -> PathBuf { + path_with_suffix_extension(original_path, ".metadata") +} + fn get_all_files<'a, P>( directory_path: P, ) -> Pin>> + Send + Sync + 'a>> @@ -451,7 +501,7 @@ mod fs_tests { use super::*; use crate::repository::repo_harness::{RepoHarness, TIMELINE_ID}; - use std::io::Write; + use std::{collections::HashMap, io::Write}; use tempfile::tempdir; #[tokio::test] @@ -465,7 +515,7 @@ mod fs_tests { ) .await?; let target_path = PathBuf::from("/").join("somewhere").join("else"); - match storage.upload(source, &target_path).await { + match storage.upload(source, &target_path, None).await { Ok(()) => panic!("Should not allow storing files with wrong target path"), Err(e) => { let message = format!("{:?}", e); @@ -475,14 +525,14 @@ mod fs_tests { } assert!(storage.list().await?.is_empty()); - let target_path_1 = upload_dummy_file(&repo_harness, &storage, "upload_1").await?; + let target_path_1 = upload_dummy_file(&repo_harness, &storage, "upload_1", None).await?; assert_eq!( storage.list().await?, vec![target_path_1.clone()], "Should list a single file after first upload" ); - let target_path_2 = upload_dummy_file(&repo_harness, &storage, "upload_2").await?; + let target_path_2 = upload_dummy_file(&repo_harness, &storage, "upload_2", None).await?; assert_eq!( list_files_sorted(&storage).await?, vec![target_path_1.clone(), target_path_2.clone()], @@ -503,12 +553,16 @@ mod fs_tests { let repo_harness = RepoHarness::create("download_file")?; let storage = create_storage()?; let upload_name = "upload_1"; - let upload_target = upload_dummy_file(&repo_harness, &storage, upload_name).await?; + let upload_target = upload_dummy_file(&repo_harness, &storage, upload_name, None).await?; let mut content_bytes = io::BufWriter::new(std::io::Cursor::new(Vec::new())); - storage.download(&upload_target, &mut content_bytes).await?; - content_bytes.flush().await?; + let metadata = storage.download(&upload_target, &mut content_bytes).await?; + assert!( + metadata.is_none(), + "No metadata should be returned for no metadata upload" + ); + content_bytes.flush().await?; let contents = String::from_utf8(content_bytes.into_inner().into_inner())?; assert_eq!( dummy_contents(upload_name), @@ -533,12 +587,16 @@ mod fs_tests { let repo_harness = RepoHarness::create("download_file_range_positive")?; let storage = create_storage()?; let upload_name = "upload_1"; - let upload_target = upload_dummy_file(&repo_harness, &storage, upload_name).await?; + let upload_target = upload_dummy_file(&repo_harness, &storage, upload_name, None).await?; let mut full_range_bytes = io::BufWriter::new(std::io::Cursor::new(Vec::new())); - storage + let metadata = storage .download_range(&upload_target, 0, None, &mut full_range_bytes) .await?; + assert!( + metadata.is_none(), + "No metadata should be returned for no metadata upload" + ); full_range_bytes.flush().await?; assert_eq!( dummy_contents(upload_name), @@ -548,7 +606,7 @@ mod fs_tests { let mut zero_range_bytes = io::BufWriter::new(std::io::Cursor::new(Vec::new())); let same_byte = 1_000_000_000; - storage + let metadata = storage .download_range( &upload_target, same_byte, @@ -556,6 +614,10 @@ mod fs_tests { &mut zero_range_bytes, ) .await?; + assert!( + metadata.is_none(), + "No metadata should be returned for no metadata upload" + ); zero_range_bytes.flush().await?; assert!( zero_range_bytes.into_inner().into_inner().is_empty(), @@ -566,7 +628,7 @@ mod fs_tests { let (first_part_local, second_part_local) = uploaded_bytes.split_at(3); let mut first_part_remote = io::BufWriter::new(std::io::Cursor::new(Vec::new())); - storage + let metadata = storage .download_range( &upload_target, 0, @@ -574,6 +636,11 @@ mod fs_tests { &mut first_part_remote, ) .await?; + assert!( + metadata.is_none(), + "No metadata should be returned for no metadata upload" + ); + first_part_remote.flush().await?; let first_part_remote = first_part_remote.into_inner().into_inner(); assert_eq!( @@ -583,7 +650,7 @@ mod fs_tests { ); let mut second_part_remote = io::BufWriter::new(std::io::Cursor::new(Vec::new())); - storage + let metadata = storage .download_range( &upload_target, first_part_local.len() as u64, @@ -591,6 +658,11 @@ mod fs_tests { &mut second_part_remote, ) .await?; + assert!( + metadata.is_none(), + "No metadata should be returned for no metadata upload" + ); + second_part_remote.flush().await?; let second_part_remote = second_part_remote.into_inner().into_inner(); assert_eq!( @@ -607,7 +679,7 @@ mod fs_tests { let repo_harness = RepoHarness::create("download_file_range_negative")?; let storage = create_storage()?; let upload_name = "upload_1"; - let upload_target = upload_dummy_file(&repo_harness, &storage, upload_name).await?; + let upload_target = upload_dummy_file(&repo_harness, &storage, upload_name, None).await?; let start = 10000; let end = 234; @@ -645,7 +717,7 @@ mod fs_tests { let repo_harness = RepoHarness::create("delete_file")?; let storage = create_storage()?; let upload_name = "upload_1"; - let upload_target = upload_dummy_file(&repo_harness, &storage, upload_name).await?; + let upload_target = upload_dummy_file(&repo_harness, &storage, upload_name, None).await?; storage.delete(&upload_target).await?; assert!(storage.list().await?.is_empty()); @@ -661,10 +733,69 @@ mod fs_tests { Ok(()) } + #[tokio::test] + async fn file_with_metadata() -> anyhow::Result<()> { + let repo_harness = RepoHarness::create("download_file")?; + let storage = create_storage()?; + let upload_name = "upload_1"; + let metadata = StorageMetadata(HashMap::from([ + ("one".to_string(), "1".to_string()), + ("two".to_string(), "2".to_string()), + ])); + let upload_target = + upload_dummy_file(&repo_harness, &storage, upload_name, Some(metadata.clone())).await?; + + let mut content_bytes = io::BufWriter::new(std::io::Cursor::new(Vec::new())); + let full_download_metadata = storage.download(&upload_target, &mut content_bytes).await?; + + content_bytes.flush().await?; + let contents = String::from_utf8(content_bytes.into_inner().into_inner())?; + assert_eq!( + dummy_contents(upload_name), + contents, + "We should upload and download the same contents" + ); + + assert_eq!( + full_download_metadata.as_ref(), + Some(&metadata), + "We should get the same metadata back for full download" + ); + + let uploaded_bytes = dummy_contents(upload_name).into_bytes(); + let (first_part_local, _) = uploaded_bytes.split_at(3); + + let mut first_part_remote = io::BufWriter::new(std::io::Cursor::new(Vec::new())); + let partial_download_metadata = storage + .download_range( + &upload_target, + 0, + Some(first_part_local.len() as u64), + &mut first_part_remote, + ) + .await?; + first_part_remote.flush().await?; + let first_part_remote = first_part_remote.into_inner().into_inner(); + assert_eq!( + first_part_local, + first_part_remote.as_slice(), + "First part bytes should be returned when requested" + ); + + assert_eq!( + partial_download_metadata.as_ref(), + Some(&metadata), + "We should get the same metadata back for partial download" + ); + + Ok(()) + } + async fn upload_dummy_file( harness: &RepoHarness<'_>, storage: &LocalFs, name: &str, + metadata: Option, ) -> anyhow::Result { let timeline_path = harness.timeline_path(&TIMELINE_ID); let relative_timeline_path = timeline_path.strip_prefix(&harness.conf.workdir)?; @@ -677,6 +808,7 @@ mod fs_tests { ) .await?, &storage_path, + metadata, ) .await?; Ok(storage_path) diff --git a/pageserver/src/remote_storage/s3_bucket.rs b/pageserver/src/remote_storage/s3_bucket.rs index 92b3b0cce8..bfd28168f4 100644 --- a/pageserver/src/remote_storage/s3_bucket.rs +++ b/pageserver/src/remote_storage/s3_bucket.rs @@ -24,6 +24,8 @@ use crate::{ remote_storage::{strip_path_prefix, RemoteStorage}, }; +use super::StorageMetadata; + const S3_FILE_SEPARATOR: char = '/'; #[derive(Debug, Eq, PartialEq)] @@ -179,12 +181,14 @@ impl RemoteStorage for S3Bucket { &self, from: impl io::AsyncRead + Unpin + Send + Sync + 'static, to: &Self::StoragePath, + metadata: Option, ) -> anyhow::Result<()> { self.client .put_object(PutObjectRequest { body: Some(StreamingBody::new(ReaderStream::new(from))), bucket: self.bucket_name.clone(), key: to.key().to_owned(), + metadata: metadata.map(|m| m.0), ..PutObjectRequest::default() }) .await?; @@ -195,7 +199,7 @@ impl RemoteStorage for S3Bucket { &self, from: &Self::StoragePath, to: &mut (impl io::AsyncWrite + Unpin + Send + Sync), - ) -> anyhow::Result<()> { + ) -> anyhow::Result> { let object_output = self .client .get_object(GetObjectRequest { @@ -210,7 +214,7 @@ impl RemoteStorage for S3Bucket { io::copy(&mut from, to).await?; } - Ok(()) + Ok(object_output.metadata.map(StorageMetadata)) } async fn download_range( @@ -219,7 +223,7 @@ impl RemoteStorage for S3Bucket { start_inclusive: u64, end_exclusive: Option, to: &mut (impl io::AsyncWrite + Unpin + Send + Sync), - ) -> anyhow::Result<()> { + ) -> anyhow::Result> { // S3 accepts ranges as https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35 // and needs both ends to be exclusive let end_inclusive = end_exclusive.map(|end| end.saturating_sub(1)); @@ -242,7 +246,7 @@ impl RemoteStorage for S3Bucket { io::copy(&mut from, to).await?; } - Ok(()) + Ok(object_output.metadata.map(StorageMetadata)) } async fn delete(&self, path: &Self::StoragePath) -> anyhow::Result<()> { diff --git a/pageserver/src/remote_storage/storage_sync/upload.rs b/pageserver/src/remote_storage/storage_sync/upload.rs index 76e92c2781..f955e04474 100644 --- a/pageserver/src/remote_storage/storage_sync/upload.rs +++ b/pageserver/src/remote_storage/storage_sync/upload.rs @@ -201,6 +201,7 @@ async fn try_upload_checkpoint< .upload( archive_streamer, &remote_storage.storage_path(&timeline_dir.join(&archive_name))?, + None, ) .await }, From dc7e3ff05af8f0d669ffe9878d5c98b2d7c8e12c Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Sat, 9 Apr 2022 01:19:45 +0300 Subject: [PATCH 075/296] Fix rustc 1.60 clippy warnings --- pageserver/src/http/routes.rs | 15 ++++++--------- pageserver/src/layered_repository.rs | 3 +-- pageserver/src/layered_repository/filename.rs | 8 ++------ pageserver/src/layered_repository/layer_map.rs | 4 +--- pageserver/src/reltag.rs | 4 +--- pageserver/src/remote_storage/local_fs.rs | 2 +- .../remote_storage/storage_sync/compression.rs | 3 +-- .../src/remote_storage/storage_sync/download.rs | 4 ++-- walkeeper/src/http/routes.rs | 6 +++--- zenith_utils/src/http/json.rs | 4 ++-- 10 files changed, 20 insertions(+), 33 deletions(-) diff --git a/pageserver/src/http/routes.rs b/pageserver/src/http/routes.rs index 207d2420bd..a0d6e922a1 100644 --- a/pageserver/src/http/routes.rs +++ b/pageserver/src/http/routes.rs @@ -68,10 +68,7 @@ fn get_config(request: &Request) -> &'static PageServerConf { // healthcheck handler async fn status_handler(request: Request) -> Result, ApiError> { let config = get_config(&request); - Ok(json_response( - StatusCode::OK, - StatusResponse { id: config.id }, - )?) + json_response(StatusCode::OK, StatusResponse { id: config.id }) } async fn timeline_create_handler(mut request: Request) -> Result, ApiError> { @@ -131,7 +128,7 @@ async fn timeline_list_handler(request: Request) -> Result, }) } - Ok(json_response(StatusCode::OK, response_data)?) + json_response(StatusCode::OK, response_data) } // Gate non incremental logical size calculation behind a flag @@ -207,7 +204,7 @@ async fn timeline_detail_handler(request: Request) -> Result) -> Result, ApiError> { @@ -247,7 +244,7 @@ async fn timeline_attach_handler(request: Request) -> Result) -> Result, ApiError> { @@ -266,7 +263,7 @@ async fn timeline_detach_handler(request: Request) -> Result) -> Result, ApiError> { @@ -280,7 +277,7 @@ async fn tenant_list_handler(request: Request) -> Result, A .await .map_err(ApiError::from_err)??; - Ok(json_response(StatusCode::OK, response_data)?) + json_response(StatusCode::OK, response_data) } async fn tenant_create_handler(mut request: Request) -> Result, ApiError> { diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index d7a250f31e..5e93e3389b 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -1474,8 +1474,7 @@ impl LayeredTimeline { // // TODO: This perhaps should be done in 'flush_frozen_layers', after flushing // *all* the layers, to avoid fsyncing the file multiple times. - let disk_consistent_lsn; - disk_consistent_lsn = Lsn(frozen_layer.get_lsn_range().end.0 - 1); + let disk_consistent_lsn = Lsn(frozen_layer.get_lsn_range().end.0 - 1); // If we were able to advance 'disk_consistent_lsn', save it the metadata file. // After crash, we will restart WAL streaming and processing from that point. diff --git a/pageserver/src/layered_repository/filename.rs b/pageserver/src/layered_repository/filename.rs index cd63f014c4..497912b408 100644 --- a/pageserver/src/layered_repository/filename.rs +++ b/pageserver/src/layered_repository/filename.rs @@ -25,9 +25,7 @@ impl PartialOrd for DeltaFileName { impl Ord for DeltaFileName { fn cmp(&self, other: &Self) -> Ordering { - let mut cmp; - - cmp = self.key_range.start.cmp(&other.key_range.start); + let mut cmp = self.key_range.start.cmp(&other.key_range.start); if cmp != Ordering::Equal { return cmp; } @@ -117,9 +115,7 @@ impl PartialOrd for ImageFileName { impl Ord for ImageFileName { fn cmp(&self, other: &Self) -> Ordering { - let mut cmp; - - cmp = self.key_range.start.cmp(&other.key_range.start); + let mut cmp = self.key_range.start.cmp(&other.key_range.start); if cmp != Ordering::Equal { return cmp; } diff --git a/pageserver/src/layered_repository/layer_map.rs b/pageserver/src/layered_repository/layer_map.rs index 8132ec9cc4..3984ee550f 100644 --- a/pageserver/src/layered_repository/layer_map.rs +++ b/pageserver/src/layered_repository/layer_map.rs @@ -296,9 +296,7 @@ impl LayerMap { key_range: &Range, lsn: Lsn, ) -> Result, Option>)>> { - let mut points: Vec; - - points = vec![key_range.start]; + let mut points = vec![key_range.start]; for l in self.historic_layers.iter() { if l.get_lsn_range().start > lsn { continue; diff --git a/pageserver/src/reltag.rs b/pageserver/src/reltag.rs index 46ff468f2f..18e26cc37a 100644 --- a/pageserver/src/reltag.rs +++ b/pageserver/src/reltag.rs @@ -39,9 +39,7 @@ impl PartialOrd for RelTag { impl Ord for RelTag { fn cmp(&self, other: &Self) -> Ordering { - let mut cmp; - - cmp = self.spcnode.cmp(&other.spcnode); + let mut cmp = self.spcnode.cmp(&other.spcnode); if cmp != Ordering::Equal { return cmp; } diff --git a/pageserver/src/remote_storage/local_fs.rs b/pageserver/src/remote_storage/local_fs.rs index 846adf8e9b..b40089d53c 100644 --- a/pageserver/src/remote_storage/local_fs.rs +++ b/pageserver/src/remote_storage/local_fs.rs @@ -58,7 +58,7 @@ impl LocalFs { &self, file_path: &Path, ) -> anyhow::Result> { - let metadata_path = storage_metadata_path(&file_path); + let metadata_path = storage_metadata_path(file_path); if metadata_path.exists() && metadata_path.is_file() { let metadata_string = fs::read_to_string(&metadata_path).await.with_context(|| { format!( diff --git a/pageserver/src/remote_storage/storage_sync/compression.rs b/pageserver/src/remote_storage/storage_sync/compression.rs index c5b041349a..511f79e0cf 100644 --- a/pageserver/src/remote_storage/storage_sync/compression.rs +++ b/pageserver/src/remote_storage/storage_sync/compression.rs @@ -201,8 +201,7 @@ pub async fn read_archive_header( .await .context("Failed to decompress a header from the archive")?; - Ok(ArchiveHeader::des(&header_bytes) - .context("Failed to deserialize a header from the archive")?) + ArchiveHeader::des(&header_bytes).context("Failed to deserialize a header from the archive") } /// Reads the archive metadata out of the archive name: diff --git a/pageserver/src/remote_storage/storage_sync/download.rs b/pageserver/src/remote_storage/storage_sync/download.rs index 32549c8650..773b4a12e5 100644 --- a/pageserver/src/remote_storage/storage_sync/download.rs +++ b/pageserver/src/remote_storage/storage_sync/download.rs @@ -225,8 +225,8 @@ async fn read_local_metadata( let local_metadata_bytes = fs::read(&local_metadata_path) .await .context("Failed to read local metadata file bytes")?; - Ok(TimelineMetadata::from_bytes(&local_metadata_bytes) - .context("Failed to read local metadata files bytes")?) + TimelineMetadata::from_bytes(&local_metadata_bytes) + .context("Failed to read local metadata files bytes") } #[cfg(test)] diff --git a/walkeeper/src/http/routes.rs b/walkeeper/src/http/routes.rs index 06a0682c37..26b23cddcc 100644 --- a/walkeeper/src/http/routes.rs +++ b/walkeeper/src/http/routes.rs @@ -31,7 +31,7 @@ struct SafekeeperStatus { async fn status_handler(request: Request) -> Result, ApiError> { let conf = get_conf(&request); let status = SafekeeperStatus { id: conf.my_id }; - Ok(json_response(StatusCode::OK, status)?) + json_response(StatusCode::OK, status) } fn get_conf(request: &Request) -> &SafeKeeperConf { @@ -106,7 +106,7 @@ async fn timeline_status_handler(request: Request) -> Result) -> Result, ApiError> { @@ -119,7 +119,7 @@ async fn timeline_create_handler(mut request: Request) -> Result Deserialize<'de>>( let whole_body = hyper::body::aggregate(request.body_mut()) .await .map_err(ApiError::from_err)?; - Ok(serde_json::from_reader(whole_body.reader()) - .map_err(|err| ApiError::BadRequest(format!("Failed to parse json request {}", err)))?) + serde_json::from_reader(whole_body.reader()) + .map_err(|err| ApiError::BadRequest(format!("Failed to parse json request {}", err))) } pub fn json_response( From 07a9553700310d6d6c2ba5c7e2e4484aeb98d899 Mon Sep 17 00:00:00 2001 From: Konstantin Knizhnik Date: Mon, 11 Apr 2022 22:30:08 +0300 Subject: [PATCH 076/296] Add test for restore from WAL (#1366) * Add test for restore from WAL * Fix python formatting * Choose unused port in wal restore test * Move recovery tests to zenith_utils/scripts * Set LD_LIBRARY_PATH in wal recovery scripts * Fix python test formatting * Fix mypy warning * Bump postgres version * Bump postgres version --- test_runner/batch_others/test_wal_restore.py | 38 +++++++++++++++++++ vendor/postgres | 2 +- zenith_utils/scripts/restore_from_wal.sh | 20 ++++++++++ .../scripts/restore_from_wal_archive.sh | 20 ++++++++++ 4 files changed, 79 insertions(+), 1 deletion(-) create mode 100644 test_runner/batch_others/test_wal_restore.py create mode 100755 zenith_utils/scripts/restore_from_wal.sh create mode 100755 zenith_utils/scripts/restore_from_wal_archive.sh diff --git a/test_runner/batch_others/test_wal_restore.py b/test_runner/batch_others/test_wal_restore.py new file mode 100644 index 0000000000..a5855f2258 --- /dev/null +++ b/test_runner/batch_others/test_wal_restore.py @@ -0,0 +1,38 @@ +import os +import subprocess + +from fixtures.utils import mkdir_if_needed +from fixtures.zenith_fixtures import (ZenithEnvBuilder, + VanillaPostgres, + PortDistributor, + PgBin, + base_dir, + vanilla_pg, + pg_distrib_dir) +from fixtures.log_helper import log + + +def test_wal_restore(zenith_env_builder: ZenithEnvBuilder, + test_output_dir, + port_distributor: PortDistributor): + zenith_env_builder.num_safekeepers = 1 + env = zenith_env_builder.init_start() + env.zenith_cli.create_branch("test_wal_restore") + pg = env.postgres.create_start('test_wal_restore') + pg.safe_psql("create table t as select generate_series(1,1000000)") + tenant_id = pg.safe_psql("show zenith.zenith_tenant")[0][0] + env.zenith_cli.pageserver_stop() + port = port_distributor.get_port() + data_dir = os.path.join(test_output_dir, 'pgsql.restored') + restored = VanillaPostgres(data_dir, PgBin(test_output_dir), port) + subprocess.call([ + 'bash', + os.path.join(base_dir, 'zenith_utils/scripts/restore_from_wal.sh'), + os.path.join(pg_distrib_dir, 'bin'), + os.path.join(test_output_dir, 'repo/safekeepers/sk1/{}/*'.format(tenant_id)), + data_dir, + str(port) + ]) + restored.start() + assert restored.safe_psql('select count(*) from t') == [(1000000, )] + restored.stop() diff --git a/vendor/postgres b/vendor/postgres index 8481459996..61afbf978b 160000 --- a/vendor/postgres +++ b/vendor/postgres @@ -1 +1 @@ -Subproject commit 848145999653be213141a330569b6f2d9f53dbf2 +Subproject commit 61afbf978b17764134ab6f1650bbdcadac147e71 diff --git a/zenith_utils/scripts/restore_from_wal.sh b/zenith_utils/scripts/restore_from_wal.sh new file mode 100755 index 0000000000..ef2171312b --- /dev/null +++ b/zenith_utils/scripts/restore_from_wal.sh @@ -0,0 +1,20 @@ +PG_BIN=$1 +WAL_PATH=$2 +DATA_DIR=$3 +PORT=$4 +SYSID=`od -A n -j 24 -N 8 -t d8 $WAL_PATH/000000010000000000000002* | cut -c 3-` +rm -fr $DATA_DIR +env -i LD_LIBRARY_PATH=$PG_BIN/../lib $PG_BIN/initdb -E utf8 -D $DATA_DIR --sysid=$SYSID +echo port=$PORT >> $DATA_DIR/postgresql.conf +REDO_POS=0x`$PG_BIN/pg_controldata -D $DATA_DIR | fgrep "REDO location"| cut -c 42-` +declare -i WAL_SIZE=$REDO_POS+114 +$PG_BIN/pg_ctl -D $DATA_DIR -l logfile start +$PG_BIN/pg_ctl -D $DATA_DIR -l logfile stop -m immediate +cp $DATA_DIR/pg_wal/000000010000000000000001 . +cp $WAL_PATH/* $DATA_DIR/pg_wal/ +if [ -f $DATA_DIR/pg_wal/*.partial ] +then + (cd $DATA_DIR/pg_wal ; for partial in \*.partial ; do mv $partial `basename $partial .partial` ; done) +fi +dd if=000000010000000000000001 of=$DATA_DIR/pg_wal/000000010000000000000001 bs=$WAL_SIZE count=1 conv=notrunc +rm -f 000000010000000000000001 diff --git a/zenith_utils/scripts/restore_from_wal_archive.sh b/zenith_utils/scripts/restore_from_wal_archive.sh new file mode 100755 index 0000000000..07f4fe1e4f --- /dev/null +++ b/zenith_utils/scripts/restore_from_wal_archive.sh @@ -0,0 +1,20 @@ +PG_BIN=$1 +WAL_PATH=$2 +DATA_DIR=$3 +PORT=$4 +SYSID=`od -A n -j 24 -N 8 -t d8 $WAL_PATH/000000010000000000000002* | cut -c 3-` +rm -fr $DATA_DIR /tmp/pg_wals +mkdir /tmp/pg_wals +env -i LD_LIBRARY_PATH=$PG_BIN/../lib $PG_BIN/initdb -E utf8 -U zenith_admin -D $DATA_DIR --sysid=$SYSID +echo port=$PORT >> $DATA_DIR/postgresql.conf +REDO_POS=0x`$PG_BIN/pg_controldata -D $DATA_DIR | fgrep "REDO location"| cut -c 42-` +declare -i WAL_SIZE=$REDO_POS+114 +cp $WAL_PATH/* /tmp/pg_wals +if [ -f $DATA_DIR/pg_wal/*.partial ] +then + (cd /tmp/pg_wals ; for partial in \*.partial ; do mv $partial `basename $partial .partial` ; done) +fi +dd if=$DATA_DIR/pg_wal/000000010000000000000001 of=/tmp/pg_wals/000000010000000000000001 bs=$WAL_SIZE count=1 conv=notrunc +echo > $DATA_DIR/recovery.signal +rm -f $DATA_DIR/pg_wal/* +echo "restore_command = 'cp /tmp/pg_wals/%f %p'" >> $DATA_DIR/postgresql.conf From 0fbe657b2f268351dc5daabee09754a578be3948 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Wed, 13 Apr 2022 00:02:06 +0300 Subject: [PATCH 077/296] Fix remote e2e tests after repository rename (#1434) Also start them after release build instead of debug. It saves 3-5 minutes and we anyway use release mode in Docker images. --- .circleci/config.yml | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/.circleci/config.yml b/.circleci/config.yml index e96964558b..9d26d5d558 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -672,7 +672,7 @@ jobs: --data \ "{ \"state\": \"pending\", - \"context\": \"zenith-remote-ci\", + \"context\": \"neon-cloud-e2e\", \"description\": \"[$REMOTE_REPO] Remote CI job is about to start\" }" - run: @@ -688,7 +688,7 @@ jobs: "{ \"ref\": \"main\", \"inputs\": { - \"ci_job_name\": \"zenith-remote-ci\", + \"ci_job_name\": \"neon-cloud-e2e\", \"commit_hash\": \"$CIRCLE_SHA1\", \"remote_repo\": \"$LOCAL_REPO\" } @@ -828,11 +828,11 @@ workflows: - remote-ci-trigger: # Context passes credentials for gh api context: CI_ACCESS_TOKEN - remote_repo: "zenithdb/console" + remote_repo: "neondatabase/cloud" requires: # XXX: Successful build doesn't mean everything is OK, but # the job to be triggered takes so much time to complete (~22 min) # that it's better not to wait for the commented-out steps - - build-zenith-debug + - build-zenith-release # - pg_regress-tests-release # - other-tests-release From 4af87f3d6097661c99cbf5b400c1af6c44819e43 Mon Sep 17 00:00:00 2001 From: Dmitry Ivanov Date: Wed, 13 Apr 2022 03:00:32 +0300 Subject: [PATCH 078/296] [proxy] Add SCRAM auth mechanism implementation (#1050) * [proxy] Add SCRAM auth * [proxy] Implement some tests for SCRAM * Refactoring + test fixes * Hide SCRAM mechanism behind `#[cfg(test)]` Currently we only use it in tests, so we hide all relevant module behind `#[cfg(test)]` to prevent "unused item" warnings. --- Cargo.lock | 35 +++- proxy/Cargo.toml | 11 +- proxy/src/auth.rs | 88 +++------- proxy/src/auth/credentials.rs | 70 ++++++++ proxy/src/auth/flow.rs | 102 ++++++++++++ proxy/src/main.rs | 39 +++-- proxy/src/parse.rs | 18 +++ proxy/src/proxy.rs | 229 ++++++++++++++++++++------ proxy/src/sasl.rs | 47 ++++++ proxy/src/sasl/channel_binding.rs | 85 ++++++++++ proxy/src/sasl/messages.rs | 67 ++++++++ proxy/src/sasl/stream.rs | 70 ++++++++ proxy/src/scram.rs | 59 +++++++ proxy/src/scram/exchange.rs | 134 ++++++++++++++++ proxy/src/scram/key.rs | 33 ++++ proxy/src/scram/messages.rs | 232 +++++++++++++++++++++++++++ proxy/src/scram/password.rs | 48 ++++++ proxy/src/scram/secret.rs | 116 ++++++++++++++ proxy/src/scram/signature.rs | 66 ++++++++ zenith_utils/src/postgres_backend.rs | 3 +- zenith_utils/src/pq_proto.rs | 36 ++++- 21 files changed, 1446 insertions(+), 142 deletions(-) create mode 100644 proxy/src/auth/credentials.rs create mode 100644 proxy/src/auth/flow.rs create mode 100644 proxy/src/parse.rs create mode 100644 proxy/src/sasl.rs create mode 100644 proxy/src/sasl/channel_binding.rs create mode 100644 proxy/src/sasl/messages.rs create mode 100644 proxy/src/sasl/stream.rs create mode 100644 proxy/src/scram.rs create mode 100644 proxy/src/scram/exchange.rs create mode 100644 proxy/src/scram/key.rs create mode 100644 proxy/src/scram/messages.rs create mode 100644 proxy/src/scram/password.rs create mode 100644 proxy/src/scram/secret.rs create mode 100644 proxy/src/scram/signature.rs diff --git a/Cargo.lock b/Cargo.lock index 1a9e261281..7df1c4ab7a 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -1907,12 +1907,15 @@ name = "proxy" version = "0.1.0" dependencies = [ "anyhow", + "async-trait", + "base64 0.13.0", "bytes", "clap 3.0.14", "fail", "futures", "hashbrown", "hex", + "hmac 0.10.1", "hyper", "lazy_static", "md5", @@ -1921,16 +1924,20 @@ dependencies = [ "rand", "rcgen", "reqwest", + "routerify 2.2.0", + "rstest", "rustls 0.19.1", "scopeguard", "serde", "serde_json", + "sha2", "socket2", "thiserror", "tokio", "tokio-postgres 0.7.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", "tokio-postgres-rustls", "tokio-rustls 0.22.0", + "tokio-stream", "workspace_hack", "zenith_metrics", "zenith_utils", @@ -2130,6 +2137,19 @@ dependencies = [ "winapi", ] +[[package]] +name = "routerify" +version = "2.2.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0c6bb49594c791cadb5ccfa5f36d41b498d40482595c199d10cd318800280bd9" +dependencies = [ + "http", + "hyper", + "lazy_static", + "percent-encoding", + "regex", +] + [[package]] name = "routerify" version = "3.0.0" @@ -2143,6 +2163,19 @@ dependencies = [ "regex", ] +[[package]] +name = "rstest" +version = "0.12.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d912f35156a3f99a66ee3e11ac2e0b3f34ac85a07e05263d05a7e2c8810d616f" +dependencies = [ + "cfg-if", + "proc-macro2", + "quote", + "rustc_version", + "syn", +] + [[package]] name = "rusoto_core" version = "0.47.0" @@ -3450,7 +3483,7 @@ dependencies = [ "postgres 0.19.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", "postgres-protocol 0.6.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", "rand", - "routerify", + "routerify 3.0.0", "rustls 0.19.1", "rustls-split", "serde", diff --git a/proxy/Cargo.toml b/proxy/Cargo.toml index dc20695884..56b6dd7e20 100644 --- a/proxy/Cargo.toml +++ b/proxy/Cargo.toml @@ -5,12 +5,14 @@ edition = "2021" [dependencies] anyhow = "1.0" +base64 = "0.13.0" bytes = { version = "1.0.1", features = ['serde'] } clap = "3.0" fail = "0.5.0" futures = "0.3.13" hashbrown = "0.11.2" hex = "0.4.3" +hmac = "0.10.1" hyper = "0.14" lazy_static = "1.4.0" md5 = "0.7.0" @@ -18,20 +20,25 @@ parking_lot = "0.11.2" pin-project-lite = "0.2.7" rand = "0.8.3" reqwest = { version = "0.11", default-features = false, features = ["blocking", "json", "rustls-tls"] } +routerify = "2" rustls = "0.19.1" scopeguard = "1.1.0" serde = "1" serde_json = "1" +sha2 = "0.9.8" socket2 = "0.4.4" -thiserror = "1.0" +thiserror = "1.0.30" tokio = { version = "1.17", features = ["macros"] } tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } tokio-rustls = "0.22.0" +tokio-stream = "0.1.8" zenith_utils = { path = "../zenith_utils" } zenith_metrics = { path = "../zenith_metrics" } workspace_hack = { version = "0.1", path = "../workspace_hack" } [dev-dependencies] -tokio-postgres-rustls = "0.8.0" +async-trait = "0.1" rcgen = "0.8.14" +rstest = "0.12" +tokio-postgres-rustls = "0.8.0" diff --git a/proxy/src/auth.rs b/proxy/src/auth.rs index e8fe65c081..bda14d67a1 100644 --- a/proxy/src/auth.rs +++ b/proxy/src/auth.rs @@ -1,14 +1,24 @@ +mod credentials; + +#[cfg(test)] +mod flow; + use crate::compute::DatabaseInfo; use crate::config::ProxyConfig; use crate::cplane_api::{self, CPlaneApi}; use crate::error::UserFacingError; use crate::stream::PqStream; use crate::waiters; -use std::collections::HashMap; +use std::io; use thiserror::Error; use tokio::io::{AsyncRead, AsyncWrite}; use zenith_utils::pq_proto::{BeMessage as Be, BeParameterStatusMessage}; +pub use credentials::ClientCredentials; + +#[cfg(test)] +pub use flow::*; + /// Common authentication error. #[derive(Debug, Error)] pub enum AuthErrorImpl { @@ -16,13 +26,17 @@ pub enum AuthErrorImpl { #[error(transparent)] Console(#[from] cplane_api::AuthError), + #[cfg(test)] + #[error(transparent)] + Sasl(#[from] crate::sasl::Error), + /// For passwords that couldn't be processed by [`parse_password`]. #[error("Malformed password message")] MalformedPassword, /// Errors produced by [`PqStream`]. #[error(transparent)] - Io(#[from] std::io::Error), + Io(#[from] io::Error), } impl AuthErrorImpl { @@ -67,70 +81,6 @@ impl UserFacingError for AuthError { } } -#[derive(Debug, Error)] -pub enum ClientCredsParseError { - #[error("Parameter `{0}` is missing in startup packet")] - MissingKey(&'static str), -} - -impl UserFacingError for ClientCredsParseError {} - -/// Various client credentials which we use for authentication. -#[derive(Debug, PartialEq, Eq)] -pub struct ClientCredentials { - pub user: String, - pub dbname: String, -} - -impl TryFrom> for ClientCredentials { - type Error = ClientCredsParseError; - - fn try_from(mut value: HashMap) -> Result { - let mut get_param = |key| { - value - .remove(key) - .ok_or(ClientCredsParseError::MissingKey(key)) - }; - - let user = get_param("user")?; - let db = get_param("database")?; - - Ok(Self { user, dbname: db }) - } -} - -impl ClientCredentials { - /// Use credentials to authenticate the user. - pub async fn authenticate( - self, - config: &ProxyConfig, - client: &mut PqStream, - ) -> Result { - fail::fail_point!("proxy-authenticate", |_| { - Err(AuthError::auth_failed("failpoint triggered")) - }); - - use crate::config::ClientAuthMethod::*; - use crate::config::RouterConfig::*; - match &config.router_config { - Static { host, port } => handle_static(host.clone(), *port, client, self).await, - Dynamic(Mixed) => { - if self.user.ends_with("@zenith") { - handle_existing_user(config, client, self).await - } else { - handle_new_user(config, client).await - } - } - Dynamic(Password) => handle_existing_user(config, client, self).await, - Dynamic(Link) => handle_new_user(config, client).await, - } - } -} - -fn new_psql_session_id() -> String { - hex::encode(rand::random::<[u8; 8]>()) -} - async fn handle_static( host: String, port: u16, @@ -169,7 +119,7 @@ async fn handle_existing_user( let md5_salt = rand::random(); client - .write_message(&Be::AuthenticationMD5Password(&md5_salt)) + .write_message(&Be::AuthenticationMD5Password(md5_salt)) .await?; // Read client's password hash @@ -213,6 +163,10 @@ async fn handle_new_user( Ok(db_info) } +fn new_psql_session_id() -> String { + hex::encode(rand::random::<[u8; 8]>()) +} + fn parse_password(bytes: &[u8]) -> Option<&str> { std::str::from_utf8(bytes).ok()?.strip_suffix('\0') } diff --git a/proxy/src/auth/credentials.rs b/proxy/src/auth/credentials.rs new file mode 100644 index 0000000000..7c8ba28622 --- /dev/null +++ b/proxy/src/auth/credentials.rs @@ -0,0 +1,70 @@ +//! User credentials used in authentication. + +use super::AuthError; +use crate::compute::DatabaseInfo; +use crate::config::ProxyConfig; +use crate::error::UserFacingError; +use crate::stream::PqStream; +use std::collections::HashMap; +use thiserror::Error; +use tokio::io::{AsyncRead, AsyncWrite}; + +#[derive(Debug, Error)] +pub enum ClientCredsParseError { + #[error("Parameter `{0}` is missing in startup packet")] + MissingKey(&'static str), +} + +impl UserFacingError for ClientCredsParseError {} + +/// Various client credentials which we use for authentication. +#[derive(Debug, PartialEq, Eq)] +pub struct ClientCredentials { + pub user: String, + pub dbname: String, +} + +impl TryFrom> for ClientCredentials { + type Error = ClientCredsParseError; + + fn try_from(mut value: HashMap) -> Result { + let mut get_param = |key| { + value + .remove(key) + .ok_or(ClientCredsParseError::MissingKey(key)) + }; + + let user = get_param("user")?; + let db = get_param("database")?; + + Ok(Self { user, dbname: db }) + } +} + +impl ClientCredentials { + /// Use credentials to authenticate the user. + pub async fn authenticate( + self, + config: &ProxyConfig, + client: &mut PqStream, + ) -> Result { + fail::fail_point!("proxy-authenticate", |_| { + Err(AuthError::auth_failed("failpoint triggered")) + }); + + use crate::config::ClientAuthMethod::*; + use crate::config::RouterConfig::*; + match &config.router_config { + Static { host, port } => super::handle_static(host.clone(), *port, client, self).await, + Dynamic(Mixed) => { + if self.user.ends_with("@zenith") { + super::handle_existing_user(config, client, self).await + } else { + super::handle_new_user(config, client).await + } + } + Dynamic(Password) => super::handle_existing_user(config, client, self).await, + Dynamic(Link) => super::handle_new_user(config, client).await, + } + } +} diff --git a/proxy/src/auth/flow.rs b/proxy/src/auth/flow.rs new file mode 100644 index 0000000000..0fafaa2f47 --- /dev/null +++ b/proxy/src/auth/flow.rs @@ -0,0 +1,102 @@ +//! Main authentication flow. + +use super::{AuthError, AuthErrorImpl}; +use crate::stream::PqStream; +use crate::{sasl, scram}; +use std::io; +use tokio::io::{AsyncRead, AsyncWrite}; +use zenith_utils::pq_proto::{BeAuthenticationSaslMessage, BeMessage, BeMessage as Be}; + +/// Every authentication selector is supposed to implement this trait. +pub trait AuthMethod { + /// Any authentication selector should provide initial backend message + /// containing auth method name and parameters, e.g. md5 salt. + fn first_message(&self) -> BeMessage<'_>; +} + +/// Initial state of [`AuthFlow`]. +pub struct Begin; + +/// Use [SCRAM](crate::scram)-based auth in [`AuthFlow`]. +pub struct Scram<'a>(pub &'a scram::ServerSecret); + +impl AuthMethod for Scram<'_> { + #[inline(always)] + fn first_message(&self) -> BeMessage<'_> { + Be::AuthenticationSasl(BeAuthenticationSaslMessage::Methods(scram::METHODS)) + } +} + +/// Use password-based auth in [`AuthFlow`]. +pub struct Md5( + /// Salt for client. + pub [u8; 4], +); + +impl AuthMethod for Md5 { + #[inline(always)] + fn first_message(&self) -> BeMessage<'_> { + Be::AuthenticationMD5Password(self.0) + } +} + +/// This wrapper for [`PqStream`] performs client authentication. +#[must_use] +pub struct AuthFlow<'a, Stream, State> { + /// The underlying stream which implements libpq's protocol. + stream: &'a mut PqStream, + /// State might contain ancillary data (see [`AuthFlow::begin`]). + state: State, +} + +/// Initial state of the stream wrapper. +impl<'a, S: AsyncWrite + Unpin> AuthFlow<'a, S, Begin> { + /// Create a new wrapper for client authentication. + pub fn new(stream: &'a mut PqStream) -> Self { + Self { + stream, + state: Begin, + } + } + + /// Move to the next step by sending auth method's name & params to client. + pub async fn begin(self, method: M) -> io::Result> { + self.stream.write_message(&method.first_message()).await?; + + Ok(AuthFlow { + stream: self.stream, + state: method, + }) + } +} + +/// Stream wrapper for handling simple MD5 password auth. +impl AuthFlow<'_, S, Md5> { + /// Perform user authentication. Raise an error in case authentication failed. + #[allow(unused)] + pub async fn authenticate(self) -> Result<(), AuthError> { + unimplemented!("MD5 auth flow is yet to be implemented"); + } +} + +/// Stream wrapper for handling [SCRAM](crate::scram) auth. +impl AuthFlow<'_, S, Scram<'_>> { + /// Perform user authentication. Raise an error in case authentication failed. + pub async fn authenticate(self) -> Result<(), AuthError> { + // Initial client message contains the chosen auth method's name. + let msg = self.stream.read_password_message().await?; + let sasl = sasl::FirstMessage::parse(&msg).ok_or(AuthErrorImpl::MalformedPassword)?; + + // Currently, the only supported SASL method is SCRAM. + if !scram::METHODS.contains(&sasl.method) { + return Err(AuthErrorImpl::auth_failed("method not supported").into()); + } + + let secret = self.state.0; + sasl::SaslStream::new(self.stream, sasl.message) + .authenticate(scram::Exchange::new(secret, rand::random, None)) + .await?; + + Ok(()) + } +} diff --git a/proxy/src/main.rs b/proxy/src/main.rs index bd99d0a639..862152bb7b 100644 --- a/proxy/src/main.rs +++ b/proxy/src/main.rs @@ -1,19 +1,8 @@ -/// -/// Postgres protocol proxy/router. -/// -/// This service listens psql port and can check auth via external service -/// (control plane API in our case) and can create new databases and accounts -/// in somewhat transparent manner (again via communication with control plane API). -/// -use anyhow::{bail, Context}; -use clap::{App, Arg}; -use config::ProxyConfig; -use futures::FutureExt; -use std::future::Future; -use tokio::{net::TcpListener, task::JoinError}; -use zenith_utils::GIT_VERSION; - -use crate::config::{ClientAuthMethod, RouterConfig}; +//! Postgres protocol proxy/router. +//! +//! This service listens psql port and can check auth via external service +//! (control plane API in our case) and can create new databases and accounts +//! in somewhat transparent manner (again via communication with control plane API). mod auth; mod cancellation; @@ -27,6 +16,24 @@ mod proxy; mod stream; mod waiters; +// Currently SCRAM is only used in tests +#[cfg(test)] +mod parse; +#[cfg(test)] +mod sasl; +#[cfg(test)] +mod scram; + +use anyhow::{bail, Context}; +use clap::{App, Arg}; +use config::ProxyConfig; +use futures::FutureExt; +use std::future::Future; +use tokio::{net::TcpListener, task::JoinError}; +use zenith_utils::GIT_VERSION; + +use crate::config::{ClientAuthMethod, RouterConfig}; + /// Flattens `Result>` into `Result`. async fn flatten_err( f: impl Future, JoinError>>, diff --git a/proxy/src/parse.rs b/proxy/src/parse.rs new file mode 100644 index 0000000000..8a05ff9c82 --- /dev/null +++ b/proxy/src/parse.rs @@ -0,0 +1,18 @@ +//! Small parsing helpers. + +use std::convert::TryInto; +use std::ffi::CStr; + +pub fn split_cstr(bytes: &[u8]) -> Option<(&CStr, &[u8])> { + let pos = bytes.iter().position(|&x| x == 0)?; + let (cstr, other) = bytes.split_at(pos + 1); + // SAFETY: we've already checked that there's a terminator + Some((unsafe { CStr::from_bytes_with_nul_unchecked(cstr) }, other)) +} + +pub fn split_at_const(bytes: &[u8]) -> Option<(&[u8; N], &[u8])> { + (bytes.len() >= N).then(|| { + let (head, tail) = bytes.split_at(N); + (head.try_into().unwrap(), tail) + }) +} diff --git a/proxy/src/proxy.rs b/proxy/src/proxy.rs index 81581b5cf1..5b662f4c69 100644 --- a/proxy/src/proxy.rs +++ b/proxy/src/proxy.rs @@ -119,7 +119,6 @@ async fn handshake( // We can't perform TLS handshake without a config let enc = tls.is_some(); stream.write_message(&Be::EncryptionResponse(enc)).await?; - if let Some(tls) = tls.take() { // Upgrade raw stream into a secure TLS-backed stream. // NOTE: We've consumed `tls`; this fact will be used later. @@ -219,32 +218,14 @@ impl Client { #[cfg(test)] mod tests { use super::*; - - use tokio::io::DuplexStream; + use crate::{auth, scram}; + use async_trait::async_trait; + use rstest::rstest; use tokio_postgres::config::SslMode; use tokio_postgres::tls::{MakeTlsConnect, NoTls}; use tokio_postgres_rustls::MakeRustlsConnect; - async fn dummy_proxy( - client: impl AsyncRead + AsyncWrite + Unpin, - tls: Option, - ) -> anyhow::Result<()> { - let cancel_map = CancelMap::default(); - - // TODO: add some infra + tests for credentials - let (mut stream, _creds) = handshake(client, tls, &cancel_map) - .await? - .context("no stream")?; - - stream - .write_message_noflush(&Be::AuthenticationOk)? - .write_message_noflush(&BeParameterStatusMessage::encoding())? - .write_message(&BeMessage::ReadyForQuery) - .await?; - - Ok(()) - } - + /// Generate a set of TLS certificates: CA + server. fn generate_certs( hostname: &str, ) -> anyhow::Result<(rustls::Certificate, rustls::Certificate, rustls::PrivateKey)> { @@ -262,19 +243,115 @@ mod tests { )) } + struct ClientConfig<'a> { + config: rustls::ClientConfig, + hostname: &'a str, + } + + impl ClientConfig<'_> { + fn make_tls_connect( + self, + ) -> anyhow::Result> { + let mut mk = MakeRustlsConnect::new(self.config); + let tls = MakeTlsConnect::::make_tls_connect(&mut mk, self.hostname)?; + Ok(tls) + } + } + + /// Generate TLS certificates and build rustls configs for client and server. + fn generate_tls_config( + hostname: &str, + ) -> anyhow::Result<(ClientConfig<'_>, Arc)> { + let (ca, cert, key) = generate_certs(hostname)?; + + let server_config = { + let mut config = rustls::ServerConfig::new(rustls::NoClientAuth::new()); + config.set_single_cert(vec![cert], key)?; + config.into() + }; + + let client_config = { + let mut config = rustls::ClientConfig::new(); + config.root_store.add(&ca)?; + ClientConfig { config, hostname } + }; + + Ok((client_config, server_config)) + } + + #[async_trait] + trait TestAuth: Sized { + async fn authenticate( + self, + _stream: &mut PqStream>, + ) -> anyhow::Result<()> { + Ok(()) + } + } + + struct NoAuth; + impl TestAuth for NoAuth {} + + struct Scram(scram::ServerSecret); + + impl Scram { + fn new(password: &str) -> anyhow::Result { + let salt = rand::random::<[u8; 16]>(); + let secret = scram::ServerSecret::build(password, &salt, 256) + .context("failed to generate scram secret")?; + Ok(Scram(secret)) + } + + fn mock(user: &str) -> Self { + let salt = rand::random::<[u8; 32]>(); + Scram(scram::ServerSecret::mock(user, &salt)) + } + } + + #[async_trait] + impl TestAuth for Scram { + async fn authenticate( + self, + stream: &mut PqStream>, + ) -> anyhow::Result<()> { + auth::AuthFlow::new(stream) + .begin(auth::Scram(&self.0)) + .await? + .authenticate() + .await?; + + Ok(()) + } + } + + /// A dummy proxy impl which performs a handshake and reports auth success. + async fn dummy_proxy( + client: impl AsyncRead + AsyncWrite + Unpin + Send, + tls: Option, + auth: impl TestAuth + Send, + ) -> anyhow::Result<()> { + let cancel_map = CancelMap::default(); + let (mut stream, _creds) = handshake(client, tls, &cancel_map) + .await? + .context("handshake failed")?; + + auth.authenticate(&mut stream).await?; + + stream + .write_message_noflush(&Be::AuthenticationOk)? + .write_message_noflush(&BeParameterStatusMessage::encoding())? + .write_message(&BeMessage::ReadyForQuery) + .await?; + + Ok(()) + } + #[tokio::test] async fn handshake_tls_is_enforced_by_proxy() -> anyhow::Result<()> { let (client, server) = tokio::io::duplex(1024); - let server_config = { - let (_ca, cert, key) = generate_certs("localhost")?; - - let mut config = rustls::ServerConfig::new(rustls::NoClientAuth::new()); - config.set_single_cert(vec![cert], key)?; - config - }; - - let proxy = tokio::spawn(dummy_proxy(client, Some(server_config.into()))); + let (_, server_config) = generate_tls_config("localhost")?; + let proxy = tokio::spawn(dummy_proxy(client, Some(server_config), NoAuth)); let client_err = tokio_postgres::Config::new() .user("john_doe") @@ -301,30 +378,14 @@ mod tests { async fn handshake_tls() -> anyhow::Result<()> { let (client, server) = tokio::io::duplex(1024); - let (ca, cert, key) = generate_certs("localhost")?; - - let server_config = { - let mut config = rustls::ServerConfig::new(rustls::NoClientAuth::new()); - config.set_single_cert(vec![cert], key)?; - config - }; - - let proxy = tokio::spawn(dummy_proxy(client, Some(server_config.into()))); - - let client_config = { - let mut config = rustls::ClientConfig::new(); - config.root_store.add(&ca)?; - config - }; - - let mut mk = MakeRustlsConnect::new(client_config); - let tls = MakeTlsConnect::::make_tls_connect(&mut mk, "localhost")?; + let (client_config, server_config) = generate_tls_config("localhost")?; + let proxy = tokio::spawn(dummy_proxy(client, Some(server_config), NoAuth)); let (_client, _conn) = tokio_postgres::Config::new() .user("john_doe") .dbname("earth") .ssl_mode(SslMode::Require) - .connect_raw(server, tls) + .connect_raw(server, client_config.make_tls_connect()?) .await?; proxy.await? @@ -334,7 +395,7 @@ mod tests { async fn handshake_raw() -> anyhow::Result<()> { let (client, server) = tokio::io::duplex(1024); - let proxy = tokio::spawn(dummy_proxy(client, None)); + let proxy = tokio::spawn(dummy_proxy(client, None, NoAuth)); let (_client, _conn) = tokio_postgres::Config::new() .user("john_doe") @@ -350,7 +411,7 @@ mod tests { async fn give_user_an_error_for_bad_creds() -> anyhow::Result<()> { let (client, server) = tokio::io::duplex(1024); - let proxy = tokio::spawn(dummy_proxy(client, None)); + let proxy = tokio::spawn(dummy_proxy(client, None, NoAuth)); let client_err = tokio_postgres::Config::new() .ssl_mode(SslMode::Disable) @@ -391,4 +452,66 @@ mod tests { Ok(()) } + + #[rstest] + #[case("password_foo")] + #[case("pwd-bar")] + #[case("")] + #[tokio::test] + async fn scram_auth_good(#[case] password: &str) -> anyhow::Result<()> { + let (client, server) = tokio::io::duplex(1024); + + let (client_config, server_config) = generate_tls_config("localhost")?; + let proxy = tokio::spawn(dummy_proxy( + client, + Some(server_config), + Scram::new(password)?, + )); + + let (_client, _conn) = tokio_postgres::Config::new() + .user("user") + .dbname("db") + .password(password) + .ssl_mode(SslMode::Require) + .connect_raw(server, client_config.make_tls_connect()?) + .await?; + + proxy.await? + } + + #[tokio::test] + async fn scram_auth_mock() -> anyhow::Result<()> { + let (client, server) = tokio::io::duplex(1024); + + let (client_config, server_config) = generate_tls_config("localhost")?; + let proxy = tokio::spawn(dummy_proxy( + client, + Some(server_config), + Scram::mock("user"), + )); + + use rand::{distributions::Alphanumeric, Rng}; + let password: String = rand::thread_rng() + .sample_iter(&Alphanumeric) + .take(rand::random::() as usize) + .map(char::from) + .collect(); + + let _client_err = tokio_postgres::Config::new() + .user("user") + .dbname("db") + .password(&password) // no password will match the mocked secret + .ssl_mode(SslMode::Require) + .connect_raw(server, client_config.make_tls_connect()?) + .await + .err() // -> Option + .context("client shouldn't be able to connect")?; + + let _server_err = proxy + .await? + .err() // -> Option + .context("server shouldn't accept client")?; + + Ok(()) + } } diff --git a/proxy/src/sasl.rs b/proxy/src/sasl.rs new file mode 100644 index 0000000000..70a4d9946a --- /dev/null +++ b/proxy/src/sasl.rs @@ -0,0 +1,47 @@ +//! Simple Authentication and Security Layer. +//! +//! RFC: . +//! +//! Reference implementation: +//! * +//! * + +mod channel_binding; +mod messages; +mod stream; + +use std::io; +use thiserror::Error; + +pub use channel_binding::ChannelBinding; +pub use messages::FirstMessage; +pub use stream::SaslStream; + +/// Fine-grained auth errors help in writing tests. +#[derive(Error, Debug)] +pub enum Error { + #[error("Failed to authenticate client: {0}")] + AuthenticationFailed(&'static str), + + #[error("Channel binding failed: {0}")] + ChannelBindingFailed(&'static str), + + #[error("Unsupported channel binding method: {0}")] + ChannelBindingBadMethod(Box), + + #[error("Bad client message")] + BadClientMessage, + + #[error(transparent)] + Io(#[from] io::Error), +} + +/// A convenient result type for SASL exchange. +pub type Result = std::result::Result; + +/// Every SASL mechanism (e.g. [SCRAM](crate::scram)) is expected to implement this trait. +pub trait Mechanism: Sized { + /// Produce a server challenge to be sent to the client. + /// This is how this method is called in PostgreSQL (`libpq/sasl.h`). + fn exchange(self, input: &str) -> Result<(Option, String)>; +} diff --git a/proxy/src/sasl/channel_binding.rs b/proxy/src/sasl/channel_binding.rs new file mode 100644 index 0000000000..776adabe55 --- /dev/null +++ b/proxy/src/sasl/channel_binding.rs @@ -0,0 +1,85 @@ +//! Definition and parser for channel binding flag (a part of the `GS2` header). + +/// Channel binding flag (possibly with params). +#[derive(Debug, PartialEq, Eq)] +pub enum ChannelBinding { + /// Client doesn't support channel binding. + NotSupportedClient, + /// Client thinks server doesn't support channel binding. + NotSupportedServer, + /// Client wants to use this type of channel binding. + Required(T), +} + +impl ChannelBinding { + pub fn and_then(self, f: impl FnOnce(T) -> Result) -> Result, E> { + use ChannelBinding::*; + Ok(match self { + NotSupportedClient => NotSupportedClient, + NotSupportedServer => NotSupportedServer, + Required(x) => Required(f(x)?), + }) + } +} + +impl<'a> ChannelBinding<&'a str> { + // NB: FromStr doesn't work with lifetimes + pub fn parse(input: &'a str) -> Option { + use ChannelBinding::*; + Some(match input { + "n" => NotSupportedClient, + "y" => NotSupportedServer, + other => Required(other.strip_prefix("p=")?), + }) + } +} + +impl ChannelBinding { + /// Encode channel binding data as base64 for subsequent checks. + pub fn encode( + &self, + get_cbind_data: impl FnOnce(&T) -> Result, + ) -> Result, E> { + use ChannelBinding::*; + Ok(match self { + NotSupportedClient => { + // base64::encode("n,,") + "biws".into() + } + NotSupportedServer => { + // base64::encode("y,,") + "eSws".into() + } + Required(mode) => { + let msg = format!( + "p={mode},,{data}", + mode = mode, + data = get_cbind_data(mode)? + ); + base64::encode(msg).into() + } + }) + } +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn channel_binding_encode() -> anyhow::Result<()> { + use ChannelBinding::*; + + let cases = [ + (NotSupportedClient, base64::encode("n,,")), + (NotSupportedServer, base64::encode("y,,")), + (Required("foo"), base64::encode("p=foo,,bar")), + ]; + + for (cb, input) in cases { + assert_eq!(cb.encode(|_| anyhow::Ok("bar".to_owned()))?, input); + } + + Ok(()) + } +} diff --git a/proxy/src/sasl/messages.rs b/proxy/src/sasl/messages.rs new file mode 100644 index 0000000000..b1ae8cc426 --- /dev/null +++ b/proxy/src/sasl/messages.rs @@ -0,0 +1,67 @@ +//! Definitions for SASL messages. + +use crate::parse::{split_at_const, split_cstr}; +use zenith_utils::pq_proto::{BeAuthenticationSaslMessage, BeMessage}; + +/// SASL-specific payload of [`PasswordMessage`](zenith_utils::pq_proto::FeMessage::PasswordMessage). +#[derive(Debug)] +pub struct FirstMessage<'a> { + /// Authentication method, e.g. `"SCRAM-SHA-256"`. + pub method: &'a str, + /// Initial client message. + pub message: &'a str, +} + +impl<'a> FirstMessage<'a> { + // NB: FromStr doesn't work with lifetimes + pub fn parse(bytes: &'a [u8]) -> Option { + let (method_cstr, tail) = split_cstr(bytes)?; + let method = method_cstr.to_str().ok()?; + + let (len_bytes, bytes) = split_at_const(tail)?; + let len = u32::from_be_bytes(*len_bytes) as usize; + if len != bytes.len() { + return None; + } + + let message = std::str::from_utf8(bytes).ok()?; + Some(Self { method, message }) + } +} + +/// A single SASL message. +/// This struct is deliberately decoupled from lower-level +/// [`BeAuthenticationSaslMessage`](zenith_utils::pq_proto::BeAuthenticationSaslMessage). +#[derive(Debug)] +pub(super) enum ServerMessage { + /// We expect to see more steps. + Continue(T), + /// This is the final step. + Final(T), +} + +impl<'a> ServerMessage<&'a str> { + pub(super) fn to_reply(&self) -> BeMessage<'a> { + use BeAuthenticationSaslMessage::*; + BeMessage::AuthenticationSasl(match self { + ServerMessage::Continue(s) => Continue(s.as_bytes()), + ServerMessage::Final(s) => Final(s.as_bytes()), + }) + } +} +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn parse_sasl_first_message() { + let proto = "SCRAM-SHA-256"; + let sasl = "n,,n=,r=KHQ2Gjc7NptyB8aov5/TnUy4"; + let sasl_len = (sasl.len() as u32).to_be_bytes(); + let bytes = [proto.as_bytes(), &[0], sasl_len.as_ref(), sasl.as_bytes()].concat(); + + let password = FirstMessage::parse(&bytes).unwrap(); + assert_eq!(password.method, proto); + assert_eq!(password.message, sasl); + } +} diff --git a/proxy/src/sasl/stream.rs b/proxy/src/sasl/stream.rs new file mode 100644 index 0000000000..03649b8d11 --- /dev/null +++ b/proxy/src/sasl/stream.rs @@ -0,0 +1,70 @@ +//! Abstraction for the string-oriented SASL protocols. + +use super::{messages::ServerMessage, Mechanism}; +use crate::stream::PqStream; +use std::io; +use tokio::io::{AsyncRead, AsyncWrite}; + +/// Abstracts away all peculiarities of the libpq's protocol. +pub struct SaslStream<'a, S> { + /// The underlying stream. + stream: &'a mut PqStream, + /// Current password message we received from client. + current: bytes::Bytes, + /// First SASL message produced by client. + first: Option<&'a str>, +} + +impl<'a, S> SaslStream<'a, S> { + pub fn new(stream: &'a mut PqStream, first: &'a str) -> Self { + Self { + stream, + current: bytes::Bytes::new(), + first: Some(first), + } + } +} + +impl SaslStream<'_, S> { + // Receive a new SASL message from the client. + async fn recv(&mut self) -> io::Result<&str> { + if let Some(first) = self.first.take() { + return Ok(first); + } + + self.current = self.stream.read_password_message().await?; + let s = std::str::from_utf8(&self.current) + .map_err(|_| io::Error::new(io::ErrorKind::InvalidData, "bad encoding"))?; + + Ok(s) + } +} + +impl SaslStream<'_, S> { + // Send a SASL message to the client. + async fn send(&mut self, msg: &ServerMessage<&str>) -> io::Result<()> { + self.stream.write_message(&msg.to_reply()).await?; + Ok(()) + } +} + +impl SaslStream<'_, S> { + /// Perform SASL message exchange according to the underlying algorithm + /// until user is either authenticated or denied access. + pub async fn authenticate(mut self, mut mechanism: impl Mechanism) -> super::Result<()> { + loop { + let input = self.recv().await?; + let (moved, reply) = mechanism.exchange(input)?; + match moved { + Some(moved) => { + self.send(&ServerMessage::Continue(&reply)).await?; + mechanism = moved; + } + None => { + self.send(&ServerMessage::Final(&reply)).await?; + return Ok(()); + } + } + } + } +} diff --git a/proxy/src/scram.rs b/proxy/src/scram.rs new file mode 100644 index 0000000000..f007f3e0b6 --- /dev/null +++ b/proxy/src/scram.rs @@ -0,0 +1,59 @@ +//! Salted Challenge Response Authentication Mechanism. +//! +//! RFC: . +//! +//! Reference implementation: +//! * +//! * + +mod exchange; +mod key; +mod messages; +mod password; +mod secret; +mod signature; + +pub use secret::*; + +pub use exchange::Exchange; +pub use secret::ServerSecret; + +use hmac::{Hmac, Mac, NewMac}; +use sha2::{Digest, Sha256}; + +// TODO: add SCRAM-SHA-256-PLUS +/// A list of supported SCRAM methods. +pub const METHODS: &[&str] = &["SCRAM-SHA-256"]; + +/// Decode base64 into array without any heap allocations +fn base64_decode_array(input: impl AsRef<[u8]>) -> Option<[u8; N]> { + let mut bytes = [0u8; N]; + + let size = base64::decode_config_slice(input, base64::STANDARD, &mut bytes).ok()?; + if size != N { + return None; + } + + Some(bytes) +} + +/// This function essentially is `Hmac(sha256, key, input)`. +/// Further reading: . +fn hmac_sha256<'a>(key: &[u8], parts: impl IntoIterator) -> [u8; 32] { + let mut mac = Hmac::::new_varkey(key).expect("bad key size"); + parts.into_iter().for_each(|s| mac.update(s)); + + // TODO: maybe newer `hmac` et al already migrated to regular arrays? + let mut result = [0u8; 32]; + result.copy_from_slice(mac.finalize().into_bytes().as_slice()); + result +} + +fn sha256<'a>(parts: impl IntoIterator) -> [u8; 32] { + let mut hasher = Sha256::new(); + parts.into_iter().for_each(|s| hasher.update(s)); + + let mut result = [0u8; 32]; + result.copy_from_slice(hasher.finalize().as_slice()); + result +} diff --git a/proxy/src/scram/exchange.rs b/proxy/src/scram/exchange.rs new file mode 100644 index 0000000000..5a986b965a --- /dev/null +++ b/proxy/src/scram/exchange.rs @@ -0,0 +1,134 @@ +//! Implementation of the SCRAM authentication algorithm. + +use super::messages::{ + ClientFinalMessage, ClientFirstMessage, OwnedServerFirstMessage, SCRAM_RAW_NONCE_LEN, +}; +use super::secret::ServerSecret; +use super::signature::SignatureBuilder; +use crate::sasl::{self, ChannelBinding, Error as SaslError}; + +/// The only channel binding mode we currently support. +#[derive(Debug)] +struct TlsServerEndPoint; + +impl std::fmt::Display for TlsServerEndPoint { + fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { + write!(f, "tls-server-end-point") + } +} + +impl std::str::FromStr for TlsServerEndPoint { + type Err = sasl::Error; + + fn from_str(s: &str) -> Result { + match s { + "tls-server-end-point" => Ok(TlsServerEndPoint), + _ => Err(sasl::Error::ChannelBindingBadMethod(s.into())), + } + } +} + +#[derive(Debug)] +enum ExchangeState { + /// Waiting for [`ClientFirstMessage`]. + Initial, + /// Waiting for [`ClientFinalMessage`]. + SaltSent { + cbind_flag: ChannelBinding, + client_first_message_bare: String, + server_first_message: OwnedServerFirstMessage, + }, +} + +/// Server's side of SCRAM auth algorithm. +#[derive(Debug)] +pub struct Exchange<'a> { + state: ExchangeState, + secret: &'a ServerSecret, + nonce: fn() -> [u8; SCRAM_RAW_NONCE_LEN], + cert_digest: Option<&'a [u8]>, +} + +impl<'a> Exchange<'a> { + pub fn new( + secret: &'a ServerSecret, + nonce: fn() -> [u8; SCRAM_RAW_NONCE_LEN], + cert_digest: Option<&'a [u8]>, + ) -> Self { + Self { + state: ExchangeState::Initial, + secret, + nonce, + cert_digest, + } + } +} + +impl sasl::Mechanism for Exchange<'_> { + fn exchange(mut self, input: &str) -> sasl::Result<(Option, String)> { + use ExchangeState::*; + match &self.state { + Initial => { + let client_first_message = + ClientFirstMessage::parse(input).ok_or(SaslError::BadClientMessage)?; + + let server_first_message = client_first_message.build_server_first_message( + &(self.nonce)(), + &self.secret.salt_base64, + self.secret.iterations, + ); + let msg = server_first_message.as_str().to_owned(); + + self.state = SaltSent { + cbind_flag: client_first_message.cbind_flag.and_then(str::parse)?, + client_first_message_bare: client_first_message.bare.to_owned(), + server_first_message, + }; + + Ok((Some(self), msg)) + } + SaltSent { + cbind_flag, + client_first_message_bare, + server_first_message, + } => { + let client_final_message = + ClientFinalMessage::parse(input).ok_or(SaslError::BadClientMessage)?; + + let channel_binding = cbind_flag.encode(|_| { + self.cert_digest + .map(base64::encode) + .ok_or(SaslError::ChannelBindingFailed("no cert digest provided")) + })?; + + // This might've been caused by a MITM attack + if client_final_message.channel_binding != channel_binding { + return Err(SaslError::ChannelBindingFailed("data mismatch")); + } + + if client_final_message.nonce != server_first_message.nonce() { + return Err(SaslError::AuthenticationFailed("bad nonce")); + } + + let signature_builder = SignatureBuilder { + client_first_message_bare, + server_first_message: server_first_message.as_str(), + client_final_message_without_proof: client_final_message.without_proof, + }; + + let client_key = signature_builder + .build(&self.secret.stored_key) + .derive_client_key(&client_final_message.proof); + + if client_key.sha256() != self.secret.stored_key { + return Err(SaslError::AuthenticationFailed("keys don't match")); + } + + let msg = client_final_message + .build_server_final_message(signature_builder, &self.secret.server_key); + + Ok((None, msg)) + } + } + } +} diff --git a/proxy/src/scram/key.rs b/proxy/src/scram/key.rs new file mode 100644 index 0000000000..1c13471bc3 --- /dev/null +++ b/proxy/src/scram/key.rs @@ -0,0 +1,33 @@ +//! Tools for client/server/stored key management. + +/// Faithfully taken from PostgreSQL. +pub const SCRAM_KEY_LEN: usize = 32; + +/// One of the keys derived from the [password](super::password::SaltedPassword). +/// We use the same structure for all keys, i.e. +/// `ClientKey`, `StoredKey`, and `ServerKey`. +#[derive(Default, Debug, PartialEq, Eq)] +#[repr(transparent)] +pub struct ScramKey { + bytes: [u8; SCRAM_KEY_LEN], +} + +impl ScramKey { + pub fn sha256(&self) -> Self { + super::sha256([self.as_ref()]).into() + } +} + +impl From<[u8; SCRAM_KEY_LEN]> for ScramKey { + #[inline(always)] + fn from(bytes: [u8; SCRAM_KEY_LEN]) -> Self { + Self { bytes } + } +} + +impl AsRef<[u8]> for ScramKey { + #[inline(always)] + fn as_ref(&self) -> &[u8] { + &self.bytes + } +} diff --git a/proxy/src/scram/messages.rs b/proxy/src/scram/messages.rs new file mode 100644 index 0000000000..f6e6133adf --- /dev/null +++ b/proxy/src/scram/messages.rs @@ -0,0 +1,232 @@ +//! Definitions for SCRAM messages. + +use super::base64_decode_array; +use super::key::{ScramKey, SCRAM_KEY_LEN}; +use super::signature::SignatureBuilder; +use crate::sasl::ChannelBinding; +use std::fmt; +use std::ops::Range; + +/// Faithfully taken from PostgreSQL. +pub const SCRAM_RAW_NONCE_LEN: usize = 18; + +/// Although we ignore all extensions, we still have to validate the message. +fn validate_sasl_extensions<'a>(parts: impl Iterator) -> Option<()> { + for mut chars in parts.map(|s| s.chars()) { + let attr = chars.next()?; + if !('a'..'z').contains(&attr) && !('A'..'Z').contains(&attr) { + return None; + } + let eq = chars.next()?; + if eq != '=' { + return None; + } + } + + Some(()) +} + +#[derive(Debug)] +pub struct ClientFirstMessage<'a> { + /// `client-first-message-bare`. + pub bare: &'a str, + /// Channel binding mode. + pub cbind_flag: ChannelBinding<&'a str>, + /// (Client username)[]. + pub username: &'a str, + /// Client nonce. + pub nonce: &'a str, +} + +impl<'a> ClientFirstMessage<'a> { + // NB: FromStr doesn't work with lifetimes + pub fn parse(input: &'a str) -> Option { + let mut parts = input.split(','); + + let cbind_flag = ChannelBinding::parse(parts.next()?)?; + + // PG doesn't support authorization identity, + // so we don't bother defining GS2 header type + let authzid = parts.next()?; + if !authzid.is_empty() { + return None; + } + + // Unfortunately, `parts.as_str()` is unstable + let pos = authzid.as_ptr() as usize - input.as_ptr() as usize + 1; + let (_, bare) = input.split_at(pos); + + // In theory, these might be preceded by "reserved-mext" (i.e. "m=") + let username = parts.next()?.strip_prefix("n=")?; + let nonce = parts.next()?.strip_prefix("r=")?; + + // Validate but ignore auth extensions + validate_sasl_extensions(parts)?; + + Some(Self { + bare, + cbind_flag, + username, + nonce, + }) + } + + /// Build a response to [`ClientFirstMessage`]. + pub fn build_server_first_message( + &self, + nonce: &[u8; SCRAM_RAW_NONCE_LEN], + salt_base64: &str, + iterations: u32, + ) -> OwnedServerFirstMessage { + use std::fmt::Write; + + let mut message = String::new(); + write!(&mut message, "r={}", self.nonce).unwrap(); + base64::encode_config_buf(nonce, base64::STANDARD, &mut message); + let combined_nonce = 2..message.len(); + write!(&mut message, ",s={},i={}", salt_base64, iterations).unwrap(); + + // This design guarantees that it's impossible to create a + // server-first-message without receiving a client-first-message + OwnedServerFirstMessage { + message, + nonce: combined_nonce, + } + } +} + +#[derive(Debug)] +pub struct ClientFinalMessage<'a> { + /// `client-final-message-without-proof`. + pub without_proof: &'a str, + /// Channel binding data (base64). + pub channel_binding: &'a str, + /// Combined client & server nonce. + pub nonce: &'a str, + /// Client auth proof. + pub proof: [u8; SCRAM_KEY_LEN], +} + +impl<'a> ClientFinalMessage<'a> { + // NB: FromStr doesn't work with lifetimes + pub fn parse(input: &'a str) -> Option { + let (without_proof, proof) = input.rsplit_once(',')?; + + let mut parts = without_proof.split(','); + let channel_binding = parts.next()?.strip_prefix("c=")?; + let nonce = parts.next()?.strip_prefix("r=")?; + + // Validate but ignore auth extensions + validate_sasl_extensions(parts)?; + + let proof = base64_decode_array(proof.strip_prefix("p=")?)?; + + Some(Self { + without_proof, + channel_binding, + nonce, + proof, + }) + } + + /// Build a response to [`ClientFinalMessage`]. + pub fn build_server_final_message( + &self, + signature_builder: SignatureBuilder, + server_key: &ScramKey, + ) -> String { + let mut buf = String::from("v="); + base64::encode_config_buf( + signature_builder.build(server_key), + base64::STANDARD, + &mut buf, + ); + + buf + } +} + +/// We need to keep a convenient representation of this +/// message for the next authentication step. +pub struct OwnedServerFirstMessage { + /// Owned `server-first-message`. + message: String, + /// Slice into `message`. + nonce: Range, +} + +impl OwnedServerFirstMessage { + /// Extract combined nonce from the message. + #[inline(always)] + pub fn nonce(&self) -> &str { + &self.message[self.nonce.clone()] + } + + /// Get reference to a text representation of the message. + #[inline(always)] + pub fn as_str(&self) -> &str { + &self.message + } +} + +impl fmt::Debug for OwnedServerFirstMessage { + fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { + f.debug_struct("ServerFirstMessage") + .field("message", &self.as_str()) + .field("nonce", &self.nonce()) + .finish() + } +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn parse_client_first_message() { + use ChannelBinding::*; + + // (Almost) real strings captured during debug sessions + let cases = [ + (NotSupportedClient, "n,,n=pepe,r=t8JwklwKecDLwSsA72rHmVju"), + (NotSupportedServer, "y,,n=pepe,r=t8JwklwKecDLwSsA72rHmVju"), + ( + Required("tls-server-end-point"), + "p=tls-server-end-point,,n=pepe,r=t8JwklwKecDLwSsA72rHmVju", + ), + ]; + + for (cb, input) in cases { + let msg = ClientFirstMessage::parse(input).unwrap(); + + assert_eq!(msg.bare, "n=pepe,r=t8JwklwKecDLwSsA72rHmVju"); + assert_eq!(msg.username, "pepe"); + assert_eq!(msg.nonce, "t8JwklwKecDLwSsA72rHmVju"); + assert_eq!(msg.cbind_flag, cb); + } + } + + #[test] + fn parse_client_final_message() { + let input = [ + "c=eSws", + "r=iiYEfS3rOgn8S3rtpSdrOsHtPLWvIkdgmHxA0hf3JNOAG4dU", + "p=SRpfsIVS4Gk11w1LqQ4QvCUBZYQmqXNSDEcHqbQ3CHI=", + ] + .join(","); + + let msg = ClientFinalMessage::parse(&input).unwrap(); + assert_eq!( + msg.without_proof, + "c=eSws,r=iiYEfS3rOgn8S3rtpSdrOsHtPLWvIkdgmHxA0hf3JNOAG4dU" + ); + assert_eq!( + msg.nonce, + "iiYEfS3rOgn8S3rtpSdrOsHtPLWvIkdgmHxA0hf3JNOAG4dU" + ); + assert_eq!( + base64::encode(msg.proof), + "SRpfsIVS4Gk11w1LqQ4QvCUBZYQmqXNSDEcHqbQ3CHI=" + ); + } +} diff --git a/proxy/src/scram/password.rs b/proxy/src/scram/password.rs new file mode 100644 index 0000000000..656780d853 --- /dev/null +++ b/proxy/src/scram/password.rs @@ -0,0 +1,48 @@ +//! Password hashing routines. + +use super::key::ScramKey; + +pub const SALTED_PASSWORD_LEN: usize = 32; + +/// Salted hashed password is essential for [key](super::key) derivation. +#[repr(transparent)] +pub struct SaltedPassword { + bytes: [u8; SALTED_PASSWORD_LEN], +} + +impl SaltedPassword { + /// See `scram-common.c : scram_SaltedPassword` for details. + /// Further reading: (see `PBKDF2`). + pub fn new(password: &[u8], salt: &[u8], iterations: u32) -> SaltedPassword { + let one = 1_u32.to_be_bytes(); // magic + + let mut current = super::hmac_sha256(password, [salt, &one]); + let mut result = current; + for _ in 1..iterations { + current = super::hmac_sha256(password, [current.as_ref()]); + // TODO: result = current.zip(result).map(|(x, y)| x ^ y), issue #80094 + for (i, x) in current.iter().enumerate() { + result[i] ^= x; + } + } + + result.into() + } + + /// Derive `ClientKey` from a salted hashed password. + pub fn client_key(&self) -> ScramKey { + super::hmac_sha256(&self.bytes, [b"Client Key".as_ref()]).into() + } + + /// Derive `ServerKey` from a salted hashed password. + pub fn server_key(&self) -> ScramKey { + super::hmac_sha256(&self.bytes, [b"Server Key".as_ref()]).into() + } +} + +impl From<[u8; SALTED_PASSWORD_LEN]> for SaltedPassword { + #[inline(always)] + fn from(bytes: [u8; SALTED_PASSWORD_LEN]) -> Self { + Self { bytes } + } +} diff --git a/proxy/src/scram/secret.rs b/proxy/src/scram/secret.rs new file mode 100644 index 0000000000..e8d180bcdd --- /dev/null +++ b/proxy/src/scram/secret.rs @@ -0,0 +1,116 @@ +//! Tools for SCRAM server secret management. + +use super::base64_decode_array; +use super::key::ScramKey; + +/// Server secret is produced from [password](super::password::SaltedPassword) +/// and is used throughout the authentication process. +#[derive(Debug)] +pub struct ServerSecret { + /// Number of iterations for `PBKDF2` function. + pub iterations: u32, + /// Salt used to hash user's password. + pub salt_base64: String, + /// Hashed `ClientKey`. + pub stored_key: ScramKey, + /// Used by client to verify server's signature. + pub server_key: ScramKey, +} + +impl ServerSecret { + pub fn parse(input: &str) -> Option { + // SCRAM-SHA-256$:$: + let s = input.strip_prefix("SCRAM-SHA-256$")?; + let (params, keys) = s.split_once('$')?; + + let ((iterations, salt), (stored_key, server_key)) = + params.split_once(':').zip(keys.split_once(':'))?; + + let secret = ServerSecret { + iterations: iterations.parse().ok()?, + salt_base64: salt.to_owned(), + stored_key: base64_decode_array(stored_key)?.into(), + server_key: base64_decode_array(server_key)?.into(), + }; + + Some(secret) + } + + /// To avoid revealing information to an attacker, we use a + /// mocked server secret even if the user doesn't exist. + /// See `auth-scram.c : mock_scram_secret` for details. + pub fn mock(user: &str, nonce: &[u8; 32]) -> Self { + // Refer to `auth-scram.c : scram_mock_salt`. + let mocked_salt = super::sha256([user.as_bytes(), nonce]); + + Self { + iterations: 4096, + salt_base64: base64::encode(&mocked_salt), + stored_key: ScramKey::default(), + server_key: ScramKey::default(), + } + } + + /// Build a new server secret from the prerequisites. + /// XXX: We only use this function in tests. + #[cfg(test)] + pub fn build(password: &str, salt: &[u8], iterations: u32) -> Option { + // TODO: implement proper password normalization required by the RFC + if !password.is_ascii() { + return None; + } + + let password = super::password::SaltedPassword::new(password.as_bytes(), salt, iterations); + + Some(Self { + iterations, + salt_base64: base64::encode(&salt), + stored_key: password.client_key().sha256(), + server_key: password.server_key(), + }) + } +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn parse_scram_secret() { + let iterations = 4096; + let salt = "+/tQQax7twvwTj64mjBsxQ=="; + let stored_key = "D5h6KTMBlUvDJk2Y8ELfC1Sjtc6k9YHjRyuRZyBNJns="; + let server_key = "Pi3QHbcluX//NDfVkKlFl88GGzlJ5LkyPwcdlN/QBvI="; + + let secret = format!( + "SCRAM-SHA-256${iterations}:{salt}${stored_key}:{server_key}", + iterations = iterations, + salt = salt, + stored_key = stored_key, + server_key = server_key, + ); + + let parsed = ServerSecret::parse(&secret).unwrap(); + assert_eq!(parsed.iterations, iterations); + assert_eq!(parsed.salt_base64, salt); + + assert_eq!(base64::encode(parsed.stored_key), stored_key); + assert_eq!(base64::encode(parsed.server_key), server_key); + } + + #[test] + fn build_scram_secret() { + let salt = b"salt"; + let secret = ServerSecret::build("password", salt, 4096).unwrap(); + assert_eq!(secret.iterations, 4096); + assert_eq!(secret.salt_base64, base64::encode(salt)); + assert_eq!( + base64::encode(secret.stored_key.as_ref()), + "lF4cRm/Jky763CN4HtxdHnjV4Q8AWTNlKvGmEFFU8IQ=" + ); + assert_eq!( + base64::encode(secret.server_key.as_ref()), + "ub8OgRsftnk2ccDMOt7ffHXNcikRkQkq1lh4xaAqrSw=" + ); + } +} diff --git a/proxy/src/scram/signature.rs b/proxy/src/scram/signature.rs new file mode 100644 index 0000000000..1c2811d757 --- /dev/null +++ b/proxy/src/scram/signature.rs @@ -0,0 +1,66 @@ +//! Tools for client/server signature management. + +use super::key::{ScramKey, SCRAM_KEY_LEN}; + +/// A collection of message parts needed to derive the client's signature. +#[derive(Debug)] +pub struct SignatureBuilder<'a> { + pub client_first_message_bare: &'a str, + pub server_first_message: &'a str, + pub client_final_message_without_proof: &'a str, +} + +impl SignatureBuilder<'_> { + pub fn build(&self, key: &ScramKey) -> Signature { + let parts = [ + self.client_first_message_bare.as_bytes(), + b",", + self.server_first_message.as_bytes(), + b",", + self.client_final_message_without_proof.as_bytes(), + ]; + + super::hmac_sha256(key.as_ref(), parts).into() + } +} + +/// A computed value which, when xored with `ClientProof`, +/// produces `ClientKey` that we need for authentication. +#[derive(Debug)] +#[repr(transparent)] +pub struct Signature { + bytes: [u8; SCRAM_KEY_LEN], +} + +impl Signature { + /// Derive `ClientKey` from client's signature and proof. + pub fn derive_client_key(&self, proof: &[u8; SCRAM_KEY_LEN]) -> ScramKey { + // This is how the proof is calculated: + // + // 1. sha256(ClientKey) -> StoredKey + // 2. hmac_sha256(StoredKey, [messages...]) -> ClientSignature + // 3. ClientKey ^ ClientSignature -> ClientProof + // + // Step 3 implies that we can restore ClientKey from the proof + // by xoring the latter with the ClientSignature. Afterwards we + // can check that the presumed ClientKey meets our expectations. + let mut signature = self.bytes; + for (i, x) in proof.iter().enumerate() { + signature[i] ^= x; + } + + signature.into() + } +} + +impl From<[u8; SCRAM_KEY_LEN]> for Signature { + fn from(bytes: [u8; SCRAM_KEY_LEN]) -> Self { + Self { bytes } + } +} + +impl AsRef<[u8]> for Signature { + fn as_ref(&self) -> &[u8] { + &self.bytes + } +} diff --git a/zenith_utils/src/postgres_backend.rs b/zenith_utils/src/postgres_backend.rs index 83792f2aca..f984fb4417 100644 --- a/zenith_utils/src/postgres_backend.rs +++ b/zenith_utils/src/postgres_backend.rs @@ -375,9 +375,8 @@ impl PostgresBackend { } AuthType::MD5 => { rand::thread_rng().fill(&mut self.md5_salt); - let md5_salt = self.md5_salt; self.write_message(&BeMessage::AuthenticationMD5Password( - &md5_salt, + self.md5_salt, ))?; self.state = ProtoState::Authentication; } diff --git a/zenith_utils/src/pq_proto.rs b/zenith_utils/src/pq_proto.rs index cb69418c07..403e176b14 100644 --- a/zenith_utils/src/pq_proto.rs +++ b/zenith_utils/src/pq_proto.rs @@ -401,7 +401,8 @@ fn read_null_terminated(buf: &mut Bytes) -> anyhow::Result { #[derive(Debug)] pub enum BeMessage<'a> { AuthenticationOk, - AuthenticationMD5Password(&'a [u8; 4]), + AuthenticationMD5Password([u8; 4]), + AuthenticationSasl(BeAuthenticationSaslMessage<'a>), AuthenticationCleartextPassword, BackendKeyData(CancelKeyData), BindComplete, @@ -429,6 +430,13 @@ pub enum BeMessage<'a> { KeepAlive(WalSndKeepAlive), } +#[derive(Debug)] +pub enum BeAuthenticationSaslMessage<'a> { + Methods(&'a [&'a str]), + Continue(&'a [u8]), + Final(&'a [u8]), +} + #[derive(Debug)] pub enum BeParameterStatusMessage<'a> { Encoding(&'a str), @@ -611,6 +619,32 @@ impl<'a> BeMessage<'a> { .unwrap(); // write into BytesMut can't fail } + BeMessage::AuthenticationSasl(msg) => { + buf.put_u8(b'R'); + write_body(buf, |buf| { + use BeAuthenticationSaslMessage::*; + match msg { + Methods(methods) => { + buf.put_i32(10); // Specifies that SASL auth method is used. + for method in methods.iter() { + write_cstr(method.as_bytes(), buf)?; + } + buf.put_u8(0); // zero terminator for the list + } + Continue(extra) => { + buf.put_i32(11); // Continue SASL auth. + buf.put_slice(extra); + } + Final(extra) => { + buf.put_i32(12); // Send final SASL message. + buf.put_slice(extra); + } + } + Ok::<_, io::Error>(()) + }) + .unwrap() + } + BeMessage::BackendKeyData(key_data) => { buf.put_u8(b'K'); write_body(buf, |buf| { From 9b7a8e67a4ccd0957afd46d857d81374126fb255 Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Tue, 12 Apr 2022 23:57:33 +0300 Subject: [PATCH 079/296] fix deadlock in upload_timeline_checkpoint It originated from the fact that we were calling to fetch_full_index without releasing the read guard, and fetch_full_index tries to acquire read again. For plain mutex it is already a deeadlock, for RW lock deadlock was achieved by an attempt to acquire write access later in the code while still having active read guard up in the stack This is sort of a bandaid because Kirill plans to change this code during removal of an archiving mechanism --- .../src/remote_storage/storage_sync/upload.rs | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/pageserver/src/remote_storage/storage_sync/upload.rs b/pageserver/src/remote_storage/storage_sync/upload.rs index f955e04474..7b6d58a661 100644 --- a/pageserver/src/remote_storage/storage_sync/upload.rs +++ b/pageserver/src/remote_storage/storage_sync/upload.rs @@ -1,6 +1,6 @@ //! Timeline synchronization logic to compress and upload to the remote storage all new timeline files from the checkpoints. -use std::{borrow::Cow, collections::BTreeSet, path::PathBuf, sync::Arc}; +use std::{collections::BTreeSet, path::PathBuf, sync::Arc}; use tracing::{debug, error, warn}; @@ -46,13 +46,21 @@ pub(super) async fn upload_timeline_checkpoint< let index_read = index.read().await; let remote_timeline = match index_read.timeline_entry(&sync_id) { - None => None, + None => { + drop(index_read); + None + } Some(entry) => match entry.inner() { - TimelineIndexEntryInner::Full(remote_timeline) => Some(Cow::Borrowed(remote_timeline)), + TimelineIndexEntryInner::Full(remote_timeline) => { + let r = Some(remote_timeline.clone()); + drop(index_read); + r + } TimelineIndexEntryInner::Description(_) => { + drop(index_read); debug!("Found timeline description for the given ids, downloading the full index"); match fetch_full_index(remote_assets.as_ref(), &timeline_dir, sync_id).await { - Ok(remote_timeline) => Some(Cow::Owned(remote_timeline)), + Ok(remote_timeline) => Some(remote_timeline), Err(e) => { error!("Failed to download full timeline index: {:?}", e); sync_queue::push(SyncTask::new( @@ -82,7 +90,6 @@ pub(super) async fn upload_timeline_checkpoint< let already_uploaded_files = remote_timeline .map(|timeline| timeline.stored_files(&timeline_dir)) .unwrap_or_default(); - drop(index_read); match try_upload_checkpoint( config, From 20414c4b16143e1757816c1cd015c01c5343b28d Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Wed, 13 Apr 2022 00:20:55 +0300 Subject: [PATCH 080/296] defuse possible deadlock in download_timeline too --- .../src/remote_storage/storage_sync/download.rs | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/pageserver/src/remote_storage/storage_sync/download.rs b/pageserver/src/remote_storage/storage_sync/download.rs index 773b4a12e5..e5aa74452b 100644 --- a/pageserver/src/remote_storage/storage_sync/download.rs +++ b/pageserver/src/remote_storage/storage_sync/download.rs @@ -1,6 +1,6 @@ //! Timeline synchrnonization logic to put files from archives on remote storage into pageserver's local directory. -use std::{borrow::Cow, collections::BTreeSet, path::PathBuf, sync::Arc}; +use std::{collections::BTreeSet, path::PathBuf, sync::Arc}; use anyhow::{ensure, Context}; use tokio::fs; @@ -64,11 +64,16 @@ pub(super) async fn download_timeline< let remote_timeline = match index_read.timeline_entry(&sync_id) { None => { error!("Cannot download: no timeline is present in the index for given id"); + drop(index_read); return DownloadedTimeline::Abort; } Some(index_entry) => match index_entry.inner() { - TimelineIndexEntryInner::Full(remote_timeline) => Cow::Borrowed(remote_timeline), + TimelineIndexEntryInner::Full(remote_timeline) => { + let cloned = remote_timeline.clone(); + drop(index_read); + cloned + } TimelineIndexEntryInner::Description(_) => { // we do not check here for awaits_download because it is ok // to call this function while the download is in progress @@ -84,7 +89,7 @@ pub(super) async fn download_timeline< ) .await { - Ok(remote_timeline) => Cow::Owned(remote_timeline), + Ok(remote_timeline) => remote_timeline, Err(e) => { error!("Failed to download full timeline index: {:?}", e); From 87020f81265b14db527177b075e78752becb24cc Mon Sep 17 00:00:00 2001 From: Arthur Petukhovsky Date: Wed, 13 Apr 2022 10:59:29 +0300 Subject: [PATCH 081/296] Fix CI staging deploy (#1499) - Remove stopped safekeeper from inventory - Fix github pages address after neon rename --- .circleci/ansible/staging.hosts | 1 - .circleci/config.yml | 10 +++++----- 2 files changed, 5 insertions(+), 6 deletions(-) diff --git a/.circleci/ansible/staging.hosts b/.circleci/ansible/staging.hosts index f6b7bf009f..69f058c2b9 100644 --- a/.circleci/ansible/staging.hosts +++ b/.circleci/ansible/staging.hosts @@ -5,7 +5,6 @@ zenith-us-stage-ps-2 console_region_id=27 [safekeepers] zenith-us-stage-sk-1 console_region_id=27 zenith-us-stage-sk-2 console_region_id=27 -zenith-us-stage-sk-3 console_region_id=27 zenith-us-stage-sk-4 console_region_id=27 [storage:children] diff --git a/.circleci/config.yml b/.circleci/config.yml index 9d26d5d558..f05e64072a 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -405,7 +405,7 @@ jobs: - run: name: Build coverage report command: | - COMMIT_URL=https://github.com/zenithdb/zenith/commit/$CIRCLE_SHA1 + COMMIT_URL=https://github.com/neondatabase/neon/commit/$CIRCLE_SHA1 scripts/coverage \ --dir=/tmp/zenith/coverage report \ @@ -416,8 +416,8 @@ jobs: name: Upload coverage report command: | LOCAL_REPO=$CIRCLE_PROJECT_USERNAME/$CIRCLE_PROJECT_REPONAME - REPORT_URL=https://zenithdb.github.io/zenith-coverage-data/$CIRCLE_SHA1 - COMMIT_URL=https://github.com/zenithdb/zenith/commit/$CIRCLE_SHA1 + REPORT_URL=https://neondatabase.github.io/zenith-coverage-data/$CIRCLE_SHA1 + COMMIT_URL=https://github.com/neondatabase/neon/commit/$CIRCLE_SHA1 scripts/git-upload \ --repo=https://$VIP_VAP_ACCESS_TOKEN@github.com/zenithdb/zenith-coverage-data.git \ @@ -593,7 +593,7 @@ jobs: name: Setup helm v3 command: | curl -s https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash - helm repo add zenithdb https://zenithdb.github.io/helm-charts + helm repo add zenithdb https://neondatabase.github.io/helm-charts - run: name: Re-deploy proxy command: | @@ -643,7 +643,7 @@ jobs: name: Setup helm v3 command: | curl -s https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash - helm repo add zenithdb https://zenithdb.github.io/helm-charts + helm repo add zenithdb https://neondatabase.github.io/helm-charts - run: name: Re-deploy proxy command: | From 58d5136a615f2c42e26ad78c16eb5fff965335df Mon Sep 17 00:00:00 2001 From: Daniil Date: Wed, 13 Apr 2022 17:16:25 +0300 Subject: [PATCH 082/296] compute_tools: check writability handler (#941) --- Cargo.lock | 1 + compute_tools/Cargo.toml | 1 + compute_tools/src/bin/zenith_ctl.rs | 2 ++ compute_tools/src/checker.rs | 46 +++++++++++++++++++++++++++++ compute_tools/src/http_api.rs | 13 ++++++-- compute_tools/src/lib.rs | 1 + 6 files changed, 62 insertions(+), 2 deletions(-) create mode 100644 compute_tools/src/checker.rs diff --git a/Cargo.lock b/Cargo.lock index 7df1c4ab7a..0584b9d6d2 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -346,6 +346,7 @@ dependencies = [ "serde_json", "tar", "tokio", + "tokio-postgres 0.7.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", "workspace_hack", ] diff --git a/compute_tools/Cargo.toml b/compute_tools/Cargo.toml index 56047093f1..fc52ce4e83 100644 --- a/compute_tools/Cargo.toml +++ b/compute_tools/Cargo.toml @@ -17,4 +17,5 @@ serde = { version = "1.0", features = ["derive"] } serde_json = "1" tar = "0.4" tokio = { version = "1.17", features = ["macros", "rt", "rt-multi-thread"] } +tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } workspace_hack = { version = "0.1", path = "../workspace_hack" } diff --git a/compute_tools/src/bin/zenith_ctl.rs b/compute_tools/src/bin/zenith_ctl.rs index 49ba653fa1..372afbc633 100644 --- a/compute_tools/src/bin/zenith_ctl.rs +++ b/compute_tools/src/bin/zenith_ctl.rs @@ -38,6 +38,7 @@ use clap::Arg; use log::info; use postgres::{Client, NoTls}; +use compute_tools::checker::create_writablity_check_data; use compute_tools::config; use compute_tools::http_api::launch_http_server; use compute_tools::logger::*; @@ -128,6 +129,7 @@ fn run_compute(state: &Arc>) -> Result { handle_roles(&read_state.spec, &mut client)?; handle_databases(&read_state.spec, &mut client)?; + create_writablity_check_data(&mut client)?; // 'Close' connection drop(client); diff --git a/compute_tools/src/checker.rs b/compute_tools/src/checker.rs new file mode 100644 index 0000000000..63da6ea23e --- /dev/null +++ b/compute_tools/src/checker.rs @@ -0,0 +1,46 @@ +use std::sync::{Arc, RwLock}; + +use anyhow::{anyhow, Result}; +use log::error; +use postgres::Client; +use tokio_postgres::NoTls; + +use crate::zenith::ComputeState; + +pub fn create_writablity_check_data(client: &mut Client) -> Result<()> { + let query = " + CREATE TABLE IF NOT EXISTS health_check ( + id serial primary key, + updated_at timestamptz default now() + ); + INSERT INTO health_check VALUES (1, now()) + ON CONFLICT (id) DO UPDATE + SET updated_at = now();"; + let result = client.simple_query(query)?; + if result.len() < 2 { + return Err(anyhow::format_err!("executed {} queries", result.len())); + } + Ok(()) +} + +pub async fn check_writability(state: &Arc>) -> Result<()> { + let connstr = state.read().unwrap().connstr.clone(); + let (client, connection) = tokio_postgres::connect(&connstr, NoTls).await?; + if client.is_closed() { + return Err(anyhow!("connection to postgres closed")); + } + tokio::spawn(async move { + if let Err(e) = connection.await { + error!("connection error: {}", e); + } + }); + + let result = client + .simple_query("UPDATE health_check SET updated_at = now() WHERE id = 1;") + .await?; + + if result.len() != 1 { + return Err(anyhow!("statement can't be executed")); + } + Ok(()) +} diff --git a/compute_tools/src/http_api.rs b/compute_tools/src/http_api.rs index 02fab08a6e..7e1a876044 100644 --- a/compute_tools/src/http_api.rs +++ b/compute_tools/src/http_api.rs @@ -11,7 +11,7 @@ use log::{error, info}; use crate::zenith::*; // Service function to handle all available routes. -fn routes(req: Request, state: Arc>) -> Response { +async fn routes(req: Request, state: Arc>) -> Response { match (req.method(), req.uri().path()) { // Timestamp of the last Postgres activity in the plain text. (&Method::GET, "/last_activity") => { @@ -29,6 +29,15 @@ fn routes(req: Request, state: Arc>) -> Response { + info!("serving /check_writability GET request"); + let res = crate::checker::check_writability(&state).await; + match res { + Ok(_) => Response::new(Body::from("true")), + Err(e) => Response::new(Body::from(e.to_string())), + } + } + // Return the `404 Not Found` for any other routes. _ => { let mut not_found = Response::new(Body::from("404 Not Found")); @@ -48,7 +57,7 @@ async fn serve(state: Arc>) { async move { Ok::<_, Infallible>(service_fn(move |req: Request| { let state = state.clone(); - async move { Ok::<_, Infallible>(routes(req, state)) } + async move { Ok::<_, Infallible>(routes(req, state).await) } })) } }); diff --git a/compute_tools/src/lib.rs b/compute_tools/src/lib.rs index 592011d95e..ffb9700a49 100644 --- a/compute_tools/src/lib.rs +++ b/compute_tools/src/lib.rs @@ -2,6 +2,7 @@ //! Various tools and helpers to handle cluster / compute node (Postgres) //! configuration. //! +pub mod checker; pub mod config; pub mod http_api; #[macro_use] From 1fd08107cab279c8fd0a0a042a5a04ec58a4fe0d Mon Sep 17 00:00:00 2001 From: Dhammika Pathirana Date: Mon, 11 Apr 2022 13:59:26 -0700 Subject: [PATCH 083/296] Add ps compaction_threshold config Signed-off-by: Dhammika Pathirana Add ps compaction_threadhold knob for (#707) (#1484) --- pageserver/src/config.rs | 22 +++++++++++++++++++++- pageserver/src/layered_repository.rs | 8 +++----- 2 files changed, 24 insertions(+), 6 deletions(-) diff --git a/pageserver/src/config.rs b/pageserver/src/config.rs index 0d5cac8b4f..067073cd9b 100644 --- a/pageserver/src/config.rs +++ b/pageserver/src/config.rs @@ -36,8 +36,8 @@ pub mod defaults { // Target file size, when creating image and delta layers. // This parameter determines L1 layer file size. pub const DEFAULT_COMPACTION_TARGET_SIZE: u64 = 128 * 1024 * 1024; - pub const DEFAULT_COMPACTION_PERIOD: &str = "1 s"; + pub const DEFAULT_COMPACTION_THRESHOLD: usize = 10; pub const DEFAULT_GC_HORIZON: u64 = 64 * 1024 * 1024; pub const DEFAULT_GC_PERIOD: &str = "100 s"; @@ -65,6 +65,7 @@ pub mod defaults { #checkpoint_distance = {DEFAULT_CHECKPOINT_DISTANCE} # in bytes #compaction_target_size = {DEFAULT_COMPACTION_TARGET_SIZE} # in bytes #compaction_period = '{DEFAULT_COMPACTION_PERIOD}' +#compaction_threshold = '{DEFAULT_COMPACTION_THRESHOLD}' #gc_period = '{DEFAULT_GC_PERIOD}' #gc_horizon = {DEFAULT_GC_HORIZON} @@ -107,6 +108,9 @@ pub struct PageServerConf { // How often to check if there's compaction work to be done. pub compaction_period: Duration, + // Level0 delta layer threshold for compaction. + pub compaction_threshold: usize, + pub gc_horizon: u64, pub gc_period: Duration, @@ -162,6 +166,7 @@ struct PageServerConfigBuilder { compaction_target_size: BuilderValue, compaction_period: BuilderValue, + compaction_threshold: BuilderValue, gc_horizon: BuilderValue, gc_period: BuilderValue, @@ -198,6 +203,7 @@ impl Default for PageServerConfigBuilder { compaction_target_size: Set(DEFAULT_COMPACTION_TARGET_SIZE), compaction_period: Set(humantime::parse_duration(DEFAULT_COMPACTION_PERIOD) .expect("cannot parse default compaction period")), + compaction_threshold: Set(DEFAULT_COMPACTION_THRESHOLD), gc_horizon: Set(DEFAULT_GC_HORIZON), gc_period: Set(humantime::parse_duration(DEFAULT_GC_PERIOD) .expect("cannot parse default gc period")), @@ -241,6 +247,10 @@ impl PageServerConfigBuilder { self.compaction_period = BuilderValue::Set(compaction_period) } + pub fn compaction_threshold(&mut self, compaction_threshold: usize) { + self.compaction_threshold = BuilderValue::Set(compaction_threshold) + } + pub fn gc_horizon(&mut self, gc_horizon: u64) { self.gc_horizon = BuilderValue::Set(gc_horizon) } @@ -313,6 +323,9 @@ impl PageServerConfigBuilder { compaction_period: self .compaction_period .ok_or(anyhow::anyhow!("missing compaction_period"))?, + compaction_threshold: self + .compaction_threshold + .ok_or(anyhow::anyhow!("missing compaction_threshold"))?, gc_horizon: self .gc_horizon .ok_or(anyhow::anyhow!("missing gc_horizon"))?, @@ -453,6 +466,9 @@ impl PageServerConf { builder.compaction_target_size(parse_toml_u64(key, item)?) } "compaction_period" => builder.compaction_period(parse_toml_duration(key, item)?), + "compaction_threshold" => { + builder.compaction_threshold(parse_toml_u64(key, item)? as usize) + } "gc_horizon" => builder.gc_horizon(parse_toml_u64(key, item)?), "gc_period" => builder.gc_period(parse_toml_duration(key, item)?), "wait_lsn_timeout" => builder.wait_lsn_timeout(parse_toml_duration(key, item)?), @@ -590,6 +606,7 @@ impl PageServerConf { checkpoint_distance: defaults::DEFAULT_CHECKPOINT_DISTANCE, compaction_target_size: 4 * 1024 * 1024, compaction_period: Duration::from_secs(10), + compaction_threshold: defaults::DEFAULT_COMPACTION_THRESHOLD, gc_horizon: defaults::DEFAULT_GC_HORIZON, gc_period: Duration::from_secs(10), wait_lsn_timeout: Duration::from_secs(60), @@ -662,6 +679,7 @@ checkpoint_distance = 111 # in bytes compaction_target_size = 111 # in bytes compaction_period = '111 s' +compaction_threshold = 2 gc_period = '222 s' gc_horizon = 222 @@ -700,6 +718,7 @@ id = 10 checkpoint_distance: defaults::DEFAULT_CHECKPOINT_DISTANCE, compaction_target_size: defaults::DEFAULT_COMPACTION_TARGET_SIZE, compaction_period: humantime::parse_duration(defaults::DEFAULT_COMPACTION_PERIOD)?, + compaction_threshold: defaults::DEFAULT_COMPACTION_THRESHOLD, gc_horizon: defaults::DEFAULT_GC_HORIZON, gc_period: humantime::parse_duration(defaults::DEFAULT_GC_PERIOD)?, wait_lsn_timeout: humantime::parse_duration(defaults::DEFAULT_WAIT_LSN_TIMEOUT)?, @@ -745,6 +764,7 @@ id = 10 checkpoint_distance: 111, compaction_target_size: 111, compaction_period: Duration::from_secs(111), + compaction_threshold: 2, gc_horizon: 222, gc_period: Duration::from_secs(222), wait_lsn_timeout: Duration::from_secs(111), diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 5e93e3389b..e178ba5222 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -1680,13 +1680,11 @@ impl LayeredTimeline { fn compact_level0(&self, target_file_size: u64) -> Result<()> { let layers = self.layers.lock().unwrap(); - // We compact or "shuffle" the level-0 delta layers when 10 have - // accumulated. - static COMPACT_THRESHOLD: usize = 10; - let level0_deltas = layers.get_level0_deltas()?; - if level0_deltas.len() < COMPACT_THRESHOLD { + // We compact or "shuffle" the level-0 delta layers when they've + // accumulated over the compaction threshold. + if level0_deltas.len() < self.conf.compaction_threshold { return Ok(()); } drop(layers); From 49da76237bd073f3f5857d6476e7a2827115cadb Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Wed, 13 Apr 2022 18:56:27 +0300 Subject: [PATCH 084/296] remove noisy debug log message --- pageserver/src/layered_repository/block_io.rs | 1 - 1 file changed, 1 deletion(-) diff --git a/pageserver/src/layered_repository/block_io.rs b/pageserver/src/layered_repository/block_io.rs index 2eba0aa403..d027b2f0e7 100644 --- a/pageserver/src/layered_repository/block_io.rs +++ b/pageserver/src/layered_repository/block_io.rs @@ -198,7 +198,6 @@ impl BlockWriter for BlockBuf { assert!(buf.len() == PAGE_SZ); let blknum = self.blocks.len(); self.blocks.push(buf); - tracing::info!("buffered block {}", blknum); Ok(blknum as u32) } } From 1d36c5a39e97006daa63b3cb2af0dee3cf1ee3e4 Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Wed, 13 Apr 2022 19:19:44 +0300 Subject: [PATCH 085/296] reenable s3 on staging pagservers by default After deadlockk fix in https://github.com/neondatabase/neon/pull/1496 s3 seems to work normally. There is one more discovered issue but it is not a blocker so can be fixed separately. --- .circleci/ansible/deploy.yaml | 27 ++++++++++++--------------- 1 file changed, 12 insertions(+), 15 deletions(-) diff --git a/.circleci/ansible/deploy.yaml b/.circleci/ansible/deploy.yaml index 2112102aa7..508843812a 100644 --- a/.circleci/ansible/deploy.yaml +++ b/.circleci/ansible/deploy.yaml @@ -63,21 +63,18 @@ tags: - pageserver - # It seems that currently S3 integration does not play well - # even with fresh pageserver without a burden of old data. - # TODO: turn this back on once the issue is solved. - # - name: update remote storage (s3) config - # lineinfile: - # path: /storage/pageserver/data/pageserver.toml - # line: "{{ item }}" - # loop: - # - "[remote_storage]" - # - "bucket_name = '{{ bucket_name }}'" - # - "bucket_region = '{{ bucket_region }}'" - # - "prefix_in_bucket = '{{ inventory_hostname }}'" - # become: true - # tags: - # - pageserver + - name: update remote storage (s3) config + lineinfile: + path: /storage/pageserver/data/pageserver.toml + line: "{{ item }}" + loop: + - "[remote_storage]" + - "bucket_name = '{{ bucket_name }}'" + - "bucket_region = '{{ bucket_region }}'" + - "prefix_in_bucket = '{{ inventory_hostname }}'" + become: true + tags: + - pageserver - name: upload systemd service definition ansible.builtin.template: From a0781f229c5574ab4fdae6b63175b7da8846921d Mon Sep 17 00:00:00 2001 From: Dhammika Pathirana Date: Wed, 13 Apr 2022 14:08:42 -0700 Subject: [PATCH 086/296] Add ps compact command Signed-off-by: Dhammika Pathirana Add ps compact command to api (#707) (#1484) --- pageserver/src/page_service.rs | 20 ++++++++++++++++++++ pageserver/src/repository.rs | 6 ++++-- test_runner/fixtures/compare_fixtures.py | 3 +++ 3 files changed, 27 insertions(+), 2 deletions(-) diff --git a/pageserver/src/page_service.rs b/pageserver/src/page_service.rs index e7a4117b3e..c09b032e48 100644 --- a/pageserver/src/page_service.rs +++ b/pageserver/src/page_service.rs @@ -713,6 +713,26 @@ impl postgres_backend::Handler for PageServerHandler { Some(result.elapsed.as_millis().to_string().as_bytes()), ]))? .write_message(&BeMessage::CommandComplete(b"SELECT 1"))?; + } else if query_string.starts_with("compact ") { + // Run compaction immediately on given timeline. + // FIXME This is just for tests. Don't expect this to be exposed to + // the users or the api. + + // compact + let re = Regex::new(r"^compact ([[:xdigit:]]+)\s([[:xdigit:]]+)($|\s)?").unwrap(); + + let caps = re + .captures(query_string) + .with_context(|| format!("Invalid compact: '{}'", query_string))?; + + let tenantid = ZTenantId::from_str(caps.get(1).unwrap().as_str())?; + let timelineid = ZTimelineId::from_str(caps.get(2).unwrap().as_str())?; + let timeline = tenant_mgr::get_timeline_for_tenant_load(tenantid, timelineid) + .context("Couldn't load timeline")?; + timeline.tline.compact()?; + + pgb.write_message_noflush(&SINGLE_COL_ROWDESC)? + .write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?; } else if query_string.starts_with("checkpoint ") { // Run checkpoint immediately on given timeline. diff --git a/pageserver/src/repository.rs b/pageserver/src/repository.rs index 02334d3229..eda9a3168d 100644 --- a/pageserver/src/repository.rs +++ b/pageserver/src/repository.rs @@ -252,8 +252,10 @@ pub trait Repository: Send + Sync { checkpoint_before_gc: bool, ) -> Result; - /// perform one compaction iteration. - /// this function is periodically called by compactor thread. + /// Perform one compaction iteration. + /// This function is periodically called by compactor thread. + /// Also it can be explicitly requested per timeline through page server + /// api's 'compact' command. fn compaction_iteration(&self) -> Result<()>; /// detaches locally available timeline by stopping all threads and removing all the data. diff --git a/test_runner/fixtures/compare_fixtures.py b/test_runner/fixtures/compare_fixtures.py index 750b02c894..598ee10f8e 100644 --- a/test_runner/fixtures/compare_fixtures.py +++ b/test_runner/fixtures/compare_fixtures.py @@ -87,6 +87,9 @@ class ZenithCompare(PgCompare): def flush(self): self.pscur.execute(f"do_gc {self.env.initial_tenant.hex} {self.timeline} 0") + def compact(self): + self.pscur.execute(f"compact {self.env.initial_tenant.hex} {self.timeline}") + def report_peak_memory_use(self) -> None: self.zenbenchmark.record("peak_mem", self.zenbenchmark.get_peak_mem(self.env.pageserver) / 1024, From cdf04b6a9fb2d5d225d12a2a74fae6c6eec26da6 Mon Sep 17 00:00:00 2001 From: Arthur Petukhovsky Date: Thu, 14 Apr 2022 09:31:35 +0300 Subject: [PATCH 087/296] Fix control file updates in safekeeper (#1452) Now control_file::Storage implements Deref for read-only access to the state. All updates should clone the state before modifying and persisting. --- walkeeper/src/control_file.rs | 57 ++++++++++++--- walkeeper/src/safekeeper.rs | 126 ++++++++++++++++++++-------------- walkeeper/src/timeline.rs | 16 ++--- 3 files changed, 127 insertions(+), 72 deletions(-) diff --git a/walkeeper/src/control_file.rs b/walkeeper/src/control_file.rs index 8b4e618661..7cc53edeb0 100644 --- a/walkeeper/src/control_file.rs +++ b/walkeeper/src/control_file.rs @@ -6,6 +6,7 @@ use lazy_static::lazy_static; use std::fs::{self, File, OpenOptions}; use std::io::{Read, Write}; +use std::ops::Deref; use std::path::{Path, PathBuf}; use tracing::*; @@ -37,8 +38,10 @@ lazy_static! { .expect("Failed to register safekeeper_persist_control_file_seconds histogram vec"); } -pub trait Storage { - /// Persist safekeeper state on disk. +/// Storage should keep actual state inside of it. It should implement Deref +/// trait to access state fields and have persist method for updating that state. +pub trait Storage: Deref { + /// Persist safekeeper state on disk and update internal state. fn persist(&mut self, s: &SafeKeeperState) -> Result<()>; } @@ -48,19 +51,47 @@ pub struct FileStorage { timeline_dir: PathBuf, conf: SafeKeeperConf, persist_control_file_seconds: Histogram, + + /// Last state persisted to disk. + state: SafeKeeperState, } impl FileStorage { - pub fn new(zttid: &ZTenantTimelineId, conf: &SafeKeeperConf) -> FileStorage { + pub fn restore_new(zttid: &ZTenantTimelineId, conf: &SafeKeeperConf) -> Result { let timeline_dir = conf.timeline_dir(zttid); let tenant_id = zttid.tenant_id.to_string(); let timeline_id = zttid.timeline_id.to_string(); - FileStorage { + + let state = Self::load_control_file_conf(conf, zttid)?; + + Ok(FileStorage { timeline_dir, conf: conf.clone(), persist_control_file_seconds: PERSIST_CONTROL_FILE_SECONDS .with_label_values(&[&tenant_id, &timeline_id]), - } + state, + }) + } + + pub fn create_new( + zttid: &ZTenantTimelineId, + conf: &SafeKeeperConf, + state: SafeKeeperState, + ) -> Result { + let timeline_dir = conf.timeline_dir(zttid); + let tenant_id = zttid.tenant_id.to_string(); + let timeline_id = zttid.timeline_id.to_string(); + + let mut store = FileStorage { + timeline_dir, + conf: conf.clone(), + persist_control_file_seconds: PERSIST_CONTROL_FILE_SECONDS + .with_label_values(&[&tenant_id, &timeline_id]), + state: state.clone(), + }; + + store.persist(&state)?; + Ok(store) } // Check the magic/version in the on-disk data and deserialize it, if possible. @@ -141,6 +172,14 @@ impl FileStorage { } } +impl Deref for FileStorage { + type Target = SafeKeeperState; + + fn deref(&self) -> &Self::Target { + &self.state + } +} + impl Storage for FileStorage { // persists state durably to underlying storage // for description see https://lwn.net/Articles/457667/ @@ -201,6 +240,9 @@ impl Storage for FileStorage { .and_then(|f| f.sync_all()) .context("failed to sync control file directory")?; } + + // update internal state + self.state = s.clone(); Ok(()) } } @@ -228,7 +270,7 @@ mod test { ) -> Result<(FileStorage, SafeKeeperState)> { fs::create_dir_all(&conf.timeline_dir(zttid)).expect("failed to create timeline dir"); Ok(( - FileStorage::new(zttid, conf), + FileStorage::restore_new(zttid, conf)?, FileStorage::load_control_file_conf(conf, zttid)?, )) } @@ -239,8 +281,7 @@ mod test { ) -> Result<(FileStorage, SafeKeeperState)> { fs::create_dir_all(&conf.timeline_dir(zttid)).expect("failed to create timeline dir"); let state = SafeKeeperState::empty(); - let mut storage = FileStorage::new(zttid, conf); - storage.persist(&state)?; + let storage = FileStorage::create_new(zttid, conf, state.clone())?; Ok((storage, state)) } diff --git a/walkeeper/src/safekeeper.rs b/walkeeper/src/safekeeper.rs index 1e23d87b34..22a8481e45 100644 --- a/walkeeper/src/safekeeper.rs +++ b/walkeeper/src/safekeeper.rs @@ -210,6 +210,7 @@ pub struct SafekeeperMemState { pub s3_wal_lsn: Lsn, // TODO: keep only persistent version pub peer_horizon_lsn: Lsn, pub remote_consistent_lsn: Lsn, + pub proposer_uuid: PgUuid, } impl SafeKeeperState { @@ -502,9 +503,8 @@ pub struct SafeKeeper { epoch_start_lsn: Lsn, pub inmem: SafekeeperMemState, // in memory part - pub s: SafeKeeperState, // persistent part + pub state: CTRL, // persistent state storage - pub control_store: CTRL, pub wal_store: WAL, } @@ -516,14 +516,14 @@ where // constructor pub fn new( ztli: ZTimelineId, - control_store: CTRL, + state: CTRL, mut wal_store: WAL, - state: SafeKeeperState, ) -> Result> { if state.timeline_id != ZTimelineId::from([0u8; 16]) && ztli != state.timeline_id { bail!("Calling SafeKeeper::new with inconsistent ztli ({}) and SafeKeeperState.server.timeline_id ({})", ztli, state.timeline_id); } + // initialize wal_store, if state is already initialized wal_store.init_storage(&state)?; Ok(SafeKeeper { @@ -535,23 +535,25 @@ where s3_wal_lsn: state.s3_wal_lsn, peer_horizon_lsn: state.peer_horizon_lsn, remote_consistent_lsn: state.remote_consistent_lsn, + proposer_uuid: state.proposer_uuid, }, - s: state, - control_store, + state, wal_store, }) } /// Get history of term switches for the available WAL fn get_term_history(&self) -> TermHistory { - self.s + self.state .acceptor_state .term_history .up_to(self.wal_store.flush_lsn()) } pub fn get_epoch(&self) -> Term { - self.s.acceptor_state.get_epoch(self.wal_store.flush_lsn()) + self.state + .acceptor_state + .get_epoch(self.wal_store.flush_lsn()) } /// Process message from proposer and possibly form reply. Concurrent @@ -587,46 +589,47 @@ where ); } /* Postgres upgrade is not treated as fatal error */ - if msg.pg_version != self.s.server.pg_version - && self.s.server.pg_version != UNKNOWN_SERVER_VERSION + if msg.pg_version != self.state.server.pg_version + && self.state.server.pg_version != UNKNOWN_SERVER_VERSION { info!( "incompatible server version {}, expected {}", - msg.pg_version, self.s.server.pg_version + msg.pg_version, self.state.server.pg_version ); } - if msg.tenant_id != self.s.tenant_id { + if msg.tenant_id != self.state.tenant_id { bail!( "invalid tenant ID, got {}, expected {}", msg.tenant_id, - self.s.tenant_id + self.state.tenant_id ); } - if msg.ztli != self.s.timeline_id { + if msg.ztli != self.state.timeline_id { bail!( "invalid timeline ID, got {}, expected {}", msg.ztli, - self.s.timeline_id + self.state.timeline_id ); } // set basic info about server, if not yet // TODO: verify that is doesn't change after - self.s.server.system_id = msg.system_id; - self.s.server.wal_seg_size = msg.wal_seg_size; - self.control_store - .persist(&self.s) - .context("failed to persist shared state")?; + { + let mut state = self.state.clone(); + state.server.system_id = msg.system_id; + state.server.wal_seg_size = msg.wal_seg_size; + self.state.persist(&state)?; + } // pass wal_seg_size to read WAL and find flush_lsn - self.wal_store.init_storage(&self.s)?; + self.wal_store.init_storage(&self.state)?; info!( "processed greeting from proposer {:?}, sending term {:?}", - msg.proposer_id, self.s.acceptor_state.term + msg.proposer_id, self.state.acceptor_state.term ); Ok(Some(AcceptorProposerMessage::Greeting(AcceptorGreeting { - term: self.s.acceptor_state.term, + term: self.state.acceptor_state.term, }))) } @@ -637,17 +640,19 @@ where ) -> Result> { // initialize with refusal let mut resp = VoteResponse { - term: self.s.acceptor_state.term, + term: self.state.acceptor_state.term, vote_given: false as u64, flush_lsn: self.wal_store.flush_lsn(), - truncate_lsn: self.s.peer_horizon_lsn, + truncate_lsn: self.state.peer_horizon_lsn, term_history: self.get_term_history(), }; - if self.s.acceptor_state.term < msg.term { - self.s.acceptor_state.term = msg.term; + if self.state.acceptor_state.term < msg.term { + let mut state = self.state.clone(); + state.acceptor_state.term = msg.term; // persist vote before sending it out - self.control_store.persist(&self.s)?; - resp.term = self.s.acceptor_state.term; + self.state.persist(&state)?; + + resp.term = self.state.acceptor_state.term; resp.vote_given = true as u64; } info!("processed VoteRequest for term {}: {:?}", msg.term, &resp); @@ -656,9 +661,10 @@ where /// Bump our term if received a note from elected proposer with higher one fn bump_if_higher(&mut self, term: Term) -> Result<()> { - if self.s.acceptor_state.term < term { - self.s.acceptor_state.term = term; - self.control_store.persist(&self.s)?; + if self.state.acceptor_state.term < term { + let mut state = self.state.clone(); + state.acceptor_state.term = term; + self.state.persist(&state)?; } Ok(()) } @@ -666,9 +672,9 @@ where /// Form AppendResponse from current state. fn append_response(&self) -> AppendResponse { let ar = AppendResponse { - term: self.s.acceptor_state.term, + term: self.state.acceptor_state.term, flush_lsn: self.wal_store.flush_lsn(), - commit_lsn: self.s.commit_lsn, + commit_lsn: self.state.commit_lsn, // will be filled by the upper code to avoid bothering safekeeper hs_feedback: HotStandbyFeedback::empty(), zenith_feedback: ZenithFeedback::empty(), @@ -681,7 +687,7 @@ where info!("received ProposerElected {:?}", msg); self.bump_if_higher(msg.term)?; // If our term is higher, ignore the message (next feedback will inform the compute) - if self.s.acceptor_state.term > msg.term { + if self.state.acceptor_state.term > msg.term { return Ok(None); } @@ -692,8 +698,11 @@ where self.wal_store.truncate_wal(msg.start_streaming_at)?; // and now adopt term history from proposer - self.s.acceptor_state.term_history = msg.term_history.clone(); - self.control_store.persist(&self.s)?; + { + let mut state = self.state.clone(); + state.acceptor_state.term_history = msg.term_history.clone(); + self.state.persist(&state)?; + } info!("start receiving WAL since {:?}", msg.start_streaming_at); @@ -715,13 +724,13 @@ where // Also note that commit_lsn can reach epoch_start_lsn earlier // that we receive new epoch_start_lsn, and we still need to sync // control file in this case. - if commit_lsn == self.epoch_start_lsn && self.s.commit_lsn != commit_lsn { + if commit_lsn == self.epoch_start_lsn && self.state.commit_lsn != commit_lsn { self.persist_control_file()?; } // We got our first commit_lsn, which means we should sync // everything to disk, to initialize the state. - if self.s.commit_lsn == Lsn(0) && commit_lsn > Lsn(0) { + if self.state.commit_lsn == Lsn(0) && commit_lsn > Lsn(0) { self.wal_store.flush_wal()?; self.persist_control_file()?; } @@ -731,10 +740,12 @@ where /// Persist in-memory state to the disk. fn persist_control_file(&mut self) -> Result<()> { - self.s.commit_lsn = self.inmem.commit_lsn; - self.s.peer_horizon_lsn = self.inmem.peer_horizon_lsn; + let mut state = self.state.clone(); - self.control_store.persist(&self.s) + state.commit_lsn = self.inmem.commit_lsn; + state.peer_horizon_lsn = self.inmem.peer_horizon_lsn; + state.proposer_uuid = self.inmem.proposer_uuid; + self.state.persist(&state) } /// Handle request to append WAL. @@ -744,13 +755,13 @@ where msg: &AppendRequest, require_flush: bool, ) -> Result> { - if self.s.acceptor_state.term < msg.h.term { + if self.state.acceptor_state.term < msg.h.term { bail!("got AppendRequest before ProposerElected"); } // If our term is higher, immediately refuse the message. - if self.s.acceptor_state.term > msg.h.term { - let resp = AppendResponse::term_only(self.s.acceptor_state.term); + if self.state.acceptor_state.term > msg.h.term { + let resp = AppendResponse::term_only(self.state.acceptor_state.term); return Ok(Some(AcceptorProposerMessage::AppendResponse(resp))); } @@ -758,8 +769,7 @@ where // processing the message. self.epoch_start_lsn = msg.h.epoch_start_lsn; - // TODO: don't update state without persisting to disk - self.s.proposer_uuid = msg.h.proposer_uuid; + self.inmem.proposer_uuid = msg.h.proposer_uuid; // do the job if !msg.wal_data.is_empty() { @@ -790,7 +800,7 @@ where // Update truncate and commit LSN in control file. // To avoid negative impact on performance of extra fsync, do it only // when truncate_lsn delta exceeds WAL segment size. - if self.s.peer_horizon_lsn + (self.s.server.wal_seg_size as u64) + if self.state.peer_horizon_lsn + (self.state.server.wal_seg_size as u64) < self.inmem.peer_horizon_lsn { self.persist_control_file()?; @@ -829,6 +839,8 @@ where #[cfg(test)] mod tests { + use std::ops::Deref; + use super::*; use crate::wal_storage::Storage; @@ -844,6 +856,14 @@ mod tests { } } + impl Deref for InMemoryState { + type Target = SafeKeeperState; + + fn deref(&self) -> &Self::Target { + &self.persisted_state + } + } + struct DummyWalStore { lsn: Lsn, } @@ -879,7 +899,7 @@ mod tests { }; let wal_store = DummyWalStore { lsn: Lsn(0) }; let ztli = ZTimelineId::from([0u8; 16]); - let mut sk = SafeKeeper::new(ztli, storage, wal_store, SafeKeeperState::empty()).unwrap(); + let mut sk = SafeKeeper::new(ztli, storage, wal_store).unwrap(); // check voting for 1 is ok let vote_request = ProposerAcceptorMessage::VoteRequest(VoteRequest { term: 1 }); @@ -890,11 +910,11 @@ mod tests { } // reboot... - let state = sk.control_store.persisted_state.clone(); + let state = sk.state.persisted_state.clone(); let storage = InMemoryState { - persisted_state: state.clone(), + persisted_state: state, }; - sk = SafeKeeper::new(ztli, storage, sk.wal_store, state).unwrap(); + sk = SafeKeeper::new(ztli, storage, sk.wal_store).unwrap(); // and ensure voting second time for 1 is not ok vote_resp = sk.process_msg(&vote_request); @@ -911,7 +931,7 @@ mod tests { }; let wal_store = DummyWalStore { lsn: Lsn(0) }; let ztli = ZTimelineId::from([0u8; 16]); - let mut sk = SafeKeeper::new(ztli, storage, wal_store, SafeKeeperState::empty()).unwrap(); + let mut sk = SafeKeeper::new(ztli, storage, wal_store).unwrap(); let mut ar_hdr = AppendRequestHeader { term: 1, diff --git a/walkeeper/src/timeline.rs b/walkeeper/src/timeline.rs index a76ef77615..a2941a9a5c 100644 --- a/walkeeper/src/timeline.rs +++ b/walkeeper/src/timeline.rs @@ -21,7 +21,6 @@ use crate::broker::SafekeeperInfo; use crate::callmemaybe::{CallmeEvent, SubscriptionStateKey}; use crate::control_file; -use crate::control_file::Storage as cf_storage; use crate::safekeeper::{ AcceptorProposerMessage, ProposerAcceptorMessage, SafeKeeper, SafeKeeperState, SafekeeperMemState, @@ -98,10 +97,9 @@ impl SharedState { peer_ids: Vec, ) -> Result { let state = SafeKeeperState::new(zttid, peer_ids); - let control_store = control_file::FileStorage::new(zttid, conf); + let control_store = control_file::FileStorage::create_new(zttid, conf, state)?; let wal_store = wal_storage::PhysicalStorage::new(zttid, conf); - let mut sk = SafeKeeper::new(zttid.timeline_id, control_store, wal_store, state)?; - sk.control_store.persist(&sk.s)?; + let sk = SafeKeeper::new(zttid.timeline_id, control_store, wal_store)?; Ok(Self { notified_commit_lsn: Lsn(0), @@ -116,18 +114,14 @@ impl SharedState { /// Restore SharedState from control file. /// If file doesn't exist, bails out. fn restore(conf: &SafeKeeperConf, zttid: &ZTenantTimelineId) -> Result { - let state = control_file::FileStorage::load_control_file_conf(conf, zttid) - .context("failed to load from control file")?; - - let control_store = control_file::FileStorage::new(zttid, conf); - + let control_store = control_file::FileStorage::restore_new(zttid, conf)?; let wal_store = wal_storage::PhysicalStorage::new(zttid, conf); info!("timeline {} restored", zttid.timeline_id); Ok(Self { notified_commit_lsn: Lsn(0), - sk: SafeKeeper::new(zttid.timeline_id, control_store, wal_store, state)?, + sk: SafeKeeper::new(zttid.timeline_id, control_store, wal_store)?, replicas: Vec::new(), active: false, num_computes: 0, @@ -419,7 +413,7 @@ impl Timeline { pub fn get_state(&self) -> (SafekeeperMemState, SafeKeeperState) { let shared_state = self.mutex.lock().unwrap(); - (shared_state.sk.inmem.clone(), shared_state.sk.s.clone()) + (shared_state.sk.inmem.clone(), shared_state.sk.state.clone()) } /// Prepare public safekeeper info for reporting. From 570db6f1681b80e50dbc2d156d037b99ca742099 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Thu, 14 Apr 2022 11:28:38 +0300 Subject: [PATCH 088/296] Update README for Zenith -> Neon renaming. There's a lot of renaming left to do in the code and docs, but this is a start. Our binaries and many other things are still called "zenith", but I didn't change those in the README, because otherwise the examples won't work. I added a brief note at the top of the README to explain that we're in the process of renaming, until we've renamed everything. --- README.md | 29 ++++++++++++++++------------- 1 file changed, 16 insertions(+), 13 deletions(-) diff --git a/README.md b/README.md index c8acf526b9..f99785e683 100644 --- a/README.md +++ b/README.md @@ -1,19 +1,22 @@ -# Zenith +# Neon -Zenith is a serverless open source alternative to AWS Aurora Postgres. It separates storage and compute and substitutes PostgreSQL storage layer by redistributing data across a cluster of nodes. +Neon is a serverless open source alternative to AWS Aurora Postgres. It separates storage and compute and substitutes PostgreSQL storage layer by redistributing data across a cluster of nodes. + +The project used to be called "Zenith". Many of the commands and code comments +still refer to "zenith", but we are in the process of renaming things. ## Architecture overview -A Zenith installation consists of compute nodes and Zenith storage engine. +A Neon installation consists of compute nodes and Neon storage engine. -Compute nodes are stateless PostgreSQL nodes, backed by Zenith storage engine. +Compute nodes are stateless PostgreSQL nodes, backed by Neon storage engine. -Zenith storage engine consists of two major components: +Neon storage engine consists of two major components: - Pageserver. Scalable storage backend for compute nodes. - WAL service. The service that receives WAL from compute node and ensures that it is stored durably. Pageserver consists of: -- Repository - Zenith storage implementation. +- Repository - Neon storage implementation. - WAL receiver - service that receives WAL from WAL service and stores it in the repository. - Page service - service that communicates with compute nodes and responds with pages from the repository. - WAL redo - service that builds pages from base images and WAL records on Page service request. @@ -35,10 +38,10 @@ To run the `psql` client, install the `postgresql-client` package or modify `PAT To run the integration tests or Python scripts (not required to use the code), install Python (3.7 or higher), and install python3 packages using `./scripts/pysync` (requires poetry) in the project directory. -2. Build zenith and patched postgres +2. Build neon and patched postgres ```sh -git clone --recursive https://github.com/zenithdb/zenith.git -cd zenith +git clone --recursive https://github.com/neondatabase/neon.git +cd neon make -j5 ``` @@ -126,7 +129,7 @@ INSERT 0 1 ## Running tests ```sh -git clone --recursive https://github.com/zenithdb/zenith.git +git clone --recursive https://github.com/neondatabase/neon.git make # builds also postgres and installs it to ./tmp_install ./scripts/pytest ``` @@ -141,14 +144,14 @@ To view your `rustdoc` documentation in a browser, try running `cargo doc --no-d ### Postgres-specific terms -Due to Zenith's very close relation with PostgreSQL internals, there are numerous specific terms used. +Due to Neon's very close relation with PostgreSQL internals, there are numerous specific terms used. Same applies to certain spelling: i.e. we use MB to denote 1024 * 1024 bytes, while MiB would be technically more correct, it's inconsistent with what PostgreSQL code and its documentation use. To get more familiar with this aspect, refer to: -- [Zenith glossary](/docs/glossary.md) +- [Neon glossary](/docs/glossary.md) - [PostgreSQL glossary](https://www.postgresql.org/docs/13/glossary.html) -- Other PostgreSQL documentation and sources (Zenith fork sources can be found [here](https://github.com/zenithdb/postgres)) +- Other PostgreSQL documentation and sources (Neon fork sources can be found [here](https://github.com/neondatabase/postgres)) ## Join the development From 19954dfd8abe154b0db17d7eb45a04acec35cbaf Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Thu, 14 Apr 2022 13:31:37 +0300 Subject: [PATCH 089/296] Refactor proxy options test to not rely on the 'schema' argument. It was the only test that used the 'schema' argument to the connect() function. I'm about to refactor the option handling and will remove the special 'schema' argument altogether, so rewrite the test to not use it. --- test_runner/batch_others/test_proxy.py | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/test_runner/batch_others/test_proxy.py b/test_runner/batch_others/test_proxy.py index d2039f9758..a6f828f829 100644 --- a/test_runner/batch_others/test_proxy.py +++ b/test_runner/batch_others/test_proxy.py @@ -5,11 +5,14 @@ def test_proxy_select_1(static_proxy): static_proxy.safe_psql("select 1;") -@pytest.mark.xfail # Proxy eats the extra connection options +# Pass extra options to the server. +# +# Currently, proxy eats the extra connection options, so this fails. +# See https://github.com/neondatabase/neon/issues/1287 +@pytest.mark.xfail def test_proxy_options(static_proxy): - schema_name = "tmp_schema_1" - with static_proxy.connect(schema=schema_name) as conn: + with static_proxy.connect(options="-cproxytest.option=value") as conn: with conn.cursor() as cur: - cur.execute("SHOW search_path;") - search_path = cur.fetchall()[0][0] - assert schema_name == search_path + cur.execute("SHOW proxytest.option;") + value = cur.fetchall()[0][0] + assert value == 'value' From a009fe912a292c0df4479c98c4bb5d62c91e7b68 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Thu, 14 Apr 2022 13:31:40 +0300 Subject: [PATCH 090/296] Refactor connection option handling in python tests The PgProtocol.connect() function took extra options for username, database, etc. Remove those options, and have a generic way for each subclass of PgProtocol to provide some default options, with the capability override them in the connect() call. --- test_runner/batch_others/test_createuser.py | 2 +- .../batch_others/test_parallel_copy.py | 5 + test_runner/batch_others/test_wal_acceptor.py | 2 +- .../batch_pg_regress/test_isolation.py | 6 +- .../batch_pg_regress/test_pg_regress.py | 6 +- .../batch_pg_regress/test_zenith_regress.py | 6 +- test_runner/fixtures/zenith_fixtures.py | 128 ++++++++---------- 7 files changed, 69 insertions(+), 86 deletions(-) diff --git a/test_runner/batch_others/test_createuser.py b/test_runner/batch_others/test_createuser.py index efb2af3f07..f4bbbc8a7a 100644 --- a/test_runner/batch_others/test_createuser.py +++ b/test_runner/batch_others/test_createuser.py @@ -28,4 +28,4 @@ def test_createuser(zenith_simple_env: ZenithEnv): pg2 = env.postgres.create_start('test_createuser2') # Test that you can connect to new branch as a new user - assert pg2.safe_psql('select current_user', username='testuser') == [('testuser', )] + assert pg2.safe_psql('select current_user', user='testuser') == [('testuser', )] diff --git a/test_runner/batch_others/test_parallel_copy.py b/test_runner/batch_others/test_parallel_copy.py index 4b7cc58d42..a44acecf21 100644 --- a/test_runner/batch_others/test_parallel_copy.py +++ b/test_runner/batch_others/test_parallel_copy.py @@ -19,6 +19,11 @@ async def copy_test_data_to_table(pg: Postgres, worker_id: int, table_name: str) copy_input = repeat_bytes(buf.read(), 5000) pg_conn = await pg.connect_async() + + # PgProtocol.connect_async sets statement_timeout to 2 minutes. + # That's not enough for this test, on a slow system in debug mode. + await pg_conn.execute("SET statement_timeout='300s'") + await pg_conn.copy_to_table(table_name, source=copy_input) diff --git a/test_runner/batch_others/test_wal_acceptor.py b/test_runner/batch_others/test_wal_acceptor.py index 8f87ff041f..dffcd7cc61 100644 --- a/test_runner/batch_others/test_wal_acceptor.py +++ b/test_runner/batch_others/test_wal_acceptor.py @@ -379,7 +379,7 @@ class ProposerPostgres(PgProtocol): tenant_id: uuid.UUID, listen_addr: str, port: int): - super().__init__(host=listen_addr, port=port, username='zenith_admin') + super().__init__(host=listen_addr, port=port, user='zenith_admin', dbname='postgres') self.pgdata_dir: str = pgdata_dir self.pg_bin: PgBin = pg_bin diff --git a/test_runner/batch_pg_regress/test_isolation.py b/test_runner/batch_pg_regress/test_isolation.py index ddafc3815b..cde56d9b88 100644 --- a/test_runner/batch_pg_regress/test_isolation.py +++ b/test_runner/batch_pg_regress/test_isolation.py @@ -35,9 +35,9 @@ def test_isolation(zenith_simple_env: ZenithEnv, test_output_dir, pg_bin, capsys ] env_vars = { - 'PGPORT': str(pg.port), - 'PGUSER': pg.username, - 'PGHOST': pg.host, + 'PGPORT': str(pg.default_options['port']), + 'PGUSER': pg.default_options['user'], + 'PGHOST': pg.default_options['host'], } # Run the command. diff --git a/test_runner/batch_pg_regress/test_pg_regress.py b/test_runner/batch_pg_regress/test_pg_regress.py index 5199f65216..07d2574f4a 100644 --- a/test_runner/batch_pg_regress/test_pg_regress.py +++ b/test_runner/batch_pg_regress/test_pg_regress.py @@ -35,9 +35,9 @@ def test_pg_regress(zenith_simple_env: ZenithEnv, test_output_dir: str, pg_bin, ] env_vars = { - 'PGPORT': str(pg.port), - 'PGUSER': pg.username, - 'PGHOST': pg.host, + 'PGPORT': str(pg.default_options['port']), + 'PGUSER': pg.default_options['user'], + 'PGHOST': pg.default_options['host'], } # Run the command. diff --git a/test_runner/batch_pg_regress/test_zenith_regress.py b/test_runner/batch_pg_regress/test_zenith_regress.py index 31d5b07093..2b57137d16 100644 --- a/test_runner/batch_pg_regress/test_zenith_regress.py +++ b/test_runner/batch_pg_regress/test_zenith_regress.py @@ -40,9 +40,9 @@ def test_zenith_regress(zenith_simple_env: ZenithEnv, test_output_dir, pg_bin, c log.info(pg_regress_command) env_vars = { - 'PGPORT': str(pg.port), - 'PGUSER': pg.username, - 'PGHOST': pg.host, + 'PGPORT': str(pg.default_options['port']), + 'PGUSER': pg.default_options['user'], + 'PGHOST': pg.default_options['host'], } # Run the command. diff --git a/test_runner/fixtures/zenith_fixtures.py b/test_runner/fixtures/zenith_fixtures.py index a95809687a..41d1443880 100644 --- a/test_runner/fixtures/zenith_fixtures.py +++ b/test_runner/fixtures/zenith_fixtures.py @@ -27,6 +27,7 @@ from dataclasses import dataclass # Type-related stuff from psycopg2.extensions import connection as PgConnection +from psycopg2.extensions import make_dsn, parse_dsn from typing import Any, Callable, Dict, Iterable, Iterator, List, Optional, TypeVar, cast, Union, Tuple from typing_extensions import Literal @@ -238,98 +239,69 @@ def port_distributor(worker_base_port): class PgProtocol: """ Reusable connection logic """ - def __init__(self, - host: str, - port: int, - username: Optional[str] = None, - password: Optional[str] = None, - dbname: Optional[str] = None, - schema: Optional[str] = None): - self.host = host - self.port = port - self.username = username - self.password = password - self.dbname = dbname - self.schema = schema + def __init__(self, **kwargs): + self.default_options = kwargs - def connstr(self, - *, - dbname: Optional[str] = None, - schema: Optional[str] = None, - username: Optional[str] = None, - password: Optional[str] = None, - statement_timeout_ms: Optional[int] = None) -> str: + def connstr(self, **kwargs) -> str: """ Build a libpq connection string for the Postgres instance. """ + return str(make_dsn(**self.conn_options(**kwargs))) - username = username or self.username - password = password or self.password - dbname = dbname or self.dbname or "postgres" - schema = schema or self.schema - res = f'host={self.host} port={self.port} dbname={dbname}' + def conn_options(self, **kwargs): + conn_options = self.default_options.copy() + if 'dsn' in kwargs: + conn_options.update(parse_dsn(kwargs['dsn'])) + conn_options.update(kwargs) - if username: - res = f'{res} user={username}' - - if password: - res = f'{res} password={password}' - - if schema: - res = f"{res} options='-c search_path={schema}'" - - if statement_timeout_ms: - res = f"{res} options='-c statement_timeout={statement_timeout_ms}'" - - return res + # Individual statement timeout in seconds. 2 minutes should be + # enough for our tests, but if you need a longer, you can + # change it by calling "SET statement_timeout" after + # connecting. + if 'options' in conn_options: + conn_options['options'] = f"-cstatement_timeout=120s " + conn_options['options'] + else: + conn_options['options'] = "-cstatement_timeout=120s" + return conn_options # autocommit=True here by default because that's what we need most of the time - def connect( - self, - *, - autocommit=True, - dbname: Optional[str] = None, - schema: Optional[str] = None, - username: Optional[str] = None, - password: Optional[str] = None, - # individual statement timeout in seconds, 2 minutes should be enough for our tests - statement_timeout: Optional[int] = 120 - ) -> PgConnection: + def connect(self, autocommit=True, **kwargs) -> PgConnection: """ Connect to the node. Returns psycopg2's connection object. This method passes all extra params to connstr. """ + conn = psycopg2.connect(**self.conn_options(**kwargs)) - conn = psycopg2.connect( - self.connstr(dbname=dbname, - schema=schema, - username=username, - password=password, - statement_timeout_ms=statement_timeout * - 1000 if statement_timeout else None)) # WARNING: this setting affects *all* tests! conn.autocommit = autocommit return conn - async def connect_async(self, - *, - dbname: str = 'postgres', - username: Optional[str] = None, - password: Optional[str] = None) -> asyncpg.Connection: + async def connect_async(self, **kwargs) -> asyncpg.Connection: """ Connect to the node from async python. Returns asyncpg's connection object. """ - conn = await asyncpg.connect( - host=self.host, - port=self.port, - database=dbname, - user=username or self.username, - password=password, - ) - return conn + # asyncpg takes slightly different options than psycopg2. Try + # to convert the defaults from the psycopg2 format. + + # The psycopg2 option 'dbname' is called 'database' is asyncpg + conn_options = self.conn_options(**kwargs) + if 'dbname' in conn_options: + conn_options['database'] = conn_options.pop('dbname') + + # Convert options='-c=' to server_settings + if 'options' in conn_options: + options = conn_options.pop('options') + for match in re.finditer('-c(\w*)=(\w*)', options): + key = match.group(1) + val = match.group(2) + if 'server_options' in conn_options: + conn_options['server_settings'].update({key: val}) + else: + conn_options['server_settings'] = {key: val} + return await asyncpg.connect(**conn_options) def safe_psql(self, query: str, **kwargs: Any) -> List[Any]: """ @@ -1149,10 +1121,10 @@ class ZenithPageserver(PgProtocol): port: PageserverPort, remote_storage: Optional[RemoteStorage] = None, config_override: Optional[str] = None): - super().__init__(host='localhost', port=port.pg, username='zenith_admin') + super().__init__(host='localhost', port=port.pg, user='zenith_admin') self.env = env self.running = False - self.service_port = port # do not shadow PgProtocol.port which is just int + self.service_port = port self.remote_storage = remote_storage self.config_override = config_override @@ -1291,7 +1263,7 @@ def pg_bin(test_output_dir: str) -> PgBin: class VanillaPostgres(PgProtocol): def __init__(self, pgdatadir: str, pg_bin: PgBin, port: int): - super().__init__(host='localhost', port=port) + super().__init__(host='localhost', port=port, dbname='postgres') self.pgdatadir = pgdatadir self.pg_bin = pg_bin self.running = False @@ -1335,8 +1307,14 @@ def vanilla_pg(test_output_dir: str) -> Iterator[VanillaPostgres]: class ZenithProxy(PgProtocol): def __init__(self, port: int): - super().__init__(host="127.0.0.1", username="pytest", password="pytest", port=port) + super().__init__(host="127.0.0.1", + user="pytest", + password="pytest", + port=port, + dbname='postgres') self.http_port = 7001 + self.host = "127.0.0.1" + self.port = port self._popen: Optional[subprocess.Popen[bytes]] = None def start_static(self, addr="127.0.0.1:5432") -> None: @@ -1380,13 +1358,13 @@ def static_proxy(vanilla_pg) -> Iterator[ZenithProxy]: class Postgres(PgProtocol): """ An object representing a running postgres daemon. """ def __init__(self, env: ZenithEnv, tenant_id: uuid.UUID, port: int): - super().__init__(host='localhost', port=port, username='zenith_admin') - + super().__init__(host='localhost', port=port, user='zenith_admin', dbname='postgres') self.env = env self.running = False self.node_name: Optional[str] = None # dubious, see asserts below self.pgdata_dir: Optional[str] = None # Path to computenode PGDATA self.tenant_id = tenant_id + self.port = port # path to conf is /pgdatadirs/tenants///postgresql.conf def create( From 4a8c66345267bfb11882a10d0260e2aacec6d112 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Thu, 14 Apr 2022 13:31:42 +0300 Subject: [PATCH 091/296] Refactor pgbench tests. - Remove batch_others/test_pgbench.py. It was a quick check that pgbench works, without actually recording any performance numbers, but that doesn't seem very interesting anymore. Remove it to avoid confusing it with the actual pgbench benchmarks - Run pgbench with "-n" and "-S" options, for two different workloads: simple-updates, and SELECT-only. Previously, we would only run it with the "default" TPCB-like workload. That's more or less the same as the simple-update (-n) workload, but I think the simple-upload workload is more relevant for testing storage performance. The SELECT-only workload is a new thing to measure. - Merge test_perf_pgbench.py and test_perf_pgbench_remote.py. I added a new "remote" implementation of the PgCompare class, which allows running the same tests against an already-running Postgres instance. - Make the PgBenchRunResult.parse_from_output function more flexible. pgbench can print different lines depending on the command-line options, but the parsing function expected a particular set of lines. --- .github/workflows/benchmarking.yml | 13 +- test_runner/batch_others/test_pgbench.py | 14 -- test_runner/fixtures/benchmark_fixture.py | 145 ++++++++---------- test_runner/fixtures/compare_fixtures.py | 49 +++++- test_runner/fixtures/zenith_fixtures.py | 68 ++++++-- test_runner/performance/test_perf_pgbench.py | 116 ++++++++++++-- .../performance/test_perf_pgbench_remote.py | 124 --------------- 7 files changed, 279 insertions(+), 250 deletions(-) delete mode 100644 test_runner/batch_others/test_pgbench.py delete mode 100644 test_runner/performance/test_perf_pgbench_remote.py diff --git a/.github/workflows/benchmarking.yml b/.github/workflows/benchmarking.yml index 36df35297d..72041c9d02 100644 --- a/.github/workflows/benchmarking.yml +++ b/.github/workflows/benchmarking.yml @@ -26,7 +26,7 @@ jobs: runs-on: [self-hosted, zenith-benchmarker] env: - PG_BIN: "/usr/pgsql-13/bin" + POSTGRES_DISTRIB_DIR: "/usr/pgsql-13" steps: - name: Checkout zenith repo @@ -51,7 +51,7 @@ jobs: echo Poetry poetry --version echo Pgbench - $PG_BIN/pgbench --version + $POSTGRES_DISTRIB_DIR/bin/pgbench --version # FIXME cluster setup is skipped due to various changes in console API # for now pre created cluster is used. When API gain some stability @@ -66,7 +66,7 @@ jobs: echo "Starting cluster" # wake up the cluster - $PG_BIN/psql $BENCHMARK_CONNSTR -c "SELECT 1" + $POSTGRES_DISTRIB_DIR/bin/psql $BENCHMARK_CONNSTR -c "SELECT 1" - name: Run benchmark # pgbench is installed system wide from official repo @@ -83,8 +83,11 @@ jobs: # sudo yum install postgresql13-contrib # actual binaries are located in /usr/pgsql-13/bin/ env: - TEST_PG_BENCH_TRANSACTIONS_MATRIX: "5000,10000,20000" - TEST_PG_BENCH_SCALES_MATRIX: "10,15" + # The pgbench test runs two tests of given duration against each scale. + # So the total runtime with these parameters is 2 * 2 * 300 = 1200, or 20 minutes. + # Plus time needed to initialize the test databases. + TEST_PG_BENCH_DURATIONS_MATRIX: "300" + TEST_PG_BENCH_SCALES_MATRIX: "10,100" PLATFORM: "zenith-staging" BENCHMARK_CONNSTR: "${{ secrets.BENCHMARK_STAGING_CONNSTR }}" REMOTE_ENV: "1" # indicate to test harness that we do not have zenith binaries locally diff --git a/test_runner/batch_others/test_pgbench.py b/test_runner/batch_others/test_pgbench.py deleted file mode 100644 index 09713023bc..0000000000 --- a/test_runner/batch_others/test_pgbench.py +++ /dev/null @@ -1,14 +0,0 @@ -from fixtures.zenith_fixtures import ZenithEnv -from fixtures.log_helper import log - - -def test_pgbench(zenith_simple_env: ZenithEnv, pg_bin): - env = zenith_simple_env - env.zenith_cli.create_branch("test_pgbench", "empty") - pg = env.postgres.create_start('test_pgbench') - log.info("postgres is running on 'test_pgbench' branch") - - connstr = pg.connstr() - - pg_bin.run_capture(['pgbench', '-i', connstr]) - pg_bin.run_capture(['pgbench'] + '-c 10 -T 5 -P 1 -M prepared'.split() + [connstr]) diff --git a/test_runner/fixtures/benchmark_fixture.py b/test_runner/fixtures/benchmark_fixture.py index 480eb3f891..a904233e98 100644 --- a/test_runner/fixtures/benchmark_fixture.py +++ b/test_runner/fixtures/benchmark_fixture.py @@ -17,7 +17,7 @@ import warnings from contextlib import contextmanager # Type-related stuff -from typing import Iterator +from typing import Iterator, Optional """ This file contains fixtures for micro-benchmarks. @@ -51,17 +51,12 @@ in the test initialization, or measure disk usage after the test query. @dataclasses.dataclass class PgBenchRunResult: - scale: int number_of_clients: int number_of_threads: int number_of_transactions_actually_processed: int latency_average: float - latency_stddev: float - tps_including_connection_time: float - tps_excluding_connection_time: float - init_duration: float - init_start_timestamp: int - init_end_timestamp: int + latency_stddev: Optional[float] + tps: float run_duration: float run_start_timestamp: int run_end_timestamp: int @@ -69,56 +64,67 @@ class PgBenchRunResult: # TODO progress @classmethod - def parse_from_output( + def parse_from_stdout( cls, - out: 'subprocess.CompletedProcess[str]', - init_duration: float, - init_start_timestamp: int, - init_end_timestamp: int, + stdout: str, run_duration: float, run_start_timestamp: int, run_end_timestamp: int, ): - stdout_lines = out.stdout.splitlines() + stdout_lines = stdout.splitlines() + + latency_stddev = None + # we know significant parts of these values from test input # but to be precise take them from output - # scaling factor: 5 - assert "scaling factor" in stdout_lines[1] - scale = int(stdout_lines[1].split()[-1]) - # number of clients: 1 - assert "number of clients" in stdout_lines[3] - number_of_clients = int(stdout_lines[3].split()[-1]) - # number of threads: 1 - assert "number of threads" in stdout_lines[4] - number_of_threads = int(stdout_lines[4].split()[-1]) - # number of transactions actually processed: 1000/1000 - assert "number of transactions actually processed" in stdout_lines[6] - number_of_transactions_actually_processed = int(stdout_lines[6].split("/")[1]) - # latency average = 19.894 ms - assert "latency average" in stdout_lines[7] - latency_average = stdout_lines[7].split()[-2] - # latency stddev = 3.387 ms - assert "latency stddev" in stdout_lines[8] - latency_stddev = stdout_lines[8].split()[-2] - # tps = 50.219689 (including connections establishing) - assert "(including connections establishing)" in stdout_lines[9] - tps_including_connection_time = stdout_lines[9].split()[2] - # tps = 50.264435 (excluding connections establishing) - assert "(excluding connections establishing)" in stdout_lines[10] - tps_excluding_connection_time = stdout_lines[10].split()[2] + for line in stdout.splitlines(): + # scaling factor: 5 + if line.startswith("scaling factor:"): + scale = int(line.split()[-1]) + # number of clients: 1 + if line.startswith("number of clients: "): + number_of_clients = int(line.split()[-1]) + # number of threads: 1 + if line.startswith("number of threads: "): + number_of_threads = int(line.split()[-1]) + # number of transactions actually processed: 1000/1000 + # OR + # number of transactions actually processed: 1000 + if line.startswith("number of transactions actually processed"): + if "/" in line: + number_of_transactions_actually_processed = int(line.split("/")[1]) + else: + number_of_transactions_actually_processed = int(line.split()[-1]) + # latency average = 19.894 ms + if line.startswith("latency average"): + latency_average = float(line.split()[-2]) + # latency stddev = 3.387 ms + # (only printed with some options) + if line.startswith("latency stddev"): + latency_stddev = float(line.split()[-2]) + + # Get the TPS without initial connection time. The format + # of the tps lines changed in pgbench v14, but we accept + # either format: + # + # pgbench v13 and below: + # tps = 50.219689 (including connections establishing) + # tps = 50.264435 (excluding connections establishing) + # + # pgbench v14: + # initial connection time = 3.858 ms + # tps = 309.281539 (without initial connection time) + if (line.startswith("tps = ") and ("(excluding connections establishing)" in line + or "(without initial connection time)")): + tps = float(line.split()[2]) return cls( - scale=scale, number_of_clients=number_of_clients, number_of_threads=number_of_threads, number_of_transactions_actually_processed=number_of_transactions_actually_processed, - latency_average=float(latency_average), - latency_stddev=float(latency_stddev), - tps_including_connection_time=float(tps_including_connection_time), - tps_excluding_connection_time=float(tps_excluding_connection_time), - init_duration=init_duration, - init_start_timestamp=init_start_timestamp, - init_end_timestamp=init_end_timestamp, + latency_average=latency_average, + latency_stddev=latency_stddev, + tps=tps, run_duration=run_duration, run_start_timestamp=run_start_timestamp, run_end_timestamp=run_end_timestamp, @@ -187,60 +193,41 @@ class ZenithBenchmarker: report=MetricReport.LOWER_IS_BETTER, ) - def record_pg_bench_result(self, pg_bench_result: PgBenchRunResult): - self.record("scale", pg_bench_result.scale, '', MetricReport.TEST_PARAM) - self.record("number_of_clients", + def record_pg_bench_result(self, prefix: str, pg_bench_result: PgBenchRunResult): + self.record(f"{prefix}.number_of_clients", pg_bench_result.number_of_clients, '', MetricReport.TEST_PARAM) - self.record("number_of_threads", + self.record(f"{prefix}.number_of_threads", pg_bench_result.number_of_threads, '', MetricReport.TEST_PARAM) self.record( - "number_of_transactions_actually_processed", + f"{prefix}.number_of_transactions_actually_processed", pg_bench_result.number_of_transactions_actually_processed, '', # thats because this is predefined by test matrix and doesnt change across runs report=MetricReport.TEST_PARAM, ) - self.record("latency_average", + self.record(f"{prefix}.latency_average", pg_bench_result.latency_average, unit="ms", report=MetricReport.LOWER_IS_BETTER) - self.record("latency_stddev", - pg_bench_result.latency_stddev, - unit="ms", - report=MetricReport.LOWER_IS_BETTER) - self.record("tps_including_connection_time", - pg_bench_result.tps_including_connection_time, - '', - report=MetricReport.HIGHER_IS_BETTER) - self.record("tps_excluding_connection_time", - pg_bench_result.tps_excluding_connection_time, - '', - report=MetricReport.HIGHER_IS_BETTER) - self.record("init_duration", - pg_bench_result.init_duration, - unit="s", - report=MetricReport.LOWER_IS_BETTER) - self.record("init_start_timestamp", - pg_bench_result.init_start_timestamp, - '', - MetricReport.TEST_PARAM) - self.record("init_end_timestamp", - pg_bench_result.init_end_timestamp, - '', - MetricReport.TEST_PARAM) - self.record("run_duration", + if pg_bench_result.latency_stddev is not None: + self.record(f"{prefix}.latency_stddev", + pg_bench_result.latency_stddev, + unit="ms", + report=MetricReport.LOWER_IS_BETTER) + self.record(f"{prefix}.tps", pg_bench_result.tps, '', report=MetricReport.HIGHER_IS_BETTER) + self.record(f"{prefix}.run_duration", pg_bench_result.run_duration, unit="s", report=MetricReport.LOWER_IS_BETTER) - self.record("run_start_timestamp", + self.record(f"{prefix}.run_start_timestamp", pg_bench_result.run_start_timestamp, '', MetricReport.TEST_PARAM) - self.record("run_end_timestamp", + self.record(f"{prefix}.run_end_timestamp", pg_bench_result.run_end_timestamp, '', MetricReport.TEST_PARAM) diff --git a/test_runner/fixtures/compare_fixtures.py b/test_runner/fixtures/compare_fixtures.py index 598ee10f8e..3c6a923587 100644 --- a/test_runner/fixtures/compare_fixtures.py +++ b/test_runner/fixtures/compare_fixtures.py @@ -2,7 +2,7 @@ import pytest from contextlib import contextmanager from abc import ABC, abstractmethod -from fixtures.zenith_fixtures import PgBin, PgProtocol, VanillaPostgres, ZenithEnv +from fixtures.zenith_fixtures import PgBin, PgProtocol, VanillaPostgres, RemotePostgres, ZenithEnv from fixtures.benchmark_fixture import MetricReport, ZenithBenchmarker # Type-related stuff @@ -162,6 +162,48 @@ class VanillaCompare(PgCompare): return self.zenbenchmark.record_duration(out_name) +class RemoteCompare(PgCompare): + """PgCompare interface for a remote postgres instance.""" + def __init__(self, zenbenchmark, remote_pg: RemotePostgres): + self._pg = remote_pg + self._zenbenchmark = zenbenchmark + + # Long-lived cursor, useful for flushing + self.conn = self.pg.connect() + self.cur = self.conn.cursor() + + @property + def pg(self): + return self._pg + + @property + def zenbenchmark(self): + return self._zenbenchmark + + @property + def pg_bin(self): + return self._pg.pg_bin + + def flush(self): + # TODO: flush the remote pageserver + pass + + def report_peak_memory_use(self) -> None: + # TODO: get memory usage from remote pageserver + pass + + def report_size(self) -> None: + # TODO: get storage size from remote pageserver + pass + + @contextmanager + def record_pageserver_writes(self, out_name): + yield # Do nothing + + def record_duration(self, out_name): + return self.zenbenchmark.record_duration(out_name) + + @pytest.fixture(scope='function') def zenith_compare(request, zenbenchmark, pg_bin, zenith_simple_env) -> ZenithCompare: branch_name = request.node.name @@ -173,6 +215,11 @@ def vanilla_compare(zenbenchmark, vanilla_pg) -> VanillaCompare: return VanillaCompare(zenbenchmark, vanilla_pg) +@pytest.fixture(scope='function') +def remote_compare(zenbenchmark, remote_pg) -> RemoteCompare: + return RemoteCompare(zenbenchmark, remote_pg) + + @pytest.fixture(params=["vanilla_compare", "zenith_compare"], ids=["vanilla", "zenith"]) def zenith_with_baseline(request) -> PgCompare: """Parameterized fixture that helps compare zenith against vanilla postgres. diff --git a/test_runner/fixtures/zenith_fixtures.py b/test_runner/fixtures/zenith_fixtures.py index 41d1443880..f8ee39a5a1 100644 --- a/test_runner/fixtures/zenith_fixtures.py +++ b/test_runner/fixtures/zenith_fixtures.py @@ -123,6 +123,22 @@ def pytest_configure(config): top_output_dir = os.path.join(base_dir, DEFAULT_OUTPUT_DIR) mkdir_if_needed(top_output_dir) + # Find the postgres installation. + global pg_distrib_dir + env_postgres_bin = os.environ.get('POSTGRES_DISTRIB_DIR') + if env_postgres_bin: + pg_distrib_dir = env_postgres_bin + else: + pg_distrib_dir = os.path.normpath(os.path.join(base_dir, DEFAULT_POSTGRES_DIR)) + log.info(f'pg_distrib_dir is {pg_distrib_dir}') + if os.getenv("REMOTE_ENV"): + # When testing against a remote server, we only need the client binary. + if not os.path.exists(os.path.join(pg_distrib_dir, 'bin/psql')): + raise Exception('psql not found at "{}"'.format(pg_distrib_dir)) + else: + if not os.path.exists(os.path.join(pg_distrib_dir, 'bin/postgres')): + raise Exception('postgres not found at "{}"'.format(pg_distrib_dir)) + if os.getenv("REMOTE_ENV"): # we are in remote env and do not have zenith binaries locally # this is the case for benchmarks run on self-hosted runner @@ -138,17 +154,6 @@ def pytest_configure(config): if not os.path.exists(os.path.join(zenith_binpath, 'pageserver')): raise Exception('zenith binaries not found at "{}"'.format(zenith_binpath)) - # Find the postgres installation. - global pg_distrib_dir - env_postgres_bin = os.environ.get('POSTGRES_DISTRIB_DIR') - if env_postgres_bin: - pg_distrib_dir = env_postgres_bin - else: - pg_distrib_dir = os.path.normpath(os.path.join(base_dir, DEFAULT_POSTGRES_DIR)) - log.info(f'pg_distrib_dir is {pg_distrib_dir}') - if not os.path.exists(os.path.join(pg_distrib_dir, 'bin/postgres')): - raise Exception('postgres not found at "{}"'.format(pg_distrib_dir)) - def zenfixture(func: Fn) -> Fn: """ @@ -1305,6 +1310,47 @@ def vanilla_pg(test_output_dir: str) -> Iterator[VanillaPostgres]: yield vanilla_pg +class RemotePostgres(PgProtocol): + def __init__(self, pg_bin: PgBin, remote_connstr: str): + super().__init__(**parse_dsn(remote_connstr)) + self.pg_bin = pg_bin + # The remote server is assumed to be running already + self.running = True + + def configure(self, options: List[str]): + raise Exception('cannot change configuration of remote Posgres instance') + + def start(self): + raise Exception('cannot start a remote Postgres instance') + + def stop(self): + raise Exception('cannot stop a remote Postgres instance') + + def get_subdir_size(self, subdir) -> int: + # TODO: Could use the server's Generic File Acccess functions if superuser. + # See https://www.postgresql.org/docs/14/functions-admin.html#FUNCTIONS-ADMIN-GENFILE + raise Exception('cannot get size of a Postgres instance') + + def __enter__(self): + return self + + def __exit__(self, exc_type, exc, tb): + # do nothing + pass + + +@pytest.fixture(scope='function') +def remote_pg(test_output_dir: str) -> Iterator[RemotePostgres]: + pg_bin = PgBin(test_output_dir) + + connstr = os.getenv("BENCHMARK_CONNSTR") + if connstr is None: + raise ValueError("no connstr provided, use BENCHMARK_CONNSTR environment variable") + + with RemotePostgres(pg_bin, connstr) as remote_pg: + yield remote_pg + + class ZenithProxy(PgProtocol): def __init__(self, port: int): super().__init__(host="127.0.0.1", diff --git a/test_runner/performance/test_perf_pgbench.py b/test_runner/performance/test_perf_pgbench.py index 5ffce3c0be..d2de76913a 100644 --- a/test_runner/performance/test_perf_pgbench.py +++ b/test_runner/performance/test_perf_pgbench.py @@ -2,29 +2,113 @@ from contextlib import closing from fixtures.zenith_fixtures import PgBin, VanillaPostgres, ZenithEnv from fixtures.compare_fixtures import PgCompare, VanillaCompare, ZenithCompare -from fixtures.benchmark_fixture import MetricReport, ZenithBenchmarker +from fixtures.benchmark_fixture import PgBenchRunResult, MetricReport, ZenithBenchmarker from fixtures.log_helper import log +from pathlib import Path + +import pytest +from datetime import datetime +import calendar +import os +import timeit + + +def utc_now_timestamp() -> int: + return calendar.timegm(datetime.utcnow().utctimetuple()) + + +def init_pgbench(env: PgCompare, cmdline): + # calculate timestamps and durations separately + # timestamp is intended to be used for linking to grafana and logs + # duration is actually a metric and uses float instead of int for timestamp + init_start_timestamp = utc_now_timestamp() + t0 = timeit.default_timer() + with env.record_pageserver_writes('init.pageserver_writes'): + env.pg_bin.run_capture(cmdline) + env.flush() + init_duration = timeit.default_timer() - t0 + init_end_timestamp = utc_now_timestamp() + + env.zenbenchmark.record("init.duration", + init_duration, + unit="s", + report=MetricReport.LOWER_IS_BETTER) + env.zenbenchmark.record("init.start_timestamp", + init_start_timestamp, + '', + MetricReport.TEST_PARAM) + env.zenbenchmark.record("init.end_timestamp", init_end_timestamp, '', MetricReport.TEST_PARAM) + + +def run_pgbench(env: PgCompare, prefix: str, cmdline): + with env.record_pageserver_writes(f'{prefix}.pageserver_writes'): + run_start_timestamp = utc_now_timestamp() + t0 = timeit.default_timer() + out = env.pg_bin.run_capture(cmdline, ) + run_duration = timeit.default_timer() - t0 + run_end_timestamp = utc_now_timestamp() + env.flush() + + stdout = Path(f"{out}.stdout").read_text() + + res = PgBenchRunResult.parse_from_stdout( + stdout=stdout, + run_duration=run_duration, + run_start_timestamp=run_start_timestamp, + run_end_timestamp=run_end_timestamp, + ) + env.zenbenchmark.record_pg_bench_result(prefix, res) + # -# Run a very short pgbench test. +# Initialize a pgbench database, and run pgbench against it. # -# Collects three metrics: +# This makes runs two different pgbench workloads against the same +# initialized database, and 'duration' is the time of each run. So +# the total runtime is 2 * duration, plus time needed to initialize +# the test database. # -# 1. Time to initialize the pgbench database (pgbench -s5 -i) -# 2. Time to run 5000 pgbench transactions -# 3. Disk space used -# -def test_pgbench(zenith_with_baseline: PgCompare): - env = zenith_with_baseline +# Currently, the # of connections is hardcoded at 4 +def run_test_pgbench(env: PgCompare, scale: int, duration: int): - with env.record_pageserver_writes('pageserver_writes'): - with env.record_duration('init'): - env.pg_bin.run_capture(['pgbench', '-s5', '-i', env.pg.connstr()]) - env.flush() + # Record the scale and initialize + env.zenbenchmark.record("scale", scale, '', MetricReport.TEST_PARAM) + init_pgbench(env, ['pgbench', f'-s{scale}', '-i', env.pg.connstr()]) - with env.record_duration('5000_xacts'): - env.pg_bin.run_capture(['pgbench', '-c1', '-t5000', env.pg.connstr()]) - env.flush() + # Run simple-update workload + run_pgbench(env, + "simple-update", + ['pgbench', '-n', '-c4', f'-T{duration}', '-P2', '-Mprepared', env.pg.connstr()]) + + # Run SELECT workload + run_pgbench(env, + "select-only", + ['pgbench', '-S', '-c4', f'-T{duration}', '-P2', '-Mprepared', env.pg.connstr()]) env.report_size() + + +def get_durations_matrix(): + durations = os.getenv("TEST_PG_BENCH_DURATIONS_MATRIX", default="45") + return list(map(int, durations.split(","))) + + +def get_scales_matrix(): + scales = os.getenv("TEST_PG_BENCH_SCALES_MATRIX", default="10") + return list(map(int, scales.split(","))) + + +# Run the pgbench tests against vanilla Postgres and zenith +@pytest.mark.parametrize("scale", get_scales_matrix()) +@pytest.mark.parametrize("duration", get_durations_matrix()) +def test_pgbench(zenith_with_baseline: PgCompare, scale: int, duration: int): + run_test_pgbench(zenith_with_baseline, scale, duration) + + +# Run the pgbench tests against an existing Postgres cluster +@pytest.mark.parametrize("scale", get_scales_matrix()) +@pytest.mark.parametrize("duration", get_durations_matrix()) +@pytest.mark.remote_cluster +def test_pgbench_remote(remote_compare: PgCompare, scale: int, duration: int): + run_test_pgbench(remote_compare, scale, duration) diff --git a/test_runner/performance/test_perf_pgbench_remote.py b/test_runner/performance/test_perf_pgbench_remote.py deleted file mode 100644 index 28472a16c8..0000000000 --- a/test_runner/performance/test_perf_pgbench_remote.py +++ /dev/null @@ -1,124 +0,0 @@ -import dataclasses -import os -import subprocess -from typing import List -from fixtures.benchmark_fixture import PgBenchRunResult, ZenithBenchmarker -import pytest -from datetime import datetime -import calendar -import timeit -import os - - -def utc_now_timestamp() -> int: - return calendar.timegm(datetime.utcnow().utctimetuple()) - - -@dataclasses.dataclass -class PgBenchRunner: - connstr: str - scale: int - transactions: int - pgbench_bin_path: str = "pgbench" - - def invoke(self, args: List[str]) -> 'subprocess.CompletedProcess[str]': - res = subprocess.run([self.pgbench_bin_path, *args], text=True, capture_output=True) - - if res.returncode != 0: - raise RuntimeError(f"pgbench failed. stdout: {res.stdout} stderr: {res.stderr}") - return res - - def init(self, vacuum: bool = True) -> 'subprocess.CompletedProcess[str]': - args = [] - if not vacuum: - args.append("--no-vacuum") - args.extend([f"--scale={self.scale}", "--initialize", self.connstr]) - return self.invoke(args) - - def run(self, jobs: int = 1, clients: int = 1): - return self.invoke([ - f"--transactions={self.transactions}", - f"--jobs={jobs}", - f"--client={clients}", - "--progress=2", # print progress every two seconds - self.connstr, - ]) - - -@pytest.fixture -def connstr(): - res = os.getenv("BENCHMARK_CONNSTR") - if res is None: - raise ValueError("no connstr provided, use BENCHMARK_CONNSTR environment variable") - return res - - -def get_transactions_matrix(): - transactions = os.getenv("TEST_PG_BENCH_TRANSACTIONS_MATRIX") - if transactions is None: - return [10**4, 10**5] - return list(map(int, transactions.split(","))) - - -def get_scales_matrix(): - scales = os.getenv("TEST_PG_BENCH_SCALES_MATRIX") - if scales is None: - return [10, 20] - return list(map(int, scales.split(","))) - - -@pytest.mark.parametrize("scale", get_scales_matrix()) -@pytest.mark.parametrize("transactions", get_transactions_matrix()) -@pytest.mark.remote_cluster -def test_pg_bench_remote_cluster(zenbenchmark: ZenithBenchmarker, - connstr: str, - scale: int, - transactions: int): - """ - The best way is to run same pack of tests both, for local zenith - and against staging, but currently local tests heavily depend on - things available only locally e.g. zenith binaries, pageserver api, etc. - Also separate test allows to run pgbench workload against vanilla postgres - or other systems that support postgres protocol. - - Also now this is more of a liveness test because it stresses pageserver internals, - so we clearly see what goes wrong in more "real" environment. - """ - pg_bin = os.getenv("PG_BIN") - if pg_bin is not None: - pgbench_bin_path = os.path.join(pg_bin, "pgbench") - else: - pgbench_bin_path = "pgbench" - - runner = PgBenchRunner( - connstr=connstr, - scale=scale, - transactions=transactions, - pgbench_bin_path=pgbench_bin_path, - ) - # calculate timestamps and durations separately - # timestamp is intended to be used for linking to grafana and logs - # duration is actually a metric and uses float instead of int for timestamp - init_start_timestamp = utc_now_timestamp() - t0 = timeit.default_timer() - runner.init() - init_duration = timeit.default_timer() - t0 - init_end_timestamp = utc_now_timestamp() - - run_start_timestamp = utc_now_timestamp() - t0 = timeit.default_timer() - out = runner.run() # TODO handle failures - run_duration = timeit.default_timer() - t0 - run_end_timestamp = utc_now_timestamp() - - res = PgBenchRunResult.parse_from_output( - out=out, - init_duration=init_duration, - init_start_timestamp=init_start_timestamp, - init_end_timestamp=init_end_timestamp, - run_duration=run_duration, - run_start_timestamp=run_start_timestamp, - run_end_timestamp=run_end_timestamp, - ) - - zenbenchmark.record_pg_bench_result(res) From 9e4de6bed02e9dc48af5b9d74a7759b0c2702b26 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Tue, 12 Apr 2022 20:29:35 +0300 Subject: [PATCH 092/296] Use RwLock instad of Mutex for layer map lock. For more concurrency --- pageserver/src/layered_repository.rs | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index e178ba5222..95df385cfe 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -193,7 +193,7 @@ impl Repository for LayeredRepository { Arc::clone(&self.walredo_mgr), self.upload_layers, ); - timeline.layers.lock().unwrap().next_open_layer_at = Some(initdb_lsn); + timeline.layers.write().unwrap().next_open_layer_at = Some(initdb_lsn); let timeline = Arc::new(timeline); let r = timelines.insert( @@ -725,7 +725,7 @@ pub struct LayeredTimeline { tenantid: ZTenantId, timelineid: ZTimelineId, - layers: Mutex, + layers: RwLock, last_freeze_at: AtomicLsn, @@ -997,7 +997,7 @@ impl LayeredTimeline { conf, timelineid, tenantid, - layers: Mutex::new(LayerMap::default()), + layers: RwLock::new(LayerMap::default()), walredo_mgr, @@ -1040,7 +1040,7 @@ impl LayeredTimeline { /// Returns all timeline-related files that were found and loaded. /// fn load_layer_map(&self, disk_consistent_lsn: Lsn) -> anyhow::Result<()> { - let mut layers = self.layers.lock().unwrap(); + let mut layers = self.layers.write().unwrap(); let mut num_layers = 0; // Scan timeline directory and create ImageFileName and DeltaFilename @@ -1194,7 +1194,7 @@ impl LayeredTimeline { continue; } - let layers = timeline.layers.lock().unwrap(); + let layers = timeline.layers.read().unwrap(); // Check the open and frozen in-memory layers first if let Some(open_layer) = &layers.open_layer { @@ -1276,7 +1276,7 @@ impl LayeredTimeline { /// Get a handle to the latest layer for appending. /// fn get_layer_for_write(&self, lsn: Lsn) -> anyhow::Result> { - let mut layers = self.layers.lock().unwrap(); + let mut layers = self.layers.write().unwrap(); ensure!(lsn.is_aligned()); @@ -1347,7 +1347,7 @@ impl LayeredTimeline { } else { Some(self.write_lock.lock().unwrap()) }; - let mut layers = self.layers.lock().unwrap(); + let mut layers = self.layers.write().unwrap(); if let Some(open_layer) = &layers.open_layer { let open_layer_rc = Arc::clone(open_layer); // Does this layer need freezing? @@ -1412,7 +1412,7 @@ impl LayeredTimeline { let timer = self.flush_time_histo.start_timer(); loop { - let layers = self.layers.lock().unwrap(); + let layers = self.layers.read().unwrap(); if let Some(frozen_layer) = layers.frozen_layers.front() { let frozen_layer = Arc::clone(frozen_layer); drop(layers); // to allow concurrent reads and writes @@ -1456,7 +1456,7 @@ impl LayeredTimeline { // Finally, replace the frozen in-memory layer with the new on-disk layers { - let mut layers = self.layers.lock().unwrap(); + let mut layers = self.layers.write().unwrap(); let l = layers.frozen_layers.pop_front(); // Only one thread may call this function at a time (for this @@ -1612,7 +1612,7 @@ impl LayeredTimeline { lsn: Lsn, threshold: usize, ) -> Result { - let layers = self.layers.lock().unwrap(); + let layers = self.layers.read().unwrap(); for part_range in &partition.ranges { let image_coverage = layers.image_coverage(part_range, lsn)?; @@ -1670,7 +1670,7 @@ impl LayeredTimeline { // FIXME: Do we need to do something to upload it to remote storage here? - let mut layers = self.layers.lock().unwrap(); + let mut layers = self.layers.write().unwrap(); layers.insert_historic(Arc::new(image_layer)); drop(layers); @@ -1678,7 +1678,7 @@ impl LayeredTimeline { } fn compact_level0(&self, target_file_size: u64) -> Result<()> { - let layers = self.layers.lock().unwrap(); + let layers = self.layers.read().unwrap(); let level0_deltas = layers.get_level0_deltas()?; @@ -1768,7 +1768,7 @@ impl LayeredTimeline { layer_paths.pop().unwrap(); } - let mut layers = self.layers.lock().unwrap(); + let mut layers = self.layers.write().unwrap(); for l in new_layers { layers.insert_historic(Arc::new(l)); } @@ -1850,7 +1850,7 @@ impl LayeredTimeline { // 2. it doesn't need to be retained for 'retain_lsns'; // 3. newer on-disk image layers cover the layer's whole key range // - let mut layers = self.layers.lock().unwrap(); + let mut layers = self.layers.write().unwrap(); 'outer: for l in layers.iter_historic_layers() { // This layer is in the process of being flushed to disk. // It will be swapped out of the layer map, replaced with From d5ae9db997711d770b52511f8bbd2eef8067cedc Mon Sep 17 00:00:00 2001 From: bojanserafimov Date: Thu, 14 Apr 2022 10:09:03 -0400 Subject: [PATCH 093/296] Add s3 cost estimate to tests (#1478) --- pageserver/src/layered_repository.rs | 22 ++++++++++++++++- test_runner/fixtures/benchmark_fixture.py | 30 ++++++++++------------- test_runner/fixtures/compare_fixtures.py | 13 ++++++++++ 3 files changed, 47 insertions(+), 18 deletions(-) diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 95df385cfe..36b081e400 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -49,7 +49,8 @@ use crate::CheckpointConfig; use crate::{ZTenantId, ZTimelineId}; use zenith_metrics::{ - register_histogram_vec, register_int_gauge_vec, Histogram, HistogramVec, IntGauge, IntGaugeVec, + register_histogram_vec, register_int_counter, register_int_gauge_vec, Histogram, HistogramVec, + IntCounter, IntGauge, IntGaugeVec, }; use zenith_utils::crashsafe_dir; use zenith_utils::lsn::{AtomicLsn, Lsn, RecordLsn}; @@ -109,6 +110,21 @@ lazy_static! { .expect("failed to define a metric"); } +// Metrics for cloud upload. These metrics reflect data uploaded to cloud storage, +// or in testing they estimate how much we would upload if we did. +lazy_static! { + static ref NUM_PERSISTENT_FILES_CREATED: IntCounter = register_int_counter!( + "pageserver_num_persistent_files_created", + "Number of files created that are meant to be uploaded to cloud storage", + ) + .expect("failed to define a metric"); + static ref PERSISTENT_BYTES_WRITTEN: IntCounter = register_int_counter!( + "pageserver_persistent_bytes_written", + "Total bytes written that are meant to be uploaded to cloud storage", + ) + .expect("failed to define a metric"); +} + /// Parts of the `.zenith/tenants//timelines/` directory prefix. pub const TIMELINES_SEGMENT_NAME: &str = "timelines"; @@ -1524,6 +1540,10 @@ impl LayeredTimeline { &metadata, false, )?; + + NUM_PERSISTENT_FILES_CREATED.inc_by(1); + PERSISTENT_BYTES_WRITTEN.inc_by(new_delta_path.metadata()?.len()); + if self.upload_layers.load(atomic::Ordering::Relaxed) { schedule_timeline_checkpoint_upload( self.tenantid, diff --git a/test_runner/fixtures/benchmark_fixture.py b/test_runner/fixtures/benchmark_fixture.py index a904233e98..0735f16d73 100644 --- a/test_runner/fixtures/benchmark_fixture.py +++ b/test_runner/fixtures/benchmark_fixture.py @@ -236,10 +236,18 @@ class ZenithBenchmarker: """ Fetch the "cumulative # of bytes written" metric from the pageserver """ - # Fetch all the exposed prometheus metrics from page server - all_metrics = pageserver.http_client().get_metrics() - # Use a regular expression to extract the one we're interested in - # + metric_name = r'pageserver_disk_io_bytes{io_operation="write"}' + return self.get_int_counter_value(pageserver, metric_name) + + def get_peak_mem(self, pageserver) -> int: + """ + Fetch the "maxrss" metric from the pageserver + """ + metric_name = r'pageserver_maxrss_kb' + return self.get_int_counter_value(pageserver, metric_name) + + def get_int_counter_value(self, pageserver, metric_name) -> int: + """Fetch the value of given int counter from pageserver metrics.""" # TODO: If we start to collect more of the prometheus metrics in the # performance test suite like this, we should refactor this to load and # parse all the metrics into a more convenient structure in one go. @@ -247,20 +255,8 @@ class ZenithBenchmarker: # The metric should be an integer, as it's a number of bytes. But in general # all prometheus metrics are floats. So to be pedantic, read it as a float # and round to integer. - matches = re.search(r'^pageserver_disk_io_bytes{io_operation="write"} (\S+)$', - all_metrics, - re.MULTILINE) - assert matches - return int(round(float(matches.group(1)))) - - def get_peak_mem(self, pageserver) -> int: - """ - Fetch the "maxrss" metric from the pageserver - """ - # Fetch all the exposed prometheus metrics from page server all_metrics = pageserver.http_client().get_metrics() - # See comment in get_io_writes() - matches = re.search(r'^pageserver_maxrss_kb (\S+)$', all_metrics, re.MULTILINE) + matches = re.search(fr'^{metric_name} (\S+)$', all_metrics, re.MULTILINE) assert matches return int(round(float(matches.group(1)))) diff --git a/test_runner/fixtures/compare_fixtures.py b/test_runner/fixtures/compare_fixtures.py index 3c6a923587..93912d2da7 100644 --- a/test_runner/fixtures/compare_fixtures.py +++ b/test_runner/fixtures/compare_fixtures.py @@ -105,6 +105,19 @@ class ZenithCompare(PgCompare): 'MB', report=MetricReport.LOWER_IS_BETTER) + total_files = self.zenbenchmark.get_int_counter_value( + self.env.pageserver, "pageserver_num_persistent_files_created") + total_bytes = self.zenbenchmark.get_int_counter_value( + self.env.pageserver, "pageserver_persistent_bytes_written") + self.zenbenchmark.record("data_uploaded", + total_bytes / (1024 * 1024), + "MB", + report=MetricReport.LOWER_IS_BETTER) + self.zenbenchmark.record("num_files_uploaded", + total_files, + "", + report=MetricReport.LOWER_IS_BETTER) + def record_pageserver_writes(self, out_name): return self.zenbenchmark.record_pageserver_writes(self.env.pageserver, out_name) From 93e0ac2b7ae84747188d0da98061333b4a52a150 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Thu, 14 Apr 2022 16:17:47 +0300 Subject: [PATCH 094/296] Remove a couple of unused dependencies. Found by "cargo-udeps" --- Cargo.lock | 2 -- pageserver/Cargo.toml | 1 - proxy/Cargo.toml | 1 - 3 files changed, 4 deletions(-) diff --git a/Cargo.lock b/Cargo.lock index 0584b9d6d2..5027c4bdc7 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -1551,7 +1551,6 @@ dependencies = [ "tokio-util 0.7.0", "toml_edit", "tracing", - "tracing-futures", "url", "workspace_hack", "zenith_metrics", @@ -1938,7 +1937,6 @@ dependencies = [ "tokio-postgres 0.7.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", "tokio-postgres-rustls", "tokio-rustls 0.22.0", - "tokio-stream", "workspace_hack", "zenith_metrics", "zenith_utils", diff --git a/pageserver/Cargo.toml b/pageserver/Cargo.toml index dccdca291c..e92ac0421c 100644 --- a/pageserver/Cargo.toml +++ b/pageserver/Cargo.toml @@ -37,7 +37,6 @@ toml_edit = { version = "0.13", features = ["easy"] } scopeguard = "1.1.0" const_format = "0.2.21" tracing = "0.1.27" -tracing-futures = "0.2" signal-hook = "0.3.10" url = "2" nix = "0.23" diff --git a/proxy/Cargo.toml b/proxy/Cargo.toml index 56b6dd7e20..be03a2d4a9 100644 --- a/proxy/Cargo.toml +++ b/proxy/Cargo.toml @@ -31,7 +31,6 @@ thiserror = "1.0.30" tokio = { version = "1.17", features = ["macros"] } tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } tokio-rustls = "0.22.0" -tokio-stream = "0.1.8" zenith_utils = { path = "../zenith_utils" } zenith_metrics = { path = "../zenith_metrics" } From 2cb39a162431716eeb835656c45ca1cff4eab544 Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Thu, 14 Apr 2022 14:04:45 +0300 Subject: [PATCH 095/296] add missing files, update workspace hack --- Cargo.lock | 8 ++++---- workspace_hack/.gitattributes | 4 ++++ workspace_hack/Cargo.toml | 16 +++++++++++----- workspace_hack/build.rs | 2 ++ 4 files changed, 21 insertions(+), 9 deletions(-) create mode 100644 workspace_hack/.gitattributes create mode 100644 workspace_hack/build.rs diff --git a/Cargo.lock b/Cargo.lock index 5027c4bdc7..3a75687b36 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -2112,7 +2112,6 @@ dependencies = [ "serde_urlencoded", "tokio", "tokio-rustls 0.23.2", - "tokio-util 0.6.9", "url", "wasm-bindgen", "wasm-bindgen-futures", @@ -3390,19 +3389,20 @@ dependencies = [ "anyhow", "bytes", "cc", + "chrono", "clap 2.34.0", "either", "hashbrown", + "indexmap", "libc", "log", "memchr", "num-integer", "num-traits", - "proc-macro2", - "quote", + "prost", + "rand", "regex", "regex-syntax", - "reqwest", "scopeguard", "serde", "syn", diff --git a/workspace_hack/.gitattributes b/workspace_hack/.gitattributes new file mode 100644 index 0000000000..3e9dba4b64 --- /dev/null +++ b/workspace_hack/.gitattributes @@ -0,0 +1,4 @@ +# Avoid putting conflict markers in the generated Cargo.toml file, since their presence breaks +# Cargo. +# Also do not check out the file as CRLF on Windows, as that's what hakari needs. +Cargo.toml merge=binary -crlf diff --git a/workspace_hack/Cargo.toml b/workspace_hack/Cargo.toml index 6e6a0e09d7..84244b3363 100644 --- a/workspace_hack/Cargo.toml +++ b/workspace_hack/Cargo.toml @@ -16,32 +16,38 @@ publish = false [dependencies] anyhow = { version = "1", features = ["backtrace", "std"] } bytes = { version = "1", features = ["serde", "std"] } +chrono = { version = "0.4", features = ["clock", "libc", "oldtime", "serde", "std", "time", "winapi"] } clap = { version = "2", features = ["ansi_term", "atty", "color", "strsim", "suggestions", "vec_map"] } either = { version = "1", features = ["use_std"] } hashbrown = { version = "0.11", features = ["ahash", "inline-more", "raw"] } +indexmap = { version = "1", default-features = false, features = ["std"] } libc = { version = "0.2", features = ["extra_traits", "std"] } log = { version = "0.4", default-features = false, features = ["serde", "std"] } memchr = { version = "2", features = ["std", "use_std"] } num-integer = { version = "0.1", default-features = false, features = ["std"] } num-traits = { version = "0.2", features = ["std"] } +prost = { version = "0.9", features = ["prost-derive", "std"] } +rand = { version = "0.8", features = ["alloc", "getrandom", "libc", "rand_chacha", "rand_hc", "small_rng", "std", "std_rng"] } regex = { version = "1", features = ["aho-corasick", "memchr", "perf", "perf-cache", "perf-dfa", "perf-inline", "perf-literal", "std", "unicode", "unicode-age", "unicode-bool", "unicode-case", "unicode-gencat", "unicode-perl", "unicode-script", "unicode-segment"] } regex-syntax = { version = "0.6", features = ["unicode", "unicode-age", "unicode-bool", "unicode-case", "unicode-gencat", "unicode-perl", "unicode-script", "unicode-segment"] } -reqwest = { version = "0.11", default-features = false, features = ["__rustls", "__tls", "blocking", "hyper-rustls", "json", "rustls", "rustls-pemfile", "rustls-tls", "rustls-tls-webpki-roots", "serde_json", "stream", "tokio-rustls", "tokio-util", "webpki-roots"] } scopeguard = { version = "1", features = ["use_std"] } serde = { version = "1", features = ["alloc", "derive", "serde_derive", "std"] } -tokio = { version = "1", features = ["bytes", "fs", "io-util", "libc", "macros", "memchr", "mio", "net", "num_cpus", "once_cell", "process", "rt", "rt-multi-thread", "signal-hook-registry", "sync", "time", "tokio-macros"] } -tracing = { version = "0.1", features = ["attributes", "std", "tracing-attributes"] } +tokio = { version = "1", features = ["bytes", "fs", "io-std", "io-util", "libc", "macros", "memchr", "mio", "net", "num_cpus", "once_cell", "process", "rt", "rt-multi-thread", "signal-hook-registry", "socket2", "sync", "time", "tokio-macros"] } +tracing = { version = "0.1", features = ["attributes", "log", "std", "tracing-attributes"] } tracing-core = { version = "0.1", features = ["lazy_static", "std"] } [build-dependencies] +anyhow = { version = "1", features = ["backtrace", "std"] } +bytes = { version = "1", features = ["serde", "std"] } cc = { version = "1", default-features = false, features = ["jobserver", "parallel"] } clap = { version = "2", features = ["ansi_term", "atty", "color", "strsim", "suggestions", "vec_map"] } either = { version = "1", features = ["use_std"] } +hashbrown = { version = "0.11", features = ["ahash", "inline-more", "raw"] } +indexmap = { version = "1", default-features = false, features = ["std"] } libc = { version = "0.2", features = ["extra_traits", "std"] } log = { version = "0.4", default-features = false, features = ["serde", "std"] } memchr = { version = "2", features = ["std", "use_std"] } -proc-macro2 = { version = "1", features = ["proc-macro"] } -quote = { version = "1", features = ["proc-macro"] } +prost = { version = "0.9", features = ["prost-derive", "std"] } regex = { version = "1", features = ["aho-corasick", "memchr", "perf", "perf-cache", "perf-dfa", "perf-inline", "perf-literal", "std", "unicode", "unicode-age", "unicode-bool", "unicode-case", "unicode-gencat", "unicode-perl", "unicode-script", "unicode-segment"] } regex-syntax = { version = "0.6", features = ["unicode", "unicode-age", "unicode-bool", "unicode-case", "unicode-gencat", "unicode-perl", "unicode-script", "unicode-segment"] } serde = { version = "1", features = ["alloc", "derive", "serde_derive", "std"] } diff --git a/workspace_hack/build.rs b/workspace_hack/build.rs new file mode 100644 index 0000000000..92518ef04c --- /dev/null +++ b/workspace_hack/build.rs @@ -0,0 +1,2 @@ +// A build script is required for cargo to consider build dependencies. +fn main() {} From e97f94cc30b7f08f308ce4086eae2f9497b0e413 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Thu, 14 Apr 2022 19:49:01 +0300 Subject: [PATCH 096/296] Bump rustc version --- .circleci/config.yml | 4 ++-- Dockerfile | 8 ++++---- Dockerfile.build | 2 +- Dockerfile.compute-tools | 2 +- README.md | 2 +- 5 files changed, 9 insertions(+), 9 deletions(-) diff --git a/.circleci/config.yml b/.circleci/config.yml index f05e64072a..5aae143e48 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -5,10 +5,10 @@ executors: resource_class: xlarge docker: # NB: when changed, do not forget to update rust image tag in all Dockerfiles - - image: zimg/rust:1.56 + - image: zimg/rust:1.58 zenith-executor: docker: - - image: zimg/rust:1.56 + - image: zimg/rust:1.58 jobs: check-codestyle-rust: diff --git a/Dockerfile b/Dockerfile index babc3b8e1d..955d26cd0b 100644 --- a/Dockerfile +++ b/Dockerfile @@ -1,7 +1,7 @@ # Build Postgres # -#FROM zimg/rust:1.56 AS pg-build -FROM zenithdb/build:buster-20220309 AS pg-build +#FROM zimg/rust:1.58 AS pg-build +FROM zenithdb/build:buster-20220414 AS pg-build WORKDIR /pg USER root @@ -17,8 +17,8 @@ RUN set -e \ # Build zenith binaries # -#FROM zimg/rust:1.56 AS build -FROM zenithdb/build:buster-20220309 AS build +#FROM zimg/rust:1.58 AS build +FROM zenithdb/build:buster-20220414 AS build ARG GIT_VERSION=local ARG CACHEPOT_BUCKET=zenith-rust-cachepot diff --git a/Dockerfile.build b/Dockerfile.build index 44a2aaafb9..c7d239647f 100644 --- a/Dockerfile.build +++ b/Dockerfile.build @@ -1,4 +1,4 @@ -FROM rust:1.56.1-slim-buster +FROM rust:1.58-slim-buster WORKDIR /home/circleci/project RUN set -e \ diff --git a/Dockerfile.compute-tools b/Dockerfile.compute-tools index f7672251e6..6a35a71bb3 100644 --- a/Dockerfile.compute-tools +++ b/Dockerfile.compute-tools @@ -1,6 +1,6 @@ # First transient image to build compute_tools binaries # NB: keep in sync with rust image version in .circle/config.yml -FROM zenithdb/build:buster-20220309 AS rust-build +FROM zenithdb/build:buster-20220414 AS rust-build WORKDIR /zenith diff --git a/README.md b/README.md index f99785e683..03f86887a7 100644 --- a/README.md +++ b/README.md @@ -31,7 +31,7 @@ apt install build-essential libtool libreadline-dev zlib1g-dev flex bison libsec libssl-dev clang pkg-config libpq-dev ``` -[Rust] 1.56.1 or later is also required. +[Rust] 1.58 or later is also required. To run the `psql` client, install the `postgresql-client` package or modify `PATH` and `LD_LIBRARY_PATH` to include `tmp_install/bin` and `tmp_install/lib`, respectively. From c9d897f9b6fecce83549aea725fd79cd8bdcdad8 Mon Sep 17 00:00:00 2001 From: Dmitry Ivanov Date: Fri, 15 Apr 2022 12:06:25 +0300 Subject: [PATCH 097/296] [proxy] Update rustls (#1510) --- Cargo.lock | 33 +++++++++++---------------------- proxy/Cargo.toml | 7 ++++--- proxy/src/config.rs | 28 +++++++++++++++++----------- proxy/src/proxy.rs | 18 ++++++++++++++---- 4 files changed, 46 insertions(+), 40 deletions(-) diff --git a/Cargo.lock b/Cargo.lock index 3a75687b36..6409b33055 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -1066,7 +1066,7 @@ dependencies = [ "hyper", "rustls 0.20.2", "tokio", - "tokio-rustls 0.23.2", + "tokio-rustls", ] [[package]] @@ -1926,7 +1926,8 @@ dependencies = [ "reqwest", "routerify 2.2.0", "rstest", - "rustls 0.19.1", + "rustls 0.20.2", + "rustls-pemfile", "scopeguard", "serde", "serde_json", @@ -1936,7 +1937,7 @@ dependencies = [ "tokio", "tokio-postgres 0.7.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", "tokio-postgres-rustls", - "tokio-rustls 0.22.0", + "tokio-rustls", "workspace_hack", "zenith_metrics", "zenith_utils", @@ -2111,7 +2112,7 @@ dependencies = [ "serde_json", "serde_urlencoded", "tokio", - "tokio-rustls 0.23.2", + "tokio-rustls", "url", "wasm-bindgen", "wasm-bindgen-futures", @@ -2823,35 +2824,23 @@ dependencies = [ [[package]] name = "tokio-postgres-rustls" -version = "0.8.0" +version = "0.9.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7bd8c37d8c23cb6ecdc32fc171bade4e9c7f1be65f693a17afbaad02091a0a19" +checksum = "606f2b73660439474394432239c82249c0d45eb5f23d91f401be1e33590444a7" dependencies = [ "futures", "ring", - "rustls 0.19.1", + "rustls 0.20.2", "tokio", "tokio-postgres 0.7.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", - "tokio-rustls 0.22.0", - "webpki 0.21.4", + "tokio-rustls", ] [[package]] name = "tokio-rustls" -version = "0.22.0" +version = "0.23.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bc6844de72e57df1980054b38be3a9f4702aba4858be64dd700181a8a6d0e1b6" -dependencies = [ - "rustls 0.19.1", - "tokio", - "webpki 0.21.4", -] - -[[package]] -name = "tokio-rustls" -version = "0.23.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a27d5f2b839802bd8267fa19b0530f5a08b9c08cd417976be2a65d130fe1c11b" +checksum = "4151fda0cf2798550ad0b34bcfc9b9dcc2a9d2471c895c68f3a8818e54f2389e" dependencies = [ "rustls 0.20.2", "tokio", diff --git a/proxy/Cargo.toml b/proxy/Cargo.toml index be03a2d4a9..20b459988a 100644 --- a/proxy/Cargo.toml +++ b/proxy/Cargo.toml @@ -21,7 +21,8 @@ pin-project-lite = "0.2.7" rand = "0.8.3" reqwest = { version = "0.11", default-features = false, features = ["blocking", "json", "rustls-tls"] } routerify = "2" -rustls = "0.19.1" +rustls = "0.20.0" +rustls-pemfile = "0.2.1" scopeguard = "1.1.0" serde = "1" serde_json = "1" @@ -30,7 +31,7 @@ socket2 = "0.4.4" thiserror = "1.0.30" tokio = { version = "1.17", features = ["macros"] } tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } -tokio-rustls = "0.22.0" +tokio-rustls = "0.23.0" zenith_utils = { path = "../zenith_utils" } zenith_metrics = { path = "../zenith_metrics" } @@ -40,4 +41,4 @@ workspace_hack = { version = "0.1", path = "../workspace_hack" } async-trait = "0.1" rcgen = "0.8.14" rstest = "0.12" -tokio-postgres-rustls = "0.8.0" +tokio-postgres-rustls = "0.9.0" diff --git a/proxy/src/config.rs b/proxy/src/config.rs index 077ff02898..aef079d089 100644 --- a/proxy/src/config.rs +++ b/proxy/src/config.rs @@ -1,10 +1,9 @@ -use anyhow::{anyhow, bail, ensure, Context}; -use rustls::{internal::pemfile, NoClientAuth, ProtocolVersion, ServerConfig}; +use anyhow::{bail, ensure, Context}; use std::net::SocketAddr; use std::str::FromStr; use std::sync::Arc; -pub type TlsConfig = Arc; +pub type TlsConfig = Arc; #[non_exhaustive] pub enum ClientAuthMethod { @@ -61,21 +60,28 @@ pub struct ProxyConfig { pub fn configure_ssl(key_path: &str, cert_path: &str) -> anyhow::Result { let key = { let key_bytes = std::fs::read(key_path).context("SSL key file")?; - let mut keys = pemfile::pkcs8_private_keys(&mut &key_bytes[..]) - .map_err(|_| anyhow!("couldn't read TLS keys"))?; + let mut keys = rustls_pemfile::pkcs8_private_keys(&mut &key_bytes[..]) + .context("couldn't read TLS keys")?; + ensure!(keys.len() == 1, "keys.len() = {} (should be 1)", keys.len()); - keys.pop().unwrap() + keys.pop().map(rustls::PrivateKey).unwrap() }; let cert_chain = { let cert_chain_bytes = std::fs::read(cert_path).context("SSL cert file")?; - pemfile::certs(&mut &cert_chain_bytes[..]) - .map_err(|_| anyhow!("couldn't read TLS certificates"))? + rustls_pemfile::certs(&mut &cert_chain_bytes[..]) + .context("couldn't read TLS certificate chain")? + .into_iter() + .map(rustls::Certificate) + .collect() }; - let mut config = ServerConfig::new(NoClientAuth::new()); - config.set_single_cert(cert_chain, key)?; - config.versions = vec![ProtocolVersion::TLSv1_3]; + let config = rustls::ServerConfig::builder() + .with_safe_default_cipher_suites() + .with_safe_default_kx_groups() + .with_protocol_versions(&[&rustls::version::TLS13])? + .with_no_client_auth() + .with_single_cert(cert_chain, key)?; Ok(config.into()) } diff --git a/proxy/src/proxy.rs b/proxy/src/proxy.rs index 5b662f4c69..788179252b 100644 --- a/proxy/src/proxy.rs +++ b/proxy/src/proxy.rs @@ -265,14 +265,24 @@ mod tests { let (ca, cert, key) = generate_certs(hostname)?; let server_config = { - let mut config = rustls::ServerConfig::new(rustls::NoClientAuth::new()); - config.set_single_cert(vec![cert], key)?; + let config = rustls::ServerConfig::builder() + .with_safe_defaults() + .with_no_client_auth() + .with_single_cert(vec![cert], key)?; + config.into() }; let client_config = { - let mut config = rustls::ClientConfig::new(); - config.root_store.add(&ca)?; + let config = rustls::ClientConfig::builder() + .with_safe_defaults() + .with_root_certificates({ + let mut store = rustls::RootCertStore::empty(); + store.add(&ca)?; + store + }) + .with_no_client_auth(); + ClientConfig { config, hostname } }; From ab20f2c4918a0031545e2d3d49e0bfd25faa5181 Mon Sep 17 00:00:00 2001 From: Dmitry Ivanov Date: Fri, 15 Apr 2022 18:36:11 +0300 Subject: [PATCH 098/296] Use the same version of `rust-postgres` everywhere. (#1516) Turns out we still had a stale dep in `compute_tools`. --- Cargo.lock | 104 ++++++++------------------------------- Cargo.toml | 4 +- compute_tools/Cargo.toml | 2 +- 3 files changed, 23 insertions(+), 87 deletions(-) diff --git a/Cargo.lock b/Cargo.lock index 6409b33055..0cdeb106ec 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -340,13 +340,13 @@ dependencies = [ "hyper", "libc", "log", - "postgres 0.19.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858)", + "postgres", "regex", "serde", "serde_json", "tar", "tokio", - "tokio-postgres 0.7.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", + "tokio-postgres", "workspace_hack", ] @@ -378,7 +378,7 @@ dependencies = [ "lazy_static", "nix", "pageserver", - "postgres 0.19.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", + "postgres", "regex", "reqwest", "serde", @@ -1529,9 +1529,9 @@ dependencies = [ "log", "nix", "once_cell", - "postgres 0.19.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", - "postgres-protocol 0.6.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", - "postgres-types 0.2.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", + "postgres", + "postgres-protocol", + "postgres-types", "postgres_ffi", "rand", "regex", @@ -1546,7 +1546,7 @@ dependencies = [ "tempfile", "thiserror", "tokio", - "tokio-postgres 0.7.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", + "tokio-postgres", "tokio-stream", "tokio-util 0.7.0", "toml_edit", @@ -1717,23 +1717,9 @@ dependencies = [ "fallible-iterator", "futures", "log", - "postgres-protocol 0.6.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", + "postgres-protocol", "tokio", - "tokio-postgres 0.7.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", -] - -[[package]] -name = "postgres" -version = "0.19.1" -source = "git+https://github.com/zenithdb/rust-postgres.git?rev=9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858#9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858" -dependencies = [ - "bytes", - "fallible-iterator", - "futures", - "log", - "postgres-protocol 0.6.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858)", - "tokio", - "tokio-postgres 0.7.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858)", + "tokio-postgres", ] [[package]] @@ -1754,24 +1740,6 @@ dependencies = [ "stringprep", ] -[[package]] -name = "postgres-protocol" -version = "0.6.1" -source = "git+https://github.com/zenithdb/rust-postgres.git?rev=9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858#9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858" -dependencies = [ - "base64 0.13.0", - "byteorder", - "bytes", - "fallible-iterator", - "hmac 0.10.1", - "lazy_static", - "md-5", - "memchr", - "rand", - "sha2", - "stringprep", -] - [[package]] name = "postgres-types" version = "0.2.1" @@ -1779,17 +1747,7 @@ source = "git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d5 dependencies = [ "bytes", "fallible-iterator", - "postgres-protocol 0.6.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", -] - -[[package]] -name = "postgres-types" -version = "0.2.1" -source = "git+https://github.com/zenithdb/rust-postgres.git?rev=9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858#9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858" -dependencies = [ - "bytes", - "fallible-iterator", - "postgres-protocol 0.6.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858)", + "postgres-protocol", ] [[package]] @@ -1935,7 +1893,7 @@ dependencies = [ "socket2", "thiserror", "tokio", - "tokio-postgres 0.7.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", + "tokio-postgres", "tokio-postgres-rustls", "tokio-rustls", "workspace_hack", @@ -2793,30 +2751,8 @@ dependencies = [ "percent-encoding", "phf", "pin-project-lite", - "postgres-protocol 0.6.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", - "postgres-types 0.2.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", - "socket2", - "tokio", - "tokio-util 0.6.9", -] - -[[package]] -name = "tokio-postgres" -version = "0.7.1" -source = "git+https://github.com/zenithdb/rust-postgres.git?rev=9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858#9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858" -dependencies = [ - "async-trait", - "byteorder", - "bytes", - "fallible-iterator", - "futures", - "log", - "parking_lot", - "percent-encoding", - "phf", - "pin-project-lite", - "postgres-protocol 0.6.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858)", - "postgres-types 0.2.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858)", + "postgres-protocol", + "postgres-types", "socket2", "tokio", "tokio-util 0.6.9", @@ -2832,7 +2768,7 @@ dependencies = [ "ring", "rustls 0.20.2", "tokio", - "tokio-postgres 0.7.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", + "tokio-postgres", "tokio-rustls", ] @@ -3171,8 +3107,8 @@ dependencies = [ "humantime", "hyper", "lazy_static", - "postgres 0.19.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", - "postgres-protocol 0.6.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", + "postgres", + "postgres-protocol", "postgres_ffi", "regex", "rusoto_core", @@ -3183,7 +3119,7 @@ dependencies = [ "signal-hook", "tempfile", "tokio", - "tokio-postgres 0.7.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", + "tokio-postgres", "tokio-util 0.7.0", "tracing", "url", @@ -3432,7 +3368,7 @@ dependencies = [ "clap 3.0.14", "control_plane", "pageserver", - "postgres 0.19.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", + "postgres", "postgres_ffi", "serde_json", "walkeeper", @@ -3468,8 +3404,8 @@ dependencies = [ "lazy_static", "nix", "pin-project-lite", - "postgres 0.19.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", - "postgres-protocol 0.6.1 (git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7)", + "postgres", + "postgres-protocol", "rand", "routerify 3.0.0", "rustls 0.19.1", diff --git a/Cargo.toml b/Cargo.toml index f3ac36dcb2..b8283a6112 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -18,7 +18,7 @@ resolver = "2" # Besides, debug info should not affect the performance. debug = true -# This is only needed for proxy's tests -# TODO: we should probably fork tokio-postgres-rustls instead +# This is only needed for proxy's tests. +# TODO: we should probably fork `tokio-postgres-rustls` instead. [patch.crates-io] tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } diff --git a/compute_tools/Cargo.toml b/compute_tools/Cargo.toml index fc52ce4e83..856ec45c73 100644 --- a/compute_tools/Cargo.toml +++ b/compute_tools/Cargo.toml @@ -11,7 +11,7 @@ clap = "3.0" env_logger = "0.9" hyper = { version = "0.14", features = ["full"] } log = { version = "0.4", features = ["std", "serde"] } -postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="9eb0dbfbeb6a6c1b79099b9f7ae4a8c021877858" } +postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } regex = "1" serde = { version = "1.0", features = ["derive"] } serde_json = "1" From 9946cd11256fc48c1b765cf62a6510c9a851251b Mon Sep 17 00:00:00 2001 From: Arseny Sher Date: Fri, 15 Apr 2022 18:52:44 +0400 Subject: [PATCH 099/296] Bump vendor/postgres to add safekeeper connection timeout. --- vendor/postgres | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vendor/postgres b/vendor/postgres index 61afbf978b..d7c8426e49 160000 --- a/vendor/postgres +++ b/vendor/postgres @@ -1 +1 @@ -Subproject commit 61afbf978b17764134ab6f1650bbdcadac147e71 +Subproject commit d7c8426e49cff3c791c3f2c4cde95f1fce665573 From 71269799500205ccd574d7820406309b2b1665de Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Fri, 15 Apr 2022 19:09:41 +0300 Subject: [PATCH 100/296] Remove custom neon Docker build image --- Dockerfile | 11 +++-------- Dockerfile.build | 23 ----------------------- Dockerfile.compute-tools | 5 ++--- 3 files changed, 5 insertions(+), 34 deletions(-) delete mode 100644 Dockerfile.build diff --git a/Dockerfile b/Dockerfile index 955d26cd0b..5e579be4e7 100644 --- a/Dockerfile +++ b/Dockerfile @@ -1,7 +1,5 @@ # Build Postgres -# -#FROM zimg/rust:1.58 AS pg-build -FROM zenithdb/build:buster-20220414 AS pg-build +FROM zimg/rust:1.58 AS pg-build WORKDIR /pg USER root @@ -16,22 +14,19 @@ RUN set -e \ && tar -C tmp_install -czf /postgres_install.tar.gz . # Build zenith binaries -# -#FROM zimg/rust:1.58 AS build -FROM zenithdb/build:buster-20220414 AS build +FROM zimg/rust:1.58 AS build ARG GIT_VERSION=local ARG CACHEPOT_BUCKET=zenith-rust-cachepot ARG AWS_ACCESS_KEY_ID ARG AWS_SECRET_ACCESS_KEY -ENV RUSTC_WRAPPER /usr/local/cargo/bin/cachepot COPY --from=pg-build /pg/tmp_install/include/postgresql/server tmp_install/include/postgresql/server COPY . . # Show build caching stats to check if it was used in the end. # Has to be the part of the same RUN since cachepot daemon is killed in the end of this RUN, loosing the compilation stats. -RUN cargo build --release && /usr/local/cargo/bin/cachepot -s +RUN cargo build --release && cachepot -s # Build final image # diff --git a/Dockerfile.build b/Dockerfile.build deleted file mode 100644 index c7d239647f..0000000000 --- a/Dockerfile.build +++ /dev/null @@ -1,23 +0,0 @@ -FROM rust:1.58-slim-buster -WORKDIR /home/circleci/project - -RUN set -e \ - && apt-get update \ - && apt-get -yq install \ - automake \ - libtool \ - build-essential \ - bison \ - flex \ - libreadline-dev \ - zlib1g-dev \ - libxml2-dev \ - libseccomp-dev \ - pkg-config \ - libssl-dev \ - clang - -RUN set -e \ - && rustup component add clippy \ - && cargo install cargo-audit \ - && cargo install --git https://github.com/paritytech/cachepot diff --git a/Dockerfile.compute-tools b/Dockerfile.compute-tools index 6a35a71bb3..a0cc21105b 100644 --- a/Dockerfile.compute-tools +++ b/Dockerfile.compute-tools @@ -1,17 +1,16 @@ # First transient image to build compute_tools binaries # NB: keep in sync with rust image version in .circle/config.yml -FROM zenithdb/build:buster-20220414 AS rust-build +FROM zimg/rust:1.58 AS rust-build WORKDIR /zenith ARG CACHEPOT_BUCKET=zenith-rust-cachepot ARG AWS_ACCESS_KEY_ID ARG AWS_SECRET_ACCESS_KEY -ENV RUSTC_WRAPPER /usr/local/cargo/bin/cachepot COPY . . -RUN cargo build -p compute_tools --release && /usr/local/cargo/bin/cachepot -s +RUN cargo build -p compute_tools --release && cachepot -s # Final image that only has one binary FROM debian:buster-slim From 3ab090b43ad71643f457108613b89c521346d612 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Fri, 15 Apr 2022 21:32:08 +0300 Subject: [PATCH 101/296] Fix compute tools build --- Dockerfile.compute-tools | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/Dockerfile.compute-tools b/Dockerfile.compute-tools index a0cc21105b..27bfbb5d1b 100644 --- a/Dockerfile.compute-tools +++ b/Dockerfile.compute-tools @@ -2,8 +2,6 @@ # NB: keep in sync with rust image version in .circle/config.yml FROM zimg/rust:1.58 AS rust-build -WORKDIR /zenith - ARG CACHEPOT_BUCKET=zenith-rust-cachepot ARG AWS_ACCESS_KEY_ID ARG AWS_SECRET_ACCESS_KEY @@ -15,4 +13,4 @@ RUN cargo build -p compute_tools --release && cachepot -s # Final image that only has one binary FROM debian:buster-slim -COPY --from=rust-build /zenith/target/release/zenith_ctl /usr/local/bin/zenith_ctl +COPY --from=rust-build /home/circleci/project/target/release/zenith_ctl /usr/local/bin/zenith_ctl From 4bc338babc22b835c377239651c97b5227053217 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Sat, 16 Apr 2022 10:01:42 +0300 Subject: [PATCH 102/296] Revert libc upgrade --- Dockerfile | 10 +++++++--- Dockerfile.build | 23 +++++++++++++++++++++++ Dockerfile.compute-tools | 13 ++++++++++--- 3 files changed, 40 insertions(+), 6 deletions(-) create mode 100644 Dockerfile.build diff --git a/Dockerfile b/Dockerfile index 5e579be4e7..a6ac923187 100644 --- a/Dockerfile +++ b/Dockerfile @@ -1,5 +1,6 @@ # Build Postgres -FROM zimg/rust:1.58 AS pg-build +#FROM zimg/rust:1.58 AS pg-build +FROM zenithdb/build:buster-20220414 AS pg-build WORKDIR /pg USER root @@ -14,7 +15,8 @@ RUN set -e \ && tar -C tmp_install -czf /postgres_install.tar.gz . # Build zenith binaries -FROM zimg/rust:1.58 AS build +#FROM zimg/rust:1.58 AS build +FROM zenithdb/build:buster-20220414 AS build ARG GIT_VERSION=local ARG CACHEPOT_BUCKET=zenith-rust-cachepot @@ -26,7 +28,9 @@ COPY . . # Show build caching stats to check if it was used in the end. # Has to be the part of the same RUN since cachepot daemon is killed in the end of this RUN, loosing the compilation stats. -RUN cargo build --release && cachepot -s +#RUN cargo build --release && cachepot -s +ENV RUSTC_WRAPPER /usr/local/cargo/bin/cachepot +RUN cargo build --release && /usr/local/cargo/bin/cachepot -s # Build final image # diff --git a/Dockerfile.build b/Dockerfile.build new file mode 100644 index 0000000000..c7d239647f --- /dev/null +++ b/Dockerfile.build @@ -0,0 +1,23 @@ +FROM rust:1.58-slim-buster +WORKDIR /home/circleci/project + +RUN set -e \ + && apt-get update \ + && apt-get -yq install \ + automake \ + libtool \ + build-essential \ + bison \ + flex \ + libreadline-dev \ + zlib1g-dev \ + libxml2-dev \ + libseccomp-dev \ + pkg-config \ + libssl-dev \ + clang + +RUN set -e \ + && rustup component add clippy \ + && cargo install cargo-audit \ + && cargo install --git https://github.com/paritytech/cachepot diff --git a/Dockerfile.compute-tools b/Dockerfile.compute-tools index 27bfbb5d1b..18ebe61384 100644 --- a/Dockerfile.compute-tools +++ b/Dockerfile.compute-tools @@ -1,6 +1,10 @@ # First transient image to build compute_tools binaries # NB: keep in sync with rust image version in .circle/config.yml -FROM zimg/rust:1.58 AS rust-build + +#FROM zimg/rust:1.58 AS rust-build +FROM zenithdb/build:buster-20220414 AS rust-build + +WORKDIR /zenith ARG CACHEPOT_BUCKET=zenith-rust-cachepot ARG AWS_ACCESS_KEY_ID @@ -8,9 +12,12 @@ ARG AWS_SECRET_ACCESS_KEY COPY . . -RUN cargo build -p compute_tools --release && cachepot -s +#RUN cargo build -p compute_tools --release && cachepot -s +ENV RUSTC_WRAPPER /usr/local/cargo/bin/cachepot +RUN cargo build -p compute_tools --release && /usr/local/cargo/bin/cachepot -s # Final image that only has one binary FROM debian:buster-slim -COPY --from=rust-build /home/circleci/project/target/release/zenith_ctl /usr/local/bin/zenith_ctl +#COPY --from=rust-build /home/circleci/project/target/release/zenith_ctl /usr/local/bin/zenith_ctl +COPY --from=rust-build /zenith/target/release/zenith_ctl /usr/local/bin/zenith_ctl From ed5f9acca94532b114b841017fa0492e349b6ef6 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Sat, 16 Apr 2022 13:38:48 +0300 Subject: [PATCH 103/296] Revert "Revert libc upgrade" (#1527) This reverts commit 4bc338babc22b835c377239651c97b5227053217. --- Dockerfile | 10 +++------- Dockerfile.build | 23 ----------------------- Dockerfile.compute-tools | 13 +++---------- 3 files changed, 6 insertions(+), 40 deletions(-) delete mode 100644 Dockerfile.build diff --git a/Dockerfile b/Dockerfile index a6ac923187..5e579be4e7 100644 --- a/Dockerfile +++ b/Dockerfile @@ -1,6 +1,5 @@ # Build Postgres -#FROM zimg/rust:1.58 AS pg-build -FROM zenithdb/build:buster-20220414 AS pg-build +FROM zimg/rust:1.58 AS pg-build WORKDIR /pg USER root @@ -15,8 +14,7 @@ RUN set -e \ && tar -C tmp_install -czf /postgres_install.tar.gz . # Build zenith binaries -#FROM zimg/rust:1.58 AS build -FROM zenithdb/build:buster-20220414 AS build +FROM zimg/rust:1.58 AS build ARG GIT_VERSION=local ARG CACHEPOT_BUCKET=zenith-rust-cachepot @@ -28,9 +26,7 @@ COPY . . # Show build caching stats to check if it was used in the end. # Has to be the part of the same RUN since cachepot daemon is killed in the end of this RUN, loosing the compilation stats. -#RUN cargo build --release && cachepot -s -ENV RUSTC_WRAPPER /usr/local/cargo/bin/cachepot -RUN cargo build --release && /usr/local/cargo/bin/cachepot -s +RUN cargo build --release && cachepot -s # Build final image # diff --git a/Dockerfile.build b/Dockerfile.build deleted file mode 100644 index c7d239647f..0000000000 --- a/Dockerfile.build +++ /dev/null @@ -1,23 +0,0 @@ -FROM rust:1.58-slim-buster -WORKDIR /home/circleci/project - -RUN set -e \ - && apt-get update \ - && apt-get -yq install \ - automake \ - libtool \ - build-essential \ - bison \ - flex \ - libreadline-dev \ - zlib1g-dev \ - libxml2-dev \ - libseccomp-dev \ - pkg-config \ - libssl-dev \ - clang - -RUN set -e \ - && rustup component add clippy \ - && cargo install cargo-audit \ - && cargo install --git https://github.com/paritytech/cachepot diff --git a/Dockerfile.compute-tools b/Dockerfile.compute-tools index 18ebe61384..27bfbb5d1b 100644 --- a/Dockerfile.compute-tools +++ b/Dockerfile.compute-tools @@ -1,10 +1,6 @@ # First transient image to build compute_tools binaries # NB: keep in sync with rust image version in .circle/config.yml - -#FROM zimg/rust:1.58 AS rust-build -FROM zenithdb/build:buster-20220414 AS rust-build - -WORKDIR /zenith +FROM zimg/rust:1.58 AS rust-build ARG CACHEPOT_BUCKET=zenith-rust-cachepot ARG AWS_ACCESS_KEY_ID @@ -12,12 +8,9 @@ ARG AWS_SECRET_ACCESS_KEY COPY . . -#RUN cargo build -p compute_tools --release && cachepot -s -ENV RUSTC_WRAPPER /usr/local/cargo/bin/cachepot -RUN cargo build -p compute_tools --release && /usr/local/cargo/bin/cachepot -s +RUN cargo build -p compute_tools --release && cachepot -s # Final image that only has one binary FROM debian:buster-slim -#COPY --from=rust-build /home/circleci/project/target/release/zenith_ctl /usr/local/bin/zenith_ctl -COPY --from=rust-build /zenith/target/release/zenith_ctl /usr/local/bin/zenith_ctl +COPY --from=rust-build /home/circleci/project/target/release/zenith_ctl /usr/local/bin/zenith_ctl From 787f0d33f0f15209e1c7d803633280b6064ed11a Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Sat, 16 Apr 2022 23:36:42 +0300 Subject: [PATCH 104/296] Use another cachepot bucket for rust Docker build caches --- Dockerfile | 2 +- Dockerfile.compute-tools | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/Dockerfile b/Dockerfile index 5e579be4e7..b2d4971345 100644 --- a/Dockerfile +++ b/Dockerfile @@ -17,7 +17,7 @@ RUN set -e \ FROM zimg/rust:1.58 AS build ARG GIT_VERSION=local -ARG CACHEPOT_BUCKET=zenith-rust-cachepot +ARG CACHEPOT_BUCKET=zenith-rust-cachepot-docker ARG AWS_ACCESS_KEY_ID ARG AWS_SECRET_ACCESS_KEY diff --git a/Dockerfile.compute-tools b/Dockerfile.compute-tools index 27bfbb5d1b..dc67ae3032 100644 --- a/Dockerfile.compute-tools +++ b/Dockerfile.compute-tools @@ -2,7 +2,7 @@ # NB: keep in sync with rust image version in .circle/config.yml FROM zimg/rust:1.58 AS rust-build -ARG CACHEPOT_BUCKET=zenith-rust-cachepot +ARG CACHEPOT_BUCKET=zenith-rust-cachepot-docker ARG AWS_ACCESS_KEY_ID ARG AWS_SECRET_ACCESS_KEY From 3136a0754a85b20a8f20d623d63add3399b51c13 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Sat, 16 Apr 2022 23:03:13 +0300 Subject: [PATCH 105/296] Use mold in Docker images --- Dockerfile | 4 ++-- Dockerfile.compute-tools | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/Dockerfile b/Dockerfile index b2d4971345..3467359ac4 100644 --- a/Dockerfile +++ b/Dockerfile @@ -9,7 +9,7 @@ COPY Makefile Makefile ENV BUILD_TYPE release RUN set -e \ - && make -j $(nproc) -s postgres \ + && mold -run make -j $(nproc) -s postgres \ && rm -rf tmp_install/build \ && tar -C tmp_install -czf /postgres_install.tar.gz . @@ -26,7 +26,7 @@ COPY . . # Show build caching stats to check if it was used in the end. # Has to be the part of the same RUN since cachepot daemon is killed in the end of this RUN, loosing the compilation stats. -RUN cargo build --release && cachepot -s +RUN mold -run cargo build --release && cachepot -s # Build final image # diff --git a/Dockerfile.compute-tools b/Dockerfile.compute-tools index dc67ae3032..c2e33b9d98 100644 --- a/Dockerfile.compute-tools +++ b/Dockerfile.compute-tools @@ -8,7 +8,7 @@ ARG AWS_SECRET_ACCESS_KEY COPY . . -RUN cargo build -p compute_tools --release && cachepot -s +RUN mold -run cargo build -p compute_tools --release && cachepot -s # Final image that only has one binary FROM debian:buster-slim From 9b7dcc2bae88d3aedc97541066416c34993e9533 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Sun, 17 Apr 2022 15:42:38 +0300 Subject: [PATCH 106/296] Use proper cachepot bucket --- .circleci/config.yml | 2 -- Dockerfile | 2 +- Dockerfile.compute-tools | 2 +- 3 files changed, 2 insertions(+), 4 deletions(-) diff --git a/.circleci/config.yml b/.circleci/config.yml index 5aae143e48..8752da506d 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -117,8 +117,6 @@ jobs: fi export CARGO_INCREMENTAL=0 - export CACHEPOT_BUCKET=zenith-rust-cachepot - export RUSTC_WRAPPER=cachepot export AWS_ACCESS_KEY_ID="${CACHEPOT_AWS_ACCESS_KEY_ID}" export AWS_SECRET_ACCESS_KEY="${CACHEPOT_AWS_SECRET_ACCESS_KEY}" "${cov_prefix[@]}" mold -run cargo build $CARGO_FLAGS --bins --tests diff --git a/Dockerfile b/Dockerfile index 3467359ac4..ebc8731168 100644 --- a/Dockerfile +++ b/Dockerfile @@ -17,7 +17,7 @@ RUN set -e \ FROM zimg/rust:1.58 AS build ARG GIT_VERSION=local -ARG CACHEPOT_BUCKET=zenith-rust-cachepot-docker +ARG CACHEPOT_BUCKET=zenith-rust-cachepot ARG AWS_ACCESS_KEY_ID ARG AWS_SECRET_ACCESS_KEY diff --git a/Dockerfile.compute-tools b/Dockerfile.compute-tools index c2e33b9d98..3fc8702f3f 100644 --- a/Dockerfile.compute-tools +++ b/Dockerfile.compute-tools @@ -2,7 +2,7 @@ # NB: keep in sync with rust image version in .circle/config.yml FROM zimg/rust:1.58 AS rust-build -ARG CACHEPOT_BUCKET=zenith-rust-cachepot-docker +ARG CACHEPOT_BUCKET=zenith-rust-cachepot ARG AWS_ACCESS_KEY_ID ARG AWS_SECRET_ACCESS_KEY From 0ca2bd929b8753e946fff83cdaa8f2b0062f6ae1 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Fri, 15 Apr 2022 22:53:31 +0300 Subject: [PATCH 107/296] Remove log crate from pageserver --- Cargo.lock | 1 - pageserver/Cargo.toml | 1 - pageserver/src/basebackup.rs | 2 +- pageserver/src/layered_repository/delta_layer.rs | 2 +- pageserver/src/layered_repository/image_layer.rs | 2 +- pageserver/src/layered_repository/inmemory_layer.rs | 2 +- pageserver/src/tenant_mgr.rs | 2 +- pageserver/src/walredo.rs | 2 +- 8 files changed, 6 insertions(+), 8 deletions(-) diff --git a/Cargo.lock b/Cargo.lock index 0cdeb106ec..e93e73f087 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -1526,7 +1526,6 @@ dependencies = [ "hyper", "itertools", "lazy_static", - "log", "nix", "once_cell", "postgres", diff --git a/pageserver/Cargo.toml b/pageserver/Cargo.toml index e92ac0421c..3825795059 100644 --- a/pageserver/Cargo.toml +++ b/pageserver/Cargo.toml @@ -14,7 +14,6 @@ hex = "0.4.3" hyper = "0.14" itertools = "0.10.3" lazy_static = "1.4.0" -log = "0.4.14" clap = "3.0" daemonize = "0.4.1" tokio = { version = "1.17", features = ["process", "sync", "macros", "fs", "rt", "io-util", "time"] } diff --git a/pageserver/src/basebackup.rs b/pageserver/src/basebackup.rs index 3caf27b9b3..077e7c9f83 100644 --- a/pageserver/src/basebackup.rs +++ b/pageserver/src/basebackup.rs @@ -12,13 +12,13 @@ //! use anyhow::{ensure, Context, Result}; use bytes::{BufMut, BytesMut}; -use log::*; use std::fmt::Write as FmtWrite; use std::io; use std::io::Write; use std::sync::Arc; use std::time::SystemTime; use tar::{Builder, EntryType, Header}; +use tracing::*; use crate::reltag::SlruKind; use crate::repository::Timeline; diff --git a/pageserver/src/layered_repository/delta_layer.rs b/pageserver/src/layered_repository/delta_layer.rs index dd6b5d3afa..6e3d65a94d 100644 --- a/pageserver/src/layered_repository/delta_layer.rs +++ b/pageserver/src/layered_repository/delta_layer.rs @@ -38,8 +38,8 @@ use crate::walrecord; use crate::{ZTenantId, ZTimelineId}; use crate::{DELTA_FILE_MAGIC, STORAGE_FORMAT_VERSION}; use anyhow::{bail, ensure, Context, Result}; -use log::*; use serde::{Deserialize, Serialize}; +use tracing::*; // avoid binding to Write (conflicts with std::io::Write) // while being able to use std::fmt::Write's methods use std::fmt::Write as _; diff --git a/pageserver/src/layered_repository/image_layer.rs b/pageserver/src/layered_repository/image_layer.rs index 08e635f073..0f334658bf 100644 --- a/pageserver/src/layered_repository/image_layer.rs +++ b/pageserver/src/layered_repository/image_layer.rs @@ -35,7 +35,6 @@ use crate::{IMAGE_FILE_MAGIC, STORAGE_FORMAT_VERSION}; use anyhow::{bail, ensure, Context, Result}; use bytes::Bytes; use hex; -use log::*; use serde::{Deserialize, Serialize}; use std::fs; use std::io::Write; @@ -43,6 +42,7 @@ use std::io::{Seek, SeekFrom}; use std::ops::Range; use std::path::{Path, PathBuf}; use std::sync::{RwLock, RwLockReadGuard}; +use tracing::*; use zenith_utils::bin_ser::BeSer; use zenith_utils::lsn::Lsn; diff --git a/pageserver/src/layered_repository/inmemory_layer.rs b/pageserver/src/layered_repository/inmemory_layer.rs index a45af51487..ffb5be1dd4 100644 --- a/pageserver/src/layered_repository/inmemory_layer.rs +++ b/pageserver/src/layered_repository/inmemory_layer.rs @@ -16,8 +16,8 @@ use crate::repository::{Key, Value}; use crate::walrecord; use crate::{ZTenantId, ZTimelineId}; use anyhow::{bail, ensure, Result}; -use log::*; use std::collections::HashMap; +use tracing::*; // avoid binding to Write (conflicts with std::io::Write) // while being able to use std::fmt::Write's methods use std::fmt::Write as _; diff --git a/pageserver/src/tenant_mgr.rs b/pageserver/src/tenant_mgr.rs index aeff718803..2765554cf9 100644 --- a/pageserver/src/tenant_mgr.rs +++ b/pageserver/src/tenant_mgr.rs @@ -13,13 +13,13 @@ use crate::walredo::PostgresRedoManager; use crate::{DatadirTimelineImpl, RepositoryImpl}; use anyhow::{Context, Result}; use lazy_static::lazy_static; -use log::*; use serde::{Deserialize, Serialize}; use serde_with::{serde_as, DisplayFromStr}; use std::collections::hash_map::Entry; use std::collections::HashMap; use std::fmt; use std::sync::{Arc, Mutex, MutexGuard}; +use tracing::*; use zenith_utils::zid::{ZTenantId, ZTimelineId}; lazy_static! { diff --git a/pageserver/src/walredo.rs b/pageserver/src/walredo.rs index ae22f1eead..b7c6ecf726 100644 --- a/pageserver/src/walredo.rs +++ b/pageserver/src/walredo.rs @@ -21,7 +21,6 @@ use byteorder::{ByteOrder, LittleEndian}; use bytes::{BufMut, Bytes, BytesMut}; use lazy_static::lazy_static; -use log::*; use nix::poll::*; use serde::Serialize; use std::fs; @@ -35,6 +34,7 @@ use std::process::{Child, ChildStderr, ChildStdin, ChildStdout, Command}; use std::sync::Mutex; use std::time::Duration; use std::time::Instant; +use tracing::*; use zenith_metrics::{register_histogram, register_int_counter, Histogram, IntCounter}; use zenith_utils::bin_ser::BeSer; use zenith_utils::lsn::Lsn; From 5b297745324f759c4aa16037a165bef251fc8252 Mon Sep 17 00:00:00 2001 From: Arseny Sher Date: Mon, 4 Apr 2022 15:42:13 +0400 Subject: [PATCH 108/296] Small refactoring after ec3bc741653d. Move record_safekeeper_info inside safekeeper.rs, fix commit_lsn update, sync control file. --- walkeeper/src/safekeeper.rs | 52 ++++++++++++++++++++++++++++++++++--- walkeeper/src/timeline.rs | 46 ++++++-------------------------- 2 files changed, 57 insertions(+), 41 deletions(-) diff --git a/walkeeper/src/safekeeper.rs b/walkeeper/src/safekeeper.rs index 22a8481e45..cf56261ee6 100644 --- a/walkeeper/src/safekeeper.rs +++ b/walkeeper/src/safekeeper.rs @@ -6,6 +6,7 @@ use bytes::{Buf, BufMut, Bytes, BytesMut}; use postgres_ffi::xlog_utils::TimeLineID; use serde::{Deserialize, Serialize}; +use std::cmp::max; use std::cmp::min; use std::fmt; use std::io::Read; @@ -15,6 +16,7 @@ use zenith_utils::zid::ZTenantTimelineId; use lazy_static::lazy_static; +use crate::broker::SafekeeperInfo; use crate::control_file; use crate::send_wal::HotStandbyFeedback; use crate::wal_storage; @@ -497,6 +499,8 @@ pub struct SafeKeeper { metrics: SafeKeeperMetrics, /// Maximum commit_lsn between all nodes, can be ahead of local flush_lsn. + /// Note: be careful to set only if we are sure our WAL (term history) matches + /// committed one. pub global_commit_lsn: Lsn, /// LSN since the proposer safekeeper currently talking to appends WAL; /// determines epoch switch point. @@ -743,7 +747,9 @@ where let mut state = self.state.clone(); state.commit_lsn = self.inmem.commit_lsn; + state.s3_wal_lsn = self.inmem.s3_wal_lsn; state.peer_horizon_lsn = self.inmem.peer_horizon_lsn; + state.remote_consistent_lsn = self.inmem.remote_consistent_lsn; state.proposer_uuid = self.inmem.proposer_uuid; self.state.persist(&state) } @@ -788,10 +794,10 @@ where self.wal_store.flush_wal()?; } - // Update global_commit_lsn, verifying that it cannot decrease. + // Update global_commit_lsn if msg.h.commit_lsn != Lsn(0) { - assert!(msg.h.commit_lsn >= self.global_commit_lsn); - self.global_commit_lsn = msg.h.commit_lsn; + // We also obtain commit lsn from peers, so value arrived here might be stale (less) + self.global_commit_lsn = max(self.global_commit_lsn, msg.h.commit_lsn); } self.inmem.peer_horizon_lsn = msg.h.truncate_lsn; @@ -835,6 +841,46 @@ where self.append_response(), ))) } + + /// Update timeline state with peer safekeeper data. + pub fn record_safekeeper_info(&mut self, sk_info: &SafekeeperInfo) -> Result<()> { + let mut sync_control_file = false; + if let (Some(commit_lsn), Some(last_log_term)) = (sk_info.commit_lsn, sk_info.last_log_term) + { + // Note: the check is too restrictive, generally we can update local + // commit_lsn if our history matches (is part of) history of advanced + // commit_lsn provider. + if last_log_term == self.get_epoch() { + self.global_commit_lsn = max(commit_lsn, self.global_commit_lsn); + self.update_commit_lsn()?; + } + } + if let Some(s3_wal_lsn) = sk_info.s3_wal_lsn { + let new_s3_wal_lsn = max(s3_wal_lsn, self.inmem.s3_wal_lsn); + sync_control_file |= + self.state.s3_wal_lsn + (self.state.server.wal_seg_size as u64) < new_s3_wal_lsn; + self.inmem.s3_wal_lsn = new_s3_wal_lsn; + } + if let Some(remote_consistent_lsn) = sk_info.remote_consistent_lsn { + let new_remote_consistent_lsn = + max(remote_consistent_lsn, self.inmem.remote_consistent_lsn); + sync_control_file |= self.state.remote_consistent_lsn + + (self.state.server.wal_seg_size as u64) + < new_remote_consistent_lsn; + self.inmem.remote_consistent_lsn = new_remote_consistent_lsn; + } + if let Some(peer_horizon_lsn) = sk_info.peer_horizon_lsn { + let new_peer_horizon_lsn = max(peer_horizon_lsn, self.inmem.peer_horizon_lsn); + sync_control_file |= self.state.peer_horizon_lsn + + (self.state.server.wal_seg_size as u64) + < new_peer_horizon_lsn; + self.inmem.peer_horizon_lsn = new_peer_horizon_lsn; + } + if sync_control_file { + self.persist_control_file()?; + } + Ok(()) + } } #[cfg(test)] diff --git a/walkeeper/src/timeline.rs b/walkeeper/src/timeline.rs index a2941a9a5c..777db7eb2b 100644 --- a/walkeeper/src/timeline.rs +++ b/walkeeper/src/timeline.rs @@ -375,10 +375,9 @@ impl Timeline { } // Notify caught-up WAL senders about new WAL data received - pub fn notify_wal_senders(&self, commit_lsn: Lsn) { - let mut shared_state = self.mutex.lock().unwrap(); - if shared_state.notified_commit_lsn < commit_lsn { - shared_state.notified_commit_lsn = commit_lsn; + fn notify_wal_senders(&self, shared_state: &mut MutexGuard) { + if shared_state.notified_commit_lsn < shared_state.sk.inmem.commit_lsn { + shared_state.notified_commit_lsn = shared_state.sk.inmem.commit_lsn; self.cond.notify_all(); } } @@ -389,13 +388,9 @@ impl Timeline { msg: &ProposerAcceptorMessage, ) -> Result> { let mut rmsg: Option; - let commit_lsn: Lsn; { let mut shared_state = self.mutex.lock().unwrap(); rmsg = shared_state.sk.process_msg(msg)?; - // locally available commit lsn. flush_lsn can be smaller than - // commit_lsn if we are catching up safekeeper. - commit_lsn = shared_state.sk.inmem.commit_lsn; // if this is AppendResponse, fill in proper hot standby feedback and disk consistent lsn if let Some(AcceptorProposerMessage::AppendResponse(ref mut resp)) = rmsg { @@ -405,9 +400,10 @@ impl Timeline { resp.zenith_feedback = zenith_feedback; } } + + // Ping wal sender that new data might be available. + self.notify_wal_senders(&mut shared_state); } - // Ping wal sender that new data might be available. - self.notify_wal_senders(commit_lsn); Ok(rmsg) } @@ -437,34 +433,8 @@ impl Timeline { /// Update timeline state with peer safekeeper data. pub fn record_safekeeper_info(&self, sk_info: &SafekeeperInfo, _sk_id: ZNodeId) -> Result<()> { let mut shared_state = self.mutex.lock().unwrap(); - // Note: the check is too restrictive, generally we can update local - // commit_lsn if our history matches (is part of) history of advanced - // commit_lsn provider. - if let (Some(commit_lsn), Some(last_log_term)) = (sk_info.commit_lsn, sk_info.last_log_term) - { - if last_log_term == shared_state.sk.get_epoch() { - shared_state.sk.global_commit_lsn = - max(commit_lsn, shared_state.sk.global_commit_lsn); - shared_state.sk.update_commit_lsn()?; - let local_commit_lsn = min(commit_lsn, shared_state.sk.wal_store.flush_lsn()); - shared_state.sk.inmem.commit_lsn = - max(local_commit_lsn, shared_state.sk.inmem.commit_lsn); - } - } - if let Some(s3_wal_lsn) = sk_info.s3_wal_lsn { - shared_state.sk.inmem.s3_wal_lsn = max(s3_wal_lsn, shared_state.sk.inmem.s3_wal_lsn); - } - if let Some(remote_consistent_lsn) = sk_info.remote_consistent_lsn { - shared_state.sk.inmem.remote_consistent_lsn = max( - remote_consistent_lsn, - shared_state.sk.inmem.remote_consistent_lsn, - ); - } - if let Some(peer_horizon_lsn) = sk_info.peer_horizon_lsn { - shared_state.sk.inmem.peer_horizon_lsn = - max(peer_horizon_lsn, shared_state.sk.inmem.peer_horizon_lsn); - } - // TODO: sync control file + shared_state.sk.record_safekeeper_info(sk_info)?; + self.notify_wal_senders(&mut shared_state); Ok(()) } From 81879f8137ca91315f57ff415170dc14f411d492 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Mon, 18 Apr 2022 12:15:54 +0300 Subject: [PATCH 109/296] Restore missing cachepot env vars --- .circleci/config.yml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/.circleci/config.yml b/.circleci/config.yml index 8752da506d..5aae143e48 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -117,6 +117,8 @@ jobs: fi export CARGO_INCREMENTAL=0 + export CACHEPOT_BUCKET=zenith-rust-cachepot + export RUSTC_WRAPPER=cachepot export AWS_ACCESS_KEY_ID="${CACHEPOT_AWS_ACCESS_KEY_ID}" export AWS_SECRET_ACCESS_KEY="${CACHEPOT_AWS_SECRET_ACCESS_KEY}" "${cov_prefix[@]}" mold -run cargo build $CARGO_FLAGS --bins --tests From 81417788c8e0ed55611065cbc34c1e5366fe4ba1 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Mon, 18 Apr 2022 11:41:05 +0300 Subject: [PATCH 110/296] walkeeper -> safekeeper --- Cargo.lock | 82 +++++++++---------- Cargo.toml | 2 +- control_plane/Cargo.toml | 2 +- control_plane/src/safekeeper.rs | 2 +- docs/README.md | 2 +- docs/glossary.md | 8 +- docs/rfcs/009-snapshot-first-storage-cli.md | 12 +-- docs/sourcetree.md | 4 +- postgres_ffi/src/waldecoder.rs | 2 +- {walkeeper => safekeeper}/Cargo.toml | 2 +- {walkeeper => safekeeper}/README | 0 {walkeeper => safekeeper}/README_PROTO.md | 0 .../spec/ProposerAcceptorConsensus.cfg | 0 .../spec/ProposerAcceptorConsensus.tla | 0 .../src/bin/safekeeper.rs | 14 ++-- {walkeeper => safekeeper}/src/broker.rs | 0 {walkeeper => safekeeper}/src/callmemaybe.rs | 0 {walkeeper => safekeeper}/src/control_file.rs | 0 .../src/control_file_upgrade.rs | 0 {walkeeper => safekeeper}/src/handler.rs | 2 +- {walkeeper => safekeeper}/src/http/mod.rs | 0 {walkeeper => safekeeper}/src/http/models.rs | 0 {walkeeper => safekeeper}/src/http/routes.rs | 0 {walkeeper => safekeeper}/src/json_ctrl.rs | 0 {walkeeper => safekeeper}/src/lib.rs | 0 {walkeeper => safekeeper}/src/receive_wal.rs | 0 {walkeeper => safekeeper}/src/s3_offload.rs | 0 {walkeeper => safekeeper}/src/safekeeper.rs | 0 {walkeeper => safekeeper}/src/send_wal.rs | 0 {walkeeper => safekeeper}/src/timeline.rs | 0 {walkeeper => safekeeper}/src/wal_service.rs | 0 {walkeeper => safekeeper}/src/wal_storage.rs | 0 zenith/Cargo.toml | 2 +- zenith/src/main.rs | 2 +- 34 files changed, 69 insertions(+), 69 deletions(-) rename {walkeeper => safekeeper}/Cargo.toml (98%) rename {walkeeper => safekeeper}/README (100%) rename {walkeeper => safekeeper}/README_PROTO.md (100%) rename {walkeeper => safekeeper}/spec/ProposerAcceptorConsensus.cfg (100%) rename {walkeeper => safekeeper}/spec/ProposerAcceptorConsensus.tla (100%) rename {walkeeper => safekeeper}/src/bin/safekeeper.rs (97%) rename {walkeeper => safekeeper}/src/broker.rs (100%) rename {walkeeper => safekeeper}/src/callmemaybe.rs (100%) rename {walkeeper => safekeeper}/src/control_file.rs (100%) rename {walkeeper => safekeeper}/src/control_file_upgrade.rs (100%) rename {walkeeper => safekeeper}/src/handler.rs (98%) rename {walkeeper => safekeeper}/src/http/mod.rs (100%) rename {walkeeper => safekeeper}/src/http/models.rs (100%) rename {walkeeper => safekeeper}/src/http/routes.rs (100%) rename {walkeeper => safekeeper}/src/json_ctrl.rs (100%) rename {walkeeper => safekeeper}/src/lib.rs (100%) rename {walkeeper => safekeeper}/src/receive_wal.rs (100%) rename {walkeeper => safekeeper}/src/s3_offload.rs (100%) rename {walkeeper => safekeeper}/src/safekeeper.rs (100%) rename {walkeeper => safekeeper}/src/send_wal.rs (100%) rename {walkeeper => safekeeper}/src/timeline.rs (100%) rename {walkeeper => safekeeper}/src/wal_service.rs (100%) rename {walkeeper => safekeeper}/src/wal_storage.rs (100%) diff --git a/Cargo.lock b/Cargo.lock index e93e73f087..a933b44356 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -381,13 +381,13 @@ dependencies = [ "postgres", "regex", "reqwest", + "safekeeper", "serde", "serde_with", "tar", "thiserror", "toml", "url", - "walkeeper", "workspace_hack", "zenith_utils", ] @@ -2290,6 +2290,45 @@ version = "1.0.9" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "73b4b750c782965c211b42f022f59af1fbceabdd026623714f104152f1ec149f" +[[package]] +name = "safekeeper" +version = "0.1.0" +dependencies = [ + "anyhow", + "byteorder", + "bytes", + "clap 3.0.14", + "const_format", + "crc32c", + "daemonize", + "etcd-client", + "fs2", + "hex", + "humantime", + "hyper", + "lazy_static", + "postgres", + "postgres-protocol", + "postgres_ffi", + "regex", + "rusoto_core", + "rusoto_s3", + "serde", + "serde_json", + "serde_with", + "signal-hook", + "tempfile", + "tokio", + "tokio-postgres", + "tokio-util 0.7.0", + "tracing", + "url", + "walkdir", + "workspace_hack", + "zenith_metrics", + "zenith_utils", +] + [[package]] name = "same-file" version = "1.0.6" @@ -3089,45 +3128,6 @@ dependencies = [ "winapi-util", ] -[[package]] -name = "walkeeper" -version = "0.1.0" -dependencies = [ - "anyhow", - "byteorder", - "bytes", - "clap 3.0.14", - "const_format", - "crc32c", - "daemonize", - "etcd-client", - "fs2", - "hex", - "humantime", - "hyper", - "lazy_static", - "postgres", - "postgres-protocol", - "postgres_ffi", - "regex", - "rusoto_core", - "rusoto_s3", - "serde", - "serde_json", - "serde_with", - "signal-hook", - "tempfile", - "tokio", - "tokio-postgres", - "tokio-util 0.7.0", - "tracing", - "url", - "walkdir", - "workspace_hack", - "zenith_metrics", - "zenith_utils", -] - [[package]] name = "want" version = "0.3.0" @@ -3369,8 +3369,8 @@ dependencies = [ "pageserver", "postgres", "postgres_ffi", + "safekeeper", "serde_json", - "walkeeper", "workspace_hack", "zenith_utils", ] diff --git a/Cargo.toml b/Cargo.toml index b8283a6112..4b3b31e0b7 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -5,7 +5,7 @@ members = [ "pageserver", "postgres_ffi", "proxy", - "walkeeper", + "safekeeper", "workspace_hack", "zenith", "zenith_metrics", diff --git a/control_plane/Cargo.toml b/control_plane/Cargo.toml index e118ea4793..80b6c00dd2 100644 --- a/control_plane/Cargo.toml +++ b/control_plane/Cargo.toml @@ -18,6 +18,6 @@ url = "2.2.2" reqwest = { version = "0.11", default-features = false, features = ["blocking", "json", "rustls-tls"] } pageserver = { path = "../pageserver" } -walkeeper = { path = "../walkeeper" } +safekeeper = { path = "../safekeeper" } zenith_utils = { path = "../zenith_utils" } workspace_hack = { version = "0.1", path = "../workspace_hack" } diff --git a/control_plane/src/safekeeper.rs b/control_plane/src/safekeeper.rs index 89ab0a31ee..e23138bd3f 100644 --- a/control_plane/src/safekeeper.rs +++ b/control_plane/src/safekeeper.rs @@ -14,7 +14,7 @@ use postgres::Config; use reqwest::blocking::{Client, RequestBuilder, Response}; use reqwest::{IntoUrl, Method}; use thiserror::Error; -use walkeeper::http::models::TimelineCreateRequest; +use safekeeper::http::models::TimelineCreateRequest; use zenith_utils::http::error::HttpErrorBody; use zenith_utils::zid::{ZNodeId, ZTenantId, ZTimelineId}; diff --git a/docs/README.md b/docs/README.md index 0558fa24a8..a3fcd20bd2 100644 --- a/docs/README.md +++ b/docs/README.md @@ -10,5 +10,5 @@ - [pageserver/README](/pageserver/README) — pageserver overview. - [postgres_ffi/README](/postgres_ffi/README) — Postgres FFI overview. - [test_runner/README.md](/test_runner/README.md) — tests infrastructure overview. -- [walkeeper/README](/walkeeper/README) — WAL service overview. +- [safekeeper/README](/safekeeper/README) — WAL service overview. - [core_changes.md](core_changes.md) - Description of Zenith changes in Postgres core diff --git a/docs/glossary.md b/docs/glossary.md index 0f82f2d666..ecc57b9ed1 100644 --- a/docs/glossary.md +++ b/docs/glossary.md @@ -29,7 +29,7 @@ Each Branch lives in a corresponding timeline[] and has an ancestor[]. NOTE: This is an overloaded term. -A checkpoint record in the WAL marks a point in the WAL sequence at which it is guaranteed that all data files have been updated with all information from shared memory modified before that checkpoint; +A checkpoint record in the WAL marks a point in the WAL sequence at which it is guaranteed that all data files have been updated with all information from shared memory modified before that checkpoint; ### Checkpoint (Layered repository) @@ -108,10 +108,10 @@ PostgreSQL LSNs and functions to monitor them: * `pg_current_wal_lsn()` - Returns the current write-ahead log write location. * `pg_current_wal_flush_lsn()` - Returns the current write-ahead log flush location. * `pg_last_wal_receive_lsn()` - Returns the last write-ahead log location that has been received and synced to disk by streaming replication. While streaming replication is in progress this will increase monotonically. -* `pg_last_wal_replay_lsn ()` - Returns the last write-ahead log location that has been replayed during recovery. If recovery is still in progress this will increase monotonically. +* `pg_last_wal_replay_lsn ()` - Returns the last write-ahead log location that has been replayed during recovery. If recovery is still in progress this will increase monotonically. [source PostgreSQL documentation](https://www.postgresql.org/docs/devel/functions-admin.html): -Zenith safekeeper LSNs. For more check [walkeeper/README_PROTO.md](/walkeeper/README_PROTO.md) +Zenith safekeeper LSNs. For more check [safekeeper/README_PROTO.md](/safekeeper/README_PROTO.md) * `CommitLSN`: position in WAL confirmed by quorum safekeepers. * `RestartLSN`: position in WAL confirmed by all safekeepers. * `FlushLSN`: part of WAL persisted to the disk by safekeeper. @@ -190,7 +190,7 @@ or we do not support them in zenith yet (pg_commit_ts). Tenant represents a single customer, interacting with Zenith. Wal redo[] activity, timelines[], layers[] are managed for each tenant independently. One pageserver[] can serve multiple tenants at once. -One safekeeper +One safekeeper See `docs/multitenancy.md` for more. diff --git a/docs/rfcs/009-snapshot-first-storage-cli.md b/docs/rfcs/009-snapshot-first-storage-cli.md index 3f5386c165..11ded3a724 100644 --- a/docs/rfcs/009-snapshot-first-storage-cli.md +++ b/docs/rfcs/009-snapshot-first-storage-cli.md @@ -12,7 +12,7 @@ Init empty pageserver using `initdb` in temporary directory. `--storage_dest=FILE_PREFIX | S3_PREFIX |...` option defines object storage type, all other parameters are passed via env variables. Inspired by WAL-G style naming : https://wal-g.readthedocs.io/STORAGES/. -Save`storage_dest` and other parameters in config. +Save`storage_dest` and other parameters in config. Push snapshots to `storage_dest` in background. ``` @@ -21,7 +21,7 @@ zenith start ``` #### 2. Restart pageserver (manually or crash-recovery). -Take `storage_dest` from pageserver config, start pageserver from latest snapshot in `storage_dest`. +Take `storage_dest` from pageserver config, start pageserver from latest snapshot in `storage_dest`. Push snapshots to `storage_dest` in background. ``` @@ -32,7 +32,7 @@ zenith start Start pageserver from existing snapshot. Path to snapshot provided via `--snapshot_path=FILE_PREFIX | S3_PREFIX | ...` Do not save `snapshot_path` and `snapshot_format` in config, as it is a one-time operation. -Save`storage_dest` parameters in config. +Save`storage_dest` parameters in config. Push snapshots to `storage_dest` in background. ``` //I.e. we want to start zenith on top of existing $PGDATA and use s3 as a persistent storage. @@ -42,15 +42,15 @@ zenith start How to pass credentials needed for `snapshot_path`? #### 4. Export. -Manually push snapshot to `snapshot_path` which differs from `storage_dest` +Manually push snapshot to `snapshot_path` which differs from `storage_dest` Optionally set `snapshot_format`, which can be plain pgdata format or zenith format. ``` zenith export --snapshot_path=FILE_PREFIX --snapshot_format=pgdata ``` #### Notes and questions -- walkeeper s3_offload should use same (similar) syntax for storage. How to set it in UI? +- safekeeper s3_offload should use same (similar) syntax for storage. How to set it in UI? - Why do we need `zenith init` as a separate command? Can't we init everything at first start? - We can think of better names for all options. - Export to plain postgres format will be useless, if we are not 100% compatible on page level. -I can recall at least one such difference - PD_WAL_LOGGED flag in pages. \ No newline at end of file +I can recall at least one such difference - PD_WAL_LOGGED flag in pages. diff --git a/docs/sourcetree.md b/docs/sourcetree.md index 89b07de8d2..b15294d67f 100644 --- a/docs/sourcetree.md +++ b/docs/sourcetree.md @@ -57,12 +57,12 @@ PostgreSQL extension that implements storage manager API and network communicati PostgreSQL extension that contains functions needed for testing and debugging. -`/walkeeper`: +`/safekeeper`: The zenith WAL service that receives WAL from a primary compute nodes and streams it to the pageserver. It acts as a holding area and redistribution center for recently generated WAL. -For more detailed info, see `/walkeeper/README` +For more detailed info, see `/safekeeper/README` `/workspace_hack`: The workspace_hack crate exists only to pin down some dependencies. diff --git a/postgres_ffi/src/waldecoder.rs b/postgres_ffi/src/waldecoder.rs index ac48b1b0f3..ce5aaf722d 100644 --- a/postgres_ffi/src/waldecoder.rs +++ b/postgres_ffi/src/waldecoder.rs @@ -4,7 +4,7 @@ //! This understands the WAL page and record format, enough to figure out where the WAL record //! boundaries are, and to reassemble WAL records that cross page boundaries. //! -//! This functionality is needed by both the pageserver and the walkeepers. The pageserver needs +//! This functionality is needed by both the pageserver and the safekeepers. The pageserver needs //! to look deeper into the WAL records to also understand which blocks they modify, the code //! for that is in pageserver/src/walrecord.rs //! diff --git a/walkeeper/Cargo.toml b/safekeeper/Cargo.toml similarity index 98% rename from walkeeper/Cargo.toml rename to safekeeper/Cargo.toml index 86aa56c9ae..ca5e2a6b55 100644 --- a/walkeeper/Cargo.toml +++ b/safekeeper/Cargo.toml @@ -1,5 +1,5 @@ [package] -name = "walkeeper" +name = "safekeeper" version = "0.1.0" edition = "2021" diff --git a/walkeeper/README b/safekeeper/README similarity index 100% rename from walkeeper/README rename to safekeeper/README diff --git a/walkeeper/README_PROTO.md b/safekeeper/README_PROTO.md similarity index 100% rename from walkeeper/README_PROTO.md rename to safekeeper/README_PROTO.md diff --git a/walkeeper/spec/ProposerAcceptorConsensus.cfg b/safekeeper/spec/ProposerAcceptorConsensus.cfg similarity index 100% rename from walkeeper/spec/ProposerAcceptorConsensus.cfg rename to safekeeper/spec/ProposerAcceptorConsensus.cfg diff --git a/walkeeper/spec/ProposerAcceptorConsensus.tla b/safekeeper/spec/ProposerAcceptorConsensus.tla similarity index 100% rename from walkeeper/spec/ProposerAcceptorConsensus.tla rename to safekeeper/spec/ProposerAcceptorConsensus.tla diff --git a/walkeeper/src/bin/safekeeper.rs b/safekeeper/src/bin/safekeeper.rs similarity index 97% rename from walkeeper/src/bin/safekeeper.rs rename to safekeeper/src/bin/safekeeper.rs index b3087a1004..490198231d 100644 --- a/walkeeper/src/bin/safekeeper.rs +++ b/safekeeper/src/bin/safekeeper.rs @@ -12,18 +12,18 @@ use std::path::{Path, PathBuf}; use std::thread; use tracing::*; use url::{ParseError, Url}; -use walkeeper::control_file::{self}; use zenith_utils::http::endpoint; use zenith_utils::zid::ZNodeId; use zenith_utils::{logging, tcp_listener, GIT_VERSION}; +use safekeeper::control_file::{self}; +use safekeeper::defaults::{DEFAULT_HTTP_LISTEN_ADDR, DEFAULT_PG_LISTEN_ADDR}; +use safekeeper::http; +use safekeeper::s3_offload; +use safekeeper::wal_service; +use safekeeper::SafeKeeperConf; +use safekeeper::{broker, callmemaybe}; use tokio::sync::mpsc; -use walkeeper::defaults::{DEFAULT_HTTP_LISTEN_ADDR, DEFAULT_PG_LISTEN_ADDR}; -use walkeeper::http; -use walkeeper::s3_offload; -use walkeeper::wal_service; -use walkeeper::SafeKeeperConf; -use walkeeper::{broker, callmemaybe}; use zenith_utils::shutdown::exit_now; use zenith_utils::signals; diff --git a/walkeeper/src/broker.rs b/safekeeper/src/broker.rs similarity index 100% rename from walkeeper/src/broker.rs rename to safekeeper/src/broker.rs diff --git a/walkeeper/src/callmemaybe.rs b/safekeeper/src/callmemaybe.rs similarity index 100% rename from walkeeper/src/callmemaybe.rs rename to safekeeper/src/callmemaybe.rs diff --git a/walkeeper/src/control_file.rs b/safekeeper/src/control_file.rs similarity index 100% rename from walkeeper/src/control_file.rs rename to safekeeper/src/control_file.rs diff --git a/walkeeper/src/control_file_upgrade.rs b/safekeeper/src/control_file_upgrade.rs similarity index 100% rename from walkeeper/src/control_file_upgrade.rs rename to safekeeper/src/control_file_upgrade.rs diff --git a/walkeeper/src/handler.rs b/safekeeper/src/handler.rs similarity index 98% rename from walkeeper/src/handler.rs rename to safekeeper/src/handler.rs index 00d177da56..bb14049787 100644 --- a/walkeeper/src/handler.rs +++ b/safekeeper/src/handler.rs @@ -94,7 +94,7 @@ impl postgres_backend::Handler for SafekeeperPostgresHandler { Ok(()) } else { - bail!("Walkeeper received unexpected initial message: {:?}", sm); + bail!("Safekeeper received unexpected initial message: {:?}", sm); } } diff --git a/walkeeper/src/http/mod.rs b/safekeeper/src/http/mod.rs similarity index 100% rename from walkeeper/src/http/mod.rs rename to safekeeper/src/http/mod.rs diff --git a/walkeeper/src/http/models.rs b/safekeeper/src/http/models.rs similarity index 100% rename from walkeeper/src/http/models.rs rename to safekeeper/src/http/models.rs diff --git a/walkeeper/src/http/routes.rs b/safekeeper/src/http/routes.rs similarity index 100% rename from walkeeper/src/http/routes.rs rename to safekeeper/src/http/routes.rs diff --git a/walkeeper/src/json_ctrl.rs b/safekeeper/src/json_ctrl.rs similarity index 100% rename from walkeeper/src/json_ctrl.rs rename to safekeeper/src/json_ctrl.rs diff --git a/walkeeper/src/lib.rs b/safekeeper/src/lib.rs similarity index 100% rename from walkeeper/src/lib.rs rename to safekeeper/src/lib.rs diff --git a/walkeeper/src/receive_wal.rs b/safekeeper/src/receive_wal.rs similarity index 100% rename from walkeeper/src/receive_wal.rs rename to safekeeper/src/receive_wal.rs diff --git a/walkeeper/src/s3_offload.rs b/safekeeper/src/s3_offload.rs similarity index 100% rename from walkeeper/src/s3_offload.rs rename to safekeeper/src/s3_offload.rs diff --git a/walkeeper/src/safekeeper.rs b/safekeeper/src/safekeeper.rs similarity index 100% rename from walkeeper/src/safekeeper.rs rename to safekeeper/src/safekeeper.rs diff --git a/walkeeper/src/send_wal.rs b/safekeeper/src/send_wal.rs similarity index 100% rename from walkeeper/src/send_wal.rs rename to safekeeper/src/send_wal.rs diff --git a/walkeeper/src/timeline.rs b/safekeeper/src/timeline.rs similarity index 100% rename from walkeeper/src/timeline.rs rename to safekeeper/src/timeline.rs diff --git a/walkeeper/src/wal_service.rs b/safekeeper/src/wal_service.rs similarity index 100% rename from walkeeper/src/wal_service.rs rename to safekeeper/src/wal_service.rs diff --git a/walkeeper/src/wal_storage.rs b/safekeeper/src/wal_storage.rs similarity index 100% rename from walkeeper/src/wal_storage.rs rename to safekeeper/src/wal_storage.rs diff --git a/zenith/Cargo.toml b/zenith/Cargo.toml index 74aeffb51c..69283d3763 100644 --- a/zenith/Cargo.toml +++ b/zenith/Cargo.toml @@ -12,7 +12,7 @@ postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98 # FIXME: 'pageserver' is needed for BranchInfo. Refactor pageserver = { path = "../pageserver" } control_plane = { path = "../control_plane" } -walkeeper = { path = "../walkeeper" } +safekeeper = { path = "../safekeeper" } postgres_ffi = { path = "../postgres_ffi" } zenith_utils = { path = "../zenith_utils" } workspace_hack = { version = "0.1", path = "../workspace_hack" } diff --git a/zenith/src/main.rs b/zenith/src/main.rs index f5d4184e63..97b07b7b74 100644 --- a/zenith/src/main.rs +++ b/zenith/src/main.rs @@ -12,7 +12,7 @@ use pageserver::config::defaults::{ use std::collections::{BTreeSet, HashMap}; use std::process::exit; use std::str::FromStr; -use walkeeper::defaults::{ +use safekeeper::defaults::{ DEFAULT_HTTP_LISTEN_PORT as DEFAULT_SAFEKEEPER_HTTP_PORT, DEFAULT_PG_LISTEN_PORT as DEFAULT_SAFEKEEPER_PG_PORT, }; From 52e0816fa5a19bb741c7b053a4f6ae88bb4ff9c8 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Mon, 18 Apr 2022 11:49:46 +0300 Subject: [PATCH 111/296] wal_acceptor -> safekeeper --- control_plane/src/compute.rs | 4 +-- control_plane/src/safekeeper.rs | 2 +- safekeeper/src/bin/safekeeper.rs | 8 ++--- test_runner/batch_others/test_auth.py | 8 ++--- .../batch_others/test_restart_compute.py | 6 ++-- test_runner/batch_others/test_tenants.py | 18 +++++------ test_runner/batch_others/test_wal_acceptor.py | 32 +++++++++---------- .../batch_others/test_wal_acceptor_async.py | 6 ++-- test_runner/fixtures/log_helper.py | 2 +- test_runner/fixtures/zenith_fixtures.py | 8 ++--- .../performance/test_bulk_tenant_create.py | 12 +++---- zenith/src/main.rs | 6 ++-- 12 files changed, 56 insertions(+), 56 deletions(-) diff --git a/control_plane/src/compute.rs b/control_plane/src/compute.rs index 64cd46fef6..1c979acbdf 100644 --- a/control_plane/src/compute.rs +++ b/control_plane/src/compute.rs @@ -331,14 +331,14 @@ impl PostgresNode { // Configure the node to connect to the safekeepers conf.append("synchronous_standby_names", "walproposer"); - let wal_acceptors = self + let safekeepers = self .env .safekeepers .iter() .map(|sk| format!("localhost:{}", sk.pg_port)) .collect::>() .join(","); - conf.append("wal_acceptors", &wal_acceptors); + conf.append("wal_acceptors", &safekeepers); } else { // We only use setup without safekeepers for tests, // and don't care about data durability on pageserver, diff --git a/control_plane/src/safekeeper.rs b/control_plane/src/safekeeper.rs index e23138bd3f..6f11a4e03d 100644 --- a/control_plane/src/safekeeper.rs +++ b/control_plane/src/safekeeper.rs @@ -13,8 +13,8 @@ use nix::unistd::Pid; use postgres::Config; use reqwest::blocking::{Client, RequestBuilder, Response}; use reqwest::{IntoUrl, Method}; -use thiserror::Error; use safekeeper::http::models::TimelineCreateRequest; +use thiserror::Error; use zenith_utils::http::error::HttpErrorBody; use zenith_utils::zid::{ZNodeId, ZTenantId, ZTimelineId}; diff --git a/safekeeper/src/bin/safekeeper.rs b/safekeeper/src/bin/safekeeper.rs index 490198231d..e191cb52fd 100644 --- a/safekeeper/src/bin/safekeeper.rs +++ b/safekeeper/src/bin/safekeeper.rs @@ -257,18 +257,18 @@ fn start_safekeeper(mut conf: SafeKeeperConf, given_id: Option, init: b let (tx, rx) = mpsc::unbounded_channel(); let conf_cloned = conf.clone(); - let wal_acceptor_thread = thread::Builder::new() - .name("WAL acceptor thread".into()) + let safekeeper_thread = thread::Builder::new() + .name("Safekeeper thread".into()) .spawn(|| { // thread code let thread_result = wal_service::thread_main(conf_cloned, pg_listener, tx); if let Err(e) = thread_result { - info!("wal_service thread terminated: {}", e); + info!("safekeeper thread terminated: {}", e); } }) .unwrap(); - threads.push(wal_acceptor_thread); + threads.push(safekeeper_thread); let conf_cloned = conf.clone(); let callmemaybe_thread = thread::Builder::new() diff --git a/test_runner/batch_others/test_auth.py b/test_runner/batch_others/test_auth.py index bda6349ef9..a8ad384f27 100644 --- a/test_runner/batch_others/test_auth.py +++ b/test_runner/batch_others/test_auth.py @@ -52,14 +52,14 @@ def test_pageserver_auth(zenith_env_builder: ZenithEnvBuilder): tenant_http_client.tenant_create() -@pytest.mark.parametrize('with_wal_acceptors', [False, True]) -def test_compute_auth_to_pageserver(zenith_env_builder: ZenithEnvBuilder, with_wal_acceptors: bool): +@pytest.mark.parametrize('with_safekeepers', [False, True]) +def test_compute_auth_to_pageserver(zenith_env_builder: ZenithEnvBuilder, with_safekeepers: bool): zenith_env_builder.pageserver_auth_enabled = True - if with_wal_acceptors: + if with_safekeepers: zenith_env_builder.num_safekeepers = 3 env = zenith_env_builder.init_start() - branch = f'test_compute_auth_to_pageserver{with_wal_acceptors}' + branch = f'test_compute_auth_to_pageserver{with_safekeepers}' env.zenith_cli.create_branch(branch) pg = env.postgres.create_start(branch) diff --git a/test_runner/batch_others/test_restart_compute.py b/test_runner/batch_others/test_restart_compute.py index fd06561c00..d6e7fd9e0d 100644 --- a/test_runner/batch_others/test_restart_compute.py +++ b/test_runner/batch_others/test_restart_compute.py @@ -8,10 +8,10 @@ from fixtures.log_helper import log # # Test restarting and recreating a postgres instance # -@pytest.mark.parametrize('with_wal_acceptors', [False, True]) -def test_restart_compute(zenith_env_builder: ZenithEnvBuilder, with_wal_acceptors: bool): +@pytest.mark.parametrize('with_safekeepers', [False, True]) +def test_restart_compute(zenith_env_builder: ZenithEnvBuilder, with_safekeepers: bool): zenith_env_builder.pageserver_auth_enabled = True - if with_wal_acceptors: + if with_safekeepers: zenith_env_builder.num_safekeepers = 3 env = zenith_env_builder.init_start() diff --git a/test_runner/batch_others/test_tenants.py b/test_runner/batch_others/test_tenants.py index e883018628..682af8de49 100644 --- a/test_runner/batch_others/test_tenants.py +++ b/test_runner/batch_others/test_tenants.py @@ -5,9 +5,9 @@ import pytest from fixtures.zenith_fixtures import ZenithEnvBuilder -@pytest.mark.parametrize('with_wal_acceptors', [False, True]) -def test_tenants_normal_work(zenith_env_builder: ZenithEnvBuilder, with_wal_acceptors: bool): - if with_wal_acceptors: +@pytest.mark.parametrize('with_safekeepers', [False, True]) +def test_tenants_normal_work(zenith_env_builder: ZenithEnvBuilder, with_safekeepers: bool): + if with_safekeepers: zenith_env_builder.num_safekeepers = 3 env = zenith_env_builder.init_start() @@ -15,17 +15,17 @@ def test_tenants_normal_work(zenith_env_builder: ZenithEnvBuilder, with_wal_acce tenant_1 = env.zenith_cli.create_tenant() tenant_2 = env.zenith_cli.create_tenant() - env.zenith_cli.create_timeline( - f'test_tenants_normal_work_with_wal_acceptors{with_wal_acceptors}', tenant_id=tenant_1) - env.zenith_cli.create_timeline( - f'test_tenants_normal_work_with_wal_acceptors{with_wal_acceptors}', tenant_id=tenant_2) + env.zenith_cli.create_timeline(f'test_tenants_normal_work_with_safekeepers{with_safekeepers}', + tenant_id=tenant_1) + env.zenith_cli.create_timeline(f'test_tenants_normal_work_with_safekeepers{with_safekeepers}', + tenant_id=tenant_2) pg_tenant1 = env.postgres.create_start( - f'test_tenants_normal_work_with_wal_acceptors{with_wal_acceptors}', + f'test_tenants_normal_work_with_safekeepers{with_safekeepers}', tenant_id=tenant_1, ) pg_tenant2 = env.postgres.create_start( - f'test_tenants_normal_work_with_wal_acceptors{with_wal_acceptors}', + f'test_tenants_normal_work_with_safekeepers{with_safekeepers}', tenant_id=tenant_2, ) diff --git a/test_runner/batch_others/test_wal_acceptor.py b/test_runner/batch_others/test_wal_acceptor.py index dffcd7cc61..cc9ec9a275 100644 --- a/test_runner/batch_others/test_wal_acceptor.py +++ b/test_runner/batch_others/test_wal_acceptor.py @@ -25,8 +25,8 @@ def test_normal_work(zenith_env_builder: ZenithEnvBuilder): zenith_env_builder.broker = True env = zenith_env_builder.init_start() - env.zenith_cli.create_branch('test_wal_acceptors_normal_work') - pg = env.postgres.create_start('test_wal_acceptors_normal_work') + env.zenith_cli.create_branch('test_safekeepers_normal_work') + pg = env.postgres.create_start('test_safekeepers_normal_work') with closing(pg.connect()) as conn: with conn.cursor() as cur: @@ -56,7 +56,7 @@ def test_many_timelines(zenith_env_builder: ZenithEnvBuilder): n_timelines = 3 branch_names = [ - "test_wal_acceptors_many_timelines_{}".format(tlin) for tlin in range(n_timelines) + "test_safekeepers_many_timelines_{}".format(tlin) for tlin in range(n_timelines) ] # pageserver, safekeeper operate timelines via their ids (can be represented in hex as 'ad50847381e248feaac9876cc71ae418') # that's not really human readable, so the branch names are introduced in Zenith CLI. @@ -196,8 +196,8 @@ def test_restarts(zenith_env_builder: ZenithEnvBuilder): zenith_env_builder.num_safekeepers = n_acceptors env = zenith_env_builder.init_start() - env.zenith_cli.create_branch('test_wal_acceptors_restarts') - pg = env.postgres.create_start('test_wal_acceptors_restarts') + env.zenith_cli.create_branch('test_safekeepers_restarts') + pg = env.postgres.create_start('test_safekeepers_restarts') # we rely upon autocommit after each statement # as waiting for acceptors happens there @@ -223,7 +223,7 @@ def test_restarts(zenith_env_builder: ZenithEnvBuilder): start_delay_sec = 2 -def delayed_wal_acceptor_start(wa): +def delayed_safekeeper_start(wa): time.sleep(start_delay_sec) wa.start() @@ -233,8 +233,8 @@ def test_unavailability(zenith_env_builder: ZenithEnvBuilder): zenith_env_builder.num_safekeepers = 2 env = zenith_env_builder.init_start() - env.zenith_cli.create_branch('test_wal_acceptors_unavailability') - pg = env.postgres.create_start('test_wal_acceptors_unavailability') + env.zenith_cli.create_branch('test_safekeepers_unavailability') + pg = env.postgres.create_start('test_safekeepers_unavailability') # we rely upon autocommit after each statement # as waiting for acceptors happens there @@ -248,7 +248,7 @@ def test_unavailability(zenith_env_builder: ZenithEnvBuilder): # shutdown one of two acceptors, that is, majority env.safekeepers[0].stop() - proc = Process(target=delayed_wal_acceptor_start, args=(env.safekeepers[0], )) + proc = Process(target=delayed_safekeeper_start, args=(env.safekeepers[0], )) proc.start() start = time.time() @@ -260,7 +260,7 @@ def test_unavailability(zenith_env_builder: ZenithEnvBuilder): # for the world's balance, do the same with second acceptor env.safekeepers[1].stop() - proc = Process(target=delayed_wal_acceptor_start, args=(env.safekeepers[1], )) + proc = Process(target=delayed_safekeeper_start, args=(env.safekeepers[1], )) proc.start() start = time.time() @@ -304,8 +304,8 @@ def test_race_conditions(zenith_env_builder: ZenithEnvBuilder, stop_value): zenith_env_builder.num_safekeepers = 3 env = zenith_env_builder.init_start() - env.zenith_cli.create_branch('test_wal_acceptors_race_conditions') - pg = env.postgres.create_start('test_wal_acceptors_race_conditions') + env.zenith_cli.create_branch('test_safekeepers_race_conditions') + pg = env.postgres.create_start('test_safekeepers_race_conditions') # we rely upon autocommit after each statement # as waiting for acceptors happens there @@ -396,7 +396,7 @@ class ProposerPostgres(PgProtocol): """ Path to postgresql.conf """ return os.path.join(self.pgdata_dir, 'postgresql.conf') - def create_dir_config(self, wal_acceptors: str): + def create_dir_config(self, safekeepers: str): """ Create dir and config for running --sync-safekeepers """ mkdir_if_needed(self.pg_data_dir_path()) @@ -407,7 +407,7 @@ class ProposerPostgres(PgProtocol): f"zenith.zenith_timeline = '{self.timeline_id.hex}'\n", f"zenith.zenith_tenant = '{self.tenant_id.hex}'\n", f"zenith.page_server_connstring = ''\n", - f"wal_acceptors = '{wal_acceptors}'\n", + f"wal_acceptors = '{safekeepers}'\n", f"listen_addresses = '{self.listen_addr}'\n", f"port = '{self.port}'\n", ] @@ -692,7 +692,7 @@ def test_replace_safekeeper(zenith_env_builder: ZenithEnvBuilder): env.safekeepers[3].stop() active_safekeepers = [1, 2, 3] pg = env.postgres.create('test_replace_safekeeper') - pg.adjust_for_wal_acceptors(safekeepers_guc(env, active_safekeepers)) + pg.adjust_for_safekeepers(safekeepers_guc(env, active_safekeepers)) pg.start() # learn zenith timeline from compute @@ -732,7 +732,7 @@ def test_replace_safekeeper(zenith_env_builder: ZenithEnvBuilder): pg.stop_and_destroy().create('test_replace_safekeeper') active_safekeepers = [2, 3, 4] env.safekeepers[3].start() - pg.adjust_for_wal_acceptors(safekeepers_guc(env, active_safekeepers)) + pg.adjust_for_safekeepers(safekeepers_guc(env, active_safekeepers)) pg.start() execute_payload(pg) diff --git a/test_runner/batch_others/test_wal_acceptor_async.py b/test_runner/batch_others/test_wal_acceptor_async.py index aadafc76cf..e3df8ea3eb 100644 --- a/test_runner/batch_others/test_wal_acceptor_async.py +++ b/test_runner/batch_others/test_wal_acceptor_async.py @@ -9,7 +9,7 @@ from fixtures.log_helper import getLogger from fixtures.utils import lsn_from_hex, lsn_to_hex from typing import List -log = getLogger('root.wal_acceptor_async') +log = getLogger('root.safekeeper_async') class BankClient(object): @@ -207,9 +207,9 @@ def test_restarts_under_load(zenith_env_builder: ZenithEnvBuilder): zenith_env_builder.num_safekeepers = 3 env = zenith_env_builder.init_start() - env.zenith_cli.create_branch('test_wal_acceptors_restarts_under_load') + env.zenith_cli.create_branch('test_safekeepers_restarts_under_load') # Enable backpressure with 1MB maximal lag, because we don't want to block on `wait_for_lsn()` for too long - pg = env.postgres.create_start('test_wal_acceptors_restarts_under_load', + pg = env.postgres.create_start('test_safekeepers_restarts_under_load', config_lines=['max_replication_write_lag=1MB']) asyncio.run(run_restarts_under_load(env, pg, env.safekeepers)) diff --git a/test_runner/fixtures/log_helper.py b/test_runner/fixtures/log_helper.py index 9aa5f40bf3..7c2d83d4e3 100644 --- a/test_runner/fixtures/log_helper.py +++ b/test_runner/fixtures/log_helper.py @@ -25,7 +25,7 @@ LOGGING = { "root": { "level": "INFO" }, - "root.wal_acceptor_async": { + "root.safekeeper_async": { "level": "INFO" # a lot of logs on DEBUG level } } diff --git a/test_runner/fixtures/zenith_fixtures.py b/test_runner/fixtures/zenith_fixtures.py index f8ee39a5a1..e0f08a3bfb 100644 --- a/test_runner/fixtures/zenith_fixtures.py +++ b/test_runner/fixtures/zenith_fixtures.py @@ -612,7 +612,7 @@ class ZenithEnv: self.broker.start() def get_safekeeper_connstrs(self) -> str: - """ Get list of safekeeper endpoints suitable for wal_acceptors GUC """ + """ Get list of safekeeper endpoints suitable for safekeepers GUC """ return ','.join([f'localhost:{wa.port.pg}' for wa in self.safekeepers]) @cached_property @@ -1484,7 +1484,7 @@ class Postgres(PgProtocol): """ Path to postgresql.conf """ return os.path.join(self.pg_data_dir_path(), 'postgresql.conf') - def adjust_for_wal_acceptors(self, wal_acceptors: str) -> 'Postgres': + def adjust_for_safekeepers(self, safekeepers: str) -> 'Postgres': """ Adjust instance config for working with wal acceptors instead of pageserver (pre-configured by CLI) directly. @@ -1499,12 +1499,12 @@ class Postgres(PgProtocol): if ("synchronous_standby_names" in cfg_line or # don't ask pageserver to fetch WAL from compute "callmemaybe_connstring" in cfg_line or - # don't repeat wal_acceptors multiple times + # don't repeat safekeepers/wal_acceptors multiple times "wal_acceptors" in cfg_line): continue f.write(cfg_line) f.write("synchronous_standby_names = 'walproposer'\n") - f.write("wal_acceptors = '{}'\n".format(wal_acceptors)) + f.write("wal_acceptors = '{}'\n".format(safekeepers)) return self def config(self, lines: List[str]) -> 'Postgres': diff --git a/test_runner/performance/test_bulk_tenant_create.py b/test_runner/performance/test_bulk_tenant_create.py index fbef131ffd..f0729d3a07 100644 --- a/test_runner/performance/test_bulk_tenant_create.py +++ b/test_runner/performance/test_bulk_tenant_create.py @@ -13,15 +13,15 @@ from fixtures.zenith_fixtures import ZenithEnvBuilder @pytest.mark.parametrize('tenants_count', [1, 5, 10]) -@pytest.mark.parametrize('use_wal_acceptors', ['with_wa', 'without_wa']) +@pytest.mark.parametrize('use_safekeepers', ['with_wa', 'without_wa']) def test_bulk_tenant_create( zenith_env_builder: ZenithEnvBuilder, - use_wal_acceptors: str, + use_safekeepers: str, tenants_count: int, zenbenchmark, ): """Measure tenant creation time (with and without wal acceptors)""" - if use_wal_acceptors == 'with_wa': + if use_safekeepers == 'with_wa': zenith_env_builder.num_safekeepers = 3 env = zenith_env_builder.init_start() @@ -32,14 +32,14 @@ def test_bulk_tenant_create( tenant = env.zenith_cli.create_tenant() env.zenith_cli.create_timeline( - f'test_bulk_tenant_create_{tenants_count}_{i}_{use_wal_acceptors}', tenant_id=tenant) + f'test_bulk_tenant_create_{tenants_count}_{i}_{use_safekeepers}', tenant_id=tenant) # FIXME: We used to start new safekeepers here. Did that make sense? Should we do it now? - #if use_wal_acceptors == 'with_wa': + #if use_safekeepers == 'with_sa': # wa_factory.start_n_new(3) pg_tenant = env.postgres.create_start( - f'test_bulk_tenant_create_{tenants_count}_{i}_{use_wal_acceptors}', tenant_id=tenant) + f'test_bulk_tenant_create_{tenants_count}_{i}_{use_safekeepers}', tenant_id=tenant) end = timeit.default_timer() time_slices.append(end - start) diff --git a/zenith/src/main.rs b/zenith/src/main.rs index 97b07b7b74..18368895a4 100644 --- a/zenith/src/main.rs +++ b/zenith/src/main.rs @@ -9,13 +9,13 @@ use pageserver::config::defaults::{ DEFAULT_HTTP_LISTEN_ADDR as DEFAULT_PAGESERVER_HTTP_ADDR, DEFAULT_PG_LISTEN_ADDR as DEFAULT_PAGESERVER_PG_ADDR, }; -use std::collections::{BTreeSet, HashMap}; -use std::process::exit; -use std::str::FromStr; use safekeeper::defaults::{ DEFAULT_HTTP_LISTEN_PORT as DEFAULT_SAFEKEEPER_HTTP_PORT, DEFAULT_PG_LISTEN_PORT as DEFAULT_SAFEKEEPER_PG_PORT, }; +use std::collections::{BTreeSet, HashMap}; +use std::process::exit; +use std::str::FromStr; use zenith_utils::auth::{Claims, Scope}; use zenith_utils::lsn::Lsn; use zenith_utils::postgres_backend::AuthType; From c15aa04714e82af1542b8ade1b6d8c1453474dee Mon Sep 17 00:00:00 2001 From: Anastasia Lubennikova Date: Thu, 14 Apr 2022 12:56:46 +0300 Subject: [PATCH 112/296] Move Cluster size limit RFC from rfcs repo --- docs/rfcs/cluster-size-limits.md | 79 ++++++++++++++++++++++++++++++++ 1 file changed, 79 insertions(+) create mode 100644 docs/rfcs/cluster-size-limits.md diff --git a/docs/rfcs/cluster-size-limits.md b/docs/rfcs/cluster-size-limits.md new file mode 100644 index 0000000000..4696f2c7f0 --- /dev/null +++ b/docs/rfcs/cluster-size-limits.md @@ -0,0 +1,79 @@ +Cluster size limits +================== + +## Summary + +One of the resource consumption limits for free-tier users is a cluster size limit. + +To enforce it, we need to calculate the timeline size and check if the limit is reached before relation create/extend operations. +If the limit is reached, the query must fail with some meaningful error/warning. +We may want to exempt some operations from the quota to allow users free space to fit back into the limit. + +The stateless compute node that performs validation is separate from the storage that calculates the usage, so we need to exchange cluster size information between those components. + +## Motivation + +Limit the maximum size of a PostgreSQL instance to limit free tier users (and other tiers in the future). +First of all, this is needed to control our free tier production costs. +Another reason to limit resources is risk management — we haven't (fully) tested and optimized zenith for big clusters, +so we don't want to give users access to the functionality that we don't think is ready. + +## Components + +* pageserver - calculate the size consumed by a timeline and add it to the feedback message. +* safekeeper - pass feedback message from pageserver to compute. +* compute - receive feedback message, enforce size limit based on GUC `zenith.max_cluster_size`. +* console - set and update `zenith.max_cluster_size` setting + +## Proposed implementation + +First of all, it's necessary to define timeline size. + +The current approach is to count all data, including SLRUs. (not including WAL) +Here we think of it as a physical disk underneath the Postgres cluster. +This is how the `LOGICAL_TIMELINE_SIZE` metric is implemented in the pageserver. + +Alternatively, we could count only relation data. As in pg_database_size(). +This approach is somewhat more user-friendly because it is the data that is really affected by the user. +On the other hand, it puts us in a weaker position than other services, i.e., RDS. +We will need to refactor the timeline_size counter or add another counter to implement it. + +Timeline size is updated during wal digestion. It is not versioned and is valid at the last_received_lsn moment. +Then this size should be reported to compute node. + +`current_timeline_size` value is included in the walreceiver's custom feedback message: `ZenithFeedback.` + +(PR about protocol changes https://github.com/zenithdb/zenith/pull/1037). + +This message is received by the safekeeper and propagated to compute node as a part of `AppendResponse`. + +Finally, when compute node receives the `current_timeline_size` from safekeeper (or from pageserver directly), it updates the global variable. + +And then every zenith_extend() operation checks if limit is reached `(current_timeline_size > zenith.max_cluster_size)` and throws `ERRCODE_DISK_FULL` error if so. +(see Postgres error codes [https://www.postgresql.org/docs/devel/errcodes-appendix.html](https://www.postgresql.org/docs/devel/errcodes-appendix.html)) + +TODO: +We can allow autovacuum processes to bypass this check, simply checking `IsAutoVacuumWorkerProcess()`. +It would be nice to allow manual VACUUM and VACUUM FULL to bypass the check, but it's uneasy to distinguish these operations at the low level. +See issues https://github.com/neondatabase/neon/issues/1245 +https://github.com/zenithdb/zenith/issues/1445 + +TODO: +We should warn users if the limit is soon to be reached. + +### **Reliability, failure modes and corner cases** + +1. `current_timeline_size` is valid at the last received and digested by pageserver lsn. + + If pageserver lags behind compute node, `current_timeline_size` will lag too. This lag can be tuned using backpressure, but it is not expected to be 0 all the time. + + So transactions that happen in this lsn range may cause limit overflow. Especially operations that generate (i.e., CREATE DATABASE) or free (i.e., TRUNCATE) a lot of data pages while generating a small amount of WAL. Are there other operations like this? + + Currently, CREATE DATABASE operations are restricted in the console. So this is not an issue. + + +### **Security implications** + +We treat compute as an untrusted component. That's why we try to isolate it with secure container runtime or a VM. +Malicious users may change the `zenith.max_cluster_size`, so we need an extra size limit check. +To cover this case, we also monitor the compute node size in the console. From 389bd1faeb91904e1bcd23dce10217abbd45ae53 Mon Sep 17 00:00:00 2001 From: Stas Kelvich Date: Sun, 17 Apr 2022 23:12:04 +0300 Subject: [PATCH 113/296] Support for SCRAM-SHA-256 in compute tools --- compute_tools/src/pg_helpers.rs | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/compute_tools/src/pg_helpers.rs b/compute_tools/src/pg_helpers.rs index 6a22b865fa..1409a81b6b 100644 --- a/compute_tools/src/pg_helpers.rs +++ b/compute_tools/src/pg_helpers.rs @@ -132,7 +132,14 @@ impl Role { let mut params: String = "LOGIN".to_string(); if let Some(pass) = &self.encrypted_password { - params.push_str(&format!(" PASSWORD 'md5{}'", pass)); + // Some time ago we supported only md5 and treated all encrypted_password as md5. + // Now we also support SCRAM-SHA-256 and to preserve compatibility + // we treat all encrypted_password as md5 unless they starts with SCRAM-SHA-256. + if pass.starts_with("SCRAM-SHA-256") { + params.push_str(&format!(" PASSWORD '{}'", pass)); + } else { + params.push_str(&format!(" PASSWORD 'md5{}'", pass)); + } } else { params.push_str(" PASSWORD NULL"); } From a1e34772e56111403501f867e34693c863b95258 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Fri, 15 Apr 2022 18:13:26 +0300 Subject: [PATCH 114/296] Improve compute error logging --- control_plane/src/compute.rs | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/control_plane/src/compute.rs b/control_plane/src/compute.rs index 1c979acbdf..c078c274cf 100644 --- a/control_plane/src/compute.rs +++ b/control_plane/src/compute.rs @@ -420,10 +420,15 @@ impl PostgresNode { if let Some(token) = auth_token { cmd.env("ZENITH_AUTH_TOKEN", token); } - let pg_ctl = cmd.status().context("pg_ctl failed")?; - if !pg_ctl.success() { - anyhow::bail!("pg_ctl failed"); + let pg_ctl = cmd.output().context("pg_ctl failed")?; + if !pg_ctl.status.success() { + anyhow::bail!( + "pg_ctl failed, exit code: {}, stdout: {}, stderr: {}", + pg_ctl.status, + String::from_utf8_lossy(&pg_ctl.stdout), + String::from_utf8_lossy(&pg_ctl.stderr), + ); } Ok(()) } From ef72eb84cf7eebb78d76993c8d1d32ecffd0c12d Mon Sep 17 00:00:00 2001 From: bojanserafimov Date: Tue, 19 Apr 2022 09:46:47 -0400 Subject: [PATCH 115/296] Remove zenfixture (#1534) --- test_runner/fixtures/zenith_fixtures.py | 35 ++++++++++--------------- 1 file changed, 14 insertions(+), 21 deletions(-) diff --git a/test_runner/fixtures/zenith_fixtures.py b/test_runner/fixtures/zenith_fixtures.py index e0f08a3bfb..8dfe219966 100644 --- a/test_runner/fixtures/zenith_fixtures.py +++ b/test_runner/fixtures/zenith_fixtures.py @@ -40,8 +40,8 @@ from fixtures.log_helper import log This file contains pytest fixtures. A fixture is a test resource that can be summoned by placing its name in the test's arguments. -A fixture is created with the decorator @zenfixture, which is a wrapper around -the standard pytest.fixture with some extra behavior. +A fixture is created with the decorator @pytest.fixture decorator. +See docs: https://docs.pytest.org/en/6.2.x/fixture.html There are several environment variables that can control the running of tests: ZENITH_BIN, POSTGRES_DISTRIB_DIR, etc. See README.md for more information. @@ -155,25 +155,18 @@ def pytest_configure(config): raise Exception('zenith binaries not found at "{}"'.format(zenith_binpath)) -def zenfixture(func: Fn) -> Fn: +def shareable_scope(fixture_name, config) -> Literal["session", "function"]: + """Return either session of function scope, depending on TEST_SHARED_FIXTURES envvar. + + This function can be used as a scope like this: + @pytest.fixture(scope=shareable_scope) + def myfixture(...) + ... """ - This is a python decorator for fixtures with a flexible scope. - - By default every test function will set up and tear down a new - database. In pytest, this is called fixtures "function" scope. - - If the environment variable TEST_SHARED_FIXTURES is set, then all - tests will share the same database. State, logs, etc. will be - stored in a directory called "shared". - """ - - scope: Literal['session', 'function'] = \ - 'function' if os.environ.get('TEST_SHARED_FIXTURES') is None else 'session' - - return pytest.fixture(func, scope=scope) + return 'function' if os.environ.get('TEST_SHARED_FIXTURES') is None else 'session' -@zenfixture +@pytest.fixture(scope=shareable_scope) def worker_seq_no(worker_id: str): # worker_id is a pytest-xdist fixture # it can be master or gw @@ -184,7 +177,7 @@ def worker_seq_no(worker_id: str): return int(worker_id[2:]) -@zenfixture +@pytest.fixture(scope=shareable_scope) def worker_base_port(worker_seq_no: int): # so we divide ports in ranges of 100 ports # so workers have disjoint set of ports for services @@ -237,7 +230,7 @@ class PortDistributor: 'port range configured for test is exhausted, consider enlarging the range') -@zenfixture +@pytest.fixture(scope=shareable_scope) def port_distributor(worker_base_port): return PortDistributor(base_port=worker_base_port, port_number=WORKER_PORT_NUM) @@ -622,7 +615,7 @@ class ZenithEnv: return AuthKeys(pub=pub, priv=priv) -@zenfixture +@pytest.fixture(scope=shareable_scope) def _shared_simple_env(request: Any, port_distributor) -> Iterator[ZenithEnv]: """ Internal fixture backing the `zenith_simple_env` fixture. If TEST_SHARED_FIXTURES From 44bfc529f668fcb4fe79c521e6970382803d1178 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Tue, 19 Apr 2022 22:06:02 +0300 Subject: [PATCH 116/296] Require specifying the upload size in remote storage --- pageserver/src/remote_storage.rs | 3 ++ pageserver/src/remote_storage/local_fs.rs | 32 ++++++++++------------ pageserver/src/remote_storage/s3_bucket.rs | 6 +++- 3 files changed, 22 insertions(+), 19 deletions(-) diff --git a/pageserver/src/remote_storage.rs b/pageserver/src/remote_storage.rs index aebd74af5a..8167830347 100644 --- a/pageserver/src/remote_storage.rs +++ b/pageserver/src/remote_storage.rs @@ -324,6 +324,9 @@ trait RemoteStorage: Send + Sync { async fn upload( &self, from: impl io::AsyncRead + Unpin + Send + Sync + 'static, + /// S3 PUT request requires the content length to be specified, + /// otherwise it starts to fail with the concurrent connection count increasing. + from_size_kb: usize, to: &Self::StoragePath, metadata: Option, ) -> anyhow::Result<()>; diff --git a/pageserver/src/remote_storage/local_fs.rs b/pageserver/src/remote_storage/local_fs.rs index b40089d53c..15c69beebb 100644 --- a/pageserver/src/remote_storage/local_fs.rs +++ b/pageserver/src/remote_storage/local_fs.rs @@ -104,7 +104,8 @@ impl RemoteStorage for LocalFs { async fn upload( &self, - mut from: impl io::AsyncRead + Unpin + Send + Sync + 'static, + from: impl io::AsyncRead + Unpin + Send + Sync + 'static, + from_size_kb: usize, to: &Self::StoragePath, metadata: Option, ) -> anyhow::Result<()> { @@ -128,7 +129,7 @@ impl RemoteStorage for LocalFs { })?, ); - io::copy(&mut from, &mut destination) + io::copy(&mut from.take(from_size_kb as u64), &mut destination) .await .with_context(|| { format!( @@ -509,13 +510,13 @@ mod fs_tests { let repo_harness = RepoHarness::create("upload_file")?; let storage = create_storage()?; - let source = create_file_for_upload( + let (file, size) = create_file_for_upload( &storage.pageserver_workdir.join("whatever"), "whatever_contents", ) .await?; let target_path = PathBuf::from("/").join("somewhere").join("else"); - match storage.upload(source, &target_path, None).await { + match storage.upload(file, size, &target_path, None).await { Ok(()) => panic!("Should not allow storing files with wrong target path"), Err(e) => { let message = format!("{:?}", e); @@ -800,24 +801,17 @@ mod fs_tests { let timeline_path = harness.timeline_path(&TIMELINE_ID); let relative_timeline_path = timeline_path.strip_prefix(&harness.conf.workdir)?; let storage_path = storage.root.join(relative_timeline_path).join(name); - storage - .upload( - create_file_for_upload( - &storage.pageserver_workdir.join(name), - &dummy_contents(name), - ) - .await?, - &storage_path, - metadata, - ) - .await?; + + let from_path = storage.pageserver_workdir.join(name); + let (file, size) = create_file_for_upload(&from_path, &dummy_contents(name)).await?; + storage.upload(file, size, &storage_path, metadata).await?; Ok(storage_path) } async fn create_file_for_upload( path: &Path, contents: &str, - ) -> anyhow::Result> { + ) -> anyhow::Result<(io::BufReader, usize)> { std::fs::create_dir_all(path.parent().unwrap())?; let mut file_for_writing = std::fs::OpenOptions::new() .write(true) @@ -825,8 +819,10 @@ mod fs_tests { .open(path)?; write!(file_for_writing, "{}", contents)?; drop(file_for_writing); - Ok(io::BufReader::new( - fs::OpenOptions::new().read(true).open(&path).await?, + let file_size = path.metadata()?.len() as usize; + Ok(( + io::BufReader::new(fs::OpenOptions::new().read(true).open(&path).await?), + file_size, )) } diff --git a/pageserver/src/remote_storage/s3_bucket.rs b/pageserver/src/remote_storage/s3_bucket.rs index bfd28168f4..b99fa478c4 100644 --- a/pageserver/src/remote_storage/s3_bucket.rs +++ b/pageserver/src/remote_storage/s3_bucket.rs @@ -180,12 +180,16 @@ impl RemoteStorage for S3Bucket { async fn upload( &self, from: impl io::AsyncRead + Unpin + Send + Sync + 'static, + from_size_kb: usize, to: &Self::StoragePath, metadata: Option, ) -> anyhow::Result<()> { self.client .put_object(PutObjectRequest { - body: Some(StreamingBody::new(ReaderStream::new(from))), + body: Some(StreamingBody::new_with_size( + ReaderStream::new(from), + from_size_kb, + )), bucket: self.bucket_name.clone(), key: to.key().to_owned(), metadata: metadata.map(|m| m.0), From 3e6087a12f26ebefe6b91ea78be5d927c72b2a48 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Fri, 8 Apr 2022 20:17:37 +0300 Subject: [PATCH 117/296] Remove S3 archiving --- Cargo.lock | 44 - pageserver/Cargo.toml | 1 - pageserver/src/bin/pageserver.rs | 2 +- pageserver/src/bin/pageserver_zst.rs | 334 ---- pageserver/src/http/openapi_spec.yml | 1 + pageserver/src/http/routes.rs | 154 +- pageserver/src/layered_repository.rs | 7 +- pageserver/src/remote_storage.rs | 75 +- pageserver/src/remote_storage/README.md | 52 - pageserver/src/remote_storage/local_fs.rs | 21 +- pageserver/src/remote_storage/s3_bucket.rs | 14 +- pageserver/src/remote_storage/storage_sync.rs | 1766 +++++++++++------ .../storage_sync/compression.rs | 612 ------ .../remote_storage/storage_sync/download.rs | 591 +++--- .../src/remote_storage/storage_sync/index.rs | 657 +++--- .../src/remote_storage/storage_sync/upload.rs | 810 ++++---- pageserver/src/repository.rs | 2 - pageserver/src/tenant_mgr.rs | 4 +- pageserver/src/timelines.rs | 4 +- pageserver/src/walreceiver.rs | 2 +- .../batch_others/test_remote_storage.py | 45 +- zenith/src/main.rs | 4 +- 22 files changed, 2360 insertions(+), 2842 deletions(-) delete mode 100644 pageserver/src/bin/pageserver_zst.rs delete mode 100644 pageserver/src/remote_storage/README.md delete mode 100644 pageserver/src/remote_storage/storage_sync/compression.rs diff --git a/Cargo.lock b/Cargo.lock index a933b44356..3480f120e0 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -55,20 +55,6 @@ dependencies = [ "backtrace", ] -[[package]] -name = "async-compression" -version = "0.3.12" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f2bf394cfbbe876f0ac67b13b6ca819f9c9f2fb9ec67223cceb1555fbab1c31a" -dependencies = [ - "futures-core", - "memchr", - "pin-project-lite", - "tokio", - "zstd", - "zstd-safe", -] - [[package]] name = "async-stream" version = "0.3.3" @@ -1508,7 +1494,6 @@ name = "pageserver" version = "0.1.0" dependencies = [ "anyhow", - "async-compression", "async-trait", "byteorder", "bytes", @@ -3428,32 +3413,3 @@ name = "zeroize" version = "1.5.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "7c88870063c39ee00ec285a2f8d6a966e5b6fb2becc4e8dac77ed0d370ed6006" - -[[package]] -name = "zstd" -version = "0.10.0+zstd.1.5.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3b1365becbe415f3f0fcd024e2f7b45bacfb5bdd055f0dc113571394114e7bdd" -dependencies = [ - "zstd-safe", -] - -[[package]] -name = "zstd-safe" -version = "4.1.4+zstd.1.5.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2f7cd17c9af1a4d6c24beb1cc54b17e2ef7b593dc92f19e9d9acad8b182bbaee" -dependencies = [ - "libc", - "zstd-sys", -] - -[[package]] -name = "zstd-sys" -version = "1.6.3+zstd.1.5.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "fc49afa5c8d634e75761feda8c592051e7eeb4683ba827211eb0d731d3402ea8" -dependencies = [ - "cc", - "libc", -] diff --git a/pageserver/Cargo.toml b/pageserver/Cargo.toml index 3825795059..1a533af95f 100644 --- a/pageserver/Cargo.toml +++ b/pageserver/Cargo.toml @@ -46,7 +46,6 @@ fail = "0.5.0" rusoto_core = "0.47" rusoto_s3 = "0.47" async-trait = "0.1" -async-compression = {version = "0.3", features = ["zstd", "tokio"]} postgres_ffi = { path = "../postgres_ffi" } zenith_metrics = { path = "../zenith_metrics" } diff --git a/pageserver/src/bin/pageserver.rs b/pageserver/src/bin/pageserver.rs index 0af96cff66..1610a26239 100644 --- a/pageserver/src/bin/pageserver.rs +++ b/pageserver/src/bin/pageserver.rs @@ -293,7 +293,7 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<() "http_endpoint_thread", false, move || { - let router = http::make_router(conf, auth_cloned, remote_index); + let router = http::make_router(conf, auth_cloned, remote_index)?; endpoint::serve_thread_main(router, http_listener, thread_mgr::shutdown_watcher()) }, )?; diff --git a/pageserver/src/bin/pageserver_zst.rs b/pageserver/src/bin/pageserver_zst.rs deleted file mode 100644 index 5b8f8cc3c6..0000000000 --- a/pageserver/src/bin/pageserver_zst.rs +++ /dev/null @@ -1,334 +0,0 @@ -//! A CLI helper to deal with remote storage (S3, usually) blobs as archives. -//! See [`compression`] for more details about the archives. - -use std::{collections::BTreeSet, path::Path}; - -use anyhow::{bail, ensure, Context}; -use clap::{App, Arg}; -use pageserver::{ - layered_repository::metadata::{TimelineMetadata, METADATA_FILE_NAME}, - remote_storage::compression, -}; -use tokio::{fs, io}; -use zenith_utils::GIT_VERSION; - -const LIST_SUBCOMMAND: &str = "list"; -const ARCHIVE_ARG_NAME: &str = "archive"; - -const EXTRACT_SUBCOMMAND: &str = "extract"; -const TARGET_DIRECTORY_ARG_NAME: &str = "target_directory"; - -const CREATE_SUBCOMMAND: &str = "create"; -const SOURCE_DIRECTORY_ARG_NAME: &str = "source_directory"; - -#[tokio::main(flavor = "current_thread")] -async fn main() -> anyhow::Result<()> { - let arg_matches = App::new("pageserver zst blob [un]compressor utility") - .version(GIT_VERSION) - .subcommands(vec![ - App::new(LIST_SUBCOMMAND) - .about("List the archive contents") - .arg( - Arg::new(ARCHIVE_ARG_NAME) - .required(true) - .takes_value(true) - .help("An archive to list the contents of"), - ), - App::new(EXTRACT_SUBCOMMAND) - .about("Extracts the archive into the directory") - .arg( - Arg::new(ARCHIVE_ARG_NAME) - .required(true) - .takes_value(true) - .help("An archive to extract"), - ) - .arg( - Arg::new(TARGET_DIRECTORY_ARG_NAME) - .required(false) - .takes_value(true) - .help("A directory to extract the archive into. Optional, will use the current directory if not specified"), - ), - App::new(CREATE_SUBCOMMAND) - .about("Creates an archive with the contents of a directory (only the first level files are taken, metadata file has to be present in the same directory)") - .arg( - Arg::new(SOURCE_DIRECTORY_ARG_NAME) - .required(true) - .takes_value(true) - .help("A directory to use for creating the archive"), - ) - .arg( - Arg::new(TARGET_DIRECTORY_ARG_NAME) - .required(false) - .takes_value(true) - .help("A directory to create the archive in. Optional, will use the current directory if not specified"), - ), - ]) - .get_matches(); - - let subcommand_name = match arg_matches.subcommand_name() { - Some(name) => name, - None => bail!("No subcommand specified"), - }; - - let subcommand_matches = match arg_matches.subcommand_matches(subcommand_name) { - Some(matches) => matches, - None => bail!( - "No subcommand arguments were recognized for subcommand '{}'", - subcommand_name - ), - }; - - let target_dir = Path::new( - subcommand_matches - .value_of(TARGET_DIRECTORY_ARG_NAME) - .unwrap_or("./"), - ); - - match subcommand_name { - LIST_SUBCOMMAND => { - let archive = match subcommand_matches.value_of(ARCHIVE_ARG_NAME) { - Some(archive) => Path::new(archive), - None => bail!("No '{}' argument is specified", ARCHIVE_ARG_NAME), - }; - list_archive(archive).await - } - EXTRACT_SUBCOMMAND => { - let archive = match subcommand_matches.value_of(ARCHIVE_ARG_NAME) { - Some(archive) => Path::new(archive), - None => bail!("No '{}' argument is specified", ARCHIVE_ARG_NAME), - }; - extract_archive(archive, target_dir).await - } - CREATE_SUBCOMMAND => { - let source_dir = match subcommand_matches.value_of(SOURCE_DIRECTORY_ARG_NAME) { - Some(source) => Path::new(source), - None => bail!("No '{}' argument is specified", SOURCE_DIRECTORY_ARG_NAME), - }; - create_archive(source_dir, target_dir).await - } - unknown => bail!("Unknown subcommand {}", unknown), - } -} - -async fn list_archive(archive: &Path) -> anyhow::Result<()> { - let archive = archive.canonicalize().with_context(|| { - format!( - "Failed to get the absolute path for the archive path '{}'", - archive.display() - ) - })?; - ensure!( - archive.is_file(), - "Path '{}' is not an archive file", - archive.display() - ); - println!("Listing an archive at path '{}'", archive.display()); - let archive_name = match archive.file_name().and_then(|name| name.to_str()) { - Some(name) => name, - None => bail!( - "Failed to get the archive name from the path '{}'", - archive.display() - ), - }; - - let archive_bytes = fs::read(&archive) - .await - .context("Failed to read the archive bytes")?; - - let header = compression::read_archive_header(archive_name, &mut archive_bytes.as_slice()) - .await - .context("Failed to read the archive header")?; - - let empty_path = Path::new(""); - println!("-------------------------------"); - - let longest_path_in_archive = header - .files - .iter() - .filter_map(|file| Some(file.subpath.as_path(empty_path).to_str()?.len())) - .max() - .unwrap_or_default() - .max(METADATA_FILE_NAME.len()); - - for regular_file in &header.files { - println!( - "File: {:width$} uncompressed size: {} bytes", - regular_file.subpath.as_path(empty_path).display(), - regular_file.size, - width = longest_path_in_archive, - ) - } - println!( - "File: {:width$} uncompressed size: {} bytes", - METADATA_FILE_NAME, - header.metadata_file_size, - width = longest_path_in_archive, - ); - println!("-------------------------------"); - - Ok(()) -} - -async fn extract_archive(archive: &Path, target_dir: &Path) -> anyhow::Result<()> { - let archive = archive.canonicalize().with_context(|| { - format!( - "Failed to get the absolute path for the archive path '{}'", - archive.display() - ) - })?; - ensure!( - archive.is_file(), - "Path '{}' is not an archive file", - archive.display() - ); - let archive_name = match archive.file_name().and_then(|name| name.to_str()) { - Some(name) => name, - None => bail!( - "Failed to get the archive name from the path '{}'", - archive.display() - ), - }; - - if !target_dir.exists() { - fs::create_dir_all(target_dir).await.with_context(|| { - format!( - "Failed to create the target dir at path '{}'", - target_dir.display() - ) - })?; - } - let target_dir = target_dir.canonicalize().with_context(|| { - format!( - "Failed to get the absolute path for the target dir path '{}'", - target_dir.display() - ) - })?; - ensure!( - target_dir.is_dir(), - "Path '{}' is not a directory", - target_dir.display() - ); - let mut dir_contents = fs::read_dir(&target_dir) - .await - .context("Failed to list the target directory contents")?; - let dir_entry = dir_contents - .next_entry() - .await - .context("Failed to list the target directory contents")?; - ensure!( - dir_entry.is_none(), - "Target directory '{}' is not empty", - target_dir.display() - ); - - println!( - "Extracting an archive at path '{}' into directory '{}'", - archive.display(), - target_dir.display() - ); - - let mut archive_file = fs::File::open(&archive).await.with_context(|| { - format!( - "Failed to get the archive name from the path '{}'", - archive.display() - ) - })?; - let header = compression::read_archive_header(archive_name, &mut archive_file) - .await - .context("Failed to read the archive header")?; - compression::uncompress_with_header(&BTreeSet::new(), &target_dir, header, &mut archive_file) - .await - .context("Failed to extract the archive") -} - -async fn create_archive(source_dir: &Path, target_dir: &Path) -> anyhow::Result<()> { - let source_dir = source_dir.canonicalize().with_context(|| { - format!( - "Failed to get the absolute path for the source dir path '{}'", - source_dir.display() - ) - })?; - ensure!( - source_dir.is_dir(), - "Path '{}' is not a directory", - source_dir.display() - ); - - if !target_dir.exists() { - fs::create_dir_all(target_dir).await.with_context(|| { - format!( - "Failed to create the target dir at path '{}'", - target_dir.display() - ) - })?; - } - let target_dir = target_dir.canonicalize().with_context(|| { - format!( - "Failed to get the absolute path for the target dir path '{}'", - target_dir.display() - ) - })?; - ensure!( - target_dir.is_dir(), - "Path '{}' is not a directory", - target_dir.display() - ); - - println!( - "Compressing directory '{}' and creating resulting archive in directory '{}'", - source_dir.display(), - target_dir.display() - ); - - let mut metadata_file_contents = None; - let mut files_co_archive = Vec::new(); - - let mut source_dir_contents = fs::read_dir(&source_dir) - .await - .context("Failed to read the source directory contents")?; - - while let Some(source_dir_entry) = source_dir_contents - .next_entry() - .await - .context("Failed to read a source dir entry")? - { - let entry_path = source_dir_entry.path(); - if entry_path.is_file() { - if entry_path.file_name().and_then(|name| name.to_str()) == Some(METADATA_FILE_NAME) { - let metadata_bytes = fs::read(entry_path) - .await - .context("Failed to read metata file bytes in the source dir")?; - metadata_file_contents = Some( - TimelineMetadata::from_bytes(&metadata_bytes) - .context("Failed to parse metata file contents in the source dir")?, - ); - } else { - files_co_archive.push(entry_path); - } - } - } - - let metadata = match metadata_file_contents { - Some(metadata) => metadata, - None => bail!( - "No metadata file found in the source dir '{}', cannot create the archive", - source_dir.display() - ), - }; - - let _ = compression::archive_files_as_stream( - &source_dir, - files_co_archive.iter(), - &metadata, - move |mut archive_streamer, archive_name| async move { - let archive_target = target_dir.join(&archive_name); - let mut archive_file = fs::File::create(&archive_target).await?; - io::copy(&mut archive_streamer, &mut archive_file).await?; - Ok(archive_target) - }, - ) - .await - .context("Failed to create an archive")?; - - Ok(()) -} diff --git a/pageserver/src/http/openapi_spec.yml b/pageserver/src/http/openapi_spec.yml index b2760efe85..c0b07418f3 100644 --- a/pageserver/src/http/openapi_spec.yml +++ b/pageserver/src/http/openapi_spec.yml @@ -409,6 +409,7 @@ components: type: object required: - awaits_download + - remote_consistent_lsn properties: awaits_download: type: boolean diff --git a/pageserver/src/http/routes.rs b/pageserver/src/http/routes.rs index a0d6e922a1..f49b1d7ba3 100644 --- a/pageserver/src/http/routes.rs +++ b/pageserver/src/http/routes.rs @@ -1,6 +1,6 @@ use std::sync::Arc; -use anyhow::Result; +use anyhow::{Context, Result}; use hyper::StatusCode; use hyper::{Body, Request, Response, Uri}; use tracing::*; @@ -21,7 +21,10 @@ use zenith_utils::zid::{ZTenantTimelineId, ZTimelineId}; use super::models::{ StatusResponse, TenantCreateRequest, TenantCreateResponse, TimelineCreateRequest, }; -use crate::remote_storage::{schedule_timeline_download, RemoteIndex}; +use crate::config::RemoteStorageKind; +use crate::remote_storage::{ + download_index_part, schedule_timeline_download, LocalFs, RemoteIndex, RemoteTimeline, S3Bucket, +}; use crate::repository::Repository; use crate::timelines::{LocalTimelineInfo, RemoteTimelineInfo, TimelineInfo}; use crate::{config::PageServerConf, tenant_mgr, timelines, ZTenantId}; @@ -31,6 +34,12 @@ struct State { auth: Option>, remote_index: RemoteIndex, allowlist_routes: Vec, + remote_storage: Option, +} + +enum GenericRemoteStorage { + Local(LocalFs), + S3(S3Bucket), } impl State { @@ -38,17 +47,34 @@ impl State { conf: &'static PageServerConf, auth: Option>, remote_index: RemoteIndex, - ) -> Self { + ) -> anyhow::Result { let allowlist_routes = ["/v1/status", "/v1/doc", "/swagger.yml"] .iter() .map(|v| v.parse().unwrap()) .collect::>(); - Self { + // Note that this remote storage is created separately from the main one in the sync_loop. + // It's fine since it's stateless and some code duplication saves us from bloating the code around with generics. + let remote_storage = conf + .remote_storage_config + .as_ref() + .map(|storage_config| match &storage_config.storage { + RemoteStorageKind::LocalFs(root) => { + LocalFs::new(root.clone(), &conf.workdir).map(GenericRemoteStorage::Local) + } + RemoteStorageKind::AwsS3(s3_config) => { + S3Bucket::new(s3_config, &conf.workdir).map(GenericRemoteStorage::S3) + } + }) + .transpose() + .context("Failed to init generic remote storage")?; + + Ok(Self { conf, auth, allowlist_routes, remote_index, - } + remote_storage, + }) } } @@ -122,8 +148,8 @@ async fn timeline_list_handler(request: Request) -> Result, timeline_id, }) .map(|remote_entry| RemoteTimelineInfo { - remote_consistent_lsn: remote_entry.disk_consistent_lsn(), - awaits_download: remote_entry.get_awaits_download(), + remote_consistent_lsn: remote_entry.metadata.disk_consistent_lsn(), + awaits_download: remote_entry.awaits_download, }), }) } @@ -184,8 +210,8 @@ async fn timeline_detail_handler(request: Request) -> Result) -> Result { + tokio::fs::create_dir_all(state.conf.timeline_path(&timeline_id, &tenant_id)) + .await + .context("Failed to create new timeline directory")?; + new_timeline.awaits_download = true; + new_timeline + } + Ok(None) => return Err(ApiError::NotFound("Unknown remote timeline".to_string())), + Err(e) => { + error!("Failed to retrieve remote timeline data: {:?}", e); + return Err(ApiError::NotFound( + "Failed to retrieve remote timeline".to_string(), + )); + } + }; + let mut index_accessor = remote_index.write().await; + match index_accessor.timeline_entry_mut(&sync_id) { + Some(remote_timeline) => { + if remote_timeline.awaits_download { + return Err(ApiError::Conflict( + "Timeline download is already in progress".to_string(), + )); + } + remote_timeline.awaits_download = true; + } + None => index_accessor.add_timeline_entry(sync_id, new_timeline), + } + schedule_timeline_download(tenant_id, timeline_id); json_response(StatusCode::ACCEPTED, ()) } +async fn try_download_shard_data( + state: &State, + sync_id: ZTenantTimelineId, +) -> anyhow::Result> { + let shard = match state.remote_storage.as_ref() { + Some(GenericRemoteStorage::Local(local_storage)) => { + download_index_part(state.conf, local_storage, sync_id).await + } + Some(GenericRemoteStorage::S3(s3_storage)) => { + download_index_part(state.conf, s3_storage, sync_id).await + } + None => return Ok(None), + } + .with_context(|| format!("Failed to download index shard for timeline {}", sync_id))?; + + let timeline_path = state + .conf + .timeline_path(&sync_id.timeline_id, &sync_id.tenant_id); + RemoteTimeline::from_index_part(&timeline_path, shard) + .map(Some) + .with_context(|| { + format!( + "Failed to convert index shard into remote timeline for timeline {}", + sync_id + ) + }) +} + async fn timeline_detach_handler(request: Request) -> Result, ApiError> { let tenant_id: ZTenantId = parse_request_param(&request, "tenant_id")?; check_permission(&request, Some(tenant_id))?; @@ -317,7 +407,7 @@ pub fn make_router( conf: &'static PageServerConf, auth: Option>, remote_index: RemoteIndex, -) -> RouterBuilder { +) -> anyhow::Result> { let spec = include_bytes!("openapi_spec.yml"); let mut router = attach_openapi_ui(endpoint::make_router(), spec, "/swagger.yml", "/v1/doc"); if auth.is_some() { @@ -331,8 +421,10 @@ pub fn make_router( })) } - router - .data(Arc::new(State::new(conf, auth, remote_index))) + Ok(router + .data(Arc::new( + State::new(conf, auth, remote_index).context("Failed to initialize router state")?, + )) .get("/v1/status", status_handler) .get("/v1/tenant", tenant_list_handler) .post("/v1/tenant", tenant_create_handler) @@ -350,5 +442,5 @@ pub fn make_router( "/v1/tenant/:tenant_id/timeline/:timeline_id/detach", timeline_detach_handler, ) - .any(handler_404) + .any(handler_404)) } diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 36b081e400..6769c9cfbc 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -387,8 +387,6 @@ impl Repository for LayeredRepository { timeline_id, timeline_sync_status_update ); match timeline_sync_status_update { - TimelineSyncStatusUpdate::Uploaded => { /* nothing to do, remote consistent lsn is managed by the remote storage */ - } TimelineSyncStatusUpdate::Downloaded => { match self.timelines.lock().unwrap().entry(timeline_id) { Entry::Occupied(_) => bail!("We completed a download for a timeline that already exists in repository. This is a bug."), @@ -650,7 +648,8 @@ impl LayeredRepository { checkpoint_before_gc: bool, ) -> Result { let _span_guard = - info_span!("gc iteration", tenant = %self.tenantid, timeline = ?target_timelineid); + info_span!("gc iteration", tenant = %self.tenantid, timeline = ?target_timelineid) + .entered(); let mut totals: GcResult = Default::default(); let now = Instant::now(); @@ -1548,7 +1547,7 @@ impl LayeredTimeline { schedule_timeline_checkpoint_upload( self.tenantid, self.timelineid, - vec![new_delta_path], + new_delta_path, metadata, ); } diff --git a/pageserver/src/remote_storage.rs b/pageserver/src/remote_storage.rs index 8167830347..effc8dcdf4 100644 --- a/pageserver/src/remote_storage.rs +++ b/pageserver/src/remote_storage.rs @@ -9,7 +9,6 @@ //! //! * synchronization logic at [`storage_sync`] module that keeps pageserver state (both runtime one and the workdir files) and storage state in sync. //! Synchronization internals are split into submodules -//! * [`storage_sync::compression`] for a custom remote storage format used to store timeline files in archives //! * [`storage_sync::index`] to keep track of remote tenant files, the metadata and their mappings to local files //! * [`storage_sync::upload`] and [`storage_sync::download`] to manage archive creation and upload; download and extraction, respectively //! @@ -54,25 +53,32 @@ //! The checkpoint uploads are disabled, if no remote storage configuration is provided (no sync loop is started this way either). //! See [`crate::layered_repository`] for the upload calls and the adjacent logic. //! -//! Synchronization logic is able to communicate back with updated timeline sync states, [`TimelineSyncState`], -//! submitted via [`crate::tenant_mgr::set_timeline_states`] function. Tenant manager applies corresponding timeline updates in pageserver's in-memory state. +//! Synchronization logic is able to communicate back with updated timeline sync states, [`crate::repository::TimelineSyncStatusUpdate`], +//! submitted via [`crate::tenant_mgr::apply_timeline_sync_status_updates`] function. Tenant manager applies corresponding timeline updates in pageserver's in-memory state. //! Such submissions happen in two cases: //! * once after the sync loop startup, to signal pageserver which timelines will be synchronized in the near future //! * after every loop step, in case a timeline needs to be reloaded or evicted from pageserver's memory //! -//! When the pageserver terminates, the upload loop finishes a current sync task (if any) and exits. +//! When the pageserver terminates, the sync loop finishes a current sync task (if any) and exits. //! -//! The storage logic considers `image` as a set of local files, fully representing a certain timeline at given moment (identified with `disk_consistent_lsn`). +//! The storage logic considers `image` as a set of local files (layers), fully representing a certain timeline at given moment (identified with `disk_consistent_lsn` from the corresponding `metadata` file). //! Timeline can change its state, by adding more files on disk and advancing its `disk_consistent_lsn`: this happens after pageserver checkpointing and is followed //! by the storage upload, if enabled. -//! Yet timeline cannot alter already existing files, and normally cannot remote those too: only a GC process is capable of removing unused files. +//! Yet timeline cannot alter already existing files, and cannot remove those too: only a GC process is capable of removing unused files. //! This way, remote storage synchronization relies on the fact that every checkpoint is incremental and local files are "immutable": //! * when a certain checkpoint gets uploaded, the sync loop remembers the fact, preventing further reuploads of the same state //! * no files are deleted from either local or remote storage, only the missing ones locally/remotely get downloaded/uploaded, local metadata file will be overwritten //! when the newer image is downloaded //! -//! To optimize S3 storage (and access), the sync loop compresses the checkpoint files before placing them to S3, and uncompresses them back, keeping track of timeline files and metadata. -//! Also, the remote file list is queried once only, at startup, to avoid possible extra costs and latency issues. +//! Pageserver maintains similar to the local file structure remotely: all layer files are uploaded with the same names under the same directory structure. +//! Yet instead of keeping the `metadata` file remotely, we wrap it with more data in [`IndexShard`], containing the list of remote files. +//! This file gets read to populate the cache, if the remote timeline data is missing from it and gets updated after every successful download. +//! This way, we optimize S3 storage access by not running the `S3 list` command that could be expencive and slow: knowing both [`ZTenantId`] and [`ZTimelineId`], +//! we can always reconstruct the path to the timeline, use this to get the same path on the remote storage and retrive its shard contents, if needed, same as any layer files. +//! +//! By default, pageserver reads the remote storage index data only for timelines located locally, to synchronize those, if needed. +//! Bulk index data download happens only initially, on pageserer startup. The rest of the remote storage stays unknown to pageserver and loaded on demand only, +//! when a new timeline is scheduled for the download. //! //! NOTES: //! * pageserver assumes it has exclusive write access to the remote storage. If supported, the way multiple pageservers can be separated in the same storage @@ -86,7 +92,7 @@ mod s3_bucket; mod storage_sync; use std::{ - collections::HashMap, + collections::{HashMap, HashSet}, ffi, fs, path::{Path, PathBuf}, }; @@ -94,22 +100,36 @@ use std::{ use anyhow::{bail, Context}; use tokio::io; use tracing::{debug, error, info}; -use zenith_utils::zid::{ZTenantId, ZTenantTimelineId, ZTimelineId}; -pub use self::storage_sync::index::{RemoteIndex, TimelineIndexEntry}; -pub use self::storage_sync::{schedule_timeline_checkpoint_upload, schedule_timeline_download}; -use self::{local_fs::LocalFs, s3_bucket::S3Bucket}; -use crate::layered_repository::ephemeral_file::is_ephemeral_file; +pub use self::{ + local_fs::LocalFs, + s3_bucket::S3Bucket, + storage_sync::{ + download_index_part, + index::{IndexPart, RemoteIndex, RemoteTimeline}, + schedule_timeline_checkpoint_upload, schedule_timeline_download, + }, +}; use crate::{ config::{PageServerConf, RemoteStorageKind}, - layered_repository::metadata::{TimelineMetadata, METADATA_FILE_NAME}, + layered_repository::{ + ephemeral_file::is_ephemeral_file, + metadata::{TimelineMetadata, METADATA_FILE_NAME}, + }, }; +use zenith_utils::zid::{ZTenantId, ZTenantTimelineId, ZTimelineId}; -pub use storage_sync::compression; - +/// A timeline status to share with pageserver's sync counterpart, +/// after comparing local and remote timeline state. #[derive(Clone, Copy, Debug)] pub enum LocalTimelineInitStatus { + /// The timeline has every remote layer present locally. + /// There could be some layers requiring uploading, + /// but this does not block the timeline from any user interaction. LocallyComplete, + /// A timeline has some files remotely, that are not present locally and need downloading. + /// Downloading might update timeline's metadata locally and current pageserver logic deals with local layers only, + /// so the data needs to be downloaded first before the timeline can be used. NeedsSync, } @@ -179,7 +199,7 @@ pub fn start_local_timeline_sync( fn local_tenant_timeline_files( config: &'static PageServerConf, -) -> anyhow::Result)>> { +) -> anyhow::Result)>> { let mut local_tenant_timeline_files = HashMap::new(); let tenants_dir = config.tenants_path(); for tenants_dir_entry in fs::read_dir(&tenants_dir) @@ -214,9 +234,8 @@ fn local_tenant_timeline_files( fn collect_timelines_for_tenant( config: &'static PageServerConf, tenant_path: &Path, -) -> anyhow::Result)>> { - let mut timelines: HashMap)> = - HashMap::new(); +) -> anyhow::Result)>> { + let mut timelines = HashMap::new(); let tenant_id = tenant_path .file_name() .and_then(ffi::OsStr::to_str) @@ -265,8 +284,8 @@ fn collect_timelines_for_tenant( // NOTE: ephemeral files are excluded from the list fn collect_timeline_files( timeline_dir: &Path, -) -> anyhow::Result<(ZTimelineId, TimelineMetadata, Vec)> { - let mut timeline_files = Vec::new(); +) -> anyhow::Result<(ZTimelineId, TimelineMetadata, HashSet)> { + let mut timeline_files = HashSet::new(); let mut timeline_metadata_path = None; let timeline_id = timeline_dir @@ -286,7 +305,7 @@ fn collect_timeline_files( debug!("skipping ephemeral file {}", entry_path.display()); continue; } else { - timeline_files.push(entry_path); + timeline_files.insert(entry_path); } } } @@ -307,7 +326,7 @@ fn collect_timeline_files( /// This storage tries to be unaware of any layered repository context, /// providing basic CRUD operations for storage files. #[async_trait::async_trait] -trait RemoteStorage: Send + Sync { +pub trait RemoteStorage: Send + Sync { /// A way to uniquely reference a file in the remote storage. type StoragePath; @@ -324,9 +343,9 @@ trait RemoteStorage: Send + Sync { async fn upload( &self, from: impl io::AsyncRead + Unpin + Send + Sync + 'static, - /// S3 PUT request requires the content length to be specified, - /// otherwise it starts to fail with the concurrent connection count increasing. - from_size_kb: usize, + // S3 PUT request requires the content length to be specified, + // otherwise it starts to fail with the concurrent connection count increasing. + from_size_bytes: usize, to: &Self::StoragePath, metadata: Option, ) -> anyhow::Result<()>; diff --git a/pageserver/src/remote_storage/README.md b/pageserver/src/remote_storage/README.md deleted file mode 100644 index 43a47e09d8..0000000000 --- a/pageserver/src/remote_storage/README.md +++ /dev/null @@ -1,52 +0,0 @@ -# Non-implementation details - -This document describes the current state of the backup system in pageserver, existing limitations and concerns, why some things are done the way they are the future development plans. -Detailed description on how the synchronization works and how it fits into the rest of the pageserver can be found in the [storage module](./../remote_storage.rs) and its submodules. -Ideally, this document should disappear after current implementation concerns are mitigated, with the remaining useful knowledge bits moved into rustdocs. - -## Approach - -Backup functionality is a new component, appeared way after the core DB functionality was implemented. -Pageserver layer functionality is also quite volatile at the moment, there's a risk its local file management changes over time. - -To avoid adding more chaos into that, backup functionality is currently designed as a relatively standalone component, with the majority of its logic placed in a standalone async loop. -This way, the backups are managed in background, not affecting directly other pageserver parts: this way the backup and restoration process may lag behind, but eventually keep up with the reality. To track that, a set of prometheus metrics is exposed from pageserver. - -## What's done - -Current implementation -* provides remote storage wrappers for AWS S3 and local FS -* synchronizes the differences with local timelines and remote states as fast as possible -* uploads new layer files -* downloads and registers timelines, found on the remote storage, but missing locally, if those are requested somehow via pageserver (e.g. http api, gc) -* uses compression when deals with files, for better S3 usage -* maintains an index of what's stored remotely -* evicts failing tasks and stops the corresponding timelines - -The tasks are delayed with every retry and the retries are capped, to avoid poisonous tasks. -After any task eviction, or any error at startup checks (e.g. obviously different and wrong local and remote states fot the same timeline), -the timeline has to be stopped from submitting further checkpoint upload tasks, which is done along the corresponding timeline status change. - -No good optimisations or performance testing is done, the feature is disabled by default and gets polished over time. -It's planned to deal with all questions that are currently on and prepare the feature to be enabled by default in cloud environments. - -### Peculiarities - -As mentioned, the backup component is rather new and under development currently, so not all things are done properly from the start. -Here's the list of known compromises with comments: - -* Remote storage file model is currently a custom archive format, that's not possible to deserialize without a particular Rust code of ours (including `serde`). -We also don't optimize the archivation and pack every timeline checkpoint separately, so the resulting blob's size that gets on S3 could be arbitrary. -But, it's a single blob, which is way better than storing ~780 small files separately. - -* Archive index restoration requires reading every blob's head. -This could be avoided by a background thread/future storing the serialized index in the remote storage. - -* no proper file comparison - -No file checksum assertion is done currently, but should be (AWS S3 returns file checksums during the `list` operation) - -* gc is ignored - -So far, we don't adjust the remote storage based on GC thread loop results, only checkpointer loop affects the remote storage. -Index module could be used as a base to implement a deferred GC mechanism, a "defragmentation" that repacks archives into new ones after GC is done removing the files from the archives. diff --git a/pageserver/src/remote_storage/local_fs.rs b/pageserver/src/remote_storage/local_fs.rs index 15c69beebb..952b2e69fe 100644 --- a/pageserver/src/remote_storage/local_fs.rs +++ b/pageserver/src/remote_storage/local_fs.rs @@ -105,7 +105,7 @@ impl RemoteStorage for LocalFs { async fn upload( &self, from: impl io::AsyncRead + Unpin + Send + Sync + 'static, - from_size_kb: usize, + from_size_bytes: usize, to: &Self::StoragePath, metadata: Option, ) -> anyhow::Result<()> { @@ -129,7 +129,11 @@ impl RemoteStorage for LocalFs { })?, ); - io::copy(&mut from.take(from_size_kb as u64), &mut destination) + let from_size_bytes = from_size_bytes as u64; + // Require to read 1 byte more than the expected to check later, that the stream and its size match. + let mut buffer_to_read = from.take(from_size_bytes + 1); + + let bytes_read = io::copy(&mut buffer_to_read, &mut destination) .await .with_context(|| { format!( @@ -138,6 +142,19 @@ impl RemoteStorage for LocalFs { ) })?; + ensure!( + bytes_read == from_size_bytes, + "Provided stream has actual size {} fthat is smaller than the given stream size {}", + bytes_read, + from_size_bytes + ); + + ensure!( + buffer_to_read.read(&mut [0]).await? == 0, + "Provided stream has bigger size than the given stream size {}", + from_size_bytes + ); + destination.flush().await.with_context(|| { format!( "Failed to upload (flush temp) file to the local storage at '{}'", diff --git a/pageserver/src/remote_storage/s3_bucket.rs b/pageserver/src/remote_storage/s3_bucket.rs index b99fa478c4..b69634a1b6 100644 --- a/pageserver/src/remote_storage/s3_bucket.rs +++ b/pageserver/src/remote_storage/s3_bucket.rs @@ -17,7 +17,7 @@ use rusoto_s3::{ }; use tokio::io; use tokio_util::io::ReaderStream; -use tracing::{debug, trace}; +use tracing::debug; use crate::{ config::S3Config, @@ -70,10 +70,6 @@ pub struct S3Bucket { impl S3Bucket { /// Creates the S3 storage, errors if incorrect AWS S3 configuration provided. pub fn new(aws_config: &S3Config, pageserver_workdir: &'static Path) -> anyhow::Result { - // TODO kb check this - // Keeping a single client may cause issues due to timeouts. - // https://github.com/rusoto/rusoto/issues/1686 - debug!( "Creating s3 remote storage for S3 bucket {}", aws_config.bucket_name @@ -91,10 +87,10 @@ impl S3Bucket { let request_dispatcher = HttpClient::new().context("Failed to create S3 http client")?; let client = if aws_config.access_key_id.is_none() && aws_config.secret_access_key.is_none() { - trace!("Using IAM-based AWS access"); + debug!("Using IAM-based AWS access"); S3Client::new_with(request_dispatcher, InstanceMetadataProvider::new(), region) } else { - trace!("Using credentials-based AWS access"); + debug!("Using credentials-based AWS access"); S3Client::new_with( request_dispatcher, StaticProvider::new_minimal( @@ -180,7 +176,7 @@ impl RemoteStorage for S3Bucket { async fn upload( &self, from: impl io::AsyncRead + Unpin + Send + Sync + 'static, - from_size_kb: usize, + from_size_bytes: usize, to: &Self::StoragePath, metadata: Option, ) -> anyhow::Result<()> { @@ -188,7 +184,7 @@ impl RemoteStorage for S3Bucket { .put_object(PutObjectRequest { body: Some(StreamingBody::new_with_size( ReaderStream::new(from), - from_size_kb, + from_size_bytes, )), bucket: self.bucket_name.clone(), key: to.key().to_owned(), diff --git a/pageserver/src/remote_storage/storage_sync.rs b/pageserver/src/remote_storage/storage_sync.rs index 50a260491b..6ba55372c2 100644 --- a/pageserver/src/remote_storage/storage_sync.rs +++ b/pageserver/src/remote_storage/storage_sync.rs @@ -9,34 +9,32 @@ //! The pair's shared buffer of a fixed size serves as an implicit queue, holding [`SyncTask`] for local files upload/download operations. //! //! The queue gets emptied by a single thread with the loop, that polls the tasks in batches of deduplicated tasks (size configurable). -//! Every task in a batch processed concurrently, which is possible due to incremental nature of the timelines: +//! A task from the batch corresponds to a single timeline, with its files to sync merged together. +//! Every batch task and layer file in the task is processed concurrently, which is possible due to incremental nature of the timelines: //! it's not asserted, but assumed that timeline's checkpoints only add the files locally, not removing or amending the existing ones. //! Only GC removes local timeline files, the GC support is not added to sync currently, //! yet downloading extra files is not critically bad at this stage, GC can remove those again. //! -//! During the loop startup, an initial [`RemoteTimelineIndex`] state is constructed via listing the remote storage contents. -//! It's enough to poll the remote state once on startup only, due to agreement that the pageserver has -//! an exclusive write access to the remote storage: new files appear in the storage only after the same -//! pageserver writes them. -//! It's important to do so, since storages like S3 can get slower and more expensive as the number of files grows. +//! During the loop startup, an initial [`RemoteTimelineIndex`] state is constructed via downloading and merging the index data for all timelines, +//! present locally. +//! It's enough to poll such timelines' remote state once on startup only, due to an agreement that only one pageserver at a time has an exclusive +//! write access to remote portion of timelines that are attached to the pagegserver. //! The index state is used to issue initial sync tasks, if needed: //! * all timelines with local state behind the remote gets download tasks scheduled. -//! Such timelines are considered "remote" before the download succeeds, so a number of operations (gc, checkpoints) on that timeline are unavailable. -//! * all never local state gets scheduled for upload, such timelines are "local" and fully operational -//! * the rest of the remote timelines are reported to pageserver, but not downloaded before they are actually accessed in pageserver, -//! it may schedule the download on such occasions. +//! Such timelines are considered "remote" before the download succeeds, so a number of operations (gc, checkpoints) on that timeline are unavailable +//! before up-to-date layers and metadata file are downloaded locally. +//! * all newer local state gets scheduled for upload, such timelines are "local" and fully operational +//! * remote timelines not present locally are unknown to pageserver, but can be downloaded on a separate request +//! //! Then, the index is shared across pageserver under [`RemoteIndex`] guard to ensure proper synchronization. +//! The remote index gets updated after very remote storage change (after an upload), same as the index part files remotely. //! -//! The synchronization unit is an archive: a set of layer files and a special metadata file, all compressed into a blob. -//! Currently, there's no way to process an archive partially, if the archive processing fails, it has to be started from zero next time again. -//! An archive contains set of files of a certain timeline, added during checkpoint(s) and the timeline metadata at that moment. -//! The archive contains that metadata's `disk_consistent_lsn` in its name, to be able to restore partial index information from just a remote storage file list. -//! The index is created at startup (possible due to exclusive ownership over the remote storage by the pageserver) and keeps track of which files were stored -//! in what remote archives. -//! Among other tasks, the index is used to prevent invalid uploads and non-existing downloads on demand. -//! Refer to [`compression`] and [`index`] for more details on the archives and index respectively. +//! Remote timeline contains a set of layer files, created during checkpoint(s) and the serialized [`IndexPart`] file with timeline metadata and all remote layer paths inside. +//! Those paths are used instead of `S3 list` command to avoid its slowliness and expenciveness for big amount of files. +//! If the index part does not contain some file path but it's present remotely, such file is invisible to pageserver and ignored. +//! Among other tasks, the index is used to prevent invalid uploads and non-existing downloads on demand, refer to [`index`] for more details. //! -//! The list construction is currently the only place where the storage sync can return an [`Err`] to the user. +//! Index construction is currently the only place where the storage sync can return an [`Err`] to the user. //! New sync tasks are accepted via [`schedule_timeline_checkpoint_upload`] and [`schedule_timeline_download`] functions, //! disregarding of the corresponding loop startup. //! It's up to the caller to avoid synchronizations if the loop is disabled: otherwise, the sync tasks will be ignored. @@ -44,42 +42,39 @@ //! reschedule the same task, with possibly less files to sync: //! * download tasks currently never replace existing local file with metadata file as an exception //! (but this is a subject to change when checksum checks are implemented: all files could get overwritten on a checksum mismatch) -//! * download tasks carry the information of skipped acrhives, so resubmissions are not downloading successfully processed archives again +//! * download tasks carry the information of skipped acrhives, so resubmissions are not downloading successfully processed layers again +//! * downloads do not contain any actual files to download, so that "external", sync pageserver code is able to schedule the timeline download +//! without accessing any extra information about its files. //! -//! Not every upload of the same timeline gets processed: if the checkpoint with the same `disk_consistent_lsn` was already uploaded, no reuploads happen, as checkpoints -//! are considered to be immutable. The order of `lsn` during upload submissions is allowed to be arbitrary and not required to be ascending. +//! Uploads and downloads sync layer files in arbitrary order, but only after all layer files are synched the local metadada (for download) and remote index part (for upload) are updated, +//! to avoid having a corrupt state without the relevant layer files. //! Refer to [`upload`] and [`download`] for more details. //! -//! Current uploads are per-checkpoint and don't accumulate any data with optimal size for storing on S3. -//! The downloaded archives get processed sequentially, from smaller `disk_consistent_lsn` to larger, with metadata files being added as last. -//! The archive unpacking is designed to unpack metadata as the last file, so the risk of leaving the corrupt timeline due to uncompression error is small (while not eliminated entirely and that should be improved). -//! There's a reschedule threshold that evicts tasks that fail too much and stops the corresponding timeline so it does not diverge from the state on the remote storage. -//! Among other pageserver-specific changes to such evicted timelines, no uploads are expected to come from them to ensure the remote storage state does not get corrupted. -//! -//! Synchronization never removes any local from pageserver workdir or remote files from the remote storage, yet there could be overwrites of the same files (metadata file updates; future checksum mismatch fixes). +//! Synchronization never removes any local files from pageserver workdir or remote files from the remote storage, yet there could be overwrites of the same files (index part and metadata file updates, future checksum mismatch fixes). //! NOTE: No real contents or checksum check happens right now and is a subject to improve later. //! //! After the whole timeline is downloaded, [`crate::tenant_mgr::apply_timeline_sync_status_updates`] function is used to update pageserver memory stage for the timeline processed. //! //! When pageserver signals shutdown, current sync task gets finished and the loop exists. -/// Expose the module for a binary CLI tool that deals with the corresponding blobs. -pub mod compression; mod download; pub mod index; mod upload; use std::{ - collections::{BTreeSet, HashMap, VecDeque}, + collections::{hash_map, HashMap, HashSet, VecDeque}, + fmt::Debug, num::{NonZeroU32, NonZeroUsize}, + ops::ControlFlow, path::{Path, PathBuf}, sync::Arc, }; -use anyhow::{bail, Context}; +use anyhow::Context; use futures::stream::{FuturesUnordered, StreamExt}; use lazy_static::lazy_static; use tokio::{ + fs, runtime::Runtime, sync::mpsc::{self, UnboundedReceiver}, time::{Duration, Instant}, @@ -87,23 +82,21 @@ use tokio::{ use tracing::*; use self::{ - compression::ArchiveHeader, - download::{download_timeline, DownloadedTimeline}, - index::{ - ArchiveDescription, ArchiveId, RemoteIndex, RemoteTimeline, RemoteTimelineIndex, - TimelineIndexEntry, TimelineIndexEntryInner, - }, - upload::upload_timeline_checkpoint, + download::{download_timeline_layers, DownloadedTimeline}, + index::{IndexPart, RemoteIndex, RemoteTimeline, RemoteTimelineIndex}, + upload::{upload_index_part, upload_timeline_layers, UploadedTimeline}, }; use super::{ LocalTimelineInitStatus, LocalTimelineInitStatuses, RemoteStorage, SyncStartupData, ZTenantTimelineId, }; use crate::{ - config::PageServerConf, layered_repository::metadata::TimelineMetadata, - remote_storage::storage_sync::compression::read_archive_header, - repository::TimelineSyncStatusUpdate, tenant_mgr::apply_timeline_sync_status_updates, - thread_mgr, thread_mgr::ThreadKind, + config::PageServerConf, + layered_repository::metadata::{metadata_path, TimelineMetadata}, + repository::TimelineSyncStatusUpdate, + tenant_mgr::apply_timeline_sync_status_updates, + thread_mgr, + thread_mgr::ThreadKind, }; use zenith_metrics::{ @@ -112,6 +105,8 @@ use zenith_metrics::{ }; use zenith_utils::zid::{ZTenantId, ZTimelineId}; +pub use self::download::download_index_part; + lazy_static! { static ref REMAINING_SYNC_ITEMS: IntGauge = register_int_gauge!( "pageserver_remote_storage_remaining_sync_items", @@ -140,7 +135,7 @@ lazy_static! { /// mpsc approach was picked to allow blocking the sync loop if no tasks are present, to avoid meaningless spinning. mod sync_queue { use std::{ - collections::HashMap, + collections::{hash_map, HashMap}, sync::atomic::{AtomicUsize, Ordering}, }; @@ -150,13 +145,14 @@ mod sync_queue { use tracing::{debug, warn}; use super::SyncTask; + use zenith_utils::zid::ZTenantTimelineId; - static SENDER: OnceCell> = OnceCell::new(); + static SENDER: OnceCell> = OnceCell::new(); static LENGTH: AtomicUsize = AtomicUsize::new(0); /// Initializes the queue with the given sender channel that is used to put the tasks into later. /// Errors if called more than once. - pub fn init(sender: UnboundedSender) -> anyhow::Result<()> { + pub fn init(sender: UnboundedSender<(ZTenantTimelineId, SyncTask)>) -> anyhow::Result<()> { SENDER .set(sender) .map_err(|_sender| anyhow!("sync queue was already initialized"))?; @@ -165,9 +161,9 @@ mod sync_queue { /// Adds a new task to the queue, if the queue was initialized, returning `true` on success. /// On any error, or if the queue was not initialized, the task gets dropped (not scheduled) and `false` is returned. - pub fn push(new_task: SyncTask) -> bool { + pub fn push(sync_id: ZTenantTimelineId, new_task: SyncTask) -> bool { if let Some(sender) = SENDER.get() { - match sender.send(new_task) { + match sender.send((sync_id, new_task)) { Err(e) => { warn!( "Failed to enqueue a sync task: the receiver is dropped: {}", @@ -189,7 +185,9 @@ mod sync_queue { /// Polls a new task from the queue, using its receiver counterpart. /// Does not block if the queue is empty, returning [`None`] instead. /// Needed to correctly track the queue length. - pub async fn next_task(receiver: &mut UnboundedReceiver) -> Option { + pub async fn next_task( + receiver: &mut UnboundedReceiver<(ZTenantTimelineId, SyncTask)>, + ) -> Option<(ZTenantTimelineId, SyncTask)> { let task = receiver.recv().await; if task.is_some() { LENGTH.fetch_sub(1, Ordering::Relaxed); @@ -199,25 +197,35 @@ mod sync_queue { /// Fetches a task batch, not bigger than the given limit. /// Not blocking, can return fewer tasks if the queue does not contain enough. - /// Duplicate entries are eliminated and not considered in batch size calculations. + /// Batch tasks are split by timelines, with all related tasks merged into one (download/upload) + /// or two (download and upload, if both were found in the queue during batch construction). pub async fn next_task_batch( - receiver: &mut UnboundedReceiver, + receiver: &mut UnboundedReceiver<(ZTenantTimelineId, SyncTask)>, mut max_batch_size: usize, - ) -> Vec { + ) -> HashMap { if max_batch_size == 0 { - return Vec::new(); + return HashMap::new(); } - let mut tasks = HashMap::with_capacity(max_batch_size); + let mut tasks: HashMap = + HashMap::with_capacity(max_batch_size); loop { match receiver.try_recv() { - Ok(new_task) => { + Ok((sync_id, new_task)) => { LENGTH.fetch_sub(1, Ordering::Relaxed); - if tasks.insert(new_task.sync_id, new_task).is_none() { - max_batch_size -= 1; - if max_batch_size == 0 { - break; + match tasks.entry(sync_id) { + hash_map::Entry::Occupied(o) => { + let current = o.remove(); + tasks.insert(sync_id, current.merge(new_task)); } + hash_map::Entry::Vacant(v) => { + v.insert(new_task); + } + } + + max_batch_size -= 1; + if max_batch_size == 0 { + break; } } Err(TryRecvError::Disconnected) => { @@ -231,7 +239,7 @@ mod sync_queue { } } - tasks.into_values().collect() + tasks } /// Length of the queue, assuming that all receiver counterparts were only called using the queue api. @@ -242,55 +250,162 @@ mod sync_queue { /// A task to run in the async download/upload loop. /// Limited by the number of retries, after certain threshold the failing task gets evicted and the timeline disabled. -#[derive(Debug, Clone)] -pub struct SyncTask { - sync_id: ZTenantTimelineId, - retries: u32, - kind: SyncKind, +#[derive(Debug)] +pub enum SyncTask { + /// A checkpoint outcome with possible local file updates that need actualization in the remote storage. + /// Not necessary more fresh than the one already uploaded. + Download(SyncData), + /// A certain amount of image files to download. + Upload(SyncData), + /// Both upload and download layers need to be synced. + DownloadAndUpload(SyncData, SyncData), } -impl SyncTask { - fn new(sync_id: ZTenantTimelineId, retries: u32, kind: SyncKind) -> Self { - Self { - sync_id, - retries, - kind, - } +/// Stores the data to synd and its retries, to evict the tasks failing to frequently. +#[derive(Debug, Clone, PartialEq, Eq)] +pub struct SyncData { + retries: u32, + data: T, +} + +impl SyncData { + fn new(retries: u32, data: T) -> Self { + Self { retries, data } } } -#[derive(Debug, Clone)] -enum SyncKind { - /// A certain amount of images (archive files) to download. - Download(TimelineDownload), - /// A checkpoint outcome with possible local file updates that need actualization in the remote storage. - /// Not necessary more fresh than the one already uploaded. - Upload(NewCheckpoint), -} +impl SyncTask { + fn download(download_task: TimelineDownload) -> Self { + Self::Download(SyncData::new(0, download_task)) + } -impl SyncKind { - fn sync_name(&self) -> &'static str { + fn upload(upload_task: TimelineUpload) -> Self { + Self::Upload(SyncData::new(0, upload_task)) + } + + /// Merges two tasks into one with the following rules: + /// + /// * Download + Download = Download with the retry counter reset and the layers to skip combined + /// * DownloadAndUpload + Download = DownloadAndUpload with Upload unchanged and the Download counterparts united by the same rules + /// * Upload + Upload = Upload with the retry counter reset and the layers to upload and the uploaded layers combined + /// * DownloadAndUpload + Upload = DownloadAndUpload with Download unchanged and the Upload counterparts united by the same rules + /// * Upload + Download = DownloadAndUpload with both tasks unchanged + /// * DownloadAndUpload + DownloadAndUpload = DownloadAndUpload with both parts united by the same rules + fn merge(mut self, other: Self) -> Self { + match (&mut self, other) { + ( + SyncTask::DownloadAndUpload(download_data, _) | SyncTask::Download(download_data), + SyncTask::Download(new_download_data), + ) + | ( + SyncTask::Download(download_data), + SyncTask::DownloadAndUpload(new_download_data, _), + ) => { + download_data + .data + .layers_to_skip + .extend(new_download_data.data.layers_to_skip.into_iter()); + download_data.retries = 0; + } + (SyncTask::Upload(upload), SyncTask::Download(new_download_data)) => { + self = SyncTask::DownloadAndUpload(new_download_data, upload.clone()); + } + + ( + SyncTask::DownloadAndUpload(_, upload_data) | SyncTask::Upload(upload_data), + SyncTask::Upload(new_upload_data), + ) + | (SyncTask::Upload(upload_data), SyncTask::DownloadAndUpload(_, new_upload_data)) => { + upload_data + .data + .layers_to_upload + .extend(new_upload_data.data.layers_to_upload.into_iter()); + upload_data + .data + .uploaded_layers + .extend(new_upload_data.data.uploaded_layers.into_iter()); + upload_data.retries = 0; + + if new_upload_data.data.metadata.disk_consistent_lsn() + > upload_data.data.metadata.disk_consistent_lsn() + { + upload_data.data.metadata = new_upload_data.data.metadata; + } + } + (SyncTask::Download(download), SyncTask::Upload(new_upload_data)) => { + self = SyncTask::DownloadAndUpload(download.clone(), new_upload_data) + } + + ( + SyncTask::DownloadAndUpload(download_data, upload_data), + SyncTask::DownloadAndUpload(new_download_data, new_upload_data), + ) => { + download_data + .data + .layers_to_skip + .extend(new_download_data.data.layers_to_skip.into_iter()); + download_data.retries = 0; + + upload_data + .data + .layers_to_upload + .extend(new_upload_data.data.layers_to_upload.into_iter()); + upload_data + .data + .uploaded_layers + .extend(new_upload_data.data.uploaded_layers.into_iter()); + upload_data.retries = 0; + + if new_upload_data.data.metadata.disk_consistent_lsn() + > upload_data.data.metadata.disk_consistent_lsn() + { + upload_data.data.metadata = new_upload_data.data.metadata; + } + } + } + + self + } + + fn name(&self) -> &'static str { match self { - Self::Download(_) => "download", - Self::Upload(_) => "upload", + SyncTask::Download(_) => "download", + SyncTask::Upload(_) => "upload", + SyncTask::DownloadAndUpload(_, _) => "download and upload", + } + } + + fn retries(&self) -> u32 { + match self { + SyncTask::Download(data) => data.retries, + SyncTask::Upload(data) => data.retries, + SyncTask::DownloadAndUpload(download_data, upload_data) => { + download_data.retries.max(upload_data.retries) + } } } } /// Local timeline files for upload, appeared after the new checkpoint. /// Current checkpoint design assumes new files are added only, no deletions or amendment happens. -#[derive(Debug, Clone)] -pub struct NewCheckpoint { - /// layer file paths in the pageserver workdir, that were added for the corresponding checkpoint. - layers: Vec, +#[derive(Debug, Clone, PartialEq, Eq)] +pub struct TimelineUpload { + /// Layer file path in the pageserver workdir, that were added for the corresponding checkpoint. + layers_to_upload: HashSet, + /// Already uploaded layers. Used to store the data about the uploads between task retries + /// and to record the data into the remote index after the task got completed or evicted. + uploaded_layers: HashSet, metadata: TimelineMetadata, } -/// Info about the remote image files. -#[derive(Debug, Clone)] -struct TimelineDownload { - files_to_skip: Arc>, - archives_to_skip: BTreeSet, +/// A timeline download task. +/// Does not contain the file list to download, to allow other +/// parts of the pageserer code to schedule the task +/// without using the remote index or any other ways to list the remote timleine files. +/// Skips the files that are already downloaded. +#[derive(Debug, Clone, PartialEq, Eq)] +pub struct TimelineDownload { + layers_to_skip: HashSet, } /// Adds the new checkpoint files as an upload sync task to the queue. @@ -300,22 +415,20 @@ struct TimelineDownload { pub fn schedule_timeline_checkpoint_upload( tenant_id: ZTenantId, timeline_id: ZTimelineId, - layers: Vec, + new_layer: PathBuf, metadata: TimelineMetadata, ) { - if layers.is_empty() { - debug!("Skipping empty layers upload task"); - return; - } - - if !sync_queue::push(SyncTask::new( + if !sync_queue::push( ZTenantTimelineId { tenant_id, timeline_id, }, - 0, - SyncKind::Upload(NewCheckpoint { layers, metadata }), - )) { + SyncTask::upload(TimelineUpload { + layers_to_upload: HashSet::from([new_layer]), + uploaded_layers: HashSet::new(), + metadata, + }), + ) { warn!( "Could not send an upload task for tenant {}, timeline {}", tenant_id, timeline_id @@ -329,12 +442,10 @@ pub fn schedule_timeline_checkpoint_upload( } /// Requests the download of the entire timeline for a given tenant. -/// No existing local files are currently owerwritten, except the metadata file. -/// The timeline downloads checkpoint archives, from the earliest `disc_consistent_lsn` to the latest, -/// replacing the metadata file as the lasat file in every archive uncompression result. +/// No existing local files are currently overwritten, except the metadata file (if its disk_consistent_lsn is less than the downloaded one). +/// The metadata file is always updated last, to avoid inconsistencies. /// -/// On any failure, the task gets retried, omitting already downloaded archives and files -/// (yet requiring to download the entire archive even if it got partially extracted before the failure). +/// On any failure, the task gets retried, omitting already downloaded layers. /// /// Ensure that the loop is started otherwise the task is never processed. pub fn schedule_timeline_download(tenant_id: ZTenantId, timeline_id: ZTimelineId) { @@ -342,31 +453,30 @@ pub fn schedule_timeline_download(tenant_id: ZTenantId, timeline_id: ZTimelineId "Scheduling timeline download for tenant {}, timeline {}", tenant_id, timeline_id ); - sync_queue::push(SyncTask::new( + sync_queue::push( ZTenantTimelineId { tenant_id, timeline_id, }, - 0, - SyncKind::Download(TimelineDownload { - files_to_skip: Arc::new(BTreeSet::new()), - archives_to_skip: BTreeSet::new(), + SyncTask::download(TimelineDownload { + layers_to_skip: HashSet::new(), }), - )); + ); } /// Uses a remote storage given to start the storage sync loop. /// See module docs for loop step description. -pub(super) fn spawn_storage_sync_thread< - P: std::fmt::Debug + Send + Sync + 'static, - S: RemoteStorage + Send + Sync + 'static, ->( +pub(super) fn spawn_storage_sync_thread( conf: &'static PageServerConf, - local_timeline_files: HashMap)>, + local_timeline_files: HashMap)>, storage: S, max_concurrent_sync: NonZeroUsize, max_sync_errors: NonZeroU32, -) -> anyhow::Result { +) -> anyhow::Result +where + P: Debug + Send + Sync + 'static, + S: RemoteStorage + Send + Sync + 'static, +{ let (sender, receiver) = mpsc::unbounded_channel(); sync_queue::init(sender)?; @@ -375,22 +485,13 @@ pub(super) fn spawn_storage_sync_thread< .build() .context("Failed to create storage sync runtime")?; - let download_paths = runtime - // TODO could take long time, consider [de]serializing [`RemoteTimelineIndex`] instead - .block_on(storage.list()) - .context("Failed to list remote storage files")? - .into_iter() - .filter_map(|remote_path| match storage.local_path(&remote_path) { - Ok(local_path) => Some(local_path), - Err(e) => { - error!( - "Failed to find local path for remote path {:?}: {:?}", - remote_path, e - ); - None - } - }); - let remote_index = RemoteIndex::try_parse_descriptions_from_paths(conf, download_paths); + let applicable_index_parts = runtime.block_on(try_fetch_index_parts( + conf, + &storage, + local_timeline_files.keys().copied().collect(), + )); + + let remote_index = RemoteIndex::from_parts(conf, applicable_index_parts)?; let local_timeline_init_statuses = schedule_first_sync_tasks( &mut runtime.block_on(remote_index.write()), @@ -409,8 +510,8 @@ pub(super) fn spawn_storage_sync_thread< runtime, conf, receiver, + Arc::new(storage), loop_index, - storage, max_concurrent_sync, max_sync_errors, ); @@ -424,44 +525,40 @@ pub(super) fn spawn_storage_sync_thread< }) } -enum LoopStep { - SyncStatusUpdates(HashMap>), - Shutdown, -} - #[allow(clippy::too_many_arguments)] -fn storage_sync_loop< - P: std::fmt::Debug + Send + Sync + 'static, - S: RemoteStorage + Send + Sync + 'static, ->( +fn storage_sync_loop( runtime: Runtime, conf: &'static PageServerConf, - mut receiver: UnboundedReceiver, + mut receiver: UnboundedReceiver<(ZTenantTimelineId, SyncTask)>, + storage: Arc, index: RemoteIndex, - storage: S, max_concurrent_sync: NonZeroUsize, max_sync_errors: NonZeroU32, -) { - let remote_assets = Arc::new((storage, index.clone())); +) where + P: Debug + Send + Sync + 'static, + S: RemoteStorage + Send + Sync + 'static, +{ info!("Starting remote storage sync loop"); loop { - let index = index.clone(); + let loop_index = index.clone(); + let storage = Arc::clone(&storage); let loop_step = runtime.block_on(async { tokio::select! { step = loop_step( conf, &mut receiver, - Arc::clone(&remote_assets), + storage, + loop_index, max_concurrent_sync, max_sync_errors, ) .instrument(info_span!("storage_sync_loop_step")) => step, - _ = thread_mgr::shutdown_watcher() => LoopStep::Shutdown, + _ = thread_mgr::shutdown_watcher() => ControlFlow::Break(()), } }); match loop_step { - LoopStep::SyncStatusUpdates(new_timeline_states) => { + ControlFlow::Continue(new_timeline_states) => { if new_timeline_states.is_empty() { debug!("Sync loop step completed, no new timeline states"); } else { @@ -470,10 +567,10 @@ fn storage_sync_loop< new_timeline_states.len() ); // Batch timeline download registration to ensure that the external registration code won't block any running tasks before. - apply_timeline_sync_status_updates(conf, index, new_timeline_states); + apply_timeline_sync_status_updates(conf, &index, new_timeline_states); } } - LoopStep::Shutdown => { + ControlFlow::Break(()) => { info!("Shutdown requested, stopping"); break; } @@ -481,68 +578,64 @@ fn storage_sync_loop< } } -async fn loop_step< - P: std::fmt::Debug + Send + Sync + 'static, - S: RemoteStorage + Send + Sync + 'static, ->( +async fn loop_step( conf: &'static PageServerConf, - receiver: &mut UnboundedReceiver, - remote_assets: Arc<(S, RemoteIndex)>, + receiver: &mut UnboundedReceiver<(ZTenantTimelineId, SyncTask)>, + storage: Arc, + index: RemoteIndex, max_concurrent_sync: NonZeroUsize, max_sync_errors: NonZeroU32, -) -> LoopStep { +) -> ControlFlow<(), HashMap>> +where + P: Debug + Send + Sync + 'static, + S: RemoteStorage + Send + Sync + 'static, +{ let max_concurrent_sync = max_concurrent_sync.get(); - let mut next_tasks = Vec::new(); // request the first task in blocking fashion to do less meaningless work - if let Some(first_task) = sync_queue::next_task(receiver).await { - next_tasks.push(first_task); - } else { - return LoopStep::Shutdown; - }; - next_tasks.extend( - sync_queue::next_task_batch(receiver, max_concurrent_sync - 1) - .await - .into_iter(), - ); + let (first_sync_id, first_task) = + if let Some(first_task) = sync_queue::next_task(receiver).await { + first_task + } else { + return ControlFlow::Break(()); + }; + + let mut batched_tasks = sync_queue::next_task_batch(receiver, max_concurrent_sync - 1).await; + match batched_tasks.entry(first_sync_id) { + hash_map::Entry::Occupied(o) => { + let current = o.remove(); + batched_tasks.insert(first_sync_id, current.merge(first_task)); + } + hash_map::Entry::Vacant(v) => { + v.insert(first_task); + } + } let remaining_queue_length = sync_queue::len(); REMAINING_SYNC_ITEMS.set(remaining_queue_length as i64); - if remaining_queue_length > 0 || !next_tasks.is_empty() { + if remaining_queue_length > 0 || !batched_tasks.is_empty() { info!( - "Processing {} tasks in batch, more tasks left to process: {}", - next_tasks.len(), + "Processing tasks for {} timelines in batch, more tasks left to process: {}", + batched_tasks.len(), remaining_queue_length ); } else { debug!("No tasks to process"); - return LoopStep::SyncStatusUpdates(HashMap::new()); + return ControlFlow::Continue(HashMap::new()); } - let mut task_batch = next_tasks + let mut sync_results = batched_tasks .into_iter() - .map(|task| async { - let sync_id = task.sync_id; - let attempt = task.retries; - let sync_name = task.kind.sync_name(); - - let extra_step = match tokio::spawn( - process_task(conf, Arc::clone(&remote_assets), task, max_sync_errors).instrument( - info_span!("process_sync_task", sync_id = %sync_id, attempt, sync_name), - ), - ) - .await - { - Ok(extra_step) => extra_step, - Err(e) => { - error!( - "Failed to process storage sync task for tenant {}, timeline {}: {:?}", - sync_id.tenant_id, sync_id.timeline_id, e - ); - None - } - }; - (sync_id, extra_step) + .map(|(sync_id, task)| { + let storage = Arc::clone(&storage); + let index = index.clone(); + async move { + let state_update = + process_sync_task(conf, storage, index, max_sync_errors, sync_id, task) + .instrument(info_span!("process_sync_tasks", sync_id = %sync_id)) + .await; + (sync_id, state_update) + } }) .collect::>(); @@ -550,45 +643,86 @@ async fn loop_step< ZTenantId, HashMap, > = HashMap::with_capacity(max_concurrent_sync); - while let Some((sync_id, state_update)) = task_batch.next().await { + while let Some((sync_id, state_update)) = sync_results.next().await { debug!("Finished storage sync task for sync id {}", sync_id); if let Some(state_update) = state_update { - let ZTenantTimelineId { - tenant_id, - timeline_id, - } = sync_id; new_timeline_states - .entry(tenant_id) + .entry(sync_id.tenant_id) .or_default() - .insert(timeline_id, state_update); + .insert(sync_id.timeline_id, state_update); } } - LoopStep::SyncStatusUpdates(new_timeline_states) + ControlFlow::Continue(new_timeline_states) } -async fn process_task< - P: std::fmt::Debug + Send + Sync + 'static, - S: RemoteStorage + Send + Sync + 'static, ->( +async fn process_sync_task( conf: &'static PageServerConf, - remote_assets: Arc<(S, RemoteIndex)>, - task: SyncTask, + storage: Arc, + index: RemoteIndex, max_sync_errors: NonZeroU32, -) -> Option { - if task.retries > max_sync_errors.get() { - error!( - "Evicting task {:?} that failed {} times, exceeding the error threshold", - task.kind, task.retries - ); - FATAL_TASK_FAILURES.inc(); - // FIXME (rodionov) this can potentially leave holes in timeline uploads - // planneed to be fixed as part of https://github.com/zenithdb/zenith/issues/977 - return None; - } + sync_id: ZTenantTimelineId, + task: SyncTask, +) -> Option +where + P: Debug + Send + Sync + 'static, + S: RemoteStorage + Send + Sync + 'static, +{ + let sync_start = Instant::now(); + let current_remote_timeline = { index.read().await.timeline_entry(&sync_id).cloned() }; - if task.retries > 0 { - let seconds_to_wait = 2.0_f64.powf(task.retries as f64 - 1.0).min(30.0); + let task = match validate_task_retries(sync_id, task, max_sync_errors) { + ControlFlow::Continue(task) => task, + ControlFlow::Break(aborted_task) => { + match aborted_task { + SyncTask::Download(_) => { + index + .write() + .await + .set_awaits_download(&sync_id, false) + .ok(); + } + SyncTask::Upload(failed_upload_data) => { + if let Err(e) = update_remote_data( + conf, + storage.as_ref(), + &index, + sync_id, + &failed_upload_data.data, + true, + ) + .await + { + error!("Failed to update remote timeline {}: {:?}", sync_id, e); + } + } + SyncTask::DownloadAndUpload(_, failed_upload_data) => { + index + .write() + .await + .set_awaits_download(&sync_id, false) + .ok(); + if let Err(e) = update_remote_data( + conf, + storage.as_ref(), + &index, + sync_id, + &failed_upload_data.data, + true, + ) + .await + { + error!("Failed to update remote timeline {}: {:?}", sync_id, e); + } + } + } + return None; + } + }; + + let current_task_attempt = task.retries(); + if current_task_attempt > 0 { + let seconds_to_wait = 2.0_f64.powf(current_task_attempt as f64 - 1.0).min(30.0); debug!( "Waiting {} seconds before starting the task", seconds_to_wait @@ -596,64 +730,372 @@ async fn process_task< tokio::time::sleep(Duration::from_secs_f64(seconds_to_wait)).await; } - let remote_index = &remote_assets.1; - - let sync_start = Instant::now(); - let sync_name = task.kind.sync_name(); - match task.kind { - SyncKind::Download(download_data) => { - let download_result = download_timeline( + let task_name = task.name(); + match task { + SyncTask::Download(new_download_data) => { + download_timeline( conf, - remote_assets.clone(), - task.sync_id, - download_data, - task.retries + 1, + (storage.as_ref(), &index), + current_remote_timeline.as_ref(), + sync_id, + new_download_data, + sync_start, + task_name, ) - .await; - - match download_result { - DownloadedTimeline::Abort => { - register_sync_status(sync_start, sync_name, None); - remote_index - .write() - .await - .set_awaits_download(&task.sync_id, false) - .expect("timeline should be present in remote index"); - None - } - DownloadedTimeline::FailedAndRescheduled => { - register_sync_status(sync_start, sync_name, Some(false)); - None - } - DownloadedTimeline::Successful => { - register_sync_status(sync_start, sync_name, Some(true)); - remote_index - .write() - .await - .set_awaits_download(&task.sync_id, false) - .expect("timeline should be present in remote index"); - Some(TimelineSyncStatusUpdate::Downloaded) - } - } + .await } - SyncKind::Upload(layer_upload) => { - let sync_status = upload_timeline_checkpoint( + SyncTask::Upload(new_upload_data) => { + upload_timeline( conf, - remote_assets, - task.sync_id, - layer_upload, - task.retries + 1, + (storage.as_ref(), &index), + current_remote_timeline.as_ref(), + sync_id, + new_upload_data, + sync_start, + task_name, ) .await; - register_sync_status(sync_start, sync_name, sync_status); None } + SyncTask::DownloadAndUpload(new_download_data, new_upload_data) => { + let status_update = download_timeline( + conf, + (storage.as_ref(), &index), + current_remote_timeline.as_ref(), + sync_id, + new_download_data, + sync_start, + task_name, + ) + .await; + + upload_timeline( + conf, + (storage.as_ref(), &index), + current_remote_timeline.as_ref(), + sync_id, + new_upload_data, + sync_start, + task_name, + ) + .await; + + status_update + } } } +async fn download_timeline( + conf: &'static PageServerConf, + (storage, index): (&S, &RemoteIndex), + current_remote_timeline: Option<&RemoteTimeline>, + sync_id: ZTenantTimelineId, + new_download_data: SyncData, + sync_start: Instant, + task_name: &str, +) -> Option +where + P: Debug + Send + Sync + 'static, + S: RemoteStorage + Send + Sync + 'static, +{ + match download_timeline_layers(storage, current_remote_timeline, sync_id, new_download_data) + .await + { + DownloadedTimeline::Abort => { + register_sync_status(sync_start, task_name, None); + if let Err(e) = index.write().await.set_awaits_download(&sync_id, false) { + error!( + "Timeline {} was expected to be in the remote index after a download attempt, but it's absent: {:?}", + sync_id, e + ); + } + None + } + DownloadedTimeline::FailedAndRescheduled => { + register_sync_status(sync_start, task_name, Some(false)); + None + } + DownloadedTimeline::Successful(mut download_data) => { + match update_local_metadata(conf, sync_id, current_remote_timeline).await { + Ok(()) => match index.write().await.set_awaits_download(&sync_id, false) { + Ok(()) => { + register_sync_status(sync_start, task_name, Some(true)); + Some(TimelineSyncStatusUpdate::Downloaded) + } + Err(e) => { + error!( + "Timeline {} was expected to be in the remote index after a sucessful download, but it's absent: {:?}", + sync_id, e + ); + None + } + }, + Err(e) => { + error!("Failed to update local timeline metadata: {:?}", e); + download_data.retries += 1; + sync_queue::push(sync_id, SyncTask::Download(download_data)); + register_sync_status(sync_start, task_name, Some(false)); + None + } + } + } + } +} + +async fn update_local_metadata( + conf: &'static PageServerConf, + sync_id: ZTenantTimelineId, + remote_timeline: Option<&RemoteTimeline>, +) -> anyhow::Result<()> { + let remote_metadata = match remote_timeline { + Some(timeline) => &timeline.metadata, + None => { + info!("No remote timeline to update local metadata from, skipping the update"); + return Ok(()); + } + }; + let remote_lsn = remote_metadata.disk_consistent_lsn(); + + let local_metadata_path = metadata_path(conf, sync_id.timeline_id, sync_id.tenant_id); + let local_lsn = if local_metadata_path.exists() { + let local_metadata = read_metadata_file(&local_metadata_path) + .await + .with_context(|| { + format!( + "Failed to load local metadata from path '{}'", + local_metadata_path.display() + ) + })?; + + Some(local_metadata.disk_consistent_lsn()) + } else { + None + }; + + if local_lsn < Some(remote_lsn) { + info!( + "Updating local timeline metadata from remote timeline: local disk_consistent_lsn={:?}, remote disk_consistent_lsn={}", + local_lsn, remote_lsn + ); + + let remote_metadata_bytes = remote_metadata + .to_bytes() + .context("Failed to serialize remote metadata to bytes")?; + fs::write(&local_metadata_path, &remote_metadata_bytes) + .await + .with_context(|| { + format!( + "Failed to write remote metadata bytes locally to path '{}'", + local_metadata_path.display() + ) + })?; + } else { + info!("Local metadata at path '{}' has later disk consistent Lsn ({:?}) than the remote one ({}), skipping the update", local_metadata_path.display(), local_lsn, remote_lsn); + } + + Ok(()) +} + +async fn read_metadata_file(metadata_path: &Path) -> anyhow::Result { + TimelineMetadata::from_bytes( + &fs::read(metadata_path) + .await + .context("Failed to read local metadata bytes from fs")?, + ) + .context("Failed to parse metadata bytes") +} + +async fn upload_timeline( + conf: &'static PageServerConf, + (storage, index): (&S, &RemoteIndex), + current_remote_timeline: Option<&RemoteTimeline>, + sync_id: ZTenantTimelineId, + new_upload_data: SyncData, + sync_start: Instant, + task_name: &str, +) where + P: Debug + Send + Sync + 'static, + S: RemoteStorage + Send + Sync + 'static, +{ + let mut uploaded_data = + match upload_timeline_layers(storage, current_remote_timeline, sync_id, new_upload_data) + .await + { + UploadedTimeline::FailedAndRescheduled => { + register_sync_status(sync_start, task_name, Some(false)); + return; + } + UploadedTimeline::Successful(upload_data) => upload_data, + UploadedTimeline::SuccessfulAfterLocalFsUpdate(mut outdated_upload_data) => { + let local_metadata_path = + metadata_path(conf, sync_id.timeline_id, sync_id.tenant_id); + let local_metadata = match read_metadata_file(&local_metadata_path).await { + Ok(metadata) => metadata, + Err(e) => { + error!( + "Failed to load local metadata from path '{}': {:?}", + local_metadata_path.display(), + e + ); + outdated_upload_data.retries += 1; + sync_queue::push(sync_id, SyncTask::Upload(outdated_upload_data)); + register_sync_status(sync_start, task_name, Some(false)); + return; + } + }; + + outdated_upload_data.data.metadata = local_metadata; + outdated_upload_data + } + }; + + match update_remote_data(conf, storage, index, sync_id, &uploaded_data.data, false).await { + Ok(()) => register_sync_status(sync_start, task_name, Some(true)), + Err(e) => { + error!("Failed to update remote timeline {}: {:?}", sync_id, e); + uploaded_data.retries += 1; + sync_queue::push(sync_id, SyncTask::Upload(uploaded_data)); + register_sync_status(sync_start, task_name, Some(false)); + } + } +} + +async fn update_remote_data( + conf: &'static PageServerConf, + storage: &S, + index: &RemoteIndex, + sync_id: ZTenantTimelineId, + uploaded_data: &TimelineUpload, + upload_failed: bool, +) -> anyhow::Result<()> +where + P: Debug + Send + Sync + 'static, + S: RemoteStorage + Send + Sync + 'static, +{ + let updated_remote_timeline = { + let mut index_accessor = index.write().await; + + match index_accessor.timeline_entry_mut(&sync_id) { + Some(existing_entry) => { + if existing_entry.metadata.disk_consistent_lsn() + < uploaded_data.metadata.disk_consistent_lsn() + { + existing_entry.metadata = uploaded_data.metadata.clone(); + } + if upload_failed { + existing_entry + .add_upload_failures(uploaded_data.layers_to_upload.iter().cloned()); + } else { + existing_entry + .add_timeline_layers(uploaded_data.uploaded_layers.iter().cloned()); + } + existing_entry.clone() + } + None => { + let mut new_remote_timeline = RemoteTimeline::new(uploaded_data.metadata.clone()); + if upload_failed { + new_remote_timeline + .add_upload_failures(uploaded_data.layers_to_upload.iter().cloned()); + } else { + new_remote_timeline + .add_timeline_layers(uploaded_data.uploaded_layers.iter().cloned()); + } + + index_accessor.add_timeline_entry(sync_id, new_remote_timeline.clone()); + new_remote_timeline + } + } + }; + + let timeline_path = conf.timeline_path(&sync_id.timeline_id, &sync_id.tenant_id); + let new_index_part = + IndexPart::from_remote_timeline(&timeline_path, updated_remote_timeline) + .context("Failed to create an index part from the updated remote timeline")?; + + upload_index_part(conf, storage, sync_id, new_index_part) + .await + .context("Failed to upload new index part") +} + +fn validate_task_retries( + sync_id: ZTenantTimelineId, + task: SyncTask, + max_sync_errors: NonZeroU32, +) -> ControlFlow { + let max_sync_errors = max_sync_errors.get(); + let mut skip_upload = false; + let mut skip_download = false; + + match &task { + SyncTask::Download(download_data) | SyncTask::DownloadAndUpload(download_data, _) + if download_data.retries > max_sync_errors => + { + error!( + "Evicting download task for timeline {} that failed {} times, exceeding the error threshold {}", + sync_id, download_data.retries, max_sync_errors + ); + skip_download = true; + } + SyncTask::Upload(upload_data) | SyncTask::DownloadAndUpload(_, upload_data) + if upload_data.retries > max_sync_errors => + { + error!( + "Evicting upload task for timeline {} that failed {} times, exceeding the error threshold {}", + sync_id, upload_data.retries, max_sync_errors + ); + skip_upload = true; + } + _ => {} + } + + match task { + aborted_task @ SyncTask::Download(_) if skip_download => ControlFlow::Break(aborted_task), + aborted_task @ SyncTask::Upload(_) if skip_upload => ControlFlow::Break(aborted_task), + aborted_task @ SyncTask::DownloadAndUpload(_, _) if skip_upload && skip_download => { + ControlFlow::Break(aborted_task) + } + SyncTask::DownloadAndUpload(download_task, _) if skip_upload => { + ControlFlow::Continue(SyncTask::Download(download_task)) + } + SyncTask::DownloadAndUpload(_, upload_task) if skip_download => { + ControlFlow::Continue(SyncTask::Upload(upload_task)) + } + not_skipped => ControlFlow::Continue(not_skipped), + } +} + +async fn try_fetch_index_parts( + conf: &'static PageServerConf, + storage: &S, + keys: HashSet, +) -> HashMap +where + P: Debug + Send + Sync + 'static, + S: RemoteStorage + Send + Sync + 'static, +{ + let mut index_parts = HashMap::with_capacity(keys.len()); + + let mut part_downloads = keys + .into_iter() + .map(|id| async move { (id, download_index_part(conf, storage, id).await) }) + .collect::>(); + + while let Some((id, part_upload_result)) = part_downloads.next().await { + match part_upload_result { + Ok(index_part) => { + debug!("Successfully fetched index part for {}", id); + index_parts.insert(id, index_part); + } + Err(e) => warn!("Failed to fetch index part for {}: {:?}", id, e), + } + } + + index_parts +} + fn schedule_first_sync_tasks( index: &mut RemoteTimelineIndex, - local_timeline_files: HashMap)>, + local_timeline_files: HashMap)>, ) -> LocalTimelineInitStatuses { let mut local_timeline_init_statuses = LocalTimelineInitStatuses::new(); @@ -661,71 +1103,66 @@ fn schedule_first_sync_tasks( VecDeque::with_capacity(local_timeline_files.len().max(local_timeline_files.len())); for (sync_id, (local_metadata, local_files)) in local_timeline_files { - let ZTenantTimelineId { - tenant_id, - timeline_id, - } = sync_id; match index.timeline_entry_mut(&sync_id) { - Some(index_entry) => { + Some(remote_timeline) => { let (timeline_status, awaits_download) = compare_local_and_remote_timeline( &mut new_sync_tasks, sync_id, local_metadata, local_files, - index_entry, + remote_timeline, ); let was_there = local_timeline_init_statuses - .entry(tenant_id) + .entry(sync_id.tenant_id) .or_default() - .insert(timeline_id, timeline_status); + .insert(sync_id.timeline_id, timeline_status); if was_there.is_some() { // defensive check warn!( "Overwriting timeline init sync status. Status {:?} Timeline {}", - timeline_status, timeline_id + timeline_status, sync_id.timeline_id ); } - index_entry.set_awaits_download(awaits_download); + remote_timeline.awaits_download = awaits_download; } None => { // TODO (rodionov) does this mean that we've crashed during tenant creation? // is it safe to upload this checkpoint? could it be half broken? - new_sync_tasks.push_back(SyncTask::new( + new_sync_tasks.push_back(( sync_id, - 0, - SyncKind::Upload(NewCheckpoint { - layers: local_files, + SyncTask::upload(TimelineUpload { + layers_to_upload: local_files, + uploaded_layers: HashSet::new(), metadata: local_metadata, }), )); local_timeline_init_statuses - .entry(tenant_id) + .entry(sync_id.tenant_id) .or_default() - .insert(timeline_id, LocalTimelineInitStatus::LocallyComplete); + .insert( + sync_id.timeline_id, + LocalTimelineInitStatus::LocallyComplete, + ); } } } - new_sync_tasks.into_iter().for_each(|task| { - sync_queue::push(task); + new_sync_tasks.into_iter().for_each(|(sync_id, task)| { + sync_queue::push(sync_id, task); }); local_timeline_init_statuses } fn compare_local_and_remote_timeline( - new_sync_tasks: &mut VecDeque, + new_sync_tasks: &mut VecDeque<(ZTenantTimelineId, SyncTask)>, sync_id: ZTenantTimelineId, local_metadata: TimelineMetadata, - local_files: Vec, - remote_entry: &TimelineIndexEntry, + local_files: HashSet, + remote_entry: &RemoteTimeline, ) -> (LocalTimelineInitStatus, bool) { - let local_lsn = local_metadata.disk_consistent_lsn(); - let uploads = remote_entry.uploaded_checkpoints(); + let remote_files = remote_entry.stored_files(); - let mut initial_timeline_status = LocalTimelineInitStatus::LocallyComplete; - - let mut awaits_download = false; // TODO probably here we need more sophisticated logic, // if more data is available remotely can we just download whats there? // without trying to upload something. It may be tricky, needs further investigation. @@ -734,38 +1171,37 @@ fn compare_local_and_remote_timeline( // (upload needs to be only for previously unsynced files, not whole timeline dir). // If one of the tasks fails they will be reordered in the queue which can lead // to timeline being stuck in evicted state - if !uploads.contains(&local_lsn) { - new_sync_tasks.push_back(SyncTask::new( + let number_of_layers_to_download = remote_files.difference(&local_files).count(); + let (initial_timeline_status, awaits_download) = if number_of_layers_to_download > 0 { + new_sync_tasks.push_back(( sync_id, - 0, - SyncKind::Upload(NewCheckpoint { - layers: local_files.clone(), + SyncTask::download(TimelineDownload { + layers_to_skip: local_files.clone(), + }), + )); + (LocalTimelineInitStatus::NeedsSync, true) + // we do not need to manupulate with remote consistent lsn here + // because it will be updated when sync will be completed + } else { + (LocalTimelineInitStatus::LocallyComplete, false) + }; + + let layers_to_upload = local_files + .difference(remote_files) + .cloned() + .collect::>(); + if !layers_to_upload.is_empty() { + new_sync_tasks.push_back(( + sync_id, + SyncTask::upload(TimelineUpload { + layers_to_upload, + uploaded_layers: HashSet::new(), metadata: local_metadata, }), )); - // Note that status here doesnt change. + // Note that status here doesn't change. } - let uploads_count = uploads.len(); - let archives_to_skip: BTreeSet = uploads - .into_iter() - .filter(|upload_lsn| upload_lsn <= &local_lsn) - .map(ArchiveId) - .collect(); - if archives_to_skip.len() != uploads_count { - new_sync_tasks.push_back(SyncTask::new( - sync_id, - 0, - SyncKind::Download(TimelineDownload { - files_to_skip: Arc::new(local_files.into_iter().collect()), - archives_to_skip, - }), - )); - initial_timeline_status = LocalTimelineInitStatus::NeedsSync; - awaits_download = true; - // we do not need to manupulate with remote consistent lsn here - // because it will be updated when sync will be completed - } (initial_timeline_status, awaits_download) } @@ -780,322 +1216,44 @@ fn register_sync_status(sync_start: Instant, sync_name: &str, sync_status: Optio .observe(secs_elapsed) } -async fn fetch_full_index< - P: Send + Sync + 'static, - S: RemoteStorage + Send + Sync + 'static, ->( - (storage, index): &(S, RemoteIndex), - timeline_dir: &Path, - id: ZTenantTimelineId, -) -> anyhow::Result { - let index_read = index.read().await; - let full_index = match index_read.timeline_entry(&id).map(|e| e.inner()) { - None => bail!("Timeline not found for sync id {}", id), - Some(TimelineIndexEntryInner::Full(_)) => { - bail!("Index is already populated for sync id {}", id) - } - Some(TimelineIndexEntryInner::Description(description)) => { - let mut archive_header_downloads = FuturesUnordered::new(); - for (archive_id, description) in description { - archive_header_downloads.push(async move { - let header = download_archive_header(storage, timeline_dir, description) - .await - .map_err(|e| (e, archive_id))?; - Ok((archive_id, description.header_size, header)) - }); - } - - let mut full_index = RemoteTimeline::empty(); - while let Some(header_data) = archive_header_downloads.next().await { - match header_data { - Ok((archive_id, header_size, header)) => full_index.update_archive_contents(archive_id.0, header, header_size), - Err((e, archive_id)) => bail!( - "Failed to download archive header for tenant {}, timeline {}, archive for Lsn {}: {}", - id.tenant_id, id.timeline_id, archive_id.0, - e - ), - } - } - full_index - } - }; - drop(index_read); // tokio rw lock is not upgradeable - index - .write() - .await - .upgrade_timeline_entry(&id, full_index.clone()) - .context("cannot upgrade timeline entry in remote index")?; - Ok(full_index) -} - -async fn download_archive_header< - P: Send + Sync + 'static, - S: RemoteStorage + Send + Sync + 'static, ->( - storage: &S, - timeline_dir: &Path, - description: &ArchiveDescription, -) -> anyhow::Result { - let mut header_buf = std::io::Cursor::new(Vec::new()); - let remote_path = storage.storage_path(&timeline_dir.join(&description.archive_name))?; - storage - .download_range( - &remote_path, - 0, - Some(description.header_size), - &mut header_buf, - ) - .await?; - let header_buf = header_buf.into_inner(); - let header = read_archive_header(&description.archive_name, &mut header_buf.as_slice()).await?; - Ok(header) -} - #[cfg(test)] mod test_utils { - use std::{ - collections::{BTreeMap, BTreeSet}, - fs, - }; - - use super::*; - use crate::{ - layered_repository::metadata::metadata_path, remote_storage::local_fs::LocalFs, - repository::repo_harness::RepoHarness, - }; use zenith_utils::lsn::Lsn; - #[track_caller] - pub async fn ensure_correct_timeline_upload( + use crate::repository::repo_harness::RepoHarness; + + use super::*; + + pub async fn create_local_timeline( harness: &RepoHarness<'_>, - remote_assets: Arc<(LocalFs, RemoteIndex)>, - timeline_id: ZTimelineId, - new_upload: NewCheckpoint, - ) { - let sync_id = ZTenantTimelineId::new(harness.tenant_id, timeline_id); - upload_timeline_checkpoint( - harness.conf, - Arc::clone(&remote_assets), - sync_id, - new_upload.clone(), - 0, - ) - .await; - - let (storage, index) = remote_assets.as_ref(); - assert_index_descriptions( - index, - &RemoteIndex::try_parse_descriptions_from_paths( - harness.conf, - remote_assets - .0 - .list() - .await - .unwrap() - .into_iter() - .map(|storage_path| storage.local_path(&storage_path).unwrap()), - ), - ) - .await; - - let new_remote_timeline = expect_timeline(index, sync_id).await; - let new_remote_lsn = new_remote_timeline - .checkpoints() - .max() - .expect("Remote timeline should have an lsn after reupload"); - let upload_lsn = new_upload.metadata.disk_consistent_lsn(); - assert!( - new_remote_lsn >= upload_lsn, - "Remote timeline after upload should have the biggest Lsn out of all uploads" - ); - assert!( - new_remote_timeline.contains_checkpoint_at(upload_lsn), - "Should contain upload lsn among the remote ones" - ); - - let remote_files_after_upload = new_remote_timeline - .stored_files(&harness.conf.timeline_path(&timeline_id, &harness.tenant_id)); - for new_uploaded_layer in &new_upload.layers { - assert!( - remote_files_after_upload.contains(new_uploaded_layer), - "Remote files do not contain layer that should be uploaded: '{}'", - new_uploaded_layer.display() - ); - } - - assert_timeline_files_match(harness, timeline_id, new_remote_timeline); - } - - pub async fn expect_timeline( - index: &RemoteIndex, - sync_id: ZTenantTimelineId, - ) -> RemoteTimeline { - if let Some(TimelineIndexEntryInner::Full(remote_timeline)) = index - .read() - .await - .timeline_entry(&sync_id) - .map(|e| e.inner()) - { - remote_timeline.clone() - } else { - panic!( - "Expect to have a full remote timeline in the index for sync id {}", - sync_id - ) - } - } - - #[track_caller] - pub async fn assert_index_descriptions( - index: &RemoteIndex, - expected_index_with_descriptions: &RemoteIndex, - ) { - let expected_index_with_descriptions = expected_index_with_descriptions.read().await; - - let index_read = index.read().await; - let actual_sync_ids = index_read.all_sync_ids().collect::>(); - let expected_sync_ids = expected_index_with_descriptions - .all_sync_ids() - .collect::>(); - assert_eq!( - actual_sync_ids, expected_sync_ids, - "Index contains unexpected sync ids" - ); - - let mut actual_timeline_entries = BTreeMap::new(); - let mut expected_timeline_entries = BTreeMap::new(); - for sync_id in actual_sync_ids { - actual_timeline_entries.insert( - sync_id, - index_read.timeline_entry(&sync_id).unwrap().clone(), - ); - expected_timeline_entries.insert( - sync_id, - expected_index_with_descriptions - .timeline_entry(&sync_id) - .unwrap() - .clone(), - ); - } - drop(index_read); - - for (sync_id, actual_timeline_entry) in actual_timeline_entries { - let expected_timeline_description = expected_timeline_entries - .remove(&sync_id) - .unwrap_or_else(|| { - panic!( - "Failed to find an expected timeline with id {} in the index", - sync_id - ) - }); - let expected_timeline_description = match expected_timeline_description.inner() { - TimelineIndexEntryInner::Description(description) => description, - TimelineIndexEntryInner::Full(_) => panic!("Expected index entry for sync id {} is a full entry, while a description was expected", sync_id), - }; - - match actual_timeline_entry.inner() { - TimelineIndexEntryInner::Description(description) => { - assert_eq!( - description, expected_timeline_description, - "Index contains unexpected descriptions entry for sync id {}", - sync_id - ) - } - TimelineIndexEntryInner::Full(remote_timeline) => { - let expected_lsns = expected_timeline_description - .values() - .map(|description| description.disk_consistent_lsn) - .collect::>(); - assert_eq!( - remote_timeline.checkpoints().collect::>(), - expected_lsns, - "Timeline {} should have the same checkpoints uploaded", - sync_id, - ) - } - } - } - } - - pub fn assert_timeline_files_match( - harness: &RepoHarness, - remote_timeline_id: ZTimelineId, - remote_timeline: RemoteTimeline, - ) { - let local_timeline_dir = harness.timeline_path(&remote_timeline_id); - let local_paths = fs::read_dir(&local_timeline_dir) - .unwrap() - .map(|dir| dir.unwrap().path()) - .collect::>(); - let mut reported_remote_files = remote_timeline.stored_files(&local_timeline_dir); - let local_metadata_path = - metadata_path(harness.conf, remote_timeline_id, harness.tenant_id); - let local_metadata = TimelineMetadata::from_bytes( - &fs::read(&local_metadata_path) - .expect("Failed to read metadata file when comparing remote and local image files"), - ) - .expect( - "Failed to parse metadata file contents when comparing remote and local image files", - ); - assert!( - remote_timeline.contains_checkpoint_at(local_metadata.disk_consistent_lsn()), - "Should contain local lsn among the remote ones after the upload" - ); - reported_remote_files.insert(local_metadata_path); - - assert_eq!( - local_paths, reported_remote_files, - "Remote image files and local image files are different, missing locally: {:?}, missing remotely: {:?}", - reported_remote_files.difference(&local_paths).collect::>(), - local_paths.difference(&reported_remote_files).collect::>(), - ); - - if let Some(remote_file) = reported_remote_files.iter().next() { - let actual_remote_paths = fs::read_dir( - remote_file - .parent() - .expect("Remote files are expected to have their timeline dir as parent"), - ) - .unwrap() - .map(|dir| dir.unwrap().path()) - .collect::>(); - - let unreported_remote_files = actual_remote_paths - .difference(&reported_remote_files) - .collect::>(); - assert!( - unreported_remote_files.is_empty(), - "Unexpected extra remote files that were not listed: {:?}", - unreported_remote_files - ) - } - } - - pub fn create_local_timeline( - harness: &RepoHarness, timeline_id: ZTimelineId, filenames: &[&str], metadata: TimelineMetadata, - ) -> anyhow::Result { + ) -> anyhow::Result { let timeline_path = harness.timeline_path(&timeline_id); - fs::create_dir_all(&timeline_path)?; + fs::create_dir_all(&timeline_path).await?; - let mut layers = Vec::with_capacity(filenames.len()); + let mut layers_to_upload = HashSet::with_capacity(filenames.len()); for &file in filenames { let file_path = timeline_path.join(file); - fs::write(&file_path, dummy_contents(file).into_bytes())?; - layers.push(file_path); + fs::write(&file_path, dummy_contents(file).into_bytes()).await?; + layers_to_upload.insert(file_path); } fs::write( metadata_path(harness.conf, timeline_id, harness.tenant_id), metadata.to_bytes()?, - )?; + ) + .await?; - Ok(NewCheckpoint { layers, metadata }) + Ok(TimelineUpload { + layers_to_upload, + uploaded_layers: HashSet::new(), + metadata, + }) } - fn dummy_contents(name: &str) -> String { + pub fn dummy_contents(name: &str) -> String { format!("contents for {}", name) } @@ -1103,3 +1261,367 @@ mod test_utils { TimelineMetadata::new(disk_consistent_lsn, None, None, Lsn(0), Lsn(0), Lsn(0)) } } + +#[cfg(test)] +mod tests { + use std::collections::BTreeSet; + + use super::{test_utils::dummy_metadata, *}; + use zenith_utils::lsn::Lsn; + + #[test] + fn download_sync_tasks_merge() { + let download_1 = SyncTask::Download(SyncData::new( + 2, + TimelineDownload { + layers_to_skip: HashSet::from([PathBuf::from("one")]), + }, + )); + let download_2 = SyncTask::Download(SyncData::new( + 6, + TimelineDownload { + layers_to_skip: HashSet::from([PathBuf::from("two"), PathBuf::from("three")]), + }, + )); + + let merged_download = match download_1.merge(download_2) { + SyncTask::Download(merged_download) => merged_download, + wrong_merge_result => panic!("Unexpected merge result: {:?}", wrong_merge_result), + }; + + assert_eq!( + merged_download.retries, 0, + "Merged task should have its retries counter reset" + ); + + assert_eq!( + merged_download + .data + .layers_to_skip + .into_iter() + .collect::>(), + BTreeSet::from([ + PathBuf::from("one"), + PathBuf::from("two"), + PathBuf::from("three") + ]), + "Merged download tasks should a combined set of layers to skip" + ); + } + + #[test] + fn upload_sync_tasks_merge() { + let metadata_1 = dummy_metadata(Lsn(1)); + let metadata_2 = dummy_metadata(Lsn(2)); + assert!(metadata_2.disk_consistent_lsn() > metadata_1.disk_consistent_lsn()); + + let upload_1 = SyncTask::Upload(SyncData::new( + 2, + TimelineUpload { + layers_to_upload: HashSet::from([PathBuf::from("one")]), + uploaded_layers: HashSet::from([PathBuf::from("u_one")]), + metadata: metadata_1, + }, + )); + let upload_2 = SyncTask::Upload(SyncData::new( + 6, + TimelineUpload { + layers_to_upload: HashSet::from([PathBuf::from("two"), PathBuf::from("three")]), + uploaded_layers: HashSet::from([PathBuf::from("u_two")]), + metadata: metadata_2.clone(), + }, + )); + + let merged_upload = match upload_1.merge(upload_2) { + SyncTask::Upload(merged_upload) => merged_upload, + wrong_merge_result => panic!("Unexpected merge result: {:?}", wrong_merge_result), + }; + + assert_eq!( + merged_upload.retries, 0, + "Merged task should have its retries counter reset" + ); + + let upload = merged_upload.data; + assert_eq!( + upload.layers_to_upload.into_iter().collect::>(), + BTreeSet::from([ + PathBuf::from("one"), + PathBuf::from("two"), + PathBuf::from("three") + ]), + "Merged upload tasks should a combined set of layers to upload" + ); + + assert_eq!( + upload.uploaded_layers.into_iter().collect::>(), + BTreeSet::from([PathBuf::from("u_one"), PathBuf::from("u_two"),]), + "Merged upload tasks should a combined set of uploaded layers" + ); + + assert_eq!( + upload.metadata, metadata_2, + "Merged upload tasks should have a metadata with biggest disk_consistent_lsn" + ); + } + + #[test] + fn upload_and_download_sync_tasks_merge() { + let download_data = SyncData::new( + 3, + TimelineDownload { + layers_to_skip: HashSet::from([PathBuf::from("d_one")]), + }, + ); + + let upload_data = SyncData::new( + 2, + TimelineUpload { + layers_to_upload: HashSet::from([PathBuf::from("u_one")]), + uploaded_layers: HashSet::from([PathBuf::from("u_one_2")]), + metadata: dummy_metadata(Lsn(1)), + }, + ); + + let (merged_download, merged_upload) = match SyncTask::Download(download_data.clone()) + .merge(SyncTask::Upload(upload_data.clone())) + { + SyncTask::DownloadAndUpload(merged_download, merged_upload) => { + (merged_download, merged_upload) + } + wrong_merge_result => panic!("Unexpected merge result: {:?}", wrong_merge_result), + }; + + assert_eq!( + merged_download, download_data, + "When upload and dowload are merged, both should be unchanged" + ); + assert_eq!( + merged_upload, upload_data, + "When upload and dowload are merged, both should be unchanged" + ); + } + + #[test] + fn uploaddownload_and_upload_sync_tasks_merge() { + let download_data = SyncData::new( + 3, + TimelineDownload { + layers_to_skip: HashSet::from([PathBuf::from("d_one")]), + }, + ); + + let metadata_1 = dummy_metadata(Lsn(5)); + let metadata_2 = dummy_metadata(Lsn(2)); + assert!(metadata_1.disk_consistent_lsn() > metadata_2.disk_consistent_lsn()); + + let upload_download = SyncTask::DownloadAndUpload( + download_data.clone(), + SyncData::new( + 2, + TimelineUpload { + layers_to_upload: HashSet::from([PathBuf::from("one")]), + uploaded_layers: HashSet::from([PathBuf::from("u_one")]), + metadata: metadata_1.clone(), + }, + ), + ); + + let new_upload = SyncTask::Upload(SyncData::new( + 6, + TimelineUpload { + layers_to_upload: HashSet::from([PathBuf::from("two"), PathBuf::from("three")]), + uploaded_layers: HashSet::from([PathBuf::from("u_two")]), + metadata: metadata_2, + }, + )); + + let (merged_download, merged_upload) = match upload_download.merge(new_upload) { + SyncTask::DownloadAndUpload(merged_download, merged_upload) => { + (merged_download, merged_upload) + } + wrong_merge_result => panic!("Unexpected merge result: {:?}", wrong_merge_result), + }; + + assert_eq!( + merged_download, download_data, + "When uploaddowload and upload tasks are merged, download should be unchanged" + ); + + assert_eq!( + merged_upload.retries, 0, + "Merged task should have its retries counter reset" + ); + let upload = merged_upload.data; + assert_eq!( + upload.layers_to_upload.into_iter().collect::>(), + BTreeSet::from([ + PathBuf::from("one"), + PathBuf::from("two"), + PathBuf::from("three") + ]), + "Merged upload tasks should a combined set of layers to upload" + ); + + assert_eq!( + upload.uploaded_layers.into_iter().collect::>(), + BTreeSet::from([PathBuf::from("u_one"), PathBuf::from("u_two"),]), + "Merged upload tasks should a combined set of uploaded layers" + ); + + assert_eq!( + upload.metadata, metadata_1, + "Merged upload tasks should have a metadata with biggest disk_consistent_lsn" + ); + } + + #[test] + fn uploaddownload_and_download_sync_tasks_merge() { + let upload_data = SyncData::new( + 22, + TimelineUpload { + layers_to_upload: HashSet::from([PathBuf::from("one")]), + uploaded_layers: HashSet::from([PathBuf::from("u_one")]), + metadata: dummy_metadata(Lsn(22)), + }, + ); + + let upload_download = SyncTask::DownloadAndUpload( + SyncData::new( + 2, + TimelineDownload { + layers_to_skip: HashSet::from([PathBuf::from("one")]), + }, + ), + upload_data.clone(), + ); + + let new_download = SyncTask::Download(SyncData::new( + 6, + TimelineDownload { + layers_to_skip: HashSet::from([PathBuf::from("two"), PathBuf::from("three")]), + }, + )); + + let (merged_download, merged_upload) = match upload_download.merge(new_download) { + SyncTask::DownloadAndUpload(merged_download, merged_upload) => { + (merged_download, merged_upload) + } + wrong_merge_result => panic!("Unexpected merge result: {:?}", wrong_merge_result), + }; + + assert_eq!( + merged_upload, upload_data, + "When uploaddowload and download tasks are merged, upload should be unchanged" + ); + + assert_eq!( + merged_download.retries, 0, + "Merged task should have its retries counter reset" + ); + assert_eq!( + merged_download + .data + .layers_to_skip + .into_iter() + .collect::>(), + BTreeSet::from([ + PathBuf::from("one"), + PathBuf::from("two"), + PathBuf::from("three") + ]), + "Merged download tasks should a combined set of layers to skip" + ); + } + + #[test] + fn uploaddownload_sync_tasks_merge() { + let metadata_1 = dummy_metadata(Lsn(1)); + let metadata_2 = dummy_metadata(Lsn(2)); + assert!(metadata_2.disk_consistent_lsn() > metadata_1.disk_consistent_lsn()); + + let upload_download = SyncTask::DownloadAndUpload( + SyncData::new( + 2, + TimelineDownload { + layers_to_skip: HashSet::from([PathBuf::from("one")]), + }, + ), + SyncData::new( + 2, + TimelineUpload { + layers_to_upload: HashSet::from([PathBuf::from("one")]), + uploaded_layers: HashSet::from([PathBuf::from("u_one")]), + metadata: metadata_1, + }, + ), + ); + let new_upload_download = SyncTask::DownloadAndUpload( + SyncData::new( + 6, + TimelineDownload { + layers_to_skip: HashSet::from([PathBuf::from("two"), PathBuf::from("three")]), + }, + ), + SyncData::new( + 6, + TimelineUpload { + layers_to_upload: HashSet::from([PathBuf::from("two"), PathBuf::from("three")]), + uploaded_layers: HashSet::from([PathBuf::from("u_two")]), + metadata: metadata_2.clone(), + }, + ), + ); + + let (merged_download, merged_upload) = match upload_download.merge(new_upload_download) { + SyncTask::DownloadAndUpload(merged_download, merged_upload) => { + (merged_download, merged_upload) + } + wrong_merge_result => panic!("Unexpected merge result: {:?}", wrong_merge_result), + }; + + assert_eq!( + merged_download.retries, 0, + "Merged task should have its retries counter reset" + ); + assert_eq!( + merged_download + .data + .layers_to_skip + .into_iter() + .collect::>(), + BTreeSet::from([ + PathBuf::from("one"), + PathBuf::from("two"), + PathBuf::from("three") + ]), + "Merged download tasks should a combined set of layers to skip" + ); + + assert_eq!( + merged_upload.retries, 0, + "Merged task should have its retries counter reset" + ); + let upload = merged_upload.data; + assert_eq!( + upload.layers_to_upload.into_iter().collect::>(), + BTreeSet::from([ + PathBuf::from("one"), + PathBuf::from("two"), + PathBuf::from("three") + ]), + "Merged upload tasks should a combined set of layers to upload" + ); + + assert_eq!( + upload.uploaded_layers.into_iter().collect::>(), + BTreeSet::from([PathBuf::from("u_one"), PathBuf::from("u_two"),]), + "Merged upload tasks should a combined set of uploaded layers" + ); + + assert_eq!( + upload.metadata, metadata_2, + "Merged upload tasks should have a metadata with biggest disk_consistent_lsn" + ); + } +} diff --git a/pageserver/src/remote_storage/storage_sync/compression.rs b/pageserver/src/remote_storage/storage_sync/compression.rs deleted file mode 100644 index 511f79e0cf..0000000000 --- a/pageserver/src/remote_storage/storage_sync/compression.rs +++ /dev/null @@ -1,612 +0,0 @@ -//! A set of structs to represent a compressed part of the timeline, and methods to asynchronously compress and uncompress a stream of data, -//! without holding the entire data in memory. -//! For the latter, both compress and uncompress functions operate buffered streams (currently hardcoded size of [`ARCHIVE_STREAM_BUFFER_SIZE_BYTES`]), -//! not attempting to hold the entire archive in memory. -//! -//! The compression is done with zstd streaming algorithm via the `async-compression` crate. -//! The crate does not contain any knobs to tweak the compression, but otherwise is one of the only ones that's both async and has an API to manage the part of an archive. -//! Zstd was picked as the best algorithm among the ones available in the crate, after testing the initial timeline file compression. -//! -//! Archiving is almost agnostic to timeline file types, with an exception of the metadata file, that's currently distinguished in the [un]compression code. -//! The metadata file is treated separately when [de]compression is involved, to reduce the risk of corrupting the metadata file. -//! When compressed, the metadata file is always required and stored as the last file in the archive stream. -//! When uncompressed, the metadata file gets naturally uncompressed last, to ensure that all other layer files are decompressed successfully first. -//! -//! Archive structure: -//! +----------------------------------------+ -//! | header | file_1, ..., file_k, metadata | -//! +----------------------------------------+ -//! -//! The archive consists of two separate zstd archives: -//! * header archive, that contains all files names and their sizes and relative paths in the timeline directory -//! Header is a Rust structure, serialized into bytes and compressed with zstd. -//! * files archive, that has metadata file as the last one, all compressed with zstd into a single binary blob -//! -//! Header offset is stored in the file name, along with the `disk_consistent_lsn` from the metadata file. -//! See [`parse_archive_name`] and [`ARCHIVE_EXTENSION`] for the name details, example: `00000000016B9150-.zst_9732`. -//! This way, the header could be retrieved without reading an entire archive file. - -use std::{ - collections::BTreeSet, - future::Future, - io::Cursor, - path::{Path, PathBuf}, - sync::Arc, -}; - -use anyhow::{bail, ensure, Context}; -use async_compression::tokio::bufread::{ZstdDecoder, ZstdEncoder}; -use serde::{Deserialize, Serialize}; -use tokio::{ - fs, - io::{self, AsyncReadExt, AsyncWriteExt}, -}; -use tracing::*; -use zenith_utils::{bin_ser::BeSer, lsn::Lsn}; - -use crate::layered_repository::metadata::{TimelineMetadata, METADATA_FILE_NAME}; - -use super::index::RelativePath; - -#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)] -pub struct ArchiveHeader { - /// All regular timeline files, excluding the metadata file. - pub files: Vec, - // Metadata file name is known to the system, as its location relative to the timeline dir, - // so no need to store anything but its size in bytes. - pub metadata_file_size: u64, -} - -#[derive(Clone, Debug, Serialize, Deserialize, PartialEq, Eq, Hash)] -pub struct FileEntry { - /// Uncompressed file size, bytes. - pub size: u64, - /// A path, relative to the directory root, used when compressing the directory contents. - pub subpath: RelativePath, -} - -const ARCHIVE_EXTENSION: &str = "-.zst_"; -const ARCHIVE_STREAM_BUFFER_SIZE_BYTES: usize = 4 * 1024 * 1024; - -/// Streams an archive of files given into a stream target, defined by the closure. -/// -/// The closure approach is picked for cases like S3, where we would need a name of the file before we can get a stream to write the bytes into. -/// Current idea is to place the header size in the name of the file, to enable the fast partial remote file index restoration without actually reading remote storage file contents. -/// -/// Performs the compression in multiple steps: -/// * prepares an archive header, stripping the `source_dir` prefix from the `files` -/// * generates the name of the archive -/// * prepares archive producer future, knowing the header and the file list -/// An `impl AsyncRead` and `impl AsyncWrite` pair of connected streams is created to implement the partial contents streaming. -/// The writer end gets into the archive producer future, to put the header and a stream of compressed files. -/// * prepares archive consumer future, by executing the provided closure -/// The closure gets the reader end stream and the name of the file to create a future that would stream the file contents elsewhere. -/// * runs and waits for both futures to complete -/// * on a successful completion of both futures, header, its size and the user-defined consumer future return data is returned -/// Due to the design above, the archive name and related data is visible inside the consumer future only, so it's possible to return the data, -/// needed for future processing. -pub async fn archive_files_as_stream( - source_dir: &Path, - files: impl Iterator, - metadata: &TimelineMetadata, - create_archive_consumer: Cons, -) -> anyhow::Result<(ArchiveHeader, u64, ConsRet)> -where - Cons: FnOnce(Box, String) -> Fut - + Send - + 'static, - Fut: Future> + Send + 'static, - ConsRet: Send + Sync + 'static, -{ - let metadata_bytes = metadata - .to_bytes() - .context("Failed to create metadata bytes")?; - let (archive_header, compressed_header_bytes) = - prepare_header(source_dir, files, &metadata_bytes) - .await - .context("Failed to prepare file for archivation")?; - - let header_size = compressed_header_bytes.len() as u64; - let (write, read) = io::duplex(ARCHIVE_STREAM_BUFFER_SIZE_BYTES); - let archive_filler = write_archive_contents( - source_dir.to_path_buf(), - archive_header.clone(), - metadata_bytes, - write, - ); - let archive_name = archive_name(metadata.disk_consistent_lsn(), header_size); - let archive_stream = - Cursor::new(compressed_header_bytes).chain(ZstdEncoder::new(io::BufReader::new(read))); - - let (archive_creation_result, archive_upload_result) = tokio::join!( - tokio::spawn(archive_filler), - tokio::spawn(async move { - create_archive_consumer(Box::new(archive_stream), archive_name).await - }) - ); - archive_creation_result - .context("Failed to spawn archive creation future")? - .context("Failed to create an archive")?; - let upload_return_value = archive_upload_result - .context("Failed to spawn archive upload future")? - .context("Failed to upload the archive")?; - - Ok((archive_header, header_size, upload_return_value)) -} - -/// Similar to [`archive_files_as_stream`], creates a pair of streams to uncompress the 2nd part of the archive, -/// that contains files and is located after the header. -/// S3 allows downloading partial file contents for a given file key (i.e. name), to accommodate this retrieval, -/// a closure is used. -/// Same concepts with two concurrent futures, user-defined closure, future and return value apply here, but the -/// consumer and the receiver ends are swapped, since the uncompression happens. -pub async fn uncompress_file_stream_with_index( - destination_dir: PathBuf, - files_to_skip: Arc>, - disk_consistent_lsn: Lsn, - header: ArchiveHeader, - header_size: u64, - create_archive_file_part: Prod, -) -> anyhow::Result -where - Prod: FnOnce(Box, String) -> Fut - + Send - + 'static, - Fut: Future> + Send + 'static, - ProdRet: Send + Sync + 'static, -{ - let (write, mut read) = io::duplex(ARCHIVE_STREAM_BUFFER_SIZE_BYTES); - let archive_name = archive_name(disk_consistent_lsn, header_size); - - let (archive_download_result, archive_uncompress_result) = tokio::join!( - tokio::spawn(async move { create_archive_file_part(Box::new(write), archive_name).await }), - tokio::spawn(async move { - uncompress_with_header(&files_to_skip, &destination_dir, header, &mut read).await - }) - ); - - let download_value = archive_download_result - .context("Failed to spawn archive download future")? - .context("Failed to download an archive")?; - archive_uncompress_result - .context("Failed to spawn archive uncompress future")? - .context("Failed to uncompress the archive")?; - - Ok(download_value) -} - -/// Reads archive header from the stream given: -/// * parses the file name to get the header size -/// * reads the exact amount of bytes -/// * uncompresses and deserializes those -pub async fn read_archive_header( - archive_name: &str, - from: &mut A, -) -> anyhow::Result { - let (_, header_size) = parse_archive_name(Path::new(archive_name))?; - - let mut compressed_header_bytes = vec![0; header_size as usize]; - from.read_exact(&mut compressed_header_bytes) - .await - .with_context(|| { - format!( - "Failed to read header header from the archive {}", - archive_name - ) - })?; - - let mut header_bytes = Vec::new(); - ZstdDecoder::new(io::BufReader::new(compressed_header_bytes.as_slice())) - .read_to_end(&mut header_bytes) - .await - .context("Failed to decompress a header from the archive")?; - - ArchiveHeader::des(&header_bytes).context("Failed to deserialize a header from the archive") -} - -/// Reads the archive metadata out of the archive name: -/// * `disk_consistent_lsn` of the checkpoint that was archived -/// * size of the archive header -pub fn parse_archive_name(archive_path: &Path) -> anyhow::Result<(Lsn, u64)> { - let archive_name = archive_path - .file_name() - .with_context(|| format!("Archive '{}' has no file name", archive_path.display()))? - .to_string_lossy(); - let (lsn_str, header_size_str) = - archive_name - .rsplit_once(ARCHIVE_EXTENSION) - .with_context(|| { - format!( - "Archive '{}' has incorrect extension, expected to contain '{}'", - archive_path.display(), - ARCHIVE_EXTENSION - ) - })?; - let disk_consistent_lsn = Lsn::from_hex(lsn_str).with_context(|| { - format!( - "Archive '{}' has an invalid disk consistent lsn in its extension", - archive_path.display(), - ) - })?; - let header_size = header_size_str.parse::().with_context(|| { - format!( - "Archive '{}' has an invalid a header offset number in its extension", - archive_path.display(), - ) - })?; - Ok((disk_consistent_lsn, header_size)) -} - -fn archive_name(disk_consistent_lsn: Lsn, header_size: u64) -> String { - let archive_name = format!( - "{:016X}{ARCHIVE_EXTENSION}{}", - u64::from(disk_consistent_lsn), - header_size, - ARCHIVE_EXTENSION = ARCHIVE_EXTENSION, - ); - archive_name -} - -pub async fn uncompress_with_header( - files_to_skip: &BTreeSet, - destination_dir: &Path, - header: ArchiveHeader, - archive_after_header: impl io::AsyncRead + Send + Sync + Unpin, -) -> anyhow::Result<()> { - debug!("Uncompressing archive into {}", destination_dir.display()); - let mut archive = ZstdDecoder::new(io::BufReader::new(archive_after_header)); - - if !destination_dir.exists() { - fs::create_dir_all(&destination_dir) - .await - .with_context(|| { - format!( - "Failed to create target directory at {}", - destination_dir.display() - ) - })?; - } else if !destination_dir.is_dir() { - bail!( - "Destination path '{}' is not a valid directory", - destination_dir.display() - ); - } - debug!("Will extract {} files from the archive", header.files.len()); - for entry in header.files { - uncompress_entry( - &mut archive, - &entry.subpath.as_path(destination_dir), - entry.size, - files_to_skip, - ) - .await - .with_context(|| format!("Failed to uncompress archive entry {:?}", entry))?; - } - uncompress_entry( - &mut archive, - &destination_dir.join(METADATA_FILE_NAME), - header.metadata_file_size, - files_to_skip, - ) - .await - .context("Failed to uncompress the metadata entry")?; - Ok(()) -} - -async fn uncompress_entry( - archive: &mut ZstdDecoder>, - destination_path: &Path, - entry_size: u64, - files_to_skip: &BTreeSet, -) -> anyhow::Result<()> { - if let Some(parent) = destination_path.parent() { - fs::create_dir_all(parent).await.with_context(|| { - format!( - "Failed to create parent directory for {}", - destination_path.display() - ) - })?; - }; - - if files_to_skip.contains(destination_path) { - debug!("Skipping {}", destination_path.display()); - copy_n_bytes(entry_size, archive, &mut io::sink()) - .await - .context("Failed to skip bytes in the archive")?; - return Ok(()); - } - - let mut destination = - io::BufWriter::new(fs::File::create(&destination_path).await.with_context(|| { - format!( - "Failed to open file {} for extraction", - destination_path.display() - ) - })?); - copy_n_bytes(entry_size, archive, &mut destination) - .await - .with_context(|| { - format!( - "Failed to write extracted archive contents into file {}", - destination_path.display() - ) - })?; - destination - .flush() - .await - .context("Failed to flush the streaming archive bytes")?; - Ok(()) -} - -async fn write_archive_contents( - source_dir: PathBuf, - header: ArchiveHeader, - metadata_bytes: Vec, - mut archive_input: io::DuplexStream, -) -> anyhow::Result<()> { - debug!("Starting writing files into archive"); - for file_entry in header.files { - let path = file_entry.subpath.as_path(&source_dir); - let mut source_file = - io::BufReader::new(fs::File::open(&path).await.with_context(|| { - format!( - "Failed to open file for archiving to path {}", - path.display() - ) - })?); - let bytes_written = io::copy(&mut source_file, &mut archive_input) - .await - .with_context(|| { - format!( - "Failed to open add a file into archive, file path {}", - path.display() - ) - })?; - ensure!( - file_entry.size == bytes_written, - "File {} was written to the archive incompletely", - path.display() - ); - trace!( - "Added file '{}' ({} bytes) into the archive", - path.display(), - bytes_written - ); - } - let metadata_bytes_written = io::copy(&mut metadata_bytes.as_slice(), &mut archive_input) - .await - .context("Failed to add metadata into the archive")?; - ensure!( - header.metadata_file_size == metadata_bytes_written, - "Metadata file was written to the archive incompletely", - ); - - archive_input - .shutdown() - .await - .context("Failed to finalize the archive")?; - debug!("Successfully streamed all files into the archive"); - Ok(()) -} - -async fn prepare_header( - source_dir: &Path, - files: impl Iterator, - metadata_bytes: &[u8], -) -> anyhow::Result<(ArchiveHeader, Vec)> { - let mut archive_files = Vec::new(); - for file_path in files { - let file_metadata = fs::metadata(file_path).await.with_context(|| { - format!( - "Failed to read metadata during archive indexing for {}", - file_path.display() - ) - })?; - ensure!( - file_metadata.is_file(), - "Archive indexed path {} is not a file", - file_path.display() - ); - - if file_path.file_name().and_then(|name| name.to_str()) != Some(METADATA_FILE_NAME) { - let entry = FileEntry { - subpath: RelativePath::new(source_dir, file_path).with_context(|| { - format!( - "File '{}' does not belong to pageserver workspace", - file_path.display() - ) - })?, - size: file_metadata.len(), - }; - archive_files.push(entry); - } - } - - let header = ArchiveHeader { - files: archive_files, - metadata_file_size: metadata_bytes.len() as u64, - }; - - debug!("Appending a header for {} files", header.files.len()); - let header_bytes = header.ser().context("Failed to serialize a header")?; - debug!("Header bytes len {}", header_bytes.len()); - let mut compressed_header_bytes = Vec::new(); - ZstdEncoder::new(io::BufReader::new(header_bytes.as_slice())) - .read_to_end(&mut compressed_header_bytes) - .await - .context("Failed to compress header bytes")?; - debug!( - "Compressed header bytes len {}", - compressed_header_bytes.len() - ); - Ok((header, compressed_header_bytes)) -} - -async fn copy_n_bytes( - n: u64, - from: &mut (impl io::AsyncRead + Send + Sync + Unpin), - into: &mut (impl io::AsyncWrite + Send + Sync + Unpin), -) -> anyhow::Result<()> { - let bytes_written = io::copy(&mut from.take(n), into).await?; - ensure!( - bytes_written == n, - "Failed to read exactly {} bytes from the input, bytes written: {}", - n, - bytes_written, - ); - Ok(()) -} - -#[cfg(test)] -mod tests { - use tokio::{fs, io::AsyncSeekExt}; - - use crate::repository::repo_harness::{RepoHarness, TIMELINE_ID}; - - use super::*; - - #[tokio::test] - async fn compress_and_uncompress() -> anyhow::Result<()> { - let repo_harness = RepoHarness::create("compress_and_uncompress")?; - let timeline_dir = repo_harness.timeline_path(&TIMELINE_ID); - init_directory( - &timeline_dir, - vec![ - ("first", "first_contents"), - ("second", "second_contents"), - (METADATA_FILE_NAME, "wrong_metadata"), - ], - ) - .await?; - let timeline_files = list_file_paths_with_contents(&timeline_dir).await?; - assert_eq!( - timeline_files, - vec![ - ( - timeline_dir.join("first"), - FileContents::Text("first_contents".to_string()) - ), - ( - timeline_dir.join(METADATA_FILE_NAME), - FileContents::Text("wrong_metadata".to_string()) - ), - ( - timeline_dir.join("second"), - FileContents::Text("second_contents".to_string()) - ), - ], - "Initial timeline contents should contain two normal files and a wrong metadata file" - ); - - let metadata = TimelineMetadata::new(Lsn(0x30), None, None, Lsn(0), Lsn(0), Lsn(0)); - let paths_to_archive = timeline_files - .into_iter() - .map(|(path, _)| path) - .collect::>(); - - let tempdir = tempfile::tempdir()?; - let base_path = tempdir.path().to_path_buf(); - let (header, header_size, archive_target) = archive_files_as_stream( - &timeline_dir, - paths_to_archive.iter(), - &metadata, - move |mut archive_streamer, archive_name| async move { - let archive_target = base_path.join(&archive_name); - let mut archive_file = fs::File::create(&archive_target).await?; - io::copy(&mut archive_streamer, &mut archive_file).await?; - Ok(archive_target) - }, - ) - .await?; - - let mut file = fs::File::open(&archive_target).await?; - file.seek(io::SeekFrom::Start(header_size)).await?; - let target_dir = tempdir.path().join("extracted"); - uncompress_with_header(&BTreeSet::new(), &target_dir, header, file).await?; - - let extracted_files = list_file_paths_with_contents(&target_dir).await?; - - assert_eq!( - extracted_files, - vec![ - ( - target_dir.join("first"), - FileContents::Text("first_contents".to_string()) - ), - ( - target_dir.join(METADATA_FILE_NAME), - FileContents::Binary(metadata.to_bytes()?) - ), - ( - target_dir.join("second"), - FileContents::Text("second_contents".to_string()) - ), - ], - "Extracted files should contain all local timeline files besides its metadata, which should be taken from the arguments" - ); - - Ok(()) - } - - async fn init_directory( - root: &Path, - files_with_contents: Vec<(&str, &str)>, - ) -> anyhow::Result<()> { - fs::create_dir_all(root).await?; - for (file_name, contents) in files_with_contents { - fs::File::create(root.join(file_name)) - .await? - .write_all(contents.as_bytes()) - .await?; - } - Ok(()) - } - - #[derive(PartialEq, Eq, PartialOrd, Ord)] - enum FileContents { - Text(String), - Binary(Vec), - } - - impl std::fmt::Debug for FileContents { - fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { - match self { - Self::Text(text) => f.debug_tuple("Text").field(text).finish(), - Self::Binary(bytes) => f - .debug_tuple("Binary") - .field(&format!("{} bytes", bytes.len())) - .finish(), - } - } - } - - async fn list_file_paths_with_contents( - root: &Path, - ) -> anyhow::Result> { - let mut file_paths = Vec::new(); - - let mut dir_listings = vec![fs::read_dir(root).await?]; - while let Some(mut dir_listing) = dir_listings.pop() { - while let Some(entry) = dir_listing.next_entry().await? { - let entry_path = entry.path(); - if entry_path.is_file() { - let contents = match String::from_utf8(fs::read(&entry_path).await?) { - Ok(text) => FileContents::Text(text), - Err(e) => FileContents::Binary(e.into_bytes()), - }; - file_paths.push((entry_path, contents)); - } else if entry_path.is_dir() { - dir_listings.push(fs::read_dir(entry_path).await?); - } else { - info!( - "Skipping path '{}' as it's not a file or a directory", - entry_path.display() - ); - } - } - } - - file_paths.sort(); - Ok(file_paths) - } -} diff --git a/pageserver/src/remote_storage/storage_sync/download.rs b/pageserver/src/remote_storage/storage_sync/download.rs index e5aa74452b..81ed649c8a 100644 --- a/pageserver/src/remote_storage/storage_sync/download.rs +++ b/pageserver/src/remote_storage/storage_sync/download.rs @@ -1,30 +1,76 @@ -//! Timeline synchrnonization logic to put files from archives on remote storage into pageserver's local directory. +//! Timeline synchrnonization logic to fetch the layer files from remote storage into pageserver's local directory. -use std::{collections::BTreeSet, path::PathBuf, sync::Arc}; +use std::fmt::Debug; -use anyhow::{ensure, Context}; +use anyhow::Context; +use futures::stream::{FuturesUnordered, StreamExt}; use tokio::fs; use tracing::{debug, error, trace, warn}; -use zenith_utils::zid::ZTenantId; use crate::{ config::PageServerConf, - layered_repository::metadata::{metadata_path, TimelineMetadata}, + layered_repository::metadata::metadata_path, remote_storage::{ - storage_sync::{ - compression, fetch_full_index, index::TimelineIndexEntryInner, sync_queue, SyncKind, - SyncTask, - }, + storage_sync::{sync_queue, SyncTask}, RemoteStorage, ZTenantTimelineId, }, }; use super::{ - index::{ArchiveId, RemoteTimeline}, - RemoteIndex, TimelineDownload, + index::{IndexPart, RemoteTimeline}, + SyncData, TimelineDownload, }; +/// Retrieves index data from the remote storage for a given timeline. +pub async fn download_index_part( + conf: &'static PageServerConf, + storage: &S, + sync_id: ZTenantTimelineId, +) -> anyhow::Result +where + P: Debug + Send + Sync + 'static, + S: RemoteStorage + Send + Sync + 'static, +{ + let index_part_path = metadata_path(conf, sync_id.timeline_id, sync_id.tenant_id) + .with_file_name(IndexPart::FILE_NAME) + .with_extension(IndexPart::FILE_EXTENSION); + let part_storage_path = storage.storage_path(&index_part_path).with_context(|| { + format!( + "Failed to get the index part storage path for local path '{}'", + index_part_path.display() + ) + })?; + let mut index_part_bytes = Vec::new(); + storage + .download(&part_storage_path, &mut index_part_bytes) + .await + .with_context(|| { + format!( + "Failed to download an index part from storage path '{:?}'", + part_storage_path + ) + })?; + + let index_part: IndexPart = serde_json::from_slice(&index_part_bytes).with_context(|| { + format!( + "Failed to deserialize index part file from storage path '{:?}'", + part_storage_path + ) + })?; + + let missing_files = index_part.missing_files(); + if !missing_files.is_empty() { + warn!( + "Found missing layers in index part for timeline {}: {:?}", + sync_id, missing_files + ); + } + + Ok(index_part) +} + /// Timeline download result, with extra data, needed for downloading. +#[derive(Debug)] pub(super) enum DownloadedTimeline { /// Remote timeline data is either absent or corrupt, no download possible. Abort, @@ -33,222 +79,136 @@ pub(super) enum DownloadedTimeline { FailedAndRescheduled, /// Remote timeline data is found, its latest checkpoint's metadata contents (disk_consistent_lsn) is known. /// Initial download successful. - Successful, + Successful(SyncData), } -/// Attempts to download and uncompress files from all remote archives for the timeline given. +/// Attempts to download all given timeline's layers. /// Timeline files that already exist locally are skipped during the download, but the local metadata file is -/// updated in the end of every checkpoint archive extraction. +/// updated in the end, if the remote one contains a newer disk_consistent_lsn. /// -/// On an error, bumps the retries count and reschedules the download, with updated archive skip list -/// (for any new successful archive downloads and extractions). -pub(super) async fn download_timeline< - P: std::fmt::Debug + Send + Sync + 'static, - S: RemoteStorage + Send + Sync + 'static, ->( - conf: &'static PageServerConf, - remote_assets: Arc<(S, RemoteIndex)>, +/// On an error, bumps the retries count and updates the files to skip with successful downloads, rescheduling the task. +pub(super) async fn download_timeline_layers<'a, P, S>( + storage: &'a S, + remote_timeline: Option<&'a RemoteTimeline>, sync_id: ZTenantTimelineId, - mut download: TimelineDownload, - retries: u32, -) -> DownloadedTimeline { - debug!("Downloading layers for sync id {}", sync_id); - - let ZTenantTimelineId { - tenant_id, - timeline_id, - } = sync_id; - let index = &remote_assets.1; - - let index_read = index.read().await; - let remote_timeline = match index_read.timeline_entry(&sync_id) { + mut download_data: SyncData, +) -> DownloadedTimeline +where + P: Debug + Send + Sync + 'static, + S: RemoteStorage + Send + Sync + 'static, +{ + let remote_timeline = match remote_timeline { + Some(remote_timeline) => { + if !remote_timeline.awaits_download { + error!("Timeline with sync id {} is not awaiting download", sync_id); + return DownloadedTimeline::Abort; + } + remote_timeline + } None => { - error!("Cannot download: no timeline is present in the index for given id"); - drop(index_read); + error!( + "Timeline with sync id {} is not present in the remote index", + sync_id + ); return DownloadedTimeline::Abort; } - - Some(index_entry) => match index_entry.inner() { - TimelineIndexEntryInner::Full(remote_timeline) => { - let cloned = remote_timeline.clone(); - drop(index_read); - cloned - } - TimelineIndexEntryInner::Description(_) => { - // we do not check here for awaits_download because it is ok - // to call this function while the download is in progress - // so it is not a concurrent download, it is the same one - - let remote_disk_consistent_lsn = index_entry.disk_consistent_lsn(); - drop(index_read); - debug!("Found timeline description for the given ids, downloading the full index"); - match fetch_full_index( - remote_assets.as_ref(), - &conf.timeline_path(&timeline_id, &tenant_id), - sync_id, - ) - .await - { - Ok(remote_timeline) => remote_timeline, - Err(e) => { - error!("Failed to download full timeline index: {:?}", e); - - return match remote_disk_consistent_lsn { - Some(_) => { - sync_queue::push(SyncTask::new( - sync_id, - retries, - SyncKind::Download(download), - )); - DownloadedTimeline::FailedAndRescheduled - } - None => { - error!("Cannot download: no disk consistent Lsn is present for the index entry"); - DownloadedTimeline::Abort - } - }; - } - } - } - }, - }; - if remote_timeline.checkpoints().max().is_none() { - debug!("Cannot download: no disk consistent Lsn is present for the remote timeline"); - return DownloadedTimeline::Abort; }; - debug!("Downloading timeline archives"); - let archives_to_download = remote_timeline - .checkpoints() - .map(ArchiveId) - .filter(|remote_archive| !download.archives_to_skip.contains(remote_archive)) + debug!("Downloading timeline layers for sync id {}", sync_id); + let download = &mut download_data.data; + + let layers_to_download = remote_timeline + .stored_files() + .difference(&download.layers_to_skip) + .cloned() .collect::>(); - let archives_total = archives_to_download.len(); - debug!("Downloading {} archives of a timeline", archives_total); - trace!("Archives to download: {:?}", archives_to_download); + trace!("Layers to download: {:?}", layers_to_download); - for (archives_downloaded, archive_id) in archives_to_download.into_iter().enumerate() { - match try_download_archive( - conf, - sync_id, - Arc::clone(&remote_assets), - &remote_timeline, - archive_id, - Arc::clone(&download.files_to_skip), - ) - .await - { - Err(e) => { - let archives_left = archives_total - archives_downloaded; - error!( - "Failed to download archive {:?} (archives downloaded: {}; archives left: {}) for tenant {} timeline {}, requeueing the download: {:?}", - archive_id, archives_downloaded, archives_left, tenant_id, timeline_id, e + let mut download_tasks = layers_to_download + .into_iter() + .map(|layer_desination_path| async move { + if layer_desination_path.exists() { + debug!( + "Layer already exists locally, skipping download: {}", + layer_desination_path.display() ); - sync_queue::push(SyncTask::new( - sync_id, - retries, - SyncKind::Download(download), - )); - return DownloadedTimeline::FailedAndRescheduled; + } else { + let layer_storage_path = storage + .storage_path(&layer_desination_path) + .with_context(|| { + format!( + "Failed to get the layer storage path for local path '{}'", + layer_desination_path.display() + ) + })?; + + let mut destination_file = fs::File::create(&layer_desination_path) + .await + .with_context(|| { + format!( + "Failed to create a destination file for layer '{}'", + layer_desination_path.display() + ) + })?; + + storage + .download(&layer_storage_path, &mut destination_file) + .await + .with_context(|| { + format!( + "Failed to download a layer from storage path '{:?}'", + layer_storage_path + ) + })?; } - Ok(()) => { - debug!("Successfully downloaded archive {:?}", archive_id); - download.archives_to_skip.insert(archive_id); + Ok::<_, anyhow::Error>(layer_desination_path) + }) + .collect::>(); + + debug!("Downloading {} layers of a timeline", download_tasks.len()); + + let mut errors_happened = false; + while let Some(download_result) = download_tasks.next().await { + match download_result { + Ok(downloaded_path) => { + download.layers_to_skip.insert(downloaded_path); + } + Err(e) => { + errors_happened = true; + error!( + "Failed to download a layer for timeline {}: {:?}", + sync_id, e + ); } } } - debug!("Finished downloading all timeline's archives"); - DownloadedTimeline::Successful -} - -async fn try_download_archive< - P: Send + Sync + 'static, - S: RemoteStorage + Send + Sync + 'static, ->( - conf: &'static PageServerConf, - ZTenantTimelineId { - tenant_id, - timeline_id, - }: ZTenantTimelineId, - remote_assets: Arc<(S, RemoteIndex)>, - remote_timeline: &RemoteTimeline, - archive_id: ArchiveId, - files_to_skip: Arc>, -) -> anyhow::Result<()> { - debug!("Downloading archive {:?}", archive_id); - let archive_to_download = remote_timeline - .archive_data(archive_id) - .with_context(|| format!("Archive {:?} not found in remote storage", archive_id))?; - let (archive_header, header_size) = remote_timeline - .restore_header(archive_id) - .context("Failed to restore header when downloading an archive")?; - - match read_local_metadata(conf, timeline_id, tenant_id).await { - Ok(local_metadata) => ensure!( - // need to allow `<=` instead of `<` due to cases when a failed archive can be redownloaded - local_metadata.disk_consistent_lsn() <= archive_to_download.disk_consistent_lsn(), - "Cannot download archive with Lsn {} since it's earlier than local Lsn {}", - archive_to_download.disk_consistent_lsn(), - local_metadata.disk_consistent_lsn() - ), - Err(e) => warn!("Failed to read local metadata file, assuming it's safe to override its with the download. Read: {:#}", e), + if errors_happened { + debug!("Reenqueuing failed download task for timeline {}", sync_id); + download_data.retries += 1; + sync_queue::push(sync_id, SyncTask::Download(download_data)); + DownloadedTimeline::FailedAndRescheduled + } else { + debug!("Finished downloading all timeline's layers"); + DownloadedTimeline::Successful(download_data) } - compression::uncompress_file_stream_with_index( - conf.timeline_path(&timeline_id, &tenant_id), - files_to_skip, - archive_to_download.disk_consistent_lsn(), - archive_header, - header_size, - move |mut archive_target, archive_name| async move { - let archive_local_path = conf - .timeline_path(&timeline_id, &tenant_id) - .join(&archive_name); - let remote_storage = &remote_assets.0; - remote_storage - .download_range( - &remote_storage.storage_path(&archive_local_path)?, - header_size, - None, - &mut archive_target, - ) - .await - }, - ) - .await?; - - Ok(()) -} - -async fn read_local_metadata( - conf: &'static PageServerConf, - timeline_id: zenith_utils::zid::ZTimelineId, - tenant_id: ZTenantId, -) -> anyhow::Result { - let local_metadata_path = metadata_path(conf, timeline_id, tenant_id); - let local_metadata_bytes = fs::read(&local_metadata_path) - .await - .context("Failed to read local metadata file bytes")?; - TimelineMetadata::from_bytes(&local_metadata_bytes) - .context("Failed to read local metadata files bytes") } #[cfg(test)] mod tests { - use std::collections::BTreeSet; + use std::collections::{BTreeSet, HashSet}; use tempfile::tempdir; - use tokio::fs; use zenith_utils::lsn::Lsn; use crate::{ remote_storage::{ - local_fs::LocalFs, - storage_sync::test_utils::{ - assert_index_descriptions, assert_timeline_files_match, create_local_timeline, - dummy_metadata, ensure_correct_timeline_upload, expect_timeline, + storage_sync::{ + index::RelativePath, + test_utils::{create_local_timeline, dummy_metadata}, }, + LocalFs, }, repository::repo_harness::{RepoHarness, TIMELINE_ID}, }; @@ -256,80 +216,185 @@ mod tests { use super::*; #[tokio::test] - async fn test_download_timeline() -> anyhow::Result<()> { - let repo_harness = RepoHarness::create("test_download_timeline")?; - let sync_id = ZTenantTimelineId::new(repo_harness.tenant_id, TIMELINE_ID); - let storage = LocalFs::new(tempdir()?.path().to_owned(), &repo_harness.conf.workdir)?; - let index = RemoteIndex::try_parse_descriptions_from_paths( - repo_harness.conf, - storage - .list() - .await? - .into_iter() - .map(|storage_path| storage.local_path(&storage_path).unwrap()), + async fn download_timeline() -> anyhow::Result<()> { + let harness = RepoHarness::create("download_timeline")?; + let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID); + let layer_files = ["a", "b", "layer_to_skip", "layer_to_keep_locally"]; + let storage = LocalFs::new(tempdir()?.path().to_path_buf(), &harness.conf.workdir)?; + let current_retries = 3; + let metadata = dummy_metadata(Lsn(0x30)); + let local_timeline_path = harness.timeline_path(&TIMELINE_ID); + let timeline_upload = + create_local_timeline(&harness, TIMELINE_ID, &layer_files, metadata.clone()).await?; + + for local_path in timeline_upload.layers_to_upload { + let remote_path = storage.storage_path(&local_path)?; + let remote_parent_dir = remote_path.parent().unwrap(); + if !remote_parent_dir.exists() { + fs::create_dir_all(&remote_parent_dir).await?; + } + fs::copy(&local_path, &remote_path).await?; + } + let mut read_dir = fs::read_dir(&local_timeline_path).await?; + while let Some(dir_entry) = read_dir.next_entry().await? { + if dir_entry.file_name().to_str() == Some("layer_to_keep_locally") { + continue; + } else { + fs::remove_file(dir_entry.path()).await?; + } + } + + let mut remote_timeline = RemoteTimeline::new(metadata.clone()); + remote_timeline.awaits_download = true; + remote_timeline.add_timeline_layers( + layer_files + .iter() + .map(|layer| local_timeline_path.join(layer)), ); - let remote_assets = Arc::new((storage, index)); - let storage = &remote_assets.0; - let index = &remote_assets.1; - let regular_timeline_path = repo_harness.timeline_path(&TIMELINE_ID); - let regular_timeline = create_local_timeline( - &repo_harness, - TIMELINE_ID, - &["a", "b"], - dummy_metadata(Lsn(0x30)), - )?; - ensure_correct_timeline_upload( - &repo_harness, - Arc::clone(&remote_assets), - TIMELINE_ID, - regular_timeline, - ) - .await; - // upload multiple checkpoints for the same timeline - let regular_timeline = create_local_timeline( - &repo_harness, - TIMELINE_ID, - &["c", "d"], - dummy_metadata(Lsn(0x40)), - )?; - ensure_correct_timeline_upload( - &repo_harness, - Arc::clone(&remote_assets), - TIMELINE_ID, - regular_timeline, - ) - .await; - - fs::remove_dir_all(®ular_timeline_path).await?; - let remote_regular_timeline = expect_timeline(index, sync_id).await; - - download_timeline( - repo_harness.conf, - Arc::clone(&remote_assets), + let download_data = match download_timeline_layers( + &storage, + Some(&remote_timeline), sync_id, - TimelineDownload { - files_to_skip: Arc::new(BTreeSet::new()), - archives_to_skip: BTreeSet::new(), - }, - 0, + SyncData::new( + current_retries, + TimelineDownload { + layers_to_skip: HashSet::from([local_timeline_path.join("layer_to_skip")]), + }, + ), ) - .await; - assert_index_descriptions( - index, - &RemoteIndex::try_parse_descriptions_from_paths( - repo_harness.conf, - remote_assets - .0 - .list() - .await - .unwrap() - .into_iter() - .map(|storage_path| storage.local_path(&storage_path).unwrap()), + .await + { + DownloadedTimeline::Successful(data) => data, + wrong_result => panic!( + "Expected a successful download for timeline, but got: {:?}", + wrong_result + ), + }; + + assert_eq!( + current_retries, download_data.retries, + "On successful download, retries are not expected to change" + ); + assert_eq!( + download_data + .data + .layers_to_skip + .into_iter() + .collect::>(), + layer_files + .iter() + .map(|layer| local_timeline_path.join(layer)) + .collect(), + "On successful download, layers to skip should contain all downloaded files and present layers that were skipped" + ); + + let mut downloaded_files = BTreeSet::new(); + let mut read_dir = fs::read_dir(&local_timeline_path).await?; + while let Some(dir_entry) = read_dir.next_entry().await? { + downloaded_files.insert(dir_entry.path()); + } + + assert_eq!( + downloaded_files, + layer_files + .iter() + .filter(|layer| layer != &&"layer_to_skip") + .map(|layer| local_timeline_path.join(layer)) + .collect(), + "On successful download, all layers that were not skipped, should be downloaded" + ); + + Ok(()) + } + + #[tokio::test] + async fn download_timeline_negatives() -> anyhow::Result<()> { + let harness = RepoHarness::create("download_timeline_negatives")?; + let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID); + let storage = LocalFs::new(tempdir()?.path().to_owned(), &harness.conf.workdir)?; + + let empty_remote_timeline_download = download_timeline_layers( + &storage, + None, + sync_id, + SyncData::new( + 0, + TimelineDownload { + layers_to_skip: HashSet::new(), + }, ), ) .await; - assert_timeline_files_match(&repo_harness, TIMELINE_ID, remote_regular_timeline); + assert!( + matches!(empty_remote_timeline_download, DownloadedTimeline::Abort), + "Should not allow downloading for empty remote timeline" + ); + + let not_expecting_download_remote_timeline = RemoteTimeline::new(dummy_metadata(Lsn(5))); + assert!( + !not_expecting_download_remote_timeline.awaits_download, + "Should not expect download for the timeline" + ); + let already_downloading_remote_timeline_download = download_timeline_layers( + &storage, + Some(¬_expecting_download_remote_timeline), + sync_id, + SyncData::new( + 0, + TimelineDownload { + layers_to_skip: HashSet::new(), + }, + ), + ) + .await; + assert!( + matches!( + dbg!(already_downloading_remote_timeline_download), + DownloadedTimeline::Abort, + ), + "Should not allow downloading for remote timeline that does not expect it" + ); + + Ok(()) + } + + #[tokio::test] + async fn test_download_index_part() -> anyhow::Result<()> { + let harness = RepoHarness::create("test_download_index_part")?; + let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID); + + let storage = LocalFs::new(tempdir()?.path().to_path_buf(), &harness.conf.workdir)?; + let metadata = dummy_metadata(Lsn(0x30)); + let local_timeline_path = harness.timeline_path(&TIMELINE_ID); + + let index_part = IndexPart::new( + HashSet::from([ + RelativePath::new(&local_timeline_path, local_timeline_path.join("one"))?, + RelativePath::new(&local_timeline_path, local_timeline_path.join("two"))?, + ]), + HashSet::from([RelativePath::new( + &local_timeline_path, + local_timeline_path.join("three"), + )?]), + metadata.disk_consistent_lsn(), + metadata.to_bytes()?, + ); + + let local_index_part_path = + metadata_path(harness.conf, sync_id.timeline_id, sync_id.tenant_id) + .with_file_name(IndexPart::FILE_NAME) + .with_extension(IndexPart::FILE_EXTENSION); + let storage_path = storage.storage_path(&local_index_part_path)?; + fs::create_dir_all(storage_path.parent().unwrap()).await?; + fs::write(&storage_path, serde_json::to_vec(&index_part)?).await?; + + let downloaded_index_part = download_index_part(harness.conf, &storage, sync_id).await?; + + assert_eq!( + downloaded_index_part, index_part, + "Downloaded index part should be the same as the one in storage" + ); Ok(()) } diff --git a/pageserver/src/remote_storage/storage_sync/index.rs b/pageserver/src/remote_storage/storage_sync/index.rs index 861b78fa3b..918bda1039 100644 --- a/pageserver/src/remote_storage/storage_sync/index.rs +++ b/pageserver/src/remote_storage/storage_sync/index.rs @@ -1,63 +1,56 @@ -//! In-memory index to track the tenant files on the remote strorage, mitigating the storage format differences between the local and remote files. -//! Able to restore itself from the storage archive data and reconstruct archive indices on demand. -//! -//! The index is intended to be portable, so deliberately does not store any local paths inside. -//! This way in the future, the index could be restored fast from its serialized stored form. +//! In-memory index to track the tenant files on the remote storage. +//! Able to restore itself from the storage index parts, that are located in every timeline's remote directory and contain all data about +//! remote timeline layers and its metadata. use std::{ - collections::{BTreeMap, BTreeSet, HashMap}, + collections::{HashMap, HashSet}, path::{Path, PathBuf}, sync::Arc, }; -use anyhow::{bail, ensure, Context}; +use anyhow::{Context, Ok}; use serde::{Deserialize, Serialize}; +use serde_with::{serde_as, DisplayFromStr}; use tokio::sync::RwLock; -use tracing::*; -use zenith_utils::{ - lsn::Lsn, - zid::{ZTenantId, ZTimelineId}, -}; use crate::{ - config::PageServerConf, - layered_repository::TIMELINES_SEGMENT_NAME, - remote_storage::{ - storage_sync::compression::{parse_archive_name, FileEntry}, - ZTenantTimelineId, - }, + config::PageServerConf, layered_repository::metadata::TimelineMetadata, + remote_storage::ZTenantTimelineId, }; - -use super::compression::ArchiveHeader; +use zenith_utils::lsn::Lsn; /// A part of the filesystem path, that needs a root to become a path again. #[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord, Hash, Serialize, Deserialize)] +#[serde(transparent)] pub struct RelativePath(String); impl RelativePath { /// Attempts to strip off the base from path, producing a relative path or an error. pub fn new>(base: &Path, path: P) -> anyhow::Result { - let relative = path - .as_ref() - .strip_prefix(base) - .context("path is not relative to base")?; + let path = path.as_ref(); + let relative = path.strip_prefix(base).with_context(|| { + format!( + "path '{}' is not relative to base '{}'", + path.display(), + base.display() + ) + })?; Ok(RelativePath(relative.to_string_lossy().to_string())) } /// Joins the relative path with the base path. - pub fn as_path(&self, base: &Path) -> PathBuf { + fn as_path(&self, base: &Path) -> PathBuf { base.join(&self.0) } } /// An index to track tenant files that exist on the remote storage. -/// Currently, timeline archive files are tracked only. #[derive(Debug, Clone)] pub struct RemoteTimelineIndex { - timeline_entries: HashMap, + timeline_entries: HashMap, } -/// A wrapper to synchrnize access to the index, should be created and used before dealing with any [`RemoteTimelineIndex`]. +/// A wrapper to synchronize the access to the index, should be created and used before dealing with any [`RemoteTimelineIndex`]. pub struct RemoteIndex(Arc>); impl RemoteIndex { @@ -67,27 +60,22 @@ impl RemoteIndex { }))) } - /// Attempts to parse file paths (not checking the file contents) and find files - /// that can be tracked wiht the index. - /// On parse falures, logs the error and continues, so empty index can be created from not suitable paths. - pub fn try_parse_descriptions_from_paths>( + pub fn from_parts( conf: &'static PageServerConf, - paths: impl Iterator, - ) -> Self { - let mut index = RemoteTimelineIndex { - timeline_entries: HashMap::new(), - }; - for path in paths { - if let Err(e) = try_parse_index_entry(&mut index, conf, path.as_ref()) { - debug!( - "Failed to parse path '{}' as index entry: {:#}", - path.as_ref().display(), - e - ); - } + index_parts: HashMap, + ) -> anyhow::Result { + let mut timeline_entries = HashMap::new(); + + for (sync_id, index_part) in index_parts { + let timeline_path = conf.timeline_path(&sync_id.timeline_id, &sync_id.tenant_id); + let remote_timeline = RemoteTimeline::from_index_part(&timeline_path, index_part) + .context("Failed to restore remote timeline data from index part")?; + timeline_entries.insert(sync_id, remote_timeline); } - Self(Arc::new(RwLock::new(index))) + Ok(Self(Arc::new(RwLock::new(RemoteTimelineIndex { + timeline_entries, + })))) } pub async fn read(&self) -> tokio::sync::RwLockReadGuard<'_, RemoteTimelineIndex> { @@ -106,39 +94,18 @@ impl Clone for RemoteIndex { } impl RemoteTimelineIndex { - pub fn timeline_entry(&self, id: &ZTenantTimelineId) -> Option<&TimelineIndexEntry> { + pub fn timeline_entry(&self, id: &ZTenantTimelineId) -> Option<&RemoteTimeline> { self.timeline_entries.get(id) } - pub fn timeline_entry_mut( - &mut self, - id: &ZTenantTimelineId, - ) -> Option<&mut TimelineIndexEntry> { + pub fn timeline_entry_mut(&mut self, id: &ZTenantTimelineId) -> Option<&mut RemoteTimeline> { self.timeline_entries.get_mut(id) } - pub fn add_timeline_entry(&mut self, id: ZTenantTimelineId, entry: TimelineIndexEntry) { + pub fn add_timeline_entry(&mut self, id: ZTenantTimelineId, entry: RemoteTimeline) { self.timeline_entries.insert(id, entry); } - pub fn upgrade_timeline_entry( - &mut self, - id: &ZTenantTimelineId, - remote_timeline: RemoteTimeline, - ) -> anyhow::Result<()> { - let mut entry = self.timeline_entries.get_mut(id).ok_or(anyhow::anyhow!( - "timeline is unexpectedly missing from remote index" - ))?; - - if !matches!(entry.inner, TimelineIndexEntryInner::Description(_)) { - anyhow::bail!("timeline entry is not a description entry") - }; - - entry.inner = TimelineIndexEntryInner::Full(remote_timeline); - - Ok(()) - } - pub fn all_sync_ids(&self) -> impl Iterator + '_ { self.timeline_entries.keys().copied() } @@ -150,351 +117,295 @@ impl RemoteTimelineIndex { ) -> anyhow::Result<()> { self.timeline_entry_mut(id) .ok_or_else(|| anyhow::anyhow!("unknown timeline sync {}", id))? - .set_awaits_download(awaits_download); + .awaits_download = awaits_download; Ok(()) } } -#[derive(Debug, Clone, PartialEq, Eq, Default)] -pub struct DescriptionTimelineIndexEntry { - pub description: BTreeMap, - pub awaits_download: bool, -} - -#[derive(Debug, Clone, PartialEq, Eq)] -pub struct FullTimelineIndexEntry { - pub remote_timeline: RemoteTimeline, - pub awaits_download: bool, -} - -#[derive(Debug, Clone, PartialEq, Eq)] -pub enum TimelineIndexEntryInner { - Description(BTreeMap), - Full(RemoteTimeline), -} - -#[derive(Debug, Clone, PartialEq, Eq)] -pub struct TimelineIndexEntry { - inner: TimelineIndexEntryInner, - awaits_download: bool, -} - -impl TimelineIndexEntry { - pub fn new(inner: TimelineIndexEntryInner, awaits_download: bool) -> Self { - Self { - inner, - awaits_download, - } - } - - pub fn inner(&self) -> &TimelineIndexEntryInner { - &self.inner - } - - pub fn inner_mut(&mut self) -> &mut TimelineIndexEntryInner { - &mut self.inner - } - - pub fn uploaded_checkpoints(&self) -> BTreeSet { - match &self.inner { - TimelineIndexEntryInner::Description(description) => { - description.keys().map(|archive_id| archive_id.0).collect() - } - TimelineIndexEntryInner::Full(remote_timeline) => remote_timeline - .checkpoint_archives - .keys() - .map(|archive_id| archive_id.0) - .collect(), - } - } - - /// Gets latest uploaded checkpoint's disk consisten Lsn for the corresponding timeline. - pub fn disk_consistent_lsn(&self) -> Option { - match &self.inner { - TimelineIndexEntryInner::Description(description) => { - description.keys().map(|archive_id| archive_id.0).max() - } - TimelineIndexEntryInner::Full(remote_timeline) => remote_timeline - .checkpoint_archives - .keys() - .map(|archive_id| archive_id.0) - .max(), - } - } - - pub fn get_awaits_download(&self) -> bool { - self.awaits_download - } - - pub fn set_awaits_download(&mut self, awaits_download: bool) { - self.awaits_download = awaits_download; - } -} - -/// Checkpoint archive's id, corresponding to the `disk_consistent_lsn` from the timeline's metadata file during checkpointing. -#[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Clone, Copy)] -pub struct ArchiveId(pub(super) Lsn); - -#[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Clone, Copy)] -struct FileId(ArchiveId, ArchiveEntryNumber); - -type ArchiveEntryNumber = usize; - -/// All archives and files in them, representing a certain timeline. -/// Uses file and archive IDs to reference those without ownership issues. +/// Restored index part data about the timeline, stored in the remote index. #[derive(Debug, PartialEq, Eq, Clone)] pub struct RemoteTimeline { - timeline_files: BTreeMap, - checkpoint_archives: BTreeMap, -} + timeline_layers: HashSet, + missing_layers: HashSet, -/// Archive metadata, enough to restore a header with the timeline data. -#[derive(Debug, PartialEq, Eq, Clone)] -pub struct CheckpointArchive { - disk_consistent_lsn: Lsn, - metadata_file_size: u64, - files: BTreeSet, - archive_header_size: u64, -} - -impl CheckpointArchive { - pub fn disk_consistent_lsn(&self) -> Lsn { - self.disk_consistent_lsn - } + pub metadata: TimelineMetadata, + pub awaits_download: bool, } impl RemoteTimeline { - pub fn empty() -> Self { + pub fn new(metadata: TimelineMetadata) -> Self { Self { - timeline_files: BTreeMap::new(), - checkpoint_archives: BTreeMap::new(), + timeline_layers: HashSet::new(), + missing_layers: HashSet::new(), + metadata, + awaits_download: false, } } - pub fn checkpoints(&self) -> impl Iterator + '_ { - self.checkpoint_archives - .values() - .map(CheckpointArchive::disk_consistent_lsn) + pub fn add_timeline_layers(&mut self, new_layers: impl IntoIterator) { + self.timeline_layers.extend(new_layers.into_iter()); + } + + pub fn add_upload_failures(&mut self, upload_failures: impl IntoIterator) { + self.missing_layers.extend(upload_failures.into_iter()); } /// Lists all layer files in the given remote timeline. Omits the metadata file. - pub fn stored_files(&self, timeline_dir: &Path) -> BTreeSet { - self.timeline_files - .values() - .map(|file_entry| file_entry.subpath.as_path(timeline_dir)) - .collect() + pub fn stored_files(&self) -> &HashSet { + &self.timeline_layers } - pub fn contains_checkpoint_at(&self, disk_consistent_lsn: Lsn) -> bool { - self.checkpoint_archives - .contains_key(&ArchiveId(disk_consistent_lsn)) + pub fn from_index_part(timeline_path: &Path, index_part: IndexPart) -> anyhow::Result { + let metadata = TimelineMetadata::from_bytes(&index_part.metadata_bytes)?; + Ok(Self { + timeline_layers: to_local_paths(timeline_path, index_part.timeline_layers), + missing_layers: to_local_paths(timeline_path, index_part.missing_layers), + metadata, + awaits_download: false, + }) } +} - pub fn archive_data(&self, archive_id: ArchiveId) -> Option<&CheckpointArchive> { - self.checkpoint_archives.get(&archive_id) - } +/// Part of the remote index, corresponding to a certain timeline. +/// Contains the data about all files in the timeline, present remotely and its metadata. +#[serde_as] +#[derive(Debug, PartialEq, Eq, Clone, Serialize, Deserialize)] +pub struct IndexPart { + timeline_layers: HashSet, + /// Currently is not really used in pageserver, + /// present to manually keep track of the layer files that pageserver might never retrieve. + /// + /// Such "holes" might appear if any upload task was evicted on an error threshold: + /// the this layer will only be rescheduled for upload on pageserver restart. + missing_layers: HashSet, + #[serde_as(as = "DisplayFromStr")] + disk_consistent_lsn: Lsn, + metadata_bytes: Vec, +} - /// Restores a header of a certain remote archive from the memory data. - /// Returns the header and its compressed size in the archive, both can be used to uncompress that archive. - pub fn restore_header(&self, archive_id: ArchiveId) -> anyhow::Result<(ArchiveHeader, u64)> { - let archive = self - .checkpoint_archives - .get(&archive_id) - .with_context(|| format!("Archive {:?} not found", archive_id))?; +impl IndexPart { + pub const FILE_NAME: &'static str = "index_part"; + pub const FILE_EXTENSION: &'static str = "json"; - let mut header_files = Vec::with_capacity(archive.files.len()); - for (expected_archive_position, archive_file) in archive.files.iter().enumerate() { - let &FileId(archive_id, archive_position) = archive_file; - ensure!( - expected_archive_position == archive_position, - "Archive header is corrupt, file # {} from archive {:?} header is missing", - expected_archive_position, - archive_id, - ); - - let timeline_file = self.timeline_files.get(archive_file).with_context(|| { - format!( - "File with id {:?} not found for archive {:?}", - archive_file, archive_id - ) - })?; - header_files.push(timeline_file.clone()); - } - - Ok(( - ArchiveHeader { - files: header_files, - metadata_file_size: archive.metadata_file_size, - }, - archive.archive_header_size, - )) - } - - /// Updates (creates, if necessary) the data about certain archive contents. - pub fn update_archive_contents( - &mut self, + #[cfg(test)] + pub fn new( + timeline_layers: HashSet, + missing_layers: HashSet, disk_consistent_lsn: Lsn, - header: ArchiveHeader, - header_size: u64, - ) { - let archive_id = ArchiveId(disk_consistent_lsn); - let mut common_archive_files = BTreeSet::new(); - for (file_index, file_entry) in header.files.into_iter().enumerate() { - let file_id = FileId(archive_id, file_index); - self.timeline_files.insert(file_id, file_entry); - common_archive_files.insert(file_id); + metadata_bytes: Vec, + ) -> Self { + Self { + timeline_layers, + missing_layers, + disk_consistent_lsn, + metadata_bytes, } + } - let metadata_file_size = header.metadata_file_size; - self.checkpoint_archives - .entry(archive_id) - .or_insert_with(|| CheckpointArchive { - metadata_file_size, - files: BTreeSet::new(), - archive_header_size: header_size, - disk_consistent_lsn, - }) - .files - .extend(common_archive_files.into_iter()); + pub fn missing_files(&self) -> &HashSet { + &self.missing_layers + } + + pub fn from_remote_timeline( + timeline_path: &Path, + remote_timeline: RemoteTimeline, + ) -> anyhow::Result { + let metadata_bytes = remote_timeline.metadata.to_bytes()?; + Ok(Self { + timeline_layers: to_relative_paths(timeline_path, remote_timeline.timeline_layers) + .context("Failed to convert timeline layers' paths to relative ones")?, + missing_layers: to_relative_paths(timeline_path, remote_timeline.missing_layers) + .context("Failed to convert missing layers' paths to relative ones")?, + disk_consistent_lsn: remote_timeline.metadata.disk_consistent_lsn(), + metadata_bytes, + }) } } -/// Metadata abput timeline checkpoint archive, parsed from its remote storage path. -#[derive(Debug, Clone, PartialEq, Eq)] -pub struct ArchiveDescription { - pub header_size: u64, - pub disk_consistent_lsn: Lsn, - pub archive_name: String, +fn to_local_paths( + timeline_path: &Path, + paths: impl IntoIterator, +) -> HashSet { + paths + .into_iter() + .map(|path| path.as_path(timeline_path)) + .collect() } -fn try_parse_index_entry( - index: &mut RemoteTimelineIndex, - conf: &'static PageServerConf, - path: &Path, -) -> anyhow::Result<()> { - let tenants_dir = conf.tenants_path(); - let tenant_id = path - .strip_prefix(&tenants_dir) - .with_context(|| { - format!( - "Path '{}' does not belong to tenants directory '{}'", - path.display(), - tenants_dir.display(), - ) - })? - .iter() - .next() - .with_context(|| format!("Found no tenant id in path '{}'", path.display()))? - .to_string_lossy() - .parse::() - .with_context(|| format!("Failed to parse tenant id from path '{}'", path.display()))?; - - let timelines_path = conf.timelines_path(&tenant_id); - match path.strip_prefix(&timelines_path) { - Ok(timelines_subpath) => { - let mut segments = timelines_subpath.iter(); - let timeline_id = segments - .next() - .with_context(|| { - format!( - "{} directory of tenant {} (path '{}') is not an index entry", - TIMELINES_SEGMENT_NAME, - tenant_id, - path.display() - ) - })? - .to_string_lossy() - .parse::() - .with_context(|| { - format!("Failed to parse timeline id from path '{}'", path.display()) - })?; - - let (disk_consistent_lsn, header_size) = - parse_archive_name(path).with_context(|| { - format!( - "Failed to parse archive name out in path '{}'", - path.display() - ) - })?; - - let archive_name = path - .file_name() - .with_context(|| format!("Archive '{}' has no file name", path.display()))? - .to_string_lossy() - .to_string(); - - let sync_id = ZTenantTimelineId { - tenant_id, - timeline_id, - }; - let timeline_index_entry = index.timeline_entries.entry(sync_id).or_insert_with(|| { - TimelineIndexEntry::new( - TimelineIndexEntryInner::Description(BTreeMap::default()), - false, - ) - }); - match timeline_index_entry.inner_mut() { - TimelineIndexEntryInner::Description(description) => { - description.insert( - ArchiveId(disk_consistent_lsn), - ArchiveDescription { - header_size, - disk_consistent_lsn, - archive_name, - }, - ); - } - TimelineIndexEntryInner::Full(_) => { - bail!("Cannot add parsed archive description to its full context in index with sync id {}", sync_id) - } - } - } - Err(timelines_strip_error) => { - bail!( - "Path '{}' is not an archive entry '{}'", - path.display(), - timelines_strip_error, - ) - } - } - Ok(()) +fn to_relative_paths( + timeline_path: &Path, + paths: impl IntoIterator, +) -> anyhow::Result> { + paths + .into_iter() + .map(|path| RelativePath::new(timeline_path, path)) + .collect() } #[cfg(test)] mod tests { + use std::collections::BTreeSet; + use super::*; + use crate::repository::repo_harness::{RepoHarness, TIMELINE_ID}; #[test] - fn header_restoration_preserves_file_order() { - let header = ArchiveHeader { - files: vec![ - FileEntry { - size: 5, - subpath: RelativePath("one".to_string()), - }, - FileEntry { - size: 1, - subpath: RelativePath("two".to_string()), - }, - FileEntry { - size: 222, - subpath: RelativePath("zero".to_string()), - }, - ], - metadata_file_size: 5, + fn index_part_conversion() { + let harness = RepoHarness::create("index_part_conversion").unwrap(); + let timeline_path = harness.timeline_path(&TIMELINE_ID); + let metadata = + TimelineMetadata::new(Lsn(5).align(), Some(Lsn(4)), None, Lsn(3), Lsn(2), Lsn(1)); + let remote_timeline = RemoteTimeline { + timeline_layers: HashSet::from([ + timeline_path.join("layer_1"), + timeline_path.join("layer_2"), + ]), + missing_layers: HashSet::from([ + timeline_path.join("missing_1"), + timeline_path.join("missing_2"), + ]), + metadata: metadata.clone(), + awaits_download: false, }; - let lsn = Lsn(1); - let mut remote_timeline = RemoteTimeline::empty(); - remote_timeline.update_archive_contents(lsn, header.clone(), 15); - - let (restored_header, _) = remote_timeline - .restore_header(ArchiveId(lsn)) - .expect("Should be able to restore header from a valid remote timeline"); + let index_part = IndexPart::from_remote_timeline(&timeline_path, remote_timeline.clone()) + .expect("Correct remote timeline should be convertable to index part"); assert_eq!( - header, restored_header, - "Header restoration should preserve file order" + index_part.timeline_layers.iter().collect::>(), + BTreeSet::from([ + &RelativePath("layer_1".to_string()), + &RelativePath("layer_2".to_string()) + ]), + "Index part should have all remote timeline layers after the conversion" + ); + assert_eq!( + index_part.missing_layers.iter().collect::>(), + BTreeSet::from([ + &RelativePath("missing_1".to_string()), + &RelativePath("missing_2".to_string()) + ]), + "Index part should have all missing remote timeline layers after the conversion" + ); + assert_eq!( + index_part.disk_consistent_lsn, + metadata.disk_consistent_lsn(), + "Index part should have disk consistent lsn from the timeline" + ); + assert_eq!( + index_part.metadata_bytes, + metadata + .to_bytes() + .expect("Failed to serialize correct metadata into bytes"), + "Index part should have all missing remote timeline layers after the conversion" + ); + + let restored_timeline = RemoteTimeline::from_index_part(&timeline_path, index_part) + .expect("Correct index part should be convertable to remote timeline"); + + let original_metadata = &remote_timeline.metadata; + let restored_metadata = &restored_timeline.metadata; + // we have to compare the metadata this way, since its header is different after creation and restoration, + // but that is now consireded ok. + assert_eq!( + original_metadata.disk_consistent_lsn(), + restored_metadata.disk_consistent_lsn(), + "remote timeline -> index part -> remote timeline conversion should not alter metadata" + ); + assert_eq!( + original_metadata.prev_record_lsn(), + restored_metadata.prev_record_lsn(), + "remote timeline -> index part -> remote timeline conversion should not alter metadata" + ); + assert_eq!( + original_metadata.ancestor_timeline(), + restored_metadata.ancestor_timeline(), + "remote timeline -> index part -> remote timeline conversion should not alter metadata" + ); + assert_eq!( + original_metadata.ancestor_lsn(), + restored_metadata.ancestor_lsn(), + "remote timeline -> index part -> remote timeline conversion should not alter metadata" + ); + assert_eq!( + original_metadata.latest_gc_cutoff_lsn(), + restored_metadata.latest_gc_cutoff_lsn(), + "remote timeline -> index part -> remote timeline conversion should not alter metadata" + ); + assert_eq!( + original_metadata.initdb_lsn(), + restored_metadata.initdb_lsn(), + "remote timeline -> index part -> remote timeline conversion should not alter metadata" + ); + + assert_eq!( + remote_timeline.awaits_download, restored_timeline.awaits_download, + "remote timeline -> index part -> remote timeline conversion should not loose download flag" + ); + + assert_eq!( + remote_timeline + .timeline_layers + .into_iter() + .collect::>(), + restored_timeline + .timeline_layers + .into_iter() + .collect::>(), + "remote timeline -> index part -> remote timeline conversion should not loose layer data" + ); + assert_eq!( + remote_timeline + .missing_layers + .into_iter() + .collect::>(), + restored_timeline + .missing_layers + .into_iter() + .collect::>(), + "remote timeline -> index part -> remote timeline conversion should not loose missing file data" ); } + + #[test] + fn index_part_conversion_negatives() { + let harness = RepoHarness::create("index_part_conversion_negatives").unwrap(); + let timeline_path = harness.timeline_path(&TIMELINE_ID); + let metadata = + TimelineMetadata::new(Lsn(5).align(), Some(Lsn(4)), None, Lsn(3), Lsn(2), Lsn(1)); + + let conversion_result = IndexPart::from_remote_timeline( + &timeline_path, + RemoteTimeline { + timeline_layers: HashSet::from([ + PathBuf::from("bad_path"), + timeline_path.join("layer_2"), + ]), + missing_layers: HashSet::from([ + timeline_path.join("missing_1"), + timeline_path.join("missing_2"), + ]), + metadata: metadata.clone(), + awaits_download: false, + }, + ); + assert!(conversion_result.is_err(), "Should not be able to convert metadata with layer paths that are not in the timeline directory"); + + let conversion_result = IndexPart::from_remote_timeline( + &timeline_path, + RemoteTimeline { + timeline_layers: HashSet::from([ + timeline_path.join("layer_1"), + timeline_path.join("layer_2"), + ]), + missing_layers: HashSet::from([ + PathBuf::from("bad_path"), + timeline_path.join("missing_2"), + ]), + metadata, + awaits_download: false, + }, + ); + assert!(conversion_result.is_err(), "Should not be able to convert metadata with missing layer paths that are not in the timeline directory"); + } } diff --git a/pageserver/src/remote_storage/storage_sync/upload.rs b/pageserver/src/remote_storage/storage_sync/upload.rs index 7b6d58a661..81758ce3ef 100644 --- a/pageserver/src/remote_storage/storage_sync/upload.rs +++ b/pageserver/src/remote_storage/storage_sync/upload.rs @@ -1,520 +1,456 @@ //! Timeline synchronization logic to compress and upload to the remote storage all new timeline files from the checkpoints. -use std::{collections::BTreeSet, path::PathBuf, sync::Arc}; +use std::{fmt::Debug, path::PathBuf}; -use tracing::{debug, error, warn}; +use anyhow::Context; +use futures::stream::{FuturesUnordered, StreamExt}; +use tokio::fs; +use tracing::{debug, error, trace, warn}; use crate::{ config::PageServerConf, + layered_repository::metadata::metadata_path, remote_storage::{ - storage_sync::{ - compression, fetch_full_index, - index::{RemoteTimeline, TimelineIndexEntry, TimelineIndexEntryInner}, - sync_queue, SyncKind, SyncTask, - }, + storage_sync::{index::RemoteTimeline, sync_queue, SyncTask}, RemoteStorage, ZTenantTimelineId, }, }; -use super::{compression::ArchiveHeader, NewCheckpoint, RemoteIndex}; +use super::{index::IndexPart, SyncData, TimelineUpload}; -/// Attempts to compress and upload given checkpoint files. -/// No extra checks for overlapping files is made: download takes care of that, ensuring no non-metadata local timeline files are overwritten. +/// Serializes and uploads the given index part data to the remote storage. +pub(super) async fn upload_index_part( + conf: &'static PageServerConf, + storage: &S, + sync_id: ZTenantTimelineId, + index_part: IndexPart, +) -> anyhow::Result<()> +where + P: Debug + Send + Sync + 'static, + S: RemoteStorage + Send + Sync + 'static, +{ + let index_part_bytes = serde_json::to_vec(&index_part) + .context("Failed to serialize index part file into bytes")?; + let index_part_size = index_part_bytes.len(); + let index_part_bytes = tokio::io::BufReader::new(std::io::Cursor::new(index_part_bytes)); + + let index_part_path = metadata_path(conf, sync_id.timeline_id, sync_id.tenant_id) + .with_file_name(IndexPart::FILE_NAME) + .with_extension(IndexPart::FILE_EXTENSION); + let index_part_storage_path = storage.storage_path(&index_part_path).with_context(|| { + format!( + "Failed to get the index part storage path for local path '{}'", + index_part_path.display() + ) + })?; + + storage + .upload( + index_part_bytes, + index_part_size, + &index_part_storage_path, + None, + ) + .await + .with_context(|| { + format!( + "Failed to upload index part to the storage path '{:?}'", + index_part_storage_path + ) + }) +} + +/// Timeline upload result, with extra data, needed for uploading. +#[derive(Debug)] +pub(super) enum UploadedTimeline { + /// Upload failed due to some error, the upload task is rescheduled for another retry. + FailedAndRescheduled, + /// No issues happened during the upload, all task files were put into the remote storage. + Successful(SyncData), + /// No failures happened during the upload, but some files were removed locally before the upload task completed + /// (could happen due to retries, for instance, if GC happens in the interim). + /// Such files are considered "not needed" and ignored, but the task's metadata should be discarded and the new one loaded from the local file. + SuccessfulAfterLocalFsUpdate(SyncData), +} + +/// Attempts to upload given layer files. +/// No extra checks for overlapping files is made and any files that are already present remotely will be overwritten, if submitted during the upload. /// /// On an error, bumps the retries count and reschedules the entire task. -/// On success, populates index data with new downloads. -pub(super) async fn upload_timeline_checkpoint< - P: std::fmt::Debug + Send + Sync + 'static, - S: RemoteStorage + Send + Sync + 'static, ->( - config: &'static PageServerConf, - remote_assets: Arc<(S, RemoteIndex)>, +pub(super) async fn upload_timeline_layers<'a, P, S>( + storage: &'a S, + remote_timeline: Option<&'a RemoteTimeline>, sync_id: ZTenantTimelineId, - new_checkpoint: NewCheckpoint, - retries: u32, -) -> Option { - debug!("Uploading checkpoint for sync id {}", sync_id); - let new_upload_lsn = new_checkpoint.metadata.disk_consistent_lsn(); + mut upload_data: SyncData, +) -> UploadedTimeline +where + P: Debug + Send + Sync + 'static, + S: RemoteStorage + Send + Sync + 'static, +{ + let upload = &mut upload_data.data; + let new_upload_lsn = upload.metadata.disk_consistent_lsn(); + debug!( + "Uploading timeline layers for sync id {}, new lsn: {}", + sync_id, new_upload_lsn + ); - let index = &remote_assets.1; - - let ZTenantTimelineId { - tenant_id, - timeline_id, - } = sync_id; - let timeline_dir = config.timeline_path(&timeline_id, &tenant_id); - - let index_read = index.read().await; - let remote_timeline = match index_read.timeline_entry(&sync_id) { - None => { - drop(index_read); - None - } - Some(entry) => match entry.inner() { - TimelineIndexEntryInner::Full(remote_timeline) => { - let r = Some(remote_timeline.clone()); - drop(index_read); - r - } - TimelineIndexEntryInner::Description(_) => { - drop(index_read); - debug!("Found timeline description for the given ids, downloading the full index"); - match fetch_full_index(remote_assets.as_ref(), &timeline_dir, sync_id).await { - Ok(remote_timeline) => Some(remote_timeline), - Err(e) => { - error!("Failed to download full timeline index: {:?}", e); - sync_queue::push(SyncTask::new( - sync_id, - retries, - SyncKind::Upload(new_checkpoint), - )); - return Some(false); - } - } - } - }, - }; - - let already_contains_upload_lsn = remote_timeline - .as_ref() - .map(|remote_timeline| remote_timeline.contains_checkpoint_at(new_upload_lsn)) - .unwrap_or(false); - if already_contains_upload_lsn { - warn!( - "Received a checkpoint with Lsn {} that's already been uploaded to remote storage, skipping the upload.", - new_upload_lsn - ); - return None; - } - - let already_uploaded_files = remote_timeline - .map(|timeline| timeline.stored_files(&timeline_dir)) + let already_uploaded_layers = remote_timeline + .map(|timeline| timeline.stored_files()) + .cloned() .unwrap_or_default(); - match try_upload_checkpoint( - config, - Arc::clone(&remote_assets), - sync_id, - &new_checkpoint, - already_uploaded_files, - ) - .await - { - Some(Ok((archive_header, header_size))) => { - let mut index_write = index.write().await; - match index_write - .timeline_entry_mut(&sync_id) - .map(|e| e.inner_mut()) - { - None => { - let mut new_timeline = RemoteTimeline::empty(); - new_timeline.update_archive_contents( - new_checkpoint.metadata.disk_consistent_lsn(), - archive_header, - header_size, - ); - index_write.add_timeline_entry( - sync_id, - TimelineIndexEntry::new(TimelineIndexEntryInner::Full(new_timeline), false), + let layers_to_upload = upload + .layers_to_upload + .difference(&already_uploaded_layers) + .cloned() + .collect::>(); + + trace!("Layers to upload: {:?}", layers_to_upload); + + let mut upload_tasks = layers_to_upload + .into_iter() + .map(|source_path| async move { + let storage_path = storage + .storage_path(&source_path) + .with_context(|| { + format!( + "Failed to get the layer storage path for local path '{}'", + source_path.display() ) - } - Some(TimelineIndexEntryInner::Full(remote_timeline)) => { - remote_timeline.update_archive_contents( - new_checkpoint.metadata.disk_consistent_lsn(), - archive_header, - header_size, - ); - } - Some(TimelineIndexEntryInner::Description(_)) => { - let mut new_timeline = RemoteTimeline::empty(); - new_timeline.update_archive_contents( - new_checkpoint.metadata.disk_consistent_lsn(), - archive_header, - header_size, - ); - index_write.add_timeline_entry( - sync_id, - TimelineIndexEntry::new(TimelineIndexEntryInner::Full(new_timeline), false), + }) + .map_err(UploadError::Other)?; + + let source_file = match fs::File::open(&source_path).await.with_context(|| { + format!( + "Failed to upen a source file for layer '{}'", + source_path.display() + ) + }) { + Ok(file) => file, + Err(e) => return Err(UploadError::MissingLocalFile(source_path, e)), + }; + + let source_size = source_file + .metadata() + .await + .with_context(|| { + format!( + "Failed to get the source file metadata for layer '{}'", + source_path.display() ) - } + }) + .map_err(UploadError::Other)? + .len() as usize; + + match storage + .upload(source_file, source_size, &storage_path, None) + .await + .with_context(|| { + format!( + "Failed to upload a layer from local path '{}'", + source_path.display() + ) + }) { + Ok(()) => Ok(source_path), + Err(e) => Err(UploadError::MissingLocalFile(source_path, e)), } - debug!("Checkpoint uploaded successfully"); - Some(true) + }) + .collect::>(); + + debug!("uploading {} layers of a timeline", upload_tasks.len()); + + let mut errors_happened = false; + let mut local_fs_updated = false; + while let Some(upload_result) = upload_tasks.next().await { + match upload_result { + Ok(uploaded_path) => { + upload.layers_to_upload.remove(&uploaded_path); + upload.uploaded_layers.insert(uploaded_path); + } + Err(e) => match e { + UploadError::Other(e) => { + errors_happened = true; + error!("Failed to upload a layer for timeline {}: {:?}", sync_id, e); + } + UploadError::MissingLocalFile(source_path, e) => { + if source_path.exists() { + errors_happened = true; + error!("Failed to upload a layer for timeline {}: {:?}", sync_id, e); + } else { + local_fs_updated = true; + upload.layers_to_upload.remove(&source_path); + warn!("Missing locally a layer file scheduled for upload, skipping"); + } + } + }, } - Some(Err(e)) => { - error!( - "Failed to upload checkpoint: {:?}, requeueing the upload", - e - ); - sync_queue::push(SyncTask::new( - sync_id, - retries, - SyncKind::Upload(new_checkpoint), - )); - Some(false) + } + + if errors_happened { + debug!("Reenqueuing failed upload task for timeline {}", sync_id); + upload_data.retries += 1; + sync_queue::push(sync_id, SyncTask::Upload(upload_data)); + UploadedTimeline::FailedAndRescheduled + } else { + debug!("Finished uploading all timeline's layers"); + if local_fs_updated { + UploadedTimeline::SuccessfulAfterLocalFsUpdate(upload_data) + } else { + UploadedTimeline::Successful(upload_data) } - None => Some(true), } } -async fn try_upload_checkpoint< - P: Send + Sync + 'static, - S: RemoteStorage + Send + Sync + 'static, ->( - config: &'static PageServerConf, - remote_assets: Arc<(S, RemoteIndex)>, - sync_id: ZTenantTimelineId, - new_checkpoint: &NewCheckpoint, - files_to_skip: BTreeSet, -) -> Option> { - let ZTenantTimelineId { - tenant_id, - timeline_id, - } = sync_id; - let timeline_dir = config.timeline_path(&timeline_id, &tenant_id); - - let files_to_upload = new_checkpoint - .layers - .iter() - .filter(|&path_to_upload| { - if files_to_skip.contains(path_to_upload) { - warn!( - "Skipping file upload '{}', since it was already uploaded", - path_to_upload.display() - ); - false - } else { - true - } - }) - .collect::>(); - - if files_to_upload.is_empty() { - warn!( - "No files to upload. Upload request was: {:?}, already uploaded files: {:?}", - new_checkpoint.layers, files_to_skip - ); - return None; - } - - let upload_result = compression::archive_files_as_stream( - &timeline_dir, - files_to_upload.into_iter(), - &new_checkpoint.metadata, - move |archive_streamer, archive_name| async move { - let timeline_dir = config.timeline_path(&timeline_id, &tenant_id); - let remote_storage = &remote_assets.0; - remote_storage - .upload( - archive_streamer, - &remote_storage.storage_path(&timeline_dir.join(&archive_name))?, - None, - ) - .await - }, - ) - .await - .map(|(header, header_size, _)| (header, header_size)); - - Some(upload_result) +enum UploadError { + MissingLocalFile(PathBuf, anyhow::Error), + Other(anyhow::Error), } #[cfg(test)] mod tests { + use std::collections::{BTreeSet, HashSet}; + use tempfile::tempdir; use zenith_utils::lsn::Lsn; use crate::{ remote_storage::{ - local_fs::LocalFs, storage_sync::{ - index::ArchiveId, - test_utils::{ - assert_index_descriptions, create_local_timeline, dummy_metadata, - ensure_correct_timeline_upload, expect_timeline, - }, + index::RelativePath, + test_utils::{create_local_timeline, dummy_metadata}, }, + LocalFs, }, repository::repo_harness::{RepoHarness, TIMELINE_ID}, }; - use super::*; + use super::{upload_index_part, *}; #[tokio::test] - async fn reupload_timeline() -> anyhow::Result<()> { - let repo_harness = RepoHarness::create("reupload_timeline")?; - let sync_id = ZTenantTimelineId::new(repo_harness.tenant_id, TIMELINE_ID); - let storage = LocalFs::new(tempdir()?.path().to_owned(), &repo_harness.conf.workdir)?; - let index = RemoteIndex::try_parse_descriptions_from_paths( - repo_harness.conf, - storage - .list() - .await? - .into_iter() - .map(|storage_path| storage.local_path(&storage_path).unwrap()), - ); - let remote_assets = Arc::new((storage, index)); - let index = &remote_assets.1; + async fn regular_layer_upload() -> anyhow::Result<()> { + let harness = RepoHarness::create("regular_layer_upload")?; + let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID); - let first_upload_metadata = dummy_metadata(Lsn(0x10)); - let first_checkpoint = create_local_timeline( - &repo_harness, - TIMELINE_ID, - &["a", "b"], - first_upload_metadata.clone(), - )?; - let local_timeline_path = repo_harness.timeline_path(&TIMELINE_ID); - ensure_correct_timeline_upload( - &repo_harness, - Arc::clone(&remote_assets), - TIMELINE_ID, - first_checkpoint, + let layer_files = ["a", "b"]; + let storage = LocalFs::new(tempdir()?.path().to_path_buf(), &harness.conf.workdir)?; + let current_retries = 3; + let metadata = dummy_metadata(Lsn(0x30)); + let local_timeline_path = harness.timeline_path(&TIMELINE_ID); + let timeline_upload = + create_local_timeline(&harness, TIMELINE_ID, &layer_files, metadata.clone()).await?; + assert!( + storage.list().await?.is_empty(), + "Storage should be empty before any uploads are made" + ); + + let upload_result = upload_timeline_layers( + &storage, + None, + sync_id, + SyncData::new(current_retries, timeline_upload.clone()), ) .await; - let uploaded_timeline = expect_timeline(index, sync_id).await; - let uploaded_archives = uploaded_timeline - .checkpoints() - .map(ArchiveId) - .collect::>(); + let upload_data = match upload_result { + UploadedTimeline::Successful(upload_data) => upload_data, + wrong_result => panic!( + "Expected a successful upload for timeline, but got: {:?}", + wrong_result + ), + }; + assert_eq!( - uploaded_archives.len(), - 1, - "Only one archive is expected after a first upload" + current_retries, upload_data.retries, + "On successful upload, retries are not expected to change" ); - let first_uploaded_archive = uploaded_archives.first().copied().unwrap(); - assert_eq!( - uploaded_timeline.checkpoints().last(), - Some(first_upload_metadata.disk_consistent_lsn()), - "Metadata that was uploaded, should have its Lsn stored" + let upload = &upload_data.data; + assert!( + upload.layers_to_upload.is_empty(), + "Successful upload should have no layers left to upload" ); assert_eq!( - uploaded_timeline - .archive_data(uploaded_archives.first().copied().unwrap()) - .unwrap() - .disk_consistent_lsn(), - first_upload_metadata.disk_consistent_lsn(), - "Uploaded archive should have corresponding Lsn" - ); - assert_eq!( - uploaded_timeline.stored_files(&local_timeline_path), - vec![local_timeline_path.join("a"), local_timeline_path.join("b")] - .into_iter() + upload + .uploaded_layers + .iter() + .cloned() + .collect::>(), + layer_files + .iter() + .map(|layer_file| local_timeline_path.join(layer_file)) .collect(), - "Should have all files from the first checkpoint" + "Successful upload should have all layers uploaded" + ); + assert_eq!( + upload.metadata, metadata, + "Successful upload should not chage its metadata" ); - let second_upload_metadata = dummy_metadata(Lsn(0x40)); - let second_checkpoint = create_local_timeline( - &repo_harness, - TIMELINE_ID, - &["b", "c"], - second_upload_metadata.clone(), - )?; - assert!( - first_upload_metadata.disk_consistent_lsn() - < second_upload_metadata.disk_consistent_lsn() + let storage_files = storage.list().await?; + assert_eq!( + storage_files.len(), + layer_files.len(), + "All layers should be uploaded" ); - ensure_correct_timeline_upload( - &repo_harness, - Arc::clone(&remote_assets), - TIMELINE_ID, - second_checkpoint, + assert_eq!( + storage_files + .into_iter() + .map(|storage_path| storage.local_path(&storage_path)) + .collect::>>()?, + layer_files + .into_iter() + .map(|file| local_timeline_path.join(file)) + .collect(), + "Uploaded files should match with the local ones" + ); + + Ok(()) + } + + // Currently, GC can run between upload retries, removing local layers scheduled for upload. Test this scenario. + #[tokio::test] + async fn layer_upload_after_local_fs_update() -> anyhow::Result<()> { + let harness = RepoHarness::create("layer_upload_after_local_fs_update")?; + let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID); + + let layer_files = ["a1", "b1"]; + let storage = LocalFs::new(tempdir()?.path().to_owned(), &harness.conf.workdir)?; + let current_retries = 5; + let metadata = dummy_metadata(Lsn(0x40)); + + let local_timeline_path = harness.timeline_path(&TIMELINE_ID); + let layers_to_upload = { + let mut layers = layer_files.to_vec(); + layers.push("layer_to_remove"); + layers + }; + let timeline_upload = + create_local_timeline(&harness, TIMELINE_ID, &layers_to_upload, metadata.clone()) + .await?; + assert!( + storage.list().await?.is_empty(), + "Storage should be empty before any uploads are made" + ); + + fs::remove_file(local_timeline_path.join("layer_to_remove")).await?; + + let upload_result = upload_timeline_layers( + &storage, + None, + sync_id, + SyncData::new(current_retries, timeline_upload.clone()), ) .await; - let updated_timeline = expect_timeline(index, sync_id).await; - let mut updated_archives = updated_timeline - .checkpoints() - .map(ArchiveId) - .collect::>(); + let upload_data = match upload_result { + UploadedTimeline::SuccessfulAfterLocalFsUpdate(upload_data) => upload_data, + wrong_result => panic!( + "Expected a successful after local fs upload for timeline, but got: {:?}", + wrong_result + ), + }; + assert_eq!( - updated_archives.len(), - 2, - "Two archives are expected after a successful update of the upload" + current_retries, upload_data.retries, + "On successful upload, retries are not expected to change" ); - updated_archives.retain(|archive_id| archive_id != &first_uploaded_archive); + let upload = &upload_data.data; + assert!( + upload.layers_to_upload.is_empty(), + "Successful upload should have no layers left to upload, even those that were removed from the local fs" + ); assert_eq!( - updated_archives.len(), - 1, - "Only one new archive is expected among the uploaded" - ); - let second_uploaded_archive = updated_archives.last().copied().unwrap(); - assert_eq!( - updated_timeline.checkpoints().max(), - Some(second_upload_metadata.disk_consistent_lsn()), - "Metadata that was uploaded, should have its Lsn stored" + upload + .uploaded_layers + .iter() + .cloned() + .collect::>(), + layer_files + .iter() + .map(|layer_file| local_timeline_path.join(layer_file)) + .collect(), + "Successful upload should have all layers uploaded" ); assert_eq!( - updated_timeline - .archive_data(second_uploaded_archive) - .unwrap() - .disk_consistent_lsn(), - second_upload_metadata.disk_consistent_lsn(), - "Uploaded archive should have corresponding Lsn" - ); - assert_eq!( - updated_timeline.stored_files(&local_timeline_path), - vec![ - local_timeline_path.join("a"), - local_timeline_path.join("b"), - local_timeline_path.join("c"), - ] - .into_iter() - .collect(), - "Should have all files from both checkpoints without duplicates" + upload.metadata, metadata, + "Successful upload should not chage its metadata" ); - let third_upload_metadata = dummy_metadata(Lsn(0x20)); - let third_checkpoint = create_local_timeline( - &repo_harness, - TIMELINE_ID, - &["d"], - third_upload_metadata.clone(), - )?; - assert_ne!( - third_upload_metadata.disk_consistent_lsn(), - first_upload_metadata.disk_consistent_lsn() - ); - assert!( - third_upload_metadata.disk_consistent_lsn() - < second_upload_metadata.disk_consistent_lsn() - ); - ensure_correct_timeline_upload( - &repo_harness, - Arc::clone(&remote_assets), - TIMELINE_ID, - third_checkpoint, - ) - .await; - - let updated_timeline = expect_timeline(index, sync_id).await; - let mut updated_archives = updated_timeline - .checkpoints() - .map(ArchiveId) - .collect::>(); + let storage_files = storage.list().await?; assert_eq!( - updated_archives.len(), - 3, - "Three archives are expected after two successful updates of the upload" - ); - updated_archives.retain(|archive_id| { - archive_id != &first_uploaded_archive && archive_id != &second_uploaded_archive - }); - assert_eq!( - updated_archives.len(), - 1, - "Only one new archive is expected among the uploaded" - ); - let third_uploaded_archive = updated_archives.last().copied().unwrap(); - assert!( - updated_timeline.checkpoints().max().unwrap() - > third_upload_metadata.disk_consistent_lsn(), - "Should not influence the last lsn by uploading an older checkpoint" + storage_files.len(), + layer_files.len(), + "All layers should be uploaded" ); assert_eq!( - updated_timeline - .archive_data(third_uploaded_archive) - .unwrap() - .disk_consistent_lsn(), - third_upload_metadata.disk_consistent_lsn(), - "Uploaded archive should have corresponding Lsn" - ); - assert_eq!( - updated_timeline.stored_files(&local_timeline_path), - vec![ - local_timeline_path.join("a"), - local_timeline_path.join("b"), - local_timeline_path.join("c"), - local_timeline_path.join("d"), - ] - .into_iter() - .collect(), - "Should have all files from three checkpoints without duplicates" + storage_files + .into_iter() + .map(|storage_path| storage.local_path(&storage_path)) + .collect::>>()?, + layer_files + .into_iter() + .map(|file| local_timeline_path.join(file)) + .collect(), + "Uploaded files should match with the local ones" ); Ok(()) } #[tokio::test] - async fn reupload_timeline_rejected() -> anyhow::Result<()> { - let repo_harness = RepoHarness::create("reupload_timeline_rejected")?; - let sync_id = ZTenantTimelineId::new(repo_harness.tenant_id, TIMELINE_ID); - let storage = LocalFs::new(tempdir()?.path().to_owned(), &repo_harness.conf.workdir)?; - let index = RemoteIndex::try_parse_descriptions_from_paths( - repo_harness.conf, - storage - .list() - .await? - .into_iter() - .map(|storage_path| storage.local_path(&storage_path).unwrap()), - ); - let remote_assets = Arc::new((storage, index)); - let storage = &remote_assets.0; - let index = &remote_assets.1; + async fn test_upload_index_part() -> anyhow::Result<()> { + let harness = RepoHarness::create("test_upload_index_part")?; + let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID); - let first_upload_metadata = dummy_metadata(Lsn(0x10)); - let first_checkpoint = create_local_timeline( - &repo_harness, - TIMELINE_ID, - &["a", "b"], - first_upload_metadata.clone(), - )?; - ensure_correct_timeline_upload( - &repo_harness, - Arc::clone(&remote_assets), - TIMELINE_ID, - first_checkpoint, - ) - .await; - let after_first_uploads = RemoteIndex::try_parse_descriptions_from_paths( - repo_harness.conf, - remote_assets - .0 - .list() - .await - .unwrap() - .into_iter() - .map(|storage_path| storage.local_path(&storage_path).unwrap()), + let storage = LocalFs::new(tempdir()?.path().to_owned(), &harness.conf.workdir)?; + let metadata = dummy_metadata(Lsn(0x40)); + let local_timeline_path = harness.timeline_path(&TIMELINE_ID); + + let index_part = IndexPart::new( + HashSet::from([ + RelativePath::new(&local_timeline_path, local_timeline_path.join("one"))?, + RelativePath::new(&local_timeline_path, local_timeline_path.join("two"))?, + ]), + HashSet::from([RelativePath::new( + &local_timeline_path, + local_timeline_path.join("three"), + )?]), + metadata.disk_consistent_lsn(), + metadata.to_bytes()?, ); - let normal_upload_metadata = dummy_metadata(Lsn(0x20)); - assert_ne!( - normal_upload_metadata.disk_consistent_lsn(), - first_upload_metadata.disk_consistent_lsn() + assert!( + storage.list().await?.is_empty(), + "Storage should be empty before any uploads are made" + ); + upload_index_part(harness.conf, &storage, sync_id, index_part.clone()).await?; + + let storage_files = storage.list().await?; + assert_eq!( + storage_files.len(), + 1, + "Should have only the index part file uploaded" ); - let checkpoint_with_no_files = create_local_timeline( - &repo_harness, - TIMELINE_ID, - &[], - normal_upload_metadata.clone(), - )?; - upload_timeline_checkpoint( - repo_harness.conf, - Arc::clone(&remote_assets), - sync_id, - checkpoint_with_no_files, - 0, - ) - .await; - assert_index_descriptions(index, &after_first_uploads).await; + let index_part_path = storage_files.first().unwrap(); + assert_eq!( + index_part_path.file_stem().and_then(|name| name.to_str()), + Some(IndexPart::FILE_NAME), + "Remote index part should have the correct name" + ); + assert_eq!( + index_part_path + .extension() + .and_then(|extension| extension.to_str()), + Some(IndexPart::FILE_EXTENSION), + "Remote index part should have the correct extension" + ); - let checkpoint_with_uploaded_lsn = create_local_timeline( - &repo_harness, - TIMELINE_ID, - &["something", "new"], - first_upload_metadata.clone(), - )?; - upload_timeline_checkpoint( - repo_harness.conf, - Arc::clone(&remote_assets), - sync_id, - checkpoint_with_uploaded_lsn, - 0, - ) - .await; - assert_index_descriptions(index, &after_first_uploads).await; + let remote_index_part: IndexPart = + serde_json::from_slice(&fs::read(&index_part_path).await?)?; + assert_eq!( + index_part, remote_index_part, + "Remote index part should match the local one" + ); Ok(()) } diff --git a/pageserver/src/repository.rs b/pageserver/src/repository.rs index eda9a3168d..d75b4efe71 100644 --- a/pageserver/src/repository.rs +++ b/pageserver/src/repository.rs @@ -182,14 +182,12 @@ impl Value { #[derive(Clone, Copy, Debug)] pub enum TimelineSyncStatusUpdate { - Uploaded, Downloaded, } impl Display for TimelineSyncStatusUpdate { fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { let s = match self { - TimelineSyncStatusUpdate::Uploaded => "Uploaded", TimelineSyncStatusUpdate::Downloaded => "Downloaded", }; f.write_str(s) diff --git a/pageserver/src/tenant_mgr.rs b/pageserver/src/tenant_mgr.rs index 2765554cf9..71e85c58e6 100644 --- a/pageserver/src/tenant_mgr.rs +++ b/pageserver/src/tenant_mgr.rs @@ -95,7 +95,7 @@ pub fn load_local_repo( /// Updates tenants' repositories, changing their timelines state in memory. pub fn apply_timeline_sync_status_updates( conf: &'static PageServerConf, - remote_index: RemoteIndex, + remote_index: &RemoteIndex, sync_status_updates: HashMap>, ) { if sync_status_updates.is_empty() { @@ -109,7 +109,7 @@ pub fn apply_timeline_sync_status_updates( trace!("Sync status updates: {:?}", sync_status_updates); for (tenant_id, tenant_timelines_sync_status_updates) in sync_status_updates { - let repo = load_local_repo(conf, tenant_id, &remote_index); + let repo = load_local_repo(conf, tenant_id, remote_index); for (timeline_id, timeline_sync_status_update) in tenant_timelines_sync_status_updates { match repo.apply_timeline_remote_sync_status_update(timeline_id, timeline_sync_status_update) diff --git a/pageserver/src/timelines.rs b/pageserver/src/timelines.rs index 105c3c869f..586d27d5b1 100644 --- a/pageserver/src/timelines.rs +++ b/pageserver/src/timelines.rs @@ -114,8 +114,8 @@ impl LocalTimelineInfo { #[serde_as] #[derive(Debug, Serialize, Deserialize, Clone)] pub struct RemoteTimelineInfo { - #[serde_as(as = "Option")] - pub remote_consistent_lsn: Option, + #[serde_as(as = "DisplayFromStr")] + pub remote_consistent_lsn: Lsn, pub awaits_download: bool, } diff --git a/pageserver/src/walreceiver.rs b/pageserver/src/walreceiver.rs index 6de0b87478..e09af09820 100644 --- a/pageserver/src/walreceiver.rs +++ b/pageserver/src/walreceiver.rs @@ -305,7 +305,7 @@ fn walreceiver_main( tenant_id, timeline_id, }) - .and_then(|e| e.disk_consistent_lsn()) + .map(|remote_timeline| remote_timeline.metadata.disk_consistent_lsn()) .unwrap_or(Lsn(0)) // no checkpoint was uploaded }); diff --git a/test_runner/batch_others/test_remote_storage.py b/test_runner/batch_others/test_remote_storage.py index e762f8589a..f2d654423a 100644 --- a/test_runner/batch_others/test_remote_storage.py +++ b/test_runner/batch_others/test_remote_storage.py @@ -18,6 +18,7 @@ import pytest # * starts a pageserver with remote storage, stores specific data in its tables # * triggers a checkpoint (which produces a local data scheduled for backup), gets the corresponding timeline id # * polls the timeline status to ensure it's copied remotely +# * inserts more data in the pageserver and repeats the process, to check multiple checkpoints case # * stops the pageserver, clears all local directories # # 2. Second pageserver @@ -50,27 +51,30 @@ def test_remote_storage_backup_and_restore(zenith_env_builder: ZenithEnvBuilder, tenant_id = pg.safe_psql("show zenith.zenith_tenant")[0][0] timeline_id = pg.safe_psql("show zenith.zenith_timeline")[0][0] - with closing(pg.connect()) as conn: - with conn.cursor() as cur: - cur.execute(f''' - CREATE TABLE t1(id int primary key, secret text); - INSERT INTO t1 VALUES ({data_id}, '{data_secret}'); - ''') - cur.execute("SELECT pg_current_wal_flush_lsn()") - current_lsn = lsn_from_hex(cur.fetchone()[0]) + checkpoint_numbers = range(1, 3) - # wait until pageserver receives that data - wait_for_last_record_lsn(client, UUID(tenant_id), UUID(timeline_id), current_lsn) + for checkpoint_number in checkpoint_numbers: + with closing(pg.connect()) as conn: + with conn.cursor() as cur: + cur.execute(f''' + CREATE TABLE t{checkpoint_number}(id int primary key, secret text); + INSERT INTO t{checkpoint_number} VALUES ({data_id}, '{data_secret}|{checkpoint_number}'); + ''') + cur.execute("SELECT pg_current_wal_flush_lsn()") + current_lsn = lsn_from_hex(cur.fetchone()[0]) - # run checkpoint manually to be sure that data landed in remote storage - with closing(env.pageserver.connect()) as psconn: - with psconn.cursor() as pscur: - pscur.execute(f"checkpoint {tenant_id} {timeline_id}") + # wait until pageserver receives that data + wait_for_last_record_lsn(client, UUID(tenant_id), UUID(timeline_id), current_lsn) - log.info("waiting for upload") - # wait until pageserver successfully uploaded a checkpoint to remote storage - wait_for_upload(client, UUID(tenant_id), UUID(timeline_id), current_lsn) - log.info("upload is done") + # run checkpoint manually to be sure that data landed in remote storage + with closing(env.pageserver.connect()) as psconn: + with psconn.cursor() as pscur: + pscur.execute(f"checkpoint {tenant_id} {timeline_id}") + + log.info(f'waiting for checkpoint {checkpoint_number} upload') + # wait until pageserver successfully uploaded a checkpoint to remote storage + wait_for_upload(client, UUID(tenant_id), UUID(timeline_id), current_lsn) + log.info(f'upload of checkpoint {checkpoint_number} is done') ##### Stop the first pageserver instance, erase all its data env.postgres.stop_all() @@ -93,5 +97,6 @@ def test_remote_storage_backup_and_restore(zenith_env_builder: ZenithEnvBuilder, pg = env.postgres.create_start('main') with closing(pg.connect()) as conn: with conn.cursor() as cur: - cur.execute(f'SELECT secret FROM t1 WHERE id = {data_id};') - assert cur.fetchone() == (data_secret, ) + for checkpoint_number in checkpoint_numbers: + cur.execute(f'SELECT secret FROM t{checkpoint_number} WHERE id = {data_id};') + assert cur.fetchone() == (f'{data_secret}|{checkpoint_number}', ) diff --git a/zenith/src/main.rs b/zenith/src/main.rs index 18368895a4..f248a5db5b 100644 --- a/zenith/src/main.rs +++ b/zenith/src/main.rs @@ -550,7 +550,7 @@ fn handle_timeline(timeline_match: &ArgMatches, env: &mut local_env::LocalEnv) - let tenant_id = get_tenant_id(create_match, env)?; let new_branch_name = create_match .value_of("branch-name") - .ok_or(anyhow!("No branch name provided"))?; + .ok_or_else(|| anyhow!("No branch name provided"))?; let timeline = pageserver .timeline_create(tenant_id, None, None, None)? .ok_or_else(|| anyhow!("Failed to create new timeline for tenant {}", tenant_id))?; @@ -571,7 +571,7 @@ fn handle_timeline(timeline_match: &ArgMatches, env: &mut local_env::LocalEnv) - let tenant_id = get_tenant_id(branch_match, env)?; let new_branch_name = branch_match .value_of("branch-name") - .ok_or(anyhow!("No branch name provided"))?; + .ok_or_else(|| anyhow!("No branch name provided"))?; let ancestor_branch_name = branch_match .value_of("ancestor-branch-name") .unwrap_or(DEFAULT_BRANCH_NAME); From 91fb21225a7a6fda0eed6d916cc6ebc8c0920aab Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Wed, 20 Apr 2022 00:46:29 +0300 Subject: [PATCH 118/296] Show more logs during S3 sync --- pageserver/src/remote_storage/storage_sync.rs | 111 +++++++----------- .../remote_storage/storage_sync/download.rs | 49 +++----- .../src/remote_storage/storage_sync/upload.rs | 51 ++++---- 3 files changed, 83 insertions(+), 128 deletions(-) diff --git a/pageserver/src/remote_storage/storage_sync.rs b/pageserver/src/remote_storage/storage_sync.rs index 6ba55372c2..649e563dbc 100644 --- a/pageserver/src/remote_storage/storage_sync.rs +++ b/pageserver/src/remote_storage/storage_sync.rs @@ -165,10 +165,7 @@ mod sync_queue { if let Some(sender) = SENDER.get() { match sender.send((sync_id, new_task)) { Err(e) => { - warn!( - "Failed to enqueue a sync task: the receiver is dropped: {}", - e - ); + warn!("Failed to enqueue a sync task: the receiver is dropped: {e}"); false } Ok(()) => { @@ -429,15 +426,9 @@ pub fn schedule_timeline_checkpoint_upload( metadata, }), ) { - warn!( - "Could not send an upload task for tenant {}, timeline {}", - tenant_id, timeline_id - ) + warn!("Could not send an upload task for tenant {tenant_id}, timeline {timeline_id}",) } else { - debug!( - "Upload task for tenant {}, timeline {} sent", - tenant_id, timeline_id - ) + debug!("Upload task for tenant {tenant_id}, timeline {timeline_id} sent") } } @@ -449,10 +440,7 @@ pub fn schedule_timeline_checkpoint_upload( /// /// Ensure that the loop is started otherwise the task is never processed. pub fn schedule_timeline_download(tenant_id: ZTenantId, timeline_id: ZTimelineId) { - debug!( - "Scheduling timeline download for tenant {}, timeline {}", - tenant_id, timeline_id - ); + debug!("Scheduling timeline download for tenant {tenant_id}, timeline {timeline_id}"); sync_queue::push( ZTenantTimelineId { tenant_id, @@ -614,11 +602,7 @@ where let remaining_queue_length = sync_queue::len(); REMAINING_SYNC_ITEMS.set(remaining_queue_length as i64); if remaining_queue_length > 0 || !batched_tasks.is_empty() { - info!( - "Processing tasks for {} timelines in batch, more tasks left to process: {}", - batched_tasks.len(), - remaining_queue_length - ); + info!("Processing tasks for {} timelines in batch, more tasks left to process: {remaining_queue_length}", batched_tasks.len()); } else { debug!("No tasks to process"); return ControlFlow::Continue(HashMap::new()); @@ -644,7 +628,7 @@ where HashMap, > = HashMap::with_capacity(max_concurrent_sync); while let Some((sync_id, state_update)) = sync_results.next().await { - debug!("Finished storage sync task for sync id {}", sync_id); + debug!("Finished storage sync task for sync id {sync_id}"); if let Some(state_update) = state_update { new_timeline_states .entry(sync_id.tenant_id) @@ -693,7 +677,7 @@ where ) .await { - error!("Failed to update remote timeline {}: {:?}", sync_id, e); + error!("Failed to update remote timeline {sync_id}: {e:?}"); } } SyncTask::DownloadAndUpload(_, failed_upload_data) => { @@ -712,7 +696,7 @@ where ) .await { - error!("Failed to update remote timeline {}: {:?}", sync_id, e); + error!("Failed to update remote timeline {sync_id}: {e:?}"); } } } @@ -720,18 +704,17 @@ where } }; + let task_name = task.name(); let current_task_attempt = task.retries(); + info!("Sync task '{task_name}' processing started, attempt #{current_task_attempt}"); + if current_task_attempt > 0 { let seconds_to_wait = 2.0_f64.powf(current_task_attempt as f64 - 1.0).min(30.0); - debug!( - "Waiting {} seconds before starting the task", - seconds_to_wait - ); + info!("Waiting {seconds_to_wait} seconds before starting the '{task_name}' task"); tokio::time::sleep(Duration::from_secs_f64(seconds_to_wait)).await; } - let task_name = task.name(); - match task { + let status_update = match task { SyncTask::Download(new_download_data) => { download_timeline( conf, @@ -782,7 +765,11 @@ where status_update } - } + }; + + info!("Finished processing the task"); + + status_update } async fn download_timeline( @@ -804,10 +791,7 @@ where DownloadedTimeline::Abort => { register_sync_status(sync_start, task_name, None); if let Err(e) = index.write().await.set_awaits_download(&sync_id, false) { - error!( - "Timeline {} was expected to be in the remote index after a download attempt, but it's absent: {:?}", - sync_id, e - ); + error!("Timeline {sync_id} was expected to be in the remote index after a download attempt, but it's absent: {e:?}"); } None } @@ -823,15 +807,12 @@ where Some(TimelineSyncStatusUpdate::Downloaded) } Err(e) => { - error!( - "Timeline {} was expected to be in the remote index after a sucessful download, but it's absent: {:?}", - sync_id, e - ); + error!("Timeline {sync_id} was expected to be in the remote index after a sucessful download, but it's absent: {e:?}"); None } }, Err(e) => { - error!("Failed to update local timeline metadata: {:?}", e); + error!("Failed to update local timeline metadata: {e:?}"); download_data.retries += 1; sync_queue::push(sync_id, SyncTask::Download(download_data)); register_sync_status(sync_start, task_name, Some(false)); @@ -873,10 +854,7 @@ async fn update_local_metadata( }; if local_lsn < Some(remote_lsn) { - info!( - "Updating local timeline metadata from remote timeline: local disk_consistent_lsn={:?}, remote disk_consistent_lsn={}", - local_lsn, remote_lsn - ); + info!("Updating local timeline metadata from remote timeline: local disk_consistent_lsn={local_lsn:?}, remote disk_consistent_lsn={remote_lsn}"); let remote_metadata_bytes = remote_metadata .to_bytes() @@ -890,7 +868,7 @@ async fn update_local_metadata( ) })?; } else { - info!("Local metadata at path '{}' has later disk consistent Lsn ({:?}) than the remote one ({}), skipping the update", local_metadata_path.display(), local_lsn, remote_lsn); + info!("Local metadata at path '{}' has later disk consistent Lsn ({local_lsn:?}) than the remote one ({remote_lsn}), skipping the update", local_metadata_path.display()); } Ok(()) @@ -933,9 +911,8 @@ async fn upload_timeline( Ok(metadata) => metadata, Err(e) => { error!( - "Failed to load local metadata from path '{}': {:?}", - local_metadata_path.display(), - e + "Failed to load local metadata from path '{}': {e:?}", + local_metadata_path.display() ); outdated_upload_data.retries += 1; sync_queue::push(sync_id, SyncTask::Upload(outdated_upload_data)); @@ -952,7 +929,7 @@ async fn upload_timeline( match update_remote_data(conf, storage, index, sync_id, &uploaded_data.data, false).await { Ok(()) => register_sync_status(sync_start, task_name, Some(true)), Err(e) => { - error!("Failed to update remote timeline {}: {:?}", sync_id, e); + error!("Failed to update remote timeline {sync_id}: {e:?}"); uploaded_data.retries += 1; sync_queue::push(sync_id, SyncTask::Upload(uploaded_data)); register_sync_status(sync_start, task_name, Some(false)); @@ -972,6 +949,7 @@ where P: Debug + Send + Sync + 'static, S: RemoteStorage + Send + Sync + 'static, { + info!("Updating remote index for the timeline"); let updated_remote_timeline = { let mut index_accessor = index.write().await; @@ -1012,6 +990,7 @@ where IndexPart::from_remote_timeline(&timeline_path, updated_remote_timeline) .context("Failed to create an index part from the updated remote timeline")?; + info!("Uploading remote data for the timeline"); upload_index_part(conf, storage, sync_id, new_index_part) .await .context("Failed to upload new index part") @@ -1031,8 +1010,8 @@ fn validate_task_retries( if download_data.retries > max_sync_errors => { error!( - "Evicting download task for timeline {} that failed {} times, exceeding the error threshold {}", - sync_id, download_data.retries, max_sync_errors + "Evicting download task for timeline {sync_id} that failed {} times, exceeding the error threshold {max_sync_errors}", + download_data.retries ); skip_download = true; } @@ -1040,9 +1019,9 @@ fn validate_task_retries( if upload_data.retries > max_sync_errors => { error!( - "Evicting upload task for timeline {} that failed {} times, exceeding the error threshold {}", - sync_id, upload_data.retries, max_sync_errors - ); + "Evicting upload task for timeline {sync_id} that failed {} times, exceeding the error threshold {max_sync_errors}", + upload_data.retries, + ); skip_upload = true; } _ => {} @@ -1083,10 +1062,10 @@ where while let Some((id, part_upload_result)) = part_downloads.next().await { match part_upload_result { Ok(index_part) => { - debug!("Successfully fetched index part for {}", id); + debug!("Successfully fetched index part for {id}"); index_parts.insert(id, index_part); } - Err(e) => warn!("Failed to fetch index part for {}: {:?}", id, e), + Err(e) => warn!("Failed to fetch index part for {id}: {e:?}"), } } @@ -1120,8 +1099,8 @@ fn schedule_first_sync_tasks( if was_there.is_some() { // defensive check warn!( - "Overwriting timeline init sync status. Status {:?} Timeline {}", - timeline_status, sync_id.timeline_id + "Overwriting timeline init sync status. Status {timeline_status:?}, timeline {}", + sync_id.timeline_id ); } remote_timeline.awaits_download = awaits_download; @@ -1207,7 +1186,7 @@ fn compare_local_and_remote_timeline( fn register_sync_status(sync_start: Instant, sync_name: &str, sync_status: Option) { let secs_elapsed = sync_start.elapsed().as_secs_f64(); - debug!("Processed a sync task in {} seconds", secs_elapsed); + info!("Processed a sync task in {secs_elapsed:.2} seconds"); match sync_status { Some(true) => IMAGE_SYNC_TIME.with_label_values(&[sync_name, "success"]), Some(false) => IMAGE_SYNC_TIME.with_label_values(&[sync_name, "failure"]), @@ -1254,7 +1233,7 @@ mod test_utils { } pub fn dummy_contents(name: &str) -> String { - format!("contents for {}", name) + format!("contents for {name}") } pub fn dummy_metadata(disk_consistent_lsn: Lsn) -> TimelineMetadata { @@ -1286,7 +1265,7 @@ mod tests { let merged_download = match download_1.merge(download_2) { SyncTask::Download(merged_download) => merged_download, - wrong_merge_result => panic!("Unexpected merge result: {:?}", wrong_merge_result), + wrong_merge_result => panic!("Unexpected merge result: {wrong_merge_result:?}"), }; assert_eq!( @@ -1334,7 +1313,7 @@ mod tests { let merged_upload = match upload_1.merge(upload_2) { SyncTask::Upload(merged_upload) => merged_upload, - wrong_merge_result => panic!("Unexpected merge result: {:?}", wrong_merge_result), + wrong_merge_result => panic!("Unexpected merge result: {wrong_merge_result:?}"), }; assert_eq!( @@ -1389,7 +1368,7 @@ mod tests { SyncTask::DownloadAndUpload(merged_download, merged_upload) => { (merged_download, merged_upload) } - wrong_merge_result => panic!("Unexpected merge result: {:?}", wrong_merge_result), + wrong_merge_result => panic!("Unexpected merge result: {wrong_merge_result:?}"), }; assert_eq!( @@ -1440,7 +1419,7 @@ mod tests { SyncTask::DownloadAndUpload(merged_download, merged_upload) => { (merged_download, merged_upload) } - wrong_merge_result => panic!("Unexpected merge result: {:?}", wrong_merge_result), + wrong_merge_result => panic!("Unexpected merge result: {wrong_merge_result:?}"), }; assert_eq!( @@ -1507,7 +1486,7 @@ mod tests { SyncTask::DownloadAndUpload(merged_download, merged_upload) => { (merged_download, merged_upload) } - wrong_merge_result => panic!("Unexpected merge result: {:?}", wrong_merge_result), + wrong_merge_result => panic!("Unexpected merge result: {wrong_merge_result:?}"), }; assert_eq!( @@ -1577,7 +1556,7 @@ mod tests { SyncTask::DownloadAndUpload(merged_download, merged_upload) => { (merged_download, merged_upload) } - wrong_merge_result => panic!("Unexpected merge result: {:?}", wrong_merge_result), + wrong_merge_result => panic!("Unexpected merge result: {wrong_merge_result:?}"), }; assert_eq!( diff --git a/pageserver/src/remote_storage/storage_sync/download.rs b/pageserver/src/remote_storage/storage_sync/download.rs index 81ed649c8a..eb805cd0cc 100644 --- a/pageserver/src/remote_storage/storage_sync/download.rs +++ b/pageserver/src/remote_storage/storage_sync/download.rs @@ -5,7 +5,7 @@ use std::fmt::Debug; use anyhow::Context; use futures::stream::{FuturesUnordered, StreamExt}; use tokio::fs; -use tracing::{debug, error, trace, warn}; +use tracing::{debug, error, info, warn}; use crate::{ config::PageServerConf, @@ -45,25 +45,16 @@ where .download(&part_storage_path, &mut index_part_bytes) .await .with_context(|| { - format!( - "Failed to download an index part from storage path '{:?}'", - part_storage_path - ) + format!("Failed to download an index part from storage path '{part_storage_path:?}'") })?; let index_part: IndexPart = serde_json::from_slice(&index_part_bytes).with_context(|| { - format!( - "Failed to deserialize index part file from storage path '{:?}'", - part_storage_path - ) + format!("Failed to deserialize index part file from storage path '{part_storage_path:?}'") })?; let missing_files = index_part.missing_files(); if !missing_files.is_empty() { - warn!( - "Found missing layers in index part for timeline {}: {:?}", - sync_id, missing_files - ); + warn!("Found missing layers in index part for timeline {sync_id}: {missing_files:?}"); } Ok(index_part) @@ -100,21 +91,17 @@ where let remote_timeline = match remote_timeline { Some(remote_timeline) => { if !remote_timeline.awaits_download { - error!("Timeline with sync id {} is not awaiting download", sync_id); + error!("Timeline with sync id {sync_id} is not awaiting download"); return DownloadedTimeline::Abort; } remote_timeline } None => { - error!( - "Timeline with sync id {} is not present in the remote index", - sync_id - ); + error!("Timeline with sync id {sync_id} is not present in the remote index"); return DownloadedTimeline::Abort; } }; - debug!("Downloading timeline layers for sync id {}", sync_id); let download = &mut download_data.data; let layers_to_download = remote_timeline @@ -123,7 +110,8 @@ where .cloned() .collect::>(); - trace!("Layers to download: {:?}", layers_to_download); + debug!("Layers to download: {layers_to_download:?}"); + info!("Downloading {} timeline layers", layers_to_download.len()); let mut download_tasks = layers_to_download .into_iter() @@ -157,8 +145,7 @@ where .await .with_context(|| { format!( - "Failed to download a layer from storage path '{:?}'", - layer_storage_path + "Failed to download a layer from storage path '{layer_storage_path:?}'" ) })?; } @@ -166,8 +153,6 @@ where }) .collect::>(); - debug!("Downloading {} layers of a timeline", download_tasks.len()); - let mut errors_happened = false; while let Some(download_result) = download_tasks.next().await { match download_result { @@ -176,21 +161,18 @@ where } Err(e) => { errors_happened = true; - error!( - "Failed to download a layer for timeline {}: {:?}", - sync_id, e - ); + error!("Failed to download a layer for timeline {sync_id}: {e:?}"); } } } if errors_happened { - debug!("Reenqueuing failed download task for timeline {}", sync_id); + debug!("Reenqueuing failed download task for timeline {sync_id}"); download_data.retries += 1; sync_queue::push(sync_id, SyncTask::Download(download_data)); DownloadedTimeline::FailedAndRescheduled } else { - debug!("Finished downloading all timeline's layers"); + info!("Successfully downloaded all layers"); DownloadedTimeline::Successful(download_data) } } @@ -266,10 +248,9 @@ mod tests { .await { DownloadedTimeline::Successful(data) => data, - wrong_result => panic!( - "Expected a successful download for timeline, but got: {:?}", - wrong_result - ), + wrong_result => { + panic!("Expected a successful download for timeline, but got: {wrong_result:?}") + } }; assert_eq!( diff --git a/pageserver/src/remote_storage/storage_sync/upload.rs b/pageserver/src/remote_storage/storage_sync/upload.rs index 81758ce3ef..b4a2f6f989 100644 --- a/pageserver/src/remote_storage/storage_sync/upload.rs +++ b/pageserver/src/remote_storage/storage_sync/upload.rs @@ -5,7 +5,7 @@ use std::{fmt::Debug, path::PathBuf}; use anyhow::Context; use futures::stream::{FuturesUnordered, StreamExt}; use tokio::fs; -use tracing::{debug, error, trace, warn}; +use tracing::{debug, error, info, warn}; use crate::{ config::PageServerConf, @@ -53,10 +53,7 @@ where ) .await .with_context(|| { - format!( - "Failed to upload index part to the storage path '{:?}'", - index_part_storage_path - ) + format!("Failed to upload index part to the storage path '{index_part_storage_path:?}'") }) } @@ -89,10 +86,6 @@ where { let upload = &mut upload_data.data; let new_upload_lsn = upload.metadata.disk_consistent_lsn(); - debug!( - "Uploading timeline layers for sync id {}, new lsn: {}", - sync_id, new_upload_lsn - ); let already_uploaded_layers = remote_timeline .map(|timeline| timeline.stored_files()) @@ -105,7 +98,11 @@ where .cloned() .collect::>(); - trace!("Layers to upload: {:?}", layers_to_upload); + debug!("Layers to upload: {layers_to_upload:?}"); + info!( + "Uploading {} timeline layers, new lsn: {new_upload_lsn}", + layers_to_upload.len(), + ); let mut upload_tasks = layers_to_upload .into_iter() @@ -157,8 +154,6 @@ where }) .collect::>(); - debug!("uploading {} layers of a timeline", upload_tasks.len()); - let mut errors_happened = false; let mut local_fs_updated = false; while let Some(upload_result) = upload_tasks.next().await { @@ -170,16 +165,19 @@ where Err(e) => match e { UploadError::Other(e) => { errors_happened = true; - error!("Failed to upload a layer for timeline {}: {:?}", sync_id, e); + error!("Failed to upload a layer for timeline {sync_id}: {e:?}"); } UploadError::MissingLocalFile(source_path, e) => { if source_path.exists() { errors_happened = true; - error!("Failed to upload a layer for timeline {}: {:?}", sync_id, e); + error!("Failed to upload a layer for timeline {sync_id}: {e:?}"); } else { local_fs_updated = true; upload.layers_to_upload.remove(&source_path); - warn!("Missing locally a layer file scheduled for upload, skipping"); + warn!( + "Missing locally a layer file {} scheduled for upload, skipping", + source_path.display() + ); } } }, @@ -187,17 +185,16 @@ where } if errors_happened { - debug!("Reenqueuing failed upload task for timeline {}", sync_id); + debug!("Reenqueuing failed upload task for timeline {sync_id}"); upload_data.retries += 1; sync_queue::push(sync_id, SyncTask::Upload(upload_data)); UploadedTimeline::FailedAndRescheduled + } else if local_fs_updated { + info!("Successfully uploaded all layers, some local layers were removed during the upload"); + UploadedTimeline::SuccessfulAfterLocalFsUpdate(upload_data) } else { - debug!("Finished uploading all timeline's layers"); - if local_fs_updated { - UploadedTimeline::SuccessfulAfterLocalFsUpdate(upload_data) - } else { - UploadedTimeline::Successful(upload_data) - } + info!("Successfully uploaded all layers"); + UploadedTimeline::Successful(upload_data) } } @@ -253,10 +250,9 @@ mod tests { let upload_data = match upload_result { UploadedTimeline::Successful(upload_data) => upload_data, - wrong_result => panic!( - "Expected a successful upload for timeline, but got: {:?}", - wrong_result - ), + wrong_result => { + panic!("Expected a successful upload for timeline, but got: {wrong_result:?}") + } }; assert_eq!( @@ -344,8 +340,7 @@ mod tests { let upload_data = match upload_result { UploadedTimeline::SuccessfulAfterLocalFsUpdate(upload_data) => upload_data, wrong_result => panic!( - "Expected a successful after local fs upload for timeline, but got: {:?}", - wrong_result + "Expected a successful after local fs upload for timeline, but got: {wrong_result:?}" ), }; From 170badd62604c050f671b1cd65a572f630f17e09 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Wed, 20 Apr 2022 11:11:07 +0300 Subject: [PATCH 119/296] Capture the postgres log in all tests that start a vanilla Postgres. --- test_runner/fixtures/zenith_fixtures.py | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/test_runner/fixtures/zenith_fixtures.py b/test_runner/fixtures/zenith_fixtures.py index 8dfe219966..a9c4c0f395 100644 --- a/test_runner/fixtures/zenith_fixtures.py +++ b/test_runner/fixtures/zenith_fixtures.py @@ -1273,10 +1273,14 @@ class VanillaPostgres(PgProtocol): with open(os.path.join(self.pgdatadir, 'postgresql.conf'), 'a') as conf_file: conf_file.writelines(options) - def start(self): + def start(self, log_path: Optional[str] = None): assert not self.running self.running = True - self.pg_bin.run_capture(['pg_ctl', '-D', self.pgdatadir, 'start']) + + if log_path is None: + log_path = os.path.join(self.pgdatadir, "pg.log") + + self.pg_bin.run_capture(['pg_ctl', '-D', self.pgdatadir, '-l', log_path, 'start']) def stop(self): assert self.running From 5e95338ee9a898ab42e96050ee348720fbe50861 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Wed, 20 Apr 2022 11:16:13 +0300 Subject: [PATCH 120/296] Improve logging in test_wal_restore.py - Capture the output of the restore_from_wal.sh in a log file - Kill "restored" Postgres server on test failure --- test_runner/batch_others/test_wal_restore.py | 24 +++++++++----------- zenith_utils/scripts/restore_from_wal.sh | 1 + 2 files changed, 12 insertions(+), 13 deletions(-) diff --git a/test_runner/batch_others/test_wal_restore.py b/test_runner/batch_others/test_wal_restore.py index a5855f2258..8cc27a455c 100644 --- a/test_runner/batch_others/test_wal_restore.py +++ b/test_runner/batch_others/test_wal_restore.py @@ -1,7 +1,6 @@ import os import subprocess -from fixtures.utils import mkdir_if_needed from fixtures.zenith_fixtures import (ZenithEnvBuilder, VanillaPostgres, PortDistributor, @@ -13,6 +12,7 @@ from fixtures.log_helper import log def test_wal_restore(zenith_env_builder: ZenithEnvBuilder, + pg_bin: PgBin, test_output_dir, port_distributor: PortDistributor): zenith_env_builder.num_safekeepers = 1 @@ -24,15 +24,13 @@ def test_wal_restore(zenith_env_builder: ZenithEnvBuilder, env.zenith_cli.pageserver_stop() port = port_distributor.get_port() data_dir = os.path.join(test_output_dir, 'pgsql.restored') - restored = VanillaPostgres(data_dir, PgBin(test_output_dir), port) - subprocess.call([ - 'bash', - os.path.join(base_dir, 'zenith_utils/scripts/restore_from_wal.sh'), - os.path.join(pg_distrib_dir, 'bin'), - os.path.join(test_output_dir, 'repo/safekeepers/sk1/{}/*'.format(tenant_id)), - data_dir, - str(port) - ]) - restored.start() - assert restored.safe_psql('select count(*) from t') == [(1000000, )] - restored.stop() + with VanillaPostgres(data_dir, PgBin(test_output_dir), port) as restored: + pg_bin.run_capture([ + os.path.join(base_dir, 'zenith_utils/scripts/restore_from_wal.sh'), + os.path.join(pg_distrib_dir, 'bin'), + os.path.join(test_output_dir, 'repo/safekeepers/sk1/{}/*'.format(tenant_id)), + data_dir, + str(port) + ]) + restored.start() + assert restored.safe_psql('select count(*) from t') == [(1000000, )] diff --git a/zenith_utils/scripts/restore_from_wal.sh b/zenith_utils/scripts/restore_from_wal.sh index ef2171312b..f05fbc609a 100755 --- a/zenith_utils/scripts/restore_from_wal.sh +++ b/zenith_utils/scripts/restore_from_wal.sh @@ -1,3 +1,4 @@ +#!/bin/bash PG_BIN=$1 WAL_PATH=$2 DATA_DIR=$3 From ac52f4f2d66885c25b99befd942825c16fd2759e Mon Sep 17 00:00:00 2001 From: Konstantin Knizhnik Date: Wed, 20 Apr 2022 13:24:38 +0300 Subject: [PATCH 121/296] Set superuser when initializing database for wal recovery (#1544) --- test_runner/batch_others/test_wal_restore.py | 2 +- zenith_utils/scripts/restore_from_wal.sh | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/test_runner/batch_others/test_wal_restore.py b/test_runner/batch_others/test_wal_restore.py index 8cc27a455c..2dbde954fc 100644 --- a/test_runner/batch_others/test_wal_restore.py +++ b/test_runner/batch_others/test_wal_restore.py @@ -33,4 +33,4 @@ def test_wal_restore(zenith_env_builder: ZenithEnvBuilder, str(port) ]) restored.start() - assert restored.safe_psql('select count(*) from t') == [(1000000, )] + assert restored.safe_psql('select count(*) from t', user='zenith_admin') == [(1000000, )] diff --git a/zenith_utils/scripts/restore_from_wal.sh b/zenith_utils/scripts/restore_from_wal.sh index f05fbc609a..4983449f24 100755 --- a/zenith_utils/scripts/restore_from_wal.sh +++ b/zenith_utils/scripts/restore_from_wal.sh @@ -5,7 +5,7 @@ DATA_DIR=$3 PORT=$4 SYSID=`od -A n -j 24 -N 8 -t d8 $WAL_PATH/000000010000000000000002* | cut -c 3-` rm -fr $DATA_DIR -env -i LD_LIBRARY_PATH=$PG_BIN/../lib $PG_BIN/initdb -E utf8 -D $DATA_DIR --sysid=$SYSID +env -i LD_LIBRARY_PATH=$PG_BIN/../lib $PG_BIN/initdb -E utf8 -U zenith_admin -D $DATA_DIR --sysid=$SYSID echo port=$PORT >> $DATA_DIR/postgresql.conf REDO_POS=0x`$PG_BIN/pg_controldata -D $DATA_DIR | fgrep "REDO location"| cut -c 42-` declare -i WAL_SIZE=$REDO_POS+114 From e660e12f797beebc62f17ba230c42ba0afc44315 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Wed, 20 Apr 2022 12:18:24 +0300 Subject: [PATCH 122/296] Update rustls-split and rustls versions. All dependencies now use rustls 0.20.2, so we no longer need to build two versions of it. --- Cargo.lock | 48 ++++++++-------------------- zenith_utils/Cargo.toml | 5 +-- zenith_utils/src/postgres_backend.rs | 4 +-- zenith_utils/src/sock_split.rs | 28 +++++++++------- zenith_utils/tests/ssl_test.rs | 37 +++++++++++++-------- 5 files changed, 57 insertions(+), 65 deletions(-) diff --git a/Cargo.lock b/Cargo.lock index 3480f120e0..ef289776e1 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -1050,7 +1050,7 @@ checksum = "d87c48c02e0dc5e3b849a2041db3029fd066650f8f717c07bf8ed78ccb895cac" dependencies = [ "http", "hyper", - "rustls 0.20.2", + "rustls", "tokio", "tokio-rustls", ] @@ -1868,7 +1868,7 @@ dependencies = [ "reqwest", "routerify 2.2.0", "rstest", - "rustls 0.20.2", + "rustls", "rustls-pemfile", "scopeguard", "serde", @@ -2048,7 +2048,7 @@ dependencies = [ "mime", "percent-encoding", "pin-project-lite", - "rustls 0.20.2", + "rustls", "rustls-pemfile", "serde", "serde_json", @@ -2222,26 +2222,13 @@ dependencies = [ [[package]] name = "rustls" -version = "0.19.1" +version = "0.20.4" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "35edb675feee39aec9c99fa5ff985081995a06d594114ae14cbe797ad7b7a6d7" -dependencies = [ - "base64 0.13.0", - "log", - "ring", - "sct 0.6.1", - "webpki 0.21.4", -] - -[[package]] -name = "rustls" -version = "0.20.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d37e5e2290f3e040b594b1a9e04377c2c671f1a1cfd9bfdef82106ac1c113f84" +checksum = "4fbfeb8d0ddb84706bc597a5574ab8912817c52a397f819e5b614e2265206921" dependencies = [ "log", "ring", - "sct 0.7.0", + "sct", "webpki 0.22.0", ] @@ -2256,11 +2243,11 @@ dependencies = [ [[package]] name = "rustls-split" -version = "0.2.2" +version = "0.3.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7fb079b52cfdb005752b7c3c646048e702003576a8321058e4c8b38227c11aa6" +checksum = "78802c9612b4689d207acff746f38132ca1b12dadb55d471aa5f10fd580f47d3" dependencies = [ - "rustls 0.19.1", + "rustls", ] [[package]] @@ -2339,16 +2326,6 @@ version = "1.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "d29ab0c6d3fc0ee92fe66e2d99f700eab17a8d57d1c1d3b748380fb20baa78cd" -[[package]] -name = "sct" -version = "0.6.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b362b83898e0e69f38515b82ee15aa80636befe47c3b6d3d89a911e78fc228ce" -dependencies = [ - "ring", - "untrusted", -] - [[package]] name = "sct" version = "0.7.0" @@ -2789,7 +2766,7 @@ checksum = "606f2b73660439474394432239c82249c0d45eb5f23d91f401be1e33590444a7" dependencies = [ "futures", "ring", - "rustls 0.20.2", + "rustls", "tokio", "tokio-postgres", "tokio-rustls", @@ -2801,7 +2778,7 @@ version = "0.23.3" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "4151fda0cf2798550ad0b34bcfc9b9dcc2a9d2471c895c68f3a8818e54f2389e" dependencies = [ - "rustls 0.20.2", + "rustls", "tokio", "webpki 0.22.0", ] @@ -3392,7 +3369,8 @@ dependencies = [ "postgres-protocol", "rand", "routerify 3.0.0", - "rustls 0.19.1", + "rustls", + "rustls-pemfile", "rustls-split", "serde", "serde_json", diff --git a/zenith_utils/Cargo.toml b/zenith_utils/Cargo.toml index cf864b3a54..2b1caa9be2 100644 --- a/zenith_utils/Cargo.toml +++ b/zenith_utils/Cargo.toml @@ -24,8 +24,8 @@ signal-hook = "0.3.10" rand = "0.8.3" jsonwebtoken = "7" hex = { version = "0.4.3", features = ["serde"] } -rustls = "0.19.1" -rustls-split = "0.2.1" +rustls = "0.20.2" +rustls-split = "0.3.0" git-version = "0.3.5" serde_with = "1.12.0" @@ -39,6 +39,7 @@ hex-literal = "0.3" tempfile = "3.2" webpki = "0.21" criterion = "0.3" +rustls-pemfile = "0.2.1" [[bench]] name = "benchmarks" diff --git a/zenith_utils/src/postgres_backend.rs b/zenith_utils/src/postgres_backend.rs index f984fb4417..fab3c388b1 100644 --- a/zenith_utils/src/postgres_backend.rs +++ b/zenith_utils/src/postgres_backend.rs @@ -304,8 +304,8 @@ impl PostgresBackend { pub fn start_tls(&mut self) -> anyhow::Result<()> { match self.stream.take() { Some(Stream::Bidirectional(bidi_stream)) => { - let session = rustls::ServerSession::new(&self.tls_config.clone().unwrap()); - self.stream = Some(Stream::Bidirectional(bidi_stream.start_tls(session)?)); + let conn = rustls::ServerConnection::new(self.tls_config.clone().unwrap())?; + self.stream = Some(Stream::Bidirectional(bidi_stream.start_tls(conn)?)); Ok(()) } stream => { diff --git a/zenith_utils/src/sock_split.rs b/zenith_utils/src/sock_split.rs index c62963e113..5e4598daf1 100644 --- a/zenith_utils/src/sock_split.rs +++ b/zenith_utils/src/sock_split.rs @@ -4,7 +4,7 @@ use std::{ sync::Arc, }; -use rustls::Session; +use rustls::Connection; /// Wrapper supporting reads of a shared TcpStream. pub struct ArcTcpRead(Arc); @@ -56,7 +56,7 @@ impl BufStream { pub enum ReadStream { Tcp(BufReader), - Tls(rustls_split::ReadHalf), + Tls(rustls_split::ReadHalf), } impl io::Read for ReadStream { @@ -79,7 +79,7 @@ impl ReadStream { pub enum WriteStream { Tcp(Arc), - Tls(rustls_split::WriteHalf), + Tls(rustls_split::WriteHalf), } impl WriteStream { @@ -107,11 +107,11 @@ impl io::Write for WriteStream { } } -type TlsStream = rustls::StreamOwned; +type TlsStream = rustls::StreamOwned; pub enum BidiStream { Tcp(BufStream), - /// This variant is boxed, because [`rustls::ServerSession`] is quite larger than [`BufStream`]. + /// This variant is boxed, because [`rustls::ServerConnection`] is quite larger than [`BufStream`]. Tls(Box>), } @@ -127,7 +127,7 @@ impl BidiStream { if how == Shutdown::Read { tls_boxed.sock.get_ref().shutdown(how) } else { - tls_boxed.sess.send_close_notify(); + tls_boxed.conn.send_close_notify(); let res = tls_boxed.flush(); tls_boxed.sock.get_ref().shutdown(how)?; res @@ -154,19 +154,23 @@ impl BidiStream { // TODO would be nice to avoid the Arc here let socket = Arc::try_unwrap(reader.into_inner().0).unwrap(); - let (read_half, write_half) = - rustls_split::split(socket, tls_boxed.sess, read_buf_cfg, write_buf_cfg); + let (read_half, write_half) = rustls_split::split( + socket, + Connection::Server(tls_boxed.conn), + read_buf_cfg, + write_buf_cfg, + ); (ReadStream::Tls(read_half), WriteStream::Tls(write_half)) } } } - pub fn start_tls(self, mut session: rustls::ServerSession) -> io::Result { + pub fn start_tls(self, mut conn: rustls::ServerConnection) -> io::Result { match self { Self::Tcp(mut stream) => { - session.complete_io(&mut stream)?; - assert!(!session.is_handshaking()); - Ok(Self::Tls(Box::new(TlsStream::new(session, stream)))) + conn.complete_io(&mut stream)?; + assert!(!conn.is_handshaking()); + Ok(Self::Tls(Box::new(TlsStream::new(conn, stream)))) } Self::Tls { .. } => Err(io::Error::new( io::ErrorKind::InvalidInput, diff --git a/zenith_utils/tests/ssl_test.rs b/zenith_utils/tests/ssl_test.rs index ef2bf1ed4a..0e330c44f8 100644 --- a/zenith_utils/tests/ssl_test.rs +++ b/zenith_utils/tests/ssl_test.rs @@ -8,7 +8,6 @@ use std::{ use byteorder::{BigEndian, ReadBytesExt, WriteBytesExt}; use bytes::{Buf, BufMut, Bytes, BytesMut}; use lazy_static::lazy_static; -use rustls::Session; use zenith_utils::postgres_backend::{AuthType, Handler, PostgresBackend}; @@ -23,11 +22,11 @@ fn make_tcp_pair() -> (TcpStream, TcpStream) { lazy_static! { static ref KEY: rustls::PrivateKey = { let mut cursor = Cursor::new(include_bytes!("key.pem")); - rustls::internal::pemfile::rsa_private_keys(&mut cursor).unwrap()[0].clone() + rustls::PrivateKey(rustls_pemfile::rsa_private_keys(&mut cursor).unwrap()[0].clone()) }; static ref CERT: rustls::Certificate = { let mut cursor = Cursor::new(include_bytes!("cert.pem")); - rustls::internal::pemfile::certs(&mut cursor).unwrap()[0].clone() + rustls::Certificate(rustls_pemfile::certs(&mut cursor).unwrap()[0].clone()) }; } @@ -45,17 +44,23 @@ fn ssl() { let ssl_response = client_sock.read_u8().unwrap(); assert_eq!(b'S', ssl_response); - let mut cfg = rustls::ClientConfig::new(); - cfg.root_store.add(&CERT).unwrap(); + let cfg = rustls::ClientConfig::builder() + .with_safe_defaults() + .with_root_certificates({ + let mut store = rustls::RootCertStore::empty(); + store.add(&CERT).unwrap(); + store + }) + .with_no_client_auth(); let client_config = Arc::new(cfg); - let dns_name = webpki::DNSNameRef::try_from_ascii_str("localhost").unwrap(); - let mut session = rustls::ClientSession::new(&client_config, dns_name); + let dns_name = "localhost".try_into().unwrap(); + let mut conn = rustls::ClientConnection::new(client_config, dns_name).unwrap(); - session.complete_io(&mut client_sock).unwrap(); - assert!(!session.is_handshaking()); + conn.complete_io(&mut client_sock).unwrap(); + assert!(!conn.is_handshaking()); - let mut stream = rustls::Stream::new(&mut session, &mut client_sock); + let mut stream = rustls::Stream::new(&mut conn, &mut client_sock); // StartupMessage stream.write_u32::(9).unwrap(); @@ -105,8 +110,10 @@ fn ssl() { } let mut handler = TestHandler { got_query: false }; - let mut cfg = rustls::ServerConfig::new(rustls::NoClientAuth::new()); - cfg.set_single_cert(vec![CERT.clone()], KEY.clone()) + let cfg = rustls::ServerConfig::builder() + .with_safe_defaults() + .with_no_client_auth() + .with_single_cert(vec![CERT.clone()], KEY.clone()) .unwrap(); let tls_config = Some(Arc::new(cfg)); @@ -209,8 +216,10 @@ fn server_forces_ssl() { } let mut handler = TestHandler; - let mut cfg = rustls::ServerConfig::new(rustls::NoClientAuth::new()); - cfg.set_single_cert(vec![CERT.clone()], KEY.clone()) + let cfg = rustls::ServerConfig::builder() + .with_safe_defaults() + .with_no_client_auth() + .with_single_cert(vec![CERT.clone()], KEY.clone()) .unwrap(); let tls_config = Some(Arc::new(cfg)); From 9eaa21317c9f00f549e633f71bb44edc28ab821a Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Wed, 20 Apr 2022 14:27:44 +0300 Subject: [PATCH 123/296] Update jsonwebtoken crate. With this, we no longer need to build two versions of 'pem' and 'base64' crates. Introduces a duplicate version of 'time' crate, though, but it's still progress. --- Cargo.lock | 93 ++++++++++++++++++++++++---------------- zenith_utils/Cargo.toml | 2 +- zenith_utils/src/auth.rs | 22 +++++----- 3 files changed, 68 insertions(+), 49 deletions(-) diff --git a/Cargo.lock b/Cargo.lock index ef289776e1..ac53fc3662 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -119,12 +119,6 @@ dependencies = [ "rustc-demangle", ] -[[package]] -name = "base64" -version = "0.12.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3441f0f7b02788e948e47f457ca01f1d7e6d92c693bc132c22b087d3141c03ff" - [[package]] name = "base64" version = "0.13.0" @@ -260,7 +254,7 @@ dependencies = [ "num-integer", "num-traits", "serde", - "time", + "time 0.1.44", "winapi", ] @@ -1163,12 +1157,12 @@ dependencies = [ [[package]] name = "jsonwebtoken" -version = "7.2.0" +version = "8.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "afabcc15e437a6484fc4f12d0fd63068fe457bf93f1c148d3d9649c60b103f32" +checksum = "cc9051c17f81bae79440afa041b3a278e1de71bfb96d32454b477fd4703ccb6f" dependencies = [ - "base64 0.12.3", - "pem 0.8.3", + "base64", + "pem", "ring", "serde", "serde_json", @@ -1382,9 +1376,9 @@ dependencies = [ [[package]] name = "num-bigint" -version = "0.2.6" +version = "0.4.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "090c7f9998ee0ff65aa5b723e4009f7b217707f1fb5ea551329cc4d6231fb304" +checksum = "f93ab6289c7b344a8a9f60f88d80aa20032336fe78da341afc91c8a2341fc75f" dependencies = [ "autocfg", "num-integer", @@ -1420,6 +1414,15 @@ dependencies = [ "libc", ] +[[package]] +name = "num_threads" +version = "0.1.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "aba1801fb138d8e85e11d0fc70baf4fe1cdfffda7c6cd34a854905df588e5ed0" +dependencies = [ + "libc", +] + [[package]] name = "object" version = "0.27.1" @@ -1572,24 +1575,13 @@ version = "0.1.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "19b17cddbe7ec3f8bc800887bab5e717348c95ea2ca0b1bf0837fb964dc67099" -[[package]] -name = "pem" -version = "0.8.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "fd56cbd21fea48d0c440b41cd69c589faacade08c992d9a54e471b79d0fd13eb" -dependencies = [ - "base64 0.13.0", - "once_cell", - "regex", -] - [[package]] name = "pem" version = "1.0.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "e9a3b09a20e374558580a4914d3b7d89bd61b954a5a5e1dcbea98753addb1947" dependencies = [ - "base64 0.13.0", + "base64", ] [[package]] @@ -1711,7 +1703,7 @@ name = "postgres-protocol" version = "0.6.1" source = "git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7#2949d98df52587d562986aad155dd4e889e408b7" dependencies = [ - "base64 0.13.0", + "base64", "byteorder", "bytes", "fallible-iterator", @@ -1850,7 +1842,7 @@ version = "0.1.0" dependencies = [ "anyhow", "async-trait", - "base64 0.13.0", + "base64", "bytes", "clap 3.0.14", "fail", @@ -1885,6 +1877,15 @@ dependencies = [ "zenith_utils", ] +[[package]] +name = "quickcheck" +version = "1.0.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "588f6378e4dd99458b60ec275b4477add41ce4fa9f64dcba6f15adccb19b50d6" +dependencies = [ + "rand", +] + [[package]] name = "quote" version = "1.0.15" @@ -1966,7 +1967,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "5911d1403f4143c9d56a702069d593e8d0f3fab880a85e103604d0893ea31ba7" dependencies = [ "chrono", - "pem 1.0.2", + "pem", "ring", "yasna", ] @@ -2031,7 +2032,7 @@ version = "0.11.9" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "87f242f1488a539a79bac6dbe7c8609ae43b7914b7736210f239a37cccb32525" dependencies = [ - "base64 0.13.0", + "base64", "bytes", "encoding_rs", "futures-core", @@ -2124,7 +2125,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "5b4f000e8934c1b4f70adde180056812e7ea6b1a247952db8ee98c94cd3116cc" dependencies = [ "async-trait", - "base64 0.13.0", + "base64", "bytes", "crc32fast", "futures", @@ -2179,7 +2180,7 @@ version = "0.47.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "6264e93384b90a747758bcc82079711eacf2e755c3a8b5091687b5349d870bcc" dependencies = [ - "base64 0.13.0", + "base64", "bytes", "chrono", "digest", @@ -2238,7 +2239,7 @@ version = "0.2.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "5eebeaeb360c87bfb72e84abdb3447159c0eaececf1bef2aecd65a8be949d1c9" dependencies = [ - "base64 0.13.0", + "base64", ] [[package]] @@ -2490,13 +2491,14 @@ dependencies = [ [[package]] name = "simple_asn1" -version = "0.4.1" +version = "0.6.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "692ca13de57ce0613a363c8c2f1de925adebc81b04c923ac60c5488bb44abe4b" +checksum = "4a762b1c38b9b990c694b9c2f8abe3372ce6a9ceaae6bca39cfc46e054f45745" dependencies = [ - "chrono", "num-bigint", "num-traits", + "thiserror", + "time 0.3.9", ] [[package]] @@ -2661,6 +2663,25 @@ dependencies = [ "winapi", ] +[[package]] +name = "time" +version = "0.3.9" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "c2702e08a7a860f005826c6815dcac101b19b5eb330c27fe4a5928fec1d20ddd" +dependencies = [ + "itoa 1.0.1", + "libc", + "num_threads", + "quickcheck", + "time-macros", +] + +[[package]] +name = "time-macros" +version = "0.2.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "42657b1a6f4d817cda8e7a0ace261fe0cc946cf3a80314390b22cc61ae080792" + [[package]] name = "tinytemplate" version = "1.2.1" @@ -2852,7 +2873,7 @@ checksum = "ff08f4649d10a70ffa3522ca559031285d8e421d727ac85c60825761818f5d0a" dependencies = [ "async-stream", "async-trait", - "base64 0.13.0", + "base64", "bytes", "futures-core", "futures-util", diff --git a/zenith_utils/Cargo.toml b/zenith_utils/Cargo.toml index 2b1caa9be2..ca98c8a2e2 100644 --- a/zenith_utils/Cargo.toml +++ b/zenith_utils/Cargo.toml @@ -22,7 +22,7 @@ tracing-subscriber = { version = "0.3", features = ["env-filter"] } nix = "0.23.0" signal-hook = "0.3.10" rand = "0.8.3" -jsonwebtoken = "7" +jsonwebtoken = "8" hex = { version = "0.4.3", features = ["serde"] } rustls = "0.20.2" rustls-split = "0.3.0" diff --git a/zenith_utils/src/auth.rs b/zenith_utils/src/auth.rs index 8271121c63..3bdabacad4 100644 --- a/zenith_utils/src/auth.rs +++ b/zenith_utils/src/auth.rs @@ -1,8 +1,6 @@ // For details about authentication see docs/authentication.md -// TODO there are two issues for our use case in jsonwebtoken library which will be resolved in next release -// The first one is that there is no way to disable expiration claim, but it can be excluded from validation, so use this as a workaround for now. -// Relevant issue: https://github.com/Keats/jsonwebtoken/issues/190 -// The second one is that we wanted to use ed25519 keys, but they are also not supported until next version. So we go with RSA keys for now. +// +// TODO: use ed25519 keys // Relevant issue: https://github.com/Keats/jsonwebtoken/issues/162 use serde; @@ -59,19 +57,19 @@ pub fn check_permission(claims: &Claims, tenantid: Option) -> Result< } pub struct JwtAuth { - decoding_key: DecodingKey<'static>, + decoding_key: DecodingKey, validation: Validation, } impl JwtAuth { - pub fn new(decoding_key: DecodingKey<'_>) -> Self { + pub fn new(decoding_key: DecodingKey) -> Self { + let mut validation = Validation::new(JWT_ALGORITHM); + // The default 'required_spec_claims' is 'exp'. But we don't want to require + // expiration. + validation.required_spec_claims = [].into(); Self { - decoding_key: decoding_key.into_static(), - validation: Validation { - algorithms: vec![JWT_ALGORITHM], - validate_exp: false, - ..Default::default() - }, + decoding_key, + validation, } } From 86bf4301b77332490662f39d52d9271c8c52ecd8 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Wed, 20 Apr 2022 14:36:54 +0300 Subject: [PATCH 124/296] Remove unnecessary dependency on 'webpki' --- Cargo.lock | 17 +++-------------- zenith_utils/Cargo.toml | 1 - 2 files changed, 3 insertions(+), 15 deletions(-) diff --git a/Cargo.lock b/Cargo.lock index ac53fc3662..9775ebe6b6 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -2230,7 +2230,7 @@ dependencies = [ "log", "ring", "sct", - "webpki 0.22.0", + "webpki", ] [[package]] @@ -2801,7 +2801,7 @@ checksum = "4151fda0cf2798550ad0b34bcfc9b9dcc2a9d2471c895c68f3a8818e54f2389e" dependencies = [ "rustls", "tokio", - "webpki 0.22.0", + "webpki", ] [[package]] @@ -3209,16 +3209,6 @@ dependencies = [ "wasm-bindgen", ] -[[package]] -name = "webpki" -version = "0.21.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b8e38c0608262c46d4a56202ebabdeb094cef7e560ca7a226c6bf055188aa4ea" -dependencies = [ - "ring", - "untrusted", -] - [[package]] name = "webpki" version = "0.22.0" @@ -3235,7 +3225,7 @@ version = "0.22.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "552ceb903e957524388c4d3475725ff2c8b7960922063af6ce53c9a43da07449" dependencies = [ - "webpki 0.22.0", + "webpki", ] [[package]] @@ -3402,7 +3392,6 @@ dependencies = [ "tokio", "tracing", "tracing-subscriber", - "webpki 0.21.4", "workspace_hack", "zenith_metrics", ] diff --git a/zenith_utils/Cargo.toml b/zenith_utils/Cargo.toml index ca98c8a2e2..dd83fa4a92 100644 --- a/zenith_utils/Cargo.toml +++ b/zenith_utils/Cargo.toml @@ -37,7 +37,6 @@ byteorder = "1.4.3" bytes = "1.0.1" hex-literal = "0.3" tempfile = "3.2" -webpki = "0.21" criterion = "0.3" rustls-pemfile = "0.2.1" From cbdfd8c71989e478ba50a63fda8c0687be8ea458 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Wed, 20 Apr 2022 14:42:05 +0300 Subject: [PATCH 125/296] Update 'routerify' dependency in proxy. routerify version 3 is used in zenith_utils, use the same version in proxy to avoid having to build two versions. --- Cargo.lock | 17 ++--------------- proxy/Cargo.toml | 2 +- 2 files changed, 3 insertions(+), 16 deletions(-) diff --git a/Cargo.lock b/Cargo.lock index 9775ebe6b6..1cf8562787 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -1858,7 +1858,7 @@ dependencies = [ "rand", "rcgen", "reqwest", - "routerify 2.2.0", + "routerify", "rstest", "rustls", "rustls-pemfile", @@ -2079,19 +2079,6 @@ dependencies = [ "winapi", ] -[[package]] -name = "routerify" -version = "2.2.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0c6bb49594c791cadb5ccfa5f36d41b498d40482595c199d10cd318800280bd9" -dependencies = [ - "http", - "hyper", - "lazy_static", - "percent-encoding", - "regex", -] - [[package]] name = "routerify" version = "3.0.0" @@ -3379,7 +3366,7 @@ dependencies = [ "postgres", "postgres-protocol", "rand", - "routerify 3.0.0", + "routerify", "rustls", "rustls-pemfile", "rustls-split", diff --git a/proxy/Cargo.toml b/proxy/Cargo.toml index 20b459988a..a4bd99db38 100644 --- a/proxy/Cargo.toml +++ b/proxy/Cargo.toml @@ -20,7 +20,7 @@ parking_lot = "0.11.2" pin-project-lite = "0.2.7" rand = "0.8.3" reqwest = { version = "0.11", default-features = false, features = ["blocking", "json", "rustls-tls"] } -routerify = "2" +routerify = "3" rustls = "0.20.0" rustls-pemfile = "0.2.1" scopeguard = "1.1.0" From e113c6fa8d5d478bbf7e78297a4f63b20474719b Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Wed, 20 Apr 2022 16:23:16 +0300 Subject: [PATCH 126/296] Print a warning if unlinking an ephemeral file fails. Unlink failure isn't serious on its own, we were about to remove the file anyway, but it shouldn't happen and could be a symptom of something more serious. We just saw "No such file or directory" errors happening from ephemeral file writeback in staging, and I suspect if we had this warning in place, we would have seen these warnings too, if the problem was that the ephemeral file was removed before dropping the EphemeralFile struct. Next time it happens, we'll have more information. --- pageserver/src/layered_repository/ephemeral_file.rs | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/pageserver/src/layered_repository/ephemeral_file.rs b/pageserver/src/layered_repository/ephemeral_file.rs index d509186e6f..060d44f810 100644 --- a/pageserver/src/layered_repository/ephemeral_file.rs +++ b/pageserver/src/layered_repository/ephemeral_file.rs @@ -16,6 +16,7 @@ use std::io::{Error, ErrorKind}; use std::ops::DerefMut; use std::path::PathBuf; use std::sync::{Arc, RwLock}; +use tracing::*; use zenith_utils::zid::ZTenantId; use zenith_utils::zid::ZTimelineId; @@ -244,9 +245,15 @@ impl Drop for EphemeralFile { // remove entry from the hash map EPHEMERAL_FILES.write().unwrap().files.remove(&self.file_id); - // unlink file - // FIXME: print error - let _ = std::fs::remove_file(&self.file.path); + // unlink the file + let res = std::fs::remove_file(&self.file.path); + if let Err(e) = res { + warn!( + "could not remove ephemeral file '{}': {}", + self.file.path.display(), + e + ); + } } } From e41ad3be0fb72c0e83bca01def6bf68537c7dfac Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Wed, 20 Apr 2022 16:21:43 +0300 Subject: [PATCH 127/296] add more context to writeback error --- pageserver/src/layered_repository/ephemeral_file.rs | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/pageserver/src/layered_repository/ephemeral_file.rs b/pageserver/src/layered_repository/ephemeral_file.rs index 060d44f810..a2f8cda461 100644 --- a/pageserver/src/layered_repository/ephemeral_file.rs +++ b/pageserver/src/layered_repository/ephemeral_file.rs @@ -259,8 +259,17 @@ impl Drop for EphemeralFile { pub fn writeback(file_id: u64, blkno: u32, buf: &[u8]) -> Result<(), std::io::Error> { if let Some(file) = EPHEMERAL_FILES.read().unwrap().files.get(&file_id) { - file.write_all_at(buf, blkno as u64 * PAGE_SZ as u64)?; - Ok(()) + match file.write_all_at(buf, blkno as u64 * PAGE_SZ as u64) { + Ok(_) => Ok(()), + Err(e) => Err(std::io::Error::new( + ErrorKind::Other, + format!( + "failed to write back to ephemeral file at {} error: {}", + file.path.display(), + e + ), + )), + } } else { Err(std::io::Error::new( ErrorKind::Other, From 334a1d6b5dd2c476bff082c297d1f5a725408875 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Wed, 20 Apr 2022 21:25:12 +0300 Subject: [PATCH 128/296] Fix materialized page caching with delta layers. We only checked the cache page version when collecting WAL records in an in-memory layer, not in a delta layer. Refactor the code so that we always stop collecting WAL records when we reach a cached materialized page. Fix the assertion on the LSN range in InMemoryLayer::get_value_reconstruct_data. It was supposed to check that the requested LSN range is within the layer's LSN range, but the inequality was backwards. That went unnoticed before, because the caller always passed the layer's start LSN as the requested LSN range's start LSN, but now we might stop the search earlier, if we have a cached page version. Co-authored-by: Konstantin Knizhnik --- pageserver/src/layered_repository.rs | 25 +++++++++++++++---- .../src/layered_repository/delta_layer.rs | 1 + .../src/layered_repository/image_layer.rs | 1 + .../src/layered_repository/inmemory_layer.rs | 9 +------ 4 files changed, 23 insertions(+), 13 deletions(-) diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 6769c9cfbc..c66e4708ff 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -1149,6 +1149,12 @@ impl LayeredTimeline { let mut path: Vec<(ValueReconstructResult, Lsn, Arc)> = Vec::new(); + let cached_lsn = if let Some((cached_lsn, _)) = &reconstruct_state.img { + *cached_lsn + } else { + Lsn(0) + }; + // 'prev_lsn' tracks the last LSN that we were at in our search. It's used // to check that each iteration make some progress, to break infinite // looping if something goes wrong. @@ -1159,10 +1165,14 @@ impl LayeredTimeline { 'outer: loop { // The function should have updated 'state' - //info!("CALLED for {} at {}: {:?} with {} records", reconstruct_state.key, reconstruct_state.lsn, result, reconstruct_state.records.len()); + //info!("CALLED for {} at {}: {:?} with {} records, cached {}", key, cont_lsn, result, reconstruct_state.records.len(), cached_lsn); match result { ValueReconstructResult::Complete => return Ok(()), ValueReconstructResult::Continue => { + // If we reached an earlier cached page image, we're done. + if cont_lsn == cached_lsn + 1 { + return Ok(()); + } if prev_lsn <= cont_lsn { // Didn't make any progress in last iteration. Error out to avoid // getting stuck in the loop. @@ -1216,12 +1226,15 @@ impl LayeredTimeline { let start_lsn = open_layer.get_lsn_range().start; if cont_lsn > start_lsn { //info!("CHECKING for {} at {} on open layer {}", key, cont_lsn, open_layer.filename().display()); + // Get all the data needed to reconstruct the page version from this layer. + // But if we have an older cached page image, no need to go past that. + let lsn_floor = max(cached_lsn + 1, start_lsn); result = open_layer.get_value_reconstruct_data( key, - open_layer.get_lsn_range().start..cont_lsn, + lsn_floor..cont_lsn, reconstruct_state, )?; - cont_lsn = start_lsn; + cont_lsn = lsn_floor; path.push((result, cont_lsn, open_layer.clone())); continue; } @@ -1230,12 +1243,13 @@ impl LayeredTimeline { let start_lsn = frozen_layer.get_lsn_range().start; if cont_lsn > start_lsn { //info!("CHECKING for {} at {} on frozen layer {}", key, cont_lsn, frozen_layer.filename().display()); + let lsn_floor = max(cached_lsn + 1, start_lsn); result = frozen_layer.get_value_reconstruct_data( key, - frozen_layer.get_lsn_range().start..cont_lsn, + lsn_floor..cont_lsn, reconstruct_state, )?; - cont_lsn = start_lsn; + cont_lsn = lsn_floor; path.push((result, cont_lsn, frozen_layer.clone())); continue 'outer; } @@ -1244,6 +1258,7 @@ impl LayeredTimeline { if let Some(SearchResult { lsn_floor, layer }) = layers.search(key, cont_lsn)? { //info!("CHECKING for {} at {} on historic layer {}", key, cont_lsn, layer.filename().display()); + let lsn_floor = max(cached_lsn + 1, lsn_floor); result = layer.get_value_reconstruct_data( key, lsn_floor..cont_lsn, diff --git a/pageserver/src/layered_repository/delta_layer.rs b/pageserver/src/layered_repository/delta_layer.rs index 6e3d65a94d..03b7e453b3 100644 --- a/pageserver/src/layered_repository/delta_layer.rs +++ b/pageserver/src/layered_repository/delta_layer.rs @@ -222,6 +222,7 @@ impl Layer for DeltaLayer { lsn_range: Range, reconstruct_state: &mut ValueReconstructState, ) -> anyhow::Result { + ensure!(lsn_range.start >= self.lsn_range.start); let mut need_image = true; ensure!(self.key_range.contains(&key)); diff --git a/pageserver/src/layered_repository/image_layer.rs b/pageserver/src/layered_repository/image_layer.rs index 0f334658bf..fa91198a79 100644 --- a/pageserver/src/layered_repository/image_layer.rs +++ b/pageserver/src/layered_repository/image_layer.rs @@ -148,6 +148,7 @@ impl Layer for ImageLayer { reconstruct_state: &mut ValueReconstructState, ) -> anyhow::Result { assert!(self.key_range.contains(&key)); + assert!(lsn_range.start >= self.lsn); assert!(lsn_range.end >= self.lsn); let inner = self.load()?; diff --git a/pageserver/src/layered_repository/inmemory_layer.rs b/pageserver/src/layered_repository/inmemory_layer.rs index ffb5be1dd4..33e1eabd8e 100644 --- a/pageserver/src/layered_repository/inmemory_layer.rs +++ b/pageserver/src/layered_repository/inmemory_layer.rs @@ -113,7 +113,7 @@ impl Layer for InMemoryLayer { lsn_range: Range, reconstruct_state: &mut ValueReconstructState, ) -> anyhow::Result { - ensure!(lsn_range.start <= self.start_lsn); + ensure!(lsn_range.start >= self.start_lsn); let mut need_image = true; let inner = self.inner.read().unwrap(); @@ -124,13 +124,6 @@ impl Layer for InMemoryLayer { if let Some(vec_map) = inner.index.get(&key) { let slice = vec_map.slice_range(lsn_range); for (entry_lsn, pos) in slice.iter().rev() { - match &reconstruct_state.img { - Some((cached_lsn, _)) if entry_lsn <= cached_lsn => { - return Ok(ValueReconstructResult::Complete) - } - _ => {} - } - let buf = reader.read_blob(*pos)?; let value = Value::des(&buf)?; match value { From 9d3779c1247eeda8e99f414bc0af7021a6550f4f Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Wed, 20 Apr 2022 21:25:16 +0300 Subject: [PATCH 129/296] Add a counter for materialized page cache hits. --- pageserver/src/layered_repository.rs | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index c66e4708ff..59a3def1fb 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -49,8 +49,8 @@ use crate::CheckpointConfig; use crate::{ZTenantId, ZTimelineId}; use zenith_metrics::{ - register_histogram_vec, register_int_counter, register_int_gauge_vec, Histogram, HistogramVec, - IntCounter, IntGauge, IntGaugeVec, + register_histogram_vec, register_int_counter, register_int_counter_vec, register_int_gauge_vec, + Histogram, HistogramVec, IntCounter, IntCounterVec, IntGauge, IntGaugeVec, }; use zenith_utils::crashsafe_dir; use zenith_utils::lsn::{AtomicLsn, Lsn, RecordLsn}; @@ -101,6 +101,15 @@ lazy_static! { .expect("failed to define a metric"); } +lazy_static! { + static ref MATERIALIZED_PAGE_CACHE_HIT: IntCounterVec = register_int_counter_vec!( + "materialize_page_cache_hits", + "Number of cache hits from materialized page cache", + &["tenant_id", "timeline_id"] + ) + .expect("failed to define a metric"); +} + lazy_static! { static ref LAST_RECORD_LSN: IntGaugeVec = register_int_gauge_vec!( "pageserver_last_record_lsn", @@ -778,6 +787,7 @@ pub struct LayeredTimeline { // Metrics reconstruct_time_histo: Histogram, + materialized_page_cache_hit_counter: IntCounter, flush_time_histo: Histogram, compact_time_histo: Histogram, create_images_time_histo: Histogram, @@ -983,6 +993,9 @@ impl LayeredTimeline { let reconstruct_time_histo = RECONSTRUCT_TIME .get_metric_with_label_values(&[&tenantid.to_string(), &timelineid.to_string()]) .unwrap(); + let materialized_page_cache_hit_counter = MATERIALIZED_PAGE_CACHE_HIT + .get_metric_with_label_values(&[&tenantid.to_string(), &timelineid.to_string()]) + .unwrap(); let flush_time_histo = STORAGE_TIME .get_metric_with_label_values(&[ "layer flush", @@ -1029,6 +1042,7 @@ impl LayeredTimeline { ancestor_lsn: metadata.ancestor_lsn(), reconstruct_time_histo, + materialized_page_cache_hit_counter, flush_time_histo, compact_time_histo, create_images_time_histo, @@ -1171,6 +1185,7 @@ impl LayeredTimeline { ValueReconstructResult::Continue => { // If we reached an earlier cached page image, we're done. if cont_lsn == cached_lsn + 1 { + self.materialized_page_cache_hit_counter.inc_by(1); return Ok(()); } if prev_lsn <= cont_lsn { From 629688fd6c144482cf73be1c75334dfc376d88b8 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Wed, 20 Apr 2022 16:24:33 +0300 Subject: [PATCH 130/296] Drop redundant resolver setting for 2021 edition --- .config/hakari.toml | 2 ++ Cargo.toml | 1 - pre-commit.py | 2 +- 3 files changed, 3 insertions(+), 2 deletions(-) diff --git a/.config/hakari.toml b/.config/hakari.toml index 7bccc6c4a3..42d184b857 100644 --- a/.config/hakari.toml +++ b/.config/hakari.toml @@ -10,6 +10,8 @@ dep-format-version = "2" # Hakari works much better with the new feature resolver. # For more about the new feature resolver, see: # https://blog.rust-lang.org/2021/03/25/Rust-1.51.0.html#cargos-new-feature-resolver +# Have to keep the resolver still here since hakari requires this field, +# despite it's now the default for 2021 edition & cargo. resolver = "2" # Add triples corresponding to platforms commonly used by developers here. diff --git a/Cargo.toml b/Cargo.toml index 4b3b31e0b7..1405f26517 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -11,7 +11,6 @@ members = [ "zenith_metrics", "zenith_utils", ] -resolver = "2" [profile.release] # This is useful for profiling and, to some extent, debug. diff --git a/pre-commit.py b/pre-commit.py index 1e886e403b..ea6a22a7fe 100755 --- a/pre-commit.py +++ b/pre-commit.py @@ -29,7 +29,7 @@ def colorify( def rustfmt(fix_inplace: bool = False, no_color: bool = False) -> str: - cmd = "rustfmt --edition=2018" + cmd = "rustfmt --edition=2021" if not fix_inplace: cmd += " --check" if no_color: From 81cad6277a2666ca47b97848628ffeafd6bf6aba Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Wed, 20 Apr 2022 16:38:33 +0300 Subject: [PATCH 131/296] Move and library crates into a dedicated directory and rename them --- Cargo.lock | 127 ++++++++---------- Cargo.toml | 4 +- compute_tools/src/bin/zenith_ctl.rs | 2 +- control_plane/Cargo.toml | 2 +- control_plane/src/compute.rs | 11 +- control_plane/src/local_env.rs | 8 +- control_plane/src/safekeeper.rs | 8 +- control_plane/src/storage.rs | 12 +- docs/README.md | 2 +- docs/authentication.md | 2 +- docs/sourcetree.md | 22 +-- {zenith_metrics => libs/metrics}/Cargo.toml | 4 +- {zenith_metrics => libs/metrics}/src/lib.rs | 0 .../metrics}/src/wrappers.rs | 8 +- .../postgres_ffi}/Cargo.toml | 4 +- {postgres_ffi => libs/postgres_ffi}/README | 0 {postgres_ffi => libs/postgres_ffi}/build.rs | 4 +- .../postgres_ffi}/pg_control_ffi.h | 0 .../postgres_ffi}/samples/pg_hba.conf | 0 .../postgres_ffi}/src/controlfile_utils.rs | 4 +- .../postgres_ffi}/src/lib.rs | 0 .../postgres_ffi}/src/nonrelfile_utils.rs | 0 .../postgres_ffi}/src/pg_constants.rs | 0 .../postgres_ffi}/src/relfile_utils.rs | 0 .../postgres_ffi}/src/waldecoder.rs | 2 +- .../postgres_ffi}/src/xlog_utils.rs | 22 +-- {zenith_utils => libs/utils}/Cargo.toml | 6 +- .../utils}/benches/benchmarks.rs | 2 +- {zenith_utils => libs/utils}/build.rs | 0 .../utils}/scripts/restore_from_wal.sh | 0 .../scripts/restore_from_wal_archive.sh | 0 {zenith_utils => libs/utils}/src/accum.rs | 2 +- {zenith_utils => libs/utils}/src/auth.rs | 0 {zenith_utils => libs/utils}/src/bin_ser.rs | 0 .../utils}/src/connstring.rs | 0 .../utils}/src/crashsafe_dir.rs | 0 .../utils}/src/http/endpoint.rs | 5 +- .../utils}/src/http/error.rs | 0 {zenith_utils => libs/utils}/src/http/json.rs | 0 {zenith_utils => libs/utils}/src/http/mod.rs | 0 .../utils}/src/http/request.rs | 0 {zenith_utils => libs/utils}/src/lib.rs | 4 +- {zenith_utils => libs/utils}/src/logging.rs | 0 {zenith_utils => libs/utils}/src/lsn.rs | 0 {zenith_utils => libs/utils}/src/nonblock.rs | 0 .../utils}/src/postgres_backend.rs | 0 {zenith_utils => libs/utils}/src/pq_proto.rs | 2 +- {zenith_utils => libs/utils}/src/seqwait.rs | 0 .../utils}/src/seqwait_async.rs | 0 {zenith_utils => libs/utils}/src/shutdown.rs | 0 {zenith_utils => libs/utils}/src/signals.rs | 0 .../utils}/src/sock_split.rs | 0 {zenith_utils => libs/utils}/src/sync.rs | 2 +- .../utils}/src/tcp_listener.rs | 0 {zenith_utils => libs/utils}/src/vec_map.rs | 0 {zenith_utils => libs/utils}/src/zid.rs | 0 .../utils}/tests/bin_ser_test.rs | 2 +- {zenith_utils => libs/utils}/tests/cert.pem | 0 {zenith_utils => libs/utils}/tests/key.pem | 0 .../utils}/tests/ssl_test.rs | 2 +- pageserver/Cargo.toml | 6 +- pageserver/src/basebackup.rs | 2 +- pageserver/src/bin/dump_layerfile.rs | 2 +- pageserver/src/bin/pageserver.rs | 24 ++-- pageserver/src/bin/update_metadata.rs | 3 +- pageserver/src/config.rs | 6 +- pageserver/src/http/models.rs | 2 +- pageserver/src/http/routes.rs | 26 ++-- pageserver/src/import_datadir.rs | 2 +- pageserver/src/layered_repository.rs | 12 +- .../src/layered_repository/delta_layer.rs | 8 +- .../src/layered_repository/ephemeral_file.rs | 3 +- pageserver/src/layered_repository/filename.rs | 2 +- .../src/layered_repository/image_layer.rs | 8 +- .../src/layered_repository/inmemory_layer.rs | 10 +- .../src/layered_repository/layer_map.rs | 4 +- pageserver/src/layered_repository/metadata.rs | 2 +- .../src/layered_repository/storage_layer.rs | 6 +- pageserver/src/lib.rs | 7 +- pageserver/src/page_cache.rs | 2 +- pageserver/src/page_service.rs | 17 ++- pageserver/src/pgdatadir_mapping.rs | 5 +- pageserver/src/remote_storage.rs | 2 +- pageserver/src/remote_storage/storage_sync.rs | 15 +-- .../remote_storage/storage_sync/download.rs | 5 +- .../src/remote_storage/storage_sync/index.rs | 7 +- .../src/remote_storage/storage_sync/upload.rs | 5 +- pageserver/src/repository.rs | 8 +- pageserver/src/tenant_mgr.rs | 2 +- pageserver/src/tenant_threads.rs | 2 +- pageserver/src/thread_mgr.rs | 2 +- pageserver/src/timelines.rs | 8 +- pageserver/src/virtual_file.rs | 4 +- pageserver/src/walingest.rs | 2 +- pageserver/src/walreceiver.rs | 10 +- pageserver/src/walredo.rs | 7 +- proxy/Cargo.toml | 4 +- proxy/src/auth.rs | 2 +- proxy/src/auth/flow.rs | 2 +- proxy/src/cancellation.rs | 2 +- proxy/src/http.rs | 5 +- proxy/src/main.rs | 4 +- proxy/src/mgmt.rs | 2 +- proxy/src/proxy.rs | 4 +- proxy/src/sasl/messages.rs | 6 +- proxy/src/stream.rs | 2 +- safekeeper/Cargo.toml | 6 +- safekeeper/src/bin/safekeeper.rs | 12 +- safekeeper/src/broker.rs | 10 +- safekeeper/src/callmemaybe.rs | 6 +- safekeeper/src/control_file.rs | 10 +- safekeeper/src/control_file_upgrade.rs | 2 +- safekeeper/src/handler.rs | 11 +- safekeeper/src/http/models.rs | 2 +- safekeeper/src/http/routes.rs | 21 +-- safekeeper/src/json_ctrl.rs | 10 +- safekeeper/src/lib.rs | 2 +- safekeeper/src/receive_wal.rs | 8 +- safekeeper/src/safekeeper.rs | 15 +-- safekeeper/src/send_wal.rs | 15 ++- safekeeper/src/timeline.rs | 9 +- safekeeper/src/wal_service.rs | 2 +- safekeeper/src/wal_storage.rs | 5 +- test_runner/batch_others/test_wal_restore.py | 2 +- workspace_hack/Cargo.toml | 5 +- zenith/Cargo.toml | 4 +- zenith/src/main.rs | 12 +- 127 files changed, 355 insertions(+), 360 deletions(-) rename {zenith_metrics => libs/metrics}/Cargo.toml (69%) rename {zenith_metrics => libs/metrics}/src/lib.rs (100%) rename {zenith_metrics => libs/metrics}/src/wrappers.rs (96%) rename {postgres_ffi => libs/postgres_ffi}/Cargo.toml (77%) rename {postgres_ffi => libs/postgres_ffi}/README (100%) rename {postgres_ffi => libs/postgres_ffi}/build.rs (96%) rename {postgres_ffi => libs/postgres_ffi}/pg_control_ffi.h (100%) rename {postgres_ffi => libs/postgres_ffi}/samples/pg_hba.conf (100%) rename {postgres_ffi => libs/postgres_ffi}/src/controlfile_utils.rs (97%) rename {postgres_ffi => libs/postgres_ffi}/src/lib.rs (100%) rename {postgres_ffi => libs/postgres_ffi}/src/nonrelfile_utils.rs (100%) rename {postgres_ffi => libs/postgres_ffi}/src/pg_constants.rs (100%) rename {postgres_ffi => libs/postgres_ffi}/src/relfile_utils.rs (100%) rename {postgres_ffi => libs/postgres_ffi}/src/waldecoder.rs (99%) rename {postgres_ffi => libs/postgres_ffi}/src/xlog_utils.rs (98%) rename {zenith_utils => libs/utils}/Cargo.toml (88%) rename {zenith_utils => libs/utils}/benches/benchmarks.rs (96%) rename {zenith_utils => libs/utils}/build.rs (100%) rename {zenith_utils => libs/utils}/scripts/restore_from_wal.sh (100%) rename {zenith_utils => libs/utils}/scripts/restore_from_wal_archive.sh (100%) rename {zenith_utils => libs/utils}/src/accum.rs (96%) rename {zenith_utils => libs/utils}/src/auth.rs (100%) rename {zenith_utils => libs/utils}/src/bin_ser.rs (100%) rename {zenith_utils => libs/utils}/src/connstring.rs (100%) rename {zenith_utils => libs/utils}/src/crashsafe_dir.rs (100%) rename {zenith_utils => libs/utils}/src/http/endpoint.rs (97%) rename {zenith_utils => libs/utils}/src/http/error.rs (100%) rename {zenith_utils => libs/utils}/src/http/json.rs (100%) rename {zenith_utils => libs/utils}/src/http/mod.rs (100%) rename {zenith_utils => libs/utils}/src/http/request.rs (100%) rename {zenith_utils => libs/utils}/src/lib.rs (95%) rename {zenith_utils => libs/utils}/src/logging.rs (100%) rename {zenith_utils => libs/utils}/src/lsn.rs (100%) rename {zenith_utils => libs/utils}/src/nonblock.rs (100%) rename {zenith_utils => libs/utils}/src/postgres_backend.rs (100%) rename {zenith_utils => libs/utils}/src/pq_proto.rs (99%) rename {zenith_utils => libs/utils}/src/seqwait.rs (100%) rename {zenith_utils => libs/utils}/src/seqwait_async.rs (100%) rename {zenith_utils => libs/utils}/src/shutdown.rs (100%) rename {zenith_utils => libs/utils}/src/signals.rs (100%) rename {zenith_utils => libs/utils}/src/sock_split.rs (100%) rename {zenith_utils => libs/utils}/src/sync.rs (99%) rename {zenith_utils => libs/utils}/src/tcp_listener.rs (100%) rename {zenith_utils => libs/utils}/src/vec_map.rs (100%) rename {zenith_utils => libs/utils}/src/zid.rs (100%) rename {zenith_utils => libs/utils}/tests/bin_ser_test.rs (96%) rename {zenith_utils => libs/utils}/tests/cert.pem (100%) rename {zenith_utils => libs/utils}/tests/key.pem (100%) rename {zenith_utils => libs/utils}/tests/ssl_test.rs (98%) diff --git a/Cargo.lock b/Cargo.lock index 1cf8562787..508b56125d 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -225,9 +225,6 @@ name = "cc" version = "1.0.72" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "22a9137b95ea06864e018375b72adfb7db6e6f68cfc8df5a04d00288050485ee" -dependencies = [ - "jobserver", -] [[package]] name = "cexpr" @@ -368,8 +365,8 @@ dependencies = [ "thiserror", "toml", "url", + "utils", "workspace_hack", - "zenith_utils", ] [[package]] @@ -1137,15 +1134,6 @@ version = "1.0.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "1aab8fc367588b89dcee83ab0fd66b72b50b72fa1904d7095045ace2b0c81c35" -[[package]] -name = "jobserver" -version = "0.1.24" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "af25a77299a7f711a01975c35a6a424eb6862092cc2d6c72c4ed6cbc56dfc1fa" -dependencies = [ - "libc", -] - [[package]] name = "js-sys" version = "0.3.56" @@ -1272,6 +1260,17 @@ dependencies = [ "autocfg", ] +[[package]] +name = "metrics" +version = "0.1.0" +dependencies = [ + "lazy_static", + "libc", + "once_cell", + "prometheus", + "workspace_hack", +] + [[package]] name = "mime" version = "0.3.16" @@ -1514,6 +1513,7 @@ dependencies = [ "hyper", "itertools", "lazy_static", + "metrics", "nix", "once_cell", "postgres", @@ -1539,9 +1539,8 @@ dependencies = [ "toml_edit", "tracing", "url", + "utils", "workspace_hack", - "zenith_metrics", - "zenith_utils", ] [[package]] @@ -1744,8 +1743,8 @@ dependencies = [ "regex", "serde", "thiserror", + "utils", "workspace_hack", - "zenith_utils", ] [[package]] @@ -1853,6 +1852,7 @@ dependencies = [ "hyper", "lazy_static", "md5", + "metrics", "parking_lot", "pin-project-lite", "rand", @@ -1872,9 +1872,8 @@ dependencies = [ "tokio-postgres", "tokio-postgres-rustls", "tokio-rustls", + "utils", "workspace_hack", - "zenith_metrics", - "zenith_utils", ] [[package]] @@ -2267,6 +2266,7 @@ dependencies = [ "humantime", "hyper", "lazy_static", + "metrics", "postgres", "postgres-protocol", "postgres_ffi", @@ -2283,10 +2283,9 @@ dependencies = [ "tokio-util 0.7.0", "tracing", "url", + "utils", "walkdir", "workspace_hack", - "zenith_metrics", - "zenith_utils", ] [[package]] @@ -3063,6 +3062,43 @@ dependencies = [ "percent-encoding", ] +[[package]] +name = "utils" +version = "0.1.0" +dependencies = [ + "anyhow", + "bincode", + "byteorder", + "bytes", + "criterion", + "git-version", + "hex", + "hex-literal", + "hyper", + "jsonwebtoken", + "lazy_static", + "metrics", + "nix", + "pin-project-lite", + "postgres", + "postgres-protocol", + "rand", + "routerify", + "rustls", + "rustls-pemfile", + "rustls-split", + "serde", + "serde_json", + "serde_with", + "signal-hook", + "tempfile", + "thiserror", + "tokio", + "tracing", + "tracing-subscriber", + "workspace_hack", +] + [[package]] name = "valuable" version = "0.1.0" @@ -3272,7 +3308,6 @@ version = "0.1.0" dependencies = [ "anyhow", "bytes", - "cc", "chrono", "clap 2.34.0", "either", @@ -3331,56 +3366,8 @@ dependencies = [ "postgres_ffi", "safekeeper", "serde_json", + "utils", "workspace_hack", - "zenith_utils", -] - -[[package]] -name = "zenith_metrics" -version = "0.1.0" -dependencies = [ - "lazy_static", - "libc", - "once_cell", - "prometheus", - "workspace_hack", -] - -[[package]] -name = "zenith_utils" -version = "0.1.0" -dependencies = [ - "anyhow", - "bincode", - "byteorder", - "bytes", - "criterion", - "git-version", - "hex", - "hex-literal", - "hyper", - "jsonwebtoken", - "lazy_static", - "nix", - "pin-project-lite", - "postgres", - "postgres-protocol", - "rand", - "routerify", - "rustls", - "rustls-pemfile", - "rustls-split", - "serde", - "serde_json", - "serde_with", - "signal-hook", - "tempfile", - "thiserror", - "tokio", - "tracing", - "tracing-subscriber", - "workspace_hack", - "zenith_metrics", ] [[package]] diff --git a/Cargo.toml b/Cargo.toml index 1405f26517..35c18ba237 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -3,13 +3,11 @@ members = [ "compute_tools", "control_plane", "pageserver", - "postgres_ffi", "proxy", "safekeeper", "workspace_hack", "zenith", - "zenith_metrics", - "zenith_utils", + "libs/*", ] [profile.release] diff --git a/compute_tools/src/bin/zenith_ctl.rs b/compute_tools/src/bin/zenith_ctl.rs index 372afbc633..a5dfb1c875 100644 --- a/compute_tools/src/bin/zenith_ctl.rs +++ b/compute_tools/src/bin/zenith_ctl.rs @@ -157,7 +157,7 @@ fn run_compute(state: &Arc>) -> Result { } fn main() -> Result<()> { - // TODO: re-use `zenith_utils::logging` later + // TODO: re-use `utils::logging` later init_logger(DEFAULT_LOG_LEVEL)?; // Env variable is set by `cargo` diff --git a/control_plane/Cargo.toml b/control_plane/Cargo.toml index 80b6c00dd2..33d01f7556 100644 --- a/control_plane/Cargo.toml +++ b/control_plane/Cargo.toml @@ -19,5 +19,5 @@ reqwest = { version = "0.11", default-features = false, features = ["blocking", pageserver = { path = "../pageserver" } safekeeper = { path = "../safekeeper" } -zenith_utils = { path = "../zenith_utils" } +utils = { path = "../libs/utils" } workspace_hack = { version = "0.1", path = "../workspace_hack" } diff --git a/control_plane/src/compute.rs b/control_plane/src/compute.rs index c078c274cf..2549baca5d 100644 --- a/control_plane/src/compute.rs +++ b/control_plane/src/compute.rs @@ -11,11 +11,12 @@ use std::sync::Arc; use std::time::Duration; use anyhow::{Context, Result}; -use zenith_utils::connstring::connection_host_port; -use zenith_utils::lsn::Lsn; -use zenith_utils::postgres_backend::AuthType; -use zenith_utils::zid::ZTenantId; -use zenith_utils::zid::ZTimelineId; +use utils::{ + connstring::connection_host_port, + lsn::Lsn, + postgres_backend::AuthType, + zid::{ZTenantId, ZTimelineId}, +}; use crate::local_env::LocalEnv; use crate::postgresql_conf::PostgresConf; diff --git a/control_plane/src/local_env.rs b/control_plane/src/local_env.rs index 2bdc76e876..12ee88cdc9 100644 --- a/control_plane/src/local_env.rs +++ b/control_plane/src/local_env.rs @@ -11,9 +11,11 @@ use std::env; use std::fs; use std::path::{Path, PathBuf}; use std::process::{Command, Stdio}; -use zenith_utils::auth::{encode_from_key_file, Claims, Scope}; -use zenith_utils::postgres_backend::AuthType; -use zenith_utils::zid::{ZNodeId, ZTenantId, ZTenantTimelineId, ZTimelineId}; +use utils::{ + auth::{encode_from_key_file, Claims, Scope}, + postgres_backend::AuthType, + zid::{ZNodeId, ZTenantId, ZTenantTimelineId, ZTimelineId}, +}; use crate::safekeeper::SafekeeperNode; diff --git a/control_plane/src/safekeeper.rs b/control_plane/src/safekeeper.rs index 6f11a4e03d..b094016131 100644 --- a/control_plane/src/safekeeper.rs +++ b/control_plane/src/safekeeper.rs @@ -15,13 +15,15 @@ use reqwest::blocking::{Client, RequestBuilder, Response}; use reqwest::{IntoUrl, Method}; use safekeeper::http::models::TimelineCreateRequest; use thiserror::Error; -use zenith_utils::http::error::HttpErrorBody; -use zenith_utils::zid::{ZNodeId, ZTenantId, ZTimelineId}; +use utils::{ + connstring::connection_address, + http::error::HttpErrorBody, + zid::{ZNodeId, ZTenantId, ZTimelineId}, +}; use crate::local_env::{LocalEnv, SafekeeperConf}; use crate::storage::PageServerNode; use crate::{fill_rust_env_vars, read_pidfile}; -use zenith_utils::connstring::connection_address; #[derive(Error, Debug)] pub enum SafekeeperHttpError { diff --git a/control_plane/src/storage.rs b/control_plane/src/storage.rs index c49d5743a9..a01ffd30f6 100644 --- a/control_plane/src/storage.rs +++ b/control_plane/src/storage.rs @@ -15,15 +15,17 @@ use postgres::{Config, NoTls}; use reqwest::blocking::{Client, RequestBuilder, Response}; use reqwest::{IntoUrl, Method}; use thiserror::Error; -use zenith_utils::http::error::HttpErrorBody; -use zenith_utils::lsn::Lsn; -use zenith_utils::postgres_backend::AuthType; -use zenith_utils::zid::{ZTenantId, ZTimelineId}; +use utils::{ + connstring::connection_address, + http::error::HttpErrorBody, + lsn::Lsn, + postgres_backend::AuthType, + zid::{ZTenantId, ZTimelineId}, +}; use crate::local_env::LocalEnv; use crate::{fill_rust_env_vars, read_pidfile}; use pageserver::tenant_mgr::TenantInfo; -use zenith_utils::connstring::connection_address; #[derive(Error, Debug)] pub enum PageserverHttpError { diff --git a/docs/README.md b/docs/README.md index a3fcd20bd2..99d635bb33 100644 --- a/docs/README.md +++ b/docs/README.md @@ -8,7 +8,7 @@ - [multitenancy.md](multitenancy.md) — how multitenancy is organized in the pageserver and Zenith CLI. - [sourcetree.md](sourcetree.md) — Overview of the source tree layeout. - [pageserver/README](/pageserver/README) — pageserver overview. -- [postgres_ffi/README](/postgres_ffi/README) — Postgres FFI overview. +- [postgres_ffi/README](/libs/postgres_ffi/README) — Postgres FFI overview. - [test_runner/README.md](/test_runner/README.md) — tests infrastructure overview. - [safekeeper/README](/safekeeper/README) — WAL service overview. - [core_changes.md](core_changes.md) - Description of Zenith changes in Postgres core diff --git a/docs/authentication.md b/docs/authentication.md index de408624ae..7200ffc62f 100644 --- a/docs/authentication.md +++ b/docs/authentication.md @@ -27,4 +27,4 @@ management_token = jwt.encode({"scope": "pageserverapi"}, auth_keys.priv, algori tenant_token = jwt.encode({"scope": "tenant", "tenant_id": ps.initial_tenant}, auth_keys.priv, algorithm="RS256") ``` -Utility functions to work with jwts in rust are located in zenith_utils/src/auth.rs +Utility functions to work with jwts in rust are located in libs/utils/src/auth.rs diff --git a/docs/sourcetree.md b/docs/sourcetree.md index b15294d67f..5fd5fe19e5 100644 --- a/docs/sourcetree.md +++ b/docs/sourcetree.md @@ -30,11 +30,6 @@ The pageserver has a few different duties: For more detailed info, see `/pageserver/README` -`/postgres_ffi`: - -Utility functions for interacting with PostgreSQL file formats. -Misc constants, copied from PostgreSQL headers. - `/proxy`: Postgres protocol proxy/router. @@ -74,14 +69,21 @@ We use [cargo-hakari](https://crates.io/crates/cargo-hakari) for automation. Main entry point for the 'zenith' CLI utility. TODO: Doesn't it belong to control_plane? -`/zenith_metrics`: +`/libs`: +Unites granular neon helper crates under the hood. +`/libs/postgres_ffi`: + +Utility functions for interacting with PostgreSQL file formats. +Misc constants, copied from PostgreSQL headers. + +`/libs/utils`: +Generic helpers that are shared between other crates in this repository. +A subject for future modularization. + +`/libs/metrics`: Helpers for exposing Prometheus metrics from the server. -`/zenith_utils`: - -Helpers that are shared between other crates in this repository. - ## Using Python Note that Debian/Ubuntu Python packages are stale, as it commonly happens, so manual installation of dependencies is not recommended. diff --git a/zenith_metrics/Cargo.toml b/libs/metrics/Cargo.toml similarity index 69% rename from zenith_metrics/Cargo.toml rename to libs/metrics/Cargo.toml index 906c5a1d64..3b6ff4691d 100644 --- a/zenith_metrics/Cargo.toml +++ b/libs/metrics/Cargo.toml @@ -1,5 +1,5 @@ [package] -name = "zenith_metrics" +name = "metrics" version = "0.1.0" edition = "2021" @@ -8,4 +8,4 @@ prometheus = {version = "0.13", default_features=false} # removes protobuf depen libc = "0.2" lazy_static = "1.4" once_cell = "1.8.0" -workspace_hack = { version = "0.1", path = "../workspace_hack" } +workspace_hack = { version = "0.1", path = "../../workspace_hack" } diff --git a/zenith_metrics/src/lib.rs b/libs/metrics/src/lib.rs similarity index 100% rename from zenith_metrics/src/lib.rs rename to libs/metrics/src/lib.rs diff --git a/zenith_metrics/src/wrappers.rs b/libs/metrics/src/wrappers.rs similarity index 96% rename from zenith_metrics/src/wrappers.rs rename to libs/metrics/src/wrappers.rs index 48202bc15e..de334add99 100644 --- a/zenith_metrics/src/wrappers.rs +++ b/libs/metrics/src/wrappers.rs @@ -8,8 +8,8 @@ use std::io::{Read, Result, Write}; /// /// ``` /// # use std::io::{Result, Read}; -/// # use zenith_metrics::{register_int_counter, IntCounter}; -/// # use zenith_metrics::CountedReader; +/// # use metrics::{register_int_counter, IntCounter}; +/// # use metrics::CountedReader; /// # /// # lazy_static::lazy_static! { /// # static ref INT_COUNTER: IntCounter = register_int_counter!( @@ -83,8 +83,8 @@ impl Read for CountedReader<'_, T> { /// /// ``` /// # use std::io::{Result, Write}; -/// # use zenith_metrics::{register_int_counter, IntCounter}; -/// # use zenith_metrics::CountedWriter; +/// # use metrics::{register_int_counter, IntCounter}; +/// # use metrics::CountedWriter; /// # /// # lazy_static::lazy_static! { /// # static ref INT_COUNTER: IntCounter = register_int_counter!( diff --git a/postgres_ffi/Cargo.toml b/libs/postgres_ffi/Cargo.toml similarity index 77% rename from postgres_ffi/Cargo.toml rename to libs/postgres_ffi/Cargo.toml index e8d471cb12..7be5ca1b93 100644 --- a/postgres_ffi/Cargo.toml +++ b/libs/postgres_ffi/Cargo.toml @@ -17,8 +17,8 @@ log = "0.4.14" memoffset = "0.6.2" thiserror = "1.0" serde = { version = "1.0", features = ["derive"] } -zenith_utils = { path = "../zenith_utils" } -workspace_hack = { version = "0.1", path = "../workspace_hack" } +utils = { path = "../utils" } +workspace_hack = { version = "0.1", path = "../../workspace_hack" } [build-dependencies] bindgen = "0.59.1" diff --git a/postgres_ffi/README b/libs/postgres_ffi/README similarity index 100% rename from postgres_ffi/README rename to libs/postgres_ffi/README diff --git a/postgres_ffi/build.rs b/libs/postgres_ffi/build.rs similarity index 96% rename from postgres_ffi/build.rs rename to libs/postgres_ffi/build.rs index 3b4b37f9ee..0043b9ab58 100644 --- a/postgres_ffi/build.rs +++ b/libs/postgres_ffi/build.rs @@ -88,8 +88,8 @@ fn main() { // 'pg_config --includedir-server' would perhaps be the more proper way to find it, // but this will do for now. // - .clang_arg("-I../tmp_install/include/server") - .clang_arg("-I../tmp_install/include/postgresql/server") + .clang_arg("-I../../tmp_install/include/server") + .clang_arg("-I../../tmp_install/include/postgresql/server") // // Finish the builder and generate the bindings. // diff --git a/postgres_ffi/pg_control_ffi.h b/libs/postgres_ffi/pg_control_ffi.h similarity index 100% rename from postgres_ffi/pg_control_ffi.h rename to libs/postgres_ffi/pg_control_ffi.h diff --git a/postgres_ffi/samples/pg_hba.conf b/libs/postgres_ffi/samples/pg_hba.conf similarity index 100% rename from postgres_ffi/samples/pg_hba.conf rename to libs/postgres_ffi/samples/pg_hba.conf diff --git a/postgres_ffi/src/controlfile_utils.rs b/libs/postgres_ffi/src/controlfile_utils.rs similarity index 97% rename from postgres_ffi/src/controlfile_utils.rs rename to libs/postgres_ffi/src/controlfile_utils.rs index b72c86c71c..4df2342b90 100644 --- a/postgres_ffi/src/controlfile_utils.rs +++ b/libs/postgres_ffi/src/controlfile_utils.rs @@ -43,7 +43,7 @@ impl ControlFileData { /// Interpret a slice of bytes as a Postgres control file. /// pub fn decode(buf: &[u8]) -> Result { - use zenith_utils::bin_ser::LeSer; + use utils::bin_ser::LeSer; // Check that the slice has the expected size. The control file is // padded with zeros up to a 512 byte sector size, so accept a @@ -77,7 +77,7 @@ impl ControlFileData { /// /// The CRC is recomputed to match the contents of the fields. pub fn encode(&self) -> Bytes { - use zenith_utils::bin_ser::LeSer; + use utils::bin_ser::LeSer; // Serialize into a new buffer. let b = self.ser().unwrap(); diff --git a/postgres_ffi/src/lib.rs b/libs/postgres_ffi/src/lib.rs similarity index 100% rename from postgres_ffi/src/lib.rs rename to libs/postgres_ffi/src/lib.rs diff --git a/postgres_ffi/src/nonrelfile_utils.rs b/libs/postgres_ffi/src/nonrelfile_utils.rs similarity index 100% rename from postgres_ffi/src/nonrelfile_utils.rs rename to libs/postgres_ffi/src/nonrelfile_utils.rs diff --git a/postgres_ffi/src/pg_constants.rs b/libs/postgres_ffi/src/pg_constants.rs similarity index 100% rename from postgres_ffi/src/pg_constants.rs rename to libs/postgres_ffi/src/pg_constants.rs diff --git a/postgres_ffi/src/relfile_utils.rs b/libs/postgres_ffi/src/relfile_utils.rs similarity index 100% rename from postgres_ffi/src/relfile_utils.rs rename to libs/postgres_ffi/src/relfile_utils.rs diff --git a/postgres_ffi/src/waldecoder.rs b/libs/postgres_ffi/src/waldecoder.rs similarity index 99% rename from postgres_ffi/src/waldecoder.rs rename to libs/postgres_ffi/src/waldecoder.rs index ce5aaf722d..9d1089ed46 100644 --- a/postgres_ffi/src/waldecoder.rs +++ b/libs/postgres_ffi/src/waldecoder.rs @@ -18,7 +18,7 @@ use crc32c::*; use log::*; use std::cmp::min; use thiserror::Error; -use zenith_utils::lsn::Lsn; +use utils::lsn::Lsn; pub struct WalStreamDecoder { lsn: Lsn, diff --git a/postgres_ffi/src/xlog_utils.rs b/libs/postgres_ffi/src/xlog_utils.rs similarity index 98% rename from postgres_ffi/src/xlog_utils.rs rename to libs/postgres_ffi/src/xlog_utils.rs index 89fdbbf7ac..1645c44de5 100644 --- a/postgres_ffi/src/xlog_utils.rs +++ b/libs/postgres_ffi/src/xlog_utils.rs @@ -28,7 +28,7 @@ use std::io::prelude::*; use std::io::SeekFrom; use std::path::{Path, PathBuf}; use std::time::SystemTime; -use zenith_utils::lsn::Lsn; +use utils::lsn::Lsn; pub const XLOG_FNAME_LEN: usize = 24; pub const XLOG_BLCKSZ: usize = 8192; @@ -351,17 +351,17 @@ pub fn main() { impl XLogRecord { pub fn from_slice(buf: &[u8]) -> XLogRecord { - use zenith_utils::bin_ser::LeSer; + use utils::bin_ser::LeSer; XLogRecord::des(buf).unwrap() } pub fn from_bytes(buf: &mut B) -> XLogRecord { - use zenith_utils::bin_ser::LeSer; + use utils::bin_ser::LeSer; XLogRecord::des_from(&mut buf.reader()).unwrap() } pub fn encode(&self) -> Bytes { - use zenith_utils::bin_ser::LeSer; + use utils::bin_ser::LeSer; self.ser().unwrap().into() } @@ -373,19 +373,19 @@ impl XLogRecord { impl XLogPageHeaderData { pub fn from_bytes(buf: &mut B) -> XLogPageHeaderData { - use zenith_utils::bin_ser::LeSer; + use utils::bin_ser::LeSer; XLogPageHeaderData::des_from(&mut buf.reader()).unwrap() } } impl XLogLongPageHeaderData { pub fn from_bytes(buf: &mut B) -> XLogLongPageHeaderData { - use zenith_utils::bin_ser::LeSer; + use utils::bin_ser::LeSer; XLogLongPageHeaderData::des_from(&mut buf.reader()).unwrap() } pub fn encode(&self) -> Bytes { - use zenith_utils::bin_ser::LeSer; + use utils::bin_ser::LeSer; self.ser().unwrap().into() } } @@ -394,12 +394,12 @@ pub const SIZEOF_CHECKPOINT: usize = std::mem::size_of::(); impl CheckPoint { pub fn encode(&self) -> Bytes { - use zenith_utils::bin_ser::LeSer; + use utils::bin_ser::LeSer; self.ser().unwrap().into() } pub fn decode(buf: &[u8]) -> Result { - use zenith_utils::bin_ser::LeSer; + use utils::bin_ser::LeSer; Ok(CheckPoint::des(buf)?) } @@ -477,7 +477,9 @@ mod tests { #[test] pub fn test_find_end_of_wal() { // 1. Run initdb to generate some WAL - let top_path = PathBuf::from(env!("CARGO_MANIFEST_DIR")).join(".."); + let top_path = PathBuf::from(env!("CARGO_MANIFEST_DIR")) + .join("..") + .join(".."); let data_dir = top_path.join("test_output/test_find_end_of_wal"); let initdb_path = top_path.join("tmp_install/bin/initdb"); let lib_path = top_path.join("tmp_install/lib"); diff --git a/zenith_utils/Cargo.toml b/libs/utils/Cargo.toml similarity index 88% rename from zenith_utils/Cargo.toml rename to libs/utils/Cargo.toml index dd83fa4a92..35eb443809 100644 --- a/zenith_utils/Cargo.toml +++ b/libs/utils/Cargo.toml @@ -1,5 +1,5 @@ [package] -name = "zenith_utils" +name = "utils" version = "0.1.0" edition = "2021" @@ -29,8 +29,8 @@ rustls-split = "0.3.0" git-version = "0.3.5" serde_with = "1.12.0" -zenith_metrics = { path = "../zenith_metrics" } -workspace_hack = { version = "0.1", path = "../workspace_hack" } +metrics = { path = "../metrics" } +workspace_hack = { version = "0.1", path = "../../workspace_hack" } [dev-dependencies] byteorder = "1.4.3" diff --git a/zenith_utils/benches/benchmarks.rs b/libs/utils/benches/benchmarks.rs similarity index 96% rename from zenith_utils/benches/benchmarks.rs rename to libs/utils/benches/benchmarks.rs index c945d5021c..0339939934 100644 --- a/zenith_utils/benches/benchmarks.rs +++ b/libs/utils/benches/benchmarks.rs @@ -1,7 +1,7 @@ #![allow(unused)] use criterion::{criterion_group, criterion_main, Criterion}; -use zenith_utils::zid; +use utils::zid; pub fn bench_zid_stringify(c: &mut Criterion) { // Can only use public methods. diff --git a/zenith_utils/build.rs b/libs/utils/build.rs similarity index 100% rename from zenith_utils/build.rs rename to libs/utils/build.rs diff --git a/zenith_utils/scripts/restore_from_wal.sh b/libs/utils/scripts/restore_from_wal.sh similarity index 100% rename from zenith_utils/scripts/restore_from_wal.sh rename to libs/utils/scripts/restore_from_wal.sh diff --git a/zenith_utils/scripts/restore_from_wal_archive.sh b/libs/utils/scripts/restore_from_wal_archive.sh similarity index 100% rename from zenith_utils/scripts/restore_from_wal_archive.sh rename to libs/utils/scripts/restore_from_wal_archive.sh diff --git a/zenith_utils/src/accum.rs b/libs/utils/src/accum.rs similarity index 96% rename from zenith_utils/src/accum.rs rename to libs/utils/src/accum.rs index d3ad61e514..0fb0190a92 100644 --- a/zenith_utils/src/accum.rs +++ b/libs/utils/src/accum.rs @@ -5,7 +5,7 @@ /// For example, to calculate the smallest value among some integers: /// /// ``` -/// use zenith_utils::accum::Accum; +/// use utils::accum::Accum; /// /// let values = [1, 2, 3]; /// diff --git a/zenith_utils/src/auth.rs b/libs/utils/src/auth.rs similarity index 100% rename from zenith_utils/src/auth.rs rename to libs/utils/src/auth.rs diff --git a/zenith_utils/src/bin_ser.rs b/libs/utils/src/bin_ser.rs similarity index 100% rename from zenith_utils/src/bin_ser.rs rename to libs/utils/src/bin_ser.rs diff --git a/zenith_utils/src/connstring.rs b/libs/utils/src/connstring.rs similarity index 100% rename from zenith_utils/src/connstring.rs rename to libs/utils/src/connstring.rs diff --git a/zenith_utils/src/crashsafe_dir.rs b/libs/utils/src/crashsafe_dir.rs similarity index 100% rename from zenith_utils/src/crashsafe_dir.rs rename to libs/utils/src/crashsafe_dir.rs diff --git a/zenith_utils/src/http/endpoint.rs b/libs/utils/src/http/endpoint.rs similarity index 97% rename from zenith_utils/src/http/endpoint.rs rename to libs/utils/src/http/endpoint.rs index 7669f18cd2..77acab496f 100644 --- a/zenith_utils/src/http/endpoint.rs +++ b/libs/utils/src/http/endpoint.rs @@ -5,12 +5,11 @@ use anyhow::anyhow; use hyper::header::AUTHORIZATION; use hyper::{header::CONTENT_TYPE, Body, Request, Response, Server}; use lazy_static::lazy_static; +use metrics::{new_common_metric_name, register_int_counter, Encoder, IntCounter, TextEncoder}; use routerify::ext::RequestExt; use routerify::RequestInfo; use routerify::{Middleware, Router, RouterBuilder, RouterService}; use tracing::info; -use zenith_metrics::{new_common_metric_name, register_int_counter, IntCounter}; -use zenith_metrics::{Encoder, TextEncoder}; use std::future::Future; use std::net::TcpListener; @@ -36,7 +35,7 @@ async fn prometheus_metrics_handler(_req: Request) -> Result anyhow::Result<()> { /// # Ok(()) diff --git a/zenith_utils/src/seqwait.rs b/libs/utils/src/seqwait.rs similarity index 100% rename from zenith_utils/src/seqwait.rs rename to libs/utils/src/seqwait.rs diff --git a/zenith_utils/src/seqwait_async.rs b/libs/utils/src/seqwait_async.rs similarity index 100% rename from zenith_utils/src/seqwait_async.rs rename to libs/utils/src/seqwait_async.rs diff --git a/zenith_utils/src/shutdown.rs b/libs/utils/src/shutdown.rs similarity index 100% rename from zenith_utils/src/shutdown.rs rename to libs/utils/src/shutdown.rs diff --git a/zenith_utils/src/signals.rs b/libs/utils/src/signals.rs similarity index 100% rename from zenith_utils/src/signals.rs rename to libs/utils/src/signals.rs diff --git a/zenith_utils/src/sock_split.rs b/libs/utils/src/sock_split.rs similarity index 100% rename from zenith_utils/src/sock_split.rs rename to libs/utils/src/sock_split.rs diff --git a/zenith_utils/src/sync.rs b/libs/utils/src/sync.rs similarity index 99% rename from zenith_utils/src/sync.rs rename to libs/utils/src/sync.rs index 5e61480bc3..48f0ff6384 100644 --- a/zenith_utils/src/sync.rs +++ b/libs/utils/src/sync.rs @@ -29,7 +29,7 @@ impl SyncFuture { /// Example: /// /// ``` - /// # use zenith_utils::sync::SyncFuture; + /// # use utils::sync::SyncFuture; /// # use std::future::Future; /// # use tokio::io::AsyncReadExt; /// # diff --git a/zenith_utils/src/tcp_listener.rs b/libs/utils/src/tcp_listener.rs similarity index 100% rename from zenith_utils/src/tcp_listener.rs rename to libs/utils/src/tcp_listener.rs diff --git a/zenith_utils/src/vec_map.rs b/libs/utils/src/vec_map.rs similarity index 100% rename from zenith_utils/src/vec_map.rs rename to libs/utils/src/vec_map.rs diff --git a/zenith_utils/src/zid.rs b/libs/utils/src/zid.rs similarity index 100% rename from zenith_utils/src/zid.rs rename to libs/utils/src/zid.rs diff --git a/zenith_utils/tests/bin_ser_test.rs b/libs/utils/tests/bin_ser_test.rs similarity index 96% rename from zenith_utils/tests/bin_ser_test.rs rename to libs/utils/tests/bin_ser_test.rs index ada43a1189..f357837a55 100644 --- a/zenith_utils/tests/bin_ser_test.rs +++ b/libs/utils/tests/bin_ser_test.rs @@ -2,7 +2,7 @@ use bytes::{Buf, BytesMut}; use hex_literal::hex; use serde::Deserialize; use std::io::Read; -use zenith_utils::bin_ser::LeSer; +use utils::bin_ser::LeSer; #[derive(Debug, PartialEq, Deserialize)] pub struct HeaderData { diff --git a/zenith_utils/tests/cert.pem b/libs/utils/tests/cert.pem similarity index 100% rename from zenith_utils/tests/cert.pem rename to libs/utils/tests/cert.pem diff --git a/zenith_utils/tests/key.pem b/libs/utils/tests/key.pem similarity index 100% rename from zenith_utils/tests/key.pem rename to libs/utils/tests/key.pem diff --git a/zenith_utils/tests/ssl_test.rs b/libs/utils/tests/ssl_test.rs similarity index 98% rename from zenith_utils/tests/ssl_test.rs rename to libs/utils/tests/ssl_test.rs index 0e330c44f8..002361667b 100644 --- a/zenith_utils/tests/ssl_test.rs +++ b/libs/utils/tests/ssl_test.rs @@ -9,7 +9,7 @@ use byteorder::{BigEndian, ReadBytesExt, WriteBytesExt}; use bytes::{Buf, BufMut, Bytes, BytesMut}; use lazy_static::lazy_static; -use zenith_utils::postgres_backend::{AuthType, Handler, PostgresBackend}; +use utils::postgres_backend::{AuthType, Handler, PostgresBackend}; fn make_tcp_pair() -> (TcpStream, TcpStream) { let listener = TcpListener::bind("127.0.0.1:0").unwrap(); diff --git a/pageserver/Cargo.toml b/pageserver/Cargo.toml index 1a533af95f..7b44dafb09 100644 --- a/pageserver/Cargo.toml +++ b/pageserver/Cargo.toml @@ -47,9 +47,9 @@ rusoto_core = "0.47" rusoto_s3 = "0.47" async-trait = "0.1" -postgres_ffi = { path = "../postgres_ffi" } -zenith_metrics = { path = "../zenith_metrics" } -zenith_utils = { path = "../zenith_utils" } +postgres_ffi = { path = "../libs/postgres_ffi" } +metrics = { path = "../libs/metrics" } +utils = { path = "../libs/utils" } workspace_hack = { version = "0.1", path = "../workspace_hack" } [dev-dependencies] diff --git a/pageserver/src/basebackup.rs b/pageserver/src/basebackup.rs index 077e7c9f83..78a27e460f 100644 --- a/pageserver/src/basebackup.rs +++ b/pageserver/src/basebackup.rs @@ -25,7 +25,7 @@ use crate::repository::Timeline; use crate::DatadirTimelineImpl; use postgres_ffi::xlog_utils::*; use postgres_ffi::*; -use zenith_utils::lsn::Lsn; +use utils::lsn::Lsn; /// This is short-living object only for the time of tarball creation, /// created mostly to avoid passing a lot of parameters between various functions diff --git a/pageserver/src/bin/dump_layerfile.rs b/pageserver/src/bin/dump_layerfile.rs index 7cf39566ac..af73ef6bdb 100644 --- a/pageserver/src/bin/dump_layerfile.rs +++ b/pageserver/src/bin/dump_layerfile.rs @@ -7,7 +7,7 @@ use pageserver::layered_repository::dump_layerfile_from_path; use pageserver::page_cache; use pageserver::virtual_file; use std::path::PathBuf; -use zenith_utils::GIT_VERSION; +use utils::GIT_VERSION; fn main() -> Result<()> { let arg_matches = App::new("Zenith dump_layerfile utility") diff --git a/pageserver/src/bin/pageserver.rs b/pageserver/src/bin/pageserver.rs index 1610a26239..867bea1b06 100644 --- a/pageserver/src/bin/pageserver.rs +++ b/pageserver/src/bin/pageserver.rs @@ -2,14 +2,6 @@ use std::{env, path::Path, str::FromStr}; use tracing::*; -use zenith_utils::{ - auth::JwtAuth, - logging, - postgres_backend::AuthType, - tcp_listener, - zid::{ZTenantId, ZTimelineId}, - GIT_VERSION, -}; use anyhow::{bail, Context, Result}; @@ -25,12 +17,20 @@ use pageserver::{ thread_mgr::ThreadKind, timelines, virtual_file, LOG_FILE_NAME, }; -use zenith_utils::http::endpoint; -use zenith_utils::shutdown::exit_now; -use zenith_utils::signals::{self, Signal}; +use utils::{ + auth::JwtAuth, + http::endpoint, + logging, + postgres_backend::AuthType, + shutdown::exit_now, + signals::{self, Signal}, + tcp_listener, + zid::{ZTenantId, ZTimelineId}, + GIT_VERSION, +}; fn main() -> anyhow::Result<()> { - zenith_metrics::set_common_metrics_prefix("pageserver"); + metrics::set_common_metrics_prefix("pageserver"); let arg_matches = App::new("Zenith page server") .about("Materializes WAL stream to pages and serves them to the postgres") .version(GIT_VERSION) diff --git a/pageserver/src/bin/update_metadata.rs b/pageserver/src/bin/update_metadata.rs index bfbb6179c5..fae5e5c2e3 100644 --- a/pageserver/src/bin/update_metadata.rs +++ b/pageserver/src/bin/update_metadata.rs @@ -6,8 +6,7 @@ use clap::{App, Arg}; use pageserver::layered_repository::metadata::TimelineMetadata; use std::path::PathBuf; use std::str::FromStr; -use zenith_utils::lsn::Lsn; -use zenith_utils::GIT_VERSION; +use utils::{lsn::Lsn, GIT_VERSION}; fn main() -> Result<()> { let arg_matches = App::new("Zenith update metadata utility") diff --git a/pageserver/src/config.rs b/pageserver/src/config.rs index 067073cd9b..0cba3f48f8 100644 --- a/pageserver/src/config.rs +++ b/pageserver/src/config.rs @@ -7,8 +7,10 @@ use anyhow::{bail, ensure, Context, Result}; use toml_edit; use toml_edit::{Document, Item}; -use zenith_utils::postgres_backend::AuthType; -use zenith_utils::zid::{ZNodeId, ZTenantId, ZTimelineId}; +use utils::{ + postgres_backend::AuthType, + zid::{ZNodeId, ZTenantId, ZTimelineId}, +}; use std::convert::TryInto; use std::env; diff --git a/pageserver/src/http/models.rs b/pageserver/src/http/models.rs index d1dfb911ba..9b51e48477 100644 --- a/pageserver/src/http/models.rs +++ b/pageserver/src/http/models.rs @@ -1,6 +1,6 @@ use serde::{Deserialize, Serialize}; use serde_with::{serde_as, DisplayFromStr}; -use zenith_utils::{ +use utils::{ lsn::Lsn, zid::{ZNodeId, ZTenantId, ZTimelineId}, }; diff --git a/pageserver/src/http/routes.rs b/pageserver/src/http/routes.rs index f49b1d7ba3..82ea5d1d09 100644 --- a/pageserver/src/http/routes.rs +++ b/pageserver/src/http/routes.rs @@ -4,19 +4,6 @@ use anyhow::{Context, Result}; use hyper::StatusCode; use hyper::{Body, Request, Response, Uri}; use tracing::*; -use zenith_utils::auth::JwtAuth; -use zenith_utils::http::endpoint::attach_openapi_ui; -use zenith_utils::http::endpoint::auth_middleware; -use zenith_utils::http::endpoint::check_permission; -use zenith_utils::http::error::ApiError; -use zenith_utils::http::{ - endpoint, - error::HttpErrorBody, - json::{json_request, json_response}, - request::parse_request_param, -}; -use zenith_utils::http::{RequestExt, RouterBuilder}; -use zenith_utils::zid::{ZTenantTimelineId, ZTimelineId}; use super::models::{ StatusResponse, TenantCreateRequest, TenantCreateResponse, TimelineCreateRequest, @@ -27,7 +14,18 @@ use crate::remote_storage::{ }; use crate::repository::Repository; use crate::timelines::{LocalTimelineInfo, RemoteTimelineInfo, TimelineInfo}; -use crate::{config::PageServerConf, tenant_mgr, timelines, ZTenantId}; +use crate::{config::PageServerConf, tenant_mgr, timelines}; +use utils::{ + auth::JwtAuth, + http::{ + endpoint::{self, attach_openapi_ui, auth_middleware, check_permission}, + error::{ApiError, HttpErrorBody}, + json::{json_request, json_response}, + request::parse_request_param, + RequestExt, RouterBuilder, + }, + zid::{ZTenantId, ZTenantTimelineId, ZTimelineId}, +}; struct State { conf: &'static PageServerConf, diff --git a/pageserver/src/import_datadir.rs b/pageserver/src/import_datadir.rs index 232892973e..8f49903e6c 100644 --- a/pageserver/src/import_datadir.rs +++ b/pageserver/src/import_datadir.rs @@ -20,7 +20,7 @@ use postgres_ffi::waldecoder::*; use postgres_ffi::xlog_utils::*; use postgres_ffi::{pg_constants, ControlFileData, DBState_DB_SHUTDOWNED}; use postgres_ffi::{Oid, TransactionId}; -use zenith_utils::lsn::Lsn; +use utils::lsn::Lsn; /// /// Import all relation data pages from local disk into the repository. diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 59a3def1fb..7525bdb94e 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -46,15 +46,17 @@ use crate::virtual_file::VirtualFile; use crate::walreceiver::IS_WAL_RECEIVER; use crate::walredo::WalRedoManager; use crate::CheckpointConfig; -use crate::{ZTenantId, ZTimelineId}; -use zenith_metrics::{ +use metrics::{ register_histogram_vec, register_int_counter, register_int_counter_vec, register_int_gauge_vec, Histogram, HistogramVec, IntCounter, IntCounterVec, IntGauge, IntGaugeVec, }; -use zenith_utils::crashsafe_dir; -use zenith_utils::lsn::{AtomicLsn, Lsn, RecordLsn}; -use zenith_utils::seqwait::SeqWait; +use utils::{ + crashsafe_dir, + lsn::{AtomicLsn, Lsn, RecordLsn}, + seqwait::SeqWait, + zid::{ZTenantId, ZTimelineId}, +}; mod blob_io; pub mod block_io; diff --git a/pageserver/src/layered_repository/delta_layer.rs b/pageserver/src/layered_repository/delta_layer.rs index 03b7e453b3..c5530a5789 100644 --- a/pageserver/src/layered_repository/delta_layer.rs +++ b/pageserver/src/layered_repository/delta_layer.rs @@ -35,7 +35,6 @@ use crate::page_cache::{PageReadGuard, PAGE_SZ}; use crate::repository::{Key, Value, KEY_SIZE}; use crate::virtual_file::VirtualFile; use crate::walrecord; -use crate::{ZTenantId, ZTimelineId}; use crate::{DELTA_FILE_MAGIC, STORAGE_FORMAT_VERSION}; use anyhow::{bail, ensure, Context, Result}; use serde::{Deserialize, Serialize}; @@ -51,8 +50,11 @@ use std::os::unix::fs::FileExt; use std::path::{Path, PathBuf}; use std::sync::{RwLock, RwLockReadGuard, RwLockWriteGuard}; -use zenith_utils::bin_ser::BeSer; -use zenith_utils::lsn::Lsn; +use utils::{ + bin_ser::BeSer, + lsn::Lsn, + zid::{ZTenantId, ZTimelineId}, +}; /// /// Header stored in the beginning of the file diff --git a/pageserver/src/layered_repository/ephemeral_file.rs b/pageserver/src/layered_repository/ephemeral_file.rs index a2f8cda461..9537d3939c 100644 --- a/pageserver/src/layered_repository/ephemeral_file.rs +++ b/pageserver/src/layered_repository/ephemeral_file.rs @@ -17,8 +17,7 @@ use std::ops::DerefMut; use std::path::PathBuf; use std::sync::{Arc, RwLock}; use tracing::*; -use zenith_utils::zid::ZTenantId; -use zenith_utils::zid::ZTimelineId; +use utils::zid::{ZTenantId, ZTimelineId}; use std::os::unix::fs::FileExt; diff --git a/pageserver/src/layered_repository/filename.rs b/pageserver/src/layered_repository/filename.rs index 497912b408..f088088277 100644 --- a/pageserver/src/layered_repository/filename.rs +++ b/pageserver/src/layered_repository/filename.rs @@ -8,7 +8,7 @@ use std::fmt; use std::ops::Range; use std::path::PathBuf; -use zenith_utils::lsn::Lsn; +use utils::lsn::Lsn; // Note: LayeredTimeline::load_layer_map() relies on this sort order #[derive(Debug, PartialEq, Eq, Clone)] diff --git a/pageserver/src/layered_repository/image_layer.rs b/pageserver/src/layered_repository/image_layer.rs index fa91198a79..0e38d46e7a 100644 --- a/pageserver/src/layered_repository/image_layer.rs +++ b/pageserver/src/layered_repository/image_layer.rs @@ -30,7 +30,6 @@ use crate::layered_repository::storage_layer::{ use crate::page_cache::PAGE_SZ; use crate::repository::{Key, Value, KEY_SIZE}; use crate::virtual_file::VirtualFile; -use crate::{ZTenantId, ZTimelineId}; use crate::{IMAGE_FILE_MAGIC, STORAGE_FORMAT_VERSION}; use anyhow::{bail, ensure, Context, Result}; use bytes::Bytes; @@ -44,8 +43,11 @@ use std::path::{Path, PathBuf}; use std::sync::{RwLock, RwLockReadGuard}; use tracing::*; -use zenith_utils::bin_ser::BeSer; -use zenith_utils::lsn::Lsn; +use utils::{ + bin_ser::BeSer, + lsn::Lsn, + zid::{ZTenantId, ZTimelineId}, +}; /// /// Header stored in the beginning of the file diff --git a/pageserver/src/layered_repository/inmemory_layer.rs b/pageserver/src/layered_repository/inmemory_layer.rs index 33e1eabd8e..714a0bc579 100644 --- a/pageserver/src/layered_repository/inmemory_layer.rs +++ b/pageserver/src/layered_repository/inmemory_layer.rs @@ -14,19 +14,21 @@ use crate::layered_repository::storage_layer::{ }; use crate::repository::{Key, Value}; use crate::walrecord; -use crate::{ZTenantId, ZTimelineId}; use anyhow::{bail, ensure, Result}; use std::collections::HashMap; use tracing::*; +use utils::{ + bin_ser::BeSer, + lsn::Lsn, + vec_map::VecMap, + zid::{ZTenantId, ZTimelineId}, +}; // avoid binding to Write (conflicts with std::io::Write) // while being able to use std::fmt::Write's methods use std::fmt::Write as _; use std::ops::Range; use std::path::PathBuf; use std::sync::RwLock; -use zenith_utils::bin_ser::BeSer; -use zenith_utils::lsn::Lsn; -use zenith_utils::vec_map::VecMap; pub struct InMemoryLayer { conf: &'static PageServerConf, diff --git a/pageserver/src/layered_repository/layer_map.rs b/pageserver/src/layered_repository/layer_map.rs index 3984ee550f..03ee8b8ef1 100644 --- a/pageserver/src/layered_repository/layer_map.rs +++ b/pageserver/src/layered_repository/layer_map.rs @@ -16,12 +16,12 @@ use crate::layered_repository::InMemoryLayer; use crate::repository::Key; use anyhow::Result; use lazy_static::lazy_static; +use metrics::{register_int_gauge, IntGauge}; use std::collections::VecDeque; use std::ops::Range; use std::sync::Arc; use tracing::*; -use zenith_metrics::{register_int_gauge, IntGauge}; -use zenith_utils::lsn::Lsn; +use utils::lsn::Lsn; lazy_static! { static ref NUM_ONDISK_LAYERS: IntGauge = diff --git a/pageserver/src/layered_repository/metadata.rs b/pageserver/src/layered_repository/metadata.rs index 7daf899ba2..0b47f8d697 100644 --- a/pageserver/src/layered_repository/metadata.rs +++ b/pageserver/src/layered_repository/metadata.rs @@ -10,7 +10,7 @@ use std::path::PathBuf; use anyhow::ensure; use serde::{Deserialize, Serialize}; -use zenith_utils::{ +use utils::{ bin_ser::BeSer, lsn::Lsn, zid::{ZTenantId, ZTimelineId}, diff --git a/pageserver/src/layered_repository/storage_layer.rs b/pageserver/src/layered_repository/storage_layer.rs index e413f311c3..aad631c5c4 100644 --- a/pageserver/src/layered_repository/storage_layer.rs +++ b/pageserver/src/layered_repository/storage_layer.rs @@ -4,13 +4,15 @@ use crate::repository::{Key, Value}; use crate::walrecord::ZenithWalRecord; -use crate::{ZTenantId, ZTimelineId}; use anyhow::Result; use bytes::Bytes; use std::ops::Range; use std::path::PathBuf; -use zenith_utils::lsn::Lsn; +use utils::{ + lsn::Lsn, + zid::{ZTenantId, ZTimelineId}, +}; pub fn range_overlaps(a: &Range, b: &Range) -> bool where diff --git a/pageserver/src/lib.rs b/pageserver/src/lib.rs index 6dddef5f27..e6ac159ef2 100644 --- a/pageserver/src/lib.rs +++ b/pageserver/src/lib.rs @@ -22,13 +22,10 @@ pub mod walredo; use lazy_static::lazy_static; use tracing::info; -use zenith_metrics::{register_int_gauge_vec, IntGaugeVec}; -use zenith_utils::{ - postgres_backend, - zid::{ZTenantId, ZTimelineId}, -}; +use utils::postgres_backend; use crate::thread_mgr::ThreadKind; +use metrics::{register_int_gauge_vec, IntGaugeVec}; use layered_repository::LayeredRepository; use pgdatadir_mapping::DatadirTimeline; diff --git a/pageserver/src/page_cache.rs b/pageserver/src/page_cache.rs index bd44384a44..0c179b95c5 100644 --- a/pageserver/src/page_cache.rs +++ b/pageserver/src/page_cache.rs @@ -47,7 +47,7 @@ use std::{ use once_cell::sync::OnceCell; use tracing::error; -use zenith_utils::{ +use utils::{ lsn::Lsn, zid::{ZTenantId, ZTimelineId}, }; diff --git a/pageserver/src/page_service.rs b/pageserver/src/page_service.rs index c09b032e48..8f5ea2e845 100644 --- a/pageserver/src/page_service.rs +++ b/pageserver/src/page_service.rs @@ -20,15 +20,13 @@ use std::str; use std::str::FromStr; use std::sync::{Arc, RwLockReadGuard}; use tracing::*; -use zenith_metrics::{register_histogram_vec, HistogramVec}; -use zenith_utils::auth::{self, JwtAuth}; -use zenith_utils::auth::{Claims, Scope}; -use zenith_utils::lsn::Lsn; -use zenith_utils::postgres_backend::is_socket_read_timed_out; -use zenith_utils::postgres_backend::PostgresBackend; -use zenith_utils::postgres_backend::{self, AuthType}; -use zenith_utils::pq_proto::{BeMessage, FeMessage, RowDescriptor, SINGLE_COL_ROWDESC}; -use zenith_utils::zid::{ZTenantId, ZTimelineId}; +use utils::{ + auth::{self, Claims, JwtAuth, Scope}, + lsn::Lsn, + postgres_backend::{self, is_socket_read_timed_out, AuthType, PostgresBackend}, + pq_proto::{BeMessage, FeMessage, RowDescriptor, SINGLE_COL_ROWDESC}, + zid::{ZTenantId, ZTimelineId}, +}; use crate::basebackup; use crate::config::PageServerConf; @@ -41,6 +39,7 @@ use crate::thread_mgr; use crate::thread_mgr::ThreadKind; use crate::walreceiver; use crate::CheckpointConfig; +use metrics::{register_histogram_vec, HistogramVec}; // Wrapped in libpq CopyData enum PagestreamFeMessage { diff --git a/pageserver/src/pgdatadir_mapping.rs b/pageserver/src/pgdatadir_mapping.rs index 0b9ea7c7a7..071eccc05d 100644 --- a/pageserver/src/pgdatadir_mapping.rs +++ b/pageserver/src/pgdatadir_mapping.rs @@ -20,8 +20,7 @@ use std::ops::Range; use std::sync::atomic::{AtomicIsize, Ordering}; use std::sync::{Arc, Mutex, RwLockReadGuard}; use tracing::{debug, error, trace, warn}; -use zenith_utils::bin_ser::BeSer; -use zenith_utils::lsn::Lsn; +use utils::{bin_ser::BeSer, lsn::Lsn}; /// Block number within a relation or SLRU. This matches PostgreSQL's BlockNumber type. pub type BlockNumber = u32; @@ -1212,7 +1211,7 @@ pub fn key_to_slru_block(key: Key) -> Result<(SlruKind, u32, BlockNumber)> { #[cfg(test)] pub fn create_test_timeline( repo: R, - timeline_id: zenith_utils::zid::ZTimelineId, + timeline_id: utils::zid::ZTimelineId, ) -> Result>> { let tline = repo.create_empty_timeline(timeline_id, Lsn(8))?; let tline = DatadirTimeline::new(tline, 256 * 1024); diff --git a/pageserver/src/remote_storage.rs b/pageserver/src/remote_storage.rs index effc8dcdf4..8a09f7b9ca 100644 --- a/pageserver/src/remote_storage.rs +++ b/pageserver/src/remote_storage.rs @@ -117,7 +117,7 @@ use crate::{ metadata::{TimelineMetadata, METADATA_FILE_NAME}, }, }; -use zenith_utils::zid::{ZTenantId, ZTenantTimelineId, ZTimelineId}; +use utils::zid::{ZTenantId, ZTenantTimelineId, ZTimelineId}; /// A timeline status to share with pageserver's sync counterpart, /// after comparing local and remote timeline state. diff --git a/pageserver/src/remote_storage/storage_sync.rs b/pageserver/src/remote_storage/storage_sync.rs index 649e563dbc..4d1ec2e225 100644 --- a/pageserver/src/remote_storage/storage_sync.rs +++ b/pageserver/src/remote_storage/storage_sync.rs @@ -86,10 +86,7 @@ use self::{ index::{IndexPart, RemoteIndex, RemoteTimeline, RemoteTimelineIndex}, upload::{upload_index_part, upload_timeline_layers, UploadedTimeline}, }; -use super::{ - LocalTimelineInitStatus, LocalTimelineInitStatuses, RemoteStorage, SyncStartupData, - ZTenantTimelineId, -}; +use super::{LocalTimelineInitStatus, LocalTimelineInitStatuses, RemoteStorage, SyncStartupData}; use crate::{ config::PageServerConf, layered_repository::metadata::{metadata_path, TimelineMetadata}, @@ -99,11 +96,11 @@ use crate::{ thread_mgr::ThreadKind, }; -use zenith_metrics::{ +use metrics::{ register_histogram_vec, register_int_counter, register_int_gauge, HistogramVec, IntCounter, IntGauge, }; -use zenith_utils::zid::{ZTenantId, ZTimelineId}; +use utils::zid::{ZTenantId, ZTenantTimelineId, ZTimelineId}; pub use self::download::download_index_part; @@ -145,7 +142,7 @@ mod sync_queue { use tracing::{debug, warn}; use super::SyncTask; - use zenith_utils::zid::ZTenantTimelineId; + use utils::zid::ZTenantTimelineId; static SENDER: OnceCell> = OnceCell::new(); static LENGTH: AtomicUsize = AtomicUsize::new(0); @@ -1197,7 +1194,7 @@ fn register_sync_status(sync_start: Instant, sync_name: &str, sync_status: Optio #[cfg(test)] mod test_utils { - use zenith_utils::lsn::Lsn; + use utils::lsn::Lsn; use crate::repository::repo_harness::RepoHarness; @@ -1246,7 +1243,7 @@ mod tests { use std::collections::BTreeSet; use super::{test_utils::dummy_metadata, *}; - use zenith_utils::lsn::Lsn; + use utils::lsn::Lsn; #[test] fn download_sync_tasks_merge() { diff --git a/pageserver/src/remote_storage/storage_sync/download.rs b/pageserver/src/remote_storage/storage_sync/download.rs index eb805cd0cc..7fe25ab36e 100644 --- a/pageserver/src/remote_storage/storage_sync/download.rs +++ b/pageserver/src/remote_storage/storage_sync/download.rs @@ -12,9 +12,10 @@ use crate::{ layered_repository::metadata::metadata_path, remote_storage::{ storage_sync::{sync_queue, SyncTask}, - RemoteStorage, ZTenantTimelineId, + RemoteStorage, }, }; +use utils::zid::ZTenantTimelineId; use super::{ index::{IndexPart, RemoteTimeline}, @@ -182,7 +183,7 @@ mod tests { use std::collections::{BTreeSet, HashSet}; use tempfile::tempdir; - use zenith_utils::lsn::Lsn; + use utils::lsn::Lsn; use crate::{ remote_storage::{ diff --git a/pageserver/src/remote_storage/storage_sync/index.rs b/pageserver/src/remote_storage/storage_sync/index.rs index 918bda1039..d847e03a24 100644 --- a/pageserver/src/remote_storage/storage_sync/index.rs +++ b/pageserver/src/remote_storage/storage_sync/index.rs @@ -13,11 +13,8 @@ use serde::{Deserialize, Serialize}; use serde_with::{serde_as, DisplayFromStr}; use tokio::sync::RwLock; -use crate::{ - config::PageServerConf, layered_repository::metadata::TimelineMetadata, - remote_storage::ZTenantTimelineId, -}; -use zenith_utils::lsn::Lsn; +use crate::{config::PageServerConf, layered_repository::metadata::TimelineMetadata}; +use utils::{lsn::Lsn, zid::ZTenantTimelineId}; /// A part of the filesystem path, that needs a root to become a path again. #[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord, Hash, Serialize, Deserialize)] diff --git a/pageserver/src/remote_storage/storage_sync/upload.rs b/pageserver/src/remote_storage/storage_sync/upload.rs index b4a2f6f989..d2ff77e92e 100644 --- a/pageserver/src/remote_storage/storage_sync/upload.rs +++ b/pageserver/src/remote_storage/storage_sync/upload.rs @@ -12,9 +12,10 @@ use crate::{ layered_repository::metadata::metadata_path, remote_storage::{ storage_sync::{index::RemoteTimeline, sync_queue, SyncTask}, - RemoteStorage, ZTenantTimelineId, + RemoteStorage, }, }; +use utils::zid::ZTenantTimelineId; use super::{index::IndexPart, SyncData, TimelineUpload}; @@ -208,7 +209,7 @@ mod tests { use std::collections::{BTreeSet, HashSet}; use tempfile::tempdir; - use zenith_utils::lsn::Lsn; + use utils::lsn::Lsn; use crate::{ remote_storage::{ diff --git a/pageserver/src/repository.rs b/pageserver/src/repository.rs index d75b4efe71..fc438cce9c 100644 --- a/pageserver/src/repository.rs +++ b/pageserver/src/repository.rs @@ -11,8 +11,10 @@ use std::fmt::Display; use std::ops::{AddAssign, Range}; use std::sync::{Arc, RwLockReadGuard}; use std::time::Duration; -use zenith_utils::lsn::{Lsn, RecordLsn}; -use zenith_utils::zid::ZTimelineId; +use utils::{ + lsn::{Lsn, RecordLsn}, + zid::ZTimelineId, +}; #[derive(Debug, Clone, Copy, Hash, PartialEq, Eq, Ord, PartialOrd, Serialize, Deserialize)] /// Key used in the Repository kv-store. @@ -431,7 +433,7 @@ pub mod repo_harness { use super::*; use hex_literal::hex; - use zenith_utils::zid::ZTenantId; + use utils::zid::ZTenantId; pub const TIMELINE_ID: ZTimelineId = ZTimelineId::from_array(hex!("11223344556677881122334455667788")); diff --git a/pageserver/src/tenant_mgr.rs b/pageserver/src/tenant_mgr.rs index 71e85c58e6..33bb4dc2e0 100644 --- a/pageserver/src/tenant_mgr.rs +++ b/pageserver/src/tenant_mgr.rs @@ -20,7 +20,7 @@ use std::collections::HashMap; use std::fmt; use std::sync::{Arc, Mutex, MutexGuard}; use tracing::*; -use zenith_utils::zid::{ZTenantId, ZTimelineId}; +use utils::zid::{ZTenantId, ZTimelineId}; lazy_static! { static ref TENANTS: Mutex> = Mutex::new(HashMap::new()); diff --git a/pageserver/src/tenant_threads.rs b/pageserver/src/tenant_threads.rs index 0d9a94cc5b..4dcc15f817 100644 --- a/pageserver/src/tenant_threads.rs +++ b/pageserver/src/tenant_threads.rs @@ -7,7 +7,7 @@ use crate::tenant_mgr::TenantState; use anyhow::Result; use std::time::Duration; use tracing::*; -use zenith_utils::zid::ZTenantId; +use utils::zid::ZTenantId; /// /// Compaction thread's main loop diff --git a/pageserver/src/thread_mgr.rs b/pageserver/src/thread_mgr.rs index 4484bb1db1..2866c6be44 100644 --- a/pageserver/src/thread_mgr.rs +++ b/pageserver/src/thread_mgr.rs @@ -47,7 +47,7 @@ use tracing::{debug, error, info, warn}; use lazy_static::lazy_static; -use zenith_utils::zid::{ZTenantId, ZTimelineId}; +use utils::zid::{ZTenantId, ZTimelineId}; use crate::shutdown_pageserver; diff --git a/pageserver/src/timelines.rs b/pageserver/src/timelines.rs index 586d27d5b1..abbabc8b31 100644 --- a/pageserver/src/timelines.rs +++ b/pageserver/src/timelines.rs @@ -14,9 +14,11 @@ use std::{ }; use tracing::*; -use zenith_utils::lsn::Lsn; -use zenith_utils::zid::{ZTenantId, ZTimelineId}; -use zenith_utils::{crashsafe_dir, logging}; +use utils::{ + crashsafe_dir, logging, + lsn::Lsn, + zid::{ZTenantId, ZTimelineId}, +}; use crate::{ config::PageServerConf, diff --git a/pageserver/src/virtual_file.rs b/pageserver/src/virtual_file.rs index 64f9db2338..4ce245a74f 100644 --- a/pageserver/src/virtual_file.rs +++ b/pageserver/src/virtual_file.rs @@ -11,15 +11,15 @@ //! src/backend/storage/file/fd.c //! use lazy_static::lazy_static; +use once_cell::sync::OnceCell; use std::fs::{File, OpenOptions}; use std::io::{Error, ErrorKind, Read, Seek, SeekFrom, Write}; use std::os::unix::fs::FileExt; use std::path::{Path, PathBuf}; use std::sync::atomic::{AtomicBool, AtomicUsize, Ordering}; use std::sync::{RwLock, RwLockWriteGuard}; -use zenith_metrics::{register_histogram_vec, register_int_gauge_vec, HistogramVec, IntGaugeVec}; -use once_cell::sync::OnceCell; +use metrics::{register_histogram_vec, register_int_gauge_vec, HistogramVec, IntGaugeVec}; // Metrics collected on disk IO operations const STORAGE_IO_TIME_BUCKETS: &[f64] = &[ diff --git a/pageserver/src/walingest.rs b/pageserver/src/walingest.rs index c6c6e89854..583cdecb1d 100644 --- a/pageserver/src/walingest.rs +++ b/pageserver/src/walingest.rs @@ -38,7 +38,7 @@ use postgres_ffi::nonrelfile_utils::mx_offset_to_member_segment; use postgres_ffi::xlog_utils::*; use postgres_ffi::TransactionId; use postgres_ffi::{pg_constants, CheckPoint}; -use zenith_utils::lsn::Lsn; +use utils::lsn::Lsn; static ZERO_PAGE: Bytes = Bytes::from_static(&[0u8; 8192]); diff --git a/pageserver/src/walreceiver.rs b/pageserver/src/walreceiver.rs index e09af09820..ce4e4d45fb 100644 --- a/pageserver/src/walreceiver.rs +++ b/pageserver/src/walreceiver.rs @@ -29,11 +29,11 @@ use tokio_postgres::replication::ReplicationStream; use tokio_postgres::{Client, NoTls, SimpleQueryMessage, SimpleQueryRow}; use tokio_stream::StreamExt; use tracing::*; -use zenith_utils::lsn::Lsn; -use zenith_utils::pq_proto::ZenithFeedback; -use zenith_utils::zid::ZTenantId; -use zenith_utils::zid::ZTenantTimelineId; -use zenith_utils::zid::ZTimelineId; +use utils::{ + lsn::Lsn, + pq_proto::ZenithFeedback, + zid::{ZTenantId, ZTenantTimelineId, ZTimelineId}, +}; // // We keep one WAL Receiver active per timeline. diff --git a/pageserver/src/walredo.rs b/pageserver/src/walredo.rs index b7c6ecf726..dcffcda6bb 100644 --- a/pageserver/src/walredo.rs +++ b/pageserver/src/walredo.rs @@ -35,17 +35,14 @@ use std::sync::Mutex; use std::time::Duration; use std::time::Instant; use tracing::*; -use zenith_metrics::{register_histogram, register_int_counter, Histogram, IntCounter}; -use zenith_utils::bin_ser::BeSer; -use zenith_utils::lsn::Lsn; -use zenith_utils::nonblock::set_nonblock; -use zenith_utils::zid::ZTenantId; +use utils::{bin_ser::BeSer, lsn::Lsn, nonblock::set_nonblock, zid::ZTenantId}; use crate::config::PageServerConf; use crate::pgdatadir_mapping::{key_to_rel_block, key_to_slru_block}; use crate::reltag::{RelTag, SlruKind}; use crate::repository::Key; use crate::walrecord::ZenithWalRecord; +use metrics::{register_histogram, register_int_counter, Histogram, IntCounter}; use postgres_ffi::nonrelfile_utils::mx_offset_to_flags_bitshift; use postgres_ffi::nonrelfile_utils::mx_offset_to_flags_offset; use postgres_ffi::nonrelfile_utils::mx_offset_to_member_offset; diff --git a/proxy/Cargo.toml b/proxy/Cargo.toml index a4bd99db38..81086a0cad 100644 --- a/proxy/Cargo.toml +++ b/proxy/Cargo.toml @@ -33,8 +33,8 @@ tokio = { version = "1.17", features = ["macros"] } tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } tokio-rustls = "0.23.0" -zenith_utils = { path = "../zenith_utils" } -zenith_metrics = { path = "../zenith_metrics" } +utils = { path = "../libs/utils" } +metrics = { path = "../libs/metrics" } workspace_hack = { version = "0.1", path = "../workspace_hack" } [dev-dependencies] diff --git a/proxy/src/auth.rs b/proxy/src/auth.rs index bda14d67a1..4c54e2f9eb 100644 --- a/proxy/src/auth.rs +++ b/proxy/src/auth.rs @@ -12,7 +12,7 @@ use crate::waiters; use std::io; use thiserror::Error; use tokio::io::{AsyncRead, AsyncWrite}; -use zenith_utils::pq_proto::{BeMessage as Be, BeParameterStatusMessage}; +use utils::pq_proto::{BeMessage as Be, BeParameterStatusMessage}; pub use credentials::ClientCredentials; diff --git a/proxy/src/auth/flow.rs b/proxy/src/auth/flow.rs index 0fafaa2f47..bcfd94a9ed 100644 --- a/proxy/src/auth/flow.rs +++ b/proxy/src/auth/flow.rs @@ -5,7 +5,7 @@ use crate::stream::PqStream; use crate::{sasl, scram}; use std::io; use tokio::io::{AsyncRead, AsyncWrite}; -use zenith_utils::pq_proto::{BeAuthenticationSaslMessage, BeMessage, BeMessage as Be}; +use utils::pq_proto::{BeAuthenticationSaslMessage, BeMessage, BeMessage as Be}; /// Every authentication selector is supposed to implement this trait. pub trait AuthMethod { diff --git a/proxy/src/cancellation.rs b/proxy/src/cancellation.rs index 07d3bcc71a..a801313635 100644 --- a/proxy/src/cancellation.rs +++ b/proxy/src/cancellation.rs @@ -4,7 +4,7 @@ use parking_lot::Mutex; use std::net::SocketAddr; use tokio::net::TcpStream; use tokio_postgres::{CancelToken, NoTls}; -use zenith_utils::pq_proto::CancelKeyData; +use utils::pq_proto::CancelKeyData; /// Enables serving `CancelRequest`s. #[derive(Default)] diff --git a/proxy/src/http.rs b/proxy/src/http.rs index 33d134678f..5a75718742 100644 --- a/proxy/src/http.rs +++ b/proxy/src/http.rs @@ -1,10 +1,7 @@ use anyhow::anyhow; use hyper::{Body, Request, Response, StatusCode}; use std::net::TcpListener; -use zenith_utils::http::endpoint; -use zenith_utils::http::error::ApiError; -use zenith_utils::http::json::json_response; -use zenith_utils::http::{RouterBuilder, RouterService}; +use utils::http::{endpoint, error::ApiError, json::json_response, RouterBuilder, RouterService}; async fn status_handler(_: Request) -> Result, ApiError> { json_response(StatusCode::OK, "") diff --git a/proxy/src/main.rs b/proxy/src/main.rs index 862152bb7b..8df46619ec 100644 --- a/proxy/src/main.rs +++ b/proxy/src/main.rs @@ -30,7 +30,7 @@ use config::ProxyConfig; use futures::FutureExt; use std::future::Future; use tokio::{net::TcpListener, task::JoinError}; -use zenith_utils::GIT_VERSION; +use utils::GIT_VERSION; use crate::config::{ClientAuthMethod, RouterConfig}; @@ -43,7 +43,7 @@ async fn flatten_err( #[tokio::main] async fn main() -> anyhow::Result<()> { - zenith_metrics::set_common_metrics_prefix("zenith_proxy"); + metrics::set_common_metrics_prefix("zenith_proxy"); let arg_matches = App::new("Zenith proxy/router") .version(GIT_VERSION) .arg( diff --git a/proxy/src/mgmt.rs b/proxy/src/mgmt.rs index ab6fdff040..23ad8a2013 100644 --- a/proxy/src/mgmt.rs +++ b/proxy/src/mgmt.rs @@ -5,7 +5,7 @@ use std::{ net::{TcpListener, TcpStream}, thread, }; -use zenith_utils::{ +use utils::{ postgres_backend::{self, AuthType, PostgresBackend}, pq_proto::{BeMessage, SINGLE_COL_ROWDESC}, }; diff --git a/proxy/src/proxy.rs b/proxy/src/proxy.rs index 788179252b..f7de1618df 100644 --- a/proxy/src/proxy.rs +++ b/proxy/src/proxy.rs @@ -5,10 +5,10 @@ use crate::stream::{MetricsStream, PqStream, Stream}; use anyhow::{bail, Context}; use futures::TryFutureExt; use lazy_static::lazy_static; +use metrics::{new_common_metric_name, register_int_counter, IntCounter}; use std::sync::Arc; use tokio::io::{AsyncRead, AsyncWrite}; -use zenith_metrics::{new_common_metric_name, register_int_counter, IntCounter}; -use zenith_utils::pq_proto::{BeMessage as Be, *}; +use utils::pq_proto::{BeMessage as Be, *}; const ERR_INSECURE_CONNECTION: &str = "connection is insecure (try using `sslmode=require`)"; const ERR_PROTO_VIOLATION: &str = "protocol violation"; diff --git a/proxy/src/sasl/messages.rs b/proxy/src/sasl/messages.rs index b1ae8cc426..58be6268fe 100644 --- a/proxy/src/sasl/messages.rs +++ b/proxy/src/sasl/messages.rs @@ -1,9 +1,9 @@ //! Definitions for SASL messages. use crate::parse::{split_at_const, split_cstr}; -use zenith_utils::pq_proto::{BeAuthenticationSaslMessage, BeMessage}; +use utils::pq_proto::{BeAuthenticationSaslMessage, BeMessage}; -/// SASL-specific payload of [`PasswordMessage`](zenith_utils::pq_proto::FeMessage::PasswordMessage). +/// SASL-specific payload of [`PasswordMessage`](utils::pq_proto::FeMessage::PasswordMessage). #[derive(Debug)] pub struct FirstMessage<'a> { /// Authentication method, e.g. `"SCRAM-SHA-256"`. @@ -31,7 +31,7 @@ impl<'a> FirstMessage<'a> { /// A single SASL message. /// This struct is deliberately decoupled from lower-level -/// [`BeAuthenticationSaslMessage`](zenith_utils::pq_proto::BeAuthenticationSaslMessage). +/// [`BeAuthenticationSaslMessage`](utils::pq_proto::BeAuthenticationSaslMessage). #[derive(Debug)] pub(super) enum ServerMessage { /// We expect to see more steps. diff --git a/proxy/src/stream.rs b/proxy/src/stream.rs index fb0be84584..42b0185fde 100644 --- a/proxy/src/stream.rs +++ b/proxy/src/stream.rs @@ -9,7 +9,7 @@ use std::{io, task}; use thiserror::Error; use tokio::io::{AsyncRead, AsyncWrite, AsyncWriteExt, ReadBuf}; use tokio_rustls::server::TlsStream; -use zenith_utils::pq_proto::{BeMessage, FeMessage, FeStartupPacket}; +use utils::pq_proto::{BeMessage, FeMessage, FeStartupPacket}; pin_project! { /// Stream wrapper which implements libpq's protocol. diff --git a/safekeeper/Cargo.toml b/safekeeper/Cargo.toml index ca5e2a6b55..76d40cdc2e 100644 --- a/safekeeper/Cargo.toml +++ b/safekeeper/Cargo.toml @@ -33,9 +33,9 @@ tokio-util = { version = "0.7", features = ["io"] } rusoto_core = "0.47" rusoto_s3 = "0.47" -postgres_ffi = { path = "../postgres_ffi" } -zenith_metrics = { path = "../zenith_metrics" } -zenith_utils = { path = "../zenith_utils" } +postgres_ffi = { path = "../libs/postgres_ffi" } +metrics = { path = "../libs/metrics" } +utils = { path = "../libs/utils" } workspace_hack = { version = "0.1", path = "../workspace_hack" } [dev-dependencies] diff --git a/safekeeper/src/bin/safekeeper.rs b/safekeeper/src/bin/safekeeper.rs index e191cb52fd..7434f921cb 100644 --- a/safekeeper/src/bin/safekeeper.rs +++ b/safekeeper/src/bin/safekeeper.rs @@ -10,11 +10,9 @@ use std::fs::{self, File}; use std::io::{ErrorKind, Write}; use std::path::{Path, PathBuf}; use std::thread; +use tokio::sync::mpsc; use tracing::*; use url::{ParseError, Url}; -use zenith_utils::http::endpoint; -use zenith_utils::zid::ZNodeId; -use zenith_utils::{logging, tcp_listener, GIT_VERSION}; use safekeeper::control_file::{self}; use safekeeper::defaults::{DEFAULT_HTTP_LISTEN_ADDR, DEFAULT_PG_LISTEN_ADDR}; @@ -23,15 +21,15 @@ use safekeeper::s3_offload; use safekeeper::wal_service; use safekeeper::SafeKeeperConf; use safekeeper::{broker, callmemaybe}; -use tokio::sync::mpsc; -use zenith_utils::shutdown::exit_now; -use zenith_utils::signals; +use utils::{ + http::endpoint, logging, shutdown::exit_now, signals, tcp_listener, zid::ZNodeId, GIT_VERSION, +}; const LOCK_FILE_NAME: &str = "safekeeper.lock"; const ID_FILE_NAME: &str = "safekeeper.id"; fn main() -> Result<()> { - zenith_metrics::set_common_metrics_prefix("safekeeper"); + metrics::set_common_metrics_prefix("safekeeper"); let arg_matches = App::new("Zenith safekeeper") .about("Store WAL stream to local file system and push it to WAL receivers") .version(GIT_VERSION) diff --git a/safekeeper/src/broker.rs b/safekeeper/src/broker.rs index 147497d673..b84b5cf789 100644 --- a/safekeeper/src/broker.rs +++ b/safekeeper/src/broker.rs @@ -17,14 +17,12 @@ use std::time::Duration; use tokio::task::JoinHandle; use tokio::{runtime, time::sleep}; use tracing::*; -use zenith_utils::zid::ZTenantId; -use zenith_utils::zid::ZTimelineId; -use zenith_utils::{ - lsn::Lsn, - zid::{ZNodeId, ZTenantTimelineId}, -}; use crate::{safekeeper::Term, timeline::GlobalTimelines, SafeKeeperConf}; +use utils::{ + lsn::Lsn, + zid::{ZNodeId, ZTenantId, ZTenantTimelineId, ZTimelineId}, +}; const RETRY_INTERVAL_MSEC: u64 = 1000; const PUSH_INTERVAL_MSEC: u64 = 1000; diff --git a/safekeeper/src/callmemaybe.rs b/safekeeper/src/callmemaybe.rs index 1e52ec927b..8c3fbe26ba 100644 --- a/safekeeper/src/callmemaybe.rs +++ b/safekeeper/src/callmemaybe.rs @@ -16,8 +16,10 @@ use tokio::sync::mpsc::UnboundedReceiver; use tokio::task; use tokio_postgres::NoTls; use tracing::*; -use zenith_utils::connstring::connection_host_port; -use zenith_utils::zid::{ZTenantId, ZTimelineId}; +use utils::{ + connstring::connection_host_port, + zid::{ZTenantId, ZTimelineId}, +}; async fn request_callback( pageserver_connstr: String, diff --git a/safekeeper/src/control_file.rs b/safekeeper/src/control_file.rs index 7cc53edeb0..c49b4c058a 100644 --- a/safekeeper/src/control_file.rs +++ b/safekeeper/src/control_file.rs @@ -10,13 +10,11 @@ use std::ops::Deref; use std::path::{Path, PathBuf}; use tracing::*; -use zenith_metrics::{register_histogram_vec, Histogram, HistogramVec, DISK_WRITE_SECONDS_BUCKETS}; -use zenith_utils::bin_ser::LeSer; - -use zenith_utils::zid::ZTenantTimelineId; use crate::control_file_upgrade::upgrade_control_file; use crate::safekeeper::{SafeKeeperState, SK_FORMAT_VERSION, SK_MAGIC}; +use metrics::{register_histogram_vec, Histogram, HistogramVec, DISK_WRITE_SECONDS_BUCKETS}; +use utils::{bin_ser::LeSer, zid::ZTenantTimelineId}; use crate::SafeKeeperConf; @@ -251,10 +249,10 @@ impl Storage for FileStorage { mod test { use super::FileStorage; use super::*; - use crate::{safekeeper::SafeKeeperState, SafeKeeperConf, ZTenantTimelineId}; + use crate::{safekeeper::SafeKeeperState, SafeKeeperConf}; use anyhow::Result; use std::fs; - use zenith_utils::lsn::Lsn; + use utils::{lsn::Lsn, zid::ZTenantTimelineId}; fn stub_conf() -> SafeKeeperConf { let workdir = tempfile::tempdir().unwrap().into_path(); diff --git a/safekeeper/src/control_file_upgrade.rs b/safekeeper/src/control_file_upgrade.rs index 9effe42f8d..0cb14298cb 100644 --- a/safekeeper/src/control_file_upgrade.rs +++ b/safekeeper/src/control_file_upgrade.rs @@ -5,7 +5,7 @@ use crate::safekeeper::{ use anyhow::{bail, Result}; use serde::{Deserialize, Serialize}; use tracing::*; -use zenith_utils::{ +use utils::{ bin_ser::LeSer, lsn::Lsn, pq_proto::SystemId, diff --git a/safekeeper/src/handler.rs b/safekeeper/src/handler.rs index bb14049787..7d86523b0e 100644 --- a/safekeeper/src/handler.rs +++ b/safekeeper/src/handler.rs @@ -14,11 +14,12 @@ use regex::Regex; use std::str::FromStr; use std::sync::Arc; use tracing::info; -use zenith_utils::lsn::Lsn; -use zenith_utils::postgres_backend; -use zenith_utils::postgres_backend::PostgresBackend; -use zenith_utils::pq_proto::{BeMessage, FeStartupPacket, RowDescriptor, INT4_OID, TEXT_OID}; -use zenith_utils::zid::{ZTenantId, ZTenantTimelineId, ZTimelineId}; +use utils::{ + lsn::Lsn, + postgres_backend::{self, PostgresBackend}, + pq_proto::{BeMessage, FeStartupPacket, RowDescriptor, INT4_OID, TEXT_OID}, + zid::{ZTenantId, ZTenantTimelineId, ZTimelineId}, +}; use crate::callmemaybe::CallmeEvent; use tokio::sync::mpsc::UnboundedSender; diff --git a/safekeeper/src/http/models.rs b/safekeeper/src/http/models.rs index 8a6ed7a812..ca18e64096 100644 --- a/safekeeper/src/http/models.rs +++ b/safekeeper/src/http/models.rs @@ -1,5 +1,5 @@ use serde::{Deserialize, Serialize}; -use zenith_utils::zid::{ZNodeId, ZTenantId, ZTimelineId}; +use utils::zid::{ZNodeId, ZTenantId, ZTimelineId}; #[derive(Serialize, Deserialize)] pub struct TimelineCreateRequest { diff --git a/safekeeper/src/http/routes.rs b/safekeeper/src/http/routes.rs index 26b23cddcc..2d22332db9 100644 --- a/safekeeper/src/http/routes.rs +++ b/safekeeper/src/http/routes.rs @@ -4,21 +4,22 @@ use serde::Serialize; use serde::Serializer; use std::fmt::Display; use std::sync::Arc; -use zenith_utils::http::json::json_request; -use zenith_utils::http::{RequestExt, RouterBuilder}; -use zenith_utils::lsn::Lsn; -use zenith_utils::zid::ZNodeId; -use zenith_utils::zid::ZTenantTimelineId; use crate::safekeeper::Term; use crate::safekeeper::TermHistory; use crate::timeline::GlobalTimelines; use crate::SafeKeeperConf; -use zenith_utils::http::endpoint; -use zenith_utils::http::error::ApiError; -use zenith_utils::http::json::json_response; -use zenith_utils::http::request::parse_request_param; -use zenith_utils::zid::{ZTenantId, ZTimelineId}; +use utils::{ + http::{ + endpoint, + error::ApiError, + json::{json_request, json_response}, + request::parse_request_param, + RequestExt, RouterBuilder, + }, + lsn::Lsn, + zid::{ZNodeId, ZTenantId, ZTenantTimelineId, ZTimelineId}, +}; use super::models::TimelineCreateRequest; diff --git a/safekeeper/src/json_ctrl.rs b/safekeeper/src/json_ctrl.rs index ad5d790105..407fafd990 100644 --- a/safekeeper/src/json_ctrl.rs +++ b/safekeeper/src/json_ctrl.rs @@ -22,9 +22,11 @@ use crate::timeline::TimelineTools; use postgres_ffi::pg_constants; use postgres_ffi::xlog_utils; use postgres_ffi::{uint32, uint64, Oid, XLogRecord}; -use zenith_utils::lsn::Lsn; -use zenith_utils::postgres_backend::PostgresBackend; -use zenith_utils::pq_proto::{BeMessage, RowDescriptor, TEXT_OID}; +use utils::{ + lsn::Lsn, + postgres_backend::PostgresBackend, + pq_proto::{BeMessage, RowDescriptor, TEXT_OID}, +}; #[derive(Serialize, Deserialize, Debug)] pub struct AppendLogicalMessage { @@ -191,7 +193,7 @@ struct XlLogicalMessage { impl XlLogicalMessage { pub fn encode(&self) -> Bytes { - use zenith_utils::bin_ser::LeSer; + use utils::bin_ser::LeSer; self.ser().unwrap().into() } } diff --git a/safekeeper/src/lib.rs b/safekeeper/src/lib.rs index 69423d42d8..8951e8f680 100644 --- a/safekeeper/src/lib.rs +++ b/safekeeper/src/lib.rs @@ -3,7 +3,7 @@ use std::path::PathBuf; use std::time::Duration; use url::Url; -use zenith_utils::zid::{ZNodeId, ZTenantTimelineId}; +use utils::zid::{ZNodeId, ZTenantTimelineId}; pub mod broker; pub mod callmemaybe; diff --git a/safekeeper/src/receive_wal.rs b/safekeeper/src/receive_wal.rs index e6b12a0d81..3ad99ab0df 100644 --- a/safekeeper/src/receive_wal.rs +++ b/safekeeper/src/receive_wal.rs @@ -7,7 +7,6 @@ use anyhow::{anyhow, bail, Result}; use bytes::BytesMut; use tokio::sync::mpsc::UnboundedSender; use tracing::*; -use zenith_utils::sock_split::ReadStream; use crate::timeline::Timeline; @@ -23,8 +22,11 @@ use crate::safekeeper::ProposerAcceptorMessage; use crate::handler::SafekeeperPostgresHandler; use crate::timeline::TimelineTools; -use zenith_utils::postgres_backend::PostgresBackend; -use zenith_utils::pq_proto::{BeMessage, FeMessage}; +use utils::{ + postgres_backend::PostgresBackend, + pq_proto::{BeMessage, FeMessage}, + sock_split::ReadStream, +}; use crate::callmemaybe::CallmeEvent; diff --git a/safekeeper/src/safekeeper.rs b/safekeeper/src/safekeeper.rs index cf56261ee6..59174f34a2 100644 --- a/safekeeper/src/safekeeper.rs +++ b/safekeeper/src/safekeeper.rs @@ -11,8 +11,6 @@ use std::cmp::min; use std::fmt; use std::io::Read; use tracing::*; -use zenith_utils::zid::ZNodeId; -use zenith_utils::zid::ZTenantTimelineId; use lazy_static::lazy_static; @@ -20,13 +18,14 @@ use crate::broker::SafekeeperInfo; use crate::control_file; use crate::send_wal::HotStandbyFeedback; use crate::wal_storage; +use metrics::{register_gauge_vec, Gauge, GaugeVec}; use postgres_ffi::xlog_utils::MAX_SEND_SIZE; -use zenith_metrics::{register_gauge_vec, Gauge, GaugeVec}; -use zenith_utils::bin_ser::LeSer; -use zenith_utils::lsn::Lsn; -use zenith_utils::pq_proto::SystemId; -use zenith_utils::pq_proto::ZenithFeedback; -use zenith_utils::zid::{ZTenantId, ZTimelineId}; +use utils::{ + bin_ser::LeSer, + lsn::Lsn, + pq_proto::{SystemId, ZenithFeedback}, + zid::{ZNodeId, ZTenantId, ZTenantTimelineId, ZTimelineId}, +}; pub const SK_MAGIC: u32 = 0xcafeceefu32; pub const SK_FORMAT_VERSION: u32 = 4; diff --git a/safekeeper/src/send_wal.rs b/safekeeper/src/send_wal.rs index f12fb5cb4a..960f70d154 100644 --- a/safekeeper/src/send_wal.rs +++ b/safekeeper/src/send_wal.rs @@ -19,13 +19,14 @@ use std::time::Duration; use std::{str, thread}; use tokio::sync::mpsc::UnboundedSender; use tracing::*; -use zenith_utils::bin_ser::BeSer; -use zenith_utils::lsn::Lsn; -use zenith_utils::postgres_backend::PostgresBackend; -use zenith_utils::pq_proto::{BeMessage, FeMessage, WalSndKeepAlive, XLogDataBody, ZenithFeedback}; -use zenith_utils::sock_split::ReadStream; - -use zenith_utils::zid::{ZTenantId, ZTimelineId}; +use utils::{ + bin_ser::BeSer, + lsn::Lsn, + postgres_backend::PostgresBackend, + pq_proto::{BeMessage, FeMessage, WalSndKeepAlive, XLogDataBody, ZenithFeedback}, + sock_split::ReadStream, + zid::{ZTenantId, ZTimelineId}, +}; // See: https://www.postgresql.org/docs/13/protocol-replication.html const HOT_STANDBY_FEEDBACK_TAG_BYTE: u8 = b'h'; diff --git a/safekeeper/src/timeline.rs b/safekeeper/src/timeline.rs index 777db7eb2b..fbae34251c 100644 --- a/safekeeper/src/timeline.rs +++ b/safekeeper/src/timeline.rs @@ -14,8 +14,11 @@ use std::time::Duration; use tokio::sync::mpsc::UnboundedSender; use tracing::*; -use zenith_utils::lsn::Lsn; -use zenith_utils::zid::{ZNodeId, ZTenantTimelineId}; +use utils::{ + lsn::Lsn, + pq_proto::ZenithFeedback, + zid::{ZNodeId, ZTenantTimelineId}, +}; use crate::broker::SafekeeperInfo; use crate::callmemaybe::{CallmeEvent, SubscriptionStateKey}; @@ -30,8 +33,6 @@ use crate::wal_storage; use crate::wal_storage::Storage as wal_storage_iface; use crate::SafeKeeperConf; -use zenith_utils::pq_proto::ZenithFeedback; - const POLL_STATE_TIMEOUT: Duration = Duration::from_secs(1); /// Replica status update + hot standby feedback diff --git a/safekeeper/src/wal_service.rs b/safekeeper/src/wal_service.rs index 305e59bcd3..468ac28526 100644 --- a/safekeeper/src/wal_service.rs +++ b/safekeeper/src/wal_service.rs @@ -12,7 +12,7 @@ use crate::callmemaybe::CallmeEvent; use crate::handler::SafekeeperPostgresHandler; use crate::SafeKeeperConf; use tokio::sync::mpsc::UnboundedSender; -use zenith_utils::postgres_backend::{AuthType, PostgresBackend}; +use utils::postgres_backend::{AuthType, PostgresBackend}; /// Accept incoming TCP connections and spawn them into a background thread. pub fn thread_main( diff --git a/safekeeper/src/wal_storage.rs b/safekeeper/src/wal_storage.rs index 7cef525bee..69a4fb11e1 100644 --- a/safekeeper/src/wal_storage.rs +++ b/safekeeper/src/wal_storage.rs @@ -20,8 +20,7 @@ use std::path::{Path, PathBuf}; use tracing::*; -use zenith_utils::lsn::Lsn; -use zenith_utils::zid::ZTenantTimelineId; +use utils::{lsn::Lsn, zid::ZTenantTimelineId}; use crate::safekeeper::SafeKeeperState; @@ -30,7 +29,7 @@ use postgres_ffi::xlog_utils::{XLogFileName, XLOG_BLCKSZ}; use postgres_ffi::waldecoder::WalStreamDecoder; -use zenith_metrics::{ +use metrics::{ register_gauge_vec, register_histogram_vec, Gauge, GaugeVec, Histogram, HistogramVec, DISK_WRITE_SECONDS_BUCKETS, }; diff --git a/test_runner/batch_others/test_wal_restore.py b/test_runner/batch_others/test_wal_restore.py index 2dbde954fc..49421aa4e8 100644 --- a/test_runner/batch_others/test_wal_restore.py +++ b/test_runner/batch_others/test_wal_restore.py @@ -26,7 +26,7 @@ def test_wal_restore(zenith_env_builder: ZenithEnvBuilder, data_dir = os.path.join(test_output_dir, 'pgsql.restored') with VanillaPostgres(data_dir, PgBin(test_output_dir), port) as restored: pg_bin.run_capture([ - os.path.join(base_dir, 'zenith_utils/scripts/restore_from_wal.sh'), + os.path.join(base_dir, 'libs/utils/scripts/restore_from_wal.sh'), os.path.join(pg_distrib_dir, 'bin'), os.path.join(test_output_dir, 'repo/safekeepers/sk1/{}/*'.format(tenant_id)), data_dir, diff --git a/workspace_hack/Cargo.toml b/workspace_hack/Cargo.toml index 84244b3363..f178b5b766 100644 --- a/workspace_hack/Cargo.toml +++ b/workspace_hack/Cargo.toml @@ -24,8 +24,8 @@ indexmap = { version = "1", default-features = false, features = ["std"] } libc = { version = "0.2", features = ["extra_traits", "std"] } log = { version = "0.4", default-features = false, features = ["serde", "std"] } memchr = { version = "2", features = ["std", "use_std"] } -num-integer = { version = "0.1", default-features = false, features = ["std"] } -num-traits = { version = "0.2", features = ["std"] } +num-integer = { version = "0.1", default-features = false, features = ["i128"] } +num-traits = { version = "0.2", features = ["i128", "std"] } prost = { version = "0.9", features = ["prost-derive", "std"] } rand = { version = "0.8", features = ["alloc", "getrandom", "libc", "rand_chacha", "rand_hc", "small_rng", "std", "std_rng"] } regex = { version = "1", features = ["aho-corasick", "memchr", "perf", "perf-cache", "perf-dfa", "perf-inline", "perf-literal", "std", "unicode", "unicode-age", "unicode-bool", "unicode-case", "unicode-gencat", "unicode-perl", "unicode-script", "unicode-segment"] } @@ -39,7 +39,6 @@ tracing-core = { version = "0.1", features = ["lazy_static", "std"] } [build-dependencies] anyhow = { version = "1", features = ["backtrace", "std"] } bytes = { version = "1", features = ["serde", "std"] } -cc = { version = "1", default-features = false, features = ["jobserver", "parallel"] } clap = { version = "2", features = ["ansi_term", "atty", "color", "strsim", "suggestions", "vec_map"] } either = { version = "1", features = ["use_std"] } hashbrown = { version = "0.11", features = ["ahash", "inline-more", "raw"] } diff --git a/zenith/Cargo.toml b/zenith/Cargo.toml index 69283d3763..9692e97331 100644 --- a/zenith/Cargo.toml +++ b/zenith/Cargo.toml @@ -13,6 +13,6 @@ postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98 pageserver = { path = "../pageserver" } control_plane = { path = "../control_plane" } safekeeper = { path = "../safekeeper" } -postgres_ffi = { path = "../postgres_ffi" } -zenith_utils = { path = "../zenith_utils" } +postgres_ffi = { path = "../libs/postgres_ffi" } +utils = { path = "../libs/utils" } workspace_hack = { version = "0.1", path = "../workspace_hack" } diff --git a/zenith/src/main.rs b/zenith/src/main.rs index f248a5db5b..afbbbe395b 100644 --- a/zenith/src/main.rs +++ b/zenith/src/main.rs @@ -16,11 +16,13 @@ use safekeeper::defaults::{ use std::collections::{BTreeSet, HashMap}; use std::process::exit; use std::str::FromStr; -use zenith_utils::auth::{Claims, Scope}; -use zenith_utils::lsn::Lsn; -use zenith_utils::postgres_backend::AuthType; -use zenith_utils::zid::{ZNodeId, ZTenantId, ZTenantTimelineId, ZTimelineId}; -use zenith_utils::GIT_VERSION; +use utils::{ + auth::{Claims, Scope}, + lsn::Lsn, + postgres_backend::AuthType, + zid::{ZNodeId, ZTenantId, ZTenantTimelineId, ZTimelineId}, + GIT_VERSION, +}; use pageserver::timelines::TimelineInfo; From abcd7a4b1fe62840160e48a8d10d96a571f8592e Mon Sep 17 00:00:00 2001 From: Arseny Sher Date: Thu, 21 Apr 2022 12:22:12 +0400 Subject: [PATCH 132/296] Insert less data in test_wal_restore. Otherwise it sometimes hits 2m statement timeout in CI. --- test_runner/batch_others/test_wal_restore.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/test_runner/batch_others/test_wal_restore.py b/test_runner/batch_others/test_wal_restore.py index 49421aa4e8..b0f34f4aae 100644 --- a/test_runner/batch_others/test_wal_restore.py +++ b/test_runner/batch_others/test_wal_restore.py @@ -19,7 +19,7 @@ def test_wal_restore(zenith_env_builder: ZenithEnvBuilder, env = zenith_env_builder.init_start() env.zenith_cli.create_branch("test_wal_restore") pg = env.postgres.create_start('test_wal_restore') - pg.safe_psql("create table t as select generate_series(1,1000000)") + pg.safe_psql("create table t as select generate_series(1,300000)") tenant_id = pg.safe_psql("show zenith.zenith_tenant")[0][0] env.zenith_cli.pageserver_stop() port = port_distributor.get_port() @@ -33,4 +33,4 @@ def test_wal_restore(zenith_env_builder: ZenithEnvBuilder, str(port) ]) restored.start() - assert restored.safe_psql('select count(*) from t', user='zenith_admin') == [(1000000, )] + assert restored.safe_psql('select count(*) from t', user='zenith_admin') == [(300000, )] From 263d60f12def9e4d2206e7587b0be073ac622755 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Thu, 21 Apr 2022 16:37:32 +0300 Subject: [PATCH 133/296] Add prometheus metric for time spent waiting for WAL to arrive --- pageserver/src/layered_repository.rs | 28 ++++++++++++++++++++-------- 1 file changed, 20 insertions(+), 8 deletions(-) diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 7525bdb94e..ff6498a489 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -110,6 +110,12 @@ lazy_static! { &["tenant_id", "timeline_id"] ) .expect("failed to define a metric"); + static ref WAIT_LSN_TIME: HistogramVec = register_histogram_vec!( + "wait_lsn_time", + "Time spent waiting for WAL to arrive", + &["tenant_id", "timeline_id"] + ) + .expect("failed to define a metric"); } lazy_static! { @@ -794,6 +800,7 @@ pub struct LayeredTimeline { compact_time_histo: Histogram, create_images_time_histo: Histogram, last_record_gauge: IntGauge, + wait_lsn_time_histo: Histogram, /// If `true`, will backup its files that appear after each checkpointing to the remote storage. upload_layers: AtomicBool, @@ -873,14 +880,15 @@ impl Timeline for LayeredTimeline { "wait_lsn called by WAL receiver thread" ); - self.last_record_lsn - .wait_for_timeout(lsn, self.conf.wait_lsn_timeout) - .with_context(|| { - format!( - "Timed out while waiting for WAL record at LSN {} to arrive, last_record_lsn {} disk consistent LSN={}", - lsn, self.get_last_record_lsn(), self.get_disk_consistent_lsn() - ) - })?; + self.wait_lsn_time_histo.observe_closure_duration( + || self.last_record_lsn + .wait_for_timeout(lsn, self.conf.wait_lsn_timeout) + .with_context(|| { + format!( + "Timed out while waiting for WAL record at LSN {} to arrive, last_record_lsn {} disk consistent LSN={}", + lsn, self.get_last_record_lsn(), self.get_disk_consistent_lsn() + ) + }))?; Ok(()) } @@ -1022,6 +1030,9 @@ impl LayeredTimeline { let last_record_gauge = LAST_RECORD_LSN .get_metric_with_label_values(&[&tenantid.to_string(), &timelineid.to_string()]) .unwrap(); + let wait_lsn_time_histo = WAIT_LSN_TIME + .get_metric_with_label_values(&[&tenantid.to_string(), &timelineid.to_string()]) + .unwrap(); LayeredTimeline { conf, @@ -1049,6 +1060,7 @@ impl LayeredTimeline { compact_time_histo, create_images_time_histo, last_record_gauge, + wait_lsn_time_histo, upload_layers: AtomicBool::new(upload_layers), From dafdf9b9524a034f25bb67d5d6f62a375a892862 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Thu, 21 Apr 2022 16:37:36 +0300 Subject: [PATCH 134/296] Handle EINTR --- libs/utils/src/pq_proto.rs | 23 +++++++++++++++++++---- pageserver/src/walredo.rs | 7 ++++++- 2 files changed, 25 insertions(+), 5 deletions(-) diff --git a/libs/utils/src/pq_proto.rs b/libs/utils/src/pq_proto.rs index 0e4c4907e7..e1677f4311 100644 --- a/libs/utils/src/pq_proto.rs +++ b/libs/utils/src/pq_proto.rs @@ -100,6 +100,21 @@ pub struct FeExecuteMessage { #[derive(Debug)] pub struct FeCloseMessage {} +/// Retry a read on EINTR +/// +/// This runs the enclosed expression, and if it returns +/// Err(io::ErrorKind::Interrupted), retries it. +macro_rules! retry_read { + ( $x:expr ) => { + loop { + match $x { + Err(e) if e.kind() == io::ErrorKind::Interrupted => continue, + res => break res, + } + } + }; +} + impl FeMessage { /// Read one message from the stream. /// This function returns `Ok(None)` in case of EOF. @@ -141,12 +156,12 @@ impl FeMessage { // Each libpq message begins with a message type byte, followed by message length // If the client closes the connection, return None. But if the client closes the // connection in the middle of a message, we will return an error. - let tag = match stream.read_u8().await { + let tag = match retry_read!(stream.read_u8().await) { Ok(b) => b, Err(e) if e.kind() == io::ErrorKind::UnexpectedEof => return Ok(None), Err(e) => return Err(e.into()), }; - let len = stream.read_u32().await?; + let len = retry_read!(stream.read_u32().await)?; // The message length includes itself, so it better be at least 4 let bodylen = len @@ -207,7 +222,7 @@ impl FeStartupPacket { // reading 4 bytes, to be precise), return None to indicate that the connection // was closed. This matches the PostgreSQL server's behavior, which avoids noise // in the log if the client opens connection but closes it immediately. - let len = match stream.read_u32().await { + let len = match retry_read!(stream.read_u32().await) { Ok(len) => len as usize, Err(e) if e.kind() == io::ErrorKind::UnexpectedEof => return Ok(None), Err(e) => return Err(e.into()), @@ -217,7 +232,7 @@ impl FeStartupPacket { bail!("invalid message length"); } - let request_code = stream.read_u32().await?; + let request_code = retry_read!(stream.read_u32().await)?; // the rest of startup packet are params let params_len = len - 8; diff --git a/pageserver/src/walredo.rs b/pageserver/src/walredo.rs index dcffcda6bb..6338b839ae 100644 --- a/pageserver/src/walredo.rs +++ b/pageserver/src/walredo.rs @@ -700,7 +700,12 @@ impl PostgresRedoProcess { // If we have more data to write, wake up if 'stdin' becomes writeable or // we have data to read. Otherwise only wake up if there's data to read. let nfds = if nwrite < writebuf.len() { 3 } else { 2 }; - let n = nix::poll::poll(&mut pollfds[0..nfds], wal_redo_timeout.as_millis() as i32)?; + let n = loop { + match nix::poll::poll(&mut pollfds[0..nfds], wal_redo_timeout.as_millis() as i32) { + Err(e) if e == nix::errno::Errno::EINTR => continue, + res => break res, + } + }?; if n == 0 { return Err(Error::new(ErrorKind::Other, "WAL redo timed out")); From a4700c9bbeb5302d4452ef6f445b514fa3822b85 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Thu, 21 Apr 2022 20:32:48 +0300 Subject: [PATCH 135/296] Use pprof to get flamegraph of get_page and get_relsize requests. This depends on a hacked version of the 'pprof-rs' crate. Because of that, it's under an optional 'profiling' feature. It is disabled by default, but enabled for release builds in CircleCI config. It doesn't currently work on macOS. The flamegraph is written to 'flamegraph.svg' in the pageserver workdir when the 'pageserver' process exits. Add a performance test that runs the perf_pgbench test, with profiling enabled. --- .circleci/config.yml | 4 +- Cargo.lock | 167 +++++++++++++++++++ pageserver/Cargo.toml | 6 + pageserver/src/bin/pageserver.rs | 13 +- pageserver/src/config.rs | 42 ++++- pageserver/src/lib.rs | 1 + pageserver/src/page_service.rs | 9 +- pageserver/src/profiling.rs | 95 +++++++++++ test_runner/fixtures/zenith_fixtures.py | 12 ++ test_runner/performance/test_perf_pgbench.py | 24 ++- 10 files changed, 363 insertions(+), 10 deletions(-) create mode 100644 pageserver/src/profiling.rs diff --git a/.circleci/config.yml b/.circleci/config.yml index 5aae143e48..643c853854 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -113,7 +113,7 @@ jobs: CARGO_FLAGS= elif [[ $BUILD_TYPE == "release" ]]; then cov_prefix=() - CARGO_FLAGS=--release + CARGO_FLAGS="--release --features profiling" fi export CARGO_INCREMENTAL=0 @@ -369,7 +369,7 @@ jobs: when: always command: | du -sh /tmp/test_output/* - find /tmp/test_output -type f ! -name "pg.log" ! -name "pageserver.log" ! -name "safekeeper.log" ! -name "regression.diffs" ! -name "junit.xml" ! -name "*.filediff" ! -name "*.stdout" ! -name "*.stderr" -delete + find /tmp/test_output -type f ! -name "pg.log" ! -name "pageserver.log" ! -name "safekeeper.log" ! -name "regression.diffs" ! -name "junit.xml" ! -name "*.filediff" ! -name "*.stdout" ! -name "*.stderr" ! -name "flamegraph.svg" -delete du -sh /tmp/test_output/* - store_artifacts: path: /tmp/test_output diff --git a/Cargo.lock b/Cargo.lock index 508b56125d..3ca3671207 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -55,6 +55,15 @@ dependencies = [ "backtrace", ] +[[package]] +name = "arrayvec" +version = "0.4.12" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "cd9fd44efafa8690358b7408d253adf110036b88f55672a933f01d616ad9b1b9" +dependencies = [ + "nodrop", +] + [[package]] name = "async-stream" version = "0.3.3" @@ -196,6 +205,12 @@ version = "3.9.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "a4a45a46ab1f2412e53d3a0ade76ffad2025804294569aae387231a0cd6e0899" +[[package]] +name = "bytemuck" +version = "1.9.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "cdead85bdec19c194affaeeb670c0e41fe23de31459efd1c174d049269cf02cc" + [[package]] name = "byteorder" version = "1.4.3" @@ -385,6 +400,15 @@ version = "0.8.3" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "5827cebf4670468b8772dd191856768aedcb1b0278a04f989f7766351917b9dc" +[[package]] +name = "cpp_demangle" +version = "0.3.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "eeaa953eaad386a53111e47172c2fedba671e5684c8dd601a5f474f4f118710f" +dependencies = [ + "cfg-if", +] + [[package]] name = "cpufeatures" version = "0.2.1" @@ -580,6 +604,15 @@ dependencies = [ "syn", ] +[[package]] +name = "debugid" +version = "0.7.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d6ee87af31d84ef885378aebca32be3d682b0e0dc119d5b4860a2c5bb5046730" +dependencies = [ + "uuid", +] + [[package]] name = "digest" version = "0.9.0" @@ -691,6 +724,18 @@ dependencies = [ "winapi", ] +[[package]] +name = "findshlibs" +version = "0.10.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "40b9e59cd0f7e0806cca4be089683ecb6434e602038df21fe6bf6711b2f07f64" +dependencies = [ + "cc", + "lazy_static", + "libc", + "winapi", +] + [[package]] name = "fixedbitset" version = "0.4.1" @@ -1098,6 +1143,24 @@ dependencies = [ "hashbrown", ] +[[package]] +name = "inferno" +version = "0.10.12" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "de3886428c6400486522cf44b8626e7b94ad794c14390290f2a274dcf728a58f" +dependencies = [ + "ahash", + "atty", + "indexmap", + "itoa 1.0.1", + "lazy_static", + "log", + "num-format", + "quick-xml", + "rgb", + "str_stack", +] + [[package]] name = "instant" version = "0.1.12" @@ -1251,6 +1314,15 @@ version = "2.4.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "308cc39be01b73d0d18f82a0e7b2a3df85245f84af96fdddc5d202d27e47b86a" +[[package]] +name = "memmap2" +version = "0.5.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "057a3db23999c867821a7a59feb06a578fcb03685e983dff90daf9e7d24ac08f" +dependencies = [ + "libc", +] + [[package]] name = "memoffset" version = "0.6.5" @@ -1353,6 +1425,12 @@ dependencies = [ "memoffset", ] +[[package]] +name = "nodrop" +version = "0.1.14" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "72ef4a56884ca558e5ddb05a1d1e7e1bfd9a68d9ed024c21704cc98872dae1bb" + [[package]] name = "nom" version = "7.1.0" @@ -1384,6 +1462,16 @@ dependencies = [ "num-traits", ] +[[package]] +name = "num-format" +version = "0.4.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "bafe4179722c2894288ee77a9f044f02811c86af699344c498b0840c698a2465" +dependencies = [ + "arrayvec", + "itoa 0.4.8", +] + [[package]] name = "num-integer" version = "0.1.44" @@ -1520,6 +1608,7 @@ dependencies = [ "postgres-protocol", "postgres-types", "postgres_ffi", + "pprof", "rand", "regex", "rusoto_core", @@ -1747,6 +1836,25 @@ dependencies = [ "workspace_hack", ] +[[package]] +name = "pprof" +version = "0.6.1" +source = "git+https://github.com/neondatabase/pprof-rs.git?branch=wallclock-profiling#4e011a87d22fb4d21d15cc38bce81ff1c75e4bc9" +dependencies = [ + "backtrace", + "cfg-if", + "findshlibs", + "inferno", + "lazy_static", + "libc", + "log", + "nix", + "parking_lot", + "symbolic-demangle", + "tempfile", + "thiserror", +] + [[package]] name = "ppv-lite86" version = "0.2.16" @@ -1876,6 +1984,15 @@ dependencies = [ "workspace_hack", ] +[[package]] +name = "quick-xml" +version = "0.22.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8533f14c8382aaad0d592c812ac3b826162128b65662331e1127b45c3d18536b" +dependencies = [ + "memchr", +] + [[package]] name = "quickcheck" version = "1.0.3" @@ -2063,6 +2180,15 @@ dependencies = [ "winreg", ] +[[package]] +name = "rgb" +version = "0.8.32" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e74fdc210d8f24a7dbfedc13b04ba5764f5232754ccebfdf5fff1bad791ccbc6" +dependencies = [ + "bytemuck", +] + [[package]] name = "ring" version = "0.16.20" @@ -2521,6 +2647,18 @@ version = "0.5.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "6e63cff320ae2c57904679ba7cb63280a3dc4613885beafb148ee7bf9aa9042d" +[[package]] +name = "stable_deref_trait" +version = "1.2.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a8f112729512f8e442d81f95a8a7ddf2b7c6b8a1a6f509a95864142b30cab2d3" + +[[package]] +name = "str_stack" +version = "0.1.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9091b6114800a5f2141aee1d1b9d6ca3592ac062dc5decb3764ec5895a47b4eb" + [[package]] name = "stringprep" version = "0.1.2" @@ -2549,6 +2687,29 @@ version = "2.4.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "6bdef32e8150c2a081110b42772ffe7d7c9032b606bc226c8260fd97e0976601" +[[package]] +name = "symbolic-common" +version = "8.7.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ac6aac7b803adc9ee75344af7681969f76d4b38e4723c6eaacf3b28f5f1d87ff" +dependencies = [ + "debugid", + "memmap2", + "stable_deref_trait", + "uuid", +] + +[[package]] +name = "symbolic-demangle" +version = "8.7.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8143ea5aa546f86c64f9b9aafdd14223ffad4ecd2d58575c63c21335909c99a7" +dependencies = [ + "cpp_demangle", + "rustc-demangle", + "symbolic-common", +] + [[package]] name = "syn" version = "1.0.86" @@ -3099,6 +3260,12 @@ dependencies = [ "workspace_hack", ] +[[package]] +name = "uuid" +version = "0.8.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "bc5cf98d8186244414c848017f0e2676b3fcb46807f6668a97dfe67359a3c4b7" + [[package]] name = "valuable" version = "0.1.0" diff --git a/pageserver/Cargo.toml b/pageserver/Cargo.toml index 7b44dafb09..eb58b90ad9 100644 --- a/pageserver/Cargo.toml +++ b/pageserver/Cargo.toml @@ -3,6 +3,10 @@ name = "pageserver" version = "0.1.0" edition = "2021" +[features] +default = [] +profiling = ["pprof"] + [dependencies] chrono = "0.4.19" rand = "0.8.3" @@ -32,6 +36,8 @@ serde = { version = "1.0", features = ["derive"] } serde_json = "1" serde_with = "1.12.0" +pprof = { git = "https://github.com/neondatabase/pprof-rs.git", branch = "wallclock-profiling", features = ["flamegraph"], optional = true } + toml_edit = { version = "0.13", features = ["easy"] } scopeguard = "1.1.0" const_format = "0.2.21" diff --git a/pageserver/src/bin/pageserver.rs b/pageserver/src/bin/pageserver.rs index 867bea1b06..9b944cc2ec 100644 --- a/pageserver/src/bin/pageserver.rs +++ b/pageserver/src/bin/pageserver.rs @@ -10,7 +10,7 @@ use daemonize::Daemonize; use pageserver::{ config::{defaults::*, PageServerConf}, - http, page_cache, page_service, + http, page_cache, page_service, profiling, remote_storage::{self, SyncStartupData}, repository::{Repository, TimelineSyncStatusUpdate}, tenant_mgr, thread_mgr, @@ -29,11 +29,15 @@ use utils::{ GIT_VERSION, }; +fn version() -> String { + format!("{} profiling:{}", GIT_VERSION, cfg!(feature = "profiling")) +} + fn main() -> anyhow::Result<()> { metrics::set_common_metrics_prefix("pageserver"); let arg_matches = App::new("Zenith page server") .about("Materializes WAL stream to pages and serves them to the postgres") - .version(GIT_VERSION) + .version(&*version()) .arg( Arg::new("daemonize") .short('d') @@ -283,6 +287,9 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<() }; info!("Using auth: {:#?}", conf.auth_type); + // start profiler (if enabled) + let profiler_guard = profiling::init_profiler(conf); + // Spawn a new thread for the http endpoint // bind before launching separate thread so the error reported before startup exits let auth_cloned = auth.clone(); @@ -315,6 +322,7 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<() "Got {}. Terminating in immediate shutdown mode", signal.name() ); + profiling::exit_profiler(conf, &profiler_guard); std::process::exit(111); } @@ -323,6 +331,7 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<() "Got {}. Terminating gracefully in fast shutdown mode", signal.name() ); + profiling::exit_profiler(conf, &profiler_guard); pageserver::shutdown_pageserver(); unreachable!() } diff --git a/pageserver/src/config.rs b/pageserver/src/config.rs index 0cba3f48f8..24ab45386d 100644 --- a/pageserver/src/config.rs +++ b/pageserver/src/config.rs @@ -140,6 +140,27 @@ pub struct PageServerConf { pub auth_validation_public_key_path: Option, pub remote_storage_config: Option, + + pub profiling: ProfilingConfig, +} + +#[derive(Debug, Clone, PartialEq, Eq)] +pub enum ProfilingConfig { + Disabled, + PageRequests, +} + +impl FromStr for ProfilingConfig { + type Err = anyhow::Error; + + fn from_str(s: &str) -> Result { + let result = match s { + "disabled" => ProfilingConfig::Disabled, + "page_requests" => ProfilingConfig::PageRequests, + _ => bail!("invalid value \"{}\" for profiling option, valid values are \"disabled\" and \"page_requests\"", s), + }; + Ok(result) + } } // use dedicated enum for builder to better indicate the intention @@ -192,6 +213,8 @@ struct PageServerConfigBuilder { remote_storage_config: BuilderValue>, id: BuilderValue, + + profiling: BuilderValue, } impl Default for PageServerConfigBuilder { @@ -224,6 +247,7 @@ impl Default for PageServerConfigBuilder { auth_validation_public_key_path: Set(None), remote_storage_config: Set(None), id: NotSet, + profiling: Set(ProfilingConfig::Disabled), } } } @@ -308,6 +332,10 @@ impl PageServerConfigBuilder { self.id = BuilderValue::Set(node_id) } + pub fn profiling(&mut self, profiling: ProfilingConfig) { + self.profiling = BuilderValue::Set(profiling) + } + pub fn build(self) -> Result { Ok(PageServerConf { listen_pg_addr: self @@ -357,6 +385,7 @@ impl PageServerConfigBuilder { .remote_storage_config .ok_or(anyhow::anyhow!("missing remote_storage_config"))?, id: self.id.ok_or(anyhow::anyhow!("missing id"))?, + profiling: self.profiling.ok_or(anyhow::anyhow!("missing profiling"))?, }) } } @@ -486,11 +515,12 @@ impl PageServerConf { "auth_validation_public_key_path" => builder.auth_validation_public_key_path(Some( PathBuf::from(parse_toml_string(key, item)?), )), - "auth_type" => builder.auth_type(parse_toml_auth_type(key, item)?), + "auth_type" => builder.auth_type(parse_toml_from_str(key, item)?), "remote_storage" => { builder.remote_storage_config(Some(Self::parse_remote_storage_config(item)?)) } "id" => builder.id(ZNodeId(parse_toml_u64(key, item)?)), + "profiling" => builder.profiling(parse_toml_from_str(key, item)?), _ => bail!("unrecognized pageserver option '{}'", key), } } @@ -623,6 +653,7 @@ impl PageServerConf { auth_type: AuthType::Trust, auth_validation_public_key_path: None, remote_storage_config: None, + profiling: ProfilingConfig::Disabled, } } } @@ -656,11 +687,14 @@ fn parse_toml_duration(name: &str, item: &Item) -> Result { Ok(humantime::parse_duration(s)?) } -fn parse_toml_auth_type(name: &str, item: &Item) -> Result { +fn parse_toml_from_str(name: &str, item: &Item) -> Result +where + T: FromStr, +{ let v = item .as_str() .with_context(|| format!("configure option {} is not a string", name))?; - AuthType::from_str(v) + T::from_str(v) } #[cfg(test)] @@ -733,6 +767,7 @@ id = 10 auth_type: AuthType::Trust, auth_validation_public_key_path: None, remote_storage_config: None, + profiling: ProfilingConfig::Disabled, }, "Correct defaults should be used when no config values are provided" ); @@ -779,6 +814,7 @@ id = 10 auth_type: AuthType::Trust, auth_validation_public_key_path: None, remote_storage_config: None, + profiling: ProfilingConfig::Disabled, }, "Should be able to parse all basic config values correctly" ); diff --git a/pageserver/src/lib.rs b/pageserver/src/lib.rs index e6ac159ef2..a761f0dfe2 100644 --- a/pageserver/src/lib.rs +++ b/pageserver/src/lib.rs @@ -7,6 +7,7 @@ pub mod layered_repository; pub mod page_cache; pub mod page_service; pub mod pgdatadir_mapping; +pub mod profiling; pub mod reltag; pub mod remote_storage; pub mod repository; diff --git a/pageserver/src/page_service.rs b/pageserver/src/page_service.rs index 8f5ea2e845..8c90195131 100644 --- a/pageserver/src/page_service.rs +++ b/pageserver/src/page_service.rs @@ -29,8 +29,9 @@ use utils::{ }; use crate::basebackup; -use crate::config::PageServerConf; +use crate::config::{PageServerConf, ProfilingConfig}; use crate::pgdatadir_mapping::DatadirTimeline; +use crate::profiling::profpoint_start; use crate::reltag::RelTag; use crate::repository::Repository; use crate::repository::Timeline; @@ -331,7 +332,10 @@ impl PageServerHandler { pgb.write_message(&BeMessage::CopyBothResponse)?; while !thread_mgr::is_shutdown_requested() { - match pgb.read_message() { + let msg = pgb.read_message(); + + let profiling_guard = profpoint_start(self.conf, ProfilingConfig::PageRequests); + match msg { Ok(message) => { if let Some(message) = message { trace!("query: {:?}", message); @@ -383,6 +387,7 @@ impl PageServerHandler { } } } + drop(profiling_guard); } Ok(()) } diff --git a/pageserver/src/profiling.rs b/pageserver/src/profiling.rs new file mode 100644 index 0000000000..e2c12c9e12 --- /dev/null +++ b/pageserver/src/profiling.rs @@ -0,0 +1,95 @@ +//! +//! Support for profiling +//! +//! This relies on a modified version of the 'pprof-rs' crate. That's not very +//! nice, so to avoid a hard dependency on that, this is an optional feature. +//! +use crate::config::{PageServerConf, ProfilingConfig}; + +/// The actual implementation is in the `profiling_impl` submodule. If the profiling +/// feature is not enabled, it's just a dummy implementation that panics if you +/// try to enabled profiling in the configuration. +pub use profiling_impl::*; + +#[cfg(feature = "profiling")] +mod profiling_impl { + use super::*; + use pprof; + use std::marker::PhantomData; + + /// Start profiling the current thread. Returns a guard object; + /// the profiling continues until the guard is dropped. + /// + /// Note: profiling is not re-entrant. If you call 'profpoint_start' while + /// profiling is already started, nothing happens, and the profiling will be + /// stopped when either guard object is dropped. + #[inline] + pub fn profpoint_start( + conf: &crate::config::PageServerConf, + point: ProfilingConfig, + ) -> Option { + if conf.profiling == point { + pprof::start_profiling(); + Some(ProfilingGuard(PhantomData)) + } else { + None + } + } + + /// A hack to remove Send and Sync from the ProfilingGuard. Because the + /// profiling is attached to current thread. + //// + /// See comments in https://github.com/rust-lang/rust/issues/68318 + type PhantomUnsend = std::marker::PhantomData<*mut u8>; + + pub struct ProfilingGuard(PhantomUnsend); + + impl Drop for ProfilingGuard { + fn drop(&mut self) { + pprof::stop_profiling(); + } + } + + /// Initialize the profiler. This must be called before any 'profpoint_start' calls. + pub fn init_profiler(conf: &PageServerConf) -> Option { + if conf.profiling != ProfilingConfig::Disabled { + Some(pprof::ProfilerGuardBuilder::default().build().unwrap()) + } else { + None + } + } + + /// Exit the profiler. Writes the flamegraph to current workdir. + pub fn exit_profiler(_conf: &PageServerConf, profiler_guard: &Option) { + // Write out the flamegraph + if let Some(profiler_guard) = profiler_guard { + if let Ok(report) = profiler_guard.report().build() { + // this gets written under the workdir + let file = std::fs::File::create("flamegraph.svg").unwrap(); + let mut options = pprof::flamegraph::Options::default(); + options.image_width = Some(2500); + report.flamegraph_with_options(file, &mut options).unwrap(); + } + } + } +} + +/// Dummy implementation when compiling without profiling feature +#[cfg(not(feature = "profiling"))] +mod profiling_impl { + use super::*; + + pub fn profpoint_start(_conf: &PageServerConf, _point: ProfilingConfig) -> () { + () + } + + pub fn init_profiler(conf: &PageServerConf) -> () { + if conf.profiling != ProfilingConfig::Disabled { + // shouldn't happen, we don't allow profiling in the config if the support + // for it is disabled. + panic!("profiling enabled but the binary was compiled without profiling support"); + } + } + + pub fn exit_profiler(_conf: &PageServerConf, _guard: &()) {} +} diff --git a/test_runner/fixtures/zenith_fixtures.py b/test_runner/fixtures/zenith_fixtures.py index a9c4c0f395..9a2d6cdc88 100644 --- a/test_runner/fixtures/zenith_fixtures.py +++ b/test_runner/fixtures/zenith_fixtures.py @@ -155,6 +155,18 @@ def pytest_configure(config): raise Exception('zenith binaries not found at "{}"'.format(zenith_binpath)) +def profiling_supported(): + """Return True if the pageserver was compiled with the 'profiling' feature + """ + bin_pageserver = os.path.join(str(zenith_binpath), 'pageserver') + res = subprocess.run([bin_pageserver, '--version'], + check=True, + universal_newlines=True, + stdout=subprocess.PIPE, + stderr=subprocess.PIPE) + return "profiling:true" in res.stdout + + def shareable_scope(fixture_name, config) -> Literal["session", "function"]: """Return either session of function scope, depending on TEST_SHARED_FIXTURES envvar. diff --git a/test_runner/performance/test_perf_pgbench.py b/test_runner/performance/test_perf_pgbench.py index d2de76913a..fc10ca4d6c 100644 --- a/test_runner/performance/test_perf_pgbench.py +++ b/test_runner/performance/test_perf_pgbench.py @@ -1,5 +1,5 @@ from contextlib import closing -from fixtures.zenith_fixtures import PgBin, VanillaPostgres, ZenithEnv +from fixtures.zenith_fixtures import PgBin, VanillaPostgres, ZenithEnv, profiling_supported from fixtures.compare_fixtures import PgCompare, VanillaCompare, ZenithCompare from fixtures.benchmark_fixture import PgBenchRunResult, MetricReport, ZenithBenchmarker @@ -106,6 +106,28 @@ def test_pgbench(zenith_with_baseline: PgCompare, scale: int, duration: int): run_test_pgbench(zenith_with_baseline, scale, duration) +# Run the pgbench tests, and generate a flamegraph from it +# This requires that the pageserver was built with the 'profiling' feature. +# +# TODO: If the profiling is cheap enough, there's no need to run the same test +# twice, with and without profiling. But for now, run it separately, so that we +# can see how much overhead the profiling adds. +@pytest.mark.parametrize("scale", get_scales_matrix()) +@pytest.mark.parametrize("duration", get_durations_matrix()) +def test_pgbench_flamegraph(zenbenchmark, pg_bin, zenith_env_builder, scale: int, duration: int): + zenith_env_builder.num_safekeepers = 1 + zenith_env_builder.pageserver_config_override = ''' +profiling="page_requests" +''' + if not profiling_supported(): + pytest.skip("pageserver was built without 'profiling' feature") + + env = zenith_env_builder.init_start() + env.zenith_cli.create_branch("empty", "main") + + run_test_pgbench(ZenithCompare(zenbenchmark, env, pg_bin, "pgbench"), scale, duration) + + # Run the pgbench tests against an existing Postgres cluster @pytest.mark.parametrize("scale", get_scales_matrix()) @pytest.mark.parametrize("duration", get_durations_matrix()) From 5f83c9290b482dc90006c400dfc68e85a17af785 Mon Sep 17 00:00:00 2001 From: Konstantin Knizhnik Date: Fri, 25 Feb 2022 19:33:44 +0300 Subject: [PATCH 136/296] Make it possible to specify per-tenant configuration parameters Add tenant config API and 'zenith tenant config' CLI command. Add 'show' query to pageserver protocol for tenantspecific config parameters Refactoring: move tenant_config code to a separate module. Save tenant conf file to tenant's directory, when tenant is created to recover it on pageserver restart. Ignore error during tenant config loading, while it is not supported by console Define PiTR interval for GC. refer #1320 --- control_plane/src/storage.rs | 53 ++++- pageserver/src/bin/pageserver.rs | 5 +- pageserver/src/config.rs | 205 +++++----------- pageserver/src/http/models.rs | 48 +++- pageserver/src/http/openapi_spec.yml | 88 ++++++- pageserver/src/http/routes.rs | 67 +++++- pageserver/src/layered_repository.rs | 238 +++++++++++++++++-- pageserver/src/lib.rs | 1 + pageserver/src/page_service.rs | 41 +++- pageserver/src/repository.rs | 31 ++- pageserver/src/tenant_config.rs | 162 +++++++++++++ pageserver/src/tenant_mgr.rs | 43 +++- pageserver/src/tenant_threads.rs | 23 +- pageserver/src/timelines.rs | 10 +- pageserver/src/walreceiver.rs | 2 +- test_runner/batch_others/test_tenant_conf.py | 49 ++++ test_runner/fixtures/zenith_fixtures.py | 23 +- zenith/src/main.rs | 34 ++- 18 files changed, 915 insertions(+), 208 deletions(-) create mode 100644 pageserver/src/tenant_config.rs create mode 100644 test_runner/batch_others/test_tenant_conf.py diff --git a/control_plane/src/storage.rs b/control_plane/src/storage.rs index a01ffd30f6..7520ad9304 100644 --- a/control_plane/src/storage.rs +++ b/control_plane/src/storage.rs @@ -1,3 +1,4 @@ +use std::collections::HashMap; use std::io::Write; use std::net::TcpStream; use std::path::PathBuf; @@ -9,7 +10,7 @@ use anyhow::{bail, Context}; use nix::errno::Errno; use nix::sys::signal::{kill, Signal}; use nix::unistd::Pid; -use pageserver::http::models::{TenantCreateRequest, TimelineCreateRequest}; +use pageserver::http::models::{TenantConfigRequest, TenantCreateRequest, TimelineCreateRequest}; use pageserver::timelines::TimelineInfo; use postgres::{Config, NoTls}; use reqwest::blocking::{Client, RequestBuilder, Response}; @@ -344,10 +345,32 @@ impl PageServerNode { pub fn tenant_create( &self, new_tenant_id: Option, + settings: HashMap<&str, &str>, ) -> anyhow::Result> { let tenant_id_string = self .http_request(Method::POST, format!("{}/tenant", self.http_base_url)) - .json(&TenantCreateRequest { new_tenant_id }) + .json(&TenantCreateRequest { + new_tenant_id, + checkpoint_distance: settings + .get("checkpoint_distance") + .map(|x| x.parse::()) + .transpose()?, + compaction_target_size: settings + .get("compaction_target_size") + .map(|x| x.parse::()) + .transpose()?, + compaction_period: settings.get("compaction_period").map(|x| x.to_string()), + compaction_threshold: settings + .get("compaction_threshold") + .map(|x| x.parse::()) + .transpose()?, + gc_horizon: settings + .get("gc_horizon") + .map(|x| x.parse::()) + .transpose()?, + gc_period: settings.get("gc_period").map(|x| x.to_string()), + pitr_interval: settings.get("pitr_interval").map(|x| x.to_string()), + }) .send()? .error_from_body()? .json::>()?; @@ -364,6 +387,32 @@ impl PageServerNode { .transpose() } + pub fn tenant_config(&self, tenant_id: ZTenantId, settings: HashMap<&str, &str>) -> Result<()> { + self.http_request(Method::PUT, format!("{}/tenant/config", self.http_base_url)) + .json(&TenantConfigRequest { + tenant_id, + checkpoint_distance: settings + .get("checkpoint_distance") + .map(|x| x.parse::().unwrap()), + compaction_target_size: settings + .get("compaction_target_size") + .map(|x| x.parse::().unwrap()), + compaction_period: settings.get("compaction_period").map(|x| x.to_string()), + compaction_threshold: settings + .get("compaction_threshold") + .map(|x| x.parse::().unwrap()), + gc_horizon: settings + .get("gc_horizon") + .map(|x| x.parse::().unwrap()), + gc_period: settings.get("gc_period").map(|x| x.to_string()), + pitr_interval: settings.get("pitr_interval").map(|x| x.to_string()), + }) + .send()? + .error_from_body()?; + + Ok(()) + } + pub fn timeline_list(&self, tenant_id: &ZTenantId) -> anyhow::Result> { let timeline_infos: Vec = self .http_request( diff --git a/pageserver/src/bin/pageserver.rs b/pageserver/src/bin/pageserver.rs index 9b944cc2ec..5c135e4eb4 100644 --- a/pageserver/src/bin/pageserver.rs +++ b/pageserver/src/bin/pageserver.rs @@ -246,11 +246,12 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<() for (tenant_id, local_timeline_init_statuses) in local_timeline_init_statuses { // initialize local tenant - let repo = tenant_mgr::load_local_repo(conf, tenant_id, &remote_index); + let repo = tenant_mgr::load_local_repo(conf, tenant_id, &remote_index) + .with_context(|| format!("Failed to load repo for tenant {}", tenant_id))?; for (timeline_id, init_status) in local_timeline_init_statuses { match init_status { remote_storage::LocalTimelineInitStatus::LocallyComplete => { - debug!("timeline {} for tenant {} is locally complete, registering it in repository", tenant_id, timeline_id); + debug!("timeline {} for tenant {} is locally complete, registering it in repository", timeline_id, tenant_id); // Lets fail here loudly to be on the safe side. // XXX: It may be a better api to actually distinguish between repository startup // and processing of newly downloaded timelines. diff --git a/pageserver/src/config.rs b/pageserver/src/config.rs index 24ab45386d..b2c4a62796 100644 --- a/pageserver/src/config.rs +++ b/pageserver/src/config.rs @@ -5,6 +5,12 @@ //! See also `settings.md` for better description on every parameter. use anyhow::{bail, ensure, Context, Result}; +use std::convert::TryInto; +use std::env; +use std::num::{NonZeroU32, NonZeroUsize}; +use std::path::{Path, PathBuf}; +use std::str::FromStr; +use std::time::Duration; use toml_edit; use toml_edit::{Document, Item}; use utils::{ @@ -12,16 +18,11 @@ use utils::{ zid::{ZNodeId, ZTenantId, ZTimelineId}, }; -use std::convert::TryInto; -use std::env; -use std::num::{NonZeroU32, NonZeroUsize}; -use std::path::{Path, PathBuf}; -use std::str::FromStr; -use std::time::Duration; - use crate::layered_repository::TIMELINES_SEGMENT_NAME; +use crate::tenant_config::{TenantConf, TenantConfOpt}; pub mod defaults { + use crate::tenant_config::defaults::*; use const_format::formatcp; pub const DEFAULT_PG_LISTEN_PORT: u16 = 64000; @@ -29,21 +30,6 @@ pub mod defaults { pub const DEFAULT_HTTP_LISTEN_PORT: u16 = 9898; pub const DEFAULT_HTTP_LISTEN_ADDR: &str = formatcp!("127.0.0.1:{DEFAULT_HTTP_LISTEN_PORT}"); - // FIXME: This current value is very low. I would imagine something like 1 GB or 10 GB - // would be more appropriate. But a low value forces the code to be exercised more, - // which is good for now to trigger bugs. - // This parameter actually determines L0 layer file size. - pub const DEFAULT_CHECKPOINT_DISTANCE: u64 = 256 * 1024 * 1024; - - // Target file size, when creating image and delta layers. - // This parameter determines L1 layer file size. - pub const DEFAULT_COMPACTION_TARGET_SIZE: u64 = 128 * 1024 * 1024; - pub const DEFAULT_COMPACTION_PERIOD: &str = "1 s"; - pub const DEFAULT_COMPACTION_THRESHOLD: usize = 10; - - pub const DEFAULT_GC_HORIZON: u64 = 64 * 1024 * 1024; - pub const DEFAULT_GC_PERIOD: &str = "100 s"; - pub const DEFAULT_WAIT_LSN_TIMEOUT: &str = "60 s"; pub const DEFAULT_WAL_REDO_TIMEOUT: &str = "60 s"; @@ -64,14 +50,6 @@ pub mod defaults { #listen_pg_addr = '{DEFAULT_PG_LISTEN_ADDR}' #listen_http_addr = '{DEFAULT_HTTP_LISTEN_ADDR}' -#checkpoint_distance = {DEFAULT_CHECKPOINT_DISTANCE} # in bytes -#compaction_target_size = {DEFAULT_COMPACTION_TARGET_SIZE} # in bytes -#compaction_period = '{DEFAULT_COMPACTION_PERIOD}' -#compaction_threshold = '{DEFAULT_COMPACTION_THRESHOLD}' - -#gc_period = '{DEFAULT_GC_PERIOD}' -#gc_horizon = {DEFAULT_GC_HORIZON} - #wait_lsn_timeout = '{DEFAULT_WAIT_LSN_TIMEOUT}' #wal_redo_timeout = '{DEFAULT_WAL_REDO_TIMEOUT}' @@ -80,6 +58,16 @@ pub mod defaults { # initial superuser role name to use when creating a new tenant #initial_superuser_name = '{DEFAULT_SUPERUSER}' +# [tenant_config] +#checkpoint_distance = {DEFAULT_CHECKPOINT_DISTANCE} # in bytes +#compaction_target_size = {DEFAULT_COMPACTION_TARGET_SIZE} # in bytes +#compaction_period = '{DEFAULT_COMPACTION_PERIOD}' +#compaction_threshold = '{DEFAULT_COMPACTION_THRESHOLD}' + +#gc_period = '{DEFAULT_GC_PERIOD}' +#gc_horizon = {DEFAULT_GC_HORIZON} +#pitr_interval = '{DEFAULT_PITR_INTERVAL}' + # [remote_storage] "### @@ -97,25 +85,6 @@ pub struct PageServerConf { /// Example (default): 127.0.0.1:9898 pub listen_http_addr: String, - // Flush out an inmemory layer, if it's holding WAL older than this - // This puts a backstop on how much WAL needs to be re-digested if the - // page server crashes. - // This parameter actually determines L0 layer file size. - pub checkpoint_distance: u64, - - // Target file size, when creating image and delta layers. - // This parameter determines L1 layer file size. - pub compaction_target_size: u64, - - // How often to check if there's compaction work to be done. - pub compaction_period: Duration, - - // Level0 delta layer threshold for compaction. - pub compaction_threshold: usize, - - pub gc_horizon: u64, - pub gc_period: Duration, - // Timeout when waiting for WAL receiver to catch up to an LSN given in a GetPage@LSN call. pub wait_lsn_timeout: Duration, // How long to wait for WAL redo to complete. @@ -142,6 +111,7 @@ pub struct PageServerConf { pub remote_storage_config: Option, pub profiling: ProfilingConfig, + pub default_tenant_conf: TenantConf, } #[derive(Debug, Clone, PartialEq, Eq)] @@ -185,15 +155,6 @@ struct PageServerConfigBuilder { listen_http_addr: BuilderValue, - checkpoint_distance: BuilderValue, - - compaction_target_size: BuilderValue, - compaction_period: BuilderValue, - compaction_threshold: BuilderValue, - - gc_horizon: BuilderValue, - gc_period: BuilderValue, - wait_lsn_timeout: BuilderValue, wal_redo_timeout: BuilderValue, @@ -224,14 +185,6 @@ impl Default for PageServerConfigBuilder { Self { listen_pg_addr: Set(DEFAULT_PG_LISTEN_ADDR.to_string()), listen_http_addr: Set(DEFAULT_HTTP_LISTEN_ADDR.to_string()), - checkpoint_distance: Set(DEFAULT_CHECKPOINT_DISTANCE), - compaction_target_size: Set(DEFAULT_COMPACTION_TARGET_SIZE), - compaction_period: Set(humantime::parse_duration(DEFAULT_COMPACTION_PERIOD) - .expect("cannot parse default compaction period")), - compaction_threshold: Set(DEFAULT_COMPACTION_THRESHOLD), - gc_horizon: Set(DEFAULT_GC_HORIZON), - gc_period: Set(humantime::parse_duration(DEFAULT_GC_PERIOD) - .expect("cannot parse default gc period")), wait_lsn_timeout: Set(humantime::parse_duration(DEFAULT_WAIT_LSN_TIMEOUT) .expect("cannot parse default wait lsn timeout")), wal_redo_timeout: Set(humantime::parse_duration(DEFAULT_WAL_REDO_TIMEOUT) @@ -261,30 +214,6 @@ impl PageServerConfigBuilder { self.listen_http_addr = BuilderValue::Set(listen_http_addr) } - pub fn checkpoint_distance(&mut self, checkpoint_distance: u64) { - self.checkpoint_distance = BuilderValue::Set(checkpoint_distance) - } - - pub fn compaction_target_size(&mut self, compaction_target_size: u64) { - self.compaction_target_size = BuilderValue::Set(compaction_target_size) - } - - pub fn compaction_period(&mut self, compaction_period: Duration) { - self.compaction_period = BuilderValue::Set(compaction_period) - } - - pub fn compaction_threshold(&mut self, compaction_threshold: usize) { - self.compaction_threshold = BuilderValue::Set(compaction_threshold) - } - - pub fn gc_horizon(&mut self, gc_horizon: u64) { - self.gc_horizon = BuilderValue::Set(gc_horizon) - } - - pub fn gc_period(&mut self, gc_period: Duration) { - self.gc_period = BuilderValue::Set(gc_period) - } - pub fn wait_lsn_timeout(&mut self, wait_lsn_timeout: Duration) { self.wait_lsn_timeout = BuilderValue::Set(wait_lsn_timeout) } @@ -344,22 +273,6 @@ impl PageServerConfigBuilder { listen_http_addr: self .listen_http_addr .ok_or(anyhow::anyhow!("missing listen_http_addr"))?, - checkpoint_distance: self - .checkpoint_distance - .ok_or(anyhow::anyhow!("missing checkpoint_distance"))?, - compaction_target_size: self - .compaction_target_size - .ok_or(anyhow::anyhow!("missing compaction_target_size"))?, - compaction_period: self - .compaction_period - .ok_or(anyhow::anyhow!("missing compaction_period"))?, - compaction_threshold: self - .compaction_threshold - .ok_or(anyhow::anyhow!("missing compaction_threshold"))?, - gc_horizon: self - .gc_horizon - .ok_or(anyhow::anyhow!("missing gc_horizon"))?, - gc_period: self.gc_period.ok_or(anyhow::anyhow!("missing gc_period"))?, wait_lsn_timeout: self .wait_lsn_timeout .ok_or(anyhow::anyhow!("missing wait_lsn_timeout"))?, @@ -386,6 +299,8 @@ impl PageServerConfigBuilder { .ok_or(anyhow::anyhow!("missing remote_storage_config"))?, id: self.id.ok_or(anyhow::anyhow!("missing id"))?, profiling: self.profiling.ok_or(anyhow::anyhow!("missing profiling"))?, + // TenantConf is handled separately + default_tenant_conf: TenantConf::default(), }) } } @@ -488,20 +403,12 @@ impl PageServerConf { let mut builder = PageServerConfigBuilder::default(); builder.workdir(workdir.to_owned()); + let mut t_conf: TenantConfOpt = Default::default(); + for (key, item) in toml.iter() { match key { "listen_pg_addr" => builder.listen_pg_addr(parse_toml_string(key, item)?), "listen_http_addr" => builder.listen_http_addr(parse_toml_string(key, item)?), - "checkpoint_distance" => builder.checkpoint_distance(parse_toml_u64(key, item)?), - "compaction_target_size" => { - builder.compaction_target_size(parse_toml_u64(key, item)?) - } - "compaction_period" => builder.compaction_period(parse_toml_duration(key, item)?), - "compaction_threshold" => { - builder.compaction_threshold(parse_toml_u64(key, item)? as usize) - } - "gc_horizon" => builder.gc_horizon(parse_toml_u64(key, item)?), - "gc_period" => builder.gc_period(parse_toml_duration(key, item)?), "wait_lsn_timeout" => builder.wait_lsn_timeout(parse_toml_duration(key, item)?), "wal_redo_timeout" => builder.wal_redo_timeout(parse_toml_duration(key, item)?), "initial_superuser_name" => builder.superuser(parse_toml_string(key, item)?), @@ -519,6 +426,9 @@ impl PageServerConf { "remote_storage" => { builder.remote_storage_config(Some(Self::parse_remote_storage_config(item)?)) } + "tenant_conf" => { + t_conf = Self::parse_toml_tenant_conf(item)?; + } "id" => builder.id(ZNodeId(parse_toml_u64(key, item)?)), "profiling" => builder.profiling(parse_toml_from_str(key, item)?), _ => bail!("unrecognized pageserver option '{}'", key), @@ -547,9 +457,42 @@ impl PageServerConf { ); } + conf.default_tenant_conf = t_conf.merge(TenantConf::default()); + Ok(conf) } + // subroutine of parse_and_validate to parse `[tenant_conf]` section + + pub fn parse_toml_tenant_conf(item: &toml_edit::Item) -> Result { + let mut t_conf: TenantConfOpt = Default::default(); + for (key, item) in item + .as_table() + .ok_or(anyhow::anyhow!("invalid tenant config"))? + .iter() + { + match key { + "checkpoint_distance" => { + t_conf.checkpoint_distance = Some(parse_toml_u64(key, item)?) + } + "compaction_target_size" => { + t_conf.compaction_target_size = Some(parse_toml_u64(key, item)?) + } + "compaction_period" => { + t_conf.compaction_period = Some(parse_toml_duration(key, item)?) + } + "compaction_threshold" => { + t_conf.compaction_threshold = Some(parse_toml_u64(key, item)? as usize) + } + "gc_horizon" => t_conf.gc_horizon = Some(parse_toml_u64(key, item)?), + "gc_period" => t_conf.gc_period = Some(parse_toml_duration(key, item)?), + "pitr_interval" => t_conf.pitr_interval = Some(parse_toml_duration(key, item)?), + _ => bail!("unrecognized tenant config option '{}'", key), + } + } + Ok(t_conf) + } + /// subroutine of parse_config(), to parse the `[remote_storage]` table. fn parse_remote_storage_config(toml: &toml_edit::Item) -> anyhow::Result { let local_path = toml.get("local_path"); @@ -635,12 +578,6 @@ impl PageServerConf { pub fn dummy_conf(repo_dir: PathBuf) -> Self { PageServerConf { id: ZNodeId(0), - checkpoint_distance: defaults::DEFAULT_CHECKPOINT_DISTANCE, - compaction_target_size: 4 * 1024 * 1024, - compaction_period: Duration::from_secs(10), - compaction_threshold: defaults::DEFAULT_COMPACTION_THRESHOLD, - gc_horizon: defaults::DEFAULT_GC_HORIZON, - gc_period: Duration::from_secs(10), wait_lsn_timeout: Duration::from_secs(60), wal_redo_timeout: Duration::from_secs(60), page_cache_size: defaults::DEFAULT_PAGE_CACHE_SIZE, @@ -654,6 +591,7 @@ impl PageServerConf { auth_validation_public_key_path: None, remote_storage_config: None, profiling: ProfilingConfig::Disabled, + default_tenant_conf: TenantConf::dummy_conf(), } } } @@ -711,15 +649,6 @@ mod tests { listen_pg_addr = '127.0.0.1:64000' listen_http_addr = '127.0.0.1:9898' -checkpoint_distance = 111 # in bytes - -compaction_target_size = 111 # in bytes -compaction_period = '111 s' -compaction_threshold = 2 - -gc_period = '222 s' -gc_horizon = 222 - wait_lsn_timeout = '111 s' wal_redo_timeout = '111 s' @@ -751,12 +680,6 @@ id = 10 id: ZNodeId(10), listen_pg_addr: defaults::DEFAULT_PG_LISTEN_ADDR.to_string(), listen_http_addr: defaults::DEFAULT_HTTP_LISTEN_ADDR.to_string(), - checkpoint_distance: defaults::DEFAULT_CHECKPOINT_DISTANCE, - compaction_target_size: defaults::DEFAULT_COMPACTION_TARGET_SIZE, - compaction_period: humantime::parse_duration(defaults::DEFAULT_COMPACTION_PERIOD)?, - compaction_threshold: defaults::DEFAULT_COMPACTION_THRESHOLD, - gc_horizon: defaults::DEFAULT_GC_HORIZON, - gc_period: humantime::parse_duration(defaults::DEFAULT_GC_PERIOD)?, wait_lsn_timeout: humantime::parse_duration(defaults::DEFAULT_WAIT_LSN_TIMEOUT)?, wal_redo_timeout: humantime::parse_duration(defaults::DEFAULT_WAL_REDO_TIMEOUT)?, superuser: defaults::DEFAULT_SUPERUSER.to_string(), @@ -768,6 +691,7 @@ id = 10 auth_validation_public_key_path: None, remote_storage_config: None, profiling: ProfilingConfig::Disabled, + default_tenant_conf: TenantConf::default(), }, "Correct defaults should be used when no config values are provided" ); @@ -798,12 +722,6 @@ id = 10 id: ZNodeId(10), listen_pg_addr: "127.0.0.1:64000".to_string(), listen_http_addr: "127.0.0.1:9898".to_string(), - checkpoint_distance: 111, - compaction_target_size: 111, - compaction_period: Duration::from_secs(111), - compaction_threshold: 2, - gc_horizon: 222, - gc_period: Duration::from_secs(222), wait_lsn_timeout: Duration::from_secs(111), wal_redo_timeout: Duration::from_secs(111), superuser: "zzzz".to_string(), @@ -815,6 +733,7 @@ id = 10 auth_validation_public_key_path: None, remote_storage_config: None, profiling: ProfilingConfig::Disabled, + default_tenant_conf: TenantConf::default(), }, "Should be able to parse all basic config values correctly" ); diff --git a/pageserver/src/http/models.rs b/pageserver/src/http/models.rs index 9b51e48477..b24b3dc316 100644 --- a/pageserver/src/http/models.rs +++ b/pageserver/src/http/models.rs @@ -20,11 +20,18 @@ pub struct TimelineCreateRequest { } #[serde_as] -#[derive(Serialize, Deserialize)] +#[derive(Serialize, Deserialize, Default)] pub struct TenantCreateRequest { #[serde(default)] #[serde_as(as = "Option")] pub new_tenant_id: Option, + pub checkpoint_distance: Option, + pub compaction_target_size: Option, + pub compaction_period: Option, + pub compaction_threshold: Option, + pub gc_horizon: Option, + pub gc_period: Option, + pub pitr_interval: Option, } #[serde_as] @@ -36,3 +43,42 @@ pub struct TenantCreateResponse(#[serde_as(as = "DisplayFromStr")] pub ZTenantId pub struct StatusResponse { pub id: ZNodeId, } + +impl TenantCreateRequest { + pub fn new(new_tenant_id: Option) -> TenantCreateRequest { + TenantCreateRequest { + new_tenant_id, + ..Default::default() + } + } +} + +#[serde_as] +#[derive(Serialize, Deserialize)] +pub struct TenantConfigRequest { + pub tenant_id: ZTenantId, + #[serde(default)] + #[serde_as(as = "Option")] + pub checkpoint_distance: Option, + pub compaction_target_size: Option, + pub compaction_period: Option, + pub compaction_threshold: Option, + pub gc_horizon: Option, + pub gc_period: Option, + pub pitr_interval: Option, +} + +impl TenantConfigRequest { + pub fn new(tenant_id: ZTenantId) -> TenantConfigRequest { + TenantConfigRequest { + tenant_id, + checkpoint_distance: None, + compaction_target_size: None, + compaction_period: None, + compaction_threshold: None, + gc_horizon: None, + gc_period: None, + pitr_interval: None, + } + } +} diff --git a/pageserver/src/http/openapi_spec.yml b/pageserver/src/http/openapi_spec.yml index c0b07418f3..9932a2d08d 100644 --- a/pageserver/src/http/openapi_spec.yml +++ b/pageserver/src/http/openapi_spec.yml @@ -328,11 +328,7 @@ paths: content: application/json: schema: - type: object - properties: - new_tenant_id: - type: string - format: hex + $ref: "#/components/schemas/TenantCreateInfo" responses: "201": description: New tenant created successfully @@ -371,7 +367,48 @@ paths: application/json: schema: $ref: "#/components/schemas/Error" - + /v1/tenant/config: + put: + description: | + Update tenant's config. + requestBody: + content: + application/json: + schema: + $ref: "#/components/schemas/TenantConfigInfo" + responses: + "200": + description: OK + content: + application/json: + schema: + type: array + items: + $ref: "#/components/schemas/TenantInfo" + "400": + description: Malformed tenant config request + content: + application/json: + schema: + $ref: "#/components/schemas/Error" + "401": + description: Unauthorized Error + content: + application/json: + schema: + $ref: "#/components/schemas/UnauthorizedError" + "403": + description: Forbidden Error + content: + application/json: + schema: + $ref: "#/components/schemas/ForbiddenError" + "500": + description: Generic operation error + content: + application/json: + schema: + $ref: "#/components/schemas/Error" components: securitySchemes: JWT: @@ -389,6 +426,45 @@ components: type: string state: type: string + TenantCreateInfo: + type: object + properties: + new_tenant_id: + type: string + format: hex + tenant_id: + type: string + format: hex + gc_period: + type: string + gc_horizon: + type: integer + pitr_interval: + type: string + checkpoint_distance: + type: integer + compaction_period: + type: string + compaction_threshold: + type: string + TenantConfigInfo: + type: object + properties: + tenant_id: + type: string + format: hex + gc_period: + type: string + gc_horizon: + type: integer + pitr_interval: + type: string + checkpoint_distance: + type: integer + compaction_period: + type: string + compaction_threshold: + type: string TimelineInfo: type: object required: diff --git a/pageserver/src/http/routes.rs b/pageserver/src/http/routes.rs index 82ea5d1d09..2db56015ad 100644 --- a/pageserver/src/http/routes.rs +++ b/pageserver/src/http/routes.rs @@ -6,13 +6,15 @@ use hyper::{Body, Request, Response, Uri}; use tracing::*; use super::models::{ - StatusResponse, TenantCreateRequest, TenantCreateResponse, TimelineCreateRequest, + StatusResponse, TenantConfigRequest, TenantCreateRequest, TenantCreateResponse, + TimelineCreateRequest, }; use crate::config::RemoteStorageKind; use crate::remote_storage::{ download_index_part, schedule_timeline_download, LocalFs, RemoteIndex, RemoteTimeline, S3Bucket, }; use crate::repository::Repository; +use crate::tenant_config::TenantConfOpt; use crate::timelines::{LocalTimelineInfo, RemoteTimelineInfo, TimelineInfo}; use crate::{config::PageServerConf, tenant_mgr, timelines}; use utils::{ @@ -375,6 +377,27 @@ async fn tenant_create_handler(mut request: Request) -> Result) -> Result) -> Result) -> Result, ApiError> { + let request_data: TenantConfigRequest = json_request(&mut request).await?; + let tenant_id = request_data.tenant_id; + // check for management permission + check_permission(&request, Some(tenant_id))?; + + let mut tenant_conf: TenantConfOpt = Default::default(); + if let Some(gc_period) = request_data.gc_period { + tenant_conf.gc_period = + Some(humantime::parse_duration(&gc_period).map_err(ApiError::from_err)?); + } + tenant_conf.gc_horizon = request_data.gc_horizon; + + if let Some(pitr_interval) = request_data.pitr_interval { + tenant_conf.pitr_interval = + Some(humantime::parse_duration(&pitr_interval).map_err(ApiError::from_err)?); + } + + tenant_conf.checkpoint_distance = request_data.checkpoint_distance; + tenant_conf.compaction_target_size = request_data.compaction_target_size; + tenant_conf.compaction_threshold = request_data.compaction_threshold; + + if let Some(compaction_period) = request_data.compaction_period { + tenant_conf.compaction_period = + Some(humantime::parse_duration(&compaction_period).map_err(ApiError::from_err)?); + } + + tokio::task::spawn_blocking(move || { + let _enter = info_span!("tenant_config", tenant = ?tenant_id).entered(); + + tenant_mgr::update_tenant_config(tenant_conf, tenant_id) + }) + .await + .map_err(ApiError::from_err)??; + + Ok(json_response(StatusCode::OK, ())?) +} + async fn handler_404(_: Request) -> Result, ApiError> { json_response( StatusCode::NOT_FOUND, @@ -426,6 +488,7 @@ pub fn make_router( .get("/v1/status", status_handler) .get("/v1/tenant", tenant_list_handler) .post("/v1/tenant", tenant_create_handler) + .put("/v1/tenant/config", tenant_config_handler) .get("/v1/tenant/:tenant_id/timeline", timeline_list_handler) .post("/v1/tenant/:tenant_id/timeline", timeline_create_handler) .get( diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index ff6498a489..3afef51a23 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -29,11 +29,13 @@ use std::ops::{Bound::Included, Deref, Range}; use std::path::{Path, PathBuf}; use std::sync::atomic::{self, AtomicBool}; use std::sync::{Arc, Mutex, MutexGuard, RwLock, RwLockReadGuard, TryLockError}; -use std::time::Instant; +use std::time::{Duration, Instant, SystemTime}; use self::metadata::{metadata_path, TimelineMetadata, METADATA_FILE_NAME}; use crate::config::PageServerConf; use crate::keyspace::KeySpace; +use crate::tenant_config::{TenantConf, TenantConfOpt}; + use crate::page_cache; use crate::remote_storage::{schedule_timeline_checkpoint_upload, RemoteIndex}; use crate::repository::{ @@ -51,6 +53,7 @@ use metrics::{ register_histogram_vec, register_int_counter, register_int_counter_vec, register_int_gauge_vec, Histogram, HistogramVec, IntCounter, IntCounterVec, IntGauge, IntGaugeVec, }; +use toml_edit; use utils::{ crashsafe_dir, lsn::{AtomicLsn, Lsn, RecordLsn}, @@ -149,7 +152,15 @@ pub const TIMELINES_SEGMENT_NAME: &str = "timelines"; /// Repository consists of multiple timelines. Keep them in a hash table. /// pub struct LayeredRepository { + // Global pageserver config parameters pub conf: &'static PageServerConf, + + // Overridden tenant-specific config parameters. + // We keep TenantConfOpt sturct here to preserve the information + // about parameters that are not set. + // This is necessary to allow global config updates. + tenant_conf: Arc>, + tenantid: ZTenantId, timelines: Mutex>, // This mutex prevents creation of new timelines during GC. @@ -219,6 +230,7 @@ impl Repository for LayeredRepository { let timeline = LayeredTimeline::new( self.conf, + Arc::clone(&self.tenant_conf), metadata, None, timelineid, @@ -302,6 +314,7 @@ impl Repository for LayeredRepository { &self, target_timelineid: Option, horizon: u64, + pitr: Duration, checkpoint_before_gc: bool, ) -> Result { let timeline_str = target_timelineid @@ -311,7 +324,7 @@ impl Repository for LayeredRepository { STORAGE_TIME .with_label_values(&["gc", &self.tenantid.to_string(), &timeline_str]) .observe_closure_duration(|| { - self.gc_iteration_internal(target_timelineid, horizon, checkpoint_before_gc) + self.gc_iteration_internal(target_timelineid, horizon, pitr, checkpoint_before_gc) }) } @@ -480,6 +493,64 @@ impl From for RepositoryTimeline { /// Private functions impl LayeredRepository { + pub fn get_checkpoint_distance(&self) -> u64 { + let tenant_conf = self.tenant_conf.read().unwrap(); + tenant_conf + .checkpoint_distance + .unwrap_or(self.conf.default_tenant_conf.checkpoint_distance) + } + + pub fn get_compaction_target_size(&self) -> u64 { + let tenant_conf = self.tenant_conf.read().unwrap(); + tenant_conf + .compaction_target_size + .unwrap_or(self.conf.default_tenant_conf.compaction_target_size) + } + + pub fn get_compaction_period(&self) -> Duration { + let tenant_conf = self.tenant_conf.read().unwrap(); + tenant_conf + .compaction_period + .unwrap_or(self.conf.default_tenant_conf.compaction_period) + } + + pub fn get_compaction_threshold(&self) -> usize { + let tenant_conf = self.tenant_conf.read().unwrap(); + tenant_conf + .compaction_threshold + .unwrap_or(self.conf.default_tenant_conf.compaction_threshold) + } + + pub fn get_gc_horizon(&self) -> u64 { + let tenant_conf = self.tenant_conf.read().unwrap(); + tenant_conf + .gc_horizon + .unwrap_or(self.conf.default_tenant_conf.gc_horizon) + } + + pub fn get_gc_period(&self) -> Duration { + let tenant_conf = self.tenant_conf.read().unwrap(); + tenant_conf + .gc_period + .unwrap_or(self.conf.default_tenant_conf.gc_period) + } + + pub fn get_pitr_interval(&self) -> Duration { + let tenant_conf = self.tenant_conf.read().unwrap(); + tenant_conf + .pitr_interval + .unwrap_or(self.conf.default_tenant_conf.pitr_interval) + } + + pub fn update_tenant_config(&self, new_tenant_conf: TenantConfOpt) -> Result<()> { + let mut tenant_conf = self.tenant_conf.write().unwrap(); + + tenant_conf.update(&new_tenant_conf); + + LayeredRepository::persist_tenant_config(self.conf, self.tenantid, *tenant_conf)?; + Ok(()) + } + // Implementation of the public `get_timeline` function. // Differences from the public: // * interface in that the caller must already hold the mutex on the 'timelines' hashmap. @@ -553,8 +624,10 @@ impl LayeredRepository { .flatten() .map(LayeredTimelineEntry::Loaded); let _enter = info_span!("loading local timeline").entered(); + let timeline = LayeredTimeline::new( self.conf, + Arc::clone(&self.tenant_conf), metadata, ancestor, timelineid, @@ -571,6 +644,7 @@ impl LayeredRepository { pub fn new( conf: &'static PageServerConf, + tenant_conf: TenantConfOpt, walredo_mgr: Arc, tenantid: ZTenantId, remote_index: RemoteIndex, @@ -579,6 +653,7 @@ impl LayeredRepository { LayeredRepository { tenantid, conf, + tenant_conf: Arc::new(RwLock::new(tenant_conf)), timelines: Mutex::new(HashMap::new()), gc_cs: Mutex::new(()), walredo_mgr, @@ -587,6 +662,71 @@ impl LayeredRepository { } } + /// Locate and load config + pub fn load_tenant_config( + conf: &'static PageServerConf, + tenantid: ZTenantId, + ) -> anyhow::Result { + let target_config_path = TenantConf::path(conf, tenantid); + + info!("load tenantconf from {}", target_config_path.display()); + + // FIXME If the config file is not found, assume that we're attaching + // a detached tenant and config is passed via attach command. + // https://github.com/neondatabase/neon/issues/1555 + if !target_config_path.exists() { + info!( + "Zenith tenant config is not found in {}", + target_config_path.display() + ); + return Ok(Default::default()); + } + + // load and parse file + let config = fs::read_to_string(target_config_path)?; + + let toml = config.parse::()?; + + let mut tenant_conf: TenantConfOpt = Default::default(); + for (key, item) in toml.iter() { + match key { + "tenant_conf" => { + tenant_conf = PageServerConf::parse_toml_tenant_conf(item)?; + } + _ => bail!("unrecognized pageserver option '{}'", key), + } + } + + Ok(tenant_conf) + } + + pub fn persist_tenant_config( + conf: &'static PageServerConf, + tenantid: ZTenantId, + tenant_conf: TenantConfOpt, + ) -> anyhow::Result<()> { + let _enter = info_span!("saving tenantconf").entered(); + let target_config_path = TenantConf::path(conf, tenantid); + info!("save tenantconf to {}", target_config_path.display()); + + let mut conf_content = r#"# This file contains a specific per-tenant's config. +# It is read in case of pageserver restart. + +# [tenant_config] +"# + .to_string(); + + // Convert the config to a toml file. + conf_content += &toml_edit::easy::to_string(&tenant_conf)?; + + fs::write(&target_config_path, conf_content).with_context(|| { + format!( + "Failed to write config file into path '{}'", + target_config_path.display() + ) + }) + } + /// Save timeline metadata to file fn save_metadata( conf: &'static PageServerConf, @@ -662,6 +802,7 @@ impl LayeredRepository { &self, target_timelineid: Option, horizon: u64, + pitr: Duration, checkpoint_before_gc: bool, ) -> Result { let _span_guard = @@ -738,7 +879,7 @@ impl LayeredRepository { timeline.checkpoint(CheckpointConfig::Forced)?; info!("timeline {} checkpoint_before_gc done", timelineid); } - timeline.update_gc_info(branchpoints, cutoff); + timeline.update_gc_info(branchpoints, cutoff, pitr); let result = timeline.gc()?; totals += result; @@ -753,6 +894,7 @@ impl LayeredRepository { pub struct LayeredTimeline { conf: &'static PageServerConf, + tenant_conf: Arc>, tenantid: ZTenantId, timelineid: ZTimelineId, @@ -857,6 +999,11 @@ struct GcInfo { /// /// FIXME: is this inclusive or exclusive? cutoff: Lsn, + + /// In addition to 'retain_lsns', keep everything newer than 'SystemTime::now()' + /// minus 'pitr_interval' + /// + pitr: Duration, } /// Public interface functions @@ -987,12 +1134,34 @@ impl Timeline for LayeredTimeline { } impl LayeredTimeline { + fn get_checkpoint_distance(&self) -> u64 { + let tenant_conf = self.tenant_conf.read().unwrap(); + tenant_conf + .checkpoint_distance + .unwrap_or(self.conf.default_tenant_conf.checkpoint_distance) + } + + fn get_compaction_target_size(&self) -> u64 { + let tenant_conf = self.tenant_conf.read().unwrap(); + tenant_conf + .compaction_target_size + .unwrap_or(self.conf.default_tenant_conf.compaction_target_size) + } + + fn get_compaction_threshold(&self) -> usize { + let tenant_conf = self.tenant_conf.read().unwrap(); + tenant_conf + .compaction_threshold + .unwrap_or(self.conf.default_tenant_conf.compaction_threshold) + } + /// Open a Timeline handle. /// /// Loads the metadata for the timeline into memory, but not the layer map. #[allow(clippy::too_many_arguments)] fn new( conf: &'static PageServerConf, + tenant_conf: Arc>, metadata: TimelineMetadata, ancestor: Option, timelineid: ZTimelineId, @@ -1036,6 +1205,7 @@ impl LayeredTimeline { LayeredTimeline { conf, + tenant_conf, timelineid, tenantid, layers: RwLock::new(LayerMap::default()), @@ -1071,6 +1241,7 @@ impl LayeredTimeline { gc_info: RwLock::new(GcInfo { retain_lsns: Vec::new(), cutoff: Lsn(0), + pitr: Duration::ZERO, }), latest_gc_cutoff_lsn: RwLock::new(metadata.latest_gc_cutoff_lsn()), @@ -1431,7 +1602,7 @@ impl LayeredTimeline { let last_lsn = self.get_last_record_lsn(); let distance = last_lsn.widening_sub(self.last_freeze_at.load()); - if distance >= self.conf.checkpoint_distance.into() { + if distance >= self.get_checkpoint_distance().into() { self.freeze_inmem_layer(true); self.last_freeze_at.store(last_lsn); } @@ -1640,13 +1811,15 @@ impl LayeredTimeline { // above. Rewrite it. let _compaction_cs = self.compaction_cs.lock().unwrap(); - let target_file_size = self.conf.checkpoint_distance; + let target_file_size = self.get_checkpoint_distance(); // Define partitioning schema if needed if let Ok(pgdir) = tenant_mgr::get_timeline_for_tenant_load(self.tenantid, self.timelineid) { - let (partitioning, lsn) = - pgdir.repartition(self.get_last_record_lsn(), self.conf.compaction_target_size)?; + let (partitioning, lsn) = pgdir.repartition( + self.get_last_record_lsn(), + self.get_compaction_target_size(), + )?; let timer = self.create_images_time_histo.start_timer(); // 2. Create new image layers for partitions that have been modified // "enough". @@ -1747,7 +1920,7 @@ impl LayeredTimeline { // We compact or "shuffle" the level-0 delta layers when they've // accumulated over the compaction threshold. - if level0_deltas.len() < self.conf.compaction_threshold { + if level0_deltas.len() < self.get_compaction_threshold() { return Ok(()); } drop(layers); @@ -1870,10 +2043,11 @@ impl LayeredTimeline { /// the latest LSN subtracted by a constant, and doesn't do anything smart /// to figure out what read-only nodes might actually need.) /// - fn update_gc_info(&self, retain_lsns: Vec, cutoff: Lsn) { + fn update_gc_info(&self, retain_lsns: Vec, cutoff: Lsn, pitr: Duration) { let mut gc_info = self.gc_info.write().unwrap(); gc_info.retain_lsns = retain_lsns; gc_info.cutoff = cutoff; + gc_info.pitr = pitr; } /// @@ -1884,7 +2058,7 @@ impl LayeredTimeline { /// obsolete. /// fn gc(&self) -> Result { - let now = Instant::now(); + let now = SystemTime::now(); let mut result: GcResult = Default::default(); let disk_consistent_lsn = self.get_disk_consistent_lsn(); @@ -1893,6 +2067,7 @@ impl LayeredTimeline { let gc_info = self.gc_info.read().unwrap(); let retain_lsns = &gc_info.retain_lsns; let cutoff = gc_info.cutoff; + let pitr = gc_info.pitr; let _enter = info_span!("garbage collection", timeline = %self.timelineid, tenant = %self.tenantid, cutoff = %cutoff).entered(); @@ -1910,8 +2085,9 @@ impl LayeredTimeline { // // Garbage collect the layer if all conditions are satisfied: // 1. it is older than cutoff LSN; - // 2. it doesn't need to be retained for 'retain_lsns'; - // 3. newer on-disk image layers cover the layer's whole key range + // 2. it is older than PITR interval; + // 3. it doesn't need to be retained for 'retain_lsns'; + // 4. newer on-disk image layers cover the layer's whole key range // let mut layers = self.layers.write().unwrap(); 'outer: for l in layers.iter_historic_layers() { @@ -1937,8 +2113,31 @@ impl LayeredTimeline { result.layers_needed_by_cutoff += 1; continue 'outer; } - - // 2. Is it needed by a child branch? + // 2. It is newer than PiTR interval? + // We use modification time of layer file to estimate update time. + // This estimation is not quite precise but maintaining LSN->timestamp map seems to be overkill. + // It is not expected that users will need high precision here. And this estimation + // is conservative: modification time of file is always newer than actual time of version + // creation. So it is safe for users. + // TODO A possible "bloat" issue still persists here. + // If modification time changes because of layer upload/download, we will keep these files + // longer than necessary. + // https://github.com/neondatabase/neon/issues/1554 + // + if let Ok(metadata) = fs::metadata(&l.filename()) { + let last_modified = metadata.modified()?; + if now.duration_since(last_modified)? < pitr { + debug!( + "keeping {} because it's modification time {:?} is newer than PITR {:?}", + l.filename().display(), + last_modified, + pitr + ); + result.layers_needed_by_pitr += 1; + continue 'outer; + } + } + // 3. Is it needed by a child branch? // NOTE With that wee would keep data that // might be referenced by child branches forever. // We can track this in child timeline GC and delete parent layers when @@ -1957,7 +2156,7 @@ impl LayeredTimeline { } } - // 3. Is there a later on-disk layer for this relation? + // 4. Is there a later on-disk layer for this relation? // // The end-LSN is exclusive, while disk_consistent_lsn is // inclusive. For example, if disk_consistent_lsn is 100, it is @@ -1998,7 +2197,7 @@ impl LayeredTimeline { result.layers_removed += 1; } - result.elapsed = now.elapsed(); + result.elapsed = now.elapsed()?; Ok(result) } @@ -2275,7 +2474,8 @@ pub mod tests { } let cutoff = tline.get_last_record_lsn(); - tline.update_gc_info(Vec::new(), cutoff); + + tline.update_gc_info(Vec::new(), cutoff, Duration::ZERO); tline.checkpoint(CheckpointConfig::Forced)?; tline.compact()?; tline.gc()?; @@ -2345,7 +2545,7 @@ pub mod tests { // Perform a cycle of checkpoint, compaction, and GC println!("checkpointing {}", lsn); let cutoff = tline.get_last_record_lsn(); - tline.update_gc_info(Vec::new(), cutoff); + tline.update_gc_info(Vec::new(), cutoff, Duration::ZERO); tline.checkpoint(CheckpointConfig::Forced)?; tline.compact()?; tline.gc()?; @@ -2422,7 +2622,7 @@ pub mod tests { // Perform a cycle of checkpoint, compaction, and GC println!("checkpointing {}", lsn); let cutoff = tline.get_last_record_lsn(); - tline.update_gc_info(Vec::new(), cutoff); + tline.update_gc_info(Vec::new(), cutoff, Duration::ZERO); tline.checkpoint(CheckpointConfig::Forced)?; tline.compact()?; tline.gc()?; diff --git a/pageserver/src/lib.rs b/pageserver/src/lib.rs index a761f0dfe2..94219c7840 100644 --- a/pageserver/src/lib.rs +++ b/pageserver/src/lib.rs @@ -11,6 +11,7 @@ pub mod profiling; pub mod reltag; pub mod remote_storage; pub mod repository; +pub mod tenant_config; pub mod tenant_mgr; pub mod tenant_threads; pub mod thread_mgr; diff --git a/pageserver/src/page_service.rs b/pageserver/src/page_service.rs index 8c90195131..58d617448a 100644 --- a/pageserver/src/page_service.rs +++ b/pageserver/src/page_service.rs @@ -19,6 +19,7 @@ use std::net::TcpListener; use std::str; use std::str::FromStr; use std::sync::{Arc, RwLockReadGuard}; +use std::time::Duration; use tracing::*; use utils::{ auth::{self, Claims, JwtAuth, Scope}, @@ -676,6 +677,37 @@ impl postgres_backend::Handler for PageServerHandler { } } pgb.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?; + } else if query_string.starts_with("show ") { + // show + let (_, params_raw) = query_string.split_at("show ".len()); + let params = params_raw.split(' ').collect::>(); + ensure!(params.len() == 1, "invalid param number for config command"); + let tenantid = ZTenantId::from_str(params[0])?; + let repo = tenant_mgr::get_repository_for_tenant(tenantid)?; + pgb.write_message_noflush(&BeMessage::RowDescription(&[ + RowDescriptor::int8_col(b"checkpoint_distance"), + RowDescriptor::int8_col(b"compaction_target_size"), + RowDescriptor::int8_col(b"compaction_period"), + RowDescriptor::int8_col(b"compaction_threshold"), + RowDescriptor::int8_col(b"gc_horizon"), + RowDescriptor::int8_col(b"gc_period"), + RowDescriptor::int8_col(b"pitr_interval"), + ]))? + .write_message_noflush(&BeMessage::DataRow(&[ + Some(repo.get_checkpoint_distance().to_string().as_bytes()), + Some(repo.get_compaction_target_size().to_string().as_bytes()), + Some( + repo.get_compaction_period() + .as_secs() + .to_string() + .as_bytes(), + ), + Some(repo.get_compaction_threshold().to_string().as_bytes()), + Some(repo.get_gc_horizon().to_string().as_bytes()), + Some(repo.get_gc_period().as_secs().to_string().as_bytes()), + Some(repo.get_pitr_interval().as_secs().to_string().as_bytes()), + ]))? + .write_message(&BeMessage::CommandComplete(b"SELECT 1"))?; } else if query_string.starts_with("do_gc ") { // Run GC immediately on given timeline. // FIXME: This is just for tests. See test_runner/batch_others/test_gc.py. @@ -693,16 +725,20 @@ impl postgres_backend::Handler for PageServerHandler { let tenantid = ZTenantId::from_str(caps.get(1).unwrap().as_str())?; let timelineid = ZTimelineId::from_str(caps.get(2).unwrap().as_str())?; + + let repo = tenant_mgr::get_repository_for_tenant(tenantid)?; + let gc_horizon: u64 = caps .get(4) .map(|h| h.as_str().parse()) - .unwrap_or(Ok(self.conf.gc_horizon))?; + .unwrap_or_else(|| Ok(repo.get_gc_horizon()))?; let repo = tenant_mgr::get_repository_for_tenant(tenantid)?; - let result = repo.gc_iteration(Some(timelineid), gc_horizon, true)?; + let result = repo.gc_iteration(Some(timelineid), gc_horizon, Duration::ZERO, true)?; pgb.write_message_noflush(&BeMessage::RowDescription(&[ RowDescriptor::int8_col(b"layers_total"), RowDescriptor::int8_col(b"layers_needed_by_cutoff"), + RowDescriptor::int8_col(b"layers_needed_by_pitr"), RowDescriptor::int8_col(b"layers_needed_by_branches"), RowDescriptor::int8_col(b"layers_not_updated"), RowDescriptor::int8_col(b"layers_removed"), @@ -711,6 +747,7 @@ impl postgres_backend::Handler for PageServerHandler { .write_message_noflush(&BeMessage::DataRow(&[ Some(result.layers_total.to_string().as_bytes()), Some(result.layers_needed_by_cutoff.to_string().as_bytes()), + Some(result.layers_needed_by_pitr.to_string().as_bytes()), Some(result.layers_needed_by_branches.to_string().as_bytes()), Some(result.layers_not_updated.to_string().as_bytes()), Some(result.layers_removed.to_string().as_bytes()), diff --git a/pageserver/src/repository.rs b/pageserver/src/repository.rs index fc438cce9c..f7c2f036a6 100644 --- a/pageserver/src/repository.rs +++ b/pageserver/src/repository.rs @@ -249,6 +249,7 @@ pub trait Repository: Send + Sync { &self, timelineid: Option, horizon: u64, + pitr: Duration, checkpoint_before_gc: bool, ) -> Result; @@ -305,6 +306,7 @@ impl<'a, T> From<&'a RepositoryTimeline> for LocalTimelineState { pub struct GcResult { pub layers_total: u64, pub layers_needed_by_cutoff: u64, + pub layers_needed_by_pitr: u64, pub layers_needed_by_branches: u64, pub layers_not_updated: u64, pub layers_removed: u64, // # of layer files removed because they have been made obsolete by newer ondisk files. @@ -315,6 +317,7 @@ pub struct GcResult { impl AddAssign for GcResult { fn add_assign(&mut self, other: Self) { self.layers_total += other.layers_total; + self.layers_needed_by_pitr += other.layers_needed_by_pitr; self.layers_needed_by_cutoff += other.layers_needed_by_cutoff; self.layers_needed_by_branches += other.layers_needed_by_branches; self.layers_not_updated += other.layers_not_updated; @@ -432,6 +435,7 @@ pub mod repo_harness { }; use super::*; + use crate::tenant_config::{TenantConf, TenantConfOpt}; use hex_literal::hex; use utils::zid::ZTenantId; @@ -454,8 +458,23 @@ pub mod repo_harness { static ref LOCK: RwLock<()> = RwLock::new(()); } + impl From for TenantConfOpt { + fn from(tenant_conf: TenantConf) -> Self { + Self { + checkpoint_distance: Some(tenant_conf.checkpoint_distance), + compaction_target_size: Some(tenant_conf.compaction_target_size), + compaction_period: Some(tenant_conf.compaction_period), + compaction_threshold: Some(tenant_conf.compaction_threshold), + gc_horizon: Some(tenant_conf.gc_horizon), + gc_period: Some(tenant_conf.gc_period), + pitr_interval: Some(tenant_conf.pitr_interval), + } + } + } + pub struct RepoHarness<'a> { pub conf: &'static PageServerConf, + pub tenant_conf: TenantConf, pub tenant_id: ZTenantId, pub lock_guard: ( @@ -487,12 +506,15 @@ pub mod repo_harness { // OK in a test. let conf: &'static PageServerConf = Box::leak(Box::new(conf)); + let tenant_conf = TenantConf::dummy_conf(); + let tenant_id = ZTenantId::generate(); fs::create_dir_all(conf.tenant_path(&tenant_id))?; fs::create_dir_all(conf.timelines_path(&tenant_id))?; Ok(Self { conf, + tenant_conf, tenant_id, lock_guard, }) @@ -507,6 +529,7 @@ pub mod repo_harness { let repo = LayeredRepository::new( self.conf, + TenantConfOpt::from(self.tenant_conf), walredo_mgr, self.tenant_id, RemoteIndex::empty(), @@ -722,7 +745,7 @@ mod tests { // FIXME: this doesn't actually remove any layer currently, given how the checkpointing // and compaction works. But it does set the 'cutoff' point so that the cross check // below should fail. - repo.gc_iteration(Some(TIMELINE_ID), 0x10, false)?; + repo.gc_iteration(Some(TIMELINE_ID), 0x10, Duration::ZERO, false)?; // try to branch at lsn 25, should fail because we already garbage collected the data match repo.branch_timeline(TIMELINE_ID, NEW_TIMELINE_ID, Lsn(0x25)) { @@ -773,7 +796,7 @@ mod tests { let tline = repo.create_empty_timeline(TIMELINE_ID, Lsn(0))?; make_some_layers(tline.as_ref(), Lsn(0x20))?; - repo.gc_iteration(Some(TIMELINE_ID), 0x10, false)?; + repo.gc_iteration(Some(TIMELINE_ID), 0x10, Duration::ZERO, false)?; let latest_gc_cutoff_lsn = tline.get_latest_gc_cutoff_lsn(); assert!(*latest_gc_cutoff_lsn > Lsn(0x25)); match tline.get(*TEST_KEY, Lsn(0x25)) { @@ -796,7 +819,7 @@ mod tests { .get_timeline_load(NEW_TIMELINE_ID) .expect("Should have a local timeline"); // this removes layers before lsn 40 (50 minus 10), so there are two remaining layers, image and delta for 31-50 - repo.gc_iteration(Some(TIMELINE_ID), 0x10, false)?; + repo.gc_iteration(Some(TIMELINE_ID), 0x10, Duration::ZERO, false)?; assert!(newtline.get(*TEST_KEY, Lsn(0x25)).is_ok()); Ok(()) @@ -815,7 +838,7 @@ mod tests { make_some_layers(newtline.as_ref(), Lsn(0x60))?; // run gc on parent - repo.gc_iteration(Some(TIMELINE_ID), 0x10, false)?; + repo.gc_iteration(Some(TIMELINE_ID), 0x10, Duration::ZERO, false)?; // Check that the data is still accessible on the branch. assert_eq!( diff --git a/pageserver/src/tenant_config.rs b/pageserver/src/tenant_config.rs new file mode 100644 index 0000000000..818b6de1b1 --- /dev/null +++ b/pageserver/src/tenant_config.rs @@ -0,0 +1,162 @@ +//! Functions for handling per-tenant configuration options +//! +//! If tenant is created with --config option, +//! the tenant-specific config will be stored in tenant's directory. +//! Otherwise, global pageserver's config is used. +//! +//! If the tenant config file is corrupted, the tenant will be disabled. +//! We cannot use global or default config instead, because wrong settings +//! may lead to a data loss. +//! +use crate::config::PageServerConf; +use serde::{Deserialize, Serialize}; +use std::path::PathBuf; +use std::time::Duration; +use utils::zid::ZTenantId; + +pub const TENANT_CONFIG_NAME: &str = "config"; + +pub mod defaults { + // FIXME: This current value is very low. I would imagine something like 1 GB or 10 GB + // would be more appropriate. But a low value forces the code to be exercised more, + // which is good for now to trigger bugs. + // This parameter actually determines L0 layer file size. + pub const DEFAULT_CHECKPOINT_DISTANCE: u64 = 256 * 1024 * 1024; + + // Target file size, when creating image and delta layers. + // This parameter determines L1 layer file size. + pub const DEFAULT_COMPACTION_TARGET_SIZE: u64 = 128 * 1024 * 1024; + + pub const DEFAULT_COMPACTION_PERIOD: &str = "1 s"; + pub const DEFAULT_COMPACTION_THRESHOLD: usize = 10; + + pub const DEFAULT_GC_HORIZON: u64 = 64 * 1024 * 1024; + pub const DEFAULT_GC_PERIOD: &str = "100 s"; + pub const DEFAULT_PITR_INTERVAL: &str = "30 days"; +} + +/// Per-tenant configuration options +#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)] +pub struct TenantConf { + // Flush out an inmemory layer, if it's holding WAL older than this + // This puts a backstop on how much WAL needs to be re-digested if the + // page server crashes. + // This parameter actually determines L0 layer file size. + pub checkpoint_distance: u64, + // Target file size, when creating image and delta layers. + // This parameter determines L1 layer file size. + pub compaction_target_size: u64, + // How often to check if there's compaction work to be done. + pub compaction_period: Duration, + // Level0 delta layer threshold for compaction. + pub compaction_threshold: usize, + // Determines how much history is retained, to allow + // branching and read replicas at an older point in time. + // The unit is #of bytes of WAL. + // Page versions older than this are garbage collected away. + pub gc_horizon: u64, + // Interval at which garbage collection is triggered. + pub gc_period: Duration, + // Determines how much history is retained, to allow + // branching and read replicas at an older point in time. + // The unit is time. + // Page versions older than this are garbage collected away. + pub pitr_interval: Duration, +} + +/// Same as TenantConf, but this struct preserves the information about +/// which parameters are set and which are not. +#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize, Default)] +pub struct TenantConfOpt { + pub checkpoint_distance: Option, + pub compaction_target_size: Option, + pub compaction_period: Option, + pub compaction_threshold: Option, + pub gc_horizon: Option, + pub gc_period: Option, + pub pitr_interval: Option, +} + +impl TenantConfOpt { + pub fn merge(&self, global_conf: TenantConf) -> TenantConf { + TenantConf { + checkpoint_distance: self + .checkpoint_distance + .unwrap_or(global_conf.checkpoint_distance), + compaction_target_size: self + .compaction_target_size + .unwrap_or(global_conf.compaction_target_size), + compaction_period: self + .compaction_period + .unwrap_or(global_conf.compaction_period), + compaction_threshold: self + .compaction_threshold + .unwrap_or(global_conf.compaction_threshold), + gc_horizon: self.gc_horizon.unwrap_or(global_conf.gc_horizon), + gc_period: self.gc_period.unwrap_or(global_conf.gc_period), + pitr_interval: self.pitr_interval.unwrap_or(global_conf.pitr_interval), + } + } + + pub fn update(&mut self, other: &TenantConfOpt) { + if let Some(checkpoint_distance) = other.checkpoint_distance { + self.checkpoint_distance = Some(checkpoint_distance); + } + if let Some(compaction_target_size) = other.compaction_target_size { + self.compaction_target_size = Some(compaction_target_size); + } + if let Some(compaction_period) = other.compaction_period { + self.compaction_period = Some(compaction_period); + } + if let Some(compaction_threshold) = other.compaction_threshold { + self.compaction_threshold = Some(compaction_threshold); + } + if let Some(gc_horizon) = other.gc_horizon { + self.gc_horizon = Some(gc_horizon); + } + if let Some(gc_period) = other.gc_period { + self.gc_period = Some(gc_period); + } + if let Some(pitr_interval) = other.pitr_interval { + self.pitr_interval = Some(pitr_interval); + } + } +} + +impl TenantConf { + pub fn default() -> TenantConf { + use defaults::*; + + TenantConf { + checkpoint_distance: DEFAULT_CHECKPOINT_DISTANCE, + compaction_target_size: DEFAULT_COMPACTION_TARGET_SIZE, + compaction_period: humantime::parse_duration(DEFAULT_COMPACTION_PERIOD) + .expect("cannot parse default compaction period"), + compaction_threshold: DEFAULT_COMPACTION_THRESHOLD, + gc_horizon: DEFAULT_GC_HORIZON, + gc_period: humantime::parse_duration(DEFAULT_GC_PERIOD) + .expect("cannot parse default gc period"), + pitr_interval: humantime::parse_duration(DEFAULT_PITR_INTERVAL) + .expect("cannot parse default PITR interval"), + } + } + + /// Points to a place in pageserver's local directory, + /// where certain tenant's tenantconf file should be located. + pub fn path(conf: &'static PageServerConf, tenantid: ZTenantId) -> PathBuf { + conf.tenant_path(&tenantid).join(TENANT_CONFIG_NAME) + } + + #[cfg(test)] + pub fn dummy_conf() -> Self { + TenantConf { + checkpoint_distance: defaults::DEFAULT_CHECKPOINT_DISTANCE, + compaction_target_size: 4 * 1024 * 1024, + compaction_period: Duration::from_secs(10), + compaction_threshold: defaults::DEFAULT_COMPACTION_THRESHOLD, + gc_horizon: defaults::DEFAULT_GC_HORIZON, + gc_period: Duration::from_secs(10), + pitr_interval: Duration::from_secs(60 * 60), + } + } +} diff --git a/pageserver/src/tenant_mgr.rs b/pageserver/src/tenant_mgr.rs index 33bb4dc2e0..8a69062dba 100644 --- a/pageserver/src/tenant_mgr.rs +++ b/pageserver/src/tenant_mgr.rs @@ -5,6 +5,7 @@ use crate::config::PageServerConf; use crate::layered_repository::LayeredRepository; use crate::remote_storage::RemoteIndex; use crate::repository::{Repository, TimelineSyncStatusUpdate}; +use crate::tenant_config::TenantConfOpt; use crate::thread_mgr; use crate::thread_mgr::ThreadKind; use crate::timelines; @@ -63,13 +64,13 @@ fn access_tenants() -> MutexGuard<'static, HashMap> { TENANTS.lock().unwrap() } -// Sets up wal redo manager and repository for tenant. Reduces code duplocation. +// Sets up wal redo manager and repository for tenant. Reduces code duplication. // Used during pageserver startup, or when new tenant is attached to pageserver. pub fn load_local_repo( conf: &'static PageServerConf, tenant_id: ZTenantId, remote_index: &RemoteIndex, -) -> Arc { +) -> Result> { let mut m = access_tenants(); let tenant = m.entry(tenant_id).or_insert_with(|| { // Set up a WAL redo manager, for applying WAL records. @@ -78,6 +79,7 @@ pub fn load_local_repo( // Set up an object repository, for actual data storage. let repo: Arc = Arc::new(LayeredRepository::new( conf, + Default::default(), Arc::new(walredo_mgr), tenant_id, remote_index.clone(), @@ -89,7 +91,12 @@ pub fn load_local_repo( timelines: HashMap::new(), } }); - Arc::clone(&tenant.repo) + + // Restore tenant config + let tenant_conf = LayeredRepository::load_tenant_config(conf, tenant_id)?; + tenant.repo.update_tenant_config(tenant_conf)?; + + Ok(Arc::clone(&tenant.repo)) } /// Updates tenants' repositories, changing their timelines state in memory. @@ -109,7 +116,16 @@ pub fn apply_timeline_sync_status_updates( trace!("Sync status updates: {:?}", sync_status_updates); for (tenant_id, tenant_timelines_sync_status_updates) in sync_status_updates { - let repo = load_local_repo(conf, tenant_id, remote_index); + let repo = match load_local_repo(conf, tenant_id, remote_index) { + Ok(repo) => repo, + Err(e) => { + error!( + "Failed to load repo for tenant {} Error: {:#}", + tenant_id, e + ); + continue; + } + }; for (timeline_id, timeline_sync_status_update) in tenant_timelines_sync_status_updates { match repo.apply_timeline_remote_sync_status_update(timeline_id, timeline_sync_status_update) @@ -174,6 +190,7 @@ pub fn shutdown_all_tenants() { pub fn create_tenant_repository( conf: &'static PageServerConf, + tenant_conf: TenantConfOpt, tenantid: ZTenantId, remote_index: RemoteIndex, ) -> Result> { @@ -186,6 +203,7 @@ pub fn create_tenant_repository( let wal_redo_manager = Arc::new(PostgresRedoManager::new(conf, tenantid)); let repo = timelines::create_repo( conf, + tenant_conf, tenantid, CreateRepo::Real { wal_redo_manager, @@ -202,6 +220,14 @@ pub fn create_tenant_repository( } } +pub fn update_tenant_config(tenant_conf: TenantConfOpt, tenantid: ZTenantId) -> Result<()> { + info!("configuring tenant {}", tenantid); + let repo = get_repository_for_tenant(tenantid)?; + + repo.update_tenant_config(tenant_conf)?; + Ok(()) +} + pub fn get_tenant_state(tenantid: ZTenantId) -> Option { Some(access_tenants().get(&tenantid)?.state) } @@ -210,7 +236,7 @@ pub fn get_tenant_state(tenantid: ZTenantId) -> Option { /// Change the state of a tenant to Active and launch its compactor and GC /// threads. If the tenant was already in Active state or Stopping, does nothing. /// -pub fn activate_tenant(conf: &'static PageServerConf, tenant_id: ZTenantId) -> Result<()> { +pub fn activate_tenant(tenant_id: ZTenantId) -> Result<()> { let mut m = access_tenants(); let tenant = m .get_mut(&tenant_id) @@ -230,7 +256,7 @@ pub fn activate_tenant(conf: &'static PageServerConf, tenant_id: ZTenantId) -> R None, "Compactor thread", true, - move || crate::tenant_threads::compact_loop(tenant_id, conf), + move || crate::tenant_threads::compact_loop(tenant_id), )?; let gc_spawn_result = thread_mgr::spawn( @@ -239,7 +265,7 @@ pub fn activate_tenant(conf: &'static PageServerConf, tenant_id: ZTenantId) -> R None, "GC thread", true, - move || crate::tenant_threads::gc_loop(tenant_id, conf), + move || crate::tenant_threads::gc_loop(tenant_id), ) .with_context(|| format!("Failed to launch GC thread for tenant {}", tenant_id)); @@ -251,7 +277,6 @@ pub fn activate_tenant(conf: &'static PageServerConf, tenant_id: ZTenantId) -> R thread_mgr::shutdown_threads(Some(ThreadKind::Compactor), Some(tenant_id), None); return gc_spawn_result; } - tenant.state = TenantState::Active; } @@ -290,7 +315,7 @@ pub fn get_timeline_for_tenant_load( .get_timeline_load(timelineid) .with_context(|| format!("Timeline {} not found for tenant {}", timelineid, tenantid))?; - let repartition_distance = tenant.repo.conf.checkpoint_distance / 10; + let repartition_distance = tenant.repo.get_checkpoint_distance() / 10; let page_tline = Arc::new(DatadirTimelineImpl::new(tline, repartition_distance)); page_tline.init_logical_size()?; diff --git a/pageserver/src/tenant_threads.rs b/pageserver/src/tenant_threads.rs index 4dcc15f817..b904d9040d 100644 --- a/pageserver/src/tenant_threads.rs +++ b/pageserver/src/tenant_threads.rs @@ -1,6 +1,5 @@ //! This module contains functions to serve per-tenant background processes, //! such as compaction and GC -use crate::config::PageServerConf; use crate::repository::Repository; use crate::tenant_mgr; use crate::tenant_mgr::TenantState; @@ -12,8 +11,8 @@ use utils::zid::ZTenantId; /// /// Compaction thread's main loop /// -pub fn compact_loop(tenantid: ZTenantId, conf: &'static PageServerConf) -> Result<()> { - if let Err(err) = compact_loop_ext(tenantid, conf) { +pub fn compact_loop(tenantid: ZTenantId) -> Result<()> { + if let Err(err) = compact_loop_ext(tenantid) { error!("compact loop terminated with error: {:?}", err); Err(err) } else { @@ -21,13 +20,15 @@ pub fn compact_loop(tenantid: ZTenantId, conf: &'static PageServerConf) -> Resul } } -fn compact_loop_ext(tenantid: ZTenantId, conf: &'static PageServerConf) -> Result<()> { +fn compact_loop_ext(tenantid: ZTenantId) -> Result<()> { loop { if tenant_mgr::get_tenant_state(tenantid) != Some(TenantState::Active) { break; } + let repo = tenant_mgr::get_repository_for_tenant(tenantid)?; + let compaction_period = repo.get_compaction_period(); - std::thread::sleep(conf.compaction_period); + std::thread::sleep(compaction_period); trace!("compaction thread for tenant {} waking up", tenantid); // Compact timelines @@ -46,23 +47,23 @@ fn compact_loop_ext(tenantid: ZTenantId, conf: &'static PageServerConf) -> Resul /// /// GC thread's main loop /// -pub fn gc_loop(tenantid: ZTenantId, conf: &'static PageServerConf) -> Result<()> { +pub fn gc_loop(tenantid: ZTenantId) -> Result<()> { loop { if tenant_mgr::get_tenant_state(tenantid) != Some(TenantState::Active) { break; } trace!("gc thread for tenant {} waking up", tenantid); - + let repo = tenant_mgr::get_repository_for_tenant(tenantid)?; + let gc_horizon = repo.get_gc_horizon(); // Garbage collect old files that are not needed for PITR anymore - if conf.gc_horizon > 0 { - let repo = tenant_mgr::get_repository_for_tenant(tenantid)?; - repo.gc_iteration(None, conf.gc_horizon, false)?; + if gc_horizon > 0 { + repo.gc_iteration(None, gc_horizon, repo.get_pitr_interval(), false)?; } // TODO Write it in more adequate way using // condvar.wait_timeout() or something - let mut sleep_time = conf.gc_period.as_secs(); + let mut sleep_time = repo.get_gc_period().as_secs(); while sleep_time > 0 && tenant_mgr::get_tenant_state(tenantid) == Some(TenantState::Active) { sleep_time -= 1; diff --git a/pageserver/src/timelines.rs b/pageserver/src/timelines.rs index abbabc8b31..adc531e6bb 100644 --- a/pageserver/src/timelines.rs +++ b/pageserver/src/timelines.rs @@ -25,6 +25,7 @@ use crate::{ layered_repository::metadata::TimelineMetadata, remote_storage::RemoteIndex, repository::{LocalTimelineState, Repository}, + tenant_config::TenantConfOpt, DatadirTimeline, RepositoryImpl, }; use crate::{import_datadir, LOG_FILE_NAME}; @@ -151,8 +152,8 @@ pub fn init_pageserver( if let Some(tenant_id) = create_tenant { println!("initializing tenantid {}", tenant_id); - let repo = - create_repo(conf, tenant_id, CreateRepo::Dummy).context("failed to create repo")?; + let repo = create_repo(conf, Default::default(), tenant_id, CreateRepo::Dummy) + .context("failed to create repo")?; let new_timeline_id = initial_timeline_id.unwrap_or_else(ZTimelineId::generate); bootstrap_timeline(conf, tenant_id, new_timeline_id, repo.as_ref()) .context("failed to create initial timeline")?; @@ -175,6 +176,7 @@ pub enum CreateRepo { pub fn create_repo( conf: &'static PageServerConf, + tenant_conf: TenantConfOpt, tenant_id: ZTenantId, create_repo: CreateRepo, ) -> Result> { @@ -211,8 +213,12 @@ pub fn create_repo( crashsafe_dir::create_dir(conf.timelines_path(&tenant_id))?; info!("created directory structure in {}", repo_dir.display()); + // Save tenant's config + LayeredRepository::persist_tenant_config(conf, tenant_id, tenant_conf)?; + Ok(Arc::new(LayeredRepository::new( conf, + tenant_conf, wal_redo_manager, tenant_id, remote_index, diff --git a/pageserver/src/walreceiver.rs b/pageserver/src/walreceiver.rs index ce4e4d45fb..357aab7221 100644 --- a/pageserver/src/walreceiver.rs +++ b/pageserver/src/walreceiver.rs @@ -93,7 +93,7 @@ pub fn launch_wal_receiver( receivers.insert((tenantid, timelineid), receiver); // Update tenant state and start tenant threads, if they are not running yet. - tenant_mgr::activate_tenant(conf, tenantid)?; + tenant_mgr::activate_tenant(tenantid)?; } }; Ok(()) diff --git a/test_runner/batch_others/test_tenant_conf.py b/test_runner/batch_others/test_tenant_conf.py new file mode 100644 index 0000000000..f74e6aad1d --- /dev/null +++ b/test_runner/batch_others/test_tenant_conf.py @@ -0,0 +1,49 @@ +from contextlib import closing + +import pytest + +from fixtures.zenith_fixtures import ZenithEnvBuilder + + +def test_tenant_config(zenith_env_builder: ZenithEnvBuilder): + env = zenith_env_builder.init_start() + """Test per tenant configuration""" + tenant = env.zenith_cli.create_tenant( + conf={ + 'checkpoint_distance': '10000', + 'compaction_target_size': '1048576', + 'compaction_period': '60sec', + 'compaction_threshold': '20', + 'gc_horizon': '1024', + 'gc_period': '100sec', + 'pitr_interval': '3600sec', + }) + + env.zenith_cli.create_timeline(f'test_tenant_conf', tenant_id=tenant) + pg = env.postgres.create_start( + "test_tenant_conf", + "main", + tenant, + ) + + with closing(env.pageserver.connect()) as psconn: + with psconn.cursor() as pscur: + pscur.execute(f"show {tenant.hex}") + assert pscur.fetchone() == (10000, 1048576, 60, 20, 1024, 100, 3600) + + # update the config and ensure that it has changed + env.zenith_cli.config_tenant(tenant_id=tenant, + conf={ + 'checkpoint_distance': '100000', + 'compaction_target_size': '1048576', + 'compaction_period': '30sec', + 'compaction_threshold': '15', + 'gc_horizon': '256', + 'gc_period': '10sec', + 'pitr_interval': '360sec', + }) + + with closing(env.pageserver.connect()) as psconn: + with psconn.cursor() as pscur: + pscur.execute(f"show {tenant.hex}") + assert pscur.fetchone() == (100000, 1048576, 30, 15, 256, 10, 360) diff --git a/test_runner/fixtures/zenith_fixtures.py b/test_runner/fixtures/zenith_fixtures.py index 9a2d6cdc88..d295a79953 100644 --- a/test_runner/fixtures/zenith_fixtures.py +++ b/test_runner/fixtures/zenith_fixtures.py @@ -835,16 +835,35 @@ class ZenithCli: self.env = env pass - def create_tenant(self, tenant_id: Optional[uuid.UUID] = None) -> uuid.UUID: + def create_tenant(self, + tenant_id: Optional[uuid.UUID] = None, + conf: Optional[Dict[str, str]] = None) -> uuid.UUID: """ Creates a new tenant, returns its id and its initial timeline's id. """ if tenant_id is None: tenant_id = uuid.uuid4() - res = self.raw_cli(['tenant', 'create', '--tenant-id', tenant_id.hex]) + if conf is None: + res = self.raw_cli(['tenant', 'create', '--tenant-id', tenant_id.hex]) + else: + res = self.raw_cli( + ['tenant', 'create', '--tenant-id', tenant_id.hex] + + sum(list(map(lambda kv: (['-c', kv[0] + ':' + kv[1]]), conf.items())), [])) res.check_returncode() return tenant_id + def config_tenant(self, tenant_id: uuid.UUID, conf: Dict[str, str]): + """ + Update tenant config. + """ + if conf is None: + res = self.raw_cli(['tenant', 'config', '--tenant-id', tenant_id.hex]) + else: + res = self.raw_cli( + ['tenant', 'config', '--tenant-id', tenant_id.hex] + + sum(list(map(lambda kv: (['-c', kv[0] + ':' + kv[1]]), conf.items())), [])) + res.check_returncode() + def list_tenants(self) -> 'subprocess.CompletedProcess[str]': res = self.raw_cli(['tenant', 'list']) res.check_returncode() diff --git a/zenith/src/main.rs b/zenith/src/main.rs index afbbbe395b..cd0cf470e8 100644 --- a/zenith/src/main.rs +++ b/zenith/src/main.rs @@ -166,7 +166,12 @@ fn main() -> Result<()> { .subcommand(App::new("create") .arg(tenant_id_arg.clone()) .arg(timeline_id_arg.clone().help("Use a specific timeline id when creating a tenant and its initial timeline")) - ) + .arg(Arg::new("config").short('c').takes_value(true).multiple_occurrences(true).required(false)) + ) + .subcommand(App::new("config") + .arg(tenant_id_arg.clone()) + .arg(Arg::new("config").short('c').takes_value(true).multiple_occurrences(true).required(false)) + ) ) .subcommand( App::new("pageserver") @@ -523,8 +528,12 @@ fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> Re } Some(("create", create_match)) => { let initial_tenant_id = parse_tenant_id(create_match)?; + let tenant_conf: HashMap<_, _> = create_match + .values_of("config") + .map(|vals| vals.flat_map(|c| c.split_once(':')).collect()) + .unwrap_or_default(); let new_tenant_id = pageserver - .tenant_create(initial_tenant_id)? + .tenant_create(initial_tenant_id, tenant_conf)? .ok_or_else(|| { anyhow!("Tenant with id {:?} was already created", initial_tenant_id) })?; @@ -533,6 +542,27 @@ fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> Re new_tenant_id ); } + Some(("config", create_match)) => { + let tenant_id = get_tenant_id(create_match, env)?; + let tenant_conf: HashMap<_, _> = create_match + .values_of("config") + .map(|vals| vals.flat_map(|c| c.split_once(':')).collect()) + .unwrap_or_default(); + + pageserver + .tenant_config(tenant_id, tenant_conf) + .unwrap_or_else(|e| { + anyhow!( + "Tenant config failed for tenant with id {} : {}", + tenant_id, + e + ); + }); + println!( + "tenant {} successfully configured on the pageserver", + tenant_id + ); + } Some((sub_name, _)) => bail!("Unexpected tenant subcommand '{}'", sub_name), None => bail!("no tenant subcommand provided"), } From d3f356e7a81464b3dcf5a5076d7bb8ef4ca30ff6 Mon Sep 17 00:00:00 2001 From: Dmitry Ivanov Date: Fri, 22 Apr 2022 17:31:58 +0300 Subject: [PATCH 137/296] Update `rust-postgres` project-wide (#1525) * Update `rust-postgres` project-wide This commit points to https://github.com/neondatabase/rust-postgres/commits/neon in order to test our patches on top of the latest version of this crate. * [proxy] Update `hmac` and `sha2` --- Cargo.lock | 196 ++++++++++++++++++++++++++++++--------- Cargo.toml | 2 +- compute_tools/Cargo.toml | 4 +- control_plane/Cargo.toml | 2 +- libs/utils/Cargo.toml | 4 +- pageserver/Cargo.toml | 8 +- proxy/Cargo.toml | 8 +- proxy/src/scram.rs | 4 +- safekeeper/Cargo.toml | 6 +- zenith/Cargo.toml | 2 +- 10 files changed, 170 insertions(+), 66 deletions(-) diff --git a/Cargo.lock b/Cargo.lock index 3ca3671207..978cd20d12 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -181,6 +181,15 @@ dependencies = [ "generic-array", ] +[[package]] +name = "block-buffer" +version = "0.10.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0bf7fe51849ea569fd452f37822f606a5cabb684dc918707a0193fd4664ff324" +dependencies = [ + "generic-array", +] + [[package]] name = "boxfnonce" version = "0.1.1" @@ -518,13 +527,13 @@ dependencies = [ ] [[package]] -name = "crypto-mac" -version = "0.10.1" +name = "crypto-common" +version = "0.1.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bff07008ec701e8028e2ceb8f83f0e4274ee62bd2dbdc4fefff2e9a91824081a" +checksum = "57952ca27b5e3606ff4dd79b0020231aaf9d6aa76dc05fd30137538c50bd3ce8" dependencies = [ "generic-array", - "subtle", + "typenum", ] [[package]] @@ -622,6 +631,17 @@ dependencies = [ "generic-array", ] +[[package]] +name = "digest" +version = "0.10.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f2fb860ca6fafa5552fb6d0e816a69c8e49f0908bf524e30a90d97c85892d506" +dependencies = [ + "block-buffer 0.10.2", + "crypto-common", + "subtle", +] + [[package]] name = "dirs-next" version = "2.0.0" @@ -994,24 +1014,23 @@ version = "0.3.4" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "7ebdb29d2ea9ed0083cd8cece49bbd968021bd99b0849edb4a9a7ee0fdf6a4e0" -[[package]] -name = "hmac" -version = "0.10.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "c1441c6b1e930e2817404b5046f1f989899143a12bf92de603b69f4e0aee1e15" -dependencies = [ - "crypto-mac 0.10.1", - "digest", -] - [[package]] name = "hmac" version = "0.11.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "2a2a2320eb7ec0ebe8da8f744d7812d9fc4cb4d09344ac01898dbcb6a20ae69b" dependencies = [ - "crypto-mac 0.11.1", - "digest", + "crypto-mac", + "digest 0.9.0", +] + +[[package]] +name = "hmac" +version = "0.12.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6c49c37c09c17a53d937dfbb742eb3a961d65a994e6bcdcf37e7399d0cc8ab5e" +dependencies = [ + "digest 0.10.3", ] [[package]] @@ -1297,11 +1316,20 @@ version = "0.9.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "7b5a279bb9607f9f53c22d496eade00d138d1bdcccd07d74650387cf94942a15" dependencies = [ - "block-buffer", - "digest", + "block-buffer 0.9.0", + "digest 0.9.0", "opaque-debug", ] +[[package]] +name = "md-5" +version = "0.10.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "658646b21e0b72f7866c7038ab086d3d5e1cd6271f060fd37defb241949d0582" +dependencies = [ + "digest 0.10.3", +] + [[package]] name = "md5" version = "0.7.0" @@ -1640,7 +1668,17 @@ checksum = "7d17b78036a60663b797adeaee46f5c9dfebb86948d1255007a1d6be0271ff99" dependencies = [ "instant", "lock_api", - "parking_lot_core", + "parking_lot_core 0.8.5", +] + +[[package]] +name = "parking_lot" +version = "0.12.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "87f5ec2493a61ac0506c0f4199f99070cbe83857b0337006a30f3e6719b8ef58" +dependencies = [ + "lock_api", + "parking_lot_core 0.9.2", ] [[package]] @@ -1657,6 +1695,19 @@ dependencies = [ "winapi", ] +[[package]] +name = "parking_lot_core" +version = "0.9.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "995f667a6c822200b0433ac218e05582f0e2efa1b922a3fd2fbaadc5f87bab37" +dependencies = [ + "cfg-if", + "libc", + "redox_syscall", + "smallvec", + "windows-sys", +] + [[package]] name = "peeking_take_while" version = "0.1.2" @@ -1690,18 +1741,18 @@ dependencies = [ [[package]] name = "phf" -version = "0.8.0" +version = "0.10.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3dfb61232e34fcb633f43d12c58f83c1df82962dcdfa565a4e866ffc17dafe12" +checksum = "fabbf1ead8a5bcbc20f5f8b939ee3f5b0f6f281b6ad3468b84656b658b455259" dependencies = [ "phf_shared", ] [[package]] name = "phf_shared" -version = "0.8.0" +version = "0.10.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "c00cf8b9eafe68dde5e9eaa2cef8ee84a9336a47d566ec55ca16589633b65af7" +checksum = "b6796ad771acdc0123d2a88dc428b5e38ef24456743ddb1744ed628f9815c096" dependencies = [ "siphasher", ] @@ -1774,40 +1825,39 @@ dependencies = [ [[package]] name = "postgres" -version = "0.19.1" -source = "git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7#2949d98df52587d562986aad155dd4e889e408b7" +version = "0.19.2" +source = "git+https://github.com/zenithdb/rust-postgres.git?rev=d052ee8b86fff9897c77b0fe89ea9daba0e1fa38#d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" dependencies = [ "bytes", "fallible-iterator", "futures", "log", - "postgres-protocol", "tokio", "tokio-postgres", ] [[package]] name = "postgres-protocol" -version = "0.6.1" -source = "git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7#2949d98df52587d562986aad155dd4e889e408b7" +version = "0.6.4" +source = "git+https://github.com/zenithdb/rust-postgres.git?rev=d052ee8b86fff9897c77b0fe89ea9daba0e1fa38#d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" dependencies = [ "base64", "byteorder", "bytes", "fallible-iterator", - "hmac 0.10.1", + "hmac 0.12.1", "lazy_static", - "md-5", + "md-5 0.10.1", "memchr", "rand", - "sha2", + "sha2 0.10.2", "stringprep", ] [[package]] name = "postgres-types" -version = "0.2.1" -source = "git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7#2949d98df52587d562986aad155dd4e889e408b7" +version = "0.2.3" +source = "git+https://github.com/zenithdb/rust-postgres.git?rev=d052ee8b86fff9897c77b0fe89ea9daba0e1fa38#d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" dependencies = [ "bytes", "fallible-iterator", @@ -1886,7 +1936,7 @@ dependencies = [ "fnv", "lazy_static", "memchr", - "parking_lot", + "parking_lot 0.11.2", "thiserror", ] @@ -1956,12 +2006,12 @@ dependencies = [ "futures", "hashbrown", "hex", - "hmac 0.10.1", + "hmac 0.12.1", "hyper", "lazy_static", "md5", "metrics", - "parking_lot", + "parking_lot 0.12.0", "pin-project-lite", "rand", "rcgen", @@ -1973,7 +2023,7 @@ dependencies = [ "scopeguard", "serde", "serde_json", - "sha2", + "sha2 0.10.2", "socket2", "thiserror", "tokio", @@ -2295,20 +2345,20 @@ dependencies = [ "base64", "bytes", "chrono", - "digest", + "digest 0.9.0", "futures", "hex", "hmac 0.11.0", "http", "hyper", "log", - "md-5", + "md-5 0.9.1", "percent-encoding", "pin-project-lite", "rusoto_credential", "rustc_version", "serde", - "sha2", + "sha2 0.9.9", "tokio", ] @@ -2560,13 +2610,24 @@ version = "0.9.9" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "4d58a1e1bf39749807d89cf2d98ac2dfa0ff1cb3faa38fbb64dd88ac8013d800" dependencies = [ - "block-buffer", + "block-buffer 0.9.0", "cfg-if", "cpufeatures", - "digest", + "digest 0.9.0", "opaque-debug", ] +[[package]] +name = "sha2" +version = "0.10.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "55deaec60f81eefe3cce0dc50bda92d6d8e88f2a27df7c5033b42afeb1ed2676" +dependencies = [ + "cfg-if", + "cpufeatures", + "digest 0.10.3", +] + [[package]] name = "sharded-slab" version = "0.1.4" @@ -2906,8 +2967,8 @@ dependencies = [ [[package]] name = "tokio-postgres" -version = "0.7.1" -source = "git+https://github.com/zenithdb/rust-postgres.git?rev=2949d98df52587d562986aad155dd4e889e408b7#2949d98df52587d562986aad155dd4e889e408b7" +version = "0.7.6" +source = "git+https://github.com/zenithdb/rust-postgres.git?rev=d052ee8b86fff9897c77b0fe89ea9daba0e1fa38#d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" dependencies = [ "async-trait", "byteorder", @@ -2915,7 +2976,7 @@ dependencies = [ "fallible-iterator", "futures", "log", - "parking_lot", + "parking_lot 0.12.0", "percent-encoding", "phf", "pin-project-lite", @@ -2923,7 +2984,7 @@ dependencies = [ "postgres-types", "socket2", "tokio", - "tokio-util 0.6.9", + "tokio-util 0.7.0", ] [[package]] @@ -3460,6 +3521,49 @@ version = "0.4.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "712e227841d057c1ee1cd2fb22fa7e5a5461ae8e48fa2ca79ec42cfc1931183f" +[[package]] +name = "windows-sys" +version = "0.34.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5acdd78cb4ba54c0045ac14f62d8f94a03d10047904ae2a40afa1e99d8f70825" +dependencies = [ + "windows_aarch64_msvc", + "windows_i686_gnu", + "windows_i686_msvc", + "windows_x86_64_gnu", + "windows_x86_64_msvc", +] + +[[package]] +name = "windows_aarch64_msvc" +version = "0.34.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "17cffbe740121affb56fad0fc0e421804adf0ae00891205213b5cecd30db881d" + +[[package]] +name = "windows_i686_gnu" +version = "0.34.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2564fde759adb79129d9b4f54be42b32c89970c18ebf93124ca8870a498688ed" + +[[package]] +name = "windows_i686_msvc" +version = "0.34.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9cd9d32ba70453522332c14d38814bceeb747d80b3958676007acadd7e166956" + +[[package]] +name = "windows_x86_64_gnu" +version = "0.34.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "cfce6deae227ee8d356d19effc141a509cc503dfd1f850622ec4b0f84428e1f4" + +[[package]] +name = "windows_x86_64_msvc" +version = "0.34.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d19538ccc21819d01deaf88d6a17eae6596a12e9aafdbb97916fb49896d89de9" + [[package]] name = "winreg" version = "0.7.0" diff --git a/Cargo.toml b/Cargo.toml index 35c18ba237..3838637d37 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -18,4 +18,4 @@ debug = true # This is only needed for proxy's tests. # TODO: we should probably fork `tokio-postgres-rustls` instead. [patch.crates-io] -tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } +tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" } diff --git a/compute_tools/Cargo.toml b/compute_tools/Cargo.toml index 856ec45c73..42db763961 100644 --- a/compute_tools/Cargo.toml +++ b/compute_tools/Cargo.toml @@ -11,11 +11,11 @@ clap = "3.0" env_logger = "0.9" hyper = { version = "0.14", features = ["full"] } log = { version = "0.4", features = ["std", "serde"] } -postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } +postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" } regex = "1" serde = { version = "1.0", features = ["derive"] } serde_json = "1" tar = "0.4" tokio = { version = "1.17", features = ["macros", "rt", "rt-multi-thread"] } -tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } +tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" } workspace_hack = { version = "0.1", path = "../workspace_hack" } diff --git a/control_plane/Cargo.toml b/control_plane/Cargo.toml index 33d01f7556..41417aab9a 100644 --- a/control_plane/Cargo.toml +++ b/control_plane/Cargo.toml @@ -5,7 +5,7 @@ edition = "2021" [dependencies] tar = "0.4.33" -postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } +postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" } serde = { version = "1.0", features = ["derive"] } serde_with = "1.12.0" toml = "0.5" diff --git a/libs/utils/Cargo.toml b/libs/utils/Cargo.toml index 35eb443809..d83b02d7ae 100644 --- a/libs/utils/Cargo.toml +++ b/libs/utils/Cargo.toml @@ -10,8 +10,8 @@ bytes = "1.0.1" hyper = { version = "0.14.7", features = ["full"] } lazy_static = "1.4.0" pin-project-lite = "0.2.7" -postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } -postgres-protocol = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } +postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" } +postgres-protocol = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" } routerify = "3" serde = { version = "1.0", features = ["derive"] } serde_json = "1" diff --git a/pageserver/Cargo.toml b/pageserver/Cargo.toml index eb58b90ad9..6648d8417a 100644 --- a/pageserver/Cargo.toml +++ b/pageserver/Cargo.toml @@ -22,10 +22,10 @@ clap = "3.0" daemonize = "0.4.1" tokio = { version = "1.17", features = ["process", "sync", "macros", "fs", "rt", "io-util", "time"] } tokio-util = { version = "0.7", features = ["io"] } -postgres-types = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } -postgres-protocol = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } -postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } -tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } +postgres-types = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" } +postgres-protocol = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" } +postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" } +tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" } tokio-stream = "0.1.8" anyhow = { version = "1.0", features = ["backtrace"] } crc32c = "0.6.0" diff --git a/proxy/Cargo.toml b/proxy/Cargo.toml index 81086a0cad..25aebc03e8 100644 --- a/proxy/Cargo.toml +++ b/proxy/Cargo.toml @@ -12,11 +12,11 @@ fail = "0.5.0" futures = "0.3.13" hashbrown = "0.11.2" hex = "0.4.3" -hmac = "0.10.1" +hmac = "0.12.1" hyper = "0.14" lazy_static = "1.4.0" md5 = "0.7.0" -parking_lot = "0.11.2" +parking_lot = "0.12" pin-project-lite = "0.2.7" rand = "0.8.3" reqwest = { version = "0.11", default-features = false, features = ["blocking", "json", "rustls-tls"] } @@ -26,11 +26,11 @@ rustls-pemfile = "0.2.1" scopeguard = "1.1.0" serde = "1" serde_json = "1" -sha2 = "0.9.8" +sha2 = "0.10.2" socket2 = "0.4.4" thiserror = "1.0.30" tokio = { version = "1.17", features = ["macros"] } -tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } +tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" } tokio-rustls = "0.23.0" utils = { path = "../libs/utils" } diff --git a/proxy/src/scram.rs b/proxy/src/scram.rs index f007f3e0b6..44671084ee 100644 --- a/proxy/src/scram.rs +++ b/proxy/src/scram.rs @@ -18,7 +18,7 @@ pub use secret::*; pub use exchange::Exchange; pub use secret::ServerSecret; -use hmac::{Hmac, Mac, NewMac}; +use hmac::{Hmac, Mac}; use sha2::{Digest, Sha256}; // TODO: add SCRAM-SHA-256-PLUS @@ -40,7 +40,7 @@ fn base64_decode_array(input: impl AsRef<[u8]>) -> Option<[u8; N /// This function essentially is `Hmac(sha256, key, input)`. /// Further reading: . fn hmac_sha256<'a>(key: &[u8], parts: impl IntoIterator) -> [u8; 32] { - let mut mac = Hmac::::new_varkey(key).expect("bad key size"); + let mut mac = Hmac::::new_from_slice(key).expect("bad key size"); parts.into_iter().for_each(|s| mac.update(s)); // TODO: maybe newer `hmac` et al already migrated to regular arrays? diff --git a/safekeeper/Cargo.toml b/safekeeper/Cargo.toml index 76d40cdc2e..8a31311b8f 100644 --- a/safekeeper/Cargo.toml +++ b/safekeeper/Cargo.toml @@ -15,8 +15,8 @@ tracing = "0.1.27" clap = "3.0" daemonize = "0.4.1" tokio = { version = "1.17", features = ["macros", "fs"] } -postgres-protocol = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } -postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } +postgres-protocol = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" } +postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" } anyhow = "1.0" crc32c = "0.6.0" humantime = "2.1.0" @@ -27,7 +27,7 @@ serde = { version = "1.0", features = ["derive"] } serde_with = {version = "1.12.0"} hex = "0.4.3" const_format = "0.2.21" -tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } +tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" } etcd-client = "0.8.3" tokio-util = { version = "0.7", features = ["io"] } rusoto_core = "0.47" diff --git a/zenith/Cargo.toml b/zenith/Cargo.toml index 9692e97331..0f72051f74 100644 --- a/zenith/Cargo.toml +++ b/zenith/Cargo.toml @@ -7,7 +7,7 @@ edition = "2021" clap = "3.0" anyhow = "1.0" serde_json = "1" -postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="2949d98df52587d562986aad155dd4e889e408b7" } +postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" } # FIXME: 'pageserver' is needed for BranchInfo. Refactor pageserver = { path = "../pageserver" } From 867aede71516756ff0ec1dba540fe7fc23bb7113 Mon Sep 17 00:00:00 2001 From: bojanserafimov Date: Fri, 22 Apr 2022 10:45:47 -0400 Subject: [PATCH 138/296] Add idle compute restart time test (#1514) --- test_runner/performance/test_startup.py | 48 +++++++++++++++++++++++++ 1 file changed, 48 insertions(+) create mode 100644 test_runner/performance/test_startup.py diff --git a/test_runner/performance/test_startup.py b/test_runner/performance/test_startup.py new file mode 100644 index 0000000000..e30912ce32 --- /dev/null +++ b/test_runner/performance/test_startup.py @@ -0,0 +1,48 @@ +from contextlib import closing + +from fixtures.zenith_fixtures import ZenithEnvBuilder +from fixtures.benchmark_fixture import ZenithBenchmarker + + +def test_startup(zenith_env_builder: ZenithEnvBuilder, zenbenchmark: ZenithBenchmarker): + zenith_env_builder.num_safekeepers = 3 + env = zenith_env_builder.init_start() + + # Start + env.zenith_cli.create_branch('test_startup') + with zenbenchmark.record_duration("startup_time"): + pg = env.postgres.create_start('test_startup') + pg.safe_psql("select 1;") + + # Restart + pg.stop_and_destroy() + with zenbenchmark.record_duration("restart_time"): + pg.create_start('test_startup') + pg.safe_psql("select 1;") + + # Fill up + num_rows = 1000000 # 30 MB + num_tables = 100 + with closing(pg.connect()) as conn: + with conn.cursor() as cur: + for i in range(num_tables): + cur.execute(f'create table t_{i} (i integer);') + cur.execute(f'insert into t_{i} values (generate_series(1,{num_rows}));') + + # Read + with zenbenchmark.record_duration("read_time"): + pg.safe_psql("select * from t_0;") + + # Read again + with zenbenchmark.record_duration("second_read_time"): + pg.safe_psql("select * from t_0;") + + # Restart + pg.stop_and_destroy() + with zenbenchmark.record_duration("restart_with_data"): + pg.create_start('test_startup') + pg.safe_psql("select 1;") + + # Read + with zenbenchmark.record_duration("read_after_restart"): + pg.safe_psql("select * from t_0;") From 1fb3d081854a31f9afd1f4e5161fa4cbf9738299 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Fri, 22 Apr 2022 21:31:27 +0300 Subject: [PATCH 139/296] Use a 1-byte length header for short blobs. Notably, this shaves 3 bytes from each small WAL record stored in ephemeral or delta layers. --- pageserver/src/layered_repository/blob_io.rs | 72 ++++++++++++++----- .../src/layered_repository/ephemeral_file.rs | 42 +++++++---- 2 files changed, 83 insertions(+), 31 deletions(-) diff --git a/pageserver/src/layered_repository/blob_io.rs b/pageserver/src/layered_repository/blob_io.rs index aa90bbd0cf..3aeeb2b2c8 100644 --- a/pageserver/src/layered_repository/blob_io.rs +++ b/pageserver/src/layered_repository/blob_io.rs @@ -1,12 +1,20 @@ //! //! Functions for reading and writing variable-sized "blobs". //! -//! Each blob begins with a 4-byte length, followed by the actual data. +//! Each blob begins with a 1- or 4-byte length field, followed by the +//! actual data. If the length is smaller than 128 bytes, the length +//! is written as a one byte. If it's larger than that, the length +//! is written as a four-byte integer, in big-endian, with the high +//! bit set. This way, we can detect whether it's 1- or 4-byte header +//! by peeking at the first byte. +//! +//! len < 128: 0XXXXXXX +//! len >= 128: 1XXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX //! use crate::layered_repository::block_io::{BlockCursor, BlockReader}; use crate::page_cache::PAGE_SZ; use std::cmp::min; -use std::io::Error; +use std::io::{Error, ErrorKind}; /// For reading pub trait BlobCursor { @@ -40,21 +48,30 @@ where let mut buf = self.read_blk(blknum)?; - // read length - let mut len_buf = [0u8; 4]; - let thislen = PAGE_SZ - off; - if thislen < 4 { - // it is split across two pages - len_buf[..thislen].copy_from_slice(&buf[off..PAGE_SZ]); - blknum += 1; - buf = self.read_blk(blknum)?; - len_buf[thislen..].copy_from_slice(&buf[0..4 - thislen]); - off = 4 - thislen; + // peek at the first byte, to determine if it's a 1- or 4-byte length + let first_len_byte = buf[off]; + let len: usize = if first_len_byte < 0x80 { + // 1-byte length header + off += 1; + first_len_byte as usize } else { - len_buf.copy_from_slice(&buf[off..off + 4]); - off += 4; - } - let len = u32::from_ne_bytes(len_buf) as usize; + // 4-byte length header + let mut len_buf = [0u8; 4]; + let thislen = PAGE_SZ - off; + if thislen < 4 { + // it is split across two pages + len_buf[..thislen].copy_from_slice(&buf[off..PAGE_SZ]); + blknum += 1; + buf = self.read_blk(blknum)?; + len_buf[thislen..].copy_from_slice(&buf[0..4 - thislen]); + off = 4 - thislen; + } else { + len_buf.copy_from_slice(&buf[off..off + 4]); + off += 4; + } + len_buf[0] &= 0x7f; + u32::from_be_bytes(len_buf) as usize + }; dstbuf.clear(); @@ -130,10 +147,27 @@ where { fn write_blob(&mut self, srcbuf: &[u8]) -> Result { let offset = self.offset; - self.inner - .write_all(&((srcbuf.len()) as u32).to_ne_bytes())?; + + if srcbuf.len() < 128 { + // Short blob. Write a 1-byte length header + let len_buf = srcbuf.len() as u8; + self.inner.write_all(&[len_buf])?; + self.offset += 1; + } else { + // Write a 4-byte length header + if srcbuf.len() > 0x7fff_ffff { + return Err(Error::new( + ErrorKind::Other, + format!("blob too large ({} bytes)", srcbuf.len()), + )); + } + let mut len_buf = ((srcbuf.len()) as u32).to_be_bytes(); + len_buf[0] |= 0x80; + self.inner.write_all(&len_buf)?; + self.offset += 4; + } self.inner.write_all(srcbuf)?; - self.offset += 4 + srcbuf.len() as u64; + self.offset += srcbuf.len() as u64; Ok(offset) } } diff --git a/pageserver/src/layered_repository/ephemeral_file.rs b/pageserver/src/layered_repository/ephemeral_file.rs index 9537d3939c..cdde9d5d13 100644 --- a/pageserver/src/layered_repository/ephemeral_file.rs +++ b/pageserver/src/layered_repository/ephemeral_file.rs @@ -199,18 +199,24 @@ impl BlobWriter for EphemeralFile { let mut buf = self.get_buf_for_write(blknum)?; // Write the length field - let len_buf = u32::to_ne_bytes(srcbuf.len() as u32); - let thislen = PAGE_SZ - off; - if thislen < 4 { - // it needs to be split across pages - buf[off..(off + thislen)].copy_from_slice(&len_buf[..thislen]); - blknum += 1; - buf = self.get_buf_for_write(blknum)?; - buf[0..4 - thislen].copy_from_slice(&len_buf[thislen..]); - off = 4 - thislen; + if srcbuf.len() < 0x80 { + buf[off] = srcbuf.len() as u8; + off += 1; } else { - buf[off..off + 4].copy_from_slice(&len_buf); - off += 4; + let mut len_buf = u32::to_be_bytes(srcbuf.len() as u32); + len_buf[0] |= 0x80; + let thislen = PAGE_SZ - off; + if thislen < 4 { + // it needs to be split across pages + buf[off..(off + thislen)].copy_from_slice(&len_buf[..thislen]); + blknum += 1; + buf = self.get_buf_for_write(blknum)?; + buf[0..4 - thislen].copy_from_slice(&len_buf[thislen..]); + off = 4 - thislen; + } else { + buf[off..off + 4].copy_from_slice(&len_buf); + off += 4; + } } // Write the payload @@ -229,7 +235,13 @@ impl BlobWriter for EphemeralFile { buf_remain = &buf_remain[this_blk_len..]; } drop(buf); - self.size += 4 + srcbuf.len() as u64; + + if srcbuf.len() < 0x80 { + self.size += 1; + } else { + self.size += 4; + } + self.size += srcbuf.len() as u64; Ok(pos) } @@ -387,6 +399,12 @@ mod tests { let pos = file.write_blob(&data)?; blobs.push((pos, data)); } + // also test with a large blobs + for i in 0..100 { + let data = format!("blob{}", i).as_bytes().repeat(100); + let pos = file.write_blob(&data)?; + blobs.push((pos, data)); + } let mut cursor = BlockCursor::new(&file); for (pos, expected) in blobs { From 56f6269a8e86f110a4eb78d2019279e1a6cca2f2 Mon Sep 17 00:00:00 2001 From: Andrey Taranik Date: Mon, 25 Apr 2022 11:34:51 +0300 Subject: [PATCH 140/296] rename docker images to neondatabase docker account (#1570) * rename docker images to neondatabase docker account * docker images build fix (permisions for Cargo.lock) --- .circleci/ansible/deploy.yaml | 10 +-- .circleci/ansible/get_binaries.sh | 32 +++---- .circleci/config.yml | 94 ++++++++++----------- .circleci/helm-values/production.proxy.yaml | 3 + .circleci/helm-values/staging.proxy.yaml | 3 + Dockerfile | 4 +- Dockerfile.compute-tools | 4 +- 7 files changed, 80 insertions(+), 70 deletions(-) diff --git a/.circleci/ansible/deploy.yaml b/.circleci/ansible/deploy.yaml index 508843812a..a8154ba3b0 100644 --- a/.circleci/ansible/deploy.yaml +++ b/.circleci/ansible/deploy.yaml @@ -1,14 +1,14 @@ -- name: Upload Zenith binaries +- name: Upload Neon binaries hosts: storage gather_facts: False remote_user: admin tasks: - - name: get latest version of Zenith binaries + - name: get latest version of Neon binaries register: current_version_file set_fact: - current_version: "{{ lookup('file', '.zenith_current_version') | trim }}" + current_version: "{{ lookup('file', '.neon_current_version') | trim }}" tags: - pageserver - safekeeper @@ -19,11 +19,11 @@ - pageserver - safekeeper - - name: upload and extract Zenith binaries to /usr/local + - name: upload and extract Neon binaries to /usr/local ansible.builtin.unarchive: owner: root group: root - src: zenith_install.tar.gz + src: neon_install.tar.gz dest: /usr/local become: true tags: diff --git a/.circleci/ansible/get_binaries.sh b/.circleci/ansible/get_binaries.sh index 242a9e87e2..a4b4372d9f 100755 --- a/.circleci/ansible/get_binaries.sh +++ b/.circleci/ansible/get_binaries.sh @@ -4,10 +4,10 @@ set -e RELEASE=${RELEASE:-false} -# look at docker hub for latest tag fo zenith docker image +# look at docker hub for latest tag for neon docker image if [ "${RELEASE}" = "true" ]; then echo "search latest relase tag" - VERSION=$(curl -s https://registry.hub.docker.com/v1/repositories/zenithdb/zenith/tags |jq -r -S '.[].name' | grep release | sed 's/release-//g' | tail -1) + VERSION=$(curl -s https://registry.hub.docker.com/v1/repositories/neondatabase/neon/tags |jq -r -S '.[].name' | grep release | sed 's/release-//g' | tail -1) if [ -z "${VERSION}" ]; then echo "no any docker tags found, exiting..." exit 1 @@ -16,7 +16,7 @@ if [ "${RELEASE}" = "true" ]; then fi else echo "search latest dev tag" - VERSION=$(curl -s https://registry.hub.docker.com/v1/repositories/zenithdb/zenith/tags |jq -r -S '.[].name' | grep -v release | tail -1) + VERSION=$(curl -s https://registry.hub.docker.com/v1/repositories/neondatabase/neon/tags |jq -r -S '.[].name' | grep -v release | tail -1) if [ -z "${VERSION}" ]; then echo "no any docker tags found, exiting..." exit 1 @@ -28,25 +28,25 @@ fi echo "found ${VERSION}" # do initial cleanup -rm -rf zenith_install postgres_install.tar.gz zenith_install.tar.gz .zenith_current_version -mkdir zenith_install +rm -rf neon_install postgres_install.tar.gz neon_install.tar.gz .neon_current_version +mkdir neon_install # retrive binaries from docker image echo "getting binaries from docker image" -docker pull --quiet zenithdb/zenith:${TAG} -ID=$(docker create zenithdb/zenith:${TAG}) +docker pull --quiet neondatabase/neon:${TAG} +ID=$(docker create neondatabase/neon:${TAG}) docker cp ${ID}:/data/postgres_install.tar.gz . -tar -xzf postgres_install.tar.gz -C zenith_install -docker cp ${ID}:/usr/local/bin/pageserver zenith_install/bin/ -docker cp ${ID}:/usr/local/bin/safekeeper zenith_install/bin/ -docker cp ${ID}:/usr/local/bin/proxy zenith_install/bin/ -docker cp ${ID}:/usr/local/bin/postgres zenith_install/bin/ +tar -xzf postgres_install.tar.gz -C neon_install +docker cp ${ID}:/usr/local/bin/pageserver neon_install/bin/ +docker cp ${ID}:/usr/local/bin/safekeeper neon_install/bin/ +docker cp ${ID}:/usr/local/bin/proxy neon_install/bin/ +docker cp ${ID}:/usr/local/bin/postgres neon_install/bin/ docker rm -vf ${ID} # store version to file (for ansible playbooks) and create binaries tarball -echo ${VERSION} > zenith_install/.zenith_current_version -echo ${VERSION} > .zenith_current_version -tar -czf zenith_install.tar.gz -C zenith_install . +echo ${VERSION} > neon_install/.neon_current_version +echo ${VERSION} > .neon_current_version +tar -czf neon_install.tar.gz -C neon_install . # do final cleaup -rm -rf zenith_install postgres_install.tar.gz +rm -rf neon_install postgres_install.tar.gz diff --git a/.circleci/config.yml b/.circleci/config.yml index 643c853854..471d64a82f 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -1,18 +1,18 @@ version: 2.1 executors: - zenith-xlarge-executor: + neon-xlarge-executor: resource_class: xlarge docker: # NB: when changed, do not forget to update rust image tag in all Dockerfiles - image: zimg/rust:1.58 - zenith-executor: + neon-executor: docker: - image: zimg/rust:1.58 jobs: check-codestyle-rust: - executor: zenith-xlarge-executor + executor: neon-xlarge-executor steps: - checkout - run: @@ -22,7 +22,7 @@ jobs: # A job to build postgres build-postgres: - executor: zenith-xlarge-executor + executor: neon-xlarge-executor parameters: build_type: type: enum @@ -67,9 +67,9 @@ jobs: paths: - tmp_install - # A job to build zenith rust code - build-zenith: - executor: zenith-xlarge-executor + # A job to build Neon rust code + build-neon: + executor: neon-xlarge-executor parameters: build_type: type: enum @@ -223,7 +223,7 @@ jobs: - "*" check-codestyle-python: - executor: zenith-executor + executor: neon-executor steps: - checkout - restore_cache: @@ -246,7 +246,7 @@ jobs: command: poetry run mypy . run-pytest: - executor: zenith-executor + executor: neon-executor parameters: # pytest args to specify the tests to run. # @@ -390,7 +390,7 @@ jobs: - "*" coverage-report: - executor: zenith-xlarge-executor + executor: neon-xlarge-executor steps: - attach_workspace: at: /tmp/zenith @@ -420,7 +420,7 @@ jobs: COMMIT_URL=https://github.com/neondatabase/neon/commit/$CIRCLE_SHA1 scripts/git-upload \ - --repo=https://$VIP_VAP_ACCESS_TOKEN@github.com/zenithdb/zenith-coverage-data.git \ + --repo=https://$VIP_VAP_ACCESS_TOKEN@github.com/neondatabase/zenith-coverage-data.git \ --message="Add code coverage for $COMMIT_URL" \ copy /tmp/zenith/coverage/report $CIRCLE_SHA1 # COPY FROM TO_RELATIVE @@ -437,7 +437,7 @@ jobs: \"target_url\": \"$REPORT_URL\" }" - # Build zenithdb/zenith:latest image and push it to Docker hub + # Build neondatabase/neon:latest image and push it to Docker hub docker-image: docker: - image: cimg/base:2021.04 @@ -451,18 +451,18 @@ jobs: - run: name: Build and push Docker image command: | - echo $DOCKER_PWD | docker login -u $DOCKER_LOGIN --password-stdin + echo $NEON_DOCKER_PWD | docker login -u $NEON_DOCKER_LOGIN --password-stdin DOCKER_TAG=$(git log --oneline|wc -l) docker build \ --pull \ --build-arg GIT_VERSION=${CIRCLE_SHA1} \ --build-arg AWS_ACCESS_KEY_ID="${CACHEPOT_AWS_ACCESS_KEY_ID}" \ --build-arg AWS_SECRET_ACCESS_KEY="${CACHEPOT_AWS_SECRET_ACCESS_KEY}" \ - --tag zenithdb/zenith:${DOCKER_TAG} --tag zenithdb/zenith:latest . - docker push zenithdb/zenith:${DOCKER_TAG} - docker push zenithdb/zenith:latest + --tag neondatabase/neon:${DOCKER_TAG} --tag neondatabase/neon:latest . + docker push neondatabase/neon:${DOCKER_TAG} + docker push neondatabase/neon:latest - # Build zenithdb/compute-node:latest image and push it to Docker hub + # Build neondatabase/compute-node:latest image and push it to Docker hub docker-image-compute: docker: - image: cimg/base:2021.04 @@ -470,31 +470,31 @@ jobs: - checkout - setup_remote_docker: docker_layer_caching: true - # Build zenithdb/compute-tools:latest image and push it to Docker hub + # Build neondatabase/compute-tools:latest image and push it to Docker hub # TODO: this should probably also use versioned tag, not just :latest. # XXX: but should it? We build and use it only locally now. - run: name: Build and push compute-tools Docker image command: | - echo $DOCKER_PWD | docker login -u $DOCKER_LOGIN --password-stdin + echo $NEON_DOCKER_PWD | docker login -u $NEON_DOCKER_LOGIN --password-stdin docker build \ --build-arg AWS_ACCESS_KEY_ID="${CACHEPOT_AWS_ACCESS_KEY_ID}" \ --build-arg AWS_SECRET_ACCESS_KEY="${CACHEPOT_AWS_SECRET_ACCESS_KEY}" \ - --tag zenithdb/compute-tools:latest -f Dockerfile.compute-tools . - docker push zenithdb/compute-tools:latest + --tag neondatabase/compute-tools:latest -f Dockerfile.compute-tools . + docker push neondatabase/compute-tools:latest - run: name: Init postgres submodule command: git submodule update --init --depth 1 - run: name: Build and push compute-node Docker image command: | - echo $DOCKER_PWD | docker login -u $DOCKER_LOGIN --password-stdin + echo $NEON_DOCKER_PWD | docker login -u $NEON_DOCKER_LOGIN --password-stdin DOCKER_TAG=$(git log --oneline|wc -l) - docker build --tag zenithdb/compute-node:${DOCKER_TAG} --tag zenithdb/compute-node:latest vendor/postgres - docker push zenithdb/compute-node:${DOCKER_TAG} - docker push zenithdb/compute-node:latest + docker build --tag neondatabase/compute-node:${DOCKER_TAG} --tag neondatabase/compute-node:latest vendor/postgres + docker push neondatabase/compute-node:${DOCKER_TAG} + docker push neondatabase/compute-node:latest - # Build production zenithdb/zenith:release image and push it to Docker hub + # Build production neondatabase/neon:release image and push it to Docker hub docker-image-release: docker: - image: cimg/base:2021.04 @@ -508,18 +508,18 @@ jobs: - run: name: Build and push Docker image command: | - echo $DOCKER_PWD | docker login -u $DOCKER_LOGIN --password-stdin + echo $NEON_DOCKER_PWD | docker login -u $NEON_DOCKER_LOGIN --password-stdin DOCKER_TAG="release-$(git log --oneline|wc -l)" docker build \ --pull \ --build-arg GIT_VERSION=${CIRCLE_SHA1} \ --build-arg AWS_ACCESS_KEY_ID="${CACHEPOT_AWS_ACCESS_KEY_ID}" \ --build-arg AWS_SECRET_ACCESS_KEY="${CACHEPOT_AWS_SECRET_ACCESS_KEY}" \ - --tag zenithdb/zenith:${DOCKER_TAG} --tag zenithdb/zenith:release . - docker push zenithdb/zenith:${DOCKER_TAG} - docker push zenithdb/zenith:release + --tag neondatabase/neon:${DOCKER_TAG} --tag neondatabase/neon:release . + docker push neondatabase/neon:${DOCKER_TAG} + docker push neondatabase/neon:release - # Build production zenithdb/compute-node:release image and push it to Docker hub + # Build production neondatabase/compute-node:release image and push it to Docker hub docker-image-compute-release: docker: - image: cimg/base:2021.04 @@ -527,29 +527,29 @@ jobs: - checkout - setup_remote_docker: docker_layer_caching: true - # Build zenithdb/compute-tools:release image and push it to Docker hub + # Build neondatabase/compute-tools:release image and push it to Docker hub # TODO: this should probably also use versioned tag, not just :latest. # XXX: but should it? We build and use it only locally now. - run: name: Build and push compute-tools Docker image command: | - echo $DOCKER_PWD | docker login -u $DOCKER_LOGIN --password-stdin + echo $NEON_DOCKER_PWD | docker login -u $NEON_DOCKER_LOGIN --password-stdin docker build \ --build-arg AWS_ACCESS_KEY_ID="${CACHEPOT_AWS_ACCESS_KEY_ID}" \ --build-arg AWS_SECRET_ACCESS_KEY="${CACHEPOT_AWS_SECRET_ACCESS_KEY}" \ - --tag zenithdb/compute-tools:release -f Dockerfile.compute-tools . - docker push zenithdb/compute-tools:release + --tag neondatabase/compute-tools:release -f Dockerfile.compute-tools . + docker push neondatabase/compute-tools:release - run: name: Init postgres submodule command: git submodule update --init --depth 1 - run: name: Build and push compute-node Docker image command: | - echo $DOCKER_PWD | docker login -u $DOCKER_LOGIN --password-stdin + echo $NEON_DOCKER_PWD | docker login -u $NEON_DOCKER_LOGIN --password-stdin DOCKER_TAG="release-$(git log --oneline|wc -l)" - docker build --tag zenithdb/compute-node:${DOCKER_TAG} --tag zenithdb/compute-node:release vendor/postgres - docker push zenithdb/compute-node:${DOCKER_TAG} - docker push zenithdb/compute-node:release + docker build --tag neondatabase/compute-node:${DOCKER_TAG} --tag neondatabase/compute-node:release vendor/postgres + docker push neondatabase/compute-node:${DOCKER_TAG} + docker push neondatabase/compute-node:release deploy-staging: docker: @@ -575,7 +575,7 @@ jobs: rm -f ssh-key ssh-key-cert.pub ansible-playbook deploy.yaml -i staging.hosts - rm -f zenith_install.tar.gz .zenith_current_version + rm -f neon_install.tar.gz .neon_current_version deploy-staging-proxy: docker: @@ -625,7 +625,7 @@ jobs: rm -f ssh-key ssh-key-cert.pub ansible-playbook deploy.yaml -i production.hosts - rm -f zenith_install.tar.gz .zenith_current_version + rm -f neon_install.tar.gz .neon_current_version deploy-release-proxy: docker: @@ -704,8 +704,8 @@ workflows: matrix: parameters: build_type: ["debug", "release"] - - build-zenith: - name: build-zenith-<< matrix.build_type >> + - build-neon: + name: build-neon-<< matrix.build_type >> matrix: parameters: build_type: ["debug", "release"] @@ -720,7 +720,7 @@ workflows: test_selection: batch_pg_regress needs_postgres_source: true requires: - - build-zenith-<< matrix.build_type >> + - build-neon-<< matrix.build_type >> - run-pytest: name: other-tests-<< matrix.build_type >> matrix: @@ -728,7 +728,7 @@ workflows: build_type: ["debug", "release"] test_selection: batch_others requires: - - build-zenith-<< matrix.build_type >> + - build-neon-<< matrix.build_type >> - run-pytest: name: benchmarks context: PERF_TEST_RESULT_CONNSTR @@ -737,7 +737,7 @@ workflows: run_in_parallel: false save_perf_report: true requires: - - build-zenith-release + - build-neon-release - coverage-report: # Context passes credentials for gh api context: CI_ACCESS_TOKEN @@ -833,6 +833,6 @@ workflows: # XXX: Successful build doesn't mean everything is OK, but # the job to be triggered takes so much time to complete (~22 min) # that it's better not to wait for the commented-out steps - - build-zenith-release + - build-neon-release # - pg_regress-tests-release # - other-tests-release diff --git a/.circleci/helm-values/production.proxy.yaml b/.circleci/helm-values/production.proxy.yaml index 27aa169c79..f2148c1d2c 100644 --- a/.circleci/helm-values/production.proxy.yaml +++ b/.circleci/helm-values/production.proxy.yaml @@ -1,6 +1,9 @@ # Helm chart values for zenith-proxy. # This is a YAML-formatted file. +image: + repository: neondatabase/neon + settings: authEndpoint: "https://console.zenith.tech/authenticate_proxy_request/" uri: "https://console.zenith.tech/psql_session/" diff --git a/.circleci/helm-values/staging.proxy.yaml b/.circleci/helm-values/staging.proxy.yaml index bdce4d80da..f4d9855476 100644 --- a/.circleci/helm-values/staging.proxy.yaml +++ b/.circleci/helm-values/staging.proxy.yaml @@ -1,6 +1,9 @@ # Helm chart values for zenith-proxy. # This is a YAML-formatted file. +image: + repository: neondatabase/neon + settings: authEndpoint: "https://console.stage.zenith.tech/authenticate_proxy_request/" uri: "https://console.stage.zenith.tech/psql_session/" diff --git a/Dockerfile b/Dockerfile index ebc8731168..a7afd1f335 100644 --- a/Dockerfile +++ b/Dockerfile @@ -26,7 +26,9 @@ COPY . . # Show build caching stats to check if it was used in the end. # Has to be the part of the same RUN since cachepot daemon is killed in the end of this RUN, loosing the compilation stats. -RUN mold -run cargo build --release && cachepot -s +RUN set -e \ + && sudo -E "PATH=$PATH" mold -run cargo build --release \ + && cachepot -s # Build final image # diff --git a/Dockerfile.compute-tools b/Dockerfile.compute-tools index 3fc8702f3f..bbe0f517ce 100644 --- a/Dockerfile.compute-tools +++ b/Dockerfile.compute-tools @@ -8,7 +8,9 @@ ARG AWS_SECRET_ACCESS_KEY COPY . . -RUN mold -run cargo build -p compute_tools --release && cachepot -s +RUN set -e \ + && sudo -E "PATH=$PATH" mold -run cargo build -p compute_tools --release \ + && cachepot -s # Final image that only has one binary FROM debian:buster-slim From 8f6a16127117d63b96c25cbf8b105ebc75a8e9c0 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Fri, 22 Apr 2022 17:07:09 +0300 Subject: [PATCH 141/296] Show better layer load errors --- pageserver/src/layered_repository/delta_layer.rs | 9 +++++++-- pageserver/src/layered_repository/image_layer.rs | 4 +++- 2 files changed, 10 insertions(+), 3 deletions(-) diff --git a/pageserver/src/layered_repository/delta_layer.rs b/pageserver/src/layered_repository/delta_layer.rs index c5530a5789..ef4c3cccb0 100644 --- a/pageserver/src/layered_repository/delta_layer.rs +++ b/pageserver/src/layered_repository/delta_layer.rs @@ -290,7 +290,10 @@ impl Layer for DeltaLayer { } fn iter<'a>(&'a self) -> Box> + 'a> { - let inner = self.load().unwrap(); + let inner = match self.load() { + Ok(inner) => inner, + Err(e) => panic!("Failed to load a delta layer: {e:?}"), + }; match DeltaValueIter::new(inner) { Ok(iter) => Box::new(iter), @@ -422,7 +425,9 @@ impl DeltaLayer { drop(inner); let inner = self.inner.write().unwrap(); if !inner.loaded { - self.load_inner(inner)?; + self.load_inner(inner).with_context(|| { + format!("Failed to load delta layer {}", self.path().display()) + })?; } else { // Another thread loaded it while we were not holding the lock. } diff --git a/pageserver/src/layered_repository/image_layer.rs b/pageserver/src/layered_repository/image_layer.rs index 0e38d46e7a..d7657ecac6 100644 --- a/pageserver/src/layered_repository/image_layer.rs +++ b/pageserver/src/layered_repository/image_layer.rs @@ -254,7 +254,9 @@ impl ImageLayer { drop(inner); let mut inner = self.inner.write().unwrap(); if !inner.loaded { - self.load_inner(&mut inner)?; + self.load_inner(&mut inner).with_context(|| { + format!("Failed to load image layer {}", self.path().display()) + })? } else { // Another thread loaded it while we were not holding the lock. } From 78a6cb247f1c37287bf88687c3309b5be99ee720 Mon Sep 17 00:00:00 2001 From: Anastasia Lubennikova Date: Thu, 7 Apr 2022 20:37:42 +0300 Subject: [PATCH 142/296] allow the users to create extensions: GRANT CREATE ON DATABASE --- compute_tools/src/bin/zenith_ctl.rs | 1 + compute_tools/src/spec.rs | 21 +++++++++++++++++++++ 2 files changed, 22 insertions(+) diff --git a/compute_tools/src/bin/zenith_ctl.rs b/compute_tools/src/bin/zenith_ctl.rs index a5dfb1c875..3685f8e8b4 100644 --- a/compute_tools/src/bin/zenith_ctl.rs +++ b/compute_tools/src/bin/zenith_ctl.rs @@ -129,6 +129,7 @@ fn run_compute(state: &Arc>) -> Result { handle_roles(&read_state.spec, &mut client)?; handle_databases(&read_state.spec, &mut client)?; + handle_grants(&read_state.spec, &mut client)?; create_writablity_check_data(&mut client)?; // 'Close' connection diff --git a/compute_tools/src/spec.rs b/compute_tools/src/spec.rs index 1dd7c0044e..27114b8202 100644 --- a/compute_tools/src/spec.rs +++ b/compute_tools/src/spec.rs @@ -244,3 +244,24 @@ pub fn handle_databases(spec: &ClusterSpec, client: &mut Client) -> Result<()> { Ok(()) } + +// Grant CREATE ON DATABASE to the database owner +// to allow clients create trusted extensions. +pub fn handle_grants(spec: &ClusterSpec, client: &mut Client) -> Result<()> { + info!("cluster spec grants:"); + + for db in &spec.cluster.databases { + let dbname = &db.name; + + let query: String = format!( + "GRANT CREATE ON DATABASE {} TO {}", + dbname.quote(), + db.owner.quote() + ); + info!("grant query {}", &query); + + client.execute(query.as_str(), &[])?; + } + + Ok(()) +} From d060a97c548dc2a395be0772f67ee306b3df14a5 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Fri, 22 Apr 2022 21:32:54 +0300 Subject: [PATCH 143/296] Simplify clippy runs --- .circleci/config.yml | 14 -------------- .github/workflows/testing.yml | 17 ++++++----------- run_clippy.sh | 2 +- 3 files changed, 7 insertions(+), 26 deletions(-) diff --git a/.circleci/config.yml b/.circleci/config.yml index 471d64a82f..3397bcc7b7 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -132,20 +132,6 @@ jobs: - ~/.cargo/git - target - # Run style checks - # has to run separately from cargo fmt section - # since needs to run with dependencies - - run: - name: cargo clippy - command: | - if [[ $BUILD_TYPE == "debug" ]]; then - cov_prefix=(scripts/coverage "--profraw-prefix=$CIRCLE_JOB" --dir=/tmp/zenith/coverage run) - elif [[ $BUILD_TYPE == "release" ]]; then - cov_prefix=() - fi - - "${cov_prefix[@]}" ./run_clippy.sh - # Run rust unit tests - run: name: cargo test diff --git a/.github/workflows/testing.yml b/.github/workflows/testing.yml index 83e46ce6be..6d109b9bb5 100644 --- a/.github/workflows/testing.yml +++ b/.github/workflows/testing.yml @@ -36,8 +36,7 @@ jobs: - name: Install macOs postgres dependencies if: matrix.os == 'macos-latest' - run: | - brew install flex bison + run: brew install flex bison - name: Set pg revision for caching id: pg_ver @@ -53,8 +52,7 @@ jobs: - name: Build postgres if: steps.cache_pg.outputs.cache-hit != 'true' - run: | - make postgres + run: make postgres - name: Cache cargo deps id: cache_cargo @@ -64,13 +62,10 @@ jobs: ~/.cargo/registry ~/.cargo/git target - key: ${{ runner.os }}-cargo-${{ hashFiles('**/Cargo.lock') }} + key: ${{ runner.os }}-cargo-${{ hashFiles('./Cargo.lock') }} - # Use `env CARGO_INCREMENTAL=0` to mitigate https://github.com/rust-lang/rust/issues/91696 for rustc 1.57.0 - - name: Run cargo build - run: | - env CARGO_INCREMENTAL=0 cargo build --workspace --bins --examples --tests + - name: Run cargo clippy + run: ./run_clippy.sh - name: Run cargo test - run: | - env CARGO_INCREMENTAL=0 cargo test -- --nocapture --test-threads=1 + run: cargo test --all --all-targets diff --git a/run_clippy.sh b/run_clippy.sh index 4ca944c1f1..f26dbaa0f3 100755 --- a/run_clippy.sh +++ b/run_clippy.sh @@ -12,4 +12,4 @@ # * `-A unknown_lints` – do not warn about unknown lint suppressions # that people with newer toolchains might use # * `-D warnings` - fail on any warnings (`cargo` returns non-zero exit status) -cargo clippy "${@:2}" --all-targets --all-features --all --tests -- -A unknown_lints -D warnings +cargo clippy --all --all-targets --all-features -- -A unknown_lints -D warnings From fec050ce97239a8c63680c70572e043513880acb Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Fri, 22 Apr 2022 22:12:25 +0300 Subject: [PATCH 144/296] Fix macos clippy issues --- pageserver/src/http/routes.rs | 2 +- pageserver/src/profiling.rs | 16 +++++++++++----- run_clippy.sh | 15 +++++++++++---- 3 files changed, 23 insertions(+), 10 deletions(-) diff --git a/pageserver/src/http/routes.rs b/pageserver/src/http/routes.rs index 2db56015ad..05485ef3b6 100644 --- a/pageserver/src/http/routes.rs +++ b/pageserver/src/http/routes.rs @@ -453,7 +453,7 @@ async fn tenant_config_handler(mut request: Request) -> Result) -> Result, ApiError> { diff --git a/pageserver/src/profiling.rs b/pageserver/src/profiling.rs index e2c12c9e12..84132659d6 100644 --- a/pageserver/src/profiling.rs +++ b/pageserver/src/profiling.rs @@ -74,22 +74,28 @@ mod profiling_impl { } } -/// Dummy implementation when compiling without profiling feature +/// Dummy implementation when compiling without profiling feature or for non-linux OSes. #[cfg(not(feature = "profiling"))] mod profiling_impl { use super::*; - pub fn profpoint_start(_conf: &PageServerConf, _point: ProfilingConfig) -> () { - () + pub struct DummyProfilerGuard; + + pub fn profpoint_start( + _conf: &PageServerConf, + _point: ProfilingConfig, + ) -> Option { + None } - pub fn init_profiler(conf: &PageServerConf) -> () { + pub fn init_profiler(conf: &PageServerConf) -> Option { if conf.profiling != ProfilingConfig::Disabled { // shouldn't happen, we don't allow profiling in the config if the support // for it is disabled. panic!("profiling enabled but the binary was compiled without profiling support"); } + None } - pub fn exit_profiler(_conf: &PageServerConf, _guard: &()) {} + pub fn exit_profiler(_conf: &PageServerConf, _guard: &Option) {} } diff --git a/run_clippy.sh b/run_clippy.sh index f26dbaa0f3..13af3fd2c5 100755 --- a/run_clippy.sh +++ b/run_clippy.sh @@ -9,7 +9,14 @@ # In vscode, this setting is Rust-analyzer>Check On Save:Command -# * `-A unknown_lints` – do not warn about unknown lint suppressions -# that people with newer toolchains might use -# * `-D warnings` - fail on any warnings (`cargo` returns non-zero exit status) -cargo clippy --all --all-targets --all-features -- -A unknown_lints -D warnings +# Not every feature is supported in macOS builds, e.g. `profiling`, +# avoid running regular linting script that checks every feature. +if [[ "$OSTYPE" == "darwin"* ]]; then + # no extra features to test currently, add more here when needed + cargo clippy --all --all-targets -- -A unknown_lints -D warnings +else + # * `-A unknown_lints` – do not warn about unknown lint suppressions + # that people with newer toolchains might use + # * `-D warnings` - fail on any warnings (`cargo` returns non-zero exit status) + cargo clippy --all --all-targets --all-features -- -A unknown_lints -D warnings +fi From eabf6f89e46533a87cbf2fa7d5206fcff6458e63 Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Mon, 25 Apr 2022 23:41:11 +0300 Subject: [PATCH 145/296] Use item.get for tenant config toml parsing Previously we've used table interface, but there was no easy way to pass it as an override to pageserver through cli. Use the same strategy as for remote storage config parsing --- pageserver/src/config.rs | 56 +++++++++++++++++++++++----------------- 1 file changed, 33 insertions(+), 23 deletions(-) diff --git a/pageserver/src/config.rs b/pageserver/src/config.rs index b2c4a62796..df4d9910ee 100644 --- a/pageserver/src/config.rs +++ b/pageserver/src/config.rs @@ -466,30 +466,40 @@ impl PageServerConf { pub fn parse_toml_tenant_conf(item: &toml_edit::Item) -> Result { let mut t_conf: TenantConfOpt = Default::default(); - for (key, item) in item - .as_table() - .ok_or(anyhow::anyhow!("invalid tenant config"))? - .iter() - { - match key { - "checkpoint_distance" => { - t_conf.checkpoint_distance = Some(parse_toml_u64(key, item)?) - } - "compaction_target_size" => { - t_conf.compaction_target_size = Some(parse_toml_u64(key, item)?) - } - "compaction_period" => { - t_conf.compaction_period = Some(parse_toml_duration(key, item)?) - } - "compaction_threshold" => { - t_conf.compaction_threshold = Some(parse_toml_u64(key, item)? as usize) - } - "gc_horizon" => t_conf.gc_horizon = Some(parse_toml_u64(key, item)?), - "gc_period" => t_conf.gc_period = Some(parse_toml_duration(key, item)?), - "pitr_interval" => t_conf.pitr_interval = Some(parse_toml_duration(key, item)?), - _ => bail!("unrecognized tenant config option '{}'", key), - } + if let Some(checkpoint_distance) = item.get("checkpoint_distance") { + t_conf.checkpoint_distance = + Some(parse_toml_u64("checkpoint_distance", checkpoint_distance)?); } + + if let Some(compaction_target_size) = item.get("compaction_target_size") { + t_conf.compaction_target_size = Some(parse_toml_u64( + "compaction_target_size", + compaction_target_size, + )?); + } + + if let Some(compaction_period) = item.get("compaction_period") { + t_conf.compaction_period = + Some(parse_toml_duration("compaction_period", compaction_period)?); + } + + if let Some(compaction_threshold) = item.get("compaction_threshold") { + t_conf.compaction_threshold = + Some(parse_toml_u64("compaction_threshold", compaction_threshold)?.try_into()?); + } + + if let Some(gc_horizon) = item.get("gc_horizon") { + t_conf.gc_horizon = Some(parse_toml_u64("gc_horizon", gc_horizon)?); + } + + if let Some(gc_period) = item.get("gc_period") { + t_conf.gc_period = Some(parse_toml_duration("gc_period", gc_period)?); + } + + if let Some(pitr_interval) = item.get("pitr_interval") { + t_conf.pitr_interval = Some(parse_toml_duration("pitr_interval", pitr_interval)?); + } + Ok(t_conf) } From 778744d35ca4ff57237a6ef5b4323084797de9bd Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Mon, 25 Apr 2022 16:29:23 +0300 Subject: [PATCH 146/296] Limit concurrent S3 and IAM interactions --- Cargo.lock | 2 +- docs/settings.md | 7 +- pageserver/src/config.rs | 215 +++++++++--------- pageserver/src/remote_storage.rs | 4 +- pageserver/src/remote_storage/s3_bucket.rs | 33 ++- pageserver/src/remote_storage/storage_sync.rs | 84 +++---- 6 files changed, 195 insertions(+), 150 deletions(-) diff --git a/Cargo.lock b/Cargo.lock index 978cd20d12..3797e4e76b 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -1899,7 +1899,7 @@ dependencies = [ "libc", "log", "nix", - "parking_lot", + "parking_lot 0.11.2", "symbolic-demangle", "tempfile", "thiserror", diff --git a/docs/settings.md b/docs/settings.md index 69aadc602f..530876a42a 100644 --- a/docs/settings.md +++ b/docs/settings.md @@ -156,6 +156,9 @@ access_key_id = 'SOMEKEYAAAAASADSAH*#' # Secret access key to connect to the bucket ("password" part of the credentials) secret_access_key = 'SOMEsEcReTsd292v' + +# S3 API query limit to avoid getting errors/throttling from AWS. +concurrency_limit = 100 ``` ###### General remote storage configuration @@ -167,8 +170,8 @@ Besides, there are parameters common for all types of remote storage that can be ```toml [remote_storage] -# Max number of concurrent connections to open for uploading to or downloading from the remote storage. -max_concurrent_sync = 100 +# Max number of concurrent timeline synchronized (layers uploaded or downloaded) with the remote storage at the same time. +max_concurrent_timelines_sync = 50 # Max number of errors a single task can have before it's considered failed and not attempted to run anymore. max_sync_errors = 10 diff --git a/pageserver/src/config.rs b/pageserver/src/config.rs index df4d9910ee..8bfe8b57ec 100644 --- a/pageserver/src/config.rs +++ b/pageserver/src/config.rs @@ -4,8 +4,7 @@ //! file, or on the command line. //! See also `settings.md` for better description on every parameter. -use anyhow::{bail, ensure, Context, Result}; -use std::convert::TryInto; +use anyhow::{anyhow, bail, ensure, Context, Result}; use std::env; use std::num::{NonZeroU32, NonZeroUsize}; use std::path::{Path, PathBuf}; @@ -34,8 +33,18 @@ pub mod defaults { pub const DEFAULT_WAL_REDO_TIMEOUT: &str = "60 s"; pub const DEFAULT_SUPERUSER: &str = "zenith_admin"; - pub const DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNC: usize = 10; + /// How many different timelines can be processed simultaneously when synchronizing layers with the remote storage. + /// During regular work, pageserver produces one layer file per timeline checkpoint, with bursts of concurrency + /// during start (where local and remote timelines are compared and initial sync tasks are scheduled) and timeline attach. + /// Both cases may trigger timeline download, that might download a lot of layers. This concurrency is limited by the clients internally, if needed. + pub const DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_TIMELINES_SYNC: usize = 50; pub const DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS: u32 = 10; + /// Currently, sync happens with AWS S3, that has two limits on requests per second: + /// ~200 RPS for IAM services + /// https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/UsingWithRDS.IAMDBAuth.html + /// ~3500 PUT/COPY/POST/DELETE or 5500 GET/HEAD S3 requests + /// https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/ + pub const DEFAULT_REMOTE_STORAGE_S3_CONCURRENCY_LIMIT: usize = 100; pub const DEFAULT_PAGE_CACHE_SIZE: usize = 8192; pub const DEFAULT_MAX_FILE_DESCRIPTORS: usize = 100; @@ -127,7 +136,7 @@ impl FromStr for ProfilingConfig { let result = match s { "disabled" => ProfilingConfig::Disabled, "page_requests" => ProfilingConfig::PageRequests, - _ => bail!("invalid value \"{}\" for profiling option, valid values are \"disabled\" and \"page_requests\"", s), + _ => bail!("invalid value \"{s}\" for profiling option, valid values are \"disabled\" and \"page_requests\""), }; Ok(result) } @@ -269,36 +278,36 @@ impl PageServerConfigBuilder { Ok(PageServerConf { listen_pg_addr: self .listen_pg_addr - .ok_or(anyhow::anyhow!("missing listen_pg_addr"))?, + .ok_or(anyhow!("missing listen_pg_addr"))?, listen_http_addr: self .listen_http_addr - .ok_or(anyhow::anyhow!("missing listen_http_addr"))?, + .ok_or(anyhow!("missing listen_http_addr"))?, wait_lsn_timeout: self .wait_lsn_timeout - .ok_or(anyhow::anyhow!("missing wait_lsn_timeout"))?, + .ok_or(anyhow!("missing wait_lsn_timeout"))?, wal_redo_timeout: self .wal_redo_timeout - .ok_or(anyhow::anyhow!("missing wal_redo_timeout"))?, - superuser: self.superuser.ok_or(anyhow::anyhow!("missing superuser"))?, + .ok_or(anyhow!("missing wal_redo_timeout"))?, + superuser: self.superuser.ok_or(anyhow!("missing superuser"))?, page_cache_size: self .page_cache_size - .ok_or(anyhow::anyhow!("missing page_cache_size"))?, + .ok_or(anyhow!("missing page_cache_size"))?, max_file_descriptors: self .max_file_descriptors - .ok_or(anyhow::anyhow!("missing max_file_descriptors"))?, - workdir: self.workdir.ok_or(anyhow::anyhow!("missing workdir"))?, + .ok_or(anyhow!("missing max_file_descriptors"))?, + workdir: self.workdir.ok_or(anyhow!("missing workdir"))?, pg_distrib_dir: self .pg_distrib_dir - .ok_or(anyhow::anyhow!("missing pg_distrib_dir"))?, - auth_type: self.auth_type.ok_or(anyhow::anyhow!("missing auth_type"))?, + .ok_or(anyhow!("missing pg_distrib_dir"))?, + auth_type: self.auth_type.ok_or(anyhow!("missing auth_type"))?, auth_validation_public_key_path: self .auth_validation_public_key_path - .ok_or(anyhow::anyhow!("missing auth_validation_public_key_path"))?, + .ok_or(anyhow!("missing auth_validation_public_key_path"))?, remote_storage_config: self .remote_storage_config - .ok_or(anyhow::anyhow!("missing remote_storage_config"))?, - id: self.id.ok_or(anyhow::anyhow!("missing id"))?, - profiling: self.profiling.ok_or(anyhow::anyhow!("missing profiling"))?, + .ok_or(anyhow!("missing remote_storage_config"))?, + id: self.id.ok_or(anyhow!("missing id"))?, + profiling: self.profiling.ok_or(anyhow!("missing profiling"))?, // TenantConf is handled separately default_tenant_conf: TenantConf::default(), }) @@ -309,7 +318,7 @@ impl PageServerConfigBuilder { #[derive(Debug, Clone, PartialEq, Eq)] pub struct RemoteStorageConfig { /// Max allowed number of concurrent sync operations between pageserver and the remote storage. - pub max_concurrent_sync: NonZeroUsize, + pub max_concurrent_timelines_sync: NonZeroUsize, /// Max allowed errors before the sync task is considered failed and evicted. pub max_sync_errors: NonZeroU32, /// The storage connection configuration. @@ -350,6 +359,9 @@ pub struct S3Config { /// /// Example: `http://127.0.0.1:5000` pub endpoint: Option, + /// AWS S3 has various limits on its API calls, we need not to exceed those. + /// See [`defaults::DEFAULT_REMOTE_STORAGE_S3_CONCURRENCY_LIMIT`] for more details. + pub concurrency_limit: NonZeroUsize, } impl std::fmt::Debug for S3Config { @@ -358,6 +370,7 @@ impl std::fmt::Debug for S3Config { .field("bucket_name", &self.bucket_name) .field("bucket_region", &self.bucket_region) .field("prefix_in_bucket", &self.prefix_in_bucket) + .field("concurrency_limit", &self.concurrency_limit) .finish() } } @@ -431,7 +444,7 @@ impl PageServerConf { } "id" => builder.id(ZNodeId(parse_toml_u64(key, item)?)), "profiling" => builder.profiling(parse_toml_from_str(key, item)?), - _ => bail!("unrecognized pageserver option '{}'", key), + _ => bail!("unrecognized pageserver option '{key}'"), } } @@ -509,32 +522,23 @@ impl PageServerConf { let bucket_name = toml.get("bucket_name"); let bucket_region = toml.get("bucket_region"); - let max_concurrent_sync: NonZeroUsize = if let Some(s) = toml.get("max_concurrent_sync") { - parse_toml_u64("max_concurrent_sync", s) - .and_then(|toml_u64| { - toml_u64.try_into().with_context(|| { - format!("'max_concurrent_sync' value {} is too large", toml_u64) - }) - }) - .ok() - .and_then(NonZeroUsize::new) - .context("'max_concurrent_sync' must be a non-zero positive integer")? - } else { - NonZeroUsize::new(defaults::DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNC).unwrap() - }; - let max_sync_errors: NonZeroU32 = if let Some(s) = toml.get("max_sync_errors") { - parse_toml_u64("max_sync_errors", s) - .and_then(|toml_u64| { - toml_u64.try_into().with_context(|| { - format!("'max_sync_errors' value {} is too large", toml_u64) - }) - }) - .ok() - .and_then(NonZeroU32::new) - .context("'max_sync_errors' must be a non-zero positive integer")? - } else { - NonZeroU32::new(defaults::DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS).unwrap() - }; + let max_concurrent_timelines_sync = NonZeroUsize::new( + parse_optional_integer("max_concurrent_timelines_sync", toml)? + .unwrap_or(defaults::DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_TIMELINES_SYNC), + ) + .context("Failed to parse 'max_concurrent_timelines_sync' as a positive integer")?; + + let max_sync_errors = NonZeroU32::new( + parse_optional_integer("max_sync_errors", toml)? + .unwrap_or(defaults::DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS), + ) + .context("Failed to parse 'max_sync_errors' as a positive integer")?; + + let concurrency_limit = NonZeroUsize::new( + parse_optional_integer("concurrency_limit", toml)? + .unwrap_or(defaults::DEFAULT_REMOTE_STORAGE_S3_CONCURRENCY_LIMIT), + ) + .context("Failed to parse 'concurrency_limit' as a positive integer")?; let storage = match (local_path, bucket_name, bucket_region) { (None, None, None) => bail!("no 'local_path' nor 'bucket_name' option"), @@ -565,6 +569,7 @@ impl PageServerConf { .get("endpoint") .map(|endpoint| parse_toml_string("endpoint", endpoint)) .transpose()?, + concurrency_limit, }), (Some(local_path), None, None) => RemoteStorageKind::LocalFs(PathBuf::from( parse_toml_string("local_path", local_path)?, @@ -573,7 +578,7 @@ impl PageServerConf { }; Ok(RemoteStorageConfig { - max_concurrent_sync, + max_concurrent_timelines_sync, max_sync_errors, storage, }) @@ -581,7 +586,7 @@ impl PageServerConf { #[cfg(test)] pub fn test_repo_dir(test_name: &str) -> PathBuf { - PathBuf::from(format!("../tmp_check/test_{}", test_name)) + PathBuf::from(format!("../tmp_check/test_{test_name}")) } #[cfg(test)] @@ -611,7 +616,7 @@ impl PageServerConf { fn parse_toml_string(name: &str, item: &Item) -> Result { let s = item .as_str() - .with_context(|| format!("configure option {} is not a string", name))?; + .with_context(|| format!("configure option {name} is not a string"))?; Ok(s.to_string()) } @@ -620,17 +625,34 @@ fn parse_toml_u64(name: &str, item: &Item) -> Result { // for our use, though. let i: i64 = item .as_integer() - .with_context(|| format!("configure option {} is not an integer", name))?; + .with_context(|| format!("configure option {name} is not an integer"))?; if i < 0 { - bail!("configure option {} cannot be negative", name); + bail!("configure option {name} cannot be negative"); } Ok(i as u64) } +fn parse_optional_integer(name: &str, item: &toml_edit::Item) -> anyhow::Result> +where + I: TryFrom, + E: std::error::Error + Send + Sync + 'static, +{ + let toml_integer = match item.get(name) { + Some(item) => item + .as_integer() + .with_context(|| format!("configure option {name} is not an integer"))?, + None => return Ok(None), + }; + + I::try_from(toml_integer) + .map(Some) + .with_context(|| format!("configure option {name} is too large")) +} + fn parse_toml_duration(name: &str, item: &Item) -> Result { let s = item .as_str() - .with_context(|| format!("configure option {} is not a string", name))?; + .with_context(|| format!("configure option {name} is not a string"))?; Ok(humantime::parse_duration(s)?) } @@ -641,7 +663,7 @@ where { let v = item .as_str() - .with_context(|| format!("configure option {} is not a string", name))?; + .with_context(|| format!("configure option {name} is not a string"))?; T::from_str(v) } @@ -679,10 +701,8 @@ id = 10 let config_string = format!("pg_distrib_dir='{}'\nid=10", pg_distrib_dir.display()); let toml = config_string.parse()?; - let parsed_config = - PageServerConf::parse_and_validate(&toml, &workdir).unwrap_or_else(|e| { - panic!("Failed to parse config '{}', reason: {}", config_string, e) - }); + let parsed_config = PageServerConf::parse_and_validate(&toml, &workdir) + .unwrap_or_else(|e| panic!("Failed to parse config '{config_string}', reason: {e}")); assert_eq!( parsed_config, @@ -715,16 +735,13 @@ id = 10 let (workdir, pg_distrib_dir) = prepare_fs(&tempdir)?; let config_string = format!( - "{}pg_distrib_dir='{}'", - ALL_BASE_VALUES_TOML, + "{ALL_BASE_VALUES_TOML}pg_distrib_dir='{}'", pg_distrib_dir.display() ); let toml = config_string.parse()?; - let parsed_config = - PageServerConf::parse_and_validate(&toml, &workdir).unwrap_or_else(|e| { - panic!("Failed to parse config '{}', reason: {}", config_string, e) - }); + let parsed_config = PageServerConf::parse_and_validate(&toml, &workdir) + .unwrap_or_else(|e| panic!("Failed to parse config '{config_string}', reason: {e}")); assert_eq!( parsed_config, @@ -772,37 +789,33 @@ local_path = '{}'"#, for remote_storage_config_str in identical_toml_declarations { let config_string = format!( - r#"{} + r#"{ALL_BASE_VALUES_TOML} pg_distrib_dir='{}' -{}"#, - ALL_BASE_VALUES_TOML, +{remote_storage_config_str}"#, pg_distrib_dir.display(), - remote_storage_config_str, ); let toml = config_string.parse()?; let parsed_remote_storage_config = PageServerConf::parse_and_validate(&toml, &workdir) - .unwrap_or_else(|e| { - panic!("Failed to parse config '{}', reason: {}", config_string, e) - }) + .unwrap_or_else(|e| panic!("Failed to parse config '{config_string}', reason: {e}")) .remote_storage_config .expect("Should have remote storage config for the local FS"); assert_eq!( - parsed_remote_storage_config, - RemoteStorageConfig { - max_concurrent_sync: NonZeroUsize::new( - defaults::DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNC - ) - .unwrap(), - max_sync_errors: NonZeroU32::new(defaults::DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS) + parsed_remote_storage_config, + RemoteStorageConfig { + max_concurrent_timelines_sync: NonZeroUsize::new( + defaults::DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_TIMELINES_SYNC + ) .unwrap(), - storage: RemoteStorageKind::LocalFs(local_storage_path.clone()), - }, - "Remote storage config should correctly parse the local FS config and fill other storage defaults" - ); + max_sync_errors: NonZeroU32::new(defaults::DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS) + .unwrap(), + storage: RemoteStorageKind::LocalFs(local_storage_path.clone()), + }, + "Remote storage config should correctly parse the local FS config and fill other storage defaults" + ); } Ok(()) } @@ -818,52 +831,49 @@ pg_distrib_dir='{}' let access_key_id = "SOMEKEYAAAAASADSAH*#".to_string(); let secret_access_key = "SOMEsEcReTsd292v".to_string(); let endpoint = "http://localhost:5000".to_string(); - let max_concurrent_sync = NonZeroUsize::new(111).unwrap(); + let max_concurrent_timelines_sync = NonZeroUsize::new(111).unwrap(); let max_sync_errors = NonZeroU32::new(222).unwrap(); + let s3_concurrency_limit = NonZeroUsize::new(333).unwrap(); let identical_toml_declarations = &[ format!( r#"[remote_storage] -max_concurrent_sync = {} -max_sync_errors = {} -bucket_name = '{}' -bucket_region = '{}' -prefix_in_bucket = '{}' -access_key_id = '{}' -secret_access_key = '{}' -endpoint = '{}'"#, - max_concurrent_sync, max_sync_errors, bucket_name, bucket_region, prefix_in_bucket, access_key_id, secret_access_key, endpoint +max_concurrent_timelines_sync = {max_concurrent_timelines_sync} +max_sync_errors = {max_sync_errors} +bucket_name = '{bucket_name}' +bucket_region = '{bucket_region}' +prefix_in_bucket = '{prefix_in_bucket}' +access_key_id = '{access_key_id}' +secret_access_key = '{secret_access_key}' +endpoint = '{endpoint}' +concurrency_limit = {s3_concurrency_limit}"# ), format!( - "remote_storage={{max_concurrent_sync={}, max_sync_errors={}, bucket_name='{}', bucket_region='{}', prefix_in_bucket='{}', access_key_id='{}', secret_access_key='{}', endpoint='{}'}}", - max_concurrent_sync, max_sync_errors, bucket_name, bucket_region, prefix_in_bucket, access_key_id, secret_access_key, endpoint + "remote_storage={{max_concurrent_timelines_sync={max_concurrent_timelines_sync}, max_sync_errors={max_sync_errors}, bucket_name='{bucket_name}',\ + bucket_region='{bucket_region}', prefix_in_bucket='{prefix_in_bucket}', access_key_id='{access_key_id}', secret_access_key='{secret_access_key}', endpoint='{endpoint}', concurrency_limit={s3_concurrency_limit}}}", ), ]; for remote_storage_config_str in identical_toml_declarations { let config_string = format!( - r#"{} + r#"{ALL_BASE_VALUES_TOML} pg_distrib_dir='{}' -{}"#, - ALL_BASE_VALUES_TOML, +{remote_storage_config_str}"#, pg_distrib_dir.display(), - remote_storage_config_str, ); let toml = config_string.parse()?; let parsed_remote_storage_config = PageServerConf::parse_and_validate(&toml, &workdir) - .unwrap_or_else(|e| { - panic!("Failed to parse config '{}', reason: {}", config_string, e) - }) + .unwrap_or_else(|e| panic!("Failed to parse config '{config_string}', reason: {e}")) .remote_storage_config .expect("Should have remote storage config for S3"); assert_eq!( parsed_remote_storage_config, RemoteStorageConfig { - max_concurrent_sync, + max_concurrent_timelines_sync, max_sync_errors, storage: RemoteStorageKind::AwsS3(S3Config { bucket_name: bucket_name.clone(), @@ -871,7 +881,8 @@ pg_distrib_dir='{}' access_key_id: Some(access_key_id.clone()), secret_access_key: Some(secret_access_key.clone()), prefix_in_bucket: Some(prefix_in_bucket.clone()), - endpoint: Some(endpoint.clone()) + endpoint: Some(endpoint.clone()), + concurrency_limit: s3_concurrency_limit, }), }, "Remote storage config should correctly parse the S3 config" diff --git a/pageserver/src/remote_storage.rs b/pageserver/src/remote_storage.rs index 8a09f7b9ca..39595b7167 100644 --- a/pageserver/src/remote_storage.rs +++ b/pageserver/src/remote_storage.rs @@ -161,7 +161,7 @@ pub fn start_local_timeline_sync( config, local_timeline_files, LocalFs::new(root.clone(), &config.workdir)?, - storage_config.max_concurrent_sync, + storage_config.max_concurrent_timelines_sync, storage_config.max_sync_errors, ) }, @@ -172,7 +172,7 @@ pub fn start_local_timeline_sync( config, local_timeline_files, S3Bucket::new(s3_config, &config.workdir)?, - storage_config.max_concurrent_sync, + storage_config.max_concurrent_timelines_sync, storage_config.max_sync_errors, ) }, diff --git a/pageserver/src/remote_storage/s3_bucket.rs b/pageserver/src/remote_storage/s3_bucket.rs index b69634a1b6..73d828d150 100644 --- a/pageserver/src/remote_storage/s3_bucket.rs +++ b/pageserver/src/remote_storage/s3_bucket.rs @@ -15,7 +15,7 @@ use rusoto_s3::{ DeleteObjectRequest, GetObjectRequest, ListObjectsV2Request, PutObjectRequest, S3Client, StreamingBody, S3, }; -use tokio::io; +use tokio::{io, sync::Semaphore}; use tokio_util::io::ReaderStream; use tracing::debug; @@ -65,6 +65,10 @@ pub struct S3Bucket { client: S3Client, bucket_name: String, prefix_in_bucket: Option, + // Every request to S3 can be throttled or cancelled, if a certain number of requests per second is exceeded. + // Same goes to IAM, which is queried before every S3 request, if enabled. IAM has even lower RPS threshold. + // The helps to ensure we don't exceed the thresholds. + concurrency_limiter: Semaphore, } impl S3Bucket { @@ -119,6 +123,7 @@ impl S3Bucket { pageserver_workdir, bucket_name: aws_config.bucket_name.clone(), prefix_in_bucket, + concurrency_limiter: Semaphore::new(aws_config.concurrency_limit.get()), }) } } @@ -147,6 +152,11 @@ impl RemoteStorage for S3Bucket { let mut continuation_token = None; loop { + let _guard = self + .concurrency_limiter + .acquire() + .await + .context("Concurrency limiter semaphore got closed during S3 list")?; let fetch_response = self .client .list_objects_v2(ListObjectsV2Request { @@ -180,6 +190,11 @@ impl RemoteStorage for S3Bucket { to: &Self::StoragePath, metadata: Option, ) -> anyhow::Result<()> { + let _guard = self + .concurrency_limiter + .acquire() + .await + .context("Concurrency limiter semaphore got closed during S3 upload")?; self.client .put_object(PutObjectRequest { body: Some(StreamingBody::new_with_size( @@ -200,6 +215,11 @@ impl RemoteStorage for S3Bucket { from: &Self::StoragePath, to: &mut (impl io::AsyncWrite + Unpin + Send + Sync), ) -> anyhow::Result> { + let _guard = self + .concurrency_limiter + .acquire() + .await + .context("Concurrency limiter semaphore got closed during S3 download")?; let object_output = self .client .get_object(GetObjectRequest { @@ -231,6 +251,11 @@ impl RemoteStorage for S3Bucket { Some(end_inclusive) => format!("bytes={}-{}", start_inclusive, end_inclusive), None => format!("bytes={}-", start_inclusive), }); + let _guard = self + .concurrency_limiter + .acquire() + .await + .context("Concurrency limiter semaphore got closed during S3 range download")?; let object_output = self .client .get_object(GetObjectRequest { @@ -250,6 +275,11 @@ impl RemoteStorage for S3Bucket { } async fn delete(&self, path: &Self::StoragePath) -> anyhow::Result<()> { + let _guard = self + .concurrency_limiter + .acquire() + .await + .context("Concurrency limiter semaphore got closed during S3 delete")?; self.client .delete_object(DeleteObjectRequest { bucket: self.bucket_name.clone(), @@ -433,6 +463,7 @@ mod tests { client: S3Client::new("us-east-1".parse().unwrap()), bucket_name: "dummy-bucket".to_string(), prefix_in_bucket: Some("dummy_prefix/".to_string()), + concurrency_limiter: Semaphore::new(1), } } diff --git a/pageserver/src/remote_storage/storage_sync.rs b/pageserver/src/remote_storage/storage_sync.rs index 4d1ec2e225..20012f32d7 100644 --- a/pageserver/src/remote_storage/storage_sync.rs +++ b/pageserver/src/remote_storage/storage_sync.rs @@ -62,7 +62,7 @@ pub mod index; mod upload; use std::{ - collections::{hash_map, HashMap, HashSet, VecDeque}, + collections::{HashMap, HashSet, VecDeque}, fmt::Debug, num::{NonZeroU32, NonZeroUsize}, ops::ControlFlow, @@ -132,7 +132,9 @@ lazy_static! { /// mpsc approach was picked to allow blocking the sync loop if no tasks are present, to avoid meaningless spinning. mod sync_queue { use std::{ - collections::{hash_map, HashMap}, + collections::{hash_map, HashMap, HashSet}, + num::NonZeroUsize, + ops::ControlFlow, sync::atomic::{AtomicUsize, Ordering}, }; @@ -179,7 +181,7 @@ mod sync_queue { /// Polls a new task from the queue, using its receiver counterpart. /// Does not block if the queue is empty, returning [`None`] instead. /// Needed to correctly track the queue length. - pub async fn next_task( + async fn next_task( receiver: &mut UnboundedReceiver<(ZTenantTimelineId, SyncTask)>, ) -> Option<(ZTenantTimelineId, SyncTask)> { let task = receiver.recv().await; @@ -195,15 +197,29 @@ mod sync_queue { /// or two (download and upload, if both were found in the queue during batch construction). pub async fn next_task_batch( receiver: &mut UnboundedReceiver<(ZTenantTimelineId, SyncTask)>, - mut max_batch_size: usize, - ) -> HashMap { - if max_batch_size == 0 { - return HashMap::new(); - } - let mut tasks: HashMap = - HashMap::with_capacity(max_batch_size); + max_timelines_to_sync: NonZeroUsize, + ) -> ControlFlow<(), HashMap> { + // request the first task in blocking fashion to do less meaningless work + let (first_sync_id, first_task) = if let Some(first_task) = next_task(receiver).await { + first_task + } else { + debug!("Queue sender part was dropped, aborting"); + return ControlFlow::Break(()); + }; + + let max_timelines_to_sync = max_timelines_to_sync.get(); + let mut batched_timelines = HashSet::with_capacity(max_timelines_to_sync); + batched_timelines.insert(first_sync_id.timeline_id); + + let mut tasks = HashMap::new(); + tasks.insert(first_sync_id, first_task); loop { + if batched_timelines.len() >= max_timelines_to_sync { + debug!("Filled a full task batch with {max_timelines_to_sync} timeline sync operations"); + break; + } + match receiver.try_recv() { Ok((sync_id, new_task)) => { LENGTH.fetch_sub(1, Ordering::Relaxed); @@ -216,24 +232,23 @@ mod sync_queue { v.insert(new_task); } } - - max_batch_size -= 1; - if max_batch_size == 0 { - break; - } + batched_timelines.insert(sync_id.timeline_id); } Err(TryRecvError::Disconnected) => { debug!("Sender disconnected, batch collection aborted"); break; } Err(TryRecvError::Empty) => { - debug!("No more data in the sync queue, task batch is not full"); + debug!( + "No more data in the sync queue, task batch is not full, length: {}, max allowed size: {max_timelines_to_sync}", + batched_timelines.len() + ); break; } } } - tasks + ControlFlow::Continue(tasks) } /// Length of the queue, assuming that all receiver counterparts were only called using the queue api. @@ -455,7 +470,7 @@ pub(super) fn spawn_storage_sync_thread( conf: &'static PageServerConf, local_timeline_files: HashMap)>, storage: S, - max_concurrent_sync: NonZeroUsize, + max_concurrent_timelines_sync: NonZeroUsize, max_sync_errors: NonZeroU32, ) -> anyhow::Result where @@ -497,7 +512,7 @@ where receiver, Arc::new(storage), loop_index, - max_concurrent_sync, + max_concurrent_timelines_sync, max_sync_errors, ); Ok(()) @@ -517,7 +532,7 @@ fn storage_sync_loop( mut receiver: UnboundedReceiver<(ZTenantTimelineId, SyncTask)>, storage: Arc, index: RemoteIndex, - max_concurrent_sync: NonZeroUsize, + max_concurrent_timelines_sync: NonZeroUsize, max_sync_errors: NonZeroU32, ) where P: Debug + Send + Sync + 'static, @@ -534,7 +549,7 @@ fn storage_sync_loop( &mut receiver, storage, loop_index, - max_concurrent_sync, + max_concurrent_timelines_sync, max_sync_errors, ) .instrument(info_span!("storage_sync_loop_step")) => step, @@ -568,34 +583,19 @@ async fn loop_step( receiver: &mut UnboundedReceiver<(ZTenantTimelineId, SyncTask)>, storage: Arc, index: RemoteIndex, - max_concurrent_sync: NonZeroUsize, + max_concurrent_timelines_sync: NonZeroUsize, max_sync_errors: NonZeroU32, ) -> ControlFlow<(), HashMap>> where P: Debug + Send + Sync + 'static, S: RemoteStorage + Send + Sync + 'static, { - let max_concurrent_sync = max_concurrent_sync.get(); - - // request the first task in blocking fashion to do less meaningless work - let (first_sync_id, first_task) = - if let Some(first_task) = sync_queue::next_task(receiver).await { - first_task - } else { - return ControlFlow::Break(()); + let batched_tasks = + match sync_queue::next_task_batch(receiver, max_concurrent_timelines_sync).await { + ControlFlow::Continue(batch) => batch, + ControlFlow::Break(()) => return ControlFlow::Break(()), }; - let mut batched_tasks = sync_queue::next_task_batch(receiver, max_concurrent_sync - 1).await; - match batched_tasks.entry(first_sync_id) { - hash_map::Entry::Occupied(o) => { - let current = o.remove(); - batched_tasks.insert(first_sync_id, current.merge(first_task)); - } - hash_map::Entry::Vacant(v) => { - v.insert(first_task); - } - } - let remaining_queue_length = sync_queue::len(); REMAINING_SYNC_ITEMS.set(remaining_queue_length as i64); if remaining_queue_length > 0 || !batched_tasks.is_empty() { @@ -623,7 +623,7 @@ where let mut new_timeline_states: HashMap< ZTenantId, HashMap, - > = HashMap::with_capacity(max_concurrent_sync); + > = HashMap::with_capacity(max_concurrent_timelines_sync.get()); while let Some((sync_id, state_update)) = sync_results.next().await { debug!("Finished storage sync task for sync id {sync_id}"); if let Some(state_update) = state_update { From 3fd234da07165401d339c59ce15577a2f0465951 Mon Sep 17 00:00:00 2001 From: Arseny Sher Date: Tue, 26 Apr 2022 13:48:42 +0400 Subject: [PATCH 147/296] Enable etcd for safekeepers in deploy. --- .circleci/ansible/production.hosts | 1 + .circleci/ansible/staging.hosts | 1 + .circleci/ansible/systemd/safekeeper.service | 2 +- 3 files changed, 3 insertions(+), 1 deletion(-) diff --git a/.circleci/ansible/production.hosts b/.circleci/ansible/production.hosts index 13224b7cf5..f32b57154c 100644 --- a/.circleci/ansible/production.hosts +++ b/.circleci/ansible/production.hosts @@ -14,3 +14,4 @@ safekeepers console_mgmt_base_url = http://console-release.local bucket_name = zenith-storage-oregon bucket_region = us-west-2 +etcd_endpoints = etcd-release.local:2379 diff --git a/.circleci/ansible/staging.hosts b/.circleci/ansible/staging.hosts index 69f058c2b9..71166c531e 100644 --- a/.circleci/ansible/staging.hosts +++ b/.circleci/ansible/staging.hosts @@ -15,3 +15,4 @@ safekeepers console_mgmt_base_url = http://console-staging.local bucket_name = zenith-staging-storage-us-east-1 bucket_region = us-east-1 +etcd_endpoints = etcd-staging.local:2379 diff --git a/.circleci/ansible/systemd/safekeeper.service b/.circleci/ansible/systemd/safekeeper.service index e75602b609..cac38d8756 100644 --- a/.circleci/ansible/systemd/safekeeper.service +++ b/.circleci/ansible/systemd/safekeeper.service @@ -6,7 +6,7 @@ After=network.target auditd.service Type=simple User=safekeeper Environment=RUST_BACKTRACE=1 ZENITH_REPO_DIR=/storage/safekeeper/data LD_LIBRARY_PATH=/usr/local/lib -ExecStart=/usr/local/bin/safekeeper -l {{ inventory_hostname }}.local:6500 --listen-http {{ inventory_hostname }}.local:7676 -p {{ first_pageserver }}:6400 -D /storage/safekeeper/data +ExecStart=/usr/local/bin/safekeeper -l {{ inventory_hostname }}.local:6500 --listen-http {{ inventory_hostname }}.local:7676 -p {{ first_pageserver }}:6400 -D /storage/safekeeper/data --broker-endpoints={{ etcd_endpoints }} ExecReload=/bin/kill -HUP $MAINPID KillMode=mixed KillSignal=SIGINT From 8b9d523f3cb1a140912bb5c0fdd67e176a10b45c Mon Sep 17 00:00:00 2001 From: Arseny Sher Date: Tue, 26 Apr 2022 19:37:56 +0400 Subject: [PATCH 148/296] Remove old WAL on safekeepers. Remove when it is consumed by all of 1) pageserver (remote_consistent_lsn) 2) safekeeper peers 3) s3 WAL offloading. In test s3 offloading for now is mocked by directly bumping s3_wal_lsn. ref #1403 --- safekeeper/src/bin/safekeeper.rs | 13 ++++- safekeeper/src/broker.rs | 7 ++- safekeeper/src/http/routes.rs | 20 ++++++++ safekeeper/src/lib.rs | 1 + safekeeper/src/remove_wal.rs | 25 ++++++++++ safekeeper/src/safekeeper.rs | 24 +++++++++ safekeeper/src/timeline.rs | 24 +++++++++ safekeeper/src/wal_storage.rs | 48 +++++++++++++++++- test_runner/batch_others/test_wal_acceptor.py | 49 +++++++++++++++++++ test_runner/fixtures/zenith_fixtures.py | 9 ++++ 10 files changed, 215 insertions(+), 5 deletions(-) create mode 100644 safekeeper/src/remove_wal.rs diff --git a/safekeeper/src/bin/safekeeper.rs b/safekeeper/src/bin/safekeeper.rs index 7434f921cb..3fea3581a8 100644 --- a/safekeeper/src/bin/safekeeper.rs +++ b/safekeeper/src/bin/safekeeper.rs @@ -16,11 +16,11 @@ use url::{ParseError, Url}; use safekeeper::control_file::{self}; use safekeeper::defaults::{DEFAULT_HTTP_LISTEN_ADDR, DEFAULT_PG_LISTEN_ADDR}; -use safekeeper::http; -use safekeeper::s3_offload; +use safekeeper::remove_wal; use safekeeper::wal_service; use safekeeper::SafeKeeperConf; use safekeeper::{broker, callmemaybe}; +use safekeeper::{http, s3_offload}; use utils::{ http::endpoint, logging, shutdown::exit_now, signals, tcp_listener, zid::ZNodeId, GIT_VERSION, }; @@ -292,6 +292,15 @@ fn start_safekeeper(mut conf: SafeKeeperConf, given_id: Option, init: b ); } + let conf_ = conf.clone(); + threads.push( + thread::Builder::new() + .name("WAL removal thread".into()) + .spawn(|| { + remove_wal::thread_main(conf_); + })?, + ); + // TODO: put more thoughts into handling of failed threads // We probably should restart them. diff --git a/safekeeper/src/broker.rs b/safekeeper/src/broker.rs index b84b5cf789..8ce7bdf0e5 100644 --- a/safekeeper/src/broker.rs +++ b/safekeeper/src/broker.rs @@ -32,23 +32,28 @@ const ZENITH_PREFIX: &str = "zenith"; /// Published data about safekeeper. Fields made optional for easy migrations. #[serde_as] -#[derive(Deserialize, Serialize)] +#[derive(Debug, Deserialize, Serialize)] pub struct SafekeeperInfo { /// Term of the last entry. pub last_log_term: Option, /// LSN of the last record. #[serde_as(as = "Option")] + #[serde(default)] pub flush_lsn: Option, /// Up to which LSN safekeeper regards its WAL as committed. #[serde_as(as = "Option")] + #[serde(default)] pub commit_lsn: Option, /// LSN up to which safekeeper offloaded WAL to s3. #[serde_as(as = "Option")] + #[serde(default)] pub s3_wal_lsn: Option, /// LSN of last checkpoint uploaded by pageserver. #[serde_as(as = "Option")] + #[serde(default)] pub remote_consistent_lsn: Option, #[serde_as(as = "Option")] + #[serde(default)] pub peer_horizon_lsn: Option, } diff --git a/safekeeper/src/http/routes.rs b/safekeeper/src/http/routes.rs index 2d22332db9..fab8724430 100644 --- a/safekeeper/src/http/routes.rs +++ b/safekeeper/src/http/routes.rs @@ -5,6 +5,7 @@ use serde::Serializer; use std::fmt::Display; use std::sync::Arc; +use crate::broker::SafekeeperInfo; use crate::safekeeper::Term; use crate::safekeeper::TermHistory; use crate::timeline::GlobalTimelines; @@ -123,6 +124,20 @@ async fn timeline_create_handler(mut request: Request) -> Result) -> Result, ApiError> { + let zttid = ZTenantTimelineId::new( + parse_request_param(&request, "tenant_id")?, + parse_request_param(&request, "timeline_id")?, + ); + let safekeeper_info: SafekeeperInfo = json_request(&mut request).await?; + + let tli = GlobalTimelines::get(get_conf(&request), zttid, false).map_err(ApiError::from_err)?; + tli.record_safekeeper_info(&safekeeper_info, ZNodeId(1))?; + + json_response(StatusCode::OK, ()) +} + /// Safekeeper http router. pub fn make_router(conf: SafeKeeperConf) -> RouterBuilder { let router = endpoint::make_router(); @@ -134,4 +149,9 @@ pub fn make_router(conf: SafeKeeperConf) -> RouterBuilder timeline_status_handler, ) .post("/v1/timeline", timeline_create_handler) + // for tests + .post( + "/v1/record_safekeeper_info/:tenant_id/:timeline_id", + record_safekeeper_info, + ) } diff --git a/safekeeper/src/lib.rs b/safekeeper/src/lib.rs index 8951e8f680..6509e8166a 100644 --- a/safekeeper/src/lib.rs +++ b/safekeeper/src/lib.rs @@ -13,6 +13,7 @@ pub mod handler; pub mod http; pub mod json_ctrl; pub mod receive_wal; +pub mod remove_wal; pub mod s3_offload; pub mod safekeeper; pub mod send_wal; diff --git a/safekeeper/src/remove_wal.rs b/safekeeper/src/remove_wal.rs new file mode 100644 index 0000000000..9474f65e5f --- /dev/null +++ b/safekeeper/src/remove_wal.rs @@ -0,0 +1,25 @@ +//! Thread removing old WAL. + +use std::{thread, time::Duration}; + +use tracing::*; + +use crate::{timeline::GlobalTimelines, SafeKeeperConf}; + +pub fn thread_main(conf: SafeKeeperConf) { + let wal_removal_interval = Duration::from_millis(5000); + loop { + let active_tlis = GlobalTimelines::get_active_timelines(); + for zttid in &active_tlis { + if let Ok(tli) = GlobalTimelines::get(&conf, *zttid, false) { + if let Err(e) = tli.remove_old_wal() { + warn!( + "failed to remove WAL for tenant {} timeline {}: {}", + tli.zttid.tenant_id, tli.zttid.timeline_id, e + ); + } + } + } + thread::sleep(wal_removal_interval) + } +} diff --git a/safekeeper/src/safekeeper.rs b/safekeeper/src/safekeeper.rs index 59174f34a2..048753152b 100644 --- a/safekeeper/src/safekeeper.rs +++ b/safekeeper/src/safekeeper.rs @@ -5,6 +5,8 @@ use byteorder::{LittleEndian, ReadBytesExt}; use bytes::{Buf, BufMut, Bytes, BytesMut}; use postgres_ffi::xlog_utils::TimeLineID; + +use postgres_ffi::xlog_utils::XLogSegNo; use serde::{Deserialize, Serialize}; use std::cmp::max; use std::cmp::min; @@ -880,6 +882,24 @@ where } Ok(()) } + + /// Get oldest segno we still need to keep. We hold WAL till it is consumed + /// by all of 1) pageserver (remote_consistent_lsn) 2) peers 3) s3 + /// offloading. + /// While it is safe to use inmem values for determining horizon, + /// we use persistent to make possible normal states less surprising. + pub fn get_horizon_segno(&self) -> XLogSegNo { + let horizon_lsn = min( + min( + self.state.remote_consistent_lsn, + self.state.peer_horizon_lsn, + ), + self.state.s3_wal_lsn, + ); + let res = horizon_lsn.segment_number(self.state.server.wal_seg_size as usize); + info!("horizon is {}, res {}", horizon_lsn, res); + res + } } #[cfg(test)] @@ -935,6 +955,10 @@ mod tests { fn flush_wal(&mut self) -> Result<()> { Ok(()) } + + fn remove_up_to(&self) -> Box Result<()>> { + Box::new(move |_segno_up_to: XLogSegNo| Ok(())) + } } #[test] diff --git a/safekeeper/src/timeline.rs b/safekeeper/src/timeline.rs index fbae34251c..4a507015d3 100644 --- a/safekeeper/src/timeline.rs +++ b/safekeeper/src/timeline.rs @@ -4,6 +4,7 @@ use anyhow::{bail, Context, Result}; use lazy_static::lazy_static; +use postgres_ffi::xlog_utils::XLogSegNo; use std::cmp::{max, min}; use std::collections::HashMap; @@ -88,6 +89,7 @@ struct SharedState { active: bool, num_computes: u32, pageserver_connstr: Option, + last_removed_segno: XLogSegNo, } impl SharedState { @@ -109,6 +111,7 @@ impl SharedState { active: false, num_computes: 0, pageserver_connstr: None, + last_removed_segno: 0, }) } @@ -127,6 +130,7 @@ impl SharedState { active: false, num_computes: 0, pageserver_connstr: None, + last_removed_segno: 0, }) } @@ -459,6 +463,26 @@ impl Timeline { let shared_state = self.mutex.lock().unwrap(); shared_state.sk.wal_store.flush_lsn() } + + pub fn remove_old_wal(&self) -> Result<()> { + let horizon_segno: XLogSegNo; + let remover: Box Result<(), anyhow::Error>>; + { + let shared_state = self.mutex.lock().unwrap(); + horizon_segno = shared_state.sk.get_horizon_segno(); + remover = shared_state.sk.wal_store.remove_up_to(); + if horizon_segno <= 1 || horizon_segno <= shared_state.last_removed_segno { + return Ok(()); + } + // release the lock before removing + } + let _enter = + info_span!("", timeline = %self.zttid.tenant_id, tenant = %self.zttid.timeline_id) + .entered(); + remover(horizon_segno - 1)?; + self.mutex.lock().unwrap().last_removed_segno = horizon_segno; + Ok(()) + } } // Utilities needed by various Connection-like objects diff --git a/safekeeper/src/wal_storage.rs b/safekeeper/src/wal_storage.rs index 69a4fb11e1..503bd7c543 100644 --- a/safekeeper/src/wal_storage.rs +++ b/safekeeper/src/wal_storage.rs @@ -11,10 +11,12 @@ use anyhow::{anyhow, bail, Context, Result}; use std::io::{Read, Seek, SeekFrom}; use lazy_static::lazy_static; -use postgres_ffi::xlog_utils::{find_end_of_wal, XLogSegNo, PG_TLI}; +use postgres_ffi::xlog_utils::{ + find_end_of_wal, IsPartialXLogFileName, IsXLogFileName, XLogFromFileName, XLogSegNo, PG_TLI, +}; use std::cmp::min; -use std::fs::{self, File, OpenOptions}; +use std::fs::{self, remove_file, File, OpenOptions}; use std::io::Write; use std::path::{Path, PathBuf}; @@ -101,6 +103,10 @@ pub trait Storage { /// Durably store WAL on disk, up to the last written WAL record. fn flush_wal(&mut self) -> Result<()>; + + /// Remove all segments <= given segno. Returns closure as we want to do + /// that without timeline lock. + fn remove_up_to(&self) -> Box Result<()>>; } /// PhysicalStorage is a storage that stores WAL on disk. Writes are separated from flushes @@ -466,6 +472,44 @@ impl Storage for PhysicalStorage { self.update_flush_lsn(); Ok(()) } + + fn remove_up_to(&self) -> Box Result<()>> { + let timeline_dir = self.timeline_dir.clone(); + let wal_seg_size = self.wal_seg_size.unwrap(); + Box::new(move |segno_up_to: XLogSegNo| { + remove_up_to(&timeline_dir, wal_seg_size, segno_up_to) + }) + } +} + +/// Remove all WAL segments in timeline_dir <= given segno. +fn remove_up_to(timeline_dir: &Path, wal_seg_size: usize, segno_up_to: XLogSegNo) -> Result<()> { + let mut n_removed = 0; + for entry in fs::read_dir(&timeline_dir)? { + let entry = entry?; + let entry_path = entry.path(); + let fname = entry_path.file_name().unwrap(); + + if let Some(fname_str) = fname.to_str() { + /* Ignore files that are not XLOG segments */ + if !IsXLogFileName(fname_str) && !IsPartialXLogFileName(fname_str) { + continue; + } + let (segno, _) = XLogFromFileName(fname_str, wal_seg_size); + if segno <= segno_up_to { + remove_file(entry_path)?; + n_removed += 1; + } + } + } + let segno_from = segno_up_to - n_removed + 1; + info!( + "removed {} WAL segments [{}; {}]", + n_removed, + XLogFileName(PG_TLI, segno_from, wal_seg_size), + XLogFileName(PG_TLI, segno_up_to, wal_seg_size) + ); + Ok(()) } pub struct WalReader { diff --git a/test_runner/batch_others/test_wal_acceptor.py b/test_runner/batch_others/test_wal_acceptor.py index cc9ec9a275..395084af0e 100644 --- a/test_runner/batch_others/test_wal_acceptor.py +++ b/test_runner/batch_others/test_wal_acceptor.py @@ -370,6 +370,55 @@ def test_broker(zenith_env_builder: ZenithEnvBuilder): time.sleep(0.5) +# Test that old WAL consumed by peers and pageserver is removed from safekeepers. +@pytest.mark.skipif(etcd_path() is None, reason="requires etcd which is not present in PATH") +def test_wal_removal(zenith_env_builder: ZenithEnvBuilder): + zenith_env_builder.num_safekeepers = 2 + zenith_env_builder.broker = True + # to advance remote_consistent_llsn + zenith_env_builder.enable_local_fs_remote_storage() + env = zenith_env_builder.init_start() + + env.zenith_cli.create_branch('test_safekeepers_wal_removal') + pg = env.postgres.create_start('test_safekeepers_wal_removal') + + with closing(pg.connect()) as conn: + with conn.cursor() as cur: + # we rely upon autocommit after each statement + # as waiting for acceptors happens there + cur.execute('CREATE TABLE t(key int primary key, value text)') + cur.execute("INSERT INTO t SELECT generate_series(1,100000), 'payload'") + + tenant_id = pg.safe_psql("show zenith.zenith_tenant")[0][0] + timeline_id = pg.safe_psql("show zenith.zenith_timeline")[0][0] + + # force checkpoint to advance remote_consistent_lsn + with closing(env.pageserver.connect()) as psconn: + with psconn.cursor() as pscur: + pscur.execute(f"checkpoint {tenant_id} {timeline_id}") + + # We will wait for first segment removal. Make sure they exist for starter. + first_segments = [ + os.path.join(sk.data_dir(), tenant_id, timeline_id, '000000010000000000000001') + for sk in env.safekeepers + ] + assert all(os.path.exists(p) for p in first_segments) + + http_cli = env.safekeepers[0].http_client() + # Pretend WAL is offloaded to s3. + http_cli.record_safekeeper_info(tenant_id, timeline_id, {'s3_wal_lsn': 'FFFFFFFF/FEFFFFFF'}) + + # wait till first segment is removed on all safekeepers + started_at = time.time() + while True: + if all(not os.path.exists(p) for p in first_segments): + break + elapsed = time.time() - started_at + if elapsed > 20: + raise RuntimeError(f"timed out waiting {elapsed:.0f}s for first segment get removed") + time.sleep(0.5) + + class ProposerPostgres(PgProtocol): """Object for running postgres without ZenithEnv""" def __init__(self, diff --git a/test_runner/fixtures/zenith_fixtures.py b/test_runner/fixtures/zenith_fixtures.py index d295a79953..e16d1acf2f 100644 --- a/test_runner/fixtures/zenith_fixtures.py +++ b/test_runner/fixtures/zenith_fixtures.py @@ -1738,6 +1738,9 @@ class Safekeeper: def http_client(self) -> SafekeeperHttpClient: return SafekeeperHttpClient(port=self.port.http) + def data_dir(self) -> str: + return os.path.join(self.env.repo_dir, "safekeepers", f"sk{self.id}") + @dataclass class SafekeeperTimelineStatus: @@ -1770,6 +1773,12 @@ class SafekeeperHttpClient(requests.Session): flush_lsn=resj['flush_lsn'], remote_consistent_lsn=resj['remote_consistent_lsn']) + def record_safekeeper_info(self, tenant_id: str, timeline_id: str, body): + res = self.post( + f"http://localhost:{self.port}/v1/record_safekeeper_info/{tenant_id}/{timeline_id}", + json=body) + res.raise_for_status() + def get_metrics(self) -> SafekeeperMetrics: request_result = self.get(f"http://localhost:{self.port}/metrics") request_result.raise_for_status() From b2e35fffa6743aa6a768337a3cd9ffdfa4f255aa Mon Sep 17 00:00:00 2001 From: Dhammika Pathirana Date: Wed, 20 Apr 2022 23:36:33 -0700 Subject: [PATCH 149/296] Fix ancestor layer traversal (#1484) Signed-off-by: Dhammika Pathirana --- pageserver/src/layered_repository.rs | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 3afef51a23..0dc54385b2 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -1466,10 +1466,10 @@ impl LayeredTimeline { )?; cont_lsn = lsn_floor; path.push((result, cont_lsn, layer)); - } else if self.ancestor_timeline.is_some() { + } else if timeline.ancestor_timeline.is_some() { // Nothing on this timeline. Traverse to parent result = ValueReconstructResult::Continue; - cont_lsn = Lsn(self.ancestor_lsn.0 + 1); + cont_lsn = Lsn(timeline.ancestor_lsn.0 + 1); } else { // Nothing found result = ValueReconstructResult::Missing; From 6391862d8a791ee6d9377c588c1a4de08b13ed5a Mon Sep 17 00:00:00 2001 From: Dhammika Pathirana Date: Thu, 21 Apr 2022 11:50:38 -0700 Subject: [PATCH 150/296] Add branch traversal test Signed-off-by: Dhammika Pathirana --- .../batch_others/test_ancestor_branch.py | 111 ++++++++++++++++++ 1 file changed, 111 insertions(+) create mode 100644 test_runner/batch_others/test_ancestor_branch.py diff --git a/test_runner/batch_others/test_ancestor_branch.py b/test_runner/batch_others/test_ancestor_branch.py new file mode 100644 index 0000000000..fa12f25894 --- /dev/null +++ b/test_runner/batch_others/test_ancestor_branch.py @@ -0,0 +1,111 @@ +import subprocess +import asyncio +from contextlib import closing + +import psycopg2.extras +import pytest +from fixtures.log_helper import log +from fixtures.zenith_fixtures import ZenithEnvBuilder + + +# +# Create ancestor branches off the main branch. +# +def test_ancestor_branch(zenith_env_builder: ZenithEnvBuilder): + + # Use safekeeper in this test to avoid a subtle race condition. + # Without safekeeper, walreceiver reconnection can stuck + # because of IO deadlock. + # + # See https://github.com/zenithdb/zenith/issues/1068 + zenith_env_builder.num_safekeepers = 1 + env = zenith_env_builder.init() + + # Override defaults, 1M gc_horizon and 4M checkpoint_distance. + # Extend compaction_period and gc_period to disable background compaction and gc. + env.pageserver.start(overrides=[ + '--pageserver-config-override="gc_period"="10 m"', + '--pageserver-config-override="gc_horizon"=1048576', + '--pageserver-config-override="checkpoint_distance"=4194304', + '--pageserver-config-override="compaction_period"="10 m"', + '--pageserver-config-override="compaction_threshold"=2' + ]) + env.safekeepers[0].start() + + pg_branch0 = env.postgres.create_start('main') + branch0_cur = pg_branch0.connect().cursor() + branch0_cur.execute("SHOW zenith.zenith_timeline") + branch0_timeline = branch0_cur.fetchone()[0] + log.info(f"b0 timeline {branch0_timeline}") + + # Create table, and insert 100k rows. + branch0_cur.execute('SELECT pg_current_wal_insert_lsn()') + branch0_lsn = branch0_cur.fetchone()[0] + log.info(f"b0 at lsn {branch0_lsn}") + + branch0_cur.execute('CREATE TABLE foo (t text) WITH (autovacuum_enabled = off)') + branch0_cur.execute(''' + INSERT INTO foo + SELECT '00112233445566778899AABBCCDDEEFF' || ':branch0:' || g + FROM generate_series(1, 100000) g + ''') + branch0_cur.execute('SELECT pg_current_wal_insert_lsn()') + lsn_100 = branch0_cur.fetchone()[0] + log.info(f'LSN after 100 rows: {lsn_100}') + + # Create branch1. + env.zenith_cli.create_branch('branch1', 'main', ancestor_start_lsn=lsn_100) + pg_branch1 = env.postgres.create_start('branch1') + log.info("postgres is running on 'branch1' branch") + + branch1_cur = pg_branch1.connect().cursor() + branch1_cur.execute("SHOW zenith.zenith_timeline") + branch1_timeline = branch1_cur.fetchone()[0] + log.info(f"b1 timeline {branch1_timeline}") + + branch1_cur.execute('SELECT pg_current_wal_insert_lsn()') + branch1_lsn = branch1_cur.fetchone()[0] + log.info(f"b1 at lsn {branch1_lsn}") + + # Insert 100k rows. + branch1_cur.execute(''' + INSERT INTO foo + SELECT '00112233445566778899AABBCCDDEEFF' || ':branch1:' || g + FROM generate_series(1, 100000) g + ''') + branch1_cur.execute('SELECT pg_current_wal_insert_lsn()') + lsn_200 = branch1_cur.fetchone()[0] + log.info(f'LSN after 100 rows: {lsn_200}') + + # Create branch2. + env.zenith_cli.create_branch('branch2', 'branch1', ancestor_start_lsn=lsn_200) + pg_branch2 = env.postgres.create_start('branch2') + log.info("postgres is running on 'branch1' branch") + + branch2_cur = pg_branch2.connect().cursor() + branch2_cur.execute("SHOW zenith.zenith_timeline") + branch2_lsn = branch2_cur.fetchone()[0] + log.info(f"b2 timeline {branch1_timeline}") + + branch2_cur.execute('SELECT pg_current_wal_insert_lsn()') + branch2_lsn = branch2_cur.fetchone()[0] + log.info(f"b2 at lsn {branch2_lsn}") + + # Insert 100k rows. + branch2_cur.execute(''' + INSERT INTO foo + SELECT '00112233445566778899AABBCCDDEEFF' || ':branch2:' || g + FROM generate_series(1, 100000) g + ''') + branch2_cur.execute('SELECT pg_current_wal_insert_lsn()') + lsn_300 = branch2_cur.fetchone()[0] + log.info(f'LSN after 300 rows: {lsn_300}') + + branch0_cur.execute('SELECT count(*) FROM foo') + assert branch0_cur.fetchone() == (100000, ) + + branch1_cur.execute('SELECT count(*) FROM foo') + assert branch1_cur.fetchone() == (200000, ) + + branch2_cur.execute('SELECT count(*) FROM foo') + assert branch2_cur.fetchone() == (300000, ) From aeb4f81c3bb74e4b0adc570f760e785bf8463533 Mon Sep 17 00:00:00 2001 From: Dhammika Pathirana Date: Thu, 21 Apr 2022 21:04:00 -0700 Subject: [PATCH 151/296] Add branch traversal unit test Signed-off-by: Dhammika Pathirana --- pageserver/src/layered_repository.rs | 57 ++++++++++++++++++++++++++++ 1 file changed, 57 insertions(+) diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 0dc54385b2..679daa8248 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -2630,4 +2630,61 @@ pub mod tests { Ok(()) } + + #[test] + fn test_traverse_ancestors() -> Result<()> { + let repo = RepoHarness::create("test_traverse_ancestors")?.load(); + let mut tline = repo.create_empty_timeline(TIMELINE_ID, Lsn(0))?; + + const NUM_KEYS: usize = 100; + const NUM_TLINES: usize = 50; + + let mut test_key = Key::from_hex("012222222233333333444444445500000000").unwrap(); + // Track page mutation lsns across different timelines. + let mut updated = [[Lsn(0); NUM_KEYS]; NUM_TLINES]; + + let mut lsn = Lsn(0); + let mut tline_id = TIMELINE_ID; + + #[allow(clippy::needless_range_loop)] + for idx in 0..NUM_TLINES { + let new_tline_id = ZTimelineId::generate(); + repo.branch_timeline(tline_id, new_tline_id, lsn)?; + tline = repo.get_timeline_load(new_tline_id)?; + tline_id = new_tline_id; + + for _ in 0..NUM_KEYS { + lsn = Lsn(lsn.0 + 0x10); + let blknum = thread_rng().gen_range(0..NUM_KEYS); + test_key.field6 = blknum as u32; + let writer = tline.writer(); + writer.put( + test_key, + lsn, + Value::Image(TEST_IMG(&format!("{} {} at {}", idx, blknum, lsn))), + )?; + println!("updating [{}][{}] at {}", idx, blknum, lsn); + writer.finish_write(lsn); + drop(writer); + updated[idx][blknum] = lsn; + } + } + + // Read pages from leaf timeline across all ancestors. + for (idx, lsns) in updated.iter().enumerate() { + for (blknum, lsn) in lsns.iter().enumerate() { + // Skip empty mutations. + if lsn.0 == 0 { + continue; + } + println!("chekcking [{}][{}] at {}", idx, blknum, lsn); + test_key.field6 = blknum as u32; + assert_eq!( + tline.get(test_key, *lsn)?, + TEST_IMG(&format!("{} {} at {}", idx, blknum, lsn)) + ); + } + } + Ok(()) + } } From 091cefaa92afdecb8a260729ae39270b6a45193f Mon Sep 17 00:00:00 2001 From: Dhammika Pathirana Date: Fri, 22 Apr 2022 17:17:44 -0700 Subject: [PATCH 152/296] Fix add compaction for key partitioning Signed-off-by: Dhammika Pathirana --- .../batch_others/test_ancestor_branch.py | 21 ++++++++++++------- 1 file changed, 13 insertions(+), 8 deletions(-) diff --git a/test_runner/batch_others/test_ancestor_branch.py b/test_runner/batch_others/test_ancestor_branch.py index fa12f25894..1e96369314 100644 --- a/test_runner/batch_others/test_ancestor_branch.py +++ b/test_runner/batch_others/test_ancestor_branch.py @@ -28,7 +28,8 @@ def test_ancestor_branch(zenith_env_builder: ZenithEnvBuilder): '--pageserver-config-override="gc_horizon"=1048576', '--pageserver-config-override="checkpoint_distance"=4194304', '--pageserver-config-override="compaction_period"="10 m"', - '--pageserver-config-override="compaction_threshold"=2' + '--pageserver-config-override="compaction_threshold"=2', + '--pageserver-config-override="compaction_target_size"=4194304' ]) env.safekeepers[0].start() @@ -51,7 +52,7 @@ def test_ancestor_branch(zenith_env_builder: ZenithEnvBuilder): ''') branch0_cur.execute('SELECT pg_current_wal_insert_lsn()') lsn_100 = branch0_cur.fetchone()[0] - log.info(f'LSN after 100 rows: {lsn_100}') + log.info(f'LSN after 100k rows: {lsn_100}') # Create branch1. env.zenith_cli.create_branch('branch1', 'main', ancestor_start_lsn=lsn_100) @@ -75,17 +76,17 @@ def test_ancestor_branch(zenith_env_builder: ZenithEnvBuilder): ''') branch1_cur.execute('SELECT pg_current_wal_insert_lsn()') lsn_200 = branch1_cur.fetchone()[0] - log.info(f'LSN after 100 rows: {lsn_200}') + log.info(f'LSN after 200k rows: {lsn_200}') # Create branch2. env.zenith_cli.create_branch('branch2', 'branch1', ancestor_start_lsn=lsn_200) pg_branch2 = env.postgres.create_start('branch2') - log.info("postgres is running on 'branch1' branch") - + log.info("postgres is running on 'branch2' branch") branch2_cur = pg_branch2.connect().cursor() + branch2_cur.execute("SHOW zenith.zenith_timeline") - branch2_lsn = branch2_cur.fetchone()[0] - log.info(f"b2 timeline {branch1_timeline}") + branch2_timeline = branch2_cur.fetchone()[0] + log.info(f"b2 timeline {branch2_timeline}") branch2_cur.execute('SELECT pg_current_wal_insert_lsn()') branch2_lsn = branch2_cur.fetchone()[0] @@ -99,7 +100,11 @@ def test_ancestor_branch(zenith_env_builder: ZenithEnvBuilder): ''') branch2_cur.execute('SELECT pg_current_wal_insert_lsn()') lsn_300 = branch2_cur.fetchone()[0] - log.info(f'LSN after 300 rows: {lsn_300}') + log.info(f'LSN after 300k rows: {lsn_300}') + + # Run compaction on branch1. + psconn = env.pageserver.connect() + psconn.cursor().execute(f'''compact {env.initial_tenant.hex} {branch1_timeline} {lsn_200}''') branch0_cur.execute('SELECT count(*) FROM foo') assert branch0_cur.fetchone() == (100000, ) From 66694e736a2e53bd611198507bc9efdb9770921c Mon Sep 17 00:00:00 2001 From: Dhammika Pathirana Date: Mon, 25 Apr 2022 13:55:00 -0700 Subject: [PATCH 153/296] Fix add ps tenant config Signed-off-by: Dhammika Pathirana --- .../batch_others/test_ancestor_branch.py | 34 ++++++++++--------- 1 file changed, 18 insertions(+), 16 deletions(-) diff --git a/test_runner/batch_others/test_ancestor_branch.py b/test_runner/batch_others/test_ancestor_branch.py index 1e96369314..aeb45348ad 100644 --- a/test_runner/batch_others/test_ancestor_branch.py +++ b/test_runner/batch_others/test_ancestor_branch.py @@ -19,21 +19,22 @@ def test_ancestor_branch(zenith_env_builder: ZenithEnvBuilder): # # See https://github.com/zenithdb/zenith/issues/1068 zenith_env_builder.num_safekeepers = 1 - env = zenith_env_builder.init() + env = zenith_env_builder.init_start() # Override defaults, 1M gc_horizon and 4M checkpoint_distance. # Extend compaction_period and gc_period to disable background compaction and gc. - env.pageserver.start(overrides=[ - '--pageserver-config-override="gc_period"="10 m"', - '--pageserver-config-override="gc_horizon"=1048576', - '--pageserver-config-override="checkpoint_distance"=4194304', - '--pageserver-config-override="compaction_period"="10 m"', - '--pageserver-config-override="compaction_threshold"=2', - '--pageserver-config-override="compaction_target_size"=4194304' - ]) - env.safekeepers[0].start() + tenant = env.zenith_cli.create_tenant( + conf={ + 'gc_period': '10 m', + 'gc_horizon': '1048576', + 'checkpoint_distance': '4194304', + 'compaction_period': '10 m', + 'compaction_threshold': '2', + 'compaction_target_size': '4194304', + }) - pg_branch0 = env.postgres.create_start('main') + env.zenith_cli.create_timeline(f'main', tenant_id=tenant) + pg_branch0 = env.postgres.create_start('main', tenant_id=tenant) branch0_cur = pg_branch0.connect().cursor() branch0_cur.execute("SHOW zenith.zenith_timeline") branch0_timeline = branch0_cur.fetchone()[0] @@ -55,8 +56,8 @@ def test_ancestor_branch(zenith_env_builder: ZenithEnvBuilder): log.info(f'LSN after 100k rows: {lsn_100}') # Create branch1. - env.zenith_cli.create_branch('branch1', 'main', ancestor_start_lsn=lsn_100) - pg_branch1 = env.postgres.create_start('branch1') + env.zenith_cli.create_branch('branch1', 'main', tenant_id=tenant, ancestor_start_lsn=lsn_100) + pg_branch1 = env.postgres.create_start('branch1', tenant_id=tenant) log.info("postgres is running on 'branch1' branch") branch1_cur = pg_branch1.connect().cursor() @@ -79,8 +80,8 @@ def test_ancestor_branch(zenith_env_builder: ZenithEnvBuilder): log.info(f'LSN after 200k rows: {lsn_200}') # Create branch2. - env.zenith_cli.create_branch('branch2', 'branch1', ancestor_start_lsn=lsn_200) - pg_branch2 = env.postgres.create_start('branch2') + env.zenith_cli.create_branch('branch2', 'branch1', tenant_id=tenant, ancestor_start_lsn=lsn_200) + pg_branch2 = env.postgres.create_start('branch2', tenant_id=tenant) log.info("postgres is running on 'branch2' branch") branch2_cur = pg_branch2.connect().cursor() @@ -104,7 +105,8 @@ def test_ancestor_branch(zenith_env_builder: ZenithEnvBuilder): # Run compaction on branch1. psconn = env.pageserver.connect() - psconn.cursor().execute(f'''compact {env.initial_tenant.hex} {branch1_timeline} {lsn_200}''') + log.info(f'compact {tenant.hex} {branch1_timeline} {lsn_200}') + psconn.cursor().execute(f'''compact {tenant.hex} {branch1_timeline} {lsn_200}''') branch0_cur.execute('SELECT count(*) FROM foo') assert branch0_cur.fetchone() == (100000, ) From 695b5f9d88c33b4c141a9d701b9e43ecb9f49f81 Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Wed, 27 Apr 2022 13:42:48 +0300 Subject: [PATCH 154/296] Remove obsolete failpoint in proxy When failpoint feature is disabled it throws away passed code so code inside is not guaranteed to compile when feature is disabled. In this particular case code is obsolete so removing it. --- Cargo.lock | 1 - proxy/Cargo.toml | 1 - proxy/src/auth/credentials.rs | 4 ---- 3 files changed, 6 deletions(-) diff --git a/Cargo.lock b/Cargo.lock index 3797e4e76b..bac5dfb674 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -2002,7 +2002,6 @@ dependencies = [ "base64", "bytes", "clap 3.0.14", - "fail", "futures", "hashbrown", "hex", diff --git a/proxy/Cargo.toml b/proxy/Cargo.toml index 25aebc03e8..f7e872ceb9 100644 --- a/proxy/Cargo.toml +++ b/proxy/Cargo.toml @@ -8,7 +8,6 @@ anyhow = "1.0" base64 = "0.13.0" bytes = { version = "1.0.1", features = ['serde'] } clap = "3.0" -fail = "0.5.0" futures = "0.3.13" hashbrown = "0.11.2" hex = "0.4.3" diff --git a/proxy/src/auth/credentials.rs b/proxy/src/auth/credentials.rs index 7c8ba28622..c3bb6da4f8 100644 --- a/proxy/src/auth/credentials.rs +++ b/proxy/src/auth/credentials.rs @@ -48,10 +48,6 @@ impl ClientCredentials { config: &ProxyConfig, client: &mut PqStream, ) -> Result { - fail::fail_point!("proxy-authenticate", |_| { - Err(AuthError::auth_failed("failpoint triggered")) - }); - use crate::config::ClientAuthMethod::*; use crate::config::RouterConfig::*; match &config.router_config { From 29539b056100c7c0b3574ec13789ef91e9d748d9 Mon Sep 17 00:00:00 2001 From: Arthur Petukhovsky Date: Wed, 27 Apr 2022 19:09:28 +0300 Subject: [PATCH 155/296] Set wal_keep_size to zero (#1507) wal_keep_size is already set to 0 in our cloud setup, but we don't use this value in tests. This commit fixes wal_keep_size in control_plane and adds tests for WAL recycling and lagging safekeepers. --- control_plane/src/compute.rs | 7 +-- test_runner/batch_others/test_wal_acceptor.py | 55 ++++++++++++++++++- .../batch_others/test_wal_acceptor_async.py | 37 ++++++++++--- test_runner/fixtures/utils.py | 11 ++++ 4 files changed, 95 insertions(+), 15 deletions(-) diff --git a/control_plane/src/compute.rs b/control_plane/src/compute.rs index 2549baca5d..92d0e080d8 100644 --- a/control_plane/src/compute.rs +++ b/control_plane/src/compute.rs @@ -273,12 +273,7 @@ impl PostgresNode { conf.append("wal_sender_timeout", "5s"); conf.append("listen_addresses", &self.address.ip().to_string()); conf.append("port", &self.address.port().to_string()); - - // Never clean up old WAL. TODO: We should use a replication - // slot or something proper, to prevent the compute node - // from removing WAL that hasn't been streamed to the safekeeper or - // page server yet. (gh issue #349) - conf.append("wal_keep_size", "10TB"); + conf.append("wal_keep_size", "0"); // Configure the node to fetch pages from pageserver let pageserver_connstr = { diff --git a/test_runner/batch_others/test_wal_acceptor.py b/test_runner/batch_others/test_wal_acceptor.py index 395084af0e..94059e2a4c 100644 --- a/test_runner/batch_others/test_wal_acceptor.py +++ b/test_runner/batch_others/test_wal_acceptor.py @@ -13,7 +13,7 @@ from dataclasses import dataclass, field from multiprocessing import Process, Value from pathlib import Path from fixtures.zenith_fixtures import PgBin, Postgres, Safekeeper, ZenithEnv, ZenithEnvBuilder, PortDistributor, SafekeeperPort, zenith_binpath, PgProtocol -from fixtures.utils import etcd_path, lsn_to_hex, mkdir_if_needed, lsn_from_hex +from fixtures.utils import etcd_path, get_dir_size, lsn_to_hex, mkdir_if_needed, lsn_from_hex from fixtures.log_helper import log from typing import List, Optional, Any @@ -791,3 +791,56 @@ def test_replace_safekeeper(zenith_env_builder: ZenithEnvBuilder): env.safekeepers[1].stop(immediate=True) execute_payload(pg) show_statuses(env.safekeepers, tenant_id, timeline_id) + + +# We have `wal_keep_size=0`, so postgres should trim WAL once it's broadcasted +# to all safekeepers. This test checks that compute WAL can fit into small number +# of WAL segments. +def test_wal_deleted_after_broadcast(zenith_env_builder: ZenithEnvBuilder): + # used to calculate delta in collect_stats + last_lsn = .0 + + # returns LSN and pg_wal size, all in MB + def collect_stats(pg: Postgres, cur, enable_logs=True): + nonlocal last_lsn + assert pg.pgdata_dir is not None + + log.info('executing INSERT to generate WAL') + cur.execute("select pg_current_wal_lsn()") + current_lsn = lsn_from_hex(cur.fetchone()[0]) / 1024 / 1024 + pg_wal_size = get_dir_size(os.path.join(pg.pgdata_dir, 'pg_wal')) / 1024 / 1024 + if enable_logs: + log.info(f"LSN delta: {current_lsn - last_lsn} MB, current WAL size: {pg_wal_size} MB") + last_lsn = current_lsn + return current_lsn, pg_wal_size + + # generates about ~20MB of WAL, to create at least one new segment + def generate_wal(cur): + cur.execute("INSERT INTO t SELECT generate_series(1,300000), 'payload'") + + zenith_env_builder.num_safekeepers = 3 + env = zenith_env_builder.init_start() + + env.zenith_cli.create_branch('test_wal_deleted_after_broadcast') + # Adjust checkpoint config to prevent keeping old WAL segments + pg = env.postgres.create_start( + 'test_wal_deleted_after_broadcast', + config_lines=['min_wal_size=32MB', 'max_wal_size=32MB', 'log_checkpoints=on']) + + pg_conn = pg.connect() + cur = pg_conn.cursor() + cur.execute('CREATE TABLE t(key int, value text)') + + collect_stats(pg, cur) + + # generate WAL to simulate normal workload + for i in range(5): + generate_wal(cur) + collect_stats(pg, cur) + + log.info('executing checkpoint') + cur.execute('CHECKPOINT') + wal_size_after_checkpoint = collect_stats(pg, cur)[1] + + # there shouldn't be more than 2 WAL segments (but dir may have archive_status files) + assert wal_size_after_checkpoint < 16 * 2.5 diff --git a/test_runner/batch_others/test_wal_acceptor_async.py b/test_runner/batch_others/test_wal_acceptor_async.py index e3df8ea3eb..c484b6401c 100644 --- a/test_runner/batch_others/test_wal_acceptor_async.py +++ b/test_runner/batch_others/test_wal_acceptor_async.py @@ -139,13 +139,12 @@ async def wait_for_lsn(safekeeper: Safekeeper, async def run_restarts_under_load(env: ZenithEnv, pg: Postgres, acceptors: List[Safekeeper], - n_workers=10): - n_accounts = 100 - init_amount = 100000 - max_transfer = 100 - period_time = 4 - iterations = 10 - + n_workers=10, + n_accounts=100, + init_amount=100000, + max_transfer=100, + period_time=4, + iterations=10): # Set timeout for this test at 5 minutes. It should be enough for test to complete # and less than CircleCI's no_output_timeout, taking into account that this timeout # is checked only at the beginning of every iteration. @@ -202,7 +201,7 @@ async def run_restarts_under_load(env: ZenithEnv, await pg_conn.close() -# restart acceptors one by one, while executing and validating bank transactions +# Restart acceptors one by one, while executing and validating bank transactions def test_restarts_under_load(zenith_env_builder: ZenithEnvBuilder): zenith_env_builder.num_safekeepers = 3 env = zenith_env_builder.init_start() @@ -213,3 +212,25 @@ def test_restarts_under_load(zenith_env_builder: ZenithEnvBuilder): config_lines=['max_replication_write_lag=1MB']) asyncio.run(run_restarts_under_load(env, pg, env.safekeepers)) + + +# Restart acceptors one by one and test that everything is working as expected +# when checkpoins are triggered frequently by max_wal_size=32MB. Because we have +# wal_keep_size=0, there will be aggressive WAL segments recycling. +def test_restarts_frequent_checkpoints(zenith_env_builder: ZenithEnvBuilder): + zenith_env_builder.num_safekeepers = 3 + env = zenith_env_builder.init_start() + + env.zenith_cli.create_branch('test_restarts_frequent_checkpoints') + # Enable backpressure with 1MB maximal lag, because we don't want to block on `wait_for_lsn()` for too long + pg = env.postgres.create_start('test_restarts_frequent_checkpoints', + config_lines=[ + 'max_replication_write_lag=1MB', + 'min_wal_size=32MB', + 'max_wal_size=32MB', + 'log_checkpoints=on' + ]) + + # we try to simulate large (flush_lsn - truncate_lsn) lag, to test that WAL segments + # are not removed before broadcasted to all safekeepers, with the help of replication slot + asyncio.run(run_restarts_under_load(env, pg, env.safekeepers, period_time=15, iterations=5)) diff --git a/test_runner/fixtures/utils.py b/test_runner/fixtures/utils.py index f16fe1d9cf..98af511036 100644 --- a/test_runner/fixtures/utils.py +++ b/test_runner/fixtures/utils.py @@ -82,3 +82,14 @@ def print_gc_result(row): # path to etcd binary or None if not present. def etcd_path(): return shutil.which("etcd") + + +# Traverse directory to get total size. +def get_dir_size(path: str) -> int: + """Return size in bytes.""" + totalbytes = 0 + for root, dirs, files in os.walk(path): + for name in files: + totalbytes += os.path.getsize(os.path.join(root, name)) + + return totalbytes From 5c5c3c64f3153b4b67c0ed4f51d4ab14c8aa1da2 Mon Sep 17 00:00:00 2001 From: Anastasia Lubennikova Date: Tue, 26 Apr 2022 19:35:07 +0300 Subject: [PATCH 156/296] Fix tenant config parsing. Add a test --- Cargo.lock | 11 ++++ pageserver/Cargo.toml | 1 + pageserver/src/config.rs | 2 +- pageserver/src/layered_repository.rs | 4 +- pageserver/src/tenant_config.rs | 6 ++ test_runner/batch_others/test_tenant_conf.py | 59 +++++++++++++------- 6 files changed, 61 insertions(+), 22 deletions(-) diff --git a/Cargo.lock b/Cargo.lock index bac5dfb674..58125ca41c 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -1073,6 +1073,16 @@ version = "2.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9a3a5bfb195931eeb336b2a7b4d761daec841b97f947d34394601737a7bba5e4" +[[package]] +name = "humantime-serde" +version = "1.1.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "57a3db5ea5923d99402c94e9feb261dc5ee9b4efa158b0315f788cf549cc200c" +dependencies = [ + "humantime", + "serde", +] + [[package]] name = "hyper" version = "0.14.17" @@ -1626,6 +1636,7 @@ dependencies = [ "hex", "hex-literal", "humantime", + "humantime-serde", "hyper", "itertools", "lazy_static", diff --git a/pageserver/Cargo.toml b/pageserver/Cargo.toml index 6648d8417a..5607baf698 100644 --- a/pageserver/Cargo.toml +++ b/pageserver/Cargo.toml @@ -35,6 +35,7 @@ humantime = "2.1.0" serde = { version = "1.0", features = ["derive"] } serde_json = "1" serde_with = "1.12.0" +humantime-serde = "1.1.1" pprof = { git = "https://github.com/neondatabase/pprof-rs.git", branch = "wallclock-profiling", features = ["flamegraph"], optional = true } diff --git a/pageserver/src/config.rs b/pageserver/src/config.rs index 8bfe8b57ec..aed7eabb76 100644 --- a/pageserver/src/config.rs +++ b/pageserver/src/config.rs @@ -439,7 +439,7 @@ impl PageServerConf { "remote_storage" => { builder.remote_storage_config(Some(Self::parse_remote_storage_config(item)?)) } - "tenant_conf" => { + "tenant_config" => { t_conf = Self::parse_toml_tenant_conf(item)?; } "id" => builder.id(ZNodeId(parse_toml_u64(key, item)?)), diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 679daa8248..d9e1244f2e 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -690,7 +690,7 @@ impl LayeredRepository { let mut tenant_conf: TenantConfOpt = Default::default(); for (key, item) in toml.iter() { match key { - "tenant_conf" => { + "tenant_config" => { tenant_conf = PageServerConf::parse_toml_tenant_conf(item)?; } _ => bail!("unrecognized pageserver option '{}'", key), @@ -712,7 +712,7 @@ impl LayeredRepository { let mut conf_content = r#"# This file contains a specific per-tenant's config. # It is read in case of pageserver restart. -# [tenant_config] +[tenant_config] "# .to_string(); diff --git a/pageserver/src/tenant_config.rs b/pageserver/src/tenant_config.rs index 818b6de1b1..a175f6abbe 100644 --- a/pageserver/src/tenant_config.rs +++ b/pageserver/src/tenant_config.rs @@ -47,6 +47,7 @@ pub struct TenantConf { // This parameter determines L1 layer file size. pub compaction_target_size: u64, // How often to check if there's compaction work to be done. + #[serde(with = "humantime_serde")] pub compaction_period: Duration, // Level0 delta layer threshold for compaction. pub compaction_threshold: usize, @@ -56,11 +57,13 @@ pub struct TenantConf { // Page versions older than this are garbage collected away. pub gc_horizon: u64, // Interval at which garbage collection is triggered. + #[serde(with = "humantime_serde")] pub gc_period: Duration, // Determines how much history is retained, to allow // branching and read replicas at an older point in time. // The unit is time. // Page versions older than this are garbage collected away. + #[serde(with = "humantime_serde")] pub pitr_interval: Duration, } @@ -70,10 +73,13 @@ pub struct TenantConf { pub struct TenantConfOpt { pub checkpoint_distance: Option, pub compaction_target_size: Option, + #[serde(with = "humantime_serde")] pub compaction_period: Option, pub compaction_threshold: Option, pub gc_horizon: Option, + #[serde(with = "humantime_serde")] pub gc_period: Option, + #[serde(with = "humantime_serde")] pub pitr_interval: Option, } diff --git a/test_runner/batch_others/test_tenant_conf.py b/test_runner/batch_others/test_tenant_conf.py index f74e6aad1d..64359a1dc3 100644 --- a/test_runner/batch_others/test_tenant_conf.py +++ b/test_runner/batch_others/test_tenant_conf.py @@ -3,21 +3,22 @@ from contextlib import closing import pytest from fixtures.zenith_fixtures import ZenithEnvBuilder +from fixtures.log_helper import log def test_tenant_config(zenith_env_builder: ZenithEnvBuilder): + # set some non-default global config + zenith_env_builder.pageserver_config_override = ''' +page_cache_size=444; +wait_lsn_timeout='111 s'; +tenant_config={checkpoint_distance = 10000, compaction_target_size = 1048576}''' + env = zenith_env_builder.init_start() """Test per tenant configuration""" - tenant = env.zenith_cli.create_tenant( - conf={ - 'checkpoint_distance': '10000', - 'compaction_target_size': '1048576', - 'compaction_period': '60sec', - 'compaction_threshold': '20', - 'gc_horizon': '1024', - 'gc_period': '100sec', - 'pitr_interval': '3600sec', - }) + tenant = env.zenith_cli.create_tenant(conf={ + 'checkpoint_distance': '20000', + 'gc_period': '30sec', + }) env.zenith_cli.create_timeline(f'test_tenant_conf', tenant_id=tenant) pg = env.postgres.create_start( @@ -26,24 +27,44 @@ def test_tenant_config(zenith_env_builder: ZenithEnvBuilder): tenant, ) + # check the configuration of the default tenant + # it should match global configuration + with closing(env.pageserver.connect()) as psconn: + with psconn.cursor() as pscur: + pscur.execute(f"show {env.initial_tenant.hex}") + res = pscur.fetchone() + log.info(f"initial_tenant res: {res}") + assert res == (10000, 1048576, 1, 10, 67108864, 100, 2592000) + + # check the configuration of the new tenant with closing(env.pageserver.connect()) as psconn: with psconn.cursor() as pscur: pscur.execute(f"show {tenant.hex}") - assert pscur.fetchone() == (10000, 1048576, 60, 20, 1024, 100, 3600) + res = pscur.fetchone() + log.info(f"res: {res}") + assert res == (20000, 1048576, 1, 10, 67108864, 30, 2592000) # update the config and ensure that it has changed env.zenith_cli.config_tenant(tenant_id=tenant, conf={ - 'checkpoint_distance': '100000', - 'compaction_target_size': '1048576', - 'compaction_period': '30sec', - 'compaction_threshold': '15', - 'gc_horizon': '256', - 'gc_period': '10sec', - 'pitr_interval': '360sec', + 'checkpoint_distance': '15000', + 'gc_period': '80sec', }) with closing(env.pageserver.connect()) as psconn: with psconn.cursor() as pscur: pscur.execute(f"show {tenant.hex}") - assert pscur.fetchone() == (100000, 1048576, 30, 15, 256, 10, 360) + res = pscur.fetchone() + log.info(f"after config res: {res}") + assert res == (15000, 1048576, 1, 10, 67108864, 80, 2592000) + + # restart the pageserver and ensure that the config is still correct + env.pageserver.stop() + env.pageserver.start() + + with closing(env.pageserver.connect()) as psconn: + with psconn.cursor() as pscur: + pscur.execute(f"show {tenant.hex}") + res = pscur.fetchone() + log.info(f"after restart res: {res}") + assert res == (15000, 1048576, 1, 10, 67108864, 80, 2592000) From 4a46b01caf1ad039c3a0f06f68dae54fe95b7b2c Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Wed, 27 Apr 2022 11:16:44 +0300 Subject: [PATCH 157/296] Properly populate local timeline map --- pageserver/src/bin/pageserver.rs | 51 +---- pageserver/src/http/routes.rs | 6 +- pageserver/src/layered_repository.rs | 2 +- pageserver/src/page_service.rs | 10 +- pageserver/src/tenant_mgr.rs | 299 ++++++++++++++++----------- pageserver/src/timelines.rs | 18 +- pageserver/src/walreceiver.rs | 2 +- 7 files changed, 207 insertions(+), 181 deletions(-) diff --git a/pageserver/src/bin/pageserver.rs b/pageserver/src/bin/pageserver.rs index 5c135e4eb4..728dcb53de 100644 --- a/pageserver/src/bin/pageserver.rs +++ b/pageserver/src/bin/pageserver.rs @@ -10,10 +10,7 @@ use daemonize::Daemonize; use pageserver::{ config::{defaults::*, PageServerConf}, - http, page_cache, page_service, profiling, - remote_storage::{self, SyncStartupData}, - repository::{Repository, TimelineSyncStatusUpdate}, - tenant_mgr, thread_mgr, + http, page_cache, page_service, profiling, tenant_mgr, thread_mgr, thread_mgr::ThreadKind, timelines, virtual_file, LOG_FILE_NAME, }; @@ -235,47 +232,8 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<() let signals = signals::install_shutdown_handlers()?; - // Initialize repositories with locally available timelines. - // Timelines that are only partially available locally (remote storage has more data than this pageserver) - // are scheduled for download and added to the repository once download is completed. - let SyncStartupData { - remote_index, - local_timeline_init_statuses, - } = remote_storage::start_local_timeline_sync(conf) - .context("Failed to set up local files sync with external storage")?; - - for (tenant_id, local_timeline_init_statuses) in local_timeline_init_statuses { - // initialize local tenant - let repo = tenant_mgr::load_local_repo(conf, tenant_id, &remote_index) - .with_context(|| format!("Failed to load repo for tenant {}", tenant_id))?; - for (timeline_id, init_status) in local_timeline_init_statuses { - match init_status { - remote_storage::LocalTimelineInitStatus::LocallyComplete => { - debug!("timeline {} for tenant {} is locally complete, registering it in repository", timeline_id, tenant_id); - // Lets fail here loudly to be on the safe side. - // XXX: It may be a better api to actually distinguish between repository startup - // and processing of newly downloaded timelines. - repo.apply_timeline_remote_sync_status_update( - timeline_id, - TimelineSyncStatusUpdate::Downloaded, - ) - .with_context(|| { - format!( - "Failed to bootstrap timeline {} for tenant {}", - timeline_id, tenant_id - ) - })? - } - remote_storage::LocalTimelineInitStatus::NeedsSync => { - debug!( - "timeline {} for tenant {} needs sync, \ - so skipped for adding into repository until sync is finished", - tenant_id, timeline_id - ); - } - } - } - } + // start profiler (if enabled) + let profiler_guard = profiling::init_profiler(conf); // initialize authentication for incoming connections let auth = match &conf.auth_type { @@ -288,8 +246,7 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<() }; info!("Using auth: {:#?}", conf.auth_type); - // start profiler (if enabled) - let profiler_guard = profiling::init_profiler(conf); + let remote_index = tenant_mgr::init_tenant_mgr(conf)?; // Spawn a new thread for the http endpoint // bind before launching separate thread so the error reported before startup exits diff --git a/pageserver/src/http/routes.rs b/pageserver/src/http/routes.rs index 05485ef3b6..f1b482cf50 100644 --- a/pageserver/src/http/routes.rs +++ b/pageserver/src/http/routes.rs @@ -244,7 +244,7 @@ async fn timeline_attach_handler(request: Request) -> Result) -> Result, A crate::tenant_mgr::list_tenants() }) .await - .map_err(ApiError::from_err)??; + .map_err(ApiError::from_err)?; json_response(StatusCode::OK, response_data) } @@ -377,7 +377,7 @@ async fn tenant_create_handler(mut request: Request) -> Result> = Mutex::new(HashMap::new()); +mod tenants_state { + use std::{ + collections::HashMap, + sync::{RwLock, RwLockReadGuard, RwLockWriteGuard}, + }; + + use utils::zid::ZTenantId; + + use crate::tenant_mgr::Tenant; + + lazy_static::lazy_static! { + static ref TENANTS: RwLock> = RwLock::new(HashMap::new()); + } + + pub(super) fn read_tenants() -> RwLockReadGuard<'static, HashMap> { + TENANTS + .read() + .expect("Failed to read() tenants lock, it got poisoned") + } + + pub(super) fn write_tenants() -> RwLockWriteGuard<'static, HashMap> { + TENANTS + .write() + .expect("Failed to write() tenants lock, it got poisoned") + } } struct Tenant { state: TenantState, + /// Contains in-memory state, including the timeline that might not yet flushed on disk or loaded form disk. repo: Arc, - - timelines: HashMap>, + /// Timelines, located locally in the pageserver's datadir. + /// Whatever manipulations happen, local timelines are not removed, only incremented with files. + /// + /// Local timelines have more metadata that's loaded into memory, + /// that is located in the `repo.timelines` field, [`crate::layered_repository::LayeredTimelineEntry`]. + local_timelines: HashMap>, } #[derive(Debug, Serialize, Deserialize, Clone, Copy, PartialEq, Eq)] @@ -60,43 +88,17 @@ impl fmt::Display for TenantState { } } -fn access_tenants() -> MutexGuard<'static, HashMap> { - TENANTS.lock().unwrap() -} - -// Sets up wal redo manager and repository for tenant. Reduces code duplication. -// Used during pageserver startup, or when new tenant is attached to pageserver. -pub fn load_local_repo( - conf: &'static PageServerConf, - tenant_id: ZTenantId, - remote_index: &RemoteIndex, -) -> Result> { - let mut m = access_tenants(); - let tenant = m.entry(tenant_id).or_insert_with(|| { - // Set up a WAL redo manager, for applying WAL records. - let walredo_mgr = PostgresRedoManager::new(conf, tenant_id); - - // Set up an object repository, for actual data storage. - let repo: Arc = Arc::new(LayeredRepository::new( - conf, - Default::default(), - Arc::new(walredo_mgr), - tenant_id, - remote_index.clone(), - conf.remote_storage_config.is_some(), - )); - Tenant { - state: TenantState::Idle, - repo, - timelines: HashMap::new(), - } - }); - - // Restore tenant config - let tenant_conf = LayeredRepository::load_tenant_config(conf, tenant_id)?; - tenant.repo.update_tenant_config(tenant_conf)?; - - Ok(Arc::clone(&tenant.repo)) +/// Initialize repositories with locally available timelines. +/// Timelines that are only partially available locally (remote storage has more data than this pageserver) +/// are scheduled for download and added to the repository once download is completed. +pub fn init_tenant_mgr(conf: &'static PageServerConf) -> anyhow::Result { + let SyncStartupData { + remote_index, + local_timeline_init_statuses, + } = remote_storage::start_local_timeline_sync(conf) + .context("Failed to set up local files sync with external storage")?; + init_local_repositories(conf, local_timeline_init_statuses, &remote_index)?; + Ok(remote_index) } /// Updates tenants' repositories, changing their timelines state in memory. @@ -113,32 +115,28 @@ pub fn apply_timeline_sync_status_updates( "Applying sync status updates for {} timelines", sync_status_updates.len() ); - trace!("Sync status updates: {:?}", sync_status_updates); + debug!("Sync status updates: {sync_status_updates:?}"); - for (tenant_id, tenant_timelines_sync_status_updates) in sync_status_updates { + for (tenant_id, status_updates) in sync_status_updates { let repo = match load_local_repo(conf, tenant_id, remote_index) { Ok(repo) => repo, Err(e) => { - error!( - "Failed to load repo for tenant {} Error: {:#}", - tenant_id, e - ); + error!("Failed to load repo for tenant {tenant_id} Error: {e:?}",); continue; } }; - for (timeline_id, timeline_sync_status_update) in tenant_timelines_sync_status_updates { - match repo.apply_timeline_remote_sync_status_update(timeline_id, timeline_sync_status_update) + for (timeline_id, status_update) in status_updates { + match repo.apply_timeline_remote_sync_status_update(timeline_id, status_update) { - Ok(_) => debug!( - "successfully applied timeline sync status update: {} -> {}", - timeline_id, timeline_sync_status_update - ), + Ok(()) => debug!("successfully applied timeline sync status update: {timeline_id} -> {status_update}"), Err(e) => error!( - "Failed to apply timeline sync status update for tenant {}. timeline {} update {} Error: {:#}", - tenant_id, timeline_id, timeline_sync_status_update, e + "Failed to apply timeline sync status update for tenant {tenant_id}. timeline {timeline_id} update {status_update} Error: {e:?}" ), } + match status_update { + TimelineSyncStatusUpdate::Downloaded => todo!("TODO kb "), + } } } } @@ -147,7 +145,7 @@ pub fn apply_timeline_sync_status_updates( /// Shut down all tenants. This runs as part of pageserver shutdown. /// pub fn shutdown_all_tenants() { - let mut m = access_tenants(); + let mut m = tenants_state::write_tenants(); let mut tenantids = Vec::new(); for (tenantid, tenant) in m.iter_mut() { tenant.state = TenantState::Stopping; @@ -167,22 +165,16 @@ pub fn shutdown_all_tenants() { // should be no more activity in any of the repositories. // // On error, log it but continue with the shutdown for other tenants. - for tenantid in tenantids { - debug!("shutdown tenant {}", tenantid); - match get_repository_for_tenant(tenantid) { + for tenant_id in tenantids { + debug!("shutdown tenant {tenant_id}"); + match get_repository_for_tenant(tenant_id) { Ok(repo) => { if let Err(err) = repo.checkpoint() { - error!( - "Could not checkpoint tenant {} during shutdown: {:?}", - tenantid, err - ); + error!("Could not checkpoint tenant {tenant_id} during shutdown: {err:?}"); } } Err(err) => { - error!( - "Could not get repository for tenant {} during shutdown: {:?}", - tenantid, err - ); + error!("Could not get repository for tenant {tenant_id} during shutdown: {err:?}"); } } } @@ -191,20 +183,20 @@ pub fn shutdown_all_tenants() { pub fn create_tenant_repository( conf: &'static PageServerConf, tenant_conf: TenantConfOpt, - tenantid: ZTenantId, + tenant_id: ZTenantId, remote_index: RemoteIndex, -) -> Result> { - match access_tenants().entry(tenantid) { +) -> anyhow::Result> { + match tenants_state::write_tenants().entry(tenant_id) { Entry::Occupied(_) => { - debug!("tenant {} already exists", tenantid); + debug!("tenant {tenant_id} already exists"); Ok(None) } Entry::Vacant(v) => { - let wal_redo_manager = Arc::new(PostgresRedoManager::new(conf, tenantid)); + let wal_redo_manager = Arc::new(PostgresRedoManager::new(conf, tenant_id)); let repo = timelines::create_repo( conf, tenant_conf, - tenantid, + tenant_id, CreateRepo::Real { wal_redo_manager, remote_index, @@ -213,36 +205,39 @@ pub fn create_tenant_repository( v.insert(Tenant { state: TenantState::Idle, repo, - timelines: HashMap::new(), + local_timelines: HashMap::new(), }); - Ok(Some(tenantid)) + Ok(Some(tenant_id)) } } } -pub fn update_tenant_config(tenant_conf: TenantConfOpt, tenantid: ZTenantId) -> Result<()> { - info!("configuring tenant {}", tenantid); - let repo = get_repository_for_tenant(tenantid)?; +pub fn update_tenant_config( + tenant_conf: TenantConfOpt, + tenant_id: ZTenantId, +) -> anyhow::Result<()> { + info!("configuring tenant {tenant_id}"); + let repo = get_repository_for_tenant(tenant_id)?; repo.update_tenant_config(tenant_conf)?; Ok(()) } pub fn get_tenant_state(tenantid: ZTenantId) -> Option { - Some(access_tenants().get(&tenantid)?.state) + Some(tenants_state::read_tenants().get(&tenantid)?.state) } /// /// Change the state of a tenant to Active and launch its compactor and GC /// threads. If the tenant was already in Active state or Stopping, does nothing. /// -pub fn activate_tenant(tenant_id: ZTenantId) -> Result<()> { - let mut m = access_tenants(); +pub fn activate_tenant(tenant_id: ZTenantId) -> anyhow::Result<()> { + let mut m = tenants_state::write_tenants(); let tenant = m .get_mut(&tenant_id) - .with_context(|| format!("Tenant not found for id {}", tenant_id))?; + .with_context(|| format!("Tenant not found for id {tenant_id}"))?; - info!("activating tenant {}", tenant_id); + info!("activating tenant {tenant_id}"); match tenant.state { // If the tenant is already active, nothing to do. @@ -267,13 +262,10 @@ pub fn activate_tenant(tenant_id: ZTenantId) -> Result<()> { true, move || crate::tenant_threads::gc_loop(tenant_id), ) - .with_context(|| format!("Failed to launch GC thread for tenant {}", tenant_id)); + .with_context(|| format!("Failed to launch GC thread for tenant {tenant_id}")); if let Err(e) = &gc_spawn_result { - error!( - "Failed to start GC thread for tenant {}, stopping its checkpointer thread: {:?}", - tenant_id, e - ); + error!("Failed to start GC thread for tenant {tenant_id}, stopping its checkpointer thread: {e:?}"); thread_mgr::shutdown_threads(Some(ThreadKind::Compactor), Some(tenant_id), None); return gc_spawn_result; } @@ -287,39 +279,42 @@ pub fn activate_tenant(tenant_id: ZTenantId) -> Result<()> { Ok(()) } -pub fn get_repository_for_tenant(tenantid: ZTenantId) -> Result> { - let m = access_tenants(); +pub fn get_repository_for_tenant(tenant_id: ZTenantId) -> anyhow::Result> { + let m = tenants_state::read_tenants(); let tenant = m - .get(&tenantid) - .with_context(|| format!("Tenant {} not found", tenantid))?; + .get(&tenant_id) + .with_context(|| format!("Tenant {tenant_id} not found"))?; Ok(Arc::clone(&tenant.repo)) } -// Retrieve timeline for tenant. Load it into memory if it is not already loaded -pub fn get_timeline_for_tenant_load( - tenantid: ZTenantId, - timelineid: ZTimelineId, -) -> Result> { - let mut m = access_tenants(); +/// Retrieves local timeline for tenant. +/// Loads it into memory if it is not already loaded. +pub fn get_local_timeline_with_load( + tenant_id: ZTenantId, + timeline_id: ZTimelineId, +) -> anyhow::Result> { + let mut m = tenants_state::write_tenants(); let tenant = m - .get_mut(&tenantid) - .with_context(|| format!("Tenant {} not found", tenantid))?; + .get_mut(&tenant_id) + .with_context(|| format!("Tenant {tenant_id} not found"))?; - if let Some(page_tline) = tenant.timelines.get(&timelineid) { + if let Some(page_tline) = tenant.local_timelines.get(&timeline_id) { return Ok(Arc::clone(page_tline)); } // First access to this timeline. Create a DatadirTimeline wrapper for it let tline = tenant .repo - .get_timeline_load(timelineid) - .with_context(|| format!("Timeline {} not found for tenant {}", timelineid, tenantid))?; + .get_timeline_load(timeline_id) + .with_context(|| format!("Timeline {timeline_id} not found for tenant {tenant_id}"))?; let repartition_distance = tenant.repo.get_checkpoint_distance() / 10; let page_tline = Arc::new(DatadirTimelineImpl::new(tline, repartition_distance)); page_tline.init_logical_size()?; - tenant.timelines.insert(timelineid, Arc::clone(&page_tline)); + tenant + .local_timelines + .insert(timeline_id, Arc::clone(&page_tline)); Ok(page_tline) } @@ -331,15 +326,87 @@ pub struct TenantInfo { pub state: TenantState, } -pub fn list_tenants() -> Result> { - access_tenants() +pub fn list_tenants() -> Vec { + tenants_state::read_tenants() .iter() - .map(|v| { - let (id, tenant) = v; - Ok(TenantInfo { - id: *id, - state: tenant.state, - }) + .map(|(id, tenant)| TenantInfo { + id: *id, + state: tenant.state, }) .collect() } + +fn init_local_repositories( + conf: &'static PageServerConf, + local_timeline_init_statuses: HashMap>, + remote_index: &RemoteIndex, +) -> anyhow::Result<(), anyhow::Error> { + for (tenant_id, local_timeline_init_statuses) in local_timeline_init_statuses { + // initialize local tenant + let repo = load_local_repo(conf, tenant_id, remote_index) + .with_context(|| format!("Failed to load repo for tenant {}", tenant_id))?; + for (timeline_id, init_status) in local_timeline_init_statuses { + match init_status { + LocalTimelineInitStatus::LocallyComplete => { + debug!("timeline {} for tenant {} is locally complete, registering it in repository", timeline_id, tenant_id); + // Lets fail here loudly to be on the safe side. + // XXX: It may be a better api to actually distinguish between repository startup + // and processing of newly downloaded timelines. + repo.apply_timeline_remote_sync_status_update( + timeline_id, + TimelineSyncStatusUpdate::Downloaded, + ) + .with_context(|| { + format!( + "Failed to bootstrap timeline {} for tenant {}", + timeline_id, tenant_id + ) + })? + } + LocalTimelineInitStatus::NeedsSync => { + debug!( + "timeline {} for tenant {} needs sync, \ + so skipped for adding into repository until sync is finished", + tenant_id, timeline_id + ); + } + } + } + } + Ok(()) +} + +// Sets up wal redo manager and repository for tenant. Reduces code duplication. +// Used during pageserver startup, or when new tenant is attached to pageserver. +fn load_local_repo( + conf: &'static PageServerConf, + tenant_id: ZTenantId, + remote_index: &RemoteIndex, +) -> anyhow::Result> { + let mut m = tenants_state::write_tenants(); + let tenant = m.entry(tenant_id).or_insert_with(|| { + // Set up a WAL redo manager, for applying WAL records. + let walredo_mgr = PostgresRedoManager::new(conf, tenant_id); + + // Set up an object repository, for actual data storage. + let repo: Arc = Arc::new(LayeredRepository::new( + conf, + TenantConfOpt::default(), + Arc::new(walredo_mgr), + tenant_id, + remote_index.clone(), + conf.remote_storage_config.is_some(), + )); + Tenant { + state: TenantState::Idle, + repo, + local_timelines: HashMap::new(), + } + }); + + // Restore tenant config + let tenant_conf = LayeredRepository::load_tenant_config(conf, tenant_id)?; + tenant.repo.update_tenant_config(tenant_conf)?; + + Ok(Arc::clone(&tenant.repo)) +} diff --git a/pageserver/src/timelines.rs b/pageserver/src/timelines.rs index adc531e6bb..acc92bb4a2 100644 --- a/pageserver/src/timelines.rs +++ b/pageserver/src/timelines.rs @@ -2,7 +2,7 @@ //! Timeline management code // -use anyhow::{bail, Context, Result}; +use anyhow::{bail, ensure, Context, Result}; use postgres_ffi::ControlFileData; use serde::{Deserialize, Serialize}; use serde_with::{serde_as, DisplayFromStr}; @@ -106,7 +106,7 @@ impl LocalTimelineInfo { match repo_timeline { RepositoryTimeline::Loaded(_) => { let datadir_tline = - tenant_mgr::get_timeline_for_tenant_load(tenant_id, timeline_id)?; + tenant_mgr::get_local_timeline_with_load(tenant_id, timeline_id)?; Self::from_loaded_timeline(&datadir_tline, include_non_incremental_logical_size) } RepositoryTimeline::Unloaded { metadata } => Ok(Self::from_unloaded_timeline(metadata)), @@ -152,7 +152,7 @@ pub fn init_pageserver( if let Some(tenant_id) = create_tenant { println!("initializing tenantid {}", tenant_id); - let repo = create_repo(conf, Default::default(), tenant_id, CreateRepo::Dummy) + let repo = create_repo(conf, TenantConfOpt::default(), tenant_id, CreateRepo::Dummy) .context("failed to create repo")?; let new_timeline_id = initial_timeline_id.unwrap_or_else(ZTimelineId::generate); bootstrap_timeline(conf, tenant_id, new_timeline_id, repo.as_ref()) @@ -203,9 +203,11 @@ pub fn create_repo( }; let repo_dir = conf.tenant_path(&tenant_id); - if repo_dir.exists() { - bail!("tenant {} directory already exists", tenant_id); - } + ensure!( + repo_dir.exists(), + "cannot create new tenant repo: '{}' directory already exists", + tenant_id + ); // top-level dir may exist if we are creating it through CLI crashsafe_dir::create_dir_all(&repo_dir) @@ -383,7 +385,7 @@ pub(crate) fn create_timeline( repo.branch_timeline(ancestor_timeline_id, new_timeline_id, start_lsn)?; // load the timeline into memory let loaded_timeline = - tenant_mgr::get_timeline_for_tenant_load(tenant_id, new_timeline_id)?; + tenant_mgr::get_local_timeline_with_load(tenant_id, new_timeline_id)?; LocalTimelineInfo::from_loaded_timeline(&loaded_timeline, false) .context("cannot fill timeline info")? } @@ -391,7 +393,7 @@ pub(crate) fn create_timeline( bootstrap_timeline(conf, tenant_id, new_timeline_id, repo.as_ref())?; // load the timeline into memory let new_timeline = - tenant_mgr::get_timeline_for_tenant_load(tenant_id, new_timeline_id)?; + tenant_mgr::get_local_timeline_with_load(tenant_id, new_timeline_id)?; LocalTimelineInfo::from_loaded_timeline(&new_timeline, false) .context("cannot fill timeline info")? } diff --git a/pageserver/src/walreceiver.rs b/pageserver/src/walreceiver.rs index 357aab7221..b7a33364c9 100644 --- a/pageserver/src/walreceiver.rs +++ b/pageserver/src/walreceiver.rs @@ -184,7 +184,7 @@ fn walreceiver_main( let repo = tenant_mgr::get_repository_for_tenant(tenant_id) .with_context(|| format!("no repository found for tenant {}", tenant_id))?; let timeline = - tenant_mgr::get_timeline_for_tenant_load(tenant_id, timeline_id).with_context(|| { + tenant_mgr::get_local_timeline_with_load(tenant_id, timeline_id).with_context(|| { format!( "local timeline {} not found for tenant {}", timeline_id, tenant_id From 6cca57f95a6aced70c1c932a580edaf621177b8b Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Wed, 27 Apr 2022 15:55:59 +0300 Subject: [PATCH 158/296] Properly remove from the local timeline map --- pageserver/src/http/routes.rs | 3 +- pageserver/src/layered_repository.rs | 55 +++++------ pageserver/src/repository.rs | 2 +- pageserver/src/tenant_mgr.rs | 136 +++++++++++++++++++-------- pageserver/src/timelines.rs | 2 +- 5 files changed, 123 insertions(+), 75 deletions(-) diff --git a/pageserver/src/http/routes.rs b/pageserver/src/http/routes.rs index f1b482cf50..295a1e9f02 100644 --- a/pageserver/src/http/routes.rs +++ b/pageserver/src/http/routes.rs @@ -347,8 +347,7 @@ async fn timeline_detach_handler(request: Request) -> Result>, - tenantid: ZTenantId, + tenant_id: ZTenantId, timelines: Mutex>, // This mutex prevents creation of new timelines during GC. // Adding yet another mutex (in addition to `timelines`) is needed because holding @@ -223,10 +223,10 @@ impl Repository for LayeredRepository { let mut timelines = self.timelines.lock().unwrap(); // Create the timeline directory, and write initial metadata to file. - crashsafe_dir::create_dir_all(self.conf.timeline_path(&timelineid, &self.tenantid))?; + crashsafe_dir::create_dir_all(self.conf.timeline_path(&timelineid, &self.tenant_id))?; let metadata = TimelineMetadata::new(Lsn(0), None, None, Lsn(0), initdb_lsn, initdb_lsn); - Self::save_metadata(self.conf, timelineid, self.tenantid, &metadata, true)?; + Self::save_metadata(self.conf, timelineid, self.tenant_id, &metadata, true)?; let timeline = LayeredTimeline::new( self.conf, @@ -234,7 +234,7 @@ impl Repository for LayeredRepository { metadata, None, timelineid, - self.tenantid, + self.tenant_id, Arc::clone(&self.walredo_mgr), self.upload_layers, ); @@ -283,7 +283,7 @@ impl Repository for LayeredRepository { }; // create a new timeline directory - let timelinedir = self.conf.timeline_path(&dst, &self.tenantid); + let timelinedir = self.conf.timeline_path(&dst, &self.tenant_id); crashsafe_dir::create_dir(&timelinedir)?; @@ -298,8 +298,8 @@ impl Repository for LayeredRepository { *src_timeline.latest_gc_cutoff_lsn.read().unwrap(), src_timeline.initdb_lsn, ); - crashsafe_dir::create_dir_all(self.conf.timeline_path(&dst, &self.tenantid))?; - Self::save_metadata(self.conf, dst, self.tenantid, &metadata, true)?; + crashsafe_dir::create_dir_all(self.conf.timeline_path(&dst, &self.tenant_id))?; + Self::save_metadata(self.conf, dst, self.tenant_id, &metadata, true)?; timelines.insert(dst, LayeredTimelineEntry::Unloaded { id: dst, metadata }); info!("branched timeline {} from {} at {}", dst, src, start_lsn); @@ -322,7 +322,7 @@ impl Repository for LayeredRepository { .unwrap_or_else(|| "-".to_string()); STORAGE_TIME - .with_label_values(&["gc", &self.tenantid.to_string(), &timeline_str]) + .with_label_values(&["gc", &self.tenant_id.to_string(), &timeline_str]) .observe_closure_duration(|| { self.gc_iteration_internal(target_timelineid, horizon, pitr, checkpoint_before_gc) }) @@ -342,7 +342,7 @@ impl Repository for LayeredRepository { for (timelineid, timeline) in &timelines_to_compact { let _entered = - info_span!("compact", timeline = %timelineid, tenant = %self.tenantid).entered(); + info_span!("compact", timeline = %timelineid, tenant = %self.tenant_id).entered(); match timeline { LayeredTimelineEntry::Loaded(timeline) => { timeline.compact()?; @@ -383,27 +383,16 @@ impl Repository for LayeredRepository { for (timelineid, timeline) in &timelines_to_compact { let _entered = - info_span!("checkpoint", timeline = %timelineid, tenant = %self.tenantid).entered(); + info_span!("checkpoint", timeline = %timelineid, tenant = %self.tenant_id) + .entered(); timeline.checkpoint(CheckpointConfig::Flush)?; } Ok(()) } - // Detaches the timeline from the repository. - fn detach_timeline(&self, timeline_id: ZTimelineId) -> Result<()> { - let mut timelines = self.timelines.lock().unwrap(); - if timelines.remove(&timeline_id).is_none() { - bail!("cannot detach timeline that is not available locally"); - } - - // Release the lock to shutdown and remove the files without holding it - drop(timelines); - // shutdown the timeline (this shuts down the walreceiver) - thread_mgr::shutdown_threads(None, Some(self.tenantid), Some(timeline_id)); - - // remove timeline files (maybe avoid this for ease of debugging if something goes wrong) - fs::remove_dir_all(self.conf.timeline_path(&timeline_id, &self.tenantid))?; + fn detach_timeline(&self, timeline_id: ZTimelineId) -> anyhow::Result<()> { + self.timelines.lock().unwrap().remove(&timeline_id); Ok(()) } @@ -422,7 +411,7 @@ impl Repository for LayeredRepository { Entry::Occupied(_) => bail!("We completed a download for a timeline that already exists in repository. This is a bug."), Entry::Vacant(entry) => { // we need to get metadata of a timeline, another option is to pass it along with Downloaded status - let metadata = Self::load_metadata(self.conf, timeline_id, self.tenantid).context("failed to load local metadata")?; + let metadata = Self::load_metadata(self.conf, timeline_id, self.tenant_id).context("failed to load local metadata")?; // finally we make newly downloaded timeline visible to repository entry.insert(LayeredTimelineEntry::Unloaded { id: timeline_id, metadata, }) }, @@ -547,7 +536,7 @@ impl LayeredRepository { tenant_conf.update(&new_tenant_conf); - LayeredRepository::persist_tenant_config(self.conf, self.tenantid, *tenant_conf)?; + LayeredRepository::persist_tenant_config(self.conf, self.tenant_id, *tenant_conf)?; Ok(()) } @@ -605,7 +594,7 @@ impl LayeredRepository { timelineid: ZTimelineId, timelines: &mut HashMap, ) -> anyhow::Result> { - let metadata = Self::load_metadata(self.conf, timelineid, self.tenantid) + let metadata = Self::load_metadata(self.conf, timelineid, self.tenant_id) .context("failed to load metadata")?; let disk_consistent_lsn = metadata.disk_consistent_lsn(); @@ -631,7 +620,7 @@ impl LayeredRepository { metadata, ancestor, timelineid, - self.tenantid, + self.tenant_id, Arc::clone(&self.walredo_mgr), self.upload_layers, ); @@ -646,12 +635,12 @@ impl LayeredRepository { conf: &'static PageServerConf, tenant_conf: TenantConfOpt, walredo_mgr: Arc, - tenantid: ZTenantId, + tenant_id: ZTenantId, remote_index: RemoteIndex, upload_layers: bool, ) -> LayeredRepository { LayeredRepository { - tenantid, + tenant_id, conf, tenant_conf: Arc::new(RwLock::new(tenant_conf)), timelines: Mutex::new(HashMap::new()), @@ -806,7 +795,7 @@ impl LayeredRepository { checkpoint_before_gc: bool, ) -> Result { let _span_guard = - info_span!("gc iteration", tenant = %self.tenantid, timeline = ?target_timelineid) + info_span!("gc iteration", tenant = %self.tenant_id, timeline = ?target_timelineid) .entered(); let mut totals: GcResult = Default::default(); let now = Instant::now(); @@ -890,6 +879,10 @@ impl LayeredRepository { totals.elapsed = now.elapsed(); Ok(totals) } + + pub fn tenant_id(&self) -> ZTenantId { + self.tenant_id + } } pub struct LayeredTimeline { diff --git a/pageserver/src/repository.rs b/pageserver/src/repository.rs index f7c2f036a6..6c75f035ca 100644 --- a/pageserver/src/repository.rs +++ b/pageserver/src/repository.rs @@ -259,7 +259,7 @@ pub trait Repository: Send + Sync { /// api's 'compact' command. fn compaction_iteration(&self) -> Result<()>; - /// detaches locally available timeline by stopping all threads and removing all the data. + /// detaches timeline-related in-memory data. fn detach_timeline(&self, timeline_id: ZTimelineId) -> Result<()>; // Allows to retrieve remote timeline index from the repo. Used in walreceiver to grab remote consistent lsn. diff --git a/pageserver/src/tenant_mgr.rs b/pageserver/src/tenant_mgr.rs index 36a4b989b7..ace6938e6d 100644 --- a/pageserver/src/tenant_mgr.rs +++ b/pageserver/src/tenant_mgr.rs @@ -3,6 +3,7 @@ use crate::config::PageServerConf; use crate::layered_repository::LayeredRepository; +use crate::pgdatadir_mapping::DatadirTimeline; use crate::remote_storage::{self, LocalTimelineInitStatus, RemoteIndex, SyncStartupData}; use crate::repository::{Repository, TimelineSyncStatusUpdate}; use crate::tenant_config::TenantConfOpt; @@ -12,7 +13,7 @@ use crate::timelines; use crate::timelines::CreateRepo; use crate::walredo::PostgresRedoManager; use crate::{DatadirTimelineImpl, RepositoryImpl}; -use anyhow::Context; +use anyhow::{bail, Context}; use serde::{Deserialize, Serialize}; use serde_with::{serde_as, DisplayFromStr}; use std::collections::hash_map::Entry; @@ -125,18 +126,11 @@ pub fn apply_timeline_sync_status_updates( continue; } }; - - for (timeline_id, status_update) in status_updates { - match repo.apply_timeline_remote_sync_status_update(timeline_id, status_update) - { - Ok(()) => debug!("successfully applied timeline sync status update: {timeline_id} -> {status_update}"), - Err(e) => error!( - "Failed to apply timeline sync status update for tenant {tenant_id}. timeline {timeline_id} update {status_update} Error: {e:?}" - ), - } - match status_update { - TimelineSyncStatusUpdate::Downloaded => todo!("TODO kb "), - } + match register_new_timelines(&repo, status_updates) { + Ok(()) => info!("successfully applied tenant {tenant_id} sync status updates"), + Err(e) => error!( + "Failed to apply timeline sync timeline status updates for tenant {tenant_id}: {e:?}" + ), } } } @@ -302,22 +296,49 @@ pub fn get_local_timeline_with_load( if let Some(page_tline) = tenant.local_timelines.get(&timeline_id) { return Ok(Arc::clone(page_tline)); } - // First access to this timeline. Create a DatadirTimeline wrapper for it - let tline = tenant - .repo - .get_timeline_load(timeline_id) - .with_context(|| format!("Timeline {timeline_id} not found for tenant {tenant_id}"))?; - let repartition_distance = tenant.repo.get_checkpoint_distance() / 10; - - let page_tline = Arc::new(DatadirTimelineImpl::new(tline, repartition_distance)); - page_tline.init_logical_size()?; + let page_tline = new_local_timeline(&tenant.repo, timeline_id) + .with_context(|| format!("Failed to create new local timeline for tenant {tenant_id}"))?; tenant .local_timelines .insert(timeline_id, Arc::clone(&page_tline)); Ok(page_tline) } +pub fn detach_timeline(tenant_id: ZTenantId, timeline_id: ZTimelineId) -> anyhow::Result<()> { + // shutdown the timeline threads (this shuts down the walreceiver) + thread_mgr::shutdown_threads(None, Some(tenant_id), Some(timeline_id)); + + match tenants_state::write_tenants().get_mut(&tenant_id) { + Some(tenant) => { + tenant + .repo + .detach_timeline(timeline_id) + .context("Failed to detach inmem tenant timeline")?; + tenant.local_timelines.remove(&timeline_id); + } + None => bail!("Tenant {tenant_id} not found in local tenant state"), + } + + Ok(()) +} + +fn new_local_timeline( + repo: &RepositoryImpl, + timeline_id: ZTimelineId, +) -> anyhow::Result>> { + let inmem_timeline = repo.get_timeline_load(timeline_id).with_context(|| { + format!("Inmem timeline {timeline_id} not found in tenant's repository") + })?; + let repartition_distance = repo.get_checkpoint_distance() / 10; + let page_tline = Arc::new(DatadirTimelineImpl::new( + inmem_timeline, + repartition_distance, + )); + page_tline.init_logical_size()?; + Ok(page_tline) +} + #[serde_as] #[derive(Serialize, Deserialize, Clone)] pub struct TenantInfo { @@ -344,38 +365,73 @@ fn init_local_repositories( for (tenant_id, local_timeline_init_statuses) in local_timeline_init_statuses { // initialize local tenant let repo = load_local_repo(conf, tenant_id, remote_index) - .with_context(|| format!("Failed to load repo for tenant {}", tenant_id))?; + .with_context(|| format!("Failed to load repo for tenant {tenant_id}"))?; + + let mut status_updates = HashMap::with_capacity(local_timeline_init_statuses.len()); for (timeline_id, init_status) in local_timeline_init_statuses { match init_status { LocalTimelineInitStatus::LocallyComplete => { - debug!("timeline {} for tenant {} is locally complete, registering it in repository", timeline_id, tenant_id); - // Lets fail here loudly to be on the safe side. - // XXX: It may be a better api to actually distinguish between repository startup - // and processing of newly downloaded timelines. - repo.apply_timeline_remote_sync_status_update( - timeline_id, - TimelineSyncStatusUpdate::Downloaded, - ) - .with_context(|| { - format!( - "Failed to bootstrap timeline {} for tenant {}", - timeline_id, tenant_id - ) - })? + debug!("timeline {timeline_id} for tenant {tenant_id} is locally complete, registering it in repository"); + status_updates.insert(timeline_id, TimelineSyncStatusUpdate::Downloaded); } LocalTimelineInitStatus::NeedsSync => { debug!( - "timeline {} for tenant {} needs sync, \ - so skipped for adding into repository until sync is finished", - tenant_id, timeline_id + "timeline {tenant_id} for tenant {timeline_id} needs sync, \ + so skipped for adding into repository until sync is finished" ); } } } + + // Lets fail here loudly to be on the safe side. + // XXX: It may be a better api to actually distinguish between repository startup + // and processing of newly downloaded timelines. + register_new_timelines(&repo, status_updates) + .with_context(|| format!("Failed to bootstrap timelines for tenant {tenant_id}"))? } Ok(()) } +fn register_new_timelines( + repo: &LayeredRepository, + status_updates: HashMap, +) -> anyhow::Result<()> { + let mut registration_queue = Vec::with_capacity(status_updates.len()); + + // first need to register the in-mem representations, to avoid missing ancestors during the local disk data registration + for (timeline_id, status_update) in status_updates { + repo.apply_timeline_remote_sync_status_update(timeline_id, status_update) + .with_context(|| { + format!("Failed to load timeline {timeline_id} into in-memory repository") + })?; + match status_update { + TimelineSyncStatusUpdate::Downloaded => registration_queue.push(timeline_id), + } + } + + for timeline_id in registration_queue { + let tenant_id = repo.tenant_id(); + match tenants_state::write_tenants().get_mut(&tenant_id) { + Some(tenant) => match tenant.local_timelines.entry(timeline_id) { + Entry::Occupied(_) => { + bail!("Local timeline {timeline_id} already registered") + } + Entry::Vacant(v) => { + v.insert(new_local_timeline(repo, timeline_id).with_context(|| { + format!("Failed to register new local timeline for tenant {tenant_id}") + })?); + } + }, + None => bail!( + "Tenant {} not found in local tenant state", + repo.tenant_id() + ), + } + } + + Ok(()) +} + // Sets up wal redo manager and repository for tenant. Reduces code duplication. // Used during pageserver startup, or when new tenant is attached to pageserver. fn load_local_repo( diff --git a/pageserver/src/timelines.rs b/pageserver/src/timelines.rs index acc92bb4a2..85ad294da9 100644 --- a/pageserver/src/timelines.rs +++ b/pageserver/src/timelines.rs @@ -204,7 +204,7 @@ pub fn create_repo( let repo_dir = conf.tenant_path(&tenant_id); ensure!( - repo_dir.exists(), + !repo_dir.exists(), "cannot create new tenant repo: '{}' directory already exists", tenant_id ); From 2911eb084aefc82791e28b668d2b06383b38c0de Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Thu, 28 Apr 2022 00:49:03 +0300 Subject: [PATCH 159/296] Remove timeline files on detach --- pageserver/src/http/routes.rs | 3 ++- pageserver/src/layered_repository.rs | 6 ++++- .../remote_storage/storage_sync/download.rs | 2 +- pageserver/src/tenant_mgr.rs | 24 ++++++++++++++----- .../batch_others/test_tenant_relocation.py | 9 +++++++ 5 files changed, 35 insertions(+), 9 deletions(-) diff --git a/pageserver/src/http/routes.rs b/pageserver/src/http/routes.rs index 295a1e9f02..311ae5adf4 100644 --- a/pageserver/src/http/routes.rs +++ b/pageserver/src/http/routes.rs @@ -347,7 +347,8 @@ async fn timeline_detach_handler(request: Request) -> Result anyhow::Result<()> { - self.timelines.lock().unwrap().remove(&timeline_id); + let mut timelines = self.timelines.lock().unwrap(); + ensure!( + timelines.remove(&timeline_id).is_some(), + "cannot detach timeline {timeline_id} that is not available locally" + ); Ok(()) } diff --git a/pageserver/src/remote_storage/storage_sync/download.rs b/pageserver/src/remote_storage/storage_sync/download.rs index 7fe25ab36e..c7a2b1fd22 100644 --- a/pageserver/src/remote_storage/storage_sync/download.rs +++ b/pageserver/src/remote_storage/storage_sync/download.rs @@ -332,7 +332,7 @@ mod tests { .await; assert!( matches!( - dbg!(already_downloading_remote_timeline_download), + already_downloading_remote_timeline_download, DownloadedTimeline::Abort, ), "Should not allow downloading for remote timeline that does not expect it" diff --git a/pageserver/src/tenant_mgr.rs b/pageserver/src/tenant_mgr.rs index ace6938e6d..3e0a907d00 100644 --- a/pageserver/src/tenant_mgr.rs +++ b/pageserver/src/tenant_mgr.rs @@ -56,7 +56,7 @@ struct Tenant { /// Contains in-memory state, including the timeline that might not yet flushed on disk or loaded form disk. repo: Arc, /// Timelines, located locally in the pageserver's datadir. - /// Whatever manipulations happen, local timelines are not removed, only incremented with files. + /// Timelines can entirely be removed entirely by the `detach` operation only. /// /// Local timelines have more metadata that's loaded into memory, /// that is located in the `repo.timelines` field, [`crate::layered_repository::LayeredTimelineEntry`]. @@ -126,8 +126,8 @@ pub fn apply_timeline_sync_status_updates( continue; } }; - match register_new_timelines(&repo, status_updates) { - Ok(()) => info!("successfully applied tenant {tenant_id} sync status updates"), + match apply_timeline_remote_sync_status_updates(&repo, status_updates) { + Ok(()) => info!("successfully applied sync status updates for tenant {tenant_id}"), Err(e) => error!( "Failed to apply timeline sync timeline status updates for tenant {tenant_id}: {e:?}" ), @@ -305,7 +305,11 @@ pub fn get_local_timeline_with_load( Ok(page_tline) } -pub fn detach_timeline(tenant_id: ZTenantId, timeline_id: ZTimelineId) -> anyhow::Result<()> { +pub fn detach_timeline( + conf: &'static PageServerConf, + tenant_id: ZTenantId, + timeline_id: ZTimelineId, +) -> anyhow::Result<()> { // shutdown the timeline threads (this shuts down the walreceiver) thread_mgr::shutdown_threads(None, Some(tenant_id), Some(timeline_id)); @@ -320,6 +324,14 @@ pub fn detach_timeline(tenant_id: ZTenantId, timeline_id: ZTimelineId) -> anyhow None => bail!("Tenant {tenant_id} not found in local tenant state"), } + let local_timeline_directory = conf.timeline_path(&timeline_id, &tenant_id); + std::fs::remove_dir_all(&local_timeline_directory).with_context(|| { + format!( + "Failed to remove local timeline directory '{}'", + local_timeline_directory.display() + ) + })?; + Ok(()) } @@ -386,13 +398,13 @@ fn init_local_repositories( // Lets fail here loudly to be on the safe side. // XXX: It may be a better api to actually distinguish between repository startup // and processing of newly downloaded timelines. - register_new_timelines(&repo, status_updates) + apply_timeline_remote_sync_status_updates(&repo, status_updates) .with_context(|| format!("Failed to bootstrap timelines for tenant {tenant_id}"))? } Ok(()) } -fn register_new_timelines( +fn apply_timeline_remote_sync_status_updates( repo: &LayeredRepository, status_updates: HashMap, ) -> anyhow::Result<()> { diff --git a/test_runner/batch_others/test_tenant_relocation.py b/test_runner/batch_others/test_tenant_relocation.py index 8213d2526b..41907adf1a 100644 --- a/test_runner/batch_others/test_tenant_relocation.py +++ b/test_runner/batch_others/test_tenant_relocation.py @@ -217,6 +217,13 @@ def test_tenant_relocation(zenith_env_builder: ZenithEnvBuilder, tenant_pg.start() + timeline_to_detach_local_path = env.repo_dir / 'tenants' / tenant.hex / 'timelines' / timeline.hex + files_before_detach = os.listdir(timeline_to_detach_local_path) + assert 'metadata' in files_before_detach, f'Regular timeline {timeline_to_detach_local_path} should have the metadata file,\ + but got: {files_before_detach}' + assert len(files_before_detach) > 2, f'Regular timeline {timeline_to_detach_local_path} should have at least one layer file,\ + but got {files_before_detach}' + # detach tenant from old pageserver before we check # that all the data is there to be sure that old pageserver # is no longer involved, and if it is, we will see the errors @@ -238,6 +245,8 @@ def test_tenant_relocation(zenith_env_builder: ZenithEnvBuilder, load_thread.join(timeout=10) log.info('load thread stopped') + assert not os.path.exists(timeline_to_detach_local_path), f'After detach, local timeline dir {timeline_to_detach_local_path} should be removed' + # bring old pageserver back for clean shutdown via zenith cli # new pageserver will be shut down by the context manager cli_config_lines = (env.repo_dir / 'config').read_text().splitlines() From 76388abeb6ecda513f50b9b89199e3f575cbe630 Mon Sep 17 00:00:00 2001 From: chaitanya sharma <86035+phoenix24@users.noreply.github.com> Date: Fri, 29 Apr 2022 14:22:46 +0300 Subject: [PATCH 160/296] Rename READMEs with .md extension, and fix links to them. Commit edba2e97 renamed pageserver/README to pageserver/README.md, but forgot to update links to it. Fix. Rename libs/postgres_ffi/README and safekeeper/README files to also have the the .md extension, so that github can render them nicely. Quote ascii-diagram in safekeeper/README.md so that it renders correctly. --- docs/README.md | 6 +++--- docs/sourcetree.md | 4 ++-- libs/postgres_ffi/{README => README.md} | 0 safekeeper/{README => README.md} | 6 ++++-- 4 files changed, 9 insertions(+), 7 deletions(-) rename libs/postgres_ffi/{README => README.md} (100%) rename safekeeper/{README => README.md} (99%) diff --git a/docs/README.md b/docs/README.md index 99d635bb33..886363dccc 100644 --- a/docs/README.md +++ b/docs/README.md @@ -7,8 +7,8 @@ - [glossary.md](glossary.md) — Glossary of all the terms used in codebase. - [multitenancy.md](multitenancy.md) — how multitenancy is organized in the pageserver and Zenith CLI. - [sourcetree.md](sourcetree.md) — Overview of the source tree layeout. -- [pageserver/README](/pageserver/README) — pageserver overview. -- [postgres_ffi/README](/libs/postgres_ffi/README) — Postgres FFI overview. +- [pageserver/README.md](/pageserver/README.md) — pageserver overview. +- [postgres_ffi/README.md](/libs/postgres_ffi/README.md) — Postgres FFI overview. - [test_runner/README.md](/test_runner/README.md) — tests infrastructure overview. -- [safekeeper/README](/safekeeper/README) — WAL service overview. +- [safekeeper/README.md](/safekeeper/README.md) — WAL service overview. - [core_changes.md](core_changes.md) - Description of Zenith changes in Postgres core diff --git a/docs/sourcetree.md b/docs/sourcetree.md index 5fd5fe19e5..5ddc6208d2 100644 --- a/docs/sourcetree.md +++ b/docs/sourcetree.md @@ -28,7 +28,7 @@ The pageserver has a few different duties: - Receive WAL from the WAL service and decode it. - Replay WAL that's applicable to the chunks that the Page Server maintains -For more detailed info, see `/pageserver/README` +For more detailed info, see [/pageserver/README](/pageserver/README.md) `/proxy`: @@ -57,7 +57,7 @@ PostgreSQL extension that contains functions needed for testing and debugging. The zenith WAL service that receives WAL from a primary compute nodes and streams it to the pageserver. It acts as a holding area and redistribution center for recently generated WAL. -For more detailed info, see `/safekeeper/README` +For more detailed info, see [/safekeeper/README](/safekeeper/README.md) `/workspace_hack`: The workspace_hack crate exists only to pin down some dependencies. diff --git a/libs/postgres_ffi/README b/libs/postgres_ffi/README.md similarity index 100% rename from libs/postgres_ffi/README rename to libs/postgres_ffi/README.md diff --git a/safekeeper/README b/safekeeper/README.md similarity index 99% rename from safekeeper/README rename to safekeeper/README.md index 4407837463..3f097d0c24 100644 --- a/safekeeper/README +++ b/safekeeper/README.md @@ -7,6 +7,7 @@ replica. A replication slot is used in the primary to prevent the primary from discarding WAL that hasn't been streamed to the WAL service yet. +``` +--------------+ +------------------+ | | WAL | | | Compute node | ----------> | WAL Service | @@ -23,7 +24,7 @@ service yet. | Pageservers | | | +--------------+ - +``` The WAL service consists of multiple WAL safekeepers that all store a @@ -31,6 +32,7 @@ copy of the WAL. A WAL record is considered durable when the majority of safekeepers have received and stored the WAL to local disk. A consensus algorithm based on Paxos is used to manage the quorum. +``` +-------------------------------------------+ | WAL Service | | | @@ -48,7 +50,7 @@ consensus algorithm based on Paxos is used to manage the quorum. | +------------+ | | | +-------------------------------------------+ - +``` The primary connects to the WAL safekeepers, so it works in a "push" fashion. That's different from how streaming replication usually From 05f8e6a050fb7af35950e69b30a23be2cc40e78a Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Mon, 25 Apr 2022 16:56:19 +0300 Subject: [PATCH 161/296] Use fsync+rename for atomic downloads from remote storage Use failpoint in test_remote_storage to check the behavior --- pageserver/Cargo.toml | 6 +- pageserver/src/bin/pageserver.rs | 7 +- pageserver/src/http/routes.rs | 72 +++++++------- pageserver/src/layered_repository.rs | 2 +- pageserver/src/page_service.rs | 3 + pageserver/src/remote_storage.rs | 18 ++++ pageserver/src/remote_storage/local_fs.rs | 13 +-- pageserver/src/remote_storage/storage_sync.rs | 92 ++++++++++++++--- .../remote_storage/storage_sync/download.rs | 99 +++++++++++++++++-- .../batch_others/test_remote_storage.py | 38 +++++-- 10 files changed, 274 insertions(+), 76 deletions(-) diff --git a/pageserver/Cargo.toml b/pageserver/Cargo.toml index 5607baf698..23c16dd5be 100644 --- a/pageserver/Cargo.toml +++ b/pageserver/Cargo.toml @@ -4,8 +4,12 @@ version = "0.1.0" edition = "2021" [features] -default = [] +# It is simpler infra-wise to have failpoints enabled by default +# It shouldnt affect perf in any way because failpoints +# are not placed in hot code paths +default = ["failpoints"] profiling = ["pprof"] +failpoints = ["fail/failpoints"] [dependencies] chrono = "0.4.19" diff --git a/pageserver/src/bin/pageserver.rs b/pageserver/src/bin/pageserver.rs index 728dcb53de..01fcc1224f 100644 --- a/pageserver/src/bin/pageserver.rs +++ b/pageserver/src/bin/pageserver.rs @@ -27,7 +27,12 @@ use utils::{ }; fn version() -> String { - format!("{} profiling:{}", GIT_VERSION, cfg!(feature = "profiling")) + format!( + "{} profiling:{} failpoints:{}", + GIT_VERSION, + cfg!(feature = "profiling"), + fail::has_failpoints() + ) } fn main() -> anyhow::Result<()> { diff --git a/pageserver/src/http/routes.rs b/pageserver/src/http/routes.rs index 311ae5adf4..c589813d69 100644 --- a/pageserver/src/http/routes.rs +++ b/pageserver/src/http/routes.rs @@ -179,43 +179,47 @@ async fn timeline_detail_handler(request: Request) -> Result(local_timeline) + }) + .await + .ok() + .and_then(|r| r.ok()) + .flatten(); - let (local_timeline_info, span) = tokio::task::spawn_blocking(move || { - let entered = span.entered(); - let repo = tenant_mgr::get_repository_for_tenant(tenant_id)?; - let local_timeline = { - repo.get_timeline(timeline_id) - .as_ref() - .map(|timeline| { - LocalTimelineInfo::from_repo_timeline( - tenant_id, - timeline_id, - timeline, - include_non_incremental_logical_size, - ) + let remote_timeline_info = { + let remote_index_read = get_state(&request).remote_index.read().await; + remote_index_read + .timeline_entry(&ZTenantTimelineId { + tenant_id, + timeline_id, + }) + .map(|remote_entry| RemoteTimelineInfo { + remote_consistent_lsn: remote_entry.metadata.disk_consistent_lsn(), + awaits_download: remote_entry.awaits_download, }) - .transpose()? }; - Ok::<_, anyhow::Error>((local_timeline, entered.exit())) - }) - .await - .map_err(ApiError::from_err)??; - - let remote_timeline_info = { - let remote_index_read = get_state(&request).remote_index.read().await; - remote_index_read - .timeline_entry(&ZTenantTimelineId { - tenant_id, - timeline_id, - }) - .map(|remote_entry| RemoteTimelineInfo { - remote_consistent_lsn: remote_entry.metadata.disk_consistent_lsn(), - awaits_download: remote_entry.awaits_download, - }) - }; - - let _enter = span.entered(); + (local_timeline_info, remote_timeline_info) + } + .instrument(info_span!("timeline_detail_handler", tenant = %tenant_id, timeline = %timeline_id)) + .await; if local_timeline_info.is_none() && remote_timeline_info.is_none() { return Err(ApiError::NotFound( diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 116fbf03a2..bbeb245f0a 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -721,7 +721,7 @@ impl LayeredRepository { } /// Save timeline metadata to file - fn save_metadata( + pub fn save_metadata( conf: &'static PageServerConf, timelineid: ZTimelineId, tenantid: ZTenantId, diff --git a/pageserver/src/page_service.rs b/pageserver/src/page_service.rs index 8adbdc5d9d..ec08a840b0 100644 --- a/pageserver/src/page_service.rs +++ b/pageserver/src/page_service.rs @@ -667,7 +667,10 @@ impl postgres_backend::Handler for PageServerHandler { // on connect pgb.write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?; } else if query_string.starts_with("failpoints ") { + ensure!(fail::has_failpoints(), "Cannot manage failpoints because pageserver was compiled without failpoints support"); + let (_, failpoints) = query_string.split_at("failpoints ".len()); + for failpoint in failpoints.split(';') { if let Some((name, actions)) = failpoint.split_once('=') { info!("cfg failpoint: {} {}", name, actions); diff --git a/pageserver/src/remote_storage.rs b/pageserver/src/remote_storage.rs index 39595b7167..cfa09dce14 100644 --- a/pageserver/src/remote_storage.rs +++ b/pageserver/src/remote_storage.rs @@ -101,6 +101,7 @@ use anyhow::{bail, Context}; use tokio::io; use tracing::{debug, error, info}; +use self::storage_sync::TEMP_DOWNLOAD_EXTENSION; pub use self::{ local_fs::LocalFs, s3_bucket::S3Bucket, @@ -304,12 +305,29 @@ fn collect_timeline_files( } else if is_ephemeral_file(&entry_path.file_name().unwrap().to_string_lossy()) { debug!("skipping ephemeral file {}", entry_path.display()); continue; + } else if entry_path.extension().and_then(ffi::OsStr::to_str) + == Some(TEMP_DOWNLOAD_EXTENSION) + { + info!("removing temp download file at {}", entry_path.display()); + fs::remove_file(&entry_path).with_context(|| { + format!( + "failed to remove temp download file at {}", + entry_path.display() + ) + })?; } else { timeline_files.insert(entry_path); } } } + // FIXME (rodionov) if attach call succeeded, and then pageserver is restarted before download is completed + // then attach is lost. There would be no retries for that, + // initial collect will fail because there is no metadata. + // We either need to start download if we see empty dir after restart or attach caller should + // be aware of that and retry attach if awaits_download for timeline switched from true to false + // but timelinne didnt appear locally. + // Check what happens with remote index in that case. let timeline_metadata_path = match timeline_metadata_path { Some(path) => path, None => bail!("No metadata file found in the timeline directory"), diff --git a/pageserver/src/remote_storage/local_fs.rs b/pageserver/src/remote_storage/local_fs.rs index 952b2e69fe..6772a4fbd6 100644 --- a/pageserver/src/remote_storage/local_fs.rs +++ b/pageserver/src/remote_storage/local_fs.rs @@ -17,6 +17,8 @@ use tokio::{ }; use tracing::*; +use crate::remote_storage::storage_sync::path_with_suffix_extension; + use super::{strip_path_prefix, RemoteStorage, StorageMetadata}; pub struct LocalFs { @@ -114,7 +116,7 @@ impl RemoteStorage for LocalFs { // We need this dance with sort of durable rename (without fsyncs) // to prevent partial uploads. This was really hit when pageserver shutdown // cancelled the upload and partial file was left on the fs - let temp_file_path = path_with_suffix_extension(&target_file_path, ".temp"); + let temp_file_path = path_with_suffix_extension(&target_file_path, "temp"); let mut destination = io::BufWriter::new( fs::OpenOptions::new() .write(true) @@ -299,15 +301,8 @@ impl RemoteStorage for LocalFs { } } -fn path_with_suffix_extension(original_path: &Path, suffix: &str) -> PathBuf { - let mut extension_with_suffix = original_path.extension().unwrap_or_default().to_os_string(); - extension_with_suffix.push(suffix); - - original_path.with_extension(extension_with_suffix) -} - fn storage_metadata_path(original_path: &Path) -> PathBuf { - path_with_suffix_extension(original_path, ".metadata") + path_with_suffix_extension(original_path, "metadata") } fn get_all_files<'a, P>( diff --git a/pageserver/src/remote_storage/storage_sync.rs b/pageserver/src/remote_storage/storage_sync.rs index 20012f32d7..2d3416cd32 100644 --- a/pageserver/src/remote_storage/storage_sync.rs +++ b/pageserver/src/remote_storage/storage_sync.rs @@ -62,7 +62,9 @@ pub mod index; mod upload; use std::{ + borrow::Cow, collections::{HashMap, HashSet, VecDeque}, + ffi::OsStr, fmt::Debug, num::{NonZeroU32, NonZeroUsize}, ops::ControlFlow, @@ -89,7 +91,10 @@ use self::{ use super::{LocalTimelineInitStatus, LocalTimelineInitStatuses, RemoteStorage, SyncStartupData}; use crate::{ config::PageServerConf, - layered_repository::metadata::{metadata_path, TimelineMetadata}, + layered_repository::{ + metadata::{metadata_path, TimelineMetadata}, + LayeredRepository, + }, repository::TimelineSyncStatusUpdate, tenant_mgr::apply_timeline_sync_status_updates, thread_mgr, @@ -103,6 +108,7 @@ use metrics::{ use utils::zid::{ZTenantId, ZTenantTimelineId, ZTimelineId}; pub use self::download::download_index_part; +pub use self::download::TEMP_DOWNLOAD_EXTENSION; lazy_static! { static ref REMAINING_SYNC_ITEMS: IntGauge = register_int_gauge!( @@ -782,8 +788,14 @@ where P: Debug + Send + Sync + 'static, S: RemoteStorage + Send + Sync + 'static, { - match download_timeline_layers(storage, current_remote_timeline, sync_id, new_download_data) - .await + match download_timeline_layers( + conf, + storage, + current_remote_timeline, + sync_id, + new_download_data, + ) + .await { DownloadedTimeline::Abort => { register_sync_status(sync_start, task_name, None); @@ -852,18 +864,28 @@ async fn update_local_metadata( if local_lsn < Some(remote_lsn) { info!("Updating local timeline metadata from remote timeline: local disk_consistent_lsn={local_lsn:?}, remote disk_consistent_lsn={remote_lsn}"); - - let remote_metadata_bytes = remote_metadata - .to_bytes() - .context("Failed to serialize remote metadata to bytes")?; - fs::write(&local_metadata_path, &remote_metadata_bytes) - .await - .with_context(|| { - format!( - "Failed to write remote metadata bytes locally to path '{}'", - local_metadata_path.display() - ) - })?; + // clone because spawn_blocking requires static lifetime + let cloned_metadata = remote_metadata.to_owned(); + let ZTenantTimelineId { + tenant_id, + timeline_id, + } = sync_id; + tokio::task::spawn_blocking(move || { + LayeredRepository::save_metadata(conf, timeline_id, tenant_id, &cloned_metadata, true) + }) + .await + .with_context(|| { + format!( + "failed to join save_metadata task for {}", + local_metadata_path.display() + ) + })? + .with_context(|| { + format!( + "Failed to write remote metadata bytes locally to path '{}'", + local_metadata_path.display() + ) + })?; } else { info!("Local metadata at path '{}' has later disk consistent Lsn ({local_lsn:?}) than the remote one ({remote_lsn}), skipping the update", local_metadata_path.display()); } @@ -1062,7 +1084,7 @@ where debug!("Successfully fetched index part for {id}"); index_parts.insert(id, index_part); } - Err(e) => warn!("Failed to fetch index part for {id}: {e:?}"), + Err(e) => warn!("Failed to fetch index part for {id}: {e}"), } } @@ -1192,6 +1214,20 @@ fn register_sync_status(sync_start: Instant, sync_name: &str, sync_status: Optio .observe(secs_elapsed) } +pub fn path_with_suffix_extension(original_path: impl AsRef, suffix: &str) -> PathBuf { + let new_extension = match original_path + .as_ref() + .extension() + .map(OsStr::to_string_lossy) + { + Some(extension) => Cow::Owned(format!("{extension}.{suffix}")), + None => Cow::Borrowed(suffix), + }; + original_path + .as_ref() + .with_extension(new_extension.as_ref()) +} + #[cfg(test)] mod test_utils { use utils::lsn::Lsn; @@ -1600,4 +1636,28 @@ mod tests { "Merged upload tasks should have a metadata with biggest disk_consistent_lsn" ); } + + #[test] + fn test_path_with_suffix_extension() { + let p = PathBuf::from("/foo/bar"); + assert_eq!( + &path_with_suffix_extension(&p, "temp").to_string_lossy(), + "/foo/bar.temp" + ); + let p = PathBuf::from("/foo/bar"); + assert_eq!( + &path_with_suffix_extension(&p, "temp.temp").to_string_lossy(), + "/foo/bar.temp.temp" + ); + let p = PathBuf::from("/foo/bar.baz"); + assert_eq!( + &path_with_suffix_extension(&p, "temp.temp").to_string_lossy(), + "/foo/bar.baz.temp.temp" + ); + let p = PathBuf::from("/foo/bar.baz"); + assert_eq!( + &path_with_suffix_extension(&p, ".temp").to_string_lossy(), + "/foo/bar.baz..temp" + ); + } } diff --git a/pageserver/src/remote_storage/storage_sync/download.rs b/pageserver/src/remote_storage/storage_sync/download.rs index c7a2b1fd22..7e2496b796 100644 --- a/pageserver/src/remote_storage/storage_sync/download.rs +++ b/pageserver/src/remote_storage/storage_sync/download.rs @@ -1,17 +1,20 @@ //! Timeline synchrnonization logic to fetch the layer files from remote storage into pageserver's local directory. -use std::fmt::Debug; +use std::{collections::HashSet, fmt::Debug, path::Path}; use anyhow::Context; use futures::stream::{FuturesUnordered, StreamExt}; -use tokio::fs; +use tokio::{ + fs, + io::{self, AsyncWriteExt}, +}; use tracing::{debug, error, info, warn}; use crate::{ config::PageServerConf, layered_repository::metadata::metadata_path, remote_storage::{ - storage_sync::{sync_queue, SyncTask}, + storage_sync::{path_with_suffix_extension, sync_queue, SyncTask}, RemoteStorage, }, }; @@ -22,6 +25,8 @@ use super::{ SyncData, TimelineDownload, }; +pub const TEMP_DOWNLOAD_EXTENSION: &str = "temp_download"; + /// Retrieves index data from the remote storage for a given timeline. pub async fn download_index_part( conf: &'static PageServerConf, @@ -46,7 +51,7 @@ where .download(&part_storage_path, &mut index_part_bytes) .await .with_context(|| { - format!("Failed to download an index part from storage path '{part_storage_path:?}'") + format!("Failed to download an index part from storage path {part_storage_path:?}") })?; let index_part: IndexPart = serde_json::from_slice(&index_part_bytes).with_context(|| { @@ -80,6 +85,7 @@ pub(super) enum DownloadedTimeline { /// /// On an error, bumps the retries count and updates the files to skip with successful downloads, rescheduling the task. pub(super) async fn download_timeline_layers<'a, P, S>( + conf: &'static PageServerConf, storage: &'a S, remote_timeline: Option<&'a RemoteTimeline>, sync_id: ZTenantTimelineId, @@ -132,12 +138,24 @@ where ) })?; - let mut destination_file = fs::File::create(&layer_desination_path) - .await - .with_context(|| { + // Perform a rename inspired by durable_rename from file_utils.c. + // The sequence: + // write(tmp) + // fsync(tmp) + // rename(tmp, new) + // fsync(new) + // fsync(parent) + // For more context about durable_rename check this email from postgres mailing list: + // https://www.postgresql.org/message-id/56583BDD.9060302@2ndquadrant.com + // If pageserver crashes the temp file will be deleted on startup and re-downloaded. + let temp_file_path = + path_with_suffix_extension(&layer_desination_path, TEMP_DOWNLOAD_EXTENSION); + + let mut destination_file = + fs::File::create(&temp_file_path).await.with_context(|| { format!( "Failed to create a destination file for layer '{}'", - layer_desination_path.display() + temp_file_path.display() ) })?; @@ -149,15 +167,55 @@ where "Failed to download a layer from storage path '{layer_storage_path:?}'" ) })?; + + // Tokio doc here: https://docs.rs/tokio/1.17.0/tokio/fs/struct.File.html states that: + // A file will not be closed immediately when it goes out of scope if there are any IO operations + // that have not yet completed. To ensure that a file is closed immediately when it is dropped, + // you should call flush before dropping it. + // + // From the tokio code I see that it waits for pending operations to complete. There shouldt be any because + // we assume that `destination_file` file is fully written. I e there is no pending .write(...).await operations. + // But for additional safety lets check/wait for any pending operations. + destination_file.flush().await.with_context(|| { + format!( + "failed to flush source file at {}", + temp_file_path.display() + ) + })?; + + // not using sync_data because it can lose file size update + destination_file.sync_all().await.with_context(|| { + format!( + "failed to fsync source file at {}", + temp_file_path.display() + ) + })?; + drop(destination_file); + + fail::fail_point!("remote-storage-download-pre-rename", |_| { + anyhow::bail!("remote-storage-download-pre-rename failpoint triggered") + }); + + fs::rename(&temp_file_path, &layer_desination_path).await?; + + fsync_path(&layer_desination_path).await.with_context(|| { + format!( + "Cannot fsync layer destination path {}", + layer_desination_path.display(), + ) + })?; } Ok::<_, anyhow::Error>(layer_desination_path) }) .collect::>(); let mut errors_happened = false; + // keep files we've downloaded to remove them from layers_to_skip if directory fsync fails + let mut undo = HashSet::new(); while let Some(download_result) = download_tasks.next().await { match download_result { Ok(downloaded_path) => { + undo.insert(downloaded_path.clone()); download.layers_to_skip.insert(downloaded_path); } Err(e) => { @@ -167,6 +225,24 @@ where } } + // fsync timeline directory which is a parent directory for downloaded files + let ZTenantTimelineId { + tenant_id, + timeline_id, + } = &sync_id; + let timeline_dir = conf.timeline_path(timeline_id, tenant_id); + if let Err(e) = fsync_path(&timeline_dir).await { + error!( + "Cannot fsync parent directory {} error {}", + timeline_dir.display(), + e + ); + for item in undo { + download.layers_to_skip.remove(&item); + } + errors_happened = true; + } + if errors_happened { debug!("Reenqueuing failed download task for timeline {sync_id}"); download_data.retries += 1; @@ -178,6 +254,10 @@ where } } +async fn fsync_path(path: impl AsRef) -> Result<(), io::Error> { + fs::File::open(path).await?.sync_all().await +} + #[cfg(test)] mod tests { use std::collections::{BTreeSet, HashSet}; @@ -236,6 +316,7 @@ mod tests { ); let download_data = match download_timeline_layers( + harness.conf, &storage, Some(&remote_timeline), sync_id, @@ -297,6 +378,7 @@ mod tests { let storage = LocalFs::new(tempdir()?.path().to_owned(), &harness.conf.workdir)?; let empty_remote_timeline_download = download_timeline_layers( + harness.conf, &storage, None, sync_id, @@ -319,6 +401,7 @@ mod tests { "Should not expect download for the timeline" ); let already_downloading_remote_timeline_download = download_timeline_layers( + harness.conf, &storage, Some(¬_expecting_download_remote_timeline), sync_id, diff --git a/test_runner/batch_others/test_remote_storage.py b/test_runner/batch_others/test_remote_storage.py index f2d654423a..59a9cfa378 100644 --- a/test_runner/batch_others/test_remote_storage.py +++ b/test_runner/batch_others/test_remote_storage.py @@ -4,10 +4,11 @@ import shutil, os from contextlib import closing from pathlib import Path +import time from uuid import UUID from fixtures.zenith_fixtures import ZenithEnvBuilder, assert_local, wait_for, wait_for_last_record_lsn, wait_for_upload from fixtures.log_helper import log -from fixtures.utils import lsn_from_hex +from fixtures.utils import lsn_from_hex, lsn_to_hex import pytest @@ -23,14 +24,14 @@ import pytest # # 2. Second pageserver # * starts another pageserver, connected to the same remote storage -# * same timeline id is queried for status, triggering timeline's download +# * timeline_attach is called for the same timeline id # * timeline status is polled until it's downloaded # * queries the specific data, ensuring that it matches the one stored before # # The tests are done for all types of remote storage pageserver supports. @pytest.mark.parametrize('storage_type', ['local_fs', 'mock_s3']) def test_remote_storage_backup_and_restore(zenith_env_builder: ZenithEnvBuilder, storage_type: str): - zenith_env_builder.rust_log_override = 'debug' + # zenith_env_builder.rust_log_override = 'debug' zenith_env_builder.num_safekeepers = 1 if storage_type == 'local_fs': zenith_env_builder.enable_local_fs_remote_storage() @@ -67,9 +68,7 @@ def test_remote_storage_backup_and_restore(zenith_env_builder: ZenithEnvBuilder, wait_for_last_record_lsn(client, UUID(tenant_id), UUID(timeline_id), current_lsn) # run checkpoint manually to be sure that data landed in remote storage - with closing(env.pageserver.connect()) as psconn: - with psconn.cursor() as pscur: - pscur.execute(f"checkpoint {tenant_id} {timeline_id}") + env.pageserver.safe_psql(f"checkpoint {tenant_id} {timeline_id}") log.info(f'waiting for checkpoint {checkpoint_number} upload') # wait until pageserver successfully uploaded a checkpoint to remote storage @@ -87,6 +86,27 @@ def test_remote_storage_backup_and_restore(zenith_env_builder: ZenithEnvBuilder, ##### Second start, restore the data and ensure it's the same env.pageserver.start() + # Introduce failpoint in download + env.pageserver.safe_psql(f"failpoints remote-storage-download-pre-rename=return") + + client.timeline_attach(UUID(tenant_id), UUID(timeline_id)) + + # is there a better way to assert that fafilpoint triggered? + time.sleep(10) + + # assert cannot attach timeline that is scheduled for download + with pytest.raises(Exception, match="Timeline download is already in progress"): + client.timeline_attach(UUID(tenant_id), UUID(timeline_id)) + + detail = client.timeline_detail(UUID(tenant_id), UUID(timeline_id)) + log.info("Timeline detail with active failpoint: %s", detail) + assert detail['local'] is None + assert detail['remote']['awaits_download'] + + # trigger temporary download files removal + env.pageserver.stop() + env.pageserver.start() + client.timeline_attach(UUID(tenant_id), UUID(timeline_id)) log.info("waiting for timeline redownload") @@ -94,6 +114,12 @@ def test_remote_storage_backup_and_restore(zenith_env_builder: ZenithEnvBuilder, interval=1, func=lambda: assert_local(client, UUID(tenant_id), UUID(timeline_id))) + detail = client.timeline_detail(UUID(tenant_id), UUID(timeline_id)) + assert detail['local'] is not None + log.info("Timeline detail after attach completed: %s", detail) + assert lsn_from_hex(detail['local']['last_record_lsn']) == current_lsn + assert not detail['remote']['awaits_download'] + pg = env.postgres.create_start('main') with closing(pg.connect()) as conn: with conn.cursor() as cur: From 67b4e38092066c7633c37ee05e4d64fa9b9a2b01 Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Thu, 28 Apr 2022 00:41:06 +0300 Subject: [PATCH 162/296] remporarily disable test_backpressure_received_lsn_lag --- test_runner/batch_others/test_backpressure.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/test_runner/batch_others/test_backpressure.py b/test_runner/batch_others/test_backpressure.py index ff34121327..6658b337ec 100644 --- a/test_runner/batch_others/test_backpressure.py +++ b/test_runner/batch_others/test_backpressure.py @@ -1,6 +1,7 @@ from contextlib import closing, contextmanager import psycopg2.extras -from fixtures.zenith_fixtures import ZenithEnvBuilder +import pytest +from fixtures.zenith_fixtures import PgProtocol, ZenithEnvBuilder from fixtures.log_helper import log import os import time @@ -91,6 +92,7 @@ def check_backpressure(pg: Postgres, stop_event: threading.Event, polling_interv # If backpressure is enabled and tuned properly, insertion will be throttled, but the query will not timeout. +@pytest.mark.skip("See https://github.com/neondatabase/neon/issues/1587") def test_backpressure_received_lsn_lag(zenith_env_builder: ZenithEnvBuilder): zenith_env_builder.num_safekeepers = 1 env = zenith_env_builder.init_start() From aa933d3961f25ff3ebb00f0a04c89dfa4ee5ceb4 Mon Sep 17 00:00:00 2001 From: Andrey Taranik Date: Fri, 29 Apr 2022 20:05:14 +0300 Subject: [PATCH 163/296] proxy settings update for new domain (#1597) --- .circleci/helm-values/production.proxy.yaml | 6 +++--- .circleci/helm-values/staging.proxy.yaml | 6 +++--- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/.circleci/helm-values/production.proxy.yaml b/.circleci/helm-values/production.proxy.yaml index f2148c1d2c..e13968a6a8 100644 --- a/.circleci/helm-values/production.proxy.yaml +++ b/.circleci/helm-values/production.proxy.yaml @@ -5,8 +5,8 @@ image: repository: neondatabase/neon settings: - authEndpoint: "https://console.zenith.tech/authenticate_proxy_request/" - uri: "https://console.zenith.tech/psql_session/" + authEndpoint: "https://console.neon.tech/authenticate_proxy_request/" + uri: "https://console.neon.tech/psql_session/" # -- Additional labels for zenith-proxy pods podLabels: @@ -28,7 +28,7 @@ exposedService: service.beta.kubernetes.io/aws-load-balancer-type: external service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing - external-dns.alpha.kubernetes.io/hostname: start.zenith.tech + external-dns.alpha.kubernetes.io/hostname: start.zenith.tech,connect.neon.tech,pg.neon.tech metrics: enabled: true diff --git a/.circleci/helm-values/staging.proxy.yaml b/.circleci/helm-values/staging.proxy.yaml index f4d9855476..34ba972b64 100644 --- a/.circleci/helm-values/staging.proxy.yaml +++ b/.circleci/helm-values/staging.proxy.yaml @@ -5,8 +5,8 @@ image: repository: neondatabase/neon settings: - authEndpoint: "https://console.stage.zenith.tech/authenticate_proxy_request/" - uri: "https://console.stage.zenith.tech/psql_session/" + authEndpoint: "https://console.stage.neon.tech/authenticate_proxy_request/" + uri: "https://console.stage.neon.tech/psql_session/" # -- Additional labels for zenith-proxy pods podLabels: @@ -20,7 +20,7 @@ exposedService: service.beta.kubernetes.io/aws-load-balancer-type: external service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing - external-dns.alpha.kubernetes.io/hostname: start.stage.zenith.tech + external-dns.alpha.kubernetes.io/hostname: connect.stage.neon.tech metrics: enabled: true From 7e1db8c8a1de346d3a6350e1079fd7e6eb30033c Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Fri, 29 Apr 2022 17:08:51 +0300 Subject: [PATCH 164/296] Show which virtual file got the deserialization errors --- pageserver/src/layered_repository/delta_layer.rs | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/pageserver/src/layered_repository/delta_layer.rs b/pageserver/src/layered_repository/delta_layer.rs index ef4c3cccb0..4952f64ccd 100644 --- a/pageserver/src/layered_repository/delta_layer.rs +++ b/pageserver/src/layered_repository/delta_layer.rs @@ -258,8 +258,18 @@ impl Layer for DeltaLayer { // Ok, 'offsets' now contains the offsets of all the entries we need to read let mut cursor = file.block_cursor(); for (entry_lsn, pos) in offsets { - let buf = cursor.read_blob(pos)?; - let val = Value::des(&buf)?; + let buf = cursor.read_blob(pos).with_context(|| { + format!( + "Failed to read blob from virtual file {}", + file.file.path.display() + ) + })?; + let val = Value::des(&buf).with_context(|| { + format!( + "Failed to deserialize file blob from virtual file {}", + file.file.path.display() + ) + })?; match val { Value::Image(img) => { reconstruct_state.img = Some((entry_lsn, img)); From 038ea4c1280416dbcee8b1c3e24d84871602c75c Mon Sep 17 00:00:00 2001 From: Andrey Taranik Date: Sat, 30 Apr 2022 22:04:08 +0300 Subject: [PATCH 165/296] proxy notice message update (#1600) --- proxy/src/auth.rs | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/proxy/src/auth.rs b/proxy/src/auth.rs index 4c54e2f9eb..c6d32040dc 100644 --- a/proxy/src/auth.rs +++ b/proxy/src/auth.rs @@ -174,7 +174,7 @@ fn parse_password(bytes: &[u8]) -> Option<&str> { fn hello_message(redirect_uri: &str, session_id: &str) -> String { format!( concat![ - "☀️ Welcome to Zenith!\n", + "☀️ Welcome to Neon!\n", "To proceed with database creation, open the following link:\n\n", " {redirect_uri}{session_id}\n\n", "It needs to be done once and we will send you '.pgpass' file,\n", From f3f12db2cbdcfa994d3a798d609ba16f9ac38baa Mon Sep 17 00:00:00 2001 From: Dhammika Pathirana Date: Fri, 29 Apr 2022 11:48:56 -0700 Subject: [PATCH 166/296] Add gc churn threshold knob (#1594) Signed-off-by: Dhammika Pathirana --- control_plane/src/storage.rs | 7 +++++++ pageserver/src/config.rs | 1 + pageserver/src/http/models.rs | 3 +++ pageserver/src/http/routes.rs | 2 ++ pageserver/src/layered_repository.rs | 25 +++++++++++++++++-------- pageserver/src/page_service.rs | 2 ++ pageserver/src/repository.rs | 1 + pageserver/src/tenant_config.rs | 12 ++++++++++++ 8 files changed, 45 insertions(+), 8 deletions(-) diff --git a/control_plane/src/storage.rs b/control_plane/src/storage.rs index 7520ad9304..3a63bf6960 100644 --- a/control_plane/src/storage.rs +++ b/control_plane/src/storage.rs @@ -369,6 +369,10 @@ impl PageServerNode { .map(|x| x.parse::()) .transpose()?, gc_period: settings.get("gc_period").map(|x| x.to_string()), + image_creation_threshold: settings + .get("image_creation_threshold") + .map(|x| x.parse::()) + .transpose()?, pitr_interval: settings.get("pitr_interval").map(|x| x.to_string()), }) .send()? @@ -405,6 +409,9 @@ impl PageServerNode { .get("gc_horizon") .map(|x| x.parse::().unwrap()), gc_period: settings.get("gc_period").map(|x| x.to_string()), + image_creation_threshold: settings + .get("image_creation_threshold") + .map(|x| x.parse::().unwrap()), pitr_interval: settings.get("pitr_interval").map(|x| x.to_string()), }) .send()? diff --git a/pageserver/src/config.rs b/pageserver/src/config.rs index aed7eabb76..14ca976448 100644 --- a/pageserver/src/config.rs +++ b/pageserver/src/config.rs @@ -75,6 +75,7 @@ pub mod defaults { #gc_period = '{DEFAULT_GC_PERIOD}' #gc_horizon = {DEFAULT_GC_HORIZON} +#image_creation_threshold = {DEFAULT_IMAGE_CREATION_THRESHOLD} #pitr_interval = '{DEFAULT_PITR_INTERVAL}' # [remote_storage] diff --git a/pageserver/src/http/models.rs b/pageserver/src/http/models.rs index b24b3dc316..e9aaa72416 100644 --- a/pageserver/src/http/models.rs +++ b/pageserver/src/http/models.rs @@ -31,6 +31,7 @@ pub struct TenantCreateRequest { pub compaction_threshold: Option, pub gc_horizon: Option, pub gc_period: Option, + pub image_creation_threshold: Option, pub pitr_interval: Option, } @@ -65,6 +66,7 @@ pub struct TenantConfigRequest { pub compaction_threshold: Option, pub gc_horizon: Option, pub gc_period: Option, + pub image_creation_threshold: Option, pub pitr_interval: Option, } @@ -78,6 +80,7 @@ impl TenantConfigRequest { compaction_threshold: None, gc_horizon: None, gc_period: None, + image_creation_threshold: None, pitr_interval: None, } } diff --git a/pageserver/src/http/routes.rs b/pageserver/src/http/routes.rs index c589813d69..5903dea372 100644 --- a/pageserver/src/http/routes.rs +++ b/pageserver/src/http/routes.rs @@ -387,6 +387,7 @@ async fn tenant_create_handler(mut request: Request) -> Result) -> Result usize { + let tenant_conf = self.tenant_conf.read().unwrap(); + tenant_conf + .image_creation_threshold + .unwrap_or(self.conf.default_tenant_conf.image_creation_threshold) + } + pub fn get_pitr_interval(&self) -> Duration { let tenant_conf = self.tenant_conf.read().unwrap(); tenant_conf @@ -1152,6 +1159,13 @@ impl LayeredTimeline { .unwrap_or(self.conf.default_tenant_conf.compaction_threshold) } + fn get_image_creation_threshold(&self) -> usize { + let tenant_conf = self.tenant_conf.read().unwrap(); + tenant_conf + .image_creation_threshold + .unwrap_or(self.conf.default_tenant_conf.image_creation_threshold) + } + /// Open a Timeline handle. /// /// Loads the metadata for the timeline into memory, but not the layer map. @@ -1821,7 +1835,7 @@ impl LayeredTimeline { // 2. Create new image layers for partitions that have been modified // "enough". for part in partitioning.parts.iter() { - if self.time_for_new_image_layer(part, lsn, 3)? { + if self.time_for_new_image_layer(part, lsn)? { self.create_image_layer(part, lsn)?; } } @@ -1839,12 +1853,7 @@ impl LayeredTimeline { } // Is it time to create a new image layer for the given partition? - fn time_for_new_image_layer( - &self, - partition: &KeySpace, - lsn: Lsn, - threshold: usize, - ) -> Result { + fn time_for_new_image_layer(&self, partition: &KeySpace, lsn: Lsn) -> Result { let layers = self.layers.read().unwrap(); for part_range in &partition.ranges { @@ -1862,7 +1871,7 @@ impl LayeredTimeline { "range {}-{}, has {} deltas on this timeline", img_range.start, img_range.end, num_deltas ); - if num_deltas >= threshold { + if num_deltas >= self.get_image_creation_threshold() { return Ok(true); } } diff --git a/pageserver/src/page_service.rs b/pageserver/src/page_service.rs index ec08a840b0..0adafab8ba 100644 --- a/pageserver/src/page_service.rs +++ b/pageserver/src/page_service.rs @@ -694,6 +694,7 @@ impl postgres_backend::Handler for PageServerHandler { RowDescriptor::int8_col(b"compaction_threshold"), RowDescriptor::int8_col(b"gc_horizon"), RowDescriptor::int8_col(b"gc_period"), + RowDescriptor::int8_col(b"image_creation_threshold"), RowDescriptor::int8_col(b"pitr_interval"), ]))? .write_message_noflush(&BeMessage::DataRow(&[ @@ -708,6 +709,7 @@ impl postgres_backend::Handler for PageServerHandler { Some(repo.get_compaction_threshold().to_string().as_bytes()), Some(repo.get_gc_horizon().to_string().as_bytes()), Some(repo.get_gc_period().as_secs().to_string().as_bytes()), + Some(repo.get_image_creation_threshold().to_string().as_bytes()), Some(repo.get_pitr_interval().as_secs().to_string().as_bytes()), ]))? .write_message(&BeMessage::CommandComplete(b"SELECT 1"))?; diff --git a/pageserver/src/repository.rs b/pageserver/src/repository.rs index 6c75f035ca..5044f2bfc5 100644 --- a/pageserver/src/repository.rs +++ b/pageserver/src/repository.rs @@ -467,6 +467,7 @@ pub mod repo_harness { compaction_threshold: Some(tenant_conf.compaction_threshold), gc_horizon: Some(tenant_conf.gc_horizon), gc_period: Some(tenant_conf.gc_period), + image_creation_threshold: Some(tenant_conf.image_creation_threshold), pitr_interval: Some(tenant_conf.pitr_interval), } } diff --git a/pageserver/src/tenant_config.rs b/pageserver/src/tenant_config.rs index a175f6abbe..9bf223e59e 100644 --- a/pageserver/src/tenant_config.rs +++ b/pageserver/src/tenant_config.rs @@ -32,6 +32,7 @@ pub mod defaults { pub const DEFAULT_GC_HORIZON: u64 = 64 * 1024 * 1024; pub const DEFAULT_GC_PERIOD: &str = "100 s"; + pub const DEFAULT_IMAGE_CREATION_THRESHOLD: usize = 3; pub const DEFAULT_PITR_INTERVAL: &str = "30 days"; } @@ -59,6 +60,8 @@ pub struct TenantConf { // Interval at which garbage collection is triggered. #[serde(with = "humantime_serde")] pub gc_period: Duration, + // Delta layer churn threshold to create L1 image layers. + pub image_creation_threshold: usize, // Determines how much history is retained, to allow // branching and read replicas at an older point in time. // The unit is time. @@ -79,6 +82,7 @@ pub struct TenantConfOpt { pub gc_horizon: Option, #[serde(with = "humantime_serde")] pub gc_period: Option, + pub image_creation_threshold: Option, #[serde(with = "humantime_serde")] pub pitr_interval: Option, } @@ -100,6 +104,9 @@ impl TenantConfOpt { .unwrap_or(global_conf.compaction_threshold), gc_horizon: self.gc_horizon.unwrap_or(global_conf.gc_horizon), gc_period: self.gc_period.unwrap_or(global_conf.gc_period), + image_creation_threshold: self + .image_creation_threshold + .unwrap_or(global_conf.image_creation_threshold), pitr_interval: self.pitr_interval.unwrap_or(global_conf.pitr_interval), } } @@ -123,6 +130,9 @@ impl TenantConfOpt { if let Some(gc_period) = other.gc_period { self.gc_period = Some(gc_period); } + if let Some(image_creation_threshold) = other.image_creation_threshold { + self.image_creation_threshold = Some(image_creation_threshold); + } if let Some(pitr_interval) = other.pitr_interval { self.pitr_interval = Some(pitr_interval); } @@ -142,6 +152,7 @@ impl TenantConf { gc_horizon: DEFAULT_GC_HORIZON, gc_period: humantime::parse_duration(DEFAULT_GC_PERIOD) .expect("cannot parse default gc period"), + image_creation_threshold: DEFAULT_IMAGE_CREATION_THRESHOLD, pitr_interval: humantime::parse_duration(DEFAULT_PITR_INTERVAL) .expect("cannot parse default PITR interval"), } @@ -162,6 +173,7 @@ impl TenantConf { compaction_threshold: defaults::DEFAULT_COMPACTION_THRESHOLD, gc_horizon: defaults::DEFAULT_GC_HORIZON, gc_period: Duration::from_secs(10), + image_creation_threshold: defaults::DEFAULT_IMAGE_CREATION_THRESHOLD, pitr_interval: Duration::from_secs(60 * 60), } } From 3128e8c75ce7eacd4a33113ed78448d6c05b1dce Mon Sep 17 00:00:00 2001 From: Dhammika Pathirana Date: Fri, 29 Apr 2022 12:47:57 -0700 Subject: [PATCH 167/296] Fix tenant conf test Signed-off-by: Dhammika Pathirana --- test_runner/batch_others/test_tenant_conf.py | 59 +++++++++++++++++--- 1 file changed, 50 insertions(+), 9 deletions(-) diff --git a/test_runner/batch_others/test_tenant_conf.py b/test_runner/batch_others/test_tenant_conf.py index 64359a1dc3..b85a541f10 100644 --- a/test_runner/batch_others/test_tenant_conf.py +++ b/test_runner/batch_others/test_tenant_conf.py @@ -1,6 +1,7 @@ from contextlib import closing import pytest +import psycopg2.extras from fixtures.zenith_fixtures import ZenithEnvBuilder from fixtures.log_helper import log @@ -30,19 +31,39 @@ tenant_config={checkpoint_distance = 10000, compaction_target_size = 1048576}''' # check the configuration of the default tenant # it should match global configuration with closing(env.pageserver.connect()) as psconn: - with psconn.cursor() as pscur: + with psconn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as pscur: + log.info(f"show {env.initial_tenant.hex}") pscur.execute(f"show {env.initial_tenant.hex}") res = pscur.fetchone() - log.info(f"initial_tenant res: {res}") - assert res == (10000, 1048576, 1, 10, 67108864, 100, 2592000) + assert all( + i in res.items() for i in { + "checkpoint_distance": 10000, + "compaction_target_size": 1048576, + "compaction_period": 1, + "compaction_threshold": 10, + "gc_horizon": 67108864, + "gc_period": 100, + "image_creation_threshold": 3, + "pitr_interval": 2592000 + }.items()) # check the configuration of the new tenant with closing(env.pageserver.connect()) as psconn: - with psconn.cursor() as pscur: + with psconn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as pscur: pscur.execute(f"show {tenant.hex}") res = pscur.fetchone() log.info(f"res: {res}") - assert res == (20000, 1048576, 1, 10, 67108864, 30, 2592000) + assert all( + i in res.items() for i in { + "checkpoint_distance": 20000, + "compaction_target_size": 1048576, + "compaction_period": 1, + "compaction_threshold": 10, + "gc_horizon": 67108864, + "gc_period": 30, + "image_creation_threshold": 3, + "pitr_interval": 2592000 + }.items()) # update the config and ensure that it has changed env.zenith_cli.config_tenant(tenant_id=tenant, @@ -52,19 +73,39 @@ tenant_config={checkpoint_distance = 10000, compaction_target_size = 1048576}''' }) with closing(env.pageserver.connect()) as psconn: - with psconn.cursor() as pscur: + with psconn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as pscur: pscur.execute(f"show {tenant.hex}") res = pscur.fetchone() log.info(f"after config res: {res}") - assert res == (15000, 1048576, 1, 10, 67108864, 80, 2592000) + assert all( + i in res.items() for i in { + "checkpoint_distance": 15000, + "compaction_target_size": 1048576, + "compaction_period": 1, + "compaction_threshold": 10, + "gc_horizon": 67108864, + "gc_period": 80, + "image_creation_threshold": 3, + "pitr_interval": 2592000 + }.items()) # restart the pageserver and ensure that the config is still correct env.pageserver.stop() env.pageserver.start() with closing(env.pageserver.connect()) as psconn: - with psconn.cursor() as pscur: + with psconn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as pscur: pscur.execute(f"show {tenant.hex}") res = pscur.fetchone() log.info(f"after restart res: {res}") - assert res == (15000, 1048576, 1, 10, 67108864, 80, 2592000) + assert all( + i in res.items() for i in { + "checkpoint_distance": 15000, + "compaction_target_size": 1048576, + "compaction_period": 1, + "compaction_threshold": 10, + "gc_horizon": 67108864, + "gc_period": 80, + "image_creation_threshold": 3, + "pitr_interval": 2592000 + }.items()) From 992874c916ad77e04d526eb7882706c2495a1426 Mon Sep 17 00:00:00 2001 From: Dhammika Pathirana Date: Sun, 1 May 2022 13:52:08 -0700 Subject: [PATCH 168/296] Fix update ps settings doc Signed-off-by: Dhammika Pathirana --- docs/settings.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/docs/settings.md b/docs/settings.md index 530876a42a..b3925528cd 100644 --- a/docs/settings.md +++ b/docs/settings.md @@ -74,6 +74,10 @@ Every `compaction_period` seconds, the page server checks if maintenance operations, like compaction, are needed on the layer files. Default is 1 s, which should be fine. +#### compaction_target_size + +File sizes for L0 delta and L1 image layers. Default is 128MB. + #### gc_horizon `gz_horizon` determines how much history is retained, to allow @@ -85,6 +89,14 @@ away. Interval at which garbage collection is triggered. Default is 100 s. +#### image_creation_threshold + +L0 delta layer threshold for L1 iamge layer creation. Default is 3. + +#### pitr_interval + +WAL retention duration for PITR branching. Default is 30 days. + #### initial_superuser_name Name of the initial superuser role, passed to initdb when a new tenant From 2477d2f9e233b2b9b8a010f0dd6a847d029be23a Mon Sep 17 00:00:00 2001 From: Stas Kelvich Date: Mon, 2 May 2022 04:37:16 +0300 Subject: [PATCH 169/296] Deploy standalone SRAM proxy on staging --- .circleci/config.yml | 1 + .../helm-values/staging.proxy-scram.yaml | 30 +++++++++++++++++++ 2 files changed, 31 insertions(+) create mode 100644 .circleci/helm-values/staging.proxy-scram.yaml diff --git a/.circleci/config.yml b/.circleci/config.yml index 3397bcc7b7..f8787edcfb 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -585,6 +585,7 @@ jobs: command: | DOCKER_TAG=$(git log --oneline|wc -l) helm upgrade zenith-proxy zenithdb/zenith-proxy --install -f .circleci/helm-values/staging.proxy.yaml --set image.tag=${DOCKER_TAG} --wait + helm upgrade zenith-proxy-scram zenithdb/zenith-proxy --install -f .circleci/helm-values/staging.proxy-scram.yaml --set image.tag=${DOCKER_TAG} --wait deploy-release: diff --git a/.circleci/helm-values/staging.proxy-scram.yaml b/.circleci/helm-values/staging.proxy-scram.yaml new file mode 100644 index 0000000000..1a9ab239b4 --- /dev/null +++ b/.circleci/helm-values/staging.proxy-scram.yaml @@ -0,0 +1,30 @@ +# Helm chart values for zenith-proxy. +# This is a YAML-formatted file. + +image: + repository: neondatabase/neon + +settings: + authBackend: "console" + authEndpoint: "https://console.stage.neon.tech:9095/management/api/v2" + +# -- Additional labels for zenith-proxy pods +podLabels: + zenith_service: proxy-scram + zenith_env: staging + zenith_region: us-east-1 + zenith_region_slug: virginia + +exposedService: + annotations: + service.beta.kubernetes.io/aws-load-balancer-type: external + service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip + service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing + external-dns.alpha.kubernetes.io/hostname: *.cloud.stage.neon.tech + +metrics: + enabled: true + serviceMonitor: + enabled: true + selector: + release: kube-prometheus-stack From 8f479a712f49eb5baed065fcec00c987493105dd Mon Sep 17 00:00:00 2001 From: Andrey Taranik Date: Mon, 2 May 2022 11:38:25 +0300 Subject: [PATCH 170/296] minor fixes in proxy deployment --- .circleci/ansible/.gitignore | 2 ++ .circleci/config.yml | 7 +++---- .circleci/helm-values/staging.proxy-scram.yaml | 2 +- 3 files changed, 6 insertions(+), 5 deletions(-) diff --git a/.circleci/ansible/.gitignore b/.circleci/ansible/.gitignore index 14a1c155ae..441d9a8b82 100644 --- a/.circleci/ansible/.gitignore +++ b/.circleci/ansible/.gitignore @@ -1,2 +1,4 @@ zenith_install.tar.gz .zenith_current_version +neon_install.tar.gz +.neon_current_version diff --git a/.circleci/config.yml b/.circleci/config.yml index f8787edcfb..2ed079f031 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -579,14 +579,13 @@ jobs: name: Setup helm v3 command: | curl -s https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash - helm repo add zenithdb https://neondatabase.github.io/helm-charts + helm repo add neondatabase https://neondatabase.github.io/helm-charts - run: name: Re-deploy proxy command: | DOCKER_TAG=$(git log --oneline|wc -l) - helm upgrade zenith-proxy zenithdb/zenith-proxy --install -f .circleci/helm-values/staging.proxy.yaml --set image.tag=${DOCKER_TAG} --wait - helm upgrade zenith-proxy-scram zenithdb/zenith-proxy --install -f .circleci/helm-values/staging.proxy-scram.yaml --set image.tag=${DOCKER_TAG} --wait - + helm upgrade zenith-proxy neondatabase/neon-proxy --install -f .circleci/helm-values/staging.proxy.yaml --set image.tag=${DOCKER_TAG} --wait + helm upgrade neon-proxy-scram neondatabase/neon-proxy --install -f .circleci/helm-values/staging.proxy-scram.yaml --set image.tag=${DOCKER_TAG} --wait deploy-release: docker: diff --git a/.circleci/helm-values/staging.proxy-scram.yaml b/.circleci/helm-values/staging.proxy-scram.yaml index 1a9ab239b4..8c7bf835bc 100644 --- a/.circleci/helm-values/staging.proxy-scram.yaml +++ b/.circleci/helm-values/staging.proxy-scram.yaml @@ -6,7 +6,7 @@ image: settings: authBackend: "console" - authEndpoint: "https://console.stage.neon.tech:9095/management/api/v2" + authEndpoint: "http://console-staging.local/management/api/v2/healthz" # -- Additional labels for zenith-proxy pods podLabels: From 68ba6a58a0b7a1eeb4102a79e6896b4508e8018e Mon Sep 17 00:00:00 2001 From: Andrey Taranik Date: Mon, 2 May 2022 11:43:55 +0300 Subject: [PATCH 171/296] authEndpoint fix --- .circleci/helm-values/staging.proxy-scram.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.circleci/helm-values/staging.proxy-scram.yaml b/.circleci/helm-values/staging.proxy-scram.yaml index 8c7bf835bc..0391697641 100644 --- a/.circleci/helm-values/staging.proxy-scram.yaml +++ b/.circleci/helm-values/staging.proxy-scram.yaml @@ -6,7 +6,7 @@ image: settings: authBackend: "console" - authEndpoint: "http://console-staging.local/management/api/v2/healthz" + authEndpoint: "http://console-staging.local/management/api/v2" # -- Additional labels for zenith-proxy pods podLabels: From 4b1bd32e4a17fe6ecda43f3d8c67ce0726d37690 Mon Sep 17 00:00:00 2001 From: Dmitry Ivanov Date: Tue, 12 Apr 2022 01:04:02 +0300 Subject: [PATCH 172/296] Drop `Debug` impl for `ScramKey` and `ServerSecret` There's a notion that accidental misuse of those implementations might reveal authentication secrets. --- proxy/src/scram/exchange.rs | 3 --- proxy/src/scram/key.rs | 2 +- proxy/src/scram/secret.rs | 1 - 3 files changed, 1 insertion(+), 5 deletions(-) diff --git a/proxy/src/scram/exchange.rs b/proxy/src/scram/exchange.rs index 5a986b965a..802fe61db5 100644 --- a/proxy/src/scram/exchange.rs +++ b/proxy/src/scram/exchange.rs @@ -8,7 +8,6 @@ use super::signature::SignatureBuilder; use crate::sasl::{self, ChannelBinding, Error as SaslError}; /// The only channel binding mode we currently support. -#[derive(Debug)] struct TlsServerEndPoint; impl std::fmt::Display for TlsServerEndPoint { @@ -28,7 +27,6 @@ impl std::str::FromStr for TlsServerEndPoint { } } -#[derive(Debug)] enum ExchangeState { /// Waiting for [`ClientFirstMessage`]. Initial, @@ -41,7 +39,6 @@ enum ExchangeState { } /// Server's side of SCRAM auth algorithm. -#[derive(Debug)] pub struct Exchange<'a> { state: ExchangeState, secret: &'a ServerSecret, diff --git a/proxy/src/scram/key.rs b/proxy/src/scram/key.rs index 1c13471bc3..73dd5e1d5c 100644 --- a/proxy/src/scram/key.rs +++ b/proxy/src/scram/key.rs @@ -6,7 +6,7 @@ pub const SCRAM_KEY_LEN: usize = 32; /// One of the keys derived from the [password](super::password::SaltedPassword). /// We use the same structure for all keys, i.e. /// `ClientKey`, `StoredKey`, and `ServerKey`. -#[derive(Default, Debug, PartialEq, Eq)] +#[derive(Default, PartialEq, Eq)] #[repr(transparent)] pub struct ScramKey { bytes: [u8; SCRAM_KEY_LEN], diff --git a/proxy/src/scram/secret.rs b/proxy/src/scram/secret.rs index e8d180bcdd..bf935d3510 100644 --- a/proxy/src/scram/secret.rs +++ b/proxy/src/scram/secret.rs @@ -5,7 +5,6 @@ use super::key::ScramKey; /// Server secret is produced from [password](super::password::SaltedPassword) /// and is used throughout the authentication process. -#[derive(Debug)] pub struct ServerSecret { /// Number of iterations for `PBKDF2` function. pub iterations: u32, From 9df8915b03e03135bf3f8f78fa00435c94aa3ccd Mon Sep 17 00:00:00 2001 From: Dmitry Ivanov Date: Tue, 12 Apr 2022 01:12:07 +0300 Subject: [PATCH 173/296] [proxy] `sasl::Mechanism` may return `Output` during exchange This is needed to forward the `ClientKey` that's required to connect the proxy to a compute. Co-authored-by: bojanserafimov --- proxy/src/sasl.rs | 13 ++++++++++++- proxy/src/sasl/messages.rs | 1 + proxy/src/sasl/stream.rs | 13 +++++++++---- proxy/src/scram.rs | 4 ++-- proxy/src/scram/exchange.rs | 10 ++++++---- 5 files changed, 30 insertions(+), 11 deletions(-) diff --git a/proxy/src/sasl.rs b/proxy/src/sasl.rs index 70a4d9946a..cd9032bfb9 100644 --- a/proxy/src/sasl.rs +++ b/proxy/src/sasl.rs @@ -39,9 +39,20 @@ pub enum Error { /// A convenient result type for SASL exchange. pub type Result = std::result::Result; +/// A result of one SASL exchange. +pub enum Step { + /// We should continue exchanging messages. + Continue(T), + /// The client has been authenticated successfully. + Authenticated(R), +} + /// Every SASL mechanism (e.g. [SCRAM](crate::scram)) is expected to implement this trait. pub trait Mechanism: Sized { + /// What's produced as a result of successful authentication. + type Output; + /// Produce a server challenge to be sent to the client. /// This is how this method is called in PostgreSQL (`libpq/sasl.h`). - fn exchange(self, input: &str) -> Result<(Option, String)>; + fn exchange(self, input: &str) -> Result<(Step, String)>; } diff --git a/proxy/src/sasl/messages.rs b/proxy/src/sasl/messages.rs index 58be6268fe..f48aee4f26 100644 --- a/proxy/src/sasl/messages.rs +++ b/proxy/src/sasl/messages.rs @@ -49,6 +49,7 @@ impl<'a> ServerMessage<&'a str> { }) } } + #[cfg(test)] mod tests { use super::*; diff --git a/proxy/src/sasl/stream.rs b/proxy/src/sasl/stream.rs index 03649b8d11..0e782c5f29 100644 --- a/proxy/src/sasl/stream.rs +++ b/proxy/src/sasl/stream.rs @@ -51,18 +51,23 @@ impl SaslStream<'_, S> { impl SaslStream<'_, S> { /// Perform SASL message exchange according to the underlying algorithm /// until user is either authenticated or denied access. - pub async fn authenticate(mut self, mut mechanism: impl Mechanism) -> super::Result<()> { + pub async fn authenticate( + mut self, + mut mechanism: M, + ) -> super::Result { loop { let input = self.recv().await?; let (moved, reply) = mechanism.exchange(input)?; + + use super::Step::*; match moved { - Some(moved) => { + Continue(moved) => { self.send(&ServerMessage::Continue(&reply)).await?; mechanism = moved; } - None => { + Authenticated(result) => { self.send(&ServerMessage::Final(&reply)).await?; - return Ok(()); + return Ok(result); } } } diff --git a/proxy/src/scram.rs b/proxy/src/scram.rs index 44671084ee..22fce7ac7e 100644 --- a/proxy/src/scram.rs +++ b/proxy/src/scram.rs @@ -13,10 +13,10 @@ mod password; mod secret; mod signature; -pub use secret::*; - pub use exchange::Exchange; +pub use key::ScramKey; pub use secret::ServerSecret; +pub use secret::*; use hmac::{Hmac, Mac}; use sha2::{Digest, Sha256}; diff --git a/proxy/src/scram/exchange.rs b/proxy/src/scram/exchange.rs index 802fe61db5..cad77e15f5 100644 --- a/proxy/src/scram/exchange.rs +++ b/proxy/src/scram/exchange.rs @@ -62,8 +62,10 @@ impl<'a> Exchange<'a> { } impl sasl::Mechanism for Exchange<'_> { - fn exchange(mut self, input: &str) -> sasl::Result<(Option, String)> { - use ExchangeState::*; + type Output = super::ScramKey; + + fn exchange(mut self, input: &str) -> sasl::Result<(sasl::Step, String)> { + use {sasl::Step::*, ExchangeState::*}; match &self.state { Initial => { let client_first_message = @@ -82,7 +84,7 @@ impl sasl::Mechanism for Exchange<'_> { server_first_message, }; - Ok((Some(self), msg)) + Ok((Continue(self), msg)) } SaltSent { cbind_flag, @@ -124,7 +126,7 @@ impl sasl::Mechanism for Exchange<'_> { let msg = client_final_message .build_server_final_message(signature_builder, &self.secret.server_key); - Ok((None, msg)) + Ok((Authenticated(client_key), msg)) } } } From af0195b60478bc82cbb7c95c1421b5ab4c3e752e Mon Sep 17 00:00:00 2001 From: Dmitry Ivanov Date: Wed, 27 Apr 2022 13:34:59 +0300 Subject: [PATCH 174/296] [proxy] Introduce `cloud::Api` for communication with Neon Cloud * `cloud::legacy` talks to Cloud API V1. * `cloud::api` defines Cloud API v2. * `cloud::local` mocks the Cloud API V2 using a local postgres instance. * It's possible to choose between API versions using the `--api-version` flag. --- proxy/Cargo.toml | 2 +- proxy/src/auth.rs | 129 +++++++++++-------- proxy/src/auth/credentials.rs | 30 ++--- proxy/src/auth/flow.rs | 28 +--- proxy/src/cloud.rs | 46 +++++++ proxy/src/cloud/api.rs | 120 +++++++++++++++++ proxy/src/{cplane_api.rs => cloud/legacy.rs} | 65 +++------- proxy/src/cloud/local.rs | 76 +++++++++++ proxy/src/compute.rs | 63 +++------ proxy/src/config.rs | 84 +++++------- proxy/src/main.rs | 108 ++++++++-------- proxy/src/mgmt.rs | 8 +- proxy/src/proxy.rs | 4 +- proxy/src/scram.rs | 4 +- proxy/src/scram/key.rs | 4 + 15 files changed, 471 insertions(+), 300 deletions(-) create mode 100644 proxy/src/cloud.rs create mode 100644 proxy/src/cloud/api.rs rename proxy/src/{cplane_api.rs => cloud/legacy.rs} (81%) create mode 100644 proxy/src/cloud/local.rs diff --git a/proxy/Cargo.toml b/proxy/Cargo.toml index f7e872ceb9..73412609f3 100644 --- a/proxy/Cargo.toml +++ b/proxy/Cargo.toml @@ -5,6 +5,7 @@ edition = "2021" [dependencies] anyhow = "1.0" +async-trait = "0.1" base64 = "0.13.0" bytes = { version = "1.0.1", features = ['serde'] } clap = "3.0" @@ -37,7 +38,6 @@ metrics = { path = "../libs/metrics" } workspace_hack = { version = "0.1", path = "../workspace_hack" } [dev-dependencies] -async-trait = "0.1" rcgen = "0.8.14" rstest = "0.12" tokio-postgres-rustls = "0.9.0" diff --git a/proxy/src/auth.rs b/proxy/src/auth.rs index c6d32040dc..5234dfc2c6 100644 --- a/proxy/src/auth.rs +++ b/proxy/src/auth.rs @@ -1,22 +1,16 @@ mod credentials; - -#[cfg(test)] mod flow; -use crate::compute::DatabaseInfo; -use crate::config::ProxyConfig; -use crate::cplane_api::{self, CPlaneApi}; +use crate::config::{CloudApi, ProxyConfig}; use crate::error::UserFacingError; use crate::stream::PqStream; -use crate::waiters; +use crate::{cloud, compute, waiters}; use std::io; use thiserror::Error; use tokio::io::{AsyncRead, AsyncWrite}; use utils::pq_proto::{BeMessage as Be, BeParameterStatusMessage}; pub use credentials::ClientCredentials; - -#[cfg(test)] pub use flow::*; /// Common authentication error. @@ -24,9 +18,14 @@ pub use flow::*; pub enum AuthErrorImpl { /// Authentication error reported by the console. #[error(transparent)] - Console(#[from] cplane_api::AuthError), + Console(#[from] cloud::AuthError), + + #[error(transparent)] + GetAuthInfo(#[from] cloud::api::GetAuthInfoError), + + #[error(transparent)] + WakeCompute(#[from] cloud::api::WakeComputeError), - #[cfg(test)] #[error(transparent)] Sasl(#[from] crate::sasl::Error), @@ -41,19 +40,19 @@ pub enum AuthErrorImpl { impl AuthErrorImpl { pub fn auth_failed(msg: impl Into) -> Self { - AuthErrorImpl::Console(cplane_api::AuthError::auth_failed(msg)) + AuthErrorImpl::Console(cloud::AuthError::auth_failed(msg)) } } impl From for AuthErrorImpl { fn from(e: waiters::RegisterError) -> Self { - AuthErrorImpl::Console(cplane_api::AuthError::from(e)) + AuthErrorImpl::Console(cloud::AuthError::from(e)) } } impl From for AuthErrorImpl { fn from(e: waiters::WaitError) -> Self { - AuthErrorImpl::Console(cplane_api::AuthError::from(e)) + AuthErrorImpl::Console(cloud::AuthError::from(e)) } } @@ -81,40 +80,28 @@ impl UserFacingError for AuthError { } } -async fn handle_static( - host: String, - port: u16, - client: &mut PqStream, - creds: ClientCredentials, -) -> Result { - client - .write_message(&Be::AuthenticationCleartextPassword) - .await?; - - // Read client's password bytes - let msg = client.read_password_message().await?; - let cleartext_password = parse_password(&msg).ok_or(AuthErrorImpl::MalformedPassword)?; - - let db_info = DatabaseInfo { - host, - port, - dbname: creds.dbname.clone(), - user: creds.user.clone(), - password: Some(cleartext_password.into()), - }; - - client - .write_message_noflush(&Be::AuthenticationOk)? - .write_message_noflush(&BeParameterStatusMessage::encoding())?; - - Ok(db_info) -} - -async fn handle_existing_user( +async fn handle_user( config: &ProxyConfig, client: &mut PqStream, creds: ClientCredentials, -) -> Result { +) -> Result { + if creds.is_existing_user() { + match &config.cloud_endpoint { + CloudApi::V1(api) => handle_existing_user_v1(api, client, creds).await, + CloudApi::V2(api) => handle_existing_user_v2(api.as_ref(), client, creds).await, + } + } else { + let redirect_uri = config.redirect_uri.as_ref(); + handle_new_user(redirect_uri, client).await + } +} + +/// Authenticate user via a legacy cloud API endpoint. +async fn handle_existing_user_v1( + cloud: &cloud::Legacy, + client: &mut PqStream, + creds: ClientCredentials, +) -> Result { let psql_session_id = new_psql_session_id(); let md5_salt = rand::random(); @@ -126,8 +113,7 @@ async fn handle_existing_user( let msg = client.read_password_message().await?; let md5_response = parse_password(&msg).ok_or(AuthErrorImpl::MalformedPassword)?; - let cplane = CPlaneApi::new(config.auth_endpoint.clone()); - let db_info = cplane + let db_info = cloud .authenticate_proxy_client(creds, md5_response, &md5_salt, &psql_session_id) .await?; @@ -135,17 +121,53 @@ async fn handle_existing_user( .write_message_noflush(&Be::AuthenticationOk)? .write_message_noflush(&BeParameterStatusMessage::encoding())?; - Ok(db_info) + Ok(compute::NodeInfo { + db_info, + scram_keys: None, + }) +} + +/// Authenticate user via a new cloud API endpoint which supports SCRAM. +async fn handle_existing_user_v2( + cloud: &(impl cloud::Api + ?Sized), + client: &mut PqStream, + creds: ClientCredentials, +) -> Result { + let auth_info = cloud.get_auth_info(&creds).await?; + + let flow = AuthFlow::new(client); + let scram_keys = match auth_info { + cloud::api::AuthInfo::Md5(_) => { + // TODO: decide if we should support MD5 in api v2 + return Err(AuthErrorImpl::auth_failed("MD5 is not supported").into()); + } + cloud::api::AuthInfo::Scram(secret) => { + let scram = Scram(&secret); + Some(compute::ScramKeys { + client_key: flow.begin(scram).await?.authenticate().await?.as_bytes(), + server_key: secret.server_key.as_bytes(), + }) + } + }; + + client + .write_message_noflush(&Be::AuthenticationOk)? + .write_message_noflush(&BeParameterStatusMessage::encoding())?; + + Ok(compute::NodeInfo { + db_info: cloud.wake_compute(&creds).await?, + scram_keys, + }) } async fn handle_new_user( - config: &ProxyConfig, + redirect_uri: &str, client: &mut PqStream, -) -> Result { +) -> Result { let psql_session_id = new_psql_session_id(); - let greeting = hello_message(&config.redirect_uri, &psql_session_id); + let greeting = hello_message(redirect_uri, &psql_session_id); - let db_info = cplane_api::with_waiter(psql_session_id, |waiter| async { + let db_info = cloud::with_waiter(psql_session_id, |waiter| async { // Give user a URL to spawn a new database client .write_message_noflush(&Be::AuthenticationOk)? @@ -160,7 +182,10 @@ async fn handle_new_user( client.write_message_noflush(&Be::NoticeResponse("Connecting to database."))?; - Ok(db_info) + Ok(compute::NodeInfo { + db_info, + scram_keys: None, + }) } fn new_psql_session_id() -> String { diff --git a/proxy/src/auth/credentials.rs b/proxy/src/auth/credentials.rs index c3bb6da4f8..a3d06b49a2 100644 --- a/proxy/src/auth/credentials.rs +++ b/proxy/src/auth/credentials.rs @@ -1,7 +1,7 @@ //! User credentials used in authentication. use super::AuthError; -use crate::compute::DatabaseInfo; +use crate::compute; use crate::config::ProxyConfig; use crate::error::UserFacingError; use crate::stream::PqStream; @@ -18,12 +18,20 @@ pub enum ClientCredsParseError { impl UserFacingError for ClientCredsParseError {} /// Various client credentials which we use for authentication. -#[derive(Debug, PartialEq, Eq)] +/// Note that we don't store any kind of client key or password here. +#[derive(Debug, Clone, PartialEq, Eq)] pub struct ClientCredentials { pub user: String, pub dbname: String, } +impl ClientCredentials { + pub fn is_existing_user(&self) -> bool { + // This logic will likely change in the future. + self.user.ends_with("@zenith") + } +} + impl TryFrom> for ClientCredentials { type Error = ClientCredsParseError; @@ -47,20 +55,8 @@ impl ClientCredentials { self, config: &ProxyConfig, client: &mut PqStream, - ) -> Result { - use crate::config::ClientAuthMethod::*; - use crate::config::RouterConfig::*; - match &config.router_config { - Static { host, port } => super::handle_static(host.clone(), *port, client, self).await, - Dynamic(Mixed) => { - if self.user.ends_with("@zenith") { - super::handle_existing_user(config, client, self).await - } else { - super::handle_new_user(config, client).await - } - } - Dynamic(Password) => super::handle_existing_user(config, client, self).await, - Dynamic(Link) => super::handle_new_user(config, client).await, - } + ) -> Result { + // This method is just a convenient facade for `handle_user` + super::handle_user(config, client, self).await } } diff --git a/proxy/src/auth/flow.rs b/proxy/src/auth/flow.rs index bcfd94a9ed..3eed0f0a23 100644 --- a/proxy/src/auth/flow.rs +++ b/proxy/src/auth/flow.rs @@ -27,19 +27,6 @@ impl AuthMethod for Scram<'_> { } } -/// Use password-based auth in [`AuthFlow`]. -pub struct Md5( - /// Salt for client. - pub [u8; 4], -); - -impl AuthMethod for Md5 { - #[inline(always)] - fn first_message(&self) -> BeMessage<'_> { - Be::AuthenticationMD5Password(self.0) - } -} - /// This wrapper for [`PqStream`] performs client authentication. #[must_use] pub struct AuthFlow<'a, Stream, State> { @@ -70,19 +57,10 @@ impl<'a, S: AsyncWrite + Unpin> AuthFlow<'a, S, Begin> { } } -/// Stream wrapper for handling simple MD5 password auth. -impl AuthFlow<'_, S, Md5> { - /// Perform user authentication. Raise an error in case authentication failed. - #[allow(unused)] - pub async fn authenticate(self) -> Result<(), AuthError> { - unimplemented!("MD5 auth flow is yet to be implemented"); - } -} - /// Stream wrapper for handling [SCRAM](crate::scram) auth. impl AuthFlow<'_, S, Scram<'_>> { /// Perform user authentication. Raise an error in case authentication failed. - pub async fn authenticate(self) -> Result<(), AuthError> { + pub async fn authenticate(self) -> Result { // Initial client message contains the chosen auth method's name. let msg = self.stream.read_password_message().await?; let sasl = sasl::FirstMessage::parse(&msg).ok_or(AuthErrorImpl::MalformedPassword)?; @@ -93,10 +71,10 @@ impl AuthFlow<'_, S, Scram<'_>> { } let secret = self.state.0; - sasl::SaslStream::new(self.stream, sasl.message) + let key = sasl::SaslStream::new(self.stream, sasl.message) .authenticate(scram::Exchange::new(secret, rand::random, None)) .await?; - Ok(()) + Ok(key) } } diff --git a/proxy/src/cloud.rs b/proxy/src/cloud.rs new file mode 100644 index 0000000000..679cfb97e1 --- /dev/null +++ b/proxy/src/cloud.rs @@ -0,0 +1,46 @@ +mod local; + +mod legacy; +pub use legacy::{AuthError, AuthErrorImpl, Legacy}; + +pub mod api; +pub use api::{Api, BoxedApi}; + +use crate::mgmt; +use crate::waiters::{self, Waiter, Waiters}; +use lazy_static::lazy_static; + +lazy_static! { + static ref CPLANE_WAITERS: Waiters = Default::default(); +} + +/// Give caller an opportunity to wait for the cloud's reply. +pub async fn with_waiter( + psql_session_id: impl Into, + action: impl FnOnce(Waiter<'static, mgmt::ComputeReady>) -> R, +) -> Result +where + R: std::future::Future>, + E: From, +{ + let waiter = CPLANE_WAITERS.register(psql_session_id.into())?; + action(waiter).await +} + +pub fn notify(psql_session_id: &str, msg: mgmt::ComputeReady) -> Result<(), waiters::NotifyError> { + CPLANE_WAITERS.notify(psql_session_id, msg) +} + +/// Construct a new opaque cloud API provider. +pub fn new(url: reqwest::Url) -> anyhow::Result { + Ok(match url.scheme() { + "https" | "http" => { + todo!("build a real cloud wrapper") + } + "postgresql" | "postgres" | "pg" => { + // Just point to a local running postgres instance. + Box::new(local::Local { url }) + } + other => anyhow::bail!("unsupported url scheme: {other}"), + }) +} diff --git a/proxy/src/cloud/api.rs b/proxy/src/cloud/api.rs new file mode 100644 index 0000000000..713140c1e6 --- /dev/null +++ b/proxy/src/cloud/api.rs @@ -0,0 +1,120 @@ +//! Declaration of Cloud API V2. + +use crate::{auth, scram}; +use async_trait::async_trait; +use serde::{Deserialize, Serialize}; +use thiserror::Error; + +#[derive(Debug, Error)] +pub enum GetAuthInfoError { + // We shouldn't include the actual secret here. + #[error("Bad authentication secret")] + BadSecret, + + #[error("Bad client credentials: {0:?}")] + BadCredentials(crate::auth::ClientCredentials), + + #[error(transparent)] + Io(#[from] std::io::Error), +} + +// TODO: convert to an enum and describe possible sub-errors (see above) +#[derive(Debug, Error)] +#[error("Failed to wake up the compute node")] +pub struct WakeComputeError; + +/// Opaque implementation of Cloud API. +pub type BoxedApi = Box; + +/// Cloud API methods required by the proxy. +#[async_trait] +pub trait Api { + /// Get authentication information for the given user. + async fn get_auth_info( + &self, + creds: &auth::ClientCredentials, + ) -> Result; + + /// Wake up the compute node and return the corresponding connection info. + async fn wake_compute( + &self, + creds: &auth::ClientCredentials, + ) -> Result; +} + +/// Auth secret which is managed by the cloud. +pub enum AuthInfo { + /// Md5 hash of user's password. + Md5([u8; 16]), + /// [SCRAM](crate::scram) authentication info. + Scram(scram::ServerSecret), +} + +/// Compute node connection params provided by the cloud. +/// Note how it implements serde traits, since we receive it over the wire. +#[derive(Serialize, Deserialize, Default)] +pub struct DatabaseInfo { + pub host: String, + pub port: u16, + pub dbname: String, + pub user: String, + + /// [Cloud API V1](super::legacy) returns cleartext password, + /// but [Cloud API V2](super::api) implements [SCRAM](crate::scram) + /// authentication, so we can leverage this method and cope without password. + pub password: Option, +} + +// Manually implement debug to omit personal and sensitive info. +impl std::fmt::Debug for DatabaseInfo { + fn fmt(&self, fmt: &mut std::fmt::Formatter) -> std::fmt::Result { + fmt.debug_struct("DatabaseInfo") + .field("host", &self.host) + .field("port", &self.port) + .finish() + } +} + +impl From for tokio_postgres::Config { + fn from(db_info: DatabaseInfo) -> Self { + let mut config = tokio_postgres::Config::new(); + + config + .host(&db_info.host) + .port(db_info.port) + .dbname(&db_info.dbname) + .user(&db_info.user); + + if let Some(password) = db_info.password { + config.password(password); + } + + config + } +} + +#[cfg(test)] +mod tests { + use super::*; + use serde_json::json; + + #[test] + fn parse_db_info() -> anyhow::Result<()> { + let _: DatabaseInfo = serde_json::from_value(json!({ + "host": "localhost", + "port": 5432, + "dbname": "postgres", + "user": "john_doe", + "password": "password", + }))?; + + let _: DatabaseInfo = serde_json::from_value(json!({ + "host": "localhost", + "port": 5432, + "dbname": "postgres", + "user": "john_doe", + }))?; + + Ok(()) + } +} diff --git a/proxy/src/cplane_api.rs b/proxy/src/cloud/legacy.rs similarity index 81% rename from proxy/src/cplane_api.rs rename to proxy/src/cloud/legacy.rs index 21fce79df3..7d99995f1a 100644 --- a/proxy/src/cplane_api.rs +++ b/proxy/src/cloud/legacy.rs @@ -1,42 +1,19 @@ +//! Cloud API V1. + +use super::api::DatabaseInfo; use crate::auth::ClientCredentials; -use crate::compute::DatabaseInfo; use crate::error::UserFacingError; -use crate::mgmt; -use crate::waiters::{self, Waiter, Waiters}; -use lazy_static::lazy_static; +use crate::waiters; use serde::{Deserialize, Serialize}; use thiserror::Error; -lazy_static! { - static ref CPLANE_WAITERS: Waiters = Default::default(); -} - -/// Give caller an opportunity to wait for cplane's reply. -pub async fn with_waiter( - psql_session_id: impl Into, - action: impl FnOnce(Waiter<'static, mgmt::ComputeReady>) -> R, -) -> Result -where - R: std::future::Future>, - E: From, -{ - let waiter = CPLANE_WAITERS.register(psql_session_id.into())?; - action(waiter).await -} - -pub fn notify( - psql_session_id: &str, - msg: Result, -) -> Result<(), waiters::NotifyError> { - CPLANE_WAITERS.notify(psql_session_id, msg) -} - -/// Zenith console API wrapper. -pub struct CPlaneApi { +/// Neon cloud API provider. +pub struct Legacy { auth_endpoint: reqwest::Url, } -impl CPlaneApi { +impl Legacy { + /// Construct a new legacy cloud API provider. pub fn new(auth_endpoint: reqwest::Url) -> Self { Self { auth_endpoint } } @@ -95,7 +72,17 @@ impl UserFacingError for AuthError { } } -impl CPlaneApi { +// NOTE: the order of constructors is important. +// https://serde.rs/enum-representations.html#untagged +#[derive(Serialize, Deserialize, Debug)] +#[serde(untagged)] +enum ProxyAuthResponse { + Ready { conn_info: DatabaseInfo }, + Error { error: String }, + NotReady { ready: bool }, // TODO: get rid of `ready` +} + +impl Legacy { pub async fn authenticate_proxy_client( &self, creds: ClientCredentials, @@ -111,8 +98,8 @@ impl CPlaneApi { .append_pair("salt", &hex::encode(salt)) .append_pair("psql_session_id", psql_session_id); - with_waiter(psql_session_id, |waiter| async { - println!("cplane request: {}", url); + super::with_waiter(psql_session_id, |waiter| async { + println!("cloud request: {}", url); // TODO: leverage `reqwest::Client` to reuse connections let resp = reqwest::get(url).await?; if !resp.status().is_success() { @@ -135,16 +122,6 @@ impl CPlaneApi { } } -// NOTE: the order of constructors is important. -// https://serde.rs/enum-representations.html#untagged -#[derive(Serialize, Deserialize, Debug)] -#[serde(untagged)] -enum ProxyAuthResponse { - Ready { conn_info: DatabaseInfo }, - Error { error: String }, - NotReady { ready: bool }, // TODO: get rid of `ready` -} - #[cfg(test)] mod tests { use super::*; diff --git a/proxy/src/cloud/local.rs b/proxy/src/cloud/local.rs new file mode 100644 index 0000000000..88eda6630c --- /dev/null +++ b/proxy/src/cloud/local.rs @@ -0,0 +1,76 @@ +//! Local mock of Cloud API V2. + +use super::api::{self, Api, AuthInfo, DatabaseInfo}; +use crate::auth::ClientCredentials; +use crate::scram; +use async_trait::async_trait; + +/// Mocked cloud for testing purposes. +pub struct Local { + /// Database url, e.g. `postgres://user:password@localhost:5432/database`. + pub url: reqwest::Url, +} + +#[async_trait] +impl Api for Local { + async fn get_auth_info( + &self, + creds: &ClientCredentials, + ) -> Result { + // We wrap `tokio_postgres::Error` because we don't want to infect the + // method's error type with a detail that's specific to debug mode only. + let io_error = |e| std::io::Error::new(std::io::ErrorKind::Other, e); + + // Perhaps we could persist this connection, but then we'd have to + // write more code for reopening it if it got closed, which doesn't + // seem worth it. + let (client, connection) = + tokio_postgres::connect(self.url.as_str(), tokio_postgres::NoTls) + .await + .map_err(io_error)?; + + tokio::spawn(connection); + let query = "select rolpassword from pg_catalog.pg_authid where rolname = $1"; + let rows = client + .query(query, &[&creds.user]) + .await + .map_err(io_error)?; + + match &rows[..] { + // We can't get a secret if there's no such user. + [] => Err(api::GetAuthInfoError::BadCredentials(creds.to_owned())), + // We shouldn't get more than one row anyway. + [row, ..] => { + let entry = row.try_get(0).map_err(io_error)?; + scram::ServerSecret::parse(entry) + .map(AuthInfo::Scram) + .or_else(|| { + // It could be an md5 hash if it's not a SCRAM secret. + let text = entry.strip_prefix("md5")?; + Some(AuthInfo::Md5({ + let mut bytes = [0u8; 16]; + hex::decode_to_slice(text, &mut bytes).ok()?; + bytes + })) + }) + // Putting the secret into this message is a security hazard! + .ok_or(api::GetAuthInfoError::BadSecret) + } + } + } + + async fn wake_compute( + &self, + creds: &ClientCredentials, + ) -> Result { + // Local setup doesn't have a dedicated compute node, + // so we just return the local database we're pointed at. + Ok(DatabaseInfo { + host: self.url.host_str().unwrap_or("localhost").to_owned(), + port: self.url.port().unwrap_or(5432), + dbname: creds.dbname.to_owned(), + user: creds.user.to_owned(), + password: None, + }) + } +} diff --git a/proxy/src/compute.rs b/proxy/src/compute.rs index 3c0eee29bc..9949e91ea2 100644 --- a/proxy/src/compute.rs +++ b/proxy/src/compute.rs @@ -1,6 +1,6 @@ use crate::cancellation::CancelClosure; +use crate::cloud::api::DatabaseInfo; use crate::error::UserFacingError; -use serde::{Deserialize, Serialize}; use std::io; use std::net::SocketAddr; use thiserror::Error; @@ -23,32 +23,21 @@ pub enum ConnectionError { impl UserFacingError for ConnectionError {} -/// Compute node connection params. -#[derive(Serialize, Deserialize, Default)] -pub struct DatabaseInfo { - pub host: String, - pub port: u16, - pub dbname: String, - pub user: String, - pub password: Option, -} - -// Manually implement debug to omit personal and sensitive info -impl std::fmt::Debug for DatabaseInfo { - fn fmt(&self, fmt: &mut std::fmt::Formatter) -> std::fmt::Result { - fmt.debug_struct("DatabaseInfo") - .field("host", &self.host) - .field("port", &self.port) - .finish() - } -} - /// PostgreSQL version as [`String`]. pub type Version = String; -impl DatabaseInfo { +/// A pair of `ClientKey` & `ServerKey` for `SCRAM-SHA-256`. +pub type ScramKeys = tokio_postgres::config::ScramKeys<32>; + +/// Compute node connection params. +pub struct NodeInfo { + pub db_info: DatabaseInfo, + pub scram_keys: Option, +} + +impl NodeInfo { async fn connect_raw(&self) -> io::Result<(SocketAddr, TcpStream)> { - let host_port = format!("{}:{}", self.host, self.port); + let host_port = format!("{}:{}", self.db_info.host, self.db_info.port); let socket = TcpStream::connect(host_port).await?; let socket_addr = socket.peer_addr()?; socket2::SockRef::from(&socket).set_keepalive(true)?; @@ -63,11 +52,13 @@ impl DatabaseInfo { .await .map_err(|_| ConnectionError::FailedToConnectToCompute)?; - // TODO: establish a secure connection to the DB - let (client, conn) = tokio_postgres::Config::from(self) - .connect_raw(&mut socket, NoTls) - .await?; + let mut config = tokio_postgres::Config::from(self.db_info); + if let Some(scram_keys) = self.scram_keys { + config.auth_keys(tokio_postgres::config::AuthKeys::ScramSha256(scram_keys)); + } + // TODO: establish a secure connection to the DB + let (client, conn) = config.connect_raw(&mut socket, NoTls).await?; let version = conn .parameter("server_version") .ok_or(ConnectionError::FailedToFetchPgVersion)? @@ -78,21 +69,3 @@ impl DatabaseInfo { Ok((socket, version, cancel_closure)) } } - -impl From for tokio_postgres::Config { - fn from(db_info: DatabaseInfo) -> Self { - let mut config = tokio_postgres::Config::new(); - - config - .host(&db_info.host) - .port(db_info.port) - .dbname(&db_info.dbname) - .user(&db_info.user); - - if let Some(password) = db_info.password { - config.password(password); - } - - config - } -} diff --git a/proxy/src/config.rs b/proxy/src/config.rs index aef079d089..6b30df604d 100644 --- a/proxy/src/config.rs +++ b/proxy/src/config.rs @@ -1,65 +1,43 @@ +use crate::cloud; use anyhow::{bail, ensure, Context}; -use std::net::SocketAddr; -use std::str::FromStr; use std::sync::Arc; -pub type TlsConfig = Arc; - -#[non_exhaustive] -pub enum ClientAuthMethod { - Password, - Link, - - /// Use password auth only if username ends with "@zenith" - Mixed, -} - -pub enum RouterConfig { - Static { host: String, port: u16 }, - Dynamic(ClientAuthMethod), -} - -impl FromStr for ClientAuthMethod { - type Err = anyhow::Error; - - fn from_str(s: &str) -> anyhow::Result { - use ClientAuthMethod::*; - match s { - "password" => Ok(Password), - "link" => Ok(Link), - "mixed" => Ok(Mixed), - _ => bail!("Invalid option for router: `{}`", s), - } - } -} - pub struct ProxyConfig { - /// main entrypoint for users to connect to - pub proxy_address: SocketAddr, + /// Unauthenticated users will be redirected to this URL. + pub redirect_uri: reqwest::Url, - /// method of assigning compute nodes - pub router_config: RouterConfig, - - /// internally used for status and prometheus metrics - pub http_address: SocketAddr, - - /// management endpoint. Upon user account creation control plane - /// will notify us here, so that we can 'unfreeze' user session. - /// TODO It uses postgres protocol over TCP but should be migrated to http. - pub mgmt_address: SocketAddr, - - /// send unauthenticated users to this URI - pub redirect_uri: String, - - /// control plane address where we would check auth. - pub auth_endpoint: reqwest::Url, + /// Cloud API endpoint for user authentication. + pub cloud_endpoint: CloudApi, + /// TLS configuration for the proxy. pub tls_config: Option, } -pub fn configure_ssl(key_path: &str, cert_path: &str) -> anyhow::Result { +/// Cloud API configuration. +pub enum CloudApi { + /// We'll drop this one when [`CloudApi::V2`] is stable. + V1(crate::cloud::Legacy), + /// The new version of the cloud API. + V2(crate::cloud::BoxedApi), +} + +impl CloudApi { + /// Configure Cloud API provider. + pub fn new(version: &str, url: reqwest::Url) -> anyhow::Result { + Ok(match version { + "v1" => Self::V1(cloud::Legacy::new(url)), + "v2" => Self::V2(cloud::new(url)?), + _ => bail!("unknown cloud API version: {}", version), + }) + } +} + +pub type TlsConfig = Arc; + +/// Configure TLS for the main endpoint. +pub fn configure_tls(key_path: &str, cert_path: &str) -> anyhow::Result { let key = { - let key_bytes = std::fs::read(key_path).context("SSL key file")?; + let key_bytes = std::fs::read(key_path).context("TLS key file")?; let mut keys = rustls_pemfile::pkcs8_private_keys(&mut &key_bytes[..]) .context("couldn't read TLS keys")?; @@ -68,7 +46,7 @@ pub fn configure_ssl(key_path: &str, cert_path: &str) -> anyhow::Result>` into `Result`. async fn flatten_err( f: impl Future, JoinError>>, @@ -44,7 +37,7 @@ async fn flatten_err( #[tokio::main] async fn main() -> anyhow::Result<()> { metrics::set_common_metrics_prefix("zenith_proxy"); - let arg_matches = App::new("Zenith proxy/router") + let arg_matches = App::new("Neon proxy/router") .version(GIT_VERSION) .arg( Arg::new("proxy") @@ -97,77 +90,80 @@ async fn main() -> anyhow::Result<()> { .short('a') .long("auth-endpoint") .takes_value(true) - .help("API endpoint for authenticating users") + .help("cloud API endpoint for authenticating users") .default_value("http://localhost:3000/authenticate_proxy_request/"), ) .arg( - Arg::new("ssl-key") - .short('k') - .long("ssl-key") + Arg::new("api-version") + .long("api-version") .takes_value(true) - .help("path to SSL key for client postgres connections"), + .default_value("v1") + .possible_values(["v1", "v2"]) + .help("cloud API version to be used for authentication"), ) .arg( - Arg::new("ssl-cert") - .short('c') - .long("ssl-cert") + Arg::new("tls-key") + .short('k') + .long("tls-key") + .alias("ssl-key") // backwards compatibility .takes_value(true) - .help("path to SSL cert for client postgres connections"), + .help("path to TLS key for client postgres connections"), + ) + .arg( + Arg::new("tls-cert") + .short('c') + .long("tls-cert") + .alias("ssl-cert") // backwards compatibility + .takes_value(true) + .help("path to TLS cert for client postgres connections"), ) .get_matches(); let tls_config = match ( - arg_matches.value_of("ssl-key"), - arg_matches.value_of("ssl-cert"), + arg_matches.value_of("tls-key"), + arg_matches.value_of("tls-cert"), ) { - (Some(key_path), Some(cert_path)) => Some(config::configure_ssl(key_path, cert_path)?), + (Some(key_path), Some(cert_path)) => Some(config::configure_tls(key_path, cert_path)?), (None, None) => None, - _ => bail!("either both or neither ssl-key and ssl-cert must be specified"), + _ => bail!("either both or neither tls-key and tls-cert must be specified"), }; - let auth_method = arg_matches.value_of("auth-method").unwrap().parse()?; - let router_config = match arg_matches.value_of("static-router") { - None => RouterConfig::Dynamic(auth_method), - Some(addr) => { - if let ClientAuthMethod::Password = auth_method { - let (host, port) = addr.split_once(':').unwrap(); - RouterConfig::Static { - host: host.to_string(), - port: port.parse().unwrap(), - } - } else { - bail!("static-router requires --auth-method password") - } - } - }; + let proxy_address: SocketAddr = arg_matches.value_of("proxy").unwrap().parse()?; + let mgmt_address: SocketAddr = arg_matches.value_of("mgmt").unwrap().parse()?; + let http_address: SocketAddr = arg_matches.value_of("http").unwrap().parse()?; + + let cloud_endpoint = config::CloudApi::new( + arg_matches.value_of("api-version").unwrap(), + arg_matches.value_of("auth-endpoint").unwrap().parse()?, + )?; let config: &ProxyConfig = Box::leak(Box::new(ProxyConfig { - router_config, - proxy_address: arg_matches.value_of("proxy").unwrap().parse()?, - mgmt_address: arg_matches.value_of("mgmt").unwrap().parse()?, - http_address: arg_matches.value_of("http").unwrap().parse()?, redirect_uri: arg_matches.value_of("uri").unwrap().parse()?, - auth_endpoint: arg_matches.value_of("auth-endpoint").unwrap().parse()?, + cloud_endpoint, tls_config, })); println!("Version: {}", GIT_VERSION); // Check that we can bind to address before further initialization - println!("Starting http on {}", config.http_address); - let http_listener = TcpListener::bind(config.http_address).await?.into_std()?; + println!("Starting http on {}", http_address); + let http_listener = TcpListener::bind(http_address).await?.into_std()?; - println!("Starting mgmt on {}", config.mgmt_address); - let mgmt_listener = TcpListener::bind(config.mgmt_address).await?.into_std()?; + println!("Starting mgmt on {}", mgmt_address); + let mgmt_listener = TcpListener::bind(mgmt_address).await?.into_std()?; - println!("Starting proxy on {}", config.proxy_address); - let proxy_listener = TcpListener::bind(config.proxy_address).await?; + println!("Starting proxy on {}", proxy_address); + let proxy_listener = TcpListener::bind(proxy_address).await?; - let http = tokio::spawn(http::thread_main(http_listener)); - let proxy = tokio::spawn(proxy::thread_main(config, proxy_listener)); - let mgmt = tokio::task::spawn_blocking(move || mgmt::thread_main(mgmt_listener)); + let tasks = [ + tokio::spawn(http::thread_main(http_listener)), + tokio::spawn(proxy::thread_main(config, proxy_listener)), + tokio::task::spawn_blocking(move || mgmt::thread_main(mgmt_listener)), + ] + .map(flatten_err); - let tasks = [flatten_err(http), flatten_err(proxy), flatten_err(mgmt)]; + // This will block until all tasks have completed. + // Furthermore, the first one to fail will cancel the rest. let _: Vec<()> = futures::future::try_join_all(tasks).await?; Ok(()) diff --git a/proxy/src/mgmt.rs b/proxy/src/mgmt.rs index 23ad8a2013..c48df653d3 100644 --- a/proxy/src/mgmt.rs +++ b/proxy/src/mgmt.rs @@ -1,4 +1,4 @@ -use crate::{compute::DatabaseInfo, cplane_api}; +use crate::cloud; use anyhow::Context; use serde::Deserialize; use std::{ @@ -75,12 +75,12 @@ struct PsqlSessionResponse { #[derive(Deserialize)] enum PsqlSessionResult { - Success(DatabaseInfo), + Success(cloud::api::DatabaseInfo), Failure(String), } /// A message received by `mgmt` when a compute node is ready. -pub type ComputeReady = Result; +pub type ComputeReady = Result; impl PsqlSessionResult { fn into_compute_ready(self) -> ComputeReady { @@ -111,7 +111,7 @@ fn try_process_query(pgb: &mut PostgresBackend, query_string: &str) -> anyhow::R let resp: PsqlSessionResponse = serde_json::from_str(query_string)?; - match cplane_api::notify(&resp.session_id, resp.result.into_compute_ready()) { + match cloud::notify(&resp.session_id, resp.result.into_compute_ready()) { Ok(()) => { pgb.write_message_noflush(&SINGLE_COL_ROWDESC)? .write_message_noflush(&BeMessage::DataRow(&[Some(b"ok")]))? diff --git a/proxy/src/proxy.rs b/proxy/src/proxy.rs index f7de1618df..4bce2bf40d 100644 --- a/proxy/src/proxy.rs +++ b/proxy/src/proxy.rs @@ -185,10 +185,10 @@ impl Client { // Authenticate and connect to a compute node. let auth = creds.authenticate(config, &mut stream).await; - let db_info = async { auth }.or_else(|e| stream.throw_error(e)).await?; + let node = async { auth }.or_else(|e| stream.throw_error(e)).await?; let (db, version, cancel_closure) = - db_info.connect().or_else(|e| stream.throw_error(e)).await?; + node.connect().or_else(|e| stream.throw_error(e)).await?; let cancel_key_data = session.enable_cancellation(cancel_closure); stream diff --git a/proxy/src/scram.rs b/proxy/src/scram.rs index 22fce7ac7e..7cc4191435 100644 --- a/proxy/src/scram.rs +++ b/proxy/src/scram.rs @@ -9,10 +9,12 @@ mod exchange; mod key; mod messages; -mod password; mod secret; mod signature; +#[cfg(test)] +mod password; + pub use exchange::Exchange; pub use key::ScramKey; pub use secret::ServerSecret; diff --git a/proxy/src/scram/key.rs b/proxy/src/scram/key.rs index 73dd5e1d5c..e9c65fcef3 100644 --- a/proxy/src/scram/key.rs +++ b/proxy/src/scram/key.rs @@ -16,6 +16,10 @@ impl ScramKey { pub fn sha256(&self) -> Self { super::sha256([self.as_ref()]).into() } + + pub fn as_bytes(&self) -> [u8; SCRAM_KEY_LEN] { + self.bytes + } } impl From<[u8; SCRAM_KEY_LEN]> for ScramKey { From 0323bb58701767b8ce5c816637ba316166f6fb41 Mon Sep 17 00:00:00 2001 From: Stas Kelvich Date: Sat, 30 Apr 2022 00:58:57 +0300 Subject: [PATCH 175/296] [proxy] Refactor cplane API and add new console SCRAM auth API Now proxy binary accepts `--auth-backend` CLI option, which determines auth scheme and cluster routing method. Following backends are currently implemented: * legacy old method, when username ends with `@zenith` it uses md5 auth dbname as the cluster name; otherwise, it sends a login link and waits for the console to call back * console new SCRAM-based console API; uses SNI info to select the destination cluster * postgres uses postgres to select auth secrets of existing roles. Useful for local testing * link sends login link for all usernames --- .gitignore | 3 + Cargo.lock | 1 + proxy/Cargo.toml | 1 + proxy/README.md | 33 ++++ proxy/src/auth.rs | 159 +++------------ proxy/src/auth/credentials.rs | 12 +- proxy/src/{cloud.rs => auth_backend.rs} | 25 +-- proxy/src/auth_backend/console.rs | 236 +++++++++++++++++++++++ proxy/src/auth_backend/legacy_console.rs | 206 ++++++++++++++++++++ proxy/src/auth_backend/link.rs | 52 +++++ proxy/src/auth_backend/postgres.rs | 93 +++++++++ proxy/src/cloud/api.rs | 120 ------------ proxy/src/cloud/legacy.rs | 160 --------------- proxy/src/cloud/local.rs | 76 -------- proxy/src/compute.rs | 2 +- proxy/src/config.rs | 56 +++--- proxy/src/main.rs | 37 +--- proxy/src/mgmt.rs | 10 +- proxy/src/proxy.rs | 6 +- proxy/src/scram/secret.rs | 1 + test_runner/fixtures/zenith_fixtures.py | 11 +- 21 files changed, 722 insertions(+), 578 deletions(-) create mode 100644 proxy/README.md rename proxy/src/{cloud.rs => auth_backend.rs} (56%) create mode 100644 proxy/src/auth_backend/console.rs create mode 100644 proxy/src/auth_backend/legacy_console.rs create mode 100644 proxy/src/auth_backend/link.rs create mode 100644 proxy/src/auth_backend/postgres.rs delete mode 100644 proxy/src/cloud/api.rs delete mode 100644 proxy/src/cloud/legacy.rs delete mode 100644 proxy/src/cloud/local.rs diff --git a/.gitignore b/.gitignore index 2ecdaa2053..adb1b41503 100644 --- a/.gitignore +++ b/.gitignore @@ -11,3 +11,6 @@ test_output/ # Coverage *.profraw *.profdata + +*.key +*.crt diff --git a/Cargo.lock b/Cargo.lock index 58125ca41c..2c081e8beb 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -2040,6 +2040,7 @@ dependencies = [ "tokio-postgres", "tokio-postgres-rustls", "tokio-rustls", + "url", "utils", "workspace_hack", ] diff --git a/proxy/Cargo.toml b/proxy/Cargo.toml index 73412609f3..43880d645a 100644 --- a/proxy/Cargo.toml +++ b/proxy/Cargo.toml @@ -32,6 +32,7 @@ thiserror = "1.0.30" tokio = { version = "1.17", features = ["macros"] } tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" } tokio-rustls = "0.23.0" +url = "2.2.2" utils = { path = "../libs/utils" } metrics = { path = "../libs/metrics" } diff --git a/proxy/README.md b/proxy/README.md new file mode 100644 index 0000000000..458a7d9bbf --- /dev/null +++ b/proxy/README.md @@ -0,0 +1,33 @@ +# Proxy + +Proxy binary accepts `--auth-backend` CLI option, which determines auth scheme and cluster routing method. Following backends are currently implemented: + +* legacy + old method, when username ends with `@zenith` it uses md5 auth dbname as the cluster name; otherwise, it sends a login link and waits for the console to call back +* console + new SCRAM-based console API; uses SNI info to select the destination cluster +* postgres + uses postgres to select auth secrets of existing roles. Useful for local testing +* link + sends login link for all usernames + +## Using SNI-based routing on localhost + +Now proxy determines cluster name from the subdomain, request to the `my-cluster-42.somedomain.tld` will be routed to the cluster named `my-cluster-42`. Unfortunately `/etc/hosts` does not support domain wildcards, so I usually use `*.localtest.me` which resolves to `127.0.0.1`. Now we can create self-signed certificate and play with proxy: + +``` +openssl req -new -x509 -days 365 -nodes -text -out server.crt -keyout server.key -subj "/CN=*.localtest.me" + +``` + +now you can start proxy: + +``` +./target/debug/proxy -c server.crt -k server.key +``` + +and connect to it: + +``` +PGSSLROOTCERT=./server.crt psql 'postgres://my-cluster-42.localtest.me:1234?sslmode=verify-full' +``` diff --git a/proxy/src/auth.rs b/proxy/src/auth.rs index 5234dfc2c6..d4e21d78a0 100644 --- a/proxy/src/auth.rs +++ b/proxy/src/auth.rs @@ -1,14 +1,14 @@ mod credentials; mod flow; -use crate::config::{CloudApi, ProxyConfig}; +use crate::auth_backend::{console, legacy_console, link, postgres}; +use crate::config::{AuthBackendType, ProxyConfig}; use crate::error::UserFacingError; use crate::stream::PqStream; -use crate::{cloud, compute, waiters}; +use crate::{auth_backend, compute, waiters}; use std::io; use thiserror::Error; use tokio::io::{AsyncRead, AsyncWrite}; -use utils::pq_proto::{BeMessage as Be, BeParameterStatusMessage}; pub use credentials::ClientCredentials; pub use flow::*; @@ -18,13 +18,10 @@ pub use flow::*; pub enum AuthErrorImpl { /// Authentication error reported by the console. #[error(transparent)] - Console(#[from] cloud::AuthError), + Console(#[from] auth_backend::AuthError), #[error(transparent)] - GetAuthInfo(#[from] cloud::api::GetAuthInfoError), - - #[error(transparent)] - WakeCompute(#[from] cloud::api::WakeComputeError), + GetAuthInfo(#[from] auth_backend::console::ConsoleAuthError), #[error(transparent)] Sasl(#[from] crate::sasl::Error), @@ -40,19 +37,19 @@ pub enum AuthErrorImpl { impl AuthErrorImpl { pub fn auth_failed(msg: impl Into) -> Self { - AuthErrorImpl::Console(cloud::AuthError::auth_failed(msg)) + AuthErrorImpl::Console(auth_backend::AuthError::auth_failed(msg)) } } impl From for AuthErrorImpl { fn from(e: waiters::RegisterError) -> Self { - AuthErrorImpl::Console(cloud::AuthError::from(e)) + AuthErrorImpl::Console(auth_backend::AuthError::from(e)) } } impl From for AuthErrorImpl { fn from(e: waiters::WaitError) -> Self { - AuthErrorImpl::Console(cloud::AuthError::from(e)) + AuthErrorImpl::Console(auth_backend::AuthError::from(e)) } } @@ -82,131 +79,25 @@ impl UserFacingError for AuthError { async fn handle_user( config: &ProxyConfig, - client: &mut PqStream, + client: &mut PqStream, creds: ClientCredentials, ) -> Result { - if creds.is_existing_user() { - match &config.cloud_endpoint { - CloudApi::V1(api) => handle_existing_user_v1(api, client, creds).await, - CloudApi::V2(api) => handle_existing_user_v2(api.as_ref(), client, creds).await, + match config.auth_backend { + AuthBackendType::LegacyConsole => { + legacy_console::handle_user( + &config.auth_endpoint, + &config.auth_link_uri, + client, + &creds, + ) + .await } - } else { - let redirect_uri = config.redirect_uri.as_ref(); - handle_new_user(redirect_uri, client).await + AuthBackendType::Console => { + console::handle_user(config.auth_endpoint.as_ref(), client, &creds).await + } + AuthBackendType::Postgres => { + postgres::handle_user(&config.auth_endpoint, client, &creds).await + } + AuthBackendType::Link => link::handle_user(config.auth_link_uri.as_ref(), client).await, } } - -/// Authenticate user via a legacy cloud API endpoint. -async fn handle_existing_user_v1( - cloud: &cloud::Legacy, - client: &mut PqStream, - creds: ClientCredentials, -) -> Result { - let psql_session_id = new_psql_session_id(); - let md5_salt = rand::random(); - - client - .write_message(&Be::AuthenticationMD5Password(md5_salt)) - .await?; - - // Read client's password hash - let msg = client.read_password_message().await?; - let md5_response = parse_password(&msg).ok_or(AuthErrorImpl::MalformedPassword)?; - - let db_info = cloud - .authenticate_proxy_client(creds, md5_response, &md5_salt, &psql_session_id) - .await?; - - client - .write_message_noflush(&Be::AuthenticationOk)? - .write_message_noflush(&BeParameterStatusMessage::encoding())?; - - Ok(compute::NodeInfo { - db_info, - scram_keys: None, - }) -} - -/// Authenticate user via a new cloud API endpoint which supports SCRAM. -async fn handle_existing_user_v2( - cloud: &(impl cloud::Api + ?Sized), - client: &mut PqStream, - creds: ClientCredentials, -) -> Result { - let auth_info = cloud.get_auth_info(&creds).await?; - - let flow = AuthFlow::new(client); - let scram_keys = match auth_info { - cloud::api::AuthInfo::Md5(_) => { - // TODO: decide if we should support MD5 in api v2 - return Err(AuthErrorImpl::auth_failed("MD5 is not supported").into()); - } - cloud::api::AuthInfo::Scram(secret) => { - let scram = Scram(&secret); - Some(compute::ScramKeys { - client_key: flow.begin(scram).await?.authenticate().await?.as_bytes(), - server_key: secret.server_key.as_bytes(), - }) - } - }; - - client - .write_message_noflush(&Be::AuthenticationOk)? - .write_message_noflush(&BeParameterStatusMessage::encoding())?; - - Ok(compute::NodeInfo { - db_info: cloud.wake_compute(&creds).await?, - scram_keys, - }) -} - -async fn handle_new_user( - redirect_uri: &str, - client: &mut PqStream, -) -> Result { - let psql_session_id = new_psql_session_id(); - let greeting = hello_message(redirect_uri, &psql_session_id); - - let db_info = cloud::with_waiter(psql_session_id, |waiter| async { - // Give user a URL to spawn a new database - client - .write_message_noflush(&Be::AuthenticationOk)? - .write_message_noflush(&BeParameterStatusMessage::encoding())? - .write_message(&Be::NoticeResponse(&greeting)) - .await?; - - // Wait for web console response (see `mgmt`) - waiter.await?.map_err(AuthErrorImpl::auth_failed) - }) - .await?; - - client.write_message_noflush(&Be::NoticeResponse("Connecting to database."))?; - - Ok(compute::NodeInfo { - db_info, - scram_keys: None, - }) -} - -fn new_psql_session_id() -> String { - hex::encode(rand::random::<[u8; 8]>()) -} - -fn parse_password(bytes: &[u8]) -> Option<&str> { - std::str::from_utf8(bytes).ok()?.strip_suffix('\0') -} - -fn hello_message(redirect_uri: &str, session_id: &str) -> String { - format!( - concat![ - "☀️ Welcome to Neon!\n", - "To proceed with database creation, open the following link:\n\n", - " {redirect_uri}{session_id}\n\n", - "It needs to be done once and we will send you '.pgpass' file,\n", - "which will allow you to access or create ", - "databases without opening your web browser." - ], - redirect_uri = redirect_uri, - session_id = session_id, - ) -} diff --git a/proxy/src/auth/credentials.rs b/proxy/src/auth/credentials.rs index a3d06b49a2..88677de511 100644 --- a/proxy/src/auth/credentials.rs +++ b/proxy/src/auth/credentials.rs @@ -23,6 +23,10 @@ impl UserFacingError for ClientCredsParseError {} pub struct ClientCredentials { pub user: String, pub dbname: String, + + // New console API requires SNI info to determine cluster name. + // Other Auth backends don't need it. + pub sni_cluster: Option, } impl ClientCredentials { @@ -45,7 +49,11 @@ impl TryFrom> for ClientCredentials { let user = get_param("user")?; let db = get_param("database")?; - Ok(Self { user, dbname: db }) + Ok(Self { + user, + dbname: db, + sni_cluster: None, + }) } } @@ -54,7 +62,7 @@ impl ClientCredentials { pub async fn authenticate( self, config: &ProxyConfig, - client: &mut PqStream, + client: &mut PqStream, ) -> Result { // This method is just a convenient facade for `handle_user` super::handle_user(config, client, self).await diff --git a/proxy/src/cloud.rs b/proxy/src/auth_backend.rs similarity index 56% rename from proxy/src/cloud.rs rename to proxy/src/auth_backend.rs index 679cfb97e1..54362bf719 100644 --- a/proxy/src/cloud.rs +++ b/proxy/src/auth_backend.rs @@ -1,10 +1,9 @@ -mod local; +pub mod console; +pub mod legacy_console; +pub mod link; +pub mod postgres; -mod legacy; -pub use legacy::{AuthError, AuthErrorImpl, Legacy}; - -pub mod api; -pub use api::{Api, BoxedApi}; +pub use legacy_console::{AuthError, AuthErrorImpl}; use crate::mgmt; use crate::waiters::{self, Waiter, Waiters}; @@ -30,17 +29,3 @@ where pub fn notify(psql_session_id: &str, msg: mgmt::ComputeReady) -> Result<(), waiters::NotifyError> { CPLANE_WAITERS.notify(psql_session_id, msg) } - -/// Construct a new opaque cloud API provider. -pub fn new(url: reqwest::Url) -> anyhow::Result { - Ok(match url.scheme() { - "https" | "http" => { - todo!("build a real cloud wrapper") - } - "postgresql" | "postgres" | "pg" => { - // Just point to a local running postgres instance. - Box::new(local::Local { url }) - } - other => anyhow::bail!("unsupported url scheme: {other}"), - }) -} diff --git a/proxy/src/auth_backend/console.rs b/proxy/src/auth_backend/console.rs new file mode 100644 index 0000000000..863e929489 --- /dev/null +++ b/proxy/src/auth_backend/console.rs @@ -0,0 +1,236 @@ +//! Declaration of Cloud API V2. + +use crate::{ + auth::{self, AuthFlow}, + compute, scram, +}; +use serde::{Deserialize, Serialize}; +use thiserror::Error; + +use crate::auth::ClientCredentials; +use crate::stream::PqStream; + +use tokio::io::{AsyncRead, AsyncWrite}; +use utils::pq_proto::{BeMessage as Be, BeParameterStatusMessage}; + +#[derive(Debug, Error)] +pub enum ConsoleAuthError { + // We shouldn't include the actual secret here. + #[error("Bad authentication secret")] + BadSecret, + + #[error("Bad client credentials: {0:?}")] + BadCredentials(crate::auth::ClientCredentials), + + /// For passwords that couldn't be processed by [`parse_password`]. + #[error("Absend SNI information")] + SniMissing, + + #[error(transparent)] + BadUrl(#[from] url::ParseError), + + #[error(transparent)] + Io(#[from] std::io::Error), + + /// HTTP status (other than 200) returned by the console. + #[error("Console responded with an HTTP status: {0}")] + HttpStatus(reqwest::StatusCode), + + #[error(transparent)] + Transport(#[from] reqwest::Error), + + #[error("Console responded with a malformed JSON: '{0}'")] + MalformedResponse(#[from] serde_json::Error), + + #[error("Console responded with a malformed compute address: '{0}'")] + MalformedComputeAddress(String), +} + +#[derive(Serialize, Deserialize, Debug)] +struct GetRoleSecretResponse { + role_secret: String, +} + +#[derive(Serialize, Deserialize, Debug)] +struct GetWakeComputeResponse { + address: String, +} + +/// Auth secret which is managed by the cloud. +pub enum AuthInfo { + /// Md5 hash of user's password. + Md5([u8; 16]), + /// [SCRAM](crate::scram) authentication info. + Scram(scram::ServerSecret), +} + +/// Compute node connection params provided by the cloud. +/// Note how it implements serde traits, since we receive it over the wire. +#[derive(Serialize, Deserialize, Default)] +pub struct DatabaseInfo { + pub host: String, + pub port: u16, + pub dbname: String, + pub user: String, + + /// [Cloud API V1](super::legacy) returns cleartext password, + /// but [Cloud API V2](super::api) implements [SCRAM](crate::scram) + /// authentication, so we can leverage this method and cope without password. + pub password: Option, +} + +// Manually implement debug to omit personal and sensitive info. +impl std::fmt::Debug for DatabaseInfo { + fn fmt(&self, fmt: &mut std::fmt::Formatter) -> std::fmt::Result { + fmt.debug_struct("DatabaseInfo") + .field("host", &self.host) + .field("port", &self.port) + .finish() + } +} + +impl From for tokio_postgres::Config { + fn from(db_info: DatabaseInfo) -> Self { + let mut config = tokio_postgres::Config::new(); + + config + .host(&db_info.host) + .port(db_info.port) + .dbname(&db_info.dbname) + .user(&db_info.user); + + if let Some(password) = db_info.password { + config.password(password); + } + + config + } +} + +async fn get_auth_info( + auth_endpoint: &str, + user: &str, + cluster: &str, +) -> Result { + let mut url = reqwest::Url::parse(&format!("{auth_endpoint}/proxy_get_role_secret"))?; + + url.query_pairs_mut() + .append_pair("cluster", cluster) + .append_pair("role", user); + + // TODO: use a proper logger + println!("cplane request: {}", url); + + let resp = reqwest::get(url).await?; + if !resp.status().is_success() { + return Err(ConsoleAuthError::HttpStatus(resp.status())); + } + + let response: GetRoleSecretResponse = serde_json::from_str(resp.text().await?.as_str())?; + + scram::ServerSecret::parse(response.role_secret.as_str()) + .map(AuthInfo::Scram) + .ok_or(ConsoleAuthError::BadSecret) +} + +/// Wake up the compute node and return the corresponding connection info. +async fn wake_compute( + auth_endpoint: &str, + cluster: &str, +) -> Result<(String, u16), ConsoleAuthError> { + let mut url = reqwest::Url::parse(&format!("{auth_endpoint}/proxy_wake_compute"))?; + url.query_pairs_mut().append_pair("cluster", cluster); + + // TODO: use a proper logger + println!("cplane request: {}", url); + + let resp = reqwest::get(url).await?; + if !resp.status().is_success() { + return Err(ConsoleAuthError::HttpStatus(resp.status())); + } + + let response: GetWakeComputeResponse = serde_json::from_str(resp.text().await?.as_str())?; + let (host, port) = response + .address + .split_once(':') + .ok_or_else(|| ConsoleAuthError::MalformedComputeAddress(response.address.clone()))?; + let port: u16 = port + .parse() + .map_err(|_| ConsoleAuthError::MalformedComputeAddress(response.address.clone()))?; + + Ok((host.to_string(), port)) +} + +pub async fn handle_user( + auth_endpoint: &str, + client: &mut PqStream, + creds: &ClientCredentials, +) -> Result { + let cluster = creds + .sni_cluster + .as_ref() + .ok_or(ConsoleAuthError::SniMissing)?; + let user = creds.user.as_str(); + + // Step 1: get the auth secret + let auth_info = get_auth_info(auth_endpoint, user, cluster).await?; + + let flow = AuthFlow::new(client); + let scram_keys = match auth_info { + AuthInfo::Md5(_) => { + // TODO: decide if we should support MD5 in api v2 + return Err(crate::auth::AuthErrorImpl::auth_failed("MD5 is not supported").into()); + } + AuthInfo::Scram(secret) => { + let scram = auth::Scram(&secret); + Some(compute::ScramKeys { + client_key: flow.begin(scram).await?.authenticate().await?.as_bytes(), + server_key: secret.server_key.as_bytes(), + }) + } + }; + + client + .write_message_noflush(&Be::AuthenticationOk)? + .write_message_noflush(&BeParameterStatusMessage::encoding())?; + + // Step 2: wake compute + let (host, port) = wake_compute(auth_endpoint, cluster).await?; + + Ok(compute::NodeInfo { + db_info: DatabaseInfo { + host, + port, + dbname: creds.dbname.clone(), + user: creds.user.clone(), + password: None, + }, + scram_keys, + }) +} + +#[cfg(test)] +mod tests { + use super::*; + use serde_json::json; + + #[test] + fn parse_db_info() -> anyhow::Result<()> { + let _: DatabaseInfo = serde_json::from_value(json!({ + "host": "localhost", + "port": 5432, + "dbname": "postgres", + "user": "john_doe", + "password": "password", + }))?; + + let _: DatabaseInfo = serde_json::from_value(json!({ + "host": "localhost", + "port": 5432, + "dbname": "postgres", + "user": "john_doe", + }))?; + + Ok(()) + } +} diff --git a/proxy/src/auth_backend/legacy_console.rs b/proxy/src/auth_backend/legacy_console.rs new file mode 100644 index 0000000000..29997d2389 --- /dev/null +++ b/proxy/src/auth_backend/legacy_console.rs @@ -0,0 +1,206 @@ +//! Cloud API V1. + +use super::console::DatabaseInfo; + +use crate::auth::ClientCredentials; +use crate::stream::PqStream; + +use crate::{compute, waiters}; +use serde::{Deserialize, Serialize}; + +use tokio::io::{AsyncRead, AsyncWrite}; +use utils::pq_proto::{BeMessage as Be, BeParameterStatusMessage}; + +use thiserror::Error; + +use crate::error::UserFacingError; + +#[derive(Debug, Error)] +pub enum AuthErrorImpl { + /// Authentication error reported by the console. + #[error("Authentication failed: {0}")] + AuthFailed(String), + + /// HTTP status (other than 200) returned by the console. + #[error("Console responded with an HTTP status: {0}")] + HttpStatus(reqwest::StatusCode), + + #[error("Console responded with a malformed JSON: {0}")] + MalformedResponse(#[from] serde_json::Error), + + #[error(transparent)] + Transport(#[from] reqwest::Error), + + #[error(transparent)] + WaiterRegister(#[from] waiters::RegisterError), + + #[error(transparent)] + WaiterWait(#[from] waiters::WaitError), +} + +#[derive(Debug, Error)] +#[error(transparent)] +pub struct AuthError(Box); + +impl AuthError { + /// Smart constructor for authentication error reported by `mgmt`. + pub fn auth_failed(msg: impl Into) -> Self { + AuthError(Box::new(AuthErrorImpl::AuthFailed(msg.into()))) + } +} + +impl From for AuthError +where + AuthErrorImpl: From, +{ + fn from(e: T) -> Self { + AuthError(Box::new(e.into())) + } +} + +impl UserFacingError for AuthError { + fn to_string_client(&self) -> String { + use AuthErrorImpl::*; + match self.0.as_ref() { + AuthFailed(_) | HttpStatus(_) => self.to_string(), + _ => "Internal error".to_string(), + } + } +} + +// NOTE: the order of constructors is important. +// https://serde.rs/enum-representations.html#untagged +#[derive(Serialize, Deserialize, Debug)] +#[serde(untagged)] +enum ProxyAuthResponse { + Ready { conn_info: DatabaseInfo }, + Error { error: String }, + NotReady { ready: bool }, // TODO: get rid of `ready` +} + +async fn authenticate_proxy_client( + auth_endpoint: &reqwest::Url, + creds: &ClientCredentials, + md5_response: &str, + salt: &[u8; 4], + psql_session_id: &str, +) -> Result { + let mut url = auth_endpoint.clone(); + url.query_pairs_mut() + .append_pair("login", &creds.user) + .append_pair("database", &creds.dbname) + .append_pair("md5response", md5_response) + .append_pair("salt", &hex::encode(salt)) + .append_pair("psql_session_id", psql_session_id); + + super::with_waiter(psql_session_id, |waiter| async { + println!("cloud request: {}", url); + // TODO: leverage `reqwest::Client` to reuse connections + let resp = reqwest::get(url).await?; + if !resp.status().is_success() { + return Err(AuthErrorImpl::HttpStatus(resp.status()).into()); + } + + let auth_info: ProxyAuthResponse = serde_json::from_str(resp.text().await?.as_str())?; + println!("got auth info: #{:?}", auth_info); + + use ProxyAuthResponse::*; + let db_info = match auth_info { + Ready { conn_info } => conn_info, + Error { error } => return Err(AuthErrorImpl::AuthFailed(error).into()), + NotReady { .. } => waiter.await?.map_err(AuthErrorImpl::AuthFailed)?, + }; + + Ok(db_info) + }) + .await +} + +async fn handle_existing_user( + auth_endpoint: &reqwest::Url, + client: &mut PqStream, + creds: &ClientCredentials, +) -> Result { + let psql_session_id = super::link::new_psql_session_id(); + let md5_salt = rand::random(); + + client + .write_message(&Be::AuthenticationMD5Password(md5_salt)) + .await?; + + // Read client's password hash + let msg = client.read_password_message().await?; + let md5_response = parse_password(&msg).ok_or(crate::auth::AuthErrorImpl::MalformedPassword)?; + + let db_info = authenticate_proxy_client( + auth_endpoint, + creds, + md5_response, + &md5_salt, + &psql_session_id, + ) + .await?; + + client + .write_message_noflush(&Be::AuthenticationOk)? + .write_message_noflush(&BeParameterStatusMessage::encoding())?; + + Ok(compute::NodeInfo { + db_info, + scram_keys: None, + }) +} + +pub async fn handle_user( + auth_endpoint: &reqwest::Url, + auth_link_uri: &reqwest::Url, + client: &mut PqStream, + creds: &ClientCredentials, +) -> Result { + if creds.is_existing_user() { + handle_existing_user(auth_endpoint, client, creds).await + } else { + super::link::handle_user(auth_link_uri.as_ref(), client).await + } +} + +fn parse_password(bytes: &[u8]) -> Option<&str> { + std::str::from_utf8(bytes).ok()?.strip_suffix('\0') +} + +#[cfg(test)] +mod tests { + use super::*; + use serde_json::json; + + #[test] + fn test_proxy_auth_response() { + // Ready + let auth: ProxyAuthResponse = serde_json::from_value(json!({ + "ready": true, + "conn_info": DatabaseInfo::default(), + })) + .unwrap(); + assert!(matches!( + auth, + ProxyAuthResponse::Ready { + conn_info: DatabaseInfo { .. } + } + )); + + // Error + let auth: ProxyAuthResponse = serde_json::from_value(json!({ + "ready": false, + "error": "too bad, so sad", + })) + .unwrap(); + assert!(matches!(auth, ProxyAuthResponse::Error { .. })); + + // NotReady + let auth: ProxyAuthResponse = serde_json::from_value(json!({ + "ready": false, + })) + .unwrap(); + assert!(matches!(auth, ProxyAuthResponse::NotReady { .. })); + } +} diff --git a/proxy/src/auth_backend/link.rs b/proxy/src/auth_backend/link.rs new file mode 100644 index 0000000000..9bdb9e21c4 --- /dev/null +++ b/proxy/src/auth_backend/link.rs @@ -0,0 +1,52 @@ +use crate::{compute, stream::PqStream}; +use tokio::io::{AsyncRead, AsyncWrite}; +use utils::pq_proto::{BeMessage as Be, BeParameterStatusMessage}; + +fn hello_message(redirect_uri: &str, session_id: &str) -> String { + format!( + concat![ + "☀️ Welcome to Neon!\n", + "To proceed with database creation, open the following link:\n\n", + " {redirect_uri}{session_id}\n\n", + "It needs to be done once and we will send you '.pgpass' file,\n", + "which will allow you to access or create ", + "databases without opening your web browser." + ], + redirect_uri = redirect_uri, + session_id = session_id, + ) +} + +pub fn new_psql_session_id() -> String { + hex::encode(rand::random::<[u8; 8]>()) +} + +pub async fn handle_user( + redirect_uri: &str, + client: &mut PqStream, +) -> Result { + let psql_session_id = new_psql_session_id(); + let greeting = hello_message(redirect_uri, &psql_session_id); + + let db_info = crate::auth_backend::with_waiter(psql_session_id, |waiter| async { + // Give user a URL to spawn a new database + client + .write_message_noflush(&Be::AuthenticationOk)? + .write_message_noflush(&BeParameterStatusMessage::encoding())? + .write_message(&Be::NoticeResponse(&greeting)) + .await?; + + // Wait for web console response (see `mgmt`) + waiter + .await? + .map_err(crate::auth::AuthErrorImpl::auth_failed) + }) + .await?; + + client.write_message_noflush(&Be::NoticeResponse("Connecting to database."))?; + + Ok(compute::NodeInfo { + db_info, + scram_keys: None, + }) +} diff --git a/proxy/src/auth_backend/postgres.rs b/proxy/src/auth_backend/postgres.rs new file mode 100644 index 0000000000..148c2a2518 --- /dev/null +++ b/proxy/src/auth_backend/postgres.rs @@ -0,0 +1,93 @@ +//! Local mock of Cloud API V2. + +use super::console::{self, AuthInfo, DatabaseInfo}; +use crate::scram; +use crate::{auth::ClientCredentials, compute}; + +use crate::stream::PqStream; +use tokio::io::{AsyncRead, AsyncWrite}; +use utils::pq_proto::{BeMessage as Be, BeParameterStatusMessage}; + +async fn get_auth_info( + auth_endpoint: &str, + creds: &ClientCredentials, +) -> Result { + // We wrap `tokio_postgres::Error` because we don't want to infect the + // method's error type with a detail that's specific to debug mode only. + let io_error = |e| std::io::Error::new(std::io::ErrorKind::Other, e); + + // Perhaps we could persist this connection, but then we'd have to + // write more code for reopening it if it got closed, which doesn't + // seem worth it. + let (client, connection) = tokio_postgres::connect(auth_endpoint, tokio_postgres::NoTls) + .await + .map_err(io_error)?; + + tokio::spawn(connection); + let query = "select rolpassword from pg_catalog.pg_authid where rolname = $1"; + let rows = client + .query(query, &[&creds.user]) + .await + .map_err(io_error)?; + + match &rows[..] { + // We can't get a secret if there's no such user. + [] => Err(console::ConsoleAuthError::BadCredentials(creds.to_owned())), + // We shouldn't get more than one row anyway. + [row, ..] => { + let entry = row.try_get(0).map_err(io_error)?; + scram::ServerSecret::parse(entry) + .map(AuthInfo::Scram) + .or_else(|| { + // It could be an md5 hash if it's not a SCRAM secret. + let text = entry.strip_prefix("md5")?; + Some(AuthInfo::Md5({ + let mut bytes = [0u8; 16]; + hex::decode_to_slice(text, &mut bytes).ok()?; + bytes + })) + }) + // Putting the secret into this message is a security hazard! + .ok_or(console::ConsoleAuthError::BadSecret) + } + } +} + +pub async fn handle_user( + auth_endpoint: &reqwest::Url, + client: &mut PqStream, + creds: &ClientCredentials, +) -> Result { + let auth_info = get_auth_info(auth_endpoint.as_ref(), creds).await?; + + let flow = crate::auth::AuthFlow::new(client); + let scram_keys = match auth_info { + AuthInfo::Md5(_) => { + // TODO: decide if we should support MD5 in api v2 + return Err(crate::auth::AuthErrorImpl::auth_failed("MD5 is not supported").into()); + } + AuthInfo::Scram(secret) => { + let scram = crate::auth::Scram(&secret); + Some(compute::ScramKeys { + client_key: flow.begin(scram).await?.authenticate().await?.as_bytes(), + server_key: secret.server_key.as_bytes(), + }) + } + }; + + client + .write_message_noflush(&Be::AuthenticationOk)? + .write_message_noflush(&BeParameterStatusMessage::encoding())?; + + Ok(compute::NodeInfo { + db_info: DatabaseInfo { + // TODO: handle that near CLI params parsing + host: auth_endpoint.host_str().unwrap_or("localhost").to_owned(), + port: auth_endpoint.port().unwrap_or(5432), + dbname: creds.dbname.to_owned(), + user: creds.user.to_owned(), + password: None, + }, + scram_keys, + }) +} diff --git a/proxy/src/cloud/api.rs b/proxy/src/cloud/api.rs deleted file mode 100644 index 713140c1e6..0000000000 --- a/proxy/src/cloud/api.rs +++ /dev/null @@ -1,120 +0,0 @@ -//! Declaration of Cloud API V2. - -use crate::{auth, scram}; -use async_trait::async_trait; -use serde::{Deserialize, Serialize}; -use thiserror::Error; - -#[derive(Debug, Error)] -pub enum GetAuthInfoError { - // We shouldn't include the actual secret here. - #[error("Bad authentication secret")] - BadSecret, - - #[error("Bad client credentials: {0:?}")] - BadCredentials(crate::auth::ClientCredentials), - - #[error(transparent)] - Io(#[from] std::io::Error), -} - -// TODO: convert to an enum and describe possible sub-errors (see above) -#[derive(Debug, Error)] -#[error("Failed to wake up the compute node")] -pub struct WakeComputeError; - -/// Opaque implementation of Cloud API. -pub type BoxedApi = Box; - -/// Cloud API methods required by the proxy. -#[async_trait] -pub trait Api { - /// Get authentication information for the given user. - async fn get_auth_info( - &self, - creds: &auth::ClientCredentials, - ) -> Result; - - /// Wake up the compute node and return the corresponding connection info. - async fn wake_compute( - &self, - creds: &auth::ClientCredentials, - ) -> Result; -} - -/// Auth secret which is managed by the cloud. -pub enum AuthInfo { - /// Md5 hash of user's password. - Md5([u8; 16]), - /// [SCRAM](crate::scram) authentication info. - Scram(scram::ServerSecret), -} - -/// Compute node connection params provided by the cloud. -/// Note how it implements serde traits, since we receive it over the wire. -#[derive(Serialize, Deserialize, Default)] -pub struct DatabaseInfo { - pub host: String, - pub port: u16, - pub dbname: String, - pub user: String, - - /// [Cloud API V1](super::legacy) returns cleartext password, - /// but [Cloud API V2](super::api) implements [SCRAM](crate::scram) - /// authentication, so we can leverage this method and cope without password. - pub password: Option, -} - -// Manually implement debug to omit personal and sensitive info. -impl std::fmt::Debug for DatabaseInfo { - fn fmt(&self, fmt: &mut std::fmt::Formatter) -> std::fmt::Result { - fmt.debug_struct("DatabaseInfo") - .field("host", &self.host) - .field("port", &self.port) - .finish() - } -} - -impl From for tokio_postgres::Config { - fn from(db_info: DatabaseInfo) -> Self { - let mut config = tokio_postgres::Config::new(); - - config - .host(&db_info.host) - .port(db_info.port) - .dbname(&db_info.dbname) - .user(&db_info.user); - - if let Some(password) = db_info.password { - config.password(password); - } - - config - } -} - -#[cfg(test)] -mod tests { - use super::*; - use serde_json::json; - - #[test] - fn parse_db_info() -> anyhow::Result<()> { - let _: DatabaseInfo = serde_json::from_value(json!({ - "host": "localhost", - "port": 5432, - "dbname": "postgres", - "user": "john_doe", - "password": "password", - }))?; - - let _: DatabaseInfo = serde_json::from_value(json!({ - "host": "localhost", - "port": 5432, - "dbname": "postgres", - "user": "john_doe", - }))?; - - Ok(()) - } -} diff --git a/proxy/src/cloud/legacy.rs b/proxy/src/cloud/legacy.rs deleted file mode 100644 index 7d99995f1a..0000000000 --- a/proxy/src/cloud/legacy.rs +++ /dev/null @@ -1,160 +0,0 @@ -//! Cloud API V1. - -use super::api::DatabaseInfo; -use crate::auth::ClientCredentials; -use crate::error::UserFacingError; -use crate::waiters; -use serde::{Deserialize, Serialize}; -use thiserror::Error; - -/// Neon cloud API provider. -pub struct Legacy { - auth_endpoint: reqwest::Url, -} - -impl Legacy { - /// Construct a new legacy cloud API provider. - pub fn new(auth_endpoint: reqwest::Url) -> Self { - Self { auth_endpoint } - } -} - -#[derive(Debug, Error)] -pub enum AuthErrorImpl { - /// Authentication error reported by the console. - #[error("Authentication failed: {0}")] - AuthFailed(String), - - /// HTTP status (other than 200) returned by the console. - #[error("Console responded with an HTTP status: {0}")] - HttpStatus(reqwest::StatusCode), - - #[error("Console responded with a malformed JSON: {0}")] - MalformedResponse(#[from] serde_json::Error), - - #[error(transparent)] - Transport(#[from] reqwest::Error), - - #[error(transparent)] - WaiterRegister(#[from] waiters::RegisterError), - - #[error(transparent)] - WaiterWait(#[from] waiters::WaitError), -} - -#[derive(Debug, Error)] -#[error(transparent)] -pub struct AuthError(Box); - -impl AuthError { - /// Smart constructor for authentication error reported by `mgmt`. - pub fn auth_failed(msg: impl Into) -> Self { - AuthError(Box::new(AuthErrorImpl::AuthFailed(msg.into()))) - } -} - -impl From for AuthError -where - AuthErrorImpl: From, -{ - fn from(e: T) -> Self { - AuthError(Box::new(e.into())) - } -} - -impl UserFacingError for AuthError { - fn to_string_client(&self) -> String { - use AuthErrorImpl::*; - match self.0.as_ref() { - AuthFailed(_) | HttpStatus(_) => self.to_string(), - _ => "Internal error".to_string(), - } - } -} - -// NOTE: the order of constructors is important. -// https://serde.rs/enum-representations.html#untagged -#[derive(Serialize, Deserialize, Debug)] -#[serde(untagged)] -enum ProxyAuthResponse { - Ready { conn_info: DatabaseInfo }, - Error { error: String }, - NotReady { ready: bool }, // TODO: get rid of `ready` -} - -impl Legacy { - pub async fn authenticate_proxy_client( - &self, - creds: ClientCredentials, - md5_response: &str, - salt: &[u8; 4], - psql_session_id: &str, - ) -> Result { - let mut url = self.auth_endpoint.clone(); - url.query_pairs_mut() - .append_pair("login", &creds.user) - .append_pair("database", &creds.dbname) - .append_pair("md5response", md5_response) - .append_pair("salt", &hex::encode(salt)) - .append_pair("psql_session_id", psql_session_id); - - super::with_waiter(psql_session_id, |waiter| async { - println!("cloud request: {}", url); - // TODO: leverage `reqwest::Client` to reuse connections - let resp = reqwest::get(url).await?; - if !resp.status().is_success() { - return Err(AuthErrorImpl::HttpStatus(resp.status()).into()); - } - - let auth_info: ProxyAuthResponse = serde_json::from_str(resp.text().await?.as_str())?; - println!("got auth info: #{:?}", auth_info); - - use ProxyAuthResponse::*; - let db_info = match auth_info { - Ready { conn_info } => conn_info, - Error { error } => return Err(AuthErrorImpl::AuthFailed(error).into()), - NotReady { .. } => waiter.await?.map_err(AuthErrorImpl::AuthFailed)?, - }; - - Ok(db_info) - }) - .await - } -} - -#[cfg(test)] -mod tests { - use super::*; - use serde_json::json; - - #[test] - fn test_proxy_auth_response() { - // Ready - let auth: ProxyAuthResponse = serde_json::from_value(json!({ - "ready": true, - "conn_info": DatabaseInfo::default(), - })) - .unwrap(); - assert!(matches!( - auth, - ProxyAuthResponse::Ready { - conn_info: DatabaseInfo { .. } - } - )); - - // Error - let auth: ProxyAuthResponse = serde_json::from_value(json!({ - "ready": false, - "error": "too bad, so sad", - })) - .unwrap(); - assert!(matches!(auth, ProxyAuthResponse::Error { .. })); - - // NotReady - let auth: ProxyAuthResponse = serde_json::from_value(json!({ - "ready": false, - })) - .unwrap(); - assert!(matches!(auth, ProxyAuthResponse::NotReady { .. })); - } -} diff --git a/proxy/src/cloud/local.rs b/proxy/src/cloud/local.rs deleted file mode 100644 index 88eda6630c..0000000000 --- a/proxy/src/cloud/local.rs +++ /dev/null @@ -1,76 +0,0 @@ -//! Local mock of Cloud API V2. - -use super::api::{self, Api, AuthInfo, DatabaseInfo}; -use crate::auth::ClientCredentials; -use crate::scram; -use async_trait::async_trait; - -/// Mocked cloud for testing purposes. -pub struct Local { - /// Database url, e.g. `postgres://user:password@localhost:5432/database`. - pub url: reqwest::Url, -} - -#[async_trait] -impl Api for Local { - async fn get_auth_info( - &self, - creds: &ClientCredentials, - ) -> Result { - // We wrap `tokio_postgres::Error` because we don't want to infect the - // method's error type with a detail that's specific to debug mode only. - let io_error = |e| std::io::Error::new(std::io::ErrorKind::Other, e); - - // Perhaps we could persist this connection, but then we'd have to - // write more code for reopening it if it got closed, which doesn't - // seem worth it. - let (client, connection) = - tokio_postgres::connect(self.url.as_str(), tokio_postgres::NoTls) - .await - .map_err(io_error)?; - - tokio::spawn(connection); - let query = "select rolpassword from pg_catalog.pg_authid where rolname = $1"; - let rows = client - .query(query, &[&creds.user]) - .await - .map_err(io_error)?; - - match &rows[..] { - // We can't get a secret if there's no such user. - [] => Err(api::GetAuthInfoError::BadCredentials(creds.to_owned())), - // We shouldn't get more than one row anyway. - [row, ..] => { - let entry = row.try_get(0).map_err(io_error)?; - scram::ServerSecret::parse(entry) - .map(AuthInfo::Scram) - .or_else(|| { - // It could be an md5 hash if it's not a SCRAM secret. - let text = entry.strip_prefix("md5")?; - Some(AuthInfo::Md5({ - let mut bytes = [0u8; 16]; - hex::decode_to_slice(text, &mut bytes).ok()?; - bytes - })) - }) - // Putting the secret into this message is a security hazard! - .ok_or(api::GetAuthInfoError::BadSecret) - } - } - } - - async fn wake_compute( - &self, - creds: &ClientCredentials, - ) -> Result { - // Local setup doesn't have a dedicated compute node, - // so we just return the local database we're pointed at. - Ok(DatabaseInfo { - host: self.url.host_str().unwrap_or("localhost").to_owned(), - port: self.url.port().unwrap_or(5432), - dbname: creds.dbname.to_owned(), - user: creds.user.to_owned(), - password: None, - }) - } -} diff --git a/proxy/src/compute.rs b/proxy/src/compute.rs index 9949e91ea2..c3c5ba47fb 100644 --- a/proxy/src/compute.rs +++ b/proxy/src/compute.rs @@ -1,5 +1,5 @@ +use crate::auth_backend::console::DatabaseInfo; use crate::cancellation::CancelClosure; -use crate::cloud::api::DatabaseInfo; use crate::error::UserFacingError; use std::io; use std::net::SocketAddr; diff --git a/proxy/src/config.rs b/proxy/src/config.rs index 6b30df604d..077a07beb9 100644 --- a/proxy/src/config.rs +++ b/proxy/src/config.rs @@ -1,35 +1,39 @@ -use crate::cloud; -use anyhow::{bail, ensure, Context}; -use std::sync::Arc; +use anyhow::{ensure, Context}; +use std::{str::FromStr, sync::Arc}; + +#[non_exhaustive] +pub enum AuthBackendType { + LegacyConsole, + Console, + Postgres, + Link, +} + +impl FromStr for AuthBackendType { + type Err = anyhow::Error; + + fn from_str(s: &str) -> anyhow::Result { + println!("ClientAuthMethod::from_str: '{}'", s); + use AuthBackendType::*; + match s { + "legacy" => Ok(LegacyConsole), + "console" => Ok(Console), + "postgres" => Ok(Postgres), + "link" => Ok(Link), + _ => Err(anyhow::anyhow!("Invlid option for auth method")), + } + } +} pub struct ProxyConfig { - /// Unauthenticated users will be redirected to this URL. - pub redirect_uri: reqwest::Url, - - /// Cloud API endpoint for user authentication. - pub cloud_endpoint: CloudApi, - /// TLS configuration for the proxy. pub tls_config: Option, -} -/// Cloud API configuration. -pub enum CloudApi { - /// We'll drop this one when [`CloudApi::V2`] is stable. - V1(crate::cloud::Legacy), - /// The new version of the cloud API. - V2(crate::cloud::BoxedApi), -} + pub auth_backend: AuthBackendType, -impl CloudApi { - /// Configure Cloud API provider. - pub fn new(version: &str, url: reqwest::Url) -> anyhow::Result { - Ok(match version { - "v1" => Self::V1(cloud::Legacy::new(url)), - "v2" => Self::V2(cloud::new(url)?), - _ => bail!("unknown cloud API version: {}", version), - }) - } + pub auth_endpoint: reqwest::Url, + + pub auth_link_uri: reqwest::Url, } pub type TlsConfig = Arc; diff --git a/proxy/src/main.rs b/proxy/src/main.rs index ce9889ce30..fc2a368b85 100644 --- a/proxy/src/main.rs +++ b/proxy/src/main.rs @@ -5,8 +5,8 @@ //! in somewhat transparent manner (again via communication with control plane API). mod auth; +mod auth_backend; mod cancellation; -mod cloud; mod compute; mod config; mod error; @@ -48,18 +48,11 @@ async fn main() -> anyhow::Result<()> { .default_value("127.0.0.1:4432"), ) .arg( - Arg::new("auth-method") - .long("auth-method") + Arg::new("auth-backend") + .long("auth-backend") .takes_value(true) - .help("Possible values: password | link | mixed") - .default_value("mixed"), - ) - .arg( - Arg::new("static-router") - .short('s') - .long("static-router") - .takes_value(true) - .help("Route all clients to host:port"), + .help("Possible values: legacy | console | postgres | link") + .default_value("legacy"), ) .arg( Arg::new("mgmt") @@ -82,7 +75,7 @@ async fn main() -> anyhow::Result<()> { .short('u') .long("uri") .takes_value(true) - .help("redirect unauthenticated users to given uri") + .help("redirect unauthenticated users to the given uri in case of link auth") .default_value("http://localhost:3000/psql_session/"), ) .arg( @@ -93,14 +86,6 @@ async fn main() -> anyhow::Result<()> { .help("cloud API endpoint for authenticating users") .default_value("http://localhost:3000/authenticate_proxy_request/"), ) - .arg( - Arg::new("api-version") - .long("api-version") - .takes_value(true) - .default_value("v1") - .possible_values(["v1", "v2"]) - .help("cloud API version to be used for authentication"), - ) .arg( Arg::new("tls-key") .short('k') @@ -132,15 +117,11 @@ async fn main() -> anyhow::Result<()> { let mgmt_address: SocketAddr = arg_matches.value_of("mgmt").unwrap().parse()?; let http_address: SocketAddr = arg_matches.value_of("http").unwrap().parse()?; - let cloud_endpoint = config::CloudApi::new( - arg_matches.value_of("api-version").unwrap(), - arg_matches.value_of("auth-endpoint").unwrap().parse()?, - )?; - let config: &ProxyConfig = Box::leak(Box::new(ProxyConfig { - redirect_uri: arg_matches.value_of("uri").unwrap().parse()?, - cloud_endpoint, tls_config, + auth_backend: arg_matches.value_of("auth-backend").unwrap().parse()?, + auth_endpoint: arg_matches.value_of("auth-endpoint").unwrap().parse()?, + auth_link_uri: arg_matches.value_of("uri").unwrap().parse()?, })); println!("Version: {}", GIT_VERSION); diff --git a/proxy/src/mgmt.rs b/proxy/src/mgmt.rs index c48df653d3..93618fff68 100644 --- a/proxy/src/mgmt.rs +++ b/proxy/src/mgmt.rs @@ -1,4 +1,4 @@ -use crate::cloud; +use crate::auth_backend; use anyhow::Context; use serde::Deserialize; use std::{ @@ -10,6 +10,8 @@ use utils::{ pq_proto::{BeMessage, SINGLE_COL_ROWDESC}, }; +/// TODO: move all of that to auth-backend/link.rs when we ditch legacy-console backend + /// /// Main proxy listener loop. /// @@ -75,12 +77,12 @@ struct PsqlSessionResponse { #[derive(Deserialize)] enum PsqlSessionResult { - Success(cloud::api::DatabaseInfo), + Success(auth_backend::console::DatabaseInfo), Failure(String), } /// A message received by `mgmt` when a compute node is ready. -pub type ComputeReady = Result; +pub type ComputeReady = Result; impl PsqlSessionResult { fn into_compute_ready(self) -> ComputeReady { @@ -111,7 +113,7 @@ fn try_process_query(pgb: &mut PostgresBackend, query_string: &str) -> anyhow::R let resp: PsqlSessionResponse = serde_json::from_str(query_string)?; - match cloud::notify(&resp.session_id, resp.result.into_compute_ready()) { + match auth_backend::notify(&resp.session_id, resp.result.into_compute_ready()) { Ok(()) => { pgb.write_message_noflush(&SINGLE_COL_ROWDESC)? .write_message_noflush(&BeMessage::DataRow(&[Some(b"ok")]))? diff --git a/proxy/src/proxy.rs b/proxy/src/proxy.rs index 4bce2bf40d..4bdbac8510 100644 --- a/proxy/src/proxy.rs +++ b/proxy/src/proxy.rs @@ -73,7 +73,7 @@ pub async fn thread_main( async fn handle_client( config: &ProxyConfig, cancel_map: &CancelMap, - stream: impl AsyncRead + AsyncWrite + Unpin, + stream: impl AsyncRead + AsyncWrite + Unpin + Send, ) -> anyhow::Result<()> { // The `closed` counter will increase when this future is destroyed. NUM_CONNECTIONS_ACCEPTED_COUNTER.inc(); @@ -148,6 +148,8 @@ async fn handshake( .or_else(|e| stream.throw_error(e)) .await?; + // TODO: set creds.cluster here when SNI info is available + break Ok(Some((stream, creds))); } CancelRequest(cancel_key_data) => { @@ -174,7 +176,7 @@ impl Client { } } -impl Client { +impl Client { /// Let the client authenticate and connect to the designated compute node. async fn connect_to_db( self, diff --git a/proxy/src/scram/secret.rs b/proxy/src/scram/secret.rs index bf935d3510..765aef4443 100644 --- a/proxy/src/scram/secret.rs +++ b/proxy/src/scram/secret.rs @@ -38,6 +38,7 @@ impl ServerSecret { /// To avoid revealing information to an attacker, we use a /// mocked server secret even if the user doesn't exist. /// See `auth-scram.c : mock_scram_secret` for details. + #[allow(dead_code)] pub fn mock(user: &str, nonce: &[u8; 32]) -> Self { // Refer to `auth-scram.c : scram_mock_salt`. let mocked_salt = super::sha256([user.as_bytes(), nonce]); diff --git a/test_runner/fixtures/zenith_fixtures.py b/test_runner/fixtures/zenith_fixtures.py index e16d1acf2f..5614cea68b 100644 --- a/test_runner/fixtures/zenith_fixtures.py +++ b/test_runner/fixtures/zenith_fixtures.py @@ -1382,8 +1382,8 @@ def remote_pg(test_output_dir: str) -> Iterator[RemotePostgres]: class ZenithProxy(PgProtocol): def __init__(self, port: int): super().__init__(host="127.0.0.1", - user="pytest", - password="pytest", + user="proxy_user", + password="pytest2", port=port, dbname='postgres') self.http_port = 7001 @@ -1399,8 +1399,8 @@ class ZenithProxy(PgProtocol): args = [bin_proxy] args.extend(["--http", f"{self.host}:{self.http_port}"]) args.extend(["--proxy", f"{self.host}:{self.port}"]) - args.extend(["--auth-method", "password"]) - args.extend(["--static-router", addr]) + args.extend(["--auth-backend", "postgres"]) + args.extend(["--auth-endpoint", "postgres://proxy_auth:pytest1@localhost:5432/postgres"]) self._popen = subprocess.Popen(args) self._wait_until_ready() @@ -1422,7 +1422,8 @@ class ZenithProxy(PgProtocol): def static_proxy(vanilla_pg) -> Iterator[ZenithProxy]: """Zenith proxy that routes directly to vanilla postgres.""" vanilla_pg.start() - vanilla_pg.safe_psql("create user pytest with password 'pytest';") + vanilla_pg.safe_psql("create user proxy_auth with password 'pytest1' superuser") + vanilla_pg.safe_psql("create user proxy_user with password 'pytest2'") with ZenithProxy(4432) as proxy: proxy.start_static() From 9a396e1feb9f35e4f2d57d38a2ac07070ecc1b4b Mon Sep 17 00:00:00 2001 From: Stas Kelvich Date: Mon, 2 May 2022 00:35:15 +0300 Subject: [PATCH 176/296] Support SNI-based routing in proxy --- proxy/src/auth.rs | 2 ++ proxy/src/auth/credentials.rs | 6 +++--- proxy/src/auth_backend/console.rs | 15 +++++++++++---- proxy/src/proxy.rs | 7 +++++-- 4 files changed, 21 insertions(+), 9 deletions(-) diff --git a/proxy/src/auth.rs b/proxy/src/auth.rs index d4e21d78a0..2463f31645 100644 --- a/proxy/src/auth.rs +++ b/proxy/src/auth.rs @@ -6,6 +6,7 @@ use crate::config::{AuthBackendType, ProxyConfig}; use crate::error::UserFacingError; use crate::stream::PqStream; use crate::{auth_backend, compute, waiters}; +use console::ConsoleAuthError::SniMissing; use std::io; use thiserror::Error; use tokio::io::{AsyncRead, AsyncWrite}; @@ -72,6 +73,7 @@ impl UserFacingError for AuthError { match self.0.as_ref() { Console(e) => e.to_string_client(), MalformedPassword => self.to_string(), + GetAuthInfo(e) if matches!(e, SniMissing) => e.to_string(), _ => "Internal error".to_string(), } } diff --git a/proxy/src/auth/credentials.rs b/proxy/src/auth/credentials.rs index 88677de511..9d2272b5ad 100644 --- a/proxy/src/auth/credentials.rs +++ b/proxy/src/auth/credentials.rs @@ -24,9 +24,9 @@ pub struct ClientCredentials { pub user: String, pub dbname: String, - // New console API requires SNI info to determine cluster name. + // New console API requires SNI info to determine the cluster name. // Other Auth backends don't need it. - pub sni_cluster: Option, + pub sni_data: Option, } impl ClientCredentials { @@ -52,7 +52,7 @@ impl TryFrom> for ClientCredentials { Ok(Self { user, dbname: db, - sni_cluster: None, + sni_data: None, }) } } diff --git a/proxy/src/auth_backend/console.rs b/proxy/src/auth_backend/console.rs index 863e929489..55a0889af4 100644 --- a/proxy/src/auth_backend/console.rs +++ b/proxy/src/auth_backend/console.rs @@ -22,10 +22,12 @@ pub enum ConsoleAuthError { #[error("Bad client credentials: {0:?}")] BadCredentials(crate::auth::ClientCredentials), - /// For passwords that couldn't be processed by [`parse_password`]. - #[error("Absend SNI information")] + #[error("SNI info is missing, please upgrade the postgres client library")] SniMissing, + #[error("Unexpected SNI content")] + SniWrong, + #[error(transparent)] BadUrl(#[from] url::ParseError), @@ -166,10 +168,15 @@ pub async fn handle_user( client: &mut PqStream, creds: &ClientCredentials, ) -> Result { + // Determine cluster name from SNI. let cluster = creds - .sni_cluster + .sni_data .as_ref() - .ok_or(ConsoleAuthError::SniMissing)?; + .ok_or(ConsoleAuthError::SniMissing)? + .split_once('.') + .ok_or(ConsoleAuthError::SniWrong)? + .0; + let user = creds.user.as_str(); // Step 1: get the auth secret diff --git a/proxy/src/proxy.rs b/proxy/src/proxy.rs index 4bdbac8510..821ce377f5 100644 --- a/proxy/src/proxy.rs +++ b/proxy/src/proxy.rs @@ -144,11 +144,14 @@ async fn handshake( } // Here and forth: `or_else` demands that we use a future here - let creds = async { params.try_into() } + let mut creds: auth::ClientCredentials = async { params.try_into() } .or_else(|e| stream.throw_error(e)) .await?; - // TODO: set creds.cluster here when SNI info is available + // Set SNI info when available + if let Stream::Tls { tls } = stream.get_ref() { + creds.sni_data = tls.get_ref().1.sni_hostname().map(|s| s.to_owned()); + } break Ok(Some((stream, creds))); } From ad25736f3a38540965cd86a5feee593a7c1fbdb5 Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Mon, 2 May 2022 18:14:36 +0300 Subject: [PATCH 177/296] Exit pageserver process with correct error code When we shutdown pageserver due to an error (e g one of th important thrads panicked) use 1 exit code so systemd can properly restart it --- pageserver/src/bin/pageserver.rs | 2 +- pageserver/src/lib.rs | 4 ++-- pageserver/src/thread_mgr.rs | 4 ++-- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/pageserver/src/bin/pageserver.rs b/pageserver/src/bin/pageserver.rs index 01fcc1224f..2139bea37e 100644 --- a/pageserver/src/bin/pageserver.rs +++ b/pageserver/src/bin/pageserver.rs @@ -295,7 +295,7 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<() signal.name() ); profiling::exit_profiler(conf, &profiler_guard); - pageserver::shutdown_pageserver(); + pageserver::shutdown_pageserver(0); unreachable!() } }) diff --git a/pageserver/src/lib.rs b/pageserver/src/lib.rs index 94219c7840..0b1c53172c 100644 --- a/pageserver/src/lib.rs +++ b/pageserver/src/lib.rs @@ -67,7 +67,7 @@ pub type RepositoryImpl = LayeredRepository; pub type DatadirTimelineImpl = DatadirTimeline; -pub fn shutdown_pageserver() { +pub fn shutdown_pageserver(exit_code: i32) { // Shut down the libpq endpoint thread. This prevents new connections from // being accepted. thread_mgr::shutdown_threads(Some(ThreadKind::LibpqEndpointListener), None, None); @@ -94,5 +94,5 @@ pub fn shutdown_pageserver() { thread_mgr::shutdown_threads(None, None, None); info!("Shut down successfully completed"); - std::process::exit(0); + std::process::exit(exit_code); } diff --git a/pageserver/src/thread_mgr.rs b/pageserver/src/thread_mgr.rs index 2866c6be44..f7f8467ae0 100644 --- a/pageserver/src/thread_mgr.rs +++ b/pageserver/src/thread_mgr.rs @@ -231,7 +231,7 @@ fn thread_wrapper( "Shutting down: thread '{}' exited with error: {:?}", thread_name, err ); - shutdown_pageserver(); + shutdown_pageserver(1); } else { error!("Thread '{}' exited with error: {:?}", thread_name, err); } @@ -241,7 +241,7 @@ fn thread_wrapper( "Shutting down: thread '{}' panicked: {:?}", thread_name, err ); - shutdown_pageserver(); + shutdown_pageserver(1); } } } From 5cb501c2b32697afaf24fea6359f7c90fe14dcd1 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Sun, 1 May 2022 21:57:33 +0300 Subject: [PATCH 178/296] Make remote storage test less flacky --- test_runner/batch_others/test_remote_storage.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/test_runner/batch_others/test_remote_storage.py b/test_runner/batch_others/test_remote_storage.py index 59a9cfa378..e205f79957 100644 --- a/test_runner/batch_others/test_remote_storage.py +++ b/test_runner/batch_others/test_remote_storage.py @@ -117,7 +117,7 @@ def test_remote_storage_backup_and_restore(zenith_env_builder: ZenithEnvBuilder, detail = client.timeline_detail(UUID(tenant_id), UUID(timeline_id)) assert detail['local'] is not None log.info("Timeline detail after attach completed: %s", detail) - assert lsn_from_hex(detail['local']['last_record_lsn']) == current_lsn + assert lsn_from_hex(detail['local']['last_record_lsn']) >= current_lsn, 'current db Lsn should shoud not be less than the one stored on remote storage' assert not detail['remote']['awaits_download'] pg = env.postgres.create_start('main') From 801b749e1dd0de501b7fd4dbe4d494f40fc64515 Mon Sep 17 00:00:00 2001 From: Stas Kelvich Date: Mon, 2 May 2022 18:08:30 +0300 Subject: [PATCH 179/296] Set correct authEndpoint for the new proxy --- .circleci/helm-values/staging.proxy-scram.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.circleci/helm-values/staging.proxy-scram.yaml b/.circleci/helm-values/staging.proxy-scram.yaml index 0391697641..d95ae3bfc2 100644 --- a/.circleci/helm-values/staging.proxy-scram.yaml +++ b/.circleci/helm-values/staging.proxy-scram.yaml @@ -6,7 +6,7 @@ image: settings: authBackend: "console" - authEndpoint: "http://console-staging.local/management/api/v2" + authEndpoint: "http://console-staging.local:9095/management/api/v2" # -- Additional labels for zenith-proxy pods podLabels: From 87a6c4d0511c1eac5229c7257256d384e6cb347c Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Mon, 2 May 2022 21:47:54 +0300 Subject: [PATCH 180/296] RFC on connection routing and authentication. This documents how we want this to work. We're not quite there yet. --- docs/rfcs/016-connection-routing.md | 151 ++++++++++++++++++++++++++++ 1 file changed, 151 insertions(+) create mode 100644 docs/rfcs/016-connection-routing.md diff --git a/docs/rfcs/016-connection-routing.md b/docs/rfcs/016-connection-routing.md new file mode 100644 index 0000000000..603a0725d6 --- /dev/null +++ b/docs/rfcs/016-connection-routing.md @@ -0,0 +1,151 @@ +# Dispatching a connection + +For each client connection, Neon service needs to authenticate the +connection, and route it to the right PostgreSQL instance. + +## Authentication + +There are three different ways to authenticate: + +- anonymous; no authentication needed +- PostgreSQL authentication +- github single sign-on using browser + +In anonymous access, the user doesn't need to perform any +authentication at all. This can be used e.g. in interactive PostgreSQL +documentation, allowing you to run the examples very quickly. Similar +to sqlfiddle.com. + +PostgreSQL authentication works the same as always. All the different +PostgreSQL authentication options like SCRAM, kerberos, etc. are +available. [1] + +The third option is to authenticate with github single sign-on. When +you open the connection in psql, you get a link that you open with +your browser. Opening the link redirects you to github authentication, +and lets the connection to proceed. This is also known as "Link auth" [2]. + + +## Routing the connection + +When a client starts a connection, it needs to be routed to the +correct PostgreSQL instance. Routing can be done by the proxy, acting +as a man-in-the-middle, or the connection can be routed at the network +level based on the hostname or IP address. + +Either way, Neon needs to identify which PostgreSQL instance the +connection should be routed to. If the instance is not already +running, it needs to be started. Some connections always require a new +PostgreSQL instance to be created, e.g. if you want to run a one-off +query against a particular point-in-time. + +The PostgreSQL instance is identified by: +- Neon account (possibly anonymous) +- cluster (known as tenant in the storage?) +- branch or snapshot name +- timestamp (PITR) +- primary or read-replica +- one-off read replica +- one-off writeable branch + +When you are using regular PostgreSQL authentication or anonymous +access, the connection URL needs to contain all the information needed +for the routing. With github single sign-on, the browser is involved +and some details - the Neon account in particular - can be deduced +from the authentication exchange. + +There are three methods for identifying the PostgreSQL instance: + +- Browser interaction (link auth) +- Options in the connection URL and the domain name +- A pre-defined endpoint, identified by domain name or IP address + +### Link Auth + + postgres://@start.neon.tech/ + +This gives you a link that you open in browser. Clicking the link +performs github authentication, and the Neon account name is +provided to the proxy behind the scenes. The proxy routes the +connection to the primary PostgreSQL instance in cluster called +"main", branch "main". + +Further ideas: +- You could pre-define a different target for link auth + connections in the UI. +- You could have a drop-down in the browser, allowing you to connect + to any cluster you want. Link Auth can be like Teleport. + +### Connection URL + +The connection URL looks like this: + + postgres://@.db.neon.tech/ + +By default, this connects you to the primary PostgreSQL instance +running on the "main" branch in the named cluster [3]. However, you can +change that by specifying options in the connection URL. The following +options are supported: + +| option name | Description | Examples | +| --- | --- | --- | +| cluster | Cluster name | cluster:myproject | +| branch | Branch name | branch:main | +| timestamp | Connect to an instance at given point-in-time. | timestamp:2022-04-08 timestamp:2022-04-08T11:42:16Z | +| lsn | Connect to an instance at given LSN | lsn:0/12FF0420 | +| read-replica | Connect to a read-replica. If the parameter is 'new', a new instance is created for this session. | read-replica read-replica:new | + +For example, to read branch 'testing' as it was on Mar 31, 2022, you could +specify a timestamp in the connection URL [4]: + + postgres://alice@cluster-1234.db.neon.tech/postgres?options=branch:testing,timestamp:2022-03-31 + +Connecting with cluster name and options can be disabled in the UI. If +disabled, you can only connect using a pre-defined endpoint. + +### Pre-defined Endpoint + +Instead of providing the cluster name, branch, and all those options +in the connection URL, you can define a named endpoint with the same +options. + +In the UI, click "create endpoint". Fill in the details: + +- Cluster name +- Branch +- timestamp or LSN +- is this for the primary or for a read replica +- etc. + +When you click Finish, a named endpoint is created. You can now use the endpoint ID to connect: + + postgres://@.endpoint.neon.tech/ + + +An endpoint can be assigned a static or dynamic IP address, so that +you can connect to it with clients that don't support TLS SNI. Maybe +bypass the proxy altogether, but that ought to be invisible to the +user. + +You can limit the range of source IP addresses that are allowed to +connect to an endpoint. An endpoint can also be exposed in an Amazon +VPC, allowing direct connections from applications. + + +# Footnotes + +[1] I'm not sure how feasible it is to set up configure like Kerberos +or LDAP in a cloud environment. But in principle I think we should +allow customers to have the full power of PostgreSQL, including all +authentication options. However, it's up to the customer to configure +it correctly. + +[2] Link is a way to both authenticate and to route the connection + +[3] This assumes that cluster-ids are globally unique, across all +Neon accounts. + +[4] The syntax accepted in the connection URL is limited by libpq. The +only way to pass arbitrary options to the server (or our proxy) is +with the "options" keyword, and the options must be percent-encoded. I +think the above would work but i haven't tested it From baa59512b8e0f5ca535025d9fc879f31fc18b39f Mon Sep 17 00:00:00 2001 From: Konstantin Knizhnik Date: Tue, 3 May 2022 08:07:14 +0300 Subject: [PATCH 181/296] Traverse frozen layer in get_reconstruct_data in reverse order (#1601) * Traverse frozen layer in get_reconstruct_data in reverse order * Fix comments on frozen layers. Note explicitly the order that the layers are in the queue. * Add fail point to reproduce failpoint iteration error Co-authored-by: Heikki Linnakangas --- pageserver/src/layered_repository.rs | 9 ++++++--- pageserver/src/layered_repository/layer_map.rs | 11 +++++++---- test_runner/batch_others/test_ancestor_branch.py | 4 ++++ 3 files changed, 17 insertions(+), 7 deletions(-) diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 080ac2852d..59e73d961d 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -1432,7 +1432,8 @@ impl LayeredTimeline { let layers = timeline.layers.read().unwrap(); - // Check the open and frozen in-memory layers first + // Check the open and frozen in-memory layers first, in order from newest + // to oldest. if let Some(open_layer) = &layers.open_layer { let start_lsn = open_layer.get_lsn_range().start; if cont_lsn > start_lsn { @@ -1450,7 +1451,7 @@ impl LayeredTimeline { continue; } } - for frozen_layer in layers.frozen_layers.iter() { + for frozen_layer in layers.frozen_layers.iter().rev() { let start_lsn = frozen_layer.get_lsn_range().start; if cont_lsn > start_lsn { //info!("CHECKING for {} at {} on frozen layer {}", key, cont_lsn, frozen_layer.filename().display()); @@ -1695,7 +1696,9 @@ impl LayeredTimeline { self.conf.timeline_path(&self.timelineid, &self.tenantid), ])?; - // Finally, replace the frozen in-memory layer with the new on-disk layers + fail_point!("flush-frozen"); + + // Finally, replace the frozen in-memory layer with the new on-disk layer { let mut layers = self.layers.write().unwrap(); let l = layers.frozen_layers.pop_front(); diff --git a/pageserver/src/layered_repository/layer_map.rs b/pageserver/src/layered_repository/layer_map.rs index 03ee8b8ef1..91a900dde0 100644 --- a/pageserver/src/layered_repository/layer_map.rs +++ b/pageserver/src/layered_repository/layer_map.rs @@ -43,10 +43,13 @@ pub struct LayerMap { pub next_open_layer_at: Option, /// - /// The frozen layer, if any, contains WAL older than the current 'open_layer' - /// or 'next_open_layer_at', but newer than any historic layer. The frozen - /// layer is during checkpointing, when an InMemoryLayer is being written out - /// to disk. + /// Frozen layers, if any. Frozen layers are in-memory layers that + /// are no longer added to, but haven't been written out to disk + /// yet. They contain WAL older than the current 'open_layer' or + /// 'next_open_layer_at', but newer than any historic layer. + /// The frozen layers are in order from oldest to newest, so that + /// the newest one is in the 'back' of the VecDeque, and the oldest + /// in the 'front'. /// pub frozen_layers: VecDeque>, diff --git a/test_runner/batch_others/test_ancestor_branch.py b/test_runner/batch_others/test_ancestor_branch.py index aeb45348ad..75fe3cde0f 100644 --- a/test_runner/batch_others/test_ancestor_branch.py +++ b/test_runner/batch_others/test_ancestor_branch.py @@ -33,6 +33,10 @@ def test_ancestor_branch(zenith_env_builder: ZenithEnvBuilder): 'compaction_target_size': '4194304', }) + with closing(env.pageserver.connect()) as psconn: + with psconn.cursor(cursor_factory=psycopg2.extras.DictCursor) as pscur: + pscur.execute("failpoints flush-frozen=sleep(10000)") + env.zenith_cli.create_timeline(f'main', tenant_id=tenant) pg_branch0 = env.postgres.create_start('main', tenant_id=tenant) branch0_cur = pg_branch0.connect().cursor() From 62449d60683e93f8f54b5c79fdcb89b74853d695 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Tue, 3 May 2022 09:25:12 +0300 Subject: [PATCH 182/296] Bump vendor/postgres (#1573) This brings us the performance improvements to WAL redo from https://github.com/neondatabase/postgres/pull/144 --- vendor/postgres | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vendor/postgres b/vendor/postgres index d7c8426e49..a13fe64a3e 160000 --- a/vendor/postgres +++ b/vendor/postgres @@ -1 +1 @@ -Subproject commit d7c8426e49cff3c791c3f2c4cde95f1fce665573 +Subproject commit a13fe64a3eff1743ff17141a2e6057f5103829f0 From 9ede38b6c4aec5a1d49f0e83278f112f1eb4069e Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Tue, 3 May 2022 09:28:57 +0300 Subject: [PATCH 183/296] Support finding LSN from a commit timestamp. A new `get_lsn_by_timestamp` command is added to the libpq page service API. An extra timestamp field is now stored in an extra field after each Clog page. It is the timestamp of the latest commit, among all the transactions on the Clog page. To find the overall latest commit, we need to scan all Clog pages, but this isn't a very frequent operation so that's not too bad. To find the LSN that corresponds to a timestamp, we perform a binary search. The binary search starts with min = last LSN when GC ran, and max = latest LSN on the timeline. On each iteration of the search we check if there are any commits with a higher-than-requested timestamp at that LSN. Implements github issue 1361. --- libs/postgres_ffi/src/xlog_utils.rs | 6 +- libs/utils/src/pq_proto.rs | 12 +++ pageserver/src/basebackup.rs | 12 ++- pageserver/src/page_service.rs | 30 +++++- pageserver/src/pgdatadir_mapping.rs | 108 +++++++++++++++++++ pageserver/src/walingest.rs | 10 +- pageserver/src/walrecord.rs | 5 +- pageserver/src/walredo.rs | 22 +++- test_runner/batch_others/test_lsn_mapping.py | 84 +++++++++++++++ test_runner/fixtures/zenith_fixtures.py | 1 + 10 files changed, 282 insertions(+), 8 deletions(-) create mode 100644 test_runner/batch_others/test_lsn_mapping.py diff --git a/libs/postgres_ffi/src/xlog_utils.rs b/libs/postgres_ffi/src/xlog_utils.rs index 1645c44de5..bd4b7df690 100644 --- a/libs/postgres_ffi/src/xlog_utils.rs +++ b/libs/postgres_ffi/src/xlog_utils.rs @@ -118,11 +118,15 @@ pub fn normalize_lsn(lsn: Lsn, seg_sz: usize) -> Lsn { } pub fn get_current_timestamp() -> TimestampTz { + to_pg_timestamp(SystemTime::now()) +} + +pub fn to_pg_timestamp(time: SystemTime) -> TimestampTz { const UNIX_EPOCH_JDATE: u64 = 2440588; /* == date2j(1970, 1, 1) */ const POSTGRES_EPOCH_JDATE: u64 = 2451545; /* == date2j(2000, 1, 1) */ const SECS_PER_DAY: u64 = 86400; const USECS_PER_SEC: u64 = 1000000; - match SystemTime::now().duration_since(SystemTime::UNIX_EPOCH) { + match time.duration_since(SystemTime::UNIX_EPOCH) { Ok(n) => { ((n.as_secs() - ((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) * SECS_PER_DAY)) * USECS_PER_SEC diff --git a/libs/utils/src/pq_proto.rs b/libs/utils/src/pq_proto.rs index e1677f4311..ce86cf8c91 100644 --- a/libs/utils/src/pq_proto.rs +++ b/libs/utils/src/pq_proto.rs @@ -503,6 +503,18 @@ impl RowDescriptor<'_> { formatcode: 0, } } + + pub const fn text_col(name: &[u8]) -> RowDescriptor { + RowDescriptor { + name, + tableoid: 0, + attnum: 0, + typoid: TEXT_OID, + typlen: -1, + typmod: 0, + formatcode: 0, + } + } } #[derive(Debug)] diff --git a/pageserver/src/basebackup.rs b/pageserver/src/basebackup.rs index 78a27e460f..14e6d40759 100644 --- a/pageserver/src/basebackup.rs +++ b/pageserver/src/basebackup.rs @@ -154,9 +154,17 @@ impl<'a> Basebackup<'a> { let img = self .timeline .get_slru_page_at_lsn(slru, segno, blknum, self.lsn)?; - ensure!(img.len() == pg_constants::BLCKSZ as usize); - slru_buf.extend_from_slice(&img); + if slru == SlruKind::Clog { + ensure!( + img.len() == pg_constants::BLCKSZ as usize + || img.len() == pg_constants::BLCKSZ as usize + 8 + ); + } else { + ensure!(img.len() == pg_constants::BLCKSZ as usize); + } + + slru_buf.extend_from_slice(&img[..pg_constants::BLCKSZ as usize]); } let segname = format!("{}/{:>04X}", slru.to_str(), segno); diff --git a/pageserver/src/page_service.rs b/pageserver/src/page_service.rs index 0adafab8ba..e584a101cd 100644 --- a/pageserver/src/page_service.rs +++ b/pageserver/src/page_service.rs @@ -31,7 +31,7 @@ use utils::{ use crate::basebackup; use crate::config::{PageServerConf, ProfilingConfig}; -use crate::pgdatadir_mapping::DatadirTimeline; +use crate::pgdatadir_mapping::{DatadirTimeline, LsnForTimestamp}; use crate::profiling::profpoint_start; use crate::reltag::RelTag; use crate::repository::Repository; @@ -42,6 +42,7 @@ use crate::thread_mgr::ThreadKind; use crate::walreceiver; use crate::CheckpointConfig; use metrics::{register_histogram_vec, HistogramVec}; +use postgres_ffi::xlog_utils::to_pg_timestamp; // Wrapped in libpq CopyData enum PagestreamFeMessage { @@ -805,6 +806,33 @@ impl postgres_backend::Handler for PageServerHandler { pgb.write_message_noflush(&SINGLE_COL_ROWDESC)? .write_message_noflush(&BeMessage::CommandComplete(b"SELECT 1"))?; + } else if query_string.starts_with("get_lsn_by_timestamp ") { + // Locate LSN of last transaction with timestamp less or equal than sppecified + // TODO lazy static + let re = Regex::new(r"^get_lsn_by_timestamp ([[:xdigit:]]+) ([[:xdigit:]]+) '(.*)'$") + .unwrap(); + let caps = re + .captures(query_string) + .with_context(|| format!("invalid get_lsn_by_timestamp: '{}'", query_string))?; + + let tenantid = ZTenantId::from_str(caps.get(1).unwrap().as_str())?; + let timelineid = ZTimelineId::from_str(caps.get(2).unwrap().as_str())?; + let timeline = tenant_mgr::get_local_timeline_with_load(tenantid, timelineid) + .context("Cannot load local timeline")?; + + let timestamp = humantime::parse_rfc3339(caps.get(3).unwrap().as_str())?; + let timestamp_pg = to_pg_timestamp(timestamp); + + pgb.write_message_noflush(&BeMessage::RowDescription(&[RowDescriptor::text_col( + b"lsn", + )]))?; + let result = match timeline.find_lsn_for_timestamp(timestamp_pg)? { + LsnForTimestamp::Present(lsn) => format!("{}", lsn), + LsnForTimestamp::Future(_lsn) => "future".into(), + LsnForTimestamp::Past(_lsn) => "past".into(), + }; + pgb.write_message_noflush(&BeMessage::DataRow(&[Some(result.as_bytes())]))?; + pgb.write_message(&BeMessage::CommandComplete(b"SELECT 1"))?; } else { bail!("unknown command"); } diff --git a/pageserver/src/pgdatadir_mapping.rs b/pageserver/src/pgdatadir_mapping.rs index 071eccc05d..c052aa3d69 100644 --- a/pageserver/src/pgdatadir_mapping.rs +++ b/pageserver/src/pgdatadir_mapping.rs @@ -13,6 +13,7 @@ use crate::repository::{Repository, Timeline}; use crate::walrecord::ZenithWalRecord; use anyhow::{bail, ensure, Result}; use bytes::{Buf, Bytes}; +use postgres_ffi::xlog_utils::TimestampTz; use postgres_ffi::{pg_constants, Oid, TransactionId}; use serde::{Deserialize, Serialize}; use std::collections::{HashMap, HashSet}; @@ -45,6 +46,13 @@ where current_logical_size: AtomicIsize, } +#[derive(Debug)] +pub enum LsnForTimestamp { + Present(Lsn), + Future(Lsn), + Past(Lsn), +} + impl DatadirTimeline { pub fn new(tline: Arc, repartition_threshold: u64) -> Self { DatadirTimeline { @@ -202,6 +210,106 @@ impl DatadirTimeline { Ok(exists) } + /// Locate LSN, such that all transactions that committed before + /// 'search_timestamp' are visible, but nothing newer is. + /// + /// This is not exact. Commit timestamps are not guaranteed to be ordered, + /// so it's not well defined which LSN you get if there were multiple commits + /// "in flight" at that point in time. + /// + pub fn find_lsn_for_timestamp(&self, search_timestamp: TimestampTz) -> Result { + let gc_cutoff_lsn_guard = self.tline.get_latest_gc_cutoff_lsn(); + let min_lsn = *gc_cutoff_lsn_guard; + let max_lsn = self.tline.get_last_record_lsn(); + + // LSNs are always 8-byte aligned. low/mid/high represent the + // LSN divided by 8. + let mut low = min_lsn.0 / 8; + let mut high = max_lsn.0 / 8 + 1; + + let mut found_smaller = false; + let mut found_larger = false; + while low < high { + // cannot overflow, high and low are both smaller than u64::MAX / 2 + let mid = (high + low) / 2; + + let cmp = self.is_latest_commit_timestamp_ge_than( + search_timestamp, + Lsn(mid * 8), + &mut found_smaller, + &mut found_larger, + )?; + + if cmp { + high = mid; + } else { + low = mid + 1; + } + } + match (found_smaller, found_larger) { + (false, false) => { + // This can happen if no commit records have been processed yet, e.g. + // just after importing a cluster. + bail!("no commit timestamps found"); + } + (true, false) => { + // Didn't find any commit timestamps larger than the request + Ok(LsnForTimestamp::Future(max_lsn)) + } + (false, true) => { + // Didn't find any commit timestamps smaller than the request + Ok(LsnForTimestamp::Past(max_lsn)) + } + (true, true) => { + // low is the LSN of the first commit record *after* the search_timestamp, + // Back off by one to get to the point just before the commit. + // + // FIXME: it would be better to get the LSN of the previous commit. + // Otherwise, if you restore to the returned LSN, the database will + // include physical changes from later commits that will be marked + // as aborted, and will need to be vacuumed away. + Ok(LsnForTimestamp::Present(Lsn((low - 1) * 8))) + } + } + } + + /// + /// Subroutine of find_lsn_for_timestamp(). Returns true, if there are any + /// commits that committed after 'search_timestamp', at LSN 'probe_lsn'. + /// + /// Additionally, sets 'found_smaller'/'found_Larger, if encounters any commits + /// with a smaller/larger timestamp. + /// + fn is_latest_commit_timestamp_ge_than( + &self, + search_timestamp: TimestampTz, + probe_lsn: Lsn, + found_smaller: &mut bool, + found_larger: &mut bool, + ) -> Result { + for segno in self.list_slru_segments(SlruKind::Clog, probe_lsn)? { + let nblocks = self.get_slru_segment_size(SlruKind::Clog, segno, probe_lsn)?; + for blknum in (0..nblocks).rev() { + let clog_page = + self.get_slru_page_at_lsn(SlruKind::Clog, segno, blknum, probe_lsn)?; + + if clog_page.len() == pg_constants::BLCKSZ as usize + 8 { + let mut timestamp_bytes = [0u8; 8]; + timestamp_bytes.copy_from_slice(&clog_page[pg_constants::BLCKSZ as usize..]); + let timestamp = TimestampTz::from_be_bytes(timestamp_bytes); + + if timestamp >= search_timestamp { + *found_larger = true; + return Ok(true); + } else { + *found_smaller = true; + } + } + } + } + Ok(false) + } + /// Get a list of SLRU segments pub fn list_slru_segments(&self, kind: SlruKind, lsn: Lsn) -> Result> { // fetch directory entry diff --git a/pageserver/src/walingest.rs b/pageserver/src/walingest.rs index 583cdecb1d..a929e290ad 100644 --- a/pageserver/src/walingest.rs +++ b/pageserver/src/walingest.rs @@ -635,7 +635,10 @@ impl<'a, R: Repository> WalIngest<'a, R> { segno, rpageno, if is_commit { - ZenithWalRecord::ClogSetCommitted { xids: page_xids } + ZenithWalRecord::ClogSetCommitted { + xids: page_xids, + timestamp: parsed.xact_time, + } } else { ZenithWalRecord::ClogSetAborted { xids: page_xids } }, @@ -652,7 +655,10 @@ impl<'a, R: Repository> WalIngest<'a, R> { segno, rpageno, if is_commit { - ZenithWalRecord::ClogSetCommitted { xids: page_xids } + ZenithWalRecord::ClogSetCommitted { + xids: page_xids, + timestamp: parsed.xact_time, + } } else { ZenithWalRecord::ClogSetAborted { xids: page_xids } }, diff --git a/pageserver/src/walrecord.rs b/pageserver/src/walrecord.rs index 5947a0c147..e8699cfa22 100644 --- a/pageserver/src/walrecord.rs +++ b/pageserver/src/walrecord.rs @@ -24,7 +24,10 @@ pub enum ZenithWalRecord { flags: u8, }, /// Mark transaction IDs as committed on a CLOG page - ClogSetCommitted { xids: Vec }, + ClogSetCommitted { + xids: Vec, + timestamp: TimestampTz, + }, /// Mark transaction IDs as aborted on a CLOG page ClogSetAborted { xids: Vec }, /// Extend multixact offsets SLRU diff --git a/pageserver/src/walredo.rs b/pageserver/src/walredo.rs index 6338b839ae..777718b311 100644 --- a/pageserver/src/walredo.rs +++ b/pageserver/src/walredo.rs @@ -283,6 +283,11 @@ impl PostgresRedoManager { // If something went wrong, don't try to reuse the process. Kill it, and // next request will launch a new one. if result.is_err() { + error!( + "error applying {} WAL records to reconstruct page image at LSN {}", + records.len(), + lsn + ); let process = process_guard.take().unwrap(); process.kill(); } @@ -387,7 +392,7 @@ impl PostgresRedoManager { } // Non-relational WAL records are handled here, with custom code that has the // same effects as the corresponding Postgres WAL redo function. - ZenithWalRecord::ClogSetCommitted { xids } => { + ZenithWalRecord::ClogSetCommitted { xids, timestamp } => { let (slru_kind, segno, blknum) = key_to_slru_block(key).or(Err(WalRedoError::InvalidRecord))?; assert_eq!( @@ -421,6 +426,21 @@ impl PostgresRedoManager { page, ); } + + // Append the timestamp + if page.len() == pg_constants::BLCKSZ as usize + 8 { + page.truncate(pg_constants::BLCKSZ as usize); + } + if page.len() == pg_constants::BLCKSZ as usize { + page.extend_from_slice(×tamp.to_be_bytes()); + } else { + warn!( + "CLOG blk {} in seg {} has invalid size {}", + blknum, + segno, + page.len() + ); + } } ZenithWalRecord::ClogSetAborted { xids } => { let (slru_kind, segno, blknum) = diff --git a/test_runner/batch_others/test_lsn_mapping.py b/test_runner/batch_others/test_lsn_mapping.py new file mode 100644 index 0000000000..37113b46f2 --- /dev/null +++ b/test_runner/batch_others/test_lsn_mapping.py @@ -0,0 +1,84 @@ +from contextlib import closing +from datetime import timedelta, timezone, tzinfo +import math +from uuid import UUID +import psycopg2.extras +import psycopg2.errors +from fixtures.zenith_fixtures import ZenithEnv, ZenithEnvBuilder, Postgres +from fixtures.log_helper import log +import time + + +# +# Test pageserver get_lsn_by_timestamp API +# +def test_lsn_mapping(zenith_env_builder: ZenithEnvBuilder): + zenith_env_builder.num_safekeepers = 1 + env = zenith_env_builder.init_start() + + new_timeline_id = env.zenith_cli.create_branch('test_lsn_mapping') + pgmain = env.postgres.create_start("test_lsn_mapping") + log.info("postgres is running on 'test_lsn_mapping' branch") + + ps_conn = env.pageserver.connect() + ps_cur = ps_conn.cursor() + conn = pgmain.connect() + cur = conn.cursor() + + # Create table, and insert rows, each in a separate transaction + # Disable synchronous_commit to make this initialization go faster. + # + # Each row contains current insert LSN and the current timestamp, when + # the row was inserted. + cur.execute("SET synchronous_commit=off") + cur.execute("CREATE TABLE foo (x integer)") + tbl = [] + for i in range(1000): + cur.execute(f"INSERT INTO foo VALUES({i})") + cur.execute(f'SELECT clock_timestamp()') + # Get the timestamp at UTC + after_timestamp = cur.fetchone()[0].replace(tzinfo=None) + tbl.append([i, after_timestamp]) + + # Execute one more transaction with synchronous_commit enabled, to flush + # all the previous transactions + cur.execute("SET synchronous_commit=on") + cur.execute("INSERT INTO foo VALUES (-1)") + + # Check edge cases: timestamp in the future + probe_timestamp = tbl[-1][1] + timedelta(hours=1) + ps_cur.execute( + f"get_lsn_by_timestamp {env.initial_tenant.hex} {new_timeline_id.hex} '{probe_timestamp.isoformat()}Z'" + ) + result = ps_cur.fetchone()[0] + assert result == 'future' + + # timestamp too the far history + probe_timestamp = tbl[0][1] - timedelta(hours=10) + ps_cur.execute( + f"get_lsn_by_timestamp {env.initial_tenant.hex} {new_timeline_id.hex} '{probe_timestamp.isoformat()}Z'" + ) + result = ps_cur.fetchone()[0] + assert result == 'past' + + # Probe a bunch of timestamps in the valid range + for i in range(1, len(tbl), 100): + probe_timestamp = tbl[i][1] + + # Call get_lsn_by_timestamp to get the LSN + ps_cur.execute( + f"get_lsn_by_timestamp {env.initial_tenant.hex} {new_timeline_id.hex} '{probe_timestamp.isoformat()}Z'" + ) + lsn = ps_cur.fetchone()[0] + + # Launch a new read-only node at that LSN, and check that only the rows + # that were supposed to be committed at that point in time are visible. + pg_here = env.postgres.create_start(branch_name='test_lsn_mapping', + node_name='test_lsn_mapping_read', + lsn=lsn) + with closing(pg_here.connect()) as conn_here: + with conn_here.cursor() as cur_here: + cur_here.execute("SELECT max(x) FROM foo") + assert cur_here.fetchone()[0] == i + + pg_here.stop_and_destroy() diff --git a/test_runner/fixtures/zenith_fixtures.py b/test_runner/fixtures/zenith_fixtures.py index 5614cea68b..5b25b1c457 100644 --- a/test_runner/fixtures/zenith_fixtures.py +++ b/test_runner/fixtures/zenith_fixtures.py @@ -1572,6 +1572,7 @@ class Postgres(PgProtocol): assert self.node_name is not None self.env.zenith_cli.pg_stop(self.node_name, self.tenant_id, True) self.node_name = None + self.running = False return self From ff7e9a86c6f61a9c23f538904f7d378126a6597e Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Tue, 3 May 2022 12:00:42 +0300 Subject: [PATCH 184/296] turn panic into an error with more details --- pageserver/src/layered_repository.rs | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 59e73d961d..1205f8d867 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -1504,12 +1504,20 @@ impl LayeredTimeline { let ancestor = self .ancestor_timeline .as_ref() - .expect("there should be an ancestor") + .with_context(|| { + format!( + "Ancestor is missing. Timeline id: {} Ancestor id {:?}", + self.timelineid, + self.get_ancestor_timeline_id(), + ) + })? .ensure_loaded() .with_context(|| { format!( - "Cannot get the whole layer for read locked: timeline {} is not present locally", - self.get_ancestor_timeline_id().unwrap()) + "Ancestor timeline is not is not loaded. Timeline id: {} Ancestor id {:?}", + self.timelineid, + self.get_ancestor_timeline_id(), + ) })?; Ok(Arc::clone(ancestor)) } From e7cba0b60722af46742094fa43c4def394cc010a Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Mon, 2 May 2022 23:36:15 +0300 Subject: [PATCH 185/296] use thiserror instead of anyhow in disk_btree --- .../src/layered_repository/disk_btree.rs | 105 ++++++++++++------ 1 file changed, 70 insertions(+), 35 deletions(-) diff --git a/pageserver/src/layered_repository/disk_btree.rs b/pageserver/src/layered_repository/disk_btree.rs index 7a9fe6f2b7..e747192d96 100644 --- a/pageserver/src/layered_repository/disk_btree.rs +++ b/pageserver/src/layered_repository/disk_btree.rs @@ -11,7 +11,6 @@ //! - page-oriented //! //! TODO: -//! - better errors (e.g. with thiserror?) //! - maybe something like an Adaptive Radix Tree would be more efficient? //! - the values stored by image and delta layers are offsets into the file, //! and they are in monotonically increasing order. Prefix compression would @@ -19,11 +18,12 @@ //! - An Iterator interface would be more convenient for the callers than the //! 'visit' function //! -use anyhow; use byteorder::{ReadBytesExt, BE}; use bytes::{BufMut, Bytes, BytesMut}; use hex; -use std::cmp::Ordering; +use std::{cmp::Ordering, io, result}; +use thiserror::Error; +use tracing::error; use crate::layered_repository::block_io::{BlockReader, BlockWriter}; @@ -86,6 +86,23 @@ impl Value { } } +#[derive(Error, Debug)] +pub enum DiskBtreeError { + #[error("Attempt to append a value that is too large {0} > {}", MAX_VALUE)] + AppendOverflow(u64), + + #[error("Unsorted input: key {key:?} is <= last_key {last_key:?}")] + UnsortedInput { key: Box<[u8]>, last_key: Box<[u8]> }, + + #[error("Could not push to new leaf node")] + FailedToPushToNewLeafNode, + + #[error("IoError: {0}")] + Io(#[from] io::Error), +} + +pub type Result = result::Result; + /// This is the on-disk representation. struct OnDiskNode<'a, const L: usize> { // Fixed-width fields @@ -106,12 +123,12 @@ impl<'a, const L: usize> OnDiskNode<'a, L> { /// /// Interpret a PAGE_SZ page as a node. /// - fn deparse(buf: &[u8]) -> OnDiskNode { + fn deparse(buf: &[u8]) -> Result> { let mut cursor = std::io::Cursor::new(buf); - let num_children = cursor.read_u16::().unwrap(); - let level = cursor.read_u8().unwrap(); - let prefix_len = cursor.read_u8().unwrap(); - let suffix_len = cursor.read_u8().unwrap(); + let num_children = cursor.read_u16::()?; + let level = cursor.read_u8()?; + let prefix_len = cursor.read_u8()?; + let suffix_len = cursor.read_u8()?; let mut off = cursor.position(); let prefix_off = off as usize; @@ -129,7 +146,7 @@ impl<'a, const L: usize> OnDiskNode<'a, L> { let keys = &buf[keys_off..keys_off + keys_len]; let values = &buf[values_off..values_off + values_len]; - OnDiskNode { + Ok(OnDiskNode { num_children, level, prefix_len, @@ -137,7 +154,7 @@ impl<'a, const L: usize> OnDiskNode<'a, L> { prefix, keys, values, - } + }) } /// @@ -149,7 +166,11 @@ impl<'a, const L: usize> OnDiskNode<'a, L> { Value::from_slice(value_slice) } - fn binary_search(&self, search_key: &[u8; L], keybuf: &mut [u8]) -> Result { + fn binary_search( + &self, + search_key: &[u8; L], + keybuf: &mut [u8], + ) -> result::Result { let mut size = self.num_children as usize; let mut low = 0; let mut high = size; @@ -209,7 +230,7 @@ where /// /// Read the value for given key. Returns the value, or None if it doesn't exist. /// - pub fn get(&self, search_key: &[u8; L]) -> anyhow::Result> { + pub fn get(&self, search_key: &[u8; L]) -> Result> { let mut result: Option = None; self.visit(search_key, VisitDirection::Forwards, |key, value| { if key == search_key { @@ -230,7 +251,7 @@ where search_key: &[u8; L], dir: VisitDirection, mut visitor: V, - ) -> anyhow::Result + ) -> Result where V: FnMut(&[u8], u64) -> bool, { @@ -243,7 +264,7 @@ where search_key: &[u8; L], dir: VisitDirection, visitor: &mut V, - ) -> anyhow::Result + ) -> Result where V: FnMut(&[u8], u64) -> bool, { @@ -260,11 +281,11 @@ where search_key: &[u8; L], dir: VisitDirection, visitor: &mut V, - ) -> anyhow::Result + ) -> Result where V: FnMut(&[u8], u64) -> bool, { - let node = OnDiskNode::deparse(node_buf); + let node = OnDiskNode::deparse(node_buf)?; let prefix_len = node.prefix_len as usize; let suffix_len = node.suffix_len as usize; @@ -369,15 +390,15 @@ where } #[allow(dead_code)] - pub fn dump(&self) -> anyhow::Result<()> { + pub fn dump(&self) -> Result<()> { self.dump_recurse(self.root_blk, &[], 0) } - fn dump_recurse(&self, blknum: u32, path: &[u8], depth: usize) -> anyhow::Result<()> { + fn dump_recurse(&self, blknum: u32, path: &[u8], depth: usize) -> Result<()> { let blk = self.reader.read_blk(self.start_blk + blknum)?; let buf: &[u8] = blk.as_ref(); - let node = OnDiskNode::::deparse(buf); + let node = OnDiskNode::::deparse(buf)?; print!("{:indent$}", "", indent = depth * 2); println!( @@ -442,17 +463,24 @@ where } } - pub fn append(&mut self, key: &[u8; L], value: u64) -> Result<(), anyhow::Error> { - assert!(value <= MAX_VALUE); + pub fn append(&mut self, key: &[u8; L], value: u64) -> Result<()> { + if value > MAX_VALUE { + return Err(DiskBtreeError::AppendOverflow(value)); + } if let Some(last_key) = &self.last_key { - assert!(key > last_key, "unsorted input"); + if key <= last_key { + return Err(DiskBtreeError::UnsortedInput { + key: key.as_slice().into(), + last_key: last_key.as_slice().into(), + }); + } } self.last_key = Some(*key); - Ok(self.append_internal(key, Value::from_u64(value))?) + self.append_internal(key, Value::from_u64(value)) } - fn append_internal(&mut self, key: &[u8; L], value: Value) -> Result<(), std::io::Error> { + fn append_internal(&mut self, key: &[u8; L], value: Value) -> Result<()> { // Try to append to the current leaf buffer let last = self.stack.last_mut().unwrap(); let level = last.level; @@ -476,14 +504,15 @@ where // key to it. let mut last = BuildNode::new(level); if !last.push(key, value) { - panic!("could not push to new leaf node"); + return Err(DiskBtreeError::FailedToPushToNewLeafNode); } + self.stack.push(last); Ok(()) } - fn flush_node(&mut self) -> Result<(), std::io::Error> { + fn flush_node(&mut self) -> Result<()> { let last = self.stack.pop().unwrap(); let buf = last.pack(); let downlink_key = last.first_key(); @@ -505,7 +534,7 @@ where /// (In the image and delta layers, it is stored in the beginning of the file, /// in the summary header) /// - pub fn finish(mut self) -> Result<(u32, W), std::io::Error> { + pub fn finish(mut self) -> Result<(u32, W)> { // flush all levels, except the root. while self.stack.len() > 1 { self.flush_node()?; @@ -692,14 +721,14 @@ mod tests { impl BlockReader for TestDisk { type BlockLease = std::rc::Rc<[u8; PAGE_SZ]>; - fn read_blk(&self, blknum: u32) -> Result { + fn read_blk(&self, blknum: u32) -> io::Result { let mut buf = [0u8; PAGE_SZ]; buf.copy_from_slice(&self.blocks[blknum as usize]); Ok(std::rc::Rc::new(buf)) } } impl BlockWriter for &mut TestDisk { - fn write_blk(&mut self, buf: Bytes) -> Result { + fn write_blk(&mut self, buf: Bytes) -> io::Result { let blknum = self.blocks.len(); self.blocks.push(buf); Ok(blknum as u32) @@ -707,7 +736,7 @@ mod tests { } #[test] - fn basic() -> anyhow::Result<()> { + fn basic() -> Result<()> { let mut disk = TestDisk::new(); let mut writer = DiskBtreeBuilder::<_, 6>::new(&mut disk); @@ -788,7 +817,7 @@ mod tests { } #[test] - fn lots_of_keys() -> anyhow::Result<()> { + fn lots_of_keys() -> Result<()> { let mut disk = TestDisk::new(); let mut writer = DiskBtreeBuilder::<_, 8>::new(&mut disk); @@ -882,7 +911,7 @@ mod tests { } #[test] - fn random_data() -> anyhow::Result<()> { + fn random_data() -> Result<()> { // Generate random keys with exponential distribution, to // exercise the prefix compression const NUM_KEYS: usize = 100000; @@ -927,21 +956,27 @@ mod tests { } #[test] - #[should_panic(expected = "unsorted input")] fn unsorted_input() { let mut disk = TestDisk::new(); let mut writer = DiskBtreeBuilder::<_, 2>::new(&mut disk); let _ = writer.append(b"ba", 1); let _ = writer.append(b"bb", 2); - let _ = writer.append(b"aa", 3); + let err = writer.append(b"aa", 3).expect_err("should've failed"); + match err { + DiskBtreeError::UnsortedInput { key, last_key } => { + assert_eq!(key.as_ref(), b"aa".as_slice()); + assert_eq!(last_key.as_ref(), b"bb".as_slice()); + } + _ => panic!("unexpected error variant, expected DiskBtreeError::UnsortedInput"), + } } /// /// This test contains a particular data set, see disk_btree_test_data.rs /// #[test] - fn particular_data() -> anyhow::Result<()> { + fn particular_data() -> Result<()> { // Build a tree from it let mut disk = TestDisk::new(); let mut writer = DiskBtreeBuilder::<_, 26>::new(&mut disk); From 2f9b17b9e5b68ae2b469618ecfdbf64d4188f041 Mon Sep 17 00:00:00 2001 From: Anastasia Lubennikova Date: Thu, 28 Apr 2022 11:19:41 +0300 Subject: [PATCH 186/296] Add simple test of pageserver recovery after crash. To cause a crash, use failpoints in checkpointer --- .circleci/config.yml | 2 +- pageserver/src/bin/pageserver.rs | 30 ++++++++++- pageserver/src/layered_repository.rs | 2 + test_runner/batch_others/test_recovery.py | 64 +++++++++++++++++++++++ test_runner/fixtures/zenith_fixtures.py | 13 +++++ 5 files changed, 108 insertions(+), 3 deletions(-) create mode 100644 test_runner/batch_others/test_recovery.py diff --git a/.circleci/config.yml b/.circleci/config.yml index 2ed079f031..864246ad2e 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -121,7 +121,7 @@ jobs: export RUSTC_WRAPPER=cachepot export AWS_ACCESS_KEY_ID="${CACHEPOT_AWS_ACCESS_KEY_ID}" export AWS_SECRET_ACCESS_KEY="${CACHEPOT_AWS_SECRET_ACCESS_KEY}" - "${cov_prefix[@]}" mold -run cargo build $CARGO_FLAGS --bins --tests + "${cov_prefix[@]}" mold -run cargo build $CARGO_FLAGS --features failpoints --bins --tests cachepot -s - save_cache: diff --git a/pageserver/src/bin/pageserver.rs b/pageserver/src/bin/pageserver.rs index 2139bea37e..6a5d4533d0 100644 --- a/pageserver/src/bin/pageserver.rs +++ b/pageserver/src/bin/pageserver.rs @@ -8,6 +8,7 @@ use anyhow::{bail, Context, Result}; use clap::{App, Arg}; use daemonize::Daemonize; +use fail::FailScenario; use pageserver::{ config::{defaults::*, PageServerConf}, http, page_cache, page_service, profiling, tenant_mgr, thread_mgr, @@ -84,8 +85,23 @@ fn main() -> anyhow::Result<()> { .help("Additional configuration overrides of the ones from the toml config file (or new ones to add there). Any option has to be a valid toml document, example: `-c=\"foo='hey'\"` `-c=\"foo={value=1}\"`"), ) + .arg( + Arg::new("enabled-features") + .long("enabled-features") + .takes_value(false) + .help("Show enabled compile time features"), + ) .get_matches(); + if arg_matches.is_present("enabled-features") { + let features: &[&str] = &[ + #[cfg(feature = "failpoints")] + "failpoints", + ]; + println!("{{\"features\": {features:?} }}"); + return Ok(()); + } + let workdir = Path::new(arg_matches.value_of("workdir").unwrap_or(".zenith")); let workdir = workdir .canonicalize() @@ -166,6 +182,14 @@ fn main() -> anyhow::Result<()> { // as a ref. let conf: &'static PageServerConf = Box::leak(Box::new(conf)); + // If failpoints are used, terminate the whole pageserver process if they are hit. + let scenario = FailScenario::setup(); + if fail::has_failpoints() { + std::panic::set_hook(Box::new(|_| { + std::process::exit(1); + })); + } + // Basic initialization of things that don't change after startup virtual_file::init(conf.max_file_descriptors); page_cache::init(conf.page_cache_size); @@ -181,10 +205,12 @@ fn main() -> anyhow::Result<()> { cfg_file_path.display() ) })?; - Ok(()) } else { - start_pageserver(conf, daemonize).context("Failed to start pageserver") + start_pageserver(conf, daemonize).context("Failed to start pageserver")?; } + + scenario.teardown(); + Ok(()) } fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<()> { diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 1205f8d867..e678c8f4cb 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -1703,6 +1703,7 @@ impl LayeredTimeline { new_delta_path.clone(), self.conf.timeline_path(&self.timelineid, &self.tenantid), ])?; + fail_point!("checkpoint-before-sync"); fail_point!("flush-frozen"); @@ -1727,6 +1728,7 @@ impl LayeredTimeline { // TODO: This perhaps should be done in 'flush_frozen_layers', after flushing // *all* the layers, to avoid fsyncing the file multiple times. let disk_consistent_lsn = Lsn(frozen_layer.get_lsn_range().end.0 - 1); + fail_point!("checkpoint-after-sync"); // If we were able to advance 'disk_consistent_lsn', save it the metadata file. // After crash, we will restart WAL streaming and processing from that point. diff --git a/test_runner/batch_others/test_recovery.py b/test_runner/batch_others/test_recovery.py new file mode 100644 index 0000000000..dbfa943a7a --- /dev/null +++ b/test_runner/batch_others/test_recovery.py @@ -0,0 +1,64 @@ +import os +import time +import psycopg2.extras +import json +from ast import Assert +from contextlib import closing +from fixtures.zenith_fixtures import ZenithEnvBuilder +from fixtures.log_helper import log + + +# +# Test pageserver recovery after crash +# +def test_pageserver_recovery(zenith_env_builder: ZenithEnvBuilder): + zenith_env_builder.num_safekeepers = 1 + # Override default checkpointer settings to run it more often + zenith_env_builder.pageserver_config_override = "tenant_config={checkpoint_distance = 1048576}" + + env = zenith_env_builder.init() + + # Check if failpoints enables. Otherwise the test doesn't make sense + f = env.zenith_cli.pageserver_enabled_features() + + assert "failpoints" in f["features"], "Build pageserver with --features=failpoints option to run this test" + zenith_env_builder.start() + + # Create a branch for us + env.zenith_cli.create_branch("test_pageserver_recovery", "main") + + pg = env.postgres.create_start('test_pageserver_recovery') + log.info("postgres is running on 'test_pageserver_recovery' branch") + + connstr = pg.connstr() + + with closing(pg.connect()) as conn: + with conn.cursor() as cur: + with closing(env.pageserver.connect()) as psconn: + with psconn.cursor(cursor_factory=psycopg2.extras.DictCursor) as pscur: + # Create and initialize test table + cur.execute("CREATE TABLE foo(x bigint)") + cur.execute("INSERT INTO foo VALUES (generate_series(1,100000))") + + # Sleep for some time to let checkpoint create image layers + time.sleep(2) + + # Configure failpoints + pscur.execute( + "failpoints checkpoint-before-sync=sleep(2000);checkpoint-after-sync=panic") + + # Do some updates until pageserver is crashed + try: + while True: + cur.execute("update foo set x=x+1") + except Exception as err: + log.info(f"Excepted server crash {err}") + + log.info("Wait before server restart") + env.pageserver.stop() + env.pageserver.start() + + with closing(pg.connect()) as conn: + with conn.cursor() as cur: + cur.execute("select count(*) from foo") + assert cur.fetchone() == (100000, ) diff --git a/test_runner/fixtures/zenith_fixtures.py b/test_runner/fixtures/zenith_fixtures.py index 5b25b1c457..9319a53778 100644 --- a/test_runner/fixtures/zenith_fixtures.py +++ b/test_runner/fixtures/zenith_fixtures.py @@ -980,6 +980,19 @@ class ZenithCli: res.check_returncode() return res + def pageserver_enabled_features(self) -> Any: + bin_pageserver = os.path.join(str(zenith_binpath), 'pageserver') + args = [bin_pageserver, '--enabled-features'] + log.info('Running command "{}"'.format(' '.join(args))) + + res = subprocess.run(args, + check=True, + universal_newlines=True, + stdout=subprocess.PIPE, + stderr=subprocess.PIPE) + log.info(f"pageserver_enabled_features success: {res.stdout}") + return json.loads(res.stdout) + def pageserver_start(self, overrides=()) -> 'subprocess.CompletedProcess[str]': start_args = ['pageserver', 'start', *overrides] append_pageserver_param_overrides(start_args, From 2f83f793bc3f5cf4008904a91f34383bd0350439 Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Tue, 3 May 2022 17:14:58 +0300 Subject: [PATCH 187/296] print more details when thread fails --- pageserver/src/thread_mgr.rs | 42 +++++++++++++++++++++++++----------- 1 file changed, 29 insertions(+), 13 deletions(-) diff --git a/pageserver/src/thread_mgr.rs b/pageserver/src/thread_mgr.rs index f7f8467ae0..b908f220ee 100644 --- a/pageserver/src/thread_mgr.rs +++ b/pageserver/src/thread_mgr.rs @@ -130,12 +130,14 @@ struct PageServerThread { } /// Launch a new thread +/// Note: if shutdown_process_on_error is set to true failure +/// of the thread will lead to shutdown of entire process pub fn spawn( kind: ThreadKind, tenant_id: Option, timeline_id: Option, name: &str, - fail_on_error: bool, + shutdown_process_on_error: bool, f: F, ) -> std::io::Result<()> where @@ -175,7 +177,7 @@ where thread_id, thread_rc2, shutdown_rx, - fail_on_error, + shutdown_process_on_error, f, ) }) { @@ -201,7 +203,7 @@ fn thread_wrapper( thread_id: u64, thread: Arc, shutdown_rx: watch::Receiver<()>, - fail_on_error: bool, + shutdown_process_on_error: bool, f: F, ) where F: FnOnce() -> anyhow::Result<()> + Send + 'static, @@ -221,27 +223,41 @@ fn thread_wrapper( let result = panic::catch_unwind(AssertUnwindSafe(f)); // Remove our entry from the global hashmap. - THREADS.lock().unwrap().remove(&thread_id); + let thread = THREADS + .lock() + .unwrap() + .remove(&thread_id) + .expect("no thread in registry"); match result { Ok(Ok(())) => debug!("Thread '{}' exited normally", thread_name), Ok(Err(err)) => { - if fail_on_error { + if shutdown_process_on_error { error!( - "Shutting down: thread '{}' exited with error: {:?}", - thread_name, err + "Shutting down: thread '{}' tenant_id: {:?}, timeline_id: {:?} exited with error: {:?}", + thread_name, thread.tenant_id, thread.timeline_id, err ); shutdown_pageserver(1); } else { - error!("Thread '{}' exited with error: {:?}", thread_name, err); + error!( + "Thread '{}' tenant_id: {:?}, timeline_id: {:?} exited with error: {:?}", + thread_name, thread.tenant_id, thread.timeline_id, err + ); } } Err(err) => { - error!( - "Shutting down: thread '{}' panicked: {:?}", - thread_name, err - ); - shutdown_pageserver(1); + if shutdown_process_on_error { + error!( + "Shutting down: thread '{}' tenant_id: {:?}, timeline_id: {:?} panicked: {:?}", + thread_name, thread.tenant_id, thread.timeline_id, err + ); + shutdown_pageserver(1); + } else { + error!( + "Thread '{}' tenant_id: {:?}, timeline_id: {:?} panicked: {:?}", + thread_name, thread.tenant_id, thread.timeline_id, err + ); + } } } } From 5642d0b2b86e967eb2b8f71dcb7540f815c22ed6 Mon Sep 17 00:00:00 2001 From: Stas Kelvich Date: Tue, 3 May 2022 23:57:24 +0300 Subject: [PATCH 188/296] Change shutdown_process_on_error thread spawn settings. Now princeple is following: acceptor threads (libpq and http) error will bring the pageserver down, but all per-tenant thread failures will be treated as an error. --- pageserver/src/bin/pageserver.rs | 4 ++-- pageserver/src/tenant_mgr.rs | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/pageserver/src/bin/pageserver.rs b/pageserver/src/bin/pageserver.rs index 6a5d4533d0..9cb7e6f13d 100644 --- a/pageserver/src/bin/pageserver.rs +++ b/pageserver/src/bin/pageserver.rs @@ -287,7 +287,7 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<() None, None, "http_endpoint_thread", - false, + true, move || { let router = http::make_router(conf, auth_cloned, remote_index)?; endpoint::serve_thread_main(router, http_listener, thread_mgr::shutdown_watcher()) @@ -301,7 +301,7 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<() None, None, "libpq endpoint thread", - false, + true, move || page_service::thread_main(conf, auth, pageserver_listener, conf.auth_type), )?; diff --git a/pageserver/src/tenant_mgr.rs b/pageserver/src/tenant_mgr.rs index 3e0a907d00..507e749e8c 100644 --- a/pageserver/src/tenant_mgr.rs +++ b/pageserver/src/tenant_mgr.rs @@ -244,7 +244,7 @@ pub fn activate_tenant(tenant_id: ZTenantId) -> anyhow::Result<()> { Some(tenant_id), None, "Compactor thread", - true, + false, move || crate::tenant_threads::compact_loop(tenant_id), )?; @@ -253,7 +253,7 @@ pub fn activate_tenant(tenant_id: ZTenantId) -> anyhow::Result<()> { Some(tenant_id), None, "GC thread", - true, + false, move || crate::tenant_threads::gc_loop(tenant_id), ) .with_context(|| format!("Failed to launch GC thread for tenant {tenant_id}")); From 9dfa145c7c7fa826eef12ef36d710db1b40152a3 Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Tue, 3 May 2022 23:51:47 +0300 Subject: [PATCH 189/296] tone down tenant not found error --- libs/utils/src/postgres_backend.rs | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/libs/utils/src/postgres_backend.rs b/libs/utils/src/postgres_backend.rs index fab3c388b1..857df0ec84 100644 --- a/libs/utils/src/postgres_backend.rs +++ b/libs/utils/src/postgres_backend.rs @@ -433,7 +433,12 @@ impl PostgresBackend { // full cause of the error, not just the top-level context + its trace. // We don't want to send that in the ErrorResponse though, // because it's not relevant to the compute node logs. - error!("query handler for '{}' failed: {:?}", query_string, e); + if query_string.starts_with("callmemaybe") { + // FIXME avoid printing a backtrace for tenant x not found errors until this is properly fixed + error!("query handler for '{}' failed: {}", query_string, e); + } else { + error!("query handler for '{}' failed: {:?}", query_string, e); + } self.write_message_noflush(&BeMessage::ErrorResponse(&e.to_string()))?; // TODO: untangle convoluted control flow if e.to_string().contains("failed to run") { From 51a0f2683bd299dfe30be511dafdf10dcfcf422d Mon Sep 17 00:00:00 2001 From: Stas Kelvich Date: Wed, 4 May 2022 01:18:08 +0300 Subject: [PATCH 190/296] fix scram-proxy addresses --- .circleci/config.yml | 2 +- .circleci/helm-values/staging.proxy-scram.yaml | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/.circleci/config.yml b/.circleci/config.yml index 864246ad2e..85654b5d45 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -584,7 +584,7 @@ jobs: name: Re-deploy proxy command: | DOCKER_TAG=$(git log --oneline|wc -l) - helm upgrade zenith-proxy neondatabase/neon-proxy --install -f .circleci/helm-values/staging.proxy.yaml --set image.tag=${DOCKER_TAG} --wait + helm upgrade neon-proxy neondatabase/neon-proxy --install -f .circleci/helm-values/staging.proxy.yaml --set image.tag=${DOCKER_TAG} --wait helm upgrade neon-proxy-scram neondatabase/neon-proxy --install -f .circleci/helm-values/staging.proxy-scram.yaml --set image.tag=${DOCKER_TAG} --wait deploy-release: diff --git a/.circleci/helm-values/staging.proxy-scram.yaml b/.circleci/helm-values/staging.proxy-scram.yaml index d95ae3bfc2..f72a9d4557 100644 --- a/.circleci/helm-values/staging.proxy-scram.yaml +++ b/.circleci/helm-values/staging.proxy-scram.yaml @@ -20,7 +20,7 @@ exposedService: service.beta.kubernetes.io/aws-load-balancer-type: external service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing - external-dns.alpha.kubernetes.io/hostname: *.cloud.stage.neon.tech + external-dns.alpha.kubernetes.io/hostname: cloud.stage.neon.tech metrics: enabled: true From 748c5a577b5b4ed9f3dee1a0cc85724893883c2e Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Wed, 4 May 2022 10:54:44 +0300 Subject: [PATCH 191/296] Bump vendor/postgres. (#1616) Includes fix for https://github.com/neondatabase/neon/issues/1615 --- vendor/postgres | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vendor/postgres b/vendor/postgres index a13fe64a3e..868e7be7ff 160000 --- a/vendor/postgres +++ b/vendor/postgres @@ -1 +1 @@ -Subproject commit a13fe64a3eff1743ff17141a2e6057f5103829f0 +Subproject commit 868e7be7ff7dd1d026917892b3951f812e9d4a08 From b9fd8a36ad3b7cccc98d71930ef18338c34aa2d7 Mon Sep 17 00:00:00 2001 From: Arseny Sher Date: Sun, 1 May 2022 16:58:34 +0400 Subject: [PATCH 192/296] Remember timeline_start_lsn and local_start_lsn on safekeeper. Make it remember when timeline starts in general and on this safekeeper in particular (the point might be later on new safekeeper replacing failed one). Bumps control file and walproposer protocol versions. While protocol is bumped, also add safekeeper node id to AcceptorProposerGreeting. ref #1561 --- safekeeper/src/control_file_upgrade.rs | 43 ++++++++++++++ safekeeper/src/http/routes.rs | 6 ++ safekeeper/src/json_ctrl.rs | 3 +- safekeeper/src/safekeeper.rs | 57 ++++++++++++++++--- safekeeper/src/timeline.rs | 4 +- test_runner/batch_others/test_wal_acceptor.py | 10 +++- test_runner/fixtures/zenith_fixtures.py | 4 +- 7 files changed, 114 insertions(+), 13 deletions(-) diff --git a/safekeeper/src/control_file_upgrade.rs b/safekeeper/src/control_file_upgrade.rs index 0cb14298cb..d11206eff6 100644 --- a/safekeeper/src/control_file_upgrade.rs +++ b/safekeeper/src/control_file_upgrade.rs @@ -103,6 +103,43 @@ pub struct SafeKeeperStateV3 { pub wal_start_lsn: Lsn, } +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct SafeKeeperStateV4 { + #[serde(with = "hex")] + pub tenant_id: ZTenantId, + /// Zenith timelineid + #[serde(with = "hex")] + pub timeline_id: ZTimelineId, + /// persistent acceptor state + pub acceptor_state: AcceptorState, + /// information about server + pub server: ServerInfo, + /// Unique id of the last *elected* proposer we dealed with. Not needed + /// for correctness, exists for monitoring purposes. + #[serde(with = "hex")] + pub proposer_uuid: PgUuid, + /// Part of WAL acknowledged by quorum and available locally. Always points + /// to record boundary. + pub commit_lsn: Lsn, + /// First LSN not yet offloaded to s3. Useful to persist to avoid finding + /// out offloading progress on boot. + pub s3_wal_lsn: Lsn, + /// Minimal LSN which may be needed for recovery of some safekeeper (end_lsn + /// of last record streamed to everyone). Persisting it helps skipping + /// recovery in walproposer, generally we compute it from peers. In + /// walproposer proto called 'truncate_lsn'. + pub peer_horizon_lsn: Lsn, + /// LSN of the oldest known checkpoint made by pageserver and successfully + /// pushed to s3. We don't remove WAL beyond it. Persisted only for + /// informational purposes, we receive it from pageserver (or broker). + pub remote_consistent_lsn: Lsn, + // Peers and their state as we remember it. Knowing peers themselves is + // fundamental; but state is saved here only for informational purposes and + // obviously can be stale. (Currently not saved at all, but let's provision + // place to have less file version upgrades). + pub peers: Peers, +} + pub fn upgrade_control_file(buf: &[u8], version: u32) -> Result { // migrate to storing full term history if version == 1 { @@ -125,6 +162,8 @@ pub fn upgrade_control_file(buf: &[u8], version: u32) -> Result wal_seg_size: oldstate.server.wal_seg_size, }, proposer_uuid: oldstate.proposer_uuid, + timeline_start_lsn: Lsn(0), + local_start_lsn: Lsn(0), commit_lsn: oldstate.commit_lsn, s3_wal_lsn: Lsn(0), peer_horizon_lsn: oldstate.truncate_lsn, @@ -146,6 +185,8 @@ pub fn upgrade_control_file(buf: &[u8], version: u32) -> Result acceptor_state: oldstate.acceptor_state, server, proposer_uuid: oldstate.proposer_uuid, + timeline_start_lsn: Lsn(0), + local_start_lsn: Lsn(0), commit_lsn: oldstate.commit_lsn, s3_wal_lsn: Lsn(0), peer_horizon_lsn: oldstate.truncate_lsn, @@ -167,6 +208,8 @@ pub fn upgrade_control_file(buf: &[u8], version: u32) -> Result acceptor_state: oldstate.acceptor_state, server, proposer_uuid: oldstate.proposer_uuid, + timeline_start_lsn: Lsn(0), + local_start_lsn: Lsn(0), commit_lsn: oldstate.commit_lsn, s3_wal_lsn: Lsn(0), peer_horizon_lsn: oldstate.truncate_lsn, diff --git a/safekeeper/src/http/routes.rs b/safekeeper/src/http/routes.rs index fab8724430..d7cbcb094e 100644 --- a/safekeeper/src/http/routes.rs +++ b/safekeeper/src/http/routes.rs @@ -69,6 +69,10 @@ struct TimelineStatus { timeline_id: ZTimelineId, acceptor_state: AcceptorStateStatus, #[serde(serialize_with = "display_serialize")] + timeline_start_lsn: Lsn, + #[serde(serialize_with = "display_serialize")] + local_start_lsn: Lsn, + #[serde(serialize_with = "display_serialize")] commit_lsn: Lsn, #[serde(serialize_with = "display_serialize")] s3_wal_lsn: Lsn, @@ -102,6 +106,8 @@ async fn timeline_status_handler(request: Request) -> Result Result<()> { let greeting_request = ProposerAcceptorMessage::Greeting(ProposerGreeting { - protocol_version: 1, // current protocol + protocol_version: 2, // current protocol pg_version: 0, // unknown proposer_id: [0u8; 16], system_id: 0, @@ -124,6 +124,7 @@ fn send_proposer_elected(spg: &mut SafekeeperPostgresHandler, term: Term, lsn: L term, start_streaming_at: lsn, term_history: history, + timeline_start_lsn: Lsn(0), }); spg.timeline.get().process_msg(&proposer_elected_request)?; diff --git a/safekeeper/src/safekeeper.rs b/safekeeper/src/safekeeper.rs index 048753152b..67d41d0b58 100644 --- a/safekeeper/src/safekeeper.rs +++ b/safekeeper/src/safekeeper.rs @@ -30,8 +30,8 @@ use utils::{ }; pub const SK_MAGIC: u32 = 0xcafeceefu32; -pub const SK_FORMAT_VERSION: u32 = 4; -const SK_PROTOCOL_VERSION: u32 = 1; +pub const SK_FORMAT_VERSION: u32 = 5; +const SK_PROTOCOL_VERSION: u32 = 2; const UNKNOWN_SERVER_VERSION: u32 = 0; /// Consensus logical timestamp. @@ -52,7 +52,7 @@ impl TermHistory { } // Parse TermHistory as n_entries followed by TermSwitchEntry pairs - pub fn from_bytes(mut bytes: Bytes) -> Result { + pub fn from_bytes(bytes: &mut Bytes) -> Result { if bytes.remaining() < 4 { bail!("TermHistory misses len"); } @@ -183,6 +183,13 @@ pub struct SafeKeeperState { /// for correctness, exists for monitoring purposes. #[serde(with = "hex")] pub proposer_uuid: PgUuid, + /// Since which LSN this timeline generally starts. Safekeeper might have + /// joined later. + pub timeline_start_lsn: Lsn, + /// Since which LSN safekeeper has (had) WAL for this timeline. + /// All WAL segments next to one containing local_start_lsn are + /// filled with data from the beginning. + pub local_start_lsn: Lsn, /// Part of WAL acknowledged by quorum and available locally. Always points /// to record boundary. pub commit_lsn: Lsn, @@ -231,6 +238,8 @@ impl SafeKeeperState { wal_seg_size: 0, }, proposer_uuid: [0; 16], + timeline_start_lsn: Lsn(0), + local_start_lsn: Lsn(0), commit_lsn: Lsn(0), s3_wal_lsn: Lsn(0), peer_horizon_lsn: Lsn(0), @@ -268,6 +277,7 @@ pub struct ProposerGreeting { #[derive(Debug, Serialize)] pub struct AcceptorGreeting { term: u64, + node_id: ZNodeId, } /// Vote request sent from proposer to safekeepers @@ -286,6 +296,7 @@ pub struct VoteResponse { flush_lsn: Lsn, truncate_lsn: Lsn, term_history: TermHistory, + timeline_start_lsn: Lsn, } /* @@ -297,6 +308,7 @@ pub struct ProposerElected { pub term: Term, pub start_streaming_at: Lsn, pub term_history: TermHistory, + pub timeline_start_lsn: Lsn, } /// Request with WAL message sent from proposer to safekeeper. Along the way it @@ -387,10 +399,15 @@ impl ProposerAcceptorMessage { } let term = msg_bytes.get_u64_le(); let start_streaming_at = msg_bytes.get_u64_le().into(); - let term_history = TermHistory::from_bytes(msg_bytes)?; + let term_history = TermHistory::from_bytes(&mut msg_bytes)?; + if msg_bytes.remaining() < 8 { + bail!("ProposerElected message is not complete"); + } + let timeline_start_lsn = msg_bytes.get_u64_le().into(); let msg = ProposerElected { term, start_streaming_at, + timeline_start_lsn, term_history, }; Ok(ProposerAcceptorMessage::Elected(msg)) @@ -437,6 +454,7 @@ impl AcceptorProposerMessage { AcceptorProposerMessage::Greeting(msg) => { buf.put_u64_le('g' as u64); buf.put_u64_le(msg.term); + buf.put_u64_le(msg.node_id.0); } AcceptorProposerMessage::VoteResponse(msg) => { buf.put_u64_le('v' as u64); @@ -449,6 +467,7 @@ impl AcceptorProposerMessage { buf.put_u64_le(e.term); buf.put_u64_le(e.lsn.into()); } + buf.put_u64_le(msg.timeline_start_lsn.into()); } AcceptorProposerMessage::AppendResponse(msg) => { buf.put_u64_le('a' as u64); @@ -511,6 +530,8 @@ pub struct SafeKeeper { pub state: CTRL, // persistent state storage pub wal_store: WAL, + + node_id: ZNodeId, // safekeeper's node id } impl SafeKeeper @@ -523,6 +544,7 @@ where ztli: ZTimelineId, state: CTRL, mut wal_store: WAL, + node_id: ZNodeId, ) -> Result> { if state.timeline_id != ZTimelineId::from([0u8; 16]) && ztli != state.timeline_id { bail!("Calling SafeKeeper::new with inconsistent ztli ({}) and SafeKeeperState.server.timeline_id ({})", ztli, state.timeline_id); @@ -544,6 +566,7 @@ where }, state, wal_store, + node_id, }) } @@ -635,6 +658,7 @@ where ); Ok(Some(AcceptorProposerMessage::Greeting(AcceptorGreeting { term: self.state.acceptor_state.term, + node_id: self.node_id, }))) } @@ -650,6 +674,7 @@ where flush_lsn: self.wal_store.flush_lsn(), truncate_lsn: self.state.peer_horizon_lsn, term_history: self.get_term_history(), + timeline_start_lsn: self.state.timeline_start_lsn, }; if self.state.acceptor_state.term < msg.term { let mut state = self.state.clone(); @@ -705,6 +730,23 @@ where // and now adopt term history from proposer { let mut state = self.state.clone(); + + // Remeber point where WAL begins globally, if not yet. + if state.timeline_start_lsn == Lsn(0) { + state.timeline_start_lsn = msg.timeline_start_lsn; + info!( + "setting timeline_start_lsn to {:?}", + state.timeline_start_lsn + ); + } + + // Remember point where WAL begins locally, if not yet. (I doubt the + // second condition is ever possible) + if state.local_start_lsn == Lsn(0) || state.local_start_lsn >= msg.start_streaming_at { + state.local_start_lsn = msg.start_streaming_at; + info!("setting local_start_lsn to {:?}", state.local_start_lsn); + } + state.acceptor_state.term_history = msg.term_history.clone(); self.state.persist(&state)?; } @@ -968,7 +1010,7 @@ mod tests { }; let wal_store = DummyWalStore { lsn: Lsn(0) }; let ztli = ZTimelineId::from([0u8; 16]); - let mut sk = SafeKeeper::new(ztli, storage, wal_store).unwrap(); + let mut sk = SafeKeeper::new(ztli, storage, wal_store, ZNodeId(0)).unwrap(); // check voting for 1 is ok let vote_request = ProposerAcceptorMessage::VoteRequest(VoteRequest { term: 1 }); @@ -983,7 +1025,7 @@ mod tests { let storage = InMemoryState { persisted_state: state, }; - sk = SafeKeeper::new(ztli, storage, sk.wal_store).unwrap(); + sk = SafeKeeper::new(ztli, storage, sk.wal_store, ZNodeId(0)).unwrap(); // and ensure voting second time for 1 is not ok vote_resp = sk.process_msg(&vote_request); @@ -1000,7 +1042,7 @@ mod tests { }; let wal_store = DummyWalStore { lsn: Lsn(0) }; let ztli = ZTimelineId::from([0u8; 16]); - let mut sk = SafeKeeper::new(ztli, storage, wal_store).unwrap(); + let mut sk = SafeKeeper::new(ztli, storage, wal_store, ZNodeId(0)).unwrap(); let mut ar_hdr = AppendRequestHeader { term: 1, @@ -1023,6 +1065,7 @@ mod tests { term: 1, lsn: Lsn(3), }]), + timeline_start_lsn: Lsn(0), }; sk.process_msg(&ProposerAcceptorMessage::Elected(pem)) .unwrap(); diff --git a/safekeeper/src/timeline.rs b/safekeeper/src/timeline.rs index 4a507015d3..745d8e0893 100644 --- a/safekeeper/src/timeline.rs +++ b/safekeeper/src/timeline.rs @@ -102,7 +102,7 @@ impl SharedState { let state = SafeKeeperState::new(zttid, peer_ids); let control_store = control_file::FileStorage::create_new(zttid, conf, state)?; let wal_store = wal_storage::PhysicalStorage::new(zttid, conf); - let sk = SafeKeeper::new(zttid.timeline_id, control_store, wal_store)?; + let sk = SafeKeeper::new(zttid.timeline_id, control_store, wal_store, conf.my_id)?; Ok(Self { notified_commit_lsn: Lsn(0), @@ -125,7 +125,7 @@ impl SharedState { Ok(Self { notified_commit_lsn: Lsn(0), - sk: SafeKeeper::new(zttid.timeline_id, control_store, wal_store)?, + sk: SafeKeeper::new(zttid.timeline_id, control_store, wal_store, conf.my_id)?, replicas: Vec::new(), active: false, num_computes: 0, diff --git a/test_runner/batch_others/test_wal_acceptor.py b/test_runner/batch_others/test_wal_acceptor.py index 94059e2a4c..702c27a79b 100644 --- a/test_runner/batch_others/test_wal_acceptor.py +++ b/test_runner/batch_others/test_wal_acceptor.py @@ -573,7 +573,9 @@ def test_timeline_status(zenith_env_builder: ZenithEnvBuilder): timeline_id = pg.safe_psql("show zenith.zenith_timeline")[0][0] # fetch something sensible from status - epoch = wa_http_cli.timeline_status(tenant_id, timeline_id).acceptor_epoch + tli_status = wa_http_cli.timeline_status(tenant_id, timeline_id) + epoch = tli_status.acceptor_epoch + timeline_start_lsn = tli_status.timeline_start_lsn pg.safe_psql("create table t(i int)") @@ -581,9 +583,13 @@ def test_timeline_status(zenith_env_builder: ZenithEnvBuilder): pg.stop().start() pg.safe_psql("insert into t values(10)") - epoch_after_reboot = wa_http_cli.timeline_status(tenant_id, timeline_id).acceptor_epoch + tli_status = wa_http_cli.timeline_status(tenant_id, timeline_id) + epoch_after_reboot = tli_status.acceptor_epoch assert epoch_after_reboot > epoch + # and timeline_start_lsn stays the same + assert tli_status.timeline_start_lsn == timeline_start_lsn + class SafekeeperEnv: def __init__(self, diff --git a/test_runner/fixtures/zenith_fixtures.py b/test_runner/fixtures/zenith_fixtures.py index 9319a53778..d6d07d78d3 100644 --- a/test_runner/fixtures/zenith_fixtures.py +++ b/test_runner/fixtures/zenith_fixtures.py @@ -1762,6 +1762,7 @@ class SafekeeperTimelineStatus: acceptor_epoch: int flush_lsn: str remote_consistent_lsn: str + timeline_start_lsn: str @dataclass @@ -1786,7 +1787,8 @@ class SafekeeperHttpClient(requests.Session): resj = res.json() return SafekeeperTimelineStatus(acceptor_epoch=resj['acceptor_state']['epoch'], flush_lsn=resj['flush_lsn'], - remote_consistent_lsn=resj['remote_consistent_lsn']) + remote_consistent_lsn=resj['remote_consistent_lsn'], + timeline_start_lsn=resj['timeline_start_lsn']) def record_safekeeper_info(self, tenant_id: str, timeline_id: str, body): res = self.post( From e58c83870fdb318d866953183192dfe97dcb6db8 Mon Sep 17 00:00:00 2001 From: Arseny Sher Date: Wed, 4 May 2022 13:36:31 +0400 Subject: [PATCH 193/296] Bump vendor/postgres to to send timeline_start_lsn. --- vendor/postgres | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vendor/postgres b/vendor/postgres index 868e7be7ff..ce3057955a 160000 --- a/vendor/postgres +++ b/vendor/postgres @@ -1 +1 @@ -Subproject commit 868e7be7ff7dd1d026917892b3951f812e9d4a08 +Subproject commit ce3057955ac962662c6fe0d00d793bfccedf7ca8 From b68e3b03ed851ed582841822a9d603f02d698b42 Mon Sep 17 00:00:00 2001 From: Arseny Sher Date: Wed, 4 May 2022 16:19:21 +0400 Subject: [PATCH 194/296] Fix control file update for b9fd8a36ad3b --- safekeeper/src/control_file_upgrade.rs | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/safekeeper/src/control_file_upgrade.rs b/safekeeper/src/control_file_upgrade.rs index d11206eff6..22716de1a0 100644 --- a/safekeeper/src/control_file_upgrade.rs +++ b/safekeeper/src/control_file_upgrade.rs @@ -216,6 +216,29 @@ pub fn upgrade_control_file(buf: &[u8], version: u32) -> Result remote_consistent_lsn: Lsn(0), peers: Peers(vec![]), }); + // migrate to having timeline_start_lsn + } else if version == 4 { + info!("reading safekeeper control file version {}", version); + let oldstate = SafeKeeperStateV4::des(&buf[..buf.len()])?; + let server = ServerInfo { + pg_version: oldstate.server.pg_version, + system_id: oldstate.server.system_id, + wal_seg_size: oldstate.server.wal_seg_size, + }; + return Ok(SafeKeeperState { + tenant_id: oldstate.tenant_id, + timeline_id: oldstate.timeline_id, + acceptor_state: oldstate.acceptor_state, + server, + proposer_uuid: oldstate.proposer_uuid, + timeline_start_lsn: Lsn(0), + local_start_lsn: Lsn(0), + commit_lsn: oldstate.commit_lsn, + s3_wal_lsn: Lsn(0), + peer_horizon_lsn: oldstate.peer_horizon_lsn, + remote_consistent_lsn: Lsn(0), + peers: Peers(vec![]), + }); } bail!("unsupported safekeeper control file version {}", version) } From e2cf77441df67a9b9a49cbdb2120096decb0e0da Mon Sep 17 00:00:00 2001 From: Anastasia Lubennikova Date: Tue, 3 May 2022 13:23:18 +0300 Subject: [PATCH 195/296] Implement pg_database_size(). In this implementation dbsize equals sum of all relation sizes, excluding shared ones. --- pageserver/src/page_service.rs | 56 +++++++++++++++++++ test_runner/batch_others/test_createdropdb.py | 11 +++- vendor/postgres | 2 +- 3 files changed, 67 insertions(+), 2 deletions(-) diff --git a/pageserver/src/page_service.rs b/pageserver/src/page_service.rs index e584a101cd..da3dedfc84 100644 --- a/pageserver/src/page_service.rs +++ b/pageserver/src/page_service.rs @@ -44,11 +44,14 @@ use crate::CheckpointConfig; use metrics::{register_histogram_vec, HistogramVec}; use postgres_ffi::xlog_utils::to_pg_timestamp; +use postgres_ffi::pg_constants; + // Wrapped in libpq CopyData enum PagestreamFeMessage { Exists(PagestreamExistsRequest), Nblocks(PagestreamNblocksRequest), GetPage(PagestreamGetPageRequest), + DbSize(PagestreamDbSizeRequest), } // Wrapped in libpq CopyData @@ -57,6 +60,7 @@ enum PagestreamBeMessage { Nblocks(PagestreamNblocksResponse), GetPage(PagestreamGetPageResponse), Error(PagestreamErrorResponse), + DbSize(PagestreamDbSizeResponse), } #[derive(Debug)] @@ -81,6 +85,13 @@ struct PagestreamGetPageRequest { blkno: u32, } +#[derive(Debug)] +struct PagestreamDbSizeRequest { + latest: bool, + lsn: Lsn, + dbnode: u32, +} + #[derive(Debug)] struct PagestreamExistsResponse { exists: bool, @@ -101,6 +112,11 @@ struct PagestreamErrorResponse { message: String, } +#[derive(Debug)] +struct PagestreamDbSizeResponse { + db_size: i64, +} + impl PagestreamFeMessage { fn parse(mut body: Bytes) -> anyhow::Result { // TODO these gets can fail @@ -142,6 +158,11 @@ impl PagestreamFeMessage { }, blkno: body.get_u32(), })), + 3 => Ok(PagestreamFeMessage::DbSize(PagestreamDbSizeRequest { + latest: body.get_u8() != 0, + lsn: Lsn::from(body.get_u64()), + dbnode: body.get_u32(), + })), _ => bail!("unknown smgr message tag: {},'{:?}'", msg_tag, body), } } @@ -172,6 +193,10 @@ impl PagestreamBeMessage { bytes.put(resp.message.as_bytes()); bytes.put_u8(0); // null terminator } + Self::DbSize(resp) => { + bytes.put_u8(104); /* tag from pagestore_client.h */ + bytes.put_i64(resp.db_size); + } } bytes.into() @@ -367,6 +392,11 @@ impl PageServerHandler { .observe_closure_duration(|| { self.handle_get_page_at_lsn_request(timeline.as_ref(), &req) }), + PagestreamFeMessage::DbSize(req) => SMGR_QUERY_TIME + .with_label_values(&["get_db_size", &tenant_id, &timeline_id]) + .observe_closure_duration(|| { + self.handle_db_size_request(timeline.as_ref(), &req) + }), }; let response = response.unwrap_or_else(|e| { @@ -487,6 +517,32 @@ impl PageServerHandler { })) } + fn handle_db_size_request( + &self, + timeline: &DatadirTimeline, + req: &PagestreamDbSizeRequest, + ) -> Result { + let _enter = info_span!("get_db_size", dbnode = %req.dbnode, req_lsn = %req.lsn).entered(); + let latest_gc_cutoff_lsn = timeline.tline.get_latest_gc_cutoff_lsn(); + let lsn = Self::wait_or_get_last_lsn(timeline, req.lsn, req.latest, &latest_gc_cutoff_lsn)?; + + let all_rels = timeline.list_rels(pg_constants::DEFAULTTABLESPACE_OID, req.dbnode, lsn)?; + let mut total_blocks: i64 = 0; + + for rel in all_rels { + if rel.forknum == 0 { + let n_blocks = timeline.get_rel_size(rel, lsn).unwrap_or(0); + total_blocks += n_blocks as i64; + } + } + + let db_size = total_blocks * pg_constants::BLCKSZ as i64; + + Ok(PagestreamBeMessage::DbSize(PagestreamDbSizeResponse { + db_size, + })) + } + fn handle_get_page_at_lsn_request( &self, timeline: &DatadirTimeline, diff --git a/test_runner/batch_others/test_createdropdb.py b/test_runner/batch_others/test_createdropdb.py index 88937fa0dc..24898be70a 100644 --- a/test_runner/batch_others/test_createdropdb.py +++ b/test_runner/batch_others/test_createdropdb.py @@ -32,7 +32,16 @@ def test_createdb(zenith_simple_env: ZenithEnv): # Test that you can connect to the new database on both branches for db in (pg, pg2): - db.connect(dbname='foodb').close() + with closing(db.connect(dbname='foodb')) as conn: + with conn.cursor() as cur: + # Check database size in both branches + cur.execute( + 'select pg_size_pretty(pg_database_size(%s)), pg_size_pretty(sum(pg_relation_size(oid))) from pg_class where relisshared is false;', + ('foodb', )) + res = cur.fetchone() + # check that dbsize equals sum of all relation sizes, excluding shared ones + # This is how we define dbsize in zenith for now + assert res[0] == res[1] # diff --git a/vendor/postgres b/vendor/postgres index ce3057955a..f8c12bb06c 160000 --- a/vendor/postgres +++ b/vendor/postgres @@ -1 +1 @@ -Subproject commit ce3057955ac962662c6fe0d00d793bfccedf7ca8 +Subproject commit f8c12bb06c314e823dbc890229c28016c1f9a0fe From b8880bfaab048576034515ff2b8174b4dc21e260 Mon Sep 17 00:00:00 2001 From: Anastasia Lubennikova Date: Wed, 4 May 2022 17:27:16 +0300 Subject: [PATCH 196/296] Bump vendor/postgres --- vendor/postgres | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vendor/postgres b/vendor/postgres index f8c12bb06c..d35bd7132f 160000 --- a/vendor/postgres +++ b/vendor/postgres @@ -1 +1 @@ -Subproject commit f8c12bb06c314e823dbc890229c28016c1f9a0fe +Subproject commit d35bd7132ff6ed600577934e5389c7657087fbe1 From c4bc604e5f7d08e785cfd48d6a11c60b3555c598 Mon Sep 17 00:00:00 2001 From: Thang Pham Date: Wed, 4 May 2022 11:23:04 -0400 Subject: [PATCH 197/296] Fix pg list table alignment #1633 Fixes #1628 - add [`comfy_table`](https://github.com/Nukesor/comfy-table/tree/main) and use it to construct table for `pg list` CLI command Comparison - Old: ``` NODE ADDRESS TIMELINE BRANCH NAME LSN STATUS main 127.0.0.1:55432 3823dd05e35d71f6ccf33049de366d70 main 0/16FB140 running migration_check 127.0.0.1:55433 3823dd05e35d71f6ccf33049de366d70 main 0/16FB140 running ``` - New: ``` NODE ADDRESS TIMELINE BRANCH NAME LSN STATUS main 127.0.0.1:55432 3823dd05e35d71f6ccf33049de366d70 main 0/16FB140 running migration_check 127.0.0.1:55433 3823dd05e35d71f6ccf33049de366d70 main 0/16FB140 running ``` --- Cargo.lock | 68 ++++++++++++++++++++++++++++++++++++++++++++++ zenith/Cargo.toml | 1 + zenith/src/main.rs | 29 ++++++++++++++------ 3 files changed, 90 insertions(+), 8 deletions(-) diff --git a/Cargo.lock b/Cargo.lock index 2c081e8beb..e9b24b2f84 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -330,6 +330,18 @@ dependencies = [ "memchr", ] +[[package]] +name = "comfy-table" +version = "5.0.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b103d85ca6e209388771bfb7aa6b68a7aeec4afbf6f0a0264bfbf50360e5212e" +dependencies = [ + "crossterm", + "strum", + "strum_macros", + "unicode-width", +] + [[package]] name = "compute_tools" version = "0.1.0" @@ -526,6 +538,31 @@ dependencies = [ "lazy_static", ] +[[package]] +name = "crossterm" +version = "0.23.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a2102ea4f781910f8a5b98dd061f4c2023f479ce7bb1236330099ceb5a93cf17" +dependencies = [ + "bitflags", + "crossterm_winapi", + "libc", + "mio", + "parking_lot 0.12.0", + "signal-hook", + "signal-hook-mio", + "winapi", +] + +[[package]] +name = "crossterm_winapi" +version = "0.9.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2ae1b35a484aa10e07fe0638d02301c5ad24de82d310ccbd2f3693da5f09bf1c" +dependencies = [ + "winapi", +] + [[package]] name = "crypto-common" version = "0.1.3" @@ -2664,6 +2701,17 @@ dependencies = [ "signal-hook-registry", ] +[[package]] +name = "signal-hook-mio" +version = "0.2.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "29ad2e15f37ec9a6cc544097b78a1ec90001e9f71b81338ca39f430adaca99af" +dependencies = [ + "libc", + "mio", + "signal-hook", +] + [[package]] name = "signal-hook-registry" version = "1.4.0" @@ -2753,6 +2801,25 @@ version = "0.10.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "73473c0e59e6d5812c5dfe2a064a6444949f089e20eec9a2e5506596494e4623" +[[package]] +name = "strum" +version = "0.23.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "cae14b91c7d11c9a851d3fbc80a963198998c2a64eec840477fa92d8ce9b70bb" + +[[package]] +name = "strum_macros" +version = "0.23.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5bb0dc7ee9c15cea6199cde9a127fa16a4c5819af85395457ad72d68edc85a38" +dependencies = [ + "heck", + "proc-macro2", + "quote", + "rustversion", + "syn", +] + [[package]] name = "subtle" version = "2.4.1" @@ -3642,6 +3709,7 @@ version = "0.1.0" dependencies = [ "anyhow", "clap 3.0.14", + "comfy-table", "control_plane", "pageserver", "postgres", diff --git a/zenith/Cargo.toml b/zenith/Cargo.toml index 0f72051f74..58f1f5751d 100644 --- a/zenith/Cargo.toml +++ b/zenith/Cargo.toml @@ -7,6 +7,7 @@ edition = "2021" clap = "3.0" anyhow = "1.0" serde_json = "1" +comfy-table = "5.0.1" postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" } # FIXME: 'pageserver' is needed for BranchInfo. Refactor diff --git a/zenith/src/main.rs b/zenith/src/main.rs index cd0cf470e8..ff2beec463 100644 --- a/zenith/src/main.rs +++ b/zenith/src/main.rs @@ -665,7 +665,19 @@ fn handle_pg(pg_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> { let timeline_name_mappings = env.timeline_name_mappings(); - println!("NODE\tADDRESS\tTIMELINE\tBRANCH NAME\tLSN\t\tSTATUS"); + let mut table = comfy_table::Table::new(); + + table.load_preset(comfy_table::presets::NOTHING); + + table.set_header(&[ + "NODE", + "ADDRESS", + "TIMELINE", + "BRANCH NAME", + "LSN", + "STATUS", + ]); + for ((_, node_name), node) in cplane .nodes .iter() @@ -684,16 +696,17 @@ fn handle_pg(pg_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> { .map(|name| name.as_str()) .unwrap_or("?"); - println!( - "{}\t{}\t{}\t{}\t{}\t{}", - node_name, - node.address, - node.timeline_id, + table.add_row(&[ + node_name.as_str(), + &node.address.to_string(), + &node.timeline_id.to_string(), branch_name, - lsn_str, + lsn_str.as_str(), node.status(), - ); + ]); } + + println!("{table}"); } "create" => { let branch_name = sub_args From 02e5083695d0ed17f7dbb1ca852f504fca42fdcc Mon Sep 17 00:00:00 2001 From: bojanserafimov Date: Wed, 4 May 2022 12:45:01 -0400 Subject: [PATCH 198/296] Add hot page test (#1479) --- poetry.lock | 30 +++++++++++++++---- pyproject.toml | 1 + test_runner/fixtures/compare_fixtures.py | 5 +++- test_runner/fixtures/zenith_fixtures.py | 2 +- test_runner/performance/test_hot_page.py | 36 +++++++++++++++++++++++ test_runner/performance/test_hot_table.py | 35 ++++++++++++++++++++++ 6 files changed, 101 insertions(+), 8 deletions(-) create mode 100644 test_runner/performance/test_hot_page.py create mode 100644 test_runner/performance/test_hot_table.py diff --git a/poetry.lock b/poetry.lock index fe18ad226c..a7cbe0aa3c 100644 --- a/poetry.lock +++ b/poetry.lock @@ -822,7 +822,7 @@ python-versions = "*" [[package]] name = "moto" -version = "3.0.4" +version = "3.1.7" description = "A library that allows your python tests to easily mock out the boto library" category = "main" optional = false @@ -844,6 +844,7 @@ importlib-metadata = {version = "*", markers = "python_version < \"3.8\""} Jinja2 = ">=2.10.1" jsondiff = {version = ">=1.1.2", optional = true, markers = "extra == \"server\""} MarkupSafe = "!=2.0.0a1" +pyparsing = {version = ">=3.0.0", optional = true, markers = "extra == \"server\""} python-dateutil = ">=2.1,<3.0.0" python-jose = {version = ">=3.1.0,<4.0.0", extras = ["cryptography"], optional = true, markers = "extra == \"server\""} pytz = "*" @@ -855,7 +856,7 @@ werkzeug = "*" xmltodict = "*" [package.extras] -all = ["PyYAML (>=5.1)", "python-jose[cryptography] (>=3.1.0,<4.0.0)", "ecdsa (!=0.15)", "docker (>=2.5.1)", "graphql-core", "jsondiff (>=1.1.2)", "aws-xray-sdk (>=0.93,!=0.96)", "idna (>=2.5,<4)", "cfn-lint (>=0.4.0)", "sshpubkeys (>=3.1.0)", "setuptools"] +all = ["PyYAML (>=5.1)", "python-jose[cryptography] (>=3.1.0,<4.0.0)", "ecdsa (!=0.15)", "docker (>=2.5.1)", "graphql-core", "jsondiff (>=1.1.2)", "aws-xray-sdk (>=0.93,!=0.96)", "idna (>=2.5,<4)", "cfn-lint (>=0.4.0)", "sshpubkeys (>=3.1.0)", "pyparsing (>=3.0.0)", "setuptools"] apigateway = ["PyYAML (>=5.1)", "python-jose[cryptography] (>=3.1.0,<4.0.0)", "ecdsa (!=0.15)"] apigatewayv2 = ["PyYAML (>=5.1)"] appsync = ["graphql-core"] @@ -864,14 +865,16 @@ batch = ["docker (>=2.5.1)"] cloudformation = ["docker (>=2.5.1)", "PyYAML (>=5.1)", "cfn-lint (>=0.4.0)"] cognitoidp = ["python-jose[cryptography] (>=3.1.0,<4.0.0)", "ecdsa (!=0.15)"] ds = ["sshpubkeys (>=3.1.0)"] +dynamodb = ["docker (>=2.5.1)"] dynamodb2 = ["docker (>=2.5.1)"] dynamodbstreams = ["docker (>=2.5.1)"] ec2 = ["sshpubkeys (>=3.1.0)"] efs = ["sshpubkeys (>=3.1.0)"] +glue = ["pyparsing (>=3.0.0)"] iotdata = ["jsondiff (>=1.1.2)"] route53resolver = ["sshpubkeys (>=3.1.0)"] s3 = ["PyYAML (>=5.1)"] -server = ["PyYAML (>=5.1)", "python-jose[cryptography] (>=3.1.0,<4.0.0)", "ecdsa (!=0.15)", "docker (>=2.5.1)", "graphql-core", "jsondiff (>=1.1.2)", "aws-xray-sdk (>=0.93,!=0.96)", "idna (>=2.5,<4)", "cfn-lint (>=0.4.0)", "sshpubkeys (>=3.1.0)", "setuptools", "flask", "flask-cors"] +server = ["PyYAML (>=5.1)", "python-jose[cryptography] (>=3.1.0,<4.0.0)", "ecdsa (!=0.15)", "docker (>=2.5.1)", "graphql-core", "jsondiff (>=1.1.2)", "aws-xray-sdk (>=0.93,!=0.96)", "idna (>=2.5,<4)", "cfn-lint (>=0.4.0)", "sshpubkeys (>=3.1.0)", "pyparsing (>=3.0.0)", "setuptools", "flask", "flask-cors"] ssm = ["PyYAML (>=5.1)", "dataclasses"] xray = ["aws-xray-sdk (>=0.93,!=0.96)", "setuptools"] @@ -1068,6 +1071,17 @@ python-versions = ">=3.6" py = "*" pytest = ">=3.10" +[[package]] +name = "pytest-lazy-fixture" +version = "0.6.3" +description = "It helps to use fixtures in pytest.mark.parametrize" +category = "main" +optional = false +python-versions = "*" + +[package.dependencies] +pytest = ">=3.2.5" + [[package]] name = "pytest-xdist" version = "2.5.0" @@ -1361,7 +1375,7 @@ testing = ["pytest (>=6)", "pytest-checkdocs (>=2.4)", "pytest-flake8", "pytest- [metadata] lock-version = "1.1" python-versions = "^3.7" -content-hash = "58762accad4122026c650fa43421a900546e89f9908e2268410e7b11cc8c6c4e" +content-hash = "dc63b6e02d0ceccdc4b5616e9362c149a27fdcc6c54fda63a3b115a5b980c42e" [metadata.files] aiopg = [ @@ -1679,8 +1693,8 @@ mccabe = [ {file = "mccabe-0.6.1.tar.gz", hash = "sha256:dd8d182285a0fe56bace7f45b5e7d1a6ebcbf524e8f3bd87eb0f125271b8831f"}, ] moto = [ - {file = "moto-3.0.4-py2.py3-none-any.whl", hash = "sha256:79646213d8438385182f4eea79e28725f94b3d0d3dc9a3eda81db47e0ebef6cc"}, - {file = "moto-3.0.4.tar.gz", hash = "sha256:168b8a3cb4dd8a6df8e51d582761cefa9657b9f45ac7e1eb24dae394ebc9e000"}, + {file = "moto-3.1.7-py3-none-any.whl", hash = "sha256:4ab6fb8dd150343e115d75e3dbdb5a8f850fc7236790819d7cef438c11ee6e89"}, + {file = "moto-3.1.7.tar.gz", hash = "sha256:20607a0fd0cf6530e05ffb623ca84d3f45d50bddbcec2a33705a0cf471e71289"}, ] mypy = [ {file = "mypy-0.910-cp35-cp35m-macosx_10_9_x86_64.whl", hash = "sha256:a155d80ea6cee511a3694b108c4494a39f42de11ee4e61e72bc424c490e46457"}, @@ -1855,6 +1869,10 @@ pytest-forked = [ {file = "pytest-forked-1.4.0.tar.gz", hash = "sha256:8b67587c8f98cbbadfdd804539ed5455b6ed03802203485dd2f53c1422d7440e"}, {file = "pytest_forked-1.4.0-py3-none-any.whl", hash = "sha256:bbbb6717efc886b9d64537b41fb1497cfaf3c9601276be8da2cccfea5a3c8ad8"}, ] +pytest-lazy-fixture = [ + {file = "pytest-lazy-fixture-0.6.3.tar.gz", hash = "sha256:0e7d0c7f74ba33e6e80905e9bfd81f9d15ef9a790de97993e34213deb5ad10ac"}, + {file = "pytest_lazy_fixture-0.6.3-py3-none-any.whl", hash = "sha256:e0b379f38299ff27a653f03eaa69b08a6fd4484e46fd1c9907d984b9f9daeda6"}, +] pytest-xdist = [ {file = "pytest-xdist-2.5.0.tar.gz", hash = "sha256:4580deca3ff04ddb2ac53eba39d76cb5dd5edeac050cb6fbc768b0dd712b4edf"}, {file = "pytest_xdist-2.5.0-py3-none-any.whl", hash = "sha256:6fe5c74fec98906deb8f2d2b616b5c782022744978e7bd4695d39c8f42d0ce65"}, diff --git a/pyproject.toml b/pyproject.toml index 7dbdcc0304..335c6d61d8 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -22,6 +22,7 @@ boto3 = "^1.20.40" boto3-stubs = "^1.20.40" moto = {version = "^3.0.0", extras = ["server"]} backoff = "^1.11.1" +pytest-lazy-fixture = "^0.6.3" [tool.poetry.dev-dependencies] yapf = "==0.31.0" diff --git a/test_runner/fixtures/compare_fixtures.py b/test_runner/fixtures/compare_fixtures.py index 93912d2da7..d70f57aa52 100644 --- a/test_runner/fixtures/compare_fixtures.py +++ b/test_runner/fixtures/compare_fixtures.py @@ -130,7 +130,10 @@ class VanillaCompare(PgCompare): def __init__(self, zenbenchmark, vanilla_pg: VanillaPostgres): self._pg = vanilla_pg self._zenbenchmark = zenbenchmark - vanilla_pg.configure(['shared_buffers=1MB']) + vanilla_pg.configure([ + 'shared_buffers=1MB', + 'synchronous_commit=off', + ]) vanilla_pg.start() # Long-lived cursor, useful for flushing diff --git a/test_runner/fixtures/zenith_fixtures.py b/test_runner/fixtures/zenith_fixtures.py index d6d07d78d3..784d2d4b26 100644 --- a/test_runner/fixtures/zenith_fixtures.py +++ b/test_runner/fixtures/zenith_fixtures.py @@ -1315,7 +1315,7 @@ class VanillaPostgres(PgProtocol): """Append lines into postgresql.conf file.""" assert not self.running with open(os.path.join(self.pgdatadir, 'postgresql.conf'), 'a') as conf_file: - conf_file.writelines(options) + conf_file.write("\n".join(options)) def start(self, log_path: Optional[str] = None): assert not self.running diff --git a/test_runner/performance/test_hot_page.py b/test_runner/performance/test_hot_page.py new file mode 100644 index 0000000000..2042b0d548 --- /dev/null +++ b/test_runner/performance/test_hot_page.py @@ -0,0 +1,36 @@ +import pytest +from contextlib import closing +from fixtures.compare_fixtures import PgCompare +from pytest_lazyfixture import lazy_fixture # type: ignore + + +@pytest.mark.parametrize( + "env", + [ + # The test is too slow to run in CI, but fast enough to run with remote tests + pytest.param(lazy_fixture("zenith_compare"), id="zenith", marks=pytest.mark.slow), + pytest.param(lazy_fixture("vanilla_compare"), id="vanilla", marks=pytest.mark.slow), + pytest.param(lazy_fixture("remote_compare"), id="remote", marks=pytest.mark.remote_cluster), + ]) +def test_hot_page(env: PgCompare): + # Update the same page many times, then measure read performance + num_writes = 1000000 + + with closing(env.pg.connect()) as conn: + with conn.cursor() as cur: + + # Write many updates to the same row + with env.record_duration('write'): + cur.execute('create table t (i integer);') + cur.execute('insert into t values (0);') + for i in range(num_writes): + cur.execute(f'update t set i = {i};') + + # Write 3-4 MB to evict t from compute cache + cur.execute('create table f (i integer);') + cur.execute(f'insert into f values (generate_series(1,100000));') + + # Read + with env.record_duration('read'): + cur.execute('select * from t;') + cur.fetchall() diff --git a/test_runner/performance/test_hot_table.py b/test_runner/performance/test_hot_table.py new file mode 100644 index 0000000000..11e047b8c3 --- /dev/null +++ b/test_runner/performance/test_hot_table.py @@ -0,0 +1,35 @@ +import pytest +from contextlib import closing +from fixtures.compare_fixtures import PgCompare +from pytest_lazyfixture import lazy_fixture # type: ignore + + +@pytest.mark.parametrize( + "env", + [ + # The test is too slow to run in CI, but fast enough to run with remote tests + pytest.param(lazy_fixture("zenith_compare"), id="zenith", marks=pytest.mark.slow), + pytest.param(lazy_fixture("vanilla_compare"), id="vanilla", marks=pytest.mark.slow), + pytest.param(lazy_fixture("remote_compare"), id="remote", marks=pytest.mark.remote_cluster), + ]) +def test_hot_table(env: PgCompare): + # Update a small table many times, then measure read performance + num_rows = 100000 # Slightly larger than shared buffers size TODO validate + num_writes = 1000000 + num_reads = 10 + + with closing(env.pg.connect()) as conn: + with conn.cursor() as cur: + + # Write many updates to a small table + with env.record_duration('write'): + cur.execute('create table t (i integer primary key);') + cur.execute(f'insert into t values (generate_series(1,{num_rows}));') + for i in range(num_writes): + cur.execute(f'update t set i = {i + num_rows} WHERE i = {i};') + + # Read the table + with env.record_duration('read'): + for i in range(num_reads): + cur.execute('select * from t;') + cur.fetchall() From bc569dde51639073cf241369f3fc872121d0c811 Mon Sep 17 00:00:00 2001 From: bojanserafimov Date: Wed, 4 May 2022 17:41:05 -0400 Subject: [PATCH 199/296] Remove some unwraps from waldecoder (#1539) --- libs/postgres_ffi/src/waldecoder.rs | 22 +++++++-- libs/postgres_ffi/src/xlog_utils.rs | 46 ++++++++++--------- pageserver/src/basebackup.rs | 5 +- pageserver/src/import_datadir.rs | 2 +- .../src/layered_repository/delta_layer.rs | 2 +- .../src/layered_repository/inmemory_layer.rs | 2 +- pageserver/src/walingest.rs | 5 +- pageserver/src/walrecord.rs | 32 ++++++------- safekeeper/src/json_ctrl.rs | 4 +- 9 files changed, 70 insertions(+), 50 deletions(-) diff --git a/libs/postgres_ffi/src/waldecoder.rs b/libs/postgres_ffi/src/waldecoder.rs index 9d1089ed46..95ea9660e8 100644 --- a/libs/postgres_ffi/src/waldecoder.rs +++ b/libs/postgres_ffi/src/waldecoder.rs @@ -89,7 +89,12 @@ impl WalStreamDecoder { return Ok(None); } - let hdr = XLogLongPageHeaderData::from_bytes(&mut self.inputbuf); + let hdr = XLogLongPageHeaderData::from_bytes(&mut self.inputbuf).map_err(|e| { + WalDecodeError { + msg: format!("long header deserialization failed {}", e), + lsn: self.lsn, + } + })?; if hdr.std.xlp_pageaddr != self.lsn.0 { return Err(WalDecodeError { @@ -106,7 +111,12 @@ impl WalStreamDecoder { return Ok(None); } - let hdr = XLogPageHeaderData::from_bytes(&mut self.inputbuf); + let hdr = XLogPageHeaderData::from_bytes(&mut self.inputbuf).map_err(|e| { + WalDecodeError { + msg: format!("header deserialization failed {}", e), + lsn: self.lsn, + } + })?; if hdr.xlp_pageaddr != self.lsn.0 { return Err(WalDecodeError { @@ -188,7 +198,13 @@ impl WalStreamDecoder { } // We now have a record in the 'recordbuf' local variable. - let xlogrec = XLogRecord::from_slice(&recordbuf[0..XLOG_SIZE_OF_XLOG_RECORD]); + let xlogrec = + XLogRecord::from_slice(&recordbuf[0..XLOG_SIZE_OF_XLOG_RECORD]).map_err(|e| { + WalDecodeError { + msg: format!("xlog record deserialization failed {}", e), + lsn: self.lsn, + } + })?; let mut crc = 0; crc = crc32c_append(crc, &recordbuf[XLOG_RECORD_CRC_OFFS + 4..]); diff --git a/libs/postgres_ffi/src/xlog_utils.rs b/libs/postgres_ffi/src/xlog_utils.rs index bd4b7df690..7882058868 100644 --- a/libs/postgres_ffi/src/xlog_utils.rs +++ b/libs/postgres_ffi/src/xlog_utils.rs @@ -15,7 +15,7 @@ use crate::XLogPageHeaderData; use crate::XLogRecord; use crate::XLOG_PAGE_MAGIC; -use anyhow::{bail, Result}; +use anyhow::bail; use byteorder::{ByteOrder, LittleEndian}; use bytes::BytesMut; use bytes::{Buf, Bytes}; @@ -28,6 +28,8 @@ use std::io::prelude::*; use std::io::SeekFrom; use std::path::{Path, PathBuf}; use std::time::SystemTime; +use utils::bin_ser::DeserializeError; +use utils::bin_ser::SerializeError; use utils::lsn::Lsn; pub const XLOG_FNAME_LEN: usize = 24; @@ -144,7 +146,7 @@ fn find_end_of_wal_segment( tli: TimeLineID, wal_seg_size: usize, start_offset: usize, // start reading at this point -) -> Result { +) -> anyhow::Result { // step back to the beginning of the page to read it in... let mut offs: usize = start_offset - start_offset % XLOG_BLCKSZ; let mut contlen: usize = 0; @@ -272,7 +274,7 @@ pub fn find_end_of_wal( wal_seg_size: usize, precise: bool, start_lsn: Lsn, // start reading WAL at this point or later -) -> Result<(XLogRecPtr, TimeLineID)> { +) -> anyhow::Result<(XLogRecPtr, TimeLineID)> { let mut high_segno: XLogSegNo = 0; let mut high_tli: TimeLineID = 0; let mut high_ispartial = false; @@ -354,19 +356,19 @@ pub fn main() { } impl XLogRecord { - pub fn from_slice(buf: &[u8]) -> XLogRecord { + pub fn from_slice(buf: &[u8]) -> Result { use utils::bin_ser::LeSer; - XLogRecord::des(buf).unwrap() + XLogRecord::des(buf) } - pub fn from_bytes(buf: &mut B) -> XLogRecord { + pub fn from_bytes(buf: &mut B) -> Result { use utils::bin_ser::LeSer; - XLogRecord::des_from(&mut buf.reader()).unwrap() + XLogRecord::des_from(&mut buf.reader()) } - pub fn encode(&self) -> Bytes { + pub fn encode(&self) -> Result { use utils::bin_ser::LeSer; - self.ser().unwrap().into() + Ok(self.ser()?.into()) } // Is this record an XLOG_SWITCH record? They need some special processing, @@ -376,35 +378,35 @@ impl XLogRecord { } impl XLogPageHeaderData { - pub fn from_bytes(buf: &mut B) -> XLogPageHeaderData { + pub fn from_bytes(buf: &mut B) -> Result { use utils::bin_ser::LeSer; - XLogPageHeaderData::des_from(&mut buf.reader()).unwrap() + XLogPageHeaderData::des_from(&mut buf.reader()) } } impl XLogLongPageHeaderData { - pub fn from_bytes(buf: &mut B) -> XLogLongPageHeaderData { + pub fn from_bytes(buf: &mut B) -> Result { use utils::bin_ser::LeSer; - XLogLongPageHeaderData::des_from(&mut buf.reader()).unwrap() + XLogLongPageHeaderData::des_from(&mut buf.reader()) } - pub fn encode(&self) -> Bytes { + pub fn encode(&self) -> Result { use utils::bin_ser::LeSer; - self.ser().unwrap().into() + self.ser().map(|b| b.into()) } } pub const SIZEOF_CHECKPOINT: usize = std::mem::size_of::(); impl CheckPoint { - pub fn encode(&self) -> Bytes { + pub fn encode(&self) -> Result { use utils::bin_ser::LeSer; - self.ser().unwrap().into() + Ok(self.ser()?.into()) } - pub fn decode(buf: &[u8]) -> Result { + pub fn decode(buf: &[u8]) -> Result { use utils::bin_ser::LeSer; - Ok(CheckPoint::des(buf)?) + CheckPoint::des(buf) } /// Update next XID based on provided new_xid and stored epoch. @@ -442,7 +444,7 @@ impl CheckPoint { // Generate new, empty WAL segment. // We need this segment to start compute node. // -pub fn generate_wal_segment(segno: u64, system_id: u64) -> Bytes { +pub fn generate_wal_segment(segno: u64, system_id: u64) -> Result { let mut seg_buf = BytesMut::with_capacity(pg_constants::WAL_SEGMENT_SIZE as usize); let pageaddr = XLogSegNoOffsetToRecPtr(segno, 0, pg_constants::WAL_SEGMENT_SIZE); @@ -462,12 +464,12 @@ pub fn generate_wal_segment(segno: u64, system_id: u64) -> Bytes { xlp_xlog_blcksz: XLOG_BLCKSZ as u32, }; - let hdr_bytes = hdr.encode(); + let hdr_bytes = hdr.encode()?; seg_buf.extend_from_slice(&hdr_bytes); //zero out the rest of the file seg_buf.resize(pg_constants::WAL_SEGMENT_SIZE, 0); - seg_buf.freeze() + Ok(seg_buf.freeze()) } #[cfg(test)] diff --git a/pageserver/src/basebackup.rs b/pageserver/src/basebackup.rs index 14e6d40759..92d35130d8 100644 --- a/pageserver/src/basebackup.rs +++ b/pageserver/src/basebackup.rs @@ -10,7 +10,7 @@ //! This module is responsible for creation of such tarball //! from data stored in object storage. //! -use anyhow::{ensure, Context, Result}; +use anyhow::{anyhow, ensure, Context, Result}; use bytes::{BufMut, BytesMut}; use std::fmt::Write as FmtWrite; use std::io; @@ -323,7 +323,8 @@ impl<'a> Basebackup<'a> { let wal_file_name = XLogFileName(PG_TLI, segno, pg_constants::WAL_SEGMENT_SIZE); let wal_file_path = format!("pg_wal/{}", wal_file_name); let header = new_tar_header(&wal_file_path, pg_constants::WAL_SEGMENT_SIZE as u64)?; - let wal_seg = generate_wal_segment(segno, pg_control.system_identifier); + let wal_seg = generate_wal_segment(segno, pg_control.system_identifier) + .map_err(|e| anyhow!(e).context("Failed generating wal segment"))?; ensure!(wal_seg.len() == pg_constants::WAL_SEGMENT_SIZE); self.ar.append(&header, &wal_seg[..])?; Ok(()) diff --git a/pageserver/src/import_datadir.rs b/pageserver/src/import_datadir.rs index 8f49903e6c..703ee8f1b1 100644 --- a/pageserver/src/import_datadir.rs +++ b/pageserver/src/import_datadir.rs @@ -274,7 +274,7 @@ fn import_control_file( // Extract the checkpoint record and import it separately. let pg_control = ControlFileData::decode(&buffer)?; - let checkpoint_bytes = pg_control.checkPointCopy.encode(); + let checkpoint_bytes = pg_control.checkPointCopy.encode()?; modification.put_checkpoint(checkpoint_bytes)?; Ok(pg_control) diff --git a/pageserver/src/layered_repository/delta_layer.rs b/pageserver/src/layered_repository/delta_layer.rs index 4952f64ccd..1e1ec716a6 100644 --- a/pageserver/src/layered_repository/delta_layer.rs +++ b/pageserver/src/layered_repository/delta_layer.rs @@ -375,7 +375,7 @@ impl Layer for DeltaLayer { write!(&mut desc, " img {} bytes", img.len()).unwrap(); } Ok(Value::WalRecord(rec)) => { - let wal_desc = walrecord::describe_wal_record(&rec); + let wal_desc = walrecord::describe_wal_record(&rec).unwrap(); write!( &mut desc, " rec {} bytes will_init: {} {}", diff --git a/pageserver/src/layered_repository/inmemory_layer.rs b/pageserver/src/layered_repository/inmemory_layer.rs index 714a0bc579..856baa2e8a 100644 --- a/pageserver/src/layered_repository/inmemory_layer.rs +++ b/pageserver/src/layered_repository/inmemory_layer.rs @@ -207,7 +207,7 @@ impl Layer for InMemoryLayer { write!(&mut desc, " img {} bytes", img.len())?; } Ok(Value::WalRecord(rec)) => { - let wal_desc = walrecord::describe_wal_record(&rec); + let wal_desc = walrecord::describe_wal_record(&rec).unwrap(); write!( &mut desc, " rec {} bytes will_init: {} {}", diff --git a/pageserver/src/walingest.rs b/pageserver/src/walingest.rs index a929e290ad..fbdb328d2c 100644 --- a/pageserver/src/walingest.rs +++ b/pageserver/src/walingest.rs @@ -21,6 +21,7 @@ //! redo Postgres process, but some records it can handle directly with //! bespoken Rust code. +use anyhow::Context; use postgres_ffi::nonrelfile_utils::clogpage_precedes; use postgres_ffi::nonrelfile_utils::slru_may_delete_clogsegment; @@ -82,7 +83,7 @@ impl<'a, R: Repository> WalIngest<'a, R> { ) -> Result<()> { let mut modification = timeline.begin_modification(lsn); - let mut decoded = decode_wal_record(recdata); + let mut decoded = decode_wal_record(recdata).context("failed decoding wal record")?; let mut buf = decoded.record.clone(); buf.advance(decoded.main_data_offset); @@ -251,7 +252,7 @@ impl<'a, R: Repository> WalIngest<'a, R> { // If checkpoint data was updated, store the new version in the repository if self.checkpoint_modified { - let new_checkpoint_bytes = self.checkpoint.encode(); + let new_checkpoint_bytes = self.checkpoint.encode()?; modification.put_checkpoint(new_checkpoint_bytes)?; self.checkpoint_modified = false; diff --git a/pageserver/src/walrecord.rs b/pageserver/src/walrecord.rs index e8699cfa22..5a384360e2 100644 --- a/pageserver/src/walrecord.rs +++ b/pageserver/src/walrecord.rs @@ -1,6 +1,7 @@ //! //! Functions for parsing WAL records. //! +use anyhow::Result; use bytes::{Buf, Bytes}; use postgres_ffi::pg_constants; use postgres_ffi::xlog_utils::{TimestampTz, XLOG_SIZE_OF_XLOG_RECORD}; @@ -9,6 +10,7 @@ use postgres_ffi::{BlockNumber, OffsetNumber}; use postgres_ffi::{MultiXactId, MultiXactOffset, MultiXactStatus, Oid, TransactionId}; use serde::{Deserialize, Serialize}; use tracing::*; +use utils::bin_ser::DeserializeError; /// Each update to a page is represented by a ZenithWalRecord. It can be a wrapper /// around a PostgreSQL WAL record, or a custom zenith-specific "record". @@ -503,7 +505,7 @@ impl XlMultiXactTruncate { // block data // ... // main data -pub fn decode_wal_record(record: Bytes) -> DecodedWALRecord { +pub fn decode_wal_record(record: Bytes) -> Result { let mut rnode_spcnode: u32 = 0; let mut rnode_dbnode: u32 = 0; let mut rnode_relnode: u32 = 0; @@ -514,7 +516,7 @@ pub fn decode_wal_record(record: Bytes) -> DecodedWALRecord { // 1. Parse XLogRecord struct // FIXME: assume little-endian here - let xlogrec = XLogRecord::from_bytes(&mut buf); + let xlogrec = XLogRecord::from_bytes(&mut buf)?; trace!( "decode_wal_record xl_rmid = {} xl_info = {}", @@ -742,34 +744,32 @@ pub fn decode_wal_record(record: Bytes) -> DecodedWALRecord { assert_eq!(buf.remaining(), main_data_len as usize); } - DecodedWALRecord { + Ok(DecodedWALRecord { xl_xid: xlogrec.xl_xid, xl_info: xlogrec.xl_info, xl_rmid: xlogrec.xl_rmid, record, blocks, main_data_offset, - } + }) } /// /// Build a human-readable string to describe a WAL record /// /// For debugging purposes -pub fn describe_wal_record(rec: &ZenithWalRecord) -> String { +pub fn describe_wal_record(rec: &ZenithWalRecord) -> Result { match rec { - ZenithWalRecord::Postgres { will_init, rec } => { - format!( - "will_init: {}, {}", - will_init, - describe_postgres_wal_record(rec) - ) - } - _ => format!("{:?}", rec), + ZenithWalRecord::Postgres { will_init, rec } => Ok(format!( + "will_init: {}, {}", + will_init, + describe_postgres_wal_record(rec)? + )), + _ => Ok(format!("{:?}", rec)), } } -fn describe_postgres_wal_record(record: &Bytes) -> String { +fn describe_postgres_wal_record(record: &Bytes) -> Result { // TODO: It would be nice to use the PostgreSQL rmgrdesc infrastructure for this. // Maybe use the postgres wal redo process, the same used for replaying WAL records? // Or could we compile the rmgrdesc routines into the dump_layer_file() binary directly, @@ -782,7 +782,7 @@ fn describe_postgres_wal_record(record: &Bytes) -> String { // 1. Parse XLogRecord struct // FIXME: assume little-endian here - let xlogrec = XLogRecord::from_bytes(&mut buf); + let xlogrec = XLogRecord::from_bytes(&mut buf)?; let unknown_str: String; @@ -830,5 +830,5 @@ fn describe_postgres_wal_record(record: &Bytes) -> String { } }; - String::from(result) + Ok(String::from(result)) } diff --git a/safekeeper/src/json_ctrl.rs b/safekeeper/src/json_ctrl.rs index d21d5ad73b..43514997d4 100644 --- a/safekeeper/src/json_ctrl.rs +++ b/safekeeper/src/json_ctrl.rs @@ -239,13 +239,13 @@ fn encode_logical_message(prefix: &str, message: &str) -> Vec { xl_crc: 0, // crc will be calculated later }; - let header_bytes = header.encode(); + let header_bytes = header.encode().expect("failed to encode header"); let crc = crc32c_append(0, &data); let crc = crc32c_append(crc, &header_bytes[0..xlog_utils::XLOG_RECORD_CRC_OFFS]); header.xl_crc = crc; let mut wal: Vec = Vec::new(); - wal.extend_from_slice(&header.encode()); + wal.extend_from_slice(&header.encode().expect("failed to encode header")); wal.extend_from_slice(&data); // WAL start position must be aligned at 8 bytes, From c46fe90010adaee8a241a2241b417ecff2f037d9 Mon Sep 17 00:00:00 2001 From: Arseny Sher Date: Thu, 5 May 2022 07:43:55 +0400 Subject: [PATCH 200/296] Fix division by zero in WAL removal. --- safekeeper/src/safekeeper.rs | 4 +--- safekeeper/src/timeline.rs | 4 ++++ 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/safekeeper/src/safekeeper.rs b/safekeeper/src/safekeeper.rs index 67d41d0b58..68361fd672 100644 --- a/safekeeper/src/safekeeper.rs +++ b/safekeeper/src/safekeeper.rs @@ -938,9 +938,7 @@ where ), self.state.s3_wal_lsn, ); - let res = horizon_lsn.segment_number(self.state.server.wal_seg_size as usize); - info!("horizon is {}, res {}", horizon_lsn, res); - res + horizon_lsn.segment_number(self.state.server.wal_seg_size as usize) } } diff --git a/safekeeper/src/timeline.rs b/safekeeper/src/timeline.rs index 745d8e0893..47137091da 100644 --- a/safekeeper/src/timeline.rs +++ b/safekeeper/src/timeline.rs @@ -469,6 +469,10 @@ impl Timeline { let remover: Box Result<(), anyhow::Error>>; { let shared_state = self.mutex.lock().unwrap(); + // WAL seg size not initialized yet, no WAL exists. + if shared_state.sk.state.server.wal_seg_size == 0 { + return Ok(()); + } horizon_segno = shared_state.sk.get_horizon_segno(); remover = shared_state.sk.wal_store.remove_up_to(); if horizon_segno <= 1 || horizon_segno <= shared_state.last_removed_segno { From 0f3ec83172b38684c8ec74a33f8db4fb9a79df2f Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Tue, 3 May 2022 17:16:46 +0300 Subject: [PATCH 201/296] avoid detach with alive branches --- pageserver/src/layered_repository.rs | 15 ++++++++++++++- .../batch_others/test_ancestor_branch.py | 18 +++++++++++++++--- .../batch_others/test_tenant_relocation.py | 7 ++++--- 3 files changed, 33 insertions(+), 7 deletions(-) diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index e678c8f4cb..69271467a6 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -393,9 +393,22 @@ impl Repository for LayeredRepository { fn detach_timeline(&self, timeline_id: ZTimelineId) -> anyhow::Result<()> { let mut timelines = self.timelines.lock().unwrap(); + // check no child timelines, because detach will remove files, which will brake child branches + // FIXME this can still be violated because we do not guarantee + // that all ancestors are downloaded/attached to the same pageserver + let num_children = timelines + .iter() + .filter(|(_, entry)| entry.ancestor_timeline_id() == Some(timeline_id)) + .count(); + + ensure!( + num_children == 0, + "Cannot detach timeline which has child timelines" + ); + ensure!( timelines.remove(&timeline_id).is_some(), - "cannot detach timeline {timeline_id} that is not available locally" + "Cannot detach timeline {timeline_id} that is not available locally" ); Ok(()) } diff --git a/test_runner/batch_others/test_ancestor_branch.py b/test_runner/batch_others/test_ancestor_branch.py index 75fe3cde0f..d6b073492d 100644 --- a/test_runner/batch_others/test_ancestor_branch.py +++ b/test_runner/batch_others/test_ancestor_branch.py @@ -1,11 +1,9 @@ -import subprocess -import asyncio from contextlib import closing import psycopg2.extras import pytest from fixtures.log_helper import log -from fixtures.zenith_fixtures import ZenithEnvBuilder +from fixtures.zenith_fixtures import ZenithEnv, ZenithEnvBuilder, ZenithPageserverApiException # @@ -120,3 +118,17 @@ def test_ancestor_branch(zenith_env_builder: ZenithEnvBuilder): branch2_cur.execute('SELECT count(*) FROM foo') assert branch2_cur.fetchone() == (300000, ) + + +def test_ancestor_branch_detach(zenith_simple_env: ZenithEnv): + env = zenith_simple_env + + parent_timeline_id = env.zenith_cli.create_branch("test_ancestor_branch_detach_parent", "empty") + + env.zenith_cli.create_branch("test_ancestor_branch_detach_branch1", + "test_ancestor_branch_detach_parent") + + ps_http = env.pageserver.http_client() + with pytest.raises(ZenithPageserverApiException, + match="Failed to detach inmem tenant timeline"): + ps_http.timeline_detach(env.initial_tenant, parent_timeline_id) diff --git a/test_runner/batch_others/test_tenant_relocation.py b/test_runner/batch_others/test_tenant_relocation.py index 41907adf1a..7e71c0a157 100644 --- a/test_runner/batch_others/test_tenant_relocation.py +++ b/test_runner/batch_others/test_tenant_relocation.py @@ -109,10 +109,11 @@ def test_tenant_relocation(zenith_env_builder: ZenithEnvBuilder, tenant = env.zenith_cli.create_tenant(UUID("74ee8b079a0e437eb0afea7d26a07209")) log.info("tenant to relocate %s", tenant) - env.zenith_cli.create_root_branch('main', tenant_id=tenant) - env.zenith_cli.create_branch('test_tenant_relocation', tenant_id=tenant) - tenant_pg = env.postgres.create_start(branch_name='main', + # attach does not download ancestor branches (should it?), just use root branch for now + env.zenith_cli.create_root_branch('test_tenant_relocation', tenant_id=tenant) + + tenant_pg = env.postgres.create_start(branch_name='test_tenant_relocation', node_name='test_tenant_relocation', tenant_id=tenant) From ad5eaa6027166b41e6485c49c7ea496e7c6515f0 Mon Sep 17 00:00:00 2001 From: Thang Pham Date: Thu, 5 May 2022 10:53:10 -0400 Subject: [PATCH 202/296] Use node's LSN for read-only nodes (#1642) Fixes #1410. --- zenith/src/main.rs | 22 +++++++++++++++------- 1 file changed, 15 insertions(+), 7 deletions(-) diff --git a/zenith/src/main.rs b/zenith/src/main.rs index ff2beec463..87bb5f3f60 100644 --- a/zenith/src/main.rs +++ b/zenith/src/main.rs @@ -683,13 +683,21 @@ fn handle_pg(pg_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> { .iter() .filter(|((node_tenant_id, _), _)| node_tenant_id == &tenant_id) { - // FIXME: This shows the LSN at the end of the timeline. It's not the - // right thing to do for read-only nodes that might be anchored at an - // older point in time, or following but lagging behind the primary. - let lsn_str = timeline_infos - .get(&node.timeline_id) - .and_then(|bi| bi.local.as_ref().map(|l| l.last_record_lsn.to_string())) - .unwrap_or_else(|| "?".to_string()); + let lsn_str = match node.lsn { + None => { + // -> primary node + // Use the LSN at the end of the timeline. + timeline_infos + .get(&node.timeline_id) + .and_then(|bi| bi.local.as_ref().map(|l| l.last_record_lsn.to_string())) + .unwrap_or_else(|| "?".to_string()) + } + Some(lsn) => { + // -> read-only node + // Use the node's LSN. + lsn.to_string() + } + }; let branch_name = timeline_name_mappings .get(&ZTenantTimelineId::new(tenant_id, node.timeline_id)) From 52a7e3155e3fd6132794f53623698e12e403f711 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Wed, 4 May 2022 14:53:18 +0300 Subject: [PATCH 203/296] Add local path to the Layer trait and historic layers --- pageserver/src/http/routes.rs | 6 +- pageserver/src/layered_repository.rs | 122 ++++++++++++++---- .../src/layered_repository/delta_layer.rs | 4 + .../src/layered_repository/image_layer.rs | 4 + .../src/layered_repository/inmemory_layer.rs | 4 + .../src/layered_repository/layer_map.rs | 2 +- .../src/layered_repository/storage_layer.rs | 3 + pageserver/src/remote_storage.rs | 8 +- pageserver/src/remote_storage/storage_sync.rs | 18 ++- 9 files changed, 131 insertions(+), 40 deletions(-) diff --git a/pageserver/src/http/routes.rs b/pageserver/src/http/routes.rs index 5903dea372..f12e4c4051 100644 --- a/pageserver/src/http/routes.rs +++ b/pageserver/src/http/routes.rs @@ -11,7 +11,7 @@ use super::models::{ }; use crate::config::RemoteStorageKind; use crate::remote_storage::{ - download_index_part, schedule_timeline_download, LocalFs, RemoteIndex, RemoteTimeline, S3Bucket, + download_index_part, schedule_layer_download, LocalFs, RemoteIndex, RemoteTimeline, S3Bucket, }; use crate::repository::Repository; use crate::tenant_config::TenantConfOpt; @@ -273,7 +273,7 @@ async fn timeline_attach_handler(request: Request) -> Result) -> Result index_accessor.add_timeline_entry(sync_id, new_timeline), } - schedule_timeline_download(tenant_id, timeline_id); + schedule_layer_download(tenant_id, timeline_id); json_response(StatusCode::ACCEPTED, ()) } diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 69271467a6..6719c22738 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -20,8 +20,8 @@ use tracing::*; use std::cmp::{max, min, Ordering}; use std::collections::hash_map::Entry; -use std::collections::BTreeSet; use std::collections::HashMap; +use std::collections::{BTreeSet, HashSet}; use std::fs; use std::fs::{File, OpenOptions}; use std::io::Write; @@ -37,7 +37,7 @@ use crate::keyspace::KeySpace; use crate::tenant_config::{TenantConf, TenantConfOpt}; use crate::page_cache; -use crate::remote_storage::{schedule_timeline_checkpoint_upload, RemoteIndex}; +use crate::remote_storage::{self, RemoteIndex}; use crate::repository::{ GcResult, Repository, RepositoryTimeline, Timeline, TimelineSyncStatusUpdate, TimelineWriter, }; @@ -428,7 +428,7 @@ impl Repository for LayeredRepository { Entry::Occupied(_) => bail!("We completed a download for a timeline that already exists in repository. This is a bug."), Entry::Vacant(entry) => { // we need to get metadata of a timeline, another option is to pass it along with Downloaded status - let metadata = Self::load_metadata(self.conf, timeline_id, self.tenant_id).context("failed to load local metadata")?; + let metadata = load_metadata(self.conf, timeline_id, self.tenant_id).context("failed to load local metadata")?; // finally we make newly downloaded timeline visible to repository entry.insert(LayeredTimelineEntry::Unloaded { id: timeline_id, metadata, }) }, @@ -618,7 +618,7 @@ impl LayeredRepository { timelineid: ZTimelineId, timelines: &mut HashMap, ) -> anyhow::Result> { - let metadata = Self::load_metadata(self.conf, timelineid, self.tenant_id) + let metadata = load_metadata(self.conf, timelineid, self.tenant_id) .context("failed to load metadata")?; let disk_consistent_lsn = metadata.disk_consistent_lsn(); @@ -776,17 +776,6 @@ impl LayeredRepository { Ok(()) } - fn load_metadata( - conf: &'static PageServerConf, - timelineid: ZTimelineId, - tenantid: ZTenantId, - ) -> Result { - let path = metadata_path(conf, timelineid, tenantid); - info!("loading metadata from {}", path.display()); - let metadata_bytes = std::fs::read(&path)?; - TimelineMetadata::from_bytes(&metadata_bytes) - } - // // How garbage collection works: // @@ -1796,10 +1785,10 @@ impl LayeredTimeline { PERSISTENT_BYTES_WRITTEN.inc_by(new_delta_path.metadata()?.len()); if self.upload_layers.load(atomic::Ordering::Relaxed) { - schedule_timeline_checkpoint_upload( + remote_storage::schedule_layer_upload( self.tenantid, self.timelineid, - new_delta_path, + HashSet::from([new_delta_path]), metadata, ); } @@ -1860,11 +1849,23 @@ impl LayeredTimeline { let timer = self.create_images_time_histo.start_timer(); // 2. Create new image layers for partitions that have been modified // "enough". + let mut layer_paths_to_upload = HashSet::with_capacity(partitioning.parts.len()); for part in partitioning.parts.iter() { if self.time_for_new_image_layer(part, lsn)? { - self.create_image_layer(part, lsn)?; + let new_path = self.create_image_layer(part, lsn)?; + layer_paths_to_upload.insert(new_path); } } + if self.upload_layers.load(atomic::Ordering::Relaxed) { + let metadata = load_metadata(self.conf, self.timelineid, self.tenantid) + .context("failed to load local metadata")?; + remote_storage::schedule_layer_upload( + self.tenantid, + self.timelineid, + layer_paths_to_upload, + metadata, + ); + } timer.stop_and_record(); // 3. Compact @@ -1906,7 +1907,7 @@ impl LayeredTimeline { Ok(false) } - fn create_image_layer(&self, partition: &KeySpace, lsn: Lsn) -> Result<()> { + fn create_image_layer(&self, partition: &KeySpace, lsn: Lsn) -> anyhow::Result { let img_range = partition.ranges.first().unwrap().start..partition.ranges.last().unwrap().end; let mut image_layer_writer = @@ -1939,10 +1940,11 @@ impl LayeredTimeline { // FIXME: Do we need to do something to upload it to remote storage here? let mut layers = self.layers.write().unwrap(); + let new_path = image_layer.path(); layers.insert_historic(Arc::new(image_layer)); drop(layers); - Ok(()) + Ok(new_path) } fn compact_level0(&self, target_file_size: u64) -> Result<()> { @@ -2037,18 +2039,43 @@ impl LayeredTimeline { } let mut layers = self.layers.write().unwrap(); + let mut new_layer_paths = HashSet::with_capacity(new_layers.len()); for l in new_layers { + new_layer_paths.insert(l.path()); layers.insert_historic(Arc::new(l)); } + if self.upload_layers.load(atomic::Ordering::Relaxed) { + let metadata = load_metadata(self.conf, self.timelineid, self.tenantid) + .context("failed to load local metadata")?; + remote_storage::schedule_layer_upload( + self.tenantid, + self.timelineid, + new_layer_paths, + metadata, + ); + } + // Now that we have reshuffled the data to set of new delta layers, we can // delete the old ones + let mut layer_paths_do_delete = HashSet::with_capacity(level0_deltas.len()); for l in level0_deltas { l.delete()?; - layers.remove_historic(l.clone()); + if let Some(path) = l.local_path() { + layer_paths_do_delete.insert(path); + } + layers.remove_historic(l); } drop(layers); + if self.upload_layers.load(atomic::Ordering::Relaxed) { + remote_storage::schedule_layer_delete( + self.tenantid, + self.timelineid, + layer_paths_do_delete, + ); + } + Ok(()) } @@ -2111,7 +2138,7 @@ impl LayeredTimeline { debug!("retain_lsns: {:?}", retain_lsns); - let mut layers_to_remove: Vec> = Vec::new(); + let mut layers_to_remove = Vec::new(); // Scan all on-disk layers in the timeline. // @@ -2222,13 +2249,24 @@ impl LayeredTimeline { // Actually delete the layers from disk and remove them from the map. // (couldn't do this in the loop above, because you cannot modify a collection // while iterating it. BTreeMap::retain() would be another option) + let mut layer_paths_to_delete = HashSet::with_capacity(layers_to_remove.len()); for doomed_layer in layers_to_remove { doomed_layer.delete()?; - layers.remove_historic(doomed_layer.clone()); - + if let Some(path) = doomed_layer.local_path() { + layer_paths_to_delete.insert(path); + } + layers.remove_historic(doomed_layer); result.layers_removed += 1; } + if self.upload_layers.load(atomic::Ordering::Relaxed) { + remote_storage::schedule_layer_delete( + self.tenantid, + self.timelineid, + layer_paths_to_delete, + ); + } + result.elapsed = now.elapsed()?; Ok(result) } @@ -2375,6 +2413,26 @@ fn rename_to_backup(path: PathBuf) -> anyhow::Result<()> { bail!("couldn't find an unused backup number for {:?}", path) } +fn load_metadata( + conf: &'static PageServerConf, + timeline_id: ZTimelineId, + tenant_id: ZTenantId, +) -> anyhow::Result { + let metadata_path = metadata_path(conf, timeline_id, tenant_id); + let metadata_bytes = std::fs::read(&metadata_path).with_context(|| { + format!( + "Failed to read metadata bytes from path {}", + metadata_path.display() + ) + })?; + TimelineMetadata::from_bytes(&metadata_bytes).with_context(|| { + format!( + "Failed to parse metadata bytes from path {}", + metadata_path.display() + ) + }) +} + /// /// Tests that are specific to the layered storage format. /// @@ -2409,9 +2467,19 @@ pub mod tests { let err = harness.try_load().err().expect("should fail"); assert_eq!(err.to_string(), "failed to load local metadata"); - assert_eq!( - err.source().unwrap().to_string(), - "metadata checksum mismatch" + + let mut found_error_message = false; + let mut err_source = err.source(); + while let Some(source) = err_source { + if source.to_string() == "metadata checksum mismatch" { + found_error_message = true; + break; + } + err_source = source.source(); + } + assert!( + found_error_message, + "didn't find the corrupted metadata error" ); Ok(()) diff --git a/pageserver/src/layered_repository/delta_layer.rs b/pageserver/src/layered_repository/delta_layer.rs index 1e1ec716a6..e78b05695c 100644 --- a/pageserver/src/layered_repository/delta_layer.rs +++ b/pageserver/src/layered_repository/delta_layer.rs @@ -218,6 +218,10 @@ impl Layer for DeltaLayer { PathBuf::from(self.layer_name().to_string()) } + fn local_path(&self) -> Option { + Some(self.path()) + } + fn get_value_reconstruct_data( &self, key: Key, diff --git a/pageserver/src/layered_repository/image_layer.rs b/pageserver/src/layered_repository/image_layer.rs index d7657ecac6..c0c8e7789a 100644 --- a/pageserver/src/layered_repository/image_layer.rs +++ b/pageserver/src/layered_repository/image_layer.rs @@ -125,6 +125,10 @@ impl Layer for ImageLayer { PathBuf::from(self.layer_name().to_string()) } + fn local_path(&self) -> Option { + Some(self.path()) + } + fn get_tenant_id(&self) -> ZTenantId { self.tenantid } diff --git a/pageserver/src/layered_repository/inmemory_layer.rs b/pageserver/src/layered_repository/inmemory_layer.rs index 856baa2e8a..bffb946f7e 100644 --- a/pageserver/src/layered_repository/inmemory_layer.rs +++ b/pageserver/src/layered_repository/inmemory_layer.rs @@ -85,6 +85,10 @@ impl Layer for InMemoryLayer { )) } + fn local_path(&self) -> Option { + None + } + fn get_tenant_id(&self) -> ZTenantId { self.tenantid } diff --git a/pageserver/src/layered_repository/layer_map.rs b/pageserver/src/layered_repository/layer_map.rs index 91a900dde0..7a2d0d5bcd 100644 --- a/pageserver/src/layered_repository/layer_map.rs +++ b/pageserver/src/layered_repository/layer_map.rs @@ -253,7 +253,7 @@ impl LayerMap { } } - pub fn iter_historic_layers(&self) -> std::slice::Iter> { + pub fn iter_historic_layers(&self) -> impl Iterator> { self.historic_layers.iter() } diff --git a/pageserver/src/layered_repository/storage_layer.rs b/pageserver/src/layered_repository/storage_layer.rs index aad631c5c4..9fcc8907d3 100644 --- a/pageserver/src/layered_repository/storage_layer.rs +++ b/pageserver/src/layered_repository/storage_layer.rs @@ -105,6 +105,9 @@ pub trait Layer: Send + Sync { /// log messages, even though they're never not on disk.) fn filename(&self) -> PathBuf; + /// If a layer has a corresponding file on a local filesystem, return its path. + fn local_path(&self) -> Option; + /// /// Return data needed to reconstruct given page at LSN. /// diff --git a/pageserver/src/remote_storage.rs b/pageserver/src/remote_storage.rs index cfa09dce14..4db0f6667d 100644 --- a/pageserver/src/remote_storage.rs +++ b/pageserver/src/remote_storage.rs @@ -14,7 +14,7 @@ //! //! * public API via to interact with the external world: //! * [`start_local_timeline_sync`] to launch a background async loop to handle the synchronization -//! * [`schedule_timeline_checkpoint_upload`] and [`schedule_timeline_download`] to enqueue a new upload and download tasks, +//! * [`schedule_layer_upload`], [`schedule_layer_download`] and [`schedule_layer_delete`] to enqueue a new upload and download tasks, //! to be processed by the async loop //! //! Here's a schematic overview of all interactions backup and the rest of the pageserver perform: @@ -71,10 +71,10 @@ //! when the newer image is downloaded //! //! Pageserver maintains similar to the local file structure remotely: all layer files are uploaded with the same names under the same directory structure. -//! Yet instead of keeping the `metadata` file remotely, we wrap it with more data in [`IndexShard`], containing the list of remote files. +//! Yet instead of keeping the `metadata` file remotely, we wrap it with more data in [`IndexPart`], containing the list of remote files. //! This file gets read to populate the cache, if the remote timeline data is missing from it and gets updated after every successful download. //! This way, we optimize S3 storage access by not running the `S3 list` command that could be expencive and slow: knowing both [`ZTenantId`] and [`ZTimelineId`], -//! we can always reconstruct the path to the timeline, use this to get the same path on the remote storage and retrive its shard contents, if needed, same as any layer files. +//! we can always reconstruct the path to the timeline, use this to get the same path on the remote storage and retrive its part contents, if needed, same as any layer files. //! //! By default, pageserver reads the remote storage index data only for timelines located locally, to synchronize those, if needed. //! Bulk index data download happens only initially, on pageserer startup. The rest of the remote storage stays unknown to pageserver and loaded on demand only, @@ -108,7 +108,7 @@ pub use self::{ storage_sync::{ download_index_part, index::{IndexPart, RemoteIndex, RemoteTimeline}, - schedule_timeline_checkpoint_upload, schedule_timeline_download, + schedule_layer_delete, schedule_layer_download, schedule_layer_upload, }, }; use crate::{ diff --git a/pageserver/src/remote_storage/storage_sync.rs b/pageserver/src/remote_storage/storage_sync.rs index 2d3416cd32..127655ce87 100644 --- a/pageserver/src/remote_storage/storage_sync.rs +++ b/pageserver/src/remote_storage/storage_sync.rs @@ -427,10 +427,10 @@ pub struct TimelineDownload { /// On task failure, it gets retried again from the start a number of times. /// /// Ensure that the loop is started otherwise the task is never processed. -pub fn schedule_timeline_checkpoint_upload( +pub fn schedule_layer_upload( tenant_id: ZTenantId, timeline_id: ZTimelineId, - new_layer: PathBuf, + layers_to_upload: HashSet, metadata: TimelineMetadata, ) { if !sync_queue::push( @@ -439,7 +439,7 @@ pub fn schedule_timeline_checkpoint_upload( timeline_id, }, SyncTask::upload(TimelineUpload { - layers_to_upload: HashSet::from([new_layer]), + layers_to_upload, uploaded_layers: HashSet::new(), metadata, }), @@ -450,6 +450,14 @@ pub fn schedule_timeline_checkpoint_upload( } } +pub fn schedule_layer_delete( + _tenant_id: ZTenantId, + _timeline_id: ZTimelineId, + _layers_to_delete: HashSet, +) { + // TODO kb implement later +} + /// Requests the download of the entire timeline for a given tenant. /// No existing local files are currently overwritten, except the metadata file (if its disk_consistent_lsn is less than the downloaded one). /// The metadata file is always updated last, to avoid inconsistencies. @@ -457,8 +465,8 @@ pub fn schedule_timeline_checkpoint_upload( /// On any failure, the task gets retried, omitting already downloaded layers. /// /// Ensure that the loop is started otherwise the task is never processed. -pub fn schedule_timeline_download(tenant_id: ZTenantId, timeline_id: ZTimelineId) { - debug!("Scheduling timeline download for tenant {tenant_id}, timeline {timeline_id}"); +pub fn schedule_layer_download(tenant_id: ZTenantId, timeline_id: ZTimelineId) { + debug!("Scheduling layer download for tenant {tenant_id}, timeline {timeline_id}"); sync_queue::push( ZTenantTimelineId { tenant_id, From 2ef0e5c6edbec21b52cf27e5ebf3fb6241918319 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Wed, 4 May 2022 22:33:53 +0300 Subject: [PATCH 204/296] Do not require metadata in every upload sync task --- pageserver/src/layered_repository.rs | 23 ++-- .../src/layered_repository/storage_layer.rs | 2 +- pageserver/src/remote_storage/storage_sync.rs | 113 +++++++++++------- .../src/remote_storage/storage_sync/upload.rs | 18 ++- 4 files changed, 91 insertions(+), 65 deletions(-) diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 6719c22738..77c01a7c66 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -1789,7 +1789,7 @@ impl LayeredTimeline { self.tenantid, self.timelineid, HashSet::from([new_delta_path]), - metadata, + Some(metadata), ); } @@ -1857,13 +1857,11 @@ impl LayeredTimeline { } } if self.upload_layers.load(atomic::Ordering::Relaxed) { - let metadata = load_metadata(self.conf, self.timelineid, self.tenantid) - .context("failed to load local metadata")?; remote_storage::schedule_layer_upload( self.tenantid, self.timelineid, layer_paths_to_upload, - metadata, + None, ); } timer.stop_and_record(); @@ -2045,17 +2043,6 @@ impl LayeredTimeline { layers.insert_historic(Arc::new(l)); } - if self.upload_layers.load(atomic::Ordering::Relaxed) { - let metadata = load_metadata(self.conf, self.timelineid, self.tenantid) - .context("failed to load local metadata")?; - remote_storage::schedule_layer_upload( - self.tenantid, - self.timelineid, - new_layer_paths, - metadata, - ); - } - // Now that we have reshuffled the data to set of new delta layers, we can // delete the old ones let mut layer_paths_do_delete = HashSet::with_capacity(level0_deltas.len()); @@ -2069,6 +2056,12 @@ impl LayeredTimeline { drop(layers); if self.upload_layers.load(atomic::Ordering::Relaxed) { + remote_storage::schedule_layer_upload( + self.tenantid, + self.timelineid, + new_layer_paths, + None, + ); remote_storage::schedule_layer_delete( self.tenantid, self.timelineid, diff --git a/pageserver/src/layered_repository/storage_layer.rs b/pageserver/src/layered_repository/storage_layer.rs index 9fcc8907d3..aaf765b83d 100644 --- a/pageserver/src/layered_repository/storage_layer.rs +++ b/pageserver/src/layered_repository/storage_layer.rs @@ -105,7 +105,7 @@ pub trait Layer: Send + Sync { /// log messages, even though they're never not on disk.) fn filename(&self) -> PathBuf; - /// If a layer has a corresponding file on a local filesystem, return its path. + /// If a layer has a corresponding file on a local filesystem, return its absolute path. fn local_path(&self) -> Option; /// diff --git a/pageserver/src/remote_storage/storage_sync.rs b/pageserver/src/remote_storage/storage_sync.rs index 127655ce87..8a26685a7d 100644 --- a/pageserver/src/remote_storage/storage_sync.rs +++ b/pageserver/src/remote_storage/storage_sync.rs @@ -72,7 +72,7 @@ use std::{ sync::Arc, }; -use anyhow::Context; +use anyhow::{bail, Context}; use futures::stream::{FuturesUnordered, StreamExt}; use lazy_static::lazy_static; use tokio::{ @@ -341,8 +341,16 @@ impl SyncTask { .extend(new_upload_data.data.uploaded_layers.into_iter()); upload_data.retries = 0; - if new_upload_data.data.metadata.disk_consistent_lsn() - > upload_data.data.metadata.disk_consistent_lsn() + if new_upload_data + .data + .metadata + .as_ref() + .map(|meta| meta.disk_consistent_lsn()) + > upload_data + .data + .metadata + .as_ref() + .map(|meta| meta.disk_consistent_lsn()) { upload_data.data.metadata = new_upload_data.data.metadata; } @@ -371,8 +379,16 @@ impl SyncTask { .extend(new_upload_data.data.uploaded_layers.into_iter()); upload_data.retries = 0; - if new_upload_data.data.metadata.disk_consistent_lsn() - > upload_data.data.metadata.disk_consistent_lsn() + if new_upload_data + .data + .metadata + .as_ref() + .map(|meta| meta.disk_consistent_lsn()) + > upload_data + .data + .metadata + .as_ref() + .map(|meta| meta.disk_consistent_lsn()) { upload_data.data.metadata = new_upload_data.data.metadata; } @@ -410,7 +426,7 @@ pub struct TimelineUpload { /// Already uploaded layers. Used to store the data about the uploads between task retries /// and to record the data into the remote index after the task got completed or evicted. uploaded_layers: HashSet, - metadata: TimelineMetadata, + metadata: Option, } /// A timeline download task. @@ -431,7 +447,7 @@ pub fn schedule_layer_upload( tenant_id: ZTenantId, timeline_id: ZTimelineId, layers_to_upload: HashSet, - metadata: TimelineMetadata, + metadata: Option, ) { if !sync_queue::push( ZTenantTimelineId { @@ -932,23 +948,24 @@ async fn upload_timeline( } UploadedTimeline::Successful(upload_data) => upload_data, UploadedTimeline::SuccessfulAfterLocalFsUpdate(mut outdated_upload_data) => { - let local_metadata_path = - metadata_path(conf, sync_id.timeline_id, sync_id.tenant_id); - let local_metadata = match read_metadata_file(&local_metadata_path).await { - Ok(metadata) => metadata, - Err(e) => { - error!( - "Failed to load local metadata from path '{}': {e:?}", - local_metadata_path.display() - ); - outdated_upload_data.retries += 1; - sync_queue::push(sync_id, SyncTask::Upload(outdated_upload_data)); - register_sync_status(sync_start, task_name, Some(false)); - return; - } - }; - - outdated_upload_data.data.metadata = local_metadata; + if outdated_upload_data.data.metadata.is_some() { + let local_metadata_path = + metadata_path(conf, sync_id.timeline_id, sync_id.tenant_id); + let local_metadata = match read_metadata_file(&local_metadata_path).await { + Ok(metadata) => metadata, + Err(e) => { + error!( + "Failed to load local metadata from path '{}': {e:?}", + local_metadata_path.display() + ); + outdated_upload_data.retries += 1; + sync_queue::push(sync_id, SyncTask::Upload(outdated_upload_data)); + register_sync_status(sync_start, task_name, Some(false)); + return; + } + }; + outdated_upload_data.data.metadata = Some(local_metadata); + } outdated_upload_data } }; @@ -982,11 +999,14 @@ where match index_accessor.timeline_entry_mut(&sync_id) { Some(existing_entry) => { - if existing_entry.metadata.disk_consistent_lsn() - < uploaded_data.metadata.disk_consistent_lsn() - { - existing_entry.metadata = uploaded_data.metadata.clone(); + if let Some(new_metadata) = uploaded_data.metadata.as_ref() { + if existing_entry.metadata.disk_consistent_lsn() + < new_metadata.disk_consistent_lsn() + { + existing_entry.metadata = new_metadata.clone(); + } } + if upload_failed { existing_entry .add_upload_failures(uploaded_data.layers_to_upload.iter().cloned()); @@ -997,7 +1017,11 @@ where existing_entry.clone() } None => { - let mut new_remote_timeline = RemoteTimeline::new(uploaded_data.metadata.clone()); + let new_metadata = match uploaded_data.metadata.as_ref() { + Some(new_metadata) => new_metadata, + None => bail!("For timeline {sync_id} upload, there's no upload metadata and no remote index entry, cannot create a new one"), + }; + let mut new_remote_timeline = RemoteTimeline::new(new_metadata.clone()); if upload_failed { new_remote_timeline .add_upload_failures(uploaded_data.layers_to_upload.iter().cloned()); @@ -1140,7 +1164,7 @@ fn schedule_first_sync_tasks( SyncTask::upload(TimelineUpload { layers_to_upload: local_files, uploaded_layers: HashSet::new(), - metadata: local_metadata, + metadata: Some(local_metadata), }), )); local_timeline_init_statuses @@ -1202,7 +1226,7 @@ fn compare_local_and_remote_timeline( SyncTask::upload(TimelineUpload { layers_to_upload, uploaded_layers: HashSet::new(), - metadata: local_metadata, + metadata: Some(local_metadata), }), )); // Note that status here doesn't change. @@ -1269,7 +1293,7 @@ mod test_utils { Ok(TimelineUpload { layers_to_upload, uploaded_layers: HashSet::new(), - metadata, + metadata: Some(metadata), }) } @@ -1340,7 +1364,7 @@ mod tests { TimelineUpload { layers_to_upload: HashSet::from([PathBuf::from("one")]), uploaded_layers: HashSet::from([PathBuf::from("u_one")]), - metadata: metadata_1, + metadata: Some(metadata_1), }, )); let upload_2 = SyncTask::Upload(SyncData::new( @@ -1348,7 +1372,7 @@ mod tests { TimelineUpload { layers_to_upload: HashSet::from([PathBuf::from("two"), PathBuf::from("three")]), uploaded_layers: HashSet::from([PathBuf::from("u_two")]), - metadata: metadata_2.clone(), + metadata: Some(metadata_2.clone()), }, )); @@ -1380,7 +1404,8 @@ mod tests { ); assert_eq!( - upload.metadata, metadata_2, + upload.metadata, + Some(metadata_2), "Merged upload tasks should have a metadata with biggest disk_consistent_lsn" ); } @@ -1399,7 +1424,7 @@ mod tests { TimelineUpload { layers_to_upload: HashSet::from([PathBuf::from("u_one")]), uploaded_layers: HashSet::from([PathBuf::from("u_one_2")]), - metadata: dummy_metadata(Lsn(1)), + metadata: Some(dummy_metadata(Lsn(1))), }, ); @@ -1442,7 +1467,7 @@ mod tests { TimelineUpload { layers_to_upload: HashSet::from([PathBuf::from("one")]), uploaded_layers: HashSet::from([PathBuf::from("u_one")]), - metadata: metadata_1.clone(), + metadata: Some(metadata_1.clone()), }, ), ); @@ -1452,7 +1477,7 @@ mod tests { TimelineUpload { layers_to_upload: HashSet::from([PathBuf::from("two"), PathBuf::from("three")]), uploaded_layers: HashSet::from([PathBuf::from("u_two")]), - metadata: metadata_2, + metadata: Some(metadata_2), }, )); @@ -1490,7 +1515,8 @@ mod tests { ); assert_eq!( - upload.metadata, metadata_1, + upload.metadata, + Some(metadata_1), "Merged upload tasks should have a metadata with biggest disk_consistent_lsn" ); } @@ -1502,7 +1528,7 @@ mod tests { TimelineUpload { layers_to_upload: HashSet::from([PathBuf::from("one")]), uploaded_layers: HashSet::from([PathBuf::from("u_one")]), - metadata: dummy_metadata(Lsn(22)), + metadata: Some(dummy_metadata(Lsn(22))), }, ); @@ -1572,7 +1598,7 @@ mod tests { TimelineUpload { layers_to_upload: HashSet::from([PathBuf::from("one")]), uploaded_layers: HashSet::from([PathBuf::from("u_one")]), - metadata: metadata_1, + metadata: Some(metadata_1), }, ), ); @@ -1588,7 +1614,7 @@ mod tests { TimelineUpload { layers_to_upload: HashSet::from([PathBuf::from("two"), PathBuf::from("three")]), uploaded_layers: HashSet::from([PathBuf::from("u_two")]), - metadata: metadata_2.clone(), + metadata: Some(metadata_2.clone()), }, ), ); @@ -1640,7 +1666,8 @@ mod tests { ); assert_eq!( - upload.metadata, metadata_2, + upload.metadata, + Some(metadata_2), "Merged upload tasks should have a metadata with biggest disk_consistent_lsn" ); } diff --git a/pageserver/src/remote_storage/storage_sync/upload.rs b/pageserver/src/remote_storage/storage_sync/upload.rs index d2ff77e92e..91a0a0d6ce 100644 --- a/pageserver/src/remote_storage/storage_sync/upload.rs +++ b/pageserver/src/remote_storage/storage_sync/upload.rs @@ -86,7 +86,10 @@ where S: RemoteStorage + Send + Sync + 'static, { let upload = &mut upload_data.data; - let new_upload_lsn = upload.metadata.disk_consistent_lsn(); + let new_upload_lsn = upload + .metadata + .as_ref() + .map(|meta| meta.disk_consistent_lsn()); let already_uploaded_layers = remote_timeline .map(|timeline| timeline.stored_files()) @@ -101,7 +104,7 @@ where debug!("Layers to upload: {layers_to_upload:?}"); info!( - "Uploading {} timeline layers, new lsn: {new_upload_lsn}", + "Uploading {} timeline layers, new lsn: {new_upload_lsn:?}", layers_to_upload.len(), ); @@ -234,8 +237,10 @@ mod tests { let current_retries = 3; let metadata = dummy_metadata(Lsn(0x30)); let local_timeline_path = harness.timeline_path(&TIMELINE_ID); - let timeline_upload = + let mut timeline_upload = create_local_timeline(&harness, TIMELINE_ID, &layer_files, metadata.clone()).await?; + timeline_upload.metadata = None; + assert!( storage.list().await?.is_empty(), "Storage should be empty before any uploads are made" @@ -278,8 +283,8 @@ mod tests { "Successful upload should have all layers uploaded" ); assert_eq!( - upload.metadata, metadata, - "Successful upload should not chage its metadata" + upload.metadata, None, + "Successful upload without metadata should not have it returned either" ); let storage_files = storage.list().await?; @@ -367,7 +372,8 @@ mod tests { "Successful upload should have all layers uploaded" ); assert_eq!( - upload.metadata, metadata, + upload.metadata, + Some(metadata), "Successful upload should not chage its metadata" ); From 4024bfe73605ce5c0ff13f7c337a2543d3ec7158 Mon Sep 17 00:00:00 2001 From: Andrey Taranik Date: Thu, 5 May 2022 22:21:07 +0300 Subject: [PATCH 205/296] get_binaries script fix (#1638) * get_binaries script fix * minor improvment for get_binaries --- .circleci/ansible/get_binaries.sh | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/.circleci/ansible/get_binaries.sh b/.circleci/ansible/get_binaries.sh index a4b4372d9f..c613213a75 100755 --- a/.circleci/ansible/get_binaries.sh +++ b/.circleci/ansible/get_binaries.sh @@ -7,7 +7,7 @@ RELEASE=${RELEASE:-false} # look at docker hub for latest tag for neon docker image if [ "${RELEASE}" = "true" ]; then echo "search latest relase tag" - VERSION=$(curl -s https://registry.hub.docker.com/v1/repositories/neondatabase/neon/tags |jq -r -S '.[].name' | grep release | sed 's/release-//g' | tail -1) + VERSION=$(curl -s https://registry.hub.docker.com/v1/repositories/neondatabase/neon/tags |jq -r -S '.[].name' | grep release | sed 's/release-//g' | grep -E '^[0-9]+$' | sort -n | tail -1) if [ -z "${VERSION}" ]; then echo "no any docker tags found, exiting..." exit 1 @@ -16,7 +16,7 @@ if [ "${RELEASE}" = "true" ]; then fi else echo "search latest dev tag" - VERSION=$(curl -s https://registry.hub.docker.com/v1/repositories/neondatabase/neon/tags |jq -r -S '.[].name' | grep -v release | tail -1) + VERSION=$(curl -s https://registry.hub.docker.com/v1/repositories/neondatabase/neon/tags |jq -r -S '.[].name' | grep -E '^[0-9]+$' | sort -n | tail -1) if [ -z "${VERSION}" ]; then echo "no any docker tags found, exiting..." exit 1 From 954859f6c5648aa351b5c4a0b05b3db0f369a0ab Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Thu, 5 May 2022 13:15:53 +0300 Subject: [PATCH 206/296] add readme for performance tests with the current state of things --- test_runner/performance/README.md | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) create mode 100644 test_runner/performance/README.md diff --git a/test_runner/performance/README.md b/test_runner/performance/README.md new file mode 100644 index 0000000000..7812c73f0c --- /dev/null +++ b/test_runner/performance/README.md @@ -0,0 +1,23 @@ +# What performance tests do we have and how we run them + +Performanse tests are build using infrastructure of our usual python integration tests. + +## Tests that are run against local installation + +Most off the performance tests run against local installation. This causes some problems because safekeeper(s) and a pageserver share resources of one single host and one underlyinng disk. + +These tests are run in CI in the same environment as the usual integration tests. So environment may not yield comarable results because this is the machine that CI provider gives us. + +## Remote tests + +There are a few tests that marked with `pytest.mark.remote_cluster`. These tests do not use local installation and onnly need a connection string to run. So they can be used for every postgresql comatible database. Currenntly these tests are run against our staging daily. Staging is not an isolated environment, so it adds to possible noise due to activity of other clusters. + +## Noise + +All tests run only once. Usually to obtain more consistent performance numbers test is performed multiple times and then some statistics is applied to results, like min/max/avg/median etc. + +## Results collection + +Local tests results for main branch and results of daily performance tests are stored in neon cluster deployed in production environment and there is a grafana dashboard that visualizes the results. Here is the [dashboard](https://observer.zenith.tech/d/DGKBm9Jnz/perf-test-results?orgId=1). The main problem with it is the unavailability to point at particular commits though the data for that is available in the database. Needs some tweaking from someone who knows Grafana tricks. + +There is also an inconsistency in test naming. Test name should be the same across platforms and results can be differentiated by the platform field. But now platform is sometimes included in test name because of the way how parametrization works in pytest. Ie there is a platform switch in the dashboard with zenith-local-ci and zenith-staging variants. I e some tests under zenith-local-ci value for a platform switch are displayed as `Test test_runner/performance/test_bulk_insert.py::test_bulk_insert[vanilla]` and `Test test_runner/performance/test_bulk_insert.py::test_bulk_insert[zenith]` which is highly confusing. From 1ad5658d9cd044b15059bdfb3417b19d5c6c8008 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Thu, 5 May 2022 19:55:08 +0300 Subject: [PATCH 207/296] Fix typos --- test_runner/performance/README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/test_runner/performance/README.md b/test_runner/performance/README.md index 7812c73f0c..c2354a7e5b 100644 --- a/test_runner/performance/README.md +++ b/test_runner/performance/README.md @@ -1,20 +1,20 @@ # What performance tests do we have and how we run them -Performanse tests are build using infrastructure of our usual python integration tests. +Performance tests are built using infrastructure of our usual python integration tests. ## Tests that are run against local installation -Most off the performance tests run against local installation. This causes some problems because safekeeper(s) and a pageserver share resources of one single host and one underlyinng disk. +Most of the performance tests run against a local installation. This causes some problems because safekeeper(s) and the pageserver share resources of one single host and one underlying disk. These tests are run in CI in the same environment as the usual integration tests. So environment may not yield comarable results because this is the machine that CI provider gives us. ## Remote tests -There are a few tests that marked with `pytest.mark.remote_cluster`. These tests do not use local installation and onnly need a connection string to run. So they can be used for every postgresql comatible database. Currenntly these tests are run against our staging daily. Staging is not an isolated environment, so it adds to possible noise due to activity of other clusters. +There are a few tests that marked with `pytest.mark.remote_cluster`. These tests do not use local installation and only need a connection string to run. So they can be used for every postgresql compatible database. Currently these tests are run against our staging daily. Staging is not an isolated environment, so it adds to possible noise due to activity of other clusters. ## Noise -All tests run only once. Usually to obtain more consistent performance numbers test is performed multiple times and then some statistics is applied to results, like min/max/avg/median etc. +All tests run only once. Usually to obtain more consistent performance numbers test is performed multiple times and then some statistics is applied to the results, like min/max/avg/median etc. ## Results collection From 30a7598172e085cbe0687746ccc5d0cdbd460554 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Thu, 5 May 2022 20:04:54 +0300 Subject: [PATCH 208/296] Some copy-editing. --- test_runner/performance/README.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/test_runner/performance/README.md b/test_runner/performance/README.md index c2354a7e5b..776565b679 100644 --- a/test_runner/performance/README.md +++ b/test_runner/performance/README.md @@ -1,23 +1,23 @@ # What performance tests do we have and how we run them -Performance tests are built using infrastructure of our usual python integration tests. +Performance tests are built using the same infrastructure as our usual python integration tests. There are some extra fixtures that help to collect performance metrics, and to run tests against both vanilla PostgreSQL and Neon for comparison. ## Tests that are run against local installation -Most of the performance tests run against a local installation. This causes some problems because safekeeper(s) and the pageserver share resources of one single host and one underlying disk. +Most of the performance tests run against a local installation. This is not very representative of a production environment. Firstly, Postgres, safekeeper(s) and the pageserver have to share CPU and I/O resources, which can add noise to the results. Secondly, network overhead is eliminated. -These tests are run in CI in the same environment as the usual integration tests. So environment may not yield comarable results because this is the machine that CI provider gives us. +In the CI, the performance tests are run in the same environment as the other integration tests. We don't have control over the host that the CI runs on, so the environment may vary widely from one run to another, which makes the results across different runs noisy to compare. ## Remote tests -There are a few tests that marked with `pytest.mark.remote_cluster`. These tests do not use local installation and only need a connection string to run. So they can be used for every postgresql compatible database. Currently these tests are run against our staging daily. Staging is not an isolated environment, so it adds to possible noise due to activity of other clusters. +There are a few tests that marked with `pytest.mark.remote_cluster`. These tests do not set up a local environment, and instead require a libpq connection string to connect to. So they can be run on any Postgres compatible database. Currently, the CI runs these tests our staging environment daily. Staging is not an isolated environment, so there can be noise in the results due to activity of other clusters. ## Noise -All tests run only once. Usually to obtain more consistent performance numbers test is performed multiple times and then some statistics is applied to the results, like min/max/avg/median etc. +All tests run only once. Usually to obtain more consistent performance numbers, a test should be repeated multiple times and the results be aggregated, for example by taking min, max, avg, or median. ## Results collection -Local tests results for main branch and results of daily performance tests are stored in neon cluster deployed in production environment and there is a grafana dashboard that visualizes the results. Here is the [dashboard](https://observer.zenith.tech/d/DGKBm9Jnz/perf-test-results?orgId=1). The main problem with it is the unavailability to point at particular commits though the data for that is available in the database. Needs some tweaking from someone who knows Grafana tricks. +Local test results for main branch, and results of daily performance tests, are stored in a neon project deployed in production environment. There is a Grafana dashboard that visualizes the results. Here is the [dashboard](https://observer.zenith.tech/d/DGKBm9Jnz/perf-test-results?orgId=1). The main problem with it is the unavailability to point at particular commit, though the data for that is available in the database. Needs some tweaking from someone who knows Grafana tricks. -There is also an inconsistency in test naming. Test name should be the same across platforms and results can be differentiated by the platform field. But now platform is sometimes included in test name because of the way how parametrization works in pytest. Ie there is a platform switch in the dashboard with zenith-local-ci and zenith-staging variants. I e some tests under zenith-local-ci value for a platform switch are displayed as `Test test_runner/performance/test_bulk_insert.py::test_bulk_insert[vanilla]` and `Test test_runner/performance/test_bulk_insert.py::test_bulk_insert[zenith]` which is highly confusing. +There is also an inconsistency in test naming. Test name should be the same across platforms, and results can be differentiated by the platform field. But currently, platform is sometimes included in test name because of the way how parametrization works in pytest. I.e. there is a platform switch in the dashboard with zenith-local-ci and zenith-staging variants. I.e. some tests under zenith-local-ci value for a platform switch are displayed as `Test test_runner/performance/test_bulk_insert.py::test_bulk_insert[vanilla]` and `Test test_runner/performance/test_bulk_insert.py::test_bulk_insert[zenith]` which is highly confusing. From 11a44eda0ecd4d41757e88df1d5fe3e3ecc73114 Mon Sep 17 00:00:00 2001 From: Sergey Melnikov Date: Thu, 5 May 2022 23:48:16 +0300 Subject: [PATCH 209/296] Add TLS support in scram-proxy (#1643) * Add TLS support in scram-proxy * Fix authEndpoint --- .circleci/helm-values/staging.proxy-scram.yaml | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/.circleci/helm-values/staging.proxy-scram.yaml b/.circleci/helm-values/staging.proxy-scram.yaml index f72a9d4557..91422e754a 100644 --- a/.circleci/helm-values/staging.proxy-scram.yaml +++ b/.circleci/helm-values/staging.proxy-scram.yaml @@ -6,7 +6,8 @@ image: settings: authBackend: "console" - authEndpoint: "http://console-staging.local:9095/management/api/v2" + authEndpoint: "http://console-staging.local/management/api/v2" + domain: "*.cloud.stage.neon.tech" # -- Additional labels for zenith-proxy pods podLabels: From ef40e404cf15cc335fbf6a226879e5358aa628eb Mon Sep 17 00:00:00 2001 From: bojanserafimov Date: Thu, 5 May 2022 19:06:53 -0400 Subject: [PATCH 210/296] Rename zenith crate to neon_local (#1625) --- Cargo.lock | 34 ++++++++++++------------- Cargo.toml | 2 +- README.md | 16 ++++++------ {zenith => neon_local}/Cargo.toml | 2 +- {zenith => neon_local}/src/main.rs | 17 ++++++------- test_runner/fixtures/zenith_fixtures.py | 2 +- 6 files changed, 36 insertions(+), 37 deletions(-) rename {zenith => neon_local}/Cargo.toml (96%) rename {zenith => neon_local}/src/main.rs (98%) diff --git a/Cargo.lock b/Cargo.lock index e9b24b2f84..3c38dc8150 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -1487,6 +1487,23 @@ dependencies = [ "tempfile", ] +[[package]] +name = "neon_local" +version = "0.1.0" +dependencies = [ + "anyhow", + "clap 3.0.14", + "comfy-table", + "control_plane", + "pageserver", + "postgres", + "postgres_ffi", + "safekeeper", + "serde_json", + "utils", + "workspace_hack", +] + [[package]] name = "nix" version = "0.23.1" @@ -3703,23 +3720,6 @@ dependencies = [ "chrono", ] -[[package]] -name = "zenith" -version = "0.1.0" -dependencies = [ - "anyhow", - "clap 3.0.14", - "comfy-table", - "control_plane", - "pageserver", - "postgres", - "postgres_ffi", - "safekeeper", - "serde_json", - "utils", - "workspace_hack", -] - [[package]] name = "zeroize" version = "1.5.2" diff --git a/Cargo.toml b/Cargo.toml index 3838637d37..f0934853f0 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -6,7 +6,7 @@ members = [ "proxy", "safekeeper", "workspace_hack", - "zenith", + "neon_local", "libs/*", ] diff --git a/README.md b/README.md index 03f86887a7..8876831265 100644 --- a/README.md +++ b/README.md @@ -49,14 +49,14 @@ make -j5 ```sh # Create repository in .zenith with proper paths to binaries and data # Later that would be responsibility of a package install script -> ./target/debug/zenith init +> ./target/debug/neon_local init initializing tenantid c03ba6b7ad4c5e9cf556f059ade44229 created initial timeline 5b014a9e41b4b63ce1a1febc04503636 timeline.lsn 0/169C3C8 created main branch pageserver init succeeded # start pageserver and safekeeper -> ./target/debug/zenith start +> ./target/debug/neon_local start Starting pageserver at 'localhost:64000' in '.zenith' Pageserver started initializing for single for 7676 @@ -64,7 +64,7 @@ Starting safekeeper at '127.0.0.1:5454' in '.zenith/safekeepers/single' Safekeeper started # start postgres compute node -> ./target/debug/zenith pg start main +> ./target/debug/neon_local pg start main Starting new postgres main on timeline 5b014a9e41b4b63ce1a1febc04503636 ... Extracting base backup to create postgres instance: path=.zenith/pgdatadirs/tenants/c03ba6b7ad4c5e9cf556f059ade44229/main port=55432 Starting postgres node at 'host=127.0.0.1 port=55432 user=zenith_admin dbname=postgres' @@ -72,7 +72,7 @@ waiting for server to start.... done server started # check list of running postgres instances -> ./target/debug/zenith pg list +> ./target/debug/neon_local pg list NODE ADDRESS TIMELINES BRANCH NAME LSN STATUS main 127.0.0.1:55432 5b014a9e41b4b63ce1a1febc04503636 main 0/1609610 running ``` @@ -94,16 +94,16 @@ postgres=# select * from t; 5. And create branches and run postgres on them: ```sh # create branch named migration_check -> ./target/debug/zenith timeline branch --branch-name migration_check +> ./target/debug/neon_local timeline branch --branch-name migration_check Created timeline '0e9331cad6efbafe6a88dd73ae21a5c9' at Lsn 0/16F5830 for tenant: c03ba6b7ad4c5e9cf556f059ade44229. Ancestor timeline: 'main' # check branches tree -> ./target/debug/zenith timeline list +> ./target/debug/neon_local timeline list main [5b014a9e41b4b63ce1a1febc04503636] ┗━ @0/1609610: migration_check [0e9331cad6efbafe6a88dd73ae21a5c9] # start postgres on that branch -> ./target/debug/zenith pg start migration_check +> ./target/debug/neon_local pg start migration_check Starting postgres node at 'host=127.0.0.1 port=55433 user=stas' waiting for server to start.... done @@ -123,7 +123,7 @@ INSERT 0 1 6. If you want to run tests afterwards (see below), you have to stop all the running the pageserver, safekeeper and postgres instances you have just started. You can stop them all with one command: ```sh -> ./target/debug/zenith stop +> ./target/debug/neon_local stop ``` ## Running tests diff --git a/zenith/Cargo.toml b/neon_local/Cargo.toml similarity index 96% rename from zenith/Cargo.toml rename to neon_local/Cargo.toml index 58f1f5751d..78d339789f 100644 --- a/zenith/Cargo.toml +++ b/neon_local/Cargo.toml @@ -1,5 +1,5 @@ [package] -name = "zenith" +name = "neon_local" version = "0.1.0" edition = "2021" diff --git a/zenith/src/main.rs b/neon_local/src/main.rs similarity index 98% rename from zenith/src/main.rs rename to neon_local/src/main.rs index 87bb5f3f60..158e43f68f 100644 --- a/zenith/src/main.rs +++ b/neon_local/src/main.rs @@ -62,15 +62,15 @@ http_port = {safekeeper_http_port} struct TimelineTreeEl { /// `TimelineInfo` received from the `pageserver` via the `timeline_list` http API call. pub info: TimelineInfo, - /// Name, recovered from zenith config mappings + /// Name, recovered from neon config mappings pub name: Option, /// Holds all direct children of this timeline referenced using `timeline_id`. pub children: BTreeSet, } -// Main entry point for the 'zenith' CLI utility +// Main entry point for the 'neon_local' CLI utility // -// This utility helps to manage zenith installation. That includes following: +// This utility helps to manage neon installation. That includes following: // * Management of local postgres installations running on top of the // pageserver. // * Providing CLI api to the pageserver @@ -125,12 +125,12 @@ fn main() -> Result<()> { .takes_value(true) .required(false); - let matches = App::new("Zenith CLI") + let matches = App::new("Neon CLI") .setting(AppSettings::ArgRequiredElseHelp) .version(GIT_VERSION) .subcommand( App::new("init") - .about("Initialize a new Zenith repository") + .about("Initialize a new Neon repository") .arg(pageserver_config_args.clone()) .arg(timeline_id_arg.clone().help("Use a specific timeline id when creating a tenant and its initial timeline")) .arg( @@ -258,7 +258,7 @@ fn main() -> Result<()> { None => bail!("no subcommand provided"), }; - // Check for 'zenith init' command first. + // Check for 'neon init' command first. let subcommand_result = if sub_name == "init" { handle_init(sub_args).map(Some) } else { @@ -481,9 +481,8 @@ fn handle_init(init_match: &ArgMatches) -> Result { }; let mut env = - LocalEnv::create_config(&toml_file).context("Failed to create zenith configuration")?; - env.init() - .context("Failed to initialize zenith repository")?; + LocalEnv::create_config(&toml_file).context("Failed to create neon configuration")?; + env.init().context("Failed to initialize neon repository")?; // default_tenantid was generated by the `env.init()` call above let initial_tenant_id = env.default_tenant_id.unwrap(); diff --git a/test_runner/fixtures/zenith_fixtures.py b/test_runner/fixtures/zenith_fixtures.py index 784d2d4b26..7acf0552df 100644 --- a/test_runner/fixtures/zenith_fixtures.py +++ b/test_runner/fixtures/zenith_fixtures.py @@ -1108,7 +1108,7 @@ class ZenithCli: assert type(arguments) == list - bin_zenith = os.path.join(str(zenith_binpath), 'zenith') + bin_zenith = os.path.join(str(zenith_binpath), 'neon_local') args = [bin_zenith] + arguments log.info('Running command "{}"'.format(' '.join(args))) From dd6dca90726c66da7398d80fc13ebeddf945b5ee Mon Sep 17 00:00:00 2001 From: Arseny Sher Date: Fri, 6 May 2022 13:03:07 +0400 Subject: [PATCH 211/296] Bump vendor/postgres to shut down on wrong basebackup. --- vendor/postgres | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vendor/postgres b/vendor/postgres index d35bd7132f..9a9459a7f9 160000 --- a/vendor/postgres +++ b/vendor/postgres @@ -1 +1 @@ -Subproject commit d35bd7132ff6ed600577934e5389c7657087fbe1 +Subproject commit 9a9459a7f9cbcaa0e35ff1f2f34c419238fdec7e From d4e155aaa3b818981717e5b1a1ac6fb7af5cc9cd Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Wed, 4 May 2022 18:28:46 +0300 Subject: [PATCH 212/296] Librarify common etcd timeline logic --- Cargo.lock | 182 ++++++++++++++--- control_plane/src/local_env.rs | 4 + control_plane/src/safekeeper.rs | 5 + libs/etcd_broker/Cargo.toml | 17 ++ libs/etcd_broker/src/lib.rs | 335 +++++++++++++++++++++++++++++++ libs/utils/src/zid.rs | 2 +- neon_local/src/main.rs | 15 +- safekeeper/Cargo.toml | 4 +- safekeeper/src/bin/safekeeper.rs | 11 +- safekeeper/src/broker.rs | 137 ++++--------- safekeeper/src/http/routes.rs | 4 +- safekeeper/src/lib.rs | 3 + safekeeper/src/safekeeper.rs | 4 +- safekeeper/src/timeline.rs | 48 ++++- workspace_hack/Cargo.toml | 9 +- 15 files changed, 633 insertions(+), 147 deletions(-) create mode 100644 libs/etcd_broker/Cargo.toml create mode 100644 libs/etcd_broker/src/lib.rs diff --git a/Cargo.lock b/Cargo.lock index 3c38dc8150..ac40a2931f 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -48,9 +48,9 @@ dependencies = [ [[package]] name = "anyhow" -version = "1.0.53" +version = "1.0.57" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "94a45b455c14666b85fc40a019e8ab9eb75e3a124e05494f5397122bc9eb06e0" +checksum = "08f9b8508dccb7687a1d6c4ce66b2b0ecef467c94667de27d8d7fe1f8d2a9cdc" dependencies = [ "backtrace", ] @@ -113,6 +113,49 @@ version = "1.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "d468802bab17cbc0cc575e9b053f41e72aa36bfa6b7f55e3529ffa43161b97fa" +[[package]] +name = "axum" +version = "0.5.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f4af7447fc1214c1f3a1ace861d0216a6c8bb13965b64bbad9650f375b67689a" +dependencies = [ + "async-trait", + "axum-core", + "bitflags", + "bytes", + "futures-util", + "http", + "http-body", + "hyper", + "itoa 1.0.1", + "matchit", + "memchr", + "mime", + "percent-encoding", + "pin-project-lite", + "serde", + "sync_wrapper", + "tokio", + "tower", + "tower-http", + "tower-layer", + "tower-service", +] + +[[package]] +name = "axum-core" +version = "0.2.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "3bdc19781b16e32f8a7200368a336fa4509d4b72ef15dd4e41df5290855ee1e6" +dependencies = [ + "async-trait", + "bytes", + "futures-util", + "http", + "http-body", + "mime", +] + [[package]] name = "backtrace" version = "0.3.64" @@ -320,6 +363,15 @@ dependencies = [ "textwrap 0.14.2", ] +[[package]] +name = "cmake" +version = "0.1.48" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e8ad8cef104ac57b68b89df3208164d228503abbdce70f6880ffa3d970e7443a" +dependencies = [ + "cc", +] + [[package]] name = "combine" version = "4.6.3" @@ -730,9 +782,9 @@ dependencies = [ [[package]] name = "etcd-client" -version = "0.8.4" +version = "0.9.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "585de5039d1ecce74773db49ba4e8107e42be7c2cd0b1a9e7fce27181db7b118" +checksum = "c434d2800b273a506b82397aad2f20971636f65e47b27c027f77d498530c5954" dependencies = [ "http", "prost", @@ -740,9 +792,26 @@ dependencies = [ "tokio-stream", "tonic", "tonic-build", + "tower", "tower-service", ] +[[package]] +name = "etcd_broker" +version = "0.1.0" +dependencies = [ + "etcd-client", + "regex", + "serde", + "serde_json", + "serde_with", + "thiserror", + "tokio", + "tracing", + "utils", + "workspace_hack", +] + [[package]] name = "fail" version = "0.5.0" @@ -1027,6 +1096,12 @@ dependencies = [ "unicode-segmentation", ] +[[package]] +name = "heck" +version = "0.4.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2540771e65fc8cb83cd6e8a237f70c319bd5c29f78ed1084ba5d50eeac86f7f9" + [[package]] name = "hermit-abi" version = "0.1.19" @@ -1092,6 +1167,12 @@ dependencies = [ "pin-project-lite", ] +[[package]] +name = "http-range-header" +version = "0.3.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0bfe8eed0a9285ef776bb792479ea3834e8b94e13d615c2f66d03dd50a435a29" + [[package]] name = "httparse" version = "1.6.0" @@ -1357,6 +1438,12 @@ version = "0.1.9" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "a3e378b66a060d48947b590737b30a1be76706c8dd7b8ba0f2fe3989c68a853f" +[[package]] +name = "matchit" +version = "0.5.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "73cbba799671b762df5a175adf59ce145165747bb891505c43d09aefbbf38beb" + [[package]] name = "md-5" version = "0.9.1" @@ -1613,9 +1700,9 @@ dependencies = [ [[package]] name = "once_cell" -version = "1.9.0" +version = "1.10.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "da32515d9f6e6e489d7bc9d84c71b060db7247dc035bbe44eac88cf87486d8d5" +checksum = "87f3e037eac156d1775da914196f0f37741a274155e34a0b7e427c35d2a2ecb9" [[package]] name = "oorandom" @@ -1976,6 +2063,16 @@ version = "0.2.16" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "eb9f9e6e233e5c4a35559a617bf40a4ec447db2e84c20b55a6f83167b7e57872" +[[package]] +name = "prettyplease" +version = "0.1.10" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d9e07e3a46d0771a8a06b5f4441527802830b43e679ba12f44960f48dd4c6803" +dependencies = [ + "proc-macro2", + "syn", +] + [[package]] name = "proc-macro-hack" version = "0.5.19" @@ -2007,9 +2104,9 @@ dependencies = [ [[package]] name = "prost" -version = "0.9.0" +version = "0.10.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "444879275cb4fd84958b1a1d5420d15e6fcf7c235fe47f053c9c2a80aceb6001" +checksum = "a07b0857a71a8cb765763950499cae2413c3f9cede1133478c43600d9e146890" dependencies = [ "bytes", "prost-derive", @@ -2017,12 +2114,14 @@ dependencies = [ [[package]] name = "prost-build" -version = "0.9.0" +version = "0.10.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "62941722fb675d463659e49c4f3fe1fe792ff24fe5bbaa9c08cd3b98a1c354f5" +checksum = "120fbe7988713f39d780a58cf1a7ef0d7ef66c6d87e5aa3438940c05357929f4" dependencies = [ "bytes", - "heck", + "cfg-if", + "cmake", + "heck 0.4.0", "itertools", "lazy_static", "log", @@ -2037,9 +2136,9 @@ dependencies = [ [[package]] name = "prost-derive" -version = "0.9.0" +version = "0.10.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f9cc1a3263e07e0bf68e96268f37665207b49560d98739662cdfaae215c720fe" +checksum = "7b670f45da57fb8542ebdbb6105a925fe571b67f9e7ed9f47a06a84e72b4e7cc" dependencies = [ "anyhow", "itertools", @@ -2050,9 +2149,9 @@ dependencies = [ [[package]] name = "prost-types" -version = "0.9.0" +version = "0.10.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "534b7a0e836e3c482d2693070f982e39e7611da9695d4d1f5a4b186b51faef0a" +checksum = "2d0a014229361011dc8e69c8a1ec6c2e8d0f2af7c91e3ea3f5b2170298461e68" dependencies = [ "bytes", "prost", @@ -2224,9 +2323,9 @@ dependencies = [ [[package]] name = "regex" -version = "1.5.4" +version = "1.5.5" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d07a8629359eb56f1e2fb1652bb04212c072a87ba68546a04065d525673ac461" +checksum = "1a11647b6b25ff05a515cb92c365cec08801e83423a235b51e231e1808747286" dependencies = [ "aho-corasick", "memchr", @@ -2501,7 +2600,7 @@ dependencies = [ "const_format", "crc32c", "daemonize", - "etcd-client", + "etcd_broker", "fs2", "hex", "humantime", @@ -2830,7 +2929,7 @@ version = "0.23.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "5bb0dc7ee9c15cea6199cde9a127fa16a4c5819af85395457ad72d68edc85a38" dependencies = [ - "heck", + "heck 0.3.3", "proc-macro2", "quote", "rustversion", @@ -2868,15 +2967,21 @@ dependencies = [ [[package]] name = "syn" -version = "1.0.86" +version = "1.0.92" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8a65b3f4ffa0092e9887669db0eae07941f023991ab58ea44da8fe8e2d511c6b" +checksum = "7ff7c592601f11445996a06f8ad0c27f094a58857c2f89e97974ab9235b92c52" dependencies = [ "proc-macro2", "quote", "unicode-xid", ] +[[package]] +name = "sync_wrapper" +version = "0.1.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "20518fe4a4c9acf048008599e464deb21beeae3d3578418951a189c235a7a9a8" + [[package]] name = "tar" version = "0.4.38" @@ -3170,12 +3275,13 @@ dependencies = [ [[package]] name = "tonic" -version = "0.6.2" +version = "0.7.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ff08f4649d10a70ffa3522ca559031285d8e421d727ac85c60825761818f5d0a" +checksum = "30fb54bf1e446f44d870d260d99957e7d11fb9d0a0f5bd1a662ad1411cc103f9" dependencies = [ "async-stream", "async-trait", + "axum", "base64", "bytes", "futures-core", @@ -3191,7 +3297,7 @@ dependencies = [ "prost-derive", "tokio", "tokio-stream", - "tokio-util 0.6.9", + "tokio-util 0.7.0", "tower", "tower-layer", "tower-service", @@ -3201,10 +3307,11 @@ dependencies = [ [[package]] name = "tonic-build" -version = "0.6.2" +version = "0.7.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9403f1bafde247186684b230dc6f38b5cd514584e8bec1dd32514be4745fa757" +checksum = "c03447cdc9eaf8feffb6412dcb27baf2db11669a6c4789f29da799aabfb99547" dependencies = [ + "prettyplease", "proc-macro2", "prost-build", "quote", @@ -3231,6 +3338,25 @@ dependencies = [ "tracing", ] +[[package]] +name = "tower-http" +version = "0.3.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e980386f06883cf4d0578d6c9178c81f68b45d77d00f2c2c1bc034b3439c2c56" +dependencies = [ + "bitflags", + "bytes", + "futures-core", + "futures-util", + "http", + "http-body", + "http-range-header", + "pin-project-lite", + "tower", + "tower-layer", + "tower-service", +] + [[package]] name = "tower-layer" version = "0.3.1" @@ -3672,13 +3798,16 @@ dependencies = [ name = "workspace_hack" version = "0.1.0" dependencies = [ + "ahash", "anyhow", "bytes", "chrono", "clap 2.34.0", "either", + "fail", "hashbrown", "indexmap", + "itoa 0.4.8", "libc", "log", "memchr", @@ -3692,6 +3821,7 @@ dependencies = [ "serde", "syn", "tokio", + "tokio-util 0.7.0", "tracing", "tracing-core", ] diff --git a/control_plane/src/local_env.rs b/control_plane/src/local_env.rs index 12ee88cdc9..5aeff505b6 100644 --- a/control_plane/src/local_env.rs +++ b/control_plane/src/local_env.rs @@ -63,6 +63,10 @@ pub struct LocalEnv { #[serde(default)] pub broker_endpoints: Option, + /// A prefix to all to any key when pushing/polling etcd from a node. + #[serde(default)] + pub broker_etcd_prefix: Option, + pub pageserver: PageServerConf, #[serde(default)] diff --git a/control_plane/src/safekeeper.rs b/control_plane/src/safekeeper.rs index b094016131..074ee72f69 100644 --- a/control_plane/src/safekeeper.rs +++ b/control_plane/src/safekeeper.rs @@ -77,6 +77,7 @@ pub struct SafekeeperNode { pub pageserver: Arc, broker_endpoints: Option, + broker_etcd_prefix: Option, } impl SafekeeperNode { @@ -94,6 +95,7 @@ impl SafekeeperNode { http_base_url: format!("http://127.0.0.1:{}/v1", conf.http_port), pageserver, broker_endpoints: env.broker_endpoints.clone(), + broker_etcd_prefix: env.broker_etcd_prefix.clone(), } } @@ -143,6 +145,9 @@ impl SafekeeperNode { if let Some(ref ep) = self.broker_endpoints { cmd.args(&["--broker-endpoints", ep]); } + if let Some(prefix) = self.broker_etcd_prefix.as_deref() { + cmd.args(&["--broker-etcd-prefix", prefix]); + } if !cmd.status()?.success() { bail!( diff --git a/libs/etcd_broker/Cargo.toml b/libs/etcd_broker/Cargo.toml new file mode 100644 index 0000000000..65bd406131 --- /dev/null +++ b/libs/etcd_broker/Cargo.toml @@ -0,0 +1,17 @@ +[package] + name = "etcd_broker" + version = "0.1.0" + edition = "2021" + + [dependencies] + etcd-client = "0.9.0" + regex = "1.4.5" + serde = { version = "1.0", features = ["derive"] } + serde_json = "1" + serde_with = "1.12.0" + + utils = { path = "../utils" } + workspace_hack = { version = "0.1", path = "../../workspace_hack" } + tokio = "1" + tracing = "0.1" + thiserror = "1" diff --git a/libs/etcd_broker/src/lib.rs b/libs/etcd_broker/src/lib.rs new file mode 100644 index 0000000000..01cc0cf162 --- /dev/null +++ b/libs/etcd_broker/src/lib.rs @@ -0,0 +1,335 @@ +//! A set of primitives to access a shared data/updates, propagated via etcd broker (not persistent). +//! Intended to connect services to each other, not to store their data. +use std::{ + collections::{hash_map, HashMap}, + fmt::Display, + str::FromStr, +}; + +use regex::{Captures, Regex}; +use serde::{Deserialize, Serialize}; +use serde_with::{serde_as, DisplayFromStr}; + +pub use etcd_client::*; + +use tokio::{sync::mpsc, task::JoinHandle}; +use tracing::*; +use utils::{ + lsn::Lsn, + zid::{ZNodeId, ZTenantId, ZTenantTimelineId}, +}; + +#[derive(Debug, Deserialize, Serialize)] +struct SafekeeperTimeline { + safekeeper_id: ZNodeId, + info: SkTimelineInfo, +} + +/// Published data about safekeeper's timeline. Fields made optional for easy migrations. +#[serde_as] +#[derive(Debug, Deserialize, Serialize)] +pub struct SkTimelineInfo { + /// Term of the last entry. + pub last_log_term: Option, + /// LSN of the last record. + #[serde_as(as = "Option")] + #[serde(default)] + pub flush_lsn: Option, + /// Up to which LSN safekeeper regards its WAL as committed. + #[serde_as(as = "Option")] + #[serde(default)] + pub commit_lsn: Option, + /// LSN up to which safekeeper offloaded WAL to s3. + #[serde_as(as = "Option")] + #[serde(default)] + pub s3_wal_lsn: Option, + /// LSN of last checkpoint uploaded by pageserver. + #[serde_as(as = "Option")] + #[serde(default)] + pub remote_consistent_lsn: Option, + #[serde_as(as = "Option")] + #[serde(default)] + pub peer_horizon_lsn: Option, + #[serde(default)] + pub wal_stream_connection_string: Option, +} + +#[derive(Debug, thiserror::Error)] +pub enum BrokerError { + #[error("Etcd client error: {0}. Context: {1}")] + EtcdClient(etcd_client::Error, String), + #[error("Error during parsing etcd data: {0}")] + ParsingError(String), + #[error("Internal error: {0}")] + InternalError(String), +} + +/// A way to control the data retrieval from a certain subscription. +pub struct SkTimelineSubscription { + safekeeper_timeline_updates: + mpsc::UnboundedReceiver>>, + kind: SkTimelineSubscriptionKind, + watcher_handle: JoinHandle>, + watcher: Watcher, +} + +impl SkTimelineSubscription { + /// Asynchronously polls for more data from the subscription, suspending the current future if there's no data sent yet. + pub async fn fetch_data( + &mut self, + ) -> Option>> { + self.safekeeper_timeline_updates.recv().await + } + + /// Cancels the subscription, stopping the data poller and waiting for it to shut down. + pub async fn cancel(mut self) -> Result<(), BrokerError> { + self.watcher.cancel().await.map_err(|e| { + BrokerError::EtcdClient( + e, + format!( + "Failed to cancel timeline subscription, kind: {:?}", + self.kind + ), + ) + })?; + self.watcher_handle.await.map_err(|e| { + BrokerError::InternalError(format!( + "Failed to join the timeline updates task, kind: {:?}, error: {e}", + self.kind + )) + })? + } +} + +/// The subscription kind to the timeline updates from safekeeper. +#[derive(Debug, Clone, PartialEq, Eq, Hash)] +pub struct SkTimelineSubscriptionKind { + broker_prefix: String, + kind: SubscriptionKind, +} + +impl SkTimelineSubscriptionKind { + pub fn all(broker_prefix: String) -> Self { + Self { + broker_prefix, + kind: SubscriptionKind::All, + } + } + + pub fn tenant(broker_prefix: String, tenant: ZTenantId) -> Self { + Self { + broker_prefix, + kind: SubscriptionKind::Tenant(tenant), + } + } + + pub fn timeline(broker_prefix: String, timeline: ZTenantTimelineId) -> Self { + Self { + broker_prefix, + kind: SubscriptionKind::Timeline(timeline), + } + } + + fn watch_regex(&self) -> Regex { + match self.kind { + SubscriptionKind::All => Regex::new(&format!( + r"^{}/([[:xdigit:]]+)/([[:xdigit:]]+)/safekeeper/([[:digit:]])$", + self.broker_prefix + )) + .expect("wrong regex for 'everything' subscription"), + SubscriptionKind::Tenant(tenant_id) => Regex::new(&format!( + r"^{}/{tenant_id}/([[:xdigit:]]+)/safekeeper/([[:digit:]])$", + self.broker_prefix + )) + .expect("wrong regex for 'tenant' subscription"), + SubscriptionKind::Timeline(ZTenantTimelineId { + tenant_id, + timeline_id, + }) => Regex::new(&format!( + r"^{}/{tenant_id}/{timeline_id}/safekeeper/([[:digit:]])$", + self.broker_prefix + )) + .expect("wrong regex for 'timeline' subscription"), + } + } + + /// Etcd key to use for watching a certain timeline updates from safekeepers. + pub fn watch_key(&self) -> String { + match self.kind { + SubscriptionKind::All => self.broker_prefix.to_string(), + SubscriptionKind::Tenant(tenant_id) => { + format!("{}/{tenant_id}/safekeeper", self.broker_prefix) + } + SubscriptionKind::Timeline(ZTenantTimelineId { + tenant_id, + timeline_id, + }) => format!( + "{}/{tenant_id}/{timeline_id}/safekeeper", + self.broker_prefix + ), + } + } +} + +#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)] +enum SubscriptionKind { + /// Get every timeline update. + All, + /// Get certain tenant timelines' updates. + Tenant(ZTenantId), + /// Get certain timeline updates. + Timeline(ZTenantTimelineId), +} + +/// Creates a background task to poll etcd for timeline updates from safekeepers. +/// Stops and returns `Err` on any error during etcd communication. +/// Watches the key changes until either the watcher is cancelled via etcd or the subscription cancellation handle, +/// exiting normally in such cases. +pub async fn subscribe_to_safekeeper_timeline_updates( + client: &mut Client, + subscription: SkTimelineSubscriptionKind, +) -> Result { + info!("Subscribing to timeline updates, subscription kind: {subscription:?}"); + + let (watcher, mut stream) = client + .watch( + subscription.watch_key(), + Some(WatchOptions::new().with_prefix()), + ) + .await + .map_err(|e| { + BrokerError::EtcdClient( + e, + format!("Failed to init the watch for subscription {subscription:?}"), + ) + })?; + + let (timeline_updates_sender, safekeeper_timeline_updates) = mpsc::unbounded_channel(); + + let subscription_kind = subscription.kind; + let regex = subscription.watch_regex(); + let watcher_handle = tokio::spawn(async move { + while let Some(resp) = stream.message().await.map_err(|e| BrokerError::InternalError(format!( + "Failed to get messages from the subscription stream, kind: {subscription_kind:?}, error: {e}" + )))? { + if resp.canceled() { + info!("Watch for timeline updates subscription was canceled, exiting"); + break; + } + + let mut timeline_updates: HashMap> = + HashMap::new(); + + let events = resp.events(); + debug!("Processing {} events", events.len()); + + for event in events { + if EventType::Put == event.event_type() { + if let Some(kv) = event.kv() { + match parse_etcd_key_value(subscription_kind, ®ex, kv) { + Ok(Some((zttid, timeline))) => { + match timeline_updates + .entry(zttid) + .or_default() + .entry(timeline.safekeeper_id) + { + hash_map::Entry::Occupied(mut o) => { + if o.get().flush_lsn < timeline.info.flush_lsn { + o.insert(timeline.info); + } + } + hash_map::Entry::Vacant(v) => { + v.insert(timeline.info); + } + } + } + Ok(None) => {} + Err(e) => error!("Failed to parse timeline update: {e}"), + }; + } + } + } + + if let Err(e) = timeline_updates_sender.send(timeline_updates) { + info!("Timeline updates sender got dropped, exiting: {e}"); + break; + } + } + + Ok(()) + }); + + Ok(SkTimelineSubscription { + kind: subscription, + safekeeper_timeline_updates, + watcher_handle, + watcher, + }) +} + +fn parse_etcd_key_value( + subscription_kind: SubscriptionKind, + regex: &Regex, + kv: &KeyValue, +) -> Result, BrokerError> { + let caps = if let Some(caps) = regex.captures(kv.key_str().map_err(|e| { + BrokerError::EtcdClient(e, format!("Failed to represent kv {kv:?} as key str")) + })?) { + caps + } else { + return Ok(None); + }; + + let (zttid, safekeeper_id) = match subscription_kind { + SubscriptionKind::All => ( + ZTenantTimelineId::new( + parse_capture(&caps, 1).map_err(BrokerError::ParsingError)?, + parse_capture(&caps, 2).map_err(BrokerError::ParsingError)?, + ), + ZNodeId(parse_capture(&caps, 3).map_err(BrokerError::ParsingError)?), + ), + SubscriptionKind::Tenant(tenant_id) => ( + ZTenantTimelineId::new( + tenant_id, + parse_capture(&caps, 1).map_err(BrokerError::ParsingError)?, + ), + ZNodeId(parse_capture(&caps, 2).map_err(BrokerError::ParsingError)?), + ), + SubscriptionKind::Timeline(zttid) => ( + zttid, + ZNodeId(parse_capture(&caps, 1).map_err(BrokerError::ParsingError)?), + ), + }; + + let info_str = kv.value_str().map_err(|e| { + BrokerError::EtcdClient(e, format!("Failed to represent kv {kv:?} as value str")) + })?; + Ok(Some(( + zttid, + SafekeeperTimeline { + safekeeper_id, + info: serde_json::from_str(info_str).map_err(|e| { + BrokerError::ParsingError(format!( + "Failed to parse '{info_str}' as safekeeper timeline info: {e}" + )) + })?, + }, + ))) +} + +fn parse_capture(caps: &Captures, index: usize) -> Result +where + T: FromStr, + ::Err: Display, +{ + let capture_match = caps + .get(index) + .ok_or_else(|| format!("Failed to get capture match at index {index}"))? + .as_str(); + capture_match.parse().map_err(|e| { + format!( + "Failed to parse {} from {capture_match}: {e}", + std::any::type_name::() + ) + }) +} diff --git a/libs/utils/src/zid.rs b/libs/utils/src/zid.rs index fce5ed97c1..44d81cda50 100644 --- a/libs/utils/src/zid.rs +++ b/libs/utils/src/zid.rs @@ -224,7 +224,7 @@ impl fmt::Display for ZTenantTimelineId { // Unique ID of a storage node (safekeeper or pageserver). Supposed to be issued // by the console. -#[derive(Clone, Copy, Eq, Ord, PartialEq, PartialOrd, Debug, Serialize, Deserialize)] +#[derive(Clone, Copy, Eq, Ord, PartialEq, PartialOrd, Hash, Debug, Serialize, Deserialize)] #[serde(transparent)] pub struct ZNodeId(pub u64); diff --git a/neon_local/src/main.rs b/neon_local/src/main.rs index 158e43f68f..8b54054080 100644 --- a/neon_local/src/main.rs +++ b/neon_local/src/main.rs @@ -517,7 +517,7 @@ fn pageserver_config_overrides(init_match: &ArgMatches) -> Vec<&str> { .collect() } -fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> Result<()> { +fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> anyhow::Result<()> { let pageserver = PageServerNode::from_env(env); match tenant_match.subcommand() { Some(("list", _)) => { @@ -550,17 +550,8 @@ fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> Re pageserver .tenant_config(tenant_id, tenant_conf) - .unwrap_or_else(|e| { - anyhow!( - "Tenant config failed for tenant with id {} : {}", - tenant_id, - e - ); - }); - println!( - "tenant {} successfully configured on the pageserver", - tenant_id - ); + .with_context(|| format!("Tenant config failed for tenant with id {tenant_id}"))?; + println!("tenant {tenant_id} successfully configured on the pageserver"); } Some((sub_name, _)) => bail!("Unexpected tenant subcommand '{}'", sub_name), None => bail!("no tenant subcommand provided"), diff --git a/safekeeper/Cargo.toml b/safekeeper/Cargo.toml index 8a31311b8f..44587dd384 100644 --- a/safekeeper/Cargo.toml +++ b/safekeeper/Cargo.toml @@ -24,11 +24,10 @@ walkdir = "2" url = "2.2.2" signal-hook = "0.3.10" serde = { version = "1.0", features = ["derive"] } -serde_with = {version = "1.12.0"} +serde_with = "1.12.0" hex = "0.4.3" const_format = "0.2.21" tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" } -etcd-client = "0.8.3" tokio-util = { version = "0.7", features = ["io"] } rusoto_core = "0.47" rusoto_s3 = "0.47" @@ -36,6 +35,7 @@ rusoto_s3 = "0.47" postgres_ffi = { path = "../libs/postgres_ffi" } metrics = { path = "../libs/metrics" } utils = { path = "../libs/utils" } +etcd_broker = { path = "../libs/etcd_broker" } workspace_hack = { version = "0.1", path = "../workspace_hack" } [dev-dependencies] diff --git a/safekeeper/src/bin/safekeeper.rs b/safekeeper/src/bin/safekeeper.rs index 3fea3581a8..7e979840c2 100644 --- a/safekeeper/src/bin/safekeeper.rs +++ b/safekeeper/src/bin/safekeeper.rs @@ -109,6 +109,12 @@ fn main() -> Result<()> { .takes_value(true) .help("a comma separated broker (etcd) endpoints for storage nodes coordination, e.g. 'http://127.0.0.1:2379'"), ) + .arg( + Arg::new("broker-etcd-prefix") + .long("broker-etcd-prefix") + .takes_value(true) + .help("a prefix to always use when polling/pusing data in etcd from this safekeeper"), + ) .get_matches(); if let Some(addr) = arg_matches.value_of("dump-control-file") { @@ -118,7 +124,7 @@ fn main() -> Result<()> { return Ok(()); } - let mut conf: SafeKeeperConf = Default::default(); + let mut conf = SafeKeeperConf::default(); if let Some(dir) = arg_matches.value_of("datadir") { // change into the data directory. @@ -162,6 +168,9 @@ fn main() -> Result<()> { let collected_ep: Result, ParseError> = addr.split(',').map(Url::parse).collect(); conf.broker_endpoints = Some(collected_ep?); } + if let Some(prefix) = arg_matches.value_of("broker-etcd-prefix") { + conf.broker_etcd_prefix = prefix.to_string(); + } start_safekeeper(conf, given_id, arg_matches.is_present("init")) } diff --git a/safekeeper/src/broker.rs b/safekeeper/src/broker.rs index 8ce7bdf0e5..c9ae1a8d98 100644 --- a/safekeeper/src/broker.rs +++ b/safekeeper/src/broker.rs @@ -1,61 +1,22 @@ //! Communication with etcd, providing safekeeper peers and pageserver coordination. -use anyhow::bail; use anyhow::Context; use anyhow::Error; use anyhow::Result; -use etcd_client::Client; -use etcd_client::EventType; -use etcd_client::PutOptions; -use etcd_client::WatchOptions; -use lazy_static::lazy_static; -use regex::Regex; -use serde::{Deserialize, Serialize}; -use serde_with::{serde_as, DisplayFromStr}; -use std::str::FromStr; +use etcd_broker::Client; +use etcd_broker::PutOptions; +use etcd_broker::SkTimelineSubscriptionKind; use std::time::Duration; use tokio::task::JoinHandle; use tokio::{runtime, time::sleep}; use tracing::*; -use crate::{safekeeper::Term, timeline::GlobalTimelines, SafeKeeperConf}; -use utils::{ - lsn::Lsn, - zid::{ZNodeId, ZTenantId, ZTenantTimelineId, ZTimelineId}, -}; +use crate::{timeline::GlobalTimelines, SafeKeeperConf}; +use utils::zid::{ZNodeId, ZTenantTimelineId}; const RETRY_INTERVAL_MSEC: u64 = 1000; const PUSH_INTERVAL_MSEC: u64 = 1000; const LEASE_TTL_SEC: i64 = 5; -// TODO: add global zenith installation ID. -const ZENITH_PREFIX: &str = "zenith"; - -/// Published data about safekeeper. Fields made optional for easy migrations. -#[serde_as] -#[derive(Debug, Deserialize, Serialize)] -pub struct SafekeeperInfo { - /// Term of the last entry. - pub last_log_term: Option, - /// LSN of the last record. - #[serde_as(as = "Option")] - #[serde(default)] - pub flush_lsn: Option, - /// Up to which LSN safekeeper regards its WAL as committed. - #[serde_as(as = "Option")] - #[serde(default)] - pub commit_lsn: Option, - /// LSN up to which safekeeper offloaded WAL to s3. - #[serde_as(as = "Option")] - #[serde(default)] - pub s3_wal_lsn: Option, - /// LSN of last checkpoint uploaded by pageserver. - #[serde_as(as = "Option")] - #[serde(default)] - pub remote_consistent_lsn: Option, - #[serde_as(as = "Option")] - #[serde(default)] - pub peer_horizon_lsn: Option, -} pub fn thread_main(conf: SafeKeeperConf) { let runtime = runtime::Builder::new_current_thread() @@ -71,22 +32,21 @@ pub fn thread_main(conf: SafeKeeperConf) { }); } -/// Prefix to timeline related data. -fn timeline_path(zttid: &ZTenantTimelineId) -> String { +/// Key to per timeline per safekeeper data. +fn timeline_safekeeper_path( + broker_prefix: String, + zttid: ZTenantTimelineId, + sk_id: ZNodeId, +) -> String { format!( - "{}/{}/{}", - ZENITH_PREFIX, zttid.tenant_id, zttid.timeline_id + "{}/{sk_id}", + SkTimelineSubscriptionKind::timeline(broker_prefix, zttid).watch_key() ) } -/// Key to per timeline per safekeeper data. -fn timeline_safekeeper_path(zttid: &ZTenantTimelineId, sk_id: ZNodeId) -> String { - format!("{}/safekeeper/{}", timeline_path(zttid), sk_id) -} - /// Push once in a while data about all active timelines to the broker. -async fn push_loop(conf: SafeKeeperConf) -> Result<()> { - let mut client = Client::connect(conf.broker_endpoints.as_ref().unwrap(), None).await?; +async fn push_loop(conf: SafeKeeperConf) -> anyhow::Result<()> { + let mut client = Client::connect(&conf.broker_endpoints.as_ref().unwrap(), None).await?; // Get and maintain lease to automatically delete obsolete data let lease = client.lease_grant(LEASE_TTL_SEC, None).await?; @@ -98,14 +58,17 @@ async fn push_loop(conf: SafeKeeperConf) -> Result<()> { // is under plain mutex. That's ok, all this code is not performance // sensitive and there is no risk of deadlock as we don't await while // lock is held. - let active_tlis = GlobalTimelines::get_active_timelines(); - for zttid in &active_tlis { - if let Ok(tli) = GlobalTimelines::get(&conf, *zttid, false) { - let sk_info = tli.get_public_info(); + for zttid in GlobalTimelines::get_active_timelines() { + if let Ok(tli) = GlobalTimelines::get(&conf, zttid, false) { + let sk_info = tli.get_public_info()?; let put_opts = PutOptions::new().with_lease(lease.id()); client .put( - timeline_safekeeper_path(zttid, conf.my_id), + timeline_safekeeper_path( + conf.broker_etcd_prefix.clone(), + zttid, + conf.my_id, + ), serde_json::to_string(&sk_info)?, Some(put_opts), ) @@ -128,45 +91,31 @@ async fn push_loop(conf: SafeKeeperConf) -> Result<()> { /// Subscribe and fetch all the interesting data from the broker. async fn pull_loop(conf: SafeKeeperConf) -> Result<()> { - lazy_static! { - static ref TIMELINE_SAFEKEEPER_RE: Regex = - Regex::new(r"^zenith/([[:xdigit:]]+)/([[:xdigit:]]+)/safekeeper/([[:digit:]])$") - .unwrap(); - } - let mut client = Client::connect(conf.broker_endpoints.as_ref().unwrap(), None).await?; - loop { - let wo = WatchOptions::new().with_prefix(); - // TODO: subscribe only to my timelines - let (_, mut stream) = client.watch(ZENITH_PREFIX, Some(wo)).await?; - while let Some(resp) = stream.message().await? { - if resp.canceled() { - bail!("watch canceled"); - } + let mut client = Client::connect(&conf.broker_endpoints.as_ref().unwrap(), None).await?; - for event in resp.events() { - if EventType::Put == event.event_type() { - if let Some(kv) = event.kv() { - if let Some(caps) = TIMELINE_SAFEKEEPER_RE.captures(kv.key_str()?) { - let tenant_id = ZTenantId::from_str(caps.get(1).unwrap().as_str())?; - let timeline_id = ZTimelineId::from_str(caps.get(2).unwrap().as_str())?; - let zttid = ZTenantTimelineId::new(tenant_id, timeline_id); - let safekeeper_id = ZNodeId(caps.get(3).unwrap().as_str().parse()?); - let value_str = kv.value_str()?; - match serde_json::from_str::(value_str) { - Ok(safekeeper_info) => { - if let Ok(tli) = GlobalTimelines::get(&conf, zttid, false) { - tli.record_safekeeper_info(&safekeeper_info, safekeeper_id)? - } - } - Err(err) => warn!( - "failed to deserialize safekeeper info {}: {}", - value_str, err - ), - } + let mut subscription = etcd_broker::subscribe_to_safekeeper_timeline_updates( + &mut client, + SkTimelineSubscriptionKind::all(conf.broker_etcd_prefix.clone()), + ) + .await + .context("failed to subscribe for safekeeper info")?; + + loop { + match subscription.fetch_data().await { + Some(new_info) => { + for (zttid, sk_info) in new_info { + // note: there are blocking operations below, but it's considered fine for now + if let Ok(tli) = GlobalTimelines::get(&conf, zttid, false) { + for (safekeeper_id, info) in sk_info { + tli.record_safekeeper_info(&info, safekeeper_id)? } } } } + None => { + debug!("timeline updates sender closed, aborting the pull loop"); + return Ok(()); + } } } } diff --git a/safekeeper/src/http/routes.rs b/safekeeper/src/http/routes.rs index d7cbcb094e..e731db5617 100644 --- a/safekeeper/src/http/routes.rs +++ b/safekeeper/src/http/routes.rs @@ -1,3 +1,4 @@ +use etcd_broker::SkTimelineInfo; use hyper::{Body, Request, Response, StatusCode}; use serde::Serialize; @@ -5,7 +6,6 @@ use serde::Serializer; use std::fmt::Display; use std::sync::Arc; -use crate::broker::SafekeeperInfo; use crate::safekeeper::Term; use crate::safekeeper::TermHistory; use crate::timeline::GlobalTimelines; @@ -136,7 +136,7 @@ async fn record_safekeeper_info(mut request: Request) -> Result>, + pub broker_etcd_prefix: String, } impl SafeKeeperConf { @@ -76,6 +78,7 @@ impl Default for SafeKeeperConf { recall_period: defaults::DEFAULT_RECALL_PERIOD, my_id: ZNodeId(0), broker_endpoints: None, + broker_etcd_prefix: defaults::DEFAULT_NEON_BROKER_PREFIX.to_string(), } } } diff --git a/safekeeper/src/safekeeper.rs b/safekeeper/src/safekeeper.rs index 68361fd672..b9264565dc 100644 --- a/safekeeper/src/safekeeper.rs +++ b/safekeeper/src/safekeeper.rs @@ -4,6 +4,7 @@ use anyhow::{bail, Context, Result}; use byteorder::{LittleEndian, ReadBytesExt}; use bytes::{Buf, BufMut, Bytes, BytesMut}; +use etcd_broker::SkTimelineInfo; use postgres_ffi::xlog_utils::TimeLineID; use postgres_ffi::xlog_utils::XLogSegNo; @@ -16,7 +17,6 @@ use tracing::*; use lazy_static::lazy_static; -use crate::broker::SafekeeperInfo; use crate::control_file; use crate::send_wal::HotStandbyFeedback; use crate::wal_storage; @@ -886,7 +886,7 @@ where } /// Update timeline state with peer safekeeper data. - pub fn record_safekeeper_info(&mut self, sk_info: &SafekeeperInfo) -> Result<()> { + pub fn record_safekeeper_info(&mut self, sk_info: &SkTimelineInfo) -> Result<()> { let mut sync_control_file = false; if let (Some(commit_lsn), Some(last_log_term)) = (sk_info.commit_lsn, sk_info.last_log_term) { diff --git a/safekeeper/src/timeline.rs b/safekeeper/src/timeline.rs index 47137091da..140d6660ac 100644 --- a/safekeeper/src/timeline.rs +++ b/safekeeper/src/timeline.rs @@ -3,6 +3,7 @@ use anyhow::{bail, Context, Result}; +use etcd_broker::SkTimelineInfo; use lazy_static::lazy_static; use postgres_ffi::xlog_utils::XLogSegNo; @@ -21,7 +22,6 @@ use utils::{ zid::{ZNodeId, ZTenantTimelineId}, }; -use crate::broker::SafekeeperInfo; use crate::callmemaybe::{CallmeEvent, SubscriptionStateKey}; use crate::control_file; @@ -89,6 +89,7 @@ struct SharedState { active: bool, num_computes: u32, pageserver_connstr: Option, + listen_pg_addr: String, last_removed_segno: XLogSegNo, } @@ -111,6 +112,7 @@ impl SharedState { active: false, num_computes: 0, pageserver_connstr: None, + listen_pg_addr: conf.listen_pg_addr.clone(), last_removed_segno: 0, }) } @@ -130,6 +132,7 @@ impl SharedState { active: false, num_computes: 0, pageserver_connstr: None, + listen_pg_addr: conf.listen_pg_addr.clone(), last_removed_segno: 0, }) } @@ -418,9 +421,9 @@ impl Timeline { } /// Prepare public safekeeper info for reporting. - pub fn get_public_info(&self) -> SafekeeperInfo { + pub fn get_public_info(&self) -> anyhow::Result { let shared_state = self.mutex.lock().unwrap(); - SafekeeperInfo { + Ok(SkTimelineInfo { last_log_term: Some(shared_state.sk.get_epoch()), flush_lsn: Some(shared_state.sk.wal_store.flush_lsn()), // note: this value is not flushed to control file yet and can be lost @@ -432,11 +435,23 @@ impl Timeline { shared_state.sk.inmem.remote_consistent_lsn, )), peer_horizon_lsn: Some(shared_state.sk.inmem.peer_horizon_lsn), - } + wal_stream_connection_string: shared_state + .pageserver_connstr + .as_deref() + .map(|pageserver_connstr| { + wal_stream_connection_string( + self.zttid, + &shared_state.listen_pg_addr, + pageserver_connstr, + ) + }) + .transpose() + .context("Failed to get the pageserver callmemaybe connstr")?, + }) } /// Update timeline state with peer safekeeper data. - pub fn record_safekeeper_info(&self, sk_info: &SafekeeperInfo, _sk_id: ZNodeId) -> Result<()> { + pub fn record_safekeeper_info(&self, sk_info: &SkTimelineInfo, _sk_id: ZNodeId) -> Result<()> { let mut shared_state = self.mutex.lock().unwrap(); shared_state.sk.record_safekeeper_info(sk_info)?; self.notify_wal_senders(&mut shared_state); @@ -489,6 +504,29 @@ impl Timeline { } } +// pageserver connstr is needed to be able to distinguish between different pageservers +// it is required to correctly manage callmemaybe subscriptions when more than one pageserver is involved +// TODO it is better to use some sort of a unique id instead of connection string, see https://github.com/zenithdb/zenith/issues/1105 +fn wal_stream_connection_string( + ZTenantTimelineId { + tenant_id, + timeline_id, + }: ZTenantTimelineId, + listen_pg_addr_str: &str, + pageserver_connstr: &str, +) -> anyhow::Result { + let me_connstr = format!("postgresql://no_user@{}/no_db", listen_pg_addr_str); + let me_conf = me_connstr + .parse::() + .with_context(|| { + format!("Failed to parse pageserver connection string '{me_connstr}' as a postgres one") + })?; + let (host, port) = utils::connstring::connection_host_port(&me_conf); + Ok(format!( + "host={host} port={port} options='-c ztimelineid={timeline_id} ztenantid={tenant_id} pageserver_connstr={pageserver_connstr}'", + )) +} + // Utilities needed by various Connection-like objects pub trait TimelineTools { fn set(&mut self, conf: &SafeKeeperConf, zttid: ZTenantTimelineId, create: bool) -> Result<()>; diff --git a/workspace_hack/Cargo.toml b/workspace_hack/Cargo.toml index f178b5b766..2bb22f2d3b 100644 --- a/workspace_hack/Cargo.toml +++ b/workspace_hack/Cargo.toml @@ -14,29 +14,34 @@ publish = false ### BEGIN HAKARI SECTION [dependencies] +ahash = { version = "0.7", features = ["std"] } anyhow = { version = "1", features = ["backtrace", "std"] } bytes = { version = "1", features = ["serde", "std"] } chrono = { version = "0.4", features = ["clock", "libc", "oldtime", "serde", "std", "time", "winapi"] } clap = { version = "2", features = ["ansi_term", "atty", "color", "strsim", "suggestions", "vec_map"] } either = { version = "1", features = ["use_std"] } +fail = { version = "0.5", default-features = false, features = ["failpoints"] } hashbrown = { version = "0.11", features = ["ahash", "inline-more", "raw"] } indexmap = { version = "1", default-features = false, features = ["std"] } +itoa = { version = "0.4", features = ["i128", "std"] } libc = { version = "0.2", features = ["extra_traits", "std"] } log = { version = "0.4", default-features = false, features = ["serde", "std"] } memchr = { version = "2", features = ["std", "use_std"] } num-integer = { version = "0.1", default-features = false, features = ["i128"] } num-traits = { version = "0.2", features = ["i128", "std"] } -prost = { version = "0.9", features = ["prost-derive", "std"] } +prost = { version = "0.10", features = ["prost-derive", "std"] } rand = { version = "0.8", features = ["alloc", "getrandom", "libc", "rand_chacha", "rand_hc", "small_rng", "std", "std_rng"] } regex = { version = "1", features = ["aho-corasick", "memchr", "perf", "perf-cache", "perf-dfa", "perf-inline", "perf-literal", "std", "unicode", "unicode-age", "unicode-bool", "unicode-case", "unicode-gencat", "unicode-perl", "unicode-script", "unicode-segment"] } regex-syntax = { version = "0.6", features = ["unicode", "unicode-age", "unicode-bool", "unicode-case", "unicode-gencat", "unicode-perl", "unicode-script", "unicode-segment"] } scopeguard = { version = "1", features = ["use_std"] } serde = { version = "1", features = ["alloc", "derive", "serde_derive", "std"] } tokio = { version = "1", features = ["bytes", "fs", "io-std", "io-util", "libc", "macros", "memchr", "mio", "net", "num_cpus", "once_cell", "process", "rt", "rt-multi-thread", "signal-hook-registry", "socket2", "sync", "time", "tokio-macros"] } +tokio-util = { version = "0.7", features = ["codec", "io"] } tracing = { version = "0.1", features = ["attributes", "log", "std", "tracing-attributes"] } tracing-core = { version = "0.1", features = ["lazy_static", "std"] } [build-dependencies] +ahash = { version = "0.7", features = ["std"] } anyhow = { version = "1", features = ["backtrace", "std"] } bytes = { version = "1", features = ["serde", "std"] } clap = { version = "2", features = ["ansi_term", "atty", "color", "strsim", "suggestions", "vec_map"] } @@ -46,7 +51,7 @@ indexmap = { version = "1", default-features = false, features = ["std"] } libc = { version = "0.2", features = ["extra_traits", "std"] } log = { version = "0.4", default-features = false, features = ["serde", "std"] } memchr = { version = "2", features = ["std", "use_std"] } -prost = { version = "0.9", features = ["prost-derive", "std"] } +prost = { version = "0.10", features = ["prost-derive", "std"] } regex = { version = "1", features = ["aho-corasick", "memchr", "perf", "perf-cache", "perf-dfa", "perf-inline", "perf-literal", "std", "unicode", "unicode-age", "unicode-bool", "unicode-case", "unicode-gencat", "unicode-perl", "unicode-script", "unicode-segment"] } regex-syntax = { version = "0.6", features = ["unicode", "unicode-age", "unicode-bool", "unicode-case", "unicode-gencat", "unicode-perl", "unicode-script", "unicode-segment"] } serde = { version = "1", features = ["alloc", "derive", "serde_derive", "std"] } From de37f982dba67eae85b64c48259a0a36dbcc0e09 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Wed, 4 May 2022 17:06:44 +0300 Subject: [PATCH 213/296] Share the remote storage as a crate --- Cargo.lock | 71 +-- control_plane/src/storage.rs | 17 +- docs/settings.md | 20 +- libs/remote_storage/Cargo.toml | 20 + libs/remote_storage/src/lib.rs | 232 ++++++++++ .../remote_storage/src}/local_fs.rs | 186 ++++---- .../remote_storage/src}/s3_bucket.rs | 147 +++---- pageserver/Cargo.toml | 8 +- pageserver/README.md | 6 +- pageserver/src/config.rs | 120 +---- pageserver/src/http/routes.rs | 29 +- pageserver/src/layered_repository.rs | 14 +- pageserver/src/lib.rs | 2 +- pageserver/src/remote_storage.rs | 412 ------------------ pageserver/src/repository.rs | 2 +- .../src/{remote_storage => }/storage_sync.rs | 373 +++++++++++++--- .../storage_sync/download.rs | 54 +-- .../storage_sync/index.rs | 0 .../storage_sync/upload.rs | 53 ++- pageserver/src/tenant_mgr.rs | 5 +- pageserver/src/timelines.rs | 2 +- safekeeper/Cargo.toml | 3 +- safekeeper/src/s3_offload.rs | 107 ++--- test_runner/fixtures/zenith_fixtures.py | 29 +- workspace_hack/Cargo.toml | 6 + 25 files changed, 961 insertions(+), 957 deletions(-) create mode 100644 libs/remote_storage/Cargo.toml create mode 100644 libs/remote_storage/src/lib.rs rename {pageserver/src/remote_storage => libs/remote_storage/src}/local_fs.rs (81%) rename {pageserver/src/remote_storage => libs/remote_storage/src}/s3_bucket.rs (74%) delete mode 100644 pageserver/src/remote_storage.rs rename pageserver/src/{remote_storage => }/storage_sync.rs (77%) rename pageserver/src/{remote_storage => }/storage_sync/download.rs (93%) rename pageserver/src/{remote_storage => }/storage_sync/index.rs (100%) rename pageserver/src/{remote_storage => }/storage_sync/upload.rs (93%) diff --git a/Cargo.lock b/Cargo.lock index ac40a2931f..148517a777 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -48,9 +48,9 @@ dependencies = [ [[package]] name = "anyhow" -version = "1.0.57" +version = "1.0.53" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "08f9b8508dccb7687a1d6c4ce66b2b0ecef467c94667de27d8d7fe1f8d2a9cdc" +checksum = "94a45b455c14666b85fc40a019e8ab9eb75e3a124e05494f5397122bc9eb06e0" dependencies = [ "backtrace", ] @@ -1700,9 +1700,9 @@ dependencies = [ [[package]] name = "once_cell" -version = "1.10.0" +version = "1.9.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "87f3e037eac156d1775da914196f0f37741a274155e34a0b7e427c35d2a2ecb9" +checksum = "da32515d9f6e6e489d7bc9d84c71b060db7247dc035bbe44eac88cf87486d8d5" [[package]] name = "oorandom" @@ -1763,7 +1763,6 @@ name = "pageserver" version = "0.1.0" dependencies = [ "anyhow", - "async-trait", "byteorder", "bytes", "chrono", @@ -1791,8 +1790,7 @@ dependencies = [ "pprof", "rand", "regex", - "rusoto_core", - "rusoto_s3", + "remote_storage", "scopeguard", "serde", "serde_json", @@ -1804,7 +1802,6 @@ dependencies = [ "tokio", "tokio-postgres", "tokio-stream", - "tokio-util 0.7.0", "toml_edit", "tracing", "url", @@ -2104,9 +2101,9 @@ dependencies = [ [[package]] name = "prost" -version = "0.10.1" +version = "0.10.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a07b0857a71a8cb765763950499cae2413c3f9cede1133478c43600d9e146890" +checksum = "bc03e116981ff7d8da8e5c220e374587b98d294af7ba7dd7fda761158f00086f" dependencies = [ "bytes", "prost-derive", @@ -2114,9 +2111,9 @@ dependencies = [ [[package]] name = "prost-build" -version = "0.10.1" +version = "0.10.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "120fbe7988713f39d780a58cf1a7ef0d7ef66c6d87e5aa3438940c05357929f4" +checksum = "65a1118354442de7feb8a2a76f3d80ef01426bd45542c8c1fdffca41a758f846" dependencies = [ "bytes", "cfg-if", @@ -2347,6 +2344,23 @@ version = "0.6.25" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "f497285884f3fcff424ffc933e56d7cbca511def0c9831a7f9b5f6153e3cc89b" +[[package]] +name = "remote_storage" +version = "0.1.0" +dependencies = [ + "anyhow", + "async-trait", + "rusoto_core", + "rusoto_s3", + "serde", + "serde_json", + "tempfile", + "tokio", + "tokio-util 0.7.0", + "tracing", + "workspace_hack", +] + [[package]] name = "remove_dir_all" version = "0.5.3" @@ -2446,9 +2460,9 @@ dependencies = [ [[package]] name = "rusoto_core" -version = "0.47.0" +version = "0.48.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5b4f000e8934c1b4f70adde180056812e7ea6b1a247952db8ee98c94cd3116cc" +checksum = "1db30db44ea73551326269adcf7a2169428a054f14faf9e1768f2163494f2fa2" dependencies = [ "async-trait", "base64", @@ -2471,9 +2485,9 @@ dependencies = [ [[package]] name = "rusoto_credential" -version = "0.47.0" +version = "0.48.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6a46b67db7bb66f5541e44db22b0a02fed59c9603e146db3a9e633272d3bac2f" +checksum = "ee0a6c13db5aad6047b6a44ef023dbbc21a056b6dab5be3b79ce4283d5c02d05" dependencies = [ "async-trait", "chrono", @@ -2489,9 +2503,9 @@ dependencies = [ [[package]] name = "rusoto_s3" -version = "0.47.0" +version = "0.48.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "048c2fe811a823ad5a9acc976e8bf4f1d910df719dcf44b15c3e96c5b7a51027" +checksum = "7aae4677183411f6b0b412d66194ef5403293917d66e70ab118f07cc24c5b14d" dependencies = [ "async-trait", "bytes", @@ -2502,9 +2516,9 @@ dependencies = [ [[package]] name = "rusoto_signature" -version = "0.47.0" +version = "0.48.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6264e93384b90a747758bcc82079711eacf2e755c3a8b5091687b5349d870bcc" +checksum = "a5ae95491c8b4847931e291b151127eccd6ff8ca13f33603eb3d0035ecb05272" dependencies = [ "base64", "bytes", @@ -2611,8 +2625,7 @@ dependencies = [ "postgres-protocol", "postgres_ffi", "regex", - "rusoto_core", - "rusoto_s3", + "remote_storage", "serde", "serde_json", "serde_with", @@ -3275,9 +3288,9 @@ dependencies = [ [[package]] name = "tonic" -version = "0.7.1" +version = "0.7.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "30fb54bf1e446f44d870d260d99957e7d11fb9d0a0f5bd1a662ad1411cc103f9" +checksum = "5be9d60db39854b30b835107500cf0aca0b0d14d6e1c3de124217c23a29c2ddb" dependencies = [ "async-stream", "async-trait", @@ -3307,9 +3320,9 @@ dependencies = [ [[package]] name = "tonic-build" -version = "0.7.1" +version = "0.7.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "c03447cdc9eaf8feffb6412dcb27baf2db11669a6c4789f29da799aabfb99547" +checksum = "d9263bf4c9bfaae7317c1c2faf7f18491d2fe476f70c414b73bf5d445b00ffa1" dependencies = [ "prettyplease", "proc-macro2", @@ -3805,7 +3818,13 @@ dependencies = [ "clap 2.34.0", "either", "fail", + "futures-channel", + "futures-task", + "futures-util", + "generic-array", "hashbrown", + "hex", + "hyper", "indexmap", "itoa 0.4.8", "libc", diff --git a/control_plane/src/storage.rs b/control_plane/src/storage.rs index 3a63bf6960..adb924d430 100644 --- a/control_plane/src/storage.rs +++ b/control_plane/src/storage.rs @@ -186,8 +186,6 @@ impl PageServerNode { ); io::stdout().flush().unwrap(); - let mut cmd = Command::new(self.env.pageserver_bin()?); - let repo_path = self.repo_path(); let mut args = vec!["-D", repo_path.to_str().unwrap()]; @@ -195,9 +193,11 @@ impl PageServerNode { args.extend(["-c", config_override]); } - fill_rust_env_vars(cmd.args(&args).arg("--daemonize")); + let mut cmd = Command::new(self.env.pageserver_bin()?); + let mut filled_cmd = fill_rust_env_vars(cmd.args(&args).arg("--daemonize")); + filled_cmd = fill_aws_secrets_vars(filled_cmd); - if !cmd.status()?.success() { + if !filled_cmd.status()?.success() { bail!( "Pageserver failed to start. See '{}' for details.", self.repo_path().join("pageserver.log").display() @@ -457,3 +457,12 @@ impl PageServerNode { Ok(timeline_info_response) } } + +fn fill_aws_secrets_vars(mut cmd: &mut Command) -> &mut Command { + for env_key in ["AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY"] { + if let Ok(value) = std::env::var(env_key) { + cmd = cmd.env(env_key, value); + } + } + cmd +} diff --git a/docs/settings.md b/docs/settings.md index b3925528cd..017d349bb6 100644 --- a/docs/settings.md +++ b/docs/settings.md @@ -6,7 +6,6 @@ If there's no such file during `init` phase of the server, it creates the file i There's a possibility to pass an arbitrary config value to the pageserver binary as an argument: such values override the values in the config file, if any are specified for the same key and get into the final config during init phase. - ### Config example ```toml @@ -35,9 +34,9 @@ Yet, it validates the config values it can (e.g. postgres install dir) and error Note the `[remote_storage]` section: it's a [table](https://toml.io/en/v1.0.0#table) in TOML specification and -* either has to be placed in the config after the table-less values such as `initial_superuser_name = 'zenith_admin'` +- either has to be placed in the config after the table-less values such as `initial_superuser_name = 'zenith_admin'` -* or can be placed anywhere if rewritten in identical form as [inline table](https://toml.io/en/v1.0.0#inline-table): `remote_storage = {foo = 2}` +- or can be placed anywhere if rewritten in identical form as [inline table](https://toml.io/en/v1.0.0#inline-table): `remote_storage = {foo = 2}` ### Config values @@ -57,7 +56,7 @@ but it will trigger a checkpoint operation to get it back below the limit. `checkpoint_distance` also determines how much WAL needs to be kept -durable in the safekeeper. The safekeeper must have capacity to hold +durable in the safekeeper. The safekeeper must have capacity to hold this much WAL, with some headroom, otherwise you can get stuck in a situation where the safekeeper is full and stops accepting new WAL, but the pageserver is not flushing out and releasing the space in the @@ -72,7 +71,7 @@ The unit is # of bytes. Every `compaction_period` seconds, the page server checks if maintenance operations, like compaction, are needed on the layer -files. Default is 1 s, which should be fine. +files. Default is 1 s, which should be fine. #### compaction_target_size @@ -163,16 +162,12 @@ bucket_region = 'eu-north-1' # Optional, pageserver uses entire bucket if the prefix is not specified. prefix_in_bucket = '/some/prefix/' -# Access key to connect to the bucket ("login" part of the credentials) -access_key_id = 'SOMEKEYAAAAASADSAH*#' - -# Secret access key to connect to the bucket ("password" part of the credentials) -secret_access_key = 'SOMEsEcReTsd292v' - # S3 API query limit to avoid getting errors/throttling from AWS. concurrency_limit = 100 ``` +If no IAM bucket access is used during the remote storage usage, use the `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables to set the access credentials. + ###### General remote storage configuration Pagesever allows only one remote storage configured concurrently and errors if parameters from multiple different remote configurations are used. @@ -183,13 +178,12 @@ Besides, there are parameters common for all types of remote storage that can be ```toml [remote_storage] # Max number of concurrent timeline synchronized (layers uploaded or downloaded) with the remote storage at the same time. -max_concurrent_timelines_sync = 50 +max_concurrent_syncs = 50 # Max number of errors a single task can have before it's considered failed and not attempted to run anymore. max_sync_errors = 10 ``` - ## safekeeper TODO diff --git a/libs/remote_storage/Cargo.toml b/libs/remote_storage/Cargo.toml new file mode 100644 index 0000000000..291f6e50ac --- /dev/null +++ b/libs/remote_storage/Cargo.toml @@ -0,0 +1,20 @@ +[package] +name = "remote_storage" +version = "0.1.0" +edition = "2021" + +[dependencies] +anyhow = { version = "1.0", features = ["backtrace"] } +tokio = { version = "1.17", features = ["sync", "macros", "fs", "io-util"] } +tokio-util = { version = "0.7", features = ["io"] } +tracing = "0.1.27" +rusoto_core = "0.48" +rusoto_s3 = "0.48" +serde = { version = "1.0", features = ["derive"] } +serde_json = "1" +async-trait = "0.1" + +workspace_hack = { version = "0.1", path = "../../workspace_hack" } + +[dev-dependencies] +tempfile = "3.2" diff --git a/libs/remote_storage/src/lib.rs b/libs/remote_storage/src/lib.rs new file mode 100644 index 0000000000..9bbb855dd5 --- /dev/null +++ b/libs/remote_storage/src/lib.rs @@ -0,0 +1,232 @@ +//! A set of generic storage abstractions for the page server to use when backing up and restoring its state from the external storage. +//! No other modules from this tree are supposed to be used directly by the external code. +//! +//! [`RemoteStorage`] trait a CRUD-like generic abstraction to use for adapting external storages with a few implementations: +//! * [`local_fs`] allows to use local file system as an external storage +//! * [`s3_bucket`] uses AWS S3 bucket as an external storage +//! +mod local_fs; +mod s3_bucket; + +use std::{ + borrow::Cow, + collections::HashMap, + ffi::OsStr, + num::{NonZeroU32, NonZeroUsize}, + path::{Path, PathBuf}, +}; + +use anyhow::Context; +use tokio::io; +use tracing::info; + +pub use self::{ + local_fs::LocalFs, + s3_bucket::{S3Bucket, S3ObjectKey}, +}; + +/// How many different timelines can be processed simultaneously when synchronizing layers with the remote storage. +/// During regular work, pageserver produces one layer file per timeline checkpoint, with bursts of concurrency +/// during start (where local and remote timelines are compared and initial sync tasks are scheduled) and timeline attach. +/// Both cases may trigger timeline download, that might download a lot of layers. This concurrency is limited by the clients internally, if needed. +pub const DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNCS: usize = 50; +pub const DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS: u32 = 10; +/// Currently, sync happens with AWS S3, that has two limits on requests per second: +/// ~200 RPS for IAM services +/// https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/UsingWithRDS.IAMDBAuth.html +/// ~3500 PUT/COPY/POST/DELETE or 5500 GET/HEAD S3 requests +/// https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/ +pub const DEFAULT_REMOTE_STORAGE_S3_CONCURRENCY_LIMIT: usize = 100; + +/// Storage (potentially remote) API to manage its state. +/// This storage tries to be unaware of any layered repository context, +/// providing basic CRUD operations for storage files. +#[async_trait::async_trait] +pub trait RemoteStorage: Send + Sync { + /// A way to uniquely reference a file in the remote storage. + type RemoteObjectId; + + /// Attempts to derive the storage path out of the local path, if the latter is correct. + fn remote_object_id(&self, local_path: &Path) -> anyhow::Result; + + /// Gets the download path of the given storage file. + fn local_path(&self, remote_object_id: &Self::RemoteObjectId) -> anyhow::Result; + + /// Lists all items the storage has right now. + async fn list(&self) -> anyhow::Result>; + + /// Streams the local file contents into remote into the remote storage entry. + async fn upload( + &self, + from: impl io::AsyncRead + Unpin + Send + Sync + 'static, + // S3 PUT request requires the content length to be specified, + // otherwise it starts to fail with the concurrent connection count increasing. + from_size_bytes: usize, + to: &Self::RemoteObjectId, + metadata: Option, + ) -> anyhow::Result<()>; + + /// Streams the remote storage entry contents into the buffered writer given, returns the filled writer. + /// Returns the metadata, if any was stored with the file previously. + async fn download( + &self, + from: &Self::RemoteObjectId, + to: &mut (impl io::AsyncWrite + Unpin + Send + Sync), + ) -> anyhow::Result>; + + /// Streams a given byte range of the remote storage entry contents into the buffered writer given, returns the filled writer. + /// Returns the metadata, if any was stored with the file previously. + async fn download_byte_range( + &self, + from: &Self::RemoteObjectId, + start_inclusive: u64, + end_exclusive: Option, + to: &mut (impl io::AsyncWrite + Unpin + Send + Sync), + ) -> anyhow::Result>; + + async fn delete(&self, path: &Self::RemoteObjectId) -> anyhow::Result<()>; +} + +/// TODO kb +pub enum GenericRemoteStorage { + Local(LocalFs), + S3(S3Bucket), +} + +impl GenericRemoteStorage { + pub fn new( + working_directory: PathBuf, + storage_config: &RemoteStorageConfig, + ) -> anyhow::Result { + match &storage_config.storage { + RemoteStorageKind::LocalFs(root) => { + info!("Using fs root '{}' as a remote storage", root.display()); + LocalFs::new(root.clone(), working_directory).map(GenericRemoteStorage::Local) + } + RemoteStorageKind::AwsS3(s3_config) => { + info!("Using s3 bucket '{}' in region '{}' as a remote storage, prefix in bucket: '{:?}', bucket endpoint: '{:?}'", + s3_config.bucket_name, s3_config.bucket_region, s3_config.prefix_in_bucket, s3_config.endpoint); + S3Bucket::new(s3_config, working_directory).map(GenericRemoteStorage::S3) + } + } + } +} + +/// Extra set of key-value pairs that contain arbitrary metadata about the storage entry. +/// Immutable, cannot be changed once the file is created. +#[derive(Debug, Clone, PartialEq, Eq)] +pub struct StorageMetadata(HashMap); + +fn strip_path_prefix<'a>(prefix: &'a Path, path: &'a Path) -> anyhow::Result<&'a Path> { + if prefix == path { + anyhow::bail!( + "Prefix and the path are equal, cannot strip: '{}'", + prefix.display() + ) + } else { + path.strip_prefix(prefix).with_context(|| { + format!( + "Path '{}' is not prefixed with '{}'", + path.display(), + prefix.display(), + ) + }) + } +} + +/// External backup storage configuration, enough for creating a client for that storage. +#[derive(Debug, Clone, PartialEq, Eq)] +pub struct RemoteStorageConfig { + /// Max allowed number of concurrent sync operations between the API user and the remote storage. + pub max_concurrent_syncs: NonZeroUsize, + /// Max allowed errors before the sync task is considered failed and evicted. + pub max_sync_errors: NonZeroU32, + /// The storage connection configuration. + pub storage: RemoteStorageKind, +} + +/// A kind of a remote storage to connect to, with its connection configuration. +#[derive(Debug, Clone, PartialEq, Eq)] +pub enum RemoteStorageKind { + /// Storage based on local file system. + /// Specify a root folder to place all stored files into. + LocalFs(PathBuf), + /// AWS S3 based storage, storing all files in the S3 bucket + /// specified by the config + AwsS3(S3Config), +} + +/// AWS S3 bucket coordinates and access credentials to manage the bucket contents (read and write). +#[derive(Clone, PartialEq, Eq)] +pub struct S3Config { + /// Name of the bucket to connect to. + pub bucket_name: String, + /// The region where the bucket is located at. + pub bucket_region: String, + /// A "subfolder" in the bucket, to use the same bucket separately by multiple remote storage users at once. + pub prefix_in_bucket: Option, + /// A base URL to send S3 requests to. + /// By default, the endpoint is derived from a region name, assuming it's + /// an AWS S3 region name, erroring on wrong region name. + /// Endpoint provides a way to support other S3 flavors and their regions. + /// + /// Example: `http://127.0.0.1:5000` + pub endpoint: Option, + /// AWS S3 has various limits on its API calls, we need not to exceed those. + /// See [`DEFAULT_REMOTE_STORAGE_S3_CONCURRENCY_LIMIT`] for more details. + pub concurrency_limit: NonZeroUsize, +} + +impl std::fmt::Debug for S3Config { + fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { + f.debug_struct("S3Config") + .field("bucket_name", &self.bucket_name) + .field("bucket_region", &self.bucket_region) + .field("prefix_in_bucket", &self.prefix_in_bucket) + .field("concurrency_limit", &self.concurrency_limit) + .finish() + } +} + +pub fn path_with_suffix_extension(original_path: impl AsRef, suffix: &str) -> PathBuf { + let new_extension = match original_path + .as_ref() + .extension() + .map(OsStr::to_string_lossy) + { + Some(extension) => Cow::Owned(format!("{extension}.{suffix}")), + None => Cow::Borrowed(suffix), + }; + original_path + .as_ref() + .with_extension(new_extension.as_ref()) +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn test_path_with_suffix_extension() { + let p = PathBuf::from("/foo/bar"); + assert_eq!( + &path_with_suffix_extension(&p, "temp").to_string_lossy(), + "/foo/bar.temp" + ); + let p = PathBuf::from("/foo/bar"); + assert_eq!( + &path_with_suffix_extension(&p, "temp.temp").to_string_lossy(), + "/foo/bar.temp.temp" + ); + let p = PathBuf::from("/foo/bar.baz"); + assert_eq!( + &path_with_suffix_extension(&p, "temp.temp").to_string_lossy(), + "/foo/bar.baz.temp.temp" + ); + let p = PathBuf::from("/foo/bar.baz"); + assert_eq!( + &path_with_suffix_extension(&p, ".temp").to_string_lossy(), + "/foo/bar.baz..temp" + ); + } +} diff --git a/pageserver/src/remote_storage/local_fs.rs b/libs/remote_storage/src/local_fs.rs similarity index 81% rename from pageserver/src/remote_storage/local_fs.rs rename to libs/remote_storage/src/local_fs.rs index 6772a4fbd6..50243352ee 100644 --- a/pageserver/src/remote_storage/local_fs.rs +++ b/libs/remote_storage/src/local_fs.rs @@ -1,7 +1,7 @@ //! Local filesystem acting as a remote storage. -//! Multiple pageservers can use the same "storage" of this kind by using different storage roots. +//! Multiple API users can use the same "storage" of this kind by using different storage roots. //! -//! This storage used in pageserver tests, but can also be used in cases when a certain persistent +//! This storage used in tests, but can also be used in cases when a certain persistent //! volume is mounted to the local FS. use std::{ @@ -17,18 +17,18 @@ use tokio::{ }; use tracing::*; -use crate::remote_storage::storage_sync::path_with_suffix_extension; +use crate::path_with_suffix_extension; use super::{strip_path_prefix, RemoteStorage, StorageMetadata}; pub struct LocalFs { - pageserver_workdir: &'static Path, - root: PathBuf, + working_directory: PathBuf, + storage_root: PathBuf, } impl LocalFs { /// Attempts to create local FS storage, along with its root directory. - pub fn new(root: PathBuf, pageserver_workdir: &'static Path) -> anyhow::Result { + pub fn new(root: PathBuf, working_directory: PathBuf) -> anyhow::Result { if !root.exists() { std::fs::create_dir_all(&root).with_context(|| { format!( @@ -38,15 +38,15 @@ impl LocalFs { })?; } Ok(Self { - pageserver_workdir, - root, + working_directory, + storage_root: root, }) } fn resolve_in_storage(&self, path: &Path) -> anyhow::Result { if path.is_relative() { - Ok(self.root.join(path)) - } else if path.starts_with(&self.root) { + Ok(self.storage_root.join(path)) + } else if path.starts_with(&self.storage_root) { Ok(path.to_path_buf()) } else { bail!( @@ -85,30 +85,30 @@ impl LocalFs { #[async_trait::async_trait] impl RemoteStorage for LocalFs { - type StoragePath = PathBuf; + type RemoteObjectId = PathBuf; - fn storage_path(&self, local_path: &Path) -> anyhow::Result { - Ok(self.root.join( - strip_path_prefix(self.pageserver_workdir, local_path) + fn remote_object_id(&self, local_path: &Path) -> anyhow::Result { + Ok(self.storage_root.join( + strip_path_prefix(&self.working_directory, local_path) .context("local path does not belong to this storage")?, )) } - fn local_path(&self, storage_path: &Self::StoragePath) -> anyhow::Result { - let relative_path = strip_path_prefix(&self.root, storage_path) + fn local_path(&self, storage_path: &Self::RemoteObjectId) -> anyhow::Result { + let relative_path = strip_path_prefix(&self.storage_root, storage_path) .context("local path does not belong to this storage")?; - Ok(self.pageserver_workdir.join(relative_path)) + Ok(self.working_directory.join(relative_path)) } - async fn list(&self) -> anyhow::Result> { - get_all_files(&self.root).await + async fn list(&self) -> anyhow::Result> { + get_all_files(&self.storage_root).await } async fn upload( &self, from: impl io::AsyncRead + Unpin + Send + Sync + 'static, from_size_bytes: usize, - to: &Self::StoragePath, + to: &Self::RemoteObjectId, metadata: Option, ) -> anyhow::Result<()> { let target_file_path = self.resolve_in_storage(to)?; @@ -194,7 +194,7 @@ impl RemoteStorage for LocalFs { async fn download( &self, - from: &Self::StoragePath, + from: &Self::RemoteObjectId, to: &mut (impl io::AsyncWrite + Unpin + Send + Sync), ) -> anyhow::Result> { let file_path = self.resolve_in_storage(from)?; @@ -229,9 +229,9 @@ impl RemoteStorage for LocalFs { } } - async fn download_range( + async fn download_byte_range( &self, - from: &Self::StoragePath, + from: &Self::RemoteObjectId, start_inclusive: u64, end_exclusive: Option, to: &mut (impl io::AsyncWrite + Unpin + Send + Sync), @@ -288,7 +288,7 @@ impl RemoteStorage for LocalFs { } } - async fn delete(&self, path: &Self::StoragePath) -> anyhow::Result<()> { + async fn delete(&self, path: &Self::RemoteObjectId) -> anyhow::Result<()> { let file_path = self.resolve_in_storage(path)?; if file_path.exists() && file_path.is_file() { Ok(fs::remove_file(file_path).await?) @@ -354,29 +354,30 @@ async fn create_target_directory(target_file_path: &Path) -> anyhow::Result<()> #[cfg(test)] mod pure_tests { - use crate::{ - layered_repository::metadata::METADATA_FILE_NAME, - repository::repo_harness::{RepoHarness, TIMELINE_ID}, - }; + use tempfile::tempdir; use super::*; #[test] fn storage_path_positive() -> anyhow::Result<()> { - let repo_harness = RepoHarness::create("storage_path_positive")?; + let workdir = tempdir()?.path().to_owned(); + let storage_root = PathBuf::from("somewhere").join("else"); let storage = LocalFs { - pageserver_workdir: &repo_harness.conf.workdir, - root: storage_root.clone(), + working_directory: workdir.clone(), + storage_root: storage_root.clone(), }; - let local_path = repo_harness.timeline_path(&TIMELINE_ID).join("file_name"); - let expected_path = storage_root.join(local_path.strip_prefix(&repo_harness.conf.workdir)?); + let local_path = workdir + .join("timelines") + .join("some_timeline") + .join("file_name"); + let expected_path = storage_root.join(local_path.strip_prefix(&workdir)?); assert_eq!( expected_path, - storage.storage_path(&local_path).expect("Matching path should map to storage path normally"), - "File paths from pageserver workdir should be stored in local fs storage with the same path they have relative to the workdir" + storage.remote_object_id(&local_path).expect("Matching path should map to storage path normally"), + "File paths from workdir should be stored in local fs storage with the same path they have relative to the workdir" ); Ok(()) @@ -386,7 +387,7 @@ mod pure_tests { fn storage_path_negatives() -> anyhow::Result<()> { #[track_caller] fn storage_path_error(storage: &LocalFs, mismatching_path: &Path) -> String { - match storage.storage_path(mismatching_path) { + match storage.remote_object_id(mismatching_path) { Ok(wrong_path) => panic!( "Expected path '{}' to error, but got storage path: {:?}", mismatching_path.display(), @@ -396,16 +397,16 @@ mod pure_tests { } } - let repo_harness = RepoHarness::create("storage_path_negatives")?; + let workdir = tempdir()?.path().to_owned(); let storage_root = PathBuf::from("somewhere").join("else"); let storage = LocalFs { - pageserver_workdir: &repo_harness.conf.workdir, - root: storage_root, + working_directory: workdir.clone(), + storage_root, }; - let error_string = storage_path_error(&storage, &repo_harness.conf.workdir); + let error_string = storage_path_error(&storage, &workdir); assert!(error_string.contains("does not belong to this storage")); - assert!(error_string.contains(repo_harness.conf.workdir.to_str().unwrap())); + assert!(error_string.contains(workdir.to_str().unwrap())); let mismatching_path_str = "/something/else"; let error_message = storage_path_error(&storage, Path::new(mismatching_path_str)); @@ -414,7 +415,7 @@ mod pure_tests { "Error should mention wrong path" ); assert!( - error_message.contains(repo_harness.conf.workdir.to_str().unwrap()), + error_message.contains(workdir.to_str().unwrap()), "Error should mention server workdir" ); assert!(error_message.contains("does not belong to this storage")); @@ -424,29 +425,28 @@ mod pure_tests { #[test] fn local_path_positive() -> anyhow::Result<()> { - let repo_harness = RepoHarness::create("local_path_positive")?; + let workdir = tempdir()?.path().to_owned(); let storage_root = PathBuf::from("somewhere").join("else"); let storage = LocalFs { - pageserver_workdir: &repo_harness.conf.workdir, - root: storage_root.clone(), + working_directory: workdir.clone(), + storage_root: storage_root.clone(), }; let name = "not a metadata"; - let local_path = repo_harness.timeline_path(&TIMELINE_ID).join(name); + let local_path = workdir.join("timelines").join("some_timeline").join(name); assert_eq!( local_path, storage - .local_path( - &storage_root.join(local_path.strip_prefix(&repo_harness.conf.workdir)?) - ) + .local_path(&storage_root.join(local_path.strip_prefix(&workdir)?)) .expect("For a valid input, valid local path should be parsed"), "Should be able to parse metadata out of the correctly named remote delta file" ); - let local_metadata_path = repo_harness - .timeline_path(&TIMELINE_ID) - .join(METADATA_FILE_NAME); - let remote_metadata_path = storage.storage_path(&local_metadata_path)?; + let local_metadata_path = workdir + .join("timelines") + .join("some_timeline") + .join("metadata"); + let remote_metadata_path = storage.remote_object_id(&local_metadata_path)?; assert_eq!( local_metadata_path, storage @@ -472,11 +472,10 @@ mod pure_tests { } } - let repo_harness = RepoHarness::create("local_path_negatives")?; let storage_root = PathBuf::from("somewhere").join("else"); let storage = LocalFs { - pageserver_workdir: &repo_harness.conf.workdir, - root: storage_root, + working_directory: tempdir()?.path().to_owned(), + storage_root, }; let totally_wrong_path = "wrong_wrong_wrong"; @@ -488,16 +487,19 @@ mod pure_tests { #[test] fn download_destination_matches_original_path() -> anyhow::Result<()> { - let repo_harness = RepoHarness::create("download_destination_matches_original_path")?; - let original_path = repo_harness.timeline_path(&TIMELINE_ID).join("some name"); + let workdir = tempdir()?.path().to_owned(); + let original_path = workdir + .join("timelines") + .join("some_timeline") + .join("some name"); let storage_root = PathBuf::from("somewhere").join("else"); let dummy_storage = LocalFs { - pageserver_workdir: &repo_harness.conf.workdir, - root: storage_root, + working_directory: workdir, + storage_root, }; - let storage_path = dummy_storage.storage_path(&original_path)?; + let storage_path = dummy_storage.remote_object_id(&original_path)?; let download_destination = dummy_storage.local_path(&storage_path)?; assert_eq!( @@ -512,18 +514,17 @@ mod pure_tests { #[cfg(test)] mod fs_tests { use super::*; - use crate::repository::repo_harness::{RepoHarness, TIMELINE_ID}; use std::{collections::HashMap, io::Write}; use tempfile::tempdir; #[tokio::test] async fn upload_file() -> anyhow::Result<()> { - let repo_harness = RepoHarness::create("upload_file")?; + let workdir = tempdir()?.path().to_owned(); let storage = create_storage()?; let (file, size) = create_file_for_upload( - &storage.pageserver_workdir.join("whatever"), + &storage.working_directory.join("whatever"), "whatever_contents", ) .await?; @@ -538,14 +539,14 @@ mod fs_tests { } assert!(storage.list().await?.is_empty()); - let target_path_1 = upload_dummy_file(&repo_harness, &storage, "upload_1", None).await?; + let target_path_1 = upload_dummy_file(&workdir, &storage, "upload_1", None).await?; assert_eq!( storage.list().await?, vec![target_path_1.clone()], "Should list a single file after first upload" ); - let target_path_2 = upload_dummy_file(&repo_harness, &storage, "upload_2", None).await?; + let target_path_2 = upload_dummy_file(&workdir, &storage, "upload_2", None).await?; assert_eq!( list_files_sorted(&storage).await?, vec![target_path_1.clone(), target_path_2.clone()], @@ -556,17 +557,16 @@ mod fs_tests { } fn create_storage() -> anyhow::Result { - let pageserver_workdir = Box::leak(Box::new(tempdir()?.path().to_owned())); - let storage = LocalFs::new(tempdir()?.path().to_owned(), pageserver_workdir)?; - Ok(storage) + LocalFs::new(tempdir()?.path().to_owned(), tempdir()?.path().to_owned()) } #[tokio::test] async fn download_file() -> anyhow::Result<()> { - let repo_harness = RepoHarness::create("download_file")?; + let workdir = tempdir()?.path().to_owned(); + let storage = create_storage()?; let upload_name = "upload_1"; - let upload_target = upload_dummy_file(&repo_harness, &storage, upload_name, None).await?; + let upload_target = upload_dummy_file(&workdir, &storage, upload_name, None).await?; let mut content_bytes = io::BufWriter::new(std::io::Cursor::new(Vec::new())); let metadata = storage.download(&upload_target, &mut content_bytes).await?; @@ -597,14 +597,15 @@ mod fs_tests { #[tokio::test] async fn download_file_range_positive() -> anyhow::Result<()> { - let repo_harness = RepoHarness::create("download_file_range_positive")?; + let workdir = tempdir()?.path().to_owned(); + let storage = create_storage()?; let upload_name = "upload_1"; - let upload_target = upload_dummy_file(&repo_harness, &storage, upload_name, None).await?; + let upload_target = upload_dummy_file(&workdir, &storage, upload_name, None).await?; let mut full_range_bytes = io::BufWriter::new(std::io::Cursor::new(Vec::new())); let metadata = storage - .download_range(&upload_target, 0, None, &mut full_range_bytes) + .download_byte_range(&upload_target, 0, None, &mut full_range_bytes) .await?; assert!( metadata.is_none(), @@ -620,7 +621,7 @@ mod fs_tests { let mut zero_range_bytes = io::BufWriter::new(std::io::Cursor::new(Vec::new())); let same_byte = 1_000_000_000; let metadata = storage - .download_range( + .download_byte_range( &upload_target, same_byte, Some(same_byte + 1), // exclusive end @@ -642,7 +643,7 @@ mod fs_tests { let mut first_part_remote = io::BufWriter::new(std::io::Cursor::new(Vec::new())); let metadata = storage - .download_range( + .download_byte_range( &upload_target, 0, Some(first_part_local.len() as u64), @@ -664,7 +665,7 @@ mod fs_tests { let mut second_part_remote = io::BufWriter::new(std::io::Cursor::new(Vec::new())); let metadata = storage - .download_range( + .download_byte_range( &upload_target, first_part_local.len() as u64, Some((first_part_local.len() + second_part_local.len()) as u64), @@ -689,16 +690,17 @@ mod fs_tests { #[tokio::test] async fn download_file_range_negative() -> anyhow::Result<()> { - let repo_harness = RepoHarness::create("download_file_range_negative")?; + let workdir = tempdir()?.path().to_owned(); + let storage = create_storage()?; let upload_name = "upload_1"; - let upload_target = upload_dummy_file(&repo_harness, &storage, upload_name, None).await?; + let upload_target = upload_dummy_file(&workdir, &storage, upload_name, None).await?; let start = 10000; let end = 234; assert!(start > end, "Should test an incorrect range"); match storage - .download_range(&upload_target, start, Some(end), &mut io::sink()) + .download_byte_range(&upload_target, start, Some(end), &mut io::sink()) .await { Ok(_) => panic!("Should not allow downloading wrong ranges"), @@ -712,7 +714,7 @@ mod fs_tests { let non_existing_path = PathBuf::from("somewhere").join("else"); match storage - .download_range(&non_existing_path, 1, Some(3), &mut io::sink()) + .download_byte_range(&non_existing_path, 1, Some(3), &mut io::sink()) .await { Ok(_) => panic!("Should not allow downloading non-existing storage file ranges"), @@ -727,10 +729,11 @@ mod fs_tests { #[tokio::test] async fn delete_file() -> anyhow::Result<()> { - let repo_harness = RepoHarness::create("delete_file")?; + let workdir = tempdir()?.path().to_owned(); + let storage = create_storage()?; let upload_name = "upload_1"; - let upload_target = upload_dummy_file(&repo_harness, &storage, upload_name, None).await?; + let upload_target = upload_dummy_file(&workdir, &storage, upload_name, None).await?; storage.delete(&upload_target).await?; assert!(storage.list().await?.is_empty()); @@ -748,7 +751,8 @@ mod fs_tests { #[tokio::test] async fn file_with_metadata() -> anyhow::Result<()> { - let repo_harness = RepoHarness::create("download_file")?; + let workdir = tempdir()?.path().to_owned(); + let storage = create_storage()?; let upload_name = "upload_1"; let metadata = StorageMetadata(HashMap::from([ @@ -756,7 +760,7 @@ mod fs_tests { ("two".to_string(), "2".to_string()), ])); let upload_target = - upload_dummy_file(&repo_harness, &storage, upload_name, Some(metadata.clone())).await?; + upload_dummy_file(&workdir, &storage, upload_name, Some(metadata.clone())).await?; let mut content_bytes = io::BufWriter::new(std::io::Cursor::new(Vec::new())); let full_download_metadata = storage.download(&upload_target, &mut content_bytes).await?; @@ -780,7 +784,7 @@ mod fs_tests { let mut first_part_remote = io::BufWriter::new(std::io::Cursor::new(Vec::new())); let partial_download_metadata = storage - .download_range( + .download_byte_range( &upload_target, 0, Some(first_part_local.len() as u64), @@ -805,16 +809,16 @@ mod fs_tests { } async fn upload_dummy_file( - harness: &RepoHarness<'_>, + workdir: &Path, storage: &LocalFs, name: &str, metadata: Option, ) -> anyhow::Result { - let timeline_path = harness.timeline_path(&TIMELINE_ID); - let relative_timeline_path = timeline_path.strip_prefix(&harness.conf.workdir)?; - let storage_path = storage.root.join(relative_timeline_path).join(name); + let timeline_path = workdir.join("timelines").join("some_timeline"); + let relative_timeline_path = timeline_path.strip_prefix(&workdir)?; + let storage_path = storage.storage_root.join(relative_timeline_path).join(name); - let from_path = storage.pageserver_workdir.join(name); + let from_path = storage.working_directory.join(name); let (file, size) = create_file_for_upload(&from_path, &dummy_contents(name)).await?; storage.upload(file, size, &storage_path, metadata).await?; Ok(storage_path) diff --git a/pageserver/src/remote_storage/s3_bucket.rs b/libs/remote_storage/src/s3_bucket.rs similarity index 74% rename from pageserver/src/remote_storage/s3_bucket.rs rename to libs/remote_storage/src/s3_bucket.rs index 73d828d150..01aaf7ca7e 100644 --- a/pageserver/src/remote_storage/s3_bucket.rs +++ b/libs/remote_storage/src/s3_bucket.rs @@ -1,7 +1,7 @@ //! AWS S3 storage wrapper around `rusoto` library. //! //! Respects `prefix_in_bucket` property from [`S3Config`], -//! allowing multiple pageservers to independently work with the same S3 bucket, if +//! allowing multiple api users to independently work with the same S3 bucket, if //! their bucket prefixes are both specified and different. use std::path::{Path, PathBuf}; @@ -19,16 +19,13 @@ use tokio::{io, sync::Semaphore}; use tokio_util::io::ReaderStream; use tracing::debug; -use crate::{ - config::S3Config, - remote_storage::{strip_path_prefix, RemoteStorage}, -}; +use crate::{strip_path_prefix, RemoteStorage, S3Config}; use super::StorageMetadata; -const S3_FILE_SEPARATOR: char = '/'; +const S3_PREFIX_SEPARATOR: char = '/'; -#[derive(Debug, Eq, PartialEq)] +#[derive(Debug, Eq, PartialEq, PartialOrd, Ord, Hash)] pub struct S3ObjectKey(String); impl S3ObjectKey { @@ -36,11 +33,7 @@ impl S3ObjectKey { &self.0 } - fn download_destination( - &self, - pageserver_workdir: &Path, - prefix_to_strip: Option<&str>, - ) -> PathBuf { + fn download_destination(&self, workdir: &Path, prefix_to_strip: Option<&str>) -> PathBuf { let path_without_prefix = match prefix_to_strip { Some(prefix) => self.0.strip_prefix(prefix).unwrap_or_else(|| { panic!( @@ -51,9 +44,9 @@ impl S3ObjectKey { None => &self.0, }; - pageserver_workdir.join( + workdir.join( path_without_prefix - .split(S3_FILE_SEPARATOR) + .split(S3_PREFIX_SEPARATOR) .collect::(), ) } @@ -61,7 +54,7 @@ impl S3ObjectKey { /// AWS S3 storage. pub struct S3Bucket { - pageserver_workdir: &'static Path, + workdir: PathBuf, client: S3Client, bucket_name: String, prefix_in_bucket: Option, @@ -73,7 +66,7 @@ pub struct S3Bucket { impl S3Bucket { /// Creates the S3 storage, errors if incorrect AWS S3 configuration provided. - pub fn new(aws_config: &S3Config, pageserver_workdir: &'static Path) -> anyhow::Result { + pub fn new(aws_config: &S3Config, workdir: PathBuf) -> anyhow::Result { debug!( "Creating s3 remote storage for S3 bucket {}", aws_config.bucket_name @@ -89,8 +82,11 @@ impl S3Bucket { .context("Failed to parse the s3 region from config")?, }; let request_dispatcher = HttpClient::new().context("Failed to create S3 http client")?; - let client = if aws_config.access_key_id.is_none() && aws_config.secret_access_key.is_none() - { + + let access_key_id = std::env::var("AWS_ACCESS_KEY_ID").ok(); + let secret_access_key = std::env::var("AWS_SECRET_ACCESS_KEY").ok(); + + let client = if access_key_id.is_none() && secret_access_key.is_none() { debug!("Using IAM-based AWS access"); S3Client::new_with(request_dispatcher, InstanceMetadataProvider::new(), region) } else { @@ -98,8 +94,8 @@ impl S3Bucket { S3Client::new_with( request_dispatcher, StaticProvider::new_minimal( - aws_config.access_key_id.clone().unwrap_or_default(), - aws_config.secret_access_key.clone().unwrap_or_default(), + access_key_id.unwrap_or_default(), + secret_access_key.unwrap_or_default(), ), region, ) @@ -107,12 +103,12 @@ impl S3Bucket { let prefix_in_bucket = aws_config.prefix_in_bucket.as_deref().map(|prefix| { let mut prefix = prefix; - while prefix.starts_with(S3_FILE_SEPARATOR) { + while prefix.starts_with(S3_PREFIX_SEPARATOR) { prefix = &prefix[1..] } let mut prefix = prefix.to_string(); - while prefix.ends_with(S3_FILE_SEPARATOR) { + while prefix.ends_with(S3_PREFIX_SEPARATOR) { prefix.pop(); } prefix @@ -120,7 +116,7 @@ impl S3Bucket { Ok(Self { client, - pageserver_workdir, + workdir, bucket_name: aws_config.bucket_name.clone(), prefix_in_bucket, concurrency_limiter: Semaphore::new(aws_config.concurrency_limit.get()), @@ -130,24 +126,23 @@ impl S3Bucket { #[async_trait::async_trait] impl RemoteStorage for S3Bucket { - type StoragePath = S3ObjectKey; + type RemoteObjectId = S3ObjectKey; - fn storage_path(&self, local_path: &Path) -> anyhow::Result { - let relative_path = strip_path_prefix(self.pageserver_workdir, local_path)?; + fn remote_object_id(&self, local_path: &Path) -> anyhow::Result { + let relative_path = strip_path_prefix(&self.workdir, local_path)?; let mut key = self.prefix_in_bucket.clone().unwrap_or_default(); for segment in relative_path { - key.push(S3_FILE_SEPARATOR); + key.push(S3_PREFIX_SEPARATOR); key.push_str(&segment.to_string_lossy()); } Ok(S3ObjectKey(key)) } - fn local_path(&self, storage_path: &Self::StoragePath) -> anyhow::Result { - Ok(storage_path - .download_destination(self.pageserver_workdir, self.prefix_in_bucket.as_deref())) + fn local_path(&self, storage_path: &Self::RemoteObjectId) -> anyhow::Result { + Ok(storage_path.download_destination(&self.workdir, self.prefix_in_bucket.as_deref())) } - async fn list(&self) -> anyhow::Result> { + async fn list(&self) -> anyhow::Result> { let mut document_keys = Vec::new(); let mut continuation_token = None; @@ -187,7 +182,7 @@ impl RemoteStorage for S3Bucket { &self, from: impl io::AsyncRead + Unpin + Send + Sync + 'static, from_size_bytes: usize, - to: &Self::StoragePath, + to: &Self::RemoteObjectId, metadata: Option, ) -> anyhow::Result<()> { let _guard = self @@ -212,7 +207,7 @@ impl RemoteStorage for S3Bucket { async fn download( &self, - from: &Self::StoragePath, + from: &Self::RemoteObjectId, to: &mut (impl io::AsyncWrite + Unpin + Send + Sync), ) -> anyhow::Result> { let _guard = self @@ -237,9 +232,9 @@ impl RemoteStorage for S3Bucket { Ok(object_output.metadata.map(StorageMetadata)) } - async fn download_range( + async fn download_byte_range( &self, - from: &Self::StoragePath, + from: &Self::RemoteObjectId, start_inclusive: u64, end_exclusive: Option, to: &mut (impl io::AsyncWrite + Unpin + Send + Sync), @@ -274,7 +269,7 @@ impl RemoteStorage for S3Bucket { Ok(object_output.metadata.map(StorageMetadata)) } - async fn delete(&self, path: &Self::StoragePath) -> anyhow::Result<()> { + async fn delete(&self, path: &Self::RemoteObjectId) -> anyhow::Result<()> { let _guard = self .concurrency_limiter .acquire() @@ -293,34 +288,30 @@ impl RemoteStorage for S3Bucket { #[cfg(test)] mod tests { - use crate::{ - layered_repository::metadata::METADATA_FILE_NAME, - repository::repo_harness::{RepoHarness, TIMELINE_ID}, - }; + use tempfile::tempdir; use super::*; #[test] fn download_destination() -> anyhow::Result<()> { - let repo_harness = RepoHarness::create("download_destination")?; - - let local_path = repo_harness.timeline_path(&TIMELINE_ID).join("test_name"); - let relative_path = local_path.strip_prefix(&repo_harness.conf.workdir)?; + let workdir = tempdir()?.path().to_owned(); + let local_path = workdir.join("one").join("two").join("test_name"); + let relative_path = local_path.strip_prefix(&workdir)?; let key = S3ObjectKey(format!( "{}{}", - S3_FILE_SEPARATOR, + S3_PREFIX_SEPARATOR, relative_path .iter() .map(|segment| segment.to_str().unwrap()) .collect::>() - .join(&S3_FILE_SEPARATOR.to_string()), + .join(&S3_PREFIX_SEPARATOR.to_string()), )); assert_eq!( local_path, - key.download_destination(&repo_harness.conf.workdir, None), - "Download destination should consist of s3 path joined with the pageserver workdir prefix" + key.download_destination(&workdir, None), + "Download destination should consist of s3 path joined with the workdir prefix" ); Ok(()) @@ -328,24 +319,21 @@ mod tests { #[test] fn storage_path_positive() -> anyhow::Result<()> { - let repo_harness = RepoHarness::create("storage_path_positive")?; + let workdir = tempdir()?.path().to_owned(); let segment_1 = "matching"; let segment_2 = "file"; - let local_path = &repo_harness.conf.workdir.join(segment_1).join(segment_2); + let local_path = &workdir.join(segment_1).join(segment_2); - let storage = dummy_storage(&repo_harness.conf.workdir); + let storage = dummy_storage(workdir); let expected_key = S3ObjectKey(format!( - "{}{SEPARATOR}{}{SEPARATOR}{}", + "{}{S3_PREFIX_SEPARATOR}{segment_1}{S3_PREFIX_SEPARATOR}{segment_2}", storage.prefix_in_bucket.as_deref().unwrap_or_default(), - segment_1, - segment_2, - SEPARATOR = S3_FILE_SEPARATOR, )); let actual_key = storage - .storage_path(local_path) + .remote_object_id(local_path) .expect("Matching path should map to S3 path normally"); assert_eq!( expected_key, @@ -360,7 +348,7 @@ mod tests { fn storage_path_negatives() -> anyhow::Result<()> { #[track_caller] fn storage_path_error(storage: &S3Bucket, mismatching_path: &Path) -> String { - match storage.storage_path(mismatching_path) { + match storage.remote_object_id(mismatching_path) { Ok(wrong_key) => panic!( "Expected path '{}' to error, but got S3 key: {:?}", mismatching_path.display(), @@ -370,10 +358,10 @@ mod tests { } } - let repo_harness = RepoHarness::create("storage_path_negatives")?; - let storage = dummy_storage(&repo_harness.conf.workdir); + let workdir = tempdir()?.path().to_owned(); + let storage = dummy_storage(workdir.clone()); - let error_message = storage_path_error(&storage, &repo_harness.conf.workdir); + let error_message = storage_path_error(&storage, &workdir); assert!( error_message.contains("Prefix and the path are equal"), "Message '{}' does not contain the required string", @@ -387,7 +375,7 @@ mod tests { "Error should mention wrong path" ); assert!( - error_message.contains(repo_harness.conf.workdir.to_str().unwrap()), + error_message.contains(workdir.to_str().unwrap()), "Error should mention server workdir" ); assert!( @@ -401,20 +389,17 @@ mod tests { #[test] fn local_path_positive() -> anyhow::Result<()> { - let repo_harness = RepoHarness::create("local_path_positive")?; - let storage = dummy_storage(&repo_harness.conf.workdir); - let timeline_dir = repo_harness.timeline_path(&TIMELINE_ID); - let relative_timeline_path = timeline_dir.strip_prefix(&repo_harness.conf.workdir)?; + let workdir = tempdir()?.path().to_owned(); + let storage = dummy_storage(workdir.clone()); + let timeline_dir = workdir.join("timelines").join("test_timeline"); + let relative_timeline_path = timeline_dir.strip_prefix(&workdir)?; let s3_key = create_s3_key( &relative_timeline_path.join("not a metadata"), storage.prefix_in_bucket.as_deref(), ); assert_eq!( - s3_key.download_destination( - &repo_harness.conf.workdir, - storage.prefix_in_bucket.as_deref() - ), + s3_key.download_destination(&workdir, storage.prefix_in_bucket.as_deref()), storage .local_path(&s3_key) .expect("For a valid input, valid S3 info should be parsed"), @@ -422,14 +407,11 @@ mod tests { ); let s3_key = create_s3_key( - &relative_timeline_path.join(METADATA_FILE_NAME), + &relative_timeline_path.join("metadata"), storage.prefix_in_bucket.as_deref(), ); assert_eq!( - s3_key.download_destination( - &repo_harness.conf.workdir, - storage.prefix_in_bucket.as_deref() - ), + s3_key.download_destination(&workdir, storage.prefix_in_bucket.as_deref()), storage .local_path(&s3_key) .expect("For a valid input, valid S3 info should be parsed"), @@ -441,12 +423,15 @@ mod tests { #[test] fn download_destination_matches_original_path() -> anyhow::Result<()> { - let repo_harness = RepoHarness::create("download_destination_matches_original_path")?; - let original_path = repo_harness.timeline_path(&TIMELINE_ID).join("some name"); + let workdir = tempdir()?.path().to_owned(); + let original_path = workdir + .join("timelines") + .join("some_timeline") + .join("some name"); - let dummy_storage = dummy_storage(&repo_harness.conf.workdir); + let dummy_storage = dummy_storage(workdir); - let key = dummy_storage.storage_path(&original_path)?; + let key = dummy_storage.remote_object_id(&original_path)?; let download_destination = dummy_storage.local_path(&key)?; assert_eq!( @@ -457,9 +442,9 @@ mod tests { Ok(()) } - fn dummy_storage(pageserver_workdir: &'static Path) -> S3Bucket { + fn dummy_storage(workdir: PathBuf) -> S3Bucket { S3Bucket { - pageserver_workdir, + workdir, client: S3Client::new("us-east-1".parse().unwrap()), bucket_name: "dummy-bucket".to_string(), prefix_in_bucket: Some("dummy_prefix/".to_string()), @@ -471,7 +456,7 @@ mod tests { S3ObjectKey(relative_file_path.iter().fold( prefix.unwrap_or_default().to_string(), |mut path_string, segment| { - path_string.push(S3_FILE_SEPARATOR); + path_string.push(S3_PREFIX_SEPARATOR); path_string.push_str(segment.to_str().unwrap()); path_string }, diff --git a/pageserver/Cargo.toml b/pageserver/Cargo.toml index 23c16dd5be..d4cceafc61 100644 --- a/pageserver/Cargo.toml +++ b/pageserver/Cargo.toml @@ -5,7 +5,7 @@ edition = "2021" [features] # It is simpler infra-wise to have failpoints enabled by default -# It shouldnt affect perf in any way because failpoints +# It shouldn't affect perf in any way because failpoints # are not placed in hot code paths default = ["failpoints"] profiling = ["pprof"] @@ -25,7 +25,6 @@ lazy_static = "1.4.0" clap = "3.0" daemonize = "0.4.1" tokio = { version = "1.17", features = ["process", "sync", "macros", "fs", "rt", "io-util", "time"] } -tokio-util = { version = "0.7", features = ["io"] } postgres-types = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" } postgres-protocol = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" } postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" } @@ -54,13 +53,10 @@ once_cell = "1.8.0" crossbeam-utils = "0.8.5" fail = "0.5.0" -rusoto_core = "0.47" -rusoto_s3 = "0.47" -async-trait = "0.1" - postgres_ffi = { path = "../libs/postgres_ffi" } metrics = { path = "../libs/metrics" } utils = { path = "../libs/utils" } +remote_storage = { path = "../libs/remote_storage" } workspace_hack = { version = "0.1", path = "../workspace_hack" } [dev-dependencies] diff --git a/pageserver/README.md b/pageserver/README.md index 1fd627785c..cf841d1e46 100644 --- a/pageserver/README.md +++ b/pageserver/README.md @@ -135,7 +135,7 @@ The backup service is disabled by default and can be enabled to interact with a CLI examples: * Local FS: `${PAGESERVER_BIN} -c "remote_storage={local_path='/some/local/path/'}"` -* AWS S3 : `${PAGESERVER_BIN} -c "remote_storage={bucket_name='some-sample-bucket',bucket_region='eu-north-1', prefix_in_bucket='/test_prefix/',access_key_id='SOMEKEYAAAAASADSAH*#',secret_access_key='SOMEsEcReTsd292v'}"` +* AWS S3 : `env AWS_ACCESS_KEY_ID='SOMEKEYAAAAASADSAH*#' AWS_SECRET_ACCESS_KEY='SOMEsEcReTsd292v' ${PAGESERVER_BIN} -c "remote_storage={bucket_name='some-sample-bucket',bucket_region='eu-north-1', prefix_in_bucket='/test_prefix/'}"` For Amazon AWS S3, a key id and secret access key could be located in `~/.aws/credentials` if awscli was ever configured to work with the desired bucket, on the AWS Settings page for a certain user. Also note, that the bucket names does not contain any protocols when used on AWS. For local S3 installations, refer to the their documentation for name format and credentials. @@ -155,11 +155,9 @@ or bucket_name = 'some-sample-bucket' bucket_region = 'eu-north-1' prefix_in_bucket = '/test_prefix/' -access_key_id = 'SOMEKEYAAAAASADSAH*#' -secret_access_key = 'SOMEsEcReTsd292v' ``` -Also, `AWS_SECRET_ACCESS_KEY` and `AWS_ACCESS_KEY_ID` variables can be used to specify the credentials instead of any of the ways above. +`AWS_SECRET_ACCESS_KEY` and `AWS_ACCESS_KEY_ID` env variables can be used to specify the S3 credentials if needed. TODO: Sharding -------------------- diff --git a/pageserver/src/config.rs b/pageserver/src/config.rs index 14ca976448..5257732c5c 100644 --- a/pageserver/src/config.rs +++ b/pageserver/src/config.rs @@ -5,6 +5,7 @@ //! See also `settings.md` for better description on every parameter. use anyhow::{anyhow, bail, ensure, Context, Result}; +use remote_storage::{RemoteStorageConfig, RemoteStorageKind, S3Config}; use std::env; use std::num::{NonZeroU32, NonZeroUsize}; use std::path::{Path, PathBuf}; @@ -33,18 +34,6 @@ pub mod defaults { pub const DEFAULT_WAL_REDO_TIMEOUT: &str = "60 s"; pub const DEFAULT_SUPERUSER: &str = "zenith_admin"; - /// How many different timelines can be processed simultaneously when synchronizing layers with the remote storage. - /// During regular work, pageserver produces one layer file per timeline checkpoint, with bursts of concurrency - /// during start (where local and remote timelines are compared and initial sync tasks are scheduled) and timeline attach. - /// Both cases may trigger timeline download, that might download a lot of layers. This concurrency is limited by the clients internally, if needed. - pub const DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_TIMELINES_SYNC: usize = 50; - pub const DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS: u32 = 10; - /// Currently, sync happens with AWS S3, that has two limits on requests per second: - /// ~200 RPS for IAM services - /// https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/UsingWithRDS.IAMDBAuth.html - /// ~3500 PUT/COPY/POST/DELETE or 5500 GET/HEAD S3 requests - /// https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/ - pub const DEFAULT_REMOTE_STORAGE_S3_CONCURRENCY_LIMIT: usize = 100; pub const DEFAULT_PAGE_CACHE_SIZE: usize = 8192; pub const DEFAULT_MAX_FILE_DESCRIPTORS: usize = 100; @@ -315,67 +304,6 @@ impl PageServerConfigBuilder { } } -/// External backup storage configuration, enough for creating a client for that storage. -#[derive(Debug, Clone, PartialEq, Eq)] -pub struct RemoteStorageConfig { - /// Max allowed number of concurrent sync operations between pageserver and the remote storage. - pub max_concurrent_timelines_sync: NonZeroUsize, - /// Max allowed errors before the sync task is considered failed and evicted. - pub max_sync_errors: NonZeroU32, - /// The storage connection configuration. - pub storage: RemoteStorageKind, -} - -/// A kind of a remote storage to connect to, with its connection configuration. -#[derive(Debug, Clone, PartialEq, Eq)] -pub enum RemoteStorageKind { - /// Storage based on local file system. - /// Specify a root folder to place all stored files into. - LocalFs(PathBuf), - /// AWS S3 based storage, storing all files in the S3 bucket - /// specified by the config - AwsS3(S3Config), -} - -/// AWS S3 bucket coordinates and access credentials to manage the bucket contents (read and write). -#[derive(Clone, PartialEq, Eq)] -pub struct S3Config { - /// Name of the bucket to connect to. - pub bucket_name: String, - /// The region where the bucket is located at. - pub bucket_region: String, - /// A "subfolder" in the bucket, to use the same bucket separately by multiple pageservers at once. - pub prefix_in_bucket: Option, - /// "Login" to use when connecting to bucket. - /// Can be empty for cases like AWS k8s IAM - /// where we can allow certain pods to connect - /// to the bucket directly without any credentials. - pub access_key_id: Option, - /// "Password" to use when connecting to bucket. - pub secret_access_key: Option, - /// A base URL to send S3 requests to. - /// By default, the endpoint is derived from a region name, assuming it's - /// an AWS S3 region name, erroring on wrong region name. - /// Endpoint provides a way to support other S3 flavors and their regions. - /// - /// Example: `http://127.0.0.1:5000` - pub endpoint: Option, - /// AWS S3 has various limits on its API calls, we need not to exceed those. - /// See [`defaults::DEFAULT_REMOTE_STORAGE_S3_CONCURRENCY_LIMIT`] for more details. - pub concurrency_limit: NonZeroUsize, -} - -impl std::fmt::Debug for S3Config { - fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { - f.debug_struct("S3Config") - .field("bucket_name", &self.bucket_name) - .field("bucket_region", &self.bucket_region) - .field("prefix_in_bucket", &self.prefix_in_bucket) - .field("concurrency_limit", &self.concurrency_limit) - .finish() - } -} - impl PageServerConf { // // Repository paths, relative to workdir. @@ -523,21 +451,21 @@ impl PageServerConf { let bucket_name = toml.get("bucket_name"); let bucket_region = toml.get("bucket_region"); - let max_concurrent_timelines_sync = NonZeroUsize::new( - parse_optional_integer("max_concurrent_timelines_sync", toml)? - .unwrap_or(defaults::DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_TIMELINES_SYNC), + let max_concurrent_syncs = NonZeroUsize::new( + parse_optional_integer("max_concurrent_syncs", toml)? + .unwrap_or(remote_storage::DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNCS), ) - .context("Failed to parse 'max_concurrent_timelines_sync' as a positive integer")?; + .context("Failed to parse 'max_concurrent_syncs' as a positive integer")?; let max_sync_errors = NonZeroU32::new( parse_optional_integer("max_sync_errors", toml)? - .unwrap_or(defaults::DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS), + .unwrap_or(remote_storage::DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS), ) .context("Failed to parse 'max_sync_errors' as a positive integer")?; let concurrency_limit = NonZeroUsize::new( parse_optional_integer("concurrency_limit", toml)? - .unwrap_or(defaults::DEFAULT_REMOTE_STORAGE_S3_CONCURRENCY_LIMIT), + .unwrap_or(remote_storage::DEFAULT_REMOTE_STORAGE_S3_CONCURRENCY_LIMIT), ) .context("Failed to parse 'concurrency_limit' as a positive integer")?; @@ -552,16 +480,6 @@ impl PageServerConf { (None, Some(bucket_name), Some(bucket_region)) => RemoteStorageKind::AwsS3(S3Config { bucket_name: parse_toml_string("bucket_name", bucket_name)?, bucket_region: parse_toml_string("bucket_region", bucket_region)?, - access_key_id: toml - .get("access_key_id") - .map(|access_key_id| parse_toml_string("access_key_id", access_key_id)) - .transpose()?, - secret_access_key: toml - .get("secret_access_key") - .map(|secret_access_key| { - parse_toml_string("secret_access_key", secret_access_key) - }) - .transpose()?, prefix_in_bucket: toml .get("prefix_in_bucket") .map(|prefix_in_bucket| parse_toml_string("prefix_in_bucket", prefix_in_bucket)) @@ -579,7 +497,7 @@ impl PageServerConf { }; Ok(RemoteStorageConfig { - max_concurrent_timelines_sync, + max_concurrent_syncs, max_sync_errors, storage, }) @@ -807,11 +725,11 @@ pg_distrib_dir='{}' assert_eq!( parsed_remote_storage_config, RemoteStorageConfig { - max_concurrent_timelines_sync: NonZeroUsize::new( - defaults::DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_TIMELINES_SYNC + max_concurrent_syncs: NonZeroUsize::new( + remote_storage::DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNCS ) .unwrap(), - max_sync_errors: NonZeroU32::new(defaults::DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS) + max_sync_errors: NonZeroU32::new(remote_storage::DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS) .unwrap(), storage: RemoteStorageKind::LocalFs(local_storage_path.clone()), }, @@ -829,29 +747,25 @@ pg_distrib_dir='{}' let bucket_name = "some-sample-bucket".to_string(); let bucket_region = "eu-north-1".to_string(); let prefix_in_bucket = "test_prefix".to_string(); - let access_key_id = "SOMEKEYAAAAASADSAH*#".to_string(); - let secret_access_key = "SOMEsEcReTsd292v".to_string(); let endpoint = "http://localhost:5000".to_string(); - let max_concurrent_timelines_sync = NonZeroUsize::new(111).unwrap(); + let max_concurrent_syncs = NonZeroUsize::new(111).unwrap(); let max_sync_errors = NonZeroU32::new(222).unwrap(); let s3_concurrency_limit = NonZeroUsize::new(333).unwrap(); let identical_toml_declarations = &[ format!( r#"[remote_storage] -max_concurrent_timelines_sync = {max_concurrent_timelines_sync} +max_concurrent_syncs = {max_concurrent_syncs} max_sync_errors = {max_sync_errors} bucket_name = '{bucket_name}' bucket_region = '{bucket_region}' prefix_in_bucket = '{prefix_in_bucket}' -access_key_id = '{access_key_id}' -secret_access_key = '{secret_access_key}' endpoint = '{endpoint}' concurrency_limit = {s3_concurrency_limit}"# ), format!( - "remote_storage={{max_concurrent_timelines_sync={max_concurrent_timelines_sync}, max_sync_errors={max_sync_errors}, bucket_name='{bucket_name}',\ - bucket_region='{bucket_region}', prefix_in_bucket='{prefix_in_bucket}', access_key_id='{access_key_id}', secret_access_key='{secret_access_key}', endpoint='{endpoint}', concurrency_limit={s3_concurrency_limit}}}", + "remote_storage={{max_concurrent_syncs={max_concurrent_syncs}, max_sync_errors={max_sync_errors}, bucket_name='{bucket_name}',\ + bucket_region='{bucket_region}', prefix_in_bucket='{prefix_in_bucket}', endpoint='{endpoint}', concurrency_limit={s3_concurrency_limit}}}", ), ]; @@ -874,13 +788,11 @@ pg_distrib_dir='{}' assert_eq!( parsed_remote_storage_config, RemoteStorageConfig { - max_concurrent_timelines_sync, + max_concurrent_syncs, max_sync_errors, storage: RemoteStorageKind::AwsS3(S3Config { bucket_name: bucket_name.clone(), bucket_region: bucket_region.clone(), - access_key_id: Some(access_key_id.clone()), - secret_access_key: Some(secret_access_key.clone()), prefix_in_bucket: Some(prefix_in_bucket.clone()), endpoint: Some(endpoint.clone()), concurrency_limit: s3_concurrency_limit, diff --git a/pageserver/src/http/routes.rs b/pageserver/src/http/routes.rs index f12e4c4051..8940efbda0 100644 --- a/pageserver/src/http/routes.rs +++ b/pageserver/src/http/routes.rs @@ -3,17 +3,16 @@ use std::sync::Arc; use anyhow::{Context, Result}; use hyper::StatusCode; use hyper::{Body, Request, Response, Uri}; +use remote_storage::GenericRemoteStorage; use tracing::*; use super::models::{ StatusResponse, TenantConfigRequest, TenantCreateRequest, TenantCreateResponse, TimelineCreateRequest, }; -use crate::config::RemoteStorageKind; -use crate::remote_storage::{ - download_index_part, schedule_layer_download, LocalFs, RemoteIndex, RemoteTimeline, S3Bucket, -}; use crate::repository::Repository; +use crate::storage_sync; +use crate::storage_sync::index::{RemoteIndex, RemoteTimeline}; use crate::tenant_config::TenantConfOpt; use crate::timelines::{LocalTimelineInfo, RemoteTimelineInfo, TimelineInfo}; use crate::{config::PageServerConf, tenant_mgr, timelines}; @@ -37,11 +36,6 @@ struct State { remote_storage: Option, } -enum GenericRemoteStorage { - Local(LocalFs), - S3(S3Bucket), -} - impl State { fn new( conf: &'static PageServerConf, @@ -57,14 +51,7 @@ impl State { let remote_storage = conf .remote_storage_config .as_ref() - .map(|storage_config| match &storage_config.storage { - RemoteStorageKind::LocalFs(root) => { - LocalFs::new(root.clone(), &conf.workdir).map(GenericRemoteStorage::Local) - } - RemoteStorageKind::AwsS3(s3_config) => { - S3Bucket::new(s3_config, &conf.workdir).map(GenericRemoteStorage::S3) - } - }) + .map(|storage_config| GenericRemoteStorage::new(conf.workdir.clone(), storage_config)) .transpose() .context("Failed to init generic remote storage")?; @@ -273,7 +260,7 @@ async fn timeline_attach_handler(request: Request) -> Result) -> Result index_accessor.add_timeline_entry(sync_id, new_timeline), } - schedule_layer_download(tenant_id, timeline_id); + storage_sync::schedule_layer_download(tenant_id, timeline_id); json_response(StatusCode::ACCEPTED, ()) } @@ -319,10 +306,10 @@ async fn try_download_shard_data( ) -> anyhow::Result> { let shard = match state.remote_storage.as_ref() { Some(GenericRemoteStorage::Local(local_storage)) => { - download_index_part(state.conf, local_storage, sync_id).await + storage_sync::download_index_part(state.conf, local_storage, sync_id).await } Some(GenericRemoteStorage::S3(s3_storage)) => { - download_index_part(state.conf, s3_storage, sync_id).await + storage_sync::download_index_part(state.conf, s3_storage, sync_id).await } None => return Ok(None), } diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 77c01a7c66..da2699b15d 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -34,10 +34,9 @@ use std::time::{Duration, Instant, SystemTime}; use self::metadata::{metadata_path, TimelineMetadata, METADATA_FILE_NAME}; use crate::config::PageServerConf; use crate::keyspace::KeySpace; +use crate::storage_sync::index::RemoteIndex; use crate::tenant_config::{TenantConf, TenantConfOpt}; -use crate::page_cache; -use crate::remote_storage::{self, RemoteIndex}; use crate::repository::{ GcResult, Repository, RepositoryTimeline, Timeline, TimelineSyncStatusUpdate, TimelineWriter, }; @@ -48,6 +47,7 @@ use crate::virtual_file::VirtualFile; use crate::walreceiver::IS_WAL_RECEIVER; use crate::walredo::WalRedoManager; use crate::CheckpointConfig; +use crate::{page_cache, storage_sync}; use metrics::{ register_histogram_vec, register_int_counter, register_int_counter_vec, register_int_gauge_vec, @@ -1785,7 +1785,7 @@ impl LayeredTimeline { PERSISTENT_BYTES_WRITTEN.inc_by(new_delta_path.metadata()?.len()); if self.upload_layers.load(atomic::Ordering::Relaxed) { - remote_storage::schedule_layer_upload( + storage_sync::schedule_layer_upload( self.tenantid, self.timelineid, HashSet::from([new_delta_path]), @@ -1857,7 +1857,7 @@ impl LayeredTimeline { } } if self.upload_layers.load(atomic::Ordering::Relaxed) { - remote_storage::schedule_layer_upload( + storage_sync::schedule_layer_upload( self.tenantid, self.timelineid, layer_paths_to_upload, @@ -2056,13 +2056,13 @@ impl LayeredTimeline { drop(layers); if self.upload_layers.load(atomic::Ordering::Relaxed) { - remote_storage::schedule_layer_upload( + storage_sync::schedule_layer_upload( self.tenantid, self.timelineid, new_layer_paths, None, ); - remote_storage::schedule_layer_delete( + storage_sync::schedule_layer_delete( self.tenantid, self.timelineid, layer_paths_do_delete, @@ -2253,7 +2253,7 @@ impl LayeredTimeline { } if self.upload_layers.load(atomic::Ordering::Relaxed) { - remote_storage::schedule_layer_delete( + storage_sync::schedule_layer_delete( self.tenantid, self.timelineid, layer_paths_to_delete, diff --git a/pageserver/src/lib.rs b/pageserver/src/lib.rs index 0b1c53172c..83985069ec 100644 --- a/pageserver/src/lib.rs +++ b/pageserver/src/lib.rs @@ -9,8 +9,8 @@ pub mod page_service; pub mod pgdatadir_mapping; pub mod profiling; pub mod reltag; -pub mod remote_storage; pub mod repository; +pub mod storage_sync; pub mod tenant_config; pub mod tenant_mgr; pub mod tenant_threads; diff --git a/pageserver/src/remote_storage.rs b/pageserver/src/remote_storage.rs deleted file mode 100644 index 4db0f6667d..0000000000 --- a/pageserver/src/remote_storage.rs +++ /dev/null @@ -1,412 +0,0 @@ -//! A set of generic storage abstractions for the page server to use when backing up and restoring its state from the external storage. -//! This particular module serves as a public API border between pageserver and the internal storage machinery. -//! No other modules from this tree are supposed to be used directly by the external code. -//! -//! There are a few components the storage machinery consists of: -//! * [`RemoteStorage`] trait a CRUD-like generic abstraction to use for adapting external storages with a few implementations: -//! * [`local_fs`] allows to use local file system as an external storage -//! * [`s3_bucket`] uses AWS S3 bucket as an external storage -//! -//! * synchronization logic at [`storage_sync`] module that keeps pageserver state (both runtime one and the workdir files) and storage state in sync. -//! Synchronization internals are split into submodules -//! * [`storage_sync::index`] to keep track of remote tenant files, the metadata and their mappings to local files -//! * [`storage_sync::upload`] and [`storage_sync::download`] to manage archive creation and upload; download and extraction, respectively -//! -//! * public API via to interact with the external world: -//! * [`start_local_timeline_sync`] to launch a background async loop to handle the synchronization -//! * [`schedule_layer_upload`], [`schedule_layer_download`] and [`schedule_layer_delete`] to enqueue a new upload and download tasks, -//! to be processed by the async loop -//! -//! Here's a schematic overview of all interactions backup and the rest of the pageserver perform: -//! -//! +------------------------+ +--------->-------+ -//! | | - - - (init async loop) - - - -> | | -//! | | | | -//! | | -------------------------------> | async | -//! | pageserver | (enqueue timeline sync task) | upload/download | -//! | | | loop | -//! | | <------------------------------- | | -//! | | (apply new timeline sync states) | | -//! +------------------------+ +---------<-------+ -//! | -//! | -//! CRUD layer file operations | -//! (upload/download/delete/list, etc.) | -//! V -//! +------------------------+ -//! | | -//! | [`RemoteStorage`] impl | -//! | | -//! | pageserver assumes it | -//! | owns exclusive write | -//! | access to this storage | -//! +------------------------+ -//! -//! First, during startup, the pageserver inits the storage sync thread with the async loop, or leaves the loop uninitialised, if configured so. -//! The loop inits the storage connection and checks the remote files stored. -//! This is done once at startup only, relying on the fact that pageserver uses the storage alone (ergo, nobody else uploads the files to the storage but this server). -//! Based on the remote storage data, the sync logic immediately schedules sync tasks for local timelines and reports about remote only timelines to pageserver, so it can -//! query their downloads later if they are accessed. -//! -//! Some time later, during pageserver checkpoints, in-memory data is flushed onto disk along with its metadata. -//! If the storage sync loop was successfully started before, pageserver schedules the new checkpoint file uploads after every checkpoint. -//! The checkpoint uploads are disabled, if no remote storage configuration is provided (no sync loop is started this way either). -//! See [`crate::layered_repository`] for the upload calls and the adjacent logic. -//! -//! Synchronization logic is able to communicate back with updated timeline sync states, [`crate::repository::TimelineSyncStatusUpdate`], -//! submitted via [`crate::tenant_mgr::apply_timeline_sync_status_updates`] function. Tenant manager applies corresponding timeline updates in pageserver's in-memory state. -//! Such submissions happen in two cases: -//! * once after the sync loop startup, to signal pageserver which timelines will be synchronized in the near future -//! * after every loop step, in case a timeline needs to be reloaded or evicted from pageserver's memory -//! -//! When the pageserver terminates, the sync loop finishes a current sync task (if any) and exits. -//! -//! The storage logic considers `image` as a set of local files (layers), fully representing a certain timeline at given moment (identified with `disk_consistent_lsn` from the corresponding `metadata` file). -//! Timeline can change its state, by adding more files on disk and advancing its `disk_consistent_lsn`: this happens after pageserver checkpointing and is followed -//! by the storage upload, if enabled. -//! Yet timeline cannot alter already existing files, and cannot remove those too: only a GC process is capable of removing unused files. -//! This way, remote storage synchronization relies on the fact that every checkpoint is incremental and local files are "immutable": -//! * when a certain checkpoint gets uploaded, the sync loop remembers the fact, preventing further reuploads of the same state -//! * no files are deleted from either local or remote storage, only the missing ones locally/remotely get downloaded/uploaded, local metadata file will be overwritten -//! when the newer image is downloaded -//! -//! Pageserver maintains similar to the local file structure remotely: all layer files are uploaded with the same names under the same directory structure. -//! Yet instead of keeping the `metadata` file remotely, we wrap it with more data in [`IndexPart`], containing the list of remote files. -//! This file gets read to populate the cache, if the remote timeline data is missing from it and gets updated after every successful download. -//! This way, we optimize S3 storage access by not running the `S3 list` command that could be expencive and slow: knowing both [`ZTenantId`] and [`ZTimelineId`], -//! we can always reconstruct the path to the timeline, use this to get the same path on the remote storage and retrive its part contents, if needed, same as any layer files. -//! -//! By default, pageserver reads the remote storage index data only for timelines located locally, to synchronize those, if needed. -//! Bulk index data download happens only initially, on pageserer startup. The rest of the remote storage stays unknown to pageserver and loaded on demand only, -//! when a new timeline is scheduled for the download. -//! -//! NOTES: -//! * pageserver assumes it has exclusive write access to the remote storage. If supported, the way multiple pageservers can be separated in the same storage -//! (i.e. using different directories in the local filesystem external storage), but totally up to the storage implementation and not covered with the trait API. -//! -//! * the sync tasks may not processed immediately after the submission: if they error and get re-enqueued, their execution might be backed off to ensure error cap is not exceeded too fast. -//! The sync queue processing also happens in batches, so the sync tasks can wait in the queue for some time. - -mod local_fs; -mod s3_bucket; -mod storage_sync; - -use std::{ - collections::{HashMap, HashSet}, - ffi, fs, - path::{Path, PathBuf}, -}; - -use anyhow::{bail, Context}; -use tokio::io; -use tracing::{debug, error, info}; - -use self::storage_sync::TEMP_DOWNLOAD_EXTENSION; -pub use self::{ - local_fs::LocalFs, - s3_bucket::S3Bucket, - storage_sync::{ - download_index_part, - index::{IndexPart, RemoteIndex, RemoteTimeline}, - schedule_layer_delete, schedule_layer_download, schedule_layer_upload, - }, -}; -use crate::{ - config::{PageServerConf, RemoteStorageKind}, - layered_repository::{ - ephemeral_file::is_ephemeral_file, - metadata::{TimelineMetadata, METADATA_FILE_NAME}, - }, -}; -use utils::zid::{ZTenantId, ZTenantTimelineId, ZTimelineId}; - -/// A timeline status to share with pageserver's sync counterpart, -/// after comparing local and remote timeline state. -#[derive(Clone, Copy, Debug)] -pub enum LocalTimelineInitStatus { - /// The timeline has every remote layer present locally. - /// There could be some layers requiring uploading, - /// but this does not block the timeline from any user interaction. - LocallyComplete, - /// A timeline has some files remotely, that are not present locally and need downloading. - /// Downloading might update timeline's metadata locally and current pageserver logic deals with local layers only, - /// so the data needs to be downloaded first before the timeline can be used. - NeedsSync, -} - -type LocalTimelineInitStatuses = HashMap>; - -/// A structure to combine all synchronization data to share with pageserver after a successful sync loop initialization. -/// Successful initialization includes a case when sync loop is not started, in which case the startup data is returned still, -/// to simplify the received code. -pub struct SyncStartupData { - pub remote_index: RemoteIndex, - pub local_timeline_init_statuses: LocalTimelineInitStatuses, -} - -/// Based on the config, initiates the remote storage connection and starts a separate thread -/// that ensures that pageserver and the remote storage are in sync with each other. -/// If no external configuration connection given, no thread or storage initialization is done. -/// Along with that, scans tenant files local and remote (if the sync gets enabled) to check the initial timeline states. -pub fn start_local_timeline_sync( - config: &'static PageServerConf, -) -> anyhow::Result { - let local_timeline_files = local_tenant_timeline_files(config) - .context("Failed to collect local tenant timeline files")?; - - match &config.remote_storage_config { - Some(storage_config) => match &storage_config.storage { - RemoteStorageKind::LocalFs(root) => { - info!("Using fs root '{}' as a remote storage", root.display()); - storage_sync::spawn_storage_sync_thread( - config, - local_timeline_files, - LocalFs::new(root.clone(), &config.workdir)?, - storage_config.max_concurrent_timelines_sync, - storage_config.max_sync_errors, - ) - }, - RemoteStorageKind::AwsS3(s3_config) => { - info!("Using s3 bucket '{}' in region '{}' as a remote storage, prefix in bucket: '{:?}', bucket endpoint: '{:?}'", - s3_config.bucket_name, s3_config.bucket_region, s3_config.prefix_in_bucket, s3_config.endpoint); - storage_sync::spawn_storage_sync_thread( - config, - local_timeline_files, - S3Bucket::new(s3_config, &config.workdir)?, - storage_config.max_concurrent_timelines_sync, - storage_config.max_sync_errors, - ) - }, - } - .context("Failed to spawn the storage sync thread"), - None => { - info!("No remote storage configured, skipping storage sync, considering all local timelines with correct metadata files enabled"); - let mut local_timeline_init_statuses = LocalTimelineInitStatuses::new(); - for (ZTenantTimelineId { tenant_id, timeline_id }, _) in - local_timeline_files - { - local_timeline_init_statuses - .entry(tenant_id) - .or_default() - .insert(timeline_id, LocalTimelineInitStatus::LocallyComplete); - } - Ok(SyncStartupData { - local_timeline_init_statuses, - remote_index: RemoteIndex::empty(), - }) - } - } -} - -fn local_tenant_timeline_files( - config: &'static PageServerConf, -) -> anyhow::Result)>> { - let mut local_tenant_timeline_files = HashMap::new(); - let tenants_dir = config.tenants_path(); - for tenants_dir_entry in fs::read_dir(&tenants_dir) - .with_context(|| format!("Failed to list tenants dir {}", tenants_dir.display()))? - { - match &tenants_dir_entry { - Ok(tenants_dir_entry) => { - match collect_timelines_for_tenant(config, &tenants_dir_entry.path()) { - Ok(collected_files) => { - local_tenant_timeline_files.extend(collected_files.into_iter()) - } - Err(e) => error!( - "Failed to collect tenant files from dir '{}' for entry {:?}, reason: {:#}", - tenants_dir.display(), - tenants_dir_entry, - e - ), - } - } - Err(e) => error!( - "Failed to list tenants dir entry {:?} in directory {}, reason: {:?}", - tenants_dir_entry, - tenants_dir.display(), - e - ), - } - } - - Ok(local_tenant_timeline_files) -} - -fn collect_timelines_for_tenant( - config: &'static PageServerConf, - tenant_path: &Path, -) -> anyhow::Result)>> { - let mut timelines = HashMap::new(); - let tenant_id = tenant_path - .file_name() - .and_then(ffi::OsStr::to_str) - .unwrap_or_default() - .parse::() - .context("Could not parse tenant id out of the tenant dir name")?; - let timelines_dir = config.timelines_path(&tenant_id); - - for timelines_dir_entry in fs::read_dir(&timelines_dir).with_context(|| { - format!( - "Failed to list timelines dir entry for tenant {}", - tenant_id - ) - })? { - match timelines_dir_entry { - Ok(timelines_dir_entry) => { - let timeline_path = timelines_dir_entry.path(); - match collect_timeline_files(&timeline_path) { - Ok((timeline_id, metadata, timeline_files)) => { - timelines.insert( - ZTenantTimelineId { - tenant_id, - timeline_id, - }, - (metadata, timeline_files), - ); - } - Err(e) => error!( - "Failed to process timeline dir contents at '{}', reason: {:?}", - timeline_path.display(), - e - ), - } - } - Err(e) => error!( - "Failed to list timelines for entry tenant {}, reason: {:?}", - tenant_id, e - ), - } - } - - Ok(timelines) -} - -// discover timeline files and extract timeline metadata -// NOTE: ephemeral files are excluded from the list -fn collect_timeline_files( - timeline_dir: &Path, -) -> anyhow::Result<(ZTimelineId, TimelineMetadata, HashSet)> { - let mut timeline_files = HashSet::new(); - let mut timeline_metadata_path = None; - - let timeline_id = timeline_dir - .file_name() - .and_then(ffi::OsStr::to_str) - .unwrap_or_default() - .parse::() - .context("Could not parse timeline id out of the timeline dir name")?; - let timeline_dir_entries = - fs::read_dir(&timeline_dir).context("Failed to list timeline dir contents")?; - for entry in timeline_dir_entries { - let entry_path = entry.context("Failed to list timeline dir entry")?.path(); - if entry_path.is_file() { - if entry_path.file_name().and_then(ffi::OsStr::to_str) == Some(METADATA_FILE_NAME) { - timeline_metadata_path = Some(entry_path); - } else if is_ephemeral_file(&entry_path.file_name().unwrap().to_string_lossy()) { - debug!("skipping ephemeral file {}", entry_path.display()); - continue; - } else if entry_path.extension().and_then(ffi::OsStr::to_str) - == Some(TEMP_DOWNLOAD_EXTENSION) - { - info!("removing temp download file at {}", entry_path.display()); - fs::remove_file(&entry_path).with_context(|| { - format!( - "failed to remove temp download file at {}", - entry_path.display() - ) - })?; - } else { - timeline_files.insert(entry_path); - } - } - } - - // FIXME (rodionov) if attach call succeeded, and then pageserver is restarted before download is completed - // then attach is lost. There would be no retries for that, - // initial collect will fail because there is no metadata. - // We either need to start download if we see empty dir after restart or attach caller should - // be aware of that and retry attach if awaits_download for timeline switched from true to false - // but timelinne didnt appear locally. - // Check what happens with remote index in that case. - let timeline_metadata_path = match timeline_metadata_path { - Some(path) => path, - None => bail!("No metadata file found in the timeline directory"), - }; - let metadata = TimelineMetadata::from_bytes( - &fs::read(&timeline_metadata_path).context("Failed to read timeline metadata file")?, - ) - .context("Failed to parse timeline metadata file bytes")?; - - Ok((timeline_id, metadata, timeline_files)) -} - -/// Storage (potentially remote) API to manage its state. -/// This storage tries to be unaware of any layered repository context, -/// providing basic CRUD operations for storage files. -#[async_trait::async_trait] -pub trait RemoteStorage: Send + Sync { - /// A way to uniquely reference a file in the remote storage. - type StoragePath; - - /// Attempts to derive the storage path out of the local path, if the latter is correct. - fn storage_path(&self, local_path: &Path) -> anyhow::Result; - - /// Gets the download path of the given storage file. - fn local_path(&self, storage_path: &Self::StoragePath) -> anyhow::Result; - - /// Lists all items the storage has right now. - async fn list(&self) -> anyhow::Result>; - - /// Streams the local file contents into remote into the remote storage entry. - async fn upload( - &self, - from: impl io::AsyncRead + Unpin + Send + Sync + 'static, - // S3 PUT request requires the content length to be specified, - // otherwise it starts to fail with the concurrent connection count increasing. - from_size_bytes: usize, - to: &Self::StoragePath, - metadata: Option, - ) -> anyhow::Result<()>; - - /// Streams the remote storage entry contents into the buffered writer given, returns the filled writer. - /// Returns the metadata, if any was stored with the file previously. - async fn download( - &self, - from: &Self::StoragePath, - to: &mut (impl io::AsyncWrite + Unpin + Send + Sync), - ) -> anyhow::Result>; - - /// Streams a given byte range of the remote storage entry contents into the buffered writer given, returns the filled writer. - /// Returns the metadata, if any was stored with the file previously. - async fn download_range( - &self, - from: &Self::StoragePath, - start_inclusive: u64, - end_exclusive: Option, - to: &mut (impl io::AsyncWrite + Unpin + Send + Sync), - ) -> anyhow::Result>; - - async fn delete(&self, path: &Self::StoragePath) -> anyhow::Result<()>; -} - -/// Extra set of key-value pairs that contain arbitrary metadata about the storage entry. -/// Immutable, cannot be changed once the file is created. -#[derive(Debug, Clone, PartialEq, Eq)] -pub struct StorageMetadata(HashMap); - -fn strip_path_prefix<'a>(prefix: &'a Path, path: &'a Path) -> anyhow::Result<&'a Path> { - if prefix == path { - anyhow::bail!( - "Prefix and the path are equal, cannot strip: '{}'", - prefix.display() - ) - } else { - path.strip_prefix(prefix).with_context(|| { - format!( - "Path '{}' is not prefixed with '{}'", - path.display(), - prefix.display(), - ) - }) - } -} diff --git a/pageserver/src/repository.rs b/pageserver/src/repository.rs index 5044f2bfc5..d25dc8914d 100644 --- a/pageserver/src/repository.rs +++ b/pageserver/src/repository.rs @@ -1,5 +1,5 @@ use crate::layered_repository::metadata::TimelineMetadata; -use crate::remote_storage::RemoteIndex; +use crate::storage_sync::index::RemoteIndex; use crate::walrecord::ZenithWalRecord; use crate::CheckpointConfig; use anyhow::{bail, Result}; diff --git a/pageserver/src/remote_storage/storage_sync.rs b/pageserver/src/storage_sync.rs similarity index 77% rename from pageserver/src/remote_storage/storage_sync.rs rename to pageserver/src/storage_sync.rs index 8a26685a7d..bcc18e8ce4 100644 --- a/pageserver/src/remote_storage/storage_sync.rs +++ b/pageserver/src/storage_sync.rs @@ -1,3 +1,87 @@ +//! There are a few components the storage machinery consists of: +//! +//! * [`RemoteStorage`] that is used to interact with an arbitrary external storage +//! +//! * synchronization logic at [`storage_sync`] module that keeps pageserver state (both runtime one and the workdir files) and storage state in sync. +//! Synchronization internals are split into submodules +//! * [`storage_sync::index`] to keep track of remote tenant files, the metadata and their mappings to local files +//! * [`storage_sync::upload`] and [`storage_sync::download`] to manage archive creation and upload; download and extraction, respectively +//! +//! * public API via to interact with the external world: +//! * [`start_local_timeline_sync`] to launch a background async loop to handle the synchronization +//! * [`schedule_timeline_checkpoint_upload`] and [`schedule_timeline_download`] to enqueue a new upload and download tasks, +//! to be processed by the async loop +//! +//! Here's a schematic overview of all interactions backup and the rest of the pageserver perform: +//! +//! +------------------------+ +--------->-------+ +//! | | - - - (init async loop) - - - -> | | +//! | | | | +//! | | -------------------------------> | async | +//! | pageserver | (enqueue timeline sync task) | upload/download | +//! | | | loop | +//! | | <------------------------------- | | +//! | | (apply new timeline sync states) | | +//! +------------------------+ +---------<-------+ +//! | +//! | +//! CRUD layer file operations | +//! (upload/download/delete/list, etc.) | +//! V +//! +------------------------+ +//! | | +//! | [`RemoteStorage`] impl | +//! | | +//! | pageserver assumes it | +//! | owns exclusive write | +//! | access to this storage | +//! +------------------------+ +//! +//! First, during startup, the pageserver inits the storage sync thread with the async loop, or leaves the loop uninitialised, if configured so. +//! The loop inits the storage connection and checks the remote files stored. +//! This is done once at startup only, relying on the fact that pageserver uses the storage alone (ergo, nobody else uploads the files to the storage but this server). +//! Based on the remote storage data, the sync logic immediately schedules sync tasks for local timelines and reports about remote only timelines to pageserver, so it can +//! query their downloads later if they are accessed. +//! +//! Some time later, during pageserver checkpoints, in-memory data is flushed onto disk along with its metadata. +//! If the storage sync loop was successfully started before, pageserver schedules the new checkpoint file uploads after every checkpoint. +//! The checkpoint uploads are disabled, if no remote storage configuration is provided (no sync loop is started this way either). +//! See [`crate::layered_repository`] for the upload calls and the adjacent logic. +//! +//! Synchronization logic is able to communicate back with updated timeline sync states, [`crate::repository::TimelineSyncStatusUpdate`], +//! submitted via [`crate::tenant_mgr::apply_timeline_sync_status_updates`] function. Tenant manager applies corresponding timeline updates in pageserver's in-memory state. +//! Such submissions happen in two cases: +//! * once after the sync loop startup, to signal pageserver which timelines will be synchronized in the near future +//! * after every loop step, in case a timeline needs to be reloaded or evicted from pageserver's memory +//! +//! When the pageserver terminates, the sync loop finishes a current sync task (if any) and exits. +//! +//! The storage logic considers `image` as a set of local files (layers), fully representing a certain timeline at given moment (identified with `disk_consistent_lsn` from the corresponding `metadata` file). +//! Timeline can change its state, by adding more files on disk and advancing its `disk_consistent_lsn`: this happens after pageserver checkpointing and is followed +//! by the storage upload, if enabled. +//! Yet timeline cannot alter already existing files, and cannot remove those too: only a GC process is capable of removing unused files. +//! This way, remote storage synchronization relies on the fact that every checkpoint is incremental and local files are "immutable": +//! * when a certain checkpoint gets uploaded, the sync loop remembers the fact, preventing further reuploads of the same state +//! * no files are deleted from either local or remote storage, only the missing ones locally/remotely get downloaded/uploaded, local metadata file will be overwritten +//! when the newer image is downloaded +//! +//! Pageserver maintains similar to the local file structure remotely: all layer files are uploaded with the same names under the same directory structure. +//! Yet instead of keeping the `metadata` file remotely, we wrap it with more data in [`IndexShard`], containing the list of remote files. +//! This file gets read to populate the cache, if the remote timeline data is missing from it and gets updated after every successful download. +//! This way, we optimize S3 storage access by not running the `S3 list` command that could be expencive and slow: knowing both [`ZTenantId`] and [`ZTimelineId`], +//! we can always reconstruct the path to the timeline, use this to get the same path on the remote storage and retrive its shard contents, if needed, same as any layer files. +//! +//! By default, pageserver reads the remote storage index data only for timelines located locally, to synchronize those, if needed. +//! Bulk index data download happens only initially, on pageserer startup. The rest of the remote storage stays unknown to pageserver and loaded on demand only, +//! when a new timeline is scheduled for the download. +//! +//! NOTES: +//! * pageserver assumes it has exclusive write access to the remote storage. If supported, the way multiple pageservers can be separated in the same storage +//! (i.e. using different directories in the local filesystem external storage), but totally up to the storage implementation and not covered with the trait API. +//! +//! * the sync tasks may not processed immediately after the submission: if they error and get re-enqueued, their execution might be backed off to ensure error cap is not exceeded too fast. +//! The sync queue processing also happens in batches, so the sync tasks can wait in the queue for some time. +//! //! A synchronization logic for the [`RemoteStorage`] and pageserver in-memory state to ensure correct synchronizations //! between local tenant files and their counterparts from the remote storage. //! @@ -62,7 +146,6 @@ pub mod index; mod upload; use std::{ - borrow::Cow, collections::{HashMap, HashSet, VecDeque}, ffi::OsStr, fmt::Debug, @@ -75,6 +158,7 @@ use std::{ use anyhow::{bail, Context}; use futures::stream::{FuturesUnordered, StreamExt}; use lazy_static::lazy_static; +use remote_storage::{GenericRemoteStorage, RemoteStorage}; use tokio::{ fs, runtime::Runtime, @@ -85,17 +169,18 @@ use tracing::*; use self::{ download::{download_timeline_layers, DownloadedTimeline}, - index::{IndexPart, RemoteIndex, RemoteTimeline, RemoteTimelineIndex}, + index::{IndexPart, RemoteTimeline, RemoteTimelineIndex}, upload::{upload_index_part, upload_timeline_layers, UploadedTimeline}, }; -use super::{LocalTimelineInitStatus, LocalTimelineInitStatuses, RemoteStorage, SyncStartupData}; use crate::{ config::PageServerConf, layered_repository::{ - metadata::{metadata_path, TimelineMetadata}, + ephemeral_file::is_ephemeral_file, + metadata::{metadata_path, TimelineMetadata, METADATA_FILE_NAME}, LayeredRepository, }, repository::TimelineSyncStatusUpdate, + storage_sync::{self, index::RemoteIndex}, tenant_mgr::apply_timeline_sync_status_updates, thread_mgr, thread_mgr::ThreadKind, @@ -134,6 +219,232 @@ lazy_static! { .expect("failed to register pageserver image sync time histogram vec"); } +/// A timeline status to share with pageserver's sync counterpart, +/// after comparing local and remote timeline state. +#[derive(Clone, Copy, Debug)] +pub enum LocalTimelineInitStatus { + /// The timeline has every remote layer present locally. + /// There could be some layers requiring uploading, + /// but this does not block the timeline from any user interaction. + LocallyComplete, + /// A timeline has some files remotely, that are not present locally and need downloading. + /// Downloading might update timeline's metadata locally and current pageserver logic deals with local layers only, + /// so the data needs to be downloaded first before the timeline can be used. + NeedsSync, +} + +type LocalTimelineInitStatuses = HashMap>; + +/// A structure to combine all synchronization data to share with pageserver after a successful sync loop initialization. +/// Successful initialization includes a case when sync loop is not started, in which case the startup data is returned still, +/// to simplify the received code. +pub struct SyncStartupData { + pub remote_index: RemoteIndex, + pub local_timeline_init_statuses: LocalTimelineInitStatuses, +} + +/// Based on the config, initiates the remote storage connection and starts a separate thread +/// that ensures that pageserver and the remote storage are in sync with each other. +/// If no external configuration connection given, no thread or storage initialization is done. +/// Along with that, scans tenant files local and remote (if the sync gets enabled) to check the initial timeline states. +pub fn start_local_timeline_sync( + config: &'static PageServerConf, +) -> anyhow::Result { + let local_timeline_files = local_tenant_timeline_files(config) + .context("Failed to collect local tenant timeline files")?; + + match config.remote_storage_config.as_ref() { + Some(storage_config) => { + match GenericRemoteStorage::new(config.workdir.clone(), storage_config) + .context("Failed to init the generic remote storage")? + { + GenericRemoteStorage::Local(local_fs_storage) => { + storage_sync::spawn_storage_sync_thread( + config, + local_timeline_files, + local_fs_storage, + storage_config.max_concurrent_syncs, + storage_config.max_sync_errors, + ) + } + GenericRemoteStorage::S3(s3_bucket_storage) => { + storage_sync::spawn_storage_sync_thread( + config, + local_timeline_files, + s3_bucket_storage, + storage_config.max_concurrent_syncs, + storage_config.max_sync_errors, + ) + } + } + .context("Failed to spawn the storage sync thread") + } + None => { + info!("No remote storage configured, skipping storage sync, considering all local timelines with correct metadata files enabled"); + let mut local_timeline_init_statuses = LocalTimelineInitStatuses::new(); + for ( + ZTenantTimelineId { + tenant_id, + timeline_id, + }, + _, + ) in local_timeline_files + { + local_timeline_init_statuses + .entry(tenant_id) + .or_default() + .insert(timeline_id, LocalTimelineInitStatus::LocallyComplete); + } + Ok(SyncStartupData { + local_timeline_init_statuses, + remote_index: RemoteIndex::empty(), + }) + } + } +} + +fn local_tenant_timeline_files( + config: &'static PageServerConf, +) -> anyhow::Result)>> { + let mut local_tenant_timeline_files = HashMap::new(); + let tenants_dir = config.tenants_path(); + for tenants_dir_entry in std::fs::read_dir(&tenants_dir) + .with_context(|| format!("Failed to list tenants dir {}", tenants_dir.display()))? + { + match &tenants_dir_entry { + Ok(tenants_dir_entry) => { + match collect_timelines_for_tenant(config, &tenants_dir_entry.path()) { + Ok(collected_files) => { + local_tenant_timeline_files.extend(collected_files.into_iter()) + } + Err(e) => error!( + "Failed to collect tenant files from dir '{}' for entry {:?}, reason: {:#}", + tenants_dir.display(), + tenants_dir_entry, + e + ), + } + } + Err(e) => error!( + "Failed to list tenants dir entry {:?} in directory {}, reason: {:?}", + tenants_dir_entry, + tenants_dir.display(), + e + ), + } + } + + Ok(local_tenant_timeline_files) +} + +fn collect_timelines_for_tenant( + config: &'static PageServerConf, + tenant_path: &Path, +) -> anyhow::Result)>> { + let mut timelines = HashMap::new(); + let tenant_id = tenant_path + .file_name() + .and_then(OsStr::to_str) + .unwrap_or_default() + .parse::() + .context("Could not parse tenant id out of the tenant dir name")?; + let timelines_dir = config.timelines_path(&tenant_id); + + for timelines_dir_entry in std::fs::read_dir(&timelines_dir).with_context(|| { + format!( + "Failed to list timelines dir entry for tenant {}", + tenant_id + ) + })? { + match timelines_dir_entry { + Ok(timelines_dir_entry) => { + let timeline_path = timelines_dir_entry.path(); + match collect_timeline_files(&timeline_path) { + Ok((timeline_id, metadata, timeline_files)) => { + timelines.insert( + ZTenantTimelineId { + tenant_id, + timeline_id, + }, + (metadata, timeline_files), + ); + } + Err(e) => error!( + "Failed to process timeline dir contents at '{}', reason: {:?}", + timeline_path.display(), + e + ), + } + } + Err(e) => error!( + "Failed to list timelines for entry tenant {}, reason: {:?}", + tenant_id, e + ), + } + } + + Ok(timelines) +} + +// discover timeline files and extract timeline metadata +// NOTE: ephemeral files are excluded from the list +fn collect_timeline_files( + timeline_dir: &Path, +) -> anyhow::Result<(ZTimelineId, TimelineMetadata, HashSet)> { + let mut timeline_files = HashSet::new(); + let mut timeline_metadata_path = None; + + let timeline_id = timeline_dir + .file_name() + .and_then(OsStr::to_str) + .unwrap_or_default() + .parse::() + .context("Could not parse timeline id out of the timeline dir name")?; + let timeline_dir_entries = + std::fs::read_dir(&timeline_dir).context("Failed to list timeline dir contents")?; + for entry in timeline_dir_entries { + let entry_path = entry.context("Failed to list timeline dir entry")?.path(); + if entry_path.is_file() { + if entry_path.file_name().and_then(OsStr::to_str) == Some(METADATA_FILE_NAME) { + timeline_metadata_path = Some(entry_path); + } else if is_ephemeral_file(&entry_path.file_name().unwrap().to_string_lossy()) { + debug!("skipping ephemeral file {}", entry_path.display()); + continue; + } else if entry_path.extension().and_then(OsStr::to_str) + == Some(TEMP_DOWNLOAD_EXTENSION) + { + info!("removing temp download file at {}", entry_path.display()); + std::fs::remove_file(&entry_path).with_context(|| { + format!( + "failed to remove temp download file at {}", + entry_path.display() + ) + })?; + } else { + timeline_files.insert(entry_path); + } + } + } + + // FIXME (rodionov) if attach call succeeded, and then pageserver is restarted before download is completed + // then attach is lost. There would be no retries for that, + // initial collect will fail because there is no metadata. + // We either need to start download if we see empty dir after restart or attach caller should + // be aware of that and retry attach if awaits_download for timeline switched from true to false + // but timelinne didnt appear locally. + // Check what happens with remote index in that case. + let timeline_metadata_path = match timeline_metadata_path { + Some(path) => path, + None => bail!("No metadata file found in the timeline directory"), + }; + let metadata = TimelineMetadata::from_bytes( + &std::fs::read(&timeline_metadata_path).context("Failed to read timeline metadata file")?, + ) + .context("Failed to parse timeline metadata file bytes")?; + + Ok((timeline_id, metadata, timeline_files)) +} + /// Wraps mpsc channel bits around into a queue interface. /// mpsc approach was picked to allow blocking the sync loop if no tasks are present, to avoid meaningless spinning. mod sync_queue { @@ -505,7 +816,7 @@ pub(super) fn spawn_storage_sync_thread( ) -> anyhow::Result where P: Debug + Send + Sync + 'static, - S: RemoteStorage + Send + Sync + 'static, + S: RemoteStorage + Send + Sync + 'static, { let (sender, receiver) = mpsc::unbounded_channel(); sync_queue::init(sender)?; @@ -566,7 +877,7 @@ fn storage_sync_loop( max_sync_errors: NonZeroU32, ) where P: Debug + Send + Sync + 'static, - S: RemoteStorage + Send + Sync + 'static, + S: RemoteStorage + Send + Sync + 'static, { info!("Starting remote storage sync loop"); loop { @@ -618,7 +929,7 @@ async fn loop_step( ) -> ControlFlow<(), HashMap>> where P: Debug + Send + Sync + 'static, - S: RemoteStorage + Send + Sync + 'static, + S: RemoteStorage + Send + Sync + 'static, { let batched_tasks = match sync_queue::next_task_batch(receiver, max_concurrent_timelines_sync).await { @@ -677,7 +988,7 @@ async fn process_sync_task( ) -> Option where P: Debug + Send + Sync + 'static, - S: RemoteStorage + Send + Sync + 'static, + S: RemoteStorage + Send + Sync + 'static, { let sync_start = Instant::now(); let current_remote_timeline = { index.read().await.timeline_entry(&sync_id).cloned() }; @@ -810,7 +1121,7 @@ async fn download_timeline( ) -> Option where P: Debug + Send + Sync + 'static, - S: RemoteStorage + Send + Sync + 'static, + S: RemoteStorage + Send + Sync + 'static, { match download_timeline_layers( conf, @@ -936,7 +1247,7 @@ async fn upload_timeline( task_name: &str, ) where P: Debug + Send + Sync + 'static, - S: RemoteStorage + Send + Sync + 'static, + S: RemoteStorage + Send + Sync + 'static, { let mut uploaded_data = match upload_timeline_layers(storage, current_remote_timeline, sync_id, new_upload_data) @@ -991,7 +1302,7 @@ async fn update_remote_data( ) -> anyhow::Result<()> where P: Debug + Send + Sync + 'static, - S: RemoteStorage + Send + Sync + 'static, + S: RemoteStorage + Send + Sync + 'static, { info!("Updating remote index for the timeline"); let updated_remote_timeline = { @@ -1101,7 +1412,7 @@ async fn try_fetch_index_parts( ) -> HashMap where P: Debug + Send + Sync + 'static, - S: RemoteStorage + Send + Sync + 'static, + S: RemoteStorage + Send + Sync + 'static, { let mut index_parts = HashMap::with_capacity(keys.len()); @@ -1246,20 +1557,6 @@ fn register_sync_status(sync_start: Instant, sync_name: &str, sync_status: Optio .observe(secs_elapsed) } -pub fn path_with_suffix_extension(original_path: impl AsRef, suffix: &str) -> PathBuf { - let new_extension = match original_path - .as_ref() - .extension() - .map(OsStr::to_string_lossy) - { - Some(extension) => Cow::Owned(format!("{extension}.{suffix}")), - None => Cow::Borrowed(suffix), - }; - original_path - .as_ref() - .with_extension(new_extension.as_ref()) -} - #[cfg(test)] mod test_utils { use utils::lsn::Lsn; @@ -1671,28 +1968,4 @@ mod tests { "Merged upload tasks should have a metadata with biggest disk_consistent_lsn" ); } - - #[test] - fn test_path_with_suffix_extension() { - let p = PathBuf::from("/foo/bar"); - assert_eq!( - &path_with_suffix_extension(&p, "temp").to_string_lossy(), - "/foo/bar.temp" - ); - let p = PathBuf::from("/foo/bar"); - assert_eq!( - &path_with_suffix_extension(&p, "temp.temp").to_string_lossy(), - "/foo/bar.temp.temp" - ); - let p = PathBuf::from("/foo/bar.baz"); - assert_eq!( - &path_with_suffix_extension(&p, "temp.temp").to_string_lossy(), - "/foo/bar.baz.temp.temp" - ); - let p = PathBuf::from("/foo/bar.baz"); - assert_eq!( - &path_with_suffix_extension(&p, ".temp").to_string_lossy(), - "/foo/bar.baz..temp" - ); - } } diff --git a/pageserver/src/remote_storage/storage_sync/download.rs b/pageserver/src/storage_sync/download.rs similarity index 93% rename from pageserver/src/remote_storage/storage_sync/download.rs rename to pageserver/src/storage_sync/download.rs index 7e2496b796..dca08bca5d 100644 --- a/pageserver/src/remote_storage/storage_sync/download.rs +++ b/pageserver/src/storage_sync/download.rs @@ -4,6 +4,7 @@ use std::{collections::HashSet, fmt::Debug, path::Path}; use anyhow::Context; use futures::stream::{FuturesUnordered, StreamExt}; +use remote_storage::{path_with_suffix_extension, RemoteStorage}; use tokio::{ fs, io::{self, AsyncWriteExt}, @@ -13,10 +14,7 @@ use tracing::{debug, error, info, warn}; use crate::{ config::PageServerConf, layered_repository::metadata::metadata_path, - remote_storage::{ - storage_sync::{path_with_suffix_extension, sync_queue, SyncTask}, - RemoteStorage, - }, + storage_sync::{sync_queue, SyncTask}, }; use utils::zid::ZTenantTimelineId; @@ -35,17 +33,19 @@ pub async fn download_index_part( ) -> anyhow::Result where P: Debug + Send + Sync + 'static, - S: RemoteStorage + Send + Sync + 'static, + S: RemoteStorage + Send + Sync + 'static, { let index_part_path = metadata_path(conf, sync_id.timeline_id, sync_id.tenant_id) .with_file_name(IndexPart::FILE_NAME) .with_extension(IndexPart::FILE_EXTENSION); - let part_storage_path = storage.storage_path(&index_part_path).with_context(|| { - format!( - "Failed to get the index part storage path for local path '{}'", - index_part_path.display() - ) - })?; + let part_storage_path = storage + .remote_object_id(&index_part_path) + .with_context(|| { + format!( + "Failed to get the index part storage path for local path '{}'", + index_part_path.display() + ) + })?; let mut index_part_bytes = Vec::new(); storage .download(&part_storage_path, &mut index_part_bytes) @@ -93,7 +93,7 @@ pub(super) async fn download_timeline_layers<'a, P, S>( ) -> DownloadedTimeline where P: Debug + Send + Sync + 'static, - S: RemoteStorage + Send + Sync + 'static, + S: RemoteStorage + Send + Sync + 'static, { let remote_timeline = match remote_timeline { Some(remote_timeline) => { @@ -130,7 +130,7 @@ where ); } else { let layer_storage_path = storage - .storage_path(&layer_desination_path) + .remote_object_id(&layer_desination_path) .with_context(|| { format!( "Failed to get the layer storage path for local path '{}'", @@ -262,18 +262,16 @@ async fn fsync_path(path: impl AsRef) -> Result<(), io::Error> { mod tests { use std::collections::{BTreeSet, HashSet}; + use remote_storage::{LocalFs, RemoteStorage}; use tempfile::tempdir; use utils::lsn::Lsn; use crate::{ - remote_storage::{ - storage_sync::{ - index::RelativePath, - test_utils::{create_local_timeline, dummy_metadata}, - }, - LocalFs, - }, repository::repo_harness::{RepoHarness, TIMELINE_ID}, + storage_sync::{ + index::RelativePath, + test_utils::{create_local_timeline, dummy_metadata}, + }, }; use super::*; @@ -283,7 +281,10 @@ mod tests { let harness = RepoHarness::create("download_timeline")?; let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID); let layer_files = ["a", "b", "layer_to_skip", "layer_to_keep_locally"]; - let storage = LocalFs::new(tempdir()?.path().to_path_buf(), &harness.conf.workdir)?; + let storage = LocalFs::new( + tempdir()?.path().to_path_buf(), + harness.conf.workdir.clone(), + )?; let current_retries = 3; let metadata = dummy_metadata(Lsn(0x30)); let local_timeline_path = harness.timeline_path(&TIMELINE_ID); @@ -291,7 +292,7 @@ mod tests { create_local_timeline(&harness, TIMELINE_ID, &layer_files, metadata.clone()).await?; for local_path in timeline_upload.layers_to_upload { - let remote_path = storage.storage_path(&local_path)?; + let remote_path = storage.remote_object_id(&local_path)?; let remote_parent_dir = remote_path.parent().unwrap(); if !remote_parent_dir.exists() { fs::create_dir_all(&remote_parent_dir).await?; @@ -375,7 +376,7 @@ mod tests { async fn download_timeline_negatives() -> anyhow::Result<()> { let harness = RepoHarness::create("download_timeline_negatives")?; let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID); - let storage = LocalFs::new(tempdir()?.path().to_owned(), &harness.conf.workdir)?; + let storage = LocalFs::new(tempdir()?.path().to_owned(), harness.conf.workdir.clone())?; let empty_remote_timeline_download = download_timeline_layers( harness.conf, @@ -429,7 +430,10 @@ mod tests { let harness = RepoHarness::create("test_download_index_part")?; let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID); - let storage = LocalFs::new(tempdir()?.path().to_path_buf(), &harness.conf.workdir)?; + let storage = LocalFs::new( + tempdir()?.path().to_path_buf(), + harness.conf.workdir.clone(), + )?; let metadata = dummy_metadata(Lsn(0x30)); let local_timeline_path = harness.timeline_path(&TIMELINE_ID); @@ -450,7 +454,7 @@ mod tests { metadata_path(harness.conf, sync_id.timeline_id, sync_id.tenant_id) .with_file_name(IndexPart::FILE_NAME) .with_extension(IndexPart::FILE_EXTENSION); - let storage_path = storage.storage_path(&local_index_part_path)?; + let storage_path = storage.remote_object_id(&local_index_part_path)?; fs::create_dir_all(storage_path.parent().unwrap()).await?; fs::write(&storage_path, serde_json::to_vec(&index_part)?).await?; diff --git a/pageserver/src/remote_storage/storage_sync/index.rs b/pageserver/src/storage_sync/index.rs similarity index 100% rename from pageserver/src/remote_storage/storage_sync/index.rs rename to pageserver/src/storage_sync/index.rs diff --git a/pageserver/src/remote_storage/storage_sync/upload.rs b/pageserver/src/storage_sync/upload.rs similarity index 93% rename from pageserver/src/remote_storage/storage_sync/upload.rs rename to pageserver/src/storage_sync/upload.rs index 91a0a0d6ce..55089df7bc 100644 --- a/pageserver/src/remote_storage/storage_sync/upload.rs +++ b/pageserver/src/storage_sync/upload.rs @@ -4,20 +4,21 @@ use std::{fmt::Debug, path::PathBuf}; use anyhow::Context; use futures::stream::{FuturesUnordered, StreamExt}; +use remote_storage::RemoteStorage; use tokio::fs; use tracing::{debug, error, info, warn}; use crate::{ config::PageServerConf, layered_repository::metadata::metadata_path, - remote_storage::{ - storage_sync::{index::RemoteTimeline, sync_queue, SyncTask}, - RemoteStorage, - }, + storage_sync::{sync_queue, SyncTask}, }; use utils::zid::ZTenantTimelineId; -use super::{index::IndexPart, SyncData, TimelineUpload}; +use super::{ + index::{IndexPart, RemoteTimeline}, + SyncData, TimelineUpload, +}; /// Serializes and uploads the given index part data to the remote storage. pub(super) async fn upload_index_part( @@ -28,7 +29,7 @@ pub(super) async fn upload_index_part( ) -> anyhow::Result<()> where P: Debug + Send + Sync + 'static, - S: RemoteStorage + Send + Sync + 'static, + S: RemoteStorage + Send + Sync + 'static, { let index_part_bytes = serde_json::to_vec(&index_part) .context("Failed to serialize index part file into bytes")?; @@ -38,12 +39,15 @@ where let index_part_path = metadata_path(conf, sync_id.timeline_id, sync_id.tenant_id) .with_file_name(IndexPart::FILE_NAME) .with_extension(IndexPart::FILE_EXTENSION); - let index_part_storage_path = storage.storage_path(&index_part_path).with_context(|| { - format!( - "Failed to get the index part storage path for local path '{}'", - index_part_path.display() - ) - })?; + let index_part_storage_path = + storage + .remote_object_id(&index_part_path) + .with_context(|| { + format!( + "Failed to get the index part storage path for local path '{}'", + index_part_path.display() + ) + })?; storage .upload( @@ -83,7 +87,7 @@ pub(super) async fn upload_timeline_layers<'a, P, S>( ) -> UploadedTimeline where P: Debug + Send + Sync + 'static, - S: RemoteStorage + Send + Sync + 'static, + S: RemoteStorage + Send + Sync + 'static, { let upload = &mut upload_data.data; let new_upload_lsn = upload @@ -112,7 +116,7 @@ where .into_iter() .map(|source_path| async move { let storage_path = storage - .storage_path(&source_path) + .remote_object_id(&source_path) .with_context(|| { format!( "Failed to get the layer storage path for local path '{}'", @@ -211,18 +215,16 @@ enum UploadError { mod tests { use std::collections::{BTreeSet, HashSet}; + use remote_storage::LocalFs; use tempfile::tempdir; use utils::lsn::Lsn; use crate::{ - remote_storage::{ - storage_sync::{ - index::RelativePath, - test_utils::{create_local_timeline, dummy_metadata}, - }, - LocalFs, - }, repository::repo_harness::{RepoHarness, TIMELINE_ID}, + storage_sync::{ + index::RelativePath, + test_utils::{create_local_timeline, dummy_metadata}, + }, }; use super::{upload_index_part, *}; @@ -233,7 +235,10 @@ mod tests { let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID); let layer_files = ["a", "b"]; - let storage = LocalFs::new(tempdir()?.path().to_path_buf(), &harness.conf.workdir)?; + let storage = LocalFs::new( + tempdir()?.path().to_path_buf(), + harness.conf.workdir.clone(), + )?; let current_retries = 3; let metadata = dummy_metadata(Lsn(0x30)); let local_timeline_path = harness.timeline_path(&TIMELINE_ID); @@ -315,7 +320,7 @@ mod tests { let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID); let layer_files = ["a1", "b1"]; - let storage = LocalFs::new(tempdir()?.path().to_owned(), &harness.conf.workdir)?; + let storage = LocalFs::new(tempdir()?.path().to_owned(), harness.conf.workdir.clone())?; let current_retries = 5; let metadata = dummy_metadata(Lsn(0x40)); @@ -403,7 +408,7 @@ mod tests { let harness = RepoHarness::create("test_upload_index_part")?; let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID); - let storage = LocalFs::new(tempdir()?.path().to_owned(), &harness.conf.workdir)?; + let storage = LocalFs::new(tempdir()?.path().to_owned(), harness.conf.workdir.clone())?; let metadata = dummy_metadata(Lsn(0x40)); let local_timeline_path = harness.timeline_path(&TIMELINE_ID); diff --git a/pageserver/src/tenant_mgr.rs b/pageserver/src/tenant_mgr.rs index 507e749e8c..20a723b5b5 100644 --- a/pageserver/src/tenant_mgr.rs +++ b/pageserver/src/tenant_mgr.rs @@ -4,8 +4,9 @@ use crate::config::PageServerConf; use crate::layered_repository::LayeredRepository; use crate::pgdatadir_mapping::DatadirTimeline; -use crate::remote_storage::{self, LocalTimelineInitStatus, RemoteIndex, SyncStartupData}; use crate::repository::{Repository, TimelineSyncStatusUpdate}; +use crate::storage_sync::index::RemoteIndex; +use crate::storage_sync::{self, LocalTimelineInitStatus, SyncStartupData}; use crate::tenant_config::TenantConfOpt; use crate::thread_mgr; use crate::thread_mgr::ThreadKind; @@ -96,7 +97,7 @@ pub fn init_tenant_mgr(conf: &'static PageServerConf) -> anyhow::Result, + remote_storage: &S3Bucket, + listing: &HashSet, dir_path: &Path, conf: &SafeKeeperConf, ) -> anyhow::Result { @@ -55,17 +57,12 @@ async fn offload_files( && IsXLogFileName(entry.file_name().to_str().unwrap()) && entry.metadata().unwrap().created().unwrap() <= horizon { - let relpath = path.strip_prefix(&conf.workdir).unwrap(); - let s3path = String::from("walarchive/") + relpath.to_str().unwrap(); - if !listing.contains(&s3path) { + let remote_path = remote_storage.remote_object_id(path)?; + if !listing.contains(&remote_path) { let file = File::open(&path).await?; - client - .put_object(PutObjectRequest { - body: Some(StreamingBody::new(ReaderStream::new(file))), - bucket: bucket_name.to_string(), - key: s3path, - ..PutObjectRequest::default() - }) + let file_length = file.metadata().await?.len() as usize; + remote_storage + .upload(BufReader::new(file), file_length, &remote_path, None) .await?; fs::remove_file(&path).await?; @@ -77,58 +74,34 @@ async fn offload_files( } async fn main_loop(conf: &SafeKeeperConf) -> anyhow::Result<()> { - let region = Region::Custom { - name: env::var("S3_REGION").context("S3_REGION env var is not set")?, - endpoint: env::var("S3_ENDPOINT").context("S3_ENDPOINT env var is not set")?, + let remote_storage = match GenericRemoteStorage::new( + conf.workdir.clone(), + &RemoteStorageConfig { + max_concurrent_syncs: NonZeroUsize::new(10).unwrap(), + max_sync_errors: NonZeroU32::new(1).unwrap(), + storage: remote_storage::RemoteStorageKind::AwsS3(S3Config { + bucket_name: "zenith-testbucket".to_string(), + bucket_region: env::var("S3_REGION").context("S3_REGION env var is not set")?, + prefix_in_bucket: Some("walarchive/".to_string()), + endpoint: Some(env::var("S3_ENDPOINT").context("S3_ENDPOINT env var is not set")?), + concurrency_limit: NonZeroUsize::new(20).unwrap(), + }), + }, + )? { + GenericRemoteStorage::Local(_) => { + bail!("Unexpected: got local storage for the remote config") + } + GenericRemoteStorage::S3(remote_storage) => remote_storage, }; - let client = S3Client::new_with( - HttpClient::new().context("Failed to create S3 http client")?, - StaticProvider::new_minimal( - env::var("S3_ACCESSKEY").context("S3_ACCESSKEY env var is not set")?, - env::var("S3_SECRET").context("S3_SECRET env var is not set")?, - ), - region, - ); - - let bucket_name = "zenith-testbucket"; - loop { - let listing = gather_wal_entries(&client, bucket_name).await?; - let n = offload_files(&client, bucket_name, &listing, &conf.workdir, conf).await?; - info!("Offload {} files to S3", n); + let listing = remote_storage + .list() + .await? + .into_iter() + .collect::>(); + let n = offload_files(&remote_storage, &listing, &conf.workdir, conf).await?; + info!("Offload {n} files to S3"); sleep(conf.ttl.unwrap()).await; } } - -async fn gather_wal_entries( - client: &S3Client, - bucket_name: &str, -) -> anyhow::Result> { - let mut document_keys = HashSet::new(); - - let mut continuation_token = None::; - loop { - let response = client - .list_objects_v2(ListObjectsV2Request { - bucket: bucket_name.to_string(), - prefix: Some("walarchive/".to_string()), - continuation_token, - ..ListObjectsV2Request::default() - }) - .await?; - document_keys.extend( - response - .contents - .unwrap_or_default() - .into_iter() - .filter_map(|o| o.key), - ); - - continuation_token = response.continuation_token; - if continuation_token.is_none() { - break; - } - } - Ok(document_keys) -} diff --git a/test_runner/fixtures/zenith_fixtures.py b/test_runner/fixtures/zenith_fixtures.py index 7acf0552df..3bb7c606d3 100644 --- a/test_runner/fixtures/zenith_fixtures.py +++ b/test_runner/fixtures/zenith_fixtures.py @@ -472,20 +472,16 @@ class ZenithEnvBuilder: mock_endpoint = self.s3_mock_server.endpoint() mock_region = self.s3_mock_server.region() - mock_access_key = self.s3_mock_server.access_key() - mock_secret_key = self.s3_mock_server.secret_key() boto3.client( 's3', endpoint_url=mock_endpoint, region_name=mock_region, - aws_access_key_id=mock_access_key, - aws_secret_access_key=mock_secret_key, + aws_access_key_id=self.s3_mock_server.access_key(), + aws_secret_access_key=self.s3_mock_server.secret_key(), ).create_bucket(Bucket=bucket_name) self.pageserver_remote_storage = S3Storage(bucket=bucket_name, endpoint=mock_endpoint, - region=mock_region, - access_key=mock_access_key, - secret_key=mock_secret_key) + region=mock_region) def __enter__(self): return self @@ -811,8 +807,6 @@ class LocalFsStorage: class S3Storage: bucket: str region: str - access_key: Optional[str] - secret_key: Optional[str] endpoint: Optional[str] @@ -998,7 +992,14 @@ class ZenithCli: append_pageserver_param_overrides(start_args, self.env.pageserver.remote_storage, self.env.pageserver.config_override) - return self.raw_cli(start_args) + + s3_env_vars = None + if self.env.s3_mock_server: + s3_env_vars = { + 'AWS_ACCESS_KEY_ID': self.env.s3_mock_server.access_key(), + 'AWS_SECRET_ACCESS_KEY': self.env.s3_mock_server.secret_key(), + } + return self.raw_cli(start_args, extra_env_vars=s3_env_vars) def pageserver_stop(self, immediate=False) -> 'subprocess.CompletedProcess[str]': cmd = ['pageserver', 'stop'] @@ -1093,6 +1094,7 @@ class ZenithCli: def raw_cli(self, arguments: List[str], + extra_env_vars: Optional[Dict[str, str]] = None, check_return_code=True) -> 'subprocess.CompletedProcess[str]': """ Run "zenith" with the specified arguments. @@ -1117,9 +1119,10 @@ class ZenithCli: env_vars = os.environ.copy() env_vars['ZENITH_REPO_DIR'] = str(self.env.repo_dir) env_vars['POSTGRES_DISTRIB_DIR'] = str(pg_distrib_dir) - if self.env.rust_log_override is not None: env_vars['RUST_LOG'] = self.env.rust_log_override + for (extra_env_key, extra_env_value) in (extra_env_vars or {}).items(): + env_vars[extra_env_key] = extra_env_value # Pass coverage settings var = 'LLVM_PROFILE_FILE' @@ -1217,10 +1220,6 @@ def append_pageserver_param_overrides( pageserver_storage_override = f"bucket_name='{pageserver_remote_storage.bucket}',\ bucket_region='{pageserver_remote_storage.region}'" - if pageserver_remote_storage.access_key is not None: - pageserver_storage_override += f",access_key_id='{pageserver_remote_storage.access_key}'" - if pageserver_remote_storage.secret_key is not None: - pageserver_storage_override += f",secret_access_key='{pageserver_remote_storage.secret_key}'" if pageserver_remote_storage.endpoint is not None: pageserver_storage_override += f",endpoint='{pageserver_remote_storage.endpoint}'" diff --git a/workspace_hack/Cargo.toml b/workspace_hack/Cargo.toml index 2bb22f2d3b..92877faef7 100644 --- a/workspace_hack/Cargo.toml +++ b/workspace_hack/Cargo.toml @@ -21,7 +21,13 @@ chrono = { version = "0.4", features = ["clock", "libc", "oldtime", "serde", "st clap = { version = "2", features = ["ansi_term", "atty", "color", "strsim", "suggestions", "vec_map"] } either = { version = "1", features = ["use_std"] } fail = { version = "0.5", default-features = false, features = ["failpoints"] } +futures-channel = { version = "0.3", features = ["alloc", "futures-sink", "sink", "std"] } +futures-task = { version = "0.3", default-features = false, features = ["alloc", "std"] } +futures-util = { version = "0.3", default-features = false, features = ["alloc", "async-await", "async-await-macro", "channel", "futures-channel", "futures-io", "futures-macro", "futures-sink", "io", "memchr", "sink", "slab", "std"] } +generic-array = { version = "0.14", default-features = false, features = ["more_lengths"] } hashbrown = { version = "0.11", features = ["ahash", "inline-more", "raw"] } +hex = { version = "0.4", features = ["alloc", "serde", "std"] } +hyper = { version = "0.14", features = ["client", "full", "h2", "http1", "http2", "runtime", "server", "socket2", "stream", "tcp"] } indexmap = { version = "1", default-features = false, features = ["std"] } itoa = { version = "0.4", features = ["i128", "std"] } libc = { version = "0.2", features = ["extra_traits", "std"] } From 10e4da399737f26a3584ab8822e701e382e2dd43 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Mon, 2 May 2022 10:46:13 +0300 Subject: [PATCH 214/296] Rework timeline batching --- pageserver/src/http/routes.rs | 15 +- pageserver/src/layered_repository.rs | 115 ++-- pageserver/src/storage_sync.rs | 884 +++++++-------------------- pageserver/src/storage_sync/index.rs | 4 +- 4 files changed, 292 insertions(+), 726 deletions(-) diff --git a/pageserver/src/http/routes.rs b/pageserver/src/http/routes.rs index 8940efbda0..0104df826e 100644 --- a/pageserver/src/http/routes.rs +++ b/pageserver/src/http/routes.rs @@ -267,7 +267,7 @@ async fn timeline_attach_handler(request: Request) -> Result { tokio::fs::create_dir_all(state.conf.timeline_path(&timeline_id, &tenant_id)) .await @@ -300,11 +300,11 @@ async fn timeline_attach_handler(request: Request) -> Result anyhow::Result> { - let shard = match state.remote_storage.as_ref() { + let index_part = match state.remote_storage.as_ref() { Some(GenericRemoteStorage::Local(local_storage)) => { storage_sync::download_index_part(state.conf, local_storage, sync_id).await } @@ -313,18 +313,15 @@ async fn try_download_shard_data( } None => return Ok(None), } - .with_context(|| format!("Failed to download index shard for timeline {}", sync_id))?; + .with_context(|| format!("Failed to download index part for timeline {sync_id}"))?; let timeline_path = state .conf .timeline_path(&sync_id.timeline_id, &sync_id.tenant_id); - RemoteTimeline::from_index_part(&timeline_path, shard) + RemoteTimeline::from_index_part(&timeline_path, index_part) .map(Some) .with_context(|| { - format!( - "Failed to convert index shard into remote timeline for timeline {}", - sync_id - ) + format!("Failed to convert index part into remote timeline for timeline {sync_id}") }) } diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index da2699b15d..039bf8d1ed 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -455,7 +455,7 @@ enum LayeredTimelineEntry { impl LayeredTimelineEntry { fn timeline_id(&self) -> ZTimelineId { match self { - LayeredTimelineEntry::Loaded(timeline) => timeline.timelineid, + LayeredTimelineEntry::Loaded(timeline) => timeline.timeline_id, LayeredTimelineEntry::Unloaded { id, .. } => *id, } } @@ -615,21 +615,17 @@ impl LayeredRepository { fn load_local_timeline( &self, - timelineid: ZTimelineId, + timeline_id: ZTimelineId, timelines: &mut HashMap, ) -> anyhow::Result> { - let metadata = load_metadata(self.conf, timelineid, self.tenant_id) + let metadata = load_metadata(self.conf, timeline_id, self.tenant_id) .context("failed to load metadata")?; let disk_consistent_lsn = metadata.disk_consistent_lsn(); let ancestor = metadata .ancestor_timeline() .map(|ancestor_timeline_id| { - trace!( - "loading {}'s ancestor {}", - timelineid, - &ancestor_timeline_id - ); + trace!("loading {timeline_id}'s ancestor {}", &ancestor_timeline_id); self.get_timeline_load_internal(ancestor_timeline_id, timelines) }) .transpose() @@ -643,7 +639,7 @@ impl LayeredRepository { Arc::clone(&self.tenant_conf), metadata, ancestor, - timelineid, + timeline_id, self.tenant_id, Arc::clone(&self.walredo_mgr), self.upload_layers, @@ -902,8 +898,8 @@ pub struct LayeredTimeline { conf: &'static PageServerConf, tenant_conf: Arc>, - tenantid: ZTenantId, - timelineid: ZTimelineId, + tenant_id: ZTenantId, + timeline_id: ZTimelineId, layers: RwLock, @@ -1177,50 +1173,50 @@ impl LayeredTimeline { tenant_conf: Arc>, metadata: TimelineMetadata, ancestor: Option, - timelineid: ZTimelineId, - tenantid: ZTenantId, + timeline_id: ZTimelineId, + tenant_id: ZTenantId, walredo_mgr: Arc, upload_layers: bool, ) -> LayeredTimeline { let reconstruct_time_histo = RECONSTRUCT_TIME - .get_metric_with_label_values(&[&tenantid.to_string(), &timelineid.to_string()]) + .get_metric_with_label_values(&[&tenant_id.to_string(), &timeline_id.to_string()]) .unwrap(); let materialized_page_cache_hit_counter = MATERIALIZED_PAGE_CACHE_HIT - .get_metric_with_label_values(&[&tenantid.to_string(), &timelineid.to_string()]) + .get_metric_with_label_values(&[&tenant_id.to_string(), &timeline_id.to_string()]) .unwrap(); let flush_time_histo = STORAGE_TIME .get_metric_with_label_values(&[ "layer flush", - &tenantid.to_string(), - &timelineid.to_string(), + &tenant_id.to_string(), + &timeline_id.to_string(), ]) .unwrap(); let compact_time_histo = STORAGE_TIME .get_metric_with_label_values(&[ "compact", - &tenantid.to_string(), - &timelineid.to_string(), + &tenant_id.to_string(), + &timeline_id.to_string(), ]) .unwrap(); let create_images_time_histo = STORAGE_TIME .get_metric_with_label_values(&[ "create images", - &tenantid.to_string(), - &timelineid.to_string(), + &tenant_id.to_string(), + &timeline_id.to_string(), ]) .unwrap(); let last_record_gauge = LAST_RECORD_LSN - .get_metric_with_label_values(&[&tenantid.to_string(), &timelineid.to_string()]) + .get_metric_with_label_values(&[&tenant_id.to_string(), &timeline_id.to_string()]) .unwrap(); let wait_lsn_time_histo = WAIT_LSN_TIME - .get_metric_with_label_values(&[&tenantid.to_string(), &timelineid.to_string()]) + .get_metric_with_label_values(&[&tenant_id.to_string(), &timeline_id.to_string()]) .unwrap(); LayeredTimeline { conf, tenant_conf, - timelineid, - tenantid, + timeline_id, + tenant_id, layers: RwLock::new(LayerMap::default()), walredo_mgr, @@ -1272,7 +1268,7 @@ impl LayeredTimeline { // Scan timeline directory and create ImageFileName and DeltaFilename // structs representing all files on disk - let timeline_path = self.conf.timeline_path(&self.timelineid, &self.tenantid); + let timeline_path = self.conf.timeline_path(&self.timeline_id, &self.tenant_id); for direntry in fs::read_dir(timeline_path)? { let direntry = direntry?; @@ -1284,7 +1280,7 @@ impl LayeredTimeline { if imgfilename.lsn > disk_consistent_lsn { warn!( "found future image layer {} on timeline {} disk_consistent_lsn is {}", - imgfilename, self.timelineid, disk_consistent_lsn + imgfilename, self.timeline_id, disk_consistent_lsn ); rename_to_backup(direntry.path())?; @@ -1292,7 +1288,7 @@ impl LayeredTimeline { } let layer = - ImageLayer::new(self.conf, self.timelineid, self.tenantid, &imgfilename); + ImageLayer::new(self.conf, self.timeline_id, self.tenant_id, &imgfilename); trace!("found layer {}", layer.filename().display()); layers.insert_historic(Arc::new(layer)); @@ -1307,7 +1303,7 @@ impl LayeredTimeline { if deltafilename.lsn_range.end > disk_consistent_lsn + 1 { warn!( "found future delta layer {} on timeline {} disk_consistent_lsn is {}", - deltafilename, self.timelineid, disk_consistent_lsn + deltafilename, self.timeline_id, disk_consistent_lsn ); rename_to_backup(direntry.path())?; @@ -1315,7 +1311,7 @@ impl LayeredTimeline { } let layer = - DeltaLayer::new(self.conf, self.timelineid, self.tenantid, &deltafilename); + DeltaLayer::new(self.conf, self.timeline_id, self.tenant_id, &deltafilename); trace!("found layer {}", layer.filename().display()); layers.insert_historic(Arc::new(layer)); @@ -1497,7 +1493,7 @@ impl LayeredTimeline { // FIXME: It's pointless to check the cache for things that are not 8kB pages. // We should look at the key to determine if it's a cacheable object let (lsn, read_guard) = - cache.lookup_materialized_page(self.tenantid, self.timelineid, key, lsn)?; + cache.lookup_materialized_page(self.tenant_id, self.timeline_id, key, lsn)?; let img = Bytes::from(read_guard.to_vec()); Some((lsn, img)) } @@ -1509,7 +1505,7 @@ impl LayeredTimeline { .with_context(|| { format!( "Ancestor is missing. Timeline id: {} Ancestor id {:?}", - self.timelineid, + self.timeline_id, self.get_ancestor_timeline_id(), ) })? @@ -1517,7 +1513,7 @@ impl LayeredTimeline { .with_context(|| { format!( "Ancestor timeline is not is not loaded. Timeline id: {} Ancestor id {:?}", - self.timelineid, + self.timeline_id, self.get_ancestor_timeline_id(), ) })?; @@ -1554,12 +1550,12 @@ impl LayeredTimeline { trace!( "creating layer for write at {}/{} for record at {}", - self.timelineid, + self.timeline_id, start_lsn, lsn ); let new_layer = - InMemoryLayer::create(self.conf, self.timelineid, self.tenantid, start_lsn)?; + InMemoryLayer::create(self.conf, self.timeline_id, self.tenant_id, start_lsn)?; let layer_rc = Arc::new(new_layer); layers.open_layer = Some(Arc::clone(&layer_rc)); @@ -1633,8 +1629,8 @@ impl LayeredTimeline { let self_clone = Arc::clone(self); thread_mgr::spawn( thread_mgr::ThreadKind::LayerFlushThread, - Some(self.tenantid), - Some(self.timelineid), + Some(self.tenant_id), + Some(self.timeline_id), "layer flush thread", false, move || self_clone.flush_frozen_layers(false), @@ -1703,7 +1699,7 @@ impl LayeredTimeline { // them all in parallel. par_fsync::par_fsync(&[ new_delta_path.clone(), - self.conf.timeline_path(&self.timelineid, &self.tenantid), + self.conf.timeline_path(&self.timeline_id, &self.tenant_id), ])?; fail_point!("checkpoint-before-sync"); @@ -1775,8 +1771,8 @@ impl LayeredTimeline { LayeredRepository::save_metadata( self.conf, - self.timelineid, - self.tenantid, + self.timeline_id, + self.tenant_id, &metadata, false, )?; @@ -1786,8 +1782,8 @@ impl LayeredTimeline { if self.upload_layers.load(atomic::Ordering::Relaxed) { storage_sync::schedule_layer_upload( - self.tenantid, - self.timelineid, + self.tenant_id, + self.timeline_id, HashSet::from([new_delta_path]), Some(metadata), ); @@ -1840,7 +1836,8 @@ impl LayeredTimeline { let target_file_size = self.get_checkpoint_distance(); // Define partitioning schema if needed - if let Ok(pgdir) = tenant_mgr::get_local_timeline_with_load(self.tenantid, self.timelineid) + if let Ok(pgdir) = + tenant_mgr::get_local_timeline_with_load(self.tenant_id, self.timeline_id) { let (partitioning, lsn) = pgdir.repartition( self.get_last_record_lsn(), @@ -1858,8 +1855,8 @@ impl LayeredTimeline { } if self.upload_layers.load(atomic::Ordering::Relaxed) { storage_sync::schedule_layer_upload( - self.tenantid, - self.timelineid, + self.tenant_id, + self.timeline_id, layer_paths_to_upload, None, ); @@ -1909,7 +1906,7 @@ impl LayeredTimeline { let img_range = partition.ranges.first().unwrap().start..partition.ranges.last().unwrap().end; let mut image_layer_writer = - ImageLayerWriter::new(self.conf, self.timelineid, self.tenantid, &img_range, lsn)?; + ImageLayerWriter::new(self.conf, self.timeline_id, self.tenant_id, &img_range, lsn)?; for range in &partition.ranges { let mut key = range.start; @@ -1932,7 +1929,7 @@ impl LayeredTimeline { // and fsync them all in parallel. par_fsync::par_fsync(&[ image_layer.path(), - self.conf.timeline_path(&self.timelineid, &self.tenantid), + self.conf.timeline_path(&self.timeline_id, &self.tenant_id), ])?; // FIXME: Do we need to do something to upload it to remote storage here? @@ -2008,8 +2005,8 @@ impl LayeredTimeline { if writer.is_none() { writer = Some(DeltaLayerWriter::new( self.conf, - self.timelineid, - self.tenantid, + self.timeline_id, + self.tenant_id, key, lsn_range.clone(), )?); @@ -2027,7 +2024,7 @@ impl LayeredTimeline { let mut layer_paths: Vec = new_layers.iter().map(|l| l.path()).collect(); // also sync the directory - layer_paths.push(self.conf.timeline_path(&self.timelineid, &self.tenantid)); + layer_paths.push(self.conf.timeline_path(&self.timeline_id, &self.tenant_id)); // Fsync all the layer files and directory using multiple threads to // minimize latency. @@ -2057,14 +2054,14 @@ impl LayeredTimeline { if self.upload_layers.load(atomic::Ordering::Relaxed) { storage_sync::schedule_layer_upload( - self.tenantid, - self.timelineid, + self.tenant_id, + self.timeline_id, new_layer_paths, None, ); storage_sync::schedule_layer_delete( - self.tenantid, - self.timelineid, + self.tenant_id, + self.timeline_id, layer_paths_do_delete, ); } @@ -2121,7 +2118,7 @@ impl LayeredTimeline { let cutoff = gc_info.cutoff; let pitr = gc_info.pitr; - let _enter = info_span!("garbage collection", timeline = %self.timelineid, tenant = %self.tenantid, cutoff = %cutoff).entered(); + let _enter = info_span!("garbage collection", timeline = %self.timeline_id, tenant = %self.tenant_id, cutoff = %cutoff).entered(); // We need to ensure that no one branches at a point before latest_gc_cutoff_lsn. // See branch_timeline() for details. @@ -2254,8 +2251,8 @@ impl LayeredTimeline { if self.upload_layers.load(atomic::Ordering::Relaxed) { storage_sync::schedule_layer_delete( - self.tenantid, - self.timelineid, + self.tenant_id, + self.timeline_id, layer_paths_to_delete, ); } @@ -2323,8 +2320,8 @@ impl LayeredTimeline { if img.len() == page_cache::PAGE_SZ { let cache = page_cache::get(); cache.memorize_materialized_page( - self.tenantid, - self.timelineid, + self.tenant_id, + self.timeline_id, key, last_rec_lsn, &img, diff --git a/pageserver/src/storage_sync.rs b/pageserver/src/storage_sync.rs index bcc18e8ce4..b6091015b9 100644 --- a/pageserver/src/storage_sync.rs +++ b/pageserver/src/storage_sync.rs @@ -92,12 +92,12 @@ //! A queue is implemented in the [`sync_queue`] module as a pair of sender and receiver channels, to block on zero tasks instead of checking the queue. //! The pair's shared buffer of a fixed size serves as an implicit queue, holding [`SyncTask`] for local files upload/download operations. //! -//! The queue gets emptied by a single thread with the loop, that polls the tasks in batches of deduplicated tasks (size configurable). -//! A task from the batch corresponds to a single timeline, with its files to sync merged together. -//! Every batch task and layer file in the task is processed concurrently, which is possible due to incremental nature of the timelines: -//! it's not asserted, but assumed that timeline's checkpoints only add the files locally, not removing or amending the existing ones. -//! Only GC removes local timeline files, the GC support is not added to sync currently, -//! yet downloading extra files is not critically bad at this stage, GC can remove those again. +//! The queue gets emptied by a single thread with the loop, that polls the tasks in batches of deduplicated tasks. +//! A task from the batch corresponds to a single timeline, with its files to sync merged together: given that only one task sync loop step is active at a time, +//! timeline uploads and downloads can happen concurrently, in no particular order due to incremental nature of the timeline layers. +//! Deletion happens only after a successful upload only, otherwise the compation output might make the timeline inconsistent until both tasks are fully processed without errors. +//! Upload and download update the remote data (inmemory index and S3 json index part file) only after every layer is successfully synchronized, while the deletion task +//! does otherwise: it requires to have the remote data updated first succesfully: blob files will be invisible to pageserver this way. //! //! During the loop startup, an initial [`RemoteTimelineIndex`] state is constructed via downloading and merging the index data for all timelines, //! present locally. @@ -119,7 +119,7 @@ //! Among other tasks, the index is used to prevent invalid uploads and non-existing downloads on demand, refer to [`index`] for more details. //! //! Index construction is currently the only place where the storage sync can return an [`Err`] to the user. -//! New sync tasks are accepted via [`schedule_timeline_checkpoint_upload`] and [`schedule_timeline_download`] functions, +//! New sync tasks are accepted via [`schedule_layer_upload`], [`schedule_layer_download`] and [`schedule_layer_delete`] functions, //! disregarding of the corresponding loop startup. //! It's up to the caller to avoid synchronizations if the loop is disabled: otherwise, the sync tasks will be ignored. //! After the initial state is loaded into memory and the loop starts, any further [`Err`] results do not stop the loop, but rather @@ -449,7 +449,7 @@ fn collect_timeline_files( /// mpsc approach was picked to allow blocking the sync loop if no tasks are present, to avoid meaningless spinning. mod sync_queue { use std::{ - collections::{hash_map, HashMap, HashSet}, + collections::{HashMap, HashSet}, num::NonZeroUsize, ops::ControlFlow, sync::atomic::{AtomicUsize, Ordering}, @@ -460,7 +460,7 @@ mod sync_queue { use tokio::sync::mpsc::{error::TryRecvError, UnboundedReceiver, UnboundedSender}; use tracing::{debug, warn}; - use super::SyncTask; + use super::{SyncTask, SyncTaskBatch}; use utils::zid::ZTenantTimelineId; static SENDER: OnceCell> = OnceCell::new(); @@ -512,10 +512,10 @@ mod sync_queue { /// Not blocking, can return fewer tasks if the queue does not contain enough. /// Batch tasks are split by timelines, with all related tasks merged into one (download/upload) /// or two (download and upload, if both were found in the queue during batch construction). - pub async fn next_task_batch( + pub(super) async fn next_task_batch( receiver: &mut UnboundedReceiver<(ZTenantTimelineId, SyncTask)>, max_timelines_to_sync: NonZeroUsize, - ) -> ControlFlow<(), HashMap> { + ) -> ControlFlow<(), HashMap> { // request the first task in blocking fashion to do less meaningless work let (first_sync_id, first_task) = if let Some(first_task) = next_task(receiver).await { first_task @@ -529,26 +529,21 @@ mod sync_queue { batched_timelines.insert(first_sync_id.timeline_id); let mut tasks = HashMap::new(); - tasks.insert(first_sync_id, first_task); + tasks.insert(first_sync_id, SyncTaskBatch::new(first_task)); loop { if batched_timelines.len() >= max_timelines_to_sync { - debug!("Filled a full task batch with {max_timelines_to_sync} timeline sync operations"); + debug!( + "Filled a full task batch with {} timeline sync operations", + batched_timelines.len() + ); break; } match receiver.try_recv() { Ok((sync_id, new_task)) => { LENGTH.fetch_sub(1, Ordering::Relaxed); - match tasks.entry(sync_id) { - hash_map::Entry::Occupied(o) => { - let current = o.remove(); - tasks.insert(sync_id, current.merge(new_task)); - } - hash_map::Entry::Vacant(v) => { - v.insert(new_task); - } - } + tasks.entry(sync_id).or_default().add(new_task); batched_timelines.insert(sync_id.timeline_id); } Err(TryRecvError::Disconnected) => { @@ -583,8 +578,8 @@ pub enum SyncTask { Download(SyncData), /// A certain amount of image files to download. Upload(SyncData), - /// Both upload and download layers need to be synced. - DownloadAndUpload(SyncData, SyncData), + /// Delete remote files. + Delete(SyncData>), } /// Stores the data to synd and its retries, to evict the tasks failing to frequently. @@ -609,121 +604,70 @@ impl SyncTask { Self::Upload(SyncData::new(0, upload_task)) } - /// Merges two tasks into one with the following rules: - /// - /// * Download + Download = Download with the retry counter reset and the layers to skip combined - /// * DownloadAndUpload + Download = DownloadAndUpload with Upload unchanged and the Download counterparts united by the same rules - /// * Upload + Upload = Upload with the retry counter reset and the layers to upload and the uploaded layers combined - /// * DownloadAndUpload + Upload = DownloadAndUpload with Download unchanged and the Upload counterparts united by the same rules - /// * Upload + Download = DownloadAndUpload with both tasks unchanged - /// * DownloadAndUpload + DownloadAndUpload = DownloadAndUpload with both parts united by the same rules - fn merge(mut self, other: Self) -> Self { - match (&mut self, other) { - ( - SyncTask::DownloadAndUpload(download_data, _) | SyncTask::Download(download_data), - SyncTask::Download(new_download_data), - ) - | ( - SyncTask::Download(download_data), - SyncTask::DownloadAndUpload(new_download_data, _), - ) => { - download_data - .data - .layers_to_skip - .extend(new_download_data.data.layers_to_skip.into_iter()); - download_data.retries = 0; - } - (SyncTask::Upload(upload), SyncTask::Download(new_download_data)) => { - self = SyncTask::DownloadAndUpload(new_download_data, upload.clone()); - } + fn delete(layers_to_delete: HashSet) -> Self { + Self::Delete(SyncData::new(0, layers_to_delete)) + } +} - ( - SyncTask::DownloadAndUpload(_, upload_data) | SyncTask::Upload(upload_data), - SyncTask::Upload(new_upload_data), - ) - | (SyncTask::Upload(upload_data), SyncTask::DownloadAndUpload(_, new_upload_data)) => { - upload_data - .data - .layers_to_upload - .extend(new_upload_data.data.layers_to_upload.into_iter()); - upload_data - .data - .uploaded_layers - .extend(new_upload_data.data.uploaded_layers.into_iter()); - upload_data.retries = 0; +#[derive(Debug, Default)] +struct SyncTaskBatch { + upload: Option>, + download: Option>, + delete: Option>>, +} - if new_upload_data - .data - .metadata - .as_ref() - .map(|meta| meta.disk_consistent_lsn()) - > upload_data +impl SyncTaskBatch { + fn new(task: SyncTask) -> Self { + let mut new_self = Self::default(); + new_self.add(task); + new_self + } + + fn add(&mut self, task: SyncTask) { + match task { + SyncTask::Download(new_download) => match &mut self.download { + Some(batch_download) => { + batch_download.retries = batch_download.retries.min(new_download.retries); + batch_download .data + .layers_to_skip + .extend(new_download.data.layers_to_skip.into_iter()); + } + None => self.download = Some(new_download), + }, + SyncTask::Upload(new_upload) => match &mut self.upload { + Some(batch_upload) => { + batch_upload.retries = batch_upload.retries.min(new_upload.retries); + + let batch_data = &mut batch_upload.data; + let new_data = new_upload.data; + batch_data + .layers_to_upload + .extend(new_data.layers_to_upload.into_iter()); + batch_data + .uploaded_layers + .extend(new_data.uploaded_layers.into_iter()); + if batch_data .metadata .as_ref() .map(|meta| meta.disk_consistent_lsn()) - { - upload_data.data.metadata = new_upload_data.data.metadata; + <= new_data + .metadata + .as_ref() + .map(|meta| meta.disk_consistent_lsn()) + { + batch_data.metadata = new_data.metadata; + } } - } - (SyncTask::Download(download), SyncTask::Upload(new_upload_data)) => { - self = SyncTask::DownloadAndUpload(download.clone(), new_upload_data) - } - - ( - SyncTask::DownloadAndUpload(download_data, upload_data), - SyncTask::DownloadAndUpload(new_download_data, new_upload_data), - ) => { - download_data - .data - .layers_to_skip - .extend(new_download_data.data.layers_to_skip.into_iter()); - download_data.retries = 0; - - upload_data - .data - .layers_to_upload - .extend(new_upload_data.data.layers_to_upload.into_iter()); - upload_data - .data - .uploaded_layers - .extend(new_upload_data.data.uploaded_layers.into_iter()); - upload_data.retries = 0; - - if new_upload_data - .data - .metadata - .as_ref() - .map(|meta| meta.disk_consistent_lsn()) - > upload_data - .data - .metadata - .as_ref() - .map(|meta| meta.disk_consistent_lsn()) - { - upload_data.data.metadata = new_upload_data.data.metadata; + None => self.upload = Some(new_upload), + }, + SyncTask::Delete(new_delete) => match &mut self.delete { + Some(batch_delete) => { + batch_delete.retries = batch_delete.retries.min(new_delete.retries); + batch_delete.data.extend(new_delete.data.into_iter()); } - } - } - - self - } - - fn name(&self) -> &'static str { - match self { - SyncTask::Download(_) => "download", - SyncTask::Upload(_) => "upload", - SyncTask::DownloadAndUpload(_, _) => "download and upload", - } - } - - fn retries(&self) -> u32 { - match self { - SyncTask::Download(data) => data.retries, - SyncTask::Upload(data) => data.retries, - SyncTask::DownloadAndUpload(download_data, upload_data) => { - download_data.retries.max(upload_data.retries) - } + None => self.delete = Some(new_delete), + }, } } } @@ -760,6 +704,7 @@ pub fn schedule_layer_upload( layers_to_upload: HashSet, metadata: Option, ) { + debug!("Scheduling layer upload for tenant {tenant_id}, timeline {timeline_id}, to upload: {layers_to_upload:?}"); if !sync_queue::push( ZTenantTimelineId { tenant_id, @@ -771,18 +716,29 @@ pub fn schedule_layer_upload( metadata, }), ) { - warn!("Could not send an upload task for tenant {tenant_id}, timeline {timeline_id}",) + warn!("Could not send an upload task for tenant {tenant_id}, timeline {timeline_id}") } else { debug!("Upload task for tenant {tenant_id}, timeline {timeline_id} sent") } } pub fn schedule_layer_delete( - _tenant_id: ZTenantId, - _timeline_id: ZTimelineId, - _layers_to_delete: HashSet, + tenant_id: ZTenantId, + timeline_id: ZTimelineId, + layers_to_delete: HashSet, ) { - // TODO kb implement later + debug!("Scheduling layer deletion for tenant {tenant_id}, timeline {timeline_id}, to delete: {layers_to_delete:?}"); + if !sync_queue::push( + ZTenantTimelineId { + tenant_id, + timeline_id, + }, + SyncTask::delete(layers_to_delete), + ) { + warn!("Could not send deletion task for tenant {tenant_id}, timeline {timeline_id}") + } else { + debug!("Deletion task for tenant {tenant_id}, timeline {timeline_id} sent") + } } /// Requests the download of the entire timeline for a given tenant. @@ -948,13 +904,13 @@ where let mut sync_results = batched_tasks .into_iter() - .map(|(sync_id, task)| { + .map(|(sync_id, batch)| { let storage = Arc::clone(&storage); let index = index.clone(); async move { let state_update = - process_sync_task(conf, storage, index, max_sync_errors, sync_id, task) - .instrument(info_span!("process_sync_tasks", sync_id = %sync_id)) + process_sync_task_batch(conf, storage, index, max_sync_errors, sync_id, batch) + .instrument(info_span!("process_sync_task_batch", sync_id = %sync_id)) .await; (sync_id, state_update) } @@ -978,13 +934,13 @@ where ControlFlow::Continue(new_timeline_states) } -async fn process_sync_task( +async fn process_sync_task_batch( conf: &'static PageServerConf, storage: Arc, index: RemoteIndex, max_sync_errors: NonZeroU32, sync_id: ZTenantTimelineId, - task: SyncTask, + batch: SyncTaskBatch, ) -> Option where P: Debug + Send + Sync + 'static, @@ -993,124 +949,103 @@ where let sync_start = Instant::now(); let current_remote_timeline = { index.read().await.timeline_entry(&sync_id).cloned() }; - let task = match validate_task_retries(sync_id, task, max_sync_errors) { - ControlFlow::Continue(task) => task, - ControlFlow::Break(aborted_task) => { - match aborted_task { - SyncTask::Download(_) => { - index - .write() - .await - .set_awaits_download(&sync_id, false) - .ok(); - } - SyncTask::Upload(failed_upload_data) => { - if let Err(e) = update_remote_data( - conf, - storage.as_ref(), - &index, - sync_id, - &failed_upload_data.data, - true, - ) + let upload_data = batch.upload.clone(); + let download_data = batch.download.clone(); + let ((), status_update) = tokio::join!( + async { + if let Some(upload_data) = upload_data { + match validate_task_retries(upload_data, max_sync_errors) + .instrument(info_span!("retries_validation")) .await - { - error!("Failed to update remote timeline {sync_id}: {e:?}"); + { + ControlFlow::Continue(new_upload_data) => { + upload_timeline_data( + conf, + (storage.as_ref(), &index), + current_remote_timeline.as_ref(), + sync_id, + new_upload_data, + sync_start, + "upload", + ) + .await; } - } - SyncTask::DownloadAndUpload(_, failed_upload_data) => { - index - .write() + ControlFlow::Break(failed_upload_data) => { + if let Err(e) = update_remote_data( + conf, + storage.as_ref(), + &index, + sync_id, + &failed_upload_data.data, + true, + ) .await - .set_awaits_download(&sync_id, false) - .ok(); - if let Err(e) = update_remote_data( - conf, - storage.as_ref(), - &index, - sync_id, - &failed_upload_data.data, - true, - ) - .await - { - error!("Failed to update remote timeline {sync_id}: {e:?}"); + { + error!("Failed to update remote timeline {sync_id}: {e:?}"); + } } } } - return None; } - }; - - let task_name = task.name(); - let current_task_attempt = task.retries(); - info!("Sync task '{task_name}' processing started, attempt #{current_task_attempt}"); - - if current_task_attempt > 0 { - let seconds_to_wait = 2.0_f64.powf(current_task_attempt as f64 - 1.0).min(30.0); - info!("Waiting {seconds_to_wait} seconds before starting the '{task_name}' task"); - tokio::time::sleep(Duration::from_secs_f64(seconds_to_wait)).await; - } - - let status_update = match task { - SyncTask::Download(new_download_data) => { - download_timeline( - conf, - (storage.as_ref(), &index), - current_remote_timeline.as_ref(), - sync_id, - new_download_data, - sync_start, - task_name, - ) - .await - } - SyncTask::Upload(new_upload_data) => { - upload_timeline( - conf, - (storage.as_ref(), &index), - current_remote_timeline.as_ref(), - sync_id, - new_upload_data, - sync_start, - task_name, - ) - .await; + .instrument(info_span!("upload_timeline_data")), + async { + if let Some(download_data) = download_data { + match validate_task_retries(download_data, max_sync_errors) + .instrument(info_span!("retries_validation")) + .await + { + ControlFlow::Continue(new_download_data) => { + return download_timeline_data( + conf, + (storage.as_ref(), &index), + current_remote_timeline.as_ref(), + sync_id, + new_download_data, + sync_start, + "download", + ) + .await + } + ControlFlow::Break(_) => { + index + .write() + .await + .set_awaits_download(&sync_id, false) + .ok(); + } + } + } None } - SyncTask::DownloadAndUpload(new_download_data, new_upload_data) => { - let status_update = download_timeline( - conf, - (storage.as_ref(), &index), - current_remote_timeline.as_ref(), - sync_id, - new_download_data, - sync_start, - task_name, - ) - .await; + .instrument(info_span!("download_timeline_data")), + ); - upload_timeline( - conf, - (storage.as_ref(), &index), - current_remote_timeline.as_ref(), - sync_id, - new_upload_data, - sync_start, - task_name, - ) - .await; - - status_update + if let Some(delete_data) = batch.delete { + match validate_task_retries(delete_data, max_sync_errors) + .instrument(info_span!("retries_validation")) + .await + { + ControlFlow::Continue(new_delete_data) => { + delete_timeline_data( + conf, + (storage.as_ref(), &index), + current_remote_timeline.as_ref(), + sync_id, + new_delete_data, + sync_start, + "delete", + ) + .instrument(info_span!("delete_timeline_data")) + .await; + } + ControlFlow::Break(_) => {} } - }; - - info!("Finished processing the task"); + } status_update } -async fn download_timeline( +async fn download_timeline_data( conf: &'static PageServerConf, (storage, index): (&S, &RemoteIndex), current_remote_timeline: Option<&RemoteTimeline>, @@ -1228,6 +1163,31 @@ async fn update_local_metadata( Ok(()) } +async fn delete_timeline_data( + conf: &PageServerConf, + index: (&S, &RemoteIndex), + as_ref: Option<&RemoteTimeline>, + sync_id: ZTenantTimelineId, + new_delete_data: SyncData>, + sync_start: Instant, + task_name: &str, +) -> Option<()> +where + P: Debug + Send + Sync + 'static, + S: RemoteStorage + Send + Sync + 'static, +{ + // match update_remote_data(conf, storage, index, sync_id, &uploaded_data.data, false).await { + // Ok(()) => register_sync_status(sync_start, task_name, Some(true)), + // Err(e) => { + // error!("Failed to update remote timeline {sync_id}: {e:?}"); + // uploaded_data.retries += 1; + // sync_queue::push(sync_id, SyncTask::Upload(uploaded_data)); + // register_sync_status(sync_start, task_name, Some(false)); + // } + // } + todo!("TODO kb") +} + async fn read_metadata_file(metadata_path: &Path) -> anyhow::Result { TimelineMetadata::from_bytes( &fs::read(metadata_path) @@ -1237,7 +1197,7 @@ async fn read_metadata_file(metadata_path: &Path) -> anyhow::Result( +async fn upload_timeline_data( conf: &'static PageServerConf, (storage, index): (&S, &RemoteIndex), current_remote_timeline: Option<&RemoteTimeline>, @@ -1245,7 +1205,8 @@ async fn upload_timeline( new_upload_data: SyncData, sync_start: Instant, task_name: &str, -) where +) -> Option<()> +where P: Debug + Send + Sync + 'static, S: RemoteStorage + Send + Sync + 'static, { @@ -1255,7 +1216,7 @@ async fn upload_timeline( { UploadedTimeline::FailedAndRescheduled => { register_sync_status(sync_start, task_name, Some(false)); - return; + return None; } UploadedTimeline::Successful(upload_data) => upload_data, UploadedTimeline::SuccessfulAfterLocalFsUpdate(mut outdated_upload_data) => { @@ -1272,7 +1233,7 @@ async fn upload_timeline( outdated_upload_data.retries += 1; sync_queue::push(sync_id, SyncTask::Upload(outdated_upload_data)); register_sync_status(sync_start, task_name, Some(false)); - return; + return None; } }; outdated_upload_data.data.metadata = Some(local_metadata); @@ -1282,12 +1243,16 @@ async fn upload_timeline( }; match update_remote_data(conf, storage, index, sync_id, &uploaded_data.data, false).await { - Ok(()) => register_sync_status(sync_start, task_name, Some(true)), + Ok(()) => { + register_sync_status(sync_start, task_name, Some(true)); + Some(()) + } Err(e) => { error!("Failed to update remote timeline {sync_id}: {e:?}"); uploaded_data.retries += 1; sync_queue::push(sync_id, SyncTask::Upload(uploaded_data)); register_sync_status(sync_start, task_name, Some(false)); + None } } } @@ -1358,51 +1323,25 @@ where .context("Failed to upload new index part") } -fn validate_task_retries( - sync_id: ZTenantTimelineId, - task: SyncTask, +async fn validate_task_retries( + sync_data: SyncData, max_sync_errors: NonZeroU32, -) -> ControlFlow { +) -> ControlFlow, SyncData> { + let current_attempt = sync_data.retries; let max_sync_errors = max_sync_errors.get(); - let mut skip_upload = false; - let mut skip_download = false; - - match &task { - SyncTask::Download(download_data) | SyncTask::DownloadAndUpload(download_data, _) - if download_data.retries > max_sync_errors => - { - error!( - "Evicting download task for timeline {sync_id} that failed {} times, exceeding the error threshold {max_sync_errors}", - download_data.retries - ); - skip_download = true; - } - SyncTask::Upload(upload_data) | SyncTask::DownloadAndUpload(_, upload_data) - if upload_data.retries > max_sync_errors => - { - error!( - "Evicting upload task for timeline {sync_id} that failed {} times, exceeding the error threshold {max_sync_errors}", - upload_data.retries, - ); - skip_upload = true; - } - _ => {} + if current_attempt >= max_sync_errors { + error!( + "Aborting task that failed {current_attempt} times, exceeding retries threshold of {max_sync_errors}", + ); + return ControlFlow::Break(sync_data); } - match task { - aborted_task @ SyncTask::Download(_) if skip_download => ControlFlow::Break(aborted_task), - aborted_task @ SyncTask::Upload(_) if skip_upload => ControlFlow::Break(aborted_task), - aborted_task @ SyncTask::DownloadAndUpload(_, _) if skip_upload && skip_download => { - ControlFlow::Break(aborted_task) - } - SyncTask::DownloadAndUpload(download_task, _) if skip_upload => { - ControlFlow::Continue(SyncTask::Download(download_task)) - } - SyncTask::DownloadAndUpload(_, upload_task) if skip_download => { - ControlFlow::Continue(SyncTask::Upload(upload_task)) - } - not_skipped => ControlFlow::Continue(not_skipped), + if current_attempt > 0 { + let seconds_to_wait = 2.0_f64.powf(current_attempt as f64 - 1.0).min(30.0); + info!("Waiting {seconds_to_wait} seconds before starting the task"); + tokio::time::sleep(Duration::from_secs_f64(seconds_to_wait)).await; } + ControlFlow::Continue(sync_data) } async fn try_fetch_index_parts( @@ -1602,370 +1541,3 @@ mod test_utils { TimelineMetadata::new(disk_consistent_lsn, None, None, Lsn(0), Lsn(0), Lsn(0)) } } - -#[cfg(test)] -mod tests { - use std::collections::BTreeSet; - - use super::{test_utils::dummy_metadata, *}; - use utils::lsn::Lsn; - - #[test] - fn download_sync_tasks_merge() { - let download_1 = SyncTask::Download(SyncData::new( - 2, - TimelineDownload { - layers_to_skip: HashSet::from([PathBuf::from("one")]), - }, - )); - let download_2 = SyncTask::Download(SyncData::new( - 6, - TimelineDownload { - layers_to_skip: HashSet::from([PathBuf::from("two"), PathBuf::from("three")]), - }, - )); - - let merged_download = match download_1.merge(download_2) { - SyncTask::Download(merged_download) => merged_download, - wrong_merge_result => panic!("Unexpected merge result: {wrong_merge_result:?}"), - }; - - assert_eq!( - merged_download.retries, 0, - "Merged task should have its retries counter reset" - ); - - assert_eq!( - merged_download - .data - .layers_to_skip - .into_iter() - .collect::>(), - BTreeSet::from([ - PathBuf::from("one"), - PathBuf::from("two"), - PathBuf::from("three") - ]), - "Merged download tasks should a combined set of layers to skip" - ); - } - - #[test] - fn upload_sync_tasks_merge() { - let metadata_1 = dummy_metadata(Lsn(1)); - let metadata_2 = dummy_metadata(Lsn(2)); - assert!(metadata_2.disk_consistent_lsn() > metadata_1.disk_consistent_lsn()); - - let upload_1 = SyncTask::Upload(SyncData::new( - 2, - TimelineUpload { - layers_to_upload: HashSet::from([PathBuf::from("one")]), - uploaded_layers: HashSet::from([PathBuf::from("u_one")]), - metadata: Some(metadata_1), - }, - )); - let upload_2 = SyncTask::Upload(SyncData::new( - 6, - TimelineUpload { - layers_to_upload: HashSet::from([PathBuf::from("two"), PathBuf::from("three")]), - uploaded_layers: HashSet::from([PathBuf::from("u_two")]), - metadata: Some(metadata_2.clone()), - }, - )); - - let merged_upload = match upload_1.merge(upload_2) { - SyncTask::Upload(merged_upload) => merged_upload, - wrong_merge_result => panic!("Unexpected merge result: {wrong_merge_result:?}"), - }; - - assert_eq!( - merged_upload.retries, 0, - "Merged task should have its retries counter reset" - ); - - let upload = merged_upload.data; - assert_eq!( - upload.layers_to_upload.into_iter().collect::>(), - BTreeSet::from([ - PathBuf::from("one"), - PathBuf::from("two"), - PathBuf::from("three") - ]), - "Merged upload tasks should a combined set of layers to upload" - ); - - assert_eq!( - upload.uploaded_layers.into_iter().collect::>(), - BTreeSet::from([PathBuf::from("u_one"), PathBuf::from("u_two"),]), - "Merged upload tasks should a combined set of uploaded layers" - ); - - assert_eq!( - upload.metadata, - Some(metadata_2), - "Merged upload tasks should have a metadata with biggest disk_consistent_lsn" - ); - } - - #[test] - fn upload_and_download_sync_tasks_merge() { - let download_data = SyncData::new( - 3, - TimelineDownload { - layers_to_skip: HashSet::from([PathBuf::from("d_one")]), - }, - ); - - let upload_data = SyncData::new( - 2, - TimelineUpload { - layers_to_upload: HashSet::from([PathBuf::from("u_one")]), - uploaded_layers: HashSet::from([PathBuf::from("u_one_2")]), - metadata: Some(dummy_metadata(Lsn(1))), - }, - ); - - let (merged_download, merged_upload) = match SyncTask::Download(download_data.clone()) - .merge(SyncTask::Upload(upload_data.clone())) - { - SyncTask::DownloadAndUpload(merged_download, merged_upload) => { - (merged_download, merged_upload) - } - wrong_merge_result => panic!("Unexpected merge result: {wrong_merge_result:?}"), - }; - - assert_eq!( - merged_download, download_data, - "When upload and dowload are merged, both should be unchanged" - ); - assert_eq!( - merged_upload, upload_data, - "When upload and dowload are merged, both should be unchanged" - ); - } - - #[test] - fn uploaddownload_and_upload_sync_tasks_merge() { - let download_data = SyncData::new( - 3, - TimelineDownload { - layers_to_skip: HashSet::from([PathBuf::from("d_one")]), - }, - ); - - let metadata_1 = dummy_metadata(Lsn(5)); - let metadata_2 = dummy_metadata(Lsn(2)); - assert!(metadata_1.disk_consistent_lsn() > metadata_2.disk_consistent_lsn()); - - let upload_download = SyncTask::DownloadAndUpload( - download_data.clone(), - SyncData::new( - 2, - TimelineUpload { - layers_to_upload: HashSet::from([PathBuf::from("one")]), - uploaded_layers: HashSet::from([PathBuf::from("u_one")]), - metadata: Some(metadata_1.clone()), - }, - ), - ); - - let new_upload = SyncTask::Upload(SyncData::new( - 6, - TimelineUpload { - layers_to_upload: HashSet::from([PathBuf::from("two"), PathBuf::from("three")]), - uploaded_layers: HashSet::from([PathBuf::from("u_two")]), - metadata: Some(metadata_2), - }, - )); - - let (merged_download, merged_upload) = match upload_download.merge(new_upload) { - SyncTask::DownloadAndUpload(merged_download, merged_upload) => { - (merged_download, merged_upload) - } - wrong_merge_result => panic!("Unexpected merge result: {wrong_merge_result:?}"), - }; - - assert_eq!( - merged_download, download_data, - "When uploaddowload and upload tasks are merged, download should be unchanged" - ); - - assert_eq!( - merged_upload.retries, 0, - "Merged task should have its retries counter reset" - ); - let upload = merged_upload.data; - assert_eq!( - upload.layers_to_upload.into_iter().collect::>(), - BTreeSet::from([ - PathBuf::from("one"), - PathBuf::from("two"), - PathBuf::from("three") - ]), - "Merged upload tasks should a combined set of layers to upload" - ); - - assert_eq!( - upload.uploaded_layers.into_iter().collect::>(), - BTreeSet::from([PathBuf::from("u_one"), PathBuf::from("u_two"),]), - "Merged upload tasks should a combined set of uploaded layers" - ); - - assert_eq!( - upload.metadata, - Some(metadata_1), - "Merged upload tasks should have a metadata with biggest disk_consistent_lsn" - ); - } - - #[test] - fn uploaddownload_and_download_sync_tasks_merge() { - let upload_data = SyncData::new( - 22, - TimelineUpload { - layers_to_upload: HashSet::from([PathBuf::from("one")]), - uploaded_layers: HashSet::from([PathBuf::from("u_one")]), - metadata: Some(dummy_metadata(Lsn(22))), - }, - ); - - let upload_download = SyncTask::DownloadAndUpload( - SyncData::new( - 2, - TimelineDownload { - layers_to_skip: HashSet::from([PathBuf::from("one")]), - }, - ), - upload_data.clone(), - ); - - let new_download = SyncTask::Download(SyncData::new( - 6, - TimelineDownload { - layers_to_skip: HashSet::from([PathBuf::from("two"), PathBuf::from("three")]), - }, - )); - - let (merged_download, merged_upload) = match upload_download.merge(new_download) { - SyncTask::DownloadAndUpload(merged_download, merged_upload) => { - (merged_download, merged_upload) - } - wrong_merge_result => panic!("Unexpected merge result: {wrong_merge_result:?}"), - }; - - assert_eq!( - merged_upload, upload_data, - "When uploaddowload and download tasks are merged, upload should be unchanged" - ); - - assert_eq!( - merged_download.retries, 0, - "Merged task should have its retries counter reset" - ); - assert_eq!( - merged_download - .data - .layers_to_skip - .into_iter() - .collect::>(), - BTreeSet::from([ - PathBuf::from("one"), - PathBuf::from("two"), - PathBuf::from("three") - ]), - "Merged download tasks should a combined set of layers to skip" - ); - } - - #[test] - fn uploaddownload_sync_tasks_merge() { - let metadata_1 = dummy_metadata(Lsn(1)); - let metadata_2 = dummy_metadata(Lsn(2)); - assert!(metadata_2.disk_consistent_lsn() > metadata_1.disk_consistent_lsn()); - - let upload_download = SyncTask::DownloadAndUpload( - SyncData::new( - 2, - TimelineDownload { - layers_to_skip: HashSet::from([PathBuf::from("one")]), - }, - ), - SyncData::new( - 2, - TimelineUpload { - layers_to_upload: HashSet::from([PathBuf::from("one")]), - uploaded_layers: HashSet::from([PathBuf::from("u_one")]), - metadata: Some(metadata_1), - }, - ), - ); - let new_upload_download = SyncTask::DownloadAndUpload( - SyncData::new( - 6, - TimelineDownload { - layers_to_skip: HashSet::from([PathBuf::from("two"), PathBuf::from("three")]), - }, - ), - SyncData::new( - 6, - TimelineUpload { - layers_to_upload: HashSet::from([PathBuf::from("two"), PathBuf::from("three")]), - uploaded_layers: HashSet::from([PathBuf::from("u_two")]), - metadata: Some(metadata_2.clone()), - }, - ), - ); - - let (merged_download, merged_upload) = match upload_download.merge(new_upload_download) { - SyncTask::DownloadAndUpload(merged_download, merged_upload) => { - (merged_download, merged_upload) - } - wrong_merge_result => panic!("Unexpected merge result: {wrong_merge_result:?}"), - }; - - assert_eq!( - merged_download.retries, 0, - "Merged task should have its retries counter reset" - ); - assert_eq!( - merged_download - .data - .layers_to_skip - .into_iter() - .collect::>(), - BTreeSet::from([ - PathBuf::from("one"), - PathBuf::from("two"), - PathBuf::from("three") - ]), - "Merged download tasks should a combined set of layers to skip" - ); - - assert_eq!( - merged_upload.retries, 0, - "Merged task should have its retries counter reset" - ); - let upload = merged_upload.data; - assert_eq!( - upload.layers_to_upload.into_iter().collect::>(), - BTreeSet::from([ - PathBuf::from("one"), - PathBuf::from("two"), - PathBuf::from("three") - ]), - "Merged upload tasks should a combined set of layers to upload" - ); - - assert_eq!( - upload.uploaded_layers.into_iter().collect::>(), - BTreeSet::from([PathBuf::from("u_one"), PathBuf::from("u_two"),]), - "Merged upload tasks should a combined set of uploaded layers" - ); - - assert_eq!( - upload.metadata, - Some(metadata_2), - "Merged upload tasks should have a metadata with biggest disk_consistent_lsn" - ); - } -} diff --git a/pageserver/src/storage_sync/index.rs b/pageserver/src/storage_sync/index.rs index d847e03a24..b52ce8c95f 100644 --- a/pageserver/src/storage_sync/index.rs +++ b/pageserver/src/storage_sync/index.rs @@ -8,7 +8,7 @@ use std::{ sync::Arc, }; -use anyhow::{Context, Ok}; +use anyhow::{anyhow, Context, Ok}; use serde::{Deserialize, Serialize}; use serde_with::{serde_as, DisplayFromStr}; use tokio::sync::RwLock; @@ -113,7 +113,7 @@ impl RemoteTimelineIndex { awaits_download: bool, ) -> anyhow::Result<()> { self.timeline_entry_mut(id) - .ok_or_else(|| anyhow::anyhow!("unknown timeline sync {}", id))? + .ok_or_else(|| anyhow!("unknown timeline sync {id}"))? .awaits_download = awaits_download; Ok(()) } From 64a602b8f3b743b543f5b36cad7aa39e82491b0c Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Sun, 1 May 2022 12:10:24 +0300 Subject: [PATCH 215/296] Delete timeline layers --- pageserver/src/layered_repository.rs | 2 +- .../src/layered_repository/layer_map.rs | 19 +- pageserver/src/storage_sync.rs | 230 ++++++++++++------ pageserver/src/storage_sync/delete.rs | 1 + pageserver/src/storage_sync/download.rs | 5 + pageserver/src/storage_sync/index.rs | 7 + pageserver/src/storage_sync/upload.rs | 5 + 7 files changed, 184 insertions(+), 85 deletions(-) create mode 100644 pageserver/src/storage_sync/delete.rs diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 039bf8d1ed..01c2b961eb 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -1881,7 +1881,7 @@ impl LayeredTimeline { for part_range in &partition.ranges { let image_coverage = layers.image_coverage(part_range, lsn)?; for (img_range, last_img) in image_coverage { - let img_lsn = if let Some(ref last_img) = last_img { + let img_lsn = if let Some(last_img) = last_img { last_img.get_lsn_range().end } else { Lsn(0) diff --git a/pageserver/src/layered_repository/layer_map.rs b/pageserver/src/layered_repository/layer_map.rs index 7a2d0d5bcd..7491294c03 100644 --- a/pageserver/src/layered_repository/layer_map.rs +++ b/pageserver/src/layered_repository/layer_map.rs @@ -132,17 +132,15 @@ impl LayerMap { // this layer contains the requested point in the key/lsn space. // No need to search any further trace!( - "found layer {} for request on {} at {}", + "found layer {} for request on {key} at {end_lsn}", l.filename().display(), - key, - end_lsn ); latest_delta.replace(Arc::clone(l)); break; } // this layer's end LSN is smaller than the requested point. If there's // nothing newer, this is what we need to return. Remember this. - if let Some(ref old_candidate) = latest_delta { + if let Some(old_candidate) = &latest_delta { if l.get_lsn_range().end > old_candidate.get_lsn_range().end { latest_delta.replace(Arc::clone(l)); } @@ -152,10 +150,8 @@ impl LayerMap { } if let Some(l) = latest_delta { trace!( - "found (old) layer {} for request on {} at {}", + "found (old) layer {} for request on {key} at {end_lsn}", l.filename().display(), - key, - end_lsn ); let lsn_floor = std::cmp::max( Lsn(latest_img_lsn.unwrap_or(Lsn(0)).0 + 1), @@ -166,17 +162,13 @@ impl LayerMap { layer: l, })) } else if let Some(l) = latest_img { - trace!( - "found img layer and no deltas for request on {} at {}", - key, - end_lsn - ); + trace!("found img layer and no deltas for request on {key} at {end_lsn}"); Ok(Some(SearchResult { lsn_floor: latest_img_lsn.unwrap(), layer: l, })) } else { - trace!("no layer found for request on {} at {}", key, end_lsn); + trace!("no layer found for request on {key} at {end_lsn}"); Ok(None) } } @@ -194,7 +186,6 @@ impl LayerMap { /// /// This should be called when the corresponding file on disk has been deleted. /// - #[allow(dead_code)] pub fn remove_historic(&mut self, layer: Arc) { let len_before = self.historic_layers.len(); diff --git a/pageserver/src/storage_sync.rs b/pageserver/src/storage_sync.rs index b6091015b9..52e0df3784 100644 --- a/pageserver/src/storage_sync.rs +++ b/pageserver/src/storage_sync.rs @@ -141,6 +141,7 @@ //! //! When pageserver signals shutdown, current sync task gets finished and the loop exists. +mod delete; mod download; pub mod index; mod upload; @@ -168,6 +169,7 @@ use tokio::{ use tracing::*; use self::{ + delete::delete_timeline_layers, download::{download_timeline_layers, DownloadedTimeline}, index::{IndexPart, RemoteTimeline, RemoteTimelineIndex}, upload::{upload_index_part, upload_timeline_layers, UploadedTimeline}, @@ -579,7 +581,7 @@ pub enum SyncTask { /// A certain amount of image files to download. Upload(SyncData), /// Delete remote files. - Delete(SyncData>), + Delete(SyncData), } /// Stores the data to synd and its retries, to evict the tasks failing to frequently. @@ -604,8 +606,8 @@ impl SyncTask { Self::Upload(SyncData::new(0, upload_task)) } - fn delete(layers_to_delete: HashSet) -> Self { - Self::Delete(SyncData::new(0, layers_to_delete)) + fn delete(delete_task: TimelineDelete) -> Self { + Self::Delete(SyncData::new(0, delete_task)) } } @@ -613,7 +615,7 @@ impl SyncTask { struct SyncTaskBatch { upload: Option>, download: Option>, - delete: Option>>, + delete: Option>, } impl SyncTaskBatch { @@ -664,7 +666,15 @@ impl SyncTaskBatch { SyncTask::Delete(new_delete) => match &mut self.delete { Some(batch_delete) => { batch_delete.retries = batch_delete.retries.min(new_delete.retries); - batch_delete.data.extend(new_delete.data.into_iter()); + + batch_delete + .data + .layers_to_delete + .extend(new_delete.data.layers_to_delete.into_iter()); + batch_delete + .data + .deleted_layers + .extend(new_delete.data.deleted_layers.into_iter()); } None => self.delete = Some(new_delete), }, @@ -694,6 +704,13 @@ pub struct TimelineDownload { layers_to_skip: HashSet, } +#[derive(Debug, Clone, PartialEq, Eq)] +pub struct TimelineDelete { + layers_to_delete: HashSet, + deleted_layers: HashSet, + deletion_registered: bool, +} + /// Adds the new checkpoint files as an upload sync task to the queue. /// On task failure, it gets retried again from the start a number of times. /// @@ -733,7 +750,11 @@ pub fn schedule_layer_delete( tenant_id, timeline_id, }, - SyncTask::delete(layers_to_delete), + SyncTask::delete(TimelineDelete { + layers_to_delete, + deleted_layers: HashSet::new(), + deletion_registered: false, + }), ) { warn!("Could not send deletion task for tenant {tenant_id}, timeline {timeline_id}") } else { @@ -951,7 +972,7 @@ where let upload_data = batch.upload.clone(); let download_data = batch.download.clone(); - let ((), status_update) = tokio::join!( + let (upload_result, status_update) = tokio::join!( async { if let Some(upload_data) = upload_data { match validate_task_retries(upload_data, max_sync_errors) @@ -969,6 +990,7 @@ where "upload", ) .await; + return Some(()); } ControlFlow::Break(failed_upload_data) => { if let Err(e) = update_remote_data( @@ -976,8 +998,10 @@ where storage.as_ref(), &index, sync_id, - &failed_upload_data.data, - true, + RemoteDataUpdate::Upload { + uploaded_data: failed_upload_data.data, + upload_failed: true, + }, ) .await { @@ -986,6 +1010,7 @@ where } } } + None } .instrument(info_span!("upload_timeline_data")), async { @@ -1029,7 +1054,6 @@ where delete_timeline_data( conf, (storage.as_ref(), &index), - current_remote_timeline.as_ref(), sync_id, new_delete_data, sync_start, @@ -1038,7 +1062,19 @@ where .instrument(info_span!("delete_timeline_data")) .await; } - ControlFlow::Break(_) => {} + ControlFlow::Break(failed_delete_data) => { + if let Err(e) = update_remote_data( + conf, + storage.as_ref(), + &index, + sync_id, + RemoteDataUpdate::Delete(&failed_delete_data.data.deleted_layers), + ) + .await + { + error!("Failed to update remote timeline {sync_id}: {e:?}"); + } + } } } @@ -1072,22 +1108,19 @@ where if let Err(e) = index.write().await.set_awaits_download(&sync_id, false) { error!("Timeline {sync_id} was expected to be in the remote index after a download attempt, but it's absent: {e:?}"); } - None } DownloadedTimeline::FailedAndRescheduled => { register_sync_status(sync_start, task_name, Some(false)); - None } DownloadedTimeline::Successful(mut download_data) => { match update_local_metadata(conf, sync_id, current_remote_timeline).await { Ok(()) => match index.write().await.set_awaits_download(&sync_id, false) { Ok(()) => { register_sync_status(sync_start, task_name, Some(true)); - Some(TimelineSyncStatusUpdate::Downloaded) + return Some(TimelineSyncStatusUpdate::Downloaded); } Err(e) => { error!("Timeline {sync_id} was expected to be in the remote index after a sucessful download, but it's absent: {e:?}"); - None } }, Err(e) => { @@ -1095,11 +1128,12 @@ where download_data.retries += 1; sync_queue::push(sync_id, SyncTask::Download(download_data)); register_sync_status(sync_start, task_name, Some(false)); - None } } } } + + None } async fn update_local_metadata( @@ -1164,28 +1198,39 @@ async fn update_local_metadata( } async fn delete_timeline_data( - conf: &PageServerConf, - index: (&S, &RemoteIndex), - as_ref: Option<&RemoteTimeline>, + conf: &'static PageServerConf, + (storage, index): (&S, &RemoteIndex), sync_id: ZTenantTimelineId, - new_delete_data: SyncData>, + mut new_delete_data: SyncData, sync_start: Instant, task_name: &str, -) -> Option<()> -where +) where P: Debug + Send + Sync + 'static, S: RemoteStorage + Send + Sync + 'static, { - // match update_remote_data(conf, storage, index, sync_id, &uploaded_data.data, false).await { - // Ok(()) => register_sync_status(sync_start, task_name, Some(true)), - // Err(e) => { - // error!("Failed to update remote timeline {sync_id}: {e:?}"); - // uploaded_data.retries += 1; - // sync_queue::push(sync_id, SyncTask::Upload(uploaded_data)); - // register_sync_status(sync_start, task_name, Some(false)); - // } - // } - todo!("TODO kb") + let timeline_delete = &mut new_delete_data.data; + + if !timeline_delete.deletion_registered { + if let Err(e) = update_remote_data( + conf, + storage, + index, + sync_id, + RemoteDataUpdate::Delete(&timeline_delete.layers_to_delete), + ) + .await + { + error!("Failed to update remote timeline {sync_id}: {e:?}"); + new_delete_data.retries += 1; + sync_queue::push(sync_id, SyncTask::Delete(new_delete_data)); + register_sync_status(sync_start, task_name, Some(false)); + return; + } + } + timeline_delete.deletion_registered = true; + + let sync_status = delete_timeline_layers(storage, sync_id, new_delete_data).await; + register_sync_status(sync_start, task_name, Some(sync_status)); } async fn read_metadata_file(metadata_path: &Path) -> anyhow::Result { @@ -1205,8 +1250,7 @@ async fn upload_timeline_data( new_upload_data: SyncData, sync_start: Instant, task_name: &str, -) -> Option<()> -where +) where P: Debug + Send + Sync + 'static, S: RemoteStorage + Send + Sync + 'static, { @@ -1216,7 +1260,7 @@ where { UploadedTimeline::FailedAndRescheduled => { register_sync_status(sync_start, task_name, Some(false)); - return None; + return; } UploadedTimeline::Successful(upload_data) => upload_data, UploadedTimeline::SuccessfulAfterLocalFsUpdate(mut outdated_upload_data) => { @@ -1233,37 +1277,54 @@ where outdated_upload_data.retries += 1; sync_queue::push(sync_id, SyncTask::Upload(outdated_upload_data)); register_sync_status(sync_start, task_name, Some(false)); - return None; + return; } }; + outdated_upload_data.data.metadata = Some(local_metadata); } outdated_upload_data } }; - match update_remote_data(conf, storage, index, sync_id, &uploaded_data.data, false).await { + match update_remote_data( + conf, + storage, + index, + sync_id, + RemoteDataUpdate::Upload { + uploaded_data: uploaded_data.data.clone(), + upload_failed: false, + }, + ) + .await + { Ok(()) => { register_sync_status(sync_start, task_name, Some(true)); - Some(()) } Err(e) => { error!("Failed to update remote timeline {sync_id}: {e:?}"); uploaded_data.retries += 1; sync_queue::push(sync_id, SyncTask::Upload(uploaded_data)); register_sync_status(sync_start, task_name, Some(false)); - None } } } +enum RemoteDataUpdate<'a> { + Upload { + uploaded_data: TimelineUpload, + upload_failed: bool, + }, + Delete(&'a HashSet), +} + async fn update_remote_data( conf: &'static PageServerConf, storage: &S, index: &RemoteIndex, sync_id: ZTenantTimelineId, - uploaded_data: &TimelineUpload, - upload_failed: bool, + update: RemoteDataUpdate<'_>, ) -> anyhow::Result<()> where P: Debug + Send + Sync + 'static, @@ -1275,40 +1336,59 @@ where match index_accessor.timeline_entry_mut(&sync_id) { Some(existing_entry) => { - if let Some(new_metadata) = uploaded_data.metadata.as_ref() { - if existing_entry.metadata.disk_consistent_lsn() - < new_metadata.disk_consistent_lsn() - { - existing_entry.metadata = new_metadata.clone(); + match update { + RemoteDataUpdate::Upload { + uploaded_data, + upload_failed, + } => { + if let Some(new_metadata) = uploaded_data.metadata.as_ref() { + if existing_entry.metadata.disk_consistent_lsn() + < new_metadata.disk_consistent_lsn() + { + existing_entry.metadata = new_metadata.clone(); + } + } + if upload_failed { + existing_entry.add_upload_failures( + uploaded_data.layers_to_upload.iter().cloned(), + ); + } else { + existing_entry + .add_timeline_layers(uploaded_data.uploaded_layers.iter().cloned()); + } + } + RemoteDataUpdate::Delete(layers_to_remove) => { + existing_entry.remove_layers(layers_to_remove) } - } - - if upload_failed { - existing_entry - .add_upload_failures(uploaded_data.layers_to_upload.iter().cloned()); - } else { - existing_entry - .add_timeline_layers(uploaded_data.uploaded_layers.iter().cloned()); } existing_entry.clone() } - None => { - let new_metadata = match uploaded_data.metadata.as_ref() { - Some(new_metadata) => new_metadata, - None => bail!("For timeline {sync_id} upload, there's no upload metadata and no remote index entry, cannot create a new one"), - }; - let mut new_remote_timeline = RemoteTimeline::new(new_metadata.clone()); - if upload_failed { - new_remote_timeline - .add_upload_failures(uploaded_data.layers_to_upload.iter().cloned()); - } else { - new_remote_timeline - .add_timeline_layers(uploaded_data.uploaded_layers.iter().cloned()); - } + None => match update { + RemoteDataUpdate::Upload { + uploaded_data, + upload_failed, + } => { + let new_metadata = match uploaded_data.metadata.as_ref() { + Some(new_metadata) => new_metadata, + None => bail!("For timeline {sync_id} upload, there's no upload metadata and no remote index entry, cannot create a new one"), + }; + let mut new_remote_timeline = RemoteTimeline::new(new_metadata.clone()); + if upload_failed { + new_remote_timeline + .add_upload_failures(uploaded_data.layers_to_upload.iter().cloned()); + } else { + new_remote_timeline + .add_timeline_layers(uploaded_data.uploaded_layers.iter().cloned()); + } - index_accessor.add_timeline_entry(sync_id, new_remote_timeline.clone()); - new_remote_timeline - } + index_accessor.add_timeline_entry(sync_id, new_remote_timeline.clone()); + new_remote_timeline + } + RemoteDataUpdate::Delete(_) => { + warn!("No remote index entry for timeline {sync_id}, skipping deletion"); + return Ok(()); + } + }, } }; @@ -1541,3 +1621,13 @@ mod test_utils { TimelineMetadata::new(disk_consistent_lsn, None, None, Lsn(0), Lsn(0), Lsn(0)) } } + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn batching_tests() { + todo!("TODO kb") + } +} diff --git a/pageserver/src/storage_sync/delete.rs b/pageserver/src/storage_sync/delete.rs new file mode 100644 index 0000000000..8b13789179 --- /dev/null +++ b/pageserver/src/storage_sync/delete.rs @@ -0,0 +1 @@ + diff --git a/pageserver/src/storage_sync/download.rs b/pageserver/src/storage_sync/download.rs index dca08bca5d..3cd6de57c7 100644 --- a/pageserver/src/storage_sync/download.rs +++ b/pageserver/src/storage_sync/download.rs @@ -120,6 +120,11 @@ where debug!("Layers to download: {layers_to_download:?}"); info!("Downloading {} timeline layers", layers_to_download.len()); + if layers_to_download.is_empty() { + info!("No layers to download after filtering, skipping"); + return DownloadedTimeline::Successful(download_data); + } + let mut download_tasks = layers_to_download .into_iter() .map(|layer_desination_path| async move { diff --git a/pageserver/src/storage_sync/index.rs b/pageserver/src/storage_sync/index.rs index b52ce8c95f..7764a810bc 100644 --- a/pageserver/src/storage_sync/index.rs +++ b/pageserver/src/storage_sync/index.rs @@ -147,6 +147,13 @@ impl RemoteTimeline { self.missing_layers.extend(upload_failures.into_iter()); } + pub fn remove_layers(&mut self, layers_to_remove: &HashSet) { + self.timeline_layers + .retain(|layer| !layers_to_remove.contains(layer)); + self.missing_layers + .retain(|layer| !layers_to_remove.contains(layer)); + } + /// Lists all layer files in the given remote timeline. Omits the metadata file. pub fn stored_files(&self) -> &HashSet { &self.timeline_layers diff --git a/pageserver/src/storage_sync/upload.rs b/pageserver/src/storage_sync/upload.rs index 55089df7bc..1e2594ac70 100644 --- a/pageserver/src/storage_sync/upload.rs +++ b/pageserver/src/storage_sync/upload.rs @@ -106,6 +106,11 @@ where .cloned() .collect::>(); + if layers_to_upload.is_empty() { + info!("No layers to upload after filtering, aborting"); + return UploadedTimeline::Successful(upload_data); + } + debug!("Layers to upload: {layers_to_upload:?}"); info!( "Uploading {} timeline layers, new lsn: {new_upload_lsn:?}", From 0a7735a65676737bb97440511ccd742bfdce68dd Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Sun, 1 May 2022 19:07:17 +0300 Subject: [PATCH 216/296] Rework remote storage sync queue, general refactoring --- .../src/remote_storage/storage_sync/delete.rs | 223 ++++++ pageserver/src/storage_sync.rs | 725 ++++++++++++------ pageserver/src/storage_sync/delete.rs | 227 ++++++ pageserver/src/storage_sync/download.rs | 30 +- pageserver/src/storage_sync/upload.rs | 47 +- 5 files changed, 974 insertions(+), 278 deletions(-) create mode 100644 pageserver/src/remote_storage/storage_sync/delete.rs diff --git a/pageserver/src/remote_storage/storage_sync/delete.rs b/pageserver/src/remote_storage/storage_sync/delete.rs new file mode 100644 index 0000000000..00e7c85e35 --- /dev/null +++ b/pageserver/src/remote_storage/storage_sync/delete.rs @@ -0,0 +1,223 @@ +//! Timeline synchrnonization logic to delete a bulk of timeline's remote files from the remote storage. + +use anyhow::Context; +use futures::stream::{FuturesUnordered, StreamExt}; +use tracing::{debug, error, info}; +use utils::zid::ZTenantTimelineId; + +use crate::remote_storage::{ + storage_sync::{SyncQueue, SyncTask}, + RemoteStorage, +}; + +use super::{LayersDeletion, SyncData}; + +/// Attempts to remove the timleline layers from the remote storage. +/// If the task had not adjusted the metadata before, the deletion will fail. +pub(super) async fn delete_timeline_layers<'a, P, S>( + storage: &'a S, + sync_queue: &SyncQueue, + sync_id: ZTenantTimelineId, + mut delete_data: SyncData, +) -> bool +where + P: std::fmt::Debug + Send + Sync + 'static, + S: RemoteStorage + Send + Sync + 'static, +{ + if !delete_data.data.deletion_registered { + error!("Cannot delete timeline layers before the deletion metadata is not registered, reenqueueing"); + delete_data.retries += 1; + sync_queue.push(sync_id, SyncTask::Delete(delete_data)); + return false; + } + + if delete_data.data.layers_to_delete.is_empty() { + info!("No layers to delete, skipping"); + return true; + } + + let layers_to_delete = delete_data + .data + .layers_to_delete + .drain() + .collect::>(); + debug!("Layers to delete: {layers_to_delete:?}"); + info!("Deleting {} timeline layers", layers_to_delete.len()); + + let mut delete_tasks = layers_to_delete + .into_iter() + .map(|local_layer_path| async { + let storage_path = match storage.storage_path(&local_layer_path).with_context(|| { + format!( + "Failed to get the layer storage path for local path '{}'", + local_layer_path.display() + ) + }) { + Ok(path) => path, + Err(e) => return Err((e, local_layer_path)), + }; + + match storage.delete(&storage_path).await.with_context(|| { + format!( + "Failed to delete remote layer from storage at '{:?}'", + storage_path + ) + }) { + Ok(()) => Ok(local_layer_path), + Err(e) => Err((e, local_layer_path)), + } + }) + .collect::>(); + + let mut errored = false; + while let Some(deletion_result) = delete_tasks.next().await { + match deletion_result { + Ok(local_layer_path) => { + debug!( + "Successfully deleted layer {} for timeline {sync_id}", + local_layer_path.display() + ); + delete_data.data.deleted_layers.insert(local_layer_path); + } + Err((e, local_layer_path)) => { + errored = true; + error!( + "Failed to delete layer {} for timeline {sync_id}: {e:?}", + local_layer_path.display() + ); + delete_data.data.layers_to_delete.insert(local_layer_path); + } + } + } + + if errored { + debug!("Reenqueuing failed delete task for timeline {sync_id}"); + delete_data.retries += 1; + sync_queue.push(sync_id, SyncTask::Delete(delete_data)); + } + errored +} + +#[cfg(test)] +mod tests { + use std::{collections::HashSet, num::NonZeroUsize}; + + use itertools::Itertools; + use tempfile::tempdir; + use tokio::fs; + use utils::lsn::Lsn; + + use crate::{ + remote_storage::{ + storage_sync::test_utils::{create_local_timeline, dummy_metadata}, + LocalFs, + }, + repository::repo_harness::{RepoHarness, TIMELINE_ID}, + }; + + use super::*; + + #[tokio::test] + async fn delete_timeline_negative() -> anyhow::Result<()> { + let harness = RepoHarness::create("delete_timeline_negative")?; + let (sync_queue, _) = SyncQueue::new(NonZeroUsize::new(100).unwrap()); + let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID); + let storage = LocalFs::new(tempdir()?.path().to_path_buf(), &harness.conf.workdir)?; + + let deleted = delete_timeline_layers( + &storage, + &sync_queue, + sync_id, + SyncData { + retries: 1, + data: LayersDeletion { + deleted_layers: HashSet::new(), + layers_to_delete: HashSet::new(), + deletion_registered: false, + }, + }, + ) + .await; + + assert!( + !deleted, + "Should not start the deletion for task with delete metadata unregistered" + ); + + Ok(()) + } + + #[tokio::test] + async fn delete_timeline() -> anyhow::Result<()> { + let harness = RepoHarness::create("delete_timeline")?; + let (sync_queue, _) = SyncQueue::new(NonZeroUsize::new(100).unwrap()); + + let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID); + let layer_files = ["a", "b", "c", "d"]; + let storage = LocalFs::new(tempdir()?.path().to_path_buf(), &harness.conf.workdir)?; + let current_retries = 3; + let metadata = dummy_metadata(Lsn(0x30)); + let local_timeline_path = harness.timeline_path(&TIMELINE_ID); + let timeline_upload = + create_local_timeline(&harness, TIMELINE_ID, &layer_files, metadata.clone()).await?; + for local_path in timeline_upload.layers_to_upload { + let remote_path = storage.storage_path(&local_path)?; + let remote_parent_dir = remote_path.parent().unwrap(); + if !remote_parent_dir.exists() { + fs::create_dir_all(&remote_parent_dir).await?; + } + fs::copy(&local_path, &remote_path).await?; + } + assert_eq!( + storage + .list() + .await? + .into_iter() + .map(|remote_path| storage.local_path(&remote_path).unwrap()) + .filter_map(|local_path| { Some(local_path.file_name()?.to_str()?.to_owned()) }) + .sorted() + .collect::>(), + layer_files + .iter() + .map(|layer_str| layer_str.to_string()) + .sorted() + .collect::>(), + "Expect to have all layer files remotely before deletion" + ); + + let deleted = delete_timeline_layers( + &storage, + &sync_queue, + sync_id, + SyncData { + retries: current_retries, + data: LayersDeletion { + deleted_layers: HashSet::new(), + layers_to_delete: HashSet::from([ + local_timeline_path.join("a"), + local_timeline_path.join("c"), + local_timeline_path.join("something_different"), + ]), + deletion_registered: true, + }, + }, + ) + .await; + assert!(deleted, "Should be able to delete timeline files"); + + assert_eq!( + storage + .list() + .await? + .into_iter() + .map(|remote_path| storage.local_path(&remote_path).unwrap()) + .filter_map(|local_path| { Some(local_path.file_name()?.to_str()?.to_owned()) }) + .sorted() + .collect::>(), + vec!["b".to_string(), "d".to_string()], + "Expect to have only non-deleted files remotely" + ); + + Ok(()) + } +} diff --git a/pageserver/src/storage_sync.rs b/pageserver/src/storage_sync.rs index 52e0df3784..b8c6f7fdab 100644 --- a/pageserver/src/storage_sync.rs +++ b/pageserver/src/storage_sync.rs @@ -147,23 +147,27 @@ pub mod index; mod upload; use std::{ - collections::{HashMap, HashSet, VecDeque}, + collections::{hash_map, HashMap, HashSet, VecDeque}, ffi::OsStr, fmt::Debug, num::{NonZeroU32, NonZeroUsize}, ops::ControlFlow, path::{Path, PathBuf}, - sync::Arc, + sync::{ + atomic::{AtomicUsize, Ordering}, + Arc, + }, }; -use anyhow::{bail, Context}; +use anyhow::{anyhow, bail, Context}; use futures::stream::{FuturesUnordered, StreamExt}; use lazy_static::lazy_static; +use once_cell::sync::OnceCell; use remote_storage::{GenericRemoteStorage, RemoteStorage}; use tokio::{ fs, runtime::Runtime, - sync::mpsc::{self, UnboundedReceiver}, + sync::mpsc::{self, error::TryRecvError, UnboundedReceiver, UnboundedSender}, time::{Duration, Instant}, }; use tracing::*; @@ -221,6 +225,8 @@ lazy_static! { .expect("failed to register pageserver image sync time histogram vec"); } +static SYNC_QUEUE: OnceCell = OnceCell::new(); + /// A timeline status to share with pageserver's sync counterpart, /// after comparing local and remote timeline state. #[derive(Clone, Copy, Debug)] @@ -449,144 +455,131 @@ fn collect_timeline_files( /// Wraps mpsc channel bits around into a queue interface. /// mpsc approach was picked to allow blocking the sync loop if no tasks are present, to avoid meaningless spinning. -mod sync_queue { - use std::{ - collections::{HashMap, HashSet}, - num::NonZeroUsize, - ops::ControlFlow, - sync::atomic::{AtomicUsize, Ordering}, - }; +struct SyncQueue { + len: AtomicUsize, + max_timelines_per_batch: NonZeroUsize, + sender: UnboundedSender<(ZTenantTimelineId, SyncTask)>, +} - use anyhow::anyhow; - use once_cell::sync::OnceCell; - use tokio::sync::mpsc::{error::TryRecvError, UnboundedReceiver, UnboundedSender}; - use tracing::{debug, warn}; - - use super::{SyncTask, SyncTaskBatch}; - use utils::zid::ZTenantTimelineId; - - static SENDER: OnceCell> = OnceCell::new(); - static LENGTH: AtomicUsize = AtomicUsize::new(0); - - /// Initializes the queue with the given sender channel that is used to put the tasks into later. - /// Errors if called more than once. - pub fn init(sender: UnboundedSender<(ZTenantTimelineId, SyncTask)>) -> anyhow::Result<()> { - SENDER - .set(sender) - .map_err(|_sender| anyhow!("sync queue was already initialized"))?; - Ok(()) +impl SyncQueue { + fn new( + max_timelines_per_batch: NonZeroUsize, + ) -> (Self, UnboundedReceiver<(ZTenantTimelineId, SyncTask)>) { + let (sender, receiver) = mpsc::unbounded_channel(); + ( + Self { + len: AtomicUsize::new(0), + max_timelines_per_batch, + sender, + }, + receiver, + ) } - /// Adds a new task to the queue, if the queue was initialized, returning `true` on success. - /// On any error, or if the queue was not initialized, the task gets dropped (not scheduled) and `false` is returned. - pub fn push(sync_id: ZTenantTimelineId, new_task: SyncTask) -> bool { - if let Some(sender) = SENDER.get() { - match sender.send((sync_id, new_task)) { - Err(e) => { - warn!("Failed to enqueue a sync task: the receiver is dropped: {e}"); - false - } - Ok(()) => { - LENGTH.fetch_add(1, Ordering::Relaxed); - true - } + fn push(&self, sync_id: ZTenantTimelineId, new_task: SyncTask) { + match self.sender.send((sync_id, new_task)) { + Ok(()) => { + self.len.fetch_add(1, Ordering::Relaxed); + } + Err(e) => { + error!("failed to push sync task to queue: {e}"); } - } else { - warn!("Failed to enqueue a sync task: the sender is not initialized"); - false } } - /// Polls a new task from the queue, using its receiver counterpart. - /// Does not block if the queue is empty, returning [`None`] instead. - /// Needed to correctly track the queue length. - async fn next_task( - receiver: &mut UnboundedReceiver<(ZTenantTimelineId, SyncTask)>, - ) -> Option<(ZTenantTimelineId, SyncTask)> { - let task = receiver.recv().await; - if task.is_some() { - LENGTH.fetch_sub(1, Ordering::Relaxed); - } - task - } - - /// Fetches a task batch, not bigger than the given limit. - /// Not blocking, can return fewer tasks if the queue does not contain enough. - /// Batch tasks are split by timelines, with all related tasks merged into one (download/upload) - /// or two (download and upload, if both were found in the queue during batch construction). - pub(super) async fn next_task_batch( - receiver: &mut UnboundedReceiver<(ZTenantTimelineId, SyncTask)>, - max_timelines_to_sync: NonZeroUsize, - ) -> ControlFlow<(), HashMap> { + /// Fetches a task batch, getting every existing entry from the queue, grouping by timelines and merging the tasks for every timeline. + /// A timeline has to care to not to delete cetain layers from the remote storage before the corresponding uploads happen. + /// Otherwise, due to "immutable" nature of the layers, the order of their deletion/uploading/downloading does not matter. + /// Hence, we merge the layers together into single task per timeline and run those concurrently (with the deletion happening only after successful uploading). + async fn next_task_batch( + &self, + // The queue is based on two ends of a channel and has to be accessible statically without blocking for submissions from the sync code. + // Its receiver needs &mut, so we cannot place it in the same container with the other end and get both static and non-blocking access. + // Hence toss this around to use it from the sync loop directly as &mut. + sync_queue_receiver: &mut UnboundedReceiver<(ZTenantTimelineId, SyncTask)>, + ) -> HashMap { // request the first task in blocking fashion to do less meaningless work - let (first_sync_id, first_task) = if let Some(first_task) = next_task(receiver).await { + let (first_sync_id, first_task) = if let Some(first_task) = sync_queue_receiver.recv().await + { + self.len.fetch_sub(1, Ordering::Relaxed); first_task } else { - debug!("Queue sender part was dropped, aborting"); - return ControlFlow::Break(()); + info!("Queue sender part was dropped, aborting"); + return HashMap::new(); }; + let mut timelines_left_to_batch = self.max_timelines_per_batch.get() - 1; + let mut tasks_to_process = self.len(); - let max_timelines_to_sync = max_timelines_to_sync.get(); - let mut batched_timelines = HashSet::with_capacity(max_timelines_to_sync); - batched_timelines.insert(first_sync_id.timeline_id); + let mut batches = HashMap::with_capacity(tasks_to_process); + batches.insert(first_sync_id, SyncTaskBatch::new(first_task)); - let mut tasks = HashMap::new(); - tasks.insert(first_sync_id, SyncTaskBatch::new(first_task)); + let mut tasks_to_reenqueue = Vec::with_capacity(tasks_to_process); - loop { - if batched_timelines.len() >= max_timelines_to_sync { - debug!( - "Filled a full task batch with {} timeline sync operations", - batched_timelines.len() - ); - break; - } - - match receiver.try_recv() { + // Pull the queue channel until we get all tasks that were there at the beginning of the batch construction. + // Yet do not put all timelines in the batch, but only the first ones that fit the timeline limit. + // Still merge the rest of the pulled tasks and reenqueue those for later. + while tasks_to_process > 0 { + match sync_queue_receiver.try_recv() { Ok((sync_id, new_task)) => { - LENGTH.fetch_sub(1, Ordering::Relaxed); - tasks.entry(sync_id).or_default().add(new_task); - batched_timelines.insert(sync_id.timeline_id); + self.len.fetch_sub(1, Ordering::Relaxed); + tasks_to_process -= 1; + + match batches.entry(sync_id) { + hash_map::Entry::Occupied(mut v) => v.get_mut().add(new_task), + hash_map::Entry::Vacant(v) => { + timelines_left_to_batch = timelines_left_to_batch.saturating_sub(1); + if timelines_left_to_batch == 0 { + tasks_to_reenqueue.push((sync_id, new_task)); + } else { + v.insert(SyncTaskBatch::new(new_task)); + } + } + } } Err(TryRecvError::Disconnected) => { debug!("Sender disconnected, batch collection aborted"); break; } Err(TryRecvError::Empty) => { - debug!( - "No more data in the sync queue, task batch is not full, length: {}, max allowed size: {max_timelines_to_sync}", - batched_timelines.len() - ); + debug!("No more data in the sync queue, task batch is not full"); break; } } } - ControlFlow::Continue(tasks) + debug!( + "Batched {} timelines, reenqueuing {}", + batches.len(), + tasks_to_reenqueue.len() + ); + for (id, task) in tasks_to_reenqueue { + self.push(id, task); + } + + batches } - /// Length of the queue, assuming that all receiver counterparts were only called using the queue api. - pub fn len() -> usize { - LENGTH.load(Ordering::Relaxed) + fn len(&self) -> usize { + self.len.load(Ordering::Relaxed) } } /// A task to run in the async download/upload loop. /// Limited by the number of retries, after certain threshold the failing task gets evicted and the timeline disabled. -#[derive(Debug)] -pub enum SyncTask { +#[derive(Debug, Clone)] +enum SyncTask { /// A checkpoint outcome with possible local file updates that need actualization in the remote storage. /// Not necessary more fresh than the one already uploaded. - Download(SyncData), + Download(SyncData), /// A certain amount of image files to download. - Upload(SyncData), + Upload(SyncData), /// Delete remote files. - Delete(SyncData), + Delete(SyncData), } /// Stores the data to synd and its retries, to evict the tasks failing to frequently. #[derive(Debug, Clone, PartialEq, Eq)] -pub struct SyncData { +struct SyncData { retries: u32, data: T, } @@ -598,24 +591,24 @@ impl SyncData { } impl SyncTask { - fn download(download_task: TimelineDownload) -> Self { + fn download(download_task: LayersDownload) -> Self { Self::Download(SyncData::new(0, download_task)) } - fn upload(upload_task: TimelineUpload) -> Self { + fn upload(upload_task: LayersUpload) -> Self { Self::Upload(SyncData::new(0, upload_task)) } - fn delete(delete_task: TimelineDelete) -> Self { + fn delete(delete_task: LayersDeletion) -> Self { Self::Delete(SyncData::new(0, delete_task)) } } -#[derive(Debug, Default)] +#[derive(Debug, Default, PartialEq, Eq)] struct SyncTaskBatch { - upload: Option>, - download: Option>, - delete: Option>, + upload: Option>, + download: Option>, + delete: Option>, } impl SyncTaskBatch { @@ -666,6 +659,31 @@ impl SyncTaskBatch { SyncTask::Delete(new_delete) => match &mut self.delete { Some(batch_delete) => { batch_delete.retries = batch_delete.retries.min(new_delete.retries); + // Need to reregister deletions, but it's ok to register already deleted files once again, they will be skipped. + batch_delete.data.deletion_registered = batch_delete + .data + .deletion_registered + .min(new_delete.data.deletion_registered); + + // Do not download and upload the layers getting removed in the same batch + if let Some(batch_download) = &mut self.download { + batch_download + .data + .layers_to_skip + .extend(new_delete.data.layers_to_delete.iter().cloned()); + batch_download + .data + .layers_to_skip + .extend(new_delete.data.deleted_layers.iter().cloned()); + } + if let Some(batch_upload) = &mut self.upload { + let not_deleted = |layer: &PathBuf| { + !new_delete.data.layers_to_delete.contains(layer) + && !new_delete.data.deleted_layers.contains(layer) + }; + batch_upload.data.layers_to_upload.retain(not_deleted); + batch_upload.data.uploaded_layers.retain(not_deleted); + } batch_delete .data @@ -685,7 +703,7 @@ impl SyncTaskBatch { /// Local timeline files for upload, appeared after the new checkpoint. /// Current checkpoint design assumes new files are added only, no deletions or amendment happens. #[derive(Debug, Clone, PartialEq, Eq)] -pub struct TimelineUpload { +struct LayersUpload { /// Layer file path in the pageserver workdir, that were added for the corresponding checkpoint. layers_to_upload: HashSet, /// Already uploaded layers. Used to store the data about the uploads between task retries @@ -700,14 +718,19 @@ pub struct TimelineUpload { /// without using the remote index or any other ways to list the remote timleine files. /// Skips the files that are already downloaded. #[derive(Debug, Clone, PartialEq, Eq)] -pub struct TimelineDownload { +struct LayersDownload { layers_to_skip: HashSet, } #[derive(Debug, Clone, PartialEq, Eq)] -pub struct TimelineDelete { +struct LayersDeletion { layers_to_delete: HashSet, deleted_layers: HashSet, + /// Pageserver uses [`IndexPart`] as a source of truth for listing the files per timeline. + /// This object gets serialized and placed into the remote storage. + /// So if we manage to update pageserver's [`RemoteIndex`] and update the index part on the remote storage, + /// the corresponding files on S3 won't exist for pageserver albeit being physically present on that remote storage still. + /// Then all that's left is to remove the files from the remote storage, without concerns about consistency. deletion_registered: bool, } @@ -721,45 +744,55 @@ pub fn schedule_layer_upload( layers_to_upload: HashSet, metadata: Option, ) { - debug!("Scheduling layer upload for tenant {tenant_id}, timeline {timeline_id}, to upload: {layers_to_upload:?}"); - if !sync_queue::push( + let sync_queue = match SYNC_QUEUE.get() { + Some(queue) => queue, + None => { + warn!("Could not send an upload task for tenant {tenant_id}, timeline {timeline_id}"); + return; + } + }; + sync_queue.push( ZTenantTimelineId { tenant_id, timeline_id, }, - SyncTask::upload(TimelineUpload { + SyncTask::upload(LayersUpload { layers_to_upload, uploaded_layers: HashSet::new(), metadata, }), - ) { - warn!("Could not send an upload task for tenant {tenant_id}, timeline {timeline_id}") - } else { - debug!("Upload task for tenant {tenant_id}, timeline {timeline_id} sent") - } + ); + debug!("Upload task for tenant {tenant_id}, timeline {timeline_id} sent") } +/// Adds the new files to delete as a deletion task to the queue. +/// On task failure, it gets retried again from the start a number of times. +/// +/// Ensure that the loop is started otherwise the task is never processed. pub fn schedule_layer_delete( tenant_id: ZTenantId, timeline_id: ZTimelineId, layers_to_delete: HashSet, ) { - debug!("Scheduling layer deletion for tenant {tenant_id}, timeline {timeline_id}, to delete: {layers_to_delete:?}"); - if !sync_queue::push( + let sync_queue = match SYNC_QUEUE.get() { + Some(queue) => queue, + None => { + warn!("Could not send deletion task for tenant {tenant_id}, timeline {timeline_id}"); + return; + } + }; + sync_queue.push( ZTenantTimelineId { tenant_id, timeline_id, }, - SyncTask::delete(TimelineDelete { + SyncTask::delete(LayersDeletion { layers_to_delete, deleted_layers: HashSet::new(), deletion_registered: false, }), - ) { - warn!("Could not send deletion task for tenant {tenant_id}, timeline {timeline_id}") - } else { - debug!("Deletion task for tenant {tenant_id}, timeline {timeline_id} sent") - } + ); + debug!("Deletion task for tenant {tenant_id}, timeline {timeline_id} sent") } /// Requests the download of the entire timeline for a given tenant. @@ -771,15 +804,23 @@ pub fn schedule_layer_delete( /// Ensure that the loop is started otherwise the task is never processed. pub fn schedule_layer_download(tenant_id: ZTenantId, timeline_id: ZTimelineId) { debug!("Scheduling layer download for tenant {tenant_id}, timeline {timeline_id}"); - sync_queue::push( + let sync_queue = match SYNC_QUEUE.get() { + Some(queue) => queue, + None => { + warn!("Could not send download task for tenant {tenant_id}, timeline {timeline_id}"); + return; + } + }; + sync_queue.push( ZTenantTimelineId { tenant_id, timeline_id, }, - SyncTask::download(TimelineDownload { + SyncTask::download(LayersDownload { layers_to_skip: HashSet::new(), }), ); + debug!("Download task for tenant {tenant_id}, timeline {timeline_id} sent") } /// Uses a remote storage given to start the storage sync loop. @@ -795,8 +836,14 @@ where P: Debug + Send + Sync + 'static, S: RemoteStorage + Send + Sync + 'static, { - let (sender, receiver) = mpsc::unbounded_channel(); - sync_queue::init(sender)?; + let (sync_queue, sync_queue_receiver) = SyncQueue::new(max_concurrent_timelines_sync); + SYNC_QUEUE + .set(sync_queue) + .map_err(|_queue| anyhow!("Could not initialize sync queue"))?; + let sync_queue = match SYNC_QUEUE.get() { + Some(queue) => queue, + None => bail!("Could not get sync queue during the sync loop step, aborting"), + }; let runtime = tokio::runtime::Builder::new_current_thread() .enable_all() @@ -813,6 +860,7 @@ where let local_timeline_init_statuses = schedule_first_sync_tasks( &mut runtime.block_on(remote_index.write()), + sync_queue, local_timeline_files, ); @@ -827,10 +875,12 @@ where storage_sync_loop( runtime, conf, - receiver, - Arc::new(storage), - loop_index, - max_concurrent_timelines_sync, + ( + Arc::new(storage), + loop_index, + sync_queue, + sync_queue_receiver, + ), max_sync_errors, ); Ok(()) @@ -843,14 +893,15 @@ where }) } -#[allow(clippy::too_many_arguments)] fn storage_sync_loop( runtime: Runtime, conf: &'static PageServerConf, - mut receiver: UnboundedReceiver<(ZTenantTimelineId, SyncTask)>, - storage: Arc, - index: RemoteIndex, - max_concurrent_timelines_sync: NonZeroUsize, + (storage, index, sync_queue, mut sync_queue_receiver): ( + Arc, + RemoteIndex, + &SyncQueue, + UnboundedReceiver<(ZTenantTimelineId, SyncTask)>, + ), max_sync_errors: NonZeroU32, ) where P: Debug + Send + Sync + 'static, @@ -859,15 +910,12 @@ fn storage_sync_loop( info!("Starting remote storage sync loop"); loop { let loop_index = index.clone(); - let storage = Arc::clone(&storage); + let loop_storage = Arc::clone(&storage); let loop_step = runtime.block_on(async { tokio::select! { step = loop_step( conf, - &mut receiver, - storage, - loop_index, - max_concurrent_timelines_sync, + (loop_storage, loop_index, sync_queue, &mut sync_queue_receiver), max_sync_errors, ) .instrument(info_span!("storage_sync_loop_step")) => step, @@ -898,23 +946,21 @@ fn storage_sync_loop( async fn loop_step( conf: &'static PageServerConf, - receiver: &mut UnboundedReceiver<(ZTenantTimelineId, SyncTask)>, - storage: Arc, - index: RemoteIndex, - max_concurrent_timelines_sync: NonZeroUsize, + (storage, index, sync_queue, sync_queue_receiver): ( + Arc, + RemoteIndex, + &SyncQueue, + &mut UnboundedReceiver<(ZTenantTimelineId, SyncTask)>, + ), max_sync_errors: NonZeroU32, ) -> ControlFlow<(), HashMap>> where P: Debug + Send + Sync + 'static, S: RemoteStorage + Send + Sync + 'static, { - let batched_tasks = - match sync_queue::next_task_batch(receiver, max_concurrent_timelines_sync).await { - ControlFlow::Continue(batch) => batch, - ControlFlow::Break(()) => return ControlFlow::Break(()), - }; + let batched_tasks = sync_queue.next_task_batch(sync_queue_receiver).await; - let remaining_queue_length = sync_queue::len(); + let remaining_queue_length = sync_queue.len(); REMAINING_SYNC_ITEMS.set(remaining_queue_length as i64); if remaining_queue_length > 0 || !batched_tasks.is_empty() { info!("Processing tasks for {} timelines in batch, more tasks left to process: {remaining_queue_length}", batched_tasks.len()); @@ -929,10 +975,15 @@ where let storage = Arc::clone(&storage); let index = index.clone(); async move { - let state_update = - process_sync_task_batch(conf, storage, index, max_sync_errors, sync_id, batch) - .instrument(info_span!("process_sync_task_batch", sync_id = %sync_id)) - .await; + let state_update = process_sync_task_batch( + conf, + (storage, index, sync_queue), + max_sync_errors, + sync_id, + batch, + ) + .instrument(info_span!("process_sync_task_batch", sync_id = %sync_id)) + .await; (sync_id, state_update) } }) @@ -941,7 +992,7 @@ where let mut new_timeline_states: HashMap< ZTenantId, HashMap, - > = HashMap::with_capacity(max_concurrent_timelines_sync.get()); + > = HashMap::new(); while let Some((sync_id, state_update)) = sync_results.next().await { debug!("Finished storage sync task for sync id {sync_id}"); if let Some(state_update) = state_update { @@ -957,8 +1008,7 @@ where async fn process_sync_task_batch( conf: &'static PageServerConf, - storage: Arc, - index: RemoteIndex, + (storage, index, sync_queue): (Arc, RemoteIndex, &SyncQueue), max_sync_errors: NonZeroU32, sync_id: ZTenantTimelineId, batch: SyncTaskBatch, @@ -972,6 +1022,13 @@ where let upload_data = batch.upload.clone(); let download_data = batch.download.clone(); + // Run both upload and download tasks concurrently (not in parallel): + // download and upload tasks do not conflict and spoil the pageserver state even if they are executed in parallel. + // Under "spoiling" here means potentially inconsistent layer set that misses some of the layers, declared present + // in local (implicitly, via Lsn values and related memory state) or remote (explicitly via remote layer file paths) metadata. + // When operating in a system without tasks failing over the error threshold, + // current batching and task processing systems aim to update the layer set and metadata files (remote and local), + // without "loosing" such layer files. let (upload_result, status_update) = tokio::join!( async { if let Some(upload_data) = upload_data { @@ -982,7 +1039,7 @@ where ControlFlow::Continue(new_upload_data) => { upload_timeline_data( conf, - (storage.as_ref(), &index), + (storage.as_ref(), &index, sync_queue), current_remote_timeline.as_ref(), sync_id, new_upload_data, @@ -1022,14 +1079,14 @@ where ControlFlow::Continue(new_download_data) => { return download_timeline_data( conf, - (storage.as_ref(), &index), + (storage.as_ref(), &index, sync_queue), current_remote_timeline.as_ref(), sync_id, new_download_data, sync_start, "download", ) - .await + .await; } ControlFlow::Break(_) => { index @@ -1046,35 +1103,40 @@ where ); if let Some(delete_data) = batch.delete { - match validate_task_retries(delete_data, max_sync_errors) - .instrument(info_span!("retries_validation")) - .await - { - ControlFlow::Continue(new_delete_data) => { - delete_timeline_data( - conf, - (storage.as_ref(), &index), - sync_id, - new_delete_data, - sync_start, - "delete", - ) - .instrument(info_span!("delete_timeline_data")) - .await; - } - ControlFlow::Break(failed_delete_data) => { - if let Err(e) = update_remote_data( - conf, - storage.as_ref(), - &index, - sync_id, - RemoteDataUpdate::Delete(&failed_delete_data.data.deleted_layers), - ) + if upload_result.is_some() { + match validate_task_retries(delete_data, max_sync_errors) + .instrument(info_span!("retries_validation")) .await - { - error!("Failed to update remote timeline {sync_id}: {e:?}"); + { + ControlFlow::Continue(new_delete_data) => { + delete_timeline_data( + conf, + (storage.as_ref(), &index, sync_queue), + sync_id, + new_delete_data, + sync_start, + "delete", + ) + .instrument(info_span!("delete_timeline_data")) + .await; + } + ControlFlow::Break(failed_delete_data) => { + if let Err(e) = update_remote_data( + conf, + storage.as_ref(), + &index, + sync_id, + RemoteDataUpdate::Delete(&failed_delete_data.data.deleted_layers), + ) + .await + { + error!("Failed to update remote timeline {sync_id}: {e:?}"); + } } } + } else { + sync_queue.push(sync_id, SyncTask::Delete(delete_data)); + warn!("Skipping delete task due to failed upload tasks, reenqueuing"); } } @@ -1083,10 +1145,10 @@ where async fn download_timeline_data( conf: &'static PageServerConf, - (storage, index): (&S, &RemoteIndex), + (storage, index, sync_queue): (&S, &RemoteIndex, &SyncQueue), current_remote_timeline: Option<&RemoteTimeline>, sync_id: ZTenantTimelineId, - new_download_data: SyncData, + new_download_data: SyncData, sync_start: Instant, task_name: &str, ) -> Option @@ -1097,6 +1159,7 @@ where match download_timeline_layers( conf, storage, + sync_queue, current_remote_timeline, sync_id, new_download_data, @@ -1126,7 +1189,7 @@ where Err(e) => { error!("Failed to update local timeline metadata: {e:?}"); download_data.retries += 1; - sync_queue::push(sync_id, SyncTask::Download(download_data)); + sync_queue.push(sync_id, SyncTask::Download(download_data)); register_sync_status(sync_start, task_name, Some(false)); } } @@ -1199,14 +1262,14 @@ async fn update_local_metadata( async fn delete_timeline_data( conf: &'static PageServerConf, - (storage, index): (&S, &RemoteIndex), + (storage, index, sync_queue): (&S, &RemoteIndex, &SyncQueue), sync_id: ZTenantTimelineId, - mut new_delete_data: SyncData, + mut new_delete_data: SyncData, sync_start: Instant, task_name: &str, ) where P: Debug + Send + Sync + 'static, - S: RemoteStorage + Send + Sync + 'static, + S: RemoteStorage + Send + Sync + 'static, { let timeline_delete = &mut new_delete_data.data; @@ -1222,14 +1285,14 @@ async fn delete_timeline_data( { error!("Failed to update remote timeline {sync_id}: {e:?}"); new_delete_data.retries += 1; - sync_queue::push(sync_id, SyncTask::Delete(new_delete_data)); + sync_queue.push(sync_id, SyncTask::Delete(new_delete_data)); register_sync_status(sync_start, task_name, Some(false)); return; } } timeline_delete.deletion_registered = true; - let sync_status = delete_timeline_layers(storage, sync_id, new_delete_data).await; + let sync_status = delete_timeline_layers(storage, sync_queue, sync_id, new_delete_data).await; register_sync_status(sync_start, task_name, Some(sync_status)); } @@ -1244,48 +1307,31 @@ async fn read_metadata_file(metadata_path: &Path) -> anyhow::Result( conf: &'static PageServerConf, - (storage, index): (&S, &RemoteIndex), + (storage, index, sync_queue): (&S, &RemoteIndex, &SyncQueue), current_remote_timeline: Option<&RemoteTimeline>, sync_id: ZTenantTimelineId, - new_upload_data: SyncData, + new_upload_data: SyncData, sync_start: Instant, task_name: &str, ) where P: Debug + Send + Sync + 'static, S: RemoteStorage + Send + Sync + 'static, { - let mut uploaded_data = - match upload_timeline_layers(storage, current_remote_timeline, sync_id, new_upload_data) - .await - { - UploadedTimeline::FailedAndRescheduled => { - register_sync_status(sync_start, task_name, Some(false)); - return; - } - UploadedTimeline::Successful(upload_data) => upload_data, - UploadedTimeline::SuccessfulAfterLocalFsUpdate(mut outdated_upload_data) => { - if outdated_upload_data.data.metadata.is_some() { - let local_metadata_path = - metadata_path(conf, sync_id.timeline_id, sync_id.tenant_id); - let local_metadata = match read_metadata_file(&local_metadata_path).await { - Ok(metadata) => metadata, - Err(e) => { - error!( - "Failed to load local metadata from path '{}': {e:?}", - local_metadata_path.display() - ); - outdated_upload_data.retries += 1; - sync_queue::push(sync_id, SyncTask::Upload(outdated_upload_data)); - register_sync_status(sync_start, task_name, Some(false)); - return; - } - }; - - outdated_upload_data.data.metadata = Some(local_metadata); - } - outdated_upload_data - } - }; + let mut uploaded_data = match upload_timeline_layers( + storage, + sync_queue, + current_remote_timeline, + sync_id, + new_upload_data, + ) + .await + { + UploadedTimeline::FailedAndRescheduled => { + register_sync_status(sync_start, task_name, Some(false)); + return; + } + UploadedTimeline::Successful(upload_data) => upload_data, + }; match update_remote_data( conf, @@ -1305,7 +1351,7 @@ async fn upload_timeline_data( Err(e) => { error!("Failed to update remote timeline {sync_id}: {e:?}"); uploaded_data.retries += 1; - sync_queue::push(sync_id, SyncTask::Upload(uploaded_data)); + sync_queue.push(sync_id, SyncTask::Upload(uploaded_data)); register_sync_status(sync_start, task_name, Some(false)); } } @@ -1313,7 +1359,7 @@ async fn upload_timeline_data( enum RemoteDataUpdate<'a> { Upload { - uploaded_data: TimelineUpload, + uploaded_data: LayersUpload, upload_failed: bool, }, Delete(&'a HashSet), @@ -1455,6 +1501,7 @@ where fn schedule_first_sync_tasks( index: &mut RemoteTimelineIndex, + sync_queue: &SyncQueue, local_timeline_files: HashMap)>, ) -> LocalTimelineInitStatuses { let mut local_timeline_init_statuses = LocalTimelineInitStatuses::new(); @@ -1491,7 +1538,7 @@ fn schedule_first_sync_tasks( // is it safe to upload this checkpoint? could it be half broken? new_sync_tasks.push_back(( sync_id, - SyncTask::upload(TimelineUpload { + SyncTask::upload(LayersUpload { layers_to_upload: local_files, uploaded_layers: HashSet::new(), metadata: Some(local_metadata), @@ -1509,7 +1556,7 @@ fn schedule_first_sync_tasks( } new_sync_tasks.into_iter().for_each(|(sync_id, task)| { - sync_queue::push(sync_id, task); + sync_queue.push(sync_id, task); }); local_timeline_init_statuses } @@ -1535,7 +1582,7 @@ fn compare_local_and_remote_timeline( let (initial_timeline_status, awaits_download) = if number_of_layers_to_download > 0 { new_sync_tasks.push_back(( sync_id, - SyncTask::download(TimelineDownload { + SyncTask::download(LayersDownload { layers_to_skip: local_files.clone(), }), )); @@ -1553,7 +1600,7 @@ fn compare_local_and_remote_timeline( if !layers_to_upload.is_empty() { new_sync_tasks.push_back(( sync_id, - SyncTask::upload(TimelineUpload { + SyncTask::upload(LayersUpload { layers_to_upload, uploaded_layers: HashSet::new(), metadata: Some(local_metadata), @@ -1584,12 +1631,12 @@ mod test_utils { use super::*; - pub async fn create_local_timeline( + pub(super) async fn create_local_timeline( harness: &RepoHarness<'_>, timeline_id: ZTimelineId, filenames: &[&str], metadata: TimelineMetadata, - ) -> anyhow::Result { + ) -> anyhow::Result { let timeline_path = harness.timeline_path(&timeline_id); fs::create_dir_all(&timeline_path).await?; @@ -1606,28 +1653,212 @@ mod test_utils { ) .await?; - Ok(TimelineUpload { + Ok(LayersUpload { layers_to_upload, uploaded_layers: HashSet::new(), metadata: Some(metadata), }) } - pub fn dummy_contents(name: &str) -> String { + pub(super) fn dummy_contents(name: &str) -> String { format!("contents for {name}") } - pub fn dummy_metadata(disk_consistent_lsn: Lsn) -> TimelineMetadata { + pub(super) fn dummy_metadata(disk_consistent_lsn: Lsn) -> TimelineMetadata { TimelineMetadata::new(disk_consistent_lsn, None, None, Lsn(0), Lsn(0), Lsn(0)) } } #[cfg(test)] mod tests { + use super::test_utils::dummy_metadata; + use crate::repository::repo_harness::TIMELINE_ID; + use hex_literal::hex; + use utils::lsn::Lsn; + use super::*; - #[test] - fn batching_tests() { - todo!("TODO kb") + const TEST_SYNC_ID: ZTenantTimelineId = ZTenantTimelineId { + tenant_id: ZTenantId::from_array(hex!("11223344556677881122334455667788")), + timeline_id: TIMELINE_ID, + }; + + #[tokio::test] + async fn separate_task_ids_batch() { + let (sync_queue, mut sync_queue_receiver) = SyncQueue::new(NonZeroUsize::new(100).unwrap()); + assert_eq!(sync_queue.len(), 0); + + let sync_id_2 = ZTenantTimelineId { + tenant_id: ZTenantId::from_array(hex!("22223344556677881122334455667788")), + timeline_id: TIMELINE_ID, + }; + let sync_id_3 = ZTenantTimelineId { + tenant_id: ZTenantId::from_array(hex!("33223344556677881122334455667788")), + timeline_id: TIMELINE_ID, + }; + assert!(sync_id_2 != TEST_SYNC_ID); + assert!(sync_id_2 != sync_id_3); + assert!(sync_id_3 != TEST_SYNC_ID); + + let download_task = SyncTask::download(LayersDownload { + layers_to_skip: HashSet::from([PathBuf::from("sk")]), + }); + let upload_task = SyncTask::upload(LayersUpload { + layers_to_upload: HashSet::from([PathBuf::from("up")]), + uploaded_layers: HashSet::from([PathBuf::from("upl")]), + metadata: Some(dummy_metadata(Lsn(2))), + }); + let delete_task = SyncTask::delete(LayersDeletion { + layers_to_delete: HashSet::from([PathBuf::from("de")]), + deleted_layers: HashSet::from([PathBuf::from("del")]), + deletion_registered: false, + }); + + sync_queue.push(TEST_SYNC_ID, download_task.clone()); + sync_queue.push(sync_id_2, upload_task.clone()); + sync_queue.push(sync_id_3, delete_task.clone()); + + let submitted_tasks_count = sync_queue.len(); + assert_eq!(submitted_tasks_count, 3); + let mut batch = sync_queue.next_task_batch(&mut sync_queue_receiver).await; + assert_eq!( + batch.len(), + submitted_tasks_count, + "Batch should consist of all tasks submitted" + ); + + assert_eq!( + Some(SyncTaskBatch::new(download_task)), + batch.remove(&TEST_SYNC_ID) + ); + assert_eq!( + Some(SyncTaskBatch::new(upload_task)), + batch.remove(&sync_id_2) + ); + assert_eq!( + Some(SyncTaskBatch::new(delete_task)), + batch.remove(&sync_id_3) + ); + + assert!(batch.is_empty(), "Should check all batch tasks"); + assert_eq!(sync_queue.len(), 0); + } + + #[tokio::test] + async fn same_task_id_separate_tasks_batch() { + let (sync_queue, mut sync_queue_receiver) = SyncQueue::new(NonZeroUsize::new(100).unwrap()); + assert_eq!(sync_queue.len(), 0); + + let download = LayersDownload { + layers_to_skip: HashSet::from([PathBuf::from("sk")]), + }; + let upload = LayersUpload { + layers_to_upload: HashSet::from([PathBuf::from("up")]), + uploaded_layers: HashSet::from([PathBuf::from("upl")]), + metadata: Some(dummy_metadata(Lsn(2))), + }; + let delete = LayersDeletion { + layers_to_delete: HashSet::from([PathBuf::from("de")]), + deleted_layers: HashSet::from([PathBuf::from("del")]), + deletion_registered: false, + }; + + sync_queue.push(TEST_SYNC_ID, SyncTask::download(download.clone())); + sync_queue.push(TEST_SYNC_ID, SyncTask::upload(upload.clone())); + sync_queue.push(TEST_SYNC_ID, SyncTask::delete(delete.clone())); + + let submitted_tasks_count = sync_queue.len(); + assert_eq!(submitted_tasks_count, 3); + let mut batch = sync_queue.next_task_batch(&mut sync_queue_receiver).await; + assert_eq!( + batch.len(), + 1, + "Queue should have one batch merged from 3 sync tasks of the same user" + ); + + assert_eq!( + Some(SyncTaskBatch { + upload: Some(SyncData { + retries: 0, + data: upload + }), + download: Some(SyncData { + retries: 0, + data: download + }), + delete: Some(SyncData { + retries: 0, + data: delete + }), + }), + batch.remove(&TEST_SYNC_ID), + "Should have one batch containing all tasks unchanged" + ); + + assert!(batch.is_empty(), "Should check all batch tasks"); + assert_eq!(sync_queue.len(), 0); + } + + #[tokio::test] + async fn same_task_id_same_tasks_batch() { + let (sync_queue, mut sync_queue_receiver) = SyncQueue::new(NonZeroUsize::new(1).unwrap()); + let download_1 = LayersDownload { + layers_to_skip: HashSet::from([PathBuf::from("sk1")]), + }; + let download_2 = LayersDownload { + layers_to_skip: HashSet::from([PathBuf::from("sk2")]), + }; + let download_3 = LayersDownload { + layers_to_skip: HashSet::from([PathBuf::from("sk3")]), + }; + let download_4 = LayersDownload { + layers_to_skip: HashSet::from([PathBuf::from("sk4")]), + }; + + let sync_id_2 = ZTenantTimelineId { + tenant_id: ZTenantId::from_array(hex!("22223344556677881122334455667788")), + timeline_id: TIMELINE_ID, + }; + assert!(sync_id_2 != TEST_SYNC_ID); + + sync_queue.push(TEST_SYNC_ID, SyncTask::download(download_1.clone())); + sync_queue.push(TEST_SYNC_ID, SyncTask::download(download_2.clone())); + sync_queue.push(sync_id_2, SyncTask::download(download_3.clone())); + sync_queue.push(TEST_SYNC_ID, SyncTask::download(download_4.clone())); + assert_eq!(sync_queue.len(), 4); + + let mut smallest_batch = sync_queue.next_task_batch(&mut sync_queue_receiver).await; + assert_eq!( + smallest_batch.len(), + 1, + "Queue should have one batch merged from the all sync tasks, but not the other user's task" + ); + assert_eq!( + Some(SyncTaskBatch { + download: Some(SyncData { + retries: 0, + data: LayersDownload { + layers_to_skip: { + let mut set = HashSet::new(); + set.extend(download_1.layers_to_skip.into_iter()); + set.extend(download_2.layers_to_skip.into_iter()); + set.extend(download_4.layers_to_skip.into_iter()); + set + }, + } + }), + upload: None, + delete: None, + }), + smallest_batch.remove(&TEST_SYNC_ID), + "Should have one batch containing all tasks merged for the tenant first appeared in the batch" + ); + + assert!(smallest_batch.is_empty(), "Should check all batch tasks"); + assert_eq!( + sync_queue.len(), + 1, + "Should have one task left out of the batch" + ); } } diff --git a/pageserver/src/storage_sync/delete.rs b/pageserver/src/storage_sync/delete.rs index 8b13789179..047ad6c2be 100644 --- a/pageserver/src/storage_sync/delete.rs +++ b/pageserver/src/storage_sync/delete.rs @@ -1 +1,228 @@ +//! Timeline synchrnonization logic to delete a bulk of timeline's remote files from the remote storage. +use anyhow::Context; +use futures::stream::{FuturesUnordered, StreamExt}; +use tracing::{debug, error, info}; + +use crate::storage_sync::{SyncQueue, SyncTask}; +use remote_storage::RemoteStorage; +use utils::zid::ZTenantTimelineId; + +use super::{LayersDeletion, SyncData}; + +/// Attempts to remove the timleline layers from the remote storage. +/// If the task had not adjusted the metadata before, the deletion will fail. +pub(super) async fn delete_timeline_layers<'a, P, S>( + storage: &'a S, + sync_queue: &SyncQueue, + sync_id: ZTenantTimelineId, + mut delete_data: SyncData, +) -> bool +where + P: std::fmt::Debug + Send + Sync + 'static, + S: RemoteStorage + Send + Sync + 'static, +{ + if !delete_data.data.deletion_registered { + error!("Cannot delete timeline layers before the deletion metadata is not registered, reenqueueing"); + delete_data.retries += 1; + sync_queue.push(sync_id, SyncTask::Delete(delete_data)); + return false; + } + + if delete_data.data.layers_to_delete.is_empty() { + info!("No layers to delete, skipping"); + return true; + } + + let layers_to_delete = delete_data + .data + .layers_to_delete + .drain() + .collect::>(); + debug!("Layers to delete: {layers_to_delete:?}"); + info!("Deleting {} timeline layers", layers_to_delete.len()); + + let mut delete_tasks = layers_to_delete + .into_iter() + .map(|local_layer_path| async { + let storage_path = + match storage + .remote_object_id(&local_layer_path) + .with_context(|| { + format!( + "Failed to get the layer storage path for local path '{}'", + local_layer_path.display() + ) + }) { + Ok(path) => path, + Err(e) => return Err((e, local_layer_path)), + }; + + match storage.delete(&storage_path).await.with_context(|| { + format!( + "Failed to delete remote layer from storage at '{:?}'", + storage_path + ) + }) { + Ok(()) => Ok(local_layer_path), + Err(e) => Err((e, local_layer_path)), + } + }) + .collect::>(); + + let mut errored = false; + while let Some(deletion_result) = delete_tasks.next().await { + match deletion_result { + Ok(local_layer_path) => { + debug!( + "Successfully deleted layer {} for timeline {sync_id}", + local_layer_path.display() + ); + delete_data.data.deleted_layers.insert(local_layer_path); + } + Err((e, local_layer_path)) => { + errored = true; + error!( + "Failed to delete layer {} for timeline {sync_id}: {e:?}", + local_layer_path.display() + ); + delete_data.data.layers_to_delete.insert(local_layer_path); + } + } + } + + if errored { + debug!("Reenqueuing failed delete task for timeline {sync_id}"); + delete_data.retries += 1; + sync_queue.push(sync_id, SyncTask::Delete(delete_data)); + } + errored +} + +#[cfg(test)] +mod tests { + use std::{collections::HashSet, num::NonZeroUsize}; + + use itertools::Itertools; + use tempfile::tempdir; + use tokio::fs; + use utils::lsn::Lsn; + + use crate::{ + repository::repo_harness::{RepoHarness, TIMELINE_ID}, + storage_sync::test_utils::{create_local_timeline, dummy_metadata}, + }; + use remote_storage::LocalFs; + + use super::*; + + #[tokio::test] + async fn delete_timeline_negative() -> anyhow::Result<()> { + let harness = RepoHarness::create("delete_timeline_negative")?; + let (sync_queue, _) = SyncQueue::new(NonZeroUsize::new(100).unwrap()); + let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID); + let storage = LocalFs::new( + tempdir()?.path().to_path_buf(), + harness.conf.workdir.clone(), + )?; + + let deleted = delete_timeline_layers( + &storage, + &sync_queue, + sync_id, + SyncData { + retries: 1, + data: LayersDeletion { + deleted_layers: HashSet::new(), + layers_to_delete: HashSet::new(), + deletion_registered: false, + }, + }, + ) + .await; + + assert!( + !deleted, + "Should not start the deletion for task with delete metadata unregistered" + ); + + Ok(()) + } + + #[tokio::test] + async fn delete_timeline() -> anyhow::Result<()> { + let harness = RepoHarness::create("delete_timeline")?; + let (sync_queue, _) = SyncQueue::new(NonZeroUsize::new(100).unwrap()); + + let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID); + let layer_files = ["a", "b", "c", "d"]; + let storage = LocalFs::new( + tempdir()?.path().to_path_buf(), + harness.conf.workdir.clone(), + )?; + let current_retries = 3; + let metadata = dummy_metadata(Lsn(0x30)); + let local_timeline_path = harness.timeline_path(&TIMELINE_ID); + let timeline_upload = + create_local_timeline(&harness, TIMELINE_ID, &layer_files, metadata.clone()).await?; + for local_path in timeline_upload.layers_to_upload { + let remote_path = storage.remote_object_id(&local_path)?; + let remote_parent_dir = remote_path.parent().unwrap(); + if !remote_parent_dir.exists() { + fs::create_dir_all(&remote_parent_dir).await?; + } + fs::copy(&local_path, &remote_path).await?; + } + assert_eq!( + storage + .list() + .await? + .into_iter() + .map(|remote_path| storage.local_path(&remote_path).unwrap()) + .filter_map(|local_path| { Some(local_path.file_name()?.to_str()?.to_owned()) }) + .sorted() + .collect::>(), + layer_files + .iter() + .map(|layer_str| layer_str.to_string()) + .sorted() + .collect::>(), + "Expect to have all layer files remotely before deletion" + ); + + let deleted = delete_timeline_layers( + &storage, + &sync_queue, + sync_id, + SyncData { + retries: current_retries, + data: LayersDeletion { + deleted_layers: HashSet::new(), + layers_to_delete: HashSet::from([ + local_timeline_path.join("a"), + local_timeline_path.join("c"), + local_timeline_path.join("something_different"), + ]), + deletion_registered: true, + }, + }, + ) + .await; + assert!(deleted, "Should be able to delete timeline files"); + + assert_eq!( + storage + .list() + .await? + .into_iter() + .map(|remote_path| storage.local_path(&remote_path).unwrap()) + .filter_map(|local_path| { Some(local_path.file_name()?.to_str()?.to_owned()) }) + .sorted() + .collect::>(), + vec!["b".to_string(), "d".to_string()], + "Expect to have only non-deleted files remotely" + ); + + Ok(()) + } +} diff --git a/pageserver/src/storage_sync/download.rs b/pageserver/src/storage_sync/download.rs index 3cd6de57c7..98a0a0e2fc 100644 --- a/pageserver/src/storage_sync/download.rs +++ b/pageserver/src/storage_sync/download.rs @@ -12,15 +12,13 @@ use tokio::{ use tracing::{debug, error, info, warn}; use crate::{ - config::PageServerConf, - layered_repository::metadata::metadata_path, - storage_sync::{sync_queue, SyncTask}, + config::PageServerConf, layered_repository::metadata::metadata_path, storage_sync::SyncTask, }; use utils::zid::ZTenantTimelineId; use super::{ index::{IndexPart, RemoteTimeline}, - SyncData, TimelineDownload, + LayersDownload, SyncData, SyncQueue, }; pub const TEMP_DOWNLOAD_EXTENSION: &str = "temp_download"; @@ -76,7 +74,7 @@ pub(super) enum DownloadedTimeline { FailedAndRescheduled, /// Remote timeline data is found, its latest checkpoint's metadata contents (disk_consistent_lsn) is known. /// Initial download successful. - Successful(SyncData), + Successful(SyncData), } /// Attempts to download all given timeline's layers. @@ -87,9 +85,10 @@ pub(super) enum DownloadedTimeline { pub(super) async fn download_timeline_layers<'a, P, S>( conf: &'static PageServerConf, storage: &'a S, + sync_queue: &'a SyncQueue, remote_timeline: Option<&'a RemoteTimeline>, sync_id: ZTenantTimelineId, - mut download_data: SyncData, + mut download_data: SyncData, ) -> DownloadedTimeline where P: Debug + Send + Sync + 'static, @@ -251,7 +250,7 @@ where if errors_happened { debug!("Reenqueuing failed download task for timeline {sync_id}"); download_data.retries += 1; - sync_queue::push(sync_id, SyncTask::Download(download_data)); + sync_queue.push(sync_id, SyncTask::Download(download_data)); DownloadedTimeline::FailedAndRescheduled } else { info!("Successfully downloaded all layers"); @@ -265,7 +264,10 @@ async fn fsync_path(path: impl AsRef) -> Result<(), io::Error> { #[cfg(test)] mod tests { - use std::collections::{BTreeSet, HashSet}; + use std::{ + collections::{BTreeSet, HashSet}, + num::NonZeroUsize, + }; use remote_storage::{LocalFs, RemoteStorage}; use tempfile::tempdir; @@ -284,6 +286,8 @@ mod tests { #[tokio::test] async fn download_timeline() -> anyhow::Result<()> { let harness = RepoHarness::create("download_timeline")?; + let (sync_queue, _) = SyncQueue::new(NonZeroUsize::new(100).unwrap()); + let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID); let layer_files = ["a", "b", "layer_to_skip", "layer_to_keep_locally"]; let storage = LocalFs::new( @@ -324,11 +328,12 @@ mod tests { let download_data = match download_timeline_layers( harness.conf, &storage, + &sync_queue, Some(&remote_timeline), sync_id, SyncData::new( current_retries, - TimelineDownload { + LayersDownload { layers_to_skip: HashSet::from([local_timeline_path.join("layer_to_skip")]), }, ), @@ -380,17 +385,19 @@ mod tests { #[tokio::test] async fn download_timeline_negatives() -> anyhow::Result<()> { let harness = RepoHarness::create("download_timeline_negatives")?; + let (sync_queue, _) = SyncQueue::new(NonZeroUsize::new(100).unwrap()); let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID); let storage = LocalFs::new(tempdir()?.path().to_owned(), harness.conf.workdir.clone())?; let empty_remote_timeline_download = download_timeline_layers( harness.conf, &storage, + &sync_queue, None, sync_id, SyncData::new( 0, - TimelineDownload { + LayersDownload { layers_to_skip: HashSet::new(), }, ), @@ -409,11 +416,12 @@ mod tests { let already_downloading_remote_timeline_download = download_timeline_layers( harness.conf, &storage, + &sync_queue, Some(¬_expecting_download_remote_timeline), sync_id, SyncData::new( 0, - TimelineDownload { + LayersDownload { layers_to_skip: HashSet::new(), }, ), diff --git a/pageserver/src/storage_sync/upload.rs b/pageserver/src/storage_sync/upload.rs index 1e2594ac70..f9d606f2b8 100644 --- a/pageserver/src/storage_sync/upload.rs +++ b/pageserver/src/storage_sync/upload.rs @@ -8,16 +8,14 @@ use remote_storage::RemoteStorage; use tokio::fs; use tracing::{debug, error, info, warn}; -use crate::{ - config::PageServerConf, - layered_repository::metadata::metadata_path, - storage_sync::{sync_queue, SyncTask}, -}; use utils::zid::ZTenantTimelineId; use super::{ index::{IndexPart, RemoteTimeline}, - SyncData, TimelineUpload, + LayersUpload, SyncData, SyncQueue, +}; +use crate::{ + config::PageServerConf, layered_repository::metadata::metadata_path, storage_sync::SyncTask, }; /// Serializes and uploads the given index part data to the remote storage. @@ -68,11 +66,7 @@ pub(super) enum UploadedTimeline { /// Upload failed due to some error, the upload task is rescheduled for another retry. FailedAndRescheduled, /// No issues happened during the upload, all task files were put into the remote storage. - Successful(SyncData), - /// No failures happened during the upload, but some files were removed locally before the upload task completed - /// (could happen due to retries, for instance, if GC happens in the interim). - /// Such files are considered "not needed" and ignored, but the task's metadata should be discarded and the new one loaded from the local file. - SuccessfulAfterLocalFsUpdate(SyncData), + Successful(SyncData), } /// Attempts to upload given layer files. @@ -81,9 +75,10 @@ pub(super) enum UploadedTimeline { /// On an error, bumps the retries count and reschedules the entire task. pub(super) async fn upload_timeline_layers<'a, P, S>( storage: &'a S, + sync_queue: &SyncQueue, remote_timeline: Option<&'a RemoteTimeline>, sync_id: ZTenantTimelineId, - mut upload_data: SyncData, + mut upload_data: SyncData, ) -> UploadedTimeline where P: Debug + Send + Sync + 'static, @@ -168,7 +163,6 @@ where .collect::>(); let mut errors_happened = false; - let mut local_fs_updated = false; while let Some(upload_result) = upload_tasks.next().await { match upload_result { Ok(uploaded_path) => { @@ -185,7 +179,16 @@ where errors_happened = true; error!("Failed to upload a layer for timeline {sync_id}: {e:?}"); } else { - local_fs_updated = true; + // We have run the upload sync task, but the file we wanted to upload is gone. + // This is "fine" due the asynchronous nature of the sync loop: it only reacts to events and might need to + // retry the upload tasks, if S3 or network is down: but during this time, pageserver might still operate and + // run compaction/gc threads, removing redundant files from disk. + // It's not good to pause GC/compaction because of those and we would rather skip such uploads. + // + // Yet absence of such files might also mean that the timeline metadata file was updated (GC moves the Lsn forward, for instance). + // We don't try to read a more recent version, since it could contain `disk_consistent_lsn` that does not have its upload finished yet. + // This will create "missing" layers and make data inconsistent. + // Instead, we only update the metadata when it was submitted in an upload task as a checkpoint result. upload.layers_to_upload.remove(&source_path); warn!( "Missing locally a layer file {} scheduled for upload, skipping", @@ -200,11 +203,8 @@ where if errors_happened { debug!("Reenqueuing failed upload task for timeline {sync_id}"); upload_data.retries += 1; - sync_queue::push(sync_id, SyncTask::Upload(upload_data)); + sync_queue.push(sync_id, SyncTask::Upload(upload_data)); UploadedTimeline::FailedAndRescheduled - } else if local_fs_updated { - info!("Successfully uploaded all layers, some local layers were removed during the upload"); - UploadedTimeline::SuccessfulAfterLocalFsUpdate(upload_data) } else { info!("Successfully uploaded all layers"); UploadedTimeline::Successful(upload_data) @@ -218,7 +218,10 @@ enum UploadError { #[cfg(test)] mod tests { - use std::collections::{BTreeSet, HashSet}; + use std::{ + collections::{BTreeSet, HashSet}, + num::NonZeroUsize, + }; use remote_storage::LocalFs; use tempfile::tempdir; @@ -237,6 +240,7 @@ mod tests { #[tokio::test] async fn regular_layer_upload() -> anyhow::Result<()> { let harness = RepoHarness::create("regular_layer_upload")?; + let (sync_queue, _) = SyncQueue::new(NonZeroUsize::new(100).unwrap()); let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID); let layer_files = ["a", "b"]; @@ -258,6 +262,7 @@ mod tests { let upload_result = upload_timeline_layers( &storage, + &sync_queue, None, sync_id, SyncData::new(current_retries, timeline_upload.clone()), @@ -322,6 +327,7 @@ mod tests { #[tokio::test] async fn layer_upload_after_local_fs_update() -> anyhow::Result<()> { let harness = RepoHarness::create("layer_upload_after_local_fs_update")?; + let (sync_queue, _) = SyncQueue::new(NonZeroUsize::new(100).unwrap()); let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID); let layer_files = ["a1", "b1"]; @@ -347,6 +353,7 @@ mod tests { let upload_result = upload_timeline_layers( &storage, + &sync_queue, None, sync_id, SyncData::new(current_retries, timeline_upload.clone()), @@ -354,7 +361,7 @@ mod tests { .await; let upload_data = match upload_result { - UploadedTimeline::SuccessfulAfterLocalFsUpdate(upload_data) => upload_data, + UploadedTimeline::Successful(upload_data) => upload_data, wrong_result => panic!( "Expected a successful after local fs upload for timeline, but got: {wrong_result:?}" ), From cf59b515195fbd56e02e5bee11991a1c375d0a69 Mon Sep 17 00:00:00 2001 From: Thang Pham Date: Mon, 9 May 2022 11:11:46 -0400 Subject: [PATCH 217/296] Update README (Running local installation section) (#1649) --- README.md | 49 +++++++++++++++++++++++------------- control_plane/src/storage.rs | 3 +++ 2 files changed, 34 insertions(+), 18 deletions(-) diff --git a/README.md b/README.md index 8876831265..af384d2672 100644 --- a/README.md +++ b/README.md @@ -50,31 +50,29 @@ make -j5 # Create repository in .zenith with proper paths to binaries and data # Later that would be responsibility of a package install script > ./target/debug/neon_local init -initializing tenantid c03ba6b7ad4c5e9cf556f059ade44229 -created initial timeline 5b014a9e41b4b63ce1a1febc04503636 timeline.lsn 0/169C3C8 -created main branch +initializing tenantid 9ef87a5bf0d92544f6fafeeb3239695c +created initial timeline de200bd42b49cc1814412c7e592dd6e9 timeline.lsn 0/16B5A50 +initial timeline de200bd42b49cc1814412c7e592dd6e9 created pageserver init succeeded # start pageserver and safekeeper > ./target/debug/neon_local start -Starting pageserver at 'localhost:64000' in '.zenith' +Starting pageserver at '127.0.0.1:64000' in '.zenith' Pageserver started -initializing for single for 7676 -Starting safekeeper at '127.0.0.1:5454' in '.zenith/safekeepers/single' +initializing for sk 1 for 7676 +Starting safekeeper at '127.0.0.1:5454' in '.zenith/safekeepers/sk1' Safekeeper started # start postgres compute node > ./target/debug/neon_local pg start main -Starting new postgres main on timeline 5b014a9e41b4b63ce1a1febc04503636 ... -Extracting base backup to create postgres instance: path=.zenith/pgdatadirs/tenants/c03ba6b7ad4c5e9cf556f059ade44229/main port=55432 +Starting new postgres main on timeline de200bd42b49cc1814412c7e592dd6e9 ... +Extracting base backup to create postgres instance: path=.zenith/pgdatadirs/tenants/9ef87a5bf0d92544f6fafeeb3239695c/main port=55432 Starting postgres node at 'host=127.0.0.1 port=55432 user=zenith_admin dbname=postgres' -waiting for server to start.... done -server started # check list of running postgres instances > ./target/debug/neon_local pg list -NODE ADDRESS TIMELINES BRANCH NAME LSN STATUS -main 127.0.0.1:55432 5b014a9e41b4b63ce1a1febc04503636 main 0/1609610 running + NODE ADDRESS TIMELINE BRANCH NAME LSN STATUS + main 127.0.0.1:55432 de200bd42b49cc1814412c7e592dd6e9 main 0/16B5BA8 running ``` 4. Now it is possible to connect to postgres and run some queries: @@ -95,17 +93,24 @@ postgres=# select * from t; ```sh # create branch named migration_check > ./target/debug/neon_local timeline branch --branch-name migration_check -Created timeline '0e9331cad6efbafe6a88dd73ae21a5c9' at Lsn 0/16F5830 for tenant: c03ba6b7ad4c5e9cf556f059ade44229. Ancestor timeline: 'main' +Created timeline 'b3b863fa45fa9e57e615f9f2d944e601' at Lsn 0/16F9A00 for tenant: 9ef87a5bf0d92544f6fafeeb3239695c. Ancestor timeline: 'main' # check branches tree > ./target/debug/neon_local timeline list - main [5b014a9e41b4b63ce1a1febc04503636] - ┗━ @0/1609610: migration_check [0e9331cad6efbafe6a88dd73ae21a5c9] +(L) main [de200bd42b49cc1814412c7e592dd6e9] +(L) ┗━ @0/16F9A00: migration_check [b3b863fa45fa9e57e615f9f2d944e601] # start postgres on that branch -> ./target/debug/neon_local pg start migration_check -Starting postgres node at 'host=127.0.0.1 port=55433 user=stas' -waiting for server to start.... done +> ./target/debug/neon_local pg start migration_check --branch-name migration_check +Starting new postgres migration_check on timeline b3b863fa45fa9e57e615f9f2d944e601 ... +Extracting base backup to create postgres instance: path=.zenith/pgdatadirs/tenants/9ef87a5bf0d92544f6fafeeb3239695c/migration_check port=55433 +Starting postgres node at 'host=127.0.0.1 port=55433 user=zenith_admin dbname=postgres' + +# check the new list of running postgres instances +> ./target/debug/neon_local pg list + NODE ADDRESS TIMELINE BRANCH NAME LSN STATUS + main 127.0.0.1:55432 de200bd42b49cc1814412c7e592dd6e9 main 0/16F9A38 running + migration_check 127.0.0.1:55433 b3b863fa45fa9e57e615f9f2d944e601 migration_check 0/16F9A70 running # this new postgres instance will have all the data from 'main' postgres, # but all modifications would not affect data in original postgres @@ -118,6 +123,14 @@ postgres=# select * from t; postgres=# insert into t values(2,2); INSERT 0 1 + +# check that the new change doesn't affect the 'main' postgres +> psql -p55432 -h 127.0.0.1 -U zenith_admin postgres +postgres=# select * from t; + key | value +-----+------- + 1 | 1 +(1 row) ``` 6. If you want to run tests afterwards (see below), you have to stop all the running the pageserver, safekeeper and postgres instances diff --git a/control_plane/src/storage.rs b/control_plane/src/storage.rs index adb924d430..d2e63a22de 100644 --- a/control_plane/src/storage.rs +++ b/control_plane/src/storage.rs @@ -167,6 +167,9 @@ impl PageServerNode { ); } + // echo the captured output of the init command + println!("{}", String::from_utf8_lossy(&init_output.stdout)); + Ok(initial_timeline_id) } From 87dfa997345cc5a825aba4acc581edbf4806b4f7 Mon Sep 17 00:00:00 2001 From: Thang Pham Date: Tue, 10 May 2022 09:55:14 -0400 Subject: [PATCH 218/296] Update layered_repository REAMDE (#1659) --- pageserver/src/layered_repository/README.md | 43 +++++++++++++++++++-- 1 file changed, 40 insertions(+), 3 deletions(-) diff --git a/pageserver/src/layered_repository/README.md b/pageserver/src/layered_repository/README.md index 519478e417..70c571a507 100644 --- a/pageserver/src/layered_repository/README.md +++ b/pageserver/src/layered_repository/README.md @@ -23,6 +23,7 @@ distribution depends on the workload: the updates could be totally random, or there could be a long stream of updates to a single relation when data is bulk loaded, for example, or something in between. +``` Cloud Storage Page Server Safekeeper L1 L0 Memory WAL @@ -37,6 +38,7 @@ Cloud Storage Page Server Safekeeper +----+----+ +----+----+ | | | |EEEE| |EEEE|EEEE| +---+-----+ +----+ +----+----+ +``` In this illustration, WAL is received as a stream from the Safekeeper, from the right. It is immediately captured by the page server and stored quickly in @@ -47,7 +49,7 @@ the same page and relation close to each other. From the page server memory, whenever enough WAL has been accumulated, it is flushed to disk into a new L0 layer file, and the memory is released. -When enough L0 files have been accumulated, they are merged together rand sliced +When enough L0 files have been accumulated, they are merged together and sliced per key-space, producing a new set of files where each file contains a more narrow key range, but larger LSN range. @@ -121,7 +123,7 @@ The files are called "layer files". Each layer file covers a range of keys, and a range of LSNs (or a single LSN, in case of image layers). You can think of it as a rectangle in the two-dimensional key-LSN space. The layer files for each timeline are stored in the timeline's subdirectory under -.zenith/tenants//timelines. +`.zenith/tenants//timelines`. There are two kind of layer files: images, and delta layers. An image file contains a snapshot of all keys at a particular LSN, whereas a delta file @@ -130,8 +132,11 @@ range of LSN. image file: +``` 000000067F000032BE0000400000000070B6-000000067F000032BE0000400000000080B6__00000000346BC568 start key end key LSN +``` + The first parts define the key range that the layer covers. See pgdatadir_mapping.rs for how the key space is used. The last part is the LSN. @@ -140,8 +145,10 @@ delta file: Delta files are named similarly, but they cover a range of LSNs: +``` 000000067F000032BE0000400000000020B6-000000067F000032BE0000400000000030B6__000000578C6B29-0000000057A50051 start key end key start LSN end LSN +``` A delta file contains all the key-values in the key-range that were updated in the LSN range. If a key has not been modified, there is no trace of it in the @@ -151,7 +158,9 @@ delta layer. A delta layer file can cover a part of the overall key space, as in the previous example, or the whole key range like this: +``` 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000578C6B29-0000000057A50051 +``` A file that covers the whole key range is called a L0 file (Level 0), while a file that covers only part of the key range is called a L1 file. The "level" of @@ -168,7 +177,9 @@ version, and how branching and GC works is still valid. The full path of a delta file looks like this: +``` .zenith/tenants/941ddc8604413b88b3d208bddf90396c/timelines/4af489b06af8eed9e27a841775616962/rel_1663_13990_2609_0_10_000000000169C348_0000000001702000 +``` For simplicity, the examples below use a simplified notation for the paths. The tenant ID is left out, the timeline ID is replaced with @@ -177,8 +188,10 @@ with a human-readable table name. The LSNs are also shorter. For example, a base image file at LSN 100 and a delta file between 100-200 for 'orders' table on 'main' branch is represented like this: +``` main/orders_100 main/orders_100_200 +``` # Creating layer files @@ -188,12 +201,14 @@ branch called 'main' and two tables, 'orders' and 'customers'. The end of WAL is currently at LSN 250. In this starting situation, you would have these files on disk: +``` main/orders_100 main/orders_100_200 main/orders_200 main/customers_100 main/customers_100_200 main/customers_200 +``` In addition to those files, the recent changes between LSN 200 and the end of WAL at 250 are kept in memory. If the page server crashes, the @@ -224,6 +239,7 @@ If the customers table is modified later, a new file is created for it at the next checkpoint. The new file will cover the "gap" from the last layer file, so the LSN ranges are always contiguous: +``` main/orders_100 main/orders_100_200 main/orders_200 @@ -236,6 +252,7 @@ last layer file, so the LSN ranges are always contiguous: main/customers_200 main/customers_200_500 main/customers_500 +``` ## Reading page versions @@ -259,15 +276,18 @@ involves replaying any WAL records applicable to the page between LSNs Imagine that a child branch is created at LSN 250: +``` @250 ----main--+--------------------------> \ +---child--------------> +``` Then, the 'orders' table is updated differently on the 'main' and 'child' branches. You now have this situation on disk: +``` main/orders_100 main/orders_100_200 main/orders_200 @@ -282,6 +302,7 @@ Then, the 'orders' table is updated differently on the 'main' and child/orders_300 child/orders_300_400 child/orders_400 +``` Because the 'customers' table hasn't been modified on the child branch, there is no file for it there. If you request a page for it on @@ -294,6 +315,7 @@ is linear, and the request's LSN identifies unambiguously which file you need to look at. For example, the history for the 'orders' table on the 'main' branch consists of these files: +``` main/orders_100 main/orders_100_200 main/orders_200 @@ -301,10 +323,12 @@ on the 'main' branch consists of these files: main/orders_300 main/orders_300_400 main/orders_400 +``` And from the 'child' branch's point of view, it consists of these files: +``` main/orders_100 main/orders_100_200 main/orders_200 @@ -313,6 +337,7 @@ files: child/orders_300 child/orders_300_400 child/orders_400 +``` The branch metadata includes the point where the child branch was created, LSN 250. If a page request comes with LSN 275, we read the @@ -345,6 +370,7 @@ Let's look at the single branch scenario again. Imagine that the end of the branch is LSN 525, so that the GC horizon is currently at 525-150 = 375 +``` main/orders_100 main/orders_100_200 main/orders_200 @@ -357,11 +383,13 @@ of the branch is LSN 525, so that the GC horizon is currently at main/customers_100 main/customers_100_200 main/customers_200 +``` We can remove the following files because the end LSNs of those files are older than GC horizon 375, and there are more recent layer files for the table: +``` main/orders_100 DELETE main/orders_100_200 DELETE main/orders_200 DELETE @@ -374,8 +402,9 @@ table: main/customers_100 DELETE main/customers_100_200 DELETE main/customers_200 KEEP, NO NEWER VERSION +``` -'main/customers_100_200' is old enough, but it cannot be +'main/customers_200' is old enough, but it cannot be removed because there is no newer layer file for the table. Things get slightly more complicated with multiple branches. All of @@ -384,6 +413,7 @@ retain older shapshot files that are still needed by child branches. For example, if child branch is created at LSN 150, and the 'customers' table is updated on the branch, you would have these files: +``` main/orders_100 KEEP, NEEDED BY child BRANCH main/orders_100_200 KEEP, NEEDED BY child BRANCH main/orders_200 DELETE @@ -398,6 +428,7 @@ table is updated on the branch, you would have these files: main/customers_200 KEEP, NO NEWER VERSION child/customers_150_300 DELETE child/customers_300 KEEP, NO NEWER VERSION +``` In this situation, 'main/orders_100' and 'main/orders_100_200' cannot be removed, even though they are older than the GC horizon, because @@ -407,6 +438,7 @@ and 'main/orders_200_300' can still be removed. If 'orders' is modified later on the 'child' branch, we will create a new base image and delta file for it on the child: +``` main/orders_100 main/orders_100_200 @@ -419,6 +451,7 @@ new base image and delta file for it on the child: child/customers_300 child/orders_150_400 child/orders_400 +``` After this, the 'main/orders_100' and 'main/orders_100_200' file could be removed. It is no longer needed by the child branch, because there @@ -434,6 +467,7 @@ Describe GC and checkpoint interval settings. In principle, each relation can be checkpointed separately, i.e. the LSN ranges of the files don't need to line up. So this would be legal: +``` main/orders_100 main/orders_100_200 main/orders_200 @@ -446,6 +480,7 @@ LSN ranges of the files don't need to line up. So this would be legal: main/customers_250 main/customers_250_500 main/customers_500 +``` However, the code currently always checkpoints all relations together. So that situation doesn't arise in practice. @@ -468,11 +503,13 @@ does that. It could be useful, however, as a transient state when garbage collecting around branch points, or explicit recovery points. For example, if we start with this: +``` main/orders_100 main/orders_100_200 main/orders_200 main/orders_200_300 main/orders_300 +``` And there is a branch or explicit recovery point at LSN 150, we could replace 'main/orders_100_200' with 'main/orders_150' to keep a From 6cb14b4200429bc2eb50b5f9879918188965b156 Mon Sep 17 00:00:00 2001 From: Arseny Sher Date: Tue, 10 May 2022 20:44:56 +0400 Subject: [PATCH 219/296] Optionally remove WAL on safekeepers without s3 offloading. And do that on staging, until offloading is merged. --- .circleci/ansible/production.hosts | 1 + .circleci/ansible/staging.hosts | 1 + .circleci/ansible/systemd/safekeeper.service | 2 +- safekeeper/src/bin/safekeeper.rs | 15 +++++++++++++++ safekeeper/src/lib.rs | 2 ++ safekeeper/src/remove_wal.rs | 2 +- safekeeper/src/safekeeper.rs | 9 +++++++-- safekeeper/src/timeline.rs | 4 ++-- 8 files changed, 30 insertions(+), 6 deletions(-) diff --git a/.circleci/ansible/production.hosts b/.circleci/ansible/production.hosts index f32b57154c..2ed8f517f7 100644 --- a/.circleci/ansible/production.hosts +++ b/.circleci/ansible/production.hosts @@ -15,3 +15,4 @@ console_mgmt_base_url = http://console-release.local bucket_name = zenith-storage-oregon bucket_region = us-west-2 etcd_endpoints = etcd-release.local:2379 +safekeeper_enable_s3_offload = true diff --git a/.circleci/ansible/staging.hosts b/.circleci/ansible/staging.hosts index 71166c531e..3ea815b907 100644 --- a/.circleci/ansible/staging.hosts +++ b/.circleci/ansible/staging.hosts @@ -16,3 +16,4 @@ console_mgmt_base_url = http://console-staging.local bucket_name = zenith-staging-storage-us-east-1 bucket_region = us-east-1 etcd_endpoints = etcd-staging.local:2379 +safekeeper_enable_s3_offload = false diff --git a/.circleci/ansible/systemd/safekeeper.service b/.circleci/ansible/systemd/safekeeper.service index cac38d8756..55088db859 100644 --- a/.circleci/ansible/systemd/safekeeper.service +++ b/.circleci/ansible/systemd/safekeeper.service @@ -6,7 +6,7 @@ After=network.target auditd.service Type=simple User=safekeeper Environment=RUST_BACKTRACE=1 ZENITH_REPO_DIR=/storage/safekeeper/data LD_LIBRARY_PATH=/usr/local/lib -ExecStart=/usr/local/bin/safekeeper -l {{ inventory_hostname }}.local:6500 --listen-http {{ inventory_hostname }}.local:7676 -p {{ first_pageserver }}:6400 -D /storage/safekeeper/data --broker-endpoints={{ etcd_endpoints }} +ExecStart=/usr/local/bin/safekeeper -l {{ inventory_hostname }}.local:6500 --listen-http {{ inventory_hostname }}.local:7676 -p {{ first_pageserver }}:6400 -D /storage/safekeeper/data --broker-endpoints={{ etcd_endpoints }} --enable-s3-offload={{ safekeeper_enable_s3_offload }} ExecReload=/bin/kill -HUP $MAINPID KillMode=mixed KillSignal=SIGINT diff --git a/safekeeper/src/bin/safekeeper.rs b/safekeeper/src/bin/safekeeper.rs index 7e979840c2..d0df7093ff 100644 --- a/safekeeper/src/bin/safekeeper.rs +++ b/safekeeper/src/bin/safekeeper.rs @@ -115,6 +115,14 @@ fn main() -> Result<()> { .takes_value(true) .help("a prefix to always use when polling/pusing data in etcd from this safekeeper"), ) + .arg( + Arg::new("enable-s3-offload") + .long("enable-s3-offload") + .takes_value(true) + .default_value("true") + .default_missing_value("true") + .help("Enable/disable s3 offloading. When disabled, safekeeper removes WAL ignoring s3 WAL horizon."), + ) .get_matches(); if let Some(addr) = arg_matches.value_of("dump-control-file") { @@ -172,6 +180,13 @@ fn main() -> Result<()> { conf.broker_etcd_prefix = prefix.to_string(); } + // Seems like there is no better way to accept bool values explicitly in clap. + conf.s3_offload_enabled = arg_matches + .value_of("enable-s3-offload") + .unwrap() + .parse() + .context("failed to parse bool enable-s3-offload bool")?; + start_safekeeper(conf, given_id, arg_matches.is_present("init")) } diff --git a/safekeeper/src/lib.rs b/safekeeper/src/lib.rs index f74e5be992..c848de9e71 100644 --- a/safekeeper/src/lib.rs +++ b/safekeeper/src/lib.rs @@ -53,6 +53,7 @@ pub struct SafeKeeperConf { pub my_id: ZNodeId, pub broker_endpoints: Option>, pub broker_etcd_prefix: String, + pub s3_offload_enabled: bool, } impl SafeKeeperConf { @@ -79,6 +80,7 @@ impl Default for SafeKeeperConf { my_id: ZNodeId(0), broker_endpoints: None, broker_etcd_prefix: defaults::DEFAULT_NEON_BROKER_PREFIX.to_string(), + s3_offload_enabled: true, } } } diff --git a/safekeeper/src/remove_wal.rs b/safekeeper/src/remove_wal.rs index 9474f65e5f..3278d51bd3 100644 --- a/safekeeper/src/remove_wal.rs +++ b/safekeeper/src/remove_wal.rs @@ -12,7 +12,7 @@ pub fn thread_main(conf: SafeKeeperConf) { let active_tlis = GlobalTimelines::get_active_timelines(); for zttid in &active_tlis { if let Ok(tli) = GlobalTimelines::get(&conf, *zttid, false) { - if let Err(e) = tli.remove_old_wal() { + if let Err(e) = tli.remove_old_wal(conf.s3_offload_enabled) { warn!( "failed to remove WAL for tenant {} timeline {}: {}", tli.zttid.tenant_id, tli.zttid.timeline_id, e diff --git a/safekeeper/src/safekeeper.rs b/safekeeper/src/safekeeper.rs index b9264565dc..fff1c269b6 100644 --- a/safekeeper/src/safekeeper.rs +++ b/safekeeper/src/safekeeper.rs @@ -930,13 +930,18 @@ where /// offloading. /// While it is safe to use inmem values for determining horizon, /// we use persistent to make possible normal states less surprising. - pub fn get_horizon_segno(&self) -> XLogSegNo { + pub fn get_horizon_segno(&self, s3_offload_enabled: bool) -> XLogSegNo { + let s3_offload_horizon = if s3_offload_enabled { + self.state.s3_wal_lsn + } else { + Lsn(u64::MAX) + }; let horizon_lsn = min( min( self.state.remote_consistent_lsn, self.state.peer_horizon_lsn, ), - self.state.s3_wal_lsn, + s3_offload_horizon, ); horizon_lsn.segment_number(self.state.server.wal_seg_size as usize) } diff --git a/safekeeper/src/timeline.rs b/safekeeper/src/timeline.rs index 140d6660ac..8b1072a54b 100644 --- a/safekeeper/src/timeline.rs +++ b/safekeeper/src/timeline.rs @@ -479,7 +479,7 @@ impl Timeline { shared_state.sk.wal_store.flush_lsn() } - pub fn remove_old_wal(&self) -> Result<()> { + pub fn remove_old_wal(&self, s3_offload_enabled: bool) -> Result<()> { let horizon_segno: XLogSegNo; let remover: Box Result<(), anyhow::Error>>; { @@ -488,7 +488,7 @@ impl Timeline { if shared_state.sk.state.server.wal_seg_size == 0 { return Ok(()); } - horizon_segno = shared_state.sk.get_horizon_segno(); + horizon_segno = shared_state.sk.get_horizon_segno(s3_offload_enabled); remover = shared_state.sk.wal_store.remove_up_to(); if horizon_segno <= 1 || horizon_segno <= shared_state.last_removed_segno { return Ok(()); From d710dff9756ca006ffb2bc7362f8137f5ca06f48 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Tue, 10 May 2022 16:28:00 +0300 Subject: [PATCH 220/296] Remove unnecessary Serialize/Deserialize traits from VecMap. It's never stored on disk. Let's be tidy. --- libs/utils/src/vec_map.rs | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/libs/utils/src/vec_map.rs b/libs/utils/src/vec_map.rs index 558721c724..9953b447c8 100644 --- a/libs/utils/src/vec_map.rs +++ b/libs/utils/src/vec_map.rs @@ -1,11 +1,9 @@ use std::{alloc::Layout, cmp::Ordering, ops::RangeBounds}; -use serde::{Deserialize, Serialize}; - /// Ordered map datastructure implemented in a Vec. /// Append only - can only add keys that are larger than the /// current max key. -#[derive(Clone, Debug, Serialize, Deserialize)] +#[derive(Clone, Debug)] pub struct VecMap(Vec<(K, V)>); impl Default for VecMap { From e6e883eb12503a3a013074c03f06d8a047f44c6c Mon Sep 17 00:00:00 2001 From: Konstantin Knizhnik Date: Wed, 11 May 2022 15:23:17 +0300 Subject: [PATCH 221/296] Do not set LSN for new FPI page (#1657) * Do not set LSN for new FPI page refer #1656 * Add page_is_new, page_get_lsn, page_set_lsn functions * Fix page_is_new implementation * Add comment from XLogReadBufferForRedoExtended --- libs/postgres_ffi/src/lib.rs | 19 +++++++++++++++++++ pageserver/src/walingest.rs | 11 +++++++++-- 2 files changed, 28 insertions(+), 2 deletions(-) diff --git a/libs/postgres_ffi/src/lib.rs b/libs/postgres_ffi/src/lib.rs index 923fbe4d5a..28d9a13dbf 100644 --- a/libs/postgres_ffi/src/lib.rs +++ b/libs/postgres_ffi/src/lib.rs @@ -8,6 +8,7 @@ #![allow(deref_nullptr)] use serde::{Deserialize, Serialize}; +use utils::lsn::Lsn; include!(concat!(env!("OUT_DIR"), "/bindings.rs")); @@ -37,3 +38,21 @@ pub const fn transaction_id_precedes(id1: TransactionId, id2: TransactionId) -> let diff = id1.wrapping_sub(id2) as i32; diff < 0 } + +// Check if page is not yet initialized (port of Postgres PageIsInit() macro) +pub fn page_is_new(pg: &[u8]) -> bool { + pg[14] == 0 && pg[15] == 0 // pg_upper == 0 +} + +// ExtractLSN from page header +pub fn page_get_lsn(pg: &[u8]) -> Lsn { + Lsn( + ((u32::from_le_bytes(pg[0..4].try_into().unwrap()) as u64) << 32) + | u32::from_le_bytes(pg[4..8].try_into().unwrap()) as u64, + ) +} + +pub fn page_set_lsn(pg: &mut [u8], lsn: Lsn) { + pg[0..4].copy_from_slice(&((lsn.0 >> 32) as u32).to_le_bytes()); + pg[4..8].copy_from_slice(&(lsn.0 as u32).to_le_bytes()); +} diff --git a/pageserver/src/walingest.rs b/pageserver/src/walingest.rs index fbdb328d2c..5223125ce6 100644 --- a/pageserver/src/walingest.rs +++ b/pageserver/src/walingest.rs @@ -24,6 +24,7 @@ use anyhow::Context; use postgres_ffi::nonrelfile_utils::clogpage_precedes; use postgres_ffi::nonrelfile_utils::slru_may_delete_clogsegment; +use postgres_ffi::{page_is_new, page_set_lsn}; use anyhow::Result; use bytes::{Buf, Bytes, BytesMut}; @@ -304,8 +305,14 @@ impl<'a, R: Repository> WalIngest<'a, R> { image.resize(image.len() + blk.hole_length as usize, 0u8); image.unsplit(tail); } - image[0..4].copy_from_slice(&((lsn.0 >> 32) as u32).to_le_bytes()); - image[4..8].copy_from_slice(&(lsn.0 as u32).to_le_bytes()); + // + // Match the logic of XLogReadBufferForRedoExtended: + // The page may be uninitialized. If so, we can't set the LSN because + // that would corrupt the page. + // + if !page_is_new(&image) { + page_set_lsn(&mut image, lsn) + } assert_eq!(image.len(), pg_constants::BLCKSZ as usize); self.put_rel_page_image(modification, rel, blk.blkno, image.freeze())?; } else { From 5bd879f6418903a62b47758441a90153f9979237 Mon Sep 17 00:00:00 2001 From: Stas Kelvich Date: Wed, 11 May 2022 15:20:48 +0300 Subject: [PATCH 222/296] Proxy: update protocol after cluster->project rename --- proxy/src/auth_backend/console.rs | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/proxy/src/auth_backend/console.rs b/proxy/src/auth_backend/console.rs index 55a0889af4..41a822701f 100644 --- a/proxy/src/auth_backend/console.rs +++ b/proxy/src/auth_backend/console.rs @@ -117,7 +117,7 @@ async fn get_auth_info( let mut url = reqwest::Url::parse(&format!("{auth_endpoint}/proxy_get_role_secret"))?; url.query_pairs_mut() - .append_pair("cluster", cluster) + .append_pair("project", cluster) .append_pair("role", user); // TODO: use a proper logger @@ -141,7 +141,7 @@ async fn wake_compute( cluster: &str, ) -> Result<(String, u16), ConsoleAuthError> { let mut url = reqwest::Url::parse(&format!("{auth_endpoint}/proxy_wake_compute"))?; - url.query_pairs_mut().append_pair("cluster", cluster); + url.query_pairs_mut().append_pair("project", cluster); // TODO: use a proper logger println!("cplane request: {}", url); From b338b5dffef46264e3d35887d9698432d2a7cc40 Mon Sep 17 00:00:00 2001 From: Arseny Sher Date: Wed, 11 May 2022 19:39:12 +0400 Subject: [PATCH 223/296] Make callmemaybe less agressive until we fix it/migrate to bigger machines. --- safekeeper/src/lib.rs | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/safekeeper/src/lib.rs b/safekeeper/src/lib.rs index c848de9e71..03236d4e65 100644 --- a/safekeeper/src/lib.rs +++ b/safekeeper/src/lib.rs @@ -31,7 +31,7 @@ pub mod defaults { pub const DEFAULT_HTTP_LISTEN_PORT: u16 = 7676; pub const DEFAULT_HTTP_LISTEN_ADDR: &str = formatcp!("127.0.0.1:{DEFAULT_HTTP_LISTEN_PORT}"); - pub const DEFAULT_RECALL_PERIOD: Duration = Duration::from_secs(1); + pub const DEFAULT_RECALL_PERIOD: Duration = Duration::from_secs(10); } #[derive(Debug, Clone)] From 20361395bb038659e476fb1566eb8ddff92612c6 Mon Sep 17 00:00:00 2001 From: Anton Shyrabokau <97127717+antons-antons@users.noreply.github.com> Date: Wed, 11 May 2022 11:36:53 -0700 Subject: [PATCH 224/296] Add zenith-us-stage-sk-5 to circleci inventory (#1665) Co-authored-by: Debian --- .circleci/ansible/staging.hosts | 1 + 1 file changed, 1 insertion(+) diff --git a/.circleci/ansible/staging.hosts b/.circleci/ansible/staging.hosts index 3ea815b907..b2bacb89ca 100644 --- a/.circleci/ansible/staging.hosts +++ b/.circleci/ansible/staging.hosts @@ -6,6 +6,7 @@ zenith-us-stage-ps-2 console_region_id=27 zenith-us-stage-sk-1 console_region_id=27 zenith-us-stage-sk-2 console_region_id=27 zenith-us-stage-sk-4 console_region_id=27 +zenith-us-stage-sk-5 console_region_id=27 [storage:children] pageservers From c8640910353a8c226f516d70e337d2eb137dfc88 Mon Sep 17 00:00:00 2001 From: Dhammika Pathirana Date: Wed, 11 May 2022 16:13:26 -0700 Subject: [PATCH 225/296] Fix err msg typo Signed-off-by: Dhammika Pathirana --- pageserver/src/layered_repository.rs | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 01c2b961eb..6a614e184f 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -1512,7 +1512,7 @@ impl LayeredTimeline { .ensure_loaded() .with_context(|| { format!( - "Ancestor timeline is not is not loaded. Timeline id: {} Ancestor id {:?}", + "Ancestor timeline is not loaded. Timeline id: {} Ancestor id {:?}", self.timeline_id, self.get_ancestor_timeline_id(), ) From 2bde77fced256600295a0a1c09c6335aed679dac Mon Sep 17 00:00:00 2001 From: Konstantin Knizhnik Date: Thu, 12 May 2022 07:56:02 +0300 Subject: [PATCH 226/296] =?UTF-8?q?Do=20not=20apply=20records=20with=20LSN?= =?UTF-8?q?=20smaller=20than=20LSN=20of=20cached=20image=20in=20del?= =?UTF-8?q?=E2=80=A6=20(#1672)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Do not apply records with LSN smaller than LSN of cached image in delta layer * Do not apply records with LSN smaller than LSN of cached image in delta layer --- pageserver/src/layered_repository/delta_layer.rs | 3 +++ 1 file changed, 3 insertions(+) diff --git a/pageserver/src/layered_repository/delta_layer.rs b/pageserver/src/layered_repository/delta_layer.rs index e78b05695c..638df6f42a 100644 --- a/pageserver/src/layered_repository/delta_layer.rs +++ b/pageserver/src/layered_repository/delta_layer.rs @@ -254,6 +254,9 @@ impl Layer for DeltaLayer { return false; } let entry_lsn = DeltaKey::extract_lsn_from_buf(key); + if entry_lsn < lsn_range.start { + return false; + } offsets.push((entry_lsn, blob_ref.pos())); !blob_ref.will_init() From 5da4f3a4df88ac2b28565eea1604bbc8272a845e Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Thu, 12 May 2022 10:31:04 +0300 Subject: [PATCH 227/296] Refactor DeltaLayer::dump() function Put most of the code in a closure that returns Result, so that we can use the ?-operator for error handling. That's simpler. --- .../src/layered_repository/delta_layer.rs | 59 +++++++++---------- 1 file changed, 27 insertions(+), 32 deletions(-) diff --git a/pageserver/src/layered_repository/delta_layer.rs b/pageserver/src/layered_repository/delta_layer.rs index 638df6f42a..1c48f3def5 100644 --- a/pageserver/src/layered_repository/delta_layer.rs +++ b/pageserver/src/layered_repository/delta_layer.rs @@ -38,10 +38,6 @@ use crate::walrecord; use crate::{DELTA_FILE_MAGIC, STORAGE_FORMAT_VERSION}; use anyhow::{bail, ensure, Context, Result}; use serde::{Deserialize, Serialize}; -use tracing::*; -// avoid binding to Write (conflicts with std::io::Write) -// while being able to use std::fmt::Write's methods -use std::fmt::Write as _; use std::fs; use std::io::{BufWriter, Write}; use std::io::{Seek, SeekFrom}; @@ -49,6 +45,7 @@ use std::ops::Range; use std::os::unix::fs::FileExt; use std::path::{Path, PathBuf}; use std::sync::{RwLock, RwLockReadGuard, RwLockWriteGuard}; +use tracing::*; use utils::{ bin_ser::BeSer, @@ -365,6 +362,28 @@ impl Layer for DeltaLayer { tree_reader.dump()?; let mut cursor = file.block_cursor(); + + // A subroutine to dump a single blob + let mut dump_blob = |blob_ref: BlobRef| -> anyhow::Result { + let buf = cursor.read_blob(blob_ref.pos())?; + let val = Value::des(&buf)?; + let desc = match val { + Value::Image(img) => { + format!(" img {} bytes", img.len()) + } + Value::WalRecord(rec) => { + let wal_desc = walrecord::describe_wal_record(&rec)?; + format!( + " rec {} bytes will_init: {} {}", + buf.len(), + rec.will_init(), + wal_desc + ) + } + }; + Ok(desc) + }; + tree_reader.visit( &[0u8; DELTA_KEY_SIZE], VisitDirection::Forwards, @@ -373,34 +392,10 @@ impl Layer for DeltaLayer { let key = DeltaKey::extract_key_from_buf(delta_key); let lsn = DeltaKey::extract_lsn_from_buf(delta_key); - let mut desc = String::new(); - match cursor.read_blob(blob_ref.pos()) { - Ok(buf) => { - let val = Value::des(&buf); - match val { - Ok(Value::Image(img)) => { - write!(&mut desc, " img {} bytes", img.len()).unwrap(); - } - Ok(Value::WalRecord(rec)) => { - let wal_desc = walrecord::describe_wal_record(&rec).unwrap(); - write!( - &mut desc, - " rec {} bytes will_init: {} {}", - buf.len(), - rec.will_init(), - wal_desc - ) - .unwrap(); - } - Err(err) => { - write!(&mut desc, " DESERIALIZATION ERROR: {}", err).unwrap(); - } - } - } - Err(err) => { - write!(&mut desc, " READ ERROR: {}", err).unwrap(); - } - } + let desc = match dump_blob(blob_ref) { + Ok(desc) => desc, + Err(err) => format!("ERROR: {}", err), + }; println!(" key {} at {}: {}", key, lsn, desc); true }, From b426775aa0dc3caa5287a91593c976f45fed0314 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Thu, 12 May 2022 12:07:09 +0300 Subject: [PATCH 228/296] Use compute-tools from the new neondatabase Docker Hub repo --- vendor/postgres | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vendor/postgres b/vendor/postgres index 9a9459a7f9..0ea7598329 160000 --- a/vendor/postgres +++ b/vendor/postgres @@ -1 +1 @@ -Subproject commit 9a9459a7f9cbcaa0e35ff1f2f34c419238fdec7e +Subproject commit 0ea7598329a83b818293137cc18bf7d42bf2fe68 From b10ae195b78835ba895d90ccc1573a0a018d8a28 Mon Sep 17 00:00:00 2001 From: Stas Kelvich Date: Thu, 12 May 2022 12:40:55 +0300 Subject: [PATCH 229/296] Set vendor/postgres back to the main branch I accidentally merged postgres PR that was referencing non-main branch. --- vendor/postgres | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vendor/postgres b/vendor/postgres index 0ea7598329..d62ec22eff 160000 --- a/vendor/postgres +++ b/vendor/postgres @@ -1 +1 @@ -Subproject commit 0ea7598329a83b818293137cc18bf7d42bf2fe68 +Subproject commit d62ec22effeca7b5794ab2c15a3fd9ee5a4a5b99 From 4538f1e1b839556aab12e5aa7d1c38646253ec97 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Thu, 12 May 2022 14:18:35 +0300 Subject: [PATCH 230/296] Correctly operate etcd safekeeper timeline data --- libs/etcd_broker/src/lib.rs | 21 +++++++++++++------ safekeeper/src/broker.rs | 2 +- safekeeper/src/timeline.rs | 41 ++----------------------------------- 3 files changed, 18 insertions(+), 46 deletions(-) diff --git a/libs/etcd_broker/src/lib.rs b/libs/etcd_broker/src/lib.rs index 01cc0cf162..1b27f99ccf 100644 --- a/libs/etcd_broker/src/lib.rs +++ b/libs/etcd_broker/src/lib.rs @@ -51,7 +51,7 @@ pub struct SkTimelineInfo { #[serde(default)] pub peer_horizon_lsn: Option, #[serde(default)] - pub wal_stream_connection_string: Option, + pub safekeeper_connection_string: Option, } #[derive(Debug, thiserror::Error)] @@ -217,16 +217,22 @@ pub async fn subscribe_to_safekeeper_timeline_updates( break; } - let mut timeline_updates: HashMap> = - HashMap::new(); + let mut timeline_updates: HashMap> = HashMap::new(); + // Keep track that the timeline data updates from etcd arrive in the right order. + // https://etcd.io/docs/v3.5/learning/api_guarantees/#isolation-level-and-consistency-of-replicas + // > etcd does not ensure linearizability for watch operations. Users are expected to verify the revision of watch responses to ensure correct ordering. + let mut timeline_etcd_versions: HashMap = HashMap::new(); + let events = resp.events(); debug!("Processing {} events", events.len()); for event in events { if EventType::Put == event.event_type() { - if let Some(kv) = event.kv() { - match parse_etcd_key_value(subscription_kind, ®ex, kv) { + if let Some(new_etcd_kv) = event.kv() { + let new_kv_version = new_etcd_kv.version(); + + match parse_etcd_key_value(subscription_kind, ®ex, new_etcd_kv) { Ok(Some((zttid, timeline))) => { match timeline_updates .entry(zttid) @@ -234,12 +240,15 @@ pub async fn subscribe_to_safekeeper_timeline_updates( .entry(timeline.safekeeper_id) { hash_map::Entry::Occupied(mut o) => { - if o.get().flush_lsn < timeline.info.flush_lsn { + let old_etcd_kv_version = timeline_etcd_versions.get(&zttid).copied().unwrap_or(i64::MIN); + if old_etcd_kv_version < new_kv_version { o.insert(timeline.info); + timeline_etcd_versions.insert(zttid,new_kv_version); } } hash_map::Entry::Vacant(v) => { v.insert(timeline.info); + timeline_etcd_versions.insert(zttid,new_kv_version); } } } diff --git a/safekeeper/src/broker.rs b/safekeeper/src/broker.rs index c9ae1a8d98..d9c60c9db0 100644 --- a/safekeeper/src/broker.rs +++ b/safekeeper/src/broker.rs @@ -60,7 +60,7 @@ async fn push_loop(conf: SafeKeeperConf) -> anyhow::Result<()> { // lock is held. for zttid in GlobalTimelines::get_active_timelines() { if let Ok(tli) = GlobalTimelines::get(&conf, zttid, false) { - let sk_info = tli.get_public_info()?; + let sk_info = tli.get_public_info(&conf)?; let put_opts = PutOptions::new().with_lease(lease.id()); client .put( diff --git a/safekeeper/src/timeline.rs b/safekeeper/src/timeline.rs index 8b1072a54b..a12f628e06 100644 --- a/safekeeper/src/timeline.rs +++ b/safekeeper/src/timeline.rs @@ -89,7 +89,6 @@ struct SharedState { active: bool, num_computes: u32, pageserver_connstr: Option, - listen_pg_addr: String, last_removed_segno: XLogSegNo, } @@ -112,7 +111,6 @@ impl SharedState { active: false, num_computes: 0, pageserver_connstr: None, - listen_pg_addr: conf.listen_pg_addr.clone(), last_removed_segno: 0, }) } @@ -132,7 +130,6 @@ impl SharedState { active: false, num_computes: 0, pageserver_connstr: None, - listen_pg_addr: conf.listen_pg_addr.clone(), last_removed_segno: 0, }) } @@ -421,7 +418,7 @@ impl Timeline { } /// Prepare public safekeeper info for reporting. - pub fn get_public_info(&self) -> anyhow::Result { + pub fn get_public_info(&self, conf: &SafeKeeperConf) -> anyhow::Result { let shared_state = self.mutex.lock().unwrap(); Ok(SkTimelineInfo { last_log_term: Some(shared_state.sk.get_epoch()), @@ -435,18 +432,7 @@ impl Timeline { shared_state.sk.inmem.remote_consistent_lsn, )), peer_horizon_lsn: Some(shared_state.sk.inmem.peer_horizon_lsn), - wal_stream_connection_string: shared_state - .pageserver_connstr - .as_deref() - .map(|pageserver_connstr| { - wal_stream_connection_string( - self.zttid, - &shared_state.listen_pg_addr, - pageserver_connstr, - ) - }) - .transpose() - .context("Failed to get the pageserver callmemaybe connstr")?, + safekeeper_connection_string: Some(conf.listen_pg_addr.clone()), }) } @@ -504,29 +490,6 @@ impl Timeline { } } -// pageserver connstr is needed to be able to distinguish between different pageservers -// it is required to correctly manage callmemaybe subscriptions when more than one pageserver is involved -// TODO it is better to use some sort of a unique id instead of connection string, see https://github.com/zenithdb/zenith/issues/1105 -fn wal_stream_connection_string( - ZTenantTimelineId { - tenant_id, - timeline_id, - }: ZTenantTimelineId, - listen_pg_addr_str: &str, - pageserver_connstr: &str, -) -> anyhow::Result { - let me_connstr = format!("postgresql://no_user@{}/no_db", listen_pg_addr_str); - let me_conf = me_connstr - .parse::() - .with_context(|| { - format!("Failed to parse pageserver connection string '{me_connstr}' as a postgres one") - })?; - let (host, port) = utils::connstring::connection_host_port(&me_conf); - Ok(format!( - "host={host} port={port} options='-c ztimelineid={timeline_id} ztenantid={tenant_id} pageserver_connstr={pageserver_connstr}'", - )) -} - // Utilities needed by various Connection-like objects pub trait TimelineTools { fn set(&mut self, conf: &SafeKeeperConf, zttid: ZTenantTimelineId, create: bool) -> Result<()>; From ec8861b8cc54f61d509925b67babc1af765c37ef Mon Sep 17 00:00:00 2001 From: Arthur Petukhovsky Date: Thu, 12 May 2022 19:53:07 +0300 Subject: [PATCH 231/296] Fix pageserver metrics names (#1682) Try to follow Prometheus style-guide https://prometheus.io/docs/practices/naming/ for metrics names. More specifically: - Use `pageserver_` prefix for all pagserver metrics - Specify `_seconds` unit in time metrics - Use unit as a suffix in other cases, such as `_hits`, `_bytes`, `_records` - Use `_total` suffix for accumulating counters (note that Histograms append that suffix internally) --- pageserver/src/layered_repository.rs | 14 +++++++------- pageserver/src/lib.rs | 2 +- pageserver/src/page_service.rs | 2 +- pageserver/src/storage_sync.rs | 4 ++-- pageserver/src/virtual_file.rs | 6 +++--- pageserver/src/walredo.rs | 8 ++++---- test_runner/fixtures/compare_fixtures.py | 4 ++-- 7 files changed, 20 insertions(+), 20 deletions(-) diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 6a614e184f..b02ab00a21 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -89,7 +89,7 @@ pub use crate::layered_repository::ephemeral_file::writeback as writeback_epheme // Metrics collected on operations on the storage repository. lazy_static! { static ref STORAGE_TIME: HistogramVec = register_histogram_vec!( - "pageserver_storage_time", + "pageserver_storage_operations_seconds", "Time spent on storage operations", &["operation", "tenant_id", "timeline_id"] ) @@ -99,8 +99,8 @@ lazy_static! { // Metrics collected on operations on the storage repository. lazy_static! { static ref RECONSTRUCT_TIME: HistogramVec = register_histogram_vec!( - "pageserver_getpage_reconstruct_time", - "Time spent on storage operations", + "pageserver_getpage_reconstruct_seconds", + "Time spent in reconstruct_value", &["tenant_id", "timeline_id"] ) .expect("failed to define a metric"); @@ -108,13 +108,13 @@ lazy_static! { lazy_static! { static ref MATERIALIZED_PAGE_CACHE_HIT: IntCounterVec = register_int_counter_vec!( - "materialize_page_cache_hits", + "pageserver_materialized_cache_hits_total", "Number of cache hits from materialized page cache", &["tenant_id", "timeline_id"] ) .expect("failed to define a metric"); static ref WAIT_LSN_TIME: HistogramVec = register_histogram_vec!( - "wait_lsn_time", + "pageserver_wait_lsn_seconds", "Time spent waiting for WAL to arrive", &["tenant_id", "timeline_id"] ) @@ -134,12 +134,12 @@ lazy_static! { // or in testing they estimate how much we would upload if we did. lazy_static! { static ref NUM_PERSISTENT_FILES_CREATED: IntCounter = register_int_counter!( - "pageserver_num_persistent_files_created", + "pageserver_created_persistent_files_total", "Number of files created that are meant to be uploaded to cloud storage", ) .expect("failed to define a metric"); static ref PERSISTENT_BYTES_WRITTEN: IntCounter = register_int_counter!( - "pageserver_persistent_bytes_written", + "pageserver_written_persistent_bytes_total", "Total bytes written that are meant to be uploaded to cloud storage", ) .expect("failed to define a metric"); diff --git a/pageserver/src/lib.rs b/pageserver/src/lib.rs index 83985069ec..fdce0e5c5f 100644 --- a/pageserver/src/lib.rs +++ b/pageserver/src/lib.rs @@ -45,7 +45,7 @@ pub const DELTA_FILE_MAGIC: u16 = 0x5A61; lazy_static! { static ref LIVE_CONNECTIONS_COUNT: IntGaugeVec = register_int_gauge_vec!( - "pageserver_live_connections_count", + "pageserver_live_connections", "Number of live network connections", &["pageserver_connection_kind"] ) diff --git a/pageserver/src/page_service.rs b/pageserver/src/page_service.rs index da3dedfc84..88273cfa57 100644 --- a/pageserver/src/page_service.rs +++ b/pageserver/src/page_service.rs @@ -326,7 +326,7 @@ const TIME_BUCKETS: &[f64] = &[ lazy_static! { static ref SMGR_QUERY_TIME: HistogramVec = register_histogram_vec!( - "pageserver_smgr_query_time", + "pageserver_smgr_query_seconds", "Time spent on smgr query handling", &["smgr_query_type", "tenant_id", "timeline_id"], TIME_BUCKETS.into() diff --git a/pageserver/src/storage_sync.rs b/pageserver/src/storage_sync.rs index b8c6f7fdab..7755e67c8d 100644 --- a/pageserver/src/storage_sync.rs +++ b/pageserver/src/storage_sync.rs @@ -208,12 +208,12 @@ lazy_static! { ) .expect("failed to register pageserver remote storage remaining sync items int gauge"); static ref FATAL_TASK_FAILURES: IntCounter = register_int_counter!( - "pageserver_remote_storage_fatal_task_failures", + "pageserver_remote_storage_fatal_task_failures_total", "Number of critically failed tasks" ) .expect("failed to register pageserver remote storage remaining sync items int gauge"); static ref IMAGE_SYNC_TIME: HistogramVec = register_histogram_vec!( - "pageserver_remote_storage_image_sync_time", + "pageserver_remote_storage_image_sync_seconds", "Time took to synchronize (download or upload) a whole pageserver image. \ Grouped by `operation_kind` (upload|download) and `status` (success|failure)", &["operation_kind", "status"], diff --git a/pageserver/src/virtual_file.rs b/pageserver/src/virtual_file.rs index 4ce245a74f..37d70372b5 100644 --- a/pageserver/src/virtual_file.rs +++ b/pageserver/src/virtual_file.rs @@ -34,7 +34,7 @@ const STORAGE_IO_TIME_BUCKETS: &[f64] = &[ lazy_static! { static ref STORAGE_IO_TIME: HistogramVec = register_histogram_vec!( - "pageserver_io_time", + "pageserver_io_operations_seconds", "Time spent in IO operations", &["operation", "tenant_id", "timeline_id"], STORAGE_IO_TIME_BUCKETS.into() @@ -43,8 +43,8 @@ lazy_static! { } lazy_static! { static ref STORAGE_IO_SIZE: IntGaugeVec = register_int_gauge_vec!( - "pageserver_io_size", - "Amount of bytes", + "pageserver_io_operations_bytes_total", + "Total amount of bytes read/written in IO operations", &["operation", "tenant_id", "timeline_id"] ) .expect("failed to define a metric"); diff --git a/pageserver/src/walredo.rs b/pageserver/src/walredo.rs index 777718b311..e556c24548 100644 --- a/pageserver/src/walredo.rs +++ b/pageserver/src/walredo.rs @@ -106,16 +106,16 @@ impl crate::walredo::WalRedoManager for DummyRedoManager { // each tenant. lazy_static! { static ref WAL_REDO_TIME: Histogram = - register_histogram!("pageserver_wal_redo_time", "Time spent on WAL redo") + register_histogram!("pageserver_wal_redo_seconds", "Time spent on WAL redo") .expect("failed to define a metric"); static ref WAL_REDO_WAIT_TIME: Histogram = register_histogram!( - "pageserver_wal_redo_wait_time", + "pageserver_wal_redo_wait_seconds", "Time spent waiting for access to the WAL redo process" ) .expect("failed to define a metric"); static ref WAL_REDO_RECORD_COUNTER: IntCounter = register_int_counter!( - "pageserver_wal_records_replayed", - "Number of WAL records replayed" + "pageserver_replayed_wal_records_total", + "Number of WAL records replayed in WAL redo process" ) .unwrap(); } diff --git a/test_runner/fixtures/compare_fixtures.py b/test_runner/fixtures/compare_fixtures.py index d70f57aa52..d572901ed1 100644 --- a/test_runner/fixtures/compare_fixtures.py +++ b/test_runner/fixtures/compare_fixtures.py @@ -106,9 +106,9 @@ class ZenithCompare(PgCompare): report=MetricReport.LOWER_IS_BETTER) total_files = self.zenbenchmark.get_int_counter_value( - self.env.pageserver, "pageserver_num_persistent_files_created") + self.env.pageserver, "pageserver_created_persistent_files_total") total_bytes = self.zenbenchmark.get_int_counter_value( - self.env.pageserver, "pageserver_persistent_bytes_written") + self.env.pageserver, "pageserver_written_persistent_bytes_total") self.zenbenchmark.record("data_uploaded", total_bytes / (1024 * 1024), "MB", From 5812e26b906d8007aed1f3d407e52d0e126c6d18 Mon Sep 17 00:00:00 2001 From: Thang Pham Date: Thu, 12 May 2022 16:33:09 -0400 Subject: [PATCH 232/296] Create an initial timeline on CLI tenant creation (#1689) Resolves #1655 --- neon_local/src/main.rs | 23 +++++++++++++++++++ .../batch_others/test_ancestor_branch.py | 1 - test_runner/batch_others/test_zenith_cli.py | 12 +++++++++- 3 files changed, 34 insertions(+), 2 deletions(-) diff --git a/neon_local/src/main.rs b/neon_local/src/main.rs index 8b54054080..75944fe107 100644 --- a/neon_local/src/main.rs +++ b/neon_local/src/main.rs @@ -540,6 +540,29 @@ fn handle_tenant(tenant_match: &ArgMatches, env: &mut local_env::LocalEnv) -> an "tenant {} successfully created on the pageserver", new_tenant_id ); + + // Create an initial timeline for the new tenant + let new_timeline_id = parse_timeline_id(create_match)?; + let timeline = pageserver + .timeline_create(new_tenant_id, new_timeline_id, None, None)? + .context(format!( + "Failed to create initial timeline for tenant {new_tenant_id}" + ))?; + let new_timeline_id = timeline.timeline_id; + let last_record_lsn = timeline + .local + .context(format!("Failed to get last record LSN: no local timeline info for timeline {new_timeline_id}"))? + .last_record_lsn; + + env.register_branch_mapping( + DEFAULT_BRANCH_NAME.to_string(), + new_tenant_id, + new_timeline_id, + )?; + + println!( + "Created an initial timeline '{new_timeline_id}' at Lsn {last_record_lsn} for tenant: {new_tenant_id}", + ); } Some(("config", create_match)) => { let tenant_id = get_tenant_id(create_match, env)?; diff --git a/test_runner/batch_others/test_ancestor_branch.py b/test_runner/batch_others/test_ancestor_branch.py index d6b073492d..982921084f 100644 --- a/test_runner/batch_others/test_ancestor_branch.py +++ b/test_runner/batch_others/test_ancestor_branch.py @@ -35,7 +35,6 @@ def test_ancestor_branch(zenith_env_builder: ZenithEnvBuilder): with psconn.cursor(cursor_factory=psycopg2.extras.DictCursor) as pscur: pscur.execute("failpoints flush-frozen=sleep(10000)") - env.zenith_cli.create_timeline(f'main', tenant_id=tenant) pg_branch0 = env.postgres.create_start('main', tenant_id=tenant) branch0_cur = pg_branch0.connect().cursor() branch0_cur.execute("SHOW zenith.zenith_timeline") diff --git a/test_runner/batch_others/test_zenith_cli.py b/test_runner/batch_others/test_zenith_cli.py index 091d9ac8ba..81567dba12 100644 --- a/test_runner/batch_others/test_zenith_cli.py +++ b/test_runner/batch_others/test_zenith_cli.py @@ -1,7 +1,7 @@ import uuid import requests -from fixtures.zenith_fixtures import ZenithEnv, ZenithEnvBuilder, ZenithPageserverHttpClient +from fixtures.zenith_fixtures import DEFAULT_BRANCH_NAME, ZenithEnv, ZenithEnvBuilder, ZenithPageserverHttpClient from typing import cast @@ -83,6 +83,16 @@ def test_cli_tenant_list(zenith_simple_env: ZenithEnv): assert tenant2.hex in tenants +def test_cli_tenant_create(zenith_simple_env: ZenithEnv): + env = zenith_simple_env + tenant_id = env.zenith_cli.create_tenant() + timelines = env.zenith_cli.list_timelines(tenant_id) + + # an initial timeline should be created upon tenant creation + assert len(timelines) == 1 + assert timelines[0][0] == DEFAULT_BRANCH_NAME + + def test_cli_ipv4_listeners(zenith_env_builder: ZenithEnvBuilder): # Start with single sk zenith_env_builder.num_safekeepers = 1 From ae20751724779986632a6cbc316b50c7568ff2d5 Mon Sep 17 00:00:00 2001 From: Thang Pham Date: Thu, 12 May 2022 17:27:08 -0400 Subject: [PATCH 233/296] update `ZenithCli::create_tenant` return signature (#1692) to include the initial timeline's ID in addition to the new tenant's ID. Context: follow-up of https://github.com/neondatabase/neon/pull/1689 --- .../batch_others/test_ancestor_branch.py | 2 +- test_runner/batch_others/test_tenant_conf.py | 2 +- .../batch_others/test_tenant_relocation.py | 2 +- test_runner/batch_others/test_tenants.py | 4 ++-- test_runner/batch_others/test_zenith_cli.py | 6 +++--- test_runner/fixtures/zenith_fixtures.py | 17 +++++++++++------ .../performance/test_bulk_tenant_create.py | 2 +- 7 files changed, 20 insertions(+), 15 deletions(-) diff --git a/test_runner/batch_others/test_ancestor_branch.py b/test_runner/batch_others/test_ancestor_branch.py index 982921084f..c07b9d6dd1 100644 --- a/test_runner/batch_others/test_ancestor_branch.py +++ b/test_runner/batch_others/test_ancestor_branch.py @@ -21,7 +21,7 @@ def test_ancestor_branch(zenith_env_builder: ZenithEnvBuilder): # Override defaults, 1M gc_horizon and 4M checkpoint_distance. # Extend compaction_period and gc_period to disable background compaction and gc. - tenant = env.zenith_cli.create_tenant( + tenant, _ = env.zenith_cli.create_tenant( conf={ 'gc_period': '10 m', 'gc_horizon': '1048576', diff --git a/test_runner/batch_others/test_tenant_conf.py b/test_runner/batch_others/test_tenant_conf.py index b85a541f10..d627d8a6ee 100644 --- a/test_runner/batch_others/test_tenant_conf.py +++ b/test_runner/batch_others/test_tenant_conf.py @@ -16,7 +16,7 @@ tenant_config={checkpoint_distance = 10000, compaction_target_size = 1048576}''' env = zenith_env_builder.init_start() """Test per tenant configuration""" - tenant = env.zenith_cli.create_tenant(conf={ + tenant, _ = env.zenith_cli.create_tenant(conf={ 'checkpoint_distance': '20000', 'gc_period': '30sec', }) diff --git a/test_runner/batch_others/test_tenant_relocation.py b/test_runner/batch_others/test_tenant_relocation.py index 7e71c0a157..20694a240c 100644 --- a/test_runner/batch_others/test_tenant_relocation.py +++ b/test_runner/batch_others/test_tenant_relocation.py @@ -107,7 +107,7 @@ def test_tenant_relocation(zenith_env_builder: ZenithEnvBuilder, # create folder for remote storage mock remote_storage_mock_path = env.repo_dir / 'local_fs_remote_storage' - tenant = env.zenith_cli.create_tenant(UUID("74ee8b079a0e437eb0afea7d26a07209")) + tenant, _ = env.zenith_cli.create_tenant(UUID("74ee8b079a0e437eb0afea7d26a07209")) log.info("tenant to relocate %s", tenant) # attach does not download ancestor branches (should it?), just use root branch for now diff --git a/test_runner/batch_others/test_tenants.py b/test_runner/batch_others/test_tenants.py index 682af8de49..1b593cfee3 100644 --- a/test_runner/batch_others/test_tenants.py +++ b/test_runner/batch_others/test_tenants.py @@ -12,8 +12,8 @@ def test_tenants_normal_work(zenith_env_builder: ZenithEnvBuilder, with_safekeep env = zenith_env_builder.init_start() """Tests tenants with and without wal acceptors""" - tenant_1 = env.zenith_cli.create_tenant() - tenant_2 = env.zenith_cli.create_tenant() + tenant_1, _ = env.zenith_cli.create_tenant() + tenant_2, _ = env.zenith_cli.create_tenant() env.zenith_cli.create_timeline(f'test_tenants_normal_work_with_safekeepers{with_safekeepers}', tenant_id=tenant_1) diff --git a/test_runner/batch_others/test_zenith_cli.py b/test_runner/batch_others/test_zenith_cli.py index 81567dba12..bff17fa679 100644 --- a/test_runner/batch_others/test_zenith_cli.py +++ b/test_runner/batch_others/test_zenith_cli.py @@ -64,13 +64,13 @@ def test_cli_tenant_list(zenith_simple_env: ZenithEnv): helper_compare_tenant_list(pageserver_http_client, env) # Create new tenant - tenant1 = env.zenith_cli.create_tenant() + tenant1, _ = env.zenith_cli.create_tenant() # check tenant1 appeared helper_compare_tenant_list(pageserver_http_client, env) # Create new tenant - tenant2 = env.zenith_cli.create_tenant() + tenant2, _ = env.zenith_cli.create_tenant() # check tenant2 appeared helper_compare_tenant_list(pageserver_http_client, env) @@ -85,7 +85,7 @@ def test_cli_tenant_list(zenith_simple_env: ZenithEnv): def test_cli_tenant_create(zenith_simple_env: ZenithEnv): env = zenith_simple_env - tenant_id = env.zenith_cli.create_tenant() + tenant_id, _ = env.zenith_cli.create_tenant() timelines = env.zenith_cli.list_timelines(tenant_id) # an initial timeline should be created upon tenant creation diff --git a/test_runner/fixtures/zenith_fixtures.py b/test_runner/fixtures/zenith_fixtures.py index 3bb7c606d3..fe20f1abbf 100644 --- a/test_runner/fixtures/zenith_fixtures.py +++ b/test_runner/fixtures/zenith_fixtures.py @@ -831,20 +831,25 @@ class ZenithCli: def create_tenant(self, tenant_id: Optional[uuid.UUID] = None, - conf: Optional[Dict[str, str]] = None) -> uuid.UUID: + timeline_id: Optional[uuid.UUID] = None, + conf: Optional[Dict[str, str]] = None) -> Tuple[uuid.UUID, uuid.UUID]: """ Creates a new tenant, returns its id and its initial timeline's id. """ if tenant_id is None: tenant_id = uuid.uuid4() + if timeline_id is None: + timeline_id = uuid.uuid4() if conf is None: - res = self.raw_cli(['tenant', 'create', '--tenant-id', tenant_id.hex]) + res = self.raw_cli([ + 'tenant', 'create', '--tenant-id', tenant_id.hex, '--timeline-id', timeline_id.hex + ]) else: - res = self.raw_cli( - ['tenant', 'create', '--tenant-id', tenant_id.hex] + - sum(list(map(lambda kv: (['-c', kv[0] + ':' + kv[1]]), conf.items())), [])) + res = self.raw_cli([ + 'tenant', 'create', '--tenant-id', tenant_id.hex, '--timeline-id', timeline_id.hex + ] + sum(list(map(lambda kv: (['-c', kv[0] + ':' + kv[1]]), conf.items())), [])) res.check_returncode() - return tenant_id + return tenant_id, timeline_id def config_tenant(self, tenant_id: uuid.UUID, conf: Dict[str, str]): """ diff --git a/test_runner/performance/test_bulk_tenant_create.py b/test_runner/performance/test_bulk_tenant_create.py index f0729d3a07..0e16d3e749 100644 --- a/test_runner/performance/test_bulk_tenant_create.py +++ b/test_runner/performance/test_bulk_tenant_create.py @@ -30,7 +30,7 @@ def test_bulk_tenant_create( for i in range(tenants_count): start = timeit.default_timer() - tenant = env.zenith_cli.create_tenant() + tenant, _ = env.zenith_cli.create_tenant() env.zenith_cli.create_timeline( f'test_bulk_tenant_create_{tenants_count}_{i}_{use_safekeepers}', tenant_id=tenant) From 85884a1599895a9875c7f0139854aa7dae21148e Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Fri, 13 May 2022 00:42:13 +0300 Subject: [PATCH 234/296] Disable tenant relocation python test --- test_runner/batch_others/test_tenant_relocation.py | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/test_runner/batch_others/test_tenant_relocation.py b/test_runner/batch_others/test_tenant_relocation.py index 20694a240c..279b3a0a25 100644 --- a/test_runner/batch_others/test_tenant_relocation.py +++ b/test_runner/batch_others/test_tenant_relocation.py @@ -95,6 +95,10 @@ def load(pg: Postgres, stop_event: threading.Event, load_ok_event: threading.Eve log.info('load thread stopped') +@pytest.mark.skip( + reason= + "needs to replace callmemaybe call with better idea how to migrate timelines between pageservers" +) @pytest.mark.parametrize('with_load', ['with_load', 'without_load']) def test_tenant_relocation(zenith_env_builder: ZenithEnvBuilder, port_distributor: PortDistributor, From 0030da57a8c6deb9795d8d9789b9996a976ad9c9 Mon Sep 17 00:00:00 2001 From: Stas Kelvich Date: Fri, 13 May 2022 02:24:08 +0300 Subject: [PATCH 235/296] compute-tools: grant rw priveleges to the all created users --- compute_tools/src/spec.rs | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/compute_tools/src/spec.rs b/compute_tools/src/spec.rs index 27114b8202..334e0a9e05 100644 --- a/compute_tools/src/spec.rs +++ b/compute_tools/src/spec.rs @@ -136,13 +136,20 @@ pub fn handle_roles(spec: &ClusterSpec, client: &mut Client) -> Result<()> { xact.execute(query.as_str(), &[])?; } } else { - info!("role name {}", &name); + info!("role name: '{}'", &name); let mut query: String = format!("CREATE ROLE {} ", name.quote()); - info!("role create query {}", &query); + info!("role create query: '{}'", &query); info_print!(" -> create"); query.push_str(&role.to_pg_options()); xact.execute(query.as_str(), &[])?; + + let grant_query = format!( + "grant pg_read_all_data, pg_write_all_data to {}", + name.quote() + ); + xact.execute(grant_query.as_str(), &[])?; + info!("role grant query: '{}'", &grant_query); } info_print!("\n"); From 51c0f9ab2b394a31358cfd187c7fdeb34372553e Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Fri, 13 May 2022 00:56:15 +0300 Subject: [PATCH 236/296] Force git version to be up to date via decl macro --- Cargo.lock | 4 ++++ libs/utils/build.rs | 3 --- libs/utils/src/lib.rs | 20 ++++++++++++++------ neon_local/Cargo.toml | 1 + neon_local/src/main.rs | 3 ++- pageserver/Cargo.toml | 1 + pageserver/src/bin/dump_layerfile.rs | 4 +++- pageserver/src/bin/pageserver.rs | 9 +++++---- pageserver/src/bin/update_metadata.rs | 4 +++- proxy/Cargo.toml | 1 + proxy/src/main.rs | 6 ++++-- safekeeper/Cargo.toml | 1 + safekeeper/src/bin/safekeeper.rs | 6 ++++-- 13 files changed, 43 insertions(+), 20 deletions(-) delete mode 100644 libs/utils/build.rs diff --git a/Cargo.lock b/Cargo.lock index 148517a777..e1e1a0f067 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -1582,6 +1582,7 @@ dependencies = [ "clap 3.0.14", "comfy-table", "control_plane", + "git-version", "pageserver", "postgres", "postgres_ffi", @@ -1773,6 +1774,7 @@ dependencies = [ "daemonize", "fail", "futures", + "git-version", "hex", "hex-literal", "humantime", @@ -2164,6 +2166,7 @@ dependencies = [ "bytes", "clap 3.0.14", "futures", + "git-version", "hashbrown", "hex", "hmac 0.12.1", @@ -2616,6 +2619,7 @@ dependencies = [ "daemonize", "etcd_broker", "fs2", + "git-version", "hex", "humantime", "hyper", diff --git a/libs/utils/build.rs b/libs/utils/build.rs deleted file mode 100644 index ee3346ae66..0000000000 --- a/libs/utils/build.rs +++ /dev/null @@ -1,3 +0,0 @@ -fn main() { - println!("cargo:rerun-if-env-changed=GIT_VERSION"); -} diff --git a/libs/utils/src/lib.rs b/libs/utils/src/lib.rs index de266efe64..0398ce5e15 100644 --- a/libs/utils/src/lib.rs +++ b/libs/utils/src/lib.rs @@ -76,9 +76,17 @@ pub mod signals; // so if we changed the index state git_version will pick that up and rerun the macro. // // Note that with git_version prefix is `git:` and in case of git version from env its `git-env:`. -use git_version::git_version; -pub const GIT_VERSION: &str = git_version!( - prefix = "git:", - fallback = concat!("git-env:", env!("GIT_VERSION")), - args = ["--abbrev=40", "--always", "--dirty=-modified"] // always use full sha -); +#[macro_export] +// TODO kb add identifier into the capture +macro_rules! project_git_version { + () => { + const GIT_VERSION: &str = git_version::git_version!( + prefix = "git:", + fallback = concat!( + "git-env:", + env!("GIT_VERSION", "Missing GIT_VERSION envvar") + ), + args = ["--abbrev=40", "--always", "--dirty=-modified"] // always use full sha + ); + }; +} diff --git a/neon_local/Cargo.toml b/neon_local/Cargo.toml index 78d339789f..8ebd7d5c17 100644 --- a/neon_local/Cargo.toml +++ b/neon_local/Cargo.toml @@ -9,6 +9,7 @@ anyhow = "1.0" serde_json = "1" comfy-table = "5.0.1" postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" } +git-version = "0.3.5" # FIXME: 'pageserver' is needed for BranchInfo. Refactor pageserver = { path = "../pageserver" } diff --git a/neon_local/src/main.rs b/neon_local/src/main.rs index 75944fe107..2f470309ff 100644 --- a/neon_local/src/main.rs +++ b/neon_local/src/main.rs @@ -21,7 +21,7 @@ use utils::{ lsn::Lsn, postgres_backend::AuthType, zid::{ZNodeId, ZTenantId, ZTenantTimelineId, ZTimelineId}, - GIT_VERSION, + project_git_version, }; use pageserver::timelines::TimelineInfo; @@ -30,6 +30,7 @@ use pageserver::timelines::TimelineInfo; const DEFAULT_SAFEKEEPER_ID: ZNodeId = ZNodeId(1); const DEFAULT_PAGESERVER_ID: ZNodeId = ZNodeId(1); const DEFAULT_BRANCH_NAME: &str = "main"; +project_git_version!(); fn default_conf() -> String { format!( diff --git a/pageserver/Cargo.toml b/pageserver/Cargo.toml index d4cceafc61..9cc8444531 100644 --- a/pageserver/Cargo.toml +++ b/pageserver/Cargo.toml @@ -52,6 +52,7 @@ nix = "0.23" once_cell = "1.8.0" crossbeam-utils = "0.8.5" fail = "0.5.0" +git-version = "0.3.5" postgres_ffi = { path = "../libs/postgres_ffi" } metrics = { path = "../libs/metrics" } diff --git a/pageserver/src/bin/dump_layerfile.rs b/pageserver/src/bin/dump_layerfile.rs index af73ef6bdb..cb08acadff 100644 --- a/pageserver/src/bin/dump_layerfile.rs +++ b/pageserver/src/bin/dump_layerfile.rs @@ -7,7 +7,9 @@ use pageserver::layered_repository::dump_layerfile_from_path; use pageserver::page_cache; use pageserver::virtual_file; use std::path::PathBuf; -use utils::GIT_VERSION; +use utils::project_git_version; + +project_git_version!(); fn main() -> Result<()> { let arg_matches = App::new("Zenith dump_layerfile utility") diff --git a/pageserver/src/bin/pageserver.rs b/pageserver/src/bin/pageserver.rs index 9cb7e6f13d..73ef5c5f4d 100644 --- a/pageserver/src/bin/pageserver.rs +++ b/pageserver/src/bin/pageserver.rs @@ -20,17 +20,18 @@ use utils::{ http::endpoint, logging, postgres_backend::AuthType, + project_git_version, shutdown::exit_now, signals::{self, Signal}, tcp_listener, zid::{ZTenantId, ZTimelineId}, - GIT_VERSION, }; +project_git_version!(); + fn version() -> String { format!( - "{} profiling:{} failpoints:{}", - GIT_VERSION, + "{GIT_VERSION} profiling:{} failpoints:{}", cfg!(feature = "profiling"), fail::has_failpoints() ) @@ -217,7 +218,7 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<() // Initialize logger let log_file = logging::init(LOG_FILE_NAME, daemonize)?; - info!("version: {}", GIT_VERSION); + info!("version: {GIT_VERSION}"); // TODO: Check that it looks like a valid repository before going further diff --git a/pageserver/src/bin/update_metadata.rs b/pageserver/src/bin/update_metadata.rs index fae5e5c2e3..3e69ad5c66 100644 --- a/pageserver/src/bin/update_metadata.rs +++ b/pageserver/src/bin/update_metadata.rs @@ -6,7 +6,9 @@ use clap::{App, Arg}; use pageserver::layered_repository::metadata::TimelineMetadata; use std::path::PathBuf; use std::str::FromStr; -use utils::{lsn::Lsn, GIT_VERSION}; +use utils::{lsn::Lsn, project_git_version}; + +project_git_version!(); fn main() -> Result<()> { let arg_matches = App::new("Zenith update metadata utility") diff --git a/proxy/Cargo.toml b/proxy/Cargo.toml index 43880d645a..4e45698e3e 100644 --- a/proxy/Cargo.toml +++ b/proxy/Cargo.toml @@ -33,6 +33,7 @@ tokio = { version = "1.17", features = ["macros"] } tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" } tokio-rustls = "0.23.0" url = "2.2.2" +git-version = "0.3.5" utils = { path = "../libs/utils" } metrics = { path = "../libs/metrics" } diff --git a/proxy/src/main.rs b/proxy/src/main.rs index fc2a368b85..7d5105c88f 100644 --- a/proxy/src/main.rs +++ b/proxy/src/main.rs @@ -25,7 +25,9 @@ use config::ProxyConfig; use futures::FutureExt; use std::{future::Future, net::SocketAddr}; use tokio::{net::TcpListener, task::JoinError}; -use utils::GIT_VERSION; +use utils::project_git_version; + +project_git_version!(); /// Flattens `Result>` into `Result`. async fn flatten_err( @@ -124,7 +126,7 @@ async fn main() -> anyhow::Result<()> { auth_link_uri: arg_matches.value_of("uri").unwrap().parse()?, })); - println!("Version: {}", GIT_VERSION); + println!("Version: {GIT_VERSION}"); // Check that we can bind to address before further initialization println!("Starting http on {}", http_address); diff --git a/safekeeper/Cargo.toml b/safekeeper/Cargo.toml index 5e1ceee02e..417cf58cd5 100644 --- a/safekeeper/Cargo.toml +++ b/safekeeper/Cargo.toml @@ -29,6 +29,7 @@ hex = "0.4.3" const_format = "0.2.21" tokio-postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" } tokio-util = { version = "0.7", features = ["io"] } +git-version = "0.3.5" postgres_ffi = { path = "../libs/postgres_ffi" } metrics = { path = "../libs/metrics" } diff --git a/safekeeper/src/bin/safekeeper.rs b/safekeeper/src/bin/safekeeper.rs index d0df7093ff..06a15a90b0 100644 --- a/safekeeper/src/bin/safekeeper.rs +++ b/safekeeper/src/bin/safekeeper.rs @@ -22,11 +22,13 @@ use safekeeper::SafeKeeperConf; use safekeeper::{broker, callmemaybe}; use safekeeper::{http, s3_offload}; use utils::{ - http::endpoint, logging, shutdown::exit_now, signals, tcp_listener, zid::ZNodeId, GIT_VERSION, + http::endpoint, logging, project_git_version, shutdown::exit_now, signals, tcp_listener, + zid::ZNodeId, }; const LOCK_FILE_NAME: &str = "safekeeper.lock"; const ID_FILE_NAME: &str = "safekeeper.id"; +project_git_version!(); fn main() -> Result<()> { metrics::set_common_metrics_prefix("safekeeper"); @@ -193,7 +195,7 @@ fn main() -> Result<()> { fn start_safekeeper(mut conf: SafeKeeperConf, given_id: Option, init: bool) -> Result<()> { let log_file = logging::init("safekeeper.log", conf.daemonize)?; - info!("version: {}", GIT_VERSION); + info!("version: {GIT_VERSION}"); // Prevent running multiple safekeepers on the same directory let lock_file_path = conf.workdir.join(LOCK_FILE_NAME); From b683308791d81f005089aed35981c73d78fbb93c Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Fri, 13 May 2022 01:05:55 +0300 Subject: [PATCH 237/296] Return GIT_VERSION back to storage binaries --- libs/utils/src/lib.rs | 55 +++++++++++++++------------ neon_local/src/main.rs | 4 +- pageserver/src/bin/dump_layerfile.rs | 2 +- pageserver/src/bin/pageserver.rs | 2 +- pageserver/src/bin/update_metadata.rs | 2 +- proxy/src/main.rs | 2 +- safekeeper/src/bin/safekeeper.rs | 2 +- 7 files changed, 37 insertions(+), 32 deletions(-) diff --git a/libs/utils/src/lib.rs b/libs/utils/src/lib.rs index 0398ce5e15..4810909712 100644 --- a/libs/utils/src/lib.rs +++ b/libs/utils/src/lib.rs @@ -54,33 +54,38 @@ pub mod nonblock; // Default signal handling pub mod signals; -// This is a shortcut to embed git sha into binaries and avoid copying the same build script to all packages -// -// we have several cases: -// * building locally from git repo -// * building in CI from git repo -// * building in docker (either in CI or locally) -// -// One thing to note is that .git is not available in docker (and it is bad to include it there). -// So everything becides docker build is covered by git_version crate. -// For docker use environment variable to pass git version, which is then retrieved by buildscript (build.rs). -// It takes variable from build process env and puts it to the rustc env. And then we can retrieve it here by using env! macro. -// Git version received from environment variable used as a fallback in git_version invokation. -// And to avoid running buildscript every recompilation, we use rerun-if-env-changed option. -// So the build script will be run only when GIT_VERSION envvar has changed. -// -// Why not to use buildscript to get git commit sha directly without procmacro from different crate? -// Caching and workspaces complicates that. In case `utils` is not -// recompiled due to caching then version may become outdated. -// git_version crate handles that case by introducing a dependency on .git internals via include_bytes! macro, -// so if we changed the index state git_version will pick that up and rerun the macro. -// -// Note that with git_version prefix is `git:` and in case of git version from env its `git-env:`. +/// This is a shortcut to embed git sha into binaries and avoid copying the same build script to all packages +/// +/// we have several cases: +/// * building locally from git repo +/// * building in CI from git repo +/// * building in docker (either in CI or locally) +/// +/// One thing to note is that .git is not available in docker (and it is bad to include it there). +/// So everything becides docker build is covered by git_version crate, and docker uses a `GIT_VERSION` argument to get the value required. +/// It takes variable from build process env and puts it to the rustc env. And then we can retrieve it here by using env! macro. +/// Git version received from environment variable used as a fallback in git_version invokation. +/// And to avoid running buildscript every recompilation, we use rerun-if-env-changed option. +/// So the build script will be run only when GIT_VERSION envvar has changed. +/// +/// Why not to use buildscript to get git commit sha directly without procmacro from different crate? +/// Caching and workspaces complicates that. In case `utils` is not +/// recompiled due to caching then version may become outdated. +/// git_version crate handles that case by introducing a dependency on .git internals via include_bytes! macro, +/// so if we changed the index state git_version will pick that up and rerun the macro. +/// +/// Note that with git_version prefix is `git:` and in case of git version from env its `git-env:`. +/// +/// ############################################################################################# +/// TODO this macro is not the way the library is intended to be used, see https://github.com/neondatabase/neon/issues/1565 for details. +/// We use `cachepot` to reduce our current CI build times: https://github.com/neondatabase/cloud/pull/1033#issuecomment-1100935036 +/// Yet, it seems to ignore the GIT_VERSION env variable, passed to Docker build, even with build.rs that contains +/// `println!("cargo:rerun-if-env-changed=GIT_VERSION");` code for cachepot cache invalidation. +/// The problem needs further investigation and regular `const` declaration instead of a macro. #[macro_export] -// TODO kb add identifier into the capture macro_rules! project_git_version { - () => { - const GIT_VERSION: &str = git_version::git_version!( + ($const_identifier:ident) => { + const $const_identifier: &str = git_version::git_version!( prefix = "git:", fallback = concat!( "git-env:", diff --git a/neon_local/src/main.rs b/neon_local/src/main.rs index 2f470309ff..6538cdefc4 100644 --- a/neon_local/src/main.rs +++ b/neon_local/src/main.rs @@ -20,8 +20,8 @@ use utils::{ auth::{Claims, Scope}, lsn::Lsn, postgres_backend::AuthType, - zid::{ZNodeId, ZTenantId, ZTenantTimelineId, ZTimelineId}, project_git_version, + zid::{ZNodeId, ZTenantId, ZTenantTimelineId, ZTimelineId}, }; use pageserver::timelines::TimelineInfo; @@ -30,7 +30,7 @@ use pageserver::timelines::TimelineInfo; const DEFAULT_SAFEKEEPER_ID: ZNodeId = ZNodeId(1); const DEFAULT_PAGESERVER_ID: ZNodeId = ZNodeId(1); const DEFAULT_BRANCH_NAME: &str = "main"; -project_git_version!(); +project_git_version!(GIT_VERSION); fn default_conf() -> String { format!( diff --git a/pageserver/src/bin/dump_layerfile.rs b/pageserver/src/bin/dump_layerfile.rs index cb08acadff..87390a1b06 100644 --- a/pageserver/src/bin/dump_layerfile.rs +++ b/pageserver/src/bin/dump_layerfile.rs @@ -9,7 +9,7 @@ use pageserver::virtual_file; use std::path::PathBuf; use utils::project_git_version; -project_git_version!(); +project_git_version!(GIT_VERSION); fn main() -> Result<()> { let arg_matches = App::new("Zenith dump_layerfile utility") diff --git a/pageserver/src/bin/pageserver.rs b/pageserver/src/bin/pageserver.rs index 73ef5c5f4d..190e38e341 100644 --- a/pageserver/src/bin/pageserver.rs +++ b/pageserver/src/bin/pageserver.rs @@ -27,7 +27,7 @@ use utils::{ zid::{ZTenantId, ZTimelineId}, }; -project_git_version!(); +project_git_version!(GIT_VERSION); fn version() -> String { format!( diff --git a/pageserver/src/bin/update_metadata.rs b/pageserver/src/bin/update_metadata.rs index 3e69ad5c66..983fdb8647 100644 --- a/pageserver/src/bin/update_metadata.rs +++ b/pageserver/src/bin/update_metadata.rs @@ -8,7 +8,7 @@ use std::path::PathBuf; use std::str::FromStr; use utils::{lsn::Lsn, project_git_version}; -project_git_version!(); +project_git_version!(GIT_VERSION); fn main() -> Result<()> { let arg_matches = App::new("Zenith update metadata utility") diff --git a/proxy/src/main.rs b/proxy/src/main.rs index 7d5105c88f..f46e19e5d6 100644 --- a/proxy/src/main.rs +++ b/proxy/src/main.rs @@ -27,7 +27,7 @@ use std::{future::Future, net::SocketAddr}; use tokio::{net::TcpListener, task::JoinError}; use utils::project_git_version; -project_git_version!(); +project_git_version!(GIT_VERSION); /// Flattens `Result>` into `Result`. async fn flatten_err( diff --git a/safekeeper/src/bin/safekeeper.rs b/safekeeper/src/bin/safekeeper.rs index 06a15a90b0..65e71fcc74 100644 --- a/safekeeper/src/bin/safekeeper.rs +++ b/safekeeper/src/bin/safekeeper.rs @@ -28,7 +28,7 @@ use utils::{ const LOCK_FILE_NAME: &str = "safekeeper.lock"; const ID_FILE_NAME: &str = "safekeeper.id"; -project_git_version!(); +project_git_version!(GIT_VERSION); fn main() -> Result<()> { metrics::set_common_metrics_prefix("safekeeper"); From 22d997049c4cf5415b208a6fb397e1c3174980b8 Mon Sep 17 00:00:00 2001 From: Egor Suvorov Date: Fri, 6 May 2022 20:03:28 +0300 Subject: [PATCH 238/296] libs/utils/http/request: add ensure_no_body --- libs/utils/src/http/request.rs | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/libs/utils/src/http/request.rs b/libs/utils/src/http/request.rs index 3bc8993c26..8e3d357397 100644 --- a/libs/utils/src/http/request.rs +++ b/libs/utils/src/http/request.rs @@ -1,7 +1,7 @@ use std::str::FromStr; use super::error::ApiError; -use hyper::{Body, Request}; +use hyper::{body::HttpBody, Body, Request}; use routerify::ext::RequestExt; pub fn get_request_param<'a>( @@ -31,3 +31,10 @@ pub fn parse_request_param( ))), } } + +pub async fn ensure_no_body(request: &mut Request) -> Result<(), ApiError> { + match request.body_mut().data().await { + Some(_) => Err(ApiError::BadRequest("Unexpected request body".into())), + None => Ok(()), + } +} From 07b85e7cfcf7d69c12e528ddde42d51444bbed27 Mon Sep 17 00:00:00 2001 From: Egor Suvorov Date: Thu, 12 May 2022 19:55:01 +0300 Subject: [PATCH 239/296] Safekeeper refactor: move callmemaybe_tx from SafekeeperPostgresBackend to Timeline --- safekeeper/src/bin/safekeeper.rs | 8 +-- safekeeper/src/handler.rs | 8 +-- safekeeper/src/receive_wal.rs | 11 +--- safekeeper/src/send_wal.rs | 6 +-- safekeeper/src/timeline.rs | 90 ++++++++++++++++++-------------- safekeeper/src/wal_service.rs | 19 ++----- 6 files changed, 66 insertions(+), 76 deletions(-) diff --git a/safekeeper/src/bin/safekeeper.rs b/safekeeper/src/bin/safekeeper.rs index 65e71fcc74..6955d2aa5c 100644 --- a/safekeeper/src/bin/safekeeper.rs +++ b/safekeeper/src/bin/safekeeper.rs @@ -17,6 +17,7 @@ use url::{ParseError, Url}; use safekeeper::control_file::{self}; use safekeeper::defaults::{DEFAULT_HTTP_LISTEN_ADDR, DEFAULT_PG_LISTEN_ADDR}; use safekeeper::remove_wal; +use safekeeper::timeline::GlobalTimelines; use safekeeper::wal_service; use safekeeper::SafeKeeperConf; use safekeeper::{broker, callmemaybe}; @@ -251,6 +252,8 @@ fn start_safekeeper(mut conf: SafeKeeperConf, given_id: Option, init: b let signals = signals::install_shutdown_handlers()?; let mut threads = vec![]; + let (callmemaybe_tx, callmemaybe_rx) = mpsc::unbounded_channel(); + GlobalTimelines::set_callmemaybe_tx(callmemaybe_tx); let conf_ = conf.clone(); threads.push( @@ -279,13 +282,12 @@ fn start_safekeeper(mut conf: SafeKeeperConf, given_id: Option, init: b ); } - let (tx, rx) = mpsc::unbounded_channel(); let conf_cloned = conf.clone(); let safekeeper_thread = thread::Builder::new() .name("Safekeeper thread".into()) .spawn(|| { // thread code - let thread_result = wal_service::thread_main(conf_cloned, pg_listener, tx); + let thread_result = wal_service::thread_main(conf_cloned, pg_listener); if let Err(e) = thread_result { info!("safekeeper thread terminated: {}", e); } @@ -299,7 +301,7 @@ fn start_safekeeper(mut conf: SafeKeeperConf, given_id: Option, init: b .name("callmemaybe thread".into()) .spawn(|| { // thread code - let thread_result = callmemaybe::thread_main(conf_cloned, rx); + let thread_result = callmemaybe::thread_main(conf_cloned, callmemaybe_rx); if let Err(e) = thread_result { error!("callmemaybe thread terminated: {}", e); } diff --git a/safekeeper/src/handler.rs b/safekeeper/src/handler.rs index 7d86523b0e..9af78661f9 100644 --- a/safekeeper/src/handler.rs +++ b/safekeeper/src/handler.rs @@ -21,9 +21,6 @@ use utils::{ zid::{ZTenantId, ZTenantTimelineId, ZTimelineId}, }; -use crate::callmemaybe::CallmeEvent; -use tokio::sync::mpsc::UnboundedSender; - /// Safekeeper handler of postgres commands pub struct SafekeeperPostgresHandler { pub conf: SafeKeeperConf, @@ -33,8 +30,6 @@ pub struct SafekeeperPostgresHandler { pub ztimelineid: Option, pub timeline: Option>, pageserver_connstr: Option, - //sender to communicate with callmemaybe thread - pub tx: UnboundedSender, } /// Parsed Postgres command. @@ -140,7 +135,7 @@ impl postgres_backend::Handler for SafekeeperPostgresHandler { } impl SafekeeperPostgresHandler { - pub fn new(conf: SafeKeeperConf, tx: UnboundedSender) -> Self { + pub fn new(conf: SafeKeeperConf) -> Self { SafekeeperPostgresHandler { conf, appname: None, @@ -148,7 +143,6 @@ impl SafekeeperPostgresHandler { ztimelineid: None, timeline: None, pageserver_connstr: None, - tx, } } diff --git a/safekeeper/src/receive_wal.rs b/safekeeper/src/receive_wal.rs index 3ad99ab0df..0ef335c9ed 100644 --- a/safekeeper/src/receive_wal.rs +++ b/safekeeper/src/receive_wal.rs @@ -5,7 +5,6 @@ use anyhow::{anyhow, bail, Result}; use bytes::BytesMut; -use tokio::sync::mpsc::UnboundedSender; use tracing::*; use crate::timeline::Timeline; @@ -28,8 +27,6 @@ use utils::{ sock_split::ReadStream, }; -use crate::callmemaybe::CallmeEvent; - pub struct ReceiveWalConn<'pg> { /// Postgres connection pg_backend: &'pg mut PostgresBackend, @@ -91,10 +88,9 @@ impl<'pg> ReceiveWalConn<'pg> { // Register the connection and defer unregister. spg.timeline .get() - .on_compute_connect(self.pageserver_connstr.as_ref(), &spg.tx)?; + .on_compute_connect(self.pageserver_connstr.as_ref())?; let _guard = ComputeConnectionGuard { timeline: Arc::clone(spg.timeline.get()), - callmemaybe_tx: spg.tx.clone(), }; let mut next_msg = Some(next_msg); @@ -194,13 +190,10 @@ impl ProposerPollStream { struct ComputeConnectionGuard { timeline: Arc, - callmemaybe_tx: UnboundedSender, } impl Drop for ComputeConnectionGuard { fn drop(&mut self) { - self.timeline - .on_compute_disconnect(&self.callmemaybe_tx) - .unwrap(); + self.timeline.on_compute_disconnect().unwrap(); } } diff --git a/safekeeper/src/send_wal.rs b/safekeeper/src/send_wal.rs index 960f70d154..d52dd6ea57 100644 --- a/safekeeper/src/send_wal.rs +++ b/safekeeper/src/send_wal.rs @@ -264,13 +264,13 @@ impl ReplicationConn { } else { let pageserver_connstr = pageserver_connstr.expect("there should be a pageserver connection string since this is not a wal_proposer_recovery"); let zttid = spg.timeline.get().zttid; - let tx_clone = spg.tx.clone(); + let tx_clone = spg.timeline.get().callmemaybe_tx.clone(); let subscription_key = SubscriptionStateKey::new( zttid.tenant_id, zttid.timeline_id, pageserver_connstr.clone(), ); - spg.tx + tx_clone .send(CallmeEvent::Pause(subscription_key)) .unwrap_or_else(|e| { error!("failed to send Pause request to callmemaybe thread {}", e); @@ -315,7 +315,7 @@ impl ReplicationConn { } else { // TODO: also check once in a while whether we are walsender // to right pageserver. - if spg.timeline.get().check_deactivate(replica_id, &spg.tx)? { + if spg.timeline.get().check_deactivate(replica_id)? { // Shut down, timeline is suspended. // TODO create proper error type for this bail!("end streaming to {:?}", spg.appname); diff --git a/safekeeper/src/timeline.rs b/safekeeper/src/timeline.rs index a12f628e06..c73d6af4ac 100644 --- a/safekeeper/src/timeline.rs +++ b/safekeeper/src/timeline.rs @@ -275,15 +275,21 @@ impl SharedState { /// Database instance (tenant) pub struct Timeline { pub zttid: ZTenantTimelineId, + pub callmemaybe_tx: UnboundedSender, mutex: Mutex, /// conditional variable used to notify wal senders cond: Condvar, } impl Timeline { - fn new(zttid: ZTenantTimelineId, shared_state: SharedState) -> Timeline { + fn new( + zttid: ZTenantTimelineId, + callmemaybe_tx: UnboundedSender, + shared_state: SharedState, + ) -> Timeline { Timeline { zttid, + callmemaybe_tx, mutex: Mutex::new(shared_state), cond: Condvar::new(), } @@ -292,34 +298,27 @@ impl Timeline { /// Register compute connection, starting timeline-related activity if it is /// not running yet. /// Can fail only if channel to a static thread got closed, which is not normal at all. - pub fn on_compute_connect( - &self, - pageserver_connstr: Option<&String>, - callmemaybe_tx: &UnboundedSender, - ) -> Result<()> { + pub fn on_compute_connect(&self, pageserver_connstr: Option<&String>) -> Result<()> { let mut shared_state = self.mutex.lock().unwrap(); shared_state.num_computes += 1; // FIXME: currently we always adopt latest pageserver connstr, but we // should have kind of generations assigned by compute to distinguish // the latest one or even pass it through consensus to reliably deliver // to all safekeepers. - shared_state.activate(&self.zttid, pageserver_connstr, callmemaybe_tx)?; + shared_state.activate(&self.zttid, pageserver_connstr, &self.callmemaybe_tx)?; Ok(()) } /// De-register compute connection, shutting down timeline activity if /// pageserver doesn't need catchup. /// Can fail only if channel to a static thread got closed, which is not normal at all. - pub fn on_compute_disconnect( - &self, - callmemaybe_tx: &UnboundedSender, - ) -> Result<()> { + pub fn on_compute_disconnect(&self) -> Result<()> { let mut shared_state = self.mutex.lock().unwrap(); shared_state.num_computes -= 1; // If there is no pageserver, can suspend right away; otherwise let // walsender do that. if shared_state.num_computes == 0 && shared_state.pageserver_connstr.is_none() { - shared_state.deactivate(&self.zttid, callmemaybe_tx)?; + shared_state.deactivate(&self.zttid, &self.callmemaybe_tx)?; } Ok(()) } @@ -327,11 +326,7 @@ impl Timeline { /// Deactivate tenant if there is no computes and pageserver is caughtup, /// assuming the pageserver status is in replica_id. /// Returns true if deactivated. - pub fn check_deactivate( - &self, - replica_id: usize, - callmemaybe_tx: &UnboundedSender, - ) -> Result { + pub fn check_deactivate(&self, replica_id: usize) -> Result { let mut shared_state = self.mutex.lock().unwrap(); if !shared_state.active { // already suspended @@ -343,7 +338,7 @@ impl Timeline { (replica_state.last_received_lsn != Lsn::MAX && // Lsn::MAX means that we don't know the latest LSN yet. replica_state.last_received_lsn >= shared_state.sk.inmem.commit_lsn); if deactivate { - shared_state.deactivate(&self.zttid, callmemaybe_tx)?; + shared_state.deactivate(&self.zttid, &self.callmemaybe_tx)?; return Ok(true); } } @@ -508,22 +503,35 @@ impl TimelineTools for Option> { } } +struct GlobalTimelinesState { + timelines: HashMap>, + callmemaybe_tx: Option>, +} + lazy_static! { - pub static ref TIMELINES: Mutex>> = - Mutex::new(HashMap::new()); + static ref TIMELINES_STATE: Mutex = Mutex::new(GlobalTimelinesState { + timelines: HashMap::new(), + callmemaybe_tx: None + }); } /// A zero-sized struct used to manage access to the global timelines map. pub struct GlobalTimelines; impl GlobalTimelines { + pub fn set_callmemaybe_tx(callmemaybe_tx: UnboundedSender) { + let mut state = TIMELINES_STATE.lock().unwrap(); + assert!(state.callmemaybe_tx.is_none()); + state.callmemaybe_tx = Some(callmemaybe_tx); + } + fn create_internal( - mut timelines: MutexGuard>>, + mut state: MutexGuard, conf: &SafeKeeperConf, zttid: ZTenantTimelineId, peer_ids: Vec, ) -> Result> { - match timelines.get(&zttid) { + match state.timelines.get(&zttid) { Some(_) => bail!("timeline {} already exists", zttid), None => { // TODO: check directory existence @@ -532,8 +540,12 @@ impl GlobalTimelines { let shared_state = SharedState::create(conf, &zttid, peer_ids) .context("failed to create shared state")?; - let new_tli = Arc::new(Timeline::new(zttid, shared_state)); - timelines.insert(zttid, Arc::clone(&new_tli)); + let new_tli = Arc::new(Timeline::new( + zttid, + state.callmemaybe_tx.as_ref().unwrap().clone(), + shared_state, + )); + state.timelines.insert(zttid, Arc::clone(&new_tli)); Ok(new_tli) } } @@ -544,20 +556,20 @@ impl GlobalTimelines { zttid: ZTenantTimelineId, peer_ids: Vec, ) -> Result> { - let timelines = TIMELINES.lock().unwrap(); - GlobalTimelines::create_internal(timelines, conf, zttid, peer_ids) + let state = TIMELINES_STATE.lock().unwrap(); + GlobalTimelines::create_internal(state, conf, zttid, peer_ids) } - /// Get a timeline with control file loaded from the global TIMELINES map. + /// Get a timeline with control file loaded from the global TIMELINES_STATE.timelines map. /// If control file doesn't exist and create=false, bails out. pub fn get( conf: &SafeKeeperConf, zttid: ZTenantTimelineId, create: bool, ) -> Result> { - let mut timelines = TIMELINES.lock().unwrap(); + let mut state = TIMELINES_STATE.lock().unwrap(); - match timelines.get(&zttid) { + match state.timelines.get(&zttid) { Some(result) => Ok(Arc::clone(result)), None => { let shared_state = @@ -573,20 +585,19 @@ impl GlobalTimelines { .contains("No such file or directory") && create { - return GlobalTimelines::create_internal( - timelines, - conf, - zttid, - vec![], - ); + return GlobalTimelines::create_internal(state, conf, zttid, vec![]); } else { return Err(error); } } }; - let new_tli = Arc::new(Timeline::new(zttid, shared_state)); - timelines.insert(zttid, Arc::clone(&new_tli)); + let new_tli = Arc::new(Timeline::new( + zttid, + state.callmemaybe_tx.as_ref().unwrap().clone(), + shared_state, + )); + state.timelines.insert(zttid, Arc::clone(&new_tli)); Ok(new_tli) } } @@ -594,8 +605,9 @@ impl GlobalTimelines { /// Get ZTenantTimelineIDs of all active timelines. pub fn get_active_timelines() -> Vec { - let timelines = TIMELINES.lock().unwrap(); - timelines + let state = TIMELINES_STATE.lock().unwrap(); + state + .timelines .iter() .filter(|&(_, tli)| tli.is_active()) .map(|(zttid, _)| *zttid) diff --git a/safekeeper/src/wal_service.rs b/safekeeper/src/wal_service.rs index 468ac28526..5980160788 100644 --- a/safekeeper/src/wal_service.rs +++ b/safekeeper/src/wal_service.rs @@ -8,29 +8,22 @@ use std::net::{TcpListener, TcpStream}; use std::thread; use tracing::*; -use crate::callmemaybe::CallmeEvent; use crate::handler::SafekeeperPostgresHandler; use crate::SafeKeeperConf; -use tokio::sync::mpsc::UnboundedSender; use utils::postgres_backend::{AuthType, PostgresBackend}; /// Accept incoming TCP connections and spawn them into a background thread. -pub fn thread_main( - conf: SafeKeeperConf, - listener: TcpListener, - tx: UnboundedSender, -) -> Result<()> { +pub fn thread_main(conf: SafeKeeperConf, listener: TcpListener) -> Result<()> { loop { match listener.accept() { Ok((socket, peer_addr)) => { debug!("accepted connection from {}", peer_addr); let conf = conf.clone(); - let tx_clone = tx.clone(); let _ = thread::Builder::new() .name("WAL service thread".into()) .spawn(move || { - if let Err(err) = handle_socket(socket, conf, tx_clone) { + if let Err(err) = handle_socket(socket, conf) { error!("connection handler exited: {}", err); } }) @@ -51,16 +44,12 @@ fn get_tid() -> u64 { /// This is run by `thread_main` above, inside a background thread. /// -fn handle_socket( - socket: TcpStream, - conf: SafeKeeperConf, - tx: UnboundedSender, -) -> Result<()> { +fn handle_socket(socket: TcpStream, conf: SafeKeeperConf) -> Result<()> { let _enter = info_span!("", tid = ?get_tid()).entered(); socket.set_nodelay(true)?; - let mut conn_handler = SafekeeperPostgresHandler::new(conf, tx); + let mut conn_handler = SafekeeperPostgresHandler::new(conf); let pgbackend = PostgresBackend::new(socket, AuthType::Trust, None, false)?; // libpq replication protocol between safekeeper and replicas/pagers pgbackend.run(&mut conn_handler)?; From bf899a57d9a2b20ba812a4002c0ac3234f064d26 Mon Sep 17 00:00:00 2001 From: Egor Suvorov Date: Thu, 12 May 2022 23:40:29 +0300 Subject: [PATCH 240/296] Safekeeper: add timeline/tenant force delete HTTP endpoings (closes #895) * There is no auth in Safekeeper HTTP at all currently, so simply calling `check_permission` is not enough. * There are no checks of Safekeeper still working with the data, as "still working" is burry now: a timeline may be "active" while there are no compute nodes and all data is propagated. * Still, callmemaybe is deactivated, and timeline is removed from the internal map. It can easily sneak back in case of race conditions and implicit creations, though. --- safekeeper/src/http/routes.rs | 48 +++++++- safekeeper/src/lib.rs | 9 +- safekeeper/src/timeline.rs | 98 ++++++++++++++- test_runner/batch_others/test_wal_acceptor.py | 113 ++++++++++++++++++ test_runner/fixtures/zenith_fixtures.py | 15 +++ 5 files changed, 277 insertions(+), 6 deletions(-) diff --git a/safekeeper/src/http/routes.rs b/safekeeper/src/http/routes.rs index e731db5617..62fbd2ff2f 100644 --- a/safekeeper/src/http/routes.rs +++ b/safekeeper/src/http/routes.rs @@ -3,19 +3,20 @@ use hyper::{Body, Request, Response, StatusCode}; use serde::Serialize; use serde::Serializer; +use std::collections::HashMap; use std::fmt::Display; use std::sync::Arc; use crate::safekeeper::Term; use crate::safekeeper::TermHistory; -use crate::timeline::GlobalTimelines; +use crate::timeline::{GlobalTimelines, TimelineDeleteForceResult}; use crate::SafeKeeperConf; use utils::{ http::{ endpoint, error::ApiError, json::{json_request, json_response}, - request::parse_request_param, + request::{ensure_no_body, parse_request_param}, RequestExt, RouterBuilder, }, lsn::Lsn, @@ -130,6 +131,44 @@ async fn timeline_create_handler(mut request: Request) -> Result, +) -> Result, ApiError> { + let zttid = ZTenantTimelineId::new( + parse_request_param(&request, "tenant_id")?, + parse_request_param(&request, "timeline_id")?, + ); + ensure_no_body(&mut request).await?; + json_response( + StatusCode::OK, + GlobalTimelines::delete_force(get_conf(&request), &zttid).map_err(ApiError::from_err)?, + ) +} + +/// Deactivates all timelines for the tenant and removes its data directory. +/// See `timeline_delete_force_handler`. +async fn tenant_delete_force_handler( + mut request: Request, +) -> Result, ApiError> { + let tenant_id = parse_request_param(&request, "tenant_id")?; + ensure_no_body(&mut request).await?; + json_response( + StatusCode::OK, + GlobalTimelines::delete_force_all_for_tenant(get_conf(&request), &tenant_id) + .map_err(ApiError::from_err)? + .iter() + .map(|(zttid, resp)| (format!("{}", zttid.timeline_id), *resp)) + .collect::>(), + ) +} + /// Used only in tests to hand craft required data. async fn record_safekeeper_info(mut request: Request) -> Result, ApiError> { let zttid = ZTenantTimelineId::new( @@ -155,6 +194,11 @@ pub fn make_router(conf: SafeKeeperConf) -> RouterBuilder timeline_status_handler, ) .post("/v1/timeline", timeline_create_handler) + .delete( + "/v1/tenant/:tenant_id/timeline/:timeline_id", + timeline_delete_force_handler, + ) + .delete("/v1/tenant/:tenant_id", tenant_delete_force_handler) // for tests .post( "/v1/record_safekeeper_info/:tenant_id/:timeline_id", diff --git a/safekeeper/src/lib.rs b/safekeeper/src/lib.rs index 03236d4e65..09b2e68a49 100644 --- a/safekeeper/src/lib.rs +++ b/safekeeper/src/lib.rs @@ -3,7 +3,7 @@ use std::path::PathBuf; use std::time::Duration; use url::Url; -use utils::zid::{ZNodeId, ZTenantTimelineId}; +use utils::zid::{ZNodeId, ZTenantId, ZTenantTimelineId}; pub mod broker; pub mod callmemaybe; @@ -57,9 +57,12 @@ pub struct SafeKeeperConf { } impl SafeKeeperConf { + pub fn tenant_dir(&self, tenant_id: &ZTenantId) -> PathBuf { + self.workdir.join(tenant_id.to_string()) + } + pub fn timeline_dir(&self, zttid: &ZTenantTimelineId) -> PathBuf { - self.workdir - .join(zttid.tenant_id.to_string()) + self.tenant_dir(&zttid.tenant_id) .join(zttid.timeline_id.to_string()) } } diff --git a/safekeeper/src/timeline.rs b/safekeeper/src/timeline.rs index c73d6af4ac..84ad53d72d 100644 --- a/safekeeper/src/timeline.rs +++ b/safekeeper/src/timeline.rs @@ -7,6 +7,8 @@ use etcd_broker::SkTimelineInfo; use lazy_static::lazy_static; use postgres_ffi::xlog_utils::XLogSegNo; +use serde::Serialize; + use std::cmp::{max, min}; use std::collections::HashMap; use std::fs::{self}; @@ -19,7 +21,7 @@ use tracing::*; use utils::{ lsn::Lsn, pq_proto::ZenithFeedback, - zid::{ZNodeId, ZTenantTimelineId}, + zid::{ZNodeId, ZTenantId, ZTenantTimelineId}, }; use crate::callmemaybe::{CallmeEvent, SubscriptionStateKey}; @@ -345,6 +347,20 @@ impl Timeline { Ok(false) } + /// Deactivates the timeline, assuming it is being deleted. + /// Returns whether the timeline was already active. + /// + /// The callmemaybe thread is stopped by the deactivation message. We assume all other threads + /// will stop by themselves eventually (possibly with errors, but no panics). There should be no + /// compute threads (as we're deleting the timeline), actually. Some WAL may be left unsent, but + /// we're deleting the timeline anyway. + pub fn deactivate_for_delete(&self) -> Result { + let mut shared_state = self.mutex.lock().unwrap(); + let was_active = shared_state.active; + shared_state.deactivate(&self.zttid, &self.callmemaybe_tx)?; + Ok(was_active) + } + fn is_active(&self) -> bool { let shared_state = self.mutex.lock().unwrap(); shared_state.active @@ -515,6 +531,12 @@ lazy_static! { }); } +#[derive(Clone, Copy, Serialize)] +pub struct TimelineDeleteForceResult { + pub dir_existed: bool, + pub was_active: bool, +} + /// A zero-sized struct used to manage access to the global timelines map. pub struct GlobalTimelines; @@ -613,4 +635,78 @@ impl GlobalTimelines { .map(|(zttid, _)| *zttid) .collect() } + + fn delete_force_internal( + conf: &SafeKeeperConf, + zttid: &ZTenantTimelineId, + was_active: bool, + ) -> Result { + match std::fs::remove_dir_all(conf.timeline_dir(zttid)) { + Ok(_) => Ok(TimelineDeleteForceResult { + dir_existed: true, + was_active, + }), + Err(e) if e.kind() == std::io::ErrorKind::NotFound => Ok(TimelineDeleteForceResult { + dir_existed: false, + was_active, + }), + Err(e) => Err(e.into()), + } + } + + /// Deactivates and deletes the timeline, see `Timeline::deactivate_for_delete()`, the deletes + /// the corresponding data directory. + /// We assume all timeline threads do not care about `GlobalTimelines` not containing the timeline + /// anymore, and they will eventually terminate without panics. + /// + /// There are multiple ways the timeline may be accidentally "re-created" (so we end up with two + /// `Timeline` objects in memory): + /// a) a compute node connects after this method is called, or + /// b) an HTTP GET request about the timeline is made and it's able to restore the current state, or + /// c) an HTTP POST request for timeline creation is made after the timeline is already deleted. + /// TODO: ensure all of the above never happens. + pub fn delete_force( + conf: &SafeKeeperConf, + zttid: &ZTenantTimelineId, + ) -> Result { + info!("deleting timeline {}", zttid); + let was_active = match TIMELINES_STATE.lock().unwrap().timelines.remove(zttid) { + None => false, + Some(tli) => tli.deactivate_for_delete()?, + }; + GlobalTimelines::delete_force_internal(conf, zttid, was_active) + } + + /// Deactivates and deletes all timelines for the tenant, see `delete()`. + /// Returns map of all timelines which the tenant had, `true` if a timeline was active. + pub fn delete_force_all_for_tenant( + conf: &SafeKeeperConf, + tenant_id: &ZTenantId, + ) -> Result> { + info!("deleting all timelines for tenant {}", tenant_id); + let mut state = TIMELINES_STATE.lock().unwrap(); + let mut deleted = HashMap::new(); + for (zttid, tli) in &state.timelines { + if zttid.tenant_id == *tenant_id { + deleted.insert( + *zttid, + GlobalTimelines::delete_force_internal( + conf, + zttid, + tli.deactivate_for_delete()?, + )?, + ); + } + } + // TODO: test that the exact subset of timelines is removed. + state + .timelines + .retain(|zttid, _| !deleted.contains_key(zttid)); + match std::fs::remove_dir_all(conf.tenant_dir(tenant_id)) { + Ok(_) => (), + Err(e) if e.kind() == std::io::ErrorKind::NotFound => (), + e => e?, + }; + Ok(deleted) + } } diff --git a/test_runner/batch_others/test_wal_acceptor.py b/test_runner/batch_others/test_wal_acceptor.py index 702c27a79b..e297f91f2c 100644 --- a/test_runner/batch_others/test_wal_acceptor.py +++ b/test_runner/batch_others/test_wal_acceptor.py @@ -850,3 +850,116 @@ def test_wal_deleted_after_broadcast(zenith_env_builder: ZenithEnvBuilder): # there shouldn't be more than 2 WAL segments (but dir may have archive_status files) assert wal_size_after_checkpoint < 16 * 2.5 + + +def test_delete_force(zenith_env_builder: ZenithEnvBuilder): + zenith_env_builder.num_safekeepers = 1 + env = zenith_env_builder.init_start() + + # Create two tenants: one will be deleted, other should be preserved. + tenant_id = env.initial_tenant.hex + timeline_id_1 = env.zenith_cli.create_branch('br1').hex # Acive, delete explicitly + timeline_id_2 = env.zenith_cli.create_branch('br2').hex # Inactive, delete explictly + timeline_id_3 = env.zenith_cli.create_branch('br3').hex # Active, delete with the tenant + timeline_id_4 = env.zenith_cli.create_branch('br4').hex # Inactive, delete with the tenant + + tenant_id_other = env.zenith_cli.create_tenant().hex + timeline_id_other = env.zenith_cli.create_root_branch( + 'br-other', tenant_id=uuid.UUID(hex=tenant_id_other)).hex + + # Populate branches + pg_1 = env.postgres.create_start('br1') + pg_2 = env.postgres.create_start('br2') + pg_3 = env.postgres.create_start('br3') + pg_4 = env.postgres.create_start('br4') + pg_other = env.postgres.create_start('br-other', tenant_id=uuid.UUID(hex=tenant_id_other)) + for pg in [pg_1, pg_2, pg_3, pg_4, pg_other]: + with closing(pg.connect()) as conn: + with conn.cursor() as cur: + cur.execute('CREATE TABLE t(key int primary key)') + sk = env.safekeepers[0] + sk_data_dir = Path(sk.data_dir()) + sk_http = sk.http_client() + assert (sk_data_dir / tenant_id / timeline_id_1).is_dir() + assert (sk_data_dir / tenant_id / timeline_id_2).is_dir() + assert (sk_data_dir / tenant_id / timeline_id_3).is_dir() + assert (sk_data_dir / tenant_id / timeline_id_4).is_dir() + assert (sk_data_dir / tenant_id_other / timeline_id_other).is_dir() + + # Stop branches which should be inactive and restart Safekeeper to drop its in-memory state. + pg_2.stop_and_destroy() + pg_4.stop_and_destroy() + sk.stop() + sk.start() + + # Ensure connections to Safekeeper are established + for pg in [pg_1, pg_3, pg_other]: + with closing(pg.connect()) as conn: + with conn.cursor() as cur: + cur.execute('INSERT INTO t (key) VALUES (1)') + + # Remove initial tenant's br1 (active) + assert sk_http.timeline_delete_force(tenant_id, timeline_id_1) == { + "dir_existed": True, + "was_active": True, + } + assert not (sk_data_dir / tenant_id / timeline_id_1).exists() + assert (sk_data_dir / tenant_id / timeline_id_2).is_dir() + assert (sk_data_dir / tenant_id / timeline_id_3).is_dir() + assert (sk_data_dir / tenant_id / timeline_id_4).is_dir() + assert (sk_data_dir / tenant_id_other / timeline_id_other).is_dir() + + # Ensure repeated deletion succeeds + assert sk_http.timeline_delete_force(tenant_id, timeline_id_1) == { + "dir_existed": False, "was_active": False + } + assert not (sk_data_dir / tenant_id / timeline_id_1).exists() + assert (sk_data_dir / tenant_id / timeline_id_2).is_dir() + assert (sk_data_dir / tenant_id / timeline_id_3).is_dir() + assert (sk_data_dir / tenant_id / timeline_id_4).is_dir() + assert (sk_data_dir / tenant_id_other / timeline_id_other).is_dir() + + # Remove initial tenant's br2 (inactive) + assert sk_http.timeline_delete_force(tenant_id, timeline_id_2) == { + "dir_existed": True, + "was_active": False, + } + assert not (sk_data_dir / tenant_id / timeline_id_1).exists() + assert not (sk_data_dir / tenant_id / timeline_id_2).exists() + assert (sk_data_dir / tenant_id / timeline_id_3).is_dir() + assert (sk_data_dir / tenant_id / timeline_id_4).is_dir() + assert (sk_data_dir / tenant_id_other / timeline_id_other).is_dir() + + # Remove non-existing branch, should succeed + assert sk_http.timeline_delete_force(tenant_id, '00' * 16) == { + "dir_existed": False, + "was_active": False, + } + assert not (sk_data_dir / tenant_id / timeline_id_1).exists() + assert not (sk_data_dir / tenant_id / timeline_id_2).exists() + assert (sk_data_dir / tenant_id / timeline_id_3).exists() + assert (sk_data_dir / tenant_id / timeline_id_4).is_dir() + assert (sk_data_dir / tenant_id_other / timeline_id_other).is_dir() + + # Remove initial tenant fully (two branches are active) + response = sk_http.tenant_delete_force(tenant_id) + assert response == { + timeline_id_3: { + "dir_existed": True, + "was_active": True, + } + } + assert not (sk_data_dir / tenant_id).exists() + assert (sk_data_dir / tenant_id_other / timeline_id_other).is_dir() + + # Remove initial tenant again. + response = sk_http.tenant_delete_force(tenant_id) + assert response == {} + assert not (sk_data_dir / tenant_id).exists() + assert (sk_data_dir / tenant_id_other / timeline_id_other).is_dir() + + # Ensure the other tenant still works + sk_http.timeline_status(tenant_id_other, timeline_id_other) + with closing(pg_other.connect()) as conn: + with conn.cursor() as cur: + cur.execute('INSERT INTO t (key) VALUES (123)') diff --git a/test_runner/fixtures/zenith_fixtures.py b/test_runner/fixtures/zenith_fixtures.py index fe20f1abbf..357db4c16d 100644 --- a/test_runner/fixtures/zenith_fixtures.py +++ b/test_runner/fixtures/zenith_fixtures.py @@ -1800,6 +1800,21 @@ class SafekeeperHttpClient(requests.Session): json=body) res.raise_for_status() + def timeline_delete_force(self, tenant_id: str, timeline_id: str) -> Dict[Any, Any]: + res = self.delete( + f"http://localhost:{self.port}/v1/tenant/{tenant_id}/timeline/{timeline_id}") + res.raise_for_status() + res_json = res.json() + assert isinstance(res_json, dict) + return res_json + + def tenant_delete_force(self, tenant_id: str) -> Dict[Any, Any]: + res = self.delete(f"http://localhost:{self.port}/v1/tenant/{tenant_id}") + res.raise_for_status() + res_json = res.json() + assert isinstance(res_json, dict) + return res_json + def get_metrics(self) -> SafekeeperMetrics: request_result = self.get(f"http://localhost:{self.port}/metrics") request_result.raise_for_status() From aa7c601eca425d82e616e0fc0468dac8a2a35db2 Mon Sep 17 00:00:00 2001 From: Anastasia Lubennikova Date: Thu, 12 May 2022 20:53:40 +0300 Subject: [PATCH 241/296] Fix pitr_interval check in GC: Use timestamp->LSN mapping instead of file modification time. Fix 'latest_gc_cutoff_lsn' - set it to the minimum of pitr_cutoff and gc_cutoff. Add new test: test_pitr_gc --- pageserver/src/layered_repository.rs | 76 +++++++++++++++-------- test_runner/batch_others/test_pitr_gc.py | 77 ++++++++++++++++++++++++ test_runner/fixtures/utils.py | 3 +- 3 files changed, 131 insertions(+), 25 deletions(-) create mode 100644 test_runner/batch_others/test_pitr_gc.py diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index b02ab00a21..24f9bcff37 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -74,6 +74,7 @@ pub mod metadata; mod par_fsync; mod storage_layer; +use crate::pgdatadir_mapping::LsnForTimestamp; use delta_layer::{DeltaLayer, DeltaLayerWriter}; use ephemeral_file::is_ephemeral_file; use filename::{DeltaFileName, ImageFileName}; @@ -81,6 +82,7 @@ use image_layer::{ImageLayer, ImageLayerWriter}; use inmemory_layer::InMemoryLayer; use layer_map::LayerMap; use layer_map::SearchResult; +use postgres_ffi::xlog_utils::to_pg_timestamp; use storage_layer::{Layer, ValueReconstructResult, ValueReconstructState}; // re-export this function so that page_cache.rs can use it. @@ -2118,11 +2120,49 @@ impl LayeredTimeline { let cutoff = gc_info.cutoff; let pitr = gc_info.pitr; + // Calculate pitr cutoff point. + // By default, we don't want to GC anything. + let mut pitr_cutoff_lsn: Lsn = *self.get_latest_gc_cutoff_lsn(); + + if let Ok(timeline) = + tenant_mgr::get_local_timeline_with_load(self.tenant_id, self.timeline_id) + { + // First, calculate pitr_cutoff_timestamp and then convert it to LSN. + // If we don't have enough data to convert to LSN, + // play safe and don't remove any layers. + if let Some(pitr_cutoff_timestamp) = now.checked_sub(pitr) { + let pitr_timestamp = to_pg_timestamp(pitr_cutoff_timestamp); + + match timeline.find_lsn_for_timestamp(pitr_timestamp)? { + LsnForTimestamp::Present(lsn) => pitr_cutoff_lsn = lsn, + LsnForTimestamp::Future(lsn) => { + debug!("future({})", lsn); + } + LsnForTimestamp::Past(lsn) => { + debug!("past({})", lsn); + } + } + debug!("pitr_cutoff_lsn = {:?}", pitr_cutoff_lsn) + } + } else { + // We don't have local timeline in mocked cargo tests. + // So, just ignore pitr_interval setting in this case. + pitr_cutoff_lsn = cutoff; + } + + let new_gc_cutoff = Lsn::min(cutoff, pitr_cutoff_lsn); + + // Nothing to GC. Return early. + if *self.get_latest_gc_cutoff_lsn() == new_gc_cutoff { + result.elapsed = now.elapsed()?; + return Ok(result); + } + let _enter = info_span!("garbage collection", timeline = %self.timeline_id, tenant = %self.tenant_id, cutoff = %cutoff).entered(); // We need to ensure that no one branches at a point before latest_gc_cutoff_lsn. // See branch_timeline() for details. - *self.latest_gc_cutoff_lsn.write().unwrap() = cutoff; + *self.latest_gc_cutoff_lsn.write().unwrap() = new_gc_cutoff; info!("GC starting"); @@ -2162,30 +2202,18 @@ impl LayeredTimeline { result.layers_needed_by_cutoff += 1; continue 'outer; } - // 2. It is newer than PiTR interval? - // We use modification time of layer file to estimate update time. - // This estimation is not quite precise but maintaining LSN->timestamp map seems to be overkill. - // It is not expected that users will need high precision here. And this estimation - // is conservative: modification time of file is always newer than actual time of version - // creation. So it is safe for users. - // TODO A possible "bloat" issue still persists here. - // If modification time changes because of layer upload/download, we will keep these files - // longer than necessary. - // https://github.com/neondatabase/neon/issues/1554 - // - if let Ok(metadata) = fs::metadata(&l.filename()) { - let last_modified = metadata.modified()?; - if now.duration_since(last_modified)? < pitr { - debug!( - "keeping {} because it's modification time {:?} is newer than PITR {:?}", - l.filename().display(), - last_modified, - pitr - ); - result.layers_needed_by_pitr += 1; - continue 'outer; - } + + // 2. It is newer than PiTR cutoff point? + if l.get_lsn_range().end > pitr_cutoff_lsn { + debug!( + "keeping {} because it's newer than pitr_cutoff_lsn {}", + l.filename().display(), + pitr_cutoff_lsn + ); + result.layers_needed_by_pitr += 1; + continue 'outer; } + // 3. Is it needed by a child branch? // NOTE With that wee would keep data that // might be referenced by child branches forever. diff --git a/test_runner/batch_others/test_pitr_gc.py b/test_runner/batch_others/test_pitr_gc.py new file mode 100644 index 0000000000..fe9159b4bb --- /dev/null +++ b/test_runner/batch_others/test_pitr_gc.py @@ -0,0 +1,77 @@ +import subprocess +from contextlib import closing + +import psycopg2.extras +import pytest +from fixtures.log_helper import log +from fixtures.utils import print_gc_result +from fixtures.zenith_fixtures import ZenithEnvBuilder + + +# +# Check pitr_interval GC behavior. +# Insert some data, run GC and create a branch in the past. +# +def test_pitr_gc(zenith_env_builder: ZenithEnvBuilder): + + zenith_env_builder.num_safekeepers = 1 + # Set pitr interval such that we need to keep the data + zenith_env_builder.pageserver_config_override = "tenant_config={pitr_interval = '1day', gc_horizon = 0}" + + env = zenith_env_builder.init_start() + pgmain = env.postgres.create_start('main') + log.info("postgres is running on 'main' branch") + + main_pg_conn = pgmain.connect() + main_cur = main_pg_conn.cursor() + + main_cur.execute("SHOW zenith.zenith_timeline") + timeline = main_cur.fetchone()[0] + + # Create table + main_cur.execute('CREATE TABLE foo (t text)') + + for i in range(10000): + main_cur.execute(''' + INSERT INTO foo + SELECT 'long string to consume some space'; + ''') + + if i == 99: + # keep some early lsn to test branch creation after GC + main_cur.execute('SELECT pg_current_wal_insert_lsn(), txid_current()') + res = main_cur.fetchone() + lsn_a = res[0] + xid_a = res[1] + log.info(f'LSN after 100 rows: {lsn_a} xid {xid_a}') + + main_cur.execute('SELECT pg_current_wal_insert_lsn(), txid_current()') + res = main_cur.fetchone() + debug_lsn = res[0] + debug_xid = res[1] + log.info(f'LSN after 10000 rows: {debug_lsn} xid {debug_xid}') + + # run GC + with closing(env.pageserver.connect()) as psconn: + with psconn.cursor(cursor_factory=psycopg2.extras.DictCursor) as pscur: + pscur.execute(f"compact {env.initial_tenant.hex} {timeline}") + # perform agressive GC. Data still should be kept because of the PITR setting. + pscur.execute(f"do_gc {env.initial_tenant.hex} {timeline} 0") + row = pscur.fetchone() + print_gc_result(row) + + # Branch at the point where only 100 rows were inserted + # It must have been preserved by PITR setting + env.zenith_cli.create_branch('test_pitr_gc_hundred', 'main', ancestor_start_lsn=lsn_a) + + pg_hundred = env.postgres.create_start('test_pitr_gc_hundred') + + # On the 'hundred' branch, we should see only 100 rows + hundred_pg_conn = pg_hundred.connect() + hundred_cur = hundred_pg_conn.cursor() + hundred_cur.execute('SELECT count(*) FROM foo') + assert hundred_cur.fetchone() == (100, ) + + # All the rows are visible on the main branch + main_cur.execute('SELECT count(*) FROM foo') + assert main_cur.fetchone() == (10000, ) diff --git a/test_runner/fixtures/utils.py b/test_runner/fixtures/utils.py index 98af511036..7b95e729d9 100644 --- a/test_runner/fixtures/utils.py +++ b/test_runner/fixtures/utils.py @@ -75,7 +75,8 @@ def lsn_from_hex(lsn_hex: str) -> int: def print_gc_result(row): log.info("GC duration {elapsed} ms".format_map(row)) log.info( - " total: {layers_total}, needed_by_cutoff {layers_needed_by_cutoff}, needed_by_branches: {layers_needed_by_branches}, not_updated: {layers_not_updated}, removed: {layers_removed}" + " total: {layers_total}, needed_by_cutoff {layers_needed_by_cutoff}, needed_by_pitr {layers_needed_by_pitr}" + " needed_by_branches: {layers_needed_by_branches}, not_updated: {layers_not_updated}, removed: {layers_removed}" .format_map(row)) From a2561f0a78116fc775732cb36c7df992d4d3a07a Mon Sep 17 00:00:00 2001 From: Anastasia Lubennikova Date: Fri, 13 May 2022 16:01:41 +0300 Subject: [PATCH 242/296] Use tenant's pitr_interval instead of hardroded 0 in the command. Adjust python tests that use the --- pageserver/src/layered_repository.rs | 11 ++++++++--- pageserver/src/page_service.rs | 5 +++-- test_runner/batch_others/test_branch_behind.py | 2 ++ test_runner/batch_others/test_gc_aggressive.py | 11 +++++++---- .../batch_others/test_old_request_lsn.py | 17 ++++++++++++----- test_runner/batch_others/test_pitr_gc.py | 2 +- test_runner/performance/test_bulk_insert.py | 1 - test_runner/performance/test_random_writes.py | 1 - 8 files changed, 33 insertions(+), 17 deletions(-) diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 24f9bcff37..c7536cc959 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -2121,7 +2121,7 @@ impl LayeredTimeline { let pitr = gc_info.pitr; // Calculate pitr cutoff point. - // By default, we don't want to GC anything. + // If we cannot determine a cutoff LSN, be conservative and don't GC anything. let mut pitr_cutoff_lsn: Lsn = *self.get_latest_gc_cutoff_lsn(); if let Ok(timeline) = @@ -2137,6 +2137,7 @@ impl LayeredTimeline { LsnForTimestamp::Present(lsn) => pitr_cutoff_lsn = lsn, LsnForTimestamp::Future(lsn) => { debug!("future({})", lsn); + pitr_cutoff_lsn = cutoff; } LsnForTimestamp::Past(lsn) => { debug!("past({})", lsn); @@ -2144,7 +2145,7 @@ impl LayeredTimeline { } debug!("pitr_cutoff_lsn = {:?}", pitr_cutoff_lsn) } - } else { + } else if cfg!(test) { // We don't have local timeline in mocked cargo tests. // So, just ignore pitr_interval setting in this case. pitr_cutoff_lsn = cutoff; @@ -2153,7 +2154,11 @@ impl LayeredTimeline { let new_gc_cutoff = Lsn::min(cutoff, pitr_cutoff_lsn); // Nothing to GC. Return early. - if *self.get_latest_gc_cutoff_lsn() == new_gc_cutoff { + if *self.get_latest_gc_cutoff_lsn() >= new_gc_cutoff { + info!( + "Nothing to GC for timeline {}. cutoff_lsn {}", + self.timeline_id, new_gc_cutoff + ); result.elapsed = now.elapsed()?; return Ok(result); } diff --git a/pageserver/src/page_service.rs b/pageserver/src/page_service.rs index 88273cfa57..28d6bf2621 100644 --- a/pageserver/src/page_service.rs +++ b/pageserver/src/page_service.rs @@ -19,7 +19,6 @@ use std::net::TcpListener; use std::str; use std::str::FromStr; use std::sync::{Arc, RwLockReadGuard}; -use std::time::Duration; use tracing::*; use utils::{ auth::{self, Claims, JwtAuth, Scope}, @@ -796,7 +795,9 @@ impl postgres_backend::Handler for PageServerHandler { .unwrap_or_else(|| Ok(repo.get_gc_horizon()))?; let repo = tenant_mgr::get_repository_for_tenant(tenantid)?; - let result = repo.gc_iteration(Some(timelineid), gc_horizon, Duration::ZERO, true)?; + // Use tenant's pitr setting + let pitr = repo.get_pitr_interval(); + let result = repo.gc_iteration(Some(timelineid), gc_horizon, pitr, true)?; pgb.write_message_noflush(&BeMessage::RowDescription(&[ RowDescriptor::int8_col(b"layers_total"), RowDescriptor::int8_col(b"layers_needed_by_cutoff"), diff --git a/test_runner/batch_others/test_branch_behind.py b/test_runner/batch_others/test_branch_behind.py index 4e2be352f4..fc84af5283 100644 --- a/test_runner/batch_others/test_branch_behind.py +++ b/test_runner/batch_others/test_branch_behind.py @@ -19,6 +19,8 @@ def test_branch_behind(zenith_env_builder: ZenithEnvBuilder): # # See https://github.com/zenithdb/zenith/issues/1068 zenith_env_builder.num_safekeepers = 1 + # Disable pitr, because here we want to test branch creation after GC + zenith_env_builder.pageserver_config_override = "tenant_config={pitr_interval = '0 sec'}" env = zenith_env_builder.init_start() # Branch at the point where only 100 rows were inserted diff --git a/test_runner/batch_others/test_gc_aggressive.py b/test_runner/batch_others/test_gc_aggressive.py index e4e4aa9f4a..519a6dda1c 100644 --- a/test_runner/batch_others/test_gc_aggressive.py +++ b/test_runner/batch_others/test_gc_aggressive.py @@ -1,7 +1,7 @@ import asyncio import random -from fixtures.zenith_fixtures import ZenithEnv, Postgres +from fixtures.zenith_fixtures import ZenithEnv, ZenithEnvBuilder, Postgres from fixtures.log_helper import log # Test configuration @@ -50,9 +50,12 @@ async def update_and_gc(env: ZenithEnv, pg: Postgres, timeline: str): # # (repro for https://github.com/zenithdb/zenith/issues/1047) # -def test_gc_aggressive(zenith_simple_env: ZenithEnv): - env = zenith_simple_env - env.zenith_cli.create_branch("test_gc_aggressive", "empty") +def test_gc_aggressive(zenith_env_builder: ZenithEnvBuilder): + + # Disable pitr, because here we want to test branch creation after GC + zenith_env_builder.pageserver_config_override = "tenant_config={pitr_interval = '0 sec'}" + env = zenith_env_builder.init_start() + env.zenith_cli.create_branch("test_gc_aggressive", "main") pg = env.postgres.create_start('test_gc_aggressive') log.info('postgres is running on test_gc_aggressive branch') diff --git a/test_runner/batch_others/test_old_request_lsn.py b/test_runner/batch_others/test_old_request_lsn.py index e7400cff96..cf7fe09b1e 100644 --- a/test_runner/batch_others/test_old_request_lsn.py +++ b/test_runner/batch_others/test_old_request_lsn.py @@ -1,5 +1,7 @@ -from fixtures.zenith_fixtures import ZenithEnv +from fixtures.zenith_fixtures import ZenithEnvBuilder from fixtures.log_helper import log +from fixtures.utils import print_gc_result +import psycopg2.extras # @@ -12,9 +14,11 @@ from fixtures.log_helper import log # just a hint that the page hasn't been modified since that LSN, and the page # server should return the latest page version regardless of the LSN. # -def test_old_request_lsn(zenith_simple_env: ZenithEnv): - env = zenith_simple_env - env.zenith_cli.create_branch("test_old_request_lsn", "empty") +def test_old_request_lsn(zenith_env_builder: ZenithEnvBuilder): + # Disable pitr, because here we want to test branch creation after GC + zenith_env_builder.pageserver_config_override = "tenant_config={pitr_interval = '0 sec'}" + env = zenith_env_builder.init_start() + env.zenith_cli.create_branch("test_old_request_lsn", "main") pg = env.postgres.create_start('test_old_request_lsn') log.info('postgres is running on test_old_request_lsn branch') @@ -26,7 +30,7 @@ def test_old_request_lsn(zenith_simple_env: ZenithEnv): timeline = cur.fetchone()[0] psconn = env.pageserver.connect() - pscur = psconn.cursor() + pscur = psconn.cursor(cursor_factory=psycopg2.extras.DictCursor) # Create table, and insert some rows. Make it big enough that it doesn't fit in # shared_buffers. @@ -53,6 +57,9 @@ def test_old_request_lsn(zenith_simple_env: ZenithEnv): # garbage collections so that the page server will remove old page versions. for i in range(10): pscur.execute(f"do_gc {env.initial_tenant.hex} {timeline} 0") + row = pscur.fetchone() + print_gc_result(row) + for j in range(100): cur.execute('UPDATE foo SET val = val + 1 WHERE id = 1;') diff --git a/test_runner/batch_others/test_pitr_gc.py b/test_runner/batch_others/test_pitr_gc.py index fe9159b4bb..ee19bddfe8 100644 --- a/test_runner/batch_others/test_pitr_gc.py +++ b/test_runner/batch_others/test_pitr_gc.py @@ -16,7 +16,7 @@ def test_pitr_gc(zenith_env_builder: ZenithEnvBuilder): zenith_env_builder.num_safekeepers = 1 # Set pitr interval such that we need to keep the data - zenith_env_builder.pageserver_config_override = "tenant_config={pitr_interval = '1day', gc_horizon = 0}" + zenith_env_builder.pageserver_config_override = "tenant_config={pitr_interval = '1 day', gc_horizon = 0}" env = zenith_env_builder.init_start() pgmain = env.postgres.create_start('main') diff --git a/test_runner/performance/test_bulk_insert.py b/test_runner/performance/test_bulk_insert.py index 4e73bedcc0..3b57ac73cc 100644 --- a/test_runner/performance/test_bulk_insert.py +++ b/test_runner/performance/test_bulk_insert.py @@ -18,7 +18,6 @@ from fixtures.compare_fixtures import PgCompare, VanillaCompare, ZenithCompare def test_bulk_insert(zenith_with_baseline: PgCompare): env = zenith_with_baseline - # Get the timeline ID of our branch. We need it for the 'do_gc' command with closing(env.pg.connect()) as conn: with conn.cursor() as cur: cur.execute("create table huge (i int, j int);") diff --git a/test_runner/performance/test_random_writes.py b/test_runner/performance/test_random_writes.py index ba9eabcd97..205388bd90 100644 --- a/test_runner/performance/test_random_writes.py +++ b/test_runner/performance/test_random_writes.py @@ -8,7 +8,6 @@ from fixtures.log_helper import log import psycopg2.extras import random import time -from fixtures.utils import print_gc_result # This is a clear-box test that demonstrates the worst case scenario for the From 768c846eeb9f90450e06185ce477ed1a566a0f22 Mon Sep 17 00:00:00 2001 From: Egor Suvorov Date: Fri, 13 May 2022 17:06:25 +0300 Subject: [PATCH 243/296] Fix test_delete_force from #1653 conflicting with #1692 --- test_runner/batch_others/test_wal_acceptor.py | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/test_runner/batch_others/test_wal_acceptor.py b/test_runner/batch_others/test_wal_acceptor.py index e297f91f2c..67c9d6070e 100644 --- a/test_runner/batch_others/test_wal_acceptor.py +++ b/test_runner/batch_others/test_wal_acceptor.py @@ -863,16 +863,16 @@ def test_delete_force(zenith_env_builder: ZenithEnvBuilder): timeline_id_3 = env.zenith_cli.create_branch('br3').hex # Active, delete with the tenant timeline_id_4 = env.zenith_cli.create_branch('br4').hex # Inactive, delete with the tenant - tenant_id_other = env.zenith_cli.create_tenant().hex - timeline_id_other = env.zenith_cli.create_root_branch( - 'br-other', tenant_id=uuid.UUID(hex=tenant_id_other)).hex + tenant_id_other_uuid, timeline_id_other_uuid = env.zenith_cli.create_tenant() + tenant_id_other = tenant_id_other_uuid.hex + timeline_id_other = timeline_id_other_uuid.hex # Populate branches pg_1 = env.postgres.create_start('br1') pg_2 = env.postgres.create_start('br2') pg_3 = env.postgres.create_start('br3') pg_4 = env.postgres.create_start('br4') - pg_other = env.postgres.create_start('br-other', tenant_id=uuid.UUID(hex=tenant_id_other)) + pg_other = env.postgres.create_start('main', tenant_id=uuid.UUID(hex=tenant_id_other)) for pg in [pg_1, pg_2, pg_3, pg_4, pg_other]: with closing(pg.connect()) as conn: with conn.cursor() as cur: From cded72a580266d978fee5260be9e0e56abbb42b9 Mon Sep 17 00:00:00 2001 From: Andrey Taranik Date: Fri, 13 May 2022 20:41:54 +0300 Subject: [PATCH 244/296] remove sk-2 from staging inventory list (#1699) --- .circleci/ansible/staging.hosts | 1 - 1 file changed, 1 deletion(-) diff --git a/.circleci/ansible/staging.hosts b/.circleci/ansible/staging.hosts index b2bacb89ca..8e89e843d9 100644 --- a/.circleci/ansible/staging.hosts +++ b/.circleci/ansible/staging.hosts @@ -4,7 +4,6 @@ zenith-us-stage-ps-2 console_region_id=27 [safekeepers] zenith-us-stage-sk-1 console_region_id=27 -zenith-us-stage-sk-2 console_region_id=27 zenith-us-stage-sk-4 console_region_id=27 zenith-us-stage-sk-5 console_region_id=27 From 081d5dac5eba534bac74624e0f935d4c0b28af6b Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Fri, 13 May 2022 21:41:00 +0300 Subject: [PATCH 245/296] Bump vendor/postgres. Includes change to reduce log noise from inmem_smgr. --- vendor/postgres | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vendor/postgres b/vendor/postgres index d62ec22eff..1db115cecb 160000 --- a/vendor/postgres +++ b/vendor/postgres @@ -1 +1 @@ -Subproject commit d62ec22effeca7b5794ab2c15a3fd9ee5a4a5b99 +Subproject commit 1db115cecb3dbc2a74c5efa964fdf3a8a341c4d2 From a10cac980f703bf5ec50e37a14aac5e6d6261525 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Sun, 15 May 2022 00:25:38 +0300 Subject: [PATCH 246/296] Continue with pageserver startup, if loading some tenants fail. Fixes https://github.com/neondatabase/neon/issues/1664 --- pageserver/src/tenant_mgr.rs | 83 ++++++++++++------- .../batch_others/test_broken_timeline.py | 80 ++++++++++++++++++ 2 files changed, 135 insertions(+), 28 deletions(-) create mode 100644 test_runner/batch_others/test_broken_timeline.py diff --git a/pageserver/src/tenant_mgr.rs b/pageserver/src/tenant_mgr.rs index 20a723b5b5..9bde9a5c4a 100644 --- a/pageserver/src/tenant_mgr.rs +++ b/pageserver/src/tenant_mgr.rs @@ -78,6 +78,9 @@ pub enum TenantState { // The local disk might have some newer files that don't exist in cloud storage yet. // The tenant cannot be accessed anymore for any reason, but graceful shutdown. Stopping, + + // Something went wrong loading the tenant state + Broken, } impl fmt::Display for TenantState { @@ -86,6 +89,7 @@ impl fmt::Display for TenantState { TenantState::Active => f.write_str("Active"), TenantState::Idle => f.write_str("Idle"), TenantState::Stopping => f.write_str("Stopping"), + TenantState::Broken => f.write_str("Broken"), } } } @@ -99,7 +103,22 @@ pub fn init_tenant_mgr(conf: &'static PageServerConf) -> anyhow::Result { + tenant.state = TenantState::Stopping; + tenantids.push(*tenantid) + } + TenantState::Broken => {} + } } drop(m); @@ -270,6 +294,10 @@ pub fn activate_tenant(tenant_id: ZTenantId) -> anyhow::Result<()> { TenantState::Stopping => { // don't re-activate it if it's being stopped } + + TenantState::Broken => { + // cannot activate + } } Ok(()) } @@ -370,38 +398,37 @@ pub fn list_tenants() -> Vec { .collect() } -fn init_local_repositories( +fn init_local_repository( conf: &'static PageServerConf, - local_timeline_init_statuses: HashMap>, + tenant_id: ZTenantId, + local_timeline_init_statuses: HashMap, remote_index: &RemoteIndex, ) -> anyhow::Result<(), anyhow::Error> { - for (tenant_id, local_timeline_init_statuses) in local_timeline_init_statuses { - // initialize local tenant - let repo = load_local_repo(conf, tenant_id, remote_index) - .with_context(|| format!("Failed to load repo for tenant {tenant_id}"))?; + // initialize local tenant + let repo = load_local_repo(conf, tenant_id, remote_index) + .with_context(|| format!("Failed to load repo for tenant {tenant_id}"))?; - let mut status_updates = HashMap::with_capacity(local_timeline_init_statuses.len()); - for (timeline_id, init_status) in local_timeline_init_statuses { - match init_status { - LocalTimelineInitStatus::LocallyComplete => { - debug!("timeline {timeline_id} for tenant {tenant_id} is locally complete, registering it in repository"); - status_updates.insert(timeline_id, TimelineSyncStatusUpdate::Downloaded); - } - LocalTimelineInitStatus::NeedsSync => { - debug!( - "timeline {tenant_id} for tenant {timeline_id} needs sync, \ - so skipped for adding into repository until sync is finished" - ); - } + let mut status_updates = HashMap::with_capacity(local_timeline_init_statuses.len()); + for (timeline_id, init_status) in local_timeline_init_statuses { + match init_status { + LocalTimelineInitStatus::LocallyComplete => { + debug!("timeline {timeline_id} for tenant {tenant_id} is locally complete, registering it in repository"); + status_updates.insert(timeline_id, TimelineSyncStatusUpdate::Downloaded); + } + LocalTimelineInitStatus::NeedsSync => { + debug!( + "timeline {tenant_id} for tenant {timeline_id} needs sync, \ + so skipped for adding into repository until sync is finished" + ); } } - - // Lets fail here loudly to be on the safe side. - // XXX: It may be a better api to actually distinguish between repository startup - // and processing of newly downloaded timelines. - apply_timeline_remote_sync_status_updates(&repo, status_updates) - .with_context(|| format!("Failed to bootstrap timelines for tenant {tenant_id}"))? } + + // Lets fail here loudly to be on the safe side. + // XXX: It may be a better api to actually distinguish between repository startup + // and processing of newly downloaded timelines. + apply_timeline_remote_sync_status_updates(&repo, status_updates) + .with_context(|| format!("Failed to bootstrap timelines for tenant {tenant_id}"))?; Ok(()) } diff --git a/test_runner/batch_others/test_broken_timeline.py b/test_runner/batch_others/test_broken_timeline.py new file mode 100644 index 0000000000..17eadb33b4 --- /dev/null +++ b/test_runner/batch_others/test_broken_timeline.py @@ -0,0 +1,80 @@ +import pytest +from contextlib import closing +from fixtures.zenith_fixtures import ZenithEnvBuilder +from fixtures.log_helper import log +import os + + +# Test restarting page server, while safekeeper and compute node keep +# running. +def test_broken_timeline(zenith_env_builder: ZenithEnvBuilder): + # One safekeeper is enough for this test. + zenith_env_builder.num_safekeepers = 3 + env = zenith_env_builder.init_start() + + tenant_timelines = [] + + for n in range(4): + tenant_id_uuid, timeline_id_uuid = env.zenith_cli.create_tenant() + tenant_id = tenant_id_uuid.hex + timeline_id = timeline_id_uuid.hex + + pg = env.postgres.create_start(f'main', tenant_id=tenant_id_uuid) + with closing(pg.connect()) as conn: + with conn.cursor() as cur: + cur.execute("CREATE TABLE t(key int primary key, value text)") + cur.execute("INSERT INTO t SELECT generate_series(1,100), 'payload'") + + cur.execute("SHOW zenith.zenith_timeline") + timeline_id = cur.fetchone()[0] + pg.stop() + tenant_timelines.append((tenant_id, timeline_id, pg)) + + # Stop the pageserver + env.pageserver.stop() + + # Leave the first timeline alone, but corrupt the others in different ways + (tenant0, timeline0, pg0) = tenant_timelines[0] + + # Corrupt metadata file on timeline 1 + (tenant1, timeline1, pg1) = tenant_timelines[1] + metadata_path = "{}/tenants/{}/timelines/{}/metadata".format(env.repo_dir, tenant1, timeline1) + print(f'overwriting metadata file at {metadata_path}') + f = open(metadata_path, "w") + f.write("overwritten with garbage!") + f.close() + + # Missing layer files file on timeline 2. (This would actually work + # if we had Cloud Storage enabled in this test.) + (tenant2, timeline2, pg2) = tenant_timelines[2] + timeline_path = "{}/tenants/{}/timelines/{}/".format(env.repo_dir, tenant2, timeline2) + for filename in os.listdir(timeline_path): + if filename.startswith('00000'): + # Looks like a layer file. Remove it + os.remove(f'{timeline_path}/{filename}') + + # Corrupt layer files file on timeline 3 + (tenant3, timeline3, pg3) = tenant_timelines[3] + timeline_path = "{}/tenants/{}/timelines/{}/".format(env.repo_dir, tenant3, timeline3) + for filename in os.listdir(timeline_path): + if filename.startswith('00000'): + # Looks like a layer file. Corrupt it + f = open(f'{timeline_path}/{filename}', "w") + f.write("overwritten with garbage!") + f.close() + + env.pageserver.start() + + # Tenant 0 should still work + pg0.start() + with closing(pg0.connect()) as conn: + with conn.cursor() as cur: + cur.execute("SELECT COUNT(*) FROM t") + assert cur.fetchone()[0] == 100 + + # But all others are broken + for n in range(1, 4): + (tenant, timeline, pg) = tenant_timelines[n] + with pytest.raises(Exception, match="Cannot load local timeline") as err: + pg.start() + log.info(f'compute startup failed as expected: {err}') From 51ea9c3053c9ab5d2be837c2eeb0dd149b038229 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Mon, 16 May 2022 09:58:58 +0300 Subject: [PATCH 247/296] Don't swallow panics when the pageserver is build with failpoints. It's very confusing, and because you don't get a stack trace and error message in the logs, makes debugging very hard. However, the 'test_pageserver_recovery' test relied on that behavior. To support that, add a new "exit" action to the pageserver 'failpoints' command, so that you can explicitly request to exit the process when a failpoint is hit. --- pageserver/src/bin/pageserver.rs | 7 +------ pageserver/src/page_service.rs | 13 ++++++++++++- test_runner/batch_others/test_recovery.py | 4 ++-- 3 files changed, 15 insertions(+), 9 deletions(-) diff --git a/pageserver/src/bin/pageserver.rs b/pageserver/src/bin/pageserver.rs index 190e38e341..c6cb460f8f 100644 --- a/pageserver/src/bin/pageserver.rs +++ b/pageserver/src/bin/pageserver.rs @@ -183,13 +183,8 @@ fn main() -> anyhow::Result<()> { // as a ref. let conf: &'static PageServerConf = Box::leak(Box::new(conf)); - // If failpoints are used, terminate the whole pageserver process if they are hit. + // Initialize up failpoints support let scenario = FailScenario::setup(); - if fail::has_failpoints() { - std::panic::set_hook(Box::new(|_| { - std::process::exit(1); - })); - } // Basic initialization of things that don't change after startup virtual_file::init(conf.max_file_descriptors); diff --git a/pageserver/src/page_service.rs b/pageserver/src/page_service.rs index 28d6bf2621..03264c9782 100644 --- a/pageserver/src/page_service.rs +++ b/pageserver/src/page_service.rs @@ -730,7 +730,18 @@ impl postgres_backend::Handler for PageServerHandler { for failpoint in failpoints.split(';') { if let Some((name, actions)) = failpoint.split_once('=') { info!("cfg failpoint: {} {}", name, actions); - fail::cfg(name, actions).unwrap(); + + // We recognize one extra "action" that's not natively recognized + // by the failpoints crate: exit, to immediately kill the process + if actions == "exit" { + fail::cfg_callback(name, || { + info!("Exit requested by failpoint"); + std::process::exit(1); + }) + .unwrap(); + } else { + fail::cfg(name, actions).unwrap(); + } } else { bail!("Invalid failpoints format"); } diff --git a/test_runner/batch_others/test_recovery.py b/test_runner/batch_others/test_recovery.py index dbfa943a7a..eb1747efa5 100644 --- a/test_runner/batch_others/test_recovery.py +++ b/test_runner/batch_others/test_recovery.py @@ -45,14 +45,14 @@ def test_pageserver_recovery(zenith_env_builder: ZenithEnvBuilder): # Configure failpoints pscur.execute( - "failpoints checkpoint-before-sync=sleep(2000);checkpoint-after-sync=panic") + "failpoints checkpoint-before-sync=sleep(2000);checkpoint-after-sync=exit") # Do some updates until pageserver is crashed try: while True: cur.execute("update foo set x=x+1") except Exception as err: - log.info(f"Excepted server crash {err}") + log.info(f"Expected server crash {err}") log.info("Wait before server restart") env.pageserver.stop() From 33cac863d74acb2bafc2f51cf364bf26b2d4d8c4 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Fri, 13 May 2022 17:04:51 +0300 Subject: [PATCH 248/296] Test simple.conf and handle broker_endpoints better --- control_plane/src/local_env.rs | 102 +++++++++++++++++------- control_plane/src/safekeeper.rs | 23 ++++-- libs/remote_storage/src/lib.rs | 3 +- neon_local/src/main.rs | 6 +- test_runner/fixtures/zenith_fixtures.py | 2 +- 5 files changed, 99 insertions(+), 37 deletions(-) diff --git a/control_plane/src/local_env.rs b/control_plane/src/local_env.rs index 5aeff505b6..35167ebabf 100644 --- a/control_plane/src/local_env.rs +++ b/control_plane/src/local_env.rs @@ -4,6 +4,7 @@ //! script which will use local paths. use anyhow::{bail, ensure, Context}; +use reqwest::Url; use serde::{Deserialize, Serialize}; use serde_with::{serde_as, DisplayFromStr}; use std::collections::HashMap; @@ -59,9 +60,10 @@ pub struct LocalEnv { #[serde(default)] pub private_key_path: PathBuf, - // A comma separated broker (etcd) endpoints for storage nodes coordination, e.g. 'http://127.0.0.1:2379'. + // Broker (etcd) endpoints for storage nodes coordination, e.g. 'http://127.0.0.1:2379'. #[serde(default)] - pub broker_endpoints: Option, + #[serde_as(as = "Vec")] + pub broker_endpoints: Vec, /// A prefix to all to any key when pushing/polling etcd from a node. #[serde(default)] @@ -184,12 +186,7 @@ impl LocalEnv { if old_timeline_id == &timeline_id { Ok(()) } else { - bail!( - "branch '{}' is already mapped to timeline {}, cannot map to another timeline {}", - branch_name, - old_timeline_id, - timeline_id - ); + bail!("branch '{branch_name}' is already mapped to timeline {old_timeline_id}, cannot map to another timeline {timeline_id}"); } } else { existing_values.push((tenant_id, timeline_id)); @@ -225,7 +222,7 @@ impl LocalEnv { /// /// Unlike 'load_config', this function fills in any defaults that are missing /// from the config file. - pub fn create_config(toml: &str) -> anyhow::Result { + pub fn parse_config(toml: &str) -> anyhow::Result { let mut env: LocalEnv = toml::from_str(toml)?; // Find postgres binaries. @@ -238,25 +235,20 @@ impl LocalEnv { env.pg_distrib_dir = cwd.join("tmp_install") } } - if !env.pg_distrib_dir.join("bin/postgres").exists() { - bail!( - "Can't find postgres binary at {}", - env.pg_distrib_dir.display() - ); - } // Find zenith binaries. if env.zenith_distrib_dir == Path::new("") { - env.zenith_distrib_dir = env::current_exe()?.parent().unwrap().to_owned(); - } - for binary in ["pageserver", "safekeeper"] { - if !env.zenith_distrib_dir.join(binary).exists() { - bail!( - "Can't find binary '{}' in zenith distrib dir '{}'", - binary, - env.zenith_distrib_dir.display() - ); - } + let current_exec_path = + env::current_exe().context("Failed to find current excecutable's path")?; + env.zenith_distrib_dir = current_exec_path + .parent() + .with_context(|| { + format!( + "Failed to find a parent directory for executable {}", + current_exec_path.display(), + ) + })? + .to_owned(); } // If no initial tenant ID was given, generate it. @@ -351,6 +343,20 @@ impl LocalEnv { "directory '{}' already exists. Perhaps already initialized?", base_path.display() ); + for binary in ["pageserver", "safekeeper"] { + if !self.zenith_distrib_dir.join(binary).exists() { + bail!( + "Can't find binary '{binary}' in zenith distrib dir '{}'", + self.zenith_distrib_dir.display() + ); + } + } + if !self.pg_distrib_dir.join("bin/postgres").exists() { + bail!( + "Can't find postgres binary at {}", + self.pg_distrib_dir.display() + ); + } fs::create_dir(&base_path)?; @@ -408,7 +414,49 @@ impl LocalEnv { fn base_path() -> PathBuf { match std::env::var_os("ZENITH_REPO_DIR") { - Some(val) => PathBuf::from(val.to_str().unwrap()), - None => ".zenith".into(), + Some(val) => PathBuf::from(val), + None => PathBuf::from(".zenith"), + } +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn simple_conf_parsing() { + let simple_conf_toml = include_str!("../simple.conf"); + let simple_conf_parse_result = LocalEnv::parse_config(simple_conf_toml); + assert!( + simple_conf_parse_result.is_ok(), + "failed to parse simple config {simple_conf_toml}, reason: {simple_conf_parse_result:?}" + ); + + let regular_url_string = "broker_endpoints = ['localhost:1111']"; + let regular_url_toml = simple_conf_toml.replace( + "[pageserver]", + &format!("\n{regular_url_string}\n[pageserver]"), + ); + match LocalEnv::parse_config(®ular_url_toml) { + Ok(regular_url_parsed) => { + assert_eq!( + regular_url_parsed.broker_endpoints, + vec!["localhost:1111".parse().unwrap()], + "Unexpectedly parsed broker endpoint url" + ); + } + Err(e) => panic!("failed to parse simple config {regular_url_toml}, reason: {e}"), + } + + let spoiled_url_string = "broker_endpoints = ['!@$XOXO%^&']"; + let spoiled_url_toml = simple_conf_toml.replace( + "[pageserver]", + &format!("\n{spoiled_url_string}\n[pageserver]"), + ); + let spoiled_url_parse_result = LocalEnv::parse_config(&spoiled_url_toml); + assert!( + spoiled_url_parse_result.is_err(), + "expected toml with invalid Url {spoiled_url_toml} to fail the parsing, but got {spoiled_url_parse_result:?}" + ); } } diff --git a/control_plane/src/safekeeper.rs b/control_plane/src/safekeeper.rs index 074ee72f69..aeeb4a50ec 100644 --- a/control_plane/src/safekeeper.rs +++ b/control_plane/src/safekeeper.rs @@ -12,7 +12,7 @@ use nix::sys::signal::{kill, Signal}; use nix::unistd::Pid; use postgres::Config; use reqwest::blocking::{Client, RequestBuilder, Response}; -use reqwest::{IntoUrl, Method}; +use reqwest::{IntoUrl, Method, Url}; use safekeeper::http::models::TimelineCreateRequest; use thiserror::Error; use utils::{ @@ -52,7 +52,7 @@ impl ResponseErrorMessageExt for Response { Err(SafekeeperHttpError::Response( match self.json::() { Ok(err_body) => format!("Error: {}", err_body.msg), - Err(_) => format!("Http error ({}) at {}.", status.as_u16(), url), + Err(_) => format!("Http error ({}) at {url}.", status.as_u16()), }, )) } @@ -76,7 +76,7 @@ pub struct SafekeeperNode { pub pageserver: Arc, - broker_endpoints: Option, + broker_endpoints: Vec, broker_etcd_prefix: Option, } @@ -142,8 +142,21 @@ impl SafekeeperNode { if !self.conf.sync { cmd.arg("--no-sync"); } - if let Some(ref ep) = self.broker_endpoints { - cmd.args(&["--broker-endpoints", ep]); + + if !self.broker_endpoints.is_empty() { + cmd.args(&[ + "--broker-endpoints", + &self.broker_endpoints.iter().map(Url::as_str).fold( + String::new(), + |mut comma_separated_urls, url| { + if !comma_separated_urls.is_empty() { + comma_separated_urls.push(','); + } + comma_separated_urls.push_str(url); + comma_separated_urls + }, + ), + ]); } if let Some(prefix) = self.broker_etcd_prefix.as_deref() { cmd.args(&["--broker-etcd-prefix", prefix]); diff --git a/libs/remote_storage/src/lib.rs b/libs/remote_storage/src/lib.rs index 9bbb855dd5..8092e4fc49 100644 --- a/libs/remote_storage/src/lib.rs +++ b/libs/remote_storage/src/lib.rs @@ -87,7 +87,8 @@ pub trait RemoteStorage: Send + Sync { async fn delete(&self, path: &Self::RemoteObjectId) -> anyhow::Result<()>; } -/// TODO kb +/// Every storage, currently supported. +/// Serves as a simple way to pass around the [`RemoteStorage`] without dealing with generics. pub enum GenericRemoteStorage { Local(LocalFs), S3(S3Bucket), diff --git a/neon_local/src/main.rs b/neon_local/src/main.rs index 6538cdefc4..e5ac46d3b1 100644 --- a/neon_local/src/main.rs +++ b/neon_local/src/main.rs @@ -275,7 +275,7 @@ fn main() -> Result<()> { "pageserver" => handle_pageserver(sub_args, &env), "pg" => handle_pg(sub_args, &env), "safekeeper" => handle_safekeeper(sub_args, &env), - _ => bail!("unexpected subcommand {}", sub_name), + _ => bail!("unexpected subcommand {sub_name}"), }; if original_env != env { @@ -289,7 +289,7 @@ fn main() -> Result<()> { Ok(Some(updated_env)) => updated_env.persist_config(&updated_env.base_data_dir)?, Ok(None) => (), Err(e) => { - eprintln!("command failed: {:?}", e); + eprintln!("command failed: {e:?}"); exit(1); } } @@ -482,7 +482,7 @@ fn handle_init(init_match: &ArgMatches) -> Result { }; let mut env = - LocalEnv::create_config(&toml_file).context("Failed to create neon configuration")?; + LocalEnv::parse_config(&toml_file).context("Failed to create neon configuration")?; env.init().context("Failed to initialize neon repository")?; // default_tenantid was generated by the `env.init()` call above diff --git a/test_runner/fixtures/zenith_fixtures.py b/test_runner/fixtures/zenith_fixtures.py index 357db4c16d..50b7ef6dbb 100644 --- a/test_runner/fixtures/zenith_fixtures.py +++ b/test_runner/fixtures/zenith_fixtures.py @@ -558,7 +558,7 @@ class ZenithEnv: port=self.port_distributor.get_port(), peer_port=self.port_distributor.get_port()) toml += textwrap.dedent(f""" - broker_endpoints = 'http://127.0.0.1:{self.broker.port}' + broker_endpoints = ['http://127.0.0.1:{self.broker.port}'] """) # Create config for pageserver From c700032dd2735bfb7c8053be40fc8ffa34a575df Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Mon, 16 May 2022 14:40:49 +0300 Subject: [PATCH 249/296] Run the regression tests in CI also for PRs opened from forked repos. --- .github/workflows/testing.yml | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/.github/workflows/testing.yml b/.github/workflows/testing.yml index 6d109b9bb5..79b2ba05d0 100644 --- a/.github/workflows/testing.yml +++ b/.github/workflows/testing.yml @@ -1,6 +1,8 @@ name: Build and Test -on: push +on: + pull_request: + push: jobs: regression-check: From c41549f630fa7adbe360f78be9c8f94952cfe4eb Mon Sep 17 00:00:00 2001 From: chaitanya sharma <86035+phoenix24@users.noreply.github.com> Date: Mon, 16 May 2022 20:12:08 +0530 Subject: [PATCH 250/296] Update readme build for osx (#1709) --- README.md | 61 ++++++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 51 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index af384d2672..39cbd2a222 100644 --- a/README.md +++ b/README.md @@ -23,6 +23,8 @@ Pageserver consists of: ## Running local installation + +#### building on Ubuntu/ Debian (Linux) 1. Install build dependencies and other useful packages On Ubuntu or Debian this set of packages should be sufficient to build the code: @@ -31,21 +33,60 @@ apt install build-essential libtool libreadline-dev zlib1g-dev flex bison libsec libssl-dev clang pkg-config libpq-dev ``` -[Rust] 1.58 or later is also required. +2. [Install Rust](https://www.rust-lang.org/tools/install) +``` +# recommended approach from https://www.rust-lang.org/tools/install +curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh +``` -To run the `psql` client, install the `postgresql-client` package or modify `PATH` and `LD_LIBRARY_PATH` to include `tmp_install/bin` and `tmp_install/lib`, respectively. +3. Install PostgreSQL Client +``` +apt install postgresql-client +``` -To run the integration tests or Python scripts (not required to use the code), install -Python (3.7 or higher), and install python3 packages using `./scripts/pysync` (requires poetry) in the project directory. - -2. Build neon and patched postgres +4. Build neon and patched postgres ```sh git clone --recursive https://github.com/neondatabase/neon.git cd neon make -j5 ``` -3. Start pageserver and postgres on top of it (should be called from repo root): + +#### building on OSX (12.3.1) +1. Install XCode +``` +xcode-select --install +``` + +2. [Install Rust](https://www.rust-lang.org/tools/install) +``` +# recommended approach from https://www.rust-lang.org/tools/install +curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh +``` + +3. Install PostgreSQL Client +``` +# from https://stackoverflow.com/questions/44654216/correct-way-to-install-psql-without-full-postgres-on-macos +brew install libpq +brew link --force libpq +``` + +4. Build neon and patched postgres +```sh +git clone --recursive https://github.com/neondatabase/neon.git +cd neon +make -j5 +``` + +#### dependency installation notes +To run the `psql` client, install the `postgresql-client` package or modify `PATH` and `LD_LIBRARY_PATH` to include `tmp_install/bin` and `tmp_install/lib`, respectively. + +To run the integration tests or Python scripts (not required to use the code), install +Python (3.7 or higher), and install python3 packages using `./scripts/pysync` (requires poetry) in the project directory. + + +#### running neon database +1. Start pageserver and postgres on top of it (should be called from repo root): ```sh # Create repository in .zenith with proper paths to binaries and data # Later that would be responsibility of a package install script @@ -75,7 +116,7 @@ Starting postgres node at 'host=127.0.0.1 port=55432 user=zenith_admin dbname=po main 127.0.0.1:55432 de200bd42b49cc1814412c7e592dd6e9 main 0/16B5BA8 running ``` -4. Now it is possible to connect to postgres and run some queries: +2. Now it is possible to connect to postgres and run some queries: ```text > psql -p55432 -h 127.0.0.1 -U zenith_admin postgres postgres=# CREATE TABLE t(key int primary key, value text); @@ -89,7 +130,7 @@ postgres=# select * from t; (1 row) ``` -5. And create branches and run postgres on them: +3. And create branches and run postgres on them: ```sh # create branch named migration_check > ./target/debug/neon_local timeline branch --branch-name migration_check @@ -133,7 +174,7 @@ postgres=# select * from t; (1 row) ``` -6. If you want to run tests afterwards (see below), you have to stop all the running the pageserver, safekeeper and postgres instances +4. If you want to run tests afterwards (see below), you have to stop all the running the pageserver, safekeeper and postgres instances you have just started. You can stop them all with one command: ```sh > ./target/debug/neon_local stop From e4a70faa08a480caa648a533c9ca579db8709fad Mon Sep 17 00:00:00 2001 From: Thang Pham Date: Mon, 16 May 2022 11:05:43 -0400 Subject: [PATCH 251/296] Add more information to timeline-related APIs (#1673) Resolves #1488. - implemented `GET tenant/:tenant_id/timeline/:timeline_id/wal_receiver` endpoint - returned `thread_id` in `thread_mgr::spawn` - added `latest_gc_cutoff_lsn` field to `LocalTimelineInfo` struct --- pageserver/src/http/openapi_spec.yml | 62 ++++++++++++++++ pageserver/src/http/routes.rs | 28 ++++++++ pageserver/src/tenant_mgr.rs | 1 + pageserver/src/thread_mgr.rs | 4 +- pageserver/src/timelines.rs | 4 ++ pageserver/src/walreceiver.rs | 72 +++++++++++++++---- .../batch_others/test_pageserver_api.py | 41 ++++++++++- test_runner/fixtures/zenith_fixtures.py | 9 +++ 8 files changed, 204 insertions(+), 17 deletions(-) diff --git a/pageserver/src/http/openapi_spec.yml b/pageserver/src/http/openapi_spec.yml index 9932a2d08d..55f7b3c5a7 100644 --- a/pageserver/src/http/openapi_spec.yml +++ b/pageserver/src/http/openapi_spec.yml @@ -123,6 +123,53 @@ paths: schema: $ref: "#/components/schemas/Error" + /v1/tenant/{tenant_id}/timeline/{timeline_id}/wal_receiver: + parameters: + - name: tenant_id + in: path + required: true + schema: + type: string + format: hex + - name: timeline_id + in: path + required: true + schema: + type: string + format: hex + get: + description: Get wal receiver's data attached to the timeline + responses: + "200": + description: WalReceiverEntry + content: + application/json: + schema: + $ref: "#/components/schemas/WalReceiverEntry" + "401": + description: Unauthorized Error + content: + application/json: + schema: + $ref: "#/components/schemas/UnauthorizedError" + "403": + description: Forbidden Error + content: + application/json: + schema: + $ref: "#/components/schemas/ForbiddenError" + "404": + description: Error when no wal receiver is running or found + content: + application/json: + schema: + $ref: "#/components/schemas/NotFoundError" + "500": + description: Generic operation error + content: + application/json: + schema: + $ref: "#/components/schemas/Error" /v1/tenant/{tenant_id}/timeline/{timeline_id}/attach: parameters: @@ -520,6 +567,21 @@ components: type: integer current_logical_size_non_incremental: type: integer + WalReceiverEntry: + type: object + required: + - thread_id + - wal_producer_connstr + properties: + thread_id: + type: integer + wal_producer_connstr: + type: string + last_received_msg_lsn: + type: string + format: hex + last_received_msg_ts: + type: integer Error: type: object diff --git a/pageserver/src/http/routes.rs b/pageserver/src/http/routes.rs index 0104df826e..bb650a34ed 100644 --- a/pageserver/src/http/routes.rs +++ b/pageserver/src/http/routes.rs @@ -224,6 +224,30 @@ async fn timeline_detail_handler(request: Request) -> Result) -> Result, ApiError> { + let tenant_id: ZTenantId = parse_request_param(&request, "tenant_id")?; + check_permission(&request, Some(tenant_id))?; + + let timeline_id: ZTimelineId = parse_request_param(&request, "timeline_id")?; + + let wal_receiver = tokio::task::spawn_blocking(move || { + let _enter = + info_span!("wal_receiver_get", tenant = %tenant_id, timeline = %timeline_id).entered(); + + crate::walreceiver::get_wal_receiver_entry(tenant_id, timeline_id) + }) + .await + .map_err(ApiError::from_err)? + .ok_or_else(|| { + ApiError::NotFound(format!( + "WAL receiver not found for tenant {} and timeline {}", + tenant_id, timeline_id + )) + })?; + + json_response(StatusCode::OK, wal_receiver) +} + async fn timeline_attach_handler(request: Request) -> Result, ApiError> { let tenant_id: ZTenantId = parse_request_param(&request, "tenant_id")?; check_permission(&request, Some(tenant_id))?; @@ -485,6 +509,10 @@ pub fn make_router( "/v1/tenant/:tenant_id/timeline/:timeline_id", timeline_detail_handler, ) + .get( + "/v1/tenant/:tenant_id/timeline/:timeline_id/wal_receiver", + wal_receiver_get_handler, + ) .post( "/v1/tenant/:tenant_id/timeline/:timeline_id/attach", timeline_attach_handler, diff --git a/pageserver/src/tenant_mgr.rs b/pageserver/src/tenant_mgr.rs index 9bde9a5c4a..bbe66d7f80 100644 --- a/pageserver/src/tenant_mgr.rs +++ b/pageserver/src/tenant_mgr.rs @@ -281,6 +281,7 @@ pub fn activate_tenant(tenant_id: ZTenantId) -> anyhow::Result<()> { false, move || crate::tenant_threads::gc_loop(tenant_id), ) + .map(|_thread_id| ()) // update the `Result::Ok` type to match the outer function's return signature .with_context(|| format!("Failed to launch GC thread for tenant {tenant_id}")); if let Err(e) = &gc_spawn_result { diff --git a/pageserver/src/thread_mgr.rs b/pageserver/src/thread_mgr.rs index b908f220ee..473cddda58 100644 --- a/pageserver/src/thread_mgr.rs +++ b/pageserver/src/thread_mgr.rs @@ -139,7 +139,7 @@ pub fn spawn( name: &str, shutdown_process_on_error: bool, f: F, -) -> std::io::Result<()> +) -> std::io::Result where F: FnOnce() -> anyhow::Result<()> + Send + 'static, { @@ -193,7 +193,7 @@ where drop(jh_guard); // The thread is now running. Nothing more to do here - Ok(()) + Ok(thread_id) } /// This wrapper function runs in a newly-spawned thread. It initializes the diff --git a/pageserver/src/timelines.rs b/pageserver/src/timelines.rs index 7cfd33c40b..eadf5bf4e0 100644 --- a/pageserver/src/timelines.rs +++ b/pageserver/src/timelines.rs @@ -45,6 +45,8 @@ pub struct LocalTimelineInfo { #[serde_as(as = "Option")] pub prev_record_lsn: Option, #[serde_as(as = "DisplayFromStr")] + pub latest_gc_cutoff_lsn: Lsn, + #[serde_as(as = "DisplayFromStr")] pub disk_consistent_lsn: Lsn, pub current_logical_size: Option, // is None when timeline is Unloaded pub current_logical_size_non_incremental: Option, @@ -68,6 +70,7 @@ impl LocalTimelineInfo { disk_consistent_lsn: datadir_tline.tline.get_disk_consistent_lsn(), last_record_lsn, prev_record_lsn: Some(datadir_tline.tline.get_prev_record_lsn()), + latest_gc_cutoff_lsn: *datadir_tline.tline.get_latest_gc_cutoff_lsn(), timeline_state: LocalTimelineState::Loaded, current_logical_size: Some(datadir_tline.get_current_logical_size()), current_logical_size_non_incremental: if include_non_incremental_logical_size { @@ -91,6 +94,7 @@ impl LocalTimelineInfo { disk_consistent_lsn: metadata.disk_consistent_lsn(), last_record_lsn: metadata.disk_consistent_lsn(), prev_record_lsn: metadata.prev_record_lsn(), + latest_gc_cutoff_lsn: metadata.latest_gc_cutoff_lsn(), timeline_state: LocalTimelineState::Unloaded, current_logical_size: None, current_logical_size_non_incremental: None, diff --git a/pageserver/src/walreceiver.rs b/pageserver/src/walreceiver.rs index b7a33364c9..b8f349af8f 100644 --- a/pageserver/src/walreceiver.rs +++ b/pageserver/src/walreceiver.rs @@ -18,6 +18,8 @@ use lazy_static::lazy_static; use postgres_ffi::waldecoder::*; use postgres_protocol::message::backend::ReplicationMessage; use postgres_types::PgLsn; +use serde::{Deserialize, Serialize}; +use serde_with::{serde_as, DisplayFromStr}; use std::cell::Cell; use std::collections::HashMap; use std::str::FromStr; @@ -35,11 +37,19 @@ use utils::{ zid::{ZTenantId, ZTenantTimelineId, ZTimelineId}, }; -// -// We keep one WAL Receiver active per timeline. -// -struct WalReceiverEntry { +/// +/// A WAL receiver's data stored inside the global `WAL_RECEIVERS`. +/// We keep one WAL receiver active per timeline. +/// +#[serde_as] +#[derive(Debug, Serialize, Deserialize, Clone)] +pub struct WalReceiverEntry { + thread_id: u64, wal_producer_connstr: String, + #[serde_as(as = "Option")] + last_received_msg_lsn: Option, + /// the timestamp (in microseconds) of the last received message + last_received_msg_ts: Option, } lazy_static! { @@ -74,7 +84,7 @@ pub fn launch_wal_receiver( receiver.wal_producer_connstr = wal_producer_connstr.into(); } None => { - thread_mgr::spawn( + let thread_id = thread_mgr::spawn( ThreadKind::WalReceiver, Some(tenantid), Some(timelineid), @@ -88,7 +98,10 @@ pub fn launch_wal_receiver( )?; let receiver = WalReceiverEntry { + thread_id, wal_producer_connstr: wal_producer_connstr.into(), + last_received_msg_lsn: None, + last_received_msg_ts: None, }; receivers.insert((tenantid, timelineid), receiver); @@ -99,15 +112,13 @@ pub fn launch_wal_receiver( Ok(()) } -// Look up current WAL producer connection string in the hash table -fn get_wal_producer_connstr(tenantid: ZTenantId, timelineid: ZTimelineId) -> String { +/// Look up a WAL receiver's data in the global `WAL_RECEIVERS` +pub fn get_wal_receiver_entry( + tenant_id: ZTenantId, + timeline_id: ZTimelineId, +) -> Option { let receivers = WAL_RECEIVERS.lock().unwrap(); - - receivers - .get(&(tenantid, timelineid)) - .unwrap() - .wal_producer_connstr - .clone() + receivers.get(&(tenant_id, timeline_id)).cloned() } // @@ -118,7 +129,18 @@ fn thread_main(conf: &'static PageServerConf, tenant_id: ZTenantId, timeline_id: info!("WAL receiver thread started"); // Look up the current WAL producer address - let wal_producer_connstr = get_wal_producer_connstr(tenant_id, timeline_id); + let wal_producer_connstr = { + match get_wal_receiver_entry(tenant_id, timeline_id) { + Some(e) => e.wal_producer_connstr, + None => { + info!( + "Unable to create the WAL receiver thread: no WAL receiver entry found for tenant {} and timeline {}", + tenant_id, timeline_id + ); + return; + } + } + }; // Make a connection to the WAL safekeeper, or directly to the primary PostgreSQL server, // and start streaming WAL from it. @@ -318,6 +340,28 @@ fn walreceiver_main( let apply_lsn = u64::from(timeline_remote_consistent_lsn); let ts = SystemTime::now(); + // Update the current WAL receiver's data stored inside the global hash table `WAL_RECEIVERS` + { + let mut receivers = WAL_RECEIVERS.lock().unwrap(); + let entry = match receivers.get_mut(&(tenant_id, timeline_id)) { + Some(e) => e, + None => { + anyhow::bail!( + "no WAL receiver entry found for tenant {} and timeline {}", + tenant_id, + timeline_id + ); + } + }; + + entry.last_received_msg_lsn = Some(last_lsn); + entry.last_received_msg_ts = Some( + ts.duration_since(SystemTime::UNIX_EPOCH) + .expect("Received message time should be before UNIX EPOCH!") + .as_micros(), + ); + } + // Send zenith feedback message. // Regular standby_status_update fields are put into this message. let zenith_status_update = ZenithFeedback { diff --git a/test_runner/batch_others/test_pageserver_api.py b/test_runner/batch_others/test_pageserver_api.py index 13f6ef358e..7fe3b4dff5 100644 --- a/test_runner/batch_others/test_pageserver_api.py +++ b/test_runner/batch_others/test_pageserver_api.py @@ -1,6 +1,12 @@ from uuid import uuid4, UUID import pytest -from fixtures.zenith_fixtures import ZenithEnv, ZenithEnvBuilder, ZenithPageserverHttpClient +from fixtures.zenith_fixtures import ( + DEFAULT_BRANCH_NAME, + ZenithEnv, + ZenithEnvBuilder, + ZenithPageserverHttpClient, + ZenithPageserverApiException, +) # test that we cannot override node id @@ -48,6 +54,39 @@ def check_client(client: ZenithPageserverHttpClient, initial_tenant: UUID): assert local_timeline_details['timeline_state'] == 'Loaded' +def test_pageserver_http_get_wal_receiver_not_found(zenith_simple_env: ZenithEnv): + env = zenith_simple_env + client = env.pageserver.http_client() + + tenant_id, timeline_id = env.zenith_cli.create_tenant() + + # no PG compute node is running, so no WAL receiver is running + with pytest.raises(ZenithPageserverApiException) as e: + _ = client.wal_receiver_get(tenant_id, timeline_id) + assert "Not Found" in str(e.value) + + +def test_pageserver_http_get_wal_receiver_success(zenith_simple_env: ZenithEnv): + env = zenith_simple_env + client = env.pageserver.http_client() + + tenant_id, timeline_id = env.zenith_cli.create_tenant() + pg = env.postgres.create_start(DEFAULT_BRANCH_NAME, tenant_id=tenant_id) + + res = client.wal_receiver_get(tenant_id, timeline_id) + assert list(res.keys()) == [ + "thread_id", + "wal_producer_connstr", + "last_received_msg_lsn", + "last_received_msg_ts", + ] + + # make a DB modification then expect getting a new WAL receiver's data + pg.safe_psql("CREATE TABLE t(key int primary key, value text)") + res2 = client.wal_receiver_get(tenant_id, timeline_id) + assert res2["last_received_msg_lsn"] > res["last_received_msg_lsn"] + + def test_pageserver_http_api_client(zenith_simple_env: ZenithEnv): env = zenith_simple_env client = env.pageserver.http_client() diff --git a/test_runner/fixtures/zenith_fixtures.py b/test_runner/fixtures/zenith_fixtures.py index 50b7ef6dbb..14eae60248 100644 --- a/test_runner/fixtures/zenith_fixtures.py +++ b/test_runner/fixtures/zenith_fixtures.py @@ -786,6 +786,15 @@ class ZenithPageserverHttpClient(requests.Session): assert isinstance(res_json, dict) return res_json + def wal_receiver_get(self, tenant_id: uuid.UUID, timeline_id: uuid.UUID) -> Dict[Any, Any]: + res = self.get( + f"http://localhost:{self.port}/v1/tenant/{tenant_id.hex}/timeline/{timeline_id.hex}/wal_receiver" + ) + self.verbose_error(res) + res_json = res.json() + assert isinstance(res_json, dict) + return res_json + def get_metrics(self) -> str: res = self.get(f"http://localhost:{self.port}/metrics") self.verbose_error(res) From 85b5c0e98921a0a254021a55c5186aa1ca18813b Mon Sep 17 00:00:00 2001 From: chaitanya sharma <86035+phoenix24@users.noreply.github.com> Date: Fri, 13 May 2022 20:14:20 +0000 Subject: [PATCH 252/296] List profiling as a feature with 'pageserver --enabled-features' Fixes https://github.com/neondatabase/neon/issues/1627 --- pageserver/src/bin/pageserver.rs | 2 ++ 1 file changed, 2 insertions(+) diff --git a/pageserver/src/bin/pageserver.rs b/pageserver/src/bin/pageserver.rs index c6cb460f8f..4cc1dcbc5a 100644 --- a/pageserver/src/bin/pageserver.rs +++ b/pageserver/src/bin/pageserver.rs @@ -98,6 +98,8 @@ fn main() -> anyhow::Result<()> { let features: &[&str] = &[ #[cfg(feature = "failpoints")] "failpoints", + #[cfg(feature = "profiling")] + "profiling", ]; println!("{{\"features\": {features:?} }}"); return Ok(()); From bea84150b2be74db6c2cfc4107de3b582c86c352 Mon Sep 17 00:00:00 2001 From: chaitanya sharma <86035+phoenix24@users.noreply.github.com> Date: Sun, 15 May 2022 04:17:28 +0530 Subject: [PATCH 253/296] Fix the markdown rendering on 004-durability.md RFC --- docs/rfcs/004-durability.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/rfcs/004-durability.md b/docs/rfcs/004-durability.md index 4543be3dae..d4716156d1 100644 --- a/docs/rfcs/004-durability.md +++ b/docs/rfcs/004-durability.md @@ -22,7 +22,7 @@ In addition to the WAL safekeeper nodes, the WAL is archived in S3. WAL that has been archived to S3 can be removed from the safekeepers, so the safekeepers don't need a lot of disk space. - +``` +----------------+ +-----> | WAL safekeeper | | +----------------+ @@ -42,23 +42,23 @@ safekeepers, so the safekeepers don't need a lot of disk space. \ \ \ - \ +--------+ - \ | | - +--> | S3 | - | | - +--------+ - + \ +--------+ + \ | | + +------> | S3 | + | | + +--------+ +``` Every WAL safekeeper holds a section of WAL, and a VCL value. The WAL can be divided into three portions: - +``` VCL LSN | | V V .................ccccccccccccccccccccXXXXXXXXXXXXXXXXXXXXXXX Archived WAL Completed WAL In-flight WAL - +``` Note that all this WAL kept in a safekeeper is a contiguous section. This is different from Aurora: In Aurora, there can be holes in the From 9a0fed0880dd1d1f482763b8de7c3a2c219fcf43 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Tue, 3 May 2022 14:11:29 +0300 Subject: [PATCH 254/296] Enable at least 1 safekeeper in every test --- .circleci/ansible/systemd/pageserver.service | 2 +- control_plane/src/local_env.rs | 3 + control_plane/src/safekeeper.rs | 1 + control_plane/src/storage.rs | 11 +++ docker-entrypoint.sh | 6 +- pageserver/src/config.rs | 95 ++++++++++++++++--- safekeeper/src/bin/safekeeper.rs | 28 +++--- safekeeper/src/broker.rs | 5 +- safekeeper/src/lib.rs | 4 +- .../batch_others/test_ancestor_branch.py | 7 -- test_runner/batch_others/test_backpressure.py | 1 - test_runner/batch_others/test_next_xid.py | 2 - .../batch_others/test_pageserver_restart.py | 2 - .../batch_others/test_remote_storage.py | 1 - .../batch_others/test_tenant_relocation.py | 10 +- .../batch_others/test_timeline_size.py | 1 - test_runner/batch_others/test_wal_acceptor.py | 13 ++- test_runner/batch_others/test_wal_restore.py | 1 - test_runner/batch_others/test_zenith_cli.py | 4 - test_runner/fixtures/zenith_fixtures.py | 44 +++++---- 20 files changed, 161 insertions(+), 80 deletions(-) diff --git a/.circleci/ansible/systemd/pageserver.service b/.circleci/ansible/systemd/pageserver.service index d346643e58..54a7b1ba0a 100644 --- a/.circleci/ansible/systemd/pageserver.service +++ b/.circleci/ansible/systemd/pageserver.service @@ -6,7 +6,7 @@ After=network.target auditd.service Type=simple User=pageserver Environment=RUST_BACKTRACE=1 ZENITH_REPO_DIR=/storage/pageserver LD_LIBRARY_PATH=/usr/local/lib -ExecStart=/usr/local/bin/pageserver -c "pg_distrib_dir='/usr/local'" -c "listen_pg_addr='0.0.0.0:6400'" -c "listen_http_addr='0.0.0.0:9898'" -D /storage/pageserver/data +ExecStart=/usr/local/bin/pageserver -c "pg_distrib_dir='/usr/local'" -c "listen_pg_addr='0.0.0.0:6400'" -c "listen_http_addr='0.0.0.0:9898'" -c "broker_endpoints=['{{ etcd_endpoints }}']" -D /storage/pageserver/data ExecReload=/bin/kill -HUP $MAINPID KillMode=mixed KillSignal=SIGINT diff --git a/control_plane/src/local_env.rs b/control_plane/src/local_env.rs index 35167ebabf..a8636f9073 100644 --- a/control_plane/src/local_env.rs +++ b/control_plane/src/local_env.rs @@ -97,6 +97,7 @@ pub struct PageServerConf { // jwt auth token used for communication with pageserver pub auth_token: String, + pub broker_endpoints: Vec, } impl Default for PageServerConf { @@ -107,6 +108,7 @@ impl Default for PageServerConf { listen_http_addr: String::new(), auth_type: AuthType::Trust, auth_token: String::new(), + broker_endpoints: Vec::new(), } } } @@ -401,6 +403,7 @@ impl LocalEnv { self.pageserver.auth_token = self.generate_auth_token(&Claims::new(None, Scope::PageServerApi))?; + self.pageserver.broker_endpoints = self.broker_endpoints.clone(); fs::create_dir_all(self.pg_data_dirs_path())?; diff --git a/control_plane/src/safekeeper.rs b/control_plane/src/safekeeper.rs index aeeb4a50ec..c5b7f830bf 100644 --- a/control_plane/src/safekeeper.rs +++ b/control_plane/src/safekeeper.rs @@ -137,6 +137,7 @@ impl SafekeeperNode { .args(&["--listen-pg", &listen_pg]) .args(&["--listen-http", &listen_http]) .args(&["--recall", "1 second"]) + .args(&["--broker-endpoints", &self.broker_endpoints.join(",")]) .arg("--daemonize"), ); if !self.conf.sync { diff --git a/control_plane/src/storage.rs b/control_plane/src/storage.rs index d2e63a22de..0b9fddd64a 100644 --- a/control_plane/src/storage.rs +++ b/control_plane/src/storage.rs @@ -121,6 +121,16 @@ impl PageServerNode { ); let listen_pg_addr_param = format!("listen_pg_addr='{}'", self.env.pageserver.listen_pg_addr); + let broker_endpoints_param = format!( + "broker_endpoints=[{}]", + self.env + .pageserver + .broker_endpoints + .iter() + .map(|url| format!("'{url}'")) + .collect::>() + .join(",") + ); let mut args = Vec::with_capacity(20); args.push("--init"); @@ -129,6 +139,7 @@ impl PageServerNode { args.extend(["-c", &authg_type_param]); args.extend(["-c", &listen_http_addr_param]); args.extend(["-c", &listen_pg_addr_param]); + args.extend(["-c", &broker_endpoints_param]); args.extend(["-c", &id]); for config_override in config_overrides { diff --git a/docker-entrypoint.sh b/docker-entrypoint.sh index 93bb5f9cd7..0e4cf45f29 100755 --- a/docker-entrypoint.sh +++ b/docker-entrypoint.sh @@ -7,7 +7,11 @@ if [ "$1" = 'pageserver' ]; then pageserver --init -D /data -c "pg_distrib_dir='/usr/local'" -c "id=10" fi echo "Staring pageserver at 0.0.0.0:6400" - pageserver -c "listen_pg_addr='0.0.0.0:6400'" -c "listen_http_addr='0.0.0.0:9898'" -D /data + if [ -z '${BROKER_ENDPOINTS}' ]; then + pageserver -c "listen_pg_addr='0.0.0.0:6400'" -c "listen_http_addr='0.0.0.0:9898'" -D /data + else + pageserver -c "listen_pg_addr='0.0.0.0:6400'" -c "listen_http_addr='0.0.0.0:9898'" -c "broker_endpoints=['${BROKER_ENDPOINTS}']" -D /data + fi else "$@" fi diff --git a/pageserver/src/config.rs b/pageserver/src/config.rs index 5257732c5c..8748683f32 100644 --- a/pageserver/src/config.rs +++ b/pageserver/src/config.rs @@ -13,6 +13,7 @@ use std::str::FromStr; use std::time::Duration; use toml_edit; use toml_edit::{Document, Item}; +use url::Url; use utils::{ postgres_backend::AuthType, zid::{ZNodeId, ZTenantId, ZTimelineId}, @@ -111,6 +112,9 @@ pub struct PageServerConf { pub profiling: ProfilingConfig, pub default_tenant_conf: TenantConf, + + /// Etcd broker endpoints to connect to. + pub broker_endpoints: Vec, } #[derive(Debug, Clone, PartialEq, Eq)] @@ -175,6 +179,7 @@ struct PageServerConfigBuilder { id: BuilderValue, profiling: BuilderValue, + broker_endpoints: BuilderValue>, } impl Default for PageServerConfigBuilder { @@ -200,6 +205,7 @@ impl Default for PageServerConfigBuilder { remote_storage_config: Set(None), id: NotSet, profiling: Set(ProfilingConfig::Disabled), + broker_endpoints: NotSet, } } } @@ -256,6 +262,10 @@ impl PageServerConfigBuilder { self.remote_storage_config = BuilderValue::Set(remote_storage_config) } + pub fn broker_endpoints(&mut self, broker_endpoints: Vec) { + self.broker_endpoints = BuilderValue::Set(broker_endpoints) + } + pub fn id(&mut self, node_id: ZNodeId) { self.id = BuilderValue::Set(node_id) } @@ -264,7 +274,15 @@ impl PageServerConfigBuilder { self.profiling = BuilderValue::Set(profiling) } - pub fn build(self) -> Result { + pub fn build(self) -> anyhow::Result { + let broker_endpoints = self + .broker_endpoints + .ok_or(anyhow!("No broker endpoints provided"))?; + ensure!( + !broker_endpoints.is_empty(), + "Empty broker endpoints collection provided" + ); + Ok(PageServerConf { listen_pg_addr: self .listen_pg_addr @@ -300,6 +318,7 @@ impl PageServerConfigBuilder { profiling: self.profiling.ok_or(anyhow!("missing profiling"))?, // TenantConf is handled separately default_tenant_conf: TenantConf::default(), + broker_endpoints, }) } } @@ -341,7 +360,7 @@ impl PageServerConf { /// validating the input and failing on errors. /// /// This leaves any options not present in the file in the built-in defaults. - pub fn parse_and_validate(toml: &Document, workdir: &Path) -> Result { + pub fn parse_and_validate(toml: &Document, workdir: &Path) -> anyhow::Result { let mut builder = PageServerConfigBuilder::default(); builder.workdir(workdir.to_owned()); @@ -373,6 +392,16 @@ impl PageServerConf { } "id" => builder.id(ZNodeId(parse_toml_u64(key, item)?)), "profiling" => builder.profiling(parse_toml_from_str(key, item)?), + "broker_endpoints" => builder.broker_endpoints( + parse_toml_array(key, item)? + .into_iter() + .map(|endpoint_str| { + endpoint_str.parse::().with_context(|| { + format!("Array item {endpoint_str} for key {key} is not a valid url endpoint") + }) + }) + .collect::>()?, + ), _ => bail!("unrecognized pageserver option '{key}'"), } } @@ -526,6 +555,7 @@ impl PageServerConf { remote_storage_config: None, profiling: ProfilingConfig::Disabled, default_tenant_conf: TenantConf::dummy_conf(), + broker_endpoints: Vec::new(), } } } @@ -576,14 +606,36 @@ fn parse_toml_duration(name: &str, item: &Item) -> Result { Ok(humantime::parse_duration(s)?) } -fn parse_toml_from_str(name: &str, item: &Item) -> Result +fn parse_toml_from_str(name: &str, item: &Item) -> anyhow::Result where - T: FromStr, + T: FromStr, + ::Err: std::fmt::Display, { let v = item .as_str() .with_context(|| format!("configure option {name} is not a string"))?; - T::from_str(v) + T::from_str(v).map_err(|e| { + anyhow!( + "Failed to parse string as {parse_type} for configure option {name}: {e}", + parse_type = stringify!(T) + ) + }) +} + +fn parse_toml_array(name: &str, item: &Item) -> anyhow::Result> { + let array = item + .as_array() + .with_context(|| format!("configure option {name} is not an array"))?; + + array + .iter() + .map(|value| { + value + .as_str() + .map(str::to_string) + .with_context(|| format!("Array item {value:?} for key {name} is not a string")) + }) + .collect() } #[cfg(test)] @@ -616,12 +668,16 @@ id = 10 fn parse_defaults() -> anyhow::Result<()> { let tempdir = tempdir()?; let (workdir, pg_distrib_dir) = prepare_fs(&tempdir)?; - // we have to create dummy pathes to overcome the validation errors - let config_string = format!("pg_distrib_dir='{}'\nid=10", pg_distrib_dir.display()); + let broker_endpoint = "http://127.0.0.1:7777"; + // we have to create dummy values to overcome the validation errors + let config_string = format!( + "pg_distrib_dir='{}'\nid=10\nbroker_endpoints = ['{broker_endpoint}']", + pg_distrib_dir.display() + ); let toml = config_string.parse()?; let parsed_config = PageServerConf::parse_and_validate(&toml, &workdir) - .unwrap_or_else(|e| panic!("Failed to parse config '{config_string}', reason: {e}")); + .unwrap_or_else(|e| panic!("Failed to parse config '{config_string}', reason: {e:?}")); assert_eq!( parsed_config, @@ -641,6 +697,9 @@ id = 10 remote_storage_config: None, profiling: ProfilingConfig::Disabled, default_tenant_conf: TenantConf::default(), + broker_endpoints: vec![broker_endpoint + .parse() + .expect("Failed to parse a valid broker endpoint URL")], }, "Correct defaults should be used when no config values are provided" ); @@ -652,15 +711,16 @@ id = 10 fn parse_basic_config() -> anyhow::Result<()> { let tempdir = tempdir()?; let (workdir, pg_distrib_dir) = prepare_fs(&tempdir)?; + let broker_endpoint = "http://127.0.0.1:7777"; let config_string = format!( - "{ALL_BASE_VALUES_TOML}pg_distrib_dir='{}'", + "{ALL_BASE_VALUES_TOML}pg_distrib_dir='{}'\nbroker_endpoints = ['{broker_endpoint}']", pg_distrib_dir.display() ); let toml = config_string.parse()?; let parsed_config = PageServerConf::parse_and_validate(&toml, &workdir) - .unwrap_or_else(|e| panic!("Failed to parse config '{config_string}', reason: {e}")); + .unwrap_or_else(|e| panic!("Failed to parse config '{config_string}', reason: {e:?}")); assert_eq!( parsed_config, @@ -680,6 +740,9 @@ id = 10 remote_storage_config: None, profiling: ProfilingConfig::Disabled, default_tenant_conf: TenantConf::default(), + broker_endpoints: vec![broker_endpoint + .parse() + .expect("Failed to parse a valid broker endpoint URL")], }, "Should be able to parse all basic config values correctly" ); @@ -691,6 +754,7 @@ id = 10 fn parse_remote_fs_storage_config() -> anyhow::Result<()> { let tempdir = tempdir()?; let (workdir, pg_distrib_dir) = prepare_fs(&tempdir)?; + let broker_endpoint = "http://127.0.0.1:7777"; let local_storage_path = tempdir.path().join("local_remote_storage"); @@ -710,6 +774,7 @@ local_path = '{}'"#, let config_string = format!( r#"{ALL_BASE_VALUES_TOML} pg_distrib_dir='{}' +broker_endpoints = ['{broker_endpoint}'] {remote_storage_config_str}"#, pg_distrib_dir.display(), @@ -718,7 +783,9 @@ pg_distrib_dir='{}' let toml = config_string.parse()?; let parsed_remote_storage_config = PageServerConf::parse_and_validate(&toml, &workdir) - .unwrap_or_else(|e| panic!("Failed to parse config '{config_string}', reason: {e}")) + .unwrap_or_else(|e| { + panic!("Failed to parse config '{config_string}', reason: {e:?}") + }) .remote_storage_config .expect("Should have remote storage config for the local FS"); @@ -751,6 +818,7 @@ pg_distrib_dir='{}' let max_concurrent_syncs = NonZeroUsize::new(111).unwrap(); let max_sync_errors = NonZeroU32::new(222).unwrap(); let s3_concurrency_limit = NonZeroUsize::new(333).unwrap(); + let broker_endpoint = "http://127.0.0.1:7777"; let identical_toml_declarations = &[ format!( @@ -773,6 +841,7 @@ concurrency_limit = {s3_concurrency_limit}"# let config_string = format!( r#"{ALL_BASE_VALUES_TOML} pg_distrib_dir='{}' +broker_endpoints = ['{broker_endpoint}'] {remote_storage_config_str}"#, pg_distrib_dir.display(), @@ -781,7 +850,9 @@ pg_distrib_dir='{}' let toml = config_string.parse()?; let parsed_remote_storage_config = PageServerConf::parse_and_validate(&toml, &workdir) - .unwrap_or_else(|e| panic!("Failed to parse config '{config_string}', reason: {e}")) + .unwrap_or_else(|e| { + panic!("Failed to parse config '{config_string}', reason: {e:?}") + }) .remote_storage_config .expect("Should have remote storage config for S3"); diff --git a/safekeeper/src/bin/safekeeper.rs b/safekeeper/src/bin/safekeeper.rs index 6955d2aa5c..d7875a9069 100644 --- a/safekeeper/src/bin/safekeeper.rs +++ b/safekeeper/src/bin/safekeeper.rs @@ -1,7 +1,7 @@ // // Main entry point for the safekeeper executable // -use anyhow::{bail, Context, Result}; +use anyhow::{bail, ensure, Context, Result}; use clap::{App, Arg}; use const_format::formatcp; use daemonize::Daemonize; @@ -31,7 +31,7 @@ const LOCK_FILE_NAME: &str = "safekeeper.lock"; const ID_FILE_NAME: &str = "safekeeper.id"; project_git_version!(GIT_VERSION); -fn main() -> Result<()> { +fn main() -> anyhow::Result<()> { metrics::set_common_metrics_prefix("safekeeper"); let arg_matches = App::new("Zenith safekeeper") .about("Store WAL stream to local file system and push it to WAL receivers") @@ -177,8 +177,12 @@ fn main() -> Result<()> { if let Some(addr) = arg_matches.value_of("broker-endpoints") { let collected_ep: Result, ParseError> = addr.split(',').map(Url::parse).collect(); - conf.broker_endpoints = Some(collected_ep?); + conf.broker_endpoints = collected_ep.context("Failed to parse broker endpoint urls")?; } + ensure!( + !conf.broker_endpoints.is_empty(), + "No broker endpoints provided" + ); if let Some(prefix) = arg_matches.value_of("broker-etcd-prefix") { conf.broker_etcd_prefix = prefix.to_string(); } @@ -309,16 +313,14 @@ fn start_safekeeper(mut conf: SafeKeeperConf, given_id: Option, init: b .unwrap(); threads.push(callmemaybe_thread); - if conf.broker_endpoints.is_some() { - let conf_ = conf.clone(); - threads.push( - thread::Builder::new() - .name("broker thread".into()) - .spawn(|| { - broker::thread_main(conf_); - })?, - ); - } + let conf_ = conf.clone(); + threads.push( + thread::Builder::new() + .name("broker thread".into()) + .spawn(|| { + broker::thread_main(conf_); + })?, + ); let conf_ = conf.clone(); threads.push( diff --git a/safekeeper/src/broker.rs b/safekeeper/src/broker.rs index d9c60c9db0..c906bc1e74 100644 --- a/safekeeper/src/broker.rs +++ b/safekeeper/src/broker.rs @@ -46,7 +46,7 @@ fn timeline_safekeeper_path( /// Push once in a while data about all active timelines to the broker. async fn push_loop(conf: SafeKeeperConf) -> anyhow::Result<()> { - let mut client = Client::connect(&conf.broker_endpoints.as_ref().unwrap(), None).await?; + let mut client = Client::connect(&conf.broker_endpoints, None).await?; // Get and maintain lease to automatically delete obsolete data let lease = client.lease_grant(LEASE_TTL_SEC, None).await?; @@ -91,7 +91,7 @@ async fn push_loop(conf: SafeKeeperConf) -> anyhow::Result<()> { /// Subscribe and fetch all the interesting data from the broker. async fn pull_loop(conf: SafeKeeperConf) -> Result<()> { - let mut client = Client::connect(&conf.broker_endpoints.as_ref().unwrap(), None).await?; + let mut client = Client::connect(&conf.broker_endpoints, None).await?; let mut subscription = etcd_broker::subscribe_to_safekeeper_timeline_updates( &mut client, @@ -99,7 +99,6 @@ async fn pull_loop(conf: SafeKeeperConf) -> Result<()> { ) .await .context("failed to subscribe for safekeeper info")?; - loop { match subscription.fetch_data().await { Some(new_info) => { diff --git a/safekeeper/src/lib.rs b/safekeeper/src/lib.rs index 09b2e68a49..131076fab6 100644 --- a/safekeeper/src/lib.rs +++ b/safekeeper/src/lib.rs @@ -51,7 +51,7 @@ pub struct SafeKeeperConf { pub ttl: Option, pub recall_period: Duration, pub my_id: ZNodeId, - pub broker_endpoints: Option>, + pub broker_endpoints: Vec, pub broker_etcd_prefix: String, pub s3_offload_enabled: bool, } @@ -81,7 +81,7 @@ impl Default for SafeKeeperConf { ttl: None, recall_period: defaults::DEFAULT_RECALL_PERIOD, my_id: ZNodeId(0), - broker_endpoints: None, + broker_endpoints: Vec::new(), broker_etcd_prefix: defaults::DEFAULT_NEON_BROKER_PREFIX.to_string(), s3_offload_enabled: true, } diff --git a/test_runner/batch_others/test_ancestor_branch.py b/test_runner/batch_others/test_ancestor_branch.py index c07b9d6dd1..5dbd6d2e26 100644 --- a/test_runner/batch_others/test_ancestor_branch.py +++ b/test_runner/batch_others/test_ancestor_branch.py @@ -10,13 +10,6 @@ from fixtures.zenith_fixtures import ZenithEnv, ZenithEnvBuilder, ZenithPageserv # Create ancestor branches off the main branch. # def test_ancestor_branch(zenith_env_builder: ZenithEnvBuilder): - - # Use safekeeper in this test to avoid a subtle race condition. - # Without safekeeper, walreceiver reconnection can stuck - # because of IO deadlock. - # - # See https://github.com/zenithdb/zenith/issues/1068 - zenith_env_builder.num_safekeepers = 1 env = zenith_env_builder.init_start() # Override defaults, 1M gc_horizon and 4M checkpoint_distance. diff --git a/test_runner/batch_others/test_backpressure.py b/test_runner/batch_others/test_backpressure.py index 6658b337ec..81f45b749b 100644 --- a/test_runner/batch_others/test_backpressure.py +++ b/test_runner/batch_others/test_backpressure.py @@ -94,7 +94,6 @@ def check_backpressure(pg: Postgres, stop_event: threading.Event, polling_interv @pytest.mark.skip("See https://github.com/neondatabase/neon/issues/1587") def test_backpressure_received_lsn_lag(zenith_env_builder: ZenithEnvBuilder): - zenith_env_builder.num_safekeepers = 1 env = zenith_env_builder.init_start() # Create a branch for us env.zenith_cli.create_branch('test_backpressure') diff --git a/test_runner/batch_others/test_next_xid.py b/test_runner/batch_others/test_next_xid.py index 03c27bcd70..1ab1addad3 100644 --- a/test_runner/batch_others/test_next_xid.py +++ b/test_runner/batch_others/test_next_xid.py @@ -6,8 +6,6 @@ from fixtures.zenith_fixtures import ZenithEnvBuilder # Test restarting page server, while safekeeper and compute node keep # running. def test_next_xid(zenith_env_builder: ZenithEnvBuilder): - # One safekeeper is enough for this test. - zenith_env_builder.num_safekeepers = 1 env = zenith_env_builder.init_start() pg = env.postgres.create_start('main') diff --git a/test_runner/batch_others/test_pageserver_restart.py b/test_runner/batch_others/test_pageserver_restart.py index 20e6f4467e..69f5ea85ce 100644 --- a/test_runner/batch_others/test_pageserver_restart.py +++ b/test_runner/batch_others/test_pageserver_restart.py @@ -5,8 +5,6 @@ from fixtures.log_helper import log # Test restarting page server, while safekeeper and compute node keep # running. def test_pageserver_restart(zenith_env_builder: ZenithEnvBuilder): - # One safekeeper is enough for this test. - zenith_env_builder.num_safekeepers = 1 env = zenith_env_builder.init_start() env.zenith_cli.create_branch('test_pageserver_restart') diff --git a/test_runner/batch_others/test_remote_storage.py b/test_runner/batch_others/test_remote_storage.py index e205f79957..3c7bd08996 100644 --- a/test_runner/batch_others/test_remote_storage.py +++ b/test_runner/batch_others/test_remote_storage.py @@ -32,7 +32,6 @@ import pytest @pytest.mark.parametrize('storage_type', ['local_fs', 'mock_s3']) def test_remote_storage_backup_and_restore(zenith_env_builder: ZenithEnvBuilder, storage_type: str): # zenith_env_builder.rust_log_override = 'debug' - zenith_env_builder.num_safekeepers = 1 if storage_type == 'local_fs': zenith_env_builder.enable_local_fs_remote_storage() elif storage_type == 'mock_s3': diff --git a/test_runner/batch_others/test_tenant_relocation.py b/test_runner/batch_others/test_tenant_relocation.py index 279b3a0a25..85a91b9ce1 100644 --- a/test_runner/batch_others/test_tenant_relocation.py +++ b/test_runner/batch_others/test_tenant_relocation.py @@ -8,7 +8,7 @@ from fixtures.log_helper import log import signal import pytest -from fixtures.zenith_fixtures import PgProtocol, PortDistributor, Postgres, ZenithEnvBuilder, ZenithPageserverHttpClient, assert_local, wait_for, wait_for_last_record_lsn, wait_for_upload, zenith_binpath, pg_distrib_dir +from fixtures.zenith_fixtures import PgProtocol, PortDistributor, Postgres, ZenithEnvBuilder, Etcd, ZenithPageserverHttpClient, assert_local, wait_for, wait_for_last_record_lsn, wait_for_upload, zenith_binpath, pg_distrib_dir from fixtures.utils import lsn_from_hex @@ -21,7 +21,8 @@ def new_pageserver_helper(new_pageserver_dir: pathlib.Path, pageserver_bin: pathlib.Path, remote_storage_mock_path: pathlib.Path, pg_port: int, - http_port: int): + http_port: int, + broker: Etcd): """ cannot use ZenithPageserver yet because it depends on zenith cli which currently lacks support for multiple pageservers @@ -36,6 +37,7 @@ def new_pageserver_helper(new_pageserver_dir: pathlib.Path, f"-c pg_distrib_dir='{pg_distrib_dir}'", f"-c id=2", f"-c remote_storage={{local_path='{remote_storage_mock_path}'}}", + f"-c broker_endpoints=['{broker.client_url()}']", ] subprocess.check_output(cmd, text=True) @@ -103,7 +105,6 @@ def load(pg: Postgres, stop_event: threading.Event, load_ok_event: threading.Eve def test_tenant_relocation(zenith_env_builder: ZenithEnvBuilder, port_distributor: PortDistributor, with_load: str): - zenith_env_builder.num_safekeepers = 1 zenith_env_builder.enable_local_fs_remote_storage() env = zenith_env_builder.init_start() @@ -180,7 +181,8 @@ def test_tenant_relocation(zenith_env_builder: ZenithEnvBuilder, pageserver_bin, remote_storage_mock_path, new_pageserver_pg_port, - new_pageserver_http_port): + new_pageserver_http_port, + zenith_env_builder.broker): # call to attach timeline to new pageserver new_pageserver_http.timeline_attach(tenant, timeline) diff --git a/test_runner/batch_others/test_timeline_size.py b/test_runner/batch_others/test_timeline_size.py index db33493d61..0b33b56df3 100644 --- a/test_runner/batch_others/test_timeline_size.py +++ b/test_runner/batch_others/test_timeline_size.py @@ -70,7 +70,6 @@ def wait_for_pageserver_catchup(pgmain: Postgres, polling_interval=1, timeout=60 def test_timeline_size_quota(zenith_env_builder: ZenithEnvBuilder): - zenith_env_builder.num_safekeepers = 1 env = zenith_env_builder.init_start() new_timeline_id = env.zenith_cli.create_branch('test_timeline_size_quota') diff --git a/test_runner/batch_others/test_wal_acceptor.py b/test_runner/batch_others/test_wal_acceptor.py index 67c9d6070e..85798156a7 100644 --- a/test_runner/batch_others/test_wal_acceptor.py +++ b/test_runner/batch_others/test_wal_acceptor.py @@ -12,7 +12,7 @@ from contextlib import closing from dataclasses import dataclass, field from multiprocessing import Process, Value from pathlib import Path -from fixtures.zenith_fixtures import PgBin, Postgres, Safekeeper, ZenithEnv, ZenithEnvBuilder, PortDistributor, SafekeeperPort, zenith_binpath, PgProtocol +from fixtures.zenith_fixtures import PgBin, Etcd, Postgres, Safekeeper, ZenithEnv, ZenithEnvBuilder, PortDistributor, SafekeeperPort, zenith_binpath, PgProtocol from fixtures.utils import etcd_path, get_dir_size, lsn_to_hex, mkdir_if_needed, lsn_from_hex from fixtures.log_helper import log from typing import List, Optional, Any @@ -22,7 +22,6 @@ from typing import List, Optional, Any # succeed and data is written def test_normal_work(zenith_env_builder: ZenithEnvBuilder): zenith_env_builder.num_safekeepers = 3 - zenith_env_builder.broker = True env = zenith_env_builder.init_start() env.zenith_cli.create_branch('test_safekeepers_normal_work') @@ -331,7 +330,6 @@ def test_race_conditions(zenith_env_builder: ZenithEnvBuilder, stop_value): @pytest.mark.skipif(etcd_path() is None, reason="requires etcd which is not present in PATH") def test_broker(zenith_env_builder: ZenithEnvBuilder): zenith_env_builder.num_safekeepers = 3 - zenith_env_builder.broker = True zenith_env_builder.enable_local_fs_remote_storage() env = zenith_env_builder.init_start() @@ -374,7 +372,6 @@ def test_broker(zenith_env_builder: ZenithEnvBuilder): @pytest.mark.skipif(etcd_path() is None, reason="requires etcd which is not present in PATH") def test_wal_removal(zenith_env_builder: ZenithEnvBuilder): zenith_env_builder.num_safekeepers = 2 - zenith_env_builder.broker = True # to advance remote_consistent_llsn zenith_env_builder.enable_local_fs_remote_storage() env = zenith_env_builder.init_start() @@ -557,8 +554,6 @@ def test_sync_safekeepers(zenith_env_builder: ZenithEnvBuilder, def test_timeline_status(zenith_env_builder: ZenithEnvBuilder): - - zenith_env_builder.num_safekeepers = 1 env = zenith_env_builder.init_start() env.zenith_cli.create_branch('test_timeline_status') @@ -599,6 +594,9 @@ class SafekeeperEnv: num_safekeepers: int = 1): self.repo_dir = repo_dir self.port_distributor = port_distributor + self.broker = Etcd(datadir=os.path.join(self.repo_dir, "etcd"), + port=self.port_distributor.get_port(), + peer_port=self.port_distributor.get_port()) self.pg_bin = pg_bin self.num_safekeepers = num_safekeepers self.bin_safekeeper = os.path.join(str(zenith_binpath), 'safekeeper') @@ -645,6 +643,8 @@ class SafekeeperEnv: safekeeper_dir, "--id", str(i), + "--broker-endpoints", + self.broker.client_url(), "--daemonize" ] @@ -698,7 +698,6 @@ def test_safekeeper_without_pageserver(test_output_dir: str, repo_dir, port_distributor, pg_bin, - num_safekeepers=1, ) with env: diff --git a/test_runner/batch_others/test_wal_restore.py b/test_runner/batch_others/test_wal_restore.py index b0f34f4aae..f4aceac5e8 100644 --- a/test_runner/batch_others/test_wal_restore.py +++ b/test_runner/batch_others/test_wal_restore.py @@ -15,7 +15,6 @@ def test_wal_restore(zenith_env_builder: ZenithEnvBuilder, pg_bin: PgBin, test_output_dir, port_distributor: PortDistributor): - zenith_env_builder.num_safekeepers = 1 env = zenith_env_builder.init_start() env.zenith_cli.create_branch("test_wal_restore") pg = env.postgres.create_start('test_wal_restore') diff --git a/test_runner/batch_others/test_zenith_cli.py b/test_runner/batch_others/test_zenith_cli.py index bff17fa679..103d51aae5 100644 --- a/test_runner/batch_others/test_zenith_cli.py +++ b/test_runner/batch_others/test_zenith_cli.py @@ -94,8 +94,6 @@ def test_cli_tenant_create(zenith_simple_env: ZenithEnv): def test_cli_ipv4_listeners(zenith_env_builder: ZenithEnvBuilder): - # Start with single sk - zenith_env_builder.num_safekeepers = 1 env = zenith_env_builder.init_start() # Connect to sk port on v4 loopback @@ -111,8 +109,6 @@ def test_cli_ipv4_listeners(zenith_env_builder: ZenithEnvBuilder): def test_cli_start_stop(zenith_env_builder: ZenithEnvBuilder): - # Start with single sk - zenith_env_builder.num_safekeepers = 1 env = zenith_env_builder.init_start() # Stop default ps/sk diff --git a/test_runner/fixtures/zenith_fixtures.py b/test_runner/fixtures/zenith_fixtures.py index 14eae60248..09f7f26588 100644 --- a/test_runner/fixtures/zenith_fixtures.py +++ b/test_runner/fixtures/zenith_fixtures.py @@ -412,11 +412,10 @@ class ZenithEnvBuilder: port_distributor: PortDistributor, pageserver_remote_storage: Optional[RemoteStorage] = None, pageserver_config_override: Optional[str] = None, - num_safekeepers: int = 0, + num_safekeepers: int = 1, pageserver_auth_enabled: bool = False, rust_log_override: Optional[str] = None, - default_branch_name=DEFAULT_BRANCH_NAME, - broker: bool = False): + default_branch_name=DEFAULT_BRANCH_NAME): self.repo_dir = repo_dir self.rust_log_override = rust_log_override self.port_distributor = port_distributor @@ -425,7 +424,10 @@ class ZenithEnvBuilder: self.num_safekeepers = num_safekeepers self.pageserver_auth_enabled = pageserver_auth_enabled self.default_branch_name = default_branch_name - self.broker = broker + # keep etcd datadir inside 'repo' + self.broker = Etcd(datadir=os.path.join(self.repo_dir, "etcd"), + port=self.port_distributor.get_port(), + peer_port=self.port_distributor.get_port()) self.env: Optional[ZenithEnv] = None self.s3_mock_server: Optional[MockS3Server] = None @@ -551,14 +553,9 @@ class ZenithEnv: default_tenant_id = '{self.initial_tenant.hex}' """) - self.broker = None - if config.broker: - # keep etcd datadir inside 'repo' - self.broker = Etcd(datadir=os.path.join(self.repo_dir, "etcd"), - port=self.port_distributor.get_port(), - peer_port=self.port_distributor.get_port()) - toml += textwrap.dedent(f""" - broker_endpoints = ['http://127.0.0.1:{self.broker.port}'] + self.broker = config.broker + toml += textwrap.dedent(f""" + broker_endpoints = ['{self.broker.client_url()}'] """) # Create config for pageserver @@ -1851,24 +1848,29 @@ class Etcd: peer_port: int handle: Optional[subprocess.Popen[Any]] = None # handle of running daemon + def client_url(self): + return f'http://127.0.0.1:{self.port}' + def check_status(self): s = requests.Session() s.mount('http://', requests.adapters.HTTPAdapter(max_retries=1)) # do not retry - s.get(f"http://localhost:{self.port}/health").raise_for_status() + s.get(f"{self.client_url()}/health").raise_for_status() def start(self): pathlib.Path(self.datadir).mkdir(exist_ok=True) etcd_full_path = etcd_path() if etcd_full_path is None: - raise Exception('etcd not found') + raise Exception('etcd binary not found locally') + client_url = self.client_url() + log.info(f'Starting etcd to listen incoming connections at "{client_url}"') with open(os.path.join(self.datadir, "etcd.log"), "wb") as log_file: args = [ etcd_full_path, f"--data-dir={self.datadir}", - f"--listen-client-urls=http://localhost:{self.port}", - f"--advertise-client-urls=http://localhost:{self.port}", - f"--listen-peer-urls=http://localhost:{self.peer_port}" + f"--listen-client-urls={client_url}", + f"--advertise-client-urls={client_url}", + f"--listen-peer-urls=http://127.0.0.1:{self.peer_port}" ] self.handle = subprocess.Popen(args, stdout=log_file, stderr=log_file) @@ -1920,7 +1922,13 @@ def test_output_dir(request: Any) -> str: return test_dir -SKIP_DIRS = frozenset(('pg_wal', 'pg_stat', 'pg_stat_tmp', 'pg_subtrans', 'pg_logical')) +SKIP_DIRS = frozenset(('pg_wal', + 'pg_stat', + 'pg_stat_tmp', + 'pg_subtrans', + 'pg_logical', + 'pg_replslot/wal_proposer_slot', + 'pg_xact')) SKIP_FILES = frozenset(('pg_internal.init', 'pg.log', From a884f4cf6bcfae751166ad0f0b5dd6b99a67cba8 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Sun, 8 May 2022 00:32:57 +0300 Subject: [PATCH 255/296] Add etcd to neon_local --- Cargo.lock | 1 + control_plane/simple.conf | 3 + control_plane/src/etcd.rs | 93 +++++++++++++ control_plane/src/lib.rs | 1 + control_plane/src/local_env.rs | 122 ++++++++++++------ control_plane/src/safekeeper.rs | 31 ++--- control_plane/src/storage.rs | 12 +- docker-entrypoint.sh | 15 ++- docs/settings.md | 17 ++- libs/etcd_broker/src/lib.rs | 30 +++-- neon_local/src/main.rs | 64 +++++---- pageserver/Cargo.toml | 1 + pageserver/src/config.rs | 25 +++- safekeeper/src/bin/safekeeper.rs | 26 ++-- safekeeper/src/broker.rs | 4 +- safekeeper/src/lib.rs | 3 +- test_runner/batch_others/test_wal_acceptor.py | 4 +- test_runner/fixtures/utils.py | 12 +- test_runner/fixtures/zenith_fixtures.py | 14 +- 19 files changed, 331 insertions(+), 147 deletions(-) create mode 100644 control_plane/src/etcd.rs diff --git a/Cargo.lock b/Cargo.lock index e1e1a0f067..a3974f6776 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -1772,6 +1772,7 @@ dependencies = [ "crc32c", "crossbeam-utils", "daemonize", + "etcd_broker", "fail", "futures", "git-version", diff --git a/control_plane/simple.conf b/control_plane/simple.conf index 2243a0a5f8..925e2f14ee 100644 --- a/control_plane/simple.conf +++ b/control_plane/simple.conf @@ -9,3 +9,6 @@ auth_type = 'Trust' id = 1 pg_port = 5454 http_port = 7676 + +[etcd_broker] +broker_endpoints = ['http://127.0.0.1:2379'] diff --git a/control_plane/src/etcd.rs b/control_plane/src/etcd.rs new file mode 100644 index 0000000000..df657dd1be --- /dev/null +++ b/control_plane/src/etcd.rs @@ -0,0 +1,93 @@ +use std::{ + fs, + path::PathBuf, + process::{Command, Stdio}, +}; + +use anyhow::Context; +use nix::{ + sys::signal::{kill, Signal}, + unistd::Pid, +}; + +use crate::{local_env, read_pidfile}; + +pub fn start_etcd_process(env: &local_env::LocalEnv) -> anyhow::Result<()> { + let etcd_broker = &env.etcd_broker; + println!( + "Starting etcd broker using {}", + etcd_broker.etcd_binary_path.display() + ); + + let etcd_data_dir = env.base_data_dir.join("etcd"); + fs::create_dir_all(&etcd_data_dir).with_context(|| { + format!( + "Failed to create etcd data dir: {}", + etcd_data_dir.display() + ) + })?; + + let etcd_stdout_file = + fs::File::create(etcd_data_dir.join("etcd.stdout.log")).with_context(|| { + format!( + "Failed to create ectd stout file in directory {}", + etcd_data_dir.display() + ) + })?; + let etcd_stderr_file = + fs::File::create(etcd_data_dir.join("etcd.stderr.log")).with_context(|| { + format!( + "Failed to create ectd stderr file in directory {}", + etcd_data_dir.display() + ) + })?; + let client_urls = etcd_broker.comma_separated_endpoints(); + + let etcd_process = Command::new(&etcd_broker.etcd_binary_path) + .args(&[ + format!("--data-dir={}", etcd_data_dir.display()), + format!("--listen-client-urls={client_urls}"), + format!("--advertise-client-urls={client_urls}"), + ]) + .stdout(Stdio::from(etcd_stdout_file)) + .stderr(Stdio::from(etcd_stderr_file)) + .spawn() + .context("Failed to spawn etcd subprocess")?; + let pid = etcd_process.id(); + + let etcd_pid_file_path = etcd_pid_file_path(env); + fs::write(&etcd_pid_file_path, pid.to_string()).with_context(|| { + format!( + "Failed to create etcd pid file at {}", + etcd_pid_file_path.display() + ) + })?; + + Ok(()) +} + +pub fn stop_etcd_process(env: &local_env::LocalEnv) -> anyhow::Result<()> { + let etcd_path = &env.etcd_broker.etcd_binary_path; + println!("Stopping etcd broker at {}", etcd_path.display()); + + let etcd_pid_file_path = etcd_pid_file_path(env); + let pid = Pid::from_raw(read_pidfile(&etcd_pid_file_path).with_context(|| { + format!( + "Failed to read etcd pid filea at {}", + etcd_pid_file_path.display() + ) + })?); + + kill(pid, Signal::SIGTERM).with_context(|| { + format!( + "Failed to stop etcd with pid {pid} at {}", + etcd_pid_file_path.display() + ) + })?; + + Ok(()) +} + +fn etcd_pid_file_path(env: &local_env::LocalEnv) -> PathBuf { + env.base_data_dir.join("etcd.pid") +} diff --git a/control_plane/src/lib.rs b/control_plane/src/lib.rs index a2ecdd3d64..c3469c3350 100644 --- a/control_plane/src/lib.rs +++ b/control_plane/src/lib.rs @@ -12,6 +12,7 @@ use std::path::Path; use std::process::Command; pub mod compute; +pub mod etcd; pub mod local_env; pub mod postgresql_conf; pub mod safekeeper; diff --git a/control_plane/src/local_env.rs b/control_plane/src/local_env.rs index a8636f9073..c73af7d338 100644 --- a/control_plane/src/local_env.rs +++ b/control_plane/src/local_env.rs @@ -60,14 +60,7 @@ pub struct LocalEnv { #[serde(default)] pub private_key_path: PathBuf, - // Broker (etcd) endpoints for storage nodes coordination, e.g. 'http://127.0.0.1:2379'. - #[serde(default)] - #[serde_as(as = "Vec")] - pub broker_endpoints: Vec, - - /// A prefix to all to any key when pushing/polling etcd from a node. - #[serde(default)] - pub broker_etcd_prefix: Option, + pub etcd_broker: EtcdBroker, pub pageserver: PageServerConf, @@ -83,6 +76,62 @@ pub struct LocalEnv { branch_name_mappings: HashMap>, } +/// Etcd broker config for cluster internal communication. +#[serde_as] +#[derive(Serialize, Deserialize, PartialEq, Eq, Clone, Debug)] +pub struct EtcdBroker { + /// A prefix to all to any key when pushing/polling etcd from a node. + #[serde(default)] + pub broker_etcd_prefix: Option, + + /// Broker (etcd) endpoints for storage nodes coordination, e.g. 'http://127.0.0.1:2379'. + #[serde(default)] + #[serde_as(as = "Vec")] + pub broker_endpoints: Vec, + + /// Etcd binary path to use. + #[serde(default)] + pub etcd_binary_path: PathBuf, +} + +impl EtcdBroker { + pub fn locate_etcd() -> anyhow::Result { + let which_output = Command::new("which") + .arg("etcd") + .output() + .context("Failed to run 'which etcd' command")?; + let stdout = String::from_utf8_lossy(&which_output.stdout); + ensure!( + which_output.status.success(), + "'which etcd' invocation failed. Status: {}, stdout: {stdout}, stderr: {}", + which_output.status, + String::from_utf8_lossy(&which_output.stderr) + ); + + let etcd_path = PathBuf::from(stdout.trim()); + ensure!( + etcd_path.is_file(), + "'which etcd' invocation was successful, but the path it returned is not a file or does not exist: {}", + etcd_path.display() + ); + + Ok(etcd_path) + } + + pub fn comma_separated_endpoints(&self) -> String { + self.broker_endpoints.iter().map(Url::as_str).fold( + String::new(), + |mut comma_separated_urls, url| { + if !comma_separated_urls.is_empty() { + comma_separated_urls.push(','); + } + comma_separated_urls.push_str(url); + comma_separated_urls + }, + ) + } +} + #[derive(Serialize, Deserialize, PartialEq, Eq, Clone, Debug)] #[serde(default)] pub struct PageServerConf { @@ -97,7 +146,6 @@ pub struct PageServerConf { // jwt auth token used for communication with pageserver pub auth_token: String, - pub broker_endpoints: Vec, } impl Default for PageServerConf { @@ -108,7 +156,6 @@ impl Default for PageServerConf { listen_http_addr: String::new(), auth_type: AuthType::Trust, auth_token: String::new(), - broker_endpoints: Vec::new(), } } } @@ -240,17 +287,7 @@ impl LocalEnv { // Find zenith binaries. if env.zenith_distrib_dir == Path::new("") { - let current_exec_path = - env::current_exe().context("Failed to find current excecutable's path")?; - env.zenith_distrib_dir = current_exec_path - .parent() - .with_context(|| { - format!( - "Failed to find a parent directory for executable {}", - current_exec_path.display(), - ) - })? - .to_owned(); + env.zenith_distrib_dir = env::current_exe()?.parent().unwrap().to_owned(); } // If no initial tenant ID was given, generate it. @@ -345,6 +382,22 @@ impl LocalEnv { "directory '{}' already exists. Perhaps already initialized?", base_path.display() ); + if !self.pg_distrib_dir.join("bin/postgres").exists() { + bail!( + "Can't find postgres binary at {}", + self.pg_distrib_dir.display() + ); + } + for binary in ["pageserver", "safekeeper"] { + if !self.zenith_distrib_dir.join(binary).exists() { + bail!( + "Can't find binary '{}' in zenith distrib dir '{}'", + binary, + self.zenith_distrib_dir.display() + ); + } + } + for binary in ["pageserver", "safekeeper"] { if !self.zenith_distrib_dir.join(binary).exists() { bail!( @@ -403,7 +456,6 @@ impl LocalEnv { self.pageserver.auth_token = self.generate_auth_token(&Claims::new(None, Scope::PageServerApi))?; - self.pageserver.broker_endpoints = self.broker_endpoints.clone(); fs::create_dir_all(self.pg_data_dirs_path())?; @@ -435,26 +487,12 @@ mod tests { "failed to parse simple config {simple_conf_toml}, reason: {simple_conf_parse_result:?}" ); - let regular_url_string = "broker_endpoints = ['localhost:1111']"; - let regular_url_toml = simple_conf_toml.replace( - "[pageserver]", - &format!("\n{regular_url_string}\n[pageserver]"), - ); - match LocalEnv::parse_config(®ular_url_toml) { - Ok(regular_url_parsed) => { - assert_eq!( - regular_url_parsed.broker_endpoints, - vec!["localhost:1111".parse().unwrap()], - "Unexpectedly parsed broker endpoint url" - ); - } - Err(e) => panic!("failed to parse simple config {regular_url_toml}, reason: {e}"), - } - - let spoiled_url_string = "broker_endpoints = ['!@$XOXO%^&']"; - let spoiled_url_toml = simple_conf_toml.replace( - "[pageserver]", - &format!("\n{spoiled_url_string}\n[pageserver]"), + let string_to_replace = "broker_endpoints = ['http://127.0.0.1:2379']"; + let spoiled_url_str = "broker_endpoints = ['!@$XOXO%^&']"; + let spoiled_url_toml = simple_conf_toml.replace(string_to_replace, spoiled_url_str); + assert!( + spoiled_url_toml.contains(spoiled_url_str), + "Failed to replace string {string_to_replace} in the toml file {simple_conf_toml}" ); let spoiled_url_parse_result = LocalEnv::parse_config(&spoiled_url_toml); assert!( diff --git a/control_plane/src/safekeeper.rs b/control_plane/src/safekeeper.rs index c5b7f830bf..407cd05c73 100644 --- a/control_plane/src/safekeeper.rs +++ b/control_plane/src/safekeeper.rs @@ -12,7 +12,7 @@ use nix::sys::signal::{kill, Signal}; use nix::unistd::Pid; use postgres::Config; use reqwest::blocking::{Client, RequestBuilder, Response}; -use reqwest::{IntoUrl, Method, Url}; +use reqwest::{IntoUrl, Method}; use safekeeper::http::models::TimelineCreateRequest; use thiserror::Error; use utils::{ @@ -75,9 +75,6 @@ pub struct SafekeeperNode { pub http_base_url: String, pub pageserver: Arc, - - broker_endpoints: Vec, - broker_etcd_prefix: Option, } impl SafekeeperNode { @@ -94,8 +91,6 @@ impl SafekeeperNode { http_client: Client::new(), http_base_url: format!("http://127.0.0.1:{}/v1", conf.http_port), pageserver, - broker_endpoints: env.broker_endpoints.clone(), - broker_etcd_prefix: env.broker_etcd_prefix.clone(), } } @@ -137,29 +132,21 @@ impl SafekeeperNode { .args(&["--listen-pg", &listen_pg]) .args(&["--listen-http", &listen_http]) .args(&["--recall", "1 second"]) - .args(&["--broker-endpoints", &self.broker_endpoints.join(",")]) + .args(&[ + "--broker-endpoints", + &self.env.etcd_broker.comma_separated_endpoints(), + ]) .arg("--daemonize"), ); if !self.conf.sync { cmd.arg("--no-sync"); } - if !self.broker_endpoints.is_empty() { - cmd.args(&[ - "--broker-endpoints", - &self.broker_endpoints.iter().map(Url::as_str).fold( - String::new(), - |mut comma_separated_urls, url| { - if !comma_separated_urls.is_empty() { - comma_separated_urls.push(','); - } - comma_separated_urls.push_str(url); - comma_separated_urls - }, - ), - ]); + let comma_separated_endpoints = self.env.etcd_broker.comma_separated_endpoints(); + if !comma_separated_endpoints.is_empty() { + cmd.args(&["--broker-endpoints", &comma_separated_endpoints]); } - if let Some(prefix) = self.broker_etcd_prefix.as_deref() { + if let Some(prefix) = self.env.etcd_broker.broker_etcd_prefix.as_deref() { cmd.args(&["--broker-etcd-prefix", prefix]); } diff --git a/control_plane/src/storage.rs b/control_plane/src/storage.rs index 0b9fddd64a..7dbc19e145 100644 --- a/control_plane/src/storage.rs +++ b/control_plane/src/storage.rs @@ -124,7 +124,7 @@ impl PageServerNode { let broker_endpoints_param = format!( "broker_endpoints=[{}]", self.env - .pageserver + .etcd_broker .broker_endpoints .iter() .map(|url| format!("'{url}'")) @@ -142,6 +142,16 @@ impl PageServerNode { args.extend(["-c", &broker_endpoints_param]); args.extend(["-c", &id]); + let broker_etcd_prefix_param = self + .env + .etcd_broker + .broker_etcd_prefix + .as_ref() + .map(|prefix| format!("broker_etcd_prefix='{prefix}'")); + if let Some(broker_etcd_prefix_param) = broker_etcd_prefix_param.as_deref() { + args.extend(["-c", broker_etcd_prefix_param]); + } + for config_override in config_overrides { args.extend(["-c", config_override]); } diff --git a/docker-entrypoint.sh b/docker-entrypoint.sh index 0e4cf45f29..6bcbc76551 100755 --- a/docker-entrypoint.sh +++ b/docker-entrypoint.sh @@ -1,17 +1,20 @@ #!/bin/sh set -eux +broker_endpoints_param="${BROKER_ENDPOINT:-absent}" +if [ "$broker_endpoints_param" != "absent" ]; then + broker_endpoints_param="-c broker_endpoints=['$broker_endpoints_param']" +else + broker_endpoints_param='' +fi + if [ "$1" = 'pageserver' ]; then if [ ! -d "/data/tenants" ]; then echo "Initializing pageserver data directory" - pageserver --init -D /data -c "pg_distrib_dir='/usr/local'" -c "id=10" + pageserver --init -D /data -c "pg_distrib_dir='/usr/local'" -c "id=10" $broker_endpoints_param fi echo "Staring pageserver at 0.0.0.0:6400" - if [ -z '${BROKER_ENDPOINTS}' ]; then - pageserver -c "listen_pg_addr='0.0.0.0:6400'" -c "listen_http_addr='0.0.0.0:9898'" -D /data - else - pageserver -c "listen_pg_addr='0.0.0.0:6400'" -c "listen_http_addr='0.0.0.0:9898'" -c "broker_endpoints=['${BROKER_ENDPOINTS}']" -D /data - fi + pageserver -c "listen_pg_addr='0.0.0.0:6400'" -c "listen_http_addr='0.0.0.0:9898'" $broker_endpoints_param -D /data else "$@" fi diff --git a/docs/settings.md b/docs/settings.md index 017d349bb6..9564ef626f 100644 --- a/docs/settings.md +++ b/docs/settings.md @@ -25,10 +25,14 @@ max_file_descriptors = '100' # initial superuser role name to use when creating a new tenant initial_superuser_name = 'zenith_admin' +broker_etcd_prefix = 'neon' +broker_endpoints = ['some://etcd'] + # [remote_storage] ``` -The config above shows default values for all basic pageserver settings. +The config above shows default values for all basic pageserver settings, besides `broker_endpoints`: that one has to be set by the user, +see the corresponding section below. Pageserver uses default values for all files that are missing in the config, so it's not a hard error to leave the config blank. Yet, it validates the config values it can (e.g. postgres install dir) and errors if the validation fails, refusing to start. @@ -46,6 +50,17 @@ Example: `${PAGESERVER_BIN} -c "checkpoint_period = '100 s'" -c "remote_storage= Note that TOML distinguishes between strings and integers, the former require single or double quotes around them. +#### broker_endpoints + +A list of endpoints (etcd currently) to connect and pull the information from. +Mandatory, does not have a default, since requires etcd to be started as a separate process, +and its connection url should be specified separately. + +#### broker_etcd_prefix + +A prefix to add for every etcd key used, to separate one group of related instances from another, in the same cluster. +Default is `neon`. + #### checkpoint_distance `checkpoint_distance` is the amount of incoming WAL that is held in diff --git a/libs/etcd_broker/src/lib.rs b/libs/etcd_broker/src/lib.rs index 1b27f99ccf..76181f9ba1 100644 --- a/libs/etcd_broker/src/lib.rs +++ b/libs/etcd_broker/src/lib.rs @@ -19,6 +19,10 @@ use utils::{ zid::{ZNodeId, ZTenantId, ZTenantTimelineId}, }; +/// Default value to use for prefixing to all etcd keys with. +/// This way allows isolating safekeeper/pageserver groups in the same etcd cluster. +pub const DEFAULT_NEON_BROKER_ETCD_PREFIX: &str = "neon"; + #[derive(Debug, Deserialize, Serialize)] struct SafekeeperTimeline { safekeeper_id: ZNodeId, @@ -104,28 +108,28 @@ impl SkTimelineSubscription { /// The subscription kind to the timeline updates from safekeeper. #[derive(Debug, Clone, PartialEq, Eq, Hash)] pub struct SkTimelineSubscriptionKind { - broker_prefix: String, + broker_etcd_prefix: String, kind: SubscriptionKind, } impl SkTimelineSubscriptionKind { - pub fn all(broker_prefix: String) -> Self { + pub fn all(broker_etcd_prefix: String) -> Self { Self { - broker_prefix, + broker_etcd_prefix, kind: SubscriptionKind::All, } } - pub fn tenant(broker_prefix: String, tenant: ZTenantId) -> Self { + pub fn tenant(broker_etcd_prefix: String, tenant: ZTenantId) -> Self { Self { - broker_prefix, + broker_etcd_prefix, kind: SubscriptionKind::Tenant(tenant), } } - pub fn timeline(broker_prefix: String, timeline: ZTenantTimelineId) -> Self { + pub fn timeline(broker_etcd_prefix: String, timeline: ZTenantTimelineId) -> Self { Self { - broker_prefix, + broker_etcd_prefix, kind: SubscriptionKind::Timeline(timeline), } } @@ -134,12 +138,12 @@ impl SkTimelineSubscriptionKind { match self.kind { SubscriptionKind::All => Regex::new(&format!( r"^{}/([[:xdigit:]]+)/([[:xdigit:]]+)/safekeeper/([[:digit:]])$", - self.broker_prefix + self.broker_etcd_prefix )) .expect("wrong regex for 'everything' subscription"), SubscriptionKind::Tenant(tenant_id) => Regex::new(&format!( r"^{}/{tenant_id}/([[:xdigit:]]+)/safekeeper/([[:digit:]])$", - self.broker_prefix + self.broker_etcd_prefix )) .expect("wrong regex for 'tenant' subscription"), SubscriptionKind::Timeline(ZTenantTimelineId { @@ -147,7 +151,7 @@ impl SkTimelineSubscriptionKind { timeline_id, }) => Regex::new(&format!( r"^{}/{tenant_id}/{timeline_id}/safekeeper/([[:digit:]])$", - self.broker_prefix + self.broker_etcd_prefix )) .expect("wrong regex for 'timeline' subscription"), } @@ -156,16 +160,16 @@ impl SkTimelineSubscriptionKind { /// Etcd key to use for watching a certain timeline updates from safekeepers. pub fn watch_key(&self) -> String { match self.kind { - SubscriptionKind::All => self.broker_prefix.to_string(), + SubscriptionKind::All => self.broker_etcd_prefix.to_string(), SubscriptionKind::Tenant(tenant_id) => { - format!("{}/{tenant_id}/safekeeper", self.broker_prefix) + format!("{}/{tenant_id}/safekeeper", self.broker_etcd_prefix) } SubscriptionKind::Timeline(ZTenantTimelineId { tenant_id, timeline_id, }) => format!( "{}/{tenant_id}/{timeline_id}/safekeeper", - self.broker_prefix + self.broker_etcd_prefix ), } } diff --git a/neon_local/src/main.rs b/neon_local/src/main.rs index e5ac46d3b1..f04af9cfdd 100644 --- a/neon_local/src/main.rs +++ b/neon_local/src/main.rs @@ -1,10 +1,10 @@ use anyhow::{anyhow, bail, Context, Result}; use clap::{App, AppSettings, Arg, ArgMatches}; use control_plane::compute::ComputeControlPlane; -use control_plane::local_env; -use control_plane::local_env::LocalEnv; +use control_plane::local_env::{EtcdBroker, LocalEnv}; use control_plane::safekeeper::SafekeeperNode; use control_plane::storage::PageServerNode; +use control_plane::{etcd, local_env}; use pageserver::config::defaults::{ DEFAULT_HTTP_LISTEN_ADDR as DEFAULT_PAGESERVER_HTTP_ADDR, DEFAULT_PG_LISTEN_ADDR as DEFAULT_PAGESERVER_PG_ADDR, @@ -14,6 +14,7 @@ use safekeeper::defaults::{ DEFAULT_PG_LISTEN_PORT as DEFAULT_SAFEKEEPER_PG_PORT, }; use std::collections::{BTreeSet, HashMap}; +use std::path::Path; use std::process::exit; use std::str::FromStr; use utils::{ @@ -32,28 +33,27 @@ const DEFAULT_PAGESERVER_ID: ZNodeId = ZNodeId(1); const DEFAULT_BRANCH_NAME: &str = "main"; project_git_version!(GIT_VERSION); -fn default_conf() -> String { +fn default_conf(etcd_binary_path: &Path) -> String { format!( r#" # Default built-in configuration, defined in main.rs +[etcd_broker] +broker_endpoints = ['http://localhost:2379'] +etcd_binary_path = '{etcd_binary_path}' + [pageserver] -id = {pageserver_id} -listen_pg_addr = '{pageserver_pg_addr}' -listen_http_addr = '{pageserver_http_addr}' +id = {DEFAULT_PAGESERVER_ID} +listen_pg_addr = '{DEFAULT_PAGESERVER_PG_ADDR}' +listen_http_addr = '{DEFAULT_PAGESERVER_HTTP_ADDR}' auth_type = '{pageserver_auth_type}' [[safekeepers]] -id = {safekeeper_id} -pg_port = {safekeeper_pg_port} -http_port = {safekeeper_http_port} +id = {DEFAULT_SAFEKEEPER_ID} +pg_port = {DEFAULT_SAFEKEEPER_PG_PORT} +http_port = {DEFAULT_SAFEKEEPER_HTTP_PORT} "#, - pageserver_id = DEFAULT_PAGESERVER_ID, - pageserver_pg_addr = DEFAULT_PAGESERVER_PG_ADDR, - pageserver_http_addr = DEFAULT_PAGESERVER_HTTP_ADDR, + etcd_binary_path = etcd_binary_path.display(), pageserver_auth_type = AuthType::Trust, - safekeeper_id = DEFAULT_SAFEKEEPER_ID, - safekeeper_pg_port = DEFAULT_SAFEKEEPER_PG_PORT, - safekeeper_http_port = DEFAULT_SAFEKEEPER_HTTP_PORT, ) } @@ -167,12 +167,12 @@ fn main() -> Result<()> { .subcommand(App::new("create") .arg(tenant_id_arg.clone()) .arg(timeline_id_arg.clone().help("Use a specific timeline id when creating a tenant and its initial timeline")) - .arg(Arg::new("config").short('c').takes_value(true).multiple_occurrences(true).required(false)) - ) + .arg(Arg::new("config").short('c').takes_value(true).multiple_occurrences(true).required(false)) + ) .subcommand(App::new("config") .arg(tenant_id_arg.clone()) - .arg(Arg::new("config").short('c').takes_value(true).multiple_occurrences(true).required(false)) - ) + .arg(Arg::new("config").short('c').takes_value(true).multiple_occurrences(true).required(false)) + ) ) .subcommand( App::new("pageserver") @@ -468,17 +468,17 @@ fn parse_timeline_id(sub_match: &ArgMatches) -> anyhow::Result Result { +fn handle_init(init_match: &ArgMatches) -> anyhow::Result { let initial_timeline_id_arg = parse_timeline_id(init_match)?; // Create config file let toml_file: String = if let Some(config_path) = init_match.value_of("config") { // load and parse the file std::fs::read_to_string(std::path::Path::new(config_path)) - .with_context(|| format!("Could not read configuration file \"{}\"", config_path))? + .with_context(|| format!("Could not read configuration file '{config_path}'"))? } else { // Built-in default config - default_conf() + default_conf(&EtcdBroker::locate_etcd()?) }; let mut env = @@ -497,7 +497,7 @@ fn handle_init(init_match: &ArgMatches) -> Result { &pageserver_config_overrides(init_match), ) .unwrap_or_else(|e| { - eprintln!("pageserver init failed: {}", e); + eprintln!("pageserver init failed: {e}"); exit(1); }); @@ -920,20 +920,23 @@ fn handle_safekeeper(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Resul Ok(()) } -fn handle_start_all(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Result<()> { +fn handle_start_all(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> anyhow::Result<()> { + etcd::start_etcd_process(env)?; let pageserver = PageServerNode::from_env(env); // Postgres nodes are not started automatically if let Err(e) = pageserver.start(&pageserver_config_overrides(sub_match)) { - eprintln!("pageserver start failed: {}", e); + eprintln!("pageserver start failed: {e}"); + try_stop_etcd_process(env); exit(1); } for node in env.safekeepers.iter() { let safekeeper = SafekeeperNode::from_env(env, node); if let Err(e) = safekeeper.start() { - eprintln!("safekeeper '{}' start failed: {}", safekeeper.id, e); + eprintln!("safekeeper '{}' start failed: {e}", safekeeper.id); + try_stop_etcd_process(env); exit(1); } } @@ -963,5 +966,14 @@ fn handle_stop_all(sub_match: &ArgMatches, env: &local_env::LocalEnv) -> Result< eprintln!("safekeeper '{}' stop failed: {}", safekeeper.id, e); } } + + try_stop_etcd_process(env); + Ok(()) } + +fn try_stop_etcd_process(env: &local_env::LocalEnv) { + if let Err(e) = etcd::stop_etcd_process(env) { + eprintln!("etcd stop failed: {e}"); + } +} diff --git a/pageserver/Cargo.toml b/pageserver/Cargo.toml index 9cc8444531..290f52e0b2 100644 --- a/pageserver/Cargo.toml +++ b/pageserver/Cargo.toml @@ -55,6 +55,7 @@ fail = "0.5.0" git-version = "0.3.5" postgres_ffi = { path = "../libs/postgres_ffi" } +etcd_broker = { path = "../libs/etcd_broker" } metrics = { path = "../libs/metrics" } utils = { path = "../libs/utils" } remote_storage = { path = "../libs/remote_storage" } diff --git a/pageserver/src/config.rs b/pageserver/src/config.rs index 8748683f32..a9215c0701 100644 --- a/pageserver/src/config.rs +++ b/pageserver/src/config.rs @@ -113,6 +113,10 @@ pub struct PageServerConf { pub profiling: ProfilingConfig, pub default_tenant_conf: TenantConf, + /// A prefix to add in etcd brokers before every key. + /// Can be used for isolating different pageserver groups withing the same etcd cluster. + pub broker_etcd_prefix: String, + /// Etcd broker endpoints to connect to. pub broker_endpoints: Vec, } @@ -179,6 +183,7 @@ struct PageServerConfigBuilder { id: BuilderValue, profiling: BuilderValue, + broker_etcd_prefix: BuilderValue, broker_endpoints: BuilderValue>, } @@ -205,7 +210,8 @@ impl Default for PageServerConfigBuilder { remote_storage_config: Set(None), id: NotSet, profiling: Set(ProfilingConfig::Disabled), - broker_endpoints: NotSet, + broker_etcd_prefix: Set(etcd_broker::DEFAULT_NEON_BROKER_ETCD_PREFIX.to_string()), + broker_endpoints: Set(Vec::new()), } } } @@ -266,6 +272,10 @@ impl PageServerConfigBuilder { self.broker_endpoints = BuilderValue::Set(broker_endpoints) } + pub fn broker_etcd_prefix(&mut self, broker_etcd_prefix: String) { + self.broker_etcd_prefix = BuilderValue::Set(broker_etcd_prefix) + } + pub fn id(&mut self, node_id: ZNodeId) { self.id = BuilderValue::Set(node_id) } @@ -278,10 +288,6 @@ impl PageServerConfigBuilder { let broker_endpoints = self .broker_endpoints .ok_or(anyhow!("No broker endpoints provided"))?; - ensure!( - !broker_endpoints.is_empty(), - "Empty broker endpoints collection provided" - ); Ok(PageServerConf { listen_pg_addr: self @@ -319,6 +325,9 @@ impl PageServerConfigBuilder { // TenantConf is handled separately default_tenant_conf: TenantConf::default(), broker_endpoints, + broker_etcd_prefix: self + .broker_etcd_prefix + .ok_or(anyhow!("missing broker_etcd_prefix"))?, }) } } @@ -392,6 +401,7 @@ impl PageServerConf { } "id" => builder.id(ZNodeId(parse_toml_u64(key, item)?)), "profiling" => builder.profiling(parse_toml_from_str(key, item)?), + "broker_etcd_prefix" => builder.broker_etcd_prefix(parse_toml_string(key, item)?), "broker_endpoints" => builder.broker_endpoints( parse_toml_array(key, item)? .into_iter() @@ -556,6 +566,7 @@ impl PageServerConf { profiling: ProfilingConfig::Disabled, default_tenant_conf: TenantConf::dummy_conf(), broker_endpoints: Vec::new(), + broker_etcd_prefix: etcd_broker::DEFAULT_NEON_BROKER_ETCD_PREFIX.to_string(), } } } @@ -700,6 +711,7 @@ id = 10 broker_endpoints: vec![broker_endpoint .parse() .expect("Failed to parse a valid broker endpoint URL")], + broker_etcd_prefix: etcd_broker::DEFAULT_NEON_BROKER_ETCD_PREFIX.to_string(), }, "Correct defaults should be used when no config values are provided" ); @@ -743,6 +755,7 @@ id = 10 broker_endpoints: vec![broker_endpoint .parse() .expect("Failed to parse a valid broker endpoint URL")], + broker_etcd_prefix: etcd_broker::DEFAULT_NEON_BROKER_ETCD_PREFIX.to_string(), }, "Should be able to parse all basic config values correctly" ); @@ -795,7 +808,7 @@ broker_endpoints = ['{broker_endpoint}'] max_concurrent_syncs: NonZeroUsize::new( remote_storage::DEFAULT_REMOTE_STORAGE_MAX_CONCURRENT_SYNCS ) - .unwrap(), + .unwrap(), max_sync_errors: NonZeroU32::new(remote_storage::DEFAULT_REMOTE_STORAGE_MAX_SYNC_ERRORS) .unwrap(), storage: RemoteStorageKind::LocalFs(local_storage_path.clone()), diff --git a/safekeeper/src/bin/safekeeper.rs b/safekeeper/src/bin/safekeeper.rs index d7875a9069..2d47710a88 100644 --- a/safekeeper/src/bin/safekeeper.rs +++ b/safekeeper/src/bin/safekeeper.rs @@ -1,7 +1,7 @@ // // Main entry point for the safekeeper executable // -use anyhow::{bail, ensure, Context, Result}; +use anyhow::{bail, Context, Result}; use clap::{App, Arg}; use const_format::formatcp; use daemonize::Daemonize; @@ -179,10 +179,6 @@ fn main() -> anyhow::Result<()> { let collected_ep: Result, ParseError> = addr.split(',').map(Url::parse).collect(); conf.broker_endpoints = collected_ep.context("Failed to parse broker endpoint urls")?; } - ensure!( - !conf.broker_endpoints.is_empty(), - "No broker endpoints provided" - ); if let Some(prefix) = arg_matches.value_of("broker-etcd-prefix") { conf.broker_etcd_prefix = prefix.to_string(); } @@ -313,14 +309,18 @@ fn start_safekeeper(mut conf: SafeKeeperConf, given_id: Option, init: b .unwrap(); threads.push(callmemaybe_thread); - let conf_ = conf.clone(); - threads.push( - thread::Builder::new() - .name("broker thread".into()) - .spawn(|| { - broker::thread_main(conf_); - })?, - ); + if !conf.broker_endpoints.is_empty() { + let conf_ = conf.clone(); + threads.push( + thread::Builder::new() + .name("broker thread".into()) + .spawn(|| { + broker::thread_main(conf_); + })?, + ); + } else { + warn!("No broker endpoints providing, starting without node sync") + } let conf_ = conf.clone(); threads.push( diff --git a/safekeeper/src/broker.rs b/safekeeper/src/broker.rs index c906bc1e74..d7217be20a 100644 --- a/safekeeper/src/broker.rs +++ b/safekeeper/src/broker.rs @@ -34,13 +34,13 @@ pub fn thread_main(conf: SafeKeeperConf) { /// Key to per timeline per safekeeper data. fn timeline_safekeeper_path( - broker_prefix: String, + broker_etcd_prefix: String, zttid: ZTenantTimelineId, sk_id: ZNodeId, ) -> String { format!( "{}/{sk_id}", - SkTimelineSubscriptionKind::timeline(broker_prefix, zttid).watch_key() + SkTimelineSubscriptionKind::timeline(broker_etcd_prefix, zttid).watch_key() ) } diff --git a/safekeeper/src/lib.rs b/safekeeper/src/lib.rs index 131076fab6..a87e5da686 100644 --- a/safekeeper/src/lib.rs +++ b/safekeeper/src/lib.rs @@ -27,7 +27,6 @@ pub mod defaults { pub const DEFAULT_PG_LISTEN_PORT: u16 = 5454; pub const DEFAULT_PG_LISTEN_ADDR: &str = formatcp!("127.0.0.1:{DEFAULT_PG_LISTEN_PORT}"); - pub const DEFAULT_NEON_BROKER_PREFIX: &str = "neon"; pub const DEFAULT_HTTP_LISTEN_PORT: u16 = 7676; pub const DEFAULT_HTTP_LISTEN_ADDR: &str = formatcp!("127.0.0.1:{DEFAULT_HTTP_LISTEN_PORT}"); @@ -82,7 +81,7 @@ impl Default for SafeKeeperConf { recall_period: defaults::DEFAULT_RECALL_PERIOD, my_id: ZNodeId(0), broker_endpoints: Vec::new(), - broker_etcd_prefix: defaults::DEFAULT_NEON_BROKER_PREFIX.to_string(), + broker_etcd_prefix: etcd_broker::DEFAULT_NEON_BROKER_ETCD_PREFIX.to_string(), s3_offload_enabled: true, } } diff --git a/test_runner/batch_others/test_wal_acceptor.py b/test_runner/batch_others/test_wal_acceptor.py index 85798156a7..e1b7bd91ee 100644 --- a/test_runner/batch_others/test_wal_acceptor.py +++ b/test_runner/batch_others/test_wal_acceptor.py @@ -13,7 +13,7 @@ from dataclasses import dataclass, field from multiprocessing import Process, Value from pathlib import Path from fixtures.zenith_fixtures import PgBin, Etcd, Postgres, Safekeeper, ZenithEnv, ZenithEnvBuilder, PortDistributor, SafekeeperPort, zenith_binpath, PgProtocol -from fixtures.utils import etcd_path, get_dir_size, lsn_to_hex, mkdir_if_needed, lsn_from_hex +from fixtures.utils import get_dir_size, lsn_to_hex, mkdir_if_needed, lsn_from_hex from fixtures.log_helper import log from typing import List, Optional, Any @@ -327,7 +327,6 @@ def test_race_conditions(zenith_env_builder: ZenithEnvBuilder, stop_value): # Test that safekeepers push their info to the broker and learn peer status from it -@pytest.mark.skipif(etcd_path() is None, reason="requires etcd which is not present in PATH") def test_broker(zenith_env_builder: ZenithEnvBuilder): zenith_env_builder.num_safekeepers = 3 zenith_env_builder.enable_local_fs_remote_storage() @@ -369,7 +368,6 @@ def test_broker(zenith_env_builder: ZenithEnvBuilder): # Test that old WAL consumed by peers and pageserver is removed from safekeepers. -@pytest.mark.skipif(etcd_path() is None, reason="requires etcd which is not present in PATH") def test_wal_removal(zenith_env_builder: ZenithEnvBuilder): zenith_env_builder.num_safekeepers = 2 # to advance remote_consistent_llsn diff --git a/test_runner/fixtures/utils.py b/test_runner/fixtures/utils.py index 7b95e729d9..ba9bc6e113 100644 --- a/test_runner/fixtures/utils.py +++ b/test_runner/fixtures/utils.py @@ -1,8 +1,9 @@ import os import shutil import subprocess +from pathlib import Path -from typing import Any, List +from typing import Any, List, Optional from fixtures.log_helper import log @@ -80,9 +81,12 @@ def print_gc_result(row): .format_map(row)) -# path to etcd binary or None if not present. -def etcd_path(): - return shutil.which("etcd") +def etcd_path() -> Path: + path_output = shutil.which("etcd") + if path_output is None: + raise RuntimeError('etcd not found in PATH') + else: + return Path(path_output) # Traverse directory to get total size. diff --git a/test_runner/fixtures/zenith_fixtures.py b/test_runner/fixtures/zenith_fixtures.py index 09f7f26588..78de78144c 100644 --- a/test_runner/fixtures/zenith_fixtures.py +++ b/test_runner/fixtures/zenith_fixtures.py @@ -555,7 +555,9 @@ class ZenithEnv: self.broker = config.broker toml += textwrap.dedent(f""" + [etcd_broker] broker_endpoints = ['{self.broker.client_url()}'] + etcd_binary_path = '{self.broker.binary_path}' """) # Create config for pageserver @@ -1846,6 +1848,7 @@ class Etcd: datadir: str port: int peer_port: int + binary_path: Path = etcd_path() handle: Optional[subprocess.Popen[Any]] = None # handle of running daemon def client_url(self): @@ -1858,15 +1861,15 @@ class Etcd: def start(self): pathlib.Path(self.datadir).mkdir(exist_ok=True) - etcd_full_path = etcd_path() - if etcd_full_path is None: - raise Exception('etcd binary not found locally') + + if not self.binary_path.is_file(): + raise RuntimeError(f"etcd broker binary '{self.binary_path}' is not a file") client_url = self.client_url() log.info(f'Starting etcd to listen incoming connections at "{client_url}"') with open(os.path.join(self.datadir, "etcd.log"), "wb") as log_file: args = [ - etcd_full_path, + self.binary_path, f"--data-dir={self.datadir}", f"--listen-client-urls={client_url}", f"--advertise-client-urls={client_url}", @@ -1927,8 +1930,7 @@ SKIP_DIRS = frozenset(('pg_wal', 'pg_stat_tmp', 'pg_subtrans', 'pg_logical', - 'pg_replslot/wal_proposer_slot', - 'pg_xact')) + 'pg_replslot/wal_proposer_slot')) SKIP_FILES = frozenset(('pg_internal.init', 'pg.log', From f2881bbd8a90bc4b04fb1693ad3a684b260a0f98 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Sat, 14 May 2022 15:03:12 +0300 Subject: [PATCH 256/296] Start and stop single etcd and mock s3 servers globally in python tests --- .circleci/config.yml | 2 +- control_plane/src/safekeeper.rs | 4 - test_runner/README.md | 1 - .../batch_others/test_tenant_relocation.py | 8 +- test_runner/fixtures/zenith_fixtures.py | 151 ++++++++++-------- 5 files changed, 87 insertions(+), 79 deletions(-) diff --git a/.circleci/config.yml b/.circleci/config.yml index 85654b5d45..62ae60eb18 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -355,7 +355,7 @@ jobs: when: always command: | du -sh /tmp/test_output/* - find /tmp/test_output -type f ! -name "pg.log" ! -name "pageserver.log" ! -name "safekeeper.log" ! -name "regression.diffs" ! -name "junit.xml" ! -name "*.filediff" ! -name "*.stdout" ! -name "*.stderr" ! -name "flamegraph.svg" -delete + find /tmp/test_output -type f ! -name "pg.log" ! -name "pageserver.log" ! -name "safekeeper.log" ! -name "etcd.log" ! -name "regression.diffs" ! -name "junit.xml" ! -name "*.filediff" ! -name "*.stdout" ! -name "*.stderr" ! -name "flamegraph.svg" -delete du -sh /tmp/test_output/* - store_artifacts: path: /tmp/test_output diff --git a/control_plane/src/safekeeper.rs b/control_plane/src/safekeeper.rs index 407cd05c73..1ac06cb2d2 100644 --- a/control_plane/src/safekeeper.rs +++ b/control_plane/src/safekeeper.rs @@ -132,10 +132,6 @@ impl SafekeeperNode { .args(&["--listen-pg", &listen_pg]) .args(&["--listen-http", &listen_http]) .args(&["--recall", "1 second"]) - .args(&[ - "--broker-endpoints", - &self.env.etcd_broker.comma_separated_endpoints(), - ]) .arg("--daemonize"), ); if !self.conf.sync { diff --git a/test_runner/README.md b/test_runner/README.md index ee171ae6a0..059bbb83cc 100644 --- a/test_runner/README.md +++ b/test_runner/README.md @@ -51,7 +51,6 @@ Useful environment variables: should go. `TEST_SHARED_FIXTURES`: Try to re-use a single pageserver for all the tests. `ZENITH_PAGESERVER_OVERRIDES`: add a `;`-separated set of configs that will be passed as -`FORCE_MOCK_S3`: inits every test's pageserver with a mock S3 used as a remote storage. `--pageserver-config-override=${value}` parameter values when zenith cli is invoked `RUST_LOG`: logging configuration to pass into Zenith CLI diff --git a/test_runner/batch_others/test_tenant_relocation.py b/test_runner/batch_others/test_tenant_relocation.py index 85a91b9ce1..0e5dd6eadf 100644 --- a/test_runner/batch_others/test_tenant_relocation.py +++ b/test_runner/batch_others/test_tenant_relocation.py @@ -3,8 +3,10 @@ import os import pathlib import subprocess import threading +import typing from uuid import UUID from fixtures.log_helper import log +from typing import Optional import signal import pytest @@ -22,7 +24,7 @@ def new_pageserver_helper(new_pageserver_dir: pathlib.Path, remote_storage_mock_path: pathlib.Path, pg_port: int, http_port: int, - broker: Etcd): + broker: Optional[Etcd]): """ cannot use ZenithPageserver yet because it depends on zenith cli which currently lacks support for multiple pageservers @@ -37,9 +39,11 @@ def new_pageserver_helper(new_pageserver_dir: pathlib.Path, f"-c pg_distrib_dir='{pg_distrib_dir}'", f"-c id=2", f"-c remote_storage={{local_path='{remote_storage_mock_path}'}}", - f"-c broker_endpoints=['{broker.client_url()}']", ] + if broker is not None: + cmd.append(f"-c broker_endpoints=['{broker.client_url()}']", ) + subprocess.check_output(cmd, text=True) # actually run new pageserver diff --git a/test_runner/fixtures/zenith_fixtures.py b/test_runner/fixtures/zenith_fixtures.py index 78de78144c..8fca56143e 100644 --- a/test_runner/fixtures/zenith_fixtures.py +++ b/test_runner/fixtures/zenith_fixtures.py @@ -61,7 +61,7 @@ DEFAULT_POSTGRES_DIR = 'tmp_install' DEFAULT_BRANCH_NAME = 'main' BASE_PORT = 15000 -WORKER_PORT_NUM = 100 +WORKER_PORT_NUM = 1000 def pytest_addoption(parser): @@ -178,7 +178,7 @@ def shareable_scope(fixture_name, config) -> Literal["session", "function"]: return 'function' if os.environ.get('TEST_SHARED_FIXTURES') is None else 'session' -@pytest.fixture(scope=shareable_scope) +@pytest.fixture(scope='session') def worker_seq_no(worker_id: str): # worker_id is a pytest-xdist fixture # it can be master or gw @@ -189,7 +189,7 @@ def worker_seq_no(worker_id: str): return int(worker_id[2:]) -@pytest.fixture(scope=shareable_scope) +@pytest.fixture(scope='session') def worker_base_port(worker_seq_no: int): # so we divide ports in ranges of 100 ports # so workers have disjoint set of ports for services @@ -242,11 +242,30 @@ class PortDistributor: 'port range configured for test is exhausted, consider enlarging the range') -@pytest.fixture(scope=shareable_scope) +@pytest.fixture(scope='session') def port_distributor(worker_base_port): return PortDistributor(base_port=worker_base_port, port_number=WORKER_PORT_NUM) +@pytest.fixture(scope='session') +def default_broker(request: Any, port_distributor: PortDistributor): + client_port = port_distributor.get_port() + # multiple pytest sessions could get launched in parallel, get them different datadirs + etcd_datadir = os.path.join(get_test_output_dir(request), f"etcd_datadir_{client_port}") + pathlib.Path(etcd_datadir).mkdir(exist_ok=True, parents=True) + + broker = Etcd(datadir=etcd_datadir, port=client_port, peer_port=port_distributor.get_port()) + yield broker + broker.stop() + + +@pytest.fixture(scope='session') +def mock_s3_server(port_distributor: PortDistributor): + mock_s3_server = MockS3Server(port_distributor.get_port()) + yield mock_s3_server + mock_s3_server.kill() + + class PgProtocol: """ Reusable connection logic """ def __init__(self, **kwargs): @@ -410,7 +429,9 @@ class ZenithEnvBuilder: def __init__(self, repo_dir: Path, port_distributor: PortDistributor, - pageserver_remote_storage: Optional[RemoteStorage] = None, + broker: Etcd, + mock_s3_server: MockS3Server, + remote_storage: Optional[RemoteStorage] = None, pageserver_config_override: Optional[str] = None, num_safekeepers: int = 1, pageserver_auth_enabled: bool = False, @@ -419,24 +440,15 @@ class ZenithEnvBuilder: self.repo_dir = repo_dir self.rust_log_override = rust_log_override self.port_distributor = port_distributor - self.pageserver_remote_storage = pageserver_remote_storage + self.remote_storage = remote_storage + self.broker = broker + self.mock_s3_server = mock_s3_server self.pageserver_config_override = pageserver_config_override self.num_safekeepers = num_safekeepers self.pageserver_auth_enabled = pageserver_auth_enabled self.default_branch_name = default_branch_name - # keep etcd datadir inside 'repo' - self.broker = Etcd(datadir=os.path.join(self.repo_dir, "etcd"), - port=self.port_distributor.get_port(), - peer_port=self.port_distributor.get_port()) self.env: Optional[ZenithEnv] = None - self.s3_mock_server: Optional[MockS3Server] = None - - if os.getenv('FORCE_MOCK_S3') is not None: - bucket_name = f'{repo_dir.name}_bucket' - log.warning(f'Unconditionally initializing mock S3 server for bucket {bucket_name}') - self.enable_s3_mock_remote_storage(bucket_name) - def init(self) -> ZenithEnv: # Cannot create more than one environment from one builder assert self.env is None, "environment already initialized" @@ -457,9 +469,8 @@ class ZenithEnvBuilder: """ def enable_local_fs_remote_storage(self, force_enable=True): - assert force_enable or self.pageserver_remote_storage is None, "remote storage is enabled already" - self.pageserver_remote_storage = LocalFsStorage( - Path(self.repo_dir / 'local_fs_remote_storage')) + assert force_enable or self.remote_storage is None, "remote storage is enabled already" + self.remote_storage = LocalFsStorage(Path(self.repo_dir / 'local_fs_remote_storage')) """ Sets up the pageserver to use the S3 mock server, creates the bucket, if it's not present already. @@ -468,22 +479,19 @@ class ZenithEnvBuilder: """ def enable_s3_mock_remote_storage(self, bucket_name: str, force_enable=True): - assert force_enable or self.pageserver_remote_storage is None, "remote storage is enabled already" - if not self.s3_mock_server: - self.s3_mock_server = MockS3Server(self.port_distributor.get_port()) - - mock_endpoint = self.s3_mock_server.endpoint() - mock_region = self.s3_mock_server.region() + assert force_enable or self.remote_storage is None, "remote storage is enabled already" + mock_endpoint = self.mock_s3_server.endpoint() + mock_region = self.mock_s3_server.region() boto3.client( 's3', endpoint_url=mock_endpoint, region_name=mock_region, - aws_access_key_id=self.s3_mock_server.access_key(), - aws_secret_access_key=self.s3_mock_server.secret_key(), + aws_access_key_id=self.mock_s3_server.access_key(), + aws_secret_access_key=self.mock_s3_server.secret_key(), ).create_bucket(Bucket=bucket_name) - self.pageserver_remote_storage = S3Storage(bucket=bucket_name, - endpoint=mock_endpoint, - region=mock_region) + self.remote_storage = S3Storage(bucket=bucket_name, + endpoint=mock_endpoint, + region=mock_region) def __enter__(self): return self @@ -497,10 +505,6 @@ class ZenithEnvBuilder: for sk in self.env.safekeepers: sk.stop(immediate=True) self.env.pageserver.stop(immediate=True) - if self.s3_mock_server: - self.s3_mock_server.kill() - if self.env.broker is not None: - self.env.broker.stop() class ZenithEnv: @@ -539,10 +543,12 @@ class ZenithEnv: self.repo_dir = config.repo_dir self.rust_log_override = config.rust_log_override self.port_distributor = config.port_distributor - self.s3_mock_server = config.s3_mock_server + self.s3_mock_server = config.mock_s3_server self.zenith_cli = ZenithCli(env=self) self.postgres = PostgresFactory(self) self.safekeepers: List[Safekeeper] = [] + self.broker = config.broker + self.remote_storage = config.remote_storage # generate initial tenant ID here instead of letting 'zenith init' generate it, # so that we don't need to dig it out of the config file afterwards. @@ -553,7 +559,6 @@ class ZenithEnv: default_tenant_id = '{self.initial_tenant.hex}' """) - self.broker = config.broker toml += textwrap.dedent(f""" [etcd_broker] broker_endpoints = ['{self.broker.client_url()}'] @@ -578,7 +583,6 @@ class ZenithEnv: # Create a corresponding ZenithPageserver object self.pageserver = ZenithPageserver(self, port=pageserver_port, - remote_storage=config.pageserver_remote_storage, config_override=config.pageserver_config_override) # Create config and a Safekeeper object for each safekeeper @@ -602,15 +606,13 @@ class ZenithEnv: self.zenith_cli.init(toml) def start(self): - # Start up the page server, all the safekeepers and the broker + # Start up broker, pageserver and all safekeepers + self.broker.try_start() self.pageserver.start() for safekeeper in self.safekeepers: safekeeper.start() - if self.broker is not None: - self.broker.start() - def get_safekeeper_connstrs(self) -> str: """ Get list of safekeeper endpoints suitable for safekeepers GUC """ return ','.join([f'localhost:{wa.port.pg}' for wa in self.safekeepers]) @@ -623,7 +625,10 @@ class ZenithEnv: @pytest.fixture(scope=shareable_scope) -def _shared_simple_env(request: Any, port_distributor) -> Iterator[ZenithEnv]: +def _shared_simple_env(request: Any, + port_distributor: PortDistributor, + mock_s3_server: MockS3Server, + default_broker: Etcd) -> Iterator[ZenithEnv]: """ Internal fixture backing the `zenith_simple_env` fixture. If TEST_SHARED_FIXTURES is set, this is shared by all tests using `zenith_simple_env`. @@ -637,7 +642,8 @@ def _shared_simple_env(request: Any, port_distributor) -> Iterator[ZenithEnv]: repo_dir = os.path.join(str(top_output_dir), "shared_repo") shutil.rmtree(repo_dir, ignore_errors=True) - with ZenithEnvBuilder(Path(repo_dir), port_distributor) as builder: + with ZenithEnvBuilder(Path(repo_dir), port_distributor, default_broker, + mock_s3_server) as builder: env = builder.init_start() # For convenience in tests, create a branch from the freshly-initialized cluster. @@ -659,12 +665,13 @@ def zenith_simple_env(_shared_simple_env: ZenithEnv) -> Iterator[ZenithEnv]: yield _shared_simple_env _shared_simple_env.postgres.stop_all() - if _shared_simple_env.s3_mock_server: - _shared_simple_env.s3_mock_server.kill() @pytest.fixture(scope='function') -def zenith_env_builder(test_output_dir, port_distributor) -> Iterator[ZenithEnvBuilder]: +def zenith_env_builder(test_output_dir, + port_distributor: PortDistributor, + mock_s3_server: MockS3Server, + default_broker: Etcd) -> Iterator[ZenithEnvBuilder]: """ Fixture to create a Zenith environment for test. @@ -682,7 +689,8 @@ def zenith_env_builder(test_output_dir, port_distributor) -> Iterator[ZenithEnvB repo_dir = os.path.join(test_output_dir, "repo") # Return the builder to the caller - with ZenithEnvBuilder(Path(repo_dir), port_distributor) as builder: + with ZenithEnvBuilder(Path(repo_dir), port_distributor, default_broker, + mock_s3_server) as builder: yield builder @@ -979,9 +987,10 @@ class ZenithCli: cmd = ['init', f'--config={tmp.name}'] if initial_timeline_id: cmd.extend(['--timeline-id', initial_timeline_id.hex]) - append_pageserver_param_overrides(cmd, - self.env.pageserver.remote_storage, - self.env.pageserver.config_override) + append_pageserver_param_overrides( + params_to_update=cmd, + remote_storage=self.env.remote_storage, + pageserver_config_override=self.env.pageserver.config_override) res = self.raw_cli(cmd) res.check_returncode() @@ -1002,9 +1011,10 @@ class ZenithCli: def pageserver_start(self, overrides=()) -> 'subprocess.CompletedProcess[str]': start_args = ['pageserver', 'start', *overrides] - append_pageserver_param_overrides(start_args, - self.env.pageserver.remote_storage, - self.env.pageserver.config_override) + append_pageserver_param_overrides( + params_to_update=start_args, + remote_storage=self.env.remote_storage, + pageserver_config_override=self.env.pageserver.config_override) s3_env_vars = None if self.env.s3_mock_server: @@ -1174,16 +1184,11 @@ class ZenithPageserver(PgProtocol): Initializes the repository via `zenith init`. """ - def __init__(self, - env: ZenithEnv, - port: PageserverPort, - remote_storage: Optional[RemoteStorage] = None, - config_override: Optional[str] = None): + def __init__(self, env: ZenithEnv, port: PageserverPort, config_override: Optional[str] = None): super().__init__(host='localhost', port=port.pg, user='zenith_admin') self.env = env self.running = False self.service_port = port - self.remote_storage = remote_storage self.config_override = config_override def start(self, overrides=()) -> 'ZenithPageserver': @@ -1223,21 +1228,21 @@ class ZenithPageserver(PgProtocol): def append_pageserver_param_overrides( params_to_update: List[str], - pageserver_remote_storage: Optional[RemoteStorage], + remote_storage: Optional[RemoteStorage], pageserver_config_override: Optional[str] = None, ): - if pageserver_remote_storage is not None: - if isinstance(pageserver_remote_storage, LocalFsStorage): - pageserver_storage_override = f"local_path='{pageserver_remote_storage.root}'" - elif isinstance(pageserver_remote_storage, S3Storage): - pageserver_storage_override = f"bucket_name='{pageserver_remote_storage.bucket}',\ - bucket_region='{pageserver_remote_storage.region}'" + if remote_storage is not None: + if isinstance(remote_storage, LocalFsStorage): + pageserver_storage_override = f"local_path='{remote_storage.root}'" + elif isinstance(remote_storage, S3Storage): + pageserver_storage_override = f"bucket_name='{remote_storage.bucket}',\ + bucket_region='{remote_storage.region}'" - if pageserver_remote_storage.endpoint is not None: - pageserver_storage_override += f",endpoint='{pageserver_remote_storage.endpoint}'" + if remote_storage.endpoint is not None: + pageserver_storage_override += f",endpoint='{remote_storage.endpoint}'" else: - raise Exception(f'Unknown storage configuration {pageserver_remote_storage}') + raise Exception(f'Unknown storage configuration {remote_storage}') params_to_update.append( f'--pageserver-config-override=remote_storage={{{pageserver_storage_override}}}') @@ -1859,7 +1864,11 @@ class Etcd: s.mount('http://', requests.adapters.HTTPAdapter(max_retries=1)) # do not retry s.get(f"{self.client_url()}/health").raise_for_status() - def start(self): + def try_start(self): + if self.handle is not None: + log.debug(f'etcd is already running on port {self.port}') + return + pathlib.Path(self.datadir).mkdir(exist_ok=True) if not self.binary_path.is_file(): From 9ccbb8d331c3eef25f01815e7d058d6260c02bf3 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Tue, 17 May 2022 10:31:13 +0300 Subject: [PATCH 257/296] Make "neon_local stop" less verbose. I got annoyed by all the noise in CI test output. Before: $ ./target/release/neon_local stop Stop pageserver gracefully Pageserver still receives connections Pageserver stopped receiving connections Pageserver status is: Reqwest error: error sending request for url (http://127.0.0.1:9898/v1/status): error trying to connect: tcp connect error: Connection refused (os error 111) initializing for sk 1 for 7676 Stop safekeeper gracefully Safekeeper still receives connections Safekeeper stopped receiving connections Safekeeper status is: Reqwest error: error sending request for url (http://127.0.0.1:7676/v1/status): error trying to connect: tcp connect error: Connection refused (os error 111) After: $ ./target/release/neon_local stop Stopping pageserver gracefully...done! Stopping safekeeper 1 gracefully...done! Also removes the spurious "initializing for sk 1 for 7676" message from "neon_local start" --- control_plane/src/safekeeper.rs | 47 ++++++++++++++++++++------------- control_plane/src/storage.rs | 46 ++++++++++++++++++++------------ 2 files changed, 57 insertions(+), 36 deletions(-) diff --git a/control_plane/src/safekeeper.rs b/control_plane/src/safekeeper.rs index 1ac06cb2d2..d5b6251209 100644 --- a/control_plane/src/safekeeper.rs +++ b/control_plane/src/safekeeper.rs @@ -81,8 +81,6 @@ impl SafekeeperNode { pub fn from_env(env: &LocalEnv, conf: &SafekeeperConf) -> SafekeeperNode { let pageserver = Arc::new(PageServerNode::from_env(env)); - println!("initializing for sk {} for {}", conf.id, conf.http_port); - SafekeeperNode { id: conf.id, conf: conf.clone(), @@ -207,12 +205,13 @@ impl SafekeeperNode { let pid = Pid::from_raw(pid); let sig = if immediate { - println!("Stop safekeeper immediately"); + print!("Stopping safekeeper {} immediately..", self.id); Signal::SIGQUIT } else { - println!("Stop safekeeper gracefully"); + print!("Stopping safekeeper {} gracefully..", self.id); Signal::SIGTERM }; + io::stdout().flush().unwrap(); match kill(pid, sig) { Ok(_) => (), Err(Errno::ESRCH) => { @@ -234,25 +233,35 @@ impl SafekeeperNode { // TODO Remove this "timeout" and handle it on caller side instead. // Shutting down may take a long time, // if safekeeper flushes a lot of data + let mut tcp_stopped = false; for _ in 0..100 { - if let Err(_e) = TcpStream::connect(&address) { - println!("Safekeeper stopped receiving connections"); - - //Now check status - match self.check_status() { - Ok(_) => { - println!("Safekeeper status is OK. Wait a bit."); - thread::sleep(Duration::from_secs(1)); - } - Err(err) => { - println!("Safekeeper status is: {}", err); - return Ok(()); + if !tcp_stopped { + if let Err(err) = TcpStream::connect(&address) { + tcp_stopped = true; + if err.kind() != io::ErrorKind::ConnectionRefused { + eprintln!("\nSafekeeper connection failed with error: {err}"); } } - } else { - println!("Safekeeper still receives connections"); - thread::sleep(Duration::from_secs(1)); } + if tcp_stopped { + // Also check status on the HTTP port + match self.check_status() { + Err(SafekeeperHttpError::Transport(err)) if err.is_connect() => { + println!("done!"); + return Ok(()); + } + Err(err) => { + eprintln!("\nSafekeeper status check failed with error: {err}"); + return Ok(()); + } + Ok(()) => { + // keep waiting + } + } + } + print!("."); + io::stdout().flush().unwrap(); + thread::sleep(Duration::from_secs(1)); } bail!("Failed to stop safekeeper with pid {}", pid); diff --git a/control_plane/src/storage.rs b/control_plane/src/storage.rs index 7dbc19e145..355c7c250d 100644 --- a/control_plane/src/storage.rs +++ b/control_plane/src/storage.rs @@ -281,12 +281,13 @@ impl PageServerNode { let pid = Pid::from_raw(read_pidfile(&pid_file)?); let sig = if immediate { - println!("Stop pageserver immediately"); + print!("Stopping pageserver immediately.."); Signal::SIGQUIT } else { - println!("Stop pageserver gracefully"); + print!("Stopping pageserver gracefully.."); Signal::SIGTERM }; + io::stdout().flush().unwrap(); match kill(pid, sig) { Ok(_) => (), Err(Errno::ESRCH) => { @@ -308,25 +309,36 @@ impl PageServerNode { // TODO Remove this "timeout" and handle it on caller side instead. // Shutting down may take a long time, // if pageserver checkpoints a lot of data + let mut tcp_stopped = false; for _ in 0..100 { - if let Err(_e) = TcpStream::connect(&address) { - println!("Pageserver stopped receiving connections"); - - //Now check status - match self.check_status() { - Ok(_) => { - println!("Pageserver status is OK. Wait a bit."); - thread::sleep(Duration::from_secs(1)); - } - Err(err) => { - println!("Pageserver status is: {}", err); - return Ok(()); + if !tcp_stopped { + if let Err(err) = TcpStream::connect(&address) { + tcp_stopped = true; + if err.kind() != io::ErrorKind::ConnectionRefused { + eprintln!("\nPageserver connection failed with error: {err}"); } } - } else { - println!("Pageserver still receives connections"); - thread::sleep(Duration::from_secs(1)); } + if tcp_stopped { + // Also check status on the HTTP port + + match self.check_status() { + Err(PageserverHttpError::Transport(err)) if err.is_connect() => { + println!("done!"); + return Ok(()); + } + Err(err) => { + eprintln!("\nPageserver status check failed with error: {err}"); + return Ok(()); + } + Ok(()) => { + // keep waiting + } + } + } + print!("."); + io::stdout().flush().unwrap(); + thread::sleep(Duration::from_secs(1)); } bail!("Failed to stop pageserver with pid {}", pid); From 070c255522f1f1d002db127b5a52c957f9016800 Mon Sep 17 00:00:00 2001 From: Andrey Taranik Date: Tue, 17 May 2022 18:03:01 +0300 Subject: [PATCH 258/296] Neon stress deploy (#1720) * storage and proxy deployment for neon stress environment * neon stress inventory fix --- .circleci/ansible/neon-stress.hosts | 19 ++++++++ .circleci/config.yml | 49 ++++++++++++++++++++ .circleci/helm-values/neon-stress.proxy.yaml | 34 ++++++++++++++ 3 files changed, 102 insertions(+) create mode 100644 .circleci/ansible/neon-stress.hosts create mode 100644 .circleci/helm-values/neon-stress.proxy.yaml diff --git a/.circleci/ansible/neon-stress.hosts b/.circleci/ansible/neon-stress.hosts new file mode 100644 index 0000000000..283ec0e8b3 --- /dev/null +++ b/.circleci/ansible/neon-stress.hosts @@ -0,0 +1,19 @@ +[pageservers] +neon-stress-ps-1 console_region_id=1 +neon-stress-ps-2 console_region_id=1 + +[safekeepers] +neon-stress-sk-1 console_region_id=1 +neon-stress-sk-2 console_region_id=1 +neon-stress-sk-3 console_region_id=1 + +[storage:children] +pageservers +safekeepers + +[storage:vars] +console_mgmt_base_url = http://neon-stress-console.local +bucket_name = neon-storage-ireland +bucket_region = eu-west-1 +etcd_endpoints = etcd-stress.local:2379 +safekeeper_enable_s3_offload = false diff --git a/.circleci/config.yml b/.circleci/config.yml index 62ae60eb18..fdd3e0cce7 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -587,6 +587,55 @@ jobs: helm upgrade neon-proxy neondatabase/neon-proxy --install -f .circleci/helm-values/staging.proxy.yaml --set image.tag=${DOCKER_TAG} --wait helm upgrade neon-proxy-scram neondatabase/neon-proxy --install -f .circleci/helm-values/staging.proxy-scram.yaml --set image.tag=${DOCKER_TAG} --wait + deploy-neon-stress: + docker: + - image: cimg/python:3.10 + steps: + - checkout + - setup_remote_docker + - run: + name: Setup ansible + command: | + pip install --progress-bar off --user ansible boto3 + - run: + name: Redeploy + command: | + cd "$(pwd)/.circleci/ansible" + + ./get_binaries.sh + + echo "${TELEPORT_SSH_KEY}" | tr -d '\n'| base64 --decode >ssh-key + echo "${TELEPORT_SSH_CERT}" | tr -d '\n'| base64 --decode >ssh-key-cert.pub + chmod 0600 ssh-key + ssh-add ssh-key + rm -f ssh-key ssh-key-cert.pub + + ansible-playbook deploy.yaml -i neon-stress.hosts + rm -f neon_install.tar.gz .neon_current_version + + deploy-neon-stress-proxy: + docker: + - image: cimg/base:2021.04 + environment: + KUBECONFIG: .kubeconfig + steps: + - checkout + - run: + name: Store kubeconfig file + command: | + echo "${NEON_STRESS_KUBECONFIG_DATA}" | base64 --decode > ${KUBECONFIG} + chmod 0600 ${KUBECONFIG} + - run: + name: Setup helm v3 + command: | + curl -s https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash + helm repo add neondatabase https://neondatabase.github.io/helm-charts + - run: + name: Re-deploy proxy + command: | + DOCKER_TAG=$(git log --oneline|wc -l) + helm upgrade neon-stress-proxy neondatabase/neon-proxy --install -f .circleci/helm-values/neon-stress.proxy.yaml --set image.tag=${DOCKER_TAG} --wait + deploy-release: docker: - image: cimg/python:3.10 diff --git a/.circleci/helm-values/neon-stress.proxy.yaml b/.circleci/helm-values/neon-stress.proxy.yaml new file mode 100644 index 0000000000..8236f9873a --- /dev/null +++ b/.circleci/helm-values/neon-stress.proxy.yaml @@ -0,0 +1,34 @@ +fullnameOverride: "neon-stress-proxy" + +settings: + authEndpoint: "https://console.dev.neon.tech/authenticate_proxy_request/" + uri: "https://console.dev.neon.tech/psql_session/" + +# -- Additional labels for zenith-proxy pods +podLabels: + zenith_service: proxy + zenith_env: staging + zenith_region: eu-west-1 + zenith_region_slug: ireland + +service: + annotations: + service.beta.kubernetes.io/aws-load-balancer-type: external + service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip + service.beta.kubernetes.io/aws-load-balancer-scheme: internal + external-dns.alpha.kubernetes.io/hostname: neon-stress-proxy.local + type: LoadBalancer + +exposedService: + annotations: + service.beta.kubernetes.io/aws-load-balancer-type: external + service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip + service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing + external-dns.alpha.kubernetes.io/hostname: connect.dev.neon.tech + +metrics: + enabled: true + serviceMonitor: + enabled: true + selector: + release: kube-prometheus-stack From f03779bf1a555f921f63406f7accdf28e427c8f0 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Tue, 17 May 2022 16:21:13 +0300 Subject: [PATCH 259/296] Fix wait_for_last_record_lsn() and wait_for_upload() python functions. The contract for wait_for() was not very clear. It waits until the given function returns successfully, without an exception, but the wait_for_last_record_lsn() and wait_for_upload() functions used "a < b" as the condition, i.e. they thought that wait_for() would poll until the function returns true. Inline the logic from wait_for() into those two functions, it's not that complicated, and you get a more specific error message too, if it fails. Also add a comment to wait_for() to make it more clear how it works. Also change remote_consistent_lsn() to return 0 instead of raising an exception, if remote is None. That can happen if nothing has been uploaded to remote storage for the timeline yet. It happened once in the CI, and I was able to reproduce that locally too by adding a sleep to the storage sync thread, to delay the first upload. --- .../batch_others/test_remote_storage.py | 8 ++-- .../batch_others/test_tenant_relocation.py | 4 +- test_runner/fixtures/zenith_fixtures.py | 47 +++++++++++++++---- 3 files changed, 44 insertions(+), 15 deletions(-) diff --git a/test_runner/batch_others/test_remote_storage.py b/test_runner/batch_others/test_remote_storage.py index 3c7bd08996..afbe3c55c7 100644 --- a/test_runner/batch_others/test_remote_storage.py +++ b/test_runner/batch_others/test_remote_storage.py @@ -6,7 +6,7 @@ from contextlib import closing from pathlib import Path import time from uuid import UUID -from fixtures.zenith_fixtures import ZenithEnvBuilder, assert_local, wait_for, wait_for_last_record_lsn, wait_for_upload +from fixtures.zenith_fixtures import ZenithEnvBuilder, assert_local, wait_until, wait_for_last_record_lsn, wait_for_upload from fixtures.log_helper import log from fixtures.utils import lsn_from_hex, lsn_to_hex import pytest @@ -109,9 +109,9 @@ def test_remote_storage_backup_and_restore(zenith_env_builder: ZenithEnvBuilder, client.timeline_attach(UUID(tenant_id), UUID(timeline_id)) log.info("waiting for timeline redownload") - wait_for(number_of_iterations=10, - interval=1, - func=lambda: assert_local(client, UUID(tenant_id), UUID(timeline_id))) + wait_until(number_of_iterations=10, + interval=1, + func=lambda: assert_local(client, UUID(tenant_id), UUID(timeline_id))) detail = client.timeline_detail(UUID(tenant_id), UUID(timeline_id)) assert detail['local'] is not None diff --git a/test_runner/batch_others/test_tenant_relocation.py b/test_runner/batch_others/test_tenant_relocation.py index 0e5dd6eadf..91506e120d 100644 --- a/test_runner/batch_others/test_tenant_relocation.py +++ b/test_runner/batch_others/test_tenant_relocation.py @@ -10,7 +10,7 @@ from typing import Optional import signal import pytest -from fixtures.zenith_fixtures import PgProtocol, PortDistributor, Postgres, ZenithEnvBuilder, Etcd, ZenithPageserverHttpClient, assert_local, wait_for, wait_for_last_record_lsn, wait_for_upload, zenith_binpath, pg_distrib_dir +from fixtures.zenith_fixtures import PgProtocol, PortDistributor, Postgres, ZenithEnvBuilder, Etcd, ZenithPageserverHttpClient, assert_local, wait_until, wait_for_last_record_lsn, wait_for_upload, zenith_binpath, pg_distrib_dir from fixtures.utils import lsn_from_hex @@ -191,7 +191,7 @@ def test_tenant_relocation(zenith_env_builder: ZenithEnvBuilder, # call to attach timeline to new pageserver new_pageserver_http.timeline_attach(tenant, timeline) # new pageserver should be in sync (modulo wal tail or vacuum activity) with the old one because there was no new writes since checkpoint - new_timeline_detail = wait_for( + new_timeline_detail = wait_until( number_of_iterations=5, interval=1, func=lambda: assert_local(new_pageserver_http, tenant, timeline)) diff --git a/test_runner/fixtures/zenith_fixtures.py b/test_runner/fixtures/zenith_fixtures.py index 8fca56143e..203e73037f 100644 --- a/test_runner/fixtures/zenith_fixtures.py +++ b/test_runner/fixtures/zenith_fixtures.py @@ -34,7 +34,12 @@ from typing_extensions import Literal import requests import backoff # type: ignore -from .utils import (etcd_path, get_self_dir, mkdir_if_needed, subprocess_capture, lsn_from_hex) +from .utils import (etcd_path, + get_self_dir, + mkdir_if_needed, + subprocess_capture, + lsn_from_hex, + lsn_to_hex) from fixtures.log_helper import log """ This file contains pytest fixtures. A fixture is a test resource that can be @@ -2065,7 +2070,11 @@ def check_restored_datadir_content(test_output_dir: str, env: ZenithEnv, pg: Pos assert (mismatch, error) == ([], []) -def wait_for(number_of_iterations: int, interval: int, func): +def wait_until(number_of_iterations: int, interval: int, func): + """ + Wait until 'func' returns successfully, without exception. Returns the last return value + from the the function. + """ last_exception = None for i in range(number_of_iterations): try: @@ -2092,9 +2101,15 @@ def remote_consistent_lsn(pageserver_http_client: ZenithPageserverHttpClient, timeline: uuid.UUID) -> int: detail = pageserver_http_client.timeline_detail(tenant, timeline) - lsn_str = detail['remote']['remote_consistent_lsn'] - assert isinstance(lsn_str, str) - return lsn_from_hex(lsn_str) + if detail['remote'] is None: + # No remote information at all. This happens right after creating + # a timeline, before any part of it it has been uploaded to remote + # storage yet. + return 0 + else: + lsn_str = detail['remote']['remote_consistent_lsn'] + assert isinstance(lsn_str, str) + return lsn_from_hex(lsn_str) def wait_for_upload(pageserver_http_client: ZenithPageserverHttpClient, @@ -2102,8 +2117,15 @@ def wait_for_upload(pageserver_http_client: ZenithPageserverHttpClient, timeline: uuid.UUID, lsn: int): """waits for local timeline upload up to specified lsn""" - - wait_for(10, 1, lambda: remote_consistent_lsn(pageserver_http_client, tenant, timeline) >= lsn) + for i in range(10): + current_lsn = remote_consistent_lsn(pageserver_http_client, tenant, timeline) + if current_lsn >= lsn: + return + log.info("waiting for remote_consistent_lsn to reach {}, now {}, iteration {}".format( + lsn_to_hex(lsn), lsn_to_hex(current_lsn), i + 1)) + time.sleep(1) + raise Exception("timed out while waiting for remote_consistent_lsn to reach {}, was {}".format( + lsn_to_hex(lsn), lsn_to_hex(current_lsn))) def last_record_lsn(pageserver_http_client: ZenithPageserverHttpClient, @@ -2121,5 +2143,12 @@ def wait_for_last_record_lsn(pageserver_http_client: ZenithPageserverHttpClient, timeline: uuid.UUID, lsn: int): """waits for pageserver to catch up to a certain lsn""" - - wait_for(10, 1, lambda: last_record_lsn(pageserver_http_client, tenant, timeline) >= lsn) + for i in range(10): + current_lsn = last_record_lsn(pageserver_http_client, tenant, timeline) + if current_lsn >= lsn: + return + log.info("waiting for last_record_lsn to reach {}, now {}, iteration {}".format( + lsn_to_hex(lsn), lsn_to_hex(current_lsn), i + 1)) + time.sleep(1) + raise Exception("timed out while waiting for last_record_lsn to reach {}, was {}".format( + lsn_to_hex(lsn), lsn_to_hex(current_lsn))) From 55ea3f262edc8d992e01c67c4ed7ef96203ebbbb Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Tue, 17 May 2022 18:14:37 +0300 Subject: [PATCH 260/296] Fix race condition leading to panic in remote storage sync thread. The SyncQueue consisted of a tokio mpsc channel, and an atomic counter to keep track of how many items there are in the channel. Updating the atomic counter was racy, and sometimes the consumer would decrement the counter before the producer had incremented it, leading to integer wraparound to usize::MAX. Calling Vec::with_capacity(usize::MAX) leads to a panic. To fix, replace the channel with a VecDeque protected by a Mutex, and a condition variable for signaling. Now that the queue is now protected by standard blocking Mutex and Condvar, refactor the functions touching it to be sync, not async. A theoretical downside of this is that the calls to push items to the queue and the storage sync thread that drains the queue might now need to wait, if another thread is busy manipulating the queue. I believe that's OK; the lock isn't held for very long, and these operations are made in background threads, not in the hot GetPage@LSN path, so they're not very latency-sensitive. Fixes #1719. Also add a test case. --- pageserver/src/storage_sync.rs | 240 ++++++++---------- pageserver/src/storage_sync/delete.rs | 4 +- pageserver/src/storage_sync/download.rs | 4 +- pageserver/src/storage_sync/upload.rs | 4 +- .../test_tenants_with_remote_storage.py | 97 +++++++ 5 files changed, 208 insertions(+), 141 deletions(-) create mode 100644 test_runner/batch_others/test_tenants_with_remote_storage.py diff --git a/pageserver/src/storage_sync.rs b/pageserver/src/storage_sync.rs index 7755e67c8d..39459fafc6 100644 --- a/pageserver/src/storage_sync.rs +++ b/pageserver/src/storage_sync.rs @@ -9,7 +9,7 @@ //! //! * public API via to interact with the external world: //! * [`start_local_timeline_sync`] to launch a background async loop to handle the synchronization -//! * [`schedule_timeline_checkpoint_upload`] and [`schedule_timeline_download`] to enqueue a new upload and download tasks, +//! * [`schedule_layer_upload`], [`schedule_layer_download`], and[`schedule_layer_delete`] to enqueue a new task //! to be processed by the async loop //! //! Here's a schematic overview of all interactions backup and the rest of the pageserver perform: @@ -44,8 +44,8 @@ //! query their downloads later if they are accessed. //! //! Some time later, during pageserver checkpoints, in-memory data is flushed onto disk along with its metadata. -//! If the storage sync loop was successfully started before, pageserver schedules the new checkpoint file uploads after every checkpoint. -//! The checkpoint uploads are disabled, if no remote storage configuration is provided (no sync loop is started this way either). +//! If the storage sync loop was successfully started before, pageserver schedules the layer files and the updated metadata file for upload, every time a layer is flushed to disk. +//! The uploads are disabled, if no remote storage configuration is provided (no sync loop is started this way either). //! See [`crate::layered_repository`] for the upload calls and the adjacent logic. //! //! Synchronization logic is able to communicate back with updated timeline sync states, [`crate::repository::TimelineSyncStatusUpdate`], @@ -54,7 +54,7 @@ //! * once after the sync loop startup, to signal pageserver which timelines will be synchronized in the near future //! * after every loop step, in case a timeline needs to be reloaded or evicted from pageserver's memory //! -//! When the pageserver terminates, the sync loop finishes a current sync task (if any) and exits. +//! When the pageserver terminates, the sync loop finishes current sync task (if any) and exits. //! //! The storage logic considers `image` as a set of local files (layers), fully representing a certain timeline at given moment (identified with `disk_consistent_lsn` from the corresponding `metadata` file). //! Timeline can change its state, by adding more files on disk and advancing its `disk_consistent_lsn`: this happens after pageserver checkpointing and is followed @@ -66,13 +66,13 @@ //! when the newer image is downloaded //! //! Pageserver maintains similar to the local file structure remotely: all layer files are uploaded with the same names under the same directory structure. -//! Yet instead of keeping the `metadata` file remotely, we wrap it with more data in [`IndexShard`], containing the list of remote files. +//! Yet instead of keeping the `metadata` file remotely, we wrap it with more data in [`IndexPart`], containing the list of remote files. //! This file gets read to populate the cache, if the remote timeline data is missing from it and gets updated after every successful download. //! This way, we optimize S3 storage access by not running the `S3 list` command that could be expencive and slow: knowing both [`ZTenantId`] and [`ZTimelineId`], //! we can always reconstruct the path to the timeline, use this to get the same path on the remote storage and retrive its shard contents, if needed, same as any layer files. //! //! By default, pageserver reads the remote storage index data only for timelines located locally, to synchronize those, if needed. -//! Bulk index data download happens only initially, on pageserer startup. The rest of the remote storage stays unknown to pageserver and loaded on demand only, +//! Bulk index data download happens only initially, on pageserver startup. The rest of the remote storage stays unknown to pageserver and loaded on demand only, //! when a new timeline is scheduled for the download. //! //! NOTES: @@ -89,13 +89,12 @@ //! Synchronization is done with the queue being emptied via separate thread asynchronously, //! attempting to fully store pageserver's local data on the remote storage in a custom format, beneficial for storing. //! -//! A queue is implemented in the [`sync_queue`] module as a pair of sender and receiver channels, to block on zero tasks instead of checking the queue. -//! The pair's shared buffer of a fixed size serves as an implicit queue, holding [`SyncTask`] for local files upload/download operations. +//! A queue is implemented in the [`sync_queue`] module as a VecDeque to hold the tasks, and a condition variable for blocking when the queue is empty. //! //! The queue gets emptied by a single thread with the loop, that polls the tasks in batches of deduplicated tasks. //! A task from the batch corresponds to a single timeline, with its files to sync merged together: given that only one task sync loop step is active at a time, //! timeline uploads and downloads can happen concurrently, in no particular order due to incremental nature of the timeline layers. -//! Deletion happens only after a successful upload only, otherwise the compation output might make the timeline inconsistent until both tasks are fully processed without errors. +//! Deletion happens only after a successful upload only, otherwise the compaction output might make the timeline inconsistent until both tasks are fully processed without errors. //! Upload and download update the remote data (inmemory index and S3 json index part file) only after every layer is successfully synchronized, while the deletion task //! does otherwise: it requires to have the remote data updated first succesfully: blob files will be invisible to pageserver this way. //! @@ -138,8 +137,6 @@ //! NOTE: No real contents or checksum check happens right now and is a subject to improve later. //! //! After the whole timeline is downloaded, [`crate::tenant_mgr::apply_timeline_sync_status_updates`] function is used to update pageserver memory stage for the timeline processed. -//! -//! When pageserver signals shutdown, current sync task gets finished and the loop exists. mod delete; mod download; @@ -153,10 +150,7 @@ use std::{ num::{NonZeroU32, NonZeroUsize}, ops::ControlFlow, path::{Path, PathBuf}, - sync::{ - atomic::{AtomicUsize, Ordering}, - Arc, - }, + sync::{Arc, Condvar, Mutex}, }; use anyhow::{anyhow, bail, Context}; @@ -167,7 +161,6 @@ use remote_storage::{GenericRemoteStorage, RemoteStorage}; use tokio::{ fs, runtime::Runtime, - sync::mpsc::{self, error::TryRecvError, UnboundedReceiver, UnboundedSender}, time::{Duration, Instant}, }; use tracing::*; @@ -453,97 +446,77 @@ fn collect_timeline_files( Ok((timeline_id, metadata, timeline_files)) } -/// Wraps mpsc channel bits around into a queue interface. -/// mpsc approach was picked to allow blocking the sync loop if no tasks are present, to avoid meaningless spinning. +/// Global queue of sync tasks. +/// +/// 'queue' is protected by a mutex, and 'condvar' is used to wait for tasks to arrive. struct SyncQueue { - len: AtomicUsize, max_timelines_per_batch: NonZeroUsize, - sender: UnboundedSender<(ZTenantTimelineId, SyncTask)>, + + queue: Mutex>, + condvar: Condvar, } impl SyncQueue { - fn new( - max_timelines_per_batch: NonZeroUsize, - ) -> (Self, UnboundedReceiver<(ZTenantTimelineId, SyncTask)>) { - let (sender, receiver) = mpsc::unbounded_channel(); - ( - Self { - len: AtomicUsize::new(0), - max_timelines_per_batch, - sender, - }, - receiver, - ) + fn new(max_timelines_per_batch: NonZeroUsize) -> Self { + Self { + max_timelines_per_batch, + queue: Mutex::new(VecDeque::new()), + condvar: Condvar::new(), + } } + /// Queue a new task fn push(&self, sync_id: ZTenantTimelineId, new_task: SyncTask) { - match self.sender.send((sync_id, new_task)) { - Ok(()) => { - self.len.fetch_add(1, Ordering::Relaxed); - } - Err(e) => { - error!("failed to push sync task to queue: {e}"); - } + let mut q = self.queue.lock().unwrap(); + + q.push_back((sync_id, new_task)); + if q.len() <= 1 { + self.condvar.notify_one(); } } /// Fetches a task batch, getting every existing entry from the queue, grouping by timelines and merging the tasks for every timeline. - /// A timeline has to care to not to delete cetain layers from the remote storage before the corresponding uploads happen. - /// Otherwise, due to "immutable" nature of the layers, the order of their deletion/uploading/downloading does not matter. + /// A timeline has to care to not to delete certain layers from the remote storage before the corresponding uploads happen. + /// Other than that, due to "immutable" nature of the layers, the order of their deletion/uploading/downloading does not matter. /// Hence, we merge the layers together into single task per timeline and run those concurrently (with the deletion happening only after successful uploading). - async fn next_task_batch( - &self, - // The queue is based on two ends of a channel and has to be accessible statically without blocking for submissions from the sync code. - // Its receiver needs &mut, so we cannot place it in the same container with the other end and get both static and non-blocking access. - // Hence toss this around to use it from the sync loop directly as &mut. - sync_queue_receiver: &mut UnboundedReceiver<(ZTenantTimelineId, SyncTask)>, - ) -> HashMap { - // request the first task in blocking fashion to do less meaningless work - let (first_sync_id, first_task) = if let Some(first_task) = sync_queue_receiver.recv().await - { - self.len.fetch_sub(1, Ordering::Relaxed); - first_task - } else { - info!("Queue sender part was dropped, aborting"); - return HashMap::new(); - }; + fn next_task_batch(&self) -> (HashMap, usize) { + // Wait for the first task in blocking fashion + let mut q = self.queue.lock().unwrap(); + while q.is_empty() { + q = self + .condvar + .wait_timeout(q, Duration::from_millis(1000)) + .unwrap() + .0; + + if thread_mgr::is_shutdown_requested() { + return (HashMap::new(), q.len()); + } + } + let (first_sync_id, first_task) = q.pop_front().unwrap(); + let mut timelines_left_to_batch = self.max_timelines_per_batch.get() - 1; - let mut tasks_to_process = self.len(); + let tasks_to_process = q.len(); let mut batches = HashMap::with_capacity(tasks_to_process); batches.insert(first_sync_id, SyncTaskBatch::new(first_task)); let mut tasks_to_reenqueue = Vec::with_capacity(tasks_to_process); - // Pull the queue channel until we get all tasks that were there at the beginning of the batch construction. + // Greedily grab as many other tasks that we can. // Yet do not put all timelines in the batch, but only the first ones that fit the timeline limit. - // Still merge the rest of the pulled tasks and reenqueue those for later. - while tasks_to_process > 0 { - match sync_queue_receiver.try_recv() { - Ok((sync_id, new_task)) => { - self.len.fetch_sub(1, Ordering::Relaxed); - tasks_to_process -= 1; - - match batches.entry(sync_id) { - hash_map::Entry::Occupied(mut v) => v.get_mut().add(new_task), - hash_map::Entry::Vacant(v) => { - timelines_left_to_batch = timelines_left_to_batch.saturating_sub(1); - if timelines_left_to_batch == 0 { - tasks_to_reenqueue.push((sync_id, new_task)); - } else { - v.insert(SyncTaskBatch::new(new_task)); - } - } + // Re-enqueue the tasks that don't fit in this batch. + while let Some((sync_id, new_task)) = q.pop_front() { + match batches.entry(sync_id) { + hash_map::Entry::Occupied(mut v) => v.get_mut().add(new_task), + hash_map::Entry::Vacant(v) => { + timelines_left_to_batch = timelines_left_to_batch.saturating_sub(1); + if timelines_left_to_batch == 0 { + tasks_to_reenqueue.push((sync_id, new_task)); + } else { + v.insert(SyncTaskBatch::new(new_task)); } } - Err(TryRecvError::Disconnected) => { - debug!("Sender disconnected, batch collection aborted"); - break; - } - Err(TryRecvError::Empty) => { - debug!("No more data in the sync queue, task batch is not full"); - break; - } } } @@ -553,14 +526,15 @@ impl SyncQueue { tasks_to_reenqueue.len() ); for (id, task) in tasks_to_reenqueue { - self.push(id, task); + q.push_back((id, task)); } - batches + (batches, q.len()) } + #[cfg(test)] fn len(&self) -> usize { - self.len.load(Ordering::Relaxed) + self.queue.lock().unwrap().len() } } @@ -823,7 +797,7 @@ pub fn schedule_layer_download(tenant_id: ZTenantId, timeline_id: ZTimelineId) { debug!("Download task for tenant {tenant_id}, timeline {timeline_id} sent") } -/// Uses a remote storage given to start the storage sync loop. +/// Launch a thread to perform remote storage sync tasks. /// See module docs for loop step description. pub(super) fn spawn_storage_sync_thread( conf: &'static PageServerConf, @@ -836,7 +810,7 @@ where P: Debug + Send + Sync + 'static, S: RemoteStorage + Send + Sync + 'static, { - let (sync_queue, sync_queue_receiver) = SyncQueue::new(max_concurrent_timelines_sync); + let sync_queue = SyncQueue::new(max_concurrent_timelines_sync); SYNC_QUEUE .set(sync_queue) .map_err(|_queue| anyhow!("Could not initialize sync queue"))?; @@ -864,7 +838,7 @@ where local_timeline_files, ); - let loop_index = remote_index.clone(); + let remote_index_clone = remote_index.clone(); thread_mgr::spawn( ThreadKind::StorageSync, None, @@ -875,12 +849,7 @@ where storage_sync_loop( runtime, conf, - ( - Arc::new(storage), - loop_index, - sync_queue, - sync_queue_receiver, - ), + (Arc::new(storage), remote_index_clone, sync_queue), max_sync_errors, ); Ok(()) @@ -896,12 +865,7 @@ where fn storage_sync_loop( runtime: Runtime, conf: &'static PageServerConf, - (storage, index, sync_queue, mut sync_queue_receiver): ( - Arc, - RemoteIndex, - &SyncQueue, - UnboundedReceiver<(ZTenantTimelineId, SyncTask)>, - ), + (storage, index, sync_queue): (Arc, RemoteIndex, &SyncQueue), max_sync_errors: NonZeroU32, ) where P: Debug + Send + Sync + 'static, @@ -909,16 +873,35 @@ fn storage_sync_loop( { info!("Starting remote storage sync loop"); loop { - let loop_index = index.clone(); let loop_storage = Arc::clone(&storage); + + let (batched_tasks, remaining_queue_length) = sync_queue.next_task_batch(); + + if thread_mgr::is_shutdown_requested() { + info!("Shutdown requested, stopping"); + break; + } + + REMAINING_SYNC_ITEMS.set(remaining_queue_length as i64); + if remaining_queue_length > 0 || !batched_tasks.is_empty() { + info!("Processing tasks for {} timelines in batch, more tasks left to process: {remaining_queue_length}", batched_tasks.len()); + } else { + debug!("No tasks to process"); + continue; + } + + // Concurrently perform all the tasks in the batch let loop_step = runtime.block_on(async { tokio::select! { - step = loop_step( + step = process_batches( conf, - (loop_storage, loop_index, sync_queue, &mut sync_queue_receiver), max_sync_errors, + loop_storage, + &index, + batched_tasks, + sync_queue, ) - .instrument(info_span!("storage_sync_loop_step")) => step, + .instrument(info_span!("storage_sync_loop_step")) => ControlFlow::Continue(step), _ = thread_mgr::shutdown_watcher() => ControlFlow::Break(()), } }); @@ -944,31 +927,18 @@ fn storage_sync_loop( } } -async fn loop_step( +async fn process_batches( conf: &'static PageServerConf, - (storage, index, sync_queue, sync_queue_receiver): ( - Arc, - RemoteIndex, - &SyncQueue, - &mut UnboundedReceiver<(ZTenantTimelineId, SyncTask)>, - ), max_sync_errors: NonZeroU32, -) -> ControlFlow<(), HashMap>> + storage: Arc, + index: &RemoteIndex, + batched_tasks: HashMap, + sync_queue: &SyncQueue, +) -> HashMap> where P: Debug + Send + Sync + 'static, S: RemoteStorage + Send + Sync + 'static, { - let batched_tasks = sync_queue.next_task_batch(sync_queue_receiver).await; - - let remaining_queue_length = sync_queue.len(); - REMAINING_SYNC_ITEMS.set(remaining_queue_length as i64); - if remaining_queue_length > 0 || !batched_tasks.is_empty() { - info!("Processing tasks for {} timelines in batch, more tasks left to process: {remaining_queue_length}", batched_tasks.len()); - } else { - debug!("No tasks to process"); - return ControlFlow::Continue(HashMap::new()); - } - let mut sync_results = batched_tasks .into_iter() .map(|(sync_id, batch)| { @@ -993,6 +963,7 @@ where ZTenantId, HashMap, > = HashMap::new(); + while let Some((sync_id, state_update)) = sync_results.next().await { debug!("Finished storage sync task for sync id {sync_id}"); if let Some(state_update) = state_update { @@ -1003,7 +974,7 @@ where } } - ControlFlow::Continue(new_timeline_states) + new_timeline_states } async fn process_sync_task_batch( @@ -1376,7 +1347,6 @@ where P: Debug + Send + Sync + 'static, S: RemoteStorage + Send + Sync + 'static, { - info!("Updating remote index for the timeline"); let updated_remote_timeline = { let mut index_accessor = index.write().await; @@ -1443,7 +1413,7 @@ where IndexPart::from_remote_timeline(&timeline_path, updated_remote_timeline) .context("Failed to create an index part from the updated remote timeline")?; - info!("Uploading remote data for the timeline"); + info!("Uploading remote index for the timeline"); upload_index_part(conf, storage, sync_id, new_index_part) .await .context("Failed to upload new index part") @@ -1685,7 +1655,7 @@ mod tests { #[tokio::test] async fn separate_task_ids_batch() { - let (sync_queue, mut sync_queue_receiver) = SyncQueue::new(NonZeroUsize::new(100).unwrap()); + let sync_queue = SyncQueue::new(NonZeroUsize::new(100).unwrap()); assert_eq!(sync_queue.len(), 0); let sync_id_2 = ZTenantTimelineId { @@ -1720,7 +1690,7 @@ mod tests { let submitted_tasks_count = sync_queue.len(); assert_eq!(submitted_tasks_count, 3); - let mut batch = sync_queue.next_task_batch(&mut sync_queue_receiver).await; + let (mut batch, _) = sync_queue.next_task_batch(); assert_eq!( batch.len(), submitted_tasks_count, @@ -1746,7 +1716,7 @@ mod tests { #[tokio::test] async fn same_task_id_separate_tasks_batch() { - let (sync_queue, mut sync_queue_receiver) = SyncQueue::new(NonZeroUsize::new(100).unwrap()); + let sync_queue = SyncQueue::new(NonZeroUsize::new(100).unwrap()); assert_eq!(sync_queue.len(), 0); let download = LayersDownload { @@ -1769,7 +1739,7 @@ mod tests { let submitted_tasks_count = sync_queue.len(); assert_eq!(submitted_tasks_count, 3); - let mut batch = sync_queue.next_task_batch(&mut sync_queue_receiver).await; + let (mut batch, _) = sync_queue.next_task_batch(); assert_eq!( batch.len(), 1, @@ -1801,7 +1771,7 @@ mod tests { #[tokio::test] async fn same_task_id_same_tasks_batch() { - let (sync_queue, mut sync_queue_receiver) = SyncQueue::new(NonZeroUsize::new(1).unwrap()); + let sync_queue = SyncQueue::new(NonZeroUsize::new(1).unwrap()); let download_1 = LayersDownload { layers_to_skip: HashSet::from([PathBuf::from("sk1")]), }; @@ -1823,11 +1793,11 @@ mod tests { sync_queue.push(TEST_SYNC_ID, SyncTask::download(download_1.clone())); sync_queue.push(TEST_SYNC_ID, SyncTask::download(download_2.clone())); - sync_queue.push(sync_id_2, SyncTask::download(download_3.clone())); + sync_queue.push(sync_id_2, SyncTask::download(download_3)); sync_queue.push(TEST_SYNC_ID, SyncTask::download(download_4.clone())); assert_eq!(sync_queue.len(), 4); - let mut smallest_batch = sync_queue.next_task_batch(&mut sync_queue_receiver).await; + let (mut smallest_batch, _) = sync_queue.next_task_batch(); assert_eq!( smallest_batch.len(), 1, diff --git a/pageserver/src/storage_sync/delete.rs b/pageserver/src/storage_sync/delete.rs index 047ad6c2be..91c618d201 100644 --- a/pageserver/src/storage_sync/delete.rs +++ b/pageserver/src/storage_sync/delete.rs @@ -119,7 +119,7 @@ mod tests { #[tokio::test] async fn delete_timeline_negative() -> anyhow::Result<()> { let harness = RepoHarness::create("delete_timeline_negative")?; - let (sync_queue, _) = SyncQueue::new(NonZeroUsize::new(100).unwrap()); + let sync_queue = SyncQueue::new(NonZeroUsize::new(100).unwrap()); let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID); let storage = LocalFs::new( tempdir()?.path().to_path_buf(), @@ -152,7 +152,7 @@ mod tests { #[tokio::test] async fn delete_timeline() -> anyhow::Result<()> { let harness = RepoHarness::create("delete_timeline")?; - let (sync_queue, _) = SyncQueue::new(NonZeroUsize::new(100).unwrap()); + let sync_queue = SyncQueue::new(NonZeroUsize::new(100).unwrap()); let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID); let layer_files = ["a", "b", "c", "d"]; diff --git a/pageserver/src/storage_sync/download.rs b/pageserver/src/storage_sync/download.rs index 98a0a0e2fc..a28867f27e 100644 --- a/pageserver/src/storage_sync/download.rs +++ b/pageserver/src/storage_sync/download.rs @@ -286,7 +286,7 @@ mod tests { #[tokio::test] async fn download_timeline() -> anyhow::Result<()> { let harness = RepoHarness::create("download_timeline")?; - let (sync_queue, _) = SyncQueue::new(NonZeroUsize::new(100).unwrap()); + let sync_queue = SyncQueue::new(NonZeroUsize::new(100).unwrap()); let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID); let layer_files = ["a", "b", "layer_to_skip", "layer_to_keep_locally"]; @@ -385,7 +385,7 @@ mod tests { #[tokio::test] async fn download_timeline_negatives() -> anyhow::Result<()> { let harness = RepoHarness::create("download_timeline_negatives")?; - let (sync_queue, _) = SyncQueue::new(NonZeroUsize::new(100).unwrap()); + let sync_queue = SyncQueue::new(NonZeroUsize::new(100).unwrap()); let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID); let storage = LocalFs::new(tempdir()?.path().to_owned(), harness.conf.workdir.clone())?; diff --git a/pageserver/src/storage_sync/upload.rs b/pageserver/src/storage_sync/upload.rs index f9d606f2b8..625ec7aed6 100644 --- a/pageserver/src/storage_sync/upload.rs +++ b/pageserver/src/storage_sync/upload.rs @@ -240,7 +240,7 @@ mod tests { #[tokio::test] async fn regular_layer_upload() -> anyhow::Result<()> { let harness = RepoHarness::create("regular_layer_upload")?; - let (sync_queue, _) = SyncQueue::new(NonZeroUsize::new(100).unwrap()); + let sync_queue = SyncQueue::new(NonZeroUsize::new(100).unwrap()); let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID); let layer_files = ["a", "b"]; @@ -327,7 +327,7 @@ mod tests { #[tokio::test] async fn layer_upload_after_local_fs_update() -> anyhow::Result<()> { let harness = RepoHarness::create("layer_upload_after_local_fs_update")?; - let (sync_queue, _) = SyncQueue::new(NonZeroUsize::new(100).unwrap()); + let sync_queue = SyncQueue::new(NonZeroUsize::new(100).unwrap()); let sync_id = ZTenantTimelineId::new(harness.tenant_id, TIMELINE_ID); let layer_files = ["a1", "b1"]; diff --git a/test_runner/batch_others/test_tenants_with_remote_storage.py b/test_runner/batch_others/test_tenants_with_remote_storage.py new file mode 100644 index 0000000000..c00f077fcd --- /dev/null +++ b/test_runner/batch_others/test_tenants_with_remote_storage.py @@ -0,0 +1,97 @@ +# +# Little stress test for the checkpointing and remote storage code. +# +# The test creates several tenants, and runs a simple workload on +# each tenant, in parallel. The test uses remote storage, and a tiny +# checkpoint_distance setting so that a lot of layer files are created. +# + +import asyncio +from contextlib import closing +from uuid import UUID + +import pytest + +from fixtures.zenith_fixtures import ZenithEnvBuilder, ZenithEnv, Postgres, wait_for_last_record_lsn, wait_for_upload +from fixtures.utils import lsn_from_hex + + +async def tenant_workload(env: ZenithEnv, pg: Postgres): + pageserver_conn = await env.pageserver.connect_async() + + pg_conn = await pg.connect_async() + + tenant_id = await pg_conn.fetchval("show zenith.zenith_tenant") + timeline_id = await pg_conn.fetchval("show zenith.zenith_timeline") + + await pg_conn.execute("CREATE TABLE t(key int primary key, value text)") + for i in range(1, 100): + await pg_conn.execute( + f"INSERT INTO t SELECT {i}*1000 + g, 'payload' from generate_series(1,1000) g") + + # we rely upon autocommit after each statement + # as waiting for acceptors happens there + res = await pg_conn.fetchval("SELECT count(*) FROM t") + assert res == i * 1000 + + +async def all_tenants_workload(env: ZenithEnv, tenants_pgs): + workers = [] + for tenant, pg in tenants_pgs: + worker = tenant_workload(env, pg) + workers.append(asyncio.create_task(worker)) + + # await all workers + await asyncio.gather(*workers) + + +@pytest.mark.parametrize('storage_type', ['local_fs', 'mock_s3']) +def test_tenants_many(zenith_env_builder: ZenithEnvBuilder, storage_type: str): + + if storage_type == 'local_fs': + zenith_env_builder.enable_local_fs_remote_storage() + elif storage_type == 'mock_s3': + zenith_env_builder.enable_s3_mock_remote_storage('test_remote_storage_backup_and_restore') + else: + raise RuntimeError(f'Unknown storage type: {storage_type}') + + zenith_env_builder.enable_local_fs_remote_storage() + + env = zenith_env_builder.init_start() + + tenants_pgs = [] + + for i in range(1, 5): + # Use a tiny checkpoint distance, to create a lot of layers quickly + tenant, _ = env.zenith_cli.create_tenant( + conf={ + 'checkpoint_distance': '5000000', + }) + env.zenith_cli.create_timeline(f'test_tenants_many', tenant_id=tenant) + + pg = env.postgres.create_start( + f'test_tenants_many', + tenant_id=tenant, + ) + tenants_pgs.append((tenant, pg)) + + asyncio.run(all_tenants_workload(env, tenants_pgs)) + + # Wait for the remote storage uploads to finish + pageserver_http = env.pageserver.http_client() + for tenant, pg in tenants_pgs: + with closing(pg.connect()) as conn: + with conn.cursor() as cur: + cur.execute("show zenith.zenith_tenant") + tenant_id = cur.fetchone()[0] + cur.execute("show zenith.zenith_timeline") + timeline_id = cur.fetchone()[0] + cur.execute("SELECT pg_current_wal_flush_lsn()") + current_lsn = lsn_from_hex(cur.fetchone()[0]) + + # wait until pageserver receives all the data + wait_for_last_record_lsn(pageserver_http, UUID(tenant_id), UUID(timeline_id), current_lsn) + + # run final checkpoint manually to flush all the data to remote storage + env.pageserver.safe_psql(f"checkpoint {tenant_id} {timeline_id}") + wait_for_upload(pageserver_http, UUID(tenant_id), UUID(timeline_id), current_lsn) From 134eeeb096de28c44c8fc7de1d771ed5350598c2 Mon Sep 17 00:00:00 2001 From: Arthur Petukhovsky Date: Tue, 17 May 2022 19:29:01 +0300 Subject: [PATCH 261/296] Add more common storage metrics (#1722) - Enabled process exporter for storage services - Changed zenith_proxy prefix to just proxy - Removed old `monitoring` directory - Removed common prefix for metrics, now our common metrics have `libmetrics_` prefix, for example `libmetrics_serve_metrics_count` - Added `test_metrics_normal_work` --- .circleci/config.yml | 2 +- Cargo.lock | 39 ++++++++++- libs/metrics/Cargo.toml | 2 +- libs/metrics/src/lib.rs | 38 +---------- libs/utils/src/http/endpoint.rs | 4 +- monitoring/docker-compose.yml | 25 ------- monitoring/grafana.yaml | 12 ---- monitoring/prometheus.yaml | 5 -- pageserver/src/bin/pageserver.rs | 1 - poetry.lock | 30 +++++++-- proxy/src/main.rs | 1 - proxy/src/proxy.rs | 8 +-- pyproject.toml | 1 + safekeeper/src/bin/safekeeper.rs | 1 - test_runner/batch_others/test_tenants.py | 82 ++++++++++++++++++++++- test_runner/fixtures/benchmark_fixture.py | 4 +- test_runner/fixtures/metrics.py | 38 +++++++++++ test_runner/fixtures/zenith_fixtures.py | 7 +- 18 files changed, 198 insertions(+), 102 deletions(-) delete mode 100644 monitoring/docker-compose.yml delete mode 100644 monitoring/grafana.yaml delete mode 100644 monitoring/prometheus.yaml create mode 100644 test_runner/fixtures/metrics.py diff --git a/.circleci/config.yml b/.circleci/config.yml index fdd3e0cce7..1eddb9f220 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -355,7 +355,7 @@ jobs: when: always command: | du -sh /tmp/test_output/* - find /tmp/test_output -type f ! -name "pg.log" ! -name "pageserver.log" ! -name "safekeeper.log" ! -name "etcd.log" ! -name "regression.diffs" ! -name "junit.xml" ! -name "*.filediff" ! -name "*.stdout" ! -name "*.stderr" ! -name "flamegraph.svg" -delete + find /tmp/test_output -type f ! -name "pg.log" ! -name "pageserver.log" ! -name "safekeeper.log" ! -name "etcd.log" ! -name "regression.diffs" ! -name "junit.xml" ! -name "*.filediff" ! -name "*.stdout" ! -name "*.stderr" ! -name "flamegraph.svg" ! -name "*.metrics" -delete du -sh /tmp/test_output/* - store_artifacts: path: /tmp/test_output diff --git a/Cargo.lock b/Cargo.lock index a3974f6776..6a320ee274 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -166,7 +166,7 @@ dependencies = [ "cc", "cfg-if", "libc", - "miniz_oxide", + "miniz_oxide 0.4.4", "object", "rustc-demangle", ] @@ -868,6 +868,18 @@ version = "0.4.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "279fb028e20b3c4c320317955b77c5e0c9701f05a1d309905d6fc702cdc5053e" +[[package]] +name = "flate2" +version = "1.0.23" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b39522e96686d38f4bc984b9198e3a0613264abaebaff2c5c918bfa6b6da09af" +dependencies = [ + "cfg-if", + "crc32fast", + "libc", + "miniz_oxide 0.5.1", +] + [[package]] name = "fnv" version = "1.0.7" @@ -1527,6 +1539,15 @@ dependencies = [ "autocfg", ] +[[package]] +name = "miniz_oxide" +version = "0.5.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d2b29bd4bc3f33391105ebee3589c19197c4271e3e5a9ec9bfe8127eeff8f082" +dependencies = [ + "adler", +] + [[package]] name = "mio" version = "0.8.2" @@ -2088,6 +2109,20 @@ dependencies = [ "unicode-xid", ] +[[package]] +name = "procfs" +version = "0.10.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "95e344cafeaeefe487300c361654bcfc85db3ac53619eeccced29f5ea18c4c70" +dependencies = [ + "bitflags", + "byteorder", + "flate2", + "hex", + "lazy_static", + "libc", +] + [[package]] name = "prometheus" version = "0.13.0" @@ -2097,8 +2132,10 @@ dependencies = [ "cfg-if", "fnv", "lazy_static", + "libc", "memchr", "parking_lot 0.11.2", + "procfs", "thiserror", ] diff --git a/libs/metrics/Cargo.toml b/libs/metrics/Cargo.toml index 3b6ff4691d..8ff5d1d421 100644 --- a/libs/metrics/Cargo.toml +++ b/libs/metrics/Cargo.toml @@ -4,7 +4,7 @@ version = "0.1.0" edition = "2021" [dependencies] -prometheus = {version = "0.13", default_features=false} # removes protobuf dependency +prometheus = {version = "0.13", default_features=false, features = ["process"]} # removes protobuf dependency libc = "0.2" lazy_static = "1.4" once_cell = "1.8.0" diff --git a/libs/metrics/src/lib.rs b/libs/metrics/src/lib.rs index 8756a078c3..b3c1a6bd55 100644 --- a/libs/metrics/src/lib.rs +++ b/libs/metrics/src/lib.rs @@ -3,7 +3,6 @@ //! Otherwise, we might not see all metrics registered via //! a default registry. use lazy_static::lazy_static; -use once_cell::race::OnceBox; pub use prometheus::{exponential_buckets, linear_buckets}; pub use prometheus::{register_gauge, Gauge}; pub use prometheus::{register_gauge_vec, GaugeVec}; @@ -27,48 +26,15 @@ pub fn gather() -> Vec { prometheus::gather() } -static COMMON_METRICS_PREFIX: OnceBox<&str> = OnceBox::new(); - -/// Sets a prefix which will be used for all common metrics, typically a service -/// name like 'pageserver'. Should be executed exactly once in the beginning of -/// any executable which uses common metrics. -pub fn set_common_metrics_prefix(prefix: &'static str) { - // Not unwrap() because metrics may be initialized after multiple threads have been started. - COMMON_METRICS_PREFIX - .set(prefix.into()) - .unwrap_or_else(|_| { - eprintln!( - "set_common_metrics_prefix() was called second time with '{}', exiting", - prefix - ); - std::process::exit(1); - }); -} - -/// Prepends a prefix to a common metric name so they are distinguished between -/// different services, see -/// A call to set_common_metrics_prefix() is necessary prior to calling this. -pub fn new_common_metric_name(unprefixed_metric_name: &str) -> String { - // Not unwrap() because metrics may be initialized after multiple threads have been started. - format!( - "{}_{}", - COMMON_METRICS_PREFIX.get().unwrap_or_else(|| { - eprintln!("set_common_metrics_prefix() was not called, but metrics are used, exiting"); - std::process::exit(1); - }), - unprefixed_metric_name - ) -} - lazy_static! { static ref DISK_IO_BYTES: IntGaugeVec = register_int_gauge_vec!( - new_common_metric_name("disk_io_bytes"), + "libmetrics_disk_io_bytes", "Bytes written and read from disk, grouped by the operation (read|write)", &["io_operation"] ) .expect("Failed to register disk i/o bytes int gauge vec"); static ref MAXRSS_KB: IntGauge = register_int_gauge!( - new_common_metric_name("maxrss_kb"), + "libmetrics_maxrss_kb", "Memory usage (Maximum Resident Set Size)" ) .expect("Failed to register maxrss_kb int gauge"); diff --git a/libs/utils/src/http/endpoint.rs b/libs/utils/src/http/endpoint.rs index 77acab496f..912404bd7d 100644 --- a/libs/utils/src/http/endpoint.rs +++ b/libs/utils/src/http/endpoint.rs @@ -5,7 +5,7 @@ use anyhow::anyhow; use hyper::header::AUTHORIZATION; use hyper::{header::CONTENT_TYPE, Body, Request, Response, Server}; use lazy_static::lazy_static; -use metrics::{new_common_metric_name, register_int_counter, Encoder, IntCounter, TextEncoder}; +use metrics::{register_int_counter, Encoder, IntCounter, TextEncoder}; use routerify::ext::RequestExt; use routerify::RequestInfo; use routerify::{Middleware, Router, RouterBuilder, RouterService}; @@ -18,7 +18,7 @@ use super::error::ApiError; lazy_static! { static ref SERVE_METRICS_COUNT: IntCounter = register_int_counter!( - new_common_metric_name("serve_metrics_count"), + "libmetrics_serve_metrics_count", "Number of metric requests made" ) .expect("failed to define a metric"); diff --git a/monitoring/docker-compose.yml b/monitoring/docker-compose.yml deleted file mode 100644 index a3fda0b246..0000000000 --- a/monitoring/docker-compose.yml +++ /dev/null @@ -1,25 +0,0 @@ -version: "3" -services: - - prometheus: - container_name: prometheus - image: prom/prometheus:latest - volumes: - - ./prometheus.yaml:/etc/prometheus/prometheus.yml - # ports: - # - "9090:9090" - # TODO: find a proper portable solution - network_mode: "host" - - grafana: - image: grafana/grafana:latest - volumes: - - ./grafana.yaml:/etc/grafana/provisioning/datasources/datasources.yaml - environment: - - GF_AUTH_ANONYMOUS_ENABLED=true - - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin - - GF_AUTH_DISABLE_LOGIN_FORM=true - # ports: - # - "3000:3000" - # TODO: find a proper portable solution - network_mode: "host" diff --git a/monitoring/grafana.yaml b/monitoring/grafana.yaml deleted file mode 100644 index eac8879e6c..0000000000 --- a/monitoring/grafana.yaml +++ /dev/null @@ -1,12 +0,0 @@ -apiVersion: 1 - -datasources: -- name: Prometheus - type: prometheus - access: proxy - orgId: 1 - url: http://localhost:9090 - basicAuth: false - isDefault: false - version: 1 - editable: false diff --git a/monitoring/prometheus.yaml b/monitoring/prometheus.yaml deleted file mode 100644 index ba55d53737..0000000000 --- a/monitoring/prometheus.yaml +++ /dev/null @@ -1,5 +0,0 @@ -scrape_configs: - - job_name: 'default' - scrape_interval: 10s - static_configs: - - targets: ['localhost:9898'] diff --git a/pageserver/src/bin/pageserver.rs b/pageserver/src/bin/pageserver.rs index 4cc1dcbc5a..00864056cb 100644 --- a/pageserver/src/bin/pageserver.rs +++ b/pageserver/src/bin/pageserver.rs @@ -38,7 +38,6 @@ fn version() -> String { } fn main() -> anyhow::Result<()> { - metrics::set_common_metrics_prefix("pageserver"); let arg_matches = App::new("Zenith page server") .about("Materializes WAL stream to pages and serves them to the postgres") .version(&*version()) diff --git a/poetry.lock b/poetry.lock index a7cbe0aa3c..aa1e91c606 100644 --- a/poetry.lock +++ b/poetry.lock @@ -822,7 +822,7 @@ python-versions = "*" [[package]] name = "moto" -version = "3.1.7" +version = "3.1.9" description = "A library that allows your python tests to easily mock out the boto library" category = "main" optional = false @@ -868,6 +868,7 @@ ds = ["sshpubkeys (>=3.1.0)"] dynamodb = ["docker (>=2.5.1)"] dynamodb2 = ["docker (>=2.5.1)"] dynamodbstreams = ["docker (>=2.5.1)"] +ebs = ["sshpubkeys (>=3.1.0)"] ec2 = ["sshpubkeys (>=3.1.0)"] efs = ["sshpubkeys (>=3.1.0)"] glue = ["pyparsing (>=3.0.0)"] @@ -953,6 +954,17 @@ importlib-metadata = {version = ">=0.12", markers = "python_version < \"3.8\""} dev = ["pre-commit", "tox"] testing = ["pytest", "pytest-benchmark"] +[[package]] +name = "prometheus-client" +version = "0.14.1" +description = "Python client for the Prometheus monitoring system." +category = "main" +optional = false +python-versions = ">=3.6" + +[package.extras] +twisted = ["twisted"] + [[package]] name = "psycopg2-binary" version = "2.9.3" @@ -1003,7 +1015,7 @@ python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*" [[package]] name = "pyjwt" -version = "2.3.0" +version = "2.4.0" description = "JSON Web Token implementation in Python" category = "main" optional = false @@ -1375,7 +1387,7 @@ testing = ["pytest (>=6)", "pytest-checkdocs (>=2.4)", "pytest-flake8", "pytest- [metadata] lock-version = "1.1" python-versions = "^3.7" -content-hash = "dc63b6e02d0ceccdc4b5616e9362c149a27fdcc6c54fda63a3b115a5b980c42e" +content-hash = "d2fcba2af0a32cde3a1d0c8cfdfe5fb26531599b0c8c376bf16e200a74b55553" [metadata.files] aiopg = [ @@ -1693,8 +1705,8 @@ mccabe = [ {file = "mccabe-0.6.1.tar.gz", hash = "sha256:dd8d182285a0fe56bace7f45b5e7d1a6ebcbf524e8f3bd87eb0f125271b8831f"}, ] moto = [ - {file = "moto-3.1.7-py3-none-any.whl", hash = "sha256:4ab6fb8dd150343e115d75e3dbdb5a8f850fc7236790819d7cef438c11ee6e89"}, - {file = "moto-3.1.7.tar.gz", hash = "sha256:20607a0fd0cf6530e05ffb623ca84d3f45d50bddbcec2a33705a0cf471e71289"}, + {file = "moto-3.1.9-py3-none-any.whl", hash = "sha256:8928ec168e5fd88b1127413b2fa570a80d45f25182cdad793edd208d07825269"}, + {file = "moto-3.1.9.tar.gz", hash = "sha256:ba683e70950b6579189bc12d74c1477aa036c090c6ad8b151a22f5896c005113"}, ] mypy = [ {file = "mypy-0.910-cp35-cp35m-macosx_10_9_x86_64.whl", hash = "sha256:a155d80ea6cee511a3694b108c4494a39f42de11ee4e61e72bc424c490e46457"}, @@ -1741,6 +1753,10 @@ pluggy = [ {file = "pluggy-1.0.0-py2.py3-none-any.whl", hash = "sha256:74134bbf457f031a36d68416e1509f34bd5ccc019f0bcc952c7b909d06b37bd3"}, {file = "pluggy-1.0.0.tar.gz", hash = "sha256:4224373bacce55f955a878bf9cfa763c1e360858e330072059e10bad68531159"}, ] +prometheus-client = [ + {file = "prometheus_client-0.14.1-py3-none-any.whl", hash = "sha256:522fded625282822a89e2773452f42df14b5a8e84a86433e3f8a189c1d54dc01"}, + {file = "prometheus_client-0.14.1.tar.gz", hash = "sha256:5459c427624961076277fdc6dc50540e2bacb98eebde99886e59ec55ed92093a"}, +] psycopg2-binary = [ {file = "psycopg2-binary-2.9.3.tar.gz", hash = "sha256:761df5313dc15da1502b21453642d7599d26be88bff659382f8f9747c7ebea4e"}, {file = "psycopg2_binary-2.9.3-cp310-cp310-macosx_10_14_x86_64.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl", hash = "sha256:539b28661b71da7c0e428692438efbcd048ca21ea81af618d845e06ebfd29478"}, @@ -1831,8 +1847,8 @@ pyflakes = [ {file = "pyflakes-2.3.1.tar.gz", hash = "sha256:f5bc8ecabc05bb9d291eb5203d6810b49040f6ff446a756326104746cc00c1db"}, ] pyjwt = [ - {file = "PyJWT-2.3.0-py3-none-any.whl", hash = "sha256:e0c4bb8d9f0af0c7f5b1ec4c5036309617d03d56932877f2f7a0beeb5318322f"}, - {file = "PyJWT-2.3.0.tar.gz", hash = "sha256:b888b4d56f06f6dcd777210c334e69c737be74755d3e5e9ee3fe67dc18a0ee41"}, + {file = "PyJWT-2.4.0-py3-none-any.whl", hash = "sha256:72d1d253f32dbd4f5c88eaf1fdc62f3a19f676ccbadb9dbc5d07e951b2b26daf"}, + {file = "PyJWT-2.4.0.tar.gz", hash = "sha256:d42908208c699b3b973cbeb01a969ba6a96c821eefb1c5bfe4c390c01d67abba"}, ] pyparsing = [ {file = "pyparsing-3.0.6-py3-none-any.whl", hash = "sha256:04ff808a5b90911829c55c4e26f75fa5ca8a2f5f36aa3a51f68e27033341d3e4"}, diff --git a/proxy/src/main.rs b/proxy/src/main.rs index f46e19e5d6..b457d46824 100644 --- a/proxy/src/main.rs +++ b/proxy/src/main.rs @@ -38,7 +38,6 @@ async fn flatten_err( #[tokio::main] async fn main() -> anyhow::Result<()> { - metrics::set_common_metrics_prefix("zenith_proxy"); let arg_matches = App::new("Neon proxy/router") .version(GIT_VERSION) .arg( diff --git a/proxy/src/proxy.rs b/proxy/src/proxy.rs index 821ce377f5..f10b273bfd 100644 --- a/proxy/src/proxy.rs +++ b/proxy/src/proxy.rs @@ -5,7 +5,7 @@ use crate::stream::{MetricsStream, PqStream, Stream}; use anyhow::{bail, Context}; use futures::TryFutureExt; use lazy_static::lazy_static; -use metrics::{new_common_metric_name, register_int_counter, IntCounter}; +use metrics::{register_int_counter, IntCounter}; use std::sync::Arc; use tokio::io::{AsyncRead, AsyncWrite}; use utils::pq_proto::{BeMessage as Be, *}; @@ -15,17 +15,17 @@ const ERR_PROTO_VIOLATION: &str = "protocol violation"; lazy_static! { static ref NUM_CONNECTIONS_ACCEPTED_COUNTER: IntCounter = register_int_counter!( - new_common_metric_name("num_connections_accepted"), + "proxy_accepted_connections", "Number of TCP client connections accepted." ) .unwrap(); static ref NUM_CONNECTIONS_CLOSED_COUNTER: IntCounter = register_int_counter!( - new_common_metric_name("num_connections_closed"), + "proxy_closed_connections", "Number of TCP client connections closed." ) .unwrap(); static ref NUM_BYTES_PROXIED_COUNTER: IntCounter = register_int_counter!( - new_common_metric_name("num_bytes_proxied"), + "proxy_io_bytes", "Number of bytes sent/received between any client and backend." ) .unwrap(); diff --git a/pyproject.toml b/pyproject.toml index 335c6d61d8..b70eb19009 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -23,6 +23,7 @@ boto3-stubs = "^1.20.40" moto = {version = "^3.0.0", extras = ["server"]} backoff = "^1.11.1" pytest-lazy-fixture = "^0.6.3" +prometheus-client = "^0.14.1" [tool.poetry.dev-dependencies] yapf = "==0.31.0" diff --git a/safekeeper/src/bin/safekeeper.rs b/safekeeper/src/bin/safekeeper.rs index 2d47710a88..61d2f558f2 100644 --- a/safekeeper/src/bin/safekeeper.rs +++ b/safekeeper/src/bin/safekeeper.rs @@ -32,7 +32,6 @@ const ID_FILE_NAME: &str = "safekeeper.id"; project_git_version!(GIT_VERSION); fn main() -> anyhow::Result<()> { - metrics::set_common_metrics_prefix("safekeeper"); let arg_matches = App::new("Zenith safekeeper") .about("Store WAL stream to local file system and push it to WAL receivers") .version(GIT_VERSION) diff --git a/test_runner/batch_others/test_tenants.py b/test_runner/batch_others/test_tenants.py index 1b593cfee3..9ccb8cf196 100644 --- a/test_runner/batch_others/test_tenants.py +++ b/test_runner/batch_others/test_tenants.py @@ -1,8 +1,12 @@ from contextlib import closing - +from datetime import datetime +import os import pytest from fixtures.zenith_fixtures import ZenithEnvBuilder +from fixtures.log_helper import log +from fixtures.metrics import parse_metrics +from fixtures.utils import lsn_to_hex @pytest.mark.parametrize('with_safekeepers', [False, True]) @@ -38,3 +42,79 @@ def test_tenants_normal_work(zenith_env_builder: ZenithEnvBuilder, with_safekeep cur.execute("INSERT INTO t SELECT generate_series(1,100000), 'payload'") cur.execute("SELECT sum(key) FROM t") assert cur.fetchone() == (5000050000, ) + + +def test_metrics_normal_work(zenith_env_builder: ZenithEnvBuilder): + zenith_env_builder.num_safekeepers = 3 + + env = zenith_env_builder.init_start() + tenant_1, _ = env.zenith_cli.create_tenant() + tenant_2, _ = env.zenith_cli.create_tenant() + + timeline_1 = env.zenith_cli.create_timeline('test_metrics_normal_work', tenant_id=tenant_1) + timeline_2 = env.zenith_cli.create_timeline('test_metrics_normal_work', tenant_id=tenant_2) + + pg_tenant1 = env.postgres.create_start('test_metrics_normal_work', tenant_id=tenant_1) + pg_tenant2 = env.postgres.create_start('test_metrics_normal_work', tenant_id=tenant_2) + + for pg in [pg_tenant1, pg_tenant2]: + with closing(pg.connect()) as conn: + with conn.cursor() as cur: + cur.execute("CREATE TABLE t(key int primary key, value text)") + cur.execute("INSERT INTO t SELECT generate_series(1,100000), 'payload'") + cur.execute("SELECT sum(key) FROM t") + assert cur.fetchone() == (5000050000, ) + + collected_metrics = { + "pageserver": env.pageserver.http_client().get_metrics(), + } + for sk in env.safekeepers: + collected_metrics[f'safekeeper{sk.id}'] = sk.http_client().get_metrics_str() + + for name in collected_metrics: + basepath = os.path.join(zenith_env_builder.repo_dir, f'{name}.metrics') + + with open(basepath, 'w') as stdout_f: + print(collected_metrics[name], file=stdout_f, flush=True) + + all_metrics = [parse_metrics(m, name) for name, m in collected_metrics.items()] + ps_metrics = all_metrics[0] + sk_metrics = all_metrics[1:] + + ttids = [{ + 'tenant_id': tenant_1.hex, 'timeline_id': timeline_1.hex + }, { + 'tenant_id': tenant_2.hex, 'timeline_id': timeline_2.hex + }] + + # Test metrics per timeline + for tt in ttids: + log.info(f"Checking metrics for {tt}") + + ps_lsn = int(ps_metrics.query_one("pageserver_last_record_lsn", filter=tt).value) + sk_lsns = [int(sk.query_one("safekeeper_commit_lsn", filter=tt).value) for sk in sk_metrics] + + log.info(f"ps_lsn: {lsn_to_hex(ps_lsn)}") + log.info(f"sk_lsns: {list(map(lsn_to_hex, sk_lsns))}") + + assert ps_lsn <= max(sk_lsns) + assert ps_lsn > 0 + + # Test common metrics + for metrics in all_metrics: + log.info(f"Checking common metrics for {metrics.name}") + + log.info( + f"process_cpu_seconds_total: {metrics.query_one('process_cpu_seconds_total').value}") + log.info(f"process_threads: {int(metrics.query_one('process_threads').value)}") + log.info( + f"process_resident_memory_bytes (MB): {metrics.query_one('process_resident_memory_bytes').value / 1024 / 1024}" + ) + log.info( + f"process_virtual_memory_bytes (MB): {metrics.query_one('process_virtual_memory_bytes').value / 1024 / 1024}" + ) + log.info(f"process_open_fds: {int(metrics.query_one('process_open_fds').value)}") + log.info(f"process_max_fds: {int(metrics.query_one('process_max_fds').value)}") + log.info( + f"process_start_time_seconds (UTC): {datetime.fromtimestamp(metrics.query_one('process_start_time_seconds').value)}" + ) diff --git a/test_runner/fixtures/benchmark_fixture.py b/test_runner/fixtures/benchmark_fixture.py index 0735f16d73..e296e85cc7 100644 --- a/test_runner/fixtures/benchmark_fixture.py +++ b/test_runner/fixtures/benchmark_fixture.py @@ -236,14 +236,14 @@ class ZenithBenchmarker: """ Fetch the "cumulative # of bytes written" metric from the pageserver """ - metric_name = r'pageserver_disk_io_bytes{io_operation="write"}' + metric_name = r'libmetrics_disk_io_bytes{io_operation="write"}' return self.get_int_counter_value(pageserver, metric_name) def get_peak_mem(self, pageserver) -> int: """ Fetch the "maxrss" metric from the pageserver """ - metric_name = r'pageserver_maxrss_kb' + metric_name = r'libmetrics_maxrss_kb' return self.get_int_counter_value(pageserver, metric_name) def get_int_counter_value(self, pageserver, metric_name) -> int: diff --git a/test_runner/fixtures/metrics.py b/test_runner/fixtures/metrics.py new file mode 100644 index 0000000000..6fc62c6ea9 --- /dev/null +++ b/test_runner/fixtures/metrics.py @@ -0,0 +1,38 @@ +from dataclasses import dataclass +from prometheus_client.parser import text_string_to_metric_families +from prometheus_client.samples import Sample +from typing import Dict, List +from collections import defaultdict + +from fixtures.log_helper import log + + +class Metrics: + metrics: Dict[str, List[Sample]] + name: str + + def __init__(self, name: str = ""): + self.metrics = defaultdict(list) + self.name = name + + def query_all(self, name: str, filter: Dict[str, str]) -> List[Sample]: + res = [] + for sample in self.metrics[name]: + if all(sample.labels[k] == v for k, v in filter.items()): + res.append(sample) + return res + + def query_one(self, name: str, filter: Dict[str, str] = {}) -> Sample: + res = self.query_all(name, filter) + assert len(res) == 1, f"expected single sample for {name} {filter}, found {res}" + return res[0] + + +def parse_metrics(text: str, name: str = ""): + metrics = Metrics(name) + gen = text_string_to_metric_families(text) + for family in gen: + for sample in family.samples: + metrics.metrics[sample.name].append(sample) + + return metrics diff --git a/test_runner/fixtures/zenith_fixtures.py b/test_runner/fixtures/zenith_fixtures.py index 203e73037f..17d932c968 100644 --- a/test_runner/fixtures/zenith_fixtures.py +++ b/test_runner/fixtures/zenith_fixtures.py @@ -1833,10 +1833,13 @@ class SafekeeperHttpClient(requests.Session): assert isinstance(res_json, dict) return res_json - def get_metrics(self) -> SafekeeperMetrics: + def get_metrics_str(self) -> str: request_result = self.get(f"http://localhost:{self.port}/metrics") request_result.raise_for_status() - all_metrics_text = request_result.text + return request_result.text + + def get_metrics(self) -> SafekeeperMetrics: + all_metrics_text = self.get_metrics_str() metrics = SafekeeperMetrics() for match in re.finditer( From b9f84f4a83ed916919884b4f9f038356e76f113f Mon Sep 17 00:00:00 2001 From: Andrey Taranik Date: Tue, 17 May 2022 23:04:04 +0300 Subject: [PATCH 262/296] trun on storage deployment to neon-stress enviroment (#1729) --- .circleci/config.yml | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/.circleci/config.yml b/.circleci/config.yml index 1eddb9f220..85ac905f0b 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -820,6 +820,25 @@ workflows: requires: - docker-image + - deploy-neon-stress: + # Context gives an ability to login + context: Docker Hub + # deploy only for commits to main + filters: + branches: + only: + - main + requires: + - docker-image + - deploy-neon-stress-proxy: + # deploy only for commits to main + filters: + branches: + only: + - main + requires: + - docker-image + - docker-image-release: # Context gives an ability to login context: Docker Hub From 772c2fb4ff3e58d328f22a955190dc08545efbdf Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 9 May 2022 19:45:28 +0300 Subject: [PATCH 263/296] Report startup metrics and failure reason from compute_ctl (#1581) + neondatabase/cloud#1103 This adds a couple of control endpoints to simplify compute state discovery for control-plane. For example, now we may figure out that Postgres wasn't able to start or basebackup failed within seconds instead of just blindly polling the compute readiness for a minute or two. Also we now expose startup metrics (time of the each step: basebackup, sync safekeepers, config, total). Console grabs them after each successful start and report as histogram to prometheus and grafana. OpenAPI spec is added and up-tp date, but is not currently used in the console yet. --- Dockerfile.compute-tools | 2 +- compute_tools/README.md | 18 +- compute_tools/src/bin/compute_ctl.rs | 174 ++++++++++ compute_tools/src/bin/zenith_ctl.rs | 252 -------------- compute_tools/src/checker.rs | 10 +- compute_tools/src/compute.rs | 315 ++++++++++++++++++ compute_tools/src/config.rs | 12 +- .../src/{http_api.rs => http/api.rs} | 47 ++- compute_tools/src/http/mod.rs | 1 + compute_tools/src/http/openapi_spec.yaml | 158 +++++++++ compute_tools/src/lib.rs | 4 +- compute_tools/src/monitor.rs | 16 +- compute_tools/src/pg_helpers.rs | 27 +- compute_tools/src/spec.rs | 47 ++- compute_tools/src/zenith.rs | 109 ------ compute_tools/tests/pg_helpers_tests.rs | 6 +- docs/docker.md | 16 +- vendor/postgres | 2 +- 18 files changed, 787 insertions(+), 429 deletions(-) create mode 100644 compute_tools/src/bin/compute_ctl.rs delete mode 100644 compute_tools/src/bin/zenith_ctl.rs create mode 100644 compute_tools/src/compute.rs rename compute_tools/src/{http_api.rs => http/api.rs} (56%) create mode 100644 compute_tools/src/http/mod.rs create mode 100644 compute_tools/src/http/openapi_spec.yaml delete mode 100644 compute_tools/src/zenith.rs diff --git a/Dockerfile.compute-tools b/Dockerfile.compute-tools index bbe0f517ce..f0c9b9d56a 100644 --- a/Dockerfile.compute-tools +++ b/Dockerfile.compute-tools @@ -15,4 +15,4 @@ RUN set -e \ # Final image that only has one binary FROM debian:buster-slim -COPY --from=rust-build /home/circleci/project/target/release/zenith_ctl /usr/local/bin/zenith_ctl +COPY --from=rust-build /home/circleci/project/target/release/compute_ctl /usr/local/bin/compute_ctl diff --git a/compute_tools/README.md b/compute_tools/README.md index ccae3d2842..15876ed246 100644 --- a/compute_tools/README.md +++ b/compute_tools/README.md @@ -1,9 +1,9 @@ # Compute node tools -Postgres wrapper (`zenith_ctl`) is intended to be run as a Docker entrypoint or as a `systemd` -`ExecStart` option. It will handle all the `zenith` specifics during compute node +Postgres wrapper (`compute_ctl`) is intended to be run as a Docker entrypoint or as a `systemd` +`ExecStart` option. It will handle all the `Neon` specifics during compute node initialization: -- `zenith_ctl` accepts cluster (compute node) specification as a JSON file. +- `compute_ctl` accepts cluster (compute node) specification as a JSON file. - Every start is a fresh start, so the data directory is removed and initialized again on each run. - Next it will put configuration files into the `PGDATA` directory. @@ -13,18 +13,18 @@ initialization: - Check and alter/drop/create roles and databases. - Hang waiting on the `postmaster` process to exit. -Also `zenith_ctl` spawns two separate service threads: +Also `compute_ctl` spawns two separate service threads: - `compute-monitor` checks the last Postgres activity timestamp and saves it - into the shared `ComputeState`; + into the shared `ComputeNode`; - `http-endpoint` runs a Hyper HTTP API server, which serves readiness and the last activity requests. Usage example: ```sh -zenith_ctl -D /var/db/postgres/compute \ - -C 'postgresql://zenith_admin@localhost/postgres' \ - -S /var/db/postgres/specs/current.json \ - -b /usr/local/bin/postgres +compute_ctl -D /var/db/postgres/compute \ + -C 'postgresql://zenith_admin@localhost/postgres' \ + -S /var/db/postgres/specs/current.json \ + -b /usr/local/bin/postgres ``` ## Tests diff --git a/compute_tools/src/bin/compute_ctl.rs b/compute_tools/src/bin/compute_ctl.rs new file mode 100644 index 0000000000..5c951b7779 --- /dev/null +++ b/compute_tools/src/bin/compute_ctl.rs @@ -0,0 +1,174 @@ +//! +//! Postgres wrapper (`compute_ctl`) is intended to be run as a Docker entrypoint or as a `systemd` +//! `ExecStart` option. It will handle all the `Neon` specifics during compute node +//! initialization: +//! - `compute_ctl` accepts cluster (compute node) specification as a JSON file. +//! - Every start is a fresh start, so the data directory is removed and +//! initialized again on each run. +//! - Next it will put configuration files into the `PGDATA` directory. +//! - Sync safekeepers and get commit LSN. +//! - Get `basebackup` from pageserver using the returned on the previous step LSN. +//! - Try to start `postgres` and wait until it is ready to accept connections. +//! - Check and alter/drop/create roles and databases. +//! - Hang waiting on the `postmaster` process to exit. +//! +//! Also `compute_ctl` spawns two separate service threads: +//! - `compute-monitor` checks the last Postgres activity timestamp and saves it +//! into the shared `ComputeNode`; +//! - `http-endpoint` runs a Hyper HTTP API server, which serves readiness and the +//! last activity requests. +//! +//! Usage example: +//! ```sh +//! compute_ctl -D /var/db/postgres/compute \ +//! -C 'postgresql://zenith_admin@localhost/postgres' \ +//! -S /var/db/postgres/specs/current.json \ +//! -b /usr/local/bin/postgres +//! ``` +//! +use std::fs::File; +use std::panic; +use std::path::Path; +use std::process::exit; +use std::sync::{Arc, RwLock}; +use std::{thread, time::Duration}; + +use anyhow::Result; +use chrono::Utc; +use clap::Arg; +use log::{error, info}; + +use compute_tools::compute::{ComputeMetrics, ComputeNode, ComputeState, ComputeStatus}; +use compute_tools::http::api::launch_http_server; +use compute_tools::logger::*; +use compute_tools::monitor::launch_monitor; +use compute_tools::params::*; +use compute_tools::pg_helpers::*; +use compute_tools::spec::*; + +fn main() -> Result<()> { + // TODO: re-use `utils::logging` later + init_logger(DEFAULT_LOG_LEVEL)?; + + // Env variable is set by `cargo` + let version: Option<&str> = option_env!("CARGO_PKG_VERSION"); + let matches = clap::App::new("compute_ctl") + .version(version.unwrap_or("unknown")) + .arg( + Arg::new("connstr") + .short('C') + .long("connstr") + .value_name("DATABASE_URL") + .required(true), + ) + .arg( + Arg::new("pgdata") + .short('D') + .long("pgdata") + .value_name("DATADIR") + .required(true), + ) + .arg( + Arg::new("pgbin") + .short('b') + .long("pgbin") + .value_name("POSTGRES_PATH"), + ) + .arg( + Arg::new("spec") + .short('s') + .long("spec") + .value_name("SPEC_JSON"), + ) + .arg( + Arg::new("spec-path") + .short('S') + .long("spec-path") + .value_name("SPEC_PATH"), + ) + .get_matches(); + + let pgdata = matches.value_of("pgdata").expect("PGDATA path is required"); + let connstr = matches + .value_of("connstr") + .expect("Postgres connection string is required"); + let spec = matches.value_of("spec"); + let spec_path = matches.value_of("spec-path"); + + // Try to use just 'postgres' if no path is provided + let pgbin = matches.value_of("pgbin").unwrap_or("postgres"); + + let spec: ComputeSpec = match spec { + // First, try to get cluster spec from the cli argument + Some(json) => serde_json::from_str(json)?, + None => { + // Second, try to read it from the file if path is provided + if let Some(sp) = spec_path { + let path = Path::new(sp); + let file = File::open(path)?; + serde_json::from_reader(file)? + } else { + panic!("cluster spec should be provided via --spec or --spec-path argument"); + } + } + }; + + let pageserver_connstr = spec + .cluster + .settings + .find("zenith.page_server_connstring") + .expect("pageserver connstr should be provided"); + let tenant = spec + .cluster + .settings + .find("zenith.zenith_tenant") + .expect("tenant id should be provided"); + let timeline = spec + .cluster + .settings + .find("zenith.zenith_timeline") + .expect("tenant id should be provided"); + + let compute_state = ComputeNode { + start_time: Utc::now(), + connstr: connstr.to_string(), + pgdata: pgdata.to_string(), + pgbin: pgbin.to_string(), + spec, + tenant, + timeline, + pageserver_connstr, + metrics: ComputeMetrics::new(), + state: RwLock::new(ComputeState::new()), + }; + let compute = Arc::new(compute_state); + + // Launch service threads first, so we were able to serve availability + // requests, while configuration is still in progress. + let _http_handle = launch_http_server(&compute).expect("cannot launch http endpoint thread"); + let _monitor_handle = launch_monitor(&compute).expect("cannot launch compute monitor thread"); + + // Run compute (Postgres) and hang waiting on it. + match compute.prepare_and_run() { + Ok(ec) => { + let code = ec.code().unwrap_or(1); + info!("Postgres exited with code {}, shutting down", code); + exit(code) + } + Err(error) => { + error!("could not start the compute node: {}", error); + + let mut state = compute.state.write().unwrap(); + state.error = Some(format!("{:?}", error)); + state.status = ComputeStatus::Failed; + drop(state); + + // Keep serving HTTP requests, so the cloud control plane was able to + // get the actual error. + info!("giving control plane 30s to collect the error before shutdown"); + thread::sleep(Duration::from_secs(30)); + info!("shutting down"); + Err(error) + } + } +} diff --git a/compute_tools/src/bin/zenith_ctl.rs b/compute_tools/src/bin/zenith_ctl.rs deleted file mode 100644 index 3685f8e8b4..0000000000 --- a/compute_tools/src/bin/zenith_ctl.rs +++ /dev/null @@ -1,252 +0,0 @@ -//! -//! Postgres wrapper (`zenith_ctl`) is intended to be run as a Docker entrypoint or as a `systemd` -//! `ExecStart` option. It will handle all the `zenith` specifics during compute node -//! initialization: -//! - `zenith_ctl` accepts cluster (compute node) specification as a JSON file. -//! - Every start is a fresh start, so the data directory is removed and -//! initialized again on each run. -//! - Next it will put configuration files into the `PGDATA` directory. -//! - Sync safekeepers and get commit LSN. -//! - Get `basebackup` from pageserver using the returned on the previous step LSN. -//! - Try to start `postgres` and wait until it is ready to accept connections. -//! - Check and alter/drop/create roles and databases. -//! - Hang waiting on the `postmaster` process to exit. -//! -//! Also `zenith_ctl` spawns two separate service threads: -//! - `compute-monitor` checks the last Postgres activity timestamp and saves it -//! into the shared `ComputeState`; -//! - `http-endpoint` runs a Hyper HTTP API server, which serves readiness and the -//! last activity requests. -//! -//! Usage example: -//! ```sh -//! zenith_ctl -D /var/db/postgres/compute \ -//! -C 'postgresql://zenith_admin@localhost/postgres' \ -//! -S /var/db/postgres/specs/current.json \ -//! -b /usr/local/bin/postgres -//! ``` -//! -use std::fs::File; -use std::panic; -use std::path::Path; -use std::process::{exit, Command, ExitStatus}; -use std::sync::{Arc, RwLock}; - -use anyhow::{Context, Result}; -use chrono::Utc; -use clap::Arg; -use log::info; -use postgres::{Client, NoTls}; - -use compute_tools::checker::create_writablity_check_data; -use compute_tools::config; -use compute_tools::http_api::launch_http_server; -use compute_tools::logger::*; -use compute_tools::monitor::launch_monitor; -use compute_tools::params::*; -use compute_tools::pg_helpers::*; -use compute_tools::spec::*; -use compute_tools::zenith::*; - -/// Do all the preparations like PGDATA directory creation, configuration, -/// safekeepers sync, basebackup, etc. -fn prepare_pgdata(state: &Arc>) -> Result<()> { - let state = state.read().unwrap(); - let spec = &state.spec; - let pgdata_path = Path::new(&state.pgdata); - let pageserver_connstr = spec - .cluster - .settings - .find("zenith.page_server_connstring") - .expect("pageserver connstr should be provided"); - let tenant = spec - .cluster - .settings - .find("zenith.zenith_tenant") - .expect("tenant id should be provided"); - let timeline = spec - .cluster - .settings - .find("zenith.zenith_timeline") - .expect("tenant id should be provided"); - - info!( - "starting cluster #{}, operation #{}", - spec.cluster.cluster_id, - spec.operation_uuid.as_ref().unwrap() - ); - - // Remove/create an empty pgdata directory and put configuration there. - create_pgdata(&state.pgdata)?; - config::write_postgres_conf(&pgdata_path.join("postgresql.conf"), spec)?; - - info!("starting safekeepers syncing"); - let lsn = sync_safekeepers(&state.pgdata, &state.pgbin) - .with_context(|| "failed to sync safekeepers")?; - info!("safekeepers synced at LSN {}", lsn); - - info!( - "getting basebackup@{} from pageserver {}", - lsn, pageserver_connstr - ); - get_basebackup(&state.pgdata, &pageserver_connstr, &tenant, &timeline, &lsn).with_context( - || { - format!( - "failed to get basebackup@{} from pageserver {}", - lsn, pageserver_connstr - ) - }, - )?; - - // Update pg_hba.conf received with basebackup. - update_pg_hba(pgdata_path)?; - - Ok(()) -} - -/// Start Postgres as a child process and manage DBs/roles. -/// After that this will hang waiting on the postmaster process to exit. -fn run_compute(state: &Arc>) -> Result { - let read_state = state.read().unwrap(); - let pgdata_path = Path::new(&read_state.pgdata); - - // Run postgres as a child process. - let mut pg = Command::new(&read_state.pgbin) - .args(&["-D", &read_state.pgdata]) - .spawn() - .expect("cannot start postgres process"); - - // Try default Postgres port if it is not provided - let port = read_state - .spec - .cluster - .settings - .find("port") - .unwrap_or_else(|| "5432".to_string()); - wait_for_postgres(&port, pgdata_path)?; - - let mut client = Client::connect(&read_state.connstr, NoTls)?; - - handle_roles(&read_state.spec, &mut client)?; - handle_databases(&read_state.spec, &mut client)?; - handle_grants(&read_state.spec, &mut client)?; - create_writablity_check_data(&mut client)?; - - // 'Close' connection - drop(client); - - info!( - "finished configuration of cluster #{}", - read_state.spec.cluster.cluster_id - ); - - // Release the read lock. - drop(read_state); - - // Get the write lock, update state and release the lock, so HTTP API - // was able to serve requests, while we are blocked waiting on - // Postgres. - let mut state = state.write().unwrap(); - state.ready = true; - drop(state); - - // Wait for child postgres process basically forever. In this state Ctrl+C - // will be propagated to postgres and it will be shut down as well. - let ecode = pg.wait().expect("failed to wait on postgres"); - - Ok(ecode) -} - -fn main() -> Result<()> { - // TODO: re-use `utils::logging` later - init_logger(DEFAULT_LOG_LEVEL)?; - - // Env variable is set by `cargo` - let version: Option<&str> = option_env!("CARGO_PKG_VERSION"); - let matches = clap::App::new("zenith_ctl") - .version(version.unwrap_or("unknown")) - .arg( - Arg::new("connstr") - .short('C') - .long("connstr") - .value_name("DATABASE_URL") - .required(true), - ) - .arg( - Arg::new("pgdata") - .short('D') - .long("pgdata") - .value_name("DATADIR") - .required(true), - ) - .arg( - Arg::new("pgbin") - .short('b') - .long("pgbin") - .value_name("POSTGRES_PATH"), - ) - .arg( - Arg::new("spec") - .short('s') - .long("spec") - .value_name("SPEC_JSON"), - ) - .arg( - Arg::new("spec-path") - .short('S') - .long("spec-path") - .value_name("SPEC_PATH"), - ) - .get_matches(); - - let pgdata = matches.value_of("pgdata").expect("PGDATA path is required"); - let connstr = matches - .value_of("connstr") - .expect("Postgres connection string is required"); - let spec = matches.value_of("spec"); - let spec_path = matches.value_of("spec-path"); - - // Try to use just 'postgres' if no path is provided - let pgbin = matches.value_of("pgbin").unwrap_or("postgres"); - - let spec: ClusterSpec = match spec { - // First, try to get cluster spec from the cli argument - Some(json) => serde_json::from_str(json)?, - None => { - // Second, try to read it from the file if path is provided - if let Some(sp) = spec_path { - let path = Path::new(sp); - let file = File::open(path)?; - serde_json::from_reader(file)? - } else { - panic!("cluster spec should be provided via --spec or --spec-path argument"); - } - } - }; - - let compute_state = ComputeState { - connstr: connstr.to_string(), - pgdata: pgdata.to_string(), - pgbin: pgbin.to_string(), - spec, - ready: false, - last_active: Utc::now(), - }; - let compute_state = Arc::new(RwLock::new(compute_state)); - - // Launch service threads first, so we were able to serve availability - // requests, while configuration is still in progress. - let mut _threads = vec![ - launch_http_server(&compute_state).expect("cannot launch compute monitor thread"), - launch_monitor(&compute_state).expect("cannot launch http endpoint thread"), - ]; - - prepare_pgdata(&compute_state)?; - - // Run compute (Postgres) and hang waiting on it. Panic if any error happens, - // it will help us to trigger unwind and kill postmaster as well. - match run_compute(&compute_state) { - Ok(ec) => exit(ec.success() as i32), - Err(error) => panic!("cannot start compute node, error: {}", error), - } -} diff --git a/compute_tools/src/checker.rs b/compute_tools/src/checker.rs index 63da6ea23e..dbb70a74cf 100644 --- a/compute_tools/src/checker.rs +++ b/compute_tools/src/checker.rs @@ -1,11 +1,11 @@ -use std::sync::{Arc, RwLock}; +use std::sync::Arc; use anyhow::{anyhow, Result}; use log::error; use postgres::Client; use tokio_postgres::NoTls; -use crate::zenith::ComputeState; +use crate::compute::ComputeNode; pub fn create_writablity_check_data(client: &mut Client) -> Result<()> { let query = " @@ -23,9 +23,9 @@ pub fn create_writablity_check_data(client: &mut Client) -> Result<()> { Ok(()) } -pub async fn check_writability(state: &Arc>) -> Result<()> { - let connstr = state.read().unwrap().connstr.clone(); - let (client, connection) = tokio_postgres::connect(&connstr, NoTls).await?; +pub async fn check_writability(compute: &Arc) -> Result<()> { + let connstr = &compute.connstr; + let (client, connection) = tokio_postgres::connect(connstr, NoTls).await?; if client.is_closed() { return Err(anyhow!("connection to postgres closed")); } diff --git a/compute_tools/src/compute.rs b/compute_tools/src/compute.rs new file mode 100644 index 0000000000..a8422fb2b2 --- /dev/null +++ b/compute_tools/src/compute.rs @@ -0,0 +1,315 @@ +// +// XXX: This starts to be scarry similar to the `PostgresNode` from `control_plane`, +// but there are several things that makes `PostgresNode` usage inconvenient in the +// cloud: +// - it inherits from `LocalEnv`, which contains **all-all** the information about +// a complete service running +// - it uses `PageServerNode` with information about http endpoint, which we do not +// need in the cloud again +// - many tiny pieces like, for example, we do not use `pg_ctl` in the cloud +// +// Thus, to use `PostgresNode` in the cloud, we need to 'mock' a bunch of required +// attributes (not required for the cloud). Yet, it is still tempting to unify these +// `PostgresNode` and `ComputeNode` and use one in both places. +// +// TODO: stabilize `ComputeNode` and think about using it in the `control_plane`. +// +use std::fs; +use std::os::unix::fs::PermissionsExt; +use std::path::Path; +use std::process::{Command, ExitStatus, Stdio}; +use std::sync::atomic::{AtomicU64, Ordering}; +use std::sync::RwLock; + +use anyhow::{Context, Result}; +use chrono::{DateTime, Utc}; +use log::info; +use postgres::{Client, NoTls}; +use serde::{Serialize, Serializer}; + +use crate::checker::create_writablity_check_data; +use crate::config; +use crate::pg_helpers::*; +use crate::spec::*; + +/// Compute node info shared across several `compute_ctl` threads. +pub struct ComputeNode { + pub start_time: DateTime, + pub connstr: String, + pub pgdata: String, + pub pgbin: String, + pub spec: ComputeSpec, + pub tenant: String, + pub timeline: String, + pub pageserver_connstr: String, + pub metrics: ComputeMetrics, + /// Volatile part of the `ComputeNode` so should be used under `RwLock` + /// to allow HTTP API server to serve status requests, while configuration + /// is in progress. + pub state: RwLock, +} + +fn rfc3339_serialize(x: &DateTime, s: S) -> Result +where + S: Serializer, +{ + x.to_rfc3339().serialize(s) +} + +#[derive(Serialize)] +#[serde(rename_all = "snake_case")] +pub struct ComputeState { + pub status: ComputeStatus, + /// Timestamp of the last Postgres activity + #[serde(serialize_with = "rfc3339_serialize")] + pub last_active: DateTime, + pub error: Option, +} + +impl ComputeState { + pub fn new() -> Self { + Self { + status: ComputeStatus::Init, + last_active: Utc::now(), + error: None, + } + } +} + +impl Default for ComputeState { + fn default() -> Self { + Self::new() + } +} + +#[derive(Serialize, Clone, Copy, PartialEq, Eq)] +#[serde(rename_all = "snake_case")] +pub enum ComputeStatus { + Init, + Running, + Failed, +} + +#[derive(Serialize)] +pub struct ComputeMetrics { + pub sync_safekeepers_ms: AtomicU64, + pub basebackup_ms: AtomicU64, + pub config_ms: AtomicU64, + pub total_startup_ms: AtomicU64, +} + +impl ComputeMetrics { + pub fn new() -> Self { + Self { + sync_safekeepers_ms: AtomicU64::new(0), + basebackup_ms: AtomicU64::new(0), + config_ms: AtomicU64::new(0), + total_startup_ms: AtomicU64::new(0), + } + } +} + +impl Default for ComputeMetrics { + fn default() -> Self { + Self::new() + } +} + +impl ComputeNode { + pub fn set_status(&self, status: ComputeStatus) { + self.state.write().unwrap().status = status; + } + + pub fn get_status(&self) -> ComputeStatus { + self.state.read().unwrap().status + } + + // Remove `pgdata` directory and create it again with right permissions. + fn create_pgdata(&self) -> Result<()> { + // Ignore removal error, likely it is a 'No such file or directory (os error 2)'. + // If it is something different then create_dir() will error out anyway. + let _ok = fs::remove_dir_all(&self.pgdata); + fs::create_dir(&self.pgdata)?; + fs::set_permissions(&self.pgdata, fs::Permissions::from_mode(0o700))?; + + Ok(()) + } + + // Get basebackup from the libpq connection to pageserver using `connstr` and + // unarchive it to `pgdata` directory overriding all its previous content. + fn get_basebackup(&self, lsn: &str) -> Result<()> { + let start_time = Utc::now(); + + let mut client = Client::connect(&self.pageserver_connstr, NoTls)?; + let basebackup_cmd = match lsn { + "0/0" => format!("basebackup {} {}", &self.tenant, &self.timeline), // First start of the compute + _ => format!("basebackup {} {} {}", &self.tenant, &self.timeline, lsn), + }; + let copyreader = client.copy_out(basebackup_cmd.as_str())?; + let mut ar = tar::Archive::new(copyreader); + + ar.unpack(&self.pgdata)?; + + self.metrics.basebackup_ms.store( + Utc::now() + .signed_duration_since(start_time) + .to_std() + .unwrap() + .as_millis() as u64, + Ordering::Relaxed, + ); + + Ok(()) + } + + // Run `postgres` in a special mode with `--sync-safekeepers` argument + // and return the reported LSN back to the caller. + fn sync_safekeepers(&self) -> Result { + let start_time = Utc::now(); + + let sync_handle = Command::new(&self.pgbin) + .args(&["--sync-safekeepers"]) + .env("PGDATA", &self.pgdata) // we cannot use -D in this mode + .stdout(Stdio::piped()) + .spawn() + .expect("postgres --sync-safekeepers failed to start"); + + // `postgres --sync-safekeepers` will print all log output to stderr and + // final LSN to stdout. So we pipe only stdout, while stderr will be automatically + // redirected to the caller output. + let sync_output = sync_handle + .wait_with_output() + .expect("postgres --sync-safekeepers failed"); + if !sync_output.status.success() { + anyhow::bail!( + "postgres --sync-safekeepers exited with non-zero status: {}", + sync_output.status, + ); + } + + self.metrics.sync_safekeepers_ms.store( + Utc::now() + .signed_duration_since(start_time) + .to_std() + .unwrap() + .as_millis() as u64, + Ordering::Relaxed, + ); + + let lsn = String::from(String::from_utf8(sync_output.stdout)?.trim()); + + Ok(lsn) + } + + /// Do all the preparations like PGDATA directory creation, configuration, + /// safekeepers sync, basebackup, etc. + pub fn prepare_pgdata(&self) -> Result<()> { + let spec = &self.spec; + let pgdata_path = Path::new(&self.pgdata); + + // Remove/create an empty pgdata directory and put configuration there. + self.create_pgdata()?; + config::write_postgres_conf(&pgdata_path.join("postgresql.conf"), spec)?; + + info!("starting safekeepers syncing"); + let lsn = self + .sync_safekeepers() + .with_context(|| "failed to sync safekeepers")?; + info!("safekeepers synced at LSN {}", lsn); + + info!( + "getting basebackup@{} from pageserver {}", + lsn, &self.pageserver_connstr + ); + self.get_basebackup(&lsn).with_context(|| { + format!( + "failed to get basebackup@{} from pageserver {}", + lsn, &self.pageserver_connstr + ) + })?; + + // Update pg_hba.conf received with basebackup. + update_pg_hba(pgdata_path)?; + + Ok(()) + } + + /// Start Postgres as a child process and manage DBs/roles. + /// After that this will hang waiting on the postmaster process to exit. + pub fn run(&self) -> Result { + let start_time = Utc::now(); + + let pgdata_path = Path::new(&self.pgdata); + + // Run postgres as a child process. + let mut pg = Command::new(&self.pgbin) + .args(&["-D", &self.pgdata]) + .spawn() + .expect("cannot start postgres process"); + + // Try default Postgres port if it is not provided + let port = self + .spec + .cluster + .settings + .find("port") + .unwrap_or_else(|| "5432".to_string()); + wait_for_postgres(&mut pg, &port, pgdata_path)?; + + let mut client = Client::connect(&self.connstr, NoTls)?; + + handle_roles(&self.spec, &mut client)?; + handle_databases(&self.spec, &mut client)?; + handle_grants(&self.spec, &mut client)?; + create_writablity_check_data(&mut client)?; + + // 'Close' connection + drop(client); + let startup_end_time = Utc::now(); + + self.metrics.config_ms.store( + startup_end_time + .signed_duration_since(start_time) + .to_std() + .unwrap() + .as_millis() as u64, + Ordering::Relaxed, + ); + self.metrics.total_startup_ms.store( + startup_end_time + .signed_duration_since(self.start_time) + .to_std() + .unwrap() + .as_millis() as u64, + Ordering::Relaxed, + ); + + self.set_status(ComputeStatus::Running); + + info!( + "finished configuration of compute for project {}", + self.spec.cluster.cluster_id + ); + + // Wait for child Postgres process basically forever. In this state Ctrl+C + // will propagate to Postgres and it will be shut down as well. + let ecode = pg + .wait() + .expect("failed to start waiting on Postgres process"); + + Ok(ecode) + } + + pub fn prepare_and_run(&self) -> Result { + info!( + "starting compute for project {}, operation {}, tenant {}, timeline {}", + self.spec.cluster.cluster_id, + self.spec.operation_uuid.as_ref().unwrap(), + self.tenant, + self.timeline, + ); + + self.prepare_pgdata()?; + self.run() + } +} diff --git a/compute_tools/src/config.rs b/compute_tools/src/config.rs index 22134db0f8..6cbd0e3d4c 100644 --- a/compute_tools/src/config.rs +++ b/compute_tools/src/config.rs @@ -6,7 +6,7 @@ use std::path::Path; use anyhow::Result; use crate::pg_helpers::PgOptionsSerialize; -use crate::zenith::ClusterSpec; +use crate::spec::ComputeSpec; /// Check that `line` is inside a text file and put it there if it is not. /// Create file if it doesn't exist. @@ -32,20 +32,20 @@ pub fn line_in_file(path: &Path, line: &str) -> Result { } /// Create or completely rewrite configuration file specified by `path` -pub fn write_postgres_conf(path: &Path, spec: &ClusterSpec) -> Result<()> { +pub fn write_postgres_conf(path: &Path, spec: &ComputeSpec) -> Result<()> { // File::create() destroys the file content if it exists. let mut postgres_conf = File::create(path)?; - write_zenith_managed_block(&mut postgres_conf, &spec.cluster.settings.as_pg_settings())?; + write_auto_managed_block(&mut postgres_conf, &spec.cluster.settings.as_pg_settings())?; Ok(()) } // Write Postgres config block wrapped with generated comment section -fn write_zenith_managed_block(file: &mut File, buf: &str) -> Result<()> { - writeln!(file, "# Managed by Zenith: begin")?; +fn write_auto_managed_block(file: &mut File, buf: &str) -> Result<()> { + writeln!(file, "# Managed by compute_ctl: begin")?; writeln!(file, "{}", buf)?; - writeln!(file, "# Managed by Zenith: end")?; + writeln!(file, "# Managed by compute_ctl: end")?; Ok(()) } diff --git a/compute_tools/src/http_api.rs b/compute_tools/src/http/api.rs similarity index 56% rename from compute_tools/src/http_api.rs rename to compute_tools/src/http/api.rs index 7e1a876044..4c8bbc608b 100644 --- a/compute_tools/src/http_api.rs +++ b/compute_tools/src/http/api.rs @@ -1,37 +1,64 @@ use std::convert::Infallible; use std::net::SocketAddr; -use std::sync::{Arc, RwLock}; +use std::sync::Arc; use std::thread; use anyhow::Result; use hyper::service::{make_service_fn, service_fn}; use hyper::{Body, Method, Request, Response, Server, StatusCode}; use log::{error, info}; +use serde_json; -use crate::zenith::*; +use crate::compute::{ComputeNode, ComputeStatus}; // Service function to handle all available routes. -async fn routes(req: Request, state: Arc>) -> Response { +async fn routes(req: Request, compute: Arc) -> Response { match (req.method(), req.uri().path()) { // Timestamp of the last Postgres activity in the plain text. + // DEPRECATED in favour of /status (&Method::GET, "/last_activity") => { info!("serving /last_active GET request"); - let state = state.read().unwrap(); + let state = compute.state.read().unwrap(); // Use RFC3339 format for consistency. Response::new(Body::from(state.last_active.to_rfc3339())) } - // Has compute setup process finished? -> true/false + // Has compute setup process finished? -> true/false. + // DEPRECATED in favour of /status (&Method::GET, "/ready") => { info!("serving /ready GET request"); - let state = state.read().unwrap(); - Response::new(Body::from(format!("{}", state.ready))) + let status = compute.get_status(); + Response::new(Body::from(format!("{}", status == ComputeStatus::Running))) } + // Serialized compute state. + (&Method::GET, "/status") => { + info!("serving /status GET request"); + let state = compute.state.read().unwrap(); + Response::new(Body::from(serde_json::to_string(&*state).unwrap())) + } + + // Startup metrics in JSON format. Keep /metrics reserved for a possible + // future use for Prometheus metrics format. + (&Method::GET, "/metrics.json") => { + info!("serving /metrics.json GET request"); + Response::new(Body::from(serde_json::to_string(&compute.metrics).unwrap())) + } + + // DEPRECATED, use POST instead (&Method::GET, "/check_writability") => { info!("serving /check_writability GET request"); - let res = crate::checker::check_writability(&state).await; + let res = crate::checker::check_writability(&compute).await; + match res { + Ok(_) => Response::new(Body::from("true")), + Err(e) => Response::new(Body::from(e.to_string())), + } + } + + (&Method::POST, "/check_writability") => { + info!("serving /check_writability POST request"); + let res = crate::checker::check_writability(&compute).await; match res { Ok(_) => Response::new(Body::from("true")), Err(e) => Response::new(Body::from(e.to_string())), @@ -49,7 +76,7 @@ async fn routes(req: Request, state: Arc>) -> Respons // Main Hyper HTTP server function that runs it and blocks waiting on it forever. #[tokio::main] -async fn serve(state: Arc>) { +async fn serve(state: Arc) { let addr = SocketAddr::from(([0, 0, 0, 0], 3080)); let make_service = make_service_fn(move |_conn| { @@ -73,7 +100,7 @@ async fn serve(state: Arc>) { } /// Launch a separate Hyper HTTP API server thread and return its `JoinHandle`. -pub fn launch_http_server(state: &Arc>) -> Result> { +pub fn launch_http_server(state: &Arc) -> Result> { let state = Arc::clone(state); Ok(thread::Builder::new() diff --git a/compute_tools/src/http/mod.rs b/compute_tools/src/http/mod.rs new file mode 100644 index 0000000000..e5fdf85eed --- /dev/null +++ b/compute_tools/src/http/mod.rs @@ -0,0 +1 @@ +pub mod api; diff --git a/compute_tools/src/http/openapi_spec.yaml b/compute_tools/src/http/openapi_spec.yaml new file mode 100644 index 0000000000..9c0f8e3ccd --- /dev/null +++ b/compute_tools/src/http/openapi_spec.yaml @@ -0,0 +1,158 @@ +openapi: "3.0.2" +info: + title: Compute node control API + version: "1.0" + +servers: + - url: "http://localhost:3080" + +paths: + /status: + get: + tags: + - "info" + summary: Get compute node internal status + description: "" + operationId: getComputeStatus + responses: + "200": + description: ComputeState + content: + application/json: + schema: + $ref: "#/components/schemas/ComputeState" + + /metrics.json: + get: + tags: + - "info" + summary: Get compute node startup metrics in JSON format + description: "" + operationId: getComputeMetricsJSON + responses: + "200": + description: ComputeMetrics + content: + application/json: + schema: + $ref: "#/components/schemas/ComputeMetrics" + + /ready: + get: + deprecated: true + tags: + - "info" + summary: Check whether compute startup process finished successfully + description: "" + operationId: computeIsReady + responses: + "200": + description: Compute is ready ('true') or not ('false') + content: + text/plain: + schema: + type: string + example: "true" + + /last_activity: + get: + deprecated: true + tags: + - "info" + summary: Get timestamp of the last compute activity + description: "" + operationId: getLastComputeActivityTS + responses: + "200": + description: Timestamp of the last compute activity + content: + text/plain: + schema: + type: string + example: "2022-10-12T07:20:50.52Z" + + /check_writability: + get: + deprecated: true + tags: + - "check" + summary: Check that we can write new data on this compute + description: "" + operationId: checkComputeWritabilityDeprecated + responses: + "200": + description: Check result + content: + text/plain: + schema: + type: string + description: Error text or 'true' if check passed + example: "true" + + post: + tags: + - "check" + summary: Check that we can write new data on this compute + description: "" + operationId: checkComputeWritability + responses: + "200": + description: Check result + content: + text/plain: + schema: + type: string + description: Error text or 'true' if check passed + example: "true" + +components: + securitySchemes: + JWT: + type: http + scheme: bearer + bearerFormat: JWT + + schemas: + ComputeMetrics: + type: object + description: Compute startup metrics + required: + - sync_safekeepers_ms + - basebackup_ms + - config_ms + - total_startup_ms + properties: + sync_safekeepers_ms: + type: integer + basebackup_ms: + type: integer + config_ms: + type: integer + total_startup_ms: + type: integer + + ComputeState: + type: object + required: + - status + - last_active + properties: + status: + $ref: '#/components/schemas/ComputeStatus' + last_active: + type: string + description: The last detected compute activity timestamp in UTC and RFC3339 format + example: "2022-10-12T07:20:50.52Z" + error: + type: string + description: Text of the error during compute startup, if any + + ComputeStatus: + type: string + enum: + - init + - failed + - running + +security: + - JWT: [] diff --git a/compute_tools/src/lib.rs b/compute_tools/src/lib.rs index ffb9700a49..aee6b53e6a 100644 --- a/compute_tools/src/lib.rs +++ b/compute_tools/src/lib.rs @@ -4,11 +4,11 @@ //! pub mod checker; pub mod config; -pub mod http_api; +pub mod http; #[macro_use] pub mod logger; +pub mod compute; pub mod monitor; pub mod params; pub mod pg_helpers; pub mod spec; -pub mod zenith; diff --git a/compute_tools/src/monitor.rs b/compute_tools/src/monitor.rs index 596981b2d2..496a5aae3b 100644 --- a/compute_tools/src/monitor.rs +++ b/compute_tools/src/monitor.rs @@ -1,4 +1,4 @@ -use std::sync::{Arc, RwLock}; +use std::sync::Arc; use std::{thread, time}; use anyhow::Result; @@ -6,16 +6,16 @@ use chrono::{DateTime, Utc}; use log::{debug, info}; use postgres::{Client, NoTls}; -use crate::zenith::ComputeState; +use crate::compute::ComputeNode; const MONITOR_CHECK_INTERVAL: u64 = 500; // milliseconds // Spin in a loop and figure out the last activity time in the Postgres. // Then update it in the shared state. This function never errors out. // XXX: the only expected panic is at `RwLock` unwrap(). -fn watch_compute_activity(state: &Arc>) { +fn watch_compute_activity(compute: &Arc) { // Suppose that `connstr` doesn't change - let connstr = state.read().unwrap().connstr.clone(); + let connstr = compute.connstr.clone(); // Define `client` outside of the loop to reuse existing connection if it's active. let mut client = Client::connect(&connstr, NoTls); let timeout = time::Duration::from_millis(MONITOR_CHECK_INTERVAL); @@ -46,7 +46,7 @@ fn watch_compute_activity(state: &Arc>) { AND usename != 'zenith_admin';", // XXX: find a better way to filter other monitors? &[], ); - let mut last_active = state.read().unwrap().last_active; + let mut last_active = compute.state.read().unwrap().last_active; if let Ok(backs) = backends { let mut idle_backs: Vec> = vec![]; @@ -83,14 +83,14 @@ fn watch_compute_activity(state: &Arc>) { } // Update the last activity in the shared state if we got a more recent one. - let mut state = state.write().unwrap(); + let mut state = compute.state.write().unwrap(); if last_active > state.last_active { state.last_active = last_active; debug!("set the last compute activity time to: {}", last_active); } } Err(e) => { - info!("cannot connect to postgres: {}, retrying", e); + debug!("cannot connect to postgres: {}, retrying", e); // Establish a new connection and try again. client = Client::connect(&connstr, NoTls); @@ -100,7 +100,7 @@ fn watch_compute_activity(state: &Arc>) { } /// Launch a separate compute monitor thread and return its `JoinHandle`. -pub fn launch_monitor(state: &Arc>) -> Result> { +pub fn launch_monitor(state: &Arc) -> Result> { let state = Arc::clone(state); Ok(thread::Builder::new() diff --git a/compute_tools/src/pg_helpers.rs b/compute_tools/src/pg_helpers.rs index 1409a81b6b..74856eac63 100644 --- a/compute_tools/src/pg_helpers.rs +++ b/compute_tools/src/pg_helpers.rs @@ -1,7 +1,9 @@ +use std::fs::File; +use std::io::{BufRead, BufReader}; use std::net::{SocketAddr, TcpStream}; use std::os::unix::fs::PermissionsExt; use std::path::Path; -use std::process::Command; +use std::process::Child; use std::str::FromStr; use std::{fs, thread, time}; @@ -220,12 +222,12 @@ pub fn get_existing_dbs(client: &mut Client) -> Result> { /// Wait for Postgres to become ready to accept connections: /// - state should be `ready` in the `pgdata/postmaster.pid` /// - and we should be able to connect to 127.0.0.1:5432 -pub fn wait_for_postgres(port: &str, pgdata: &Path) -> Result<()> { +pub fn wait_for_postgres(pg: &mut Child, port: &str, pgdata: &Path) -> Result<()> { let pid_path = pgdata.join("postmaster.pid"); let mut slept: u64 = 0; // ms let pause = time::Duration::from_millis(100); - let timeout = time::Duration::from_millis(200); + let timeout = time::Duration::from_millis(10); let addr = SocketAddr::from_str(&format!("127.0.0.1:{}", port)).unwrap(); loop { @@ -236,14 +238,19 @@ pub fn wait_for_postgres(port: &str, pgdata: &Path) -> Result<()> { bail!("timed out while waiting for Postgres to start"); } + if let Ok(Some(status)) = pg.try_wait() { + // Postgres exited, that is not what we expected, bail out earlier. + let code = status.code().unwrap_or(-1); + bail!("Postgres exited unexpectedly with code {}", code); + } + if pid_path.exists() { - // XXX: dumb and the simplest way to get the last line in a text file - // TODO: better use `.lines().last()` later - let stdout = Command::new("tail") - .args(&["-n1", pid_path.to_str().unwrap()]) - .output()? - .stdout; - let status = String::from_utf8(stdout)?; + let file = BufReader::new(File::open(&pid_path)?); + let status = file + .lines() + .last() + .unwrap() + .unwrap_or_else(|_| "unknown".to_string()); let can_connect = TcpStream::connect_timeout(&addr, timeout).is_ok(); // Now Postgres is ready to accept connections diff --git a/compute_tools/src/spec.rs b/compute_tools/src/spec.rs index 334e0a9e05..e88df56a65 100644 --- a/compute_tools/src/spec.rs +++ b/compute_tools/src/spec.rs @@ -3,16 +3,53 @@ use std::path::Path; use anyhow::Result; use log::{info, log_enabled, warn, Level}; use postgres::Client; +use serde::Deserialize; use crate::config; use crate::params::PG_HBA_ALL_MD5; use crate::pg_helpers::*; -use crate::zenith::ClusterSpec; + +/// Cluster spec or configuration represented as an optional number of +/// delta operations + final cluster state description. +#[derive(Clone, Deserialize)] +pub struct ComputeSpec { + pub format_version: f32, + pub timestamp: String, + pub operation_uuid: Option, + /// Expected cluster state at the end of transition process. + pub cluster: Cluster, + pub delta_operations: Option>, +} + +/// Cluster state seen from the perspective of the external tools +/// like Rails web console. +#[derive(Clone, Deserialize)] +pub struct Cluster { + pub cluster_id: String, + pub name: String, + pub state: Option, + pub roles: Vec, + pub databases: Vec, + pub settings: GenericOptions, +} + +/// Single cluster state changing operation that could not be represented as +/// a static `Cluster` structure. For example: +/// - DROP DATABASE +/// - DROP ROLE +/// - ALTER ROLE name RENAME TO new_name +/// - ALTER DATABASE name RENAME TO new_name +#[derive(Clone, Deserialize)] +pub struct DeltaOp { + pub action: String, + pub name: PgIdent, + pub new_name: Option, +} /// It takes cluster specification and does the following: /// - Serialize cluster config and put it into `postgresql.conf` completely rewriting the file. /// - Update `pg_hba.conf` to allow external connections. -pub fn handle_configuration(spec: &ClusterSpec, pgdata_path: &Path) -> Result<()> { +pub fn handle_configuration(spec: &ComputeSpec, pgdata_path: &Path) -> Result<()> { // File `postgresql.conf` is no longer included into `basebackup`, so just // always write all config into it creating new file. config::write_postgres_conf(&pgdata_path.join("postgresql.conf"), spec)?; @@ -39,7 +76,7 @@ pub fn update_pg_hba(pgdata_path: &Path) -> Result<()> { /// Given a cluster spec json and open transaction it handles roles creation, /// deletion and update. -pub fn handle_roles(spec: &ClusterSpec, client: &mut Client) -> Result<()> { +pub fn handle_roles(spec: &ComputeSpec, client: &mut Client) -> Result<()> { let mut xact = client.transaction()?; let existing_roles: Vec = get_existing_roles(&mut xact)?; @@ -165,7 +202,7 @@ pub fn handle_roles(spec: &ClusterSpec, client: &mut Client) -> Result<()> { /// like `CREATE DATABASE` and `DROP DATABASE` do not support it. Statement-level /// atomicity should be enough here due to the order of operations and various checks, /// which together provide us idempotency. -pub fn handle_databases(spec: &ClusterSpec, client: &mut Client) -> Result<()> { +pub fn handle_databases(spec: &ComputeSpec, client: &mut Client) -> Result<()> { let existing_dbs: Vec = get_existing_dbs(client)?; // Print a list of existing Postgres databases (only in debug mode) @@ -254,7 +291,7 @@ pub fn handle_databases(spec: &ClusterSpec, client: &mut Client) -> Result<()> { // Grant CREATE ON DATABASE to the database owner // to allow clients create trusted extensions. -pub fn handle_grants(spec: &ClusterSpec, client: &mut Client) -> Result<()> { +pub fn handle_grants(spec: &ComputeSpec, client: &mut Client) -> Result<()> { info!("cluster spec grants:"); for db in &spec.cluster.databases { diff --git a/compute_tools/src/zenith.rs b/compute_tools/src/zenith.rs deleted file mode 100644 index ba7dc20787..0000000000 --- a/compute_tools/src/zenith.rs +++ /dev/null @@ -1,109 +0,0 @@ -use std::process::{Command, Stdio}; - -use anyhow::Result; -use chrono::{DateTime, Utc}; -use postgres::{Client, NoTls}; -use serde::Deserialize; - -use crate::pg_helpers::*; - -/// Compute node state shared across several `zenith_ctl` threads. -/// Should be used under `RwLock` to allow HTTP API server to serve -/// status requests, while configuration is in progress. -pub struct ComputeState { - pub connstr: String, - pub pgdata: String, - pub pgbin: String, - pub spec: ClusterSpec, - /// Compute setup process has finished - pub ready: bool, - /// Timestamp of the last Postgres activity - pub last_active: DateTime, -} - -/// Cluster spec or configuration represented as an optional number of -/// delta operations + final cluster state description. -#[derive(Clone, Deserialize)] -pub struct ClusterSpec { - pub format_version: f32, - pub timestamp: String, - pub operation_uuid: Option, - /// Expected cluster state at the end of transition process. - pub cluster: Cluster, - pub delta_operations: Option>, -} - -/// Cluster state seen from the perspective of the external tools -/// like Rails web console. -#[derive(Clone, Deserialize)] -pub struct Cluster { - pub cluster_id: String, - pub name: String, - pub state: Option, - pub roles: Vec, - pub databases: Vec, - pub settings: GenericOptions, -} - -/// Single cluster state changing operation that could not be represented as -/// a static `Cluster` structure. For example: -/// - DROP DATABASE -/// - DROP ROLE -/// - ALTER ROLE name RENAME TO new_name -/// - ALTER DATABASE name RENAME TO new_name -#[derive(Clone, Deserialize)] -pub struct DeltaOp { - pub action: String, - pub name: PgIdent, - pub new_name: Option, -} - -/// Get basebackup from the libpq connection to pageserver using `connstr` and -/// unarchive it to `pgdata` directory overriding all its previous content. -pub fn get_basebackup( - pgdata: &str, - connstr: &str, - tenant: &str, - timeline: &str, - lsn: &str, -) -> Result<()> { - let mut client = Client::connect(connstr, NoTls)?; - let basebackup_cmd = match lsn { - "0/0" => format!("basebackup {} {}", tenant, timeline), // First start of the compute - _ => format!("basebackup {} {} {}", tenant, timeline, lsn), - }; - let copyreader = client.copy_out(basebackup_cmd.as_str())?; - let mut ar = tar::Archive::new(copyreader); - - ar.unpack(&pgdata)?; - - Ok(()) -} - -/// Run `postgres` in a special mode with `--sync-safekeepers` argument -/// and return the reported LSN back to the caller. -pub fn sync_safekeepers(pgdata: &str, pgbin: &str) -> Result { - let sync_handle = Command::new(&pgbin) - .args(&["--sync-safekeepers"]) - .env("PGDATA", &pgdata) // we cannot use -D in this mode - .stdout(Stdio::piped()) - .spawn() - .expect("postgres --sync-safekeepers failed to start"); - - // `postgres --sync-safekeepers` will print all log output to stderr and - // final LSN to stdout. So we pipe only stdout, while stderr will be automatically - // redirected to the caller output. - let sync_output = sync_handle - .wait_with_output() - .expect("postgres --sync-safekeepers failed"); - if !sync_output.status.success() { - anyhow::bail!( - "postgres --sync-safekeepers exited with non-zero status: {}", - sync_output.status, - ); - } - - let lsn = String::from(String::from_utf8(sync_output.stdout)?.trim()); - - Ok(lsn) -} diff --git a/compute_tools/tests/pg_helpers_tests.rs b/compute_tools/tests/pg_helpers_tests.rs index 472a49af4b..33f903f0e1 100644 --- a/compute_tools/tests/pg_helpers_tests.rs +++ b/compute_tools/tests/pg_helpers_tests.rs @@ -4,12 +4,12 @@ mod pg_helpers_tests { use std::fs::File; use compute_tools::pg_helpers::*; - use compute_tools::zenith::ClusterSpec; + use compute_tools::spec::ComputeSpec; #[test] fn params_serialize() { let file = File::open("tests/cluster_spec.json").unwrap(); - let spec: ClusterSpec = serde_json::from_reader(file).unwrap(); + let spec: ComputeSpec = serde_json::from_reader(file).unwrap(); assert_eq!( spec.cluster.databases.first().unwrap().to_pg_options(), @@ -24,7 +24,7 @@ mod pg_helpers_tests { #[test] fn settings_serialize() { let file = File::open("tests/cluster_spec.json").unwrap(); - let spec: ClusterSpec = serde_json::from_reader(file).unwrap(); + let spec: ComputeSpec = serde_json::from_reader(file).unwrap(); assert_eq!( spec.cluster.settings.as_pg_settings(), diff --git a/docs/docker.md b/docs/docker.md index cc54d012dd..100cdd248b 100644 --- a/docs/docker.md +++ b/docs/docker.md @@ -1,20 +1,20 @@ -# Docker images of Zenith +# Docker images of Neon ## Images Currently we build two main images: -- [zenithdb/zenith](https://hub.docker.com/repository/docker/zenithdb/zenith) — image with pre-built `pageserver`, `safekeeper` and `proxy` binaries and all the required runtime dependencies. Built from [/Dockerfile](/Dockerfile). -- [zenithdb/compute-node](https://hub.docker.com/repository/docker/zenithdb/compute-node) — compute node image with pre-built Postgres binaries from [zenithdb/postgres](https://github.com/zenithdb/postgres). +- [neondatabase/neon](https://hub.docker.com/repository/docker/zenithdb/zenith) — image with pre-built `pageserver`, `safekeeper` and `proxy` binaries and all the required runtime dependencies. Built from [/Dockerfile](/Dockerfile). +- [neondatabase/compute-node](https://hub.docker.com/repository/docker/zenithdb/compute-node) — compute node image with pre-built Postgres binaries from [neondatabase/postgres](https://github.com/neondatabase/postgres). -And additional intermediate images: +And additional intermediate image: -- [zenithdb/compute-tools](https://hub.docker.com/repository/docker/zenithdb/compute-tools) — compute node configuration management tools. +- [neondatabase/compute-tools](https://hub.docker.com/repository/docker/neondatabase/compute-tools) — compute node configuration management tools. ## Building pipeline -1. Image `zenithdb/compute-tools` is re-built automatically. +We build all images after a successful `release` tests run and push automatically to Docker Hub with two parallel CI jobs -2. Image `zenithdb/compute-node` is built independently in the [zenithdb/postgres](https://github.com/zenithdb/postgres) repo. +1. `neondatabase/compute-tools` and `neondatabase/compute-node` -3. Image `zenithdb/zenith` is built in this repo after a successful `release` tests run and pushed to Docker Hub automatically. +2. `neondatabase/neon` diff --git a/vendor/postgres b/vendor/postgres index 1db115cecb..79af2faf08 160000 --- a/vendor/postgres +++ b/vendor/postgres @@ -1 +1 @@ -Subproject commit 1db115cecb3dbc2a74c5efa964fdf3a8a341c4d2 +Subproject commit 79af2faf08d9bec1b1664a72936727dcca36d253 From 98da0aa159f028c1ffc0679ee788f44e9f083dfc Mon Sep 17 00:00:00 2001 From: Arthur Petukhovsky Date: Wed, 18 May 2022 15:17:04 +0300 Subject: [PATCH 264/296] Add _total suffix to metrics name (#1741) --- libs/metrics/src/lib.rs | 2 +- libs/utils/src/http/endpoint.rs | 2 +- proxy/src/proxy.rs | 6 +++--- test_runner/fixtures/benchmark_fixture.py | 2 +- 4 files changed, 6 insertions(+), 6 deletions(-) diff --git a/libs/metrics/src/lib.rs b/libs/metrics/src/lib.rs index b3c1a6bd55..9929fc6d45 100644 --- a/libs/metrics/src/lib.rs +++ b/libs/metrics/src/lib.rs @@ -28,7 +28,7 @@ pub fn gather() -> Vec { lazy_static! { static ref DISK_IO_BYTES: IntGaugeVec = register_int_gauge_vec!( - "libmetrics_disk_io_bytes", + "libmetrics_disk_io_bytes_total", "Bytes written and read from disk, grouped by the operation (read|write)", &["io_operation"] ) diff --git a/libs/utils/src/http/endpoint.rs b/libs/utils/src/http/endpoint.rs index 912404bd7d..51bff5f6eb 100644 --- a/libs/utils/src/http/endpoint.rs +++ b/libs/utils/src/http/endpoint.rs @@ -18,7 +18,7 @@ use super::error::ApiError; lazy_static! { static ref SERVE_METRICS_COUNT: IntCounter = register_int_counter!( - "libmetrics_serve_metrics_count", + "libmetrics_metric_handler_requests_total", "Number of metric requests made" ) .expect("failed to define a metric"); diff --git a/proxy/src/proxy.rs b/proxy/src/proxy.rs index f10b273bfd..642e50c2c1 100644 --- a/proxy/src/proxy.rs +++ b/proxy/src/proxy.rs @@ -15,17 +15,17 @@ const ERR_PROTO_VIOLATION: &str = "protocol violation"; lazy_static! { static ref NUM_CONNECTIONS_ACCEPTED_COUNTER: IntCounter = register_int_counter!( - "proxy_accepted_connections", + "proxy_accepted_connections_total", "Number of TCP client connections accepted." ) .unwrap(); static ref NUM_CONNECTIONS_CLOSED_COUNTER: IntCounter = register_int_counter!( - "proxy_closed_connections", + "proxy_closed_connections_total", "Number of TCP client connections closed." ) .unwrap(); static ref NUM_BYTES_PROXIED_COUNTER: IntCounter = register_int_counter!( - "proxy_io_bytes", + "proxy_io_bytes_total", "Number of bytes sent/received between any client and backend." ) .unwrap(); diff --git a/test_runner/fixtures/benchmark_fixture.py b/test_runner/fixtures/benchmark_fixture.py index e296e85cc7..5fc6076f51 100644 --- a/test_runner/fixtures/benchmark_fixture.py +++ b/test_runner/fixtures/benchmark_fixture.py @@ -236,7 +236,7 @@ class ZenithBenchmarker: """ Fetch the "cumulative # of bytes written" metric from the pageserver """ - metric_name = r'libmetrics_disk_io_bytes{io_operation="write"}' + metric_name = r'libmetrics_disk_io_bytes_total{io_operation="write"}' return self.get_int_counter_value(pageserver, metric_name) def get_peak_mem(self, pageserver) -> int: From 432907ff5f130f4fada8dd605e428d1bea822ea0 Mon Sep 17 00:00:00 2001 From: Egor Suvorov Date: Wed, 18 May 2022 22:02:17 +0200 Subject: [PATCH 265/296] Safekeeper: avoid holding mutex when deleting a tenant (#1746) Following discussion with @arssher after #1653 --- safekeeper/src/timeline.rs | 35 +++++++++++++++++++---------------- 1 file changed, 19 insertions(+), 16 deletions(-) diff --git a/safekeeper/src/timeline.rs b/safekeeper/src/timeline.rs index 84ad53d72d..2bb7771aac 100644 --- a/safekeeper/src/timeline.rs +++ b/safekeeper/src/timeline.rs @@ -679,29 +679,32 @@ impl GlobalTimelines { /// Deactivates and deletes all timelines for the tenant, see `delete()`. /// Returns map of all timelines which the tenant had, `true` if a timeline was active. + /// There may be a race if new timelines are created simultaneously. pub fn delete_force_all_for_tenant( conf: &SafeKeeperConf, tenant_id: &ZTenantId, ) -> Result> { info!("deleting all timelines for tenant {}", tenant_id); - let mut state = TIMELINES_STATE.lock().unwrap(); - let mut deleted = HashMap::new(); - for (zttid, tli) in &state.timelines { - if zttid.tenant_id == *tenant_id { - deleted.insert( - *zttid, - GlobalTimelines::delete_force_internal( - conf, - zttid, - tli.deactivate_for_delete()?, - )?, - ); + let mut to_delete = HashMap::new(); + { + // Keep mutex in this scope. + let timelines = &mut TIMELINES_STATE.lock().unwrap().timelines; + for (&zttid, tli) in timelines.iter() { + if zttid.tenant_id == *tenant_id { + to_delete.insert(zttid, tli.deactivate_for_delete()?); + } } + // TODO: test that the correct subset of timelines is removed. It's complicated because they are implicitly created currently. + timelines.retain(|zttid, _| !to_delete.contains_key(zttid)); } - // TODO: test that the exact subset of timelines is removed. - state - .timelines - .retain(|zttid, _| !deleted.contains_key(zttid)); + let mut deleted = HashMap::new(); + for (zttid, was_active) in to_delete { + deleted.insert( + zttid, + GlobalTimelines::delete_force_internal(conf, &zttid, was_active)?, + ); + } + // There may be inactive timelines, so delete the whole tenant dir as well. match std::fs::remove_dir_all(conf.tenant_dir(tenant_id)) { Ok(_) => (), Err(e) if e.kind() == std::io::ErrorKind::NotFound => (), From 4a36d89247723a42b45f8b46da5e6b930a6aaa38 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Wed, 18 May 2022 22:26:17 +0300 Subject: [PATCH 266/296] Avoid spawning a layer-flush thread when there's no work to do. The check_checkpoint_distance() always spawned a new thread, even if there is no frozen layer to flush. That was a thinko, as @knizhnik pointed out. --- pageserver/src/layered_repository.rs | 32 +++++++++++++++++----------- 1 file changed, 20 insertions(+), 12 deletions(-) diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index c7536cc959..bad2e32cc2 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -1621,22 +1621,30 @@ impl LayeredTimeline { pub fn check_checkpoint_distance(self: &Arc) -> Result<()> { let last_lsn = self.get_last_record_lsn(); + // Has more than 'checkpoint_distance' of WAL been accumulated? let distance = last_lsn.widening_sub(self.last_freeze_at.load()); if distance >= self.get_checkpoint_distance().into() { + // Yes. Freeze the current in-memory layer. self.freeze_inmem_layer(true); self.last_freeze_at.store(last_lsn); - } - if let Ok(guard) = self.layer_flush_lock.try_lock() { - drop(guard); - let self_clone = Arc::clone(self); - thread_mgr::spawn( - thread_mgr::ThreadKind::LayerFlushThread, - Some(self.tenant_id), - Some(self.timeline_id), - "layer flush thread", - false, - move || self_clone.flush_frozen_layers(false), - )?; + + // Launch a thread to flush the frozen layer to disk, unless + // a thread was already running. (If the thread was running + // at the time that we froze the layer, it must've seen the + // the layer we just froze before it exited; see comments + // in flush_frozen_layers()) + if let Ok(guard) = self.layer_flush_lock.try_lock() { + drop(guard); + let self_clone = Arc::clone(self); + thread_mgr::spawn( + thread_mgr::ThreadKind::LayerFlushThread, + Some(self.tenant_id), + Some(self.timeline_id), + "layer flush thread", + false, + move || self_clone.flush_frozen_layers(false), + )?; + } } Ok(()) } From 5914aab78aa54daa889abab9ae41db358158bd71 Mon Sep 17 00:00:00 2001 From: Dmitry Rodionov Date: Wed, 18 May 2022 21:16:14 +0300 Subject: [PATCH 267/296] add comments, use expect instead of unwrap --- .../src/layered_repository/disk_btree.rs | 33 +++++++++++++++---- 1 file changed, 26 insertions(+), 7 deletions(-) diff --git a/pageserver/src/layered_repository/disk_btree.rs b/pageserver/src/layered_repository/disk_btree.rs index e747192d96..0c9ad75048 100644 --- a/pageserver/src/layered_repository/disk_btree.rs +++ b/pageserver/src/layered_repository/disk_btree.rs @@ -444,6 +444,13 @@ where /// /// stack[0] is the current root page, stack.last() is the leaf. /// + /// We maintain the length of the stack to be always greater than zero. + /// Two exceptions are: + /// 1. `Self::flush_node`. The method will push the new node if it extracted the last one. + /// So because other methods cannot see the intermediate state invariant still holds. + /// 2. `Self::finish`. It consumes self and does not return it back, + /// which means that this is where the structure is destroyed. + /// Thus stack of zero length cannot be observed by other methods. stack: Vec>, /// Last key that was appended to the tree. Used to sanity check that append @@ -482,7 +489,10 @@ where fn append_internal(&mut self, key: &[u8; L], value: Value) -> Result<()> { // Try to append to the current leaf buffer - let last = self.stack.last_mut().unwrap(); + let last = self + .stack + .last_mut() + .expect("should always have at least one item"); let level = last.level; if last.push(key, value) { return Ok(()); @@ -512,19 +522,25 @@ where Ok(()) } + /// Flush the bottommost node in the stack to disk. Appends a downlink to its parent, + /// and recursively flushes the parent too, if it becomes full. If the root page becomes full, + /// creates a new root page, increasing the height of the tree. fn flush_node(&mut self) -> Result<()> { - let last = self.stack.pop().unwrap(); + // Get the current bottommost node in the stack and flush it to disk. + let last = self + .stack + .pop() + .expect("should always have at least one item"); let buf = last.pack(); let downlink_key = last.first_key(); let downlink_ptr = self.writer.write_blk(buf)?; - // Append the downlink to the parent + // Append the downlink to the parent. If there is no parent, ie. this was the root page, + // create a new root page, increasing the height of the tree. if self.stack.is_empty() { self.stack.push(BuildNode::new(last.level + 1)); } - self.append_internal(&downlink_key, Value::from_blknum(downlink_ptr))?; - - Ok(()) + self.append_internal(&downlink_key, Value::from_blknum(downlink_ptr)) } /// @@ -540,7 +556,10 @@ where self.flush_node()?; } - let root = self.stack.first().unwrap(); + let root = self + .stack + .first() + .expect("by the check above we left one item there"); let buf = root.pack(); let root_blknum = self.writer.write_blk(buf)?; From bd2979d02cfafa84180290f1c3986ad5d3eb33de Mon Sep 17 00:00:00 2001 From: Egor Suvorov Date: Tue, 10 May 2022 17:06:03 +0300 Subject: [PATCH 268/296] CirleCI/check-codestyle-python: print versions --- .circleci/config.yml | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/.circleci/config.yml b/.circleci/config.yml index 85ac905f0b..60a1cfea14 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -222,6 +222,12 @@ jobs: key: v2-python-deps-{{ checksum "poetry.lock" }} paths: - /home/circleci/.cache/pypoetry/virtualenvs + - run: + name: Print versions + when: always + command: | + poetry run python --version + poetry show - run: name: Run yapf to ensure code format when: always From 7dd27ecd20c179d176880998db8ce9a1f1f56c61 Mon Sep 17 00:00:00 2001 From: Egor Suvorov Date: Tue, 10 May 2022 17:08:33 +0300 Subject: [PATCH 269/296] Bump minimal supported Python version to 3.9 Most of the CI already run with Python 3.9 since https://github.com/neondatabase/docker-images/pull/1 --- README.md | 3 +-- docs/sourcetree.md | 8 ++++---- pyproject.toml | 2 +- 3 files changed, 6 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index 39cbd2a222..d5dccb7724 100644 --- a/README.md +++ b/README.md @@ -51,7 +51,6 @@ cd neon make -j5 ``` - #### building on OSX (12.3.1) 1. Install XCode ``` @@ -82,7 +81,7 @@ make -j5 To run the `psql` client, install the `postgresql-client` package or modify `PATH` and `LD_LIBRARY_PATH` to include `tmp_install/bin` and `tmp_install/lib`, respectively. To run the integration tests or Python scripts (not required to use the code), install -Python (3.7 or higher), and install python3 packages using `./scripts/pysync` (requires poetry) in the project directory. +Python (3.9 or higher), and install python3 packages using `./scripts/pysync` (requires poetry) in the project directory. #### running neon database diff --git a/docs/sourcetree.md b/docs/sourcetree.md index 5ddc6208d2..81e0f2fe88 100644 --- a/docs/sourcetree.md +++ b/docs/sourcetree.md @@ -91,18 +91,18 @@ so manual installation of dependencies is not recommended. A single virtual environment with all dependencies is described in the single `Pipfile`. ### Prerequisites -- Install Python 3.7 (the minimal supported version) or greater. +- Install Python 3.9 (the minimal supported version) or greater. - Our setup with poetry should work with newer python versions too. So feel free to open an issue with a `c/test-runner` label if something doesnt work as expected. - - If you have some trouble with other version you can resolve it by installing Python 3.7 separately, via pyenv or via system package manager e.g.: + - If you have some trouble with other version you can resolve it by installing Python 3.9 separately, via pyenv or via system package manager e.g.: ```bash # In Ubuntu sudo add-apt-repository ppa:deadsnakes/ppa sudo apt update - sudo apt install python3.7 + sudo apt install python3.9 ``` - Install `poetry` - Exact version of `poetry` is not important, see installation instructions available at poetry's [website](https://python-poetry.org/docs/#installation)`. -- Install dependencies via `./scripts/pysync`. Note that CI uses Python 3.7 so if you have different version some linting tools can yield different result locally vs in the CI. +- Install dependencies via `./scripts/pysync`. Note that CI uses Python 3.9 so if you have different version some linting tools can yield different result locally vs in the CI. Run `poetry shell` to activate the virtual environment. Alternatively, use `poetry run` to run a single command in the venv, e.g. `poetry run pytest`. diff --git a/pyproject.toml b/pyproject.toml index b70eb19009..def55f6671 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -5,7 +5,7 @@ description = "" authors = [] [tool.poetry.dependencies] -python = "^3.7" +python = "^3.9" pytest = "^6.2.5" psycopg2-binary = "^2.9.1" typing-extensions = "^3.10.0" From fab104d5f32f3373c29d7764c37830b712f954c3 Mon Sep 17 00:00:00 2001 From: Egor Suvorov Date: Tue, 10 May 2022 17:11:31 +0300 Subject: [PATCH 270/296] docs/sourcetree: add note about exact Python version used and how to choose it --- docs/sourcetree.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/docs/sourcetree.md b/docs/sourcetree.md index 81e0f2fe88..c8d4baff62 100644 --- a/docs/sourcetree.md +++ b/docs/sourcetree.md @@ -93,7 +93,7 @@ A single virtual environment with all dependencies is described in the single `P ### Prerequisites - Install Python 3.9 (the minimal supported version) or greater. - Our setup with poetry should work with newer python versions too. So feel free to open an issue with a `c/test-runner` label if something doesnt work as expected. - - If you have some trouble with other version you can resolve it by installing Python 3.9 separately, via pyenv or via system package manager e.g.: + - If you have some trouble with other version you can resolve it by installing Python 3.9 separately, via [pyenv](https://github.com/pyenv/pyenv) or via system package manager e.g.: ```bash # In Ubuntu sudo add-apt-repository ppa:deadsnakes/ppa @@ -102,7 +102,11 @@ A single virtual environment with all dependencies is described in the single `P ``` - Install `poetry` - Exact version of `poetry` is not important, see installation instructions available at poetry's [website](https://python-poetry.org/docs/#installation)`. -- Install dependencies via `./scripts/pysync`. Note that CI uses Python 3.9 so if you have different version some linting tools can yield different result locally vs in the CI. +- Install dependencies via `./scripts/pysync`. + - Note that CI uses specific Python version (look for `PYTHON_VERSION` [here](https://github.com/neondatabase/docker-images/blob/main/rust/Dockerfile)) + so if you have different version some linting tools can yield different result locally vs in the CI. + - You can explicitly specify which Python to use by running `poetry env use /path/to/python`, e.g. `poetry env use python3.9`. + This may also disable the `The currently activated Python version X.Y.Z is not supported by the project` warning. Run `poetry shell` to activate the virtual environment. Alternatively, use `poetry run` to run a single command in the venv, e.g. `poetry run pytest`. From c1b365fdf7f56cf05d84c7b095bebc12101a1c12 Mon Sep 17 00:00:00 2001 From: Anastasia Lubennikova Date: Wed, 18 May 2022 14:29:01 +0300 Subject: [PATCH 271/296] Use temp filename while writing ImageLayer file --- .../src/layered_repository/delta_layer.rs | 24 ++++++++--- .../src/layered_repository/image_layer.rs | 42 +++++++++++++++---- 2 files changed, 53 insertions(+), 13 deletions(-) diff --git a/pageserver/src/layered_repository/delta_layer.rs b/pageserver/src/layered_repository/delta_layer.rs index 1c48f3def5..855e2a9172 100644 --- a/pageserver/src/layered_repository/delta_layer.rs +++ b/pageserver/src/layered_repository/delta_layer.rs @@ -420,6 +420,21 @@ impl DeltaLayer { } } + fn temp_path_for( + conf: &PageServerConf, + timelineid: ZTimelineId, + tenantid: ZTenantId, + key_start: Key, + lsn_range: Range, + ) -> PathBuf { + conf.timeline_path(&timelineid, &tenantid).join(format!( + "{}-XXX__{:016X}-{:016X}.temp", + key_start, + u64::from(lsn_range.start), + u64::from(lsn_range.end) + )) + } + /// /// Open the underlying file and read the metadata into memory, if it's /// not loaded already. @@ -607,12 +622,9 @@ impl DeltaLayerWriter { // // Note: This overwrites any existing file. There shouldn't be any. // FIXME: throw an error instead? - let path = conf.timeline_path(&timelineid, &tenantid).join(format!( - "{}-XXX__{:016X}-{:016X}.temp", - key_start, - u64::from(lsn_range.start), - u64::from(lsn_range.end) - )); + let path = + DeltaLayer::temp_path_for(conf, timelineid, tenantid, key_start, lsn_range.clone()); + let mut file = VirtualFile::create(&path)?; // make room for the header block file.seek(SeekFrom::Start(PAGE_SZ as u64))?; diff --git a/pageserver/src/layered_repository/image_layer.rs b/pageserver/src/layered_repository/image_layer.rs index c0c8e7789a..0a7cd2cdba 100644 --- a/pageserver/src/layered_repository/image_layer.rs +++ b/pageserver/src/layered_repository/image_layer.rs @@ -241,6 +241,20 @@ impl ImageLayer { } } + fn temp_path_for( + path_or_conf: &PathOrConf, + timelineid: ZTimelineId, + tenantid: ZTenantId, + fname: &ImageFileName, + ) -> PathBuf { + match path_or_conf { + PathOrConf::Path(path) => path.to_path_buf(), + PathOrConf::Conf(conf) => conf + .timeline_path(&timelineid, &tenantid) + .join(format!("{}.temp", fname)), + } + } + /// /// Open the underlying file and read the metadata into memory, if it's /// not loaded already. @@ -398,7 +412,7 @@ impl ImageLayer { /// pub struct ImageLayerWriter { conf: &'static PageServerConf, - _path: PathBuf, + path: PathBuf, timelineid: ZTimelineId, tenantid: ZTenantId, key_range: Range, @@ -416,11 +430,9 @@ impl ImageLayerWriter { key_range: &Range, lsn: Lsn, ) -> anyhow::Result { - // Create the file - // - // Note: This overwrites any existing file. There shouldn't be any. - // FIXME: throw an error instead? - let path = ImageLayer::path_for( + // Create the file initially with a temporary filename. + // We'll atomically rename it to the final name when we're done. + let path = ImageLayer::temp_path_for( &PathOrConf::Conf(conf), timelineid, tenantid, @@ -441,7 +453,7 @@ impl ImageLayerWriter { let writer = ImageLayerWriter { conf, - _path: path, + path, timelineid, tenantid, key_range: key_range.clone(), @@ -512,6 +524,22 @@ impl ImageLayerWriter { index_root_blk, }), }; + + // Rename the file to its final name + // + // Note: This overwrites any existing file. There shouldn't be any. + // FIXME: throw an error instead? + let final_path = ImageLayer::path_for( + &PathOrConf::Conf(self.conf), + self.timelineid, + self.tenantid, + &ImageFileName { + key_range: self.key_range.clone(), + lsn: self.lsn, + }, + ); + std::fs::rename(self.path, &final_path)?; + trace!("created image layer {}", layer.path().display()); Ok(layer) From 3da4b3165ef4056f72e0fb84bd4fd24669526c15 Mon Sep 17 00:00:00 2001 From: Anastasia Lubennikova Date: Wed, 18 May 2022 18:06:33 +0300 Subject: [PATCH 272/296] Fsync layer files before rename --- pageserver/src/layered_repository/delta_layer.rs | 7 ++++--- pageserver/src/layered_repository/image_layer.rs | 3 +++ 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/pageserver/src/layered_repository/delta_layer.rs b/pageserver/src/layered_repository/delta_layer.rs index 855e2a9172..3484e6bd0f 100644 --- a/pageserver/src/layered_repository/delta_layer.rs +++ b/pageserver/src/layered_repository/delta_layer.rs @@ -425,7 +425,7 @@ impl DeltaLayer { timelineid: ZTimelineId, tenantid: ZTenantId, key_start: Key, - lsn_range: Range, + lsn_range: &Range, ) -> PathBuf { conf.timeline_path(&timelineid, &tenantid).join(format!( "{}-XXX__{:016X}-{:016X}.temp", @@ -622,8 +622,7 @@ impl DeltaLayerWriter { // // Note: This overwrites any existing file. There shouldn't be any. // FIXME: throw an error instead? - let path = - DeltaLayer::temp_path_for(conf, timelineid, tenantid, key_start, lsn_range.clone()); + let path = DeltaLayer::temp_path_for(conf, timelineid, tenantid, key_start, &lsn_range); let mut file = VirtualFile::create(&path)?; // make room for the header block @@ -717,6 +716,8 @@ impl DeltaLayerWriter { }), }; + // fsync the file + file.sync_all()?; // Rename the file to its final name // // Note: This overwrites any existing file. There shouldn't be any. diff --git a/pageserver/src/layered_repository/image_layer.rs b/pageserver/src/layered_repository/image_layer.rs index 0a7cd2cdba..5e97366da9 100644 --- a/pageserver/src/layered_repository/image_layer.rs +++ b/pageserver/src/layered_repository/image_layer.rs @@ -525,6 +525,9 @@ impl ImageLayerWriter { }), }; + // fsync the file + file.sync_all()?; + // Rename the file to its final name // // Note: This overwrites any existing file. There shouldn't be any. From 4c30ae8ba32f45d90d870dbf926965237ddd3c7f Mon Sep 17 00:00:00 2001 From: Anastasia Lubennikova Date: Wed, 18 May 2022 22:29:13 +0300 Subject: [PATCH 273/296] Add random string as a part of tempfile name --- .../src/layered_repository/delta_layer.rs | 12 ++++++++++-- .../src/layered_repository/image_layer.rs | 19 +++++++++++-------- 2 files changed, 21 insertions(+), 10 deletions(-) diff --git a/pageserver/src/layered_repository/delta_layer.rs b/pageserver/src/layered_repository/delta_layer.rs index 3484e6bd0f..ed342c0cca 100644 --- a/pageserver/src/layered_repository/delta_layer.rs +++ b/pageserver/src/layered_repository/delta_layer.rs @@ -37,6 +37,7 @@ use crate::virtual_file::VirtualFile; use crate::walrecord; use crate::{DELTA_FILE_MAGIC, STORAGE_FORMAT_VERSION}; use anyhow::{bail, ensure, Context, Result}; +use rand::{distributions::Alphanumeric, Rng}; use serde::{Deserialize, Serialize}; use std::fs; use std::io::{BufWriter, Write}; @@ -427,11 +428,18 @@ impl DeltaLayer { key_start: Key, lsn_range: &Range, ) -> PathBuf { + let rand_string: String = rand::thread_rng() + .sample_iter(&Alphanumeric) + .take(8) + .map(char::from) + .collect(); + conf.timeline_path(&timelineid, &tenantid).join(format!( - "{}-XXX__{:016X}-{:016X}.temp", + "{}-XXX__{:016X}-{:016X}.{}.temp", key_start, u64::from(lsn_range.start), - u64::from(lsn_range.end) + u64::from(lsn_range.end), + rand_string )) } diff --git a/pageserver/src/layered_repository/image_layer.rs b/pageserver/src/layered_repository/image_layer.rs index 5e97366da9..905023ecf9 100644 --- a/pageserver/src/layered_repository/image_layer.rs +++ b/pageserver/src/layered_repository/image_layer.rs @@ -34,6 +34,7 @@ use crate::{IMAGE_FILE_MAGIC, STORAGE_FORMAT_VERSION}; use anyhow::{bail, ensure, Context, Result}; use bytes::Bytes; use hex; +use rand::{distributions::Alphanumeric, Rng}; use serde::{Deserialize, Serialize}; use std::fs; use std::io::Write; @@ -242,17 +243,19 @@ impl ImageLayer { } fn temp_path_for( - path_or_conf: &PathOrConf, + conf: &PageServerConf, timelineid: ZTimelineId, tenantid: ZTenantId, fname: &ImageFileName, ) -> PathBuf { - match path_or_conf { - PathOrConf::Path(path) => path.to_path_buf(), - PathOrConf::Conf(conf) => conf - .timeline_path(&timelineid, &tenantid) - .join(format!("{}.temp", fname)), - } + let rand_string: String = rand::thread_rng() + .sample_iter(&Alphanumeric) + .take(8) + .map(char::from) + .collect(); + + conf.timeline_path(&timelineid, &tenantid) + .join(format!("{}.{}.temp", fname, rand_string)) } /// @@ -433,7 +436,7 @@ impl ImageLayerWriter { // Create the file initially with a temporary filename. // We'll atomically rename it to the final name when we're done. let path = ImageLayer::temp_path_for( - &PathOrConf::Conf(conf), + conf, timelineid, tenantid, &ImageFileName { From cbd00d7ed91e4b4cd95d3e2e40b16a06e73613ff Mon Sep 17 00:00:00 2001 From: Anastasia Lubennikova Date: Wed, 18 May 2022 23:46:38 +0300 Subject: [PATCH 274/296] Remove temp layer files during timeline initialization on pageserver start --- pageserver/src/storage_sync.rs | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/pageserver/src/storage_sync.rs b/pageserver/src/storage_sync.rs index 39459fafc6..bbebcd1f36 100644 --- a/pageserver/src/storage_sync.rs +++ b/pageserver/src/storage_sync.rs @@ -421,6 +421,14 @@ fn collect_timeline_files( entry_path.display() ) })?; + } else if entry_path.extension().and_then(OsStr::to_str) == Some("temp") { + info!("removing temp layer file at {}", entry_path.display()); + std::fs::remove_file(&entry_path).with_context(|| { + format!( + "failed to remove temp layer file at {}", + entry_path.display() + ) + })?; } else { timeline_files.insert(entry_path); } From 0da4046704d5c5f100a81915e68098f7c8e486f7 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Thu, 19 May 2022 00:53:28 +0300 Subject: [PATCH 275/296] Include traversal path in error message. Previously, the path was printed to the log with separate error!() calls. It's better to include the whole path in the error object and have it printed to the log as one message. Also print the path in the ValueReconstructResult::Missing case. This is what it looks like now: 2022-05-17T21:53:53.611801Z ERROR pagestream{timeline=5adcb4af3e95f00a31550d266aab7a37 tenant=74d9f9ad3293c030c6a6e196dd91c60f}: error reading relation or page version: could not find data for key 000000067F000032BE000000000000000001 at LSN 0/1698C48, for request at LSN 0/1698CF8 Caused by: 0: layer traversal: result Complete, cont_lsn 0/1698C48, layer: 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001698C48-0000000001698CC1 1: layer traversal: result Continue, cont_lsn 0/1698CC1, layer: inmem-0000000001698CC1-FFFFFFFFFFFFFFFF Stack backtrace: --- pageserver/src/layered_repository.rs | 72 ++++++++++++++++++---------- 1 file changed, 46 insertions(+), 26 deletions(-) diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index bad2e32cc2..79e66e5f17 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -1357,7 +1357,9 @@ impl LayeredTimeline { let mut timeline_owned; let mut timeline = self; - let mut path: Vec<(ValueReconstructResult, Lsn, Arc)> = Vec::new(); + // For debugging purposes, collect the path of layers that we traversed + // through. It's included in the error message if we fail to find the key. + let mut traversal_path: Vec<(ValueReconstructResult, Lsn, Arc)> = Vec::new(); let cached_lsn = if let Some((cached_lsn, _)) = &reconstruct_state.img { *cached_lsn @@ -1387,32 +1389,24 @@ impl LayeredTimeline { if prev_lsn <= cont_lsn { // Didn't make any progress in last iteration. Error out to avoid // getting stuck in the loop. - - // For debugging purposes, print the path of layers that we traversed - // through. - for (r, c, l) in path { - error!( - "PATH: result {:?}, cont_lsn {}, layer: {}", - r, - c, - l.filename().display() - ); - } - bail!("could not find layer with more data for key {} at LSN {}, request LSN {}, ancestor {}", - key, - Lsn(cont_lsn.0 - 1), - request_lsn, - timeline.ancestor_lsn) + return layer_traversal_error(format!( + "could not find layer with more data for key {} at LSN {}, request LSN {}, ancestor {}", + key, + Lsn(cont_lsn.0 - 1), + request_lsn, + timeline.ancestor_lsn + ), traversal_path); } prev_lsn = cont_lsn; } ValueReconstructResult::Missing => { - bail!( - "could not find data for key {} at LSN {}, for request at LSN {}", - key, - cont_lsn, - request_lsn - ) + return layer_traversal_error( + format!( + "could not find data for key {} at LSN {}, for request at LSN {}", + key, cont_lsn, request_lsn + ), + traversal_path, + ); } } @@ -1447,7 +1441,7 @@ impl LayeredTimeline { reconstruct_state, )?; cont_lsn = lsn_floor; - path.push((result, cont_lsn, open_layer.clone())); + traversal_path.push((result, cont_lsn, open_layer.clone())); continue; } } @@ -1462,7 +1456,7 @@ impl LayeredTimeline { reconstruct_state, )?; cont_lsn = lsn_floor; - path.push((result, cont_lsn, frozen_layer.clone())); + traversal_path.push((result, cont_lsn, frozen_layer.clone())); continue 'outer; } } @@ -1477,7 +1471,7 @@ impl LayeredTimeline { reconstruct_state, )?; cont_lsn = lsn_floor; - path.push((result, cont_lsn, layer)); + traversal_path.push((result, cont_lsn, layer)); } else if timeline.ancestor_timeline.is_some() { // Nothing on this timeline. Traverse to parent result = ValueReconstructResult::Continue; @@ -2375,6 +2369,32 @@ impl LayeredTimeline { } } +/// Helper function for get_reconstruct_data() to add the path of layers traversed +/// to an error, as anyhow context information. +fn layer_traversal_error( + msg: String, + path: Vec<(ValueReconstructResult, Lsn, Arc)>, +) -> anyhow::Result<()> { + // We want the original 'msg' to be the outermost context. The outermost context + // is the most high-level information, which also gets propagated to the client. + let mut msg_iter = path + .iter() + .map(|(r, c, l)| { + format!( + "layer traversal: result {:?}, cont_lsn {}, layer: {}", + r, + c, + l.filename().display() + ) + }) + .chain(std::iter::once(msg)); + // Construct initial message from the first traversed layer + let err = anyhow!(msg_iter.next().unwrap()); + + // Append all subsequent traversals, and the error message 'msg', as contexts. + Err(msg_iter.fold(err, |err, msg| err.context(msg))) +} + struct LayeredTimelineWriter<'a> { tl: &'a LayeredTimeline, _write_guard: MutexGuard<'a, ()>, From ee3bcf108d0ed1c1442c22182dcaaa1a6c518df4 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Thu, 19 May 2022 00:53:33 +0300 Subject: [PATCH 276/296] Fix compact_level0 for delta layers with overlap or gaps We saw a case in staging, where there was a gap in the LSN ranges of level 0 files, like this: 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016960E9-00000000016E4DB9 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016E4DB9-000000000BFCE3E1 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000000BFCE3E1-000000000BFD0FE9 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000060045901-000000007005EAC1 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000007005EAC1-0000000080062E99 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000080062E99-000000009007F481 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000009007F481-00000000A009F7C9 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000A009F7C9-00000000AA284EB9 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000AA286471-00000000AA2886B9 Note that gap between 000000000BFD0FE9 and 0000000060045901. I don't know how that happened, but in general the pageserver should be robust if there are gaps like that, or overlapping files etc. In theory they could happen as result of crashes, partial downloads from S3 etc., although it is mystery what caused it in this case. Looking at the compaction code, it was not safe in the face of gaps like that. The compaction routine collected all the level 0 files, and took their min(start)..max(end) as the range of the new files it builds. That's wrong, if the level 0 files don't cover the whole LSN range; the newly created files will miss any records in the gap. Fix that, by only collecting contiguous sequences of level 0 files, so that the end LSN of previous delta file is equal to the start of the next one. Fixes issue #1730 --- pageserver/src/layered_repository.rs | 106 +++++++++++++++++++-------- 1 file changed, 76 insertions(+), 30 deletions(-) diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index 79e66e5f17..fc4ab942f6 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -18,7 +18,7 @@ use itertools::Itertools; use lazy_static::lazy_static; use tracing::*; -use std::cmp::{max, min, Ordering}; +use std::cmp::{max, Ordering}; use std::collections::hash_map::Entry; use std::collections::HashMap; use std::collections::{BTreeSet, HashSet}; @@ -1946,41 +1946,87 @@ impl LayeredTimeline { Ok(new_path) } + /// + /// Collect a bunch of Level 0 layer files, and compact and reshuffle them as + /// as Level 1 files. + /// fn compact_level0(&self, target_file_size: u64) -> Result<()> { let layers = self.layers.read().unwrap(); - - let level0_deltas = layers.get_level0_deltas()?; - - // We compact or "shuffle" the level-0 delta layers when they've - // accumulated over the compaction threshold. - if level0_deltas.len() < self.get_compaction_threshold() { - return Ok(()); - } + let mut level0_deltas = layers.get_level0_deltas()?; drop(layers); - // FIXME: this function probably won't work correctly if there's overlap - // in the deltas. - let lsn_range = level0_deltas - .iter() - .map(|l| l.get_lsn_range()) - .reduce(|a, b| min(a.start, b.start)..max(a.end, b.end)) - .unwrap(); + // Only compact if enough layers have accumulated. + if level0_deltas.is_empty() || level0_deltas.len() < self.get_compaction_threshold() { + return Ok(()); + } - let all_values_iter = level0_deltas.iter().map(|l| l.iter()).kmerge_by(|a, b| { - if let Ok((a_key, a_lsn, _)) = a { - if let Ok((b_key, b_lsn, _)) = b { - match a_key.cmp(b_key) { - Ordering::Less => true, - Ordering::Equal => a_lsn <= b_lsn, - Ordering::Greater => false, + // Gather the files to compact in this iteration. + // + // Start with the oldest Level 0 delta file, and collect any other + // level 0 files that form a contiguous sequence, such that the end + // LSN of previous file matches the start LSN of the next file. + // + // Note that if the files don't form such a sequence, we might + // "compact" just a single file. That's a bit pointless, but it allows + // us to get rid of the level 0 file, and compact the other files on + // the next iteration. This could probably made smarter, but such + // "gaps" in the sequence of level 0 files should only happen in case + // of a crash, partial download from cloud storage, or something like + // that, so it's not a big deal in practice. + level0_deltas.sort_by_key(|l| l.get_lsn_range().start); + let mut level0_deltas_iter = level0_deltas.iter(); + + let first_level0_delta = level0_deltas_iter.next().unwrap(); + let mut prev_lsn_end = first_level0_delta.get_lsn_range().end; + let mut deltas_to_compact = vec![Arc::clone(first_level0_delta)]; + for l in level0_deltas_iter { + let lsn_range = l.get_lsn_range(); + + if lsn_range.start != prev_lsn_end { + break; + } + deltas_to_compact.push(Arc::clone(l)); + prev_lsn_end = lsn_range.end; + } + let lsn_range = Range { + start: deltas_to_compact.first().unwrap().get_lsn_range().start, + end: deltas_to_compact.last().unwrap().get_lsn_range().end, + }; + + info!( + "Starting Level0 compaction in LSN range {}-{} for {} layers ({} deltas in total)", + lsn_range.start, + lsn_range.end, + deltas_to_compact.len(), + level0_deltas.len() + ); + for l in deltas_to_compact.iter() { + info!("compact includes {}", l.filename().display()); + } + // We don't need the original list of layers anymore. Drop it so that + // we don't accidentally use it later in the function. + drop(level0_deltas); + + // This iterator walks through all key-value pairs from all the layers + // we're compacting, in key, LSN order. + let all_values_iter = deltas_to_compact + .iter() + .map(|l| l.iter()) + .kmerge_by(|a, b| { + if let Ok((a_key, a_lsn, _)) = a { + if let Ok((b_key, b_lsn, _)) = b { + match a_key.cmp(b_key) { + Ordering::Less => true, + Ordering::Equal => a_lsn <= b_lsn, + Ordering::Greater => false, + } + } else { + false } } else { - false + true } - } else { - true - } - }); + }); // Merge the contents of all the input delta layers into a new set // of delta layers, based on the current partitioning. @@ -2046,8 +2092,8 @@ impl LayeredTimeline { // Now that we have reshuffled the data to set of new delta layers, we can // delete the old ones - let mut layer_paths_do_delete = HashSet::with_capacity(level0_deltas.len()); - for l in level0_deltas { + let mut layer_paths_do_delete = HashSet::with_capacity(deltas_to_compact.len()); + for l in deltas_to_compact { l.delete()?; if let Some(path) = l.local_path() { layer_paths_do_delete.insert(path); From baf7a81dceaa68d634a96b4833bec2fc6999b5ce Mon Sep 17 00:00:00 2001 From: Egor Suvorov Date: Thu, 19 May 2022 13:01:03 +0200 Subject: [PATCH 277/296] git-upload: pass committer to 'git rebase' (fix #1749) (#1750) No committer was specified, which resulted in failing `git rebase` if the branch is not up-to-date. --- scripts/git-upload | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/scripts/git-upload b/scripts/git-upload index 4649f6998d..a53987894a 100755 --- a/scripts/git-upload +++ b/scripts/git-upload @@ -80,12 +80,14 @@ class GitRepo: print('No changes detected, quitting') return - run([ + git_with_user = [ 'git', '-c', 'user.name=vipvap', '-c', 'user.email=vipvap@zenith.tech', + ] + run(git_with_user + [ 'commit', '--author="vipvap "', f'--message={message}', @@ -94,7 +96,7 @@ class GitRepo: for _ in range(5): try: run(['git', 'fetch', 'origin', branch]) - run(['git', 'rebase', f'origin/{branch}']) + run(git_with_user + ['rebase', f'origin/{branch}']) run(['git', 'push', 'origin', branch]) return From ffbb9dd1553288641a59622693eb68bf99205cee Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Thu, 19 May 2022 10:24:50 +0300 Subject: [PATCH 278/296] Add a 5 minute timeout to python tests. The CI times out after 10 minutes of no output. It's annoying if a test hangs and is killed by the CI timeout, because you don't get information about which test was running. Try to avoid that, by adding a slightly smaller timeout in pytest itself. You can override it on a per-test basis if needed, but let's try to keep our tests shorter than that. For the Postgres regression tests, use a longer 30 minute timeout. They're not really a single test, but many tests wrapped in a single pytest test. It's OK for them to run longer in aggregate, each Postgres test is still fairly short. --- poetry.lock | 17 ++++++++++++++++- pyproject.toml | 1 + pytest.ini | 1 + test_runner/batch_pg_regress/test_isolation.py | 5 ++++- test_runner/batch_pg_regress/test_pg_regress.py | 5 ++++- test_runner/performance/test_startup.py | 4 +++- 6 files changed, 29 insertions(+), 4 deletions(-) diff --git a/poetry.lock b/poetry.lock index aa1e91c606..a69f482776 100644 --- a/poetry.lock +++ b/poetry.lock @@ -1094,6 +1094,17 @@ python-versions = "*" [package.dependencies] pytest = ">=3.2.5" +[[package]] +name = "pytest-timeout" +version = "2.1.0" +description = "pytest plugin to abort hanging tests" +category = "main" +optional = false +python-versions = ">=3.6" + +[package.dependencies] +pytest = ">=5.0.0" + [[package]] name = "pytest-xdist" version = "2.5.0" @@ -1387,7 +1398,7 @@ testing = ["pytest (>=6)", "pytest-checkdocs (>=2.4)", "pytest-flake8", "pytest- [metadata] lock-version = "1.1" python-versions = "^3.7" -content-hash = "d2fcba2af0a32cde3a1d0c8cfdfe5fb26531599b0c8c376bf16e200a74b55553" +content-hash = "4ee85b435461dec70b406bf7170302fe54e9e247bdf628a9cb6b5fb9eb9afd82" [metadata.files] aiopg = [ @@ -1889,6 +1900,10 @@ pytest-lazy-fixture = [ {file = "pytest-lazy-fixture-0.6.3.tar.gz", hash = "sha256:0e7d0c7f74ba33e6e80905e9bfd81f9d15ef9a790de97993e34213deb5ad10ac"}, {file = "pytest_lazy_fixture-0.6.3-py3-none-any.whl", hash = "sha256:e0b379f38299ff27a653f03eaa69b08a6fd4484e46fd1c9907d984b9f9daeda6"}, ] +pytest-timeout = [ + {file = "pytest-timeout-2.1.0.tar.gz", hash = "sha256:c07ca07404c612f8abbe22294b23c368e2e5104b521c1790195561f37e1ac3d9"}, + {file = "pytest_timeout-2.1.0-py3-none-any.whl", hash = "sha256:f6f50101443ce70ad325ceb4473c4255e9d74e3c7cd0ef827309dfa4c0d975c6"}, +] pytest-xdist = [ {file = "pytest-xdist-2.5.0.tar.gz", hash = "sha256:4580deca3ff04ddb2ac53eba39d76cb5dd5edeac050cb6fbc768b0dd712b4edf"}, {file = "pytest_xdist-2.5.0-py3-none-any.whl", hash = "sha256:6fe5c74fec98906deb8f2d2b616b5c782022744978e7bd4695d39c8f42d0ce65"}, diff --git a/pyproject.toml b/pyproject.toml index def55f6671..c965535049 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -24,6 +24,7 @@ moto = {version = "^3.0.0", extras = ["server"]} backoff = "^1.11.1" pytest-lazy-fixture = "^0.6.3" prometheus-client = "^0.14.1" +pytest-timeout = "^2.1.0" [tool.poetry.dev-dependencies] yapf = "==0.31.0" diff --git a/pytest.ini b/pytest.ini index abc69b765b..da9ab8c12f 100644 --- a/pytest.ini +++ b/pytest.ini @@ -9,3 +9,4 @@ minversion = 6.0 log_format = %(asctime)s.%(msecs)-3d %(levelname)s [%(filename)s:%(lineno)d] %(message)s log_date_format = %Y-%m-%d %H:%M:%S log_cli = true +timeout = 300 diff --git a/test_runner/batch_pg_regress/test_isolation.py b/test_runner/batch_pg_regress/test_isolation.py index cde56d9b88..7c99c04fe3 100644 --- a/test_runner/batch_pg_regress/test_isolation.py +++ b/test_runner/batch_pg_regress/test_isolation.py @@ -1,9 +1,12 @@ import os - +import pytest from fixtures.utils import mkdir_if_needed from fixtures.zenith_fixtures import ZenithEnv, base_dir, pg_distrib_dir +# The isolation tests run for a long time, especially in debug mode, +# so use a larger-than-default timeout. +@pytest.mark.timeout(1800) def test_isolation(zenith_simple_env: ZenithEnv, test_output_dir, pg_bin, capsys): env = zenith_simple_env diff --git a/test_runner/batch_pg_regress/test_pg_regress.py b/test_runner/batch_pg_regress/test_pg_regress.py index 07d2574f4a..be7776113a 100644 --- a/test_runner/batch_pg_regress/test_pg_regress.py +++ b/test_runner/batch_pg_regress/test_pg_regress.py @@ -1,9 +1,12 @@ import os - +import pytest from fixtures.utils import mkdir_if_needed from fixtures.zenith_fixtures import ZenithEnv, check_restored_datadir_content, base_dir, pg_distrib_dir +# The pg_regress tests run for a long time, especially in debug mode, +# so use a larger-than-default timeout. +@pytest.mark.timeout(1800) def test_pg_regress(zenith_simple_env: ZenithEnv, test_output_dir: str, pg_bin, capsys): env = zenith_simple_env diff --git a/test_runner/performance/test_startup.py b/test_runner/performance/test_startup.py index e30912ce32..53b6a3a4fc 100644 --- a/test_runner/performance/test_startup.py +++ b/test_runner/performance/test_startup.py @@ -1,9 +1,11 @@ +import pytest from contextlib import closing - from fixtures.zenith_fixtures import ZenithEnvBuilder from fixtures.benchmark_fixture import ZenithBenchmarker +# This test sometimes runs for longer than the global 5 minute timeout. +@pytest.mark.timeout(600) def test_startup(zenith_env_builder: ZenithEnvBuilder, zenbenchmark: ZenithBenchmarker): zenith_env_builder.num_safekeepers = 3 env = zenith_env_builder.init_start() From a4aef5d8dc9666183e3968031952cb511cf918ec Mon Sep 17 00:00:00 2001 From: bojanserafimov Date: Thu, 19 May 2022 12:25:31 -0400 Subject: [PATCH 279/296] Compile psql with openssl (#1725) --- Makefile | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Makefile b/Makefile index d2a79661f2..329742bf78 100644 --- a/Makefile +++ b/Makefile @@ -12,12 +12,12 @@ endif # BUILD_TYPE ?= debug ifeq ($(BUILD_TYPE),release) - PG_CONFIGURE_OPTS = --enable-debug + PG_CONFIGURE_OPTS = --enable-debug --with-openssl PG_CFLAGS = -O2 -g3 $(CFLAGS) # Unfortunately, `--profile=...` is a nightly feature CARGO_BUILD_FLAGS += --release else ifeq ($(BUILD_TYPE),debug) - PG_CONFIGURE_OPTS = --enable-debug --enable-cassert --enable-depend + PG_CONFIGURE_OPTS = --enable-debug --with-openssl --enable-cassert --enable-depend PG_CFLAGS = -O0 -g3 $(CFLAGS) else $(error Bad build type `$(BUILD_TYPE)', see Makefile for options) From 65cf1a3221a7535e2aece1b99d985f9a4fbfb3cf Mon Sep 17 00:00:00 2001 From: KlimentSerafimov Date: Fri, 20 May 2022 12:02:51 -0400 Subject: [PATCH 280/296] Added paths to openssl includes and libraries for OSX because make complained that it couldn't find them. (#1761) --- Makefile | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/Makefile b/Makefile index 329742bf78..5eca7fb094 100644 --- a/Makefile +++ b/Makefile @@ -23,6 +23,12 @@ else $(error Bad build type `$(BUILD_TYPE)', see Makefile for options) endif +# macOS with brew-installed openssl requires explicit paths +UNAME_S := $(shell uname -s) +ifeq ($(UNAME_S),Darwin) + PG_CONFIGURE_OPTS += --with-includes=/usr/local/opt/openssl/include --with-libraries=/usr/local/opt/openssl/lib +endif + # Choose whether we should be silent or verbose CARGO_BUILD_FLAGS += --$(if $(filter s,$(MAKEFLAGS)),quiet,verbose) # Fix for a corner case when make doesn't pass a jobserver From d97617ed3a59e78733752c410025b4e9a1ed614a Mon Sep 17 00:00:00 2001 From: Andrey Taranik Date: Fri, 20 May 2022 23:12:30 +0300 Subject: [PATCH 281/296] updated proxy and proxy scram deployment for prod and stress environments (#1758) --- .circleci/config.yml | 6 +++-- .../helm-values/neon-stress.proxy-scram.yaml | 26 +++++++++++++++++++ .../helm-values/production.proxy-scram.yaml | 24 +++++++++++++++++ .circleci/helm-values/production.proxy.yaml | 8 +----- 4 files changed, 55 insertions(+), 9 deletions(-) create mode 100644 .circleci/helm-values/neon-stress.proxy-scram.yaml create mode 100644 .circleci/helm-values/production.proxy-scram.yaml diff --git a/.circleci/config.yml b/.circleci/config.yml index 60a1cfea14..eb2bf0172b 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -640,7 +640,8 @@ jobs: name: Re-deploy proxy command: | DOCKER_TAG=$(git log --oneline|wc -l) - helm upgrade neon-stress-proxy neondatabase/neon-proxy --install -f .circleci/helm-values/neon-stress.proxy.yaml --set image.tag=${DOCKER_TAG} --wait + helm upgrade neon-stress-proxy neondatabase/neon-proxy --install -f .circleci/helm-values/neon-stress.proxy.yaml --set image.tag=${DOCKER_TAG} --wait + helm upgrade neon-stress-proxy-scram neondatabase/neon-proxy --install -f .circleci/helm-values/neon-stress.proxy-scram.yaml --set image.tag=${DOCKER_TAG} --wait deploy-release: docker: @@ -689,7 +690,8 @@ jobs: name: Re-deploy proxy command: | DOCKER_TAG="release-$(git log --oneline|wc -l)" - helm upgrade zenith-proxy zenithdb/zenith-proxy --install -f .circleci/helm-values/production.proxy.yaml --set image.tag=${DOCKER_TAG} --wait + helm upgrade neon-proxy neondatabase/neon-proxy --install -f .circleci/helm-values/production.proxy.yaml --set image.tag=${DOCKER_TAG} --wait + helm upgrade neon-proxy-scram neondatabase/neon-proxy --install -f .circleci/helm-values/production.proxy-scram.yaml --set image.tag=${DOCKER_TAG} --wait # Trigger a new remote CI job remote-ci-trigger: diff --git a/.circleci/helm-values/neon-stress.proxy-scram.yaml b/.circleci/helm-values/neon-stress.proxy-scram.yaml new file mode 100644 index 0000000000..8f55d31c87 --- /dev/null +++ b/.circleci/helm-values/neon-stress.proxy-scram.yaml @@ -0,0 +1,26 @@ +fullnameOverride: "neon-stress-proxy-scram" + +settings: + authBackend: "console" + authEndpoint: "http://neon-stress-console.local/management/api/v2" + domain: "*.stress.neon.tech" + +podLabels: + zenith_service: proxy-scram + zenith_env: staging + zenith_region: eu-west-1 + zenith_region_slug: ireland + +exposedService: + annotations: + service.beta.kubernetes.io/aws-load-balancer-type: external + service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip + service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing + external-dns.alpha.kubernetes.io/hostname: '*.stress.neon.tech' + +metrics: + enabled: true + serviceMonitor: + enabled: true + selector: + release: kube-prometheus-stack diff --git a/.circleci/helm-values/production.proxy-scram.yaml b/.circleci/helm-values/production.proxy-scram.yaml new file mode 100644 index 0000000000..54b0fbcd98 --- /dev/null +++ b/.circleci/helm-values/production.proxy-scram.yaml @@ -0,0 +1,24 @@ +settings: + authBackend: "console" + authEndpoint: "http://console-release.local/management/api/v2" + domain: "*.cloud.neon.tech" + +podLabels: + zenith_service: proxy-scram + zenith_env: production + zenith_region: us-west-2 + zenith_region_slug: oregon + +exposedService: + annotations: + service.beta.kubernetes.io/aws-load-balancer-type: external + service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip + service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing + external-dns.alpha.kubernetes.io/hostname: '*.cloud.neon.tech' + +metrics: + enabled: true + serviceMonitor: + enabled: true + selector: + release: kube-prometheus-stack diff --git a/.circleci/helm-values/production.proxy.yaml b/.circleci/helm-values/production.proxy.yaml index e13968a6a8..87c61c90cf 100644 --- a/.circleci/helm-values/production.proxy.yaml +++ b/.circleci/helm-values/production.proxy.yaml @@ -1,9 +1,3 @@ -# Helm chart values for zenith-proxy. -# This is a YAML-formatted file. - -image: - repository: neondatabase/neon - settings: authEndpoint: "https://console.neon.tech/authenticate_proxy_request/" uri: "https://console.neon.tech/psql_session/" @@ -28,7 +22,7 @@ exposedService: service.beta.kubernetes.io/aws-load-balancer-type: external service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing - external-dns.alpha.kubernetes.io/hostname: start.zenith.tech,connect.neon.tech,pg.neon.tech + external-dns.alpha.kubernetes.io/hostname: connect.neon.tech,pg.neon.tech metrics: enabled: true From 3c6890bf1dd72722c646d918b984d2392a010ce2 Mon Sep 17 00:00:00 2001 From: Egor Suvorov Date: Thu, 21 Apr 2022 14:54:22 +0300 Subject: [PATCH 282/296] postgres_ffi: add complex WAL tests for find_end_of_wal * Actual generation logic is in a separate crate `postgres_ffi/wal_generate` * The create also provides a binary for debug purposes akin to `initdb` * Two tests currently fail and are ignored * There is no easy way to test this directly in Safekeeper as it starts restoring from commit_lsn. So testing would require disconnecting Safekeeper just after it has received the WAL, but before it is committed. --- Cargo.lock | 15 + libs/postgres_ffi/Cargo.toml | 5 + libs/postgres_ffi/src/xlog_utils.rs | 143 ++++++--- libs/postgres_ffi/wal_generate/Cargo.toml | 14 + .../wal_generate/src/bin/wal_generate.rs | 58 ++++ libs/postgres_ffi/wal_generate/src/lib.rs | 278 ++++++++++++++++++ 6 files changed, 466 insertions(+), 47 deletions(-) create mode 100644 libs/postgres_ffi/wal_generate/Cargo.toml create mode 100644 libs/postgres_ffi/wal_generate/src/bin/wal_generate.rs create mode 100644 libs/postgres_ffi/wal_generate/src/lib.rs diff --git a/Cargo.lock b/Cargo.lock index 6a320ee274..6acad6dac8 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -2047,15 +2047,18 @@ dependencies = [ "bytes", "chrono", "crc32c", + "env_logger", "hex", "lazy_static", "log", "memoffset", + "postgres", "rand", "regex", "serde", "thiserror", "utils", + "wal_generate", "workspace_hack", ] @@ -3627,6 +3630,18 @@ version = "0.9.4" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "49874b5167b65d7193b8aba1567f5c7d93d001cafc34600cee003eda787e483f" +[[package]] +name = "wal_generate" +version = "0.1.0" +dependencies = [ + "anyhow", + "clap 3.0.14", + "env_logger", + "log", + "postgres", + "tempfile", +] + [[package]] name = "walkdir" version = "2.3.2" diff --git a/libs/postgres_ffi/Cargo.toml b/libs/postgres_ffi/Cargo.toml index 7be5ca1b93..129c93cf6d 100644 --- a/libs/postgres_ffi/Cargo.toml +++ b/libs/postgres_ffi/Cargo.toml @@ -20,5 +20,10 @@ serde = { version = "1.0", features = ["derive"] } utils = { path = "../utils" } workspace_hack = { version = "0.1", path = "../../workspace_hack" } +[dev-dependencies] +env_logger = "0.9" +postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" } +wal_generate = { path = "wal_generate" } + [build-dependencies] bindgen = "0.59.1" diff --git a/libs/postgres_ffi/src/xlog_utils.rs b/libs/postgres_ffi/src/xlog_utils.rs index 7882058868..3e30f9905e 100644 --- a/libs/postgres_ffi/src/xlog_utils.rs +++ b/libs/postgres_ffi/src/xlog_utils.rs @@ -476,78 +476,127 @@ pub fn generate_wal_segment(segno: u64, system_id: u64) -> Result anyhow::Result, + expected_end_of_wal_non_partial: Lsn, + last_segment: &str, + ) { + use wal_generate::*; + // 1. Generate some WAL let top_path = PathBuf::from(env!("CARGO_MANIFEST_DIR")) .join("..") .join(".."); - let data_dir = top_path.join("test_output/test_find_end_of_wal"); - let initdb_path = top_path.join("tmp_install/bin/initdb"); - let lib_path = top_path.join("tmp_install/lib"); - if data_dir.exists() { - fs::remove_dir_all(&data_dir).unwrap(); + let cfg = Conf { + pg_distrib_dir: top_path.join("tmp_install"), + datadir: top_path.join(format!("test_output/{}", test_name)), + }; + if cfg.datadir.exists() { + fs::remove_dir_all(&cfg.datadir).unwrap(); } - println!("Using initdb from '{}'", initdb_path.display()); - println!("Data directory '{}'", data_dir.display()); - let initdb_output = Command::new(initdb_path) - .args(&["-D", data_dir.to_str().unwrap()]) - .arg("--no-instructions") - .arg("--no-sync") - .env_clear() - .env("LD_LIBRARY_PATH", &lib_path) - .env("DYLD_LIBRARY_PATH", &lib_path) - .output() - .unwrap(); - assert!( - initdb_output.status.success(), - "initdb failed. Status: '{}', stdout: '{}', stderr: '{}'", - initdb_output.status, - String::from_utf8_lossy(&initdb_output.stdout), - String::from_utf8_lossy(&initdb_output.stderr), - ); + cfg.initdb().unwrap(); + let mut srv = cfg.start_server().unwrap(); + let expected_wal_end: Lsn = + u64::from(generate_wal(&mut srv.connect_with_timeout().unwrap()).unwrap()).into(); + srv.kill(); // 2. Pick WAL generated by initdb - let wal_dir = data_dir.join("pg_wal"); + let wal_dir = cfg.datadir.join("pg_wal"); let wal_seg_size = 16 * 1024 * 1024; // 3. Check end_of_wal on non-partial WAL segment (we treat it as fully populated) let (wal_end, tli) = find_end_of_wal(&wal_dir, wal_seg_size, true, Lsn(0)).unwrap(); let wal_end = Lsn(wal_end); - println!("wal_end={}, tli={}", wal_end, tli); - assert_eq!(wal_end, "0/2000000".parse::().unwrap()); + info!( + "find_end_of_wal returned (wal_end={}, tli={})", + wal_end, tli + ); + assert_eq!(wal_end, expected_end_of_wal_non_partial); // 4. Get the actual end of WAL by pg_waldump - let waldump_path = top_path.join("tmp_install/bin/pg_waldump"); - let waldump_output = Command::new(waldump_path) - .arg(wal_dir.join("000000010000000000000001")) - .env_clear() - .env("LD_LIBRARY_PATH", &lib_path) - .env("DYLD_LIBRARY_PATH", &lib_path) - .output() - .unwrap(); - let waldump_output = std::str::from_utf8(&waldump_output.stderr).unwrap(); - println!("waldump_output = '{}'", &waldump_output); - let re = Regex::new(r"invalid record length at (.+):").unwrap(); - let caps = re.captures(waldump_output).unwrap(); + let waldump_output = cfg + .pg_waldump("000000010000000000000001", last_segment) + .unwrap() + .stderr; + let waldump_output = std::str::from_utf8(&waldump_output).unwrap(); + let caps = match Regex::new(r"invalid record length at (.+):") + .unwrap() + .captures(waldump_output) + { + Some(caps) => caps, + None => { + error!("Unable to parse pg_waldump's stderr:\n{}", waldump_output); + panic!(); + } + }; let waldump_wal_end = Lsn::from_str(caps.get(1).unwrap().as_str()).unwrap(); + info!( + "waldump erred on {}, expected wal end at {}", + waldump_wal_end, expected_wal_end + ); + assert_eq!(waldump_wal_end, expected_wal_end); // 5. Rename file to partial to actually find last valid lsn fs::rename( - wal_dir.join("000000010000000000000001"), - wal_dir.join("000000010000000000000001.partial"), + wal_dir.join(last_segment), + wal_dir.join(format!("{}.partial", last_segment)), ) .unwrap(); let (wal_end, tli) = find_end_of_wal(&wal_dir, wal_seg_size, true, Lsn(0)).unwrap(); let wal_end = Lsn(wal_end); - println!("wal_end={}, tli={}", wal_end, tli); + info!( + "find_end_of_wal returned (wal_end={}, tli={})", + wal_end, tli + ); assert_eq!(wal_end, waldump_wal_end); } + #[test] + pub fn test_find_end_of_wal_simple() { + init_logging(); + test_end_of_wal( + "test_find_end_of_wal_simple", + wal_generate::generate_simple, + "0/2000000".parse::().unwrap(), + "000000010000000000000001", + ); + } + + #[test] + #[ignore = "not yet fixed, needs correct skipping of contrecord"] // TODO + pub fn test_find_end_of_wal_crossing_segment_followed_by_small_one() { + init_logging(); + test_end_of_wal( + "test_find_end_of_wal_crossing_segment_followed_by_small_one", + wal_generate::generate_wal_record_crossing_segment_followed_by_small_one, + "0/3000000".parse::().unwrap(), + "000000010000000000000002", + ); + } + + #[test] + #[ignore = "not yet fixed, needs correct parsing of pre-last segments"] // TODO + pub fn test_find_end_of_wal_last_crossing_segment() { + init_logging(); + test_end_of_wal( + "test_find_end_of_wal_last_crossing_segment", + wal_generate::generate_last_wal_record_crossing_segment, + "0/3000000".parse::().unwrap(), + "000000010000000000000002", + ); + } + /// Check the math in update_next_xid /// /// NOTE: These checks are sensitive to the value of XID_CHECKPOINT_INTERVAL, diff --git a/libs/postgres_ffi/wal_generate/Cargo.toml b/libs/postgres_ffi/wal_generate/Cargo.toml new file mode 100644 index 0000000000..a10671dddd --- /dev/null +++ b/libs/postgres_ffi/wal_generate/Cargo.toml @@ -0,0 +1,14 @@ +[package] +name = "wal_generate" +version = "0.1.0" +edition = "2021" + +# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html + +[dependencies] +anyhow = "1.0" +clap = "3.0" +env_logger = "0.9" +log = "0.4" +postgres = { git = "https://github.com/zenithdb/rust-postgres.git", rev="d052ee8b86fff9897c77b0fe89ea9daba0e1fa38" } +tempfile = "3.2" diff --git a/libs/postgres_ffi/wal_generate/src/bin/wal_generate.rs b/libs/postgres_ffi/wal_generate/src/bin/wal_generate.rs new file mode 100644 index 0000000000..07ceb31c7f --- /dev/null +++ b/libs/postgres_ffi/wal_generate/src/bin/wal_generate.rs @@ -0,0 +1,58 @@ +use anyhow::*; +use clap::{App, Arg}; +use wal_generate::*; + +fn main() -> Result<()> { + env_logger::Builder::from_env( + env_logger::Env::default().default_filter_or("wal_generate=info"), + ) + .init(); + let arg_matches = App::new("Postgres WAL generator") + .about("Generates Postgres databases with specific WAL properties") + .arg( + Arg::new("datadir") + .short('D') + .long("datadir") + .takes_value(true) + .help("Data directory for the Postgres server") + .required(true) + ) + .arg( + Arg::new("pg-distrib-dir") + .long("pg-distrib-dir") + .takes_value(true) + .help("Directory with Postgres distribution (bin and lib directories, e.g. tmp_install)") + .default_value("/usr/local") + ) + .arg( + Arg::new("type") + .long("type") + .takes_value(true) + .help("Type of WAL to generate") + .possible_values(["simple", "last_wal_record_crossing_segment", "wal_record_crossing_segment_followed_by_small_one"]) + .required(true) + ) + .get_matches(); + + let cfg = Conf { + pg_distrib_dir: arg_matches.value_of("pg-distrib-dir").unwrap().into(), + datadir: arg_matches.value_of("datadir").unwrap().into(), + }; + cfg.initdb()?; + let mut srv = cfg.start_server()?; + let lsn = match arg_matches.value_of("type").unwrap() { + "simple" => generate_simple(&mut srv.connect_with_timeout()?)?, + "last_wal_record_crossing_segment" => { + generate_last_wal_record_crossing_segment(&mut srv.connect_with_timeout()?)? + } + "wal_record_crossing_segment_followed_by_small_one" => { + generate_wal_record_crossing_segment_followed_by_small_one( + &mut srv.connect_with_timeout()?, + )? + } + a => panic!("Unknown --type argument: {}", a), + }; + println!("end_of_wal = {}", lsn); + srv.kill(); + Ok(()) +} diff --git a/libs/postgres_ffi/wal_generate/src/lib.rs b/libs/postgres_ffi/wal_generate/src/lib.rs new file mode 100644 index 0000000000..a5cd81d68a --- /dev/null +++ b/libs/postgres_ffi/wal_generate/src/lib.rs @@ -0,0 +1,278 @@ +use anyhow::*; +use core::time::Duration; +use log::*; +use postgres::types::PgLsn; +use postgres::Client; +use std::cmp::Ordering; +use std::path::{Path, PathBuf}; +use std::process::{Command, Stdio}; +use std::time::Instant; +use tempfile::{tempdir, TempDir}; + +#[derive(Debug, Clone, PartialEq, Eq)] +pub struct Conf { + pub pg_distrib_dir: PathBuf, + pub datadir: PathBuf, +} + +pub struct PostgresServer { + process: std::process::Child, + _unix_socket_dir: TempDir, + client_config: postgres::Config, +} + +impl Conf { + fn pg_bin_dir(&self) -> PathBuf { + self.pg_distrib_dir.join("bin") + } + + fn pg_lib_dir(&self) -> PathBuf { + self.pg_distrib_dir.join("lib") + } + + fn new_pg_command(&self, command: impl AsRef) -> Result { + let path = self.pg_bin_dir().join(command); + ensure!(path.exists(), "Command {:?} does not exist", path); + let mut cmd = Command::new(path); + cmd.env_clear() + .env("LD_LIBRARY_PATH", self.pg_lib_dir()) + .env("DYLD_LIBRARY_PATH", self.pg_lib_dir()); + Ok(cmd) + } + + pub fn initdb(&self) -> Result<()> { + if let Some(parent) = self.datadir.parent() { + info!("Pre-creating parent directory {:?}", parent); + // Tests may be run concurrently and there may be a race to create `test_output/`. + // std::fs::create_dir_all is guaranteed to have no races with another thread creating directories. + std::fs::create_dir_all(parent)?; + } + info!( + "Running initdb in {:?} with user \"postgres\"", + self.datadir + ); + let output = self + .new_pg_command("initdb")? + .arg("-D") + .arg(self.datadir.as_os_str()) + .args(&["-U", "postgres", "--no-instructions", "--no-sync"]) + .output()?; + debug!("initdb output: {:?}", output); + ensure!( + output.status.success(), + "initdb failed, stdout and stderr follow:\n{}{}", + String::from_utf8_lossy(&output.stdout), + String::from_utf8_lossy(&output.stderr), + ); + Ok(()) + } + + pub fn start_server(&self) -> Result { + info!("Starting Postgres server in {:?}", self.datadir); + let unix_socket_dir = tempdir()?; // We need a directory with a short name for Unix socket (up to 108 symbols) + let unix_socket_dir_path = unix_socket_dir.path().to_owned(); + let server_process = self + .new_pg_command("postgres")? + .args(&["-c", "listen_addresses="]) + .arg("-k") + .arg(unix_socket_dir_path.as_os_str()) + .arg("-D") + .arg(self.datadir.as_os_str()) + .args(&["-c", "wal_keep_size=50MB"]) // Ensure old WAL is not removed + .args(&["-c", "logging_collector=on"]) // stderr will mess up with tests output + .args(&["-c", "shared_preload_libraries=zenith"]) // can only be loaded at startup + // Disable background processes as much as possible + .args(&["-c", "wal_writer_delay=10s"]) + .args(&["-c", "autovacuum=off"]) + .stderr(Stdio::null()) + .spawn()?; + let server = PostgresServer { + process: server_process, + _unix_socket_dir: unix_socket_dir, + client_config: { + let mut c = postgres::Config::new(); + c.host_path(&unix_socket_dir_path); + c.user("postgres"); + c.connect_timeout(Duration::from_millis(1000)); + c + }, + }; + Ok(server) + } + + pub fn pg_waldump( + &self, + first_segment_name: &str, + last_segment_name: &str, + ) -> Result { + let first_segment_file = self.datadir.join(first_segment_name); + let last_segment_file = self.datadir.join(last_segment_name); + info!( + "Running pg_waldump for {} .. {}", + first_segment_file.display(), + last_segment_file.display() + ); + let output = self + .new_pg_command("pg_waldump")? + .args(&[ + &first_segment_file.as_os_str(), + &last_segment_file.as_os_str(), + ]) + .output()?; + debug!("waldump output: {:?}", output); + Ok(output) + } +} + +impl PostgresServer { + pub fn connect_with_timeout(&self) -> Result { + let retry_until = Instant::now() + *self.client_config.get_connect_timeout().unwrap(); + while Instant::now() < retry_until { + use std::result::Result::Ok; + if let Ok(client) = self.client_config.connect(postgres::NoTls) { + return Ok(client); + } + std::thread::sleep(Duration::from_millis(100)); + } + bail!("Connection timed out"); + } + + pub fn kill(&mut self) { + self.process.kill().unwrap(); + self.process.wait().unwrap(); + } +} + +impl Drop for PostgresServer { + fn drop(&mut self) { + use std::result::Result::Ok; + match self.process.try_wait() { + Ok(Some(_)) => return, + Ok(None) => { + warn!("Server was not terminated, will be killed"); + } + Err(e) => { + error!("Unable to get status of the server: {}, will be killed", e); + } + } + let _ = self.process.kill(); + } +} + +pub trait PostgresClientExt: postgres::GenericClient { + fn pg_current_wal_insert_lsn(&mut self) -> Result { + Ok(self + .query_one("SELECT pg_current_wal_insert_lsn()", &[])? + .get(0)) + } + fn pg_current_wal_flush_lsn(&mut self) -> Result { + Ok(self + .query_one("SELECT pg_current_wal_flush_lsn()", &[])? + .get(0)) + } +} + +impl PostgresClientExt for C {} + +fn generate_internal( + client: &mut C, + f: impl Fn(&mut C, PgLsn) -> Result>, +) -> Result { + client.execute("create extension if not exists zenith_test_utils", &[])?; + + let wal_segment_size = client.query_one( + "select cast(setting as bigint) as setting, unit \ + from pg_settings where name = 'wal_segment_size'", + &[], + )?; + ensure!( + wal_segment_size.get::<_, String>("unit") == "B", + "Unexpected wal_segment_size unit" + ); + ensure!( + wal_segment_size.get::<_, i64>("setting") == 16 * 1024 * 1024, + "Unexpected wal_segment_size in bytes" + ); + + let initial_lsn = client.pg_current_wal_insert_lsn()?; + info!("LSN initial = {}", initial_lsn); + + let last_lsn = match f(client, initial_lsn)? { + None => client.pg_current_wal_insert_lsn()?, + Some(last_lsn) => match last_lsn.cmp(&client.pg_current_wal_insert_lsn()?) { + Ordering::Less => bail!("Some records were inserted after the generated WAL"), + Ordering::Equal => last_lsn, + Ordering::Greater => bail!("Reported LSN is greater than insert_lsn"), + }, + }; + + // Some records may be not flushed, e.g. non-transactional logical messages. + client.execute("select neon_xlogflush(pg_current_wal_insert_lsn())", &[])?; + match last_lsn.cmp(&client.pg_current_wal_flush_lsn()?) { + Ordering::Less => bail!("Some records were flushed after the generated WAL"), + Ordering::Equal => {} + Ordering::Greater => bail!("Reported LSN is greater than flush_lsn"), + } + Ok(last_lsn) +} + +pub fn generate_simple(client: &mut impl postgres::GenericClient) -> Result { + generate_internal(client, |client, _| { + client.execute("CREATE table t(x int)", &[])?; + Ok(None) + }) +} + +fn generate_single_logical_message( + client: &mut impl postgres::GenericClient, + transactional: bool, +) -> Result { + generate_internal(client, |client, initial_lsn| { + ensure!( + initial_lsn < PgLsn::from(0x0200_0000 - 1024 * 1024), + "Initial LSN is too far in the future" + ); + + let message_lsn: PgLsn = client + .query_one( + "select pg_logical_emit_message($1, 'big-16mb-msg', \ + concat(repeat('abcd', 16 * 256 * 1024), 'end')) as message_lsn", + &[&transactional], + )? + .get("message_lsn"); + ensure!( + message_lsn > PgLsn::from(0x0200_0000 + 4 * 8192), + "Logical message did not cross the segment boundary" + ); + ensure!( + message_lsn < PgLsn::from(0x0400_0000), + "Logical message crossed two segments" + ); + + if transactional { + // Transactional logical messages are part of a transaction, so the one above is + // followed by a small COMMIT record. + + let after_message_lsn = client.pg_current_wal_insert_lsn()?; + ensure!( + message_lsn < after_message_lsn, + "No record found after the emitted message" + ); + Ok(Some(after_message_lsn)) + } else { + Ok(Some(message_lsn)) + } + }) +} + +pub fn generate_wal_record_crossing_segment_followed_by_small_one( + client: &mut impl postgres::GenericClient, +) -> Result { + generate_single_logical_message(client, true) +} + +pub fn generate_last_wal_record_crossing_segment( + client: &mut C, +) -> Result { + generate_single_logical_message(client, false) +} From 12b7c793b3f9885d3132d66da149431b4fd7f5b7 Mon Sep 17 00:00:00 2001 From: Egor Suvorov Date: Thu, 21 Apr 2022 22:52:55 +0300 Subject: [PATCH 283/296] postgres_ffi: find_end_of_wal_segment: remove redundant CRC operations Previous invariant: `crc` contains an "unfinalized" CRC32 value, its one complement, like in postgres before FIN_CRC32C. New invariant: `crc` always contains a "finalized" CRC32 value, this is the semantics of crc32c_append, so we don't need to invert CRC manually. --- libs/postgres_ffi/src/xlog_utils.rs | 3 --- 1 file changed, 3 deletions(-) diff --git a/libs/postgres_ffi/src/xlog_utils.rs b/libs/postgres_ffi/src/xlog_utils.rs index 3e30f9905e..ce036bc49a 100644 --- a/libs/postgres_ffi/src/xlog_utils.rs +++ b/libs/postgres_ffi/src/xlog_utils.rs @@ -234,16 +234,13 @@ fn find_end_of_wal_segment( wal_crc = LittleEndian::read_u32(&buf[crc_offs..crc_offs + 4]); crc = crc32c_append(0, &buf[crc_offs + 4..page_offs + n]); } else { - crc ^= 0xFFFFFFFFu32; crc = crc32c_append(crc, &buf[page_offs..page_offs + n]); } - crc = !crc; rec_offs += n; offs += n; contlen -= n; if contlen == 0 { - crc = !crc; crc = crc32c_append(crc, &rec_hdr); offs = (offs + 7) & !7; // pad on 8 bytes boundary */ if crc == wal_crc { From c9efdec8db8115a56bb6044e0d0547aac7583872 Mon Sep 17 00:00:00 2001 From: Egor Suvorov Date: Thu, 21 Apr 2022 23:08:13 +0300 Subject: [PATCH 284/296] postgres_ffi: find_end_of_wal_segment: improve name of wal_crc variable Now it reflects the field it's mirroring. --- libs/postgres_ffi/src/xlog_utils.rs | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/libs/postgres_ffi/src/xlog_utils.rs b/libs/postgres_ffi/src/xlog_utils.rs index ce036bc49a..9fcf78acb1 100644 --- a/libs/postgres_ffi/src/xlog_utils.rs +++ b/libs/postgres_ffi/src/xlog_utils.rs @@ -150,7 +150,7 @@ fn find_end_of_wal_segment( // step back to the beginning of the page to read it in... let mut offs: usize = start_offset - start_offset % XLOG_BLCKSZ; let mut contlen: usize = 0; - let mut wal_crc: u32 = 0; + let mut xl_crc: u32 = 0; let mut crc: u32 = 0; let mut rec_offs: usize = 0; let mut buf = [0u8; XLOG_BLCKSZ]; @@ -231,7 +231,7 @@ fn find_end_of_wal_segment( } if rec_offs <= XLOG_RECORD_CRC_OFFS && rec_offs + n >= XLOG_SIZE_OF_XLOG_RECORD { let crc_offs = page_offs - rec_offs + XLOG_RECORD_CRC_OFFS; - wal_crc = LittleEndian::read_u32(&buf[crc_offs..crc_offs + 4]); + xl_crc = LittleEndian::read_u32(&buf[crc_offs..crc_offs + 4]); crc = crc32c_append(0, &buf[crc_offs + 4..page_offs + n]); } else { crc = crc32c_append(crc, &buf[page_offs..page_offs + n]); @@ -243,14 +243,14 @@ fn find_end_of_wal_segment( if contlen == 0 { crc = crc32c_append(crc, &rec_hdr); offs = (offs + 7) & !7; // pad on 8 bytes boundary */ - if crc == wal_crc { + if crc == xl_crc { // record is valid, advance the result to its end (with // alignment to the next record taken into account) last_valid_rec_pos = offs; } else { info!( "CRC mismatch {} vs {} at {}", - crc, wal_crc, last_valid_rec_pos + crc, xl_crc, last_valid_rec_pos ); break; } From c4b77084afd70098ed3ecf56b6778a6cc0dbcfe4 Mon Sep 17 00:00:00 2001 From: Egor Suvorov Date: Thu, 19 May 2022 01:58:51 +0300 Subject: [PATCH 285/296] utils: add const_assert! macro --- libs/utils/src/lib.rs | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/libs/utils/src/lib.rs b/libs/utils/src/lib.rs index 4810909712..15d4c7a81e 100644 --- a/libs/utils/src/lib.rs +++ b/libs/utils/src/lib.rs @@ -95,3 +95,11 @@ macro_rules! project_git_version { ); }; } + +/// Same as `assert!`, but evaluated during compilation and gets optimized out in runtime. +#[macro_export] +macro_rules! const_assert { + ($($args:tt)*) => { + const _: () = assert!($($args)*); + }; +} From a124e44866c0b6cd1295d83d445dc7fab9e6e1d5 Mon Sep 17 00:00:00 2001 From: Egor Suvorov Date: Thu, 19 May 2022 03:02:54 +0300 Subject: [PATCH 286/296] postgres_ffi: find_end_of_wal_segment: add lots of trace --- libs/postgres_ffi/src/xlog_utils.rs | 75 ++++++++++++++++++++++++++++- 1 file changed, 73 insertions(+), 2 deletions(-) diff --git a/libs/postgres_ffi/src/xlog_utils.rs b/libs/postgres_ffi/src/xlog_utils.rs index 9fcf78acb1..93b4924110 100644 --- a/libs/postgres_ffi/src/xlog_utils.rs +++ b/libs/postgres_ffi/src/xlog_utils.rs @@ -160,9 +160,11 @@ fn find_end_of_wal_segment( file.seek(SeekFrom::Start(offs as u64))?; let mut rec_hdr = [0u8; XLOG_RECORD_CRC_OFFS]; + trace!("find_end_of_wal_segment(data_dir={}, segno={}, tli={}, wal_seg_size={}, start_offset=0x{:x})", data_dir.display(), segno, tli, wal_seg_size, start_offset); while offs < wal_seg_size { // we are at the beginning of the page; read it in if offs % XLOG_BLCKSZ == 0 { + trace!("offs=0x{:x}: new page", offs); let bytes_read = file.read(&mut buf)?; if bytes_read != buf.len() { bail!( @@ -176,10 +178,16 @@ fn find_end_of_wal_segment( let xlp_magic = LittleEndian::read_u16(&buf[0..2]); let xlp_info = LittleEndian::read_u16(&buf[2..4]); let xlp_rem_len = LittleEndian::read_u32(&buf[XLP_REM_LEN_OFFS..XLP_REM_LEN_OFFS + 4]); + trace!( + " xlp_magic=0x{:x}, xlp_info=0x{:x}, xlp_rem_len={}", + xlp_magic, + xlp_info, + xlp_rem_len + ); // this is expected in current usage when valid WAL starts after page header if xlp_magic != XLOG_PAGE_MAGIC as u16 { trace!( - "invalid WAL file {}.partial magic {} at {:?}", + " invalid WAL file {}.partial magic {} at {:?}", file_name, xlp_magic, Lsn(XLogSegNoOffsetToRecPtr(segno, offs as u32, wal_seg_size)), @@ -194,12 +202,13 @@ fn find_end_of_wal_segment( offs += XLOG_SIZE_OF_XLOG_SHORT_PHD; } // ... and step forward again if asked + trace!(" skipped header to 0x{:x}", offs); offs = max(offs, start_offset); - // beginning of the next record } else if contlen == 0 { let page_offs = offs % XLOG_BLCKSZ; let xl_tot_len = LittleEndian::read_u32(&buf[page_offs..page_offs + 4]) as usize; + trace!("offs=0x{:x}: new record, xl_tot_len={}", offs, xl_tot_len); if xl_tot_len == 0 { info!( "find_end_of_wal_segment reached zeros at {:?}, last records ends at {:?}", @@ -212,10 +221,20 @@ fn find_end_of_wal_segment( ); break; // zeros, reached the end } + trace!( + " updating last_valid_rec_pos: 0x{:x} --> 0x{:x}", + last_valid_rec_pos, + offs + ); last_valid_rec_pos = offs; offs += 4; rec_offs = 4; contlen = xl_tot_len - 4; + trace!( + " reading rec_hdr[0..4] <-- [0x{:x}; 0x{:x})", + page_offs, + page_offs + 4 + ); rec_hdr[0..4].copy_from_slice(&buf[page_offs..page_offs + 4]); } else { // we're continuing a record, possibly from previous page. @@ -224,28 +243,79 @@ fn find_end_of_wal_segment( // read the rest of the record, or as much as fits on this page. let n = min(contlen, pageleft); + trace!( + "offs=0x{:x}, record continuation, pageleft={}, contlen={}", + offs, + pageleft, + contlen + ); // fill rec_hdr (header up to (but not including) xl_crc field) + trace!( + " rec_offs={}, XLOG_RECORD_CRC_OFFS={}, XLOG_SIZE_OF_XLOG_RECORD={}", + rec_offs, + XLOG_RECORD_CRC_OFFS, + XLOG_SIZE_OF_XLOG_RECORD + ); if rec_offs < XLOG_RECORD_CRC_OFFS { let len = min(XLOG_RECORD_CRC_OFFS - rec_offs, n); + trace!( + " reading rec_hdr[{}..{}] <-- [0x{:x}; 0x{:x})", + rec_offs, + rec_offs + len, + page_offs, + page_offs + len + ); rec_hdr[rec_offs..rec_offs + len].copy_from_slice(&buf[page_offs..page_offs + len]); } if rec_offs <= XLOG_RECORD_CRC_OFFS && rec_offs + n >= XLOG_SIZE_OF_XLOG_RECORD { let crc_offs = page_offs - rec_offs + XLOG_RECORD_CRC_OFFS; xl_crc = LittleEndian::read_u32(&buf[crc_offs..crc_offs + 4]); + trace!( + " reading xl_crc: [0x{:x}; 0x{:x}) = 0x{:x}", + crc_offs, + crc_offs + 4, + xl_crc + ); crc = crc32c_append(0, &buf[crc_offs + 4..page_offs + n]); + trace!( + " initializing crc: [0x{:x}; 0x{:x}); crc = 0x{:x}", + crc_offs + 4, + page_offs + n, + crc + ); } else { + let old_crc = crc; crc = crc32c_append(crc, &buf[page_offs..page_offs + n]); + trace!( + " appending to crc: [0x{:x}; 0x{:x}); 0x{:x} --> 0x{:x}", + page_offs, + page_offs + n, + old_crc, + crc + ); } rec_offs += n; offs += n; contlen -= n; if contlen == 0 { + trace!(" record completed at 0x{:x}", offs); crc = crc32c_append(crc, &rec_hdr); offs = (offs + 7) & !7; // pad on 8 bytes boundary */ + trace!( + " padded offs to 0x{:x}, crc is {:x}, expected crc is {:x}", + offs, + crc, + xl_crc + ); if crc == xl_crc { // record is valid, advance the result to its end (with // alignment to the next record taken into account) + trace!( + " updating last_valid_rec_pos: 0x{:x} --> 0x{:x}", + last_valid_rec_pos, + offs + ); last_valid_rec_pos = offs; } else { info!( @@ -257,6 +327,7 @@ fn find_end_of_wal_segment( } } } + trace!("last_valid_rec_pos=0x{:x}", last_valid_rec_pos); Ok(last_valid_rec_pos as u32) } From 967eb38e815a102751bd1658caf91a05f9cecb22 Mon Sep 17 00:00:00 2001 From: Egor Suvorov Date: Thu, 19 May 2022 03:20:06 +0300 Subject: [PATCH 287/296] postgres_ffi: find_end_of_wal_segment: fix contrecord skipping Also enable corresponding test. --- libs/postgres_ffi/src/xlog_utils.rs | 42 +++++++++++++++++++++-------- 1 file changed, 31 insertions(+), 11 deletions(-) diff --git a/libs/postgres_ffi/src/xlog_utils.rs b/libs/postgres_ffi/src/xlog_utils.rs index 93b4924110..ac52e3fb4f 100644 --- a/libs/postgres_ffi/src/xlog_utils.rs +++ b/libs/postgres_ffi/src/xlog_utils.rs @@ -15,7 +15,7 @@ use crate::XLogPageHeaderData; use crate::XLogRecord; use crate::XLOG_PAGE_MAGIC; -use anyhow::bail; +use anyhow::{bail, ensure}; use byteorder::{ByteOrder, LittleEndian}; use bytes::BytesMut; use bytes::{Buf, Bytes}; @@ -149,6 +149,7 @@ fn find_end_of_wal_segment( ) -> anyhow::Result { // step back to the beginning of the page to read it in... let mut offs: usize = start_offset - start_offset % XLOG_BLCKSZ; + let mut skipping_first_contrecord: bool = false; let mut contlen: usize = 0; let mut xl_crc: u32 = 0; let mut crc: u32 = 0; @@ -194,9 +195,21 @@ fn find_end_of_wal_segment( ); } if offs == 0 { - offs = XLOG_SIZE_OF_XLOG_LONG_PHD; + offs += XLOG_SIZE_OF_XLOG_LONG_PHD; if (xlp_info & XLP_FIRST_IS_CONTRECORD) != 0 { - offs += ((xlp_rem_len + 7) & !7) as usize; + trace!(" first record is contrecord"); + skipping_first_contrecord = true; + contlen = xlp_rem_len as usize; + if offs < start_offset { + // Pre-condition failed: the beginning of the segment is unexpectedly corrupted. + ensure!(start_offset - offs >= contlen, + "start_offset is in the middle of the first record (which happens to be a contrecord), \ + expected to be on a record boundary. Is beginning of the segment corrupted?"); + contlen = 0; + // keep skipping_first_contrecord to avoid counting the contrecord as valid, we did not check it. + } + } else { + trace!(" first record is not contrecord"); } } else { offs += XLOG_SIZE_OF_XLOG_SHORT_PHD; @@ -221,12 +234,17 @@ fn find_end_of_wal_segment( ); break; // zeros, reached the end } - trace!( - " updating last_valid_rec_pos: 0x{:x} --> 0x{:x}", - last_valid_rec_pos, - offs - ); - last_valid_rec_pos = offs; + if skipping_first_contrecord { + skipping_first_contrecord = false; + trace!(" first contrecord has been just completed"); + } else { + trace!( + " updating last_valid_rec_pos: 0x{:x} --> 0x{:x}", + last_valid_rec_pos, + offs + ); + last_valid_rec_pos = offs; + } offs += 4; rec_offs = 4; contlen = xl_tot_len - 4; @@ -308,7 +326,10 @@ fn find_end_of_wal_segment( crc, xl_crc ); - if crc == xl_crc { + if skipping_first_contrecord { + // do nothing, the flag will go down on next iteration when we're reading new record + trace!(" first conrecord has been just completed"); + } else if crc == xl_crc { // record is valid, advance the result to its end (with // alignment to the next record taken into account) trace!( @@ -642,7 +663,6 @@ mod tests { } #[test] - #[ignore = "not yet fixed, needs correct skipping of contrecord"] // TODO pub fn test_find_end_of_wal_crossing_segment_followed_by_small_one() { init_logging(); test_end_of_wal( From 73187bfef12852b38b39724df42323e5ab0c60a5 Mon Sep 17 00:00:00 2001 From: Egor Suvorov Date: Sat, 21 May 2022 02:48:43 +0300 Subject: [PATCH 288/296] postgres_ffi: find_end_of_wal_segment: clarify code around xl_crc retrieval It would be better to not update xl_crc/rec_hdr at all when skipping contrecord, but I would prefer to keep PR #1574 small. Better audit of `find_end_of_wal_segment` is coming anyway in #544. --- libs/postgres_ffi/src/xlog_utils.rs | 31 +++++++++++++++++++++++++++-- 1 file changed, 29 insertions(+), 2 deletions(-) diff --git a/libs/postgres_ffi/src/xlog_utils.rs b/libs/postgres_ffi/src/xlog_utils.rs index ac52e3fb4f..32a3022c5a 100644 --- a/libs/postgres_ffi/src/xlog_utils.rs +++ b/libs/postgres_ffi/src/xlog_utils.rs @@ -30,6 +30,7 @@ use std::path::{Path, PathBuf}; use std::time::SystemTime; use utils::bin_ser::DeserializeError; use utils::bin_ser::SerializeError; +use utils::const_assert; use utils::lsn::Lsn; pub const XLOG_FNAME_LEN: usize = 24; @@ -159,6 +160,8 @@ fn find_end_of_wal_segment( let mut last_valid_rec_pos: usize = start_offset; // assume at given start_offset begins new record let mut file = File::open(data_dir.join(file_name.clone() + ".partial")).unwrap(); file.seek(SeekFrom::Start(offs as u64))?; + // xl_crc is the last field in XLogRecord, will not be read into rec_hdr + const_assert!(XLOG_RECORD_CRC_OFFS + 4 == XLOG_SIZE_OF_XLOG_RECORD); let mut rec_hdr = [0u8; XLOG_RECORD_CRC_OFFS]; trace!("find_end_of_wal_segment(data_dir={}, segno={}, tli={}, wal_seg_size={}, start_offset=0x{:x})", data_dir.display(), segno, tli, wal_seg_size, start_offset); @@ -267,7 +270,7 @@ fn find_end_of_wal_segment( pageleft, contlen ); - // fill rec_hdr (header up to (but not including) xl_crc field) + // fill rec_hdr header up to (but not including) xl_crc field trace!( " rec_offs={}, XLOG_RECORD_CRC_OFFS={}, XLOG_SIZE_OF_XLOG_RECORD={}", rec_offs, @@ -287,6 +290,14 @@ fn find_end_of_wal_segment( } if rec_offs <= XLOG_RECORD_CRC_OFFS && rec_offs + n >= XLOG_SIZE_OF_XLOG_RECORD { let crc_offs = page_offs - rec_offs + XLOG_RECORD_CRC_OFFS; + // All records are aligned on 8-byte boundary, so their 8-byte frames + // cannot be split between pages. As xl_crc is the last field, + // its content is always on the same page. + const_assert!(XLOG_RECORD_CRC_OFFS % 8 == 4); + // We should always start reading aligned records even in incorrect WALs so if + // the condition is false it is likely a bug. However, it is localized somewhere + // in this function, hence we do not crash and just report failure instead. + ensure!(crc_offs % 8 == 4, "Record is not aligned properly (bug?)"); xl_crc = LittleEndian::read_u32(&buf[crc_offs..crc_offs + 4]); trace!( " reading xl_crc: [0x{:x}; 0x{:x}) = 0x{:x}", @@ -301,7 +312,9 @@ fn find_end_of_wal_segment( page_offs + n, crc ); - } else { + } else if rec_offs > XLOG_RECORD_CRC_OFFS { + // As all records are 8-byte aligned, the header is already fully read and `crc` is initialized in the branch above. + ensure!(rec_offs >= XLOG_SIZE_OF_XLOG_RECORD); let old_crc = crc; crc = crc32c_append(crc, &buf[page_offs..page_offs + n]); trace!( @@ -311,6 +324,20 @@ fn find_end_of_wal_segment( old_crc, crc ); + } else { + // Correct because of the way conditions are written above. + assert!(rec_offs + n < XLOG_SIZE_OF_XLOG_RECORD); + // If `skipping_first_contrecord == true`, we may be reading from a middle of a record + // which started in the previous segment. Hence there is no point in validating the header. + if !skipping_first_contrecord && rec_offs + n > XLOG_RECORD_CRC_OFFS { + info!( + "Curiously corrupted WAL: a record stops inside the header; \ + offs=0x{:x}, record continuation, pageleft={}, contlen={}", + offs, pageleft, contlen + ); + break; + } + // Do nothing: we are still reading the header. It's accounted in CRC in the end of the record. } rec_offs += n; offs += n; From ef7cdb13e28abcbd1a36eea87dda70481ab28191 Mon Sep 17 00:00:00 2001 From: Egor Suvorov Date: Sat, 21 May 2022 03:47:41 +0300 Subject: [PATCH 289/296] Remove unused dependencies from poetry.lock via `poetry lock --no-update` There were a bunch of dependencies for Python <3.9. They are not needed after #1254. This commit makes it easier to add/remove dependencies because lock file will be updated like this on any such operation. Do not update dependencies yet to not break anything. --- poetry.lock | 103 +--------------------------------------------------- 1 file changed, 2 insertions(+), 101 deletions(-) diff --git a/poetry.lock b/poetry.lock index a69f482776..6e552d2cd3 100644 --- a/poetry.lock +++ b/poetry.lock @@ -21,9 +21,6 @@ category = "main" optional = false python-versions = ">=3.6" -[package.dependencies] -typing-extensions = {version = ">=3.6.5", markers = "python_version < \"3.8\""} - [[package]] name = "asyncpg" version = "0.24.0" @@ -32,9 +29,6 @@ category = "main" optional = false python-versions = ">=3.6.0" -[package.dependencies] -typing-extensions = {version = ">=3.7.4.3", markers = "python_version < \"3.8\""} - [package.extras] dev = ["Cython (>=0.29.24,<0.30.0)", "pytest (>=6.0)", "Sphinx (>=4.1.2,<4.2.0)", "sphinxcontrib-asyncio (>=0.3.0,<0.4.0)", "sphinx-rtd-theme (>=0.5.2,<0.6.0)", "pycodestyle (>=2.7.0,<2.8.0)", "flake8 (>=3.9.2,<3.10.0)", "uvloop (>=0.15.3)"] docs = ["Sphinx (>=4.1.2,<4.2.0)", "sphinxcontrib-asyncio (>=0.3.0,<0.4.0)", "sphinx-rtd-theme (>=0.5.2,<0.6.0)"] @@ -125,7 +119,6 @@ python-versions = ">=3.6" [package.dependencies] botocore-stubs = "*" -typing-extensions = {version = "*", markers = "python_version < \"3.9\""} [package.extras] accessanalyzer = ["mypy-boto3-accessanalyzer (>=1.20.0)"] @@ -454,9 +447,6 @@ category = "main" optional = false python-versions = ">=3.6" -[package.dependencies] -typing-extensions = {version = "*", markers = "python_version < \"3.9\""} - [[package]] name = "cached-property" version = "1.5.2" @@ -524,7 +514,6 @@ python-versions = ">=3.6" [package.dependencies] colorama = {version = "*", markers = "platform_system == \"Windows\""} -importlib-metadata = {version = "*", markers = "python_version < \"3.8\""} [[package]] name = "colorama" @@ -605,7 +594,6 @@ optional = false python-versions = "!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*,>=2.7" [package.dependencies] -importlib-metadata = {version = "*", markers = "python_version < \"3.8\""} mccabe = ">=0.6.0,<0.7.0" pycodestyle = ">=2.7.0,<2.8.0" pyflakes = ">=2.3.0,<2.4.0" @@ -664,23 +652,6 @@ category = "main" optional = false python-versions = ">=3.5" -[[package]] -name = "importlib-metadata" -version = "4.10.1" -description = "Read metadata from Python packages" -category = "main" -optional = false -python-versions = ">=3.7" - -[package.dependencies] -typing-extensions = {version = ">=3.6.4", markers = "python_version < \"3.8\""} -zipp = ">=0.5" - -[package.extras] -docs = ["sphinx", "jaraco.packaging (>=8.2)", "rst.linker (>=1.9)"] -perf = ["ipython"] -testing = ["pytest (>=6)", "pytest-checkdocs (>=2.4)", "pytest-flake8", "pytest-cov", "pytest-enabler (>=1.0.1)", "packaging", "pyfakefs", "flufl.flake8", "pytest-perf (>=0.9.2)", "pytest-black (>=0.3.7)", "pytest-mypy", "importlib-resources (>=1.3)"] - [[package]] name = "iniconfig" version = "1.1.1" @@ -759,9 +730,6 @@ category = "main" optional = false python-versions = ">=2.7" -[package.dependencies] -importlib-metadata = {version = "*", markers = "python_version < \"3.8\""} - [package.extras] docs = ["sphinx", "jaraco.packaging (>=3.2)", "rst.linker (>=1.9)"] testing = ["pytest (>=3.5,!=3.7.3)", "pytest-checkdocs (>=1.2.3)", "pytest-flake8", "pytest-black-multipy", "pytest-cov", "ecdsa", "feedparser", "numpy", "pandas", "pymongo", "scikit-learn", "sqlalchemy", "enum34", "jsonlib"] @@ -785,7 +753,6 @@ python-versions = "*" [package.dependencies] attrs = ">=17.4.0" -importlib-metadata = {version = "*", markers = "python_version < \"3.8\""} pyrsistent = ">=0.14.0" six = ">=1.11.0" @@ -840,7 +807,6 @@ flask = {version = "*", optional = true, markers = "extra == \"server\""} flask-cors = {version = "*", optional = true, markers = "extra == \"server\""} graphql-core = {version = "*", optional = true, markers = "extra == \"server\""} idna = {version = ">=2.5,<4", optional = true, markers = "extra == \"server\""} -importlib-metadata = {version = "*", markers = "python_version < \"3.8\""} Jinja2 = ">=2.10.1" jsondiff = {version = ">=1.1.2", optional = true, markers = "extra == \"server\""} MarkupSafe = "!=2.0.0a1" @@ -890,7 +856,6 @@ python-versions = ">=3.5" [package.dependencies] mypy-extensions = ">=0.4.3,<0.5.0" toml = "*" -typed-ast = {version = ">=1.4.0,<1.5.0", markers = "python_version < \"3.8\""} typing-extensions = ">=3.7.4" [package.extras] @@ -947,9 +912,6 @@ category = "main" optional = false python-versions = ">=3.6" -[package.dependencies] -importlib-metadata = {version = ">=0.12", markers = "python_version < \"3.8\""} - [package.extras] dev = ["pre-commit", "tox"] testing = ["pytest", "pytest-benchmark"] @@ -1061,7 +1023,6 @@ python-versions = ">=3.6" atomicwrites = {version = ">=1.0", markers = "sys_platform == \"win32\""} attrs = ">=19.2.0" colorama = {version = "*", markers = "sys_platform == \"win32\""} -importlib-metadata = {version = ">=0.12", markers = "python_version < \"3.8\""} iniconfig = "*" packaging = "*" pluggy = ">=0.12,<2.0" @@ -1279,14 +1240,6 @@ category = "main" optional = false python-versions = ">=2.6, !=3.0.*, !=3.1.*, !=3.2.*" -[[package]] -name = "typed-ast" -version = "1.4.3" -description = "a fork of Python 2 and 3 ast modules with type comment support" -category = "dev" -optional = false -python-versions = "*" - [[package]] name = "types-psycopg2" version = "2.9.6" @@ -1383,22 +1336,10 @@ category = "dev" optional = false python-versions = "*" -[[package]] -name = "zipp" -version = "3.7.0" -description = "Backport of pathlib-compatible object wrapper for zip files" -category = "main" -optional = false -python-versions = ">=3.7" - -[package.extras] -docs = ["sphinx", "jaraco.packaging (>=8.2)", "rst.linker (>=1.9)"] -testing = ["pytest (>=6)", "pytest-checkdocs (>=2.4)", "pytest-flake8", "pytest-cov", "pytest-enabler (>=1.0.1)", "jaraco.itertools", "func-timeout", "pytest-black (>=0.3.7)", "pytest-mypy"] - [metadata] lock-version = "1.1" -python-versions = "^3.7" -content-hash = "4ee85b435461dec70b406bf7170302fe54e9e247bdf628a9cb6b5fb9eb9afd82" +python-versions = "^3.9" +content-hash = "be9c00bb5081535805824242fea2a03b2f82fa9466856d618e24b3140c7da6a0" [metadata.files] aiopg = [ @@ -1594,10 +1535,6 @@ idna = [ {file = "idna-3.3-py3-none-any.whl", hash = "sha256:84d9dd047ffa80596e0f246e2eab0b391788b0503584e8945f2368256d2735ff"}, {file = "idna-3.3.tar.gz", hash = "sha256:9d643ff0a55b762d5cdb124b8eaa99c66322e2157b69160bc32796e824360e6d"}, ] -importlib-metadata = [ - {file = "importlib_metadata-4.10.1-py3-none-any.whl", hash = "sha256:899e2a40a8c4a1aec681feef45733de8a6c58f3f6a0dbed2eb6574b4387a77b6"}, - {file = "importlib_metadata-4.10.1.tar.gz", hash = "sha256:951f0d8a5b7260e9db5e41d429285b5f451e928479f19d80818878527d36e95e"}, -] iniconfig = [ {file = "iniconfig-1.1.1-py2.py3-none-any.whl", hash = "sha256:011e24c64b7f47f6ebd835bb12a743f2fbe9a26d4cecaa7f53bc4f35ee9da8b3"}, {file = "iniconfig-1.1.1.tar.gz", hash = "sha256:bc3af051d7d14b2ee5ef9969666def0cd1a000e121eaea580d4a313df4b37f32"}, @@ -2001,38 +1938,6 @@ toml = [ {file = "toml-0.10.2-py2.py3-none-any.whl", hash = "sha256:806143ae5bfb6a3c6e736a764057db0e6a0e05e338b5630894a5f779cabb4f9b"}, {file = "toml-0.10.2.tar.gz", hash = "sha256:b3bda1d108d5dd99f4a20d24d9c348e91c4db7ab1b749200bded2f839ccbe68f"}, ] -typed-ast = [ - {file = "typed_ast-1.4.3-cp35-cp35m-manylinux1_i686.whl", hash = "sha256:2068531575a125b87a41802130fa7e29f26c09a2833fea68d9a40cf33902eba6"}, - {file = "typed_ast-1.4.3-cp35-cp35m-manylinux1_x86_64.whl", hash = "sha256:c907f561b1e83e93fad565bac5ba9c22d96a54e7ea0267c708bffe863cbe4075"}, - {file = "typed_ast-1.4.3-cp35-cp35m-manylinux2014_aarch64.whl", hash = "sha256:1b3ead4a96c9101bef08f9f7d1217c096f31667617b58de957f690c92378b528"}, - {file = "typed_ast-1.4.3-cp35-cp35m-win32.whl", hash = "sha256:dde816ca9dac1d9c01dd504ea5967821606f02e510438120091b84e852367428"}, - {file = "typed_ast-1.4.3-cp35-cp35m-win_amd64.whl", hash = "sha256:777a26c84bea6cd934422ac2e3b78863a37017618b6e5c08f92ef69853e765d3"}, - {file = "typed_ast-1.4.3-cp36-cp36m-macosx_10_9_x86_64.whl", hash = "sha256:f8afcf15cc511ada719a88e013cec87c11aff7b91f019295eb4530f96fe5ef2f"}, - {file = "typed_ast-1.4.3-cp36-cp36m-manylinux1_i686.whl", hash = "sha256:52b1eb8c83f178ab787f3a4283f68258525f8d70f778a2f6dd54d3b5e5fb4341"}, - {file = "typed_ast-1.4.3-cp36-cp36m-manylinux1_x86_64.whl", hash = "sha256:01ae5f73431d21eead5015997ab41afa53aa1fbe252f9da060be5dad2c730ace"}, - {file = "typed_ast-1.4.3-cp36-cp36m-manylinux2014_aarch64.whl", hash = "sha256:c190f0899e9f9f8b6b7863debfb739abcb21a5c054f911ca3596d12b8a4c4c7f"}, - {file = "typed_ast-1.4.3-cp36-cp36m-win32.whl", hash = "sha256:398e44cd480f4d2b7ee8d98385ca104e35c81525dd98c519acff1b79bdaac363"}, - {file = "typed_ast-1.4.3-cp36-cp36m-win_amd64.whl", hash = "sha256:bff6ad71c81b3bba8fa35f0f1921fb24ff4476235a6e94a26ada2e54370e6da7"}, - {file = "typed_ast-1.4.3-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:0fb71b8c643187d7492c1f8352f2c15b4c4af3f6338f21681d3681b3dc31a266"}, - {file = "typed_ast-1.4.3-cp37-cp37m-manylinux1_i686.whl", hash = "sha256:760ad187b1041a154f0e4d0f6aae3e40fdb51d6de16e5c99aedadd9246450e9e"}, - {file = "typed_ast-1.4.3-cp37-cp37m-manylinux1_x86_64.whl", hash = "sha256:5feca99c17af94057417d744607b82dd0a664fd5e4ca98061480fd8b14b18d04"}, - {file = "typed_ast-1.4.3-cp37-cp37m-manylinux2014_aarch64.whl", hash = "sha256:95431a26309a21874005845c21118c83991c63ea800dd44843e42a916aec5899"}, - {file = "typed_ast-1.4.3-cp37-cp37m-win32.whl", hash = "sha256:aee0c1256be6c07bd3e1263ff920c325b59849dc95392a05f258bb9b259cf39c"}, - {file = "typed_ast-1.4.3-cp37-cp37m-win_amd64.whl", hash = "sha256:9ad2c92ec681e02baf81fdfa056fe0d818645efa9af1f1cd5fd6f1bd2bdfd805"}, - {file = "typed_ast-1.4.3-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:b36b4f3920103a25e1d5d024d155c504080959582b928e91cb608a65c3a49e1a"}, - {file = "typed_ast-1.4.3-cp38-cp38-manylinux1_i686.whl", hash = "sha256:067a74454df670dcaa4e59349a2e5c81e567d8d65458d480a5b3dfecec08c5ff"}, - {file = "typed_ast-1.4.3-cp38-cp38-manylinux1_x86_64.whl", hash = "sha256:7538e495704e2ccda9b234b82423a4038f324f3a10c43bc088a1636180f11a41"}, - {file = "typed_ast-1.4.3-cp38-cp38-manylinux2014_aarch64.whl", hash = "sha256:af3d4a73793725138d6b334d9d247ce7e5f084d96284ed23f22ee626a7b88e39"}, - {file = "typed_ast-1.4.3-cp38-cp38-win32.whl", hash = "sha256:f2362f3cb0f3172c42938946dbc5b7843c2a28aec307c49100c8b38764eb6927"}, - {file = "typed_ast-1.4.3-cp38-cp38-win_amd64.whl", hash = "sha256:dd4a21253f42b8d2b48410cb31fe501d32f8b9fbeb1f55063ad102fe9c425e40"}, - {file = "typed_ast-1.4.3-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:f328adcfebed9f11301eaedfa48e15bdece9b519fb27e6a8c01aa52a17ec31b3"}, - {file = "typed_ast-1.4.3-cp39-cp39-manylinux1_i686.whl", hash = "sha256:2c726c276d09fc5c414693a2de063f521052d9ea7c240ce553316f70656c84d4"}, - {file = "typed_ast-1.4.3-cp39-cp39-manylinux1_x86_64.whl", hash = "sha256:cae53c389825d3b46fb37538441f75d6aecc4174f615d048321b716df2757fb0"}, - {file = "typed_ast-1.4.3-cp39-cp39-manylinux2014_aarch64.whl", hash = "sha256:b9574c6f03f685070d859e75c7f9eeca02d6933273b5e69572e5ff9d5e3931c3"}, - {file = "typed_ast-1.4.3-cp39-cp39-win32.whl", hash = "sha256:209596a4ec71d990d71d5e0d312ac935d86930e6eecff6ccc7007fe54d703808"}, - {file = "typed_ast-1.4.3-cp39-cp39-win_amd64.whl", hash = "sha256:9c6d1a54552b5330bc657b7ef0eae25d00ba7ffe85d9ea8ae6540d2197a3788c"}, - {file = "typed_ast-1.4.3.tar.gz", hash = "sha256:fb1bbeac803adea29cedd70781399c99138358c26d05fcbd23c13016b7f5ec65"}, -] types-psycopg2 = [ {file = "types-psycopg2-2.9.6.tar.gz", hash = "sha256:753b50b38da0e61bc8f89d149f2c4420c7e18535a87963d17b72343eb98f7c32"}, {file = "types_psycopg2-2.9.6-py3-none-any.whl", hash = "sha256:2cfd855e1562ebb5da595ee9401da93a308d69121ccd359cb8341f94ba4b6d1c"}, @@ -2123,7 +2028,3 @@ yapf = [ {file = "yapf-0.31.0-py2.py3-none-any.whl", hash = "sha256:e3a234ba8455fe201eaa649cdac872d590089a18b661e39bbac7020978dd9c2e"}, {file = "yapf-0.31.0.tar.gz", hash = "sha256:408fb9a2b254c302f49db83c59f9aa0b4b0fd0ec25be3a5c51181327922ff63d"}, ] -zipp = [ - {file = "zipp-3.7.0-py3-none-any.whl", hash = "sha256:b47250dd24f92b7dd6a0a8fc5244da14608f3ca90a5efcd37a3b1642fac9a375"}, - {file = "zipp-3.7.0.tar.gz", hash = "sha256:9f50f446828eb9d45b267433fd3e9da8d801f614129124863f9c51ebceafb87d"}, -] From 89e5659f3f4e163533ddf08bfb71495a8dabe2b7 Mon Sep 17 00:00:00 2001 From: Egor Suvorov Date: Sat, 21 May 2022 03:11:39 +0300 Subject: [PATCH 290/296] Replace COPYRIGHT file from the root with NOTICE file The primary reason: make GitHub detect that we use Apache License 2.0 They do it via https://github.com/licensee/licensee Ruby library (gem). Our COPYRIGHT file contains a part of the Apache License, which should be added to a source file, not the license or copyright information itself, which confuses the library. Instead, the recommended way is to create a NOTICE file which references license of the code and its bundled dependencies. --- COPYRIGHT | 20 -------------------- NOTICE | 5 +++++ 2 files changed, 5 insertions(+), 20 deletions(-) delete mode 100644 COPYRIGHT create mode 100644 NOTICE diff --git a/COPYRIGHT b/COPYRIGHT deleted file mode 100644 index 448363b12f..0000000000 --- a/COPYRIGHT +++ /dev/null @@ -1,20 +0,0 @@ -This software is licensed under the Apache 2.0 License: - ----------------------------------------------------------------------------- -Copyright 2021 Zenith Labs, Inc - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. ----------------------------------------------------------------------------- - -The PostgreSQL submodule in vendor/postgres is licensed under the -PostgreSQL license. See vendor/postgres/COPYRIGHT. diff --git a/NOTICE b/NOTICE new file mode 100644 index 0000000000..47cc4e798f --- /dev/null +++ b/NOTICE @@ -0,0 +1,5 @@ +Neon +Copyright 2022 Neon Inc. + +The PostgreSQL submodule in vendor/postgres is licensed under the +PostgreSQL license. See vendor/postgres/COPYRIGHT. From fbedd535c0c79e06c41b1a8d78e0bb74de74a848 Mon Sep 17 00:00:00 2001 From: chaitanya sharma <86035+phoenix24@users.noreply.github.com> Date: Mon, 23 May 2022 15:46:00 +0530 Subject: [PATCH 291/296] Replace a bunch of zenith references with neon. --- docs/glossary.md | 16 ++++++++-------- safekeeper/README.md | 4 ++-- safekeeper/README_PROTO.md | 2 +- 3 files changed, 11 insertions(+), 11 deletions(-) diff --git a/docs/glossary.md b/docs/glossary.md index ecc57b9ed1..a014446010 100644 --- a/docs/glossary.md +++ b/docs/glossary.md @@ -21,7 +21,7 @@ NOTE:It has nothing to do with PostgreSQL pg_basebackup. ### Branch -We can create branch at certain LSN using `zenith timeline branch` command. +We can create branch at certain LSN using `neon_local timeline branch` command. Each Branch lives in a corresponding timeline[] and has an ancestor[]. @@ -91,7 +91,7 @@ The layer map tracks what layers exist in a timeline. ### Layered repository -Zenith repository implementation that keeps data in layers. +Neon repository implementation that keeps data in layers. ### LSN The Log Sequence Number (LSN) is a unique identifier of the WAL record[] in the WAL log. @@ -101,7 +101,7 @@ It is printed as two hexadecimal numbers of up to 8 digits each, separated by a Check also [PostgreSQL doc about pg_lsn type](https://www.postgresql.org/docs/devel/datatype-pg-lsn.html) Values can be compared to calculate the volume of WAL data that separates them, so they are used to measure the progress of replication and recovery. -In postgres and Zenith lsns are used to describe certain points in WAL handling. +In Postgres and Neon LSNs are used to describe certain points in WAL handling. PostgreSQL LSNs and functions to monitor them: * `pg_current_wal_insert_lsn()` - Returns the current write-ahead log insert location. @@ -111,13 +111,13 @@ PostgreSQL LSNs and functions to monitor them: * `pg_last_wal_replay_lsn ()` - Returns the last write-ahead log location that has been replayed during recovery. If recovery is still in progress this will increase monotonically. [source PostgreSQL documentation](https://www.postgresql.org/docs/devel/functions-admin.html): -Zenith safekeeper LSNs. For more check [safekeeper/README_PROTO.md](/safekeeper/README_PROTO.md) +Neon safekeeper LSNs. For more check [safekeeper/README_PROTO.md](/safekeeper/README_PROTO.md) * `CommitLSN`: position in WAL confirmed by quorum safekeepers. * `RestartLSN`: position in WAL confirmed by all safekeepers. * `FlushLSN`: part of WAL persisted to the disk by safekeeper. * `VCL`: the largerst LSN for which we can guarantee availablity of all prior records. -Zenith pageserver LSNs: +Neon pageserver LSNs: * `last_record_lsn` - the end of last processed WAL record. * `disk_consistent_lsn` - data is known to be fully flushed and fsync'd to local disk on pageserver up to this LSN. * `remote_consistent_lsn` - The last LSN that is synced to remote storage and is guaranteed to survive pageserver crash. @@ -132,7 +132,7 @@ This is the unit of data exchange between compute node and pageserver. ### Pageserver -Zenith storage engine: repositories + wal receiver + page service + wal redo. +Neon storage engine: repositories + wal receiver + page service + wal redo. ### Page service @@ -184,10 +184,10 @@ relation exceeds that size, it is split into multiple segments. SLRUs include pg_clog, pg_multixact/members, and pg_multixact/offsets. There are other SLRUs in PostgreSQL, but they don't need to be stored permanently (e.g. pg_subtrans), -or we do not support them in zenith yet (pg_commit_ts). +or we do not support them in neon yet (pg_commit_ts). ### Tenant (Multitenancy) -Tenant represents a single customer, interacting with Zenith. +Tenant represents a single customer, interacting with Neon. Wal redo[] activity, timelines[], layers[] are managed for each tenant independently. One pageserver[] can serve multiple tenants at once. One safekeeper diff --git a/safekeeper/README.md b/safekeeper/README.md index 3f097d0c24..a4bb260932 100644 --- a/safekeeper/README.md +++ b/safekeeper/README.md @@ -1,6 +1,6 @@ # WAL service -The zenith WAL service acts as a holding area and redistribution +The neon WAL service acts as a holding area and redistribution center for recently generated WAL. The primary Postgres server streams the WAL to the WAL safekeeper, and treats it like a (synchronous) replica. A replication slot is used in the primary to prevent the @@ -94,7 +94,7 @@ Q: What if the compute node evicts a page, needs it back, but the page is yet A: If the compute node has evicted a page, changes to it have been WAL-logged (that's why it is called Write Ahead logging; there are some exceptions like index builds, but these are exceptions). These WAL records will eventually - reach the Page Server. The Page Server notes that the compute note requests + reach the Page Server. The Page Server notes that the compute node requests pages with a very recent LSN and will not respond to the compute node until a corresponding WAL is received from WAL safekeepers. diff --git a/safekeeper/README_PROTO.md b/safekeeper/README_PROTO.md index 5d79f8c2d3..6b2ae50254 100644 --- a/safekeeper/README_PROTO.md +++ b/safekeeper/README_PROTO.md @@ -151,7 +151,7 @@ It is assumed that in case of loosing local data by some safekeepers, it should * `RestartLSN`: position in WAL confirmed by all safekeepers. * `FlushLSN`: part of WAL persisted to the disk by safekeeper. * `NodeID`: pair (term,UUID) -* `Pager`: Zenith component restoring pages from WAL stream +* `Pager`: Neon component restoring pages from WAL stream * `Replica`: read-only computatio node * `VCL`: the largerst LSN for which we can guarantee availablity of all prior records. From 3ff5caf786e666c988c3d74d65e399d95d1b7ae6 Mon Sep 17 00:00:00 2001 From: KlimentSerafimov Date: Mon, 23 May 2022 13:11:59 -0400 Subject: [PATCH 292/296] Add to readme install protobuf etcd (#1777) * Update installation instructions * Added libprotobuf-dev etcd to apt install Added "brew install protobuf etcd" to OSX installation instructions. Added "sudo apt install libprotobuf-dev etcd" to Linux installation instructions. Without these, cargo build complains. Figured out in collaboration with Bojan. --- README.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index d5dccb7724..8e8bf1a9b2 100644 --- a/README.md +++ b/README.md @@ -30,7 +30,7 @@ Pageserver consists of: On Ubuntu or Debian this set of packages should be sufficient to build the code: ```text apt install build-essential libtool libreadline-dev zlib1g-dev flex bison libseccomp-dev \ -libssl-dev clang pkg-config libpq-dev +libssl-dev clang pkg-config libpq-dev libprotobuf-dev etcd ``` 2. [Install Rust](https://www.rust-lang.org/tools/install) @@ -52,9 +52,10 @@ make -j5 ``` #### building on OSX (12.3.1) -1. Install XCode +1. Install XCode and dependencies ``` xcode-select --install +brew install protobuf etcd ``` 2. [Install Rust](https://www.rust-lang.org/tools/install) From 2aceb6a3095bf0ee6cf7ef3ecc1bb182864abccb Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Mon, 23 May 2022 20:58:27 +0300 Subject: [PATCH 293/296] Fix garbage collection to not remove image layers that are still needed. The logic would incorrectly remove an image layer, if a new image layer existed, even though the older image layer was still needed by some delta layers after it. See example given in the comment this adds. Without this fix, I was getting a lot of "could not find data for key 010000000000000000000000000000000000" errors from GC, with the new test case being added in PR #1735. Fixes #707 --- pageserver/src/layered_repository.rs | 24 ++++++++++++------- .../src/layered_repository/layer_map.rs | 13 ++++------ 2 files changed, 20 insertions(+), 17 deletions(-) diff --git a/pageserver/src/layered_repository.rs b/pageserver/src/layered_repository.rs index fc4ab942f6..a83907430e 100644 --- a/pageserver/src/layered_repository.rs +++ b/pageserver/src/layered_repository.rs @@ -18,7 +18,7 @@ use itertools::Itertools; use lazy_static::lazy_static; use tracing::*; -use std::cmp::{max, Ordering}; +use std::cmp::{max, min, Ordering}; use std::collections::hash_map::Entry; use std::collections::HashMap; use std::collections::{BTreeSet, HashSet}; @@ -2165,7 +2165,7 @@ impl LayeredTimeline { let gc_info = self.gc_info.read().unwrap(); let retain_lsns = &gc_info.retain_lsns; - let cutoff = gc_info.cutoff; + let cutoff = min(gc_info.cutoff, disk_consistent_lsn); let pitr = gc_info.pitr; // Calculate pitr cutoff point. @@ -2294,12 +2294,20 @@ impl LayeredTimeline { // is 102, then it might not have been fully flushed to disk // before crash. // - // FIXME: This logic is wrong. See https://github.com/zenithdb/zenith/issues/707 - if !layers.newer_image_layer_exists( - &l.get_key_range(), - l.get_lsn_range().end, - disk_consistent_lsn + 1, - )? { + // For example, imagine that the following layers exist: + // + // 1000 - image (A) + // 1000-2000 - delta (B) + // 2000 - image (C) + // 2000-3000 - delta (D) + // 3000 - image (E) + // + // If GC horizon is at 2500, we can remove layers A and B, but + // we cannot remove C, even though it's older than 2500, because + // the delta layer 2000-3000 depends on it. + if !layers + .image_layer_exists(&l.get_key_range(), &(l.get_lsn_range().end..new_gc_cutoff))? + { debug!( "keeping {} because it is the latest layer", l.filename().display() diff --git a/pageserver/src/layered_repository/layer_map.rs b/pageserver/src/layered_repository/layer_map.rs index 7491294c03..f7f51bf21f 100644 --- a/pageserver/src/layered_repository/layer_map.rs +++ b/pageserver/src/layered_repository/layer_map.rs @@ -201,18 +201,14 @@ impl LayerMap { NUM_ONDISK_LAYERS.dec(); } - /// Is there a newer image layer for given key-range? + /// Is there a newer image layer for given key- and LSN-range? /// /// This is used for garbage collection, to determine if an old layer can /// be deleted. - /// We ignore layers newer than disk_consistent_lsn because they will be removed at restart - /// We also only look at historic layers - //#[allow(dead_code)] - pub fn newer_image_layer_exists( + pub fn image_layer_exists( &self, key_range: &Range, - lsn: Lsn, - disk_consistent_lsn: Lsn, + lsn_range: &Range, ) -> Result { let mut range_remain = key_range.clone(); @@ -225,8 +221,7 @@ impl LayerMap { let img_lsn = l.get_lsn_range().start; if !l.is_incremental() && l.get_key_range().contains(&range_remain.start) - && img_lsn > lsn - && img_lsn < disk_consistent_lsn + && lsn_range.contains(&img_lsn) { made_progress = true; let img_key_end = l.get_key_range().end; From 8346aa3a29daf6088689076d35a9c99df3c9e4ce Mon Sep 17 00:00:00 2001 From: KlimentSerafimov Date: Tue, 24 May 2022 04:55:38 -0400 Subject: [PATCH 294/296] Potential fix to #1626. Fixed typo is Makefile. (#1781) * Potential fix to #1626. Fixed typo is Makefile. * Completed fix to #1626. Summary: changed 'error' to 'bail' in start_pageserver and start_safekeeper. --- Makefile | 2 +- pageserver/src/bin/pageserver.rs | 2 +- safekeeper/src/bin/safekeeper.rs | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/Makefile b/Makefile index 5eca7fb094..fdfc64f6fa 100644 --- a/Makefile +++ b/Makefile @@ -20,7 +20,7 @@ else ifeq ($(BUILD_TYPE),debug) PG_CONFIGURE_OPTS = --enable-debug --with-openssl --enable-cassert --enable-depend PG_CFLAGS = -O0 -g3 $(CFLAGS) else -$(error Bad build type `$(BUILD_TYPE)', see Makefile for options) + $(error Bad build type '$(BUILD_TYPE)', see Makefile for options) endif # macOS with brew-installed openssl requires explicit paths diff --git a/pageserver/src/bin/pageserver.rs b/pageserver/src/bin/pageserver.rs index 00864056cb..ac90500b97 100644 --- a/pageserver/src/bin/pageserver.rs +++ b/pageserver/src/bin/pageserver.rs @@ -254,7 +254,7 @@ fn start_pageserver(conf: &'static PageServerConf, daemonize: bool) -> Result<() // Otherwise, the coverage data will be damaged. match daemonize.exit_action(|| exit_now(0)).start() { Ok(_) => info!("Success, daemonized"), - Err(err) => error!(%err, "could not daemonize"), + Err(err) => bail!("{err}. could not daemonize. bailing."), } } diff --git a/safekeeper/src/bin/safekeeper.rs b/safekeeper/src/bin/safekeeper.rs index 61d2f558f2..a5ffc013e2 100644 --- a/safekeeper/src/bin/safekeeper.rs +++ b/safekeeper/src/bin/safekeeper.rs @@ -245,7 +245,7 @@ fn start_safekeeper(mut conf: SafeKeeperConf, given_id: Option, init: b // Otherwise, the coverage data will be damaged. match daemonize.exit_action(|| exit_now(0)).start() { Ok(_) => info!("Success, daemonized"), - Err(e) => error!("Error, {}", e), + Err(err) => bail!("Error: {err}. could not daemonize. bailing."), } } From 541ec258758309b1ef98c24b5afe79169406d3b9 Mon Sep 17 00:00:00 2001 From: Kirill Bulatov Date: Tue, 24 May 2022 17:56:37 +0300 Subject: [PATCH 295/296] Properly shutdown test mock S3 server --- .circleci/config.yml | 2 +- test_runner/fixtures/zenith_fixtures.py | 7 +++++-- 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/.circleci/config.yml b/.circleci/config.yml index eb2bf0172b..41f7693726 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -361,7 +361,7 @@ jobs: when: always command: | du -sh /tmp/test_output/* - find /tmp/test_output -type f ! -name "pg.log" ! -name "pageserver.log" ! -name "safekeeper.log" ! -name "etcd.log" ! -name "regression.diffs" ! -name "junit.xml" ! -name "*.filediff" ! -name "*.stdout" ! -name "*.stderr" ! -name "flamegraph.svg" ! -name "*.metrics" -delete + find /tmp/test_output -type f ! -name "*.log" ! -name "regression.diffs" ! -name "junit.xml" ! -name "*.filediff" ! -name "*.stdout" ! -name "*.stderr" ! -name "flamegraph.svg" ! -name "*.metrics" -delete du -sh /tmp/test_output/* - store_artifacts: path: /tmp/test_output diff --git a/test_runner/fixtures/zenith_fixtures.py b/test_runner/fixtures/zenith_fixtures.py index 17d932c968..8f9bf1c11b 100644 --- a/test_runner/fixtures/zenith_fixtures.py +++ b/test_runner/fixtures/zenith_fixtures.py @@ -393,7 +393,10 @@ class MockS3Server: ): self.port = port - self.subprocess = subprocess.Popen([f'poetry run moto_server s3 -p{port}'], shell=True) + # XXX: do not use `shell=True` or add `exec ` to the command here otherwise. + # We use `self.subprocess.kill()` to shut down the server, which would not "just" work in Linux + # if a process is started from the shell process. + self.subprocess = subprocess.Popen(['poetry', 'run', 'moto_server', 's3', f'-p{port}']) error = None try: return_code = self.subprocess.poll() @@ -403,7 +406,7 @@ class MockS3Server: error = f"expected mock s3 server to start but it failed with exception: {e}. stdout: '{self.subprocess.stdout}', stderr: '{self.subprocess.stderr}'" if error is not None: log.error(error) - self.subprocess.kill() + self.kill() raise RuntimeError("failed to start s3 mock server") def endpoint(self) -> str: From d32b491a5300d99c9e2d7811944160185e23730c Mon Sep 17 00:00:00 2001 From: Sergey Melnikov Date: Wed, 25 May 2022 11:31:10 +0400 Subject: [PATCH 296/296] Add zenith-us-stage-sk-6 to deploy (#1728) --- .circleci/ansible/staging.hosts | 1 + 1 file changed, 1 insertion(+) diff --git a/.circleci/ansible/staging.hosts b/.circleci/ansible/staging.hosts index 8e89e843d9..d99ffa6dac 100644 --- a/.circleci/ansible/staging.hosts +++ b/.circleci/ansible/staging.hosts @@ -6,6 +6,7 @@ zenith-us-stage-ps-2 console_region_id=27 zenith-us-stage-sk-1 console_region_id=27 zenith-us-stage-sk-4 console_region_id=27 zenith-us-stage-sk-5 console_region_id=27 +zenith-us-stage-sk-6 console_region_id=27 [storage:children] pageservers