Update words in docs for better readability (#6600)

## Problem
 Found typos while reading the docs

## Summary of changes
Fixed the typos found
This commit is contained in:
Clarence
2024-02-03 01:59:39 +01:00
committed by GitHub
parent 0ac2606c8a
commit 3d1b08496a
12 changed files with 45 additions and 45 deletions

View File

@@ -78,7 +78,7 @@ with grpc streams and tokio mpsc channels. The implementation description is at
It is just 500 lines of code and core functionality is complete. 1-1 pub sub It is just 500 lines of code and core functionality is complete. 1-1 pub sub
gives about 120k received messages per second; having multiple subscribers in gives about 120k received messages per second; having multiple subscribers in
different connecitons quickly scales to 1 million received messages per second. different connections quickly scales to 1 million received messages per second.
I had concerns about many concurrent streams in singe connection, but 2^20 I had concerns about many concurrent streams in singe connection, but 2^20
subscribers still work (though eat memory, with 10 publishers 20GB are consumed; subscribers still work (though eat memory, with 10 publishers 20GB are consumed;
in this implementation each publisher holds full copy of all subscribers). There in this implementation each publisher holds full copy of all subscribers). There
@@ -95,12 +95,12 @@ other members, with best-effort this is simple.
### Security implications ### Security implications
Communication happens in a private network that is not exposed to users; Communication happens in a private network that is not exposed to users;
additionaly we can add auth to the broker. additionally we can add auth to the broker.
## Alternative: get existing pub-sub ## Alternative: get existing pub-sub
We could take some existing pub sub solution, e.g. RabbitMQ, Redis. But in this We could take some existing pub sub solution, e.g. RabbitMQ, Redis. But in this
case IMV simplicity of our own outweights external dependency costs (RabbitMQ is case IMV simplicity of our own outweighs external dependency costs (RabbitMQ is
much more complicated and needs VM; Redis Rust client maintenance is not much more complicated and needs VM; Redis Rust client maintenance is not
ideal...). Also note that projects like CockroachDB and TiDB are based on gRPC ideal...). Also note that projects like CockroachDB and TiDB are based on gRPC
as well. as well.

View File

@@ -74,7 +74,7 @@ TenantMaintenanceGuard: Like ActiveTenantGuard, but can be held even when the
tenant is not in Active state. Used for operations like attach/detach. Perhaps tenant is not in Active state. Used for operations like attach/detach. Perhaps
allow only one such guard on a Tenant at a time. allow only one such guard on a Tenant at a time.
Similarly for Timelines. We don't currentl have a "state" on Timeline, but I think Similarly for Timelines. We don't currently have a "state" on Timeline, but I think
we need at least two states: Active and Stopping. The Stopping state is used at we need at least two states: Active and Stopping. The Stopping state is used at
deletion, to prevent new TimelineActiveGuards from appearing, while you wait for deletion, to prevent new TimelineActiveGuards from appearing, while you wait for
existing TimelineActiveGuards to die out. existing TimelineActiveGuards to die out.
@@ -85,7 +85,7 @@ have a TenantActiveGuard, and the tenant's state changes from Active to
Stopping, the is_shutdown_requested() function should return true, and Stopping, the is_shutdown_requested() function should return true, and
shutdown_watcher() future should return. shutdown_watcher() future should return.
This signaling doesn't neessarily need to cover all cases. For example, if you This signaling doesn't necessarily need to cover all cases. For example, if you
have a block of code in spawn_blocking(), it might be acceptable if have a block of code in spawn_blocking(), it might be acceptable if
is_shutdown_requested() doesn't return true even though the tenant is in is_shutdown_requested() doesn't return true even though the tenant is in
Stopping state, as long as the code finishes reasonably fast. Stopping state, as long as the code finishes reasonably fast.

View File

@@ -37,7 +37,7 @@ sequenceDiagram
``` ```
At this point it is not possible to restore from index, it contains L2 which At this point it is not possible to restore from index, it contains L2 which
is no longer available in s3 and doesnt contain L3 added by compaction by the is no longer available in s3 and doesn't contain L3 added by compaction by the
first pageserver. So if any of the pageservers restart initial sync will fail first pageserver. So if any of the pageservers restart initial sync will fail
(or in on-demand world it will fail a bit later during page request from (or in on-demand world it will fail a bit later during page request from
missing layer) missing layer)
@@ -74,7 +74,7 @@ One possible solution for relocation case is to orchestrate background jobs
from outside. The oracle who runs migration can turn off background jobs on from outside. The oracle who runs migration can turn off background jobs on
PS1 before migration and then run migration -> enable them on PS2. The problem PS1 before migration and then run migration -> enable them on PS2. The problem
comes if migration fails. In this case in order to resume background jobs comes if migration fails. In this case in order to resume background jobs
oracle needs to guarantee that PS2 doesnt run background jobs and if it doesnt oracle needs to guarantee that PS2 doesn't run background jobs and if it doesn't
respond then PS1 is stuck unable to run compaction/gc. This cannot be solved respond then PS1 is stuck unable to run compaction/gc. This cannot be solved
without human ensuring that no upload from PS2 can happen. In order to be able without human ensuring that no upload from PS2 can happen. In order to be able
to resolve this automatically CAS is required on S3 side so pageserver can to resolve this automatically CAS is required on S3 side so pageserver can
@@ -128,7 +128,7 @@ During discussion it seems that we converged on the approach consisting of:
whether we need to apply change to the index state or not. whether we need to apply change to the index state or not.
- Responsibility for running background jobs is assigned externally. Pageserver - Responsibility for running background jobs is assigned externally. Pageserver
keeps locally persistent flag for each tenant that indicates whether this keeps locally persistent flag for each tenant that indicates whether this
pageserver is considered as primary one or not. TODO what happends if we pageserver is considered as primary one or not. TODO what happens if we
crash and cannot start for some extended period of time? Control plane can crash and cannot start for some extended period of time? Control plane can
assign ownership to some other pageserver. Pageserver needs some way to check assign ownership to some other pageserver. Pageserver needs some way to check
if its still the blessed one. Maybe by explicit request to control plane on if its still the blessed one. Maybe by explicit request to control plane on
@@ -138,7 +138,7 @@ Requirement for deterministic layer generation was considered overly strict
because of two reasons: because of two reasons:
- It can limit possible optimizations e g when pageserver wants to reshuffle - It can limit possible optimizations e g when pageserver wants to reshuffle
some data locally and doesnt want to coordinate this some data locally and doesn't want to coordinate this
- The deterministic algorithm itself can change so during deployments for some - The deterministic algorithm itself can change so during deployments for some
time there will be two different version running at the same time which can time there will be two different version running at the same time which can
cause non determinism cause non determinism
@@ -164,7 +164,7 @@ sequenceDiagram
CP->>PS1: Yes CP->>PS1: Yes
deactivate CP deactivate CP
PS1->>S3: Fetch PS1 index. PS1->>S3: Fetch PS1 index.
note over PS1: Continue operations, start backround jobs note over PS1: Continue operations, start background jobs
note over PS1,PS2: PS1 starts up and still and is not a leader anymore note over PS1,PS2: PS1 starts up and still and is not a leader anymore
PS1->>CP: Am I still the leader for Tenant X? PS1->>CP: Am I still the leader for Tenant X?
CP->>PS1: No CP->>PS1: No
@@ -203,7 +203,7 @@ sequenceDiagram
### Eviction ### Eviction
When two pageservers operate on a tenant for extended period of time follower When two pageservers operate on a tenant for extended period of time follower
doesnt perform write operations in s3. When layer is evicted follower relies doesn't perform write operations in s3. When layer is evicted follower relies
on updates from primary to get info about layers it needs to cover range for on updates from primary to get info about layers it needs to cover range for
evicted layer. evicted layer.

View File

@@ -4,7 +4,7 @@ Created on 08.03.23
## Motivation ## Motivation
Currently we dont delete pageserver part of the data from s3 when project is deleted. (The same is true for safekeepers, but this outside of the scope of this RFC). Currently we don't delete pageserver part of the data from s3 when project is deleted. (The same is true for safekeepers, but this outside of the scope of this RFC).
This RFC aims to spin a discussion to come to a robust deletion solution that wont put us in into a corner for features like postponed deletion (when we keep data for user to be able to restore a project if it was deleted by accident) This RFC aims to spin a discussion to come to a robust deletion solution that wont put us in into a corner for features like postponed deletion (when we keep data for user to be able to restore a project if it was deleted by accident)
@@ -75,9 +75,9 @@ Remote one is needed for cases when pageserver is lost during deletion so other
Why local mark file is needed? Why local mark file is needed?
If we dont have one, we have two choices, delete local data before deleting the remote part or do that after. If we don't have one, we have two choices, delete local data before deleting the remote part or do that after.
If we delete local data before remote then during restart pageserver wont pick up remote tenant at all because nothing is available locally (pageserver looks for remote conuterparts of locally available tenants). If we delete local data before remote then during restart pageserver wont pick up remote tenant at all because nothing is available locally (pageserver looks for remote counterparts of locally available tenants).
If we delete local data after remote then at the end of the sequence when remote mark file is deleted if pageserver restart happens then the state is the same to situation when pageserver just missing data on remote without knowing the fact that this data is intended to be deleted. In this case the current behavior is upload everything local-only to remote. If we delete local data after remote then at the end of the sequence when remote mark file is deleted if pageserver restart happens then the state is the same to situation when pageserver just missing data on remote without knowing the fact that this data is intended to be deleted. In this case the current behavior is upload everything local-only to remote.
@@ -145,7 +145,7 @@ sequenceDiagram
CP->>PS: Retry delete tenant CP->>PS: Retry delete tenant
PS->>CP: Not modified PS->>CP: Not modified
else Mark is missing else Mark is missing
note over PS: Continue to operate the tenant as if deletion didnt happen note over PS: Continue to operate the tenant as if deletion didn't happen
note over CP: Eventually console should <br> retry delete request note over CP: Eventually console should <br> retry delete request
@@ -168,7 +168,7 @@ sequenceDiagram
PS->>CP: True PS->>CP: True
``` ```
Similar sequence applies when both local and remote marks were persisted but Control Plane still didnt receive a response. Similar sequence applies when both local and remote marks were persisted but Control Plane still didn't receive a response.
If pageserver crashes after both mark files were deleted then it will reply to control plane status poll request with 404 which should be treated by control plane as success. If pageserver crashes after both mark files were deleted then it will reply to control plane status poll request with 404 which should be treated by control plane as success.
@@ -187,7 +187,7 @@ If pageseserver is lost then the deleted tenant should be attached to different
##### Restrictions for tenant that is in progress of being deleted ##### Restrictions for tenant that is in progress of being deleted
I propose to add another state to tenant/timeline - PendingDelete. This state shouldnt allow executing any operations aside from polling the deletion status. I propose to add another state to tenant/timeline - PendingDelete. This state shouldn't allow executing any operations aside from polling the deletion status.
#### Summary #### Summary
@@ -237,7 +237,7 @@ New branch gets created
PS1 starts up (is it possible or we just recycle it?) PS1 starts up (is it possible or we just recycle it?)
PS1 is unaware of the new branch. It can either fall back to s3 ls, or ask control plane. PS1 is unaware of the new branch. It can either fall back to s3 ls, or ask control plane.
So here comes the dependency of storage on control plane. During restart storage needs to know which timelines are valid for operation. If there is nothing on s3 that can answer that question storage neeeds to ask control plane. So here comes the dependency of storage on control plane. During restart storage needs to know which timelines are valid for operation. If there is nothing on s3 that can answer that question storage needs to ask control plane.
### Summary ### Summary
@@ -250,7 +250,7 @@ Cons:
Pros: Pros:
- Easier to reason about if you dont have to account for pageserver restarts - Easier to reason about if you don't have to account for pageserver restarts
### Extra notes ### Extra notes
@@ -262,7 +262,7 @@ Delayed deletion can be done with both approaches. As discussed with Anna (@step
After discussion in comments I see that we settled on two options (though a bit different from ones described in rfc). First one is the same - pageserver owns as much as possible. The second option is that pageserver owns markers thing, but actual deletion happens in control plane by repeatedly calling ls + delete. After discussion in comments I see that we settled on two options (though a bit different from ones described in rfc). First one is the same - pageserver owns as much as possible. The second option is that pageserver owns markers thing, but actual deletion happens in control plane by repeatedly calling ls + delete.
To my mind the only benefit of the latter approach is possible code reuse between safekeepers and pageservers. Otherwise poking around integrating s3 library into control plane, configuring shared knowledge abouth paths in s3 - are the downsides. Another downside of relying on control plane is the testing process. Control plane resides in different repository so it is quite hard to test pageserver related changes there. e2e test suite there doesnt support shutting down pageservers, which are separate docker containers there instead of just processes. To my mind the only benefit of the latter approach is possible code reuse between safekeepers and pageservers. Otherwise poking around integrating s3 library into control plane, configuring shared knowledge about paths in s3 - are the downsides. Another downside of relying on control plane is the testing process. Control plane resides in different repository so it is quite hard to test pageserver related changes there. e2e test suite there doesn't support shutting down pageservers, which are separate docker containers there instead of just processes.
With pageserver owning everything we still give the retry logic to control plane but its easier to duplicate if needed compared to sharing inner s3 workings. We will have needed tests for retry logic in neon repo. With pageserver owning everything we still give the retry logic to control plane but its easier to duplicate if needed compared to sharing inner s3 workings. We will have needed tests for retry logic in neon repo.

View File

@@ -75,7 +75,7 @@ sequenceDiagram
``` ```
At this point it is not possible to restore the state from index, it contains L2 which At this point it is not possible to restore the state from index, it contains L2 which
is no longer available in s3 and doesnt contain L3 added by compaction by the is no longer available in s3 and doesn't contain L3 added by compaction by the
first pageserver. So if any of the pageservers restart, initial sync will fail first pageserver. So if any of the pageservers restart, initial sync will fail
(or in on-demand world it will fail a bit later during page request from (or in on-demand world it will fail a bit later during page request from
missing layer) missing layer)
@@ -171,7 +171,7 @@ sequenceDiagram
Another problem is a possibility of concurrent branch creation calls. Another problem is a possibility of concurrent branch creation calls.
I e during migration create_branch can be called on old pageserver and newly created branch wont be seen on new pageserver. Prior art includes prototyping an approach of trying to mirror such branches, but currently it lost its importance, because now attach is fast because we dont need to download all data, and additionally to the best of my knowledge of control plane internals (cc @ololobus to confirm) operations on one project are executed sequentially, so it is not possible to have such case. So branch create operation will be executed only when relocation is completed. As a safety measure we can forbid branch creation for tenants that are in readonly remote state. I e during migration create_branch can be called on old pageserver and newly created branch wont be seen on new pageserver. Prior art includes prototyping an approach of trying to mirror such branches, but currently it lost its importance, because now attach is fast because we don't need to download all data, and additionally to the best of my knowledge of control plane internals (cc @ololobus to confirm) operations on one project are executed sequentially, so it is not possible to have such case. So branch create operation will be executed only when relocation is completed. As a safety measure we can forbid branch creation for tenants that are in readonly remote state.
## Simplistic approach ## Simplistic approach

View File

@@ -55,7 +55,7 @@ When PostgreSQL requests a file, `compute_ctl` downloads it.
PostgreSQL requests files in the following cases: PostgreSQL requests files in the following cases:
- When loading a preload library set in `local_preload_libraries` - When loading a preload library set in `local_preload_libraries`
- When explicitly loading a library with `LOAD` - When explicitly loading a library with `LOAD`
- Wnen creating extension with `CREATE EXTENSION` (download sql scripts, (optional) extension data files and (optional) library files))) - When creating extension with `CREATE EXTENSION` (download sql scripts, (optional) extension data files and (optional) library files)))
#### Summary #### Summary

View File

@@ -26,7 +26,7 @@ plane guarantee prevents robust response to failures, as if a pageserver is unre
we may not detach from it. The mechanism in this RFC fixes this, by making it safe to we may not detach from it. The mechanism in this RFC fixes this, by making it safe to
attach to a new, different pageserver even if an unresponsive pageserver may be running. attach to a new, different pageserver even if an unresponsive pageserver may be running.
Futher, lack of safety during split-brain conditions blocks two important features where occasional Further lack of safety during split-brain conditions blocks two important features where occasional
split-brain conditions are part of the design assumptions: split-brain conditions are part of the design assumptions:
- seamless tenant migration ([RFC PR](https://github.com/neondatabase/neon/pull/5029)) - seamless tenant migration ([RFC PR](https://github.com/neondatabase/neon/pull/5029))
@@ -490,11 +490,11 @@ The above makes it safe for control plane to change the assignment of
tenant to pageserver in control plane while a timeline creation is ongoing. tenant to pageserver in control plane while a timeline creation is ongoing.
The reason is that the creation request against the new assigned pageserver The reason is that the creation request against the new assigned pageserver
uses a new generation number. However, care must be taken by control plane uses a new generation number. However, care must be taken by control plane
to ensure that a "timeline creation successul" response from some pageserver to ensure that a "timeline creation successful" response from some pageserver
is checked for the pageserver's generation for that timeline's tenant still being the latest. is checked for the pageserver's generation for that timeline's tenant still being the latest.
If it is not the latest, the response does not constitute a successful timeline creation. If it is not the latest, the response does not constitute a successful timeline creation.
It is acceptable to discard such responses, the scrubber will clean up the S3 state. It is acceptable to discard such responses, the scrubber will clean up the S3 state.
It is better to issue a timelien deletion request to the stale attachment. It is better to issue a timeline deletion request to the stale attachment.
#### Timeline Deletion #### Timeline Deletion
@@ -633,7 +633,7 @@ As outlined in the Part 1 on correctness, it is critical that deletions are only
executed once the key is not referenced anywhere in S3. executed once the key is not referenced anywhere in S3.
This property is obviously upheld by the scheme above. This property is obviously upheld by the scheme above.
#### We Accept Object Leakage In Acceptable Circumcstances #### We Accept Object Leakage In Acceptable Circumstances
If we crash in the flow above between (2) and (3), we lose track of unreferenced object. If we crash in the flow above between (2) and (3), we lose track of unreferenced object.
Further, enqueuing a single to the persistent queue may not be durable immediately to amortize cost of flush to disk. Further, enqueuing a single to the persistent queue may not be durable immediately to amortize cost of flush to disk.

View File

@@ -162,7 +162,7 @@ struct Tenant {
... ...
txns: HashMap<TxnId, Transaction>, txns: HashMap<TxnId, Transaction>,
// the most recently started txn's id; only most recently sarted can win // the most recently started txn's id; only most recently started can win
next_winner_txn: Option<TxnId>, next_winner_txn: Option<TxnId>,
} }
struct Transaction { struct Transaction {
@@ -186,7 +186,7 @@ A transaction T in state Committed has subsequent transactions that may or may n
So, for garbage collection, we need to assess transactions in state Committed and RejectAcknowledged: So, for garbage collection, we need to assess transactions in state Committed and RejectAcknowledged:
- Commited: delete objects on the deadlist. - Committed: delete objects on the deadlist.
- We dont need a LIST request here, the deadlist is sufficient. So, its really cheap. - We dont need a LIST request here, the deadlist is sufficient. So, its really cheap.
- This is **not true MVCC garbage collection**; by deleting the objects on Committed transaction T s deadlist, we might delete data referenced by other transactions that were concurrent with T, i.e., they started while T was still open. However, the fact that T is committed means that the other transactions are RejectPending or RejectAcknowledged, so, they dont matter. Pageservers executing these doomed RejectPending transactions must handle 404 for GETs gracefully, e.g., by trying to commit txn so they observe the rejection theyre destined to get anyways. 404s for RejectAcknowledged is handled below. - This is **not true MVCC garbage collection**; by deleting the objects on Committed transaction T s deadlist, we might delete data referenced by other transactions that were concurrent with T, i.e., they started while T was still open. However, the fact that T is committed means that the other transactions are RejectPending or RejectAcknowledged, so, they dont matter. Pageservers executing these doomed RejectPending transactions must handle 404 for GETs gracefully, e.g., by trying to commit txn so they observe the rejection theyre destined to get anyways. 404s for RejectAcknowledged is handled below.
- RejectAcknowledged: delete all objects created in that txn, and discard deadlists. - RejectAcknowledged: delete all objects created in that txn, and discard deadlists.
@@ -242,15 +242,15 @@ If a pageserver is unresponsive from Control Planes / Computes perspective
At this point, availability is restored and user pain relieved. At this point, availability is restored and user pain relieved.
Whats left is to somehow close the doomed transaction of the unresponsive pageserver, so that it beomes RejectAcknowledged, and GC can make progress. Since S3 is cheap, we can afford to wait a really long time here, especially if we put a soft bound on the amount of data a transaction may produce before it must commit. Procedure: Whats left is to somehow close the doomed transaction of the unresponsive pageserver, so that it becomes RejectAcknowledged, and GC can make progress. Since S3 is cheap, we can afford to wait a really long time here, especially if we put a soft bound on the amount of data a transaction may produce before it must commit. Procedure:
1. Ensure the unresponsive pageserver is taken out of rotation for new attachments. That probably should happen as part of the routine above. 1. Ensure the unresponsive pageserver is taken out of rotation for new attachments. That probably should happen as part of the routine above.
2. Make a human operator investigate decide what to do (next morning, NO ONCALL ALERT): 2. Make a human operator investigate decide what to do (next morning, NO ONCALL ALERT):
1. Inspect the instance, investigate logs, understand root cause. 1. Inspect the instance, investigate logs, understand root cause.
2. Try to re-establish connectivity between pageserver and Control Plane so that pageserver can retry commits, get rejected, ack rejection ⇒ enable GC. 2. Try to re-establish connectivity between pageserver and Control Plane so that pageserver can retry commits, get rejected, ack rejection ⇒ enable GC.
3. Use below procedure to decomission pageserver. 3. Use below procedure to decommission pageserver.
### Decomissioning A Pageserver (Dead or Alive-but-Unrespsonive) ### Decommissioning A Pageserver (Dead or Alive-but-Unresponsive)
The solution, enabled by this proposal: The solution, enabled by this proposal:
@@ -310,7 +310,7 @@ Issues that we discussed:
1. In abstract terms, this proposal provides a linearized history for a given S3 prefix. 1. In abstract terms, this proposal provides a linearized history for a given S3 prefix.
2. In concrete terms, this proposal provides a linearized history per tenant. 2. In concrete terms, this proposal provides a linearized history per tenant.
3. There can be multiple writers at a given time, but only one of them will win to become part of the linearized history. 3. There can be multiple writers at a given time, but only one of them will win to become part of the linearized history.
4. ************************************************************************************Alternative ideas mentioned during meetings that should be turned into a written prospoal like this one:************************************************************************************ 4. ************************************************************************************Alternative ideas mentioned during meetings that should be turned into a written proposal like this one:************************************************************************************
1. @Dmitry Rodionov : having linearized storage of index_part.json in some database that allows serializable transactions / atomic compare-and-swap PUT 1. @Dmitry Rodionov : having linearized storage of index_part.json in some database that allows serializable transactions / atomic compare-and-swap PUT
2. @Dmitry Rodionov : 2. @Dmitry Rodionov :
3. @Stas : something like this scheme, but somehow find a way to equate attachment duration with transaction duration, without losing work if pageserver dies months after attachment. 3. @Stas : something like this scheme, but somehow find a way to equate attachment duration with transaction duration, without losing work if pageserver dies months after attachment.

View File

@@ -54,7 +54,7 @@ If the compaction algorithm doesn't change between the two compaction runs, is d
*However*: *However*:
1. the file size of the overwritten L1s may not be identical, and 1. the file size of the overwritten L1s may not be identical, and
2. the bit pattern of the overwritten L1s may not be identical, and, 2. the bit pattern of the overwritten L1s may not be identical, and,
3. in the future, we may want to make the compaction code non-determinstic, influenced by past access patterns, or otherwise change it, resulting in L1 overwrites with a different set of delta records than before the overwrite 3. in the future, we may want to make the compaction code non-deterministic, influenced by past access patterns, or otherwise change it, resulting in L1 overwrites with a different set of delta records than before the overwrite
The items above are a problem for the [split-brain protection RFC](https://github.com/neondatabase/neon/pull/4919) because it assumes that layer files in S3 are only ever deleted, but never replaced (overPUTted). The items above are a problem for the [split-brain protection RFC](https://github.com/neondatabase/neon/pull/4919) because it assumes that layer files in S3 are only ever deleted, but never replaced (overPUTted).
@@ -63,7 +63,7 @@ But node B based its world view on the version of node A's `index_part.json` fro
That earlier `index_part.json`` contained the file size of the pre-overwrite L1. That earlier `index_part.json`` contained the file size of the pre-overwrite L1.
If the overwritten L1 has a different file size, node B will refuse to read data from the overwritten L1. If the overwritten L1 has a different file size, node B will refuse to read data from the overwritten L1.
Effectively, the data in the L1 has become inaccessible to node B. Effectively, the data in the L1 has become inaccessible to node B.
If node B already uploaded an index part itself, all subsequent attachments will use node B's index part, and run into the same probem. If node B already uploaded an index part itself, all subsequent attachments will use node B's index part, and run into the same problem.
If we ever introduce checksums instead of checking just the file size, then a mismatching bit pattern (2) will cause similar problems. If we ever introduce checksums instead of checking just the file size, then a mismatching bit pattern (2) will cause similar problems.
@@ -121,7 +121,7 @@ Multi-object changes that previously created and removed files in timeline dir a
* atomic `index_part.json` update in S3, as per guarantee that S3 PUT is atomic * atomic `index_part.json` update in S3, as per guarantee that S3 PUT is atomic
* local timeline dir state: * local timeline dir state:
* irrelevant for layer map content => irrelevant for atomic updates / crash consistency * irrelevant for layer map content => irrelevant for atomic updates / crash consistency
* if we crash after index part PUT, local layer files will be used, so, no on-demand downloads neede for them * if we crash after index part PUT, local layer files will be used, so, no on-demand downloads needed for them
* if we crash before index part PUT, local layer files will be deleted * if we crash before index part PUT, local layer files will be deleted
## Trade-Offs ## Trade-Offs
@@ -140,7 +140,7 @@ Assuming upload queue allows for unlimited queue depth (that's what it does toda
* wal ingest: currently unbounded * wal ingest: currently unbounded
* L0 => L1 compaction: CPU time proportional to `O(sum(L0 size))` and upload work proportional to `O()` * L0 => L1 compaction: CPU time proportional to `O(sum(L0 size))` and upload work proportional to `O()`
* Compaction threshold is 10 L0s and each L0 can be up to 256M in size. Target size for L1 is 128M. * Compaction threshold is 10 L0s and each L0 can be up to 256M in size. Target size for L1 is 128M.
* In practive, most L0s are tiny due to 10minute `DEFAULT_CHECKPOINT_TIMEOUT`. * In practice, most L0s are tiny due to 10minute `DEFAULT_CHECKPOINT_TIMEOUT`.
* image layer generation: CPU time `O(sum(input data))` + upload work `O(sum(new image layer size))` * image layer generation: CPU time `O(sum(input data))` + upload work `O(sum(new image layer size))`
* I have no intuition how expensive / long-running it is in reality. * I have no intuition how expensive / long-running it is in reality.
* gc: `update_gc_info`` work (not substantial, AFAIK) * gc: `update_gc_info`` work (not substantial, AFAIK)
@@ -158,7 +158,7 @@ Pageserver crashes are very rare ; it would likely be acceptable to re-do the lo
However, regular pageserver restart happen frequently, e.g., during weekly deploys. However, regular pageserver restart happen frequently, e.g., during weekly deploys.
In general, pageserver restart faces the problem of tenants that "take too long" to shut down. In general, pageserver restart faces the problem of tenants that "take too long" to shut down.
They are a problem because other tenants that shut down quickly are unavailble while we wait for the slow tenants to shut down. They are a problem because other tenants that shut down quickly are unavailable while we wait for the slow tenants to shut down.
We currently allot 10 seconds for graceful shutdown until we SIGKILL the pageserver process (as per `pageserver.service` unit file). We currently allot 10 seconds for graceful shutdown until we SIGKILL the pageserver process (as per `pageserver.service` unit file).
A longer budget would expose tenants that are done early to a longer downtime. A longer budget would expose tenants that are done early to a longer downtime.
A short budget would risk throwing away more work that'd have to be re-done after restart. A short budget would risk throwing away more work that'd have to be re-done after restart.
@@ -236,7 +236,7 @@ tenants/$tenant/timelines/$timeline/$key_and_lsn_range
tenants/$tenant/timelines/$timeline/$layer_file_id-$key_and_lsn_range tenants/$tenant/timelines/$timeline/$layer_file_id-$key_and_lsn_range
``` ```
To guarantee uniqueness, the unqiue number is a sequence number, stored in `index_part.json`. To guarantee uniqueness, the unique number is a sequence number, stored in `index_part.json`.
This alternative does not solve atomic layer map updates. This alternative does not solve atomic layer map updates.
In our crash-during-compaction scenario above, the compaction run after the crash will not overwrite the L1s, but write/PUT new files with new sequence numbers. In our crash-during-compaction scenario above, the compaction run after the crash will not overwrite the L1s, but write/PUT new files with new sequence numbers.
@@ -246,11 +246,11 @@ We'd need to write a deduplication pass that checks if perfectly overlapping lay
However, this alternative is appealing because it systematically prevents overwrites at a lower level than this RFC. However, this alternative is appealing because it systematically prevents overwrites at a lower level than this RFC.
So, this alternative is sufficient for the needs of the split-brain safety RFC (immutable layer files locally and in S3). So, this alternative is sufficient for the needs of the split-brain safety RFC (immutable layer files locally and in S3).
But it doesn't solve the problems with crash-during-compaction outlined earlier in this RFC, and in fact, makes it much more accute. But it doesn't solve the problems with crash-during-compaction outlined earlier in this RFC, and in fact, makes it much more acute.
The proposed design in this RFC addresses both. The proposed design in this RFC addresses both.
So, if this alternative sounds appealing, we should implement the proposal in this RFC first, then implement this alternative on top. So, if this alternative sounds appealing, we should implement the proposal in this RFC first, then implement this alternative on top.
That way, we avoid a phase where the crash-during-compaction problem is accute. That way, we avoid a phase where the crash-during-compaction problem is acute.
## Related issues ## Related issues

View File

@@ -596,4 +596,4 @@ pageservers are updated to be aware of it.
As well as simplifying implementation, putting heatmaps in S3 will be useful As well as simplifying implementation, putting heatmaps in S3 will be useful
for future analytics purposes -- gathering aggregated statistics on activity for future analytics purposes -- gathering aggregated statistics on activity
pattersn across many tenants may be done directly from data in S3. patterns across many tenants may be done directly from data in S3.

View File

@@ -147,7 +147,7 @@ Separating corrupt writes from non-corrupt ones is a hard problem in general,
and if the application was involved in making the corrupt write, a recovery and if the application was involved in making the corrupt write, a recovery
would also involve the application. Therefore, corruption that has made it into would also involve the application. Therefore, corruption that has made it into
the WAL is outside of the scope of this feature. However, the WAL replay can be the WAL is outside of the scope of this feature. However, the WAL replay can be
issued to right before the point in time where the corruption occured. Then the issued to right before the point in time where the corruption occurred. Then the
data loss is isolated to post-corruption writes only. data loss is isolated to post-corruption writes only.
## Impacted components (e.g. pageserver, safekeeper, console, etc) ## Impacted components (e.g. pageserver, safekeeper, console, etc)
@@ -161,7 +161,7 @@ limits and billing we apply to existing timelines.
## Proposed implementation ## Proposed implementation
The first problem to keep in mind is the reproducability of `initdb`. The first problem to keep in mind is the reproducibility of `initdb`.
So an initial step would be to upload `initdb` snapshots to S3. So an initial step would be to upload `initdb` snapshots to S3.
After that, we'd have the endpoint spawn a background process which After that, we'd have the endpoint spawn a background process which

View File

@@ -69,7 +69,7 @@ However, unlike above, an ideal solution will
* This means, read each `DiskBtree` page at most once. * This means, read each `DiskBtree` page at most once.
* Facilitate merging of the reads we issue to the OS and eventually NVMe. * Facilitate merging of the reads we issue to the OS and eventually NVMe.
Each of these items above represents a signficant amount of work. Each of these items above represents a significant amount of work.
## Performance ## Performance