- There was an issue with zero commit_lsn `reason: LaggingWal { current_commit_lsn: 0/0, new_commit_lsn: 1/6FD90D38, threshold: 10485760 } }`. The problem was in `send_wal.rs`, where we initialized `end_pos = Lsn(0)` and in some cases sent it to the pageserver.
- IDENTIFY_SYSTEM previously returned `flush_lsn` as a physical end of WAL. Now it returns `flush_lsn` (as it was) to walproposer and `commit_lsn` to everyone else including pageserver.
- There was an issue with backoff where connection was cancelled right after initialization: `connected!` -> `safekeeper_handle_db: Connection cancelled` -> `Backoff: waiting 3 seconds`. The problem was in sleeping before establishing the connection. This is fixed by reworking retry logic.
- There was an issue with getting `NoKeepAlives` reason in a loop. The issue is probably the same as the previous.
- There was an issue with filtering safekeepers based on retry attempts, which could filter some safekeepers indefinetely. This is fixed by using retry cooldown duration instead of retry attempts.
- Some `send_wal.rs` connections failed with errors without context. This is fixed by adding a timeline to safekeepers errors.
New retry logic works like this:
- Every candidate has a `next_retry_at` timestamp and is not considered for connection until that moment
- When walreceiver connection is closed, we update `next_retry_at` using exponential backoff, increasing the cooldown on every disconnect.
- When `last_record_lsn` was advanced using the WAL from the safekeeper, we reset the retry cooldown and exponential backoff, allowing walreceiver to reconnect to the same safekeeper instantly.
Re-export only things that are used by other modules.
In the future, I'm imagining that we run bindgen twice, for Postgres
v14 and v15. The two sets of bindings would go into separate
'bindings_v14' and 'bindings_v15' modules.
Rearrange postgres_ffi modules.
Move function, to avoid Postgres version dependency in timelines.rs
Move function to generate a logical-message WAL record to postgres_ffi.
To flush inmemory layer eventually when no new data arrives, which helps
safekeepers to suspend activity (stop pushing to the broker). Default 10m should
be ok.
Reorganize existing READMEs and other documentation files into mdbook
format. The resulting Table of Contents is a mix of placeholders for
docs that we should write, and documentation files that we already had,
dropped into the most appropriate place.
Update the Pageserver overview diagram. Add sections on thread
management and WAL redo processes.
Add all the RFCs to the mdbook Table of Content too.
Per github issue #1979
On ProposerElected message receival WAL is truncated at streaming point; this
code expected that, once vote is given for the proposer / term switch happened,
flush_lsn can be advanced only by this proposer (or higher one). However, that
didn't take into account possibility of accumulating written WAL and flushing it
after vote is given -- flushing goes without term checks. Which eventually led
to the violation in question.
ref #2048
"cargo clippy" started to complain about these, after running "cargo
update". Not sure why it didn't complain before, but seems reasonable to
fix these. (The "cargo update" is not included in this commit)
Mitigates latency fee, making push throughput 1-1.5 order of magnitude bigger.
Also make leases per timeline, not per whole safekeeper, avoiding storing
garbage in etcd for deleted timelines while safekeeper is alive.
* `control_plane` crate (used by `neon_local`) now parses an `auth_enabled` bool for each Safekeeper
* If auth is enabled, a Safekeeper is passed a path to a public key via a new command line argument
* Added TODO comments to other places needing auth
I made the check at launcher level with the perspective of generally moving
election (decision who offloads) there.
Also log timeline 'active' changes.
I still don't like the surroundings and feel we'd better get away without using
election API at all, but this is a quick fix to keep CI green.
ref #1815
- Uncomment accidently `self.keep_alive.abort()` commented line, due to this
task never finished, which blocked launcher.
- Mess up with initialization one more time, to fix offloader trying to back up
segment 0. Now we initialize all required LSNs in handle_elected,
where we learn start LSN for the first time.
- Fix blind attempt to provide safekeeper service file with remote storage
params.
Separate task is launched for each timeline and stopped when timeline doesn't
need offloading. Decision who offloads is done through etcd leader election;
currently there is no pre condition for participating, that's a TODO.
neon_local and tests infrastructure for remote storage in safekeepers added,
along with the test itself.
ref #1009
Co-authored-by: Anton Shyrabokau <ahtoxa@Antons-MacBook-Pro.local>
* Potential fix to #1626. Fixed typo is Makefile.
* Completed fix to #1626.
Summary:
changed 'error' to 'bail' in start_pageserver and start_safekeeper.
- Enabled process exporter for storage services
- Changed zenith_proxy prefix to just proxy
- Removed old `monitoring` directory
- Removed common prefix for metrics, now our common metrics have `libmetrics_` prefix, for example `libmetrics_serve_metrics_count`
- Added `test_metrics_normal_work`
* There is no auth in Safekeeper HTTP at all currently,
so simply calling `check_permission` is not enough.
* There are no checks of Safekeeper still working with the data,
as "still working" is burry now: a timeline may be "active"
while there are no compute nodes and all data is propagated.
* Still, callmemaybe is deactivated, and timeline is removed from the
internal map. It can easily sneak back in case of race conditions
and implicit creations, though.