rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-04 12:02:55 +00:00

Author	SHA1	Message	Date
Arthur Petukhovsky	976576ae59	Fix walreceiver and safekeeper bugs (#2295 ) - There was an issue with zero commit_lsn `reason: LaggingWal { current_commit_lsn: 0/0, new_commit_lsn: 1/6FD90D38, threshold: 10485760 } }`. The problem was in `send_wal.rs`, where we initialized `end_pos = Lsn(0)` and in some cases sent it to the pageserver. - IDENTIFY_SYSTEM previously returned `flush_lsn` as a physical end of WAL. Now it returns `flush_lsn` (as it was) to walproposer and `commit_lsn` to everyone else including pageserver. - There was an issue with backoff where connection was cancelled right after initialization: `connected!` -> `safekeeper_handle_db: Connection cancelled` -> `Backoff: waiting 3 seconds`. The problem was in sleeping before establishing the connection. This is fixed by reworking retry logic. - There was an issue with getting `NoKeepAlives` reason in a loop. The issue is probably the same as the previous. - There was an issue with filtering safekeepers based on retry attempts, which could filter some safekeepers indefinetely. This is fixed by using retry cooldown duration instead of retry attempts. - Some `send_wal.rs` connections failed with errors without context. This is fixed by adding a timeline to safekeepers errors. New retry logic works like this: - Every candidate has a `next_retry_at` timestamp and is not considered for connection until that moment - When walreceiver connection is closed, we update `next_retry_at` using exponential backoff, increasing the cooldown on every disconnect. - When `last_record_lsn` was advanced using the WAL from the safekeeper, we reset the retry cooldown and exponential backoff, allowing walreceiver to reconnect to the same safekeeper instantly.	2022-08-18 13:38:23 +03:00
Heikki Linnakangas	9bc12f7444	Move auto-generated 'bindings' to a separate inner module. Re-export only things that are used by other modules. In the future, I'm imagining that we run bindgen twice, for Postgres v14 and v15. The two sets of bindings would go into separate 'bindings_v14' and 'bindings_v15' modules. Rearrange postgres_ffi modules. Move function, to avoid Postgres version dependency in timelines.rs Move function to generate a logical-message WAL record to postgres_ffi.	2022-08-18 13:25:00 +03:00
Kirill Bulatov	648e8bbefe	Fix 1.63 clippy lints (#2282 )	2022-08-16 18:49:22 +03:00
Arseny Sher	431393e361	Find end of WAL on safekeepers using WalStreamDecoder. We could make it inside wal_storage.rs, but taking into account that - wal_storage.rs reading is async - we don't need s3 here - error handling is different; error during decoding is normal I decided to put it separately. Test cargo test test_find_end_of_wal_last_crossing_segment prepared earlier by @yeputons passes now. Fixes https://github.com/neondatabase/neon/issues/544 https://github.com/neondatabase/cloud/issues/2004 Supersedes https://github.com/neondatabase/neon/pull/2066	2022-08-14 14:47:14 +03:00
Arseny Sher	e593cbaaba	Add pageserver checkpoint_timeout option. To flush inmemory layer eventually when no new data arrives, which helps safekeepers to suspend activity (stop pushing to the broker). Default 10m should be ok.	2022-08-11 22:54:09 +03:00
Ankur Srivastava	84d1bc06a9	refactor: replace lazy-static with once-cell (#2195 ) - Replacing all the occurrences of lazy-static with `once-cell::sync::Lazy` - fixes #1147 Signed-off-by: Ankur Srivastava <best.ankur@gmail.com>	2022-08-05 19:34:04 +02:00
Alexey Kondratov	01f1f1c1bf	Add OpenAPI spec for safekeeper HTTP API (neondatabase/cloud#1264, #2061 ) This spec is used in the `cloud` repo to generate HTTP client.	2022-07-27 21:29:22 +03:00
Heikki Linnakangas	b4c74c0ecd	Clean up unnecessary dependencies. Just to be tidy.	2022-07-20 16:31:25 +03:00
Heikki Linnakangas	0b14fdb078	Reorganize, expand, improve internal documentation Reorganize existing READMEs and other documentation files into mdbook format. The resulting Table of Contents is a mix of placeholders for docs that we should write, and documentation files that we already had, dropped into the most appropriate place. Update the Pageserver overview diagram. Add sections on thread management and WAL redo processes. Add all the RFCs to the mdbook Table of Content too. Per github issue #1979	2022-07-18 17:39:12 +03:00
Arseny Sher	a69fdb0e8e	Fix commit_lsn monotonicity violation. On ProposerElected message receival WAL is truncated at streaming point; this code expected that, once vote is given for the proposer / term switch happened, flush_lsn can be advanced only by this proposer (or higher one). However, that didn't take into account possibility of accumulating written WAL and flushing it after vote is given -- flushing goes without term checks. Which eventually led to the violation in question. ref #2048	2022-07-18 15:15:51 +03:00
Heikki Linnakangas	a342957aee	Use ok_or_else() instead of ok_or(), to silence clippy warnings. "cargo clippy" started to complain about these, after running "cargo update". Not sure why it didn't complain before, but seems reasonable to fix these. (The "cargo update" is not included in this commit)	2022-07-14 22:13:51 +03:00
Kirill Bulatov	50821c0a3c	Return download stream directly from the remote storage API	2022-07-05 21:45:15 +03:00
Arseny Sher	137291dc24	Push to etcd from safekeeper many timelines concurrently. Mitigates latency fee, making push throughput 1-1.5 order of magnitude bigger. Also make leases per timeline, not per whole safekeeper, avoiding storing garbage in etcd for deleted timelines while safekeeper is alive.	2022-06-27 16:30:21 +03:00
Arthur Petukhovsky	55192384c3	Fix zero timeline_start_lsn (#1981 ) * Fix zero timeline_start_lsn * Log more info on control file upgrade * Fix formatting Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>	2022-06-24 13:59:37 +03:00
Kirill Bulatov	7c49abe7d1	Rework etcd timeline updates and their handling	2022-06-23 09:11:27 +03:00
Arthur Petukhovsky	6c4d6a2183	Remove timeline_start_lsn check temporary. (#1964 )	2022-06-21 02:02:24 +03:00
Arthur Petukhovsky	699f46cd84	Download WAL from S3 if it's not available in safekeeper dir (#1932 ) `send_wal.rs` and `WalReader` are now async. `test_s3_wal_replay` checks that WAL can be replayed after offloaded.	2022-06-17 15:33:39 +03:00
Kirill Bulatov	d8a37452c8	Rename ZenithFeedback (#1912 )	2022-06-11 00:44:05 +03:00
Egor Suvorov	e2a5a31595	Safekeeper HTTP router: add comment about /v1/timeline	2022-06-09 17:14:46 +02:00
Egor Suvorov	f7b878611a	Implement JWT authentication in Safekeeper HTTP API (#1753 ) * `control_plane` crate (used by `neon_local`) now parses an `auth_enabled` bool for each Safekeeper * If auth is enabled, a Safekeeper is passed a path to a public key via a new command line argument * Added TODO comments to other places needing auth	2022-06-09 17:14:46 +02:00
Arseny Sher	a51b2dac9a	Don't s3 offload from newly joined safekeeper not having required WAL. I made the check at launcher level with the perspective of generally moving election (decision who offloads) there. Also log timeline 'active' changes.	2022-06-09 18:30:16 +04:00
Kirill Bulatov	8a53472e4f	Force etcd broker keys to not to intersect	2022-06-08 11:21:05 +03:00
Arseny Sher	0b93253b3c	Fix leaked keepalive task in s3 offloading leader election. I still don't like the surroundings and feel we'd better get away without using election API at all, but this is a quick fix to keep CI green. ref #1815	2022-06-07 15:17:57 +04:00
Kirill Bulatov	2623193876	Remove pageserver_connstr from WAL stream logic	2022-06-03 17:30:36 +03:00
Kirill Bulatov	c5007d3916	Remove unused module	2022-06-03 00:23:13 +03:00
Kirill Bulatov	5b06599770	Simplify etcd key regex parsing	2022-06-03 00:23:13 +03:00
Kirill Bulatov	e5cb727572	Replace callmemaybe with etcd subscriptions on safekeeper timeline info	2022-06-01 16:07:04 +03:00
Ryan Russell	54e163ac03	Improve Readability in Docs Signed-off-by: Ryan Russell <ryanrussell@users.noreply.github.com>	2022-05-31 17:22:47 +03:00
Arthur Petukhovsky	c3e0b6c839	Implement timeline-based metrics in safekeeper (#1823 ) Now there's timelines metrics collector, which goes through all timelines and reports metrics only for active ones	2022-05-31 11:10:50 +03:00
Arseny Sher	36281e3b47	Extend test_wal_backup with compute restart.	2022-05-30 13:57:17 +04:00
Heikki Linnakangas	4b4d3073b8	Fix misc typos	2022-05-28 14:56:23 +03:00
Kian-Meng Ang	f1c51a1267	Fix typos	2022-05-28 14:02:05 +03:00
Arseny Sher	cb8bf1beb6	Prevent commit_lsn <= flush_lsn violation after `a42eba3cd7`. Nothing complained about that yet, but we definitely don't hold at least one assert, so let's keep it this way until better version.	2022-05-27 20:23:30 +04:00
Arseny Sher	54b75248ff	s3 WAL offloading staging review. - Uncomment accidently `self.keep_alive.abort()` commented line, due to this task never finished, which blocked launcher. - Mess up with initialization one more time, to fix offloader trying to back up segment 0. Now we initialize all required LSNs in handle_elected, where we learn start LSN for the first time. - Fix blind attempt to provide safekeeper service file with remote storage params.	2022-05-27 14:02:52 +04:00
Arseny Sher	0e1bd57c53	Add WAL offloading to s3 on safekeepers. Separate task is launched for each timeline and stopped when timeline doesn't need offloading. Decision who offloads is done through etcd leader election; currently there is no pre condition for participating, that's a TODO. neon_local and tests infrastructure for remote storage in safekeepers added, along with the test itself. ref #1009 Co-authored-by: Anton Shyrabokau <ahtoxa@Antons-MacBook-Pro.local>	2022-05-27 06:19:23 +04:00
chaitanya sharma	c584d90bb9	initial commit, renamed znodeid to nodeid.	2022-05-25 20:11:26 +03:00
KlimentSerafimov	8346aa3a29	Potential fix to #1626 . Fixed typo is Makefile. (#1781 ) * Potential fix to #1626. Fixed typo is Makefile. * Completed fix to #1626. Summary: changed 'error' to 'bail' in start_pageserver and start_safekeeper.	2022-05-24 04:55:38 -04:00
chaitanya sharma	fbedd535c0	Replace a bunch of zenith references with neon.	2022-05-23 13:16:00 +03:00
Egor Suvorov	432907ff5f	Safekeeper: avoid holding mutex when deleting a tenant (#1746 ) Following discussion with @arssher after #1653	2022-05-18 23:02:17 +03:00
Arthur Petukhovsky	134eeeb096	Add more common storage metrics (#1722 ) - Enabled process exporter for storage services - Changed zenith_proxy prefix to just proxy - Removed old `monitoring` directory - Removed common prefix for metrics, now our common metrics have `libmetrics_` prefix, for example `libmetrics_serve_metrics_count` - Added `test_metrics_normal_work`	2022-05-17 19:29:01 +03:00
Kirill Bulatov	a884f4cf6b	Add etcd to neon_local	2022-05-17 01:17:44 +03:00
Kirill Bulatov	9a0fed0880	Enable at least 1 safekeeper in every test	2022-05-17 01:17:44 +03:00
Egor Suvorov	bf899a57d9	Safekeeper: add timeline/tenant force delete HTTP endpoings (closes #895 ) * There is no auth in Safekeeper HTTP at all currently, so simply calling `check_permission` is not enough. * There are no checks of Safekeeper still working with the data, as "still working" is burry now: a timeline may be "active" while there are no compute nodes and all data is propagated. * Still, callmemaybe is deactivated, and timeline is removed from the internal map. It can easily sneak back in case of race conditions and implicit creations, though.	2022-05-13 15:43:52 +02:00
Egor Suvorov	07b85e7cfc	Safekeeper refactor: move callmemaybe_tx from SafekeeperPostgresBackend to Timeline	2022-05-13 15:43:52 +02:00
Kirill Bulatov	b683308791	Return GIT_VERSION back to storage binaries	2022-05-13 16:34:32 +03:00
Kirill Bulatov	51c0f9ab2b	Force git version to be up to date via decl macro	2022-05-13 16:34:32 +03:00
Kirill Bulatov	4538f1e1b8	Correctly operate etcd safekeeper timeline data	2022-05-12 18:47:31 +03:00
Arseny Sher	b338b5dffe	Make callmemaybe less agressive until we fix it/migrate to bigger machines.	2022-05-11 22:16:13 +04:00
Arseny Sher	6cb14b4200	Optionally remove WAL on safekeepers without s3 offloading. And do that on staging, until offloading is merged.	2022-05-10 22:41:02 +04:00
Kirill Bulatov	de37f982db	Share the remote storage as a crate	2022-05-07 00:30:36 +03:00

1 2

61 Commits