rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-24 00:20:37 +00:00

Author	SHA1	Message	Date
Bojan Serafimov	cdab7bc83b	Merge branch 'main' into basebackup-import	2022-06-18 15:21:50 -04:00
Bojan Serafimov	2ce653fd79	Add checksum todos	2022-06-18 11:19:02 -04:00
Bojan Serafimov	010132dbb0	Check file sizes	2022-06-18 11:14:21 -04:00
Bojan Serafimov	443e409950	Fix test	2022-06-18 10:48:21 -04:00
Bojan Serafimov	71afd06f83	Test failed import	2022-06-17 21:42:04 -04:00
Bojan Serafimov	45b470e206	Bump timeout	2022-06-17 16:30:29 -04:00
Anastasia Lubennikova	11d7743b39	basebackup import fixes (#1955 )	2022-06-17 15:29:32 -04:00
Arthur Petukhovsky	f862373ac0	Fix WAL timeout in test_s3_wal_replay (#1953 )	2022-06-17 20:43:54 +03:00
Arthur Petukhovsky	699f46cd84	Download WAL from S3 if it's not available in safekeeper dir (#1932 ) `send_wal.rs` and `WalReader` are now async. `test_s3_wal_replay` checks that WAL can be replayed after offloaded.	2022-06-17 15:33:39 +03:00
Bojan Serafimov	e3ce99a711	Merge branch 'main' into basebackup-import	2022-06-16 22:00:21 -04:00
Anastasia Lubennikova	36ee182d26	Implement page servise 'fullbackup' endpoint (#1923 ) * Implement page servise 'fullbackup' endpoint that works like basebackup, but also sends relational files * Add test_runner/batch_others/test_fullbackup.py Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>	2022-06-16 14:07:11 +03:00
Bojan Serafimov	670c8ab7be	Merge branch 'basebackup-import' of github.com:neondatabase/neon into basebackup-import	2022-06-14 17:08:31 -04:00
Bojan Serafimov	05151e643f	Merge branch 'main' into basebackup-import	2022-06-14 17:05:52 -04:00
Anastasia Lubennikova	9ccb7b75a6	Fix import of multi-segment relation files	2022-06-14 21:42:07 +03:00
Bojan Serafimov	909a0df048	Run yapf	2022-06-12 11:17:44 -04:00
Bojan Serafimov	a568c49111	WIP	2022-06-10 10:53:52 -04:00
Bojan Serafimov	ea97135fa8	Create user in test	2022-06-09 23:38:18 -04:00
Bojan Serafimov	31cf43724c	WIP	2022-06-09 22:51:32 -04:00
Bojan Serafimov	1380a1cce1	Pass lsn	2022-06-09 12:39:29 -04:00
Egor Suvorov	0ac0fba77a	test_runner: test Safekeeper HTTP API Auth All endpoints except for POST /v1/timeline are tested, this one is not tested in any way yet. Three attempts for each endpoint: correctly authenticated, badly authenticated, unauthenticated.	2022-06-09 17:14:46 +02:00
Egor Suvorov	1f1d852204	ZenithEnvBuilder: rename pageserver_auth_enabled --> auth_enabled	2022-06-09 17:14:46 +02:00
Arseny Sher	a51b2dac9a	Don't s3 offload from newly joined safekeeper not having required WAL. I made the check at launcher level with the perspective of generally moving election (decision who offloads) there. Also log timeline 'active' changes.	2022-06-09 18:30:16 +04:00
Bojan Serafimov	0277c37759	Don't compress tar	2022-06-08 19:31:23 -04:00
Bojan Serafimov	7fb732a39f	Create timeline if not exists	2022-06-08 19:24:34 -04:00
Bojan Serafimov	ecee80d1bf	Send tar	2022-06-08 14:37:46 -04:00
Bojan Serafimov	a39501beee	Add todos	2022-06-08 10:36:35 -04:00
Bojan Serafimov	0677fb7ae7	Add neon_local command	2022-06-07 22:08:35 -04:00
Bojan Serafimov	99260b18ab	Add test	2022-06-07 19:16:29 -04:00
Dmitry Rodionov	e442f5357b	unify two identical failpoints in flush_frozen_layer probably is a merge artfact	2022-06-03 19:36:09 +03:00
Arseny Sher	5a723d44cd	Parametrize test_normal_work. I like to run small test locally, but let's avoid duplication.	2022-06-03 20:32:53 +04:00
Arseny Sher	70a53c4b03	Get backup test_safekeeper_normal_work, but skip by default. It is handy for development.	2022-06-03 16:12:14 +04:00
Kirill Bulatov	a91e0c299d	Reproduce etcd parsing bug in Python tests	2022-06-03 00:23:13 +03:00
bojanserafimov	90e2c9ee1f	Rename zenith to neon in python tests (#1871 )	2022-06-02 16:21:28 -04:00
Kirill Bulatov	e5cb727572	Replace callmemaybe with etcd subscriptions on safekeeper timeline info	2022-06-01 16:07:04 +03:00
Dmitry Rodionov	b1b67cc5a0	improve test normal work to start several computes	2022-05-31 22:42:11 +03:00
Arseny Sher	36281e3b47	Extend test_wal_backup with compute restart.	2022-05-30 13:57:17 +04:00
Anastasia Lubennikova	e014cb6026	rename zenith.zenith_tenant to neon.tenant_id in test	2022-05-30 12:24:44 +03:00
Anastasia Lubennikova	67d6ff4100	Rename custom GUCs: - zenith.zenith_tenant -> neon.tenant_id - zenith.zenith_timeline -> neon.timeline_id	2022-05-30 11:11:01 +03:00
Anastasia Lubennikova	6a867bce6d	Rename 'zenith_admin' role to 'cloud_admin'	2022-05-30 11:11:01 +03:00
Anastasia Lubennikova	751f1191b4	Rename 'wal_acceptors' GUC to 'safekeepers'	2022-05-30 11:11:01 +03:00
Anastasia Lubennikova	3accde613d	Rename contrib/zenith to contrib/neon. Rename custom GUCs: - zenith.page_server_connstring -> neon.pageserver_connstring - zenith.zenith_tenant -> neon.tenantid - zenith.zenith_timeline -> neon.timelineid - zenith.max_cluster_size -> neon.max_cluster_size	2022-05-30 11:11:01 +03:00
Kian-Meng Ang	f1c51a1267	Fix typos	2022-05-28 14:02:05 +03:00
Thang Pham	757746b571	Fix `test_pageserver_http_get_wal_receiver_success` flaky test. (#1786 ) Fixes #1768. ## Context Previously, to test `get_wal_receiver` API, we make run some DB transactions then call the API to check the latest message's LSN from the WAL receiver. However, this test won't work because it's not guaranteed that the WAL receiver will get the latest WAL from the postgres/safekeeper at the time of making the API call. This PR resolves the above issue by adding a "poll and wait" code that waits to retrieve the latest data from the WAL receiver. This PR also fixes a bug that tries to compare two hex LSNs, should convert to number before the comparison. See: https://github.com/neondatabase/neon/issues/1768#issuecomment-1133752122.	2022-05-27 13:33:53 -04:00
Thang Pham	75f71a6380	Handle broken timelines on startup (#1809 ) Resolve #1663. ## Changes - ignore a "broken" [1] timeline on page server startup - fix the race condition when creating multiple timelines in parallel for a tenant - added tests for the above changes [1]: a timeline is marked as "broken" if either - failed to load the timeline's metadata or - the timeline's disk consistent LSN is zero	2022-05-27 11:43:06 -04:00
Arseny Sher	0e1bd57c53	Add WAL offloading to s3 on safekeepers. Separate task is launched for each timeline and stopped when timeline doesn't need offloading. Decision who offloads is done through etcd leader election; currently there is no pre condition for participating, that's a TODO. neon_local and tests infrastructure for remote storage in safekeepers added, along with the test itself. ref #1009 Co-authored-by: Anton Shyrabokau <ahtoxa@Antons-MacBook-Pro.local>	2022-05-27 06:19:23 +04:00
Heikki Linnakangas	7997fc2932	Fix error handling with 'basebackup' command. If the 'basebackup' command failed in the middle of building the tar archive, the client would not report the error, but would attempt to to start up postgres with the partial contents of the data directory. That fails because the control file is missing (it's added to the archive last, precisly to make sure that you cannot start postgres from a partial archive). But the client doesn't see the proper error message that caused the basebackup to fail in the server, which is confusing. Two issues conspired to cause that: 1. The tar::Builder object that we use in the pageserver to construct the tar stream has a Drop handler that automatically writes a valid end-of-archive marker on drop. Because of that, the resulting tarball looks complete, even if an error happens while we're building it. The pageserver does send an ErrorResponse after the seemingly-valid tarball, but: 2. The client stops reading the Copy stream, as soon as it sees the tar end-of-archive marker. Therefore, it doesn't read the ErrorResponse that comes after it. We have two clients that call 'basebackup', one in `control_plane` used by the `neon_local` binary, and another one in `compute_tools`. Both had the same issue. This PR fixes both issues, even though fixing either one would be enough to fix the problem at hand. The pageserver now doesn't send the end-of-archive marker on error, and the client now reads the copy stream to the end, even if it sees an end-of-archive marker. Fixes github issue #1715 In the passing, change Basebackup to use generic Write rather than 'dyn'.	2022-05-25 18:14:44 +03:00
Arthur Petukhovsky	134eeeb096	Add more common storage metrics (#1722 ) - Enabled process exporter for storage services - Changed zenith_proxy prefix to just proxy - Removed old `monitoring` directory - Removed common prefix for metrics, now our common metrics have `libmetrics_` prefix, for example `libmetrics_serve_metrics_count` - Added `test_metrics_normal_work`	2022-05-17 19:29:01 +03:00
Heikki Linnakangas	55ea3f262e	Fix race condition leading to panic in remote storage sync thread. The SyncQueue consisted of a tokio mpsc channel, and an atomic counter to keep track of how many items there are in the channel. Updating the atomic counter was racy, and sometimes the consumer would decrement the counter before the producer had incremented it, leading to integer wraparound to usize::MAX. Calling Vec::with_capacity(usize::MAX) leads to a panic. To fix, replace the channel with a VecDeque protected by a Mutex, and a condition variable for signaling. Now that the queue is now protected by standard blocking Mutex and Condvar, refactor the functions touching it to be sync, not async. A theoretical downside of this is that the calls to push items to the queue and the storage sync thread that drains the queue might now need to wait, if another thread is busy manipulating the queue. I believe that's OK; the lock isn't held for very long, and these operations are made in background threads, not in the hot GetPage@LSN path, so they're not very latency-sensitive. Fixes #1719. Also add a test case.	2022-05-17 18:14:57 +03:00
Heikki Linnakangas	f03779bf1a	Fix wait_for_last_record_lsn() and wait_for_upload() python functions. The contract for wait_for() was not very clear. It waits until the given function returns successfully, without an exception, but the wait_for_last_record_lsn() and wait_for_upload() functions used "a < b" as the condition, i.e. they thought that wait_for() would poll until the function returns true. Inline the logic from wait_for() into those two functions, it's not that complicated, and you get a more specific error message too, if it fails. Also add a comment to wait_for() to make it more clear how it works. Also change remote_consistent_lsn() to return 0 instead of raising an exception, if remote is None. That can happen if nothing has been uploaded to remote storage for the timeline yet. It happened once in the CI, and I was able to reproduce that locally too by adding a sleep to the storage sync thread, to delay the first upload.	2022-05-17 18:14:10 +03:00
Kirill Bulatov	f2881bbd8a	Start and stop single etcd and mock s3 servers globally in python tests	2022-05-17 01:17:44 +03:00

1 2 3 4 5

224 Commits