rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-30 11:30:37 +00:00

Author	SHA1	Message	Date
Dmitry Rodionov	ec0faf3ac6	retry timeline delete	2022-07-07 21:20:04 +03:00
Dmitry Rodionov	1a5af6d7a5	extend detach/delete tests	2022-07-07 21:20:04 +03:00
Dmitry Rodionov	168214e0b6	use tenant status endpoint to check whether timelines were downloaded or not	2022-07-07 21:20:04 +03:00
Dmitry Rodionov	e1e24336b7	review adjustments, bring back timeline_detach and rename it to timeline_delete	2022-07-07 21:20:04 +03:00
Dmitry Rodionov	4c54e4b37d	switch to per-tenant attach/detach download operations of all timelines for one tenant are now grouped together so when attach is invoked pageserver downloads all of them and registers them in a single apply_sync_status_update call so branches can be used safely with attach/detach	2022-07-07 21:20:04 +03:00
bojanserafimov	4a96259bdd	Add export/import test (#2036 )	2022-07-06 13:45:26 -04:00
Arthur Petukhovsky	8fabdc6708	Add tests with concurrent computes. Removes test_restart_compute, as added test_compute_restarts is stronger.	2022-07-06 18:07:29 +04:00
bojanserafimov	32560e75d2	Enable relocation test (#1974 )	2022-07-05 08:27:57 -04:00
bojanserafimov	d29c545b5d	Gc/compaction thread pool, take 2 (#1933 ) Decrease the number of pageserver threads by running gc and compaction in a blocking tokio thread pool	2022-07-05 02:06:40 -04:00
Dmitry Rodionov	65704708fa	remove unused imports, make more use of pathlib.Path	2022-07-01 18:56:51 +03:00
Bojan Serafimov	f09c09438a	Fix gc after import	2022-07-01 11:10:49 +03:00
Kirill Bulatov	8a714f1ebf	Add coverage to GH actions and rework part of them (#1987 )	2022-06-27 19:15:56 +03:00
Anastasia Lubennikova	3c2b03cd87	Update timeline size on dropdb. Add the test (#1973 ) In addition, fix database size calculation: count not only main fork of the relation, but also vm and fsm.	2022-06-23 12:28:12 +03:00
KlimentSerafimov	d059e588a6	Added invariant check for project name. (#1921 ) Summary: Added invariant checking for project name. Refactored ClientCredentials and TlsConfig. * Added formatting invariant check for project name: \forall c \in project_name . c \in [alnum] U {'-'}. sni_data == <project_name>.<common_name> * Added exhaustive tests for get_project_name. * Refactored TlsConfig to contain common_name : Option<String>. * Refactored ClientCredentials construction to construct project_name directly. * Merged ProjectNameError into ClientCredsParseError. * Tweaked proxy tests to accommodate refactored ClientCredentials construction semantics. * [Pytests] Added project option argument to test_proxy_select_1. * Removed project param from Api since now it's contained in creds. * Refactored &Option<String> -> Option<&str>. Co-authored-by: Dmitrii Ivanov <dima@neon.tech>.	2022-06-22 09:34:24 -04:00
bojanserafimov	1ca28e6f3c	Import basebackup into pageserver (#1925 ) Allow importing basebackup taken from vanilla postgres or another pageserver via psql copy in protocol.	2022-06-21 11:04:10 -04:00
Arthur Petukhovsky	f862373ac0	Fix WAL timeout in test_s3_wal_replay (#1953 )	2022-06-17 20:43:54 +03:00
Arthur Petukhovsky	699f46cd84	Download WAL from S3 if it's not available in safekeeper dir (#1932 ) `send_wal.rs` and `WalReader` are now async. `test_s3_wal_replay` checks that WAL can be replayed after offloaded.	2022-06-17 15:33:39 +03:00
Anastasia Lubennikova	36ee182d26	Implement page servise 'fullbackup' endpoint (#1923 ) * Implement page servise 'fullbackup' endpoint that works like basebackup, but also sends relational files * Add test_runner/batch_others/test_fullbackup.py Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>	2022-06-16 14:07:11 +03:00
Egor Suvorov	0ac0fba77a	test_runner: test Safekeeper HTTP API Auth All endpoints except for POST /v1/timeline are tested, this one is not tested in any way yet. Three attempts for each endpoint: correctly authenticated, badly authenticated, unauthenticated.	2022-06-09 17:14:46 +02:00
Egor Suvorov	1f1d852204	ZenithEnvBuilder: rename pageserver_auth_enabled --> auth_enabled	2022-06-09 17:14:46 +02:00
Arseny Sher	a51b2dac9a	Don't s3 offload from newly joined safekeeper not having required WAL. I made the check at launcher level with the perspective of generally moving election (decision who offloads) there. Also log timeline 'active' changes.	2022-06-09 18:30:16 +04:00
Dmitry Rodionov	e442f5357b	unify two identical failpoints in flush_frozen_layer probably is a merge artfact	2022-06-03 19:36:09 +03:00
Arseny Sher	5a723d44cd	Parametrize test_normal_work. I like to run small test locally, but let's avoid duplication.	2022-06-03 20:32:53 +04:00
Arseny Sher	70a53c4b03	Get backup test_safekeeper_normal_work, but skip by default. It is handy for development.	2022-06-03 16:12:14 +04:00
Kirill Bulatov	a91e0c299d	Reproduce etcd parsing bug in Python tests	2022-06-03 00:23:13 +03:00
bojanserafimov	90e2c9ee1f	Rename zenith to neon in python tests (#1871 )	2022-06-02 16:21:28 -04:00
Kirill Bulatov	e5cb727572	Replace callmemaybe with etcd subscriptions on safekeeper timeline info	2022-06-01 16:07:04 +03:00
Dmitry Rodionov	b1b67cc5a0	improve test normal work to start several computes	2022-05-31 22:42:11 +03:00
Arseny Sher	36281e3b47	Extend test_wal_backup with compute restart.	2022-05-30 13:57:17 +04:00
Anastasia Lubennikova	e014cb6026	rename zenith.zenith_tenant to neon.tenant_id in test	2022-05-30 12:24:44 +03:00
Anastasia Lubennikova	67d6ff4100	Rename custom GUCs: - zenith.zenith_tenant -> neon.tenant_id - zenith.zenith_timeline -> neon.timeline_id	2022-05-30 11:11:01 +03:00
Anastasia Lubennikova	6a867bce6d	Rename 'zenith_admin' role to 'cloud_admin'	2022-05-30 11:11:01 +03:00
Anastasia Lubennikova	751f1191b4	Rename 'wal_acceptors' GUC to 'safekeepers'	2022-05-30 11:11:01 +03:00
Anastasia Lubennikova	3accde613d	Rename contrib/zenith to contrib/neon. Rename custom GUCs: - zenith.page_server_connstring -> neon.pageserver_connstring - zenith.zenith_tenant -> neon.tenantid - zenith.zenith_timeline -> neon.timelineid - zenith.max_cluster_size -> neon.max_cluster_size	2022-05-30 11:11:01 +03:00
Kian-Meng Ang	f1c51a1267	Fix typos	2022-05-28 14:02:05 +03:00
Thang Pham	757746b571	Fix `test_pageserver_http_get_wal_receiver_success` flaky test. (#1786 ) Fixes #1768. ## Context Previously, to test `get_wal_receiver` API, we make run some DB transactions then call the API to check the latest message's LSN from the WAL receiver. However, this test won't work because it's not guaranteed that the WAL receiver will get the latest WAL from the postgres/safekeeper at the time of making the API call. This PR resolves the above issue by adding a "poll and wait" code that waits to retrieve the latest data from the WAL receiver. This PR also fixes a bug that tries to compare two hex LSNs, should convert to number before the comparison. See: https://github.com/neondatabase/neon/issues/1768#issuecomment-1133752122.	2022-05-27 13:33:53 -04:00
Thang Pham	75f71a6380	Handle broken timelines on startup (#1809 ) Resolve #1663. ## Changes - ignore a "broken" [1] timeline on page server startup - fix the race condition when creating multiple timelines in parallel for a tenant - added tests for the above changes [1]: a timeline is marked as "broken" if either - failed to load the timeline's metadata or - the timeline's disk consistent LSN is zero	2022-05-27 11:43:06 -04:00
Arseny Sher	0e1bd57c53	Add WAL offloading to s3 on safekeepers. Separate task is launched for each timeline and stopped when timeline doesn't need offloading. Decision who offloads is done through etcd leader election; currently there is no pre condition for participating, that's a TODO. neon_local and tests infrastructure for remote storage in safekeepers added, along with the test itself. ref #1009 Co-authored-by: Anton Shyrabokau <ahtoxa@Antons-MacBook-Pro.local>	2022-05-27 06:19:23 +04:00
Heikki Linnakangas	7997fc2932	Fix error handling with 'basebackup' command. If the 'basebackup' command failed in the middle of building the tar archive, the client would not report the error, but would attempt to to start up postgres with the partial contents of the data directory. That fails because the control file is missing (it's added to the archive last, precisly to make sure that you cannot start postgres from a partial archive). But the client doesn't see the proper error message that caused the basebackup to fail in the server, which is confusing. Two issues conspired to cause that: 1. The tar::Builder object that we use in the pageserver to construct the tar stream has a Drop handler that automatically writes a valid end-of-archive marker on drop. Because of that, the resulting tarball looks complete, even if an error happens while we're building it. The pageserver does send an ErrorResponse after the seemingly-valid tarball, but: 2. The client stops reading the Copy stream, as soon as it sees the tar end-of-archive marker. Therefore, it doesn't read the ErrorResponse that comes after it. We have two clients that call 'basebackup', one in `control_plane` used by the `neon_local` binary, and another one in `compute_tools`. Both had the same issue. This PR fixes both issues, even though fixing either one would be enough to fix the problem at hand. The pageserver now doesn't send the end-of-archive marker on error, and the client now reads the copy stream to the end, even if it sees an end-of-archive marker. Fixes github issue #1715 In the passing, change Basebackup to use generic Write rather than 'dyn'.	2022-05-25 18:14:44 +03:00
Arthur Petukhovsky	134eeeb096	Add more common storage metrics (#1722 ) - Enabled process exporter for storage services - Changed zenith_proxy prefix to just proxy - Removed old `monitoring` directory - Removed common prefix for metrics, now our common metrics have `libmetrics_` prefix, for example `libmetrics_serve_metrics_count` - Added `test_metrics_normal_work`	2022-05-17 19:29:01 +03:00
Heikki Linnakangas	55ea3f262e	Fix race condition leading to panic in remote storage sync thread. The SyncQueue consisted of a tokio mpsc channel, and an atomic counter to keep track of how many items there are in the channel. Updating the atomic counter was racy, and sometimes the consumer would decrement the counter before the producer had incremented it, leading to integer wraparound to usize::MAX. Calling Vec::with_capacity(usize::MAX) leads to a panic. To fix, replace the channel with a VecDeque protected by a Mutex, and a condition variable for signaling. Now that the queue is now protected by standard blocking Mutex and Condvar, refactor the functions touching it to be sync, not async. A theoretical downside of this is that the calls to push items to the queue and the storage sync thread that drains the queue might now need to wait, if another thread is busy manipulating the queue. I believe that's OK; the lock isn't held for very long, and these operations are made in background threads, not in the hot GetPage@LSN path, so they're not very latency-sensitive. Fixes #1719. Also add a test case.	2022-05-17 18:14:57 +03:00
Heikki Linnakangas	f03779bf1a	Fix wait_for_last_record_lsn() and wait_for_upload() python functions. The contract for wait_for() was not very clear. It waits until the given function returns successfully, without an exception, but the wait_for_last_record_lsn() and wait_for_upload() functions used "a < b" as the condition, i.e. they thought that wait_for() would poll until the function returns true. Inline the logic from wait_for() into those two functions, it's not that complicated, and you get a more specific error message too, if it fails. Also add a comment to wait_for() to make it more clear how it works. Also change remote_consistent_lsn() to return 0 instead of raising an exception, if remote is None. That can happen if nothing has been uploaded to remote storage for the timeline yet. It happened once in the CI, and I was able to reproduce that locally too by adding a sleep to the storage sync thread, to delay the first upload.	2022-05-17 18:14:10 +03:00
Kirill Bulatov	f2881bbd8a	Start and stop single etcd and mock s3 servers globally in python tests	2022-05-17 01:17:44 +03:00
Kirill Bulatov	a884f4cf6b	Add etcd to neon_local	2022-05-17 01:17:44 +03:00
Kirill Bulatov	9a0fed0880	Enable at least 1 safekeeper in every test	2022-05-17 01:17:44 +03:00
Thang Pham	e4a70faa08	Add more information to timeline-related APIs (#1673 ) Resolves #1488. - implemented `GET tenant/:tenant_id/timeline/:timeline_id/wal_receiver` endpoint - returned `thread_id` in `thread_mgr::spawn` - added `latest_gc_cutoff_lsn` field to `LocalTimelineInfo` struct	2022-05-16 11:05:43 -04:00
Heikki Linnakangas	51ea9c3053	Don't swallow panics when the pageserver is build with failpoints. It's very confusing, and because you don't get a stack trace and error message in the logs, makes debugging very hard. However, the 'test_pageserver_recovery' test relied on that behavior. To support that, add a new "exit" action to the pageserver 'failpoints' command, so that you can explicitly request to exit the process when a failpoint is hit.	2022-05-16 09:58:58 +03:00
Heikki Linnakangas	a10cac980f	Continue with pageserver startup, if loading some tenants fail. Fixes https://github.com/neondatabase/neon/issues/1664	2022-05-15 00:25:38 +03:00
Egor Suvorov	768c846eeb	Fix test_delete_force from #1653 conflicting with #1692	2022-05-13 17:36:18 +02:00
Anastasia Lubennikova	a2561f0a78	Use tenant's pitr_interval instead of hardroded 0 in the command. Adjust python tests that use the	2022-05-13 18:32:14 +03:00

1 2 3 4 5

217 Commits