rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-07 13:32:57 +00:00

Author	SHA1	Message	Date
Alexey Kondratov	772c2fb4ff	Report startup metrics and failure reason from compute_ctl (#1581 ) + neondatabase/cloud#1103 This adds a couple of control endpoints to simplify compute state discovery for control-plane. For example, now we may figure out that Postgres wasn't able to start or basebackup failed within seconds instead of just blindly polling the compute readiness for a minute or two. Also we now expose startup metrics (time of the each step: basebackup, sync safekeepers, config, total). Console grabs them after each successful start and report as histogram to prometheus and grafana. OpenAPI spec is added and up-tp date, but is not currently used in the console yet.	2022-05-18 13:03:29 +04:00
Andrey Taranik	b9f84f4a83	trun on storage deployment to neon-stress enviroment (#1729 )	2022-05-17 23:04:04 +03:00
Arthur Petukhovsky	134eeeb096	Add more common storage metrics (#1722 ) - Enabled process exporter for storage services - Changed zenith_proxy prefix to just proxy - Removed old `monitoring` directory - Removed common prefix for metrics, now our common metrics have `libmetrics_` prefix, for example `libmetrics_serve_metrics_count` - Added `test_metrics_normal_work`	2022-05-17 19:29:01 +03:00
Heikki Linnakangas	55ea3f262e	Fix race condition leading to panic in remote storage sync thread. The SyncQueue consisted of a tokio mpsc channel, and an atomic counter to keep track of how many items there are in the channel. Updating the atomic counter was racy, and sometimes the consumer would decrement the counter before the producer had incremented it, leading to integer wraparound to usize::MAX. Calling Vec::with_capacity(usize::MAX) leads to a panic. To fix, replace the channel with a VecDeque protected by a Mutex, and a condition variable for signaling. Now that the queue is now protected by standard blocking Mutex and Condvar, refactor the functions touching it to be sync, not async. A theoretical downside of this is that the calls to push items to the queue and the storage sync thread that drains the queue might now need to wait, if another thread is busy manipulating the queue. I believe that's OK; the lock isn't held for very long, and these operations are made in background threads, not in the hot GetPage@LSN path, so they're not very latency-sensitive. Fixes #1719. Also add a test case.	2022-05-17 18:14:57 +03:00
Heikki Linnakangas	f03779bf1a	Fix wait_for_last_record_lsn() and wait_for_upload() python functions. The contract for wait_for() was not very clear. It waits until the given function returns successfully, without an exception, but the wait_for_last_record_lsn() and wait_for_upload() functions used "a < b" as the condition, i.e. they thought that wait_for() would poll until the function returns true. Inline the logic from wait_for() into those two functions, it's not that complicated, and you get a more specific error message too, if it fails. Also add a comment to wait_for() to make it more clear how it works. Also change remote_consistent_lsn() to return 0 instead of raising an exception, if remote is None. That can happen if nothing has been uploaded to remote storage for the timeline yet. It happened once in the CI, and I was able to reproduce that locally too by adding a sleep to the storage sync thread, to delay the first upload.	2022-05-17 18:14:10 +03:00
Andrey Taranik	070c255522	Neon stress deploy (#1720 ) * storage and proxy deployment for neon stress environment * neon stress inventory fix	2022-05-17 18:03:01 +03:00
Heikki Linnakangas	9ccbb8d331	Make "neon_local stop" less verbose. I got annoyed by all the noise in CI test output. Before: $ ./target/release/neon_local stop Stop pageserver gracefully Pageserver still receives connections Pageserver stopped receiving connections Pageserver status is: Reqwest error: error sending request for url (http://127.0.0.1:9898/v1/status): error trying to connect: tcp connect error: Connection refused (os error 111) initializing for sk 1 for 7676 Stop safekeeper gracefully Safekeeper still receives connections Safekeeper stopped receiving connections Safekeeper status is: Reqwest error: error sending request for url (http://127.0.0.1:7676/v1/status): error trying to connect: tcp connect error: Connection refused (os error 111) After: $ ./target/release/neon_local stop Stopping pageserver gracefully...done! Stopping safekeeper 1 gracefully...done! Also removes the spurious "initializing for sk 1 for 7676" message from "neon_local start"	2022-05-17 10:31:13 +03:00
Kirill Bulatov	f2881bbd8a	Start and stop single etcd and mock s3 servers globally in python tests	2022-05-17 01:17:44 +03:00
Kirill Bulatov	a884f4cf6b	Add etcd to neon_local	2022-05-17 01:17:44 +03:00
Kirill Bulatov	9a0fed0880	Enable at least 1 safekeeper in every test	2022-05-17 01:17:44 +03:00
chaitanya sharma	bea84150b2	Fix the markdown rendering on 004-durability.md RFC	2022-05-17 00:16:42 +03:00
chaitanya sharma	85b5c0e989	List profiling as a feature with 'pageserver --enabled-features' Fixes https://github.com/neondatabase/neon/issues/1627	2022-05-16 21:10:57 +03:00
Thang Pham	e4a70faa08	Add more information to timeline-related APIs (#1673 ) Resolves #1488. - implemented `GET tenant/:tenant_id/timeline/:timeline_id/wal_receiver` endpoint - returned `thread_id` in `thread_mgr::spawn` - added `latest_gc_cutoff_lsn` field to `LocalTimelineInfo` struct	2022-05-16 11:05:43 -04:00
chaitanya sharma	c41549f630	Update readme build for osx (#1709 )	2022-05-16 10:42:08 -04:00
Heikki Linnakangas	c700032dd2	Run the regression tests in CI also for PRs opened from forked repos.	2022-05-16 14:40:49 +03:00
Kirill Bulatov	33cac863d7	Test simple.conf and handle broker_endpoints better	2022-05-16 12:07:35 +03:00
Heikki Linnakangas	51ea9c3053	Don't swallow panics when the pageserver is build with failpoints. It's very confusing, and because you don't get a stack trace and error message in the logs, makes debugging very hard. However, the 'test_pageserver_recovery' test relied on that behavior. To support that, add a new "exit" action to the pageserver 'failpoints' command, so that you can explicitly request to exit the process when a failpoint is hit.	2022-05-16 09:58:58 +03:00
Heikki Linnakangas	a10cac980f	Continue with pageserver startup, if loading some tenants fail. Fixes https://github.com/neondatabase/neon/issues/1664	2022-05-15 00:25:38 +03:00
Heikki Linnakangas	081d5dac5e	Bump vendor/postgres. Includes change to reduce log noise from inmem_smgr.	2022-05-13 21:41:00 +03:00
Andrey Taranik	cded72a580	remove sk-2 from staging inventory list (#1699 )	2022-05-13 20:41:54 +03:00
Egor Suvorov	768c846eeb	Fix test_delete_force from #1653 conflicting with #1692	2022-05-13 17:36:18 +02:00
Anastasia Lubennikova	a2561f0a78	Use tenant's pitr_interval instead of hardroded 0 in the command. Adjust python tests that use the	2022-05-13 18:32:14 +03:00
Anastasia Lubennikova	aa7c601eca	Fix pitr_interval check in GC: Use timestamp->LSN mapping instead of file modification time. Fix 'latest_gc_cutoff_lsn' - set it to the minimum of pitr_cutoff and gc_cutoff. Add new test: test_pitr_gc	2022-05-13 18:32:14 +03:00
Egor Suvorov	bf899a57d9	Safekeeper: add timeline/tenant force delete HTTP endpoings (closes #895 ) * There is no auth in Safekeeper HTTP at all currently, so simply calling `check_permission` is not enough. * There are no checks of Safekeeper still working with the data, as "still working" is burry now: a timeline may be "active" while there are no compute nodes and all data is propagated. * Still, callmemaybe is deactivated, and timeline is removed from the internal map. It can easily sneak back in case of race conditions and implicit creations, though.	2022-05-13 15:43:52 +02:00
Egor Suvorov	07b85e7cfc	Safekeeper refactor: move callmemaybe_tx from SafekeeperPostgresBackend to Timeline	2022-05-13 15:43:52 +02:00
Egor Suvorov	22d997049c	libs/utils/http/request: add ensure_no_body	2022-05-13 15:43:52 +02:00
Kirill Bulatov	b683308791	Return GIT_VERSION back to storage binaries	2022-05-13 16:34:32 +03:00
Kirill Bulatov	51c0f9ab2b	Force git version to be up to date via decl macro	2022-05-13 16:34:32 +03:00
Stas Kelvich	0030da57a8	compute-tools: grant rw priveleges to the all created users	2022-05-13 11:27:00 +03:00
Kirill Bulatov	85884a1599	Disable tenant relocation python test	2022-05-13 01:26:38 +03:00
Thang Pham	ae20751724	update `ZenithCli::create_tenant` return signature (#1692 ) to include the initial timeline's ID in addition to the new tenant's ID. Context: follow-up of https://github.com/neondatabase/neon/pull/1689	2022-05-12 17:27:08 -04:00
Thang Pham	5812e26b90	Create an initial timeline on CLI tenant creation (#1689 ) Resolves #1655	2022-05-12 16:33:09 -04:00
Arthur Petukhovsky	ec8861b8cc	Fix pageserver metrics names (#1682 ) Try to follow Prometheus style-guide https://prometheus.io/docs/practices/naming/ for metrics names. More specifically: - Use `pageserver_` prefix for all pagserver metrics - Specify `_seconds` unit in time metrics - Use unit as a suffix in other cases, such as `_hits`, `_bytes`, `_records` - Use `_total` suffix for accumulating counters (note that Histograms append that suffix internally)	2022-05-12 19:53:07 +03:00
Kirill Bulatov	4538f1e1b8	Correctly operate etcd safekeeper timeline data	2022-05-12 18:47:31 +03:00
Stas Kelvich	b10ae195b7	Set vendor/postgres back to the main branch I accidentally merged postgres PR that was referencing non-main branch.	2022-05-12 15:05:49 +03:00
Alexey Kondratov	b426775aa0	Use compute-tools from the new neondatabase Docker Hub repo	2022-05-12 12:26:24 +03:00
Heikki Linnakangas	5da4f3a4df	Refactor DeltaLayer::dump() function Put most of the code in a closure that returns Result, so that we can use the ?-operator for error handling. That's simpler.	2022-05-12 10:31:04 +03:00
Konstantin Knizhnik	2bde77fced	Do not apply records with LSN smaller than LSN of cached image in del… (#1672 ) * Do not apply records with LSN smaller than LSN of cached image in delta layer * Do not apply records with LSN smaller than LSN of cached image in delta layer	2022-05-12 07:56:02 +03:00
Dhammika Pathirana	c864091035	Fix err msg typo Signed-off-by: Dhammika Pathirana <dham@neon.tech>	2022-05-11 16:13:26 -07:00
Anton Shyrabokau	20361395bb	Add zenith-us-stage-sk-5 to circleci inventory (#1665 ) Co-authored-by: Debian <admin@ip-10-0-5-32.us-west-2.compute.internal>	2022-05-11 21:36:53 +03:00
Arseny Sher	b338b5dffe	Make callmemaybe less agressive until we fix it/migrate to bigger machines.	2022-05-11 22:16:13 +04:00
Stas Kelvich	5bd879f641	Proxy: update protocol after cluster->project rename	2022-05-11 15:50:36 +03:00
Konstantin Knizhnik	e6e883eb12	Do not set LSN for new FPI page (#1657 ) * Do not set LSN for new FPI page refer #1656 * Add page_is_new, page_get_lsn, page_set_lsn functions * Fix page_is_new implementation * Add comment from XLogReadBufferForRedoExtended	2022-05-11 15:23:17 +03:00
Heikki Linnakangas	d710dff975	Remove unnecessary Serialize/Deserialize traits from VecMap. It's never stored on disk. Let's be tidy.	2022-05-10 23:47:40 +03:00
Arseny Sher	6cb14b4200	Optionally remove WAL on safekeepers without s3 offloading. And do that on staging, until offloading is merged.	2022-05-10 22:41:02 +04:00
Thang Pham	87dfa99734	Update layered_repository REAMDE (#1659 )	2022-05-10 09:55:14 -04:00
Thang Pham	cf59b51519	Update README (Running local installation section) (#1649 )	2022-05-09 11:11:46 -04:00
Kirill Bulatov	0a7735a656	Rework remote storage sync queue, general refactoring	2022-05-07 01:33:33 +03:00
Kirill Bulatov	64a602b8f3	Delete timeline layers	2022-05-07 01:33:33 +03:00
Kirill Bulatov	10e4da3997	Rework timeline batching	2022-05-07 01:33:33 +03:00

1 2 3 4 5 ...

1600 Commits