rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-03 11:32:56 +00:00

Author	SHA1	Message	Date
bojanserafimov	ef40e404cf	Rename zenith crate to neon_local (#1625 )	2022-05-05 19:06:53 -04:00
Dmitry Ivanov	d3f356e7a8	Update `rust-postgres` project-wide (#1525 ) * Update `rust-postgres` project-wide This commit points to https://github.com/neondatabase/rust-postgres/commits/neon in order to test our patches on top of the latest version of this crate. * [proxy] Update `hmac` and `sha2`	2022-04-22 17:31:58 +03:00
Kirill Bulatov	81cad6277a	Move and library crates into a dedicated directory and rename them	2022-04-21 13:30:33 +03:00
Kirill Bulatov	629688fd6c	Drop redundant resolver setting for 2021 edition	2022-04-21 13:30:33 +03:00
Kirill Bulatov	81417788c8	walkeeper -> safekeeper	2022-04-18 12:52:31 +03:00
Dmitry Ivanov	ab20f2c491	Use the same version of `rust-postgres` everywhere. (#1516 ) Turns out we still had a stale dep in `compute_tools`.	2022-04-15 18:36:11 +03:00
Dmitry Rodionov	eee0f51e0c	use cargo-hakari to manage workspace_hack crate workspace_hack is needed to avoid recompilation when different crates inside the workspace depend on the same packages but with different features being enabled. Problem occurs when you build crates separately one by one. So this is irrelevant to our CI setup because there we build all binaries at once, but it may be relevant for local development. this also changes cargo's resolver version to 2	2022-03-29 10:42:04 +03:00
Dmitry Ivanov	a47dade622	[proxy] Migrate to async This change makes most parts of the code asynchronous, except for the `mgmt` subsystem (we're going to drop it anyway). Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>	2022-02-17 11:54:27 +03:00
Alexey Kondratov	06c28174c2	Integrate compute_tools into zenith workspace and improve logging (zenithdb/console#487 )	2022-01-18 18:47:31 +03:00
Dmitry Ivanov	cb1b4a12a6	Add some prometheus metrics to pageserver The metrics are served by an http endpoint, which is meant to be spawned in a new thread. In the future the endpoint will provide more APIs, but for the time being, we won't bother with proper routing.	2021-08-03 21:42:24 +03:00
Stas Kelvich	bf45bef284	md5 auth for postgres_backend.rs	2021-07-19 14:52:41 +03:00
Dmitry Ivanov	eb7388e3e8	Add debug info to release builds This is useful for profiling and, to some extent, debug. Besides, debug info should not affect the performance.	2021-06-28 14:21:30 +03:00
Arseny Sher	37b0236e9a	Move wal acceptor tests to python. Includes fixtures for wal acceptors and associated setup. Nothing really new here, but surprisingly this caught some issues in walproposer. ref #182	2021-06-15 15:14:27 +03:00
Eric Seppanen	df5a55c445	add workspace_hack crate Our builds can be a little inconsistent, because Cargo doesn't deal well with workspaces where there are multiple crates which have different dependencies that select different features. As a workaround, copy what other big rust projects do: add a workspace_hack crate. This crate just pins down a set of dependencies and features that satisfies all of the workspace crates. The benefits are: - running `cargo build` from one of the workspace subdirectories now works without rebuilding anything. - running `cargo install` works (without rebuilding anything). - making small dependency changes is much less likely to trigger large dependency rebuilds.	2021-05-07 13:08:31 -07:00
Eric Seppanen	f387769203	add zenith_utils crate This is a place for code that's shared between other crates in this repository.	2021-04-20 11:11:29 -07:00
Heikki Linnakangas	3600b33f1c	Implement "timelines" in page server This replaces the page server's "datadir" concept. The Page Server now always works with a "Zenith Repository". When you initialize a new repository with "zenith init", it runs initdb and loads an initial basebackup of the freshly-created cluster into the repository, on "main" branch. Repository can hold multiple "timelines", which can be given human-friendly names, making them "branches". One page server simultaneously serves all timelines stored in the repository, and you can have multiple Postgres compute nodes connected to the page server, as long they all operate on a different timeline. There is a new command "zenith branch", which can be used to fork off new branches from existing branches. The repository uses the directory layout desribed as Repository format v1 in https://github.com/zenithdb/rfcs/pull/5. It it highly inefficient: - we never create new snapshots. So in practice, it's really just a base backup of the initial empty cluster, and everything else is reconstructed by redoing all WAL - when you create a new timeline, the base snapshot and all WAL is copied from the new timeline to the new one. There is no smarts about referencing the old snapshots/wal from the ancestor timeline. To support all this, this commit includes a bunch of other changes: - Implement "basebackup" funtionality in page server. When you initialize a new compute node with "zenith pg create", it connects to the page server, and requests a base backup of the Postgres data directory on that timeline. (the base backup excludes user tables, so it's not as bad as it sounds). - Have page server's WAL receiver write the WAL into timeline dir. This allows running a Page Server and Compute Nodes without a WAL safekeeper, until we get around to integrate that properly into the system. (Even after we integrate WAL safekeeper, this is perhaps how this will operate when you want to run the system on your laptop.) - restore_datadir.rs was renamed to restore_local_repo.rs, and heavily modified to use the new format. It now also restores all WAL. - Page server no longer scans and restores everything into memory at startup. Instead, when the first request is made for a timeline, the timeline is slurped into memory at that point. - The responsibility for telling page server to "callmemaybe" was moved into Postgres libpqpagestore code. Also, WAL producer connstring cannot be specified in the pageserver's command line anymore. - Having multiple "system identifiers" in the same page server is no longer supported. I repurposed much of that code to support multiple timelines, instead. - Implemented very basic, incomplete, support for PostgreSQL's Extended Query Protocol in page_service.rs. Turns out that rust-postgres' copy_out() function always uses the extended query protocol to send out the command, and I'm using that to stream the base backup from the page server. TODO: I haven't fixed the WAL safekeeper for this scheme, so all the integration tests involving safekeepers are failing. My plan is to modify the safekeeper to know about Zenith timelines, too, and modify it to work with the same Zenith repository format. It only needs to care about the '.zenith/timelines/<timeline>/wal' directories.	2021-04-20 19:11:27 +03:00
Stas Kelvich	59163cf3b3	Rework controle_plane code to reuse it in CLI. Move all paths from control_plane to local_env which can set them for testing environment or for local installation.	2021-04-10 12:09:20 +03:00
Heikki Linnakangas	1367332447	Separate walkeeper and pageserver sources into different directories. The integration tests, which depend on both walkeeper and pageserver, are moved into yet another directory, 'integration_tests'.	2021-04-06 13:15:26 +03:00
Konstantin Knizhnik	13f507f0b4	Calculate records CRC in wal decoder	2021-04-02 10:30:56 +03:00
Konstantin Knizhnik	02ca245081	Port wal_acceptor to rust	2021-04-02 10:30:56 +03:00
anastasia	853c130ff0	[issue #7 ] CLI parse subcommands	2021-03-29 15:59:28 +03:00
Stas Kelvich	8a80d055b9	daemon mode for pageserver	2021-03-29 15:59:28 +03:00
Heikki Linnakangas	4c0be32bf5	Implement Text User Interface to show log streams in multiple "windows" Switch to 'slog' crate for logging, it gives us the flexibility that we need for the widget to scroll logs on TUI	2021-03-29 15:59:28 +03:00
Stas Kelvich	9e89c1e2cd	add CLI options to pageserver	2021-03-29 15:59:28 +03:00
Heikki Linnakangas	303a546aba	Refactor locking in page cache, and use async I/O for WAL redo Story on why: The apply_wal_records() function spawned the special postgres process to perform WAL redo. That was done in a blocking fashion: it launches the process, then it writes the command to its stdin, then it reads the result from its stdout. I wanted to also read the child process's stderr, and forward it to the page server's log (which is just the page server's stderr ATM). That has classic potential for deadlock: the child process might block trying to write to stderr/stdout, if the parent isn't reading it. So the parent needs to perform the read/write with the child's stdin/stdout/stderr in an async fashion. So I refactored the code in walredo.c into async style. But it started to hang. It took me a while to figure it out; async makes for really ugly stacktraces, it's hard to figure out what's going on. The call path goes like this: Page service -> get_page_at_lsn() in page cache -> apply_wal_records() the page service is written in async style. And I refactored apply_wal_recorsds() to also be async. BUT, get_page_at_lsn() acquires a lock, in a blocking fashion. The lock-up happened like this: - a GetPage@LSN request arrives. The asynch handler thread calls get_page_at_lsn(), which acquires a lock. While holding the lock, it calls apply_wal_records(). - apply_wal_records() launches the child process, and waits on it using async functions - more GetPage@LSN requests arrive. They also call get_page_at_lsn(). But because the lock is already held, they all block The subsequent GetPage@LSN calls that block waiting on the lock use up all the async handler threads. All the threads are locked up, so there is no one left to make progress on the apply_wal_records() call, so it never releases the lock. Deadlock So my lesson here is that mixing async and blocking styles is painful. Googling around, this is a well known problem, there are long philosophical discussions on "what color is your function". My plan to fix that is to move the WAL redo into a separate thread or thread pool, and have the GetPage@LSN handlers communicate with it using channels. Having a separate thread pool for it makes sense anyway in the long run. We'll want to keep the postgres process around, rather than launch it separately every time we need to reconstruct a page. Also, when we're not busy reconstructing pages that are needed right now by GetPage@LSN calls, we want to proactively apply incoming WAL records from a "backlog". Solution: Launch a dedicated thread for WAL redo at startup. It has an event loop, where it listens on a channel for requests to apply WAL. When a page server thread needs some WAL to be applied, it sends the request on the channel, and waits for response. After it's done the WAL redo process puts the new page image in the page cache, and wakes up the requesting thread using a condition variable. This also needed locking changes in the page cache. Each cache entry now has a reference counter and a dedicated Mutex to protect just the entry.	2021-03-29 15:59:28 +03:00
Heikki Linnakangas	e7694a1d5a	page server: use logger module	2021-03-29 15:59:28 +03:00
Stas Kelvich	851e910324	WIP: local control plane	2021-03-29 15:59:28 +03:00
Heikki Linnakangas	8bb282dcad	page server: Restore base backup from S3 at page server startup This includes a "launch.sh" script that I've been using to initialize and launch the Postgres + Page Server combination.	2021-03-29 15:59:28 +03:00
Heikki Linnakangas	af7ebb6395	Implement WAL redo. When a page is requested from the page cache (GetPage@LSN), launch postgres in special "WAL redo" mode to reconstruct that page version. Plus a bunch of misc fixes to the WAL decoding code.	2021-03-29 15:59:27 +03:00
Stas Kelvich	626b4e9987	basic support of postgres backend protocol	2021-03-29 15:59:27 +03:00
Heikki Linnakangas	3058021ca7	WIP: beginnings of page server page cache	2021-03-29 15:57:15 +03:00
Heikki Linnakangas	9a9480e8c9	Add WIP support for decoding WAL records.	2021-03-29 15:57:15 +03:00
Stas Kelvich	c856a2f2d2	pageserver stub	2021-03-29 15:57:15 +03:00

33 Commits