rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-17 02:12:56 +00:00

Author	SHA1	Message	Date
huming	9c846a93e8	chore(doc)	2022-06-03 14:24:27 +03:00
Kirill Bulatov	81cad6277a	Move and library crates into a dedicated directory and rename them	2022-04-21 13:30:33 +03:00
Heikki Linnakangas	5d9851f5d1	Refactor the I/O functions. This introduces two new abstraction layers for I/O: - Block I/O, and - Blob I/O. The BlockReader trait abstracts a file or something else that can be read in 8kB pages. It is implemented by EphemeralFiles, and by a new FileBlockReader struct that allows reading arbitrary VirtualFiles in that manner, utilizing the page cache. There is also a new BlockCursor struct that works as a cursor over a BlockReader. When you create a BlockCursor and read the first page using it, it keeps the reference to the page. If you access the same page again, it avoids going to page cache and quickly returns the same page again. That can save a lot of lookups in the page cache if you perform multiple reads. The Blob-oriented API allows reading and writing "blobs" of arbitrary length. It is a layer on top of the block-oriented API. When you write a blob with the write_blob() function, it writes a length field followed by the actual data to the underlying block storage, and returns the offset where the blob was stored. The blob can be retrieved later using the offset. Finally, this replaces the I/O code in image-, delta-, and in-memory layers to use the new abstractions. These replace the 'bookfile' crate. This is a backwards-incompatible change to the storage format.	2022-04-07 20:58:54 +03:00
Heikki Linnakangas	2f784144fe	Avoid deadlock when locking two buffers. It happened in unit tests. If a thread tries to read a buffer while already holding a lock on one buffer, the code to find a victim buffer to evict could try to evict the buffer that's already locked. To fix, skip locked buffers.	2022-04-04 20:12:31 +03:00
Heikki Linnakangas	07342f7519	Major storage format rewrite. This is a backwards-incompatible change. The new pageserver cannot read repositories created with an old pageserver binary, or vice versa. Simplify Repository to a value-store ------------------------------------ Move the responsibility of tracking relation metadata, like which relations exist and what are their sizes, from Repository to a new module, pgdatadir_mapping.rs. The interface to Repository is now a simple key-value PUT/GET operations. It's still not any old key-value store though. A Repository is still responsible from handling branching, and every GET operation comes with an LSN. Mapping from Postgres data directory to keys/values --------------------------------------------------- All the data is now stored in the key-value store. The 'pgdatadir_mapping.rs' module handles mapping from PostgreSQL objects like relation pages and SLRUs, to key-value pairs. The key to the Repository key-value store is a Key struct, which consists of a few integer fields. It's wide enough to store a full RelFileNode, fork and block number, and to distinguish those from metadata keys. 'pgdatadir_mapping.rs' is also responsible for maintaining a "partitioning" of the keyspace. Partitioning means splitting the keyspace so that each partition holds a roughly equal number of keys. The partitioning is used when new image layer files are created, so that each image layer file is roughly the same size. The partitioning is also responsible for reclaiming space used by deleted keys. The Repository implementation doesn't have any explicit support for deleting keys. Instead, the deleted keys are simply omitted from the partitioning, and when a new image layer is created, the omitted keys are not copied over to the new image layer. We might want to implement tombstone keys in the future, to reclaim space faster, but this will work for now. Changes to low-level layer file code ------------------------------------ The concept of a "segment" is gone. Each layer file can now store an arbitrary range of Keys. Checkpointing, compaction ------------------------- The background tasks are somewhat different now. Whenever checkpoint_distance is reached, the WAL receiver thread "freezes" the current in-memory layer, and creates a new one. This is a quick operation and doesn't perform any I/O yet. It then launches a background "layer flushing thread" to write the frozen layer to disk, as a new L0 delta layer. This mechanism takes care of durability. It replaces the checkpointing thread. Compaction is a new background operation that takes a bunch of L0 delta layers, and reshuffles the data in them. It runs in a separate compaction thread. Deployment ---------- This also contains changes to the ansible scripts that enable having multiple different pageservers running at the same time in the staging environment. We will use that to keep an old version of the pageserver running, for clusters created with the old version, at the same time with a new pageserver with the new binary. Author: Heikki Linnakangas Author: Konstantin Knizhnik <knizhnik@zenith.tech> Author: Andrey Taranik <andrey@zenith.tech> Reviewed-by: Matthias Van De Meent <matthias@zenith.tech> Reviewed-by: Bojan Serafimov <bojan@zenith.tech> Reviewed-by: Konstantin Knizhnik <knizhnik@zenith.tech> Reviewed-by: Anton Shyrabokau <antons@zenith.tech> Reviewed-by: Dhammika Pathirana <dham@zenith.tech> Reviewed-by: Kirill Bulatov <kirill@zenith.tech> Reviewed-by: Anastasia Lubennikova <anastasia@zenith.tech> Reviewed-by: Alexey Kondratov <alexey@zenith.tech>	2022-03-28 05:41:15 -05:00
Kirill Bulatov	edc7bebcb5	Remove obvious panic sources	2022-03-25 11:58:54 +02:00
Heikki Linnakangas	c718870517	Tiny refactoring of page_cache::init function. The init function only needs the 'page_cache_size' from the config, so seems slightly nicer to pass just that.	2022-03-24 09:46:07 +02:00
Kirill Bulatov	114a757d1c	Use generic config parameters in pageserver cli Co-authored-by: Heikki Linnakangas <heikki.linnakangas@iki.fi>	2021-12-23 18:58:28 +02:00
Heikki Linnakangas	5aa969a588	Replace in-memory layers and OOM-triggered eviction with temp files. The "in-memory layer" is misnomer now, each in-memory layer is now actually backed by a file. The files are ephemeral, in that they don't survive page server crash or shutdown. To avoid reading the file for every operation, "ephemeral files" are cached in a page cache. This includes changes from 'inmemory-layer-chunks' branch to serialize / the page versions when they are added to the open layer. The difference is that they are not serialized to the expandable in-memory "chunk buffer", but written out to the file.	2021-11-26 17:25:17 +03:00
Heikki Linnakangas	431d32756b	Add a buffer cache, and use it to store materialized pages. The buffer cache is shared across all tenants, allowing memory to be dynamically allocated where it's needed the most. The cache works on 8 kB pages, and uses the clock algorithm for replacement policy; same as the PostgreSQL buffer cache. One peculiarity is that the materialized page versions can be looked up by an inexact LSN, to find the latest page version with an LSN >= the search key. The code is structured to support caching other kinds of pages in the same cache in the future, but with a different mapping key. Co-authored-by: Patrick Insinger <patrick@zenith.tech>	2021-11-12 11:02:12 -08:00
Heikki Linnakangas	b949127b06	Rename page_cache.rs to tenant_mgr.rs. Once upon a time, 'page_cache.rs' contained an actual page cache, but it hasn't for a very long time. Rename to reflect what it actually does these days.	2021-08-30 15:17:30 +03:00
Heikki Linnakangas	4046530160	Remove remnants of choosing between repository formats. Now that we only have one Repository implementation, no need for the command-line options to choose it either. I'm removing these as a separate commit to show what we will need to do if we add another Repository implementation in the future (even though I don't foresee us doing that any time soon)	2021-08-25 18:37:22 +03:00
Heikki Linnakangas	5998744bcc	Remove rocksdb implementation. The layered storage format is good enough that we don't need the rocksdb implementation anymore. There are a lot of known issues but we'll keep working on them.	2021-08-25 18:37:22 +03:00
Heikki Linnakangas	2450f82de5	Introduce a new "layered" repository implementation. This replaces the RocksDB based implementation with an approach using "snapshot files" on disk, and in-memory btreemaps to hold the recent changes. This make the repository implementation a configuration option. You can choose 'layered' or 'rocksdb' with "zenith init --repository-format=<format>" The unit tests have been refactored to exercise both implementations. 'layered' is now the default. Push/pull is not implemented. The 'test_history_inmemory' test has been commented out accordingly. It's not clear how we will implement that functionality; probably by copying the snapshot files directly.	2021-08-16 10:06:48 +03:00
Dmitry Rodionov	ce5333656f	Introduce authentication v0.1. Current state with authentication. Page server validates JWT token passed as a password during connection phase and later when performing an action such as create branch tenant parameter of an operation is validated to match one submitted in token. To allow access from console there is dedicated scope: PageServerApi, this scope allows access to all tenants. See code for access validation in: PageServerHandler::check_permission. Because we are in progress of refactoring of communication layer involving wal proposer protocol, and safekeeper<->pageserver. Safekeeper now doesn’t check token passed from compute, and uses “hardcoded” token passed via environment variable to communicate with pageserver. Compute postgres now takes token from environment variable and passes it as a password field in pageserver connection. It is not passed through settings because then user will be able to retrieve it using pg_settings or SHOW .. I’ve added basic test in test_auth.py. Probably after we add authentication to remaining network paths we should enable it by default and switch all existing tests to use it.	2021-08-11 20:05:54 +03:00
Heikki Linnakangas	acc0f41985	Don't try to launch duplicate WAL redo thread if tenant already exists. The codepath for tenant_create command first launched the WAL redo thread, and then called branches::create_repo() which checked if the tenant's directory already exists. That's problematic, because launching the WAL redo thread will run initdb if the directory doesn't already exist. Race condition: If the tenant already exists, it will have a WAL redo thread already running, and the old and new WAL redo thread might try to run initdb at the same time, causing all kinds of weird failures. The test_pageserver_api test was failing 100% repeatably on my laptop because of this. I'm not sure why this doesn't occur on the CI: Jul 31 18:05:48.877 INFO running initdb in "./tenants/5227e4eb90894775ac6b8a8c76f24b2e/wal-redo-datadir", location: pageserver::walredo, pageserver/src/walredo.rs:483 thread 'WAL redo thread' panicked at 'initdb failed: The files belonging to this database system will be owned by user "heikki". This user must also own the server process. The database cluster will be initialized with locale "C". The default database encoding has accordingly been set to "SQL_ASCII". The default text search configuration will be set to "english". Data page checksums are disabled. creating directory ./tenants/0305b1326f3ea33add0929d516da7cb6/wal-redo-datadir ... ok creating subdirectories ... ok selecting dynamic shared memory implementation ... posix selecting default max_connections ... 100 selecting default shared_buffers ... 128MB selecting default time zone ... Europe/Helsinki creating configuration files ... ok running bootstrap script ... stderr: 2021-07-31 15:05:48.875 GMT [282569] LOG: could not open configuration file "/home/heikki/git-sandbox/zenith/test_output/test_tenant_list/repo/./tenants/0305b1326f3ea33add0929d516da7cb6/wal-redo-datadir/postgresql.conf": No such file or directory 2021-07-31 15:05:48.875 GMT [282569] FATAL: configuration file "/home/heikki/git-sandbox/zenith/test_output/test_tenant_list/repo/./tenants/0305b1326f3ea33add0929d516da7cb6/wal-redo-datadir/postgresql.conf" contains errors child process exited with exit code 1 initdb: removing data directory "./tenants/0305b1326f3ea33add0929d516da7cb6/wal-redo-datadir"	2021-07-31 18:13:21 +03:00
Dmitry Rodionov	767590bbd5	support tenants this patch adds support for tenants. This touches mostly pageserver. Directory layout on disk is changed to contain new layer of indirection. Now path to particular repository has the following structure: <pageserver workdir>/tenants/<tenant id>. Tenant id has the same format as timeline id. Tenant id is included in pageserver commands when needed. Also new commands are available in pageserver: tenant_list, tenant_create. This is also reflected CLI. During init default tenant is created and it's id is saved in CLI config, so following commands can use it without extra options. Tenant id is also included in compute postgres configuration, so it can be passed via ServerInfo to safekeeper and in connection string to pageserver. For more info see docs/multitenancy.md.	2021-07-22 20:54:20 +03:00
Heikki Linnakangas	34f4207501	Refactoring of the Repository/Timeline stuff - All timelines are now stored in the same rocksdb repository. The GET functions have been taught to follow the ancestors. - Change the way relation size is stored. Instead of inserting "tombstone" entries for blocks that are truncated away, store relation size as separate key-value entry for each relation - Add an abstraction for the key-value store: ObjectStore. It allows swapping RocksDB with some other key-value store easily. Perhaps we will write our own storage implementation using that interface, or perhaps we'll need a different abstraction, but this is a small improvement over status quo in any case. - Garbage Collection is broken and commented out. It's not clear where and how it should be implemented.	2021-05-27 20:07:50 +03:00
Heikki Linnakangas	600e1a0080	Pass PageServerConf as static ref. It's created once early in server startup, after parsing the command-line options, and never modified afterwards. To simplify things, pass it around as static ref, instead of making copies in all the different structs. We still pass around a reference to it, rather than putting it in a global variable, to allow unit testing with different configs in the same process.	2021-05-20 09:11:36 +03:00
Stas Kelvich	746f667311	Refactor CLI and CLI<->pageserver interfaces to support remote pageserver This patch started as an effort to support CLI working against remote pageserver, but turned into a pretty big refactoring. * CLI now does not look into repository files directly. New commands 'branch_create' and 'identify_system' were introduced into page_service to support that. * Branch management that was scattered between local_env and zenith/main.rs is moved into pageserver/branches.rs. That code could better fit in Repository/Timeline impl, but I'll leave that for a different patch. * All tests-related code from local_env went into integration_tests/src/lib.rs as an extension to PostgresNode trait. * Paths-generating functions were concentrated around corresponding config types (LocalEnv and PageserverConf).	2021-05-17 19:17:51 +03:00
Heikki Linnakangas	270356ec38	Refactor WalRedoManager for easier testing. Turn WalRedoManager into an abstract trait, so that it can be easily mocked in unit tests. One change here is that the WAL redo manager is no longer tied to a specific zenith timeline. It didn't do anything with that information aside from using it in the dummy datadir's name. We could use any random string for that purpose, it's just to prevent two WAL redo managers from stepping over each other. But this commit actually changes things so that all timelines use the same WAL redo manager, so that's not necessary. We will probably want to maintain a pool of WAL redo processes in the future, but for now let's keep it simple. In the passing, fix some comments.	2021-05-14 12:44:49 +03:00
Heikki Linnakangas	c2db828481	Create RocksDB databases under correct path. We used to create them under .zenith/.zenith/<timelineid>. The double .zenith was clearly not intentional. Change it to .zenith/timelines/<timelineid>. Fixes https://github.com/zenithdb/zenith/issues/127	2021-05-14 12:44:44 +03:00
Heikki Linnakangas	b484b896b6	Refactor the functionality page_cache.rs. This moves things around: - The PageCache is split into two structs: Repository and Timeline. A Repository holds multiple Timelines. In order to get a page version, you must first get a reference to the Repository, then the Timeline in the repository, and finally call the get_page_at_lsn() function on the Timeline object. This sounds complicated, but because each connection from a compute node, and each WAL receiver, only deals with one timeline at a time, the callers can get the reference to the Timeline object once and hold onto it. The Timeline corresponds most closely to the old PageCache object. - Repository and Timeline are now abstract traits, so that we can support multiple implementations. I don't actually expect us to have multiple implementations for long. We have the RocksDB implementation now, but as soon as we have a different implementation that's usable, I expect that we will retire the RocksDB implementation. But I think this abstraction works as good documentation in any case: it's now easier to see what the interface for storing and loading pages from the repository is, by looking at the Repository/Timeline traits. They abstract traits are in repository.rs, and the RocksDB implementation of them is in repository/rocksdb.rs. - page_cache.rs is now a "switchboard" to get a handle to the repository. Currently, the page server can only handle one repository at a time, so there isn't much there, but in the future we might do multi-tenancy there.	2021-05-05 10:37:36 +03:00
Heikki Linnakangas	086c0ad829	Remove unused 'apply_pending' field.	2021-04-30 12:44:06 +03:00
Eric Seppanen	975b2d12dc	cargo fmt	2021-04-28 10:01:58 -07:00
Heikki Linnakangas	c7f54af1f1	Refactor page_cache <-> walredo interface. Make the caller of request_redo() responsible for gathering the WAL records to redo, and for storing the reconstructed page image back in the page cache. This leaves the WAL redo manager purely responsible for dealing with the postgres child process, removing its dependency on the PageCache.	2021-04-27 21:43:56 +03:00
Heikki Linnakangas	cff671c1bd	Remove duplicated LSN fields from the page cache. Having multiple copies of the same values is a source of confusion. Commit `da9bf5dc63` fixed one race condition caused by that, for example. See also discussion at https://github.com/zenithdb/zenith/issues/57#issuecomment-824393470 This changes SeqWait.advance() to return the old number, and not panic if you try to move the value backwards. The caller should check for that and act accordingly.	2021-04-27 10:32:39 +03:00
Eric Seppanen	4acdcbe90f	clippy cleanup #3 Fix issues raised by clippy. Mostly trivial ones, though some allow 4-5 lines of code to be reduced to 1.	2021-04-26 12:35:35 -07:00
Heikki Linnakangas	f617115467	Remove obsolete comment on async usage in the page cache	2021-04-26 14:12:57 +03:00
Heikki Linnakangas	3b9e7fc5e6	Use explicit threads. Remove 'async' usage a much as feasible. Async code is harder to debug, and mixing async and non-async code is a recipe for confusion and bugs. There are a couple of exceptions: - The code in walredo.rs, which needs to read and write to the child process simultaneously, still uses async. It's more convenient there. The 'async' usage is carefully limited to just the functions that communicate with the child process. - Code in walreceiver.rs that uses tokio-postgres to do streaming replication. We have to use async there, because tokio-postgres is async. Most rust-postgres functionality has non-async wrappers, but not the new replication client code. The async usage is very limited here, too: we use just block_on to call the tokio-postgres functions. The code in 'page_service.rs' now launches a dedicated thread for each connection. This replaces tokio::sync:⌚:channel with std::sync:mpsc in 'seqwait.rs', to make that non-async. It's not a drop-in replacement, though: std::sync::mpsc doesn't support multiple consumers, so we cannot share a channel between multiple waiters. So this removes the code to check if an existing channel can be reused, and creates a new one for each waiter. That created another problem: BTreeMap cannot hold duplicates, so I replaced that with BinaryHeap. Similarly, the tokio::{mpsc, oneshot} channels used between WAL redo manager and PageCache are replaced with std::sync::mpsc. (There is no separate 'oneshot' channel in the standard library.) Fixes github issue #58, and coincidentally also issue #66.	2021-04-26 13:07:51 +03:00
Eric Seppanen	1c775bdcac	Drop LSNs from PageCacheStats There's no clear way to sum LSNs across timelines, so just remove them for now.	2021-04-25 19:37:02 -07:00
Eric Seppanen	07d0241076	add AtomicLsn AtomicLsn is a wrapper around AtomicU64 that has load() and store() members that are cheap (on x86, anyway) and can be safely used in any context. This commit uses AtomicLsn in the page cache, and fixes up some downstream code that manually implemented LSN formatting. There's also a bugfix to the logging in wait_lsn, which prints the wrong lsn value.	2021-04-25 19:37:02 -07:00
Eric Seppanen	d760446053	remove Lsn::sub in favor of sub_checked There is only one place doing subtraction, and it had a manually implemented check.	2021-04-25 19:37:02 -07:00
Eric Seppanen	01e239afa3	apply Lsn type everywhere Use the `Lsn` type everywhere that I can find u64 being used to represent an LSN.	2021-04-25 19:37:02 -07:00
Konstantin Knizhnik	da9bf5dc63	Store atomic last_valid_lsn after seqwait_lsn.advance	2021-04-25 14:11:31 +03:00
Eric Seppanen	1cb9b5523b	cargo fmt	2021-04-24 16:03:44 -07:00
Konstantin Knizhnik	968cd8f20c	Do not delete versions in GC	2021-04-24 23:52:50 +03:00
Konstantin Knizhnik	3e007b0eb9	Do not delete versions in GC	2021-04-24 22:32:22 +03:00
Heikki Linnakangas	5e0cc89de8	Re-group functions in page_cache.rs, and add comments.	2021-04-24 17:54:31 +03:00
Heikki Linnakangas	0fc05569e0	Improve comments in page_cache.rs. Explain the mix of async and other functions in the page cache.	2021-04-24 17:54:28 +03:00
Heikki Linnakangas	021462da3e	Refactor put_wal_record() so that it doesn't need to be marked 'async'. It was only marked as async because it calls relsize_get(), but relsize_get() will in fact never block when it's called with the max LSN value, like put_wal_record() does. Refactor to avoid marking put_wal_record() as 'async'.	2021-04-24 17:54:26 +03:00
Heikki Linnakangas	93d7d2ae2a	Refactor pagecache <-> Wal redo communication After the rocksdb patch (commit `6aa38d3f7d`), the CacheEntry struct was used only momentarily in the communication between the page_cache and the walredo modules. It was in fact not stored in any cache anymore. For clarity, refactor the communication. There is now a WalRedoManager struct, with `request_redo` function, that can be used to request WAL replay of a particular page. It sends a request to a queue like before, but the queue has been replaced with tokio::sync::mpsc. Previously, the resulting page image was stored directly in the CacheEntry, and the requestor was notified using a condition variable. Now, the requestor includes a 'oneshot' channel in the request, and the WAL redo manager sends the response there.	2021-04-24 12:24:04 +03:00
Konstantin Knizhnik	499b4f7eba	Log garbage collection statistics	2021-04-23 18:02:58 +03:00
Konstantin Knizhnik	52ee3a2bac	Support CREATE DATABASE command	2021-04-23 17:03:56 +03:00
anastasia	573f1ada83	[issue #56 ] Fix race at postgres instance + walreceiver start. Uses postgres/vendor issue_56_rebased branch.	2021-04-23 13:35:30 +03:00
Konstantin Knizhnik	59b23fef64	Wait for WAL receiver to start	2021-04-23 12:40:29 +03:00
Konstantin Knizhnik	ee87e6aad3	Sum log files in case of test failure	2021-04-22 22:14:41 +03:00
Konstantin Knizhnik	ff3488fadd	Fix bug in do_gc	2021-04-22 19:37:33 +03:00
Konstantin Knizhnik	4a0a9e748c	Enable garbage collector	2021-04-22 17:52:15 +03:00
Konstantin Knizhnik	ed30f2096c	Disable GC by default	2021-04-22 11:30:27 +03:00

1 2

83 Commits