rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-06-01 12:30:38 +00:00

Author	SHA1	Message	Date
Konstantin Knizhnik	b227c63edf	Set proper xl_prev in basebackup, when possible. In a passing fix two minor issues with basabackup: * check that we can't create branches with pre-initdb LSN's * normalize branch LSN's that are pointing to the segment boundary patch by @knizhnik closes #506	2021-09-03 14:58:59 +03:00
anastasia	66dcaa4e01	Rename put_unlink() to drop_relish() in Timeline trait. Rename put_unlink() to drop_segment() in Layer trait.	2021-09-03 11:00:38 +03:00
Heikki Linnakangas	1686715ad0	Partial fix for issue with extending relation with a gap. This should fix the sporadic regression test failures we've been seeing lately with "no base img found" errors. This fixes the common case, but one corner case is still not handled: If a relation is extended across a segment boundary, leaving a gap block in the segment preceding the segment containing the target block, the preceding segment will not be padded with zeros correctly. This adds a test case for that, but it's commented out. See github issue https://github.com/zenithdb/zenith/issues/500	2021-09-02 22:01:46 +03:00
Dmitry Rodionov	bc709561b6	fix clippy warnings	2021-09-02 18:54:44 +03:00
Stas Kelvich	59c19d6e18	Rework basebackup. * add lsn argument * do not expose wait_lsn, wait inside list_nonrels() * fix parameters parsing * expose get_last_record_rlsn() to atomically read (last,prev) pair More work is needed to correctly handle basebackup@old_lsn but current approach already allows to fix test_restart_compute	2021-09-02 12:06:12 +03:00
Stas Kelvich	8c07a36fda	Remove last_valid_lsn tracking in wal_receiver. There are two main reasons for that: a) Latest unfinished record may disapper after compute node restart, so let's try not leak volatile part of the WAL into the repository. Always use last_valid_record instead. That change requires different getPage@LSN logic in postgres -- we need to ask LSN's that point to some complete record instead of GetFlushRecPtr() that can point in the middle of the record. That was already done by @knizhnik to deal with the same problem during the work on `postgres --sync-safekeepers`. Postgres will use LSN's aligned on 0x8 boundary in get_page requests, so we also need to be sure that last_valid_record is aligned. b) Switch to get_last_record_lsn() in basebackup@no_lsn. When compute node is running without safekeepers and streams WAL directly to pageserver it is important to match basebackup LSN and LSN of replication start. Before this commit basebackup@no_lsn was waiting for last_valid_lsn and walreceiver started replication with last_record_lsn, which can be less. So replication was failing since compute node doesn't have requested WAL.	2021-09-02 12:06:12 +03:00
Heikki Linnakangas	787806285d	Remove unused 'update_meta' argument. It was used by the object repository code, but now that that's gone, it's dead.	2021-08-27 15:45:45 +03:00
Heikki Linnakangas	4902d1daa8	Store base images in separate ImageLayers Previously, a SnapshotLayer and corresponding file on disk contained the base image of every page in the segment at the start LSN, and all the changes (= WAL records) in the range between start and end LSN. That was a bit awkward, because we had to keep the base image of every page in memory until we had accumulated enough WAL after the base image to write out the layer. When it's time to write out a layer, we would really want to replay the WAL to reconstruct the most recent version of each page, to save the effort later. That's on the assumption that the client will usually request the most recent version, not some older one. Split the SnapshotLayer into two structs: ImageLayer and DeltaLayer. An image layer contains a "snapshot" of the segment at one specific LSN, and no WAL records, whereas a delta layer contains WAL records in a range of LSNs. In order to reconstruct a page version in the delta layer, by performing WAL redo, you also need the previous image layer. So the delta layers are "incremental" against the previous layer. So where previously we would create snapshot files like this: rel_100_200 rel_200_300 rel_300_400 We now create image and delta files like this: rel_100 # image rel_100_200 # delta rel_200 rel_200_300 rel_300 rel_300_400 rel_400 That's more files, but as discussed above, this allows storing more up-to-date page versions on disk, which should reduce the latency of responding to a GetPage request. It also allows more fine-grained garbage collection. In the above example, after the old page version are no longer needed and if the relation is not modified anymore, we only need to keep the latest image file, 'rel_400', and everything else can be removed. Implements https://github.com/zenithdb/zenith/issues/339	2021-08-27 02:35:16 +03:00
Heikki Linnakangas	4046530160	Remove remnants of choosing between repository formats. Now that we only have one Repository implementation, no need for the command-line options to choose it either. I'm removing these as a separate commit to show what we will need to do if we add another Repository implementation in the future (even though I don't foresee us doing that any time soon)	2021-08-25 18:37:22 +03:00
Heikki Linnakangas	5998744bcc	Remove rocksdb implementation. The layered storage format is good enough that we don't need the rocksdb implementation anymore. There are a lot of known issues but we'll keep working on them.	2021-08-25 18:37:22 +03:00
Heikki Linnakangas	250ae643a8	Remove 'zenith push' feature. Now that the new storage format is based on immutable files, we want to implement push/pull in terms of these immutable files as well. Similarly to how those files will be transferred between S3 and the page server. The implementation we had was fairly tightly coupled with the object repository implementation, but I'm about to remove the object / rocksdb storage format soon. That would leave the current "zenith push" command completely broken. It seemed like a good idea at the time, but in hindsight, it was premature to implement push/pull yet. It's a nice feature and I'd like to see it reimplemented in the future, but in the meanwhile, let's remove the code we had. We can dig the parts of it that might be useful in the future from the git history.	2021-08-25 18:37:22 +03:00
Heikki Linnakangas	91f72fabc9	Work with smaller segments. Split each relish into fixed-sized 10 MB segments. Separate layers are created for each segment. This reduces the write amplification if you have a large relation and update only parts of it; the downside is that you have a lot more files. The 10 MB is just a guess, we should do some modeling and testing in the future to figure out the optimal size. Each segment tracks the size of the segment separately. To figure out the total size of a relish, you need to loop through the segment to find the highest segment that's in use. That's a bit inefficient, but will do for now. We might want to add a cache or something later.	2021-08-17 18:54:41 +03:00
Heikki Linnakangas	2450f82de5	Introduce a new "layered" repository implementation. This replaces the RocksDB based implementation with an approach using "snapshot files" on disk, and in-memory btreemaps to hold the recent changes. This make the repository implementation a configuration option. You can choose 'layered' or 'rocksdb' with "zenith init --repository-format=<format>" The unit tests have been refactored to exercise both implementations. 'layered' is now the default. Push/pull is not implemented. The 'test_history_inmemory' test has been commented out accordingly. It's not clear how we will implement that functionality; probably by copying the snapshot files directly.	2021-08-16 10:06:48 +03:00
Heikki Linnakangas	8517d9696d	Move gc_iteration() function to Repository trait. The upcoming layered storage implementation handles GC as a repository-wide operation because it needs to pay attention to the branch points of all timelines.	2021-08-12 23:46:01 +03:00
anastasia	1bfade8adc	Issue #330 . Use put_unlink for twophase relishes. Follow PostgreSQL logic: remove Twophase files when prepared transaction is committed/aborted. Always store Twophase segments as materialized page images (no wal records).	2021-08-12 14:42:21 +03:00
anastasia	4eebe22fbb	cargo fmt	2021-08-12 14:42:21 +03:00
Heikki Linnakangas	20d5e757ca	Remove now-unused get_next_tag function. The only caller was removed by commit `c99a211b01`.	2021-08-11 22:16:38 +03:00
Dmitry Rodionov	ce5333656f	Introduce authentication v0.1. Current state with authentication. Page server validates JWT token passed as a password during connection phase and later when performing an action such as create branch tenant parameter of an operation is validated to match one submitted in token. To allow access from console there is dedicated scope: PageServerApi, this scope allows access to all tenants. See code for access validation in: PageServerHandler::check_permission. Because we are in progress of refactoring of communication layer involving wal proposer protocol, and safekeeper<->pageserver. Safekeeper now doesn’t check token passed from compute, and uses “hardcoded” token passed via environment variable to communicate with pageserver. Compute postgres now takes token from environment variable and passes it as a password field in pageserver connection. It is not passed through settings because then user will be able to retrieve it using pg_settings or SHOW .. I’ve added basic test in test_auth.py. Probably after we add authentication to remaining network paths we should enable it by default and switch all existing tests to use it.	2021-08-11 20:05:54 +03:00
anastasia	e406811375	Fixes for handling SLRU relishes: replace get_tx_status() with self.get_tx_is_in_progress() to handle xacts in truncated SLRU segments correctly	2021-08-11 05:49:24 +03:00
anastasia	e475f82ff1	Rename get_rel_size() to get_relish_size(). Don't bail if relish is not found, just return None and let the caller to decide how to handle this	2021-08-11 05:49:24 +03:00
Dmitry Ivanov	cb1b4a12a6	Add some prometheus metrics to pageserver The metrics are served by an http endpoint, which is meant to be spawned in a new thread. In the future the endpoint will provide more APIs, but for the time being, we won't bother with proper routing.	2021-08-03 21:42:24 +03:00
Heikki Linnakangas	9ff122835f	Refactor ObjectTags, intruducing a new concept called "relish" This clarifies - I hope - the abstractions between Repository and ObjectRepository. The ObjectTag struct was a mix of objects that could be accessed directly through the public Timeline interface, and also objects that were created and used internally by the ObjectRepository implementation and not supposed to be accessed directly by the callers. With the RelishTag separaate from ObjectTag, the distinction is more clear: RelishTag is used in the public interface, and ObjectTag is used internally between object_repository.rs and object_store.rs, and it contains the internal metadata object types. One awkward thing with the ObjectTag struct was that the Repository implementation had to distinguish between ObjectTags for relations, and track the size of the relation, while others were used to store "blobs". With the RelishTags, some relishes are considered "non-blocky", and the Repository implementation is expected to track their sizes, while others are stored as blobs. I'm not 100% happy with how RelishTag captures that either: it just knows that some relish kinds are blocky and some non-blocky, and there's an is_block() function to check that. But this does enable size-tracking for SLRUs, allowing us to treat them more like relations. This changes the way SLRUs are stored in the repository. Each SLRU segment, e.g. "pg_clog/0000", "pg_clog/0001", are now handled as a separate relish. This removes the need for the SLRU-specific put_slru_truncate() function in the Timeline trait. SLRU truncation is now handled by caling put_unlink() on the segment. This is more in line with how PostgreSQL stores SLRUs and handles their trunction. The SLRUs are "blocky", so they are accessed one 8k page at a time, and repository tracks their size. I considered an alternative design where we would treat each SLRU segment as non-blocky, and just store the whole file as one blob. Each SLRU segment is up to 256 kB in size, which isn't that large, so that might've worked fine, too. One reason I didn't do that is that it seems better to have the WAL redo routines be as close as possible to the PostgreSQL routines. It doesn't matter much in the repository, though; we have to track the size for relations anyway, so there's not much difference in whether we also do it for SLRUs. While working on this, I noticed that the CLOG and MultiXact redo code did not handle wraparound correctly. We need to fix that, but for now, I just commented them out with a FIXME comment.	2021-08-03 14:01:05 +03:00
Heikki Linnakangas	47824c5fca	Remove page server interactive mode. It was pretty cool, but no one used it, and it had gotten badly out of date. The main interesting thing with it was to see some basic metrics on the fly, while the page server is running, but the metrics collection had been broken for a long time, too. Best to just remove it.	2021-07-23 12:21:21 +03:00
Dmitry Rodionov	767590bbd5	support tenants this patch adds support for tenants. This touches mostly pageserver. Directory layout on disk is changed to contain new layer of indirection. Now path to particular repository has the following structure: <pageserver workdir>/tenants/<tenant id>. Tenant id has the same format as timeline id. Tenant id is included in pageserver commands when needed. Also new commands are available in pageserver: tenant_list, tenant_create. This is also reflected CLI. During init default tenant is created and it's id is saved in CLI config, so following commands can use it without extra options. Tenant id is also included in compute postgres configuration, so it can be passed via ServerInfo to safekeeper and in connection string to pageserver. For more info see docs/multitenancy.md.	2021-07-22 20:54:20 +03:00
sharnoff	c4b2bf7ebd	Use 'zenith_admin' as superuser name in `initdb`	2021-07-21 17:22:22 +03:00
Konstantin Knizhnik	0723d49e0b	Object push (#276 ) * Introducing common enum ObjectVal for all values * Rewrite push mechanism to use raw object copy * Fix history unit test * Add skip_nonrel_objects functions for history unit tests	2021-07-21 00:41:57 +03:00
Konstantin Knizhnik	9838c71a47	Explicit compact (#341 ) * Do no perform compaction of RocksDB storage on each GC iteration * Increase GC timeout to let GC tests passed * Add comment to gc_iteration	2021-07-19 16:49:12 +03:00
Konstantin Knizhnik	d55095ab21	[refer #331 ] Move initialization of checkpoint object into import_timeline_from_postgres_datadir	2021-07-16 18:43:07 +03:00
Konstantin Knizhnik	e74b06d999	Pass prev_record_ptr through zenith.signal file to compute node	2021-07-16 18:43:07 +03:00
Heikki Linnakangas	46e613f423	Fix typos	2021-07-16 18:43:07 +03:00
Konstantin Knizhnik	842419b91f	Do not update relation metadata in get_page_at_lsn	2021-07-16 18:43:07 +03:00
Konstantin Knizhnik	eb0a56eb22	Replay non-relational WAL records on page server	2021-07-16 18:43:07 +03:00
Heikki Linnakangas	befefe8d84	Run 'cargo fmt'. Fixes a few formatting discrepancies had crept in recently.	2021-07-14 22:03:14 +03:00
Heikki Linnakangas	d119f2bcce	Add unit test for branch creation. This is pretty similar to the python 'test_branch_behind' test, but I find it useful to have a small unit test for it too.	2021-07-13 09:54:27 +03:00
Dmitry Rodionov	75e717fe86	allow both domains and ip addresses in connection options for pageserver and wal keeper. Also updated PageServerNode definition in control plane to account for that. resolves #303	2021-07-09 16:46:21 +03:00
Heikki Linnakangas	ced338fd20	Handle relation DROPs in page server. Add back code to parse transaction commit and abort records, and in particular the list of dropped relations in them. Add 'put_unlink' function to the Timeline trait and implementation. We had the code to handle dropped relations in the GC code and elsewhere in ObjectRepository already, but there was nothing to create the RelationSizeEntry::Unlink tombstone entries until now. Also add a test to check that GC correctly removes all page versions of a dropped relation. Implements https://github.com/zenithdb/zenith/issues/232, except for the "orphaned" rels. Reviewed-by: Konstantin Knizhnik	2021-06-29 00:27:10 +03:00
Heikki Linnakangas	ec44f4b299	Add test for Garbage Collection. This expose a command in in page server to run GC immediately on a given timeline. It's just for testing purposes.	2021-06-28 17:07:28 +03:00
Heikki Linnakangas	4f1b22a2c8	Use ObjectTag enum instead of special fork number to store metadata objects. Extracted from Konstantin's larger PR: https://github.com/zenithdb/zenith/pull/268	2021-06-22 21:34:31 +03:00
Patrick Insinger	47694ea4f5	zenith push	2021-06-02 17:20:49 -04:00
Patrick Insinger	3364a8d442	pageserver - timeline history api	2021-06-02 16:20:26 -04:00
Heikki Linnakangas	34f4207501	Refactoring of the Repository/Timeline stuff - All timelines are now stored in the same rocksdb repository. The GET functions have been taught to follow the ancestors. - Change the way relation size is stored. Instead of inserting "tombstone" entries for blocks that are truncated away, store relation size as separate key-value entry for each relation - Add an abstraction for the key-value store: ObjectStore. It allows swapping RocksDB with some other key-value store easily. Perhaps we will write our own storage implementation using that interface, or perhaps we'll need a different abstraction, but this is a small improvement over status quo in any case. - Garbage Collection is broken and commented out. It's not clear where and how it should be implemented.	2021-05-27 20:07:50 +03:00
Heikki Linnakangas	cb6e2d9ddb	Minor refactoring and cleanup of the Timeline interface. Move `save_decoded_record` out of the Timeline trait. The storage implementation shouldn't need to know how to decode records. Also move put_create_database() out of the Timeline trait. Add a new `list_rels` function to Timeline to support it, instead. Rename `get_relsize` to `get_rel_size`, and `get_relsize_exists` to `get_rel_exists`. Seems nicer.	2021-05-27 09:44:46 +03:00
Alexey Kondratov	b1a424dfa9	Add more info about borrowed from Postgres structures (RelTag and BufferTag)	2021-05-26 12:05:13 +03:00
Eric Seppanen	7c73afc1af	switch repository types to serde Derive Serialize+Deserialize for RelTag, BufferTag, CacheKey. Replace handwritten pack/unpack functions with ser, des from zenith_utils::bin_ser (which uses the bincode crate). There are some ugly hybrids in walredo.rs, but those functions are already doing a lot of questionable manual byte-twiddling, so hopefully the weirdness will go away when we get better postgres protocol wrappers.	2021-05-25 14:56:19 -07:00
Eric Seppanen	6f9175ca2d	cargo fmt	2021-05-24 17:28:56 -07:00
Heikki Linnakangas	69fa10ff86	Fix rocksdb get_relsize() implementation to work with historic LSNs.	2021-05-24 17:12:18 +03:00
Heikki Linnakangas	d5fe515363	Implement "checkpointing" in the page server. - Previously, we checked on first use of a timeline, whether there is a snapshot and WAL for the timeline, and loaded it all into the (rocksdb) repository. That's a waste of effort if we had done that earlier already, and stopped and restarted the server. Track the last LSN that we have loaded into the repository, and only load the recent missing WAL after that. - When you create a new zenith repository with "zenith init", immediately load the initial empty postgres cluster into the rocksdb repository. Previously, we only did that on the first connection. This way, we don't need any "load from filesystem" codepath during normal operation, we can assume that the repository for a timeline is always up to date. (We might still want to use the functionality to import an existing PostgreSQL data directory into the repository in the future, as a separate Import feature, but not today.)	2021-05-24 17:02:05 +03:00
Heikki Linnakangas	6a9c036ac1	Revert all changes related to storing and restoring non-rel data in page server This includes the following commits: `35a1c3d521` Specify right LSN in test_createdb.py `d95e1da742` Fix issue with propagation of CREATE DATABASE to the branch `8465738aa5` [refer #167] Fix handling of pg_filenode.map files in page server `86056abd0e` Fix merge conflict: set initial WAL position to second segment because of pg_resetwal `2bf2dd1d88` Add nonrelfile_utils.rs file `20b6279beb` Fix restoring non-relational data during compute node startup `06f96f9600` Do not transfer WAL to computation nodes: use pg_resetwal for node startup As well as some older changes related to storing CLOG and MultiXact data as "pseudorelation" in the page server. With this revert, we go back to the situtation that when you create a new compute node, we ship all the WAL from the beginning of time to the compute node. Obviously we need a better solution, like the code that this reverts. But per discussion with Konstantin and Stas, this stuff was still half-baked, and it's better for it to live in a branch for now, until it's more complete and has gone through some review.	2021-05-24 16:05:45 +03:00
Eric Seppanen	4aabc9a682	easy clippy cleanups Various things that clippy complains about, and are really easy to fix.	2021-05-23 13:17:15 -07:00
Konstantin Knizhnik	06f96f9600	Do not transfer WAL to computation nodes: use pg_resetwal for node startup	2021-05-20 14:13:47 +03:00

1 2 3 4

170 Commits