rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-22 23:50:39 +00:00

Author	SHA1	Message	Date
Joonas Koivunen	3c9b484c4d	feat: Timeline detach ancestor (#7456 ) ## Problem Timelines cannot be deleted if they have children. In many production cases, a branch or a timeline has been created off the main branch for various reasons to the effect of having now a "new main" branch. This feature will make it possible to detach a timeline from its ancestor by inheriting all of the data before the branchpoint to the detached timeline and by also reparenting all of the ancestor's earlier branches to the detached timeline. ## Summary of changes - Earlier added copy_lsn_prefix functionality is used - RemoteTimelineClient learns to adopt layers by copying them from another timeline - LayerManager adds support for adding adopted layers - `timeline::Timeline::{prepare_to_detach,complete_detaching}_from_ancestor` and `timeline::detach_ancestor` are added - HTTP PUT handler Cc: #6994 Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-05-07 13:47:57 +03:00
Joonas Koivunen	8d0f701767	feat: copy delta layer prefix or "truncate" (#7228 ) For "timeline ancestor merge" or "timeline detach," we need to "cut" delta layers at particular LSN. The name "truncate" is not used as it would imply that a layer file changes, instead of what happens: we copy keys with Lsn less than a "cut point". Cc: #6994 Add the "copy delta layer prefix" operation to DeltaLayerInner, re-using some of the vectored read internals. The code is `cfg(test)` until it will be used later with a more complete integration test.	2024-04-18 10:43:04 +03:00
Arpad Müller	82853cc1d1	Fix warnings and compile errors on nightly (#6886 ) Nightly has added a bunch of compiler and linter warnings. There is also two dependencies that fail compilation on latest nightly due to using the old `stdsimd` feature name. This PR fixes them.	2024-03-01 17:14:19 +01:00
Christian Schwarz	47873470db	pageserver: add method to dump keyspace in mgmt api client (#6145 ) Part of getpage@lsn benchmark epic: https://github.com/neondatabase/neon/issues/5771	2023-12-16 10:52:48 +00:00
Joonas Koivunen	105edc265c	fix: remove layer_removal_cs (#5108 ) Quest: https://github.com/neondatabase/neon/issues/4745. Follow-up to #4938. - add in locks for compaction and gc, so we don't have multiple executions at the same time in tests - remove layer_removal_cs - remove waiting for uploads in eviction/gc/compaction - #4938 will keep the file resident until upload completes Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-11-28 19:15:21 +02:00
John Spray	ab631e6792	pageserver: make TenantsMap shard-aware (#5819 ) ## Problem When using TenantId as the key, we are unable to handle multiple tenant shards attached to the same pageserver for the same tenant ID. This is an expected scenario if we have e.g. 8 shards and 5 pageservers. ## Summary of changes - TenantsMap is now a BTreeMap instead of a HashMap: this enables looking up by range. In future, we will need this for page_service, as incoming requests will just specify the Key, and we'll have to figure out which shard to route it to. - A new key type TenantShardId is introduced, to act as the key in TenantsMap, and as the id type in external APIs. Its human readable serialization is backward compatible with TenantId, and also forward-compatible as long as sharding is not actually used (when we construct a TenantShardId with ShardCount(0), it serializes to an old-fashioned TenantId). - Essential tenant APIs are updated to accept TenantShardIds: tenant/timeline create, tenant delete, and /location_conf. These are the APIs that will enable driving sharded tenants. Other apis like /attach /detach /load /ignore will not work with sharding: those will soon be deprecated and replaced with /location_conf as part of the live migration work. Closes: #5787	2023-11-15 23:20:21 +02:00
Joonas Koivunen	462f04d377	Smaller test addition and change (#5858 ) - trivial serialization roundtrip test for `pageserver::repository::Value` - add missing `start_paused = true` to 15s test making it <0s test - completely unrelated future clippy lint avoidance (helps beta channel users)	2023-11-14 18:04:34 +01:00
John Spray	ba92668e37	pageserver: deletion queue & generation validation for deletions (#5207 ) ## Problem Pageservers must not delete objects or advertise updates to remote_consistent_lsn without checking that they hold the latest generation for the tenant in question (see [the RFC]( https://github.com/neondatabase/neon/blob/main/docs/rfcs/025-generation-numbers.md)) In this PR: - A new "deletion queue" subsystem is introduced, through which deletions flow - `RemoteTimelineClient` is modified to send deletions through the deletion queue: - For GC & compaction, deletions flow through the full generation verifying process - For timeline deletions, deletions take a fast path that bypasses generation verification - The `last_uploaded_consistent_lsn` value in `UploadQueue` is replaced with a mechanism that maintains a "projected" lsn (equivalent to the previous property), and a "visible" LSN (which is the one that we may share with safekeepers). - Until `control_plane_api` is set, all deletions skip generation validation - Tests are introduced for the new functionality in `test_pageserver_generations.py` Once this lands, if a pageserver is configured with the `control_plane_api` configuration added in https://github.com/neondatabase/neon/pull/5163, it becomes safe to attach a tenant to multiple pageservers concurrently. --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-09-26 16:11:55 +01:00
Joonas Koivunen	1fdf01e3bc	fix: readable Debug for Layers (#3575 ) #3536 added the custom Debug implementations but it using derived Debug on Key lead to too verbose output. Instead of making `Key`'s `Debug` unconditionally or conditionally do the `Display` variant (for table space'd keys), opted to build a newtype to provide `Debug` for `Range<Key>` via `Display` which seemed to work unconditionally. Also orders Key to have: 1. comment, 2. derive, 3. `struct Key`.	2023-02-09 13:55:37 +02:00
bojanserafimov	a3d7ad2d52	Implement layer map using immutable BST (#2998 )	2023-01-20 16:10:12 -05:00
Heikki Linnakangas	9a6c0be823	storage_sync2 The code in this change was extracted from PR #2595, i.e., Heikki’s draft PR for on-demand download. High-Level Changes - storage_sync module rewrite - Changes to Tenant Loading - Changes to Timeline States - Crash-safe & Resumable Tenant Attach There are several follow-up work items planned. Refer to the Epic issue on GitHub: https://github.com/neondatabase/neon/issues/2029 Metadata: closes https://github.com/neondatabase/neon/pull/2785 unsquashed history of this patch: archive/pr-2785-storage-sync2/pre-squash Co-authored-by: Dmitry Rodionov <dmitry@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech> =============================================================================== storage_sync module rewrite =========================== The storage_sync code is rewritten. New module name is storage_sync2, mostly to make a more reasonable git diff. The updated block comment in storage_sync2.rs describes the changes quite well, so, we will not reproduce that comment here. TL;DR: - Global sync queue and RemoteIndex are replaced with per-timeline `RemoteTimelineClient` structure that contains a queue for UploadOperations to ensure proper ordering and necessary metadata. - Before deleting local layer files, wait for ongoing UploadOps to finish (wait_completion()). - Download operations are not queued and executed immediately. Changes to Tenant Loading ========================= Initial sync part was rewritten as well and represents the other major change that serves as a foundation for on-demand downloads. Routines for attaching and loading shifted directly to Tenant struct and now are asynchronous and spawned into the background. Since this patch doesn’t introduce on-demand download of layers we fully synchronize with the remote during pageserver startup. See details in `Timeline::reconcile_with_remote` and `Timeline::download_missing`. Changes to Tenant States ======================== The “Active” state has lost its “background_jobs_running: bool” member. That variable indicated whether the GC & Compaction background loops are spawned or not. With this patch, they are now always spawned. Unit tests (#[test]) use the TenantConf::{gc_period,compaction_period} to disable their effect (`15db566`). This patch introduces a new tenant state, “Attaching”. A tenant that is being attached starts in this state and transitions to “Active” once it finishes download. The `GET /tenant` endpoints returns `TenantInfo::has_in_progress_downloads`. We derive the value for that field from the tenant state now, to remain backwards-compatible with cloud.git. We will remove that field when we switch to on-demand downloads. Changes to Timeline States ========================== The TimelineInfo::awaits_download field is now equivalent to the tenant being in Attaching state. Previously, download progress was tracked per timeline. With this change, it’s only tracked per tenant. When on-demand downloads arrive, the field will be completely obsolete. Deprecation is tracked in isuse #2930. Crash-safe & Resumable Tenant Attach ==================================== Previously, the attach operation was not persistent. I.e., when tenant attach was interrupted by a crash, the pageserver would not continue attaching after pageserver restart. In fact, the half-finished tenant directory on disk would simply be skipped by tenant_mgr because it lacked the metadata file (it’s written last). This patch introduces an “attaching” marker file inside that is present inside the tenant directory while the tenant is attaching. During pageserver startup, tenant_mgr will resume attach if that file is present. If not, it assumes that the local tenant state is consistent and tries to load the tenant. If that fails, the tenant transitions into Broken state.	2022-11-29 18:55:20 +01:00
Konstantin Knizhnik	f3073a4db9	R-Tree layer map (#2317 ) Replace the layer array and linear search with R-tree So far, the in-memory layer map that holds information about layer files that exist, has used a simple Vec, in no particular order, to hold information about all the layers. That obviously doesn't scale very well; with thousands of layer files the linear search was consuming a lot of CPU. Replace it with a two-dimensional R-tree, with Key and LSN ranges as the dimensions. For the R-tree, use the 'rstar' crate. To be able to use that, we convert the Keys and LSNs into 256-bit integers. 64 bits would be enough to represent LSNs, and 128 bits would be enough to represent Keys. However, we use 256 bits, because rstar internally performs multiplication to calculate the area of rectangles, and the result of multiplying two 128 bit integers doesn't necessarily fit in 128 bits, causing integer overflow and, if overflow-checks are enabled, panic. To avoid that, we use 256 bit integers. Add a performance test that creates a lot of layer files, to demonstrate the benefit.	2022-09-22 08:35:06 +03:00
sharnoff	4a3b3ff11d	Move testing pageserver libpq cmds to HTTP api (#2429 ) Closes #2422. The APIs have been feature gated with the `testing_api!` macro so that they return 400s when support hasn't been compiled in.	2022-09-20 11:28:12 -07:00
Kirill Bulatov	b8eb908a3d	Rename old project name references	2022-09-14 08:14:05 +03:00
Kirill Bulatov	2db20e5587	Remove [Un]Loaded timeline code (#2359 )	2022-09-02 14:31:28 +03:00
Heikki Linnakangas	5522fbab25	Move all unit tests related to Repository/Timeline to layered_repository.rs There was a nominal split between the tests in layered_repository.rs and repository.rs, such that tests specific to the layered implementation were supposed to be in layered_repository.rs, and tests that should work with any implementation of the traits were supposed to be in repository.rs. In practice, the line was quite muddled. With minor tweaks, many of the tests in layered_repository.rs should work with other implementations too, and vice versa. And in practice we only have one implementation, so it's more straightforward to gather all unit tests in one place.	2022-08-20 01:21:18 +03:00
Kirill Bulatov	c634cb1d36	Remove TimelineWriter trait, rename LayeredTimelineWriter struct into TimelineWriter	2022-08-19 16:40:37 +03:00
Kirill Bulatov	c19b4a65f9	Remove Repository trait, rename LayeredRepository struct into Repository	2022-08-19 16:40:37 +03:00
Kirill Bulatov	8043612334	Remove Timeline trait, rename LayeredTimeline struct into Timeline	2022-08-19 16:40:37 +03:00
Kirill Bulatov	3b819ee159	Remove extra type aliases (#2280 )	2022-08-17 17:51:53 +03:00
Arseny Sher	e593cbaaba	Add pageserver checkpoint_timeout option. To flush inmemory layer eventually when no new data arrives, which helps safekeepers to suspend activity (stop pushing to the broker). Default 10m should be ok.	2022-08-11 22:54:09 +03:00
Ankur Srivastava	84d1bc06a9	refactor: replace lazy-static with once-cell (#2195 ) - Replacing all the occurrences of lazy-static with `once-cell::sync::Lazy` - fixes #1147 Signed-off-by: Ankur Srivastava <best.ankur@gmail.com>	2022-08-05 19:34:04 +02:00
Heikki Linnakangas	02afa2762c	Move Tenant- and TimelineInfo structs to models.rs. They are part of the management API response structs. Let's try to concentrate everything that's part of the API in models.rs.	2022-07-29 15:02:15 +03:00
Thang Pham	6a664629fa	Add timeline physical size tracking (#2126 ) Ref #1902. - Track the layered timeline's `physical_size` using `pageserver_current_physical_size` metric when updating the layer map. - Report the local timeline's `physical_size` in timeline GET APIs. - Add `include-non-incremental-physical-size` URL flag to also report the local timeline's `physical_size_non_incremental` (similar to `logical_size_non_incremental`) - Add a `UIntGaugeVec` and `UIntGauge` to represent `u64` prometheus metrics Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>	2022-07-27 12:36:46 -04:00
Heikki Linnakangas	d6f12cff8e	Make DatadirTimeline a trait, implemented by LayeredTimeline. Previously DatadirTimeline was a separate struct, and there was a 1:1 relationship between each DatadirTimeline and LayeredTimeline. That was a bit awkward; whenever you created a timeline, you also needed to create the DatadirTimeline wrapper around it, and if you only had a reference to the LayeredTimeline, you would need to look up the corresponding DatadirTimeline struct through tenant_mgr::get_local_timeline_with_load(). There were a couple of calls like that from LayeredTimeline itself. Refactor DatadirTimeline, so that it's a trait, and mark LayeredTimeline as implementing that trait. That way, there's only one object, LayeredTimeline, and you can call both Timeline and DatadirTimeline functions on that. You can now also call DatadirTimeline functions from LayeredTimeline itself. I considered just moving all the functions from DatadirTimeline directly to Timeline/LayeredTimeline, but I still like to have some separation. Timeline provides a simple key-value API, and handles durably storing key/value pairs, and branching. Whereas DatadirTimeline is stateless, and provides an abstraction over the key-value store, to present an interface with relations, databases, etc. Postgres concepts. This simplified the logical size calculation fast-path for branch creation, introduced in commit `28243d68e6`. LayerTimeline can now access the ancestor's logical size directly, so it doesn't need the caller to pass it to it. I moved the fast-path to init_logical_size() function itself. It now checks if the ancestor's last LSN is the same as the branch point, i.e. if there haven't been any changes on the ancestor after the branch, and copies the size from there. An additional bonus is that the optimization will now work any time you have a branch of another branch, with no changes from the ancestor, not only at a create-branch command.	2022-07-27 10:26:21 +03:00
Dmitry Rodionov	21da9199fa	take Value by reference to avoid calling .clone	2022-07-11 17:03:58 +03:00
Thang Pham	1f5918b36d	Delay calculating the starting LSN when doing timeline branching (#2053 ) Previously, upon branching, if no starting LSN is specified, we determine the start LSN based on the source timeline's last record LSN in `timelines::create_timeline` function, which then calls `Repository::branch_timeline` to create the timeline. Inside the `LayeredRepository::branch_timeline` function, to start branching, we try to acquire a GC lock to prevent GC from removing data needed for the new timeline. However, a GC iteration takes time, so the GC lock can be held for a long period of time. As a result, the previously determined starting LSN can become invalid because of GC. This PR fixes the above issue by delaying the LSN calculation part and moving it to be inside `LayeredRepository::branch_timeline` function.	2022-07-08 10:29:29 -04:00
Dmitry Rodionov	e1e24336b7	review adjustments, bring back timeline_detach and rename it to timeline_delete	2022-07-07 21:20:04 +03:00
Dmitry Rodionov	4c54e4b37d	switch to per-tenant attach/detach download operations of all timelines for one tenant are now grouped together so when attach is invoked pageserver downloads all of them and registers them in a single apply_sync_status_update call so branches can be used safely with attach/detach	2022-07-07 21:20:04 +03:00
Dmitry Rodionov	cfdf79aceb	harden create_empty_timeline Reorder checks so it checks whether the timeline exists before writing something to disk, possibly replacing valid content	2022-07-05 16:44:18 +03:00
chaitanya sharma	e1336f451d	renamed .zenith data-dir to .neon.	2022-06-09 18:19:18 +02:00
huming	9c846a93e8	chore(doc)	2022-06-03 14:24:27 +03:00
Kirill Bulatov	e5cb727572	Replace callmemaybe with etcd subscriptions on safekeeper timeline info	2022-06-01 16:07:04 +03:00
Kian-Meng Ang	f1c51a1267	Fix typos	2022-05-28 14:02:05 +03:00
Kirill Bulatov	de37f982db	Share the remote storage as a crate	2022-05-07 00:30:36 +03:00
Dhammika Pathirana	f3f12db2cb	Add gc churn threshold knob (#1594 ) Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>	2022-05-01 13:13:17 -07:00
Kirill Bulatov	6cca57f95a	Properly remove from the local timeline map	2022-04-29 09:19:18 +03:00
Konstantin Knizhnik	5f83c9290b	Make it possible to specify per-tenant configuration parameters Add tenant config API and 'zenith tenant config' CLI command. Add 'show' query to pageserver protocol for tenantspecific config parameters Refactoring: move tenant_config code to a separate module. Save tenant conf file to tenant's directory, when tenant is created to recover it on pageserver restart. Ignore error during tenant config loading, while it is not supported by console Define PiTR interval for GC. refer #1320	2022-04-22 11:24:29 +03:00
Kirill Bulatov	81cad6277a	Move and library crates into a dedicated directory and rename them	2022-04-21 13:30:33 +03:00
Kirill Bulatov	3e6087a12f	Remove S3 archiving	2022-04-19 23:13:52 +03:00
Dhammika Pathirana	a0781f229c	Add ps compact command Signed-off-by: Dhammika Pathirana <dhammika@gmail.com> Add ps compact command to api (#707) (#1484)	2022-04-13 22:47:13 -07:00
Heikki Linnakangas	214567bf8f	Use B-tree for the index in image and delta layers. We now use a page cache for those, instead of slurping the whole index into memory. Fixes https://github.com/zenithdb/zenith/issues/1356 This is a backwards-incompatible change to the storage format, so bump STORAGE_FORMAT_VERSION.	2022-04-07 20:58:55 +03:00
Konstantin Knizhnik	232fe14297	Refactor partitioning	2022-04-04 10:43:27 +03:00
Heikki Linnakangas	07342f7519	Major storage format rewrite. This is a backwards-incompatible change. The new pageserver cannot read repositories created with an old pageserver binary, or vice versa. Simplify Repository to a value-store ------------------------------------ Move the responsibility of tracking relation metadata, like which relations exist and what are their sizes, from Repository to a new module, pgdatadir_mapping.rs. The interface to Repository is now a simple key-value PUT/GET operations. It's still not any old key-value store though. A Repository is still responsible from handling branching, and every GET operation comes with an LSN. Mapping from Postgres data directory to keys/values --------------------------------------------------- All the data is now stored in the key-value store. The 'pgdatadir_mapping.rs' module handles mapping from PostgreSQL objects like relation pages and SLRUs, to key-value pairs. The key to the Repository key-value store is a Key struct, which consists of a few integer fields. It's wide enough to store a full RelFileNode, fork and block number, and to distinguish those from metadata keys. 'pgdatadir_mapping.rs' is also responsible for maintaining a "partitioning" of the keyspace. Partitioning means splitting the keyspace so that each partition holds a roughly equal number of keys. The partitioning is used when new image layer files are created, so that each image layer file is roughly the same size. The partitioning is also responsible for reclaiming space used by deleted keys. The Repository implementation doesn't have any explicit support for deleting keys. Instead, the deleted keys are simply omitted from the partitioning, and when a new image layer is created, the omitted keys are not copied over to the new image layer. We might want to implement tombstone keys in the future, to reclaim space faster, but this will work for now. Changes to low-level layer file code ------------------------------------ The concept of a "segment" is gone. Each layer file can now store an arbitrary range of Keys. Checkpointing, compaction ------------------------- The background tasks are somewhat different now. Whenever checkpoint_distance is reached, the WAL receiver thread "freezes" the current in-memory layer, and creates a new one. This is a quick operation and doesn't perform any I/O yet. It then launches a background "layer flushing thread" to write the frozen layer to disk, as a new L0 delta layer. This mechanism takes care of durability. It replaces the checkpointing thread. Compaction is a new background operation that takes a bunch of L0 delta layers, and reshuffles the data in them. It runs in a separate compaction thread. Deployment ---------- This also contains changes to the ansible scripts that enable having multiple different pageservers running at the same time in the staging environment. We will use that to keep an old version of the pageserver running, for clusters created with the old version, at the same time with a new pageserver with the new binary. Author: Heikki Linnakangas Author: Konstantin Knizhnik <knizhnik@zenith.tech> Author: Andrey Taranik <andrey@zenith.tech> Reviewed-by: Matthias Van De Meent <matthias@zenith.tech> Reviewed-by: Bojan Serafimov <bojan@zenith.tech> Reviewed-by: Konstantin Knizhnik <knizhnik@zenith.tech> Reviewed-by: Anton Shyrabokau <antons@zenith.tech> Reviewed-by: Dhammika Pathirana <dham@zenith.tech> Reviewed-by: Kirill Bulatov <kirill@zenith.tech> Reviewed-by: Anastasia Lubennikova <anastasia@zenith.tech> Reviewed-by: Alexey Kondratov <alexey@zenith.tech>	2022-03-28 05:41:15 -05:00
Kirill Bulatov	55de0b88f5	Hide remote timeline index access details	2022-03-28 12:36:01 +03:00
Heikki Linnakangas	3b069f5aef	Fix name of directory used in unit test. There's another test called 'timeline_load'. If the two tests run in parallel, they would conflict and fail.	2022-03-18 21:27:48 +02:00
Dmitry Rodionov	7738254f83	refactor timeline memory state management	2022-03-18 18:14:57 +03:00
Kirill Bulatov	f49990ed43	Allow creating timelines by branching off ancestors	2022-03-10 19:38:58 +02:00
Kirill Bulatov	10f811e886	Use `timeline` instead of `branch` in pageserver's API	2022-03-10 19:38:58 +02:00
Heikki Linnakangas	7fae894648	Move a few unit tests specific to layered file format. These tests have intimate knowledge of the directory layeout and layer file names used by the LayeredRepository implementation of the Repository trait. Move them, so that all the tests that remain in repository.rs are expected to work without changes with any implementation of Repository. Not that we have any plans to create another Repository implementaiton any time soon, but as long as we have the Repository interface, let's try to maintain that abstraction in the tests too.	2022-02-23 20:32:06 +02:00

1 2 3 4

170 Commits