rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-07 14:10:43 +00:00

Author	SHA1	Message	Date
Stas Kelvich	5642d0b2b8	Change shutdown_process_on_error thread spawn settings. Now princeple is following: acceptor threads (libpq and http) error will bring the pageserver down, but all per-tenant thread failures will be treated as an error.	2022-05-04 00:42:57 +03:00
Anastasia Lubennikova	2f9b17b9e5	Add simple test of pageserver recovery after crash. To cause a crash, use failpoints in checkpointer	2022-05-03 17:13:09 +03:00
Dmitry Rodionov	ad25736f3a	Exit pageserver process with correct error code When we shutdown pageserver due to an error (e g one of th important thrads panicked) use 1 exit code so systemd can properly restart it	2022-05-02 19:04:45 +03:00
Dmitry Rodionov	05f8e6a050	Use fsync+rename for atomic downloads from remote storage Use failpoint in test_remote_storage to check the behavior	2022-04-29 15:53:56 +03:00
Kirill Bulatov	4a46b01caf	Properly populate local timeline map	2022-04-29 09:19:18 +03:00
Konstantin Knizhnik	5f83c9290b	Make it possible to specify per-tenant configuration parameters Add tenant config API and 'zenith tenant config' CLI command. Add 'show' query to pageserver protocol for tenantspecific config parameters Refactoring: move tenant_config code to a separate module. Save tenant conf file to tenant's directory, when tenant is created to recover it on pageserver restart. Ignore error during tenant config loading, while it is not supported by console Define PiTR interval for GC. refer #1320	2022-04-22 11:24:29 +03:00
Heikki Linnakangas	a4700c9bbe	Use pprof to get flamegraph of get_page and get_relsize requests. This depends on a hacked version of the 'pprof-rs' crate. Because of that, it's under an optional 'profiling' feature. It is disabled by default, but enabled for release builds in CircleCI config. It doesn't currently work on macOS. The flamegraph is written to 'flamegraph.svg' in the pageserver workdir when the 'pageserver' process exits. Add a performance test that runs the perf_pgbench test, with profiling enabled.	2022-04-21 20:32:48 +03:00
Kirill Bulatov	81cad6277a	Move and library crates into a dedicated directory and rename them	2022-04-21 13:30:33 +03:00
Kirill Bulatov	3e6087a12f	Remove S3 archiving	2022-04-19 23:13:52 +03:00
Heikki Linnakangas	5d9851f5d1	Refactor the I/O functions. This introduces two new abstraction layers for I/O: - Block I/O, and - Blob I/O. The BlockReader trait abstracts a file or something else that can be read in 8kB pages. It is implemented by EphemeralFiles, and by a new FileBlockReader struct that allows reading arbitrary VirtualFiles in that manner, utilizing the page cache. There is also a new BlockCursor struct that works as a cursor over a BlockReader. When you create a BlockCursor and read the first page using it, it keeps the reference to the page. If you access the same page again, it avoids going to page cache and quickly returns the same page again. That can save a lot of lookups in the page cache if you perform multiple reads. The Blob-oriented API allows reading and writing "blobs" of arbitrary length. It is a layer on top of the block-oriented API. When you write a blob with the write_blob() function, it writes a length field followed by the actual data to the underlying block storage, and returns the offset where the blob was stored. The blob can be retrieved later using the offset. Finally, this replaces the I/O code in image-, delta-, and in-memory layers to use the new abstractions. These replace the 'bookfile' crate. This is a backwards-incompatible change to the storage format.	2022-04-07 20:58:54 +03:00
Anastasia Lubennikova	8745b022a9	Extend LayerMap dump() function to print also open_layers and frozen_layers. Add verbose option to chose if we need to print all layer's keys or not.	2022-03-31 17:26:24 +03:00
Heikki Linnakangas	07342f7519	Major storage format rewrite. This is a backwards-incompatible change. The new pageserver cannot read repositories created with an old pageserver binary, or vice versa. Simplify Repository to a value-store ------------------------------------ Move the responsibility of tracking relation metadata, like which relations exist and what are their sizes, from Repository to a new module, pgdatadir_mapping.rs. The interface to Repository is now a simple key-value PUT/GET operations. It's still not any old key-value store though. A Repository is still responsible from handling branching, and every GET operation comes with an LSN. Mapping from Postgres data directory to keys/values --------------------------------------------------- All the data is now stored in the key-value store. The 'pgdatadir_mapping.rs' module handles mapping from PostgreSQL objects like relation pages and SLRUs, to key-value pairs. The key to the Repository key-value store is a Key struct, which consists of a few integer fields. It's wide enough to store a full RelFileNode, fork and block number, and to distinguish those from metadata keys. 'pgdatadir_mapping.rs' is also responsible for maintaining a "partitioning" of the keyspace. Partitioning means splitting the keyspace so that each partition holds a roughly equal number of keys. The partitioning is used when new image layer files are created, so that each image layer file is roughly the same size. The partitioning is also responsible for reclaiming space used by deleted keys. The Repository implementation doesn't have any explicit support for deleting keys. Instead, the deleted keys are simply omitted from the partitioning, and when a new image layer is created, the omitted keys are not copied over to the new image layer. We might want to implement tombstone keys in the future, to reclaim space faster, but this will work for now. Changes to low-level layer file code ------------------------------------ The concept of a "segment" is gone. Each layer file can now store an arbitrary range of Keys. Checkpointing, compaction ------------------------- The background tasks are somewhat different now. Whenever checkpoint_distance is reached, the WAL receiver thread "freezes" the current in-memory layer, and creates a new one. This is a quick operation and doesn't perform any I/O yet. It then launches a background "layer flushing thread" to write the frozen layer to disk, as a new L0 delta layer. This mechanism takes care of durability. It replaces the checkpointing thread. Compaction is a new background operation that takes a bunch of L0 delta layers, and reshuffles the data in them. It runs in a separate compaction thread. Deployment ---------- This also contains changes to the ansible scripts that enable having multiple different pageservers running at the same time in the staging environment. We will use that to keep an old version of the pageserver running, for clusters created with the old version, at the same time with a new pageserver with the new binary. Author: Heikki Linnakangas Author: Konstantin Knizhnik <knizhnik@zenith.tech> Author: Andrey Taranik <andrey@zenith.tech> Reviewed-by: Matthias Van De Meent <matthias@zenith.tech> Reviewed-by: Bojan Serafimov <bojan@zenith.tech> Reviewed-by: Konstantin Knizhnik <knizhnik@zenith.tech> Reviewed-by: Anton Shyrabokau <antons@zenith.tech> Reviewed-by: Dhammika Pathirana <dham@zenith.tech> Reviewed-by: Kirill Bulatov <kirill@zenith.tech> Reviewed-by: Anastasia Lubennikova <anastasia@zenith.tech> Reviewed-by: Alexey Kondratov <alexey@zenith.tech>	2022-03-28 05:41:15 -05:00
Kirill Bulatov	b39d1b1717	Exit only on important thread failures	2022-03-25 11:58:54 +02:00
Kirill Bulatov	28bc8e3f5c	Log pageserver threads better and shut down on errors in them	2022-03-25 11:58:54 +02:00
Kirill Bulatov	edc7bebcb5	Remove obvious panic sources	2022-03-25 11:58:54 +02:00
Heikki Linnakangas	c718870517	Tiny refactoring of page_cache::init function. The init function only needs the 'page_cache_size' from the config, so seems slightly nicer to pass just that.	2022-03-24 09:46:07 +02:00
Dmitry Rodionov	7738254f83	refactor timeline memory state management	2022-03-18 18:14:57 +03:00
Kirill Bulatov	dd74c66ef0	Do not create timeline along with tenant	2022-03-10 19:38:58 +02:00
Kirill Bulatov	c7569dce47	Allow passing initial timeline id into zenith CLI commands	2022-03-10 19:38:58 +02:00
Kirill Bulatov	10f811e886	Use `timeline` instead of `branch` in pageserver's API	2022-03-10 19:38:58 +02:00
Dmitry Rodionov	1d90b1b205	add node id to pageserver (#1310 ) * Add --id argument to safekeeper setting its unique u64 id. In preparation for storage node messaging. IDs are supposed to be monotonically assigned by the console. In tests it is issued by ZenithEnv; at the zenith cli level and fixtures, string name is completely replaced by integer id. Example TOML configs are adjusted accordingly. Sequential ids are chosen over Zid mainly because they are compact and easy to type/remember. * add node id to pageserver This adds node id parameter to pageserver configuration. Also I use a simple builder to construct pageserver config struct to avoid setting node id to some temporary invalid value. Some of the changes in test fixtures are needed to split init and start operations for envrionment. Co-authored-by: Arseny Sher <sher-ars@yandex.ru>	2022-03-04 01:10:42 +03:00
Kirill Bulatov	76b74349cb	Bump pageserver dependencies	2022-02-10 08:33:22 -05:00
Kirill Bulatov	3ed156a5b6	Add a CLI tool to manipulate remote storage blob files	2022-02-05 15:48:08 -05:00
Heikki Linnakangas	dab30c27b6	Refactor thread management and shutdown This introduces a new module to handle thread creation and shutdown. All page server threads are now registered in a global hash map, and there's a function to request individual threads to shut down gracefully. Thread shutdown request is signalled to the thread with a flag, as well as a Future that can be used to wake up async operations if shutdown is requested. Use that facility to have the libpq listener thread respond to pageserver shutdown, based on Kirill's earlier prototype (https://github.com/zenithdb/zenith/pull/1088). That addresses https://github.com/zenithdb/zenith/issues/1036, previously the libpq listener thread would not exit until one more connection arrives. This also eliminates a resource leak in the accept() loop. Previously, we added the JoinHanlde of each new thread to a vector but old handles for threads that had already exited were never removed.	2022-01-14 18:36:10 +02:00
Kirill Bulatov	23cf2fa984	Properly shutdown storage sync loop	2022-01-11 15:44:23 +02:00
Kirill Bulatov	384b2a91fa	Pass generic pageserver params through zenith cli	2022-01-11 15:44:23 +02:00
Kirill Bulatov	f0afd08667	Fix zenith init defaults	2021-12-28 00:21:48 +02:00
Kirill Bulatov	b494ac1ea0	Remove redundant pageserver cli params	2021-12-27 18:38:54 +02:00
Kirill Bulatov	114a757d1c	Use generic config parameters in pageserver cli Co-authored-by: Heikki Linnakangas <heikki.linnakangas@iki.fi>	2021-12-23 18:58:28 +02:00
Konstantin Knizhnik	76777f5812	Add utility for dumping/editing metadata file (#1031 )	2021-12-21 15:43:15 +03:00
Kirill Bulatov	673c297949	Download timelines on demand	2021-12-10 17:23:35 +02:00
Dmitry Ivanov	7cec13d1df	Improve shutdown story for code coverage This patch introduces fixes for several problems affecting LLVM-based code coverage: * Daemonizing parent processes should call _exit() to prevent coverage data file corruption (.profraw) due to concurrent writes. Implement proper shutdown handlers in safekeeper.	2021-12-06 13:27:52 +03:00
Kirill Bulatov	670205e17a	Evict excessively failing sync tasks, improve processing for the rest of the tasks	2021-11-30 13:58:49 +02:00
Heikki Linnakangas	7cae265447	Fix dump_layerfile. The VirtualFile machinery panics if it's not initialized	2021-11-29 11:26:54 +02:00
Heikki Linnakangas	d47f610606	Fix pageserver CLI parameter names and document them	2021-11-25 13:31:52 +02:00
Heikki Linnakangas	431d32756b	Add a buffer cache, and use it to store materialized pages. The buffer cache is shared across all tenants, allowing memory to be dynamically allocated where it's needed the most. The cache works on 8 kB pages, and uses the clock algorithm for replacement policy; same as the PostgreSQL buffer cache. One peculiarity is that the materialized page versions can be looked up by an inexact LSN, to find the latest page version with an LSN >= the search key. The code is structured to support caching other kinds of pages in the same cache in the future, but with a different mapping key. Co-authored-by: Patrick Insinger <patrick@zenith.tech>	2021-11-12 11:02:12 -08:00
Heikki Linnakangas	9300107cdf	Cache Book objects, use virtual files to avoid running out of fds. Currently, whenever a page version is needed from an image or delta layer, we open the file and read and parse the bookfile headers. That's pretty expensive. To reduce the overhead, introduce a cache of open file descriptors, and use that to cache the Book objects so that we don't need to read the metadata on every access.	2021-11-10 17:19:37 +02:00
Dmitry Rodionov	987833e0b9	Propagate git SHA to zenith binaries Git commit sha is displayed when --version flag is used and is written to logs during service startup. Uses git_version crate when git is available, and GIT_VERSION environment variable otherwise which is the case for docker builds.	2021-11-04 14:22:29 +03:00
Kirill Bulatov	f36acf00de	Reduce "relish" word usages in remote storage	2021-11-04 12:53:42 +02:00
Heikki Linnakangas	fb524dd973	Put a global limit on memory used by in-memory layers. Adds simple global tracking of memory used by the in-memory layers. It's very approximate, it doesn't take into account allocator, memory fragmentation or many other things, but it's a good first step. After storing a WAL record in the repository, the WAL receiver checks if the global memory usage. If it's above a configurable threshold (hard coded at 128 MB at the moment), it evicts a layer. The victim layer is chosen by GClock algorithm, similar to that used in the Postgres buffer cache. This stops the page server from using an unbounded amount of memory. It's pretty crude, the eviction and materializing and writing a layer to disk happens now in the WAL receiver thread. It would be nice to move that to a background thread, and it would be nice to have a smarter policy on when to materialize a new image layer and when to just write out a delta layer, and it would be nice to have more accurate accounting of memory. But this should fix the most pressing OOM issues, and is a step in the right direction. Co-authored-by: Patrick Insinger <patrickinsinger@gmail.com>	2021-11-02 15:49:39 +02:00
Patrick Insinger	b532470792	Set SO_REUSEADDR for all TCP listeners	2021-10-29 12:45:26 -07:00
Kirill Bulatov	e9b5224a8a	Fix toml serde gotchas	2021-10-18 14:14:27 +03:00
Kirill Bulatov	ba557d126b	React on sigint	2021-10-15 21:24:24 +03:00
anastasia	d7c9dd06f4	Implement graceful shutdown at 'pageserver stop': - perform checkpoint for each tenant repository. - wait for the completion of all threads. Add new option 'immediate' to 'pageserver stop' command to terminate the pageserver immediately.	2021-10-11 13:35:01 +03:00
Heikki Linnakangas	7216f22609	Use tracing crate to have more context in log messages. Whenever we start processing a request, we now enter a tracing "span" that includes context information like the tenant and timeline ID, and the operation we're performing. That context information gets attached to every log message we create within the span. That way, we don't need to include basic context information like that in every log message, and it also becomes easier to filter the logs programmatically. This removes the eplicit timeline and tenant IDs from most log messages, as you get that information from the enclosing span now. Also improve log messages in general, dialing down the level of some messages that are not very useful, and adding information to others. We now obey the RUST_LOG env variable, if it's set. The 'tracing' crate allows for different log formatters, like JSON or bunyan output. The one we use now is human-readable multi-line format, which is nice when reading the log directly, but hard for post-processing. For production, we'll probably want JSON output and some tools for working with it, but that's left as a TODO. The log format is easy to change.	2021-10-11 08:59:06 +03:00
Egor Suvorov	7e190d72a5	Make `pageserver_` prefix for common metric names configurable (#681 )	2021-10-05 19:06:44 +03:00
Kirill Bulatov	5719f13cb2	Rework the relish thread model (#689 )	2021-10-05 10:15:56 +03:00
Heikki Linnakangas	e474790400	Print more details on errors to log Fixes https://github.com/zenithdb/zenith/issues/661	2021-10-01 17:57:41 +03:00
Kirill Bulatov	287ea2e5e3	Limit concurrent relish storage sync operations	2021-10-01 08:37:09 +03:00
Kirill Bulatov	fb05e4cb0b	Show better error messages on pageserver failures	2021-09-29 01:55:41 +03:00

1 2 3

125 Commits