rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-15 09:22:55 +00:00

Author	SHA1	Message	Date
Konstantin Knizhnik	f3073a4db9	R-Tree layer map (#2317 ) Replace the layer array and linear search with R-tree So far, the in-memory layer map that holds information about layer files that exist, has used a simple Vec, in no particular order, to hold information about all the layers. That obviously doesn't scale very well; with thousands of layer files the linear search was consuming a lot of CPU. Replace it with a two-dimensional R-tree, with Key and LSN ranges as the dimensions. For the R-tree, use the 'rstar' crate. To be able to use that, we convert the Keys and LSNs into 256-bit integers. 64 bits would be enough to represent LSNs, and 128 bits would be enough to represent Keys. However, we use 256 bits, because rstar internally performs multiplication to calculate the area of rectangles, and the result of multiplying two 128 bit integers doesn't necessarily fit in 128 bits, causing integer overflow and, if overflow-checks are enabled, panic. To avoid that, we use 256 bit integers. Add a performance test that creates a lot of layer files, to demonstrate the benefit.	2022-09-22 08:35:06 +03:00
sharnoff	4a3b3ff11d	Move testing pageserver libpq cmds to HTTP api (#2429 ) Closes #2422. The APIs have been feature gated with the `testing_api!` macro so that they return 400s when support hasn't been compiled in.	2022-09-20 11:28:12 -07:00
Kirill Bulatov	031e57a973	Disable failpoints by default	2022-09-16 09:26:29 +03:00
Kirill Bulatov	b8eb908a3d	Rename old project name references	2022-09-14 08:14:05 +03:00
Heikki Linnakangas	40c845e57d	Switch to async for all concurrency in the pageserver. Instead of spawning helper threads, we now use Tokio tasks. There are multiple Tokio runtimes, for different kinds of tasks. One for serving libpq client connections, another for background operations like GC and compaction, and so on. That's not strictly required, we could use just one runtime, but with this you can still get an overview of what's happening with "top -H". There's one subtle behavior in how TenantState is updated. Before this patch, if you deleted all timelines from a tenant, its GC and compaction loops were stopped, and the tenant went back to Idle state. We no longer do that. The empty tenant stays Active. The changes to test_tenant_tasks.py are related to that. There's still plenty of synchronous code and blocking. For example, we still use blocking std::io functions for all file I/O, and the communication with WAL redo processes is still uses low-level unix poll(). We might want to rewrite those later, but this will do for now. The model is that local file I/O is considered to be fast enough that blocking - and preventing other tasks running in the same thread - is acceptable.	2022-09-12 14:21:00 +03:00
Heikki Linnakangas	34b5d7aa9f	Remove unused dependency	2022-08-27 18:14:33 +03:00
Ankur Srivastava	84d1bc06a9	refactor: replace lazy-static with once-cell (#2195 ) - Replacing all the occurrences of lazy-static with `once-cell::sync::Lazy` - fixes #1147 Signed-off-by: Ankur Srivastava <best.ankur@gmail.com>	2022-08-05 19:34:04 +02:00
Heikki Linnakangas	b4c74c0ecd	Clean up unnecessary dependencies. Just to be tidy.	2022-07-20 16:31:25 +03:00
bojanserafimov	1ca28e6f3c	Import basebackup into pageserver (#1925 ) Allow importing basebackup taken from vanilla postgres or another pageserver via psql copy in protocol.	2022-06-21 11:04:10 -04:00
Kirill Bulatov	e5cb727572	Replace callmemaybe with etcd subscriptions on safekeeper timeline info	2022-06-01 16:07:04 +03:00
bojanserafimov	ca10cc12c1	Close file descriptors for redo process (#1834 )	2022-05-31 14:14:09 -04:00
Kirill Bulatov	a884f4cf6b	Add etcd to neon_local	2022-05-17 01:17:44 +03:00
Kirill Bulatov	51c0f9ab2b	Force git version to be up to date via decl macro	2022-05-13 16:34:32 +03:00
Kirill Bulatov	de37f982db	Share the remote storage as a crate	2022-05-07 00:30:36 +03:00
Dmitry Rodionov	05f8e6a050	Use fsync+rename for atomic downloads from remote storage Use failpoint in test_remote_storage to check the behavior	2022-04-29 15:53:56 +03:00
Anastasia Lubennikova	5c5c3c64f3	Fix tenant config parsing. Add a test	2022-04-28 11:49:19 +03:00
Dmitry Ivanov	d3f356e7a8	Update `rust-postgres` project-wide (#1525 ) * Update `rust-postgres` project-wide This commit points to https://github.com/neondatabase/rust-postgres/commits/neon in order to test our patches on top of the latest version of this crate. * [proxy] Update `hmac` and `sha2`	2022-04-22 17:31:58 +03:00
Heikki Linnakangas	a4700c9bbe	Use pprof to get flamegraph of get_page and get_relsize requests. This depends on a hacked version of the 'pprof-rs' crate. Because of that, it's under an optional 'profiling' feature. It is disabled by default, but enabled for release builds in CircleCI config. It doesn't currently work on macOS. The flamegraph is written to 'flamegraph.svg' in the pageserver workdir when the 'pageserver' process exits. Add a performance test that runs the perf_pgbench test, with profiling enabled.	2022-04-21 20:32:48 +03:00
Kirill Bulatov	81cad6277a	Move and library crates into a dedicated directory and rename them	2022-04-21 13:30:33 +03:00
Kirill Bulatov	3e6087a12f	Remove S3 archiving	2022-04-19 23:13:52 +03:00
Kirill Bulatov	0ca2bd929b	Remove log crate from pageserver	2022-04-18 00:00:36 +03:00
Heikki Linnakangas	93e0ac2b7a	Remove a couple of unused dependencies. Found by "cargo-udeps"	2022-04-14 17:38:26 +03:00
Kirill Bulatov	db63fa64ae	Use rusoto lib for S3 relish_storage impl	2022-04-11 21:34:04 +03:00
Heikki Linnakangas	214567bf8f	Use B-tree for the index in image and delta layers. We now use a page cache for those, instead of slurping the whole index into memory. Fixes https://github.com/zenithdb/zenith/issues/1356 This is a backwards-incompatible change to the storage format, so bump STORAGE_FORMAT_VERSION.	2022-04-07 20:58:55 +03:00
Heikki Linnakangas	5d9851f5d1	Refactor the I/O functions. This introduces two new abstraction layers for I/O: - Block I/O, and - Blob I/O. The BlockReader trait abstracts a file or something else that can be read in 8kB pages. It is implemented by EphemeralFiles, and by a new FileBlockReader struct that allows reading arbitrary VirtualFiles in that manner, utilizing the page cache. There is also a new BlockCursor struct that works as a cursor over a BlockReader. When you create a BlockCursor and read the first page using it, it keeps the reference to the page. If you access the same page again, it avoids going to page cache and quickly returns the same page again. That can save a lot of lookups in the page cache if you perform multiple reads. The Blob-oriented API allows reading and writing "blobs" of arbitrary length. It is a layer on top of the block-oriented API. When you write a blob with the write_blob() function, it writes a length field followed by the actual data to the underlying block storage, and returns the offset where the blob was stored. The blob can be retrieved later using the offset. Finally, this replaces the I/O code in image-, delta-, and in-memory layers to use the new abstractions. These replace the 'bookfile' crate. This is a backwards-incompatible change to the storage format.	2022-04-07 20:58:54 +03:00
Dmitry Ivanov	f5da652388	[proxy] Enable keepalives for all tcp connections (#1448 )	2022-03-31 20:44:57 +03:00
Dmitry Rodionov	eee0f51e0c	use cargo-hakari to manage workspace_hack crate workspace_hack is needed to avoid recompilation when different crates inside the workspace depend on the same packages but with different features being enabled. Problem occurs when you build crates separately one by one. So this is irrelevant to our CI setup because there we build all binaries at once, but it may be relevant for local development. this also changes cargo's resolver version to 2	2022-03-29 10:42:04 +03:00
Heikki Linnakangas	07342f7519	Major storage format rewrite. This is a backwards-incompatible change. The new pageserver cannot read repositories created with an old pageserver binary, or vice versa. Simplify Repository to a value-store ------------------------------------ Move the responsibility of tracking relation metadata, like which relations exist and what are their sizes, from Repository to a new module, pgdatadir_mapping.rs. The interface to Repository is now a simple key-value PUT/GET operations. It's still not any old key-value store though. A Repository is still responsible from handling branching, and every GET operation comes with an LSN. Mapping from Postgres data directory to keys/values --------------------------------------------------- All the data is now stored in the key-value store. The 'pgdatadir_mapping.rs' module handles mapping from PostgreSQL objects like relation pages and SLRUs, to key-value pairs. The key to the Repository key-value store is a Key struct, which consists of a few integer fields. It's wide enough to store a full RelFileNode, fork and block number, and to distinguish those from metadata keys. 'pgdatadir_mapping.rs' is also responsible for maintaining a "partitioning" of the keyspace. Partitioning means splitting the keyspace so that each partition holds a roughly equal number of keys. The partitioning is used when new image layer files are created, so that each image layer file is roughly the same size. The partitioning is also responsible for reclaiming space used by deleted keys. The Repository implementation doesn't have any explicit support for deleting keys. Instead, the deleted keys are simply omitted from the partitioning, and when a new image layer is created, the omitted keys are not copied over to the new image layer. We might want to implement tombstone keys in the future, to reclaim space faster, but this will work for now. Changes to low-level layer file code ------------------------------------ The concept of a "segment" is gone. Each layer file can now store an arbitrary range of Keys. Checkpointing, compaction ------------------------- The background tasks are somewhat different now. Whenever checkpoint_distance is reached, the WAL receiver thread "freezes" the current in-memory layer, and creates a new one. This is a quick operation and doesn't perform any I/O yet. It then launches a background "layer flushing thread" to write the frozen layer to disk, as a new L0 delta layer. This mechanism takes care of durability. It replaces the checkpointing thread. Compaction is a new background operation that takes a bunch of L0 delta layers, and reshuffles the data in them. It runs in a separate compaction thread. Deployment ---------- This also contains changes to the ansible scripts that enable having multiple different pageservers running at the same time in the staging environment. We will use that to keep an old version of the pageserver running, for clusters created with the old version, at the same time with a new pageserver with the new binary. Author: Heikki Linnakangas Author: Konstantin Knizhnik <knizhnik@zenith.tech> Author: Andrey Taranik <andrey@zenith.tech> Reviewed-by: Matthias Van De Meent <matthias@zenith.tech> Reviewed-by: Bojan Serafimov <bojan@zenith.tech> Reviewed-by: Konstantin Knizhnik <knizhnik@zenith.tech> Reviewed-by: Anton Shyrabokau <antons@zenith.tech> Reviewed-by: Dhammika Pathirana <dham@zenith.tech> Reviewed-by: Kirill Bulatov <kirill@zenith.tech> Reviewed-by: Anastasia Lubennikova <anastasia@zenith.tech> Reviewed-by: Alexey Kondratov <alexey@zenith.tech>	2022-03-28 05:41:15 -05:00
Dmitry Rodionov	8b8d78a3a0	use main branch of our bookfile crate	2022-03-23 22:05:43 +04:00
Kirill Bulatov	063f9ba81d	Use serde_with to (de)serialize ZId and Lsn to hex	2022-03-21 12:46:07 +02:00
anastasia	1a4682a04a	Add 'walreceiver-after-ingest' failpoint. Use sleep at this point to imitate slow walreceiver.	2022-02-22 13:56:21 +03:00
Kirill Bulatov	6eef401602	Move routerify behind zenith_utils	2022-02-10 08:33:22 -05:00
Kirill Bulatov	76b74349cb	Bump pageserver dependencies	2022-02-10 08:33:22 -05:00
anastasia	5abe2129c6	Extend replication protocol with ZentihFeedback message to pass current_timeline_size to compute node Put standby_status_update fields into ZenithFeedback and send them as one message. Pass values sizes together with keys in ZenithFeedback message.	2022-01-27 11:20:45 +03:00
Dmitry Rodionov	e6f2d70517	use 2021 rust edition	2022-01-25 18:48:49 +03:00
Heikki Linnakangas	adb0b3dada	Include backtrace in error messages in the log. 'anyhow' crate can include a backtrace in all errors, when the 'backtrace' feature is enabled. Enable it, and change the places that used '{:#}' or '{}' to '{:?}', so that the backtrace is printed.	2022-01-14 10:10:17 +02:00
Patrick Insinger	24c8dab86f	pageserver - parallelize checkpoint fsyncs	2022-01-04 20:40:57 -08:00
Dmitry Rodionov	c910132d4b	Fix wal receiver shutdown This patch allows to shutdown wal receiver when there are no messages and wal receiver is blocked inside tokio-postgres. In this case it cannot check the shutdown flag. This patch switches to use async interface of tokio-postgres directly without sync wrappers. It opens the possibility to use tokio::select! between the phsycal_stream.next() and a shutdown channel readiness to interrupt replication process. Also this allows to shutdown only particular wal receiver without using global shutdown_requested flag.	2021-12-29 14:42:29 +03:00
bojanserafimov	b807570f46	Use parking_lot::Mutex instead of std::Mutex in walreceiver (#1045 )	2021-12-23 14:25:44 -05:00
Kirill Bulatov	114a757d1c	Use generic config parameters in pageserver cli Co-authored-by: Heikki Linnakangas <heikki.linnakangas@iki.fi>	2021-12-23 18:58:28 +02:00
Kirill Bulatov	5cff7d1de9	Use proper download order	2021-12-14 15:32:22 +02:00
Kirill Bulatov	e61732ca7c	Compress checkpoint files before streaming into S3	2021-12-10 17:23:35 +02:00
Heikki Linnakangas	cb4a8396fb	Use rustls rather than native-tls in all dependencies. We depends on rustls in postgres_backend anyway, so might as well use it for all TLS stuff. Seems better to depend on only one library both from a security point of view, and because fewer dependencies means less code to compile. With this commit, we no longer depend on OpenSSL.	2021-12-10 15:14:27 +02:00
Heikki Linnakangas	9d369f158c	Update rust-s3 to version 0.28.0 0.28.0 includes two changes I submitted to upstream: - Add support for older ListObjects API, needed to use rust-s3 with Google Cloud Storage: https://github.com/durch/rust-s3/pull/229 - If file is smaller than one chunk, don't initiate multi-part upload. https://github.com/durch/rust-s3/pull/228 These are not critical for Zenith right now, but let's stay up-to-date.	2021-12-10 14:52:08 +02:00
Dmitry Ivanov	7cec13d1df	Improve shutdown story for code coverage This patch introduces fixes for several problems affecting LLVM-based code coverage: * Daemonizing parent processes should call _exit() to prevent coverage data file corruption (.profraw) due to concurrent writes. Implement proper shutdown handlers in safekeeper.	2021-12-06 13:27:52 +03:00
Kirill Bulatov	670205e17a	Evict excessively failing sync tasks, improve processing for the rest of the tasks	2021-11-30 13:58:49 +02:00
Heikki Linnakangas	9300107cdf	Cache Book objects, use virtual files to avoid running out of fds. Currently, whenever a page version is needed from an image or delta layer, we open the file and read and parse the bookfile headers. That's pretty expensive. To reduce the overhead, introduce a cache of open file descriptors, and use that to cache the Book objects so that we don't need to read the metadata on every access.	2021-11-10 17:19:37 +02:00
Heikki Linnakangas	b38e841f2d	Use poll() in communication with WAL redo process. The tokio futures added some overhead, so switch to plain non-blocking I/O with poll(). In a simple pgbench test on my laptop (select-only queries, scale-factor 1 `pgbench -P1 -T50 -S`), this gives about 10% improvement, from about 4300 TPS to 4800 TPS.	2021-11-04 10:39:04 +02:00
Dmitry Rodionov	5bc09074ea	add a flag to avoid non incremental size calculation in pageserver http api This calculation is not that heavy but it is needed only in tests, and in case the number of tenants/timelines is high the calculation can take noticeable time. Resolves https://github.com/zenithdb/zenith/issues/804	2021-10-27 13:30:34 +03:00
Kirill Bulatov	04fb0a0342	Add core relish backup and restore functionality	2021-10-22 22:22:38 +03:00

1 2

93 Commits