rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-04 04:30:38 +00:00

Author	SHA1	Message	Date
Bojan Serafimov	87d75f6070	Test on real data	2022-12-02 15:28:18 -05:00
Bojan Serafimov	f7201cd3cf	refactor test	2022-12-02 15:07:59 -05:00
Bojan Serafimov	2dcbdd9e47	Use i128	2022-12-02 14:52:55 -05:00
Bojan Serafimov	72db121a8a	Add todo	2022-12-02 09:17:20 -05:00
Bojan Serafimov	5444e6ff32	Add todo	2022-12-01 23:04:06 -05:00
Bojan Serafimov	809b04eccb	Add todo	2022-12-01 22:46:57 -05:00
Bojan Serafimov	e7b2b5ae12	Remove comment	2022-12-01 21:59:34 -05:00
Bojan Serafimov	f476e56315	Actually implement	2022-12-01 21:58:34 -05:00
Bojan Serafimov	72835371cc	Add mock persistent BST implementation	2022-12-01 20:49:43 -05:00
Bojan Serafimov	924d91c47d	Bench layer map using persistent range queries	2022-12-01 16:42:24 -05:00
Bojan Serafimov	4e61edef7c	wip	2022-12-01 15:05:21 -05:00
bojanserafimov	fe280f70aa	Add synthetic layer map bench (#2979 )	2022-12-01 13:29:21 -05:00
bojanserafimov	b9544adcb4	Add layer map search benchmark (#2957 )	2022-11-30 13:48:07 -05:00
Heikki Linnakangas	33834c01ec	Rename Paused states to Stopping. I'm not a fan of "Paused", for two reasons: - Paused implies that the tenant/timeline with no activity on it. That's not true; the tenant/timeline can still have active tasks working on it. - Paused implies that it can be resumed later. It can not. A tenant or timeline in this state cannot be switched back to Active state anymore. A completely new Tenant or Timeline struct can be constructed for the same tenant or timeline later, e.g. if you detach and later re-attach the same tenant, but that's a different thing. Stopping describes the state better. I also considered "ShuttingDown", but Stopping is simpler as it's a single word.	2022-11-30 01:10:16 +02:00
Heikki Linnakangas	9a6c0be823	storage_sync2 The code in this change was extracted from PR #2595, i.e., Heikki’s draft PR for on-demand download. High-Level Changes - storage_sync module rewrite - Changes to Tenant Loading - Changes to Timeline States - Crash-safe & Resumable Tenant Attach There are several follow-up work items planned. Refer to the Epic issue on GitHub: https://github.com/neondatabase/neon/issues/2029 Metadata: closes https://github.com/neondatabase/neon/pull/2785 unsquashed history of this patch: archive/pr-2785-storage-sync2/pre-squash Co-authored-by: Dmitry Rodionov <dmitry@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech> =============================================================================== storage_sync module rewrite =========================== The storage_sync code is rewritten. New module name is storage_sync2, mostly to make a more reasonable git diff. The updated block comment in storage_sync2.rs describes the changes quite well, so, we will not reproduce that comment here. TL;DR: - Global sync queue and RemoteIndex are replaced with per-timeline `RemoteTimelineClient` structure that contains a queue for UploadOperations to ensure proper ordering and necessary metadata. - Before deleting local layer files, wait for ongoing UploadOps to finish (wait_completion()). - Download operations are not queued and executed immediately. Changes to Tenant Loading ========================= Initial sync part was rewritten as well and represents the other major change that serves as a foundation for on-demand downloads. Routines for attaching and loading shifted directly to Tenant struct and now are asynchronous and spawned into the background. Since this patch doesn’t introduce on-demand download of layers we fully synchronize with the remote during pageserver startup. See details in `Timeline::reconcile_with_remote` and `Timeline::download_missing`. Changes to Tenant States ======================== The “Active” state has lost its “background_jobs_running: bool” member. That variable indicated whether the GC & Compaction background loops are spawned or not. With this patch, they are now always spawned. Unit tests (#[test]) use the TenantConf::{gc_period,compaction_period} to disable their effect (`15db566`). This patch introduces a new tenant state, “Attaching”. A tenant that is being attached starts in this state and transitions to “Active” once it finishes download. The `GET /tenant` endpoints returns `TenantInfo::has_in_progress_downloads`. We derive the value for that field from the tenant state now, to remain backwards-compatible with cloud.git. We will remove that field when we switch to on-demand downloads. Changes to Timeline States ========================== The TimelineInfo::awaits_download field is now equivalent to the tenant being in Attaching state. Previously, download progress was tracked per timeline. With this change, it’s only tracked per tenant. When on-demand downloads arrive, the field will be completely obsolete. Deprecation is tracked in isuse #2930. Crash-safe & Resumable Tenant Attach ==================================== Previously, the attach operation was not persistent. I.e., when tenant attach was interrupted by a crash, the pageserver would not continue attaching after pageserver restart. In fact, the half-finished tenant directory on disk would simply be skipped by tenant_mgr because it lacked the metadata file (it’s written last). This patch introduces an “attaching” marker file inside that is present inside the tenant directory while the tenant is attaching. During pageserver startup, tenant_mgr will resume attach if that file is present. If not, it assumes that the local tenant state is consistent and tries to load the tenant. If that fails, the tenant transitions into Broken state.	2022-11-29 18:55:20 +01:00
Heikki Linnakangas	fbd5f65938	Misc cosmetic fixes in comments, messages. Most of these were extracted from PR #2785.	2022-11-29 14:10:45 +02:00
Heikki Linnakangas	1f1324ebed	Require tenant to be active when calculating tenant size. It's not clear if the calculation would work or make sense, if the tenant is only partially loaded. Let's play it safe, and require it to be Active.	2022-11-29 14:10:45 +02:00
Joonas Koivunen	f277140234	Small fixes (#2949 ) Nothing interesting in these changes. Passing through the RUST_BACKTRACE=full will hopefully save someone else panick reproduction time. Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2022-11-29 10:29:25 +02:00
MMeent	0c1195c30d	Fix #2937 (#2940 ) Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2022-11-28 15:34:07 +01:00
Heikki Linnakangas	67469339fa	When new timeline is created, don't wait for compaction. (#2931 ) When a new root timeline is created, we want to flush all the data to disk before we return success to the caller. We were using checkpoint(CheckpointConfig::Forced) for that, but that also performs compaction. With the default settings, compaction will have no work after we have imported an empty database, as the image of that is too small to require compaction. However, with very small checkpoint_distance and compaction_target_size, compaction will run, and it can take a while. PR #2785 adds new tests that use very small checkpoint_distance and compaction_target_size settings, and the test sometimes failed with "operation timed out" error in the client, when the create_timeline step took too long.	2022-11-28 11:05:20 +02:00
Heikki Linnakangas	15db566420	Allow setting gc/compaction_period to 0, to disable automatic GC/compaction Many python tests were setting the GC/compaction period to large values, to effectively disable GC / compaction. Reserve value 0 to mean "explicitly disabled". We also set them to 0 in unit tests now, although currently, unit tests don't launch the background jobs at all, so it won't have any effect. Fixes https://github.com/neondatabase/neon/issues/2917	2022-11-25 20:14:06 +02:00
Egor Suvorov	ae53dc3326	Add authentication between Safekeeper and Pageserver/Compute * Fix https://github.com/neondatabase/neon/issues/1854 * Never log Safekeeper::conninfo in walproposer as it now contains a secret token * control_panel, test_runner: generate and pass JWT tokens for Safekeeper to compute and pageserver * Compute: load JWT token for Safekepeer from the environment variable. Do not reuse the token from pageserver_connstring because it's embedded in there weirdly. * Pageserver: load JWT token for Safekeeper from the environment variable. * Rewrite docs/authentication.md	2022-11-25 04:17:42 +03:00
Egor Suvorov	1ca76776d0	pageserver: require management permissions on HTTP /status	2022-11-25 04:17:42 +03:00
Egor Suvorov	2ce5d8137d	Separate permission checks for Pageserver and Safekeeper There will be different scopes for those two, so authorization code should be different. The `check_permission` function is now not in the shared library. Its implementation is very similar to the one which will be added for Safekeeper. In fact, we may reuse the same existing root-like 'PageServerApi' scope, but I would prefer to have separate root-like scopes for services. Also, generate_management_token in tests is generate_pageserver_token now.	2022-11-25 04:17:42 +03:00
Egor Suvorov	b6989e8928	pageserver: make `wal_source_connstring: String` a 'wal_source_connconf: PgConnectionConfig`	2022-11-24 14:02:23 +03:00
Heikki Linnakangas	99d9c23df5	Gather path-related consts and functions to one place. Feels more organized this way.	2022-11-24 12:26:15 +02:00
Konstantin Knizhnik	a6e4a3c3ef	Implement corrent truncation of FSM/VM forks on arbitrary position (#2609 ) refer #2601 Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>	2022-11-23 18:46:07 +02:00
Alexey Kondratov	4bf3087aed	[pageserver] list `latest_gc_cutoff_lsn` in the OpenAPI spec (#2894 ) It seems that it's present in the API response for quite a while. It's just not listed in the spec, fix it.	2022-11-22 21:10:49 +01:00
Heikki Linnakangas	86e483f87b	Fix tenant size modeling code to include WAL at end of branch Imagine that you have a tenant with a single branch like this: ---------------==========> ^ gc horizon where: ---- is the portion of the branch that is older than retention period ==== is the portion of the branch that is newer than retention period. Before this commit, the sizing model included the logical size at the GC horizon, but not the WAL after that. In particular, that meant that on a newly created tenant with just one timeline, where the retention period covered the whole history of the timeline, i.e. gc_cutoff was 0, the calculated tenant size was always zero. We now include the WAL after the GC horizon in the size. So in the above example, the calculated tenant size would be the logical size of the database the GC horizon, plus all the WAL after it (marked with ===). This adds a new `insert_point` function to the sizing model, alongside `modify_branch`, and changes the code in size.rs to use the new function. The new function takes an absolute lsn and logical size as argument, so we no longer need to calculate the difference to the previous point. Also, the end-size is now optional, because we now need to add a point to represent the end of each branch to the model, but we don't want to or need to calculate the logical size at that point.	2022-11-22 17:11:27 +02:00
Alexander Stanovoy	a5b898a31c	Fix the order of checks in LSN (#2882 ) We should check if LSN is in the lower range because it's constant and only after wait for LSN to arrive if needed.	2022-11-22 02:28:41 +02:00
Heikki Linnakangas	6c97fc941a	Enable passing FAILPOINTS at startup. - Pass through FAILPOINTS environment variable to the pageserver in "neon_local pageserver start" command - On startup, list any failpoints that were set with FAILPOINTS to the log - Add optional "extra_env_vars" argument to the NeonPageserver.start() function in the python fixture, so that you can pass FAILPOINTS None of the tests use this functionality yet; that comes in a separate commit. closes https://github.com/neondatabase/neon/pull/2865	2022-11-21 16:24:19 +01:00
Christian Schwarz	f564dff0e3	make test_tenant_detach_smoke fail reproducibly Add failpoint that triggers the race condition. Skip test until we'll land the fix from https://github.com/neondatabase/neon/pull/2851 with https://github.com/neondatabase/neon/pull/2785	2022-11-18 17:15:34 +01:00
Christian Schwarz	d783889a1f	timeline: explicit tracking of flush loop state: NotStarted, Running, Exited This allows us to error out in the case where we request flush but the flush loop is not running. Before, we would only track whether it was started, but not when it exited. Better to use an enum with 3 states than a 2-state bool because then the error message can answer the question whether we ever started the flush loop or not.	2022-11-18 17:15:34 +01:00
Heikki Linnakangas	328ec1ce24	Print a more full error message, with stack trace, on GC failure. In a CI run, I got a test failure because of this error in the log, from the test_get_tenant_size_with_multiple_branches test: ERROR gc_loop{tenant_id=f1630516d4b526139836ced93be0c878}: Gc failed, retrying in 2s: No such file or directory (os error 2) There are known race conditions between GC and timeline deletion, which surely caused that error. But if we didn't know the cause, it would be pretty hard to debug without a stack trace.	2022-11-18 11:44:00 +02:00
Joonas Koivunen	1d105727cb	perf: simple walredo bench (#2816 ) adds a simple walredo bench to allow some comparison of the walredo throughput. Cc: #1339, #2778	2022-11-16 11:13:56 +02:00
Heikki Linnakangas	46d30bf054	Check for errors in pageserver log after each test. If there are any unexpected ERRORs or WARNs in pageserver.log after test finishes, fail the test. This requires whitelisting the errors that are expected in each test, and there's also a few common errors that are printed by most tests, which are whitelisted in the fixture itself. With this, we don't need the special abort() call in testing mode, when compaction or GC fails. Those failures will print ERRORs to the logs, which will be picked up by this new mechanisms. A bunch of errors are currently whitelisted that we probably shouldn't be emitting in the first place, but fixing those is out of scope for this commit, so I just left FIXME comments on them.	2022-11-15 18:47:28 +02:00
Heikki Linnakangas	e44e4a699b	Downgrade log message, if client terminates COPY during basebackup import It's more or less expected from pageserver's point of view. Change the error kind to ConnectionReset, so that it gets logged at INFO level instead of ERROR.	2022-11-15 18:47:28 +02:00
Heikki Linnakangas	dbe5b52494	Avoid some vector-growing overhead. I saw this in 'perf' profile of a sequential scan: > - 31.93% 0.21% compute request pageserver [.] <pageserver::walredo::PostgresRedoManager as pageserver::walredo::WalRedoManager>::request_redo > - 31.72% <pageserver::walredo::PostgresRedoManager as pageserver::walredo::WalRedoManager>::request_redo > - 31.26% pageserver::walredo::PostgresRedoManager::apply_batch_postgres > + 7.64% <std::process::ChildStdin as std::io::Write>::write > + 6.17% nix::poll::poll > + 3.58% <std::process::ChildStderr as std::io::Read>::read > + 2.96% std::sync::condvar::Condvar::notify_one > + 2.48% std::sys::unix::locks::futex::Condvar::wait > + 2.19% alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle > + 1.14% std::sys::unix::locks::futex::Mutex::lock_contended > 0.67% __rust_alloc_zeroed > 0.62% __stpcpy_ssse3 > 0.56% std::sys::unix::locks::futex::Mutex::wake Note the 'do_reserve_handle' overhead. That's caused by having to grow the buffer used to construct the WAL redo request. This commit eliminates that overhead. It's only about 2% of the overall CPU usage, but every little helps. Also reuse the temp buffer when reading records from a DeltaLayer, and call Vec::reserve to avoid growing a buffer when reading a blob across pages. I saw a reduction from 2% to 1% of CPU spent in do_reserve_and_handle in that codepath, but that's such a small change that it could be just noise. Seems like it shouldn't hurt though.	2022-11-12 18:52:25 +02:00
bojanserafimov	7fd88fab59	Trace read requests (#2762 )	2022-11-10 16:43:04 -05:00
Christian Schwarz	8654e95fae	walredo: fix zombie processes ([postgres] <defunct>) This change wraps the std::process:Child that we spawn for WAL redo into a type that ensures that we try to SIGKILL + waitpid() on it. If there is no explicit call to kill_and_wait(), the Drop implementation will spawns a task that does it in the BACKGROUND_RUNTIME. That's an ugly hack but I think it's better than doing kill+wait synchronously from Drop, since I think the general assumption in the Rust ecosystem is that Drop doesn't block. Especially since the drop sites can be _any_ place that drops the last Arc<PostgresRedoManager>, e.g., compaction or GC. The benefit of having the new type over just adding a Drop impl to PostgresRedoProcess is that we can construct it earlier than the full PostgresRedoProcess in PostgresRedoProcess::launch(). That allows us to correctly kill+wait the child if there is an error in PostgresRedoProcess::launch() after spawning it. I also took a stab at a regression test. I manually verified that it fails before the fix to walredo.rs. fixes https://github.com/neondatabase/neon/issues/2761 closes https://github.com/neondatabase/neon/pull/2776	2022-11-10 12:50:50 +01:00
Christian Schwarz	c3a470a29b	walredo process management: handle every error on the kill() and drop path If we're not calling kill() before dropping the PostgresRedoProcess, we currently leak it. That's most likely the root cause for #2761. This patch 1. adds an error log message for that case and 2. adds error handling for all errors on the kill() path. If we're a `testing` build, we panic. Otherwise, we log an error and leak the process. The error handling changes (2) are necessary to conclusively state that the root cause for #2761 is indeed (1). If we didn't have them, the root cause could be missing error handling instead. To make the log messages useful, I've added tracing::instrument attributes that log the tenant_id and PID. That helps mapping back the PID of `defunct` processes to pageserver log messages. Note that a defunct process's `/proc/$PID/` directory isn't very useful. We have left little more than its PID. Once we have validated the root cause, we'll find a fix, but that's still an ongoing discussion. refs https://github.com/neondatabase/neon/issues/2761 closes https://github.com/neondatabase/neon/pull/2769	2022-11-08 14:03:13 +01:00
Joonas Koivunen	548d472b12	fix: logical size query at before initdb_lsn (#2755 ) With more realistic selection of gc_horizon in tests there is an immediate failure with trying to query logical size with lsn < initdb_lsn. Fixes that, adds illustration gathered from clarity of explaining this tenant size calculation to more people. Cc: #2748, #2599.	2022-11-07 12:03:57 +02:00
Dmitry Rodionov	99e745a760	review adjustments	2022-11-06 13:42:18 +02:00
Dmitry Rodionov	15d970f731	decrease diff by moving check_checkpoint_distance back Co-authored-by: Christian Schwarz <me@cschwarz.com>	2022-11-06 13:42:18 +02:00
Heikki Linnakangas	7b7f84f1b4	Refactor layer flushing task Extracted from https://github.com/neondatabase/neon/pull/2595	2022-11-06 13:42:18 +02:00
Dmitry Ivanov	c38f38dab7	Move pq_proto to its own crate	2022-11-03 22:56:04 +03:00
Joonas Koivunen	cf68963b18	Add initial tenant sizing model and a http route to query it (#2714 ) Tenant size information is gathered by using existing parts of `Tenant::gc_iteration` which are now separated as `Tenant::refresh_gc_info`. `Tenant::refresh_gc_info` collects branch points, and invokes `Timeline::update_gc_info`; nothing was supposed to be changed there. The gathered branch points (through Timeline's `GcInfo::retain_lsns`), `GcInfo::horizon_cutoff`, and `GcInfo::pitr_cutoff` are used to build up a Vec of updates fed into the `libs/tenant_size_model` to calculate the history size. The gathered information is now exposed using `GET /v1/tenant/{tenant_id}/size`, which which will respond with the actual calculated size. Initially the idea was to have this delivered as tenant background task and exported via metric, but it might be too computationally expensive to run it periodically as we don't yet know if the returned values are any good. Adds one new metric: - pageserver_storage_operations_seconds with label `logical_size` - separating from original `init_logical_size` Adds a pageserver wide configuration variable: - `concurrent_tenant_size_logical_size_queries` with default 1 This leaves a lot of TODO's, tracked on issue #2748.	2022-11-03 12:39:19 +00:00
bojanserafimov	d7eeb73f6f	Impl serialize for pagestream FeMessage (#2741 )	2022-11-02 23:44:07 -04:00
bojanserafimov	a0a74868a4	Fix clippy (#2742 )	2022-11-02 12:30:09 -04:00
Christian Schwarz	b154992510	timeline_list_handler: avoid spawn_blocking As per https://github.com/neondatabase/neon/issues/2731#issuecomment-1299335813 refs https://github.com/neondatabase/neon/issues/2731	2022-11-02 16:22:58 +01:00

1 2 3 4 5 ...

1021 Commits