rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-14 00:42:54 +00:00

Author	SHA1	Message	Date
Kirill Bulatov	861dc8e64e	Remove redundant once_cell usages	2022-12-09 22:14:32 +02:00
Dmitry Rodionov	3122f3282f	Ignore backup files (ones with .n.old suffix) in download_missing This is rather a hack to resolve immediate issue: https://github.com/neondatabase/neon/issues/3024 Properly cleaning this file from index part requires changes to initialization of remote queue. Because we need to clean it up earlier than we start warking around files. With on-demand there will be no walk around layer files becase download_missing is no longer needed, so I believe it will be natural to unify this with load_layer_map	2022-12-09 12:07:50 +03:00
Konstantin Knizhnik	e1ef62f086	Print more information about context of failed walredo requests (#3003 )	2022-12-08 09:12:38 +02:00
Kirill Bulatov	b50e0793cf	Rework remote_storage interface (#2993 ) Changes: * Remove `RemoteObjectId` concept from remote_storage. Operate directly on /-separated names instead. These names are now represented by struct `RemotePath` which was renamed from struct `RelativePath` * Require remote storage to operate on relative paths for its contents, thus simplifying the way to derive them in pageserver and safekeeper * Make `IndexPart` to use `String` instead of `RelativePath` for its entries, since those are just the layer names	2022-12-07 23:11:02 +02:00
Christian Schwarz	ac0c167a85	improve pidfile handling This patch centralize the logic of creating & reading pid files into the new pid_file module and improves upon / makes explicit a few race conditions that existed with the previous code. Starting Processes / Creating Pidfiles ====================================== Before this patch, we had three places that had very similar-looking match lock_file::create_lock_file { ... } blocks. After this change, they can use a straight-forward call provided by the pid_file: pid_file::claim_pid_file_for_pid() Stopping Processes / Reading Pidfiles ===================================== The new pid_file module provides a function to read a pidfile, called read_pidfile(), that returns a pub enum PidFileRead { NotExist, NotHeldByAnyProcess(PidFileGuard), LockedByOtherProcess(Pid), } If we get back NotExist, there is nothing to kill. If we get back NotHeldByAnyProcess, the pid file is stale and we must ignore its contents. If it's LockedByOtherProcess, it's either another pidfile reader or, more likely, the daemon that is still running. In this case, we can read the pid in the pidfile and kill it. There's still a small window where this is racy, but it's not a regression compared to what we have before. The NotHeldByAnyProcess is an improvement over what we had before this patch. Before, we would blindly read the pidfile contents and kill, even if no other process held the flock. If the pidfile was stale (NotHeldByAnyProcess), then that kill would either result in ESRCH or hit some other unrelated process on the system. This patch avoids the latter cacse by grabbing an exclusive flock before reading the pidfile, and returning the flock to the caller in the form of a guard object, to avoid concurrent reads / kills. It's hopefully irrelevant in practice, but it's a little robustness that we get for free here. Maintain flock on Pidfile of ETCD / any InitialPidFile::Create() ================================================================ Pageserver and safekeeper create their pidfiles themselves. But for etcd, neon_local creates the pidfile (InitialPidFile::Create()). Before this change, we would unlock the etcd pidfile as soon as `neon_local start` exits, simply because no-one else kept the FD open. During `neon_local stop`, that results in a stale pid file, aka, NotHeldByAnyProcess, and it would henceforth not trust that the PID stored in the file is still valid. With this patch, we make the etcd process inherit the pidfile FD, thereby keeping the flock held until it exits.	2022-12-07 18:24:12 +01:00
Heikki Linnakangas	a46a81b5cb	Fix updating "trace_read_requests" with /v1/tenant/config mgmt API. The new "trace_read_requests" option was missing from the parse_toml_tenant_conf function that reads the config file. Because of that, the option was ignored, which caused the test_read_trace.py test to fail. It used to work before commit `9a6c0be823`, because the TenantConfigOpt struct was constructed directly in tenant_create_handler, but now it is saved and read back from disk even for a newly created tenant. The abovementioned bug was fixed in commit `09393279c6` already, which added the missing code to parse_toml_tenant_conf() to parse the new "trace_read_requests" option. This commit fixes one more function that was missed earlier, and adds more detail to the error message if parsing the config file fails.	2022-12-07 15:03:39 +02:00
Heikki Linnakangas	b513619503	Remove obsolete 'awaits_download' field. It used to be a separate piece of state, but after `9a6c0be823` it's just an alias for the Tenant being in Attaching state. It was only used in one assertion in a test, but that check doesn't make sense anymore, so just remove it. Fixes https://github.com/neondatabase/neon/issues/2930	2022-12-07 13:13:54 +02:00
Kirill Bulatov	09393279c6	Fix tenant config parsing	2022-12-06 23:52:16 +02:00
Kliment Serafimov	8f2b3cbded	Sentry integration for storage. (#2926 ) Added basic instrumentation to integrate sentry with the proxy, pageserver, and safekeeper processes. Currently in sentry there are three projects, one for each process. Sentry url is sent to all three processes separately via cli args.	2022-12-06 18:57:54 +00:00
Christian Schwarz	4530544bb8	draw_timeline_dirs: accept paths as input	2022-12-06 18:17:48 +01:00
Dmitry Rodionov	98ff0396f8	tone down error log for successful process termination	2022-12-06 18:44:07 +03:00
Kirill Bulatov	d6bfe955c6	Add commands to unload and load the tenant in memory (#2977 ) Closes https://github.com/neondatabase/neon/issues/2537 Follow-up of https://github.com/neondatabase/neon/pull/2950 With the new model that prevents attaching without the remote storage, it has started to be even more odd to add attach-with-files functionality (in addition to the issues raised previously). Adds two separate commands: * `POST {tenant_id}/ignore` that places a mark file to skip such tenant on every start and removes it from memory * `POST {tenant_id}/schedule_load` that tries to load a tenant from local FS similar to what pageserver does now on startup, but without directory removals	2022-12-06 15:30:02 +00:00
Alexander Bayandin	61825dfb57	Update chrono to 0.4.23; use only clock feature from it	2022-12-06 15:45:58 +01:00
Kirill Bulatov	c0480facc1	Rename RelativePath to RemotePath Improve rustdocs a bit	2022-12-05 22:52:42 +02:00
Kirill Bulatov	b38473d367	Remove RelativePath conversions Function was unused, but publicly exported from the module lib, so not reported by rustc as unused	2022-12-05 22:52:42 +02:00
Kirill Bulatov	38af453553	Use async RwLock around tenants (#3009 ) A step towards more async code in our repo, to help avoid most of the odd blocking calls, that might deadlock, as mentioned in https://github.com/neondatabase/neon/issues/2975	2022-12-05 22:48:45 +02:00
Shany Pozin	79fdd3d51b	Fix #2907 : Change missing_layers property to optional in the IndexPart struct (#3005 ) Move missing_layers property to Option<HashSet<RelativePath>> This will allow the safe removal of it once the upgrade of all page servers is done with this new code	2022-12-05 13:56:04 +02:00
Kirill Bulatov	4f443c339d	Tone down retry error logs (#2999 ) Closes https://github.com/neondatabase/neon/issues/2990	2022-12-03 15:30:55 +00:00
Alexander Bayandin	788823ebe3	Fix named_arguments_used_positionally warnings (#2987 ) ``` warning: named argument `file` is not used by name --> pageserver/src/tenant/timeline.rs:1078:54 \| 1078 \| trace!("downloading image file: {}", file = path.display()); \| -- ^^^^ this named argument is referred to by position in formatting string \| \| \| this formatting argument uses named argument `file` by position \| = note: `#[warn(named_arguments_used_positionally)]` on by default help: use the named argument by name to avoid ambiguity \| 1078 \| trace!("downloading image file: {file}", file = path.display()); \| ++++ ``` Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2022-12-02 17:59:26 +00:00
bojanserafimov	fe280f70aa	Add synthetic layer map bench (#2979 )	2022-12-01 13:29:21 -05:00
bojanserafimov	b9544adcb4	Add layer map search benchmark (#2957 )	2022-11-30 13:48:07 -05:00
Heikki Linnakangas	33834c01ec	Rename Paused states to Stopping. I'm not a fan of "Paused", for two reasons: - Paused implies that the tenant/timeline with no activity on it. That's not true; the tenant/timeline can still have active tasks working on it. - Paused implies that it can be resumed later. It can not. A tenant or timeline in this state cannot be switched back to Active state anymore. A completely new Tenant or Timeline struct can be constructed for the same tenant or timeline later, e.g. if you detach and later re-attach the same tenant, but that's a different thing. Stopping describes the state better. I also considered "ShuttingDown", but Stopping is simpler as it's a single word.	2022-11-30 01:10:16 +02:00
Heikki Linnakangas	9a6c0be823	storage_sync2 The code in this change was extracted from PR #2595, i.e., Heikki’s draft PR for on-demand download. High-Level Changes - storage_sync module rewrite - Changes to Tenant Loading - Changes to Timeline States - Crash-safe & Resumable Tenant Attach There are several follow-up work items planned. Refer to the Epic issue on GitHub: https://github.com/neondatabase/neon/issues/2029 Metadata: closes https://github.com/neondatabase/neon/pull/2785 unsquashed history of this patch: archive/pr-2785-storage-sync2/pre-squash Co-authored-by: Dmitry Rodionov <dmitry@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech> =============================================================================== storage_sync module rewrite =========================== The storage_sync code is rewritten. New module name is storage_sync2, mostly to make a more reasonable git diff. The updated block comment in storage_sync2.rs describes the changes quite well, so, we will not reproduce that comment here. TL;DR: - Global sync queue and RemoteIndex are replaced with per-timeline `RemoteTimelineClient` structure that contains a queue for UploadOperations to ensure proper ordering and necessary metadata. - Before deleting local layer files, wait for ongoing UploadOps to finish (wait_completion()). - Download operations are not queued and executed immediately. Changes to Tenant Loading ========================= Initial sync part was rewritten as well and represents the other major change that serves as a foundation for on-demand downloads. Routines for attaching and loading shifted directly to Tenant struct and now are asynchronous and spawned into the background. Since this patch doesn’t introduce on-demand download of layers we fully synchronize with the remote during pageserver startup. See details in `Timeline::reconcile_with_remote` and `Timeline::download_missing`. Changes to Tenant States ======================== The “Active” state has lost its “background_jobs_running: bool” member. That variable indicated whether the GC & Compaction background loops are spawned or not. With this patch, they are now always spawned. Unit tests (#[test]) use the TenantConf::{gc_period,compaction_period} to disable their effect (`15db566`). This patch introduces a new tenant state, “Attaching”. A tenant that is being attached starts in this state and transitions to “Active” once it finishes download. The `GET /tenant` endpoints returns `TenantInfo::has_in_progress_downloads`. We derive the value for that field from the tenant state now, to remain backwards-compatible with cloud.git. We will remove that field when we switch to on-demand downloads. Changes to Timeline States ========================== The TimelineInfo::awaits_download field is now equivalent to the tenant being in Attaching state. Previously, download progress was tracked per timeline. With this change, it’s only tracked per tenant. When on-demand downloads arrive, the field will be completely obsolete. Deprecation is tracked in isuse #2930. Crash-safe & Resumable Tenant Attach ==================================== Previously, the attach operation was not persistent. I.e., when tenant attach was interrupted by a crash, the pageserver would not continue attaching after pageserver restart. In fact, the half-finished tenant directory on disk would simply be skipped by tenant_mgr because it lacked the metadata file (it’s written last). This patch introduces an “attaching” marker file inside that is present inside the tenant directory while the tenant is attaching. During pageserver startup, tenant_mgr will resume attach if that file is present. If not, it assumes that the local tenant state is consistent and tries to load the tenant. If that fails, the tenant transitions into Broken state.	2022-11-29 18:55:20 +01:00
Heikki Linnakangas	fbd5f65938	Misc cosmetic fixes in comments, messages. Most of these were extracted from PR #2785.	2022-11-29 14:10:45 +02:00
Heikki Linnakangas	1f1324ebed	Require tenant to be active when calculating tenant size. It's not clear if the calculation would work or make sense, if the tenant is only partially loaded. Let's play it safe, and require it to be Active.	2022-11-29 14:10:45 +02:00
Joonas Koivunen	f277140234	Small fixes (#2949 ) Nothing interesting in these changes. Passing through the RUST_BACKTRACE=full will hopefully save someone else panick reproduction time. Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2022-11-29 10:29:25 +02:00
MMeent	0c1195c30d	Fix #2937 (#2940 ) Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2022-11-28 15:34:07 +01:00
Heikki Linnakangas	67469339fa	When new timeline is created, don't wait for compaction. (#2931 ) When a new root timeline is created, we want to flush all the data to disk before we return success to the caller. We were using checkpoint(CheckpointConfig::Forced) for that, but that also performs compaction. With the default settings, compaction will have no work after we have imported an empty database, as the image of that is too small to require compaction. However, with very small checkpoint_distance and compaction_target_size, compaction will run, and it can take a while. PR #2785 adds new tests that use very small checkpoint_distance and compaction_target_size settings, and the test sometimes failed with "operation timed out" error in the client, when the create_timeline step took too long.	2022-11-28 11:05:20 +02:00
Heikki Linnakangas	15db566420	Allow setting gc/compaction_period to 0, to disable automatic GC/compaction Many python tests were setting the GC/compaction period to large values, to effectively disable GC / compaction. Reserve value 0 to mean "explicitly disabled". We also set them to 0 in unit tests now, although currently, unit tests don't launch the background jobs at all, so it won't have any effect. Fixes https://github.com/neondatabase/neon/issues/2917	2022-11-25 20:14:06 +02:00
Egor Suvorov	ae53dc3326	Add authentication between Safekeeper and Pageserver/Compute * Fix https://github.com/neondatabase/neon/issues/1854 * Never log Safekeeper::conninfo in walproposer as it now contains a secret token * control_panel, test_runner: generate and pass JWT tokens for Safekeeper to compute and pageserver * Compute: load JWT token for Safekepeer from the environment variable. Do not reuse the token from pageserver_connstring because it's embedded in there weirdly. * Pageserver: load JWT token for Safekeeper from the environment variable. * Rewrite docs/authentication.md	2022-11-25 04:17:42 +03:00
Egor Suvorov	1ca76776d0	pageserver: require management permissions on HTTP /status	2022-11-25 04:17:42 +03:00
Egor Suvorov	2ce5d8137d	Separate permission checks for Pageserver and Safekeeper There will be different scopes for those two, so authorization code should be different. The `check_permission` function is now not in the shared library. Its implementation is very similar to the one which will be added for Safekeeper. In fact, we may reuse the same existing root-like 'PageServerApi' scope, but I would prefer to have separate root-like scopes for services. Also, generate_management_token in tests is generate_pageserver_token now.	2022-11-25 04:17:42 +03:00
Egor Suvorov	b6989e8928	pageserver: make `wal_source_connstring: String` a 'wal_source_connconf: PgConnectionConfig`	2022-11-24 14:02:23 +03:00
Heikki Linnakangas	99d9c23df5	Gather path-related consts and functions to one place. Feels more organized this way.	2022-11-24 12:26:15 +02:00
Konstantin Knizhnik	a6e4a3c3ef	Implement corrent truncation of FSM/VM forks on arbitrary position (#2609 ) refer #2601 Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>	2022-11-23 18:46:07 +02:00
Alexey Kondratov	4bf3087aed	[pageserver] list `latest_gc_cutoff_lsn` in the OpenAPI spec (#2894 ) It seems that it's present in the API response for quite a while. It's just not listed in the spec, fix it.	2022-11-22 21:10:49 +01:00
Heikki Linnakangas	86e483f87b	Fix tenant size modeling code to include WAL at end of branch Imagine that you have a tenant with a single branch like this: ---------------==========> ^ gc horizon where: ---- is the portion of the branch that is older than retention period ==== is the portion of the branch that is newer than retention period. Before this commit, the sizing model included the logical size at the GC horizon, but not the WAL after that. In particular, that meant that on a newly created tenant with just one timeline, where the retention period covered the whole history of the timeline, i.e. gc_cutoff was 0, the calculated tenant size was always zero. We now include the WAL after the GC horizon in the size. So in the above example, the calculated tenant size would be the logical size of the database the GC horizon, plus all the WAL after it (marked with ===). This adds a new `insert_point` function to the sizing model, alongside `modify_branch`, and changes the code in size.rs to use the new function. The new function takes an absolute lsn and logical size as argument, so we no longer need to calculate the difference to the previous point. Also, the end-size is now optional, because we now need to add a point to represent the end of each branch to the model, but we don't want to or need to calculate the logical size at that point.	2022-11-22 17:11:27 +02:00
Alexander Stanovoy	a5b898a31c	Fix the order of checks in LSN (#2882 ) We should check if LSN is in the lower range because it's constant and only after wait for LSN to arrive if needed.	2022-11-22 02:28:41 +02:00
Heikki Linnakangas	6c97fc941a	Enable passing FAILPOINTS at startup. - Pass through FAILPOINTS environment variable to the pageserver in "neon_local pageserver start" command - On startup, list any failpoints that were set with FAILPOINTS to the log - Add optional "extra_env_vars" argument to the NeonPageserver.start() function in the python fixture, so that you can pass FAILPOINTS None of the tests use this functionality yet; that comes in a separate commit. closes https://github.com/neondatabase/neon/pull/2865	2022-11-21 16:24:19 +01:00
Christian Schwarz	f564dff0e3	make test_tenant_detach_smoke fail reproducibly Add failpoint that triggers the race condition. Skip test until we'll land the fix from https://github.com/neondatabase/neon/pull/2851 with https://github.com/neondatabase/neon/pull/2785	2022-11-18 17:15:34 +01:00
Christian Schwarz	d783889a1f	timeline: explicit tracking of flush loop state: NotStarted, Running, Exited This allows us to error out in the case where we request flush but the flush loop is not running. Before, we would only track whether it was started, but not when it exited. Better to use an enum with 3 states than a 2-state bool because then the error message can answer the question whether we ever started the flush loop or not.	2022-11-18 17:15:34 +01:00
Heikki Linnakangas	328ec1ce24	Print a more full error message, with stack trace, on GC failure. In a CI run, I got a test failure because of this error in the log, from the test_get_tenant_size_with_multiple_branches test: ERROR gc_loop{tenant_id=f1630516d4b526139836ced93be0c878}: Gc failed, retrying in 2s: No such file or directory (os error 2) There are known race conditions between GC and timeline deletion, which surely caused that error. But if we didn't know the cause, it would be pretty hard to debug without a stack trace.	2022-11-18 11:44:00 +02:00
Joonas Koivunen	1d105727cb	perf: simple walredo bench (#2816 ) adds a simple walredo bench to allow some comparison of the walredo throughput. Cc: #1339, #2778	2022-11-16 11:13:56 +02:00
Heikki Linnakangas	46d30bf054	Check for errors in pageserver log after each test. If there are any unexpected ERRORs or WARNs in pageserver.log after test finishes, fail the test. This requires whitelisting the errors that are expected in each test, and there's also a few common errors that are printed by most tests, which are whitelisted in the fixture itself. With this, we don't need the special abort() call in testing mode, when compaction or GC fails. Those failures will print ERRORs to the logs, which will be picked up by this new mechanisms. A bunch of errors are currently whitelisted that we probably shouldn't be emitting in the first place, but fixing those is out of scope for this commit, so I just left FIXME comments on them.	2022-11-15 18:47:28 +02:00
Heikki Linnakangas	e44e4a699b	Downgrade log message, if client terminates COPY during basebackup import It's more or less expected from pageserver's point of view. Change the error kind to ConnectionReset, so that it gets logged at INFO level instead of ERROR.	2022-11-15 18:47:28 +02:00
Heikki Linnakangas	dbe5b52494	Avoid some vector-growing overhead. I saw this in 'perf' profile of a sequential scan: > - 31.93% 0.21% compute request pageserver [.] <pageserver::walredo::PostgresRedoManager as pageserver::walredo::WalRedoManager>::request_redo > - 31.72% <pageserver::walredo::PostgresRedoManager as pageserver::walredo::WalRedoManager>::request_redo > - 31.26% pageserver::walredo::PostgresRedoManager::apply_batch_postgres > + 7.64% <std::process::ChildStdin as std::io::Write>::write > + 6.17% nix::poll::poll > + 3.58% <std::process::ChildStderr as std::io::Read>::read > + 2.96% std::sync::condvar::Condvar::notify_one > + 2.48% std::sys::unix::locks::futex::Condvar::wait > + 2.19% alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle > + 1.14% std::sys::unix::locks::futex::Mutex::lock_contended > 0.67% __rust_alloc_zeroed > 0.62% __stpcpy_ssse3 > 0.56% std::sys::unix::locks::futex::Mutex::wake Note the 'do_reserve_handle' overhead. That's caused by having to grow the buffer used to construct the WAL redo request. This commit eliminates that overhead. It's only about 2% of the overall CPU usage, but every little helps. Also reuse the temp buffer when reading records from a DeltaLayer, and call Vec::reserve to avoid growing a buffer when reading a blob across pages. I saw a reduction from 2% to 1% of CPU spent in do_reserve_and_handle in that codepath, but that's such a small change that it could be just noise. Seems like it shouldn't hurt though.	2022-11-12 18:52:25 +02:00
bojanserafimov	7fd88fab59	Trace read requests (#2762 )	2022-11-10 16:43:04 -05:00
Christian Schwarz	8654e95fae	walredo: fix zombie processes ([postgres] <defunct>) This change wraps the std::process:Child that we spawn for WAL redo into a type that ensures that we try to SIGKILL + waitpid() on it. If there is no explicit call to kill_and_wait(), the Drop implementation will spawns a task that does it in the BACKGROUND_RUNTIME. That's an ugly hack but I think it's better than doing kill+wait synchronously from Drop, since I think the general assumption in the Rust ecosystem is that Drop doesn't block. Especially since the drop sites can be _any_ place that drops the last Arc<PostgresRedoManager>, e.g., compaction or GC. The benefit of having the new type over just adding a Drop impl to PostgresRedoProcess is that we can construct it earlier than the full PostgresRedoProcess in PostgresRedoProcess::launch(). That allows us to correctly kill+wait the child if there is an error in PostgresRedoProcess::launch() after spawning it. I also took a stab at a regression test. I manually verified that it fails before the fix to walredo.rs. fixes https://github.com/neondatabase/neon/issues/2761 closes https://github.com/neondatabase/neon/pull/2776	2022-11-10 12:50:50 +01:00
Christian Schwarz	c3a470a29b	walredo process management: handle every error on the kill() and drop path If we're not calling kill() before dropping the PostgresRedoProcess, we currently leak it. That's most likely the root cause for #2761. This patch 1. adds an error log message for that case and 2. adds error handling for all errors on the kill() path. If we're a `testing` build, we panic. Otherwise, we log an error and leak the process. The error handling changes (2) are necessary to conclusively state that the root cause for #2761 is indeed (1). If we didn't have them, the root cause could be missing error handling instead. To make the log messages useful, I've added tracing::instrument attributes that log the tenant_id and PID. That helps mapping back the PID of `defunct` processes to pageserver log messages. Note that a defunct process's `/proc/$PID/` directory isn't very useful. We have left little more than its PID. Once we have validated the root cause, we'll find a fix, but that's still an ongoing discussion. refs https://github.com/neondatabase/neon/issues/2761 closes https://github.com/neondatabase/neon/pull/2769	2022-11-08 14:03:13 +01:00
Joonas Koivunen	548d472b12	fix: logical size query at before initdb_lsn (#2755 ) With more realistic selection of gc_horizon in tests there is an immediate failure with trying to query logical size with lsn < initdb_lsn. Fixes that, adds illustration gathered from clarity of explaining this tenant size calculation to more people. Cc: #2748, #2599.	2022-11-07 12:03:57 +02:00

1 2 3 4 5 ...

1029 Commits