rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-04 04:30:38 +00:00

Author	SHA1	Message	Date
Heikki Linnakangas	e0c43396bf	Always compile failpoints support, add runtime config option to enable it. It's annoying that many of the tests required a special build with the "testing" feature. I think it's better to have a runtime check. It adds a few CPU instructions to where failpoints are defined, even if they are disabled, but that's a small price to pay for the convenience. Fixes issue 2531, although differently from what was discussed on that issue.	2022-12-08 13:32:01 +02:00
Sergey Melnikov	f5a735ac3b	Add proxy and broker to us-west-2 (#3027 ) Co-authored-by: Lassi Pölönen <lassi.polonen@iki.fi>	2022-12-08 12:24:24 +01:00
MMeent	0d04cd0b99	Run compaction on the buffer holding received buffers when useful (#3028 ) This cleans up unused entries and reduces the chance of prefetch buffer thrashing.	2022-12-08 09:49:43 +01:00
Konstantin Knizhnik	e1ef62f086	Print more information about context of failed walredo requests (#3003 )	2022-12-08 09:12:38 +02:00
Kirill Bulatov	b50e0793cf	Rework remote_storage interface (#2993 ) Changes: * Remove `RemoteObjectId` concept from remote_storage. Operate directly on /-separated names instead. These names are now represented by struct `RemotePath` which was renamed from struct `RelativePath` * Require remote storage to operate on relative paths for its contents, thus simplifying the way to derive them in pageserver and safekeeper * Make `IndexPart` to use `String` instead of `RelativePath` for its entries, since those are just the layer names	2022-12-07 23:11:02 +02:00
Christian Schwarz	ac0c167a85	improve pidfile handling This patch centralize the logic of creating & reading pid files into the new pid_file module and improves upon / makes explicit a few race conditions that existed with the previous code. Starting Processes / Creating Pidfiles ====================================== Before this patch, we had three places that had very similar-looking match lock_file::create_lock_file { ... } blocks. After this change, they can use a straight-forward call provided by the pid_file: pid_file::claim_pid_file_for_pid() Stopping Processes / Reading Pidfiles ===================================== The new pid_file module provides a function to read a pidfile, called read_pidfile(), that returns a pub enum PidFileRead { NotExist, NotHeldByAnyProcess(PidFileGuard), LockedByOtherProcess(Pid), } If we get back NotExist, there is nothing to kill. If we get back NotHeldByAnyProcess, the pid file is stale and we must ignore its contents. If it's LockedByOtherProcess, it's either another pidfile reader or, more likely, the daemon that is still running. In this case, we can read the pid in the pidfile and kill it. There's still a small window where this is racy, but it's not a regression compared to what we have before. The NotHeldByAnyProcess is an improvement over what we had before this patch. Before, we would blindly read the pidfile contents and kill, even if no other process held the flock. If the pidfile was stale (NotHeldByAnyProcess), then that kill would either result in ESRCH or hit some other unrelated process on the system. This patch avoids the latter cacse by grabbing an exclusive flock before reading the pidfile, and returning the flock to the caller in the form of a guard object, to avoid concurrent reads / kills. It's hopefully irrelevant in practice, but it's a little robustness that we get for free here. Maintain flock on Pidfile of ETCD / any InitialPidFile::Create() ================================================================ Pageserver and safekeeper create their pidfiles themselves. But for etcd, neon_local creates the pidfile (InitialPidFile::Create()). Before this change, we would unlock the etcd pidfile as soon as `neon_local start` exits, simply because no-one else kept the FD open. During `neon_local stop`, that results in a stale pid file, aka, NotHeldByAnyProcess, and it would henceforth not trust that the PID stored in the file is still valid. With this patch, we make the etcd process inherit the pidfile FD, thereby keeping the flock held until it exits.	2022-12-07 18:24:12 +01:00
Lassi Pölönen	6dfd7cb1d0	Neon storage broker helm value fixes (#3025 ) * We were missing one cluster in production: `prod-ap-southeast-1-epsilon` configs. * We had `metrics` enabled. This means creating `ServiceScrape` objects, but since those clusters don't have `kube-prometheus-stack` like older ones, we are missing the CRDs, so the helm deploy fails.	2022-12-07 17:15:51 +02:00
Heikki Linnakangas	a46a81b5cb	Fix updating "trace_read_requests" with /v1/tenant/config mgmt API. The new "trace_read_requests" option was missing from the parse_toml_tenant_conf function that reads the config file. Because of that, the option was ignored, which caused the test_read_trace.py test to fail. It used to work before commit `9a6c0be823`, because the TenantConfigOpt struct was constructed directly in tenant_create_handler, but now it is saved and read back from disk even for a newly created tenant. The abovementioned bug was fixed in commit `09393279c6` already, which added the missing code to parse_toml_tenant_conf() to parse the new "trace_read_requests" option. This commit fixes one more function that was missed earlier, and adds more detail to the error message if parsing the config file fails.	2022-12-07 15:03:39 +02:00
Lassi Pölönen	c74dca95fc	Helm values for old staging and one region in new staging (#2922 ) helm values for the new `storage-broker`. gRPC, over secure connection with a proper certificate, but no authentication. Uses alb ingress in the old cluster and nginx ingress for the new one. The chart is deployed and the addresses are functional, while the pipeline doesn't exist yet.	2022-12-07 14:24:07 +02:00
Heikki Linnakangas	b513619503	Remove obsolete 'awaits_download' field. It used to be a separate piece of state, but after `9a6c0be823` it's just an alias for the Tenant being in Attaching state. It was only used in one assertion in a test, but that check doesn't make sense anymore, so just remove it. Fixes https://github.com/neondatabase/neon/issues/2930	2022-12-07 13:13:54 +02:00
Shany Pozin	b447eb4d1e	Add postgres-v15 to source tree documentation (#3023 )	2022-12-07 12:56:42 +02:00
Kirill Bulatov	6a57d5bbf9	Make the request tracing test more useful	2022-12-06 23:52:16 +02:00
Kirill Bulatov	09393279c6	Fix tenant config parsing	2022-12-06 23:52:16 +02:00
Nikita Kalyanov	634d0eab68	pass availability zone to console during pageserver registration (#2991 ) this is safe because unknown fields are ignored. After the corresponding PR in control plane is merged this field is going to be required Part of https://github.com/neondatabase/cloud/issues/3131	2022-12-06 21:09:54 +02:00
Kliment Serafimov	8f2b3cbded	Sentry integration for storage. (#2926 ) Added basic instrumentation to integrate sentry with the proxy, pageserver, and safekeeper processes. Currently in sentry there are three projects, one for each process. Sentry url is sent to all three processes separately via cli args.	2022-12-06 18:57:54 +00:00
Christian Schwarz	4530544bb8	draw_timeline_dirs: accept paths as input	2022-12-06 18:17:48 +01:00
Dmitry Rodionov	98ff0396f8	tone down error log for successful process termination	2022-12-06 18:44:07 +03:00
Kirill Bulatov	d6bfe955c6	Add commands to unload and load the tenant in memory (#2977 ) Closes https://github.com/neondatabase/neon/issues/2537 Follow-up of https://github.com/neondatabase/neon/pull/2950 With the new model that prevents attaching without the remote storage, it has started to be even more odd to add attach-with-files functionality (in addition to the issues raised previously). Adds two separate commands: * `POST {tenant_id}/ignore` that places a mark file to skip such tenant on every start and removes it from memory * `POST {tenant_id}/schedule_load` that tries to load a tenant from local FS similar to what pageserver does now on startup, but without directory removals	2022-12-06 15:30:02 +00:00
danieltprice	046ba67d68	Update README.md (#3015 ) Update readme to remove reference to the invite gate.	2022-12-06 11:27:46 -04:00
Alexander Bayandin	61825dfb57	Update chrono to 0.4.23; use only clock feature from it	2022-12-06 15:45:58 +01:00
Kirill Bulatov	c0480facc1	Rename RelativePath to RemotePath Improve rustdocs a bit	2022-12-05 22:52:42 +02:00
Kirill Bulatov	b38473d367	Remove RelativePath conversions Function was unused, but publicly exported from the module lib, so not reported by rustc as unused	2022-12-05 22:52:42 +02:00
Kirill Bulatov	7a9cb75e02	Replace dynamic dispatch with static dispatch	2022-12-05 22:52:42 +02:00
Kirill Bulatov	38af453553	Use async RwLock around tenants (#3009 ) A step towards more async code in our repo, to help avoid most of the odd blocking calls, that might deadlock, as mentioned in https://github.com/neondatabase/neon/issues/2975	2022-12-05 22:48:45 +02:00
Shany Pozin	79fdd3d51b	Fix #2907 : Change missing_layers property to optional in the IndexPart struct (#3005 ) Move missing_layers property to Option<HashSet<RelativePath>> This will allow the safe removal of it once the upgrade of all page servers is done with this new code	2022-12-05 13:56:04 +02:00
Alexander Bayandin	ab073696d0	test_bulk_update: use new prefetch settings (#3007 ) Replace `seqscan_prefetch_buffers` with `effective_io_concurrency` & `maintenance_io_concurrency` in one more place (the last one!)	2022-12-05 10:56:01 +00:00
Kirill Bulatov	4f443c339d	Tone down retry error logs (#2999 ) Closes https://github.com/neondatabase/neon/issues/2990	2022-12-03 15:30:55 +00:00
Alexander Bayandin	ed27c98022	Nightly Benchmarks: use new prefetch settings (#3000 ) - Replace `seqscan_prefetch_buffers` with `effective_io_concurrency` and `maintenance_io_concurrency` for `clickbench-compare` job (see https://github.com/neondatabase/neon/pull/2876) - Get the database name in a runtime (it can be `main` or `neondb` or something else)	2022-12-03 13:11:02 +00:00
Alexander Bayandin	788823ebe3	Fix named_arguments_used_positionally warnings (#2987 ) ``` warning: named argument `file` is not used by name --> pageserver/src/tenant/timeline.rs:1078:54 \| 1078 \| trace!("downloading image file: {}", file = path.display()); \| -- ^^^^ this named argument is referred to by position in formatting string \| \| \| this formatting argument uses named argument `file` by position \| = note: `#[warn(named_arguments_used_positionally)]` on by default help: use the named argument by name to avoid ambiguity \| 1078 \| trace!("downloading image file: {file}", file = path.display()); \| ++++ ``` Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2022-12-02 17:59:26 +00:00
MMeent	145e7e4b96	Prefetch cleanup: (#2876 ) - Enable `enable_seqscan_prefetch` by default - Drop use of `seqscan_prefetch_buffers` in favor of `[maintenance,effective]_io_concurrency` This includes adding some fields to the HeapScan execution node, and vacuum state. - Cleanup some conditionals in vacuumlazy.c - Clarify enable_seqscan_prefetch GUC description - Fix issues in heap SeqScan prefetching where synchronize_seqscan machinery wasn't handled properly.	2022-12-02 13:35:01 +01:00
Heikki Linnakangas	d90b52b405	Update README - Change "WAL service" to "safekeepers" in the architecture section. The safekeepers together form the WAL service, but we don't use that term much in the code. - Replace the short list of pageserver components with a link /docs. We have more details there. - Add "Other resources" to Documention section, with links to some blog posts and a video presentation. - Remove notice at the top about the Zenith -> Neon rename. There are still a few references to Zenith in the codebase, but not so many that we would need to call it out at the top anymore.	2022-12-02 11:47:50 +01:00
Konstantin Knizhnik	c21104465e	Fix copying relation in walloged create database in PG15 (#2986 ) refer #2904	2022-12-01 22:27:18 +02:00
bojanserafimov	fe280f70aa	Add synthetic layer map bench (#2979 )	2022-12-01 13:29:21 -05:00
Heikki Linnakangas	faf1d20e6a	Don't remove PID file in neon_local, and wait after "pageserver init". (#2983 ) Our shutdown procedure for "pageserver init" was buggy. Firstly, it merely sent the process a SIGKILL, but did not wait for it to actually exit. Normally, it should exit quickly as SIGKILL cannot be caught or ignored by the target process, but it's still asynchronous and the process can still be alive when the kill(2) call returns. Secondly, "neon_local" removed the PID file after sending SIGKILL, even though the process was still running. That hid the first problem: if we didn't remove the PID file, and you start a new pageserver process while the old one is still running, you would get an error when the new process tries to lock the PID file. We've been seeing a lot of "Cannot assign requested address" failures in the CI lately. Our theory is that when we run "pageserver init" immediately followed by "pageserver start", the first process is still running and listening on the port when the second invocation starts up. This commit hopefully fixes the problem. It is generally a bad idea for the "neon_local" to remove the PID file on the child process's behalf. The correct way would be for the server process to remove the PID file, after it has fully shutdown everything else. We don't currently have a robust way to ensure that everything has truly shut down and closed, however. A simpler way is to simply never remove the PID file. It's not necessary to remove the PID file for correctness: we cannot rely on the cleanup to happen anyway, if the server process crashes for example. Because of that, we already have all the logic in place to deal with a stale PID file that belonged to a process that already exited. Let's rely on that on normal shutdown too.	2022-12-01 16:38:52 +02:00
Konstantin Knizhnik	d9ab42013f	Resend prefetch request in case of pageserver restart (#2974 ) refer #2819 Co-authored-by: MMeent <matthias@neon.tech>	2022-12-01 12:16:15 +02:00
Bojan Serafimov	edfebad3a1	Add test that importing an empty file fails. We used to have a bug where the pageserver just got stuck if the client sent a CopyDone message before reaching end of tar stream. That showed up with an empty tar file, as one example. That was inadvertently fixed by code refactorings, but let's add a regression test for it, so that we don't accidentally re-introduce the bug later. Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2022-12-01 12:08:56 +02:00
bojanserafimov	b9544adcb4	Add layer map search benchmark (#2957 )	2022-11-30 13:48:07 -05:00
Andrés	ebb51f16e0	Re-introduce aws-sdk-rust as rusoto S3 replacement (#2841 ) Part of https://github.com/neondatabase/neon/issues/2683 Initial PR: https://github.com/neondatabase/neon/pull/2802, revert: https://github.com/neondatabase/neon/pull/2837 Co-authored-by: andres <andres.rodriguez@outlook.es>	2022-11-30 17:47:32 +02:00
Alexander Bayandin	136b029d7a	neon-project-create: fix project creation (#2954 ) Update api/v2 call to support changes from https://github.com/neondatabase/cloud/pull/2929	2022-11-30 09:19:59 +00:00
Heikki Linnakangas	33834c01ec	Rename Paused states to Stopping. I'm not a fan of "Paused", for two reasons: - Paused implies that the tenant/timeline with no activity on it. That's not true; the tenant/timeline can still have active tasks working on it. - Paused implies that it can be resumed later. It can not. A tenant or timeline in this state cannot be switched back to Active state anymore. A completely new Tenant or Timeline struct can be constructed for the same tenant or timeline later, e.g. if you detach and later re-attach the same tenant, but that's a different thing. Stopping describes the state better. I also considered "ShuttingDown", but Stopping is simpler as it's a single word.	2022-11-30 01:10:16 +02:00
Heikki Linnakangas	9a6c0be823	storage_sync2 The code in this change was extracted from PR #2595, i.e., Heikki’s draft PR for on-demand download. High-Level Changes - storage_sync module rewrite - Changes to Tenant Loading - Changes to Timeline States - Crash-safe & Resumable Tenant Attach There are several follow-up work items planned. Refer to the Epic issue on GitHub: https://github.com/neondatabase/neon/issues/2029 Metadata: closes https://github.com/neondatabase/neon/pull/2785 unsquashed history of this patch: archive/pr-2785-storage-sync2/pre-squash Co-authored-by: Dmitry Rodionov <dmitry@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech> =============================================================================== storage_sync module rewrite =========================== The storage_sync code is rewritten. New module name is storage_sync2, mostly to make a more reasonable git diff. The updated block comment in storage_sync2.rs describes the changes quite well, so, we will not reproduce that comment here. TL;DR: - Global sync queue and RemoteIndex are replaced with per-timeline `RemoteTimelineClient` structure that contains a queue for UploadOperations to ensure proper ordering and necessary metadata. - Before deleting local layer files, wait for ongoing UploadOps to finish (wait_completion()). - Download operations are not queued and executed immediately. Changes to Tenant Loading ========================= Initial sync part was rewritten as well and represents the other major change that serves as a foundation for on-demand downloads. Routines for attaching and loading shifted directly to Tenant struct and now are asynchronous and spawned into the background. Since this patch doesn’t introduce on-demand download of layers we fully synchronize with the remote during pageserver startup. See details in `Timeline::reconcile_with_remote` and `Timeline::download_missing`. Changes to Tenant States ======================== The “Active” state has lost its “background_jobs_running: bool” member. That variable indicated whether the GC & Compaction background loops are spawned or not. With this patch, they are now always spawned. Unit tests (#[test]) use the TenantConf::{gc_period,compaction_period} to disable their effect (`15db566`). This patch introduces a new tenant state, “Attaching”. A tenant that is being attached starts in this state and transitions to “Active” once it finishes download. The `GET /tenant` endpoints returns `TenantInfo::has_in_progress_downloads`. We derive the value for that field from the tenant state now, to remain backwards-compatible with cloud.git. We will remove that field when we switch to on-demand downloads. Changes to Timeline States ========================== The TimelineInfo::awaits_download field is now equivalent to the tenant being in Attaching state. Previously, download progress was tracked per timeline. With this change, it’s only tracked per tenant. When on-demand downloads arrive, the field will be completely obsolete. Deprecation is tracked in isuse #2930. Crash-safe & Resumable Tenant Attach ==================================== Previously, the attach operation was not persistent. I.e., when tenant attach was interrupted by a crash, the pageserver would not continue attaching after pageserver restart. In fact, the half-finished tenant directory on disk would simply be skipped by tenant_mgr because it lacked the metadata file (it’s written last). This patch introduces an “attaching” marker file inside that is present inside the tenant directory while the tenant is attaching. During pageserver startup, tenant_mgr will resume attach if that file is present. If not, it assumes that the local tenant state is consistent and tries to load the tenant. If that fails, the tenant transitions into Broken state.	2022-11-29 18:55:20 +01:00
Heikki Linnakangas	baa8d5a16a	Test that physical size is the same before and after re-attaching tenant.	2022-11-29 14:32:01 +02:00
Heikki Linnakangas	fbd5f65938	Misc cosmetic fixes in comments, messages. Most of these were extracted from PR #2785.	2022-11-29 14:10:45 +02:00
Heikki Linnakangas	1f1324ebed	Require tenant to be active when calculating tenant size. It's not clear if the calculation would work or make sense, if the tenant is only partially loaded. Let's play it safe, and require it to be Active.	2022-11-29 14:10:45 +02:00
Alexander Bayandin	fb633b16ac	neon-project-create: change default region for staging (#2951 ) Change the default region for staging from `us-east-1` to `us-east-2` for project creation. Remove REGION_ID from `neon-branch-create` since we don't use it.	2022-11-29 11:38:24 +00:00
Joonas Koivunen	f277140234	Small fixes (#2949 ) Nothing interesting in these changes. Passing through the RUST_BACKTRACE=full will hopefully save someone else panick reproduction time. Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2022-11-29 10:29:25 +02:00
Arseny Sher	52166799bd	Put .proto compilation result to $OUT_DIR/ Sometimes CI build fails with error: couldn't read storage_broker/src/../proto/storage_broker.rs: No such file or directory (os error 2) --> storage_broker/src/lib.rs:14:5 \| 14 \| include!("../proto/storage_broker.rs"); \| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The root cause is not clear, but it looks like interference with cachepot. Per cargo docs, build scripts shouldn't output to anywhere but OUT_DIR; let's follow this and see if it helps.	2022-11-28 20:27:43 +04:00
Sergey Melnikov	0a4e5f8aa3	Setup legacy scram proxy in us-east-2 (#2943 )	2022-11-28 17:21:35 +01:00
MMeent	0c1195c30d	Fix #2937 (#2940 ) Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2022-11-28 15:34:07 +01:00
Alexander Bayandin	3ba92d238e	Nightly Benchmarks: Fix default db name and clickbench-compare trigger (#2938 ) - Fix database name: `main` -> `neondb` - Fix `clickbench-compare` trigger; the job should be triggered even if `pgbench-compare` fails	2022-11-28 12:08:04 +00:00

1 2 3 4 5 ...

2438 Commits