rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-08 05:52:55 +00:00

Author	SHA1	Message	Date
Kirill Bulatov	fca25edae8	Fix 1.66 Clippy warnings (#3178 ) 1.66 release speeds up compile times for over 10% according to tests. Also its Clippy finds plenty of old nits in our code: * useless conversion, `foo as u8` where `foo: u8` and similar, removed `as u8` and similar * useless references and dereferenced (that were automatically adjusted by the compiler), removed various `&` and `` bool -> u8 conversion via `if/else`, changed to `u8::from` * Map `.iter()` calls where only values were used, changed to `.values()` instead Standing out lints: * `Eq` is missing in our protoc generated structs. Silenced, does not seem crucial for us. * `fn default` looks like the one from `Default` trait, so I've implemented that instead and replaced the `dummy_` method in tests with `::default()` invocation Clippy detected that ``` if retry_attempt < u32::MAX { retry_attempt += 1; } ``` is a saturating add and proposed to replace it.	2022-12-22 14:27:48 +02:00
Kirill Bulatov	9ddd1d7522	Stop all storage nodes on startup failure	2022-12-19 21:43:36 +02:00
Kirill Bulatov	49a211c98a	Add neon_local test	2022-12-19 21:43:36 +02:00
Dmitry Ivanov	61194ab2f4	Update rust-postgres everywhere I've rebased[1] Neon's fork of rust-postgres to incorporate latest upstream changes (including dependabot's fixes), so we need to advance revs here as well. [1] https://github.com/neondatabase/rust-postgres/commits/neon	2022-12-17 00:26:10 +03:00
Dmitry Ivanov	83baf49487	[proxy] Forward compute connection params to client This fixes all kinds of problems related to missing params, like broken timestamps (due to `integer_datetimes`). This solution is not ideal, but it will help. Meanwhile, I'm going to dedicate some time to improving connection machinery. Note that this does not fix problems with passing certain parameters in a reverse direction, i.e. from client to compute. This is a separate matter and will be dealt with in an upcoming PR.	2022-12-16 21:37:50 +03:00
Kirill Bulatov	02c1c351dc	Create initial timeline without remote storage (#3077 ) Removes the race during pageserver initial timeline creation that lead to partial layer uploads. This race is only reproducible in test code, we do not create initial timelines in cloud (yet, at least), but still nice to remove the non-deterministic behavior.	2022-12-13 15:42:59 +02:00
Arseny Sher	32662ff1c4	Replace etcd with storage_broker. This is the replacement itself, the binary landed earlier. See docs/storage_broker.md. ref https://github.com/neondatabase/neon/pull/2466 https://github.com/neondatabase/neon/issues/2394	2022-12-12 13:30:16 +03:00
Christian Schwarz	ac0c167a85	improve pidfile handling This patch centralize the logic of creating & reading pid files into the new pid_file module and improves upon / makes explicit a few race conditions that existed with the previous code. Starting Processes / Creating Pidfiles ====================================== Before this patch, we had three places that had very similar-looking match lock_file::create_lock_file { ... } blocks. After this change, they can use a straight-forward call provided by the pid_file: pid_file::claim_pid_file_for_pid() Stopping Processes / Reading Pidfiles ===================================== The new pid_file module provides a function to read a pidfile, called read_pidfile(), that returns a pub enum PidFileRead { NotExist, NotHeldByAnyProcess(PidFileGuard), LockedByOtherProcess(Pid), } If we get back NotExist, there is nothing to kill. If we get back NotHeldByAnyProcess, the pid file is stale and we must ignore its contents. If it's LockedByOtherProcess, it's either another pidfile reader or, more likely, the daemon that is still running. In this case, we can read the pid in the pidfile and kill it. There's still a small window where this is racy, but it's not a regression compared to what we have before. The NotHeldByAnyProcess is an improvement over what we had before this patch. Before, we would blindly read the pidfile contents and kill, even if no other process held the flock. If the pidfile was stale (NotHeldByAnyProcess), then that kill would either result in ESRCH or hit some other unrelated process on the system. This patch avoids the latter cacse by grabbing an exclusive flock before reading the pidfile, and returning the flock to the caller in the form of a guard object, to avoid concurrent reads / kills. It's hopefully irrelevant in practice, but it's a little robustness that we get for free here. Maintain flock on Pidfile of ETCD / any InitialPidFile::Create() ================================================================ Pageserver and safekeeper create their pidfiles themselves. But for etcd, neon_local creates the pidfile (InitialPidFile::Create()). Before this change, we would unlock the etcd pidfile as soon as `neon_local start` exits, simply because no-one else kept the FD open. During `neon_local stop`, that results in a stale pid file, aka, NotHeldByAnyProcess, and it would henceforth not trust that the PID stored in the file is still valid. With this patch, we make the etcd process inherit the pidfile FD, thereby keeping the flock held until it exits.	2022-12-07 18:24:12 +01:00
Heikki Linnakangas	faf1d20e6a	Don't remove PID file in neon_local, and wait after "pageserver init". (#2983 ) Our shutdown procedure for "pageserver init" was buggy. Firstly, it merely sent the process a SIGKILL, but did not wait for it to actually exit. Normally, it should exit quickly as SIGKILL cannot be caught or ignored by the target process, but it's still asynchronous and the process can still be alive when the kill(2) call returns. Secondly, "neon_local" removed the PID file after sending SIGKILL, even though the process was still running. That hid the first problem: if we didn't remove the PID file, and you start a new pageserver process while the old one is still running, you would get an error when the new process tries to lock the PID file. We've been seeing a lot of "Cannot assign requested address" failures in the CI lately. Our theory is that when we run "pageserver init" immediately followed by "pageserver start", the first process is still running and listening on the port when the second invocation starts up. This commit hopefully fixes the problem. It is generally a bad idea for the "neon_local" to remove the PID file on the child process's behalf. The correct way would be for the server process to remove the PID file, after it has fully shutdown everything else. We don't currently have a robust way to ensure that everything has truly shut down and closed, however. A simpler way is to simply never remove the PID file. It's not necessary to remove the PID file for correctness: we cannot rely on the cleanup to happen anyway, if the server process crashes for example. Because of that, we already have all the logic in place to deal with a stale PID file that belonged to a process that already exited. Let's rely on that on normal shutdown too.	2022-12-01 16:38:52 +02:00
Joonas Koivunen	f277140234	Small fixes (#2949 ) Nothing interesting in these changes. Passing through the RUST_BACKTRACE=full will hopefully save someone else panick reproduction time. Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2022-11-29 10:29:25 +02:00
Egor Suvorov	ae53dc3326	Add authentication between Safekeeper and Pageserver/Compute * Fix https://github.com/neondatabase/neon/issues/1854 * Never log Safekeeper::conninfo in walproposer as it now contains a secret token * control_panel, test_runner: generate and pass JWT tokens for Safekeeper to compute and pageserver * Compute: load JWT token for Safekepeer from the environment variable. Do not reuse the token from pageserver_connstring because it's embedded in there weirdly. * Pageserver: load JWT token for Safekeeper from the environment variable. * Rewrite docs/authentication.md	2022-11-25 04:17:42 +03:00
Egor Suvorov	46ea2a8e96	Continue #2724 : replace `Url`-based `PgConnectionConfig` with a hand-crafted struct Downsides are: * We store all components of the config separately. `Url` stores them inside a single `String` and a bunch of ints which point to different parts of the URL, which is probably more efficient. * It is now impossible to pass arbitrary connection strings to the configuration file, one has to support all components explicitly. However, we never supported anything except for `host:port` anyway. Upsides are: * This significantly restricts the space of possible connection strings, some of which may be either invalid or unsupported. E.g. Postgres' connection strings may include a bunch of parameters as query (e.g. `connect_timeout=`, `options=`). These are nether validated by the current implementation, nor passed to the postgres client library, Hence, storing separate fields expresses the intention better. * The same connection configuration may be represented as a URL in multiple ways (e.g. either `password=` in the query part or a standard URL password). Now we have a single canonical way. * Escaping is provided for `options=`. Other possibilities considered: * `newtype` with a `String` inside and some validation on creation. This is more efficient, but harder to log for two reasons: * Passwords should never end up in logs, so we have to somehow * Escaped `options=` are harder to read, especially if URL-encoded, and we use `options=` a lot.	2022-11-24 14:02:23 +03:00
Konstantin Knizhnik	1af087449a	Reduce max_replication_write_lag to 10Mb (#1793 )	2022-11-23 08:41:22 +02:00
Heikki Linnakangas	6c97fc941a	Enable passing FAILPOINTS at startup. - Pass through FAILPOINTS environment variable to the pageserver in "neon_local pageserver start" command - On startup, list any failpoints that were set with FAILPOINTS to the log - Add optional "extra_env_vars" argument to the NeonPageserver.start() function in the python fixture, so that you can pass FAILPOINTS None of the tests use this functionality yet; that comes in a separate commit. closes https://github.com/neondatabase/neon/pull/2865	2022-11-21 16:24:19 +01:00
Heikki Linnakangas	150bddb929	Clean up process start/stop handling * Poll more frequently when waiting for process start/stop. This speeds up startup and shutdown in tests. We did this already in commit `52ce1c9d53`, which reduced the interval to 100 ms, but it was inadvertently increased back to 500 ms in commit `d42700280f`. Reduce it to 100 ms again, for both start and stop operations. * Harmonize the start and stop loops, printing the dots and notices the same way in both. I considered extracting the logic to a separate retry-function that takes a closure as argument that does the polling, but as long as we only have two copies, the code duplication isn't that bad. * Remove newline after "Starting pageserver" and "Starting etcd" messages, so that the progress-indicator dots that are printed once a second are printed on the same line. Before: Starting pageserver at '127.0.0.1:64000' in '.neon' ... pageserver started, pid: 2538937 After: Starting pageserver at '127.0.0.1:64000' in '.neon'... pageserver started, pid: 2538937 The "Starting safekeeper" message already got this right. * Update example output in README.md to match	2022-11-16 19:51:37 +02:00
bojanserafimov	7fd88fab59	Trace read requests (#2762 )	2022-11-10 16:43:04 -05:00
Kirill Bulatov	d42700280f	Remove daemonize from storage components (#2677 ) Move daemonization logic into `control_plane`. Storage binaries now only crate a lockfile to avoid concurrent services running in the same directory.	2022-11-02 02:26:37 +02:00
Dmitry Ivanov	0df3467146	Refactoring: replace `utils::connstring` with `Url`-based APIs	2022-11-01 18:17:36 +03:00
Arseny Sher	b42bf9265a	Enable etcd compaction in neon_local.	2022-10-27 10:47:08 +03:00
Konstantin Knizhnik	7b6431cbd7	Disable wal_log_hints by default (#2598 ) * Disable wal_log_hints by default * Remove obsolete comment anbout wal_log_hints	2022-10-22 14:59:18 +03:00
Arseny Sher	7480a0338a	Determine safekeeper for offloading WAL without etcd election API. This API is rather pointless, as sane choice anyway requires knowledge of peers status and leaders lifetime in any case can intersect, which is fine for us -- so manual elections are straightforward. Here, we deterministically choose among the reasonably caught up safekeepers, shifting by timeline id to spread the load. A step towards custom broker https://github.com/neondatabase/neon/issues/2394	2022-10-21 15:33:27 +03:00
Anastasia Lubennikova	52e75fead9	Use anyhow::Result explicitly	2022-10-21 12:47:06 +03:00
Anastasia Lubennikova	a347d2b6ac	#2616 handle 'Unsupported pg_version' error properly	2022-10-21 12:47:06 +03:00
Kirill Bulatov	c4ee62d427	Bump clap and other minor dependencies (#2623 )	2022-10-17 12:58:40 +03:00
Kirill Bulatov	f03b7c3458	Bump regular dependencies (#2618 ) * etcd-client is not updated, since we plan to replace it with another client and the new version errors with some missing prost library error * clap had released another major update that requires changing every CLI declaration again, deserves a separate PR	2022-10-15 01:55:31 +03:00
Heikki Linnakangas	538876650a	Merge 'local' and 'remote' parts of TimelineInfo into one struct. The 'local' part was always filled in, so that was easy to merge into into the TimelineInfo itself. 'remote' only contained two fields, 'remote_consistent_lsn' and 'awaits_download'. I made 'remote_consistent_lsn' an optional field, and 'awaits_download' is now false if the timeline is not present remotely. However, I kept stub versions of the 'local' and 'remote' structs for backwards-compatibility, with a few fields that are actively used by the control plane. They just duplicate the fields from TimelineInfo now. They can be removed later, once the control plane has been updated to use the new fields.	2022-10-14 18:37:14 +03:00
Heikki Linnakangas	500239176c	Make TimelineInfo.local field mandatory. It was only None when you queried the status of a timeline with 'timeline_detail' mgmt API call, and it was still being downloaded. You can check for that status with the 'tenant_status' API call instead, checking for has_in_progress_downloads field. Anothere case was if an error happened while trying to get the current logical size, in a 'timeline_detail' request. It might make sense to tolerate such errors, and leave the fields we cannot fill in as empty, None, 0 or similar, but it doesn't make sense to me to leave the whole 'local' struct empty in tht case.	2022-10-14 18:37:14 +03:00
Arseny Sher	9fe4548e13	Reimplement explicit timeline creation on safekeepers. With the ability to pass commit_lsn. This allows to perform project WAL recovery through different (from the original) set of safekeepers (or under different ttid) by 1) moving WAL files to s3 under proper ttid; 2) explicitly creating timeline on safekeepers, setting commit_lsn to the latest point; 3) putting the lastest .parital file to the timeline directory on safekeepers, if desired. Extend test_s3_wal_replay to exersise this behaviour. Also extends timeline_status endpoint to return postgres information.	2022-10-13 21:43:10 +04:00
sharnoff	580584c8fc	Remove control_plane deps on pageserver/safekeeper (#2513 ) Creates new `pageserver_api` and `safekeeper_api` crates to serve as the shared dependencies. Should reduce both recompile times and cold compile times. Decreases the size of the optimized `neon_local` binary: 380M -> 179M. No significant changes for anything else (mostly as expected).	2022-10-04 11:14:45 -07:00
Anastasia Lubennikova	5dddeb8d88	Use non-versioned pg_distrib dir	2022-09-22 14:15:13 +03:00
Anastasia Lubennikova	03c606f7c5	Pass pg_version parameter to timeline import command. Add pg_version field to LocalTimelineInfo. Use pg_version in the export_import_between_pageservers script	2022-09-22 14:15:13 +03:00
Anastasia Lubennikova	86bf491981	Support pg 15 - Split postgres_ffi into two version specific files. - Preserve pg_version in timeline metadata. - Use pg_version in safekeeper code. Check for postgres major version mismatch. - Clean up the code to use DEFAULT_PG_VERSION constant everywhere, instead of hardcoding. - Parameterize python tests: use DEFAULT_PG_VERSION env and pg_version fixture. To run tests using a specific PostgreSQL version, pass the DEFAULT_PG_VERSION environment variable: 'DEFAULT_PG_VERSION='15' ./scripts/pytest test_runner/regress' Currently don't all tests pass, because rust code relies on the default version of PostgreSQL in a few places.	2022-09-22 14:15:13 +03:00
bojanserafimov	96e867642f	Validate tenant create options (#2450 ) Co-authored-by: Kirill Bulatov <kirill@neon.tech>	2022-09-15 18:20:23 -04:00
Kirill Bulatov	b8eb908a3d	Rename old project name references	2022-09-14 08:14:05 +03:00
Kirill Bulatov	1a8c8b04d7	Merge Repository and Tenant entities, rework tenant background jobs	2022-09-13 15:39:39 +03:00
Anastasia Lubennikova	05e263d0d3	Prepare pg 15 support (build system and submodules) (#2337 ) * Add submodule postgres-15 * Support pg_15 in pgxn/neon * Renamed zenith -> neon in Makefile * fix name of codestyle check * Refactor build system to prepare for building multiple Postgres versions. Rename "vendor/postgres" to "vendor/postgres-v14" Change Postgres build and install directory paths to be version-specific: - tmp_install/build -> pg_install/build/14 - tmp_install/* -> pg_install/14/* And Makefile targets: - "make postgres" -> "make postgres-v14" - "make postgres-headers" -> "make postgres-v14-headers" - etc. Add Makefile aliases: - "make postgres" to build "postgres-v14" and in future, "postgres-v15" - similarly for "make postgres-headers" Fix POSTGRES_DISTRIB_DIR path in pytest scripts * Make postgres version a variable in codestyle workflow * Support vendor/postgres-v15 in codestyle check workflow * Support postgres-v15 building in Makefile * fix pg version in Dockerfile.compute-node * fix kaniko path * Build neon extensions in version-specific directories * fix obsolete mentions of vendor/postgres * use vendor/postgres-v14 in Dockerfile.compute-node.legacy * Use PG_VERSION_NUM to gate dependencies in inmem_smgr.c * Use versioned ECR repositories and image names for compute-node. The image name format is compute-node-vXX, where XX is postgres major version number. For now only v14 is supported. Old format unversioned name (compute-node) is left, because cloud repo depends on it. * update vendor/postgres submodule url (zenith->neondatabase rename) * Fix postgres path in python tests after rebase * fix path in regress test * Use separate dockerfiles to build compute-node: Dockerfile.compute-node-v15 should be identical to Dockerfile.compute-node-v14 except for the version number. This is a hack, because Kaniko doesn't support build ARGs properly * bump vendor/postgres-v14 and vendor/postgres-v15 * Don't use Kaniko cache for v14 and v15 compute-node images * Build compute-node images for different versions in different jobs Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2022-09-05 18:30:54 +03:00
Heikki Linnakangas	a4e79db348	Move `neon_local` to `control_plane`. Seems a bit silly to have a separate crate just for the executable. It relies on the control plane for everything it does, and it's the only user of the control plane.	2022-09-02 16:34:33 +03:00
Heikki Linnakangas	15c5f3e6cf	Fix misc typos in comments and variable names.	2022-09-01 20:04:08 +03:00
MMeent	f99ccb5041	Extract WalProposer into the neon extension (#2217 ) Including, but not limited to: * Fixes to neon management code to support walproposer-as-an-extension * Fix issue in expected output of pg settings serialization. * Show the logs of a failed --sync-safekeepers process in CI * Add compat layer for renamed GUCs in postgres.conf * Update vendor/postgres to the latest origin/main	2022-08-18 17:12:28 +02:00
Kirill Bulatov	67e091c906	Rework `init` in pageserver CLI (#2272 ) * Do not create initial tenant and timeline (adjust Python tests for that) * Rework config handling during init, add --update-config to manage local config updates	2022-08-17 23:24:47 +03:00
bojanserafimov	e9a3499e87	Fix flaky pageserver restarts in tests (#2261 )	2022-08-17 08:17:35 -04:00
Arseny Sher	e593cbaaba	Add pageserver checkpoint_timeout option. To flush inmemory layer eventually when no new data arrives, which helps safekeepers to suspend activity (stop pushing to the broker). Default 10m should be ok.	2022-08-11 22:54:09 +03:00
Kirill Bulatov	3a9bff81db	Fix etcd typos	2022-08-08 19:04:46 +03:00
Ankur Srivastava	84d1bc06a9	refactor: replace lazy-static with once-cell (#2195 ) - Replacing all the occurrences of lazy-static with `once-cell::sync::Lazy` - fixes #1147 Signed-off-by: Ankur Srivastava <best.ankur@gmail.com>	2022-08-05 19:34:04 +02:00
Heikki Linnakangas	52ce1c9d53	Speed up test shutdown, by polling more frequently. A fair amount of the time in our python tests is spent waiting for the pageserver and safekeeper processes to shut down. It doesn't matter so much when you're running a lot of tests in parallel, but it's quite noticeable when running them sequentially. A big part of the slowness is that is that after sending the SIGTERM signal, we poll to see if the process is still running, and the polling happened at 1 s interval. Reduce it to 0.1 s.	2022-08-04 12:57:15 +03:00
Dmitry Rodionov	5f71aa09d3	support running tests against real s3 implementation without mocking	2022-08-04 11:14:05 +03:00
Heikki Linnakangas	02afa2762c	Move Tenant- and TimelineInfo structs to models.rs. They are part of the management API response structs. Let's try to concentrate everything that's part of the API in models.rs.	2022-07-29 15:02:15 +03:00
Alexey Kondratov	01f1f1c1bf	Add OpenAPI spec for safekeeper HTTP API (neondatabase/cloud#1264, #2061 ) This spec is used in the `cloud` repo to generate HTTP client.	2022-07-27 21:29:22 +03:00
Heikki Linnakangas	b4c74c0ecd	Clean up unnecessary dependencies. Just to be tidy.	2022-07-20 16:31:25 +03:00
Alexander Bayandin	7898e72990	Remove duplicated checks from LocalEnv	2022-07-04 22:35:00 +03:00

1 2 3 4 5 ...

256 Commits