rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-15 09:22:55 +00:00

Author	SHA1	Message	Date
Heikki Linnakangas	635b67508b	Split utils::http to separate crate (#10753 ) Avoids compiling the crate and its dependencies into binaries that don't need them. Shrinks the compute_ctl binary from about 31MB to 28MB in the release-line-debug-size-lto profile.	2025-02-11 22:06:53 +00:00
Erik Grinaker	699a213c5d	Display reqwest error source (#10004 ) ## Problem Reqwest errors don't include details about the inner source error. This means that we get opaque errors like: ``` receive body: error sending request for url (http://localhost:9898/v1/location_config) ``` Instead of the more helpful: ``` receive body: error sending request for url (http://localhost:9898/v1/location_config): operation timed out ``` Touches #9801. ## Summary of changes Include the source error for `reqwest::Error` wherever it's displayed.	2024-12-04 13:05:53 +00:00
Heikki Linnakangas	e989a5e4a2	neon_local: Use clap derive macros to parse the CLI args (#9103 ) This is easier to work with.	2024-09-27 22:08:46 +03:00
Arpad Müller	8eaa8ad358	Remove async_trait usages from safekeeper and neon_local (#8864 ) Removes additional async_trait usages from safekeeper and neon_local. Also removes now redundant dependencies of the `async_trait` crate. cc earlier work: #6305, #6464, #7303, #7342, #7212, #8296	2024-08-29 18:24:25 +02:00
Peter Bendel	82266a252c	Allow longer timeout for starting pageserver, safe keeper and storage controller in test cases to make test cases less flaky (#8079 ) ## Problem see https://github.com/neondatabase/neon/issues/8070 ## Summary of changes the neon_local subcommands to - start neon - start pageserver - start safekeeper - start storage controller get a new option -t=xx or --start-timeout=xx which allows to specify a longer timeout in seconds we wait for the process start. This is useful in test cases where the pageserver has to read a lot of layer data, like in pagebench test cases. In addition we exploit the new timeout option in the python test infrastructure (python fixtures) and modify the flaky testcase to increase the timeout from 10 seconds to 1 minute. Example from the test execution ```bash RUST_BACKTRACE=1 NEON_ENV_BUILDER_USE_OVERLAYFS_FOR_SNAPSHOTS=1 DEFAULT_PG_VERSION=15 BUILD_TYPE=release ./scripts/pytest test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py ... 2024-06-19 09:29:34.590 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local storage_controller start --start-timeout=60s" 2024-06-19 09:29:36.365 INFO [broker.py:34] starting storage_broker to listen incoming connections at "127.0.0.1:15001" 2024-06-19 09:29:36.365 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local pageserver start --id=1 --start-timeout=60s" 2024-06-19 09:29:36.366 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local safekeeper start 1 --start-timeout=60s" ```	2024-06-21 10:36:12 +00:00
Arseny Sher	4feb6ba29c	Make pull_timeline work with auth enabled. - Make safekeeper read SAFEKEEPER_AUTH_TOKEN env variable with JWT token to connect to other safekeepers. - Set it in neon_local when auth is enabled. - Create simple rust http client supporting it, and use it in pull_timeline implementation. - Enable auth in all pull_timeline tests. - Make sk http_client() by default generate safekeeper wide token, it makes easier enabling auth in all tests by default.	2024-06-18 15:45:39 +03:00
Alex Chi Z	9e567d9814	feat(neon_local): support listen addr for safekeeper (#7328 ) Leftover from my LFC benchmarks. Safekeepers only listen on `127.0.0.1` for `neon_local`. This pull request adds support for listening on other address. To specify a custom address, modify `.neon/config`. ``` [[safekeepers]] listen_addr = "192.168.?.?" ``` Endpoints created by neon_local still use 127.0.0.1 and I will fix them later. I didn't fix it in the same pull request because my benchmark setting does not use neon_local to create compute nodes so I don't know how to fix it yet -- maybe replacing a few `127.0.0.1`s. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-04-17 09:10:01 +03:00
Christian Schwarz	689ad72e92	fix(neon_local): leaks child process if it fails to start & pass checks (#6474 ) refs https://github.com/neondatabase/neon/issues/6473 Before this PR, if process_started() didn't return Ok(true) until we ran out of retries, we'd return an error but leave the process running. Try it by adding a 20s sleep to the pageserver `main()`, e.g., right before we claim the pidfile. Without this PR, output looks like so: ``` (.venv) cs@devvm-mbp:[~/src/neon-work-2]: ./target/debug/neon_local start Starting neon broker at 127.0.0.1:50051. storage_broker started, pid: 2710939 . attachment_service started, pid: 2710949 Starting pageserver node 1 at '127.0.0.1:64000' in ".neon/pageserver_1"..... pageserver has not started yet, continuing to wait..... pageserver 1 start failed: pageserver did not start in 10 seconds No process is holding the pidfile. The process must have already exited. Leave in place to avoid race conditions: ".neon/pageserver_1/pageserver.pid" No process is holding the pidfile. The process must have already exited. Leave in place to avoid race conditions: ".neon/safekeepers/sk1/safekeeper.pid" Stopping storage_broker with pid 2710939 immediately....... storage_broker has not stopped yet, continuing to wait..... neon broker stop failed: storage_broker with pid 2710939 did not stop in 10 seconds Stopping attachment_service with pid 2710949 immediately....... attachment_service has not stopped yet, continuing to wait..... attachment service stop failed: attachment_service with pid 2710949 did not stop in 10 seconds ``` and we leak the pageserver process ``` (.venv) cs@devvm-mbp:[~/src/neon-work-2]: ps aux \| grep pageserver cs 2710959 0.0 0.2 2377960 47616 pts/4 Sl 14:36 0:00 /home/cs/src/neon-work-2/target/debug/pageserver -D .neon/pageserver_1 -c id=1 -c pg_distrib_dir='/home/cs/src/neon-work-2/pg_install' -c http_auth_type='Trust' -c pg_auth_type='Trust' -c listen_http_addr='127.0.0.1:9898' -c listen_pg_addr='127.0.0.1:64000' -c broker_endpoint='http://127.0.0.1:50051/' -c control_plane_api='http://127.0.0.1:1234/' -c remote_storage={local_path='../local_fs_remote_storage/pageserver'} ``` After this PR, there is no leaked process.	2024-01-25 19:20:02 +01:00
Christian Schwarz	1a9854bfb7	add a Rust client for Pageserver management API (#6127 ) Part of getpage@lsn benchmark epic: https://github.com/neondatabase/neon/issues/5771 This PR moves the control plane's spread-all-over-the-place client for the pageserver management API into a separate module within the pageserver crate. I need that client to be async in my benchmarking work, so, this PR switches to the async version of `reqwest`. That is also the right direction generally IMO. The switch to async in turn mandated converting most of the `control_plane/` code to async. Note that some of the client methods should be taking `TenantShardId` instead of `TenantId`, but, none of the callers seem to be sharding-aware. Leaving that for another time: https://github.com/neondatabase/neon/issues/6154	2023-12-15 18:33:45 +01:00
duguorong009	25a37215f3	fix: replace all `std::PathBuf`s with `camino::Utf8PathBuf` (#5352 ) Fixes #4689 by replacing all of `std::Path` , `std::PathBuf` with `camino::Utf8Path`, `camino::Utf8PathBuf` in - pageserver - safekeeper - control_plane - libs/remote_storage Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-10-04 17:52:23 +03:00
Arseny Sher	4687b2e597	Test that auth on pg/http services can be enabled separately in sks. To this end add 1) -e option to 'neon_local safekeeper start' command appending extra options to safekeeper invocation; 2) Allow multiple occurrences of the same option in safekeepers, the last value is taken. 3) Allow to specify empty string for *-auth-public-key-path opts, it disables auth for the service.	2023-08-15 19:31:20 +03:00
Arseny Sher	13adc83fc3	Allow to enable http/pg/pg tenant only auth separately in safekeeper. The same option enables auth and specifies public key, so this allows to use different public keys as well. The motivation is to 1) Allow to e.g. change pageserver key/token without replacing all compute tokens. 2) Enable auth gradually.	2023-08-15 19:31:20 +03:00
Arseny Sher	921bb86909	Use safekeeper tenant only port in all tests and actually test it. Compute now uses special safekeeper WAL service port allowing auth tokens with only tenant scope. Adds understanding of this port to neon_local and fixtures, as well as test of both ports behaviour with different tokens. ref https://github.com/neondatabase/neon/issues/4730	2023-07-19 06:03:51 +04:00
arpad-m	982fce1e72	Fix rustdoc warnings and test cargo doc in CI (#4711 ) ## Problem `cargo +nightly doc` is giving a lot of warnings: broken links, naked URLs, etc. ## Summary of changes * update the `proc-macro2` dependency so that it can compile on latest Rust nightly, see https://github.com/dtolnay/proc-macro2/pull/391 and https://github.com/dtolnay/proc-macro2/issues/398 * allow the `private_intra_doc_links` lint, as linking to something that's private is always more useful than just mentioning it without a link: if the link breaks in the future, at least there is a warning due to that. Also, one might enable [`--document-private-items`](https://doc.rust-lang.org/cargo/commands/cargo-doc.html#documentation-options) in the future and make these links work in general. * fix all the remaining warnings given by `cargo +nightly doc` * make it possible to run `cargo doc` on stable Rust by updating `opentelemetry` and associated crates to version 0.19, pulling in a fix that previously broke `cargo doc` on stable: https://github.com/open-telemetry/opentelemetry-rust/pull/904 * Add `cargo doc` to CI to ensure that it won't get broken in the future. Fixes #2557 ## Future work * Potentially, it might make sense, for development purposes, to publish the generated rustdocs somewhere, like for example [how the rust compiler does it](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_driver/index.html). I will file an issue for discussion.	2023-07-15 05:11:25 +03:00
Heikki Linnakangas	df3bae2ce3	Use `compute_ctl` to manage Postgres in tests. (#3886 ) This adds test coverage for 'compute_ctl', as it is now used by all the python tests. There are a few differences in how 'compute_ctl' is called in the tests, compared to the real web console: - In the tests, the postgresql.conf file is included as one large string in the spec file, and it is written out as it is to the data directory. I added a new field for that to the spec file. The real web console, however, sets all the necessary settings in the 'settings' field, and 'compute_ctl' creates the postgresql.conf from those settings. - In the tests, the information needed to connect to the storage, i.e. tenant_id, timeline_id, connection strings to pageserver and safekeepers, are now passed as new fields in the spec file. The real web console includes them as the GUCs in the 'settings' field. (Both of these are different from what the test control plane used to do: It used to write the GUCs directly in the postgresql.conf file). The plan is to change the control plane to use the new method, and remove the old method, but for now, support both. Some tests that were sensitive to the amount of WAL generated needed small changes, to accommodate that compute_ctl runs the background health monitor which makes a few small updates. Also some tests shut down the pageserver, and now that the background health check can run some queries while the pageserver is down, that can produce a few extra errors in the logs, which needed to be allowlisted. Other changes: - remove obsolete comments about PostgresNode; - create standby.signal file for Static compute node; - log output of `compute_ctl` and `postgres` is merged into `endpoints/compute.log`. --------- Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>	2023-06-06 14:59:36 +01:00
Arseny Sher	278d0f117d	Rename neon_local sk logs s/safekeeper 1.log/safekeeper-1.log. I don't like spaces in file names.	2023-03-28 14:28:56 +04:00
Heikki Linnakangas	1da963b2f9	Remove some unused code in control plane.	2023-03-20 16:28:01 +02:00
Arthur Petukhovsky	b067378d0d	Measure cross-AZ traffic in safekeepers (#3806 ) Create `safekeeper_pg_io_bytes_total` metric to track total amount of bytes written/read in a postgres connections to safekeepers. This metric has the following labels: - `client_az` – availability zone of the connection initiator, or `"unknown"` - `sk_az` – availability zone of the safekeeper, or `"unknown"` - `app_name` – `application_name` of the postgres client - `dir` – data direction, either `"read"` or `"write"` - `same_az` – `"true"`, `"false"` or `"unknown"`. Can be derived from `client_az` and `sk_az`, exists purely for convenience. This is implemented by passing availability zone in the connection string, like this: `-c tenant_id=AAA timeline_id=BBB availability-zone=AZ-1`. Update ansible deployment scripts to add availability_zone argument to safekeeper and pageserver in systemd service files.	2023-03-16 17:24:01 +03:00
Arseny Sher	32662ff1c4	Replace etcd with storage_broker. This is the replacement itself, the binary landed earlier. See docs/storage_broker.md. ref https://github.com/neondatabase/neon/pull/2466 https://github.com/neondatabase/neon/issues/2394	2022-12-12 13:30:16 +03:00
Egor Suvorov	ae53dc3326	Add authentication between Safekeeper and Pageserver/Compute * Fix https://github.com/neondatabase/neon/issues/1854 * Never log Safekeeper::conninfo in walproposer as it now contains a secret token * control_panel, test_runner: generate and pass JWT tokens for Safekeeper to compute and pageserver * Compute: load JWT token for Safekepeer from the environment variable. Do not reuse the token from pageserver_connstring because it's embedded in there weirdly. * Pageserver: load JWT token for Safekeeper from the environment variable. * Rewrite docs/authentication.md	2022-11-25 04:17:42 +03:00
Egor Suvorov	46ea2a8e96	Continue #2724 : replace `Url`-based `PgConnectionConfig` with a hand-crafted struct Downsides are: * We store all components of the config separately. `Url` stores them inside a single `String` and a bunch of ints which point to different parts of the URL, which is probably more efficient. * It is now impossible to pass arbitrary connection strings to the configuration file, one has to support all components explicitly. However, we never supported anything except for `host:port` anyway. Upsides are: * This significantly restricts the space of possible connection strings, some of which may be either invalid or unsupported. E.g. Postgres' connection strings may include a bunch of parameters as query (e.g. `connect_timeout=`, `options=`). These are nether validated by the current implementation, nor passed to the postgres client library, Hence, storing separate fields expresses the intention better. * The same connection configuration may be represented as a URL in multiple ways (e.g. either `password=` in the query part or a standard URL password). Now we have a single canonical way. * Escaping is provided for `options=`. Other possibilities considered: * `newtype` with a `String` inside and some validation on creation. This is more efficient, but harder to log for two reasons: * Passwords should never end up in logs, so we have to somehow * Escaped `options=` are harder to read, especially if URL-encoded, and we use `options=` a lot.	2022-11-24 14:02:23 +03:00
Kirill Bulatov	d42700280f	Remove daemonize from storage components (#2677 ) Move daemonization logic into `control_plane`. Storage binaries now only crate a lockfile to avoid concurrent services running in the same directory.	2022-11-02 02:26:37 +02:00
Dmitry Ivanov	0df3467146	Refactoring: replace `utils::connstring` with `Url`-based APIs	2022-11-01 18:17:36 +03:00
Arseny Sher	7480a0338a	Determine safekeeper for offloading WAL without etcd election API. This API is rather pointless, as sane choice anyway requires knowledge of peers status and leaders lifetime in any case can intersect, which is fine for us -- so manual elections are straightforward. Here, we deterministically choose among the reasonably caught up safekeepers, shifting by timeline id to spread the load. A step towards custom broker https://github.com/neondatabase/neon/issues/2394	2022-10-21 15:33:27 +03:00
Arseny Sher	9fe4548e13	Reimplement explicit timeline creation on safekeepers. With the ability to pass commit_lsn. This allows to perform project WAL recovery through different (from the original) set of safekeepers (or under different ttid) by 1) moving WAL files to s3 under proper ttid; 2) explicitly creating timeline on safekeepers, setting commit_lsn to the latest point; 3) putting the lastest .parital file to the timeline directory on safekeepers, if desired. Extend test_s3_wal_replay to exersise this behaviour. Also extends timeline_status endpoint to return postgres information.	2022-10-13 21:43:10 +04:00
sharnoff	580584c8fc	Remove control_plane deps on pageserver/safekeeper (#2513 ) Creates new `pageserver_api` and `safekeeper_api` crates to serve as the shared dependencies. Should reduce both recompile times and cold compile times. Decreases the size of the optimized `neon_local` binary: 380M -> 179M. No significant changes for anything else (mostly as expected).	2022-10-04 11:14:45 -07:00
Kirill Bulatov	b8eb908a3d	Rename old project name references	2022-09-14 08:14:05 +03:00
Heikki Linnakangas	15c5f3e6cf	Fix misc typos in comments and variable names.	2022-09-01 20:04:08 +03:00
Kirill Bulatov	67e091c906	Rework `init` in pageserver CLI (#2272 ) * Do not create initial tenant and timeline (adjust Python tests for that) * Rework config handling during init, add --update-config to manage local config updates	2022-08-17 23:24:47 +03:00
bojanserafimov	e9a3499e87	Fix flaky pageserver restarts in tests (#2261 )	2022-08-17 08:17:35 -04:00
Heikki Linnakangas	52ce1c9d53	Speed up test shutdown, by polling more frequently. A fair amount of the time in our python tests is spent waiting for the pageserver and safekeeper processes to shut down. It doesn't matter so much when you're running a lot of tests in parallel, but it's quite noticeable when running them sequentially. A big part of the slowness is that is that after sending the SIGTERM signal, we poll to see if the process is still running, and the polling happened at 1 s interval. Reduce it to 0.1 s.	2022-08-04 12:57:15 +03:00
Alexey Kondratov	01f1f1c1bf	Add OpenAPI spec for safekeeper HTTP API (neondatabase/cloud#1264, #2061 ) This spec is used in the `cloud` repo to generate HTTP client.	2022-07-27 21:29:22 +03:00
Egor Suvorov	f7b878611a	Implement JWT authentication in Safekeeper HTTP API (#1753 ) * `control_plane` crate (used by `neon_local`) now parses an `auth_enabled` bool for each Safekeeper * If auth is enabled, a Safekeeper is passed a path to a public key via a new command line argument * Added TODO comments to other places needing auth	2022-06-09 17:14:46 +02:00
Arseny Sher	0e1bd57c53	Add WAL offloading to s3 on safekeepers. Separate task is launched for each timeline and stopped when timeline doesn't need offloading. Decision who offloads is done through etcd leader election; currently there is no pre condition for participating, that's a TODO. neon_local and tests infrastructure for remote storage in safekeepers added, along with the test itself. ref #1009 Co-authored-by: Anton Shyrabokau <ahtoxa@Antons-MacBook-Pro.local>	2022-05-27 06:19:23 +04:00
chaitanya sharma	c584d90bb9	initial commit, renamed znodeid to nodeid.	2022-05-25 20:11:26 +03:00
Heikki Linnakangas	9ccbb8d331	Make "neon_local stop" less verbose. I got annoyed by all the noise in CI test output. Before: $ ./target/release/neon_local stop Stop pageserver gracefully Pageserver still receives connections Pageserver stopped receiving connections Pageserver status is: Reqwest error: error sending request for url (http://127.0.0.1:9898/v1/status): error trying to connect: tcp connect error: Connection refused (os error 111) initializing for sk 1 for 7676 Stop safekeeper gracefully Safekeeper still receives connections Safekeeper stopped receiving connections Safekeeper status is: Reqwest error: error sending request for url (http://127.0.0.1:7676/v1/status): error trying to connect: tcp connect error: Connection refused (os error 111) After: $ ./target/release/neon_local stop Stopping pageserver gracefully...done! Stopping safekeeper 1 gracefully...done! Also removes the spurious "initializing for sk 1 for 7676" message from "neon_local start"	2022-05-17 10:31:13 +03:00
Kirill Bulatov	f2881bbd8a	Start and stop single etcd and mock s3 servers globally in python tests	2022-05-17 01:17:44 +03:00
Kirill Bulatov	a884f4cf6b	Add etcd to neon_local	2022-05-17 01:17:44 +03:00
Kirill Bulatov	9a0fed0880	Enable at least 1 safekeeper in every test	2022-05-17 01:17:44 +03:00
Kirill Bulatov	33cac863d7	Test simple.conf and handle broker_endpoints better	2022-05-16 12:07:35 +03:00
Kirill Bulatov	d4e155aaa3	Librarify common etcd timeline logic	2022-05-06 22:32:57 +03:00
Kirill Bulatov	81cad6277a	Move and library crates into a dedicated directory and rename them	2022-04-21 13:30:33 +03:00
Kirill Bulatov	52e0816fa5	wal_acceptor -> safekeeper	2022-04-18 12:52:31 +03:00
Kirill Bulatov	81417788c8	walkeeper -> safekeeper	2022-04-18 12:52:31 +03:00
Arseny Sher	ec3bc74165	Add safekeeper information exchange through etcd. Safekeers now publish to and pull from etcd per-timeline data. Immediate goal is WAL truncation, for which every safekeeper must know remote_consistent_lsn; the next would be callmemaybe replacement. Adds corresponding '--broker' argument to safekeeper and ability to run etcd in tests. Adds test checking remote_consistent_lsn is indeed communicated.	2022-03-29 18:16:49 +04:00
Arseny Sher	f86cf93435	Refactor timeline creation on safekeepers, allowing storing peer ids. Have separate routine and http endpoint to create timeline on safekeepers. It is not used yet, i.e. timeline is still created implicitly, but we'll change that once infrastructure for learning which tlis are assigned to which safekeepers will be ready, preventing accidental creation by compute. Changes format of safekeeper control file, allowing to store set of peers. Knowing peers provides a part of foundation for peer recovery (calculating min horizons like truncate_lsn for WAL truncation and commit_lsn for sync-safekeepers replacement) and proper membership change; similarly, we don't yet use it for now. Employing cf file version bump, extracts tenant_id and timeline_id to top level where it is more suitable. Also adds a bunch of LSNs there and rename truncate_lsn to more specific peer_horizon_lsn.	2022-03-06 08:06:38 +03:00
Dmitry Rodionov	1d90b1b205	add node id to pageserver (#1310 ) * Add --id argument to safekeeper setting its unique u64 id. In preparation for storage node messaging. IDs are supposed to be monotonically assigned by the console. In tests it is issued by ZenithEnv; at the zenith cli level and fixtures, string name is completely replaced by integer id. Example TOML configs are adjusted accordingly. Sequential ids are chosen over Zid mainly because they are compact and easy to type/remember. * add node id to pageserver This adds node id parameter to pageserver configuration. Also I use a simple builder to construct pageserver config struct to avoid setting node id to some temporary invalid value. Some of the changes in test fixtures are needed to split init and start operations for envrionment. Co-authored-by: Arseny Sher <sher-ars@yandex.ru>	2022-03-04 01:10:42 +03:00
Dhammika Pathirana	1e8ca497e0	Fix safekeeper loopback addr (#1247 ) Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>	2022-02-10 09:23:53 +03:00
Kirill Bulatov	8ab4c8a050	Code review fixes	2022-01-11 15:44:23 +02:00
Kirill Bulatov	7c4a653230	Propagate Zenith CLI's RUST_LOG env var to subprocesses	2022-01-11 15:44:23 +02:00

1 2

54 Commits