rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-07 21:42:56 +00:00

Author	SHA1	Message	Date
Kirill Bulatov	dec58092e8	Replace Box<dyn> with impl in RemoteStorage upload (#3984 ) Replaces `Box<(dyn io::AsyncRead + Unpin + Send + Sync + 'static)>` with `impl io::AsyncRead + Unpin + Send + Sync + 'static` usages in the `RemoteStorage` interface, to make it closer to [`#![feature(async_fn_in_trait)]`](https://blog.rust-lang.org/inside-rust/2022/11/17/async-fn-in-trait-nightly.html) For `GenericRemoteStorage`, replaces `type Target = dyn RemoteStorage` with another impl with `RemoteStorage` methods inside it. We can reuse the trait, that would require importing the trait in every file where it's used and makes us farther from the unstable feature. After this PR, I've manged to create a patch with the changes: https://github.com/neondatabase/neon/compare/kb/less-dyn-storage...kb/nightly-async-trait?expand=1 Current rust implementation does not like recursive async trait calls, so `UnreliableWrapper` was removed: it contained a `GenericRemoteStorage` that implemented the `RemoteStorage` trait, and itself implemented the trait, which nightly rustc did not like and proposed to box the future. Similarly, `GenericRemoteStorage` cannot implement `RemoteStorage` for nightly rustc to work, since calls various remote storages' methods from inside. I've compiled current `main` and the nightly branch both with `time env RUSTC_WRAPPER="" cargo +nightly build --all --timings` command, and got ``` Finished dev [optimized + debuginfo] target(s) in 2m 04s env RUSTC_WRAPPER="" cargo +nightly build --all --timings 1283.19s user 50.40s system 1074% cpu 2:04.15 total for the new feature tried and Finished dev [optimized + debuginfo] target(s) in 2m 40s env RUSTC_WRAPPER="" cargo +nightly build --all --timings 1288.59s user 52.06s system 834% cpu 2:40.71 total for the old async_trait approach. ``` On my machine, the `remote_storage` lib compilation takes ~10 less time with the nightly feature (left) than the regular main (right). ![image](https://user-images.githubusercontent.com/2690773/230620797-163d8b89-dac8-4366-bcf6-cd1cdddcd22c.png) Full cargo reports are available at [timings.zip](https://github.com/neondatabase/neon/files/11179369/timings.zip)	2023-04-07 21:39:49 +03:00
Stas Kelvich	0bf70e113f	Add extra cnames to staging proxy	2023-04-07 19:18:19 +03:00
Vadim Kharitonov	31f2cdeb1e	Update Dockerfile.compute-node Co-authored-by: MMeent <matthias@neon.tech>	2023-04-07 15:26:22 +02:00
Vadim Kharitonov	979fa8b1ba	Compile timescaledb	2023-04-07 15:26:22 +02:00
Konstantin Knizhnik	bfee412701	Trigger tests for index scan implementation (#3968 ) ## Describe your changes ## Issue ticket number and link ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist	2023-04-07 14:26:21 +03:00
Dmitry Rodionov	bfeb428d1b	tests: make neon_fixtures a bit thinner by splitting out some pageserver related helpers (#3977 ) neon_fixture is quite big and messy, lets clean it up a bit.	2023-04-07 13:47:28 +03:00
Stas Kelvich	b1c2a6384a	Set non-wildcard common names in link auth proxy Old coding here ignored non-wildcard common names and passed None instead. With my recent changes I started throwing an error in that case. Old logic doesn't seem to be a great choice, so instead of passing None I actually set non-wildcard common names too. That way it is possible to avoid handling cases with None in downstream code.	2023-04-07 01:24:27 +03:00
Anastasia Lubennikova	6d01d835a8	[proxy] Report error if proxy_io_bytes_per_client metric has decreased	2023-04-06 23:14:07 +03:00
Alexey Kondratov	e42982fb1e	[compute_ctl] Empty computes and /configure API (#3963 ) This commit adds an option to start compute without spec and then pass it a valid spec via `POST /configure` API endpoint. This is a main prerequisite for maintaining the pool of compute nodes in the control-plane. For example: 1. Start compute with ```shell cargo run --bin compute_ctl -- -i no-compute \ -p http://localhost:9095 \ -D compute_pgdata \ -C "postgresql://cloud_admin@127.0.0.1:5434/postgres" \ -b ./pg_install/v15/bin/postgres ``` 2. Configure it with ```shell curl -d "{\"spec\": $(cat ./compute-spec.json)}" http://localhost:3080/configure ``` Internally, it's implemented using a `Condvar` + `Mutex`. Compute spec is moved under Mutex, as it's now could be updated in the http handler. Also `RwLock` was replaced with `Mutex` because the latter works well with `Condvar`. First part of the neondatabase/cloud#4433	2023-04-06 21:21:58 +02:00
Dmitry Rodionov	b45c92e533	tests: exclude compatibility tests by default (#3975 ) This allows to skip compatibility tests based on `CHECK_ONDISK_DATA_COMPATIBILITY` environment variable. When the variable is missing (default) compatibility tests wont be run.	2023-04-06 21:21:39 +03:00
Arthur Petukhovsky	ba4a96fdb1	Eagerly update wal_backup_lsn after each segment offload (#3976 ) Otherwise it can lag a lot, preventing WAL segments cleanup. Also max wal_backup_lsn on update, pulling it down is pointless. Should help with https://github.com/neondatabase/neon/issues/3957, but will not fix it completely.	2023-04-06 20:57:06 +03:00
Alexander Bayandin	4d64edf8a5	Nightly Benchmarks: Add free tier sized compute (#3969 ) - Add support for VMs and CU - Add free tier limited benchmark (0.25 CU) - Ensure we use 1 CU by default for pgbench workload	2023-04-06 19:18:24 +03:00
Kirill Bulatov	102746bc8f	Apply clippy rule exclusion locally instead of a global approach (#3974 )	2023-04-06 18:57:48 +03:00
Alexander Bayandin	887cee64e2	test_runner: add links to grafana for remote tests (#3961 ) Add Grafana links to allure reports to make it easier to debug perf test failures	2023-04-06 13:52:41 +01:00
Vadim Kharitonov	2ce973c72f	Allow installation of `pg_stat_statements`	2023-04-06 13:26:40 +02:00
Gleb Novikov	9db70f6232	Added disk_size and instance_type to payload (#3918 ) ## Describe your changes In https://github.com/neondatabase/cloud/issues/4354 we are making scheduling of projects based on available disk space and overcommit, so we need to know disk size and just in case instance type of the pageserver ## Issue ticket number and link https://github.com/neondatabase/cloud/issues/4354 ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [ ] ~If it is a core feature, I have added thorough tests.~ - [ ] ~Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?~ - [ ] ~If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.~	2023-04-06 14:02:56 +04:00
Joonas Koivunen	b17c24fa38	fix: settle down to configured percent (#3947 ) in real env testing we noted that the disk-usage based eviction sails 1 percentage point above the configured value, which might be a source of confusion, so it might be better to get rid of that confusion now. confusion: "I configured 85% but pageserver sails at 86%". Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-04-06 12:47:21 +03:00
Alexander Bayandin	9310949b44	GitHub Autocomment: Retry on server errors (#3958 ) Retry posting/updating a comment in case of 5XX errors from GitHub API	2023-04-05 22:08:06 +03:00
Stas Kelvich	d8df5237fa	Aligne extra certificate name with default cert-manager names	2023-04-05 21:29:21 +03:00
Stas Kelvich	c3ca48c62b	Support extra domain names for proxy. Make it possible to specify directory where proxy will look up for extra certificates. Proxy will iterate through subdirs of that directory and load `key.pem` and `cert.pem` files from each subdir. Certs directory structure may look like that: certs \|--example.com \| \|--key.pem \| \|--cert.pem \|--foo.bar \|--key.pem \|--cert.pem Actual domain names are taken from certs and key, subdir names are ignored.	2023-04-05 20:06:48 +03:00
Alexander Bayandin	957acb51b5	GitHub Autocomment: Fix the link to the latest commit (#3952 )	2023-04-04 19:06:10 +03:00
Alexander Bayandin	1d23b5d1de	Comment PR with test results (#3907 ) This PR adds posting a comment with test results. Each workflow run updates the comment with new results. The layout and the information that we post can be changed to our needs, right now, it contains failed tests and test which changes status after rerun (i.e. flaky tests)	2023-04-04 12:22:47 +01:00
Alexander Bayandin	105b8bb9d3	test_runner: automatically rerun flaky tests (#3880 ) This PR adds a plugin that automatically reruns (up to 3 times) flaky tests. Internally, it uses data from `TEST_RESULT_CONNSTR` database and `pytest-rerunfailures` plugin. As the first approximation we consider the test flaky if it has failed on the main branch in the last 10 days. Flaky tests are fetched by `scripts/flaky_tests.py` script (it's possible to use it in a standalone mode to learn which tests are flaky), stored to a JSON file, and then the file is passed to the pytest plugin.	2023-04-04 12:21:54 +01:00
Kirill Bulatov	846532112c	Remove unused S3 list operation (#3936 ) In S3, pageserver only lists tenants (prefixes) on S3, no other keys. Remove the list operation from the API, since S3 impl does not seem to work normally and not used anyway,	2023-04-03 23:44:38 +03:00
Dmitry Ivanov	f85a61ceac	[proxy] Fix regression in logging For some reason, `tracing::instrument` proc_macro doesn't always print elements specified via `fields()` or even show that it's impossible (e.g. there's no Display impl). Work around this using the `?foo` notation. Before: 2023-04-03T14:48:06.017504Z INFO handle_client🤝 received SslRequest After: 2023-04-03T14:51:24.424176Z INFO handle_client{session_id=7bd07be8-3462-404e-8ccc-0a5332bf3ace}🤝 received SslRequest	2023-04-03 18:49:30 +03:00
Christian Schwarz	45bf76eb05	enable layer eviction by default in prod (#3933 ) Leave disk_usage_based_eviction above the current max usage in prod (82%ish), so that deploying this commit won't trigger disk_usage_based_eviction. As indicated in the TODO, we'll decrease the value to 80% later. Also update the staging YAMLs to use the anchor syntax for `evictions_low_residence_duration_metric_threshold` like we do in the prod YAMLs as of this patch.	2023-04-03 14:57:36 +02:00
Joonas Koivunen	a415670bc3	feat: log evictions (#3930 ) this will help log analysis with the counterpart of already logging all remote download needs and downloads. ended up with a easily regexable output in the final round.	2023-04-03 14:15:41 +03:00
Joonas Koivunen	cf5cfe6d71	fix: metric used for alerting threshold on staging (#3932 ) This should remove the too eager alerts from staging.	2023-04-03 13:26:45 +03:00
Arseny Sher	d733bc54b8	Rename ReplicationFeedback and its fields. This is the the feedback originating from pageserver, so change previous confusing names to s/ReplicationFeedback/PageserverFeedback s/ps_writelsn/last_receive_lsn s/ps_flushlsn/disk_consistent_lsn s/ps_apply_lsn/remote_consistent_lsn I haven't changed on the wire format to keep compatibility. However, understanding of new field names is added to compute, so once all computes receive this patch we can change the wire names as well. Safekeepers/pageservers are deployed roughly at the same time and it is ok to live without feedbacks during the short period, so this is not a problem there.	2023-04-03 01:52:41 +04:00
Arthur Petukhovsky	814abd9f84	Switch to safekeeper in the same AZ (#3883 ) Add a condition to switch walreceiver connection to safekeeper that is located in the same availability zone. Switch happens when commit_lsn of a candidate is not less than commit_lsn from the active connection. This condition is expected not to trigger instantly, because commit_lsn of a current connection is usually greater than commit_lsn of updates from the broker. That means that if WAL is written continuously, switch can take a lot of time, but it should happen eventually. Now protoc 3.15+ is required for building neon. Fixes https://github.com/neondatabase/neon/issues/3200	2023-04-02 11:32:27 +03:00
Alexander Bayandin	75ffe34b17	check-macos-build: fix cache key (#3926 ) We don't have `${{ matrix.build_type }}` there, so it gets resolved to an empty substring and looks like this [`v1-macOS--pg-f8a650e49b06d39ad131b860117504044b01f312-dcccd010ff851b9f72bb451f28243fa3a341f07028034bbb46ea802413b36d80`](https://github.com/neondatabase/neon/actions/runs/4575422427/jobs/8078231907#step:26:2)	2023-03-31 21:45:59 +03:00
Christian Schwarz	d2aa31f0ce	fix pageserver_evictions_with_low_residence_duration metric (#3925 ) It was doing the comparison in the wrong way.	2023-03-31 19:25:53 +03:00
Dmitry Rodionov	22f9ea5fe2	Remind people to clean up merge commit message in PR template (#3920 )	2023-03-31 16:11:34 +03:00
Joonas Koivunen	d0711d0896	build: fix git perms for deploy job (#3921 ) copy pasted from `build-neon` job. it is interesting that this is only needed by `build-neon` and `deploy`. Fixes: https://github.com/neondatabase/neon/actions/runs/4568077915/jobs/8070960178 which seems to have been going for a while.	2023-03-31 16:05:15 +03:00
Arseny Sher	271f6a6e99	Always sync-safekeepers in neon_local on compute start. Instead of checking neon.safekeepers GUC value in existing pg node data dir, just always run sync-safekeepers when safekeepers are configured. Without this change, creation of new compute didn't run it. That's ok for new timeline/branch (it doesn't return anything useful anyway, and LSN is known by pageserver), but restart of compute for existing timeline bore the risk of getting basebackup not on the latest LSN, i.e. basically broken -- it might not have prev_lsn, and even if it had, walproposer would complain anyway. fixes https://github.com/neondatabase/neon/issues/2963	2023-03-31 16:15:06 +04:00
Christian Schwarz	a64dd3ecb5	disk-usage-based layer eviction (#3809 ) This patch adds a pageserver-global background loop that evicts layers in response to a shortage of available bytes in the $repo/tenants directory's filesystem. The loop runs periodically at a configurable `period`. Each loop iteration uses `statvfs` to determine filesystem-level space usage. It compares the returned usage data against two different types of thresholds. The iteration tries to evict layers until app-internal accounting says we should be below the thresholds. We cross-check this internal accounting with the real world by making another `statvfs` at the end of the iteration. We're good if that second statvfs shows that we're _actually_ below the configured thresholds. If we're still above one or more thresholds, we emit a warning log message, leaving it to the operator to investigate further. There are two thresholds: - `max_usage_pct` is the relative available space, expressed in percent of the total filesystem space. If the actual usage is higher, the threshold is exceeded. - `min_avail_bytes` is the absolute available space in bytes. If the actual usage is lower, the threshold is exceeded. The iteration evicts layers in LRU fashion with a reservation of up to `tenant_min_resident_size` bytes of the most recent layers per tenant. The layers not part of the per-tenant reservation are evicted least-recently-used first until we're below all thresholds. The `tenant_min_resident_size` can be overridden per tenant as `min_resident_size_override` (bytes). In addition to the loop, there is also an HTTP endpoint to perform one loop iteration synchronous to the request. The endpoint takes an absolute number of bytes that the iteration needs to evict before pressure is relieved. The tests use this endpoint, which is a great simplification over setting up loopback-mounts in the tests, which would be required to test the statvfs part of the implementation. We will rely on manual testing in staging to test the statvfs parts. The HTTP endpoint is also handy in emergencies where an operator wants the pageserver to evict a given amount of space _now. Hence, it's arguments documented in openapi_spec.yml. The response type isn't documented though because we don't consider it stable. The endpoint should _not_ be used by Console but it could be used by on-call. Co-authored-by: Joonas Koivunen <joonas@neon.tech> Co-authored-by: Dmitry Rodionov <dmitry@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2023-03-31 14:47:57 +03:00
Konstantin Knizhnik	bf46237fc2	Fix prefetch for parallel bitmap scan (#3875 ) ## Describe your changes Fix prefetch for parallel bitmap scan ## Issue ticket number and link ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.	2023-03-30 22:07:19 +03:00
Lassi Pölönen	41d364a8f1	Add more detailed logging to compute_ctl's shutdown (#3915 ) Currently we don't see from the logs, if shutting down tracing takes long time or not. We do see that shutting down computes gets delayed for some reason and hits thhe grace period limit. Moving the shutdown message to slightly later, when we don't have anything else than just exit left. ## Issue ticket number and link ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.	2023-03-30 22:02:39 +03:00
Christian Schwarz	fa54a57ca2	random_init_delay: remove the minimum of 10 seconds (#3914 ) Before this patch, the range from which the random delay is picked is at minimum 10 seconds. With this patch, they delay is bounded to whatever the given `period` is, and zero, if period id Duration::ZERO. Motivation for this: the disk usage eviction tests that we'll add in https://github.com/neondatabase/neon/pull/3905 need to wait for the disk usage eviction background loop to do its job. They set a period of 1s. It seems wasteful to wait 10 seconds in the tests. Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-03-30 18:38:45 +02:00
Lassi Pölönen	1c1bb904ed	Rename zenith_* labels to neon_* (#3911 ) ## Describe your changes Get rid of the legacy labeling. Aslo `neon_region_slug` with the same value as `neon_region` doesn't make much sense, so just drop it. This allows us to drop the relabeling from zenith to neon in the log collector.	2023-03-30 16:24:47 +03:00
Gleb Novikov	b26c837ed6	Fixed pageserver openapi spec properties reference (#3904 ) ## Describe your changes In [this linter run](https://github.com/neondatabase/cloud/actions/runs/4553032319/jobs/8029101300?pr=4391) accidentally found out that spec is invalid. Reference other schemas in properties should be done the way I changed. Could not find documentation specifically for schemas embedding in `components.schemas`, but it seems like the approach is inherited from json schema: https://json-schema.org/understanding-json-schema/structuring.html#ref ## Issue ticket number and link - ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [ ] ~If it is a core feature, I have added thorough tests.~ - [ ] ~Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?~ - [ ] ~If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.~	2023-03-29 19:18:44 +04:00
Kirill Bulatov	ac9c7e8c4a	Replace pin! from tokio to the std one (#3903 ) With fresh rustc brought by https://github.com/neondatabase/neon/pull/3902, we can use `std::pin::pin!` macro instead of the tokio one. One place did not need the macro at all, other places were adjusted.	2023-03-29 14:14:56 +03:00
Vadim Kharitonov	f1b174dc6a	Update rust version to 1.68.2	2023-03-29 12:50:04 +04:00
Kirill Bulatov	9d714a8413	Split $CARGO_FLAGS and $CARGO_FEATURES to make e2e tests work	2023-03-29 00:08:30 +03:00
Kirill Bulatov	6c84cbbb58	Run new Rust IT test in CI	2023-03-29 00:08:30 +03:00
Kirill Bulatov	1300dc9239	Replace Python IT test with the Rust one	2023-03-29 00:08:30 +03:00
Kirill Bulatov	018c8b0e2b	Use proper tokens and delimeters when listing S3	2023-03-29 00:08:30 +03:00
Arseny Sher	b52389f228	Cleanly exit on any shutdown signal in storage_broker. neon_local sends SIGQUIT, which otherwise dumps core by default. Also, remove obsolete install_shutdown_handlers; in all binaries it was overridden by ShutdownSignals::handle later. ref https://github.com/neondatabase/neon/issues/3847	2023-03-28 22:29:42 +04:00
Heikki Linnakangas	5a123b56e5	Remove obsolete hack to rename neon-specific GUCs. I checked the console database, we don't have any of these left in production.	2023-03-28 17:57:22 +03:00
Arthur Petukhovsky	7456e5b71c	Add script to collect state from safekeepers (#3835 ) Add an ansible script to collect https://github.com/neondatabase/neon/pull/3710 state JSON from all safekeeper nodes and upload them to a postgres table.	2023-03-28 17:04:02 +03:00

... 2 3 4 5 6 ...

3156 Commits