rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-10 06:52:55 +00:00

Author	SHA1	Message	Date
Christian Schwarz	f7ec33970a	add doc comment that outlines which tokio tasks walreceiver creates	2023-01-24 15:23:48 +01:00
Joonas Koivunen	98d0a0d242	fix(http): omit needless string allocs (#3421 ) Drive-by fix noticed while #3419.	2023-01-24 14:53:39 +02:00
Joonas Koivunen	f74080cbad	feat(http): support ?inputs_only=true for tenant_size (#3419 ) this makes debugging problematic cases in the future easier, as we can just request the model inputs, use them locally to reproduce the issue with the model.	2023-01-24 13:57:13 +02:00
Christian Schwarz	55c184fcd7	fix some anyhow::Context::context calls that should use with_context(format!(...)) Noticed this while combing through some production logs.	2023-01-24 12:22:33 +01:00
Kirill Bulatov	fd18692dfb	Output coloured pageserver logs for tty stdout	2023-01-24 09:58:08 +02:00
Alexey Kondratov	a4be54d21f	[compute_ctl] Stop updating roles on each compute start (#3391 ) I noticed that `compute_ctl` updates all roles on each start, search for rows like > - web_access:[FILTERED] -> update in the compute startup log. It happens since we had an adhoc hack for md5 hashes comparison, which doesn't work with scram hashes stored in the `pg_authid`. It doesn't really hurt, as nothing changes, but we just run >= 2 extra queries on each start, so fix it.	2023-01-23 17:46:22 +01:00
Christian Schwarz	6b6570b580	remove TimelineState::Suspended, introduce TimelineState::Loading The TimelineState::Suspsended was dubious to begin with. I suppose that the intention was that timelines could transition back and forth between Active and Suspended states. But practically, the code before this patch never did that. The transitions were: () ==Timeline::new==> Suspended ====> {Active,Broken,Stopping} One exception: Tenant::set_stopping() could transition timelines like so: !Broken ==Tenant::set_stopping()==> Suspended But Tenant itself cannot transition from stopping state to any other state. Thus, this patch removes TimelineState::Suspended and introduces a new state Loading. The aforementioned transitions change as follows: - () ==Timeline::new==> Suspended ====> {Active,Broken,Stopping} + () ==Timeline::new==> Loading ==*==> {Active,Broken,Stopping} - !Broken ==Tenant::set_stopping()==> Suspended + !Broken ==Tenant::set_stopping()==> Stopping Walreceiver's connection manager loop watches TimelineState to decide whether it should retry connecting, or exit. This patch changes the loop to exit when it observes the transition into Stopping state. Walreceiver isn't supposed to be started until the timeline transitions into Active state. So, this patch also adds some warn!() messages in case this happens anyways.	2023-01-23 17:22:49 +01:00
Joonas Koivunen	7704caa3ac	More tenant size fixes (#3410 ) Small changes, but hopefully this will help with the panic detected in staging, for which we cannot get the debugging information right now (end-of-branch before branch-point).	2023-01-23 17:12:51 +02:00
Shany Pozin	a44e5eda14	Adding pageserver3 to staging (#3403 )	2023-01-23 14:08:48 +01:00
Konstantin Knizhnik	5c865f46ba	Fix slru_segment_key_range function: segno was assigned to incorrect Key field (#3354 )	2023-01-23 10:51:09 +02:00
bojanserafimov	a3d7ad2d52	Implement layer map using immutable BST (#2998 )	2023-01-20 16:10:12 -05:00
Anastasia Lubennikova	36f048d6b0	Fix tenant size orphans (#3377 ) Before only the timelines which have passed the `gc_horizon` were processed which failed with orphans at the tree_sort phase. Example input in added `test_branched_empty_timeline_size` test case. The PR changes iteration to happen through all timelines, and in addition to that, any learned branch points will be calculated as they would had been in the original implementation if the ancestor branch had been over the `gc_horizon`. This also changes how tenants where all timelines are below `gc_horizon` are handled. Previously tenant_size 0 was returned, but now they will have approximately `initdb_lsn` worth of tenant_size. The PR also adds several new tenant size tests that describe various corner cases of branching structure and `gc_horizon` setting. They are currently disabled to not consume time during CI. Co-authored-by: Joonas Koivunen <joonas@neon.tech> Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>	2023-01-20 20:21:36 +02:00
Joonas Koivunen	58fb6fe861	fix: dont stop pageserver if we fail to calculate synthetic size	2023-01-20 19:55:19 +02:00
Alexey Kondratov	20b1e26e74	[compute_ctl] Make role deletion spec processing idempotent (#3380 ) Previously, we were trying to re-assign owned objects of the already deleted role. This were causing a crash loop in the case when compute was restarted with a spec that includes delta operation for role deletion. To avoid such cases, check that role is still present before calling `reassign_owned_objects`. Resolves neondatabase/cloud#3553	2023-01-20 15:37:24 +01:00
Christian Schwarz	8ba1699937	Revert "Use actual temporary dir for pageserver unit tests" This reverts commit `826e89b9ce`. The problem with that commit was that it deletes the TempDir while there are still EphemeralFile instances open. At first I thought this could be fixed by simply adding Handle::current().block_on(task_mgr::shutdown(None, Some(tenant_id), None)) to TenantHarness::drop, but it turned out to be insufficient. So, reverting the commit until we find a proper solution. refs https://github.com/neondatabase/neon/issues/3385	2023-01-19 20:16:56 +01:00
bojanserafimov	a9bd05760f	Improve layer map docstrings (#3382 )	2023-01-19 10:29:15 -05:00
Heikki Linnakangas	e5cc2f92c4	Switch to 'tracing' for logging, restructure code to make use of spans. Refactors Compute::prepare_and_run. It's split into subroutines differently, to make it easier to attach tracing spans to the different stages. The high-level logic for waiting for Postgres to exit is moved to the caller. Replace 'env_logger' with 'tracing', and add `#instrument` directives to different stages fo the startup process. This is a fairly mechanical change, except for the changes in 'spec.rs'. 'spec.rs' contained some complicated formatting, where parts of log messages were printed directly to stdout with `print`s. That was a bit messed up because the log normally goes to stderr, but those lines were printed to stdout. In our docker images, stderr and stdout both go to the same place so you wouldn't notice, but I don't think it was intentional. This changes the log format to the default 'tracing_subscriber::format' format. It's different from the Postgres log format, however, and because both compute_tools and Postgres print to the same log, it's now a mix of two different formats. I'm not sure how the Grafana log parsing pipeline can handle that. If it's a problem, we can build custom formatter to change the compute_tools log format to be the same as Postgres's, like it was before this commit, or we can change the Postgres log format to match tracing_formatter's, or we can start printing compute_tool's log output to a different destination than Postgres	2023-01-18 19:42:47 +02:00
Kirill Bulatov	90f66aa51b	Enable logs in unit tests	2023-01-18 17:43:27 +02:00
Kirill Bulatov	826e89b9ce	Use actual temporary dir for pageserver unit tests	2023-01-18 17:43:27 +02:00
Vadim Kharitonov	e59d32ac5d	Change SENTRY_ENVIRONMENT from "development" to "staging"	2023-01-18 16:34:49 +01:00
Anastasia Lubennikova	506086a3e2	Fix metric_collection_endpoint for prod. It was incorrectly set to staging url	2023-01-18 16:35:43 +02:00
Heikki Linnakangas	3b58c61b33	If an error happens while checking for core dumps, don't panic. If we panic, we skip the 30s wait in 'main', and don't give the console a chance to observe the error. Which is not nice. Spotted by @ololobus at https://github.com/neondatabase/neon/pull/3352#discussion_r1072806981	2023-01-18 11:25:47 +02:00
Kirill Bulatov	c6b56d2967	Add more io::Error context when fail to operate on a path (#3254 ) I have a test failure that shows ``` Caused by: 0: Failed to reconstruct a page image: 1: Directory not empty (os error 39) ``` but does not really show where exactly that happens. https://neon-github-public-dev.s3.amazonaws.com/reports/pr-3227/release/3823785365/index.html#categories/c0057473fc9ec8fb70876fd29a171ce8/7088dab272f2c7b7/?attachment=60fe6ed2add4d82d The PR aims to add more context in debugging that issue.	2023-01-17 22:07:38 +02:00
Anastasia Lubennikova	9d3992ef48	Increase metric_collection_interval for proxy on prod to not owerwhelm the service	2023-01-17 15:50:19 +02:00
Anastasia Lubennikova	7624963e13	Enable metric_collection_endpoint for proxy on prod in all regions	2023-01-17 13:43:50 +02:00
Anastasia Lubennikova	63e3b815a2	Enable metric_collection_endpoint for pageserver on prod in all regions	2023-01-17 13:43:50 +02:00
Kirill Bulatov	1ebd145c29	Actualize the comment (#3362 ) Follow-up of https://github.com/neondatabase/neon/pull/3326#issuecomment-1384265759	2023-01-17 13:30:42 +02:00
sharnoff	f8e887830a	build: Use `curl -f` on vm-informant download (#3363 ) Without this, we can silently fail	2023-01-17 10:38:33 +01:00
Christian Schwarz	48dd9565ac	TaskHandle: tone down `sender is dropped while join handle is still alive` Rationale: see comments added as part of this commit. fixes https://github.com/neondatabase/neon/issues/3339	2023-01-17 09:42:22 +01:00
Anastasia Lubennikova	e067cd2947	Enable metric collection for proxy on staging	2023-01-16 21:15:42 +02:00
Christian Schwarz	58c8c1076c	download_all_remote_layers API: require client to specify max_concurrent_downloads Before this patch, we would start all layer downloads simultaneously. There is at most one download_all_remote_layers task per timeline. Hence, the specified limit is per timeline. There is still no global concurrency limit for layer downloads. We'll have to revisit that at some point and also prioritize on-demand initiated downloads over download_all_remote_layers downloads. But that's for another day.	2023-01-16 19:29:06 +01:00
Alexander Bayandin	4c6b507472	Update Postgres clients we test (#3359 ) Update client libraries and runtimes for Postgres libraries we test. - `pg8000` works with Neon now 🎉 - `PostgresClientKit` still doesn't support SNI	2023-01-16 17:22:17 +00:00
Stas Kelvich	431e464c1e	Consumption metering RFC	2023-01-16 19:15:59 +02:00
danieltprice	424fd0bd63	Update auth.rs (#3349 ) Update SNI error message. Users now specify the endpoint ID when making a connection to Neon. This should be reflected in the error message.	2023-01-16 12:32:00 -04:00
Joonas Koivunen	a8a9bee602	walredo: simple tests and bench updates (#3045 ) Separated from #2875. The microbenchmark has been validated to show similar difference as to larger scale OLTP benchmark.	2023-01-16 18:24:45 +02:00
Vadim Kharitonov	6ac5656be5	Enable earthdistance extension	2023-01-16 17:04:51 +01:00
Anastasia Lubennikova	3c571ecde8	Update docs/consumption_metrics.md	2023-01-16 17:24:13 +02:00
Anastasia Lubennikova	5f1bd0e8a3	Add documentation for consumption metrics	2023-01-16 17:24:13 +02:00
Anastasia Lubennikova	2cbe84b78f	Proxy metrics (#3290 ) Implement proxy metrics collection. Only collect metric for outbound traffic. Add proxy CLI parameters: - metric-collection-endpoint - metric-collection-interval. Add test_proxy_metric_collection test. Move shared consumption metrics code to libs/consumption_metrics. Refactor the code.	2023-01-16 15:17:28 +00:00
sharnoff	5c6a7a17cb	Add VM informant to vm-compute-node (#3324 ) The general idea is that the VM informant binary is added to the vm-compute-node images only. `compute_tools` then will run whatever's at `/bin/vm-informant`, if the path exists.	2023-01-16 07:05:29 -08:00
Arseny Sher	84ffdc8b4f	Don't keep FDs open on cancelled timelines in safekeepers. Since PR #3300 we don't remove timelines completely until next restart, so this prevents leakage. fixes https://github.com/neondatabase/neon/issues/3336	2023-01-16 19:03:38 +04:00
Kirill Bulatov	bce4233d3a	Rework Cargo.toml dependencies (#3322 ) * Use workspace variables from cargo, coming with rustc [1.64](https://github.com/rust-lang/rust/blob/master/RELEASES.md#version-1640-2022-09-22) See https://doc.rust-lang.org/nightly/cargo/reference/workspaces.html#the-package-table and https://doc.rust-lang.org/nightly/cargo/reference/workspaces.html#the-dependencies-table sections. Now, all dependencies in all non-root `Cargo.toml` files are defined as ``` clap.workspace = true ``` sometimes, when extra features are needed, as ``` bytes = {workspace = true, features = ['serde'] } ``` With the actual declarations (with shared features and version numbers/file paths/etc.) in the root Cargo.toml. Features are additive: https://doc.rust-lang.org/nightly/cargo/reference/specifying-dependencies.html#inheriting-a-dependency-from-a-workspace * Uses the mechanism above to set common, 2021, edition and license across the workspace * Mechanically bumps a few dependencies * Updates hakari format, as it suggested: ``` work/neon/neon kb/cargo-templated ❯ cargo hakari generate info: no changes detected info: new hakari format version available: 3 (current: 2) (add or update `dep-format-version = "3"` in hakari.toml, then run `cargo hakari generate && cargo hakari manage-deps`) ```	2023-01-13 18:13:34 +02:00
Vadim Kharitonov	16baa91b2b	Add more information about `cargo deny`	2023-01-13 13:24:34 +01:00
Kirill Bulatov	99808558de	Avoid duplicate timeline insert (#3326 ) `initialize_with_lock` inserts `Arc<Timeline>` before returning it: `c1731bc4f0/pageserver/src/tenant.rs (L222)` but `setup_timeline` function did another insert, which got removed in this PR: `c1731bc4f0/pageserver/src/tenant.rs (L486)` On top, a better comment and function renames are added.	2023-01-13 12:05:54 +00:00
Anastasia Lubennikova	c6d383e239	code cleanup	2023-01-13 11:51:28 +02:00
Anastasia Lubennikova	5e3e0fbf6f	remove unneeded Cargo.lock changes	2023-01-13 11:51:28 +02:00
Anastasia Lubennikova	26f39c03f2	review code cleanup: - handle errors in calculate_synthetic_size_worker. Don't exit the bgworker if one tenant failed. - add cached_synthetic_tenant_size to cache values calculated by the bgworker - code cleanup: remove unneeded info! messages, clean comments - handle collect_metrics_task() error. Don't exit collect_metrics worker if one task failed. - add unit test to cover case when we have multiple branches at the same lsn	2023-01-13 11:51:28 +02:00
Anastasia Lubennikova	148e020fb9	Fix logical size calculation: sort updates in topological order so that the parent timeline always preceeds its children. fixes #3179	2023-01-13 11:51:28 +02:00
Anastasia Lubennikova	0675859bb0	Add background worker that periodically spawns synthetic size calculation. Add new pageserver config param calculate_synthetic_size_interval	2023-01-13 11:51:28 +02:00
Anastasia Lubennikova	ba0190e3e8	Handle errors in tenant_size_model code	2023-01-13 11:51:28 +02:00

1 2 3 4 5 ...

2678 Commits