rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-08 05:52:55 +00:00

Author	SHA1	Message	Date
Joonas Koivunen	138bc028ed	fix: quick and dirty panic avoidance on drop path (#4128 ) Sentry caught a panic on load testing server related to metric removals: https://neondatabase.sentry.io/issues/4142396994 Turn the `expect` into logging, but also add logging for each removal, so we could identify in which cases we do double-remove. The double-removal (or never adding) cause is not obvious or expected. Original added in #3837.	2023-05-01 11:54:09 +03:00
Stas Kelvich	d53f81b449	Add one more pageserver to staging	2023-04-30 22:39:34 +03:00
Joonas Koivunen	6f472df0d0	fix: restore not logging ignored io errors as errors (#4120 ) the fix is rather indirect due to the accidental applying of too much `anyhow`: if handle_pagerequests returns a `QueryError` it will now be bubbled up as-is `QueryError`. `QueryError` allows the inner `std::io::Error` to be inspected and thus we can filter certain error kinds which are perfectly normal without a huge log message. for a very long time (`b2f5102`) the errors were converted to `anyhow` by mistake which made this difficult or impossible, even though from the types it would appear that we propagate wrapped `std::io::Error`s and can filter them. Fixes #4113, most likely filters some other errors as well.	2023-04-30 14:34:55 +03:00
Rahul Patil	21eb944b5e	Staging: Add safekeeper nodes [3-8] to eu-west-1 (#4123 )	2023-04-29 15:25:57 +03:00
Arthur Petukhovsky	95244912c5	Override sharded-slab to increase MAX_THREADS (#4122 ) Add patch directive to Cargo.toml to use patched version of sharded-slab: `98d16753ab` Patch changes the MAX_THREADS limit from 4096 to 32768. This is a temporary workaround for using tracing from many threads in safekeepers code, until async safekeepers patch is merged to the main. Note that patch can affect other rust services, not only the safekeeper binary.	2023-04-29 13:31:04 +03:00
Shany Pozin	2617e70008	Add 4 new Pageservers for retool launch (#4115 ) ## Describe your changes Adding 4 new pageserves to us-west TF apply output: module.pageserver-us-west-2.aws_instance.this["7"]: Creation complete after 21s [id=i-02eec9b40617db5bc] module.pageserver-us-west-2.aws_instance.this["5"]: Creation complete after 21s [id=i-00ca6417c7bf96820] module.pageserver-us-west-2.aws_instance.this["4"]: Creation complete after 21s [id=i-013263dd1c239adcc] module.pageserver-us-west-2.aws_instance.this["6"]: Creation complete after 22s [id=i-01cdf7d2bc1433b6a]	2023-04-29 11:42:52 +02:00
Arthur Petukhovsky	8543485e92	Pull clone timeline from peer safekeepers (#4089 ) Add HTTP endpoint to initialize safekeeper timeline from peer safekeepers. This is useful for initializing new safekeeper to replace failed safekeeper. Not fully "correct" in all cases, but should work in most. This code is not suitable for production workloads but can be tested on staging to get started. New endpoint is separated from usual cases and should not affect anything if no one explicitly uses a new endpoint. We can rollback this commit in case of issues.	2023-04-28 14:20:46 +00:00
Joonas Koivunen	ec53c5ca2e	revert: "Add check for duplicates of generated image layers" (#4104 ) This reverts commit `732acc5`. Reverted PR: #3869 As noted in PR #4094, we do in fact try to insert duplicates to the layer map, if L0->L1 compaction is interrupted. We do not have a proper fix for that right now, and we are in a hurry to make a release to production, so revert the changes related to this to the state that we have in production currently. We know that we have a bug here, but better to live with the bug that we've had in production for a long time, than rush a fix to production without testing it in staging first. Cc: #4094, #4088	2023-04-28 17:20:18 +03:00
Stas Kelvich	94d612195a	bump rust-postgres version, after merging PR in rust-postgres	2023-04-28 17:15:43 +03:00
Stas Kelvich	b1329db495	fix sigterm handling	2023-04-28 17:15:43 +03:00
Stas Kelvich	5bb971d64e	fix more python tests	2023-04-28 17:15:43 +03:00
Stas Kelvich	0364f77b9a	fix python styling	2023-04-28 17:15:43 +03:00
Stas Kelvich	4ac6a9f089	add backward compatibility to proxy	2023-04-28 17:15:43 +03:00
Stas Kelvich	9486d76b2a	Add tests for link auth to compute connection	2023-04-28 17:15:43 +03:00
Stas Kelvich	040f736909	remove changes in main proxy that are now not needed	2023-04-28 17:15:43 +03:00
Stas Kelvich	645e4f6ab9	use TLS in link proxy	2023-04-28 17:15:43 +03:00
Heikki Linnakangas	e947cc119b	Add a small test case for pg_sni_router	2023-04-28 17:15:43 +03:00
Heikki Linnakangas	53e5d18da5	Start passthrough earlier As soon as we have received the SSLRequest packet, and have figured out the hostname to connect to from the SNI, we can start passing through data. We don't need to parse the StartupPacket that the client will send next.	2023-04-28 17:15:43 +03:00
Heikki Linnakangas	3813c703c9	Add an option for destination port. Makes it easier to test locally.	2023-04-28 17:15:43 +03:00
Heikki Linnakangas	b15204fa8c	Fix --help, and required args	2023-04-28 17:15:43 +03:00
Alexey Kondratov	81c75586ab	Take port from SNI, formatting, make clippy happy	2023-04-28 17:15:43 +03:00
Anton Chaporgin	556fb1642a	fixed the way hostname is parsed	2023-04-28 17:15:43 +03:00
Stas Kelvich	23aca81943	Add SNI-based proxy router In order to not to create NodePorts for each compute we can setup services that accept connections on wildcard domains and then use information from domain name to route connection to some internal service. There are ready solutions for HTTPS and TLS connections but postgresql protocol uses opportunistic TLS and we haven't found any ready solutions. This patch introduces `pg_sni_router` which routes connections to `aaa--bbb--123.external.domain` to `aaa.bbb.123.internal.domain`. In the long run we can avoid console -> compute psql communications, but now this router seems to be the easier way forward.	2023-04-28 17:15:43 +03:00
Arseny Sher	42798e6adc	Increase connection_timeout to PG in find end of WAL test. And log postgres to stdout. Probably fixes https://github.com/neondatabase/neon/issues/3778	2023-04-28 16:17:23 +04:00
Arthur Petukhovsky	b03143dfc8	Use serde_as DisplayFromStr everywhere (#4103 ) We used `display_serialize` previously, but it works only for Serialize. `DisplayFromStr` does the same, but also works for Deserialize.	2023-04-28 13:55:07 +03:00
Arseny Sher	fdacfaabfd	Move PageserverFeedback to utils. It allows to replace u64 with proper Lsn and pretty print PageserverFeedback with serde(_json). Now walsenders on safekeepers queried with debug_dump look like "walsenders": [ { "ttid": "fafe0cf39a99c608c872706149de9d2a/b4fb3be6f576935e7f0fcb84bdb909a1", "addr": "127.0.0.1:48774", "conn_id": 3, "appname": "pageserver", "feedback": { "Pageserver": { "current_timeline_size": 32096256, "last_received_lsn": "0/2415298", "disk_consistent_lsn": "0/1696628", "remote_consistent_lsn": "0/0", "replytime": "2023-04-12T13:54:53.958856+00:00" } } } ],	2023-04-28 06:22:13 +04:00
Arseny Sher	b2a3981ead	Move tracking of walsenders out of Timeline. Refactors walsenders out of timeline.rs to makes it less convoluted into separate WalSenders with its own lock, but otherwise having the same structure. Tracking of in-memory remote_consistent_lsn is also moved there as it is mainly received from pageserver. State of walsender (feedback) is also restructured to be cleaner; now it is either PageserverFeedback or StandbyFeedback(StandbyReply, HotStandbyFeedback), but not both.	2023-04-28 06:22:13 +04:00
Joonas Koivunen	fe0b616299	feat(page_service): read timeouts (#4093 ) Introduce read timeouts to our `page_service` connections. Without read timeouts, we essentially leak connections. This is a port of #3995. Split the refactorings to the other PR: #4097. Fixes #4028.	2023-04-27 17:55:35 +00:00
Alexander Bayandin	c4e1cafb63	scripts/flaky_tests.py: handle connection error (#4096 ) - Increase `connect_timeout` to 30s, which should be enough for most of the cases - If the script cannot connect to the DB (or any other `psycopg2.OperationalError` occur) — do not fail the script, log the error and proceed. Problems with fetching flaky tests shouldn't block the PR	2023-04-27 17:08:00 +01:00
Joonas Koivunen	fdf5e4db5e	refactor: Cleanup page service (#4097 ) Refactoring part of #4093. Numerious `Send + Sync` bounds were a distraction, that were not needed at all. The proper `Bytes` usage and one `"error_message".to_string()` are just drive-by fixes. Not using the `PostgresBackendTCP` allows us to start setting read timeouts (and more). `PostgresBackendTCP` is still used from proxy, so it cannot be removed.	2023-04-27 18:51:57 +03:00
Heikki Linnakangas	d1e86d65dc	Run rustfmt to fix whitespace. Commit `e6ec2400fc` introduced some trivial whitespace issues.	2023-04-27 18:45:22 +03:00
Arseny Sher	f5b4697c90	Log session_id when proxy per client task errors out.	2023-04-27 19:08:22 +04:00
Christian Schwarz	3be81dd36b	fix `clippy --release` failure introduced in #4030 (#4095 ) PR `build: run clippy for powerset of features (#4077)` brought us a `clippy --release` pass. It was merged after #4030, which fails under `clippy --release` with ``` error: static `TENANT_ID_EXTRACTOR` is never used --> pageserver/src/tenant/timeline.rs:4270:16 \| 4270 \| pub static TENANT_ID_EXTRACTOR: once_cell::sync::Lazy< \| ^^^^^^^^^^^^^^^^^^^ \| = note: `-D dead-code` implied by `-D warnings` error: static `TIMELINE_ID_EXTRACTOR` is never used --> pageserver/src/tenant/timeline.rs:4276:16 \| 4276 \| pub static TIMELINE_ID_EXTRACTOR: once_cell::sync::Lazy< \| ^^^^^^^^^^^^^^^^^^^^^ ``` A merge queue would have prevented this.	2023-04-27 17:07:25 +03:00
MMeent	e6ec2400fc	Enable hot standby PostgreSQL replicas. Notes: - This still needs UI support from the Console - I've not tuned any GUCs for PostgreSQL to make this work better - Safekeeper has gotten a tweak in which WAL is sent and how: It now sends zero-ed WAL data from the start of the timeline's first segment up to the first byte of the timeline to be compatible with normal PostgreSQL WAL streaming. - This includes the commits of #3714 Fixes one part of https://github.com/neondatabase/neon/issues/769 Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>	2023-04-27 15:26:44 +02:00
Christian Schwarz	5b911e1f9f	build: run clippy for powerset of features (#4077 ) This will catch compiler & clippy warnings in all feature combinations. We should probably use cargo hack for build and test as well, but, that's quite expensive and would add to overall CI wait times. obsoletes https://github.com/neondatabase/neon/pull/4073 refs https://github.com/neondatabase/neon/pull/4070	2023-04-27 15:01:27 +03:00
Christian Schwarz	9ea7b5dd38	clean up logging around on-demand downloads (#4030 ) - Remove repeated tenant & timeline from span - Demote logging of the path to debug level - Log completion at info level, in the same function where we log errors - distinguish between layer file download success & on-demand download succeeding as a whole in the log message wording - Assert that the span contains a tenant id and a timeline id fixes https://github.com/neondatabase/neon/issues/3945 Before: ``` INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{tenant_id=$TENANT_ID timeline_id=$TIMELINE_ID layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: download complete: /storage/pageserver/data/tenants/$TENANT_ID/timelines/$TIMELINE_ID/000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91 INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{tenant_id=$TENANT_ID timeline_id=$TIMELINE_ID layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: Rebuilt layer map. Did 9 insertions to process a batch of 1 updates. ``` After: ``` INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: layer file download finished INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: Rebuilt layer map. Did 9 insertions to process a batch of 1 updates. INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: on-demand download successful ```	2023-04-27 11:54:48 +02:00
Arseny Sher	0112a602e1	Add timeout on proxy -> compute connection establishment. Otherwise we sit up to default tcp_syn_retries (about 2+ min) before gettings os error 110 if compute has been migrated to another pod.	2023-04-27 09:50:52 +04:00
Anastasia Lubennikova	92214578af	Fix proxy_io_bytes_per_client metric: use branch_id identifier properly. (#4084 ) It fixes the miscalculation of the metric for projects that use multiple branches for the same endpoint. We were under billing users with such projects. So we need to communicate the change in Release Notes.	2023-04-26 17:47:54 +03:00
Christian Schwarz	6861259be7	add global metric for unexpected on-demand downloads (#4069 ) Until we have toned down the prod logs to zero WARN and ERROR, we want a dedicated metric for which we can have a dedicated alert. fixes https://github.com/neondatabase/neon/issues/3924	2023-04-26 15:18:26 +02:00
Sergey Melnikov	11df2ee5d7	Add safekeeper-3.us-east-2.aws.neon.build (#4085 )	2023-04-26 14:40:36 +03:00
Arseny Sher	31a3910fd9	Remove wait_for_sk_commit_lsn_to_reach_remote_storage. It had a couple of inherent races: 1) Even if compute is killed before the call, some more data might still arrive to safekeepers after commit_lsn on them is polled, advancing it. Then checkpoint on pageserver might not include this tail, and so upload of expected LSN won't happen until one more checkpoint. 2) commit_lsn is updated asynchronously -- compute can commit transaction before communicating commit_lsn to even single safekeeper (sync-safekeepers can be used to forces the advancement). This makes semantics of wait_for_sk_commit_lsn_to_reach_remote_storage quite complicated. Replace it with last_flush_lsn_upload which 1) Learns last flush LSN on compute; 2) Waits for it to arrive to pageserver; 3) Checkpoints it; 4) Waits for the upload. In some tests this keeps compute alive longer than before, but this doesn't seem to be important. There is a chance this fixes https://github.com/neondatabase/neon/issues/3209	2023-04-26 13:46:33 +04:00
Joonas Koivunen	381c8fca4f	feat: log how long tenant activation takes (#4080 ) Adds just a counter counting up from the creation to the tenant, logged after activation. Might help guide us with the investigation of #4025.	2023-04-26 12:39:17 +03:00
Joonas Koivunen	4625da3164	build: remove busted sk-1.us-east-2 from staging hosts (#4082 ) this should give us complete deployments while a new one is being brought up.	2023-04-26 09:07:45 +00:00
Joonas Koivunen	850f6b1cb9	refactor: drop pageserver_ondisk_layers (#4071 ) I didn't get through #3775 fast enough so we wanted to remove this metric. Fixes #3705.	2023-04-26 11:49:29 +03:00
Sergey Melnikov	f19b70b379	Configure extra domain for us-east-1 (#4078 )	2023-04-26 09:36:26 +02:00
Sergey Melnikov	9d0cf08d5f	Fix new storage-broker deploy for eu-central-1 (#4079 )	2023-04-26 10:29:44 +03:00
Alexander Bayandin	2d6fd72177	GitHub Workflows: Fix crane for several registries (#4076 ) Follow-up fix after https://github.com/neondatabase/neon/pull/4067 ``` + crane tag neondatabase/vm-compute-node-v14:3064 latest Error: fetching "neondatabase/vm-compute-node-v14:3064": GET https://index.docker.io/v2/neondatabase/vm-compute-node-v14/manifests/3064: MANIFEST_UNKNOWN: manifest unknown; unknown tag=3064 ``` I reverted back the previous approach for promoting images (login to one registry, save images to local fs, logout and login to another registry, and push images from local fs). It turns out what works for one Google project (kaniko), doesn't work for another (crane) [sigh]	2023-04-25 23:58:59 +01:00
Heikki Linnakangas	8945fbdb31	Enable OpenTelemetry tracing in proxy in staging. (#4065 ) Depends on https://github.com/neondatabase/helm-charts/pull/32 Co-authored-by: Lassi Pölönen <lassi.polonen@iki.fi>	2023-04-25 20:45:36 +03:00
Alexander Bayandin	05ac0e2493	Login to ECR and Docker Hub at once (#4067 ) - Update kaniko to 1.9.2 (from 1.7.0), problem with reproducible build is fixed - Login to ECR and Docker Hub at once, so we can push to several registries, it makes job `push-docker-hub` unneeded - `push-docker-hub` replaced with `promote-images` in `needs:` clause, Pushing images to production ECR moved to `promote-images` job	2023-04-25 17:54:10 +01:00
Joonas Koivunen	bfd45dd671	test_tenant_config: allow ERROR from eviction task (#4074 )	2023-04-25 18:41:09 +03:00

1 2 3 4 5 ...

3162 Commits