rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-06 21:50:37 +00:00

Author	SHA1	Message	Date
Stas Kelvich	9486d76b2a	Add tests for link auth to compute connection	2023-04-28 17:15:43 +03:00
Stas Kelvich	040f736909	remove changes in main proxy that are now not needed	2023-04-28 17:15:43 +03:00
Stas Kelvich	645e4f6ab9	use TLS in link proxy	2023-04-28 17:15:43 +03:00
Heikki Linnakangas	e947cc119b	Add a small test case for pg_sni_router	2023-04-28 17:15:43 +03:00
Heikki Linnakangas	53e5d18da5	Start passthrough earlier As soon as we have received the SSLRequest packet, and have figured out the hostname to connect to from the SNI, we can start passing through data. We don't need to parse the StartupPacket that the client will send next.	2023-04-28 17:15:43 +03:00
Heikki Linnakangas	3813c703c9	Add an option for destination port. Makes it easier to test locally.	2023-04-28 17:15:43 +03:00
Heikki Linnakangas	b15204fa8c	Fix --help, and required args	2023-04-28 17:15:43 +03:00
Alexey Kondratov	81c75586ab	Take port from SNI, formatting, make clippy happy	2023-04-28 17:15:43 +03:00
Anton Chaporgin	556fb1642a	fixed the way hostname is parsed	2023-04-28 17:15:43 +03:00
Stas Kelvich	23aca81943	Add SNI-based proxy router In order to not to create NodePorts for each compute we can setup services that accept connections on wildcard domains and then use information from domain name to route connection to some internal service. There are ready solutions for HTTPS and TLS connections but postgresql protocol uses opportunistic TLS and we haven't found any ready solutions. This patch introduces `pg_sni_router` which routes connections to `aaa--bbb--123.external.domain` to `aaa.bbb.123.internal.domain`. In the long run we can avoid console -> compute psql communications, but now this router seems to be the easier way forward.	2023-04-28 17:15:43 +03:00
Arseny Sher	42798e6adc	Increase connection_timeout to PG in find end of WAL test. And log postgres to stdout. Probably fixes https://github.com/neondatabase/neon/issues/3778	2023-04-28 16:17:23 +04:00
Arthur Petukhovsky	b03143dfc8	Use serde_as DisplayFromStr everywhere (#4103 ) We used `display_serialize` previously, but it works only for Serialize. `DisplayFromStr` does the same, but also works for Deserialize.	2023-04-28 13:55:07 +03:00
Arseny Sher	fdacfaabfd	Move PageserverFeedback to utils. It allows to replace u64 with proper Lsn and pretty print PageserverFeedback with serde(_json). Now walsenders on safekeepers queried with debug_dump look like "walsenders": [ { "ttid": "fafe0cf39a99c608c872706149de9d2a/b4fb3be6f576935e7f0fcb84bdb909a1", "addr": "127.0.0.1:48774", "conn_id": 3, "appname": "pageserver", "feedback": { "Pageserver": { "current_timeline_size": 32096256, "last_received_lsn": "0/2415298", "disk_consistent_lsn": "0/1696628", "remote_consistent_lsn": "0/0", "replytime": "2023-04-12T13:54:53.958856+00:00" } } } ],	2023-04-28 06:22:13 +04:00
Arseny Sher	b2a3981ead	Move tracking of walsenders out of Timeline. Refactors walsenders out of timeline.rs to makes it less convoluted into separate WalSenders with its own lock, but otherwise having the same structure. Tracking of in-memory remote_consistent_lsn is also moved there as it is mainly received from pageserver. State of walsender (feedback) is also restructured to be cleaner; now it is either PageserverFeedback or StandbyFeedback(StandbyReply, HotStandbyFeedback), but not both.	2023-04-28 06:22:13 +04:00
Joonas Koivunen	fe0b616299	feat(page_service): read timeouts (#4093 ) Introduce read timeouts to our `page_service` connections. Without read timeouts, we essentially leak connections. This is a port of #3995. Split the refactorings to the other PR: #4097. Fixes #4028.	2023-04-27 17:55:35 +00:00
Alexander Bayandin	c4e1cafb63	scripts/flaky_tests.py: handle connection error (#4096 ) - Increase `connect_timeout` to 30s, which should be enough for most of the cases - If the script cannot connect to the DB (or any other `psycopg2.OperationalError` occur) — do not fail the script, log the error and proceed. Problems with fetching flaky tests shouldn't block the PR	2023-04-27 17:08:00 +01:00
Joonas Koivunen	fdf5e4db5e	refactor: Cleanup page service (#4097 ) Refactoring part of #4093. Numerious `Send + Sync` bounds were a distraction, that were not needed at all. The proper `Bytes` usage and one `"error_message".to_string()` are just drive-by fixes. Not using the `PostgresBackendTCP` allows us to start setting read timeouts (and more). `PostgresBackendTCP` is still used from proxy, so it cannot be removed.	2023-04-27 18:51:57 +03:00
Heikki Linnakangas	d1e86d65dc	Run rustfmt to fix whitespace. Commit `e6ec2400fc` introduced some trivial whitespace issues.	2023-04-27 18:45:22 +03:00
Arseny Sher	f5b4697c90	Log session_id when proxy per client task errors out.	2023-04-27 19:08:22 +04:00
Christian Schwarz	3be81dd36b	fix `clippy --release` failure introduced in #4030 (#4095 ) PR `build: run clippy for powerset of features (#4077)` brought us a `clippy --release` pass. It was merged after #4030, which fails under `clippy --release` with ``` error: static `TENANT_ID_EXTRACTOR` is never used --> pageserver/src/tenant/timeline.rs:4270:16 \| 4270 \| pub static TENANT_ID_EXTRACTOR: once_cell::sync::Lazy< \| ^^^^^^^^^^^^^^^^^^^ \| = note: `-D dead-code` implied by `-D warnings` error: static `TIMELINE_ID_EXTRACTOR` is never used --> pageserver/src/tenant/timeline.rs:4276:16 \| 4276 \| pub static TIMELINE_ID_EXTRACTOR: once_cell::sync::Lazy< \| ^^^^^^^^^^^^^^^^^^^^^ ``` A merge queue would have prevented this.	2023-04-27 17:07:25 +03:00
MMeent	e6ec2400fc	Enable hot standby PostgreSQL replicas. Notes: - This still needs UI support from the Console - I've not tuned any GUCs for PostgreSQL to make this work better - Safekeeper has gotten a tweak in which WAL is sent and how: It now sends zero-ed WAL data from the start of the timeline's first segment up to the first byte of the timeline to be compatible with normal PostgreSQL WAL streaming. - This includes the commits of #3714 Fixes one part of https://github.com/neondatabase/neon/issues/769 Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>	2023-04-27 15:26:44 +02:00
Christian Schwarz	5b911e1f9f	build: run clippy for powerset of features (#4077 ) This will catch compiler & clippy warnings in all feature combinations. We should probably use cargo hack for build and test as well, but, that's quite expensive and would add to overall CI wait times. obsoletes https://github.com/neondatabase/neon/pull/4073 refs https://github.com/neondatabase/neon/pull/4070	2023-04-27 15:01:27 +03:00
Christian Schwarz	9ea7b5dd38	clean up logging around on-demand downloads (#4030 ) - Remove repeated tenant & timeline from span - Demote logging of the path to debug level - Log completion at info level, in the same function where we log errors - distinguish between layer file download success & on-demand download succeeding as a whole in the log message wording - Assert that the span contains a tenant id and a timeline id fixes https://github.com/neondatabase/neon/issues/3945 Before: ``` INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{tenant_id=$TENANT_ID timeline_id=$TIMELINE_ID layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: download complete: /storage/pageserver/data/tenants/$TENANT_ID/timelines/$TIMELINE_ID/000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91 INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{tenant_id=$TENANT_ID timeline_id=$TIMELINE_ID layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: Rebuilt layer map. Did 9 insertions to process a batch of 1 updates. ``` After: ``` INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: layer file download finished INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: Rebuilt layer map. Did 9 insertions to process a batch of 1 updates. INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: on-demand download successful ```	2023-04-27 11:54:48 +02:00
Arseny Sher	0112a602e1	Add timeout on proxy -> compute connection establishment. Otherwise we sit up to default tcp_syn_retries (about 2+ min) before gettings os error 110 if compute has been migrated to another pod.	2023-04-27 09:50:52 +04:00
Anastasia Lubennikova	92214578af	Fix proxy_io_bytes_per_client metric: use branch_id identifier properly. (#4084 ) It fixes the miscalculation of the metric for projects that use multiple branches for the same endpoint. We were under billing users with such projects. So we need to communicate the change in Release Notes.	2023-04-26 17:47:54 +03:00
Christian Schwarz	6861259be7	add global metric for unexpected on-demand downloads (#4069 ) Until we have toned down the prod logs to zero WARN and ERROR, we want a dedicated metric for which we can have a dedicated alert. fixes https://github.com/neondatabase/neon/issues/3924	2023-04-26 15:18:26 +02:00
Sergey Melnikov	11df2ee5d7	Add safekeeper-3.us-east-2.aws.neon.build (#4085 )	2023-04-26 14:40:36 +03:00
Arseny Sher	31a3910fd9	Remove wait_for_sk_commit_lsn_to_reach_remote_storage. It had a couple of inherent races: 1) Even if compute is killed before the call, some more data might still arrive to safekeepers after commit_lsn on them is polled, advancing it. Then checkpoint on pageserver might not include this tail, and so upload of expected LSN won't happen until one more checkpoint. 2) commit_lsn is updated asynchronously -- compute can commit transaction before communicating commit_lsn to even single safekeeper (sync-safekeepers can be used to forces the advancement). This makes semantics of wait_for_sk_commit_lsn_to_reach_remote_storage quite complicated. Replace it with last_flush_lsn_upload which 1) Learns last flush LSN on compute; 2) Waits for it to arrive to pageserver; 3) Checkpoints it; 4) Waits for the upload. In some tests this keeps compute alive longer than before, but this doesn't seem to be important. There is a chance this fixes https://github.com/neondatabase/neon/issues/3209	2023-04-26 13:46:33 +04:00
Joonas Koivunen	381c8fca4f	feat: log how long tenant activation takes (#4080 ) Adds just a counter counting up from the creation to the tenant, logged after activation. Might help guide us with the investigation of #4025.	2023-04-26 12:39:17 +03:00
Joonas Koivunen	4625da3164	build: remove busted sk-1.us-east-2 from staging hosts (#4082 ) this should give us complete deployments while a new one is being brought up.	2023-04-26 09:07:45 +00:00
Joonas Koivunen	850f6b1cb9	refactor: drop pageserver_ondisk_layers (#4071 ) I didn't get through #3775 fast enough so we wanted to remove this metric. Fixes #3705.	2023-04-26 11:49:29 +03:00
Sergey Melnikov	f19b70b379	Configure extra domain for us-east-1 (#4078 )	2023-04-26 09:36:26 +02:00
Sergey Melnikov	9d0cf08d5f	Fix new storage-broker deploy for eu-central-1 (#4079 )	2023-04-26 10:29:44 +03:00
Alexander Bayandin	2d6fd72177	GitHub Workflows: Fix crane for several registries (#4076 ) Follow-up fix after https://github.com/neondatabase/neon/pull/4067 ``` + crane tag neondatabase/vm-compute-node-v14:3064 latest Error: fetching "neondatabase/vm-compute-node-v14:3064": GET https://index.docker.io/v2/neondatabase/vm-compute-node-v14/manifests/3064: MANIFEST_UNKNOWN: manifest unknown; unknown tag=3064 ``` I reverted back the previous approach for promoting images (login to one registry, save images to local fs, logout and login to another registry, and push images from local fs). It turns out what works for one Google project (kaniko), doesn't work for another (crane) [sigh]	2023-04-25 23:58:59 +01:00
Heikki Linnakangas	8945fbdb31	Enable OpenTelemetry tracing in proxy in staging. (#4065 ) Depends on https://github.com/neondatabase/helm-charts/pull/32 Co-authored-by: Lassi Pölönen <lassi.polonen@iki.fi>	2023-04-25 20:45:36 +03:00
Alexander Bayandin	05ac0e2493	Login to ECR and Docker Hub at once (#4067 ) - Update kaniko to 1.9.2 (from 1.7.0), problem with reproducible build is fixed - Login to ECR and Docker Hub at once, so we can push to several registries, it makes job `push-docker-hub` unneeded - `push-docker-hub` replaced with `promote-images` in `needs:` clause, Pushing images to production ECR moved to `promote-images` job	2023-04-25 17:54:10 +01:00
Joonas Koivunen	bfd45dd671	test_tenant_config: allow ERROR from eviction task (#4074 )	2023-04-25 18:41:09 +03:00
Joonas Koivunen	7f80230fd2	fix: stop dead_code rustc lint (#4070 ) only happens without `--all-features` which is what `./run_clippy.sh` uses.	2023-04-25 17:07:04 +02:00
Sergey Melnikov	78bbbccadb	Deploy proxies for preview enviroments (#4052 ) ## Describe your changes Deploy `main` proxies to the preview environments We don't deploy storage there yet, as it's tricky. ## Issue ticket number and link https://github.com/neondatabase/cloud/issues/4737	2023-04-25 16:46:52 +02:00
Christian Schwarz	dbbe032c39	neon_local: fix `tenant create -c eviction_policy:...` (#4004 ) And add corresponding unit test. The fix is to use `.remove()` instead of `.get()` when processing the arugments hash map. The code uses emptiness of the hash map to determine whether all arguments have been processed. This was likely a copy-paste error. refs https://github.com/neondatabase/neon/issues/3942	2023-04-25 15:33:30 +02:00
Joonas Koivunen	cb9473928d	feat: add rough timings for basebackup (#4062 ) just record the time needed for waiting the lsn and then the basebackup in a log message in millis. this is related to ongoing investigations to cold start performance. this could also be a a counter. it cannot be added next to smgr histograms, because we don't want another histogram per timeline. the aim is to allow drilling deeper into which timelines were slow, and to understand why some need two basebackups.	2023-04-25 13:22:16 +00:00
Christian Schwarz	fa20e37574	add gauge for in-flight layer uploads (#3951 ) For the "worst-case /storage usage panel", we need to compute ``` remote size + local-only size ``` We currently don't have a metric for local-only layers. The number of in-flight layers in the upload queue is just that, so, let Prometheus scrape it. The metric is two counters (started and finished). The delta is the amount of in-flight uploads in the queue. The metrics are incremented in the respective `call_unfinished_metric_*` functions. These track ongoing operations by file_kind and op_kind. We only need this metric for layer uploads, so, there's the new RemoteTimelineClientMetricsCallTrackSize type that forces all call sites to decide whether they want the size tracked or not. If we find that other file_kinds or op_kinds are interesting (metadata uploads, layer downloads, layer deletes) are interesting, we can just enable them, and they'll be just another label combination within the metrics that this PR adds. fixes https://github.com/neondatabase/neon/issues/3922	2023-04-25 14:22:48 +02:00
Joonas Koivunen	4911d7ce6f	feat: warn when requests get cancelled (#4064 ) Add a simple disarmable dropguard to log if request is cancelled before it is completed. We currently don't have this, and it makes for difficult to know when the request was dropped.	2023-04-25 15:22:23 +03:00
Christian Schwarz	e83684b868	add libmetric metric for each logged log message (#4055 ) This patch extends the libmetrics logging setup functionality with a `tracing` layer that increments a Prometheus counter each time we log a log message. We have the counter per tracing event level. This allows for monitoring WARN and ERR log volume without parsing the log. Also, it would allow cross-checking whether logs got dropped on the way into Loki. It would be nicer if we could hook deeper into the tracing logging layer, to avoid evaluating the filter twice. But I don't know how to do it.	2023-04-25 14:10:18 +02:00
Eduard Dyckman	afbbc61036	Adding synthetic size to pageserver swagger (#4049 ) ## Describe your changes I added synthetic size response to the console swagger. Now I am syncing it back to neon	2023-04-24 16:19:25 +03:00
Alexey Kondratov	7ba5c286b7	[compute_ctl] Improve 'empty' compute startup sequence (#4034 ) Do several attempts to get spec from the control-plane and retry network errors and all reasonable HTTP response codes. Do not hang waiting for spec without confirmation from the control-plane that compute is known and is in the `Empty` state. Adjust the way we track `total_startup_ms` metric, it should be calculated since the moment we received spec, not from the moment `compute_ctl` started. Also introduce a new `wait_for_spec_ms` metric to track the time spent sleeping and waiting for spec to be delivered from control-plane. Part of neondatabase/cloud#3533	2023-04-21 11:10:48 +02:00
sharnoff	02b28ae0b1	fix vm-informant dbname: "neondb" -> "postgres" (#4046 ) Changes the vm-informant's postgres connection string's dbname from "neondb" (which sometimes doesn't exist) to "postgres" (which _hopefully_ should exist more often?). Currently there are a handful of VMs in prod that aren't working with autoscaling because they don't have the "neondb" database. The vm-informant doesn't require any database in particular; it's just connecting as `cloud_admin` to be able to adjust the file cache settings.	2023-04-18 18:54:32 +03:00
Cihan Demirci	0bfbae2d73	Add storage broker deployment to us-east-1 (#4048 )	2023-04-18 18:41:09 +03:00
fcdm	f1b7dc4064	Update pageserver instances in us-east-1	2023-04-18 14:08:12 +01:00
Alexander Bayandin	e2a5177e89	Bump h2 from 0.3.17 to 0.3.18 (#4045 )	2023-04-18 16:04:10 +03:00

1 2 3 4 5 ...

3099 Commits