Commit Graph

179 Commits

Author SHA1 Message Date
arpad-m
982fce1e72 Fix rustdoc warnings and test cargo doc in CI (#4711)
## Problem

`cargo +nightly doc` is giving a lot of warnings: broken links, naked
URLs, etc.

## Summary of changes

* update the `proc-macro2` dependency so that it can compile on latest
Rust nightly, see https://github.com/dtolnay/proc-macro2/pull/391 and
https://github.com/dtolnay/proc-macro2/issues/398
* allow the `private_intra_doc_links` lint, as linking to something
that's private is always more useful than just mentioning it without a
link: if the link breaks in the future, at least there is a warning due
to that. Also, one might enable
[`--document-private-items`](https://doc.rust-lang.org/cargo/commands/cargo-doc.html#documentation-options)
in the future and make these links work in general.
* fix all the remaining warnings given by `cargo +nightly doc`
* make it possible to run `cargo doc` on stable Rust by updating
`opentelemetry` and associated crates to version 0.19, pulling in a fix
that previously broke `cargo doc` on stable:
https://github.com/open-telemetry/opentelemetry-rust/pull/904
* Add `cargo doc` to CI to ensure that it won't get broken in the
future.

Fixes #2557

## Future work
* Potentially, it might make sense, for development purposes, to publish
the generated rustdocs somewhere, like for example [how the rust
compiler does
it](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_driver/index.html).
I will file an issue for discussion.
2023-07-15 05:11:25 +03:00
Alex Chi Z
1309571f5d proxy: switch to structopt for clap parsing (#4714)
Using `#[clap]` for parsing cli opts, which is easier to maintain.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2023-07-14 19:11:01 +03:00
Conrad Ludgate
db4d094afa proxy: add more error cases to retry connect (#4707)
## Problem

In the logs, I noticed we still weren't retrying in some cases. Seemed
to be timeouts but we explicitly wanted to handle those

## Summary of changes

Retry on io::ErrorKind::TimedOut errors.
Handle IO errors in tokio_postgres::Error.
2023-07-13 11:47:27 +01:00
Conrad Ludgate
0626e0bfd3 proxy: refactor some error handling and shutdowns (#4684)
## Problem

It took me a while to understand the purpose of all the tasks spawned in
the main functions.

## Summary of changes

Utilising the type system and less macros, plus much more comments,
document the shutdown procedure of each task in detail
2023-07-13 11:03:37 +01:00
Conrad Ludgate
a1d6b1a4af proxy wake_compute loop (#4675)
## Problem

If we fail to wake up the compute node, a subsequent connect attempt
will definitely fail. However, kubernetes won't fail the connection
immediately, instead it hangs until we timeout (10s).

## Summary of changes

Refactor the loop to allow fast retries of compute_wake and to skip a
connect attempt.
2023-07-12 11:38:36 +01:00
Conrad Ludgate
ac758e4f51 allow repeated IO errors from compute node (#4624)
## Problem

#4598 compute nodes are not accessible some time after wake up due to
kubernetes DNS not being fully propagated.

## Summary of changes

Update connect retry mechanism to support handling IO errors and
sleeping for 100ms

## Checklist before requesting a review

- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
2023-07-07 19:50:50 +03:00
Stas Kelvich
dbf88cf2d7 Minimalistic pool for http endpoint compute connections (under opt-in flag)
Cache up to 20 connections per endpoint. Once all pooled connections
are used current implementation can open an extra connection, so the
maximum number of simultaneous connections is not enforced.

There are more things to do here, especially with background clean-up
of closed connections, and checks for transaction state. But current
implementation allows to check for smaller coonection latencies that
this cache should bring.
2023-07-05 12:00:03 +03:00
Stas Kelvich
cbf9a40889 Set a shorter timeout for the initial connection attempti in proxy.
In case we try to connect to an outdated address that is no longer valid, the
default behavior of Kubernetes is to drop the packets, causing us to wait for
the entire timeout period. We want to fail fast in such cases.

A specific case to consider is when we have cached compute node information
with a 5-minute TTL (Time To Live), but the user has executed a `/suspend` API
call, resulting in the nonexistence of the compute node.
2023-07-04 20:34:22 +03:00
Conrad Ludgate
ab2ea8cfa5 use pbkdf2 crate (#4626)
## Problem

While pbkdf2 is a simple algorithm, we should probably use a well tested
implementation

## Summary of changes

* Use pbkdf2 crate
* Use arrays like the hmac comment says

## Checklist before requesting a review

- [X] I have performed a self-review of my code.
- [X] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
2023-07-04 14:54:59 +01:00
Joonas Koivunen
cff7ae0b0d fix: no more ansi colored logs (#4613)
Allure does not support ansi colored logs, yet `compute_ctl` has them.

Upgrade criterion to get rid of atty dependency, disable ansi colors,
remove atty dependency and disable ansi feature of tracing-subscriber.

This is a heavy-handed approach. I am not aware of a workflow where
you'd want to connect a terminal directly to for example `compute_ctl`,
usually you find the logs in a file. If someone had been using colors,
they will now need to:
- turn the `tracing-subscriber.default-features` to `true`
- edit their wanted project to have colors

I decided to explicitly disable ansi colors in case we would have in
future a dependency accidentally enabling the feature on
`tracing-subscriber`, which would be quite surprising but not
unimagineable.

By getting rid of `atty` from dependencies we get rid of
<https://github.com/advisories/GHSA-g98v-hv3f-hcfr>.
2023-07-03 16:37:02 +03:00
Shany Pozin
26828560a8 Add timeouts and retries to consumption metrics reporting client (#4563)
## Problem
#4528
## Summary of changes
Add a 60 seconds default timeout to the reqwest client 
Add retries for up to 3 times to call into the metric consumption
endpoint

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-07-03 15:20:49 +03:00
Joonas Koivunen
4957bb2d48 fix(proxy): stray span enter globbers up logs (#4612)
Prod logs have deep accidential span nesting. Introduced in #3759 and
has been untouched since, maybe no one watches proxy logs? :) I found it
by accident when looking to see if proxy logs have ansi colors with
`{neon_service="proxy"}`.

The solution is to mostly stop using `Span::enter` or `Span::entered` in
async code. Kept on `Span::entered` in cancel on shutdown related path.
2023-07-03 11:53:57 +01:00
Arthur Petukhovsky
a7a0c3cd27 Invalidate proxy cache in http-over-sql (#4500)
HTTP queries failed with errors `error connecting to server: failed to
lookup address information: Name or service not known\n\nCaused by:\n
failed to lookup address information: Name or service not known`

The fix reused cache invalidation logic in proxy from usual postgres
connections and added it to HTTP-over-SQL queries.

Also removed a timeout for HTTP request, because it almost never worked
on staging (50s+ time just to start the compute), and we can have the
similar case in production. Should be ok, since we have a limits for the
requests and responses.
2023-06-14 19:24:46 +03:00
Stas Kelvich
4385e0c291 Return more RowDescription fields via proxy json endpoint
As we aim to align client-side behavior with node-postgres, it's necessary for us to return these fields, because node-postgres does so as well.
2023-06-13 22:31:18 +03:00
Stas Kelvich
c82d19d8d6 Fix NULLs handling in proxy json endpoint
There were few problems with null handling:

* query_raw_txt() accepted vector of string so it always (erroneously)
treated "null" as a string instead of null. Change rust pg client
to accept the vector of Option<String> instead of just Strings. Adopt
coding here to pass nulls as None.

* pg_text_to_json() had a check that always interpreted "NULL" string
as null. That is wrong and nulls were already handled by match None.
This bug appeared as a bad attempt to parse arrays containing NULL
elements. Fix coding by checking presence of quotes while parsing an
array (no quotes -> null, quoted -> "null" string).

Array parser fix also slightly changes behavior by always cleaning
current entry when pushing to the resulting vector. This seems to be
an omission by previous coding, however looks like it was harmless
as entry was not cleared only at the end of the nested or to-level
array.
2023-06-08 16:00:18 +03:00
Stas Kelvich
d73639646e Add more output options to proxy json endpoint
With this commit client can pass following optional headers:

`Neon-Raw-Text-Output: true`. Return postgres values as text, without parsing them. So numbers, objects, booleans, nulls and arrays will be returned as text. That can be useful in cases when client code wants to implement it's own parsing or reuse parsing libraries from e.g. node-postgres.

`Neon-Array-Mode: true`. Return postgres rows as arrays instead of objects. That is more compact representation and also helps in some edge
cases where it is hard to use rows represented as objects (e.g. when several fields have the same name).
2023-06-08 16:00:18 +03:00
Joonas Koivunen
a55c663848 chore: comment marker fixes (#4406)
Upgrading to rust 1.70 will require these.
2023-06-02 21:03:12 +03:00
Joonas Koivunen
ef80a902c8 pg_sni_router: add session_id to more messages (#4403)
See superceded #4390.

- capture log in test
- expand the span to cover init and error reporting
- remove obvious logging by logging only unexpected
2023-06-02 14:59:10 +03:00
Arseny Sher
c200ebc096 proxy: log endpoint name everywhere.
Checking out proxy logs for the endpoint is a frequent (often first) operation
during user issues investigation; let's remove endpoint id -> session id mapping
annoying extra step here.
2023-05-24 09:11:23 +04:00
Stas Kelvich
dad3519351 Add SQL-over-HTTP endpoint to Proxy
This commit introduces an SQL-over-HTTP endpoint in the proxy, with a JSON
response structure resembling that of the node-postgres driver. This method,
using HTTP POST, achieves smaller amortized latencies in edge setups due to
fewer round trips and an enhanced open connection reuse by the v8 engine.

This update involves several intricacies:
1. SQL injection protection: We employed the extended query protocol, modifying
   the rust-postgres driver to send queries in one roundtrip using a text
   protocol rather than binary, bypassing potential issues like those identified
   in https://github.com/sfackler/rust-postgres/issues/1030.

2. Postgres type compatibility: As not all postgres types have binary
   representations (e.g., acl's in pg_class), we adjusted rust-postgres to
   respond with text protocol, simplifying serialization and fixing queries with
   text-only types in response.

3. Data type conversion: Considering JSON supports fewer data types than
   Postgres, we perform conversions where possible, passing all other types as
   strings. Key conversions include:
   - postgres int2, int4, float4, float8 -> json number (NaN and Inf remain
     text)
   - postgres bool, null, text -> json bool, null, string
   - postgres array -> json array
   - postgres json and jsonb -> json object

4. Alignment with node-postgres: To facilitate integration with js libraries,
   we've matched the response structure of node-postgres, returning command tags
   and column oids. Command tag capturing was added to the rust-postgres
   functionality as part of this change.
2023-05-23 20:01:40 +03:00
Vadim Kharitonov
4f64be4a98 Add endpoint to connection string 2023-05-15 23:45:04 +02:00
Stas Kelvich
b1329db495 fix sigterm handling 2023-04-28 17:15:43 +03:00
Stas Kelvich
4ac6a9f089 add backward compatibility to proxy 2023-04-28 17:15:43 +03:00
Stas Kelvich
9486d76b2a Add tests for link auth to compute connection 2023-04-28 17:15:43 +03:00
Stas Kelvich
040f736909 remove changes in main proxy that are now not needed 2023-04-28 17:15:43 +03:00
Stas Kelvich
645e4f6ab9 use TLS in link proxy 2023-04-28 17:15:43 +03:00
Heikki Linnakangas
53e5d18da5 Start passthrough earlier
As soon as we have received the SSLRequest packet, and have figured
out the hostname to connect to from the SNI, we can start passing
through data. We don't need to parse the StartupPacket that the client
will send next.
2023-04-28 17:15:43 +03:00
Heikki Linnakangas
3813c703c9 Add an option for destination port.
Makes it easier to test locally.
2023-04-28 17:15:43 +03:00
Heikki Linnakangas
b15204fa8c Fix --help, and required args 2023-04-28 17:15:43 +03:00
Alexey Kondratov
81c75586ab Take port from SNI, formatting, make clippy happy 2023-04-28 17:15:43 +03:00
Anton Chaporgin
556fb1642a fixed the way hostname is parsed 2023-04-28 17:15:43 +03:00
Stas Kelvich
23aca81943 Add SNI-based proxy router
In order to not to create NodePorts for each compute we can setup
services that accept connections on wildcard domains and then use
information from domain name to route connection to some internal
service. There are ready solutions for HTTPS and TLS connections
but postgresql protocol uses opportunistic TLS and we haven't found
any ready solutions.

This patch introduces `pg_sni_router` which routes connections to
`aaa--bbb--123.external.domain` to `aaa.bbb.123.internal.domain`.

In the long run we can avoid console -> compute psql communications,
but now this router seems to be the easier way forward.
2023-04-28 17:15:43 +03:00
Arseny Sher
f5b4697c90 Log session_id when proxy per client task errors out. 2023-04-27 19:08:22 +04:00
Arseny Sher
0112a602e1 Add timeout on proxy -> compute connection establishment.
Otherwise we sit up to default tcp_syn_retries (about 2+ min) before gettings os
error 110 if compute has been migrated to another pod.
2023-04-27 09:50:52 +04:00
Anastasia Lubennikova
92214578af Fix proxy_io_bytes_per_client metric: use branch_id identifier properly. (#4084)
It fixes the miscalculation of the metric for projects that use multiple
branches for the same endpoint.
We were under billing users with such projects. So we need to
communicate the change in Release Notes.
2023-04-26 17:47:54 +03:00
Sasha Krassovsky
fd31fafeee Make proxy shutdown when all connections are closed (#3764)
## Describe your changes
Makes Proxy start draining connections on SIGTERM.
## Issue ticket number and link
#3333
2023-04-13 19:31:30 +03:00
Stas Kelvich
5d0ecadf7c Add support for non-SNI case in multi-cert proxy
When no SNI is provided use the default certificate, otherwise we can't
get to the options parameter which can be used to set endpoint name too.
That means that non-SNI flow will not work for CNAME domains in verify-full
mode.
2023-04-12 18:16:49 +03:00
Stas Kelvich
b1c2a6384a Set non-wildcard common names in link auth proxy
Old coding here ignored non-wildcard common names and passed None instead. With my recent changes
I started throwing an error in that case. Old logic doesn't seem to be a great choice, so instead
of passing None I actually set non-wildcard common names too. That way it is possible to avoid handling
cases with None in downstream code.
2023-04-07 01:24:27 +03:00
Anastasia Lubennikova
6d01d835a8 [proxy] Report error if proxy_io_bytes_per_client metric has decreased 2023-04-06 23:14:07 +03:00
Stas Kelvich
d8df5237fa Aligne extra certificate name with default cert-manager names 2023-04-05 21:29:21 +03:00
Stas Kelvich
c3ca48c62b Support extra domain names for proxy.
Make it possible to specify directory where proxy will look up for
extra certificates. Proxy will iterate through subdirs of that directory
and load `key.pem` and `cert.pem` files from each subdir. Certs directory
structure may look like that:

  certs
  |--example.com
  |  |--key.pem
  |  |--cert.pem
  |--foo.bar
     |--key.pem
     |--cert.pem

Actual domain names are taken from certs and key, subdir names are
ignored.
2023-04-05 20:06:48 +03:00
Dmitry Ivanov
f85a61ceac [proxy] Fix regression in logging
For some reason, `tracing::instrument` proc_macro doesn't always print
elements specified via `fields()` or even show that it's impossible
(e.g. there's no Display impl).

Work around this using the `?foo` notation.

Before:
2023-04-03T14:48:06.017504Z  INFO handle_client🤝 received SslRequest

After:
2023-04-03T14:51:24.424176Z  INFO handle_client{session_id=7bd07be8-3462-404e-8ccc-0a5332bf3ace}🤝 received SslRequest
2023-04-03 18:49:30 +03:00
Arseny Sher
a7ab53c80c Forward framed read buf contents to compute before proxy pass.
Otherwise they get lost. Normally buffer is empty before proxy pass, but this is
not the case with pipeline mode of out npm driver; fixes connection hangup
introduced by b80fe41af3 for it.

fixes https://github.com/neondatabase/neon/issues/3822
2023-03-15 14:32:41 +03:00
Joonas Koivunen
c23c8946a3 chore: clippies introduced with rust 1.68 (#3781)
- handle automatically fixable future clippies
- tune run-clippy.sh to remove macos specifics which we no longer have

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2023-03-14 15:29:02 +02:00
Dmitry Ivanov
2e4bf7cee4 [proxy] Immediately log all compute node connection errors. 2023-03-14 01:45:57 +03:00
Arthur Petukhovsky
d9a1329834 Make postgres_backend use generic IO type (#3789)
- Support measuring inbound and outbound traffic in MeasuredStream
- Start using MeasuredStream in safekeepers code
2023-03-13 12:18:10 +03:00
Anastasia Lubennikova
b7fddfa70d Add branch_id field to proxy_io_bytes_per_client metric.
Since we allow switching endpoints between different branches, it is important to use composite key.
Otherwise, we may try to calculate delta between metric values for two different branches.
2023-03-10 15:00:48 +02:00
Arseny Sher
b80fe41af3 Refactor postgres protocol parsing.
1) Remove allocation and data copy during each message read. Instead, parsing
functions now accept BytesMut from which data they form messages, with
pointers (e.g. in CopyData) pointing directly into BytesMut buffer. Accordingly,
move ConnectionError containing IO error subtype into framed.rs providing this
and leave in pq_proto only ProtocolError.

2) Remove anyhow from pq_proto.

3) Move FeStartupPacket out of FeMessage. Now FeStartupPacket::parse returns it
directly, eliminating dead code where user wants startup packet but has to match
for others.

proxy stream.rs is adapted to framed.rs with minimal changes. It also benefits
from framed.rs improvements described above.
2023-03-09 20:45:56 +03:00
Arseny Sher
0d8ced8534 Remove sync postgres_backend, tidy up its split usage.
- Add support for splitting async postgres_backend into read and write halfes.
  Safekeeper needs this for bidirectional streams. To this end, encapsulate
  reading-writing postgres messages to framed.rs with split support without any
  additional changes (relying on BufRead for reading and BytesMut out buffer for
  writing).
- Use async postgres_backend throughout safekeeper (and in proxy auth link
  part).
- In both safekeeper COPY streams, do read-write from the same thread/task with
  select! for easier error handling.
- Tidy up finishing CopyBoth streams in safekeeper sending and receiving WAL
  -- join split parts back catching errors from them before returning.

Initially I hoped to do that read-write without split at all, through polling
IO:
https://github.com/neondatabase/neon/pull/3522
However that turned out to be more complicated than I initially expected
due to 1) borrow checking and 2) anon Future types. 1) required Rc<Refcell<...>>
which is Send construct just to satisfy the checker; 2) can be workaround with
transmute. But this is so messy that I decided to leave split.
2023-03-09 20:45:56 +03:00
Arseny Sher
7627d85345 Move async postgres_backend to its own crate.
To untie cyclic dependency between sync and async versions of postgres_backend,
copy QueryError and some logging/error routines to postgres_backend.rs. This is
temporal glue to make commits smaller, sync version will be dropped by the
upcoming commit completely.
2023-03-09 20:45:56 +03:00