## Problem
We were returning Pending when a connection had a notice/notification
(introduced recently in #5020). When returning pending, the runtime
assumes you will call `cx.waker().wake()` in order to continue
processing.
We weren't doing that, so the connection task would get stuck
## Summary of changes
Don't return pending. Loop instead
## Problem
Errors and notices that happen during a pooled connection lifecycle have
no session identifiers
## Summary of changes
Using a watch channel, we set the session ID whenever it changes. This
way we can see the status of a connection for that session
Also, adding a connection id to be able to search the entire connection
lifecycle
## Problem
As documented, the global connection pool will be high contention.
## Summary of changes
Use DashMap rather than Mutex<HashMap>.
Of note, DashMap currently uses a RwLock internally, but it's partially
sharded to reduce contention by a factor of N. We could potentially use
flurry which is a port of Java's concurrent hashmap, but I have no good
understanding of it's performance characteristics. Dashmap is at least
equivalent to hashmap but less contention.
See the read heavy benchmark to analyse our expected performance
<https://github.com/xacrimon/conc-map-bench#ready-heavy>
I also spoke with the developer of dashmap recently, and they are
working on porting the implementation to use concurrent HAMT FWIW
## Problem
Half of #4699.
TCP/WS have one implementation of `connect_to_compute`, HTTP has another
implementation of `connect_to_compute`.
Having both is annoying to deal with.
## Summary of changes
Creates a set of traits `ConnectMechanism` and `ShouldError` that allows
the `connect_to_compute` to be generic over raw TCP stream or
tokio_postgres based connections.
I'm not super happy with this. I think it would be nice to
remove tokio_postgres entirely but that will need a lot more thought to
be put into it.
I have also slightly refactored the caching to use fewer references.
Instead using ownership to ensure the state of retrying is encoded in
the type system.
## Problem
In the logs, I noticed we still weren't retrying in some cases. Seemed
to be timeouts but we explicitly wanted to handle those
## Summary of changes
Retry on io::ErrorKind::TimedOut errors.
Handle IO errors in tokio_postgres::Error.
## Problem
If we fail to wake up the compute node, a subsequent connect attempt
will definitely fail. However, kubernetes won't fail the connection
immediately, instead it hangs until we timeout (10s).
## Summary of changes
Refactor the loop to allow fast retries of compute_wake and to skip a
connect attempt.
## Problem
#4598 compute nodes are not accessible some time after wake up due to
kubernetes DNS not being fully propagated.
## Summary of changes
Update connect retry mechanism to support handling IO errors and
sleeping for 100ms
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
Cache up to 20 connections per endpoint. Once all pooled connections
are used current implementation can open an extra connection, so the
maximum number of simultaneous connections is not enforced.
There are more things to do here, especially with background clean-up
of closed connections, and checks for transaction state. But current
implementation allows to check for smaller coonection latencies that
this cache should bring.