## Problem
If the user reset password, cache could receive this information only
after `ttl` minutes.
## Summary of changes
Invalidate password on auth failure.
- No need to include c.h, port.h or pg_config.h, they are included in
postgres.h
- No need to include postgres.h in header files. Instead, the
assumption in PostgreSQL is that all .c files include postgres.h.
- Reorder includes to alphabetical order, and system headers before
pgsql headers
- Remove bunch of other unnecessary includes that got copy-pasted from
one source file to another
## Problem
Existing dependencies didn't work on Fedora 39 (python 3.12)
## Summary of changes
- Update pyyaml 6.0 -> 6.0.1
- Update yarl 1.8.2->1.9.4
- Update the `dnf install` line in README to include dependencies of
python packages (unrelated to upgrades, just noticed absences while
doing fresh pysync run)
## Problem
Currently if we are getting many consecutive connections to the same
user/ep we will send a lot of traffic to the console.
## Summary of changes
Cache with ttl=4min proxy_get_role_secret response.
Note: this is the temporary hack, notifier listener is WIP.
## Problem
The `src/proxy.rs` file is far too large
## Summary of changes
Creates 3 new files:
```
src/metrics.rs
src/proxy/retry.rs
src/proxy/connect_compute.rs
```
## Problem
#6112 added some logs and metrics: clean these up a bit:
- Avoid counting startup completions for tenants launched after startup
- exclude no-op cases from timing histograms
- remove a rogue log messages
Walproposer sometimes intentionally PANICs when its term is defeated as the
basebackup is likely spoiled by that time. We don't want core dumped in this
case.
It turns out the issue with skipped jobs is not so trivial (because
Github checks jobs transitively), a possible workaround with `if:
always() && contains(fromJSON('["success", "skipped"]'),
needs.build-buildtools-image.result)` will tangle the workflow really
bad. We'll need to come up with a better solution.
To unblock the main I'm going to revert
https://github.com/neondatabase/neon/pull/6082.
## Currently our build docker file is located in the build repo it makes
sense to have it as a part of our neon repo
## Summary of changes
We had the docker file that we use to build our binary and other tools
resided in the build repo
It made sense to bring the docker file to its repo where it has been
used
So that the contributors can also view it and amend if required
It will reduce the maintenance. Docker file changes and code changes can
be accommodated in same PR
Also, building the image and pushing it to ECR is abstracted in a
reusable workflow. Ideal is to use that for any other jobs too
## Checklist before requesting a review
- [x] Moved the docker file used to build the binary from the build repo
to the neon repo
- [x] adding gh workflow to build and push the image
- [x] adding gh workflow to tag the pushed image
- [x] update readMe file
---------
Co-authored-by: Abhijeet Patil <abhijeet@neon.tech>
Co-authored-by: Alexander Bayandin <alexander@neon.tech>
## Problem
https://github.com/neondatabase/neon/security/dependabot/48
```
$ cargo tree -i zerocopy
zerocopy v0.7.3
└── ahash v0.8.5
└── hashbrown v0.13.2
```
ahash doesn't use the affected APIs we we are not vulnerable but best to
update to silence the alert anyway
## Summary of changes
```
$ cargo update -p zerocopy --precise 0.7.31
Updating crates.io index
Updating syn v2.0.28 -> v2.0.32
Updating zerocopy v0.7.3 -> v0.7.31
Updating zerocopy-derive v0.7.3 -> v0.7.31
```
## Problem
During startup, a client request might have to wait a long time while
the system is busy initializing all the attached tenants, even though
most of the attached tenants probably don't have any client requests to
service, and could wait a bit.
## Summary of changes
- Add a semaphore to limit how many Tenant::spawn()s may concurrently do
I/O to attach their tenant (i.e. read indices from remote storage, scan
local layer files, etc).
- Add Tenant::activate_now, a hook for kicking a tenant in its spawn()
method to skip waiting for the warmup semaphore
- For tenants that attached via warmup semaphore units, wait for logical
size calculation to complete before dropping the warmup units
- Set Tenant::activate_now in `get_active_tenant_with_timeout` (the page
service's path for getting a reference to a tenant).
- Wait for tenant activation in HTTP handlers for timeline creation and
deletion: like page service requests, these require an active tenant and
should prioritize activation if called.
## Problem
Various places in remote storage were not subject to a timeout (thereby
stuck TCP connections could hold things up), and did not respect a
cancellation token (so things like timeline deletion or tenant detach
would have to wait arbitrarily long).
## Summary of changes
- Add download_cancellable and upload_cancellable helpers, and use them
in all the places we wait for remote storage operations (with the
exception of initdb downloads, where it would not have been safe).
- Add a cancellation token arg to `download_retry`.
- Use cancellation token args in various places that were missing one
per #5066Closes: #5066
Why is this only "basic" handling?
- Doesn't express difference between shutdown and errors in return
types, to avoid refactoring all the places that use an anyhow::Error
(these should all eventually return a more structured error type)
- Implements timeouts on top of remote storage, rather than within it:
this means that operations hitting their timeout will lose their
semaphore permit and thereby go to the back of the queue for their
retry.
- Doing a nicer job is tracked in
https://github.com/neondatabase/neon/issues/6096
Part of getpage@lsn benchmark epic:
https://github.com/neondatabase/neon/issues/5771
This PR moves the control plane's spread-all-over-the-place client for
the pageserver management API into a separate module within the
pageserver crate.
I need that client to be async in my benchmarking work, so, this PR
switches to the async version of `reqwest`.
That is also the right direction generally IMO.
The switch to async in turn mandated converting most of the
`control_plane/` code to async.
Note that some of the client methods should be taking `TenantShardId`
instead of `TenantId`, but, none of the callers seem to be
sharding-aware.
Leaving that for another time:
https://github.com/neondatabase/neon/issues/6154
This doesn't make the scrubber smart enough to understand that many
shards are part of the same tenants, but it makes it understand paths
well enough to scrub the individual shards without thinking they're
malformed.
This is a prerequisite to being able to run tests with sharding enabled.
Related: #5929
## Summary of changes
saw some low-hanging codecov improvements. even if code coverage is
somewhat of a pointless game, might as well add tests where we can and
delete code if it's unused
* initdb uploads had no cancellation token, which means that when we
were stuck in upload retries, we wouldn't be able to delete the
timeline. in general, the combination of retrying forever and not having
cancellation tokens is quite dangerous.
* initdb uploads wouldn't rewind the file. this wasn't discovered in the
purposefully unreliable test-s3 in pytest because those fail on the
first byte always, not somewhere during the connection. we'd be getting
errors from the AWS sdk that the file was at an unexpected end.
slack thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1702632247784079