The data will help decide whether it's ok
to keep holding Timeline::layers in shared mode until
after we've calculated the holes.
Other timings are to understand the general breakdown
of timings in that function.
Context: https://github.com/neondatabase/neon/issues/4492
I observe sporadic reconnections with ~10k idle computes. It looks like a
separate issue, probably walreceiver runtime gets blocked somewhere, but in any
case 2-3 seconds is too small.
## Problem
`pytest-timeout` and `pytest-rerunfailures` are incompatible (or rather
not fully compatible). Timeouts aren't set for reruns.
Ref https://github.com/pytest-dev/pytest-rerunfailures/issues/99
## Summary of changes
- Dynamically make timeouts `func_only` for tests that we're going to
retry. It applies timeouts for reruns as well.
## Problem
1. During the rollout we got a panic: "timeline that we were deleting
was concurrently removed from 'timelines' map" that was caused by lock
guard not being propagated to the background part of the deletion.
Existing test didnt catch it because failpoint that was used for
verification was placed earlier prior to background task spawning.
2. When looking at surrounding code one more bug was detected. We
removed timeline from the map before deletion is finished, which breaks
client retry logic, because it will indicate 404 before actual deletion
is completed which can lead to client stopping its retry poll earlier.
## Summary of changes
1. Carry the lock guard over to background deletion. Ensure existing
test case fails without applied patch (second deletion becomes stuck
without it, which eventually leads to a test failure).
2. Move delete_all call earlier so timeline is removed from the map is
the last thing done during deletion.
Additionally I've added timeline_id to the `update_gc_info` span,
because `debug_assert_current_span_has_tenant_and_timeline_id` in
`download_remote_layer` was firing when `update_gc_info` lead to
on-demand downloads via `find_lsn_for_timestamp` (caught by @problame).
This is not directly related to the PR but fixes possible flakiness.
Another smaller set of changes involves deletion wrapper used in python
tests. Now there is a simpler wrapper that waits for deletions to
complete `timeline_delete_wait_completed`. Most of the
test_delete_timeline.py tests make negative tests, i.e., "does
ps_http.timeline_delete() fail in this and that scenario".
These can be left alone. Other places when we actually do the deletions,
we need to use the helper that polls for completion.
Discussion
https://neondb.slack.com/archives/C03F5SM1N02/p1686668007396639resolves#4496
---------
Co-authored-by: Christian Schwarz <christian@neon.tech>
Default value for `wal_acceptor_reconnect_timeout` was changed in
https://github.com/neondatabase/neon/pull/4428 and it affected
performance up to 20% in some cases.
Revert the value back.
## Problem
When I switched `psycopg2.connect` from context manager to a regular
function call in https://github.com/neondatabase/neon/pull/4382 I
embarrassingly forgot about commit, so it doesn't really put data into DB 😞
## Summary of changes
- Enable autocommit for data ingestion scripts
## Problem
Some metrics are better to be observed at page-server level. Otherwise,
as we have a lot of tenants in production, we cannot do a sum b/c
Prometheus has limit on how many time series we can aggregate. This also
helps reduce metrics scraping size.
## Summary of changes
Some integration tests are likely not to pass as it will check the
existence of some metrics. Waiting for CI complete and fix them.
Metrics downgraded: page cache hit (where we are likely to have a
page-server level page cache in the future instead of per-tenant), and
reconstruct time (this would better be tenant-level, as we have one pg
replayer for each tenant, but now we make it page-server level as we do
not need that fine-grained data).
---------
Signed-off-by: Alex Chi <iskyzh@gmail.com>
## Problem
`neon-image-depot` is an experimental job we use to compare with the
main `neon-image` job.
But it's not stable and right now we don't have the capacity to properly
fix and evaluate it.
We can come back to this later.
## Summary of changes
Remove `neon-image-depot` job
HTTP queries failed with errors `error connecting to server: failed to
lookup address information: Name or service not known\n\nCaused by:\n
failed to lookup address information: Name or service not known`
The fix reused cache invalidation logic in proxy from usual postgres
connections and added it to HTTP-over-SQL queries.
Also removed a timeout for HTTP request, because it almost never worked
on staging (50s+ time just to start the compute), and we can have the
similar case in production. Should be ok, since we have a limits for the
requests and responses.
## Problem
If the script fails to generate a test summary, the step also fails the
job/workflow (despite this could be a non-fatal problem).
## Summary of changes
- Separate JSON parsing and summarisation into separate functions
- Wrap functions calling into try..catch block, add an error message to
GitHub comment and do not fail the step
- Make `scripts/comment-test-report.js` a CLI script that can be run
locally (mock GitHub calls) to make it easier to debug issues locally
This is preliminary work for/from #4220 (async `Layer::get_value_reconstruct_data`).
# Full Stack Of Preliminary PRs
Thanks to the countless preliminary PRs, this conversion is relatively
straight-forward.
1. Clean-ups
* https://github.com/neondatabase/neon/pull/4316
* https://github.com/neondatabase/neon/pull/4317
* https://github.com/neondatabase/neon/pull/4318
* https://github.com/neondatabase/neon/pull/4319
* https://github.com/neondatabase/neon/pull/4321
* Note: these were mostly to find an alternative to #4291, which I
thought we'd need in my original plan where we would need to convert
`Tenant::timelines` into an async locking primitive (#4333). In reviews,
we walked away from that, but these cleanups were still quite useful.
2. https://github.com/neondatabase/neon/pull/4364
3. https://github.com/neondatabase/neon/pull/4472
4. https://github.com/neondatabase/neon/pull/4476
5. https://github.com/neondatabase/neon/pull/4477
6. https://github.com/neondatabase/neon/pull/4485
# Significant Changes In This PR
## `compact_level0_phase1` & `create_delta_layer`
This commit partially reverts
"pgserver: spawn_blocking in compaction (#4265)"
4e359db4c7.
Specifically, it reverts the `spawn_blocking`-ificiation of
`compact_level0_phase1`.
If we didn't revert it, we'd have to use `Timeline::layers.blocking_read()`
inside `compact_level0_phase1`. That would use up a thread in the
`spawn_blocking` thread pool, which is hard-capped.
I considered wrapping the code that follows the second
`layers.read().await` into `spawn_blocking`, but there are lifetime
issues with `deltas_to_compact`.
Also, this PR switches the `create_delta_layer` _function_ back to
async, and uses `spawn_blocking` inside to run the code that does sync
IO, while keeping the code that needs to lock `Timeline::layers` async.
## `LayerIter` and `LayerKeyIter` `Send` bounds
I had to add a `Send` bound on the `dyn` type that `LayerIter`
and `LayerKeyIter` wrap. Why? Because we now have the second
`layers.read().await` inside `compact_level0_phase`, and these
iterator instances are held across that await-point.
More background:
https://github.com/neondatabase/neon/pull/4462#issuecomment-1587376960
## `DatadirModification::flush`
Needed to replace the `HashMap::retain` with a hand-rolled variant
because `TimelineWriter::put` is now async.