Commit Graph

3337 Commits

Author SHA1 Message Date
Alek Westover
2e81d280c8 delete comment 2023-06-20 21:50:52 -04:00
Alek Westover
f9700c8bb9 Merge branch 'main' into extension_server 2023-06-20 21:49:41 -04:00
Alek Westover
e6137d45d2 Merge branch 'main' into extension_server 2023-06-20 21:36:15 -04:00
Alek Westover
ab1d903600 remote extra files 2023-06-20 21:28:00 -04:00
Alek Westover
bfd670b9a7 fixed an issue with the wrong path 2023-06-20 21:24:51 -04:00
Alek Westover
5e96ab43ea more debugginh 2023-06-20 20:02:24 -04:00
Alek Westover
890061d371 arg passing is mostly working 2023-06-20 19:35:13 -04:00
Alek Westover
b4c5beff9f list_files function in remote_storage (#4522) 2023-06-20 15:36:28 -04:00
bojanserafimov
90e1f629e8 Add test for skip_pg_catalog_updates (#4530) 2023-06-20 11:38:59 -04:00
Alek Westover
6b74d1a76a partils 2023-06-19 15:25:53 -04:00
Alek Westover
2023e22ed3 Add RelationError error type to pageserver rather than string parsing error messages (#4508) 2023-06-19 13:14:20 -04:00
Christian Schwarz
036fda392f log timings for compact_level0_phase1 (#4527)
The data will help decide whether it's ok
to keep holding Timeline::layers in shared mode until
after we've calculated the holes.

Other timings are to understand the general breakdown
of timings in that function.

Context: https://github.com/neondatabase/neon/issues/4492
2023-06-19 17:25:57 +03:00
Arseny Sher
557abc18f3 Fix test_s3_wal_replay assertion flakiness.
Supposedly fixes https://github.com/neondatabase/neon/issues/4277
2023-06-19 16:08:20 +04:00
Arseny Sher
3b06a5bc54 Raise pageserver walreceiver timeouts.
I observe sporadic reconnections with ~10k idle computes. It looks like a
separate issue, probably walreceiver runtime gets blocked somewhere, but in any
case 2-3 seconds is too small.
2023-06-19 15:59:38 +04:00
Alek Westover
a936b8a92b add ext cli args 2023-06-16 17:05:39 -04:00
Alek Westover
c7bea52849 adding command line argument 2023-06-16 16:58:13 -04:00
Alek Westover
1b7ab6d468 successfully upload and download the test_load extension 2023-06-16 16:51:08 -04:00
Alek Westover
e07d5d00e9 actually write correct data 2023-06-16 16:20:05 -04:00
Alek Westover
15d3d007eb added several imports so that extension_server compiles 2023-06-16 16:09:02 -04:00
Alek Westover
77157c7741 merge tiny suggestion 2023-06-16 15:49:18 -04:00
Alek Westover
b9b1b3596c started working on the tests, ran into some issues 2023-06-16 15:47:49 -04:00
Alexander Bayandin
1b947fc8af test_runner: workaround rerunfailures and timeout incompatibility (#4469)
## Problem

`pytest-timeout` and `pytest-rerunfailures` are incompatible (or rather
not fully compatible). Timeouts aren't set for reruns.

Ref https://github.com/pytest-dev/pytest-rerunfailures/issues/99

## Summary of changes
- Dynamically make timeouts `func_only` for tests that we're going to
retry. It applies timeouts for reruns as well.
2023-06-16 18:08:11 +01:00
Christian Schwarz
78082d0b9f create_delta_layer: avoid needless stat (#4489)
We already do it inside `frozen_layer.write_to_disk()`.

Context:
https://github.com/neondatabase/neon/pull/4441#discussion_r1228083959
2023-06-16 16:54:41 +02:00
Alexander Bayandin
190c3ba610 Add tags for releases (#4524)
## Problem

It's not a trivial task to find corresponding changes for a particular
release (for example, for 3371 — 🤷)

Ref:
https://neondb.slack.com/archives/C04BLQ4LW7K/p1686761537607649?thread_ts=1686736854.174559&cid=C04BLQ4LW7K

## Summary of changes
- Tag releases
- Add a manual trigger for the release workflow
2023-06-16 14:17:37 +01:00
Christian Schwarz
14d495ae14 create_delta_layer: improve misleading TODO comment (#4488)
Context:
https://github.com/neondatabase/neon/pull/4441#discussion_r1228086608
2023-06-16 14:23:55 +03:00
Alek Westover
91a809332f Update libs/remote_storage/src/s3_bucket.rs
Co-authored-by: Alex Chi Z. <iskyzh@gmail.com>
2023-06-15 13:00:07 -04:00
Dmitry Rodionov
472cc17b7a propagate lock guard to background deletion task (#4495)
## Problem

1. During the rollout we got a panic: "timeline that we were deleting
was concurrently removed from 'timelines' map" that was caused by lock
guard not being propagated to the background part of the deletion.
Existing test didnt catch it because failpoint that was used for
verification was placed earlier prior to background task spawning.
2. When looking at surrounding code one more bug was detected. We
removed timeline from the map before deletion is finished, which breaks
client retry logic, because it will indicate 404 before actual deletion
is completed which can lead to client stopping its retry poll earlier.

## Summary of changes

1. Carry the lock guard over to background deletion. Ensure existing
test case fails without applied patch (second deletion becomes stuck
without it, which eventually leads to a test failure).
2. Move delete_all call earlier so timeline is removed from the map is
the last thing done during deletion.

Additionally I've added timeline_id to the `update_gc_info` span,
because `debug_assert_current_span_has_tenant_and_timeline_id` in
`download_remote_layer` was firing when `update_gc_info` lead to
on-demand downloads via `find_lsn_for_timestamp` (caught by @problame).
This is not directly related to the PR but fixes possible flakiness.

Another smaller set of changes involves deletion wrapper used in python
tests. Now there is a simpler wrapper that waits for deletions to
complete `timeline_delete_wait_completed`. Most of the
test_delete_timeline.py tests make negative tests, i.e., "does
ps_http.timeline_delete() fail in this and that scenario".
These can be left alone. Other places when we actually do the deletions,
we need to use the helper that polls for completion.

Discussion
https://neondb.slack.com/archives/C03F5SM1N02/p1686668007396639

resolves #4496

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-06-15 17:30:12 +03:00
Alek Westover
214ecacfc4 clippy 2023-06-15 10:20:01 -04:00
Arthur Petukhovsky
76413a0fb8 Revert reconnect_timeout to improve performance (#4512)
Default value for `wal_acceptor_reconnect_timeout` was changed in
https://github.com/neondatabase/neon/pull/4428 and it affected
performance up to 20% in some cases.

Revert the value back.
2023-06-15 15:26:59 +03:00
Alexander Bayandin
e60b70b475 Fix data ingestion scripts (#4515)
## Problem

When I switched `psycopg2.connect` from context manager to a regular
function call in https://github.com/neondatabase/neon/pull/4382 I 
embarrassingly forgot about commit, so it doesn't really put data into DB 😞

## Summary of changes
- Enable autocommit for data ingestion scripts
2023-06-15 15:01:06 +03:00
Alex Chi Z
2252c5c282 metrics: convert some metrics to pageserver-level (#4490)
## Problem

Some metrics are better to be observed at page-server level. Otherwise,
as we have a lot of tenants in production, we cannot do a sum b/c
Prometheus has limit on how many time series we can aggregate. This also
helps reduce metrics scraping size.

## Summary of changes

Some integration tests are likely not to pass as it will check the
existence of some metrics. Waiting for CI complete and fix them.

Metrics downgraded: page cache hit (where we are likely to have a
page-server level page cache in the future instead of per-tenant), and
reconstruct time (this would better be tenant-level, as we have one pg
replayer for each tenant, but now we make it page-server level as we do
not need that fine-grained data).

---------

Signed-off-by: Alex Chi <iskyzh@gmail.com>
2023-06-14 17:12:34 -04:00
Alek Westover
7465c644b9 started adding a test 2023-06-14 14:54:36 -04:00
Alexander Bayandin
94f315d490 Remove neon-image-depot job (#4506)
## Problem

`neon-image-depot` is an experimental job we use to compare with the
main `neon-image` job.
But it's not stable and right now we don't have the capacity to properly
fix and evaluate it.
We can come back to this later.

## Summary of changes

Remove `neon-image-depot` job
2023-06-14 19:03:09 +01:00
Christian Schwarz
cd3faa8c0c test_basic_eviction: avoid some sources of flakiness (#4504)
We've seen the download_layer() call return 304 in prod because of a
spurious on-demand download caused by a GetPage request from compute.

Avoid these and some other sources of on-demand downloads by shutting
down compute, SKs, and by disabling background loops.

CF
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4498/5258914461/index.html#suites/2599693fa27db8427603ba822bcf2a20/357808fd552fede3
2023-06-14 19:04:22 +02:00
Arthur Petukhovsky
a7a0c3cd27 Invalidate proxy cache in http-over-sql (#4500)
HTTP queries failed with errors `error connecting to server: failed to
lookup address information: Name or service not known\n\nCaused by:\n
failed to lookup address information: Name or service not known`

The fix reused cache invalidation logic in proxy from usual postgres
connections and added it to HTTP-over-SQL queries.

Also removed a timeout for HTTP request, because it almost never worked
on staging (50s+ time just to start the compute), and we can have the
similar case in production. Should be ok, since we have a limits for the
requests and responses.
2023-06-14 19:24:46 +03:00
Dmitry Rodionov
ee9a5bae43 Filter only active timelines for compaction (#4487)
Previously we may've included Stopping/Broken timelines here, which
leads to errors in logs -> causes tests to sporadically fail

resolves #4467
2023-06-14 19:07:42 +03:00
Alek Westover
bb931f2ce0 alek: add download files changes 2023-06-14 10:50:25 -04:00
Alexander Bayandin
9484b96d7c GitHub Autocomment: do not fail the job (#4478)
## Problem

If the script fails to generate a test summary, the step also fails the
job/workflow (despite this could be a non-fatal problem).

## Summary of changes
- Separate JSON parsing and summarisation into separate functions
- Wrap functions calling into try..catch block, add an error message to
GitHub comment and do not fail the step
- Make `scripts/comment-test-report.js` a CLI script that can be run
locally (mock GitHub calls) to make it easier to debug issues locally
2023-06-14 15:07:30 +01:00
Shany Pozin
ebee8247b5 Move s3 delete_objects to use chunks of 1000 OIDs (#4463)
See
https://github.com/neondatabase/neon/pull/4461#pullrequestreview-1474240712
2023-06-14 15:38:01 +03:00
bojanserafimov
3164ad7052 compute_ctl: Spec parser forward compatibility test (#4494) 2023-06-13 21:48:09 -04:00
Alexander Bayandin
a0b3990411 Retry data ingestion scripts on connection errors (#4382)
## Problem

From time to time, we're catching a race condition when trying to upload
perf or regression test results.

Ref:
- https://neondb.slack.com/archives/C03H1K0PGKH/p1685462717870759
- https://github.com/neondatabase/cloud/issues/3686

## Summary of changes

Wrap `psycopg2.connect` method with `@backoff.on_exception`
contextmanager
2023-06-13 22:33:42 +01:00
Stas Kelvich
4385e0c291 Return more RowDescription fields via proxy json endpoint
As we aim to align client-side behavior with node-postgres, it's necessary for us to return these fields, because node-postgres does so as well.
2023-06-13 22:31:18 +03:00
Christian Schwarz
3693d1f431 turn Timeline::layers into tokio::sync::RwLock (#4441)
This is preliminary work for/from #4220 (async `Layer::get_value_reconstruct_data`).

# Full Stack Of Preliminary PRs

Thanks to the countless preliminary PRs, this conversion is relatively
straight-forward.

1. Clean-ups
  * https://github.com/neondatabase/neon/pull/4316
  * https://github.com/neondatabase/neon/pull/4317
  * https://github.com/neondatabase/neon/pull/4318
  * https://github.com/neondatabase/neon/pull/4319
  * https://github.com/neondatabase/neon/pull/4321
* Note: these were mostly to find an alternative to #4291, which I
   thought we'd need in my original plan where we would need to convert
   `Tenant::timelines` into an async locking primitive (#4333). In reviews,
   we walked away from that, but these cleanups were still quite useful.
2. https://github.com/neondatabase/neon/pull/4364
3. https://github.com/neondatabase/neon/pull/4472
4. https://github.com/neondatabase/neon/pull/4476
5. https://github.com/neondatabase/neon/pull/4477
6. https://github.com/neondatabase/neon/pull/4485

# Significant Changes In This PR

## `compact_level0_phase1` & `create_delta_layer`

This commit partially reverts

   "pgserver: spawn_blocking in compaction (#4265)"
    4e359db4c7.

Specifically, it reverts the `spawn_blocking`-ificiation of
`compact_level0_phase1`.
If we didn't revert it, we'd have to use `Timeline::layers.blocking_read()`
inside `compact_level0_phase1`. That would use up a thread in the
`spawn_blocking` thread pool, which is hard-capped.

I considered wrapping the code that follows the second
`layers.read().await` into `spawn_blocking`, but there are lifetime
issues with `deltas_to_compact`.

Also, this PR switches the `create_delta_layer` _function_ back to
async, and uses `spawn_blocking` inside to run the code that does sync
IO, while keeping the code that needs to lock `Timeline::layers` async.

## `LayerIter` and `LayerKeyIter` `Send` bounds

I had to add a `Send` bound on the `dyn` type that `LayerIter`
and `LayerKeyIter` wrap. Why? Because we now have the second
`layers.read().await` inside `compact_level0_phase`, and these
iterator instances are held across that await-point.

More background:
https://github.com/neondatabase/neon/pull/4462#issuecomment-1587376960

## `DatadirModification::flush`

Needed to replace the `HashMap::retain` with a hand-rolled variant
because `TimelineWriter::put` is now async.
2023-06-13 18:38:41 +02:00
Anastasia Lubennikova
8013f9630d code cleanup 2023-06-13 18:23:15 +02:00
Anastasia Lubennikova
34f22e9b12 Request extension files from compute_ctl 2023-06-13 16:52:37 +02:00
Christian Schwarz
fdf7a67ed2 init_empty_layer_map: use try_write (#4485)
This is preliminary work for/from #4220 (async
`Layer::get_value_reconstruct_data`).
Or more specifically, #4441, where we turn Timeline::layers into a
tokio::sync::RwLock.

By using try_write() here, we can avoid turning init_empty_layer_map
async,
which is nice because much of its transitive call(er) graph isn't async.
2023-06-13 13:49:40 +02:00
Alexey Kondratov
1299df87d2 [compute_ctl] Fix logging if catalog updates are skipped (#4480)
Otherwise, it wasn't clear from the log when Postgres started up
completely if catalog updates were skipped.

Follow-up for 4936ab6
2023-06-13 13:34:56 +02:00
Christian Schwarz
754ceaefac make TimelineWriter Send by using tokio::sync Mutex internally (#4477)
This is preliminary work for/from #4220 (async
`Layer::get_value_reconstruct_data`).

There, we want to switch `Timeline::layers` to be a
`tokio::sync::RwLock`.

That will require the `TimelineWriter` to become async, because at times
its functions need to lock `Timeline::layers` in order to freeze the
open layer.

While doing that, rustc complains that we're now holding
`Timeline::write_lock` across await points (lock order is that
`write_lock` must be acquired before `Timelines::layers`).

So, we need to switch it over to an async primitive.
2023-06-13 10:15:25 +02:00
Arseny Sher
143fa0da42 Remove timeout on test_close_on_connections_exit
We have 300s timeout on all tests, and doubling logic in popen.wait sometimes
exceeds 5s, making the test flaky.

ref https://github.com/neondatabase/neon/issues/4211
2023-06-13 06:26:03 +04:00
bojanserafimov
4936ab6842 compute_ctl: add flag to avoid config step (#4457)
Add backwards-compatible flag that cplane can use to speed up startup time
2023-06-12 13:57:02 -04:00