Commit Graph

3350 Commits

Author SHA1 Message Date
Alek Westover
89b8ea132e refactor more 2023-06-21 11:39:24 -04:00
Alek Westover
bfbae98f24 refactor 2023-06-21 11:32:44 -04:00
Alek Westover
02a1d4d8c1 refactoring a bit 2023-06-21 11:32:04 -04:00
Alek Westover
4a35f29301 code style 2023-06-21 11:07:27 -04:00
Alek Westover
559e318328 remote dead imports 2023-06-21 11:01:45 -04:00
Alek Westover
a4d236b02f finishing cleanup debugging 2023-06-21 11:00:36 -04:00
Alek Westover
8b9f72e117 removing debugging 2023-06-21 10:51:35 -04:00
Alek Westover
bb414e5a0a removing debugging 2023-06-21 10:45:37 -04:00
Alek Westover
32c03bc784 cleaning up some comments 2023-06-21 10:32:28 -04:00
Alek Westover
c99e203094 I think it's working 2023-06-21 10:10:02 -04:00
Alek Westover
e7b9259675 more debugging, didn't find the problem 2023-06-20 22:48:43 -04:00
Alek Westover
356f7d3a7e more debugging 2023-06-20 22:39:02 -04:00
Alek Westover
0f6b05337e fixed minor issue with merge 2023-06-20 22:08:07 -04:00
Alek Westover
2e81d280c8 delete comment 2023-06-20 21:50:52 -04:00
Alek Westover
f9700c8bb9 Merge branch 'main' into extension_server 2023-06-20 21:49:41 -04:00
Alek Westover
e6137d45d2 Merge branch 'main' into extension_server 2023-06-20 21:36:15 -04:00
Alek Westover
ab1d903600 remote extra files 2023-06-20 21:28:00 -04:00
Alek Westover
bfd670b9a7 fixed an issue with the wrong path 2023-06-20 21:24:51 -04:00
Alek Westover
5e96ab43ea more debugginh 2023-06-20 20:02:24 -04:00
Alek Westover
890061d371 arg passing is mostly working 2023-06-20 19:35:13 -04:00
Alek Westover
b4c5beff9f list_files function in remote_storage (#4522) 2023-06-20 15:36:28 -04:00
bojanserafimov
90e1f629e8 Add test for skip_pg_catalog_updates (#4530) 2023-06-20 11:38:59 -04:00
Alek Westover
6b74d1a76a partils 2023-06-19 15:25:53 -04:00
Alek Westover
2023e22ed3 Add RelationError error type to pageserver rather than string parsing error messages (#4508) 2023-06-19 13:14:20 -04:00
Christian Schwarz
036fda392f log timings for compact_level0_phase1 (#4527)
The data will help decide whether it's ok
to keep holding Timeline::layers in shared mode until
after we've calculated the holes.

Other timings are to understand the general breakdown
of timings in that function.

Context: https://github.com/neondatabase/neon/issues/4492
2023-06-19 17:25:57 +03:00
Arseny Sher
557abc18f3 Fix test_s3_wal_replay assertion flakiness.
Supposedly fixes https://github.com/neondatabase/neon/issues/4277
2023-06-19 16:08:20 +04:00
Arseny Sher
3b06a5bc54 Raise pageserver walreceiver timeouts.
I observe sporadic reconnections with ~10k idle computes. It looks like a
separate issue, probably walreceiver runtime gets blocked somewhere, but in any
case 2-3 seconds is too small.
2023-06-19 15:59:38 +04:00
Alek Westover
a936b8a92b add ext cli args 2023-06-16 17:05:39 -04:00
Alek Westover
c7bea52849 adding command line argument 2023-06-16 16:58:13 -04:00
Alek Westover
1b7ab6d468 successfully upload and download the test_load extension 2023-06-16 16:51:08 -04:00
Alek Westover
e07d5d00e9 actually write correct data 2023-06-16 16:20:05 -04:00
Alek Westover
15d3d007eb added several imports so that extension_server compiles 2023-06-16 16:09:02 -04:00
Alek Westover
77157c7741 merge tiny suggestion 2023-06-16 15:49:18 -04:00
Alek Westover
b9b1b3596c started working on the tests, ran into some issues 2023-06-16 15:47:49 -04:00
Alexander Bayandin
1b947fc8af test_runner: workaround rerunfailures and timeout incompatibility (#4469)
## Problem

`pytest-timeout` and `pytest-rerunfailures` are incompatible (or rather
not fully compatible). Timeouts aren't set for reruns.

Ref https://github.com/pytest-dev/pytest-rerunfailures/issues/99

## Summary of changes
- Dynamically make timeouts `func_only` for tests that we're going to
retry. It applies timeouts for reruns as well.
2023-06-16 18:08:11 +01:00
Christian Schwarz
78082d0b9f create_delta_layer: avoid needless stat (#4489)
We already do it inside `frozen_layer.write_to_disk()`.

Context:
https://github.com/neondatabase/neon/pull/4441#discussion_r1228083959
2023-06-16 16:54:41 +02:00
Alexander Bayandin
190c3ba610 Add tags for releases (#4524)
## Problem

It's not a trivial task to find corresponding changes for a particular
release (for example, for 3371 — 🤷)

Ref:
https://neondb.slack.com/archives/C04BLQ4LW7K/p1686761537607649?thread_ts=1686736854.174559&cid=C04BLQ4LW7K

## Summary of changes
- Tag releases
- Add a manual trigger for the release workflow
2023-06-16 14:17:37 +01:00
Christian Schwarz
14d495ae14 create_delta_layer: improve misleading TODO comment (#4488)
Context:
https://github.com/neondatabase/neon/pull/4441#discussion_r1228086608
2023-06-16 14:23:55 +03:00
Alek Westover
91a809332f Update libs/remote_storage/src/s3_bucket.rs
Co-authored-by: Alex Chi Z. <iskyzh@gmail.com>
2023-06-15 13:00:07 -04:00
Dmitry Rodionov
472cc17b7a propagate lock guard to background deletion task (#4495)
## Problem

1. During the rollout we got a panic: "timeline that we were deleting
was concurrently removed from 'timelines' map" that was caused by lock
guard not being propagated to the background part of the deletion.
Existing test didnt catch it because failpoint that was used for
verification was placed earlier prior to background task spawning.
2. When looking at surrounding code one more bug was detected. We
removed timeline from the map before deletion is finished, which breaks
client retry logic, because it will indicate 404 before actual deletion
is completed which can lead to client stopping its retry poll earlier.

## Summary of changes

1. Carry the lock guard over to background deletion. Ensure existing
test case fails without applied patch (second deletion becomes stuck
without it, which eventually leads to a test failure).
2. Move delete_all call earlier so timeline is removed from the map is
the last thing done during deletion.

Additionally I've added timeline_id to the `update_gc_info` span,
because `debug_assert_current_span_has_tenant_and_timeline_id` in
`download_remote_layer` was firing when `update_gc_info` lead to
on-demand downloads via `find_lsn_for_timestamp` (caught by @problame).
This is not directly related to the PR but fixes possible flakiness.

Another smaller set of changes involves deletion wrapper used in python
tests. Now there is a simpler wrapper that waits for deletions to
complete `timeline_delete_wait_completed`. Most of the
test_delete_timeline.py tests make negative tests, i.e., "does
ps_http.timeline_delete() fail in this and that scenario".
These can be left alone. Other places when we actually do the deletions,
we need to use the helper that polls for completion.

Discussion
https://neondb.slack.com/archives/C03F5SM1N02/p1686668007396639

resolves #4496

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-06-15 17:30:12 +03:00
Alek Westover
214ecacfc4 clippy 2023-06-15 10:20:01 -04:00
Arthur Petukhovsky
76413a0fb8 Revert reconnect_timeout to improve performance (#4512)
Default value for `wal_acceptor_reconnect_timeout` was changed in
https://github.com/neondatabase/neon/pull/4428 and it affected
performance up to 20% in some cases.

Revert the value back.
2023-06-15 15:26:59 +03:00
Alexander Bayandin
e60b70b475 Fix data ingestion scripts (#4515)
## Problem

When I switched `psycopg2.connect` from context manager to a regular
function call in https://github.com/neondatabase/neon/pull/4382 I 
embarrassingly forgot about commit, so it doesn't really put data into DB 😞

## Summary of changes
- Enable autocommit for data ingestion scripts
2023-06-15 15:01:06 +03:00
Alex Chi Z
2252c5c282 metrics: convert some metrics to pageserver-level (#4490)
## Problem

Some metrics are better to be observed at page-server level. Otherwise,
as we have a lot of tenants in production, we cannot do a sum b/c
Prometheus has limit on how many time series we can aggregate. This also
helps reduce metrics scraping size.

## Summary of changes

Some integration tests are likely not to pass as it will check the
existence of some metrics. Waiting for CI complete and fix them.

Metrics downgraded: page cache hit (where we are likely to have a
page-server level page cache in the future instead of per-tenant), and
reconstruct time (this would better be tenant-level, as we have one pg
replayer for each tenant, but now we make it page-server level as we do
not need that fine-grained data).

---------

Signed-off-by: Alex Chi <iskyzh@gmail.com>
2023-06-14 17:12:34 -04:00
Alek Westover
7465c644b9 started adding a test 2023-06-14 14:54:36 -04:00
Alexander Bayandin
94f315d490 Remove neon-image-depot job (#4506)
## Problem

`neon-image-depot` is an experimental job we use to compare with the
main `neon-image` job.
But it's not stable and right now we don't have the capacity to properly
fix and evaluate it.
We can come back to this later.

## Summary of changes

Remove `neon-image-depot` job
2023-06-14 19:03:09 +01:00
Christian Schwarz
cd3faa8c0c test_basic_eviction: avoid some sources of flakiness (#4504)
We've seen the download_layer() call return 304 in prod because of a
spurious on-demand download caused by a GetPage request from compute.

Avoid these and some other sources of on-demand downloads by shutting
down compute, SKs, and by disabling background loops.

CF
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4498/5258914461/index.html#suites/2599693fa27db8427603ba822bcf2a20/357808fd552fede3
2023-06-14 19:04:22 +02:00
Arthur Petukhovsky
a7a0c3cd27 Invalidate proxy cache in http-over-sql (#4500)
HTTP queries failed with errors `error connecting to server: failed to
lookup address information: Name or service not known\n\nCaused by:\n
failed to lookup address information: Name or service not known`

The fix reused cache invalidation logic in proxy from usual postgres
connections and added it to HTTP-over-SQL queries.

Also removed a timeout for HTTP request, because it almost never worked
on staging (50s+ time just to start the compute), and we can have the
similar case in production. Should be ok, since we have a limits for the
requests and responses.
2023-06-14 19:24:46 +03:00
Dmitry Rodionov
ee9a5bae43 Filter only active timelines for compaction (#4487)
Previously we may've included Stopping/Broken timelines here, which
leads to errors in logs -> causes tests to sporadically fail

resolves #4467
2023-06-14 19:07:42 +03:00
Alek Westover
bb931f2ce0 alek: add download files changes 2023-06-14 10:50:25 -04:00