Commit Graph

747 Commits

Author SHA1 Message Date
bojanserafimov
520046f5bd cold starts: Add sync-safekeepers fast path (#4804) 2023-07-25 19:44:18 -04:00
Alex Chi Z
bcc2aee704 proxy: add tests for batch http sql (#4793)
This PR adds an integration test case for batch HTTP SQL endpoint.
https://github.com/neondatabase/neon/pull/4654/ should be merged first
before we land this PR.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2023-07-25 15:08:24 +00:00
Dmitry Rodionov
6d023484ed Use mark file to allow for deletion operations to continue through restarts (#4552)
## Problem

Currently we delete local files first, so if pageserver restarts after
local files deletion then remote deletion is not continued. This can be
solved with inversion of these steps.

But even if these steps are inverted when index_part.json is deleted
there is no way to distinguish between "this timeline is good, we just
didnt upload it to remote" and "this timeline is deleted we should
continue with removal of local state". So to solve it we use another
mark file. After index part is deleted presence of this mark file
indentifies that it was a deletion intention.

Alternative approach that was discussed was to delete all except
metadata first, and then delete metadata and index part. In this case we
still do not support local only configs making them rather unsafe
(deletion in them is already unsafe, but this direction solidifies this
direction instead of fixing it). Another downside is that if we crash
after local metadata gets removed we may leave dangling index part on
the remote which in theory shouldnt be a big deal because the file is
small.

It is not a big change to choose another approach at this point.

## Summary of changes

Timeline deletion sequence:
1. Set deleted_at in remote index part.
2. Create local mark file.
3. Delete local files except metadata (it is simpler this way, to be
able to reuse timeline initialization code that expects metadata)
4. Delete remote layers
5. Delete index part
6. Delete meta, timeline directory.
7. Delete mark file.

This works for local only configuration without remote storage.
Sequence is resumable from any point.

resolves #4453
resolves https://github.com/neondatabase/neon/pull/4552 (the issue was
created with async cancellation in mind, but we can still have issues
with retries if metadata is deleted among the first by remove_dir_all
(which doesnt have any ordering guarantees))

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-07-25 16:25:27 +03:00
cui fliter
f2e2b8a7f4 fix some typos (#4662)
Typos fix

Signed-off-by: cui fliter <imcusg@gmail.com>
2023-07-25 14:39:29 +03:00
Joonas Koivunen
77a68326c5 Thin out TenantState metric, keep set of broken tenants (#4796)
We currently have a timeseries for each of the tenants in different
states. We only want this for Broken. Other states could be counters.

Fix this by making the `pageserver_tenant_states_count` a counter
without a `tenant_id` and
add a `pageserver_broken_tenants_count` which has a `tenant_id` label,
each broken tenant being 1.
2023-07-25 11:15:54 +03:00
Joonas Koivunen
294b8a8fde Convert per timeline metrics to global (#4769)
Cut down the per-(tenant, timeline) histograms by making them global:

- `pageserver_getpage_get_reconstruct_data_seconds`
- `pageserver_read_num_fs_layers`
- `pageserver_remote_operation_seconds`
- `pageserver_remote_timeline_client_calls_started`
- `pageserver_wait_lsn_seconds`
- `pageserver_io_operations_seconds`

---------

Co-authored-by: Shany Pozin <shany@neon.tech>
2023-07-25 00:43:27 +03:00
Konstantin Knizhnik
457e3a3ebc Mx offset bug (#4775)
Fix mx_offset_to_flags_offset() function

Fixes issue #4774

Postgres `MXOffsetToFlagsOffset` was not correctly converted to Rust
because cast to u16 is done before division by modulo. It is possible
only if divider is power of two.

Add a small rust unit test to check that the function produces same results
as the PostgreSQL macro, and extend the existing python test to cover
this bug.

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2023-07-21 21:20:53 +03:00
Alex Chi Z
3fc3666df7 make flush frozen layer an atomic operation (#4720)
## Problem

close https://github.com/neondatabase/neon/issues/4712

## Summary of changes

Previously, when flushing frozen layers, it was split into two
operations: add delta layer to disk + remove frozen layer from memory.
This would cause a short period of time where we will have the same data
both in frozen and delta layer. In this PR, we merge them into one
atomic operation in layer map manager, therefore simplifying the code.
Note that if we decide to create image layers for L0 flush, it will
still be split into two operations on layer map.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-07-20 13:39:19 -04:00
bojanserafimov
e1061879aa Improve startup python test (#4757) 2023-07-19 23:46:16 -04:00
Arseny Sher
921bb86909 Use safekeeper tenant only port in all tests and actually test it.
Compute now uses special safekeeper WAL service port allowing auth tokens with
only tenant scope. Adds understanding of this port to neon_local and fixtures,
as well as test of both ports behaviour with different tokens.

ref https://github.com/neondatabase/neon/issues/4730
2023-07-19 06:03:51 +04:00
Joonas Koivunen
762a8a7bb5 python: more linting (#4734)
Ruff has "B" class of lints, including B018 which will nag on useless
expressions, related to #4719. Enable such lints and fix the existing
issues.

Most notably:
- https://beta.ruff.rs/docs/rules/mutable-argument-default/
- https://beta.ruff.rs/docs/rules/assert-false/

---------

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2023-07-18 12:56:40 +03:00
Alex Chi Z
1066bca5e3 compaction: allow duplicated layers and skip in replacement (#4696)
## Problem

Compactions might generate files of exactly the same name as before
compaction due to our naming of layer files. This could have already
caused some mess in the system, and is known to cause some issues like
https://github.com/neondatabase/neon/issues/4088. Therefore, we now
consider duplicated layers in the post-compaction process to avoid
violating the layer map duplicate checks.

related previous works: close
https://github.com/neondatabase/neon/pull/4094
error reported in: https://github.com/neondatabase/neon/issues/4690,
https://github.com/neondatabase/neon/issues/4088

## Summary of changes

If a file already exists in the layer map before the compaction, do not
modify the layer map and do not delete the file. The file on disk at
that time should be the new one overwritten by the compaction process.

This PR also adds a test case with a fail point that produces exactly
the same set of files.

This bypassing behavior is safe because the produced layer files have
the same content / are the same representation of the original file.

An alternative might be directly removing the duplicate check in the
layer map, but I feel it would be good if we can prevent that in the
first place.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Konstantin Knizhnik <knizhnik@garret.ru>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-07-17 17:26:29 +03:00
arpad-m
982fce1e72 Fix rustdoc warnings and test cargo doc in CI (#4711)
## Problem

`cargo +nightly doc` is giving a lot of warnings: broken links, naked
URLs, etc.

## Summary of changes

* update the `proc-macro2` dependency so that it can compile on latest
Rust nightly, see https://github.com/dtolnay/proc-macro2/pull/391 and
https://github.com/dtolnay/proc-macro2/issues/398
* allow the `private_intra_doc_links` lint, as linking to something
that's private is always more useful than just mentioning it without a
link: if the link breaks in the future, at least there is a warning due
to that. Also, one might enable
[`--document-private-items`](https://doc.rust-lang.org/cargo/commands/cargo-doc.html#documentation-options)
in the future and make these links work in general.
* fix all the remaining warnings given by `cargo +nightly doc`
* make it possible to run `cargo doc` on stable Rust by updating
`opentelemetry` and associated crates to version 0.19, pulling in a fix
that previously broke `cargo doc` on stable:
https://github.com/open-telemetry/opentelemetry-rust/pull/904
* Add `cargo doc` to CI to ensure that it won't get broken in the
future.

Fixes #2557

## Future work
* Potentially, it might make sense, for development purposes, to publish
the generated rustdocs somewhere, like for example [how the rust
compiler does
it](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_driver/index.html).
I will file an issue for discussion.
2023-07-15 05:11:25 +03:00
Joonas Koivunen
9a69b6cb94 Demote deletion warning, list files (#4688)
Handle test failures like:

```
AssertionError: assert not ['$ts  WARN delete_timeline{tenant_id=X timeline_id=Y}: About to remove 1 files\n']
```

Instead of logging:

```
WARN delete_timeline{tenant_id=X timeline_id=Y}: Found 1 files not bound to index_file.json, proceeding with their deletion
WARN delete_timeline{tenant_id=X timeline_id=Y}: About to remove 1 files
```

For each one operation of timeline deletion, list all unref files with
`info!`, and then continue to delete them with the added spice of
logging the rare/never happening non-utf8 name with `warn!`.

Rationale for `info!` instead of `warn!`: this is a normal operation;
like we had mentioned in `test_import.py` -- basically whenever we
delete a timeline which is not idle.

Rationale for N * (`ìnfo!`|`warn!`): symmetry for the layer deletions;
if we could ever need those, we could also need these for layer files
which are not yet mentioned in `index_part.json`.

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-07-14 18:59:16 +03:00
arpad-m
664d32eb7f Don't propagate but log file not found error in layer uploading (#4694)
This addresses the issue in #4526 by adding a test that reproduces the
race condition that gave rise to the bug (or at least *a* race condition
that gave rise to the same error message), and then implementing a fix
that just prints a message to the log if a file could not been found for
uploading. Even though the underlying race condition is not fixed yet,
this will un-block the upload queue in that situation, greatly reducing
the impact of such a (rare) race.

Fixes #4526.
2023-07-12 18:10:49 +02:00
Alexander Bayandin
ed845b644b Prevent unintentional Postgres submodule update (#4692)
## Problem

Postgres submodule can be changed unintentionally, and these changes are
easy to miss during the review.

Adding a check that should prevent this from happening, the check fails
`build-neon` job with the following message:
```
Expected postgres-v14 rev to be at '1414141414141414141414141414141414141414', but it is at '1144aee1661c79eec65e784a8dad8bd450d9df79'
Expected postgres-v15 rev to be at '1515151515151515151515151515151515151515', but it is at '1984832c740a7fa0e468bb720f40c525b652835d'
Please update vendors/revisions.json if these changes are intentional.
```
This is an alternative approach to
https://github.com/neondatabase/neon/pull/4603

## Summary of changes
- Add `vendor/revisions.json` file with expected revisions
- Add built-time check (to `build-neon` job) that Postgres submodules
match revisions from `vendor/revisions.json`
- A couple of small improvements for logs from
https://github.com/neondatabase/neon/pull/4603
- Fixed GitHub autocomment for no tests was run case

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-07-12 15:12:37 +01:00
bojanserafimov
92aee7e07f cold starts: basebackup compression (#4482)
Co-authored-by: Alex Chi Z <iskyzh@gmail.com>
2023-07-11 13:11:23 -04:00
Alexander Bayandin
33c2d94ba6 Fix git-env version for PRs (#4641)
## Problem

Binaries created from PRs (both in docker images and for tests) have 
wrong git-env versions, they point to phantom merge commits.

## Summary of changes
- Prefer GIT_VERSION env variable even if git information was accessible
- Use `${{ github.event.pull_request.head.sha || github.sha }}` instead
of `${{ github.sha }}` for `GIT_VERSION` in workflows

So the builds will still happen from this phantom commit, but we will
report the PR commit.

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-07-10 20:01:01 +01:00
Christian Schwarz
505aa242ac page cache: add size metrics (#4629)
Make them a member of `struct PageCache` to prepare for a future
where there's no global state.
2023-07-05 15:36:42 +03:00
Christian Schwarz
3f9defbfb4 page cache: add access & hit rate metrics (#4628)
Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
2023-07-05 10:38:32 +02:00
bojanserafimov
c7143dbde6 compute_ctl: Fix misleading metric (#4608) 2023-07-04 19:07:36 -04:00
Alexander Bayandin
78a7f68902 Make pg_version and build_type regular parameters (#4311)
## Problem

All tests have already been parametrised by Postgres version and build
type (to have them distinguishable in the Allure report), but despite
it, it's anyway required to have DEFAULT_PG_VERSION and BUILD_TYPE env
vars set to corresponding values, for example to
run`test_timeline_deletion_with_files_stuck_in_upload_queue[release-pg14-local_fs]`
test it's required to set `DEFAULT_PG_VERSION=14` and
`BUILD_TYPE=release`.
This PR makes the test framework pick up parameters from the test name
itself.

## Summary of changes
- Postgres version and build type related fixtures now are
function-scoped (instead of being sessions scoped before)
- Deprecate `--pg-version` argument in favour of DEFAULT_PG_VERSION env
variable (it's easier to parse)
- GitHub autocomment now includes only one command with all the failed
tests + runs them in parallel
2023-07-03 13:51:40 +01:00
Christian Schwarz
24eaa3b7ca timeline creation: reflect failures due to ancestor LSN issues in status code (#4600)
Before, it was a `500` and control plane would retry, wereas it actually
should have stopped retrying.

(Stacked on top of https://github.com/neondatabase/neon/pull/4597 )

fixes https://github.com/neondatabase/neon/issues/4595
part of https://github.com/neondatabase/cloud/issues/5626

---------

Co-authored-by: Shany Pozin <shany@neon.tech>
2023-07-03 15:21:10 +03:00
Sasha Krassovsky
ff1a1aea86 Make control plane connector send encrypted password (#4607)
Control plane needs the encrypted hash that postgres itself generates
2023-06-30 14:17:44 -07:00
bojanserafimov
9de1a6fb14 cold starts: Run sync_safekeepers on compute_ctl shutdown (#4588) 2023-06-30 16:29:47 -04:00
Dmitry Rodionov
b990200496 tests: use shortcut everywhere to get timeline path (#4586) 2023-06-30 15:01:06 +03:00
Alex Chi Z
7e20b49da4 refactor: use LayerDesc in LayerMap (part 2) (#4437)
## Problem

part of https://github.com/neondatabase/neon/issues/4392, continuation
of https://github.com/neondatabase/neon/pull/4408

## Summary of changes

This PR removes all layer objects from LayerMap and moves it to the
timeline struct. In timeline struct, LayerFileManager maps a layer
descriptor to a layer object, and it is stored in the same RwLock as
LayerMap to avoid behavior difference.

Key changes:

* LayerMap now does not have generic, and only stores descriptors.
* In Timeline, we add a new struct called layer mapping.
* Currently, layer mapping is stored in the same lock with layer map.
Every time we retrieve data from the layer map, we will need to map the
descriptor to the actual object.
* Replace_historic is moved to layer mapping's replace, and the return
value behavior is different from before. I'm a little bit unsure about
this part and it would be good to have some comments on that.
* Some test cases are rewritten to adapt to the new interface, and we
can decide whether to remove it in the future because it does not make
much sense now.
* LayerDescriptor is moved to `tests` module and should only be intended
for unit testing / benchmarks.
* Because we now have a usage pattern like "take the guard of lock, then
get the reference of two fields", we want to avoid dropping the
incorrect object when we intend to unlock the lock guard. Therefore, a
new set of helper function `drop_r/wlock` is added. This can be removed
in the future when we finish the refactor.

TODOs after this PR: fully remove RemoteLayer, and move LayerMapping to
a separate LayerCache.

all refactor PRs:

```
#4437 --- #4479 ------------ #4510 (refactor done at this point)
      \-- #4455 -- #4502 --/
```

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2023-06-29 15:06:07 -04:00
Shany Pozin
ec9b585837 Add Activating as a possible state for attaching a tenant in test_tenant_relocation validation (#4581)
Fix flakyness of test_tenant_relocation
2023-06-28 12:35:52 +03:00
Dmitry Rodionov
681c6910c2 Straighten the spec for timeline delete (#4538)
## Problem

Lets keep 500 for unusual stuff that is not considered normal. Came up
during one of the discussions around console logs now seeing this 500's.

## Summary of changes

- Return 409 Conflict instead of 500
- Remove 200 ok status because it is not used anymore
2023-06-27 13:56:32 +03:00
Sasha Krassovsky
c215389f1c quote_ident identifiers when creating neon_superuser (#4562)
## Problem
2023-06-24 10:34:15 +03:00
Sasha Krassovsky
b1477b4448 Create neon_superuser role, grant it to roles created from control plane (#4425)
## Problem
Currently, if a user creates a role, it won't by default have any grants
applied to it. If the compute restarts, the grants get applied. This
gives a very strange UX of being able to drop roles/not have any access
to anything at first, and then once something triggers a config
application, suddenly grants are applied. This removes these grants.
2023-06-24 01:38:27 +03:00
Christian Schwarz
76718472be add pageserver-global histogram for basebackup latency (#4559)
The histogram distinguishes by ok/err.

I took the liberty to create a small abstraction for such use cases.
It helps keep the label values inside `metrics.rs`, right next
to the place where the metric and its labels are declared.
2023-06-23 16:42:12 +02:00
Alexander Bayandin
d3aa8a48ea Update client libs for test_runner/pg_clients to their latest versions (#4547)
Resolves https://github.com/neondatabase/neon/security/dependabot/27
2023-06-21 16:20:35 +01:00
bojanserafimov
90e1f629e8 Add test for skip_pg_catalog_updates (#4530) 2023-06-20 11:38:59 -04:00
Arseny Sher
557abc18f3 Fix test_s3_wal_replay assertion flakiness.
Supposedly fixes https://github.com/neondatabase/neon/issues/4277
2023-06-19 16:08:20 +04:00
Arseny Sher
3b06a5bc54 Raise pageserver walreceiver timeouts.
I observe sporadic reconnections with ~10k idle computes. It looks like a
separate issue, probably walreceiver runtime gets blocked somewhere, but in any
case 2-3 seconds is too small.
2023-06-19 15:59:38 +04:00
Alexander Bayandin
1b947fc8af test_runner: workaround rerunfailures and timeout incompatibility (#4469)
## Problem

`pytest-timeout` and `pytest-rerunfailures` are incompatible (or rather
not fully compatible). Timeouts aren't set for reruns.

Ref https://github.com/pytest-dev/pytest-rerunfailures/issues/99

## Summary of changes
- Dynamically make timeouts `func_only` for tests that we're going to
retry. It applies timeouts for reruns as well.
2023-06-16 18:08:11 +01:00
Dmitry Rodionov
472cc17b7a propagate lock guard to background deletion task (#4495)
## Problem

1. During the rollout we got a panic: "timeline that we were deleting
was concurrently removed from 'timelines' map" that was caused by lock
guard not being propagated to the background part of the deletion.
Existing test didnt catch it because failpoint that was used for
verification was placed earlier prior to background task spawning.
2. When looking at surrounding code one more bug was detected. We
removed timeline from the map before deletion is finished, which breaks
client retry logic, because it will indicate 404 before actual deletion
is completed which can lead to client stopping its retry poll earlier.

## Summary of changes

1. Carry the lock guard over to background deletion. Ensure existing
test case fails without applied patch (second deletion becomes stuck
without it, which eventually leads to a test failure).
2. Move delete_all call earlier so timeline is removed from the map is
the last thing done during deletion.

Additionally I've added timeline_id to the `update_gc_info` span,
because `debug_assert_current_span_has_tenant_and_timeline_id` in
`download_remote_layer` was firing when `update_gc_info` lead to
on-demand downloads via `find_lsn_for_timestamp` (caught by @problame).
This is not directly related to the PR but fixes possible flakiness.

Another smaller set of changes involves deletion wrapper used in python
tests. Now there is a simpler wrapper that waits for deletions to
complete `timeline_delete_wait_completed`. Most of the
test_delete_timeline.py tests make negative tests, i.e., "does
ps_http.timeline_delete() fail in this and that scenario".
These can be left alone. Other places when we actually do the deletions,
we need to use the helper that polls for completion.

Discussion
https://neondb.slack.com/archives/C03F5SM1N02/p1686668007396639

resolves #4496

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-06-15 17:30:12 +03:00
Alex Chi Z
2252c5c282 metrics: convert some metrics to pageserver-level (#4490)
## Problem

Some metrics are better to be observed at page-server level. Otherwise,
as we have a lot of tenants in production, we cannot do a sum b/c
Prometheus has limit on how many time series we can aggregate. This also
helps reduce metrics scraping size.

## Summary of changes

Some integration tests are likely not to pass as it will check the
existence of some metrics. Waiting for CI complete and fix them.

Metrics downgraded: page cache hit (where we are likely to have a
page-server level page cache in the future instead of per-tenant), and
reconstruct time (this would better be tenant-level, as we have one pg
replayer for each tenant, but now we make it page-server level as we do
not need that fine-grained data).

---------

Signed-off-by: Alex Chi <iskyzh@gmail.com>
2023-06-14 17:12:34 -04:00
Christian Schwarz
cd3faa8c0c test_basic_eviction: avoid some sources of flakiness (#4504)
We've seen the download_layer() call return 304 in prod because of a
spurious on-demand download caused by a GetPage request from compute.

Avoid these and some other sources of on-demand downloads by shutting
down compute, SKs, and by disabling background loops.

CF
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4498/5258914461/index.html#suites/2599693fa27db8427603ba822bcf2a20/357808fd552fede3
2023-06-14 19:04:22 +02:00
Arseny Sher
143fa0da42 Remove timeout on test_close_on_connections_exit
We have 300s timeout on all tests, and doubling logic in popen.wait sometimes
exceeds 5s, making the test flaky.

ref https://github.com/neondatabase/neon/issues/4211
2023-06-13 06:26:03 +04:00
bojanserafimov
6bac770811 Add cold start test (#4436) 2023-06-08 18:11:33 -04:00
Stas Kelvich
d73639646e Add more output options to proxy json endpoint
With this commit client can pass following optional headers:

`Neon-Raw-Text-Output: true`. Return postgres values as text, without parsing them. So numbers, objects, booleans, nulls and arrays will be returned as text. That can be useful in cases when client code wants to implement it's own parsing or reuse parsing libraries from e.g. node-postgres.

`Neon-Array-Mode: true`. Return postgres rows as arrays instead of objects. That is more compact representation and also helps in some edge
cases where it is hard to use rows represented as objects (e.g. when several fields have the same name).
2023-06-08 16:00:18 +03:00
Dmitry Rodionov
d53f9ab3eb delete timelines from s3 (#4384)
Delete data from s3 when timeline deletion is requested

## Summary of changes

UploadQueue is altered to support scheduling of delete operations in
stopped state. This looks weird, and I'm thinking whether there are
better options/refactorings for upload client to make it look better.

Probably can be part of https://github.com/neondatabase/neon/issues/4378

Deletion is implemented directly in existing endpoint because changes are not
that significant. If we want more safety we can separate those or create
feature flag for new behavior.

resolves [#4193](https://github.com/neondatabase/neon/issues/4193)

---------

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2023-06-08 15:01:22 +03:00
Dmitry Rodionov
1a1019990a map TenantState::Broken to TenantAttachmentStatus::Failed (#4371)
## Problem

Attach failures are not reported in public part of the api (in
`attachment_status` field of TenantInfo).

## Summary of changes

Expose TenantState::Broken as TenantAttachmentStatus::Failed

In the way its written Failed status will be reported even if no
attachment happened. (I e if tenant become broken on startup). This is
in line with other members. I e Active will be resolved to Attached even
if no actual attach took place.

This can be tweaked if needed. At the current stage it would be overengineering without clear motivation

resolves #4344
2023-06-07 18:25:30 +03:00
Joonas Koivunen
5761190e0d feat: three phased startup order (#4399)
Initial logical size calculation could still hinder our fast startup
efforts in #4397. See #4183. In deployment of 2023-06-06
about a 200 initial logical sizes were calculated on hosts which
took the longest to complete initial load (12s).

Implements the three step/tier initialization ordering described in
#4397:
1. load local tenants
2. do initial logical sizes per walreceivers for 10s
3. background tasks

Ordering is controlled by:
- waiting on `utils::completion::Barrier`s on background tasks
- having one attempt for each Timeline to do initial logical size
calculation
- `pageserver/src/bin/pageserver.rs` releasing background jobs after
timeout or completion of initial logical size calculation

The timeout is there just to safeguard in case a legitimate non-broken
timeline initial logical size calculation goes long. The timeout is
configurable, by default 10s, which I think would be fine for production
systems. In the test cases I've been looking at, it seems that these
steps are completed as fast as possible.

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-06-07 14:29:23 +03:00
Heikki Linnakangas
df3bae2ce3 Use compute_ctl to manage Postgres in tests. (#3886)
This adds test coverage for 'compute_ctl', as it is now used by all
the python tests.
    
There are a few differences in how 'compute_ctl' is called in the
tests, compared to the real web console:
    
- In the tests, the postgresql.conf file is included as one large
  string in the spec file, and it is written out as it is to the data
  directory.  I added a new field for that to the spec file. The real
  web console, however, sets all the necessary settings in the
  'settings' field, and 'compute_ctl' creates the postgresql.conf from
  those settings.

- In the tests, the information needed to connect to the storage, i.e.
  tenant_id, timeline_id, connection strings to pageserver and
  safekeepers, are now passed as new fields in the spec file. The real
  web console includes them as the GUCs in the 'settings' field. (Both
  of these are different from what the test control plane used to do:
  It used to write the GUCs directly in the postgresql.conf file). The
  plan is to change the control plane to use the new method, and
  remove the old method, but for now, support both.

Some tests that were sensitive to the amount of WAL generated needed
small changes, to accommodate that compute_ctl runs the background
health monitor which makes a few small updates. Also some tests shut
down the pageserver, and now that the background health check can run
some queries while the pageserver is down, that can produce a few
extra errors in the logs, which needed to be allowlisted.

Other changes:
- remove obsolete comments about PostgresNode;
- create standby.signal file for Static compute node;
- log output of `compute_ctl` and `postgres` is merged into
`endpoints/compute.log`.

---------

Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>
2023-06-06 14:59:36 +01:00
Joonas Koivunen
0cef7e977d refactor: just one way to shutdown a tenant (#4407)
We have 2 ways of tenant shutdown, we should have just one.

Changes are mostly mechanical simple refactorings.

Added `warn!` on the "shutdown all remaining tasks" should trigger test
failures in the between time of not having solved the "tenant/timeline
owns all spawned tasks" issue.

Cc: #4327.
2023-06-06 15:30:55 +03:00
Joonas Koivunen
18a9d47f8e test: restore NotConnected being allowed globally (#4426)
Flakyness introduced by #4402 evidence [^1].

I had assumed the NotConnected would had been an expected io error, but
it's not. Restore the global `allowed_error`.

[^1]:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4407/5185897757/index.html#suites/82004ab4e3720b47bf78f312dabe7c55/14f636d0ecd3939d/
2023-06-06 13:51:39 +03:00
Joonas Koivunen
e0bd81ce1f test: fix flaky warning on attach (#4415)
added the `allowed_error` to the `positive_env` so any tests completing
the attach are allowed have this print out. they are allowed to do so,
because the `random_init_delay` can produce close to zero and thus the
first run will be near attach. Though... Unsure if we ever really need
the eviction task to run **before** it can evict something, as in after
20min or 24h.

in the failed test case however period is 20s so interesting that we
didn't run into this sooner.

evidence of flaky:
https://github.com/neondatabase/neon/actions/runs/5175677035/jobs/9323705929?pr=4399#step:4:38536
2023-06-05 18:12:58 +03:00