Commit Graph

1441 Commits

Author SHA1 Message Date
John Spray
affe408433 storage scrubber: GC ancestor shard layers (#8196)
## Problem

After a shard split, the pageserver leaves the ancestor shard's content
in place. It may be referenced by child shards, but eventually child
shards will de-reference most ancestor layers as they write their own
data and do GC. We would like to eventually clean up those ancestor
layers to reclaim space.

## Summary of changes

- Extend the physical GC command with `--mode=full`, which includes
cleaning up unreferenced ancestor shard layers
- Add test `test_scrubber_physical_gc_ancestors`
- Remove colored log output: in testing this is irritating ANSI code
spam in logs, and in interactive use doesn't add much.
- Refactor storage controller API client code out of storcon_client into
a `storage_controller/client` crate
- During physical GC of ancestors, call into the storage controller to
check that the latest shards seen in S3 reflect the latest state of the
tenant, and there is no shard split in progress.
2024-07-22 14:36:56 +02:00
Christian Schwarz
9b883e4651 pageserver: remove obsolete cached_metric_collection_interval (#8370)
We're removing the usage of this long-meaningless config field in
https://github.com/neondatabase/aws/pull/1599

Once that PR has been deployed to staging and prod, we can merge this
PR.
2024-07-22 14:36:56 +02:00
Arpad Müller
ed7ee73cba Enable zstd in tests (#8368)
Successor of #8288 , just enable zstd in tests. Also adds a test that
creates easily compressable data.

Part of #5431

---------

Co-authored-by: John Spray <john@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
2024-07-22 14:36:56 +02:00
John Spray
9868bb3346 tests: turn on safekeeper eviction by default (#8352)
## Problem

Ahead of enabling eviction in the field, where it will become the
normal/default mode, let's enable it by default throughout our tests in
case any issues become visible there.

## Summary of changes

- Make default `extra_opts` for safekeepers enable offload & deletion
- Set low timeouts in `extra_opts` so that tests running for tens of
seconds have a chance to hit some of these background operations.
2024-07-22 14:36:56 +02:00
John Spray
27da0e9cf5 tests: increase test_pg_regress and test_isolation timeouts (#8418)
## Problem

These tests time out ~1 in 50 runs when in debug mode.

There is no indication of a real issue: they're just wrappers that have
large numbers of individual tests contained within on pytest case.

## Summary of changes

- Bump pg_regress timeout from 600 to 900s
- Bump test_isolation timeout from 300s (default) to 600s

In future it would be nice to break out these tests to run individual
cases (or batches thereof) as separate tests, rather than this monolith.
2024-07-22 14:36:56 +02:00
John Spray
de9bf2af6c tests: fix metrics check in test_s3_eviction (#8419)
## Problem

This test would occasionally fail its metric check. This could happen in
the rare case that the nodes had all been restarted before their most
recent eviction.

The metric check was added in
https://github.com/neondatabase/neon/pull/8348

## Summary of changes

- Check metrics before each restart, accumulate into a bool that we
assert on at the end of the test
2024-07-22 14:36:56 +02:00
Christian Schwarz
3d2c2ce139 NeonEnv.from_repo_dir: use storage_controller_db instead of attachments.json (#8382)
When `NeonEnv.from_repo_dir` was introduced, storage controller stored
its
state exclusively `attachments.json`.
Since then, it has moved to using Postgres, which stores its state in
`storage_controller_db`.

But `NeonEnv.from_repo_dir` wasn't adjusted to do this.
This PR rectifies the situation.

Context for this is failures in
`test_pageserver_characterize_throughput_with_n_tenants`
CF:
https://neondb.slack.com/archives/C033RQ5SPDH/p1721035799502239?thread_ts=1720901332.293769&cid=C033RQ5SPDH

Notably, `from_repo_dir` is also used by the backwards- and
forwards-compatibility.
Thus, the changes in this PR affect those tests as well.
However, it turns out that the compatibility snapshot already contains
the `storage_controller_db`.
Thus, it should just work and in fact we can remove hacks like
`fixup_storage_controller`.

Follow-ups created as part of this work:
* https://github.com/neondatabase/neon/issues/8399
* https://github.com/neondatabase/neon/issues/8400
2024-07-22 14:36:56 +02:00
Joonas Koivunen
ff174a88c0 test: allow requests to any pageserver get cancelled (#8413)
Fix flakyness on `test_sharded_timeline_detach_ancestor` which does not
reproduce on a fast enough runner by allowing cancelled request before
completing on all pageservers. It was only allowed on half of the
pageservers.

Failure evidence:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8352/9972357040/index.html#suites/a1c2be32556270764423c495fad75d47/7cca3e3d94fe12f2
2024-07-22 14:36:56 +02:00
John Spray
ef3ebfaf67 pageserver: layer count & size metrics (#8410)
## Problem

We lack insight into:
- How much of a tenant's physical size is image vs. delta layers
- Average sizes of image vs. delta layers
- Total layer counts per timeline, indicating size of index_part object

As well as general observability love, this is motivated by
https://github.com/neondatabase/neon/issues/6738, where we need to
define some sensible thresholds for storage amplification, and using
total physical size may not work well (if someone does a lot of DROPs
then it's legitimate for the physical-synthetic ratio to be huge), but
the ratio between image layer size and delta layer size may be a better
indicator of whether we're generating unreasonable quantities of image
layers.

## Summary of changes

- Add pageserver_layer_bytes and pageserver_layer_count metrics,
labelled by timeline and `kind` (delta or image)
- Add & subtract these with LayerInner's lifetime.

I'm intentionally avoiding using a generic metric RAII guard object, to
avoid bloating LayerInner: it already has all the information it needs
to update metric on new+drop.
2024-07-22 14:36:56 +02:00
John Spray
c150ad4ee2 tests: add test_compaction_l0_memory (#8403)
This test reproduces the case of a writer creating a deep stack of L0
layers. It uses realistic layer sizes and writes several gigabytes of
data, therefore runs as a performance test although it is validating
memory footprint rather than performance per se.

It acts a regression test for two recent fixes:
- https://github.com/neondatabase/neon/pull/8401
- https://github.com/neondatabase/neon/pull/8391

In future it will demonstrate the larger improvement of using a k-merge
iterator for L0 compaction (#8184)

This test can be extended to enforce limits on the memory consumption of
other housekeeping steps, by restarting the pageserver and then running
other things to do the same "how much did RSS increase" measurement.
2024-07-22 14:36:56 +02:00
John Spray
b21e131d11 pageserver: exclude un-read layers from short residence statistic (#8396)
## Problem

The `evictions_with_low_residence_duration` is used as an indicator of
cache thrashing. However, there are situations where it is quite
legitimate to only have a short residence during compaction, where a
delta is downloaded, used to generate an image layer, and then
discarded. This can lead to false positive alerts.

## Summary of changes

- Only track low residence duration for layers that have been accessed
at least once (compaction doesn't count as an access). This will give us
a metric that indicates thrashing on layers that the _user_ is using,
rather than those we're downloading for housekeeping purposes.

Once we add "layer visibility" as an explicit property of layers, this
can also be used as a cleaner condition (residence of non-visible layers
should never be alertable)
2024-07-22 14:36:56 +02:00
Tristan Partin
18e7c2b7a1 Add some typing to Endpoint.respec() 2024-07-22 14:36:56 +02:00
Tristan Partin
ad5d784fb7 Hide import behind TYPE_CHECKING 2024-07-22 14:36:56 +02:00
Tristan Partin
85d47637ee Run each migration in its own transaction
Previously, every migration was run in the same transaction. This
is preparatory work for fixing CVE-2024-4317.
2024-07-22 14:36:56 +02:00
Arpad Müller
9dc71f5a88 Avoid the storage controller in test_tenant_creation_fails (#8392)
As described in #8385, the likely source for flakiness in
test_tenant_creation_fails is the following sequence of events:

1. test instructs the storage controller to create the tenant
2. storage controller adds the tenant and persists it to the database.
issues a creation request
3. the pageserver restarts with the failpoint disabled
4. storage controller's background reconciliation still wants to create
the tenant
5. pageserver gets new request to create the tenant from background
reconciliation

This commit just avoids the storage controller entirely. It has its own
set of issues, as the re-attach request will obviously not include the
tenant, but it's still useful to test for non-existence of the tenant.

The generation is also not optional any more during tenant attachment.
If you omit it, the pageserver yields an error. We change the signature
of `tenant_attach` to reflect that.

Alternative to #8385
Fixes #8266
2024-07-22 14:36:56 +02:00
Sasha Krassovsky
349373cb11 Allow reusing projects between runs of logical replication benchmarks (#8393) 2024-07-22 14:36:56 +02:00
Joonas Koivunen
957f99cad5 feat(timeline_detach_ancestor): success idempotency (#8354)
Right now timeline detach ancestor reports an error (409, "no ancestor")
on a new attempt after successful completion. This makes it troublesome
for storage controller retries. Fix it to respond with `200 OK` as if
the operation had just completed quickly.

Additionally, the returned timeline identifiers in the 200 OK response
are now ordered so that responses between different nodes for error
comparison are done by the storage controller added in #8353.

Design-wise, this PR introduces a new strategy for accessing the latest
uploaded IndexPart:
`RemoteTimelineClient::initialized_upload_queue(&self) ->
Result<UploadQueueAccessor<'_>, NotInitialized>`. It should be a more
scalable way to query the latest uploaded `IndexPart` than to add a
query method for each question directly on `RemoteTimelineClient`.

GC blocking will need to be introduced to make the operation fully
idempotent. However, it is idempotent for the cases demonstrated by
tests.

Cc: #6994
2024-07-22 14:36:56 +02:00
John Spray
2a3a136474 pageserver: use PITR GC cutoffs as authoritative (#8365)
## Problem

Pageserver GC uses a size-based condition (GC "horizon" in addition to
time-based "PITR").

Eventually we plan to retire the size-based condition:
https://github.com/neondatabase/neon/issues/6374

Currently, we always apply the more conservative of the two, meaning
that tenants always retain at least 64MB of history (default horizon),
even after a very long time has passed. This is particularly acute in
cases where someone has dropped tables/databases, and then leaves a
database idle: the horizon can prevent GCing very large quantities of
historical data (we already account for this in synthetic size by
ignoring gc horizon).

We're not entirely removing GC horizon right now because we don't want
to 100% rely on standby_horizon for robustness of physical replication,
but we can tweak our logic to avoid retaining that 64MB LSN length
indefinitely.

## Summary of changes

- Rework `Timeline::find_gc_cutoffs`, with new logic:
- If there is no PITR set, then use `DEFAULT_PITR_INTERVAL` (1 week) to
calculate a time threshold. Retain either the horizon or up to that
thresholds, whichever requires less data.
- When there is a PITR set, and we have unambiguously resolved the
timestamp to an LSN, then ignore the GC horizon entirely. For typical
PITRs (1 day, 1 week), this will still easily retain enough data to
avoid stressing read only replicas.

The key property we end up with, whether a PITR is set or not, is that
after enough time has passed, our GC cutoff on an idle timeline will
catch up with the last_record_lsn.

Using `DEFAULT_PITR_INTERVAL` is a bit of an arbitrary hack, but this
feels like it isn't really worth the noise of exposing in TenantConfig.
We could just make it a different named constant though. The end-end
state will be that there is no gc_horizon at all, and that tenants with
pitr_interval=0 would truly retain no history, so this constant would go
away.
2024-07-22 14:36:56 +02:00
Joonas Koivunen
cfaf30f5e8 feat(storcon): timeline detach ancestor passthrough (#8353)
Currently storage controller does not support forwarding timeline detach
ancestor requests to pageservers. Add support for forwarding `PUT
.../:tenant_id/timelines/:timeline_id/detach_ancestor`. Implement the
support mostly as is, because the timeline detach ancestor will be made
(mostly) idempotent in future PR.

Cc: #6994
2024-07-22 14:36:56 +02:00
Christian Schwarz
72c2d0812e remove page_service show <tenant_id> (#8372)
This operation isn't used in practice, so let's remove it.

Context: in https://github.com/neondatabase/neon/pull/8339
2024-07-22 14:36:56 +02:00
Arseny Sher
537ecf45f8 Fix test_timeline_copy flakiness.
fixes https://github.com/neondatabase/neon/issues/8355
2024-07-22 14:31:12 +02:00
Konstantin Knizhnik
7973c3e941 Add neon.running_xacts_overflow_policy to make it possible for RO replica to startup without primary even in case running xacts overflow (#8323)
## Problem

Right now if there are too many running xacts to be restored from CLOG
at replica startup,
then replica is not trying to restore them and wait for non-overflown
running-xacs WAL record from primary.
But if primary is not active, then replica will not start at all.

Too many running xacts can be caused by transactions with large number
of subtractions.
But right now it can be also cause by two reasons:
- Lack of shutdown checkpoint which updates `oldestRunningXid` (because
of immediate shutdown)
- nextXid alignment on 1024 boundary (which cause loosing ~1k XIDs on
each restart)

Both problems are somehow addressed now.
But we have existed customers with "sparse" CLOG and lack of
checkpoints.
To be able to start RO replicas for such customers I suggest to add GUC
which allows replica to start even in case of subxacts overflow.

## Summary of changes

Add `neon.running_xacts_overflow_policy` with the following values:
- ignore: restore from CLOG last N XIDs and accept connections
- skip: do not restore any XIDs from CXLOGbut still accept connections
- wait: wait non-overflown running xacts record from primary node

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-07-15 09:34:35 -04:00
Vlad Lazar
085bbaf5f8 tests: allow list breaching min resident size in statvfs test (#8358)
## Problem
This test would sometimes violate the min resident size during disk
eviction and fail due to the generate warning log.

Disk usage candidate collection only takes into account active tenants.
However, the statvfs call takes into account the entire tenants
directory, which includes tenants which haven't become active yet.

After re-starting the pageserver, disk usage eviction may kick in
*before* both tenants have become active. Hence, the logic will try to satisfy
thedisk usage requirements by evicting everything belonging to the active
tenant, and hence violating the tenant minimum resident size.

## Summary of changes

Allow the warning
2024-07-15 09:28:35 -04:00
John Spray
3f8819827c pageserver: circuit breaker on compaction (#8359)
## Problem

We already back off on compaction retries, but the impact of a failing
compaction can be so great that backing off up to 300s isn't enough. The
impact is consuming a lot of I/O+CPU in the case of image layer
generation for large tenants, and potentially also leaking disk space.

Compaction failures are extremely rare and almost always indicate a bug,
frequently a bug that will not let compaction to proceed until it is
fixed.

Related: https://github.com/neondatabase/neon/issues/6738

## Summary of changes

- Introduce a CircuitBreaker type
- Add a circuit breaker for compaction, with a policy that after 5
failures, compaction will not be attempted again for 24 hours.
- Add metrics that we can alert on: any >0 value for
`pageserver_circuit_breaker_broken_total` should generate an alert.
- Add a test that checks this works as intended.

Couple notes to reviewers:
- Circuit breakers are intrinsically a defense-in-depth measure: this is
not the solution to any underlying issues, it is just a general
mitigation for "unknown unknowns" that might be encountered in future.
- This PR isn't primarily about writing a perfect CircuitBreaker type:
the one in this PR is meant to be just enough to mitigate issues in
compaction, and make it easy to monitor/alert on these failures. We can
refine this type in future as/when we want to use it elsewhere.
2024-07-15 09:28:35 -04:00
Sasha Krassovsky
119ddf6ccf Grant execute on snapshot functions to neon_superuser (#8346)
## Problem
I need `neon_superuser` to be allowed to create snapshots for
replication tests

## Summary of changes
Adds a migration that grants these functions to neon_superuser
2024-07-15 09:28:35 -04:00
Joonas Koivunen
90f447b79d test: limit test_layer_download_timeouted to MOCK_S3 (#8331)
Requests against REAL_S3 on CI can consistently take longer than 1s;
testing the short timeouts against it made no sense in hindsight, as
MOCK_S3 works just as well.

evidence:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8229/9857994025/index.html#suites/b97efae3a617afb71cb8142f5afa5224/6828a50921660a32
2024-07-15 09:28:35 -04:00
Arseny Sher
8532d72276 Fix memory context of NeonWALReader allocation.
Allocating it in short living context is wrong because it is reused during
backend lifetime.
2024-07-15 09:28:35 -04:00
John Spray
d3ff47f572 storage controller: add node deletion API (#8226)
## Problem

In anticipation of later adding a really nice drain+delete API, I
initially only added an intentionally basic `/drop` API that is just
about usable for deleting nodes in a pinch, but requires some ugly
storage controller restarts to persuade it to restart secondaries.

## Summary of changes

I started making a few tiny fixes, and ended up writing the delete
API...

- Quality of life nit: ordering of node + tenant listings in storcon_cli
- Papercut: Fix the attach_hook using the wrong operation type for
reporting slow locks
- Make Service::spawn tolerate `generation_pageserver` columns that
point to nonexistent node IDs. I started out thinking of this as a
general resilience thing, but when implementing the delete API I
realized it was actually a legitimate end state after the delete API is
called (as that API doesn't wait for all reconciles to succeed).
- Add a `DELETE` API for nodes, which does not gracefully drain, but
does reschedule everything. This becomes safe to use when the system is
in any state, but will incur availability gaps for any tenants that
weren't already live-migrated away. If tenants have already been
drained, this becomes a totally clean + safe way to decom a node.
- Add a test and a storcon_cli wrapper for it

This is meant to be a robust initial API that lets us remove nodes
without doing ugly things like restarting the storage controller -- it's
not quite a totally graceful node-draining routine yet. There's more
work in https://github.com/neondatabase/neon/issues/8333 to get to our
end-end state.
2024-07-15 09:28:35 -04:00
John Spray
8cc768254f safekeeper: eviction metrics (#8348)
## Problem

Follow up to https://github.com/neondatabase/neon/pull/8335, to improve
observability of how many evict/restores we are doing.

## Summary of changes

- Add `safekeeper_eviction_events_started_total` and
`safekeeper_eviction_events_completed_total`, with a "kind" label of
evict or restore. This gives us rates, and also ability to calculate how
many are in progress.
- Generalize SafekeeperMetrics test type to use the same helpers as
pageserver, and enable querying any metric.
- Read the new metrics at the end of the eviction test.
2024-07-15 09:28:35 -04:00
Christian Schwarz
5bba3e3c75 pageserver: remove trace_read_requests (#8338)
`trace_read_requests` is a per `Tenant`-object option.
But the `handle_pagerequests` loop doesn't know which
`Tenant` object (i.e., which shard) the request is for.

The remaining use of the `Tenant` object is to check `tenant.cancel`.
That check is incorrect [if the pageserver hosts multiple
shards](https://github.com/neondatabase/neon/issues/7427#issuecomment-2220577518).
I'll fix that in a future PR where I completely eliminate the holding
of `Tenant/Timeline` objects across requests.
See [my code RFC](https://github.com/neondatabase/neon/pull/8286) for
the
high level idea.

Note that we can always bring the tracing functionality if we need it.
But since it's actually about logging the `page_service` wire bytes,
it should be a `page_service`-level config option, not per-Tenant.
And for enabling tracing on a single connection, we can implement
a `set pageserver_trace_connection;` option.
2024-07-15 09:28:35 -04:00
Peter Bendel
6caf702417 Run Performance bench on more platforms (#8312)
## Problem

https://github.com/neondatabase/cloud/issues/14721

## Summary of changes

add one more platform to benchmarking job 


57535c039c/.github/workflows/benchmarking.yml (L57C3-L126)

Run with pg 16, provisioner k8-neonvm by default on the new platform.

Adjust some test cases to

- not depend on database client <-> database server latency by pushing
loops into server side pl/pgSQL functions
- increase statement and test timeouts

First successful run of these job steps 

https://github.com/neondatabase/neon/actions/runs/9869817756/job/27254280428
2024-07-15 09:28:35 -04:00
John Spray
547acde6cd safekeeper: add eviction_min_resident to stop evictions thrashing (#8335)
## Problem

- The condition for eviction is not time-based: it is possible for a
timeline to be restored in response to a client, that client times out,
and then as soon as the timeline is restored it is immediately evicted
again.
- There is no delay on eviction at startup of the safekeeper, so when it
starts up and sees many idle timelines, it does many evictions which
will likely be immediately restored when someone uses the timeline.

## Summary of changes

- Add `eviction_min_resident` parameter, and use it in
`ready_for_eviction` to avoid evictions if the timeline has been
resident for less than this period.
- This also implicitly delays evictions at startup for
`eviction_min_resident`
- Set this to a very low number for the existing eviction test, which
expects immediate eviction.

The default period is 15 minutes. The general reasoning for that is that
in the worst case where we thrash ~10k timelines on one safekeeper,
downloading 16MB for each one, we should set a period that would not
overwhelm the node's bandwidth.
2024-07-15 09:28:35 -04:00
John Spray
f28abb953d tests: stabilize test_sharding_split_compaction (#8318)
## Problem

This test incorrectly assumed that a post-split compaction would only
drop content. This was easily destabilized by any changes to image
generation rules.

## Summary of changes

- Before split, do a full image layer generation pass, to guarantee that
post-split compaction should only drop data, never create it.
- Fix the force_image_layer_creation mode of compaction that we use from
tests like this: previously it would try and generate image layers even
if one already existed with the same layer key, which caused compaction
to fail.
2024-07-15 09:28:35 -04:00
Christian Schwarz
bfc7338246 pageserver: move page_service's import basebackup / import wal to mgmt API (#8292)
I want to fix bugs in `page_service`
([issue](https://github.com/neondatabase/neon/issues/7427)) and the
`import basebackup` / `import wal` stand in the way / make the
refactoring more complicated.

We don't use these methods anyway in practice, but, there have been some
objections to removing the functionality completely.

So, this PR preserves the existing functionality but moves it into the
HTTP management API.

Note that I don't try to fix existing bugs in the code, specifically not
fixing
* it only ever worked correctly for unsharded tenants
* it doesn't clean up on error

All errors are mapped to `ApiError::InternalServerError`.
2024-07-15 09:28:35 -04:00
Yuchen Liang
0b6492e7d3 tests: increase approx size equal threshold to avoid test_lsn_lease_size flakiness (#8282)
## Summary of changes

Increase the `assert_size_approx_equal` threshold to avoid flakiness of
`test_lsn_lease_size`. Still needs more investigation to fully resolve
#8293.

- Also set `autovacuum=off` for the endpoint we are running in the test.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-07-15 09:28:35 -04:00
John Spray
7cfaecbeb6 tests: stabilize test_timeline_size_quota_on_startup (#8255)
## Problem

`test_timeline_size_quota_on_startup` assumed that writing data beyond
the size limit would always be blocked. This is not so: the limit is
only enforced if feedback makes it back from the pageserver to the
safekeeper + compute.

Closes: https://github.com/neondatabase/neon/issues/6562

## Summary of changes

- Modify the test to wait for the pageserver to catch up. The size limit
was never actually being enforced robustly, the original version of this
test was just writing much more than 30MB and about 98% of the time
getting lucky such that the feedback happened to arrive before the tests
for loop was done.
- If the test fails, log the logical size as seen by the pageserver.
2024-07-15 09:28:35 -04:00
John Spray
108bf56e44 tests: use smaller layers in test_pg_regress (#8232)
## Problem

Debug-mode runs of test_pg_regress are rather slow since
https://github.com/neondatabase/neon/pull/8105, and occasionally exceed
their 600s timeout.

## Summary of changes

- Use 8MiB layer files, avoiding large ephemeral layers

On a hetzner AX102, this takes the runtime from 230s to 190s. Which
hopefully will be enough to get the runtime on github runners more
reliably below its 600s timeout.

This has the side benefit of exercising more of the pageserver stack
(including compaction) under a workload that exercises a more diverse
set of postgres functionality than most of our tests.
2024-07-15 09:28:35 -04:00
Tristan Partin
cec216c5c0 Add long running replication tests
These tests will help verify that replication, both physical and
logical, works as expected in Neon.

Co-authored-by: Sasha Krassovsky <sasha@neon.tech>
2024-07-15 09:28:35 -04:00
Tristan Partin
930201e033 Add PgBin.run_nonblocking()
Allows a process to run without blocking program execution, which can be
useful for certain test scenarios.

Co-authored-by: Sasha Krassovsky <sasha@neon.tech>
2024-07-15 09:28:35 -04:00
Tristan Partin
8328580dc2 Log PG environment variables when a PgBin runs
Useful for debugging situations like connecting to databases.

Co-authored-by: Sasha Krassovsky <sasha@neon.tech>
2024-07-15 09:28:35 -04:00
Tristan Partin
8d9b632f2a Add Neon HTTP API test fixture
This is a Python binding to the Neon HTTP API. It isn't complete, but
can be extended as necessary.

Co-authored-by: Sasha Krassovsky <sasha@neon.tech>
2024-07-15 09:28:35 -04:00
Tristan Partin
55d37c77b9 Hide import behind TYPE_CHECKING
No need to import it if we aren't type checking anything.
2024-07-15 09:28:35 -04:00
Konstantin Knizhnik
05b4169644 Increase timeout for wating subscriber caught-up (#8118)
## Problem

test_subscriber_restart has quit large failure rate'

https://neonprod.grafana.net/d/fddp4rvg7k2dcf/regression-test-failures?orgId=1&var-test_name=test_subscriber_restart&var-max_count=100&var-restrict=false

I can be caused by too small timeout (5 seconds) to wait until changes
are propagated.

Related to #8097

## Summary of changes

Increase timeout to 30 seconds.

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-07-08 17:22:36 +01:00
Alexander Bayandin
d1495755e7 SELECT 💣(); (#8270)
## Problem
We want to be able to test how our infrastructure reacts on segfaults in
Postgres (for example, we collect cores, and get some required
logs/metrics, etc)

## Summary of changes
- Add `trigger_segfauls` function to `neon_test_utils` to trigger a
segfault in Postgres
- Add `trigger_panic` function to `neon_test_utils` to trigger SIGABRT
(by using `elog(PANIC, ...))
- Fix cleanup logic in regression tests in endpoint crashed
2024-07-08 17:22:36 +01:00
John Spray
64334f497d tests: make location_conf_churn more robust (#8271)
## Problem

This test directly manages locations on pageservers and configuration of
an endpoint. However, it did not switch off the parts of the storage
controller that attempt to do the same: occasionally, the test would
fail in a strange way such as a compute failing to accept a
reconfiguration request.

## Summary of changes

- Wire up the storage controller's compute notification hook to a no-op
handler
- Configure the tenant's scheduling policy to Stop.
2024-07-08 17:22:35 +01:00
John Spray
32fc2dd683 tests: extend allow list in deletion test (#8268)
## Problem

1ea5d8b132 tolerated this as an error
message, but it can show up in logs as well.

Example failure:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8201/9780147712/index.html#testresult/263422f5f5f292ea/retries

## Summary of changes

- Tolerate "failed to delete 1 objects" in pageserver logs, this occurs
occasionally when injected failures exhaust deletion's retries.
2024-07-08 17:22:35 +01:00
Konstantin Knizhnik
3ee82a9895 implement rolling hyper-log-log algorithm (#8068)
## Problem

See #7466

## Summary of changes

Implement algorithm descried in
https://hal.science/hal-00465313/document

Now new GUC is added:
`neon.wss_max_duration` which specifies size of sliding window (in
seconds). Default value is 1 hour.

It is possible to request estimation of working set sizes (within this
window using new function
`approximate_working_set_size_seconds`. Old function
`approximate_working_set_size` is preserved for backward compatibility.
But its scope is also limited by `neon.wss_max_duration`.

Version of Neon extension is changed to 1.4

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Matthias van de Meent <matthias@neon.tech>
2024-07-08 17:22:35 +01:00
Yuchen Liang
32828cddd6 feat(pageserver): integrate lsn lease into synthetic size (#8220)
Part of #7497, closes #8071. (accidentally closed #8208, reopened here)

## Problem

After the changes in #8084, we need synthetic size to also account for
leased LSNs so that users do not get free retention by running a small
ephemeral endpoint for a long time.

## Summary of changes

This PR integrates LSN leases into the synthetic size calculation. We
model leases as read-only branches started at the leased LSN (except it
does not have a timeline id).

Other changes:
- Add new unit tests testing whether a lease behaves like a read-only
branch.
- Change `/size_debug` response to include lease point in the SVG
visualization.
- Fix `/lsn_lease` HTTP API to do proper parsing for POST.



Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
2024-07-08 17:22:35 +01:00
John Spray
7e2a3d2728 pageserver: downgrade stale generation messages to INFO (#8256)
## Problem

When generations were new, these messages were an important way of
noticing if something unexpected was going on. We found some real issues
when investigating tests that unexpectedly tripped them.

At time has gone on, this code is now pretty battle-tested, and as we do
more live migrations etc, it's fairly normal to see the occasional
message from a node with a stale generation.

At this point the cognitive load on developers to selectively allow-list
these logs outweighs the benefit of having them at warn severity.

Closes: https://github.com/neondatabase/neon/issues/8080

## Summary of changes

- Downgrade "Dropped remote consistent LSN updates" and "Dropping stale
deletions" messages to INFO
- Remove all the allow-list entries for these logs.
2024-07-08 17:22:35 +01:00
Alexander Bayandin
0e4832308d CI(pg-clients): unify workflow with build-and-test (#8160)
## Problem

`pg-clients` workflow looks different from the main `build-and-test`
workflow for historical reasons (it was my very first task at Neon, and 
back then I wasn't really familiar with the rest of the CI pipelines).
This PR unifies `pg-clients` workflow with `build-and-test`

## Summary of changes
- Rename `pg_clients.yml` to `pg-clients.yml`
- Run the workflow on changes in relevant files
- Create Allure report for tests
- Send slack notifications to `#on-call-qa-staging-stream` channel
(instead of `#on-call-staging-stream`)
- Update Client libraries once we're here
2024-07-08 17:22:35 +01:00