part of https://github.com/neondatabase/cloud/issues/14024, k8s does not
always have a volume available for logging, and I'm running into weird
permission errors... While I could spend time figuring out how to create
temp directories for logging, I think it would be better to just disable
file logging as k8s containers are ephemeral and we cannot retrieve
anything on the fs after the container gets removed.
## Summary of changes
`PAGESERVER_DISABLE_FILE_LOGGING=1` -> file logging disabled
Signed-off-by: Alex Chi Z <chi@neon.tech>
This tweaks the rows-to-JSON rendering logic in order to avoid
allocating 0-sized temporary vectors and later growing them
to insert elements.
As the exact size is known in advance, both vectors can be built
with an exact capacity upfront. This will avoid further vector
growing/reallocation in the rendering hotpath.
Signed-off-by: Luca BRUNO <lucab@lucabruno.net>
## Problem
In https://github.com/neondatabase/neon/pull/8161, we changed the path
to Neon artefacts by adding commit sha to it, but we missed adding these
changes to `promote-compatibility-data` job that we use for
backward/forward- compatibility testing.
## Summary of changes
- Add commit sha to `promote-compatibility-data`
## Summary of changes
Increase the `assert_size_approx_equal` threshold to avoid flakiness of
`test_lsn_lease_size`. Still needs more investigation to fully resolve
#8293.
- Also set `autovacuum=off` for the endpoint we are running in the test.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
## Problem
`test_timeline_size_quota_on_startup` assumed that writing data beyond
the size limit would always be blocked. This is not so: the limit is
only enforced if feedback makes it back from the pageserver to the
safekeeper + compute.
Closes: https://github.com/neondatabase/neon/issues/6562
## Summary of changes
- Modify the test to wait for the pageserver to catch up. The size limit
was never actually being enforced robustly, the original version of this
test was just writing much more than 30MB and about 98% of the time
getting lucky such that the feedback happened to arrive before the tests
for loop was done.
- If the test fails, log the logical size as seen by the pageserver.
## Problem
Debug-mode runs of test_pg_regress are rather slow since
https://github.com/neondatabase/neon/pull/8105, and occasionally exceed
their 600s timeout.
## Summary of changes
- Use 8MiB layer files, avoiding large ephemeral layers
On a hetzner AX102, this takes the runtime from 230s to 190s. Which
hopefully will be enough to get the runtime on github runners more
reliably below its 600s timeout.
This has the side benefit of exercising more of the pageserver stack
(including compaction) under a workload that exercises a more diverse
set of postgres functionality than most of our tests.
## Problem
We currently use 'immediate' mode in the most commonly used shutdown
path, when the control plane calls a `compute_ctl` API to terminate
Postgres inside compute without waiting for the actual pod / VM
termination. Yet, 'immediate' shutdown doesn't create a shutdown
checkpoint and ROs have bad times figuring out the list of running xacts
during next start.
## Summary of changes
Use 'fast' mode, which creates a shutdown checkpoint that is important
for ROs to get a list of running xacts faster instead of going through
the CLOG. On the control plane side, we poll this `compute_ctl`
termination API for 10s, it should be enough as we don't really write
any data at checkpoint time. If it times out, we anyway switch to the
slow k8s-based termination.
See https://www.postgresql.org/docs/current/server-shutdown.html for the
list of modes and signals.
The default VM shutdown hook already uses `fast` mode, see [1]
[1]
c9fd8d7693/vm-image-spec.yaml (L30-L31)
Related to #6211
## Problem
LSN Leases introduced in #8084 is a new API that is made shard-aware
from day 1. To support ephemeral endpoint in #7994 without linking
Postgres C API against `compute_ctl`, part of the sharding needs to
reside in `utils`.
## Summary of changes
- Create a new `shard` module in utils crate.
- Move more interface related part of tenant sharding API to utils and
re-export them in pageserver_api.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
## Problem
Rarely, a dbdir entry can exist with no `relmap_file_key` data. This
causes compaction to fail, because it assumes that if the database
exists, then so does the relmap file.
Basebackup already handled this using a boolean to record whether such a
key exists, but `collect_keyspace` didn't.
## Summary of changes
- Respect the flag for whether a relfilemap exists in collect_keyspace
- The reproducer for this issue will merge separately in
https://github.com/neondatabase/neon/pull/8232
These tests will help verify that replication, both physical and
logical, works as expected in Neon.
Co-authored-by: Sasha Krassovsky <sasha@neon.tech>
Allows a process to run without blocking program execution, which can be
useful for certain test scenarios.
Co-authored-by: Sasha Krassovsky <sasha@neon.tech>
## Problem
- Resident memory on long running pageserver processes tends to climb:
memory fragmentation is suspected.
- Total resident memory may be a limiting factor for running on smaller
nodes.
## Summary of changes
- As a low-energy experiment, switch the pageserver to use jemalloc (not
a net-new dependency, proxy already use it)
- Decide at end of week whether to revert before next release.
## Problem
Sparse keyspaces were constructed with ranges out of order: this didn't break things obviously, but meant that users of KeySpace functions that assume ordering would assert out.
Closes https://github.com/neondatabase/neon/issues/8277
## Summary of changes
make sure the sparse keyspace has ordered keyspace parts
The find-large-objects scrubber subcommand is quite fast if you run it
in an environment with low latency to the S3 bucket (say an EC2 instance
in the same region). However, the higher the latency gets, the slower
the command becomes. Therefore, add a concurrency param and make it
parallelized. This doesn't change that general relationship, but at
least lets us do multiple requests in parallel and therefore hopefully
faster.
Running with concurrency of 64 (default):
```
2024-07-05T17:30:22.882959Z INFO lazy_load_identity [...]
[...]
2024-07-05T17:30:28.289853Z INFO Scanned 500 shards. [...]
```
With concurrency of 1, simulating state before this PR:
```
2024-07-05T17:31:43.375153Z INFO lazy_load_identity [...]
[...]
2024-07-05T17:33:51.987092Z INFO Scanned 500 shards. [...]
```
In other words, to list 500 shards, speed is increased from 2:08 minutes
to 6 seconds.
Follow-up of #8257, part of #5431
Improve parsing of the `ImageCompressionAlgorithm` enum to allow level
customization like `zstd(1)`, as strum only takes `Default::default()`,
i.e. `None` as the level.
Part of #5431
## Problem
test_subscriber_restart has quit large failure rate'
https://neonprod.grafana.net/d/fddp4rvg7k2dcf/regression-test-failures?orgId=1&var-test_name=test_subscriber_restart&var-max_count=100&var-restrict=false
I can be caused by too small timeout (5 seconds) to wait until changes
are propagated.
Related to #8097
## Summary of changes
Increase timeout to 30 seconds.
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
We want to be able to test how our infrastructure reacts on segfaults in
Postgres (for example, we collect cores, and get some required
logs/metrics, etc)
## Summary of changes
- Add `trigger_segfauls` function to `neon_test_utils` to trigger a
segfault in Postgres
- Add `trigger_panic` function to `neon_test_utils` to trigger SIGABRT
(by using `elog(PANIC, ...))
- Fix cleanup logic in regression tests in endpoint crashed
## Problem
Assume a timeline with the following workload: very slow ingest of
updates to a small number of keys that fit within the same partition (as decided by
`KeySpace::partition`). These tenants will create small L0 layers since due to time
based rolling, and, consequently, the L1 layers will also be small.
Currently, by default, we need to ingest 512 MiB of WAL before checking
if an image layer is required. This scheme works fine under the assumption that L1s are roughly of
checkpoint distance size, but as the first paragraph explained, that's not the case for all workloads.
## Summary of changes
Check if new image layers are required at least once every checkpoint timeout interval.
## Problem
Safekeepers left running for a long time use a lot of memory (up to the
point of OOMing, on small nodes) for deleted timelines, because the
`Timeline` struct is kept alive as a guard against recreating deleted
timelines.
Closes: https://github.com/neondatabase/neon/issues/6810
## Summary of changes
- Create separate tombstones that just record a ttid and when the
timeline was deleted.
- Add a periodic housekeeping task that cleans up tombstones older than
a hardcoded TTL (24h)
I think this also makes https://github.com/neondatabase/neon/pull/6766
un-needed, as the tombstone is also checked during deletion.
I considered making the overall timeline map use an enum type containing
active or deleted, but having a separate map of tombstones avoids
bloating that map, so that calls like `get()` can still go straight to a
timeline without having to walk a hashmap that also contains tombstones.
## Problem
This test directly manages locations on pageservers and configuration of
an endpoint. However, it did not switch off the parts of the storage
controller that attempt to do the same: occasionally, the test would
fail in a strange way such as a compute failing to accept a
reconfiguration request.
## Summary of changes
- Wire up the storage controller's compute notification hook to a no-op
handler
- Configure the tenant's scheduling policy to Stop.
## Problem
See #7466
## Summary of changes
Implement algorithm descried in
https://hal.science/hal-00465313/document
Now new GUC is added:
`neon.wss_max_duration` which specifies size of sliding window (in
seconds). Default value is 1 hour.
It is possible to request estimation of working set sizes (within this
window using new function
`approximate_working_set_size_seconds`. Old function
`approximate_working_set_size` is preserved for backward compatibility.
But its scope is also limited by `neon.wss_max_duration`.
Version of Neon extension is changed to 1.4
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Matthias van de Meent <matthias@neon.tech>
Part of #7497, closes#8071. (accidentally closed#8208, reopened here)
## Problem
After the changes in #8084, we need synthetic size to also account for
leased LSNs so that users do not get free retention by running a small
ephemeral endpoint for a long time.
## Summary of changes
This PR integrates LSN leases into the synthetic size calculation. We
model leases as read-only branches started at the leased LSN (except it
does not have a timeline id).
Other changes:
- Add new unit tests testing whether a lease behaves like a read-only
branch.
- Change `/size_debug` response to include lease point in the SVG
visualization.
- Fix `/lsn_lease` HTTP API to do proper parsing for POST.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
Adds a find-large-objects subcommand to the scrubber to allow listing
layer objects larger than a specific size.
To be used like:
```
AWS_PROFILE=dev REGION=us-east-2 BUCKET=neon-dev-storage-us-east-2 cargo run -p storage_scrubber -- find-large-objects --min-size 250000000 --ignore-deltas
```
Part of #5431
## Problem
When generations were new, these messages were an important way of
noticing if something unexpected was going on. We found some real issues
when investigating tests that unexpectedly tripped them.
At time has gone on, this code is now pretty battle-tested, and as we do
more live migrations etc, it's fairly normal to see the occasional
message from a node with a stale generation.
At this point the cognitive load on developers to selectively allow-list
these logs outweighs the benefit of having them at warn severity.
Closes: https://github.com/neondatabase/neon/issues/8080
## Summary of changes
- Downgrade "Dropped remote consistent LSN updates" and "Dropping stale
deletions" messages to INFO
- Remove all the allow-list entries for these logs.
## Problem
`pg-clients` workflow looks different from the main `build-and-test`
workflow for historical reasons (it was my very first task at Neon, and
back then I wasn't really familiar with the rest of the CI pipelines).
This PR unifies `pg-clients` workflow with `build-and-test`
## Summary of changes
- Rename `pg_clients.yml` to `pg-clients.yml`
- Run the workflow on changes in relevant files
- Create Allure report for tests
- Send slack notifications to `#on-call-qa-staging-stream` channel
(instead of `#on-call-staging-stream`)
- Update Client libraries once we're here
## Problem
I'd like to keep this in the tree since it might be useful in prod as
well. It's a bit too noisy as is and missing the lsn.
## Summary of changes
Add an lsn field and and increase the rate limit duration.
## Problem
Currently, if you need to rename a job and the job is listed in [branch
protection
rules](https://github.com/neondatabase/neon/settings/branch_protection_rules),
the PR won't be allowed to merge.
## Summary of changes
- Add `conclusion` job that fails if any of its dependencies don't
finish successfully
## Problem
If there's a quota error, it makes sense to cache it for a short window
of time. Many clients do not handle database connection errors
gracefully, so just spam retry 🤡
## Summary of changes
Updates the node_info cache to support storing console errors. Store
console errors if they cannot be retried (using our own heuristic.
should only trigger for quota exceeded errors).
## Problem
The metrics we have today aren't convenient for planning around the
impact of timeline archival on costs.
Closes: https://github.com/neondatabase/neon/issues/8108
## Summary of changes
- Add metric `pageserver_archive_size`, which indicates the logical
bytes of data which we would expect to write into an archived branch.
- Add metric `pageserver_pitr_history_size`, which indicates the
distance between last_record_lsn and the PITR cutoff.
These metrics are somewhat temporary: when we implement #8088 and
associated consumption metric changes, these will reach a final form.
For now, an "archived" branch is just any branch outside of its parent's
PITR window: later, archival will become an explicit state (which will
_usually_ correspond to falling outside the parent's PITR window).
The overall volume of timeline metrics is something to watch, but we are
removing many more in https://github.com/neondatabase/neon/pull/8245
than this PR is adding.
I'd like to add some constraints to the layer map we generate in tests.
(1) is the layer map that the current compaction algorithm will produce.
There is a property that for all delta layer, all delta layer overlaps
with it on the LSN axis will have the same LSN range.
(2) is the layer map that cannot be produced with the legacy compaction
algorithm.
(3) is the layer map that will be produced by the future
tiered-compaction algorithm. The current validator does not allow that
but we can modify the algorithm to allow it in the future.
## Summary of changes
Add a validator to check if the layer map is valid and refactor the test
cases to include delta layer start/end LSN.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
## Problem
We record detailed histograms for all page_service op types, which
mostly aren't very interesting, but make our prometheus scrapes huge.
Closes: #8223
## Summary of changes
- Only track GetPageAtLsn histograms on a per-timeline granularity. For
all other operation types, rely on existing node-wide histograms.
we want to run some specific pagebench test cases on dedicated hardware
to get reproducible results
run1: 1 client per tenant => characterize throughput with n tenants.
- 500 tenants
- scale 13 (200 MB database)
- 1 hour duration
- ca 380 GB layer snapshot files
run2.singleclient: 1 client per tenant => characterize latencies
run2.manyclient: N clients per tenant => characterize throughput
scalability within one tenant.
- 1 tenant with 1 client for latencies
- 1 tenant with 64 clients because typically for a high number of
connections we recommend the connection pooler
which by default uses 64 connections (for scalability)
- scale 136 (2048 MB database)
- 20 minutes each
PR #8106 was created with the assumption that no blob is larger than
`256 MiB`. Due to #7852 we have checking for *writes* of blobs larger
than that limit, but we didn't have checking for *reads* of such large
blobs: in theory, we could be reading these blobs every day but we just
don't happen to write the blobs for some reason.
Therefore, we now add a warning for *reads* of such large blobs as well.
To make deploying compression less dangerous, we therefore only assume a
blob is compressed if the compression setting is present in the config.
This also means that we can't back out of compression once we enabled
it.
Part of https://github.com/neondatabase/neon/issues/5431
## Problem
test_location_conf_churn fails on log errors when it tries to shutdown a
pageserver immediately after starting a tenant attach, like this:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8224/9761000525/index.html#/testresult/15fb6beca5c7327c
```
shutdown:shutdown{tenant_id=35f5c55eb34e7e5e12288c5d8ab8b909 shard_id=0000}:timeline_shutdown{timeline_id=30936747043353a98661735ad09cbbfe shutdown_mode=FreezeAndFlush}: failed to freeze and flush: cannot flush frozen layers when flush_loop is not running, state is Exited\n')
```
This is happening because Tenant::shutdown fires its cancellation token
early if the tenant is not fully attached by the time shutdown is
called, so the flush loop is shutdown by the time we try and flush.
## Summary of changes
- In the early-cancellation case, also set the shutdown mode to Hard to
skip trying to do a flush that will fail.
## Problem
GitHub Actions complain that we use actions that depend on deprecated
Node 16:
```
Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: docker/setup-buildx-action@v2
```
But also, the latest `docker/setup-buildx-action` fails with the following
error:
```
/nvme/actions-runner/_work/_actions/docker/setup-buildx-action/v3/webpack:/docker-setup-buildx/node_modules/@actions/cache/lib/cache.js:175
throw new Error(`Path Validation Error: Path(s) specified in the action for caching do(es) not exist, hence no cache is being saved.`);
^
Error: Path Validation Error: Path(s) specified in the action for caching do(es) not exist, hence no cache is being saved.
at Object.rejected (/nvme/actions-runner/_work/_actions/docker/setup-buildx-action/v3/webpack:/docker-setup-buildx/node_modules/@actions/cache/lib/cache.js:175:1)
at Generator.next (<anonymous>)
at fulfilled (/nvme/actions-runner/_work/_actions/docker/setup-buildx-action/v3/webpack:/docker-setup-buildx/node_modules/@actions/cache/lib/cache.js:29:1)
```
We can work this around by setting `cache-binary: false` for `uses:
docker/setup-buildx-action@v3`
## Summary of changes
- Update `docker/setup-buildx-action` from `v2` to `v3`, set
`cache-binary: false`
- Update `docker/login-action` from `v2` to `v3`
- Update `docker/build-push-action` from `v4`/`v5` to `v6`
All the code to ensure the WAL record lands at a page boundary was
unnecessary for reproducing the original problem. In fact, it's a pretty
basic test that checks that outbound replication (= neon as publisher)
still works after restarting the endpoint. It just used to be very
broken before commit 5ceccdc7de, which also added this test.
To verify that:
1. Check out commit f3af5f4660 (because the next commit, 7dd58e1449,
fixed the same bug in a different way, making it infeasible to revert
the bug fix in an easy way)
2. Revert the bug fix from commit 5ceccdc7de with this:
```
diff --git a/pgxn/neon/walproposer_pg.c b/pgxn/neon/walproposer_pg.c
index 7debb6325..9f03bbd99 100644
--- a/pgxn/neon/walproposer_pg.c
+++ b/pgxn/neon/walproposer_pg.c
@@ -1437,8 +1437,10 @@ XLogWalPropWrite(WalProposer *wp, char *buf, Size nbytes, XLogRecPtr recptr)
*
* https://github.com/neondatabase/neon/issues/5749
*/
+#if 0
if (!wp->config->syncSafekeepers)
XLogUpdateWalBuffers(buf, recptr, nbytes);
+#endif
while (nbytes > 0)
{
```
3. Run the test_wal_page_boundary_start regression test. It fails, as
expected
4. Apply this commit to the test, and run it again. It still fails, with
the same error mentioned in issue #5749:
```
PG:2024-06-30 20:49:08.805 GMT [1248196] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names '"pub1"')
PG:2024-06-30 21:37:52.567 GMT [1467972] LOG: starting logical decoding for slot "sub1"
PG:2024-06-30 21:37:52.567 GMT [1467972] DETAIL: Streaming transactions committing after 0/1532330, reading WAL from 0/1531C78.
PG:2024-06-30 21:37:52.567 GMT [1467972] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names '"pub1"')
PG:2024-06-30 21:37:52.567 GMT [1467972] LOG: logical decoding found consistent point at 0/1531C78
PG:2024-06-30 21:37:52.567 GMT [1467972] DETAIL: There are no running transactions.
PG:2024-06-30 21:37:52.567 GMT [1467972] STATEMENT: START_REPLICATION SLOT "sub1" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names '"pub1"')
PG:2024-06-30 21:37:52.568 GMT [1467972] ERROR: could not find record while sending logically-decoded data: invalid contrecord length 312 (expected 6) at 0/1533FD8
```