## Problem
See
https://neondb.slack.com/archives/C04DGM6SMTM/p1707149618314539?thread_ts=1707081520.140049&cid=C04DGM6SMTM
## Summary of changes
Perform checkpoint check after processing `ingest_batch_size` (default
100) WAL records.
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
We use an outdated version of Python (3.9.2)
## Summary of changes
- Update Python to the latest patch version (3.9.18)
- Unify the usage of python caches where possible
It's awkward to point to a file when doing some kinds of ad-hoc
deployment (like right now, when I'm hacking a helm chart having not
quite hooked up secrets properly yet). We take all the rest of the
secrets as CLI args directly, so let's do the same for public key.
create_neon_superuser runs the first queries in the database after cold
start. Traces suggest that those first queries can make up a significant
fraction of the cold start time. Make it more visible by adding an
explict tracing span to it; currently you just have to deduce it by
looking at the time spent in the parent 'apply_config' span subtracted
by all the other child spans.
## Problem
We've got several issues with the current `benchmarks` job setup:
- `benchmark_durations.json` file (that we generate in runtime to
split tests into several jobs[0]) is not consistent between these
jobs (and very not consistent with the file if we rerun the job). I.e.
test selection for each job can be different, which could end up in
missed tests in a test run.
- `scripts/benchmark_durations` doesn't fetch all tests from the
database (it doesn't expect any extra directories inside
`test_runner/performance`)
- For some reason, currently split into 4 groups ends up with the 4th
group has no tests to run, which fails the job[1]
- [0] https://github.com/neondatabase/neon/pull/4683
- [1] https://github.com/neondatabase/neon/issues/6629
## Summary of changes
- Generate `benchmark_durations.json` file once before we start
`benchmarks` jobs (this makes it consistent across the jobs) and pass
the file content through the GitHub Actions input (this makes it
consistent for reruns)
- `scripts/benchmark_durations` fix SQL query for getting all required
tests
- Split benchmarks into 5 jobs instead of 4 jobs.
When we'll later introduce a global pool of pre-spawned walredo
processes (https://github.com/neondatabase/neon/issues/6581), this
refactoring avoids plumbing through the reference to the pool to all the
places where we create a broken tenant.
Builds atop the refactoring in #6583
Fix several test flakes:
- test_sharding_service_smoke had log failures on "Dropped LSN updates"
- test_emergency_mode had log failures on a deletion queue shutdown
check, where the check was incorrect because it was expecting channel
receiver to stay alive after cancellation token was fired.
- test_secondary_mode_eviction had racing heatmap uploads because the
test was using a live migration hook to set up locations, where that
migration was itself uploading heatmaps and generally making the
situation more complex than it needed to be.
These are the failure modes that I saw when spot checking the last few
failures of each test.
This will mostly/completely address #6511, but I'll leave that ticket
open for a couple days and then check if either of the tests named in
that ticket are flaky.
Related #6511
## Problem
the idea is to keep compute up and running if there are any active
logical replication subscriptions.
### Rationale
Rationale:
- The Write-Ahead Logging (WAL) files, which contain the data changes,
will need to be retained on the publisher side until the subscriber is
able to connect again and apply these changes. This could potentially
lead to increased disk usage on the publisher - and we do not want to
disrupt the source - I think it is more pain for our customer to resolve
storage issues on the source than to pay for the compute at the target.
- Upon resuming the compute resources, the subscriber will start
consuming and applying the changes from the retained WAL files. The time
taken to catch up will depend on the volume of changes and the
configured vCPUs.
we can avoid explaining complex situations where we lag behind (in
extreme cases we could lag behind hours, days or even months)
- I think an important use case for logical replication from a source is
a one-time migration or release upgrade. In this case the customer would
not mind if we are not suspended for the duration of the migration.
We need to document this in the release notes and the documentation in
the context of logical replication where Neon is the target (subscriber)
### See internal discussion here
https://neondb.slack.com/archives/C04DGM6SMTM/p1706793400746539?thread_ts=1706792628.701279&cid=C04DGM6SMTM
Fix cloning the serialized heatmap on every attempt by just turning it
into `bytes::Bytes` before clone so it will be a refcounted instead of
refcounting a vec clone later on.
Also fixes one cancellation token cloning I had missed in #6618.
Cc: #6096
## Problem
We don't have a neat way to carry around migration .sql files during
deploy, and in any case would prefer to avoid depending on diesel CLI to
deploy.
## Summary of changes
- Use `diesel_migrations` crate to embed migrations in our binary
- Run migrations on startup
- Drop the diesel dependency in the `neon_local` binary, as the
attachment_service binary just needs the database to exist. Do database
creation with a simple `createdb`.
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
The solution we ended up for `backoff::retry` requires always cloning of
cancellation tokens even though there is just `.await`. Fix that, and
also turn the return type into `Option<Result<T, E>>` avoiding the need
for the `E::cancelled()` fn passed in.
Cc: #6096
## Problem
This change was left out of #6585 accidentally -- just forgot to push
the very last version of my branch.
Now that we can load database url from Secrets Manager, we don't always
need it on the CLI any more. We should let the user omit it instead of
passing `--database-url ""`
## Summary of changes
- Make `--database-url` optional
Cleanups from https://github.com/neondatabase/neon/pull/6394
- There was a rogue `*` breaking the `GET /tenant/:tenant_id`, which
passes through to shard zero
- There was a duplicate migrate endpoint
- There are un-prefixed API endpoints that were only needed for compat
tests and can now be removed.
## Problem
Running some memory profiling with high concurrent request rate shows
seemingly some memory fragmentation.
## Summary of changes
Eventually, we will want to separate global memory (caches) from local
memory (per connection handshake and per passthrough).
Using a string interner for project info cache helps reduce some of the
fragmentation of the global cache by having a single heap dedicated to
project strings, and not scattering them throughout all a requests.
At the same time, the interned key is 4 bytes vs the 24 bytes that
`SmolStr` offers.
Important: we should only store verified strings in the interner because
there's no way to remove them afterwards. Good for caching responses
from console.
Before tenant migration it made sense to leak broken tenants in the
metrics until restart. Nowdays it makes less sense because on
cancellations we set the tenant broken. The set metric still allows
filterable alerting.
Fixes: #6507
## Problem
We were archiving the pref benchmarks to
- neon DB
- git repo `zenith-perf-data`
As the pref batch ran in parallel when the uploading of results to
zenith-perf-data` git repo resulted in merge conflicts.
Which made the run flaky and as a side effect the build started failing
.
The problem is been expressed in
https://github.com/neondatabase/neon/issues/5160
## Summary of changes
As the results were not used from the git repo it was redundant hence in
this PR cleaning up the results uploading of of perf results to git repo
The shell script `generate_and_push_perf_report.sh` was using a py
script
[git-upload](https://github.com/neondatabase/neon/compare/remove-perf-benchmark-git-upload?expand=1#diff-c6d938e7f060e487367d9dc8055245c82b51a73c1f97956111a495a8a86e9a33)
and
[scripts/generate_perf_report_page.py](https://github.com/neondatabase/neon/pull/6590/files#diff-81af2147e72d07e4cf8ee4395632596d805d6168ba75c71cab58db2659956ef8)
which are not used anywhere else in repo hence also cleaning that up
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat the commit message to not include the
above checklist
There is currently no cleanup done after a delta layer creation error,
so delta layers can accumulate. The problem gets worse as the operation
gets retried and delta layers accumulate on the disk. Therefore, delete
them from disk (if something has been written to disk).
I was getting an error:
/home/heikki/git-sandbox/neon//pgxn/neon_walredo/walredoproc.c:161:5: error: conflicting types for ‘close_range’; have ‘int(unsigned int, unsigned int, unsigned int)’
161 | int close_range(unsigned int start_fd, unsigned int count, unsigned int flags) {
| ^~~~~~~~~~~
In file included from /usr/include/x86_64-linux-gnu/bits/sigstksz.h:24,
from /usr/include/signal.h:328,
from /home/heikki/git-sandbox/neon//pgxn/neon_walredo/walredoproc.c:50:
/usr/include/unistd.h:1208:12: note: previous declaration of ‘close_range’ with type ‘int(unsigned int, unsigned int, int)’
1208 | extern int close_range (unsigned int __fd, unsigned int __max_fd,
| ^~~~~~~~~~~
The discrepancy is in the 3rd argument. Apparently in the glibc
wrapper it's signed.
As a quick fix, rename our close_range() function, the one that calls
syscall() directly, to avoid the clash with the glibc wrapper. In the
long term, an autoconf test would be nice, and some equivalent on
macOS, see issue #6580.