With this, 10us batching timeout works, but it has some other wrinkles:
- it uses the signal-based timer APIs instead of going through epoll (=> timerfd)
= it needs to make a syscall for each batch, which costs around 1-2us, so, probably significant CPU time wasted on this.
## Problem
It is called context/ctx everywhere and the Monitoring suffix needlessly
confuses with proper monitoring code.
## Summary of changes
* Rename RequestMonitoring to RequestContext
* Rename RequestMonitoringInner to RequestContextInner
## Problem
I've noticed that we have 2 flaky tests which failed with error:
```
re.error: missing ), unterminated subpattern at position 21
```
- `test_timeline_archival_chaos` — has been already fixed
- `test_sharded_tad_interleaved_after_partial_success` — I didn't manage
to find the incorrect regex
[Internal link](https://neonprod.grafana.net/goto/yfmVHV7NR?orgId=1)
## Summary of changes
- Wrap `re.match` in `try..except` block and print incorrect regex
## Problem
We want to keep `#on-call-staging-stream` channel close to the prod one
and redirect notifications from failing benchmarks to another channel
for investigation.
## Summary of changes
- Send notifications regarding failures in `benchmarking` job to
`#on-call-staging-stream`
- Send notifications regarding failures in `periodic_pagebench` job to
`#on-call-staging-stream`
## Problem
Two recently observed log errors indicate safekeeper tasks for a
timeline running after that timeline's deletion has started.
- https://github.com/neondatabase/neon/issues/8972
- https://github.com/neondatabase/neon/issues/8974
These code paths do not have a mechanism that coordinates task shutdown
with the overall shutdown of the timeline.
## Summary of changes
- Add a `Gate` to `Timeline`
- Take the gate as part of resident timeline guard: any code that holds
a guard over a timeline staying resident should also hold a guard over
the timeline's total lifetime.
- Take the gate from the wal removal task
- Respect Timeline::cancel in WAL send/recv code, so that we do not
block shutdown indefinitely.
- Add a test that deletes timelines with open pageserver+compute
connections, to check these get torn down as expected.
There is some risk to introducing gates: if there is code holding a gate
which does not properly respect a cancellation token, it can cause
shutdown hangs. The risk of this for safekeepers is lower in practice
than it is for other services, because in a healthy timeline deletion,
the compute is shutdown first, then the timeline is deleted on the
pageserver, and finally it is deleted on the safekeepers -- that makes
it much less likely that some protocol handler will still be running.
Closes: #8972Closes: #8974
See https://github.com/neondatabase/cloud/issues/14378
In collaboration with @cloneable and @awarus, we sifted through logs and
simply demoted some logs to debug. This is not at all finished and there
are more logs to review, but we ran out of time in the session we
organised. In any slightly more nuanced cases, we didn't touch the log,
instead leaving a TODO comment.
In timeline preloading, we also do a preload for offloaded timelines.
This includes the download of `index-part.json`. Ultimately, such a
download is wasteful, therefore avoid it. Same goes for the remote
client, we just discard it immediately thereafter.
Part of #8088
---------
Co-authored-by: Christian Schwarz <christian@neon.tech>
Before, compute_ctl didn't have a good registry for what command would
run when, depending exclusively on sync code to apply changes. When
users have many databases/roles to manage, this step can take a
substantial amount of time, breaking assumptions about low (re)start
times in other systems.
This commit reduces the time compute_ctl takes to restart when changes
must be applied, by making all commands more or less blind writes, and
applying these commands in an asynchronous context, only waiting for
completion once we know the commands have all been sent.
Additionally, this reduces time spent by batching per-database
operations where previously we would create a new SQL connection for
every user-database operation we planned to execute.
## Problem
We have a bunch of duplicated code for automated releases. There will be
even more, once we have `release-compute` branch
(https://github.com/neondatabase/neon/pull/9637).
Another issue with the current `release` workflow is that it creates a
PR from the main as is. If we create 2 different releases from the
same commit, GitHub could mix up results from different PRs.
## Summary of changes
- Create a reusable workflow for releases
- Create an empty commit to differentiate releases
part of https://github.com/neondatabase/neon/issues/9114, we want to be
able to run partial gc-compaction in tests. In the future, we can also
expand this functionality to legacy compaction, so that we can trigger
compaction for a specific key range.
## Summary of changes
* Support passing compaction key range through pageserver routes.
* Refactor input parameters of compact related function to take the new
`CompactOptions`.
* Add tests for partial compaction. Note that the test may or may not
trigger compaction based on GC horizon. We need to improve the test case
to ensure things always get below the gc_horizon and the gc-compaction
can be triggered.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
close https://github.com/neondatabase/neon/issues/9730
The test case tests if anything goes wrong during pageserver restart +
*during timeline creation not complete*. Therefore, queue is stopped
error is normal in this case, except that it should be categorized as a
shutdown error instead of a real error.
## Summary of changes
* More comments for the test case.
* Queue stopped error will now be forwarded as
CreateTimelineError::ShuttingDown.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
The community decided to make a new off-schedule release due to ABI
breakage in last week's release. We're not affected by the ABI
breakage because we rebuild all extensions in our docker images, but
let's stay up-to-date. There were a few other fixes in the release
too.
Currently, local_proxy will write an error log if it doesn't find the
config file. This is expected for startup, so it's just noise. It is an
error if we do receive an explicit SIGHUP though.
I've also demoted the build info logs to be debug level. We don't need
them in the compute image since we have other ways to determine what
code is running.
Lastly, I've demoted SIGHUP signal handling from warn to info, since
it's not really a warning event.
See https://github.com/neondatabase/cloud/issues/10880 for more details
## Problem
The first version of the ingest benchmark had some parsing and reporting
logic in shell script inside GitHub workflow.
it is better to move that logic into a python testcase so that we can
also run it locally.
## Summary of changes
- Create new python testcase
- invoke pgcopydb inside python test case
- move the following logic into python testcase
- determine backpressure
- invoke pgcopydb and report its progress
- parse pgcopydb log and extract metrics
- insert metrics into perf test database
- add additional column to perf test database that can receive endpoint
ID used for pgcopydb run to have it available in grafana dashboard when
retrieving other metrics for an endpoint
## Example run
https://github.com/neondatabase/neon/actions/runs/11860622170/job/33056264386
Earlier work (#7547) has made the scrubber internally generic, but one
could only configure it to use S3 storage.
This is the final piece to make (most of, snapshotting still requires
S3) the scrubber be able to be configured via GenericRemoteStorage.
I.e. you can now set an env var like:
```
REMOTE_STORAGE_CONFIG='remote_storage = { bucket_name = "neon-dev-safekeeper-us-east-2d", bucket_region = "us-east-2" }
```
and the scrubber will read it instead.
There is a potential data corruption issue, not one I've encountered,
but it's still not hard to hit with some correct looking code given our
current architecture. It has to do with the timeline's memory object storage
via reference counted `Arc`s, and the removal of `retain_lsn` entries at
the drop of the last `Arc` reference.
The corruption steps are as follows:
1. timeline gets offloaded. timeline object A doesn't get dropped
though, because some long-running task accesses it
2. the same timeline gets unoffloaded again. timeline object B gets
created for it, timeline object A still referenced. both point to the
same timeline.
3. the task keeping the reference to timeline object A exits. destructor
for object A runs, removing `retain_lsn` in the timeline's parent.
4. the timeline's parent runs gc without the `retain_lsn` of the still
exant timleine's child, leading to data corruption.
In general we are susceptible each time when we recreate a `Timeline`
object in the same process, which happens both during a timeline
offload/unoffload cycle, as well as during an ancestor detach operation.
The solution this PR implements is to make the destructor for a timeline
as well as an offloaded timeline remove at most one `retain_lsn`.
PR #9760 has added a log line to print the refcounts at timeline
offload, but this only detects one of the places where we do such a
recycle operation. Plus it doesn't prevent the actual issue.
I doubt that this occurs in practice. It is more a defense in depth measure.
Usually I'd assume that the timeline gets dropped immediately in step 1,
as there is no background tasks referencing it after its shutdown.
But one never knows, and reducing the stakes of step 1 actually occurring
is a really good idea, from potential data corruption to waste of CPU time.
Part of #8088