## Problem
Both these versions are binary compatible, but the way pgvector
structures the SQL files forbids installing 0.7.4 if you have a 0.8.0
distribution. Yet, some users may need a previous version for backward
compatibility, e.g., restoring the dump.
See this thread for discussion
https://neondb.slack.com/archives/C04DGM6SMTM/p1735911490242919?thread_ts=1731343604.259169&cid=C04DGM6SMTM
## Summary of changes
Put `vector--0.7.4.sql` file into compute image to allow installing this
version as well.
Tested on staging and it seems to be working as expected:
```sql
select * from pg_available_extensions where name = 'vector';
name | default_version | installed_version | comment
--------+-----------------+-------------------+------------------------------------------------------
vector | 0.8.0 | (null) | vector data type and ivfflat and hnsw access methods
create extension vector version '0.7.4';
select * from pg_available_extensions where name = 'vector';
name | default_version | installed_version | comment
--------+-----------------+-------------------+------------------------------------------------------
vector | 0.8.0 | 0.7.4 | vector data type and ivfflat and hnsw access methods
alter extension vector update;
select * from pg_available_extensions where name = 'vector';
name | default_version | installed_version | comment
--------+-----------------+-------------------+------------------------------------------------------
vector | 0.8.0 | 0.8.0 | vector data type and ivfflat and hnsw access methods
drop extension vector;
create extension vector;
select * from pg_available_extensions where name = 'vector';
name | default_version | installed_version | comment
--------+-----------------+-------------------+------------------------------------------------------
vector | 0.8.0 | 0.8.0 | vector data type and ivfflat and hnsw access methods
```
If we find out it's a good approach, we can adopt the same for other
extensions with a stable ABI -- support both `current` and `current - 1`
releases.
# Refs
- extracted from https://github.com/neondatabase/neon/pull/9353
# Problem
Before this PR, when task_mgr shutdown is signalled, e.g. during
pageserver shutdown or Tenant shutdown, initial logical size calculation
stops polling and drops the future that represents the calculation.
This is against the current policy that we poll all futures to
completion.
This became apparent during development of concurrent IO which warns if
we drop a `Timeline::get_vectored` future that still has in-flight IOs.
We may revise the policy in the future, but, right now initial logical
size calculation is the only part of the codebase that doesn't adhere to
the policy, so let's fix it.
## Code Changes
- make sensitive exclusively to `Timeline::cancel`
- This should be sufficient for all cases of shutdowns; the sensitivity
to task_mgr shutdown is unnecessary.
- this broke the various cancel tests in `test_timeline_size.py`, e.g.,
`test_timeline_initial_logical_size_calculation_cancellation`
- the tests would time out because the await point was not sensitive to
cancellation
- to fix this, refactor `pausable_failpoint` so that it accepts a
cancellation token
- side note: we _really_ should write our own failpoint library; maybe
after we get heap-allocated RequestContext, we can plumb failpoints
through there.
## Problem
PR #9993 was supposed to enable `page_service_pipelining` by default for
all `NeonEnv`s, but this was ineffective in our CI environment.
Thus, CI Python-based tests and benchmarks, unless explicitly
configuring pipelining, were still using serial protocol handling.
## Analysis
The root cause was that in our CI environment,
`config.compatibility_neon_binpath` is always Truthy.
It's not in local environments, which is why this slipped through in
local testing.
Lesson: always add a log line ot pageserver startup and spot-check tests
to ensure the intended default is picked up.
## Summary of changes
Fix it. Since enough time has passed, the compatiblity snapshot contains
a recent enough software version so we don't need to worry about
`compatibility_neon_binpath` anymore.
## Future Work
The question how to add a new default except for compatibliity tests,
which is what the broken code was supposed to do, is still unsolved.
Slack discussion:
https://neondb.slack.com/archives/C059ZC138NR/p1737490501941309
## Problem
Currently, the layer flush loop will continue flushing layers as long as
any are pending, and only notify waiters once there are no further
layers to flush. This can cause waiters to wait longer than necessary,
and potentially starve them if pending layers keep arriving faster than
they can be flushed. The impact of this will increase when we add
compaction backpressure and propagate it up into the WAL receiver.
Extracted from #10405.
## Summary of changes
Break out of the layer flush loop once we've flushed up to the requested
LSN. If further flush requests have arrived in the meanwhile, flushing
will resume immediately after.
## Problem
For compaction backpressure, we need a mechanism to signal when
compaction has reduced the L0 delta layer count below the backpressure
threshold.
Extracted from #10405.
## Summary of changes
Add `LayerMap::watch_level0_deltas()` which returns a
`tokio::sync:⌚:Receiver` signalling the current L0 delta layer
count.
## Problem
It's sometimes useful to obtain the elapsed duration from a
`StorageTimeMetricsTimer` for purposes beyond just recording it in
metrics (e.g. to log it).
Extracted from #10405.
## Summary of changes
Add `StorageTimeMetricsTimer.elapsed()` and return the duration from
`stop_and_record()`.
The issue is that get_vectored_reconstruct_data latency means something
very different now with concurrent IO than what it did before, because
all the time we spend on the data blocks is no longer part of the
get_vectored_reconstruct_data().await wall clock time
GET_RECONSTRUCT_DATA_TIME : all the 3 dashboards that use it are in my /personal/christian folder. I guess I'm free to break them 😄https://github.com/search?q=repo%3Aneondatabase%2Fgrafana-dashboard-export%20pageserver_getpage_get_reconstruct_data_seconds&type=code
RECONSTRUCT_TIME
Used in a couple of dashboards I think nobody uses
- Timeline Inspector
- Sharding WAL streaming
- Pageserver
- walredo time throaway
Vlad agrees with removing them for now.
Maybe in the future we'll add some back
pageserver_getpage_get_reconstruct_data_seconds -> pageserver_getpage_io_plan_seconds
pageserver_getpage_reconstruct_data_seconds -> pageserver_getpage_io_execute_seconds
## Problem
part of https://github.com/neondatabase/neon/issues/9114
The automatic trigger is already implemented at
https://github.com/neondatabase/neon/pull/10221 but I need to write some
tests and finish my experiments in staging before I can merge it with
confidence. Given that I have some other patches that will modify the
config items, I'd like to get the config items merged first to reduce
conflicts.
## Summary of changes
* add `l2_lsn` to index_part.json -- below that LSN, data have been
processed by gc-compaction
* add a set of gc-compaction auto trigger control items into the config
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We are removing the `pg_anon` v1 extension from Neon. So we don't need
to test it anymore and can remove the code for simplicity.
## Summary of changes
The code required for testing `pg_anon` is removed.
christian@neon-hetzner-dev-christian:[~/src/neon-work-1]: NEON_PAGESERVER_USE_ONE_RUNTIME=current_thread DEFAULT_PG_VERSION=14 BUILD_TYPE=release poetry run pytest -k 'test_ancestor_detach_branched_from[release-pg14-False-True-after]'
2025-01-21T18:42:38.794431Z WARN initial_size_calculation{tenant_id=cb106e50ddedc20995b0b1bb065ebcd9 shard_id=0000 timeline_id=e362ff10e7c4e116baee457de5c766d9}:logical_size_calculation_task: dropping ValuesReconstructState while some IOs have not been completed num_active_ios=1 sidecar_task_id=None backtrace= 0: <pageserver::tenant::storage_layer::ValuesReconstructState as core::ops::drop::Drop>::drop
at /home/christian/src/neon-work-1/pageserver/src/tenant/storage_layer.rs:553:24
1: core::ptr::drop_in_place<pageserver::tenant::storage_layer::ValuesReconstructState>
at /home/christian/.rustup/toolchains/1.84.0-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:521:1
2: core::ptr::drop_in_place<pageserver::tenant::timeline::Timeline::get::{{closure}}>
at /home/christian/src/neon-work-1/pageserver/src/tenant/timeline.rs:1042:5
3: core::ptr::drop_in_place<pageserver::pgdatadir_mapping::<impl pageserver::tenant::timeline::Timeline>::get_current_logical_size_non_incremental::{{closure}}>
at /home/christian/src/neon-work-1/pageserver/src/pgdatadir_mapping.rs:1001:67
4: core::ptr::drop_in_place<pageserver::tenant::timeline::Timeline::calculate_logical_size::{{closure}}>
at /home/christian/src/neon-work-1/pageserver/src/tenant/timeline.rs:3100:18
5: core::ptr::drop_in_place<pageserver::tenant::timeline::Timeline::logical_size_calculation_task::{{closure}}::{{closure}}::{{closure}}>
at /home/christian/src/neon-work-1/pageserver/src/tenant/timeline.rs:3050:22
6: core::ptr::drop_in_place<pageserver::tenant::timeline::Timeline::logical_size_calculation_task::{{closure}}::{{closure}}>
at /home/christian/src/neon-work-1/pageserver/src/tenant/timeline.rs:3060:5
For all high-rooted long-lived IoConcurrency's, IoConcurrency::drop will
never run.
What we actually care about is that we leave no dangling IOs after
get_vectored_impl, which lives much shorter than a high-rooted
IoConcurrency.
However, lifetime of `ValuesReconstructData` is generally == lifetime of
get_vectored_impl.
We did not have any tests on fast_import binary yet.
In this PR I have introduced:
- `FastImport` class and tools for testing in python
- basic test that runs fast import against vanilla postgres and checks
that data is there
Should be merged after https://github.com/neondatabase/neon/pull/10251
We currently have some flakiness in
`test_scrubber_physical_gc_ancestors`, see #10391.
The first flakiness kind is about the reconciler not actually becoming
idle within the timeout of 30 seconds. We see continuous forward
progress so this is likely not a hang. We also see this happen in
parallel to a test failure, so is likely due to runners being
overloaded. Therefore, we increase the timeout.
The second flakiness kind is an assertion failure. This one is a little
bit more tricky, but we saw in the successful run that there was some
advance of the lsn between the compaction ran (which created layer
files) and the gc run. Apparently gc rejects reductions to the single
image layer setting if the cutoff lsn is the same as the lsn of the
image layer: it will claim that that layer is newer than the space
cutoff and therefore skip it, while thinking the old layer (that we want
to delete) is the latest one (so it's not deleted).
We address the second flakiness kind by inserting a tiny amount of WAL
between the compaction and gc. This should hopefully fix things.
Related issue: #10391
(not closing it with the merger of the PR as we'll need to validate that
these changes had the intended effect).
Thanks to Chi for going over this together with me in a call.
## Problem
When releasing `release-7574`, the Github Release creation failed with
"body is too long" (see
https://github.com/neondatabase/neon/actions/runs/12834025431/job/35792346745#step:5:77).
There's lots of room for improvement of the release notes, but for now
we'll disable them instead.
## Summary of changes
- Disable automatic generation of release notes for Github releases
- Enable creation of Github releases for proxy/compute
Reproduced by
test_runner/regress/test_branching.py::test_branching_with_pgbench[debug-pg16-flat-1-10]'
It kinda makes sense that this deadlocks in `sequential` mode.
However, it also deadlocks in `sidecar-task` mode.
I don't understand why.
This simplifies the code in `pageserver_physical_gc` a little bit after
the feedback in #10007 that the code is too complicated.
Most importantly, we don't pass around `GcSummary` any more in a
complicated fashion, and we save on async stream-combinator-inception in
one place in favour of `try_stream!{}`.
Follow-up of #10007