## Problem
Currently, after updating `Dockerfile.build-tools` in a PR, it requires
a manual action to make it `pinned`, i.e., the default for everyone. It
also makes all opened PRs use such images (even created in the PR and
without such changes).
This PR overhauls the way we build and use `build-tools` image (and uses
the image from Docker Hub).
## Summary of changes
- The `neondatabase/build-tools` image gets tagged with the latest
commit sha for the `Dockerfile.build-tools` file
- Each PR calculates the tag for `neondatabase/build-tools`, tries to
pull it, and rebuilds the image with such tag if it doesn't exist.
- Use `neondatabase/build-tools` as a default image
- When running on `main` branch — create a `pinned` tag and push it to
ECR
- Use `concurrency` to ensure we don't build `build-tools` image for the
same commit in parallel from different PRs
## Problem
The vectored read path proposed in
https://github.com/neondatabase/neon/pull/6576 seems
to be functionally correct, but in my testing (see below) it is about 10-20% slower than the naive
sequential vectored implementation.
## Summary of changes
There's three parts to this PR:
1. Supporting vectored blob reads. This is actually trickier than it
sounds because on disk blobs are prefixed with a variable length size header.
Since the blobs are not necessarily fixed size, we need to juggle the offsets
such that the callers can retrieve the blobs from the resulting buffer.
2. Merge disk read requests issued by the vectored read path up to a
maximum size. Again, the merging is complicated by the fact that blobs
are not fixed size. We keep track of the begin and end offset of each blob
and pass them into the vectored blob reader. In turn, the reader will return
a buffer and the offsets at which the blobs begin and end.
3. A benchmark for basebackup requests against tenant with large SLRU
block counts is added. This required a small change to pagebench and a new config
variable for the pageserver which toggles the vectored get validation.
We can probably optimise things further by adding a little bit of
concurrency for our IO. In principle, it's as simple as spawning a task which deals with issuing
IO and doing the serialisation and handling on the parent task which receives input via a
channel.
This reverts commits 587cb705b8 (PR #6661)
and fcbe9fb184 (PR #6842).
Conflicts:
pageserver/src/tenant.rs
pageserver/src/tenant/timeline.rs
The conflicts were with
* pageserver: adjust checkpoint distance for sharded tenants (#6852)
* pageserver: add vectored get implementation (#6576)
Also we had to keep the `allowed_errors` to make `test_forward_compatibility` happy,
see the PR thread on GitHub for details.
## Problem
Starting up the pageserver before the storage controller is ready can
lead
to a round of reconciliation, which leads to the previous tenant being
shut down.
This disturbs some tests.
## Summary of changes
Wait for the storage controller to become ready on neon env start-up.
Closes https://github.com/neondatabase/neon/issues/6724
Not allowing evicting wanted deleted layers is something I've forgotten
to implement on #5645. This PR makes it possible to evict such layers,
which should reduce the amount of hanging evictions.
Fixes: #6928
Co-authored-by: Christian Schwarz <christian@neon.tech>
## Problem
After commit [840abe3954] (store AUX files
as deltas) we avoid quadratic growth of storage size when storing LR
snapshots but get quadratic slowdown of reconstruct time.
As a result storing 70k snapshots at my local Neon instance took more
than 3 hours and starting node (creation of basecbackup): ~10 minutes.
In prod 70k AUX files cause increase of startup time to 40 minutes:
https://neondb.slack.com/archives/C03F5SM1N02/p1708513010480179
## Summary of changes
Enforce storing full AUX directory (some analog of FPI) each 1024 files.
Time of creation 70k snapshots is reduced to 6 minutes and startup time
- to 1.5 minutes (100 seconds).
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
This is a precursor to adding a convenience CLI for the storage
controller.
## Summary of changes
- move controller api structs into pageserver_api::controller_api to
make them visible to other crates
- rename pageserver_api::control_api to pageserver_api::upcall_api to
match the /upcall/v1/ naming in the storage controller.
Why here rather than a totally separate crate? It's convenient to have
all the pageserver-related stuff in one place, and if we ever wanted to
move it to a different crate it's super easy to do that later.
Rebased version of #5234, part of #6768
This consists of three parts:
1. A refactoring and new contract for implementing and testing
compaction.
The logic is now in a separate crate, with no dependency on the
'pageserver' crate. It defines an interface that the real pageserver
must implement, in order to call the compaction algorithm. The interface
models things like delta and image layers, but just the parts that the
compaction algorithm needs to make decisions. That makes it easier unit
test the algorithm and experiment with different implementations.
I did not convert the current code to the new abstraction, however. When
compaction algorithm is set to "Legacy", we just use the old code. It
might be worthwhile to convert the old code to the new abstraction, so
that we can compare the behavior of the new algorithm against the old
one, using the same simulated cases. If we do that, have to be careful
that the converted code really is equivalent to the old.
This inclues only trivial changes to the main pageserver code. All the
new code is behind a tenant config option. So this should be pretty safe
to merge, even if the new implementation is buggy, as long as we don't
enable it.
2. A new compaction algorithm, implemented using the new abstraction.
The new algorithm is tiered compaction. It is inspired by the PoC at PR
#4539, although I did not use that code directly, as I needed the new
implementation to fit the new abstraction. The algorithm here is less
advanced, I did not implement partial image layers, for example. I
wanted to keep it simple on purpose, so that as we add bells and
whistles, we can see the effects using the included simulator.
One difference to #4539 and your typical LSM tree implementations is how
we keep track of the LSM tree levels. This PR doesn't have a permanent
concept of a level, tier or sorted run at all. There are just delta and
image layers. However, when compaction starts, we look at the layers
that exist, and arrange them into levels, depending on their shapes.
That is ephemeral: when the compaction finishes, we forget that
information. This allows the new algorithm to work without any extra
bookkeeping. That makes it easier to transition from the old algorithm
to new, and back again.
There is just a new tenant config option to choose the compaction
algorithm. The default is "Legacy", meaning the current algorithm in
'main'. If you set it to "Tiered", the new algorithm is used.
3. A simulator, which implements the new abstraction.
The simulator can be used to analyze write and storage amplification,
without running a test with the full pageserver. It can also draw an SVG
animation of the simulation, to visualize how layers are created and
deleted.
To run the simulator:
cargo run --bin compaction-simulator run-suite
---------
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
## Problem
## Summary of changes
Updates the neon.tech link to point to a /github page in order to
correctly attribute visits originating from the repo.
## Problem
Data team cannot distinguish between cold start and not cold start.
## Summary of changes
Report `is_cold_start` to analytics.
---------
Co-authored-by: Conrad Ludgate <conrad@neon.tech>
Noticed that we are failing to handle `Result::Err` when entering a gate
for logical size calculation. Audited rest of the gate enters, which
seem fine, unified two instances.
Noticed that the gate guard allows to remove a failpoint, then noticed
that adjacent failpoint was blocking the executor thread instead of
using `pausable_failpoint!`, fix both.
eviction_task.rs now maintains a gate guard as well.
Cc: #4733
## Problem
We want to show connection counts to console users.
## Summary of changes
Start exporting connection counts grouped by database name and
connection state.
## Problem
LFC has high impact on Neon application performance but there is no way
for user to check efficiency of its usage
## Summary of changes
Show LFC statistic in EXPLAIN ANALYZE
## Description
**Local file cache (LFC)**
A layer of caching that stores frequently accessed data from the storage
layer in the local memory of the Neon compute instance. This cache helps
to reduce latency and improve query performance by minimizing the need
to fetch data from the storage layer repeatedly.
**Externalization of LFC in explain output**
Then EXPLAIN ANALYZE output is extended to display important counts for
local file cache (LFC) hits and misses.
This works both, for EXPLAIN text and json output.
**File cache: hits**
Whenever the Postgres backend retrieves a page/block from SGMR, it is
not found in shared buffer but the page is already found in the LFC this
counter is incremented.
**File cache: misses**
Whenever the Postgres backend retrieves a page/block from SGMR, it is
not found in shared buffer and also not in then LFC but the page is
retrieved from Neon storage (page server) this counter is incremented.
Example (for explain text output)
```sql
explain (analyze,buffers,prefetch,filecache) select count(*) from pgbench_accounts;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=214486.94..214486.95 rows=1 width=8) (actual time=5195.378..5196.034 rows=1 loops=1)
Buffers: shared hit=178875 read=143691 dirtied=128597 written=127346
Prefetch: hits=0 misses=1865 expired=0 duplicates=0
File cache: hits=141826 misses=1865
-> Gather (cost=214486.73..214486.94 rows=2 width=8) (actual time=5195.366..5196.025 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=178875 read=143691 dirtied=128597 written=127346
Prefetch: hits=0 misses=1865 expired=0 duplicates=0
File cache: hits=141826 misses=1865
-> Partial Aggregate (cost=213486.73..213486.74 rows=1 width=8) (actual time=5187.670..5187.670 rows=1 loops=3)
Buffers: shared hit=178875 read=143691 dirtied=128597 written=127346
Prefetch: hits=0 misses=1865 expired=0 duplicates=0
File cache: hits=141826 misses=1865
-> Parallel Index Only Scan using pgbench_accounts_pkey on pgbench_accounts (cost=0.43..203003.02 rows=4193481 width=0) (actual time=0.574..4928.995 rows=3333333 loops=3)
Heap Fetches: 3675286
Buffers: shared hit=178875 read=143691 dirtied=128597 written=127346
Prefetch: hits=0 misses=1865 expired=0 duplicates=0
File cache: hits=141826 misses=1865
```
The json output uses the following keys and provides integer values for
those keys:
```
...
"File Cache Hits": 141826,
"File Cache Misses": 1865
...
```
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
fixes https://github.com/neondatabase/neon/issues/6889
# Problem
The failure in the last 3 flaky runs on `main` is
```
test_runner/regress/test_remote_storage.py:460: in test_remote_timeline_client_calls_started_metric
churn("a", "b")
test_runner/regress/test_remote_storage.py:457: in churn
assert gc_result["layers_removed"] > 0
E assert 0 > 0
```
That's this code
cd449d66ea/test_runner/regress/test_remote_storage.py (L448-L460)
So, the test expects GC to remove some layers but the GC doesn't.
# Fix
My impression is that the VACUUM isn't re-using pages aggressively
enough, but I can't really prove that. Tried to analyze the layer map
dump but it's too complex.
So, this PR:
- Creates more churn by doing the overwrite twice.
- Forces image layer creation.
It also drive-by removes the redundant call to timeline_compact,
because, timeline_checkpoint already does that internally.
## Problem
Attachment service does not do auth based on JWT scopes.
## Summary of changes
Do JWT based permission checking for requests coming into the attachment
service.
Requests into the attachment service must use different tokens based on
the endpoint:
* `/control` and `/debug` require `admin` scope
* `/upcall` requires `generations_api` scope
* `/v1/...` requires `pageserverapi` scope
Requests into the pageserver from the attachment service must use
`pageserverapi` scope.
## Problem
README.md is missing cleanup instructions
## Summary of changes
Add cleanup instructions
Add instructions how to handle errors during initialization
---------
Co-authored-by: Andreas Scherbaum <andreas@neon.tech>
Use the remote_timeline_client metrics instead, they work for layer file
uploads and are reasonable close to what the
`pageserver_created_persistent_*` metrics were.
Should we wait for empty upload queue before calling `report_size()`?
part of https://github.com/neondatabase/neon/issues/6737
## Problem
Customers should be able to determine the size of their workload's
working set to right size their compute.
Since Neon uses Local file cache (LFC) instead of shared buffers on
bigger compute nodes to cache pages we need to externalize a means to
determine LFC hit ratio in addition to shared buffer hit ratio.
Currently the following end user documentation
fb7cd3af0e/content/docs/manage/endpoints.md (L137)
is wrong because it describes how to right size a compute node based on
shared buffer hit ratio.
Note that the existing functionality in extension "neon" is NOT
available to end users but only to superuser / cloud_admin.
## Summary of changes
- externalize functions and views in neon extension to end users
- introduce a new view `NEON_STAT_FILE_CACHE` with the following DDL
```sql
CREATE OR REPLACE VIEW NEON_STAT_FILE_CACHE AS
WITH lfc_stats AS (
SELECT
stat_name,
count
FROM neon_get_lfc_stats() AS t(stat_name text, count bigint)
),
lfc_values AS (
SELECT
MAX(CASE WHEN stat_name = 'file_cache_misses' THEN count ELSE NULL END) AS file_cache_misses,
MAX(CASE WHEN stat_name = 'file_cache_hits' THEN count ELSE NULL END) AS file_cache_hits,
MAX(CASE WHEN stat_name = 'file_cache_used' THEN count ELSE NULL END) AS file_cache_used,
MAX(CASE WHEN stat_name = 'file_cache_writes' THEN count ELSE NULL END) AS file_cache_writes,
-- Calculate the file_cache_hit_ratio within the same CTE for simplicity
CASE
WHEN MAX(CASE WHEN stat_name = 'file_cache_misses' THEN count ELSE 0 END) + MAX(CASE WHEN stat_name = 'file_cache_hits' THEN count ELSE 0 END) = 0 THEN NULL
ELSE ROUND((MAX(CASE WHEN stat_name = 'file_cache_hits' THEN count ELSE 0 END)::DECIMAL /
(MAX(CASE WHEN stat_name = 'file_cache_hits' THEN count ELSE 0 END) + MAX(CASE WHEN stat_name = 'file_cache_misses' THEN count ELSE 0 END))) * 100, 2)
END AS file_cache_hit_ratio
FROM lfc_stats
)
SELECT file_cache_misses, file_cache_hits, file_cache_used, file_cache_writes, file_cache_hit_ratio from lfc_values;
```
This view can be used by an end user as follows:
```sql
CREATE EXTENSION NEON;
SELECT * from neon. NEON_STAT_FILE_CACHE"
```
The output looks like the following:
```
select * from NEON_STAT_FILE_CACHE;
file_cache_misses | file_cache_hits | file_cache_used | file_cache_writes | file_cache_hit_ratio
-------------------+-----------------+-----------------+-------------------+----------------------
2133643 | 108999742 | 607 | 10767410 | 98.08
(1 row)
```
## Checklist before requesting a review
- [x ] I have performed a self-review of my code.
- [x ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [x ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
## Problem
We want to report how much cache was used and what the limit was.
## Summary of changes
Added one more query to sql_exporter to expose
`neon.file_cache_size_limit`.
* decreases checkpointing and compaction targets for even more layer
files
* write 10 thousand rows 2 times instead of writing 20 thousand rows 1
time so that there is more to GC. Before it was noisily jumping between
1 and 0 layer files, now it's jumping between 19 and 20 layer files. The
0 caused an assertion error that gave the test most of its flakiness.
* larger timeout for the churn while failpoints are active thread: this
is mostly so that the test is more robust on systems with more load
Fixes#3051
## Problem
Previously we always wrote out both legacy and modern tenant config
files. The legacy write enabled rollbacks, but we are long past the
point where that is needed.
We still need the legacy format for situations where someone is running
tenants without generations (that will be yanked as well eventually),
but we can avoid writing it out at all if we do have a generation number
set. We implicitly also avoid writing the legacy config if our mode is
Secondary (secondary mode is newer than generations).
## Summary of changes
- Make writing legacy tenant config conditional on there being no
generation number set.
This PR enforces aspects of `Timeline::repartition` that were already
true at runtime:
- it's not called concurrently, so, bail out if it is anyway (see
comment why it's not called concurrently)
- the `lsn` should never be moving backwards over the lifetime of a
Timeline object, because last_record_lsn() can only move forwards
over the lifetime of a Timeline object
The switch to tokio::sync::Mutex blows up the size of the `partitioning`
field from 40 bytes to 72 bytes on Linux x86_64.
That would be concerning if it was a hot field, but, `partitioning` is
only accessed every 20s by one task, so, there won't be excessive cache
pain on it.
(It still sucks that it's now >1 cache line, but I need the Send-able
MutexGuard in the next PR)
part of https://github.com/neondatabase/neon/issues/6861
It's been dead-code-at-runtime for 9 months, let's remove it.
We can always re-introduce it at a later point.
Came across this while working on #6861, which will touch
`time_for_new_image_layer`. This is an opporunity to make that function
simpler.
## Problem
Since the location config API was added, the attach and detach endpoints
are deprecated. Hiding them from consumers of the swagger definition is
a precursor to removing them
Neon's cloud no longer uses this api since
https://github.com/neondatabase/cloud/pull/10538
Fully removing the APIs will implicitly make use of generation numbers
mandatory, and should happen alongside
https://github.com/neondatabase/neon/issues/5388, which will happen once
we're happy that the storage controller is ready for prime time.
## Summary of changes
- Remove /attach and /detach from pageserver's swagger file
Reverts neondatabase/neon#6765 , bringing back #6731
We concluded that #6731 never was the root cause for the instability in
staging.
More details:
https://neondb.slack.com/archives/C033RQ5SPDH/p1708011674755319
However, the massive amount of concurrent `spawn_blocking` calls from
the `save_metadata` calls during startups might cause a performance
regression.
So, we'll merge this PR here after we've stopped writing the metadata
#6769).
## Problem
In the original PR[0], I've missed a couple of `release` occurrences
that should also be handled for `release-proxy` branch
- [0] https://github.com/neondatabase/neon/pull/6797
## Summary of changes
- Add handling for `release-proxy` branch to allure report
- Add handling for `release-proxy` branch to e2e tests malts.com
We set it for neon replica, if primary is running.
Postgres uses this GUC at the start,
to determine if replica should wait for
RUNNING_XACTS from primary or not.
Corresponding cloud PR is
https://github.com/neondatabase/cloud/pull/10183
* Add test hot-standby replica startup.
* Extract oldest_running_xid from XlRunningXits WAL records.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Konstantin Knizhnik <knizhnik@garret.ru>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
## Problem
To "build" a compute image that doesn't have anything new, kaniko takes
13m[0], docker buildx does it in 5m[1].
Also, kaniko doesn't fully support bash expressions in the Dockerfile
`RUN`, so we have to use different workarounds for this (like `bash -c
...`).
- [0]
https://github.com/neondatabase/neon/actions/runs/8011512414/job/21884933687
- [1]
https://github.com/neondatabase/neon/actions/runs/8008245697/job/21874278162
## Summary of changes
- Use docker buildx to build `compute-node` images
- Use docker buildx to build `neon-image` image
- Use docker buildx to build `compute-tools` image
- Use docker hub for image cache (instead of ECR)
## Problem
Following up https://github.com/neondatabase/neon/pull/6884, hopefully,
a real final fix for https://github.com/neondatabase/neon/issues/6236.
## Summary of changes
`handle_migrations` is done over the main `postgres` db connection.
Therefore, the privileges assigned here do not work with databases
created later (i.e., `neondb`). This pull request moves the grants to
`handle_grants`, so that it runs for each DB created. The SQL is added
into the `BEGIN/END` block, so that it takes only one RTT to apply all
of them.
Signed-off-by: Alex Chi Z <chi@neon.tech>
PR #6655 turned out to be not enough to prevent .snap files bloat; some
subscribers just don't ack flushed position, thus never advancing the
slot. Probably other bloating scenarios are also possible, so add a more direct
restriction -- drop all slots if too many .snap files has been discovered.
## Problem
See https://neondb.slack.com/archives/C04DGM6SMTM/p1708363190710839
## Summary of changes
Flush logical message with snapshot and origin state
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
## Summary of changes
## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
## Checklist before merging
- [ ] Do not forget to reformat commit message to not include the above
checklist
## Problem
Following up on https://github.com/neondatabase/neon/pull/6845, we did
not make the default privileges grantable before, and therefore, even if
the users have full privileges, they are not able to grant them to
others.
Should be a final fix for
https://github.com/neondatabase/neon/issues/6236.
## Summary of changes
Add `WITH GRANT` to migrations so that neon_superuser can grant the
permissions.
---------
Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem
We want to release Proxy at a different cadence.
## Summary of changes
- build-and-test workflow:
- Handle the `release-proxy` branch
- Tag images built on this branch with `release-proxy-XXX` tag
- Trigger deploy workflow with `deployStorage=true` &
`deployStorageBroker=true` on `release` branch
- Trigger deploy workflow with `deployPgSniRouter=true` &
`deployProxy=true` on `release-proxy` branch
- release workflow (scheduled creation of release branch):
- Schedule Proxy releases for Thursdays (a random day to make it
different from Storage releases)
- Some checks weren't properly returning an error when they failed
- TenantState::to_persistent wasn't setting generation_pageserver
properly
- Changes to node scheduling policy weren't being persisted.
Stacks on https://github.com/neondatabase/neon/pull/6823
- Pending a heartbeating mechanism (#6844 ), use /re-attach calls as a
cue to mark an offline node as active, so that a node which is
unavailable during controller startup doesn't require manual
intervention if it later starts/restarts.
- Tweak scheduling logic so that when we schedule the attached location
for a tenant, we prefer to select from secondary locations rather than
picking a fresh one.
This is an interim state until we implement #6844 and full chaos testing
for handling failures.
The `refill_interval` switched from a milliseconds usize to a Duration
during a review follow-up, hence this slipped through manual testing.
Part of https://github.com/neondatabase/neon/issues/5899
PR adds a simple at most 1Hz refreshed informational API for querying
pageserver utilization. In this first phase, no actual background
calculation is performed. Instead, the worst possible score is always
returned. The returned bytes information is however correct.
Cc: #6835
Cc: #5331
- Add some context to logs
- Add tests for pageserver restarts when managed by storage controller
- Make /location_config tolerate compute hook failures on shard
creations, not just modifications.
## Problem
When a secondary mode location starts up, it scans local layer files.
Currently it warns on any layers whose names don't parse as a
LayerFileName, generating warning spam from perfectly normal tempfiles.
## Summary of changes
- Refactor local vars to build a Utf8PathBuf for the layer file
candidate
- Use the crate::is_temporary check to identify + clean up temp files.
---------
Co-authored-by: Christian Schwarz <christian@neon.tech>