Commit Graph

7536 Commits

Author SHA1 Message Date
Arpad Müller
91dad2514f storcon: also support tenant deletion for safekeepers (#11289)
If a tenant gets deleted, delete also all of its timelines. We assume
that by the time a tenant is being deleted, no new timelines are being
created, so we don't need to worry about races with creation in this
situation.

Unlike #11233, which was very simple because it listed the timelines and
invoked timeline deletion, this PR obtains a list of safekeepers to
invoke the tenant deletion on, and then invokes tenant deletion on each
safekeeper that has one or multiple timelines.

Alternative to #11233
Builds on #11288
Part of #9011
2025-03-20 10:52:21 +00:00
Dmitrii Kovalkov
9bf59989db storcon: add https API (#11239)
## Problem

Pageservers use unencrypted HTTP requests for storage controller API.

- Closes: https://github.com/neondatabase/cloud/issues/25524

## Summary of changes

- Replace hyper0::server::Server with http_utils::server::Server in
storage controller.
- Add HTTPS handler for storage controller API.
- Support `ssl_ca_file` in pageserver.
2025-03-20 08:22:02 +00:00
Tristan Partin
c6f5a58d3b Remove potential for SQL injection (#11260)
Timeline IDs do not contain characters that may cause a SQL injection,
but best to always play it safe.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2025-03-19 19:19:38 +00:00
Suhas Thalanki
5589efb6de moving LastWrittenLSNCache to Neon Extension (#11031)
## Problem

We currently have this code duplicated across different PG versions.
Moving this to an extension would reduce duplication and simplify
maintenance.

## Summary of changes

Moving the LastWrittenLSN code from PG versions to the Neon extension
and linking it with hooks.

Related Postgres PR: https://github.com/neondatabase/postgres/pull/590

Closes: https://github.com/neondatabase/neon/issues/10973

---------

Co-authored-by: Tristan Partin <tristan@neon.tech>
2025-03-19 17:29:40 +00:00
Folke Behrens
019a29748d proxy: Move release PR creation to Tuesday (#11306)
Move the creation of the proxy release PR to Tuesday mornings.
2025-03-19 16:45:29 +00:00
StepSecurity Bot
88ea855cff fix(ci): Fixing StepSecurity Flagged Issues (#11311)
This pull request is created by
[StepSecurity](https://app.stepsecurity.io/securerepo) at the request of
@areyou1or0.
 ## Summary

This pull request is created by
[StepSecurity](https://app.stepsecurity.io/securerepo) at the request of
@areyou1or0. Please merge the Pull Request to incorporate the requested
changes. Please tag @areyou1or0 on your message if you have any
questions related to the PR.
## Summary

This pull request is created by
[StepSecurity](https://app.stepsecurity.io/securerepo) at the request of
@areyou1or0. Please merge the Pull Request to incorporate the requested
changes. Please tag @areyou1or0 on your message if you have any
questions related to the PR.

## Security Fixes

### Least Privileged GitHub Actions Token Permissions

The GITHUB_TOKEN is an automatically generated secret to make
authenticated calls to the GitHub API. GitHub recommends setting minimum
token permissions for the GITHUB_TOKEN.

- [GitHub Security
Guide](https://docs.github.com/en/actions/security-guides/automatic-token-authentication#using-the-github_token-in-a-workflow)
- [The Open Source Security Foundation (OpenSSF) Security
Guide](https://github.com/ossf/scorecard/blob/main/docs/checks.md#token-permissions)
### Pinned Dependencies

GitHub Action tags and Docker tags are mutable. This poses a security
risk. GitHub's Security Hardening guide recommends pinning actions to
full length commit.

- [GitHub Security
Guide](https://docs.github.com/en/actions/security-guides/security-hardening-for-github-actions#using-third-party-actions)
- [The Open Source Security Foundation (OpenSSF) Security
Guide](https://github.com/ossf/scorecard/blob/main/docs/checks.md#pinned-dependencies)
### Harden Runner

[Harden-Runner](https://github.com/step-security/harden-runner) is an
open-source security agent for the GitHub-hosted runner to prevent
software supply chain attacks. It prevents exfiltration of credentials,
detects tampering of source code during build, and enables running jobs
without `sudo` access. See how popular open-source projects use
Harden-Runner
[here](https://docs.stepsecurity.io/whos-using-harden-runner).

<details>
<summary>Harden runner usage</summary>

You can find link to view insights and policy recommendation in the
build log

<img
src="https://github.com/step-security/harden-runner/blob/main/images/buildlog1.png?raw=true"
width="60%" height="60%">

Please refer to
[documentation](https://docs.stepsecurity.io/harden-runner) to find more
details.
</details>



will fix https://github.com/neondatabase/cloud/issues/26141
2025-03-19 16:44:22 +00:00
Vlad Lazar
9ce3704ab5 pageseserver: rename cplane api to storage controller api (#11310)
## Problem

The pageserver upcall api was designed to work with control plane or the
storage controller.
We have completed the transition period and now the upcall api only
targets the storage controller.

## Summary of changes

Rename types accordingly and tweak some comments.
2025-03-19 16:29:52 +00:00
Alexey Kondratov
518269ea6a feat(compute): Add perf test for compute startup time breakdown (#11198)
## Problem

We had a recent Postgres startup latency (`start_postgres_ms`)
degradation, but it was only caught with SLO alerts. There was actually
an existing test for the same purpose -- `start_postgres_ms`, but it's
doing only two starts, so it's a bit noisy.

## Summary of changes

Add new compute startup latency test that does 100 iterations and
reports p50, p90 and p99 latencies.

Part of https://github.com/neondatabase/cloud/issues/24882
2025-03-19 16:11:33 +00:00
a-masterov
cf9d817a21 Add tests for some extensions currently not covered by the regression tests (#11191)
## Problem
Some extensions do not contain tests, which can be easily run on top of
docker-compose or staging.

## Summary of changes
Added the pg_regress based tests for `pg_tiktoken`, `pgx_ulid`, `pg_rag`
Now they will be run on top of docker-compose, but I intend to adopt
them to be run on top staging in the next PRs
2025-03-19 15:38:33 +00:00
John Spray
55cb07f680 pageserver: improve debuggability of timeline creation failures during chaos testing (#11300)
## Problem

We're seeing timeline creation failures that look suspiciously like some
race with the cleanup-deletion of initdb temporary directories. I
couldn't spot the bug, but we can make it a bit easier to debug.

Related: https://github.com/neondatabase/neon/issues/11296

## Summary of changes

- Avoid surfacing distracting ENOENT failure to delete as a log error --
this is fine, and can happen if timeline is cancelled while doing
initdb, or if initdb itself has an error where it doesn't write the dir
(this error is surfaced separately)
- Log after purging initdb temp directories
2025-03-19 13:38:03 +00:00
Christian Schwarz
0f20dae3c3 impr: merge pageserver_api::models::TenantConfig and pageserver::tenant::config::TenantConfOpt (#11298)
The only difference between
- `pageserver_api::models::TenantConfig` and
- `pageserver::tenant::config::TenantConfOpt`

at this point is that `TenantConfOpt` serializes with
`skip_serializing_if = Option::is_none`.
That is an efficiency improvement for all the places that currently
serde `models::TenantConfig` because new serializations will no longer
write `$fieldname: null` for each field that is `None` at runtime.

This should be particularly beneficial for Storcon, which stores
JSON-serialized `models::TenantConfig` in its DB.

# Behavior Changes


This PR changes the serialization behavior: we omit `None` fields
instead of serializing `$fieldname: null`).

So it's a data format change (see section on compatibility below).

And it changes API responses from Storcon and Pageserver.

## API Response Compatibility

Storcon returns the location description.
Afaik it is passed through into
- storcon_cli output
- storcon UI in console admin UI

These outputs will no longer contain `$fieldname: null` values,
which de-bloats the output (good).
But in storcon UI, it also serves as an editor "default", which
will be eliminated after a storcon with this PR is released.


## Data Format Compatibility


Backwards compat: new software reading old serialized data will
deserialize to the same runtime value because all the field types
are exactly the same and `skip_serializing_if` does not affect
deserialization.

Forward compat: old software reading data serialized by new software
will map absence fields in the serialized form to runtime value
`Option::None`. This is serde default behavior, see this playground
to convince yourself:

https://play.rust-lang.org/?version=stable&mode=debug&edition=2024&gist=f7f4e1a169959a3085b6158c022a05eb

The `serde(with="humantime_serde")` however behaves strangely:
if used on an `Option<Duration>`, it still requires the field to be
present,
unlike the serde default behavior shown in the previous paragraph.
The workaround is to set `serde(default)`.
Previously it was set on each individual field, but, we do have the
container attribute, so, set it there.
This requires deriving a `Default` impl, which, because all fields are
`Option`,
is non-magic.
See my notes here:
https://gist.github.com/problame/eddbc225a5d12617e9f2c6413e0cf799

# Future Work

We should have separate types (& crates) for
- runtime types configuration (e.g. PageServerConf::tenant_config,
AttachedLocationConf)
- `config-v1` file pageserver local disk file format
- `mgmt API`
- `pageserver.toml`

Right now they all use the same, which is convenient but makes it hard
to reason about compatibility breakage.

# Refs

- corresponding docs.neon.build PR
https://github.com/neondatabase/docs/pull/470
2025-03-19 12:47:17 +00:00
JC Grünhage
aedeb37220 fix(ci): put the BUILD_TAG of the upcoming release into RC PR artifacts (#11304)
## Problem
#11061 changed how artifacts for releases are built, by
reusing/retagging the artifacts from release PRs. This resulted in the
BUILD_TAG that's baked into the images to not be as expected.
Context: https://neondb.slack.com/archives/C08JBTT3R1Q/p1742333300129069

## Summary of changes
Set BUILD_TAG to the release tag of the upcoming release when running
inside release PRs.
2025-03-19 09:34:28 +00:00
Anastasia Lubennikova
6af974548e feat(compute_ctl): Add basic audit logging for computes. (#11170)
if `audit_log_level` is set to Log, 
preload pgaudit extension and log DDL with masked parameters into
standard postgresql log
2025-03-19 00:13:36 +00:00
Christian Schwarz
9fb77d6cdd buffered writer: add cancellation sensitivity (#11052)
In
-
https://github.com/neondatabase/neon/pull/10993#issuecomment-2690428336

I added infinite retries for buffered writer flush IOs, primarily to
gracefully handle ENOSPC but more generally so that the buffered writer
is not left in a state where reads from the surrounding InMemoryLayer
cause panics.

However, I didn't add cancellation sensitivity, which is concerning
because then there is no way to detach a timeline/tenant that is
encountering the write IO errors.
That’s a legitimate scenario in the case of some edge case bug. 
See the #10993 description for details.


This PR
- first makes flush loop infallible, enabled by infinite retries
- then adds sensitivity to `Timeline::cancel` to the flush loop, thereby
making it fallible in one specific way again
- finally fixes the InMemoryLayer/EphemeralFile/BufferedWriter
amalgamate to remain read-available after flush loop is cancelled.

The support for read-availability after cancellation is necessary so
that reads from the InMemoryLayer that are already queued up behind the
RwLock that wraps the BufferedWriter won't panic because of the
`mutable=None` that we leave behind in case the flush loop gets
cancelled.

# Alternatives

One might think that we can only ship the change for read-availability
if flush encounters an error, without the infinite retrying and/or
cancellation sensitivity complexity.

The problem with that is that read-availability sounds good but is
really quite useless, because we cannot ingest new WAL without a
writable InMemoryLayer. Thus, very soon after we transition to read-only
mode, reads from compute are going to wait anyway, but on `wait_lsn`
instead of the RwLock, because ingest isn't progressing.

Thus, having the infinite flush retries still makes more sense because
they're just "slowness" to the user, whereas wait_lsn is hard errors.
2025-03-18 18:48:43 +00:00
JC Grünhage
99639c26b4 fix(ci): update build-tools image references (#11293)
## Problem
https://github.com/neondatabase/neon/pull/11210 migrated pushing images
to ghcr. Unfortunately, it was incomplete in using images from ghcr,
which resulted in a few places referencing the ghcr build-tools image,
while trying to use docker hub credentials.

## Summary of changes
Use build-tools image from ghcr consistently.
2025-03-18 15:21:22 +00:00
Ivan Efremov
86fe26c676 fix(proxy): Fix testodrome HTTP header handling in proxy (#11292)
Relates to #22486
2025-03-18 15:14:08 +00:00
JC Grünhage
eb6efda98b impr(ci): move some kinds of tests to PR runs only (#11272)
## Problem
The pipelines after release merges are slower than they need to be at
the moment. This is because some kinds of tests/checks run on all kinds
of pipelines, even though they only matter in some of those.

## Summary of changes
Run `check-codestyle-{rust,python,jsonnet}`, `build-and-test-locally`
and `trigger-e2e-tests` only on regular PRs, not release PR or pushes to
main or release branches.
2025-03-18 13:49:34 +00:00
Conrad Ludgate
fd41ab9bb6 chore: remove x509-parser (#11247)
Both crates seem well maintained. x509-cert is part of the high quality
RustCrypto project that we already make heavy use of, and I think it
makes sense to reduce the dependencies where possible.
2025-03-18 13:05:08 +00:00
JC Grünhage
2dfff6a2a3 impr(ci): use ghcr.io as the default container registry (#11210)
## Problem
Docker Hub has new rate limits coming up, and to avoid problems coming
with those we're switching to GHCR.

## Summary of changes
- Push images to GHCR initially and distribute them from there
- Use images from GHCR in docker-compose
2025-03-18 11:30:49 +00:00
Arpad Müller
2cf6ae76fc storcon: move safekeeper related stuff out of service.rs (#11288)
There is no functional change here. We move safekeeper related code from
`service.rs` to `service/safekeeper_service.rs`, so that safekeeper
related stuff is contained in a single file. This also helps with
preventing `service.rs` from growing even further.

Part of #9011.
2025-03-18 09:00:53 +00:00
Dmitrii Kovalkov
57d51e949d tests: suppress excessive pageserver errors in test_timeline_ancestor_detach_errors (#11277)
## Problem

The test is flaky because of the same reasons as described in
https://github.com/neondatabase/neon/issues/11177.
The test has already suppressed these `WARN` and `ERROR` log messages,
but the regexp didn't match all possible errors.

## Summary of changes
- Change regexp to suppress all possible allowed error log messages.
2025-03-18 07:10:11 +00:00
Arpad Müller
0d3d639ef3 storcon: remove timeouts for safekeeper heartbeating (#11232)
PRs #10891 and #10902 have time-bounded the safekeeper heartbeating of
the storage controller. Those timeouts were not meant to be permanent,
but temporary until we figured out the reasons for the safekeeper
heartbeating causing problems.

Now they are better understood and resolved. A comment is
[here](https://github.com/neondatabase/cloud/issues/24396#issuecomment-2679342929),
but most importantly, we've had:

* #10954 to send heartbeats concurrently (before the issue was we sent
them sequentially, so the total time time was number of nodes times time
for timeout to be hit, now the total time is the maximum of all things
we are heartbeating)
* work to actually make heartbeats work and not error, i.e. JWT rollout
for storcon, not sending heartbeats to decomissioned safekeepers,
removal of decomissioned safekeepers from the databases

Part of https://github.com/neondatabase/cloud/issues/25473
2025-03-18 03:37:45 +00:00
Alex Chi Z.
05ca27c981 fix(pagectl/benches): scope context with debug tools (#11285)
## Problem


7c462b3417
requires all contexts have scopes. pagectl/benches don't have such
scopes.

close https://github.com/neondatabase/neon/issues/11280

## Summary of changes

Adding scopes for the tools.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-03-17 21:27:27 +00:00
Alex Chi Z.
bb64beffbb fix(pageserver): log compaction errors with timeline ids (#11231)
## Problem

Makes it easier to debug.

## Summary of changes

Log compaction errors with timeline ids.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-03-17 19:42:02 +00:00
Konstantin Knizhnik
24f41bee5c Update LFC in case of unlogged build (#11262)
## Problem

Unlogged build is used for GIST/SPGIST/GIN/HNSW indexes.
In this mode we first change relation class to `RELPERSISTENCE_UNLOGGED`
and save them on local disk.
But we do not save unlogged relations in LFC.
It may cause fetching incorrect value from LFC if relfilenode is reused.

## Summary of changes

Save modified pages in LFC on second stage of unlogged build (when
modified pages are walloged).
There is no need to save pages in LFC at first phase because the will be
in any case overwritten with assigned LSN at second phase.

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2025-03-17 19:06:42 +00:00
Suhas Thalanki
a05c99f487 fix: removed anon pg extension (#10936)
## Problem

Removing the `anon` v1 extension in postgres as described in
https://github.com/neondatabase/cloud/issues/22663. This extension is
not built for postgres v17 and is out of date when compared to the
upstream variant which is v2 (we have v1.4).

## Summary of changes

Removed the `anon` v1 extension from being built or preloaded

Related to https://github.com/neondatabase/cloud/issues/22663
2025-03-17 18:23:32 +00:00
JC Grünhage
486ffeef6d fix(ci): don't have neon-test-extensions release tag push depend on compute-node-image build (#11281)
## Problem
Failures like
https://github.com/neondatabase/neon/actions/runs/13901493608/job/38896940612?pr=11272
are caused by the dependency on `compute-node-image`, which was wrong on
release jobs anyway.

## Summary of changes
Remove dependency on `compute-node-image` from the job
`add-release-tag-to-neon-test-extension-image`.
2025-03-17 16:31:49 +00:00
Arpad Müller
56149a046a Add test_explicit_timeline_creation_storcon and make it work (#11261)
Adds a basic test that makes the storcon issue explicit creation of a
timeline on safeekepers (main storcon PR in #11058). It was adapted from
`test_explicit_timeline_creation` from #11002.

Also, do a bunch of fixes needed to get the test work (the API
definitions weren't correct), and log more stuff when we can't create a
new timeline due to no safekeepers being active.

Part of #9011

---------

Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
2025-03-17 16:28:21 +00:00
Roman Zaynetdinov
db30e1669c Add /configure_telemetry API endpoint (#11117)
Work on https://github.com/neondatabase/cloud/issues/23721 and
https://github.com/neondatabase/cloud/issues/23714

Depends on https://github.com/neondatabase/neon/pull/11111

- Add `/configure_telemetry` API endpoint
- Support second rsyslog configuration for Postgres logs export
- Enable logs export when compute feature is enabled and configure
Postgres to send logs to syslog

I have used `/configure_telemetry` name because in the future I see it
also being used for configuring a `pg_tracing` extension to export
traces. Let me know if you'd rather have these APIs separate. In this
case we can rename it to `/configure_rsyslog`.
2025-03-17 13:53:23 +00:00
JC Grünhage
fdf04d4d81 fix(ci): use correct branch ref for checking whether this is a release merge queue (#11270)
## Problem

https://github.com/neondatabase/neon/actions/runs/13894288475/job/38871819190
shows the "Add fast-fordward label to PR to trigger fast-forward merge"
job being skipped. This is due to not using the right variable for
checking which branch the merge queue is merging into.

## Summary of changes
Use the `branch` output of the `meta` task for checking the target
branch of a merge group.
2025-03-17 09:26:45 +00:00
Alexander Bayandin
136cae76c2 fix(ci): correct regex to detect release-compute RC PRs (#11269)
## Problem
The regex in `_meta.yml` workflow doesn't detect RC PRs for compute
releases:
https://neondb.slack.com/archives/C059ZC138NR/p1742164884669389

## Summary of changes
- Fix regex

---------

Co-authored-by: Peter Bendel <peterbendel@neon.tech>
2025-03-17 07:25:12 +00:00
Konstantin Knizhnik
15e63afe7d Support DEBUG_COMPARE_LOCAL mode for unloggedindex build (#11257)
## Problem

In unlogged index build (used fir GIST/SPGIST/GIN indexes) files is
created on disk and then removed at the end.
It contradicts to the logic of DEBUG_COMPARE_LOCAL mode.

## Summary of changes

Do not create and unlink files in unlogged build in DEBUG_COMPARE_LOCAL
mode.

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2025-03-17 06:07:24 +00:00
Alexey Kondratov
966abd3bd6 fix(compute_ctl): Dollar escaping helper fixes (#11263)
## Problem

In the previous PR #11045, one edge-case wasn't covered, when an ident
contains only one `$`, we were picking `$$` as a 'wrapper'. Yet, when
this `$` is at the beginning or at the end of the ident, then we end up
with `$$$` in a row which breaks the escaping.

## Summary of changes

Start from `x` tag instead of a blank string.

Slack:
https://neondb.slack.com/archives/C08HV951W2W/p1742076675079769?thread_ts=1742004205.461159&cid=C08HV951W2W
2025-03-16 18:39:54 +00:00
Alexey Kondratov
8566cad23b chore(docs): Refresh RFC guide to suggest using YYYY-MM-DD prefix (#11252)
## Problem

Serial/numeric IDs lead to collisions, which is not critical but looks
awkward.
Previous discussion:
https://neondb.slack.com/archives/C033A2WE6BZ/p1741891345869979

## Summary of changes

Suggest using the `YYYY-MM-DD` prefix, which i) has less chance of
collision; ii) provides out-of-the-box lexicographic sorting; iii) even
if it collides, it's not a big deal -- just two RFCs have been started
on the same day.

---------

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2025-03-16 17:17:58 +00:00
Peter Bendel
228bb75354 Extend large tenant OLTP workload ... (#11166)
... to better match the workload characteristics of real Neon customers

## Problem

We analyzed workloads of large Neon users and want to extend the oltp
workload to include characteristics seen in those workloads.

## Summary of changes

- for re-use branch delete inserted rows from last run
- adjust expected run-time (time-outs) in GitHub workflow
- add queries that exposes the prefetch getpages path
- add I/U/D transactions for another table (so far the workload was
insert/append-only)
- add an explicit vacuum analyze step and measure its time
- add reindex concurrently step and measure its time (and take care that
this step succeeds even if prior reindex runs have failed or were
canceled)
- create a second connection string for the pooled connection that
removes the `-pooler` suffix from the hostname because we want to run
long-running statements (database maintenance) and bypass the pooler
which doesn't support unlimited statement timeout

## Test run


https://github.com/neondatabase/neon/actions/runs/13851772887/job/38760172415
2025-03-16 14:04:48 +00:00
Cihan Demirci
a5b00b87ba CI(pre-merge-checks): use step-security/changed-files (#11265)
Use Step Security maintained version of `tj-actions/changed-files`.

https://www.stepsecurity.io/blog/harden-runner-detection-tj-actions-changed-files-action-is-compromised#use-the-stepsecurity-maintained-changed-files-action
2025-03-16 13:53:27 +00:00
John Spray
a674ed8caf storcon: safety check when completing shard split (#11256)
## Problem

There is a rare race between controller graceful deployment and shard
splitting where we may incorrectly both abort _and_ complete the split
(on different pods), and thereby leave no shards at all in the database.

Related: #11254

## Summary of changes

- In complete_shard_split, refuse to delete anything if child shards are
not found
2025-03-14 20:08:24 +00:00
Erik Grinaker
53d50c7ea5 pageserver: deflake compaction tests (#11246)
These need to set `NoYield`, otherwise they may be preempted by pending
L0 compaction.
2025-03-14 17:45:18 +00:00
Dmitrii Kovalkov
3168bd0e3a tests: suppress "Cancelled request finished with an error" in test_timeline_archive (#11241)
## Problem

Previous PR https://github.com/neondatabase/neon/pull/11190 didn't
suppress `Cancelled request finished with an error` messages, which are
also expected, so the test
https://github.com/neondatabase/neon/issues/11177 is still flaky.

## Summary of changes
- Suppress `Cancelled request finished with an error` in
`test_timeline_archive`
2025-03-14 17:42:09 +00:00
Alexander Bayandin
4a97cd0b7e test_runner: fix tests with jsonnet for Python 3.13 (#11240)
## Problem
Python's `jsonnet` 0.20.0 doesn't support Python 3.13, so we have a
couple of tests xfailed because of that.

## Summary of changes
- Bump `jsonnet` to `0.21.0rc2` which supports Python 3.13
- Unxfail `test_sql_exporter_metrics_e2e` and
`test_sql_exporter_metrics_smoke` on Python 3.13
2025-03-14 17:02:55 +00:00
Anastasia Lubennikova
b7c6738524 feat(compute_ctl): add pgaudt log gc to compute_ctl (#11169)
- add pgaudt_gc thread to compute_ctl
to cleanup old pgaudit logs if they exist.
pgaudit can rotate files, but it doesn't delete the old files
  
- Add AUDIT_LOG_DIR_SIZE metric to compute_ctl
to track the size of the audit log directory in bytes.

- Fix permissions for rsyslog state files directory
2025-03-14 14:08:16 +00:00
Conrad Ludgate
7fe5a689b4 feat(proxy): export ingress metrics (#11244)
## Problem

We exposed the direction tag in #10925 but didn't actually include the
ingress tag in the export to allow for an adaption period.

## Summary of changes

We now export the ingress direction
2025-03-14 13:54:57 +00:00
Dmitrii Kovalkov
b0922967e0 Bump humantime version and remove advisories.ignore (#11242)
## Problem

- Closes:
https://github.com/neondatabase/neon/issues/11179#issuecomment-2724222041

## Summary of changes
- Bump humantime version to `2.2`
- Remove `RUSTSEC-2025-0014` from `advisories.ignore`
2025-03-14 11:51:11 +00:00
Dmitrii Kovalkov
f68be2b5e2 safekeeper: https for management API (#11171)
## Problem

Storage controller uses unencrypted HTTP requests for safekeeper
management API.

- Closes: https://github.com/neondatabase/cloud/issues/24836

## Summary of changes

- Replace `hyper0::server::Server` with `http_utils::server::Server` in
safekeeper.
- Add HTTPS handler for safekeeper management API.
2025-03-14 11:41:22 +00:00
Christian Schwarz
04370b48b3 fix(storcon): optimization validation makes decisions based on wrong SecondaryProgress (#11229)
# Refs

- fixes https://github.com/neondatabase/neon/issues/11228

# Problem High-Level

When storcon validates whether a `ScheduleOptimizationAction` should be
applied, it retrieves the `tenant_secondary_status` to determine whether
a secondary is ready for the optimization.

When collecting results, it associates secondary statuses with the wrong
optimization actions in the batch of optimizations that we're
validating.

The result is that we make the decision for shard/location X based on
the SecondaryStatus of a random secondary location Y in the current
batch of optimizations.

A possible symptom is an early cutover, as seen in this engineering
investigation here:
- https://github.com/neondatabase/cloud/issues/25734

# Problem Code-Level

This code here in `optimize_all_validate`


97e2e27f68/storage_controller/src/service.rs (L7012-L7029)

zips the `want_secondary_status` with the Vec returned from
`tenant_for_shards_api` .

However, the Vec returned from `want_secondary_status` is not ordered
(it uses FuturesUnordered internally).

# Solution

Sort the Vec in input order before returning it.

`optimize_all_validate` was the only caller affected by this problem

While at it, also future-proof similar-looking function
`tenant_for_shards`.
None of its callers care about the order, but this type of function
signature is easy to use incorrectly.

# Future Work

Avoid the additional iteration, map, and allocation.

Change API to leverage AsyncFn (async closure).
And/or invert `tenant_for_shards_api` into a Future ext trait / iterator
adaptor thing.
2025-03-14 11:21:16 +00:00
Arpad Müller
5359cf717c storcon: add API definitions for exclude_timeline and term_bump (#11197)
Adds API definitions for the safekeeper API endpoints `exclude_timeline`
and `term_bump`. Also does a bugfix to return the correct type from
`delete_timeline`.

Part of #8614
2025-03-14 00:00:37 +00:00
Erik Grinaker
d6d78a050f pageserver: disable l0_flush_wait_upload by default (#11215)
## Problem

This is already disabled in production, as it is replaced by L0 flush
delays. It will be removed in a later PR, once the config option is no
longer specified in production.

## Summary of changes

Disable `l0_flush_wait_upload` by default.
2025-03-13 21:08:28 +00:00
Erik Grinaker
4ff000c042 pageserver: deflake test_metadata_image_creation (#11230)
## Problem

`test_metadata_image_creation ` became flaky with #11212, since image
compaction may yield to L0 compaction.

## Summary of changes

Set `NoYield` when compacting in tenant tests.
2025-03-13 20:46:21 +00:00
Conrad Ludgate
9a3020d2ce chore(proxy): pre-initialise metricvecs (#11226)
## Problem

We noticed that error metrics didn't show for some services with light
load. This is not great and can cause problems for dashboards/alerts

## Summary of changes

Pre-initialise some metricvecs.
2025-03-13 20:23:53 +00:00
Alex Chi Z.
23b713900e feat(storcon): passthrough ancestor detach behavior (#11199)
## Problem

https://github.com/neondatabase/neon/issues/10310
https://github.com/neondatabase/neon/pull/11158

## Summary of changes

We need to passthrough the new detach behavior through the storcon API.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2025-03-13 20:21:23 +00:00