Commit Graph

6250 Commits

Author SHA1 Message Date
Conrad Ludgate
77e431d80a abstract out listen-accept loops 2024-10-07 16:52:37 +01:00
Conrad Ludgate
784571eac7 remove redundant generic 2024-10-07 16:52:37 +01:00
Conrad Ludgate
ffd0d875cf minor changes 2024-10-07 16:52:37 +01:00
Conrad Ludgate
dbf34a17ce rename to serverless backend 2024-10-07 16:52:37 +01:00
Conrad Ludgate
4001e24745 proxy: continue streamlining auth::Backend 2024-10-07 16:52:37 +01:00
Conrad Ludgate
d9a59c5a3f deduplicate some mode 2024-10-07 16:52:37 +01:00
Conrad Ludgate
c7afbe55c9 remove some duplication 2024-10-07 16:52:37 +01:00
Conrad Ludgate
d0930c9d1d remove console-redirect from mega-backend object 2024-10-07 16:52:37 +01:00
Conrad Ludgate
addfff61b5 create console_redirect_proxy solo-path 2024-10-07 16:52:37 +01:00
Conrad Ludgate
cb28721eee make well-defined console request backend 2024-10-07 16:52:37 +01:00
Conrad Ludgate
07076e88a9 shrink some uses of config 2024-10-07 16:52:37 +01:00
Conrad Ludgate
2feba8a3da remove auth backend from proxy config 2024-10-07 16:52:37 +01:00
Heikki Linnakangas
323bd018cd Make sure BufferTag padding bytes are cleared in hash keys (#9292)
The prefetch-queue hash table uses a BufferTag struct as the hash key,
and it's hashed using hash_bytes(). It's important that all the padding
bytes in the key are cleared, because hash_bytes() will include them.

I was getting compiler warnings like this on v14 and v15, when compiling
with -Warray-bounds:

    In function ‘prfh_lookup_hash_internal’,
inlined from ‘prfh_lookup’ at
pg_install/v14/include/postgresql/server/lib/simplehash.h:821:9,
inlined from ‘neon_read_at_lsnv’ at pgxn/neon/pagestore_smgr.c:2789:11,
inlined from ‘neon_read_at_lsn’ at pgxn/neon/pagestore_smgr.c:2904:2:
pg_install/v14/include/postgresql/server/storage/relfilenode.h:90:43:
warning: array subscript ‘PrefetchRequest[0]’ is partly outside array
bounds of ‘BufferTag[1]’ {aka ‘struct buftag[1]’} [-Warray-bounds]
       89 |         ((node1).relNode == (node2).relNode && \
          |         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       90 |          (node1).dbNode == (node2).dbNode && \
          |          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~
       91 |          (node1).spcNode == (node2).spcNode)
          |          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
pg_install/v14/include/postgresql/server/storage/buf_internals.h:116:9:
note: in expansion of macro ‘RelFileNodeEquals’
      116 |         RelFileNodeEquals((a).rnode, (b).rnode) && \
          |         ^~~~~~~~~~~~~~~~~
pgxn/neon/neon_pgversioncompat.h:25:31: note: in expansion of macro
‘BUFFERTAGS_EQUAL’
       25 | #define BufferTagsEqual(a, b) BUFFERTAGS_EQUAL(*(a), *(b))
          |                               ^~~~~~~~~~~~~~~~
pgxn/neon/pagestore_smgr.c:220:34: note: in expansion of macro
‘BufferTagsEqual’
220 | #define SH_EQUAL(tb, a, b) (BufferTagsEqual(&(a)->buftag,
&(b)->buftag))
          |                                  ^~~~~~~~~~~~~~~
pg_install/v14/include/postgresql/server/lib/simplehash.h:280:77: note:
in expansion of macro ‘SH_EQUAL’
280 | #define SH_COMPARE_KEYS(tb, ahash, akey, b) (ahash ==
SH_GET_HASH(tb, b) && SH_EQUAL(tb, b->SH_KEY, akey))
| ^~~~~~~~
pg_install/v14/include/postgresql/server/lib/simplehash.h:799:21: note:
in expansion of macro ‘SH_COMPARE_KEYS’
      799 |                 if (SH_COMPARE_KEYS(tb, hash, key, entry))
          |                     ^~~~~~~~~~~~~~~
    pgxn/neon/pagestore_smgr.c: In function ‘neon_read_at_lsn’:
    pgxn/neon/pagestore_smgr.c:2742:25: note: object ‘buftag’ of size 20
     2742 |         BufferTag       buftag = {0};
          |                         ^~~~~~

This commit silences those warnings, although it's not clear to me why
the compiler complained like that in the first place. I found the issue
with padding bytes while looking into those warnings, but that was
coincidental, I don't think the padding bytes explain the warnings as
such.

In v16, the BUFFERTAGS_EQUAL macro was replaced with a static inline
function, and that also silences the compiler warning. Not clear to me
why.
2024-10-07 18:04:04 +03:00
Folke Behrens
ad267d849f proxy: Move module base files into module directory (#9297) 2024-10-07 16:25:34 +02:00
Conrad Ludgate
8cd7b5bf54 proxy: rename console -> control_plane, rename web -> console_redirect (#9266)
rename console -> control_plane
rename web -> console_redirect

I think these names are a little more representative.
2024-10-07 14:09:54 +01:00
Konstantin Knizhnik
47c3c9a413 Fix update of statistic for LFC/prefetch (#9272)
## Problem

See #9199

## Summary of changes

Fix update of hits/misses for LFC and prefetch introduced in
78938d1b59

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-10-07 12:21:16 +03:00
Arseny Sher
eae4470bb6 safekeeper: remove local WAL files ignoring peer_horizon_lsn. (#8900)
If peer safekeeper needs garbage collected segment it will be fetched
now from s3 using on-demand WAL download. Reduces danger of running out of disk space when safekeeper fails.
2024-10-04 19:07:39 +03:00
Ivan Efremov
2d248aea6f proxy: exclude triple logging of connect compute errors (#9277)
Fixes (#9020)
 - Use the compute::COULD_NOT_CONNECT for connection error message;
 - Eliminate logging for one connection attempt;
 - Typo fix.
2024-10-04 18:21:39 +03:00
Conrad Ludgate
6c05f89f7d proxy: add local-proxy to compute image (#8823)
1. Adds local-proxy to compute image and vm spec
2. Updates local-proxy config processing, writing PID to a file eagerly
3. Updates compute-ctl to understand local proxy compute spec and to
send SIGHUP to local-proxy over that pid.

closes https://github.com/neondatabase/cloud/issues/16867
2024-10-04 14:52:01 +00:00
Arseny Sher
db53f98725 neon walsender_hooks: take basebackup LSN directly. (#9263)
NeonWALReader needs to know LSN before which WAL is not available
locally, that is, basebackup LSN. Previously it was taken from
WalpropShmemState, but that's racy, as walproposer sets its there only
after successfull election. Get it directly with GetRedoStartLsn.

Should fix flakiness of
test_ondemand_wal_download_in_replication_slot_funcs etc.

ref #9201
2024-10-04 14:56:15 +01:00
Erik Grinaker
04a6222418 remote_storage: add head_object integration test (#9274) 2024-10-04 12:40:41 +01:00
Vlad Lazar
dcf7af5a16 storcon: do timeline creation on all attached location (#9237)
## Problem

Creation of a timelines during a reconciliation can lead to
unavailability if the user attempts to
start a compute before the storage controller has notified cplane of the
cut-over.

## Summary of changes

Create timelines on all currently attached locations. For the latest
location, we still look
at the database (this is a previously). With this change we also look
into the observed state
to find *other* attached locations.

Related https://github.com/neondatabase/neon/issues/9144
2024-10-04 11:56:43 +01:00
Erik Grinaker
37158d0424 pageserver: use conditional GET for secondary tenant heatmaps (#9236)
## Problem

Secondary tenant heatmaps were always downloaded, even when they hadn't
changed. This can be avoided by using a conditional GET request passing
the `ETag` of the previous heatmap.

## Summary of changes

The `ETag` was already plumbed down into the heatmap downloader, and
just needed further plumbing into the remote storage backends.

* Add a `DownloadOpts` struct and pass it to
`RemoteStorage::download()`.
* Add an optional `DownloadOpts::etag` field, which uses a conditional
GET and returns `DownloadError::Unmodified` on match.
2024-10-04 12:29:48 +02:00
Erik Grinaker
60fb840e1f Cargo.toml: enable sso for aws-config (#9261)
## Problem

The S3 tests couldn't use SSO authentication for local tests against S3.

## Summary of changes

Enable the `sso` feature of `aws-config`. Also run `cargo hakari
generate` which made some updates to `workspace_hack`.
2024-10-04 11:27:06 +01:00
Heikki Linnakangas
52232dd85c tests: Add a comment explaining the rules of NeonLocalCli wrappers (#9195) 2024-10-03 22:03:29 +03:00
Heikki Linnakangas
8ef0c38b23 tests: Rename NeonLocalCli functions to match the 'neon_local' commands (#9195)
This makes it more clear that the functions in NeonLocalCli are just
typed wrappers around the corresponding 'neon_local' commands.
2024-10-03 22:03:27 +03:00
Heikki Linnakangas
56bb1ac458 tests: Move NeonCli and friends to separate file (#9195)
In the passing, rename it to NeonLocalCli, to reflect that the binary
is called 'neon_local'.

Add wrapper for the 'timeline_import' command, eliminating the last
raw call to the raw_cli() function from tests, except for a few in
test_neon_cli.py which are about testing the 'neon_local' iteself. All
the other calls are now made through the strongly-typed wrapper
functions
2024-10-03 22:03:25 +03:00
Heikki Linnakangas
19db9e9aad tests: Replace direct calls to neon_cli with wrappers in NeonEnv (#9195)
Add wrappers for a few commands that didn't have them before. Move the
logic to generate tenant and timeline IDs from NeonCli to the callers,
so that NeonCli is more purely just a type-safe wrapper around
'neon_local'.
2024-10-03 22:03:22 +03:00
David Gomes
4e9b32c442 chore: makes some onboarding document improvements (#9216)
* I had to install `m4` in order to be able to run locally
* The docs/docker.md was missing a pointer to where the compute node
code is

(Was originally on #8888 but I am pulling this out)
2024-10-03 20:58:30 +02:00
David Gomes
2fac0b7fac chore: remove unnecessary comments in compute/Dockerfile.compute-node (#9253)
See [this
comment](https://github.com/neondatabase/neon/pull/8888#discussion_r1783130082).
2024-10-03 18:26:41 +00:00
Arpad Müller
e3d6ecaeee Revert hyper and tonic updates (#9268) 2024-10-03 19:21:22 +01:00
Arseny Sher
d785fcb5ff safekeeper: fix panic in debug_dump. (#9097)
Panic was triggered only when dump selected no timelines.

sentry report:
https://neondatabase.sentry.io/issues/5832368589/
2024-10-03 19:22:22 +03:00
Vlad Lazar
552fa2b972 pageserver: tweak oversized key read path warning (#9221)
## Problem

`Oversized vectored read [...]` logs are spewing in prod because we have
a few keys that
are unexpectedly large:
* reldir/relblock - these are unbounded, so it's known technical debt
* slru block - they can be a bit bigger than 128KiB due to storage
format overhead

## Summary of changes

* Bump threshold to 130KiB
* Don't warn on oversized reldir and dbdir keys 

Closes https://github.com/neondatabase/neon/issues/8967
2024-10-03 16:40:35 +01:00
Arpad Müller
9d93dd4807 Rename hyper 1.0 to hyper and hyper 0.14 to hyper0 (#9254)
Follow-up of #9234 to give hyper 1.0 the version-free name, and the
legacy version of hyper the one with the version number inside. As we
move away from hyper 0.14, we can remove the `hyper0` name piece by
piece.

Part of #9255
2024-10-03 16:33:43 +02:00
Heikki Linnakangas
53b6e1a01c vm-monitor: Upgrade axum from 0.6 to 0.7 (#9257)
Because:
- it's nice to be up-to-date,
- we already had axum 0.7 in our dependency tree, so this avoids having
to compile two versions, and
- removes one of the remaining dpendencies to hyper version 0

Also bumps the 'tokio-tungstenite' dependency, to avoid having two
versions in the dependency tree.
2024-10-03 16:49:39 +03:00
Joonas Koivunen
dbef1b064c chore: smaller layer changes (#9247)
Address minor technical debt in Layer inspired by #9224:

- layer usage as arg same as in spans
- avoid one Weak::upgrade
2024-10-03 09:38:45 +01:00
Heikki Linnakangas
6a9e2d657c Remove unnecessary dependencies from postgis-build image (#9211)
The apt install stage before this commit:

    0 upgraded, 391 newly installed, 0 to remove and 9 not upgraded.
    Need to get 261 MB of archives.

after:

    0 upgraded, 367 newly installed, 0 to remove and 9 not upgraded.
    Need to get 220 MB of archives.
2024-10-03 10:05:23 +03:00
Arpad Müller
2d8f6d7906 Suppress wal lag timeout warnings right after tenant attachment (#9232)
As seen in https://github.com/neondatabase/cloud/issues/17335, during
releases we can have ingest lags that are above the limits for warnings.
However, such lags are part of normal pageserver startup.

Therefore, calculate a certain cooldown timestamp until which we accept
lags up to a certain size. The heuristic is chosen to grow the later we get
to fully load the tenant, and we also add 60 seconds as a grace period
after that term.
2024-10-03 02:33:09 +01:00
Arpad Müller
1b176fe74a Use hyper 1.0 and tonic 0.12 in storage broker (#9234)
Fixes #9231 .

Upgrade hyper to 1.4.0 and use hyper 1.4 instead of 0.14 in the storage
broker, together with tonic 0.12. The two upgrades go hand in hand.

Thanks to the broker being independent from other components, we can
upgrade its hyper version without touching the other components, which
makes things easier.
2024-10-03 00:48:12 +02:00
Heikki Linnakangas
1dec93f129 Add compute_tools/ to the list of paths that trigger an E2E run on a PR (#9251)
compute_ctl is an important part of the interfaces between the control
plane and the compute, so it seems important to E2E test any changes
there.
2024-10-03 00:31:19 +03:00
Alexander Bayandin
16002f5e45 test_runner: bump requests and psycopg2-binary (#9248)
## Problem

```
Warning: The file chosen for install of requests 2.32.0 (requests-2.32.0-py3-none-any.whl) is yanked. Reason for being yanked: Yanked due to conflicts with CVE-2024-35195 mitigation
```

## Summary of changes
- Update `requests` to fix the warning
- Update `psycopg2-binary`
2024-10-02 21:26:45 +01:00
dotdister
09d4bad1be Change parentheses to clarify conditions in walproposer (#9180)
Some parentheses in conditional expressions are redundant or necessary
for clarity conditional expressions in walproposer.

## Summary of changes

Change some parentheses to clarify conditions in walproposer.

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2024-10-02 14:49:52 -04:00
Heikki Linnakangas
d20448986c Fix metric name of the 'getpage_wait_seconds_bucket' metric (#9242)
Per convention, histogram buckets have the '_bucket' suffix. I got that
wrong in commit 0d500bbd5b.

Fixes https://github.com/neondatabase/neon/issues/9241
2024-10-02 20:05:14 +03:00
John Spray
d54624153d tests: sync_after_each_test -> sync_between_tests (#9239)
## Problem

We are seeing frequent pageserver startup timelines while it calls
syncfs(). There is an existing fixture that syncs _after_ tests, but not
before the first one. We hypothesize that some failures are happening on
the first test in a job.

## Summary of changes

- extend the existing sync_after_each_test to be a sync between all
tests, including sync'ing before running the first test. That should
remove any ambiguity about whether the sync is happening on the correct
node.

This is an alternative to https://github.com/neondatabase/neon/pull/8957
-- I didn't realize until I saw Alexander's comment on that PR that we
have an existing hook that syncs filesystems and can be extended.
2024-10-02 17:44:25 +01:00
Alex Chi Z.
700885471f fix(test): only test num of L1 layers in compaction smoke test (#9186)
close https://github.com/neondatabase/neon/issues/9160

For whatever reason, pg17's WAL pattern seems different from others,
which triggers some flaky behavior within the compaction smoke test.

## Summary of changes

* Run L0 compaction before proceeding with the read benchmark.
* So that we can ensure the num of L0 layers is 0 and test the
compaction behavior only with L1 layers.

We have a threshold for triggering L0 compaction. In some cases, the
test case did not produce enough L0 layers to do a L0 compaction,
therefore leaving the layer map with 3+ L0 layers above the L1 layers.
This increases the average read depth for the timeline.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-10-02 17:42:35 +01:00
Vlad Lazar
38a8dcab9f storcon: add metric for long running reconciles (#9207)
## Problem

We don't have an alert for long running reconciles. Stuck reconciles are
problematic
as we've seen in a recent incident.

## Summary of changes

Add a new metric `storage_controller_reconcile_long_running_total` with
labels: `{tenant_id, shard_number, seq}`.
The metric is removed after the long running reconcile finishes. These
events should be rare, so we won't break
the bank on cardinality.

Related https://github.com/neondatabase/neon/issues/9150
2024-10-02 17:25:11 +01:00
Vlad Lazar
8dbfda98d4 storcon: ignore deleted timelines on new location catch-up (#9244)
## Problem

If a timeline was deleted right before waiting for LSNs to catch up
before the cut-over,
then we would wait forever. 

## Summary of changes

Fix the issue and add a test for timeline deletions mid migration. 

Related https://github.com/neondatabase/neon/issues/9144
2024-10-02 17:23:26 +01:00
John Spray
f875e107aa pageserver: tweak logging of "became visible" for layers (#9224)
## Problem

Recent change to avoid the "became visible" log messages from certain
tasks missed a task: the logical size calculation that happens as a
child of synthetic size calculation.

Related: https://github.com/neondatabase/neon/issues/9058

## Summary of changes

- Add OnDemandLogicalSize to the list of permitted tasks for reads
making a covered layer visible
- Tweak the log message to use layer name instead of key: this is more
terse, and easier to use when debugging, as one can search for it
elsewhere to see when the layer was written/downloaded etc.
2024-10-02 13:21:04 +01:00
Folke Behrens
1e90e792d6 proxy: Add timeout to webauth confirmation wait (#9227)
```shell
$ cargo run -p proxy --bin proxy -- --auth-backend=web --webauth-confirmation-timeout=5s
```

```
$ psql -h localhost -p 4432
NOTICE:  Welcome to Neon!
Authenticate by visiting within 5s:
    http://localhost:3000/psql_session/e946900c8a9bc6e9


psql: error: connection to server at "localhost" (::1), port 4432 failed: Connection refused
	Is the server running on that host and accepting TCP/IP connections?
connection to server at "localhost" (127.0.0.1), port 4432 failed: ERROR:  Disconnected due to inactivity after 5s.
```
2024-10-02 12:10:56 +02:00
Matthias van de Meent
ea32f1d0a3 Expose more granular wait event data to the user (#9163)
In PG17, there is this newfangled custom wait events system. This commit
adds that feature to Neon, so that users can see what their backends may
be waiting for when a PostgreSQL backend is playing the waiting game in
Neon code.
2024-10-02 11:12:50 +02:00