Commit Graph

6421 Commits

Author SHA1 Message Date
Anastasia Lubennikova
793ad50b7d fix allow_unstable_extensions GUC - make it USERSET (#9563)
fix message wording
2024-10-29 14:25:23 +00:00
John Spray
7a1331eee5 pageserver: make concurrent offloaded timeline operations safe wrt manifest uploads (#9557)
## Problem

Uploads of the tenant manifest could race between different tasks,
resulting in unexpected results in remote storage.

Closes: https://github.com/neondatabase/neon/issues/9556

## Summary of changes

- Create a central function for uploads that takes a tokio::sync::Mutex
- Store the latest upload in that Mutex, so that when there is lots of
concurrency (e.g. archive 20 timelines at once) we can coalesce their
manifest writes somewhat.
2024-10-29 13:54:48 +00:00
John Spray
4ef74215e1 pageserver: refactor generation-aware loading code into generic (#9545)
## Problem

Indices used to be the only kind of object where we had to search across
generations to find the most recent one. As of
https://github.com/neondatabase/neon/issues/9543, manifests will need
the same treatment.

## Summary of changes

- Refactor download_index_part to a generic download_generation_object
function, which will be usable for downloading manifest objects as well.
2024-10-29 13:00:03 +00:00
Conrad Ludgate
d4cbc8cfeb [auth_broker]: regress test (#9541)
python based regression test setup for auth_broker. This uses a http
mock for cplane as well as the JWKs url.

complications:
1. We cannot just use local_proxy binary, as that requires the
pg_session_jwt extension which we don't have available in the current
test suite
2. We cannot use just any old http mock for local_proxy, as auth_broker
requires http2 to local_proxy

as such, I used the h2 library to implement an echo server - copied from
the examples in the h2 docs.
2024-10-29 11:39:09 +00:00
Conrad Ludgate
47c35f67c3 [proxy]: fix JWT handling for AWS cognito. (#9536)
In the base64 payload of an aws cognito jwt, I saw the following:

```
"iss":"https:\/\/cognito-idp.us-west-2.amazonaws.com\/us-west-2_redacted"
```

issuers are supposed to be URLs, and URLs are always valid un-escaped
JSON. However, `\/` is a valid escape character so what AWS is doing is
technically correct... sigh...

This PR refactors the test suite and adds a new regression test for
cognito.
2024-10-29 11:01:09 +00:00
Peter Bendel
45b558f480 temporarily increase timeout for clickbench benchmark until regression is resolved (#9554)
## Problem

click bench job in benchmarking workflow has a performance regression
causing it to run in timeout of max job run.

Suspected root cause:
Project has been migrated from single pageserver to storage controller
managed project on Oct 14th.
Since then the regression shows.

## Summary of changes

Increase timeout of pytest to 12 hours.
Increase job timeout to 12 hours
2024-10-29 10:53:28 +00:00
Arpad Müller
a73402e646 Offloaded timeline deletion (#9519)
As pointed out in
https://github.com/neondatabase/neon/pull/9489#discussion_r1814699683 ,
we currently didn't support deletion for offloaded timelines after the
timeline has been loaded from the manifest instead of having been
offloaded.

This was because the upload queue hasn't been initialized yet. This PR
thus initializes the timeline and shuts it down immediately.

Part of #8088
2024-10-29 10:41:53 +00:00
Vlad Lazar
07b974480c pageserver: move things around to prepare for decoding logic (#9504)
## Problem

We wish to have high level WAL decoding logic in `wal_decoder::decoder`
module.

## Summary of Changes

For this we need the `Value` and `NeonWalRecord` types accessible there, so:
1. Move `Value` and `NeonWalRecord` to `pageserver::value` and
`pageserver::record` respectively.
2. Get rid of `pageserver::repository` (follow up from (1))
3. Move PG specific WAL record types to `postgres_ffi::walrecord`. In
theory they could live in `wal_decoder`, but it would create a circular
dependency between `wal_decoder` and `postgres_ffi`. Long term it makes
sense for those types to be PG version specific, so that will work out nicely.
4. Move higher level WAL record types (to be ingested by pageserver)
into `wal_decoder::models`

Related: https://github.com/neondatabase/neon/issues/9335
Epic: https://github.com/neondatabase/neon/issues/9329
2024-10-29 10:00:34 +00:00
Arpad Müller
62f5d484d9 Assert the tenant to be active in unoffload_timeline (#9539)
Currently, all callers of `unoffload_timeline` ensure that the tenant
the unoffload operation is called on is active. We rely on it being
active as we activate the timeline below and don't want to race with the
activation code of the tenant (in the worst case, activating a timeline
twice).

Therefore, add this assertion.

Part of #8088
2024-10-29 00:36:05 +00:00
Tristan Partin
4df3987054 Get role name when not a C string
We will only have a C string if the specified role is a string.
Otherwise, we need to resolve references to public, current_role,
current_user, and session_user.

Fixes: https://github.com/neondatabase/cloud/issues/19323
Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-10-28 18:21:45 -05:00
Konstantin Knizhnik
0624565617 Create the notion of unstable extensions
As a DBaaS provider, Neon needs to provide a stable platform for
customers to build applications upon. At the same time however, we also
need to enable customers to use the latest and greatest technology, so
they can prototype their work, and we can solicit feedback. If all
extensions are treated the same in terms of stability, it is hard to
meet that goal.

There are now two new GUCs created by the Neon extension:

neon.allow_unstable_extensions: This is a session GUC which allows
a session to install and load unstable extensions.

neon.unstable_extensions: This is a comma-separated list of extension
names. We can check if a CREATE EXTENSION statement is attempting to
install an unstable extension, and if so, deny the request if
neon.allow_unstable_extensions is not set to true.

Signed-off-by: Tristan Partin <tristan@neon.tech>
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-10-28 17:47:15 -05:00
George MacKerron
7d5f6b6a52 Build pgrag extensions x3 (#8486)
Build the pgrag extensions (rag, rag_bge_small_en_v15, and
rag_jina_reranker_v1_tiny_en) as part of the compute node Dockerfile.

---------

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2024-10-28 20:06:36 +00:00
Alex Chi Z.
f7c61e856f fix(pageserver): bump tokio-epoll-uring (#9546)
Includes https://github.com/neondatabase/tokio-epoll-uring/pull/58 that
fixes the clippy error.

## Summary of changes

Update the version of tokio-epoll-uring

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-10-28 20:03:02 +00:00
Alex Chi Z.
57c21aff9f refactor(pageserver): remove aux v1 configs (#9494)
## Problem

Part of https://github.com/neondatabase/neon/issues/8623

## Summary of changes

Removed all aux-v1 config processing code. Note that we persisted it
into the index part file, so we cannot really remove the field from
index part. I also kept the config item within the tenant config, but we
will not read it any more.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-10-28 19:51:14 +00:00
Erik Grinaker
248558dee8 safekeeper: refactor WalAcceptor to be event-driven (#9462)
## Problem

The `WalAcceptor` main loop currently uses two nested loops to consume
inbound messages. This makes it hard to slot in periodic events like
metrics collection. It also duplicates the event processing code, and assumes
all messages in steady state are AppendRequests (other messages types may
be dropped if following an AppendRequest).

## Summary of changes

Refactor the `WalAcceptor` loop to be event driven.
2024-10-28 17:18:37 +00:00
Sergey Melnikov
3bad52543f We don't have legacy proxies anymore (#9544)
We don't have legacy scram proxies anymore:
cc: https://github.com/neondatabase/cloud/issues/9745
2024-10-28 16:42:35 +00:00
Tristan Partin
3d64a7ddcd Add pg_mooncake to compute-node.Dockerfile
Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-10-28 11:23:30 -05:00
Conrad Ludgate
25f1e5cfeb [proxy] demote warnings and remove dead-argument (#9512)
fixes https://github.com/neondatabase/cloud/issues/19000
2024-10-28 15:02:20 +00:00
Rahul Patil
8dd555d396 ci(proxy): Update GH action flag on proxy deployment (#9535)
## Problem

Based on a recent proxy deployment issue, we deployed another proxy
version (proxy-scram), which was not needed when deploying a specific
proxy type. we have
[PR](https://github.com/neondatabase/infra/pull/2142) to update on the
infra branch and need to update CI in this repo which triggers proxy
deployment.

## Summary of changes

- Update proxy deployment flag 

## Checklist before requesting a review

- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist
2024-10-28 13:17:09 +01:00
Arthur Petukhovsky
01b6843e12 Route pgbouncer logs to virtio-serial (#9488)
virtio-serial is much more performant than /dev/console emulation,
therefore, is much more suitable for the verbose logs inside vm. This
commit changes routing for pgbouncer logs, since we've recently noticed
it can emit large volumes of logs.

Manually tested on staging by pinning a compute image to my test
project.

Should help with https://github.com/neondatabase/cloud/issues/19072
2024-10-28 12:09:47 +00:00
John Spray
93987b5a4a tests: add test_storage_controller_onboard_detached (#9431)
## Problem

We haven't historically taken this API route where we would onboard a
tenant to the controller in detached state. It worked, but we didn't
have test coverage.

## Summary of changes

- Add a test that onboards a tenant to the storage controller in
Detached mode, and checks that deleting it without attaching it works as
expected.
2024-10-28 11:11:12 +00:00
John Spray
33baca07b6 storcon: add an API to cancel ongoing reconciler (#9520)
## Problem

If something goes wrong with a live migration, we currently only have
awkward ways to interrupt that:
- Restart the storage controller
- Ask it to do some other modification/migration on the shard, which we
don't really want.

## Summary of changes

- Add a new `/cancel` control API, and storcon_cli wrapper for it, which
fires the Reconciler's cancellation token. This is just for on-call use
and we do not expect it to be used by any other services.
2024-10-28 09:26:01 +00:00
John Spray
923974d4da safekeeper: don't un-evict timelines during snapshot API handler (#9428)
## Problem

When we use pull_timeline API on an evicted timeline, it gets downloaded
to serve the snapshot API request. That means that to evacuate all the
timelines from a node, the node needs enough disk space to download
partial segments from all timelines, which may not be physically the
case.

Closes: #8833 

## Summary of changes

- Add a "try" variant of acquiring a residence guard, that returns None
if the timeline is offloaded
- During snapshot API handler, take a different code path if the
timeline isn't resident, where we just read the checkpoint and don't try
to read any segments.
2024-10-28 08:47:12 +00:00
Arpad Müller
e7277885b3 Don't consider archived timelines for synthetic size calculation (#9497)
Archived timelines should not count towards synthetic size.

Closes #9384.

Part of #8088.
2024-10-26 13:27:57 +00:00
dependabot[bot]
80262e724f build(deps): bump werkzeug from 3.0.3 to 3.0.6 (#9527) 2024-10-26 08:24:15 +01:00
Yuchen Liang
85b954f449 pageserver: add tokio-epoll-uring slots waiters queue depth metrics (#9482)
In complement to
https://github.com/neondatabase/tokio-epoll-uring/pull/56.

## Problem

We want to make tokio-epoll-uring slots waiters queue depth observable
via Prometheus.

## Summary of changes

- Add `pageserver_tokio_epoll_uring_slots_submission_queue_depth`
metrics as a `Histogram`.
- Each thread-local tokio-epoll-uring system is given a `LocalHistogram`
to observe the metrics.
- Keep a list of `Arc<ThreadLocalMetrics>` used on-demand to flush data
to the shared histogram.
- Extend `Collector::collect` to report
`pageserver_tokio_epoll_uring_slots_submission_queue_depth`.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
2024-10-25 21:30:57 +01:00
Arpad Müller
76328ada05 Fix unoffload_timeline races with creation (#9525)
This PR does two things:

1. Obtain a `TimelineCreateGuard` object in `unoffload_timeline`. This
prevents two unoffload tasks from racing with each other. While they
already obtain locks for `timelines` and `offloaded_timelines`, they
aren't sufficient, as we have already constructed an entire timeline at
that point. We shouldn't ever have two `Timeline` objects in the same
process at the same time.
2. don't allow timeline creations for timelines that have been
offloaded. Obviously they already exist, so we should not allow
creation. the previous logic only looked at the timelines list.

Part of #8088
2024-10-25 20:06:27 +00:00
Erik Grinaker
b54b632c6a safekeeper: don't pass conf into storage constructors (#9523)
## Problem

The storage components take an entire `SafekeeperConf` during
construction, but only actually use the `no_sync` field. This makes it
hard to understand the storage inputs (which fields do they actually
care about?), and is also inconvenient for tests and benchmarks that
need to set up a lot of unnecessary boilerplate.

## Summary of changes

* Don't take the entire config, but pass in the `no_sync` field
explicitly.
* Take the timeline dir instead of `ttid` as an input, since it's the
only thing it cares about.
* Fix a couple of tests to not leak tempdirs.
* Various minor tweaks.
2024-10-25 18:19:52 +01:00
Erik Grinaker
9909551f47 safekeeper: fix version in TimelinePersistentState::empty() (#9521)
## Problem

The Postgres version in `TimelinePersistentState::empty()` is incorrect:
the major version should be multiplied by 10000.

## Summary of changes

Multiply the version by 10000.
2024-10-25 16:22:35 +01:00
Arseny Sher
700b102b0f safekeeper: retry eviction. (#9485)
Without this manager may sleep forever after eviction failure without
retries.
2024-10-25 17:48:29 +03:00
Conrad Ludgate
dbadb0f9bb proxy: propagate session IDs (#9509)
fixes #9367 by sending session IDs to local_proxy, and also returns
session IDs to the client for easier debugging.
2024-10-25 14:34:19 +00:00
John Spray
8297f7a181 pageserver: fix N^2 I/O when processing relation drops in transaction abort (#9507)
## Problem

We have some known N^2 behaviors when it comes to large relation counts,
due to the monolithic encoding and full rewrites of of RelDirectory each
time a relation is added. Ordinarily our backpressure mechanisms give
"slow but steady" performance when creating/dropping/truncating
relations. However, in the case of a transaction abort, it is possible
for a single WAL record to drop an unbounded number of relations. The
results in an unavailable compute, as when it sends one of these
records, it can stall the pageserver's ingest for many minutes, even
though the compute only sent a small amount of WAL.

Closes https://github.com/neondatabase/neon/issues/9505

## Summary of changes

- Rewrite relation-dropping code to do one read/modify/write cycle of
RelDirectory, instead of doing it separately for each relation in a
loop.
- Add a test for the bug scenario encountered:
`test_tx_abort_with_many_relations`

The test has ~40s runtime on my workstation. About 1 second of that is
the part where we wait for ingest to catch up after a rollback, the rest
is the slowness of creating and truncating a large number of relations.


---------

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2024-10-25 15:09:02 +01:00
Christian Schwarz
2090e928d1 refactor(timeline creation): idempotency checking (#9501)
# Context

In the PGDATA import code
(https://github.com/neondatabase/neon/pull/9218) I add a third way to
create timelines, namely, by importing from a copy of a vanilla PGDATA
directory in object storage.

For idempotency, I'm using the PGDATA object storage location
specification, which is stored in the IndexPart for the entire lifespan
of the timeline. When loading the timeline from remote storage, that
value gets stored inside `struct Timeline` and timeline creation
compares the creation argument with that value to determine idempotency
of the request.

# Changes

This PR refactors the existing idempotency handling of Timeline
bootstrap and branching such that we simply compare the
`CreateTimelineIdempotency` struct, using the derive-generated
`PartialEq` implementation.

Also, by spelling idempotency out in the type names, I find it adds a
lot of clarity.

The pathway to idempotency via requester-provided idempotency key also
becomes very straight-forward, if we ever want to do this in the future.

# Refs
* platform context: https://github.com/neondatabase/neon/pull/9218
* product context: https://github.com/neondatabase/cloud/issues/17507
* stacks on top of https://github.com/neondatabase/neon/pull/9366
2024-10-25 14:44:20 +01:00
Tristan Partin
05eff3a67e Move logical replication slot monitor
neon.c is getting crowded and the logical replication slot monitor is
a good candidate for reorganization. It is very self-contained, and
being in a separate file will make it that much easier to find.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-10-25 08:41:44 -05:00
Arseny Sher
c6cf5e7c0f Make test_pageserver_lsn_wait_error_safekeeper_stop less aggressive. (#9517)
Previously it inserted ~150MiB of WAL while expecting page fetching to
work in 1s (wait_lsn_timeout=1s). It failed in CI in debug builds.
Instead, just directly wait for the wanted condition, i.e. needed
safekeepers are reported in pageserver timed out waiting for WAL error
message. Also set NEON_COMPUTE_TESTING_BASEBACKUP_RETRIES to 1 in this
test and neighbour one, it reduces execution time from 2.5m to ~10s.
2024-10-25 14:13:46 +01:00
Christian Schwarz
e0c7f1ce15 remote_storage(local_fs): return correct file sizes (#9511)
## Problem

`local_fs` doesn't return file sizes, which I need in PGDATA import
(#9218)

## Solution

Include file sizes in the result.

I would have liked to add a unit test, and started doing that in 

* https://github.com/neondatabase/neon/pull/9510

by extending the common object storage tests
(`libs/remote_storage/tests/common/tests.rs`) to check for sizes as
well.

But it turns out that localfs is not even covered by the common object
storage tests and upon closer inspection, it seems that this area needs
more attention.
=> punt the effort into https://github.com/neondatabase/neon/pull/9510
2024-10-25 12:20:53 +00:00
Christian Schwarz
6f5c262684 pageserver: add testing API to scan layers for disposable keys (#9393)
This PR adds a pageserver mgmt API to scan a layer file for disposable
keys.

It hooks it up to the sharding compaction test, demonstrating that we're
not filtering out all disposable keys.

This is extracted from PGDATA import
(https://github.com/neondatabase/neon/pull/9218)
where I do the filtering of layer files based on `is_key_disposable`.
2024-10-25 14:16:45 +02:00
Jakub Kołodziejczak
9768f09f6b proxy: don't follow redirects for user provided JWKS urls + set custom user agent (#9514)
partially fixes https://github.com/neondatabase/cloud/issues/19249

ref https://docs.rs/reqwest/latest/reqwest/redirect/index.html
> By default, a Client will automatically handle HTTP redirects, having
a maximum redirect chain of 10 hops. To customize this behavior, a
redirect::Policy can be used with a ClientBuilder.
2024-10-25 14:04:41 +02:00
Yuchen Liang
db900ae9d0 fix(test): remove too strict layers_removed==0 check in test_readonly_node_gc (#9506)
Fixes #9098 

## Problem

`test_readonly_node_gc` is flaky. As shown in [Allure
Report](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9469/11444519440/index.html#suites/3ccffb1d100105b98aed3dc19b717917/2c02073738fa2b39),
we would get a `AssertionError: No layers should be removed, old layers
are guarded by leases.` after the test restarts pageservers or after
reconfigure pageservers.

During the investigation, we found that the layers has LSN (`0/1563088`)
greater than the LSN (`0x1562000`) protected by the lease. For instance,


**Layers removed**
<pre>

000000067F00000005000034540100000000-000000067F00000005000040050100000000__000000000<b><i>1563088</i></b>-00000001
(shard 0002)

000000068000000000000017E20000000001-010000000100000001000000000000000001__000000000<b><i>1563088</i></b>-00000001
(shard 0002)
</pre>

**Lsn Lease Granted**
<pre>
handle_make_lsn_lease{lsn=<b><i>0/1562000</i></b> shard_id=0002
shard_id=0002}: lease created, valid until 2024-10-21
</pre>

This means that these layers are not guarded by the leases: they are in
"future", not visible to the static endpoint.

## Summary of changes

- Remove the assertion layers_removed == 0 after trigger timeline GC
while holding the lease. Instead rely on the successful execution of
the`SELECT` query to test lease validity.
- Improve test logging


Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-10-25 12:50:47 +01:00
Arpad Müller
4d9036bf1f Support offloaded timelines during shard split (#9489)
Before, we didn't copy over the `index-part.json` of offloaded timelines
to the new shard's location, resulting in the new shard not knowing the
timeline even exists.

In #9444, we copy over the manifest, but we also need to do this for
`index-part.json`.

As the operations to do are mostly the same between offloaded and
non-offloaded timelines, we can iterate over all of them in the same
loop, after the introduction of a `TimelineOrOffloadedArcRef` type to
generalize over the two cases. This is analogous to the deletion code
added in #8907.

The added test also ensures that the sharded archival config endpoint
works, something that has not yet been ensured by tests.

Part of #8088
2024-10-25 12:32:46 +02:00
Vlad Lazar
b3bedda6fd pageserver/walingest: log on gappy rel extend (#9502)
## Problem

https://github.com/neondatabase/neon/pull/9492 added a metric to track
the total count of block gaps filled on rel extend. More context is
needed to understand when this happens. The current theory is that it
may only happen on pg 14 and pg 15 since they do not WAL log relation extends.

## Summary of Changes

A rate limited log is added.
2024-10-25 11:15:53 +01:00
Christian Schwarz
b782b11b33 refactor(timeline creation): represent bootstrap vs branch using enum (#9366)
# Problem

Timeline creation can either be bootstrap or branch.
The distinction is made based on whether the `ancestor_*` fields are
present or not.

In the PGDATA import code
(https://github.com/neondatabase/neon/pull/9218), I add a third variant
to timeline creation.

# Solution

The above pushed me to refactor the code in Pageserver to distinguish
the different creation requests through enum variants.

There is no externally observable effect from this change.

On the implementation level, a notable change is that the acquisition of
the `TimelineCreationGuard` happens later than before. This is necessary
so that we have everything in place to construct the
`CreateTimelineIdempotency`. Notably, this moves the acquisition of the
creation guard _after_ the acquisition of the `gc_cs` lock in the case
of branching. This might appear as if we're at risk of holding `gc_cs`
longer than before this PR, but, even before this PR, we were holding
`gc_cs` until after the `wait_completion()` that makes the timeline
creation durable in S3 returns. I don't see any deadlock risk with
reversing the lock acquisition order.

As a drive-by change, I found that the `create_timeline()` function in
`neon_local` is unused, so I removed it.

# Refs

* platform context: https://github.com/neondatabase/neon/pull/9218
* product context: https://github.com/neondatabase/cloud/issues/17507
* next PR stacked atop this one:
https://github.com/neondatabase/neon/pull/9501
2024-10-25 10:04:27 +00:00
Vlad Lazar
5069123b6d pageserver: refactor ingest inplace to decouple decoding and handling (#9472)
## Problem

WAL ingest couples decoding of special records with their handling
(updates to the storage engine mostly).
This is a roadblock for our plan to move WAL filtering (and implicitly
decoding) to safekeepers since they cannot
do writes to the storage engine. 

## Summary of changes

This PR decouples the decoding of the special WAL records from their
application. The changes are done in place
and I've done my best to refrain from refactorings and attempted to
preserve the original code as much as possible.

Related: https://github.com/neondatabase/neon/issues/9335
Epic: https://github.com/neondatabase/neon/issues/9329
2024-10-24 17:12:47 +01:00
Alex Chi Z.
fb0406e9d2 refactor(pageserver): refactor split writers using batch layer writer (#9493)
part of https://github.com/neondatabase/neon/issues/9114,
https://github.com/neondatabase/neon/issues/8836,
https://github.com/neondatabase/neon/issues/8362

The split layer writer code can be used in a more general way: the
caller puts unfinished writers into the batch layer writer and let batch
layer writer to ensure the atomicity of the layer produces.

## Summary of changes

* Add batch layer writer, which atomically finishes the layers.
`BatchLayerWriter::finish` is simply a copy-paste from previous split
layer writers.
* Refactor split writers to use the batch layer writer.
* The current split writer tests cover all code path of batch layer
writer.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-10-24 10:49:54 -04:00
Alexander Bayandin
b8a311131e CI: remove git config --add safe.directory hack (#9391)
## Problem

We have `git config --global --add safe.directory ...` leftovers from the
past, but `actions/checkout` does it by default (since v3.0.2, we use v4)

## Summary of changes
- Remove `git config --global --add safe.directory ...` hack
2024-10-24 15:49:26 +01:00
John Spray
d589498c6f storcon: respect Reconciler::cancel during await_lsn (#9486)
## Problem

When a pageserver is misbehaving (e.g. we hit an ingest bug or something
is pathologically slow), the storage controller could get stuck in the
part of live migration that waits for LSNs to catch up. This is a
problem, because it can prevent us migrating the troublesome tenant to
another pageserver.

Closes: https://github.com/neondatabase/cloud/issues/19169

## Summary of changes

- Respect Reconciler::cancel during await_lsn.
2024-10-24 15:23:09 +01:00
Christian Schwarz
6f34f97573 refactor(pageserver(load_remote_timeline)) remove dead code handling absence of IndexPart (#9408)
The code is dead at runtime since we're nowadays always running with
remote storage and treat it as the source of truth during attach.

Clean it up as a preliminary to
https://github.com/neondatabase/neon/pull/9218.

Related: https://github.com/neondatabase/neon/pull/9366
2024-10-24 09:00:22 +01:00
Tristan Partin
b86432c29e Fix buggy sizeof
A sizeof on a pointer on a 64 bit machine is 8 bytes whereas
Entry::old_name is a 64 byte array of characters. There was most likely
no fallout since the string would start with NUL bytes, but best to fix
nonetheless.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-10-23 21:52:22 -06:00
Vlad Lazar
ac1205c14c pageserver: add metric for number of zeroed pages on rel extend (#9492)
## Problem

Filling the gap in with zeroes is annoying for sharded ingest. We are
not sure it even happens in reality.

## Summary of Changes

Add one global counter which tracks how many such gap blocks we filled
on relation extends. We can add more metrics once we understand the
scope.
2024-10-23 19:58:28 +01:00
John Spray
e3ff87ce3b tests: avoid using background_process when invoking pg_ctl (#9469)
## Problem

Occasionally, we get failures to start the storage controller's db with
errors like:
```
aborting due to panic at /__w/neon/neon/control_plane/src/background_process.rs:349:67:
claim pid file: lock file

Caused by:
    file is already locked
```
e.g.
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9428/11380574562/index.html#/testresult/1c68d413ea9ecd4a

This is happening in a stop,start cycle during a test. Presumably the
pidfile from the startup background process is still held at the point
we stop, because we let pg_ctl keep running in the background.

## Summary of changes

- Refactor pg_ctl invocations into a helper
- In the controller's `start` function, use pg_ctl & a wait loop for
pg_isready, instead of using background_process

---------

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
2024-10-23 16:29:55 +00:00