Commit Graph

5510 Commits

Author SHA1 Message Date
Arseny Sher
6f20a18e8e Allow to change compute safekeeper list without restart.
- Add --safekeepers option to neon_local reconfigure
- Add it to python Endpoint reconfigure
- Implement config reload in walproposer by restarting the whole bgw when
  safekeeper list changes.

ref https://github.com/neondatabase/neon/issues/6341
2024-06-27 15:08:35 +03:00
Vlad Lazar
d557002675 strocon: don't overcommit when making node fill plan (#8171)
## Problem
The fill requirement was not taken into account when looking through the
shards of a given node to fill from.

## Summary of Changes
Ensure that we do not fill a node past the recommendation from
`Scheduler::compute_fill_requirement`.
2024-06-27 11:56:57 +01:00
Cihan Demirci
32b75e7c73 CI: additional trigger on merge to main (#8176)
Before we consolidate workflows we want to be triggered by merges to main.

https://github.com/neondatabase/cloud/issues/14862
2024-06-26 22:36:41 +00:00
Heikki Linnakangas
d2753719e3 test: Add helper function for importing a Postgres cluster (#8025)
Also, modify the "neon_local timeline import" command so that it
doesn't create the endpoint any more. I don't see any reason to bundle
that in the same command, the "timeline create" and "timeline branch"
commands don't do that either.

I plan to add more tests similar to 'test_import_at_2bil', this will
help to reduce the copy-pasting.
2024-06-26 21:54:29 +00:00
Alex Chi Z
04b2ac3fed test: use aux file v2 policy in benchmarks (#8174)
Use aux file v2 in benchmarks.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-06-26 20:33:15 +00:00
John Spray
c39d5b03e8 pageserver: remove legacy tenant config code, clean up redundant generation none/broken usages (#7947)
## Problem

In https://github.com/neondatabase/neon/pull/5299, the new config-v1
tenant config file was added to hold the LocationConf type. We left the
old config file in place for forward compat, and because running without
generations (therefore without LocationConf) as still useful before the
storage controller was ready for prime-time.

Closes: https://github.com/neondatabase/neon/issues/5388

## Summary of changes

- Remove code for reading and writing the legacy config file
- Remove Generation::Broken: it was unused.
- Treat missing config file on disk as an error loading a tenant, rather
than defaulting it. We can now remove LocationConf::default, and thereby
guarantee that we never construct a tenant with a None generation.
- Update some comments + add some assertions to clarify that
Generation::None is only used in layer metadata, not in the state of a
running tenant.
- Update docker compose test to create tenants with a generation
2024-06-26 19:53:59 +00:00
Arthur Petukhovsky
76fc3d4aa1 Evict WAL files from disk (#8022)
Fixes https://github.com/neondatabase/neon/issues/6337

Add safekeeper support to switch between `Present` and
`Offloaded(flush_lsn)` states. The offloading is disabled by default,
but can be controlled using new cmdline arguments:

```
      --enable-offload
          Enable automatic switching to offloaded state
      --delete-offloaded-wal
          Delete local WAL files after offloading. When disabled, they will be left on disk
      --control-file-save-interval <CONTROL_FILE_SAVE_INTERVAL>
          Pending updates to control file will be automatically saved after this interval [default: 300s]
```

Manager watches state updates and detects when there are no actvity on
the timeline and actual partial backup upload in remote storage. When
all conditions are met, the state can be switched to offloaded.

In `timeline.rs` there is `StateSK` enum to support switching between
states. When offloaded, code can access only control file structure and
cannot use `SafeKeeper` to accept new WAL.

`FullAccessTimeline` is now renamed to `WalResidentTimeline`. This
struct contains guard to notify manager about active tasks requiring
on-disk WAL access. All guards are issued by the manager, all requests
are sent via channel using `ManagerCtl`. When manager receives request
to issue a guard, it unevicts timeline if it's currently evicted.

Fixed a bug in partial WAL backup, it used `term` instead of
`last_log_term` previously.

After this commit is merged, next step is to roll this change out, as in
issue #6338.
2024-06-26 18:58:56 +01:00
Vlad Lazar
dd3adc3693 docker: downgrade openssl to 1.1.1w (#8168)
## Problem
We have seen numerous segfault and memory corruption issue for clients
using libpq and openssl 3.2.2. I don't know if this is a bug in openssl
or libpq. Downgrading to 1.1.1w fixes the issues for the storage
controller and pgbench.

## Summary of Changes:
Use openssl 1.1.1w instead of 3.2.2
2024-06-26 17:27:23 +00:00
Heikki Linnakangas
5b871802fd Add counters for commands processed through the libpq page service API (#8089)
I was looking for metrics on how many computes are still using protocol
version 1 and 2. This provides counters for that as "pagestream" and
"pagestream_v2" commands, but also all the other commands. The new
metrics are global for the whole pageserver instance rather than
per-tenant, so the additional metrics bloat should be fairly small.
2024-06-26 19:53:03 +03:00
Heikki Linnakangas
24ce73ffaf Silence compiler warning (#8153)
I saw this compiler warning on my laptop:

pgxn/neon_walredo/walredoproc.c:178:10: warning: using the result of an
assignment as a condition without parentheses [-Wparentheses]
            if (err = close_range_syscall(3, ~0U, 0)) {
                ~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
pgxn/neon_walredo/walredoproc.c:178:10: note: place parentheses around
the assignment to silence this warning
            if (err = close_range_syscall(3, ~0U, 0)) {
                    ^
                (                                   )
pgxn/neon_walredo/walredoproc.c:178:10: note: use '==' to turn this
assignment into an equality comparison
            if (err = close_range_syscall(3, ~0U, 0)) {
                    ^
                    ==
    1 warning generated.

I'm not sure what compiler version or options cause that, but it's a
good warning. Write the call a little differently, to avoid the warning
and to make it a little more clear anyway. (The 'err' variable wasn't
used for anything, so I'm surprised we were not seeing a compiler
warning on the unused value, too.)
2024-06-26 19:19:27 +03:00
Arthur Petukhovsky
3118c24521 Panic on unexpected error in simtests (#8169) 2024-06-26 16:46:14 +01:00
Alexander Bayandin
5af9660b9e CI(build-tools): don't install Postgres 14 (#6540)
## Problem

We install Postgres 14 in `build-tools` image, but we don't need
it. We use Postgres binaries, which we build ourselves.

## Summary of changes
- Remove Postgresql 14 installation from `build-tools` image
2024-06-26 16:37:04 +01:00
Conrad Ludgate
d7e349d33c proxy: report blame for passthrough disconnect io errors (#8170)
## Problem

Hard to debug the disconnection reason currently.

## Summary of changes

Keep track of error-direction, and therefore error source (client vs
compute) during passthrough.
2024-06-26 15:11:26 +00:00
Arthur Petukhovsky
47e5bf3bbb Improve term reject message in walproposer (#8164)
Co-authored-by: Tristan Partin <tristan@neon.tech>
2024-06-26 15:26:52 +01:00
Alex Chi Z
5d2f9ffa89 test(bottom-most-compaction): wal apply order (#8163)
A follow-up on https://github.com/neondatabase/neon/pull/8103/.
Previously, main branch fails with:

```
assertion `left == right` failed
  left: b"value 3@0x10@0x30@0x28@0x40"
 right: b"value 3@0x10@0x28@0x30@0x40"
```

This gets fixed after #8103 gets merged.


Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-06-26 09:34:41 -04:00
Heikki Linnakangas
fdadd6a152 Remove primary_is_running (#8162)
This was a half-finished mechanism to allow a replica to enter hot
standby mode sooner, without waiting for a running-xacts record. It had
issues, and we are working on a better mechanism to replace it.

The control plane might still set the flag in the spec file, but
compute_ctl will simply ignore it.
2024-06-26 15:13:03 +03:00
Peter Bendel
9b623d3a2c add commit hash to S3 object identifier for artifacts on S3 (#8161)
In future we may want to run periodic tests on dedicated cloud instances
that are not GitHub action runners.
To allow these to download artifact binaries for a specific commit hash
we want to make the search by commit hash possible and prefix the S3
objects with
`artifacts/${GITHUB_SHA}/${GITHUB_RUN_ID}/${GITHUB_RUN_ATTEMPT}`

---------

Co-authored-by: Alexander Bayandin <alexander@neon.tech>
2024-06-26 05:46:52 +00:00
Alex Chi Z
9b98823d61 bottom-most-compaction: use in test_gc_feedback + fix bugs (#8103)
Adds manual compaction trigger; add gc compaction to test_gc_feedback

Part of https://github.com/neondatabase/neon/issues/8002

```
test_gc_feedback[debug-pg15].logical_size: 50 Mb
test_gc_feedback[debug-pg15].physical_size: 2269 Mb
test_gc_feedback[debug-pg15].physical/logical ratio: 44.5302 
test_gc_feedback[debug-pg15].max_total_num_of_deltas: 7 
test_gc_feedback[debug-pg15].max_num_of_deltas_above_image: 2 
test_gc_feedback[debug-pg15].logical_size_after_bottom_most_compaction: 50 Mb
test_gc_feedback[debug-pg15].physical_size_after_bottom_most_compaction: 287 Mb
test_gc_feedback[debug-pg15].physical/logical ratio after bottom_most_compaction: 5.6312 
test_gc_feedback[debug-pg15].max_total_num_of_deltas_after_bottom_most_compaction: 4 
test_gc_feedback[debug-pg15].max_num_of_deltas_above_image_after_bottom_most_compaction: 1
```

## Summary of changes

* Add the manual compaction trigger
* Use in test_gc_feedback
* Add a guard to avoid running it with retain_lsns
* Fix: Do `schedule_compaction_update` after compaction
* Fix: Supply deltas in the correct order to reconstruct value

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-06-25 23:00:14 +00:00
Alex Chi Z
76864e6a2a feat(pageserver): add image layer iterator (#8006)
part of https://github.com/neondatabase/neon/issues/8002

## Summary of changes

This pull request adds the image layer iterator. It buffers a fixed
amount of key-value pairs in memory, and give developer an iterator
abstraction over the image layer. Once the buffer is exhausted, it will
issue 1 I/O to fetch the next batch.

Due to the Rust lifetime mysteries, the `get_stream_from` API has been
refactored to `into_stream` and consumes `self`.

Delta layer iterator implementation will be similar, therefore I'll add
it after this pull request gets merged.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-06-25 20:49:29 +00:00
Conrad Ludgate
6c5d3b5263 proxy fix wake compute console retry (#8141)
## Problem

1. Proxy is retrying errors from cplane that shouldn't be retried
2. ~~Proxy is not using the retry_after_ms value~~

## Summary of changes

1. Correct the could_retry impl for ConsoleError.
2. ~~Update could_retry interface to support returning a fixed wait
duration.~~
2024-06-25 18:07:54 +00:00
Christian Schwarz
cd9a550d97 clippy-deny the todo!() macro (#4340)
`todo!()` shouldn't slip into prod code
2024-06-25 18:03:27 +00:00
John Spray
07f21dd6b6 pageserver: remove attach/detach apis (#8134)
## Problem

These APIs have been deprecated for some time, but were still used from
test code.

Closes: https://github.com/neondatabase/neon/issues/4282

## Summary of changes

- It is still convenient to do a "tenant_attach" from a test without
having to write out a location_conf body, so those test methods have
been retained with implementations that call through to their
location_conf equivalent.
2024-06-25 17:38:06 +01:00
Heikki Linnakangas
64a4461191 Fix submodule references to match the REL_*_STABLE_neon branches (#8159)
No code changes, just point to the correct commit SHAs.
2024-06-25 19:05:13 +03:00
Yuchen Liang
961fc0ba8f feat(pageserver): add metrics for number of valid leases after each refresh (#8147)
Part of #7497, closes #8120.

## Summary of changes

This PR adds a metric to track the number of valid leases after `GCInfo`
gets refreshed each time.

Besides this metric, we should also track disk space and synthetic size
(after #8071 is closed) to make sure leases are used properly.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-06-25 15:43:12 +00:00
Alexander Bayandin
9b2f9419d9 CI: upload docker cache only from main (#8157)
## Problem
The Docker build cache gets invalidated by PRs

## Summary of changes
- Upload cache only from the main branch
2024-06-25 16:18:22 +01:00
Christian Schwarz
947f6da75e L0 flush: avoid short-lived allocation when checking key_range empty (#8154)
We only use `keys` to check if it's empty so we can bail out early. No
need to collect the keys for that.

Found this while doing research for
https://github.com/neondatabase/neon/issues/7418
2024-06-25 17:04:44 +02:00
Vlad Lazar
7026dde9eb storcon: update db related dependencides (#8155)
## Problem
Storage controller runs into memory corruption issue on the drain/fill
code paths.

## Summary of changes
Update db related depdencies in the unlikely case that the issue was
fixed in diesel.
2024-06-25 15:06:18 +01:00
Heikki Linnakangas
d502313841 Fix MVCC bug with prepared xact with subxacts on standby (#8152)
We did not recover the subtransaction IDs of prepared transactions when
starting a hot standby from a shutdown checkpoint. As a result, such
subtransactions were considered as aborted, rather than in-progress.
That would lead to hint bits being set incorrectly, and the
subtransactions suddenly becoming visible to old snapshots when the
prepared transaction was committed.

To fix, update pg_subtrans with prepared transactions's subxids when
starting hot standby from a shutdown checkpoint. The snapshots taken
from that state need to be marked as "suboverflowed", so that we also
check the pg_subtrans.

Discussion:
https://www.postgresql.org/message-id/6b852e98-2d49-4ca1-9e95-db419a2696e0%40iki.fi

NEON: cherry-picked from the upstream thread ahead of time, to unblock
https://github.com/neondatabase/neon/pull/7288. I expect this to be
committed to upstream in the next few days, superseding this. NOTE: I
did not include the new regression test on v15 and v14 branches, because
the test would need some adapting, and we don't run the perl tests on
Neon anyway.
2024-06-25 16:29:32 +03:00
Yuchen Liang
219e78f885 feat(pageserver): add an optional lease to the get_lsn_by_timestamp API (#8104)
Part of #7497, closes #8072.

## Problem

Currently the `get_lsn_by_timestamp` and branch creation pageserver APIs do not provide a pleasant client experience where the looked-up LSN might be GC-ed between the two API calls.

This PR attempts to prevent common races between GC and branch creation by making use of LSN leases provided in #8084. A lease can be optionally granted to a looked-up LSN. With the lease, GC will not touch layers needed to reconstruct all pages at this LSN for the duration of the lease.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
2024-06-24 20:12:24 +00:00
John Spray
1ea5d8b132 tests: accomodate some messages that can fail tests (#8144)
## Problem

- `test_storage_controller_many_tenants` can fail with warnings in the
storage controller about tenant creation holding a lock for too long,
because this test stresses the machine running the test with many
concurrent timeline creations
- `test_tenant_delete_smoke` can fail when synthetic remote storage
errors show up

## Summary of changes

- tolerate warnings about slow timeline creation in
test_storage_controller_many_tenants
- tolerate both possible errors during error_tolerant_delete
2024-06-24 17:03:53 +00:00
John Spray
3d760938e1 storcon_cli: remove old tenant-scatter command (#8127)
## Problem

This command was used in the very early days of sharding, before the
storage controller had anti-affinity + scheduling optimization to spread
out shards.

## Summary of changes

- Remove `storcon_cli tenant-scatter`
2024-06-24 12:57:16 -04:00
Alex Chi Z
9211de0df7 test(pageserver): add delta records tests for gc-compaction (#8078)
Part of https://github.com/neondatabase/neon/issues/8002

This pull request adds tests for bottom-most gc-compaction with delta
records. Also fixed a bug in the compaction process that creates
overlapping delta layers by force splitting at the original delta layer
boundary.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-06-24 15:50:31 +00:00
Alex Chi Z
d8ffe662a9 fix(pageserver): handle version number in draw timeline (#8102)
We now have a `vX` number in the file name, i.e.,
`000000067F0000000400000B150100000000-000000067F0000000400000D350100000000__00000000014B7AC8-v1-00000001`

The related pull request for new-style path was merged a month ago
https://github.com/neondatabase/neon/pull/7660

## Summary of changes

Fixed the draw timeline dir command to handle it.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-06-24 15:31:06 +00:00
Arthur Petukhovsky
a4db2af1f0 Truncate waltmp file on creation (#8133)
Previously in safekeeper code, new segment file was opened without
truncate option. I don't think there is a reason to do it, this commit
replaces it with `File::create` to make it simpler and remove
`clippy::suspicious_open_options` linter warning.
2024-06-24 14:07:59 +00:00
John Spray
47fdf93cf0 tests: fix a flake in test_sharding_split_compaction (#8136)
## Problem

This test could occasionally trigger a "removing local file ... because
it has unexpected length log" when using the
`compact-shard-ancestors-persistent` failpoint is in use, which is
unexpected because that failpoint stops the process when the remote
metadata is in sync with local files.

It was because there are two shards on the same pageserver, and while
the one being compacted explicitly stops at the failpoint, another shard
was compacting in the background and failing at an unclean point. The
test intends to disable background compaction, but was mistakenly
revoking the value of `compaction_period` when it updated
`pitr_interval`.

Example failure:

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8123/9602976462/index.html#/testresult/7dd6165da7daef40

## Summary of changes

- Update `TENANT_CONF` in the test to use properly typed values, so that
it is usable in pageserver APIs as well as via neon_local.
- When updating tenant config with `pitr_interval`, retain the overrides
from the start of the test, so that there won't be any background
compaction going on during the test.
2024-06-24 14:54:54 +01:00
John Spray
de05f90735 pageserver: add more info-level logging in shard splits (#8137)
## Problem

`test_sharding_autosplit` is occasionally failing on warnings about
shard splits taking longer than expected (`Exclusive lock by ShardSplit
was held for`...)

It's not obvious which part is taking the time (I suspect remote storage
uploads).

Example:
https://neon-github-public-dev.s3.amazonaws.com/reports/main/9618788427/index.html#testresult/b395294d5bdeb783/

## Summary of changes

- Since shard splits are infrequent events, we can afford to be very
chatty: add a bunch of info-level logging throughout the process.
2024-06-24 11:53:43 +01:00
John Spray
188797f048 pageserver: remove code that resumes tenant deletions after restarts (#8091)
#8082 removed the legacy deletion path, but retained code for completing
deletions that were started before a pageserver restart. This PR cleans
up that remaining code, and removes all the pageserver code that dealt
with tenant deletion markers and resuming tenant deletions.

The release at https://github.com/neondatabase/neon/pull/8138 contains
https://github.com/neondatabase/neon/pull/8082, so we can now merge this
to `main`
2024-06-24 11:41:11 +01:00
Arpad Müller
5446e08891 Move remote_storage config related code into dedicated module (#8132)
Moves `RemoteStorageConfig` and related structs and functions into a
dedicated module. Also implements `Serialize` for the config structs
(requested in #8126).

Follow-up of #8126
2024-06-24 12:29:54 +02:00
Conrad Ludgate
78d9059fc7 proxy: update tokio-postgres to allow arbitrary config params (#8076)
## Problem

Fixes https://github.com/neondatabase/neon/issues/1287

## Summary of changes

tokio-postgres now supports arbitrary server params through the
`param(key, value)` method. Some keys are special so we explicitly
filter them out.
2024-06-24 10:20:27 +00:00
Arpad Müller
75747cdbff Use serde for RemoteStorageConfig parsing (#8126)
Adds a `Deserialize` impl to `RemoteStorageConfig`. We thus achieve the
same as #7743 but with less repetitive code, by deriving `Deserialize`
impls on `S3Config`, `AzureConfig`, and `RemoteStorageConfig`. The
disadvantage is less useful error messages.

The git history of this PR contains a state where we go via an
intermediate representation, leveraging the `serde_json` crate,
without it ever being actual json though.

Also, the PR adds deserialization tests.

Alternative to #7743 .
2024-06-22 17:57:09 +00:00
Vlad Lazar
8fe3f17c47 storcon: improve drain and fill shard placement (#8119)
## Problem
While adapting the storage controller scale test to do graceful rolling
restarts via
drain and fill, I noticed that secondaries are also being rescheduled,
which, in turn,
caused the storage controller to optimise attachments.

## Summary of changes
* Introduce a transactional looking rescheduling primitive (i.e. "try to
schedule to this
secondary, but leave everything as is if you can't")
* Use it for the drain and fill stages to avoid calling into
`Scheduler::schedule` and having
secondaries move around.
2024-06-22 14:20:58 +00:00
Anastasia Lubennikova
8776089c70 Remove kq_imcx extension support per customer request
neondatabase/cloud#13648
2024-06-21 20:22:54 +01:00
John Spray
b74232eb4d tests: allow-list neon_local endpoint errors from storage controller (#8123)
## Problem

For testing, the storage controller has a built-in hack that loads
neon_local endpoint config from disk, and uses it to reconfigure
endpoints when the attached pageserver changes.

Some tests that stop an endpoint while the storage controller is running
could occasionally fail on log errors from the controller trying to use
its special test-mode calls into neon local Endpoint.

Example:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8117/9592392425/index.html#/testresult/9d2bb8623d0d53f8

## Summary of changes

- Give NotifyError an explicit NeonLocal variant, to avoid munging these
into generic 500s (I don't want to ignore 500s in general)
- Allow-list errors related to the local notification hook.

The expectation is that tests using endpoints/workloads should be
independently checking that those endpoints work: if neon_local
generates an error inside the storage controller, that's ignorable.
2024-06-21 17:23:31 +00:00
Vlad Lazar
ee3081863e storcon: implement endpoints for cancellation of drain and fill operations (#8029)
## Problem
There's no way to cancel drain and fill operations.

## Summary of changes
Implement HTTP endpoints to allow cancelling of background operations.
When the operationis cancelled successfully, the node scheduling policy will revert to
`Active`.
2024-06-21 17:13:51 +01:00
John Spray
15728be0e1 pageserver: always detach before deleting (#8082)
In #7957 we enabled deletion without attachment, but retained the
old-style deletion (return 202, delete in background) for attached
tenants. In this PR, we remove the old-style deletion path, such that if
the tenant delete API is invoked while a tenant is detached, it is
simply detached before completing the deletion.

This intentionally doesn't rip out all the old deletion code: in case a
deletion was in progress at time of upgrade, we keep around the code for
finishing it for one release cycle. The rest of the code removal happens
in https://github.com/neondatabase/neon/pull/8091

Now that deletion will always be via the new path, the new path is also
updated to use some retries around remote storage operations, to
tripping up the control plane with 500s if S3 has an intermittent issue.
2024-06-21 15:39:19 +01:00
Arthur Petukhovsky
f45cf28247 Add eviction_state to control file (#8125)
This is a preparation for #8022, to make the PR both backwards and
foward compatible.

This commit adds `eviction_state` field to control file. Adds support
for reading it, but writes control file in old format where possible, to
keep the disk format forward compatible.

Note: in `patch_control_file`, new field gets serialized to json like
this:
- `"eviction_state": "Present"`
- `"eviction_state": {"Offloaded": "0/8F"}`
2024-06-21 13:15:02 +01:00
Peter Bendel
82266a252c Allow longer timeout for starting pageserver, safe keeper and storage controller in test cases to make test cases less flaky (#8079)
## Problem

see https://github.com/neondatabase/neon/issues/8070

## Summary of changes

the neon_local subcommands to 
- start neon
- start pageserver
- start safekeeper
- start storage controller

get a new option -t=xx or --start-timeout=xx which allows to specify a
longer timeout in seconds we wait for the process start.
This is useful in test cases where the pageserver has to read a lot of
layer data, like in pagebench test cases.

In addition we exploit the new timeout option in the python test
infrastructure (python fixtures) and modify the flaky testcase to
increase the timeout from 10 seconds to 1 minute.

Example from the test execution

```bash
RUST_BACKTRACE=1 NEON_ENV_BUILDER_USE_OVERLAYFS_FOR_SNAPSHOTS=1 DEFAULT_PG_VERSION=15 BUILD_TYPE=release     ./scripts/pytest test_runner/performance/pageserver/pagebench/test_pageserver_max_throughput_getpage_at_latest_lsn.py
...
2024-06-19 09:29:34.590 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local storage_controller start --start-timeout=60s"
2024-06-19 09:29:36.365 INFO [broker.py:34] starting storage_broker to listen incoming connections at "127.0.0.1:15001"
2024-06-19 09:29:36.365 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local pageserver start --id=1 --start-timeout=60s"
2024-06-19 09:29:36.366 INFO [neon_fixtures.py:1513] Running command "/instance_store/neon/target/release/neon_local safekeeper start 1 --start-timeout=60s"
```
2024-06-21 10:36:12 +00:00
John Spray
59f949b4a8 pageserver: remove unused load/ignore APIs (#8122)
## Problem

These APIs have be unused for some time. They were superseded by
/location_conf: the equivalent of ignoring a tenant is now to put it in
secondary mode.

## Summary of changes

- Remove APIs
- Remove tests & helpers that used them
- Remove error variants that are no longer needed.
2024-06-21 10:02:15 +00:00
Vlad Lazar
01399621d5 storcon: avoid promoting too many shards of the same tenant (#8099)
## Problem

The fill planner introduced in
https://github.com/neondatabase/neon/pull/8014 selects
tenant shards to promote strictly based on attached shard count load
(tenant shards on
nodes with the most attached shard counts are considered first). This
approach runs the
risk of migrating too many shards belonging to the same tenant on the
same primary node.

This is bad for availability and causes extra reconciles via the storage
controller's background
optimisations.

Also see
https://github.com/neondatabase/neon/pull/8014#discussion_r1642456241.

## Summary of changes
Refine the fill plan to avoid promoting too many shards belonging to the
same tenant on the same node.
We allow for `max(1, shard_count / node_count)` shards belonging to the
same tenant to be promoted.
2024-06-21 10:19:01 +01:00
Jure Bajic
0792bb6785 Add tracing for shared locks in id_lock_map (#7618)
## Problem
Storage controller shared locks do not print a warning when held for long time spans.

## Summary of changes
Extension of issue https://github.com/neondatabase/neon/issues/7108 in
tracing to exclusive lock in `id_lock_map` was added, to add the same
for shared locks. It was mentioned in the comment
https://github.com/neondatabase/neon/pull/7397#discussion_r1587961160
2024-06-21 09:47:04 +01:00