mirror of
https://github.com/neondatabase/neon.git
synced 2026-01-06 21:12:55 +00:00
## Problem
We have been dealing with a number of issues with the SC compute
notification mechanism. Various race conditions exist in the
PG/HCC/cplane/PS distributed system, and relying on the SC to send
notifications to the compute node to notify it of PS changes is not
robust. We decided to pursue a more robust option where the compute node
itself discovers whether it may be pointing to the incorrect PSs and
proactively reconfigure itself if issues are suspected.
## Summary of changes
To support this self-healing reconfiguration mechanism several pieces
are needed. This PR adds a mechanism to `compute_ctl` called "refresh
configuration", where the compute node reaches out to the control plane
to pull a new config and reconfigure PG using the new config, instead of
listening for a notification message containing a config to arrive from
the control plane. Main changes to compute_ctl:
1. The `compute_ctl` state machine now has a new State,
`RefreshConfigurationPending`. The compute node may enter this state
upon receiving a signal that it may be using the incorrect page servers.
2. Upon entering the `RefreshConfigurationPending` state, the background
configurator thread in `compute_ctl` wakes up, pulls a new config from
the control plane, and reconfigures PG (with `pg_ctl reload`) according
to the new config.
3. The compute node may enter the new `RefreshConfigurationPending`
state from `Running` or `Failed` states. If the configurator managed to
configure the compute node successfully, it will enter the `Running`
state, otherwise, it stays in `RefreshConfigurationPending` and the
configurator thread will wait for the next notification if an incorrect
config is still suspected.
4. Added various plumbing in `compute_ctl` data structures to allow the
configurator thread to perform the config fetch.
The "incorrect config suspected" notification is delivered using a HTTP
endpoint, `/refresh_configuration`, on `compute_ctl`. This endpoint is
currently not called by anyone other than the tests. In a follow up PR I
will set up some code in the PG extension/libpagestore to call this HTTP
endpoint whenever PG suspects that it is pointing to the wrong page
servers.
## How is this tested?
Modified `test_runner/regress/test_change_pageserver.py` to add a
scenario where we use the new `/refresh_configuration` mechanism instead
of the existing `/configure` mechanism (which requires us sending a full
config to compute_ctl) to have the compute node reload and reconfigure
its pageservers.
I took one shortcut to reduce the scope of this change when it comes to
testing: the compute node uses a local config file instead of pulling a
config over the network from the HCC. This simplifies the test setup in
the following ways:
* The existing test framework is set up to use local config files for
compute nodes only, so it's convenient if I just stick with it.
* The HCC today generates a compute config with production settings
(e.g., assuming 4 CPUs, 16GB RAM, with local file caches), which is
probably not suitable in tests. We may need to add another test-only
endpoint config to the control plane to make this work.
The config-fetch part of the code is relatively straightforward (and
well-covered in both production and the KIND test) so it is probably
fine to replace it with loading from the local config file for these
integration tests.
In addition to making sure that the tests pass, I also manually
inspected the logs to make sure that the compute node is indeed
reloading the config using the new mechanism instead of going down the
old `/configure` path (it turns out the test has bugs which causes
compute `/configure` messages to be sent despite the test intending to
disable/blackhole them).
```test
2024-09-24T18:53:29.573650Z INFO http request{otel.name=/refresh_configuration http.method=POST}: serving /refresh_configuration POST request
2024-09-24T18:53:29.573689Z INFO configurator_main_loop: compute node suspects its configuration is out of date, now refreshing configuration
2024-09-24T18:53:29.573706Z INFO configurator_main_loop: reloading config.json from path: /workspaces/hadron/test_output/test_change_pageserver_using_refresh[release-pg16]/repo/endpoints/ep-1/spec.json
PG:2024-09-24 18:53:29.574 GMT [52799] LOG: received SIGHUP, reloading configuration files
PG:2024-09-24 18:53:29.575 GMT [52799] LOG: parameter "neon.extension_server_port" cannot be changed without restarting the server
PG:2024-09-24 18:53:29.575 GMT [52799] LOG: parameter "neon.pageserver_connstring" changed to "postgresql://no_user@localhost:15008"
...
```
Co-authored-by: William Huang <william.huang@databricks.com>
172 lines
6.8 KiB
Python
172 lines
6.8 KiB
Python
from __future__ import annotations
|
|
|
|
import asyncio
|
|
from typing import TYPE_CHECKING
|
|
|
|
import pytest
|
|
from fixtures.log_helper import log
|
|
from fixtures.neon_fixtures import NeonEnvBuilder
|
|
from fixtures.remote_storage import RemoteStorageKind
|
|
|
|
if TYPE_CHECKING:
|
|
from fixtures.neon_fixtures import Endpoint, NeonEnvBuilder
|
|
|
|
|
|
def reconfigure_endpoint(endpoint: Endpoint, pageserver_id: int, use_explicit_reconfigure: bool):
|
|
# It's important that we always update config.json before issuing any reconfigure requests
|
|
# to make sure that PG-initiated config refresh doesn't mess things up by reverting to the old config.
|
|
endpoint.update_pageservers_in_config(pageserver_id=pageserver_id)
|
|
|
|
# PG will eventually automatically refresh its configuration if it detects connectivity issues with pageservers.
|
|
# We also allow the test to explicitly request a reconfigure so that the test can be sure that the
|
|
# endpoint is running with the latest configuration.
|
|
#
|
|
# Note that explicit reconfiguration is not required for the system to function or for this test to pass.
|
|
# It is kept for reference as this is how this test used to work before the capability of initiating
|
|
# configuration refreshes was added to compute nodes.
|
|
if use_explicit_reconfigure:
|
|
endpoint.reconfigure(pageserver_id=pageserver_id)
|
|
|
|
|
|
@pytest.mark.parametrize("use_explicit_reconfigure_for_failover", [False, True])
|
|
def test_change_pageserver(
|
|
neon_env_builder: NeonEnvBuilder, use_explicit_reconfigure_for_failover: bool
|
|
):
|
|
"""
|
|
A relatively low level test of reconfiguring a compute's pageserver at runtime. Usually this
|
|
is all done via the storage controller, but this test will disable the storage controller's compute
|
|
notifications, and instead update endpoints directly.
|
|
"""
|
|
num_connections = 3
|
|
|
|
neon_env_builder.num_pageservers = 2
|
|
neon_env_builder.enable_pageserver_remote_storage(
|
|
remote_storage_kind=RemoteStorageKind.MOCK_S3,
|
|
)
|
|
env = neon_env_builder.init_start()
|
|
|
|
env.create_branch("test_change_pageserver")
|
|
endpoint = env.endpoints.create_start("test_change_pageserver")
|
|
|
|
# Put this tenant into a dual-attached state
|
|
assert env.get_tenant_pageserver(env.initial_tenant) == env.pageservers[0]
|
|
alt_pageserver_id = env.pageservers[1].id
|
|
env.pageservers[1].tenant_attach(env.initial_tenant)
|
|
|
|
pg_conns = [endpoint.connect() for i in range(num_connections)]
|
|
curs = [pg_conn.cursor() for pg_conn in pg_conns]
|
|
|
|
def execute(statement: str):
|
|
for cur in curs:
|
|
cur.execute(statement)
|
|
|
|
def fetchone():
|
|
results = [cur.fetchone() for cur in curs]
|
|
assert all(result == results[0] for result in results)
|
|
return results[0]
|
|
|
|
# Create table, and insert some rows. Make it big enough that it doesn't fit in
|
|
# shared_buffers, otherwise the SELECT after restart will just return answer
|
|
# from shared_buffers without hitting the page server, which defeats the point
|
|
# of this test.
|
|
curs[0].execute("CREATE TABLE foo (t text)")
|
|
curs[0].execute(
|
|
"""
|
|
INSERT INTO foo
|
|
SELECT 'long string to consume some space' || g
|
|
FROM generate_series(1, 100000) g
|
|
"""
|
|
)
|
|
|
|
# Verify that the table is larger than shared_buffers
|
|
curs[0].execute(
|
|
"""
|
|
select setting::int * pg_size_bytes(unit) as shared_buffers, pg_relation_size('foo') as tbl_size
|
|
from pg_settings where name = 'shared_buffers'
|
|
"""
|
|
)
|
|
row = curs[0].fetchone()
|
|
assert row is not None
|
|
log.info(f"shared_buffers is {row[0]}, table size {row[1]}")
|
|
assert int(row[0]) < int(row[1])
|
|
|
|
execute("SELECT count(*) FROM foo")
|
|
assert fetchone() == (100000,)
|
|
|
|
# Reconfigure the endpoint to use the alt pageserver. We issue an explicit reconfigure request here
|
|
# regardless of test mode as this is testing the externally driven reconfiguration scenario, not the
|
|
# compute-initiated reconfiguration scenario upon detecting failures.
|
|
reconfigure_endpoint(endpoint, pageserver_id=alt_pageserver_id, use_explicit_reconfigure=True)
|
|
|
|
# Verify that the neon.pageserver_connstring GUC is set to the correct thing
|
|
execute("SELECT setting FROM pg_settings WHERE name='neon.pageserver_connstring'")
|
|
connstring = fetchone()
|
|
assert connstring is not None
|
|
expected_connstring = f"postgresql://no_user:@localhost:{env.pageservers[1].service_port.pg}"
|
|
assert expected_connstring == expected_connstring
|
|
|
|
env.pageservers[
|
|
0
|
|
].stop() # Stop the old pageserver just to make sure we're reading from the new one
|
|
env.storage_controller.node_configure(env.pageservers[0].id, {"availability": "Offline"})
|
|
|
|
execute("SELECT count(*) FROM foo")
|
|
assert fetchone() == (100000,)
|
|
|
|
# Try failing back, and this time we will stop the current pageserver before reconfiguring
|
|
# the endpoint. Whereas the previous reconfiguration was like a healthy migration, this
|
|
# is more like what happens in an unexpected pageserver failure.
|
|
#
|
|
# Since we're dual-attached, need to tip-off storage controller to treat the one we're
|
|
# about to start as the attached pageserver
|
|
env.pageservers[0].start()
|
|
env.pageservers[1].stop()
|
|
env.storage_controller.node_configure(env.pageservers[1].id, {"availability": "Offline"})
|
|
env.storage_controller.reconcile_until_idle()
|
|
|
|
reconfigure_endpoint(
|
|
endpoint,
|
|
pageserver_id=env.pageservers[0].id,
|
|
use_explicit_reconfigure=use_explicit_reconfigure_for_failover,
|
|
)
|
|
|
|
endpoint.reconfigure(pageserver_id=env.pageservers[0].id)
|
|
|
|
execute("SELECT count(*) FROM foo")
|
|
assert fetchone() == (100000,)
|
|
|
|
env.pageservers[0].stop()
|
|
env.pageservers[1].start()
|
|
env.storage_controller.node_configure(env.pageservers[0].id, {"availability": "Offline"})
|
|
env.storage_controller.reconcile_until_idle()
|
|
|
|
# Test a (former) bug where a child process spins without updating its connection string
|
|
# by executing a query separately. This query will hang until we issue the reconfigure.
|
|
async def reconfigure_async():
|
|
await asyncio.sleep(
|
|
1
|
|
) # Sleep for 1 second just to make sure we actually started our count(*) query
|
|
reconfigure_endpoint(
|
|
endpoint,
|
|
pageserver_id=env.pageservers[1].id,
|
|
use_explicit_reconfigure=use_explicit_reconfigure_for_failover,
|
|
)
|
|
|
|
def execute_count():
|
|
execute("SELECT count(*) FROM FOO")
|
|
|
|
async def execute_and_reconfigure():
|
|
task_exec = asyncio.to_thread(execute_count)
|
|
task_reconfig = asyncio.create_task(reconfigure_async())
|
|
await asyncio.gather(
|
|
task_exec,
|
|
task_reconfig,
|
|
)
|
|
|
|
asyncio.run(execute_and_reconfigure())
|
|
assert fetchone() == (100000,)
|
|
|
|
# One final check that nothing hangs
|
|
execute("SELECT count(*) FROM foo")
|
|
assert fetchone() == (100000,)
|