Compare commits

..

45 Commits

Author SHA1 Message Date
Ivan Efremov
0467d88f06 Merge pull request #9756 from neondatabase/rc/proxy/2024-11-14
Proxy release 2024-11-14
2024-11-14 09:46:52 +02:00
Tristan Partin
1280b708f1 Improve error handling for NeonAPI fixture
Move error handling to the common request function and add a debug log.

Signed-off-by: Tristan Partin <tristan@neon.tech>
2024-11-13 20:35:48 -06:00
John Spray
b4e00b8b22 pageserver: refuse to load tenants with suspiciously old indices in old generations (#9719)
## Problem

Historically, if a control component passed a pageserver "generation: 1"
this could be a quick way to corrupt a tenant by loading a historic
index.

Follows https://github.com/neondatabase/neon/pull/9383
Closes #6951 

## Summary of changes

- Introduce a Fatal variant to DownloadError, to enable index downloads
to signal when they have encountered a scary enough situation that we
shouldn't proceed to load the tenant.
- Handle this variant by putting the tenant into a broken state (no
matter which timeline within the tenant reported it)
- Add a test for this case

In the event that this behavior fires when we don't want it to, we have
ways to intervene:
- "Touch" an affected index to update its mtime (download+upload S3
object)
- If this behavior is triggered, it indicates we're attaching in some
old generation, so we should be able to fix that by manually bumping
generation numbers in the storage controller database (this should never
happen, but it's an option if it does)
2024-11-13 18:07:39 +00:00
Ivan Efremov
f5eec194e7 Merge pull request #9674 from neondatabase/rc/proxy/2024-11-07
Proxy release 2024-11-07
2024-11-07 12:07:12 +02:00
Conrad Ludgate
7e00be391d Merge pull request #9558 from neondatabase/rc/proxy/2024-10-29
Auth broker release 2024-10-29
2024-10-29 12:10:50 +00:00
Folke Behrens
d56599df2a Merge pull request #9499 from neondatabase/rc/proxy/2024-10-24
Proxy release 2024-10-24
2024-10-24 10:34:56 +02:00
Folke Behrens
9d9aab3680 Merge pull request #9426 from neondatabase/rc/proxy/2024-10-17
Proxy release 2024-10-17
2024-10-17 12:18:51 +02:00
Folke Behrens
a202b1b5cc Merge pull request #9341 from neondatabase/rc/proxy/2024-10-10
Proxy release 2024-10-10
2024-10-10 09:17:11 +02:00
Folke Behrens
90f731f3b1 Merge pull request #9256 from neondatabase/rc/proxy/2024-10-03
Proxy release 2024-10-03
2024-10-03 11:01:41 +02:00
Conrad Ludgate
7736b748d3 Merge pull request #9159 from neondatabase/rc/proxy/2024-09-26
Proxy release 2024-09-26
2024-09-26 09:22:33 +01:00
Conrad Ludgate
9c23333cb3 Merge pull request #9056 from neondatabase/rc/proxy/2024-09-19
Proxy release 2024-09-19
2024-09-19 10:41:17 +01:00
Conrad Ludgate
66a99009ba Merge pull request #8799 from neondatabase/rc/proxy/2024-08-22
Proxy release 2024-08-22
2024-08-22 10:04:56 +01:00
Conrad Ludgate
5d4c57491f Merge pull request #8723 from neondatabase/rc/proxy/2024-08-14
Proxy release 2024-08-14
2024-08-14 13:05:51 +01:00
Conrad Ludgate
73935ea3a2 Merge pull request #8647 from neondatabase/rc/proxy/2024-08-08
Proxy release 2024-08-08
2024-08-08 15:37:09 +01:00
Conrad Ludgate
32e595d4dd Merge branch 'release-proxy' into rc/proxy/2024-08-08 2024-08-08 13:53:33 +01:00
Conrad Ludgate
b0d69acb07 Merge pull request #8505 from neondatabase/rc/proxy/2024-07-25
Proxy release 2024-07-25
2024-07-25 11:07:19 +01:00
Conrad Ludgate
98355a419a Merge pull request #8351 from neondatabase/rc/proxy/2024-07-11
Proxy release 2024-07-11
2024-07-11 10:40:17 +01:00
Conrad Ludgate
cfb03d6cf0 Merge pull request #8178 from neondatabase/rc/proxy/2024-06-27
Proxy release 2024-06-27
2024-06-27 11:35:30 +01:00
Conrad Ludgate
d81ef3f962 Revert "proxy: update tokio-postgres to allow arbitrary config params (#8076)"
This reverts commit 78d9059fc7.
2024-06-27 09:46:58 +01:00
Conrad Ludgate
5d62c67e75 Merge pull request #8117 from neondatabase/rc/proxy/2024-06-20
Proxy release 2024-06-20
2024-06-20 11:42:35 +01:00
Anna Khanova
53d53d5b1e Merge pull request #7980 from neondatabase/rc/proxy/2024-06-06
Proxy release 2024-06-06
2024-06-06 13:14:40 +02:00
Anna Khanova
29fe6ea47a Merge pull request #7909 from neondatabase/rc/proxy/2024-05-30
Proxy release 2024-05-30
2024-05-30 14:59:41 +02:00
Alexander Bayandin
640327ccb3 Merge pull request #7880 from neondatabase/rc/proxy/2024-05-24
Proxy release 2024-05-24
2024-05-24 18:00:18 +01:00
Anna Khanova
7cf0f6b37e Merge pull request #7853 from neondatabase/rc/proxy/2024-05-23
Proxy release 2024-05-23
2024-05-23 12:09:13 +02:00
Anna Khanova
03c2c569be [proxy] Do not fail after parquet upload error (#7858)
## Problem

If the parquet upload was unsuccessful, it will panic.

## Summary of changes

Write error in logs instead.
2024-05-23 11:44:47 +02:00
Conrad Ludgate
eff6d4538a Merge pull request #7654 from neondatabase/rc/proxy/2024-05-08
Proxy release 2024-05-08
2024-05-08 11:56:20 +01:00
Conrad Ludgate
5ef7782e9c Merge pull request #7649 from neondatabase/rc/proxy/2024-05-08
Proxy release 2024-05-08
2024-05-08 06:54:03 +01:00
Conrad Ludgate
73101db8c4 Merge branch 'release-proxy' into rc/proxy/2024-05-08 2024-05-08 06:43:57 +01:00
Anna Khanova
bccdfc6d39 Merge pull request #7580 from neondatabase/rc/proxy/2024-05-02
Proxy release 2024-05-02
2024-05-02 12:00:01 +02:00
Anna Khanova
99595813bb proxy: keep track on the number of events from redis by type. (#7582)
## Problem

It's unclear what is the distribution of messages, proxy is consuming
from redis.

## Summary of changes

Add counter.
2024-05-02 11:56:19 +02:00
Anna Khanova
fe07b54758 Merge pull request #7507 from neondatabase/rc/proxy/2024-04-25
Proxy release 2024-04-25
2024-04-25 13:50:05 +02:00
Anna Khanova
a42d173e7b proxy: Fix cancellations (#7510)
## Problem

Cancellations were published to the channel, that was never read.

## Summary of changes

Fallback to global redis publishing.
2024-04-25 13:42:25 +02:00
Anna Khanova
e07f689238 Update connect to compute and wake compute retry configs (#7509)
## Problem

## Summary of changes

Decrease waiting time
2024-04-25 13:20:21 +02:00
Conrad Ludgate
7831eddc88 Merge pull request #7417 from neondatabase/rc/proxy/2024-04-18
Proxy release 2024-04-18
2024-04-18 12:03:07 +01:00
Conrad Ludgate
943b1bc80c Merge pull request #7366 from neondatabase/proxy-hotfix
Release proxy (2024-04-11 hotfix)
2024-04-12 10:15:14 +01:00
Conrad Ludgate
95a184e9b7 proxy: fix overloaded db connection closure (#7364)
## Problem

possible for the database connections to not close in time.

## Summary of changes

force the closing of connections if the client has hung up
2024-04-11 23:38:47 +01:00
Conrad Ludgate
3fa17e9d17 Merge pull request #7357 from neondatabase/rc/proxy/2024-04-11
Proxy release 2024-04-11
2024-04-11 11:49:45 +01:00
Anna Khanova
55e0fd9789 Merge pull request #7304 from neondatabase/rc/proxy/2024-04-04
Proxy release 2024-04-04
2024-04-04 12:40:11 +02:00
Anna Khanova
2a88889f44 Merge pull request #7254 from neondatabase/rc/proxy/2024-03-27
Proxy release 2024-03-27
2024-03-27 11:44:09 +01:00
Conrad Ludgate
5bad8126dc Merge pull request #7173 from neondatabase/rc/proxy/2024-03-19
Proxy release 2024-03-19
2024-03-19 12:11:42 +00:00
Anna Khanova
27bc242085 Merge pull request #7119 from neondatabase/rc/proxy/2024-03-14
Proxy release 2024-03-14
2024-03-14 14:57:05 +05:00
Anna Khanova
192b49cc6d Merge branch 'release-proxy' into rc/proxy/2024-03-14 2024-03-14 14:16:36 +05:00
Conrad Ludgate
e1b60f3693 Merge pull request #7041 from neondatabase/rc/proxy/2024-03-07
Proxy release 2024-03-07
2024-03-08 08:19:16 +00:00
Anna Khanova
2804f5323b Merge pull request #6997 from neondatabase/rc/proxy/2024-03-04
Proxy release 2024-03-04
2024-03-04 17:36:11 +04:00
Anna Khanova
676adc6b32 Merge branch 'release-proxy' into rc/proxy/2024-03-04 2024-03-04 16:41:46 +04:00
5 changed files with 93 additions and 25 deletions

View File

@@ -15,6 +15,9 @@ pub enum DownloadError {
///
/// Concurrency control is not timed within timeout.
Timeout,
/// Some integrity/consistency check failed during download. This is used during
/// timeline loads to cancel the load of a tenant if some timeline detects fatal corruption.
Fatal(String),
/// The file was found in the remote storage, but the download failed.
Other(anyhow::Error),
}
@@ -29,6 +32,7 @@ impl std::fmt::Display for DownloadError {
DownloadError::Unmodified => write!(f, "File was not modified"),
DownloadError::Cancelled => write!(f, "Cancelled, shutting down"),
DownloadError::Timeout => write!(f, "timeout"),
DownloadError::Fatal(why) => write!(f, "Fatal read error: {why}"),
DownloadError::Other(e) => write!(f, "Failed to download a remote file: {e:?}"),
}
}
@@ -41,7 +45,7 @@ impl DownloadError {
pub fn is_permanent(&self) -> bool {
use DownloadError::*;
match self {
BadInput(_) | NotFound | Unmodified | Cancelled => true,
BadInput(_) | NotFound | Unmodified | Fatal(_) | Cancelled => true,
Timeout | Other(_) => false,
}
}

View File

@@ -1433,6 +1433,12 @@ impl Tenant {
info!(%timeline_id, "index_part not found on remote");
continue;
}
Err(DownloadError::Fatal(why)) => {
// If, while loading one remote timeline, we saw an indication that our generation
// number is likely invalid, then we should not load the whole tenant.
error!(%timeline_id, "Fatal error loading timeline: {why}");
anyhow::bail!(why.to_string());
}
Err(e) => {
// Some (possibly ephemeral) error happened during index_part download.
// Pretend the timeline exists to not delete the timeline directory,

View File

@@ -574,12 +574,18 @@ impl RemoteTimelineClient {
if latest_index_generation > index_generation {
// Unexpected! Why are we loading such an old index if a more recent one exists?
tracing::warn!(
// We will refuse to proceed, as there is no reasonable scenario where this should happen, but
// there _is_ a clear bug/corruption scenario where it would happen (controller sets the generation
// backwards).
tracing::error!(
?index_generation,
?latest_index_generation,
?latest_index_mtime,
"Found a newer index while loading an old one"
);
return Err(DownloadError::Fatal(
"Index age exceeds threshold and a newer index exists".into(),
));
}
}

View File

@@ -5,6 +5,8 @@ from typing import TYPE_CHECKING, cast, final
import requests
from fixtures.log_helper import log
if TYPE_CHECKING:
from typing import Any, Literal, Optional
@@ -30,7 +32,11 @@ class NeonAPI:
kwargs["headers"] = {}
kwargs["headers"]["Authorization"] = f"Bearer {self.__neon_api_key}"
return requests.request(method, f"{self.__neon_api_base_url}{endpoint}", **kwargs)
resp = requests.request(method, f"{self.__neon_api_base_url}{endpoint}", **kwargs)
log.debug("%s %s returned a %d: %s", method, endpoint, resp.status_code, resp.text)
resp.raise_for_status()
return resp
def create_project(
self,
@@ -66,8 +72,6 @@ class NeonAPI:
json=data,
)
assert resp.status_code == 201
return cast("dict[str, Any]", resp.json())
def get_project_details(self, project_id: str) -> dict[str, Any]:
@@ -79,7 +83,7 @@ class NeonAPI:
"Content-Type": "application/json",
},
)
assert resp.status_code == 200
return cast("dict[str, Any]", resp.json())
def delete_project(
@@ -95,8 +99,6 @@ class NeonAPI:
},
)
assert resp.status_code == 200
return cast("dict[str, Any]", resp.json())
def start_endpoint(
@@ -112,8 +114,6 @@ class NeonAPI:
},
)
assert resp.status_code == 200
return cast("dict[str, Any]", resp.json())
def suspend_endpoint(
@@ -129,8 +129,6 @@ class NeonAPI:
},
)
assert resp.status_code == 200
return cast("dict[str, Any]", resp.json())
def restart_endpoint(
@@ -146,8 +144,6 @@ class NeonAPI:
},
)
assert resp.status_code == 200
return cast("dict[str, Any]", resp.json())
def create_endpoint(
@@ -178,8 +174,6 @@ class NeonAPI:
json=data,
)
assert resp.status_code == 201
return cast("dict[str, Any]", resp.json())
def get_connection_uri(
@@ -206,8 +200,6 @@ class NeonAPI:
},
)
assert resp.status_code == 200
return cast("dict[str, Any]", resp.json())
def get_branches(self, project_id: str) -> dict[str, Any]:
@@ -219,8 +211,6 @@ class NeonAPI:
},
)
assert resp.status_code == 200
return cast("dict[str, Any]", resp.json())
def get_endpoints(self, project_id: str) -> dict[str, Any]:
@@ -232,8 +222,6 @@ class NeonAPI:
},
)
assert resp.status_code == 200
return cast("dict[str, Any]", resp.json())
def get_operations(self, project_id: str) -> dict[str, Any]:
@@ -246,8 +234,6 @@ class NeonAPI:
},
)
assert resp.status_code == 200
return cast("dict[str, Any]", resp.json())
def wait_for_operation_to_finish(self, project_id: str):

View File

@@ -35,9 +35,10 @@ from fixtures.pageserver.utils import (
wait_for_upload,
)
from fixtures.remote_storage import (
LocalFsStorage,
RemoteStorageKind,
)
from fixtures.utils import wait_until
from fixtures.utils import run_only_on_default_postgres, wait_until
from fixtures.workload import Workload
if TYPE_CHECKING:
@@ -728,3 +729,68 @@ def test_upgrade_generationless_local_file_paths(
)
# We should download into the same local path we started with
assert os.path.exists(victim_path)
@run_only_on_default_postgres("Only tests index logic")
def test_old_index_time_threshold(
neon_env_builder: NeonEnvBuilder,
):
"""
Exercise pageserver's detection of trying to load an ancient non-latest index.
(see https://github.com/neondatabase/neon/issues/6951)
"""
# Run with local_fs because we will interfere with mtimes by local filesystem access
neon_env_builder.enable_pageserver_remote_storage(RemoteStorageKind.LOCAL_FS)
env = neon_env_builder.init_start()
tenant_id = env.initial_tenant
timeline_id = env.initial_timeline
workload = Workload(env, tenant_id, timeline_id)
workload.init()
workload.write_rows(32)
# Remember generation 1's index path
assert isinstance(env.pageserver_remote_storage, LocalFsStorage)
index_path = env.pageserver_remote_storage.index_path(tenant_id, timeline_id)
# Increment generation by detaching+attaching, and write+flush some data to get a new remote index
env.storage_controller.tenant_policy_update(tenant_id, {"placement": "Detached"})
env.storage_controller.tenant_policy_update(tenant_id, {"placement": {"Attached": 0}})
env.storage_controller.reconcile_until_idle()
workload.churn_rows(32)
# A new index should have been written
assert env.pageserver_remote_storage.index_path(tenant_id, timeline_id) != index_path
# Hack the mtime on the generation 1 index
log.info(f"Setting old mtime on {index_path}")
os.utime(index_path, times=(time.time(), time.time() - 30 * 24 * 3600))
env.pageserver.allowed_errors.extend(
[
".*Found a newer index while loading an old one.*",
".*Index age exceeds threshold and a newer index exists.*",
]
)
# Detach from storage controller + attach in an old generation directly on the pageserver.
workload.stop()
env.storage_controller.tenant_policy_update(tenant_id, {"placement": "Detached"})
env.storage_controller.reconcile_until_idle()
env.storage_controller.tenant_policy_update(tenant_id, {"scheduling": "Stop"})
env.storage_controller.allowed_errors.append(".*Scheduling is disabled by policy")
# The controller would not do this (attach in an old generation): we are doing it to simulate
# a hypothetical profound bug in the controller.
env.pageserver.http_client().tenant_location_conf(
tenant_id, {"generation": 1, "mode": "AttachedSingle", "tenant_conf": {}}
)
# The pageserver should react to this situation by refusing to attach the tenant and putting
# it into Broken state
env.pageserver.allowed_errors.append(".*tenant is broken.*")
with pytest.raises(
PageserverApiException,
match="tenant is broken: Index age exceeds threshold and a newer index exists",
):
env.pageserver.http_client().timeline_detail(tenant_id, timeline_id)