mirror of
https://github.com/neondatabase/neon.git
synced 2026-01-08 22:12:56 +00:00
# TLDR All changes are no-op except 1. publishing additional metrics. 2. problem VI ## Problem I It has come to my attention that the Neon Storage Controller doesn't correctly update its "observed" state of tenants previously associated with PSs that has come back up after a local data loss. It would still think that the old tenants are still attached to page servers and won't ask more questions. The pageserver has enough information from the reattach request/response to tell that something is wrong, but it doesn't do anything about it either. We need to detect this situation in production while I work on a fix. (I think there is just some misunderstanding about how Neon manages their pageserver deployments which got me confused about all the invariants.) ## Summary of changes I Added a `pageserver_local_data_loss_suspected` gauge metric that will be set to 1 if we detect a problematic situation from the reattch response. The problematic situation is when the PS doesn't have any local tenants but received a reattach response containing tenants. We can set up an alert using this metric. The alert should be raised whenever this metric reports non-zero number. Also added a HTTP PUT `http://pageserver/hadron-internal/reset_alert_gauges` API on the pageserver that can be used to reset the gauge and the alert once we manually rectify the situation (by restarting the HCC). ## Problem II Azure upload is 3x slower than AWS. -> 3x slower ingestion. The reason for the slower upload is that Azure upload in page server is much slower => higher flush latency => higher disk consistent LSN => higher back pressure. ## Summary of changes II Use Azure put_block API to uploads a 1 GB layer file in 8 blocks in parallel. I set the put_block block size to be 128 MB by default in azure config. To minimize neon changes, upload function passes the layer file path to the azure upload code through the storage metadata. This allows the azure put block to use FileChunkStreamRead to stream read from one partition in the file instead of loading all file data in memory and split it into 8 128 MB chunks. ## How is this tested? II 1. rust test_real_azure tests the put_block change. 3. I deployed the change in azure dev and saw flush latency reduces from ~30 seconds to 10 seconds. 4. I also did a bunch of stress test using sqlsmith and 100 GB TPCDS runs. ## Problem III Currently Neon limits the compaction tasks as 3/4 * CPU cores. This limits the overall compaction throughput and it can easily cause head-of-the-line blocking problems when a few large tenants are compacting. ## Summary of changes III This PR increases the limit of compaction tasks as `BG_TASKS_PER_THREAD` (default 4) * CPU cores. Note that `CONCURRENT_BACKGROUND_TASKS` also limits some other tasks `logical_size_calculation` and `layer eviction` . But compaction should be the most frequent and time-consuming task. ## Summary of changes IV This PR adds the following PageServer metrics: 1. `pageserver_disk_usage_based_eviction_evicted_bytes_total`: captures the total amount of bytes evicted. It's more straightforward to see the bytes directly instead of layers. 2. `pageserver_active_storage_operations_count`: captures the active storage operation, e.g., flush, L0 compaction, image creation etc. It's useful to visualize these active operations to get a better idea of what PageServers are spending cycles on in the background. ## Summary of changes V When investigating data corruptions, it's useful to search the base image and all WAL records of a page up to an LSN, i.e., a breakdown of GetPage@LSN request. This PR implements this functionality with two tools: 1. Extended `pagectl` with a new command to search the layer files for a given key up to a given LSN from the `index_part.json` file. The output can be used to download the files from S3 and then search the file contents using the second tool. Example usage: ``` cargo run --bin pagectl index-part search --tenant-id 09b99ea3239bbb3b2d883a59f087659d --timeline-id 7bedf4a6995baff7c0421ff9aebbcdab --path ~/Downloads/corruption/index_part.json-0000000c-formatted --key 000000067F000080140000802100000D61BD --lsn 70C/BF3D61D8 ``` Example output: ``` tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F0000801400008028000002FEFF__000007089F0B5381-0000070C7679EEB9-0000000c tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000000000000000000000000000000000-000000067F0000801400008028000002F3F1__000006DD95B6F609-000006E2BA14C369-0000000c tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F000080140000802100001B0973__000006D33429F539-000006DD95B6F609-0000000c tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F00008014000080210000164D81__000006C6343B2D31-000006D33429F539-0000000b tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F0000801400008021000017687B__000006BA344FA7F1-000006C6343B2D31-0000000b tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F00008014000080210000165BAB__000006AD34613D19-000006BA344FA7F1-0000000b tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F0000801400000B180000000002-000000067F00008014000080210000137A39__0000069F34773461-000006AD34613D19-0000000b tenants/09b99ea3239bbb3b2d883a59f087659d-0304/timelines/7bedf4a6995baff7c0421ff9aebbcdab/000000067F000080140000802100000D4000-000000067F000080140000802100000F0000__0000069F34773460-0000000b ``` 2. Added a unit test to search the layer file contents. It's not implemented part of `pagectl` because it depends on some test harness code, which can only be used by unit tests. Example usage: ``` cargo test --package pageserver --lib -- tenant::debug::test_search_key --exact --nocapture -- --tenant-id 09b99ea3239bbb3b2d883a59f087659d --timeline-id 7bedf4a6995baff7c0421ff9aebbcdab --data-dir /Users/chen.luo/Downloads/corruption --key 000000067F000080140000802100000D61BD --lsn 70C/BF3D61D8 ``` Example output: ``` # omitted image for brievity delta: 69F/769D8180: will_init: false, "OgAAALGkuwXwYp12nwYAAECGAAASIqLHAAAAAH8GAAAUgAAAIYAAAL1hDQD/DLGkuwUDAAAAEAAWAA==" delta: 69F/769CB6D8: will_init: false, "PQAAALGkuwXotZx2nwYAABAJAAAFk7tpACAGAH8GAAAUgAAAIYAAAL1hDQD/CQUAEAASALExuwUBAAAAAA==" ``` ## Problem VI Currently when page service resolves shards from page numbers, it doesn't fully support the case that the shard could be split in the middle. This will lead to query failures during the tenant split for either commit or abort cases (it's mostly for abort). ## Summary of changes VI This PR adds retry logic in `Cache::get()` to deal with shard resolution errors more gracefully. Specifically, it'll clear the cache and retry, instead of failing the query immediately. It also reduces the internal timeout to make retries faster. The PR also fixes a very obvious bug in `TenantManager::resolve_attached_shard` where the code tries to cache the computed the shard number, but forgot to recompute when the shard count is different. --------- Co-authored-by: William Huang <william.huang@databricks.com> Co-authored-by: Haoyu Huang <haoyu.huang@databricks.com> Co-authored-by: Chen Luo <chen.luo@databricks.com> Co-authored-by: Vlad Lazar <vlad.lazar@databricks.com> Co-authored-by: Vlad Lazar <vlad@neon.tech>
192 lines
9.8 KiB
Python
Executable File
192 lines
9.8 KiB
Python
Executable File
#! /usr/bin/env python3
|
|
|
|
from __future__ import annotations
|
|
|
|
import argparse
|
|
import re
|
|
import sys
|
|
from typing import TYPE_CHECKING
|
|
|
|
if TYPE_CHECKING:
|
|
from collections.abc import Iterable
|
|
|
|
|
|
def scan_pageserver_log_for_errors(
|
|
input: Iterable[str], allowed_errors: list[str]
|
|
) -> list[tuple[int, str]]:
|
|
error_or_warn = re.compile(r"\s(ERROR|WARN)")
|
|
errors = []
|
|
for lineno, line in enumerate(input, start=1):
|
|
if len(line) == 0:
|
|
continue
|
|
|
|
if error_or_warn.search(line):
|
|
# Is this a torn log line? This happens when force-killing a process and restarting
|
|
# Example: "2023-10-25T09:38:31.752314Z WARN deletion executo2023-10-25T09:38:31.875947Z INFO version: git-env:0f9452f76e8ccdfc88291bccb3f53e3016f40192"
|
|
if re.match("\\d{4}-\\d{2}-\\d{2}T.+\\d{4}-\\d{2}-\\d{2}T.+INFO version.+", line):
|
|
continue
|
|
|
|
# It's an ERROR or WARN. Is it in the allow-list?
|
|
for a in allowed_errors:
|
|
try:
|
|
if re.match(a, line):
|
|
break
|
|
# We can switch `re.error` with `re.PatternError` after 3.13
|
|
# https://docs.python.org/3/library/re.html#re.PatternError
|
|
except re.error:
|
|
print(f"Invalid regex: '{a}'", file=sys.stderr)
|
|
raise
|
|
else:
|
|
errors.append((lineno, line))
|
|
return errors
|
|
|
|
|
|
DEFAULT_PAGESERVER_ALLOWED_ERRORS = (
|
|
# All tests print these, when starting up or shutting down
|
|
".*wal receiver task finished with an error: walreceiver connection handling failure.*",
|
|
".*Shutdown task error: walreceiver connection handling failure.*",
|
|
".*wal_connection_manager.*tcp connect error: Connection refused.*",
|
|
".*query handler for .* failed: Socket IO error: Connection reset by peer.*",
|
|
".*serving compute connection task.*exited with error: Postgres connection error.*",
|
|
".*serving compute connection task.*exited with error: Connection reset by peer.*",
|
|
".*serving compute connection task.*exited with error: Postgres query error.*",
|
|
".*Connection aborted: error communicating with the server: Transport endpoint is not connected.*",
|
|
# FIXME: replication patch for tokio_postgres regards any but CopyDone/CopyData message in CopyBoth stream as unexpected
|
|
".*Connection aborted: unexpected message from server*",
|
|
".*kill_and_wait_impl.*: wait successful.*",
|
|
".*query handler for 'pagestream.*failed: Broken pipe.*", # pageserver notices compute shut down
|
|
".*query handler for 'pagestream.*failed: Connection reset by peer.*", # pageserver notices compute shut down
|
|
# safekeeper connection can fail with this, in the window between timeline creation
|
|
# and streaming start
|
|
".*Failed to process query for timeline .*: state uninitialized, no data to read.*",
|
|
# Tests related to authentication and authorization print these
|
|
".*Error processing HTTP request: Forbidden",
|
|
# intentional failpoints
|
|
".*failpoint ",
|
|
# Tenant::delete_timeline() can cause any of the four following errors.
|
|
# FIXME: we shouldn't be considering it an error: https://github.com/neondatabase/neon/issues/2946
|
|
".*could not flush frozen layer.*queue is in state Stopped", # when schedule layer upload fails because queued got closed before compaction got killed
|
|
".*wait for layer upload ops to complete.*", # .*Caused by:.*wait_completion aborted because upload queue was stopped
|
|
".*gc_loop.*Gc failed, retrying in.*timeline is Stopping", # When gc checks timeline state after acquiring layer_removal_cs
|
|
".*gc_loop.*Gc failed, retrying in.*: Cannot run GC iteration on inactive tenant", # Tenant::gc precondition
|
|
".*compaction_loop.*Compaction failed.*, retrying in.*timeline or pageserver is shutting down", # When compaction checks timeline state after acquiring layer_removal_cs
|
|
".*query handler for 'pagestream.*failed: Timeline .* was not found", # postgres reconnects while timeline_delete doesn't hold the tenant's timelines.lock()
|
|
".*query handler for 'pagestream.*failed: Timeline .* is not active", # timeline delete in progress
|
|
".*task iteration took longer than the configured period.*",
|
|
# these can happen anytime we do compactions from background task and shutdown pageserver
|
|
".*could not compact.*cancelled.*",
|
|
# this is expected given our collaborative shutdown approach for the UploadQueue
|
|
".*Compaction failed.*, retrying in .*: Other\\(queue is in state Stopped.*",
|
|
".*Compaction failed.*, retrying in .*: ShuttingDown",
|
|
".*Compaction failed.*, retrying in .*: Other\\(timeline shutting down.*",
|
|
# Pageserver timeline deletion should be polled until it gets 404, so ignore it globally
|
|
".*Error processing HTTP request: NotFound: Timeline .* was not found",
|
|
".*took more than expected to complete.*",
|
|
# these can happen during shutdown, but it should not be a reason to fail a test
|
|
".*completed, took longer than expected.*",
|
|
# AWS S3 may emit 500 errors for keys in a DeleteObjects response: we retry these
|
|
# and it is not a failure of our code when it happens.
|
|
".*DeleteObjects.*We encountered an internal error. Please try again.*",
|
|
# During shutdown, DownloadError::Cancelled may be logged as an error. Cleaning this
|
|
# up is tracked in https://github.com/neondatabase/neon/issues/6096
|
|
".*Cancelled, shutting down.*",
|
|
# Open layers are only rolled at Lsn boundaries to avoid name clashses.
|
|
# Hence, we can overshoot the soft limit set by checkpoint distance.
|
|
# This is especially pronounced in tests that set small checkpoint
|
|
# distances.
|
|
".*Flushed oversized open layer with size.*",
|
|
# During teardown, we stop the storage controller before the pageservers, so pageservers
|
|
# can experience connection errors doing background deletion queue work.
|
|
".*WARN deletion backend:.* storage controller upcall failed, will retry.*error sending request.*",
|
|
# Can happen when the pageserver starts faster than the storage controller
|
|
".*WARN init_tenant_mgr:.* storage controller upcall failed, will retry.*error sending request.*",
|
|
# Can happen when the test shuts down the storage controller while it is calling the utilization API
|
|
".*WARN.*path=/v1/utilization .*request was dropped before completing",
|
|
# Can happen during shutdown
|
|
".*scheduling deletion on drop failed: queue is in state Stopped.*",
|
|
".*scheduling deletion on drop failed: queue is shutting down.*",
|
|
# L0 flush backpressure delays are expected under heavy ingest load. We want to exercise
|
|
# this backpressure in tests.
|
|
".*delaying layer flush by \\S+ for compaction backpressure.*",
|
|
".*stalling layer flushes for compaction backpressure.*",
|
|
".*layer roll waiting for flush due to compaction backpressure.*",
|
|
".*BatchSpanProcessor.*",
|
|
# Can happen in tests that purposely wipe pageserver "local disk" data.
|
|
".*Local data loss suspected.*",
|
|
# Too many frozen layers error is normal during intensive benchmarks
|
|
".*too many frozen layers.*",
|
|
# Transient errors when resolving tenant shards by page service
|
|
".*Fail to resolve tenant shard in attempt.*",
|
|
# Expected warnings when pageserver has not refreshed GC info yet
|
|
".*pitr LSN/interval not found, skipping force image creation LSN calculation.*",
|
|
".*No broker updates received for a while.*",
|
|
*(
|
|
[
|
|
r".*your platform is not a supported production platform, ignoing request for O_DIRECT; this could hide alignment bugs.*"
|
|
]
|
|
if sys.platform != "linux"
|
|
else []
|
|
),
|
|
)
|
|
|
|
|
|
DEFAULT_STORAGE_CONTROLLER_ALLOWED_ERRORS = [
|
|
# Many tests will take pageservers offline, resulting in log warnings on the controller
|
|
# failing to connect to them.
|
|
".*Call to node.*management API.*failed.*receive body.*",
|
|
".*Call to node.*management API.*failed.*ReceiveBody.*",
|
|
".*Call to node.*management API.*failed.*Timeout.*",
|
|
".*Failed to update node .+ after heartbeat round.*error sending request for url.*",
|
|
".*background_reconcile: failed to fetch top tenants:.*client error \\(Connect\\).*",
|
|
# Many tests will take safekeepers offline
|
|
".*Call to safekeeper.*management API.*failed.*receive body.*",
|
|
".*Call to safekeeper.*management API.*failed.*ReceiveBody.*",
|
|
".*Call to safekeeper.*management API.*failed.*Timeout.*",
|
|
# Many tests will start up with a node offline
|
|
".*startup_reconcile: Could not scan node.*",
|
|
# Tests run in dev mode
|
|
".*Starting in dev mode.*",
|
|
# Tests that stop endpoints & use the storage controller's neon_local notification
|
|
# mechanism might fail (neon_local's stopping and endpoint isn't atomic wrt the storage
|
|
# controller's attempts to notify the endpoint).
|
|
".*reconciler.*neon_local notification hook failed.*",
|
|
".*reconciler.*neon_local error.*",
|
|
# Tenant rate limits may fire in tests that submit lots of API requests.
|
|
".*tenant \\S+ is rate limited.*",
|
|
]
|
|
|
|
|
|
def _check_allowed_errors(input):
|
|
allowed_errors: list[str] = list(DEFAULT_PAGESERVER_ALLOWED_ERRORS)
|
|
|
|
# add any test specifics here; cli parsing is not provided for the
|
|
# difficulty of copypasting regexes as arguments without any quoting
|
|
# errors.
|
|
|
|
errors = scan_pageserver_log_for_errors(input, allowed_errors)
|
|
|
|
for lineno, error in errors:
|
|
print(f"-:{lineno}: {error.strip()}", file=sys.stderr)
|
|
|
|
print(f"\n{len(errors)} not allowed errors", file=sys.stderr)
|
|
|
|
return errors
|
|
|
|
|
|
if __name__ == "__main__":
|
|
parser = argparse.ArgumentParser(
|
|
description="check input against pageserver global allowed_errors"
|
|
)
|
|
parser.add_argument(
|
|
"-i",
|
|
"--input",
|
|
type=argparse.FileType("r"),
|
|
help="Pageserver logs file. Use '-' for stdin.",
|
|
required=True,
|
|
)
|
|
|
|
args = parser.parse_args()
|
|
errors = _check_allowed_errors(args.input)
|
|
|
|
sys.exit(len(errors) > 0)
|