rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-17 10:22:56 +00:00

Author	SHA1	Message	Date
Konstantin Knizhnik	4dbb469539	Remove AuxFileStore from test_pgstat.py	2024-12-04 10:19:30 +02:00
Konstantin Knizhnik	3114a9f992	Use new version of test_wal_recevier test	2024-12-04 09:51:48 +02:00
Konstantin Knizhnik	668845fcb7	Yet another attempt to make test_wal_receiver test pass	2024-12-04 09:50:09 +02:00
Konstantin Knizhnik	0f63037345	Increase timeout to 10 seconds	2024-12-04 09:50:09 +02:00
Konstantin Knizhnik	dd826c4c81	Increase wait LSN timeout to 2 seconds	2024-12-04 09:50:09 +02:00
Konstantin Knizhnik	8782018438	Add SIZE_OF_XLOG_RECORD_DATA_HEADER_SHORT	2024-12-04 09:50:09 +02:00
Konstantin Knizhnik	f4196983d2	Undo changes test_import.py	2024-12-04 09:50:09 +02:00
Konstantin Knizhnik	293f22056f	Undo debug trace	2024-12-04 09:50:09 +02:00
Konstantin Knizhnik	cfcb197d85	Add calculate_walrecord_end_lsn function	2024-12-04 09:50:08 +02:00
Konstantin Knizhnik	fa3dc91fb5	Fix is_aux_file_key	2024-12-04 09:50:07 +02:00
Konstantin Knizhnik	0841fb9b7b	Do not import AUX files	2024-12-04 09:48:55 +02:00
Konstantin Knizhnik	b5802abab9	Disable importing AX files	2024-12-04 09:44:49 +02:00
Konstantin Knizhnik	5d12e5a72d	Update pageserver/src/basebackup.rs Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2024-12-04 09:44:49 +02:00
Konstantin Knizhnik	3179b9bed2	Add test that pgstat information is dropped in case of abnormal termination	2024-12-04 09:44:49 +02:00
Konstantin Knizhnik	e48a8415ff	Add comments	2024-12-04 09:44:49 +02:00
Konstantin Knizhnik	fc041a213b	Update test_runner/regress/test_pgstat.py Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2024-12-04 09:44:48 +02:00
Konstantin Knizhnik	3890f5846c	Update pageserver/src/basebackup.rs Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2024-12-04 09:44:48 +02:00
Konstantin Knizhnik	585672f5fc	Update test_runner/regress/test_pgstat.py Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2024-12-04 09:44:48 +02:00
Konstantin Knizhnik	92ae912ebe	Handle lack of checkpoint in basebackup	2024-12-04 09:44:48 +02:00
Konstantin Knizhnik	953ad21104	Make clippy happy	2024-12-04 09:44:27 +02:00
Konstantin Knizhnik	0fcf20075a	Drop pg_stat data in case of abnormal Postgres termination	2024-12-04 09:43:57 +02:00
Konstantin Knizhnik	c4affc7859	Add neon_pgstat_file_size_limit GUC to limit size of AUX file for pg_stat	2024-12-04 09:43:55 +02:00
Konstantin Knizhnik	368fc2c6c3	Do not import AUX files while initializing fresh timeline from pgdatadir	2024-12-04 09:40:47 +02:00
Konstantin Knizhnik	dfed7029e1	Make clippy happy	2024-12-04 09:40:47 +02:00
Konstantin Knizhnik	6c0e44be4c	Fix bug in AUX v1 delta optimization	2024-12-04 09:40:47 +02:00
Konstantin Knizhnik	dc780c9e1e	Fix bug in including AUX files in basebacklup	2024-12-04 09:40:47 +02:00
Konstantin Knizhnik	45dcc66bf4	Fix python fiormating	2024-12-04 09:40:46 +02:00
Konstantin Knizhnik	d91c5c3523	Add AUX_DIR_PG_STAT	2024-12-04 09:40:46 +02:00
Konstantin Knizhnik	b79e2062bd	Add test for restoring pgstat file	2024-12-04 09:40:46 +02:00
Konstantin Knizhnik	ffc24d6d8d	Persist pg_stat informartion in PS	2024-12-04 09:40:44 +02:00
Christian Schwarz	8d93d02c2f	page_service: enable batching in Rust & Python Tests + Python benchmarks (#9993 ) This is the first step towards batching rollout. Refs - rollout plan: https://github.com/neondatabase/cloud/issues/20620 - task https://github.com/neondatabase/neon/issues/9377 - uber-epic: https://github.com/neondatabase/neon/issues/9376	2024-12-04 00:07:49 +00:00
Alexander Bayandin	023821a80c	test_page_service_batching: fix non-numeric metrics (#9998 ) ## Problem ``` 2024-12-03T15:42:46.5978335Z + poetry run python /__w/neon/neon/scripts/ingest_perf_test_result.py --ingest /__w/neon/neon/test_runner/perf-report-local 2024-12-03T15:42:49.5325077Z Traceback (most recent call last): 2024-12-03T15:42:49.5325603Z File "/__w/neon/neon/scripts/ingest_perf_test_result.py", line 165, in <module> 2024-12-03T15:42:49.5326029Z main() 2024-12-03T15:42:49.5326316Z File "/__w/neon/neon/scripts/ingest_perf_test_result.py", line 155, in main 2024-12-03T15:42:49.5326739Z ingested = ingest_perf_test_result(cur, item, recorded_at_timestamp) 2024-12-03T15:42:49.5327488Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-12-03T15:42:49.5327914Z File "/__w/neon/neon/scripts/ingest_perf_test_result.py", line 99, in ingest_perf_test_result 2024-12-03T15:42:49.5328321Z psycopg2.extras.execute_values( 2024-12-03T15:42:49.5328940Z File "/github/home/.cache/pypoetry/virtualenvs/non-package-mode-_pxWMzVK-py3.11/lib/python3.11/site-packages/psycopg2/extras.py", line 1299, in execute_values 2024-12-03T15:42:49.5335618Z cur.execute(b''.join(parts)) 2024-12-03T15:42:49.5335967Z psycopg2.errors.InvalidTextRepresentation: invalid input syntax for type numeric: "concurrent-futures" 2024-12-03T15:42:49.5336287Z LINE 57: 'concurrent-futures', 2024-12-03T15:42:49.5336462Z ^ ``` ## Summary of changes - `test_page_service_batching`: save non-numeric params as `labels` - Add a runtime check that `metric_value` is NUMERIC	2024-12-03 22:46:18 +00:00
Christian Schwarz	944c1adc4c	tests & benchmarks: unify the way we customize the default tenant config (#9992 ) Before this PR, some override callbacks used `.default()`, others used `.setdefault()`. As of this PR, all callbacks use `.setdefault()` which I think is least prone to failure. Aligning on a single way will set the right example for future tests that need such customization. The `test_pageserver_getpage_throttle.py` technically is a change in behavior: before, it replaced the `tenant_config` field, now it just configures the throttle. This is what I believe is intended anyway.	2024-12-03 22:07:03 +00:00
Arpad Müller	ca85f364ba	Support tenant manifests in the scrubber (#9942 ) Support tenant manifests in the storage scrubber: * list the manifests, order them by generation * delete all manifests except for the two most recent generations * for the latest manifest: try parsing it. I've tested this patch by running the against a staging bucket and it successfully deleted stuff (and avoided deleting the latest two generations). In follow-up work, we might want to also check some invariants of the manifest, as mentioned in #8088. Part of #9386 Part of #8088 --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-12-03 20:39:10 +00:00
Conrad Ludgate	9ef0662a42	chore(proxy): enforce single host+port (#9995 ) proxy doesn't ever provide multiple hosts/ports, so this code adds a lot of complexity of error handling for no good reason. (stacked on #9990)	2024-12-03 20:00:14 +00:00
Alexey Immoreev	3baef0bca3	Improvement: add console redirect timeout warning (#9985 ) ## Problem There is no information on session being cancelled in 2 minutes at the moment ## Summary of changes The timeout being logged for the user	2024-12-03 18:59:44 +00:00
Erik Grinaker	f312c6571f	pageserver: respond to multiple shutdown signals (#9982 ) ## Problem The Pageserver signal handler would only respond to a single signal and initiate shutdown. Subsequent signals were ignored. This meant that a `SIGQUIT` sent after a `SIGTERM` had no effect (e.g. in the case of a slow or stalled shutdown). The `test_runner` uses this to force shutdown if graceful shutdown is slow. Touches #9740. ## Summary of changes Keep responding to signals after the initial shutdown signal has been received. Arguably, the `test_runner` should also use `SIGKILL` rather than `SIGQUIT` in this case, but it seems reasonable to respond to `SIGQUIT` regardless.	2024-12-03 18:47:17 +00:00
Conrad Ludgate	27a42d0f96	chore(proxy): remove postgres config parser and md5 support (#9990 ) Keeping the `mock` postgres cplane adaptor using "stock" tokio-postgres allows us to remove a lot of dead weight from our actual postgres connection logic.	2024-12-03 18:39:23 +00:00
John Spray	b04ab468ee	pageserver: more detailed logs when calling re-attach (#9996 ) ## Problem We saw a peculiar case where a pageserver apparently got a 0-tenant response to `/re-attach` but we couldn't see the request landing on a storage controller. It was hard to confirm retrospectively that the pageserver was configured properly at the moment it sent the request. ## Summary of changes - Log the URL to which we are sending the request - Log the NodeId and metadata that we sent	2024-12-03 18:36:37 +00:00
John Spray	dcb629532b	pageserver: only store SLRUs & aux files on shard zero (#9786 ) ## Problem Since https://github.com/neondatabase/neon/pull/9423 the non-zero shards no longer need SLRU content in order to do GC. This data is now redundant on shards >0. One release cycle after merging that PR, we may merge this one, which also stops writing those pages to shards > 0, reaping the efficiency benefit. Closes: https://github.com/neondatabase/neon/issues/7512 Closes: https://github.com/neondatabase/neon/issues/9641 ## Summary of changes - Avoid storing SLRUs on non-zero shards - Bonus: avoid storing aux files on non-zero shards	2024-12-03 17:22:49 +00:00
John Spray	71d004289c	storcon: in shard splits, inherit parent's AZ (#9946 ) ## Problem Sharded tenants should be run in a single AZ for best performance, so that computes have AZ-local latency to all the shards. Part of https://github.com/neondatabase/neon/issues/8264 ## Summary of changes - When we split a tenant, instead of updating each shard's preferred AZ to wherever it is scheduled, propagate the preferred AZ from the parent. - Drop the check in `test_shard_preferred_azs` that asserts shards end up in their preferred AZ: this will not be true again until the optimize_attachment logic is updated to make this so. The existing check wasn't testing anything about scheduling, it was just asserting that we set preferred AZ in a way that matches the way things happen to be scheduled at time of split.	2024-12-03 16:55:00 +00:00
Christian Schwarz	4d422b937c	pageserver: only throttle pagestream requests & bring back throttling deduction for smgr latency metrics (#9962 ) ## Problem In the batching PR - https://github.com/neondatabase/neon/pull/9870 I stopped deducting the time-spent-in-throttle fro latency metrics, i.e., - smgr latency metrics (`SmgrOpTimer`) - basebackup latency (+scan latency, which I think is part of basebackup). The reason for stopping the deduction was that with the introduction of batching, the trick with tracking time-spent-in-throttle inside RequestContext and swap-replacing it from the `impl Drop for SmgrOpTimer` no longer worked with >1 requests in a batch. However, deducting time-spent-in-throttle is desirable because our internal latency SLO definition does not account for throttling. ## Summary of changes - Redefine throttling to be a page_service pagestream request throttle instead of a throttle for repository `Key` reads through `Timeline::get` / `Timeline::get_vectored`. - This means reads done by `basebackup` are no longer subject to any throttle. - The throttle applies after batching, before handling of the request. - Drive-by fix: make throttle sensitive to cancellation. - Rename metric label `kind` from `timeline_get` to `pagestream` to reflect the new scope of throttling. To avoid config format breakage, we leave the config field named `timeline_get_throttle` and ignore the `task_kinds` field. This will be cleaned up in a future PR. ## Trade-Offs Ideally, we would apply the throttle before reading a request off the connection, so that we queue the minimal amount of work inside the process. However, that's not possible because we need to do shard routing. The redefinition of the throttle to limit pagestream request rate instead of repository `Key` rate comes with several downsides: - We're no longer able to use the throttle mechanism for other other tasks, e.g. image layer creation. However, in practice, we never used that capability anyways. - We no longer throttle basebackup.	2024-12-03 15:25:58 +00:00
Erik Grinaker	bbe4dfa991	test_runner: use immediate shutdown in `test_sharded_ingest` (#9984 ) ## Problem `test_sharded_ingest` ingests a lot of data, which can cause shutdown to be slow e.g. due to local "S3 uploads" or compactions. This can cause test flakes during teardown. Resolves #9740. ## Summary of changes Perform an immediate shutdown of the cluster.	2024-12-03 14:33:31 +00:00
Erik Grinaker	dcb24ce170	safekeeper,pageserver: add heap profiling (#9778 ) ## Problem We don't have good observability for memory usage. This would be useful e.g. to debug OOM incidents or optimize performance or resource usage. We would also like to use continuous profiling with e.g. [Grafana Cloud Profiles](https://grafana.com/products/cloud/profiles-for-continuous-profiling/) (see https://github.com/neondatabase/cloud/issues/14888). This PR is intended as a proof of concept, to try it out in staging and drive further discussions about profiling more broadly. Touches https://github.com/neondatabase/neon/issues/9534. Touches https://github.com/neondatabase/cloud/issues/14888. Depends on #9779. Depends on #9780. ## Summary of changes Adds a HTTP route `/profile/heap` that takes a heap profile and returns it. Query parameters: * `format`: output format (`jemalloc` or `pprof`; default `pprof`). Unlike CPU profiles (see #9764), heap profiles are not symbolized and require the original binary to translate addresses to function names. To make this work with Grafana, we'll probably have to symbolize the process server-side -- this is left as future work, as is other output formats like SVG. Heap profiles don't work on macOS due to limitations in jemalloc.	2024-12-03 11:35:59 +00:00
a-masterov	a2a942f93c	Add support for the extensions test for Postgres v17 (#9748 ) ## Problem The extensions for Postgres v17 are ready but we do not test the extensions shipped with v17 ## Summary of changes Build the test image based on Postgres v17. Run the tests for v17. --------- Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>	2024-12-03 11:25:29 +00:00
Christian Schwarz	cb10be710d	page_service: batching observability & include throttled time in smgr metrics (#9870 ) This PR - fixes smgr metrics https://github.com/neondatabase/neon/issues/9925 - adds an additional startup log line logging the current batching config - adds a histogram of batch sizes global and per-tenant - adds a metric exposing the current batching config The issue described #9925 is that before this PR, request latency was only observed after batching. This means that smgr latency metrics (most importantly getpage latency) don't account for - `wait_lsn` time - time spent waiting for batch to fill up / the executor stage to pick up the batch. The fix is to use a per-request batching timer, like we did before the initial batching PR. We funnel those timers through the entire request lifecycle. I noticed that even before the initial batching changes, we weren't accounting for the time spent writing & flushing the response to the wire. This PR drive-by fixes that deficiency by dropping the timers at the very end of processing the batch, i.e., after the `pgb.flush()` call. I was *unable to maintain the behavior that we deduct time-spent-in-throttle from various latency metrics. The reason is that we're using a single* counter in `RequestContext` to track micros spent in throttle. But there are N metrics timers in the batch, one per request. As a consequence, the practice of consuming the counter in the drop handler of each timer no longer works because all but the first timer will encounter error `close() called on closed state`. A failed attempt to maintain the current behavior can be found in https://github.com/neondatabase/neon/pull/9951. So, this PR remvoes the deduction behavior from all metrics. I started a discussion on Slack about it the implications this has for our internal SLO calculation: https://neondb.slack.com/archives/C033RQ5SPDH/p1732910861704029 # Refs - fixes https://github.com/neondatabase/neon/issues/9925 - sub-issue https://github.com/neondatabase/neon/issues/9377 - epic: https://github.com/neondatabase/neon/issues/9376	2024-12-03 11:03:23 +00:00
Christian Schwarz	15d01b257a	storcon_cli tenant-describe: include tenant-wide information in output (#9899 ) Before this PR, the storcon_cli didn't have a way to show the tenant-wide information of the TenantDescribeResponse. Sadly, the `Serialize` impl for the tenant config doesn't skip on `None`, so, the output becomes a bit bloated. Maybe we can use `skip_serializing_if(Option::is_none)` in the future. => https://github.com/neondatabase/neon/issues/9983	2024-12-03 10:55:13 +00:00
John Spray	aaee713e53	storcon: use proper schedule context during node delete (#9958 ) ## Problem I was touching `test_storage_controller_node_deletion` because for AZ scheduling work I was adding a change to the storage controller (kick secondaries during optimisation) that made a FIXME in this test defunct. While looking at it I also realized that we can easily fix the way node deletion currently doesn't use a proper ScheduleContext, using the iterator type recently added for that purpose. ## Summary of changes - A testing-only behavior in storage controller where if a secondary location isn't yet ready during optimisation, it will be actively polled. - Remove workaround in `test_storage_controller_node_deletion` that previously was needed because optimisation would get stuck on cold secondaries. - Update node deletion code to use a `TenantShardContextIterator` and thereby a proper ScheduleContext	2024-12-03 08:59:38 +00:00
Alexey Kondratov	2e9207fdf3	fix(testing): Use 1 MB shared_buffers even with LFC (#9969 ) ## Problem After enabling LFC in tests and lowering `shared_buffers` we started having more problems with `test_pg_regress`. ## Summary of changes Set `shared_buffers` to 1MB to both exercise getPage requests/LFC, and still have enough room for Postgres to operate. Everything smaller might be not enough for Postgres under load, and can cause errors like 'no unpinned buffers available'. See Konstantin's comment [1] as well. Fixes #9956 [1]: https://github.com/neondatabase/neon/issues/9956#issuecomment-2511608097	2024-12-02 18:46:06 +00:00
Tristan Partin	d8ebd33fe6	Stop changing the value of neon.extension_server_port at runtime (#9972 ) On reconfigure, we no longer passed a port for the extension server which caused us to not write out the neon.extension_server_port line. Thus, Postgres thought we were setting the port to the default value of 0. PGC_POSTMASTER GUCs cannot be set at runtime, which causes the following log messages: > LOG: parameter "neon.extension_server_port" cannot be changed without restarting the server > LOG: configuration file "/var/db/postgres/compute/pgdata/postgresql.conf" contains errors; unaffected changes were applied Fixes: https://github.com/neondatabase/neon/issues/9945 Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-12-02 18:06:19 +00:00

1 2 3 4 5 ...

6718 Commits