rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-05 05:00:37 +00:00

Author	SHA1	Message	Date
John Spray	0baf91fcac	pageserver: include secondary tenants in disk usage eviction	2023-10-26 20:33:47 +01:00
John Spray	cd27f42839	pageserver: pass TenantManager into disk usage eviction task	2023-10-26 20:33:47 +01:00
John Spray	edf303ff44	pageserver: add Layer::for_secondary	2023-10-26 20:33:47 +01:00
John Spray	ec189338eb	tests: avoid deprecated log.warn()	2023-10-26 20:33:47 +01:00
John Spray	e22f9b9fbb	tests: more logging from attachment_service	2023-10-26 20:33:47 +01:00
John Spray	83fc486636	tests: log in UTC	2023-10-26 20:33:47 +01:00
John Spray	3afec731dc	pageserver: add a CancellationToken to Tenant	2023-10-26 20:33:47 +01:00
John Spray	559611f9a6	pageserver: add Tenant::heatmap_hook	2023-10-26 20:33:47 +01:00
John Spray	d56dad3824	tests: add test_pageserver_secondary	2023-10-26 20:33:47 +01:00
John Spray	547acb6653	pageserver: create timelines/ dir when configuring secondary location	2023-10-26 20:33:47 +01:00
John Spray	5264ffc2a9	pageserver/http: add testing routes for secondary mode	2023-10-26 20:33:47 +01:00
John Spray	ee841708ff	pageserver: implement flush on location conf update	2023-10-26 20:33:47 +01:00
John Spray	f356df1860	pageserver: more logging during tenant shutdown	2023-10-26 20:33:47 +01:00
John Spray	fbe1da2981	tests: add helpers for location_conf	2023-10-26 20:33:47 +01:00
John Spray	22c94d99f5	tests: refactor a path helper	2023-10-26 20:33:47 +01:00
John Spray	6de414b5a5	tests: add multi attach test	2023-10-26 20:33:47 +01:00
John Spray	8c0ce9723a	Refactor Workload into shared location	2023-10-26 20:33:47 +01:00
John Spray	5a4c371e94	pageserver: launch tasks for secondary mode	2023-10-26 20:33:47 +01:00
John Spray	65096ac992	pageserver: add secondary downloader & heatmaps	2023-10-26 20:33:47 +01:00
John Spray	049cb1fb4b	Add Timeline::generate_heatmap, remote client heatmap upload	2023-10-26 20:33:47 +01:00
John Spray	640350a2c0	TenantManager: implement hooks for secondary downloads	2023-10-26 20:33:47 +01:00
John Spray	d422105d88	pageserver: start refactoring into TenantManager	2023-10-26 11:14:24 +01:00
John Spray	ead1931167	pageserver: add InProgress top level state & make TenantsMap lock synchronous	2023-10-26 11:13:16 +01:00
John Spray	e0ebdfc7ce	pageserver: suppress compaction/gc errors while stopping (#5670 ) ## Problem Tenant deletions would sometimes be accompanied by compaction stack traces, because `shutdown()` puts the tenant into stopping state before it joins background tasks. ## Summary of changes Treat GC+Compaction as no-ops on a Stopping tenant.	2023-10-26 10:59:24 +01:00
Joonas Koivunen	c508d3b5fa	reimpl Layer, remove remote layer, trait Layer, trait PersistentLayer (#4938 ) Implement a new `struct Layer` abstraction which manages downloadness internally, requiring no LayerMap locking or rewriting to download or evict providing a property "you have a layer, you can read it". The new `struct Layer` provides ability to keep the file resident via a RAII structure for new layers which still need to be uploaded. Previous solution solved this `RemoteTimelineClient::wait_completion` which lead to bugs like #5639. Evicting or the final local deletion after garbage collection is done using Arc'd value `Drop`. With a single `struct Layer` the closed open ended `trait Layer`, `trait PersistentLayer` and `struct RemoteLayer` are removed following noting that compaction could be simplified by simply not using any of the traits in between: #4839. The new `struct Layer` is a preliminary to remove `Timeline::layer_removal_cs` documented in #4745. Preliminaries: #4936, #4937, #5013, #5014, #5022, #5033, #5044, #5058, #5059, #5061, #5074, #5103, epic #5172, #5645, #5649. Related split off: #5057, #5134.	2023-10-26 12:36:38 +03:00
John Spray	acda65d7d4	pageserver: quieten "Failed to get info about AUX files" (#5669 ) ## Problem This line caused lots of errors to be emitted for healthy tenants. ## Summary of changes Downgrade to debug, since it is an expected code path we'll take for tenants at startup.	2023-10-26 09:53:18 +01:00
dependabot[bot]	378daa358b	build(deps): bump werkzeug from 2.2.3 to 3.0.1 (#5665 )	2023-10-25 22:50:35 +00:00
Alexander Bayandin	85f4514e7d	Get env var for real Azure tests from GitHub (#5662 ) ## Problem We'll need to switch `REMOTE_STORAGE_AZURE_REGION` from the current `eastus2` region to something `eu-central-1`-like. This may require changing `AZURE_STORAGE_ACCESS_KEY`. To make it possible to switch from one place (not to break a lot of builds on CI), move `REMOTE_STORAGE_AZURE_CONTAINER` and `REMOTE_STORAGE_AZURE_REGION` to GitHub Variables. See https://github.com/neondatabase/neon/settings/variables/actions ## Summary of changes - Get values for `REMOTE_STORAGE_AZURE_CONTAINER` & `REMOTE_STORAGE_AZURE_REGION` from GitHub Variables	2023-10-25 22:54:23 +01:00
Joonas Koivunen	f70019797c	refactor(rtc): schedule compaction update (#5649 ) a single operation instead of N uploads and 1 deletion scheduling with write(layer_map) lock releasing in the between. Compaction update will make for a much better place to change how the operation will change in future compared to more general file based operations. builds upon #5645. solves the problem of difficult to see hopeful correctness w.r.t. other `index_part.json` changing operations. Co-authored-by: Shany Pozin <shany@neon.tech>	2023-10-25 22:25:43 +01:00
Joonas Koivunen	325258413a	fix: trampling on global physical size metric (#5663 ) All loading (attached, or from disk) timelines overwrite the global gauge for physical size. The `_set` method cannot be used safely, so remove it and just "add" the physical size.	2023-10-25 19:29:12 +01:00
Konstantin Knizhnik	4ddbc0e46d	Ignore missed AUX_FILES_KEY when generating image layer (#5660 ) ## Problem Logical replication requires new AUX_FILES_KEY which is definitely absent in existed database. We do not have function to check if key exists in our KV storage. So I have to handle the error in `list_aux_files` method. But this key is also included in key space range and accessed y `create_image_layer` method. ## Summary of changes Check if AUX_FILES_KEY exists before including it in keyspace. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Shany Pozin <shany@neon.tech> Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2023-10-25 18:35:23 +01:00
Arpad Müller	a673e4e7a9	Optionally return json from get_lsn_by_timestamp (#5608 ) This does two things: first a minor refactor to not use HTTP/1.x style header names and also to not panic if some certain requests had no "Accept" header. As a second thing, it addresses the third bullet point from #3689: > Change `get_lsn_by_timestamp` API method to return LSN even if we only found commit before the specified timestamp. This is done by adding a version parameter to the `get_lsn_by_timestamp` API call and making its behaviour depend on the version number. Part of #3414 (but doesn't address it in its entirety). --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-10-25 18:46:34 +02:00
bojanserafimov	c155cc0c3f	Fix test instructions readme (#5644 )	2023-10-25 11:53:04 -04:00
Conrad Ludgate	32126d705b	proxy refactor serverless (#4685 ) ## Problem Our serverless backend was a bit jumbled. As a comment indicated, we were handling SQL-over-HTTP in our `websocket.rs` file. I've extracted out the `sql_over_http` and `websocket` files from the `http` module and put them into a new module called `serverless`. ## Summary of changes ```sh mkdir proxy/src/serverless mv proxy/src/http/{conn_pool,sql_over_http,websocket}.rs proxy/src/serverless/ mv proxy/src/http/server.rs proxy/src/http/health_server.rs mv proxy/src/metrics proxy/src/usage_metrics.rs ``` I have also extracted the hyper server and handler from websocket.rs into `serverless.rs`	2023-10-25 15:43:03 +01:00
John Spray	5683ae9eab	pageserver: suppress some of the most common spurious warnings (#5658 ) Two of the most common spurious log messages: - broker connections terminate & we log at error severity. Unfortunately tonic gives us an "Unknown" error so to suppress these we're doing string matching. It's hacky but worthwhile for operations. - the first iteration of tenant background tasks tends to over-run its schedule and emit a warning. Ultimately we should fix these to run on time, but for now we are not benefiting from polluting our logs with the warnings.	2023-10-25 14:55:37 +01:00
Alexander Bayandin	4778b6a12e	Switch to querying new tests results DB (#5616 ) ## Problem We started to store test results in a new format in https://github.com/neondatabase/neon/pull/4549. This PR switches scripts to query this db. (we can completely remove old DB/ingestions scripts in a couple of weeks after the PR merged) ## Summary of changes - `scripts/benchmark_durations.py` query new database - `scripts/flaky_tests.py` query new database	2023-10-25 14:25:13 +01:00
John Spray	8b8be7bed4	tests: don't fail tests on torn log lines (#5655 ) ## Problem Tests that force-kill and restart a service can generate torn log lines that might match WARN\|ERROR, but not match the allow expression that a test has loaded, e.g. https://neon-github-public-dev.s3.amazonaws.com/reports/pr-5651/6638398772/index.html#suites/7538959189f4501983ddd9e167836c8b/d272ba8a73e6945c ## Summary of changes Ignore log lines which match a regex for torn log lines on restart: they have two timestamps and the second line is an "INFO version"... message.	2023-10-25 13:29:30 +01:00
Conrad Ludgate	a461c459d8	fix http pool test (#5653 ) ## Problem We defer the returning of connections the the connection pool. It's possible for our test to be faster than the returning of connections - which then gets a differing process ID because it opens a new connection. ## Summary of changes 1. Delay the tests just a little (20ms) to give more chance for connections to return. 2. Correlate connection IDs with the connection logs a bit more	2023-10-25 13:20:45 +01:00
Joonas Koivunen	4ae2d1390d	refactor(remote_timeline_client): Split deletion into unlinking + deletion (#5645 ) Quest: #4745. Prerequisite for #4938. Original https://github.com/neondatabase/neon/pull/4938#issuecomment-1777150665. The new Layer implementation has so far been using `RemoteTimelineClient::schedule_layer_file_deletion` from `Layer::drop` but it was noticed that this could mean that the L0s compaction wanted to remove could linger in the index part for longer time or be left there for longer time. Solution is to split the `RemoteTimelineClient::schedule_layer_file_deletion` into two parts: - unlinking from index_part.json, to be called from end of compaction and gc - scheduling of actual deletions, to be called from `Layer::drop` The added methods are added unused.	2023-10-25 15:01:19 +03:00
Joonas Koivunen	c5949e1fd6	misc smaller improvements (#5527 ) - finally add an `#[instrument]` to Timeline::create_image_layers, making it easier to see that something is happening because we create image layers - format some macro context code - add a warning not to create new validation functions a la parse do not validate Split off from #5198.	2023-10-25 14:59:43 +03:00
John Spray	127837abb0	tests: de-flake test_eviction_across_generations (#5650 ) ## Problem There was an edge case where initial logical size calculation can be downloading a layer that wasn't hit by the test's `SELECT`, and it's on-disk but still marked as remote in the pageserver's internal state, so evicting it fails. https://neon-github-public-dev.s3.amazonaws.com/reports/pr-5648/6630099807/index.html#categories/dee044ec96f666edb90a77c01099a941/e38e97a2735ffa8c/ ## Summary of changes Use pageserver API to learn about layers, instead of inspecting local disk, so that we will always agree with the pageserver about which layer are local.	2023-10-25 10:55:45 +01:00
Conrad Ludgate	b2c96047d0	move wake compute after the auth quirks logic (#5642 ) ## Problem https://github.com/neondatabase/neon/issues/5568#issuecomment-1777015606 ## Summary of changes Make the auth_quirks_creds return the authentication information, and push the wake_compute loop to after, inside `auth_quirks`	2023-10-25 08:30:47 +01:00
Em Sharnoff	44202eeb3b	Bump vm-builder v0.18.1 -> v0.18.2 (#5646 ) Only applicable change was neondatabase/autoscaling#571, removing the postgres_exporter flags `--auto-discover-databases` and `--exclude-databases=...`	2023-10-24 16:04:28 -07:00
Arpad Müller	4bef977c56	Use tuples instead of manual comparison chain (#5637 ) Makes code a little bit simpler	2023-10-24 17:16:23 +00:00
John Spray	a0b862a8bd	pageserver: schedule frozen layer uploads inside the layers lock (#5639 ) ## Problem Compaction's source of truth for what layers exist is the LayerManager. `flush_frozen_layer` updates LayerManager before it has scheduled upload of the frozen layer. Compaction can then "see" the new layer, decide to delete it, schedule uploads of replacement layers, all before `flush_frozen_layer` wakes up again and schedules the upload. When the upload is scheduled, the local layer file may be gone, in which case we end up with no such layer in remote storage, but an entry still added to IndexPart pointing to the missing layer. ## Summary of changes Schedule layer uploads inside the `self.layers` lock, so that whenever a frozen layer is present in LayerManager, it is also present in RemoteTimelineClient's metadata. Closes: #5635	2023-10-24 13:57:01 +01:00
Conrad Ludgate	767ef29390	proxy: filter out more quota exceeded errors (#5640 ) ## Problem Looking at logs, I saw more retries being performed for other quota exceeded errors ## Summary of changes Filter out all quota exceeded family of errors	2023-10-24 13:13:23 +01:00
Alexander Bayandin	a8a800af51	Run real Azure tests on CI (#5627 ) ## Problem We do not run real Azure-related tests on CI ## Summary of changes - Set required env variables to run real Azure blob storage tests on CI	2023-10-24 12:12:11 +01:00
Arpad Müller	1e250cd90a	Cleanup in azure_upload_download_works test (#5636 ) The `azure_upload_download_works` test is not cleaning up after itself, leaving behind the files it is uploading. I found these files when looking at the contents of the bucket in #5627. We now clean up the file we uploaded before, like the other tests do it as well. Follow-up of #5546	2023-10-23 19:08:56 +01:00
John Spray	eaaa18f6ed	attachment_service: graceful SIGQUIT (#5626 ) `attachment_service` doesn't explicitly handle signals, which causes a backtrace when `neon_local` kills it with SIGQUIT. Closes: https://github.com/neondatabase/neon/issues/5613	2023-10-23 17:30:25 +01:00
John Spray	188f67e1df	pageserver: forward compat: be tolerant of deletion marker in `timelines/` (#5632 ) ## Problem https://github.com/neondatabase/neon/pull/5580 will move the remote deletion marker into the `timelines/` path. This would cause old pageserver code to fail loading the tenant due to an apparently invalid timeline ID. That would be a problem if we had to roll back after deploying #5580 ## Summary of changes If a `deleted` file is in `timelines/` just ignore it.	2023-10-23 17:51:38 +02:00

1 2 3 4 5 ...

3956 Commits