rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-21 15:10:44 +00:00

Author	SHA1	Message	Date
Joonas Koivunen	4ef9f38bdd	doc: comment pass	2023-05-05 02:46:15 +03:00
Joonas Koivunen	505febfc97	doc: fix typo and assert message	2023-05-05 02:42:20 +03:00
Joonas Koivunen	fa947f0f4f	fix: resubscribe while holding the lock	2023-05-05 02:40:15 +03:00
Joonas Koivunen	022ff36481	chore: yet another cfg feature testing	2023-05-05 02:39:56 +03:00
Joonas Koivunen	a77bde5852	restore anyhow::anyhow!(s) error transformations	2023-05-05 02:17:51 +03:00
Joonas Koivunen	88d2551bb4	request coalescing with MaybeDone	2023-05-05 02:05:31 +03:00
Joonas Koivunen	fffb23ec06	chore: forgotten cfg feature testing	2023-05-05 01:17:30 +03:00
Joonas Koivunen	1c20cea537	doc: note unhandled case	2023-05-05 00:54:36 +03:00
Joonas Koivunen	a543e7081f	refactor: InnerDeleteTimelineError does not need to be Error	2023-05-05 00:54:19 +03:00
Joonas Koivunen	e00da98996	wip: try: request coalescing (now building)	2023-05-05 00:34:26 +03:00
Joonas Koivunen	5a924f1867	wip: try with request coalescing	2023-05-05 00:17:15 +03:00
Joonas Koivunen	ad70535899	complete the revert with the datastructure change	2023-05-04 22:00:12 +03:00
Joonas Koivunen	fd6d81b044	wip: revert: `11c18b05aa` This partially reverts commit `11c18b05aa` for the take_mut parts.	2023-05-04 21:50:17 +03:00
Christian Schwarz	8425b6ab21	clarifications around the QueueUninitialized error	2023-05-04 17:08:02 +02:00
Christian Schwarz	b8eba89a10	improve docstring on persist_index_part_with_deleted_flag	2023-05-03 19:46:55 +02:00
Christian Schwarz	08558b83ed	improve XXX above remove_dir_all regarding atomicity With this PR, we're now atomic, if remote storage is configured.	2023-05-03 19:39:32 +02:00
Christian Schwarz	4903953e9f	clippy: tracing::Span is only used in cfg(feature = "testing")	2023-05-03 18:35:01 +02:00
Christian Schwarz	de3f23344a	Merge remote-tracking branch 'origin/main' into dkr/deleted-flag-in-remote-index	2023-05-03 18:07:36 +02:00
Christian Schwarz	b58bf56670	fixup comment from 'add TODO comment regarding de-configuration of remote storage'	2023-05-03 17:47:11 +02:00
Christian Schwarz	16486edd8e	clarify / improve comments in Tenant::delete_timeline	2023-05-03 17:33:53 +02:00
Arthur Petukhovsky	3ceef7b17a	Add more safekeeper and walreceiver metrics (#4142 ) Add essential safekeeper and pageserver::walreceiver metrics. Mostly counters, such as the number of received queries, broker messages, removed WAL segments, or connection switches events in walreceiver. Also logs broker push loop duration.	2023-05-03 17:07:41 +03:00
Kirill Bulatov	586e6e55f8	Print WalReceiver context on WAL waiting timeout (#4090 ) Closes https://github.com/neondatabase/neon/issues/2106 Before: ``` Extracting base backup to create postgres instance: path=/Users/someonetoignore/work/neon/neon_main/test_output/test_pageserver_lsn_wait_error_safekeeper_stop/repo/endpoints/ep-2/pgdata port=15017 stderr: command failed: page server 'basebackup' command failed Caused by: 0: db error: ERROR: Timed out while waiting for WAL record at LSN 0/FFFFFFFF to arrive, last_record_lsn 0/A2C3F58 disk consistent LSN=0/16B5A50 1: ERROR: Timed out while waiting for WAL record at LSN 0/FFFFFFFF to arrive, last_record_lsn 0/A2C3F58 disk consistent LSN=0/16B5A50 Stack backtrace: ``` After: ``` Extracting base backup to create postgres instance: path=/Users/someonetoignore/work/neon/neon/test_output/test_pageserver_lsn_wait_error_safekeeper_stop/repo/endpoints/ep-2/pgdata port=15011 stderr: command failed: page server 'basebackup' command failed Caused by: 0: db error: ERROR: Timed out while waiting for WAL record at LSN 0/FFFFFFFF to arrive, last_record_lsn 0/A2C3F58 disk consistent LSN=0/16B5A50, WalReceiver status (update 2023-04-26 14:20:39): streaming WAL from node 12346, commit\|streaming Lsn: 0/A2C3F58\|0/A2C3F58, safekeeper candidates (id\|update_time\|commit_lsn): [(12348\|14:20:40\|0/A2C3F58), (12346\|14:20:40\|0/A2C3F58), (12347\|14:20:40\|0/A2C3F58)] 1: ERROR: Timed out while waiting for WAL record at LSN 0/FFFFFFFF to arrive, last_record_lsn 0/A2C3F58 disk consistent LSN=0/16B5A50, WalReceiver status (update 2023-04-26 14:20:39): streaming WAL from node 12346, commit\|streaming Lsn: 0/A2C3F58\|0/A2C3F58, safekeeper candidates (id\|update_time\|commit_lsn): [(12348\|14:20:40\|0/A2C3F58), (12346\|14:20:40\|0/A2C3F58), (12347\|14:20:40\|0/A2C3F58)] Stack backtrace: ``` As the issue requests, the PR adds the context in logs only, but I think we should expose the context via HTTP management API similar way — it should be simple with the new API, but better be done in a separate PR. Co-authored-by: Kirill Bulatov <kirill@neon.tech>	2023-05-03 16:25:19 +03:00
Joonas Koivunen	138bc028ed	fix: quick and dirty panic avoidance on drop path (#4128 ) Sentry caught a panic on load testing server related to metric removals: https://neondatabase.sentry.io/issues/4142396994 Turn the `expect` into logging, but also add logging for each removal, so we could identify in which cases we do double-remove. The double-removal (or never adding) cause is not obvious or expected. Original added in #3837.	2023-05-01 11:54:09 +03:00
Joonas Koivunen	6f472df0d0	fix: restore not logging ignored io errors as errors (#4120 ) the fix is rather indirect due to the accidental applying of too much `anyhow`: if handle_pagerequests returns a `QueryError` it will now be bubbled up as-is `QueryError`. `QueryError` allows the inner `std::io::Error` to be inspected and thus we can filter certain error kinds which are perfectly normal without a huge log message. for a very long time (`b2f5102`) the errors were converted to `anyhow` by mistake which made this difficult or impossible, even though from the types it would appear that we propagate wrapped `std::io::Error`s and can filter them. Fixes #4113, most likely filters some other errors as well.	2023-04-30 14:34:55 +03:00
Joonas Koivunen	ec53c5ca2e	revert: "Add check for duplicates of generated image layers" (#4104 ) This reverts commit `732acc5`. Reverted PR: #3869 As noted in PR #4094, we do in fact try to insert duplicates to the layer map, if L0->L1 compaction is interrupted. We do not have a proper fix for that right now, and we are in a hurry to make a release to production, so revert the changes related to this to the state that we have in production currently. We know that we have a bug here, but better to live with the bug that we've had in production for a long time, than rush a fix to production without testing it in staging first. Cc: #4094, #4088	2023-04-28 17:20:18 +03:00
Arseny Sher	fdacfaabfd	Move PageserverFeedback to utils. It allows to replace u64 with proper Lsn and pretty print PageserverFeedback with serde(_json). Now walsenders on safekeepers queried with debug_dump look like "walsenders": [ { "ttid": "fafe0cf39a99c608c872706149de9d2a/b4fb3be6f576935e7f0fcb84bdb909a1", "addr": "127.0.0.1:48774", "conn_id": 3, "appname": "pageserver", "feedback": { "Pageserver": { "current_timeline_size": 32096256, "last_received_lsn": "0/2415298", "disk_consistent_lsn": "0/1696628", "remote_consistent_lsn": "0/0", "replytime": "2023-04-12T13:54:53.958856+00:00" } } } ],	2023-04-28 06:22:13 +04:00
Joonas Koivunen	fe0b616299	feat(page_service): read timeouts (#4093 ) Introduce read timeouts to our `page_service` connections. Without read timeouts, we essentially leak connections. This is a port of #3995. Split the refactorings to the other PR: #4097. Fixes #4028.	2023-04-27 17:55:35 +00:00
Christian Schwarz	db9d78151a	add TODO comment regarding de-configuration of remote storage	2023-04-27 18:52:21 +02:00
Christian Schwarz	3720e9073f	Merge remote-tracking branch 'origin/main' into dkr/deleted-flag-in-remote-index	2023-04-27 18:43:10 +02:00
Christian Schwarz	11c18b05aa	push metadata.to_bytes() out of stop() into persist_index_part_with_deleted_flag() This pushes the (unlikely) possibility of failure to serialize metadata out of stop(). That in turn leaves us with only one case of how stop() can fail. There are two callsites of stop(): 1. perform_upload_task: here, we can safely say "unreachable", and I think any future refactorings that might violate that invariant would notice, because the unreachable!() is close to the code that would likely be refactored. The unreachable!() is desirable there because otherwise we'd need to think about how to handle the error. Maybe the previous code would have done the right thing, maybe not. 2. delete_timeline: this is the new one, and, it's far away from the code that initializes the upload queue. Putting an unreachable!() there seems risky. So, bail out with an error. It will become a 500 status code, which console shall retry according to the openapi spec. We have test coverage that the retry can succeed.	2023-04-27 18:33:40 +02:00
Christian Schwarz	48112bdf53	fixup typo in 'fix the problem exposed by the previously added test case'	2023-04-27 18:05:14 +02:00
Christian Schwarz	e41b2ed66a	distinguished error types	2023-04-27 17:59:10 +02:00
Joonas Koivunen	fdf5e4db5e	refactor: Cleanup page service (#4097 ) Refactoring part of #4093. Numerious `Send + Sync` bounds were a distraction, that were not needed at all. The proper `Bytes` usage and one `"error_message".to_string()` are just drive-by fixes. Not using the `PostgresBackendTCP` allows us to start setting read timeouts (and more). `PostgresBackendTCP` is still used from proxy, so it cannot be removed.	2023-04-27 18:51:57 +03:00
Christian Schwarz	d5280bf2dd	timeline_delete => persist_index_part_with_deleted_flag: make cancel safe this fixes the test added in the previous commit	2023-04-27 16:18:16 +02:00
Christian Schwarz	3be81dd36b	fix `clippy --release` failure introduced in #4030 (#4095 ) PR `build: run clippy for powerset of features (#4077)` brought us a `clippy --release` pass. It was merged after #4030, which fails under `clippy --release` with ``` error: static `TENANT_ID_EXTRACTOR` is never used --> pageserver/src/tenant/timeline.rs:4270:16 \| 4270 \| pub static TENANT_ID_EXTRACTOR: once_cell::sync::Lazy< \| ^^^^^^^^^^^^^^^^^^^ \| = note: `-D dead-code` implied by `-D warnings` error: static `TIMELINE_ID_EXTRACTOR` is never used --> pageserver/src/tenant/timeline.rs:4276:16 \| 4276 \| pub static TIMELINE_ID_EXTRACTOR: once_cell::sync::Lazy< \| ^^^^^^^^^^^^^^^^^^^^^ ``` A merge queue would have prevented this.	2023-04-27 17:07:25 +03:00
Christian Schwarz	7f3ee0d45d	fix the problem exposed by the previously added test case	2023-04-27 16:00:42 +02:00
Christian Schwarz	958fd5720e	add (failing) test case for index upload causing delete timeline request timeout The test fails because the assert trips in the second call.	2023-04-27 16:00:42 +02:00
MMeent	e6ec2400fc	Enable hot standby PostgreSQL replicas. Notes: - This still needs UI support from the Console - I've not tuned any GUCs for PostgreSQL to make this work better - Safekeeper has gotten a tweak in which WAL is sent and how: It now sends zero-ed WAL data from the start of the timeline's first segment up to the first byte of the timeline to be compatible with normal PostgreSQL WAL streaming. - This includes the commits of #3714 Fixes one part of https://github.com/neondatabase/neon/issues/769 Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>	2023-04-27 15:26:44 +02:00
Christian Schwarz	9ea7b5dd38	clean up logging around on-demand downloads (#4030 ) - Remove repeated tenant & timeline from span - Demote logging of the path to debug level - Log completion at info level, in the same function where we log errors - distinguish between layer file download success & on-demand download succeeding as a whole in the log message wording - Assert that the span contains a tenant id and a timeline id fixes https://github.com/neondatabase/neon/issues/3945 Before: ``` INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{tenant_id=$TENANT_ID timeline_id=$TIMELINE_ID layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: download complete: /storage/pageserver/data/tenants/$TENANT_ID/timelines/$TIMELINE_ID/000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91 INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{tenant_id=$TENANT_ID timeline_id=$TIMELINE_ID layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: Rebuilt layer map. Did 9 insertions to process a batch of 1 updates. ``` After: ``` INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: layer file download finished INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: Rebuilt layer map. Did 9 insertions to process a batch of 1 updates. INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: on-demand download successful ```	2023-04-27 11:54:48 +02:00
Christian Schwarz	95b560db4e	cleanup: persist_index_part_with_deleted_flag doesn't need self: &Arc<Self>	2023-04-27 11:52:13 +02:00
Christian Schwarz	6861259be7	add global metric for unexpected on-demand downloads (#4069 ) Until we have toned down the prod logs to zero WARN and ERROR, we want a dedicated metric for which we can have a dedicated alert. fixes https://github.com/neondatabase/neon/issues/3924	2023-04-26 15:18:26 +02:00
Joonas Koivunen	381c8fca4f	feat: log how long tenant activation takes (#4080 ) Adds just a counter counting up from the creation to the tenant, logged after activation. Might help guide us with the investigation of #4025.	2023-04-26 12:39:17 +03:00
Joonas Koivunen	850f6b1cb9	refactor: drop pageserver_ondisk_layers (#4071 ) I didn't get through #3775 fast enough so we wanted to remove this metric. Fixes #3705.	2023-04-26 11:49:29 +03:00
Joonas Koivunen	7f80230fd2	fix: stop dead_code rustc lint (#4070 ) only happens without `--all-features` which is what `./run_clippy.sh` uses.	2023-04-25 17:07:04 +02:00
Joonas Koivunen	cb9473928d	feat: add rough timings for basebackup (#4062 ) just record the time needed for waiting the lsn and then the basebackup in a log message in millis. this is related to ongoing investigations to cold start performance. this could also be a a counter. it cannot be added next to smgr histograms, because we don't want another histogram per timeline. the aim is to allow drilling deeper into which timelines were slow, and to understand why some need two basebackups.	2023-04-25 13:22:16 +00:00
Christian Schwarz	fa20e37574	add gauge for in-flight layer uploads (#3951 ) For the "worst-case /storage usage panel", we need to compute ``` remote size + local-only size ``` We currently don't have a metric for local-only layers. The number of in-flight layers in the upload queue is just that, so, let Prometheus scrape it. The metric is two counters (started and finished). The delta is the amount of in-flight uploads in the queue. The metrics are incremented in the respective `call_unfinished_metric_*` functions. These track ongoing operations by file_kind and op_kind. We only need this metric for layer uploads, so, there's the new RemoteTimelineClientMetricsCallTrackSize type that forces all call sites to decide whether they want the size tracked or not. If we find that other file_kinds or op_kinds are interesting (metadata uploads, layer downloads, layer deletes) are interesting, we can just enable them, and they'll be just another label combination within the metrics that this PR adds. fixes https://github.com/neondatabase/neon/issues/3922	2023-04-25 14:22:48 +02:00
Christian Schwarz	e83684b868	add libmetric metric for each logged log message (#4055 ) This patch extends the libmetrics logging setup functionality with a `tracing` layer that increments a Prometheus counter each time we log a log message. We have the counter per tracing event level. This allows for monitoring WARN and ERR log volume without parsing the log. Also, it would allow cross-checking whether logs got dropped on the way into Loki. It would be nicer if we could hook deeper into the tracing logging layer, to avoid evaluating the filter twice. But I don't know how to do it.	2023-04-25 14:10:18 +02:00
Eduard Dyckman	afbbc61036	Adding synthetic size to pageserver swagger (#4049 ) ## Describe your changes I added synthetic size response to the console swagger. Now I am syncing it back to neon	2023-04-24 16:19:25 +03:00
Dmitry Rodionov	aba2d60815	add test	2023-04-21 12:07:14 +03:00
Dmitry Rodionov	efe914f056	delete local data when facing timeline that is marked as deleted in s3	2023-04-21 12:04:21 +03:00

1 2 3 4 5 ...

1313 Commits