rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-05 05:00:37 +00:00

Author	SHA1	Message	Date
Arpad Müller	e6470ee92e	Add API description for safekeeper copy endpoint (#5770 ) Adds a yaml API description for a new endpoint that allows creation of a new timeline as the copy of an existing one. Part of #5282	2023-11-06 15:00:07 +01:00
bojanserafimov	dc72567288	Layer flush minor speedup (#5765 ) Convert keys to `i128` before sorting	2023-11-06 08:58:20 -05:00
John Spray	6defa2b5d5	pageserver: add `Gate` as a partner to CancellationToken for safe shutdown of `Tenant` & `Timeline` (#5711 ) ## Problem When shutting down a Tenant, it isn't just important to cause any background tasks to stop. It's also important to wait until they have stopped before declaring shutdown complete, in cases where we may re-use the tenant's local storage for something else, such as running in secondary mode, or creating a new tenant with the same ID. ## Summary of changes A `Gate` class is added, inspired by [seastar::gate](https://docs.seastar.io/master/classseastar_1_1gate.html). For types that have an important lifetime that corresponds to some physical resource, use of a Gate as well as a CancellationToken provides a robust pattern for async requests & shutdown: - Requests must always acquire the gate as long as they are using the object - Shutdown must set the cancellation token, and then `close()` the gate to wait for requests in progress before returning. This is not for memory safety: it's for expressing the difference between "Arc<Tenant> exists", and "This tenant's files on disk are eligible to be read/written". - Both Tenant and Timeline get a Gate & CancellationToken. - The Timeline gate is held during eviction of layers, and during page_service requests. - Existing cancellation support in page_service is refined to use the timeline-scope cancellation token instead of a process-scope cancellation token. This replaces the use of `task_mgr::associate_with`: tasks no longer change their tenant/timelineidentity after being spawned. The Tenant's Gate is not yet used, but will be important for Tenant-scoped operations in secondary mode, where we must ensure that our secondary-mode downloads for a tenant are gated wrt the activity of an attached Tenant. This is part of a broader move away from using the global-state driven `task_mgr` shutdown tokens: - less global state where we rely on implicit knowledge of what task a given function is running in, and more explicit references to the cancellation token that a particular function/type will respect, making shutdown easier to reason about. - eventually avoid the big global TASKS mutex. --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-11-06 12:39:20 +00:00
duguorong009	b3d3a2587d	feat: improve the serde impl for several types(`Lsn`, `TenantId`, `TimelineId` ...) (#5335 ) Improve the serde impl for several types (`Lsn`, `TenantId`, `TimelineId`) by making them sensitive to `Serializer::is_human_readadable` (true for json, false for bincode). Fixes #3511 by: - Implement the custom serde for `Lsn` - Implement the custom serde for `Id` - Add the helper module `serde_as_u64` in `libs/utils/src/lsn.rs` - Remove the unnecessary attr `#[serde_as(as = "DisplayFromStr")]` in all possible structs Additionally some safekeeper types gained serde tests. --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-11-06 11:40:03 +02:00
Heikki Linnakangas	b85fc39bdb	Update control plane API path for getting compute spec. (#5357 ) We changed the path in the control plane. The old path is still accepted for compatibility with existing computes, but we'd like to phase it out.	2023-11-06 09:26:09 +02:00
duguorong009	09b5954526	refactor: use streaming in safekeeper `/v1/debug_dump` http response (#5731 ) - Update the handler for `/v1/debug_dump` http response in safekeeper - Update the `debug_dump::build()` to use the streaming in JSON build process	2023-11-05 10:16:54 +00:00
John Spray	306c4f9967	s3_scrubber: prepare for scrubbing buckets with generation-aware content (#5700 ) ## Problem The scrubber didn't know how to find the latest index_part when generations were in use. ## Summary of changes - Teach the scrubber to do the same dance that pageserver does when finding the latest index_part.json - Teach the scrubber how to understand layer files with generation suffixes. - General improvement to testability: scan_metadata has a machine readable output that the testing `S3Scrubber` wrapper can read. - Existing test coverage of scrubber was false-passing because it just didn't see any data due to prefixing of data in the bucket. Fix that. This is incremental improvement: the more confidence we can have in the scrubber, the more we can use it in integration tests to validate the state of remote storage. --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2023-11-03 17:36:02 +00:00
Konstantin Knizhnik	5ceccdc7de	Logical replication startup fixes (#5750 ) ## Problem See https://neondb.slack.com/archives/C04DGM6SMTM/p1698226491736459 ## Summary of changes Update WAL affected buffers when restoring WAL from safekeeper ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Arseny Sher <sher-ars@yandex.ru>	2023-11-03 18:40:27 +02:00
Conrad Ludgate	cdcaa329bf	proxy: no more statements (#5747 ) ## Problem my prepared statements change in tokio-postgres landed in the latest release. it didn't work as we intended ## Summary of changes https://github.com/neondatabase/rust-postgres/pull/24	2023-11-03 08:30:58 +00:00
Joonas Koivunen	27bdbf5e36	chore(layer): restore logging, doc changes (#5766 ) Some of the log messages were lost with the #4938. This PR adds some of them back, most notably: - starting to on-demand download - successful completion of on-demand download - ability to see when there were many waiters for the layer download - "unexpectedly on-demand downloading ..." is now `info!` Additionally some rare events are logged as error, which should never happen.	2023-11-02 19:05:33 +00:00
khanova	4c7fa12a2a	Proxy introduce allowed ips (#5729 ) ## Problem Proxy doesn't accept wake_compute responses with the allowed IPs. ## Summary of changes Extend wake_compute api to be able to return allowed_ips.	2023-11-02 16:26:15 +00:00
Em Sharnoff	367971a0e9	vm-monitor: Remove support for file cache in tmpfs (#5617 ) ref neondatabase/cloud#7516. We switched everything over to file cache on disk, now time to remove support for having it in tmpfs.	2023-11-02 16:06:16 +00:00
bojanserafimov	51570114ea	Remove outdated and flaky perf test (#5762 )	2023-11-02 10:43:59 -04:00
Joonas Koivunen	098d3111a5	fix(layer): get_and_upgrade and metrics (#5767 ) when introducing `get_and_upgrade` I forgot that an `evict_and_wait` would had already incremented the counter for started evictions, but an upgrade would just "silently" cancel the eviction as no drop would ever run. these metrics are likely sources for alerts with the next release, so it's important to keep them correct.	2023-11-02 13:06:14 +00:00
Joonas Koivunen	3737fe3a4b	fix(layer): error out early if layer path is non-file (#5756 ) In an earlier PR https://github.com/neondatabase/neon/pull/5743#discussion_r1378625244 I added a FIXME and there's a simple solution suggested by @jcsp, so implement it. Wondering why I did not implement this originally, there is no concept of a permanent failure, so this failure will happen quite often. I don't think the frequency is a problem however. Sadly for std::fs::FileType there is only decimal and hex formatting, no octal.	2023-11-02 11:03:38 +00:00
John Spray	5650138532	pageserver: helpers for explicitly dying on fatal I/O errors (#5651 ) Following from discussion on https://github.com/neondatabase/neon/pull/5436 where hacking an implicit die-on-fatal-io behavior into an Error type was a source of disagreement -- in this PR, dying on fatal I/O errors is explicit, with `fatal_err` and `maybe_fatal_err` helpers in the `MaybeFatalIo` trait, which is implemented for std::io::Result. To enable this approach with `crashsafe_overwrite`, the return type of that function is changed to std::io::Result -- the previous error enum for this function was not used for any logic, and the utility of saying exactly which step in the function failed is outweighed by the hygiene of having an I/O funciton return an io::Result. The initial use case for these helpers is the deletion queue.	2023-11-02 09:14:26 +00:00
Joonas Koivunen	2dca4c03fc	feat(layer): cancellable get_or_maybe_download (#5744 ) With the layer implementation as was done in #4938, it is possible via cancellation to cause two concurrent downloads on the same path, due to how `RemoteTimelineClient::download_remote_layer` does tempfiles. Thread the init semaphore through the spawned task of downloading to make this impossible to happen.	2023-11-02 08:06:32 +00:00
bojanserafimov	0b790b6d00	Record wal size in import benchmark (#5755 )	2023-11-01 17:02:58 -04:00
Joonas Koivunen	e82d1ad6b8	fix(layer): reinit on access before eviction happens (#5743 ) Right before merging, I added a loop to `fn LayerInner::get_or_maybe_download`, which was always supposed to be there. However I had forgotten to restart initialization instead of waiting for the eviction to happen to support original design goal of "eviction should always lose to redownload (or init)". This was wrong. After this fix, if `spawn_blocking` queue is blocked on something, nothing bad will happen. Part of #5737.	2023-11-01 17:38:32 +02:00
Muhammet Yazici	4f0a8e92ad	fix: Add bearer prefix to Authorization header (#5740 ) ## Problem Some requests with `Authorization` header did not properly set the `Bearer ` prefix. Problem explained here https://github.com/neondatabase/cloud/issues/6390. ## Summary of changes Added `Bearer ` prefix to missing requests.	2023-11-01 09:41:48 +03:00
Konstantin Knizhnik	5952f350cb	Always handle POLLHUP in walredo error poll loop (#5716 ) ## Problem test_stderr hangs on MacOS. See https://neondb.slack.com/archives/C036U0GRMRB/p1698438997903919 ## Summary of changes Always handle POLLHUP to prevent infinite loop. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2023-10-31 20:57:03 +02:00
Tristan Partin	726c8e6730	Add docs for updating Postgres for new minor versions	2023-10-31 12:31:14 -05:00
Em Sharnoff	f7067a38b7	compute_ctl: Assume --vm-monitor-addr arg is always present (#5611 ) It has a default value, so this should be sound. Treating its presence as semantically significant was leading to spurious warnings.	2023-10-31 10:00:23 -07:00
Joonas Koivunen	896347f307	refactor(layer): remove version checking with atomics (#5742 ) The `LayerInner::version` never needed to be read in more than one place. Clarified while fixing #5737 of which this is the first step. This decrements possible wrong atomics usage in Layer, but does not really fix anything.	2023-10-31 18:40:08 +02:00
John Spray	e5c81fef86	tests: minor improvements (#5674 ) Minor changes from while I have been working on HA tests: - Manual pytest executions came with some warnings from `log.warn()` usage - When something fails in a generations-enabled test, it it useful to have a log from the attachment service of what attached when, and with which generation. --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-10-31 11:44:35 +00:00
Christian Schwarz	7ebe9ca1ac	pageserver: `/attach`: clarify semantics of 409 (#5698 ) context: https://app.incident.io/neondb/incidents/75 specifically: https://neondb.slack.com/archives/C0634NXQ6E7/p1698422852902959?thread_ts=1698419362.155059&cid=C0634NXQ6E7	2023-10-31 09:32:58 +01:00
Shany Pozin	1588601503	Move release PR creation to Friday (#5721 ) Prepare for a new release workflow * Release PR is created on Fridays * The discussion/approval happens during Friday * Sunday morning the deployment will be done in central-il and perf tests will be run * On Monday early IST morning gradually start rolling (starting from US regions as they are still in weekend time) See slack for discussion: https://neondb.slack.com/archives/C04P81J55LK/p1698565305607839?thread_ts=1698428241.031979&cid=C04P81J55LK	2023-10-30 22:10:24 +01:00
John Spray	9c35e1e6e5	pageserver: downgrade slow task warnings from warn to info (#5724 ) ## Problem In #5658 we suppressed the first-iteration output from these logs, but the volume of warnings is still problematic. ## Summary of changes - Downgrade all slow task warnings to INFO. The information is still there if we actively want to know about which tasks are running slowly, without polluting the overall stream of warnings with situations that are unsurprising to us. - Revert the previous change so that we output on the first iteration as we used to do. There is no reason to suppress these, now that the severity is just info.	2023-10-30 18:32:30 +00:00
Conrad Ludgate	d8c21ec70d	fix nightly 1.75 (#5719 ) ## Problem Neon doesn't compile on nightly and had numerous clippy complaints. ## Summary of changes 1. Fixed troublesome dependency 2. Fixed or ignored the lints where appropriate	2023-10-30 16:43:06 +00:00
Konstantin Knizhnik	ad99fa5f03	Grant BYPASSRLS and REPLICATION to exited roles (#5657 ) ## Problem Role need to have REPLICATION privilege to be able to used for logical replication. New roles are created with this option. This PR tries to update existed roles. ## Summary of changes Update roles in `handle_roles` method ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2023-10-30 15:29:25 +00:00
John Spray	e675f4cec8	tests: fix missing comma in test_timeline_deletion_with_files_stuck_… (#5713 ) …in_upload_queue This was a syntax mistake in https://github.com/neondatabase/neon/pull/5149 We didn't notice because the situation the log allow list covers is a relative rare race.	2023-10-30 15:18:32 +00:00
Joonas Koivunen	4db8efb2cf	Layer: logging fixes (#5676 ) - include Layer generation in the default display, with Generation::Broken as `-broken` - omit layer from `layer_gc` span because the api it works with needs to support N layers, so the api needs to log each layer	2023-10-30 16:21:30 +02:00
John Spray	07c2b29895	pageserver: fix error logging on stray timeline files (#5712 ) ## Problem If there were stray files in the timelines/ dir after tenant deletion, pageserver could panic on out of range. ## Summary of changes Use iterator `take()`, which doesn't care if the number of elements available is less than requested.	2023-10-30 13:24:52 +00:00
Konstantin Knizhnik	9cdffd164a	Prevent SIGSEGV in apply_error_callback when record was not decoded (#5703 ) ## Problem See https://neondb.slack.com/archives/C036U0GRMRB/p1698652221399419?thread_ts=1698438997.903919&cid=C036U0GRMRB ## Summary of changes Check if record pointer is not NULL before trying to print record descriptor ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2023-10-30 12:06:08 +02:00
John Spray	87db4b441c	pageserver: cleaner shutdown in timeline delete (#5701 ) The flush task logs a backtrace if it tries to upload and remote timeline client is already in stopped state. Therefore we cannot shut them down concurrently: flush task must be shut down first. This wasn't more obvious because: - Timeline deletions IRL usually happen when not much is being written - In tests, there is a global allow-list for this log It's not obvious whether removing the global log allow list is safe, this PR was prompted by how the log spam got in my way when testing deletion changes.	2023-10-30 09:18:40 +00:00
Conrad Ludgate	964c5c56b7	proxy: dont retry server errors (#5694 ) ## Problem accidental spam ## Summary of changes don't spam control plane if control plane is down :) ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist	2023-10-30 08:38:56 +00:00
Arpad Müller	bd59349af3	Fix Rust 1.74 warnings (#5702 ) Fixes new warnings and clippy changes introduced by version 1.74 of the rust compiler toolchain.	2023-10-28 03:47:26 +02:00
Joonas Koivunen	2bd79906d9	fix: possible page_service hang on cancel (#5696 ) Fixes #5341, one more suspected case, see: https://github.com/neondatabase/neon/issues/5341#issuecomment-1783052379 - races `MaybeWriteOnly::shutdown` with cancellation - switches to using `AsyncWriteExt::write_buf` - notes cancellation safety for shutdown	2023-10-27 19:09:34 +01:00
Conrad Ludgate	493b47e1da	proxy: exclude client latencies in metrics (#5688 ) ## Problem In #5539, I moved the connect_to_compute latency to start counting before authentication - this is because authentication will perform some calls to the control plane in order to get credentials and to eagerly wake a compute server. It felt important to include these times in the latency metric as these are times we should definitely care about reducing. What is not interesting to record in this metric is the roundtrip time during authentication when we wait for the client to respond. ## Summary of changes Implement a mechanism to pause the latency timer, resuming on drop of the pause struct. We pause the timer right before we send the authentication message to the client, and we resume the timer right after we complete the authentication flow.	2023-10-27 17:17:39 +00:00
John Spray	c13e932c3b	pageserver: add generation fields in openapi spec (#5690 ) These optional fields have existed for as while, but weren't mentioned in `openapi_spec.yaml` yet.	2023-10-27 14:20:04 +01:00
Gleb Novikov	a5292f7e67	Some minor renames in attachment service API (#5687 ) ## Problem ## Summary of changes ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [ ] ~~If it is a core feature, I have added thorough tests.~~ - [ ] ~~Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?~~ - [ ] ~~If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.~~ ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist	2023-10-27 12:36:34 +01:00
Arthur Petukhovsky	262348e41b	Fix safekeeper log spans (#5643 ) We were missing spans with ttid in "WAL backup" and several other places, this commit should fix it. Here are the examples of logs before and after: https://gist.github.com/petuhovskiy/711a4a4e7ddde3cab3fa6419b2f70fb9	2023-10-27 12:09:02 +01:00
Joonas Koivunen	68f15cf967	fix: schedule_compaction_update must only unlink (#5675 ) #5649 added the concept of dangling layers which #4938 uses but only partially. I forgot to change `schedule_compaction_update` to not schedule deletions to uphold the "have a layer, you can read it". With the now remembered fix, I don't think these checks should ever fail except for a mistake I already did. These changes might be useful for protecting future changes, even though the Layer carrying the generation AND the `schedule_(gc\|compaction)_update` require strong arcs. Rationale for keeping the `#[cfg(feature = "testing")]` is worsening any leak situation which might come up.	2023-10-27 11:16:01 +01:00
duguorong009	39f8fd6945	feat: add `build_tag` env support for `set_build_info_metric` (#5576 ) - Add a new util `project_build_tag` macro, similar to `project_git_version` - Update the `set_build_info_metric` to accept and make use of `build_tag` info - Update all codes which use the `set_build_info_metric`	2023-10-27 10:47:11 +01:00
John Spray	83567f9e4e	tests: revise perf test that interfered with local disk state (#5682 ) This benchmark started failing after #5580 merged. It was manually deleting some local content on a pageserver, and expecting the behavior that the pageserver would "forget" about the timeline on startup as a result. That is no longer our behavior: pageservers use the remote storage as the source of truth. Rather than having the test go manually delete things at all, we can just delete the whole tenant via the pageserver API, and thereby start from a clean situation.	2023-10-27 09:23:49 +01:00
Conrad Ludgate	71611f4ab3	proxy: prepare to remove high cardinality metrics (#5461 ) ## Problem High cardinality metrics are bad ## Summary of changes Preparing to remove high cardinality metrics. Will actually remove in #5466	2023-10-26 22:54:37 +01:00
John Spray	7c16b5215e	scrubber: add separate find/purge garbage commands (#5409 ) ## Problem The previous garbage cleanup functionality relied on doing a dry run, inspecting logs, and then doing a deletion. This isn't ideal, because what one actually deletes might not be the same as what one saw in the dry run. It's also risky UX to rely on presence/absence of one CLI flag to control deletion: ideally the deletion command should be totally separate from the one that scans the bucket. Related: https://github.com/neondatabase/neon/issues/5037 ## Summary of changes This is a major re-work of the code, which results in a net decrease in line count of about 600. The old code for removing garbage was build around the idea of doing discovery and purging together: a "delete_batch_producer" sent batches into a deleter. The new code writes out both procedures separately, in functions that use the async streams introduced in https://github.com/neondatabase/neon/pull/5176 to achieve fast concurrent access to S3 while retaining the readability of a single function. - Add `find-garbage`, which writes out a JSON file of tenants/timelines to purge - Add `purge-garbage` which consumes the garbage JSON file, applies some extra validations, and does deletions. - The purge command will refuse to execute if the garbage file indicates that only garbage was found: this guards against classes of bugs where the scrubber might incorrectly deem everything garbage. - The purge command defaults to only deleting tenants that were found in "deleted" state in the control plane. This guards against the risk that using the wrong console API endpoint could cause all tenants to appear to be missing. Outstanding work for a future PR: - Make whatever changes are needed to adapt to the Console/Control Plane separation. - Make purge even safer by checking S3 `Modified` times for index_part.json files (not doing this here, because it will depend on the generation-aware changes for finding index_part.json files) ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com> Co-authored-by: Shany Pozin <shany@neon.tech>	2023-10-26 20:36:28 +01:00
Em Sharnoff	39b148b74e	Bump vm-builder v0.18.2 -> v0.18.4 (#5666 ) Only applicable change was neondatabase/autoscaling#584, setting pgbouncer auth_dbname=postgres in order to fix superuser connections from preventing dropping databases.	2023-10-26 20:04:57 +01:00
Sasha Krassovsky	116c342cad	Support changing pageserver dynamically (#5542 ) ## Problem We currently require full restart of compute if we change the pageserver url ## Summary of changes Makes it so that we don't have to do a full restart, but can just send SIGHUP	2023-10-26 10:56:07 -07:00
John Spray	ba4fe9e10f	pageserver: fix the second "AUX files" warning (#5673 ) In https://github.com/neondatabase/neon/pull/5669 I didn't notice that the same warning is logged in two places: fix the other one.	2023-10-26 13:54:52 +00:00

1 2 3 4 5 ...

3986 Commits