rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-04 12:40:37 +00:00

Author	SHA1	Message	Date
Joonas Koivunen	27bdbf5e36	chore(layer): restore logging, doc changes (#5766 ) Some of the log messages were lost with the #4938. This PR adds some of them back, most notably: - starting to on-demand download - successful completion of on-demand download - ability to see when there were many waiters for the layer download - "unexpectedly on-demand downloading ..." is now `info!` Additionally some rare events are logged as error, which should never happen.	2023-11-02 19:05:33 +00:00
Joonas Koivunen	098d3111a5	fix(layer): get_and_upgrade and metrics (#5767 ) when introducing `get_and_upgrade` I forgot that an `evict_and_wait` would had already incremented the counter for started evictions, but an upgrade would just "silently" cancel the eviction as no drop would ever run. these metrics are likely sources for alerts with the next release, so it's important to keep them correct.	2023-11-02 13:06:14 +00:00
Joonas Koivunen	3737fe3a4b	fix(layer): error out early if layer path is non-file (#5756 ) In an earlier PR https://github.com/neondatabase/neon/pull/5743#discussion_r1378625244 I added a FIXME and there's a simple solution suggested by @jcsp, so implement it. Wondering why I did not implement this originally, there is no concept of a permanent failure, so this failure will happen quite often. I don't think the frequency is a problem however. Sadly for std::fs::FileType there is only decimal and hex formatting, no octal.	2023-11-02 11:03:38 +00:00
John Spray	5650138532	pageserver: helpers for explicitly dying on fatal I/O errors (#5651 ) Following from discussion on https://github.com/neondatabase/neon/pull/5436 where hacking an implicit die-on-fatal-io behavior into an Error type was a source of disagreement -- in this PR, dying on fatal I/O errors is explicit, with `fatal_err` and `maybe_fatal_err` helpers in the `MaybeFatalIo` trait, which is implemented for std::io::Result. To enable this approach with `crashsafe_overwrite`, the return type of that function is changed to std::io::Result -- the previous error enum for this function was not used for any logic, and the utility of saying exactly which step in the function failed is outweighed by the hygiene of having an I/O funciton return an io::Result. The initial use case for these helpers is the deletion queue.	2023-11-02 09:14:26 +00:00
Joonas Koivunen	2dca4c03fc	feat(layer): cancellable get_or_maybe_download (#5744 ) With the layer implementation as was done in #4938, it is possible via cancellation to cause two concurrent downloads on the same path, due to how `RemoteTimelineClient::download_remote_layer` does tempfiles. Thread the init semaphore through the spawned task of downloading to make this impossible to happen.	2023-11-02 08:06:32 +00:00
Joonas Koivunen	e82d1ad6b8	fix(layer): reinit on access before eviction happens (#5743 ) Right before merging, I added a loop to `fn LayerInner::get_or_maybe_download`, which was always supposed to be there. However I had forgotten to restart initialization instead of waiting for the eviction to happen to support original design goal of "eviction should always lose to redownload (or init)". This was wrong. After this fix, if `spawn_blocking` queue is blocked on something, nothing bad will happen. Part of #5737.	2023-11-01 17:38:32 +02:00
Muhammet Yazici	4f0a8e92ad	fix: Add bearer prefix to Authorization header (#5740 ) ## Problem Some requests with `Authorization` header did not properly set the `Bearer ` prefix. Problem explained here https://github.com/neondatabase/cloud/issues/6390. ## Summary of changes Added `Bearer ` prefix to missing requests.	2023-11-01 09:41:48 +03:00
Konstantin Knizhnik	5952f350cb	Always handle POLLHUP in walredo error poll loop (#5716 ) ## Problem test_stderr hangs on MacOS. See https://neondb.slack.com/archives/C036U0GRMRB/p1698438997903919 ## Summary of changes Always handle POLLHUP to prevent infinite loop. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2023-10-31 20:57:03 +02:00
Joonas Koivunen	896347f307	refactor(layer): remove version checking with atomics (#5742 ) The `LayerInner::version` never needed to be read in more than one place. Clarified while fixing #5737 of which this is the first step. This decrements possible wrong atomics usage in Layer, but does not really fix anything.	2023-10-31 18:40:08 +02:00
Christian Schwarz	7ebe9ca1ac	pageserver: `/attach`: clarify semantics of 409 (#5698 ) context: https://app.incident.io/neondb/incidents/75 specifically: https://neondb.slack.com/archives/C0634NXQ6E7/p1698422852902959?thread_ts=1698419362.155059&cid=C0634NXQ6E7	2023-10-31 09:32:58 +01:00
John Spray	9c35e1e6e5	pageserver: downgrade slow task warnings from warn to info (#5724 ) ## Problem In #5658 we suppressed the first-iteration output from these logs, but the volume of warnings is still problematic. ## Summary of changes - Downgrade all slow task warnings to INFO. The information is still there if we actively want to know about which tasks are running slowly, without polluting the overall stream of warnings with situations that are unsurprising to us. - Revert the previous change so that we output on the first iteration as we used to do. There is no reason to suppress these, now that the severity is just info.	2023-10-30 18:32:30 +00:00
Conrad Ludgate	d8c21ec70d	fix nightly 1.75 (#5719 ) ## Problem Neon doesn't compile on nightly and had numerous clippy complaints. ## Summary of changes 1. Fixed troublesome dependency 2. Fixed or ignored the lints where appropriate	2023-10-30 16:43:06 +00:00
Joonas Koivunen	4db8efb2cf	Layer: logging fixes (#5676 ) - include Layer generation in the default display, with Generation::Broken as `-broken` - omit layer from `layer_gc` span because the api it works with needs to support N layers, so the api needs to log each layer	2023-10-30 16:21:30 +02:00
John Spray	07c2b29895	pageserver: fix error logging on stray timeline files (#5712 ) ## Problem If there were stray files in the timelines/ dir after tenant deletion, pageserver could panic on out of range. ## Summary of changes Use iterator `take()`, which doesn't care if the number of elements available is less than requested.	2023-10-30 13:24:52 +00:00
John Spray	87db4b441c	pageserver: cleaner shutdown in timeline delete (#5701 ) The flush task logs a backtrace if it tries to upload and remote timeline client is already in stopped state. Therefore we cannot shut them down concurrently: flush task must be shut down first. This wasn't more obvious because: - Timeline deletions IRL usually happen when not much is being written - In tests, there is a global allow-list for this log It's not obvious whether removing the global log allow list is safe, this PR was prompted by how the log spam got in my way when testing deletion changes.	2023-10-30 09:18:40 +00:00
Arpad Müller	bd59349af3	Fix Rust 1.74 warnings (#5702 ) Fixes new warnings and clippy changes introduced by version 1.74 of the rust compiler toolchain.	2023-10-28 03:47:26 +02:00
John Spray	c13e932c3b	pageserver: add generation fields in openapi spec (#5690 ) These optional fields have existed for as while, but weren't mentioned in `openapi_spec.yaml` yet.	2023-10-27 14:20:04 +01:00
Gleb Novikov	a5292f7e67	Some minor renames in attachment service API (#5687 ) ## Problem ## Summary of changes ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [ ] ~~If it is a core feature, I have added thorough tests.~~ - [ ] ~~Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?~~ - [ ] ~~If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.~~ ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist	2023-10-27 12:36:34 +01:00
Joonas Koivunen	68f15cf967	fix: schedule_compaction_update must only unlink (#5675 ) #5649 added the concept of dangling layers which #4938 uses but only partially. I forgot to change `schedule_compaction_update` to not schedule deletions to uphold the "have a layer, you can read it". With the now remembered fix, I don't think these checks should ever fail except for a mistake I already did. These changes might be useful for protecting future changes, even though the Layer carrying the generation AND the `schedule_(gc\|compaction)_update` require strong arcs. Rationale for keeping the `#[cfg(feature = "testing")]` is worsening any leak situation which might come up.	2023-10-27 11:16:01 +01:00
duguorong009	39f8fd6945	feat: add `build_tag` env support for `set_build_info_metric` (#5576 ) - Add a new util `project_build_tag` macro, similar to `project_git_version` - Update the `set_build_info_metric` to accept and make use of `build_tag` info - Update all codes which use the `set_build_info_metric`	2023-10-27 10:47:11 +01:00
John Spray	ba4fe9e10f	pageserver: fix the second "AUX files" warning (#5673 ) In https://github.com/neondatabase/neon/pull/5669 I didn't notice that the same warning is logged in two places: fix the other one.	2023-10-26 13:54:52 +00:00
John Spray	de90bf4663	pageserver: always load remote metadata (no more `spawn_load`) (#5580 ) ## Problem The pageserver had two ways of loading a tenant: - `spawn_load` would trust on-disk content to reflect all existing timelines - `spawn_attach` would list timelines in remote storage. It was incorrect for `spawn_load` to trust local disk content, because it doesn't know if the tenant might have been attached and written somewhere else. To make this correct would requires some generation number checks, but the payoff is to avoid one S3 op per tenant at startup, so it's not worth the complexity -- it is much simpler to have one way to load a tenant. ## Summary of changes - `Tenant` objects are always created with `Tenant::spawn`: there is no more distinction between "load" and "attach". - The ability to run without remote storage (for `neon_local`) is preserved by adding a branch inside `attach` that uses a fallback `load_local` if no remote_storage is present. - Fix attaching a tenant when it has a timeline with no IndexPart: this can occur if a newly created timeline manages to upload a layer before it has uploaded an index. - The attach marker file that used to indicate whether a tenant should be "loaded" or "attached" is no longer needed, and is removed. - The GenericRemoteStorage interface gets a `list()` method that maps more directly to what ListObjects does, returning both keys and common prefixes. The existing `list_files` and `list_prefixes` methods are just calls into `list()` now -- these can be removed later if we would like to shrink the interface a bit. - The remote deletion marker is moved into `timelines/` and detected as part of listing timelines rather than as a separate GET request. If any existing tenants have a marker in the old location (unlikely, only happens if something crashes mid-delete), then they will rely on the control plane retrying to complete their deletion. - Revise S3 calls for timeline listing and tenant load to take a cancellation token, and retry forever: it never makes sense to make a Tenant broken because of a transient S3 issue. ## Breaking changes - The remote deletion marker is moved from `deleted` to `timelines/deleted` within the tenant prefix. Markers in the old location will be ignored: it is the control plane's responsibility to retry deletions until they succeed. Markers in the new location will be tolerated by the previous release of pageserver via https://github.com/neondatabase/neon/pull/5632 - The local `attaching` marker file is no longer written. Therefore, if the pageserver is downgraded after running this code, the old pageserver will not be able to distinguish between partially attached tenants and fully attached tenants. This would only impact tenants that were partway through attaching at the moment of downgrade. In the unlikely even t that we do experience an incident that prompts us to roll back, then we may check for attach operations in flight, and manually insert `attaching` marker files as needed. --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-10-26 14:48:44 +01:00
John Spray	8360307ea0	pageserver: exponential backoff on compaction/GC failures (#5672 ) Previously, if walredo process crashed we would try to spawn a fresh one every 2 seconds, which is expensive in itself, but also results in a high I/O load from the part of the compaction prior to the failure, which we re-run every 2 seconds. Closes: https://github.com/neondatabase/neon/issues/5671	2023-10-26 14:00:26 +01:00
John Spray	e0ebdfc7ce	pageserver: suppress compaction/gc errors while stopping (#5670 ) ## Problem Tenant deletions would sometimes be accompanied by compaction stack traces, because `shutdown()` puts the tenant into stopping state before it joins background tasks. ## Summary of changes Treat GC+Compaction as no-ops on a Stopping tenant.	2023-10-26 10:59:24 +01:00
Joonas Koivunen	c508d3b5fa	reimpl Layer, remove remote layer, trait Layer, trait PersistentLayer (#4938 ) Implement a new `struct Layer` abstraction which manages downloadness internally, requiring no LayerMap locking or rewriting to download or evict providing a property "you have a layer, you can read it". The new `struct Layer` provides ability to keep the file resident via a RAII structure for new layers which still need to be uploaded. Previous solution solved this `RemoteTimelineClient::wait_completion` which lead to bugs like #5639. Evicting or the final local deletion after garbage collection is done using Arc'd value `Drop`. With a single `struct Layer` the closed open ended `trait Layer`, `trait PersistentLayer` and `struct RemoteLayer` are removed following noting that compaction could be simplified by simply not using any of the traits in between: #4839. The new `struct Layer` is a preliminary to remove `Timeline::layer_removal_cs` documented in #4745. Preliminaries: #4936, #4937, #5013, #5014, #5022, #5033, #5044, #5058, #5059, #5061, #5074, #5103, epic #5172, #5645, #5649. Related split off: #5057, #5134.	2023-10-26 12:36:38 +03:00
John Spray	acda65d7d4	pageserver: quieten "Failed to get info about AUX files" (#5669 ) ## Problem This line caused lots of errors to be emitted for healthy tenants. ## Summary of changes Downgrade to debug, since it is an expected code path we'll take for tenants at startup.	2023-10-26 09:53:18 +01:00
Joonas Koivunen	f70019797c	refactor(rtc): schedule compaction update (#5649 ) a single operation instead of N uploads and 1 deletion scheduling with write(layer_map) lock releasing in the between. Compaction update will make for a much better place to change how the operation will change in future compared to more general file based operations. builds upon #5645. solves the problem of difficult to see hopeful correctness w.r.t. other `index_part.json` changing operations. Co-authored-by: Shany Pozin <shany@neon.tech>	2023-10-25 22:25:43 +01:00
Joonas Koivunen	325258413a	fix: trampling on global physical size metric (#5663 ) All loading (attached, or from disk) timelines overwrite the global gauge for physical size. The `_set` method cannot be used safely, so remove it and just "add" the physical size.	2023-10-25 19:29:12 +01:00
Konstantin Knizhnik	4ddbc0e46d	Ignore missed AUX_FILES_KEY when generating image layer (#5660 ) ## Problem Logical replication requires new AUX_FILES_KEY which is definitely absent in existed database. We do not have function to check if key exists in our KV storage. So I have to handle the error in `list_aux_files` method. But this key is also included in key space range and accessed y `create_image_layer` method. ## Summary of changes Check if AUX_FILES_KEY exists before including it in keyspace. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Shany Pozin <shany@neon.tech> Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2023-10-25 18:35:23 +01:00
Arpad Müller	a673e4e7a9	Optionally return json from get_lsn_by_timestamp (#5608 ) This does two things: first a minor refactor to not use HTTP/1.x style header names and also to not panic if some certain requests had no "Accept" header. As a second thing, it addresses the third bullet point from #3689: > Change `get_lsn_by_timestamp` API method to return LSN even if we only found commit before the specified timestamp. This is done by adding a version parameter to the `get_lsn_by_timestamp` API call and making its behaviour depend on the version number. Part of #3414 (but doesn't address it in its entirety). --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-10-25 18:46:34 +02:00
John Spray	5683ae9eab	pageserver: suppress some of the most common spurious warnings (#5658 ) Two of the most common spurious log messages: - broker connections terminate & we log at error severity. Unfortunately tonic gives us an "Unknown" error so to suppress these we're doing string matching. It's hacky but worthwhile for operations. - the first iteration of tenant background tasks tends to over-run its schedule and emit a warning. Ultimately we should fix these to run on time, but for now we are not benefiting from polluting our logs with the warnings.	2023-10-25 14:55:37 +01:00
Joonas Koivunen	4ae2d1390d	refactor(remote_timeline_client): Split deletion into unlinking + deletion (#5645 ) Quest: #4745. Prerequisite for #4938. Original https://github.com/neondatabase/neon/pull/4938#issuecomment-1777150665. The new Layer implementation has so far been using `RemoteTimelineClient::schedule_layer_file_deletion` from `Layer::drop` but it was noticed that this could mean that the L0s compaction wanted to remove could linger in the index part for longer time or be left there for longer time. Solution is to split the `RemoteTimelineClient::schedule_layer_file_deletion` into two parts: - unlinking from index_part.json, to be called from end of compaction and gc - scheduling of actual deletions, to be called from `Layer::drop` The added methods are added unused.	2023-10-25 15:01:19 +03:00
Joonas Koivunen	c5949e1fd6	misc smaller improvements (#5527 ) - finally add an `#[instrument]` to Timeline::create_image_layers, making it easier to see that something is happening because we create image layers - format some macro context code - add a warning not to create new validation functions a la parse do not validate Split off from #5198.	2023-10-25 14:59:43 +03:00
John Spray	a0b862a8bd	pageserver: schedule frozen layer uploads inside the layers lock (#5639 ) ## Problem Compaction's source of truth for what layers exist is the LayerManager. `flush_frozen_layer` updates LayerManager before it has scheduled upload of the frozen layer. Compaction can then "see" the new layer, decide to delete it, schedule uploads of replacement layers, all before `flush_frozen_layer` wakes up again and schedules the upload. When the upload is scheduled, the local layer file may be gone, in which case we end up with no such layer in remote storage, but an entry still added to IndexPart pointing to the missing layer. ## Summary of changes Schedule layer uploads inside the `self.layers` lock, so that whenever a frozen layer is present in LayerManager, it is also present in RemoteTimelineClient's metadata. Closes: #5635	2023-10-24 13:57:01 +01:00
John Spray	188f67e1df	pageserver: forward compat: be tolerant of deletion marker in `timelines/` (#5632 ) ## Problem https://github.com/neondatabase/neon/pull/5580 will move the remote deletion marker into the `timelines/` path. This would cause old pageserver code to fail loading the tenant due to an apparently invalid timeline ID. That would be a problem if we had to roll back after deploying #5580 ## Summary of changes If a `deleted` file is in `timelines/` just ignore it.	2023-10-23 17:51:38 +02:00
John Spray	7e805200bb	pageserver: parallel load of configs (#5607 ) ## Problem When the number of tenants is large, sequentially issuing the open/read calls for their config files is a ~1000ms delay during startup. It's not a lot, but it's simple to fix. ## Summary of changes Put all the config loads into spawn_blocking() tasks and run them in a JoinSet. We can simplify this a bit later when we have full async disk I/O. --------- Co-authored-by: Shany Pozin <shany@neon.tech>	2023-10-23 15:32:34 +01:00
Christian Schwarz	c6ca1d76d2	consumption_metrics: fix periodicness behavior & reporting (#5625 ) Before this PR, the ticker was running at default miss behavior `Delay`. For example, here is the startup output with 25k tenants: ``` 2023-10-19T09:57:21.682466Z INFO synthetic_size_worker: starting calculate_synthetic_size_worker 2023-10-19T10:50:44.678202Z WARN synthetic_size_worker: task iteration took longer than the configured period elapsed=3202.995707156s period=10m task=ConsumptionMetricsSyntheticSizeWorker 2023-10-19T10:52:17.408056Z WARN synthetic_size_worker: task iteration took longer than the configured period elapsed=2695.72556035s period=10m task=ConsumptionMetricsSyntheticSizeWorker ``` The first message's `elapsed` value is correct. It matches the delta between the log line timestamps. The second one is logged ca 1.5min after, though, but reports a much larger `elapsed` than 1.5min. This PR fixes the behavior by copying what `eviction_task.rs` does.	2023-10-23 16:31:38 +02:00
John Spray	41ee75bc71	pageserver: do config writes in a spawn_blocking (#5603 ) ## Problem We now persist tenant configuration every time we spawn a tenant. The persist_tenant_config function is doing a series of non-async filesystem I/O, because `crashsafe::` isn't async yet. This isn't a demonstrated problem, but is a source of uncertainty when reasoning about what's happening with our startup times. ## Summary of changes - Wrap `crashsafe_overwrite` in `spawn_blocking`. - Although I think this change makes sense, it does not have a measurable impact on load time when testing with 10k tenants. - This can be reverted when we have full async I/O	2023-10-23 09:19:01 +01:00
Christian Schwarz	11e523f503	walredo: fix EGAGAIN/"os error 11" false page reconstruction failures (#5560 ) Stacked atop https://github.com/neondatabase/neon/pull/5559 Before this PR, there was the following race condition: ``` T1: polls for writeable stdin T1: writes to stdin T1: enters poll for stdout/stderr T2: enters poll for stdin write WALREDO: writes to stderr KERNEL: wakes up T1 and T2 Tx: reads stderr and prints it Ty: reads stderr and gets EAGAIN (valid values for (x, y) are (1, 2) or (2, 1)) ``` The concrete symptom that we observed repeatedly was with PG16, which started logging `registered custom resource manager` to stderr always, during startup, thereby giving us repeated opportunity to hit above race condition. PG14 and PG15 didn't log anything to stderr, hence we could have only hit this race condition if there was an actual error happening. This PR fixes the race by moving the reading of stderr into a tokio task. It exits when the stderr is closed by the child process, which in turn happens when the child exits, either by itself or because we killed it. The downside is that the async scheduling can reorder the log messages, which can be seen in the new `test_stderr`, which runs in a single-threaded runtime. I included the output below. Overall I think we should move the entire walredo to async, as Joonas proposed many months ago. This PR's asyncification is just the first step to resolve these false page reconstruction errors. After this is fixed, we should stop printing that annoying stderr message on walredo startup; it causes noise in the pageserver logs. That work is tracked in #5399 . ``` 2023-10-13T19:05:21.878858Z ERROR apply_wal_records{tenant_id=d546fb76ba529195392fb4d19e243991 pid=753986}: failed to write out the walredo errored input: No such file or directory (os error 2) target=walredo-1697223921878-1132-0.walredo length=1132 2023-10-13T19:05:21.878932Z DEBUG postgres applied 2 WAL records (1062 bytes) in 114666 us to reconstruct page image at LSN 0/0 2023-10-13T19:05:21.878942Z ERROR error applying 2 WAL records 0/16A9388..0/16D4080 (1062 bytes) to base image with LSN 0/0 to reconstruct page image at LSN 0/0 n_attempts=0: apply_wal_records Caused by: WAL redo process closed its stdout unexpectedly 2023-10-13T19:05:21.879027Z INFO kill_and_wait_impl{pid=753986}: wait successful exit_status=signal: 11 (SIGSEGV) (core dumped) 2023-10-13T19:05:21.879079Z DEBUG wal-redo-postgres-stderr{pid=753986 tenant_id=d546fb76ba529195392fb4d19e243991 pg_version=16}: wal-redo-postgres stderr_logger_task started 2023-10-13T19:05:21.879104Z ERROR wal-redo-postgres-stderr{pid=753986 tenant_id=d546fb76ba529195392fb4d19e243991 pg_version=16}: received output output="2023-10-13 19:05:21.769 GMT [753986] LOG: registered custom resource manager \"neon\" with ID 134\n" 2023-10-13T19:05:21.879116Z DEBUG wal-redo-postgres-stderr{pid=753986 tenant_id=d546fb76ba529195392fb4d19e243991 pg_version=16}: wal-redo-postgres stderr_logger_task finished 2023-10-13T19:05:22.004439Z ERROR apply_wal_records{tenant_id=d546fb76ba529195392fb4d19e243991 pid=754000}: failed to write out the walredo errored input: No such file or directory (os error 2) target=walredo-1697223922004-1132-0.walredo length=1132 2023-10-13T19:05:22.004493Z DEBUG postgres applied 2 WAL records (1062 bytes) in 125344 us to reconstruct page image at LSN 0/0 2023-10-13T19:05:22.004501Z ERROR error applying 2 WAL records 0/16A9388..0/16D4080 (1062 bytes) to base image with LSN 0/0 to reconstruct page image at LSN 0/0 n_attempts=1: apply_wal_records Caused by: WAL redo process closed its stdout unexpectedly 2023-10-13T19:05:22.004588Z INFO kill_and_wait_impl{pid=754000}: wait successful exit_status=signal: 11 (SIGSEGV) (core dumped) 2023-10-13T19:05:22.004624Z DEBUG wal-redo-postgres-stderr{pid=754000 tenant_id=d546fb76ba529195392fb4d19e243991 pg_version=16}: wal-redo-postgres stderr_logger_task started 2023-10-13T19:05:22.004653Z ERROR wal-redo-postgres-stderr{pid=754000 tenant_id=d546fb76ba529195392fb4d19e243991 pg_version=16}: received output output="2023-10-13 19:05:21.884 GMT [754000] LOG: registered custom resource manager \"neon\" with ID 134\n" 2023-10-13T19:05:22.004666Z DEBUG wal-redo-postgres-stderr{pid=754000 tenant_id=d546fb76ba529195392fb4d19e243991 pg_version=16}: wal-redo-postgres stderr_logger_task finished ```	2023-10-23 09:00:13 +01:00
John Spray	a8899e1e0f	pageserver: apply timeout when waiting for tenant loads (#5601 ) ## Problem Loading tenants shouldn't hang. However, if it does, we shouldn't let one hung tenant prevent the entire process from starting background jobs. ## Summary of changes Generalize the timeout mechanism that we already applied to loading initial logical sizes: each phase in startup where we wait for a barrier is subject to a timeout, and startup will proceed if it doesn't complete within timeout. Startup metrics will still reflect the time when a phase actually completed, rather than when we skipped it. The code isn't the most beautiful, but that kind of reflects the awkwardness of await'ing on a future and then stashing it to await again later if we time out. I could imagine making this cleaner in future by waiting on a structure that doesn't self-destruct on wait() the way Barrier does, then make InitializationOrder into a structure that manages the series of waits etc.	2023-10-20 09:15:34 +01:00
Arpad Müller	f842b22b90	Add endpoint for querying time info for lsn (#5497 ) ## Problem See #5468. ## Summary of changes Add a new `get_timestamp_of_lsn` endpoint, returning the timestamp associated with the given lsn. Fixes #5468. --------- Co-authored-by: Shany Pozin <shany@neon.tech>	2023-10-19 04:50:49 +02:00
Konstantin Knizhnik	5c88213eaf	Logical replication (#5271 ) ## Problem See https://github.com/neondatabase/company_projects/issues/111 ## Summary of changes Save logical replication files in WAL at compute and include them in basebackup at pate server. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Arseny Sher <sher-ars@yandex.ru>	2023-10-18 16:42:22 +03:00
John Spray	607d19f0e0	pageserver: clean up page service Result handling for shutdown/disconnect (#5504 ) ## Problem - QueryError always logged at error severity, even though disconnections are not true errors. - QueryError type is not expressive enough to distinguish actual errors from shutdowns. - In some functions we're returning Ok(()) on shutdown, in others we're returning an error ## Summary of changes - Add QueryError::Shutdown and use it in places we check for cancellation - Adopt consistent Result behavior: disconnects and shutdowns are always QueryError, not ok - Transform shutdown+disconnect errors to Ok(()) at the very top of the task that runs query handler - Use the postgres protocol error code for "admin shutdown" in responses to clients when we are shutting down. Closes: #5517	2023-10-18 13:28:38 +01:00
Christian Schwarz	9da67c4f19	walredo: make request_redo() an async fn (#5559 ) Stacked atop https://github.com/neondatabase/neon/pull/5557 Prep work for https://github.com/neondatabase/neon/pull/5560 These changes have a 2% impact on `bench_walredo`. That's likely because of the `block_on() in the innermost piece of benchmark-only code. So, it doesn't affect production code. The use of closures in the benchmarking code prevents a straightforward conversion of the whole benchmarking code to async. before: ``` $ cargo bench --features testing --bench bench_walredo Compiling pageserver v0.1.0 (/home/cs/src/neon/pageserver) Finished bench [optimized + debuginfo] target(s) in 2m 11s Running benches/bench_walredo.rs (target/release/deps/bench_walredo-d99a324337dead70) Gnuplot not found, using plotters backend short/short/1 time: [26.363 µs 27.451 µs 28.573 µs] Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild short/short/2 time: [64.340 µs 64.927 µs 65.485 µs] Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) low mild short/short/4 time: [101.98 µs 104.06 µs 106.13 µs] short/short/8 time: [151.42 µs 152.74 µs 154.03 µs] short/short/16 time: [296.30 µs 297.53 µs 298.88 µs] Found 14 outliers among 100 measurements (14.00%) 10 (10.00%) high mild 4 (4.00%) high severe medium/medium/1 time: [225.12 µs 225.90 µs 226.66 µs] Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) low mild medium/medium/2 time: [490.80 µs 491.64 µs 492.49 µs] Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) low mild medium/medium/4 time: [934.47 µs 936.49 µs 938.52 µs] Found 5 outliers among 100 measurements (5.00%) 3 (3.00%) low mild 1 (1.00%) high mild 1 (1.00%) high severe medium/medium/8 time: [1.8364 ms 1.8412 ms 1.8463 ms] Found 4 outliers among 100 measurements (4.00%) 4 (4.00%) high mild medium/medium/16 time: [3.6694 ms 3.6896 ms 3.7104 ms] ``` after: ``` $ cargo bench --features testing --bench bench_walredo Compiling pageserver v0.1.0 (/home/cs/src/neon/pageserver) Finished bench [optimized + debuginfo] target(s) in 2m 11s Running benches/bench_walredo.rs (target/release/deps/bench_walredo-d99a324337dead70) Gnuplot not found, using plotters backend short/short/1 time: [28.345 µs 28.529 µs 28.699 µs] change: [-0.2201% +3.9276% +8.2451%] (p = 0.07 > 0.05) No change in performance detected. Found 17 outliers among 100 measurements (17.00%) 4 (4.00%) low severe 5 (5.00%) high mild 8 (8.00%) high severe short/short/2 time: [66.145 µs 66.719 µs 67.274 µs] change: [+1.5467% +2.7605% +3.9927%] (p = 0.00 < 0.05) Performance has regressed. Found 5 outliers among 100 measurements (5.00%) 5 (5.00%) low mild short/short/4 time: [105.51 µs 107.52 µs 109.49 µs] change: [+0.5023% +3.3196% +6.1986%] (p = 0.02 < 0.05) Change within noise threshold. short/short/8 time: [151.90 µs 153.16 µs 154.41 µs] change: [-1.0001% +0.2779% +1.4221%] (p = 0.65 > 0.05) No change in performance detected. short/short/16 time: [297.38 µs 298.26 µs 299.20 µs] change: [-0.2953% +0.2462% +0.7763%] (p = 0.37 > 0.05) No change in performance detected. Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high mild medium/medium/1 time: [229.76 µs 230.72 µs 231.69 µs] change: [+1.5804% +2.1354% +2.6635%] (p = 0.00 < 0.05) Performance has regressed. medium/medium/2 time: [501.14 µs 502.31 µs 503.64 µs] change: [+1.8730% +2.1709% +2.5199%] (p = 0.00 < 0.05) Performance has regressed. Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) low mild 1 (1.00%) high mild 5 (5.00%) high severe medium/medium/4 time: [954.15 µs 956.74 µs 959.33 µs] change: [+1.7962% +2.1627% +2.4905%] (p = 0.00 < 0.05) Performance has regressed. medium/medium/8 time: [1.8726 ms 1.8785 ms 1.8848 ms] change: [+1.5858% +2.0240% +2.4626%] (p = 0.00 < 0.05) Performance has regressed. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) low mild 3 (3.00%) high mild 2 (2.00%) high severe medium/medium/16 time: [3.7565 ms 3.7746 ms 3.7934 ms] change: [+1.5503% +2.3044% +3.0818%] (p = 0.00 < 0.05) Performance has regressed. Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high mild ```	2023-10-18 11:23:06 +01:00
Christian Schwarz	3a8959a4c4	page_cache: remove dead code (#5493 )	2023-10-17 15:56:16 +01:00
Arpad Müller	093f8c5f45	Update rust to 1.73.0 (#5574 ) [Release notes](https://blog.rust-lang.org/2023/10/05/Rust-1.73.0.html)	2023-10-17 13:13:12 +01:00
Christian Schwarz	9256788273	limit imitate accesses concurrency, using same semaphore as compactions (#5578 ) Before this PR, when we restarted pageserver, we'd see a rush of `$number_of_tenants` concurrent eviction tasks starting to do imitate accesses building up in the period of `[init_order allows activations, $random_access_delay + EvictionPolicyLayerAccessThreshold::period]`. We simply cannot handle that degree of concurrent IO. We already solved the problem for compactions by adding a semaphore. So, this PR shares that semaphore for use by evictions. Part of https://github.com/neondatabase/neon/issues/5479 Which is again part of https://github.com/neondatabase/neon/issues/4743 Risks / Changes In System Behavior ================================== * we don't do evictions as timely as we currently do * we log a bunch of warnings about eviction taking too long * imitate accesses and compactions compete for the same concurrency limit, so, they'll slow each other down through this shares semaphore Changes ======= - Move the `CONCURRENT_COMPACTIONS` semaphore into `tasks.rs` - Rename it to `CONCURRENT_BACKGROUND_TASKS` - Use it also for the eviction imitate accesses: - Imitate acceses are both per-TIMELINE and per-TENANT - The per-TENANT is done through coalescing all the per-TIMELINE tasks via a tokio mutex `eviction_task_tenant_state`. - We acquire the CONCURRENT_BACKGROUND_TASKS permit early, at the beginning of the eviction iteration, much before the imitate acesses start (and they may not even start at all in the given iteration, as they happen only every $threshold). - Acquiring early is sub-optimal because when the per-timline tasks coalesce on the `eviction_task_tenant_state` mutex, they are already holding a CONCURRENT_BACKGROUND_TASKS permit. - It's also unfair because tenants with many timelines win the CONCURRENT_BACKGROUND_TASKS more often. - I don't think there's another way though, without refactoring more of the imitate accesses logic, e.g, making it all per-tenant. - Add metrics for queue depth behind the semaphore. I found these very useful to understand what work is queued in the system. - The metrics are tagged by the new `BackgroundLoopKind`. - On a green slate, I would have used `TaskKind`, but we already had pre-existing labels whose names didn't map exactly to task kind. Also the task kind is kind of a lower-level detail, so, I think it's fine to have a separate enum to identify background work kinds. Future Work =========== I guess I could move the eviction tasks from a ticker to "sleep for $period". The benefit would be that the semaphore automatically "smears" the eviction task scheduling over time, so, we only have the rush on restart but a smeared-out rush afterward. The downside is that this perverts the meaning of "$period", as we'd actually not run the eviction at a fixed period. It also means the the "took to long" warning & metric becomes meaningless. Then again, that is already the case for the compaction and gc tasks, which do sleep for `$period` instead of using a ticker.	2023-10-17 11:29:48 +02:00
Joonas Koivunen	9e1449353d	crash-consistent layer map through index_part.json (#5198 ) Fixes #5172 as it: - removes recoinciliation with remote index_part.json and accepts remote index_part.json as the truth, deleting any local progress which is yet to be reflected in remote - moves to prefer remote metadata Additionally: - tests with single LOCAL_FS parametrization are cleaned up - adds a test case for branched (non-bootstrap) local only timeline availability after restart --------- Co-authored-by: Christian Schwarz <christian@neon.tech> Co-authored-by: John Spray <john@neon.tech>	2023-10-17 10:04:56 +01:00
John Spray	b06dffe3dc	pageserver: fixes to `/location_config` API (#5548 ) ## Problem I found some issues with the `/location_config` API when writing new tests. ## Summary of changes - Calling the API with the "Detached" state is now idempotent. - `Tenant::spawn_attach` now takes a boolean to indicate whether to expect a marker file. Marker files are used in the old attach path, but not in the new location conf API. They aren't needed because in the New World, the choice of whether to attach via remote state ("attach") or to trust local state ("load") will be revised to cope with the transitions between secondary & attached (see https://github.com/neondatabase/neon/issues/5550). It is okay to merge this change ahead of that ticket, because the API is not used in the wild yet. - Instead of using `schedule_local_tenant_processing`, the location conf API handler does its own directory creation and calls `spawn_attach` directly. - A new `unsafe_create_dir_all` is added. This differs from crashsafe::create_dir_all in two ways: - It is intentionally not crashsafe, because in the location conf API we are no longer using directory or config existence as the signal for any important business logic. - It is async and uses `tokio::fs`.	2023-10-17 10:21:31 +02:00
Christian Schwarz	b08a0ee186	walredo: fix race condition where shutdown kills the wrong process (#5557 ) Before this PR, the following race condition existed: ``` T1: does the apply_wal_records() call and gets back an error T2: does the apply_wal_records() call and gets back an error T2: does the kill_and_shutdown T2: new loop iteration T2: launches new walredo process T1: does the kill_and_shutdown of the new process ``` That last step is wrong, T2 already did the kill_and_shutdown. The symptom of this race condition was that T2 would observe an error when it tried to do something with the process after T1 killed it. For example, but not limited to: `POLLHUP` / `"WAL redo process closed its stderr unexpectedly"`. The fix in this PR is the following: * Use Arc to represent walredo processes. The Arc lives at least as long as the walredo process. * Use Arc::ptr_eq to determine whether to kill the process or not. The price is an additional RwLock to protect the new `redo_process` field that holds the Arc. I guess that could perhaps be an atomic pointer swap some day. But, let's get one race fixed without risking introducing a new one. The use of Arc/drop is also not super great here because it now allows for an unlimited number of to-be-killed processes to exist concurrently. See the various `NB` comments above `drop(proc)` for why it's "ok" right now due to the blocking `wait` inside `drop`. Note: an earlier fix attempt was https://github.com/neondatabase/neon/pull/5545 where we apply_batch_postgres would compare stdout_fd for equality. That's incorrect because the kernel can reuse the file descriptor when T2 launches the new process. Details: https://github.com/neondatabase/neon/pull/5545#pullrequestreview-1676589373	2023-10-17 09:55:39 +02:00

1 2 3 4 5 ...

1625 Commits