rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-08 05:52:55 +00:00

Author	SHA1	Message	Date
Joonas Koivunen	ff87fc569d	test: Remote storage refactorings (#5243 ) Remote storage cleanup split from #5198: - pageserver, extensions, and safekeepers now have their separate remote storage - RemoteStorageKind has the configuration code - S3Storage has the cleanup code - with MOCK_S3, pageserver, extensions, safekeepers use different buckets - with LOCAL_FS, `repo_dir / "local_fs_remote_storage" / $user` is used as path, where $user is `pageserver`, `safekeeper` - no more `NeonEnvBuilder.enable_xxx_remote_storage` but one `enable_{pageserver,extensions,safekeeper}_remote_storage` Should not have any real changes. These will allow us to default to `LOCAL_FS` for pageserver on the next PR, remove `RemoteStorageKind.NOOP`, work towards #5172. Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2023-09-08 13:54:23 +03:00
Heikki Linnakangas	cdc65c1857	Update pg_cron to version 1.6.0 (#5252 ) This includes PostgreSQL 16 support. There are no catalog changes, so this is a drop-in replacement, no need to run "ALTER EXTENSION UPDATE".	2023-09-08 12:42:46 +03:00
Heikki Linnakangas	dac995e7e9	Update plpgsql_check extension to version v2.4.0 (#5249 ) This brings v16 support.	2023-09-08 10:46:02 +03:00
Alexander Bayandin	b80740bf9f	test_startup: increase timeout (#5238 ) ## Problem `test_runner/performance/test_startup.py::test_startup` started to fail more frequently because of the timeout. Let's increase the timeout to see the failures on the perf dashboard. ## Summary of changes - Increase timeout for`test_startup` from 600 to 900 seconds	2023-09-08 01:57:38 +01:00
Heikki Linnakangas	57c1ea49b3	Update hypopg extension to version 1.4.0 (#5245 ) The v1.4.0 includes changes to make it compile with PostgreSQL 16. The commit log doesn't call it out explicitly, but I tested it manually. v1.4.0 includes some new functions, but I tested manually that the the v1.3.1 functionality works with the v1.4.0 version of the library. That means that this doesn't break existing installations. Users can do "ALTER EXTENSION hypopg UPDATE" if they want to use the new v1.4.0 functionality, but they don't have to.	2023-09-08 03:30:11 +03:00
Heikki Linnakangas	6c31a2d342	Upgrade prefix extension to version 1.2.10 (#5244 ) This version includes trivial changes to make it compile with PostgreSQL 16. No functional changes.	2023-09-08 02:10:01 +03:00
Heikki Linnakangas	252b953f18	Upgrade postgresql-hll to version 2.18. (#5241 ) This includes PostgreSQL 16 support. No other changes, really. The extension version in the upstream was changed from 2.17 to 2.18, however, there is no difference between the catalog objects. So if you had installed 2.17 previously, it will continue to work. You can run "ALTER EXTENSION hll UPDATE", but all it will do is update the version number in the pg_extension table.	2023-09-08 02:07:17 +03:00
Heikki Linnakangas	b414360afb	Upgrade ip4r to version 2.4.2 (#5242 ) Includes PostgreSQL v16 support. No functional changes.	2023-09-08 02:06:53 +03:00
Arpad Müller	d206655a63	Make VirtualFile::{open, open_with_options, create,sync_all,with_file} async fn (#5224 ) ## Problem Once we use async file system APIs for `VirtualFile`, these functions will also need to be async fn. ## Summary of changes Makes the functions `open, open_with_options, create,sync_all,with_file` of `VirtualFile` async fn, including all functions that call it. Like in the prior PRs, the actual I/O operations are not using async APIs yet, as per request in the #4743 epic. We switch towards not using `VirtualFile` in the par_fsync module, hopefully this is only temporary until we can actually do fully async I/O in `VirtualFile`. This might cause us to exhaust fd limits in the tests, but it should only be an issue for the local developer as we have high ulimits in prod. This PR is a follow-up of #5189, #5190, #5195, and #5203. Part of #4743.	2023-09-08 00:50:50 +02:00
Heikki Linnakangas	e5adc4efb9	Upgrade h3-pg to version 4.1.3. (#5237 ) This includes v16 support.	2023-09-07 21:39:12 +03:00
Heikki Linnakangas	c202f0ba10	Update PostGIS to version 3.3.3 (#5236 ) It's a good idea to keep up-to-date in general. One noteworthy change is that PostGIS 3.3.3 adds support for PostgreSQL v16. We'll need that. PostGIS 3.4.0 has already been released, and we should consider upgrading to that. However, it's a major upgrade and requires running "SELECT postgis_extensions_upgrade();" in each database, to upgrade the catalogs. I don't want to deal with that right now.	2023-09-07 21:38:55 +03:00
Alexander Bayandin	d15563f93b	Misc workflows: fix quotes in bash (#5235 )	2023-09-07 19:39:42 +03:00
Rahul Modpur	485a2cfdd3	Fix pg_config version parsing (#5200 ) ## Problem Fix pg_config version parsing ## Summary of changes Use regex to capture major version of postgres #5146	2023-09-07 15:34:22 +02:00
Alexander Bayandin	1fee69371b	Update `plv8` to 3.1.8 (#5230 ) ## Problem We likely need this to support Postgres 16 It's also been asked by a user https://github.com/neondatabase/neon/discussions/5042 The latest version is 3.2.0, but it requires some changes in the build script (which I haven't checked, but it didn't work right away) ## Summary of changes ``` 3.1.8 2023-08-01 - force v8 to compile in release mode 3.1.7 2023-06-26 - fix byteoffset issue with arraybuffers - support postgres 16 beta 3.1.6 2023-04-08 - fix crash issue on fetch apply - fix interrupt issue ``` From https://github.com/plv8/plv8/blob/v3.1.8/Changes	2023-09-07 14:21:38 +01:00
Alexander Bayandin	f8a91e792c	Even better handling of `approved-for-ci-run` label (#5227 ) ## Problem We've got `approved-for-ci-run` to work 🎉 But it's still a bit rough, this PR should improve the UX for external contributors. ## Summary of changes - `build_and_test.yml`: add `check-permissions` job, which fails if PR is created from a fork. Make all jobs in the workflow to be dependant on `check-permission` to fail fast - `approved-for-ci-run.yml`: add `cleanup` job to close `ci-run/pr-` PRs and delete linked branches when the parent PR is closed - `approved-for-ci-run.yml`: fix the layout for the `ci-run/pr-` PR description - GitHub Autocomment: add a comment with tests result to the original PR (instead of a PR from `ci-run/pr-*` )	2023-09-07 14:21:01 +01:00
duguorong009	706977fb77	fix(pageserver): add the walreceiver state to tenant timeline GET api endpoint (#5196 ) Add a `walreceiver_state` field to `TimelineInfo` (response of `GET /v1/tenant/:tenant_id/timeline/:timeline_id`) and while doing that, refactor out a common `Timeline::walreceiver_state(..)`. No OpenAPI changes, because this is an internal debugging addition. Fixes #3115. Co-authored-by: Joonas Koivunen <joonas.koivunen@gmail.com>	2023-09-07 14:17:18 +03:00
Arpad Müller	7ba0f5c08d	Improve comment in page cache (#5220 ) It was easy to interpret comment in the page cache initialization code to be about justifying why we leak here at all, not just why this specific type of leaking is done (which the comment was actually meant to describe). See https://github.com/neondatabase/neon/pull/5125#discussion_r1308445993 --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-09-06 21:44:54 +02:00
Arpad Müller	6243b44dea	Remove Virtual from FileBlockReaderVirtual variant name (#5225 ) With #5181, the generics for `FileBlockReader` have been removed, so having a `Virtual` postfix makes less sense now.	2023-09-06 20:54:57 +02:00
Joonas Koivunen	3a966852aa	doc: tests expect lsof (#5226 ) On a clean system `lsof` needs to be installed. Add it to the list just to keep things nice and copy-pasteable (except for poetry).	2023-09-06 21:40:00 +03:00
duguorong009	31e1568dee	refactor(pageserver): refactor pageserver router state creation (#5165 ) Fixes #3894 by: - Refactor the pageserver router creation flow - Create the router state in `pageserver/src/bin/pageserver.rs`	2023-09-06 21:31:49 +03:00
Chengpeng Yan	9a9187b81a	Complete the missing metrics for files_created/bytes_written (#5120 )	2023-09-06 14:00:15 -04:00
Chengpeng Yan	dfe2e5159a	remove the duplicate entries in postgresql.conf (#5090 )	2023-09-06 13:57:03 -04:00
Alexander Bayandin	e4b1d6b30a	Misc post-merge fixes (#5219 ) ## Problem - `SCALE: unbound variable` from https://github.com/neondatabase/neon/pull/5079 - The layout of the GitHub auto-comment is broken if the code coverage section follows flaky test section from https://github.com/neondatabase/neon/pull/4999 ## Summary of changes - `benchmarking.yml`: Rename `SCALE` to `TEST_OLAP_SCALE` - `comment-test-report.js`: Add an extra new-line before Code coverage section	2023-09-06 20:11:44 +03:00
Alexander Bayandin	76a96b0745	Notify Slack channel about upcoming releases (#5197 ) ## Problem When the next release is coming, we want to let everyone know about it by posting a message to the Slack channel with a list of commits. ## Summary of changes - `.github/workflows/release-notify.yml` is added - the workflow sends a message to `vars.SLACK_UPCOMING_RELEASE_CHANNEL_ID` (or [#test-release-notifications](https://neondb.slack.com/archives/C05QQ9J1BRC) if not configured) - On each PR update, the workflow updates the list of commits in the message (it doesn't send additional messages)	2023-09-06 17:52:21 +01:00
Arpad Müller	5e00c44169	Add WriteBlobWriter buffering and make VirtualFile::{write,write_all} async (#5203 ) ## Problem We want to convert the `VirtualFile` APIs to async fn so that we can adopt one of the async I/O solutions. ## Summary of changes This PR is a follow-up of #5189, #5190, and #5195, and does the following: * Move the used `Write` trait functions of `VirtualFile` into inherent functions * Add optional buffering to `WriteBlobWriter`. The buffer is discarded on drop, similarly to how tokio's `BufWriter` does it: drop is neither async nor does it support errors. * Remove the generics by `Write` impl of `WriteBlobWriter`, alwaays using `VirtualFile` * Rename `WriteBlobWriter` to `BlobWriter` * Make various functions in the write path async, like `VirtualFile::{write,write_all}`. Part of #4743.	2023-09-06 18:17:12 +02:00
Alexander Bayandin	d5f1858f78	approved-for-ci-run.yml: use different tokens (#5218 ) ## Problem `CI_ACCESS_TOKEN` has quite limited access (which is good), but this doesn't allow it to remove labels from PRs (which is bad) ## Summary of changes - Use `GITHUB_TOKEN` to remove labels - Use `CI_ACCESS_TOKEN` to create PRs	2023-09-06 18:50:59 +03:00
John Spray	61d661a6c3	pageserver: generation number fetch on startup and use in /attach (#5163 ) ## Problem - #5050 Closes: https://github.com/neondatabase/neon/issues/5136 ## Summary of changes - A new configuration property `control_plane_api` controls other functionality in this PR: if it is unset (default) then everything still works as it does today. - If `control_plane_api` is set, then on startup we call out to control plane `/re-attach` endpoint to discover our attachments and their generations. If an attachment is missing from the response we implicitly detach the tenant. - Calls to pageserver `/attach` API may include a `generation` parameter. If `control_plane_api` is set, then this parameter is mandatory. - RemoteTimelineClient's loading of index_part.json is generation-aware, and will try to load the index_part with the most recent generation <= its own generation. - The `neon_local` testing environment now includes a new binary `attachment_service` which implements the endpoints that the pageserver requires to operate. This is on by default if running `cargo neon` by hand. In `test_runner/` tests, it is off by default: existing tests continue to run with in the legacy generation-less mode. Caveats: - The re-attachment during startup assumes that we are only re-attaching tenants that have previously been attached, and not totally new tenants -- this relies on the control plane's attachment logic to keep retrying so that we should eventually see the attach API call. That's important because the `/re-attach` API doesn't tell us which timelines we should attach -- we still use local disk state for that. Ref: https://github.com/neondatabase/neon/issues/5173 - Testing: generations are only enabled for one integration test right now (test_pageserver_restart), as a smoke test that all the machinery basically works. Writing fuller tests that stress tenant migration will come later, and involve extending our test fixtures to deal with multiple pageservers. - I'm not in love with "attachment_service" as a name for the neon_local component, but it's not very important because we can easily rename these test bits whenever we want. - Limited observability when in re-attach on startup: when I add generation validation for deletions in a later PR, I want to wrap up the control plane API calls in some small client class that will expose metrics for things like errors calling the control plane API, which will act as a strong red signal that something is not right. Co-authored-by: Christian Schwarz <christian@neon.tech> Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-09-06 14:44:48 +01:00
Alexander Bayandin	da60f69909	approved-for-ci-run.yml: use our bot (#5216 ) ## Problem Pull Requests created by GitHub Actions bot doesn't have access to secrets, so we need to use our bot for it to auto-trigger a tests run See previous PRs #4663, #5210, #5212 ## Summary of changes - Use our bot to create PRs	2023-09-06 14:55:11 +03:00
John Spray	743933176e	scrubber: add `scan-metadata` and hook into integration tests (#5176 ) ## Problem - Scrubber's `tidy` command requires presence of a control plane - Scrubber has no tests at all ## Summary of changes - Add re-usable async streams for reading metadata from a bucket - Add a `scan-metadata` command that reads from those streams and calls existing `checks.rs` code to validate metadata, then returns a summary struct for the bucket. Command returns nonzero status if errors are found. - Add an `enable_scrub_on_exit()` function to NeonEnvBuilder so that tests using remote storage can request to have the scrubber run after they finish - Enable remote storarge and scrub_on_exit in test_pageserver_restart and test_pageserver_chaos This is a "toe in the water" of the overall space of validating the scrubber. Later, we should: - Enable scrubbing at end of tests using remote storage by default - Make the success condition stricter than "no errors": tests should declare what tenants+timelines they expect to see in the bucket (or sniff these from the functions tests use to create them) and we should require that the scrubber reports on these particular tenants/timelines. The `tidy` command is untouched in this PR, but it should be refactored later to use similar async streaming interface instead of the current batch-reading approach (the streams are faster with large buckets), and to also be covered by some tests. --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech> Co-authored-by: Alexander Bayandin <alexander@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech> Co-authored-by: Conrad Ludgate <conrad@neon.tech>	2023-09-06 11:55:24 +01:00
Alexander Bayandin	8e25d3e79e	test_runner: add scale parameter to tpc-h tests (#5079 ) ## Problem It's hard to find out which DB size we use for OLAP benchmarks (TPC-H in particular). This PR adds handling of `TEST_OLAP_SCALE` env var, which is get added to a test name as a parameter. This is required for performing larger periodic benchmarks. ## Summary of changes - Handle `TEST_OLAP_SCALE` in `test_runner/performance/test_perf_olap.py` - Set `TEST_OLAP_SCALE` in `.github/workflows/benchmarking.yml` to a TPC-H scale	2023-09-06 13:22:57 +03:00
duguorong009	4fec48f2b5	chore(pageserver): remove unnecessary logging in tenant task loops (#5188 ) Fixes #3830 by adding the `#[cfg(not(feature = "testing"))]` attribute to unnecessary loggings in `pageserver/src/tenant/tasks.rs`. Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-09-06 13:19:19 +03:00
Vadim Kharitonov	88b1ac48bd	Create Release PR at 7:00 UTC every Tuesday (#5213 )	2023-09-06 13:17:52 +03:00
Alexander Bayandin	15ff4e5fd1	approved-for-ci-run.yml: trigger on pull_request_target (#5212 ) ## Problem Continuation of #4663, #5210 We're still getting an error: ``` GraphQL: Resource not accessible by integration (removeLabelsFromLabelable) ``` ## Summary of changes - trigger `approved-for-ci-run.yml` workflow on `pull_request_target` instead of `pull_request`	2023-09-06 13:14:07 +03:00
Alexander Bayandin	dbfb4ea7b8	Make CI more friendly for external contributors. Second try (#5210 ) ## Problem `approved-for-ci-run` label logic doesn't work as expected: - https://github.com/neondatabase/neon/pull/4722#issuecomment-1636742145 - https://github.com/neondatabase/neon/pull/4722#issuecomment-1636755394 Continuation of https://github.com/neondatabase/neon/pull/4663 Closes #2222 (hopefully) ## Summary of changes - Create a twin PR automatically - Allow `GITHUB_TOKEN` to manipulate with labels	2023-09-06 10:06:55 +01:00
Alexander Bayandin	c222320a2a	Generate lcov coverage report (#4999 ) ## Problem We want to display coverage information for each PR. - an example of a full coverage report: https://neon-github-public-dev.s3.amazonaws.com/code-coverage/abea64800fb390c32a3efe6795d53d8621115c83/lcov/index.html - an example of GitHub auto-comment with coverage information: https://github.com/neondatabase/neon/pull/4999#issuecomment-1679344658 ## Summary of changes - Use patched[*](`426e7e7a22`) lcov to generate coverage report - Upload HTML coverage report to S3 - `scripts/comment-test-report.js`: add coverage information	2023-09-06 00:48:15 +01:00
MMeent	89c64e179e	Fix corruption issue in Local File Cache (#5208 ) Fix issue where updating the size of the Local File Cache could lead to invalid reads: ## Problem LFC cache can get re-enabled when lfc_max_size is set, e.g. through an autoscaler configuration, or PostgreSQL not liking us setting the variable. 1. initialize: LFC enabled (lfc_size_limit > 0; lfc_desc = 0) 2. Open LFC file fails, lfc_desc = -1. lfc_size_limit is set to 0; 3. lfc_size_limit is updated by autoscaling to >0 4. read() now thinks LFC is enabled (size_limit > 0) and lfc_desc is valid, but doesn't try to read from the invalid file handle and thus doesn't update the buffer content with the page's data, but does think the data was read... Any buffer we try to read from local file cache is essentially uninitialized memory. Those are likely 0-bytes, but might potentially be any old buffer that was previously read from or flushed to disk. Fix this by adding a more definitive disable flag, plus better invalid state handling.	2023-09-05 20:00:47 +02:00
Alexander Bayandin	7ceddadb37	Merge custom extension CI jobs (#5194 ) ## Problem When a remote custom extension build fails, it looks a bit confusing on neon CI: - `trigger-custom-extensions-build` is green - `wait-for-extensions-build` is red - `build-and-upload-extensions` is red But to restart the build (to get everything green), you need to restart the only passed `trigger-custom-extensions-build`. ## Summary of changes - Merge `trigger-custom-extensions-build` and `wait-for-extensions-build` jobs into `trigger-custom-extensions-build-and-wait`	2023-09-05 14:02:37 +01:00
Arpad Müller	4904613aaa	Convert `VirtualFile::{seek,metadata}` to async (#5195 ) ## Problem We want to convert the `VirtualFile` APIs to async fn so that we can adopt one of the async I/O solutions. ## Summary of changes Convert the following APIs of `VirtualFile` to async fn (as well as all of the APIs calling it): * `VirtualFile::seek` * `VirtualFile::metadata` * Also, prepare for deletion of the write impl by writing the summary to a buffer before writing it to disk, as suggested in https://github.com/neondatabase/neon/issues/4743#issuecomment-1700663864 . This change adds an additional warning for the case when the summary exceeds a block. Previously, we'd have silently corrupted data in this (unlikely) case. * `WriteBlobWriter::write_blob`, in preparation for making `VirtualFile::write_all` async.	2023-09-05 12:55:45 +02:00
Nikita Kalyanov	77658a155b	support deploying in IPv6-only environments (#4135 ) A set of changes to enable neon to work in IPv6 environments. The changes are backward-compatible but allow to deploy neon even to IPv6-only environments: - bind to both IPv4 and IPv6 interfaces - allow connections to Postgres from IPv6 interface - parse the address from control plane that could also be IPv6	2023-09-05 12:45:46 +03:00
Arpad Müller	128a85ba5e	Convert many VirtualFile APIs to async (#5190 ) ## Problem `VirtualFile` does both reading and writing, and it would be nice if both could be converted to async, so that it doesn't have to support an async read path and a blocking write path (especially for the locks this is annoying as none of the lock implementations in std, tokio or parking_lot have support for both async and blocking access). ## Summary of changes This PR is some initial work on making the `VirtualFile` APIs async. It can be reviewed commit-by-commit. * Introduce the `MaybeVirtualFile` enum to be generic in a test that compares real files with virtual files. * Make various APIs of `VirtualFile` async, including `write_all_at`, `read_at`, `read_exact_at`. Part of #4743 , successor of #5180. Co-authored-by: Christian Schwarz <me@cschwarz.com>	2023-09-04 17:05:20 +02:00
Arpad Müller	6cd497bb44	Make VirtualFile::crashsafe_overwrite async fn (#5189 ) ## Problem The `VirtualFile::crashsafe_overwrite` function was introduced by #5186 but it was not turned `async fn` yet. We want to make these functions async fn as part of #4743. ## Summary of changes Make `VirtualFile::crashsafe_overwrite` async fn, as well as all the functions calling it. Don't make anything inside `crashsafe_overwrite` use async functionalities, as per #4743 instructions. Also, add rustdoc to `crashsafe_overwrite`. Part of #4743.	2023-09-04 12:52:35 +02:00
John Spray	80f10d5ced	pageserver: safe deletion for tenant directories (#5182 ) ## Problem If a pageserver crashes partway through deleting a tenant's directory, it might leave a partial state that confuses a subsequent startup/attach. ## Summary of changes Rename tenant directory to a temporary path before deleting. Timeline deletions already have deletion markers to provide safety. In future, it would be nice to exploit this to send responses to detach requests earlier: https://github.com/neondatabase/neon/issues/5183	2023-09-04 08:31:55 +01:00
Christian Schwarz	7e817789d5	VirtualFile: add crash-safe overwrite abstraction & use it (#5186 ) (part of #4743) (preliminary to #5180) This PR adds a special-purpose API to `VirtualFile` for write-once files. It adopts it for `save_metadata` and `persist_tenant_conf`. This is helpful for the asyncification efforts (#4743) and specifically asyncification of `VirtualFile` because above two functions were the only ones that needed the VirtualFile to be an `std::io::Write`. (There was also `manifest.rs` that needed the `std::io::Write`, but, it isn't used right now, and likely won't be used because we're taking a different route for crash consistency, see #5172. So, let's remove it. It'll be in Git history if we need to re-introduce it when picking up the compaction work again; that's why it was introduced in the first place). We can't remove the `impl std::io::Write for VirtualFile` just yet because of the `BufWriter` in ```rust struct DeltaLayerWriterInner { ... blob_writer: WriteBlobWriter<BufWriter<VirtualFile>>, } ``` But, @arpad-m and I have a plan to get rid of that by extracting the append-only-ness-on-top-of-VirtualFile that #4994 added to `EphemeralFile` into an abstraction that can be re-used in the `DeltaLayerWriterInner` and `ImageLayerWriterInner`. That'll be another PR. ### Performance Impact This PR adds more fsyncs compared to before because we fsync the parent directory every time. 1. For `save_metadata`, the additional fsyncs are unnecessary because we know that `metadata` fits into a kernel page, and hence the write won't be torn on the way into the kernel. However, the `metadata` file in general is going to lose signficance very soon (=> see #5172), and the NVMes in prod can definitely handle the additional fsync. So, let's not worry about it. 2. For `persist_tenant_conf`, which we don't check to fit into a single kernel page, this PR makes it actually crash-consistent. Before, we could crash while writing out the tenant conf, leaving a prefix of the tenant conf on disk.	2023-09-02 10:06:14 +02:00
John Spray	41aa627ec0	tests: get test name automatically for remote storage (#5184 ) ## Problem Tests using remote storage have manually entered `test_name` parameters, which: - Are easy to accidentally duplicate when copying code to make a new test - Omit parameters, so don't actually create unique S3 buckets when running many tests concurrently. ## Summary of changes - Use the `request` fixture in neon_env_builder fixture to get the test name, then munge that into an S3 compatible bucket name. - Remove the explicit `test_name` parameters to enable_remote_storage	2023-09-01 17:29:38 +01:00
Conrad Ludgate	44da9c38e0	proxy: error typo (#5187 ) ## Problem https://github.com/neondatabase/neon/pull/5162#discussion_r1311853491	2023-09-01 19:21:33 +03:00
Christian Schwarz	cfc0fb573d	pageserver: run all Rust tests with remote storage enabled (#5164 ) For [#5086](https://github.com/neondatabase/neon/pull/5086#issuecomment-1701331777) we will require remote storage to be configured in pageserver. This PR enables `localfs`-based storage for all Rust unit tests. Changes: - In `TenantHarness`, set up localfs remote storage for the tenant. - `create_test_timeline` should mimic what real timeline creation does, and real timeline creation waits for the timeline to reach remote storage. With this PR, `create_test_timeline` now does that as well. - All the places that create the harness tenant twice need to shut down the tenant before the re-create through a second call to `try_load` or `load`. - Without shutting down, upload tasks initiated by/through the first incarnation of the harness tenant might still be ongoing when the second incarnation of the harness tenant is `try_load`/`load`ed. That doesn't make sense in the tests that do that, they generally try to set up a scenario similar to pageserver stop & start. - There was one test that recreates a timeline, not the tenant. For that case, I needed to create a `Timeline::shutdown` method. It's a refactoring of the existing `Tenant::shutdown` method. - The remote_timeline_client tests previously set up their own `GenericRemoteStorage` and `RemoteTimelineClient`. Now they re-use the one that's pre-created by the TenantHarness. Some adjustments to the assertions were needed because the assertions now need to account for the initial image layer that's created by `create_test_timeline` to be present.	2023-09-01 18:10:40 +02:00
Christian Schwarz	aa22000e67	FileBlockReader<File> is never used (#5181 ) part of #4743 preliminary to #5180	2023-09-01 17:30:22 +02:00
Christian Schwarz	5edae96a83	rfc: Crash-Consistent Layer Map Updates By Leveraging index_part.json (#5086 ) This RFC describes a simple scheme to make layer map updates crash consistent by leveraging the index_part.json in remote storage. Without such a mechanism, crashes can induce certain edge cases in which broadly held assumptions about system invariants don't hold.	2023-09-01 15:24:58 +02:00
Christian Schwarz	40ce520c07	remote_timeline_client: tests: run upload ops on the tokio::test runtime (#5177 ) The `remote_timeline_client` tests use `#[tokio::test]` and rely on the fact that the test runtime that is set up by this macro is single-threaded. In PR https://github.com/neondatabase/neon/pull/5164, we observed interesting flakiness with the `upload_scheduling` test case: it would observe the upload of the third layer (`layer_file_name_3`) before we did `wait_completion`. Under the single-threaded-runtime assumption, that wouldn't be possible, because the test code doesn't await inbetween scheduling the upload and calling `wait_completion`. However, RemoteTimelineClient was actually using `BACKGROUND_RUNTIME`. That means there was parallelism where the tests didn't expect it, leading to flakiness such as execution of an UploadOp task before the test calls `wait_completion`. The most confusing scenario is code like this: ``` schedule upload(A); wait_completion.await; // B schedule_upload(C); wait_completion.await; // D ``` On a single-threaded executor, it is guaranteed that the upload up C doesn't run before D, because we (the test) don't relinquish control to the executor before D's `await` point. However, RemoteTimelineClient actually scheduled onto the BACKGROUND_RUNTIME, so, `A` could start running before `B` and `C` could start running before `D`. This would cause flaky tests when making assertions about the state manipulated by the operations. The concrete issue that led to discover of this bug was an assertion about `remote_fs_dir` state in #5164.	2023-09-01 16:24:04 +03:00
Alexander Bayandin	e9f2c64322	Wait for custom extensions build before deploy (#5170 ) ## Problem Currently, the `deploy` job doesn't wait for the custom extension job (in another repo) and can be started even with failed extensions build. This PR adds another job that polls the status of the extension build job and fails if the extension build fails. ## Summary of changes - Add `wait-for-extensions-build` job, which waits for a custom extension build in another repo.	2023-09-01 12:59:19 +01:00

1 2 3 4 5 ...

3707 Commits