rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-07 05:22:56 +00:00

Author	SHA1	Message	Date
Heikki Linnakangas	da626fb1fa	tests: Remove "postgres is running on ... branch" messages It seems like useless chatter. The endpoint.start() itself prints a "Running command ... neon_local endpoint start" message too.	2024-02-11 01:34:31 +02:00
John Spray	12b39c9db9	control_plane: add debug APIs for force-dropping tenant/node (#6702 ) ## Problem When debugging/supporting this service, we sometimes need it to just forget about a tenant or node, e.g. because of an issue cleanly tearing them down. For example, if I create a tenant with a PlacementPolicy that can't be scheduled on the nodes we have, we would never be able to schedule it for a DELETE to work. ## Summary of changes - Add APIs for dropping nodes and tenants that do no teardown other than removing the entity from the DB and removing any references to it.	2024-02-10 11:56:52 +00:00
Heikki Linnakangas	df5e2729a9	Remove now unused allowlisted errors. I'm not sure when we stopped emitting these, but they don't seem to be needed anymore.	2024-02-10 12:05:02 +02:00
Heikki Linnakangas	0fd3cd27cb	Tighten up the check for garbage after end-of-tar. Turn the warning into an error, if there is garbage after the end of imported tar file. However, it's normal for 'tar' to append extra empty blocks to the end, so tolerate those without warnings or errors.	2024-02-10 12:05:02 +02:00
Christian Schwarz	5779c7908a	revert two recent `heavier_once_cell` changes (#6704 ) This PR reverts - https://github.com/neondatabase/neon/pull/6589 - https://github.com/neondatabase/neon/pull/6652 because there's a performance regression that's particularly visible at high layer counts. Most likely it's because the switch to RwLock inflates the ``` inner: heavier_once_cell::OnceCell<ResidentOrWantedEvicted>, ``` size from 48 to 88 bytes, which, by itself is almost a doubling of the cache footprint, and probably the fact that it's now larger than a cache line also doesn't help. See this chat on the Neon discord for more context: https://discord.com/channels/1176467419317940276/1204714372295958548/1205541184634617906 I'm reverting 6652 as well because it might also have perf implications, and we're getting close to the next release. We should re-do its changes after the next release, though. cc @koivunej cc @ivaxer	2024-02-09 22:22:40 +00:00
Sasha Krassovsky	1a4dd58b70	Grant pg_monitor to neon_superuser (#6691 ) ## Problem The people want pg_monitor https://github.com/neondatabase/neon/issues/6682 ## Summary of changes Gives the people pg_monitor	2024-02-09 20:22:53 +00:00
Conrad Ludgate	cbd3a32d4d	proxy: decode username and password (#6700 ) ## Problem usernames and passwords can be URL 'percent' encoded in the connection string URL provided by serverless driver. ## Summary of changes Decode the parameters when getting conn info	2024-02-09 19:22:23 +00:00
Christian Schwarz	ca818c8bd7	fix(test_ondemand_download_timetravel): occasionally fails with slightly higher physical size (#6687 )	2024-02-09 20:09:37 +01:00
Arseny Sher	1bb9abebf2	Remove WAL segments from s3 in batches. Do list-delete operations in batches instead of doing full list first, to ensure deletion makes progress even if there are a lot of files to remove. To this end, add max_keys limit to remote storage list_files.	2024-02-09 22:11:53 +04:00
Conrad Ludgate	96d89cde51	Proxy error reworking (#6453 ) ## Problem Taking my ideas from https://github.com/neondatabase/neon/pull/6283 and doing a bit less radical changes. smaller commits. We currently don't report error classifications in proxy as the current error handling made it hard to do so. ## Summary of changes 1. Add a `ReportableError` trait that all errors will implement. This provides the error classification functionality. 2. Handle Client requests a strongly typed error * this error is a `ReportableError` and is logged appropriately 3. The handle client error only has a few possible error types, to account for the fact that at this point errors should be returned to the user.	2024-02-09 15:50:51 +00:00
John Spray	89a5c654bf	control_plane: follow up for embedded migrations (#6647 ) ## Problem In https://github.com/neondatabase/neon/pull/6637, we remove the need to run migrations externally, but for compat tests to work we can't remove those invocations from the neon_local binary. Once that previous PR merges, we can make the followup changes without upsetting compat tests.	2024-02-09 14:26:50 +00:00
Heikki Linnakangas	5239cdc29f	Fix test_vm_bit_clear_on_heap_lock test The test was supposed to reproduce the bug fixed in commit `66fa176cc8`, i.e. that the clearing of the VM bit was not replayed in the pageserver on HEAP_LOCK records. But it was broken in many ways and failed to reproduce the original problem if you reverted the fix: - The comparison of XIDs was broken. The test read the XID in to a variable in python, but it was treated as a string rather than an integer. As a result, e.g. "999" > "1000". - The test accessed the locked tuple too early, in the loop. Accessing it early, before the pg_xact page had been removed, set the hint bits. That masked the problem on subsequent accesses. - The on-demand SLRU download that was introduced in commit `9a9d9beaee` hid the issue. Even though an SLRU segment was removed by Postgres, when it later tried to access it, it could still download it from the pageserver. To ensure that doesn't happen, shorten the GC period and compact and GC aggressively in the test. I also added a more direct check that the VM page is updated, using the get_page_at_lsn() debugging function. Right after locking the row, we now fetch the VM page from pageserver and directly compare it with the VM page in the page cache. They should match. That assertion is more robust to things like on-demand SLRU download that could mask the bug.	2024-02-09 15:56:41 +02:00
Heikki Linnakangas	84a0e7b022	tests: Allow setting shutdown mode separately from 'destroy' flag In neon_local, the default mode is now always 'fast', regardless of 'destroy'. You can override it with the "neon_local endpoint stop --mode=immediate" flag. In python tests, we still default to 'immediate' mode when using the stop_and_destroy() function, and 'fast' with plain stop(). I kept that to avoid changing behavior in existing tests. I don't think existing tests depend on it, but I wasn't 100% certain.	2024-02-09 15:56:41 +02:00
John Spray	8d98981fe5	tests: deflake test_sharding_split_unsharded (#6699 ) ## Problem This test was a subset of the larger sharding test, and it missed the validate() call on workload that was implicitly waiting for a tenant to become active before trying to split it. It could therefore fail to split due to tenant not yet being active. ## Summary of changes - Insert .validate() call, and move the Workload setup to after the check of shard ID (as the shard ID check should pass immediately)	2024-02-09 13:20:04 +00:00
Joonas Koivunen	eb919cab88	prepare to move timeouts and cancellation handling to remote_storage (#6696 ) This PR is preliminary cleanups and refactoring around `remote_storage` for next PR which will move the timeouts and cancellation into `remote_storage`. Summary: - smaller drive-by fixes - code simplification - refactor common parts like `DownloadError::is_permanent` - align error types with `RemoteStorage::list_*` to use more `download_retry` helper Cc: #6096	2024-02-09 12:52:58 +00:00
Anastasia Lubennikova	eec1e1a192	Pre-install anon extension from compute_ctl if anon is in shared_preload_libraries. Users cannot install it themselves, because superuser is required. GRANT all priveleged needed to use it to db_owner We use the neon fork of the extension, because small change to sql file is needed to allow db_owner to use it. This feature is behind a feature flag AnonExtension, so it is not enabled by default.	2024-02-09 12:32:07 +00:00
Conrad Ludgate	ea089dc977	proxy: add per query array mode flag (#6678 ) ## Problem Drizzle needs to be able to configure the array_mode flag per query. ## Summary of changes Adds an array_mode flag to the query data json that will otherwise default to the header flag.	2024-02-09 10:29:20 +00:00
John Spray	951c9bf4ca	control_plane: fix shard splitting on unsharded tenant (#6689 ) ## Problem Previous test started with a new-style TenantShardId with a non-zero ShardCount. We also need to handle the case of a ShardCount() (aka `unsharded`) parent shard. A followup PR will refactor ShardCount to make its inner value private and thereby make this kind of mistake harder ## Summary of changes - Fix a place we were incorrectly treating a ShardCount as a number of shards rather than as thing that can be zero or the number of shards. - Add a test for this case.	2024-02-09 10:12:40 +00:00
Heikki Linnakangas	568f91420a	tests: try to make restored-datadir comparison tests not flaky (#6666 ) This test occasionally fails with a difference in "pg_xact/0000" file between the local and restored datadirs. My hypothesis is that something changed in the database between the last explicit checkpoint and the shutdown. I suspect autovacuum, it could certainly create transactions. To fix, be more precise about the point in time that we compare. Shut down the endpoint first, then read the last LSN (i.e. the shutdown checkpoint's LSN), from the local disk with pg_controldata. And use exactly that LSN in the basebackup. Closes #559. I'm proposing this as an alternative to https://github.com/neondatabase/neon/pull/6662.	2024-02-09 11:34:15 +02:00
Joonas Koivunen	a18aa14754	test: shutdown endpoints before deletion (#6619 ) this avoids a page_service error in the log sometimes. keeping the endpoint running while deleting has no function for this test.	2024-02-09 09:01:07 +00:00
Konstantin Knizhnik	529a79d263	Increment generation which LFC is disabled by assigning 0 to neon.file_cache_size_limit (#6692 ) ## Problem test_lfc_resize sometimes filed with assertion failure when require lock in write operation: ``` if (lfc_ctl->generation == generation) { Assert(LFC_ENABLED()); ``` ## Summary of changes Increment generation when 0 is assigned to neon.file_cache_size_limit ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-02-09 08:14:41 +02:00
Joonas Koivunen	c09993396e	fix: secondary tenant relative order eviction (#6491 ) Calculate the `relative_last_activity` using the total evicted and resident layers similar to what we originally planned. Cc: #5331	2024-02-09 00:37:57 +02:00
Joonas Koivunen	9a31311990	fix(heavier_once_cell): assertion failure can be hit (#6652 ) @problame noticed that the `tokio::sync::AcquireError` branch assertion can be hit like in the first commit. We haven't seen this yet in production, but I'd prefer not to see it there. There `take_and_deinit` is being used, but this race must be quite timing sensitive.	2024-02-08 22:40:14 +02:00
Arpad Müller	c0e0fc8151	Update Rust to 1.76.0 (#6683 ) [Release notes](https://github.com/rust-lang/rust/releases/tag/1.75.0).	2024-02-08 19:57:02 +01:00
John Spray	e8d2843df6	storage controller: improved handling of node availability on restart (#6658 ) - Automatically set a node's availability to Active if it is responsive in startup_reconcile - Impose a 5s timeout of HTTP request to list location conf, so that an unresponsive node can't hang it for minutes - Do several retries if the request fails with a retryable error, to be tolerant of concurrent pageserver & storage controller restarts - Add a readiness hook for use with k8s so that we can tell when the startup reconciliaton is done and the service is fully ready to do work. - Add /metrics to the list of un-authenticated endpoints (this is unrelated but we're touching the line in this PR already, and it fixes auth error spam in deployed container.) - A test for the above. Closes: #6670	2024-02-08 18:00:53 +00:00
John Spray	af91a28936	pageserver: shard splitting (#6379 ) ## Problem One doesn't know at tenant creation time how large the tenant will grow. We need to be able to dynamically adjust the shard count at runtime. This is implemented as "splitting" of shards into smaller child shards, which cover a subset of the keyspace that the parent covered. Refer to RFC: https://github.com/neondatabase/neon/pull/6358 Part of epic: #6278 ## Summary of changes This PR implements the happy path (does not cleanly recover from a crash mid-split, although won't lose any data), without any optimizations (e.g. child shards re-download their own copies of layers that the parent shard already had on local disk) - Add `/v1/tenant/:tenant_shard_id/shard_split` API to pageserver: this copies the shard's index to the child shards' paths, instantiates child `Tenant` object, and tears down parent `Tenant` object. - Add `splitting` column to `tenant_shards` table. This is written into an existing migration because we haven't deployed yet, so don't need to cleanly upgrade. - Add `/control/v1/tenant/:tenant_id/shard_split` API to attachment_service, - Add `test_sharding_split_smoke` test. This covers the happy path: future PRs will add tests that exercise failure cases.	2024-02-08 15:35:13 +00:00
Konstantin Knizhnik	43eae17f0d	Drop unused replication slots (#6655 ) ## Problem See #6626 If there is inactive replication slot then Postgres will not bw able to shrink WAL and delete unused snapshots. If she other active subscription is present, then snapshots created each 15 seconds will overflow AUX_DIR. Setting `max_slot_wal_keep_size` doesn't solve the problem, because even small WAL segment will be enough to overflow AUX_DIR if there is no other activity on the system. ## Summary of changes If there are active subscriptions and some logical replication slots are not used during `neon.logical_replication_max_time_lag` interval, then unused slot is dropped. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-02-08 17:31:15 +02:00
Anna Khanova	6c34d4cd14	Proxy: set timeout on establishing connection (#6679 ) ## Problem There is no timeout on the handshake. ## Summary of changes Set the timeout on the establishing connection.	2024-02-08 13:52:04 +00:00
Anna Khanova	c63e3e7e84	Proxy: improve http-pool (#6577 ) ## Problem The password check logic for the sql-over-http is a bit non-intuitive. ## Summary of changes 1. Perform scram auth using the same logic as for websocket cleartext password. 2. Split establish connection logic and connection pool. 3. Parallelize param parsing logic with authentication + wake compute. 4. Limit the total number of clients	2024-02-08 12:57:05 +01:00
Christian Schwarz	c52495774d	tokio-epoll-uring: expose its metrics in pageserver's `/metrics` (#6672 ) context: https://github.com/neondatabase/neon/issues/6667	2024-02-07 23:58:54 +00:00
Andreas Scherbaum	9a017778a9	Update copyright notice, set it to current year (#6671 ) ## Problem Copyright notice is outdated ## Summary of changes Replace the initial year `2022` with `2022 - 2024`, after brief discussion with Stas about the format Co-authored-by: Andreas Scherbaum <andreas@neon.tech>	2024-02-08 00:48:31 +01:00
Christian Schwarz	c561ad4e2e	feat: expose locked memory in pageserver `/metrics` (#6669 ) context: https://github.com/neondatabase/neon/issues/6667	2024-02-07 19:39:52 +00:00
John Spray	3bd2a4fd56	control_plane: avoid feedback loop with /location_config if compute hook fails. (#6668 ) ## Problem The existing behavior isn't exactly incorrect, but is operationally risky: if the control plane compute hook breaks, then all the control plane operations trying to call /location_config will end up retrying forever, which could put more load on the system. ## Summary of changes - Treat 404s as fatal errors to do fewer retries: a 404 either indicates we have the wrong URL, or some control plane bug is failing to recognize our tenant ID as existing. - Do not return an error on reconcilation errors in a non-creating /location_config response: this allows the control plane to finish its Operation (and we will eventually retry the compute notification later)	2024-02-07 19:14:18 +00:00
Tristan Partin	128fae7054	Update Postgres 16 to 16.2	2024-02-07 11:10:48 -08:00
Tristan Partin	5541244dc4	Update Postgres 15 to 15.6	2024-02-07 11:10:48 -08:00
Tristan Partin	2e9b1f7aaf	Update Postgres 14 to 14.11	2024-02-07 11:10:48 -08:00
Christian Schwarz	51f9385b1b	live-reconfigurable virtual_file::IoEngine (#6552 ) This PR adds an API to live-reconfigure the VirtualFile io engine. It also adds a flag to `pagebench get-page-latest-lsn`, which is where I found this functionality to be useful: it helps compare the io engines in a benchmark without re-compiling a release build, which took ~50s on the i3en.3xlarge where I was doing the benchmark. Switching the IO engine is completely safe at runtime.	2024-02-07 17:47:55 +00:00
Sasha Krassovsky	7b49e5e5c3	Remove compute migrations feature flag (#6653 )	2024-02-07 07:55:55 -09:00
Abhijeet Patil	75f1a01d4a	Optimise e2e run (#6513 ) ## Problem We have finite amount of runners and intermediate results are often wanted before a PR is ready for merging. Currently all PRs get e2e tests run and this creates a lot of throwaway e2e results which may or may not get to start or complete before a new push. ## Summary of changes 1. Skip e2e test when PR is in draft mode 2. Run e2e when PR status changes from draft to ready for review (change this to having its trigger in below PR and update results of build and test) 3. Abstract e2e test in a Separate workflow and call it from the main workflow for the e2e test 5. Add a label, if that label is present run e2e test in draft (run-e2e-test-in-draft) 6. Auto add a label(approve to ci) so that all the external contributors PR , e2e run in draft 7. Document the new label changes and the above behaviour Draft PR : https://github.com/neondatabase/neon/actions/runs/7729128470 Ready To Review : https://github.com/neondatabase/neon/actions/runs/7733779916 Draft PR with label : https://github.com/neondatabase/neon/actions/runs/7725691012/job/21062432342 and https://github.com/neondatabase/neon/actions/runs/7733854028 ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2024-02-07 16:14:10 +00:00
John Spray	090a789408	storage controller: use PUT instead of POST (#6659 ) This was a typo, the server expects PUT.	2024-02-07 13:24:10 +00:00
John Spray	3d4fe205ba	control_plane/attachment_service: database connection pool (#6622 ) ## Problem This is mainly to limit our concurrency, rather than to speed up requests (I was doing some sanity checks on performance of the service with thousands of shards) ## Summary of changes - Enable the `diesel:r2d2` feature, which provides an async connection pool - Acquire a connection before entering spawn_blocking for a database transaction (recall that diesel's interface is sync) - Set a connection pool size of 99 to fit within default postgres limit (100) - Also set the tokio blocking thread count to accomodate the same number of blocking tasks (the only thing we use spawn_blocking for is database calls).	2024-02-07 13:08:09 +00:00
Arpad Müller	f7516df6c1	Pass timestamp as a datetime (#6656 ) This saves some repetition. I did this in #6533 for `tenant_time_travel_remote_storage` already.	2024-02-07 12:56:53 +01:00
Konstantin Knizhnik	f3d7d23805	Some small WAL records can write a lot of data to KV storage, so perform checkpoint check more frequently (#6639 ) ## Problem See https://neondb.slack.com/archives/C04DGM6SMTM/p1707149618314539?thread_ts=1707081520.140049&cid=C04DGM6SMTM ## Summary of changes Perform checkpoint check after processing `ingest_batch_size` (default 100) WAL records. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-02-07 08:47:19 +02:00
Alexander Bayandin	9f75da7c0a	test_lazy_startup: fix statement_timeout setting (#6654 ) ## Problem Test `test_lazy_startup` is flaky[0], sometimes (pretty frequently) it fails with `canceling statement due to statement timeout`. - [0] https://neon-github-public-dev.s3.amazonaws.com/reports/main/7803316870/index.html#suites/355b1a7a5b1e740b23ea53728913b4fa/7263782d30986c50/history ## Summary of changes - Fix setting `statement_timeout` setting by reusing a connection for all queries. - Also fix label (`lazy`, `eager`) assignment - Split `test_lazy_startup` into two, by `slru` laziness and make tests smaller	2024-02-07 00:31:26 +00:00
Alexander Bayandin	f4cc7cae14	CI(build-tools): Update Python from 3.9.2 to 3.9.18 (#6615 ) ## Problem We use an outdated version of Python (3.9.2) ## Summary of changes - Update Python to the latest patch version (3.9.18) - Unify the usage of python caches where possible	2024-02-06 20:30:43 +00:00
John Spray	4f57dc6cc6	control_plane/attachment_service: take public key as value (#6651 ) It's awkward to point to a file when doing some kinds of ad-hoc deployment (like right now, when I'm hacking a helm chart having not quite hooked up secrets properly yet). We take all the rest of the secrets as CLI args directly, so let's do the same for public key.	2024-02-06 19:08:39 +00:00
Heikki Linnakangas	dc811d1923	Add a span to 'create_neon_superuser' for better OpenTelemetry traces (#6644 ) create_neon_superuser runs the first queries in the database after cold start. Traces suggest that those first queries can make up a significant fraction of the cold start time. Make it more visible by adding an explict tracing span to it; currently you just have to deduce it by looking at the time spent in the parent 'apply_config' span subtracted by all the other child spans.	2024-02-06 20:37:35 +02:00
Alexander Bayandin	e65f0fe874	CI(benchmarks): make job split consistent across reruns (#6614 ) ## Problem We've got several issues with the current `benchmarks` job setup: - `benchmark_durations.json` file (that we generate in runtime to split tests into several jobs[0]) is not consistent between these jobs (and very not consistent with the file if we rerun the job). I.e. test selection for each job can be different, which could end up in missed tests in a test run. - `scripts/benchmark_durations` doesn't fetch all tests from the database (it doesn't expect any extra directories inside `test_runner/performance`) - For some reason, currently split into 4 groups ends up with the 4th group has no tests to run, which fails the job[1] - [0] https://github.com/neondatabase/neon/pull/4683 - [1] https://github.com/neondatabase/neon/issues/6629 ## Summary of changes - Generate `benchmark_durations.json` file once before we start `benchmarks` jobs (this makes it consistent across the jobs) and pass the file content through the GitHub Actions input (this makes it consistent for reruns) - `scripts/benchmark_durations` fix SQL query for getting all required tests - Split benchmarks into 5 jobs instead of 4 jobs.	2024-02-06 17:00:55 +00:00
Joonas Koivunen	bb92721168	build: migrate check-style-rust to small runners (#6588 ) We have more small runners than large runners, and often a shortage of large runners. Migrate `check-style-rust` to run on small runners.	2024-02-06 15:53:04 +00:00
Christian Schwarz	d7b29aace7	refactor(walredo): don't create WalRedoManager for broken tenants (#6597 ) When we'll later introduce a global pool of pre-spawned walredo processes (https://github.com/neondatabase/neon/issues/6581), this refactoring avoids plumbing through the reference to the pool to all the places where we create a broken tenant. Builds atop the refactoring in #6583	2024-02-06 16:20:02 +01:00

1 2 3 4 5 ...

4593 Commits