rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-06 21:50:37 +00:00

Author	SHA1	Message	Date
Sasha Krassovsky	7b49e5e5c3	Remove compute migrations feature flag (#6653 )	2024-02-07 07:55:55 -09:00
Abhijeet Patil	75f1a01d4a	Optimise e2e run (#6513 ) ## Problem We have finite amount of runners and intermediate results are often wanted before a PR is ready for merging. Currently all PRs get e2e tests run and this creates a lot of throwaway e2e results which may or may not get to start or complete before a new push. ## Summary of changes 1. Skip e2e test when PR is in draft mode 2. Run e2e when PR status changes from draft to ready for review (change this to having its trigger in below PR and update results of build and test) 3. Abstract e2e test in a Separate workflow and call it from the main workflow for the e2e test 5. Add a label, if that label is present run e2e test in draft (run-e2e-test-in-draft) 6. Auto add a label(approve to ci) so that all the external contributors PR , e2e run in draft 7. Document the new label changes and the above behaviour Draft PR : https://github.com/neondatabase/neon/actions/runs/7729128470 Ready To Review : https://github.com/neondatabase/neon/actions/runs/7733779916 Draft PR with label : https://github.com/neondatabase/neon/actions/runs/7725691012/job/21062432342 and https://github.com/neondatabase/neon/actions/runs/7733854028 ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2024-02-07 16:14:10 +00:00
John Spray	090a789408	storage controller: use PUT instead of POST (#6659 ) This was a typo, the server expects PUT.	2024-02-07 13:24:10 +00:00
John Spray	3d4fe205ba	control_plane/attachment_service: database connection pool (#6622 ) ## Problem This is mainly to limit our concurrency, rather than to speed up requests (I was doing some sanity checks on performance of the service with thousands of shards) ## Summary of changes - Enable the `diesel:r2d2` feature, which provides an async connection pool - Acquire a connection before entering spawn_blocking for a database transaction (recall that diesel's interface is sync) - Set a connection pool size of 99 to fit within default postgres limit (100) - Also set the tokio blocking thread count to accomodate the same number of blocking tasks (the only thing we use spawn_blocking for is database calls).	2024-02-07 13:08:09 +00:00
Arpad Müller	f7516df6c1	Pass timestamp as a datetime (#6656 ) This saves some repetition. I did this in #6533 for `tenant_time_travel_remote_storage` already.	2024-02-07 12:56:53 +01:00
Konstantin Knizhnik	f3d7d23805	Some small WAL records can write a lot of data to KV storage, so perform checkpoint check more frequently (#6639 ) ## Problem See https://neondb.slack.com/archives/C04DGM6SMTM/p1707149618314539?thread_ts=1707081520.140049&cid=C04DGM6SMTM ## Summary of changes Perform checkpoint check after processing `ingest_batch_size` (default 100) WAL records. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-02-07 08:47:19 +02:00
Alexander Bayandin	9f75da7c0a	test_lazy_startup: fix statement_timeout setting (#6654 ) ## Problem Test `test_lazy_startup` is flaky[0], sometimes (pretty frequently) it fails with `canceling statement due to statement timeout`. - [0] https://neon-github-public-dev.s3.amazonaws.com/reports/main/7803316870/index.html#suites/355b1a7a5b1e740b23ea53728913b4fa/7263782d30986c50/history ## Summary of changes - Fix setting `statement_timeout` setting by reusing a connection for all queries. - Also fix label (`lazy`, `eager`) assignment - Split `test_lazy_startup` into two, by `slru` laziness and make tests smaller	2024-02-07 00:31:26 +00:00
Alexander Bayandin	f4cc7cae14	CI(build-tools): Update Python from 3.9.2 to 3.9.18 (#6615 ) ## Problem We use an outdated version of Python (3.9.2) ## Summary of changes - Update Python to the latest patch version (3.9.18) - Unify the usage of python caches where possible	2024-02-06 20:30:43 +00:00
John Spray	4f57dc6cc6	control_plane/attachment_service: take public key as value (#6651 ) It's awkward to point to a file when doing some kinds of ad-hoc deployment (like right now, when I'm hacking a helm chart having not quite hooked up secrets properly yet). We take all the rest of the secrets as CLI args directly, so let's do the same for public key.	2024-02-06 19:08:39 +00:00
Heikki Linnakangas	dc811d1923	Add a span to 'create_neon_superuser' for better OpenTelemetry traces (#6644 ) create_neon_superuser runs the first queries in the database after cold start. Traces suggest that those first queries can make up a significant fraction of the cold start time. Make it more visible by adding an explict tracing span to it; currently you just have to deduce it by looking at the time spent in the parent 'apply_config' span subtracted by all the other child spans.	2024-02-06 20:37:35 +02:00
Alexander Bayandin	e65f0fe874	CI(benchmarks): make job split consistent across reruns (#6614 ) ## Problem We've got several issues with the current `benchmarks` job setup: - `benchmark_durations.json` file (that we generate in runtime to split tests into several jobs[0]) is not consistent between these jobs (and very not consistent with the file if we rerun the job). I.e. test selection for each job can be different, which could end up in missed tests in a test run. - `scripts/benchmark_durations` doesn't fetch all tests from the database (it doesn't expect any extra directories inside `test_runner/performance`) - For some reason, currently split into 4 groups ends up with the 4th group has no tests to run, which fails the job[1] - [0] https://github.com/neondatabase/neon/pull/4683 - [1] https://github.com/neondatabase/neon/issues/6629 ## Summary of changes - Generate `benchmark_durations.json` file once before we start `benchmarks` jobs (this makes it consistent across the jobs) and pass the file content through the GitHub Actions input (this makes it consistent for reruns) - `scripts/benchmark_durations` fix SQL query for getting all required tests - Split benchmarks into 5 jobs instead of 4 jobs.	2024-02-06 17:00:55 +00:00
Joonas Koivunen	bb92721168	build: migrate check-style-rust to small runners (#6588 ) We have more small runners than large runners, and often a shortage of large runners. Migrate `check-style-rust` to run on small runners.	2024-02-06 15:53:04 +00:00
Christian Schwarz	d7b29aace7	refactor(walredo): don't create WalRedoManager for broken tenants (#6597 ) When we'll later introduce a global pool of pre-spawned walredo processes (https://github.com/neondatabase/neon/issues/6581), this refactoring avoids plumbing through the reference to the pool to all the places where we create a broken tenant. Builds atop the refactoring in #6583	2024-02-06 16:20:02 +01:00
Christian Schwarz	53a3ed0a7e	debug_assert presence of `shard_id` tracing field (#6572 ) also: fixes https://github.com/neondatabase/neon/issues/6638	2024-02-06 14:43:33 +00:00
dependabot[bot]	27a3c9ecbe	build(deps): bump cryptography from 41.0.6 to 42.0.0 (#6643 )	2024-02-06 13:15:07 +00:00
John Spray	6297843317	tests: flakiness fixes in pageserver tests (#6632 ) Fix several test flakes: - test_sharding_service_smoke had log failures on "Dropped LSN updates" - test_emergency_mode had log failures on a deletion queue shutdown check, where the check was incorrect because it was expecting channel receiver to stay alive after cancellation token was fired. - test_secondary_mode_eviction had racing heatmap uploads because the test was using a live migration hook to set up locations, where that migration was itself uploading heatmaps and generally making the situation more complex than it needed to be. These are the failure modes that I saw when spot checking the last few failures of each test. This will mostly/completely address #6511, but I'll leave that ticket open for a couple days and then check if either of the tests named in that ticket are flaky. Related #6511	2024-02-06 12:49:41 +00:00
Vadim Kharitonov	dae56ef60c	Do not suspend compute if there is an active logical replication subscription. (#6570 ) ## Problem the idea is to keep compute up and running if there are any active logical replication subscriptions. ### Rationale Rationale: - The Write-Ahead Logging (WAL) files, which contain the data changes, will need to be retained on the publisher side until the subscriber is able to connect again and apply these changes. This could potentially lead to increased disk usage on the publisher - and we do not want to disrupt the source - I think it is more pain for our customer to resolve storage issues on the source than to pay for the compute at the target. - Upon resuming the compute resources, the subscriber will start consuming and applying the changes from the retained WAL files. The time taken to catch up will depend on the volume of changes and the configured vCPUs. we can avoid explaining complex situations where we lag behind (in extreme cases we could lag behind hours, days or even months) - I think an important use case for logical replication from a source is a one-time migration or release upgrade. In this case the customer would not mind if we are not suspended for the duration of the migration. We need to document this in the release notes and the documentation in the context of logical replication where Neon is the target (subscriber) ### See internal discussion here https://neondb.slack.com/archives/C04DGM6SMTM/p1706793400746539?thread_ts=1706792628.701279&cid=C04DGM6SMTM	2024-02-06 12:15:42 +00:00
Christian Schwarz	0de46fd6f2	heavier_once_cell: switch to tokio::sync::RwLock (#6589 ) Using the RwLock reduces contention on the hot path. Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2024-02-06 14:04:15 +02:00
Joonas Koivunen	53743991de	uploader: avoid cloning vecs just to get Bytes (#6645 ) Fix cloning the serialized heatmap on every attempt by just turning it into `bytes::Bytes` before clone so it will be a refcounted instead of refcounting a vec clone later on. Also fixes one cancellation token cloning I had missed in #6618. Cc: #6096	2024-02-06 11:34:13 +00:00
John Spray	431f4234d4	storage controller: embed database migrations in binary (#6637 ) ## Problem We don't have a neat way to carry around migration .sql files during deploy, and in any case would prefer to avoid depending on diesel CLI to deploy. ## Summary of changes - Use `diesel_migrations` crate to embed migrations in our binary - Run migrations on startup - Drop the diesel dependency in the `neon_local` binary, as the attachment_service binary just needs the database to exist. Do database creation with a simple `createdb`. Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2024-02-06 10:07:10 +00:00
Christian Schwarz	edcde05c1c	refactor(walredo): split up the massive `walredo.rs` (#6583 ) Part of https://github.com/neondatabase/neon/issues/6581	2024-02-06 09:44:49 +00:00
Christian Schwarz	e196d974cc	pagebench: actually implement `--num_clients` (#6640 ) Will need this to validate per-tenant throttling in https://github.com/neondatabase/neon/issues/5899	2024-02-06 10:34:16 +01:00
Joonas Koivunen	947165788d	refactor: needless cancellation token cloning (#6618 ) The solution we ended up for `backoff::retry` requires always cloning of cancellation tokens even though there is just `.await`. Fix that, and also turn the return type into `Option<Result<T, E>>` avoiding the need for the `E::cancelled()` fn passed in. Cc: #6096	2024-02-06 09:39:06 +02:00
John Spray	8e114bd610	control_plane/attachment_service: make --database-url optional (#6636 ) ## Problem This change was left out of #6585 accidentally -- just forgot to push the very last version of my branch. Now that we can load database url from Secrets Manager, we don't always need it on the CLI any more. We should let the user omit it instead of passing `--database-url ""` ## Summary of changes - Make `--database-url` optional	2024-02-05 20:31:55 +01:00
John Spray	cb7c89332f	control_plane: fix tenant GET, clean up endpoints (#6553 ) Cleanups from https://github.com/neondatabase/neon/pull/6394 - There was a rogue `*` breaking the `GET /tenant/:tenant_id`, which passes through to shard zero - There was a duplicate migrate endpoint - There are un-prefixed API endpoints that were only needed for compat tests and can now be removed.	2024-02-05 14:29:05 +00:00
Conrad Ludgate	74c5e3d9b8	use string interner for project cache (#6578 ) ## Problem Running some memory profiling with high concurrent request rate shows seemingly some memory fragmentation. ## Summary of changes Eventually, we will want to separate global memory (caches) from local memory (per connection handshake and per passthrough). Using a string interner for project info cache helps reduce some of the fragmentation of the global cache by having a single heap dedicated to project strings, and not scattering them throughout all a requests. At the same time, the interned key is 4 bytes vs the 24 bytes that `SmolStr` offers. Important: we should only store verified strings in the interner because there's no way to remove them afterwards. Good for caching responses from console.	2024-02-05 14:27:25 +00:00
Joonas Koivunen	5e8deca268	metrics: remove broken tenants (#6586 ) Before tenant migration it made sense to leak broken tenants in the metrics until restart. Nowdays it makes less sense because on cancellations we set the tenant broken. The set metric still allows filterable alerting. Fixes: #6507	2024-02-05 14:49:35 +02:00
Joonas Koivunen	db89b13aaa	fix: use the shared constant download buffer size (#6620 ) Noticed that we had forgotten to use `remote_timeline_client.rs::BUFFER_SIZE` in one instance.	2024-02-05 13:10:08 +01:00
Abhijeet Patil	01c57ec547	Removed Uploading of perf result to git repo 'zenith-perf-data' (#6590 ) ## Problem We were archiving the pref benchmarks to - neon DB - git repo `zenith-perf-data` As the pref batch ran in parallel when the uploading of results to zenith-perf-data` git repo resulted in merge conflicts. Which made the run flaky and as a side effect the build started failing . The problem is been expressed in https://github.com/neondatabase/neon/issues/5160 ## Summary of changes As the results were not used from the git repo it was redundant hence in this PR cleaning up the results uploading of of perf results to git repo The shell script `generate_and_push_perf_report.sh` was using a py script [git-upload](https://github.com/neondatabase/neon/compare/remove-perf-benchmark-git-upload?expand=1#diff-c6d938e7f060e487367d9dc8055245c82b51a73c1f97956111a495a8a86e9a33) and [scripts/generate_perf_report_page.py](https://github.com/neondatabase/neon/pull/6590/files#diff-81af2147e72d07e4cf8ee4395632596d805d6168ba75c71cab58db2659956ef8) which are not used anywhere else in repo hence also cleaning that up ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat the commit message to not include the above checklist	2024-02-05 10:08:20 +00:00
Arpad Müller	56cf360439	Don't preserve temp files on creation errors of delta layers (#6612 ) There is currently no cleanup done after a delta layer creation error, so delta layers can accumulate. The problem gets worse as the operation gets retried and delta layers accumulate on the disk. Therefore, delete them from disk (if something has been written to disk).	2024-02-05 09:53:37 +00:00
Heikki Linnakangas	df7bee7cfa	Fix compilation with recent glibc headers with close_range(2). I was getting an error: /home/heikki/git-sandbox/neon//pgxn/neon_walredo/walredoproc.c:161:5: error: conflicting types for ‘close_range’; have ‘int(unsigned int, unsigned int, unsigned int)’ 161 \| int close_range(unsigned int start_fd, unsigned int count, unsigned int flags) { \| ^~~~~~~~~~~ In file included from /usr/include/x86_64-linux-gnu/bits/sigstksz.h:24, from /usr/include/signal.h:328, from /home/heikki/git-sandbox/neon//pgxn/neon_walredo/walredoproc.c:50: /usr/include/unistd.h:1208:12: note: previous declaration of ‘close_range’ with type ‘int(unsigned int, unsigned int, int)’ 1208 \| extern int close_range (unsigned int __fd, unsigned int __max_fd, \| ^~~~~~~~~~~ The discrepancy is in the 3rd argument. Apparently in the glibc wrapper it's signed. As a quick fix, rename our close_range() function, the one that calls syscall() directly, to avoid the clash with the glibc wrapper. In the long term, an autoconf test would be nice, and some equivalent on macOS, see issue #6580.	2024-02-05 11:50:45 +02:00
Joonas Koivunen	70f646ffe2	More logging fixes (#6584 ) I was on-call this week, these would had made me understand more/faster of the system: - move stray attaching start logging inside the span it starts, add generation - log ancestor timeline_id or bootstrapping in the beginning of timeline creation	2024-02-05 09:34:03 +02:00
Vadim Kharitonov	7e8529bec1	Revert "Update pgvector to v0.6.0, third attempt" (#6610 ) The issue is still unsolved because of shmem size in VMs. Need to figure it out before applying this patch. For more details: ``` ERROR: could not resize shared memory segment "/PostgreSQL.2892504480" to 16774205952 bytes: No space left on device ``` As an example, the same issue in community pgvector/pgvector#453.	2024-02-04 22:27:07 +00:00
Clarence	09519c1773	chore: update wording in docs to improve readability (#6607 ) ## Problem Found typos while reading the docs ## Summary of changes Fixed the typos found	2024-02-04 19:33:38 +00:00
Joonas Koivunen	9dd69194d4	refactor(proxy): std::io::Write for BytesMut exists (#6606 ) Replace TODO with an existing implementation via `BufMut::writer``.	2024-02-03 22:15:59 +00:00
Heikki Linnakangas	647b85fc15	Update pgvector to v0.6.0, third attempt This includes a compatibility patch that is needed because pgvector now skips WAL-logging during the index build, and WAL-logs the index only in one go at the end. That's how GIN, GiST and SP-GIST index builds work in core PostgreSQL too, but we need some Neon-specific calls to mark the beginning and end of those build phases. pgvector is the first index AM that does that with parallel workers, so I had to modify those functions in the Neon extension to be aware of parallel workers. Only the leader needs to create the underlying file and perform the WAL-logging. (In principle, the parallel workers could participate in the WAL-logging too, but pgvector doesn't do that. This will need some further work if that changes). The previous attempt at this (#6592) missed that parallel workers needed those changes, and segfaulted in parallel build that spilled to disk. Testing ------- We don't have a place for regression tests of extensions at the moment. I tested this manually with the following script: ``` CREATE EXTENSION IF NOT EXISTS vector; DROP TABLE IF EXISTS tst; CREATE TABLE tst (i serial, v vector(3)); INSERT INTO tst (v) SELECT ARRAY[random(), random(), random()] FROM generate_series(1, 15000) g; -- Serial build, in memory ALTER TABLE tst SET (parallel_workers=0); SET maintenance_work_mem='50 MB'; CREATE INDEX idx ON tst USING hnsw (v vector_l2_ops); -- Test that the index works. (The table contents are random, and the -- search is approximate anyway, so we cannot check the exact values. -- For now, just eyeball that they look reasonable) set enable_seqscan=off; explain SELECT * FROM tst ORDER BY v <-> ARRAY[0, 0, 0]::vector LIMIT 5; SELECT * FROM tst ORDER BY v <-> ARRAY[0, 0, 0]::vector LIMIT 5; DROP INDEX idx; -- Serial build, spills to on disk ALTER TABLE tst SET (parallel_workers=0); SET maintenance_work_mem='5 MB'; CREATE INDEX idx ON tst USING hnsw (v vector_l2_ops); SELECT * FROM tst ORDER BY v <-> ARRAY[0, 0, 0]::vector LIMIT 5; DROP INDEX idx; -- Parallel build, in memory ALTER TABLE tst SET (parallel_workers=4); SET maintenance_work_mem='50 MB'; CREATE INDEX idx ON tst USING hnsw (v vector_l2_ops); SELECT * FROM tst ORDER BY v <-> ARRAY[0, 0, 0]::vector LIMIT 5; DROP INDEX idx; -- Parallel build, spills to disk ALTER TABLE tst SET (parallel_workers=4); SET maintenance_work_mem='5 MB'; CREATE INDEX idx ON tst USING hnsw (v vector_l2_ops); SELECT * FROM tst ORDER BY v <-> ARRAY[0, 0, 0]::vector LIMIT 5; DROP INDEX idx; ```	2024-02-03 09:19:37 +02:00
Heikki Linnakangas	c96aead502	Reorganize .dockerignore Author: Alexander Bayandin <alexander@neon.tech>	2024-02-03 09:19:37 +02:00
Arpad Müller	aac8eb2c36	Minor logging improvements (#6593 ) * log when `lsn_by_timestamp` finished together with its result * add back logging of the layer name as suggested in https://github.com/neondatabase/neon/pull/6549#discussion_r1475756808	2024-02-03 02:16:20 +01:00
Clarence	3d1b08496a	Update words in docs for better readability (#6600 ) ## Problem Found typos while reading the docs ## Summary of changes Fixed the typos found	2024-02-03 00:59:39 +00:00
Arpad Müller	0ac2606c8a	S3 restore test: Use a workaround to enable moto's self-copy support (#6594 ) While working on https://github.com/getmoto/moto/pull/7303 I discovered that if you enable bucket encryption, moto allows self-copies. So we can un-ignore the test. I tried it out locally, it works great. Followup of #6533, part of https://github.com/neondatabase/cloud/issues/8233	2024-02-02 23:45:57 +01:00
Em Sharnoff	d820d64e38	Bump vm-builder v0.21.0 -> v0.23.2 (#6480 ) Relevant changes were all from v0.23.0: - neondatabase/autoscaling#724 - neondatabase/autoscaling#726 - neondatabase/autoscaling#732 Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2024-02-02 22:39:20 +00:00
Arthur Petukhovsky	f2aa96f003	Console split RFC (#1997 ) [Rendered](https://github.com/neondatabase/neon/blob/rfc-console-split/docs/rfcs/017-console-split.md) Co-authored-by: Stas Kelvich <stas.kelvich@gmail.com>	2024-02-02 23:41:55 +02:00
Sasha Krassovsky	2fd8e24c8f	Switch sleeps to wait_until (#6575 ) ## Problem I didn't know about `wait_until` and was relying on `sleep` to wait for stuff. This caused some tests to be flaky. https://github.com/neondatabase/neon/issues/6561 ## Summary of changes Switch to `wait_until`, this should make it tests less flaky	2024-02-02 21:32:40 +00:00
Heikki Linnakangas	c9876b0993	Fix double-free bug in walredo process. (#6534 ) At the end of ApplyRecord(), we called pfree on the decoded record, if it was "oversized". However, we had alread linked it to the "decode queue" list in XLogReaderState. If we later called XLogBeginRead(), it called ResetDecoder and tried to free the same record again. The conditions to hit this are: - a large WAL record (larger than aboue 64 kB I think, per DEFAULT_DECODE_BUFFER_SIZE), and - another WAL record processed by the same WAL redo process after the large one. I think the reason we haven't seen this earlier is that you don't get WAL records that large that are sent to the WAL redo process, except when logical replication is enabled. Logical replication adds data to the WAL records, making them larger. To fix, allocate the buffer ourselves, and don't link it to the decode queue. Alternatively, we could perhaps have just removed the pfree(), but frankly I'm a bit scared about the whole queue thing.	2024-02-02 21:49:11 +02:00
John Spray	786e9cf75b	control_plane: implement HTTP compute hook for attachment service (#6471 ) ## Problem When we change which physical pageservers a tenant is attached to, we must update the control plane so that it can update computes. This will be done via an HTTP hook, as described in https://www.notion.so/neondatabase/Sharding-Service-Control-Plane-interface-6de56dd310a043bfa5c2f5564fa98365#1fe185a35d6d41f0a54279ac1a41bc94 ## Summary of changes - Optional CLI args `--control-plane-jwt-token` and `-compute-hook-url` are added. If these are set, then we will use this HTTP endpoint, instead of trying to use neon_local LocalEnv to update compute configuration. - Implement an HTTP-driven version of ComputeHook that calls into the configured URL - Notify for all tenants on startup, to ensure that we don't miss notifications if we crash partway through a change, and carry a `pending_compute_notification` flag at runtime to allow notifications to fail without risking never sending the update. - Add a test for all this One might wonder: why not do a "forever" retry for compute hook notifications, rather than carrying a flag on the shard to call reconcile() again later. The reason is that we will later limit concurreny of reconciles, when dealing with larger numbers of shards, and if reconcile is stuck waiting for the control plane to accept a notification request, it could jam up the whole system and prevent us making other changes. Anyway: from the perspective of the outside world, we _do_ retry forever, but we don't retry forever within a given Reconciler lifetime. The `pending_compute_notification` logic is predicated on later adding a background task that just calls `Service::reconcile_all` on a schedule to make sure that anything+everything that can fail a Reconciler::reconcile call will eventually be retried.	2024-02-02 19:22:03 +00:00
Vadim Kharitonov	0b91edb943	Revert pgvector 0.6.0 (#6592 ) It doesn't work in our VMs. Need more time to investigate	2024-02-02 18:36:31 +00:00
John Spray	2e5eab69c6	tests: remove test_gc_cutoff (#6587 ) This test became flaky when postgres retry handling was fixed to use backoff delays -- each iteration in this test's loop was taking much longer because pgbench doesn't fail until postgres has given up on retrying to the pageserver. We are just removing it, because the condition it tests is no longer risky: we reload all metadata from remote storage on restart, so crashing directly between making local changes and doing remote uploads isn't interesting any more. Closes: https://github.com/neondatabase/neon/issues/2856 Closes: https://github.com/neondatabase/neon/issues/5329	2024-02-02 18:20:18 +00:00
Joonas Koivunen	caf868e274	test: assert we eventually free space (#6536 ) in `test_statvfs_pressure_{usage,min_avail_bytes}` we now race against initial logical size calculation on-demand downloading the layers. first wait out the initial logical sizes, then change the final asserts to be "eventual", which is not great but it is faster than failing and retrying. this issue seems to happen only in debug mode tests. Fixes: #6510	2024-02-02 19:46:47 +02:00
John Spray	7e2436695d	storage controller: use AWS Secrets Manager for database URL, etc (#6585 ) ## Problem Passing secrets in via CLI/environment is awkward when using helm for deployment, and not ideal for security (secrets may show up in ps, /proc). We can bypass these issues by simply connecting directly to the AWS Secrets Manager service at runtime. ## Summary of changes - Add dependency on aws-sdk-secretsmanager - Update other aws dependencies to latest, to match transitive dependency versions - Add `Secrets` type in attachment service, using AWS SDK to load if secrets are not provided on the command line.	2024-02-02 16:57:11 +00:00
Conrad Ludgate	6506fd14c4	proxy: more refactors (#6526 ) ## Problem not really any problem, just some drive-by changes ## Summary of changes 1. move wake compute 2. move json processing 3. move handle_try_wake 4. move test backend to api provider 5. reduce wake-compute concerns 6. remove duplicate wake-compute loop	2024-02-02 16:07:35 +00:00

1 2 3 4 5 ...

4556 Commits