rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-16 01:42:55 +00:00

Author	SHA1	Message	Date
Joonas Koivunen	a691786ce2	fix: logical size calculation gating (#6915 ) Noticed that we are failing to handle `Result::Err` when entering a gate for logical size calculation. Audited rest of the gate enters, which seem fine, unified two instances. Noticed that the gate guard allows to remove a failpoint, then noticed that adjacent failpoint was blocking the executor thread instead of using `pausable_failpoint!`, fix both. eviction_task.rs now maintains a gate guard as well. Cc: #4733	2024-02-27 14:27:13 +00:00
Roman Zaynetdinov	2991d01b61	Export connection counts from sql_exporter (#6926 ) ## Problem We want to show connection counts to console users. ## Summary of changes Start exporting connection counts grouped by database name and connection state.	2024-02-27 13:47:05 +00:00
Konstantin Knizhnik	e895644555	Show LFC statistic in EXPLAIN (#6851 ) ## Problem LFC has high impact on Neon application performance but there is no way for user to check efficiency of its usage ## Summary of changes Show LFC statistic in EXPLAIN ANALYZE ## Description Local file cache (LFC) A layer of caching that stores frequently accessed data from the storage layer in the local memory of the Neon compute instance. This cache helps to reduce latency and improve query performance by minimizing the need to fetch data from the storage layer repeatedly. Externalization of LFC in explain output Then EXPLAIN ANALYZE output is extended to display important counts for local file cache (LFC) hits and misses. This works both, for EXPLAIN text and json output. File cache: hits Whenever the Postgres backend retrieves a page/block from SGMR, it is not found in shared buffer but the page is already found in the LFC this counter is incremented. File cache: misses Whenever the Postgres backend retrieves a page/block from SGMR, it is not found in shared buffer and also not in then LFC but the page is retrieved from Neon storage (page server) this counter is incremented. Example (for explain text output) ```sql explain (analyze,buffers,prefetch,filecache) select count(*) from pgbench_accounts; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Finalize Aggregate (cost=214486.94..214486.95 rows=1 width=8) (actual time=5195.378..5196.034 rows=1 loops=1) Buffers: shared hit=178875 read=143691 dirtied=128597 written=127346 Prefetch: hits=0 misses=1865 expired=0 duplicates=0 File cache: hits=141826 misses=1865 -> Gather (cost=214486.73..214486.94 rows=2 width=8) (actual time=5195.366..5196.025 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=178875 read=143691 dirtied=128597 written=127346 Prefetch: hits=0 misses=1865 expired=0 duplicates=0 File cache: hits=141826 misses=1865 -> Partial Aggregate (cost=213486.73..213486.74 rows=1 width=8) (actual time=5187.670..5187.670 rows=1 loops=3) Buffers: shared hit=178875 read=143691 dirtied=128597 written=127346 Prefetch: hits=0 misses=1865 expired=0 duplicates=0 File cache: hits=141826 misses=1865 -> Parallel Index Only Scan using pgbench_accounts_pkey on pgbench_accounts (cost=0.43..203003.02 rows=4193481 width=0) (actual time=0.574..4928.995 rows=3333333 loops=3) Heap Fetches: 3675286 Buffers: shared hit=178875 read=143691 dirtied=128597 written=127346 Prefetch: hits=0 misses=1865 expired=0 duplicates=0 File cache: hits=141826 misses=1865 ``` The json output uses the following keys and provides integer values for those keys: ``` ... "File Cache Hits": 141826, "File Cache Misses": 1865 ... ``` ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-02-27 14:45:54 +02:00
Christian Schwarz	62d77e263f	test_remote_timeline_client_calls_started_metric: fix flakiness (#6911 ) fixes https://github.com/neondatabase/neon/issues/6889 # Problem The failure in the last 3 flaky runs on `main` is ``` test_runner/regress/test_remote_storage.py:460: in test_remote_timeline_client_calls_started_metric churn("a", "b") test_runner/regress/test_remote_storage.py:457: in churn assert gc_result["layers_removed"] > 0 E assert 0 > 0 ``` That's this code `cd449d66ea/test_runner/regress/test_remote_storage.py (L448-L460)` So, the test expects GC to remove some layers but the GC doesn't. # Fix My impression is that the VACUUM isn't re-using pages aggressively enough, but I can't really prove that. Tried to analyze the layer map dump but it's too complex. So, this PR: - Creates more churn by doing the overwrite twice. - Forces image layer creation. It also drive-by removes the redundant call to timeline_compact, because, timeline_checkpoint already does that internally.	2024-02-27 10:55:10 +01:00
Alex Chi Z	b2bbc20311	fix: only alter default privileges when public schema exists (#6914 ) ## Problem Following up https://github.com/neondatabase/neon/pull/6885, only alter default privileges when the public schema exists. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-02-26 11:48:56 -09:00
Vlad Lazar	5accf6e24a	attachment_service: JWT auth enforcement (#6897 ) ## Problem Attachment service does not do auth based on JWT scopes. ## Summary of changes Do JWT based permission checking for requests coming into the attachment service. Requests into the attachment service must use different tokens based on the endpoint: * `/control` and `/debug` require `admin` scope * `/upcall` requires `generations_api` scope * `/v1/...` requires `pageserverapi` scope Requests into the pageserver from the attachment service must use `pageserverapi` scope.	2024-02-26 18:17:06 +00:00
Andreas Scherbaum	0881d4f9e3	Update README, include cleanup details (#6816 ) ## Problem README.md is missing cleanup instructions ## Summary of changes Add cleanup instructions Add instructions how to handle errors during initialization --------- Co-authored-by: Andreas Scherbaum <andreas@neon.tech>	2024-02-26 18:53:48 +01:00
Alexander Bayandin	975786265c	CI: Delete GitHub Actions caches once PR is closed (#6900 ) ## Problem > Approaching total cache storage limit (9.25 GB of 10 GB Used) > Least recently used caches will be automatically evicted to limit the total cache storage to 10 GB. [Learn more about cache usage.](https://docs.github.com/actions/using-workflows/caching-dependencies-to-speed-up-workflows#usage-limits-and-eviction-policy) From https://github.com/neondatabase/neon/actions/caches Some of these caches are from closed/merged PRs. ## Summary of changes - Add a workflow that deletes caches for closed branches	2024-02-26 18:17:22 +01:00
Christian Schwarz	c4059939e6	fixup(#6893 ): report_size() still used pageserver_created_persistent_* metrics (#6909 ) Use the remote_timeline_client metrics instead, they work for layer file uploads and are reasonable close to what the `pageserver_created_persistent_*` metrics were. Should we wait for empty upload queue before calling `report_size()`? part of https://github.com/neondatabase/neon/issues/6737	2024-02-26 17:28:00 +01:00
Bodobolero	75baf83fce	externalize statistics on LFC cache usage (#6906 ) ## Problem Customers should be able to determine the size of their workload's working set to right size their compute. Since Neon uses Local file cache (LFC) instead of shared buffers on bigger compute nodes to cache pages we need to externalize a means to determine LFC hit ratio in addition to shared buffer hit ratio. Currently the following end user documentation `fb7cd3af0e/content/docs/manage/endpoints.md (L137)` is wrong because it describes how to right size a compute node based on shared buffer hit ratio. Note that the existing functionality in extension "neon" is NOT available to end users but only to superuser / cloud_admin. ## Summary of changes - externalize functions and views in neon extension to end users - introduce a new view `NEON_STAT_FILE_CACHE` with the following DDL ```sql CREATE OR REPLACE VIEW NEON_STAT_FILE_CACHE AS WITH lfc_stats AS ( SELECT stat_name, count FROM neon_get_lfc_stats() AS t(stat_name text, count bigint) ), lfc_values AS ( SELECT MAX(CASE WHEN stat_name = 'file_cache_misses' THEN count ELSE NULL END) AS file_cache_misses, MAX(CASE WHEN stat_name = 'file_cache_hits' THEN count ELSE NULL END) AS file_cache_hits, MAX(CASE WHEN stat_name = 'file_cache_used' THEN count ELSE NULL END) AS file_cache_used, MAX(CASE WHEN stat_name = 'file_cache_writes' THEN count ELSE NULL END) AS file_cache_writes, -- Calculate the file_cache_hit_ratio within the same CTE for simplicity CASE WHEN MAX(CASE WHEN stat_name = 'file_cache_misses' THEN count ELSE 0 END) + MAX(CASE WHEN stat_name = 'file_cache_hits' THEN count ELSE 0 END) = 0 THEN NULL ELSE ROUND((MAX(CASE WHEN stat_name = 'file_cache_hits' THEN count ELSE 0 END)::DECIMAL / (MAX(CASE WHEN stat_name = 'file_cache_hits' THEN count ELSE 0 END) + MAX(CASE WHEN stat_name = 'file_cache_misses' THEN count ELSE 0 END))) * 100, 2) END AS file_cache_hit_ratio FROM lfc_stats ) SELECT file_cache_misses, file_cache_hits, file_cache_used, file_cache_writes, file_cache_hit_ratio from lfc_values; ``` This view can be used by an end user as follows: ```sql CREATE EXTENSION NEON; SELECT * from neon. NEON_STAT_FILE_CACHE" ``` The output looks like the following: ``` select * from NEON_STAT_FILE_CACHE; file_cache_misses \| file_cache_hits \| file_cache_used \| file_cache_writes \| file_cache_hit_ratio -------------------+-----------------+-----------------+-------------------+---------------------- 2133643 \| 108999742 \| 607 \| 10767410 \| 98.08 (1 row) ``` ## Checklist before requesting a review - [x ] I have performed a self-review of my code. - [x ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [x ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist	2024-02-26 16:06:00 +00:00
Roman Zaynetdinov	459c2af8c1	Expose LFC cache size limit from sql_exporter (#6912 ) ## Problem We want to report how much cache was used and what the limit was. ## Summary of changes Added one more query to sql_exporter to expose `neon.file_cache_size_limit`.	2024-02-26 10:36:11 -05:00
Arpad Müller	51a43b121c	Fix test_remote_storage_upload_queue_retries flakiness (#6898 ) * decreases checkpointing and compaction targets for even more layer files * write 10 thousand rows 2 times instead of writing 20 thousand rows 1 time so that there is more to GC. Before it was noisily jumping between 1 and 0 layer files, now it's jumping between 19 and 20 layer files. The 0 caused an assertion error that gave the test most of its flakiness. * larger timeout for the churn while failpoints are active thread: this is mostly so that the test is more robust on systems with more load Fixes #3051	2024-02-26 13:21:40 +01:00
John Spray	256058f2ab	pageserver: only write out legacy tenant config if no generation (#6891 ) ## Problem Previously we always wrote out both legacy and modern tenant config files. The legacy write enabled rollbacks, but we are long past the point where that is needed. We still need the legacy format for situations where someone is running tenants without generations (that will be yanked as well eventually), but we can avoid writing it out at all if we do have a generation number set. We implicitly also avoid writing the legacy config if our mode is Secondary (secondary mode is newer than generations). ## Summary of changes - Make writing legacy tenant config conditional on there being no generation number set.	2024-02-26 10:24:58 +00:00
Christian Schwarz	ceedc3ef73	Timeline::repartition: enforce no concurrent callers & lsn to not move backwards (#6862 ) This PR enforces aspects of `Timeline::repartition` that were already true at runtime: - it's not called concurrently, so, bail out if it is anyway (see comment why it's not called concurrently) - the `lsn` should never be moving backwards over the lifetime of a Timeline object, because last_record_lsn() can only move forwards over the lifetime of a Timeline object The switch to tokio::sync::Mutex blows up the size of the `partitioning` field from 40 bytes to 72 bytes on Linux x86_64. That would be concerning if it was a hot field, but, `partitioning` is only accessed every 20s by one task, so, there won't be excessive cache pain on it. (It still sucks that it's now >1 cache line, but I need the Send-able MutexGuard in the next PR) part of https://github.com/neondatabase/neon/issues/6861	2024-02-26 11:22:15 +01:00
Christian Schwarz	5273c94c59	pageserver: remove two obsolete/unused per-timeline metrics (#6893 ) over-compensating the addition of a new per-timeline metric in https://github.com/neondatabase/neon/pull/6834 part of https://github.com/neondatabase/neon/issues/6737	2024-02-26 09:19:24 +00:00
Christian Schwarz	dedf66ba5b	remove `gc_feedback` mechanism (#6863 ) It's been dead-code-at-runtime for 9 months, let's remove it. We can always re-introduce it at a later point. Came across this while working on #6861, which will touch `time_for_new_image_layer`. This is an opporunity to make that function simpler.	2024-02-26 10:05:24 +01:00
John Spray	8283779ee8	pageserver: remove legacy attach/detach APIs from swagger (#6883 ) ## Problem Since the location config API was added, the attach and detach endpoints are deprecated. Hiding them from consumers of the swagger definition is a precursor to removing them Neon's cloud no longer uses this api since https://github.com/neondatabase/cloud/pull/10538 Fully removing the APIs will implicitly make use of generation numbers mandatory, and should happen alongside https://github.com/neondatabase/neon/issues/5388, which will happen once we're happy that the storage controller is ready for prime time. ## Summary of changes - Remove /attach and /detach from pageserver's swagger file	2024-02-25 14:53:17 +00:00
Joonas Koivunen	b8f9e3a9eb	fix(flaky): typo Stopping/Stopped (#6894 ) introduced in `8dee9908f8`, should help with the #6681 common problem which is just a mismatched allowed error.	2024-02-24 21:32:41 +00:00
Christian Schwarz	ec3efc56a8	Revert "Revert "refactor(VirtualFile::crashsafe_overwrite): avoid Handle::block_on in callers"" (#6775 ) Reverts neondatabase/neon#6765 , bringing back #6731 We concluded that #6731 never was the root cause for the instability in staging. More details: https://neondb.slack.com/archives/C033RQ5SPDH/p1708011674755319 However, the massive amount of concurrent `spawn_blocking` calls from the `save_metadata` calls during startups might cause a performance regression. So, we'll merge this PR here after we've stopped writing the metadata #6769).	2024-02-23 17:16:43 +01:00
Alexander Bayandin	94f6b488ed	CI(release-proxy): fix a couple missed release-proxy branch handling (#6892 ) ## Problem In the original PR[0], I've missed a couple of `release` occurrences that should also be handled for `release-proxy` branch - [0] https://github.com/neondatabase/neon/pull/6797 ## Summary of changes - Add handling for `release-proxy` branch to allure report - Add handling for `release-proxy` branch to e2e tests malts.com	2024-02-23 14:12:09 +00:00
Anastasia Lubennikova	a12e4261a3	Add neon.primary_is_running GUC. (#6705 ) We set it for neon replica, if primary is running. Postgres uses this GUC at the start, to determine if replica should wait for RUNNING_XACTS from primary or not. Corresponding cloud PR is https://github.com/neondatabase/cloud/pull/10183 * Add test hot-standby replica startup. * Extract oldest_running_xid from XlRunningXits WAL records. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Konstantin Knizhnik <knizhnik@garret.ru> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2024-02-23 13:56:41 +00:00
Christian Schwarz	cd449d66ea	stop writing `metadata` file (#6769 ) Building atop #6777, this PR removes the code that writes the `metadata` file and adds a piece of migration code that removes any remaining `metadata` files. We'll remove the migration code after this PR has been deployed. part of https://github.com/neondatabase/neon/issues/6663 More cleanups punted into follow-up issue, as they touch a lot of code: https://github.com/neondatabase/neon/issues/6890	2024-02-23 14:33:47 +01:00
Alexander Bayandin	6f8f7c7de9	CI: Build images using docker buildx instead of kaniko (#6871 ) ## Problem To "build" a compute image that doesn't have anything new, kaniko takes 13m[0], docker buildx does it in 5m[1]. Also, kaniko doesn't fully support bash expressions in the Dockerfile `RUN`, so we have to use different workarounds for this (like `bash -c ...`). - [0] https://github.com/neondatabase/neon/actions/runs/8011512414/job/21884933687 - [1] https://github.com/neondatabase/neon/actions/runs/8008245697/job/21874278162 ## Summary of changes - Use docker buildx to build `compute-node` images - Use docker buildx to build `neon-image` image - Use docker buildx to build `compute-tools` image - Use docker hub for image cache (instead of ECR)	2024-02-23 12:36:18 +01:00
Alex Chi Z	12487e662d	compute_ctl: move default privileges grants to handle_grants (#6885 ) ## Problem Following up https://github.com/neondatabase/neon/pull/6884, hopefully, a real final fix for https://github.com/neondatabase/neon/issues/6236. ## Summary of changes `handle_migrations` is done over the main `postgres` db connection. Therefore, the privileges assigned here do not work with databases created later (i.e., `neondb`). This pull request moves the grants to `handle_grants`, so that it runs for each DB created. The SQL is added into the `BEGIN/END` block, so that it takes only one RTT to apply all of them. Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-02-22 17:00:03 -05:00
Arseny Sher	5bcae3a86e	Drop LR slots if too many .snap files are found. PR #6655 turned out to be not enough to prevent .snap files bloat; some subscribers just don't ack flushed position, thus never advancing the slot. Probably other bloating scenarios are also possible, so add a more direct restriction -- drop all slots if too many .snap files has been discovered.	2024-02-23 01:12:49 +04:00
Konstantin Knizhnik	47657f2df4	Flush logical messages with snapshots and replication origin (#6826 ) ## Problem See https://neondb.slack.com/archives/C04DGM6SMTM/p1708363190710839 ## Summary of changes Flush logical message with snapshot and origin state ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-02-22 21:33:38 +02:00
Sasha Krassovsky	d669dacd71	Add pgpartman (#6849 ) ## Problem ## Summary of changes ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist	2024-02-22 10:05:37 -08:00
Alex Chi Z	837988b6c9	compute_ctl: run migrations to grant default grantable privileges (#6884 ) ## Problem Following up on https://github.com/neondatabase/neon/pull/6845, we did not make the default privileges grantable before, and therefore, even if the users have full privileges, they are not able to grant them to others. Should be a final fix for https://github.com/neondatabase/neon/issues/6236. ## Summary of changes Add `WITH GRANT` to migrations so that neon_superuser can grant the permissions. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-02-22 17:49:02 +00:00
John Spray	9c6145f0a9	control_plane: fix a compilation error from racing PRs (#6882 ) Merge of two green PRs raced, and ended up with a non-compiling result.	2024-02-22 16:51:46 +00:00
Alexander Bayandin	2424d90883	CI: Split Proxy and Storage releases (#6797 ) ## Problem We want to release Proxy at a different cadence. ## Summary of changes - build-and-test workflow: - Handle the `release-proxy` branch - Tag images built on this branch with `release-proxy-XXX` tag - Trigger deploy workflow with `deployStorage=true` & `deployStorageBroker=true` on `release` branch - Trigger deploy workflow with `deployPgSniRouter=true` & `deployProxy=true` on `release-proxy` branch - release workflow (scheduled creation of release branch): - Schedule Proxy releases for Thursdays (a random day to make it different from Storage releases)	2024-02-22 17:15:18 +01:00
John Spray	cf3baf6039	storage controller: fix consistency check (#6855 ) - Some checks weren't properly returning an error when they failed - TenantState::to_persistent wasn't setting generation_pageserver properly - Changes to node scheduling policy weren't being persisted.	2024-02-22 14:10:49 +00:00
John Spray	9c48b5c4ab	controller: improved handling of offline nodes (#6846 ) Stacks on https://github.com/neondatabase/neon/pull/6823 - Pending a heartbeating mechanism (#6844 ), use /re-attach calls as a cue to mark an offline node as active, so that a node which is unavailable during controller startup doesn't require manual intervention if it later starts/restarts. - Tweak scheduling logic so that when we schedule the attached location for a tenant, we prefer to select from secondary locations rather than picking a fresh one. This is an interim state until we implement #6844 and full chaos testing for handling failures.	2024-02-22 14:01:06 +00:00
Christian Schwarz	c671aeacd4	fix(per-tenant throttling): incorrect `allowed_rps` field in log message (#6869 ) The `refill_interval` switched from a milliseconds usize to a Duration during a review follow-up, hence this slipped through manual testing. Part of https://github.com/neondatabase/neon/issues/5899	2024-02-22 14:19:11 +01:00
Joonas Koivunen	bc7a82caf2	feat: bare-bones /v1/utilization (#6831 ) PR adds a simple at most 1Hz refreshed informational API for querying pageserver utilization. In this first phase, no actual background calculation is performed. Instead, the worst possible score is always returned. The returned bytes information is however correct. Cc: #6835 Cc: #5331	2024-02-22 13:58:59 +02:00
John Spray	b5246753bf	storage controller: miscellaneous improvements (#6800 ) - Add some context to logs - Add tests for pageserver restarts when managed by storage controller - Make /location_config tolerate compute hook failures on shard creations, not just modifications.	2024-02-22 09:33:40 +00:00
John Spray	c1095f4c52	pageserver: don't warn on tempfiles in secondary location (#6837 ) ## Problem When a secondary mode location starts up, it scans local layer files. Currently it warns on any layers whose names don't parse as a LayerFileName, generating warning spam from perfectly normal tempfiles. ## Summary of changes - Refactor local vars to build a Utf8PathBuf for the layer file candidate - Use the crate::is_temporary check to identify + clean up temp files. --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-02-22 09:32:27 +00:00
Anna Khanova	1718c0b59b	Proxy: cancel query on connection drop (#6832 ) ## Problem https://github.com/neondatabase/cloud/issues/10259 ## Summary of changes Make sure that the request is dropped once the connection was dropped.	2024-02-21 22:43:55 +00:00
Joe Drumgoole	8107ae8377	README: Fix the link to the free tier request (#6858 )	2024-02-21 23:42:24 +01:00
dependabot[bot]	555ee9fdd0	build(deps): bump cryptography from 42.0.2 to 42.0.4 (#6870 )	2024-02-21 21:41:51 +00:00
Alex Chi Z	6921577cec	compute_ctl: grant default privileges on table to `neon_superuser` (#6845 ) ## Problem fix https://github.com/neondatabase/neon/issues/6236 again ## Summary of changes This pull request adds a setup command in compute spec to modify default privileges of public schema to have full permission on table/sequence for neon_superuser. If an extension upgrades to superuser during creation, the tables/sequences they create in the public schema will be automatically granted to neon_superuser. Questions: * does it impose any security flaws? public schema should be fine... * for all extensions that create tables in schemas other than public, we will need to manually handle them (e.g., pg_anon). * we can modify some extensions to remove their superuser requirement in the future. * we may contribute to Postgres to allow for the creation of extensions with a specific user in the future. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-02-21 16:09:34 -05:00
Arpad Müller	20fff05699	Remove stray del and TODO (#6867 ) The TODO has made it into #6821. I originally just put it there for bookmarking purposes. The `del` has been added by #6818 but is also redundant.	2024-02-21 19:39:14 +00:00
Alexander Bayandin	f2767d2056	CI: run check-permissions before all jobs (#6794 ) ## Problem For PRs from external contributors, we're still running `actionlint` and `neon_extra_builds` workflows (which could fail due to lack of permissions to secrets). ## Summary of changes - Extract `check-permissions` job to a separate reusable workflow - Depend all jobs from `actionlint` and `neon_extra_builds` workflows on `check-permissions`	2024-02-21 20:32:12 +01:00
Tristan Partin	76b92e3389	Fix multithreaded postmaster on macOS curl_global_init() with an IPv6 enabled curl build on macOS will cause the calling program to become multithreaded. Unfortunately for shared_preload_libraries, that means the postmaster becomes multithreaded, which CANNOT happen. There are checks in Postgres to make sure that this is not the case.	2024-02-21 13:22:30 -06:00
Arthur Petukhovsky	03f8a42ed9	Add walsenders_keep_horizon option (#6860 ) Add `--walsenders-keep-horizon` argument to safekeeper cmdline. It will prevent deleting WAL segments from disk if they are needed by the active START_REPLICATION connection. This is useful for sharding. Without this option, if one of the shard falls behind, it starts to read WAL from S3, which is much slower than disk. This can result in huge shard lagging.	2024-02-21 19:09:40 +00:00
Conrad Ludgate	60e5a56a5a	proxy: include client IP in ip deny message (#6854 ) ## Problem Debugging IP deny errors is difficult for our users ## Summary of changes Include the client IP in the deny message	2024-02-21 18:24:59 +01:00
John Spray	afda4420bd	test_sharding_ingress: bigger data, skip in debug mode (#6859 ) ## Problem Accidentally merged #6852 without this test stability change. The test as-written could sometimes fail on debug-pg14. ## Summary of changes - Write more data so that the test can more reliably assert on the ratio of total layers to small layers - Skip the test in debug mode, since writing any more than a tiny bit of data tends to result in a flaky test in the much slower debug environment.	2024-02-21 17:03:55 +00:00
John Spray	ce1673a8c4	tests: improve stability of tests using `wait_for_upload_queue_empty` (#6856 ) ## Problem PR #6834 introduced an assertion that the sets of metric labels on finished operations should equal those on started operations, which is not true if no operations have finished yet for a particular set of labels. ## Summary of changes - Instead of asserting out, wait and re-check in the case that finished metrics don't match started	2024-02-21 16:00:17 +00:00
John Spray	532b0fa52b	Revise CODEOWNERS (#6840 ) ## Problem - Current file has ambiguous ownership for some paths - The /control_plane/attachment_service is storage specific & updates there don't need to request reviews from other teams. ## Summary of changes - Define a single owning team per path, so that we can make reviews by that team mandatory in future. - Remove the top-level /control_plane as no one specific team owns neon_local, and we would rarely see a PR that exclusively touches that path. - Add an entry for /control_plane/attachment_service, which is newer storage-specific code.	2024-02-21 15:45:22 +00:00
Arpad Müller	4de2f0f3e0	Implement a sharded time travel recovery endpoint (#6821 ) The sharding service didn't have support for S3 disaster recovery. This PR adds a new endpoint to the attachment service, which is slightly different from the endpoint on the pageserver, in that it takes the shard count history of the tenant as json parameters: we need to do time travel recovery for both the shard count at the target time and the shard count at the current moment in time, as well as the past shard counts that either still reference. Fixes #6604, part of https://github.com/neondatabase/cloud/issues/8233 --------- Co-authored-by: John Spray <john@neon.tech>	2024-02-21 16:35:37 +01:00
Joonas Koivunen	41464325c7	fix: remaining missed cancellations and timeouts (#6843 ) As noticed in #6836 some occurances of error conversions were missed in #6697: - `std::io::Error` popped up by `tokio::io::copy_buf` containing `DownloadError` was turned into `DownloadError::Other` - similarly for secondary downloader errors These changes come at the loss of pathname context. Cc: #6096	2024-02-21 15:20:59 +00:00

... 3 4 5 6 7 ...

4928 Commits