rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-05 21:20:37 +00:00

Author	SHA1	Message	Date
Alexey Masterov	ecf20bb6fa	Add the workflow file	2024-09-03 17:21:33 +02:00
Alexey Masterov	5a4a2ae4cd	Fix the trailing space	2024-09-02 10:52:22 +02:00
Alexey Masterov	d4f656daa2	Change the python file	2024-09-02 09:07:11 +02:00
Alexey Masterov	e2921e352c	Change the patch file	2024-09-02 09:06:19 +02:00
Alexey Masterov	8fb8ec57ea	Add python script, rename patch file	2024-08-30 16:39:07 +02:00
Alexey Masterov	0c6b34b5a0	New patch	2024-08-30 13:22:50 +02:00
Alexey Masterov	b3d90a7d7d	Merge branch 'main' into amasterov/regress-arm	2024-08-29 09:19:34 +02:00
Alexey Masterov	9b0e277514	New patch	2024-08-28 18:14:52 +02:00
Heikki Linnakangas	c5ef779801	tests: Remove unnecessary entries from list of allowed errors (#8199 ) The "manual_gc" context was removed in commit `be0c73f8e7`. The code that generated the other error was removed in commit `9a6c0be823`.	2024-08-27 17:47:05 +01:00
Heikki Linnakangas	2d10306f7a	Remove support for pageserver <-> compute protocol version 1 (#8774 ) Protocol version 2 has been the default for a while now, and we no longer have any computes running in production that used protocol version 1. This completes the migration by removing support for v1 in both the pageserver and the compute. See issue #6211.	2024-08-27 18:36:33 +03:00
Alexey Kondratov	9b9f90c562	fix(walproposer): Do not restart on safekeepers reordering (#8840 ) ## Problem Currently, we compare `neon.safekeepers` values as is, so we unnecessarily restart walproposer even if safekeepers set didn't change. This leads to errors like: ```log FATAL: [WP] restarting walproposer to change safekeeper list from safekeeper-8.us-east-2.aws.neon.tech:6401,safekeeper-11.us-east-2.aws.neon.tech:6401,safekeeper-10.us-east-2.aws.neon.tech:6401 to safekeeper-11.us-east-2.aws.neon.tech:6401,safekeeper-8.us-east-2.aws.neon.tech:6401,safekeeper-10.us-east-2.aws.neon.tech:6401 ``` ## Summary of changes Split the GUC into the list of individual safekeepers and properly compare. We could've done that somewhere on the upper level, e.g., control plane, but I think it's still better when the actual config consumer is smarter and doesn't rely on upper levels.	2024-08-27 15:49:47 +02:00
Folke Behrens	52cb33770b	proxy: Rename backend types and variants as prep for refactor (#8845 ) * AuthBackend enum to AuthBackendType * BackendType enum to Backend * Link variants to Web * Adjust messages, comments, etc.	2024-08-27 14:12:42 +02:00
Conrad Ludgate	12850dd5e9	proxy: remove dead code (#8847 ) By marking everything possible as pub(crate), we find a few dead code candidates.	2024-08-27 12:00:35 +01:00
a-masterov	5d527133a3	Fix the pg_hintplan flakyness (#8834 ) ## Problem pg_hintplan test seems to be flaky, sometimes it fails, while usually it passes ## Summary of changes The regression test is changed to filter out the Neon service queries. The expected file is changed as well. ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist	2024-08-27 12:39:42 +02:00
Arseny Sher	09362b6363	safekeeper: reorder routes and their handlers. Routes and their handlers were in a bit different order in 1) routes list 2) their implementation 3) python client 4) openapi spec, making addition of new ones intimidating. Make it the same everywhere, roughly lexicographically but preserving some of existing logic. No functional changes.	2024-08-27 07:37:55 +03:00
Alexey Kondratov	7820c572e7	fix(sql-exporter): Remove tenant_id from compute_logical_snapshot_files It appeared to be that it's already auto-added to all metrics [1] [1]: `3a907c317c/apps/base/ext-vmagent/vmagent.yaml (L43)`	2024-08-27 00:51:23 +02:00
Alexey Kondratov	bf03713fa1	fix(sql-exporter): Fix typo in gauge In `f4b3c317f` there was a typo and I missed that on review	2024-08-27 00:51:23 +02:00
Alex Chi Z.	0f65684263	feat(pageserver): use split layer writer in gc-compaction (#8608 ) Part of #8002, the final big PR in the batch. ## Summary of changes This pull request uses the new split layer writer in the gc-compaction. * It changes how layers are split. Previously, we split layers based on the original split point, but this creates too many layers (test_gc_feedback has one key per layer). * Therefore, we first verify if the layer map can be processed by the current algorithm (See https://github.com/neondatabase/neon/pull/8191, it's basically the same check) * On that, we proceed with the compaction. This way, it creates a large enough layer close to the target layer size. * Added a new set of functions `with_discard` in the split layer writer. This helps us skip layers if we are going to produce the same persistent key. * The delta writer will keep the updates of the same key in a single file. This might create a super large layer, but we can optimize it later. * The split layer writer is used in the gc-compaction algorithm, and it will split layers based on size. * Fix the image layer summary block encoded the wrong key range. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com> Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-08-26 14:19:47 -04:00
Christian Schwarz	97241776aa	pageserver: startup: ensure local disk state is durable (#8835 ) refs https://github.com/neondatabase/neon/issues/6989 Problem ------- After unclean shutdown, we get restarted, start reading the local filesystem, and make decisions based on those reads. However, some of the data might have not yet been fsynced when the unclean shutdown completed. Durability matters even though Pageservers are conceptually just a cache of state in S3. For example: - the cloud control plane is no control loop => pageserver responses to tenant attachmentm, etc, needs to be durable. - the storage controller does not rely on this (as much?) - we don't have layer file checksumming, so, downloaded+renamed but not fsynced layer files are technically not to be trusted - https://github.com/neondatabase/neon/issues/2683 Solution -------- `syncfs` the tenants directory during startup, before we start reading from it. This is a bit overkill because we do remove some temp files (InMemoryLayer!) later during startup. Further, these temp files are particularly likely to be dirty in the kernel page cache. However, we don't want to refactor that cleanup code right now, and the dirty data on pageservers is generally not that high. Last, with [direct IO](https://github.com/neondatabase/neon/issues/8130) we're going to have near-zero kernel page cache anyway quite soon.	2024-08-26 18:07:55 +02:00
Arpad Müller	2dd53e7ae0	Timeline archival test (#8824 ) This PR: * Implements the rule that archived timelines require all of their children to be archived as well, as specified in the RFC. There is no fancy locking mechanism though, so the precondition can still be broken. As a TODO for later, we still allow unarchiving timelines with archived parents. * Adds an `is_archived` flag to `TimelineInfo` * Adds timeline_archival_config to `PageserverHttpClient` * Adds a new `test_timeline_archive` test, loosely based on `test_timeline_delete` Part of #8088	2024-08-26 17:30:19 +02:00
Folke Behrens	d6eede515a	proxy: clippy lints: handle some low hanging fruit (#8829 ) Should be mostly uncontroversial ones.	2024-08-26 15:16:54 +02:00
Alexey Kondratov	d48229f50f	feat(compute): Introduce new compute_subscriptions_count metric (#8796 ) ## Problem We need some metric to sneak peek into how many people use inbound logical replication (Neon is a subscriber). ## Summary of changes This commit adds a new metric `compute_subscriptions_count`, which is number of subscriptions grouped by enabled/disabled state. Resolves: neondatabase/cloud#16146	2024-08-26 14:34:18 +02:00
Jakub Kołodziejczak	cdfdcd3e5d	chore: improve markdown formatting (#8825 ) fixes: ![Screenshot_2024-08-25_16-25-30](https://github.com/user-attachments/assets/c993309b-6c2d-4938-9fd0-ce0953fc63ff) fixes: ![Screenshot_2024-08-25_16-26-29](https://github.com/user-attachments/assets/cf497f4a-d9e3-45a6-a1a5-7e215d96d022)	2024-08-25 16:33:45 +01:00
Conrad Ludgate	06795c6b9a	proxy: new local-proxy application (#8736 ) Add binary for local-proxy that uses the local auth backend. Runs only the http serverless driver support and offers config reload based on a config file and SIGHUP	2024-08-23 22:32:10 +01:00
Conrad Ludgate	701cb61b57	proxy: local auth backend (#8806 ) Adds a Local authentication backend. Updates http to extract JWT bearer tokens and passes them to the local backend to validate.	2024-08-23 18:48:06 +00:00
John Spray	0aa1450936	storage controller: enable timeline CRUD operations to run concurrently with reconciliation & make them safer (#8783 ) ## Problem - If a reconciler was waiting to be able to notify computes about a change, but the control plane was waiting for the controller to finish a timeline creation/deletion, the overall system can deadlock. - If a tenant shard was migrated concurrently with a timeline creation/deletion, there was a risk that the timeline operation could be applied to a non-latest-generation location, and thereby not really be persistent. This has never happened in practice, but would eventually happen at scale. Closes: #8743 ## Summary of changes - Introduce `Service::tenant_remote_mutation` helper, which looks up shards & generations and passes them into an inner function that may do remote I/O to pageservers. Before returning success, this helper checks that generations haven't incremented, to guarantee that changes are persistent. - Convert tenant_timeline_create, tenant_timeline_delete, and tenant_timeline_detach_ancestor to use this helper. - These functions no longer block on ensure_attached unless the tenant was never attached at all, so they should make progress even if we can't complete compute notifications. This increases the database load from timeline/create operations, but only with cheap read transactions.	2024-08-23 18:56:05 +01:00
John Spray	b65a95f12e	controller: use PageserverUtilization for scheduling (#8711 ) ## Problem Previously, the controller only used the shard counts for scheduling. This works well when hosting only many-sharded tenants, but works much less well when hosting single-sharded tenants that have a greater deviation in size-per-shard. Closes: https://github.com/neondatabase/neon/issues/7798 ## Summary of changes - Instead of UtilizationScore, carry the full PageserverUtilization through into the Scheduler. - Use the PageserverUtilization::score() instead of shard count when ordering nodes in scheduling. Q: Why did test_sharding_split_smoke need updating in this PR? A: There's an interesting side effect during shard splits: because we do not decrement the shard count in the utilization when we de-schedule the shards from before the split, the controller will now prefer to pick _different_ nodes for shards compared with which ones held secondaries before the split. We could use our knowledge of splitting to fix up the utilizations more actively in this situation, but I'm leaning toward leaving the code simpler, as in practical systems the impact of one shard on the utilization of a node should be fairly low (single digit %).	2024-08-23 18:32:56 +01:00
Conrad Ludgate	c1cb7a0fa0	proxy: flesh out JWT verification code (#8805 ) This change adds in the necessary verification steps for the JWT payload, and adds per-role querying of JWKs as needed for #8736	2024-08-23 18:01:02 +01:00
Alex Chi Z.	f4cac1f30f	impr(pageserver): error if keys are unordered in merge iter (#8818 ) In case of corrupted delta layers, we can detect the corruption and bail out the compaction. ## Summary of changes * Detect wrong delta desc of key range * Detect unordered deltas Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-08-23 16:38:42 +00:00
Conrad Ludgate	612b643315	update diesel (#8816 ) https://rustsec.org/advisories/RUSTSEC-2024-0365	2024-08-23 15:28:22 +00:00
Vlad Lazar	bcc68a7866	storcon_cli: add support for drain and fill operations (#8791 ) ## Problem We have been naughty and curl-ed storcon to fix-up drains and fills. ## Summary of changes Add support for starting/cancelling drain/fill operations via `storcon_cli`.	2024-08-23 14:48:06 +01:00
Joonas Koivunen	73286e6b9f	test: copy dict to avoid error on retry (#8811 ) there is no "const" in python, so when we modify the global dict, it will remain that way on the retry. fix to not have it influence other tests which might be run on the same runner. evidence: <https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8625/10513146742/index.html#/testresult/453c4ce05ada7496>	2024-08-23 14:43:08 +01:00
Alex Chi Z.	bc8cfe1b55	fix(pageserver): l0 check criteria (#8797 ) close https://github.com/neondatabase/neon/issues/8579 ## Summary of changes The `is_l0` check now takes both layer key range and the layer type. This allows us to have image layers covering the full key range in btm-most compaction (upcoming PR). However, we still don't allow delta layers to cover the full key range, and I will make btm-most compaction to generate delta layers with the key range of the keys existing in the layer instead of `Key::MIN..Key::HACK_MAX` (upcoming PR). Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-08-23 09:42:45 -04:00
Alex Chi Z.	6a74bcadec	feat(pageserver): remove features=testing restriction for compact (#8815 ) A small PR to make it possible to run force compaction in staging for btm-gc compaction testing. Part of https://github.com/neondatabase/neon/issues/8002 Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-08-23 14:32:00 +01:00
Alexander Bayandin	e62cd9e121	CI(autocomment): add arch to build type (#8809 ) ## Problem Failed / flaky tests for different arches don't have any difference in GitHub Autocomment ## Summary of changes - Add arch to build type for GitHub autocomment	2024-08-23 14:29:11 +01:00
Arpad Müller	e80ab8fd6a	Update serde_json to 1.0.125 (#8813 ) Updates `serde_json` to `1.0.125`, rolling out speedups added by a serde_json contributor. Release [link](https://github.com/serde-rs/json/releases/tag/1.0.125). Blog post [link](https://purplesyringa.moe/blog/i-sped-up-serde-json-strings-by-20-percent/).	2024-08-23 12:14:14 +01:00
MMeent	d8ca495eae	Require poetry >=1.8 (#8812 ) This was already a requirement for installing the python packages after https://github.com/neondatabase/neon/pull/8609 got merged, so this updates the documentation to reflect that.	2024-08-23 11:48:26 +01:00
Heikki Linnakangas	dbdb8a1187	Document how to use "git merge" for PostgreSQL minor version upgrades. (#8692 ) Our new policy is to use the "rebase" method and slice all the Neon commits into a nice patch set when doing a new major version, and use "merge" method on minor version upgrades on the release branches. "git merge" preserves the git history of Neon commits on the Postgres branches. While it's nice to rebase all the Neon changes to a logical patch set against upstream, having to do it between every minor release is a fair amount work, and it loses the history, and is more error-prone.	2024-08-23 09:15:55 +03:00
Tristan Partin	f7ab3ffcb7	Check that TERM != dumb before using colors in pre-commit.py Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-08-22 18:03:45 -05:00
Tristan Partin	2f8d548a12	Update Postgres 16 to 16.4 Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-08-22 18:03:45 -05:00
Tristan Partin	66db381dc9	Update Postgres 15 to 15.8 Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-08-22 18:03:45 -05:00
Tristan Partin	6744ed19d8	Update Postgres 14 to 14.13 Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-08-22 18:03:45 -05:00
Tristan Partin	ae63ac7488	Write messages field by field instead of bytes sheet in test_simple_sync_safekeepers Co-authored-by: Arseny Sher <ars@neon.tech>	2024-08-22 18:03:45 -05:00
Alex Chi Z.	6eb638f4b3	feat(pageserver): warn on aux v1 tenants + default to v2 (#8625 ) part of https://github.com/neondatabase/neon/issues/8623 We want to discover potential aux v1 customers that we might have missed from the migrations. ## Summary of changes Log warnings on basebackup, load timeline, and the first put_file. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-08-22 22:31:38 +01:00
Konstantin Knizhnik	7a485b599b	Fix race condition in LRU list update in get_cached_relsize (#8807 ) ## Problem See https://neondb.slack.com/archives/C07J14D8GTX/p1724347552023709 Manipulations with LRU list in relation size cache are performed under shared lock ## Summary of changes Take exclusive lock ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-08-22 23:53:37 +03:00
Joonas Koivunen	b1c457898b	test_compatibility: flush in the end (#8804 ) `test_forward_compatibility` is still often failing at graceful shutdown. Fix this by explicit flush before shutdown. Example: https://neon-github-public-dev.s3.amazonaws.com/reports/main/10506613738/index.html#testresult/5e7111907f7ecfb2/ Cc: #8655 and #8708 Previous attempt: #8787	2024-08-22 16:38:03 +01:00
Alexey Masterov	362f411735	renew patches	2024-08-22 17:03:14 +02:00
Folke Behrens	1a9d559be8	proxy: Enable stricter/pedantic clippy checks (#8775 ) Create a list of currently allowed exceptions that should be reduced over time.	2024-08-22 13:29:05 +02:00
Alexey Kondratov	0e6c0d47a5	Revert "Use sycnhronous commit for logical replicaiton worker (#8645 )" (#8792 ) This reverts commit `cbe8c77997`. This change was originally made to test a hypothesis, but after that, the proper fix #8669 was merged, so now it's not needed. Moreover, the test is still flaky, so probably this bug was not a reason of the flakiness. Related to #8097	2024-08-22 12:52:36 +02:00
Arpad Müller	d645645fab	Sleep in test_scrubber_physical_gc (#8798 ) This copies a piece of code from `test_scrubber_physical_gc_ancestors` to fix a source of flakiness: later on we rely on stuff being older than a second, but the test can run faster under optimal conditions (as happened to me locally, but also obvservable in [this](https://neon-github-public-dev.s3.amazonaws.com/reports/main/10470762360/index.html#testresult/f713b02657db4b4c/retries) allure report): ``` test_runner/regress/test_storage_scrubber.py:169: in test_scrubber_physical_gc assert gc_summary["remote_storage_errors"] == 0 E assert 1 == 0 ```	2024-08-22 12:45:29 +02:00

1 2 3 4 5 ...

5975 Commits