rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-05 13:10:37 +00:00

Author	SHA1	Message	Date
Bojan Serafimov	8657f8e0c0	wip	2023-12-04 15:56:07 -05:00
Bojan Serafimov	c358014389	configurable n_iterations	2023-12-04 15:15:25 -05:00
Bojan Serafimov	c0af1a5044	wip	2023-12-04 15:08:27 -05:00
Bojan Serafimov	094e7606a4	wip	2023-12-04 14:39:55 -05:00
Bojan Serafimov	dd4ff8ef9e	wip	2023-12-04 14:19:45 -05:00
Bojan Serafimov	13b076401a	no remote storage	2023-12-04 13:20:15 -05:00
Bojan Serafimov	5615aca244	Merge branch 'main' into add-profiler	2023-12-01 09:36:59 -05:00
bojanserafimov	fd81945a60	Use TEST_OUTPUT envvar in pageserver (#5984 )	2023-12-01 09:16:24 -05:00
bojanserafimov	e49c21a3cd	Speed up rel extend (#5983 )	2023-12-01 09:11:41 -05:00
Anastasia Lubennikova	92e7cd40e8	add sql_exporter to vm-image (#5949 ) expose LFC metrics	2023-12-01 13:40:49 +00:00
Alexander Bayandin	7eabfc40ee	test_runner: use separate directory for each rerun (#6004 ) ## Problem While investigating https://github.com/neondatabase/neon/issues/5854, we hypothesised that logs/repo-dir from the initial failure might leak into reruns. Use different directories for each run to avoid such a possibility. ## Summary of changes - make each test rerun use different directories - update `pytest-rerunfailure` plugin from 11.1.2 to 13.0	2023-12-01 13:26:19 +00:00
Christian Schwarz	ce1652990d	logical size: better represent level of accuracy in the type system (#5999 ) I would love to not expose the in-accurate value int he mgmt API at all, and in fact control plane doesn't use it [^1]. But our tests do, and I have no desire to change them at this time. [^1]: https://github.com/neondatabase/cloud/pull/8317	2023-12-01 14:16:29 +01:00
Christian Schwarz	8cd28e1718	logical size calculation: make .current_size() infallible (#5999 ) ... by panicking on overflow; It was made fallible initially due to in-confidence in logical size calculation. However, the error has never happened since I am at Neon. Let's stop worrying about this by converting the overflow check into a panic.	2023-12-01 14:16:29 +01:00
Christian Schwarz	1c88824ed0	initial logical size calculation: add a bunch of metrics (#5995 ) These will help us answer questions such as: - when & at what do calculations get started after PS restart? - how often is the api to get current incrementally-computed logical size called, and does it return Exact vs Approximate? I'd also be interested in a histogram of how much wall clock time size calculations take, but, I don't know good bucket sizes, and, logging it would introduce yet another per-timeline log message during startup; don't think that's worth it just yet. Context - https://neondb.slack.com/archives/C033RQ5SPDH/p1701197668789769 - https://github.com/neondatabase/neon/issues/5962 - https://github.com/neondatabase/neon/issues/5963 - https://github.com/neondatabase/neon/pull/5955 - https://github.com/neondatabase/cloud/issues/7408	2023-12-01 12:52:59 +01:00
Arpad Müller	1ce1c82d78	Clean up local state if index_part.json request gives 404 (#6009 ) If `index_part.json` is (verifiably) not present on remote storage, we should regard the timeline as inexistent. This lets `clean_up_timelines` purge the partial local disk state, which is important in the case of incomplete creations leaving behind state that hinders retries. For incomplete deletions, we also want the timeline's local disk content be gone completely. The PR removes the allowed warnings added by #5390 and #5912, as we now are only supposed to issue info level messages. It also adds a reproducer for #6007, by parametrizing the `test_timeline_init_break_before_checkpoint_recreate` test added by #5390. If one reverts the .rs changes, the "cannot create its uninit mark file" log line occurs once one comments out the failing checks for the local disk state being actually empty. Closes #6007 --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-12-01 10:58:06 +00:00
Vadim Kharitonov	f784e59b12	Update timescaledb to 2.13.0 (#5975 ) TimescaleDB has released 2.13.0. This version is compatible with Postgres16	2023-11-30 17:12:52 -06:00
Arpad Müller	b71b8ecfc2	Add existing_initdb_timeline_id param to timeline creation (#5912 ) This PR adds an `existing_initdb_timeline_id` option to timeline creation APIs, taking an optional timeline ID. Follow-up of #5390. If the `existing_initdb_timeline_id` option is specified via the HTTP API, the pageserver downloads the existing initdb archive from the given timeline ID and extracts it, instead of running initdb itself. --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-11-30 22:32:04 +01:00
Arpad Müller	3842773546	Correct RFC number for Pageserver WAL DR RFC (#5997 ) When I opened #5248, 27 was an unused RFC number. Since then, two RFCs have been merged, so now 27 is taken. 29 is free though, so move it there.	2023-11-30 21:01:25 +00:00
Conrad Ludgate	f39fca0049	proxy: chore: replace strings with SmolStr (#5786 ) ## Problem no problem ## Summary of changes replaces boxstr with arcstr as it's cheaper to clone. mild perf improvement. probably should look into other smallstring optimsations tbh, they will likely be even better. The longest endpoint name I was able to construct is something like `ep-weathered-wildflower-12345678` which is 32 bytes. Most string optimisations top out at 23 bytes	2023-11-30 20:52:30 +00:00
Joonas Koivunen	b451e75dc6	test: include cmdline in captured output (#5977 ) aiming for faster to understand a bunch of `.stdout` and `.stderr` files, see example echo_1.stdout differences: ``` +# echo foobar abbacd + foobar abbacd ``` it can be disabled and is disabled in this PR for some tests; use `pg_bin.run_capture(..., with_command_header=False)` for that. as a bonus this cleans up the echoed newlines from s3_scrubber output which are also saved to file but echoed to test log. Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2023-11-30 17:31:03 +00:00
Anna Khanova	3657a3c76e	Proxy fix metrics record (#5996 ) ## Problem Some latency metrics are recorded in inconsistent way. ## Summary of changes Make sure that everything is recorded in seconds.	2023-11-30 16:33:54 +00:00
Joonas Koivunen	eba3bfc57e	test: python needs thread safety as well (#5992 ) we have test cases which launch processes from threads, and they capture output assuming this counter is thread-safe. at least according to my understanding this operation in python requires a lock to be thread-safe.	2023-11-30 15:48:40 +00:00
John Spray	57ae9cd07f	pageserver: add `flush_ms` and document `/location_config` API (#5860 ) - During migration of tenants, it is useful for callers to `/location_conf` to flush a tenant's layers while transitioning to AttachedStale: this optimization reduces the redundant WAL replay work that the tenant's new attached pageserver will have to do. Test coverage for this will come as part of the larger tests for live migration in #5745 #5842 - Flushing is controlled with `flush_ms` query parameter: it is the caller's job to decide how long they want to wait for a flush to complete. If flush is not complete within the time limit, the pageserver proceeds to succeed anyway: flushing is only an optimization. - Add swagger definitions for all this: the location_config API is the primary interface for driving tenant migration as described in docs/rfcs/028-pageserver-migration.md, and will eventually replace the various /attach /detach /load /ignore APIs. --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-11-30 14:22:07 +00:00
Christian Schwarz	3bb1030f5d	walingest: refactor if-cascade on `decoded.xl_rmid` into match statement (#5974 ) refs https://github.com/neondatabase/neon/issues/5962 --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-11-30 14:07:41 +00:00
John Spray	5d3c3636fc	tests: add a log allow list entry in `test_timeline_deletion_with_files_stuck_in_upload_queue` (#5981 ) Test failure seen here: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-5860/7032903218/index.html#suites/837740b64a53e769572c4ed7b7a7eeeb/c0f1c79a70a3b9ab ``` E AssertionError: assert not [(302, '2023-11-29T13:23:51.046801Z ERROR request{method=PUT path=/v1/tenant/f6b845de60cb0e92f4426e0d6af1d2ea/timeline/69a8c98004abe71a281cff8642a45274/checkpoint request_id=eca33d8a-7af2-46e7-92ab-c28629feb42c}: Error processing HTTP request: InternalServerError(queue is in state Stopped\n')] ``` This appears to be a legitimate log: the test is issuing checkpoint requests in the background, and deleting (therefore shutting down) a timeline.	2023-11-30 13:44:14 +00:00
Conrad Ludgate	0c87d1866b	proxy: fix wake_compute error prop (#5989 ) ## Problem fixes #5654 - WakeComputeErrors occuring during a connect_to_compute got propagated as IO errors, which get forwarded to the user as "Couldn't connect to compute node" with no helpful message. ## Summary of changes Handle WakeComputeError during ConnectionError properly	2023-11-30 13:43:21 +00:00
Arpad Müller	8ec6033ed8	Pageserver disaster recovery RFC (#5248 ) Enable the pageserver to recover from data corruption events by implementing a feature to re-apply historic WAL records in parallel to the already occurring WAL replay. The feature is outside of the user-visible backup and history story, and only serves as a second-level backup for the case that there is a bug in the pageservers that corrupted the served pages. The RFC proposes the addition of two new features: * recover a broken branch from WAL (downtime is allowed) * a test recovery system to recover random branches to make sure recovery works	2023-11-30 14:30:17 +01:00
Anna Khanova	e12e2681e9	IP allowlist on the proxy side (#5906 ) ## Problem Per-project IP allowlist: https://github.com/neondatabase/cloud/issues/8116 ## Summary of changes Implemented IP filtering on the proxy side. To retrieve ip allowlist for all scenarios, added `get_auth_info` call to the control plane for: * sql-over-http * password_hack * cleartext_hack Added cache with ttl for sql-over-http path This might slow down a bit, consider using redis in the future. --------- Co-authored-by: Conrad Ludgate <conrad@neon.tech>	2023-11-30 13:14:33 +00:00
Joonas Koivunen	1e57ddaabc	fix: flush loop should also keep the gate open (#5987 ) I was expecting this to already be in place, because this should not conflict how we shutdown (0. cancel, 1. shutdown_tasks, 2. close gate).	2023-11-30 14:26:11 +02:00
John Khvatov	3e094e90d7	update aws sdk to 1.0.x (#5976 ) This change will be useful for experimenting with S3 performance. Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-11-30 14:17:58 +02:00
Christian Schwarz	292281c9df	pagectl: add subcommand to rewrite layer file summary (#5933 ) Part of getpage@lsn benchmark epic: https://github.com/neondatabase/neon/issues/5771	2023-11-30 11:34:30 +00:00
Rahul Modpur	50d959fddc	refactor: use serde for TenantConf deserialization Fixes: #5300 (#5310 ) Remove handcrafted TenantConf deserialization code. Use `serde_path_to_error` to include the field which failed parsing. Leaves the duplicated TenantConf in pageserver and models, does not touch PageserverConf handcrafted deserialization. Error change: - before change: "configure option `checkpoint_distance` cannot be negative" - after change: "`checkpoint_distance`: invalid value: integer `-1`, expected u64" Fixes: #5300 Cc: #3682 --------- Signed-off-by: Rahul Modpur <rmodpur2@gmail.com> Co-authored-by: Shany Pozin <shany@neon.tech> Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-11-30 12:47:13 +02:00
Conrad Ludgate	fc77c42c57	proxy: add flag to enable http pool for all users (#5959 ) ## Problem #5123 ## Summary of changes Add `--sql-over-http-pool-opt-in true` default cli arg. Allows us to set `--sql-over-http-pool-opt-in false` region-by-region	2023-11-30 10:19:30 +00:00
Conrad Ludgate	f05d1b598a	proxy: add more db error info (#5951 ) ## Problem https://github.com/neondatabase/serverless/issues/51 ## Summary of changes include more error fields in the json response	2023-11-30 10:18:59 +00:00
Christian Schwarz	ca597206b8	walredo: latency histogram for spawn duration (#5925 ) fixes https://github.com/neondatabase/neon/issues/5891	2023-11-29 18:44:37 +00:00
Rahul Modpur	46f20faa0d	neon_local: fix endpoint api to prevent two primary endpoints (#5520 ) `neon_local endpoint` subcommand currently allows creating two primary endpoints for the same branch which leads to shutdown of both endpoints `neon_local endpoint start` new behavior: 1. Fail if endpoint doesn't exist 2. Fail if two primary conflict detected Fixes #4959 Closes #5426 Signed-off-by: Rahul Modpur <rmodpur2@gmail.com> Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-11-29 19:38:03 +02:00
John Spray	9e55ad4796	pageserver: refactor TenantId to TenantShardId in Tenant & Timeline (#5957 ) (includes two preparatory commits from https://github.com/neondatabase/neon/pull/5960) ## Problem To accommodate multiple shards in the same tenant on the same pageserver, we must include the full TenantShardId in local paths. That means that all code touching local storage needs to see the TenantShardId. ## Summary of changes - Replace `tenant_id: TenantId` with `tenant_shard_id: TenantShardId` on Tenant, Timeline and RemoteTimelineClient. - Use TenantShardId in helpers for building local paths. - Update all the relevant call sites. This doesn't update absolutely everything: things like PageCache, TaskMgr, WalRedo are still shard-naive. The purpose of this PR is to update the core types so that others code can be added/updated incrementally without churning the most central shared types.	2023-11-29 14:52:35 +00:00
John Spray	70b5646fba	pageserver: remove redundant serialization helpers on DeletionList (#5960 ) Precursor for https://github.com/neondatabase/neon/pull/5957 ## Problem When DeletionList was written, TenantId/TimelineId didn't have human-friendly modes in their serde. #5335 added those, such that the helpers used in serialization of HashMaps are no longer necessary. ## Summary of changes - Add a unit test to ensure that this change isn't changing anything about the serialized form - Remove the serialization helpers for maps of Id	2023-11-29 10:39:12 +00:00
Konstantin Knizhnik	64890594a5	Optimize storing of null page in WAL (#5910 ) ## Problem PG16 (https://github.com/neondatabase/postgres/pull/327) adds new function to SMGR: zeroextend It's implementation in Neon actually wal-log zero pages of extended relation. This zero page is wal-logged using XLOG_FPI. As far as page is zero, the hole optimization (excluding from the image everything between pg_upper and pd_lower) doesn't work. ## Summary of changes In case of zero page (`PageIsNull()` returns true) assume `hole_size=BLCKSZ` ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2023-11-29 12:08:20 +02:00
Arseny Sher	78e73b20e1	Notify safekeeper readiness with systemd. To avoid downtime during deploy, as in busy regions initial load can currently take ~30s.	2023-11-29 14:07:06 +04:00
John Spray	c48cc020bd	pageserver: fix race between deletion completion and incoming requests (#5941 ) ## Problem This is a narrow race that can leave a stuck Stopping tenant behind, while emitting a log error "Missing InProgress marker during tenant upsert, this is a bug" - Deletion request 1 puts tenant into Stopping state, and fires off background part of DeleteTenantFlow - Deletion request 2 acquires a SlotGuard for the same tenant ID, leaves a TenantSlot::InProgress in place while it checks if the tenant's state is accept able. - DeleteTenantFlow finishes, calls TenantsMap::remove, which removes the InProgress marker. - Deletion request 2 calls SlotGuard::revert, which upserts the old value (the Tenant in Stopping state), and emits the telltale log message. Closes: #5936 ## Summary of changes - Add a regression test which uses pausable failpoints to reproduce this scenario. - TenantsMap::remove is only called by DeleteTenantFlow. Its behavior is tweaked to express the different possible states, especially `InProgress` which carriers a barrier. - In DeleteTenantFlow, if we see such a barrier result from remove(), wait for the barrier and then try removing again. --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-11-29 09:32:26 +00:00
dependabot[bot]	a15969714c	build(deps): bump openssl from 0.10.57 to 0.10.60 in /test_runner/pg_clients/rust/tokio-postgres (#5966 ) Bumps [openssl](https://github.com/sfackler/rust-openssl) from 0.10.57 to 0.10.60. Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-11-29 02:17:15 +01:00
dependabot[bot]	8c195d8214	build(deps): bump cryptography from 41.0.4 to 41.0.6 (#5970 ) Bumps [cryptography](https://github.com/pyca/cryptography) from 41.0.4 to 41.0.6. Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-11-29 02:16:35 +01:00
dependabot[bot]	0d16874960	build(deps): bump openssl from 0.10.55 to 0.10.60 (#5965 ) Bumps [openssl](https://github.com/sfackler/rust-openssl) from 0.10.55 to 0.10.60. Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-11-29 01:24:02 +01:00
Alexander Bayandin	fd440e7d79	neonvm: add pgbouncer patch to support DEALLOCATD/DISCARD ALL (#5958 ) pgbouncer 1.21.0 doesn't play nicely with DEALLOCATD/DISCARD ALL if prepared statement support is enabled (max_prepared_statements > 0). There's a patch[0] that improves this (it will be included in the next release of pgbouncer). This PR applies this patch on top of 1.21.0 release tarball. For some reason, the tarball doesn't include `test/test_prepared.py` (which is modified by the patch as well), so the patch can't be applied clearly. I use `filterdiff` (from `patchutils` package) to apply the required changes. [0] `a7b3c0a5f4`	2023-11-28 23:43:24 +00:00
Bojan Serafimov	922c4f07d5	Repeat 1000 times	2023-11-28 13:13:21 -05:00
Bojan Serafimov	0671fdd265	cargo lock	2023-11-28 12:56:13 -05:00
Bojan Serafimov	3f4fd576c6	Merge branch 'main' into add-profiler	2023-11-28 12:53:39 -05:00
bojanserafimov	65160650da	Add walingest test (#5892 )	2023-11-28 12:50:53 -05:00
dependabot[bot]	12dd6b61df	build(deps): bump aiohttp from 3.8.6 to 3.9.0 (#5946 )	2023-11-28 17:47:15 +00:00

1 2 3 4 5 ...

4137 Commits