rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-08 05:52:55 +00:00

Author	SHA1	Message	Date
Joonas Koivunen	eba3bfc57e	test: python needs thread safety as well (#5992 ) we have test cases which launch processes from threads, and they capture output assuming this counter is thread-safe. at least according to my understanding this operation in python requires a lock to be thread-safe.	2023-11-30 15:48:40 +00:00
John Spray	57ae9cd07f	pageserver: add `flush_ms` and document `/location_config` API (#5860 ) - During migration of tenants, it is useful for callers to `/location_conf` to flush a tenant's layers while transitioning to AttachedStale: this optimization reduces the redundant WAL replay work that the tenant's new attached pageserver will have to do. Test coverage for this will come as part of the larger tests for live migration in #5745 #5842 - Flushing is controlled with `flush_ms` query parameter: it is the caller's job to decide how long they want to wait for a flush to complete. If flush is not complete within the time limit, the pageserver proceeds to succeed anyway: flushing is only an optimization. - Add swagger definitions for all this: the location_config API is the primary interface for driving tenant migration as described in docs/rfcs/028-pageserver-migration.md, and will eventually replace the various /attach /detach /load /ignore APIs. --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-11-30 14:22:07 +00:00
Christian Schwarz	3bb1030f5d	walingest: refactor if-cascade on `decoded.xl_rmid` into match statement (#5974 ) refs https://github.com/neondatabase/neon/issues/5962 --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-11-30 14:07:41 +00:00
John Spray	5d3c3636fc	tests: add a log allow list entry in `test_timeline_deletion_with_files_stuck_in_upload_queue` (#5981 ) Test failure seen here: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-5860/7032903218/index.html#suites/837740b64a53e769572c4ed7b7a7eeeb/c0f1c79a70a3b9ab ``` E AssertionError: assert not [(302, '2023-11-29T13:23:51.046801Z ERROR request{method=PUT path=/v1/tenant/f6b845de60cb0e92f4426e0d6af1d2ea/timeline/69a8c98004abe71a281cff8642a45274/checkpoint request_id=eca33d8a-7af2-46e7-92ab-c28629feb42c}: Error processing HTTP request: InternalServerError(queue is in state Stopped\n')] ``` This appears to be a legitimate log: the test is issuing checkpoint requests in the background, and deleting (therefore shutting down) a timeline.	2023-11-30 13:44:14 +00:00
Conrad Ludgate	0c87d1866b	proxy: fix wake_compute error prop (#5989 ) ## Problem fixes #5654 - WakeComputeErrors occuring during a connect_to_compute got propagated as IO errors, which get forwarded to the user as "Couldn't connect to compute node" with no helpful message. ## Summary of changes Handle WakeComputeError during ConnectionError properly	2023-11-30 13:43:21 +00:00
Arpad Müller	8ec6033ed8	Pageserver disaster recovery RFC (#5248 ) Enable the pageserver to recover from data corruption events by implementing a feature to re-apply historic WAL records in parallel to the already occurring WAL replay. The feature is outside of the user-visible backup and history story, and only serves as a second-level backup for the case that there is a bug in the pageservers that corrupted the served pages. The RFC proposes the addition of two new features: * recover a broken branch from WAL (downtime is allowed) * a test recovery system to recover random branches to make sure recovery works	2023-11-30 14:30:17 +01:00
Anna Khanova	e12e2681e9	IP allowlist on the proxy side (#5906 ) ## Problem Per-project IP allowlist: https://github.com/neondatabase/cloud/issues/8116 ## Summary of changes Implemented IP filtering on the proxy side. To retrieve ip allowlist for all scenarios, added `get_auth_info` call to the control plane for: * sql-over-http * password_hack * cleartext_hack Added cache with ttl for sql-over-http path This might slow down a bit, consider using redis in the future. --------- Co-authored-by: Conrad Ludgate <conrad@neon.tech>	2023-11-30 13:14:33 +00:00
Joonas Koivunen	1e57ddaabc	fix: flush loop should also keep the gate open (#5987 ) I was expecting this to already be in place, because this should not conflict how we shutdown (0. cancel, 1. shutdown_tasks, 2. close gate).	2023-11-30 14:26:11 +02:00
John Khvatov	3e094e90d7	update aws sdk to 1.0.x (#5976 ) This change will be useful for experimenting with S3 performance. Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-11-30 14:17:58 +02:00
Christian Schwarz	292281c9df	pagectl: add subcommand to rewrite layer file summary (#5933 ) Part of getpage@lsn benchmark epic: https://github.com/neondatabase/neon/issues/5771	2023-11-30 11:34:30 +00:00
Rahul Modpur	50d959fddc	refactor: use serde for TenantConf deserialization Fixes: #5300 (#5310 ) Remove handcrafted TenantConf deserialization code. Use `serde_path_to_error` to include the field which failed parsing. Leaves the duplicated TenantConf in pageserver and models, does not touch PageserverConf handcrafted deserialization. Error change: - before change: "configure option `checkpoint_distance` cannot be negative" - after change: "`checkpoint_distance`: invalid value: integer `-1`, expected u64" Fixes: #5300 Cc: #3682 --------- Signed-off-by: Rahul Modpur <rmodpur2@gmail.com> Co-authored-by: Shany Pozin <shany@neon.tech> Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-11-30 12:47:13 +02:00
Conrad Ludgate	fc77c42c57	proxy: add flag to enable http pool for all users (#5959 ) ## Problem #5123 ## Summary of changes Add `--sql-over-http-pool-opt-in true` default cli arg. Allows us to set `--sql-over-http-pool-opt-in false` region-by-region	2023-11-30 10:19:30 +00:00
Conrad Ludgate	f05d1b598a	proxy: add more db error info (#5951 ) ## Problem https://github.com/neondatabase/serverless/issues/51 ## Summary of changes include more error fields in the json response	2023-11-30 10:18:59 +00:00
Christian Schwarz	ca597206b8	walredo: latency histogram for spawn duration (#5925 ) fixes https://github.com/neondatabase/neon/issues/5891	2023-11-29 18:44:37 +00:00
Rahul Modpur	46f20faa0d	neon_local: fix endpoint api to prevent two primary endpoints (#5520 ) `neon_local endpoint` subcommand currently allows creating two primary endpoints for the same branch which leads to shutdown of both endpoints `neon_local endpoint start` new behavior: 1. Fail if endpoint doesn't exist 2. Fail if two primary conflict detected Fixes #4959 Closes #5426 Signed-off-by: Rahul Modpur <rmodpur2@gmail.com> Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-11-29 19:38:03 +02:00
John Spray	9e55ad4796	pageserver: refactor TenantId to TenantShardId in Tenant & Timeline (#5957 ) (includes two preparatory commits from https://github.com/neondatabase/neon/pull/5960) ## Problem To accommodate multiple shards in the same tenant on the same pageserver, we must include the full TenantShardId in local paths. That means that all code touching local storage needs to see the TenantShardId. ## Summary of changes - Replace `tenant_id: TenantId` with `tenant_shard_id: TenantShardId` on Tenant, Timeline and RemoteTimelineClient. - Use TenantShardId in helpers for building local paths. - Update all the relevant call sites. This doesn't update absolutely everything: things like PageCache, TaskMgr, WalRedo are still shard-naive. The purpose of this PR is to update the core types so that others code can be added/updated incrementally without churning the most central shared types.	2023-11-29 14:52:35 +00:00
John Spray	70b5646fba	pageserver: remove redundant serialization helpers on DeletionList (#5960 ) Precursor for https://github.com/neondatabase/neon/pull/5957 ## Problem When DeletionList was written, TenantId/TimelineId didn't have human-friendly modes in their serde. #5335 added those, such that the helpers used in serialization of HashMaps are no longer necessary. ## Summary of changes - Add a unit test to ensure that this change isn't changing anything about the serialized form - Remove the serialization helpers for maps of Id	2023-11-29 10:39:12 +00:00
Konstantin Knizhnik	64890594a5	Optimize storing of null page in WAL (#5910 ) ## Problem PG16 (https://github.com/neondatabase/postgres/pull/327) adds new function to SMGR: zeroextend It's implementation in Neon actually wal-log zero pages of extended relation. This zero page is wal-logged using XLOG_FPI. As far as page is zero, the hole optimization (excluding from the image everything between pg_upper and pd_lower) doesn't work. ## Summary of changes In case of zero page (`PageIsNull()` returns true) assume `hole_size=BLCKSZ` ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2023-11-29 12:08:20 +02:00
Arseny Sher	78e73b20e1	Notify safekeeper readiness with systemd. To avoid downtime during deploy, as in busy regions initial load can currently take ~30s.	2023-11-29 14:07:06 +04:00
John Spray	c48cc020bd	pageserver: fix race between deletion completion and incoming requests (#5941 ) ## Problem This is a narrow race that can leave a stuck Stopping tenant behind, while emitting a log error "Missing InProgress marker during tenant upsert, this is a bug" - Deletion request 1 puts tenant into Stopping state, and fires off background part of DeleteTenantFlow - Deletion request 2 acquires a SlotGuard for the same tenant ID, leaves a TenantSlot::InProgress in place while it checks if the tenant's state is accept able. - DeleteTenantFlow finishes, calls TenantsMap::remove, which removes the InProgress marker. - Deletion request 2 calls SlotGuard::revert, which upserts the old value (the Tenant in Stopping state), and emits the telltale log message. Closes: #5936 ## Summary of changes - Add a regression test which uses pausable failpoints to reproduce this scenario. - TenantsMap::remove is only called by DeleteTenantFlow. Its behavior is tweaked to express the different possible states, especially `InProgress` which carriers a barrier. - In DeleteTenantFlow, if we see such a barrier result from remove(), wait for the barrier and then try removing again. --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-11-29 09:32:26 +00:00
dependabot[bot]	a15969714c	build(deps): bump openssl from 0.10.57 to 0.10.60 in /test_runner/pg_clients/rust/tokio-postgres (#5966 ) Bumps [openssl](https://github.com/sfackler/rust-openssl) from 0.10.57 to 0.10.60. Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-11-29 02:17:15 +01:00
dependabot[bot]	8c195d8214	build(deps): bump cryptography from 41.0.4 to 41.0.6 (#5970 ) Bumps [cryptography](https://github.com/pyca/cryptography) from 41.0.4 to 41.0.6. Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-11-29 02:16:35 +01:00
dependabot[bot]	0d16874960	build(deps): bump openssl from 0.10.55 to 0.10.60 (#5965 ) Bumps [openssl](https://github.com/sfackler/rust-openssl) from 0.10.55 to 0.10.60. Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-11-29 01:24:02 +01:00
Alexander Bayandin	fd440e7d79	neonvm: add pgbouncer patch to support DEALLOCATD/DISCARD ALL (#5958 ) pgbouncer 1.21.0 doesn't play nicely with DEALLOCATD/DISCARD ALL if prepared statement support is enabled (max_prepared_statements > 0). There's a patch[0] that improves this (it will be included in the next release of pgbouncer). This PR applies this patch on top of 1.21.0 release tarball. For some reason, the tarball doesn't include `test/test_prepared.py` (which is modified by the patch as well), so the patch can't be applied clearly. I use `filterdiff` (from `patchutils` package) to apply the required changes. [0] `a7b3c0a5f4`	2023-11-28 23:43:24 +00:00
bojanserafimov	65160650da	Add walingest test (#5892 )	2023-11-28 12:50:53 -05:00
dependabot[bot]	12dd6b61df	build(deps): bump aiohttp from 3.8.6 to 3.9.0 (#5946 )	2023-11-28 17:47:15 +00:00
bojanserafimov	5345c1c21b	perf readme fix (#5956 )	2023-11-28 17:31:42 +00:00
Joonas Koivunen	105edc265c	fix: remove layer_removal_cs (#5108 ) Quest: https://github.com/neondatabase/neon/issues/4745. Follow-up to #4938. - add in locks for compaction and gc, so we don't have multiple executions at the same time in tests - remove layer_removal_cs - remove waiting for uploads in eviction/gc/compaction - #4938 will keep the file resident until upload completes Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-11-28 19:15:21 +02:00
Shany Pozin	8625466144	Move run_initdb to be async and guarded by max of 8 running tasks. Fixes #5895 . Use tenant.cancel for cancellation (#5921 ) ## Problem https://github.com/neondatabase/neon/issues/5895	2023-11-28 14:49:31 +00:00
John Spray	1ab0cfc8cb	pageserver: add sharding metadata to `LocationConf` (#5932 ) ## Problem The TenantShardId in API URLs is sufficient to uniquely identify a tenant shard, but not for it to function: it also needs to know its full sharding configuration (stripe size, layout version) in order to map keys to shards. ## Summary of changes - Introduce ShardIdentity: this is the superset of ShardIndex (#5924 ) that is required for translating keys to shard numbers. - Include ShardIdentity as an optional attribute of LocationConf - Extend the public `LocationConfig` API structure with a flat representation of shard attributes. The net result is that at the point we construct a `Tenant`, we have a `ShardIdentity` (inside LocationConf). This enables the next steps to actually use the ShardIdentity to split WAL and validate that page service requires are reaching the correct shard.	2023-11-28 13:14:51 +00:00
John Spray	ca469be1cf	pageserver: add shard indices to layer metadata (#5928 ) ## Problem For sharded tenants, the layer keys must include the shard number and shard count, to disambiguate keys written by different shards in the same tenant (shard number), and disambiguate layers written before and after splits (shard count). Closes: https://github.com/neondatabase/neon/issues/5924 ## Summary of changes There are no functional changes in this PR: everything behaves the same for the default ShardIndex::unsharded() value. Actual construct of sharded tenants will come next. - Add a ShardIndex type: this is just a wrapper for a ShardCount and ShardNumber. This is a subset of ShardIdentity: whereas ShardIdentity contains enough information to filter page keys, ShardIndex contains just enough information to construct a remote key. ShardIndex has a compact encoding, the same as the shard part of TenantShardId. - Store the ShardIndex as part of IndexLayerMetadata, if it is set to a different value than ShardIndex::unsharded. - Update RemoteTimelineClient and DeletionQueue to construct paths using the layer metadata. Deletion code paths that previously just passed a `Generation` now pass a full `LayerFileMetadata` to capture the shard as well. Notes to reviewers: - In deletion code paths, I could have used a (Generation, ShardIndex) instead of the full LayerFileMetadata. I opted for the full object partly for brevity, and partly because in future when we add checksums the deletion code really will care about the full metadata in order to validate that it is deleting what was intended. - While ShardIdentity and TenantShardId could both use a ShardIndex, I find that they read more cleanly as "flat" structs that spell out the shard count and number field separately. Serialization code would need writing out by hand anyway, because TenantShardId's serialized form is not a serde struct-style serialization. - ShardIndex doesn't _have_ to exist (we could use ShardIdentity everywhere), but it is a worthwhile optimization, as we will have many copies of this as part of layer metadata. In future the size difference betweedn ShardIndex and ShardIdentity may become larger if we implement more sophisticated key distribution mechanisms (i.e. new values of ShardIdentity::layout). --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-11-28 11:47:25 +00:00
Christian Schwarz	286f34dfce	test suite: add method for generation-aware detachment of a tenant (#5939 ) Part of getpage@lsn benchmark epic: https://github.com/neondatabase/neon/issues/5771	2023-11-28 09:51:37 +00:00
Sasha Krassovsky	f290b27378	Fix check for if shmem is valid to take into account detached shmem (#5937 ) ## Problem We can segfault if we update connstr inside of a process that has detached from shmem (e.g. inside stats collector) ## Summary of changes Add a check to make sure we're not detached	2023-11-28 03:14:42 +00:00
Sasha Krassovsky	4cd18fcebd	Compile wal2json (#5893 ) Add wal2json extension	2023-11-27 18:17:26 -08:00
Anastasia Lubennikova	4c29e0594e	Update neon extension relocatable for existing installations (#5943 )	2023-11-27 23:29:24 +00:00
Anastasia Lubennikova	3c56a4dd18	Make neon extension relocatable to allow SET SCHEMA (#5942 )	2023-11-27 21:45:41 +00:00
Conrad Ludgate	316309c85b	channel binding (#5683 ) ## Problem channel binding protects scram from sophisticated MITM attacks where the attacker is able to produce 'valid' TLS certificates. ## Summary of changes get the tls-server-end-point channel binding, and verify it is correct for the SCRAM-SHA-256-PLUS authentication flow	2023-11-27 21:45:15 +00:00
Arpad Müller	e09bb9974c	bootstrap_timeline: rename initdb_path to pgdata_path (#5931 ) This is a rename without functional changes, in preparation for #5912. Split off from #5912 as per review request.	2023-11-27 20:14:39 +00:00
Anastasia Lubennikova	5289f341ce	Use test specific directory in test_remote_extensions (#5938 )	2023-11-27 18:57:58 +00:00
Joonas Koivunen	683ec2417c	deflake: test_live_reconfig_get_evictions_low_residence_... (#5926 ) - disable extra tenant - disable compaction which could try to repartition while we assert Split from #5108.	2023-11-27 15:20:54 +02:00
Christian Schwarz	a76a503b8b	remove confusing no-op .take() of init_tenant_load_remote (#5923 ) The `Tenant::spawn()` method already `.take()`s it. I think this was an oversight in https://github.com/neondatabase/neon/pull/5580 .	2023-11-27 12:50:19 +00:00
Anastasia Lubennikova	92bc2bb132	Refactor remote extensions feature to request extensions from proxy (#5836 ) instead of direct S3 request. Pros: - simplify code a lot (no need to provide AWS credentials and paths); - reduce latency of downloading extension data as proxy resides near computes; -reduce AWS costs as proxy has cache and 1000 computes asking the same extension will not generate 1000 downloads from S3. - we can use only one S3 bucket to store extensions (and rid of regional buckets which were introduced to reduce latency); Changes: - deprecate remote-ext-config compute_ctl parameter, use http://pg-ext-s3-gateway if any old format remote-ext-cofig is provided; - refactor tests to use mock http server;	2023-11-27 12:10:23 +00:00
John Spray	b80b9e1c4c	pageserver: remove defunct local timeline delete markers (#5699 ) ## Problem Historically, we treated the presence of a timeline on local disk as evidence that it logically exists. Since #5580 that is no longer the case, so we can always rely on remote storage. If we restart and the timeline is gone in remote storage, we will also purge it from local disk: no need for a marker. Reference on why this PR is for timeline markers and not tenant markers: https://github.com/neondatabase/neon/issues/5080#issuecomment-1783187807 ## Summary of changes Remove code paths that read + write deletion marker for timelines. Leave code path that deletes these markers, just in case we deploy while there are some in existence. This can be cleaned up later. (https://github.com/neondatabase/neon/issues/5718)	2023-11-27 09:31:20 +00:00
Anastasia Lubennikova	87b8ac3ec3	Only create neon extension in postgres database; (#5918 ) Create neon extension in neon schema.	2023-11-26 08:37:01 +00:00
Joonas Koivunen	6b1c4cc983	fix: long timeline create cancelled by tenant delete (#5917 ) Fix the fallible vs. infallible check order with `UninitTimeline::finish_creation` so that the incomplete timeline can be removed. Currently the order of drop guard unwrapping causes uninit files to be left on pageserver, blocking the tenant deletion. Cc: #5914 Cc: #investigation-2023-11-23-stuck-tenant-deletion	2023-11-24 16:17:56 +00:00
Joonas Koivunen	831fad46d5	tests: fix allowed_error for compaction detecting a shutdown (#5919 ) This has been causing flaky tests, [example evidence]. Follow-up to #5883 where I forgot to fix this. [example evidence]: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-5917/6981540065/index.html#suites/9d2450a537238135fd4007859e09aca7/6fd3556a879fa3d1	2023-11-24 16:14:32 +00:00
Joonas Koivunen	53851ea8ec	fix: log cancelled request handler errors (#5915 ) noticed during [investigation] with @problame a major point of lost error logging which would had sped up the investigation. Cc: #5815 [investigation]: https://neondb.slack.com/archives/C066ZFAJU85/p1700751858049319	2023-11-24 15:54:06 +02:00
Joonas Koivunen	044375732a	test: support validating allowed_errors against a logfile (#5905 ) this will make it easier to test if an added allowed_error does in fact match for example against a log file from an allure report. ``` $ python3 test_runner/fixtures/pageserver/allowed_errors.py --help usage: allowed_errors.py [-h] [-i INPUT] check input against pageserver global allowed_errors optional arguments: -h, --help show this help message and exit -i INPUT, --input INPUT Pageserver logs file. Reads from stdin if no file is provided. ``` Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2023-11-24 12:43:25 +00:00
Konstantin Knizhnik	ea63b43009	Check if LFC was intialized in local_cache_pages function (#5911 ) ## Problem There is not check that LFC is initialised (`lfc_max_size != 0`) in `local_cache_pages` function ## Summary of changes Add proper check. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2023-11-24 08:23:00 +02:00
Conrad Ludgate	a56fd45f56	proxy: fix memory leak again (#5909 ) ## Problem The connections.join_next helped but it wasn't enough... The way I implemented the improvement before was still faulty but it mostly worked so it looked like it was working correctly. From [`tokio::select` docs](https://docs.rs/tokio/latest/tokio/macro.select.html): > 4. Once an <async expression> returns a value, attempt to apply the value to the provided <pattern>, if the pattern matches, evaluate <handler> and return. If the pattern does not match, disable the current branch and for the remainder of the current call to select!. Continue from step 3. The `connections.join_next()` future would complete and `Some(Err(e))` branch would be evaluated but not match (as the future would complete without panicking, we would hope). Since the branch doesn't match, it's disabled. The select continues but never attempts to call `join_next` again. Getting unlucky, more TCP connections are created than we attempt to join_next. ## Summary of changes Replace the `Some(Err(e))` pattern with `Some(e)`. Because of the auto-disabling feature, we don't need the `if !connections.is_empty()` step as the `None` pattern will disable it for us.	2023-11-23 19:11:24 +00:00

1 2 3 4 5 ...

4108 Commits