rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-08 14:02:55 +00:00

Author	SHA1	Message	Date
Konstantin Knizhnik	17c002b660	Do not copy logical replicaiton slots to replica (#9458 ) ## Problem Replication slots are now persisted using AUX files mechanism and included in basebackup when replica is launched. This slots are not somehow used at replica but hold WAL, which may cause local disk space exhaustion. ## Summary of changes Add `--replica` parameter to basebackup request and do not include replication slot state files in basebackup for replica. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-11-08 14:54:58 +02:00
John Spray	aa9112efce	pageserver: add `no_sync` for use in regression tests (1/2) (#9677 ) ## Problem In test environments, the `syncfs` that the pageserver does on startup can take a long time, as other tests running concurrently might have many gigabytes of dirty pages. ## Summary of changes - Add a `no_sync` option to the pageserver's config. - Skip syncfs on startup if this is set - A subsequent PR (https://github.com/neondatabase/neon/pull/9678) will enable this by default in tests. We need to wait until after the next release to avoid breaking compat tests, which would fail if we set no_sync & use an old pageserver binary. Q: Why is this a different mechanism than safekeeper, which as a --no-sync CLI? A: Because the way we manage pageservers in neon_local depends on the pageserver.toml containing the full configuration, whereas safekeepers have a config file which is neon-local-specific and can drive a CLI flag. Q: Why is the option no_sync rather than sync? A: For boolean configs with a dangerous value, it's preferable to make "false" the safe option, so that any downstream future config tooling that might have a "booleans are false by default" behavior (e.g. golang structs) is safe by default. Q: Why only skip the syncfs, and not all fsyncs? A: Skipping all fsyncs would require more code changes, and the most acute problem isn't fsyncs themselves (these just slow down a running test), it's the syncfs (which makes a pageserver startup slow as a result of _other_ tests)	2024-11-08 10:16:04 +00:00
JC Grünhage	027889b06c	ci: use set-docker-config-dir from dev-actions (#9638 ) set-docker-config-dir was replicated over multiple repositories. The replica of this action was removed from this repository and it's using the version from github.com/neondatabase/dev-actions instead	2024-11-08 10:44:59 +01:00
Heikki Linnakangas	79929bb1b6	Disable `rust_2024_compatibility` lint option (#9615 ) Compiling with nightly rust compiler, I'm getting a lot of errors like this: error: `if let` assigns a shorter lifetime since Edition 2024 --> proxy/src/auth/backend/jwt.rs:226:16 \| 226 \| if let Some(permit) = self.try_acquire_permit() { \| ^^^^^^^^^^^^^^^^^^^------------------------- \| \| \| this value has a significant drop implementation which may observe a major change in drop order and requires your discretion \| = warning: this changes meaning in Rust 2024 = note: for more information, see issue #124085 <https://github.com/rust-lang/rust/issues/124085> help: the value is now dropped here in Edition 2024 --> proxy/src/auth/backend/jwt.rs:241:13 \| 241 \| } else { \| ^ note: the lint level is defined here --> proxy/src/lib.rs:8:5 \| 8 \| rust_2024_compatibility \| ^^^^^^^^^^^^^^^^^^^^^^^ = note: `#[deny(if_let_rescope)]` implied by `#[deny(rust_2024_compatibility)]` and this: error: these values and local bindings have significant drop implementation that will have a different drop order from that of Edition 2021 --> proxy/src/auth/backend/jwt.rs:376:18 \| 369 \| let client = Client::builder() \| ------ these values have significant drop implementation and will observe changes in drop order under Edition 2024 ... 376 \| map: DashMap::default(), \| ^^^^^^^^^^^^^^^^^^ \| = warning: this changes meaning in Rust 2024 = note: for more information, see issue #123739 <https://github.com/rust-lang/rust/issues/123739> = note: `#[deny(tail_expr_drop_order)]` implied by `#[deny(rust_2024_compatibility)]` They are caused by the `rust_2024_compatibility` lint option. When we actually switch to the 2024 edition, it makes sense to go through all these and check that the drop order changes don't break anything, but in the meanwhile, there's no easy way to avoid these errors. Disable it, to allow compiling with nightly again. Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2024-11-08 08:35:03 +00:00
Peter Bendel	9132d80aa3	add pgcopydb tool to build tools image (#9658 ) ## Problem build-tools image does not provide superuser, so additional packages can not be installed during GitHub benchmarking workflows but need to be added to the image ## Summary of changes install pgcopydb version 0.17-1 or higher into build-tools bookworm image ```bash docker run -it neondatabase/build-tools:<tag>-bookworm-arm64 /bin/bash ... nonroot@c23c6f4901ce:~$ LD_LIBRARY_PATH=/pgcopydb/lib /pgcopydb/bin/pgcopydb --version; 13:58:19.768 8 INFO Running pgcopydb version 0.17 from "/pgcopydb/bin/pgcopydb" pgcopydb version 0.17 compiled with PostgreSQL 16.4 (Debian 16.4-1.pgdg120+2) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit compatible with Postgres 11, 12, 13, 14, 15, and 16 ``` Example usage of that image in a workflow https://github.com/neondatabase/neon/actions/runs/11725718371/job/32662681172#step:7:14	2024-11-07 19:00:25 +01:00
Conrad Ludgate	82e3f0ecba	[proxy/authorize]: improve JWKS reliability (#9676 ) While setting up some tests, I noticed that we didn't support keycloak. They make use of encryption JWKs as well as signature ones. Our current jwks crate does not support parsing encryption keys which caused the entire jwk set to fail to parse. Switching to lazy parsing fixes this. Also while setting up tests, I couldn't use localhost jwks server as we require HTTPS and we were using webpki so it was impossible to add a custom CA. Enabling native roots addresses this possibility. I saw some of our current e2e tests against our custom JWKS in s3 were taking a while to fetch. I've added a timeout + retries to address this.	2024-11-07 16:24:38 +00:00
Arpad Müller	75aa19aa2d	Don't attach is_archived to debug output (#9679 ) We are in branches where we know its value already.	2024-11-07 16:13:50 +00:00
Alex Chi Z.	a8d9939ea9	fix(pageserver): reduce aux compaction threshold (#9647 ) ref https://github.com/neondatabase/neon/issues/9441 The metrics from LR publisher testing project: ~300KB aux key deltas per 256MB files. Therefore, I think we can do compaction more aggressively as these deltas are small and compaction can reduce layer download latency. We also have a read path perf fix https://github.com/neondatabase/neon/pull/9631 but I'd still combine the read path fix with the reduce of the compaction threshold. ## Summary of changes * reduce metadata compaction threshold * use num of L1 delta layers as an indicator for metadata compaction * dump more logs Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-11-07 10:38:15 -05:00
Erik Grinaker	f18aa04b90	safekeeper: use `set_len()` to zero out segments (#9665 ) ## Problem When we create a new segment, we zero it out in order to avoid changing the length and fsyncing metadata on every write. However, we zeroed it out by writing 8 KB zero-pages, and Tokio file writes have non-trivial overhead. ## Summary of changes Zero out the segment using [`File::set_len()`](https://docs.rs/tokio/latest/i686-unknown-linux-gnu/tokio/fs/struct.File.html#method.set_len) instead. This will typically (depending on the filesystem) just write a sparse file and omit the 16 MB of data entirely. This improves WAL append throughput for large messages by over 400% with fsync disabled, and 100% with fsync enabled.	2024-11-07 15:09:57 +00:00
Erik Grinaker	01265b7bc6	safekeeper: add basic WAL ingestion benchmarks (#9531 ) ## Problem We don't have any benchmarks for Safekeeper WAL ingestion. ## Summary of changes Add some basic benchmarks for WAL ingestion, specifically for `SafeKeeper::process_msg()` (single append) and `WalAcceptor` (pipelined batch ingestion). Also add some baseline file write benchmarks.	2024-11-07 13:24:03 +00:00
Arseny Sher	f54f0e8e2d	Fix direct reading from WAL buffers. (#9639 ) Fix direct reading from WAL buffers. Pointer wasn't advanced which resulted in sending corrupted WAL if part of read used WAL buffers and part read from the file. Also move it to neon_walreader so that e.g. replication could also make use of it. ref https://github.com/neondatabase/cloud/issues/19567	2024-11-07 11:29:52 +00:00
Erik Grinaker	d6aa26a533	postgres_ffi: make `WalGenerator` generic over record generator (#9614 ) ## Problem Benchmarks need more control over the WAL generated by `WalGenerator`. In particular, they need to vary the size of logical messages. ## Summary of changes * Make `WalGenerator` generic over `RecordGenerator`, which constructs WAL records. * Add `LogicalMessageGenerator` which emits logical messages, with a configurable payload. * Minor tweaks and code reorganization. There are no changes to the core logic or emitted WAL.	2024-11-07 10:38:39 +00:00
Cheng Chen	e1d0b73824	chore(compute): Bump pg_mooncake to the latest version	2024-11-06 22:41:18 -06:00
Arpad Müller	011c0a175f	Support copying layers in detach_ancestor from before shard splits (#9669 ) We need to use the shard associated with the layer file, not the shard associated with our current tenant shard ID. Due to shard splits, the shard IDs can refer to older files. close https://github.com/neondatabase/neon/issues/9667	2024-11-07 01:53:58 +01:00
Alex Chi Z.	2a95a51a0d	refactor(pageserver): better pageservice command parsing (#9597 ) close https://github.com/neondatabase/neon/issues/9460 ## Summary of changes A full rewrite of pagestream cmdline parsing to make it more robust and readable. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2024-11-06 20:41:01 +00:00
Yuchen Liang	11fc1a4c12	fix(test): use layer map dump in `test_readonly_node_gc` to validate layers protected by leases (#9551 ) Fixes #9518. ## Problem After removing the assertion `layers_removed == 0` in #9506, we could miss breakage if we solely rely on the successful execution of the `SELECT` query to check if lease is properly protecting layers. Details listed in #9518. Also, in integration tests, we sometimes run into the race condition where getpage request comes before the lease get renewed (item 2 of #8817), even if compute_ctl sends a lease renewal as soon as it sees a `/configure` API calls that updates the `pageserver_connstr`. In this case, we would observe a getpage request error stating that we `tried to request a page version that was garbage collected` (as we seen in [Allure Report](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8613/11550393107/index.html#suites/3ccffb1d100105b98aed3dc19b717917/d1a1ba47bc180493)). ## Summary of changes - Use layer map dump to verify if the lease protects what it claimed: Record all historical layers that has `start_lsn <= lease_lsn` before and after running timeline gc. This is the same check as `ad79f42460/pageserver/src/tenant/timeline.rs (L5025-L5027)` The set recorded after GC should contain every layer in the set recorded before GC. - Wait until log contains another successful lease request before running the `SELECT` query after GC. We argued in #8817 that the bad request can only exist within a short period after migration/restart, and our test shows that as long as a lease renewal is done before the first getpage request sent after reconfiguration, we will not have bad request. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-11-06 20:18:21 +00:00
Tristan Partin	93123f2623	Rename compute_backpressure_throttling_ms to compute_backpressure_throttling_seconds This is in line with the Prometheus guidance[0]. We also haven't started using this metric, so renaming is essentially free. Link: https://prometheus.io/docs/practices/naming/ [0] Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-11-06 13:28:23 -06:00
Alex Chi Z.	1d3559d4bc	feat(pageserver): add fast path for sparse keyspace read (#9631 ) In https://github.com/neondatabase/neon/issues/9441, the tenant has a lot of aux keys spread in multiple aux files. The perf tool shows that a significant amount of time is spent on remove_overlapping_keys. For sparse keyspaces, we don't need to report missing key errors anyways, and it's very likely that we will need to read all layers intersecting with the key range. Therefore, this patch adds a new fast path for sparse keyspace reads that we do not track `unmapped_keyspace` in a fine-grained way. We only modify it when we find an image layer. In debug mode, it was ~5min to read the aux files for a dump of the tenant, and now it's only 8s, that's a 60x speedup. ## Summary of changes * Do not add sparse keys into `keys_done` so that remove_overlapping does nothing. * Allow `ValueReconstructSituation::Complete` to be updated again in `ValuesReconstructState::update_key` for sparse keyspaces. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-11-06 18:17:02 +00:00
Conrad Ludgate	73bdc9a2d0	[proxy]: minor changes to endpoint-cache handling (#9666 ) I think I meant to make these changes over 6 months ago. alas, better late than never. 1. should_reject doesn't eagerly intern the endpoint string 2. Rate limiter uses a std Mutex instead of a tokio Mutex. 3. Recently I introduced a `-local-proxy` endpoint suffix. I forgot to add this to normalize. 4. Random but a small cleanup making the ControlPlaneEvent deser directly to the interned strings.	2024-11-06 17:40:40 +00:00
John Spray	d182ff294c	storcon: respect tenant scheduling policy in drain/fill (#9657 ) ## Problem Pinning a tenant by setting Pause scheduling policy doesn't work because drain/fill code moves the tenant around during deploys. Closes: https://github.com/neondatabase/neon/issues/9612 ## Summary of changes - In drain, only move a tenant if it is in Active or Essential mode - In fill, only move a tenant if it is in Active mode. The asymmetry is a bit annoying, but it faithfully respects the purposes of the modes: Essential is meant to endeavor to keep the tenant available, which means it needs to be drained but doesn't need to be migrated during fills.	2024-11-06 15:14:43 +00:00
Vlad Lazar	4dfa0c221b	pageserver: ingest pre-serialized batches of values (#9579 ) ## Problem https://github.com/neondatabase/neon/pull/9524 split the decoding and interpretation step from ingestion. The output of the first phase is a `wal_decoder::models::InterpretedWalRecord`. Before this patch set that struct contained a list of `Value` instances. We wish to lift the decoding and interpretation step to the safekeeper, but it would be nice if the safekeeper gave us a batch containing the raw data instead of actual values. ## Summary of changes Main goal here is to make `InterpretedWalRecord` hold a raw buffer which contains pre-serialized Values. For this we do: 1. Add a `SerializedValueBatch` type. This is `inmemory_layer::SerializedBatch` with some extra functionality for extension, observing values for shard 0 and tests. 2. Replace `inmemory_layer::SerializedBatch` with `SerializedValueBatch` 3. Make `DatadirModification` maintain a `SerializedValueBatch`. ### `DatadirModification` changes `DatadirModification` now maintains a `SerializedValueBatch` and extends it as new WAL records come in (to avoid flushing to disk on every record). In turn, this cascaded into a number of modifications to `DatadirModification`: 1. Replace `pending_data_pages` and `pending_zero_data_pages` with `pending_data_batch`. 2. Removal of `pending_zero_data_pages` and its cousin `on_wal_record_end` 3. Rename `pending_bytes` to `pending_metadata_bytes` since this is what it tracks now. 4. Adapting of various utility methods like `len`, `approx_pending_bytes` and `has_dirty_data_pages`. Removal of `pending_zero_data_pages` and the optimisation associated with it ((1) and (2)) deserves more detail. Previously all zero data pages went through `pending_zero_data_pages`. We wrote zero data pages when filling gaps caused by relation extension (case A) and when handling special wal records (case B). If it happened that the same WAL record contained a non zero write for an entry in `pending_zero_data_pages` we skipped the zero write. Case A: We handle this differently now. When ingesting the `SerialiezdValueBatch` associated with one PG WAL record, we identify the gaps and fill the them in one go. Essentially, we move from a per key process (gaps were filled after each new key), and replace it with a per record process. Hence, the optimisation is not required anymore. Case B: When the handling of a special record needs to zero out a key, it just adds that to the current batch. I inspected the code, and I don't think the optimisation kicked in here.	2024-11-06 14:10:32 +00:00
Folke Behrens	bdd492b1d8	proxy: Replace "web(auth)" with "console redirect" everywhere (#9655 )	2024-11-06 11:03:38 +00:00
Folke Behrens	5d8284c7fe	proxy: Read cplane JWT with clap arg (#9654 )	2024-11-06 10:27:55 +00:00
Folke Behrens	ebc43efebc	proxy: Refactor cplane types (#9643 ) The overall idea of the PR is to rename a few types to make their purpose more clear, reduce abstraction where not needed, and move types to to more better suited modules.	2024-11-05 23:03:53 +01:00
Folke Behrens	754d2950a3	proxy: Revert ControlPlaneEvent back to struct (#9649 ) Due to neondatabase/cloud#19815 we need to be more tolerant when reading events.	2024-11-05 21:32:33 +00:00
Conrad Ludgate	fcde40d600	[proxy] use the proxy protocol v2 command to silence some logs (#9620 ) The PROXY Protocol V2 offers a "command" concept. It can be of two different values. "Local" and "Proxy". The spec suggests that "Local" be used for health-checks. We can thus use this to silence logging for such health checks such as those from NLB. This additionally refactors the flow to be a bit more type-safe, self documenting and using zerocopy deser.	2024-11-05 17:23:00 +00:00
Erik Grinaker	babfeb70ba	safekeeper: don't allocate send buffers on stack (#9644 ) ## Problem While experimenting with `MAX_SEND_SIZE` for benchmarking, I saw stack overflows when increasing it to 1 MB. Turns out a few buffers of this size are stack-allocated rather than heap-allocated. Even at the default 128 KB size, that's a bit large to allocate on the stack. ## Summary of changes Heap-allocate buffers of size `MAX_SEND_SIZE`.	2024-11-05 17:05:30 +00:00
Ivan Efremov	2f1a56c8f9	proxy: Unify local and remote conn pool client structures (#9604 ) Unify client, EndpointConnPool and DbUserConnPool for remote and local conn. - Use new ClientDataEnum for additional client data. - Add ClientInnerCommon client structure. - Remove Client and EndpointConnPool code from local_conn_pool.rs	2024-11-05 17:33:41 +02:00
John Spray	e30f5fb922	scrubber: remove AWS region assumption, tolerate negative max_project_size (#9636 ) ## Problem First issues noticed when trying to run scrubber find-garbage on Azure: - Azure staging contains projects with -1 set for max_project_size: apparently the control plane treats this as a signed field. - Scrubber code assumed that listing projects should filter to aws-$REGION. This is no longer needed (per comment in the code) because we know hit region-local APIs. This PR doesn't make it work all the way (`init_remote` still assumes S3), but these are necessary precursors. ## Summary of changes - Change max-project_size from unsigned to signed - Remove region filtering in favor of simply using the right region's API (which we already do)	2024-11-05 13:32:50 +00:00
Arpad Müller	70ae8c16da	Construct models::TenantConfig only once (#9630 ) Since 5f83c9290b482dc90006c400dfc68e85a17af785/#1504 we've had duplication in construction of models::TenantConfig, where both constructs contained the same code. This PR removes one of the two locations to avoid the duplication.	2024-11-05 13:02:49 +00:00
Erik Grinaker	8840f3858c	pageserver: return 503 during tenant shutdown (#9635 ) ## Problem Tenant operations may return `409 Conflict` if the tenant is shutting down. This status code is not retried by the control plane, causing user-facing errors during pageserver restarts. Operations should instead return `503 Service Unavailable`, which may be retried for idempotent operations. ## Summary of changes Convert `GetActiveTenantError::WillNotBecomeActive(TenantState::Stopping)` to `ApiError::ShuttingDown` rather than `ApiError::Conflict`. This error is returned by `Tenant::wait_to_become_active` in most (all?) tenant/timeline-related HTTP routes.	2024-11-05 13:16:55 +01:00
Tristan Partin	1e16221f82	Update psycopg2 to latest version for complete PG 17 support Update the types to match. Changes the cursor import to match the C bindings[0]. Link: https://github.com/python/typeshed/issues/12578 [0] Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-11-04 18:21:59 -06:00
Tristan Partin	34812a6aab	Improve some typing related to performance testing for LR Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-11-04 15:52:01 -06:00
Arpad Müller	ee68bbf6f5	Add tenant config option to allow timeline_offloading (#9598 ) Allow us to enable timeline offloading for single tenants without having to enable it for the entire pageserver. Part of #8088.	2024-11-04 21:01:18 +01:00
Folke Behrens	1085fe57d3	proxy: Rewrite ControlPlaneEvent as enum (#9627 )	2024-11-04 20:19:26 +01:00
Folke Behrens	59879985b4	proxy: Wrap JWT errors in separate AuthError variant (#9625 ) * Also rename `AuthFailed` variant to `PasswordFailed`. * Before this all JWT errors end up in `AuthError::AuthFailed()`, expects a username and also causes cache invalidation.	2024-11-04 19:56:40 +01:00
Conrad Ludgate	81d1bb1941	quieten aws_config logs (#9626 ) logs during aws authentication are soooo noisy in staging 🙃	2024-11-04 17:28:10 +00:00
Christian Schwarz	06113e94e6	fix(test_regress): always use storcon virtual pageserver API to set tenant config (#9622 ) Problem ------- Tests that directly call the Pageserver Management API to set tenant config are flaky if the Pageserver is managed by Storcon because Storcon is the source of truth and may (theoretically) reconcile a tenant at any time. Solution -------- Switch all users of `set_tenant_config`/`patch_tenant_config_client_side` to use the `env.storage_controller.pageserver_api()` Future Work ----------- Prevent regressions from creeping in. And generally clean up up tenant configuration. Maybe we can avoid the Pageserver having a default tenant config at all and put the default into Storcon instead? * => https://github.com/neondatabase/neon/issues/9621 Refs ---- fixes https://github.com/neondatabase/neon/issues/9522	2024-11-04 17:42:08 +01:00
Erik Grinaker	0d5a512825	safekeeper: add walreceiver metrics (#9450 ) ## Problem We don't have any observability for Safekeeper WAL receiver queues. ## Summary of changes Adds a few WAL receiver metrics: * `safekeeper_wal_receivers`: gauge of currently connected WAL receivers. * `safekeeper_wal_receiver_queue_depth`: histogram of queue depths per receiver, sampled every 5 seconds. * `safekeeper_wal_receiver_queue_depth_total`: gauge of total queued messages across all receivers. * `safekeeper_wal_receiver_queue_size_total`: gauge of total queued message sizes across all receivers. There are already metrics for ingested WAL volume: `written_wal_bytes` counter per timeline, and `safekeeper_write_wal_bytes` per-request histogram.	2024-11-04 15:22:46 +00:00
Conrad Ludgate	8ad1dbce72	[proxy]: parse proxy protocol TLVs with aws/azure support (#9610 ) AWS/azure private link shares extra information in the "TLV" values of the proxy protocol v2 header. This code doesn't action on it, but it parses it as appropriate.	2024-11-04 14:04:56 +00:00
Conrad Ludgate	3dcdbcc34d	remove aws-lc-rs dep and fix storage_broker tls (#9613 ) It seems the ecosystem is not so keen on moving to aws-lc-rs as it's build setup is more complicated than ring (requiring cmake). Eventually I expect the ecosystem should pivot to https://github.com/ctz/graviola/tree/main/rustls-graviola as it stabilises (it has a very simply build step and license), but for now let's try not have a headache of juggling two crypto libs. I also noticed that tonic will just fail with tls without a default provider, so I added some defensive code for that.	2024-11-04 13:29:13 +00:00
Matthias van de Meent	d5de63c6b8	Fix a time zone issue in a PG17 test case (#9618 ) The commit was cherry-picked and thus shouldn't cause issues once we merge the release tag for PostgreSQL 17.1	2024-11-04 12:10:32 +00:00
John Spray	4534f5cdc6	pageserver: make local timeline deletion infallible (#9594 ) ## Problem In https://github.com/neondatabase/neon/pull/9589, timeline offload code is modified to return an explicit error type rather than propagating anyhow::Error. One of the 'Other' cases there is I/O errors from local timeline deletion, which shouldn't need to exist, because our policy is not to try and continue running if the local disk gives us errors. ## Summary of changes - Make `delete_local_timeline_directory` and use `.fatal_err(` on I/O errors --------- Co-authored-by: Erik Grinaker <erik@neon.tech>	2024-11-04 09:11:52 +00:00
Erik Grinaker	0058eb09df	test_runner/performance: add sharded ingest benchmark (#9591 ) Adds a Python benchmark for sharded ingestion. This ingests 7 GB of WAL (100M rows) into a Safekeeper and fans out to 10 shards running on 10 different pageservers. The ingest volume and duration is recorded.	2024-11-02 16:42:10 +00:00
Konstantin Knizhnik	8ac523d2ee	Do not assign page LSN to new (uninitialized) page in ClearVisibilityMapFlags redo handler (#9287 ) ## Problem https://neondb.slack.com/archives/C04DGM6SMTM/p1727872045252899 See https://github.com/neondatabase/neon/issues/9240 ## Summary of changes Add `!page_is_new` check before assigning page lsn. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-11-01 20:31:29 +02:00
John Spray	3c16bd6e0b	storcon: skip non-active projects in chaos injection (#9606 ) ## Problem We may sometimes use scheduling modes like `Pause` to pin a tenant in its current location for operational reasons. It is undesirable for the chaos task to make any changes to such projects. ## Summary of changes - Add a check for scheduling mode - Add a log line when we do choose to do a chaos action for a tenant: this will help us understand which operations originate from the chaos task.	2024-11-01 16:47:20 +00:00
Erik Grinaker	123816e99a	safekeeper: log slow WalAcceptor sends (#9564 ) ## Problem We don't have any observability into full WalAcceptor queues per timeline. ## Summary of changes Logs a message when a WalAcceptor send has blocked for 5 seconds, and another message when the send completes. This implies that the log frequency is at most once every 5 seconds per timeline, so we don't need further throttling.	2024-11-01 13:47:03 +01:00
Peter Bendel	8b3bcf71ee	revert higher token expiration (#9605 ) ## Problem The IAM role associated with our github action runner supports a max token expiration which is lower than the value we tried. ## Summary of changes Since we believe to have understood the performance regression we (by ensuring availability zone affinity of compute and pageserver) the job should again run in lower than 5 hours and we revert this change instead of increasing the max session token expiration in the IAM role which would reduce our security.	2024-11-01 12:46:02 +01:00
Erik Grinaker	4c2c8d6708	test_runner: fix `tenant_get_shards` with one pageserver (#9603 ) ## Problem `tenant_get_shards()` does not work with a sharded tenant on 1 pageserver, as it assumes an unsharded tenant in this case. This special case appears to have been added to handle e.g. `test_emergency_mode`, where the storage controller is stopped. This breaks e.g. the sharded ingest benchmark in #9591 when run with a single shard. ## Summary of changes Correctly look up shards even with a single pageserver, but add a special case that assumes an unsharded tenant if the storage controller is stopped and the caller provides an explicit pageserver, in order to accomodate `test_emergency_mode`.	2024-11-01 11:25:04 +00:00
Conrad Ludgate	2d1366c8ee	fix pre-commit hook with python stubs (#9602 ) fix #9601	2024-11-01 11:22:38 +00:00

1 2 3 4 5 ...

6494 Commits