rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-31 20:10:38 +00:00

Author	SHA1	Message	Date
John Spray	9ded2556df	tests: increase test_pg_regress and test_isolation timeouts (#8418 ) ## Problem These tests time out ~1 in 50 runs when in debug mode. There is no indication of a real issue: they're just wrappers that have large numbers of individual tests contained within on pytest case. ## Summary of changes - Bump pg_regress timeout from 600 to 900s - Bump test_isolation timeout from 300s (default) to 600s In future it would be nice to break out these tests to run individual cases (or batches thereof) as separate tests, rather than this monolith.	2024-07-18 10:23:17 +01:00
John Spray	7672e49ab5	tests: fix metrics check in test_s3_eviction (#8419 ) ## Problem This test would occasionally fail its metric check. This could happen in the rare case that the nodes had all been restarted before their most recent eviction. The metric check was added in https://github.com/neondatabase/neon/pull/8348 ## Summary of changes - Check metrics before each restart, accumulate into a bool that we assert on at the end of the test	2024-07-18 10:14:56 +01:00
Christian Schwarz	a2d170b6d0	NeonEnv.from_repo_dir: use storage_controller_db instead of `attachments.json` (#8382 ) When `NeonEnv.from_repo_dir` was introduced, storage controller stored its state exclusively `attachments.json`. Since then, it has moved to using Postgres, which stores its state in `storage_controller_db`. But `NeonEnv.from_repo_dir` wasn't adjusted to do this. This PR rectifies the situation. Context for this is failures in `test_pageserver_characterize_throughput_with_n_tenants` CF: https://neondb.slack.com/archives/C033RQ5SPDH/p1721035799502239?thread_ts=1720901332.293769&cid=C033RQ5SPDH Notably, `from_repo_dir` is also used by the backwards- and forwards-compatibility. Thus, the changes in this PR affect those tests as well. However, it turns out that the compatibility snapshot already contains the `storage_controller_db`. Thus, it should just work and in fact we can remove hacks like `fixup_storage_controller`. Follow-ups created as part of this work: * https://github.com/neondatabase/neon/issues/8399 * https://github.com/neondatabase/neon/issues/8400	2024-07-18 10:56:07 +02:00
dotdister	1303d47778	Fix comment in Control Plane (#8406 ) ## Problem There are something wrong in the comment of `control_plane/src/broker.rs` and `control_plane/src/pageserver.rs` ## Summary of changes Fixed the comment about component name and their data path in `control_plane/src/broker.rs` and `control_plane/src/pageserver.rs`.	2024-07-18 09:33:46 +01:00
Joonas Koivunen	e250b9e063	test: allow requests to any pageserver get cancelled (#8413 ) Fix flakyness on `test_sharded_timeline_detach_ancestor` which does not reproduce on a fast enough runner by allowing cancelled request before completing on all pageservers. It was only allowed on half of the pageservers. Failure evidence: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8352/9972357040/index.html#suites/a1c2be32556270764423c495fad75d47/7cca3e3d94fe12f2	2024-07-17 22:03:02 +01:00
John Spray	0c236fa465	pageserver: layer count & size metrics (#8410 ) ## Problem We lack insight into: - How much of a tenant's physical size is image vs. delta layers - Average sizes of image vs. delta layers - Total layer counts per timeline, indicating size of index_part object As well as general observability love, this is motivated by https://github.com/neondatabase/neon/issues/6738, where we need to define some sensible thresholds for storage amplification, and using total physical size may not work well (if someone does a lot of DROPs then it's legitimate for the physical-synthetic ratio to be huge), but the ratio between image layer size and delta layer size may be a better indicator of whether we're generating unreasonable quantities of image layers. ## Summary of changes - Add pageserver_layer_bytes and pageserver_layer_count metrics, labelled by timeline and `kind` (delta or image) - Add & subtract these with LayerInner's lifetime. I'm intentionally avoiding using a generic metric RAII guard object, to avoid bloating LayerInner: it already has all the information it needs to update metric on new+drop.	2024-07-17 21:55:20 +01:00
Yuchen Liang	da84a250c6	docs: update storage controller db name in doc (#8411 ) The db name was renamed to storage_controller from attachment_service. Doc was stale.	2024-07-17 15:19:40 -04:00
John Spray	975f8ac658	tests: add test_compaction_l0_memory (#8403 ) This test reproduces the case of a writer creating a deep stack of L0 layers. It uses realistic layer sizes and writes several gigabytes of data, therefore runs as a performance test although it is validating memory footprint rather than performance per se. It acts a regression test for two recent fixes: - https://github.com/neondatabase/neon/pull/8401 - https://github.com/neondatabase/neon/pull/8391 In future it will demonstrate the larger improvement of using a k-merge iterator for L0 compaction (#8184) This test can be extended to enforce limits on the memory consumption of other housekeeping steps, by restarting the pageserver and then running other things to do the same "how much did RSS increase" measurement.	2024-07-17 17:35:27 +00:00
Alex Chi Z.	839a5724a4	test(pageserver): more k-merge tests on duplicated keys (#8404 ) Existing tenants and some selection of layers might produce duplicated keys. Add tests to ensure the k-merge iterator handles it correctly. We also enforced ordering of the k-merge iterator to put images before deltas. part of https://github.com/neondatabase/neon/issues/8002 --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2024-07-17 16:22:38 +01:00
Peter Bendel	f2b8e390e7	Bodobolero/pgbench compare azure (#8409 ) ## Problem We want to run performance tests on all supported cloud providers. We want to run most tests on the postgres version which is default for new projects in production, currently (July 24) this is postgres version 16 ## Summary of changes - change default postgres version for some (performance) tests to 16 (which is our default for new projects in prod anyhow) - add azure region to pgbench_compare jobs - add azure region to pgvector benchmarking jobs - re-used project `weathered-snowflake-88107345` was prepared with 1 million embeddings running on 7 minCU 7 maxCU in azure region to compare with AWS region (pgvector indexing and hnsw queries) - see job pgbench-pgvector - Note we now have a 11 environments combinations where we run pgbench-compare and 5 are for k8s-pod (deprecated) which we can remove in the future once auto-scaling team approves. ## Logs A current run with the changes from this pull request is running here https://github.com/neondatabase/neon/actions/runs/9972096222 Note that we currently expect some failures due to - https://github.com/neondatabase/neon/issues/8275 - instability of projects on azure region	2024-07-17 16:56:32 +02:00
John Spray	f7131834eb	docs/rfcs: timeline ancestor detach API (#6888 ) ## Problem When a tenant creates a new timeline that they will treat as their 'main' history, it is awkward to permanently retain an 'old main' timeline as its ancestor. Currently this is necessary because it is forbidden to delete a timeline which has descendents. ## Summary of changes A new pageserver API is proposed to 'adopt' data from a parent timeline into one of its children, such that the link between ancestor and child can be severed, leaving the parent in a state where it may then be deleted. --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2024-07-17 14:25:35 +00:00
John Spray	4a90423292	pageserver: reduce size of delta layer ValueRef (#8401 ) ## Problem ValueRef is an unnecessarily large structure, because it carries a cursor. L0 compaction currently instantiates gigabytes of these under some circumstances. ## Summary of changes - Carry a ref to the parent layer instead of a cursor, and construct a cursor on demand. This reduces RSS high watermark during L0 compaction by about 20%.	2024-07-16 21:36:17 +01:00
John Spray	f4f0869dc8	pageserver: exclude un-read layers from short residence statistic (#8396 ) ## Problem The `evictions_with_low_residence_duration` is used as an indicator of cache thrashing. However, there are situations where it is quite legitimate to only have a short residence during compaction, where a delta is downloaded, used to generate an image layer, and then discarded. This can lead to false positive alerts. ## Summary of changes - Only track low residence duration for layers that have been accessed at least once (compaction doesn't count as an access). This will give us a metric that indicates thrashing on layers that the _user_ is using, rather than those we're downloading for housekeeping purposes. Once we add "layer visibility" as an explicit property of layers, this can also be used as a cleaner condition (residence of non-visible layers should never be alertable)	2024-07-16 20:55:29 +01:00
Alex Chi Z.	0950866fa8	fix(pageserver): limit num of delta layers for l0 compaction (#8391 ) ## Problem close https://github.com/neondatabase/neon/issues/8389 ## Summary of changes A quick mitigation for tenants with fast writes. We compact at most 60 delta layers at a time, expecting a memory footprint of 15GB. We will pick the oldest 60 L0 layers. This should be a relatively safe change so no test is added. Question is whether to make this parameter configurable via tenant config. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: John Spray <john@neon.tech>	2024-07-16 20:43:24 +01:00
Tristan Partin	7cf59ae5b4	Add some typing to Endpoint.respec()	2024-07-16 12:12:29 -05:00
Tristan Partin	b197cc20fc	Hide import behind TYPE_CHECKING	2024-07-16 12:12:29 -05:00
Tristan Partin	ba17025a57	Run each migration in its own transaction Previously, every migration was run in the same transaction. This is preparatory work for fixing CVE-2024-4317.	2024-07-16 12:12:29 -05:00
Tristan Partin	b5ab055526	Rename compute migrations to start at 1 This matches what we put into the neon_migration.migration_id table.	2024-07-16 12:12:29 -05:00
John Spray	a40b402957	pageserver: clean up GcCutoffs names (#8379 ) - `horizon` is a confusing term, it's not at all obvious that this means space-based retention limit, rather than the total GC history limit. Rename to `GcCutoffs::space`. - `pitr` is less confusing, but still an unecessary level of indirection from what we really mean: a time-based condition. The fact that we use that that time-history for Point In Time Recovery doesn't mean we have to refer to time as "pitr" everywhere. Rename to `GcCutoffs::time`.	2024-07-16 13:54:54 +00:00
dependabot[bot]	d2ee760eb2	build(deps): bump setuptools from 65.5.1 to 70.0.0 (#8387 ) Bumps [setuptools](https://github.com/pypa/setuptools) from 65.5.1 to 70.0.0. Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: a-masterov <72613290+a-masterov@users.noreply.github.com>	2024-07-16 12:20:23 +02:00
Arpad Müller	66337097de	Avoid the storage controller in test_tenant_creation_fails (#8392 ) As described in #8385, the likely source for flakiness in test_tenant_creation_fails is the following sequence of events: 1. test instructs the storage controller to create the tenant 2. storage controller adds the tenant and persists it to the database. issues a creation request 3. the pageserver restarts with the failpoint disabled 4. storage controller's background reconciliation still wants to create the tenant 5. pageserver gets new request to create the tenant from background reconciliation This commit just avoids the storage controller entirely. It has its own set of issues, as the re-attach request will obviously not include the tenant, but it's still useful to test for non-existence of the tenant. The generation is also not optional any more during tenant attachment. If you omit it, the pageserver yields an error. We change the signature of `tenant_attach` to reflect that. Alternative to #8385 Fixes #8266	2024-07-16 12:19:28 +02:00
Anastasia Lubennikova	e6dadcd2f3	Compute: add compatibility patch for rum Fixes #8251	2024-07-16 13:10:34 +03:00
John Spray	83e07c1a5b	pageserver: un-Arc Timeline::layers (#8386 ) ## Problem This structure was in an Arc<> unnecessarily, making it harder to reason about its lifetime (i.e. it was superficially possible for LayerManager to outlive timeline, even though no code used it that way) ## Summary of changes - Remove the Arc<>	2024-07-16 08:52:49 +01:00
Arpad Müller	ee263e6a62	Allow the new clippy::doc_lazy_continuation lint (#8388 ) The `doc_lazy_continuation` lint of clippy is still unknown on latest rust stable. Fixes fall-out from #8151.	2024-07-16 00:16:18 +00:00
Sasha Krassovsky	7eb37fea26	Allow reusing projects between runs of logical replication benchmarks (#8393 )	2024-07-15 14:55:57 -07:00
Joonas Koivunen	730db859c7	feat(timeline_detach_ancestor): success idempotency (#8354 ) Right now timeline detach ancestor reports an error (409, "no ancestor") on a new attempt after successful completion. This makes it troublesome for storage controller retries. Fix it to respond with `200 OK` as if the operation had just completed quickly. Additionally, the returned timeline identifiers in the 200 OK response are now ordered so that responses between different nodes for error comparison are done by the storage controller added in #8353. Design-wise, this PR introduces a new strategy for accessing the latest uploaded IndexPart: `RemoteTimelineClient::initialized_upload_queue(&self) -> Result<UploadQueueAccessor<'_>, NotInitialized>`. It should be a more scalable way to query the latest uploaded `IndexPart` than to add a query method for each question directly on `RemoteTimelineClient`. GC blocking will need to be introduced to make the operation fully idempotent. However, it is idempotent for the cases demonstrated by tests. Cc: #6994	2024-07-15 17:47:53 +00:00
John Spray	04448ac323	pageserver: use PITR GC cutoffs as authoritative (#8365 ) ## Problem Pageserver GC uses a size-based condition (GC "horizon" in addition to time-based "PITR"). Eventually we plan to retire the size-based condition: https://github.com/neondatabase/neon/issues/6374 Currently, we always apply the more conservative of the two, meaning that tenants always retain at least 64MB of history (default horizon), even after a very long time has passed. This is particularly acute in cases where someone has dropped tables/databases, and then leaves a database idle: the horizon can prevent GCing very large quantities of historical data (we already account for this in synthetic size by ignoring gc horizon). We're not entirely removing GC horizon right now because we don't want to 100% rely on standby_horizon for robustness of physical replication, but we can tweak our logic to avoid retaining that 64MB LSN length indefinitely. ## Summary of changes - Rework `Timeline::find_gc_cutoffs`, with new logic: - If there is no PITR set, then use `DEFAULT_PITR_INTERVAL` (1 week) to calculate a time threshold. Retain either the horizon or up to that thresholds, whichever requires less data. - When there is a PITR set, and we have unambiguously resolved the timestamp to an LSN, then ignore the GC horizon entirely. For typical PITRs (1 day, 1 week), this will still easily retain enough data to avoid stressing read only replicas. The key property we end up with, whether a PITR is set or not, is that after enough time has passed, our GC cutoff on an idle timeline will catch up with the last_record_lsn. Using `DEFAULT_PITR_INTERVAL` is a bit of an arbitrary hack, but this feels like it isn't really worth the noise of exposing in TenantConfig. We could just make it a different named constant though. The end-end state will be that there is no gc_horizon at all, and that tenants with pitr_interval=0 would truly retain no history, so this constant would go away.	2024-07-15 17:43:05 +01:00
Joonas Koivunen	324e4e008f	feat(storcon): timeline detach ancestor passthrough (#8353 ) Currently storage controller does not support forwarding timeline detach ancestor requests to pageservers. Add support for forwarding `PUT .../:tenant_id/timelines/:timeline_id/detach_ancestor`. Implement the support mostly as is, because the timeline detach ancestor will be made (mostly) idempotent in future PR. Cc: #6994	2024-07-15 18:08:24 +03:00
Christian Schwarz	b49b450dc4	remove page_service `show <tenant_id>` (#8372 ) This operation isn't used in practice, so let's remove it. Context: in https://github.com/neondatabase/neon/pull/8339	2024-07-15 15:33:56 +01:00
Konstantin Knizhnik	8a8b83df27	Add neon.running_xacts_overflow_policy to make it possible for RO replica to startup without primary even in case running xacts overflow (#8323 ) ## Problem Right now if there are too many running xacts to be restored from CLOG at replica startup, then replica is not trying to restore them and wait for non-overflown running-xacs WAL record from primary. But if primary is not active, then replica will not start at all. Too many running xacts can be caused by transactions with large number of subtractions. But right now it can be also cause by two reasons: - Lack of shutdown checkpoint which updates `oldestRunningXid` (because of immediate shutdown) - nextXid alignment on 1024 boundary (which cause loosing ~1k XIDs on each restart) Both problems are somehow addressed now. But we have existed customers with "sparse" CLOG and lack of checkpoints. To be able to start RO replicas for such customers I suggest to add GUC which allows replica to start even in case of subxacts overflow. ## Summary of changes Add `neon.running_xacts_overflow_policy` with the following values: - ignore: restore from CLOG last N XIDs and accept connections - skip: do not restore any XIDs from CXLOGbut still accept connections - wait: wait non-overflown running xacts record from primary node ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-07-15 15:52:00 +03:00
Arseny Sher	4bdfb96078	Fix test_timeline_copy flakiness. fixes https://github.com/neondatabase/neon/issues/8355	2024-07-15 15:21:16 +03:00
Luca Bruno	8da3b547f8	proxy/http: switch to typed_json (#8377 ) ## Summary of changes This switches JSON rendering logic to `typed_json` in order to reduce the number of allocations in the HTTP responder path. Followup from https://github.com/neondatabase/neon/pull/8319#issuecomment-2216991760. --------- Co-authored-by: Conrad Ludgate <conradludgate@gmail.com>	2024-07-15 12:38:52 +01:00
Vlad Lazar	b329b1c610	tests: allow list breaching min resident size in statvfs test (#8358 ) ## Problem This test would sometimes violate the min resident size during disk eviction and fail due to the generate warning log. Disk usage candidate collection only takes into account active tenants. However, the statvfs call takes into account the entire tenants directory, which includes tenants which haven't become active yet. After re-starting the pageserver, disk usage eviction may kick in before both tenants have become active. Hence, the logic will try to satisfy thedisk usage requirements by evicting everything belonging to the active tenant, and hence violating the tenant minimum resident size. ## Summary of changes Allow the warning	2024-07-12 17:31:17 +01:00
Alex Chi Z	4184685721	fix(pageserver): unique test harness name for merge_in_between (#8366 ) As title, there should be a way to detect duplicated harness names in the future :( Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-07-12 14:28:13 +01:00
Conrad Ludgate	411a130675	Fix nightly warnings 2024 june (#8151 ) ## Problem new clippy warnings on nightly. ## Summary of changes broken up each commit by warning type. 1. Remove some unnecessary refs. 2. In edition 2024, inference will default to `!` and not `()`. 3. Clippy complains about doc comment indentation 4. Fix `Trait + ?Sized` where `Trait: Sized`. 5. diesel_derives triggering `non_local_defintions`	2024-07-12 13:58:04 +01:00
John Spray	0645ae318e	pageserver: circuit breaker on compaction (#8359 ) ## Problem We already back off on compaction retries, but the impact of a failing compaction can be so great that backing off up to 300s isn't enough. The impact is consuming a lot of I/O+CPU in the case of image layer generation for large tenants, and potentially also leaking disk space. Compaction failures are extremely rare and almost always indicate a bug, frequently a bug that will not let compaction to proceed until it is fixed. Related: https://github.com/neondatabase/neon/issues/6738 ## Summary of changes - Introduce a CircuitBreaker type - Add a circuit breaker for compaction, with a policy that after 5 failures, compaction will not be attempted again for 24 hours. - Add metrics that we can alert on: any >0 value for `pageserver_circuit_breaker_broken_total` should generate an alert. - Add a test that checks this works as intended. Couple notes to reviewers: - Circuit breakers are intrinsically a defense-in-depth measure: this is not the solution to any underlying issues, it is just a general mitigation for "unknown unknowns" that might be encountered in future. - This PR isn't primarily about writing a perfect CircuitBreaker type: the one in this PR is meant to be just enough to mitigate issues in compaction, and make it easy to monitor/alert on these failures. We can refine this type in future as/when we want to use it elsewhere.	2024-07-12 12:04:02 +01:00
Japin Li	86d6ef305a	Remove fs2 dependency (#8350 ) The fs2 dependency is not needed anymore after commit `d42700280`.	2024-07-12 12:56:06 +03:00
Arpad Müller	2e37aa3fe8	Implement decompression for vectored reads (#8302 ) Implement decompression of images for vectored reads. This doesn't implement support for still treating blobs as uncompressed with the bits we reserved for compression, as we have removed that functionality in #8300 anyways. Part of #5431	2024-07-12 04:32:34 +02:00
Arpad Müller	30bbfde50d	Pass configured compression param to image generation (#8363 ) We need to pass on the configured compression param during image layer generation. This was an oversight of #8106, and the likely cause why #8288 didn't bring any interesting regressions. Part of https://github.com/neondatabase/neon/issues/5431	2024-07-12 01:43:44 +02:00
Sasha Krassovsky	82b9a44ab4	Grant execute on snapshot functions to neon_superuser (#8346 ) ## Problem I need `neon_superuser` to be allowed to create snapshots for replication tests ## Summary of changes Adds a migration that grants these functions to neon_superuser	2024-07-11 20:29:35 +00:00
Joonas Koivunen	4a87bac036	test: limit `test_layer_download_timeouted` to MOCK_S3 (#8331 ) Requests against REAL_S3 on CI can consistently take longer than 1s; testing the short timeouts against it made no sense in hindsight, as MOCK_S3 works just as well. evidence: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8229/9857994025/index.html#suites/b97efae3a617afb71cb8142f5afa5224/6828a50921660a32	2024-07-11 15:03:35 -04:00
Alex Chi Z	38b4ed297e	feat(pageserver): rewrite streaming vectored read planner (#8242 ) Rewrite streaming vectored read planner to be a separate struct. The API is designed to produce batches around `max_read_size` instead of exactly less than that so that `handle_XX` returns one batch a time. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-07-11 18:28:16 +00:00
Arseny Sher	cd29156927	Fix memory context of NeonWALReader allocation. Allocating it in short living context is wrong because it is reused during backend lifetime.	2024-07-11 20:31:15 +03:00
John Spray	814c8e8f68	storage controller: add node deletion API (#8226 ) ## Problem In anticipation of later adding a really nice drain+delete API, I initially only added an intentionally basic `/drop` API that is just about usable for deleting nodes in a pinch, but requires some ugly storage controller restarts to persuade it to restart secondaries. ## Summary of changes I started making a few tiny fixes, and ended up writing the delete API... - Quality of life nit: ordering of node + tenant listings in storcon_cli - Papercut: Fix the attach_hook using the wrong operation type for reporting slow locks - Make Service::spawn tolerate `generation_pageserver` columns that point to nonexistent node IDs. I started out thinking of this as a general resilience thing, but when implementing the delete API I realized it was actually a legitimate end state after the delete API is called (as that API doesn't wait for all reconciles to succeed). - Add a `DELETE` API for nodes, which does not gracefully drain, but does reschedule everything. This becomes safe to use when the system is in any state, but will incur availability gaps for any tenants that weren't already live-migrated away. If tenants have already been drained, this becomes a totally clean + safe way to decom a node. - Add a test and a storcon_cli wrapper for it This is meant to be a robust initial API that lets us remove nodes without doing ugly things like restarting the storage controller -- it's not quite a totally graceful node-draining routine yet. There's more work in https://github.com/neondatabase/neon/issues/8333 to get to our end-end state.	2024-07-11 17:05:47 +01:00
John Spray	0159ae9536	safekeeper: eviction metrics (#8348 ) ## Problem Follow up to https://github.com/neondatabase/neon/pull/8335, to improve observability of how many evict/restores we are doing. ## Summary of changes - Add `safekeeper_eviction_events_started_total` and `safekeeper_eviction_events_completed_total`, with a "kind" label of evict or restore. This gives us rates, and also ability to calculate how many are in progress. - Generalize SafekeeperMetrics test type to use the same helpers as pageserver, and enable querying any metric. - Read the new metrics at the end of the eviction test.	2024-07-11 17:05:35 +01:00
Vlad Lazar	d9a82468e2	storage_controller: fix ReconcilerWaiter::get_status (#8341 ) ## Problem SeqWait::would_wait_for returns Ok in the case when we would not wait for the sequence number and Err otherwise. ReconcilerWaiter::get_status uses it the wrong way around. This can cause the storage controller to go into a busy loop and make it look unavailable to the k8s controller. ## Summary of changes Use `SeqWait::would_wait_for` correctly.	2024-07-11 15:43:28 +01:00
Christian Schwarz	e26ef640c1	pageserver: remove `trace_read_requests` (#8338 ) `trace_read_requests` is a per `Tenant`-object option. But the `handle_pagerequests` loop doesn't know which `Tenant` object (i.e., which shard) the request is for. The remaining use of the `Tenant` object is to check `tenant.cancel`. That check is incorrect [if the pageserver hosts multiple shards](https://github.com/neondatabase/neon/issues/7427#issuecomment-2220577518). I'll fix that in a future PR where I completely eliminate the holding of `Tenant/Timeline` objects across requests. See [my code RFC](https://github.com/neondatabase/neon/pull/8286) for the high level idea. Note that we can always bring the tracing functionality if we need it. But since it's actually about logging the `page_service` wire bytes, it should be a `page_service`-level config option, not per-Tenant. And for enabling tracing on a single connection, we can implement a `set pageserver_trace_connection;` option.	2024-07-11 15:17:07 +02:00
Peter Bendel	c11b9cb43d	Run Performance bench on more platforms (#8312 ) ## Problem https://github.com/neondatabase/cloud/issues/14721 ## Summary of changes add one more platform to benchmarking job `57535c039c/.github/workflows/benchmarking.yml (L57C3-L126)` Run with pg 16, provisioner k8-neonvm by default on the new platform. Adjust some test cases to - not depend on database client <-> database server latency by pushing loops into server side pl/pgSQL functions - increase statement and test timeouts First successful run of these job steps https://github.com/neondatabase/neon/actions/runs/9869817756/job/27254280428	2024-07-11 10:07:12 +01:00
John Spray	69b6675da0	rfcs: add RFC for timeline archival (#8221 ) A design for a cheap low-resource state for idle timelines: - #8088	2024-07-11 08:23:51 +01:00
Stas Kelvich	6bbd34a216	Enable core dumps for postgres (#8272 ) Set core rmilit to ulimited in compute_ctl, so that all child processes inherit it. We could also set rlimit in relevant startup script, but that way we would depend on external setup and might inadvertently disable it again (core dumping worked in pods, but not in VMs with inittab-based startup).	2024-07-11 10:20:14 +03:00

1 2 3 4 5 ...

5660 Commits