rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-06 13:40:37 +00:00

Author	SHA1	Message	Date
Konstantin Knizhnik	0624565617	Create the notion of unstable extensions As a DBaaS provider, Neon needs to provide a stable platform for customers to build applications upon. At the same time however, we also need to enable customers to use the latest and greatest technology, so they can prototype their work, and we can solicit feedback. If all extensions are treated the same in terms of stability, it is hard to meet that goal. There are now two new GUCs created by the Neon extension: neon.allow_unstable_extensions: This is a session GUC which allows a session to install and load unstable extensions. neon.unstable_extensions: This is a comma-separated list of extension names. We can check if a CREATE EXTENSION statement is attempting to install an unstable extension, and if so, deny the request if neon.allow_unstable_extensions is not set to true. Signed-off-by: Tristan Partin <tristan@neon.tech> Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2024-10-28 17:47:15 -05:00
George MacKerron	7d5f6b6a52	Build `pgrag` extensions x3 (#8486 ) Build the pgrag extensions (rag, rag_bge_small_en_v15, and rag_jina_reranker_v1_tiny_en) as part of the compute node Dockerfile. --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2024-10-28 20:06:36 +00:00
Alex Chi Z.	f7c61e856f	fix(pageserver): bump tokio-epoll-uring (#9546 ) Includes https://github.com/neondatabase/tokio-epoll-uring/pull/58 that fixes the clippy error. ## Summary of changes Update the version of tokio-epoll-uring Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-10-28 20:03:02 +00:00
Alex Chi Z.	57c21aff9f	refactor(pageserver): remove aux v1 configs (#9494 ) ## Problem Part of https://github.com/neondatabase/neon/issues/8623 ## Summary of changes Removed all aux-v1 config processing code. Note that we persisted it into the index part file, so we cannot really remove the field from index part. I also kept the config item within the tenant config, but we will not read it any more. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-10-28 19:51:14 +00:00
Erik Grinaker	248558dee8	safekeeper: refactor `WalAcceptor` to be event-driven (#9462 ) ## Problem The `WalAcceptor` main loop currently uses two nested loops to consume inbound messages. This makes it hard to slot in periodic events like metrics collection. It also duplicates the event processing code, and assumes all messages in steady state are AppendRequests (other messages types may be dropped if following an AppendRequest). ## Summary of changes Refactor the `WalAcceptor` loop to be event driven.	2024-10-28 17:18:37 +00:00
Sergey Melnikov	3bad52543f	We don't have legacy proxies anymore (#9544 ) We don't have legacy scram proxies anymore: cc: https://github.com/neondatabase/cloud/issues/9745	2024-10-28 16:42:35 +00:00
Tristan Partin	3d64a7ddcd	Add pg_mooncake to compute-node.Dockerfile Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-10-28 11:23:30 -05:00
Conrad Ludgate	25f1e5cfeb	[proxy] demote warnings and remove dead-argument (#9512 ) fixes https://github.com/neondatabase/cloud/issues/19000	2024-10-28 15:02:20 +00:00
Rahul Patil	8dd555d396	ci(proxy): Update GH action flag on proxy deployment (#9535 ) ## Problem Based on a recent proxy deployment issue, we deployed another proxy version (proxy-scram), which was not needed when deploying a specific proxy type. we have [PR](https://github.com/neondatabase/infra/pull/2142) to update on the infra branch and need to update CI in this repo which triggers proxy deployment. ## Summary of changes - Update proxy deployment flag ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist	2024-10-28 13:17:09 +01:00
Arthur Petukhovsky	01b6843e12	Route pgbouncer logs to virtio-serial (#9488 ) virtio-serial is much more performant than /dev/console emulation, therefore, is much more suitable for the verbose logs inside vm. This commit changes routing for pgbouncer logs, since we've recently noticed it can emit large volumes of logs. Manually tested on staging by pinning a compute image to my test project. Should help with https://github.com/neondatabase/cloud/issues/19072	2024-10-28 12:09:47 +00:00
John Spray	93987b5a4a	tests: add test_storage_controller_onboard_detached (#9431 ) ## Problem We haven't historically taken this API route where we would onboard a tenant to the controller in detached state. It worked, but we didn't have test coverage. ## Summary of changes - Add a test that onboards a tenant to the storage controller in Detached mode, and checks that deleting it without attaching it works as expected.	2024-10-28 11:11:12 +00:00
John Spray	33baca07b6	storcon: add an API to cancel ongoing reconciler (#9520 ) ## Problem If something goes wrong with a live migration, we currently only have awkward ways to interrupt that: - Restart the storage controller - Ask it to do some other modification/migration on the shard, which we don't really want. ## Summary of changes - Add a new `/cancel` control API, and storcon_cli wrapper for it, which fires the Reconciler's cancellation token. This is just for on-call use and we do not expect it to be used by any other services.	2024-10-28 09:26:01 +00:00
John Spray	923974d4da	safekeeper: don't un-evict timelines during snapshot API handler (#9428 ) ## Problem When we use pull_timeline API on an evicted timeline, it gets downloaded to serve the snapshot API request. That means that to evacuate all the timelines from a node, the node needs enough disk space to download partial segments from all timelines, which may not be physically the case. Closes: #8833 ## Summary of changes - Add a "try" variant of acquiring a residence guard, that returns None if the timeline is offloaded - During snapshot API handler, take a different code path if the timeline isn't resident, where we just read the checkpoint and don't try to read any segments.	2024-10-28 08:47:12 +00:00
Arpad Müller	e7277885b3	Don't consider archived timelines for synthetic size calculation (#9497 ) Archived timelines should not count towards synthetic size. Closes #9384. Part of #8088.	2024-10-26 13:27:57 +00:00
dependabot[bot]	80262e724f	build(deps): bump werkzeug from 3.0.3 to 3.0.6 (#9527 )	2024-10-26 08:24:15 +01:00
Yuchen Liang	85b954f449	pageserver: add tokio-epoll-uring slots waiters queue depth metrics (#9482 ) In complement to https://github.com/neondatabase/tokio-epoll-uring/pull/56. ## Problem We want to make tokio-epoll-uring slots waiters queue depth observable via Prometheus. ## Summary of changes - Add `pageserver_tokio_epoll_uring_slots_submission_queue_depth` metrics as a `Histogram`. - Each thread-local tokio-epoll-uring system is given a `LocalHistogram` to observe the metrics. - Keep a list of `Arc<ThreadLocalMetrics>` used on-demand to flush data to the shared histogram. - Extend `Collector::collect` to report `pageserver_tokio_epoll_uring_slots_submission_queue_depth`. Signed-off-by: Yuchen Liang <yuchen@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-10-25 21:30:57 +01:00
Arpad Müller	76328ada05	Fix unoffload_timeline races with creation (#9525 ) This PR does two things: 1. Obtain a `TimelineCreateGuard` object in `unoffload_timeline`. This prevents two unoffload tasks from racing with each other. While they already obtain locks for `timelines` and `offloaded_timelines`, they aren't sufficient, as we have already constructed an entire timeline at that point. We shouldn't ever have two `Timeline` objects in the same process at the same time. 2. don't allow timeline creations for timelines that have been offloaded. Obviously they already exist, so we should not allow creation. the previous logic only looked at the timelines list. Part of #8088	2024-10-25 20:06:27 +00:00
Erik Grinaker	b54b632c6a	safekeeper: don't pass conf into storage constructors (#9523 ) ## Problem The storage components take an entire `SafekeeperConf` during construction, but only actually use the `no_sync` field. This makes it hard to understand the storage inputs (which fields do they actually care about?), and is also inconvenient for tests and benchmarks that need to set up a lot of unnecessary boilerplate. ## Summary of changes * Don't take the entire config, but pass in the `no_sync` field explicitly. * Take the timeline dir instead of `ttid` as an input, since it's the only thing it cares about. * Fix a couple of tests to not leak tempdirs. * Various minor tweaks.	2024-10-25 18:19:52 +01:00
Erik Grinaker	9909551f47	safekeeper: fix version in `TimelinePersistentState::empty()` (#9521 ) ## Problem The Postgres version in `TimelinePersistentState::empty()` is incorrect: the major version should be multiplied by 10000. ## Summary of changes Multiply the version by 10000.	2024-10-25 16:22:35 +01:00
Arseny Sher	700b102b0f	safekeeper: retry eviction. (#9485 ) Without this manager may sleep forever after eviction failure without retries.	2024-10-25 17:48:29 +03:00
Conrad Ludgate	dbadb0f9bb	proxy: propagate session IDs (#9509 ) fixes #9367 by sending session IDs to local_proxy, and also returns session IDs to the client for easier debugging.	2024-10-25 14:34:19 +00:00
John Spray	8297f7a181	pageserver: fix N^2 I/O when processing relation drops in transaction abort (#9507 ) ## Problem We have some known N^2 behaviors when it comes to large relation counts, due to the monolithic encoding and full rewrites of of RelDirectory each time a relation is added. Ordinarily our backpressure mechanisms give "slow but steady" performance when creating/dropping/truncating relations. However, in the case of a transaction abort, it is possible for a single WAL record to drop an unbounded number of relations. The results in an unavailable compute, as when it sends one of these records, it can stall the pageserver's ingest for many minutes, even though the compute only sent a small amount of WAL. Closes https://github.com/neondatabase/neon/issues/9505 ## Summary of changes - Rewrite relation-dropping code to do one read/modify/write cycle of RelDirectory, instead of doing it separately for each relation in a loop. - Add a test for the bug scenario encountered: `test_tx_abort_with_many_relations` The test has ~40s runtime on my workstation. About 1 second of that is the part where we wait for ingest to catch up after a rollback, the rest is the slowness of creating and truncating a large number of relations. --------- Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2024-10-25 15:09:02 +01:00
Christian Schwarz	2090e928d1	refactor(timeline creation): idempotency checking (#9501 ) # Context In the PGDATA import code (https://github.com/neondatabase/neon/pull/9218) I add a third way to create timelines, namely, by importing from a copy of a vanilla PGDATA directory in object storage. For idempotency, I'm using the PGDATA object storage location specification, which is stored in the IndexPart for the entire lifespan of the timeline. When loading the timeline from remote storage, that value gets stored inside `struct Timeline` and timeline creation compares the creation argument with that value to determine idempotency of the request. # Changes This PR refactors the existing idempotency handling of Timeline bootstrap and branching such that we simply compare the `CreateTimelineIdempotency` struct, using the derive-generated `PartialEq` implementation. Also, by spelling idempotency out in the type names, I find it adds a lot of clarity. The pathway to idempotency via requester-provided idempotency key also becomes very straight-forward, if we ever want to do this in the future. # Refs * platform context: https://github.com/neondatabase/neon/pull/9218 * product context: https://github.com/neondatabase/cloud/issues/17507 * stacks on top of https://github.com/neondatabase/neon/pull/9366	2024-10-25 14:44:20 +01:00
Tristan Partin	05eff3a67e	Move logical replication slot monitor neon.c is getting crowded and the logical replication slot monitor is a good candidate for reorganization. It is very self-contained, and being in a separate file will make it that much easier to find. Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-10-25 08:41:44 -05:00
Arseny Sher	c6cf5e7c0f	Make test_pageserver_lsn_wait_error_safekeeper_stop less aggressive. (#9517 ) Previously it inserted ~150MiB of WAL while expecting page fetching to work in 1s (wait_lsn_timeout=1s). It failed in CI in debug builds. Instead, just directly wait for the wanted condition, i.e. needed safekeepers are reported in pageserver timed out waiting for WAL error message. Also set NEON_COMPUTE_TESTING_BASEBACKUP_RETRIES to 1 in this test and neighbour one, it reduces execution time from 2.5m to ~10s.	2024-10-25 14:13:46 +01:00
Christian Schwarz	e0c7f1ce15	remote_storage(local_fs): return correct file sizes (#9511 ) ## Problem `local_fs` doesn't return file sizes, which I need in PGDATA import (#9218) ## Solution Include file sizes in the result. I would have liked to add a unit test, and started doing that in * https://github.com/neondatabase/neon/pull/9510 by extending the common object storage tests (`libs/remote_storage/tests/common/tests.rs`) to check for sizes as well. But it turns out that localfs is not even covered by the common object storage tests and upon closer inspection, it seems that this area needs more attention. => punt the effort into https://github.com/neondatabase/neon/pull/9510	2024-10-25 12:20:53 +00:00
Christian Schwarz	6f5c262684	pageserver: add testing API to scan layers for disposable keys (#9393 ) This PR adds a pageserver mgmt API to scan a layer file for disposable keys. It hooks it up to the sharding compaction test, demonstrating that we're not filtering out all disposable keys. This is extracted from PGDATA import (https://github.com/neondatabase/neon/pull/9218) where I do the filtering of layer files based on `is_key_disposable`.	2024-10-25 14:16:45 +02:00
Jakub Kołodziejczak	9768f09f6b	proxy: don't follow redirects for user provided JWKS urls + set custom user agent (#9514 ) partially fixes https://github.com/neondatabase/cloud/issues/19249 ref https://docs.rs/reqwest/latest/reqwest/redirect/index.html > By default, a Client will automatically handle HTTP redirects, having a maximum redirect chain of 10 hops. To customize this behavior, a redirect::Policy can be used with a ClientBuilder.	2024-10-25 14:04:41 +02:00
Yuchen Liang	db900ae9d0	fix(test): remove too strict layers_removed==0 check in test_readonly_node_gc (#9506 ) Fixes #9098 ## Problem `test_readonly_node_gc` is flaky. As shown in [Allure Report](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9469/11444519440/index.html#suites/3ccffb1d100105b98aed3dc19b717917/2c02073738fa2b39), we would get a `AssertionError: No layers should be removed, old layers are guarded by leases.` after the test restarts pageservers or after reconfigure pageservers. During the investigation, we found that the layers has LSN (`0/1563088`) greater than the LSN (`0x1562000`) protected by the lease. For instance, Layers removed <pre> 000000067F00000005000034540100000000-000000067F00000005000040050100000000__000000000<b><i>1563088</i></b>-00000001 (shard 0002) 000000068000000000000017E20000000001-010000000100000001000000000000000001__000000000<b><i>1563088</i></b>-00000001 (shard 0002) </pre> Lsn Lease Granted <pre> handle_make_lsn_lease{lsn=<b><i>0/1562000</i></b> shard_id=0002 shard_id=0002}: lease created, valid until 2024-10-21 </pre> This means that these layers are not guarded by the leases: they are in "future", not visible to the static endpoint. ## Summary of changes - Remove the assertion layers_removed == 0 after trigger timeline GC while holding the lease. Instead rely on the successful execution of the`SELECT` query to test lease validity. - Improve test logging Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-10-25 12:50:47 +01:00
Arpad Müller	4d9036bf1f	Support offloaded timelines during shard split (#9489 ) Before, we didn't copy over the `index-part.json` of offloaded timelines to the new shard's location, resulting in the new shard not knowing the timeline even exists. In #9444, we copy over the manifest, but we also need to do this for `index-part.json`. As the operations to do are mostly the same between offloaded and non-offloaded timelines, we can iterate over all of them in the same loop, after the introduction of a `TimelineOrOffloadedArcRef` type to generalize over the two cases. This is analogous to the deletion code added in #8907. The added test also ensures that the sharded archival config endpoint works, something that has not yet been ensured by tests. Part of #8088	2024-10-25 12:32:46 +02:00
Vlad Lazar	b3bedda6fd	pageserver/walingest: log on gappy rel extend (#9502 ) ## Problem https://github.com/neondatabase/neon/pull/9492 added a metric to track the total count of block gaps filled on rel extend. More context is needed to understand when this happens. The current theory is that it may only happen on pg 14 and pg 15 since they do not WAL log relation extends. ## Summary of Changes A rate limited log is added.	2024-10-25 11:15:53 +01:00
Christian Schwarz	b782b11b33	refactor(timeline creation): represent bootstrap vs branch using enum (#9366 ) # Problem Timeline creation can either be bootstrap or branch. The distinction is made based on whether the `ancestor_` fields are present or not. In the PGDATA import code (https://github.com/neondatabase/neon/pull/9218), I add a third variant to timeline creation. # Solution The above pushed me to refactor the code in Pageserver to distinguish the different creation requests through enum variants. There is no externally observable effect from this change. On the implementation level, a notable change is that the acquisition of the `TimelineCreationGuard` happens later than before. This is necessary so that we have everything in place to construct the `CreateTimelineIdempotency`. Notably, this moves the acquisition of the creation guard _after_ the acquisition of the `gc_cs` lock in the case of branching. This might appear as if we're at risk of holding `gc_cs` longer than before this PR, but, even before this PR, we were holding `gc_cs` until after the `wait_completion()` that makes the timeline creation durable in S3 returns. I don't see any deadlock risk with reversing the lock acquisition order. As a drive-by change, I found that the `create_timeline()` function in `neon_local` is unused, so I removed it. # Refs platform context: https://github.com/neondatabase/neon/pull/9218 * product context: https://github.com/neondatabase/cloud/issues/17507 * next PR stacked atop this one: https://github.com/neondatabase/neon/pull/9501	2024-10-25 10:04:27 +00:00
Vlad Lazar	5069123b6d	pageserver: refactor ingest inplace to decouple decoding and handling (#9472 ) ## Problem WAL ingest couples decoding of special records with their handling (updates to the storage engine mostly). This is a roadblock for our plan to move WAL filtering (and implicitly decoding) to safekeepers since they cannot do writes to the storage engine. ## Summary of changes This PR decouples the decoding of the special WAL records from their application. The changes are done in place and I've done my best to refrain from refactorings and attempted to preserve the original code as much as possible. Related: https://github.com/neondatabase/neon/issues/9335 Epic: https://github.com/neondatabase/neon/issues/9329	2024-10-24 17:12:47 +01:00
Alex Chi Z.	fb0406e9d2	refactor(pageserver): refactor split writers using batch layer writer (#9493 ) part of https://github.com/neondatabase/neon/issues/9114, https://github.com/neondatabase/neon/issues/8836, https://github.com/neondatabase/neon/issues/8362 The split layer writer code can be used in a more general way: the caller puts unfinished writers into the batch layer writer and let batch layer writer to ensure the atomicity of the layer produces. ## Summary of changes * Add batch layer writer, which atomically finishes the layers. `BatchLayerWriter::finish` is simply a copy-paste from previous split layer writers. * Refactor split writers to use the batch layer writer. * The current split writer tests cover all code path of batch layer writer. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-10-24 10:49:54 -04:00
Alexander Bayandin	b8a311131e	CI: remove `git config --add safe.directory` hack (#9391 ) ## Problem We have `git config --global --add safe.directory ...` leftovers from the past, but `actions/checkout` does it by default (since v3.0.2, we use v4) ## Summary of changes - Remove `git config --global --add safe.directory ...` hack	2024-10-24 15:49:26 +01:00
John Spray	d589498c6f	storcon: respect Reconciler::cancel during await_lsn (#9486 ) ## Problem When a pageserver is misbehaving (e.g. we hit an ingest bug or something is pathologically slow), the storage controller could get stuck in the part of live migration that waits for LSNs to catch up. This is a problem, because it can prevent us migrating the troublesome tenant to another pageserver. Closes: https://github.com/neondatabase/cloud/issues/19169 ## Summary of changes - Respect Reconciler::cancel during await_lsn.	2024-10-24 15:23:09 +01:00
Christian Schwarz	6f34f97573	refactor(pageserver(load_remote_timeline)) remove dead code handling absence of IndexPart (#9408 ) The code is dead at runtime since we're nowadays always running with remote storage and treat it as the source of truth during attach. Clean it up as a preliminary to https://github.com/neondatabase/neon/pull/9218. Related: https://github.com/neondatabase/neon/pull/9366	2024-10-24 09:00:22 +01:00
Tristan Partin	b86432c29e	Fix buggy sizeof A sizeof on a pointer on a 64 bit machine is 8 bytes whereas Entry::old_name is a 64 byte array of characters. There was most likely no fallout since the string would start with NUL bytes, but best to fix nonetheless. Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-10-23 21:52:22 -06:00
Vlad Lazar	ac1205c14c	pageserver: add metric for number of zeroed pages on rel extend (#9492 ) ## Problem Filling the gap in with zeroes is annoying for sharded ingest. We are not sure it even happens in reality. ## Summary of Changes Add one global counter which tracks how many such gap blocks we filled on relation extends. We can add more metrics once we understand the scope.	2024-10-23 19:58:28 +01:00
John Spray	e3ff87ce3b	tests: avoid using background_process when invoking pg_ctl (#9469 ) ## Problem Occasionally, we get failures to start the storage controller's db with errors like: ``` aborting due to panic at /__w/neon/neon/control_plane/src/background_process.rs:349:67: claim pid file: lock file Caused by: file is already locked ``` e.g. https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9428/11380574562/index.html#/testresult/1c68d413ea9ecd4a This is happening in a stop,start cycle during a test. Presumably the pidfile from the startup background process is still held at the point we stop, because we let pg_ctl keep running in the background. ## Summary of changes - Refactor pg_ctl invocations into a helper - In the controller's `start` function, use pg_ctl & a wait loop for pg_isready, instead of using background_process --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>	2024-10-23 16:29:55 +00:00
Tristan Partin	0595320c87	Protect call to pg_current_wal_lsn() in retained_wal query We can't call pg_current_wal_lsn() if we are a standby instance (read replica). Any attempt to call this function while in recovery results in: ERROR: recovery is in progress Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-10-23 09:55:00 -06:00
Folke Behrens	92d5e0e87a	proxy: clear lib.rs of code items (#9479 ) We keep lib.rs for crate configs, lint configs and re-exports for the binaries.	2024-10-23 08:21:28 +02:00
Arpad Müller	3a3bd34a28	Rename IndexPart::{from_s3_bytes,to_s3_bytes} (#9481 ) We support multiple storage backends now, so remove the `_s3_` from the name. Analogous to the names adopted for tenant manifests added in #9444.	2024-10-23 00:34:24 +02:00
Alex Chi Z.	64949a37a9	fix(pageserver): make delta split layer writer finish atomic (#9048 ) similar to https://github.com/neondatabase/neon/pull/8841, we make the delta layer writer atomic when finishing the layers. ## Summary of changes * `put_value` not taking discard fn anymore * `finish` decides what layers to keep --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-10-22 22:06:21 +00:00
Arpad Müller	6f8fcdf9ea	Timeline offloading persistence (#9444 ) Persist timeline offloaded state to S3. Right now, as of #8907, at each restart of the pageserver, all offloaded state is lost, so we load the full timeline again. As it starts with an empty local directory, we might potentially download some files again, leading to downloads that are ultimately wasteful. This patch adds support for persisting the offloaded state, allowing us to never load offloaded timelines in the first place. The persistence feature is facilitated via a new file in S3 that is tenant-global, which contains a list of all offloaded timelines. It is updated each time we offload or unoffload a timeline, and otherwise never touched. This choice means that tenants where no offloading is happening will not immediately get a manifest, keeping the change very minimal at the start. We leave generation support for future work. It is important to support generations, as in the worst case, the manifest might be overwritten by an older generation after a timeline has been unoffloaded (and unarchived), so the next pageserver process instantiation might wrongly believe that some timeline is still offloaded even though it should be active. Part of #9386, #8088	2024-10-22 20:52:30 +00:00
Tristan Partin	fcb55a2aa2	Fix copy-paste error in checkpoints_timed metric Importing the wrong metric. Sigh... Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-10-22 14:34:26 -06:00
a-masterov	f36cf3f885	Fix local errors for the tests with the versions mix (#9477 ) ## Problem If the environment variables `COMPATIBILITY_NEON_BIN` or `COMPATIBILITY_POSTGRES_DISTRIB_DIR` are not set (this is usual during a local run), the tests with the versions mix cannot run. ## Summary of changes If these variables are not set turn off the version mix. --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2024-10-22 21:58:55 +02:00
John Spray	8dca188974	storage controller: add metrics for tenant shard, node count (#9475 ) ## Problem Previously, figuring out how many tenant shards were managed by a storage controller was typically done by peeking at the database or calling into the API. A metric makes it easier to monitor, as unexpectedly increasing shard counts can be indicative of problems elsewhere in the system. ## Summary of changes - Add metrics `storage_controller_pageserver_nodes` (updated on node CRUD operations from Service) and `storage_controller_tenant_shards` (updated RAII-style from TenantShard)	2024-10-22 19:43:02 +01:00
Tristan Partin	b7fa93f6b7	Use make's builtin RM variable At least as far as removing individual files goes, this is the best pattern for removing. I can't say the same for removing directories, but I went ahead and changed those to `$(RM) -r` anyway. Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-10-22 09:14:29 -06:00
Arseny Sher	1e8e04bb2c	safekeeper: refactor timeline initialization (#9362 ) Always do timeline init through atomic rename of temp directory. Add GlobalTimelines::load_temp_timeline which does this, and use it from both pull_timeline and basic timeline creation. Fixes a collection of issues: - previously timeline creation didn't really flushed cfile to disk due to 'nothing to do if state didn't change' check; - even if it did, without tmp dir it is possible to lose the cfile but leave timeline dir in place, making it look corrupted; - tenant directory creation fsync was missing in timeline creation; - pull_timeline is now protected from concurrent both itself and timeline creation; - now global timelines map entry got special CreationInProgress entry type which prevents from anyone getting access to timeline while it is being created (previously one could get access to it, but it was locked during creation, which is valid but confusing if creation failed). fixes #8927	2024-10-22 07:11:36 +01:00

1 2 3 4 5 ...

6411 Commits