rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-20 14:40:37 +00:00

Author	SHA1	Message	Date
Nikita Kalyanov	f446e08fb8	change HTTP method to comply with spec (#9100 ) There is discrepancy with the spec, it has PUT	2024-09-23 15:53:06 +02:00
Heikki Linnakangas	9a32aa828d	Fix init of WAL page header at startup (#8914 ) If the primary is started at an LSN within the first of a 16 MB WAL segment, the "long XLOG page header" at the beginning of the segment was not initialized correctly. That has gone unnnoticed, because under normal circumstances, nothing looks at the page header. The WAL that is streamed to the safekeepers starts at the new record's LSN, not at the beginning of the page, so that bogus page header didn't propagate elsewhere, and a primary server doesn't normally read the WAL its written. Which is good because the contents of the page would be bogus anyway, as it wouldn't contain any of the records before the LSN where the new record is written. Except that in the following cases a primary does read its own WAL: 1. When there are two-phase transactions in prepared state at checkpoint. The checkpointer reads the two-phase state from the XLOG_XACT_PREPARE record, and writes it to a file in pg_twophase/. 2. Logical decoding reads the WAL starting from the replication slot's restart LSN. This PR fixes the problem with two-phase transactions. For that, it's sufficient to initialize the page header correctly. The checkpointer only needs to read XLOG_XACT_PREPARE records that were generated after the server startup, so it's still OK that older WAL is missing / bogus. I have not investigated if we have a problem with logical decoding, however. Let's deal with that separately. Special thanks to @Lzjing-1997, who independently found the same bug and opened a PR to fix it, although I did not use that PR.	2024-09-21 04:00:38 +03:00
Christian Schwarz	ec5dce04eb	pageserver: throttling: per-tenant metrics + more metrics to help understand throttle queue depth (#9077 )	2024-09-20 16:48:26 +00:00
Christian Schwarz	c45b56e0bb	pageserver: add counters for started smgr/getpage requests (#9069 ) After this PR ``` curl localhost:9898/metrics \| grep smgr_ \| grep start ``` ``` pageserver_smgr_query_started_count{shard_id="0000",smgr_query_type="get_page_at_lsn",tenant_id="...",timeline_id="..."} 0 pageserver_smgr_query_started_global_count{smgr_query_type="get_db_size"} 0 pageserver_smgr_query_started_global_count{smgr_query_type="get_page_at_lsn"} 0 pageserver_smgr_query_started_global_count{smgr_query_type="get_rel_exists"} 0 pageserver_smgr_query_started_global_count{smgr_query_type="get_rel_size"} 0 pageserver_smgr_query_started_global_count{smgr_query_type="get_slru_segment"} 0 ``` We instantiate the per-tenant counter only for `get_page_at_lsn`.	2024-09-20 14:55:50 +01:00
Yuchen Liang	1708743e78	pageserver: wait for lsn lease duration after transition into AttachedSingle (#9024 ) Part of #7497, closes https://github.com/neondatabase/neon/issues/8890. ## Problem Since leases are in-memory objects, we need to take special care of them after pageserver restarts and while doing a live migration. The approach we took for pageserver restart is to wait for at least lease duration before doing first GC. We want to do the same for live migration. Since we do not do any GC when a tenant is in `AttachedStale` or `AttachedMulti` mode, only the transition from `AttachedMulti` to `AttachedSingle` requires this treatment. ## Summary of changes - Added `lsn_lease_deadline` field in `GcBlock::reasons`: the tenant is temporarily blocked from GC until we reach the deadline. This information does not persist to S3. - In `GCBlock::start`, skip the GC iteration if we are blocked by the lsn lease deadline. - In `TenantManager::upsert_location`, set the lsn_lease_deadline to `Instant::now() + lsn_lease_length` so the granted leases have a chance to be renewed before we run GC for the first time after transitioned from AttachedMulti to AttachedSingle. Signed-off-by: Yuchen Liang <yuchen@neon.tech> Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2024-09-19 17:27:10 +01:00
Conrad Ludgate	0a1ca7670c	proxy: remove auth info from http conn info & fixup jwt api trait (#9047 ) misc changes split out from #8855 - allow cloning the request context in a read-only fashion for background tasks - propagate endpoint and request context through the jwk cache - only allow password based auth for md5 during testing - remove auth info from conn info	2024-09-19 15:09:30 +00:00
Arseny Sher	32a0e759bd	safekeeper: add wal_last_modified to debug_dump. Adds to debug_dump option to include highest modified time among all WAL segments. In passing replace some str with OsStr to have less unwraps.	2024-09-19 16:17:25 +03:00
Alexander Bayandin	ac6a1151ae	test_postgres_version: reenable version check for prereleased versions	2024-09-18 14:51:59 +01:00
Christian Schwarz	135e7e4306	add `neon_local` subcommand for the broker & use that from regression tests (#8948 ) There's currently no way to just start/stop broker from `neon_local`. This PR * adds a sub-command * uses that sub-command from the test suite instead of the pre-existing Python `subprocess` based approach. Found this useful during investigation https://github.com/neondatabase/cloud/issues/16886.	2024-09-18 09:10:27 +02:00
Arpad Müller	a1b71b73fe	Rename some S3 usages to "remote storage" in exposed messages (#8999 ) In exposed messages like log messages we mentioned "S3", which is not entirely accurate as we support Azure blob storage now as well.	2024-09-17 19:15:01 +02:00
Tristan Partin	6138eb50e9	Fix test code related to migrations We added another migration in `5876c441ab`, but didn't bump this value. This had no effect, but best to fix it anyway. Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-09-17 15:56:05 +01:00
Vlad Lazar	b719d58863	storcon: forward requests from stepped down instance to the current leader (#8954 ) ## Problem It turns out that we can't rely on external orchestration to promptly route trafic to the new leader. This is downtime inducing. Forwarding provides a safe way out. ## Safety We forward when: 1. Request is not one of ["/control/v1/step_down", "/status", "/ready", "/metrics"] 2. Current instance is in [`LeadershipStatus::SteppedDown`] state 3. There is a leader in the database to forward to 4. Leader from step (3) is not the current instance If a storcon instance is persisted in the database, then we know that it is the current leader. There's one exception: time between handling step-down request and the new leader updating the database. Let's treat the happy case first. The stepped down node does not produce any side effects, since all request handling happens on the leader. As for the edge case, we are guaranteed to always have a maximum of two running instances. Hence, if we are in the edge case scenario the leader persisted in the database is the stepped down instance that received the request. Condition (4) above covers this scenario. ## Summary of changes * Conversion utilities for reqwest <-> hyper. I'm not happy with these, but I don't see a better way. Open to suggestions. * Add request forwarding logic * Update each request handler. Again, not happy with this. If anyone knows a nice to wrap the handlers, lmk. Me and Joonas tried :/ * Update each handler to maybe forward * Tweak tests to showcase new behaviour	2024-09-17 09:25:42 +01:00
Heikki Linnakangas	2db840d8b8	Move a few test functions related to auth tokens to separate file (#9018 ) For readability. neon_fixtures.py is huge.	2024-09-17 06:53:18 +03:00
Heikki Linnakangas	4295ff0f07	Mark a couple of test fixtures as session-scoped (#9018 ) pg_distrib_dir doesn't include the Postgres version and only depends on env variables which cannot change during a test run, so it can be marked as session-scoped. Similarly, the platform cannot change during a test run.	2024-09-17 06:53:18 +03:00
Heikki Linnakangas	c6f56b8462	Remove redundant get_dir_size() function (#9018 ) There was another copy of it in utils.py. The only difference is that the version in utils.py tolerates files that are concurrently removed. That seems fine for the few callers in neon_fixtures.py too.	2024-09-17 06:53:18 +03:00
Heikki Linnakangas	fec9321fc0	Use Path type in a few more places in neon_fixtures.py (#9018 ) This is in preparation of replacing neon_fixtures.get_dir_size with neon_fixtures.utils.get_dir_size() in next commit.	2024-09-17 06:53:18 +03:00
Heikki Linnakangas	3a52e356c1	Remove unused function (#9018 )	2024-09-17 06:53:18 +03:00
Tristan Partin	5e16c7bb0b	Generate pgbench data on the server for most tests This should generally be faster when running tests, especially those that run with higher scales. Ignoring test_lfc_resize since it seems like we are hitting a query timeout for some reason that I have yet to investigate. A little bit of improvemnt is better than none. Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-09-16 23:37:36 +01:00
Tristan Partin	5876c441ab	Grant access to pg_show_replication_origin_status for neon_superuser Signed-off-by: Tristan Partin <tristan@neon.tech>	2024-09-16 16:38:55 +01:00
Matthias van de Meent	78938d1b59	[compute/postgres] feature: PostgreSQL 17 (#8573 ) This adds preliminary PG17 support to Neon, based on RC1 / 2024-09-04 `07b828e9d4` NOTICE: The data produced by the included version of the PostgreSQL fork may not be compatible with the future full release of PostgreSQL 17 due to expected or unexpected future changes in magic numbers and internals. DO NOT EXPECT DATA IN V17-TENANTS TO BE COMPATIBLE WITH THE 17.0 RELEASE! Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech> Co-authored-by: Alexander Bayandin <alexander@neon.tech> Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2024-09-12 23:18:41 +01:00
Heikki Linnakangas	8dc069037b	Remove NeonEnvBuilder.start() function It feels wrong to me to start() from the builder object. Surely the thing you start is the environment itself, not its configuration.	2024-09-12 01:28:56 +03:00
Heikki Linnakangas	0a363c3dce	Add --timeline-id option to "neon_local timeline branch" command Makes it consistent with the "timeline create" and "timeline import" commands, which allowed you to pass the timeline id as argument. This also makes it unnecessary to parse the timeline ID from the output in the python function that calls it.	2024-09-12 01:28:56 +03:00
Heikki Linnakangas	aeca15008c	Remove obsolete and misleading comment The tenant ID was not actually generated here but in NeonEnvBuilder. And the "neon_local init" command hasn't been able to generate the initial tenant since `8712e1899e` anyway.	2024-09-12 01:28:56 +03:00
Heikki Linnakangas	723c0971e8	Don't create 'empty' branch in neon_simple_env (#8965 ) Now that we've given up hope on sharing the neon_simple_env between tests, there's no reason to not use the 'main' branch directly.	2024-09-09 12:38:34 +03:00
Heikki Linnakangas	c8f67eed8f	Remove TEST_SHARED_FIXTURES (#8965 ) I wish it worked, but it's been broken for a long time, so let's admit defeat and remove it. The idea of sharing the same pageserver and safekeeper environment between tests is still sound, and it could save a lot of time in our CI. We should perhaps put some time into doing that, but we're better off starting from scratch than trying to make TEST_SHARED_FIXTURES work in its current form.	2024-09-09 12:38:34 +03:00
Joonas Koivunen	3dbd34aa78	feat(storcon): forward gc blocking and unblocking (#8956 ) Currently using gc blocking and unblocking with storage controller managed pageservers is painful. Implement the API on storage controller. Fixes: #8893	2024-09-06 22:42:55 +01:00
Alex Chi Z.	ac5815b594	feat(storage-controller): add node shards api (#8896 ) For control-plane managed tenants, we have the page in the admin console that lists all tenants on a specific pageserver. But for storage-controller managed ones, we don't have that functionality for now. ## Summary of changes Adds an API that lists all shards on a given node (intention + observed) --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2024-09-06 14:14:21 -04:00
Arseny Sher	11cf16e3f3	safekeeper: add term_bump endpoint. When walproposer observes now higher term it restarts instead of crashing whole compute with PANIC; this avoids compute crash after term_bump call. After successfull election we're still checking last_log_term of the highest given vote to ensure basebackup is good, and PANIC otherwise. It will be used for migration per 035-safekeeper-dynamic-membership-change.md and https://github.com/neondatabase/docs/pull/21 ref https://github.com/neondatabase/neon/issues/8700	2024-09-06 19:13:50 +03:00
Arseny Sher	e287f36a05	safekeeper: fix endpoint restart immediately after xlog switch. Check that truncation point is not from the future by comparing it with write_record_lsn, not write_lsn, and explain that xlog switch changes their normal order. ref https://github.com/neondatabase/neon/issues/8911	2024-09-06 18:09:21 +03:00
Vlad Lazar	e86fef05dd	storcon: track preferred AZ for each tenant shard (#8937 ) ## Problem We want to do AZ aware scheduling, but don't have enough metadata. ## Summary of changes Introduce a `preferred_az_id` concept for each managed tenant shard. In a future PR, the scheduler will use this as a soft preference. The idea is to try and keep the shard attachments within the same AZ. Under the assumption that the compute was placed in the correct AZ, this reduces the chances of cross AZ trafic from between compute and PS. In terms of code changes we: 1. Add a new nullable `preferred_az_id` column to the `tenant_shards` table. Also include an in-memory counterpart. 2. Populate the preferred az on tenant creation and shard splits. 3. Add an endpoint which allows to bulk-set preferred AZs. (3) gives us the migration path. I'll write a script which queries the cplane db in the region and sets the preferred az of all shards with an active compute to the AZ of said compute. For shards without an active compute, I'll use the AZ of the currently attached pageserver since this is what cplane uses now to schedule computes.	2024-09-06 13:11:17 +01:00
Vlad Lazar	04f99a87bf	storcon: make pageserver AZ id mandatory (#8856 ) ## Problem https://github.com/neondatabase/neon/pull/8852 introduced a new nullable column for the `nodes` table: `availability_zone_id` ## Summary of changes * Make neon local and the test suite always provide an az id * Make the az id field in the ps registration request mandatory * Migrate the column to non-nullable and adjust in memory state accordingly * Remove the code that was used to populate the az id for pre-existing nodes	2024-09-05 19:14:21 +01:00
Joonas Koivunen	efe03d5a1c	build: sync between benchies (#8919 ) Sometimes, the benchmarks fail to start up pageserver in 10s without any obvious reason. Benchmarks run sequentially on otherwise idle runners. Try running `sync(2)` after each bench to force a cleaner slate. Implement this via: - SYNC_AFTER_EACH_TEST environment variable enabled autouse fixture - autouse fixture seems to be outermost fixture, so it works as expected - set SYNC_AFTER_EACH_TEST=true for benchmarks in build_and_test workflow Evidence: https://neon-github-public-dev.s3.amazonaws.com/reports/main/10678984691/index.html#suites/5008d72a1ba3c0d618a030a938fc035c/1210266507534c0f/ --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2024-09-05 14:29:48 +01:00
Christian Schwarz	850421ec06	refactor(pageserver): rely on serde derive for toml deserialization (#7656 ) This PR simplifies the pageserver configuration parsing as follows: * introduce the `pageserver_api::config::ConfigToml` type * implement `Default` for `ConfigToml` * use serde derive to do the brain-dead leg-work of processing the toml document * use `serde(default)` to fill in default values * in `pageserver` crate: * use `toml_edit` to deserialize the pageserver.toml string into a `ConfigToml` * `PageServerConfig::parse_and_validate` then * consumes the `ConfigToml` * destructures it exhaustively into its constituent fields * constructs the `PageServerConfig` The rules are: * in `ConfigToml`, use `deny_unknown_fields` everywhere * static default values go in `pageserver_api` * if there cannot be a static default value (e.g. which default IO engine to use, because it depends on the runtime), make the field in `ConfigToml` an `Option` * if runtime-augmentation of a value is needed, do that in `parse_and_validate` * a good example is `virtual_file_io_engine` or `l0_flush`, both of which need to execute code to determine the effective value in `PageServerConf` The benefits: * massive amount of brain-dead repetitive code can be deleted * "unused variable" compile-time errors when removing a config value, due to the exhaustive destructuring in `parse_and_validate` * compile-time errors guide you when adding a new config field Drawbacks: * serde derive is sometimes a bit too magical * `deny_unknown_fields` is easy to miss Future Work / Benefits: * make `neon_local` use `pageserver_api` to construct `ConfigToml` and write it to `pageserver.toml` * This provides more type safety / coompile-time errors than the current approach. ### Refs Fixes #3682 ### Future Work * `remote_storage` deser doesn't reject unknown fields https://github.com/neondatabase/neon/issues/8915 * clean up `libs/pageserver_api/src/config.rs` further * break up into multiple files, at least for tenant config * move `models` as appropriate / refine distinction between config and API models / be explicit about when it's the same * use `pub(crate)` visibility on `mod defaults` to detect stale values	2024-09-05 14:59:49 +02:00
John Spray	1a9b54f1d9	storage controller: read from database in validate API (#8784 ) ## Problem The initial implementation of the validate API treats the in-memory generations as authoritative. - This is true when only one storage controller is running, but if a rogue controller was running that hadn't been shut down properly, and some pageserver requests were routed to that bad controller, it could incorrectly return valid=true for stale generations. - The generation in the main in-memory map gets out of date while a live migration is in flight, and if the origin location for the migration tries to do some deletions even though it is in AttachedStale (for example because it had already started compaction), these might be wrongly validated + executed. ## Summary of changes - Continue to do the in-memory check: if this returns valid=false it is sufficient to reject requests. - When valid=true, do an additional read from the database to confirm the generation is fresh. - Revise behavior for validation on missing shards: this used to always return valid=true as a convenience for deletions and shard splits, so that pageservers weren't prevented from completing any enqueued deletions for these shards after they're gone. However, this becomes unsafe when we consider split brain scenarios. We could reinstate this in future if we wanted to store some tombstones for deleted shards. - Update test_scrubber_physical_gc to cope with the behavioral change: they must now explicitly flush the deletion queue before splits, to avoid tripping up on deletions that are enqueued at the time of the split (these tests assert "scrubber deletes nothing", which check fails if the split leaves behind some remote objects that are legitimately GC'able) - Add `test_storage_controller_validate_during_migration`, which uses failpoints to create a situation where incorrect generation validation during a live migration could result in a corruption The rate of validate calls for tenants is pretty low: it happens as a consequence deletions from GC and compaction, which are both concurrency-limited on the pageserver side.	2024-09-04 15:00:40 +01:00
Joonas Koivunen	7a1397cf37	storcon: boilerplate to upsert safekeeper records on deploy (#8879 ) We currently do not record safekeepers in the storage controller database. We want to migrate timelines across safekeepers eventually, so start recording the safekeepers on deploy. Cc: #8698	2024-09-04 10:10:05 +00:00
Vlad Lazar	75310fe441	storcon: make hb interval an argument and speed up tests (#8880 ) ## Problem Each test might wait for up to 5s in order to HB the pageserver. ## Summary of changes Make the heartbeat interval configurable and use a really tight one for neon local => startup quicker	2024-09-04 10:09:41 +01:00
Arseny Sher	80512e2779	safekeeper: add endpoint resetting uploaded partial segment state. Endpoint implementation sends msg to manager requesting to do the reset. Manager stops current partial backup upload task if it exists and performs the reset. Also slightly tweak eviction condition: all full segments before flush_lsn must be uploaded (and committed) and there must be only one segment left on disk (partial). This allows to evict timelines which started not on the first segment and didn't fill the whole segment (previous condition wasn't good because last_removed_segno was 0). ref https://github.com/neondatabase/neon/issues/8759	2024-09-03 17:21:36 +03:00
Vlad Lazar	c43e664ff5	storcon: provide an az id in metadata.json from neon local (#8897 ) ## Problem Neon local set-up does not inject an az id in `metadata.json`. See real change in https://github.com/neondatabase/neon/pull/8852. ## Summary of changes We piggyback on the existing `availability_zone` pageserver configuration in order to avoid making neon local even more complex.	2024-09-03 15:11:30 +01:00
Christian Schwarz	bf0531d107	fixup(#8839 ): `test_forward_compatibility` needs to allow lag warning as well (#8891 ) Found in https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8885/10665614629/index.html#suites/0fbaeb107ef328d03993d44a1fb15690/ea10ba1c140fba1d	2024-09-02 15:10:10 +01:00
Arpad Müller	9746b6ea31	Implement archival_config timeline endpoint in the storage controller (#8680 ) Implement the timeline specific `archival_config` endpoint also in the storage controller. It's mostly a copy-paste of the detach handler: the task is the same: do the same operation on all shards. Part of #8088.	2024-09-02 13:51:45 +02:00
Arpad Müller	3ec785f30d	Add safekeeper scrubber test (#8785 ) The test is very rudimentary, it only checks that before and after tenant deletion, we can run `scan_metadata` for the safekeeper node kind. Also, we don't actually expect any uploaded data, for that we don't have enough WAL (needs to create at least one S3-uploaded file, the scrubber doesn't recognize partial files yet). The `scan_metadata` scrubber subcommand is extended to support either specifying a database connection string, which was previously the only way, and required a database to be present, or specifying the timeline information manually via json. This is ideal for testing scenarios because in those, the number of timelines is usually limited, but it is involved to spin up a database just to write the timeline information.	2024-08-31 01:12:25 +02:00
Arpad Müller	96b5c4d33d	Don't unarchive a timeline if its ancestor is archived (#8853 ) If a timeline unarchival request comes in, give an error if the parent timeline is archived. This prevents us from the situation of having an archived timeline with children that are not archived. Follow up of #8824 Part of #8088 --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2024-08-29 12:54:02 +00:00
Conrad Ludgate	a644f01b6a	proxy+pageserver: shared leaky bucket impl (#8539 ) In proxy I switched to a leaky-bucket impl using the GCRA algorithm. I figured I could share the code with pageserver and remove the leaky_bucket crate dependency with some very basic tokio timers and queues for fairness. The underlying algorithm should be fairly clear how it works from the comments I have left in the code. --- In benchmarking pageserver, @problame found that the new implementation fixes a getpage throughput discontinuity in pageserver under the `pagebench get-page-latest-lsn` benchmark with the clickbench dataset (`test_perf_olap.py`). The discontinuity is that for any of `--num-clients={2,3,4}`, getpage throughput remains 10k. With `--num-clients=5` and greater, getpage throughput then jumps to the configured 20k rate limit. With the changes in this PR, the discontinuity is gone, and we scale throughput linearly to `--num-clients` until the configured rate limit. More context in https://github.com/neondatabase/cloud/issues/16886#issuecomment-2315257641. closes https://github.com/neondatabase/cloud/issues/16886 --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech>	2024-08-29 11:26:52 +00:00
Christian Schwarz	c2f8fdccd7	ingest: rate-limited warning if WAL commit timestamps lags for > wait_lsn_timeout (#8839 ) refs https://github.com/neondatabase/cloud/issues/13750 The logging in this commit will make it easier to detect lagging ingest. We're trusting compute timestamps --- ideally we'd use SK timestmaps instead. But trusting the compute timestamp is ok for now.	2024-08-29 12:06:00 +01:00
Andrew Rudenko	acc075071d	feat(compute_ctl): add periodic `lease lsn` requests for static computes (#7994 ) Part of #7497 ## Problem Static computes pinned at some fix LSN could be created initially within PITR interval but eventually go out it. To make sure that Static computes are not affected by GC, we need to start using the LSN lease API (introduced in #8084) in compute_ctl. ## Summary of changes compute_ctl - Spawn a thread for when a static compute starts to periodically ping pageserver(s) to make LSN lease requests. - Add `test_readonly_node_gc` to test if static compute can read all pages without error. - (test will fail on main without the code change here) page_service - `wait_or_get_last_lsn` will now allow `request_lsn` less than `latest_gc_cutoff_lsn` to proceed if there is a lease on `request_lsn`. Signed-off-by: Yuchen Liang <yuchen@neon.tech> Co-authored-by: Alexey Kondratov <kondratov.aleksey@gmail.com>	2024-08-28 19:09:26 +00:00
Christian Schwarz	9627747d35	bypass `PageCache` for `InMemoryLayer` + avoid `Value::deser` on L0 flush (#8537 ) Part of [Epic: Bypass PageCache for user data blocks](https://github.com/neondatabase/neon/issues/7386). # Problem `InMemoryLayer` still uses the `PageCache` for all data stored in the `VirtualFile` that underlies the `EphemeralFile`. # Background Before this PR, `EphemeralFile` is a fancy and (code-bloated) buffered writer around a `VirtualFile` that supports `blob_io`. The `InMemoryLayerInner::index` stores offsets into the `EphemeralFile`. At those offset, we find a varint length followed by the serialized `Value`. Vectored reads (`get_values_reconstruct_data`) are not in fact vectored - each `Value` that needs to be read is read sequentially. The `will_init` bit of information which we use to early-exit the `get_values_reconstruct_data` for a given key is stored in the serialized `Value`, meaning we have to read & deserialize the `Value` from the `EphemeralFile`. The L0 flushing also needs to re-determine the `will_init` bit of information, by deserializing each value during L0 flush. # Changes 1. Store the value length and `will_init` information in the `InMemoryLayer::index`. The `EphemeralFile` thus only needs to store the values. 2. For `get_values_reconstruct_data`: - Use the in-memory `index` figures out which values need to be read. Having the `will_init` stored in the index enables us to do that. - View the EphemeralFile as a byte array of "DIO chunks", each 512 bytes in size (adjustable constant). A "DIO chunk" is the minimal unit that we can read under direct IO. - Figure out which chunks need to be read to retrieve the serialized bytes for thes values we need to read. - Coalesce chunk reads such that each DIO chunk is only read once to serve all value reads that need data from that chunk. - Merge adjacent chunk reads into larger `EphemeralFile::read_exact_at_eof_ok` of up to 128k (adjustable constant). 3. The new `EphemeralFile::read_exact_at_eof_ok` fills the IO buffer from the underlying VirtualFile and/or its in-memory buffer. 4. The L0 flush code is changed to use the `index` directly, `blob_io` 5. We can remove the `ephemeral_file::page_caching` construct now. The `get_values_reconstruct_data` changes seem like a bit overkill but they are necessary so we issue the equivalent amount of read system calls compared to before this PR where it was highly likely that even if the first PageCache access was a miss, remaining reads within the same `get_values_reconstruct_data` call from the same `EphemeralFile` page were a hit. The "DIO chunk" stuff is truly unnecessary for page cache bypass, but, since we're working on [direct IO](https://github.com/neondatabase/neon/issues/8130) and https://github.com/neondatabase/neon/issues/8719 specifically, we need to do _something_ like this anyways in the near future. # Alternative Design The original plan was to use the `vectored_blob_io` code it relies on the invariant of Delta&Image layers that `index order == values order`. Further, `vectored_blob_io` code's strategy for merging IOs is limited to adjacent reads. However, with direct IO, there is another level of merging that should be done, specifically, if multiple reads map to the same "DIO chunk" (=alignment-requirement-sized and -aligned region of the file), then it's "free" to read the chunk into an IO buffer and serve the two reads from that buffer. => https://github.com/neondatabase/neon/issues/8719 # Testing / Performance Correctness of the IO merging code is ensured by unit tests. Additionally, minimal tests are added for the `EphemeralFile` implementation and the bit-packed `InMemoryLayerIndexValue`. Performance testing results are presented below. All pref testing done on my M2 MacBook Pro, running a Linux VM. It's a release build without `--features testing`. We see definitive improvement in ingest performance microbenchmark and an ad-hoc microbenchmark for getpage against InMemoryLayer. ``` baseline: commit `7c74112b2a` origin/main HEAD: `ef1c55c52e` ``` <details> ``` cargo bench --bench bench_ingest -- 'ingest 128MB/100b seq, no delta' baseline ingest-small-values/ingest 128MB/100b seq, no delta time: [483.50 ms 498.73 ms 522.53 ms] thrpt: [244.96 MiB/s 256.65 MiB/s 264.73 MiB/s] HEAD ingest-small-values/ingest 128MB/100b seq, no delta time: [479.22 ms 482.92 ms 487.35 ms] thrpt: [262.64 MiB/s 265.06 MiB/s 267.10 MiB/s] ``` </details> We don't have a micro-benchmark for InMemoryLayer and it's quite cumbersome to add one. So, I did manual testing in `neon_local`. <details> ``` ./target/release/neon_local stop rm -rf .neon ./target/release/neon_local init ./target/release/neon_local start ./target/release/neon_local tenant create --set-default ./target/release/neon_local endpoint create foo ./target/release/neon_local endpoint start foo psql 'postgresql://cloud_admin@127.0.0.1:55432/postgres' psql (13.16 (Debian 13.16-0+deb11u1), server 15.7) CREATE TABLE wal_test ( id SERIAL PRIMARY KEY, data TEXT ); DO $$ DECLARE i INTEGER := 1; BEGIN WHILE i <= 500000 LOOP INSERT INTO wal_test (data) VALUES ('data'); i := i + 1; END LOOP; END $$; -- => result is one L0 from initdb and one 137M-sized ephemeral-2 DO $$ DECLARE i INTEGER := 1; random_id INTEGER; random_record wal_test%ROWTYPE; start_time TIMESTAMP := clock_timestamp(); selects_completed INTEGER := 0; min_id INTEGER := 1; -- Minimum ID value max_id INTEGER := 100000; -- Maximum ID value, based on your insert range iters INTEGER := 100000000; -- Number of iterations to run BEGIN WHILE i <= iters LOOP -- Generate a random ID within the known range random_id := min_id + floor(random() * (max_id - min_id + 1))::int; -- Select the row with the generated random ID SELECT * INTO random_record FROM wal_test WHERE id = random_id; -- Increment the select counter selects_completed := selects_completed + 1; -- Check if a second has passed IF EXTRACT(EPOCH FROM clock_timestamp() - start_time) >= 1 THEN -- Print the number of selects completed in the last second RAISE NOTICE 'Selects completed in last second: %', selects_completed; -- Reset counters for the next second selects_completed := 0; start_time := clock_timestamp(); END IF; -- Increment the loop counter i := i + 1; END LOOP; END $$; ./target/release/neon_local stop baseline: commit `7c74112b2a` origin/main NOTICE: Selects completed in last second: 1864 NOTICE: Selects completed in last second: 1850 NOTICE: Selects completed in last second: 1851 NOTICE: Selects completed in last second: 1918 NOTICE: Selects completed in last second: 1911 NOTICE: Selects completed in last second: 1879 NOTICE: Selects completed in last second: 1858 NOTICE: Selects completed in last second: 1827 NOTICE: Selects completed in last second: 1933 ours NOTICE: Selects completed in last second: 1915 NOTICE: Selects completed in last second: 1928 NOTICE: Selects completed in last second: 1913 NOTICE: Selects completed in last second: 1932 NOTICE: Selects completed in last second: 1846 NOTICE: Selects completed in last second: 1955 NOTICE: Selects completed in last second: 1991 NOTICE: Selects completed in last second: 1973 ``` NB: the ephemeral file sizes differ by ca 1MiB, ours being 1MiB smaller. </details> # Rollout This PR changes the code in-place and is not gated by a feature flag.	2024-08-28 18:31:41 +00:00
Vlad Lazar	793b5061ec	storcon: track pageserver availability zone (#8852 ) ## Problem In order to build AZ aware scheduling, the storage controller needs to know what AZ pageservers are in. Related https://github.com/neondatabase/neon/issues/8848 ## Summary of changes This patch set adds a new nullable column to the `nodes` table: `availability_zone_id`. The node registration request is extended to include the AZ id (pageservers already have this in their `metadata.json` file). If the node is already registered, then we update the persistent and in-memory state with the provided AZ. Otherwise, we add the node with the AZ to begin with. A couple assumptions are made here: 1. Pageserver AZ ids are stable 2. AZ ids do not change over time Once all pageservers have a configured AZ, we can remove the optionals in the code and make the database column not nullable.	2024-08-28 18:23:55 +01:00
Yuchen Liang	a889a49e06	pageserver: do vectored read on each dio-aligned section once (#8763 ) Part of #8130, closes #8719. ## Problem Currently, vectored blob io only coalesce blocks if they are immediately adjacent to each other. When we switch to Direct IO, we need a way to coalesce blobs that are within the dio-aligned boundary but has gap between them. ## Summary of changes - Introduces a `VectoredReadCoalesceMode` for `VectoredReadPlanner` and `StreamingVectoredReadPlanner` which has two modes: - `AdjacentOnly` (current implementation) - `Chunked(<alignment requirement>)` - New `ChunkedVectorBuilder` that considers batching `dio-align`-sized read, the start and end of the vectored read will respect `stx_dio_offset_align` / `stx_dio_mem_align` (`vectored_read.start` and `vectored_read.blobs_at.first().start_offset` will be two different value). - Since we break the assumption that blobs within single `VectoredRead` are next to each other (implicit end offset), we start to store blob end offsets in the `VectoredRead`. - Adapted existing tests to run in both `VectoredReadCoalesceMode`. - The io alignment can also be live configured at runtime. Signed-off-by: Yuchen Liang <yuchen@neon.tech>	2024-08-28 15:54:42 +01:00
Joonas Koivunen	c0ba18a112	bench: flush before shutting down (#8844 ) while driving by: - remove the extra tenant - remove the extra timelines implement this by turning the pg_compare to a yielding fixture. evidence: https://neon-github-public-dev.s3.amazonaws.com/reports/main/10571779162/index.html#suites/9681106e61a1222669b9d22ab136d07b/3bbe9f007b3ffae1/	2024-08-28 10:20:43 +01:00
Heikki Linnakangas	c5ef779801	tests: Remove unnecessary entries from list of allowed errors (#8199 ) The "manual_gc" context was removed in commit `be0c73f8e7`. The code that generated the other error was removed in commit `9a6c0be823`.	2024-08-27 17:47:05 +01:00

1 2 3 4 5 ...

1581 Commits