rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-07-07 14:10:43 +00:00

Author	SHA1	Message	Date
Dmitry Rodionov	d53f9ab3eb	delete timelines from s3 (#4384 ) Delete data from s3 when timeline deletion is requested ## Summary of changes UploadQueue is altered to support scheduling of delete operations in stopped state. This looks weird, and I'm thinking whether there are better options/refactorings for upload client to make it look better. Probably can be part of https://github.com/neondatabase/neon/issues/4378 Deletion is implemented directly in existing endpoint because changes are not that significant. If we want more safety we can separate those or create feature flag for new behavior. resolves [#4193](https://github.com/neondatabase/neon/issues/4193) --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-06-08 15:01:22 +03:00
Dmitry Rodionov	8560a98d68	fix openapi spec to pass swagger editor validation (#4445 ) There shouldnt be a dash before `type: object`. Also added description.	2023-06-08 13:25:30 +03:00
Alex Chi Z	2e687bca5b	refactor: use LayerDesc in layer map (part 1) (#4408 ) ## Problem part of https://github.com/neondatabase/neon/issues/4392 ## Summary of changes This PR adds a new HashMap that maps persistent layer desc to the layer object inside LayerMap. Originally I directly went towards adding such layer cache in Timeline, but the changes are too many and cannot be reviewed as a reasonably-sized PR. Therefore, we take this intermediate step to change part of the codebase to use persistent layer desc, and come up with other PRs to move this hash map of layer desc to the timeline struct. Also, file_size is now part of the layer desc. --------- Signed-off-by: Alex Chi <iskyzh@gmail.com> Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>	2023-06-07 18:28:18 +03:00
Dmitry Rodionov	1a1019990a	map TenantState::Broken to TenantAttachmentStatus::Failed (#4371 ) ## Problem Attach failures are not reported in public part of the api (in `attachment_status` field of TenantInfo). ## Summary of changes Expose TenantState::Broken as TenantAttachmentStatus::Failed In the way its written Failed status will be reported even if no attachment happened. (I e if tenant become broken on startup). This is in line with other members. I e Active will be resolved to Attached even if no actual attach took place. This can be tweaked if needed. At the current stage it would be overengineering without clear motivation resolves #4344	2023-06-07 18:25:30 +03:00
Alex Chi Z	1c200bd15f	fix: break dev dependencies between wal_craft and pg_ffi (#4424 ) ## Problem close https://github.com/neondatabase/neon/issues/4266 ## Summary of changes With this PR, rust-analyzer should be able to give lints and auto complete in `mod tests`, and this makes writing tests easier. Previously, rust-analyzer cannot do auto completion. --------- Signed-off-by: Alex Chi <iskyzh@gmail.com>	2023-06-07 17:51:13 +03:00
Arseny Sher	37bf2cac4f	Persist safekeeper control file once in a while. It should make remote_consistent_lsn commonly up-to-date on non actively writing projects, which removes spike or pageserver -> safekeeper reconnections on storage nodes restart.	2023-06-07 17:23:37 +04:00
Joonas Koivunen	5761190e0d	feat: three phased startup order (#4399 ) Initial logical size calculation could still hinder our fast startup efforts in #4397. See #4183. In deployment of 2023-06-06 about a 200 initial logical sizes were calculated on hosts which took the longest to complete initial load (12s). Implements the three step/tier initialization ordering described in #4397: 1. load local tenants 2. do initial logical sizes per walreceivers for 10s 3. background tasks Ordering is controlled by: - waiting on `utils::completion::Barrier`s on background tasks - having one attempt for each Timeline to do initial logical size calculation - `pageserver/src/bin/pageserver.rs` releasing background jobs after timeout or completion of initial logical size calculation The timeout is there just to safeguard in case a legitimate non-broken timeline initial logical size calculation goes long. The timeout is configurable, by default 10s, which I think would be fine for production systems. In the test cases I've been looking at, it seems that these steps are completed as fast as possible. Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-06-07 14:29:23 +03:00
Vadim Kharitonov	88f0cfc575	Fix `pgx_ulid` extension (#4431 ) The issue was in the wrong `control` file name	2023-06-07 11:41:53 +02:00
Arseny Sher	6b3c020cd9	Don't warn on system id = 0 in walproposer greeting. sync-safekeepers doesn't know it and sends 0.	2023-06-07 12:39:20 +04:00
Arseny Sher	c058e1cec2	Quick exit in truncate_wal if nothing to do. ref https://github.com/neondatabase/neon/issues/4414	2023-06-07 12:39:20 +04:00
Arseny Sher	dc6a382873	Increase timeouts on compute -> sk connections. context: https://github.com/neondatabase/neon/issues/4414 And improve messages/comments here and there.	2023-06-07 12:39:20 +04:00
Heikki Linnakangas	df3bae2ce3	Use `compute_ctl` to manage Postgres in tests. (#3886 ) This adds test coverage for 'compute_ctl', as it is now used by all the python tests. There are a few differences in how 'compute_ctl' is called in the tests, compared to the real web console: - In the tests, the postgresql.conf file is included as one large string in the spec file, and it is written out as it is to the data directory. I added a new field for that to the spec file. The real web console, however, sets all the necessary settings in the 'settings' field, and 'compute_ctl' creates the postgresql.conf from those settings. - In the tests, the information needed to connect to the storage, i.e. tenant_id, timeline_id, connection strings to pageserver and safekeepers, are now passed as new fields in the spec file. The real web console includes them as the GUCs in the 'settings' field. (Both of these are different from what the test control plane used to do: It used to write the GUCs directly in the postgresql.conf file). The plan is to change the control plane to use the new method, and remove the old method, but for now, support both. Some tests that were sensitive to the amount of WAL generated needed small changes, to accommodate that compute_ctl runs the background health monitor which makes a few small updates. Also some tests shut down the pageserver, and now that the background health check can run some queries while the pageserver is down, that can produce a few extra errors in the logs, which needed to be allowlisted. Other changes: - remove obsolete comments about PostgresNode; - create standby.signal file for Static compute node; - log output of `compute_ctl` and `postgres` is merged into `endpoints/compute.log`. --------- Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>	2023-06-06 14:59:36 +01:00
Joonas Koivunen	0cef7e977d	refactor: just one way to shutdown a tenant (#4407 ) We have 2 ways of tenant shutdown, we should have just one. Changes are mostly mechanical simple refactorings. Added `warn!` on the "shutdown all remaining tasks" should trigger test failures in the between time of not having solved the "tenant/timeline owns all spawned tasks" issue. Cc: #4327.	2023-06-06 15:30:55 +03:00
Joonas Koivunen	18a9d47f8e	test: restore NotConnected being allowed globally (#4426 ) Flakyness introduced by #4402 evidence [^1]. I had assumed the NotConnected would had been an expected io error, but it's not. Restore the global `allowed_error`. [^1]: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4407/5185897757/index.html#suites/82004ab4e3720b47bf78f312dabe7c55/14f636d0ecd3939d/	2023-06-06 13:51:39 +03:00
Sasha Krassovsky	ac11e7c32d	Remove arch-specific stuff from HNSW extension (#4423 )	2023-06-05 22:04:15 -08:00
Konstantin Knizhnik	8e1b5e1224	Remove -ftree-vectorizer-verbose=0 option notrecognized by MaxOS/X c… (#4412 ) …ompiler ## Problem ## Summary of changes ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist	2023-06-05 20:10:19 +03:00
Joonas Koivunen	e0bd81ce1f	test: fix flaky warning on attach (#4415 ) added the `allowed_error` to the `positive_env` so any tests completing the attach are allowed have this print out. they are allowed to do so, because the `random_init_delay` can produce close to zero and thus the first run will be near attach. Though... Unsure if we ever really need the eviction task to run before it can evict something, as in after 20min or 24h. in the failed test case however period is 20s so interesting that we didn't run into this sooner. evidence of flaky: https://github.com/neondatabase/neon/actions/runs/5175677035/jobs/9323705929?pr=4399#step:4:38536	2023-06-05 18:12:58 +03:00
Joonas Koivunen	77598f5d0a	Better walreceiver logging (#4402 ) walreceiver logs are a bit hard to understand because of partial span usage, extra messages, ignored errors popping up as huge stacktraces. Fixes #3330 (by spans, also demote info -> debug). - arrange walreceivers spans into a hiearchy: - `wal_connection_manager{tenant_id, timeline_id}` -> `connection{node_id}` -> `poller` - unifies the error reporting inside `wal_receiver`: - All ok errors are now `walreceiver connection handling ended: {e:#}` - All unknown errors are still stacktraceful task_mgr reported errors with context `walreceiver connection handling failure` - Remove `connect` special casing, was: `DB connection stream finished` for ok errors - Remove `done replicating` special casing, was `Replication stream finished` for ok errors - lowered log levels for (non-exhaustive list): - `WAL receiver manager started, connecting to broker` (at startup) - `WAL receiver shutdown requested, shutting down` (at shutdown) - `Connection manager loop ended, shutting down` (at shutdown) - `sender is dropped while join handle is still alive` (at lucky shutdown, see #2885) - `timeline entered terminal state {:?}, stopping wal connection manager loop` (at shutdown) - `connected!` (at startup) - `Walreceiver db connection closed` (at disconnects?, was without span) - `Connection cancelled` (at shutdown, was without span) - `observed timeline state change, new state is {new_state:?}` (never after Timeline::activate was made infallible) - changed: - `Timeline dropped state updates sender, stopping wal connection manager loop` - was out of date; sender is not dropped but `Broken \| Stopping` state transition - also made `debug!` - `Timeline dropped state updates sender before becoming active, stopping wal connection manager loop` - was out of date: sender is again not dropped but `Broken \| Stopping` state transition - also made `debug!` - log fixes: - stop double reporting panics via JoinError	2023-06-05 17:35:23 +03:00
Joonas Koivunen	8142edda01	test: Less flaky gc (#4416 ) Solves a flaky test error in the wild[^1] by: - Make the gc shutdown signal reading an `allowed_error` - Note the gc shutdown signal readings as being in `allowed_error`s - Allow passing tenant conf to init_start to avoid unncessary tenants [^1]: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4399/5176432780/index.html#suites/b97efae3a617afb71cb8142f5afa5224/2cd76021ea011f93	2023-06-05 15:43:52 +03:00
Vadim Kharitonov	b9871158ba	Compile PGX ULID extension (#4413 ) Create pgx_ulid extension ``` postgres=# create extension ulid; CREATE EXTENSION postgres=# CREATE TABLE users ( id ulid NOT NULL DEFAULT gen_ulid() PRIMARY KEY, name text NOT NULL ); CREATE TABLE postgres=# insert into users (name) values ('vadim'); INSERT 0 1 postgres=# select * from users; id \| name ----------------------------+------- 01H25DDG3KYMYZTNR41X38E256 \| vadim ```	2023-06-05 12:52:13 +03:00
Joonas Koivunen	8caef2c0c5	fix: delay `eviction_task` as well (#4397 ) As seen on deployment of 2023-06-01 release, times were improving but there were some outliers caused by: - timelines `eviction_task` starting while activating and running imitation - timelines `initial logical size` calculation This PR fixes it so that `eviction_task` is delayed like other background tasks fixing an oversight from earlier #4372. After this PR activation will be two phases: 1. load and activate tenants AND calculate some initial logical sizes 2. rest of initial logical sizes AND background tasks - compaction, gc, disk usage based eviction, timelines `eviction_task`, consumption metrics	2023-06-05 09:37:53 +03:00
Konstantin Knizhnik	04542826be	Add HNSW extension (#4227 ) ## Describe your changes Port HNSW implementation for ANN search top Postgres ## Issue ticket number and link https://www.pinecone.io/learn/hnsw ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist	2023-06-04 11:41:38 +03:00
bojanserafimov	4ba950a35a	Add libcurl as dependency to readme (#4405 )	2023-06-02 18:07:45 -04:00
Joonas Koivunen	a55c663848	chore: comment marker fixes (#4406 ) Upgrading to rust 1.70 will require these.	2023-06-02 21:03:12 +03:00
Heikki Linnakangas	9787227c35	Shield HTTP request handlers from async cancellations. (#4314 ) We now spawn a new task for every HTTP request, and wait on the JoinHandle. If Hyper drops the Future, the spawned task will keep running. This protects the rest of the pageserver code from unexpected async cancellations. This creates a CancellationToken for each request and passes it to the handler function. If the HTTP request is dropped by the client, the CancellationToken is signaled. None of the handler functions make use for the CancellationToken currently, but they now they could. The CancellationToken arguments also work like documentation. When you're looking at a function signature and you see that it takes a CancellationToken as argument, it's a nice hint that the function might run for a long time, and won't be async cancelled. The default assumption in the pageserver is now that async functions are not cancellation-safe anyway, unless explictly marked as such, but this is a nice extra reminder. Spawning a task for each request is OK from a performance point of view because spawning is very cheap in Tokio, and none of our HTTP requests are very performance critical anyway. Fixes issue #3478	2023-06-02 08:28:13 -04:00
Joonas Koivunen	ef80a902c8	pg_sni_router: add session_id to more messages (#4403 ) See superceded #4390. - capture log in test - expand the span to cover init and error reporting - remove obvious logging by logging only unexpected	2023-06-02 14:59:10 +03:00
Alex Chi Z	66cdba990a	refactor: use PersistentLayerDesc for persistent layers (#4398 ) ## Problem Part of https://github.com/neondatabase/neon/issues/4373 ## Summary of changes This PR adds `PersistentLayerDesc`, which will be used in LayerMap mapping and probably layer cache. After this PR and after we change LayerMap to map to layer desc, we can safely drop RemoteLayerDesc. --------- Signed-off-by: Alex Chi <iskyzh@gmail.com> Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>	2023-06-01 22:06:28 +03:00
Alex Chi Z	82484e8241	pgserver: add more metrics for better observability (#4323 ) ## Problem This PR includes doc changes to the current metrics as well as adding new metrics. With the new set of metrics, we can quantitatively analyze the read amp., write amp. and space amp. in the system, when used together with https://github.com/neondatabase/neonbench close https://github.com/neondatabase/neon/issues/4312 ref https://github.com/neondatabase/neon/issues/4347 compaction metrics TBD, a novel idea is to print L0 file number and number of layers in the system, and we can do this in the future when we start working on compaction. ## Summary of changes * Add `READ_NUM_FS_LAYERS` for computing read amp. * Add `MATERIALIZED_PAGE_CACHE_HIT_UPON_REQUEST`. * Add `GET_RECONSTRUCT_DATA_TIME`. GET_RECONSTRUCT_DATA_TIME + RECONSTRUCT_TIME + WAIT_LSN_TIME should be approximately total time of reads. * Add `5.0` and `10.0` to `STORAGE_IO_TIME_BUCKETS` given some fsync runs slow (i.e., > 1s) in some cases. * Some `WAL_REDO` metrics are only used when Postgres is involved in the redo process. --------- Signed-off-by: Alex Chi <iskyzh@gmail.com>	2023-06-01 21:46:04 +03:00
Joonas Koivunen	36fee50f4d	compute_ctl: enable tracing panic hook (#4375 ) compute_ctl can panic, but `tracing` is used for logging. panic stderr output can interleave with messages from normal logging. The fix is to use the established way (pageserver, safekeeper, storage_broker) of using `tracing` to report panics.	2023-06-01 20:12:07 +03:00
bojanserafimov	330083638f	Fix stale and misleading comment in LayerMap (#4297 )	2023-06-01 05:04:46 +03:00
Konstantin Knizhnik	952d6e43a2	Add pageserver parameter forced_image_creation_limit which can be used… (#4353 ) This parameter can be use to restrict number of image layers generated because of GC request (wanted image layers). Been set to zero it completely eliminates creation of such image layers. So it allows to avoid extra storage consumption after merging #3673 ## Problem PR #3673 forces generation of missed image layers. So i short term is cause cause increase (in worst case up to two times) size of storage. It was intended (by me) that GC period is comparable with PiTR interval. But looks like it is not the case now - GC is performed much more frequently. It may cause the problem with space exhaustion: GC forces new image creation while large PiTR still prevent GC from collecting old layers. ## Summary of changes Add new pageserver parameter` forced_image_creation_limit` which restrict number of created image layers which are requested by GC. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist	2023-05-31 21:37:20 +03:00
bojanserafimov	b6447462dc	Fix layer map correctness bug (#4342 )	2023-05-31 12:23:00 -04:00
Dmitry Rodionov	b190c3e6c3	reduce flakiness by allowing Compaction failed, retrying in X queue is in state Stopped. (#4379 ) resolves https://github.com/neondatabase/neon/issues/4374 by adding the error to allowed_errors	2023-05-30 20:11:44 +03:00
Joonas Koivunen	f4db85de40	Continued startup speedup (#4372 ) Startup continues to be slow, work towards to alleviate it. Summary of changes: - pretty the functional improvements from #4366 into `utils::completion::{Completion, Barrier}` - extend "initial load completion" usage up to tenant background tasks - previously only global background tasks - spawn_blocking the tenant load directory traversal - demote some logging - remove some unwraps - propagate some spans to `spawn_blocking` Runtime effects should be major speedup to loading, but after that, the `BACKGROUND_RUNTIME` will be blocked for a long time (minutes). Possible follow-ups: - complete initial tenant sizes before allowing background tasks to block the `BACKGROUND_RUNTIME`	2023-05-30 16:25:07 +03:00
Arthur Petukhovsky	210be6b6ab	Replace broker duration logs with metrics (#4370 ) I've added logs for broker push duration after every iteration in https://github.com/neondatabase/neon/pull/4142. This log has not found any real issues, so we can replace it with metrics, to slightly reduce log volume. LogQL query found that pushes longer that 500ms happened only 90 times for the last month. https://neonprod.grafana.net/goto/KTNj9UwVg?orgId=1 `{unit="safekeeper.service"} \|= "timeline updates to broker in" \| regexp "to broker in (?P<duration>.*)" \| duration > 500ms`	2023-05-30 16:08:02 +03:00
Alexander Bayandin	daa79b150f	Code Coverage: store lcov report (#4358 ) ## Problem In the future, we want to compare code coverage on a PR with coverage on the main branch. Currently, we store only code coverage HTML reports, I suggest we start storing reports in "lcov info" format that we can use/parse in the future. Currently, the file size is ~7Mb (it's a text-based format and could be compressed into a ~400Kb archive) - More about "lcov info" format: https://manpages.ubuntu.com/manpages/jammy/man1/geninfo.1.html#files - Part of https://github.com/neondatabase/neon/issues/3543 ## Summary of changes - Change `scripts/coverage` to output lcov coverage to `report/lcov.info` file instead of stdout (we already upload the whole `report/` directory to S3)	2023-05-30 14:05:41 +01:00
Joonas Koivunen	db14355367	revert: static global init logical size limiter (#4368 ) added in #4366. revert for testing without it; it may have unintenteded side-effects, and it's very difficult to understand the results from the 10k load testing environments. earlier results: https://github.com/neondatabase/neon/pull/4366#issuecomment-1567491064	2023-05-30 10:40:37 +03:00
Joonas Koivunen	cb83495744	try: startup speedup (#4366 ) Startup can take a long time. We suspect it's the initial logical size calculations. Long term solution is to not block the tokio executors but do most of I/O in spawn_blocking. See: #4025, #4183 Short-term solution to above: - Delay global background tasks until initial tenant loads complete - Just limit how many init logical size calculations can we have at the same time to `cores / 2` This PR is for trying in staging.	2023-05-29 21:48:38 +03:00
Christian Schwarz	f4f300732a	refactor TenantState transitions (#4321 ) This is preliminary work for/from #4220 (async `Layer::get_value_reconstruct_data`). The motivation is to avoid locking `Tenant::timelines` in places that can't be `async`, because in #4333 we want to convert Tenant::timelines from `std::sync::Mutex` to `tokio::sync::Mutex`. But, the changes here are useful in general because they clean up & document tenant state transitions. That also paves the way for #4350, which is an alternative to #4333 that refactors the pageserver code so that we can keep the `Tenant::timelines` mutex sync. This patch consists of the following core insights and changes: * spawn_load and spawn_attach own the tenant state until they're done * once load()/attach() calls are done ... * if they failed, transition them to Broken directly (we know that there's no background activity because we didn't call activate yet) * if they succeed, call activate. We can make it infallible. How? Later. * set_broken() and set_stopping() are changed to wait for spawn_load() / spawn_attach() to finish. * This sounds scary because it might hinder detach or shutdown, but actually, concurrent attach+detach, or attach+shutdown, or load+shutdown, or attach+shutdown were just racy before this PR. So, with this change, they're not anymore. In the future, we can add a `CancellationToken` stored in Tenant to cancel `load` and `attach` faster, i.e., make `spawn_load` / `spawn_attach` transition them to Broken state sooner. See the doc comments on TenantState for the state transitions that are now possible. It might seem scary, but actually, this patch reduces the possible state transitions. We introduce a new state `TenantState::Activating` to avoid grabbing the `Tenant::timelines` lock inside the `send_modify` closure. These were the humble beginnings of this PR (see Motivation section), and I think it's still the right thing to have this `Activating` state, even if we decide against async `Tenant::timelines` mutex. The reason is that `send_modify` locks internally, and by moving locking of Tenant::timelines out of the closure, the internal locking of `send_modify` becomes a leaf of the lock graph, and so, we eliminate deadlock risk. Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-05-29 17:52:41 +03:00
Em Sharnoff	ccf653c1f4	re-enable file cache integration for VM compute node (#4338 ) #4155 inadvertently switched to a version of the VM builder that leaves the file cache integration disabled by default. This re-enables the vm-informant's file cache integration. (as a refresher: The vm-informant is the autoscaling component that sits inside the VM and manages postgres / compute_ctl) See also: https://github.com/neondatabase/autoscaling/pull/265	2023-05-28 10:22:45 -07:00
Heikki Linnakangas	2d6a022bb8	Don't allow two timeline_delete operations to run concurrently. (#4313 ) If the timeline is already being deleted, return an error. We used to notice the duplicate request and error out in persist_index_part_with_deleted_flag(), but it's better to detect it earlier. Add an explicit lock for the deletion. Note: This doesn't do anything about the async cancellation problem (github issue #3478): if the original HTTP request dropped, because the client disconnected, the timeline deletion stops half-way through the operation. That needs to be fixed, too, but that's a separate story. (This is a simpler replacement for PR #4194. I'm also working on the cancellation shielding, see PR #4314.)	2023-05-27 15:55:43 +03:00
Heikki Linnakangas	2cdf07f12c	Refactor RequestSpan into a function. Previously, you used it like this: \|r\| RequestSpan(my_handler).handle(r) But I don't see the point of the RequestSpan struct. It's just a wrapper around the handler function. With this commit, the call becomes: \|r\| request_span(r, my_handler) Which seems a little simpler. At first I thought that the RequestSpan struct would allow "chaining" other kinds of decorators like RequestSpan, so that you could do something like this: \|r\| CheckPermissions(RequestSpan(my_handler)).handle(r) But it doesn't work like that. If each of those structs wrap a handler function, it would actually look like this: \|r\| CheckPermissions(\|r\| RequestSpan(my_handler).handle(r))).handle(r) This commit doesn't make that kind of chaining any easier, but seems a little more straightforward anyway.	2023-05-27 11:47:22 +03:00
Heikki Linnakangas	200a520e6c	Minor refactoring in RequestSpan Require the error type to be ApiError. It implicitly required that anyway, because the function used error::handler, which downcasted the error to an ApiError. If the error was in fact anything else than ApiError, it would just panic. Better to check it at compilation time. Also make the last-resort error handler more forgiving, so that it returns an 500 Internal Server error response, instead of panicking, if a request handler returns some other error than an ApiError.	2023-05-27 11:47:22 +03:00
Alex Chi Z	4e359db4c7	pgserver: spawn_blocking in compaction (#4265 ) Compaction is usually a compute-heavy process and might affect other futures running on the thread of the compaction. Therefore, we add `block_in_place` as a temporary solution to avoid blocking other futures on the same thread as compaction in the runtime. As we are migrating towards a fully-async-style pageserver, we can revert this change when everything is async and when we move compaction to a separate runtime. --------- Signed-off-by: Alex Chi <iskyzh@gmail.com>	2023-05-26 17:15:47 -04:00
Joonas Koivunen	be177f82dc	Revert "Allow for higher s3 concurrency (#4292 )" (#4356 ) This reverts commit `024109fbeb` for it failing to be speed up anything, but run into more errors. See: #3698.	2023-05-26 18:37:17 +03:00
Alexander Bayandin	339a3e3146	GitHub Autocomment: comment commits for branches (#4335 ) ## Problem GitHub Autocomment script posts a comment only for PRs. It's harder to debug failed tests on main or release branches. ## Summary of changes - Change the GitHub Autocomment script to be able to post a comment to either a PR or a commit of a branch	2023-05-26 14:49:42 +01:00
Heikki Linnakangas	a560b28829	Make new tenant/timeline IDs mandatory in create APIs. (#4304 ) We used to generate the ID, if the caller didn't specify it. That's bad practice, however, because network is never fully reliable, so it's possible we create a new tenant but the caller doesn't know about it, and because it doesn't know the tenant ID, it has no way of retrying or checking if it succeeded. To discourage that, make it mandatory. The web control plane has not relied on the auto-generation for a long time.	2023-05-26 16:19:36 +03:00
Joonas Koivunen	024109fbeb	Allow for higher s3 concurrency (#4292 ) We currently have a semaphore based rate limiter which we hope will keep us under S3 limits. However, the semaphore does not consider time, so I've been hesitant to raise the concurrency limit of 100. See #3698. The PR Introduces a leaky-bucket based rate limiter instead of the `tokio::sync::Semaphore` which will allow us to raise the limit later on. The configuration changes are not contained here.	2023-05-26 13:35:50 +03:00
Alexander Bayandin	2b25f0dfa0	Fix flakiness of test_metric_collection (#4346 ) ## Problem Test `test_metric_collection` become flaky: ``` AssertionError: assert not ['2023-05-25T14:03:41.644042Z ERROR metrics_collection: failed to send metrics: reqwest::Error { kind: Request, url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("localhost")), port: Some(18022), path: "/billing/api/v1/usage_events", query: None, fragment: None }, source: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 99, kind: AddrNotAvailable, message: "Cannot assign requested address" })) }', ...] ``` I suspect it is caused by having 2 places when we define `httpserver_listen_address` fixture (which is internally used by `pytest-httpserver` plugin) ## Summary of changes - Remove the definition of `httpserver_listen_address` from `test_runner/regress/test_ddl_forwarding.py` and keep one in `test_runner/fixtures/neon_fixtures.py` - Also remote unused `httpserver_listen_address` parameter from `test_proxy_metric_collection`	2023-05-26 00:05:11 +03:00
Christian Schwarz	057cceb559	refactor: make timeline activation infallible (#4319 ) Timeline::activate() was only fallible because `launch_wal_receiver` was. `launch_wal_receiver` was fallible only because of some preliminary checks in `WalReceiver::start`. Turns out these checks can be shifted to the type system by delaying creatinon of the `WalReceiver` struct to the point where we activate the timeline. The changes in this PR were enabled by my previous refactoring that funneled the broker_client from pageserver startup to the activate() call sites. Patch series: - #4316 - #4317 - #4318 - #4319	2023-05-25 20:26:43 +02:00

1 2 3 4 5 ...

3266 Commits