rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-13 16:32:56 +00:00

Author	SHA1	Message	Date
George MacKerron	1ca08cc523	Changed batch query body to from [...] to { queries: [...] } (#4975 ) ## Problem It's nice if `single query : single response :: batch query : batch response`. But at present, in the single case we send `{ query: '', params: [] }` and get back a single `{ rows: [], ... }` object, while in the batch case we send an array of `{ query: '', params: [] }` objects and get back not an array of `{ rows: [], ... }` objects but a `{ results: [ { rows: [] , ... }, { rows: [] , ... }, ... ] }` object instead. ## Summary of changes With this change, the batch query body becomes `{ queries: [{ query: '', params: [] }, ... ] }`, which restores a consistent relationship between the request and response bodies.	2023-08-14 16:07:33 +01:00
Dmitry Rodionov	4626d89eda	Harden retries on tenant/timeline deletion path. (#4973 ) Originated from test failure where we got SlowDown error from s3. The patch generalizes `download_retry` to not be download specific. Resulting `retry` function is moved to utils crate. `download_retries` is now a thin wrapper around this `retry` function. To ensure that all needed retries are in place test code now uses `test_remote_failures=1` setting. Ref https://neondb.slack.com/archives/C059ZC138NR/p1691743624353009	2023-08-14 17:16:49 +03:00
Arseny Sher	8173813584	Add term=n option to safekeeper START_REPLICATION command. It allows term leader to ensure he pulls data from the correct term. Absense of it wasn't very problematic due to CRC checks, but let's be strict. walproposer still doesn't use it as we're going to remove recovery completely from it.	2023-08-12 12:20:13 +03:00
Dmitry Rodionov	d39fd66773	tests: remove redundant wait_while (#4952 ) Remove redundant `wait_while` in tests. It had only one usage. Use `wait_tenant_status404`. Related: https://github.com/neondatabase/neon/pull/4855#discussion_r1289610641	2023-08-11 10:18:13 +03:00
Dmitry Rodionov	c58b22bacb	Delete tenant's data from s3 (#4855 ) ## Summary of changes For context see https://github.com/neondatabase/neon/blob/main/docs/rfcs/022-pageserver-delete-from-s3.md Create Flow to delete tenant's data from pageserver. The approach heavily mimics previously implemented timeline deletion implemented mostly in https://github.com/neondatabase/neon/pull/4384 and followed up in https://github.com/neondatabase/neon/pull/4552 For remaining deletion related issues consult with deletion project here: https://github.com/orgs/neondatabase/projects/33 resolves #4250 resolves https://github.com/neondatabase/neon/issues/3889 --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-08-10 18:53:16 +03:00
Joonas Koivunen	71f9d9e5a3	test: allow slow shutdown warning (#4953 ) Introduced in #4886, did not consider that tests with real_s3 could sometimes go over the limit. Do not fail tests because of that.	2023-08-10 15:55:41 +03:00
Alek Westover	119b86480f	test: make pg_regress less flaky, hopefully (#4903 ) `pg_regress` is flaky: https://github.com/neondatabase/neon/issues/559 Consolidated `CHECKPOINT` to `check_restored_datadir_content`, add a wait for `wait_for_last_flush_lsn`. Some recently introduced flakyness was fixed with #4948. --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-08-10 15:24:43 +03:00
Joonas Koivunen	db48f7e40d	test: mark test_download_extensions.py skipped for now (#4948 ) The test mutates a shared directory which does not work with multiple concurrent tests. It is being fixed, so this should be a very temporary band-aid. Cc: #4949.	2023-08-10 11:05:27 +00:00
Alexander Bayandin	5993b2bedc	test_runner: remove excessive timeouts (#4659 ) ## Problem For some tests, we override the default timeout (300s / 5m) with a larger values like 600s / 10m or even 1800s / 30m, even if it's not required. I've collected some statistics (for the last 60 days) for tests duration: \| test \| max (s) \| p99 (s) \| p50 (s) \| count \| \|-----------------------------------\|---------\|---------\|---------\|-------\| \| test_hot_standby \| 9 \| 2 \| 2 \| 5319 \| \| test_import_from_vanilla \| 16 \| 9 \| 6 \| 5692 \| \| test_import_from_pageserver_small \| 37 \| 7 \| 5 \| 5719 \| \| test_pg_regress \| 101 \| 73 \| 44 \| 5642 \| \| test_isolation \| 65 \| 56 \| 39 \| 5692 \| A couple of tests that I left with custom 600s / 10m timeout. \| test \| max (s) \| p99 (s) \| p50 (s) \| count \| \|-----------------------------------\|---------\|---------\|---------\|-------\| \| test_gc_cutoff \| 456 \| 224 \| 109 \| 5694 \| \| test_pageserver_chaos \| 528 \| 267 \| 121 \| 5712 \| ## Summary of changes - Remove `@pytest.mark.timeout` annotation from several tests	2023-08-09 16:27:53 +01:00
John Spray	4dc644612b	pageserver: expose prometheus metrics for startup time (#4893 ) ## Problem Currently to know how long pageserver startup took requires inspecting logs. ## Summary of changes `pageserver_startup_duration_ms` metric is added, with label `phase` for different phases of startup. These are broken down by phase, where the phases correspond to the existing wait points in the code: - Start of doing I/O - When tenant load is done - When initial size calculation is done - When background jobs start - Then "complete" when everything is done. `pageserver_startup_is_loading` is a 0/1 gauge that indicates whether we are in the initial load of tenants. `pageserver_tenant_activation_seconds` is a histogram of time in seconds taken to activate a tenant. Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-08-08 12:41:37 +03:00
John Spray	4892a5c5b7	pageserver: avoid logging the "ERROR" part of DbErrors that are successes (#4902 ) ## Problem The pageserver<->safekeeper protocol uses error messages to indicate end of stream. pageserver already logs these at INFO level, but the inner error message includes the word "ERROR", which interferes with log searching. Example: ``` walreceiver connection handling ended: db error: ERROR: ending streaming to Some("pageserver") at 0/4031CA8 ``` The inner DbError has a severity of ERROR so DbError's Display implementation includes that ERROR, even though we are actually logging the error at INFO level. ## Summary of changes Introduce an explicit WalReceiverError type, and in its From<> for postgres errors, apply the logic from ExpectedError, for expected errors, and a new condition for successes. The new output looks like: ``` walreceiver connection handling ended: Successful completion: ending streaming to Some("pageserver") at 0/154E9C0, receiver is caughtup and there is no computes ```	2023-08-08 12:35:24 +03:00
John Spray	33cb1e9c0c	tests: enable higher concurrency and adjust tests with outlier runtime (#4904 ) ## Problem I spent a few minutes seeing how fast I could get our regression test suite to run on my workstation, for when I want to run a "did I break anything?" smoke test before pushing to CI. - Test runtime was dominated by a couple of tests that run for longer than all the others take together - Test concurrency was limited to <16 by the ports-per-worker setting There's no "right answer" for how long a test should be, but as a rule of thumb, no one test should run for much longer than the time it takes to run all the other tests together. ## Summary of changes - Make the ports per worker setting dynamic depending on worker count - Modify the longest running tests to run for a shorter time (`test_duplicate_layers` which uses a pgbench runtime) or fewer iterations (`test_restarts_frequent_checkpoints`).	2023-08-08 09:16:21 +01:00
Joonas Koivunen	ba9df27e78	fix: silence not found error when removing ephmeral (#4900 ) We currently cannot drop tenant before removing it's directory, or use Tenant::drop for this. This creates unnecessary or inactionable warnings during detach at least. Silence the most typical, file not found. Log remaining at `error!`. Cc: #2442	2023-08-04 21:03:17 +03:00
Joonas Koivunen	ea3e1b51ec	Remote storage metrics (#4892 ) We don't know how our s3 remote_storage is performing, or if it's blocking the shutdown. Well, for sampling reasons, we will not really know even after this PR. Add metrics: - align remote_storage metrics towards #4813 goals - histogram `remote_storage_s3_request_seconds{request_type=(get_object\|put_object\|delete_object\|list_objects), result=(ok\|err\|cancelled)}` - histogram `remote_storage_s3_wait_seconds{request_type=(same kinds)}` - counter `remote_storage_s3_cancelled_waits_total{request_type=(same kinds)}` Follow-up work: - After release, remove the old metrics, migrate dashboards Histogram buckets are rough guesses, need to be tuned. In pageserver we have a download timeout of 120s, so I think the 100s bucket is quite nice.	2023-08-04 21:01:29 +03:00
Joonas Koivunen	5263b39e2c	fix: shutdown logging again (#4886 ) During deploys of 2023-08-03 we logged too much on shutdown. Fix the logging by timing each top level shutdown step, and possibly warn on it taking more than a rough threshold, based on how long I think it possibly should be taking. Also remove all shutdown logging from background tasks since there is already "shutdown is taking a long time" logging. Co-authored-by: John Spray <john@neon.tech>	2023-08-03 20:34:05 +03:00
Dmitry Rodionov	1497a42296	tests: split neon_fixtures.py (#4871 ) ## Problem neon_fixtures.py has grown to unmanageable size. It attracts conflicts. When adding specific utils under for example `fixtures/pageserver` things sometimes need to import stuff from `neon_fixtures.py` which creates circular import. This is usually only needed for type annotations, so `typing.TYPE_CHECKING` flag can mask the issue. Nevertheless I believe that splitting neon_fixtures.py into smaller parts is a better approach. Currently the PR contains small things, but I plan to continue and move NeonEnv to its own `fixtures.env` module. To keep the diff small I think this PR can already be merged to cause less conflicts. UPD: it looks like currently its not really possible to fully avoid usage of `typing.TYPE_CHECKING`, because some components directly depend on each other. I e Env -> Cli -> Env cycle. But its still worth it to avoid it in as many places as possible. And decreasing neon_fixture's size still makes sense.	2023-08-03 17:20:24 +03:00
Alexander Bayandin	cd33089a66	test_runner: set AWS credentials for endpoints (#4887 ) ## Problem If AWS credentials are not set locally (via AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY env vars) `test_remote_library[release-pg15-mock_s3]` test fails with the following error: ``` ERROR could not start the compute node: Failed to download a remote file: Failed to download S3 object: failed to construct request ``` ## Summary of changes - set AWS credentials for endpoints programmatically	2023-08-03 16:44:48 +03:00
Alek Westover	d005c77ea3	Tar Remote Extensions (#4715 ) Add infrastructure to dynamically load postgres extensions and shared libraries from remote extension storage. Before postgres start downloads list of available remote extensions and libraries, and also downloads 'shared_preload_libraries'. After postgres is running, 'compute_ctl' listens for HTTP requests to load files. Postgres has new GUC 'extension_server_port' to specify port on which 'compute_ctl' listens for requests. When PostgreSQL requests a file, 'compute_ctl' downloads it. See more details about feature design and remote extension storage layout in docs/rfcs/024-extension-loading.md --------- Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech> Co-authored-by: Alek Westover <alek.westover@gmail.com>	2023-08-02 12:38:12 +03:00
Dmitry Rodionov	c3fe335eaf	wait for tenant to be active before polling for timeline absence (#4856 ) ## Problem https://neon-github-public-dev.s3.amazonaws.com/reports/main/5692829577/index.html#suites/f588e0a787c49e67b29490359c589fae/4c50937643d68a66 ## Summary of changes wait for tenant to be active after restart before polling for timeline absence	2023-08-01 18:28:18 +03:00
Alex Chi Z	7b6c849456	support isolation level + read only for http batch sql (#4830 ) We will retrieve `neon-batch-isolation-level` and `neon-batch-read-only` from the http header, which sets the txn properties. https://github.com/neondatabase/serverless/pull/38#issuecomment-1653130981 --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2023-08-01 02:59:11 +03:00
Alexander Bayandin	39e458f049	test_compatibility: fix pg_tenant_only_port port collision (#4850 ) ## Problem Compatibility tests fail from time to time due to `pg_tenant_only_port` port collision (added in https://github.com/neondatabase/neon/pull/4731) ## Summary of changes - replace `pg_tenant_only_port` value in config with new port - remove old logic, than we don't need anymore - unify config overrides	2023-07-31 20:49:46 +03:00
Joonas Koivunen	89ee8f2028	fix: demote warnings, fix flakyness (#4837 ) `WARN ... found future (image\|delta) layer` are not actionable log lines. They don't need to be warnings. `info!` is enough. This also fixes some known but not tracked flakyness in [`test_remote_timeline_client_calls_started_metric`][evidence]. [evidence]: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4829/5683495367/index.html#/testresult/34fe79e24729618b Closes #3369. Closes #4473.	2023-07-31 07:43:12 +00:00
Joonas Koivunen	2fbdf26094	test: raise timeout to avoid flakyness (#4832 ) 2s timeout was too tight for our CI, [evidence](https://neon-github-public-dev.s3.amazonaws.com/reports/main/5669956577/index.html#/testresult/6388e31182cc2d6e). 15s might be better. Also cleanup code no longer needed after #4204.	2023-07-28 14:32:01 -04:00
Alexander Bayandin	7374634845	test_runner: clean up test_compatibility (#4770 ) ## Problem We have some amount of outdated logic in test_compatibility, that we don't need anymore. ## Summary of changes - Remove `PR4425_ALLOWED_DIFF` and tune `dump_differs` method to accept allowed diffs in the future (a cleanup after https://github.com/neondatabase/neon/pull/4425) - Remote etcd related code (a cleanup after https://github.com/neondatabase/neon/pull/2733) - Don't set `preserve_database_files`	2023-07-28 16:15:31 +01:00
Alexander Bayandin	9fdd3a4a1e	test_runner: add amcheck to test_compatibility (#4772 ) Run `pg_amcheck` in forward and backward compatibility tests to catch some data corruption. ## Summary of changes - Add amcheck compiling to Makefile - Add `pg_amcheck` to test_compatibility	2023-07-28 16:00:55 +01:00
Alek Westover	3681fc39fd	modify `relative_path_to_s3_object` logic for `prefix=None` (#4795 ) see added unit tests for more description	2023-07-28 10:03:18 -04:00
Joonas Koivunen	67d2fa6dec	test: fix `test_neon_cli_basics` flakyness without making it better for future (#4827 ) The test was starting two endpoints on the same branch as discovered by @petuhovskiy. The fix is to allow passing branch-name from the python side over to neon_local, which already accepted it. Split from #4824, which will handle making this more misuse resistant.	2023-07-27 19:13:58 +03:00
Joonas Koivunen	395bd9174e	test: allow future image layer warning (#4818 ) https://neon-github-public-dev.s3.amazonaws.com/reports/main/5670795960/index.html#suites/837740b64a53e769572c4ed7b7a7eeeb/5a73fa4a69399123/retries Allow it because we are doing immediate stop.	2023-07-27 10:22:44 +03:00
Joonas Koivunen	48ce95533c	test: allow normal warnings in test_threshold_based_eviction (#4801 ) See: https://neon-github-public-dev.s3.amazonaws.com/reports/main/5654328815/index.html#suites/3fc871d9ee8127d8501d607e03205abb/3482458eba88c021	2023-07-26 20:20:12 +03:00
bojanserafimov	520046f5bd	cold starts: Add sync-safekeepers fast path (#4804 )	2023-07-25 19:44:18 -04:00
Alex Chi Z	bcc2aee704	proxy: add tests for batch http sql (#4793 ) This PR adds an integration test case for batch HTTP SQL endpoint. https://github.com/neondatabase/neon/pull/4654/ should be merged first before we land this PR. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2023-07-25 15:08:24 +00:00
Dmitry Rodionov	6d023484ed	Use mark file to allow for deletion operations to continue through restarts (#4552 ) ## Problem Currently we delete local files first, so if pageserver restarts after local files deletion then remote deletion is not continued. This can be solved with inversion of these steps. But even if these steps are inverted when index_part.json is deleted there is no way to distinguish between "this timeline is good, we just didnt upload it to remote" and "this timeline is deleted we should continue with removal of local state". So to solve it we use another mark file. After index part is deleted presence of this mark file indentifies that it was a deletion intention. Alternative approach that was discussed was to delete all except metadata first, and then delete metadata and index part. In this case we still do not support local only configs making them rather unsafe (deletion in them is already unsafe, but this direction solidifies this direction instead of fixing it). Another downside is that if we crash after local metadata gets removed we may leave dangling index part on the remote which in theory shouldnt be a big deal because the file is small. It is not a big change to choose another approach at this point. ## Summary of changes Timeline deletion sequence: 1. Set deleted_at in remote index part. 2. Create local mark file. 3. Delete local files except metadata (it is simpler this way, to be able to reuse timeline initialization code that expects metadata) 4. Delete remote layers 5. Delete index part 6. Delete meta, timeline directory. 7. Delete mark file. This works for local only configuration without remote storage. Sequence is resumable from any point. resolves #4453 resolves https://github.com/neondatabase/neon/pull/4552 (the issue was created with async cancellation in mind, but we can still have issues with retries if metadata is deleted among the first by remove_dir_all (which doesnt have any ordering guarantees)) --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech> Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-07-25 16:25:27 +03:00
cui fliter	f2e2b8a7f4	fix some typos (#4662 ) Typos fix Signed-off-by: cui fliter <imcusg@gmail.com>	2023-07-25 14:39:29 +03:00
Joonas Koivunen	77a68326c5	Thin out TenantState metric, keep set of broken tenants (#4796 ) We currently have a timeseries for each of the tenants in different states. We only want this for Broken. Other states could be counters. Fix this by making the `pageserver_tenant_states_count` a counter without a `tenant_id` and add a `pageserver_broken_tenants_count` which has a `tenant_id` label, each broken tenant being 1.	2023-07-25 11:15:54 +03:00
Joonas Koivunen	294b8a8fde	Convert per timeline metrics to global (#4769 ) Cut down the per-(tenant, timeline) histograms by making them global: - `pageserver_getpage_get_reconstruct_data_seconds` - `pageserver_read_num_fs_layers` - `pageserver_remote_operation_seconds` - `pageserver_remote_timeline_client_calls_started` - `pageserver_wait_lsn_seconds` - `pageserver_io_operations_seconds` --------- Co-authored-by: Shany Pozin <shany@neon.tech>	2023-07-25 00:43:27 +03:00
Konstantin Knizhnik	457e3a3ebc	Mx offset bug (#4775 ) Fix mx_offset_to_flags_offset() function Fixes issue #4774 Postgres `MXOffsetToFlagsOffset` was not correctly converted to Rust because cast to u16 is done before division by modulo. It is possible only if divider is power of two. Add a small rust unit test to check that the function produces same results as the PostgreSQL macro, and extend the existing python test to cover this bug. Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2023-07-21 21:20:53 +03:00
Alex Chi Z	3fc3666df7	make flush frozen layer an atomic operation (#4720 ) ## Problem close https://github.com/neondatabase/neon/issues/4712 ## Summary of changes Previously, when flushing frozen layers, it was split into two operations: add delta layer to disk + remove frozen layer from memory. This would cause a short period of time where we will have the same data both in frozen and delta layer. In this PR, we merge them into one atomic operation in layer map manager, therefore simplifying the code. Note that if we decide to create image layers for L0 flush, it will still be split into two operations on layer map. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-07-20 13:39:19 -04:00
bojanserafimov	e1061879aa	Improve startup python test (#4757 )	2023-07-19 23:46:16 -04:00
Arseny Sher	921bb86909	Use safekeeper tenant only port in all tests and actually test it. Compute now uses special safekeeper WAL service port allowing auth tokens with only tenant scope. Adds understanding of this port to neon_local and fixtures, as well as test of both ports behaviour with different tokens. ref https://github.com/neondatabase/neon/issues/4730	2023-07-19 06:03:51 +04:00
Joonas Koivunen	762a8a7bb5	python: more linting (#4734 ) Ruff has "B" class of lints, including B018 which will nag on useless expressions, related to #4719. Enable such lints and fix the existing issues. Most notably: - https://beta.ruff.rs/docs/rules/mutable-argument-default/ - https://beta.ruff.rs/docs/rules/assert-false/ --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2023-07-18 12:56:40 +03:00
Alex Chi Z	1066bca5e3	compaction: allow duplicated layers and skip in replacement (#4696 ) ## Problem Compactions might generate files of exactly the same name as before compaction due to our naming of layer files. This could have already caused some mess in the system, and is known to cause some issues like https://github.com/neondatabase/neon/issues/4088. Therefore, we now consider duplicated layers in the post-compaction process to avoid violating the layer map duplicate checks. related previous works: close https://github.com/neondatabase/neon/pull/4094 error reported in: https://github.com/neondatabase/neon/issues/4690, https://github.com/neondatabase/neon/issues/4088 ## Summary of changes If a file already exists in the layer map before the compaction, do not modify the layer map and do not delete the file. The file on disk at that time should be the new one overwritten by the compaction process. This PR also adds a test case with a fail point that produces exactly the same set of files. This bypassing behavior is safe because the produced layer files have the same content / are the same representation of the original file. An alternative might be directly removing the duplicate check in the layer map, but I feel it would be good if we can prevent that in the first place. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Konstantin Knizhnik <knizhnik@garret.ru> Co-authored-by: Heikki Linnakangas <heikki@neon.tech> Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-07-17 17:26:29 +03:00
arpad-m	982fce1e72	Fix rustdoc warnings and test cargo doc in CI (#4711 ) ## Problem `cargo +nightly doc` is giving a lot of warnings: broken links, naked URLs, etc. ## Summary of changes * update the `proc-macro2` dependency so that it can compile on latest Rust nightly, see https://github.com/dtolnay/proc-macro2/pull/391 and https://github.com/dtolnay/proc-macro2/issues/398 * allow the `private_intra_doc_links` lint, as linking to something that's private is always more useful than just mentioning it without a link: if the link breaks in the future, at least there is a warning due to that. Also, one might enable [`--document-private-items`](https://doc.rust-lang.org/cargo/commands/cargo-doc.html#documentation-options) in the future and make these links work in general. * fix all the remaining warnings given by `cargo +nightly doc` * make it possible to run `cargo doc` on stable Rust by updating `opentelemetry` and associated crates to version 0.19, pulling in a fix that previously broke `cargo doc` on stable: https://github.com/open-telemetry/opentelemetry-rust/pull/904 * Add `cargo doc` to CI to ensure that it won't get broken in the future. Fixes #2557 ## Future work * Potentially, it might make sense, for development purposes, to publish the generated rustdocs somewhere, like for example [how the rust compiler does it](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_driver/index.html). I will file an issue for discussion.	2023-07-15 05:11:25 +03:00
Joonas Koivunen	9a69b6cb94	Demote deletion warning, list files (#4688 ) Handle test failures like: ``` AssertionError: assert not ['$ts WARN delete_timeline{tenant_id=X timeline_id=Y}: About to remove 1 files\n'] ``` Instead of logging: ``` WARN delete_timeline{tenant_id=X timeline_id=Y}: Found 1 files not bound to index_file.json, proceeding with their deletion WARN delete_timeline{tenant_id=X timeline_id=Y}: About to remove 1 files ``` For each one operation of timeline deletion, list all unref files with `info!`, and then continue to delete them with the added spice of logging the rare/never happening non-utf8 name with `warn!`. Rationale for `info!` instead of `warn!`: this is a normal operation; like we had mentioned in `test_import.py` -- basically whenever we delete a timeline which is not idle. Rationale for N * (`ìnfo!`\|`warn!`): symmetry for the layer deletions; if we could ever need those, we could also need these for layer files which are not yet mentioned in `index_part.json`. --------- Co-authored-by: Christian Schwarz <christian@neon.tech>	2023-07-14 18:59:16 +03:00
arpad-m	664d32eb7f	Don't propagate but log file not found error in layer uploading (#4694 ) This addresses the issue in #4526 by adding a test that reproduces the race condition that gave rise to the bug (or at least a race condition that gave rise to the same error message), and then implementing a fix that just prints a message to the log if a file could not been found for uploading. Even though the underlying race condition is not fixed yet, this will un-block the upload queue in that situation, greatly reducing the impact of such a (rare) race. Fixes #4526.	2023-07-12 18:10:49 +02:00
Alexander Bayandin	ed845b644b	Prevent unintentional Postgres submodule update (#4692 ) ## Problem Postgres submodule can be changed unintentionally, and these changes are easy to miss during the review. Adding a check that should prevent this from happening, the check fails `build-neon` job with the following message: ``` Expected postgres-v14 rev to be at '1414141414141414141414141414141414141414', but it is at '1144aee1661c79eec65e784a8dad8bd450d9df79' Expected postgres-v15 rev to be at '1515151515151515151515151515151515151515', but it is at '1984832c740a7fa0e468bb720f40c525b652835d' Please update vendors/revisions.json if these changes are intentional. ``` This is an alternative approach to https://github.com/neondatabase/neon/pull/4603 ## Summary of changes - Add `vendor/revisions.json` file with expected revisions - Add built-time check (to `build-neon` job) that Postgres submodules match revisions from `vendor/revisions.json` - A couple of small improvements for logs from https://github.com/neondatabase/neon/pull/4603 - Fixed GitHub autocomment for no tests was run case --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-07-12 15:12:37 +01:00
bojanserafimov	92aee7e07f	cold starts: basebackup compression (#4482 ) Co-authored-by: Alex Chi Z <iskyzh@gmail.com>	2023-07-11 13:11:23 -04:00
Alexander Bayandin	33c2d94ba6	Fix git-env version for PRs (#4641 ) ## Problem Binaries created from PRs (both in docker images and for tests) have wrong git-env versions, they point to phantom merge commits. ## Summary of changes - Prefer GIT_VERSION env variable even if git information was accessible - Use `${{ github.event.pull_request.head.sha \|\| github.sha }}` instead of `${{ github.sha }}` for `GIT_VERSION` in workflows So the builds will still happen from this phantom commit, but we will report the PR commit. --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>	2023-07-10 20:01:01 +01:00
Christian Schwarz	505aa242ac	page cache: add size metrics (#4629 ) Make them a member of `struct PageCache` to prepare for a future where there's no global state.	2023-07-05 15:36:42 +03:00
Christian Schwarz	3f9defbfb4	page cache: add access & hit rate metrics (#4628 ) Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>	2023-07-05 10:38:32 +02:00
bojanserafimov	c7143dbde6	compute_ctl: Fix misleading metric (#4608 )	2023-07-04 19:07:36 -04:00

1 2 3 4 5 ...

776 Commits