rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-10 06:52:55 +00:00

Author	SHA1	Message	Date
Heikki Linnakangas	bc5ec43056	Fix flaky physical-size tests in test_timeline_size.py. These two tests, test_timeline_physical_size_post_compaction and test_timeline_physical_size_post_gc, assumed that after you have waited for the WAL from a bulk insertion to arrive, and you run a cycle of checkpoint and compaction, no new layer files are created. Because if a new layer file is created while we are calculating the incremental and non-incremental physical sizes, they might differ. However, the tests used a very small checkpoint_distance, so even a small amount of WAL generated in PostgreSQL could cause a new layer file to be created. Autovacuum can kick in at any time, and do that. That caused occasional failues in the test. I was able to reproduce it reliably by adding a long delay between the incremental and non-incremental size calculations: ``` --- a/pageserver/src/http/routes.rs +++ b/pageserver/src/http/routes.rs @@ -129,6 +129,9 @@ async fn build_timeline_info( } }; let current_physical_size = Some(timeline.get_physical_size()); + if include_non_incremental_physical_size { + std:🧵:sleep(std::time::Duration::from_millis(60000)); + } let info = TimelineInfo { tenant_id: timeline.tenant_id, ``` To fix, disable autovacuum for the table. Autovacuum could still kick in for other tables, e.g. catalog tables, but that seems less likely to generate enough WAL to causea new layer file to be flushed. If this continues to be a problem in the future, we could simply retry the physical size call a few times, if there's a mismatch. A mismatch could happen every once in a while, but it's very unlikely to happen more than once or twice in a row. Fixes https://github.com/neondatabase/neon/issues/2212	2022-10-19 23:50:21 +03:00
Konstantin Knizhnik	91411c415a	Persists latest_gc_cutoff_lsn before performing GC (#2558 ) * Persists latest_gc_cutoff_lsn before performing GC * Peform some refactoring and code deduplication refer #2539 * Add test for persisting GC cutoff * Fix python test style warnings * Bump postgres version * Reduce number of iterations in test_gc_cutoff test * Bump postgres version * Undo bumping postgres version	2022-10-19 12:32:03 +03:00
bojanserafimov	8fbe437768	Improve pageserver IO metrics (#2629 )	2022-10-18 11:53:28 -04:00
Dmitry Rodionov	129f7c82b7	remove redundant expect_tenant_to_download_timeline	2022-10-18 11:21:48 +03:00
Joonas Koivunen	c709354579	Add layer sizes to index_part.json (#2582 ) This is the first step in verifying layer files. Next up on the road is hashing the files and verifying the hashes. The metadata additions do not require any migration. The idea is that the change is backward and forward-compatible with regard to `index_part.json` due to the softness of JSON schema and the deserialization options in use. New types added: - LayerFileMetadata for tracking the file metadata - starting with only the file size - in future hopefully a sha256 as well - IndexLayerMetadata, the serialized counterpart of LayerFileMetadata LayerFileMetadata needing to have all fields Option is a problem but that is not possible to handle without conflicting a lot more with other ongoing work. Co-authored-by: Kirill Bulatov <kirill@neon.tech>	2022-10-17 12:21:04 +03:00
Heikki Linnakangas	538876650a	Merge 'local' and 'remote' parts of TimelineInfo into one struct. The 'local' part was always filled in, so that was easy to merge into into the TimelineInfo itself. 'remote' only contained two fields, 'remote_consistent_lsn' and 'awaits_download'. I made 'remote_consistent_lsn' an optional field, and 'awaits_download' is now false if the timeline is not present remotely. However, I kept stub versions of the 'local' and 'remote' structs for backwards-compatibility, with a few fields that are actively used by the control plane. They just duplicate the fields from TimelineInfo now. They can be removed later, once the control plane has been updated to use the new fields.	2022-10-14 18:37:14 +03:00
Heikki Linnakangas	500239176c	Make TimelineInfo.local field mandatory. It was only None when you queried the status of a timeline with 'timeline_detail' mgmt API call, and it was still being downloaded. You can check for that status with the 'tenant_status' API call instead, checking for has_in_progress_downloads field. Anothere case was if an error happened while trying to get the current logical size, in a 'timeline_detail' request. It might make sense to tolerate such errors, and leave the fields we cannot fill in as empty, None, 0 or similar, but it doesn't make sense to me to leave the whole 'local' struct empty in tht case.	2022-10-14 18:37:14 +03:00
Arseny Sher	9fe4548e13	Reimplement explicit timeline creation on safekeepers. With the ability to pass commit_lsn. This allows to perform project WAL recovery through different (from the original) set of safekeepers (or under different ttid) by 1) moving WAL files to s3 under proper ttid; 2) explicitly creating timeline on safekeepers, setting commit_lsn to the latest point; 3) putting the lastest .parital file to the timeline directory on safekeepers, if desired. Extend test_s3_wal_replay to exersise this behaviour. Also extends timeline_status endpoint to return postgres information.	2022-10-13 21:43:10 +04:00
Andrés	09dda35dac	Return broken tenants due to non existing timelines dir (#2552 ) (#2575 ) Co-authored-by: andres <andres.rodriguez@outlook.es>	2022-10-12 22:28:39 +03:00
Arseny Sher	6d0dacc4ce	Recreate timeline on pageserver in s3_wal_replay test. That's closer to real usage than switching to brand new pageserver.	2022-10-12 11:46:21 +04:00
Heikki Linnakangas	e5e40a31f4	Clean up terms "delete timeline" and "detach tenant". You cannot attach/detach an individual timeline, attach/detach always applies to the whole tenant. However, you can delete a single timeline from a tenant. Fix some comments and error messages that confused these two operations.	2022-10-11 17:47:41 +03:00
Lassi Pölönen	e520293090	Add build info metric to pageserver, safekeeper and proxy (#2596 ) * Test that we emit build info metric for pageserver, safekeeper and proxy with some non-zero length revision label * Emit libmetrics_build_info on startup of pageserver, safekeeper and proxy with label "revision" which tells the git revision.	2022-10-11 09:54:32 +03:00
Heikki Linnakangas	a22165d41e	Add tests for comparing root and child branch performance. Author: Thang Pham <thang@neon.tech>	2022-10-08 10:07:33 +03:00
Heikki Linnakangas	9e1eb69d55	Increase default compaction_period setting to 20 s. The previous default of 1 s caused excessive CPU usage when there were a lot of projects. Polling every timeline once a second was too aggressive so let's reduce it. Fixes https://github.com/neondatabase/neon/issues/2542, but we probably also want do to something so that we don't poll timelines that have received no new WAL or layers since last check.	2022-10-07 13:55:19 +03:00
Arthur Petukhovsky	687ba81366	Display sync safekeepers output in compute_ctl (#2571 ) Pipe postgres output to compute_ctl stdout and create a test to check that compute_ctl works and prints postgres logs.	2022-10-06 13:53:52 +00:00
Andrés	47bae68a2e	Make get_lsn_by_timestamp available in mgmt API (#2536 ) (#2560 ) Co-authored-by: andres <andres.rodriguez@outlook.es>	2022-10-06 12:42:50 +03:00
Konstantin Knizhnik	ff8c481777	Normalize last_record LSN in wal receiver (#2529 ) * Add test for branching on page boundary * Normalize start recovery point Co-authored-by: Heikki Linnakangas <heikki@neon.tech> Co-authored-by: Thang Pham <thang@neon.tech>	2022-10-06 09:01:56 +03:00
Kirill Bulatov	d823e84ed5	Allow attaching tenants with zero timelines	2022-10-04 18:13:51 +03:00
Kirill Bulatov	231dfbaed6	Do not remove empty timelines/ directory for tenants	2022-10-04 18:13:51 +03:00
Dmitry Rodionov	5cf53786f9	Improve pytest ergonomics 1. Disable perf tests by default 2. Add instruction to run tests in parallel	2022-10-04 14:53:01 +03:00
Kirill Bulatov	7b2f9dc908	Reuse existing tenants during attach (#2540 )	2022-10-03 13:33:55 +03:00
Alexander Bayandin	ebab89ebd2	test_runner: pass password to pgbench via PGPASSWORD (#2468 )	2022-09-23 12:51:33 +00:00
Anastasia Lubennikova	d098542dde	Make test_timeline_size_metrics more stable: Compare size with Vanilla postgres size instead of hardcoded value	2022-09-22 14:15:13 +03:00
Anastasia Lubennikova	862902f9e5	Update readme and openapi spec	2022-09-22 14:15:13 +03:00
Anastasia Lubennikova	0fde59aa46	use pg_version in python tests	2022-09-22 14:15:13 +03:00
Anastasia Lubennikova	86bf491981	Support pg 15 - Split postgres_ffi into two version specific files. - Preserve pg_version in timeline metadata. - Use pg_version in safekeeper code. Check for postgres major version mismatch. - Clean up the code to use DEFAULT_PG_VERSION constant everywhere, instead of hardcoding. - Parameterize python tests: use DEFAULT_PG_VERSION env and pg_version fixture. To run tests using a specific PostgreSQL version, pass the DEFAULT_PG_VERSION environment variable: 'DEFAULT_PG_VERSION='15' ./scripts/pytest test_runner/regress' Currently don't all tests pass, because rust code relies on the default version of PostgreSQL in a few places.	2022-09-22 14:15:13 +03:00
Konstantin Knizhnik	f3073a4db9	R-Tree layer map (#2317 ) Replace the layer array and linear search with R-tree So far, the in-memory layer map that holds information about layer files that exist, has used a simple Vec, in no particular order, to hold information about all the layers. That obviously doesn't scale very well; with thousands of layer files the linear search was consuming a lot of CPU. Replace it with a two-dimensional R-tree, with Key and LSN ranges as the dimensions. For the R-tree, use the 'rstar' crate. To be able to use that, we convert the Keys and LSNs into 256-bit integers. 64 bits would be enough to represent LSNs, and 128 bits would be enough to represent Keys. However, we use 256 bits, because rstar internally performs multiplication to calculate the area of rectangles, and the result of multiplying two 128 bit integers doesn't necessarily fit in 128 bits, causing integer overflow and, if overflow-checks are enabled, panic. To avoid that, we use 256 bit integers. Add a performance test that creates a lot of layer files, to demonstrate the benefit.	2022-09-22 08:35:06 +03:00
Alexander Bayandin	19fa410ff8	NeonCompare: switch to new pageserver HTTP API	2022-09-21 15:17:55 +01:00
sharnoff	4a3b3ff11d	Move testing pageserver libpq cmds to HTTP api (#2429 ) Closes #2422. The APIs have been feature gated with the `testing_api!` macro so that they return 400s when support hasn't been compiled in.	2022-09-20 11:28:12 -07:00
Heikki Linnakangas	a5019bf771	Use a simpler way to set extra options for benchmark test. Commit `43a4f7173e` fixed the case that there are extra options in the connection string, but broke it in the case when there are not. Fix that. But on second thoughts, it's more straightforward set the options with ALTER DATABASE, so change the workflow yaml file to do that instead.	2022-09-20 13:48:50 +03:00
Heikki Linnakangas	e4f775436f	Don't override other options than statement_timeout in test conn string. In commit `6985f6cd6c`, I tried passing extra GUCs in the 'options' part of the connection string, but it didn't work because the pgbench test overrode it with the statement_timeout. Change it so that it adds the statement_timeout to any other options, instead of replacing them.	2022-09-20 09:46:15 +03:00
Kirill Bulatov	b46c8b4ae0	Add an alias to build test images simply	2022-09-16 18:58:41 +03:00
Kirill Bulatov	031e57a973	Disable failpoints by default	2022-09-16 09:26:29 +03:00
bojanserafimov	96e867642f	Validate tenant create options (#2450 ) Co-authored-by: Kirill Bulatov <kirill@neon.tech>	2022-09-15 18:20:23 -04:00
Egor Suvorov	e968b5e502	tests: do not set num_safekeepers = 1, it's the default (#2457 ) Also get rid if `with_safekeepers` parameter in tests. Its meaning has changed: `False` meant "no safekeepers" which is not supported anymore, so we assume it's always `True`. See #1648	2022-09-15 21:43:51 +03:00
Kirill Bulatov	b8eb908a3d	Rename old project name references	2022-09-14 08:14:05 +03:00
Alexander Bayandin	59d04ab66a	test_runner: redact passwords from log messages (#2434 )	2022-09-13 17:24:11 +00:00
Kirill Bulatov	1a8c8b04d7	Merge Repository and Tenant entities, rework tenant background jobs	2022-09-13 15:39:39 +03:00
Kirill Bulatov	2a837d7de7	Create tenants in temporary directory first (#2426 )	2022-09-12 21:04:33 +00:00
Heikki Linnakangas	40c845e57d	Switch to async for all concurrency in the pageserver. Instead of spawning helper threads, we now use Tokio tasks. There are multiple Tokio runtimes, for different kinds of tasks. One for serving libpq client connections, another for background operations like GC and compaction, and so on. That's not strictly required, we could use just one runtime, but with this you can still get an overview of what's happening with "top -H". There's one subtle behavior in how TenantState is updated. Before this patch, if you deleted all timelines from a tenant, its GC and compaction loops were stopped, and the tenant went back to Idle state. We no longer do that. The empty tenant stays Active. The changes to test_tenant_tasks.py are related to that. There's still plenty of synchronous code and blocking. For example, we still use blocking std::io functions for all file I/O, and the communication with WAL redo processes is still uses low-level unix poll(). We might want to rewrite those later, but this will do for now. The model is that local file I/O is considered to be fast enough that blocking - and preventing other tasks running in the same thread - is acceptable.	2022-09-12 14:21:00 +03:00
Kirill Bulatov	923f642549	Collect cargo build timings	2022-09-09 22:32:00 +03:00
Kirill Bulatov	c9e7c2f014	Ensure all temporary and empty directories and files are cleansed on pageserver startup	2022-09-09 16:36:45 +03:00
Heikki Linnakangas	f441fe57d4	Register prometheus counters correctly. Commit `f081419e68` moved all the prometheus counters to `metrics.rs`, but accidentally replaced a couple of `register_int_counter!(...)` calls with just `IntCounter::new(...)`. Because of that, the counters were not registered in the metrics registry, and were not exposed through the metrics HTTP endpoint. Fixes failures we're seeing in a bunch of 'performance' tests because of the missing metrics.	2022-09-06 17:38:17 +03:00
Heikki Linnakangas	cf157ad8e4	Add test that repeatedly kills and restarts the pageserver. This caught or reproduced several bugs when I originally wrote this test back in May, including #1731, #1740, #1751, and #707. I believe all the issues have been fixed now, but since this was a very fruitful test, let's add it to the test suite. We didn't commit this earlier, because the test was very slow especially with a debug build. We've since changed the build options so that even the debug builds are not quite so slow anymore.	2022-09-06 13:00:40 +03:00
Lassi Pölönen	f081419e68	Cleanup tenant specific metrics once a tenant is detached. (#2328 ) * Add test for pageserver metric cleanup once a tenant is detached. * Remove tenant specific timeline metrics on detach. * Use definitions from timeline_metrics in page service. * Move metrics to own file from layered_repository/timeline.rs * TIMELINE_METRICS: define smgr metrics * REMOVE SMGR cleanup from timeline_metrics. Doesn't seem to work as expected. * Vritual file centralized metrics, except for evicted file as there's no tenat id or timeline id. * Use STORAGE_TIME from timeline_metrics in layered_repository. * Remove timelineless gc metrics for tenant on detach. * Rename timeline metrics -> metrics as it's more generic. * Don't create a TimelineMetrics instance for VirtualFile * Move the rest of the metric definitions to metrics.rs too. * UUID -> ZTenantId * Use consistent style for dict. * Use Repository's Drop trait for dropping STORAGE_TIME metrics. * No need for Arc, TimelineMetrics is used in just one place. Due to that, we can fall back using ZTenantId and ZTimelineId too to avoid additional string allocation.	2022-09-06 11:30:20 +03:00
Anastasia Lubennikova	05e263d0d3	Prepare pg 15 support (build system and submodules) (#2337 ) * Add submodule postgres-15 * Support pg_15 in pgxn/neon * Renamed zenith -> neon in Makefile * fix name of codestyle check * Refactor build system to prepare for building multiple Postgres versions. Rename "vendor/postgres" to "vendor/postgres-v14" Change Postgres build and install directory paths to be version-specific: - tmp_install/build -> pg_install/build/14 - tmp_install/* -> pg_install/14/* And Makefile targets: - "make postgres" -> "make postgres-v14" - "make postgres-headers" -> "make postgres-v14-headers" - etc. Add Makefile aliases: - "make postgres" to build "postgres-v14" and in future, "postgres-v15" - similarly for "make postgres-headers" Fix POSTGRES_DISTRIB_DIR path in pytest scripts * Make postgres version a variable in codestyle workflow * Support vendor/postgres-v15 in codestyle check workflow * Support postgres-v15 building in Makefile * fix pg version in Dockerfile.compute-node * fix kaniko path * Build neon extensions in version-specific directories * fix obsolete mentions of vendor/postgres * use vendor/postgres-v14 in Dockerfile.compute-node.legacy * Use PG_VERSION_NUM to gate dependencies in inmem_smgr.c * Use versioned ECR repositories and image names for compute-node. The image name format is compute-node-vXX, where XX is postgres major version number. For now only v14 is supported. Old format unversioned name (compute-node) is left, because cloud repo depends on it. * update vendor/postgres submodule url (zenith->neondatabase rename) * Fix postgres path in python tests after rebase * fix path in regress test * Use separate dockerfiles to build compute-node: Dockerfile.compute-node-v15 should be identical to Dockerfile.compute-node-v14 except for the version number. This is a hack, because Kaniko doesn't support build ARGs properly * bump vendor/postgres-v14 and vendor/postgres-v15 * Don't use Kaniko cache for v14 and v15 compute-node images * Build compute-node images for different versions in different jobs Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2022-09-05 18:30:54 +03:00
Konstantin Knizhnik	ad057124be	Update relation size cache only when latest LSN is requested (#2310 ) * Update relation size cache only when latest LSN is requested * Fix tests * Add a test case for timetravel query after pageserver restart. This test is currently failing, the queries return incorrect results. I don't know why, needs to be investigated. FAILED test_runner/batch_others/test_readonly_node.py::test_timetravel - assert 85 == 100000 If you remove the pageserver restart from the test, it passes. * yapf3 test_readonly_node.py * Add comment about cache correction in case of setting incorrect latest flag * Fix formatting for test_readonly_node.py * Remove unused imports * Fix mypy warning for test_readonly_node.py * Fix formatting of test_readonly_node.py * Bump postgres version Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2022-09-05 13:12:02 +03:00
Heikki Linnakangas	aeb1cf9c36	Fix misc typos and grammar in comments.	2022-09-05 11:09:32 +03:00
Konstantin Knizhnik	eef7475408	Add tests for measuring effect of lsn caching (#2384 ) * Add tests for measurif effet of lsn caching * Fix formatting of test_latency.py * Fix test_lsn_mapping test	2022-09-03 17:06:19 +03:00
Kirill Bulatov	827c3013bd	Adjust benchmark code to Ids	2022-09-02 14:57:09 +03:00

1 2 3 4 5 ...

449 Commits