rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-21 15:10:44 +00:00

Author	SHA1	Message	Date
Alexander Bayandin	19fa410ff8	NeonCompare: switch to new pageserver HTTP API	2022-09-21 15:17:55 +01:00
sharnoff	4a3b3ff11d	Move testing pageserver libpq cmds to HTTP api (#2429 ) Closes #2422. The APIs have been feature gated with the `testing_api!` macro so that they return 400s when support hasn't been compiled in.	2022-09-20 11:28:12 -07:00
Heikki Linnakangas	a5019bf771	Use a simpler way to set extra options for benchmark test. Commit `43a4f7173e` fixed the case that there are extra options in the connection string, but broke it in the case when there are not. Fix that. But on second thoughts, it's more straightforward set the options with ALTER DATABASE, so change the workflow yaml file to do that instead.	2022-09-20 13:48:50 +03:00
Heikki Linnakangas	e4f775436f	Don't override other options than statement_timeout in test conn string. In commit `6985f6cd6c`, I tried passing extra GUCs in the 'options' part of the connection string, but it didn't work because the pgbench test overrode it with the statement_timeout. Change it so that it adds the statement_timeout to any other options, instead of replacing them.	2022-09-20 09:46:15 +03:00
Kirill Bulatov	b46c8b4ae0	Add an alias to build test images simply	2022-09-16 18:58:41 +03:00
Kirill Bulatov	031e57a973	Disable failpoints by default	2022-09-16 09:26:29 +03:00
bojanserafimov	96e867642f	Validate tenant create options (#2450 ) Co-authored-by: Kirill Bulatov <kirill@neon.tech>	2022-09-15 18:20:23 -04:00
Egor Suvorov	e968b5e502	tests: do not set num_safekeepers = 1, it's the default (#2457 ) Also get rid if `with_safekeepers` parameter in tests. Its meaning has changed: `False` meant "no safekeepers" which is not supported anymore, so we assume it's always `True`. See #1648	2022-09-15 21:43:51 +03:00
Kirill Bulatov	b8eb908a3d	Rename old project name references	2022-09-14 08:14:05 +03:00
Alexander Bayandin	59d04ab66a	test_runner: redact passwords from log messages (#2434 )	2022-09-13 17:24:11 +00:00
Kirill Bulatov	1a8c8b04d7	Merge Repository and Tenant entities, rework tenant background jobs	2022-09-13 15:39:39 +03:00
Kirill Bulatov	2a837d7de7	Create tenants in temporary directory first (#2426 )	2022-09-12 21:04:33 +00:00
Heikki Linnakangas	40c845e57d	Switch to async for all concurrency in the pageserver. Instead of spawning helper threads, we now use Tokio tasks. There are multiple Tokio runtimes, for different kinds of tasks. One for serving libpq client connections, another for background operations like GC and compaction, and so on. That's not strictly required, we could use just one runtime, but with this you can still get an overview of what's happening with "top -H". There's one subtle behavior in how TenantState is updated. Before this patch, if you deleted all timelines from a tenant, its GC and compaction loops were stopped, and the tenant went back to Idle state. We no longer do that. The empty tenant stays Active. The changes to test_tenant_tasks.py are related to that. There's still plenty of synchronous code and blocking. For example, we still use blocking std::io functions for all file I/O, and the communication with WAL redo processes is still uses low-level unix poll(). We might want to rewrite those later, but this will do for now. The model is that local file I/O is considered to be fast enough that blocking - and preventing other tasks running in the same thread - is acceptable.	2022-09-12 14:21:00 +03:00
Kirill Bulatov	923f642549	Collect cargo build timings	2022-09-09 22:32:00 +03:00
Kirill Bulatov	c9e7c2f014	Ensure all temporary and empty directories and files are cleansed on pageserver startup	2022-09-09 16:36:45 +03:00
Heikki Linnakangas	f441fe57d4	Register prometheus counters correctly. Commit `f081419e68` moved all the prometheus counters to `metrics.rs`, but accidentally replaced a couple of `register_int_counter!(...)` calls with just `IntCounter::new(...)`. Because of that, the counters were not registered in the metrics registry, and were not exposed through the metrics HTTP endpoint. Fixes failures we're seeing in a bunch of 'performance' tests because of the missing metrics.	2022-09-06 17:38:17 +03:00
Heikki Linnakangas	cf157ad8e4	Add test that repeatedly kills and restarts the pageserver. This caught or reproduced several bugs when I originally wrote this test back in May, including #1731, #1740, #1751, and #707. I believe all the issues have been fixed now, but since this was a very fruitful test, let's add it to the test suite. We didn't commit this earlier, because the test was very slow especially with a debug build. We've since changed the build options so that even the debug builds are not quite so slow anymore.	2022-09-06 13:00:40 +03:00
Lassi Pölönen	f081419e68	Cleanup tenant specific metrics once a tenant is detached. (#2328 ) * Add test for pageserver metric cleanup once a tenant is detached. * Remove tenant specific timeline metrics on detach. * Use definitions from timeline_metrics in page service. * Move metrics to own file from layered_repository/timeline.rs * TIMELINE_METRICS: define smgr metrics * REMOVE SMGR cleanup from timeline_metrics. Doesn't seem to work as expected. * Vritual file centralized metrics, except for evicted file as there's no tenat id or timeline id. * Use STORAGE_TIME from timeline_metrics in layered_repository. * Remove timelineless gc metrics for tenant on detach. * Rename timeline metrics -> metrics as it's more generic. * Don't create a TimelineMetrics instance for VirtualFile * Move the rest of the metric definitions to metrics.rs too. * UUID -> ZTenantId * Use consistent style for dict. * Use Repository's Drop trait for dropping STORAGE_TIME metrics. * No need for Arc, TimelineMetrics is used in just one place. Due to that, we can fall back using ZTenantId and ZTimelineId too to avoid additional string allocation.	2022-09-06 11:30:20 +03:00
Anastasia Lubennikova	05e263d0d3	Prepare pg 15 support (build system and submodules) (#2337 ) * Add submodule postgres-15 * Support pg_15 in pgxn/neon * Renamed zenith -> neon in Makefile * fix name of codestyle check * Refactor build system to prepare for building multiple Postgres versions. Rename "vendor/postgres" to "vendor/postgres-v14" Change Postgres build and install directory paths to be version-specific: - tmp_install/build -> pg_install/build/14 - tmp_install/* -> pg_install/14/* And Makefile targets: - "make postgres" -> "make postgres-v14" - "make postgres-headers" -> "make postgres-v14-headers" - etc. Add Makefile aliases: - "make postgres" to build "postgres-v14" and in future, "postgres-v15" - similarly for "make postgres-headers" Fix POSTGRES_DISTRIB_DIR path in pytest scripts * Make postgres version a variable in codestyle workflow * Support vendor/postgres-v15 in codestyle check workflow * Support postgres-v15 building in Makefile * fix pg version in Dockerfile.compute-node * fix kaniko path * Build neon extensions in version-specific directories * fix obsolete mentions of vendor/postgres * use vendor/postgres-v14 in Dockerfile.compute-node.legacy * Use PG_VERSION_NUM to gate dependencies in inmem_smgr.c * Use versioned ECR repositories and image names for compute-node. The image name format is compute-node-vXX, where XX is postgres major version number. For now only v14 is supported. Old format unversioned name (compute-node) is left, because cloud repo depends on it. * update vendor/postgres submodule url (zenith->neondatabase rename) * Fix postgres path in python tests after rebase * fix path in regress test * Use separate dockerfiles to build compute-node: Dockerfile.compute-node-v15 should be identical to Dockerfile.compute-node-v14 except for the version number. This is a hack, because Kaniko doesn't support build ARGs properly * bump vendor/postgres-v14 and vendor/postgres-v15 * Don't use Kaniko cache for v14 and v15 compute-node images * Build compute-node images for different versions in different jobs Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2022-09-05 18:30:54 +03:00
Konstantin Knizhnik	ad057124be	Update relation size cache only when latest LSN is requested (#2310 ) * Update relation size cache only when latest LSN is requested * Fix tests * Add a test case for timetravel query after pageserver restart. This test is currently failing, the queries return incorrect results. I don't know why, needs to be investigated. FAILED test_runner/batch_others/test_readonly_node.py::test_timetravel - assert 85 == 100000 If you remove the pageserver restart from the test, it passes. * yapf3 test_readonly_node.py * Add comment about cache correction in case of setting incorrect latest flag * Fix formatting for test_readonly_node.py * Remove unused imports * Fix mypy warning for test_readonly_node.py * Fix formatting of test_readonly_node.py * Bump postgres version Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2022-09-05 13:12:02 +03:00
Heikki Linnakangas	aeb1cf9c36	Fix misc typos and grammar in comments.	2022-09-05 11:09:32 +03:00
Konstantin Knizhnik	eef7475408	Add tests for measuring effect of lsn caching (#2384 ) * Add tests for measurif effet of lsn caching * Fix formatting of test_latency.py * Fix test_lsn_mapping test	2022-09-03 17:06:19 +03:00
Kirill Bulatov	827c3013bd	Adjust benchmark code to Ids	2022-09-02 14:57:09 +03:00
Kirill Bulatov	2db20e5587	Remove [Un]Loaded timeline code (#2359 )	2022-09-02 14:31:28 +03:00
Kirill Bulatov	f78a542cba	Calculate timeline initial logical size in the background Start the calculation on the first size request, return partially calculated size during calculation, retry if failed. Remove "fast" size init through the ancestor: the current approach is fast enough for now and there are better ways to optimize the calculation via incremental ancestor size computation	2022-09-02 14:31:28 +03:00
Heikki Linnakangas	47bd307cb8	Add python types to represent LSNs, tenant IDs and timeline IDs. (#2351 ) For better ergonomics. I always found it weird that we used UUID to actually mean a tenant or timeline ID. It worked because it happened to have the same length, 16 bytes, but it was hacky.	2022-09-02 10:16:47 +03:00
Heikki Linnakangas	15c5f3e6cf	Fix misc typos in comments and variable names.	2022-09-01 20:04:08 +03:00
Alexander Bayandin	d7c9cfe7bb	Create Allure report for perf tests (#2326 )	2022-08-31 16:15:26 +01:00
Heikki Linnakangas	3aca717f3d	Reorganize python tests. Merge batch_others and batch_pg_regress. The original idea was to split all the python tests into multiple "batches" and run each batch in parallel as a separate CI job. However, the batch_pg_regress batch was pretty short compared to all the tests in batch_others. We could split batch_others into multiple batches, but it actually seems better to just treat them as one big pool of tests and use pytest's handle the parallelism on its own. If we need to split them across multiple nodes in the future, we could use pytest-shard or something else, instead of managing the batches ourselves. Merge test_neon_regress.py, test_pg_regress.py and test_isolation.py into one file, test_pg_regress.py. Seems more clear to group all pg_regress-based tests into one file, now that they would all be in the same directory.	2022-08-30 18:25:38 +03:00
Dmitry Ivanov	96a50e99cf	Forward various connection params to compute nodes. (#2336 ) Previously, proxy didn't forward auxiliary `options` parameter and other ones to the client's compute node, e.g. ``` $ psql "user=john host=localhost dbname=postgres options='-cgeqo=off'" postgres=# show geqo; ┌──────┐ │ geqo │ ├──────┤ │ on │ └──────┘ (1 row) ``` With this patch we now forward `options`, `application_name` and `replication`. Further reading: https://www.postgresql.org/docs/current/libpq-connect.html Fixes #1287.	2022-08-30 17:36:21 +03:00
Konstantin Knizhnik	ee8b5f967d	Add fork_at_current_lsn function which creates branch at current LSN (#2344 ) * Add fork_at_current_lsn function which creates branch at current LSN * Undo use of fork_at_current_lsn in test_branching because of short GC period * Add missed return in fork_at_current_lsn * Add missed return in fork_at_current_lsn * Update test_runner/fixtures/neon_fixtures.py Co-authored-by: Heikki Linnakangas <heikki@zenith.tech> * Update test_runner/fixtures/neon_fixtures.py Co-authored-by: Heikki Linnakangas <heikki@zenith.tech> * Update test_runner/fixtures/neon_fixtures.py Co-authored-by: Heikki Linnakangas <heikki@zenith.tech> Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>	2022-08-29 17:59:04 +03:00
Heikki Linnakangas	ec20534173	Fix minor typos and leftover comments.	2022-08-27 17:54:56 +03:00
Dmitry Ivanov	6d30e21a32	Fix proxy tests (#2343 ) There might be different psql & locale configurations, therefore we should explicitly reset them to defaults.	2022-08-26 20:42:32 +03:00
KlimentSerafimov	b98fa5d6b0	Added a new test for making sure the proxy displays a session_id when using link auth. (#2039 ) Added pytest to check correctness of the link authentication pipeline. Context: this PR is the first step towards refactoring the link authentication pipeline to use https (instead of psql) to send the db info to the proxy. There was a test missing for this pipeline in this repo, so this PR adds that test as preparation for the actual change of psql -> https. Co-authored-by: Bojan Serafimov <bojan.serafimov7@gmail.com> Co-authored-by: Dmitry Rodionov <dmitry@neon.tech> Co-authored-by: Stas Kelvic <stas@neon.tech> Co-authored-by: Dimitrii Ivanov <dima@neon.tech>	2022-08-22 20:02:45 -04:00
Dmitry Rodionov	9dd19ec397	Remove interferring proc check We do not need it anymore because ports_distributor checks whether the port can be used before giving it to service	2022-08-22 20:59:32 +03:00
Alexander Bayandin	39a3bcac36	test_runner: fix flake8 warnings	2022-08-22 14:57:09 +01:00
Alexander Bayandin	4c2bb43775	Reformat all python files by black & isort	2022-08-22 14:57:09 +01:00
Alexander Bayandin	277f2d6d3d	Report test results to Allure (#2229 )	2022-08-22 11:21:50 +01:00
Heikki Linnakangas	d48177d0d8	Expose timeline logical size as a prometheus metric. Physical size was already exposed, and it'd be nice to show both logical and physical size side by side in our graphana dashboards.	2022-08-19 22:21:33 +03:00
MMeent	f99ccb5041	Extract WalProposer into the neon extension (#2217 ) Including, but not limited to: * Fixes to neon management code to support walproposer-as-an-extension * Fix issue in expected output of pg settings serialization. * Show the logs of a failed --sync-safekeepers process in CI * Add compat layer for renamed GUCs in postgres.conf * Update vendor/postgres to the latest origin/main	2022-08-18 17:12:28 +02:00
Kirill Bulatov	67e091c906	Rework `init` in pageserver CLI (#2272 ) * Do not create initial tenant and timeline (adjust Python tests for that) * Rework config handling during init, add --update-config to manage local config updates	2022-08-17 23:24:47 +03:00
bojanserafimov	e9a3499e87	Fix flaky pageserver restarts in tests (#2261 )	2022-08-17 08:17:35 -04:00
Alexander Bayandin	4cddb0f1a4	Set up a workflow to run pgbench against captest (#2077 )	2022-08-15 18:54:31 +01:00
Dmitry Rodionov	63a72d99bb	increase timeout in wait_for_upload to avoid spurious failures when testing with real s3	2022-08-15 18:02:27 +03:00
Arthur Petukhovsky	116ecdf87a	Improve walreceiver logic (#2253 ) This patch makes walreceiver logic more complicated, but it should work better in most cases. Added `test_wal_lagging` to test scenarios where alive safekeepers can lag behind other alive safekeepers. - There was a bug which looks like `etcd_info.timeline.commit_lsn > Some(self.local_timeline.get_last_record_lsn())` filtered all safekeepers in some strange cases. I removed this filter, it should probably help with #2237 - Now walreceiver_connection reports status, including commit_lsn. This allows keeping safekeeper connection even when etcd is down. - Safekeeper connection now fails if pageserver doesn't receive safekeeper messages for some time. Usually safekeeper sends messages at least once per second. - `LaggingWal` check now uses `commit_lsn` directly from safekeeper. This fixes the issue with often reconnects, when compute generates WAL really fast. - `NoWalTimeout` is rewritten to trigger only when we know about the new WAL and the connected safekeeper doesn't stream any WAL. This allows setting a small `lagging_wal_timeout` because it will trigger only when we observe that the connected safekeeper has stuck.	2022-08-15 13:31:26 +03:00
Alexander Bayandin	da5f8486ce	test_runner/pg_clients: collect docker logs (#2259 )	2022-08-12 17:03:09 +01:00
Dmitry Ivanov	ad08c273d3	[proxy] Rework wire format of the password hack and some errors (#2236 ) The new format has a few benefits: it's shorter, simpler and human-readable as well. We don't use base64 anymore, since url encoding got us covered. We also show a better error in case we couldn't parse the payload; the users should know it's all about passing the correct project name.	2022-08-12 17:38:43 +03:00
Thang Pham	6d99b4f1d8	disable `test_import_from_pageserver_multisegment` (#2258 ) This test failed consistently on `main` now. It's better to temporarily disable it to avoid blocking others' PRs while investigating the root cause for the test failure. See: #2255, #2256	2022-08-12 19:13:42 +07:00
Thang Pham	7da47d8a0a	Fix timeline physical size flaky tests (#2244 ) Resolves #2212. - use `wait_for_last_flush_lsn` in `test_timeline_physical_size_` tests ## Context Need to wait for the pageserver to catch up with the compute's last flush LSN because during the timeline physical size API call, it's possible that there are running `LayerFlushThread` threads. These threads flush new layers into disk and hence update the physical size. This results in a mismatch between the physical size reported by the API and the actual physical size on disk. ### Note The `LayerFlushThread` threads are processed concurrently*, so it's possible that the above error still persists even with this patch. However, making the tests wait to finish processing all the WALs (not flushing) before calculating the physical size should help reduce the "flakiness" significantly	2022-08-12 14:28:50 +07:00
Thang Pham	dc52436a8f	Fix bug when import large (>1GB) relations (#2172 ) Resolves #2097 - use timeline modification's `lsn` and timeline's `last_record_lsn` to determine the corresponding LSN to query data in `DatadirModification::get` - update `test_import_from_pageserver`. Split the test into 2 variants: `small` and `multisegment`. + `small` is the old test + `multisegment` is to simulate #2097 by using a larger number of inserted rows to create multiple segment files of a relation. `multisegment` is configured to only run with a `release` build	2022-08-12 09:24:20 +07:00

1 2 3 4 5 ...

422 Commits