Commit 43a4f7173e fixed the case that there are extra options in the
connection string, but broke it in the case when there are not. Fix
that. But on second thoughts, it's more straightforward set the
options with ALTER DATABASE, so change the workflow yaml file to do
that instead.
In commit 6985f6cd6c, I tried passing extra GUCs in the 'options' part
of the connection string, but it didn't work because the pgbench test
overrode it with the statement_timeout. Change it so that it adds the
statement_timeout to any other options, instead of replacing them.
Also get rid if `with_safekeepers` parameter in tests.
Its meaning has changed: `False` meant "no safekeepers" which is not
supported anymore, so we assume it's always `True`.
See #1648
Instead of spawning helper threads, we now use Tokio tasks. There
are multiple Tokio runtimes, for different kinds of tasks. One for
serving libpq client connections, another for background operations
like GC and compaction, and so on. That's not strictly required, we
could use just one runtime, but with this you can still get an
overview of what's happening with "top -H".
There's one subtle behavior in how TenantState is updated. Before this
patch, if you deleted all timelines from a tenant, its GC and
compaction loops were stopped, and the tenant went back to Idle
state. We no longer do that. The empty tenant stays Active. The
changes to test_tenant_tasks.py are related to that.
There's still plenty of synchronous code and blocking. For example, we
still use blocking std::io functions for all file I/O, and the
communication with WAL redo processes is still uses low-level unix
poll(). We might want to rewrite those later, but this will do for
now. The model is that local file I/O is considered to be fast enough
that blocking - and preventing other tasks running in the same thread -
is acceptable.
Commit f081419e68 moved all the prometheus counters to `metrics.rs`,
but accidentally replaced a couple of `register_int_counter!(...)`
calls with just `IntCounter::new(...)`. Because of that, the counters
were not registered in the metrics registry, and were not exposed
through the metrics HTTP endpoint.
Fixes failures we're seeing in a bunch of 'performance' tests because
of the missing metrics.
This caught or reproduced several bugs when I originally wrote this test
back in May, including #1731, #1740, #1751, and #707. I believe all the
issues have been fixed now, but since this was a very fruitful test,
let's add it to the test suite.
We didn't commit this earlier, because the test was very slow especially
with a debug build. We've since changed the build options so that even
the debug builds are not quite so slow anymore.
* Add test for pageserver metric cleanup once a tenant is detached.
* Remove tenant specific timeline metrics on detach.
* Use definitions from timeline_metrics in page service.
* Move metrics to own file from layered_repository/timeline.rs
* TIMELINE_METRICS: define smgr metrics
* REMOVE SMGR cleanup from timeline_metrics. Doesn't seem to work as
expected.
* Vritual file centralized metrics, except for evicted file as there's no
tenat id or timeline id.
* Use STORAGE_TIME from timeline_metrics in layered_repository.
* Remove timelineless gc metrics for tenant on detach.
* Rename timeline metrics -> metrics as it's more generic.
* Don't create a TimelineMetrics instance for VirtualFile
* Move the rest of the metric definitions to metrics.rs too.
* UUID -> ZTenantId
* Use consistent style for dict.
* Use Repository's Drop trait for dropping STORAGE_TIME metrics.
* No need for Arc, TimelineMetrics is used in just one place. Due to that,
we can fall back using ZTenantId and ZTimelineId too to avoid additional
string allocation.
* Add submodule postgres-15
* Support pg_15 in pgxn/neon
* Renamed zenith -> neon in Makefile
* fix name of codestyle check
* Refactor build system to prepare for building multiple Postgres versions.
Rename "vendor/postgres" to "vendor/postgres-v14"
Change Postgres build and install directory paths to be version-specific:
- tmp_install/build -> pg_install/build/14
- tmp_install/* -> pg_install/14/*
And Makefile targets:
- "make postgres" -> "make postgres-v14"
- "make postgres-headers" -> "make postgres-v14-headers"
- etc.
Add Makefile aliases:
- "make postgres" to build "postgres-v14" and in future, "postgres-v15"
- similarly for "make postgres-headers"
Fix POSTGRES_DISTRIB_DIR path in pytest scripts
* Make postgres version a variable in codestyle workflow
* Support vendor/postgres-v15 in codestyle check workflow
* Support postgres-v15 building in Makefile
* fix pg version in Dockerfile.compute-node
* fix kaniko path
* Build neon extensions in version-specific directories
* fix obsolete mentions of vendor/postgres
* use vendor/postgres-v14 in Dockerfile.compute-node.legacy
* Use PG_VERSION_NUM to gate dependencies in inmem_smgr.c
* Use versioned ECR repositories and image names for compute-node.
The image name format is compute-node-vXX, where XX is postgres major version number.
For now only v14 is supported.
Old format unversioned name (compute-node) is left, because cloud repo depends on it.
* update vendor/postgres submodule url (zenith->neondatabase rename)
* Fix postgres path in python tests after rebase
* fix path in regress test
* Use separate dockerfiles to build compute-node:
Dockerfile.compute-node-v15 should be identical to Dockerfile.compute-node-v14 except for the version number.
This is a hack, because Kaniko doesn't support build ARGs properly
* bump vendor/postgres-v14 and vendor/postgres-v15
* Don't use Kaniko cache for v14 and v15 compute-node images
* Build compute-node images for different versions in different jobs
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
* Update relation size cache only when latest LSN is requested
* Fix tests
* Add a test case for timetravel query after pageserver restart.
This test is currently failing, the queries return incorrect results.
I don't know why, needs to be investigated.
FAILED test_runner/batch_others/test_readonly_node.py::test_timetravel - assert 85 == 100000
If you remove the pageserver restart from the test, it passes.
* yapf3 test_readonly_node.py
* Add comment about cache correction in case of setting incorrect latest flag
* Fix formatting for test_readonly_node.py
* Remove unused imports
* Fix mypy warning for test_readonly_node.py
* Fix formatting of test_readonly_node.py
* Bump postgres version
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
Start the calculation on the first size request, return
partially calculated size during calculation, retry if failed.
Remove "fast" size init through the ancestor: the current approach is
fast enough for now and there are better ways to optimize the
calculation via incremental ancestor size computation
For better ergonomics. I always found it weird that we used UUID to
actually mean a tenant or timeline ID. It worked because it happened
to have the same length, 16 bytes, but it was hacky.
Merge batch_others and batch_pg_regress. The original idea was to
split all the python tests into multiple "batches" and run each batch
in parallel as a separate CI job. However, the batch_pg_regress batch
was pretty short compared to all the tests in batch_others. We could
split batch_others into multiple batches, but it actually seems better
to just treat them as one big pool of tests and use pytest's handle
the parallelism on its own. If we need to split them across multiple
nodes in the future, we could use pytest-shard or something else,
instead of managing the batches ourselves.
Merge test_neon_regress.py, test_pg_regress.py and test_isolation.py
into one file, test_pg_regress.py. Seems more clear to group all
pg_regress-based tests into one file, now that they would all be in
the same directory.
Previously, proxy didn't forward auxiliary `options` parameter
and other ones to the client's compute node, e.g.
```
$ psql "user=john host=localhost dbname=postgres options='-cgeqo=off'"
postgres=# show geqo;
┌──────┐
│ geqo │
├──────┤
│ on │
└──────┘
(1 row)
```
With this patch we now forward `options`, `application_name` and `replication`.
Further reading: https://www.postgresql.org/docs/current/libpq-connect.htmlFixes#1287.
* Add fork_at_current_lsn function which creates branch at current LSN
* Undo use of fork_at_current_lsn in test_branching because of short GC period
* Add missed return in fork_at_current_lsn
* Add missed return in fork_at_current_lsn
* Update test_runner/fixtures/neon_fixtures.py
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
* Update test_runner/fixtures/neon_fixtures.py
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
* Update test_runner/fixtures/neon_fixtures.py
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
Added pytest to check correctness of the link authentication pipeline.
Context: this PR is the first step towards refactoring the link authentication pipeline to use https (instead of psql) to send the db info to the proxy. There was a test missing for this pipeline in this repo, so this PR adds that test as preparation for the actual change of psql -> https.
Co-authored-by: Bojan Serafimov <bojan.serafimov7@gmail.com>
Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
Co-authored-by: Stas Kelvic <stas@neon.tech>
Co-authored-by: Dimitrii Ivanov <dima@neon.tech>
Including, but not limited to:
* Fixes to neon management code to support walproposer-as-an-extension
* Fix issue in expected output of pg settings serialization.
* Show the logs of a failed --sync-safekeepers process in CI
* Add compat layer for renamed GUCs in postgres.conf
* Update vendor/postgres to the latest origin/main
* Do not create initial tenant and timeline (adjust Python tests for that)
* Rework config handling during init, add --update-config to manage local config updates
This patch makes walreceiver logic more complicated, but it should work better in most cases. Added `test_wal_lagging` to test scenarios where alive safekeepers can lag behind other alive safekeepers.
- There was a bug which looks like `etcd_info.timeline.commit_lsn > Some(self.local_timeline.get_last_record_lsn())` filtered all safekeepers in some strange cases. I removed this filter, it should probably help with #2237
- Now walreceiver_connection reports status, including commit_lsn. This allows keeping safekeeper connection even when etcd is down.
- Safekeeper connection now fails if pageserver doesn't receive safekeeper messages for some time. Usually safekeeper sends messages at least once per second.
- `LaggingWal` check now uses `commit_lsn` directly from safekeeper. This fixes the issue with often reconnects, when compute generates WAL really fast.
- `NoWalTimeout` is rewritten to trigger only when we know about the new WAL and the connected safekeeper doesn't stream any WAL. This allows setting a small `lagging_wal_timeout` because it will trigger only when we observe that the connected safekeeper has stuck.
The new format has a few benefits: it's shorter, simpler and
human-readable as well. We don't use base64 anymore, since
url encoding got us covered.
We also show a better error in case we couldn't parse the
payload; the users should know it's all about passing the
correct project name.
This test failed consistently on `main` now. It's better to temporarily disable it to avoid blocking others' PRs while investigating the root cause for the test failure.
See: #2255, #2256
Resolves#2212.
- use `wait_for_last_flush_lsn` in `test_timeline_physical_size_*` tests
## Context
Need to wait for the pageserver to catch up with the compute's last flush LSN because during the timeline physical size API call, it's possible that there are running `LayerFlushThread` threads. These threads flush new layers into disk and hence update the physical size. This results in a mismatch between the physical size reported by the API and the actual physical size on disk.
### Note
The `LayerFlushThread` threads are processed **concurrently**, so it's possible that the above error still persists even with this patch. However, making the tests wait to finish processing all the WALs (not flushing) before calculating the physical size should help reduce the "flakiness" significantly
Resolves#2097
- use timeline modification's `lsn` and timeline's `last_record_lsn` to determine the corresponding LSN to query data in `DatadirModification::get`
- update `test_import_from_pageserver`. Split the test into 2 variants: `small` and `multisegment`.
+ `small` is the old test
+ `multisegment` is to simulate #2097 by using a larger number of inserted rows to create multiple segment files of a relation. `multisegment` is configured to only run with a `release` build