There's a bug somewhere because this happens in python regression
tests. We need to hunt that down, but in any case, let's not get stuck
in an infinite loop if it happens.
Some tests are hitting this case, where pageserver returns 0 page
images in the response to a GetPage request. I suspect it's because
the code doesn't handle sharding correclty? In any case, let's not
panic on it, but return an IO error to the originating backend.
Add a new 'pageserver_connection_info' field in the compute spec. It
replaces the old 'pageserver_connstring' field with a more complicated
struct that includes both libpq and grpc URLs, for each shard (or only
one of the the URLs, depending on the configuration). It also includes
a flag suggesting which one to use; compute_ctl now uses it to decide
which protocol to use for the basebackup.
This is compatible with everything that's in production, because the
control plane never used the 'pageserver_connstring' field. That was
added a long time ago with the idea that it would replace the code
that digs the 'neon.pageserver_connstring' GUC from the list of
Postgres settings, but we never got around to do that in the control
plane. Hence, it was only used with neon_local. But the plan now is to
pass the 'pageserver_connection_info' from the control plane, and once
that's fully deployed everywhere, the code to parse
'neon.pageserver_connstring' in compute_ctl can be removed.
The 'grpc' flag on an endpoint in endpoint config is now more of a
suggestion. Compute_ctl gets both URLs, so it can choose to use libpq
or grpc as it wishes. It currently always obeys the 'prefer_grpc' flag
that's part of the connection info though. Postgres however uses grpc
iff the new rust-based communicator is enabled.
TODO/plan for the control plane:
- Start to pass `pageserver_connection_info` in the spec file.
- Also keep the current `neon.pageserver_connstring` setting for now,
for backwards compatibility with old computes
After that, the `pageserver_connection_info.prefer_grpc` flag in the
spec file can be used to control whether compute_ctl uses grpc or
libpq. The actual compute's grpc usage will be controlled by the
`neon.enable_new_communicator` GUC. It can be set separately from
'prefer_grpc'.
Later:
- Once all old computes are gone, remove the code to pass
`neon.pageserver_connstring`
This fixes errors like `index "pg_class_relname_nsp_index" contains
unexpected zero page at block 2` when running the python tests
smgrzeroextend() still called into the old LFC's lfc_write() function,
even when using the new communicator, which zeroed some arbitrary
pages in the LFC file, overwriting pages managed by the new LFC
implementation managed by `integrated_cache.rs`
## Problem
If LFC generation is changed then `lfc_readv_select` will return -1 but
pages are still marked as available in bitmap.
## Summary of changes
Update bitmap after generation check.
Co-authored-by: Kosntantin Knizhnik <konstantin.knizhnik@databricks.com>
- Add optional `?mode=fast|immediate` to `/terminate`, `fast` is
default. Immediate avoids waiting 30
seconds before returning from `terminate`.
- Add `TerminateMode` to `ComputeStatus::TerminationPending`
- Use `/terminate?mode=immediate` in `neon_local` instead of `pg_ctl
stop` for `test_replica_promotes`.
- Change `test_replica_promotes` to check returned LSN
- Annotate `finish_sync_safekeepers` as `noreturn`.
https://github.com/neondatabase/cloud/issues/29807
## Problem
https://github.com/neondatabase/neon/issues/7570
Even triggers are supported only for superusers.
## Summary of changes
Temporary switch to superuser when even trigger is created and disable
execution of user's even triggers under superuser.
---------
Co-authored-by: Dimitri Fontaine <dim@tapoueh.org>
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
DROP FUNCTION doesn't allow to specify default for parameters.
## Summary of changes
Remove DEFAULT clause from pgxn/neon/neon--1.6--1.5.sql
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
See https://github.com/neondatabase/neon/issues/12018
Now `prefetch_register_bufferv` calculates min_ring_index of all vector
requests.
But because of pump prefetch state or connection failure, previous slots
can be already proceeded and reused.
## Summary of changes
Instead of returning minimal index, this function should return last
slot index.
Actually result of this function is used only in two places. A first
place just fort checking (and this check is redundant because the same
check is done in `prefetch_register_bufferv` itself.
And in the second place where index of filled slot is actually used,
there is just one request.
Sp fortunately this bug can cause only assert failure in debug build.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
`VotesCollectedMset` uses the wrong safekeeper to update truncateLsn.
This led to some failed assert later in the code during running
safekeeper migration tests.
- Relates to https://github.com/neondatabase/neon/issues/11823
## Summary of changes
Use proper safekeeper to update truncateLsn in VotesCollectedMset
That avoids clashes if you run two computes at the same time. More
secure too. We might want to have a TCP port in the long run, but this
is less trouble for now.
To see the metrics with curl you can use:
curl --unix-socket .neon/endpoints/ep-main/pgdata/.metrics.socket http://localhost/metrics
## Problem
It will be useful to understand what kind of queries our clients are
executed.
And one of the most important characteristic of query is query execution
time - at least it allows to distinguish OLAP and OLTP queries. Also
monitoring query execution time can help to detect problem with
performance (assuming that workload is more or less stable).
## Summary of changes
Add query execution time histogram.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
`safekeepers_cmp` was added by #8840 to make changes of the safekeeper
set order independent: a `sk1,sk2,sk3` specifier changed to
`sk3,sk1,sk2` should not cause a walproposer restart. However, this
check didn't support generations, in the sense that it would see the
`g#123:` as part of the first safekeeper in the list, and if the first
safekeeper changes, it would also restart the walproposer.
Therefore, parse the generation properly and make it not be a part of
the generation.
This PR doesn't add a specific test, but I have confirmed locally that
`test_safekeepers_reconfigure_reorder` is fixed with the changes of PR
#11712 applied thanks to this PR.
Part of https://github.com/neondatabase/neon/issues/11670
Use that instead of the half-baked Adaptive Radix Tree
implementation. ART would probably be better in the long run, but more
complicated to implement.
In order to enable TLS connections between computes and safekeepers, we
need to provide the control plane with a way to configure the various
libpq keyword parameters, sslmode and sslrootcert. neon.safekeepers is a
comma separated list of safekeepers formatted as host:port, so isn't
available for extension in the same way that neon.pageserver_connstring
is. This could be remedied in a future PR.
Part-of: https://github.com/neondatabase/cloud/issues/25823
Link:
https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-PARAMKEYWORDS
Signed-off-by: Tristan Partin <tristan@neon.tech>
There were conflicts because of the differences in the page_api
protocol that was merged to main vs what was on the branch. I adapted
the code for the protocol in main.
## Problem
See https://github.com/neondatabase/neon/issues/11997
This guard prevents race condition with pump prefetch state (initiated
by timeout).
Assert checks that prefetching is also done under guard.
But prewarm knows nothing about it.
## Summary of changes
Pump prefetch state only in regular backends.
Prewarming is done by background workers now.
Also it seems to have not sense to pump prefetch state in any other
background workers: parallel executors, vacuum,... because they are
short living and can not leave unconsumed responses in socket.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>