## Problem
Init fork is used in DEBUG_COMPARE_LOCAL to determine unlogged relation
or unlogged build.
But it is created only after the relation is initialized and so can be
swapped out, producing `Page is evicted with zero LSN` error.
## Summary of changes
Create init fork together with main fork for unlogged relations in
DEBUG_COMPARE_LOCAL mode.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
clang produce warning about unused variable `n_synced` in
HandleSafekeeperResponse
## Summary of changes
Remove local variable.
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Pass more neon ids to compute_ctl.
Expose them to postgres as neon extension GUCs:
neon.project_id, neon.branch_id, neon.endpoint_id.
This is the compute side PR, not yet supported by cplane.
## Problem
Walproposer should get elected and commit WAL on safekeepers specified
by the membership configuration.
## Summary of changes
- Add to wp `members_safekeepers` and `new_members_safekeepers` arrays
mapping configuration members to connection slots. Establish this
mapping (by node id) when safekeeper sends greeting, giving its id and
when mconf becomes known / changes.
- Add to TermsCollected, VotesCollected,
GetAcknowledgedByQuorumWALPosition membership aware logic. Currently it
partially duplicates existing one, but we'll drop the latter eventually.
- In python, rename Configuration to MembershipConfiguration for
clarity.
- Add test_quorum_sanity testing new logic.
ref https://github.com/neondatabase/neon/issues/10851
pagestore_smgr.c had grown pretty large. Split into two parts, such
that the smgr routines that PostgreSQL code calls stays in
pagestore_smgr.c, and all the prefetching logic and other lower-level
routines related to communicating with the pageserver are moved to a
new source file, "communicator.c".
There are plans to replace communicator parts with a new
implementation. See https://github.com/neondatabase/neon/pull/10799.
This commit doesn't implement any of the new things yet, but it is
good preparation for it. I'm imagining that the new implementation
will approximately replace the current "communicator.c" code, exposing
roughly the same functions to pagestore_smgr.c.
This commit doesn't change any functionality or behavior, or make any
other changes to the existing code: It just moves existing code
around.
## Problem
Support of unlogged build in DEBUG_COMPARE_LOCAL.
Neon SMGR treats present of local file as indicator of unlogged
relations.
But it doesn't work in DEBUG_COMPARE_LOCAL mode.
## Summary of changes
Use INIT_FORKNUM as indicator of unlogged file and create this file
while unlogged index build.
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Also, move the call to the lfc_init() function. It was weird to have it
in libpagestore.c, when libpagestore.c otherwise had nothing to do with
the LFC. Move it directly into _PG_init()
## Problem
If the local file cache is shrunk, so that we punch some holes in the
underlying file, the local_cache view displays the holes incorrectly.
See https://github.com/neondatabase/neon/issues/10770
## Summary of changes
Skip hole tags in the local_cache view.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
The 'neon_read' function needs to have a different prototype on PG < 16,
because it's part of the smgr interface. But neon_read_at_lsn doesn't
have that restriction.
Remove useless and often wrong IDENTIFICATION comments. PostgreSQL
sources have them, mostly for historical reasons, but there's no need
for us to copy that style.
Remove unnecessary #includes in header files, putting the #includes
directly in the .c files that need them. The principle is that a header
file should #include other header files if they need definitions from
them, such that each header file can be compiled on its own, but not
other #includes. (There are tools to enforce that, but this was just a
manual clean up of violations that I happened to spot.)
## Problem
See https://neondb.slack.com/archives/C04DGM6SMTM/p1741594233757489
Consider the following scenario:
1. Backend A wants to prefetch some block B
2. Backend A checks that block B is not present in shared buffer
3. Backend A registers new prefetch request and calls
prefetch_do_request
4. prefetch_do_request calls neon_get_request_lsns
5. neon_get_request_lsns obtains LwLSN for block B
6. Backend B downloads B, updates and wallogs it (let say to Lsn1)
7. Block B is once again thrown from shared buffers, its LwLSN is set to
Lsn1
8. Backend A obtains current flush LSN, let's say that it is Lsn1
9. Backend A stores Lsn1 as effective_lsn in prefetch slot.
10. Backend A reads page B with LwLSN=Lsn1
11. Backend A finds in prefetch ring response for prefetch request for
block B with effective_lsn=Lsn1, so that it satisfies
neon_prefetch_response_usable condition
12. Backend A uses deteriorated version of the page!
## Summary of changes
Use `not_modified_since` as `effective_lsn`.
It should not cause some degrade of performance because we store LwLSN
when it was not found in LwLSN hash, so if page is not changed till
prefetch response is arrived, then LwLSN should not be changed.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
Commit
3da70abfa5
cause noticeable performance regression (40% in update-with-prefetch in
test_bulk_update):
https://neondb.slack.com/archives/C04BLQ4LW7K/p1742633167580879
## Summary of changes
Remove loop from pageserver_try_receive to make it fetch not more than
one response. There is still loop in `pump_prefetch_state` which can
fetch as many responses as available.
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Previously we had different meanings for the bitmask of vector IOps.
That has now been unified to "bit set = final result, no more
scribbling".
Furthermore, the LFC read path scribbled on pages that were already
read; that's probably not a good thing so that's been fixed too. In
passing, the read path of LFC has been updated to read only the
requested pages into the provided buffers, thus reducing the IO size of
vectorized IOs.
## Problem
## Summary of changes
## Problem
Macro IS_LOCAL_REL used for DEBUG_COMPARE_LOCAL mode use greater-than
rather than greater-or-equal comparison while first table really is
assigned FirstNormalObjectId.
## Summary of changes
Replace strict greater with greater-or-equal comparison.
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
Now that stuck connections are quickly terminated it's not easy to
quickly find the right port from the pid to correlate the connection
with the one seen on pageserver side.
## Summary of changes
Call getsockname() and include the local port number in the
no-response-from-pageserver log messages.
## Problem
See https://neondb.slack.com/archives/C04DGM6SMTM/p1741176713523469
The problem is that this function is using `PQgetCopyData(shard->conn,
&resp_buff.data, 1 /* async = true */)`
to try to fetch next message. But this function returns 0 if the whole
message is not present in the buffer.
And input buffer may contain only part of message so result is not
fetched.
## Summary of changes
Use `PQisBusy` + `WaitEventSetWait` to check if data is available and
`PQgetCopyData(shard->conn, &resp_buff.data, 0)` to read whole message
in this case.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
We currently have this code duplicated across different PG versions.
Moving this to an extension would reduce duplication and simplify
maintenance.
## Summary of changes
Moving the LastWrittenLSN code from PG versions to the Neon extension
and linking it with hooks.
Related Postgres PR: https://github.com/neondatabase/postgres/pull/590
Closes: https://github.com/neondatabase/neon/issues/10973
---------
Co-authored-by: Tristan Partin <tristan@neon.tech>
## Problem
Unlogged build is used for GIST/SPGIST/GIN/HNSW indexes.
In this mode we first change relation class to `RELPERSISTENCE_UNLOGGED`
and save them on local disk.
But we do not save unlogged relations in LFC.
It may cause fetching incorrect value from LFC if relfilenode is reused.
## Summary of changes
Save modified pages in LFC on second stage of unlogged build (when
modified pages are walloged).
There is no need to save pages in LFC at first phase because the will be
in any case overwritten with assigned LSN at second phase.
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
In unlogged index build (used fir GIST/SPGIST/GIN indexes) files is
created on disk and then removed at the end.
It contradicts to the logic of DEBUG_COMPARE_LOCAL mode.
## Summary of changes
Do not create and unlink files in unlogged build in DEBUG_COMPARE_LOCAL
mode.
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
DEBUG_COMPARE_LOCAL is not supported in neon_zeroextend added in PG16
## Summary of changes
Add support of DEBUG_COMPARE_LOCAL in neon_zeroextend
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
https://github.com/neondatabase/neon/issues/10851
## Summary of changes
Do some refactoring before making walproposer generations aware.
- Rename SS_VOTING to SS_WAIT_VOTING, SS_IDLE to SS_WAIT_ELECTED
- Continue to get rid of epochs: rename GetEpoch to GetLastLogTerm,
donorEpoch to donorLastLogTerm
- Instead of counting n_votes, n_connected, introduce explicit
WalProposerState (collecting terms / voting / elected). Refactor out
TermsCollected and VotesCollected; they will determine state transition
differently depending whether generations are enabled or not.
There is no new logic in this PR and thus no new tests.
## Problem
Async prefetch in LFC PR cause incorrect calculation of LFC
`used_pages`when page is overwritten
## Summary of changes
Decrement `used_pages` is page is overwritten.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Matthias van de Meent <matthias@neon.tech>
This allows for improved decoding of otherwise broken WAL.
## Problem
Currently, if (or when) a WAL record has a wrong prevptr, that breaks
decoding. With this, we don't have to break on that if we decide it's OK
to proceed after that.
## Summary of changes
Use a Neon GUC to allow the system to enable the NEON-specific
skip_lsn_checks option in XLogReader.
## Problem
See https://neondb.slack.com/archives/C08DE6Q9C3B
Sometimes compute is not able to receive responses from PS for a long
time (minutes).
I do not think that the problem is at compute side, but in order to
exclude this possibility I wan to see more information about connection
state at compute side, particularly amount of data cached in connection
buffer.
## Summary of changes
Right now we are dumping state of socket buffer.
This PR adds printing state of connection buffer.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
This PR adds a new key to neon.neon_lfc_stats —
'file_cache_chunk_size_pages'. It just returns the value of
BLOCKS_PER_CHUNK from the LFC implementation.
The new value should (eventually) allow changing the chunk size without
breaking any places that rely on LFC stats values measured in number of
chunks.
See neondatabase/cloud#25170 for more.
## Problem
New async prefetch introduces `prefetch+lookup[` function which is
called before LFC lookup to check if prefetch request is already
completed.
This function is not containing now check that response is actually
`T_NeonGetPageResponse` (and not error).
## Summary of changes
Add checks for response tag.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
Preparation for https://github.com/neondatabase/neon/issues/10851
## Summary of changes
Add walproposer `safekeepers_generations` field which can be set by
prefixing `neon.safekeepers` GUC with `g#n:`. Non zero value (n) forces
walproposer to use generations. In particular, this also disables
implicit timeline creation as timeline will be created by storcon. Add
test checking this. Also add missing infra: `--safekeepers-generation`
flag to neon_local endpoint start + fix `--start-timeout` flag: it
existed but value wasn't used.
## Problem
See https://neondb.slack.com/archives/C033RQ5SPDH/p1740157873114339
smgrextend for FSM fork is called during page reconstruction by walredo
process causing overflow of inmem SMGR (64 pages).
## Summary of changes
Do not store zero pages in inmem SMGR because `inmem_read` returns zero
page if it is not able to locate specified block.
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
We see `Inmem storage overflow` in page server logs:
https://neondb.slack.com/archives/C033RQ5SPDH/p1740157873114339
walked process is using inseam SMGR with storage size limited by 64
pages with warning watermark 32 (based ion the assumption that
XLR_MAX_BLOCK_ID is 32, so WAL record can not access more than 32
pages).
Actually it is not true. We can update up to 3 forks for each block
(including update of FSM and VM forks).
## Summary of changes
This PR increases inseam SMGR size for walled process to 100 pages and
print stack trace in case of overflow.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
PR https://github.com/neondatabase/neon/pull/10442 added prefetch_lookup
function.
It changed handling of getpage requests in compute.
Before:
1. Lookup in LFC (return if found)
2. Register prefetch buffer
3. Wait prefetch result (increment getpage_hist)
Now:
1. Lookup prefetch ring (return if prefetch request is already
completed)
2. Lookup in LFC (return if found)
3. Register prefetch buffer
4. Wait prefetch result (increment getpage_hist)
So if prefetch result is already available, then get page histogram is
not incremented.
It case failure of some test_throughtput benchmarks:
https://neondb.slack.com/archives/C033RQ5SPDH/p1740425527249499
## Summary of changes
Increment getpage histogram in `prefetch_lookup`
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
As part of https://github.com/neondatabase/neon/issues/8614 we need to
pass membership configurations between compute and safekeepers.
## Summary of changes
Add version 3 of the protocol carrying membership configurations.
Greeting message in both sides gets full conf, and other messages
generation number only. Use protocol bump to include other accumulated
changes:
- stop packing whole structs on the wire as is;
- make the tag u8 instead of u64;
- send all ints in network order;
- drop proposer_uuid, we can pass it in START_WAL_PUSH and it wasn't
much useful anyway.
Per message changes, apart from mconf:
- ProposerGreeting: tenant / timeline id is sent now as hex cstring.
Remove proto version, it is passed outside in START_WAL_PUSH. Remove
postgres timeline, it is unused. Reorder fields a bit.
- AcceptorGreeting: reorder fields
- VoteResponse: timeline_start_lsn is removed. It can be taken from
first member of term history, and later we won't need it at all when all
timelines will be explicitly created. Vote itself is u8 instead of u64.
- ProposerElected: timeline_start_lsn is removed for the same reasons.
- AppendRequest: epoch_start_lsn removed, it is known from term history
in ProposerElected.
Both compute and sk are able to talk v2 and v3 to make rollbacks (in
case we need them) easier; neon.safekeeper_proto_version GUC sets the
client version. v2 code can be dropped later.
So far empty conf is passed everywhere, future PRs will handle them.
To test, add param to some tests choosing proto version; we want to test
both 2 and 3 until we fully migrate.
ref https://github.com/neondatabase/neon/issues/10326
---------
Co-authored-by: Arthur Petukhovsky <petuhovskiy@yandex.ru>
We've seen some cases in production where a compute doesn't get a
response to a pageserver request for several minutes, or even more. We
haven't found the root cause for that yet, but whatever the reason is,
it seems overly optimistic to think that if the pageserver hasn't
responded for 2 minutes, we'd get a response if we just wait patiently a
little longer. More likely, the pageserver is dead or there's some kind
of a network glitch so that the TCP connection is dead, or at least
stuck for a long time. Either way, it's better to disconnect and
reconnect. I set the default timeout to 2 minutes, which should be
enough for any GetPage request under normal circumstances, even if the
pageserver has to download several layer files from remote storage.
Make the disconnect timeout configurable. Also make the "log interval",
after which we print a message to the log configurable, so that if you
change the disconnect timeout, you can set the log timeout
correspondingly. The default log interval is still 10 s. The new GUCs
are called "neon.pageserver_response_log_timeout" and
"neon.pageserver_response_disconnect_timeout".
Includes a basic test for the log and disconnect timeouts.
Implements issue #10857
WaitEventSetWait() returns the number of "events" that happened, and
only that many events in the WaitEvent array are updated. When the
timeout is reached, it returns 0 and does not modify the WaitEvent array
at all. We were reading 'event.events' without checking the return
value, which would be uninitialized when the timeout was hit.
No test included, as this is harmless at the moment. But this caused the
test I'm including in PR #10882 to fail. That PR changes the logic to
loop back to retry the PQgetCopyData() call if WL_SOCKET_READABLE was
set. Currently, making an extra call to PQconsumeInput() is harmless,
but with that change in logic, it turns into a busy-wait.
## Problem
Prefetch is performed locally, so different backers can request the same
pages form PS.
Such duplicated request increase load of page server and network
traffic.
Making prefetch global seems to be very difficult and undesirable,
because different queries can access chunks on different speed. Storing
prefetch chunks in LFC will not completely eliminate duplicates, but can
minimise such requests.
The problem with storing prefetch result in LFC is that in this case
page is not protected by share buffer lock.
So we will have to perform extra synchronisation at LFC side.
See:
https://neondb.slack.com/archives/C0875PUD0LC/p1736772890602029?thread_ts=1736762541.116949&cid=C0875PUD0LC
@MMeent implementation of prewarm:
See https://github.com/neondatabase/neon/pull/10312/
## Summary of changes
Use conditional variables to sycnhronize access to LFC entry.
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem
Refer https://github.com/neondatabase/neon/issues/10885
Wait position in ring buffer to restrict number of in-flight requests is
not correctly calculated.
## Summary of changes
Update condition and remove redundant assertion
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Instead of hardcoding the request timeout, let's make it configurable as
a PGC_SUSET GUC.
Additionally, add a connect timeout GUC. Although the extension server
runs on the compute, it is always best to keep operations from hanging.
Better to present a timeout error to the user than a stuck backend.
Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
See https://github.com/neondatabase/neon/issues/10839
rho(x,b) functions returns values in range [1,b+1] and addSHLL tries to
store it in array of size b+1.
## Summary of changes
Subtract 1 fro value returned by rho
---------
Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>