Commit Graph

1712 Commits

Author SHA1 Message Date
Kliment Serafimov
2bfef5514e Merged with main. 2022-06-02 00:31:41 +02:00
Kliment Serafimov
9ec312ce98 Merge branch 'main' of https://github.com/neondatabase/neon into main 2022-06-02 00:16:55 +02:00
Kliment Serafimov
6ec80c0015 Merge branch 'added-project-option' of https://github.com/neondatabase/neon into added-project-option 2022-06-02 00:16:15 +02:00
Kliment Serafimov
bcf5cd908e Merged changes. 2022-06-02 00:15:47 +02:00
Dmitry Ivanov
5f9924b7f6 [proxy] Propagate SASL/SCRAM auth errors to the user
This will replace the vague (and incorrect) "Internal error" with a nice
and helpful authentication error, e.g. "password doesn't match".
2022-06-02 00:15:47 +02:00
Dmitry Ivanov
ec483d705d [proxy] Refactoring
This patch attempts to fix some of the technical debt
we had to introduce in previous patches.
2022-06-02 00:15:47 +02:00
Thang Pham
a76fe9bf8a Fix test_pageserver_http_get_wal_receiver_success flaky test. (#1786)
Fixes #1768.

## Context

Previously, to test `get_wal_receiver` API, we make run some DB transactions then call the API to check the latest message's LSN from the WAL receiver. However, this test won't work because it's not guaranteed that the WAL receiver will get the latest WAL from the postgres/safekeeper at the time of making the API call. 

This PR resolves the above issue by adding a "poll and wait" code that waits to retrieve the latest data from the WAL receiver. 

This PR also fixes a bug that tries to compare two hex LSNs, should convert to number before the comparison. See: https://github.com/neondatabase/neon/issues/1768#issuecomment-1133752122.
2022-06-02 00:15:47 +02:00
Arseny Sher
0bc9ff571b Prevent commit_lsn <= flush_lsn violation after a42eba3cd7.
Nothing complained about that yet, but we definitely don't hold at least one
assert, so let's keep it this way until better version.
2022-06-02 00:15:47 +02:00
Thang Pham
77366148ee Handle broken timelines on startup (#1809)
Resolve #1663.

## Changes

- ignore a "broken" [1] timeline on page server startup
- fix the race condition when creating multiple timelines in parallel for a tenant
- added tests for the above changes

[1]: a timeline is marked as "broken" if either
- failed to load the timeline's metadata or
- the timeline's disk consistent LSN is zero
2022-06-02 00:15:47 +02:00
Arseny Sher
9aadbc316d s3 WAL offloading staging review.
- Uncomment accidently `self.keep_alive.abort()` commented line, due to this
  task never finished, which blocked launcher.
- Mess up with initialization one more time, to fix offloader trying to back up
  segment 0. Now we initialize all required LSNs in handle_elected,
  where we learn start LSN for the first time.
- Fix blind attempt to provide safekeeper service file with remote storage
  params.
2022-06-02 00:15:47 +02:00
Arseny Sher
619515d935 Add WAL offloading to s3 on safekeepers.
Separate task is launched for each timeline and stopped when timeline doesn't
need offloading. Decision who offloads is done through etcd leader election;
currently there is no pre condition for participating, that's a TODO.

neon_local and tests infrastructure for remote storage in safekeepers added,
along with the test itself.

ref #1009

Co-authored-by: Anton Shyrabokau <ahtoxa@Antons-MacBook-Pro.local>
2022-06-02 00:15:47 +02:00
bojanserafimov
b763adaf8a Change proxy welcome message (#1808)
Remove zenith sun and outdated instructions around .pgpass
2022-06-02 00:15:47 +02:00
Thang Pham
1314bb483f Reduce the logging level when PG client disconnected to INFO (#1713)
Fixes #1683.
2022-06-02 00:15:47 +02:00
Dmitry Rodionov
e1eb53ac59 Tidy up some log messages
* turn println into an info with proper message
* rename new_local_timeline to load_local_timeline because it does not
  create new timeline, it registers timeline that exists on disk in
  pageserver in-memory structures
2022-06-02 00:15:47 +02:00
Konstantin Knizhnik
837aeb77ac Initialize last_freeze_at with disk consistent LSN to avoid creation of small L0 delta layer on startup
refer #1736
2022-06-02 00:15:47 +02:00
Dmitry Rodionov
c46bf93808 allow TLS 1.2 in proxy to be compatible with older client libraries 2022-06-02 00:15:47 +02:00
Dmitry Rodionov
f5e6b1c525 add simple metrics for remote storage operations
track number of operations and number of their failures
2022-06-02 00:15:47 +02:00
Kirill Bulatov
a15470e3d6 Move rustfmt check to GH Action 2022-06-02 00:15:47 +02:00
Kirill Bulatov
96bda79092 Run basic checks on PRs and pushes to main only 2022-06-02 00:15:47 +02:00
chaitanya sharma
98a1a2b3cd initial commit, renamed znodeid to nodeid. 2022-06-02 00:15:47 +02:00
Heikki Linnakangas
bf6428971e Fix error handling with 'basebackup' command.
If the 'basebackup' command failed in the middle of building the tar
archive, the client would not report the error, but would attempt to
to start up postgres with the partial contents of the data directory.
That fails because the control file is missing (it's added to the
archive last, precisly to make sure that you cannot start postgres
from a partial archive). But the client doesn't see the proper error
message that caused the basebackup to fail in the server, which is
confusing.

Two issues conspired to cause that:

1. The tar::Builder object that we use in the pageserver to construct
the tar stream has a Drop handler that automatically writes a valid
end-of-archive marker on drop. Because of that, the resulting tarball
looks complete, even if an error happens while we're building it. The
pageserver does send an ErrorResponse after the seemingly-valid
tarball, but:

2. The client stops reading the Copy stream, as soon as it sees the
tar end-of-archive marker. Therefore, it doesn't read the
ErrorResponse that comes after it.

We have two clients that call 'basebackup', one in `control_plane`
used by the `neon_local` binary, and another one in
`compute_tools`. Both had the same issue.

This PR fixes both issues, even though fixing either one would be
enough to fix the problem at hand. The pageserver now doesn't send the
end-of-archive marker on error, and the client now reads the copy
stream to the end, even if it sees an end-of-archive marker.

Fixes github issue #1715

In the passing, change Basebackup to use generic Write rather than
'dyn'.
2022-06-02 00:15:47 +02:00
Heikki Linnakangas
b85d284f08 Set --quota-backend-bytes when launching etcd in tests.
By default, etcd makes a huge 10 GB mmap() allocation when it starts up.
It doesn't actually use that much memory, it's just address space, but
it caused me grief when I tried to use 'rr' to debug a python test run.
Apparently, when you replay the 'rr' trace, it does allocate memory for
all that address space.

The size of the initial mmap depends on the --quota-backend-bytes setting.
Our etcd clusters are very small, so let's set --quota-backend-bytes to
keep the virtual memory size small, to make debugging with 'rr' easier.

See https://github.com/etcd-io/etcd/issues/7910 and
5e4b008106
2022-06-02 00:15:47 +02:00
Andrey Taranik
164f8f8124 helm repository name fix for production proxy deploy (#1790) 2022-06-02 00:15:47 +02:00
Heikki Linnakangas
45792c25be Improve error messages on seccomp loading errors.
Bump vendor/postgres for https://github.com/neondatabase/postgres/pull/166
2022-06-02 00:15:47 +02:00
Andrey Taranik
cffea24d20 production inventory update (#1779) 2022-06-02 00:15:47 +02:00
Arseny Sher
fc0b51819c Disable restart_after_crash in neon_local.
It is pointless when basebackup is invalid.
2022-06-02 00:15:47 +02:00
Sergey Melnikov
1d18b813b2 Add zenith-us-stage-sk-6 to deploy (#1728) 2022-06-02 00:15:47 +02:00
Kirill Bulatov
fedcc71c01 Properly shutdown test mock S3 server 2022-06-02 00:15:47 +02:00
KlimentSerafimov
a3238cd69d Potential fix to #1626. Fixed typo is Makefile. (#1781)
* Potential fix to #1626. Fixed typo is Makefile.
* Completed fix to #1626.

Summary:
changed 'error' to 'bail' in start_pageserver and start_safekeeper.
2022-06-02 00:15:47 +02:00
Heikki Linnakangas
f12fa69c9f Fix garbage collection to not remove image layers that are still needed.
The logic would incorrectly remove an image layer, if a new image layer
existed, even though the older image layer was still needed by some
delta layers after it. See example given in the comment this adds.

Without this fix, I was getting a lot of "could not find data for key
010000000000000000000000000000000000" errors from GC, with the new test
case being added in PR #1735.

Fixes #707
2022-06-02 00:15:47 +02:00
Kliment Serafimov
bbe7bc4dc1 Merge branch 'main' into HEAD 2022-06-02 00:05:40 +02:00
Kliment Serafimov
c5f3c9bbc7 Merged changes. 2022-06-02 00:04:26 +02:00
Kirill Bulatov
de7eda2dc6 Fix url path printing 2022-06-02 00:48:10 +03:00
Dmitry Rodionov
1188c9a95c remove extra span as this code is already covered by create timeline span
E g this log line contains duplicated data:
INFO /timeline_create{tenant=8d367870988250a755101b5189bbbc17
  new_timeline=Some(27e2580f51f5660642d8ce124e9ee4ac) lsn=None}:
  bootstrapping{timeline=27e2580f51f5660642d8ce124e9ee4ac
  tenant=8d367870988250a755101b5189bbbc17}:
  created root timeline 27e2580f51f5660642d8ce124e9ee4ac
  timeline.lsn 0/16960E8

this avoids variable duplication in `bootstrapping` subspan
2022-06-01 19:29:17 +03:00
Kirill Bulatov
e5cb727572 Replace callmemaybe with etcd subscriptions on safekeeper timeline info 2022-06-01 16:07:04 +03:00
Dmitry Rodionov
6623c5b9d5 add installation instructions for Fedora Linux 2022-06-01 15:59:53 +03:00
Anton Chaporgin
e5a2b0372d remove sk1 from inventory (#1845)
https://github.com/neondatabase/cloud/issues/1454
2022-06-01 15:40:45 +03:00
Alexey Kondratov
af6143ea1f Install missing openssl packages in the Github Actions workflow 2022-05-31 23:12:30 +03:00
Alexey Kondratov
ff233cf4c2 Use :local compute-tools tag to build compute-node image 2022-05-31 23:12:30 +03:00
Dmitry Rodionov
b1b67cc5a0 improve test normal work to start several computes 2022-05-31 22:42:11 +03:00
bojanserafimov
ca10cc12c1 Close file descriptors for redo process (#1834) 2022-05-31 14:14:09 -04:00
Thang Pham
c97cd684e0 Use HOMEBREW_PREFIX instead of hard-coded path (#1833) 2022-05-31 11:20:51 -04:00
Ryan Russell
54e163ac03 Improve Readability in Docs
Signed-off-by: Ryan Russell <ryanrussell@users.noreply.github.com>
2022-05-31 17:22:47 +03:00
Konstantin Knizhnik
595a6bc1e1 Bump vendor/postgres to fix basebackup LSN comparison. (#1835)
Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
2022-05-31 14:47:06 +03:00
Arthur Petukhovsky
c3e0b6c839 Implement timeline-based metrics in safekeeper (#1823)
Now there's timelines metrics collector, which goes through all timelines and reports metrics only for active ones
2022-05-31 11:10:50 +03:00
Arseny Sher
36281e3b47 Extend test_wal_backup with compute restart. 2022-05-30 13:57:17 +04:00
Anastasia Lubennikova
e014cb6026 rename zenith.zenith_tenant to neon.tenant_id in test 2022-05-30 12:24:44 +03:00
Anastasia Lubennikova
915e5c9114 Rename 'zenith_admin' to 'cloud_admin' on compute node start 2022-05-30 11:11:01 +03:00
Anastasia Lubennikova
67d6ff4100 Rename custom GUCs:
- zenith.zenith_tenant -> neon.tenant_id
- zenith.zenith_timeline -> neon.timeline_id
2022-05-30 11:11:01 +03:00
Anastasia Lubennikova
6a867bce6d Rename 'zenith_admin' role to 'cloud_admin' 2022-05-30 11:11:01 +03:00