Commit Graph

273 Commits

Author SHA1 Message Date
Thang Pham
e4a70faa08 Add more information to timeline-related APIs (#1673)
Resolves #1488.

- implemented `GET tenant/:tenant_id/timeline/:timeline_id/wal_receiver` endpoint
- returned `thread_id` in `thread_mgr::spawn` 
- added `latest_gc_cutoff_lsn` field to `LocalTimelineInfo` struct
2022-05-16 11:05:43 -04:00
Kirill Bulatov
33cac863d7 Test simple.conf and handle broker_endpoints better 2022-05-16 12:07:35 +03:00
Heikki Linnakangas
51ea9c3053 Don't swallow panics when the pageserver is build with failpoints.
It's very confusing, and because you don't get a stack trace and error
message in the logs, makes debugging very hard. However, the
'test_pageserver_recovery' test relied on that behavior. To support that,
add a new "exit" action to the pageserver 'failpoints' command, so that
you can explicitly request to exit the process when a failpoint is hit.
2022-05-16 09:58:58 +03:00
Heikki Linnakangas
a10cac980f Continue with pageserver startup, if loading some tenants fail.
Fixes https://github.com/neondatabase/neon/issues/1664
2022-05-15 00:25:38 +03:00
Egor Suvorov
768c846eeb Fix test_delete_force from #1653 conflicting with #1692 2022-05-13 17:36:18 +02:00
Anastasia Lubennikova
a2561f0a78 Use tenant's pitr_interval instead of hardroded 0 in the command.
Adjust python tests that use the
2022-05-13 18:32:14 +03:00
Anastasia Lubennikova
aa7c601eca Fix pitr_interval check in GC:
Use timestamp->LSN mapping instead of file modification time.
Fix 'latest_gc_cutoff_lsn' - set it to the minimum of pitr_cutoff and gc_cutoff.
Add new test: test_pitr_gc
2022-05-13 18:32:14 +03:00
Egor Suvorov
bf899a57d9 Safekeeper: add timeline/tenant force delete HTTP endpoings (closes #895)
* There is no auth in Safekeeper HTTP at all currently,
  so simply calling `check_permission` is not enough.
* There are no checks of Safekeeper still working with the data,
  as "still working" is burry now: a timeline may be "active"
  while there are no compute nodes and all data is propagated.
* Still, callmemaybe is deactivated, and timeline is removed from the
  internal map. It can easily sneak back in case of race conditions
  and implicit creations, though.
2022-05-13 15:43:52 +02:00
Kirill Bulatov
85884a1599 Disable tenant relocation python test 2022-05-13 01:26:38 +03:00
Thang Pham
ae20751724 update ZenithCli::create_tenant return signature (#1692)
to include the initial timeline's ID in addition to the new tenant's ID.

Context: follow-up of https://github.com/neondatabase/neon/pull/1689
2022-05-12 17:27:08 -04:00
Thang Pham
5812e26b90 Create an initial timeline on CLI tenant creation (#1689)
Resolves #1655
2022-05-12 16:33:09 -04:00
Arthur Petukhovsky
ec8861b8cc Fix pageserver metrics names (#1682)
Try to follow Prometheus style-guide https://prometheus.io/docs/practices/naming/ for metrics names. More specifically:
- Use `pageserver_` prefix for all pagserver metrics
- Specify `_seconds` unit in time metrics
- Use unit as a suffix in other cases, such as `_hits`, `_bytes`, `_records`
- Use `_total` suffix for accumulating counters (note that Histograms append that suffix internally)
2022-05-12 19:53:07 +03:00
Kirill Bulatov
de37f982db Share the remote storage as a crate 2022-05-07 00:30:36 +03:00
bojanserafimov
ef40e404cf Rename zenith crate to neon_local (#1625) 2022-05-05 19:06:53 -04:00
Heikki Linnakangas
30a7598172 Some copy-editing. 2022-05-05 22:35:15 +03:00
Heikki Linnakangas
1ad5658d9c Fix typos 2022-05-05 22:35:15 +03:00
Dmitry Rodionov
954859f6c5 add readme for performance tests with the current state of things 2022-05-05 22:35:15 +03:00
Dmitry Rodionov
0f3ec83172 avoid detach with alive branches 2022-05-05 12:54:42 +03:00
bojanserafimov
02e5083695 Add hot page test (#1479) 2022-05-04 12:45:01 -04:00
Anastasia Lubennikova
e2cf77441d Implement pg_database_size().
In this implementation dbsize equals sum of all relation sizes, excluding shared ones.
2022-05-04 18:14:45 +03:00
Arseny Sher
b9fd8a36ad Remember timeline_start_lsn and local_start_lsn on safekeeper.
Make it remember when timeline starts in general and on this safekeeper in
particular (the point might be later on new safekeeper replacing failed one).

Bumps control file and walproposer protocol versions.

While protocol is bumped, also add safekeeper node id to
AcceptorProposerGreeting.

ref #1561
2022-05-04 14:32:03 +04:00
Anastasia Lubennikova
2f9b17b9e5 Add simple test of pageserver recovery after crash. To cause a crash, use failpoints in checkpointer 2022-05-03 17:13:09 +03:00
Heikki Linnakangas
9ede38b6c4 Support finding LSN from a commit timestamp.
A new `get_lsn_by_timestamp` command is added to the libpq page service
API.

An extra timestamp field is now stored in an extra field after each
Clog page. It is the timestamp of the latest commit, among all the
transactions on the Clog page. To find the overall latest commit, we
need to scan all Clog pages, but this isn't a very frequent operation
so that's not too bad.

To find the LSN that corresponds to a timestamp, we perform a binary
search. The binary search starts with min = last LSN when GC ran, and
max = latest LSN on the timeline. On each iteration of the search we
check if there are any commits with a higher-than-requested timestamp
at that LSN.

Implements github issue 1361.
2022-05-03 09:28:57 +03:00
Konstantin Knizhnik
baa59512b8 Traverse frozen layer in get_reconstruct_data in reverse order (#1601)
* Traverse frozen layer in get_reconstruct_data in reverse order

* Fix comments on frozen layers.

Note explicitly the order that the layers are in the queue.

* Add fail point to reproduce failpoint iteration error

Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2022-05-03 08:07:14 +03:00
Kirill Bulatov
5cb501c2b3 Make remote storage test less flacky 2022-05-02 20:04:48 +03:00
Stas Kelvich
0323bb5870 [proxy] Refactor cplane API and add new console SCRAM auth API
Now proxy binary accepts `--auth-backend` CLI option, which determines
auth scheme and cluster routing method. Following backends are currently
implemented:

* legacy
    old method, when username ends with `@zenith` it uses md5 auth dbname as
    the cluster name; otherwise, it sends a login link and waits for the console
    to call back
* console
    new SCRAM-based console API; uses SNI info to select the destination
    cluster
* postgres
    uses postgres to select auth secrets of existing roles. Useful for local
    testing
* link
    sends login link for all usernames
2022-05-02 18:32:18 +03:00
Dhammika Pathirana
3128e8c75c Fix tenant conf test
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>
2022-05-01 13:13:25 -07:00
Dmitry Rodionov
67b4e38092 remporarily disable test_backpressure_received_lsn_lag 2022-04-29 15:53:56 +03:00
Dmitry Rodionov
05f8e6a050 Use fsync+rename for atomic downloads from remote storage
Use failpoint in test_remote_storage to check the behavior
2022-04-29 15:53:56 +03:00
Kirill Bulatov
2911eb084a Remove timeline files on detach 2022-04-29 09:19:18 +03:00
Anastasia Lubennikova
5c5c3c64f3 Fix tenant config parsing. Add a test 2022-04-28 11:49:19 +03:00
Arthur Petukhovsky
29539b0561 Set wal_keep_size to zero (#1507)
wal_keep_size is already set to 0 in our cloud setup, but we don't use this value in tests. This commit fixes wal_keep_size in control_plane and adds tests for WAL recycling and lagging safekeepers.
2022-04-27 19:09:28 +03:00
Dhammika Pathirana
66694e736a Fix add ps tenant config
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>
2022-04-27 00:05:13 -07:00
Dhammika Pathirana
091cefaa92 Fix add compaction for key partitioning
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>
2022-04-27 00:05:13 -07:00
Dhammika Pathirana
6391862d8a Add branch traversal test
Signed-off-by: Dhammika Pathirana <dhammika@gmail.com>
2022-04-27 00:05:13 -07:00
Arseny Sher
8b9d523f3c Remove old WAL on safekeepers.
Remove when it is consumed by all of 1) pageserver (remote_consistent_lsn) 2)
safekeeper peers 3) s3 WAL offloading.

In test s3 offloading for now is mocked by directly bumping s3_wal_lsn.

ref #1403
2022-04-26 23:02:23 +04:00
bojanserafimov
867aede715 Add idle compute restart time test (#1514) 2022-04-22 10:45:47 -04:00
Konstantin Knizhnik
5f83c9290b Make it possible to specify per-tenant configuration parameters
Add tenant config API and 'zenith tenant config' CLI command.
Add 'show' query to pageserver protocol for tenantspecific config parameters

Refactoring: move tenant_config code to a separate module.
Save tenant conf file to tenant's directory, when tenant is created to recover it on pageserver restart.
Ignore error during tenant config loading, while it is not supported by console

Define PiTR interval for GC.

refer #1320
2022-04-22 11:24:29 +03:00
Heikki Linnakangas
a4700c9bbe Use pprof to get flamegraph of get_page and get_relsize requests.
This depends on a hacked version of the 'pprof-rs' crate. Because of
that, it's under an optional 'profiling' feature. It is disabled by
default, but enabled for release builds in CircleCI config. It doesn't
currently work on macOS.

The flamegraph is written to 'flamegraph.svg' in the pageserver
workdir when the 'pageserver' process exits.

Add a performance test that runs the perf_pgbench test, with profiling
enabled.
2022-04-21 20:32:48 +03:00
Arseny Sher
abcd7a4b1f Insert less data in test_wal_restore.
Otherwise it sometimes hits 2m statement timeout in CI.
2022-04-21 16:00:15 +04:00
Kirill Bulatov
81cad6277a Move and library crates into a dedicated directory and rename them 2022-04-21 13:30:33 +03:00
Konstantin Knizhnik
ac52f4f2d6 Set superuser when initializing database for wal recovery (#1544) 2022-04-20 13:24:38 +03:00
Heikki Linnakangas
5e95338ee9 Improve logging in test_wal_restore.py
- Capture the output of the restore_from_wal.sh in a log file
- Kill "restored" Postgres server on test failure
2022-04-20 11:18:40 +03:00
Heikki Linnakangas
170badd626 Capture the postgres log in all tests that start a vanilla Postgres. 2022-04-20 11:18:40 +03:00
Kirill Bulatov
3e6087a12f Remove S3 archiving 2022-04-19 23:13:52 +03:00
bojanserafimov
ef72eb84cf Remove zenfixture (#1534) 2022-04-19 09:46:47 -04:00
Kirill Bulatov
52e0816fa5 wal_acceptor -> safekeeper 2022-04-18 12:52:31 +03:00
bojanserafimov
d5ae9db997 Add s3 cost estimate to tests (#1478) 2022-04-14 10:09:03 -04:00
Heikki Linnakangas
4a8c663452 Refactor pgbench tests.
- Remove batch_others/test_pgbench.py. It was a quick check that pgbench
  works, without actually recording any performance numbers, but that
  doesn't seem very interesting anymore. Remove it to avoid confusing it
  with the actual pgbench benchmarks

- Run pgbench with "-n" and "-S" options, for two different workloads:
  simple-updates, and SELECT-only. Previously, we would only run it with
  the "default" TPCB-like workload. That's more or less the same as the
  simple-update (-n) workload, but I think the simple-upload workload
  is more relevant for testing storage performance. The SELECT-only
  workload is a new thing to measure.

- Merge test_perf_pgbench.py and test_perf_pgbench_remote.py. I added
  a new "remote" implementation of the PgCompare class, which allows
  running the same tests against an already-running Postgres instance.

- Make the PgBenchRunResult.parse_from_output function more
  flexible. pgbench can print different lines depending on the
  command-line options, but the parsing function expected a particular
  set of lines.
2022-04-14 13:31:42 +03:00
Heikki Linnakangas
a009fe912a Refactor connection option handling in python tests
The PgProtocol.connect() function took extra options for username,
database, etc. Remove those options, and have a generic way for each
subclass of PgProtocol to provide some default options, with the
capability override them in the connect() call.
2022-04-14 13:31:40 +03:00