Commit Graph

1600 Commits

Author SHA1 Message Date
Alexey Kondratov
772c2fb4ff Report startup metrics and failure reason from compute_ctl (#1581)
+ neondatabase/cloud#1103

This adds a couple of control endpoints to simplify compute state
discovery for control-plane. For example, now we may figure out
that Postgres wasn't able to start or basebackup failed within
seconds instead of just blindly polling the compute readiness
for a minute or two.

Also we now expose startup metrics (time of the each step: basebackup,
sync safekeepers, config, total). Console grabs them after each
successful start and report as histogram to prometheus and grafana.

OpenAPI spec is added and up-tp date, but is not currently used in the
console yet.
2022-05-18 13:03:29 +04:00
Andrey Taranik
b9f84f4a83 trun on storage deployment to neon-stress enviroment (#1729) 2022-05-17 23:04:04 +03:00
Arthur Petukhovsky
134eeeb096 Add more common storage metrics (#1722)
- Enabled process exporter for storage services
- Changed zenith_proxy prefix to just proxy
- Removed old `monitoring` directory
- Removed common prefix for metrics, now our common metrics have `libmetrics_` prefix, for example `libmetrics_serve_metrics_count`
- Added `test_metrics_normal_work`
2022-05-17 19:29:01 +03:00
Heikki Linnakangas
55ea3f262e Fix race condition leading to panic in remote storage sync thread.
The SyncQueue consisted of a tokio mpsc channel, and an atomic counter
to keep track of how many items there are in the channel. Updating the
atomic counter was racy, and sometimes the consumer would decrement
the counter before the producer had incremented it, leading to integer
wraparound to usize::MAX. Calling Vec::with_capacity(usize::MAX) leads
to a panic.

To fix, replace the channel with a VecDeque protected by a Mutex, and
a condition variable for signaling. Now that the queue is now
protected by standard blocking Mutex and Condvar, refactor the
functions touching it to be sync, not async.

A theoretical downside of this is that the calls to push items to the
queue and the storage sync thread that drains the queue might now need
to wait, if another thread is busy manipulating the queue. I believe
that's OK; the lock isn't held for very long, and these operations are
made in background threads, not in the hot GetPage@LSN path, so
they're not very latency-sensitive.

Fixes #1719. Also add a test case.
2022-05-17 18:14:57 +03:00
Heikki Linnakangas
f03779bf1a Fix wait_for_last_record_lsn() and wait_for_upload() python functions.
The contract for wait_for() was not very clear. It waits until the
given function returns successfully, without an exception, but the
wait_for_last_record_lsn() and wait_for_upload() functions used "a <
b" as the condition, i.e. they thought that wait_for() would poll
until the function returns true.

Inline the logic from wait_for() into those two functions, it's not
that complicated, and you get a more specific error message too, if it
fails. Also add a comment to wait_for() to make it more clear how it
works.

Also change remote_consistent_lsn() to return 0 instead of raising an
exception, if remote is None. That can happen if nothing has been
uploaded to remote storage for the timeline yet. It happened once in
the CI, and I was able to reproduce that locally too by adding a sleep
to the storage sync thread, to delay the first upload.
2022-05-17 18:14:10 +03:00
Andrey Taranik
070c255522 Neon stress deploy (#1720)
* storage and proxy deployment for neon stress environment

* neon stress inventory fix
2022-05-17 18:03:01 +03:00
Heikki Linnakangas
9ccbb8d331 Make "neon_local stop" less verbose.
I got annoyed by all the noise in CI test output.

Before:

    $ ./target/release/neon_local stop
    Stop pageserver gracefully
    Pageserver still receives connections
    Pageserver stopped receiving connections
    Pageserver status is: Reqwest error: error sending request for url (http://127.0.0.1:9898/v1/status): error trying to connect: tcp connect error: Connection refused (os error 111)
    initializing for sk 1 for 7676
    Stop safekeeper gracefully
    Safekeeper still receives connections
    Safekeeper stopped receiving connections
    Safekeeper status is: Reqwest error: error sending request for url (http://127.0.0.1:7676/v1/status): error trying to connect: tcp connect error: Connection refused (os error 111)

After:

    $ ./target/release/neon_local stop
    Stopping pageserver gracefully...done!
    Stopping safekeeper 1 gracefully...done!

Also removes the spurious "initializing for sk 1 for 7676" message from
"neon_local start"
2022-05-17 10:31:13 +03:00
Kirill Bulatov
f2881bbd8a Start and stop single etcd and mock s3 servers globally in python tests 2022-05-17 01:17:44 +03:00
Kirill Bulatov
a884f4cf6b Add etcd to neon_local 2022-05-17 01:17:44 +03:00
Kirill Bulatov
9a0fed0880 Enable at least 1 safekeeper in every test 2022-05-17 01:17:44 +03:00
chaitanya sharma
bea84150b2 Fix the markdown rendering on 004-durability.md RFC 2022-05-17 00:16:42 +03:00
chaitanya sharma
85b5c0e989 List profiling as a feature with 'pageserver --enabled-features'
Fixes https://github.com/neondatabase/neon/issues/1627
2022-05-16 21:10:57 +03:00
Thang Pham
e4a70faa08 Add more information to timeline-related APIs (#1673)
Resolves #1488.

- implemented `GET tenant/:tenant_id/timeline/:timeline_id/wal_receiver` endpoint
- returned `thread_id` in `thread_mgr::spawn` 
- added `latest_gc_cutoff_lsn` field to `LocalTimelineInfo` struct
2022-05-16 11:05:43 -04:00
chaitanya sharma
c41549f630 Update readme build for osx (#1709) 2022-05-16 10:42:08 -04:00
Heikki Linnakangas
c700032dd2 Run the regression tests in CI also for PRs opened from forked repos. 2022-05-16 14:40:49 +03:00
Kirill Bulatov
33cac863d7 Test simple.conf and handle broker_endpoints better 2022-05-16 12:07:35 +03:00
Heikki Linnakangas
51ea9c3053 Don't swallow panics when the pageserver is build with failpoints.
It's very confusing, and because you don't get a stack trace and error
message in the logs, makes debugging very hard. However, the
'test_pageserver_recovery' test relied on that behavior. To support that,
add a new "exit" action to the pageserver 'failpoints' command, so that
you can explicitly request to exit the process when a failpoint is hit.
2022-05-16 09:58:58 +03:00
Heikki Linnakangas
a10cac980f Continue with pageserver startup, if loading some tenants fail.
Fixes https://github.com/neondatabase/neon/issues/1664
2022-05-15 00:25:38 +03:00
Heikki Linnakangas
081d5dac5e Bump vendor/postgres.
Includes change to reduce log noise from inmem_smgr.
2022-05-13 21:41:00 +03:00
Andrey Taranik
cded72a580 remove sk-2 from staging inventory list (#1699) 2022-05-13 20:41:54 +03:00
Egor Suvorov
768c846eeb Fix test_delete_force from #1653 conflicting with #1692 2022-05-13 17:36:18 +02:00
Anastasia Lubennikova
a2561f0a78 Use tenant's pitr_interval instead of hardroded 0 in the command.
Adjust python tests that use the
2022-05-13 18:32:14 +03:00
Anastasia Lubennikova
aa7c601eca Fix pitr_interval check in GC:
Use timestamp->LSN mapping instead of file modification time.
Fix 'latest_gc_cutoff_lsn' - set it to the minimum of pitr_cutoff and gc_cutoff.
Add new test: test_pitr_gc
2022-05-13 18:32:14 +03:00
Egor Suvorov
bf899a57d9 Safekeeper: add timeline/tenant force delete HTTP endpoings (closes #895)
* There is no auth in Safekeeper HTTP at all currently,
  so simply calling `check_permission` is not enough.
* There are no checks of Safekeeper still working with the data,
  as "still working" is burry now: a timeline may be "active"
  while there are no compute nodes and all data is propagated.
* Still, callmemaybe is deactivated, and timeline is removed from the
  internal map. It can easily sneak back in case of race conditions
  and implicit creations, though.
2022-05-13 15:43:52 +02:00
Egor Suvorov
07b85e7cfc Safekeeper refactor: move callmemaybe_tx from SafekeeperPostgresBackend to Timeline 2022-05-13 15:43:52 +02:00
Egor Suvorov
22d997049c libs/utils/http/request: add ensure_no_body 2022-05-13 15:43:52 +02:00
Kirill Bulatov
b683308791 Return GIT_VERSION back to storage binaries 2022-05-13 16:34:32 +03:00
Kirill Bulatov
51c0f9ab2b Force git version to be up to date via decl macro 2022-05-13 16:34:32 +03:00
Stas Kelvich
0030da57a8 compute-tools: grant rw priveleges to the all created users 2022-05-13 11:27:00 +03:00
Kirill Bulatov
85884a1599 Disable tenant relocation python test 2022-05-13 01:26:38 +03:00
Thang Pham
ae20751724 update ZenithCli::create_tenant return signature (#1692)
to include the initial timeline's ID in addition to the new tenant's ID.

Context: follow-up of https://github.com/neondatabase/neon/pull/1689
2022-05-12 17:27:08 -04:00
Thang Pham
5812e26b90 Create an initial timeline on CLI tenant creation (#1689)
Resolves #1655
2022-05-12 16:33:09 -04:00
Arthur Petukhovsky
ec8861b8cc Fix pageserver metrics names (#1682)
Try to follow Prometheus style-guide https://prometheus.io/docs/practices/naming/ for metrics names. More specifically:
- Use `pageserver_` prefix for all pagserver metrics
- Specify `_seconds` unit in time metrics
- Use unit as a suffix in other cases, such as `_hits`, `_bytes`, `_records`
- Use `_total` suffix for accumulating counters (note that Histograms append that suffix internally)
2022-05-12 19:53:07 +03:00
Kirill Bulatov
4538f1e1b8 Correctly operate etcd safekeeper timeline data 2022-05-12 18:47:31 +03:00
Stas Kelvich
b10ae195b7 Set vendor/postgres back to the main branch
I accidentally merged postgres PR that was referencing non-main branch.
2022-05-12 15:05:49 +03:00
Alexey Kondratov
b426775aa0 Use compute-tools from the new neondatabase Docker Hub repo 2022-05-12 12:26:24 +03:00
Heikki Linnakangas
5da4f3a4df Refactor DeltaLayer::dump() function
Put most of the code in a closure that returns Result, so that we can
use the ?-operator for error handling. That's simpler.
2022-05-12 10:31:04 +03:00
Konstantin Knizhnik
2bde77fced Do not apply records with LSN smaller than LSN of cached image in del… (#1672)
* Do not apply records with LSN smaller than LSN of cached image in delta layer

* Do not apply records with LSN smaller than LSN of cached image in delta layer
2022-05-12 07:56:02 +03:00
Dhammika Pathirana
c864091035 Fix err msg typo
Signed-off-by: Dhammika Pathirana <dham@neon.tech>
2022-05-11 16:13:26 -07:00
Anton Shyrabokau
20361395bb Add zenith-us-stage-sk-5 to circleci inventory (#1665)
Co-authored-by: Debian <admin@ip-10-0-5-32.us-west-2.compute.internal>
2022-05-11 21:36:53 +03:00
Arseny Sher
b338b5dffe Make callmemaybe less agressive until we fix it/migrate to bigger machines. 2022-05-11 22:16:13 +04:00
Stas Kelvich
5bd879f641 Proxy: update protocol after cluster->project rename 2022-05-11 15:50:36 +03:00
Konstantin Knizhnik
e6e883eb12 Do not set LSN for new FPI page (#1657)
* Do not set LSN for new FPI page

refer #1656

* Add page_is_new, page_get_lsn, page_set_lsn functions

* Fix page_is_new implementation

* Add comment from XLogReadBufferForRedoExtended
2022-05-11 15:23:17 +03:00
Heikki Linnakangas
d710dff975 Remove unnecessary Serialize/Deserialize traits from VecMap.
It's never stored on disk. Let's be tidy.
2022-05-10 23:47:40 +03:00
Arseny Sher
6cb14b4200 Optionally remove WAL on safekeepers without s3 offloading.
And do that on staging, until offloading is merged.
2022-05-10 22:41:02 +04:00
Thang Pham
87dfa99734 Update layered_repository REAMDE (#1659) 2022-05-10 09:55:14 -04:00
Thang Pham
cf59b51519 Update README (Running local installation section) (#1649) 2022-05-09 11:11:46 -04:00
Kirill Bulatov
0a7735a656 Rework remote storage sync queue, general refactoring 2022-05-07 01:33:33 +03:00
Kirill Bulatov
64a602b8f3 Delete timeline layers 2022-05-07 01:33:33 +03:00
Kirill Bulatov
10e4da3997 Rework timeline batching 2022-05-07 01:33:33 +03:00