Commit Graph

823 Commits

Author SHA1 Message Date
Heikki Linnakangas
d0494c391a Remove wal_receiver mgmt API endpoint
Move all the fields that were returned by the wal_receiver endpoint into
timeline_detail. Internally, move those fields from the separate global
WAL_RECEIVERS hash into the LayeredTimeline struct. That way, all the
information about a timeline is kept in one place.

In the passing, I noted that the 'thread_id' field was removed from
WalReceiverEntry in commit e5cb727572, but it forgot to update
openapi_spec.yml. This commit removes that too.
2022-07-29 20:51:37 +03:00
Kirill Bulatov
2af5a96f0d Back off when reenqueueing delete tasks 2022-07-29 19:04:40 +03:00
Heikki Linnakangas
02afa2762c Move Tenant- and TimelineInfo structs to models.rs.
They are part of the management API response structs. Let's try to
concentrate everything that's part of the API in models.rs.
2022-07-29 15:02:15 +03:00
Heikki Linnakangas
d903dd61bd Rename 'wal_producer_connstr' to 'wal_source_connstr'.
What the WAL receiver really connects to is the safekeeper. The
"producer" term is a bit misleading, as the safekeeper doesn't produce
the WAL, the compute node does.

This change also applies to the name of the field used in the mgmt API
in in the response of the
'/v1/tenant/:tenant_id/timeline/:timeline_id/wal_receiver' endpoint.
AFAICS that's not used anywhere else than one python test, so it
should be OK to change it.
2022-07-29 09:09:22 +03:00
Thang Pham
417d9e9db2 Add current physical size to tenant status endpoint (#2173)
Ref #1902
2022-07-28 13:59:20 -04:00
Kirill Bulatov
58b04438f0 Tweak backoff numbers to avoid no wal connection threshold trigger 2022-07-27 22:16:40 +03:00
Thang Pham
6a664629fa Add timeline physical size tracking (#2126)
Ref #1902.

- Track the layered timeline's `physical_size` using `pageserver_current_physical_size` metric when updating the layer map.
- Report the local timeline's `physical_size` in timeline GET APIs.
- Add `include-non-incremental-physical-size` URL flag to also report the local timeline's `physical_size_non_incremental` (similar to `logical_size_non_incremental`)
- Add a `UIntGaugeVec` and `UIntGauge` to represent `u64` prometheus metrics

Co-authored-by: Dmitry Rodionov <dmitry@neon.tech>
2022-07-27 12:36:46 -04:00
Heikki Linnakangas
d6f12cff8e Make DatadirTimeline a trait, implemented by LayeredTimeline.
Previously DatadirTimeline was a separate struct, and there was a 1:1
relationship between each DatadirTimeline and LayeredTimeline. That was
a bit awkward; whenever you created a timeline, you also needed to create
the DatadirTimeline wrapper around it, and if you only had a reference
to the LayeredTimeline, you would need to look up the corresponding
DatadirTimeline struct through tenant_mgr::get_local_timeline_with_load().
There were a couple of calls like that from LayeredTimeline itself.

Refactor DatadirTimeline, so that it's a trait, and mark LayeredTimeline
as implementing that trait. That way, there's only one object,
LayeredTimeline, and you can call both Timeline and DatadirTimeline
functions on that. You can now also call DatadirTimeline functions from
LayeredTimeline itself.

I considered just moving all the functions from DatadirTimeline directly
to Timeline/LayeredTimeline, but I still like to have some separation.
Timeline provides a simple key-value API, and handles durably storing
key/value pairs, and branching. Whereas DatadirTimeline is stateless, and
provides an abstraction over the key-value store, to present an interface
with relations, databases, etc. Postgres concepts.

This simplified the logical size calculation fast-path for branch
creation, introduced in commit 28243d68e6. LayerTimeline can now
access the ancestor's logical size directly, so it doesn't need the
caller to pass it to it. I moved the fast-path to init_logical_size()
function itself. It now checks if the ancestor's last LSN is the same
as the branch point, i.e. if there haven't been any changes on the
ancestor after the branch, and copies the size from there. An
additional bonus is that the optimization will now work any time you
have a branch of another branch, with no changes from the ancestor,
not only at a create-branch command.
2022-07-27 10:26:21 +03:00
Konstantin Knizhnik
5a4394a8df Do not hold timelines lock while calling update_gc_info to avoid recusrive mutex lock and so deadlock (#2163) 2022-07-26 22:21:05 +03:00
Heikki Linnakangas
d301b8364c Move LayeredTimeline and related code to separate source file.
The layered_repository.rs file had grown to be very large. Split off
the LayeredTimeline struct and related code to a separate source file to
make it more manageable.

There are plans to move much of the code to track timelines from
tenant_mgr.rs to LayeredRepository. That will make layered_repository.rs
grow again, so now is a good time to split it.

There's a lot more cleanup to do, but this commit intentionally only
moves existing code and avoids doing anything else, for easier review.
2022-07-26 11:47:04 +03:00
Kirill Bulatov
172314155e Compact only once on psql checkpoint call 2022-07-26 11:37:16 +03:00
Konstantin Knizhnik
28243d68e6 Yet another apporach of copying logical timeline size during branch creation (#2139)
* Yet another apporach of copying logical timeline size during branch creation

* Fix unit tests

* Update pageserver/src/layered_repository.rs

Co-authored-by: Thang Pham <thang@neon.tech>

* Update pageserver/src/layered_repository.rs

Co-authored-by: Thang Pham <thang@neon.tech>

* Update pageserver/src/layered_repository.rs

Co-authored-by: Thang Pham <thang@neon.tech>

Co-authored-by: Thang Pham <thang@neon.tech>
2022-07-26 09:11:10 +03:00
Dmitry Rodionov
e308265e42 register tenants task thread pool threads in thread_mgr
needed to avoid this warning: is_shutdown_requested() called in an unexpected thread
2022-07-22 11:43:38 +03:00
Thang Pham
ed102f44d9 Reduce memory allocations for page server (#2010)
## Overview

This patch reduces the number of memory allocations when running the page server under a heavy write workload. This mostly helps improve the speed of WAL record ingestion. 

## Changes
- modified `DatadirModification` to allow reuse the struct's allocated memory after each modification
- modified `decode_wal_record` to allow passing a `DecodedWALRecord` reference. This helps reuse the struct in each `decode_wal_record` call
- added a reusable buffer for serializing object inside the `InMemoryLayer::put_value` function
- added a performance test simulating a heavy write workload for testing the changes in this patch

### Semi-related changes
- remove redundant serializations when calling `DeltaLayer::put_value` during `InMemoryLayer::write_to_disk` function call [1]
- removed the info span `info_span!("processing record", lsn = %lsn)` during each WAL ingestion [2]

## Notes
- [1]: in `InMemoryLayer::write_to_disk`, a deserialization is called
  ```
  let val = Value::des(&buf)?;
  delta_layer_writer.put_value(key, *lsn, val)?;
  ``` 
  `DeltaLayer::put_value` then creates a serialization based on the previous deserialization
  ```
  let off = self.blob_writer.write_blob(&Value::ser(&val)?)?;
  ```
- [2]: related: https://github.com/neondatabase/neon/issues/733
2022-07-21 12:08:26 -04:00
Konstantin Knizhnik
572ae74388 More precisely control size of inmem layer (#1927)
* More precisely control size of inmem layer

* Force recompaction of L0 layers if them contains large non-wallogged BLOBs to avoid too large layers

* Add modified version of test_hot_update test (test_dup_key.py) which should generate large layers without large number of tables

* Change test name in test_dup_key

* Add Layer::get_max_key_range function

* Add layer::key_iter method and implement new approach of splitting layers during compaction based on total size of all key values

* Add test_large_schema test for checking layer file size after compaction

* Make clippy happy

* Restore checking LSN distance threshold for checkpoint in-memory layer

* Optimize stoage keys iterator

* Update pageserver/src/layered_repository.rs

Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>

* Update pageserver/src/layered_repository.rs

Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>

* Update pageserver/src/layered_repository.rs

Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>

* Update pageserver/src/layered_repository.rs

Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>

* Update pageserver/src/layered_repository.rs

Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>

* Fix code style

* Reduce number of tables in test_large_schema to make it fit in timeout with debug build

* Fix style of test_large_schema.py

* Fix handlng of duplicates layers

Co-authored-by: Heikki Linnakangas <heikki@zenith.tech>
2022-07-21 07:45:11 +03:00
Heikki Linnakangas
b4c74c0ecd Clean up unnecessary dependencies.
Just to be tidy.
2022-07-20 16:31:25 +03:00
Thang Pham
160e52ec7e Optimize branch creation (#2101)
Resolves #2054

**Context**: branch creation needs to wait for GC to acquire `gc_cs` lock, which prevents creating new timelines during GC. However, because individual timeline GC iteration also requires `compaction_cs` lock, branch creation may also need to wait for compactions of multiple timelines. This results in large latency when creating a new branch, which we advertised as *"instantly"*.

This PR optimizes the latency of branch creation by separating GC into two phases:
1. Collect GC data (branching points, cutoff LSNs, etc)
2. Perform GC for each timeline

The GC bottleneck comes from step 2, which must wait for compaction of multiple timelines. This PR modifies the branch creation and GC functions to allow GC to hold the GC lock only in step 1. As a result, branch creation doesn't need to wait for compaction to finish but only needs to wait for GC data collection step, which is fast.
2022-07-19 14:56:25 -04:00
Heikki Linnakangas
f384e20d78 Minor cleanup in layer_repository.rs. 2022-07-19 07:50:55 +03:00
Heikki Linnakangas
0b14fdb078 Reorganize, expand, improve internal documentation
Reorganize existing READMEs and other documentation files into mdbook
format. The resulting Table of Contents is a mix of placeholders for
docs that we should write, and documentation files that we already had,
dropped into the most appropriate place.

Update the Pageserver overview diagram. Add sections on thread
management and WAL redo processes.

Add all the RFCs to the mdbook Table of Content too.

Per github issue #1979
2022-07-18 17:39:12 +03:00
Dmitry Rodionov
7987889cb3 keep successfully downloaded index parts 2022-07-18 12:27:04 +03:00
Dmitry Rodionov
912a08317b do not ignore errors during downloading of tenant index parts 2022-07-18 12:27:04 +03:00
Kirill Bulatov
c4b2347e21 Use less restricrtive lock guard during storage sync 2022-07-17 12:49:18 +03:00
Heikki Linnakangas
a342957aee Use ok_or_else() instead of ok_or(), to silence clippy warnings.
"cargo clippy" started to complain about these, after running "cargo
update". Not sure why it didn't complain before, but seems reasonable to
fix these. (The "cargo update" is not included in this commit)
2022-07-14 22:13:51 +03:00
Thang Pham
7f048abf3b Add close_fds for initdb command and add close fd test (#2060)
This PR adds a test for https://github.com/neondatabase/neon/pull/1834 and fixes the error in https://app.circleci.com/pipelines/github/neondatabase/neon/7753/workflows/94d1b796-10a3-4989-b23c-4c1eb4a49cf5/jobs/79586, which happens because `pageserver.pid` is held by `initdb` command on restart.

Because the test requires `lsof` to be installed in the docker image, this PR also updates the caches and docker image specified in CircleCI config file.
2022-07-12 15:04:40 -04:00
bojanserafimov
5cf597044d Allow prev_lsn hint for fullbackup (#2052) 2022-07-11 10:31:14 -04:00
Heikki Linnakangas
95452e605a Optimize importing a physical backup
Before this patch, importing a physical backup followed the same path
as ingesting any WAL records:

1. All the data pages from the backup are first collected in the
   DatadirModification object.
2. Then, they are "committed" to the Repository. They are written to
   the in-memory layer
3. Finally, the in-memory layer is frozen, and flushed to disk as a
   L0 delta layer file.

This was pretty inefficient. In step 1, the whole physical backup was
held in memory. If the backup is large, you simply run out of
memory. And in step 3, the resulting L0 delta layer file is large,
holding all the data again. That's a problem if the backup is larger
than 5 GB: Amazon S3 doesn't allow uploading files larger than 5 GB
(without using multi-part upload, see github issue #1910). So we want
to avoid that.

To alleviate those problems, optimize the codepath for importing a
physical backup. The basic flow is the same as before, but step 1
is optimized so that it doesn't accumulate all the data in memory,
and step 3 writes the data in image layers instead of one large delta
layer.
2022-07-11 17:03:58 +03:00
Dmitry Rodionov
21da9199fa take Value by reference to avoid calling .clone 2022-07-11 17:03:58 +03:00
Thang Pham
1f5918b36d Delay calculating the starting LSN when doing timeline branching (#2053)
Previously, upon branching, if no starting LSN is specified, we
determine the start LSN based on the source timeline's last record LSN
in `timelines::create_timeline` function, which then calls `Repository::branch_timeline`
to create the timeline.

Inside the `LayeredRepository::branch_timeline` function, to start branching,
we try to acquire a GC lock to prevent GC from removing data needed
for the new timeline. However, a GC iteration takes time, so the GC lock 
can be held for a long period of time. As a result, the previously determined 
starting LSN can become invalid because of GC.

This PR fixes the above issue by delaying the LSN calculation part and moving it to be 
inside `LayeredRepository::branch_timeline` function.
2022-07-08 10:29:29 -04:00
Dmitry Rodionov
1a5af6d7a5 extend detach/delete tests 2022-07-07 21:20:04 +03:00
Dmitry Rodionov
520ffb341b fix pageserver openapi spec 2022-07-07 21:20:04 +03:00
Dmitry Rodionov
9f2b40645d review cleanup, point timeline/detach to timeline/delete 2022-07-07 21:20:04 +03:00
Dmitry Rodionov
168214e0b6 use tenant status endpoint to check whether timelines were downloaded or not 2022-07-07 21:20:04 +03:00
Dmitry Rodionov
d9d4ef12c3 review cleanup 2022-07-07 21:20:04 +03:00
Dmitry Rodionov
e1e24336b7 review adjustments, bring back timeline_detach and rename it to timeline_delete 2022-07-07 21:20:04 +03:00
Dmitry Rodionov
4c54e4b37d switch to per-tenant attach/detach
download operations of all timelines for one tenant are now grouped
together so when attach is invoked pageserver downloads all of them
and registers them in a single apply_sync_status_update call so
branches can be used safely with attach/detach
2022-07-07 21:20:04 +03:00
Heikki Linnakangas
e6ea049165 If an error happens during import of base backup or WAL, log it.
We only sent the error to the client, with no trace in the pageserver log.
Log it, similar to how we log errors in GetPage@LSN requests.
2022-07-07 16:05:13 +03:00
Heikki Linnakangas
0e3456351f Shrink thread pools used for WAL receivers and background tasks.
I noticed that the pageserver has a very large virtual memory size,
several GB, even though it doesn't actually use that much
memory. That's not much of a problem normally, but I hit it because I
wanted to run tests with a limited virtual memory size, by calling
setrlimit(RLIMIT_AS), but the highest limit you can set is 2 GB. I was
not able to start pageserver with a limit of 2 GB.

On Linux, each thread allocates 32 MB of virtual memory. I read this
on some random forum on the Internet, but unfortunately could not find
the source again now. Empirically, reducing the number of threads clearly
helps to bring down the virtual memory size.

Aside from the virtual memory usage, it seems excessive to launch 40
threads in both of those thread pools. The tokio default is to have as
many worker threads as there are CPU cores in the system. That seems
like a fine heuristic for us, too, so remove the explicit setting of
the pool size and rely on the default. Note that the GC and compaction
tasks are actually run with tokio spawn_blocking, so the threads that
are actually doing the work, and possibly waiting on I/O, are not
consuming threads from the thread pool. The WAL receiver work is done
in the tokio worker threads, but the WAL receivers are more CPU bound
so that seems OK.

Also remove the explicit maxinum on blocking tasks. I'm not sure what
the right value for that would be, or whether the value we set (100)
would be better than the tokio default (512). Since the value was
arbitrary, let's just rely on the tokio default for that, too.
2022-07-06 22:36:38 +03:00
bojanserafimov
242af75653 Fix signal file parsing (#2042) 2022-07-06 13:45:02 -04:00
Kirill Bulatov
50821c0a3c Return download stream directly from the remote storage API 2022-07-05 21:45:15 +03:00
Dmitry Rodionov
cfdf79aceb harden create_empty_timeline
Reorder checks so it checks whether the timeline exists
before writing something to disk, possibly replacing valid content
2022-07-05 16:44:18 +03:00
Heikki Linnakangas
bb69e0920c Do not overwrite an existing image layer.
See github issues #1594 and #1690

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2022-07-05 14:45:31 +03:00
bojanserafimov
d29c545b5d Gc/compaction thread pool, take 2 (#1933)
Decrease the number of pageserver threads by running gc and compaction in a blocking tokio thread pool
2022-07-05 02:06:40 -04:00
Kirill Bulatov
6abdb12724 Fix 1.62 Clippy errors 2022-07-04 23:46:37 +03:00
Bojan Serafimov
f09c09438a Fix gc after import 2022-07-01 11:10:49 +03:00
Kirill Bulatov
1d0706cf25 Fix walreceiver connection selection mechanism
* Avoid reconnecting to safekeeper immediately after its failure by limiting candidates to those with fewest connection attempts. Thus we don't have to wait lagging_wal_timeout (10s by default) before switch happens even if no new changes are generated, and current test_restarts_under_load expects some commits to happen within 4s.
* Make default max_lsn_wal_lag larger, otherwise we constant reconnections happen during normal work.
* Fix wal_connection_attempts maintanance, preventing busy loop of reconnections.
2022-06-30 00:40:12 +03:00
Anastasia Lubennikova
3c2b03cd87 Update timeline size on dropdb. Add the test (#1973)
In addition, fix database size calculation:
count not only main fork of the relation, but also vm and fsm.
2022-06-23 12:28:12 +03:00
Kirill Bulatov
7c49abe7d1 Rework etcd timeline updates and their handling 2022-06-23 09:11:27 +03:00
bojanserafimov
1ca28e6f3c Import basebackup into pageserver (#1925)
Allow importing basebackup taken from vanilla postgres or another pageserver via psql copy in protocol.
2022-06-21 11:04:10 -04:00
Anastasia Lubennikova
36ee182d26 Implement page servise 'fullbackup' endpoint (#1923)
* Implement page servise 'fullbackup' endpoint that works like basebackup, but also sends relational files

* Add test_runner/batch_others/test_fullbackup.py

Co-authored-by: bojanserafimov <bojan.serafimov7@gmail.com>
2022-06-16 14:07:11 +03:00
Kirill Bulatov
d8a37452c8 Rename ZenithFeedback (#1912) 2022-06-11 00:44:05 +03:00