Commit Graph

2331 Commits

Author SHA1 Message Date
Heikki Linnakangas
150bddb929 Clean up process start/stop handling
* Poll more frequently when waiting for process start/stop. This
  speeds up startup and shutdown in tests. We did this already in
  commit 52ce1c9d53, which reduced the interval to 100 ms, but it was
  inadvertently increased back to 500 ms in commit d42700280f. Reduce
  it to 100 ms again, for both start and stop operations.

* Harmonize the start and stop loops, printing the dots and notices
  the same way in both. I considered extracting the logic to a
  separate retry-function that takes a closure as argument that does
  the polling, but as long as we only have two copies, the code
  duplication isn't that bad.

* Remove newline after "Starting pageserver" and "Starting etcd"
  messages, so that the progress-indicator dots that are printed once
  a second are printed on the same line. Before:

    Starting pageserver at '127.0.0.1:64000' in '.neon'
    ...
    pageserver started, pid: 2538937

  After:

    Starting pageserver at '127.0.0.1:64000' in '.neon'...
    pageserver started, pid: 2538937

  The "Starting safekeeper" message already got this right.

* Update example output in README.md to match
2022-11-16 19:51:37 +02:00
Alexander Bayandin
2b728bc69e test_forward_compatibility: fix path to pg_distrib_dir (#2826)
Set correct `pg_distrib_dir` in `pageserver.toml` and in neon_local
`config`.

`test_forward_compatibility` shows flakiness during `neon_local pg
start`, so hopefully, the patch will help.

```
2022-11-15 16:07:34.091 GMT [13338] LOG:  starting with zenith basebackup at LSN 0/A6A9310, prev 0/0
2022-11-15 16:07:34.091 GMT [13338] FATAL:  cannot start in read-write mode from this base backup
2022-11-15 16:07:34.091 GMT [13337] LOG:  startup process (PID 13338) exited with exit code 1
```
2022-11-16 15:14:36 +00:00
Kirill Bulatov
5184685ced Revert "Introduce aws-sdk-rust as rusoto S3 replacement (#2802)" (#2837)
Despite tests working, on staging the library started to fail with the
following error:

```
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: 2022-11-16T11:53:37.191211Z  INFO init_tenant_mgr:local_tenant_timeline_files: Collected files for 16 tenants
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: thread 'main' panicked at 'A connector was not available. Either set a custom connector or enable the `rustls` and `native-tls` crate featu>
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: stack backtrace:
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:    0: rust_begin_unwind
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:584:5
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:    1: core::panicking::panic_fmt
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:142:14
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:    2: core::panicking::panic_display
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:72:5
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:    3: core::panicking::panic_str
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:56:5
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:    4: core::option::expect_failed
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/option.rs:1854:5
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:    5: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:    6: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:    7: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:    8: <aws_types::credentials::provider::future::ProvideCredentials as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:    9: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   10: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   11: <aws_types::credentials::provider::future::ProvideCredentials as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   12: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   13: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   14: <aws_smithy_http_tower::map_request::MapRequestFuture<F,E> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   15: <core::pin::Pin<P> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/future.rs:124:9
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   16: <aws_smithy_http_tower::parse_response::ParseResponseService<InnerService,ResponseHandler,RetryPolicy> as tower_service::Service<aws_>
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/aws-smithy-http-tower-0.51.0/src/parse_response.rs:109:34
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   17: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   18: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tracing-0.1.37/src/instrument.rs:272:9
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   19: <core::pin::Pin<P> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/future.rs:124:9
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   20: <aws_smithy_client::timeout::TimeoutServiceFuture<InnerFuture> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/aws-smithy-client-0.51.0/src/timeout.rs:189:70
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   21: <tower::retry::future::ResponseFuture<P,S,Request> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tower-0.4.13/src/retry/future.rs:77:41
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   22: <aws_smithy_client::timeout::TimeoutServiceFuture<InnerFuture> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/aws-smithy-client-0.51.0/src/timeout.rs:189:70
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   23: aws_smithy_client::Client<C,M,R>::call_raw::{{closure}}
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/aws-smithy-client-0.51.0/src/lib.rs:227:56
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   24: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   25: aws_smithy_client::Client<C,M,R>::call::{{closure}}
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/aws-smithy-client-0.51.0/src/lib.rs:184:29
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   26: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   27: aws_sdk_s3::client::fluent_builders::GetObject::send::{{closure}}
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/aws-sdk-s3-0.21.0/src/client.rs:7735:40
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   28: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   29: remote_storage::s3_bucket::S3Bucket::download_object::{{closure}}
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at libs/remote_storage/src/s3_bucket.rs:205:20
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   30: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   31: <remote_storage::s3_bucket::S3Bucket as remote_storage::RemoteStorage>::download::{{closure}}
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at libs/remote_storage/src/s3_bucket.rs:399:11
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   32: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   33: <core::pin::Pin<P> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/future.rs:124:9
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   34: remote_storage::GenericRemoteStorage::download_storage_object::{{closure}}
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at libs/remote_storage/src/lib.rs:264:55
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   35: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   36: pageserver::storage_sync::download::download_index_part::{{closure}}
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at pageserver/src/storage_sync/download.rs:148:57
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   37: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   38: pageserver::storage_sync::download::download_index_parts::{{closure}}::{{closure}}::{{closure}}
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at pageserver/src/storage_sync/download.rs:77:75
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   39: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   40: <futures_util::stream::futures_unordered::FuturesUnordered<Fut> as futures_core::stream::Stream>::poll_next
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.24/src/stream/futures_unordered/mod.rs:514:17
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   41: futures_util::stream::stream::StreamExt::poll_next_unpin
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.24/src/stream/stream/mod.rs:1626:9
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   42: <futures_util::stream::stream::next::Next<St> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.24/src/stream/stream/next.rs:32:9
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   43: pageserver::storage_sync::download::download_index_parts::{{closure}}
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at pageserver/src/storage_sync/download.rs:80:69
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   44: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/future/mod.rs:91:19
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   45: tokio::park:🧵:CachedParkThread::block_on::{{closure}}
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.21.1/src/park/thread.rs:267:54
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   46: tokio::coop::with_budget::{{closure}}
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.21.1/src/coop.rs:102:9
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   47: std:🧵:local::LocalKey<T>::try_with
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/thread/local.rs:445:16
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   48: std:🧵:local::LocalKey<T>::with
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/thread/local.rs:421:9
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   49: tokio::coop::with_budget
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.21.1/src/coop.rs:95:5
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   50: tokio::coop::budget
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.21.1/src/coop.rs:72:5
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   51: tokio::park:🧵:CachedParkThread::block_on
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.21.1/src/park/thread.rs:267:31
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   52: tokio::runtime::enter::Enter::block_on
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.21.1/src/runtime/enter.rs:152:13
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   53: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.21.1/src/runtime/scheduler/multi_thread/mod.rs:79:9
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   54: tokio::runtime::Runtime::block_on
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /home/nonroot/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.21.1/src/runtime/mod.rs:492:44
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   55: pageserver::storage_sync::spawn_storage_sync_task
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at pageserver/src/storage_sync.rs:656:34
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   56: pageserver::tenant_mgr::init_tenant_mgr
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at pageserver/src/tenant_mgr.rs:88:13
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   57: pageserver::start_pageserver
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at pageserver/src/bin/pageserver.rs:269:9
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   58: pageserver::main
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at pageserver/src/bin/pageserver.rs:103:5
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:   59: core::ops::function::FnOnce::call_once
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]:              at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/ops/function.rs:248:5
Nov 16 11:53:37 pageserver-0.us-east-2.aws.neon.build pageserver[481974]: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
```

Feels like better testing on the env is needed later, maybe more e2e
tests have to be written (albeit we have download tests, so something
else happens here, tls issues?)
2022-11-16 15:10:36 +00:00
Heikki Linnakangas
9ae4da4f31 Silence test failure caused by race condition between GC and detach.
Thanks to the race condition, GC sometimes fails with "no such file or
directory" error, if the tenant is detached concurrently. That's a
known issue, but it didn't cause test failures until we started to
check for unexpected ERRORs in the log in commit 46d30bf054. We should
fix the race condition, of course, but until we do, let's silence the
failures.
2022-11-16 15:50:49 +02:00
Sergey Melnikov
aca221ac8b Switch old staging to new etcd (#2834) 2022-11-16 16:54:55 +04:00
Heikki Linnakangas
d013a2b227 Make test_gc_cutoff test more robust.
Previously, if the failpoint was not reached for some reason, the test
would only fail because it would reach the 5 minute timeout we have on
all python tests. That's very subtle. Make it fail explicitly, if the
failpoint is not hit on each iteration of the loop.

Extracted from a larger PR, see
https://github.com/neondatabase/neon/pull/2785/files#r1022765794
2022-11-16 13:24:02 +02:00
Heikki Linnakangas
3f93c6c6f0 Improve checks for broken tenants in test_broken_timeline.py
- Refactor the code a little bit, removing the silly for-loop over a
  single element.

- Make it more clear in log messages that the errors are expectd

- Check for a more precise error message "Failed to load delta layer"
  instead of just "extracting base backup failed".
2022-11-16 13:16:00 +02:00
Rory de Zoete
53267969d7 Preparation for ARM runners (#2751)
Need to make the runner tag more specific else we inadvertently might
run workloads on the wrong arch

Co-authored-by: Rory de Zoete <rdezoete@Rorys-Mac-Studio.fritz.box>
Co-authored-by: Rory de Zoete <rdezoete@RorysMacStudio.fritz.box>
2022-11-16 11:28:57 +01:00
Andrés
c4b417ecdb Introduce aws-sdk-rust as rusoto S3 replacement (#2802)
- `aws-smithy-http`: Needed because of `SdkBody` see
https://github.com/awslabs/smithy-rs/issues/1759
- `aws-types`: Needed because of `SharedCredentialsProvider`, the
recommended way from aws is something like
`aws_config::from_env().region("us-east-1").load().await` but that is
problematic because of:

- `sync -> async ` in the creation of S3Client and i don't want to
change the signature of any method in this class.
- We do not need the four default steps in
https://github.com/awslabs/aws-sdk-rust/blob/main/sdk/aws-config/src/default_provider/credentials.rs#L235

- `Hyper`: Similar to what's currently doing Rusoto in
https://github.com/rusoto/rusoto/blob/master/rusoto/signature/src/signature.rs#L59
to stream the body, see also
https://github.com/awslabs/aws-sdk-rust/discussions/361

Co-authored-by: andres <andres.rodriguez@outlook.es>
2022-11-16 11:28:37 +02:00
Joonas Koivunen
1d105727cb perf: simple walredo bench (#2816)
adds a simple walredo bench to allow some comparison of the walredo
throughput.

Cc: #1339, #2778
2022-11-16 11:13:56 +02:00
Heikki Linnakangas
4787a744c2 Add documentation page about error handling and logging. (#2681)
Add a page to the internal documentation, on how we do error
handling and logging.
2022-11-16 10:38:03 +02:00
Sergey Melnikov
ac3ccac56c Add zenith-1-ps-4 and zenith-1-ps-5 (#2815) 2022-11-16 11:25:24 +04:00
Alexander Bayandin
638af96c51 postgres-v15: fix expected results for regress tests (#2822)
Fix expected output for regress tests for Postgres 15.

Required for https://github.com/neondatabase/neon/pull/2809
2022-11-15 22:32:12 +00:00
Kirill Bulatov
1e21ca1afe Trim whitespaces off Lsn strings when parsing (#2827) 2022-11-15 22:39:44 +02:00
Heikki Linnakangas
46d30bf054 Check for errors in pageserver log after each test.
If there are any unexpected ERRORs or WARNs in pageserver.log after test
finishes, fail the test. This requires whitelisting the errors that *are*
expected in each test, and there's also a few common errors that are
printed by most tests, which are whitelisted in the fixture itself.

With this, we don't need the special abort() call in testing mode, when
compaction or GC fails. Those failures will print ERRORs to the logs,
which will be picked up by this new mechanisms.

A bunch of errors are currently whitelisted that we probably shouldn't
be emitting in the first place, but fixing those is out of scope for this
commit, so I just left FIXME comments on them.
2022-11-15 18:47:28 +02:00
Heikki Linnakangas
d0105cea1f Avoid errors when removing a timeline that's still active 2022-11-15 18:47:28 +02:00
Heikki Linnakangas
e44e4a699b Downgrade log message, if client terminates COPY during basebackup import
It's more or less expected from pageserver's point of view. Change the
error kind to ConnectionReset, so that it gets logged at INFO level
instead of ERROR.
2022-11-15 18:47:28 +02:00
Heikki Linnakangas
223834a420 Fix confusion between Postgres and pageserver connection string in test.
We passed the pageserver's libpq endpoint URL as the 'compute_ctl
--connstr' argument, but that was bogus: the --connstr URL is supposed
to be the URL to the *Postgres* instance that compute_ctl launches and
monitors, not to the pageserver. compute_ctl does need the pageserver
URL too, but it is read from the cluster spec JSON, not --connstr.

That was pretty confusing, as you got a lot of "unknown command"
errors in the pageserver log, when compute_tools tries to run regular
SQL commands on the pageserver. The test still passed, however, as it
doesn't require the SQL commands to succeed. But to make this less
confusing, use an invalid hostname instead, so that the queries will
fail to even connect.
2022-11-15 18:47:28 +02:00
MMeent
01778e37cc Address issues in the pagestore prefetch mechanism: (#2790)
- Update vendored PostgreSQL to address prefetch issues
 - Make flushed state explicit in PrefetchState
 - Move flush logic into prefetch_wait_for, where possible
 - Clean up some prefetch state handling code in the various code
elements handling state transitions.
 - Fix a race condition in neon_read_at_lsn where a hash entry pointer
was used after the hash table was updated. This could result in
incorrect state transitions and assertion failures after disconnects
during prefetch_wait_for in that neon_read_at_lsn.
 
Fixes #2780
2022-11-15 15:12:38 +01:00
Alexander Bayandin
03190a2161 GitHub Actions: Do not create Allure report for cancelled jobs (#2813)
If a workflow is cancelled, do not delay its finishing by creating an allure
report.
2022-11-15 10:27:59 +00:00
Kirill Bulatov
f87017c04d Omit dependencies' debug info (#2803)
Based on https://neondb.slack.com/archives/C0277TKAJCA/p1668079753506749

Co-authored-by: Arseny Sher <sher-ars@yandex.ru>
2022-11-14 12:44:41 +00:00
andres
c11cbf0f5c fix test_compare_child_and_root_pgbench_perf to do a fair comparison 2022-11-13 21:03:54 +02:00
Heikki Linnakangas
f30ef00439 Stop building the legacy "compute-node" docker image.
Before we had separate images for v14 and v15, the compute node image
was called just "neondatabase/compute-node". It has been superseded by
the "neondatabase/compute-node-v14" and "neondatabase/compute-node-v15"
images. The old image is not used by the cloud console build or tests
anymore.
2022-11-12 20:48:10 +02:00
Heikki Linnakangas
dbe5b52494 Avoid some vector-growing overhead.
I saw this in 'perf' profile of a sequential scan:

> -   31.93%     0.21%  compute request  pageserver         [.] <pageserver::walredo::PostgresRedoManager as pageserver::walredo::WalRedoManager>::request_redo
>    - 31.72% <pageserver::walredo::PostgresRedoManager as pageserver::walredo::WalRedoManager>::request_redo
>       - 31.26% pageserver::walredo::PostgresRedoManager::apply_batch_postgres
>          + 7.64% <std::process::ChildStdin as std::io::Write>::write
>          + 6.17% nix::poll::poll
>          + 3.58% <std::process::ChildStderr as std::io::Read>::read
>          + 2.96% std::sync::condvar::Condvar::notify_one
>          + 2.48% std::sys::unix::locks::futex::Condvar::wait
>          + 2.19% alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle
>          + 1.14% std::sys::unix::locks::futex::Mutex::lock_contended
>            0.67% __rust_alloc_zeroed
>            0.62% __stpcpy_ssse3
>            0.56% std::sys::unix::locks::futex::Mutex::wake

Note the 'do_reserve_handle' overhead. That's caused by having to grow
the buffer used to construct the WAL redo request. This commit
eliminates that overhead. It's only about 2% of the overall CPU usage,
but every little helps.

Also reuse the temp buffer when reading records from a DeltaLayer, and
call Vec::reserve to avoid growing a buffer when reading a blob across
pages. I saw a reduction from 2% to 1% of CPU spent in
do_reserve_and_handle in that codepath, but that's such a small change
that it could be just noise. Seems like it shouldn't hurt though.
2022-11-12 18:52:25 +02:00
Heikki Linnakangas
4131a6efae Remove unused Dockerfile.compute-node.legacy.
The cloud end-to-end tests use the docker images built by the neon PR
now, and don't need this legacy Dockerfile anymore.
2022-11-12 18:51:51 +02:00
Kirill Bulatov
03695261fc Test storage Docker images (#2767)
Closes https://github.com/neondatabase/neon/issues/2697
Example:
https://github.com/neondatabase/neon/actions/runs/3416774593/jobs/5688394855

Adds a set of tests on the storage Docker images before they are pushed
to the public registries:
* tests that pageserver binary has the correct version string (other
binaries are built with the same library, so it should be enough to test
one)
* tests that the compose file set-up works and all components are able
to start and perform a single SQL query (CREATE TABLE)
2022-11-11 19:42:26 +02:00
bojanserafimov
7fd88fab59 Trace read requests (#2762) 2022-11-10 16:43:04 -05:00
bojanserafimov
7edc098c40 Add perf test instructions (#2777) 2022-11-10 16:05:57 -05:00
Vadim Kharitonov
8421218152 Change the branch name for V14 as it does for V15 2022-11-10 17:41:36 +01:00
Alexander Bayandin
d5b7832c21 Fix test_wal_backpressure tests (#2792)
Fix expected return type for `fetchone `:

```
AssertionError: assert False
 +  where False = isinstance((Decimal('56048'), '55 kB', '0/1CF52D8', '0/1CE77E8'), list)
```
2022-11-10 16:15:04 +00:00
Arthur Petukhovsky
c6072d38c2 Remove debug logs in should_walsender_stop (#2791) 2022-11-10 15:49:00 +00:00
Alexander Bayandin
175779c0ef GitHub Actions: fix non-parallel benchmarks on CI (#2787)
Fix non-parallel pytest run by setting `--dist=loadgroup` only for
pytest command with xdist enabled (`-n` is set)
2022-11-10 12:51:47 +00:00
Christian Schwarz
8654e95fae walredo: fix zombie processes ([postgres] <defunct>)
This change wraps the std::process:Child that we spawn for WAL redo
into a type that ensures that we try to SIGKILL + waitpid() on it.

If there is no explicit call to kill_and_wait(), the Drop implementation
will spawns a task that does it in the BACKGROUND_RUNTIME.
That's an ugly hack but I think it's better than doing kill+wait
synchronously from Drop, since I think the general assumption in the
Rust ecosystem is that Drop doesn't block.
Especially since the drop sites can be _any_ place that drops the last
Arc<PostgresRedoManager>, e.g., compaction or GC.

The benefit of having the new type over just adding a Drop impl to
PostgresRedoProcess is that we can construct it earlier than the full
PostgresRedoProcess in PostgresRedoProcess::launch().
That allows us to correctly kill+wait the child if there is an error in
PostgresRedoProcess::launch() after spawning it.

I also took a stab at a regression test. I manually verified
that it fails before the fix to walredo.rs.

fixes https://github.com/neondatabase/neon/issues/2761
closes https://github.com/neondatabase/neon/pull/2776
2022-11-10 12:50:50 +01:00
Vadim Kharitonov
f720dd735e Stricter mypy linters for test_runner/fixtures/* 2022-11-10 12:47:27 +01:00
Alexander Bayandin
c4f9f1dc6d Add data format forward compatibility tests (#2766)
Add `test_forward_compatibility`, which checks if it's going to
be possible to roll back a release to the previous version.
The test uses artifacts (Neon & Postgres binaries) from the previous
release to start Neon on the repo created by the current version. It
performs exactly the same checks as `test_backward_compatibility` does.

Single `ALLOW_BREAKING_CHANGES` env var got replaced by
`ALLOW_BACKWARD_COMPATIBILITY_BREAKAGE` &
`ALLOW_FORWARD_COMPATIBILITY_BREAKAGE` and can be set by `backward
compatibility breakage` and `forward compatibility breakage` labels
respectively.
2022-11-10 09:06:34 +00:00
Kirill Bulatov
4a10e1b066 Pass pushed storage Docker tag to e2e jobs 2022-11-10 08:50:42 +02:00
Vadim Kharitonov
b55466045e Introduce codeowners 2022-11-09 11:43:10 +01:00
Heikki Linnakangas
e999f66b01 Use a cached WaitEventSet instead of WaitLatchOrSocket.
When we repeatedly wait for the same events, it's faster to create the
event set once and reuse it. While testing with a sequential scan test
case, I saw WaitLatchOrSocket consuming a lot of CPU:

> -   40.52%     0.14%  postgres  postgres           [.] WaitLatchOrSocket
>    - 40.38% WaitLatchOrSocket
>       + 17.83% AddWaitEventToSet
>       + 9.47% close@plt
>       + 8.29% CreateWaitEventSet
>       + 4.57% WaitEventSetWait

This eliminates most of that overhead.
2022-11-08 19:45:14 +02:00
andres
1cf257bc4a feedback 2022-11-08 20:15:54 +04:00
andres
40164bd589 Use latestMsgReceivedAt in walproposer 2022-11-08 20:15:54 +04:00
Christian Schwarz
c3a470a29b walredo process management: handle every error on the kill() and drop path
If we're not calling kill() before dropping the PostgresRedoProcess, we
currently leak it.
That's most likely the root cause for #2761.
This patch
1. adds an error log message for that case and
2. adds error handling for all errors on the kill() path. If we're a
`testing` build, we panic. Otherwise, we log an error and leak the
process.

The error handling changes (2) are necessary to conclusively state that
the root cause for #2761 is indeed (1). If we didn't have them, the root
cause could be missing error handling instead.

To make the log messages useful, I've added tracing::instrument
attributes that log the tenant_id and PID. That helps mapping back the
PID of `defunct` processes to pageserver log messages. Note that a
defunct process's `/proc/$PID/` directory isn't very useful. We have
left little more than its PID.

Once we have validated the root cause, we'll find a fix, but that's
still an ongoing discussion.

refs https://github.com/neondatabase/neon/issues/2761
closes https://github.com/neondatabase/neon/pull/2769
2022-11-08 14:03:13 +01:00
Alexander Bayandin
c1a76eb0e5 test_runner: replace global variables with fixtures (#2754)
This PR replaces the following global variables in the test framework
with fixtures to make tests more configurable. I mainly need this for
the forward compatibility tests (draft in
https://github.com/neondatabase/neon/pull/2766).

```
base_dir
neon_binpath 
pg_distrib_dir
top_output_dir
default_pg_version (this one got replaced with a fixture named pg_version)
```

Also, this PR adds more `Path` type where the code implies it.
2022-11-07 18:39:51 +00:00
MMeent
d5b6471fa9 Update prefetch mechanism: (#2687)
Prefetch requests and responses are stored in a ringbuffer instead of a
queue, which means we can utilize prefetches of many relations
concurrently -- page reads of un-prefetched relations now don't imply
dropping prefetches.

In a future iteration, this may detect sequential scans based on the
read behavior of sequential scans, and will dynamically prefetch buffers
for such relations as needed. Right now, it still depends on explicit
prefetch requests from PostgreSQL.

The main improvement here is that we now have a buffer for prefetched
pages of 128 entries with random access. Before, we had a similarly sized
cache, but this cache did not allow for random access, which resulted in
dropped entries when multiple systems used the prefetching subsystem
concurrently.

See also: #2544
2022-11-07 18:13:24 +02:00
Joonas Koivunen
548d472b12 fix: logical size query at before initdb_lsn (#2755)
With more realistic selection of gc_horizon in tests there is an
immediate failure with trying to query logical size with lsn <
initdb_lsn. Fixes that, adds illustration gathered from clarity of
explaining this tenant size calculation to more people.

Cc: #2748, #2599.
2022-11-07 12:03:57 +02:00
Dmitry Rodionov
99e745a760 review adjustments 2022-11-06 13:42:18 +02:00
Dmitry Rodionov
15d970f731 decrease diff by moving check_checkpoint_distance back
Co-authored-by: Christian Schwarz <me@cschwarz.com>
2022-11-06 13:42:18 +02:00
Heikki Linnakangas
7b7f84f1b4 Refactor layer flushing task
Extracted from https://github.com/neondatabase/neon/pull/2595
2022-11-06 13:42:18 +02:00
Alexander Bayandin
bc40a5595f test_runner: update cryptography (#2753)
Bump `cryptography` package from 37.0.4 to 38.0.3 to fix
https://github.com/neondatabase/neon/security/dependabot/6 (rather just
in case).

Ref https://www.openssl.org/news/secadv/20221101.txt
2022-11-04 21:29:51 +00:00
Heikki Linnakangas
07b3ba5ce3 Bump postgres submodules, for some cosmetic cleanups. 2022-11-04 20:30:13 +02:00
Dmitry Ivanov
c38f38dab7 Move pq_proto to its own crate 2022-11-03 22:56:04 +03:00