Commit Graph

3174 Commits

Author SHA1 Message Date
Joonas Koivunen
4ef9f38bdd doc: comment pass 2023-05-05 02:46:15 +03:00
Joonas Koivunen
505febfc97 doc: fix typo and assert message 2023-05-05 02:42:20 +03:00
Joonas Koivunen
fa947f0f4f fix: resubscribe while holding the lock 2023-05-05 02:40:15 +03:00
Joonas Koivunen
022ff36481 chore: yet another cfg feature testing 2023-05-05 02:39:56 +03:00
Joonas Koivunen
a77bde5852 restore anyhow::anyhow!(s) error transformations 2023-05-05 02:17:51 +03:00
Joonas Koivunen
88d2551bb4 request coalescing with MaybeDone 2023-05-05 02:05:31 +03:00
Joonas Koivunen
fffb23ec06 chore: forgotten cfg feature testing 2023-05-05 01:17:30 +03:00
Joonas Koivunen
1c20cea537 doc: note unhandled case 2023-05-05 00:54:36 +03:00
Joonas Koivunen
a543e7081f refactor: InnerDeleteTimelineError does not need to be Error 2023-05-05 00:54:19 +03:00
Joonas Koivunen
e00da98996 wip: try: request coalescing (now building) 2023-05-05 00:34:26 +03:00
Joonas Koivunen
5a924f1867 wip: try with request coalescing 2023-05-05 00:17:15 +03:00
Joonas Koivunen
ad70535899 complete the revert with the datastructure change 2023-05-04 22:00:12 +03:00
Joonas Koivunen
fd6d81b044 wip: revert: 11c18b05aa
This partially reverts commit 11c18b05aa
for the take_mut parts.
2023-05-04 21:50:17 +03:00
Christian Schwarz
8425b6ab21 clarifications around the QueueUninitialized error 2023-05-04 17:08:02 +02:00
Christian Schwarz
c025617639 actually give it 5 seconds, not 2.5 (and drive-by fix the error message)
CI release test failed due to this timeout, likely due to system load.
The debug test was not affected.
2023-05-04 12:03:46 +02:00
Christian Schwarz
b8eba89a10 improve docstring on persist_index_part_with_deleted_flag 2023-05-03 19:46:55 +02:00
Christian Schwarz
dd73c1d793 revert some intermediate changes to failpoint_sleep_millis_async 2023-05-03 19:42:11 +02:00
Christian Schwarz
08558b83ed improve XXX above remove_dir_all regarding atomicity
With this PR, we're now atomic, if remote storage is configured.
2023-05-03 19:39:32 +02:00
Christian Schwarz
30ea7b141b test_delete_timeline_client_hangup: account for logging of dropped requests
This must have come with the latest merge from main.
2023-05-03 18:39:54 +02:00
Christian Schwarz
4903953e9f clippy: tracing::Span is only used in cfg(feature = "testing") 2023-05-03 18:35:01 +02:00
Christian Schwarz
2785b6a4ed at some point we committed changes submodule refs in vendor/ , revert that 2023-05-03 18:34:03 +02:00
Christian Schwarz
de3f23344a Merge remote-tracking branch 'origin/main' into dkr/deleted-flag-in-remote-index 2023-05-03 18:07:36 +02:00
Christian Schwarz
b58bf56670 fixup comment from 'add TODO comment regarding de-configuration of remote storage' 2023-05-03 17:47:11 +02:00
Christian Schwarz
16486edd8e clarify / improve comments in Tenant::delete_timeline 2023-05-03 17:33:53 +02:00
Alexander Bayandin
b114ef26c2 GitHub Autocomment: add a note if no tests were run (#4109)
- Always (if not cancelled) add a comment to a PR
- Mention in the comment if no tests were run / reports were not
generated.
2023-05-03 15:38:49 +01:00
Arthur Petukhovsky
3ceef7b17a Add more safekeeper and walreceiver metrics (#4142)
Add essential safekeeper and pageserver::walreceiver metrics. Mostly
counters, such as the number of received queries, broker messages,
removed WAL segments, or connection switches events in walreceiver.
Also logs broker push loop duration.
2023-05-03 17:07:41 +03:00
Kirill Bulatov
586e6e55f8 Print WalReceiver context on WAL waiting timeout (#4090)
Closes https://github.com/neondatabase/neon/issues/2106

Before:
```
Extracting base backup to create postgres instance: path=/Users/someonetoignore/work/neon/neon_main/test_output/test_pageserver_lsn_wait_error_safekeeper_stop/repo/endpoints/ep-2/pgdata port=15017

              stderr: command failed: page server 'basebackup' command failed

Caused by:
    0: db error: ERROR: Timed out while waiting for WAL record at LSN 0/FFFFFFFF to arrive, last_record_lsn 0/A2C3F58 disk consistent LSN=0/16B5A50
    1: ERROR: Timed out while waiting for WAL record at LSN 0/FFFFFFFF to arrive, last_record_lsn 0/A2C3F58 disk consistent LSN=0/16B5A50

Stack backtrace:
```

After:
```
Extracting base backup to create postgres instance: path=/Users/someonetoignore/work/neon/neon/test_output/test_pageserver_lsn_wait_error_safekeeper_stop/repo/endpoints/ep-2/pgdata port=15011

              stderr: command failed: page server 'basebackup' command failed

Caused by:
    0: db error: ERROR: Timed out while waiting for WAL record at LSN 0/FFFFFFFF to arrive, last_record_lsn 0/A2C3F58 disk consistent LSN=0/16B5A50, WalReceiver status (update 2023-04-26 14:20:39): streaming WAL from node 12346, commit|streaming Lsn: 0/A2C3F58|0/A2C3F58, safekeeper candidates (id|update_time|commit_lsn): [(12348|14:20:40|0/A2C3F58), (12346|14:20:40|0/A2C3F58), (12347|14:20:40|0/A2C3F58)]
    1: ERROR: Timed out while waiting for WAL record at LSN 0/FFFFFFFF to arrive, last_record_lsn 0/A2C3F58 disk consistent LSN=0/16B5A50, WalReceiver status (update 2023-04-26 14:20:39): streaming WAL from node 12346, commit|streaming Lsn: 0/A2C3F58|0/A2C3F58, safekeeper candidates (id|update_time|commit_lsn): [(12348|14:20:40|0/A2C3F58), (12346|14:20:40|0/A2C3F58), (12347|14:20:40|0/A2C3F58)]

Stack backtrace:
```

As the issue requests, the PR adds the context in logs only, but I think
we should expose the context via HTTP management API similar way — it
should be simple with the new API, but better be done in a separate PR.

Co-authored-by: Kirill Bulatov <kirill@neon.tech>
2023-05-03 16:25:19 +03:00
Anton Chaporgin
db81242f4a add debug to pg-sni-router install (#4143) 2023-05-03 16:14:16 +03:00
dependabot[bot]
39ca7c7c09 Bump flask from 2.1.3 to 2.2.5 (#4138) 2023-05-03 10:40:35 +01:00
Heikki Linnakangas
ecc0cf8cd6 Treat EPIPE as an expected error. (#4141)
If the other end of a TCP connection closes its read end of the socket,
you get an EPIPE when you try to send. I saw that happen in the CI once:


https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4136/release/4869464644/index.html#suites/c19bc2126511ef8cb145cca25c438215/7ec87b016c0b4b50/

```
2023-05-03T07:53:22.394152Z ERROR Task 'serving compute connection task' tenant_id: Some(c204447079e02e7ba8f593cb8bc57e76), timeline_id: Some(b666f26600e6deaa9f43e1aeee5bacb7) exited with error: Postgres connection error

Caused by:
    Broken pipe (os error 32)

Stack backtrace:
   0: pageserver::page_service::page_service_conn_main::{{closure}}
             at /__w/neon/neon/pageserver/src/page_service.rs:282:17
      <core::panic::unwind_safe::AssertUnwindSafe<F> as core::future::future::Future>::poll
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panic/unwind_safe.rs:296:9
      <futures_util::future::future::catch_unwind::CatchUnwind<Fut> as core::future::future::Future>::poll::{{closure}}
             at /__w/neon/neon/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/future/future/catch_unwind.rs:36:42
      <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panic/unwind_safe.rs:271:9
   ...
```

In the passing, add a comment to explain what the "expected" in the
`is_expected_io_error` function means.
2023-05-03 12:03:13 +03:00
Joonas Koivunen
faebe3177b wal_craft: cleanup to keep editable (#4121)
wal_craft had accumulated some trouble by using `use anyhow::*;`. Fixes
that, removes redundant conversions (never need to convert a Path to
OsStr), especially at the `Process` args.

Originally in #4100 but we merged a later PR instead for the fixes. I
dropped the `postmaster.pid` polling in favor of just having a longer
connect timeout.
2023-05-03 11:11:55 +03:00
Joonas Koivunen
474f69c1c0 fix: omit cancellation logging when panicking (#4125)
noticed while describing `RequestSpan`, this fix will omit the otherwise
logged message about request being cancelled when panicking in the
request handler. this was missed on #4064.
2023-05-03 10:56:49 +03:00
Konstantin Knizhnik
47521693ed Fix file_cache build warnings (#4118)
## Describe your changes

## Issue ticket number and link

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist
2023-05-02 22:28:21 +03:00
Heikki Linnakangas
4d55d61807 Store basic endpoint info in endpoint.json file. (#4058)
It's more convenient than parsing the postgresql.conf file.

Extracted from PR #3886. I started working on another patch (to make it
safe to run two "neon_local endpoint create" commands concurrently), and
realized that this change will make that simpler too.
2023-05-02 20:36:11 +03:00
Sergey Melnikov
093fafd6bd Deploy pg-sni-router (#4132) 2023-05-01 17:18:45 +02:00
Shany Pozin
e3ae2661ee Add 2 new sets of safekeepers to us-west2 (#4130)
## Describe your changes
TF output:       
module.safekeeper-us-west-2.aws_instance.this["3"]: Creation complete
after 13s [id=i-089f6b9ef426dff76]
module.safekeeper-us-west-2.aws_instance.this["4"]: Creation complete
after 13s [id=i-0fe6bf912c4710c82]
module.safekeeper-us-west-2.aws_instance.this["5"]: Creation complete
after 13s [id=i-0a83c1c46d2b4e409]
module.safekeeper-us-west-2.aws_instance.this["6"]: Creation complete
after 13s [id=i-0fef5317b8fdc9f8d]
module.safekeeper-us-west-2.aws_instance.this["7"]: Creation complete
after 13s [id=i-0be739190d4289bf9]
module.safekeeper-us-west-2.aws_instance.this["8"]: Creation complete
after 13s [id=i-00e851803669e5cfe]
2023-05-01 14:22:59 +03:00
Anton Chaporgin
7e368f3edf build pg-sni-router binary (#4129)
## Describe your changes
This adds pg-sni-router binary to the build pipeline and neon image.

## Issue ticket number and link

https://github.com/neondatabase/cloud/issues/1461
2023-05-01 12:14:31 +02:00
Joonas Koivunen
138bc028ed fix: quick and dirty panic avoidance on drop path (#4128)
Sentry caught a panic on load testing server related to metric removals:
https://neondatabase.sentry.io/issues/4142396994

Turn the `expect` into logging, but also add logging for each removal,
so we could identify in which cases we do double-remove. The
double-removal (or never adding) cause is not obvious or expected.

Original added in #3837.
2023-05-01 11:54:09 +03:00
Stas Kelvich
d53f81b449 Add one more pageserver to staging 2023-04-30 22:39:34 +03:00
Joonas Koivunen
6f472df0d0 fix: restore not logging ignored io errors as errors (#4120)
the fix is rather indirect due to the accidental applying of too much
`anyhow`: if handle_pagerequests returns a `QueryError` it will now be
bubbled up as-is `QueryError`. `QueryError` allows the inner
`std::io::Error` to be inspected and thus we can filter certain error
kinds which are perfectly normal without a huge log message.

for a very long time (b2f5102) the errors were converted to `anyhow` by
mistake which made this difficult or impossible, even though from the
types it would *appear* that we propagate wrapped `std::io::Error`s and
can filter them.

Fixes #4113, most likely filters some other errors as well.
2023-04-30 14:34:55 +03:00
Rahul Patil
21eb944b5e Staging: Add safekeeper nodes [3-8] to eu-west-1 (#4123) 2023-04-29 15:25:57 +03:00
Arthur Petukhovsky
95244912c5 Override sharded-slab to increase MAX_THREADS (#4122)
Add patch directive to Cargo.toml to use patched version of
sharded-slab:
98d16753ab

Patch changes the MAX_THREADS limit from 4096 to 32768. This is a
temporary workaround for using tracing from many threads in safekeepers
code, until async safekeepers patch is merged to the main.

Note that patch can affect other rust services, not only the safekeeper
binary.
2023-04-29 13:31:04 +03:00
Shany Pozin
2617e70008 Add 4 new Pageservers for retool launch (#4115)
## Describe your changes
Adding 4 new pageserves to us-west

TF apply output:
module.pageserver-us-west-2.aws_instance.this["7"]: Creation complete
after 21s [id=i-02eec9b40617db5bc]
module.pageserver-us-west-2.aws_instance.this["5"]: Creation complete
after 21s [id=i-00ca6417c7bf96820]
module.pageserver-us-west-2.aws_instance.this["4"]: Creation complete
after 21s [id=i-013263dd1c239adcc]
module.pageserver-us-west-2.aws_instance.this["6"]: Creation complete
after 22s [id=i-01cdf7d2bc1433b6a]
2023-04-29 11:42:52 +02:00
Arthur Petukhovsky
8543485e92 Pull clone timeline from peer safekeepers (#4089)
Add HTTP endpoint to initialize safekeeper timeline from peer
safekeepers. This is useful for initializing new safekeeper to replace
failed safekeeper. Not fully "correct" in all cases, but should work in
most.

This code is not suitable for production workloads but can be tested on
staging to get started. New endpoint is separated from usual cases and
should not affect anything if no one explicitly uses a new endpoint. We
can rollback this commit in case of issues.
2023-04-28 14:20:46 +00:00
Joonas Koivunen
ec53c5ca2e revert: "Add check for duplicates of generated image layers" (#4104)
This reverts commit 732acc5.

Reverted PR: #3869

As noted in PR #4094, we do in fact try to insert duplicates to the
layer map, if L0->L1 compaction is interrupted. We do not have a proper
fix for that right now, and we are in a hurry to make a release to
production, so revert the changes related to this to the state that we
have in production currently. We know that we have a bug here, but
better to live with the bug that we've had in production for a long
time, than rush a fix to production without testing it in staging first.

Cc: #4094, #4088
2023-04-28 17:20:18 +03:00
Stas Kelvich
94d612195a bump rust-postgres version, after merging PR in rust-postgres 2023-04-28 17:15:43 +03:00
Stas Kelvich
b1329db495 fix sigterm handling 2023-04-28 17:15:43 +03:00
Stas Kelvich
5bb971d64e fix more python tests 2023-04-28 17:15:43 +03:00
Stas Kelvich
0364f77b9a fix python styling 2023-04-28 17:15:43 +03:00
Stas Kelvich
4ac6a9f089 add backward compatibility to proxy 2023-04-28 17:15:43 +03:00