Commit Graph

2847 Commits

Author SHA1 Message Date
Shany Pozin
d19c5248c9 Add UUID header to mgmt API (#3708)
## Describe your changes

## Issue ticket number and link
#3479
## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
2023-03-01 18:09:08 +02:00
sharnoff
1360361f60 Fix missing VM cgconfig.conf (#3718)
It was being added to the wrong stage in the dockerfile. This should fix
it, and resolves an ongoing issue on staging.
2023-02-28 21:11:00 -08:00
Alexander Bayandin
000eb1b069 Bump tempfile from 3.3.0 to 3.4.0 (#3709)
Update `tempfile` crate to get rid of `remove_dir_all` dependency
Ref https://github.com/neondatabase/neon/security/dependabot/15
2023-02-27 12:44:08 +00:00
Heikki Linnakangas
f51b48fa49 Fix UNLOGGED tables.
Instead of trying to create missing files on the way, send init fork contents as
main fork from pageserver during basebackup. Add test for that. Call
put_rel_drop for init forks; previously they weren't removed. Bump
vendor/postgres to revert previous approach on Postgres side.

Co-authored-by: Arseny Sher <sher-ars@yandex.ru>

ref https://github.com/neondatabase/postgres/pull/264
ref https://github.com/neondatabase/postgres/pull/259
ref https://github.com/neondatabase/neon/issues/1222
2023-02-24 23:30:02 +04:00
Sergey Melnikov
9f906ff236 Add pageserver-2.us-east-2.aws.neon.tech (#3701) 2023-02-23 19:56:21 +01:00
Sam Kleinman
c79dd8d458 compute_ctl: support for fetching spec from control plane (#3610) 2023-02-23 13:19:39 -05:00
Vadim Kharitonov
ec4ecdd543 Enable postgres SPI extensions 2023-02-23 16:49:37 +01:00
MMeent
20a4d817ce Update vendored PostgreSQL versions to 14.7 and 15.2 (#3581)
## Describe your changes
Rebase vendored PostgreSQL onto 14.7 and 15.2

## Issue ticket number and link

#3579

## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [x] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [x] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
    ```
The version of PostgreSQL that we use is updated to 14.7 for PostgreSQL
14 and 15.2 for PostgreSQL 15.
    ```
2023-02-23 16:10:22 +02:00
Vadim Kharitonov
5ebf7e5619 Fix pg_jsonschema and pg_graphql 2023-02-23 10:43:46 +01:00
Arseny Sher
0692fffbf3 Bump vendor/postgres to include hotfix for unlogged tables with indexes.
https://github.com/neondatabase/postgres/pull/259
https://github.com/neondatabase/postgres/pull/262
2023-02-23 01:34:59 +04:00
Vadim Kharitonov
093570af20 Compile pg_hashids extension 2023-02-22 21:00:25 +01:00
Dmitry Rodionov
eb403da814 Use debug level for successful GET http requests (#3681)
We started rather frequently scrap some apis for metadata. This includes
layer eviction tester, I believe console does that too.

It should eliminate these logs:
https://neonprod.grafana.net/goto/rr_ace1Vz?orgId=1 (Note the rate
around 2k messages per minute)
2023-02-22 22:19:05 +03:00
Vadim Kharitonov
f3ad635911 Compile pgrouting extension 2023-02-22 20:16:11 +01:00
Vadim Kharitonov
a8d7360881 Compile hypopg extension 2023-02-22 20:14:30 +01:00
Lassi Pölönen
b0311cfdeb Change the production neon-proxy-scram update strategy to RollingUpdate (#3683)
## Describe your changes
The same change in production as was done in staging by
https://github.com/neondatabase/neon/pull/3678

## Issue ticket number and link
https://github.com/neondatabase/neon/issues/3333
2023-02-22 20:15:37 +02:00
Konstantin Knizhnik
412e0aa985 Skip largest N holes during compaction (#3597)
## Describe your changes

This is yet another attempt to address problem with storage size
ballooning #2948
Previous PR #3348 tries to address this problem by maintaining list of
holes for each layer.
The problem with this approach is that we have to load all layer on
pageserver start.
Lazy loading of layers is not possible any more.

This PR tries to collect information of N largest holes on compaction
time and exclude this holes from produced layers.
It can cause generation of larger number of layers (up to 2 times) and
producing small layers.
But it requires minimal changes in code and doesn't affect storage
format.

For graphical explanation please see thread:
https://github.com/neondatabase/neon/pull/3597#discussion_r1112704451

## Issue ticket number and link

#2948
#3348

## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
2023-02-22 18:28:01 +02:00
Lassi Pölönen
965b4f4ae2 Change the staging neon-proxy-scram update strategy to RollingUpdate (#3678)
## Describe your changes
When we deploy the proxy with the default Recreate strategy, there's
always some downtime and existing connections will be shut down. Change
the strategy to RollingUpdate and delay the kill signal by one week. AWS
Network Loadbalancer keeps the existing connections alive for as long as
the pods are alive, but will direct new connections to new pods.


## Issue ticket number and link
https://github.com/neondatabase/neon/issues/3333
2023-02-22 16:50:07 +02:00
Arthur Petukhovsky
95018672fa Remove safekeeper-1.ap-southeast-1.aws.neon.tech (#3671)
We migrated all timelines to
`safekeeper-3.ap-southeast-1.aws.neon.tech`, now old instance can be
removed.
2023-02-22 11:55:41 +02:00
Sergey Melnikov
2caece2077 Add -v to ansible invocations (#3670)
To get more debug output on failures
2023-02-21 23:11:52 +03:00
Joonas Koivunen
b8b8c19fb4 fix: hold permit until GetObject eof (#3663)
previously we applied the ratelimiting only up to receiving the headers
from s3, or somewhere near it. the commit adds an adapter which carries
the permit until the AsyncRead has been disposed.

fixes #3662.
2023-02-21 21:14:08 +02:00
Joonas Koivunen
225add041f calculate_logical_size: no longer use spawn_blocking (#3664)
Calculation of logical size is now async because of layer downloads, so
we shouldn't use spawn_blocking for it. Use of `spawn_blocking`
exhausted resources which are needed by `tokio::io::copy` when copying
from a stream to a file which lead to deadlock.

Fixes: #3657
2023-02-21 21:09:31 +02:00
Joonas Koivunen
5d001b1e5a chore: ignore all compaction inactive tenant errors (#3665)
these are happening in tests because of #3655 but they sure took some
time to appear.

makes the `Compaction failed, retrying in 2s: Cannot run compaction
iteration on inactive tenant` into a globally allowed error, because it
has been seen failing on different test cases.
2023-02-21 20:20:13 +02:00
Joonas Koivunen
fe462de85b fix: log download failed error (#3661)
Fixes #3659
2023-02-21 19:31:53 +02:00
Vadim Kharitonov
c0de7f5cd8 Build pg_jsonschema and pg_graphql extensions (#3535)
## Describe your changes
Layer for building pg extensions written on Rust

It required forking:
* `cargo-pgx` (in order not to catch an ABI mismatch error (`cargo-pgx`
hardcoded ABI tcdi/pgx#1032)
* `pg_jsonschema` (to use forked `cargo-pgx` version)
* `pgx-contrib-spiext` (to use forked `cargo-pgx`)
* `pg_graphql` (to use forked `cargo-pgx` and `pgx-contrib-spiext`
version)

Before the patch:

```
postgres=# create extension pg_jsonschema;
2023-02-02 17:45:23.120 UTC [35] ERROR:  incompatible library "/usr/local/lib/pg_jsonschema.so": ABI mismatch
2023-02-02 17:45:23.120 UTC [35] DETAIL:  Server has ABI "Neon Postgres", library has "PostgreSQL".
2023-02-02 17:45:23.120 UTC [35] STATEMENT:  create extension pg_jsonschema;
ERROR:  incompatible library "/usr/local/lib/pg_jsonschema.so": ABI mismatch
DETAIL:  Server has ABI "Neon Postgres", library has "PostgreSQL".
```

After 

```
postgres=# create extension pg_jsonschema;
CREATE EXTENSION
postgres=# select json_matches_schema('{"type": "object"}', '{}');
 json_matches_schema
---------------------
 t
postgres=# create extension pg_graphql;
CREATE EXTENSION
postgres=# create table book(id int primary key, title text);
CREATE TABLE
postgres=# insert into book(id, title) values (1, 'book 1');
INSERT 0 1
postgres=# select graphql.resolve($$
query {
  bookCollection {
    edges {
      node {
        id
      }
    }
  }
}
$$);
                            resolve
----------------------------------------------------------------
 {"data": {"bookCollection": {"edges": [{"node": {"id": 1}}]}}}
(1 row)
```

## Issue ticket number and link
Closes #3429, #3096

## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [x] If it is a core feature, I have added thorough tests.
- [x] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [x] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

`pg_jsonschema` extension will be available for our customers
2023-02-21 17:31:23 +01:00
Joonas Koivunen
b220ba6cd1 add random init delay for background tasks (#3655)
Fixes #3649.
2023-02-21 12:42:11 +01:00
Joonas Koivunen
7de373210d Warn when background tasks exceed their configured period (#3654)
Fixes #3648.
2023-02-21 13:02:19 +02:00
Vadim Kharitonov
5c5b03ce08 Compile xml2 extension 2023-02-21 10:34:45 +01:00
Joonas Koivunen
d7d3f451f0 Use tracing panic hook in all binaries (#3634)
Enables tracing panic hook in addition to pageserver introduced in
#3475:

- proxy
- safekeeper
- storage_broker

For proxy, a drop guard which resets the original std panic hook was
added on the first commit. Other binaries don't need it so they never
reset anything by `disarm`ing the drop guard.

The aim of the change is to make sure all panics a) have span
information b) are logged similar to other messages, not interleaved
with other messages as happens right now. Interleaving happens right now
because std prints panics to stderr, and other logging happens in
stdout. If this was handled gracefully by some utility, the log message
splitter would treat panics as belonging to the previous message because
it expects a message to start with a timestamp.

Cc: #3468
2023-02-21 10:03:55 +02:00
Keanu Ashwell
bc7d3c6476 docs: add dependency requirements for arch based systems (#3588)
This pull request adds information on building neon on Arch based system
such as Artix, Manjaro, Antergos, etc.
2023-02-20 22:51:54 +03:00
Sergey Melnikov
e3d75879c0 Use fqdn to access console management API on production (#3651)
console-release.local is legacy manual CNAME to
neon-internal-api.aws.neon.tech in r53
We could use neon-internal-api.aws.neon.tech name directly

This already was deployed to staging in
https://github.com/neondatabase/neon/pull/3642
2023-02-20 18:11:06 +01:00
Christian Schwarz
485b269674 eviction: tone down logs to debug!() level if there were no evictions
fixes #3647
2023-02-20 18:01:59 +01:00
Christian Schwarz
ee1eda9921 eviction: remove EvictionStats::not_considered_due_to_clock_skew
Rationale: see the block comment added in this patch.

fixes #3641
2023-02-20 18:01:59 +01:00
Christian Schwarz
e363911c85 timeline: propagate span to download_remote_layer (#3644)
fixes #3643
refs #3604
2023-02-20 17:18:13 +02:00
Sergey Melnikov
d5d690c044 Use fqdn for staging console management API (#3642)
`console-staging.local` is legacy manual CNAME to
`neon-internal-api.aws.neon.build` in r53
We could use `neon-internal-api.aws.neon.build` name directly
2023-02-20 16:05:21 +01:00
Arthur Petukhovsky
8f557477c6 Add new safekeeper to ap-southeast-1 prod (#3645) 2023-02-20 17:51:27 +03:00
Shany Pozin
af210c8b42 Allow running do_gc in non testing env (#3639)
## Describe your changes
Since the current default gc period is set to 1 hour, whenever there is
an immediate need to reduce PITR and run gc, the user has to wait 1 hour
for PITR change to take effect
By enabling this API the user can configure PITR and immediately call
the do_gc API to trigger gc
## Issue ticket number and link
#3590
## Checklist before requesting a review
- [X] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
2023-02-20 13:23:13 +02:00
sharnoff
2153d2e00a Run compute_ctl in a cgroup in VMs (#3577) 2023-02-17 14:14:41 -08:00
Alexander Bayandin
564fa11244 Update Postgres extensions (#3615)
- Update postgis from 3.3.1 from 3.3.2
- Update plv8 from 3.1.4 to 3.1.5
- Update h3-pg from 4.0.1 to 4.1.2 (and underlying h3 from 4.0.1 to 4.1.0)
2023-02-17 18:18:23 +00:00
Christian Schwarz
8d28a24b26 staging: enable automatic layer eviction at 20m threshold + period (#3636)
What it says on the tin.

Part of #2476
2023-02-17 18:32:01 +02:00
Anastasia Lubennikova
53128d56d9 Fix make clean:
Use correct paths in neon-pg-ext-clean
2023-02-17 17:57:45 +02:00
Anastasia Lubennikova
40799d8ae7 Add debug messages to catch abnormal consumption metric values 2023-02-17 17:57:45 +02:00
Konstantin Knizhnik
b242b0ad67 Fix flaky tests (#3616)
## Describe your changes

test_on_demand_download is flaky because not waiting until created image
layer is transferred to S3.
test_tenants_with_remote_storage just leaves garbage at the end of
overwritten file.

Right solution for test_on_demand_download is to add some API call to
wait completion of synchronization with S3 (not just based on last
record LSN). But right now it is solved using sleep.

## Issue ticket number and link

#3209 

## Checklist before requesting a review
- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
2023-02-17 15:56:56 +02:00
Dmitry Ivanov
d90cd36bcc [proxy] Improve tracing spans here and there. 2023-02-17 15:32:14 +03:00
Dmitry Ivanov
956b6f17ca [proxy] Handle some unix signals.
On the surface, this doesn't add much, but there are some benefits:

* We can do graceful shutdowns and thus record more code coverage data.

* We now have a foundation for the more interesting behaviors, e.g. "stop
  accepting new connections after SIGTERM but keep serving the existing ones".

* We give the otel machinery a chance to flush trace events before
  finally shutting down.
2023-02-17 15:32:14 +03:00
Heikki Linnakangas
6f9af0aa8c [proxy] Enable OpenTelemetry tracing.
This commit sets up OpenTelemetry tracing and exporter, so that they
can be exported as OpenTelemetry traces as well.

All outgoing HTTP requests will be traced. A separate (child)
span is created for each outgoing HTTP request, and the tracing
context is also propagated to the server in the HTTP headers.

If tracing is enabled in the control plane and compute node too, you
can now get an end-to-end distributed trace of what happens when a new
connection is established, starting from the handshake with the
client, creating the 'start_compute' operation in the control plane,
starting the compute node, all the way to down to fetching the base
backup and the availability checks in compute_ctl.

Co-authored-by: Dmitry Ivanov <dima@neon.tech>
2023-02-17 15:32:14 +03:00
Joonas Koivunen
8e6b27bf7c fix: avoid busy loop on replacement failure (#3613)
Add an AtomicBool per RemoteLayer, use it to mark together with closed
semaphore that remotelayer is unusable until restart or ignore+load.

https://github.com/neondatabase/neon/issues/3533#issuecomment-1431481554
2023-02-17 14:15:29 +02:00
Joonas Koivunen
ae3eff1ad2 Tracing panic hook (#3475)
Fixes #3468.

This does change how the panics look, and most importantly, make sure
they are not interleaved with other messages. Adds a `GET /v1/panic`
endpoint for panic testing (useful for sentry dedup and this hook
testing).

The panics are now logged within a new error level span called `panic`
which separates it from other error level events. The panic info is
unpacked into span fields:
- thread=mgmt request worker
- location="pageserver/src/http/routes.rs:898:9"

Co-authored-by: Christian Schwarz <christian@neon.tech>
2023-02-17 13:56:00 +02:00
Joonas Koivunen
501702b27c fix: flaky test_compaction_downloads_on_demand_with_image_creation (#3629)
fix is to stop postgres before the final checkpoint to ensure no
inmemory layer gets created.

Fixes #3627.
2023-02-17 13:34:26 +02:00
Alexander Bayandin
526f8b76aa Bump werkzeug from 2.1.2 to 2.2.3 (#3631)
## Describe your changes

```
$ poetry add werkzeug@latest "moto[server]@latest"
Using version ^2.2.3 for werkzeug
Using version ^4.1.2 for moto

Updating dependencies
Resolving dependencies... (1.6s)

Writing lock file

Package operations: 0 installs, 2 updates, 1 removal

  • Removing pytz (2022.1)
  • Updating werkzeug (2.1.2 -> 2.2.3)
  • Updating moto (3.1.18 -> 4.1.2)
```

Resolves:
- https://github.com/neondatabase/neon/security/dependabot/14
- https://github.com/neondatabase/neon/security/dependabot/13

`@dependabot` failed to create a PR for some reason (I guess because it
also needed to handle `moto` dependency)

## Issue ticket number and link
N/A

## Checklist before requesting a review
- [x] I have performed a self-review of my code.
- [x] If it is a core feature, I have added thorough tests.
- [x] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [x] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.
2023-02-17 08:29:52 +00:00
Sergey Melnikov
a1b062123b Do not deploy storage to old account (#3630)
It's gone
2023-02-16 20:28:53 +00:00