462 Commits

Author SHA1 Message Date
Tristan Partin
041b653a1a Add state diagram for compute
Models a compute's lifetime.
2024-03-20 17:10:46 -05:00
Alex Chi Z
76c44dc140 spec: disable neon extension auto upgrade (#7128)
This pull request disables neon extension auto upgrade to help the next
compute image upgrade smooth.

## Summary of changes

We have two places to auto-upgrade neon extension: during compute spec
update, and when the compute node starts. The compute spec update logic
is always there, and the compute node start logic is added in
https://github.com/neondatabase/neon/pull/7029. In this pull request, we
disable both of them, so that we can still roll back to an older version
of compute before figuring out the best way of extension
upgrade-downgrade. https://github.com/neondatabase/neon/issues/6936

We will enable auto-upgrade in the next release following this release.

There are no other extension upgrades from release 4917 and therefore
after this pull request, it would be safe to revert to release 4917.

Impact:

* Project created after unpinning the compute image -> if we need to
roll back, **they will stuck**, because the default neon extension
version is 1.3. Need to manually pin the compute image version if such
things happen.
* Projects already stuck on staging due to not downgradeable -> I don't
know their current status, maybe they are already running the latest
compute image?
* Other projects -> can be rolled back to release 4917.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-03-14 19:45:38 +00:00
Arseny Sher
0cf0731d8b SIGQUIT instead of SIGKILL prewarmed postgres.
To avoid orphaned processes using wiped datadir with confusing logging.
2024-03-11 22:36:52 +04:00
Sasha Krassovsky
4834d22d2d Revoke REPLICATION (#7052)
## Problem
Currently users can cause problems with replication
## Summary of changes
Don't let them replicate
2024-03-08 22:24:30 +00:00
Sasha Krassovsky
2fc89428c3 Hopefully stabilize test_bad_connection.py (#6976)
## Problem
It seems that even though we have a retry on basebackup, it still
sometimes fails to fetch it with the failpoint enabled, resulting in a
test error.

## Summary of changes
If we fail to get the basebackup, disable the failpoint and try again.
2024-03-07 10:12:06 -08:00
Alex Chi Z
0b330e1310 upgrade neon extension on startup (#7029)
## Problem

Fix https://github.com/neondatabase/neon/issues/7003. Fix
https://github.com/neondatabase/neon/issues/6982. Currently, neon
extension is only upgraded when new compute spec gets applied, for
example, when creating a new role or creating a new database. This also
resolves `neon.lfc_stat` not found warnings in prod.

## Summary of changes

This pull request adds the logic to spawn a background thread to upgrade
the neon extension version if the compute is a primary. If for whatever
reason the upgrade fails, it reports an error to the console and does
not impact compute node state.

This change can be further applied to 3rd-party extension upgrades. We
can silently upgrade the version of 3rd party extensions in the
background in the future.

Questions:

* Does alter extension takes some kind of lock that will block user
requests?
* Does `ALTER EXTENSION` writes to the database if nothing needs to be
upgraded? (may impact storage size).

Otherwise it's safe to land this pull request.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-03-06 12:20:44 -05:00
Alex Chi Z
b7db912be6 compute_ctl: only try zenith_admin if could not authenticate (#6955)
## Problem

Fix https://github.com/neondatabase/neon/issues/6498

## Summary of changes

Only re-authenticate with zenith_admin if authentication fails.
Otherwise, directly return the error message.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-03-04 14:28:45 -05:00
Arpad Müller
82853cc1d1 Fix warnings and compile errors on nightly (#6886)
Nightly has added a bunch of compiler and linter warnings. There is also
two dependencies that fail compilation on latest nightly due to using
the old `stdsimd` feature name. This PR fixes them.
2024-03-01 17:14:19 +01:00
Alex Chi Z
b2bbc20311 fix: only alter default privileges when public schema exists (#6914)
## Problem

Following up https://github.com/neondatabase/neon/pull/6885, only alter
default privileges when the public schema exists.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-02-26 11:48:56 -09:00
Anastasia Lubennikova
a12e4261a3 Add neon.primary_is_running GUC. (#6705)
We set it for neon replica, if primary is running.

Postgres uses this GUC at the start,
to determine if replica should wait for
RUNNING_XACTS from primary or not.

Corresponding cloud PR is
https://github.com/neondatabase/cloud/pull/10183

* Add test hot-standby replica startup.
* Extract oldest_running_xid from XlRunningXits WAL records.
---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Konstantin Knizhnik <knizhnik@garret.ru>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2024-02-23 13:56:41 +00:00
Alex Chi Z
12487e662d compute_ctl: move default privileges grants to handle_grants (#6885)
## Problem

Following up https://github.com/neondatabase/neon/pull/6884, hopefully,
a real final fix for https://github.com/neondatabase/neon/issues/6236.

## Summary of changes

`handle_migrations` is done over the main `postgres` db connection.
Therefore, the privileges assigned here do not work with databases
created later (i.e., `neondb`). This pull request moves the grants to
`handle_grants`, so that it runs for each DB created. The SQL is added
into the `BEGIN/END` block, so that it takes only one RTT to apply all
of them.

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-02-22 17:00:03 -05:00
Alex Chi Z
837988b6c9 compute_ctl: run migrations to grant default grantable privileges (#6884)
## Problem

Following up on https://github.com/neondatabase/neon/pull/6845, we did
not make the default privileges grantable before, and therefore, even if
the users have full privileges, they are not able to grant them to
others.

Should be a final fix for
https://github.com/neondatabase/neon/issues/6236.

## Summary of changes

Add `WITH GRANT` to migrations so that neon_superuser can grant the
permissions.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-02-22 17:49:02 +00:00
Alex Chi Z
6921577cec compute_ctl: grant default privileges on table to neon_superuser (#6845)
## Problem

fix https://github.com/neondatabase/neon/issues/6236 again

## Summary of changes

This pull request adds a setup command in compute spec to modify default
privileges of public schema to have full permission on table/sequence
for neon_superuser. If an extension upgrades to superuser during
creation, the tables/sequences they create in the public schema will be
automatically granted to neon_superuser.

Questions:
* does it impose any security flaws? public schema should be fine...
* for all extensions that create tables in schemas other than public, we
will need to manually handle them (e.g., pg_anon).
* we can modify some extensions to remove their superuser requirement in
the future.
* we may contribute to Postgres to allow for the creation of extensions
with a specific user in the future.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
2024-02-21 16:09:34 -05:00
Nikita Kalyanov
cbb599f353 Add /terminate API (#6745)
this is to speed up suspends, see
https://github.com/neondatabase/cloud/issues/10284

## Problem

## Summary of changes

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist
2024-02-20 19:42:36 +02:00
John Spray
24014d8383 pageserver: fix sharding emitting empty image layers during compaction (#6776)
## Problem

Sharded tenants would sometimes try to write empty image layers during
compaction: this was more noticeable on larger databases.
- https://github.com/neondatabase/neon/issues/6755

**Note to reviewers: the last commit is a refactor that de-intents a
whole block, I recommend reviewing the earlier commits one by one to see
the real changes**

## Summary of changes

- Fix a case where when we drop a key during compaction, we might fail
to write out keys (this was broken when vectored get was added)
- If an image layer is empty, then do not try and write it out, but
leave `start` where it is so that if the subsequent key range meets
criteria for writing an image layer, we will extend its key range to
cover the empty area.
- Add a compaction test that configures small layers and compaction
thresholds, and asserts that we really successfully did image layer
generation. This fails before the fix.
2024-02-18 08:51:12 +00:00
Konstantin Knizhnik
c19625a29c Support sharding for compute_ctl (#6787)
## Problem

See https://github.com/neondatabase/neon/issues/6786

## Summary of changes

Split connection string in compute.rs when requesting basebackup
2024-02-16 14:50:09 +00:00
Heikki Linnakangas
a5114a99b2 Create a symlink from pg_dynshmem to /dev/shm
See included comment and issue
https://github.com/neondatabase/autoscaling/issues/800 for details.

This has no effect, unless you set "dynamic_shared_memory_type = mmap"
in postgresql.conf.
2024-02-14 11:37:52 +02:00
Sasha Krassovsky
1a4dd58b70 Grant pg_monitor to neon_superuser (#6691)
## Problem
The people want pg_monitor
https://github.com/neondatabase/neon/issues/6682
## Summary of changes
Gives the people pg_monitor
2024-02-09 20:22:53 +00:00
Anastasia Lubennikova
eec1e1a192 Pre-install anon extension from compute_ctl
if anon is in shared_preload_libraries.
Users cannot install it themselves, because superuser is required.

GRANT all priveleged needed to use it to db_owner

We use the neon fork of the extension, because small change to sql file
is needed to allow db_owner to use it.

This feature is behind a feature flag AnonExtension,
so it is not enabled by default.
2024-02-09 12:32:07 +00:00
Arpad Müller
c0e0fc8151 Update Rust to 1.76.0 (#6683)
[Release notes](https://github.com/rust-lang/rust/releases/tag/1.75.0).
2024-02-08 19:57:02 +01:00
Sasha Krassovsky
7b49e5e5c3 Remove compute migrations feature flag (#6653) 2024-02-07 07:55:55 -09:00
Heikki Linnakangas
dc811d1923 Add a span to 'create_neon_superuser' for better OpenTelemetry traces (#6644)
create_neon_superuser runs the first queries in the database after cold
start. Traces suggest that those first queries can make up a significant
fraction of the cold start time. Make it more visible by adding an
explict tracing span to it; currently you just have to deduce it by
looking at the time spent in the parent 'apply_config' span subtracted
by all the other child spans.
2024-02-06 20:37:35 +02:00
Vadim Kharitonov
dae56ef60c Do not suspend compute if there is an active logical replication subscription. (#6570)
## Problem

the idea is to keep compute up and running if there are any active
logical replication subscriptions.

### Rationale

Rationale:
- The Write-Ahead Logging (WAL) files, which contain the data changes,
will need to be retained on the publisher side until the subscriber is
able to connect again and apply these changes. This could potentially
lead to increased disk usage on the publisher - and we do not want to
disrupt the source - I think it is more pain for our customer to resolve
storage issues on the source than to pay for the compute at the target.
- Upon resuming the compute resources, the subscriber will start
consuming and applying the changes from the retained WAL files. The time
taken to catch up will depend on the volume of changes and the
configured vCPUs.
we can avoid explaining complex situations where we lag behind (in
extreme cases we could lag behind hours, days or even months)
- I think an important use case for logical replication from a source is
a one-time migration or release upgrade. In this case the customer would
not mind if we are not suspended for the duration of the migration.

We need to document this in the release notes and the documentation in
the context of logical replication where Neon is the target (subscriber)

### See internal discussion here

https://neondb.slack.com/archives/C04DGM6SMTM/p1706793400746539?thread_ts=1706792628.701279&cid=C04DGM6SMTM
2024-02-06 12:15:42 +00:00
Sasha Krassovsky
be30388901 Add retry to fetching basebackup (#6537)
## Problem
Currently we have no retry mechanism for fetching basebackup. If there's
an unstable connection, starting compute will just fail.

## Summary of changes
Adds an exponential backoff with 7 retries to get the basebackup.
2024-02-01 20:50:04 +00:00
Sasha Krassovsky
e8c9a51273 Allow creating subscriptions as neon_superuser (#6484)
## Problem
We currently can't create subscriptions in PG14 and PG15 because only
superusers can, and PG16 requires adding roles to
pg_create_subscription.

## Summary of changes
I added changes to PG14 and PG15 that allow neon_superuser to bypass the
superuser requirement. For PG16, I didn't do that but added a migration
that adds neon_superuser to pg_create_subscription. Also added a test to
make sure it works.
2024-01-30 22:32:33 -08:00
Sasha Krassovsky
71f495c7f7 Gate it behind feature flags 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
27587e155d Fix test 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
55aede2762 Prevnet duplicate insertions 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
9f186b4d3e Fix query 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
585687d563 Fix syntax error 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
65a98e425d Switch to bigint 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
844303255a Cargo fmt 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
6d8df2579b Fix dumb thing 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
30064eb197 Add scary comment 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
869acfe29b Make migrations transactional 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
11a91eaf7b Uncomment the thread 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
a718287902 Make migrations happen on a separate thread 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
2eac1adcb9 Make clippy happy 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
a40ed86d87 Add test for migrations, add initial migration 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
1bf8bb88c5 Add support for migrations within compute_ctl 2024-01-22 14:53:29 -08:00
Anastasia Lubennikova
e6e013b3b7 Fix pgbouncer settings update:
- Start pgbouncer in VM from postgres user, to allow connection to
pgbouncer admin console.
- Remove unused compute_ctl options --pgbouncer-connstr
and --pgbouncer-ini-path.
- Fix and cleanup code of connection to pgbouncer, add retries
because pgbouncer may not be instantly ready when compute_ctl starts.
2024-01-18 11:27:12 +00:00
Konstantin Knizhnik
31a4eb40b2 Do not suspend compute if autovacuum is active (#6322)
## Problem

Se.e
https://github.com/orgs/neondatabase/projects/49/views/13?pane=issue&itemId=48282912

## Summary of changes


Do not suspend compute if there are active auto vacuum workers

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-01-14 09:33:57 +02:00
Arthur Petukhovsky
97b48c23f8 Compact some compute_ctl logs (#6346)
Print postgres roles in a single line and add some info.
2024-01-12 18:24:22 +00:00
Alexey Kondratov
1c432d5492 [compute_ctl] Do not miss short-living connections (#6008)
## Problem

Currently, activity monitor in `compute_ctl` has 500 ms polling
interval. It also looks on the list of current client backends looking
for an active one or one with the most recent state change. This means
we can miss short-living connections.

Yet, during testing this PR I realized that it's usually not a problem
with pooled connection, as pgbouncer maintains connections to Postgres
even though client connection are short-living. We can still miss direct
connections.

## Summary of changes

This commit introduces another way to detect user activity on the
compute. It polls a sum of `active_time` and sum of `sessions` from all
non-system databases in the `pg_stat_database` [1]. If user runs some
queries or just open a direct connection, it will rise; if user will
drop db, it can go down, but it's still a change and will be detected as
activity.

New statistic-based logic seems to be working fine. Yet, after having it
running for a couple of hours I've seen several odd cases with
connections via pgbouncer:

1. Sometimes, if you run just `psql pooler_connstr -c 'select 1;'`
   `active_time` could be not updated immediately, and it may take a couple
   of dozens of seconds. This doesn't seem critical, though.
2. Same query with pooler, `active_time` can be bumped a bit, then
   pgbouncer keeps open connection to Postgres for ~10 minutes, then it
   disconnects, and `active_time` *could be* bumped a bit again. 'Could be'
   because I've seen it once, but it didn't reproduce for a second try.

I think this can create false-positives (hopefully rare), when we will
not suspend some computes because of lagged statistics update OR because
some non-user processes will try to connect to user databases.
Currently, we don't touch them outside of startup and
`postgres_exporter` is configured to do not discover other databases,
but this can change in the future.

New behavior is covered by feature flag `activity_monitor_experimental`,
which should be provided by control plane via neondatabase/cloud#9171

[1] https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-DATABASE-VIEW

Related to neondatabase/cloud#7966, neondatabase/cloud#7198
2024-01-12 18:15:41 +01:00
Arthur Petukhovsky
544284cce0 Collapse multiline queries in compute_ctl (#6316) 2024-01-10 22:25:28 +04:00
Arthur Petukhovsky
71beabf82d Join multiline postgres logs in compute_ctl (#5903)
Postgres can write multiline logs, and they are difficult to handle
after they are mixed with other logs. This PR combines multiline logs
from postgres into a single line, where previous line breaks are
replaced with unicode zero-width spaces. Then postgres logs are written
to stderr with `PG:` prefix.

It makes it easy to distinguish postgres logs from all other compute
logs with a simple grep, e.g. `|= "PG:"`
2024-01-10 15:11:43 +00:00
Arseny Sher
a41c4122e3 Don't suspend compute if there is active LR subscriber.
https://github.com/neondatabase/neon/issues/6258
2024-01-06 01:24:44 +04:00
Arseny Sher
9a43c04a19 compute_ctl: kill postgres and sync-safekeeprs on exit.
Otherwise they are left orphaned when compute_ctl is terminated with a
signal. It was invisible most of the time because normally neon_local or k8s
kills postgres directly and then compute_ctl finishes gracefully. However, in
some tests compute_ctl gets stuck waiting for sync-safekeepers which
intentionally never ends because safekeepers are offline, and we want to stop
compute_ctl without leaving orphanes behind.

This is a quite rough approach which doesn't wait for children termination. A
better way would be to convert compute_ctl to async which would make waiting
easy.
2024-01-01 20:44:05 +04:00
Anastasia Lubennikova
6e40900569 Manage pgbouncer configuration from compute_ctl:
- add pgbouncer_settings section to compute spec;
- add pgbouncer-connstr option to compute_ctl.
- add pgbouncer-ini-path option to compute_ctl. Default: /etc/pgbouncer/pgbouncer.ini

Apply pgbouncer config on compute start and respec to override default spec.

Save pgbouncer config updates to pgbouncer.ini to preserve them across pgbouncer restarts.
2023-12-26 15:17:09 +00:00
Anastasia Lubennikova
0bd79eb063 Handle role deletion when project has no databases. (#6170)
There is still default 'postgres' database, that may contain objects
owned by the role or some ACLs. We need to reassign objects in this
database too.

## Problem
If customer deleted all databases and then tries to delete role, that
has some non-standard ACLs,
`apply_config` operation will stuck because of failing role deletion.
2023-12-19 16:27:47 +00:00