Commit Graph

4468 Commits

Author SHA1 Message Date
Sasha Krassovsky
71f495c7f7 Gate it behind feature flags 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
0a7e050144 Fix test one last time 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
55bfa91bd7 Fix test again again 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
d90b2b99df Fix test again 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
27587e155d Fix test 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
55aede2762 Prevnet duplicate insertions 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
9f186b4d3e Fix query 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
585687d563 Fix syntax error 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
65a98e425d Switch to bigint 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
b2e7249979 Sleep 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
844303255a Cargo fmt 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
6d8df2579b Fix dumb thing 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
3c3b53f8ad Update test 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
30064eb197 Add scary comment 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
869acfe29b Make migrations transactional 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
11a91eaf7b Uncomment the thread 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
394ef013d0 Push the migrations test 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
a718287902 Make migrations happen on a separate thread 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
2eac1adcb9 Make clippy happy 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
3f90b2d337 Fix test_ddl_forwarding 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
a40ed86d87 Add test for migrations, add initial migration 2024-01-22 14:53:29 -08:00
Sasha Krassovsky
1bf8bb88c5 Add support for migrations within compute_ctl 2024-01-22 14:53:29 -08:00
Vlad Lazar
f1901833a6 pageserver_api: migrate keyspace related functions from pgdatadir_mapping (#6406)
The idea is to achieve separation between keyspace layout definition
and operating on said keyspace. I've inlined all these function since
they're small and we don't use LTO in the storage release builds
at the moment.

Closes https://github.com/neondatabase/neon/issues/6347
2024-01-22 19:16:38 +00:00
Arthur Petukhovsky
b41ee81308 Log warning on slow WAL removal (#6432)
Also add `safekeeper_active_timelines` metric.
Should help investigating #6403
2024-01-22 18:38:05 +00:00
Christian Schwarz
205b6111e6 attachment_service: /attach-hook: correctly handle detach (#6433)
Before this patch, we would update the `tenant_state.intent` in memory
but not persist the detachment to disk.

I noticed this in https://github.com/neondatabase/neon/pull/6214 where
we stop, then restart, the attachment service.
2024-01-22 18:27:05 +00:00
John Spray
93572a3e99 pageserver: mark tenant broken when cancelling attach (#6430)
## Problem

When a tenant is in Attaching state, and waiting for the
`concurrent_tenant_warmup` semaphore, it also listens for the tenant
cancellation token. When that token fires, Tenant::attach drops out.
Meanwhile, Tenant::set_stopping waits forever for the tenant to exit
Attaching state.

Fixes: https://github.com/neondatabase/neon/issues/6423

## Summary of changes

- In the absence of a valid state for the tenant, it is set to Broken in
this path. A more elegant solution will require more refactoring, beyond
this minimal fix.
2024-01-22 15:50:32 +00:00
Christian Schwarz
15c0df4de7 fixup(#6037): actually fix the issue, #6388 failed to do so (#6429)
Before this patch, the select! still retured immediately if `futs` was
empty. Must have tested a stale build in my manual testing of #6388.
2024-01-22 14:27:29 +00:00
Anna Khanova
3290fb09bf Proxy: fix gc (#6426)
## Problem

Gc currently doesn't work properly.

## Summary of changes

Change statement on running gc.
2024-01-22 13:24:10 +00:00
hamishc
efdb2bf948 Added missing PG_VERSION arg into compute node dockerfile (#6382)
## Problem

If you build the compute-node dockerfile with the PG_VERSION argument
passed in (e.g. `docker build -f Dockerfile.compute-node --build-arg
PG_VERSION=v15 .`, it fails, as some of stages doesn't have the
PG_VERSION arg defined.

## Summary of changes

Added the PG_VERSION arg to the plv8-build, neon-pg-ext-build, and 
pg-embedding-pg-build stages of Dockerfile.compute-node
2024-01-22 11:05:27 +00:00
Conrad Ludgate
5559b16953 bump shlex (#6421)
## Problem

https://rustsec.org/advisories/RUSTSEC-2024-0006

## Summary of changes

`cargo update -p shlex`
2024-01-22 09:14:30 +00:00
Konstantin Knizhnik
1aea65eb9d Fix potential overflow in update_next_xid (#6412)
## Problem

See https://neondb.slack.com/archives/C06F5UJH601/p1705731304237889

Adding 1 to xid in `update_next_xid` can cause overflow in debug mode.
0xffffffff is valid transaction ID.

## Summary of changes

Use `wrapping_add` 

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
2024-01-21 22:11:00 +02:00
Conrad Ludgate
34ddec67d9 proxy small tweaks (#6398)
## Problem

In https://github.com/neondatabase/neon/pull/6283 I did a couple changes
that weren't directly related to the goal of extracting the state
machine, so I'm putting them here

## Summary of changes

- move postgres vs console provider into another enum
- reduce error cases for link auth
- slightly refactor link flow
2024-01-21 09:58:42 +01:00
Anna Khanova
9ace36d93c Proxy: do not store empty key (#6415)
## Problem

Currently we store in cache even if the project is undefined. That makes
invalidation impossible.

## Summary of changes

Do not store if project id is empty.
2024-01-20 16:14:53 +00:00
Heikki Linnakangas
e4898a6e60 Don't pass InvalidTransactionId to update_next_xid. (#6410)
update_next_xid() doesn't have any special treatment for the invalid or
other special XIDs, so it will treat InvalidTransactionId (0) as a
regular XID. If old nextXid is smaller than 2^31, 0 will look like a
very old XID, and nothing happens. But if nextXid is greater than 2^31 0
will look like a very new XID, and update_next_xid() will incorrectly
bump up nextXID.
2024-01-20 18:04:16 +02:00
Joonas Koivunen
c77981289c build: terminate long running tests (#6389)
configures nextest to kill tests after 1 minute. slow period is set to
20s which is how long our tests currently take in total, there will be 2
warnings and then the test will be killed and it's output logged.

Cc: #6361
Cc: #6368 -- likely this will be enough for longer time, but it will be
counter productive when we want to attach and debug; the added line
would have to be commented out.
2024-01-20 17:41:55 +02:00
Anna Khanova
f003dd6ad5 Remove rename in parameters (#6411)
## Problem

Name in notifications is not compatible with console name.

## Summary of changes

Rename fields to make it compatible.
2024-01-20 10:20:53 +00:00
Conrad Ludgate
7e7e9f5191 proxy: add more columns to parquet upload (#6405)
## Problem

Some fields were missed in the initial spec.

## Summary of changes

Adds a success boolean (defaults to false unless specifically marked as
successful).
Adds a duration_us integer that tracks how many microseconds were taken
from session start through to request completion.
2024-01-20 09:38:11 +00:00
Christian Schwarz
760a48207d fixup(#6037): page_service hangs up within 10ms if there's no message (#6388)
From #6037 on, until this patch, if the client opens the connection but
doesn't send a `PagestreamFeMessage` within the first 10ms, we'd close
the connection because `self.timeline_cancelled()` returns.
It returns because `self.shard_timelines` is still empty at that point:
it gets filled lazily within the handlers for the incoming messages.

Changes
-------

The question is: if we can't check for timeline cancellation, what else
do we need to be cancellable for? `tenant.cancel` is also a bad choice
because the `tenant` (shard) we pick at the top of handle_pagerequests
might indeed go away over the course of the connection lifetime, but
other shards may still be there.

The correct solution, I think, is to be responsive to task_mgr
cancellation, because the connection handler runs in a task_mgr task and
it is already the current canonical way how we shut down a tenant's /
timelin's page_service connections (see `Tenant::shutdown` /
`Timeline::shutdown`).

So, rename the function and make it sensitive to task_mgr cancellation.
2024-01-19 19:16:01 +00:00
Arseny Sher
88df057531 Delete WAL segments from s3 when timeline is deleted.
In the most straightforward way; safekeeper performs it in DELETE endpoint
implementation, with no coordination between sks.

delete_force endpoint in the code is renamed to delete as there is only one way
to delete.
2024-01-19 20:11:24 +04:00
Alexander Bayandin
c65ac37a6d zenbenchmark: attach perf results to allure report (#6395)
## Problem

For PRs with `run-benchmarks` label, we don't upload results to the db,
making it harder to debug such tests. The only way to see some
numbers is by examining GitHub Action output which is really
inconvenient.
This PR adds zenbenchmark metrics to Allure reports.

## Summary of changes
- Create a json file with zenbenchmark results and attach it to allure
report
2024-01-18 20:59:43 +00:00
Arthur Petukhovsky
a092127b17 Fix truncateLsn initialization (#6396)
In
7f828890cf
we changed the logic for persisting control_files. Previously it was
updated if `peer_horizon_lsn` jumped more than one segment, which made
`peer_horizon_lsn` initialized on disk as soon as safekeeper has
received a first `AppendRequest`.

This caused an issue with `truncateLsn`, which now can be zero
sometimes. This PR fixes it, and now `truncateLsn/peer_horizon_lsn` can
never be zero once we know `timeline_start_lsn`.

Closes https://github.com/neondatabase/neon/issues/6248
2024-01-18 18:55:24 +00:00
Christian Schwarz
e8f773387d pagebench: avoid noise about CopyFail in PS logs (#6392)
Before this patch, pagebench get-page-latest-lsn would sometimes cause
noisy errors in pageserver log about `CopyFail` protocol message.

refs https://github.com/neondatabase/neon/issues/6390
2024-01-18 18:50:42 +00:00
Christian Schwarz
00936d19e1 pagebench: use tracing panic hook (#6393) 2024-01-18 18:39:38 +00:00
Joonas Koivunen
57155ada77 temp: human readable summaries for relative access time compared to absolute (#6384)
With testing the new eviction order there is a problem of all of the
(currently rare) disk usage based evictions being rare and unique; this
PR adds a human readable summary of what absolute order would had done
and what the relative order does. Assumption is that these loggings will
make the few evictions runs in staging more useful.

Cc: #5304 for allowing testing in the staging
2024-01-18 17:21:08 +02:00
Konstantin Knizhnik
02b916d3c9 Use [NEON_SMGR] tag for all messages in neon extension (#6313)
## Problem

Use [NEON_SMGR] for all log messages produced by neon extension.

## Summary of changes

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-01-18 17:08:34 +02:00
Anastasia Lubennikova
e6e013b3b7 Fix pgbouncer settings update:
- Start pgbouncer in VM from postgres user, to allow connection to
pgbouncer admin console.
- Remove unused compute_ctl options --pgbouncer-connstr
and --pgbouncer-ini-path.
- Fix and cleanup code of connection to pgbouncer, add retries
because pgbouncer may not be instantly ready when compute_ctl starts.
2024-01-18 11:27:12 +00:00
John Spray
bd19290d9f pageserver: add shard_id to metric labels (#6308)
## Problem

tenant_id/timeline_id is no longer a full identifier for metrics from a
`Tenant` or `Timeline` object.

Closes: https://github.com/neondatabase/neon/issues/5953

## Summary of changes

Include `shard_id` label everywhere we have `tenant_id`/`timeline_id`
label.
2024-01-18 10:52:18 +00:00
Joonas Koivunen
a584e300d1 test: figure out the relative eviction order assertions (#6375)
I just failed to see this earlier on #6136. layer counts are used as an
abstraction, and each of the two tenants lose proportionally about the
same amount of layers. sadly there is no difference in between
`relative_spare` and `relative_equal` as both of these end up evicting
the exact same amount of layers, but I'll try to add later another test
for those.

Cc: #5304
2024-01-18 12:39:45 +02:00
Joonas Koivunen
e247ddbddc build: update h2 (#6383)
Notes: https://github.com/hyperium/h2/releases/tag/v0.3.24

Related: https://rustsec.org/advisories/RUSTSEC-2024-0003
2024-01-18 09:54:15 +00:00
Konstantin Knizhnik
0dc4c9b0b8 Relsize hash lru eviction (#6353)
## Problem


Currently relation hash size is limited by "neon.relsize_hash_size" GUC
with default value 64k.
64k relations is not so small number... but it is enough to create 376
databases to exhaust it.

## Summary of changes

Use LRU replacement algorithm to prevent hash overflow

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
2024-01-17 20:34:30 +02:00