rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-05-27 18:10:37 +00:00

Author	SHA1	Message	Date
Kosntantin Knizhnik	ae7b92abeb	Undo check for INIT_FORKNUM	2025-07-29 08:04:11 +03:00
Kosntantin Knizhnik	3c54a235dd	Add test_unlogged_build.py	2025-07-29 08:04:11 +03:00
Kosntantin Knizhnik	de33affb1f	Fix merge conflicts	2025-07-29 08:04:11 +03:00
Kosntantin Knizhnik	eabac14080	Fix merge conflicts	2025-07-29 08:04:11 +03:00
Kosntantin Knizhnik	8e150568ec	Handle init fork in specialk way	2025-07-29 08:04:11 +03:00
Kosntantin Knizhnik	1ca23b47fd	Add comment to the test	2025-07-29 08:04:11 +03:00
Kosntantin Knizhnik	1c0f4d6f97	Replace spinlock with LWLock	2025-07-29 08:04:11 +03:00
Konstantin Knizhnik	67c31b61e8	Fix warning	2025-07-29 08:04:11 +03:00
Konstantin Knizhnik	9d12eea25a	Fix merge problems	2025-07-29 08:04:11 +03:00
Konstantin Knizhnik	c1362cbf71	Fix empty list check	2025-07-29 08:04:11 +03:00
Konstantin Knizhnik	902ea0ccd9	Address review comments	2025-07-29 08:04:11 +03:00
Konstantin Knizhnik	fb6d7c4676	Fix merge conflict	2025-07-29 08:04:11 +03:00
Konstantin Knizhnik	5d93a8cc71	Update pgxn/neon/relkind_cache.c Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2025-07-29 08:04:11 +03:00
Konstantin Knizhnik	c3fdab3886	Update pgxn/neon/pagestore_client.h Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2025-07-29 08:04:11 +03:00
Konstantin Knizhnik	1e4783f3f9	Update pgxn/neon/pagestore_client.h Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2025-07-29 08:04:11 +03:00
Konstantin Knizhnik	20dea3aafb	Move lwlock to pagestore_smgr	2025-07-29 08:04:11 +03:00
Konstantin Knizhnik	ca13e7ad7a	Do not return from TRY/CATCH in determine_entry_relkind	2025-07-29 08:04:11 +03:00
Konstantin Knizhnik	87c9b067c2	Remove obsolete comment	2025-07-29 08:04:11 +03:00
Konstantin Knizhnik	e9df43abda	Change return type of determine_entry_relkind to RelKind	2025-07-29 08:04:11 +03:00
Konstantin Knizhnik	840c73e3c4	Rename safe_mdexists to determine_entry_relkind and do unpin instead of unlock in it	2025-07-29 08:04:11 +03:00
Konstantin Knizhnik	a9e940e236	Add assertion to store_cached_relkind	2025-07-29 08:04:11 +03:00
Konstantin Knizhnik	02ecb1ebbf	Update pgxn/neon/pagestore_client.h Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2025-07-29 08:04:11 +03:00
Konstantin Knizhnik	2c0a87af68	Update pgxn/neon/relkind_cache.c Co-authored-by: Heikki Linnakangas <heikki@neon.tech>	2025-07-29 08:04:11 +03:00
Konstantin Knizhnik	a9d4cbe242	Unpin entry in case of mdexists error	2025-07-29 08:04:11 +03:00
Kosntantin Knizhnik	d5d41241fa	Fix incorrect unpin condition in get_cached_relkind	2025-07-29 08:04:11 +03:00
Kosntantin Knizhnik	2e34fe03c7	Replace flags with enum	2025-07-29 08:04:11 +03:00
Konstantin Knizhnik	510c891ae5	Add comments	2025-07-29 08:04:11 +03:00
Konstantin Knizhnik	ac233dc9aa	Fix access to uninitialized flag	2025-07-29 08:04:11 +03:00
Konstantin Knizhnik	c083765840	Address review comments	2025-07-29 08:04:11 +03:00
Konstantin Knizhnik	8884f55eee	Increase number of updates in test_unlogged.py	2025-07-29 08:04:11 +03:00
Konstantin Knizhnik	1f93b664ad	Add test_unlogged for measuring effect of relkind cache	2025-07-29 08:04:11 +03:00
Konstantin Knizhnik	883379f936	Add cache for relation kind	2025-07-29 08:04:11 +03:00
Ivan Efremov	6be572177c	chore: Fix nightly lints (#12746 ) - Remove some unused code - Use `is_multiple_of()` instead of '%' - Collapse consecuative "if let" statements - Elided lifetime fixes It is enough just to review the code of your team	2025-07-28 21:36:30 +00:00
Alex Chi Z.	fe7a4e1ab6	fix(test): wait compaction in timeline offload test (#12673 ) ## Problem close LKB-753. `test_pageserver_metrics_removed_after_offload` is unstable and it sometimes leave the metrics behind after tenant offloading. It turns out that we triggered an image compaction before the offload and the job was stopped after the offload request was completed. ## Summary of changes Wait all background tasks to finish before checking the metrics. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-07-28 16:27:55 +00:00
Heikki Linnakangas	40cae8cc36	Fix misc typos and some cosmetic code cleanup (#12695 )	2025-07-28 16:21:35 +00:00
Heikki Linnakangas	02fc8b7c70	Add compatibility macros for MyProcNumber and PGIOAlignedBlock (#12715 ) There were a few uses of these already, so collect them to the compatibility header to avoid the repetition and scattered #ifdefs. The definition of MyProcNumber is a little different from what was used before, but the end result is the same. (PGPROC->pgprocno values were just assigned sequentially to all PGPROC array members, see InitProcGlobal(). That's a bit silly, which is why it was removed in v17.)	2025-07-28 15:05:36 +00:00
John Spray	60feb168e2	pageserver: decrease MAX_SHARDS in utilization (#12668 ) ## Problem When tenants have a lot of timelines, the number of tenants that a pageserver can comfortably handle goes down. Branching is much more widely used in practice now than it was when this code was written, and we generally run pageservers with a few thousand tenants (where each tenant has many timelines), rather than the 10k-20k we might have done historically. This should really be something configurable, or a more direct proxy for resource utilization (such as non-archived timeline count), but this change should be a low effort improvement. ## Summary of changes * Change the target shard count (MAX_SHARDS) to 2500 from 5000 when calculating pageserver utilization (i.e. a 200% overcommit now corresponds to 5000 shards, not 10000 shards) Co-authored-by: John Spray <john.spray@databricks.com>	2025-07-28 13:50:18 +00:00
a-masterov	da596a5162	Update the versions for ClickHouse and Debezium (#12741 ) ## Problem The test for logical replication used the year-old versions of ClickHouse and Debezium so that we may miss problems related to up-to-date versions. ## Summary of changes The ClickHouse version has been updated to 24.8. The Debezium version has been updated to the latest stable one, 3.1.3Final. Some problems with locally running the Debezium test have been fixed. --------- Co-authored-by: Alexey Masterov <alexey.masterov@databricks.com> Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2025-07-28 13:26:33 +00:00
Conrad Ludgate	effd6bf829	[proxy] add metrics for caches (#12752 ) Exposes metrics for caches. LKB-2594 This exposes a high level namespace, `cache`, that all cache metrics can be added to - this makes it easier to make library panels for the caches as I understand it. To calculate the current cache fill ratio, you could use the following query: ``` ( cache_inserted_total{cache="node_info"} - sum (cache_evicted_total{cache="node_info"}) without (cause) ) / cache_capacity{cache="node_info"} ``` To calculate the cache hit ratio, you could use the following query: ``` cache_request_total{cache="node_info", outcome="hit"} / sum (cache_request_total{cache="node_info"}) without (outcome) ```	2025-07-28 10:41:49 +00:00
Tristan Partin	a6e0baf31a	[BRC-1405] Mount databricks pg_hba and pg_ident from configmap (#12733 ) ## Problem For certificate auth, we need to configure pg_hba and pg_ident for it to work. HCC needs to mount this config map to all pg compute pod. ## Summary of changes Create `databricks_pg_hba` and `databricks_pg_ident` to configure where the files are located on the pod. These configs are pass down to `compute_ctl`. Compute_ctl uses these config to update `pg_hba.conf` and `pg_ident.conf` file. We append `include_if_exists {databricks_pg_hba}` to `pg_hba.conf` and similarly to `pg_ident.conf`. So that it will refer to databricks config file without much change to existing pg default config file. --------- Co-authored-by: Jarupat Jisarojito <jarupat.jisarojito@databricks.com> Co-authored-by: William Huang <william.huang@databricks.com> Co-authored-by: HaoyuHuang <haoyu.huang.68@gmail.com>	2025-07-25 20:50:03 +00:00
Christian Schwarz	19b74b8837	fix(page_service): getpage requests don't hold `applied_gc_cutoff_lsn` guard (#12743 ) Before this PR, getpage requests wouldn't hold the `applied_gc_cutoff_lsn` guard until they were done. Theoretical impact: if we’re not holding the `RcuReadGuard`, gc can theoretically concurrently delete reconstruct data that we need to reconstruct the page. I don't think this practically occurs in production because the odds of it happening are quite low, especially for primary read_write computes. But RO replicas / standby_horizon relies on correct `applied_gc_cutofff_lsn`, so, I'm fixing this as part of the work ok replacing standby_horizon propagation mechanism with leases (LKB-88). The change is feature-gated with a feature flag, and evaluated once when entering `handle_pagestream` to avoid performance impact. For observability, we add a field to the `handle_pagestream` span, and a slow-log to the place in `gc_loop` where it waits for the in-flight RcuReadGuard's to drain. refs - fixes https://databricks.atlassian.net/browse/LKB-2572 - standby_horizon leases epic: https://databricks.atlassian.net/browse/LKB-2572 --------- Co-authored-by: Christian Schwarz <Christian Schwarz>	2025-07-25 20:25:04 +00:00
Folke Behrens	25718e324a	proxy: Define service_info metric showing the run state (#12749 ) ## Problem Monitoring dashboards show aggregates of all proxy instances, including terminating ones. This can skew the results or make graphs less readable. Also, alerts must be tuned to ignore certain signals from terminating proxies. ## Summary of changes Add a `service_info` metric currently with one label, `state`, showing if an instance is in state `init`, `running`, or `terminating`. The metric can be joined with other metrics to filter the presented time series.	2025-07-25 18:27:21 +00:00
Dmitrii Kovalkov	ac8f44c70e	tests: stop ps immediately in test_ps_unavailable_after_delete (#12728 ) ## Problem test_ps_unavailable_after_delete is flaky. All test failures I've looked at are because of ERROR log messages in pageserver, which happen because storage controller tries runs a reconciliations during the graceful shutdown of the pageserver. I wasn't able to reproduce it locally, but I think stopping PS immediately instead of gracefully should help. If not, we might just silence those errors. - Closes: https://databricks.atlassian.net/browse/LKB-745	2025-07-25 18:09:34 +00:00
Conrad Ludgate	d09664f039	[proxy] replace TimedLru with moka (#12726 ) LKB-2536 TimedLru is hard to maintain. Let's use moka instead. Stacked on top of #12710.	2025-07-25 17:39:48 +00:00
Mikhail	6689d6fd89	LFC prewarm perftest fixes: use existing staging project (#12651 ) https://github.com/neondatabase/cloud/issues/19011 - Prewarm config changes are not publicly available. Correct the test by using a pre-filled 50 GB project on staging - Create extension neon with schema neon to fix read performance tests on staging, error example in https://neon-github-public-dev.s3.amazonaws.com/reports/main/16483462789/index.html#suites/3d632da6dda4a70f5b4bd24904ab444c/919841e331089fc4/ - Don't create extra endpoint in LFC prewarm performance tests	2025-07-25 16:56:41 +00:00
Tristan Partin	33b400beae	[BRC-1425] Plumb through and set the requisite GUCs when starting the compute instance (#12732 ) ## Problem We need the set the following Postgres GUCs to the correct value before starting Postgres in the compute instance: ``` databricks.workspace_url databricks.enable_databricks_identity_login databricks.enable_sql_restrictions ``` ## Summary of changes Plumbed through `workspace_url` and other GUC settings via `DatabricksSettings` in `ComputeSpec`. The spec is sent to the compute instance when it starts up and the GUCs are written to `postgresql.conf` before the postgres process is launched. --------- Co-authored-by: Jarupat Jisarojito <jarupat.jisarojito@databricks.com> Co-authored-by: William Huang <william.huang@databricks.com>	2025-07-25 15:20:05 +00:00
Tristan Partin	ca07f7dba5	Copy pg server cert and key to pgdata with correct permission (#12731 ) ## Problem Copy certificate and key from secret mount directory to `pgdata` directory where `postgres` is the owner and we can set the key permission to 0600. ## Summary of changes - Added new pgparam `pg_compute_tls_settings` to specify where k8s secret for certificate and key are mounted. - Added a new field to `ComputeSpec` called `databricks_settings`. This is a struct that will be used to store any other settings that needs to be propagate to Compute but should not be persisted to `ComputeSpec` in the database. - Then when the compute container start up, as part of `prepare_pgdata` function, it will copied `server.key` and `server.crt` from k8s mounted directory to `pgdata` directory. ## How is this tested? Add unit tests. Manual test via KIND Co-authored-by: Jarupat Jisarojito <jarupat.jisarojito@databricks.com>	2025-07-25 15:05:05 +00:00
Vlad Lazar	b0dfe0ffa6	storcon: attempt all non-essential location config calls during reconciliations (#12745 ) ## Problem We saw the following in the field: Context and observations: * The storage controller keeps track of the latest generations and the pageserver that issued the latest generation in the database * When the storage controller needs to proxy a request (e.g. timeline creation) to the pageservers, it will find use the pageserver that issued the latest generation from the db (generation_pageserver). * pageserver-2.cell-2 got into a bad state and wasn't able to apply location_config (e.g. detach a shard) What happened: 1. pageserver-2.cell-2 was a secondary for our shard since we were not able to detach it 2. control plane asked to detach a tenant (presumably because it was idle) a. In response storcon clears the generation_pageserver from the db and attempts to detach all locations b. it tries to detach pageserver-2.cell-2 first, but fails, which fails the entire reconciliation leaving the good attached location still there c. return success to cplane 3. control plane asks to re-attach the tenant a. In response storcon performs a reconciliation b. it finds that the observed state matches the intent (remember we did not detach the primary at step(2)) c. skips incrementing the genration and setting the generation_pageserver column Now any requests that need to be proxied to pageservers and rely on the generation_pageserver db column fail because that's not set ## Summary of changes 1. We do all non-essential location config calls (setting up secondaries, detaches) at the end of the reconciliation. Previously, we bailed out of the reconciliation on the first failure. With this patch we attempt all of the RPCs. This allows the observed state to update even if another RPC failed for unrelated reasons. 2. If the overall reconciliation failed, we don't want to remove nodes from the observed state as a safe-guard. With the previous patch, we'll get a deletion delta to process, which would be ignored. Ignoring it is not the right thing to do since it's out of sync with the db state. Hence, on reconciliation failures map deletion from the observed state to the uncertain state. Future reconciliation will query the node to refresh their observed state. Closes LKB-204	2025-07-25 14:03:17 +00:00
Erik Grinaker	185ead8395	pageserver: verify gRPC GetPages on correct shard (#12722 ) Verify that gRPC `GetPageRequest` has been sent to the shard that owns the pages. This avoid spurious `NotFound` errors if a compute misroutes a request, which can appear scarier (e.g. data loss). Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191).	2025-07-25 13:43:04 +00:00
Erik Grinaker	37e322438b	pageserver: document gRPC compute accessibility (#12724 ) Document that the Pageserver gRPC port is accessible by computes, and should not provide internal services. Touches [LKB-191](https://databricks.atlassian.net/browse/LKB-191).	2025-07-25 13:35:44 +00:00

1 2 3 4 5 ...

8454 Commits