rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-07 13:32:57 +00:00

Author	SHA1	Message	Date
Heikki Linnakangas	9537829ccd	fast_import: Make CPU & memory size configurable (#10709 ) The old values assumed that you have at least about 18 GB of RAM available (shared_buffers=10GB and maintenance_work_mem=8GB). That's a lot when testing locally. Make it configurable, and make the default assumption much smaller: 256 MB. This is nice for local testing, but it's also in preparation for starting to use VMs to run these jobs. When launched in a VM, the control plane can set these env variables according to the max size of the VM. Also change the formula for how RAM is distributed: use 10% of RAM for shared_buffers, and 70% for maintenance_work_mem. That leaves a good amount for misc. other stuff and the OS. A very large shared_buffers setting won't typically help with bulk loading. It won't help with the network and I/O of processing all the tables, unless maybe if the whole database fits in shared buffers, but even then it's not much faster than using local disk. Bulk loading is all sequential I/O. It also won't help much with index creation, which is also sequential I/O. A large maintenance_work_mem can be quite useful, however, so that's where we put most of the RAM.	2025-02-12 11:43:23 +00:00
Mikhail Kot	2c4c6e6330	fix(neon): Add tests clarifying postgres sigabrt on pageserver unavailability (#10666 ) Resolves: https://github.com/neondatabase/neon/issues/5734 When we query pageserver and it's unavailable after some retries, postgres sigabrt's. This is intended behavior so I've added tests checking it	2025-02-12 10:52:26 +00:00
Erik Grinaker	71c30e52fa	pageserver: properly yield for L0 compaction (#10769 ) ## Problem When image compaction yields for L0 compaction, it may not immediately schedule L0 compaction, because it just goes on to compact the next pending timeline. Touches #10694. Requires #10744. ## Summary of changes Extend `CompactionOutcome` with `YieldForL0` and `Skipped` variants, and immediately schedule an L0 compaction pass in the `YieldForL0` case.	2025-02-11 23:43:58 +00:00
Erik Grinaker	6c83ac3fd2	pageserver: do all L0 compaction before image compaction (#10744 ) ## Problem Image compaction can starve out L0 compaction if a tenant has several timelines with L0 debt. Touches #10694. Requires #10740. ## Summary of changes * Add an initial L0 compaction pass, in order of L0 count. * Add a tenant option `compaction_l0_first` to control the L0 pass (disabled by default). * Add `CompactFlags::OnlyL0Compaction` to run an L0-only compaction pass. * Clean up the compaction iteration logic. A later PR will use separate semaphores for the L0 and image compaction passes to avoid cross-tenant L0 starvation. That PR will also make image compaction yield if _any_ of the tenant's timelines have pending L0 compaction to further avoid starvation.	2025-02-11 22:08:46 +00:00
Heikki Linnakangas	635b67508b	Split utils::http to separate crate (#10753 ) Avoids compiling the crate and its dependencies into binaries that don't need them. Shrinks the compute_ctl binary from about 31MB to 28MB in the release-line-debug-size-lto profile.	2025-02-11 22:06:53 +00:00
dependabot[bot]	9491154eae	build(deps): bump cryptography from 43.0.1 to 44.0.1 in the pip group (#10773 ) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Alexander Bayandin <alexander@neon.tech>	2025-02-11 21:23:17 +00:00
Heikki Linnakangas	b5e09fdaf3	Re-order Dockerfile steps for putting together final compute image (#10736 ) Run "apt install" first, and only then COPY the files from the intermediary build layers to the final image. This way, if you modify any of the sources that trigger e.g. rebuilding compute_ctl, the "apt install" step can still be cached.	2025-02-11 20:10:06 +00:00
John Spray	cd51ed2f86	tests: parametrize test_graceful_cluster_restart on AZ count (#10427 ) ## Problem In https://github.com/neondatabase/neon/pull/10411 fill logic changes such that it benefits us to test it with & without AZs set up. I didn't extend the test inline in that PR because there were overlapping test changes in flight to add `num_az` parameter. ## Summary of changes - Parameterise test on AZ count (1 or 2) - When AZ count is 2, use a different balance check that just asserts the _tenants_ are balanced (since AZ affinity is chosen on a per-tenant basis)	2025-02-11 20:09:41 +00:00
Folke Behrens	f62bc28086	proxy: Move binaries into the lib (#10758 ) * This way all clippy lints defined in the lib also cover the binary code. * It's much easier to detect unused code. * Fix all discovered lints.	2025-02-11 19:46:23 +00:00
Tristan Partin	da9c101939	Implement a second HTTP server within compute_ctl (#10574 ) The compute_ctl HTTP server has the following purposes: - Allow management via the control plane - Provide an endpoint for scaping metrics - Provide APIs for compute internal clients - Neon Postgres extension for installing remote extensions - local_proxy for installing extensions and adding grants The first two purposes require the HTTP server to be available outside the compute. The Neon threat model is a bad actor within our internal network. We need to reduce the surface area of attack. By exposing unnecessary unauthenticated HTTP endpoints to the internal network, we increase the surface area of attack. For endpoints described in the third bullet point, we can just run an extra HTTP server, which is only bound to the loopback interface since all consumers of those endpoints are within the compute.	2025-02-11 18:02:22 +00:00
Arpad Müller	f7b2293317	Hardlink resident layers during detach ancestor (#10729 ) After a detach ancestor operation, we don't want to on-demand download layers that are already resident. This has shown to impede performance, sometimes quite a lot (50 seconds: https://github.com/neondatabase/neon/issues/8828#issuecomment-2643735644) Fixes #8828.	2025-02-11 16:58:34 +00:00
Arpad Müller	be447ba4f8	Change timeline_offloading setting default to true (#10760 ) This changes the default value of the `timeline_offloading` pageserver and tenant configs to true, now that offloading has been rolled out without problems. There is also a small fix in the tenant config merge function, where we applied the `lazy_slru_download` value instead of `timeline_offloading`. Related issue: https://github.com/neondatabase/cloud/issues/21353	2025-02-11 16:36:54 +00:00
Christian Schwarz	9247331c67	fix(page_service / batching): smgr op latency metric of dropped responses include flush time (#10756 ) # Problem Say we have a batch of 10 responses to send out. Then, even with - #10728 we've still only called observe_execution_end_flush_start for the first 3 responses. The remaining 7 response timers are still ticking. When compute now closes the connection, the waiting flush fails with an error and we `drop()` the remaining 7 responses' smgr op timers. The `impl Drop for SmgrOpTimer` will observe an execution time that includes the flush time. In practice, this is supsected to produce the `+Inf` observations in the smgr op latency histogram we've seen since the introduction of pipelining, even after shipping #10728. refs: - fixup of https://github.com/neondatabase/neon/pull/10042 - fixup of https://github.com/neondatabase/neon/pull/10728 - fixes https://github.com/neondatabase/neon/issues/10754	2025-02-11 14:05:59 +00:00
John Spray	fcedd10226	tests: temporarily permit a log error (#10752 ) ## Problem These tests can encounter a bug in the pageserver read path (#9185) which occurs under the very specific circumstances that the tests create, but is very unlikely to happen in the field. We will fix the bug, but in the meantime let's un-flake the tests. Related: https://github.com/neondatabase/neon/issues/10720 ## Summary of changes - Permit "could not find data for key" errors in tests affected by #9185	2025-02-11 12:37:09 +00:00
Arpad Müller	a4ea1e53ae	Apply Azure SDK patch to periodically load workload identity file (#10415 ) The SDK bug https://github.com/Azure/azure-sdk-for-rust/issues/1739 was originally worked around via #10378, but now upstream has provided a fix in [this](https://github.com/Azure/azure-sdk-for-rust/pull/1997) PR, which we've been asked to test. So this is what this PR is doing: revert #10378 (to make sure we fail if the bug isn't fixed by the SDK PR), and apply the SDK PR to our fork. Currently pointing to my local branch to check CI. I'd like to merge the [SDK fork PR](https://github.com/neondatabase/azure-sdk-for-rust/pull/2) before merging this to main.	2025-02-11 09:40:22 +00:00
Heikki Linnakangas	c26131c2b3	Link pgbouncer dynamically (#10749 ) I don't see the point of static linking, postgres itself and many of the extensions are already built dynamically. One reason for the change is that I'm working on bigger changes to start using systemd in the compute, and as part of that I wanted to add the --with-systemd configure option to pgbouncer, and there doesn't seem to be a static version of libsystemd (at least not on Debian).	2025-02-11 07:48:54 +00:00
Andrew Rudenko	4ab18444ec	compute_ctl: database_schema should keep process::Child as part of returned value (#10273 ) ## Problem /database_schema endpoint returns incomplete output from `pg_dump` ## Summary of changes The Tokio process was not used properly. The returned stream does not include `process::Child`, and the process is scheduled to be killed immediately after the `get_database_schema` call when `cmd` goes out of scope. The solution in this PR is to return a special Stream implementation that retains `process::Child`.	2025-02-11 07:02:13 +00:00
Heikki Linnakangas	98883e4b30	compute_ctl: Use a single tokio runtime (#10743 ) compute_ctl is mostly written in synchronous fashion, intended to run in a single thread. However various parts had become async, and they launched their own tokio runtimes to run the async code. For example, VM monitor ran in its own multi-threaded runtime, and apply_spec_sql() launched another multi-threaded runtime to run the per-database SQL commands in parallel. In addition to that, a few places used a current-thread runtime to run async code in the main thread, or launched a current-thread runtime in a different thread to run background tasks. Unify the runtimes so that there is only one tokio runtime. It's created very early at process startup, and the main thread "enters" the runtime, so that it's always available for tokio::spawn() and runtime.block_on() calls. All code that needs to run async code uses the same runtime. The main thread still mostly runs in a synchronous fashion. When it needs to run async code, it uses rt.block_on(). Spawn fewer additional threads, prefer to spawn tokio tasks instead. Convert some code that ran synchronously in background threads into async. I didn't go all the way, though, some background threads are still spawned.	2025-02-11 00:39:44 +00:00
Tristan Partin	3d143ad799	Unbrick the forward compatibility test failures (#10747 ) Since the merge of https://github.com/neondatabase/neon/pull/10523, forward compatibility tests have been broken everywhere. Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-02-10 22:22:10 +00:00
Alex Chi Z.	b0c7ee0175	feat(pageserver): better gc_compaction_split heuristics (#10727 ) ## Problem close https://github.com/neondatabase/neon/issues/10213 `range_search` only returns the top-most layers that may satisfy the search, so it doesn't include all layers that might be accessed (the user needs to recursively call this function). We need to retrieve the full layer map and find overlaps in order to have a correct heuristics of the job split. ## Summary of changes Retrieve all layers and find overlaps instead of doing `range_search`. The patch also reduces the time holding the layer map read guard. Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-02-10 19:33:34 +00:00
Erik Grinaker	8c4e94107d	pageserver: notify compaction loop at threshold (#10740 ) ## Problem The compaction loop currently runs periodically, which can cause it to wait for up to 20 seconds before starting L0 compaction by default. Also, when we later separate the semaphores for L0 compaction and image compaction, we want to give up waiting for the image compaction semaphore if L0 compaction is needed on any timeline. Touches #10694. ## Summary of changes Notify the compaction loop when an L0 flush (on any timeline) exceeds `compaction_threshold`. Also do some opportunistic cleanups in the area.	2025-02-10 17:48:09 +00:00
Heikki Linnakangas	c368b0fe14	Use a cache mount to speed up rebuilding compute node image (#10737 ) Building the compute rust binaries from scratch is pretty slow, it takes between 4-15 minutes on my laptop, depending on which compiler flags and other tricks I use. A cache mount allows caching the dependencies and incremental builds, which speeds up rebuilding significantly when you only makes a small change in a source file.	2025-02-10 16:58:29 +00:00
Heikki Linnakangas	aba61a3712	Download awscli in separate layer in Dockerfile, to allow caching (#10733 ) The awscli was downloaded at the last stages of the overall compute image build, which meant that if you modified any part of the build, it would trigger a re-download of the awscli. That's a bit annoying when developing locally and rebuilding the compute image repeatedly. Move it to a separate layer, to cache separately and to avoid the spurious rebuilds.	2025-02-10 16:48:28 +00:00
Tristan Partin	946da3f7e2	Require --compute-id when running compute_ctl (#10523 ) The compute_id will be used when verifying claims sent by the control plane. Signed-off-by: Tristan Partin <tristan@neon.tech> Signed-off-by: Tristan Partin <tristan@neon.tech>	2025-02-10 16:46:20 +00:00
Ivan Efremov	73633e27ed	fix(proxy): Log errors from the local proxy in auth-broker (#10659 ) Handle errors from local proxy by parsing HTTP response in auth broker code Closes [#19476](https://github.com/neondatabase/cloud/issues/19476)	2025-02-10 16:06:13 +00:00
Konstantin Knizhnik	0cf0119751	Add --save_records option to pg_waldump (#10626 ) ## Problem Make it possible to dump WAL records in format recognised by walredo process. Intended usage: ``` pg_waldump -R 1663/5/16396 -B 771727 000000010000000100000034 --save-records=/tmp/walredo.records postgres --wal-redo < /tmp/walredo.records > /tmp/page.img ``` ## Summary of changes Related Postgres PRs: https://github.com/neondatabase/postgres/pull/575 https://github.com/neondatabase/postgres/pull/572 --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>	2025-02-10 15:48:03 +00:00
Alex Chi Z.	b37f52fdf1	feat(pageserver): dump read path on missing key error (#10528 ) ## Problem helps investigate https://github.com/neondatabase/neon/issues/10482 ## Summary of changes In debug mode and testing mode, we will record all files visited by a read operation, and print it out when it errors. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-02-10 14:25:56 +00:00
Alex Chi Z.	443c8d0b4b	feat(pageserver): repartition on L0-L1 boundary (#10548 ) ## Problem Reduce the read amplification when doing `repartition`. ## Summary of changes Compute the L0-L1 boundary LSN and do repartition here. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>	2025-02-10 14:25:48 +00:00
Alexander Bayandin	2f36bdb218	CI(build-neon): fix duplicated builds (#10731 ) ## Problem Parameterising `build-neon` job with `test-cfg` makes it to build exactly the same thing several times. See - `874accd6ed/.github/workflows/_build-and-test-locally.yml (L51-L52)` - https://github.com/neondatabase/neon/actions/runs/13215068271/job/36893373038 ## Summary of changes - Extract `sanitizers` to a separate input from `test-cfg` and set it separately - Don't parametrise `build-neon` with `test-cfg`	2025-02-10 12:29:39 +00:00
Ivan Efremov	e7118213ab	impr(proxy): Set TTL for Redis cancellation map keys (#10671 ) Use expire() op to set TTL for Redis cancellation key	2025-02-10 10:51:53 +00:00
a-masterov	d204d51faf	Fix the upgrade test for pg_jwt by adding the database name (#10738 ) ## Problem The upgrade test for pg_jwt does not work correctly. ## Summary of changes The script for the upgrade test is modified to use the database `contrib_regression`.	2025-02-10 09:56:46 +00:00
Erik Grinaker	ac55e2dbe5	pageserver: improve tenant housekeeping task (#10725 ) # Problem walredo shutdown is done in the compaction task. Let's move it to tenant housekeeping. # Summary of changes * Rename "ingest housekeeping" to "tenant housekeeping". * Move walredo shutdown into tenant housekeeping. * Add a constant `WALREDO_IDLE_TIMEOUT` set to 3 minutes (previously 10x compaction threshold).	2025-02-08 12:42:55 +00:00
Erik Grinaker	874accd6ed	pageserver: misc task cleanups (#10723 ) This patch does a bunch of superficial cleanups of `tenant::tasks` to avoid noise in subsequent PRs. There are no functional changes. PS: enable "hide whitespace" when reviewing, due to the unindentation of large async blocks.	2025-02-08 11:02:13 +00:00
Christian Schwarz	6cd3b501ec	fix(page_service / batching): smgr op latency metrics includes the flush time of preceding requests (#10728 ) Before this PR, if a batch contains N responses, the smgr op latency reported for response (N-i) would include the time we spent flushing the preceding requests. refs: - fixup of https://github.com/neondatabase/neon/pull/10042 - fixes https://github.com/neondatabase/neon/issues/10674	2025-02-08 09:28:09 +00:00
Christian Schwarz	bf20d78292	fix(page_service): page reconstruct error log does not include `shard_id` label (#10680 ) # Problem Before this PR, the `shard_id` field was missing when page_service logs a reconstruct error. This was caused by batching-related refactorings. Example from staging: ``` 2025-01-30T07:10:04.346022Z ERROR page_service_conn_main{peer_addr=...}:process_query{tenant_id=... timeline_id=...}:handle_pagerequests:request:handle_get_page_at_lsn_request_batched{req_lsn=FFFFFFFF/FFFFFFFF}: error reading relation or page version: Read error: whole vectored get request failed because one or more of the requested keys were missing: could not find data for key ... ``` # Changes Delay creation of the handler-specific span until after shard routing This also avoids the need for the record() call in the pagestream hot path. # Testing Manual testing with a failpoint that is part of this PR's history but will be squashed away. # Refs - fixes https://github.com/neondatabase/neon/issues/10599	2025-02-07 19:45:39 +00:00
Arpad Müller	2656c713a4	Revert recent AWS SDK update (#10724 ) We've been seeing some regressions in staging since the AWS SDK updates: https://github.com/neondatabase/neon/issues/10695 . We aren't sure the regression was caused by the SDK update, but the issues do involve S3, so it's not unlikely. By reverting the SDK update we find out whether it was really the SDK update, or something else. Reverts the two PRs: * https://github.com/neondatabase/neon/pull/10588 * https://github.com/neondatabase/neon/pull/10699 https://neondb.slack.com/archives/C08C2G15M6U/p1738576986047179	2025-02-07 17:37:53 +00:00
John Spray	5e95860e70	tests: wait for manifest persistence in test_timeline_archival_chaos (#10719 ) ## Problem This test would sometimes fail its assertion that a timeline does not revert to active once archived. That's because it was using the in-memory offload state, not the persistent state, so this was sometimes lost across a pageserver restart. Closes: https://github.com/neondatabase/neon/issues/10389 ## Summary of changes - When reading offload status, read from pageserver API _and_ remote storage before considering the timeline offloaded	2025-02-07 16:27:39 +00:00
Heikki Linnakangas	0abff59e97	compute: Allow postgres user to power off the VM (#10710 ) I plan to use this when launching a fast_import job in a VM. There's currently no good way for an executable running in a NeonVM to exit gracefully and have the VM shut down. The inittab we use always respawns the payload command. The idea is that the control plane can use "fast_import ... && poweroff" as the command, so that when fast_import completes successfully, the VM is terminated, and the k8s Pod and VirtualMachine object are marked as completed successfully. I'm working on bigger changes to how we launch VMs, and will try to come up with a nicer system for that, but in the meanwhile, this quick hack allows us to proceed with using VMs for one-off jobs like fast_import.	2025-02-07 16:03:01 +00:00
John Spray	9609f7547e	tests: address warnings in timeline shutdown (#10702 ) ## Problem There are a couple of log warnings tripping up `test_timeline_archival_chaos` - `[stopping left-over name="timeline_delete" tenant_shard_id=2d526292b67dac0e6425266d7079c253 timeline_id=Some(44ba36bfdee5023672c93778985facd9) kind=TimelineDeletionWorker\n')](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10672/13161357302/index.html#/testresult/716b997bb1d8a021)` - `ignoring attempt to restart exited flush_loop 503d8f401d8887cfaae873040a6cc193/d5eed0673ba37d8992f7ec411363a7e3\n')` Related: https://github.com/neondatabase/neon/issues/10389 ## Summary of changes - Downgrade the 'ignoring attempt to restart' to info -- there's nothing in the design that forbids this happening, i.e. someone calling maybe_spawn_flush_loop concurrently with shutdown() - Prevent timeline deletion tasks outliving tenants by carrying a gateguard. This logically makes sense because the deletion process does call into Tenant to update manifests.	2025-02-07 15:29:34 +00:00
Erik Grinaker	d6e87a3a9c	pageserver: add separate, disabled compaction semaphore (#10716 ) ## Problem L0 compaction can get starved by other background tasks. It needs to be responsive to avoid read amp blowing up during heavy write workloads. Touches #10694. ## Summary of changes Add a separate semaphore for compaction, configurable via `use_compaction_semaphore` (disabled by default). This is primarily for testing in staging; it needs further work (in particular to split image/L0 compaction jobs) before it can be enabled.	2025-02-07 15:11:31 +00:00
Arpad Müller	f5243992fa	safekeeper: make timeline deletions a bit more verbose (#10721 ) Make timeline deletion print the sub-steps, so that we can narrow down some stuck timeline deletion issues we are observing. https://neondb.slack.com/archives/C08C2G15M6U/p1738930694716009	2025-02-07 15:06:26 +00:00
John Spray	95220ba43e	tests: fix flaky endpoint in test_ingest_logical_message (#10700 ) ## Problem Endpoint kept running while timeline was deleted, causing forbidden warnings on the pageserver when the tenant is not found. ## Summary of changes - Explicitly stop the endpoint before the end of the test, so that it isn't trying to talk to the pageserver in the background while things are torn down	2025-02-07 14:51:36 +00:00
John Spray	08f92bb916	pageserver: clean up DeletionQueue push_layers_sync (#10701 ) ## Problem This is tech debt. While we introduced generations for tenants, some legacy situations without generations needed to delete things inline (async operation) instead of enqueing them (sync operation). ## Summary of changes - Remove the async code, replace calls with the sync variant, and assert that the generation is always set	2025-02-07 13:03:01 +00:00
Fedor Dikarev	8f651f9582	switch from localtest.me to local.neon.build (#10714 ) ## Problem Ref: https://github.com/neondatabase/neon/issues/10632 We use dns named `.localtest.me` in our test, and that domain is well-known and widely used for that, with all the records there resolve to the localhost, both IPv4 and IPv6: `127.0.0.1` and `::1` In some cases on our runners these addresses resolves only to `IPv6`, and so components fail to connect when runner doesn't have `IPv6` address. We suspect issue in systemd-resolved here (https://github.com/systemd/systemd/issues/17745) To workaround that and improve test stability, we introduced our own domain `.local.neon.build` with IPv4 address `127.0.0.1` only See full details and troubleshoot log in referred issue. p.s. If you're FritzBox user, don't forget to add that domain `local.neon.build` to the `DNS Rebind Protection` section under `Home Network -> Network -> Network Settings`, otherwise FritzBox will block addresses, resolving to the local addresses. For other devices/vendors, please check corresponding documentation, if resolving `local.neon.build` will produce empty answer for you. ## Summary of changes Replace all the occurrences of `localtest.me` with `local.neon.build`	2025-02-07 12:25:16 +00:00
Arseny Sher	b5a239c4ae	Add reconciliation details to sk membership change rfc (#10514 ) ## Problem RFC pointed out the need of reconciliation, but wasn't detailed how it can be done. ## Summary of changes Add these details.	2025-02-07 11:20:49 +00:00
Alexander Lakhin	de05258419	Adjust diesel schema check for build with sanitizers (#10711 ) We need to disable the detection of memory leaks when running ``neon_local init` for build with sanitizers to avoid an error thrown by AddressSanitizer.	2025-02-07 08:56:39 +00:00
Peter Bendel	e73d681a0e	Patch pgcopydb and fix another segfault (#10706 ) ## Problem Found another pgcopydb segfault in error handling ```bash 2025-02-06 15:30:40.112 51299 ERROR pgsql.c:2330 [TARGET -738302813] FATAL: terminating connection due to administrator command 2025-02-06 15:30:40.112 51298 ERROR pgsql.c:2330 [TARGET -1407749748] FATAL: terminating connection due to administrator command 2025-02-06 15:30:40.112 51297 ERROR pgsql.c:2330 [TARGET -2073308066] FATAL: terminating connection due to administrator command 2025-02-06 15:30:40.112 51300 ERROR pgsql.c:2330 [TARGET 1220908650] FATAL: terminating connection due to administrator command 2025-02-06 15:30:40.432 51300 ERROR pgsql.c:2536 [Postgres] FATAL: terminating connection due to administrator command 2025-02-06 15:30:40.513 51290 ERROR copydb.c:773 Sub-process 51300 exited with code 0 and signal Segmentation fault 2025-02-06 15:30:40.578 51299 ERROR pgsql.c:2536 [Postgres] FATAL: terminating connection due to administrator command 2025-02-06 15:30:40.613 51290 ERROR copydb.c:773 Sub-process 51299 exited with code 0 and signal Segmentation fault 2025-02-06 15:30:41.253 51298 ERROR pgsql.c:2536 [Postgres] FATAL: terminating connection due to administrator command 2025-02-06 15:30:41.314 51290 ERROR copydb.c:773 Sub-process 51298 exited with code 0 and signal Segmentation fault 2025-02-06 15:30:43.133 51297 ERROR pgsql.c:2536 [Postgres] FATAL: terminating connection due to administrator command 2025-02-06 15:30:43.215 51290 ERROR copydb.c:773 Sub-process 51297 exited with code 0 and signal Segmentation fault 2025-02-06 15:30:43.215 51290 ERROR indexes.c:123 Some INDEX worker process(es) have exited with error, see above for details 2025-02-06 15:30:43.215 51290 ERROR indexes.c:59 Failed to create indexes, see above for details 2025-02-06 15:30:43.232 51271 ERROR copydb.c:768 Sub-process 51290 exited with code 12 ``` ```bashadmin@ip-172-31-38-164:~/pgcopydb$ gdb /usr/local/pgsql/bin/pgcopydb core GNU gdb (Debian 13.1-3) 13.1 Copyright (C) 2023 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "aarch64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <https://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from /usr/local/pgsql/bin/pgcopydb... [New LWP 51297] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1". Core was generated by `pgcopydb: create index ocr.ocr_pipeline_step_results_version_pkey '. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x0000aaaac3a4b030 in splitLines (lbuf=lbuf@entry=0xffffd8b86930, buffer=<optimized out>) at string_utils.c:630 630 newLinePtr = '\0'; (gdb) bt #0 0x0000aaaac3a4b030 in splitLines (lbuf=lbuf@entry=0xffffd8b86930, buffer=<optimized out>) at string_utils.c:630 #1 0x0000aaaac3a3a678 in pgsql_execute_log_error (pgsql=pgsql@entry=0xffffd8b87040, result=result@entry=0x0, sql=sql@entry=0xffff81fe9be0 "CREATE UNIQUE INDEX IF NOT EXISTS ocr_pipeline_step_results_version_pkey ON ocr.ocr_pipeline_step_results_version USING btree (id, transaction_id);", debugParameters=debugParameters@entry=0xaaaaec5f92f0, context=context@entry=0x0) at pgsql.c:2322 #2 0x0000aaaac3a3bbec in pgsql_execute_with_params (pgsql=pgsql@entry=0xffffd8b87040, sql=0xffff81fe9be0 "CREATE UNIQUE INDEX IF NOT EXISTS ocr_pipeline_step_results_version_pkey ON ocr.ocr_pipeline_step_results_version USING btree (id, transaction_id);", paramCount=paramCount@entry=0, paramTypes=paramTypes@entry=0x0, paramValues=paramValues@entry=0x0, context=context@entry=0x0, parseFun=parseFun@entry=0x0) at pgsql.c:1649 #3 0x0000aaaac3a3c468 in pgsql_execute (pgsql=pgsql@entry=0xffffd8b87040, sql=<optimized out>) at pgsql.c:1522 #4 0x0000aaaac3a245f4 in copydb_create_index (specs=specs@entry=0xffffd8b8ec98, dst=dst@entry=0xffffd8b87040, index=index@entry=0xffff81f71800, ifNotExists=<optimized out>) at indexes.c:846 #5 0x0000aaaac3a24ca8 in copydb_create_index_by_oid (specs=specs@entry=0xffffd8b8ec98, dst=dst@entry=0xffffd8b87040, indexOid=<optimized out>) at indexes.c:410 #6 0x0000aaaac3a25040 in copydb_index_worker (specs=specs@entry=0xffffd8b8ec98) at indexes.c:297 #7 0x0000aaaac3a25238 in copydb_start_index_workers (specs=specs@entry=0xffffd8b8ec98) at indexes.c:209 #8 0x0000aaaac3a252f4 in copydb_index_supervisor (specs=specs@entry=0xffffd8b8ec98) at indexes.c:112 #9 0x0000aaaac3a253f4 in copydb_start_index_supervisor (specs=0xffffd8b8ec98) at indexes.c:57 #10 copydb_start_index_supervisor (specs=specs@entry=0xffffd8b8ec98) at indexes.c:34 #11 0x0000aaaac3a51ff4 in copydb_process_table_data (specs=specs@entry=0xffffd8b8ec98) at table-data.c:146 #12 0x0000aaaac3a520dc in copydb_copy_all_table_data (specs=specs@entry=0xffffd8b8ec98) at table-data.c:69 #13 0x0000aaaac3a0ccd8 in cloneDB (copySpecs=copySpecs@entry=0xffffd8b8ec98) at cli_clone_follow.c:602 #14 0x0000aaaac3a0d2cc in start_clone_process (pid=0xffffd8b743d8, copySpecs=0xffffd8b8ec98) at cli_clone_follow.c:502 #15 start_clone_process (copySpecs=copySpecs@entry=0xffffd8b8ec98, pid=pid@entry=0xffffd8b89788) at cli_clone_follow.c:482 #16 0x0000aaaac3a0d52c in cli_clone (argc=<optimized out>, argv=<optimized out>) at cli_clone_follow.c:164 #17 0x0000aaaac3a53850 in commandline_run (command=command@entry=0xffffd8b9eb88, argc=0, argc@entry=22, argv=0xffffd8b9edf8, argv@entry=0xffffd8b9ed48) at /home/admin/pgcopydb/src/bin/pgcopydb/../lib/subcommands.c/commandline.c:71 #18 0x0000aaaac3a01464 in main (argc=22, argv=0xffffd8b9ed48) at main.c:140 (gdb) ``` The problem is most likely that the following call returned a message in a read-only memory segment where we cannot replace \n with \0 in string_utils.c splitLines() function ```C char message = PQerrorMessage(pgsql->connection); ``` ## Summary of changes modified the patch to also address this problem	2025-02-06 20:21:18 +00:00
Anastasia Lubennikova	44b905d14b	Fix remote extension lookup (#10708 ) when library name doesn't match extension name. The bug was introduced by recent commit `ebc55e6a`	2025-02-06 19:21:38 +00:00
Arseny Sher	186199f406	Update aws sdk (#10699 ) ## Problem We have unclear issue with stuck s3 client, probably after partial aws sdk update without updating sdk-s3. https://github.com/neondatabase/neon/pull/10588 Let's try to update s3 as well. ## Summary of changes Result of running cargo update -p aws-types -p aws-sigv4 -p aws-credential-types -p aws-smithy-types -p aws-smithy-async -p aws-sdk-kms -p aws-sdk-iam -p aws-sdk-s3 -p aws-config ref https://github.com/neondatabase/neon/issues/10695	2025-02-06 17:28:27 +00:00
OBBO67	82cbab7512	Switch reqlsns[0].request_lsn to arrow operator in neon_read_at_lsnv() (#10620 ) (#10687 ) ## Problem Currently the following line below uses array subscript notation which is confusing since `reqlsns` is not an array but just a pointer to a struct. ``` XLogWaitForReplayOf(reqlsns[0].request_lsn); ``` ## Summary of changes Switch from array subscript notation to arrow operator to improve readability of code. Close #10620.	2025-02-06 17:26:26 +00:00

1 2 3 4 5 ...

7180 Commits