rust/neon - neon - Gitea: Git with a cup of tea

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-07 21:42:56 +00:00

Author	SHA1	Message	Date
Egor Suvorov	89e5659f3f	Replace COPYRIGHT file from the root with NOTICE file The primary reason: make GitHub detect that we use Apache License 2.0 They do it via https://github.com/licensee/licensee Ruby library (gem). Our COPYRIGHT file contains a part of the Apache License, which should be added to a source file, not the license or copyright information itself, which confuses the library. Instead, the recommended way is to create a NOTICE file which references license of the code and its bundled dependencies.	2022-05-23 01:03:03 +02:00
Egor Suvorov	ef7cdb13e2	Remove unused dependencies from poetry.lock via `poetry lock --no-update` There were a bunch of dependencies for Python <3.9. They are not needed after #1254. This commit makes it easier to add/remove dependencies because lock file will be updated like this on any such operation. Do not update dependencies yet to not break anything.	2022-05-21 12:21:45 +02:00
Egor Suvorov	73187bfef1	postgres_ffi: find_end_of_wal_segment: clarify code around xl_crc retrieval It would be better to not update xl_crc/rec_hdr at all when skipping contrecord, but I would prefer to keep PR #1574 small. Better audit of `find_end_of_wal_segment` is coming anyway in #544.	2022-05-21 05:25:17 +02:00
Egor Suvorov	967eb38e81	postgres_ffi: find_end_of_wal_segment: fix contrecord skipping Also enable corresponding test.	2022-05-21 05:25:17 +02:00
Egor Suvorov	a124e44866	postgres_ffi: find_end_of_wal_segment: add lots of trace	2022-05-21 05:25:17 +02:00
Egor Suvorov	c4b77084af	utils: add const_assert! macro	2022-05-21 05:25:17 +02:00
Egor Suvorov	c9efdec8db	postgres_ffi: find_end_of_wal_segment: improve name of wal_crc variable Now it reflects the field it's mirroring.	2022-05-21 05:25:17 +02:00
Egor Suvorov	12b7c793b3	postgres_ffi: find_end_of_wal_segment: remove redundant CRC operations Previous invariant: `crc` contains an "unfinalized" CRC32 value, its one complement, like in postgres before FIN_CRC32C. New invariant: `crc` always contains a "finalized" CRC32 value, this is the semantics of crc32c_append, so we don't need to invert CRC manually.	2022-05-21 05:25:17 +02:00
Egor Suvorov	3c6890bf1d	postgres_ffi: add complex WAL tests for find_end_of_wal * Actual generation logic is in a separate crate `postgres_ffi/wal_generate` * The create also provides a binary for debug purposes akin to `initdb` * Two tests currently fail and are ignored * There is no easy way to test this directly in Safekeeper as it starts restoring from commit_lsn. So testing would require disconnecting Safekeeper just after it has received the WAL, but before it is committed.	2022-05-21 05:25:17 +02:00
Andrey Taranik	d97617ed3a	updated proxy and proxy scram deployment for prod and stress environments (#1758 )	2022-05-20 23:12:30 +03:00
KlimentSerafimov	65cf1a3221	Added paths to openssl includes and libraries for OSX because make complained that it couldn't find them. (#1761 )	2022-05-20 12:02:51 -04:00
bojanserafimov	a4aef5d8dc	Compile psql with openssl (#1725 )	2022-05-19 12:25:31 -04:00
Heikki Linnakangas	ffbb9dd155	Add a 5 minute timeout to python tests. The CI times out after 10 minutes of no output. It's annoying if a test hangs and is killed by the CI timeout, because you don't get information about which test was running. Try to avoid that, by adding a slightly smaller timeout in pytest itself. You can override it on a per-test basis if needed, but let's try to keep our tests shorter than that. For the Postgres regression tests, use a longer 30 minute timeout. They're not really a single test, but many tests wrapped in a single pytest test. It's OK for them to run longer in aggregate, each Postgres test is still fairly short.	2022-05-19 14:04:14 +03:00
Egor Suvorov	baf7a81dce	git-upload: pass committer to 'git rebase' (fix #1749 ) (#1750 ) No committer was specified, which resulted in failing `git rebase` if the branch is not up-to-date.	2022-05-19 14:01:03 +03:00
Heikki Linnakangas	ee3bcf108d	Fix compact_level0 for delta layers with overlap or gaps We saw a case in staging, where there was a gap in the LSN ranges of level 0 files, like this: 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001696070-00000000016960E9 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016960E9-00000000016E4DB9 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000016E4DB9-000000000BFCE3E1 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000000BFCE3E1-000000000BFD0FE9 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000060045901-000000007005EAC1 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000007005EAC1-0000000080062E99 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000080062E99-000000009007F481 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__000000009007F481-00000000A009F7C9 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000A009F7C9-00000000AA284EB9 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000AA286471-00000000AA2886B9 Note that gap between 000000000BFD0FE9 and 0000000060045901. I don't know how that happened, but in general the pageserver should be robust if there are gaps like that, or overlapping files etc. In theory they could happen as result of crashes, partial downloads from S3 etc., although it is mystery what caused it in this case. Looking at the compaction code, it was not safe in the face of gaps like that. The compaction routine collected all the level 0 files, and took their min(start)..max(end) as the range of the new files it builds. That's wrong, if the level 0 files don't cover the whole LSN range; the newly created files will miss any records in the gap. Fix that, by only collecting contiguous sequences of level 0 files, so that the end LSN of previous delta file is equal to the start of the next one. Fixes issue #1730	2022-05-19 10:19:38 +03:00
Heikki Linnakangas	0da4046704	Include traversal path in error message. Previously, the path was printed to the log with separate error!() calls. It's better to include the whole path in the error object and have it printed to the log as one message. Also print the path in the ValueReconstructResult::Missing case. This is what it looks like now: 2022-05-17T21:53:53.611801Z ERROR pagestream{timeline=5adcb4af3e95f00a31550d266aab7a37 tenant=74d9f9ad3293c030c6a6e196dd91c60f}: error reading relation or page version: could not find data for key 000000067F000032BE000000000000000001 at LSN 0/1698C48, for request at LSN 0/1698CF8 Caused by: 0: layer traversal: result Complete, cont_lsn 0/1698C48, layer: 000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__0000000001698C48-0000000001698CC1 1: layer traversal: result Continue, cont_lsn 0/1698CC1, layer: inmem-0000000001698CC1-FFFFFFFFFFFFFFFF Stack backtrace:	2022-05-19 10:19:38 +03:00
Anastasia Lubennikova	cbd00d7ed9	Remove temp layer files during timeline initialization on pageserver start	2022-05-19 10:11:12 +03:00
Anastasia Lubennikova	4c30ae8ba3	Add random string as a part of tempfile name	2022-05-19 10:11:12 +03:00
Anastasia Lubennikova	3da4b3165e	Fsync layer files before rename	2022-05-19 10:11:12 +03:00
Anastasia Lubennikova	c1b365fdf7	Use temp filename while writing ImageLayer file	2022-05-19 10:11:12 +03:00
Egor Suvorov	fab104d5f3	docs/sourcetree: add note about exact Python version used and how to choose it	2022-05-19 00:09:13 +02:00
Egor Suvorov	7dd27ecd20	Bump minimal supported Python version to 3.9 Most of the CI already run with Python 3.9 since https://github.com/neondatabase/docker-images/pull/1	2022-05-19 00:09:13 +02:00
Egor Suvorov	bd2979d02c	CirleCI/check-codestyle-python: print versions	2022-05-19 00:09:13 +02:00
Dmitry Rodionov	5914aab78a	add comments, use expect instead of unwrap	2022-05-19 00:54:14 +03:00
Heikki Linnakangas	4a36d89247	Avoid spawning a layer-flush thread when there's no work to do. The check_checkpoint_distance() always spawned a new thread, even if there is no frozen layer to flush. That was a thinko, as @knizhnik pointed out.	2022-05-19 00:51:48 +03:00
Egor Suvorov	432907ff5f	Safekeeper: avoid holding mutex when deleting a tenant (#1746 ) Following discussion with @arssher after #1653	2022-05-18 23:02:17 +03:00
Arthur Petukhovsky	98da0aa159	Add _total suffix to metrics name (#1741 )	2022-05-18 15:17:04 +03:00
Alexey Kondratov	772c2fb4ff	Report startup metrics and failure reason from compute_ctl (#1581 ) + neondatabase/cloud#1103 This adds a couple of control endpoints to simplify compute state discovery for control-plane. For example, now we may figure out that Postgres wasn't able to start or basebackup failed within seconds instead of just blindly polling the compute readiness for a minute or two. Also we now expose startup metrics (time of the each step: basebackup, sync safekeepers, config, total). Console grabs them after each successful start and report as histogram to prometheus and grafana. OpenAPI spec is added and up-tp date, but is not currently used in the console yet.	2022-05-18 13:03:29 +04:00
Andrey Taranik	b9f84f4a83	trun on storage deployment to neon-stress enviroment (#1729 )	2022-05-17 23:04:04 +03:00
Arthur Petukhovsky	134eeeb096	Add more common storage metrics (#1722 ) - Enabled process exporter for storage services - Changed zenith_proxy prefix to just proxy - Removed old `monitoring` directory - Removed common prefix for metrics, now our common metrics have `libmetrics_` prefix, for example `libmetrics_serve_metrics_count` - Added `test_metrics_normal_work`	2022-05-17 19:29:01 +03:00
Heikki Linnakangas	55ea3f262e	Fix race condition leading to panic in remote storage sync thread. The SyncQueue consisted of a tokio mpsc channel, and an atomic counter to keep track of how many items there are in the channel. Updating the atomic counter was racy, and sometimes the consumer would decrement the counter before the producer had incremented it, leading to integer wraparound to usize::MAX. Calling Vec::with_capacity(usize::MAX) leads to a panic. To fix, replace the channel with a VecDeque protected by a Mutex, and a condition variable for signaling. Now that the queue is now protected by standard blocking Mutex and Condvar, refactor the functions touching it to be sync, not async. A theoretical downside of this is that the calls to push items to the queue and the storage sync thread that drains the queue might now need to wait, if another thread is busy manipulating the queue. I believe that's OK; the lock isn't held for very long, and these operations are made in background threads, not in the hot GetPage@LSN path, so they're not very latency-sensitive. Fixes #1719. Also add a test case.	2022-05-17 18:14:57 +03:00
Heikki Linnakangas	f03779bf1a	Fix wait_for_last_record_lsn() and wait_for_upload() python functions. The contract for wait_for() was not very clear. It waits until the given function returns successfully, without an exception, but the wait_for_last_record_lsn() and wait_for_upload() functions used "a < b" as the condition, i.e. they thought that wait_for() would poll until the function returns true. Inline the logic from wait_for() into those two functions, it's not that complicated, and you get a more specific error message too, if it fails. Also add a comment to wait_for() to make it more clear how it works. Also change remote_consistent_lsn() to return 0 instead of raising an exception, if remote is None. That can happen if nothing has been uploaded to remote storage for the timeline yet. It happened once in the CI, and I was able to reproduce that locally too by adding a sleep to the storage sync thread, to delay the first upload.	2022-05-17 18:14:10 +03:00
Andrey Taranik	070c255522	Neon stress deploy (#1720 ) * storage and proxy deployment for neon stress environment * neon stress inventory fix	2022-05-17 18:03:01 +03:00
Heikki Linnakangas	9ccbb8d331	Make "neon_local stop" less verbose. I got annoyed by all the noise in CI test output. Before: $ ./target/release/neon_local stop Stop pageserver gracefully Pageserver still receives connections Pageserver stopped receiving connections Pageserver status is: Reqwest error: error sending request for url (http://127.0.0.1:9898/v1/status): error trying to connect: tcp connect error: Connection refused (os error 111) initializing for sk 1 for 7676 Stop safekeeper gracefully Safekeeper still receives connections Safekeeper stopped receiving connections Safekeeper status is: Reqwest error: error sending request for url (http://127.0.0.1:7676/v1/status): error trying to connect: tcp connect error: Connection refused (os error 111) After: $ ./target/release/neon_local stop Stopping pageserver gracefully...done! Stopping safekeeper 1 gracefully...done! Also removes the spurious "initializing for sk 1 for 7676" message from "neon_local start"	2022-05-17 10:31:13 +03:00
Kirill Bulatov	f2881bbd8a	Start and stop single etcd and mock s3 servers globally in python tests	2022-05-17 01:17:44 +03:00
Kirill Bulatov	a884f4cf6b	Add etcd to neon_local	2022-05-17 01:17:44 +03:00
Kirill Bulatov	9a0fed0880	Enable at least 1 safekeeper in every test	2022-05-17 01:17:44 +03:00
chaitanya sharma	bea84150b2	Fix the markdown rendering on 004-durability.md RFC	2022-05-17 00:16:42 +03:00
chaitanya sharma	85b5c0e989	List profiling as a feature with 'pageserver --enabled-features' Fixes https://github.com/neondatabase/neon/issues/1627	2022-05-16 21:10:57 +03:00
Thang Pham	e4a70faa08	Add more information to timeline-related APIs (#1673 ) Resolves #1488. - implemented `GET tenant/:tenant_id/timeline/:timeline_id/wal_receiver` endpoint - returned `thread_id` in `thread_mgr::spawn` - added `latest_gc_cutoff_lsn` field to `LocalTimelineInfo` struct	2022-05-16 11:05:43 -04:00
chaitanya sharma	c41549f630	Update readme build for osx (#1709 )	2022-05-16 10:42:08 -04:00
Heikki Linnakangas	c700032dd2	Run the regression tests in CI also for PRs opened from forked repos.	2022-05-16 14:40:49 +03:00
Kirill Bulatov	33cac863d7	Test simple.conf and handle broker_endpoints better	2022-05-16 12:07:35 +03:00
Heikki Linnakangas	51ea9c3053	Don't swallow panics when the pageserver is build with failpoints. It's very confusing, and because you don't get a stack trace and error message in the logs, makes debugging very hard. However, the 'test_pageserver_recovery' test relied on that behavior. To support that, add a new "exit" action to the pageserver 'failpoints' command, so that you can explicitly request to exit the process when a failpoint is hit.	2022-05-16 09:58:58 +03:00
Heikki Linnakangas	a10cac980f	Continue with pageserver startup, if loading some tenants fail. Fixes https://github.com/neondatabase/neon/issues/1664	2022-05-15 00:25:38 +03:00
Heikki Linnakangas	081d5dac5e	Bump vendor/postgres. Includes change to reduce log noise from inmem_smgr.	2022-05-13 21:41:00 +03:00
Andrey Taranik	cded72a580	remove sk-2 from staging inventory list (#1699 )	2022-05-13 20:41:54 +03:00
Egor Suvorov	768c846eeb	Fix test_delete_force from #1653 conflicting with #1692	2022-05-13 17:36:18 +02:00
Anastasia Lubennikova	a2561f0a78	Use tenant's pitr_interval instead of hardroded 0 in the command. Adjust python tests that use the	2022-05-13 18:32:14 +03:00
Anastasia Lubennikova	aa7c601eca	Fix pitr_interval check in GC: Use timestamp->LSN mapping instead of file modification time. Fix 'latest_gc_cutoff_lsn' - set it to the minimum of pitr_cutoff and gc_cutoff. Add new test: test_pitr_gc	2022-05-13 18:32:14 +03:00

1 2 3 4 5 ...

1627 Commits