Log benchmark configuration
This repo holds the configuration we used to benchmark GreptimeDB, Clickhouse and Elastic Search.
Here are the versions of databases we used in the benchmark
| name | version |
|---|---|
| GreptimeDB | v0.9.2 |
| Clickhouse | 24.9.1.219 |
| Elasticsearch | 8.15.0 |
Structured model vs Unstructured model
We divide test into two parts, using structured model and unstructured model accordingly. You can also see the difference in create table clause.
Structured model
The log data is pre-processed into columns by vector. For example an insert request looks like following
INSERT INTO test_table (bytes, http_version, ip, method, path, status, user, timestamp) VALUES ()
The goal is to test string/text support for each database. In real scenarios it means the datasource(or log data producers) have separate fields defined, or have already processed the raw input.
Unstructured model
The log data is inserted as a long string, and then we build fulltext index upon these strings. For example an insert request looks like following
INSERT INTO test_table (message, timestamp) VALUES ()
The goal is to test fuzzy search performance for each database. In real scenarios it means the log is produced by some kind of middleware and inserted directly into the database.
Creating tables
See here for GreptimeDB and Clickhouse's create table clause. The mapping of Elastic search is created automatically.
Vector Configuration
We use vector to generate random log data and send inserts to databases. Please refer to structured config and unstructured config for detailed configuration.
SQLs and payloads
Please refer to SQL query for GreptimeDB and Clickhouse, and query payload for Elastic search.
Steps to reproduce
- Decide whether to run structured model test or unstructured mode test.
- Build vector binary(see vector's config file for specific branch) and databases binaries accordingly.
- Create table in GreptimeDB and Clickhouse in advance.
- Run vector to insert data.
- When data insertion is finished, run queries against each database. Note: you'll need to update timerange value after data insertion.
Addition
- You can tune GreptimeDB's configuration to get better performance.
- You can setup GreptimeDB to use S3 as storage, see here.