ClickHouse数据库详解：高性能列式存储与实时分析

需积分: 32 152 浏览量更新于2024-07-18 收藏 655KB PDF 举报

"这篇文档是关于ClickHouse的PDF版本，由Yandex LLC发布，日期为2017年5月19日。这份文档详细介绍了ClickHouse的特点、用途以及性能等核心内容，适合对ClickHouse感兴趣的读者下载学习。" ClickHouse是一款高性能、列式存储的数据库管理系统（DBMS），主要设计用于在线分析处理（OLAP）和大数据场景。以下是其主要特点的详细说明： 1. **列式存储**：与传统行式存储不同，ClickHouse采用列式存储方式，这对于数据分析查询非常高效，因为大多数分析查询只涉及少数列。 2. **数据压缩**：ClickHouse在磁盘上对数据进行压缩，以减少存储需求并提高读取速度。数据压缩技术能够显著提升存储效率和查询性能。 3. **磁盘存储**：ClickHouse采用特定的数据布局策略，以优化磁盘I/O操作，加快数据读取。 4. **多核并行处理**：ClickHouse支持在多个CPU核心上并行处理任务，从而实现更快的查询执行速度。 5. **分布式处理**：ClickHouse能够跨多个服务器进行分布式处理，以处理大规模数据和高并发查询。 6. **SQL支持**：ClickHouse支持SQL查询语言，使用户可以使用熟悉的语法进行数据操作。 7. **向量引擎**：ClickHouse利用向量指令来加速计算，提高处理效率。 8. **实时数据更新**：尽管ClickHouse最初设计为静态分析系统，但也可以支持实时数据更新，适应实时分析的需求。 9. **索引**：虽然ClickHouse的主设计不依赖于传统的索引，但它提供了一些特殊的索引类型，以增强特定类型的查询性能。 10. **适用于在线查询**：ClickHouse不仅适用于批量分析，也能够处理在线查询，适合混合工作负载的场景。 11. **近似计算支持**：对于大数据分析，ClickHouse允许进行近似计算，以在保持较高精度的同时提高查询速度。 12. **数据复制与完整性**：ClickHouse支持数据复制，确保在副本上的数据一致性，增强了系统的可用性和容错性。 ClickHouse的一些潜在缺点包括对事务处理支持有限、不适合在线事务处理（OLTP）以及可能需要复杂的配置来优化性能。它最初被设计用于解决Yandex.Metrica的海量数据分析问题，这是一个收集和分析网站统计信息的服务。Yandex.Metrica使用ClickHouse处理聚合和非聚合数据，以及各种其他Yandex服务。在性能方面，ClickHouse表现出色，尤其是在处理大型单个查询和大量短查询时的吞吐量和延迟方面。这些特性使得ClickHouse成为大数据分析领域的一个强大工具。

xiv

ClickHouse Documentation, Release

JavaEnable: 1 0 1 0 0 0 1 0 1 1

˓→ 1 1 1 1 0 1 0 0 1 1

Title: Yandex Announcements - Investor Relations - Yandex Yandex -- Contact

˓→us -- Moscow Yandex -- Mission Ru Yandex -- History -- History of

˓→Yandex Yandex Financial Releases - Investor Relations - Yandex Yandex --

˓→Locations Yandex Board of Directors - Corporate Governance - Yandex

˓→Yandex -- Technologies

GoodEvent: 1 1 1 1 1 1 1 1 1 1

˓→ 1 1 1 1 1 1 1 1 1 1

EventTime: 2016-05-18 05:19:20 2016-05-18 08:10:20 2016-05-18 07:38:00

˓→2016-05-18 01:13:08 2016-05-18 00:04:06 2016-05-18 04:21:30 2016-05-18

˓→00:34:16 2016-05-18 07:35:49 2016-05-18 11:41:59 2016-05-18 01:13:32

These examples only show the order that data is arranged in. The values from different columns are stored separately,

and data from the same column is stored together. Examples of a column-oriented DBMS: Vertica, Paraccel

(Actian Matrix) (Amazon Redshift), Sybase IQ, Exasol, Infobright, InfiniDB, MonetDB

(VectorWise) (Actian Vector), LucidDB, SAP HANA, Google Dremel, Google PowerDrill,

Druid, kdb+ . .

Different orders for storing data are better suited to different scenarios. The data access scenario refers to what queries

are made, how often, and in what proportion; how much data is read for each type of query - rows, columns, and bytes;

the relationship between reading and updating data; the working size of the data and how locally it is used; whether

transactions are used, and how isolated they are; requirements for data replication and logical integrity; requirements

for latency and throughput for each type of query, and so on.

The higher the load on the system, the more important it is to customize the system to the scenario, and the more

speciﬁc this customization becomes. There is no system that is equally well-suited to signiﬁcantly different scenarios.

If a system is adaptable to a wide set of scenarios, under a high load, the system will handle all the scenarios equally

poorly, or will work well for just one of the scenarios.

We’ll say that the following is true for the OLAP (online analytical processing) scenario:

• The vast majority of requests are for read access.

• Data is updated in fairly large batches (> 1000 rows), not by single rows; or it is not updated at all.

• Data is added to the DB but is not modiﬁed.

• For reads, quite a large number of rows are extracted from the DB, but only a small subset of columns.

• Tables are “wide,” meaning they contain a large number of columns.

• Queries are relatively rare (usually hundreds of queries per server or less per second).

• For simple queries, latencies around 50 ms are allowed.

• Column values are fairly small - numbers and short strings (for example, 60 bytes per URL).

• Requires high throughput when processing a single query (up to billions of rows per second per server).

• There are no transactions.

• Low requirements for data consistency.

• There is one large table per query. All tables are small, except for one.

• A query result is signiﬁcantly smaller than the source data. That is, data is ﬁltered or aggregated. The result ﬁts

in a single server’s RAM.

It is easy to see that the OLAP scenario is very different from other popular scenarios (such as OLTP or Key-Value

access). So it doesn’t make sense to try to use OLTP or a Key-Value DB for processing analytical queries if you want

to get decent performance. For example, if you try to use MongoDB or Elliptics for analytics, you will get very poor

performance compared to OLAP databases.

2 Chapter 1. Introduction

ClickHouse Documentation, Release

Columnar-oriented databases are better suited to OLAP scenarios (at least 100 times better in processing speed for

most queries), for the following reasons:

1. For I/O. 1.1. For an analytical query, only a small number of table columns need to be read. In a column-oriented

database, you can read just the data you need. For example, if you need 5 columns out of 100, you can expect a 20-fold

reduction in I/O. 1.2. Since data is read in packets, it is easier to compress. Data in columns is also easier to compress.

This further reduces the I/O volume. 1.3. Due to the reduced I/O, more data ﬁts in the system cache.

For example, the query “count the number of records for each advertising platform” requires reading one “advertising

platform ID” column, which takes up 1 byte uncompressed. If most of the trafﬁc was not from advertising platforms,

you can expect at least 10-fold compression of this column. When using a quick compression algorithm, data decom-

pression is possible at a speed of at least several gigabytes of uncompressed data per second. In other words, this query

can be processed at a speed of approximately several billion rows per second on a single server. This speed is actually

achieved in practice.

Example:

milovidov@.yandex.ru:~$ clickhouse-client

ClickHouse client version 0.0.52053.

Connecting to localhost:9000.

Connected to ClickHouse server version 0.0.52053.

:) SELECT CounterID, count() FROM hits GROUP BY CounterID ORDER BY count() DESC LIMIT

˓→20

SELECT

CounterID,

count()

FROM hits

GROUP BY CounterID

ORDER BY count() DESC

LIMIT 20

-CounterID--count()-

| 114208 | 56057344 |

| 115080 | 51619590 |

| 3228 | 44658301 |

| 38230 | 42045932 |

| 145263 | 42042158 |

| 91244 | 38297270 |

| 154139 | 26647572 |

| 150748 | 24112755 |

| 242232 | 21302571 |

| 338158 | 13507087 |

| 62180 | 12229491 |

| 82264 | 12187441 |

| 232261 | 12148031 |

| 146272 | 11438516 |

| 168777 | 11403636 |

| 4120072 | 11227824 |

| 10938808 | 10519739 |

| 74088 | 9047015 |

| 115079 | 8837972 |

| 337234 | 8205961 |

--------------

20 rows in set. Elapsed: 0.153 sec. Processed 1.00 billion rows, 4.00 GB (6.53

˓→billion rows/s., 26.10 GB/s.)

1.1. What is ClickHouse? 3

ClickHouse Documentation, Release

2. For CPU. Since executing a query requires processing a large number of rows, it helps to dispatch all operations

for entire vectors instead of for separate rows, or to implement the query engine so that there is almost no dispatching

cost. If you don’t do this, with any half-decent disk subsystem, the query interpreter inevitably stalls the CPU. It makes

sense to both store data in columns and process it, when possible, by columns.

There are two ways to do this: 1. A vector engine. All operations are written for vectors, instead of for separate values.

This means you don’t need to call operations very often, and dispatching costs are negligible. Operation code contains

an optimized internal cycle. 2. Code generation. The code generated for the query has all the indirect calls in it.

This is not done in “normal” databases, because it doesn’t make sense when running simple queries. However, there

are exceptions. For example, MemSQL uses code generation to reduce latency when processing SQL queries. (For

comparison, analytical DBMSs require optimization of throughput, not latency.)

Note that for CPU efﬁciency, the query language must be declarative (SQL or MDX), or at least a vector (J, K). The

query should only contain implicit loops, allowing for optimization.

Distinctive features of ClickHouse

1. True column-oriented DBMS.

In a true column-oriented DBMS, there isn’t any “garbage” stored with the values. For example, constant-length

values must be supported, to avoid storing their length “number” next to the values. As an example, a billion UInt8-

type values should actually consume around 1 GB uncompressed, or this will strongly affect the CPU use. It is very

important to store data compactly (without any “garbage”) even when uncompressed, since the speed of decompression

(CPU usage) depends mainly on the volume of uncompressed data.

This is worth noting because there are systems that can store values of separate columns separately, but that can’t

effectively process analytical queries due to their optimization for other scenarios. Example are HBase, BigTable,

Cassandra, and HyperTable. In these systems, you will get throughput around a hundred thousand rows per second,

but not hundreds of millions of rows per second.

Also note that ClickHouse is a DBMS, not a single database. ClickHouse allows creating tables and databases in

runtime, loading data, and running queries without reconﬁguring and restarting the server.

2. Data compression.

Some column-oriented DBMSs (InﬁniDB CE and MonetDB) do not use data compression. However, data compression

really improves performance.

3. Disk storage of data.

Many column-oriented DBMSs (SAP HANA, and Google PowerDrill) can only work in RAM. But even on thousands

of servers, the RAM is too small for storing all the pageviews and sessions in Yandex.Metrica.

4. Parallel processing on multiple cores.

Large queries are parallelized in a natural way.

4 Chapter 1. Introduction

剩余202页未读，继续阅读

奥特曼VS怪兽

粉丝: 16
资源: 8

ClickHouse数据库详解：高性能列式存储与实时分析

clickhouse中文学习文档

clickhouse离线安装方法

clickhouse数据库操作demo

ClickHouse文档.docx

clickhouse文档.docx

【尚硅谷】大数据技术之clickhouse文档

clickhouse中文文档

ClickHouse中文文档

clickhouse官方文档PDF版本

clickhouse官方文档.pdf

最新资源