没有合适的资源?快使用搜索试试~ 我知道了~
首页clickhouse官方文档PDF版本
clickhouse官方文档PDF版本
1星 需积分: 49 72 下载量 160 浏览量
更新于2023-03-16
评论 1
收藏 4.56MB PDF 举报
clickhouse官方文档PDF版本哈。clickhouse官方文档PDF版本哈。
资源详情
资源评论
资源推荐
2018/9/26 ClickHouse Documentation - ClickHouse Documentation
https://clickhouse.yandex/docs/en/single/ 1/579
What is ClickHouse?
ClickHouse is a column-oriented database management system (DBMS) for
online analytical processing of queries (OLAP).
In a "normal" row-oriented DBMS, data is stored in this order:
In order words, all the values related to a row are physically stored next to each
other.
Examples of a row-oriented DBMS are MySQL, Postgres, and MS SQL Server.
In a column-oriented DBMS, data is stored like this:
Row WatchID JavaEnable Title GoodEvent
#0 89354350662 1 Investor
Relations
1
#1 90329509958 0 Contact us 1
#2 89953706054 1 Mission 1
#N ... ... ... ...
2018/9/26 ClickHouse Documentation - ClickHouse Documentation
https://clickhouse.yandex/docs/en/single/ 2/579
These examples only show the order that data is arranged in. The values from
different columns are stored separately, and data from the same column is
stored together.
Examples of a column-oriented DBMS: Vertica, Paraccel (Actian Matrix and
Amazon Redshift), Sybase IQ, Exasol, Infobright, InfiniDB, MonetDB (VectorWise
and Actian Vector), LucidDB, SAP HANA, Google Dremel, Google PowerDrill,
Druid, and kdb+.
Different orders for storing data are better suited to different scenarios. The
data access scenario refers to what queries are made, how often, and in what
proportion; how much data is read for each type of query – rows, columns, and
bytes; the relationship between reading and updating data; the working size of
the data and how locally it is used; whether transactions are used, and how
isolated they are; requirements for data replication and logical integrity;
requirements for latency and throughput for each type of query, and so on.
The higher the load on the system, the more important it is to customize the
system set up to match the requirements of the usage scenario, and the more
fine grained this customization becomes. There is no system that is equally well-
suited to significantly different scenarios. If a system is adaptable to a wide set
of scenarios, under a high load, the system will handle all the scenarios equally
poorly, or will work well for just one or few of possible scenarios.
Row: #0 #1 #2 #N
WatchID: 89354350662 90329509958 89953706054 ...
JavaEnable: 1 0 1 ...
Title: Investor
Relations
Contact us Mission ...
GoodEvent: 1 1 1 ...
EventTime: 2016-05-18
05:19:20
2016-05-18
08:10:20
2016-05-18
07:38:00
...
2018/9/26 ClickHouse Documentation - ClickHouse Documentation
https://clickhouse.yandex/docs/en/single/ 3/579
Key Properties of the OLAP scenario
The vast majority of requests are for read access.
Data is updated in fairly large batches (> 1000 rows), not by single rows; or it
is not updated at all.
Data is added to the DB but is not modified.
For reads, quite a large number of rows are extracted from the DB, but only
a small subset of columns.
Tables are "wide," meaning they contain a large number of columns.
Queries are relatively rare (usually hundreds of queries per server or less per
second).
For simple queries, latencies around 50 ms are allowed.
Column values are fairly small: numbers and short strings (for example, 60
bytes per URL).
Requires high throughput when processing a single query (up to billions of
rows per second per server).
Transactions are not necessary.
Low requirements for data consistency.
There is one large table per query. All tables are small, except for one.
A query result is significantly smaller than the source data. In other words,
data is filtered or aggregated, so the result fits in a single server's RAM.
It is easy to see that the OLAP scenario is very different from other popular
scenarios (such as OLTP or Key-Value access). So it doesn't make sense to try to
use OLTP or a Key-Value DB for processing analytical queries if you want to get
decent performance. For example, if you try to use MongoDB or Redis for
analytics, you will get very poor performance compared to OLAP databases.
Why Column-Oriented Databases Work Better in the OLAP
Scenario
2018/9/26 ClickHouse Documentation - ClickHouse Documentation
https://clickhouse.yandex/docs/en/single/ 4/579
Column-oriented databases are better suited to OLAP scenarios: they are at
least 100 times faster in processing most queries. The reasons are explained in
detail below, but the fact is easier to demonstrate visually:
Row-oriented DBMS
Row-oriented
Column-oriented DBMS
Column-oriented
See the difference?
Input/output
1. For an analytical query, only a small number of table columns need to be
read. In a column-oriented database, you can read just the data you need.
For example, if you need 5 columns out of 100, you can expect a 20-fold
reduction in I/O.
2. Since data is read in packets, it is easier to compress. Data in columns is also
easier to compress. This further reduces the I/O volume.
3. Due to the reduced I/O, more data fits in the system cache.
For example, the query "count the number of records for each advertising
platform" requires reading one "advertising platform ID" column, which takes up
1 byte uncompressed. If most of the traffic was not from advertising platforms,
you can expect at least 10-fold compression of this column. When using a quick
compression algorithm, data decompression is possible at a speed of at least
several gigabytes of uncompressed data per second. In other words, this query
can be processed at a speed of approximately several billion rows per second on
a single server. This speed is actually achieved in practice.
CPU
Since executing a query requires processing a large number of rows, it helps to
dispatch all operations for entire vectors instead of for separate rows, or to
Example
2018/9/26 ClickHouse Documentation - ClickHouse Documentation
https://clickhouse.yandex/docs/en/single/ 5/579
implement the query engine so that there is almost no dispatching cost. If you
don't do this, with any half-decent disk subsystem, the query interpreter
inevitably stalls the CPU. It makes sense to both store data in columns and
process it, when possible, by columns.
There are two ways to do this:
1. A vector engine. All operations are written for vectors, instead of for
separate values. This means you don't need to call operations very often,
and dispatching costs are negligible. Operation code contains an optimized
internal cycle.
2. Code generation. The code generated for the query has all the indirect calls
in it.
This is not done in "normal" databases, because it doesn't make sense when
running simple queries. However, there are exceptions. For example, MemSQL
uses code generation to reduce latency when processing SQL queries. (For
comparison, analytical DBMSs require optimization of throughput, not latency.)
Note that for CPU efficiency, the query language must be declarative (SQL or
MDX), or at least a vector (J, K). The query should only contain implicit loops,
allowing for optimization.
Distinctive Features of ClickHouse
True Column-Oriented DBMS
In a true column-oriented DBMS, no extra data is stored with the values. Among
other things, this means that constant-length values must be supported, to
avoid storing their length "number" next to the values. As an example, a billion
UInt8-type values should actually consume around 1 GB uncompressed, or this
will strongly affect the CPU use. It is very important to store data compactly
(without any "garbage") even when uncompressed, since the speed of
decompression (CPU usage) depends mainly on the volume of uncompressed
data.
剩余578页未读,继续阅读
笔尖的痕
- 粉丝: 103
- 资源: 13
上传资源 快速赚钱
- 我的内容管理 收起
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
会员权益专享
最新资源
- RTL8188FU-Linux-v5.7.4.2-36687.20200602.tar(20765).gz
- c++校园超市商品信息管理系统课程设计说明书(含源代码) (2).pdf
- 建筑供配电系统相关课件.pptx
- 企业管理规章制度及管理模式.doc
- vb打开摄像头.doc
- 云计算-可信计算中认证协议改进方案.pdf
- [详细完整版]单片机编程4.ppt
- c语言常用算法.pdf
- c++经典程序代码大全.pdf
- 单片机数字时钟资料.doc
- 11项目管理前沿1.0.pptx
- 基于ssm的“魅力”繁峙宣传网站的设计与实现论文.doc
- 智慧交通综合解决方案.pptx
- 建筑防潮设计-PowerPointPresentati.pptx
- SPC统计过程控制程序.pptx
- SPC统计方法基础知识.pptx
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论1