问题描述
今天,我正在做kudu的分区测试,结果真的让我感到困惑
首先,我用presto创建两个kudu表
表一
CREATE TABLE events_one (
id integer WITH (primary_key = true),event_time timestamp,score Decimal(8,2),message varchar
) WITH (
partition_by_hash_columns = ARRAY['id'],partition_by_hash_buckets = 36,number_of_replicas = 1
);
表二
CREATE TABLE events_two (
id integer WITH (primary_key = true),event_time timestamp WITH (primary_key = true),partition_by_range_columns = ARRAY['event_time'],range_partitions = '[
{"lower": "2015-01-01T00:00:00","upper": "2015-03-01T00:00:00"},{"lower": "2015-03-01T00:00:00","upper": "2015-06-01T00:00:00"},{"lower": "2015-06-01T00:00:00","upper": "2015-09-01T00:00:00"},{"lower": "2015-09-01T00:00:00","upper": "2016-01-01T00:00:00"},{"lower": "2016-01-01T00:00:00","upper": "2016-03-01T00:00:00"},{"lower": "2016-03-01T00:00:00","upper": "2016-06-01T00:00:00"},{"lower": "2016-06-01T00:00:00","upper": "2016-09-01T00:00:00"},{"lower": "2016-09-01T00:00:00","upper": "2017-01-01T00:00:00"},{"lower": "2017-01-01T00:00:00","upper": "2017-03-01T00:00:00"},{"lower": "2017-03-01T00:00:00","upper": "2017-06-01T00:00:00"},{"lower": "2017-06-01T00:00:00","upper": "2017-09-01T00:00:00"},{"lower": "2017-09-01T00:00:00","upper": "2018-01-01T00:00:00"}
]',number_of_replicas = 1 );
比我在每个表中插入10000000行数据
List<Date> eventsTime = getDateTimes("2015-01-01","2018-01-01");
List<TestTableTwo> entitys = new ArrayList<>();
for(int i=1;i<10000000;i++){
TestTableTwo entity = new TestTableTwo() ;
entity.setId(i);
entity.setEvent_time(eventsTime.get(new Random().nextInt(eventsTime.size())));
entity.setscore(score);
entity.setMessage(msg);
entitys.add(entity);
if(i%3000==0){
KuduUtil.save(entitys);
entitys = new ArrayList<>();
System.out.println(i);
}
}
最后,我正在测试查询速度
public static void main(String[] args) throws Exception{
Class.forName("com.facebook.presto.jdbc.PrestoDriver");
Connection conn= DriverManager.getConnection("jdbc:presto://10.100.218.110:8090/kudu/tcdb","root",null);
Statement sm = conn.createStatement();
String sql1 = "select count(1) from events_two where event_time < timestamp '2017-01-03 00:00:00'";
Long AllTime = 0L ;
int per = 100 ;
for(int i=0;i<per;i++) {
long startTime = System.currentTimeMillis();
ResultSet sets = sm.executeQuery(sql1);
while (sets.next()) {
String id = sets.getString(1);
System.out.println(id);
}
long endTime = System.currentTimeMillis();
AllTime += (endTime-startTime);
}
System.out.println("testsql:"+sql1);
System.out.println("per cost :"+AllTime/per+"ms");
}
结果是:
---
testsql:select * from events_one where id = 200
per cost :61ms
testsql:select * from events_two where id = 200
per cost :54ms
---
testsql:select sum(score) from events_one where event_time < timestamp '2017-01-03 00:00:00'
per cost :448ms
testsql:select sum(score) from events_two where event_time < timestamp '2017-01-03 00:00:00'
per cost :736ms
---
testsql:select sum(score) from events_one where id<5000000 and event_time < timestamp '2017-01-03 00:00:00'
per cost :281ms
testsql:select sum(score) from events_two where id<5000000 and event_time < timestamp '2017-01-03 00:00:00'
per cost :745ms
所有数据都是“ 2015-01-01”-“ 2018-01-01”中的随机数据
为什么kudu表按哈希和范围进行分区,比按哈希进行分区更慢?
解决方法
由kudu docs撰写,我发现这个词:
Partitioning Rules of Thumb
• For large tables,such as fact tables,aim for as many tablets as you have cores in the cluster.
• For small tables,such as dimension tables,ensure that each tablet is at least 1 GB in size
table two
hash&Range总分区数=(哈希分区数)*(范围分区数)= 36 * 12 = 432,我的kudu群集具有3台计算机,每台计算机8核,总核数为24
可能有太多分区等待cpu分配要扫描的时间片
对吗?