问题描述
我在 ClickHouse 表中有一些原始 JSON 数据(实际上是来自 netflow 收集器的 netflow V9) 它看起来像这样:
{"AgentID":"10.1.8.1","Header":{"Version":9,"Count":2},"DataSets":[
[{"I":2,"V":"231"},{"I":3,"V":"151"},{"I":8,"V":"109.195.122.130"}],[{"I":2,"V":"341"},"V":"221"},"V":"109.195.122.233"}]
]}'
我的任务是通过以下方式将 DataSets 数组转换为另一个 ClickHouse 表:
I2 I3 I8
-----------------------------
231 151 109.195.122.130
341 221 109.195.122.233
...
解决方法
要解析 JSON,请考虑使用专门的 json functions:
SELECT
toInt32(column_values[1]) AS I2,toInt32(column_values[2]) AS I3,column_values[3] AS I8
FROM
(
SELECT
arrayJoin(JSONExtract(json,'DataSets','Array(Array(Tuple(Int32,String)))')) AS row,arraySort(x -> (x.1),row) AS row_with_sorted_columns,arrayMap(x -> (x.2),row_with_sorted_columns) AS column_values
FROM
(
SELECT '{"AgentID":"10.1.8.1","Header":{"Version":9,"Count":2},"DataSets":[\n [{"I":3,"V":"151"},{"I":8,"V":"109.195.122.130"},{"I":2,"V":"231"}],\n [{"I":2,"V":"341"},{"I":3,"V":"221"},"V":"109.195.122.233"}]]}' AS json
)
)
/*
┌─I2──┬─I3──┬─I8──────────────┐
│ 231 │ 151 │ 109.195.122.130 │
│ 341 │ 221 │ 109.195.122.233 │
└─────┴─────┴─────────────────┘
*/
(要了解有关 JSON 解析的更多信息,请参阅 How to extract json from json in clickhouse?)
上面的实现依赖于Datasets-array的固定结构。正如我在现实世界中所理解的,这种结构具有任意模式(https://www.iana.org/assignments/ipfix/ipfix.xhtml),例如:
{
"AgentID":"192.168.21.15","Header":{},"DataSets":[
[
{"I":8,"V":"192.16.28.217"},{"I":12,"V":"180.10.210.240"},{"I":5,"V":2},{"I":4,"V":6},{"I":7,"V":443},{"I":6,"V":"0x10"}
]
]
}
因此出现了关于具有任意列数的表的问题。 ClickHouse 不支持此功能 - 看看在这种情况下如何呈现表格 https://stackoverflow.com/search?q=%5Bclickhouse%5D+pivot。