如何在Clickhouse中匹配汉字

问题描述

我尝试使用 clickhouse sql 从字符串中提取中文字符。

我使用:

select extractAll('dkfdfjsd1234中文字符串','[a-zA-Z]')

它可以成功返回:

['d','k','f','d','j','s','d']

现在我想像这样提取中文,我试过:

select extractAll('dkfdfjsd1234中文字符串','[\u4e00-\u9fa5]')

它返回错误

Code: 427,e.displayText() = DB::Exception: OptimizedRegularExpression: cannot compile re2: [\u4e00-\u9fa5],error: invalid escape sequence: \u. Look at https://github.com/google/re2/wiki/Syntax for reference. Please note that if you specify regex as an sql string literal,the slashes have to be additionally escaped. For example,to match an opening brace,write '\(' -- the first slash is for sql and the second one is for regex (version 20.8.14.4 (official build))

解决方法

要匹配 Unicode 点,请使用 \x{FFFF}:

SELECT extractAll('dkfdfjsd1234中文字符串','[\\x{4e00}-\\x{9fa5}]') AS result

/*
┌─result─────────────────────┐
│ ['中','文','字','符','串'] │
└────────────────────────────┘
*/