问题描述
我尝试使用 clickhouse sql 从字符串中提取中文字符。
我使用:
select extractAll('dkfdfjsd1234中文字符串','[a-zA-Z]')
它可以成功返回:
['d','k','f','d','j','s','d']
select extractAll('dkfdfjsd1234中文字符串','[\u4e00-\u9fa5]')
它返回错误。
Code: 427,e.displayText() = DB::Exception: OptimizedRegularExpression: cannot compile re2: [\u4e00-\u9fa5],error: invalid escape sequence: \u. Look at https://github.com/google/re2/wiki/Syntax for reference. Please note that if you specify regex as an sql string literal,the slashes have to be additionally escaped. For example,to match an opening brace,write '\(' -- the first slash is for sql and the second one is for regex (version 20.8.14.4 (official build))
解决方法
要匹配 Unicode 点,请使用 \x{FFFF}:
SELECT extractAll('dkfdfjsd1234中文字符串','[\\x{4e00}-\\x{9fa5}]') AS result
/*
┌─result─────────────────────┐
│ ['中','文','字','符','串'] │
└────────────────────────────┘
*/