在 MYSQL 中未正确识别不同的术语

问题描述

我创建了一个数据库，用于存储从孟加拉语文本文档创建的简单倒排索引。

表名： simple_index ，主键 {Term,Document_id}

表格定义：

CREATE TABLE IF NOT EXISTS basicindex.simple_index (
    term varchar(255) NOT NULL,doc_id INT NOT NULL,frequency INT NOT NULL,PRIMARY KEY (term,doc_id) 
)

奇怪的是，我发现对于以下两个不同的词：

খুঁজে - 出现在文档 3、16、34 中
খুজে - 出现在文档 1 中

当我执行以下查询时：

查询 1：

select doc_id from basicindex.simple_index where term='খুঁজে';

查询 2：

select doc_id from basicindex.simple_index where term = 'খুজে';

两者都返回 4 行，声称 খুঁজে 和 খুজে 存在于所有四个文档中。

从日志中我发现 [distinct Term,document id,frequency] খুঁজে 只为文档 id 1 插入：

为খুজে插入索引 ->{ DocID: 1,Frequency: 1}

('খুজে',1,1)

和 খুঁজে 已插入文档 ID 3、16 和 34

为খুঁজে插入索引 ->{ DocID: 3,Frequency: 1}

('খুঁজে',3,1)

为খুঁজে插入索引 ->{ DocID: 16,Frequency: 2}

('খুঁজে',16,2)

为খুঁজে插入索引 ->{ DocID: 34,34,1)

以下是术语的 unicode 值：

খুঁজে [('খ',2454),('ু',2497),('ঁ',2433),('জ',2460),('ে',2503)]

খুজে [('খ',2503)]

我使用的是 MysqL 8.0.13 版。我请求有人帮助我理解为什么 MysqL 数据库表现出这种行为。为什么它不能区分“খুঁজে”和“খুজে”？我能做些什么来纠正这个问题？

我已附上文档 1、3、16 和 34 以及输入和输出日志文件，供您参考here。

解决方法

两者都返回 4 行，声称 খুঁজে 和 খুজে 存在于所有四个文档中。

检查使用的 COLLATION。明确指定需要的 COLLATE。

举个例子：

CREATE TABLE IF NOT EXISTS simple_index (
    term varchar(255) NOT NULL,doc_id INT NOT NULL,frequency INT NOT NULL,PRIMARY KEY (term,doc_id) 
);

INSERT INTO simple_index VALUES
('খুঁজে',1,0 ),('খুজে',2,0 );
SELECT * FROM simple_index;

term	doc_id	频率
খুঁজে	1	0
খুজে	2	0

select doc_id from simple_index where term = 'খুঁজে';
select doc_id from simple_index where term = 'খুজে';

| doc_id |
| -----: |
|      1 |
|      2 |

| doc_id |
| -----: |
|      1 |
|      2 |

select doc_id from simple_index where term = 'খুঁজে'COLLATE utf8mb4_bin;
select doc_id from simple_index where term = 'খুজে' COLLATE utf8mb4_bin;

| doc_id |
| -----: |
|      1 |

| doc_id |
| -----: |
|      2 |

dbfiddle here

database mysql mysql-python python unicode-string