有没有更好的方法使用字典来解决这个问题?

问题描述

我正在尝试解决以下问题:

示例 csv 数据集如下所示(数据集中共有 1000 行):

enter image description here

我想解决的问题是:

  • 实现 AND 条件,例如steel keyboard 应该只匹配在某处同时包含 steelkeyboard 的项目名称(不是 必须按这个顺序)
  • 实施 OR 条件,例如steel keyboard 应该匹配项目名称 steel tablewooden keyboard,因为它们都包含 我们的搜索词之一
  • 实现数字范围查询,例如steel keyboard 价格在 40 美元到 70 美元之间

我已经使用以下方法解决了问题,但我觉得使用字典会更简单:

class SimpleSearch: 
    
    def __init__(self,path):
        self.df = pd.read_csv(path)
    
        
    def match_keyword(self,pattern):
        self.df['matches'] = self.df['name'].str.findall(pattern).apply(lambda x: list(set(x)))
        
        
        ids = []
        for i in self.df.itertuples():
            if i.matches != []: 
                 ids.append(i.id)
                    
        return ids
    
if __name__ == '__main__': 
    path = "random_path/file.csv"
    pattern = "steel keyboard"
    search_obj = SimpleSearch(path)
    print(search_obj.match_keyword(pattern))
  • 是否有一种简单的方法可以使用字典区分 AndOr 操作的逻辑?我的解决方案此时只解决 AND。
  • 解决数字范围查询的最佳方法是什么?我想不出一种方法,可以提供一些帮助。

解决方法

在下面的数据框中,有 3 个结果匹配名称 (1xAND,2xOR) 和价格标准 ([40,70])

>>> df
                       name   price
0   Lightweight Linen Watch   54.56
1               Steel Table   63.88  # OK
2  Keyboard With Steel Keys   48.24  # OK
3           Wooden Keyboard  104.29
4         Small Rubber Lamp   82.69
5       Durable Leather Car    9.88
6            Steel Keyboard   59.45  # OK
7   Fantastic Granite Bench   22.21
8            Apple Keyboard  999.99

用熊猫解决

TL;DR

import re

search = "steel keyboard"
search = fr"({'|'.join(search.split())})"  # '(steel|keyboard)'
min_price = 40
max_price = 70

name_result = df["name"].str.findall(search,re.IGNORECASE).apply(len)
price_result = df["price"].between(min_price,max_price)

out = df.loc[(name_result > 0) & (price_result == True)]
>>> out
                       name  price
1               Steel Table  63.88
2  Keyboard With Steel Keys  48.24
6            Steel Keyboard  59.45

名称标准

可以同时进行

import re
search = "steel keyboard"
search = fr"({'|'.join(search.split())})"

name_result = df["name"].str.findall(search,re.IGNORECASE).apply(len)
>>> pd.concat([df["name"],name_result],axis="columns")
                       name  name
0   Lightweight Linen Watch     0  # no match
1               Steel Table     1  # partial match (ANY of words <- OR)
2  Keyboard With Steel Keys     2  # full match (ALL words <- AND)
3           Wooden Keyboard     1
4         Small Rubber Lamp     0
5       Durable Leather Car     0
6            Steel Keyboard     2
7   Fantastic Granite Bench     0
8            Apple Keyboard     1
  • 0:没有结果
  • 1 到 N-1:部分匹配。至少找到了一个词。
  • N:完全匹配。找到所有单词 => N = len(search.split())

价格标准

简单得多!

min_price = 40
max_price = 70

price_result = df["price"].between(min_price,max_price)

结果 一起应用所有规则:

out = df.loc[(name_result > 0) & (price_result == True)]
>>> out
                       name  price
1               Steel Table  63.88
2  Keyboard With Steel Keys  48.24
6            Steel Keyboard  59.45

dict求解

import re

search = "steel keyboard"
search = fr"({'|'.join(search.split())})"  # '(steel|keyboard)'
search = re.compile(search,re.IGNORECASE)
min_price = 40
max_price = 70

data = df.set_index("name").squeeze().to_dict()

out = {name: price for name,price in data.items()
           if search.search(name) and min_price <= price <= max_price}
>>> out
{'Steel Table': 63.88,'Keyboard With Steel Keys': 48.24,'Steel Keyboard': 59.45}