将最接近的字符串与另一个字符串足球队匹配?

问题描述

我正在努力标准化我通过Football API接收到的一些数据。

我有一个具有三个输入的函数homeaway(两个足球队)和包含球队homeaway的字符串列表,但是它们可以与输入homeaway命名不同。

我的目标是将列表中home的所有实例替换为1,并将列表中away的所有实例替换为2。

以下是一些示例输入:

home: "Manchester United",away: "Liverpool",list = ["Man Utd and Yes","Liverpool and No","Man Utd and No","Liverpool and Yes"]

home: "Manchester United",away: "Manchester City","Man City and No","Man City and Yes"]

home: "Paris Saint Germain",away: "Monaco",list = ["Monaco and Yes","Monaco and No","PSG and Yes","PSG and No"]

home: "Brighton & Hove Albion",away: "Chelsea",list = ["Chelsea and No","Brighton and Yes","Chelsea and Yes","Brighton and No"]

请注意,列表中的球队名称是一致的(您永远不会在同一列表中看到“曼联和是”,“曼联和否”)。

现在,我该如何匹配球队?这是我到目前为止所做的:

def standardise(home,away,lst):
   for i,v in enumerate(lst):
      team = v.split("and")[0]
   
      if team in home or home in team:
         lst[i] = v.replace(team,"1")
         for j,k in enumerate(lst):
            new_team = k.split("and")[0]
            if new_team != i and team != new_team:
               lst[j] = k.replace(new_team,"2")
            else:
               lst[j] = k.replace(new_team,"1")
         
      elif team in away or away in team:
         # same code as above but for away 

      elif enchant.utils.levenshtein(team,home) >= \
           enchant.utils.levenshtein(team,away):
      
         lst[i] = v.replace(team,"2")

      else:
         lst[i] = v.replace(team,"1")

Levenshtein距离用于衡量将一个单词序列转换为另一个单词序列所需的最少编辑次数

现在,此方法无法100%地起作用,例如,使用首字母缩写词时,该方法似乎失败了。

是否有更好的方法可以做到这一点,也许有人可以想到一种更具体的方法

解决方法

Fuzzywuzzy非常适合此操作。也很docs

from fuzzywuzzy import process

def standardise(home,away,lst):
    home_away = {home:'1',away:'2'}
    choices = [home,away]
    
    print ([ home_away[process.extractOne(each,choices)[0]] for each in lst ])


home = "Manchester United"
away = "Liverpool",lst = ["Man Utd and Yes","Liverpool and No","Man Utd and No","Liverpool and Yes"]
standardise(home,lst)


home = "Manchester United" 
away = "Manchester City" 
lst = ["Man Utd and Yes","Man City and No","Man City and Yes"]
standardise(home,lst)


home = "Paris Saint Germain" 
away = "Monaco" 
lst = ["Monaco and Yes","Monaco and No","PSG and Yes","PSG and No"]
standardise(home,lst)


home = "Brighton & Hove Albion" 
away = "Chelsea" 
lst = ["Chelsea and No","Brighton and Yes","Chelsea and Yes","Brighton and No"]
standardise(home,lst)

输出:

['1','2','1','2']
['1','2']
['2','1']
['2','1']
,

您可以先尝试删除元音,以便每个名称都更接近其首字母缩写,然后再应用Levenshtein。

还要检查fuzzywuzzy