如何使用 Python 对 Pandas 数据框列进行模糊匹配?
我们将第一个DataFrame中的单词与第二个DataFrame中的单词进行匹配。对于最接近的匹配,我们将使用阈值。我们将阈值取为70,即,当字符串彼此接近70%以上时发生匹配。
让我们首先创建字典并转换为熊猫数据框-
#字典 d1 = {'Car': ["BMW", "Audi", "Lexus", "Mercedes", "Rolls"]} d2 = {'Car': ["BM", "Audi", "Le", "MERCEDES", "Rolls Royce"]} #将字典转换为熊猫数据框 df1 = pd.DataFrame(d1) df2 = pd.DataFrame(d2)
现在,将数据框列转换为用于模糊匹配的元素列表-
myList1 = df1['Car'].tolist() myList2 = df2['Car'].tolist()
示例
以下是完整的代码-
import pandas as pd from fuzzywuzzy import fuzz from fuzzywuzzy import process #字典 d1 = {'Car': ["BMW", "Audi", "Lexus", "Mercedes", "Rolls"]} d2 = {'Car': ["BM", "Audi", "Le", "MERCEDES", "Rolls Royce"]} #将字典转换为熊猫数据框 df1 = pd.DataFrame(d1) df2 = pd.DataFrame(d2) #printingthepandasdataframes print("Dataframe 1 = \n",df1) print("Dataframe 2 = \n",df2) #emptylistsforstoringthematcheslater match1 = [] match2 = [] k = [] #convertingdataframecolumntolistofelementsforfuzzymatching myList1 = df1['Car'].tolist() myList2 = df2['Car'].tolist() threshold = 70 #iteratingmyList1toextractclosestmatchfrommyList2 for i in myList1: match1.append(process.extractOne(i, myList2, scorer=fuzz.ratio)) df1['matches'] = match1 for j in df1['matches']: if j[1] >= threshold: k.append(j[0]) match2.append(",".join(k)) k = [] #savingmatchestodf1 df1['matches'] = match2 print("\nMatches...") print(df1)输出结果
这将产生以下输出-
Dataframe 1 = Car 0 BMW 1 Audi 2 Lexus 3 Mercedes 4 Rolls Dataframe 2 = Car 0 BM 1 Audi 2 Le 3 Mercedes 4 Rolls Royce Matches... Car matches 0 BM BM 1 Audi Audi 2 Lexus 3 Mercedes MERCEDES 4 Rolls