如何使用 Python 对 Pandas 数据框列进行模糊匹配?
我们将第一个DataFrame中的单词与第二个DataFrame中的单词进行匹配。对于最接近的匹配,我们将使用阈值。我们将阈值取为70,即,当字符串彼此接近70%以上时发生匹配。
让我们首先创建字典并转换为熊猫数据框-
#字典
d1 = {'Car': ["BMW", "Audi", "Lexus", "Mercedes", "Rolls"]}
d2 = {'Car': ["BM", "Audi", "Le", "MERCEDES", "Rolls Royce"]}
#将字典转换为熊猫数据框
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)现在,将数据框列转换为用于模糊匹配的元素列表-
myList1 = df1['Car'].tolist() myList2 = df2['Car'].tolist()
示例
以下是完整的代码-
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
#字典
d1 = {'Car': ["BMW", "Audi", "Lexus", "Mercedes", "Rolls"]}
d2 = {'Car': ["BM", "Audi", "Le", "MERCEDES", "Rolls Royce"]}
#将字典转换为熊猫数据框
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
#printingthepandasdataframes
print("Dataframe 1 = \n",df1)
print("Dataframe 2 = \n",df2)
#emptylistsforstoringthematcheslater
match1 = []
match2 = []
k = []
#convertingdataframecolumntolistofelementsforfuzzymatching
myList1 = df1['Car'].tolist()
myList2 = df2['Car'].tolist()
threshold = 70
#iteratingmyList1toextractclosestmatchfrommyList2
for i in myList1:
match1.append(process.extractOne(i, myList2, scorer=fuzz.ratio))
df1['matches'] = match1
for j in df1['matches']:
if j[1] >= threshold:
k.append(j[0])
match2.append(",".join(k))
k = []
#savingmatchestodf1
df1['matches'] = match2
print("\nMatches...")
print(df1)输出结果这将产生以下输出-
Dataframe 1 =
Car
0 BMW
1 Audi
2 Lexus
3 Mercedes
4 Rolls
Dataframe 2 =
Car
0 BM
1 Audi
2 Le
3 Mercedes
4 Rolls Royce
Matches...
Car matches
0 BM BM
1 Audi Audi
2 Lexus
3 Mercedes MERCEDES
4 Rolls