目录
- 模拟数据
- 差集
- 方法1:concat + drop_duplicates
- 方法2:append + drop_duplicates
- 交集
- 方法1:merge
- 方法2:concat + duplicated + loc
- 方法3:concat + groupby + query
- 并集
- 方法1:concat + drop_duplicates
- 方法2:append + drop_duplicates
- 方法3:merge
大家好,我是Peter~
本文讲解的是如何利用Pandas函数求解两个DataFrame的差集、交集、并集。
模拟数据
模拟一份简单的数据:
In [1]:
In [2]:
df1 = pd.DataFrame({"col1":[1,2,3,4,5],
"col2":[6,7,8,9,10]
})
df2 = pd.DataFrame({"col1":[1,3,7],
"col2":[6,8,10]
})
In [3]:
Out[3]:
|
col1 |
col2 |
0 |
1 |
6 |
1 |
2 |
7 |
2 |
3 |
8 |
3 |
4 |
9 |
4 |
5 |
10 |
In [4]:
Out[4]:
|
col1 |
col2 |
0 |
1 |
6 |
1 |
3 |
8 |
2 |
7 |
10 |
两个DataFrame的相同部分:

差集
方法1:concat + drop_duplicates
In [5]:
df3 = pd.concat([df1,df2])
df3
Out[5]:
|
col1 |
col2 |
0 |
1 |
6 |
1 |
2 |
7 |
2 |
3 |
8 |
3 |
4 |
9 |
4 |
5 |
10 |
0 |
1 |
6 |
1 |
3 |
8 |
2 |
7 |
10 |
In [6]:
# 结果1
df3.drop_duplicates(["col1","col2"],keep=False)
Out[6]:
|
col1 |
col2 |
1 |
2 |
7 |
3 |
4 |
9 |
4 |
5 |
10 |
2 |
7 |
10 |
方法2:append + drop_duplicates
In [7]:
df4 = df1.append(df2)
df4
Out[7]:
|
col1 |
col2 |
0 |
1 |
6 |
1 |
2 |
7 |
2 |
3 |
8 |
3 |
4 |
9 |
4 |
5 |
10 |
0 |
1 |
6 |
1 |
3 |
8 |
2 |
7 |
10 |
In [8]:
# 结果2
df4.drop_duplicates(["col1","col2"],keep=False)
Out[8]:
|
col1 |
col2 |
1 |
2 |
7 |
3 |
4 |
9 |
4 |
5 |
10 |
2 |
7 |
10 |
交集
方法1:merge
In [9]:
# 结果
# 等效:df5 = pd.merge(df1, df2, how="inner")
df5 = pd.merge(df1,df2)
df5
Out[9]:
方法2:concat + duplicated + loc
In [10]:
df6 = pd.concat([df1,df2])
df6
Out[10]:
|
col1 |
col2 |
0 |
1 |
6 |
1 |
2 |
7 |
2 |
3 |
8 |
3 |
4 |
9 |
4 |
5 |
10 |
0 |
1 |
6 |
1 |
3 |
8 |
2 |
7 |
10 |
In [11]:
s = df6.duplicated(subset=['col1','col2'], keep='first')
s
Out[11]:
0 False
1 False
2 False
3 False
4 False
0 True
1 True
2 False
dtype: bool
In [12]:
# 结果
df8 = df6.loc[s == True]
df8
Out[12]:
方法3:concat + groupby + query
In [13]:
# df6 = pd.concat([df1,df2])
df6
Out[13]:
|
col1 |
col2 |
0 |
1 |
6 |
1 |
2 |
7 |
2 |
3 |
8 |
3 |
4 |
9 |
4 |
5 |
10 |
0 |
1 |
6 |
1 |
3 |
8 |
2 |
7 |
10 |
In [14]:
df9 = df6.groupby(["col1", "col2"]).size().reset_index()
df9.columns = ["col1", "col2", "count"]
df9
Out[14]:
|
col1 |
col2 |
count |
0 |
1 |
6 |
2 |
1 |
2 |
7 |
1 |
2 |
3 |
8 |
2 |
3 |
4 |
9 |
1 |
4 |
5 |
10 |
1 |
5 |
7 |
10 |
1 |
In [15]:
df10 = df9.query("count > 1")[["col1", "col2"]]
df10
Out[15]:
并集
方法1:concat + drop_duplicates
In [16]:
df11 = pd.concat([df1,df2])
df11
Out[16]:
|
col1 |
col2 |
0 |
1 |
6 |
1 |
2 |
7 |
2 |
3 |
8 |
3 |
4 |
9 |
4 |
5 |
10 |
0 |
1 |
6 |
1 |
3 |
8 |
2 |
7 |
10 |
In [17]:
# 结果
# df12 = df11.drop_duplicates(subset=["col1","col2"],keep="last")
df12 = df11.drop_duplicates(subset=["col1","col2"],keep="first")
df12
Out[17]:
|
col1 |
col2 |
0 |
1 |
6 |
1 |
2 |
7 |
2 |
3 |
8 |
3 |
4 |
9 |
4 |
5 |
10 |
2 |
7 |
10 |
方法2:append + drop_duplicates
In [18]:
df13 = df1.append(df2)
# df13.drop_duplicates(subset=["col1","col2"],keep="last")
df13.drop_duplicates(subset=["col1","col2"],keep="first")
Out[18]:
|
col1 |
col2 |
0 |
1 |
6 |
1 |
2 |
7 |
2 |
3 |
8 |
3 |
4 |
9 |
4 |
5 |
10 |
2 |
7 |
10 |
方法3:merge
In [19]:
pd.merge(df1,df2,how="outer")
Out[19]:
|
col1 |
col2 |
0 |
1 |
6 |
1 |
2 |
7 |
2 |
3 |
8 |
3 |
4 |
9 |
4 |
5 |
10 |
5 |
7 |
10 |