如何删除在 R 数据框中具有三个或更少重复组合的分类列的行?
在数据分析中,我们有时会根据我们的想法决定数据的大小或样本大小,这可能会导致删除部分数据。一种这样的事情可能是删除三个或更少的分类列的重复组合,它可以在dplyr包的过滤器功能的帮助下通过group_by函数进行分组来完成。
示例1
考虑以下数据框-
set.seed(121) x1<−sample(LETTERS[1:6],20,replace=TRUE) x2<−sample(c("Male","Female"),20,replace=TRUE) x3<−rpois(20,5) df1<−data.frame(x1,x2,x3) df1输出结果
x1 x2 x3 1 D Female 5 2 D Female 2 3 D Male 7 4 D Female 8 5 A Male 6 6 C Female 7 7 A Female 3 8 C Female 1 9 C Female 7 10 E Male 2 11 D Female 3 12 E Female 6 13 F Female 3 14 D Female 4 15 A Male 4 16 E Male 4 17 B Female 8 18 B Female 7 19 C Female 5 20 A Female 9
加载dplyr包并删除具有三个或更少重复组合的分类列-
示例
library(dplyr) df1%>%group_by(x1,x2)%>%filter(n()>=4) # A tibble: 9 x 3 # Groups: x1, x2 [2]输出结果
x1 x2 x31 D Female 5 2 D Female 2 3 D Female 8 4 C Female 7 5 C Female 1 6 C Female 7 7 D Female 3 8 D Female 4 9 C Female 5
例2
y1<−sample(c("S1","S2","S3","S4","S5","S6"),20,replace=TRUE) y2<−sample(c("Winter","Summer"),20,replace=TRUE) y3<−rnorm(20,3) df2<−data.frame(y1,y2,y3) df2输出结果
y1 y2 y3 1 S1 Winter 2.683082 2 S4 Summer 1.141916 3 S6 Winter 3.371681 4 S2 Winter 3.191187 5 S3 Summer 2.195504 6 S5 Summer 2.631736 7 S3 Winter 3.303605 8 S6 Summer 3.074344 9 S5 Summer 2.663724 10 S5 Winter 2.281991 11 S6 Summer 4.174418 12 S4 Winter 6.081246 13 S4 Summer 3.202913 14 S2 Winter 5.557243 15 S2 Winter 3.747462 16 S2 Winter 2.621571 17 S2 Summer 3.909743 18 S5 Winter 2.325663 19 S5 Summer 3.749852 20 S5 Winter 2.331191
示例
df2%>%group_by(y1,y2)%>%filter(n()>=4) # A tibble: 4 x 3 # Groups: y1, y2 [1]输出结果
y1 y2 y31 S2 Winter 3.19 2 S2 Winter 5.56 3 S2 Winter 3.75 4 S2 Winter 2.62