R-data.table
上一节介绍了最基本的如何操作data.frame的行列来实现各种功能需求的数据计算。这一节将介绍另外一个功能更加强大的R包:data.table
library(data.table)
DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9);
行列选择
DT;
DT[2]; //第2行
DT[,v]; //第v列(向量形式)
DT[,list(v)]; //第v列(列表形式)
DT[2:3,sum(v)]; //第v列第2行和第3行之和
DT[2:5,cat(v,”\n”)]; //列出第2至第5行v列的值
DT[c(FALSE,TRUE)] //列出偶数行的值
DT;
DT[2];
DT[,v];
DT[,list(v)];
DT[2:3,sum(v)];
DT[2:5,cat(v,"\n")];
DT[c(FALSE,TRUE)]
| x | y | v |
|---|
| 1 | a | 1 | 1 |
|---|
| 2 | a | 3 | 2 |
|---|
| 3 | a | 6 | 3 |
|---|
| 4 | b | 1 | 4 |
|---|
| 5 | b | 3 | 5 |
|---|
| 6 | b | 6 | 6 |
|---|
| 7 | c | 1 | 7 |
|---|
| 8 | c | 3 | 8 |
|---|
| 9 | c | 6 | 9 |
|---|
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
| v |
|---|
| 1 | 1 |
|---|
| 2 | 2 |
|---|
| 3 | 3 |
|---|
| 4 | 4 |
|---|
| 5 | 5 |
|---|
| 6 | 6 |
|---|
| 7 | 7 |
|---|
| 8 | 8 |
|---|
| 9 | 9 |
|---|
5
2 3 4 5
NULL
DT[,2,with=FALSE];
DT["a"];
DT[y== 6]
| y |
|---|
| 1 | 1 |
|---|
| 2 | 3 |
|---|
| 3 | 6 |
|---|
| 4 | 1 |
|---|
| 5 | 3 |
|---|
| 6 | 6 |
|---|
| 7 | 1 |
|---|
| 8 | 3 |
|---|
| 9 | 6 |
|---|
| x | y | v | v2 | m |
|---|
| 1 | a | 1 | 42 | NA | 42 |
|---|
| 2 | a | 3 | 42 | NA | 42 |
|---|
| 3 | a | 6 | 42 | NA | 42 |
|---|
| x | y | v | v2 | m |
|---|
| 1 | a | 6 | 42 | NA | 42 |
|---|
| 2 | b | 6 | 6 | 84 | 5 |
|---|
| 3 | c | 6 | 9 | NA | 8 |
|---|
分组计算
DT;
DT[,sum(v),by=x];
DT[,sum(v),by=key(DT)]
DT["a",sum(v)]
| x | y | v |
|---|
| 1 | a | 1 | 1 |
|---|
| 2 | a | 3 | 2 |
|---|
| 3 | a | 6 | 3 |
|---|
| 4 | b | 1 | 4 |
|---|
| 5 | b | 3 | 5 |
|---|
| 6 | b | 6 | 6 |
|---|
| 7 | c | 1 | 7 |
|---|
| 8 | c | 3 | 8 |
|---|
| 9 | c | 6 | 9 |
|---|
DT[,sum(v),by=y]
DT["a",sum(v)]
DT[c("a","b"),sum(v),by=.EACHI]
6
X = data.table(c("b","c"),foo=c(4,2))
X
DT
| x | y | v |
|---|
| 1 | a | 1 | 1 |
|---|
| 2 | a | 3 | 2 |
|---|
| 3 | a | 6 | 3 |
|---|
| 4 | b | 1 | 4 |
|---|
| 5 | b | 3 | 5 |
|---|
| 6 | b | 6 | 6 |
|---|
| 7 | c | 1 | 7 |
|---|
| 8 | c | 3 | 8 |
|---|
| 9 | c | 6 | 9 |
|---|
合并
DT[X] #join(i为data.table,i join to x 通过x的key值)
DT[X]
| x | y | v | foo |
|---|
| 1 | b | 1 | 4 | 4 |
|---|
| 2 | b | 3 | 5 | 4 |
|---|
| 3 | b | 6 | 6 | 4 |
|---|
| 4 | c | 1 | 7 | 2 |
|---|
| 5 | c | 3 | 8 | 2 |
|---|
| 6 | c | 6 | 9 | 2 |
|---|
DT[X,sum(v),by=.EACHI] #join and eval j for each row in i
DT[X,sum(v),by=.EACHI]
DT[X,mult="first"] #first row of each group
DT[X,mult="first"]
DT[X,sum(v)*foo,by=.EACHI] #join inherited scope
DT[X,sum(v)*foo,by=.EACHI]
setkey(DT,x,y)
setkeyv(DT,c("x","y")
DT
DT["a"]
| x | y | v | v2 | m |
|---|
| 1 | a | 1 | 42 | NA | 42 |
|---|
| 2 | a | 3 | 42 | NA | 42 |
|---|
| 3 | a | 6 | 42 | NA | 42 |
|---|
| 4 | b | 1 | 4 | 84 | 5 |
|---|
| 5 | b | 3 | 5 | 84 | 5 |
|---|
| 6 | b | 6 | 6 | 84 | 5 |
|---|
| 7 | c | 1 | 7 | NA | 8 |
|---|
| 8 | c | 3 | 8 | NA | 8 |
|---|
| 9 | c | 6 | 9 | NA | 8 |
|---|
| x | y | v | v2 | m |
|---|
| 1 | a | 1 | 42 | NA | 42 |
|---|
| 2 | a | 3 | 42 | NA | 42 |
|---|
| 3 | a | 6 | 42 | NA | 42 |
|---|
当 i 以列名的形式求值于data.table或list表达式中时,执行selfjoin操作
DT
DT["a"]
DT[.("a",1)]
| x | y | v | v2 | m |
|---|
| 1 | a | 1 | 42 | NA | 42 |
|---|
| 2 | a | 3 | 42 | NA | 42 |
|---|
| 3 | a | 6 | 42 | NA | 42 |
|---|
| 4 | b | 1 | 4 | 84 | 5 |
|---|
| 5 | b | 3 | 5 | 84 | 5 |
|---|
| 6 | b | 6 | 6 | 84 | 5 |
|---|
| 7 | c | 1 | 7 | NA | 8 |
|---|
| 8 | c | 3 | 8 | NA | 8 |
|---|
| 9 | c | 6 | 9 | NA | 8 |
|---|
| x | y | v | v2 | m |
|---|
| 1 | a | 1 | 42 | NA | 42 |
|---|
| 2 | a | 3 | 42 | NA | 42 |
|---|
| 3 | a | 6 | 42 | NA | 42 |
|---|
| x | y | v | v2 | m | V2 |
|---|
| 1 | a | 1 | 42 | NA | 42 | 1 |
|---|
| 2 | a | 3 | 42 | NA | 42 | 1 |
|---|
| 3 | a | 6 | 42 | NA | 42 | 1 |
|---|
DT[list("a",3:5)]
| x | y | v | V2 |
|---|
| 1 | a | 1 | 1 | 3 |
|---|
| 2 | a | 3 | 2 | 3 |
|---|
| 3 | a | 6 | 3 | 3 |
|---|
| 4 | a | 1 | 1 | 4 |
|---|
| 5 | a | 3 | 2 | 4 |
|---|
| 6 | a | 6 | 3 | 4 |
|---|
| 7 | a | 1 | 1 | 5 |
|---|
| 8 | a | 3 | 2 | 5 |
|---|
| 9 | a | 6 | 3 | 5 |
|---|
by: a list() of expressions of column names,在data.table框架中被计算,因此列名是被当作一个变量,然后data.table 以by计算的值分组,为每一个组计算j的值。
当i是data.table时,DT[i,j,by=.EACHI],以DT里面每一行i值分组,你能同时合并(in i)并计算(in j)
DT
DT[,sum(v),by=.(y%%2)]
DT[X]
DT[X,sum(v),by=.EACHI]
| x | y | v |
|---|
| 1 | a | 1 | 1 |
|---|
| 2 | a | 3 | 2 |
|---|
| 3 | a | 6 | 3 |
|---|
| 4 | b | 1 | 4 |
|---|
| 5 | b | 3 | 5 |
|---|
| 6 | b | 6 | 6 |
|---|
| 7 | c | 1 | 7 |
|---|
| 8 | c | 3 | 8 |
|---|
| 9 | c | 6 | 9 |
|---|
| x | y | v | foo |
|---|
| 1 | b | 1 | 4 | 4 |
|---|
| 2 | b | 3 | 5 | 4 |
|---|
| 3 | b | 6 | 6 | 4 |
|---|
| 4 | c | 1 | 7 | 2 |
|---|
| 5 | c | 3 | 8 | 2 |
|---|
| 6 | c | 6 | 9 | 2 |
|---|
x = data.table(c("b","c"),foo=c(4,2))
x
DT
DT[,.SD[2],by=x]
| x | y | v |
|---|
| 1 | a | 1 | 1 |
|---|
| 2 | a | 3 | 2 |
|---|
| 3 | a | 6 | 3 |
|---|
| 4 | b | 1 | 4 |
|---|
| 5 | b | 3 | 5 |
|---|
| 6 | b | 6 | 6 |
|---|
| 7 | c | 1 | 7 |
|---|
| 8 | c | 3 | 8 |
|---|
| 9 | c | 6 | 9 |
|---|
DT[,tail(.SD,2),by=x]
| x | y | v |
|---|
| 1 | a | 3 | 2 |
|---|
| 2 | a | 6 | 3 |
|---|
| 3 | b | 3 | 5 |
|---|
| 4 | b | 6 | 6 |
|---|
| 5 | c | 3 | 8 |
|---|
| 6 | c | 6 | 9 |
|---|
DT
DT[,lapply(.SD,sum),by=x]
| x | y | v |
|---|
| 1 | a | 1 | 1 |
|---|
| 2 | a | 3 | 2 |
|---|
| 3 | a | 6 | 3 |
|---|
| 4 | b | 1 | 4 |
|---|
| 5 | b | 3 | 5 |
|---|
| 6 | b | 6 | 6 |
|---|
| 7 | c | 1 | 7 |
|---|
| 8 | c | 3 | 8 |
|---|
| 9 | c | 6 | 9 |
|---|
DT
DT[,list(MySum=sum(v),
MyMin=min(v),
MyMax=max(v)),
by=.(x,y%%2)]
| x | y | v |
|---|
| 1 | a | 1 | 1 |
|---|
| 2 | a | 3 | 2 |
|---|
| 3 | a | 6 | 3 |
|---|
| 4 | b | 1 | 4 |
|---|
| 5 | b | 3 | 5 |
|---|
| 6 | b | 6 | 6 |
|---|
| 7 | c | 1 | 7 |
|---|
| 8 | c | 3 | 8 |
|---|
| 9 | c | 6 | 9 |
|---|
| x | y | MySum | MyMin | MyMax |
|---|
| 1 | a | 1 | 3 | 1 | 2 |
|---|
| 2 | a | 0 | 3 | 3 | 3 |
|---|
| 3 | b | 1 | 9 | 4 | 5 |
|---|
| 4 | b | 0 | 6 | 6 | 6 |
|---|
| 5 | c | 1 | 15 | 7 | 8 |
|---|
| 6 | c | 0 | 9 | 9 | 9 |
|---|
DT[,sum(v),x][V1<20]
DT[,sum(v),x][order(-V1)]
print(DT[,z:=42L])
print(DT[,z:=NULL])
print(DT["a",v:=42L])
x y v z
1: a 1 1 42
2: a 3 2 42
3: a 6 3 42
4: b 1 4 42
5: b 3 5 42
6: b 6 6 42
7: c 1 7 42
8: c 3 8 42
9: c 6 9 42
x y v
1: a 1 1
2: a 3 2
3: a 6 3
4: b 1 4
5: b 3 5
6: b 6 6
7: c 1 7
8: c 3 8
9: c 6 9
x y v
1: a 1 42
2: a 3 42
3: a 6 42
4: b 1 4
5: b 3 5
6: b 6 6
7: c 1 7
8: c 3 8
9: c 6 9
print(DT["b",v2:=84L])
x y v v2
1: a 1 42 NA
2: a 3 42 NA
3: a 6 42 NA
4: b 1 4 84
5: b 3 5 84
6: b 6 6 84
7: c 1 7 NA
8: c 3 8 NA
9: c 6 9 NA
add new column by reference by group
#NB: postfix [] is shortcut to print()
DT[,m:=mean(v),by=x][]
| x | y | v | v2 | m |
|---|
| 1 | a | 1 | 42 | NA | 42 |
|---|
| 2 | a | 3 | 42 | NA | 42 |
|---|
| 3 | a | 6 | 42 | NA | 42 |
|---|
| 4 | b | 1 | 4 | 84 | 5 |
|---|
| 5 | b | 3 | 5 | 84 | 5 |
|---|
| 6 | b | 6 | 6 | 84 | 5 |
|---|
| 7 | c | 1 | 7 | NA | 8 |
|---|
| 8 | c | 3 | 8 | NA | 8 |
|---|
| 9 | c | 6 | 9 | NA | 8 |
|---|
DT[,.SD[which.min(v)],by=x][]
| x | y | v | v2 | m |
|---|
| 1 | a | 1 | 42 | NA | 42 |
|---|
| 2 | b | 1 | 4 | 84 | 5 |
|---|
| 3 | c | 1 | 7 | NA | 8 |
|---|
DT[!.("a")]
DT[!"a"]
DT[!2:4]
DT[x!="b" | y!=3]
DT[!.("b",3)]
| x | y | v | v2 | m |
|---|
| 1 | b | 1 | 4 | 84 | 5 |
|---|
| 2 | b | 3 | 5 | 84 | 5 |
|---|
| 3 | b | 6 | 6 | 84 | 5 |
|---|
| 4 | c | 1 | 7 | NA | 8 |
|---|
| 5 | c | 3 | 8 | NA | 8 |
|---|
| 6 | c | 6 | 9 | NA | 8 |
|---|
| x | y | v | v2 | m |
|---|
| 1 | b | 1 | 4 | 84 | 5 |
|---|
| 2 | b | 3 | 5 | 84 | 5 |
|---|
| 3 | b | 6 | 6 | 84 | 5 |
|---|
| 4 | c | 1 | 7 | NA | 8 |
|---|
| 5 | c | 3 | 8 | NA | 8 |
|---|
| 6 | c | 6 | 9 | NA | 8 |
|---|
| x | y | v | v2 | m |
|---|
| 1 | a | 1 | 42 | NA | 42 |
|---|
| 2 | b | 3 | 5 | 84 | 5 |
|---|
| 3 | b | 6 | 6 | 84 | 5 |
|---|
| 4 | c | 1 | 7 | NA | 8 |
|---|
| 5 | c | 3 | 8 | NA | 8 |
|---|
| 6 | c | 6 | 9 | NA | 8 |
|---|
| x | y | v | v2 | m |
|---|
| 1 | a | 1 | 42 | NA | 42 |
|---|
| 2 | a | 3 | 42 | NA | 42 |
|---|
| 3 | a | 6 | 42 | NA | 42 |
|---|
| 4 | b | 1 | 4 | 84 | 5 |
|---|
| 5 | b | 6 | 6 | 84 | 5 |
|---|
| 6 | c | 1 | 7 | NA | 8 |
|---|
| 7 | c | 3 | 8 | NA | 8 |
|---|
| 8 | c | 6 | 9 | NA | 8 |
|---|
| x | y | v | v2 | m |
|---|
| 1 | a | 1 | 42 | NA | 42 |
|---|
| 2 | a | 3 | 42 | NA | 42 |
|---|
| 3 | a | 6 | 42 | NA | 42 |
|---|
| 4 | c | 1 | 7 | NA | 8 |
|---|
| 5 | c | 3 | 8 | NA | 8 |
|---|
| 6 | c | 6 | 9 | NA | 8 |
|---|
DT1 = data.table(x=c("c", "a", "b", "a", "b"), a=1:5)
DT2 = data.table(x=c("d", "c", "b"), mul=6:8)
DT1
DT2
library(data.table)
DT1[DT2,on=c(x="x")]
Error in `[.data.table`(DT1, DT2, on = c(x = "x")): unused argument (on = c(x = "x"))
update.packages()
Error in contrib.url(repos, type): trying to use CRAN without setting a mirror
rownames(DT)
- ‘1’
- ‘2’
- ‘3’
- ‘4’
- ‘5’
- ‘6’
- ‘7’
- ‘8’
- ‘9’
typeof(colnames(DT))
‘character’
t