Tuesday, February 13, 2007

The use of apply and tapply in Panel & Cross-Section Time-series data

When your data units have stacked structure, you have panel or cross-sectional time-series data. Other softwares like stata have great support to the data management of this kind dataset (e.g. here).

In R, you can do basic data management using loops. But loops sometimes consume heavily the computational times. The efficient way to do this is to use apply and tapply commands.

vector manipulation (use tapply)

y.star <- tapply(y, cid, FUN=mean)

This will create mean value of each units (e.g., country average). cid is the panel id identifier (e.g., country code, or country name).

matrix manipulation (use apply + tapply)

FOO <- function(x) tapply(x, cid, FUN=mean)
x.tilde <- apply(x, 2, FUN=FOO)

This will create mean value of each units among several predictors. The second argument of apply stands for whether we deal with the data by 1=row or by 2=column.

Lagging a variable

lagfun <- function(x){
n <- length(x)
xlag <- rep(NA, n-1)
for(i in 2:n) xlag[i] <- x[i-1]
return(xlag)
}
y <- unlist(tapply(y, country, lagfun))

No comments: