One of the major drawback of R is its inability to deal with large data set. R relies on system memories to handle data or object in general. Since there is a physical limitation of memories, it implies that R has limited memories to deal with large object. The problem has been discussed many times in the R mailing list since the first release of R. One of the suggestion is to use SQL to deal with large object and hence avoid reading data into the memory. But using SQL means you have to learn another language, it is just not that pleasant if the learning curve of R is already steep.
This year, Roger Peng wrote a package called filehash that sovled the problem. The rationale of filehash is to dump large data or object into the hard drive. Assign a environment name for the dumped object. You then can access the database through the assigned environment. The whole procedure avoid using memories to deal with large object. The physical limitation of filehash of course is the physical size of the hard drive, which is less of a problem given that most machines now have equipped fair large HD.
filehash can handle many type of object. I will only show an example code of how to use filehash to deal with data object here because most of the time, the large object is a data set. For more detailed usage, check the manual here.
# example code.
dumpDF(read.table("large.dat", header=T), dbName="db01")
env01 <- db2env(db="db01")
The first element of dumpDF() is a data object. Read in the data within dumpDF(), so R memory does not have a copy of it. Space saved! So now, the large data set "large.dat" can be accesses through the environment of the env01. To access it, we use with(). Suppose we want to do a linear regression of "y" on "x." And we access the data using the variable names. If you assign a object name for the read.table, the memory will have a copy of the data. This is not what we intend to do.
fit <- with(env01, lm(y~x))
Or we just want to do simple data managements.
with(env01, y <- 2))
16 hours ago