Tuesday, September 18, 2007

Dealing with large data set in R

One of the major drawback of R is its inability to deal with large data set. R relies on system memories to handle data or object in general. Since there is a physical limitation of memories, it implies that R has limited memories to deal with large object. The problem has been discussed many times in the R mailing list since the first release of R. One of the suggestion is to use SQL to deal with large object and hence avoid reading data into the memory. But using SQL means you have to learn another language, it is just not that pleasant if the learning curve of R is already steep.

This year, Roger Peng wrote a package called filehash that sovled the problem. The rationale of filehash is to dump large data or object into the hard drive. Assign a environment name for the dumped object. You then can access the database through the assigned environment. The whole procedure avoid using memories to deal with large object. The physical limitation of filehash of course is the physical size of the hard drive, which is less of a problem given that most machines now have equipped fair large HD.

filehash can handle many type of object. I will only show an example code of how to use filehash to deal with data object here because most of the time, the large object is a data set. For more detailed usage, check the manual here.

# example code.

dumpDF(read.table("large.dat", header=T), dbName="db01")
env01 <- db2env(db="db01")

The first element of dumpDF() is a data object. Read in the data within dumpDF(), so R memory does not have a copy of it. Space saved! So now, the large data set "large.dat" can be accesses through the environment of the env01. To access it, we use with(). Suppose we want to do a linear regression of "y" on "x." And we access the data using the variable names. If you assign a object name for the read.table, the memory will have a copy of the data. This is not what we intend to do.

fit <- with(env01, lm(y~x))

Or we just want to do simple data managements.

with(env01, mean(y))
with(env01, y[1] <- 2))

8 comments:

Tal Galili said...

Hi.

Another very useful command (under the hash package) is:
"dbLoad()"

Which works just like "attach()"

(In the dbLoad(), the objects are attached, but are kept stored on the local harddisk)

Yu-Sung Su said...

Thanks Tal,

Nice to know a new handy function in filehash.

But for programming purpose, I would use with() to avoid messing up with multiple data sets.

jfolson said...

Doesn't this still require being able to fit the whole dataset into memory? At best, it seems like it's still loading entire columns into memory.

Hyo said...

thanks, Yu-sung :)

bhains said...

Hey,

How can I merge other datasets into this bigger dataset?

Thanks

Xianjun said...

I agree with jfolson; it still takes time to load data into memory since you use 'read.table("big.data.set")'. This is the most time/memory consuming if the dataset file is large (e.g. >1Gb).

Any idea for this problem? Thanks

Mearsault said...

видеорегистратор mini dvthanks all, really helpfull

David said...

hello everybody,

Could I merge two large databases? How could I do that using this package?

Thanks in advance.