Thursday, September 20, 2007

filehash application (I): store multiple data sets

If you have a large data set, your memory will be consumed up just by reading it in. Let's say we get R to read the data set in, and impute missing data of this data set and store the imputed data set. What this procedure means is that R will have two copies of the data set where the imputed one is even bigger than the first one. If the useage of R memory is on the edge when you read in the data, it is almost impossible to store another data set with equal or bigger size of the first one. The standard trick is to remove the first data set, clean up memory space for the 2nd one. This is not going to work here because you need the first data set to impute the 2nd one. I have an blog entry here which shows that we can use filehas to dump the data set into the hard drive and still be able to access it without using any memory space.

Now, consider a mcmc procedure, or just a simple looping that creates new a data set in every iteration. Our goal now is to be able to store evey data set so we can compare these data sets with the original one.

Here is an example code that uses filehash to implement the following:

setwd("C:/")
# create a dummy data space
dbCreate("swapdata")
# assign a name for this data space
db <- dbInit("swapdata")
gc() # begin memory count step 1
# read in the source data
test.data <- matrix(rnorm(1000), 1000, 100)
gc() # memory usage increase
# insert data into dummy space db
dbInsert(db, "testdb", test.data)
rm(test.data) # delete the data set
gc() # memory usage back to step 1
dbExists(db, "testdb") # check if testdb exists
n.chain <- 3
for (i in 1:n.chain){
# creating a copy of old data set + add noise
imp <- db$testdb + matrix(rnorm(1000), 1000, 100)
gc() # memory usage increase
dbname <- paste("imp", i, sep="")
dbInsert(db, dbname, imp) # insert imp into db
rm(imp)
gc() # memory usage back to step 1
print(dbExists(db, dbname)) # check if imp01-imp03 exist
}
# now we do not have any data stored in the memory

# but can we still examine the data sets? YES!!
gc() # memory monitor again (same as step 1)
dbList(db) # how many data sets do we have?
apply(db$imp1, 2, mean)
apply(db$imp2, 2, mean)
apply(db$imp3, 2, mean)
gc() # memory usage back to step 1

4 comments:

Ivan said...

Hello, your example is very useful to understand filehash package!!!

this example uses data created in R, but if a can create it in R, i can manipulate it without using filehash functions, so i've a doubt, how can i read data from Access (by example) using filehash functions??

thanks a lot for your help!!!

Yu-Sung Su said...

Hi,

"filehash" is not a package for reading data into R. It just creates a linkage between a data in a hard driver and R. So I guess your real question is how to read Access data in R. Currently, there is no package I know of that can do this. I would recommend you to use stat/transfer application (http://www.stattransfer.com/) to convert Access data into other R-readable format, eg. stata, or csv, etc.

After this, you can follow the instruction in the below entry to read in the data without consume too much memory.

http://yusung.blogspot.com/2007/09/dealing-with-large-data-set-in-r.html

Ivàn said...

Thanks, what i really needed was the function dumpDF(), which i found in another post

Marco Antonio Mendoza said...

Hi,

Indeed I would like to use Filehash in order to work with large txt files on R. Thus I thougth the function dumpDF as a way to acces to my txt files (using read.table) wo having to load them into the memory. My problem is to know whether in any case I need to create a database,...what I did precisely is:
dumpDF(read.table("large.txt", header=F), dbName="myfile")
env01 <- db2env(db="myfile")

then I thought that I could direcly use env01 as a way to access to this data file , but this does not work...I got the following errors:
Erreur dans rep("XXX", nrow(data)) : argument 'times' incorrect

Could you give me a hint to make it to work?