Skip to main content

Migration

Excelling in R

1 September 2014

Recently we have helped researchers wanting to perform a large analysis of data that was in an Excel spreadsheet. We had over 70,000 rows and around 20 columns which required lots of conditional summing and checks to be performed. After trying to use Excel to perform the analysis on my Mac it seemed the conditional summing was particularly slow – it seemed Excel experienced large pauses while it was (I assume) trying to cache the data.

Migrating to R

After a while I thought R (http://www.r-project.org) would be a better choice for this type of work and I would try converting it to a format that R could understand.  The researchers also wanted to perform similar analysis on over 3 million rows of data so it seemed like the way to go.  R is a very powerful package and is designed to efficiently process data and has many libraries of functions available so you will not need to reinvent the wheel.  We needed to convert the spreadsheet to something R could read and Excel can save as a CSV file (comma separated) which worked well to begin with but some characters were not coming out as expected – especially when I needed to run this through another program before passing it to R. The data contained information about Welsh companies and some names and addresses contained “diacritics” – see Wikipedia article on the Welsh orthography.  These characters would then cause errors in the subsequent file reading (such as â) which was just appearing in my editor vi as <89>.

Text encoding

After finding out how Excel encoded its text files, I found that it used the native Mac OS encoding – see Wikipedia article on Mac OS Roman.  Therefore I needed a way of encoding to something which R would understand.  After some googling it seemed the iconv utility would perform this task.  Therefore I ran:

$ iconv -f MAC -t UTF8 < orig.csv > converted.csv

The end of the line…

After running this there was still one more thing to do.  It seems Excel uses just a carriage return to signify new lines (where ^M will appear where a new line should).  Therefore I ran this through the tr utility to translate the carriage returns to new lines.  Therefore the command to convert Excel to something R could understand was:

$ iconv -f MAC -t UTF8 < orig.csv | tr '\r' '\n' > converted.csv

This was then read into R with:

wales.comp <- read.csv("converted.csv",
                       sep=",",
                       header=TRUE,
                       na.strings=c("n.a.", ""))

Once this was read into R the calculation was easily performed and the process can be repeated for the 3 million records as well.