2012-10-26

.Rhistory

Over the last couple of years I've been using R every now and then. When I stumbled upon an interesting topic, and I managed to get a hold of a data set, I tried to make sense of it using R.

It's a bit like the Stat Labs approach: I might get started by a newspaper article  that makes a claim about some reality - birth rates rising, a political horse race being "tied" or not close at all, a species approaching extinction, ...

I'm used to doubt those claims. Not because of my superior knowledge in the domain of interest (I'm not foolish enough to believe it's there), but because I know of the many fallacies out there.

Take a look at Luis debunking a myth propagated by New Zealand's Chief Coroner about "Suicide statistics and the Christchurch earthquake" for an example of what I'm talking about.

So I try to get a data set that can be used. Sometimes this is easy, sometimes it involves some web scraping. In most cases the raw data need some fiddling. There are situations where I find more than one source; so: which source is more useful for me? To cut that story, and it's a long long long one, short: there's a lot to be done before I can be sure the data will allow me to evaluate the claim that got me started.

One more point: While exploring the data availability and the usefulness of the data I can get, I can't be sure that all this will lead to anything. So, because I'm lazy, and because time is the scarcest resource, I just open my R.

I try to read the data from the web (read.csv, read.table). Maybe it works, maybe it doesn't. Maybe RCurl can help me, XML, httr or whatever...

Finally I have a data.frame usually called foo, or foobar, that seems to be useful. If not - well let's call the analysis a day, or if there's another couple of minutes, give it one more try. But let's assume I ended up with a useful data.frame that goes by the name of foo. I type something like

data_of_interest <- foo
save(data_of_interest, file = "data_of_interest.Rdata")

That's where I quit R. That's where the .Rhistory comes in. I type some commands:

rm .Rdata # I won't need the zillions of temporary results
cp .Rhistory Rhistory-YYYY-MM-DD # I want a backup. Something might go wrong.
mv .Rhistory data_of_interest.R # I want something source-able
Then I delete all wrong turns taken from data_of_interest.R, refactor what's left, and source the file in a separate R session. Finally I compare the data.frame arrived at by the script to the one in data_of_interest.Rdata

(If the topic is really interesting, and the data are really useful, I might go through several iterations of the above.)

Lately I've tried to be more systematic and collected all those "data_of_interest.R" scripts in a central workspace directory. Some of them will make it into this blog. So, in a way this blog will reflect my personal .Rhistory, or rather the huge number of .Rhistories living on my hard disks. Mostly it'll keep me from reinventing the wheel. But maybe one trick or the other might be helpful for someone out there.

No comments:

Post a Comment