Preambles for Reproducible Research

Dec 26, 2018 4 min read R

Replication can have a different meaning to the different discipline. Replication in studies that uses public data sources involve sharing codes to ensure consistency in the results. Here, I share some chunk of the codes in five steps using RStudio, that can ensure ease in shareability. These 5 steps are the preambles prior I proceed toward data management.

For this a user will require to install R here and R-Studio here. A copy of this code is available here.

Clean the current workspace and session.

The very first step in to remove all the objects and plots from the current workspace and device in the R. For this, I use following codes.

rm(list = ls())
dev.off(dev.list()["RStudioGD"])

Install and load R packages

Second step is to install the required R-packages (if r-packages don’t exist installed package directory of R) and load them into the library. In the following code, I first create a function called load_packages(). This function takes list of names of packages as argument, then check if the packages are already installed in the user’s package list, and if not then installs the packages from the cran. I always use load_packages("rstudioapi") package and it’s a mandatory package for mywork flow. Then I use load_packages() for other required packages. These packages depends upon the project. In the following example I use haven, and Hmisc packages.

load_packages <- function(pkg){
 new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
  if (length(new.pkg)) 
    install.packages(new.pkg, dependencies = TRUE,
                     repos = "http://cran.us.r-project.org")
  sapply(pkg, require, character.only = TRUE)
}
load_packages("rstudioapi") # This is a mandatory package.
load_packages (c("devtools",  "haven", "Hmisc")) # These packages depends upon the project.

Setup Working directory.

Setting up the working directory is my third step. It is very important because my working directory can always be different then my coauthor’s/referee’s working directory. So, I first save the R script in the desired location manually and my coauthors are also advised to do similar. Then I use following code. This code detects where R-script is saved and sets that as working directory. Note this chunk of code uses rstudioapi package, therefore it’s a mandatory package for my workflow.

path <- dirname(rstudioapi::getActiveDocumentContext()$path)
setwd(path)

The Folders

I never right click and create new folders. I always code to create a folder. The code helps me to track what I did in logical manner. Following codes directly create the folders in the working directory. I usually like to have folder called rawdata to dump the data downloaded from the internet then I also like another folder outcomes to save my final dataset for analysis, plots and tables. Sometimes, I can create folders within folder like rawdata/beadataset.

dir.create(file.path(path, "rawdata"))
dir.create(file.path(path, "outcomes"))
dir.create(file.path(path, "rawdata/beadataset"))

Getting data.

Never download the data manually. If possible, always provide a download link and use script to download the data. And never touch the data. It seems counter-intuitive but, I never open data in excel. If I open data in excel, I make sure I don’t save or If I have to play around with data, I do that in separate folder and delete them ASAP.

Consider following example, the data of GDP by State by Year is available from the Bureau of Economic Analysis website. These data have stable link (the link doesn’t change over time) and content of data are consistent. The data can be download in the .zip format. Again, I write script to unzip the data. The script checks if folder called rawdata has gdpstate_naics_all_C.zip file or not. If there exist no file, the script will download the file.

gdpbystatebyyear <- "https://www.bea.gov/regional/zip/gdpstate/gdpstate_naics_all_C.zip"
if (file.exists("rawdata/gdpstate_naics_all_C.zip") == FALSE) { # get the zip file
  download.file(gdpbystatebyyear,
                destfile = "rawdata/gdpstate_naics_all_C.zip", mode="wb")
}
unzip("rawdata/gdpstate_naics_all_C.zip", exdir = paste0(path,"/rawdata/beadataset"))

Consider another example, if data is not available from in the web, I can share them via my google drive. I upload the zip file in google drive then get the public shareable link. The object gdrivepublic comprises of the public shareable link. Like previous chunk of code, here I check if data exist or not then download.

gdrivepublic <- "https://drive.google.com/uc?authuser=0&id=1AiZda_1-2nwrxI8fLD0Y6e5rTg7aocv0&export=download"

if (file.exists("datafromGoogleDrive.zip") == FALSE) { # get the zip file
  download.file(gsub("open\\?", "uc\\?export=download\\&", gdrivepublic), destfile = "datafromGoogleDrive.zip", mode="wb")
}

Reproducible, R,