This lesson is still being designed and assembled (Pre-Alpha version)

Putting it all together

Overview

Teaching: 45 min
Exercises: 50 min
Questions
  • How do RMarkdown, Git, and Open Science Framework work together to produce a reproducible data analysis?

Objectives
  • To produce a simple example of a reproducible workflow

workflow

Putting it together

We’ll now be putting everything that we’ve leared together and practicing the workflow in its entirety. As an example project we’ll be building on the analysis of the Old Faithful data which was started in episode 3.

Prepare and plan

Step 1: Create and organise RStudio project

Create a brand new project (in RStudio cloud or RStudio local) using the instructions episode 2. Call the project old_faithful. Place the project somewhere convenient for you e.g., Desktop/old_faithful.

Steps 2 and 3: Set up Git and Github

Initialize a local git repository (in RStudio cloud or RStudio local) using the instructions episode 4. You should call the repository old_faithful.

Copy the old_faithful_updated.Rmd analysis from episode 3.

Rename old_faithful_updated.Rmd -> old_faithful_eda.Rmd. This is to better reflect what the document is about (EDA = exploratory data analysis) and to differentiate it from other files that we will create.

If you’re using RStudio cloud you can’t copy files between project. Instead upload the solutions version available here.

Make sure you’re ‘ignoring’ pdfs in your .gitignore file.

Commit and push this.

Step 4: Preregister the analysis on the Open Science Framework.

We’re now going to write down our analysis plan and preregister a simple hypothesis. This will be done in three steps:

  1. Create an Open Science Framework (OSF) project.
  2. Write an analysis plan (a plan will be provided).
  3. Upload documents and preregister the project.

1. Create an OSF project

  1. Sign up, if you haven’t already, for an osf account.
  2. On the dashboard click Create new project. Call it ‘Old Faithful’ and then Go to project.

2. Create hypothesis and analysis plan

Our hypothesis is:

There are 2 distinct types of eruption from Old Faithful: 1) short frequent eruptions, 2) long infrequent eruptions.

We shall test it by clustering the data using a Gaussian mixture model (more information here) with 1 and 2 components. The true number of components will be given by the model with the smallest Integrated Complete Data Likelihood (ICL) criterion. The model will be fit in the R package RMixmod.

(Don’t worry if you don’t understand this - all the materials will be provided!)

  1. Download the analysis plan here
  2. Commit and push changes.
  3. Knit both analysis_plan.Rmd and old_faithful_eda.Rmd files to pdfs.
  4. Upload both analysis_plan.pdf and old_faithful_eda.pdf documents to the OSF project. To do this navigate to Files, click on OSF Storage (United States). The Upload button should appear and you can use this to upload files.

3. Preregister plan and project.

The OSF instructions for registering projects can be found here. Specific instructions for this project are given here:

  1. Click Registrations > New registration
  2. Select Open-Ended Registration and then select Create draft.
  3. Fill in the required Metadata:.
    1. Description: An investigation into the Old Faithful geyser
    2. License: CC-By Attribution 4.0 International.
    3. Subjects: Physical Sciences and Mathematics.
  4. Click Next then fill in the Summary:
    1. Summary: The attached file analysis_plan.pdf contains our project description, hypothesis and analysis plan.
    2. Select the analysis_plan.pdf document.
  5. Now click Reveiw. Review your submission and when you’re satisfied click Register.
  6. You can now click Make registration public immediately then Submit.

The registration is now pending approval by the administrators of the project (i.e., you). If you navigate back to the Registrations tab of the project and click on the link next to Pending registration, it will bring you to your registration page. This contains the information you entered when registering the project, as well as a snapshot of the project contents, i.e., analysis_plan.pdf and old_faithful_eda.pdf.

You will get an email asking you to approve or cancel this preregistration. You should cancel this preregistration after the course.

Analysis and write

Steps 5 and 6: Perform analysis and commit changes

  1. Download the report template from here.
  2. You will need to install Rmixmod. You will be asked Do you want to install from sources the package which needs compilation? (Yes/no/cancel) - choose no.
  3. Follow the instructions in the report template. You should make a commit and push after every couple of completed tasks in the report template. i.e. you should commit and push multiple times. This is good practice.

A completed report is available here should you need help.

Publish PDF

Step 7 and 8:

  1. Knit the completed report_template.Rmd to a pdf.
  2. Upload the pdf to the OSF project.

You have now successfully completed a fully reproducible piece of preregistered data analysis!

Both the preregistration and the final analysis report are all publicly available on the OSF.

Your code is available for anyone to use from your Github profile.

Conclusions

That concludes the course. In this last episode we have brought together all the elements learned in the previous episodes and applied it to an example analysis of the Old Faithful dataset. We used RStudio to create an project ( episode 2 ) to contain our Old Faithful analysis. Then we performed some exploratory data analysis and wrote an analysis plan using RMarkdown ( episode 3). From episode 1 we learnt about preregistration and so we preregistered our hypothesis with the OSF. We performed our intended analysis, backing up along the way using version control ( episode 4). We then published our report, along with our pre-registered analysis plan, on the OSF. In addition, because we used version control, our code was already published and available for use by others on Github.

Key Points

  • Open Science Framework (OSF) is a tool for organising and preregistering projects and sharing them on the web

  • Rstudio, Rmarkdown, Git and Github can be combined to create reproducible data analysis workflow