You will learn how to use R in this course. The course objective is not to turn you into a computer programmer. However, you will learn how to write relatively simple R scripts to document your manipulation and analysis of biological data. To achieve this goal, you will need to become proficient with R, RStudio and the Rmarkdown file format. There is more information below.
R is a free programming language that is popular in statistics and data science. It is not widely used by computer scientists because it is not a robust programming language. R is a tool for the statistical analysis and visualization of data. Not a tool for general application development. The strengths of R are the capacities in statistics, data visualization and workflow documentation and reporting.
Another advantage of R is that there are many robust useR communities such as statistics, bioinformatics, economics, social sciences and digital humanties. R is open source, so there are a huge number of free add-on packages that can help you with your data analyses.
To learn more about R in general, go to the R project home page, https://www.r-project.org/
R is widely available at Purdue. It should be on every lab computer, and it is also available on the Research Computing clusters. In class, we will use the RStudio server and Scholar. However, you may want to install R on your personal computer.
The Comprehensive R Archive Network (CRAN), https://cloud.r-project.org/, is a global network of servers that provide access to R software. Follow the instructions at your chosen CRAN mirror to download R for your platform. It is not necessary to upgrade R every time that there is a new release. In fact, I discourage this. It is OK to be one or two versions behind unless you find that you have a specific need to upgrade, e.g. the latest upgrade fixed a bug that has caused problems for you.
You can use R from the command line, i.e. a terminal window on a Mac or Linux machine, but this is not a user-friendly environment. R has a Graphical User Interface (GUI) named the R.app, but it has a limited number of features. RStudio is an Integrated Development Environment (IDE) that has become the standard for R users. An IDE has many advanced features, but many users only use a subset of these features.
You can install RStudio from their website, https://www.rstudio.com/products/rstudio/download/. Make sure that you choose the FREE desktop version for your operating system. For RStudio, I recommend that you update more frequently. Updating RStudio will not affect your R installation.
Once you have RStudio installed, you should watch this video that provides a tour of RStudio's features, https://www.youtube.com/watch?v=pXd54-vucu0
There are literally hundreds if not thousands of packages available for R. A package is software that allows R to do new things, or to do old things in new ways. Many packages are available from CRAN. You can browse the CRAN packages by task view, https://cran.r-project.org/web/views/ to find packages that might be useful for your project. However, the most common way to acquire new packages is when you find a new R script that requires them. For example, you may run this line of R code that loads a package named 'name_of_package':
If you don't have 'name_of_package' on your computer, R will return an error message:
Error in library(name_of_package) : there is no package called 'name_of_package'
The solution may be to install 'name_of_package' as follows:
If the package is available from CRAN, it will be downloaded along with the other packages that it requires. If the package is not available (or you misspell its name), you will get a warning.
Important, the parenthetical note, (for R version 3.4.3), should not be interpreted to mean that the package is available for another version of R! The package is likely not available at all.
In addition, R may ask if you want to create a personal or user library. Respond yes to this prompt.
Many packages are not available on CRAN because the developer has not chosen to release it there. Many packages are available from GitHub repositories. For example, there are a large number of packages for Open Science that are distributed this way, https://ropensci.org/packages/. There is a way to install packages from GitHub repositories, but that is beyond the scope of this guide.
In addition, most packages for bioinformatics are distributed by Bioconductor, https://www.bioconductor.org/, and Bioconductor has a specific way to install their packages.
Generally, you can manage your packages from the 'Packages' pane in RStudio, but it is a good idea to know how to do it in the Console as well.
Finally, another problem that you may encounter is that your security software may interfere with the package installation process. The package will download normally, but then you will get a warning message:
I don't think that there is an easy fix for this, but try running this in the Console:
A new window will open that allows you to edit the function. Go to line 142 and change this
Click 'Save' in the bottom left of the window. Package installation should now complete normally. Unfortunately, you will need to repeat this for every R session.
A critical goal of this course is to teach you how and why you should carefully and clearly document your data manipulations and analyses. This is a strength of R and RStudio and a shortcoming of a tool like Excel.
Frequently, people write their R scripts as a plain text file, and save it with a .R extension. The file contains lines of code and comments like this:
my_line <- "this is a line of code"
#this is a comment
Not surprisingly, a long R script is tedious to understand, and comments are often sparse. However, the advantage of an R script is that you can run or source the script, and R will execute all lines of code, ignoring the comments. Below is an R script that you can open in the browser.
This R script is for a project to retrieve data for SNPs (single nucleotide polymorphisms) from a database at NCBI. Do not worry about the details of the code. Simply note that there is not a lot of information about the process.
Note, I had to add '.txt' as a file extension so that this file would open in the browser. Normally, the file extension would be '.R'
This is an Rmarkdown version of the same R script. Note that there is narrative text that explains the analysis and the R code is grouped in chunks. This plain text file can be opened and edited with a text editor, but it fully functional when opened in RStudio. This file is only slightly more informative than the plain R script because it contains R code and annotations in Rmarkdown, a lightweight markup language similar to LaTeX, HTML or XML.
The Rmarkdown file is an intermediate version of this workflow. For reporting purposes, the Rmarkdown file can be rendered to an HTML, PDF or Word file. I find HTML to be the most useful and easiest to generate. The Rmarkdown file can also be easily converted to a plain R script.
Note, I had to add '.txt' as the file extension. Normally, the file extension would be '.Rmd'. In fact, this file while not behave properly in RStudio until the '.txt' is removed.
This HTML file was created from the Rmarkdown file above by using Knit to HTML in RStudio. All text narrative has been converted from markdown to HTML, including a table of contents. The R code has actually been executed, and the results are shown in the report.
You can also generate a .R file from the Rmarkdown file with the purl function,
There is more information about Rmarkdown on the Rmarkdown page of this guide, https://guides.lib.purdue.edu/R4MolecularBiosciences/Rmarkdown