edX Notes: Data Science: R Basics

Updated: Jul 29

While my MS thesis research in a wet lab has taught me a lot, I haven't gotten much exposure to the more computational side of the coin. While a simple t-test might be sufficient for my current purposes, I'd really like to learn more about the more advanced methods scientists use to understand larger or more complex data sets. That's why I've been auditing courses on edX.org related to this topic. Below you'll find some notes I've taken from the HarvardX course 'Data Science: R Basics'.

Getting Started with R

ABOUT R

  • R is a coding language built by statisticians to provide an interactive environment for data analysis

  • One of the fastest growing languages

FEATURES

  • Executable scripts serve as a record of your analysis- great for reproducibility!

  • Free, open source

OTHER NOTES

  • RStudio is an interactive, integrative development environment for R

  • Base R is what you get when you first download and install R

  • Additional components and modalities available as packages

#Package installation for placeholder package
install.packages("package name")

#Loading a package you've already installed
library(package name)

#To view what packages you've installed
installed.packages()

RStudio Key Bindings (Mac)

  • New script: command+shift+n

  • Save current script: command+s

  • Run entire script: command+shift+return

  • Run a single line of script: command+return (while cursor is on that line)

  • Assignment sign ("<-"): alt+-


Objects in R

WHAT IS AN OBJECT?

  • Includes variables, functions, datasets, etc.

  • Defining an object in R augments the workspace

  • Objects have a given 'class', which determines how they are handled

UNDERSTANDING AN OBJECT

  • To see the content of an object

#Option 1 for example_object
example_object

#Option 2 for example_object
print(example_object)
  • To see all objects saved in your workspace

ls()

Vectors

VECTOR BASICS

  • The most basic unit to store data

  • A vector is several entries of an object, made using the concatenate function

example_vector <- c(A, B, C)
  • To determine the length of a vector (i.e. the number of objects/entries within it)

length(example_vector)
  • To name the components of the vector

example_names <- c(name = A, name = B, name = C)
names(example_vector) <- example_names

SUBSETTING

  • Subsetting refers to accessing a lesser part of a vector (as opposed to the whole thing)

  • The brackets ("[", "]") are used to accomplish this

#The bracket contents could be the number of one object
example_vector[2]

#Or, a sequence of numbers using the colon (":")
example_vector[1:2]

#Or, another vector
example_vector[c(1, 3)]

#If the elements (objects) inside the vector have names, you can also use the name
example_vector["A"]

VECTOR COERCION

  • A means of being flexible with data class

  • Functions can accomplish this

as.character(example.vector)
as.numeric(example.vector)
  • NA is used to represent a lack of data, and will be inserted into the data if you force the data to become a class that disagrees with one of the elements

example_vector <- c("1", "b", "3")
as.numeric(example_vector)
#> 1 NA 3

Data Types & Data Frames

SOME DATA TYPES

  • Numeric

  • Character

  • Integer

  • Logical

  • Factors

FUNCTIONS

  • To see the class of an object

class(example_object)
  • To see the structure of an object

str(example_object)
  • To see the few few rows of a data frame/object

head(example_object)

CATEGORICAL DATA

  • To see the levels of that data

levels(example_data)

DATA FRAMES

  • Basically tables where the rows are observations and the columns are variables

  • Useful, as they allow you to combine different data types within one object

  • To create a data frame

data.frame(example_thing)
  • When using the data.frame function, characters will be converted to factors; to prevent this, make the argument 'stringsasfactors' FALSE

data.frame(example_thing, stringsasfactors = F)

Sorting

SORT FUNCTION

  • Sorts a vector in increasing order

example_vector <- c(1, 4, 2, 5, 3)
sort(x)
#> 1 2 3 4 5

ORDER FUNCTION

  • Takes a vector and returns the indices that sorts that vector

  • This means that it gives you a list of numbers telling you, in order, which object goes where to make the vector sorted from min to max

  • The implication of this is that you can make an object of out the order, and then use it to sort your vector

example_vector <- c(1, 4, 2, 5, 3)
order(example_vector)
#> 1 3 5 2 4

ordered_example <- order(example_vector)
example_vector[ordered_example]
#> 1 2 3 4 5

WHICH.MAX & WHICH.MIN

  • The index number for the max or min object of a vector can be called using the appropriate which function

which.max(example_vector)
#> 4

example_vector[4]
#> 5

RANK FUNCTION

  • The rank function returns a vector that identifies each element's rank from the original vector

example_vector <- c(1, 7, 8, 10)
rank(example_vector)
#> 1 2 3 4

Indexing

  • Logicals can be used to index vectors

  • Function "sum()" can be used to return the sum of TRUE entries (T = 1, F = 0)

WHICH FUNCTION

  • Gives the entries of a logical vector that are TRUE

  • Similar to which.max and which.min

  • Useful for condensing info

MATCH FUNCTION

  • Looks for entries in a vector and returns the index needed to access them

%in%

  • If you want to know whether or not an element in one vector is ALSO in another vector

x <- c("a", "b", "c", "d", "e")
y <- c("a", "d", "f")
y %in% x
#> TRUE, TRUE, FALSE

Basic Data Wrangling

  • Note: dplyr package needed

MUTATE FUNCTION

  • To change the data table by adding a new column or changing an existing one

FILTER FUNCTION

  • Filter data by subsetting rows

SELECT FUNCTION

  • Filter data by subsetting columns

PIPE OPERATOR (%>%)

  • Allows results from one function to automatically be 'piped' into another

murders %>% select(state, region, rate) %>% filter(rate <= 0.71)

Plots

BASIC PLOTS

plot(x, y)

HISTOGRAMS

  • Graphical summaries of a list of numbers

hist(example_data)

BOXPLOT

  • Morse terse summary than histogram, but easier to stack against one another

  • Particularly useful when you want to summarize several variables or strata of a variable

  • Strata are separated with "~"

boxplot(population~region, data = murders)

Programming in R

BASIC CONDITIONALS

  • if-else is the most common

if(a != 0){
    print(1/a)
    } else{
    print("No reciprocal for 0.")
    }
  • if-else functions take three arguments: a logical and two possible answers

  • Any function: returns TRUE if any of the options are true

  • All function: returns TRUE if all are true

MAKING YOUR OWN FUNCTIONS

example_function <- function(x){
    operations that operate on x, which is defined by user of function value final line is returned
    }
  • Functions can have multiple variables, not just one. This is indicated by multiple arguments (e.g. "function(x, y, z)

FOR LOOPS

  • Allows a task to be performed over and over again, but with changing variables/values

for(i in range of values){
    operations that use i, which is changing across the range of values
    }

OTHER FUNCTIONS

  • Apply family: apply, sapply, tapply, mapply

  • split

  • cut

  • quantile

  • reduce

  • identical

  • unique