While my MS thesis research in a wet lab has taught me a lot, I haven't gotten much exposure to the more computational side of the coin. While a simple t-test might be sufficient for my current purposes, I'd really like to learn more about the more advanced methods scientists use to understand larger or more complex data sets. That's why I've been auditing courses on edX.org related to this topic. Below you'll find some notes I've taken from the HarvardX course 'Data Science: R Basics'.
Getting Started with R
ABOUT R
R is a coding language built by statisticians to provide an interactive environment for data analysis
One of the fastest growing languages
FEATURES
Executable scripts serve as a record of your analysis- great for reproducibility!
Free, open source
OTHER NOTES
RStudio is an interactive, integrative development environment for R
Base R is what you get when you first download and install R
Additional components and modalities available as packages
#Package installation for placeholder package
install.packages("package name")
#Loading a package you've already installed
library(package name)
#To view what packages you've installed
installed.packages()
RStudio Key Bindings (Mac)
New script: command+shift+n
Save current script: command+s
Run entire script: command+shift+return
Run a single line of script: command+return (while cursor is on that line)
Assignment sign ("<-"): alt+-
Objects in R
WHAT IS AN OBJECT?
Includes variables, functions, datasets, etc.
Defining an object in R augments the workspace
Objects have a given 'class', which determines how they are handled
UNDERSTANDING AN OBJECT
To see the content of an object
#Option 1 for example_object
example_object
#Option 2 for example_object
print(example_object)
To see all objects saved in your workspace
ls()
Vectors
VECTOR BASICS
The most basic unit to store data
A vector is several entries of an object, made using the concatenate function
example_vector <- c(A, B, C)
To determine the length of a vector (i.e. the number of objects/entries within it)
length(example_vector)
To name the components of the vector
example_names <- c(name = A, name = B, name = C)
names(example_vector) <- example_names
SUBSETTING
Subsetting refers to accessing a lesser part of a vector (as opposed to the whole thing)
The brackets ("[", "]") are used to accomplish this
#The bracket contents could be the number of one object
example_vector[2]
#Or, a sequence of numbers using the colon (":")
example_vector[1:2]
#Or, another vector
example_vector[c(1, 3)]
#If the elements (objects) inside the vector have names, you can also use the name
example_vector["A"]
VECTOR COERCION
A means of being flexible with data class
Functions can accomplish this
as.character(example.vector)
as.numeric(example.vector)
NA is used to represent a lack of data, and will be inserted into the data if you force the data to become a class that disagrees with one of the elements
example_vector <- c("1", "b", "3")
as.numeric(example_vector)
#> 1 NA 3
Data Types & Data Frames
SOME DATA TYPES
Numeric
Character
Integer
Logical
Factors
FUNCTIONS
To see the class of an object
class(example_object)
To see the structure of an object
str(example_object)
To see the few few rows of a data frame/object
head(example_object)
CATEGORICAL DATA
To see the levels of that data
levels(example_data)
DATA FRAMES
Basically tables where the rows are observations and the columns are variables
Useful, as they allow you to combine different data types within one object
To create a data frame
data.frame(example_thing)
When using the data.frame function, characters will be converted to factors; to prevent this, make the argument 'stringsasfactors' FALSE
data.frame(example_thing, stringsasfactors = F)
Sorting
SORT FUNCTION
Sorts a vector in increasing order
example_vector <- c(1, 4, 2, 5, 3)
sort(x)
#> 1 2 3 4 5
ORDER FUNCTION
Takes a vector and returns the indices that sorts that vector
This means that it gives you a list of numbers telling you, in order, which object goes where to make the vector sorted from min to max
The implication of this is that you can make an object of out the order, and then use it to sort your vector
example_vector <- c(1, 4, 2, 5, 3)
order(example_vector)
#> 1 3 5 2 4
ordered_example <- order(example_vector)
example_vector[ordered_example]
#> 1 2 3 4 5
WHICH.MAX & WHICH.MIN
The index number for the max or min object of a vector can be called using the appropriate which function
which.max(example_vector)
#> 4
example_vector[4]
#> 5
RANK FUNCTION
The rank function returns a vector that identifies each element's rank from the original vector
example_vector <- c(1, 7, 8, 10)
rank(example_vector)
#> 1 2 3 4
Indexing
Logicals can be used to index vectors
Function "sum()" can be used to return the sum of TRUE entries (T = 1, F = 0)
WHICH FUNCTION
Gives the entries of a logical vector that are TRUE
Similar to which.max and which.min
Useful for condensing info
MATCH FUNCTION
Looks for entries in a vector and returns the index needed to access them
%in%
If you want to know whether or not an element in one vector is ALSO in another vector
x <- c("a", "b", "c", "d", "e")
y <- c("a", "d", "f")
y %in% x
#> TRUE, TRUE, FALSE
Basic Data Wrangling
Note: dplyr package needed
MUTATE FUNCTION
To change the data table by adding a new column or changing an existing one
FILTER FUNCTION
Filter data by subsetting rows
SELECT FUNCTION
Filter data by subsetting columns
PIPE OPERATOR (%>%)
Allows results from one function to automatically be 'piped' into another
murders %>% select(state, region, rate) %>% filter(rate <= 0.71)
Plots
BASIC PLOTS
plot(x, y)
HISTOGRAMS
Graphical summaries of a list of numbers
hist(example_data)
BOXPLOT
Morse terse summary than histogram, but easier to stack against one another
Particularly useful when you want to summarize several variables or strata of a variable
Strata are separated with "~"
boxplot(population~region, data = murders)
Programming in R
BASIC CONDITIONALS
if-else is the most common
if(a != 0){
print(1/a)
} else{
print("No reciprocal for 0.")
}
if-else functions take three arguments: a logical and two possible answers
Any function: returns TRUE if any of the options are true
All function: returns TRUE if all are true
MAKING YOUR OWN FUNCTIONS
example_function <- function(x){
operations that operate on x, which is defined by user of function value final line is returned
}
Functions can have multiple variables, not just one. This is indicated by multiple arguments (e.g. "function(x, y, z)
FOR LOOPS
Allows a task to be performed over and over again, but with changing variables/values
for(i in range of values){
operations that use i, which is changing across the range of values
}
OTHER FUNCTIONS
Apply family: apply, sapply, tapply, mapply
split
cut
quantile
reduce
identical
unique