Topic 1 Intro to R, RStudio, and R Markdown

Learning Goals

Download and install the necessary tools (R, RStudio)
Develop comfort in navigating the tools in RStudio
Develop comfort in writing and knitting a R Markdown file
Identify the characteristics of tidy data
Use R code: as a calculator and to explore tidy data

Getting Started in RStudio

As you might guess from the name, “Data Science” requires data. Working with modern (large, messy) data sets requires statistical software. We’ll exclusively use RStudio. Why?

it’s free
it’s open source (the code is free & anybody can contribute to it)
it has a huge online community (which is helpful for when you get stuck)
it’s one of the industry standards
it can be used to create reproducible and lovely documents (In fact, the course materials that you’re currently reading were constructed entirely within RStudio! [thank you Prof. Brianna Heggeseth!])

Download R & RStudio

To get started, take the following two steps in the given order. Even if you already have R/RStudio, make sure to update to the most recent versions.

Download and install the R statistical software at https://mirror.las.iastate.edu/CRAN/
Download and install the FREE version of RStudio at https://www.rstudio.com/products/rstudio/download/#download

If you are having issues with downloading, log on to https://rstudio.macalester.edu/ (use Mac credentials) to use the RStudio server.

What’s the difference between R and RStudio? RStudio requires R to run, but has more options and is easier to use than R. I like to think of RStudio as the car, and R as the engine.

A quick tour of RStudio

Open RStudio! You should see four panes, each serving a different purpose:

Figure 1.1: RStudio Interface

This short video tour of RStudio summarizes some basic features of the console.

Exercise 1.1 (Warm Up) Use RStudio as a simple calculator to do the following:

Perform a simple calculation: calculate 90/3.
RStudio has built-in functions to which we supply the necessary arguments: function(arguments). Use the built-in function sqrt to calculate the square root of 25.
Use the built-in function rep to repeat the number “5” eight times.
Use the seq function to create the vector (0, 3, 6, 9, 12). Hint: seq(0, 15, by = 5) would create the vector (0, 5, 10, 15).
Create a new vector by concatenating three repetitions of the vector from the previous part.

Solution

90/3 
## [1] 30

sqrt(25)
## [1] 5

rep(5, times = 8)
## [1] 5 5 5 5 5 5 5 5

seq(0, 12, by = 3)
## [1]  0  3  6  9 12

rep(seq(0, 12, by = 3), times =  3)
##  [1]  0  3  6  9 12  0  3  6  9 12  0  3  6  9 12

rep(seq(0, 12, by = 3), each = 3) #notice the difference between times and each
##  [1]  0  0  0  3  3  3  6  6  6  9  9  9 12 12 12

Exercise 1.2 (Assignment) We often want to store our output for later use (why?). The basic idea in RStudio:

`name <- output`

Copy and paste the following code into the console, line by line. NOTE: RStudio ignores any content after the #. Thus we use this to ‘comment’ and organize our code.

#type square_3
square_3
    
#calculate 3 squared
3^2    
    
#store this as "square_3"
square_3 <- 3^2    
    
#type square_3 again!
square_3
    
#do some math with square_3
square_3 + 2

Data

Not only does “Data Science” require statistical software, it requires DATA! Consider the Google definition:

Figure 1.2: A datum.

With this definition in mind, which of the following are examples of data?

tables

##   family father mother sex height nkids
## 1      1   78.5   67.0   M   73.2     4
## 2      1   78.5   67.0   F   69.2     4
## 3      1   78.5   67.0   F   69.0     4
## 4      1   78.5   67.0   F   69.0     4
## 5      2   75.5   66.5   M   73.5     4
## 6      2   75.5   66.5   M   72.5     4

We’ll mostly work with data that look like this:

##   family father mother sex height nkids
## 1      1   78.5   67.0   M   73.2     4
## 2      1   78.5   67.0   F   69.2     4
## 3      1   78.5   67.0   F   69.0     4
## 4      1   78.5   67.0   F   69.0     4
## 5      2   75.5   66.5   M   73.5     4
## 6      2   75.5   66.5   M   72.5     4

This isn’t as restrictive as it seems. We can convert the above signals: photos, videos, and text to a data table format!

Tidy Data

Example: After a scandal among FIFA officials, fivethirtyeight.com posted an analysis of FIFA viewership, “How to Break FIFA”. Here’s a snapshot of the data used in this article:

country	confederation	population_share	tv_audience_share	gdp_weighted_share
United States	CONCACAF	4.5	4.3	11.3
Japan	AFC	1.9	4.9	9.1
China	AFC	19.5	14.8	7.3
Germany	UEFA	1.2	2.9	6.3
Brazil	CONMEBOL	2.8	7.1	5.4
United Kingdom	UEFA	0.9	2.1	4.2
Italy	UEFA	0.9	2.1	4.0
France	UEFA	0.9	2.0	4.0
Russia	UEFA	2.1	3.1	3.5
Spain	UEFA	0.7	1.8	3.1

The data table above is in tidy format. Tidy data tables have three key features:

Each row represents a unit of observation (also referred to as a case).
Each column represents a variable (ie. an attribute of the cases that can vary from case to case). Each variable is one of two types:

quantitative = numerical
categorical = discrete possibilities/categories

Each entry contains a single data value; no analysis, summaries, footnotes, comments, etc., and only one value per cell

Tidy Data: Art by Allison Horst

Exercise 1.3 (Units of Observation and Variables) Consider the following in a group:

What are the units of observation in the FIFA data?
What are the variables? Which are quantitative? Which are categorical?
Are these tidy data?

Solution

A FIFA member country
country name, soccer or football confederation, country’s share of global population (percentage), country’s share of global world cup TV Audience (percentage), country’s GDP-weighted audience share (percentage)
Yes

Exercise 1.4 (Tidy vs. Untidy) Check out the following data. Explain to each other why they are untidy and how we can tidy them.

Data 1: FIFA

country confederation population share tv_share

United States CONCACAF i don’t know* 4.3% *look up later

Japan AFC 1.9 4.9%

China AFC 19.5 14.8%

total=24%
Data 2: Gapminder life expectancies by country

country 1952 1957 1962

Asia Afghanistan 28.8 30.3 32.0

Bahrain 50.9 53.8 56.9

Africa Algeria 43.0 45.7 48.3

country	confederation	population share	tv_share
United States	CONCACAF	i don’t know*	4.3%	*look up later
Japan	AFC	1.9	4.9%
China	AFC	19.5	14.8%
			total=24%

	country	1952	1957	1962
Asia	Afghanistan	28.8	30.3	32.0
	Bahrain	50.9	53.8	56.9
Africa	Algeria	43.0	45.7	48.3

Solution

There are notes such as “I don’t know” and “look up later” in columns with numeric values; the last row with the total is a summary. We could remove the text notes, replace it with the value if known, and remove the last row with the total summary.
The first column does not have a row name. It should be continent. Additionally, Bahrain needs a value for the continent.

Data Basics in RStudio

For now, we’ll focus on tidy data. In a couple of weeks, you’ll learn how to turn untidy data into tidy data.

Exercise 1.5 (Importing Package Data) The first step to working with data in RStudio is getting it in there! How we do this depends on its format (eg: Excel spreadsheet, csv file, txt file) and storage locations (eg: online, within Wiki, desktop). Luckily for us, the fifa_audience data are stored in the fivethirtyeight RStudio package. Copy and paste the following code into the Console and press Enter.

#download the data and information in the fivethirtyeight package (we only need to do this once)
install.packages('fivethirtyeight')

#load the fivethirtyeight package (we must do this every session of RStudio in which we want to use this package)
library(fivethirtyeight)
    
#load the fifa data
data("fifa_audience")
    
#store this under a shorter, easier name
fifa <- fifa_audience

Exercise 1.6 (Examining Data Structures) Before we can analyze our data, we must understand its structure. Try out the following functions (copy and paste into the Console). For each, make a note that describes its action.

#(what does View do?)
View(fifa)  

#(what does head do?)
head(fifa)  

#(what does dim do?)
dim(fifa)           

#(what does names do?)
names(fifa)

Solution

#View() opens up a new tab with a spreadsheet preview of the data to visually explore the data. It is commented out in the Rmarkdown file because this is an interactive feature
#View(fifa)  

#head() gives the first 6 (default number) rows of a data set
head(fifa)  
## # A tibble: 6 × 5
##   country        confederation population_share tv_audience_share gdp_weighted…¹
##   <chr>          <chr>                    <dbl>             <dbl>          <dbl>
## 1 United States  CONCACAF                   4.5               4.3           11.3
## 2 Japan          AFC                        1.9               4.9            9.1
## 3 China          AFC                       19.5              14.8            7.3
## 4 Germany        UEFA                       1.2               2.9            6.3
## 5 Brazil         CONMEBOL                   2.8               7.1            5.4
## 6 United Kingdom UEFA                       0.9               2.1            4.2
## # … with abbreviated variable name ¹gdp_weighted_share

#dim() gives the number of rows and number of columns
dim(fifa)           
## [1] 191   5

#names() gives the names of the columns/variables
names(fifa)   
## [1] "country"            "confederation"      "population_share"  
## [4] "tv_audience_share"  "gdp_weighted_share"

Exercise 1.7 (Codebooks) Data are also only useful if we know what they measure! The fifa data table is tidy; it doesn’t have any helpful notes in the data itself.

Rather, information about the data is stored in a separate codebook. Codebooks can be stored in many ways (eg: Google docs, word docs, etc). Here the authors have made their codebook available in RStudio (under the original fifa_audience name). Check it out (run the following code in the console):

?fifa_audience

What does population_share measure?
What are the units of population_share?

Solution

Country’s share of global population
Percentage between 0 and 100

Exercise 1.8 (Examining a Single Variable) Consider the following:

We might want to access and focus on a single variable. To this end, we can use the $ notation (see below). What are the values of tv_audience_share? Of confederation? Is it easy to figure out?

fifa$tv_audience_share
fifa$confederation

It’s important to understand the format/class of each variable (quantitative, categorical, date, etc) in both its meaning and its structure within RStudio:

class(fifa$tv_audience_share)
class(fifa$confederation)

If a variable is categorical (in factor format), we can determine its levels / category labels. What are the value of confederation?

levels(fifa$confederation) #it is in character format
levels(factor(fifa$confederation)) #we can convert to factor format

R Markdown and Reproducible Research

Reproducible research is the idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them. - Reproducible Research, Coursera

Useful Resources:

Research often makes claims that are difficult to verify. A recent study of published psychology articles found that less than half of published claims could be reproduced. One of the most common reasons claims cannot be reproduced is confusion about data analysis. It may be unclear exactly how data was prepared and analyzed, or there may be a mistake in the analysis.

In this course we will use an innovative format called R Markdown that dramatically increases the transparency of data analysis. R Markdown interleaves data, R code, graphs, tables, and text, packaging them into an easily publishable format.

To use R Markdown, you will write an R Markdown formatted file in RStudio and then ask RStudio to knit it into an HTML document (or occasionally a PDF or MS Word document).

Exercise 1.9 (Deduce the R Markdown Format) Look at this Sample RMarkdown and the HTML webpage it creates. Consider the following and discuss:

How are bullets, italics, and section headers represented in the R Markdown file?
How does R code appear in the R Markdown file?
In the HTML webpage, do you see the R code, the output of the R code, or both?

Solution

Bullets are represented with * and +
Italics are represented with * before and after a word or phrase
Section headers are represented with #

R code chunks are between 3 tick marks at the beginning and end; it is R code if there is an r in curly braces
  
If echo=FALSE in curly braces, the code is not shown. Otherwise, both code and output are shown by default.

Now take a look at the R Markdown cheatsheet. Look up the R Markdown features from the previous question on the cheatsheet. There’s a great deal more information there.

Assignment 1 (due Wednesday, January 25 @ 11:59pm)

Complete the following. If you get stuck along the way, refer to the R Markdown cheatsheet linked above, search the web for answers, and/or ask for help!

Exercise 1.10 (Your First R Markdown File) Create a new R Markdown about your favorite food.

Create a new folder on your Desktop called COMP_STAT_112.
Within that folder, create a subfolder called Day_01.
Create a new file in RStudio (File -> New File -> R Markdown) with a Title of First_Markdown. Save it in the Day 01 subfolder.
Make sure you can compile/render (Knit) the Markdown into a webpage (html file).
Add a new line between title and output that reads: author: Your Name.
Create a very brief essay about your favorite food. Make sure to include:

A picture from the web: (1) Download an image and save it in the same folder as your Markdown (.Rmd) file. Then, add the following code to your Markdown file: ![This is the caption](file_name.png). For example, to add a photo of an elephant saved as elephant.png, you would put: ![This is an elephant](elephant.png)
A bullet list
A numbered list

Compile (Knit) the document into an html file.

Exercise 1.11 (New Data!) There’s a data set named comic_characters in the fivethirtyeightdata package.

Install the package by running the following in the Console:

install.packages('fivethirtyeightdata', repos = 'https://fivethirtyeightdata.github.io/drat/', type = 'source')

Check out the codebook (hint: use ?) to understand what these data measure. Then add a second section to your R Markdown file, and then use code chunks and R commands to perform/answer the following tasks/questions:

Load the data.
What are the units of observation? How many observations are there?
In a new code chunk, print out the first 12 rows of the data set.
Get a list of all variable names.
What’s the class of the date variable?
List all of the unique entries in the gsm variable (no need to include NA).
Compile the document into an html file.

Appendix: R Functions

R as a calculator

Function/Operator	Action	Example
`/`	Division	`90/30`
`*`	Multiplication	`2*5`
`+`	Addition	`1+1`
`-`	Subtraction	`1-1`
`^`	Exponent/Power to	`3^2`
`sqrt(x)`	Square root	`sqrt(25)`

R Basics

Function/Operator	Action	Example
`install.packages('packagename')`	Download a R package (function, data, etc.) from repository	`install.packages('fivethirtyeight')`
`library(packagename)`	Access a downloaded R package	`library(fivethirtyeight)`
`?function_object_name`	Opens the help/documentation for the function or object	`?seq`
`rep(x, times, each)`	Repeat x a # times	`rep(5,8)`
`seq(from,to,by)`	Sequence generation	`2*5`
`name <- value_output`	Assign value or output to a name	`squared_3 <- 3^2`
`View(x)`	Open spreadsheet viewer of dataset	`View(fifa_audience)`
`head(x)`	Print the first 6 rows of a dataset	`head(fifa_audience)`
`dim(x)`	Print the dimensions (number of rows and columns) of a dataset	`dim(fifa_audience)`
`names(x)`	Print the names of the variables in a dataset	`names(fifa_audience)`
`$`	Used to access one variable in a data set based on its name	`fifa_audience$confederation`
`class(x)`	Print the class types argument or input	`class(fifa_audience$confederation)`
`factor(x)`	Converts the argument or input to a factor class type (categorical variable)	`factor(fifa_audience$confederation)`
`levels(x)`	Prints the unique categories of a factor	`levels(factor(fifa_audience$confederation))`