Exploratory Data Analysis (EDA), a name given to the process of
Another way to describe EDA:
Useful R functions:
str()
to learn about the numbers of variables and observations as well as the classes of variableshead()
to view the top of the data table (can specify the number of rows with n=
)tail()
to view the bottom of the data tablespc_tbl_ [52 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ state : chr [1:52] "United States" "Alabama" "Alaska" "Arizona" ...
$ murder : num [1:52] 5.6 8.2 4.8 7.5 6.7 6.9 3.7 2.9 4.4 35.4 ...
$ forcible_rape : num [1:52] 31.7 34.3 81.1 33.8 42.9 26 43.4 20 44.7 30.2 ...
$ robbery : num [1:52] 140.7 141.4 80.9 144.4 91.1 ...
$ aggravated_assault : num [1:52] 291 248 465 327 387 ...
$ burglary : num [1:52] 727 954 622 948 1085 ...
$ larceny_theft : num [1:52] 2286 2650 2599 2965 2711 ...
$ motor_vehicle_theft: num [1:52] 417 288 391 924 262 ...
$ population : num [1:52] 2.96e+08 4.55e+06 6.69e+05 5.97e+06 2.78e+06 ...
- attr(*, "spec")=
.. cols(
.. state = col_character(),
.. murder = col_double(),
.. forcible_rape = col_double(),
.. robbery = col_double(),
.. aggravated_assault = col_double(),
.. burglary = col_double(),
.. larceny_theft = col_double(),
.. motor_vehicle_theft = col_double(),
.. population = col_double()
.. )
- attr(*, "problems")=<externalptr>
# A tibble: 6 × 9
state murder forcibl…¹ robbery aggra…² burgl…³ larce…⁴ motor…⁵ popul…⁶
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 United States 5.6 31.7 141. 291. 727. 2286. 417. 2.96e8
2 Alabama 8.2 34.3 141. 248. 954. 2650 288. 4.55e6
3 Alaska 4.8 81.1 80.9 465. 622. 2599. 391 6.69e5
4 Arizona 7.5 33.8 144. 327. 948. 2965. 924. 5.97e6
5 Arkansas 6.7 42.9 91.1 387. 1085. 2711. 262. 2.78e6
6 California 6.9 26 176. 317. 693. 1916. 713. 3.58e7
# … with abbreviated variable names ¹forcible_rape, ²aggravated_assault,
# ³burglary, ⁴larceny_theft, ⁵motor_vehicle_theft, ⁶population
# A tibble: 6 × 9
state murder forcibl…¹ robbery aggra…² burgl…³ larce…⁴ motor…⁵ popul…⁶
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Vermont 1.3 23.3 11.7 83.5 492. 1686. 103. 618814
2 Virginia 6.1 22.7 99.2 155. 392. 2035 211. 7563887
3 Washington 3.3 44.7 92.1 206. 960. 3150. 784. 6261282
4 West Virginia 4.4 17.7 44.6 206. 621. 1794 210 1803920
5 Wisconsin 3.5 20.6 82.2 135. 441. 1993. 227. 5541443
6 Wyoming 2.7 24 15.3 188. 476. 2534. 145. 506242
# … with abbreviated variable names ¹forcible_rape, ²aggravated_assault,
# ³burglary, ⁴larceny_theft, ⁵motor_vehicle_theft, ⁶population
join
commands)?One convenient way to do this is with a pairs
plot.
The main point of such plots is not necessarily to draw any conclusions, but help generate more specific research questions and hypotheses.
You will often end up with a lot of data, and it can be easy to be overwhelmed.
How should you get started?
To do so, you can again revisit questions like “What patterns do you see?” or “Why might they be occurring?”
Let’s practice these steps using data about flight delays from Kaggle. Download template Rmd file from course website.
Midterm Revisions Part 1 due Thursday 3/23 @ 8am
Midterm Revisions Part 2 due Friday 3/24 @ 11:59pm
IV1 due Friday 3/24 @ 11:59pm
Assignment 8 (Parts 1 and 2) due Wed., 3/29 @ 11:59pm
Start thinking about your Final Project!