Topic 11 Midterm Review

Template Rmd

Access a template .Rmd for this midterm review here.

Data: Kiva

Kiva is a non-profit that allows people from around the world to lend small amounts to others to start or grow a business, go to school, access clean energy, etc. Since its founding in 2005, more than $1.2 billion in loans to over 3 million borrowers have been funded. In this activity, we’ll examine some lending data from 2005-2012.

Investigate the partners Data Table

Kiva has field partners who generally act as intermediaries between borrowers and Kiva (lenders). They evaluate borrower risk, post loan requests on Kiva, and process payments. The following command loads a table with data on Kiva’s field partners.

partners <- read_csv("https://jamesnormington.github.io/112_spring_2023/data/kiva_partners2.csv")

Exercise 11.1 (Regional distribution) Make a summary table with only five columns: region (countries.region), total number of partners, total number of loans posted, total amount raised, and average loan size per loan posted.

The four columns after region should all be on a per region basis; for example, the row for Central America should include average loan size per loan posted in Central America. Sort your table by total amount raised.

region_by_amt = partners %>%
  group_by(countries.region) %>%
  summarize(total_partners = n(),
            total_loans = sum(loans_posted),
            total_amt_raised = sum(total_amount_raised),
            avg_loan_size = total_amt_raised / total_loans) %>%
  arrange(desc(total_amt_raised))
## Error in group_by(., countries.region): object 'partners' not found

Exercise 11.2 (Total amount by region) Create a bar plot of total amount raised by region.

region_by_amt %>%
  ggplot(aes(x = countries.region, y = total_amt_raised)) +
  geom_col()
## Error in ggplot(., aes(x = countries.region, y = total_amt_raised)): object 'region_by_amt' not found

Investigate the loans Data Table

The loans table below contains information about individual loans to borrowers.

# a random sample of 10,000
loans <- read_csv("https://jamesnormington.github.io/112_spring_2023/data/kiva_loans_small.csv")

Exercise 11.3 (Preprocessing) Consider the loans data and perform the following preprocessing steps and save the new data as loans2.

  1. Remove all rows where the funded amount is 0.
  2. Note that the date information about when a loan request was posted is separated out into different fields for year, month, day, hour, etc. It would be more helpful to have a single field for the exact time at which each loan was posted. We’ll do this in three steps. First, create (mutate) a new column by pasting together the year, month, and date, separated by hyphens: post_dt=paste(posted_yr, posted_mo, posted_day, sep='-'). Second, create a new column with the time: post_time=paste(posted_hr,posted_min,posted_sec, sep=':'). Third, using the ymd_hms command from lubridate, add a new column with the full date (including time of day):
    post_date=ymd_hms(paste(post_dt,post_time,sep=' ')).
  3. Repeat the previous part to add a column funded_date with the exact time at which each loan was funded.
  4. Create a new table called loans2 that only contains the following columns from loans: loan_id,status,funded_amount,paid_amount,sector,location.country,lat,lon,partner_id,post_date,funded_date
loans2 = loans %>%
  filter(funded_amount != 0) %>%
  mutate(post_dt=paste(posted_yr, posted_mo, posted_day, sep='-'),
         post_time=paste(posted_hr,posted_min,posted_sec, sep=':'),
         post_date=ymd_hms(paste(post_dt,post_time,sep=' ')),
         funded_dt=paste(funded_yr, funded_mo, funded_day, sep='-'),
         funded_time=paste(funded_hr,funded_min,funded_sec, sep=':'),
         funded_date=ymd_hms(paste(funded_dt,funded_time,sep=' '))) %>%
  select(loan_id,status,funded_amount,paid_amount,sector,location.country,lat,lon,partner_id,post_date,funded_date )
## Error in filter(., funded_amount != 0): object 'loans' not found

Exercise 11.4 (Loans by country) Consider the loans2 table.

  1. Find the top 5 countries by number of loans.
  2. Find the top 5 countries by total funded loan amount.
loans2 %>%
  group_by(location.country) %>%
  summarize(n = n()) %>%
  arrange(desc(n)) %>%
  head(5)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'head': object 'loans2' not found
loans2 %>%
  group_by(location.country) %>%
  summarize(total_amt = sum(funded_amount)) %>%
  arrange(desc(total_amt)) %>%
  head(5)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'head': object 'loans2' not found

Exercise 11.5 (Sector analysis) Make a scatterplot with the number of loans in each sector on the x-axis and the average loan size in each sector on the y-axis. Instead of using points, use text with each sector’s name as the glyph.

Hint: start by wrangling the data into glyph-ready form, with one row corresponding to one glyph.

loans2 %>%
  group_by(sector) %>%
  summarize(n_loans = n(),
            avg_loan_size = mean(funded_amount)) %>%
  ggplot(aes(x = n_loans, y = avg_loan_size, label = sector)) +
  geom_text(angle = 45, size = 2.5)
## Error in group_by(., sector): object 'loans2' not found

Putting them together

Exercise 11.6 (Join practice) Join the countries.region variable from the partners table onto the loans2 table, in order to have a region associated with each loan. Save the output table as loans3.

loans3 = loans2 %>%
  left_join(partners, by = 'partner_id')
## Error in left_join(., partners, by = "partner_id"): object 'loans2' not found

Exercise 11.7 (Defaults) Consider the following:

  1. Starting with loans3, create a new table DefaultData that only includes those loans whose status is either “defaulted” or “paid”.
DefaultData = loans3 %>%
  filter(status.x %in% c("defaulted", "paid"))
## Error in filter(., status.x %in% c("defaulted", "paid")): object 'loans3' not found
  1. Make a density plot of the funded amount, split by loan status (“defaulted” or “paid”). That is, your plot should have two separate density curves on top of each other. Interpret the plot.
DefaultData %>%
  ggplot(aes(x = funded_amount, fill = status.x)) +
  geom_density(alpha = 0.4)
## Error in ggplot(., aes(x = funded_amount, fill = status.x)): object 'DefaultData' not found
  1. Make a bar chart with 10 bars, such that each bar is a $1000 range of funding amount; that is, the first bar should be for $0-$1000, the second for $1001-$2000, and so forth. The bars should all have height equal to 1, but be filled with two colors: one for the percentage of loans in that interval bin that defaulted and one for the percentage that were paid back in full. Interpret your graphic.
DefaultData %>%
  mutate(funded_amount_cat = cut(funded_amount, breaks = c(0, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000))) %>%
  ggplot(aes(x = funded_amount_cat, fill = status.x)) +
  geom_bar(position = "fill")
## Error in mutate(., funded_amount_cat = cut(funded_amount, breaks = c(0, : object 'DefaultData' not found
  1. Starting with the data in DefaultData from part (a), make a table with four columns: partner_id, number of defaulted loans through that partner, number of loans completely paid back through that partner, and percentage of loans defaulted (the second column divided by the sum of the second and third columns). Sort your table from highest default percentage to lowest, and print out only those with at least a 10% default percentage.
DefaultData %>%
  group_by(partner_id, status.x) %>%
  summarize(n = n()) %>%
  arrange(partner_id) %>%
  pivot_wider(names_from = "status.x",
              values_from = "n",
              values_fill = 0) %>%
  mutate(pct_defaulted = 100*defaulted / (defaulted + paid)) %>%
  arrange(desc(pct_defaulted)) %>%
  filter(pct_defaulted >= 10) %>%
  print()
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'print': object 'DefaultData' not found

Exercise 11.8 (Funding time) Consider the following:

  1. Use the command days_to_fund = difftime(funded_date, post_date, units="days") within mutate() to add a column to the loans3 data that has the number of days between the time a loan was posted and the time it was funded. Save your new table as loans4.
loans4 = loans3 %>%
  mutate(days_to_fund = difftime(funded_date, post_date, units="days"))
## Error in mutate(., days_to_fund = difftime(funded_date, post_date, units = "days")): object 'loans3' not found
  1. The days_to_fund should always be positive, but there are some negative values, most likely due to data entry errors. Filter these out of the data table loans4.
loans4 = loans4 %>%
  filter(days_to_fund >= 0)
## Error in filter(., days_to_fund >= 0): object 'loans4' not found
  1. Make a bar chart with average days to fund by region. Reorder the x-axis so that the regions are in order of ascending days to fund.
loans4 %>%
  group_by(countries.region) %>%
  summarize(days_to_fund = mean(days_to_fund)) %>%
  mutate(countries.region = factor(countries.region),
         countries.region = fct_reorder(countries.region, days_to_fund)) %>%
  ggplot(aes(x = countries.region, y = days_to_fund)) +
  geom_col()
## Error in group_by(., countries.region): object 'loans4' not found
  1. Make a bar chart with average days to fund by sector. Reorder the x-axis so that the sectors are in order of ascending days to fund.
loans4 %>%
  group_by(sector) %>%
  summarize(days_to_fund = mean(days_to_fund)) %>%
  mutate(sector = factor(sector),
         sector = fct_reorder(sector, days_to_fund)) %>%
  ggplot(aes(x = sector, y = days_to_fund)) +
  geom_col()
## Error in group_by(., sector): object 'loans4' not found
  1. Make a scatterplot with funded_amount on the x-axis and days_to_fund on the y-axis. Color the points by region and set the transparency to 0.2. Add linear trendlines for each region. What are the trends?
loans4 %>%
  ggplot(aes(x = funded_amount, y = days_to_fund, color = countries.region)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "lm", se = FALSE)
## Error in ggplot(., aes(x = funded_amount, y = days_to_fund, color = countries.region)): object 'loans4' not found