Topic 11 Midterm Review
Template Rmd
Access a template .Rmd for this midterm review here.
Data: Kiva
Kiva is a non-profit that allows people from around the world to lend small amounts to others to start or grow a business, go to school, access clean energy, etc. Since its founding in 2005, more than $1.2 billion in loans to over 3 million borrowers have been funded. In this activity, we’ll examine some lending data from 2005-2012.
Investigate the partners
Data Table
Kiva has field partners who generally act as intermediaries between borrowers and Kiva (lenders). They evaluate borrower risk, post loan requests on Kiva, and process payments. The following command loads a table with data on Kiva’s field partners.
<- read_csv("https://jamesnormington.github.io/112_spring_2023/data/kiva_partners2.csv") partners
Exercise 11.1 (Regional distribution) Make a summary table with only five columns: region (countries.region
), total number of partners, total number of loans posted, total amount raised, and average loan size per loan posted.
The four columns after region should all be on a per region basis; for example, the row for Central America should include average loan size per loan posted in Central America. Sort your table by total amount raised.
= partners %>%
region_by_amt group_by(countries.region) %>%
summarize(total_partners = n(),
total_loans = sum(loans_posted),
total_amt_raised = sum(total_amount_raised),
avg_loan_size = total_amt_raised / total_loans) %>%
arrange(desc(total_amt_raised))
## Error in group_by(., countries.region): object 'partners' not found
Exercise 11.2 (Total amount by region) Create a bar plot of total amount raised by region.
%>%
region_by_amt ggplot(aes(x = countries.region, y = total_amt_raised)) +
geom_col()
## Error in ggplot(., aes(x = countries.region, y = total_amt_raised)): object 'region_by_amt' not found
Investigate the loans
Data Table
The loans
table below contains information about individual loans to borrowers.
# a random sample of 10,000
<- read_csv("https://jamesnormington.github.io/112_spring_2023/data/kiva_loans_small.csv") loans
Exercise 11.3 (Preprocessing) Consider the loans
data and perform the following preprocessing steps and save the new data as loans2
.
- Remove all rows where the funded amount is 0.
- Note that the date information about when a loan request was posted is separated out into different fields for year, month, day, hour, etc. It would be more helpful to have a single field for the exact time at which each loan was posted. We’ll do this in three steps. First, create (mutate) a new column by pasting together the year, month, and date, separated by hyphens:
post_dt=paste(posted_yr, posted_mo, posted_day, sep='-')
. Second, create a new column with the time:post_time=paste(posted_hr,posted_min,posted_sec, sep=':')
. Third, using theymd_hms
command fromlubridate
, add a new column with the full date (including time of day):
post_date=ymd_hms(paste(post_dt,post_time,sep=' '))
.
- Repeat the previous part to add a column
funded_date
with the exact time at which each loan was funded.
- Create a new table called
loans2
that only contains the following columns fromloans
: loan_id,status,funded_amount,paid_amount,sector,location.country,lat,lon,partner_id,post_date,funded_date
= loans %>%
loans2 filter(funded_amount != 0) %>%
mutate(post_dt=paste(posted_yr, posted_mo, posted_day, sep='-'),
post_time=paste(posted_hr,posted_min,posted_sec, sep=':'),
post_date=ymd_hms(paste(post_dt,post_time,sep=' ')),
funded_dt=paste(funded_yr, funded_mo, funded_day, sep='-'),
funded_time=paste(funded_hr,funded_min,funded_sec, sep=':'),
funded_date=ymd_hms(paste(funded_dt,funded_time,sep=' '))) %>%
select(loan_id,status,funded_amount,paid_amount,sector,location.country,lat,lon,partner_id,post_date,funded_date )
## Error in filter(., funded_amount != 0): object 'loans' not found
Exercise 11.4 (Loans by country) Consider the loans2
table.
- Find the top 5 countries by number of loans.
- Find the top 5 countries by total funded loan amount.
%>%
loans2 group_by(location.country) %>%
summarize(n = n()) %>%
arrange(desc(n)) %>%
head(5)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'head': object 'loans2' not found
%>%
loans2 group_by(location.country) %>%
summarize(total_amt = sum(funded_amount)) %>%
arrange(desc(total_amt)) %>%
head(5)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'head': object 'loans2' not found
Exercise 11.5 (Sector analysis) Make a scatterplot with the number of loans in each sector on the x-axis and the average loan size in each sector on the y-axis. Instead of using points, use text with each sector’s name as the glyph.
Hint: start by wrangling the data into glyph-ready form, with one row corresponding to one glyph.
%>%
loans2 group_by(sector) %>%
summarize(n_loans = n(),
avg_loan_size = mean(funded_amount)) %>%
ggplot(aes(x = n_loans, y = avg_loan_size, label = sector)) +
geom_text(angle = 45, size = 2.5)
## Error in group_by(., sector): object 'loans2' not found
Putting them together
Exercise 11.6 (Join practice) Join the countries.region
variable from the partners
table onto the loans2
table, in order to have a region associated with each loan. Save the output table as loans3
.
= loans2 %>%
loans3 left_join(partners, by = 'partner_id')
## Error in left_join(., partners, by = "partner_id"): object 'loans2' not found
Exercise 11.7 (Defaults) Consider the following:
- Starting with
loans3
, create a new tableDefaultData
that only includes those loans whose status is either “defaulted” or “paid”.
= loans3 %>%
DefaultData filter(status.x %in% c("defaulted", "paid"))
## Error in filter(., status.x %in% c("defaulted", "paid")): object 'loans3' not found
- Make a density plot of the funded amount, split by loan status (“defaulted” or “paid”). That is, your plot should have two separate density curves on top of each other. Interpret the plot.
%>%
DefaultData ggplot(aes(x = funded_amount, fill = status.x)) +
geom_density(alpha = 0.4)
## Error in ggplot(., aes(x = funded_amount, fill = status.x)): object 'DefaultData' not found
- Make a bar chart with 10 bars, such that each bar is a $1000 range of funding amount; that is, the first bar should be for $0-$1000, the second for $1001-$2000, and so forth. The bars should all have height equal to 1, but be filled with two colors: one for the percentage of loans in that interval bin that defaulted and one for the percentage that were paid back in full. Interpret your graphic.
%>%
DefaultData mutate(funded_amount_cat = cut(funded_amount, breaks = c(0, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000))) %>%
ggplot(aes(x = funded_amount_cat, fill = status.x)) +
geom_bar(position = "fill")
## Error in mutate(., funded_amount_cat = cut(funded_amount, breaks = c(0, : object 'DefaultData' not found
- Starting with the data in
DefaultData
from part (a), make a table with four columns: partner_id, number of defaulted loans through that partner, number of loans completely paid back through that partner, and percentage of loans defaulted (the second column divided by the sum of the second and third columns). Sort your table from highest default percentage to lowest, and print out only those with at least a 10% default percentage.
%>%
DefaultData group_by(partner_id, status.x) %>%
summarize(n = n()) %>%
arrange(partner_id) %>%
pivot_wider(names_from = "status.x",
values_from = "n",
values_fill = 0) %>%
mutate(pct_defaulted = 100*defaulted / (defaulted + paid)) %>%
arrange(desc(pct_defaulted)) %>%
filter(pct_defaulted >= 10) %>%
print()
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'print': object 'DefaultData' not found
Exercise 11.8 (Funding time) Consider the following:
- Use the command
days_to_fund = difftime(funded_date, post_date, units="days")
withinmutate()
to add a column to theloans3
data that has the number of days between the time a loan was posted and the time it was funded. Save your new table asloans4
.
= loans3 %>%
loans4 mutate(days_to_fund = difftime(funded_date, post_date, units="days"))
## Error in mutate(., days_to_fund = difftime(funded_date, post_date, units = "days")): object 'loans3' not found
- The
days_to_fund
should always be positive, but there are some negative values, most likely due to data entry errors. Filter these out of the data tableloans4
.
= loans4 %>%
loans4 filter(days_to_fund >= 0)
## Error in filter(., days_to_fund >= 0): object 'loans4' not found
- Make a bar chart with average days to fund by region. Reorder the x-axis so that the regions are in order of ascending days to fund.
%>%
loans4 group_by(countries.region) %>%
summarize(days_to_fund = mean(days_to_fund)) %>%
mutate(countries.region = factor(countries.region),
countries.region = fct_reorder(countries.region, days_to_fund)) %>%
ggplot(aes(x = countries.region, y = days_to_fund)) +
geom_col()
## Error in group_by(., countries.region): object 'loans4' not found
- Make a bar chart with average days to fund by sector. Reorder the x-axis so that the sectors are in order of ascending days to fund.
%>%
loans4 group_by(sector) %>%
summarize(days_to_fund = mean(days_to_fund)) %>%
mutate(sector = factor(sector),
sector = fct_reorder(sector, days_to_fund)) %>%
ggplot(aes(x = sector, y = days_to_fund)) +
geom_col()
## Error in group_by(., sector): object 'loans4' not found
- Make a scatterplot with
funded_amount
on the x-axis anddays_to_fund
on the y-axis. Color the points by region and set the transparency to 0.2. Add linear trendlines for each region. What are the trends?
%>%
loans4 ggplot(aes(x = funded_amount, y = days_to_fund, color = countries.region)) +
geom_point(alpha = 0.2) +
geom_smooth(method = "lm", se = FALSE)
## Error in ggplot(., aes(x = funded_amount, y = days_to_fund, color = countries.region)): object 'loans4' not found