# BIOS 611

BIOS 611: Assignment 5
General Instructions
For this assignment, create an R Markdown document and use R for the following exercises.
For all exercises: – Whether making a plot, a table, or otherwise, the goal should be to produce results which look presentable (e.g., pretend it will be published in a magazine with your name attached). – use section headers (with pount sign) to separate each exercise. Do not just use the numbered list, those are a bit small in the context of the full document, so it’s harder for me to detect while scrolling through.
When you are finished, knit to HTML and submit via SOLE. Not to TurnItIn, but to the “Assignment” area
Exercises
Exercise 1
Using the California wildfire data, create a table that shows, for each decade since 1950, the following information:
Number of fires
Total area burnt
Average area burnt
Standard deviation of area burnt
If there is/are other numeric measures you think might be of interest.
So your table should look something like the following (these are made-up numbers):
Decade N Fires Total Area Average Area SD Area
1950 3 30 10 3
1960 6 36 6 6
You should load both the first and the second pages from the California Wildfires Excel sheet.
You may get some errors with dates from the second sheet. This is to be expected, if you look at, say, the alarm date variable, you will see that some dates are the MM/DD/YYYY format and others are the (much more logical) YYYY/MM/DD format. I think that the MM/DD/YYYY format are all older, so since I’m asking about 1950 and later, these should get filtered out anyway, so we don’t need to worry about them.
Exercise 2
Again using the California wildfire data (importing both the first two pages), create a figure that will show whether forest fires have become more severe over time. The figure should illustrate the severity both between years and within years. That is: Suppose in the mid 1990’s, severity tended to peak in July/August, while around 2010 severity tended to peak in May/June. The plot should be designed to show this at a glance. Like in the previous exercise, we can limit out analysis to 1950 and later. Or if that plot isn’t interesting, we can cut off the date to a later time if that leads to a better presentation of the data.
‘Severity’ can be defined in several ways, I don’t want to limit you into a single definition. In fact you should consider several ways to define ‘severity’ and make several figures (they could be the same basic ‘type’ of figure with one changed variable).
Include a short description / justification for why you created your plot in the way that you did. How did you measure ‘severity’ and how is/are your plot(s) addressing what I described in the first paragraph?
Exercise 3
This exercise might take a bit more time and work. If needed, we can extend the due date another week. The goal of this exercise is to be able to calculate accessibility to healthcare based on a 2-step floating catchment area (e.g. https://en.wikipedia.org/wiki/Two-step_floating_catchment_area_method (https://en.wikipedia.org/wiki/Twostep_floating_catchment_area_method)).
You will create several functions to accomplish this, some of which call others. I will try to ‘scaffold’ this into several steps. You should get the function in each step working (verify against the test sets I’ve provided, or more if you want to validate further) before moving to the next step. For example, if your distance function in Part (a) doesn’t work, then you will not be able to attempt Part (b).
There are some functions either in Base R or in some packages built for spatial data which can do this, and probably do it better and faster. You cannot use these in your function. The point here is to make your own function, doing the computations yourself, so that you can do so when you encounter a situation where the premade function doesn’t exist yet.
Context
I have created some simulated data for a country called Squaretopia. The data are comprised of two datasets:
houses.csv : This dataset has three columns: xdim , ydim , and county . The first two variables are the coordinates of (fictitional) houses in Squaretopia, while the county  denotes some subdivisions within Squaretopia (like states, provinces, or counties).
hospitals.csv : This dataset has three colums: xdim , ydim , and beds . The first two variables are again  the coordinates, while beds  is the number of beds in each hospital, which is indicitive of how much service that hospital is able to provide.
You can ‘see’ Squaretopia with the following commands:
library(“tidyverse”)
## Parsed with column specification:
## cols(
##   xdim = col_double(),
##   ydim = col_double(),
##   county = col_character()
## )
## Parsed with column specification:
## cols(
##   xdim = col_integer(),
##   ydim = col_integer(),
##   beds = col_integer()
## )
ggplot( houses , aes(x=xdim, y=ydim) ) +
geom_point( aes(color=county) ) +
geom_point( data=hospitals , color=”indianred3″, aes(size=beds) ) +
theme_bw()
Side note: We are able to plot data from two different datasets!
Part (a)
Create a function get_dist() . This function should take two inputs:
dat01 : Expected to be a data.frame  (or something similar such as tibble ) which has at least columns named xdim  and ydim .
dat02 : Expected to be a data.frame  (or something similar such as tibble ) which has columns named xdim  and ydim .
You can have this function either assume that dat02  contains a single set of coordinates (e.g., the location of a single hospital, which would be a data.frame  with 1 row) or you can extend it to accept a set of n2  coordinates.
Either way, your function should allow dat01  to be multiple ( n1 ) coordinates.
If you require n2  to be 1, then the function should return a vector of length n1  (the distance from each point in dat01  to the point in dat02 ).
If you allow n2  to be greater than 1, then your function should return a data.frame  or matrix  (etc.) with n1  rows and n2  columns.
I’m not sure if there is a way to vectorize the operations, I suspect that a loop would be necessary.
So the following output should be expected:
Output when dat02  is a single set of coordinates
Houses at (10,0), (20,0), (10,10), and (20,10)
Hospitals at (0,0) and (for the test_coords03 ) (10,10)
With these smaller examples, we can double-check these by hand to make sure they’re right.
test_coords01 <- data.frame(
xdim = c( 10, 20, 10, 20 ),
ydim = c(  0,  0, 10, 10 ),
zval = c( 1, 1.5, 2, 4.25 )
)
test_coords02 <- data.frame(
xdim = c( 0 ),
ydim = c( 0 )
)
get_dist( test_coords01 , test_coords02 )
## [1] 10.00000 20.00000 14.14214 22.36068
Output when dat02  is a multiple sets of coordinates
test_coords03 <- data.frame(
xdim = c( 0 , 10 ),
ydim = c( 0 , 10 )
)
get_dist( test_coords01 , test_coords03 )
##          [,1]     [,2]
## [1,] 10.00000 10.00000
## [2,] 20.00000 14.14214
## [3,] 14.14214  0.00000
## [4,] 22.36068 10.00000
Part (b)
https://sole.hsc.wvu.edu/Site/4030/Multiformat?InstanceID=763631 4/7
11/8/2018 BIOS 611: Assignment 5
Write a function that will compute the sum of a specified variable when a particular condition is TRUE . The purpose here is to have a single function (I called mine service_in_range() ) that can be used to: (1) Count how many people/houses are within a specified distance of each hospital, and (2) Be able to add up a variable for the houses in a certain range.
Your function should take 4 inputs:
dat01  : The same as before.
dat02  : The same as before.
RR  : The range of interest (e.g., 20), a single numeric value
var  : The name of the variable to be summed.
If var  is not provided, then the function should count the number of rows from dat01  that are within a distance of RR  from the location specified by dat02 .
If var  is provided, then the function should sum the specified variable from the rows of dat01  which are within RR  from the location specified by dat02 .
As before, you can have your function assume that dat02  contains a single set of coordinates, or you can extend it to accept a set of n2  coordinates.
So the following output should be expected:
Output when dat02  is a single set of coordinates
get_dist( test_coords01 , test_coords02 )
## [1] 10.00000 20.00000 14.14214 22.36068
test_coords01
4 rows
xdim

ydim

10 0 1.00
20 0 1.50
10 10 2.00
20 10 4.25
service_in_range( test_coords01 , test_coords02, 15 )
## [1] 2
service_in_range( test_coords01 , test_coords02, 15, “zval” )
## [1] 3
zval

11/8/2018 BIOS 611: Assignment 5
service_in_range( test_coords01 , test_coords02, 20 )
## [1] 3
service_in_range( test_coords01 , test_coords02, 20, “zval” )
## [1] 4.5
Output when dat02  is a multiple sets of coordinates
service_in_range( test_coords01 , test_coords03, 15 )
## [1] 2 4
service_in_range( test_coords01 , test_coords03, 15, “zval” )
## [1] 3.00 8.75
service_in_range( test_coords01 , test_coords03, 20 )
## [1] 3 4
service_in_range( test_coords01 , test_coords03, 20, “zval” )
## [1] 4.50 8.75
Part (c)
Now that we have those functions, we should be able to get to computing access to healthcare using the 2-step floating catchment method! Depending on how you wrote your functions, you may need to use loops to run the calculations here. Or you may want to write a function that does all of these calculations (so that, say, if I throw a different dataset at you, instead of editing a number of lines of code you can just swap out one function call).
For each hospital, calculate the number of beds per person (a proxy measure of available service) using a distance of 20 as the cut-off. Call the variable service_avail .
For each house, calculate the service access by summing service_avail  for all hospitals within a distance of 20 from that house.
Make a map (like the plot of Squaretopia in the context to this exercise) where the plotting shape represents the county, and the plotting color represents service access.
Make a table containing the number of houses, as well as the average service access, for each county.
Part (d)
Compute the average service access for each county (similar to the table from Part (c)) for a sequence of ranges.
You pick the values of RR , just make them sensible, and make a plot of average service access vs range, with a separate line for each county.
I haven’t done this yet, so I don’t know what it looks like. It might be interesting, it might be boring.