4 Part I - Dolphins
4.1 Introduction
4.1.1 Topic
Wittenberg’s Marine Science major Heddie Samuelson has graciously provided data from her research on dolphins. Dolphins often swim together in “pods”, and Heddie wants to know how the pod size varies, and how pod size might be related to other variables.
4.1.2 Data
In the Dolphins subfolder are two files….
- In the Sightings file, the cases are particular sightings of dolphin pods in a particular area on the north Atlantic coast of the US. Variables include the following:
- Date and Time of the sighting
- Location of the sighting, specified by Latitude and Longitude
- Estimated pod size (in the variable labeled “#”), i.e., the number of dolphins in the pod
- Species and Common Name of the variety of dolphins sighted
Note that there are 11 sheets - one for each year from 2006 through 2016.
- There are three sheets in the SeaTemps file. Ignore the “AVERAGE SST” sheet (which includes monthly averages and a graph that you will probably do anyway) and “Sheet2” (which duplicates one year’s data in Sheet1). In Sheet1 there are four similar variables for each of the 11 years:
- date
- sea surface temperature in Fahrenheit, for that date
- sea surface temperature in Celsius, for that date
- average sea surface temperature in Celsius, for that month (typically listed on the first day of the month)
Note that there are lots of missing data and blank columns, and that all 11 years of sea temps are on this one sheet. Note also that there are different numbers of rows for the different years.
4.2 Data Wrangling
Before making my analysis, I loaded the following packages:
library(tidyverse)
library(readxl)
library(mosaic)
library(knitr)
library(maps)
Since the 2006 sheet is slightly different from the other sheets (it has 2 extra rows below the data), I read in this sheet separately.
Sight06 <- read_excel("~/Data229/Project/Dolphins/Sightings.xlsx", sheet = "2006", n_max = 62)
Sight06 <- Sight06 %>%
select(Date, Time, Latitude, Longitude, "#") %>%
mutate(Date = as.Date(Date))
Then, a function to read in the sheets from 2007 to 2016 was created. Beside from importing the plain datasets, I also did some wrangling to make it look better.
Sightings <- function(sheetNum){
SightTable <- read_excel("~/Data229/Project/Dolphins/Sightings.xlsx", sheet = sheetNum)
SightTable <- SightTable %>%
select(Date, Time, Latitude, Longitude, "#") %>%
mutate(Date = as.Date(Date))}
After the function was defined, it’s now time to get the data.
Sight07 <- Sightings(2)
Sight08 <- Sightings(3)
Sight09 <- Sightings(4)
Sight10 <- Sightings(5)
Sight11 <- Sightings(6)
Sight12 <- Sightings(7)
Sight13 <- Sightings(8)
Sight14 <- Sightings(9)
Sight15 <- Sightings(10)
Sight16 <- Sightings(11)
My goal is to have an output table with 5 variables Date, Time, Latitude, Longitude and podSize. I used 10 full_join()’s to join 11 tables to get my table. I also renamed column 5 “podSize”.
Sights <- Sight06 %>%
full_join(Sight07, by = c("Date", "Time", "Latitude", "Longitude", "#")) %>%
full_join(Sight08, by = c("Date", "Time", "Latitude", "Longitude", "#")) %>%
full_join(Sight09, by = c("Date", "Time", "Latitude", "Longitude", "#")) %>%
full_join(Sight10, by = c("Date", "Time", "Latitude", "Longitude", "#")) %>%
full_join(Sight11, by = c("Date", "Time", "Latitude", "Longitude", "#")) %>%
full_join(Sight12, by = c("Date", "Time", "Latitude", "Longitude", "#")) %>%
full_join(Sight13, by = c("Date", "Time", "Latitude", "Longitude", "#")) %>%
full_join(Sight14, by = c("Date", "Time", "Latitude", "Longitude", "#")) %>%
full_join(Sight15, by = c("Date", "Time", "Latitude", "Longitude", "#")) %>%
full_join(Sight16, by = c("Date", "Time", "Latitude", "Longitude", "#"))
colnames(Sights)[5] <- "podSize"
It might actually be a better idea to split the Day column into 3 columns Year, Month and Day; since I’m going to find out how the pod size distribution varies by time of year and across the years.
SightsYMD <- Sights %>%
separate(Date, into = c("Year", "Month", "Day"))
Now I finally have the table that I wanted. Let’s take a quick look at its first few rows.
kable(SightsYMD %>% head())
Year | Month | Day | Time | Latitude | Longitude | podSize |
---|---|---|---|---|---|---|
2006 | 07 | 25 | 1519 | 42.70759 | -70.50612 | 50 |
2006 | 07 | 26 | 1645 | 42.85215 | -70.32297 | 70 |
2006 | 07 | 26 | 1632 | 42.84765 | -70.32412 | 40 |
2006 | 07 | 29 | 1431 | 42.73842 | -70.52207 | 20 |
2006 | 07 | 29 | 1445 | 42.74283 | -70.51724 | 10 |
2006 | 08 | 02 | 1615 | 42.90334 | -70.34932 | 30 |
4.3 Data Exploring
4.3.1 Pod Size
Now it’s time to analyze. First, I wanted to check out the pod size distribution.
SightsYMD %>%
ggplot(mapping = aes(podSize)) +
geom_histogram(bins = 40, color = "black", fill = "grey")
The shape of the distribution of pod size is skewed to the right, so I decided to use a log transformation and therefore model the log(podSize).
SightsYMD %>%
ggplot(mapping = aes(log(podSize))) +
geom_histogram(bins = 25, color = "black", fill = "grey")
Sure enough! The log(podSize) distribution looks normalish
At this time, I’d like to check out factors that affect pod size
4.3.1.1 Time of the Year
Below is a log(podSize) vs Month boxplot:
SightsYMD %>%
ggplot(mapping = aes(x= Month, y = log(podSize))) +
geom_boxplot()
The first thing I noticed was all the data was recorded in 8 months from May to November. The mean log(podSize) is about the same in those months. The log(podSize) are highest in August, September and October, meaning that the number of dolphins in the pod is biggest during those month. On the other hand, May is the month that has the lowest mean log(podSize), and this indicates the estimated size is smallest in May.
4.3.1.2 Year (2006-2016)
Next, let’s find out how does pod size vary across the years.
SightsYMD %>%
ggplot(mapping = aes(x = Year, y = log(podSize))) +
geom_boxplot()
Overall, the mean log(podSize) doesn’t seem to be very different throughout the years. So year seems to have a tiny or even no impact on pod size. Some years (2009, 2016,…) the number of dolphins is slightly bigger than the others - as their mean log(podSize) are the highest (about 4). On the other hand, the lowest mean log(podSize) is about 3 (in 2013), and this is not that much smaller than the other ones.
4.3.1.3 Time of the Day
In order to see how does pod size vary by time of the day, I compared the log(podSize) of 4 different periods:
- Morning (before 11:00)
- Noonish (11:00 - 14:00)
- Afternoon (14:00 - 17:00)
- Evening (after 17:00)
Here are the summary statistics for each one of the period I just mentioned:
Morning <- SightsYMD %>%
filter(Time <= 1100)
kable(favstats(~ log(podSize), data = Morning))
min | Q1 | median | Q3 | max | mean | sd | n | missing | |
---|---|---|---|---|---|---|---|---|---|
0 | 2.484907 | 3.401197 | 4.317488 | 6.214608 | 3.369225 | 1.174626 | 152 | 0 |
Noonish <- SightsYMD %>%
filter(Time > 1100 & Time <= 1400)
kable(favstats(~ log(podSize), data = Noonish))
min | Q1 | median | Q3 | max | mean | sd | n | missing | |
---|---|---|---|---|---|---|---|---|---|
0 | 2.70805 | 3.68888 | 4.60517 | 5.703782 | 3.488528 | 1.214849 | 255 | 0 |
Afternoon <- SightsYMD %>%
filter(Time > 1400 & Time <= 1700)
kable(favstats(~ log(podSize), data = Afternoon))
min | Q1 | median | Q3 | max | mean | sd | n | missing | |
---|---|---|---|---|---|---|---|---|---|
0 | 2.70805 | 3.401197 | 4.248495 | 6.907755 | 3.425035 | 1.069065 | 472 | 1 |
Evening <- SightsYMD %>%
filter(Time > 1700)
kable(favstats(~ log(podSize), data = Noonish))
min | Q1 | median | Q3 | max | mean | sd | n | missing | |
---|---|---|---|---|---|---|---|---|---|
0 | 2.70805 | 3.68888 | 4.60517 | 5.703782 | 3.488528 | 1.214849 | 255 | 0 |
Overall, mean and median log(podSize) are about the same for all 4 periods of the day (around 3.5). So we can say that time of the day doesn’t affect pod size at all. The highest log(podSize) is 6.9 - in the afternoon.
4.3.1.4 Location
USA <- map_data("usa")
States <- map_data("state")
colnames(SightsYMD)[5] <- "lat"
colnames(SightsYMD)[6] <- "long"
This map illustrates the location of the sighting.
USA %>%
ggplot(mapping = aes(x = long, y = lat)) +
geom_polygon(mapping = aes(group = group), color = "brown", fill = "tan") +
geom_point(data = SightsYMD, mapping = aes(x = long, y = lat, size = podSize)) +
coord_fixed(xlim = c(-71, -70), ylim = c(42.1, 43.2))
All the location of sighting is in the Atlantic coast of the US. It seems like locations that have longitude from -70.6 to -70.25 and latitude from 42.75 to 43 are where pod size is the largest.
4.3.2 Sea Surface Temperature
The next thing I did was to look at the sea surface temperature.
First, I loaded the SeaTemps data file.
SeaTemps <- read_excel("~/Data229/Project/Dolphins/SeaTemps.xlsx", sheet = "Sheet2")
SeaTemps <- SeaTemps[,1:2]
SeaTemps <- SeaTemps %>%
filter(`SST (deg F)` != "N/A") %>%
mutate(Temperature = as.double(`SST (deg F)`), Date = as.Date(Date)) %>%
select(-`SST (deg F)`)
Now let’s look at the average temperature in each month. Here are the numerical and visual summaries of average sea surface temperature.
kable(AvgTemp <- SeaTemps %>%
separate(Date, into = c("Year", "Month", "Day")) %>%
group_by(Month) %>%
summarise(AvgTemp = mean(Temperature)))
Month | AvgTemp |
---|---|
05 | 50.51983 |
06 | 58.76095 |
07 | 65.20261 |
08 | 66.83306 |
09 | 63.39961 |
10 | 57.59091 |
AvgTemp %>%
ggplot(mapping = aes(x = Month, y = AvgTemp, group = "")) +
geom_line(color = "red", size = 2) +
geom_point(color = "blue", size = 5)
The average sea surface temperature is highest in August, at about 67 degree F. July, August and September are the 3 months that sea surface is warmer, since they are have mean temperature over 63 degree. In contrast, the surface is cooler in May, June and October (all under 60 degree); and May is the coldest month with average temperature of slightly above 50 F.
4.3.3 Pod Size vs Temperature
Sights <- Sights %>%
select(Date, podSize) %>%
mutate(Date = as.Date(Date))
PodTemp <- full_join(Sights, SeaTemps, by = "Date")
PodTemp %>%
ggplot(mapping = aes(y = Temperature, x = log(podSize))) +
geom_point() +
stat_smooth(method = "lm", se = FALSE)
lm(Temperature ~ log(podSize), data = PodTemp)
Call:
lm(formula = Temperature ~ log(podSize), data = PodTemp)
Coefficients:
(Intercept) log(podSize)
65.0016 0.2285
The plot above reveals a positive but weak and linearish relationshoip between log(podSize) and Temperature. This means larger log(podSize) is associated with warmer sea surface temperature. The regression equation is Temp^ = 0.23*log(podSize) + 65. It has a slope of 0.23, which indicates every factor of e in pod size is associated with an increase of 0.23 degree F in temperature.