4 Part I - Dolphins

4.1 Introduction

4.1.1 Topic

Wittenberg’s Marine Science major Heddie Samuelson has graciously provided data from her research on dolphins. Dolphins often swim together in “pods”, and Heddie wants to know how the pod size varies, and how pod size might be related to other variables.

4.1.2 Data

In the Dolphins subfolder are two files….

  1. In the Sightings file, the cases are particular sightings of dolphin pods in a particular area on the north Atlantic coast of the US. Variables include the following:
  • Date and Time of the sighting
  • Location of the sighting, specified by Latitude and Longitude
  • Estimated pod size (in the variable labeled “#”), i.e., the number of dolphins in the pod
  • Species and Common Name of the variety of dolphins sighted

Note that there are 11 sheets - one for each year from 2006 through 2016.

  1. There are three sheets in the SeaTemps file. Ignore the “AVERAGE SST” sheet (which includes monthly averages and a graph that you will probably do anyway) and “Sheet2” (which duplicates one year’s data in Sheet1). In Sheet1 there are four similar variables for each of the 11 years:
  • date
  • sea surface temperature in Fahrenheit, for that date
  • sea surface temperature in Celsius, for that date
  • average sea surface temperature in Celsius, for that month (typically listed on the first day of the month)

Note that there are lots of missing data and blank columns, and that all 11 years of sea temps are on this one sheet. Note also that there are different numbers of rows for the different years.

4.2 Data Wrangling

Before making my analysis, I loaded the following packages:

library(tidyverse)
library(readxl)
library(mosaic)
library(knitr)
library(maps)

Since the 2006 sheet is slightly different from the other sheets (it has 2 extra rows below the data), I read in this sheet separately.

Sight06 <- read_excel("~/Data229/Project/Dolphins/Sightings.xlsx", sheet = "2006", n_max = 62)

Sight06 <- Sight06 %>% 
  select(Date, Time, Latitude, Longitude, "#") %>% 
  mutate(Date = as.Date(Date))

Then, a function to read in the sheets from 2007 to 2016 was created. Beside from importing the plain datasets, I also did some wrangling to make it look better.

Sightings <- function(sheetNum){
  SightTable <- read_excel("~/Data229/Project/Dolphins/Sightings.xlsx", sheet = sheetNum)
  SightTable <- SightTable %>% 
  select(Date, Time, Latitude, Longitude, "#") %>% 
  mutate(Date = as.Date(Date))}

After the function was defined, it’s now time to get the data.

Sight07 <- Sightings(2)
Sight08 <- Sightings(3)
Sight09 <- Sightings(4)
Sight10 <- Sightings(5)
Sight11 <- Sightings(6)
Sight12 <- Sightings(7)
Sight13 <- Sightings(8)
Sight14 <- Sightings(9)
Sight15 <- Sightings(10)
Sight16 <- Sightings(11)

My goal is to have an output table with 5 variables Date, Time, Latitude, Longitude and podSize. I used 10 full_join()’s to join 11 tables to get my table. I also renamed column 5 “podSize”.

Sights <- Sight06 %>% 
  full_join(Sight07, by = c("Date", "Time", "Latitude", "Longitude", "#")) %>% 
  full_join(Sight08, by = c("Date", "Time", "Latitude", "Longitude", "#")) %>% 
  full_join(Sight09, by = c("Date", "Time", "Latitude", "Longitude", "#")) %>% 
  full_join(Sight10, by = c("Date", "Time", "Latitude", "Longitude", "#")) %>% 
  full_join(Sight11, by = c("Date", "Time", "Latitude", "Longitude", "#")) %>% 
  full_join(Sight12, by = c("Date", "Time", "Latitude", "Longitude", "#")) %>% 
  full_join(Sight13, by = c("Date", "Time", "Latitude", "Longitude", "#")) %>% 
  full_join(Sight14, by = c("Date", "Time", "Latitude", "Longitude", "#")) %>% 
  full_join(Sight15, by = c("Date", "Time", "Latitude", "Longitude", "#")) %>% 
  full_join(Sight16, by = c("Date", "Time", "Latitude", "Longitude", "#")) 
colnames(Sights)[5] <- "podSize"

It might actually be a better idea to split the Day column into 3 columns Year, Month and Day; since I’m going to find out how the pod size distribution varies by time of year and across the years.

SightsYMD <- Sights %>% 
  separate(Date, into = c("Year", "Month", "Day"))

Now I finally have the table that I wanted. Let’s take a quick look at its first few rows.

kable(SightsYMD %>% head())
Year Month Day Time Latitude Longitude podSize
2006 07 25 1519 42.70759 -70.50612 50
2006 07 26 1645 42.85215 -70.32297 70
2006 07 26 1632 42.84765 -70.32412 40
2006 07 29 1431 42.73842 -70.52207 20
2006 07 29 1445 42.74283 -70.51724 10
2006 08 02 1615 42.90334 -70.34932 30

4.3 Data Exploring

4.3.1 Pod Size

Now it’s time to analyze. First, I wanted to check out the pod size distribution.

SightsYMD %>% 
  ggplot(mapping = aes(podSize)) +
  geom_histogram(bins = 40, color = "black", fill = "grey")

The shape of the distribution of pod size is skewed to the right, so I decided to use a log transformation and therefore model the log(podSize).

SightsYMD %>% 
  ggplot(mapping = aes(log(podSize))) +
  geom_histogram(bins = 25, color = "black", fill = "grey")

Sure enough! The log(podSize) distribution looks normalish

At this time, I’d like to check out factors that affect pod size

4.3.1.1 Time of the Year

Below is a log(podSize) vs Month boxplot:

SightsYMD %>% 
  ggplot(mapping = aes(x= Month, y = log(podSize))) +
  geom_boxplot()

The first thing I noticed was all the data was recorded in 8 months from May to November. The mean log(podSize) is about the same in those months. The log(podSize) are highest in August, September and October, meaning that the number of dolphins in the pod is biggest during those month. On the other hand, May is the month that has the lowest mean log(podSize), and this indicates the estimated size is smallest in May.

4.3.1.2 Year (2006-2016)

Next, let’s find out how does pod size vary across the years.

SightsYMD %>% 
  ggplot(mapping = aes(x = Year, y = log(podSize))) +
  geom_boxplot()

Overall, the mean log(podSize) doesn’t seem to be very different throughout the years. So year seems to have a tiny or even no impact on pod size. Some years (2009, 2016,…) the number of dolphins is slightly bigger than the others - as their mean log(podSize) are the highest (about 4). On the other hand, the lowest mean log(podSize) is about 3 (in 2013), and this is not that much smaller than the other ones.

4.3.1.3 Time of the Day

In order to see how does pod size vary by time of the day, I compared the log(podSize) of 4 different periods:

  • Morning (before 11:00)
  • Noonish (11:00 - 14:00)
  • Afternoon (14:00 - 17:00)
  • Evening (after 17:00)

Here are the summary statistics for each one of the period I just mentioned:

Morning <- SightsYMD %>% 
  filter(Time <= 1100)
kable(favstats(~ log(podSize), data = Morning))
min Q1 median Q3 max mean sd n missing
0 2.484907 3.401197 4.317488 6.214608 3.369225 1.174626 152 0
Noonish <- SightsYMD %>% 
  filter(Time > 1100 & Time <= 1400)
kable(favstats(~ log(podSize), data = Noonish))
min Q1 median Q3 max mean sd n missing
0 2.70805 3.68888 4.60517 5.703782 3.488528 1.214849 255 0
Afternoon <- SightsYMD %>% 
  filter(Time > 1400 & Time <= 1700)
kable(favstats(~ log(podSize), data = Afternoon))
min Q1 median Q3 max mean sd n missing
0 2.70805 3.401197 4.248495 6.907755 3.425035 1.069065 472 1
Evening <- SightsYMD %>% 
  filter(Time > 1700)
kable(favstats(~ log(podSize), data = Noonish))
min Q1 median Q3 max mean sd n missing
0 2.70805 3.68888 4.60517 5.703782 3.488528 1.214849 255 0

Overall, mean and median log(podSize) are about the same for all 4 periods of the day (around 3.5). So we can say that time of the day doesn’t affect pod size at all. The highest log(podSize) is 6.9 - in the afternoon.

4.3.1.4 Location

USA <- map_data("usa")
States <- map_data("state")
colnames(SightsYMD)[5] <- "lat"
colnames(SightsYMD)[6] <- "long"

This map illustrates the location of the sighting.

USA %>% 
  ggplot(mapping = aes(x = long, y = lat)) +
  geom_polygon(mapping = aes(group = group), color = "brown", fill = "tan") +
  geom_point(data = SightsYMD, mapping = aes(x = long, y = lat, size = podSize)) +
  coord_fixed(xlim = c(-71, -70), ylim = c(42.1, 43.2))

All the location of sighting is in the Atlantic coast of the US. It seems like locations that have longitude from -70.6 to -70.25 and latitude from 42.75 to 43 are where pod size is the largest.

4.3.2 Sea Surface Temperature

The next thing I did was to look at the sea surface temperature.

First, I loaded the SeaTemps data file.

SeaTemps <- read_excel("~/Data229/Project/Dolphins/SeaTemps.xlsx", sheet = "Sheet2")
SeaTemps <- SeaTemps[,1:2]
SeaTemps <- SeaTemps %>% 
  filter(`SST (deg F)` != "N/A") %>% 
  mutate(Temperature = as.double(`SST (deg F)`), Date = as.Date(Date)) %>% 
  select(-`SST (deg F)`)

Now let’s look at the average temperature in each month. Here are the numerical and visual summaries of average sea surface temperature.

kable(AvgTemp <-  SeaTemps %>%
  separate(Date, into = c("Year", "Month", "Day")) %>% 
  group_by(Month) %>% 
  summarise(AvgTemp = mean(Temperature)))
Month AvgTemp
05 50.51983
06 58.76095
07 65.20261
08 66.83306
09 63.39961
10 57.59091
AvgTemp %>% 
  ggplot(mapping = aes(x = Month, y = AvgTemp, group = "")) +
  geom_line(color = "red", size = 2) +
  geom_point(color = "blue", size = 5)

The average sea surface temperature is highest in August, at about 67 degree F. July, August and September are the 3 months that sea surface is warmer, since they are have mean temperature over 63 degree. In contrast, the surface is cooler in May, June and October (all under 60 degree); and May is the coldest month with average temperature of slightly above 50 F.

4.3.3 Pod Size vs Temperature

Sights <- Sights %>% 
  select(Date, podSize) %>% 
  mutate(Date = as.Date(Date))
PodTemp <- full_join(Sights, SeaTemps, by = "Date")
PodTemp %>% 
  ggplot(mapping = aes(y = Temperature, x = log(podSize))) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE)

lm(Temperature ~ log(podSize), data = PodTemp)

Call:
lm(formula = Temperature ~ log(podSize), data = PodTemp)

Coefficients:
 (Intercept)  log(podSize)  
     65.0016        0.2285  

The plot above reveals a positive but weak and linearish relationshoip between log(podSize) and Temperature. This means larger log(podSize) is associated with warmer sea surface temperature. The regression equation is Temp^ = 0.23*log(podSize) + 65. It has a slope of 0.23, which indicates every factor of e in pod size is associated with an increase of 0.23 degree F in temperature.