Data

Featured Image

Section 1: Original Dataset

Our team mainly worked with the Motor Vehicle Collisions Data on NYC Open Page. This dataset ranges from July 2012 upto the present date. This data was collected for record-keeping purposes since the crashes are all police reported motor vehicle collisions in New York City and a police report is required to be filled out for collisions where someone is injured or killed, or where there is at least $1000 worth of damage.

For our project, we will be looking at the data from 2016-2020. The 5 years we chose were 2016-2020 because we wanted mostly pre-pandemic data with some months of data during the pandemic to make secondary significant observations, if any.

The following are the relevant variables for the original dataset.

  • Crash Date: Occurrence date of collision
  • Crash Time: Occurrence time of collision
  • Borough: Borough where collision occurred
  • Zip Code: Postal code of incident occurrence
  • Latitude: Latitude coordinates for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)
  • Longitude: Longitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)
  • Number of Persons Injured: The Number of People injured in the collision
  • Number of Persons Killed: The Number of People killed in the collision
  • Number of Pedestrians Injured: The Number of Pedestrians Injured in the collision
  • Number of Pedestrians Killed: The Number of Pedestrians Killed in the collision
  • Number of Cyclist Injured: The Number of Cyclist Injured in the collision
  • Number of Cyclist killed: The Number of Cyclist Killed in the collision
  • Contributing Factor Vehicle 1: Factors contributing to the collision for designated vehicle
  • Contributing Factor Vehicle 2: Factors contributing to the collision for designated vehicle
  • Vehicle Type Code 1: Type of vehicle based on the selected vehicle category (ATV, bicycle, car/suv, ebike, escooter, truck/bus, motorcycle, other)
  • Vehicle Type Code 2: Type of vehicle based on the selected vehicle category (ATV, bicycle, car/suv, ebike, escooter, truck/bus, motorcycle, other)

Section 1.1: Cleaning the original dataset

As we were selecting the most relevant variables, we also found some missing values which we wanted to omit and hence, those variables are not mentioned in the above table. For our next step, we cleaned the data. For our EDA, we began with reducing the size of the dataset by only looking at a random sample of 10,000 observations from 2016-2020 so that our sample is easier to manage. Next, we removed information we deemed irrelevant such as “street name”, “cross street name”, “off-street name”, and “on-street name” because there is a rare chance that an accident occurs enough times on the same street for adequate analysis. The original cleaned dataset can be found here.

In order to create data visualizations and statistical models using time variables such as date and time, we realized that the inputs for time were strings so we need to convert them to numeric values. Furthermore, we divided the date from one collective string and separated them into “Month”, “Day”, and “Year” variables for more specific findings based on frequency of accidents throughout the year and month. To also look into the prevalence of accidents during the day, we separated the time into the “Hour” and “Minute” variables and converted them to numeric values.

The dataset also contained five different variables related to the cause of the accident, labelled as “Contributing Factor Vehicle 1” to “Contributing Factor Vehicle 5.” However, to consolidate the causes into one variable, pivot longer was used to create a new variable caled “CAUSE”. The same was done with variables “Vehicle Code 1” to ``Vehicle 5” by creating a new variable called “VEHICLE.”

Section 1.2: Potential Limitations of our Dataset

There were a few issues with our dataset that could hinder our analysis. One such issue involves the way certain data points for variables are entered. This issue mainly affects the “CAUSE” and “VEHICLE” variables, as there are misspellings, multiple ways of classifying data, and many data points labelled as “unspecified”. With such problems, our analysis runs the risk of misinterpreting the data with potential values not being included or considered due to human error in data entry. Furthermore, certain types of accidents could be more likely to be interpreted as “unspecified,” thereby skewing the data.

Section 2: Merging Datasets

In addition to exploratory data analysis of the original dataset, we wanted to see if weather has any impact on the number of collisions in New York City. In order to do so, we also looked at Weather Data

The following illustrates the relevant variables for the weather dataset.

  • Temp High: Highest temperature at any given day
  • Temp Low: Lowest temperature at any given day
  • Year: Year
  • Month: Month of the given year
  • Day: Date of the given month
  • Snow, Ice, Pellets, Hail, Ice on Ground (in): The amount of snow in inches

Section 2.2: Cleaning the merged dataset

As our first step here, we had to download monthly weather files from 2016-2020 which resulted in merging 12 files for all five years, which means merging 60 data files into one, which we refer to as “Weather Data” here. The next step was to merge this dataset with our original dataset. We merged in two ways: 1) merged_data, which can be found here, containing condensed totals by day for regression analysis and 2) merged_data_expanded, which can be found here, mapping the weather onto each collision in order to use for mapping.

Please see Section 4 to view the code for cleaning the weather dataset in order to merge it with the main dataset.

Section 3: R Packages

Overall, the following new R packages were used in the making of this project.

  • jtools: Collection of tools that helped the analysis and presentation of our regressions.
  • readxl: To allow us to use Excel files rather than CSV files

Section 4: Code for cleaning the datasets

Our load_and_clean_data.R details of both, the original data and the weather data, are discussed in detail below. Alternatively, the file can be found here

#Code for cleaning the original dataset:

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(sf)
## Linking to GEOS 3.9.1, GDAL 3.2.1, PROJ 7.2.1
library(readxl)
library(jtools)

collision <- read_csv(here::here('dataset/NYC Collision Data.csv'))
## Rows: 1831755 Columns: 29
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (17): CRASH DATE, CRASH TIME, BOROUGH, LOCATION, ON STREET NAME, CROSS S...
## dbl (12): ZIP CODE, LATITUDE, LONGITUDE, NUMBER OF PERSONS INJURED, NUMBER O...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
collision_50000 <- collision %>%filter(!is.na(BOROUGH), !is.na('ZIP CODE'), !is.na(LONGITUDE), !is.na(LATITUDE), !is.na(LOCATION), !is.na(`NUMBER OF PEDESTRIANS KILLED`), !is.na(`NUMBER OF CYCLIST INJURED`)) %>% select(-(`ON STREET NAME`:`OFF STREET NAME`)) %>% rename(DATE = `CRASH DATE`, TIME =`CRASH TIME`, ZIPCODE = `ZIP CODE`, LOCATION = `LOCATION`) %>% separate(DATE, c("MONTH", "DAY", "YEAR"), sep = "/") %>% filter(YEAR != 2012, YEAR != 2015, YEAR != 2013, YEAR != 2014) %>%separate(TIME, c("HOUR", "MINUTE"), sep = ":") %>% mutate(`CONTRIBUTING FACTOR VEHICLE 1` = str_to_lower(`CONTRIBUTING FACTOR VEHICLE 1`)) %>%mutate(`CONTRIBUTING FACTOR VEHICLE 2` = str_to_lower(`CONTRIBUTING FACTOR VEHICLE 2`)) %>%mutate(`CONTRIBUTING FACTOR VEHICLE 3` = str_to_lower(`CONTRIBUTING FACTOR VEHICLE 3`)) %>%mutate(`CONTRIBUTING FACTOR VEHICLE 4` = str_to_lower(`CONTRIBUTING FACTOR VEHICLE 4`)) %>% mutate(`CONTRIBUTING FACTOR VEHICLE 5` = str_to_lower(`CONTRIBUTING FACTOR VEHICLE 5`)) %>% filter(!is.na(`CONTRIBUTING FACTOR VEHICLE 1`),!is.na(`CONTRIBUTING FACTOR VEHICLE 2`), !is.na(`CONTRIBUTING FACTOR VEHICLE 3`), !is.na(`CONTRIBUTING FACTOR VEHICLE 4`), !is.na(`CONTRIBUTING FACTOR VEHICLE 5`)) %>% pivot_longer(`CONTRIBUTING FACTOR VEHICLE 1`:`CONTRIBUTING FACTOR VEHICLE 5`, names_to = "CONTRIBUTING VEHICLE", values_to = "CAUSE") %>% filter(!is.na(CAUSE)) %>% pivot_longer(`VEHICLE TYPE CODE 1`:`VEHICLE TYPE CODE 5`, names_to = "VEHICLE CODE", values_to = "VEHICLE") %>% select(-(`CONTRIBUTING VEHICLE`)) %>% select(-(`VEHICLE CODE`)) %>% mutate(VEHICLE = str_to_lower(VEHICLE)) %>% filter(!is.na("VEHICLE")) %>% sample_n(50000) %>% mutate(HOUR = as.numeric(HOUR)) #write_csv(here::here('dataset/NYC_Collision_50000.csv'))

#Code for merging the weather data with the original dataset:

weather <- read_excel(here::here('dataset/WeatherData2016-2020.xlsx'))
weather <- weather %>% rename_with(toupper)
weather <- weather %>% mutate(DAY = ifelse(nchar(DAY) == 1, paste0("0", DAY), DAY))
weather <- weather %>% mutate(MONTH = ifelse(nchar(MONTH) == 1, paste0("0", MONTH), MONTH))
weather <- weather %>% unite("DATE", MONTH, DAY, YEAR, sep = "/")

#We use the 50,000 random sample below for merging the dataset

collision_50000 <- collision_50000 %>% unite("DATE", MONTH, DAY, YEAR, sep = "/")
collisions_by_day <- collision_50000 %>% group_by(DATE) %>% transmute(`NUMBER OF PERSONS KILLED` = sum(`NUMBER OF PERSONS KILLED`), `NUMBER OF PEDESTRIANS KILLED` = sum(`NUMBER OF PEDESTRIANS KILLED`), `NUMBER OF CYCLIST KILLED` = sum(`NUMBER OF CYCLIST KILLED`), `NUMBER OF MOTORIST KILLED` = sum(`NUMBER OF MOTORIST KILLED`), `NUMBER OF PERSONS INJURED` = sum(`NUMBER OF PERSONS INJURED`), `NUMBER OF PEDESTRIANS INJURED` = sum(`NUMBER OF PEDESTRIANS INJURED`), `NUMBER OF CYCLIST INJURED` = sum(`NUMBER OF CYCLIST INJURED`), `NUMBER OF MOTORIST INJURED` = sum(`NUMBER OF MOTORIST INJURED`), num_collisions = n())
collisions_by_day <- unique(collisions_by_day)


  
merged_data <- right_join(collisions_by_day, weather, by = "DATE")
merged_data_expanded <- left_join(collision_50000, weather, by = "DATE")
#write_csv(merged_data_expanded, here::here("dataset/Merged_Data_Expanded.csv"))

#Below is code to add in seasons to the merged dataset
merged_data <- merged_data %>% separate(DATE, c("MONTH", "DAY", "YEAR"), sep = "/")

merged_data <- merged_data %>% mutate(SEASON = case_when(MONTH %in% c("09", "10", "11") ~ "Fall", MONTH %in% c("12", "01", "02")  ~ "Winter", MONTH %in% c("03", "04", "05")  ~ "Spring", TRUE ~ "Summer"))
#write_csv(merged_data, here::here("dataset/Merged_Data.csv"))
Previous Big Picture