Loading and Cleaning:
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
collision <- read_csv(here::here('dataset/NYC Collision Data.csv'))
## Rows: 1831755 Columns: 29
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (17): CRASH DATE, CRASH TIME, BOROUGH, LOCATION, ON STREET NAME, CROSS S...
## dbl (12): ZIP CODE, LATITUDE, LONGITUDE, NUMBER OF PERSONS INJURED, NUMBER O...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
collision['collision_date'] = as.Date(collision$'CRASH DATE', format="%m/%d/%Y")
collision<- collision%>%mutate(., YEAR = year(collision_date))
collision<- collision%>%filter(YEAR>2015 & YEAR<2021)
summary(collision)
## CRASH DATE CRASH TIME BOROUGH ZIP CODE
## Length:1016774 Length:1016774 Length:1016774 Min. :10000
## Class :character Class :character Class :character 1st Qu.:10451
## Mode :character Mode :character Mode :character Median :11208
## Mean :10859
## 3rd Qu.:11249
## Max. :11697
## NA's :360780
## LATITUDE LONGITUDE LOCATION ON STREET NAME
## Min. : 0.00 Min. :-201.36 Length:1016774 Length:1016774
## 1st Qu.:40.67 1st Qu.: -73.97 Class :character Class :character
## Median :40.72 Median : -73.92 Mode :character Mode :character
## Mean :40.67 Mean : -73.83
## 3rd Qu.:40.77 3rd Qu.: -73.86
## Max. :43.34 Max. : 0.00
## NA's :92548 NA's :92548
## CROSS STREET NAME OFF STREET NAME NUMBER OF PERSONS INJURED
## Length:1016774 Length:1016774 Min. : 0.0000
## Class :character Class :character 1st Qu.: 0.0000
## Mode :character Mode :character Median : 0.0000
## Mean : 0.2841
## 3rd Qu.: 0.0000
## Max. :31.0000
## NA's :17
## NUMBER OF PERSONS KILLED NUMBER OF PEDESTRIANS INJURED
## Min. :0.000000 Min. : 0.00000
## 1st Qu.:0.000000 1st Qu.: 0.00000
## Median :0.000000 Median : 0.00000
## Mean :0.001224 Mean : 0.04979
## 3rd Qu.:0.000000 3rd Qu.: 0.00000
## Max. :8.000000 Max. :27.00000
## NA's :31
## NUMBER OF PEDESTRIANS KILLED NUMBER OF CYCLIST INJURED
## Min. :0.000000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.000000 Median :0.00000
## Mean :0.000621 Mean :0.02474
## 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :6.000000 Max. :3.00000
##
## NUMBER OF CYCLIST KILLED NUMBER OF MOTORIST INJURED NUMBER OF MOTORIST KILLED
## Min. :0.0000000 Min. : 0.0000 Min. :0.000000
## 1st Qu.:0.0000000 1st Qu.: 0.0000 1st Qu.:0.000000
## Median :0.0000000 Median : 0.0000 Median :0.000000
## Mean :0.0001131 Mean : 0.2094 Mean :0.000488
## 3rd Qu.:0.0000000 3rd Qu.: 0.0000 3rd Qu.:0.000000
## Max. :2.0000000 Max. :31.0000 Max. :4.000000
##
## CONTRIBUTING FACTOR VEHICLE 1 CONTRIBUTING FACTOR VEHICLE 2
## Length:1016774 Length:1016774
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
## CONTRIBUTING FACTOR VEHICLE 3 CONTRIBUTING FACTOR VEHICLE 4
## Length:1016774 Length:1016774
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
## CONTRIBUTING FACTOR VEHICLE 5 COLLISION_ID VEHICLE TYPE CODE 1
## Length:1016774 Min. :3363355 Length:1016774
## Class :character 1st Qu.:3618219 Class :character
## Mode :character Median :3872688 Mode :character
## Mean :3872628
## 3rd Qu.:4127036
## Max. :4463721
##
## VEHICLE TYPE CODE 2 VEHICLE TYPE CODE 3 VEHICLE TYPE CODE 4
## Length:1016774 Length:1016774 Length:1016774
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## VEHICLE TYPE CODE 5 collision_date YEAR
## Length:1016774 Min. :2016-01-01 Min. :2016
## Class :character 1st Qu.:2017-02-13 1st Qu.:2017
## Mode :character Median :2018-03-22 Median :2018
## Mean :2018-04-02 Mean :2018
## 3rd Qu.:2019-05-05 3rd Qu.:2019
## Max. :2020-12-31 Max. :2020
##
collision_use <- collision %>% filter(!is.na(BOROUGH), !is.na('ZIP CODE'), !is.na(LONGITUDE), !is.na(LATITUDE), !is.na(LOCATION))%>% select(-(`ON STREET NAME`:`OFF STREET NAME`)) %>%
rename(DATE = `CRASH DATE`, TIME = `CRASH TIME`, ZIPCODE = `ZIP CODE`, LOCATION = `LOCATION`)
collision_use['collision_date'] = as.Date(collision_use$DATE, format="%m/%d/%Y")
collision_use<- collision_use%>% mutate(., year = year(collision_date))
collision_use$TIME<- as.character(collision_use$TIME)
collision_use$TIME <- as.numeric(unlist(strsplit(collision_use$TIME, ":"))[seq(1, 2 * nrow(collision_use), 2)])
#EDA
collisions_per_day <-collision_use %>%
group_by(collision_date, BOROUGH) %>%
summarise(NUMBER_OF_ACCIDENTS=n(),
NUMBER_OF_DEATHS = sum(`NUMBER OF PERSONS KILLED`),
NUMBER_OF_INJURIES = sum(`NUMBER OF PERSONS INJURED`))%>%
ungroup()
## `summarise()` has grouped output by 'collision_date'. You can override using the `.groups` argument.
ggplot(data=collisions_per_day, aes(x=collision_date, y=NUMBER_OF_ACCIDENTS))+
geom_line(aes(group=BOROUGH, color=BOROUGH))+ggtitle("Number of Accidents over Time by Borough")
#Missing values exploration
collisions_missing <- collision %>%
mutate(MISSING_BOROUGH=1*is.na(BOROUGH),
MISSING_CONTRIBUTION_INFO=1*is.na(`CONTRIBUTING FACTOR VEHICLE 1`)) %>%
select(MISSING_BOROUGH, MISSING_CONTRIBUTION_INFO, YEAR, BOROUGH)
missing_borough_by_year <- collisions_missing %>%
group_by(YEAR) %>% summarise(percent_missing_borough=mean(MISSING_BOROUGH))
missing_contrib_by_year <- collisions_missing %>%
group_by(YEAR) %>% summarise(percent_missing_contrib=mean(MISSING_CONTRIBUTION_INFO))
missing_contrib_by_borough <- collisions_missing %>%
group_by(BOROUGH)%>%summarise(percent_missing_contrib=mean(MISSING_CONTRIBUTION_INFO))
#Years with the highest missing information
ggplot(data=missing_borough_by_year, aes(x=YEAR, y=percent_missing_borough,)) + geom_col(aes(fill=percent_missing_borough))+
xlab("Year") + ylab("% Missing Borough Information") +
ggtitle("% Missing Borough Information vs Year")
#Variables with highest missing percentage of values
naValues = as.data.frame(100* apply(collision, 2, function(col)sum(is.na(col))/length(col)))
naValues$Var = rownames(naValues)
colnames(naValues) = c("NA_Percentage","Variables")
naValues <- naValues[order(naValues$NA_Percentage),]
ggplot(naValues,aes(y= NA_Percentage, x=reorder( Variables,NA_Percentage)))+
geom_bar( stat="identity", position = "dodge", fill ="red")+
coord_flip()+
theme_classic()+
scale_colour_brewer() +
scale_y_continuous()+ ggtitle("Percentage of Missing Values in NYC Vehicle Collisions Dataset \nby Variable Type") +ylab("Missing Value Percentage") + xlab("Variables")
#Accidents in each borough over time.
collision_use['collision_date'] = as.Date(collision_use$DATE, format="%m/%d/%Y")
collision_use<- collision_use%>% mutate(., year = year(collision_date))
collision_use%>% mutate(., year = year(collision_date)) %>%
group_by(., BOROUGH, year) %>%
summarise(.,count = n()) %>%
ggplot(aes(x=year, y=count))+
geom_line(stat="identity", aes(color=BOROUGH))+geom_point()+ggtitle("Number of Accidents Over Time by Borough") +
scale_colour_viridis_d("plasma")
## `summarise()` has grouped output by 'BOROUGH'. You can override using the `.groups` argument.
#CONTRIBUTING FACTOR EDA
print("Number of missing values in `CONTRIBUTING FACTOR VEHICLE 1`")
## [1] "Number of missing values in `CONTRIBUTING FACTOR VEHICLE 1`"
print(sum(is.na(collision_use$`CONTRIBUTING FACTOR VEHICLE 1`)))
## [1] 2429
print("Number of missing values in `CONTRIBUTING FACTOR VEHICLE 2`")
## [1] "Number of missing values in `CONTRIBUTING FACTOR VEHICLE 2`"
print(sum(is.na(collision_use$`CONTRIBUTING FACTOR VEHICLE 2`)))
## [1] 106742
#ma
#Top 10 contributing factors for Vehicle 1
collisions_by_factor1 <- collision_use%>% filter(!is.na(`CONTRIBUTING FACTOR VEHICLE 1`)) %>%
filter(`CONTRIBUTING FACTOR VEHICLE 1` != "Unspecified") %>%
group_by(`CONTRIBUTING FACTOR VEHICLE 1`) %>%
summarise(NUMBER_OF_ACCIDENTS=n(),
NUMBER_OF_DEATHS = sum(`NUMBER OF PERSONS KILLED`)) %>%
arrange(desc(NUMBER_OF_ACCIDENTS))
collisions_by_factor1 <- head(collisions_by_factor1, n = 10)
ggplot(collisions_by_factor1, aes(x=reorder(`CONTRIBUTING FACTOR VEHICLE 1`, `NUMBER_OF_ACCIDENTS`, fill=`CONTRIBUTING FACTOR VEHICLE 1`), y=`NUMBER_OF_ACCIDENTS`))+
geom_col(aes(fill=`CONTRIBUTING FACTOR VEHICLE 1`))+
ggtitle("Top 10 First Contributing Factors by Accident Count")+
xlab("Contributing Factor Vehicle 1")+
ylab("Number of Accidents")+
coord_flip()
collisions_by_factor2 <- collision_use%>% filter(!is.na(`CONTRIBUTING FACTOR VEHICLE 2`)) %>%
filter(`CONTRIBUTING FACTOR VEHICLE 2` != "Unspecified") %>%
group_by(`CONTRIBUTING FACTOR VEHICLE 2`) %>%
summarise(NUMBER_OF_ACCIDENTS=n(),
NUMBER_OF_DEATHS = sum(`NUMBER OF PERSONS KILLED`)) %>%
arrange(desc(NUMBER_OF_ACCIDENTS))
collisions_by_factor2 <- head(collisions_by_factor2, n = 10)
ggplot(collisions_by_factor2, aes(x=reorder(`CONTRIBUTING FACTOR VEHICLE 2`, `NUMBER_OF_ACCIDENTS`, fill=`CONTRIBUTING FACTOR VEHICLE 2`), y=`NUMBER_OF_ACCIDENTS`))+
geom_col(aes(fill=`CONTRIBUTING FACTOR VEHICLE 2`))+
ggtitle("Top 10 Second Contributing Factors by Accident Count")+
xlab("Contributing Factor Vehicle 2")+
ylab("Number of Accidents")+
coord_flip()
#Vehicle Type EDA
#Most of the observations do not have variables `CONTRIBUTING FACTOR VEHICLE 5`, `VEHICLE TYPE CODE 5` etc, I dropped them, and Created a new binary variable which represents if 3 or more vehicles were involved in an accident.
collision_use <- collision_use %>% mutate(VEHICLES_INVOLVED_GTE3 = 1 * !is.na(`CONTRIBUTING FACTOR VEHICLE 3`))
collision_use <- collision_use %>% select(-c(`VEHICLE TYPE CODE 5`,`CONTRIBUTING FACTOR VEHICLE 5`,
`VEHICLE TYPE CODE 4`,`CONTRIBUTING FACTOR VEHICLE 4`,
`VEHICLE TYPE CODE 3`,`CONTRIBUTING FACTOR VEHICLE 3`,
))
#Number of Accidents and Number of Deaths for Combinations of Vehicle Types 1 and 2.
collision_by_vehicleTypes <- collision_use %>%
filter(!is.na(`VEHICLE TYPE CODE 1`) & !is.na(`VEHICLE TYPE CODE 2`)) %>%
filter(`VEHICLE TYPE CODE 1` != "UNKNOWN" & `VEHICLE TYPE CODE 2` != "UNKNOWN") %>%
group_by(`VEHICLE TYPE CODE 1`,`VEHICLE TYPE CODE 2`) %>%
summarise(NUMBER_OF_ACCIDENTS=n(),
NUMBER_OF_DEATHS = sum(`NUMBER OF PERSONS KILLED`)) %>%
arrange(desc(NUMBER_OF_ACCIDENTS))
## `summarise()` has grouped output by 'VEHICLE TYPE CODE 1'. You can override using the `.groups` argument.
#Injuries/Deaths of person type by borough
collision_persontype <- collision_use %>% group_by(BOROUGH) %>% summarise(Ped_Injuries = sum(`NUMBER OF PEDESTRIANS INJURED`), Ped_Deaths = sum(`NUMBER OF PEDESTRIANS KILLED`), Cyclist_Injuries = sum(`NUMBER OF CYCLIST INJURED`), Cyclist_Deaths = sum(`NUMBER OF CYCLIST KILLED`), Motorist_Injuries = sum(`NUMBER OF MOTORIST INJURED`), Motorist_Deaths = sum(`NUMBER OF MOTORIST KILLED`)) %>% mutate(Ped_Fatality_Rate=Ped_Deaths/Ped_Injuries, Cyclist_Fatality_Rate=Cyclist_Deaths/Cyclist_Injuries, Motorist_Fatality_Rate=Motorist_Deaths/Motorist_Injuries)
df1 <- data.frame(collision_persontype$Ped_Fatality_Rate, collision_persontype$Cyclist_Fatality_Rate, collision_persontype$Motorist_Fatality_Rate, collision_persontype$BOROUGH)
df2 <- tidyr::pivot_longer(df1, cols=c('collision_persontype.Ped_Fatality_Rate', 'collision_persontype.Cyclist_Fatality_Rate', 'collision_persontype.Motorist_Fatality_Rate'), names_to='Type',
values_to="Fatality_Rate")
ggplot(df2, aes(x=collision_persontype.BOROUGH, y=Fatality_Rate, fill=Type)) +
geom_bar(stat='identity', position='dodge')+ggtitle('Borough Fatality Rates by Person Type')+ xlab('Borough')
collision_10000 <- collision %>% separate(`CRASH DATE`, c("MONTH", "DAY", "YEAR"), sep = "/") %>% sample_n(10000) %>% filter(YEAR != 2012, YEAR != 2015, YEAR != 2013, YEAR != 2014) %>% write_csv("NYC Collision 10000.csv")
collision_use <- collision %>%
filter(!is.na(BOROUGH), !is.na('ZIP CODE'), !is.na(LONGITUDE), !is.na(LATITUDE), !is.na(LOCATION), !is.na(`NUMBER OF PEDESTRIANS KILLED`), !is.na(`NUMBER OF CYCLIST INJURED`)) %>% select(-(`ON STREET NAME`:`OFF STREET NAME`)) %>%
rename(DATE = `CRASH DATE`, TIME = `CRASH TIME`, ZIPCODE = `ZIP CODE`, LOCATION = `LOCATION`) %>% separate(DATE, c("MONTH", "DAY", "YEAR"), sep = "/") %>%
separate(TIME, c("HOUR", "MINUTE"), sep = ":") %>%
mutate(`CONTRIBUTING FACTOR VEHICLE 1` = str_to_lower(`CONTRIBUTING FACTOR VEHICLE 1`)) %>%
mutate(`CONTRIBUTING FACTOR VEHICLE 2` = str_to_lower(`CONTRIBUTING FACTOR VEHICLE 2`)) %>%
mutate(`CONTRIBUTING FACTOR VEHICLE 3` = str_to_lower(`CONTRIBUTING FACTOR VEHICLE 3`)) %>%
mutate(`CONTRIBUTING FACTOR VEHICLE 4` = str_to_lower(`CONTRIBUTING FACTOR VEHICLE 4`)) %>%
mutate(`CONTRIBUTING FACTOR VEHICLE 5` = str_to_lower(`CONTRIBUTING FACTOR VEHICLE 5`)) %>% filter(!is.na(`CONTRIBUTING FACTOR VEHICLE 1`),!is.na(`CONTRIBUTING FACTOR VEHICLE 2`), !is.na(`CONTRIBUTING FACTOR VEHICLE 3`), !is.na(`CONTRIBUTING FACTOR VEHICLE 4`), !is.na(`CONTRIBUTING FACTOR VEHICLE 5`)) %>% pivot_longer(`CONTRIBUTING FACTOR VEHICLE 1`:`CONTRIBUTING FACTOR VEHICLE 5`, names_to = "CONTRIBUTING VEHICLE", values_to = "CAUSE") %>% filter(!is.na(CAUSE)) %>%
pivot_longer(`VEHICLE TYPE CODE 1`:`VEHICLE TYPE CODE 5`, names_to = "VEHICLE CODE", values_to = "VEHICLE") %>% select(-(`CONTRIBUTING VEHICLE`)) %>% select(-(`VEHICLE CODE`)) %>% mutate(VEHICLE = str_to_lower(VEHICLE)) %>%
sample_n(1000) %>%
filter(!is.na("VEHICLE"))
collision_use
## # A tibble: 1,000 x 22
## MONTH DAY YEAR HOUR MINUTE BOROUGH ZIPCODE LATITUDE LONGITUDE LOCATION
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 02 24 2016 13 00 BROOKLYN 11208 40.7 -73.9 (40.65886~
## 2 07 13 2017 21 45 BROOKLYN 11231 40.7 -74.0 (40.67769~
## 3 01 15 2020 22 10 QUEENS 11420 40.7 -73.8 (40.68302~
## 4 05 20 2019 8 30 QUEENS 11377 40.7 -73.9 (40.7497,~
## 5 06 11 2016 10 30 STATEN ~ 10307 40.5 -74.2 (40.51393~
## 6 03 10 2018 6 00 QUEENS 11434 40.7 -73.8 (40.68139~
## 7 06 26 2016 12 20 QUEENS 11105 40.8 -73.9 (40.78045~
## 8 06 24 2017 3 00 QUEENS 11429 40.7 -73.7 (40.70316~
## 9 08 27 2020 16 10 QUEENS 11419 40.7 -73.8 (40.68868~
## 10 03 05 2016 11 00 QUEENS 11368 40.8 -73.9 (40.75718~
## # ... with 990 more rows, and 12 more variables:
## # NUMBER OF PERSONS INJURED <dbl>, NUMBER OF PERSONS KILLED <dbl>,
## # NUMBER OF PEDESTRIANS INJURED <dbl>, NUMBER OF PEDESTRIANS KILLED <dbl>,
## # NUMBER OF CYCLIST INJURED <dbl>, NUMBER OF CYCLIST KILLED <dbl>,
## # NUMBER OF MOTORIST INJURED <dbl>, NUMBER OF MOTORIST KILLED <dbl>,
## # COLLISION_ID <dbl>, collision_date <date>, CAUSE <chr>, VEHICLE <chr>
Number of Accidents by Borough and Vehicle Type:
collision_use %>% group_by(VEHICLE) %>% ggplot(aes(x = VEHICLE, y = stat(prop), group = 1)) + geom_bar() + facet_wrap(~BOROUGH)
Number of Accidents by Month(Time of the Year):
collision_use %>% group_by(MONTH) %>% mutate(count = n()) %>% ggplot(aes(x = MONTH, y = count, group = 1)) + geom_line(aes(color = BOROUGH))
Number of Accidents by Hour (Time of the Day):
collision_use %>% group_by(HOUR) %>% filter(HOUR <= 24) %>% summarize(count = n()) %>% ggplot(aes(x = HOUR, y = count, group = 1)) + geom_line()
Number of Person’s Injured by Borough:
collision_use %>% ggplot(aes(x = BOROUGH, y = `NUMBER OF PERSONS INJURED`)) + geom_col()
Number of Pedestrians Injured by Borough:
collision_use %>% ggplot(aes(x = BOROUGH, y = `NUMBER OF PEDESTRIANS KILLED`)) + geom_col()
Number of Cyclists Injured by Borough:
collision_use %>% ggplot(aes(x = BOROUGH, y = `NUMBER OF CYCLIST INJURED`)) + geom_col()
collision_use %>% group_by(CAUSE) %>% mutate(count = n()) %>% ggplot(aes(x = CAUSE, y = count, color = BOROUGH)) + geom_col()
Further Findings (NYC Collisions Data):
Brooklyn, Queens, and Manhattan have the highest proportions of traffic accidents in New York Ciy. The time of the year when traffic accidents seem to peak is during the summer months of June to October. Interestingly, even though the warmer weather might cause an increase in traffic movement which could potentially lead to an increase traffic collisions, there is a sudden drop in April of traffic accidents. During the day, the most traffic accidents were reported at 4pm and is the lowest at 1am to 2am. Furthermore, in terms of vehicle types, my assumption that sedans were the most prevalent type to be involved in accidents proved true.
In terms of causalities and injuries, my initial assumption that was proven correct where Brooklyn, Manhattan, and Queens have the highest number of people injured and pedestrians killed out of the 5 boroughs.
In terms of the cause of accidents, “unspecified” is the most popular contributing reason for accidents in New York City. Rather than ignoring this, it could either mean that there are certain flaws in the data collecting process that could be improved upon, or there is a wide variety of reasons that the standard choices to do not encompass. There is a possibility that it is a combination of both but I believe this is something that should not be ignored.
Other trends: -Looking at the volume of accidents by year affirms our initial prediction that 2020 would experience a severe drop in accidents as COVID and quarantine would limit overall driving/traffic in NYC. -Brooklyn leads all boroughs with the most accidents per year across 5 years, while Staten Island consistently has the least. -Staten Island also has the lowest population density by far. -Driver Inattention and Distraction is the largest driver of accidents in NYC. -Approximately 183,000 accidents have cited this as a contributing factor -Next highest is Failure to Yield Right-of-Way, with approximately 53,000 accidents. -Staten Island experiences the highest Death to Injury ratio among the boroughs across all types of people. Bronx sees a considerably high death rate for cyclists compared to other boroughs. -Sedans seem to be involved in the most accidents, with 7 out of the top 10 vehicle code combinations involving sedans. -Combinations with the most collisions are sedans with other sedans, approximately 95,000 accidents.