Data Cleaning/Loading and Data Equity

2021-10-21

BLOG POST 2 - NYC COLLISIONS

Data Loading and Cleaning

Cleaning the original dataset:

We are planning to exclude “street name”, “longitude”, “latitude”, “cross street name”, “off-street name”, and “on-street name” because R might treat these observations as categorical variables and there is a rare chance that an accident occurs enough times on the same street for analysis to happen. There would not be enough observation for each street to make meaningful visualizations and findings with interpretations that are understandable. Furthermore, the sheer number of streets in New York City makes it difficult to find patterns so, in terms of geographical location, we will be utilizing the “Borough” variable as there is more of a finite number of them. Furthermore, when we merge datasets, making the dates and times into the same format as the other datasets will be important as well. Finally, there might be some cleaning necessary when there are missing values for variables such as “Borough”, Contributing Factor Vehicle 1-5, Vehicle Type Code 1-5, and variables related to casualties and injuries.

Adding more variables:

Weather: In order to create a more credible dataset, we are aiming to include other variables such as weather data in New York City. The following link https://www.ncdc.noaa.gov/cdo-web/datasets/GHCND/stations/GHCND:USW00094728/detail provides us with this historical data. To begin with, we will be downloading weather data for each month from 2016-2020 (both inclusive). Once we merge and load the files into one dataset, we will then be cleaning the data to see which variables are relevant. For example, temperature and precipitation rate might be more interesting to look at, rather than wind speed. Lastly, we will merge the file with our main NYC Collisions dataset.

Traffic and Parking Violations: We are in the process of searching and examining for credible datasets on traffic and the number of parking violations per vehicle type in order to observe any correlation between those variables and the number of accidents. We aim to have this more fleshed out in the next blog post. The parking violation data can be found at: https://data.cityofnewyork.us/City-Government/Parking-Violations-Issued-Fiscal-Year-2022/pvqr-7yc4/data. Traffic count data can be found at: https://data.cityofnewyork.us/Transportation/Traffic-Volume-Counts-2014-2019-/ertz-hr4r

Demographics: We would also like to explore any relationship between collisions and demographic variables. Initially we wanted to incorporate census data to observe these relationships, but found difficulty with the method of combining the data sets. For now, we felt that including population and population density by NYC boroughs/zip codes would be helpful. Population data by zip code and borough can be found at: https://data.beta.nyc/en/dataset/pediacities-nyc-neighborhoods/resource/7caac650-d082-4aea-9f9b-3681d568e8a5

Choosing subset of data, if large:

Our original dataset is relatively large with daily data from July 2012 to October 2021. We started by looking at a 5 year period and came to this conclusion as we wanted to observe potential patterns across a number of years. The 5 years we chose were 2016-2020 because we wanted mostly pre-pandemic data with some months of data during the pandemic to make secondary significant observations, if any.

Removing missing values / focusing on columns with less missing data:

Many of the data points are missing a location. While these must be excluded from analysis when we are answering questions related to the location of crashes, we will not remove the data points because they can be used to answer other questions, including those related to injury/death, the cause for the crash, and the types of vehicles.

Data Equity

NYC has extreme inequality, both in terms of wealth and opportunity. Our data analysis could display an impact of this, with worse traffic outcomes in fewer wealthy communities. Also, some of the variables have missing values with unspecified accident reasons along with the fact that this data only tracked motor vehicle accidents in New York City. This makes it hard to extrapolate our findings to other cities due to differences between suburban, rural, and urban locations, demographics of the subjects, and population density.

Seeking and including communities’ interest in design considereations is a relevant principle for this project because our findings could potentially provide insights into how the city can create initiatives to reduce traffic accidents. This data has many variables that could prove useful to understanding an underlying common pattern in motor vehicle accidents in New York City such as the vehicle types of both the contributing vehicle and the vehicle involved, number of injured individuals by category (pedestrian, motorist, cyclist), and the contributing factor. The data and findings could end up of little use if we did not understand the community’s sentiment. One of the main goals of this research is to improve people’s lives so communicating with the right agencies and leaders can help us understand how to design our study to come up with meaningful findings. Along with communicating with the right authories, this research can empower organizations like the New York City Council’s Committee on Transportation and the Department of City Planning to coordinate efforts to create a safer driving environment for New Yorkers. Furthermore, creating a safer city in general where there is minimal collateral damage to pedestrians and cyclists. Our insights could give the city of New York the power to keep the streets safer from traffic accidents. Overall, we think that empowering the right organizations and authorities with the ability to create a safer New York City is the goal of this research.

This project has the opportunity to promote equitable outcomes in NY if there are statistically significant findings that provide insight into possible patterns or factors that affect collision rates in NY. The socio-political/demographic component that we plan to merge onto this dataset is relevant to this. It is important to consider this data equity principle because this data analysis does have the possibility to promote equitable outcomes.

Previous Exploratory Data Analysis

Next Three Initial Datasets