Dataset Merging

2021-11-19

This week, we spent time exploring the NYC Open Data, which the city of New York provides on its website in order to transparently give the data it collects to the public. We collected some of the datasets we thought might be most relevant to predicting collisions in NYC. These datasets are described below. In the coming week, we will conduct the merging onto our collision data and explore the relevance of each dataset. After this, we will continue to work with the dataset(s) that we find to be the most pertinent to our analysis and leave out those datasets that do not help advance our final analysis.

Weather: In our previous blog post, we ran a few correlations on the merged data set including the original collision dataset and the weather dataset. Specifically, we observed a very low correlation of only 0.4 between temperatures and injuries and surprisingly, no correlation between the amount of snow, ice pellets and hail, and the number of people killed in traffic accidents. A potential idea is to explore if weather relates to the number of cars on the roads as the lack of correlation might be a result of people avoiding the roads during bad weather. Even though temperatures and number of injuries do not seem to be too interesting to model, we decided to categorize the weather into seasons and then run an initial fit model to see if anything interesting shows up. We observed a clear relationship between the month and the number of people injured from traffic accidents. The numbers are the highest in the summer, followed by the fall and then winter and spring.

Real Time Traffic Speeds: https://data.cityofnewyork.us/Transportation/Real-Time-Traffic-Speed-Data/qkm5-nuaq

Extracted from NYC Open Data, this dataset gives us the speed levels in specific boroughs at various points on a given day. Merging this dataset is still in question as there are a lot of data points for one date itself and we will need to figure out how to club all those data points into one date so it matches our original dataset. Once we are able to do so, we will be able to group the data by borough and see if we find any significant correlation between the number of injuries and deaths and the speed levels. We predict that if the speed levels are high, there will be a higher number of collisions.

Neighborhood slow zones: https://data.cityofnewyork.us/Transportation/VZV_Neighborhood-Slow-Zones/y4nf-25nw

The Neighborhood Slow Zone Program is an application based program that takes areas in New York City and reduces the speed limit to 20mph. These areas were chosen based on crashes, presence of schools, and other neighborhood amenities. At these designated zones, there is a mix of markings, signs, and speed humps to reduce speed limits. For this dataset, it is already in a shapefile so merging with collisions which have the longitude and latitude variables based on its geometry or its longitude and latitude is a way to join these datasets. The merge data can not only be used for modeling but visualizations as well with spatial maps. With spatial maps, we can see if there are more areas in New York City that have been left out where there are higher concentrations of collisions, injuries, or fatalities to certain types of individuals such as pedestrians and cyclists. Furthermore, a logistic regression could be a model worth looking into because we can see if a likelihood of an accident or fatality increases or decreases based on if the accident occured in an area that did have reduced speed limits or not.

Pavement quality: https://data.cityofnewyork.us/Transportation/Street-Pavement-Rating/2cav-chmn

The pavement quality dataset is from the NYC Open Data source. The street pavement rating dataset can be merged onto our current dataset by merging onto the variables Longitude and Latitude, or Borough is also a variable that can be used for merging, but it’s important to note that using Borough would limit some of the possible analysis that could be done using this dataset. This new data could be used to find whether there is any correlation between collisions and areas in NYC with bad street pavement ratings. This dataset hasn’t been merged onto our new dataset yet because we are evaluating how much value it could add to our analysis.

Service requests for street lights/traffic signals: https://data.cityofnewyork.us/Transportation/DOT-Street-Lights-and-Traffic-Signals-311-Service-/jwvp-gyiq

This dataset includes the vast majority of service requests (some are left out due to “operational and system complexities”) for street lights/traffic signals reported to 311 from 2010 to present. The dataset comes from the NYC Open Data source. It comes in the form of a shape file, which will allow for us to explore whether or not there is a correlation between the breakdown of infrastructure critical to efficient traffic direction and the prevalence of traffic collisions by observing where service requests and collisions tend to be concentrated in relation to each other. Since the dataset is in the form of a shapefile, it will be possible to merge on the basis of latitude and longitude.

This is an interesting dataset to merge for practical purposes because it provides the city with a metric to focus on to reduce collisions. If service requests are correlated with collisions, it is imperative for the city to keep up with maintenance on traffic infrastructure to ensure the safety of its citizens.

Speed Reducer: https://data.cityofnewyork.us/Transportation/Speed-Reducer-Tracking-System-SRTS-/9n6h-pt9g/data

Another dataset that we’re looking into adding relates to speed reducers, or speed bumps, around NYC. This dataset has many different variables, but only about half of them are useful for our project, some being the coordinates of where speed bumps are, the status of the project, as well as the number of speed bumps implemented per location. With installation date and borough included as well, merging this dataset with our NYC Collisions data set was fairly simple, though there was still some cleaning to do with regards to the date format and some column names. One potential challenge using this merged data could relate to requests to implement speed bumps that were never fulfilled. I included these in the data set as they might give insight on which areas have issues with driving speed, as these projects are usually started by request. However, this might not be an entirely reliable measure since sometimes these projects are closed because speed isn’t deemed a significant issue. As such it might be better to filter so the dataset only includes projects that have been installed.

Previous Interactive Plans

Next Modeling