Three Initial Datasets

2021-10-15

NYC Collision Data

Link to the original Dataset:

https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95"

The dataset has not been cleaned or loaded yet

Description of Dataset:

This data set contains details on crash events from April 28th, 2014 until October 14, 2021, which was when it was last updated. The data was provided by the New York Police Department and is owned by NYC OpenData. The crashes are all police reported motor vehicle collisions in New York City. The police report is required to be filled out for collisions where someone is injured or killed, or where there is at least 1000 dollars worth of damage. Due to recent success in the CompStat program, NYPD began to apply CompStat principles to other problems. Besides homicides, police have the most contact with fatal traffic collisions. In 1998, TrafficStat was implemented to improve traffic safety and as a data collection procedure for uniform traffic safety data. In July 1999, the Traffic Accident Management System (TAMS) was implemented to collect traffic data in a uniform method. As the years progressed, there was a growing need for more detailed analyses of traffic data. Therefore, in 2014, Vision Zero was launched as a citywide traffic safety initiative to eliminate traffic fatalities. This replaced the TAMS with the new Finest Online Records Management System (FORMS) which enabled police officers to electronically enter, using a Department cellphone or computer, enter all the MV-104AN data fields instead of filling out a form by hand.

Details on Number of Rows and Columns:

There are 1.83M rows and 29 columns with each row describing a motor vehicle collision. The columns consist of information that describe the accident in a detailed manner such as the crash date and time, location, street name, number of people injured and killed, number of pedestrians injured and killed, number of motorists injured and killed, the contributing factor for the accident, and the vehicle type.

Main Questions we’re hoping to address:

What are the drivers for collision mortality rates? Which groups are most at risk of injury/death? Are there any geographic areas that are especially dangerous to drive in? What kind of relationships can we draw about the characteristics of these areas?

What challenges do you foresee?:

Various columns may prove to be problematic to both parse and analyze. Most notably, the Contributing Factor Vehicle columns have many values entered as “Unspecified” which fails to provide any meaningful information. In addition, with regards to the damage for each collision, only injuries and deaths are recorded. While this may not pose a challenge depending on the questions we attempt to answer, we did notice that many collisions had no injuries or deaths. In these specific cases, data concerning vehicle damage could give us more insight into the collision tendencies in NYC.

Census Data

Link to the original data source and Cleaning of Data: American Community Survey (ACS) PUMS files can be accessed at https://www.census.gov/programs-surveys/acs/data/pums.html. The compiled dataset can be accessed at https://drive.google.com/file/d/1LUzDW1FPd74qQ3VC0T2Pgz3DGcPujWcC/view?usp=sharing. This data has been merged and can be loaded and cleaned further.

Details and Description of Data

The data is a pooled cross-section of surveyed individuals as part of the one-year estimates of the American Community Survey (ACS), Census data, and ranges from years 2013 to 2018. The ACS is conducted monthly every year, and surveys about 3.5 million individuals in all 50 states, District of Columbia, and Puerto Rico. The survey asks individuals about topics of education, employment, race, nativity, internet access, and transportation, among others. This rich dataset has not been used in research papers a lot. The ACS was first implemented throughout the U.S. in January 2005, and one-year and five-year estimates have been available since then. For this study, we use the Public Use Microdata Sample (PUMS) files of one-year estimates of surveyed individuals from 2013 to 2018. Files from different years and states were downloaded and merged into one big file, giving a pooled cross-section of 18,973,476 observations and 43 variables. This data is at the individual level, and each observation for variables is at the level of a surveyed individual, per state, per year.

Interesting Questions and Foreseeable Challenge

Following questions could be interesting to look into: 1. The relation between an individual’s race and wages by employment sector 2. Gender related patterns which may or may not have been affected by policy changes 3. Is there any relation between linguistic skills across states per age group?

It is important to note that the ACS PUMS data does not capture survey records of all individuals. Access to the restricted ACS data from the Census will be better to conduct this analysis. For example, if we want to look at racial demographics, the dataset is heavily weighted by whites. Access to the Census data will include all the surveyed individuals and get rid of this bias. Additionally, the entire dataset with 80+ variables could not be downloaded which may result in some biases.

NBA Play-by-Play data

Original Data Source We utilized the nbaStatR R package to create this data frame. This package scrapes the data from stats.nba.com. The dataset can be accessed from this link: https://drive.google.com/file/d/1n34V115Sbx2atgrk77CmbqBV1oO_t_1u/view?usp=sharing This is not a clean dataset, and there’s wrangling that needs to be done in order to make the data analyzable.

Details and Description of Data

Play-By-Play data is the most granular form of data available to the public, and it provides insight into every detail of the game. It is basically a transcript of all tracked events in a game, which gives us flexibility to analyze a lot of questions.
This dataset is the raw play-by-play data from every game in the 2020-21 NBA Season. There’s 512581 rows with 23 columns in the dataset.

Previous Data Cleaning/Loading and Data Equity