03 Nov An IPython Notebook Tutorial for Data Science Beginners – Analyzing Toronto Neighbourhood Crime Data
Toronto is considered to be a safe city in comparison to other metropolises in the world. In an article in the Economist (2015), Toronto was ranked as the safest major city in North America and the eighth safest major city in the world. Not surprisingly, however, despite being ranked as a relatively safe city, Toronto has its fair share of crime. The City consists of 140 officially recognized neighbourhoods along with many other unofficial, smaller neighbourhoods. As is the case with any big city, some neighbourhoods are considered to be less safe than others. Several reasons are attributed to higher crime such as lower income, higher unemployment, lower literacy and access to education, among other reasons.
The City of Toronto’s Open Data portal, consists of over 200 datasets organized into 15 different categories. I was motivated to download some of these open datasets to explore neighbourhood crime within Toronto. I found three datasets from the Open Data portal, which were relevant to crime – safety, demographics, and economics data. The advantage of using these datasets was that the data was available in a relatively clean format. Also, each of these three datasets had exactly 140 rows – one for each official neighbourhood in Toronto. The disadvantage was that the data accounted for only two years – 2008 and 2011, which limited my freedom for making predictions based on this data. Despite this limitation, I decided to subject these datasets to a typical data science pipeline (i.e., wrangling, data analysis, data visualization, and prediction) and extract any hidden value with respect to neighbourhood crime.
For detailed steps, and to replicate my results, or to run your own analyses, please go to my github page to download/clone the IPython notebook and datasets.
Reading in data
The source file for each dataset was provided as an Excel file with two sheets – one for 2008 and one for 2011. I converted these sheets into separate CSV files, and imported them as Pandas data frames. So each raw dataset resulted in two Pandas data frames – six in total. Initially, I did not foresee any use for the economics data as I felt that the safety and demographics datasets would be sufficient for analyses. I will explain what motivated me to use the economics data later in this post.
Initial data exploration
A quick look at the first five rows showed differences in the crime and demographics data for 2008 and 2011. I wanted consistent columns in both years to enable comparisons. The 2011 crime data had a column for total major crimes whereas the 2008 data did not have one. I noticed that the demographics data frame for 2011 had 39 columns whereas it had 85 columns in 2008. This was because in 2008, the City of Toronto collected language and ethnicity data for each neighbourhood whereas in 2011, it only collected language data. I realized this would pose another challenge for making comparisons between 2008 and 2011.
Check for missing values
As a test for missing values, I dropped all the ‘NA’ columns in the data frames and checked to see if their size was equal to the original data frames. The sizes were equal, indicating that there were no missing values (the advantage of having pre-cleaned data!).
Renaming column titles
The datasets had long column titles with spaces in between. For easier data access, sub-selection and sorting, I shortened the column titles into smaller, single-word names by using a dictionary. The table below shows the original and shortened column titles.
The column titled TMCI in 2011 represents all the major crimes committed. It is the sum of eight different crime categories – Assaults, Break & Enters, Drug Arrests, Murders, Robberies, Sexual Assaults, Thefts, and Vehicle Thefts. My focus for this project was on this summed crime category, which I refer to as major crime. As mentioned earlier, the 2008 crime data did not have this summed category. So, I computed this category for 2008 and added it as an additional column called TMCI2.
Normalizing crime data
While generating a few initial plots, I realized that population would be a confounding variable. In other words, a neighbourhood might have more number of major crimes occurring merely because of having a higher population density. This could overpower other salient contributors of crime. To avoid this effect, I decided to normalize the data by calculating major crime per capita. So, I normalized the data by dividing all crimes for each neighbourhood with the neighbourhood’s population (obtained from the demographics data), and then multiplying these values with 1000. This gave me the values for major crime incidents in each neighbourhood per 1000 people, henceforth referred to as per capita.
I compared the means of all crimes that fell under major crime incidents for 2011 with 2008. I found that Assaults, Drug Arrests, and Break & Enters were the main major crime categories for both these years in terms of frequency. Murders and Thefts were the lowest two categories of major crime. Perhaps this finding is one reason why Toronto is generally a safe city.
Mean normalized major crime data for 2011 and 2008:
Comparison of most and least crime prone neighbourhoods
Next, I sorted the data to find the five neighbourhoods for both these years with the least and most major crime incidents.
What were the top five major crime prone neighbourhoods in 2011?
What were the top five major crime prone neighbourhoods in 2008?
Four of the top five neighbourhoods matched in 2008 and 2011. This is because Yorkdale-Glen Park was ranked eighth in 2008 and had a slight increase in crime in 2011, hence showing up in the 2011 list of five neighbourhoods with most major crime incidents. Also, Danforth, which is generally considered to be a high crime neighbourhood, had a drop in crime in 2011. A closer look at Danforth showed that while Danforth had an overall reduction in major crime, there was a considerable reduction specifically in the Drug Arrests category by approximately 75%, which may have accounted for Danforth not showing up in the top five neighbourhoods for 2011.
What were the bottom five major crime prone neighbourhoods in 2011?
What were the bottom five major crime prone neighbourhoods in 2008?
Again, I noticed that four of the five neighbourhoods match in 2008 and 2011, suggesting that major crime had more or less been stable in Toronto, in these two years. Next, I plotted a few visualizations of the major crime data. I compared major crimes in 2011 and 2008 as univariate distributions using Seaborn by plotting a histogram and a kernel density estimate. A kernel density estimation allows us to estimate the probability density function of a random variable from a finite set of data. So, it allows us to look at the major crime data as a continuous probability distribution rather than a histogram.
I realized that an easier way to compare both years would be to show the major crimes in a scatterplot. This also showed how correlated both were. In the plot below, 2011 major crime data is on the x-axis and 2008 major crime data is on the y-axis.
As expected, there was a strong correlation of 0.91 in major crime for both years. In addition to outliers in the plot, I also noticed two points, which did not fall along the general trend. These two points refer to major crime per capita in two neighbourhoods. These are neighbourhoods where the major crime in 2008 is between 45 and 55, and the major crime in 2011 is between 25 and 35. I found that these two neighbourhoods were Danforth and Waterfront Communities. We discussed how Danforth, despite being a crime prone neighbourhood, had a reduction in major crimes in 2011. Likewise, Waterfront Communities seemed to have a drop in crime as well.
Neighbourhoods with maximum change in crime
Another exploration I felt would be interesting was to find out how major crime had changed in Toronto’s neighbourhoods between 2008 and 2011. So, I calculated the percentage of increase or decrease in major crime from 2008 to 2011 by (a) computing the difference in major crime between the two years for each neighbourhood, (b) dividing this difference with the major crime in 2008 for that neighbourhood, and (c) converting this value into a percentage.
The five neighbourhoods with the maximum increase in major crime from 2008 to 2011 were the following.
The five neighbourhoods with the maximum decrease in major crime from 2008 to 2011 were the following.
Positive values of Percentage Change in Crime from 2008 to 2011 indicate an increase in major crime and negative values indicate a decrease in major crime, from 2008 to 2011. I wanted to know how these neighbourhoods with the most increase and decrease in major crime compared against some of the top and bottom major crime neighbourhoods for 2011.
We already discussed how Danforth, generally a high major crime neighbourhood, had a decrease in crime in 2011. This becomes obvious when we look at the Percentage Change in Crime value which shows a 48.5% decrease from 2008 to 2011. We also noticed the decrease in major crime for Waterfront Communities in the bivariate scatter plot. In support of this finding, the Percentage Change in Crime data reveals a 33.7% decrease in major crime for Waterfront Communities from 2008 to 2011.
Identifying prominent age group and economics factors
I examined the demographics data for 2011 and 2008 and decided to focus on four different age groups – (1) Children (0-14 years), (2) Youth (15-24 years), (3) Adults (25-54 years), (4) Seniors (55 and over). Most of these categories were already available as columns in the 2011 demographics data frame, except Adults. So, I computed the number of Adults in each neighbourhood by subtracting the sum of the remaining three age groups from the total population of each neighbourhood.
In the 2008 demographics data frame, none of these categories were available as pre-existing age groups. Instead, population was divided into columns that were grouped by 5-year age categories (e.g., 0-4 years, 5-9 years etc.). Therefore, to make the 2008 data consistent with the 2011 data, I calculated the number of people in each of the four age groups – Children, Youth, Adults, Seniors. Summary statistics for the 2008 and 2011 demographics data showed that the Adults group is the most prominent population group across all neighbourhoods.
I also wanted to look at the median/mean household income in each neighbourhood for 2008 and 2011. Unfortunately, income data was available only with the 2008 safety data from the Open Data portal, and was unavailable for 2011. This is what motivated me to take a closer look at the economics data, something I had hinted at earlier. Surprisingly, the economics datasets did not contain any income data. However, they had other potentially important variables such as number of people employed, and number of people on social assistance.
Following a procedure similar to what I did with the safety and demographics data, I read the economics data for 2008 and 2011 into separate Pandas data frames, selected only the most relevant columns: number of businesses in each neighbourhood, number of people employed in each neighbourhood, and number of social assistance recipients. Then I shortened the column titles to simpler ones. All three variables seemed important because they are connected to income and employment, which are traditionally considered as important motivators for crime. These variables were normalized to values per 100 people.
A sample of five rows showing the selected economics variables for 2011:
Prediction using machine learning
Ideally, I would have liked to have all the major crime data over a series of years (approximately 20 years or so) along with the corresponding features for each year. This would have allowed me to perform crime prediction for a projected year. Since, I did not have this data available, I decided to perform two types of prediction exercises, both of which are regression problems.
First, I decided to look at the percentage change in major crime from 2008 to 2011 as my dependent variable, and the corresponding percentage changes in my features as my independent variables. Can we successfully predict the percentage of increase or decrease in crime from 2008 to 2011 using a machine learning model?
Second, I decided to predict the major crime in a neighbourhood using the features as independent variables. So, in the second problem, I am not looking at changes from 2008 to 2011. Instead, I am building two separate machine learning models – one for 2008 and one for 2011.
Prior to performing model fitting, I wanted to take a look at the limited set of features I had and filter out any unnecessary features, keeping the two regression problems in mind.
I merged the age group data (the four age groups) and percentage of male and female data with the three selected features from the economics data, and major crime data, for each year separately. This resulted in obtaining two data frames with 10 columns – 9 features/independent variables, and 1 dependent variable. Then I computed cross-correlations of all 10. This allowed me to look for any correlations among variables and remove variables that were confounding each other. Two strongly correlated variables can decrease the robustness of a machine learning model as one variable can potentially inhibit the effect of another.
Correlation table for 2011 is provided below.
Cross-correlation table for 2008 is provided below.
Major crime is primarily associated with males rather than females. This led to me removing the percentage of females as a feature.
I plotted these correlations to make them more understandable.
Correlation plot for 2011:
For the 2011 data, there was a moderate to high positive correlation between the number of males per capita (i.e., per 100 people) and major crime (TMCI), number of adults per capita and major crime, number of people employed per capita and major crime, and number of social assistance recipients per capita and major crime. There was a negative correlation between the number of seniors per capita and major crime. There was also a strong positive correlation between the number of people employed per capita and the number of businesses per capita in that neighbourhood, suggesting that these features might be containing redundant information.
Correlation plot for 2008:
I noticed a strong correlation between the number of people employed per capita and the number of businesses per capita in that neighbourhood for 2008 as well. Ideally, I would have removed one of these features. However, I decided against it based on the following reasoning. First, it is important to acknowledge that there is likely to be some correlation between these two variables. The number of businesses in a neighbourhood is likely to be a good indicator of the overall economic health of that neighbourhood. However, it is also possible that the businesses in a specific neighbourhood are not necessarily employing the people from that neighbourhood. Several businesses exist in downtown areas of cities, despite which neighbourhoods closer to downtown could still be high crime prone areas. Additionally, the nature of businesses can vary. A neighbourhood might consist entirely of small businesses such as sole proprietorships and small partnerships, which might not be having a sizeable number of employees. Given all these reasons, I included both these variables within the feature set.
Model selection and evaluation
My first task was to predict the percentage of increase or decrease in crime from 2008 to 2011 using a machine learning model. I computed the percentage changes in all eight features from 2008 to 2011, and the percentage change in major crime, for each neighbourhood. I used a linear regression model imported from scikit-learn. I partitioned the data into train and test sets using a 70:30 split, respectively. I decided to use linear regression because it provides a basic starting point for regression, under the assumption that the relationship between independent and dependent variables is linear. It always leaves us with the option of choosing a more flexible model later, if needed.
I evaluated the model by first checking the R-squared value of the model on the training set, which was 27.3% – a relatively moderate value indicating that the model was not flexible enough to fit the data. We also have to consider that we are neither using the actual features nor predicting major crime rates. Instead, we are using percentage changes of these attributes between 2008 and 2011 as features for predicting the percentage change in major crime. So, some of the initial relationships between the features and major crimes, as indicated by their correlations, may have been compromised. As an additional evaluation test, I computed the mean squared error on the training data (MSE = 404.9) and on the test data (MSE = 439.0). Both the training and test mean squared errors are extremely high.
I plotted the residuals for the linear regression model.
Next, I used a random forest model. The random forest model had an R-squared value of 85.1% on the training data, which was a big improvement over the linear regression model. However, it performed poorly on the test data ( value of 9.2%), suggesting that it does not generalize. This could also be noticed in the difference in mean squared error values for the training data (MSE = 83) and for the test data (MSE = 404.9). As an additional step, I did a k-fold cross-validation with k = 5. The average mean squared error across all five folds was still high (MSE = 571.6).
These results indicated that the features did not have enough information that allowed the possibility of deriving a model that could make reasonable predictions. It is also likely that the model is wrong for the data. So, my conclusion was that this model would not be able to successfully predict the percentage of change in major crime in a neighbourhood from 2008 to 2011.
Next, I moved on to my second regression problem where I had decided to predict major crime in each neighbourhood using the eight original features as independent variables. For the second problem, I wanted to have two separate machine learning models – one to back-predict total crimes for 2008 and one to back-predict total crimes for 2011. Following a similar approach to what I did previously, I started with linear regression as my choice of machine learning model. I used the same randomized partitioning as before – 70% of the data was used for training and 30% for testing. Both the models were able to explain the variance in the training data reasonably well, as reflected by their R-squared values: 71% for 2011 and 72% for 2008. For the 2011 model, mean squared errors were 22.3 on the training data and 40.7 on the test data, whereas for the 2008 model, mean squared errors were 40.9 on the training data and 46.6 on the test data. So, although the performance for the 2011 model seemed to be superior, the 2008 model seemed more robust with respect to generalizability, given that the difference between the MSE for training and test data was quite low. The residual plot for the 2011 data is provided below.
The residual plot for the 2008 data is provided below.
For the most part, the models look homoscedastic, further confirming that linear regression is a reasonable choice for a regression model. They show a small amount of heteroscedasticity and ceiling effects.
Since the models looked reasonable after evaluation, I examined the standardized coefficients of all the predictors for both models using statsmodels.
For the 2008 linear regression model, the normalized coefficient values suggest that almost all the features except the number of people employed, are important predictors for major crime in a neighbourhood. These features are people in each of the four age groups, percentage of males, number of businesses, and number of social assistance recipients in each neighbourhood. For the 2011 linear regression model, however, only the percentage of males, number of businesses, and number of social assistance recipients show up as important predictors of major crime. So, the models for both years are not consistent with each other.
As a next step, I used a random forest regression model to see if there would be an improvement in prediction performance, with the same 70:30 partitioning of the data into training and test data.
Both the models were able to explain the variance in the training data very well, as reflected by their R-squared values (90.8% for 2011 and 92.4% for 2008). For the 2011 model, mean squared errors was 153.3 on the test data, whereas for the 2008 model, mean squared error was 156.4 on the test data. These results indicate that the random forest models, despite having better values, performed poorer than linear regression. In other words, they were too flexible and as a result overfitted to the training data. From a generalizability and robustness perspective, linear regression might be the better option as a regression model for predicting major crime in each neighbourhood.
Since random forest models allow us to find out the most important features for that model, I plotted these.
For both 2008 and 2011, again the percentage of males, number of businesses, and number of social assistance recipients show up as the most important predictors of major crime, similar to the 2011 linear regression model.
Key findings, limitations, and future extensions
My most important finding from the 2011 linear regression model and the random forest models was that among the limited set of independent variables that were available to me, the percentage of males, the number of businesses, and the number of social assistance recipients within a neighbourhood are the most important predictors of major crime in that neighbourhood.
This finding cannot be generalized because the data was limited to only two years – 2008 and 2011. Besides, there are several other factors that should have ideally been included as independent variables which I could not include as they were unavailable for both years. Perhaps the most important missing feature was income data. I would also have liked to include a feature that either quantified urbanization or captured gentrification in these neighbourhoods. Toronto has been undergoing a lot of change in the form of major construction projects in low, and mid income neighbourhoods. Some of these are considered to be a part of revitalization projects. For example, the City of Toronto began an initiative in 2005 known as the Regent Park Revitalization Plan. The plan involved transforming an area from a social housing neighbourhood into a thriving mixed income neighbourhood by implementing construction in three phases that included a mix of rental and condominium buildings, townhouses, commercial space with community facilities, active parks and open spaces. Currently, phases 2 and 3 are underway. Variables that capture urbanization can definitely show how the demographics of the city are being reshaped and how these changes are affecting crime in that neighbourhood. To obtain some of these features I would have to explore beyond the Open Data portal made available by the City of Toronto. This would be a possible extension to the current work.
Toronto is also going through a condominium construction boom, which is increasingly escalating rental and housing prices, as well as affecting affordability of living for low-income residents. How much of this change could be affecting crime? Housing and rental prices could serve as important independent variables that affect crime.
One final aspect I would investigate is the effectiveness of social assistance programs. My findings show that the greater the number of people on social assistance, the more the crime in that area. But does this mean that social assistance is causing more crime? Clearly no; however, it is a reflection of the income needs of people in that neighbourhood. The expectation is that with greater social assistance, the income of each person and therefore the overall economic health of that neighbourhood will change. But is this really happening? One way to investigate this is to look at neighbourhoods that received more social assistance, and see if the crime rates in that neighbourhood reduced in a few years from the point of receiving higher social assistance.
From a prediction standpoint, obtaining more data for at least 15-20 years would have allowed the models to capture predictable trends. Despite these limitations, the data provided us with interesting findings and a set of action items for future extension of this work. A few important points to reflect on, as possible recommendations to the City of Toronto are the following:
The findings show that having more businesses in each neighbourhood is correlated with a reduction in major crime. Having more businesses is a sign of urbanization and impacts the overall economic health of a neighbourhood. Given that the number of businesses in a neighbourhood is important, we need to facilitate the growth of businesses. Therefore, more people within each neighbourhood would need to be provided with the necessary training and self-financing opportunities for starting their own entrepreneurial ventures. The city should provide various economic incentives for entrepreneurs and ensure that people in every neighbourhood are made aware of these opportunities.
The findings show a correlation between the number of people receiving social assistance and the amount of major crime in a neighbourhood. This probably reflects the immediate housing and income needs of people in that neighbourhood. The city should consider ways in which social assistance programs can assist people not just in the short-term by way of income and housing help, but also in the long-term by offering accessible education and employability programs to these people. The city can also do an assessment of these long-term programs for their effectiveness in reducing crime over a 2-5 year period.
I hope that my experience exploring neighbourhood crime through this project will help you along in your own data science journey using IPython notebooks. I welcome you to build upon this work, improve it, and extend it.
By: Naresh Vempala