At RokkinCat, we’ve been attending and working with the Milwaukee Data Initiative to see what data the city has, or could have, and where we can take it to improve Milwaukee. One of the first newer sources created is from the scraping of the Milwaukee Police Call Dispatch Log, and has been collecting data for the past month or so. It has a neat website here to see the calls in real-time in list and map format, as well as offering a CSV download.
There is an important note regarding this data:
The City of Milwaukee makes no warranty, representation or guaranty as to the content, accuracy, timeliness or completeness of any of the information presented, due to data collection methods. The city of Milwaukee assumes no liability.
That said, we can still play around and explore to see if anything interesting comes up. We’ll be using R to transform and display the gathered data. The overarching R package we’ll be making use of is tidyverse, which installs a set of popular packages that work well together to make it easier to work with the data. It includes things like ggplot2
, dplyr
, tidyr
, readr
, tibble
, and lubridate
. The syntax and functions called in the following scripts may be from any of those packages. For further documentation, call ?FUNCTION_NAME
within an R console.
To get started, install and load tidyverse:
install.packages("tidyverse")
library(tidyverse)
Initially, we’ll read in our data and transform it into a more usable format, as well as extract some other features. The data starts the week of November 13, 2016 and ends the week of December 11, 2016. We give the read_csv
function some hints as to the types when it converts the data, and then manually convert the nature and final status of the calls to factors (a categorical data type). We’ll pull some of the time data of the call into their own columns. We’ll also make sure the time for these columns is in the local timezone, as the data represents it in UTC.
calls <- read_csv("file.csv", col_types = cols(time = col_datetime(), location = col_character()))
calls$nature <- as.factor(calls$nature)
calls$status <- as.factor(calls$status)
calls$hour <- hour(with_tz(calls$time, "America/Chicago"))
calls$day <- day(with_tz(calls$time, "America/Chicago"))
calls$month <- month(with_tz(calls$time, "America/Chicago"), label=TRUE)
calls$wday <- wday(with_tz(calls$time, "America/Chicago"), label=TRUE)
To start, we’ll use ggplot’s bar graph to see what the distribution for the call volume looks like across different days of the week and different times.
ggplot(data = calls) + geom_bar(mapping = aes(x = wday))
ggplot(data = calls) + geom_bar(mapping = aes(x = hour))
Nothing is immediately interesting when looking at the calls by day of the week. It appears that calls are least frequent between 9pm and 12am, and most frequent during the 10am hour. There is a noticeable jump around 2am and 10am, as well as a cliff at 3pm and 5pm. These may align with shift changes in some way, as they’re around 8 hours apart. If we want to see the frequency of calls by hour and day of the week, we can take a look at what a heatmap across those columns looks like.
calls %>%
group_by(hour, wday) %>%
summarise(count = n()) %>%
ggplot(aes(x = wday, y = hour,
fill = count)) +
geom_tile() +
labs(y = "Hour of Call",
x = "Day of Call",
title = "Number of Calls During Hour of Day")
There doesn’t appear to be much to glean from this currently, so we’ll move on from just looking at the time-based data. We also have the nature of the call, and seeing what the most common types are may be a good place to start.
> top_call_natures <- group_by(calls, nature) %>% summarize(count = n()) %>% arrange(desc(count)) %>% select(nature) %>% slice(1:10) %>% .$nature
> top_call_natures
[1] BUSINESS CHECK TRAFFIC STOP RETURN STATION TRBL W/SUBJ
[5] BUS INV FOLLOW UP SUBJ STOP CITIZEN CONTACT
[9] TS TARGETED REPORTS STATION
Some of these are pretty clear, but I haven’t found a source for what things like “BUS INV” and “TS TARGETED” refer to.
It may be interesting to see when the most common calls happen throughout the day. To do that, we’ll graph another heatmap, but filter using the variable we created in the last section to only graph the most common call types.
calls %>%
filter(nature %in% top_calls_natures) %>%
group_by(nature, hour) %>%
summarise(count = n()) %>%
ggplot(aes(x = hour, y = reorder(nature, count),
fill = count)) +
geom_tile() +
labs(y = "Nature of Call",
x = "Hour of Call",
title = "Number of Calls during the Day by Type")
Perhaps the most curious part of the graph is that between business checks in the morning or early afternoon, and traffic stops in the middle of the day, the remaining calls are crowded out and lose a lot of detail in the rest of the figure. Those two calls types are responsible for almost a third of all the calls.
That about wraps up a very cursory exploration of the data gathered thus far. There are some avenues to pursue in the future regarding the types of calls made in and around the large hourly changes in call volume, and making use of other features like district and location (maybe distance from nearest station?). Currently, status is difficult to use as it changes over time, but we are not tracking the changes, so status is currently just the final status. This should be fixed “soon”, and would allow for further analysis.
The code and data I used is shared here.