1. Poisson Mixture Models with flexmix
In the final learning lesson of the course, we will fit a Poisson mixture model with `flexmix`.
2. The problem to solve
As we saw, the crimes dataset is of count data, so a suitable distribution should be the Poisson.
Because we don't have a clear idea of how many clusters should we consider, we will try with a range of possibilities covering from 1 to 15 clusters, and then choose the mixture model that minimizes the BIC.
Finally, the parameters to be estimated with `flexmix` package are the lambdas for the multivariate Poisson and the corresponding proportions.
3. Fit with flexmix
Before fitting the model, we need to transform the dataset into a matrix, saving it as crimes underscore matrix. Observe that I am not considering the names of the communities into this matrix.
Next, to consider a range of clusters, we can use the implemented function called `stepFlexmix()`, which accepts a vector of integers for the k value.
Also, we can specify the number of repetitions the EM algorithm should run for each `k` value.
And finally, the Poisson distribution is represented by the function `FLXMCmvpois` and the control argument is the same as the previous model fits.
4. Pick the best model
Once we have fitted the models for a range of clusters, we need to pick one by some criterion.
There are many statistical criteria. In flexmix, for example, you can use the BIC, AIC and the ICL.
Here, I use the BIC because is the one by default in flexmix.
To get the model that minimizes this criterion, we use the function `getmodel()` as shown here and store it into the object best underscore fit.
With this object, we proceed with the analysis as usual.
5. The proportions
The proportions are extracted as before with `prior()`.
From these results, we observe that the best model has six clusters, not all with the same importance though.
6. The parameters
To extract the lambdas, we'll use `parameters()`.
For the purpose of simplifying the visualization, I'll transform the parameters into a data frame and add the type of crime as a new column.
Then, the first six columns of `param_pmm` have the values of the lambdas for the six clusters, where each row represents a type of crime.
7. Visualize the clusters
To visualize the lambdas for each cluster in a nicer way, we can use the function `pivot_longer()` from the `tidyr` package as well as the `ggplot2` function, `facet_wrap()`.
`pivot_longer` transforms the columns into variables and `facet_wrap` plots each component into a new plot.
8. Visualize each cluster
The resulting plot is shown here.
Observe that cluster one and two, for example, present the larger lambdas, meaning that it is more likely to get more crimes in the regions identified as these clusters.
However, cluster 4 exhibits the lowest lambdas so the communities identified by this cluster are the safest.
9. Assign cluster to each community
In chapter 3, we saw that the function `clusters()` from `flexmix` assigns to each observation the most probable cluster.
Keeping the same logic, we can assign the cluster to the communities it is most related to and save the results into a new variable called cluster.
10. Visualize the clusters with their communities
There are many ways to visualize each community within each cluster.
The one presented here is to group by the cluster's number and create a new variable that counts the observations in each cluster.
This number is just used to place the community onto the y axis.
Also, I added the function `geom_text()` to plot the community name and the resulting plot is as follows:
11. Visualize the clusters with their communities
Observe, for example, cluster 1, which was one of the most dangerous is formed by only six communities. And cluster 4, which was the safest, is also the one with more communities.
12. Let's practice!
Now it is your turn to find the communities' clusters.