Charting word length with NLTK

1. Charting word length with nltk

Hi everyone! In this video, we are going to learn about using charts with our NLP tools.

2. Getting started with matplotlib

Matplotlib is a charting library used by many different open-source Python projects to create data visualizations, charts and graphs. It has fairly straightforward functionality with lots of options for graphs like histograms, bar charts, line charts and scatter plots. It even has advanced functionality like generating 3D graphs and animations.

3. Plotting a histogram with matplotlib

Matplotlib is usually imported by simply aliasing the pyplot module as plt. If we want to plot a basic histogram, which is a type of plot used to show distribution of data, we can pass in a small array to the `hist` function. The array has 5 appearing twice and 7 appearing three times, so it's a good candidate to show distribution. Finally, we call the plt.show function and matplotlib will show us the generated chart in our system's standard graphics viewing tool.

4. Generated histogram

This is the chart that we generated using the previous code. We notice that indeed it has determined proper bins for each entry and we can see that the 7 and 5 bins reflect the distribution we expected to see. It's not the prettiest chart by default, but making it look nicer is fairly easy with more arguments and several available helper libraries.

5. Combining NLP data extraction with plotting

We can then use skills we have learned throughout this first chapter to tokenize text and chart word length for a simple sentence. First, we perform the necessary imports to use NLTK for word tokenization and matplotlib charting. Then, we tokenize the words and punctuation in a short sentence. Finally, we can use Python list comprehension with our tokenized words array to transform it to a list of lengths. As a brief refresher on list comprehensions, it is a succint way to write a for loop. If we look at the syntax, we have opening and closing square brackets. Then we can iterate over any list and make a new list using this simple syntax. Here, we create a list that holds the lengths of each word in the words array simply by saying len(w) for w in words. This will iterate over each word, calculate the length and return it as a new list. We then pass this array of token lengths to the hist function and generate our chart using the plt.show method.

6. Word length histogram

Here is the generated histogram from our previous code. We can see from the chart that we have a majority of four-letter words in our example sentence. Of course, with a simple sentence, this is easy enough to simply count by hand -- but for an entire play or book, this would be tedious and prone to error -- so writing it in code makes it a lot easier.

7. Let's practice!

Now it's your turn to start plotting NLP charts with matplotlib!