Charting practice
Try using your new skills to find and chart the number of words per line in the script using matplotlib
. The Holy Grail script is loaded for you, and you need to use regex to find the words per line.
Using list comprehensions here will speed up your computations. For example: my_lines = [tokenize(l) for l in lines]
will call a function tokenize
on each line in the list lines
. The new transformed list will be saved in the my_lines
variable.
You have access to the entire script in the variable holy_grail
. Go for it!
This exercise is part of the course
Introduction to Natural Language Processing in Python
Exercise instructions
- Split the script
holy_grail
into lines using the newline ('\n'
) character. - Use
re.sub()
inside a list comprehension to replace the prompts such asARTHUR:
andSOLDIER #1
. The pattern has been written for you. - Use a list comprehension to tokenize
lines
withregexp_tokenize()
, keeping only words. Recall that the pattern for words is"\w+"
. - Use a list comprehension to create a list of line lengths called
line_num_words
.- Use
t_line
as your iterator variable to iterate overtokenized_lines
, and thenlen()
function to compute line lengths.
- Use
- Plot a histogram of
line_num_words
usingplt.hist()
. Don't forgot to useplt.show()
as well to display the plot.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Split the script into lines: lines
lines = ____.____('\n')
# Replace all script lines for speaker
pattern = "[A-Z]{2,}(\s)?(#\d)?([A-Z]{2,})?:"
lines = [re.____(____, '', l) for l in lines]
# Tokenize each line: tokenized_lines
tokenized_lines = [____ for s in lines]
# Make a frequency list of lengths: line_num_words
line_num_words = [____ for t_line in tokenized_lines]
# Plot a histogram of the line lengths
____
# Show the plot
____