1. Basic feature extraction
In this video, we will learn to extract certain basic features from text. While not very powerful, they can give us a good idea of the text we are dealing with.
2. Number of characters
The most basic feature we can extract from text is the number of characters, including whitespaces. For instance, the string "I don't know."
has 13 characters. The number of characters is the length of the string. Python gives us a built-in len() function
which returns the length of the string passed into it. The output will be
13 here too. If our dataframe df has a textual feature (say 'review'), we can compute the number of characters for each review and store it as a new feature 'num_chars' by using the pandas dataframe apply method. This is done by creating df['num_chars']
and assigning it to df['review'].apply(len).
3. Number of words
Another feature we can compute is the number of words. Assuming that every word is separated by a space, we can use a string's split() method to convert it into a list where every element is a word.
In this example, the string Mary had a little lamb is split to create a list
containing the words Mary, had, a, little and lamb.
We can now compute the number of words by computing the number of elements in this list
using len().
4. Number of words
To do this for a textual feature in a dataframe, we first define a function
that takes in a string as an argument and returns the number of words in it. The steps followed inside the function are similar as before. We then pass this function word_count into apply. We create df['num_words']
and assign it to df['review'].apply(word_count).
5. Average word length
Let's now compute the average length of words in a string. Let's define a function avg_word_length()
which takes in a string and returns the average word length. We first split the string
into words and compute
the length of each word. Next, we compute the average word length
by dividing the sum of the lengths of all words by the number of words.
6. Average word length
We can now pass this into apply()
to generate a average word length feature like before.
7. Special features
When working with data such as tweets, it maybe useful to compute the number of hashtags or mentions used. This tweet by DataCamp,
for instance, has one mention upendra_35 which begins with an @ and two hashtags, PySpark and Spark which begin with a #.
8. Hashtags and mentions
Let's write a function
that computes the number of hashtags in a string. We split the
string into words. We then use list comprehension
to create a list containing only those words that are hashtags. We do this using the startswith method of strings to find out if a word begins with #. The final step
is to return the number of elements in this list using len. The procedure to compute number of mentions is identical except that we check if a word starts with @. Let's see this function in action. When we pass a string
"@janedoe This is my first tweet! #FirstTweet #Happy", the function returns 2
which is indeed the number of hashtags in the string.
9. Other features
There are other basic features we can compute such as number of sentences,
number of paragraphs,
number of words starting with an uppercase,
all-capital words,
numeric quantities
etc. The procedure to do this is extremely similar to the ones we've already covered.
10. Let's practice!
That's enough theory for now. Let's practice!