Introduction to NLP feature engineering

1. Introduction to NLP feature engineering

Welcome to Feature Engineering for NLP in Python! I am Rounak and I will be your instructor for this course. In this course, you will learn to extract useful features out of text and convert them into formats that are suitable for machine learning algorithms.

2. Numerical data

For any ML algorithm, data fed into it must be in tabular form and all the training features must be numerical. Consider the Iris dataset. Every training instance has exactly four numerical features. The ML algorithm uses these four features to train and predict if an instance belongs to class iris-virginica, iris-setosa or iris-versicolor.

3. One-hot encoding

ML algorithms can also work with categorical data provided they are converted into numerical form through one-hot encoding. Let's say you have a categorical feature 'sex' with two categories 'male' and 'female'.

4. One-hot encoding

One-hot encoding will convert this feature into two features,

5. One-hot encoding

'sex_male' and 'sex_female' such that each male instance has a 'sex_male' value of 1 and 'sex_female' value of 0. For females, it is the vice versa.

6. One-hot encoding with pandas

To do this in code, we use pandas' get_dummies() function. Let's import pandas using the alias pd. We can then pass our dataframe df into the pd.get_dummies() function and pass a list of features to be encoded as the columns argument. Not mentioning columns will lead pandas to automatically encode all non-numerical features. Finally, we overwrite the original dataframe with the encoded version by assigning the dataframe returned by get_dummies() back to df.

7. Textual data

Consider a movie reviews dataset. This data cannot be utilized by any machine learning or ML algorithm. The training feature 'review' isn't numerical. Neither is it categorical to perform one-hot encoding on.

8. Text pre-processing

We need to perform two steps to make this dataset suitable for ML. The first is to standardize the text. This involves steps like converting words to lowercase and their base form. For instance, 'Reduction' gets lowercased and then converted to its base form, reduce. We will cover these concepts in more detail in subsequent lessons.

9. Vectorization

After preprocessing, the reviews are converted into a set of numerical training features through a process known as vectorization. After vectorization, our original review dataset gets converted

10. Vectorization

into something like this. We will learn techniques to achieve this in later lessons.

11. Basic features

We can also extract certain basic features from text. It maybe useful to know the word count, character count and average word length of a particular text. While working with niche data such as tweets, it also maybe useful to know how many hashtags have been used in a tweet. This tweet by Silverado Records,for instance, uses two.

12. POS tagging

So far, we have seen how to extract features out of an entire body of text. Some NLP applications may require you to extract features for individual words. For instance, you may want to do parts-of-speech tagging to know the different parts-of-speech present in your text as shown. As an example, consider the sentence 'I have a dog'. POS tagging will label each word with its corresponding part-of-speech.

13. Named Entity Recognition

You may also want to know perform named entity recognition to find out if a particular noun is referring to a person, organization or country. For instance, consider the sentence "Brian works at DataCamp". Here, there are two nouns "Brian" and "DataCamp". Brian refers to a person whereas DataCamp refers to an organization.

14. Concepts covered

Therefore, broadly speaking, this course will teach you how to conduct text preprocessing, extract certain basic features, word features and convert documents into a set of numerical features (using a process known as vectorization).

15. Let's practice!

Great! Now, let's practice!