1
Regular expressions & word tokenization
Free
This chapter will introduce some basic NLP concepts, such as word tokenization and regular expressions to help parse text. You'll also learn how to handle non-English text and more difficult tokenization you might find.
2
Simple topic identification
This chapter will introduce you to topic identification, which you can apply to any text you encounter in the wild. Using basic NLP models, you will identify topics from texts based on term frequencies. You'll experiment and compare two simple methods: bag-of-words and Tf-idf using NLTK, and a new library Gensim.
3
Named-entity recognition
This chapter will introduce a slightly more advanced topic: named-entity recognition. You'll learn how to identify the who, what, and where of your texts using pre-trained models on English and non-English text. You'll also learn how to use some new libraries, polyglot and spaCy, to add to your NLP toolbox.
4
Building a "fake news" classifier
You'll apply the basics of what you've learned along with some supervised machine learning to build a "fake news" detector. You'll begin by learning the basics of supervised machine learning, and then move forward by choosing a few important features and testing ideas to identify and classify fake news articles.

TfidfVectorizer for text classification

Similar to the sparse CountVectorizer created in the previous exercise, you'll work on creating tf-idf vectors for your documents. You'll set up a TfidfVectorizer and investigate some of its features.

In this exercise, you'll use pandas and sklearn along with the same X_train, y_train and X_test, y_test DataFrames and Series you created in the last exercise.

Import TfidfVectorizer from sklearn.feature_extraction.text.
Create a TfidfVectorizer object called tfidf_vectorizer. When doing so, specify the keyword arguments stop_words="english" and max_df=0.7.
Fit and transform the training data.
Transform the test data.
Print the first 10 features of tfidf_vectorizer.
Print the first 5 vectors of the tfidf training data using slicing on the .A (or array) attribute of tfidf_train.

script.py

IPython Shell

Regular expressions & word tokenization

Simple topic identification

Named-entity recognition

Building a "fake news" classifier

Exercise

Exercise

TfidfVectorizer for text classification

Instructions