1. Generalized overview of NLP
Welcome back. Now we will cover the key NLP techniques used to prepare text data into a machine-readable form for use in LLMs.
2. Where are we?
This preparation is part of the text pre-processing building block for LLMs.
3. Text pre-processing
Text pre-processing transforms raw text data into a standardized format and involves several steps, including tokenization, stop word removal, and lemmatization.
Note that these pre-processing steps are independent and can be done in a different order depending on the task.
Let's learn about each one.
4. Tokenization
Tokenization splits the text into words, also called tokens. Consider the sentence: “Working with natural language processing techniques is tricky”.
It is broken into words as – ["Working", "with", "natural", "language", "processing", "techniques", "is", "tricky", "."].
Note that punctuation is also a token.
Square brackets represent a list. As the series of words are now considered a list, they are no longer a sentence anymore.
5. Stop word removal
Sometimes, when working with text data, we may come across frequently used words, such as "with" or "is," that don't add much meaning to the text, known as stop words.
These additional words are eliminated through a step called stop word removal to identify the most important parts of the sentence.
6. Lemmatization
Often, we may have slightly different words that mean the same thing in the context of the sentence. This means we can group these words together.
This process of reducing words to their base form is known as lemmatization. For example, "talking", "talked", and "talk" would be mapped to the root word "talk".
7. Text representation
The next step is to change the text into a form the computer can understand. We do this through text representation.
8. Text representation
Text representation techniques help convert preprocessed text into a numerical form.
There are different ways of doing this, but we will focus on bag-of-words and word embeddings.
9. Bag-of-words
The bag-of-words approach involves converting the text into a matrix of word counts.
Consider the two sentences: “The cat chased the mouse swiftly” and “The mouse chased the cat”
After removing the stop words, the bag-of-words technique creates a list of all the unique words and their count, such as “cat”, “chased”, “mouse”, and "swiftly".
Note that in the first sentence, the last digit is one, which corresponds to the word "swiftly".
However, since sentence two does not contain this word, it is represented with the digit zero.
10. Limitations of bag-of-words
The bag-of-words method has its limitations - it fails to capture the meaning and context of words, leading to incorrect interpretations of a text.
For example, these sentences are similar, but their meaning is the opposite.
Further, it treats related words, such as "cat" and "mouse," as separate and independent, failing to capture their semantic relationship.
11. Word embeddings
Word embeddings address these limitations by capturing semantic meanings of words and representing them as numbers, allowing for similar words to have similar representations.
For example, the cat is a predator and the mouse is prey. Word embeddings will convert these into numbers, with higher numbers indicating a stronger meaning.
So the word "cat" becomes minus 0-point-9, 0-point-9, and 0-point-9.
Plant, furry, and carnivore are features learned by the model using the training data. In reality, we won't see these labels but instead, see a word defined as a list of numbers.
This technique allows us to represent similar relationships between other words like tiger and deer, and eagle and rabbit.
12. Machine-readable form
To recap, several techniques such as tokenization, stop word removal, and lemmatization are used to pre-process text data.
13. Machine-readable form
Which is then transformed into a numerical format using techniques like bag-of-words and word embeddings, enabling it to be used by LLMs.
14. Let's practice!
Let's practice.