1. Simple NLP, complex problems
You've learned so much about Natural Language Processing fundamentals in this course, congratulations! In this video we'll talk more about how complex these problems can be and how to use the skills you have learned to start a longer exploration of working with language in Python. In the exercises, you will apply some extra investigation into your fake news classification model to see if it has really learned what you wanted.
2. Translation
Translation, although it might work well for some languages, still has a long way to go. This tweet by Lupin attempting to translate some legal or bureaucratic text about economics and industry is a pretty funny and also sadly accurate example of when using word vectors between two languages can end poorly. The German text uses words like Nationalökonomie and Wirtschaftswissenschaft but the English text simply says the economics of economics (including economics, economics and so forth) is part of economics.
The German text has many different words related to economics and they are all simply closest to the english vector for economics, leading to a hilarious but woefully inaccurate translation.
3. Sentiment analysis
Sentiment analysis is far from a solved problem. Complex issues like snark or sarcasm, and difficult problems with negation (for example: I liked it BUT it could have been better) make it an open field of research. There is also active research regarding how separate communities use the same words differently.
This is a graphic from a project called Social Sent created by a group of researchers at Stanford. The project compares sentiment changes in words over time and from different communities. Here, the authors compare sentiment in word usage between two different reddit communities, 2X which is a woman-centered reddit and sports. The graphic illustrates the SAME word can be used with very different sentiments depending on the communal understanding of the word.
4. Language biases
Finally, we must remember language can contain its own prejudices and unfair treatment towards groups. When we then train word vectors on these prejudiced texts, our word vectors will likely reflect those problems.
Here, we take a gendered language, like english and translate it to Turkish, a language with no gendered pronouns.
When we click to translate it back, we see the genders have switched. This phenomena was studied in a recent article by a Princeton researcher Aylin Caliskan alongside several ethical machine learning researchers. She also gave a talk at the 33rd annual Chaos Computer Club conference in Hamburg, which I can definitely recommend viewing.
5. Let's practice!
As we have seen in just a few examples, the field of natural language processing still has plenty of unsolved problems. In fact, our fake news detector is likely one of them! Let's do some investigation into what it has learned to determine if the model will be widely applicable or if the problem is likely more complex than simple word counts.