Multilingual NER with polyglot

1. Multilingual NER with polyglot

In this video, we'll review multilingual named entity recognition using a new library Polyglot.

2. What is polyglot?

Polyglot is yet another natural language processing library which uses word vectors to perform simple tasks such as entity recognition. You might be wondering: why do I need to learn another library which uses word vectors? Don't I already have Gensim and Spacy? And you would be correct. The main benefit and difference of using Polyglot, however, is the wide variety of languages it supports. Polyglot has word embeddings for more than 130 languages! For this reason, you can even use it for tasks like transliteration, as shown here translating some english text into arabic. Transliteration is the ability to translate text by swapping characters from one language to another. Of course, any user of Google translate or its competitors has seen issues in translation created by word vectors, but Polyglot is a pretty neat open-source tool to have for so many languages.

3. Spanish NER with polyglot

Instead of transliteration, we are going to use Polyglot to perform named entity recognition for some new languages. Similar to SpaCy, you need to have the proper vectors downloaded and installed before you begin. Once you do, Polyglot does not need to be told which language you are using. It uses the language detection model to do so when the Text object is initialized by passing in the document string. Here is a recent headline from the newspapers in Madrid about the promotion of Madrid by another Spanish politician. If you know Spanish, (or even if you don't and you take a look at the capitalized words), you can see quite a few titles, locations and people. When we call the entities attribute of the text object, we can see a list of entity chunks found by Polyglot while parsing the text. Each chunk has a label, represented by the symbols starting with I-, such as I-ORG representing an organization, I-LOC representing a location and I-PER representing a person. You may notice some possible duplication in the first two entities found, separating Generalitat de and Catalunya. This makes some sense because the phrase represents both a location Catalunya and a organization the Generalitat. That said, you may need to clean up returned entities when they don't match your expected labels or have substrings you would rather not track.

4. Let's practice!

Now it's your turn to use NER with Polyglot!