word2vec

Je hebt veel functietitels van het internet gescrapet en je twijfelt of je extra titels moet toevoegen voor je analyse. Tot nu toe heb je meer dan 13.000 functietitels verzameld in een gegevensset genaamd job_titles. Je hebt gelezen dat word2vec over het algemeen het beste presteert als het model genoeg data heeft om goed te trainen, en als woorden niet vaak genoeg voorkomen in je data, kan het model weinig opleveren.

In deze oefening test je hoe nuttig extra data is door je model 3 keer te draaien; bij elke run gebruik je meer data.

Deze oefening maakt deel uit van de cursus

Introductie tot Natural Language Processing in R

Cursus bekijken

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

library(h2o)
h2o.init()

set.seed(1111)
# Use 33% of the available data
sample_size <- floor(___ * nrow(job_titles))
sample_data <- sample(nrow(job_titles), size = sample_size)

h2o_object = as.h2o(job_titles[sample_data, ])
words <- h2o.tokenize(h2o_object$jobtitle, "\\\\W+")
words <- h2o.tolower(words)
words = words[is.na(words) || (!words %in% stop_words$word),]

word2vec_model <- h2o.word2vec(words, min_word_freq=5, epochs = 10)
# Find synonyms for the word "teacher"
___.___(word2vec_model, "teacher", count=10)

Code bewerken en uitvoeren

Deze oefening maakt deel uit van de cursus

Introductie tot Natural Language Processing in R

SkillTag.level.intermediateSkillTag.label

4.8+

Begin de cursus gratis

Chapter 1 of Introduction to Natural Langauge Processing prepares you for running your first analysis on text. You will explore regular expressions and tokenization, two of the most common components of most analysis tasks. With regular expressions, you can search for any pattern you can think of, and with tokenization, you can prepare and clean text for more sophisticated analysis. This chapter is necessary for tackling the techniques we will learn in the remaining chapters of this course.

Exercise 1: Regular expression basics Exercise 2: Practicing syntax with grep Exercise 3: Exploring regular expression functions.Exercise 4: Tokenization Exercise 5: tidytext functions Exercise 6: Tokenization: sentences Exercise 7: Text cleaning basics Exercise 8: Text preprocessing: remove stop words Exercise 9: Text preprocessing: Stemming

In this chapter, you will learn the most common and studied ways to analyze text. You will look at creating a text corpus, expanding a bag-of-words representation into a TFIDF matrix, and use cosine-similarity metrics to determine how similar two pieces of text are to each other. You build on your foundations for practicing NLP before you dive into applications of NLP in chapters 3 and 4.

Exercise 1: Understanding an R corpus Exercise 2: Explore an R corpus Exercise 3: Creating a tibble from a corpus Exercise 4: Creating a corpus Exercise 5: The bag-of-words representation Exercise 6: Practice BoW Exercise 7: BoW Example Exercise 8: Sparse matrices Exercise 9: The TFIDF Exercise 10: Manual calculations Exercise 11: TFIDF Practice Exercise 12: Cosine Similarity Exercise 13: An example of failing at text analysis Exercise 14: Cosine similarity example

Chapter 3 focuses on two common text analysis approaches, classification modeling, and topic modeling. If you are working on text analysis projects, you will inevitably use one or both of these methods. This chapter teaches you how to perform both techniques and provides insight into how to approach these techniques from a practical point of you.

Exercise 1: Preparing text for modeling Exercise 2: Data preparation Exercise 3: Removing sparse terms Exercise 4: Classification modeling Exercise 5: Classification modeling example Exercise 6: Confusion matrices Exercise 7: TFIDF tibble vs dtm Exercise 8: Introduction to topic modeling Exercise 9: LDA practice Exercise 10: Assigning topics to documents Exercise 11: LDA in practice Exercise 12: Testing perplexity Exercise 13: Reviewing LDA results

In chapter 4 we cover two staples of natural language processing, sentiment analysis, and word embeddings. These are two analysis techniques that are a must for anyone learning the fundamentals of text analysis. Furthermore, you will briefly learn about BERT, part-of-speech tagging, and named entity recognition. Almost 15 different analysis techniques were covered in this course, so chapter 4 ends by recapping all of the great techniques you will learn about in this course.

Exercise 1: Sentimentanalyse Exercise 2: tidytext-lexicons Exercise 3: Sentimentscores Exercise 4: Sentiment en emotie Exercise 5: Woordembeddings Exercise 6: oefenen met h2o Exercise 7: word2vec

Huidige oefening

Exercise 8: Aanvullende NLP-analyses Exercise 9: Methoden herzien #1 Exercise 10: Methoden herhalen #2 Exercise 11: Conclusie