LoslegenKostenlos loslegen

word2vec

You have been web-scrapping a lot of job titles from the internet and are unsure if you need to scrap additional job titles for your analysis. So far, you have collected over 13,000 job titles in a dataset called job_titles. You have read that word2vec generally performs best if the model has enough data to properly train, and if words are not mentioned enough in your data, the model might not be useful.

In this exercise you will test how helpful additional data is by running your model 3 times; each run will use additional data.

Diese Übung ist Teil des Kurses

Introduction to Natural Language Processing in R

Kurs anzeigen

Interaktive Übung

Versuche dich an dieser Übung, indem du diesen Beispielcode vervollständigst.

library(h2o)
h2o.init()

set.seed(1111)
# Use 33% of the available data
sample_size <- floor(___ * nrow(job_titles))
sample_data <- sample(nrow(job_titles), size = sample_size)

h2o_object = as.h2o(job_titles[sample_data, ])
words <- h2o.tokenize(h2o_object$jobtitle, "\\\\W+")
words <- h2o.tolower(words)
words = words[is.na(words) || (!words %in% stop_words$word),]

word2vec_model <- h2o.word2vec(words, min_word_freq=5, epochs = 10)
# Find synonyms for the word "teacher"
___.___(word2vec_model, "teacher", count=10)
Code bearbeiten und ausführen