word2vec
You have been web-scrapping a lot of job titles from the internet and are unsure if you need to scrap additional job titles for your analysis. So far, you have collected over 13,000 job titles in a dataset called job_titles
. You have read that word2vec generally performs best if the model has enough data to properly train, and if words are not mentioned enough in your data, the model might not be useful.
In this exercise you will test how helpful additional data is by running your model 3 times; each run will use additional data.
Diese Übung ist Teil des Kurses
Introduction to Natural Language Processing in R
Interaktive Übung
Versuche dich an dieser Übung, indem du diesen Beispielcode vervollständigst.
library(h2o)
h2o.init()
set.seed(1111)
# Use 33% of the available data
sample_size <- floor(___ * nrow(job_titles))
sample_data <- sample(nrow(job_titles), size = sample_size)
h2o_object = as.h2o(job_titles[sample_data, ])
words <- h2o.tokenize(h2o_object$jobtitle, "\\\\W+")
words <- h2o.tolower(words)
words = words[is.na(words) || (!words %in% stop_words$word),]
word2vec_model <- h2o.word2vec(words, min_word_freq=5, epochs = 10)
# Find synonyms for the word "teacher"
___.___(word2vec_model, "teacher", count=10)