word2vec
You have been web-scrapping a lot of job titles from the internet and are unsure if you need to scrap additional job titles for your analysis. So far, you have collected over 13,000 job titles in a dataset called job_titles
. You have read that word2vec generally performs best if the model has enough data to properly train, and if words are not mentioned enough in your data, the model might not be useful.
In this exercise you will test how helpful additional data is by running your model 3 times; each run will use additional data.
This exercise is part of the course
Introduction to Natural Language Processing in R
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
library(h2o)
h2o.init()
set.seed(1111)
# Use 33% of the available data
sample_size <- floor(___ * nrow(job_titles))
sample_data <- sample(nrow(job_titles), size = sample_size)
h2o_object = as.h2o(job_titles[sample_data, ])
words <- h2o.tokenize(h2o_object$jobtitle, "\\\\W+")
words <- h2o.tolower(words)
words = words[is.na(words) || (!words %in% stop_words$word),]
word2vec_model <- h2o.word2vec(words, min_word_freq=5, epochs = 10)
# Find synonyms for the word "teacher"
___.___(word2vec_model, "teacher", count=10)