Closing thoughts

1. Closing thoughts

Congratulations on completing this course on Machine Learning with Apache Spark. You have covered a lot of ground, reviewing some Machine Learning fundamentals and seeing how they can be applied to large datasets, using Spark for distributed computing.

2. Things you've learned

You learned how to load data into Spark and then perform a variety of operations on those data. Specifically, you learned basic column manipulation on DataFrames, how to deal with text data, bucketing continuous data and one-hot encoding categorical data. You then delved into two types of classifiers, Decision Trees and Logistic Regression, in the process building a robust spam classifier. You also learned about partitioning your data and how to use testing data and a selection of metrics to evaluate a model. Next you learned about regression, starting with a simple linear regression model and progressing to penalized regression, which allowed you to build a model using only the most relevant predictors. You learned about pipelines and how they can make your Spark code cleaner and easier to maintain. This led naturally into using cross-validation and grid search to derive more robust model metrics and use them to select good model parameters. Finally you encountered two forms of ensemble models.

3. Learning more

Of course, there are many topics that were not covered in this course. If you want to dig deeper then consult the excellent and extensive online documentation. Importantly you can find instructions for setting up and securing a Spark cluster.

4. Congratulations!

Now go and use what you've learned to solve challenging and interesting big data problems in the real world!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.