Web Scraping Overview
1. Web Scraping With Python
Hi everyone, my name is Thomas Laetsch. I'm currently a data scientist working in the Center for Data Science at New York University. In this course, Web Scraping with Python, you will learn some of the fundamental techniques in computational web scraping. That is, you will learn to create software to automate data extraction from online sources. Before moving to specifics and technicalities, let me convince you that these techniques can be a valuable addition to your data-science know-how, and that this course will be the perfect place to start or strengthen the foundational pieces of this skill set.2. Business Savvy
You might ask yourself why businesses might employ those with experience web-scraping? What can businesses gain from web-scraping? Well, they can scrape competitor sites to gather prices for similar products or services to compare and adjust their own price set-points. They can scrape online reviews of their products or services, and gather public opinion around the company in general. They can scrape social media sites, or other public forums for contact or other information of clients or potential clients, to meaningfully direct resources towards this group of possible customers. And this was just a short list!3. It's Personal
We list here a few fun things you can do scraping the web. You could search for your favorite memes from your favorite sites. You can scour through classified ads, looking for your favorite things. You can look for trending topics on social media sites. You can look for recipes you might be interested in on cooking blogs. In fact, there's a whole lot you can do!4. About My Work
Now, let me give you an example that I've worked on. Here at the Center for Data Science, while working with an amazing sociologist, I have been heavily involved in collecting the data for the website: AmericanViolence.org. As was famously implied by the former FBI director James Comey, crime data has not been easy to collect and analyze across city agencies in the United States. It turns out though, that many such agencies publish this data online. The work I've done with my collaborators is to collect, process, and format these data into a single repository, starting with murder data for some of the largest cities in the US. So, realize this: many of the techniques you will learn in this course are the same that I used to collect data for AmericanViolence.org, which is now helping track trends in murders across the United States.5. Pipe Dream
To better visualize the focus of what you will learn in these lectures and exercises, let's roughly breakdown the web-scraping pipeline into three pieces.6. Pipe Dream: Setup
The first piece is the setup, that is, defining the goal or task and identifying the online sources which you believe will help you achieve the desired end result.7. Pipe Dream: Acquisition
The second is the acquisition of these online data. This includes accessing the data, parsing this information, and extracting these data into meaningful and useful data structures.8. Pipe Dream: Processing
The third is the processing phase, where you run these downloaded data through whatever analyses or processes needed to achieve the desired goal.9. How do you do?
This course focuses on the acquisition phase. To accomplish this, we will be using python and the web-crawling framework scrapy. We chose scrapy since we can jump in quickly, and easily scale to large scraping projects. However, even if you aren't sold on using scrapy or python, you will still build techniques and intuition that will be valuable in any computational web-scraping environment you enjoy!10. Are you in?
So, I hope you're as thrilled as I am to take part in this course, and gain the skills to start scraping the web for whatever reasons excite you!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.