Parse and Crawl

1. Move Your Bloomin' Parse

At this point, you should be starting to feel comfortable looking at the class we define for a scrapy spider. You should have some feeling for how to setup and use the start_requests method in your spider, and understand that from the start_requests method, the yielded scrapy.Request call sends a scrapy response object to a parsing function for processing. In this lesson, we will focus on the parsing method.

2. Once Again

Let's again quickly remember the simple spider class we've seen before. Our focus will be the parse method. As a reminder, we do not need to name the parse method parse, but we could call it whatever we want as long as we correctly use the callback argument in the yielded scrapy.Request call in the start_requests function.

3. You Already Know!

Most of the scraping magic actually occurs within the parsing function. This is where we use the material we've been building up during this entire course! Over the last three chapters, you have built up the techniques to handle a scrapy response variable for data extraction! So, the only thing we need to focus on here is what to do with the extracted site data once you have your hands on it. In the previous example, we had the parser simply save the HTML code to a local file. In the next couple examples, I'll show you that we can do much, much more.

4. DataCamp Course Links: Save to File

The first example we'll look at extracts the DataCamp course links from the course directory (as we did in chapter 3), then saves these links to a file with one link per line.

5. DataCamp Course Links: Parse Again

Another, much more interesting and powerful example is to create a spider which crawls between different sites. We give you an example of that here. We start by first extracting the course links from the DataCamp course directory (as we did in the last chapter). Then, instead of printing anything to a file here, we will have the spider follow those links and parse those sites in a second parser. You see, finishing the first parse method, we loop over the links we extracted from the course directory, then we send the spider to follow each of those links and scrape those sites with the method parse2. Notice here that when we send our spider from the first parser to the second, we again use a yield command. But, instead of creating a scrapy.Request call (like we did in start_requests), we use the follow method in the response variable itself. The follow method works similarly to the scrapy.Request call, where we need to input the url we want to use to load a response variable and use the callback argument to point the spider to which parsing method we are going to use next.

6. Follow the Web

To help visualize what's happening, our spider starts at the base site (or sites) indicated in the start_requests method. From there, we can follow links on that site to other sites, and scrape them, or potentially even follow links on those sites to new sites, creating a web for our spider to crawl. Hence the name web-crawlers, and why the programs that crawl the web are often called spiders.

7. Johnny Parsin'

Guess what? You're well on your way to being a full-fledged web scraper. You should now understand the ideas behind scraping a single site, creating a spider to scrape for you, and understand the concept of crawling to scrape multiple sites. Our next lesson will give you a full example of this awesomeness.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.