Your First Spider
1. A Classy Spider
We've arrived at our final chapter for this course. It's now time to put everything we've learned to use while creating spiders. Unlike the scraping on a single website that we've done already, a spider will crawl the web through multiple pages, following links if we choose, and scrape each of those pages automatically according to the procedures we've programmed.2. Your Spider
Before narrowing in on our task during this lesson, let's jump into the deep end and look at the general form of the code we will be writing to create our spiders. While this doesn't look super complicated, we'll still break it into three parts to ingest.3. Your Spider
The first part is simply the necessary importing of scrapy and the CrawlerProcess object. The second part, which happens to be the most important part for us, is the code for the actual spider. This code will tell scrapy what websites to scrape and how to scrape them with all the techniques we've built so far in this course. The code for the spider comes in the form of a class, a python object to house together methods and variables that relate to each other. We can name this class whatever we like. Though, the class must take scrapy.Spider as an input, which is why we needed to import scrapy above. The third part runs the spider, using the CrawlerProcess function. For our purposes in this course, we will only need to make sure that the spider name we pass to crawl method (in this case process.crawl) to the actual name of our spider. From here on, we will focus on the code for the spider.4. Weaving the Web
We now narrow in on creating the actual spider. This is the class object which basically tells us what sites we want to scrape and how we want to scrape them. Again, the code may look a little complicated, but we will walk through it and see that we've built up the technique to cover most of the work. Here we have a code to scrape the DataCamp course directory. It includes all the basic pieces we need. We can name the class anything we want, here we called the class "DCspider" and defined the name variable within that class with a similar name (although we can assign any string to the name variable we want). This name variable is important for some of the action that happens under the hood of scrapy. We must have a start_requests method to define which site or sites we want to scrape, and which tells us where to send the information from these sites to be parsed. Finally, we need to have at least one method to parse to the website we scrape; we can call the parsing method (or methods) anything we want as long as we correctly identify the method within the start_requests function. All we are doing in this parser (which we named parse) is taking the HTML code and writing it to a file.5. We'll Weave the Web Together
At this point, maybe you're still a little nervous about the amount of stuff that has suddenly been thrown at you in this lesson. Don't worry, in the exercises you have a chance to examine the code a little more closely. And, in the next lessons we will explain deeper what's happening so that by the end of this chapter, you'll feel comfortable making and running your own scrapy spiders!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.