1. A Request for Service
In the previous lesson, we quickly went over creating a simple scrapy spider, but didn't get into too much detail about what was going on. In this lesson, we will look deeper into the the start_requests method and its relation to the parsing method or methods within the our spider.
2. Spider Recall
Remember that the code to setup and run a spider roughly looked like the code we see here, where most of the work will go into coding the class we define for our spider.
3. Spider Recall
We've already seen an example of a spider which effectively would save the HTML code from the DataCamp course directory into a file. So, let's now narrow in on the start_requests method.
4. The Skinny on start_requests
First, within the spider class, we must define a method called start_requests which should take self as an input. The reason we don't have flexibility in this method name is because scrapy looks for the start_requests method by name within the class we define for our spider.
We then have a list of the url or urls we would like to start scraping (but in this case, only one). It doesn't matter that we named this list urls, but it seems like a convenient name considering.
Finally, we will take each url within the urls list and send it off to be dealt with.
Now, this "send-off" is the most complicated part!
But even before worrying about the "send-off" let's notice that we really did not need to loop over the urls list with only one url. We could have easily defined it without the for loop, but we wrote it originally with the loop to give you an idea of how you might construct this if you had several urls you want to initiate the spider with.
Back to the "send-off": the yield command, which you might be familiar with already, acts kind of like a return command in that it returns values when the start_requests is run; yield is a python call, and is not specific to scrapy. We won't go into more detail about yield vs return here, but just note we are using yield in this method.
The object we are yielding is a scrapy.Request object. We haven't seen these before, but again, that's OK. What yielding the scrapy.Request object does is send a response variable (the same response variable we are familiar with) pre-loaded with the HTML code from the url argument of the scrapy.Request call, to the parsing function defined in the callback argument.
5. Zoom Out
So, if we look at the entire spider class again, what we see is that the start_request call will pre-load a response variable with the HTML code from the DataCamp course directory, and send it to the method we have named parse.
As a glimpse into the future, notice that the parse method has response as its second input variable, this is the variable passed from the scrapy.Request call.
Also, notice that while there are many wheels turning in the start_requests method, most of them are happening under the hood, and the only real adjustments we need to make are to define which url or urls we are going to scrape, and what callback method we want to use to parse those scraped sites.
6. End Request
So, you've taken a big bite out of scrapy spiders already. Again, there are my things that scrapy has setup under the hood, and this can seem intimidating, but we're not learning how to build the engine in this course, we're learning how to drive the car.