Get startedGet started for free

Respond Please!

1. Getting Ready to Crawl

We've spent some time together learning XPath and CSS Locator syntax, and how to use them within scrapy Selector objects. In this lesson, we will introduce Response objects in scrapy, which behave like Selectors, but give us extra tools to mobilize our scraping efforts across multiple websites. You see, we are moving towards creating spiders, programs that crawl the web and scrape data in a way we specify, and although we will need to wait to the next chapter to build a spider, moving from Selector objects to Response objects gives us these extra pieces we need to crawl between sites instead of simply parsing one site.

2. Let's Respond

In the next chapter, we will learn exactly how to scrape a site and load its HTML code into a scrapy variable without having to do all the work we have previously done: loading the HTML code into a string and then passing that string as a variable to a Selector object. But for now, we will focus on Response objects which already have HTML pre-loaded. You ask: "Tom, why are you making us learn a new Response object when you just taught us about Selectors?!". My "Response" to you is that you can use everything we've learned for Selectors with Responses. The Selector object was our introduction to a Response object! What makes us want to use a Response object rather than a Selector is that, on top of all the Selector functionality, the Response object keeps track of which URL the HTML code is from, and hence gives us a tool to not only scrape one single site, but crawl between links on sites and scrape multiple sites automatically!

3. What We Know!

As an illustration of what we already know for Response variables from our Selector expertise, suppose we have pre-loaded a Response variable with the HTML from some website, and we are interested in the span elements which are children of some div element and whose class attribute is "bio". We can still use the xpath method as we have before. We can still use the css method as we've learned about in this chapter. We can chain together these methods. And we can extract the selected data using the extract or extract_first methods we already know about.

4. What We Don't Know

What we gain by using a Response object is the functionality to keep track of the URL where the HTML was scraped from, which it stores as a string in its url variable; and the ability to follow a new link using the follow method, which allows us to crawl through multiple pages for scraping. We will learn more about the follow method in the next chapter, for now just realize that this ability to "follow" links automatically makes using Response the correct choice when we want to crawl between websites for scraping.

5. In Response

In this lesson we introduced the scrapy Response object, showing how it can be used like a Selector, but adds crawling capability. Although we still have some gaps to fill in with regards to our understanding of creating a Response and utilizing the follow method for crawling, those gaps will be the focus of our next chapter, and the culmination of our work so far. You will be able to create a web spider to crawl and scrape multiple sites automatically.