1. Scraping For Reals
Let's consider everything you've learned so far in these lessons. At this point you have a grasp of the structure of HTML; you can translate that knowledge into XPath and CSS Locator phrases to direct us computationally to specific elements of interest; from there, you can extract attributes such as hyperlinks; you can extract text; and you can do all this within the scrapy Selector and Response framework. Congratulations! You already can start scraping. To prove this to you, in this lesson we will scrape an actual site using only the techniques that we've built up throughout these lessons. What website shall we scrape? DataCamp, of course!
2. DataCamp Site
We're going to scrape the course directory on DataCamp and create a list of links to the course pages. That is, we will end up with a list of strings, where each string is the link to one of the course pages.
3. What's the Div, Yo?
We have taken the HTML for the DataCamp course directory and loaded it into a scrapy Response variable.
After I manually inspected the HTML code, I noticed that each of the courses displayed on the DataCamp site belong to a div element within the class "course-block". So, let's go ahead and move to those elements using a CSS Locator. We'll store this output in the variable course_divs.
By the way, at the time we scraped this site, DataCamp had 185 courses listed in this directory, and it seems we've got them all in these selected div elements.
4. Inspecting course-block
Examining the first of the div elements in the course-block class, we notice that there are three children.
5. The first child
The first child is a hyperlink element to the course website. It also contains several other elements as children which comprise the upper portion of the course block, highlighted here.
Let's note that we took the first element from the SelectorList as the variable first_child, meaning that first_child is a Selector object. So, to get to the data in the first_child element, we can easily apply the extract function.
6. The second child
The second child is another div element, which also contains several other elements which comprise the footer of the course block; the section of the course block highlighted here.
7. The forgotten child
The third child is a span element which acts as an invisible container for some specific information, but really isn't visible in the course block itself.
8. Listful
After this inspection, we are now in the position to easily create the list of course links, our original goal. Here I present two options, though others are certainly possible.
The first and possibly simplest is to use just a single CSS Locator to direct to the course-block div elements, then direct to the href attributes of the hyperlink child (the hyperlink child we found when exploring the children of the course block). From there, a quick call to extract.
The second is to do this stepwise, mixing CSS Locator and XPath methods. First we collect the course divs with a CSS Locator; then direct to the href attributes of the hyperlink child using XPath; and finally extract these.
9. Get Schooled
And guess what? At this point, we're done!
Let's look at the list we created. We have collected all links to the courses!
10. Links Achieved
We've made it through this example and were able to really scrape the DataCamp course directory. You saw some of the exploratory methods and implementations that I would code myself for the task. In the next chapter, we will move on to building spiders!