1. Capstone
We're finally at the point that we can setup an entire scrapy spider from start to finish, and that's exactly what we will do in this lesson.
In fact, what we will do is create a spider which first collects all the course links from the DataCamp course directory; then, it will follow those links to extract the course title and the titles of all the course chapters; finally, it will store this information in a dictionary for us to use as we want to later.
2. Inspecting Elements
The structure of our spider will be as you see here. You will notice all the usual suspects set up for us, including the naming variable within the spider class, and a start_requests method which directs us to the DataCamp course directory site. You will also notice at the bottom, we have an empty dictionary, called dc_dict, which is what we want to fill in with the course titles and course chapter titles during the scrape.
Our first objective of scraping the course directory (to extract the course page links) will be coded in the parsing method we call parse_front within our spider class. From there, the spider will crawl to each of those course pages and fill in the dc_dict using the course titles as the keys, and a list of the course chapter titles as the items; this second order scraping method we will call parse_pages.
3. Parsing the Front Page
It remains for us to fill in the parsing code.
Starting with parse_front, we will first direct to all the course block div elements (as we have done before). These course blocks divvy up the course information for each course in the directory.
From each course block, we then direct to the course page link, again as we have done before.
Using the extract method, we create a list of the links (as strings) that we want to follow.
And finally, we will iterate through the links and yield a call to response.follow, directing the spider to crawl to each of these course pages. Note that the follow callback method is directed to parse_pages, which is the name of the parsing method we want to spider to use at the next step.
4. Parsing the Course Pages
Now, to fill in the parse_pages method, we remember that we want to extract the course title and the titles of the course chapters.
After inspecting the HTML on one of the course pages, we discover that the course title is defined by the text within an h1 element whose class contains the word title. So, we start by directing to this text.
To extract the course title, we will call extract_first rather than extract since we want to be left with the title as string, rather than a list containing the title, as extract would leave us with.
Although it is unnecessary, when we do this, we can clean the text a little, removing strange character returns that often crop up in HTML. Fortunately, strings in python already include a strip method to do this cleaning for us!
Next, we want to get to the chapter titles. On inspection, we discover that these are defined as the text within h4 elements whose class is chapter__title.
So, we direct the spider to these pieces of text, extract the text and clean it as before. This time, we use extract to get a list of the many chapter titles per course.
We finally end by filling in our dictionary whose keys are the course titles and corresponding elements are the chapter titles.
And now that we have finished this parsing method, we have finished our spider.
5. It's time to Weave
Now it's your turn to start crawling!