Time to Run

In the last lesson, we went through creating an entire web-crawler to access course information from each course in the DataCamp course directory. However, the lesson seemed to stop without a climax, because we didn't play with the code after finishing the parsing methods.

The point of this exercise is to remedy that!

The code we give you to look at in this and the next exercise is long, because its the entire spider that took us the lesson to create! However, don't be intimidated! The point of these two exercises is to give you a very easy task to complete, with the hope that you will look at and run the code for this spider. That way, even though it is long, you will have a grasp of it!

This exercise is part of the course

Web Scraping in Python

View Course

Exercise instructions

Fill in the one blank at the end of the parse_pages methods to assign the chapter titles to the dictionary whose key is the corresponding course title.

NOTE: If you hit Run Code, you must Reset to Sample Code to successfully use Run Code again!!

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Import scrapy
import scrapy

# Import the CrawlerProcess: for running the spider
from scrapy.crawler import CrawlerProcess

# Create the Spider class
class DC_Chapter_Spider(scrapy.Spider):
  name = "dc_chapter_spider"
  # start_requests method
  def start_requests(self):
    yield scrapy.Request(url = url_short,
                         callback = self.parse_front)
  # First parsing method
  def parse_front(self, response):
    course_blocks = response.css('div.course-block')
    course_links = course_blocks.xpath('./a/@href')
    links_to_follow = course_links.extract()
    for url in links_to_follow:
      yield response.follow(url = url,
                            callback = self.parse_pages)
  # Second parsing method
  def parse_pages(self, response):
    crs_title = response.xpath('//h1[contains(@class,"title")]/text()')
    crs_title_ext = crs_title.extract_first().strip()
    ch_titles = response.css('h4.chapter__title::text')
    ch_titles_ext = [t.strip() for t in ch_titles.extract()]
    dc_dict[ crs_title_ext ] = ____

# Initialize the dictionary **outside** of the Spider class
dc_dict = dict()

# Run the Spider
process = CrawlerProcess()
process.crawl(DC_Chapter_Spider)
process.start()

# Print a preview of courses
previewCourses(dc_dict)

Edit and Run Code