The Source of the Source

1. "Inspecting the HTML"

So far we've only really played with toy examples of HTML; very simple code to illustrate the points we want to make, making sure the text of the HTML can fit onto a slide. This doesn't reflect most (or any) HTML you will encounter out in the wild. We'll spend this lesson looking at how I go about exploring HTML and inputting the HTML as text into a Selector object. For this lesson, I will describe the exploration part using my Firefox browser, but even if that isn't your browser of choice, most major web browsers have similar functionality. So, you should be able to use this lesson and by analogy use Chrome, or Safari, or another browser which has similar capability.

2. "Source" = HTML Code

If I navigate to a website which has content I want to scrape, I can view the HTML source code of the website in Firefox by either right clicking within the page (or Control+Click on a Mac) and selecting "View Page Source"; or another way to get there on my version of Firefox is by going to the Tools menu, selecting Web Developer and selecting "Page Source". My version of Chrome has similar selections, and I bet your browser does too! This is a first step in "inspecting the HTML", looking at the actual HTML source code.

3. Inspecting Elements

Another useful tool which Firefox provides (as does Chrome, and I'm sure many other browsers), is the ability to "inspect an element". What this means is that you can select an element on the website and ask to be directed to the actual HTML Code for that specific element. In Firefox, you let your mouse hover over the element you're interested in, then right-click (or Control+Click on a Mac), and select "Inspect Element". This opens a second pane in the browser and highlights the HTML for that specific element. This is extremely useful when trying to figure out characteristics of an element which you might want to use in your scraper.

4. HTML text to Selector

The final piece of this lesson is to give you a quick understanding of how we get the HTML text into a Selector object. The way we do this is using the python requests library. We won't delve into requests more than this short introduction here. In fact, in the last chapter, you will learn how to do everything within scrapy itself, without using requests. But this is still a nice piece of information to carry with you. In this example we will get the HTML from the DataCamp course directory. The requests library makes it easy to get the HTML as a string by first passing to requests.get the url (as a string) of the site you want the HTML from, then looking at the content, as we have done here. You'll notice that we assigned the content to the variable html to pass to the Selector.

5. You Know Our Secrets

In several lessons and exercises you will hear or read things like "...by inspecting the HTML..." or "...we have pre-loaded a variable with the HTML..." or "...we have loaded the HTML into a Selector object..." Now, you know what we did to "inspect the HTML" or how we "pre-loaded a variable or Selector object" with an HTML string. And the best part is, you can do it too!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.