Introduction to HTML

1. Introduction to HTML

Hello and welcome to this course on web scraping with R. My name is Timo Grossenbacher and I'll be your instructor. In this chapter, you'll be introduced to HTML, the declarative language that is used on every single page on the web. But first, let's quickly look into the concept of scraping.

2. If you see something, it can be scraped

While there are ever more nicely structured datasets available, such as open government data, a wealth of interesting information is still locked into websites. I use the term "locked" in the sense that there's no explicit download button where you can easily get a nice CSV or JSON. In contrast, the data is sometimes spread across multiple pages and hard-to-parse tables, like in this Pokémon database. However: What you and your browser can see, can be downloaded nonetheless, through a process called scraping. Basically, that means writing a small program that goes to a website and downloads exactly the stuff you want – be it only once or repeatedly. Usually, your scraper also does the parsing for you, that is, translating the HTML found on the page into a nicely structured dataset you can analyze later on.

3. Hypertext Markup Language (HTML)

HTML stands for Hypertext Markup Language. Actually, this is one of the oldest technologies of the web. As the name says, it's a markup language: A way to declare the structure of a web page in a semantic way. For instance, if you wanted to have a title followed by some text paragraphs, one of them including a bulleted list, you could specify that with HTML. Specifically, you would write so-called tags, the building blocks of websites. For a title, you'd choose from everything between h1 and h5 or even higher, depending on the importance of the respective heading. You'd insert paragraphs with the p tag.

4. HTML is organized hierarchically

Now to the bulleted list. For that, you'd write sequences of li tags within a ul tag. Notice how the tags can wrap around a text or yet other tags? That's the hierarchic nature of HTML. A starting tag is always followed by either plain text or another set of tags, and then closed by an ending tag which is marked with a forward slash. This allows to organize a website into different parts in a nice and easily readable way.

5. HTML tags can have attributes

HTML elements also can have so-called attributes. An "a" element, for example, is used to construct a hyperlink to an external website. The target of the link, that is, the actual URL the link points to, is specified with the href attribute. Throughout this course, you'll get to know many more of these HTML attributes.

6. Reading HTML with R

With R, it's easy to read in HTML. In this course, you'll mainly work with the rvest package from the Tidyverse. For example, with the read_html() function, you can read in an entire HTML document. The class of the resulting html variable is an xml_document and an xml_node. Why is that, you might ask? Internally, rvest works with the xml2 package, as HTML borrows heavily from the Extensible Markup Language syntax. Therefore some functions like read_html() are actually from the xml2 package.

7. Insert title here...

Thus, you can also use the xml_structure() function of the xml2 package to get a look at the structure of the HTML document you just read in. First load the xml2 package with the library() command. This function prints the basic outline of an HTML document in a pretty way.

8. Let's parse HTML!

Okay, let's get your hands dirty with some HTML parsing in R.