1. Attribute and Text Selection
Our last lesson introduced us to CSS Locator notation and how to translate to that from much of what we've learned using XPath. In this lesson, we will learn how to use CSS Locators to select attributes and how to use both CSS Locator and XPath notation to select the displayed text on websites.
2. You Must have Guts to use your Colon
Recall that to select an attribute in XPath, we use the @ symbol, so that the general form of the XPath would be to first select the element whose attribute we want to get to, followed by a forward-slash, followed by the @ symbol connected to the attribute type name we want to select.
For example, if we first select the div element with id equal to uid, and from there select the href attributes from all the hyperlink children, this XPath string would do the trick.
For a CSS Locator, we again direct to the element whose attribute we want to get to, and follow this by a double-colon connected to the attr attribute function. Then, the argument within the attr function is the attribute name.
So, to select the same href attributes as we did in the XPath string above, we would write the following. Remember that the pound-sign tells us to select the div element by its id attribute, the greater than symbol tells us to move down one generation, and here we see our newly introduced double-colon attr piece, this is to select the desired attribute (in this case href).
3. Text Extraction
We are going to switch gears a bit to hit on an extraction point we've neglected so far.
Suppose that we have navigated to a paragraph element with id "p-example" and we want to direct to the text within that paragraph element.
To do this, we can use the text() method within the XPath. Here we've gone ahead and put the XPath into a scrapy Selector to look at the output.
By using the single forward-slash before the text method, we will direct to all chunks of text that are within that element, but not within future generations.
On the other hand, if we use a double forward-slash, then we will point to all chunks of text that are within that element and within its descendants; in this case we pick up the "DataCamp" text, since it belongs to the next generation hyperlink element.
4. Text Extraction
Similar to attribute selection, to navigate to this text in CSS Locator notation, we again follow the element selection by the double-colon. But this time, we follow the double colon only by the word text.
As we did with XPath, we can indicate whether we want only the text in the current element (but not from future generations), or if we want to also include the text within future generations.
To grab only the text within the element, but not future generations, we use the double-colon without preceding it by a space.
On the other hand, if we also want to include the text within future generations, we simply add a space before the double-colon.
As a note:
In both XPath and CSS Locator notation, the extracted text is broken up by elements. So in this example, since there is a hyperlink child, the text is broken into the chunk before the hyperlink child, the text of the hyperlink child, and the text following the hyperlink child.
5. Scoping the Colon
You are now able to use CSS Locator and XPath notation, and have also learned how to extract text data from elements within HTML using both! We're all set for you to work through some exercise examples, and, at the end of this chapter, to see a real example.