Get startedGet started for free

Adding functionality to classes

1. Leveraging Classes

In our last lesson, we created a Document class that currently only serves as a container for our text.

2. Extending Document class

Let's look at our current definition of Document and talk about how we can improve its functionality. Right now the class is just a container for the user provided text; this doesn't add much value for our user. In order for our Document class to be more useful, we can add more attributes & methods besides __init__. For example, let's say that in our workflow we always want to tokenize our documents as a first step. Tokenization is a common step in text analysis, it is the process of breaking up a document into individual words, also known as tokens.

3. Current document class

As we've learned, __init__ is what's called when a user wants to create an instance of Document. This would be a convenient location to put a tokenization step. Placing the tokenization process inside of __init__ will ensure our Document is tokenized as soon as it is created, and this will save your user's the trouble of thinking about this step.

4. Revising __init__

Our new __init__ method might look something like this. You can see that we added a line that calls self dot underscore tokenize and dumps the output into an attribute named tokens. So where does underscore tokenize come from? and why does it have an underscore in front of it?

5. Adding _tokenize() method

Let's first answer the question of where the method came from. Since this course isn't focused on teaching text analytics, we aren't going to cover the tokenization function implementation; we'll simply import a function to do it for us. In a way, this demonstrates the beauty of modularity and python's community. Often times there are functions already written by someone in the community, and all you need to do is find out where they are and import them for your own use case. Moving on to the definition of underscore tokenize. We only pass one parameter to the function, the prescribed self convention that will represent an instance of the Document object. Since the tokenize function is already written, all that's left to do is call it on the text attribute. And voila! The tokenization process will now be completed automatically as soon as a user creates a Document instance. So why did we use a leading underscore when naming _tokenize?

6. Non-public methods

The reason we added the tokenization process to the __init__ method is that we wanted tokenization to happen immediately without the user having to think about it. Because of this, the user doesn't need to call underscore tokenize themselves; in other words, this method doesn't need to be 'public' to the user. According to PEP 8, non-public methods should be named with a single leading underscore. This signifies to the user that the method is intended for internal use only. Users can still use non-public methods in their own workflow, but they must do so at their own risk since the developer did not intend for them to do so.

7. The risks of non-public methods

The risks of using a non-public method in your own workflow include: little or no documentation and the function's input or output might change without warning when the developer updates their package.

8. Let's Practice

You've now seen how we can add more functionality to our Document class. In the upcoming exercises, you'll add additional methods to streamline our text analytics workflow; you'll even write your first non-public method!