Data sources and risks

1. Data sources and risks

Hi! My name is Michael Chow, and I work on the Data Science team at DataCamp. Previously, you learned about the data science workflow. In this lesson, we'll focus on the first step: data collection.

2. Common sources of data

Data is everywhere and almost any business process can generate mountains of data. Some of the most common sources of data are web events, customer data, logistics data, and financial transactions. It's possible that your company is already collecting all of this information. It's best to ask your data engineers what is collected and what isn't, and to emphasize the importance of starting the centralized data collections process sooner rather than later.

3. Web data

Let's dive a bit deeper into web data. When a user visits a web page or clicks on a link, it can be helpful to track this information in order to calculate conversion rates or monitor the popularity of different pieces of content. At a minimum, you'll want to collect the name of the event, which could mean the URL of the page visited or an identifier for the element that was clicked, the timestamp of the event, and an identifier for the user that performed the action.

4. Personally Identifiable Information (PII)

Suppose Jane Doe is a customer who visits your company website and likes one of your products. You might choose to track her name, the timestamp, and the object that she clicked on. It's important to remember that Jane Doe's name is Personally Identifiable Information, or PII. PII includes a person's name, location, email address, and any other piece of information that could be used to tie a web event back to a real human. PII should be treated with extreme sensitivity and caution.

5. Data pseudonymization

One of the easiest ways to protect Jane's identity is to split this information into two separate entries. We can assign Jane a user id, in this case 185477, and store that information in a users table. We can then identify her event using this id. We call the data in the events table pseudonymized because Jane can't be identified by that table alone, but she can be identified if we combine information from the users table with the events table. To protect Jane, we'll want to make sure that access to the users table is restricted to only folks who need to know Jane's identity, such as senior customer service representatives or members of the legal team. We'll also want to periodically audit who has accessed this data and how they have used it to ensure that Jane's data is respected.

6. Data anonymization

The best way to protect Jane's privacy is to destroy the information in the users table after assigning Jane's user id. Without the users table, the events table is fully anonymized data. For many analysis purposes, anonymized data is sufficient. We need to know that Jane is a unique individual, but we don't need to know her name or any other PII.

7. General Data Protection Regulation (GDPR)

You might have heard the term "GDPR" from your data team recently. GDPR stands for General Data Protection Regulation and applies to all data inside of the European Union. The purpose of GDPR is to give individuals control over their personal data. Among other things, GDPR regulates how long data can be stored, mandates appropriate anonymization, and requires data collection to be disclosed and consent to be obtained. It's always best to consult a lawyer when dealing with any data inside of the EU to ensure that you comply with GDPR.

8. Let's practice!

You now know some common sources of data, the difference between anonymization and pseudonymization, and the definition of GDPR. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.