Collecting additional data

1. Collecting additional data

While collecting internal data is useful for some data science projects, it's only one piece of the puzzle. Often, you need to gather data from external sources as well.

2. Even more data

There are many ways that you can collect additional data for your organization. A few common ways include APIs, public records, and Mechanical Turk, all of which we'll discuss in this lesson.

3. Data APIs

Let's begin with APIs. API stands for Application Programming Interface. It's an easy way of requesting data from a third party over the internet. Many companies have APIs to let your team access their data. Some noteable APIs include Twitter, Wikipedia, Yahoo! Finance, and Google Maps, but there are many, many more. If you work with a partner and think that they might have useful data, do a quick web search and see if an API exists!

4. Tracking a hashtag

Let's look at an example of the Twitter API. Suppose we want to track Tweets with the hashtag DataFramed, DataCamp's wonderful podcast on Data Science. We can use the Twitter API to request all Tweets with this hashtag. At this point, we have many options for analysis. We could perform a sentiment analysis on the text of each Tweet and get an idea of how people like our podcast. We could simply track how often hashtag DataFramed appears each week. We could also combine this data with our downloads data and see if positive Tweets are correlated with more downloads.

5. Public records

Public records are another great way of gathering additional data. In the US, data-dot-gov has health, education, and commerce data available for free download. In the EU, data-dot-europa-dot-eu has similar data. These can be great sources for understanding population-level trends or gathering location and economic data.

6. Building a training set

Previously, we discussed image recognition as a type of data science problem. In order to build a good image recognition algorithm, we need a set of pictures where the images have already been labeled, which is called our training set. But we don't need just one or two pictures. We need hundreds or thousands of pictures. Getting these labeled images can be really difficult and time consuming, and a lack of a training set is often what keeps good data science projects from being completed.

7. Mechanical Turk

Depending on what kind of training set is needed, Mechanical Turk, also called MTurk, can be a great option. Mechanical Turk means asking humans to complete a task that we eventually plan on computerizing. In our previous example, this would mean labeling a handful of pictures to create a training set for image recognition. Rather than asking one person to label thousands of images, we recruit thousands of people and pay each of them to label a few images. To ensure quality, we might ask two or three people to review the same image and then take the most common answer.

8. Mechanical Turk

Many platforms exist to help build your Mechanical Turk problem and recruit helpers, such as AWS MTurk. Mechanical Turk isn't just for image recognition. You can also use it to label customer reviews as positive or negative, extract text from a form, or highlight key words in a sentence. In the example on the right, users are asked to identify which sections of the image contain a street sign.

9. Let's practice!

You now know about three ways of getting external data: APIs, public records, and Mechanical Turk. Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.