Processing large amounts of Twitter data
Sometimes, the data we have to process reaches a size that is too much for a computer's memory to handle. This is a common problem faced by data scientists. A solution to this is to process an entire data source chunk by chunk, instead of a single go all at once.
In this exercise, you will do just that. You will process a large csv file of Twitter data in the same way that you processed 'tweets.csv'
in Bringing it all together exercises of the prequel course, but this time, working on it in chunks of 10 entries at a time.
If you are interested in learning how to access Twitter data so you can work with it on your own system, refer to Part 2 of the DataCamp course on Importing Data in Python.
The pandas package has been imported as pd
and the file 'tweets.csv'
is in your current directory for your use.
Be aware that this is real data from Twitter and as such there is always a risk that it may contain profanity or other offensive content (in this exercise, and any following exercises that also use real Twitter data).
This exercise is part of the course
Python Toolbox
Exercise instructions
- Initialize an empty dictionary
counts_dict
for storing the results of processing the Twitter data. - Iterate over the
'tweets.csv'
file by using afor
loop. Use the loop variablechunk
and iterate over the call topd.read_csv()
with achunksize
of 10. - In the inner loop, iterate over the column
'lang'
inchunk
by using afor
loop. Use the loop variableentry
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Initialize an empty dictionary: counts_dict
# Iterate over the file chunk by chunk
for ____ in ____:
# Iterate over the column in DataFrame
for ____ in ____:
if entry in counts_dict.keys():
counts_dict[entry] += 1
else:
counts_dict[entry] = 1
# Print the populated dictionary
print(counts_dict)