Extracting information for large amounts of Twitter data
Great job chunking out that file in the previous exercise. You now know how to deal with situations where you need to process a very large file and that's a very useful skill to have!
It's good to know how to process a file in smaller, more manageable chunks, but it can become very tedious having to write and rewrite the same code for the same task each time. In this exercise, you will be making your code more reusable by putting your work in the last exercise in a function definition.
The pandas package has been imported as pd
and the file 'tweets.csv'
is in your current directory for your use.
This exercise is part of the course
Python Toolbox
Exercise instructions
- Define the function
count_entries()
, which has 3 parameters. The first parameter iscsv_file
for the filename, the second isc_size
for the chunk size, and the last iscolname
for the column name. - Iterate over the file in
csv_file
file by using afor
loop. Use the loop variablechunk
and iterate over the call topd.read_csv()
, passingc_size
tochunksize
. - In the inner loop, iterate over the column given by
colname
inchunk
by using afor
loop. Use the loop variableentry
. - Call the
count_entries()
function by passing to it the filename'tweets.csv'
, the size of chunks10
, and the name of the column to count,'lang'
. Assign the result of the call to the variableresult_counts
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Define count_entries()
def ____():
"""Return a dictionary with counts of
occurrences as value for each key."""
# Initialize an empty dictionary: counts_dict
counts_dict = {}
# Iterate over the file chunk by chunk
for ____ in ____:
# Iterate over the column in DataFrame
for ____ in ____:
if entry in counts_dict.keys():
counts_dict[entry] += 1
else:
counts_dict[entry] = 1
# Return counts_dict
return counts_dict
# Call count_entries(): result_counts
result_counts = ____
# Print result_counts
print(result_counts)