1. Case Study: Generating a Report Repository
In this chapter, we've learned about the ways S3 can host and share files publicly and privately. In this lesson, we will walk through everything we learned to see how it fits together.
2. Final product
Every month, Sam is asked to compile a report on the number of requests received on get it done per case type. When Council asks, she emails them the report.
They want to see how many requests came in, which types of request they were, and how many of each type there were total.
The council members should be able to access a page, see the listing of all available and generated charts, click on one, and see it.
3. The steps
In order to accomplish this, we will:
Download files for the month from the raw data bucket
Concatenate them into one CSV
Create an aggregated DataFrame
4. The steps
Write the DataFrame to CSV and HTML
Generate a Bokeh plot, save as HTML
5. The steps
Create `gid-reports` bucket, configure for website
Upload all the three files for the month to S3
Generate an index.html file that lists all the files
Get the website URL!
6. Raw data bucket
We will be picking up raw data from the gid-requests bucket. This bucket contains daily CSVs. Every CSV has rows for every request that was made in the GetItDone app that day.
7. Read raw data files
First, let's create a list to hold our DataFrames.
Then, let's get the listing of all files in S3 that start with 2019_jan.
We can use the prefix argument of the list_objects method to filter.
Don't forget, that our actual dictionary of objects is in response contents.
8. Read raw data files
Next, we will iterate over each object, and load its StreamingBody using get_object. We can't use the Object URL because it's private.
We will then pass that body to pandas read_csv, creating a DataFrame.
We append that DataFrame to a list. So we will have a list of 31 DataFrames, one for each day in the df_list variable.
9. Read raw data files
We concatenate all the DataFrames in the list into one, using pd.concat;
10. Create aggregated reports
Then, we will do a basic aggregation. We will write the resulting DataFrame to a CSV, and an HTML file. We will also make a Bokeh plot and write it to HTML.
11. Report bucket
We will place aggregated reports in the gid-reports bucket.
12. Upload Aggregated CSV
We upload the aggregated CSV to S3 With the key 2019/jan/final_report.csv. We also set the public-read ACL.
13. Upload HTML Table
We upload the table HTML next. We use ContentType text/html to tell S3 that this is an HTML file.
14. Upload HTML Chart
Finally, we upload the chart.html as well.
15. Uploaded reports
This is now what we have in our S3 console for January. Next, we want to generate an index.html file that lets Sam's bosses browse the website and click on what she wants to see.
16. Create index.html
We list the gid-report bucket for objects starting with 2019/
Then, we convert the response contents into a DataFrame, since it's a dictionary.
Finally, we add a column "Link" that combines key with the base S3 website URL.
17. Create index.html
Then, we write our DataFrame to HTML using pandas, selecting only Link, LastModified and Size columns. Note that we used the argument render_links=True so that our links are clickable in HTML.
18. Upload index.html
We upload the index.html file to the root of the gid-reports bucket.
19. Get the URL of the index!
Finally - let's get the website URL, containing bucket name, and share it with the council!
20. Let's tweak!
Sam has built a pretty cool reporting system. Now that we've had a general overview, let's see how we can make it better!