Accessing private objects in S3
1. Accessing private objects in S3
In the last lesson, we learned how to set permissions on objects in buckets, making them private and public. We also learned how to access public files. But what happens when we want to share private data? Data engineers are always balancing security, sharing, and access. Let's dig in!2. Downloading a private file
If we did not explicitly set a public-read ACL for a file, AWS defaults to private. If we try to read it with pandas read_csv through a Public Object URL, we will get a 403 error - which means we are forbidden. If we don't make it public to the world, what do we do?3. Downloading private files
We can use the boto3's download_file method to download the file and have pandas read the CSV from disk. This is a great option if we expect this file not to change much.4. Accessing private files
Yet, there's a much simpler way by using boto3's get_object method to access the file's contents from S3. We call the get_object method with the bucket name and object key as parameters.5. Accessing private files
In response, we receive very similar information we get from the head_object method in Chapter 1 - metadata about the file. We also get a Body key, with a "StreamingBody" object as the value. A StreamingBody is a special type of response that doesn't download the whole object immediately.6. Accessing private Files
Pandas knows how to handle this type of response. We pass the contents of this key to pandas, and it will read it like a CSV file.7. Pre-signed URLs
We can also grant access to S3 private objects temporarily by using presigned_urls. These are special URLs that expire within a given time period. In this lesson, we will use it for accessing private files. However, we can use this mechanism to give access to a myriad of S3 operations.8. Pre-signed URLs
Let's upload a file. By default it's private.9. Pre-signed URLs
We generate a pre-signed URL that grants access to the file for 1 hour or 3600 seconds. This way, our colleague can open it in pandas or a browser. After 1 hour passes, the access expires.10. Load multiple files into one DataFrame
Often, data engineers have to parse multiple files on S3 that follow a pattern, load them into one DataFrame then do something with it. This is quite simple to do with the tools we already learned! First, let's create a list to hold the DataFrames of every file we will read from S3. Then, let's get the list of CSVs from S3 that start with 2019/. Let's assign the contents of the Contents key from the response dictionary to request_files variable.11. Load multiple files into one DataFrame
We iterate over the request_files, loading each object. We read object body into a DataFrame. Then, we append the DataFrame to the list of DataFrames we are collecting.12. Load multiple files into one DataFrame
After the loop, we use pandas concat method to combine all the DataFrames in the list. We can now see our combined DataFrame with the head method!13. Review Accessing private objects in S3
In this lesson, we discussed how to open and share private files in S3. We can simply download the file and read the local file into pandas. We can use get_object and read the response's 'Body' key into pandas. Lastly, we can generate a short-lived pre-signed URL, and pass it to pandas.14. Review - Sharing URLs
We now also have two ways to share a file. Public files are shared through the Public URL, that's generated using the string's format method. We can also share private files temporarily using presigned_urls. We get these from running boto3's get_presigned_url method.15. Let's practice!
Now that we know how to open and share private files, let's practice!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.