The database schema
By now, you know that SQL databases always have a database schema. In the video on databases, you saw the following diagram:
A PostgreSQL database is set up in your local environment, which contains this database schema. It's been filled with some example data. You can use pandas
to query the database using the read_sql()
function. You'll have to pass it a database engine, which has been defined for you and is called db_engine
.
The pandas
package imported as pd
will store the query result into a DataFrame object, so you can use any DataFrame functionality on it after fetching the results from the database.
This is a part of the course
“Introduction to Data Engineering”
Exercise instructions
- Complete the
SELECT
statement so it selects thefirst_name
and thelast_name
in the"Customer"
table. Make sure to order by the last name first and the first name second. - Use the
.head()
method to show the first3
rows ofdata
. - Use
.info()
to show some general information aboutdata
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Complete the SELECT statement
data = pd.read_sql("""
SELECT first_name, ____ FROM "____"
ORDER BY ____, ____
""", db_engine)
# Show the first 3 rows of the DataFrame
print(data.head(____))
# Show the info of the DataFrame
print(data.____())
This exercise is part of the course
Introduction to Data Engineering
Learn about the world of data engineering in this short course, covering tools and topics like ETL and cloud computing.
Now that you know the primary differences between a data engineer and a data scientist, get ready to explore the data engineer's toolbox! Learn in detail about different types of databases data engineers use, how parallel computing is a cornerstone of the data engineer's toolkit, and how to schedule data processing jobs using scheduling frameworks.
Exercise 1: DatabasesExercise 2: SQL vs NoSQLExercise 3: The database schemaExercise 4: Joining on relationsExercise 5: Star schema diagramExercise 6: What is parallel computingExercise 7: Why parallel computing?Exercise 8: From task to subtasksExercise 9: Using a DataFrameExercise 10: Parallel computation frameworksExercise 11: Spark, Hadoop and HiveExercise 12: A PySpark groupbyExercise 13: Running PySpark filesExercise 14: Workflow scheduling frameworksExercise 15: Airflow, Luigi and cronExercise 16: Airflow DAGsWhat is DataCamp?
Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.