Bir PySpark groupby işlemi

dask çatısını ve onun DataFrame soyutlamasını bazı hesaplamalar için nasıl kullanacağını gördün. Ancak videoda da gördüğün gibi, büyük veri dünyasında veri işleme için Spark muhtemelen daha popüler bir seçenek.

Bu egzersizde, bir Spark DataFrame'ini yönetmek için PySpark paketini kullanacaksın. Veriler önceki egzersizlerle aynı: 1896 ile 2016 arasındaki Olimpiyat etkinliklerine katılan sporcular.

Spark DataFrame'i athlete_events_spark çalışma alanında mevcut.

Bu egzersizde kullanacağın yöntemler:

.printSchema(): Bir Spark DataFrame'inin şemasını yazdırmaya yardımcı olur.
.groupBy(): Bir toplulaştırma için gruplama ifadesi.
.mean(): Her grup için ortalamayı alır.
.show(): Sonuçları gösterir.

Bu egzersiz

Data Engineering'e Giriş

kursunun bir parçasıdır

Kursu Görüntüle

Egzersiz talimatları

athlete_events_spark'ın türünü bul.
athlete_events_spark'ın şemasını bul.
Yıla göre gruplandırarak Olimpiyat sporcularının ortalama yaşını yazdır. Dikkat et, Spark aslında henüz hiçbir şey hesaplamadı. Buna tembel değerlendirme (lazy evaluation) diyebilirsin.
Önceki sonucu al ve ortalama yaşı hesaplamak için sonuç üzerinde .show() çağır.

Uygulamalı interaktif egzersiz

Bu örnek kodu tamamlayarak bu egzersizi bitirin.

# Print the type of athlete_events_spark
print(____(athlete_events_spark))

# Print the schema of athlete_events_spark
print(athlete_events_spark.____())

# Group by the Year, and find the mean Age
print(athlete_events_spark.____('Year').mean(____))

# Group by the Year, and find the mean Age
print(athlete_events_spark.____('Year').mean(____).____())

Kodu Düzenle ve Çalıştır

Bu egzersiz

Data Engineering'e Giriş

kursunun bir parçasıdır

IntermediárioNível de habilidade

4.8+

Kursa Ücretsiz Başlayın

In this first chapter, you will be exposed to the world of data engineering! Explore the differences between a data engineer and a data scientist, get an overview of the various tools data engineers use and expand your understanding of how cloud technology plays a role in data engineering.

Exercise 1: What is data engineering?Exercise 2: Tasks of the data engineer Exercise 3: Data engineer or data scientist?Exercise 4: Data engineering problems Exercise 5: Tools of the data engineer Exercise 6: Kinds of databases Exercise 7: Processing tasks Exercise 8: Scheduling tools Exercise 9: Cloud providers Exercise 10: Why cloud computing?Exercise 11: Big players in cloud computing Exercise 12: Cloud services

Now that you know the primary differences between a data engineer and a data scientist, get ready to explore the data engineer's toolbox! Learn in detail about different types of databases data engineers use, how parallel computing is a cornerstone of the data engineer's toolkit, and how to schedule data processing jobs using scheduling frameworks.

Exercise 1: Veritabanları Exercise 2: SQL vs NoSQL Exercise 3: Veritabanı şeması Exercise 4: İlişkiler üzerinden join Exercise 5: Yıldız şeması diyagramı Exercise 6: Paralel hesaplama nedir Exercise 7: Neden paralel hesaplama?Exercise 8: Görevden alt görevlere Exercise 9: Bir DataFrame Kullanma Exercise 10: Paralel hesaplama çerçeveleri Exercise 11: Spark, Hadoop ve Hive Exercise 12: Bir PySpark groupby işlemi

Geçerli Egzersiz

Exercise 13: PySpark dosyalarını çalıştırma Exercise 14: İş akışı zamanlama çerçeveleri Exercise 15: Airflow, Luigi ve cron Exercise 16: Airflow DAG'leri

Having been exposed to the toolbox of data engineers, it's now time to jump into the bread and butter of a data engineer's workflow! With ETL, you will learn how to extract raw data from various sources, transform this raw data into actionable insights, and load it into relevant databases ready for consumption!

Exercise 1: Extract Exercise 2: Data sources Exercise 3: Fetch from an API Exercise 4: Read from a database Exercise 5: Transform Exercise 6: Splitting the rental rate Exercise 7: Prepare for transformations Exercise 8: Joining with ratings Exercise 9: Loading Exercise 10: OLAP or OLTP Exercise 11: Writing to a file Exercise 12: Load into Postgres Exercise 13: Putting it all together Exercise 14: Defining a DAG Exercise 15: Setting up Airflow Exercise 16: Interpreting the DAG

Cap off all that you've learned in the previous three chapters by completing a real-world data engineering use case from DataCamp! You will perform and schedule an ETL process that transforms raw course rating data, into actionable course recommendations for DataCamp students!

Exercise 1: Course ratings Exercise 2: Exploring the schema Exercise 3: Querying the table Exercise 4: Average rating per course Exercise 5: From ratings to recommendations Exercise 6: Filter out corrupt data Exercise 7: Using the recommender transformation Exercise 8: Scheduling daily jobs Exercise 9: The target table Exercise 10: Defining the DAG Exercise 11: Enable the DAG Exercise 12: Querying the recommendations Exercise 13: Congratulations