Get startedGet started for free

Unreliable data source identification

Your team is developing a model for assisting in generating accurate reporting in the automotive safety industry. You have gathered preference data from three data sources - a "GlobalDrive Safety Institute," an "AutoTech Safety Alliance," and "QuickScan Auto Review". Recently, concerns have arisen about the integrity of the data, and you have been asked to assess it for any unreliable data sources.

automotive_df is a combined DataFrame loaded using the pre-imported pandas library. It contains data from the three sources. The pre-imported majority_vote function creates a dictionary-like object with the majority (chosen, rejected) pair per 'id'.

This exercise is part of the course

Reinforcement Learning from Human Feedback (RLHF)

View Course

Exercise instructions

  • Define the condition for counting one disagreement with the majority vote for a given data source.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

def detect_unreliable_source(merged_df):
    df_majority = df.groupby('id').apply(majority_vote)
    disagreements = {source: 0 for source in df['source'].unique()}
    for _, row in df.iterrows():
        # Condition to find a disagreement with majority vote
        ____
    unreliable_source = max(disagreements, key=disagreements.get)
    return unreliable_source

disagreement = detect_unreliable_source(automotive_df)
print("Unreliable Source:", disagreement)
Edit and Run Code