Unreliable data source identification
Your team is developing a model for assisting in generating accurate reporting in the automotive safety industry. You have gathered preference data from three data sources - a "GlobalDrive Safety Institute," an "AutoTech Safety Alliance," and "QuickScan Auto Review". Recently, concerns have arisen about the integrity of the data, and you have been asked to assess it for any unreliable data sources.
automotive_df
is a combined DataFrame
loaded using the pre-imported pandas
library. It contains data from the three sources. The pre-imported majority_vote
function creates a dictionary-like object with the majority (chosen, rejected) pair per 'id'
.
This exercise is part of the course
Reinforcement Learning from Human Feedback (RLHF)
Exercise instructions
- Define the condition for counting one disagreement with the majority vote for a given data source.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
def detect_unreliable_source(merged_df):
df_majority = df.groupby('id').apply(majority_vote)
disagreements = {source: 0 for source in df['source'].unique()}
for _, row in df.iterrows():
# Condition to find a disagreement with majority vote
____
unreliable_source = max(disagreements, key=disagreements.get)
return unreliable_source
disagreement = detect_unreliable_source(automotive_df)
print("Unreliable Source:", disagreement)