Create Function to Calculate D
Calculating the Index of Dissimilarity requires multiple steps and has high reuse potential. In this exercise you will create the function dissimilarity
that we used in the previous exercise. The function's input parameters will be a DataFrame of small area geographies (such as tracts) and three column names: the two columns with population counts of Group A and Group B, and the column with the names or geographic identifiers of the container geography (such as states or metro areas).
As a reminder, the formula the the Index of Dissimilarity is:
$$D = \frac{1}{2}\sum{\left\lvert \frac{a}{A} - \frac{b}{B} \right\rvert}$$
pandas
has been imported using the usual alias. The groupby
and merge
are already completed for you in the code below.
This exercise is part of the course
Analyzing US Census Data in Python
Exercise instructions
- Calculate the expression inside the absolute value bars based on the formula: The column names for \(A\) and \(B\) are formed by adding the suffix
"_sum"
to the parameterscol_A
andcol_B
- The
sum
method on a single column returns a series; use theto_frame()
method to convert the series to a DataFrame - Test the new function on
tracts
: calculate White-Black dissimilarity by MSA name
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
def dissimilarity(df, col_A, col_B, group_by):
# Sum Group A and Group B by grouping column
grouped_sums = df.groupby(group_by)[[col_A, col_B]].sum()
tmp = pd.merge(df, grouped_sums, left_on = group_by,
right_index = True, suffixes = ("", "_sum"))
# Calculate inner expression
tmp["D"] = abs(____)
# Calculate Index of Dissimilarity and convert to DataFrame
return 0.5 * tmp.groupby(group_by)["D"].sum().____
msa_D = dissimilarity(msa_tracts, ____, ____, "msa_name")
print(msa_D.head())