1. Working with numeric data
Time to switch our focus on to working with numeric data.
2. The original salaries dataset
So far, we've been looking at a modified version of the data professionals dataset.
Let's print summary information about our original DataFrame.
3. The original salaries dataset
The first thing that jumps out is that the Salary_USD column we've been working with is not present, but there's a column called Salary_In_Rupees, referring to India's currency.
4. Salary in rupees
Previewing this column, we see that the values contain commas, and the data type is object.
5. Converting strings to numbers
To obtain Salary in USD we'll need to perform a few tasks.
First, we need to remove the commas from the values in the Salary_In_Rupees column.
Next, we change the data type to float.
Lastly, we'll make a new column by converting the currency.
6. Converting strings to numbers
To remove commas, we can use the pandas Series-dot-string-dot-replace method. We first pass the characters we want to remove, followed by the characters to replace them with.
As we don't want to add characters back in, when we update the column we provide an empty string in this part of the method.
Printing the first five rows of this column, we see the commas have been removed. However, the column is still object data type.
7. Converting strings to numbers
We update the data type to float.
We've looked up the conversation rate from Indian rupees to US dollars, and currently one rupee is worth one-point-two cents.
To create the Salary_USD column we multiply the values in the rupees column by zero-point-zero-one-two.
8. Previewing the new column
Printing the first five rows of the original and new column, we can see that values in Salary_USD are equal to one-point-two percent of the Salary_In_Rupees column.
9. Adding summary statistics into a DataFrame
Recall that we've previously used pandas' groupby function to calculate summary statistics. Here, we find the mean salary in US dollars by company size.
While this is useful, sometimes we might prefer to add summary statistics directly into our DataFrame, rather than creating a summary table.
10. Adding summary statistics into a DataFrame
Let's say we would like to create a new column containing the standard deviation of Salary_USD, where values are conditional based on the Experience column.
The first step still involves a groupby, done here with the Experience column.
11. Adding summary statistics into a DataFrame
We then select the Salary_USD column,
12. Adding summary statistics into a DataFrame
and call pandas dot-transform.
13. Adding summary statistics into a DataFrame
Inside the transform call, we apply a lambda function using the syntax lambda x semi-colon, followed by a call of x-dot-std. This calculates the standard deviation of salaries based on experience.
14. Adding summary statistics into a DataFrame
We can select more than one column and use the value_counts method. This prints the combinations of values for the columns we have chosen, in this case Experience and newly created std_dev columns.
For example, there are 257 rows with SE, or Senior-level, experience, and the standard deviation in salary for this group is nearly 53000 dollars.
Unsurprisingly, there appears to be a larger variation in salary associated with the most senior role, Executive.
15. Adding summary statistics into a DataFrame
We can repeat this process for other summary statistics!
Here, we add a column for the median salary based on company size. We use a backslash to split our code over two lines, otherwise it is quite long and difficult to read.
Previewing the two columns of interest we see the values have been mapped correctly and that medium-sized companies have the largest median salary!
16. Let's practice!
Now it's your turn to explore some numeric data!