Exploring your data using visualizations
1. Exploring your data using visualizations
One of the most important parts of the EDA workflow is data visualization. It helps you better understand your data and allows you to effectively communicate insights to technical and non-technical stakeholders alike.2. Visualizing data in Python
In Python, the seaborn library allows you to easily create informative and attractive plots. It builds on top of matplotlib, which you may have seen in other courses. Here, we'll use seaborn.3. Visualizing the distribution of account lengths
Let's say you wanted to visualize the distribution of the account lengths of your customers. Many machine learning algorithms make assumptions about how the data is distributed, so it's important to understand how the variables in your own dataset are distributed before you apply those algorithms. A histogram is an effective way to visualize the distribution of a variable, and you can create one using seaborn's distplot function, which is short for distribution plot. First, import seaborn. Then, pass in the Account Length feature of the telco DataFrame to the distplot function. Remember to call plot dot show to display the plot.4. Visualizing the distribution of account lengths
You can see here that it resembles a bell curve, also known as the normal distribution. It turns out that many things we measure in the real-world are well approximated by the normal distribution, and many models actually make the assumption that your data is normally distributed.5. Differences in account length
Let's now visualize the differences in account length between churners and non-churners. An effective way to do this is using a box plot, which you can create using seaborn's box plot function by specifying the x, y, and data parameters as shown here. As you can see, there doesn't appear to be any noticeable difference in account length.6. Differences in account lengths
The line in the middle of each box represents the median.7. Differences in account lengths
The colored boxes represent the middle 50% of the account lengths for each group.8. Differences in account lengths
The values here range from the 25th to the9. Differences in account lengths
75th percentile and give a sense10. Differences in account lengths
for the spread of the distribution.11. Differences in account length
The floating points represent outliers,12. Differences in account length
which you can remove using the "sym" parameter, as shown here.13. Adding a third variable
Seaborn allows you to easily add a third variable to your plot. For example, we might be interested in visualizing whether the "International Plan" feature has an impact on account length or churn. You can add this information to the plot by specifying the "hue" parameter. From the plot, it looks like as far as predicting churn goes, it does not matter whether or not a customer had an international plan.14. Let's make some plots!
In the exercises, you will visualize the distributions of other features and investigate their influence on churn. Happy plotting!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.