Get startedGet started for free

Extract a dataset

1. Extract a dataset

Now you have computed a whole bunch of features and added them to your network object. It is time to start using those features in predictive modeling. But first, you have to turn your network object into a tabular dataset and make it ready for predictive modeling.

2. Getting the dataset

In the last chapter, you generated features and added them to the network object. Here you can see some features of the data scientists added to the network `g`. There are a few basic network features, i.e., degree, triangles, betweenness, and transitivity. There are also a couple of link-based features: number of r neighbors and average age. Finally, there are two PageRank features. When we inspect the network object `g` you can see that all these features (and some others) are indeed a part of the network. Under the attr component, you can see degree, triangles, betweenness and so on.

3. Getting the dataset

But how do we get it out of the network object and into a data frame? That's simple. we use the `igraph` function `as_data_frame` as you can see here. It takes as arguments the network object, in this case, `g` and a second argument called `what` which specifies whether we want the node attributes or the edge attributes. We want the node attributes so we specify `what=vertices`, as you see here. And the result is the dataframe you see here. Each column in this dataframe is one of the features we added to the network object and each row corresponds to a person in the network of data scientists. This is a dataset that we can explore, preprocess and mine.

4. Preprocessing - missing values

Now, real life data is not always pretty. It is usually noisy and dirty, with missing values and outliers, that can have a great effect during model development and distort the results. First, missing values occur for a few different reasons. The data may be non-applicable, for example, the churn date of a customer who hasn't churned, it may be non-disclosed, such as income, or an error may have occurred when data was being handled. There are various approaches to deal with missing values, by e.g. replacing them based on imputation procedures or when there are many missing values, by removing the respective feature or observations. Since our dataset is a result of the featurization process, missing values in a certain variable could occur as the feature was being computed. So it makes sense to check for missing values by variable. If `dataset` is the dataframe we extracted from the network object, then we can count the number of missing values in the column degree, using this operation. `is.na` returns a logical vector with TRUE when there is a missing value and False otherwise. The sum function adds the logical values together and results in the number of missing values in the degree column.

5. Preprocessing - correlated variables

For some classification techniques, it is important to remove correlated variables since they can cause unstable models. We use the `corrplot` package to visualize the correlation among the variables. We compute `M` the correlation matrix of our dataset. Notice that we skip the first column since it is the name of the observations and therefore not meaningful for the correlations. Then we plot the correlation matrix, and here you see the result. In the correlation plot, blue represents a positive correlation and red a negative correlation. A darker color indicates a stronger correlation. You can see for example that degree and PageRank have a high positive correlation. Based on this plot we can decide which variables to remove from the dataset. There is no strict rule on how high the correlation must be, but 0.9 is often used as a reference.

6. Let's practice!

Now its time for you to turn the churn network into a tabular dataset and preprocess it.