Challenges of network-based inference

1. Challenges of network-based inference

Now that you are familiar with predicting labels in networked data, we will zoom in on three main challenges of network learning.

2. First challenge

In supervised learning, it is standard practice and highly recommended to evaluate model performance on a test set, or using cross-validation. Typically, the dataset is split into training and test sets by randomly selecting 60 to 80 percent of the observations for the training set and using the remaining observations in the test set. In the case of networked data where the observations are connected, splitting the data becomes problematic. Take for example the network of data scientists. In the R code, we randomly select 60% of the nodes. These are used to construct a training network and the remaining 40% is used for the test network. Here you can see the training and test networks resulting from this random split. The two subnetworks are clearly very different from the original network and from each other. This problem can be solved by training the model on one network and evaluating it on another independent network. Alternatively, you can build a flat dataset by featurizing the network, as you see later in this course.

3. Second challenge

The second challenge we come across when working with networked data is that the observations may not be independent and identically distributed, or iid. On the contrary, since they are connected, there are dependencies between them and correlational behavior, which means that the label of one node depends on the label of the other nodes through the edges. Let's take another look at the network of data scientists, and assume that both F and G have unknown labels. As you can see, G has two blue and two green neighbors. G is also connected to F, who has an unknown label. If we use the relational neighbor classifier from before to infer a label for G, it depends on F whether G is blue or green. If F is blue than G is also blue but if F is green then G is also green. So G and F are not independent.

4. Third challenge

This brings us to the final challenge: collective inferencing. Given a semi-labeled network, can we predict the label for all the unlabeled nodes? This is what collective inference procedures do. They infer a set of class labels or probabilities for the unknown nodes by taking into account the fact that inferences about nodes can mutually affect one another. They are designed so that the whole network is simultaneously updated. As a result, long-distance propagation is possible. That is, the influence of a specific label, such as fraud, may affect nodes far away and not only those in the nearest neighborhood. Examples of collective inference procedures are Gibbs sampling, iterative classification, and relaxation labeling.

5. Probabilistic relational neighbor classifier

We end this lesson with another network classifier called the probabilistic relational neighbor classifier. It is similar to the relational neighbor classifier from before, except now the nodes in the neighborhood don´t belong to specific classes. Instead, they have a probability of belonging to each of the two classes as you see in the figure here. The node at the top has a 90% chance of being a churner and 10% of being a nonchurner. The churn probability of the node in the center is computed by adding together the churn probabilities of the nodes in the neighborhood and dividing by the number of neighboring nodes as demonstrated in the R code.

6. Let's practice!

Now it's your turn. In the exercises you will perform collective inferencing yourself and see how it affects model performance.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.