Get startedGet started for free

Data fusion

1. Data fusion

Sometimes a labeled dataset is not ready for you, so instead you have to create it from raw data. This poses several challenges, one of which is having to combine information from multiple data sources, a process known as data fusion.

2. Computers, ports, and protocols

But first, here is a brief introduction to computer security jargon, to prepare you for a fresh new dataset! Computers communicate with each other through different ports. Moreover, data sent from one port on the source computer to another on the destination computer is not all sent at once, but rather in small chunks, known as packets.

3. The LANL cyber dataset

The result is known as a network flow: data transfer in packets between a port on a source computer and a port on a destination computer, following a certain protocol. Here is an example dataset. Each event contains information on the source, the destination, the protocol used, the number of packets and the bytes transferred.

4. The LANL cyber dataset

Now consider a second dataset, called attacks. This contains information about attacks performed by the security team itself during a test. The two datasets concern the same set of computers, and can therefore be fused using the source or destination computer IDs.

5. Labeling events versus labeling computers

When analyzing data at this level of detail, security analysts attach labels to whole computers rather than individual events. For example, if some malware is looking for an open port in a target computer, an attack known as portscan, it will contact all possible ports until it finds an open one. This behavior is quite obvious when looking at all the data from one computer, but not when looking at individual events.

6. Group and featurize

Therefore, your unit of analysis is an individual computer. So let's first group the DataFrame by destination_computer. This yields one DataFrame for each computer. Note that the .group_by() method from pandas returns an iterator, so if you want to access one of its elements you have to call list() on it first.

7. Group and featurize

The next step is to convert each of these DataFrames to feature vectors. This is a form of feature engineering, except that the feature extractors now take as input an entire data frame. You can start by computing the number of unique destination ports - use sets to do this efficiently. Then, calculate the average number of packets transferred per session, and then the average duration of each session.

8. Group and featurize

Applying this function on each group yields an iterator containing one vector per group, all of the same length. This can be easily transformed into a pandas Data Frame, by calling list() on the iterator and indexing by the destination computer IDs. The result is what you wanted: one row per destination_computer!

9. Labeled dataset

We now have a standard training dataset where each example is described by a feature vector. To produce labels for these examples we check whether the destination computer ID, stored in the "index" attribute of our DataFrame, appears as either a source or destination of an attack in our second dataset. We can now apply our standard workflow, with an AdaBoostClassifier, on the labelled data. This yields an accuracy of 0.92.

10. Ready to catch a hacker?

Real-life machine learning often involves data fusion across many datasets. Working closely with the analyst, you can still turn such problems into a standard supervised learning task using the techniques from this lesson! Time to practice this skill in the next exercise.