Module 5: Anomaly and Intrusion Detection with Machine Learning

Module 5: Anomaly and Intrusion Detection with Machine Learning#

Why do we need anomaly detection in Cyber Security?#

Detecting and mitigating zero-day vulnerabilities is crucial for maintaining robust cybersecurity. A zero-day vulnerability represents a hidden weakness in a computer system, exploitable by attackers and undetected by affected parties. To comprehend the potential impact, consider the scenario where an organization falls victim to a zero-day exploit.

Identifying abnormal network behavior is instrumental in fortifying organizations against zero-day attacks. This document provides insights into various approaches to achieve effective anomaly detection.

Technical Explanation#

Anomaly detection, also known as outlier detection, involves recognizing unexpected events, observations, or items that significantly deviate from the norm. Anomalous data can often be easily identified by breaking established rules, such as exceeding predefined thresholds.

Properties of Anomaly

Anomalies in data occur infrequently.
Features of data anomalies are markedly different from those of normal instances.

Examples of Anomalies:

Network anomalies (e.g., intrusion detection forecasting)
Application performance anomalies
Web application security anomalies (e.g., XSS attacks, DDoS attacks, unexpected login attempts)

Refer below time series graph which shows unexpected drop in network usage (anomalous behaviour):

Challenges in Anomaly Detection:

Defining normal behavior
Handling imbalanced distribution of normal and abnormal data
Sparse occurrence of abnormal events
Appropriate feature extraction
Handling noise (distinct from anomalies)
Future anomalies may differ significantly from training set examples.

Testing for Anomalous Points:

Supervised training requires labeled anomalous data points.
Unsupervised learning relies on distances or cluster densities to estimate normalcy and outliers.
Feature Selection Criteria:
Choose features that exhibit unusually high or low values during anomaly events.

Real-Time Anomaly Detection#

In communication networks, real-time processing is vital for detecting correlated traffic indicative of anomalous behavior, such as DDoS or zero-day attacks.

Performance Evaluation Criteria: Given class imbalance, accuracy may not be a reliable metric. Confusion matrix analysis is recommended for a more nuanced evaluation.

Applications: Anomaly-based network intrusion detection system, credit card fraud detection, and malware detection are among the diverse applications. Practical examples and demonstrations, including a zero-day attack demo, are available in the provided resources.

In below picture, red points indicates anomaly based on previously seen data points.

Noise Filtering: Prior to applying anomaly detection techniques, noise filtering is essential to reduce false positives. A recommended approach is discussed in the referenced paper.

Case Study: Detecting Intrusion with Machine Learning#

Training data: “Intrusion Detection Evaluation Dataset” (CICIDS2017). Description page: https://www.unb.ca/cic/datasets/ids-2017.html

The data set is public. Download link: http://205.174.165.80/CICDataset/CIC-IDS-2017/Dataset/

CICIDS2017 combines 8 files recorded on different days of observation (PCAP + CSV). Used archive: http://205.174.165.80/CICDataset/CIC-IDS-2017/Dataset/GeneratedLabelledFlows.zip

In the downloaded archive GeneratedLabelledFlows.zip the file “Thursday” Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv is selected.

Sources:

[Sharafaldin2018] Iman Sharafaldin, Arash Habibi Lashkari and Ali A. Ghorbani. Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization. 2018
[Kostas2018] Kahraman Kostas. Anomaly Detection in Networks Using Machine Learning. 2018 (error was found in assessing the importance of features)
bozbil/Anomaly-Detection-in-Networks-Using-Machine-Learning (error was found in assessing the importance of features)

Data preprocessing#

import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt
from IPython.display import display

%matplotlib inline

Download the dataset from Github to Google Colab and unzip it.

!pip install graphviz

Requirement already satisfied: graphviz in /home/soraxas/micromamba/envs/wsu/lib/python3.9/site-packages (0.20.1)

We use “engine=python” to avoid the “UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x96 in position 11: invalid start byte” encoding error.

df = pd.read_csv(
    "https://raw.githubusercontent.com/WSU-AI-CyberSecurity/data/master/CICIDS2017-thursday.pcap_ISCX.csv",
    encoding="cp1252",
    low_memory=False,
)
df

	Flow ID	Source IP	Source Port	Destination IP	Destination Port	Protocol	Timestamp	Flow Duration	Total Fwd Packets	Total Backward Packets	...	min_seg_size_forward	Active Mean	Active Std	Active Max	Active Min	Idle Mean	Idle Std	Idle Max	Idle Min	Label
0	192.168.10.3-192.168.10.50-389-33898-6	192.168.10.50	33898	192.168.10.3	389	6	6/7/2017 8:59	113095465	48	24	...	32	203985.500	5.758373e+05	1629110.0	379.0	13800000.0	4.277541e+06	16500000.0	6737603.0	BENIGN
1	192.168.10.3-192.168.10.50-389-33904-6	192.168.10.50	33904	192.168.10.3	389	6	6/7/2017 8:59	113473706	68	40	...	32	178326.875	5.034269e+05	1424245.0	325.0	13800000.0	4.229413e+06	16500000.0	6945512.0	BENIGN
2	8.0.6.4-8.6.0.1-0-0-0	8.6.0.1	0	8.0.6.4	0	0	6/7/2017 8:59	119945515	150	0	...	0	6909777.333	1.170000e+07	20400000.0	6.0	24400000.0	2.430000e+07	60100000.0	5702188.0	BENIGN
3	192.168.10.14-65.55.44.109-59135-443-6	192.168.10.14	59135	65.55.44.109	443	6	6/7/2017 8:59	60261928	9	7	...	20	0.000	0.000000e+00	0.0	0.0	0.0	0.000000e+00	0.0	0.0	BENIGN
4	192.168.10.3-192.168.10.14-53-59555-17	192.168.10.14	59555	192.168.10.3	53	17	6/7/2017 8:59	269	2	2	...	32	0.000	0.000000e+00	0.0	0.0	0.0	0.000000e+00	0.0	0.0	BENIGN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
170361	157.240.18.35-192.168.10.51-443-55641-6	157.240.18.35	443	192.168.10.51	55641	6	6/7/2017 12:59	49	1	3	...	20	0.000	0.000000e+00	0.0	0.0	0.0	0.000000e+00	0.0	0.0	BENIGN
170362	192.168.10.51-199.16.156.120-45337-443-6	199.16.156.120	443	192.168.10.51	45337	6	6/7/2017 12:59	217	2	1	...	32	0.000	0.000000e+00	0.0	0.0	0.0	0.000000e+00	0.0	0.0	BENIGN
170363	192.168.10.12-192.168.10.50-60148-22-6	192.168.10.12	60148	192.168.10.50	22	6	6/7/2017 12:59	1387547	41	46	...	32	0.000	0.000000e+00	0.0	0.0	0.0	0.000000e+00	0.0	0.0	BENIGN
170364	192.168.10.12-192.168.10.50-60146-22-6	192.168.10.12	60146	192.168.10.50	22	6	6/7/2017 12:59	207	1	1	...	32	0.000	0.000000e+00	0.0	0.0	0.0	0.000000e+00	0.0	0.0	BENIGN
170365	192.168.10.12-192.168.10.50-60146-22-6	192.168.10.50	22	192.168.10.12	60146	6	6/7/2017 12:59	50	1	2	...	32	0.000	0.000000e+00	0.0	0.0	0.0	0.000000e+00	0.0	0.0	BENIGN

170366 rows × 85 columns

df.shape

(170366, 85)

The columns “Fwd Header Length” and “Fwd Header Length.1” are identical, the second one is removed, 84 columns remain.

df.columns = df.columns.str.strip()
df = df.drop(columns=["Fwd Header Length.1"])
df.shape

(170366, 84)

When assessing the distribution of labels, it turns out that out of 458968 records there are many blank records (“BENIGN” - benign background traffic).

df["Label"].unique()

array(['BENIGN', 'Web Attack – Brute Force', 'Web Attack – XSS',
       'Web Attack – Sql Injection'], dtype=object)

df["Label"].value_counts()

Label
BENIGN                        168186
Web Attack – Brute Force        1507
Web Attack – XSS                 652
Web Attack – Sql Injection        21
Name: count, dtype: int64

Delete blank records.

df = df.drop(df[pd.isnull(df["Flow ID"])].index)
df.shape

(170366, 84)

The “Flow Bytes/s” and “Flow Packets/s” columns have non-numerical values, replace them.

df.replace("Infinity", -1, inplace=True)
df[["Flow Bytes/s", "Flow Packets/s"]] = df[["Flow Bytes/s", "Flow Packets/s"]].apply(
    pd.to_numeric
)

Replace the NaN values and infinity values with -1.

df.replace([np.inf, -np.inf, np.nan], -1, inplace=True)

Convert string characters to numbers, use LabelEncoder, not OneHotEncoder.

string_features = list(df.select_dtypes(include=["object"]).columns)
string_features.remove("Label")
string_features

['Flow ID', 'Source IP', 'Destination IP', 'Timestamp']

le = preprocessing.LabelEncoder()
df[string_features] = df[string_features].apply(lambda col: le.fit_transform(col))

Undersampling against unbalance#

Dataset is unbalanced: total records = 170366, “BENIGN” records = 168186, records with attacks much less: 1507 + 652 + 21 = 2180.

benign_total = len(df[df["Label"] == "BENIGN"])
benign_total

attack_total = len(df[df["Label"] != "BENIGN"])
attack_total

df.to_csv("web_attacks_unbalanced.csv", index=False)
df["Label"].value_counts()

Label
BENIGN                        168186
Web Attack – Brute Force        1507
Web Attack – XSS                 652
Web Attack – Sql Injection        21
Name: count, dtype: int64

We use undersampling to correct class imbalances: we remove most of the “BENIGN” records.

Form a balanced dataset web_attacks_balanced.csv in proportion: 30% attack (2180 records), 70% benign data (2180 / 30 * 70 ~ = 5087 records).

Algorithm to form a balanced df_balanced dataset:

All the records with the attacks are copied to the new dataset.
There are two conditions for copying “BENIGN” records to the new dataset:
1. The next record is copyied with the benign_inc_probability.
2. The total number of “BENIGN” records must not exceed the limit of 5087 records.

Сalculate the probability of copying a “BENIGN” record. The enlargement multiplier is used to get exactly 70% benign data (5087 records).

enlargement = 1.1
benign_included_max = attack_total / 30 * 70
benign_inc_probability = (benign_included_max / benign_total) * enlargement
print(benign_included_max, benign_inc_probability)

5086.666666666667 0.03326872232726466

Copy records from df to df_balanced, save dataset web_attacks_balanced.csv.

import random

indexes = []
benign_included_count = 0
for index, row in df.iterrows():
    if row["Label"] != "BENIGN":
        indexes.append(index)
    else:
        # Copying with benign_inc_probability
        if random.random() > benign_inc_probability:
            continue
        # Have we achieved 70% (5087 records)?
        if benign_included_count > benign_included_max:
            continue
        benign_included_count += 1
        indexes.append(index)
df_balanced = df.loc[indexes]

df_balanced["Label"].value_counts()

Label
BENIGN                        5087
Web Attack – Brute Force      1507
Web Attack – XSS               652
Web Attack – Sql Injection      21
Name: count, dtype: int64

Preparing data for training#

df = df_balanced.copy()

The Label column is encoded as follows: “BENIGN” = 0, attack = 1.

df["Label"] = df["Label"].apply(lambda x: 0 if x == "BENIGN" else 1)

7 features (Flow ID, Source IP, Source Port, Destination IP, Destination Port, Protocol, Timestamp) are excluded from the dataset. The hypothesis is that the “shape” of the data being transmitted is more important than these attributes. In addition, ports and addresses can be substituted by an attacker, so it is better that the ML algorithm does not take these features into account in training [Kostas2018].

excluded = [
    "Flow ID",
    "Source IP",
    "Source Port",
    "Destination IP",
    "Destination Port",
    "Protocol",
    "Timestamp",
]
df = df.drop(columns=excluded, errors="ignore")

Below at the stage of importance estimation the “Init_Win_bytes_backward” feature has the maximum value. After viewing the source dataset, it seems that an inaccuracy was made in forming the dataset.

It turns out that it is possible to make a fairly accurate classification by one feature.

Description of features: http://www.netflowmeter.ca/netflowmeter.html

 Init_Win_bytes_backward - The total number of bytes sent in initial window in the backward direction
 Init_Win_bytes_forward - The total number of bytes sent in initial window in the forward direction

if "Init_Win_bytes_backward" in df.columns:
    df["Init_Win_bytes_backward"].hist(figsize=(6, 4), bins=10)
    plt.title("Init_Win_bytes_backward")
    plt.xlabel("Value bins")
    plt.ylabel("Density")
    plt.savefig("Init_Win_bytes_backward.png", dpi=300)

../_images/48f910f7e01b193879520b3b3beb88d469b4c29cca36930460fcde49bd679bd1.png

if "Init_Win_bytes_forward" in df.columns:
    df["Init_Win_bytes_forward"].hist(figsize=(6, 4), bins=10)
    plt.title("Init_Win_bytes_forward")
    plt.xlabel("Value bins")
    plt.ylabel("Density")
    plt.savefig("Init_Win_bytes_forward.png", dpi=300)

../_images/d40d562043b3c68394da80fb712edce2ee32b0e9c038646176618f9605eb925b.png

excluded2 = ["Init_Win_bytes_backward", "Init_Win_bytes_forward"]
df = df.drop(columns=excluded2, errors="ignore")

y = df["Label"].values
X = df.drop(columns=["Label"])
print(X.shape, y.shape)

(7267, 74) (7267,)

Feature importance#

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

unique, counts = np.unique(y_train, return_counts=True)
dict(zip(unique, counts))

{0: 3571, 1: 1515}

Visualization of the decision tree, importance evaluation using a single tree (DecisionTreeClassifier)#

In the beginning we use one tree - for the convenience of visualization of the classifier. High cross-validation scores even with 5 leaves look suspiciously good, we should look at the data carefully. Parameters for change - test_size in the cell above (train_test_split), max_leaf_nodes in the cell below.

By changing the random_state parameter, we will get different trees and different features with the highest importance. But the forest will already average individual trees below.

from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

decision_tree = DecisionTreeClassifier(max_leaf_nodes=5, random_state=0)
decision_tree = decision_tree.fit(X_train, y_train)
cross_val_score(decision_tree, X_train, y_train, cv=10)

array([0.96463654, 0.95284872, 0.95874263, 0.97249509, 0.95284872,
       0.94695481, 0.94685039, 0.97244094, 0.96259843, 0.96259843])

from sklearn.tree import export_text

r = export_text(decision_tree, feature_names=X_train.columns.to_list())
print(r)

|--- Max Packet Length <= 3.00
|   |--- Fwd IAT Std <= 2454249.88
|   |   |--- Bwd Packets/s <= 10256.68
|   |   |   |--- class: 0
|   |   |--- Bwd Packets/s >  10256.68
|   |   |   |--- class: 0
|   |--- Fwd IAT Std >  2454249.88
|   |   |--- class: 1
|--- Max Packet Length >  3.00
|   |--- Total Length of Fwd Packets <= 32821.50
|   |   |--- class: 0
|   |--- Total Length of Fwd Packets >  32821.50
|   |   |--- class: 1

from graphviz import Source
from sklearn import tree

Source(tree.export_graphviz(decision_tree, out_file=None, feature_names=X.columns))

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
File ~/micromamba/envs/wsu/lib/python3.9/site-packages/graphviz/backend/execute.py:79, in run_check(cmd, input_lines, encoding, quiet, **kwargs)
     78         kwargs['stdout'] = kwargs['stderr'] = subprocess.PIPE
---> 79     proc = _run_input_lines(cmd, input_lines, kwargs=kwargs)
     80 else:

File ~/micromamba/envs/wsu/lib/python3.9/site-packages/graphviz/backend/execute.py:99, in _run_input_lines(cmd, input_lines, kwargs)
     98 def _run_input_lines(cmd, input_lines, *, kwargs):
---> 99     popen = subprocess.Popen(cmd, stdin=subprocess.PIPE, **kwargs)
    101     stdin_write = popen.stdin.write

File ~/micromamba/envs/wsu/lib/python3.9/subprocess.py:951, in Popen.__init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, user, group, extra_groups, encoding, errors, text, umask)
    948             self.stderr = io.TextIOWrapper(self.stderr,
    949                     encoding=encoding, errors=errors)
--> 951     self._execute_child(args, executable, preexec_fn, close_fds,
    952                         pass_fds, cwd, env,
    953                         startupinfo, creationflags, shell,
    954                         p2cread, p2cwrite,
    955                         c2pread, c2pwrite,
    956                         errread, errwrite,
    957                         restore_signals,
    958                         gid, gids, uid, umask,
    959                         start_new_session)
    960 except:
    961     # Cleanup if the child failed starting.

File ~/micromamba/envs/wsu/lib/python3.9/subprocess.py:1837, in Popen._execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, restore_signals, gid, gids, uid, umask, start_new_session)
   1836         err_msg = os.strerror(errno_num)
-> 1837     raise child_exception_type(errno_num, err_msg, err_filename)
   1838 raise child_exception_type(err_msg)

FileNotFoundError: [Errno 2] No such file or directory: PosixPath('dot')

The above exception was the direct cause of the following exception:

ExecutableNotFound                        Traceback (most recent call last)
File ~/micromamba/envs/wsu/lib/python3.9/site-packages/IPython/core/formatters.py:974, in MimeBundleFormatter.__call__(self, obj, include, exclude)
    971     method = get_real_method(obj, self.print_method)
    973     if method is not None:
--> 974         return method(include=include, exclude=exclude)
    975     return None
    976 else:

File ~/micromamba/envs/wsu/lib/python3.9/site-packages/graphviz/jupyter_integration.py:98, in JupyterIntegration._repr_mimebundle_(self, include, exclude, **_)
     96 include = set(include) if include is not None else {self._jupyter_mimetype}
     97 include -= set(exclude or [])
---> 98 return {mimetype: getattr(self, method_name)()
     99         for mimetype, method_name in MIME_TYPES.items()
    100         if mimetype in include}

File ~/micromamba/envs/wsu/lib/python3.9/site-packages/graphviz/jupyter_integration.py:98, in <dictcomp>(.0)
     96 include = set(include) if include is not None else {self._jupyter_mimetype}
     97 include -= set(exclude or [])
---> 98 return {mimetype: getattr(self, method_name)()
     99         for mimetype, method_name in MIME_TYPES.items()
    100         if mimetype in include}

File ~/micromamba/envs/wsu/lib/python3.9/site-packages/graphviz/jupyter_integration.py:112, in JupyterIntegration._repr_image_svg_xml(self)
    110 def _repr_image_svg_xml(self) -> str:
    111     """Return the rendered graph as SVG string."""
--> 112     return self.pipe(format='svg', encoding=SVG_ENCODING)

File ~/micromamba/envs/wsu/lib/python3.9/site-packages/graphviz/piping.py:104, in Pipe.pipe(self, format, renderer, formatter, neato_no_op, quiet, engine, encoding)
     55 def pipe(self,
     56          format: typing.Optional[str] = None,
     57          renderer: typing.Optional[str] = None,
   (...)
     61          engine: typing.Optional[str] = None,
     62          encoding: typing.Optional[str] = None) -> typing.Union[bytes, str]:
     63     """Return the source piped through the Graphviz layout command.
     64 
     65     Args:
   (...)
    102         '<?xml version='
    103     """
--> 104     return self._pipe_legacy(format,
    105                              renderer=renderer,
    106                              formatter=formatter,
    107                              neato_no_op=neato_no_op,
    108                              quiet=quiet,
    109                              engine=engine,
    110                              encoding=encoding)

File ~/micromamba/envs/wsu/lib/python3.9/site-packages/graphviz/_tools.py:171, in deprecate_positional_args.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    162     wanted = ', '.join(f'{name}={value!r}'
    163                        for name, value in deprecated.items())
    164     warnings.warn(f'The signature of {func.__name__} will be reduced'
    165                   f' to {supported_number} positional args'
    166                   f' {list(supported)}: pass {wanted}'
    167                   ' as keyword arg(s)',
    168                   stacklevel=stacklevel,
    169                   category=category)
--> 171 return func(*args, **kwargs)

File ~/micromamba/envs/wsu/lib/python3.9/site-packages/graphviz/piping.py:121, in Pipe._pipe_legacy(self, format, renderer, formatter, neato_no_op, quiet, engine, encoding)
    112 @_tools.deprecate_positional_args(supported_number=2)
    113 def _pipe_legacy(self,
    114                  format: typing.Optional[str] = None,
   (...)
    119                  engine: typing.Optional[str] = None,
    120                  encoding: typing.Optional[str] = None) -> typing.Union[bytes, str]:
--> 121     return self._pipe_future(format,
    122                              renderer=renderer,
    123                              formatter=formatter,
    124                              neato_no_op=neato_no_op,
    125                              quiet=quiet,
    126                              engine=engine,
    127                              encoding=encoding)

File ~/micromamba/envs/wsu/lib/python3.9/site-packages/graphviz/piping.py:149, in Pipe._pipe_future(self, format, renderer, formatter, neato_no_op, quiet, engine, encoding)
    146 if encoding is not None:
    147     if codecs.lookup(encoding) is codecs.lookup(self.encoding):
    148         # common case: both stdin and stdout need the same encoding
--> 149         return self._pipe_lines_string(*args, encoding=encoding, **kwargs)
    150     try:
    151         raw = self._pipe_lines(*args, input_encoding=self.encoding, **kwargs)

File ~/micromamba/envs/wsu/lib/python3.9/site-packages/graphviz/backend/piping.py:212, in pipe_lines_string(engine, format, input_lines, encoding, renderer, formatter, neato_no_op, quiet)
    206 cmd = dot_command.command(engine, format,
    207                           renderer=renderer,
    208                           formatter=formatter,
    209                           neato_no_op=neato_no_op)
    210 kwargs = {'input_lines': input_lines, 'encoding': encoding}
--> 212 proc = execute.run_check(cmd, capture_output=True, quiet=quiet, **kwargs)
    213 return proc.stdout

File ~/micromamba/envs/wsu/lib/python3.9/site-packages/graphviz/backend/execute.py:84, in run_check(cmd, input_lines, encoding, quiet, **kwargs)
     82 except OSError as e:
     83     if e.errno == errno.ENOENT:
---> 84         raise ExecutableNotFound(cmd) from e
     85     raise
     87 if not quiet and proc.stderr:

ExecutableNotFound: failed to execute PosixPath('dot'), make sure the Graphviz executables are on your systems' PATH

<graphviz.sources.Source at 0x7f28caed4d30>

Analyze the confusion matrix. Which classes are confidently classified by the model?

unique, counts = np.unique(y_test, return_counts=True)
dict(zip(unique, counts))

{0: 1516, 1: 665}

from sklearn.metrics import confusion_matrix

y_pred = decision_tree.predict(X_test)
confusion_matrix(y_test, y_pred)

array([[1505,   11],
       [  97,  568]])

Importance evaluation using SelectFromModel#

from sklearn.feature_selection import SelectFromModel

sfm = SelectFromModel(estimator=decision_tree).fit(X_train, y_train)
sfm.estimator_.feature_importances_

array([0.        , 0.        , 0.        , 0.06230676, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.1981437 , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.01768306, 0.        , 0.72186648, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        ])

sfm.threshold_

0.013513513513513514

X_train_new = sfm.transform(X_train)
print(
    "Original num features: {}, selected num features: {}".format(
        X_train.shape[1], X_train_new.shape[1]
    )
)

Original num features: 74, selected num features: 4

indices = np.argsort(decision_tree.feature_importances_)[::-1]
for idx, i in enumerate(indices[:10]):
    print(
        "{}.\t{} - {}".format(
            idx, X_train.columns[i], decision_tree.feature_importances_[i]
        )
    )

Max Packet Length - 0.7218664752320741
Fwd IAT Std - 0.19814370190487834
Total Length of Fwd Packets - 0.06230675880412216
Bwd Packets/s - 0.017683064058925304
Bwd IAT Std - 0.0
Fwd IAT Mean - 0.0
Fwd IAT Max - 0.0
Fwd IAT Min - 0.0
Bwd IAT Total - 0.0
Bwd IAT Mean - 0.0

Evaluation of importance using RandomForestClassifier.feature_importances_#

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=250, random_state=42, oob_score=True)
rf.fit(X_train, y_train)
# Score = mean accuracy on the given test data and labels
print(
    "R^2 Training Score: {:.2f} \nR^2 Validation Score: {:.2f} \nOut-of-bag Score: {:.2f}".format(
        rf.score(X_train, y_train), rf.score(X_test, y_test), rf.oob_score_
    )
)

R^2 Training Score: 0.99 
R^2 Validation Score: 0.97 
Out-of-bag Score: 0.98

features = X.columns
importances = rf.feature_importances_
indices = np.argsort(importances)[::-1]
webattack_features = []

for index, i in enumerate(indices[:20]):
    webattack_features.append(features[i])
    print("{}.\t#{}\t{:.3f}\t{}".format(index + 1, i, importances[i], features[i]))

#51	0.085	Average Packet Size
#13	0.069	Flow Bytes/s
#38	0.062	Max Packet Length
#39	0.062	Packet Length Mean
#61	0.059	Subflow Fwd Bytes
#23	0.057	Fwd IAT Min
#7	0.057	Fwd Packet Length Mean
#3	0.045	Total Length of Fwd Packets
#52	0.045	Avg Fwd Segment Size
#21	0.035	Fwd IAT Std
#15	0.033	Flow IAT Mean
#5	0.031	Fwd Packet Length Max
#33	0.024	Fwd Header Length
#0	0.021	Flow Duration
#14	0.021	Flow Packets/s
#16	0.020	Flow IAT Std
#35	0.018	Fwd Packets/s
#19	0.018	Fwd IAT Total
#20	0.017	Fwd IAT Mean
#22	0.016	Fwd IAT Max

For comparison, the results of the study [Sharafaldin2018] (compare relatively, without taking into account the multiplier):

Init Win F.Bytes 0.0200
Subflow F.Bytes 0.0145
Init Win B.Bytes 0.0129
Total Len F.Packets 0.0096

And incorrect results [Kostas2018] (error was found in assessing the importance of features, line: impor_bars = pd.DataFrame({‘Features’:refclasscol[0:20],’importance’:importances[0:20]}), the importances[0:20] sample does not take into account that the values are not sorted in descending order):

Flow Bytes/s 0.313402
Total Length of Fwd Packets 0.304917
Flow Duration 0.000485
Fwd Packet Length Max 0.00013

indices = np.argsort(importances)[-20:]
plt.rcParams["figure.figsize"] = (10, 6)
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="#cccccc", align="center")
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel("Relative Importance")
plt.grid()
plt.savefig("feature_importances.png", dpi=300, bbox_inches="tight")
plt.show()

../_images/8d1e2793ba86f90b8760a5d11b64b0d2399967a5425dbe504d15b08803da70ef.png

y_pred = rf.predict(X_test)
confusion_matrix(y_test, y_pred)

array([[1495,   21],
       [  34,  631]])

Next, for experiments, we keep the first max_features of features with maximum importance.

max_features = 20
webattack_features = webattack_features[:max_features]
webattack_features

['Average Packet Size',
 'Flow Bytes/s',
 'Max Packet Length',
 'Packet Length Mean',
 'Subflow Fwd Bytes',
 'Fwd IAT Min',
 'Fwd Packet Length Mean',
 'Total Length of Fwd Packets',
 'Avg Fwd Segment Size',
 'Fwd IAT Std',
 'Flow IAT Mean',
 'Fwd Packet Length Max',
 'Fwd Header Length',
 'Flow Duration',
 'Flow Packets/s',
 'Flow IAT Std',
 'Fwd Packets/s',
 'Fwd IAT Total',
 'Fwd IAT Mean',
 'Fwd IAT Max']

Analysis of selected features#

df[webattack_features].hist(figsize=(20, 12), bins=10)
plt.savefig("features_hist.png", dpi=300)

../_images/63dd0282ff6ce14497730152b7cb0c44b108291f061f46053e93e292f0a61633.png

Install Facets Overview

https://pair-code.github.io/facets/

!pip install facets-overview

Requirement already satisfied: facets-overview in /home/soraxas/micromamba/envs/wsu/lib/python3.9/site-packages (1.1.1)
Requirement already satisfied: numpy>=1.16.0 in /home/soraxas/micromamba/envs/wsu/lib/python3.9/site-packages (from facets-overview) (1.26.3)
Requirement already satisfied: pandas>=0.22.0 in /home/soraxas/micromamba/envs/wsu/lib/python3.9/site-packages (from facets-overview) (2.2.0)
Requirement already satisfied: protobuf>=3.20.0 in /home/soraxas/micromamba/envs/wsu/lib/python3.9/site-packages (from facets-overview) (4.23.4)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/soraxas/micromamba/envs/wsu/lib/python3.9/site-packages (from pandas>=0.22.0->facets-overview) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /home/soraxas/micromamba/envs/wsu/lib/python3.9/site-packages (from pandas>=0.22.0->facets-overview) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /home/soraxas/micromamba/envs/wsu/lib/python3.9/site-packages (from pandas>=0.22.0->facets-overview) (2023.4)
Requirement already satisfied: six>=1.5 in /home/soraxas/micromamba/envs/wsu/lib/python3.9/site-packages (from python-dateutil>=2.8.2->pandas>=0.22.0->facets-overview) (1.16.0)

Create the feature stats for the datasets and stringify it.

import base64
from facets_overview.generic_feature_statistics_generator import (
    GenericFeatureStatisticsGenerator,
)

gfsg = GenericFeatureStatisticsGenerator()
proto = gfsg.ProtoFromDataFrames(
    [{"name": "train + test", "table": df[webattack_features]}]
)
protostr = base64.b64encode(proto.SerializeToString()).decode("utf-8")

/home/soraxas/micromamba/envs/wsu/lib/python3.9/site-packages/facets_overview/base_generic_feature_statistics_generator.py:121: FutureWarning: Series.ravel is deprecated. The underlying array is already 1D, so ravel is not necessary.  Use `to_numpy()` for conversion to a numpy array instead.
  flattened = x.ravel()

Display the facets overview visualization for this data.

from IPython.display import display, HTML

HTML_TEMPLATE = """
        <script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"></script>
        <link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html" >
        <facets-overview id="elem"></facets-overview>
        <script>
          document.querySelector("#elem").protoInput = "{protostr}";
        </script>"""
html = HTML_TEMPLATE.format(protostr=protostr)
display(HTML(html))

import seaborn as sns

corr_matrix = df[webattack_features].corr()
plt.rcParams["figure.figsize"] = (16, 5)
g = sns.heatmap(corr_matrix, annot=True, fmt=".1g", cmap="Greys")
g.set_xticklabels(
    g.get_xticklabels(),
    verticalalignment="top",
    horizontalalignment="right",
    rotation=30,
)
plt.savefig("corr_heatmap.png", dpi=300, bbox_inches="tight")

../_images/93025455eafb1da654946ef94830b5ce02c501f601d49a69170197e2979b3324.png

Remove correlated features.

to_be_removed = {
    "Packet Length Mean",
    "Avg Fwd Segment Size",
    "Subflow Fwd Bytes",
    "Fwd Packets/s",
    "Fwd IAT Total",
    "Fwd IAT Max",
}
webattack_features = [item for item in webattack_features if item not in to_be_removed]
webattack_features = webattack_features[:10]
webattack_features

['Average Packet Size',
 'Flow Bytes/s',
 'Max Packet Length',
 'Fwd IAT Min',
 'Fwd Packet Length Mean',
 'Total Length of Fwd Packets',
 'Fwd IAT Std',
 'Flow IAT Mean',
 'Fwd Packet Length Max',
 'Fwd Header Length']

corr_matrix = df[webattack_features].corr()
plt.rcParams["figure.figsize"] = (6, 5)
sns.heatmap(corr_matrix, annot=True, fmt=".1g", cmap="Greys");

../_images/4ae7689364ae51e874da2c697ca57a014a102e935d63a5028ed87c75f530c7d3.png

Hyperparameter selection#

Reopen the dataset.

from sklearn.model_selection import GridSearchCV

df = df_balanced.copy()
df["Label"] = df["Label"].apply(lambda x: 0 if x == "BENIGN" else 1)
y = df["Label"].values
X = df[webattack_features]
print(X.shape, y.shape)

(7267, 10) (7267,)

We get the list of RandomForestClassifier parameters.

rfc = RandomForestClassifier(random_state=1)
rfc.get_params().keys()

dict_keys(['bootstrap', 'ccp_alpha', 'class_weight', 'criterion', 'max_depth', 'max_features', 'max_leaf_nodes', 'max_samples', 'min_impurity_decrease', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'monotonic_cst', 'n_estimators', 'n_jobs', 'oob_score', 'random_state', 'verbose', 'warm_start'])

For search of quasi-optimal value of one parameter we fix the others.

parameters = {
    "n_estimators": [10],
    "min_samples_leaf": [3],
    "max_features": [3],
    "max_depth": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 15, 17, 20, 30, 50],
}
scoring = ("f1", "accuracy")
gcv = GridSearchCV(
    rfc, parameters, scoring=scoring, refit="f1", cv=10, return_train_score=True
)
gcv.fit(X, y)
results = gcv.cv_results_

cv_results = pd.DataFrame(gcv.cv_results_)
cv_results.head()

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_max_depth	param_max_features	param_min_samples_leaf	param_n_estimators	params	split0_test_f1	...	split2_train_accuracy	split3_train_accuracy	split4_train_accuracy	split5_train_accuracy	split6_train_accuracy	split7_train_accuracy	split8_train_accuracy	split9_train_accuracy	mean_train_accuracy	std_train_accuracy
0	0.011794	0.000374	0.001573	0.000136	1	3	3	10	{'max_depth': 1, 'max_features': 3, 'min_sampl...	0.891509	...	0.906575	0.906881	0.905505	0.909786	0.905199	0.906131	0.904908	0.908730	0.906564	0.001923
1	0.015826	0.000166	0.001633	0.000103	2	3	3	10	{'max_depth': 2, 'max_features': 3, 'min_sampl...	0.894472	...	0.959021	0.958257	0.958410	0.958257	0.959633	0.955359	0.954900	0.959792	0.958182	0.001711
2	0.019654	0.000284	0.001686	0.000130	3	3	3	10	{'max_depth': 3, 'max_features': 3, 'min_sampl...	0.891688	...	0.959174	0.958410	0.958716	0.959174	0.959480	0.955053	0.955206	0.959945	0.958396	0.001723
3	0.022836	0.000558	0.001778	0.000117	4	3	3	10	{'max_depth': 4, 'max_features': 3, 'min_sampl...	0.916256	...	0.972018	0.967125	0.966667	0.962691	0.969572	0.968812	0.969118	0.972940	0.969023	0.002834
4	0.023506	0.000631	0.001776	0.000067	5	3	3	10	{'max_depth': 5, 'max_features': 3, 'min_sampl...	0.937799	...	0.970031	0.970031	0.974312	0.972477	0.974312	0.970188	0.973857	0.973399	0.972616	0.001729

5 rows × 59 columns

# https://scikit-learn.org/dev/auto_examples/model_selection/plot_multi_metric_evaluation.html
plt.figure(figsize=(8, 5))
plt.title("GridSearchCV results", fontsize=14)

plt.xlabel("max_depth")
plt.ylabel("f1")

ax = plt.gca()
ax.set_xlim(1, 30)
ax.set_ylim(0.9, 1)

X_axis = np.array(results["param_max_depth"].data, dtype=float)

for scorer, color in zip(sorted(scoring), ["g", "k"]):
    for sample, style in (("train", "--"), ("test", "-")):
        sample_score_mean = results["mean_%s_%s" % (sample, scorer)]
        sample_score_std = results["std_%s_%s" % (sample, scorer)]
        ax.fill_between(
            X_axis,
            sample_score_mean - sample_score_std,
            sample_score_mean + sample_score_std,
            alpha=0.1 if sample == "test" else 0,
            color=color,
        )
        ax.plot(
            X_axis,
            sample_score_mean,
            style,
            color=color,
            alpha=1 if sample == "test" else 0.7,
            label="%s (%s)" % (scorer, sample),
        )

    best_index = np.nonzero(results["rank_test_%s" % scorer] == 1)[0][0]
    best_score = results["mean_test_%s" % scorer][best_index]

    # Plot a dotted vertical line at the best score for that scorer marked by x
    ax.plot(
        [
            X_axis[best_index],
        ]
        * 2,
        [0, best_score],
        linestyle="-.",
        color=color,
        marker="x",
        markeredgewidth=3,
        ms=8,
    )

    # Annotate the best score for that scorer
    ax.annotate("%0.2f" % best_score, (X_axis[best_index] + 0.3, best_score + 0.005))

plt.legend(loc="best")
plt.grid(False)
plt.savefig("GridSearchCV_results.png", dpi=300)
plt.show()

../_images/787a7a07c89ab1e52a6d0b0960676f5e5d37f2904a59c6ef5592984757e02f38.png

Final model#

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(5086, 10) (5086,)
(2181, 10) (2181,)

rfc = RandomForestClassifier(
    max_depth=17,
    max_features=10,
    min_samples_leaf=3,
    n_estimators=50,
    random_state=42,
    oob_score=True,
)
# rfc = RandomForestClassifier(n_estimators=250, random_state=1)
rfc.fit(X_train, y_train)

RandomForestClassifier(max_depth=17, max_features=10, min_samples_leaf=3,
                       n_estimators=50, oob_score=True, random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

features = X.columns
importances = rfc.feature_importances_
indices = np.argsort(importances)[::-1]

for index, i in enumerate(indices[:10]):
    print("{}.\t#{}\t{:.3f}\t{}".format(index + 1, i, importances[i], features[i]))

#2	0.362	Max Packet Length
#0	0.211	Average Packet Size
#6	0.150	Fwd IAT Std
#3	0.126	Fwd IAT Min
#5	0.054	Total Length of Fwd Packets
#7	0.053	Flow IAT Mean
#9	0.026	Fwd Header Length
#4	0.014	Fwd Packet Length Mean
#8	0.003	Fwd Packet Length Max
#1	0.001	Flow Bytes/s

y_pred = rfc.predict(X_test)
confusion_matrix(y_test, y_pred)

array([[1491,   25],
       [  43,  622]])

import sklearn.metrics as metrics

accuracy = metrics.accuracy_score(y_test, y_pred)
precision = metrics.precision_score(y_test, y_pred)
recall = metrics.recall_score(y_test, y_pred)
f1 = metrics.f1_score(y_test, y_pred)
print("Accuracy =", accuracy)
print("Precision =", precision)
print("Recall =", recall)
print("F1 =", f1)

Accuracy = 0.9688216414488766
Precision = 0.9613601236476044
Recall = 0.9353383458646617
F1 = 0.948170731707317

Model approbation#

df = df_balanced.copy()
df["Label"] = df["Label"].apply(lambda x: 0 if x == "BENIGN" else 1)
y_test = df["Label"].values
X_test = df[webattack_features]
print(X_test.shape, y_test.shape)

(7267, 10) (7267,)

X_test.head()

	Average Packet Size	Flow Bytes/s	Max Packet Length	Fwd IAT Min	Fwd Packet Length Mean	Total Length of Fwd Packets	Fwd IAT Std	Flow IAT Mean	Fwd Packet Length Max	Fwd Header Length
15	83.500000	4.857268e+03	77	0.0	45.000000	45	0.000000	25117.000000	45	32
73	80.000000	1.633136e+06	94	4.0	44.000000	88	0.000000	56.333333	44	64
90	80.000000	1.380000e+06	94	47.0	44.000000	88	0.000000	66.666667	44	64
140	414.533333	5.871577e+06	1555	1.0	345.555556	3110	284.408126	75.642857	1555	304
212	94.250000	3.018519e+06	112	1.0	51.000000	102	0.000000	36.000000	51	64

import time

seconds = time.time()
y_pred = rfc.predict(X_test)
print("Total operation time:", time.time() - seconds, "seconds")

print("Benign records detected (0), attacks detected (1):")
unique, counts = np.unique(y_pred, return_counts=True)
dict(zip(unique, counts))

Total operation time: 0.010741710662841797 seconds
Benign records detected (0), attacks detected (1):

{0: 5120, 1: 2147}

Confusion matrix:

1 - predicted value (Wikipedia uses different convention for axes)
TN FP
FN TP

confusion_matrix(y_test, y_pred)

array([[5035,   52],
       [  85, 2095]])

accuracy = metrics.accuracy_score(y_test, y_pred)
precision = metrics.precision_score(y_test, y_pred)
recall = metrics.recall_score(y_test, y_pred)
f1 = metrics.f1_score(y_test, y_pred)
print("Accuracy =", accuracy)
print("Precision =", precision)
print("Recall =", recall)
print("F1 =", f1)

Accuracy = 0.9811476537773497
Precision = 0.975780158360503
Recall = 0.9610091743119266
F1 = 0.9683383406517218

predict = pd.DataFrame({"Predict": rfc.predict(X_test)})
label = pd.DataFrame({"Label": y_test})
result = X_test.join(label).join(predict)

display("The following is point that are predicted to be intrusion (anomaly)")
result[result["Predict"] == 1]

'The following is point that are predicted to be intrusion (anomaly)'

	Average Packet Size	Flow Bytes/s	Max Packet Length	Fwd IAT Min	Fwd Packet Length Mean	Total Length of Fwd Packets	Fwd IAT Std	Flow IAT Mean	Fwd Packet Length Max	Fwd Header Length	Label	Predict
704	7.500000	6.000000e+05	6	0.0	6.000000	6	0.000000e+00	1.333333e+01	6	20	1.0	1.0
756	48.000000	1.134752e+06	48	3.0	32.000000	64	0.000000e+00	4.700000e+01	32	64	1.0	1.0
1047	87.500000	5.073501e+03	118	4.0	38.000000	76	0.000000e+00	2.049867e+04	38	40	1.0	1.0
1075	141.000000	1.098433e+02	200	0.0	41.000000	41	0.000000e+00	2.194035e+06	41	32	1.0	1.0
1083	271.333333	7.509294e+05	796	2.0	269.333333	808	7.580185e+02	5.380000e+02	796	60	1.0	1.0
...	...	...	...	...	...	...	...	...	...	...	...	...
4591	146.500000	3.885334e+03	183	0.0	55.000000	55	0.000000e+00	6.125600e+04	55	20	1.0	1.0
4600	80.500000	1.192036e+04	101	3.0	40.000000	80	0.000000e+00	7.885667e+03	40	64	1.0	1.0
4709	87.250000	6.516699e+03	119	3.0	37.000000	74	0.000000e+00	1.595900e+04	37	40	1.0	1.0
4855	104.000000	1.900135e+03	108	0.0	50.000000	50	0.000000e+00	8.315200e+04	50	20	1.0	1.0
4938	257.538461	5.846547e+02	1460	4.0	77.125000	617	1.965194e+06	4.772048e+05	342	172	1.0	1.0

61 rows × 12 columns

display("The following is point that are predicted to be genuine")
result[result["Predict"] == 0]

'The following is point that are predicted to be genuine'

	Average Packet Size	Flow Bytes/s	Max Packet Length	Fwd IAT Min	Fwd Packet Length Mean	Total Length of Fwd Packets	Fwd IAT Std	Flow IAT Mean	Fwd Packet Length Max	Fwd Header Length	Label	Predict
15	83.500000	4.857268e+03	77	0.0	45.000000	45	0.000000e+00	25117.000000	45	32	0.0	0.0
73	80.000000	1.633136e+06	94	4.0	44.000000	88	0.000000e+00	56.333333	44	64	0.0	0.0
90	80.000000	1.380000e+06	94	47.0	44.000000	88	0.000000e+00	66.666667	44	64	0.0	0.0
140	414.533333	5.871577e+06	1555	1.0	345.555556	3110	2.844081e+02	75.642857	1555	304	0.0	0.0
212	94.250000	3.018519e+06	112	1.0	51.000000	102	0.000000e+00	36.000000	51	64	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...
7141	80.000000	1.308057e+06	94	48.0	44.000000	88	0.000000e+00	70.333333	44	40	0.0	0.0
7143	2.400000	1.538462e+05	6	0.0	0.000000	0	0.000000e+00	19.500000	0	32	0.0	0.0
7166	94.250000	2.000000e+06	112	4.0	51.000000	102	0.000000e+00	54.333333	51	40	0.0	0.0
7230	43.636364	9.050426e+01	223	48.0	46.142857	323	2.126298e+06	530361.800000	223	160	0.0	0.0
7254	139.000000	1.533367e+03	186	0.0	46.000000	46	0.000000e+00	151301.000000	46	32	0.0	0.0

171 rows × 12 columns