ML Project Cybersecurity Deep Learning IIoT

IoT NETWORK
INTRUSION
DETECTION

A neural network trained on RT-IoT2022 — 123,117 real-world network flow records from a smart home testbed — to classify traffic into 12 attack and benign categories. Built with TensorFlow / Keras, class-weight balancing, and a rigorous preprocessing pipeline.

Open in Colab UCI Dataset

123KFlow Records

85Features

12Attack Classes

98.4%Test Accuracy

Live Neural Network — 92 → 16 → 12

Input (92)

Hidden (16)

Output (12)

01 — Dataset

RT-IoT2022
CLASS DISTRIBUTION

Captured from a real smart-home testbed. Severely imbalanced — DOS_SYN_Hping alone accounts for ~77% of records, making naive accuracy a misleading metric.

Sample Count by Attack Type (log scale)

DOS_SYN_Hping94,659 · Attack

Thing_Speak8,108 · Benign

ARP_poisioning7,750 · Attack

MQTT_Publish4,146 · Benign

NMAP_UDP_SCAN2,590 · Attack

NMAP_XMAS_TREE_SCAN2,010 · Attack

NMAP_OS_DETECTION2,000 · Attack

NMAP_TCP_scan1,002 · Attack

DDOS_Slowloris534 · Attack

Wipro_bulb253 · Benign

Metasploit_Brute_Force_SSH37 · Attack

NMAP_FIN_SCAN28 · Attack

02 — Pipeline

PREPROCESSING
PIPELINE

Every step runs in order to produce clean, leak-free training data for the classifier.

Load Data

UCI ML Repository via ucimlrepo. 123,117 rows × 85 cols. No missing values.

Drop Index

Remove the unnamed row-number column saved into the CSV — carries no signal.

One-Hot Encode

pd.get_dummies on proto & service (2 categorical cols → 10 binary columns).

Fixed: called twice

Stratified Split

80/20 train-test, stratified on Attack_type to preserve rare class proportions.

Fixed: duplicate call

Drop Constant Cols

Remove columns with std = 0 before scaling to avoid division-by-zero NaN values.

Fixed: order

Standardise

Fit StandardScaler on train only. Apply to both train and test to prevent leakage.

NaN / Inf Fix

np.nan_to_num replaces any remaining NaN / ±Inf with 0.0 after scaling.

Fixed: before fit

Label Encode

LabelEncoder maps 12 string class names → integers 0–11 for sparse CE loss.

Fixed: dead code

03 — Model Architecture

NEURAL NETWORK
DESIGN

Intentionally lean: one hidden layer with 16 neurons. Simplicity is a feature for a network-flow classifier where the signal is strong.

Input Layer

92 features (after encoding & constant-col removal)

INPUT

Dense — ReLU

16 neurons · (92+1)×16 = 1,488 params

HIDDEN

Dense — Softmax

12 neurons · (16+1)×12 = 204 params

OUTPUT

Total trainable parameters: 1,692

Training Config

OptimizerAdam (lr=0.001)

Losssparse_categorical_crossentropy

Batch Size512

Epochs30

Validation Split20% of train

Class Weights✓ Balanced

Random Seed42

Imbalance Strategy

Methodsklearn compute_class_weight

DOS_SYN_Hping wt~0.08 (majority)

NMAP_FIN_SCAN wt~270 (rarest class)

Stratify split✓ Yes

Metric reportedMacro F1 + per-class

04 — Results

MODEL
PERFORMANCE

Evaluated on the held-out 20% test set. Per-class F1 is the key metric given the severe class imbalance.

98.4% Test Accuracy

Baseline (majority class only) = 76.9%

0.97 Macro F1

Averaged equally across all 12 classes

0.99 Weighted F1

Weighted by class sample counts

Per-Class F1 Score

06 — Observations

TRAINING
INSIGHTS

Key findings from the loss curve behaviour and recommendations for future iterations.

⚖️

Inverted Loss Curve

The training loss being higher than validation loss confirms that the balanced class weights are correctly penalizing mistakes on rare classes. The model is being pushed harder on minority categories during training — exactly as intended.

🔬

Oversampling

To improve precision for the weakest categories, future iterations should explore synthetic oversampling — particularly SMOTE — applied to ultra-rare classes like NMAP_FIN_SCAN (28 samples) and Metasploit_Brute_SSH (37 samples).

🧠

Advanced Architectures

Implementing deeper hidden layers or ensemble methods like XGBoost could yield gains on extreme imbalance cases. Tree-based ensembles often handle skewed distributions more effectively than standard neural networks at this scale.

IoT NETWORK
INTRUSION
DETECTION

RT-IoT2022
CLASS DISTRIBUTION

PREPROCESSING
PIPELINE

Load Data

Drop Index

One-Hot Encode

Stratified Split

Drop Constant Cols

Standardise

NaN / Inf Fix

Label Encode

NEURAL NETWORK
DESIGN

MODEL
PERFORMANCE

PUBLIC &
REPRODUCIBLE

Google Colab Notebook

UCI ML Repository Dataset

TRAINING
INSIGHTS

Inverted Loss Curve

Oversampling

Advanced Architectures

IoT NETWORK INTRUSION DETECTION

RT-IoT2022CLASS DISTRIBUTION

PREPROCESSINGPIPELINE

Load Data

Drop Index

One-Hot Encode

Stratified Split

Drop Constant Cols

Standardise

NaN / Inf Fix

Label Encode

NEURAL NETWORKDESIGN

MODELPERFORMANCE

PUBLIC &REPRODUCIBLE

Google Colab Notebook

UCI ML Repository Dataset

TRAININGINSIGHTS

Inverted Loss Curve

Oversampling

Advanced Architectures

IoT NETWORK
INTRUSION
DETECTION

RT-IoT2022
CLASS DISTRIBUTION

PREPROCESSING
PIPELINE

NEURAL NETWORK
DESIGN

MODEL
PERFORMANCE

PUBLIC &
REPRODUCIBLE

TRAINING
INSIGHTS