Blog
Telecom
From classical methods to neural networks: Exploring the potential of Deep Learning in identifying obfuscated traffic

From classical methods to neural networks: Exploring the potential of Deep Learning in identifying obfuscated traffic

October 30, 2024

Telecom

From classical methods to neural networks: Exploring the potential of Deep Learning in identifying obfuscated traffic

Network traffic analysis and classification have become essential for maintaining the resilience and security of contemporary computer networks. With the rapid increase in data volumes and the growing complexity of encryption methods, the need for effective network flow classification continues to rise. By identifying, categorizing, and analyzing network traffic accurately, organizations can detect potential threats, optimize network performance, and ensure compliance with security protocols.

Traditional Methods for Network Traffic Analysis

Classification of network traffic using traditional methods involves various approaches, each with its own strengths and weaknesses. Let’s examine the main methods and their limitations when dealing with obfuscated and encrypted traffic.

1. Server Name Indication (SNI) Method

The SNI method is based on analyzing the domain information that an encrypted connection transmits in plaintext when establishing a TLS session. Since the domain name is specified in the “Server Name” header during the TLS handshake, this method enables the identification of servers and services even if subsequent traffic is encrypted.

Limitations of the SNI Method:

Insufficient accuracy with port obfuscation and address translation: When IP addresses and ports are modified or obfuscated, accuracy decreases because the link between the SNI and a specific application can be disrupted.
Inability to identify when using VPNs: The SNI header becomes unavailable for analysis if traffic passes through a VPN, as it is hidden by tunnel encryption.
Lack of data for all protocols: Not all protocols and applications transmit data over TLS, making SNI-based analysis inapplicable to them.

2. Payload Inspection

Payload inspection involves a detailed analysis of packet contents to identify patterns and characteristics specific to a protocol or application. This method provides high accuracy in determining data types and classifying them based on content.

Limitations of Payload Inspection:

Computational resource costs: Payload inspection requires significant resources due to the need to examine each packet’s content.
Privacy issues: Full access to packet data raises privacy concerns, especially when working with personal or corporate data.
Inability to analyze encrypted traffic: Encryption of traffic (TLS or VPN) makes payload inspection impossible, reducing the effectiveness of this method in modern environments where a significant portion of traffic is encrypted.

3. Statistical Machine Learning Methods

Statistical machine learning methods classify traffic based on various metrics and characteristics (such as packet sizes, frequency, and time intervals). Models can be trained on statistical data, allowing for effective identification of certain types of traffic in some cases.

Limitations of Statistical Machine Learning Methods:

Need for clean and labeled data: For successful operation, statistical learning models require high-quality labeled data, which is challenging to collect, especially for less common protocols.
Resource-intensive: This method requires significant computational resources, slowing down the analysis in cases of large data volumes.
Low effectiveness in the presence of traffic obfuscation: Protocols that mask their metadata or continuously change traffic patterns can complicate analysis, leading to low accuracy from statistical models.

As a result, although traditional methods may exhibit high accuracy in some cases, they face numerous limitations, making it challenging to classify modern traffic types.

Neural Network Approach to Identifying Obfuscated Network Traffic

Our research explores deep learning as a more accurate and flexible alternative to traditional methods. We implemented models based on convolutional neural networks (CNN) and the ResNet architecture, adapting them for high-precision classification of encrypted VPN and proxy traffic.

Data

A Netflow 10 (IPFIX) dataset was used for traffic classification, designed to standardize the transmission of IP information from exporter to collector, supported by manufacturers such as Cisco, Solera, VMware, and Citrix. IPFIX specifications are provided in RFCs 7011–7015 and RFC 5103.

Data Collection

Data was collected using a device with a deep packet inspection (DPI) system connected to other devices generating traffic over various VPNs. This approach captured unique IPs and ports generated by VPNs with dynamic assignments under restrictions, resulting in a rich array of unique IP and port combinations for training the neural network model.

The collected data included the following parameters:

Data Type	Description
octet_delta_count	Incoming counter of length N x 8 bits for the number of bytes associated with the IP flow.
packet_delta_count	Incoming packet counter of length N x 8 bits for the number of packets associated with the IP flow.
protocol_identifier	IP protocol byte.
ip_class_of_service	IP class or service.
source_port	Sender’s port.
source_ipv4	Sender’s IPv4.
destination_port	Recipient’s port.
destination_ipv4	Recipient’s IPv4.
bgp_source_as_number	Source BGP autonomous system number (N can be 2 or 4).
bgp_destination_as_number	Destination BGP autonomous system number (N can be 2 or 4).
input_snmp	Virtual LAN identifier associated with the incoming interface.
output_snmp	Virtual LAN identifier associated with the outgoing interface.
ip_version	IPv4 or IPv6 protocol version.
post_nat_source_ipv4	Source NAT IPv4.
post_nat_source_port	Source NAT port.
frgmt_delta_packs	Delta of fragmented packets.
repeat_delta_pack	Delta of retransmissions.
packet_deliver_time	Delay (RTT/2), ms.
protocol_code	Protocol code using autonomous system class for the neural network.

Data Processing Before Training

The data was split into training (80%) and testing (20%) sets. Class balancing adjustments and IPFIX data labeling were applied to highlight specific classes.

Training

The neural networks were trained using two architectures with hyperparameter tuning. The protocol class ratios in the training sample were:

Protocol	Ratio
DNS	18.67%
HTTP	1.38%
HTTPS	16.27%
DoH	2.66%
ICMP	4.83%
Bittorrent	24.73%
AdGuard VPN	2.34%
VPN Unlimited	12.18%
Psiphon 3	12.41%
Lantern	4.53%

Testing

Models were evaluated on the test set using precision, recall, and F1 score metrics:

$\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}$

$Recall = \frac{TP}{TP + FN}$

$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}$

$Precision = \frac{TP}{TP + FP}$

$\text{F1 Score} = \frac{2 \times \text{Recall} \times \text{Precision}}{\text{Recall} + \text{Precision}}$

$F1 Score = \frac{2 \times Recall \times Precision}{Recall + Precision}$

where TP denotes true positives, FN false negatives, and FP false positives.

The experiment was conducted on VPNs with a wide IP range to enhance result objectivity. The ResNet architecture model demonstrated higher accuracy in classifying VPN protocols.

Results

Classical Convolutional Neural Network

Protocol	TP	FP	FN	F1 Score
AdGuard VPN	28	9	50	0.49
VPN Unlimited	3	3	22	0.21
Psiphon 3	8455	160	399	0.97

ResNet Architecture

Protocol	TP	FP	FN	F1 Score
AdGuard VPN	60	5	18	0.84
VPN Unlimited	5	9	20	0.26
Psiphon 3	8847	1030	7	0.95

The ResNet architecture showed higher efficiency in identifying VPN traffic and can serve as a reliable foundation for encrypted traffic classification tasks.

Conclusion

In this article, we examined obfuscated traffic identification methods, covering both classical and neural network approaches. While traditional methods provide basic capabilities, they have limitations in dynamic traffic and encryption environments. Modern neural networks offer greater accuracy and flexibility, effectively identifying obfuscated traffic even when traditional methods prove ineffective. Thus, the shift to neural network approaches marks a significant step forward in network security.

Vote:

5 out of 5

Average rating : 5

Rated by: 1