Traditional Methods for Network Traffic Analysis
Classification of network traffic using traditional methods involves various approaches, each with its own strengths and weaknesses. Let’s examine the main methods and their limitations when dealing with obfuscated and encrypted traffic.
1. Server Name Indication (SNI) Method
The SNI method is based on analyzing the domain information that an encrypted connection transmits in plaintext when establishing a TLS session. Since the domain name is specified in the “Server Name” header during the TLS handshake, this method enables the identification of servers and services even if subsequent traffic is encrypted.
Limitations of the SNI Method:
- Insufficient accuracy with port obfuscation and address translation: When IP addresses and ports are modified or obfuscated, accuracy decreases because the link between the SNI and a specific application can be disrupted.
- Inability to identify when using VPNs: The SNI header becomes unavailable for analysis if traffic passes through a VPN, as it is hidden by tunnel encryption.
- Lack of data for all protocols: Not all protocols and applications transmit data over TLS, making SNI-based analysis inapplicable to them.
2. Payload Inspection
Payload inspection involves a detailed analysis of packet contents to identify patterns and characteristics specific to a protocol or application. This method provides high accuracy in determining data types and classifying them based on content.
Limitations of Payload Inspection:
- Computational resource costs: Payload inspection requires significant resources due to the need to examine each packet’s content.
- Privacy issues: Full access to packet data raises privacy concerns, especially when working with personal or corporate data.
- Inability to analyze encrypted traffic: Encryption of traffic (TLS or VPN) makes payload inspection impossible, reducing the effectiveness of this method in modern environments where a significant portion of traffic is encrypted.
3. Statistical Machine Learning Methods
Statistical machine learning methods classify traffic based on various metrics and characteristics (such as packet sizes, frequency, and time intervals). Models can be trained on statistical data, allowing for effective identification of certain types of traffic in some cases.
Limitations of Statistical Machine Learning Methods:
- Need for clean and labeled data: For successful operation, statistical learning models require high-quality labeled data, which is challenging to collect, especially for less common protocols.
- Resource-intensive: This method requires significant computational resources, slowing down the analysis in cases of large data volumes.
- Low effectiveness in the presence of traffic obfuscation: Protocols that mask their metadata or continuously change traffic patterns can complicate analysis, leading to low accuracy from statistical models.
As a result, although traditional methods may exhibit high accuracy in some cases, they face numerous limitations, making it challenging to classify modern traffic types.
Neural Network Approach to Identifying Obfuscated Network Traffic
Our research explores deep learning as a more accurate and flexible alternative to traditional methods. We implemented models based on convolutional neural networks (CNN) and the ResNet architecture, adapting them for high-precision classification of encrypted VPN and proxy traffic.
Data
A Netflow 10 (IPFIX) dataset was used for traffic classification, designed to standardize the transmission of IP information from exporter to collector, supported by manufacturers such as Cisco, Solera, VMware, and Citrix. IPFIX specifications are provided in RFCs 7011–7015 and RFC 5103.
Data Collection
Data was collected using a device with a deep packet inspection (DPI) system connected to other devices generating traffic over various VPNs. This approach captured unique IPs and ports generated by VPNs with dynamic assignments under restrictions, resulting in a rich array of unique IP and port combinations for training the neural network model.
The collected data included the following parameters:
Data Type | Description |
---|---|
octet_delta_count | Incoming counter of length N x 8 bits for the number of bytes associated with the IP flow. |
packet_delta_count | Incoming packet counter of length N x 8 bits for the number of packets associated with the IP flow. |
protocol_identifier | IP protocol byte. |
ip_class_of_service | IP class or service. |
source_port | Sender’s port. |
source_ipv4 | Sender’s IPv4. |
destination_port | Recipient’s port. |
destination_ipv4 | Recipient’s IPv4. |
bgp_source_as_number | Source BGP autonomous system number (N can be 2 or 4). |
bgp_destination_as_number | Destination BGP autonomous system number (N can be 2 or 4). |
input_snmp | Virtual LAN identifier associated with the incoming interface. |
output_snmp | Virtual LAN identifier associated with the outgoing interface. |
ip_version | IPv4 or IPv6 protocol version. |
post_nat_source_ipv4 | Source NAT IPv4. |
post_nat_source_port | Source NAT port. |
frgmt_delta_packs | Delta of fragmented packets. |
repeat_delta_pack | Delta of retransmissions. |
packet_deliver_time | Delay (RTT/2), ms. |
protocol_code | Protocol code using autonomous system class for the neural network. |
Data Processing Before Training
The data was split into training (80%) and testing (20%) sets. Class balancing adjustments and IPFIX data labeling were applied to highlight specific classes.
Training
The neural networks were trained using two architectures with hyperparameter tuning. The protocol class ratios in the training sample were:
Protocol | Ratio |
DNS | 18.67% |
HTTP | 1.38% |
HTTPS | 16.27% |
DoH | 2.66% |
ICMP | 4.83% |
Bittorrent | 24.73% |
AdGuard VPN | 2.34% |
VPN Unlimited | 12.18% |
Psiphon 3 | 12.41% |
Lantern | 4.53% |
Testing
Models were evaluated on the test set using precision, recall, and F1 score metrics:
where TP denotes true positives, FN false negatives, and FP false positives.
The experiment was conducted on VPNs with a wide IP range to enhance result objectivity. The ResNet architecture model demonstrated higher accuracy in classifying VPN protocols.
Results
Classical Convolutional Neural Network
Protocol | TP | FP | FN | F1 Score |
AdGuard VPN | 28 | 9 | 50 | 0.49 |
VPN Unlimited | 3 | 3 | 22 | 0.21 |
Psiphon 3 | 8455 | 160 | 399 | 0.97 |
ResNet Architecture
Protocol | TP | FP | FN | F1 Score |
AdGuard VPN | 60 | 5 | 18 | 0.84 |
VPN Unlimited | 5 | 9 | 20 | 0.26 |
Psiphon 3 | 8847 | 1030 | 7 | 0.95 |
The ResNet architecture showed higher efficiency in identifying VPN traffic and can serve as a reliable foundation for encrypted traffic classification tasks.
Conclusion
In this article, we examined obfuscated traffic identification methods, covering both classical and neural network approaches. While traditional methods provide basic capabilities, they have limitations in dynamic traffic and encryption environments. Modern neural networks offer greater accuracy and flexibility, effectively identifying obfuscated traffic even when traditional methods prove ineffective. Thus, the shift to neural network approaches marks a significant step forward in network security.