Building Datasets to Detect Anomalies in 5G Core

As networks expand, traditional monitoring tools cannot keep up with the speed and complexity of today's networks. This can cost providers millions in downtime or disruptions.

When machine learning models are given structured input, they detect disruptions faster than rule-based systems. However, most network operators still rely on methods that miss patterns in large data streams.

The shift to AI-based monitoring requires new approaches to data preparation. Machine learning algorithms in network monitoring face three requirements: accurate labeling, real-world scenario modeling, and ongoing quality checks. Without these elements, systems generate false alarms that undermine trust.

Quick Take

  • 5G networks require datasets that reflect the complexity of real-world traffic.
  • Data preprocessing affects the accuracy of the detection model.
  • Hybrid approaches that combine AI and human verification yield the best results.
  • Continuously updating datasets prevents AI model decay in networks.

Understanding the Complexities of the 5G Core and Telecommunications Networks

The 5G core is the central element of the fifth-generation telecommunications network, which provides communication between users, services, and infrastructure. 5G simultaneously provides data transfer speed, millisecond latency, support for devices within a square kilometer, and scalability. Therefore, the 5G core is built on the SBA architecture, where all functions are divided into modules that interact through standardized interfaces. This provides flexibility, but creates challenges: the complexity of managing signaling, security, and quality of service increases.

The situation is further complicated by the introduction of network slicing, a mechanism for creating virtual subnetworks for specific use cases. This requires algorithms for resource orchestration, load balancing, and protection against cyber threats. The telecommunications infrastructure is becoming more complex, combining physical base stations, cloud data centers, edge computing, and IoT devices with different network requirements. 

Engineers face the challenges of compatibility between old and new technologies, integration of multi-vendor equipment, and the need for constant software updates. All this makes the 5G core not just a technical solution, but a complex ecosystem where each component affects the stability, security, and performance of the network, which explains the high complexity of its development, implementation, and operation.

Importance of quality data in anomaly detection

Data quality can be assessed using the following metrics:

  • Accuracy. Error-free metrics that reflect the actual state of the network
  • Completeness. Full coverage at all service levels.
  • Consistency. Uniform formatting for cross-system analysis.
  • Timeliness. Updates in less than a second during traffic peaks.
  • Relevance. Context-sensitive noise filtering.

Poor-quality input data creates cascading failures. A lack of timestamps causes false alarms in systems, draining resources and undermining trust in automated tools.

Therefore, data origin is tracked at collection points. This helps to:

  • Check suspicious patterns.
  • Track root causes during incidents.
  • Maintain audit logs for compliance.

Data-intensive networks reduce troubleshooting costs and customer churn by implementing a hybrid approach, automated checks at the processing stage, and human oversight for extreme cases.

Data Annotation | Keymakr

Basics of creating datasets

Data collection and labeling are key processes in developing artificial intelligence and machine learning systems; the accuracy and reliability of AI models depend on their quality. 

The first step is to determine the goals, what data is needed, what it will be used for, and what task the model should solve. Next, information is collected from various sources, including open datasets, created datasets, sensors, online resources, call-detail records, and synthetic data. It is vital to ensure that the sample is representative so that it covers different scenarios and avoids bias. 

After collecting the data, cleaning it, and removing duplicates, missing values, and incorrect or noisy records, it is necessary. 

The next stage is labeling, giving meaningful labels to the data. Experts perform labeling manually or through crowdsourcing, using semi-automatic tools with subsequent verification, including packet trace labeling for network traffic analysis. The quality of annotation instructions and consistency control between different annotators play an essential role. Also, it is necessary to ensure class balance in the data to avoid a situation where the AI ​​model learns well only on dominant examples but ignores rare ones. 

Important concepts are ethics and confidentiality; data is collected in compliance with the legislation protecting personal information. Collection is a complex process that combines strategic planning, organization, quality control, and compliance with standards, which is the basis for the operation of any AI.

Leveraging Machine Learning Techniques for Anomaly Detection

Supervised learning is used when historical data with clear labels is available. The model receives examples with correct answers and learns to predict the system's state or detect failures. This allows you to build fault diagnosis systems, predict congestion, classify traffic types, and perform fraud analytics in telecom environments. The disadvantage of this approach is the need for a large amount of labeled data, and labeling often requires expert knowledge. 

Unsupervised learning does not require labels and is used to detect hidden patterns in the data. In communication networks, it is used to cluster users by behavior, detect anomalies in traffic, analyze load patterns, or segment base stations by performance characteristics. This approach is beneficial for cybersecurity, where unknown threats can be recognized as anomalous deviations from typical data. In addition, unsupervised methods help reduce labeling requirements and prepare the basis for further application of supervised learning. 

Both approaches are often combined in hybrid solutions, such as clustering to group data and then using the resulting clusters as labels to train a classifier. Thus, supervised learning in communication networks focuses on accuracy and predicting known scenarios. In contrast, unsupervised learning helps to deal with uncertainty, discover new patterns, and increase the adaptability of network systems.

Using neural networks and ensemble methods

Deep learning architectures handle complex network behavior that other models miss. Recurrent neural networks track sequential patterns in timestamped data to detect multi-step attacks and identify latency spikes in real time. An autoencoder compresses everyday operations into patterns and label deviations.

These methods are combined using ensemble approaches. Random forest models cross-validate predictions from multiple neural networks and reduce the number of false positives.

Telecom Anomaly Datasets: Practices for Anomaly Detection at the Core of 5G

Structured frameworks separate detection systems from reactive tools. Four key areas for curating training data for next-generation networks: adaptive design, collaborative validation, dynamic maintenance, and measurable outcomes.

Three standardization practices ensure cross-platform interoperability:

  • Unified timestamp formats on distributed nodes.
  • Standard metadata tagging rules.
  • Consistent severity matrices.

Version control systems automatically track changes across data iterations. Organizations that leverage detailed edit histories reduce the cost of retraining AI models.

With shared anomaly libraries and data anonymization protocols, networks achieve faster threat identification. Performance metrics focus on actionable outcomes: mean time to detection, containment rates, and operational impact metrics.

FAQ

Why is quality data necessary for anomaly detection in 5G networks?

It is important for anomaly detection in 5G networks because it allows machine learning algorithms to distinguish between normal behavior and potential threats.

How do machine learning techniques reduce false positives when detecting anomalies?

Machine learning techniques reduce false positives by analyzing large amounts of data and learning to distinguish between actual deviations from normal behavior.

What are the challenges of integrating telecom data from multiple sources?

Different formats, protocols, and standards complicate the integration of telecom data from multiple sources and require additional normalization and unification.

What metrics are most important when evaluating anomaly detection models?

Precision, Recall, and F1-score are the most important metrics, as they reflect the balance between correctly detected anomalies and the number of false positives. Also considered are ROC-AUC and PR-AUC, which assess the ability of an AI model to distinguish normal from abnormal data at different thresholds.