In my previous post we went over the topic of autoencoders. It is certainly knowledge worth implementing in one’s life. Let’s now imagine a system where the communication between servers is done using Kafka. During the lifetime of the system it turned out that some events are quite harmful. We must detect them and transfer them to a separate process where they will be thoroughly examined.
Let’s start with a few assumptions:
Kafka Streams is a library that allows for processing data between topics. The first step is to plug into [all_events] and build a process topology.
Filtering by key, events are directed to dedicated topics where they will be handled appropriately.
The library allows for the use of preset models built in Keras. This is a good solution when the AI team works at TF/Keras and they are responsible for building and customizing models. In this case, we will take a different route. We will create and train an autoencoder in Java.
The events have the same structure and values as in the example previously touched on. They are divided into two CSV files [normal_raw_events.csv] and [anomalous_raw_events.csv]
Incoming data is not standardized. We build a dedicated NormalizerMinMaxScaler which scales values to the range of [0.0-1.0].
The trained normalizer will serve as a pre-processor for a dedicated iterator that navigates the [normal_raw_events.csv] file.
The autoencoder will have the same structure as the previously mentioned example in Keras.
The model and normalizer are saved, ultimately they should end up on a dedicated resource from which the running application would download and build its configuration.
The trained model may show an average reconstruction error for typical events at 0.0188, while for anomalies the number will be 0.0834. By inputting MSE for 100 events from two groups on the chart, we can specify the cut-off threshold at [threshold = 0.045].
To map the model into the process topology I will use the ValueTransformer interface implemented in the AnomalyDetection class. In the constructor, we create a model with a normalizer and classes that will help with calculating the reconstruction error.
The transform method receives the events collected in the time window. They must be mapped to a format understandable to the [INDArray] model. For each of the events a reconstruction error is calculated, those that exceed the threshold receive the ANOMALOUS key.
Mateusz Frączek, R&D Division Leader