Machine Learning for
Computer Security
Open-source software and datasets
developed by the research group of Konrad Rieck

This webpage revolves around machine learning and computer security. It provides a collection of open-source software and datasets that have been developed by the research group of Konrad Rieck. The group is currently working at TU Braunschweig, where it forms the Institute of System Security.

Machine Learning

Dodo — Dos and Don'ts of Machine Learning in Computer Security

In the project, we identify common pitfalls in the design, implementation, and evaluation of learning-based security systems. We demonstrate how individual pitfalls can lead to unrealistic performance and interpretations, obstructing the understanding of the security problem at hand. As a remedy, we propose actionable recommendations to support researchers in avoiding or mitigating the pitfalls where possible.  Website    Data   Paper

Elsa — Evaluating Explanation Methods for Deep Learning in Computer Security

In the project, we develop criteria for comparing and evaluating explanation methods in the context of computer security. These cover general properties, such as the accuracy of explanations, as well as security- focused aspects, such as the completeness, efficiency, and robustness. We observe significant differences between explanation methods and build on these to derive general recommendations for selecting and applying them in computer security.  Website    Code   Paper

Adversarial Learning

Scaler — Image-Scaling Attacks in Machine Learning

This project studies image-scaling attacks, a new form of attacks that allow an adversary to manipulate images, such that they change their content during downscaling. Image-scaling attacks are a considerable threat, as scaling is omnipresent in computer vision. Moreover, these attacks are agnostic to the learning model and training data, affecting any learning-based system operating on images.  Code    Data   Paper

Twins — Machine Learning meets Digital Watermarking

In this research project we explore similarities between machine learning and digital watermarking under attack. As part of the project, we have developed a unified view on attacks in both domains and created a framework for modeling evasion and poisoning attacks. The code and datasets of our case studies are publicly available.  Code    Data   Paper

Imitator — Adversarial Examples of Source Code

In this project, we attack methods for authorship attribution of source code using adversarial learning. We exploit that these methods rest on machine learning and thus can be deceived by adversarial examples of source code. Our attack performs a series of semantics-preserving code transformations that mislead the attribution but appear plausible to a developer. Our attack and the datasets are publicly available.  Code    Data   Paper

Vulnerability Discovery

Joern — A Robust Tool for Static Code Analysis

Joern is a platform for robust analysis of C/C++ code. It generates code property graphs, a novel graph representation of code that exposes the code’s syntax, control-flow, data-flow and type information. Code property graphs are stored in a graph database. This allows code to be mined using search queries formulated in the graph traversal language Gremlin. Joern forms the basis for assisted vulnerability discovery using machine learning techniques.  Code   Paper

Pulsar — Protocol Learning, Simulation and Stateful Fuzzing

Pulsar is a network fuzzer with automatic protocol learning and simulation capabilites. The tool allows to model a protocol through machine learning techniques, such as clustering and hidden Markov models. These models can be used to simulate communication between Pulsar and a real client or server thanks to semantically correct messages which, in combination with a series of fuzzing primitives, allow to test the implementation of an unknown protocol for errors in deeper states of its protocol state machine.  Code   Paper

Malware Analysis

Drebin — Dataset of Malicious Android Applications

The Drebin dataset consists of roughly 5,000 malicious Android applications that have been collected as part of the Mobile Sandbox project between 2010 and 2012. The dataset can be used to experiment with Android malware and compare different detection approaches.  Data    Paper

Adagio — Structural Analysis and Detection of Android Malware

Adagio is a collection of Python modules for analyzing and detecting Android malware. These modules allow to extract labeled call graphs from Android APKs or DEX files and apply an explicit feature map that captures their structural relationships. Additional modules provide classes for designing binary or multiclass classification experiments and applying machine learning for detection of malicious structure.  Code   Paper

Malheur — Automatic Analysis of Malware Behavior

Malheur is a tool for the automatic analysis of program behavior recorded from malware. It has been designed to support the regular analysis of malware and the development of detection and defense measures. Malheur allows for identifying novel classes of malware with similar behavior and assigning unknown malware to discovered classes using machine learning.  Code    Data   Paper

Data Analysis

Harry — A Tool for Measuring String Similarity

Harry is a tool for comparing strings and measuring their similarity. The tool supports several common distance and kernel functions for strings as well as some excotic similarity measures. The focus lies on implicit similarity measures, that is, comparison functions that do not give rise to an explicit vector space. Examples of such similarity measures are the Levenshtein and Jaro-Winkler distance.  Code    Paper

Sally — A Tool for Embedding Strings in Vector Spaces

Sally is a small tool for mapping a set of strings to a set of vectors. This mapping is referred to as embedding and allows for applying techniques of machine learning and data mining for analysis of string data. Sally can applied to several types of string data, such as text documents, DNA sequences or log files, where it can handle common formats such as directories, archives and text files.  Code    Paper

Salad — A Content Anomaly Detector based on n-Grams

Letter Salad, or Salad for short, is an efficient and flexible implementation of the anomaly detection method Anagram. The method uses n-grams (substrings of length n) maintained in a Bloom filter for efficiently detecting anomalies in large sets of string data. Salad extends the original method by supporting n-grams of bytes as well n-grams of words and tokens.  Code   Paper


Institute of System Security
TU Braunschweig
Rebenring 56
38106 Braunschweig

Responsibility under the German Press Law §55 Sect. 2 RStV:
Prof. Dr. Konrad Rieck
Phone: +49 531 391-55120