Machine Learning for
Computer Security
Open-source software and datasets
developed by the research group of Konrad Rieck

About

This webpage revolves around machine learning and computer security. It provides a collection of open-source software and datasets that have been developed by the research group of Konrad Rieck. The group is currently working at TU Braunschweig, where it forms the Institute of System Security.

Software

Vulnerability Discovery

Joern — A Robust Tool for Static Code Analysis

Joern is a platform for robust analysis of C/C++ code. It generates code property graphs, a novel graph representation of code that exposes the code’s syntax, control-flow, data-flow and type information. Code property graphs are stored in a graph database. This allows code to be mined using search queries formulated in the graph traversal language Gremlin. Joern forms the basis for assisted vulnerability discovery using machine learning techniques.  Website   Yamaguchi et al., S&P 2014

Pulsar — Protocol Learning, Simulation and Stateful Fuzzing

Pulsar is a network fuzzer with automatic protocol learning and simulation capabilites. The tool allows to model a protocol through machine learning techniques, such as clustering and hidden Markov models. These models can be used to simulate communication between Pulsar and a real client or server thanks to semantically correct messages which, in combination with a series of fuzzing primitives, allow to test the implementation of an unknown protocol for errors in deeper states of its protocol state machine.  Github   Gascon et al., SC 2015

Malware Analysis and Detection

Adagio — Structural Analysis and Detection of Android Malware

Adagio is a collection of Python modules for analyzing and detecting Android malware. These modules allow to extract labeled call graphs from Android APKs or DEX files and apply an explicit feature map that captures their structural relationships. Additional modules provide classes for designing binary or multiclass classification experiments and applying machine learning for detection of malicious structure.  Github   Gascon et al., AISEC 2013

Salad — A Content Anomaly Detector based on n-Grams

Letter Salad, or Salad for short, is an efficient and flexible implementation of the anomaly detection method Anagram. The method uses n-grams (substrings of length n) maintained in a Bloom filter for efficiently detecting anomalies in large sets of string data. Salad extends the original method by supporting n-grams of bytes as well n-grams of words and tokens.  Website   Wressnegger et al., AISEC 2013

Malheur — Automatic Analysis of Malware Behavior

Malheur is a tool for the automatic analysis of program behavior recorded from malware. It has been designed to support the regular analysis of malware and the development of detection and defense measures. Malheur allows for identifying novel classes of malware with similar behavior and assigning unknown malware to discovered classes using machine learning.  Website   Rieck et al., JCS 2011

Generic Data Analysis

Harry — A Tool for Measuring String Similarity

Harry is a tool for comparing strings and measuring their similarity. The tool supports several common distance and kernel functions for strings as well as some excotic similarity measures. The focus lies on implicit similarity measures, that is, comparison functions that do not give rise to an explicit vector space. Examples of such similarity measures are the Levenshtein and Jaro-Winkler distance.  Website    Rieck & Wressnegger, JMLR 2016

Sally — A Tool for Embedding Strings in Vector Spaces

Sally is a small tool for mapping a set of strings to a set of vectors. This mapping is referred to as embedding and allows for applying techniques of machine learning and data mining for analysis of string data. Sally can applied to several types of string data, such as text documents, DNA sequences or log files, where it can handle common formats such as directories, archives and text files.  Website    Rieck et al., JMLR 2012

Prisma — Protocol Inspection and State Machine Analysis

Prisma is an R package for processing and analyzing huge text corpora. In combination with the tool Sally the package provides testing-based token selection and replicate-aware, highly tuned non-negative matrix factorization and principal component analysis. Prisma allows for analyzing very big data sets even on desktop machines.  CRAN   Krueger et al., PSDML 2010

Datasets

Drebin Dataset

The Drebin dataset consists of roughly 5,000 malicious Android applications that have been collected as part of the Mobile Sandbox project between 2010 and 2012. The dataset has been downloaded by over 150 research institutes and universities.  Website    Arp et al., NDSS 2014

Malheur Dataset

The Malheur dataset contains the recorded behavior of roughly 30,000 malicious programs (malware). It has been created in 2009 for developing clustering and classification methods for malware behavior. Due to the rapid evolution of malware, the dataset can be considered obsolote nowadays.  Website    Rieck et al., JCS 2011

Contact

Institute of System Security
TU Braunschweig
Rebenring 56
38106 Braunschweig

Responsibility under the German Press Law §55 Sect. 2 RStV:
Prof. Dr. Konrad Rieck
Email: rieck@mlsec.org
Phone: +49 531 391-55120