Salad

A Content Anomaly Detector based on n-Grams

Salad – Home

A Content Anomaly Detector based on n-Grams

Letter Salad or Salad for short, enables detecting anomalies in large string-based datasets. The tool is based on n‑gram models, a generalization of the well-known bag-of-word models where inputs are represented as a set of substrings of length n. Here, the length may be specified as bits, bytes, or words. For training the model, n‑grams are extracted from the input data and stored in probabilistic data structures, such as Bloom filters or Count-Min Sketches. This enables Salad to represent a large corpus of data in little memory. For anomaly detection, the n‑grams of unknown strings are then matched against the learned model. Features (n‑grams) not seen during training are indicators for anomalies.

Salad is based on concepts from Anagram by Wang et al. (RAID 2006), but extends the original work in several ways: First, the tool does not only operate on n‑grams of bytes, but is also capable of comparing n‑grams over words and tokens. Second, we are not limited to Bloom Filter, but use more advanced probabilistic data structures such as Count-Min Sketches to allow for more flexible detection schemes. Third, Salad also implements a 2-class version of the detector that enables discriminating data from two opposing classes, for instance, benign and malicious. Finally, the tool features a build-in inspection and statistics mode that can help to analyze the learned Bloom filter and its predictions.

Most prominently, the underlying concepts have been used in computer security for intrusion and attack detection (Wang et al, 2006; Wressnegger et al, 2013, 2018). However, Salad is not limited to this domain and can be use in a variety of applications. To illustrate the versatility of the tool we provide some concrete examples of its usage. All examples come with data sets and instructions.

Author of Salad

authors Salad is developed by Christian Wressnegger at the TU Braunschweig and has been previously supported by University of Göttingen and idalab GmbH.

You can contact the main author at christian at mlsec.org.
For news and updates follow me on Twitter.