Malheur

Automatic Analysis of Malware Behavior

Examples

Distances of program behavior

The first example demonstrates how a distance matrix is computed for the archive "dataset.zip" containing reports of program behavior. The matrix is written to the file "out.txt".

    malheur -o out.txt -v distance dataset.zip

The distance matrix reflects the dissimilarity of behavior for each report in the archive. The entries of the matrix range from 0 to sqrt(2), where small values indicate similar behavior and larger values deviating behavior. The matrix can be used as the basis for several analysis and data mining techniques, such as hierarchical clustering, nearest-neighbor classification or multi-dimensional scaling. It is a generic starting point for research on analysis of malware behavior.

Extraction of prototypes

Manual inspection of several behavior reports is tedious and annoying. The second example illustrates how prototypical reports are extracted from the dataset "dataset.zip". The prototypes are written to the file "out.txt".

    malheur -o out.txt -v prototype dataset.zip

From all the reports of program behavior, a small subset is selected which is representative for the full data set. The elements of this subset are referred to as prototypes. Prior to further analysis of a large data set, a quick inspection of prototypes enables an overview of contained behavior and shows patterns typical for the data set.

Clustering and classification

This example demonstrates how clustering and classification are applied for analysis of two data sets, "dataset1.zip" and "dataset2.zip". The clustering and classification results are written to "out1.txt" and "out2.txt" respectively.

    malheur -o out1.txt -v cluster dataset1.zip 
    malheur -o out2.txt -v classify dataset2.zip

First, reports in the archive "dataset1.zip" are clustered into groups of similar behavior. The groups can be used to discover novel malware classes or identify behavioral patterns shared by several malware instances. Each cluster is represented by a small set of prototypical reports, such that manual inspectation can usually be restricted to prototypes. Second, the reports in "dataset2.zip" are assigned to the discovered groups. This classification can be used to filter out variants of classes contained in "dataset1.zip", such that novel malware in "dataset2.zip" can be identified.

Incremental analysis

In the last example, Malheur is applied for incremental analysis of a larger data set split into three archives, namely "dataset1.zip", "dataset2.zip" and "dataset3.zip". Results of this analysis are written to the files "out1.txt", "out2.txt" and "out3.txt".

    malheur -o out1.txt -v -r increment dataset1.zip
    malheur -o out2.txt -v increment dataset2.zip
    malheur -o out3.txt -v increment dataset2.zip

First, the archive "dataset1.zip" is processed using incremental analysis. The extra option "-r" is used to reset the internal state of Malheur, such that results from previous incremental runs are discarded. Then, the files "dataset2.zip" and "dataset3.zip" are analyzed where for each archive first known behavior is identified using classification and novel groups of malware are discovered using clustering. The intermediate results for each archive are stored in the Malheur home directory, by default "~/.malheur". The incremental analysis allows to process large data sets efficiently, where run-time and memory requirements are significantly reduced in comparison to batch analysis.

Debugging

The reports of malware behavior are embedded in a vector space where each report is represented by a sparse feature vector. To understand this representation and trace down problems, a lookup table can be enabled in the features setting of "malheur.cfg".

    malheur -o /dev/null -vvv prototype dataset.zip

The above command extracts prototypes from the provided data set. However, it also present a lot of verbose information on the reports and extracted prototypes. In particular, for each prototype the corresponding feature vector is displayed. If the lookup table is enabled, the dimensions of this vector are printed with respective instruction n-grams (substrings composed of n).