A Tool for Embedding Strings in Vector Spaces

Sally – Home

A tool for embedding strings

strings Sally is a small tool for mapping a set of strings to a set of vectors. This mapping is referred to as embedding and allows for applying techniques of machine learning and data mining for analysis of string data. Sally can be applied to several types of string data, such as text documents, DNA sequences or log files, where it can handle string data in directories, archives and text files.

features Sally implements a standard technique for mapping strings to a vector space that can be referred to as generalized bag-of-words model. The strings are characterized by a set of features, where each feature is associated with one dimension of the vector space. The following types of features are supported by Sally: bytes, tokens, n-grams of bytes and n-grams of tokens.

embedding Sally proceeds by counting the occurrences of the specified features in each string and generating a sparse vector of count values. Alternatively, binary or TF-IDF values can be computed and stored in the vectors. Sally then normalizes the vector, for example using the L1 or L2 norm, and outputs it in a specified format, such as plain text or in LibSVM or Matlab format.

There are many applications for Sally, for example, in the areas of natural language processing, bioinformatics, information retrieval and computer security. To illustrate the merit of Sally, we provide some examples including text categorization, finding genes in DNA and analysing similarities of languages. All examples come with data sets and instructions.

Authors of Sally

authors Sally is currently developed by Konrad Rieck and Christian Wressnegger at the University of Göttingen. Previous versions of the tool have been also developed at TU Berlin and Idalab GmbH.

You can contact the main author at konrad at
For news and updates follow us on Twitter.