Tuesday, December 31, 2013

Malware Comparison with N-Grams

Jason has recently presented his paper on the use of n-gram analysis on Malicious Executables for identifying code re-use within the malware creation industry. The identification of code re-use has a variety of applications that could be quite useful to lots of folks.
N-Gram Clustering of Malware Samples,
Upchurch & Zhou [2013]

Reduction of code requiring analysis

The n-gram analysis performed shows strong reuse of code - a common repository of analysis for common instruction groups can reduce the amount of analysis required. Similar to FLIRT in IDA Pro - a common repository of these high re-use portions of would have a profound affect on the cost of malware analysis and reverse engineering.

Design and Implementation Evolution

Like other software, malware, when designed and implemented in a commercial process, is developed in a formal methodology adhering to other normal laws of projects and software development (think constraints in time, resources, etc.)  Identifying similar strains of malware based on compiled similarity analysis could be used to identify changes in the offensive objective of attackers.  This can also be used to more quickly analyze for family resemblance by Antivirus companies.

Identification of Migration

Extending the idea of family resemblance, n-gram based analysis can show real world relationships between malware authors as these individuals work collaboratively, share code, or even gift code that was previously "cutting edge" to other groups.

Malware Backtracking

As malware authors gain experience and resources their tactics for ensuring anonymity and protecting themselves from public disclosure is likely to improve. In the early years these authors are likely to be more risk oriented and may have left individually identifiable information in a variety of locations - including in the compiled code itself.  It may be possible to link newer malware to older malware, and from there the real world author.

You can find the full paper "First Byte: Force-Based Clustering of Filtered Block N-Grams to Detect Code Reuse in Malicious Software" at http://www.cs.uccs.edu/~xzhou/publications/Malware2013.pdf.