Using File Relationships in Malware Classification
- Nikos Karampatziakis ,
- Jack W. Stokes ,
- Anil Thomas ,
- Mady Marinescu
Proceedings of Conference on Detection of Intrusions and Malware & Vulnerability Assessment |
Published by Springer
Typical malware classification methods analyze unknown files in isolation. However, malware does not exist in a vacuum. It is often packaged in container files such as zip archives. Other malware drops, or even downloads new files to the victim’s hard drive. We present a new malware classification system based on a graph induced by file relationships. In this paper we analyze archives exhibiting containment, a relationship for which we have much available data. However our method is quite general and can be applied to other types of file relationships. We face two key challenges in this work: we seek a method which can evaluate unknown files quickly and train the system with a very large, bipartite graph including over 719 thousand containers and 3.4 million files. To improve file classification, we propagate information along the edges of the graph and assign malware probabilities to all files, regardless of whether they are infectious or not. As a first step, we propose several methods to classify the containers; these algorithms are based on our initial, baseline probability estimates for each individual file in the archive being malicious. Next, we present a new, highly scalable, file relationship malware classifier which improves the baseline probability estimate for each it individual file based on the malware probabilities of each archive containing the file and the initial baseline estimate. We show that since malicious files are often included in multiple malware containers, the system’s detection accuracy can be significantly improved, particularly at low false positive rates which are the main operating points for automated malware classifiers. For example at a false positive rate of 0.2%, the false negative rate decreases from 42.1% to 15.2%. Finally, the new system is highly scalable; our basic implementation can train both our classifiers in a total of 16 minutes.