Large scale machine learning for the detection and classification of malware

Date
2018
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
Bad actors have embraced automation and current malware analysis systems cannot keep up with the ever-increasing load of malware being created daily. As a result, traditional malware detection and classification techniques using expert systems and brittle heuristics are outdated and ineffective. We introduce deep learning models based on inexpensive static features gathered from large scale malware datasets to generate robust and efficient malware detection and malware family classification predictions. ☐ Static analysis is performed by dissecting or disassembling the malware's binary file and studying the components without executing it. Furthermore, static analysis is generally much faster than most malware analysis techniques. However, some static analysis of malware can be computationally expensive and not all static analysis should be considered for every sample in a large malware dataset. We introduce a meta-model trained using deep learning that finds the simplest classifiers to characterize and assign malware into their corresponding families. Using static analysis of malware, we generate descriptive features to be used in conjunction with deep learning, in order to predict malware families. Our meta-model can determine when simple and less expensive malware characterization will suffice to accurately classify malicious executables, or when more computationally expensive descriptions are required. ☐ One of the most important components of training deep learning models, particularly deep neural networks, is finding the optimal model configuration and feature set combinations. Most applications of deep learning, specifically neural networks, use heuristics or trial-and-error to find the optimal model configurations. We implemented a large scale model configuration search using supercomputing resources to produce the most accurate deep learning model given a feature set. In addition, we construct a genetic algorithm used to find the optimal subset of static analysis features. This result provides us with the ability to construct extremely accurate deep learning models for malware detection and malware family classification.
Description
Keywords
Applied sciences, Data science, Deep learning, Machine learning, Malware detection, Neural networks, Static analysis
Citation