2021 Microsoft Security Research AI RFP Winners
Amin Kharraz
Florida International University
Microsoft lead collaborator: M365 Security + Compliance Research
WEBHASH: A Spatio-Temporal Deep Learning Approach for Detecting Social Engineering Attacks
Social engineering attacks continue to remain a top security threat. The impact of these attacks is often deep and consequential. Modern social engineering attacks have evolved to deliver different classes of malicious code while collecting extensive financial and personal information. Unfortunately, current mechanisms are woefully inadequate to identify and reason about such adversarial operations, leaving organizations and end-users open to a variety of consequential attacks. The goal of this project is to design principles that will guide the development of an unsupervised approach to automatically identify temporal drifts and detect emerging trends in the social engineering attack landscape. The core insight of our research is that most of social engineering campaigns rarely change the underlying software development techniques to build their attack pages and tend to reuse specific web development patterns to generate a diverse set of attack pages. In this proposal, we develop a novel similarity hashing mechanism, called WEBHASH, which takes into account the spatio-temporal characteristics of a target website and convert them into a vector that facilitates a low-overhead attribution and similarity testing at scale. We will take advantage of advances in machine learning and incorporate Siamese Neural Networks (SNNs) to conduct unsupervised similarity testing across the vectorized data. We posit that a number of useful activities can be performed with WEBHASH. By developing low latency detection and mitigation platforms for social engineering attacks, we can better protect organizations and institutions from data breaches and reduce users’ exposure to modern social engineering attacks. WEBHASH also allows approximating the prevalence of an emerging social engineering threat or the adoption of new attack techniques across different campaigns with minimal human intervention.
Zhou Li and Yanning Shen
University of California Irvine
Microsoft lead collaborator: M365 Security + Compliance Research
Scalable Graph Learning for Automatic Detection of Spearphishing
In this project, we will tackle the problem of automated spearphishing detection. Spearphishing has become a primary attack vector to perpetuate entities in public and private sectors, causing billions of dollars loss annually. Due to the advanced social-engineering tricks performed by the attackers, spearphishing emails are often evasive, difficult to capture by the existing approaches based on malware detection, sender/domain blacklisting, etc. To address this urgent threat, we will explore how to adapt state-of-the-art graph learning algorithms. In particular, we will first investigate how to model the email data as a graph, such that the spearphishing impersonators can be distinguished. Then, we will build a detection system with multi-kernel learning to capture the complex relationship between email users and their sending behaviors. For timely detection, we will examine how the trained classifier can be updated online with Random Feature based function estimation. Finally, we will derive the relation between different function estimators and the privacy levels. We expect this project to have profound impact on email security and research in graph learning.
2020 Microsoft Security Research AI RFP Winners
Dawn Song and Peng Gao
University of California, Berkeley
Microsoft lead collaborator: M365 Security + Compliance Research
A Security Knowledge Graph for Automated Threat Intelligence Gathering and Management
Sophisticated cyber-attacks have plagued many high-profile businesses. To gain visibility into the fast-evolving threat landscape, open-source Cyber Threat Intelligence (OSCTI) has received growing attention from the community. Commonly, knowledge about a threat is presented in a vast number of OSCTI reports, detailing how the threat unfolds into multiple steps. Despite the pressing need for high-quality OSCTI, existing approaches, however, have primarily operated on fragmented threat indicators (e.g., Indicators of Compromise). On the other hand, descriptive relationships between threat indicators have been overlooked, which contain essential information on the threat behaviors that is critical to uncovering the complete threat scenario. Recognizing the limitation, this proposal seeks to design and develop an intelligent and scalable system for automated threat intelligence gathering and management. The proposed system will use a combination of AI-based methods to collect heterogeneous OSCTI data from various sources, extract comprehensive knowledge about threat behaviors in the form of security-related entities and their relations, construct a security knowledge graph from the extracted information, and update the knowledge graph by continuously learning from its deployment. We will also pursue possible security defensive applications that can be further empowered by OSCTI. The proposed work has a broad impact for advancing the state-of-the-art in threat intelligence gathering, management, and applications.
Nick Heard
Department of Mathematics, Imperial College London
Microsoft lead collaborator: M365 Security + Compliance Research
Understanding the enterprise: Host-based event prediction for automatic defence in cyber-security
The next generation of cyber-security challenges will demonstrate an increase in complexity and sophistication, aided by artificial intelligence. To counter this AI-driven threat, we propose to develop Bayesian statistical methodologies for adaptively designing robust, interpretable mathematical models of normal behaviour in new environments. These methodologies will provide new insights into enterprise systems, providing detailed under-standing of network assets and their relationships. These insights will inform enterprise risk-based assessments and enhance the detection and response to cyber threats. Challenges will include the fusion of diverse data sources, collected both within the network environment and externally, and securely sharing intelligence obtained from other platforms. To address these challenges, the proposed workflows will construct modelling frameworks for adaptively building probability distributions for predicting the future activity of a network host. Perspectives in both discrete time and continuous time, along with hybrids of the two, will be considered. Central to the model-building challenge will be developing principled methods for automatically identifying the quantity (either in terms of counts, or in time horizons) of historical data which should be conditioned upon in forming short-term and longer-term predictions. The principal modelling paradigm will be centered on a host-based approach, which has both the capacity to scale and be most sensitive to the protection of sensitive data. Additionally, there will be important scope for making inferences about large-scale network structure, to inform these host-based AI technologies about the position, importance and likely connectivity of the node within the network.
Nicolas Papernot
University of Toronto, Department of Electrical and Computer Engineering
Microsoft lead collaborator: Azure Trustworthy Machine Learning + Microsoft Security Response Center (MSRC)
Towards Machine Learning Governance
The predictions of machine learning (ML) systems often appear fragile, with no hint as to the reasoning behind them—and may be dangerously wrong. This is unacceptable: society must be able to trust and hold to account ML. This proposal seeks to empower ML developers and engineers to develop and design ML systems that are secure and provide the tools that enable its users to manage security, legal, and regulatory standards. Our efforts achieve this through the development of machine learning governance. We focus our efforts around two attack vectors: (1) input manipulations at training and test time that target the ML system’s integrity and (2) model inversion and extraction that target the privacy of training data and the confidentiality of model architectural details. We propose to tackle the first attack vector through the development of robust model uncertainty estimates, the identification of coresets in ML, and the creation of computationally efficient influence metrics. We approach the second attack vector by focusing on the life of ML systems after they have been trained: we will pursue model watermarking, machine unlearning, and the identifiability of ML outputs.