IOM and Microsoft release first-ever differentially private synthetic dataset to counter human trafficking

已发布

Migrants rescued last March in the Channel of Sicily by Italian Coast Guard (File photo). © Francesco Malavolta/IOM 2015

Microsoft is home to a diverse team of researchers focused on supporting a healthy global society, including finding ways technology can address human rights problems affecting the most vulnerable populations around the world. With a multi-disciplinary background in human-computer interaction, data science, and the social sciences, the research team partners with community, governmental, and nongovernmental organizations to create open technologies that enable scalable responses to such challenges.  

The United Nations’ International Organization for Migration (opens in new tab) (IOM) provides direct assistance and support to migrants around the world, as well as victims and survivors of human trafficking. IOM (opens in new tab) is dedicated to promoting humane and orderly migration by providing services to governments and migrants in its 175 member countries. It recently reported (opens in new tab) 50 million victims of forced labor globally, including 3.3 million children, 6.3 million in commercial sexual exploitation, and 22 million trapped in forced marriages. Understanding and addressing problems at this scale requires technology to help anti-trafficking actors and domain experts gather and translate real-world data into evidence that can inform policies and build support systems. 

According to IOM, migrants and displaced people represent some of the most vulnerable populations in society. The organization explains that, “while human mobility can be a source of prosperity, innovation, and sustainable development (opens in new tab)every migration journey can include risks to safety, which are exacerbated during times of crisis, or when people face extreme vulnerability (opens in new tab) as they are forced to migrate amid a lack of safe and regular migration pathways (opens in new tab).

Spotlight: Blog post

MedFuzz: Exploring the robustness of LLMs on medical challenge problems

Medfuzz tests LLMs by breaking benchmark assumptions, exposing vulnerabilities to bolster real-world accuracy.

Today, using software developed by Microsoft researchers, IOM released its second synthetic dataset from trafficking victim case records, the first ever public dataset to describe victim-perpetrator relations. The synthetic dataset is also the first of its kind to be generated with differential privacy, providing an additional security guarantee for multiple data releases, which enables the sharing of more data and allows more rigorous research to be conducted while protecting privacy and civil liberties. 

The new data release builds on several years of collaboration between Microsoft and IOM to support safe data sharing of victim case records in ways that can inform collective action across the anti-trafficking community. This collaboration began in July 2019 when IOM joined the accelerator program of the Tech Against Trafficking (opens in new tab) (TAT) coalition, with the goal of advancing the privacy and utility of data made available through the Counter Trafficking Data Collaborative (opens in new tab) (CTDC) data hub – the first global portal on human trafficking case data. Since then, IOM and Microsoft have collaborated to improve the ways data on identified victims and survivors—as well as their accounts of perpetrators—can be used to combat the proliferation of human trafficking.  

“We are grateful to Microsoft Research for our partnership over almost four years to share data while protecting the safety and privacy of victims and survivors of trafficking.”

– Monica Goracci, IOM’s Director of Programme Support and Migration Management

The critical importance of data privacy when working with vulnerable populations 

When publishing data on victims of trafficking, all efforts must be taken to ensure that traffickers are wholly prevented from identifying known victims in published datasets. It is also important to protect individuals’ privacy to avoid stigma or other potential forms of harm or (re)traumatization. Data statistics accuracy is another concern: the statistics must simultaneously enable researchers and analysts to guarantee victims’ privacy and extract useful insights from the dataset containing personal information. This is critically important: if a privacy method were to over- or under-report a given pattern in victim cases, it could mislead decision makers to misdirect scarce resources and therefore fail to tackle the originating problem.

The collaboration between IOM and Microsoft was founded on the idea that rather than redacting sensitive data to create privacy, synthetic datasets can be generated in ways that accurately capture the structure and statistics of underlying sensitive datasets, while remaining private by design. But not all synthetic data comes with formal guarantees of data privacy or accuracy. Therefore, building trust in synthetic data requires communicating how well the synthetic data represents the actual sensitive data, while ensuring that these comparisons do not create privacy risks themselves.

From this founding principle, along with the need to accurately report case counts broken down by different combinations of attributes (e.g., age range, gender, nationality), a solution emerged: to release synthetic data alongside privacy-preserving counts of cases, matching all short combinations of case attributes. The aggregate data thereby supports both evaluation of synthetic data quality and retrieval of accurate counts for official reporting. Through this collaboration and the complementary nature of synthetic data and aggregate data—together with interactive interfaces with which to view and explore both datasets—the open-source Synthetic Data Showcase (opens in new tab) software was developed.

In September 2021 (opens in new tab), IOM used Synthetic Data Showcase to release its first downloadable Global Synthetic Dataset (opens in new tab), representing data from over 156,000 victims and survivors of trafficking across 189 countries and territories (where victims were first identified and supported by CTDC partners). The new Global Victim-Perpetrator Synthetic Dataset (opens in new tab), released today, is CTDC’s second synthetic dataset produced using an updated version of Synthetic Data Showcase with added support for differential privacy. This new dataset includes IOM data from over 17,000 trafficking victim case records and their accounts of over 37,000 perpetrators who facilitated the trafficking process from 2005 to 2022.  Together, these datasets provide vital first-hand information on the socio-demographic profiles of victims, their accounts of perpetrators, types of exploitation, and the overall trafficking process—all of which are critical to better assist survivors and prosecute perpetrators. 

“Data privacy is crucial to the pursuit of efficient, targeted counter-trafficking policies and good migration governance.”

– Irina Todorova, Head of the Assistance to Vulnerable Migrants Unit at IOM’s Protection Division

A differentially private dataset 

In 2006, Microsoft researchers led the initial development of differential privacy, and today it represents the gold standard in privacy protection. It helps ensure that answers to data queries are similar, whether or not any individual data subject is in the dataset, and therefore cannot be used to infer the presence of specific individuals, either directly or indirectly.  

Existing algorithms for differentially private data synthesis typically create privacy by “hiding” actual combinations of attributes in a sea of fabricated or spurious attribute combinations that don’t specifically reflect what was in the original sensitive dataset.

This can be problematic if the presence of these fabricated attribute combinations misrepresents the real-world situation and misleads downstream decision making, policy making, or resource allocation to the detriment of the underlying population (e.g., encouraging policing of trafficking routes that have not actually been observed). 

When the research team encountered these challenges with existing differentially private synthesizers, they engaged fellow researchers at Microsoft to explore possible solutions. They explained the critical importance of reporting accurate counts of actual attribute combinations in support of statistical reporting and evidence-based intervention, and how the “feature” of fabricating unobserved combinations as a way of preserving privacy could be harmful when attempting to understand real-world patterns of exploitation.

Those colleagues had recently solved a similar problem in a different context: how to extract accurate counts of n-gram word combinations from a corpus of private text data. Their solution (opens in new tab), recently published at the 2021 Conference on Neural Information Processing Systems, significantly outperformed the state of the art. In collaboration with the research team working with IOM, they adapted this solution into a new approach to generating differentially private marginals (opens in new tab)—counts of all short combinations of attributes that represented a differentially-private aggregate dataset.

Because differentially private data has the property that subsequent processing cannot increase privacy loss, any datasets generated from such aggregates retain the same level of privacy. This enabled the team to modify (opens in new tab) their existing approach to data synthesis—creating synthetic records by sampling attribute combinations until all attributes are accounted for—to extrapolate these noisily reported attribute combinations into full, differentially-private synthetic records. The result is precisely what IOM and similar organizations need to create a thriving data ecosystem in the fight against human trafficking and other human rights violations: accurate aggregate data for official reporting, synthetic data for interactive exploration and machine learning, and differential privacy guarantees that provide protection even over multiple overlapping data releases. 

This new synthesizer (opens in new tab) is now available to the community via Microsoft’s SmartNoise (opens in new tab) library within the OpenDP (opens in new tab) initiative. Unlike existing synthesizers, it provides strong control over the extent to which fabrication of spurious attribute combinations is allowed and augments synthetic datasets with “actual” aggregate data protected by differential privacy.

Access to private-yet-accurate patterns of attributes characterizing victim-perpetrator relationships allows stakeholders to advance the understanding of risk factors for vulnerability and carry out effective counter-trafficking interventions, all while keeping the victims’ identities private.

“The new dataset represents the first global collection of case data linking the profiles of trafficking victims and perpetrators ever made available to the public, while enabling strong privacy guarantees. It provides critical information to better assist survivors and prosecute offenders.” – Claire Galez-Davis, Data Scientist at IOM’s Protection Division. 

An intuitive new interface and public utility web application 

Solving problems at a global scale requires tools that make safe data sharing accessible wherever there is a need and in a way that is understandable by all stakeholders. The team wanted to construct an intuitive interface to help develop a shared evidence base and motivate collective action by the anti-trafficking community. They also wanted to ensure that the solution was available to anyone with a need to share sensitive data safely and responsibly. The new user interface developed through this work is now available as a public utility web application (opens in new tab) in which private data aggregation and synthesis are performed locally in the web browser, with no data ever leaving the user’s machine.

“I find the locally run web application incredibly interactive and intuitive. It is a lot easier for me to explain the data generation process and teach others to use the new web interface. As the data is processed locally in our computers, I don’t need to worry about data leaks.” – Lorraine Wong, Research Officer at IOM’s Protection Division.  

What’s next for the IOM and Microsoft collaboration 

Microsoft and IOM have made the solution publicly accessible for other organizations, including central government agencies. It can be used by any stakeholder who wants to collect and publish sensitive data while protecting individual privacy.

Through workshops and guidance on how to produce high-quality administrative data, the organizations plan to share evidence on exploitation and abuse to support Member States, other UN agencies, and counter-trafficking organizations around the world. This kind of administrative data is a key source of information providing baseline statistics that can be used to understand patterns, risk factors, trends, and modus operandi that are critical for policy response formulation (opens in new tab).

For example, IOM has been collaborating with the UN Office on Drugs and Crime (UNODC) to establish international standards and guidance to support governments in producing high-quality administrative data. It has also been collaborating with the UN International Labour Organization (ILO) to index policy-oriented research on trafficking in a bibliography (opens in new tab). Finally, IOM is producing an online course, including a module that includes guidance on synthetic data, to encourage safe data sharing from governments and frontline counter-trafficking agencies.

“Being able to publish more data than we have done in the past, and in an even safer way, is a great achievement,” explained Phineas Jasi, Data Management and Research Specialist at IOM’s Protection Division. He added that “The aim is for these data to inform the evidence base on human trafficking, which in turns helps devise efficient and targeted counter-trafficking policies and achieve good migration governance.” 

Translating data into evidence is the goal of the related ShowWhy (opens in new tab) application from the same Microsoft research team, which guides domain experts through the end-to-end process of developing causal evidence from observational data. Just like Synthetic Data Showcase (opens in new tab), it makes advanced data science capabilities accessible to domain experts through a suite of interactive, no-code user interfaces. 

“Driving a coordinated global response against human trafficking requires removing traditional barriers to both data access and data analysis,” said Darren Edge, Director at Microsoft Research. “With our Synthetic Data Showcase and ShowWhy applications, we are aiming to empower domain experts to develop causal evidence for themselves, from sensitive data that couldn’t otherwise be shared, and use this to inform collective action with a precision and scale that couldn’t otherwise be imagined.” 

相关论文与出版物

继续阅读

查看所有博客文章