Training Private and Efficient Language Models with Synthetic Data from LLMs
- Da Yu ,
- Arturs Backurs ,
- Sivakanth Gopi ,
- Huseyin Inan ,
- Janardhan (Jana) Kulkarni ,
- Zinan Lin ,
- Chulin Xie ,
- Huishuai Zhang ,
- Wanrong Zhang
NeurIPS 2023 SoLaR Workshop |
Language models are pivotal in modern text-based applications, offering many productivity features like next-word prediction, smart composition, and summarization. In many applications, these models must be lightweight to meet inference time and computational cost requirements. Furthermore, due to the inherent sensitivity of their training data, it is essential to train those models in a privacy-preserving manner. While it is well established that training large models with differential privacy (DP) leads to favorable utility-vs-privacy trade offs, training lightweight models with DP remains an open challenge.
This paper explores the use of synthetic data generated from a DP fine-tuned large language model (LLM) to train lightweight models. The key insight behind our framework is that LLMs are better suited for private fine-tuning, and hence using the synthetic data is one way to transfer such capability to smaller models. Our framework can also be interpreted as doing {\em sampling based} Knowledge Distillation in DP setting. It’s noteworthy that smaller models can be trained on synthetic data using non-private optimizers, thanks to the post-processing property of DP. We empirically demonstrate that our new approach significantly improves downstream performance compared to directly train lightweight models on real data with DP. For instance, using a model with just 4.4 million parameters, we achieve 97% relative performance compared to the non-private counterparts in both medical and conversational corpus.