Seeding Public AI with Private Data: A New Approach to Synthetic Generation

Author: Denis Avetisyan


Researchers have developed a novel method for creating realistic datasets that protect sensitive information while still enabling powerful AI models.

The research demonstrates that the RPSG method surpasses existing differential privacy techniques-including DP-SGD, AUG-PE, and RUPTA-across various large language models, achieving demonstrably improved accuracy, diversity, and lexical quality in generated text, and thus establishing a new benchmark for privacy-preserving language generation.
The research demonstrates that the RPSG method surpasses existing differential privacy techniques-including DP-SGD, AUG-PE, and RUPTA-across various large language models, achieving demonstrably improved accuracy, diversity, and lexical quality in generated text, and thus establishing a new benchmark for privacy-preserving language generation.

This paper introduces RPSG, a multi-stage framework for privacy-preserving synthetic data generation leveraging private seeds and demonstrating improved utility and stronger privacy guarantees against membership inference attacks compared to existing techniques.

Balancing data utility with stringent privacy demands remains a core challenge in modern machine learning. This is addressed in ‘Private Seeds, Public LLMs: Realistic and Privacy-Preserving Synthetic Data Generation’, which introduces a novel approach to generating synthetic data that safeguards sensitive information. The proposed Realistic and Privacy-Preserving Synthetic Data Generation (RPSG) method leverages private seeds and formal differential privacy to achieve high fidelity while offering robust protection against membership inference attacks. Can this framework unlock broader data sharing and collaboration without compromising individual privacy?


The Inherent Paradox of Data-Driven Discovery

The relentless advance of machine learning is fundamentally reshaping data science, yet its power is intrinsically linked to access – often requiring the pooling of datasets that inherently contain personally identifiable information. This creates a paradox: the very algorithms designed to improve lives and drive innovation rely on data that, if exposed, can lead to significant privacy breaches. Sensitive details – from medical records and financial histories to location data and personal preferences – become vulnerable when raw data is shared, even with the best intentions. The increasing demand for data to train robust machine learning models therefore necessitates a careful balancing act, acknowledging that unfettered access to information presents substantial risks to individual privacy and demands innovative solutions to mitigate those risks.

Despite longstanding practices of data anonymization – such as removing direct identifiers like names and addresses – increasingly sophisticated attacks demonstrate the vulnerability of these methods. Specifically, Membership Inference Attacks (MIAs) pose a significant threat by attempting to determine whether a specific individual’s data was included in a training dataset. These attacks don’t seek to uncover what data was contributed, but rather if an individual participated at all. By analyzing the model’s outputs – the probabilities it assigns to different predictions – an attacker can infer membership with accuracy exceeding chance, even when personally identifying information has been removed. This is because models often “memorize” characteristics of individual data points, especially those that are rare or influential, creating subtle patterns that MIAs exploit. Consequently, relying solely on traditional anonymization is often insufficient to guarantee privacy in the age of powerful machine learning models and adversarial attacks.

The demand for methods capable of generating realistic synthetic data stems directly from the limitations of traditional privacy-preserving techniques. As machine learning models require extensive datasets to function effectively, sharing raw data poses significant risks, even after standard anonymization processes. Sophisticated attacks, such as membership inference, can often re-identify individuals within supposedly de-identified datasets. Consequently, research is increasingly focused on creating artificial datasets that statistically mirror real data without containing any actual individual records. This synthetic data allows for model training and analysis while effectively eliminating the risk of privacy breaches, offering a crucial balance between data utility and individual protection. The success of these methods hinges on their ability to capture complex correlations and distributions present in the original data, ensuring the resulting models maintain accuracy and generalizability.

Existing Approaches: A Landscape of Compromises

Differential Privacy (DP) is a mathematical framework for quantifying privacy loss when analyzing datasets, and it operates by adding carefully calibrated noise to query results. This noise ensures that the output of any analysis is insensitive to changes in a single individual’s data, thereby protecting privacy. A common application of DP is in gradient-based machine learning methods, notably Differentially Private Stochastic Gradient Descent (DP-SGD). DP-SGD modifies the standard stochastic gradient descent algorithm by clipping individual gradients and adding noise to the aggregated gradient before each parameter update. The magnitude of the added noise is controlled by the privacy parameter ε and δ, which define the privacy loss bounds. Lower values of ε and δ provide stronger privacy guarantees but typically result in decreased model utility due to the increased noise.

Prompt-based synthetic data generation utilizes Large Language Models (LLMs) by formulating data creation as a text generation task. Methods like AUG-PE (Augmented Privacy Enhancement) employ carefully crafted prompts to guide the LLM in producing data samples that preserve statistical characteristics of the original dataset. One-to-one rewriting approaches, exemplified by RUPTA, involve replacing sensitive data points with synthetic alternatives generated by LLMs based on individual data records and associated contextual information. These techniques aim to create data that is statistically similar to the original while minimizing the risk of re-identification by conditioning the LLM on specific attributes or relationships within the data.

Current privacy-preserving data generation techniques, including those utilizing Differential Privacy and Large Language Models, frequently encounter a trade-off between data fidelity and the strength of privacy guarantees. Methods aiming for robust privacy, such as those employing strict Differential Privacy parameters, can significantly reduce the utility and realism of the generated synthetic data. Conversely, techniques prioritizing high data fidelity often rely on less rigorous privacy mechanisms, potentially increasing the risk of re-identification or information leakage. This challenge stems from the inherent difficulty in perturbing data sufficiently to ensure privacy while preserving the complex relationships and statistical properties necessary for downstream tasks; balancing these competing objectives remains a primary research focus.

RPSG: A Framework for Principled Synthesis

RPSG generates synthetic data through a two-stage process initiated by utilizing private data as seed values. These seeds are then processed by an Abstraction Model, a component designed to create a diverse set of candidate synthetic samples. This model operates by identifying and representing the underlying patterns within the seed data, allowing it to generate new data points that statistically resemble the original private data while maintaining variability. The abstraction process prioritizes the creation of candidates that are both realistic – meaning they adhere to the observed distributions of the private data – and diverse, preventing the model from simply replicating existing data points and ensuring broader utility of the synthetic dataset.

Negative Log-Likelihood (NLL)-based filtering is a core component of RPSG’s privacy mechanism, operating by identifying and removing synthetic samples that exhibit an overly high likelihood given the private training data; this mitigates the risk of membership inference attacks. Specifically, samples exceeding a predetermined NLL threshold are considered overfit and discarded. Following NLL filtering, Cosine Similarity is employed to refine the generated dataset by assessing the semantic similarity between synthetic and original data points; this ensures that the generated data remains representative of the original distribution and maintains data utility after the privacy-enhancing filtering stage. The Cosine Similarity metric calculates the cosine of the angle between vector representations of the data, with higher values indicating greater similarity.

RPSG utilizes Large Language Models (LLMs) as the core engine for synthetic data generation, enabling the creation of diverse and contextually relevant samples. A key component is Sentiment Alignment, a process ensuring the generated data accurately mirrors the sentiment distribution present in the original, private dataset. This is achieved through techniques that condition the LLM to produce outputs with comparable sentiment scores. Benchmarking demonstrates that RPSG achieves a processing speedup of 1.22x to 1.38x when compared to the AUG-PE method, indicating improved efficiency in synthetic data production while maintaining fidelity to the original data’s emotional tone.

Demonstrating Robustness and Utility: Empirical Validation

Rigorous evaluation confirms that the proposed Randomized Perturbation and Smoothing Generator (RPSG) significantly diminishes the risk of sensitive information disclosure. Specifically, the system demonstrates robust resistance against Membership Inference Attacks, a key indicator of privacy leakage. Performance in these attacks is quantified using the Area Under the Receiver Operating Characteristic Curve (AUC), with RPSG consistently achieving scores of approximately 50%. This result is crucial, as an AUC of 0.5 suggests the system effectively obscures individual data points, performing at chance level and indicating a substantial barrier against identifying whether a specific record was used in the training process. Consequently, RPSG offers a strong defense against unauthorized data reconstruction and protects the privacy of individuals represented within the dataset.

The synthetic data produced by this approach demonstrates substantial utility for practical applications, achieving a Next-Word Prediction Accuracy reaching 40%. This performance is further validated through metrics like Fréchet Inception Distance, which assesses the similarity of the generated data to the original, and Self-BLEU scores, ranging from 0.1 to 0.3. These Self-BLEU values are particularly noteworthy, as they indicate a tunable level of diversity within the synthetic dataset; lower scores suggest greater variation, while higher scores indicate a closer resemblance to the original data distribution, allowing users to tailor the output to specific downstream task requirements and balance fidelity with the need for novel insights.

Research demonstrates that the proposed Randomized Perturbation with Selective Gradient (RPSG) method successfully navigates the inherent tension between safeguarding data privacy and maintaining data usefulness. Evaluations reveal a compelling balance; while effectively minimizing the risk of sensitive information disclosure – as shown by robust resistance to privacy attacks – RPSG simultaneously generates synthetic data suitable for a variety of practical applications. Notably, the quality of this generated data, assessed through metrics like Fréchet Inception Distance, improves as the size of the synthetic dataset increases, suggesting that RPSG offers a scalable solution for responsible data sharing and collaboration. This ability to consistently deliver both privacy and utility positions RPSG as a promising advancement in the field of data synthesis.

The pursuit of synthetic data generation, as detailed in this work, echoes a fundamental principle of information theory. As Claude Shannon stated, “The most important thing in communication is to convey information, and the most important thing in data is to preserve it.” RPSG, with its focus on both utility and differential privacy, embodies this concept. The method’s multi-stage approach isn’t merely about creating data that looks real; it’s about constructing a representation that retains informational value while minimizing the risk of revealing sensitive attributes. This careful balance, achieved through private seeds and rigorous privacy guarantees, demonstrates a commitment to provable data integrity – a hallmark of elegant algorithmic design.

What Lies Ahead?

The pursuit of synthetic data, as demonstrated by this work, perpetually circles a fundamental tension: utility versus provable privacy. The introduction of RPSG represents a step toward reconciling these forces, yet it does not, and cannot, eliminate the underlying mathematical realities. The utility metrics presented, while encouraging, remain empirical observations-tests passed do not constitute a proof of generalizability. Future research must prioritize the development of formal guarantees regarding synthetic data fidelity, moving beyond ad-hoc evaluations. A demonstrable link between the privacy parameters and the actual information leakage, quantified with rigorous bounds, is paramount.

Furthermore, the reliance on ‘private seeds’ begs the question of seed generation and secure storage. A compromised seed invalidates the entire privacy proposition-a rather obvious, yet often overlooked, vulnerability. Investigations into fully homomorphic encryption schemes applied to the seed itself, or perhaps differential privacy mechanisms applied during seed creation, should be explored. The current approach, while pragmatic, ultimately shifts the security burden to a single point of failure.

Ultimately, the field requires a shift in mindset. Synthetic data generation should not be viewed as an art of approximation, but as a problem of constructive proof. Demonstrating that a synthetic dataset cannot reveal information about the original data, within mathematically defined limits, is the only path toward true trustworthiness. Only then can these methods move beyond being ‘useful’ and become genuinely reliable.


Original article: https://arxiv.org/pdf/2604.07486.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-04-12 21:32