On July 15, 2024, the Personal Data Protection Commission of Singapore (PDPC) released its Proposed Guide on Synthetic Data Generation (Guide). The Guide is a key resource within the Privacy Enhancing Technology (PET) Sandbox which aims to assist organisations in understanding the techniques and potential applications of Synthetic Data (SD) generation, particularly in the context of artificial intelligence (AI). As highlighted by Minister for Digital Development and Information Josephine Teo during her opening address at Singapore’s Personal Data Protection Week 2024, generating SD is a rapidly evolving PET that enables realistic AI model training without compromising sensitive data.
Understanding SD and Its Benefits
SD is commonly referred to as artificially generated data created using purpose-built mathematical models (including AI and machine learning (ML) models) or algorithms. SD is generated by training a model or algorithm on a source dataset to mimic the characteristics and structure of the source data. Its advantages include:
- Enhancing AI/ML development – SD can drive AI and ML growth by enabling model training without exposing actual personal data.
- Addressing data challenges – SD can overcome dataset related challenges in AI model training (e.g. insufficient or biased datasets) by augmenting and diversifying training datasets.
- Facilitating collaboration and software development – SD can be used for data analytics, collaboration and software development, reducing the risk of data breaches during the development process.
The Role of PETs
The Guide defines PETs as tools and techniques that allow the processing, analysis and extraction of insights from data without revealing the underlying personal or commercially sensitive information.
PETs are generally categorised into three main types:
- Data obfuscation
- Encrypted data processing
- Federated analytics
SD generation is a form of data obfuscation and has applications in privacy-preserving AI/ML, data sharing and software testing.
Use Cases and Good Practices for SD Generation
The Guide outlines several use cases for SD along with recommended best practices:
- Generating training dataset for AI/ML models, including data augmentation and increasing data diversity
- Good practices – Adding noise in appropriate scenarios to, or reduce the granularity of, the SD points.
- Data analysis and collaboration, including data sharing and analysis and previewing data for collaborative purposes
- Good practices – Incorporating data protection measures through the SD generation process such as (i) removing outliers from source data, pseudonymising source data and employing data minimisation and generalise granular data during the data preparation phase, (ii) adding noise before or after SD generation during the SD generation phase and (iii) incorporating technical, contractual and governance measures to mitigate any residual re-identification risks during post SD generation phase.
- Software testing, including system development to avoid data breaches
- Good practices – Generating SD that follows the semantics (e.g. format, min/max values and categories) of source data instead of statistical characteristics and properties.
Key Steps in SD generation
While SD is generally fictitious data that may not be considered personal data on its own, it still carries re-identification risks. As such, the Guide provides a five-step approach to for organisations to reduce re-identification risks of SD.
Step 1: Know your data
Organisations should have a clear understanding of the purpose and use cases for the SD and the source data it mimics. Organisations should note that:
- General trends/insights of source data will be replicated in the SD. Thus, the SD will not offer any protection to sensitive trends/insights.
- Organisations may have to prioritise data protection over data utility where the SD is to be released publicly.
- Where relevant, proper contractual obligations should be put in place on recipients of the SD to prevent re-identification attacks on the data.
Organisations should also establish objectives prior to SD generation to determine the acceptable risk threshold of the generated SD and the expected utility of the data. This may allow organisations to ascertain the appropriate benchmarks for assessing any trade-offs between data protection risks and data utility to meet their business objectives.
Step 2: Prepare Your Data
Organisations should consider the key insights and necessary data attributes for the SD to meet its business objectives.
- In terms of key insights, organisations need to understand and identify the trends, key statistical properties and attribute-relationships in the source data that need to be preserved for analysis. Outliers may be removed if such trends/insights are not necessary.
- Based on the key insights needed, organisations should apply data minimisation to extract only relevant attributes from source, remove or pseudonymise direct identifiers and add noise or generalise the data as needed to reduce re-identification risks. They should also standardise documentation in a data dictionary to validate the integrity of generated SD and address inconsistencies.
Step 3: Generate SD
Organisations need to consider which method to generate SD is most appropriate based on their used cases, data objectives and types of data. Some examples include sequential tree-based synthesisers, copulas and deep generative models.
After generation of the SD, it is good practice for organisations to perform the following checks on the quality of the generated SD:
- Data integrity – This ensures the accuracy, completeness, consistency and validity of the SD as compared to the source data.
- Data fidelity – This examines if the SD closely follows the characteristics and statistical attributes of the source data.
- Data utility – This refers to how well the SD can replace or add to source data for the specific data objective of the organisation.
Step 4: Assess Re-identification Risks
After the SD is generated and utility measurement is assessed to be acceptable, organisations should evaluate re-identification risks based on their internal criteria. This is an attack-based evaluation on how successful an adversary carrying out re-identification attacks can determine if an individual belongs to the source dataset and/or derive details of an individual from the source dataset.
The Guide offers various approaches to determine and quantify re-identification risks and provides examples of existing industry guidelines and recommendations for de-identified/anonymised data. It is noted that there is no universally accepted numerical threshold value for risk levels.
Step 5: Manage Residual Risks
Organisations should identify all potential residual risks and implement appropriate mitigation controls (technical, governance and contractual) to minimise these risks. These risks and controls should be documented and approved by the management and key stakeholders as part of the organisation’s enterprise risk framework. Such risks may include:
- New insights derived from SD
- Potential impact on groups of individuals due to membership disclosure
- Parties receiving SD
- Changing environment
- Model leakage
The Guide offers examples of best practices and security controls for managing these residual risks associated with using SD. These measures include technical, governance and contractual mitigation controls, with specific examples such as access controls, asset management, risk management, legal controls and database security.
Additionally, organisations should plan for incidents involving SD breaches. For example:
- Where there is a loss of fully SD that is not intended for public release, organisations should investigate the incident to understand the root cause and improve internal safeguards against such occurrences in future. Organisations should also monitor for evidence of actual re-identification and assess if it would be a notifiable data breach to PDPC.
- Where there is a loss of a SD generator model, parameters and/or SD, organisations should investigate to understand the root cause of the incident and improve its internal safeguards. It should also monitor for a possible successful model inversion attack which may result in the reconstruction and disclosure of the source data and, if applicable, assess if such breach is notifiable.
Concluding Remarks
The Guide offers valuable insights into leveraging SD while balancing data utility and data protection risks in SD generation. It is intended as a living document and will be updated to ensure its recommendations remain relevant. Organisations are encouraged to stay informed of these updates and actively engage with the principles outlined in the Guide to ensure their practices align with both current and future standards of data protection.
Privacy World will continue to monitor Asian data privacy legislative developments and keep you in the loop on new developments related to data privacy, cybersecurity and AI.