Validation
Methodology & Results
Complete technical documentation of our 4-level validation framework, including methodology, results, and interpretation guidelines.
1. Validation Overview
Our synthetic cfDNA data undergoes rigorous validation through a 4-level framework designed to ensure both statistical accuracy and biological functionality. Each level tests a different aspect of data quality.
| Level | What It Tests | Result | Status |
|---|---|---|---|
| 1 | Distributional similarity to real cfDNA | 92.9% match | PASS |
| 2 | Z-score trisomy detection | 100% T21, 75% T18/T13, 100% spec | PASS |
| 3 | Classifier augmentation value | +0.102 AUC | PASS |
| 4 | Generation controllability | 100% accuracy | PASS |
2. Distributional Similarity (Level 1)
We compare synthetic cfDNA against real clinical cfDNA across multiple biological dimensions to ensure statistical accuracy of the generated data.
Dimensions Tested
Fragment Length Distribution
Comparison of fragment size histograms, including the characteristic ~166bp maternal peak and ~143bp fetal peak.
GC Content
Distribution of GC percentage across fragments, which affects sequencing bias and must match real data.
Nucleotide Composition
Per-position nucleotide frequencies to ensure realistic sequence patterns.
End Motifs
4-mer terminal sequence frequencies, which are biologically distinct in cfDNA.
Results
| Metric | Real Data | Synthetic Data | Similarity |
|---|---|---|---|
| Mean Fragment Length | 166.2 bp | 165.8 bp | 99.8% |
| Fragment Length Std | 24.1 bp | 23.9 bp | 99.2% |
| Mean GC Content | 41.2% | 41.0% | 99.5% |
| Chromosome Distribution | - | - | 98.4% |
| Overall Similarity Score | 92.9% |
Key Finding
Synthetic cfDNA achieves >92% similarity across all measured dimensions, indicating the generative model correctly captures real cfDNA characteristics.
3. Z-Score Detection (Level 2)
The z-score method is the clinical standard for NIPT trisomy detection. We validate that synthetic samples with generated trisomies are correctly detected using this method.
Methodology
Z-scores are calculated for each target chromosome by comparing observed vs expected chromosome ratios:
A sample is classified as affected if z > 3.0 for the target chromosome. Reference statistics are derived from euploid samples.
Results: 1M Fragment Samples
| Condition | Samples | Mean Z-Score | Detection Rate |
|---|---|---|---|
| Trisomy 21 | 16 | 8.2 (range: 4.1 - 12.3) | 100% (16/16) |
| Trisomy 18 | 8 | 5.1 (range: 3.2 - 7.8) | 75% (6/8) |
| Trisomy 13 | 8 | 4.8 (range: 3.0 - 6.9) | 75% (6/8) |
| Normal (Euploid) | 15 | 0.2 (range: -1.2 - 1.1) | 100% specificity |
Clinical Interpretation
100% T21 sensitivity with 100% specificity demonstrates that synthetic trisomy samples contain biologically correct chromosome dosage effects that are detectable using clinical NIPT methods. The slightly lower T18/T13 detection aligns with clinical expectations due to smaller chromosome sizes.
Fetal Fraction Sensitivity
Z-score detection depends on fetal fraction. We tested across multiple FF levels:
| Fetal Fraction | T21 Z-Score (mean) | Detection Rate |
|---|---|---|
| 8% | 5.8 | 100% |
| 10% | 7.3 | 100% |
| 15% | 10.9 | 100% |
4. Augmentation Experiments (Level 3)
We tested whether synthetic data improves classifier performance when added to real training data, simulating the key commercial use case.
Experiment Design
Data Setup
9 real cfDNA samples + synthetic samples at varying ratios. Conditions injected via chromosome coverage adjustment.
Cross-Validation
Leave-one-sample-out (LOOCV) ensuring no data leakage. Each fold tests on a held-out sample.
Classifiers
Random Forest with chromosome ratio features (v1) and full bin-level features (v2).
Metrics
AUC-ROC for discrimination between normal and trisomy conditions.
Results: Augmentation v1 (8 Features)
| Synthetic Ratio | Training Mix | Mean AUC | Improvement |
|---|---|---|---|
| 0% (baseline) | Real only | 0.546 | - |
| 25% | Real + synthetic | 0.620 | +0.074 |
| 50% | Real + synthetic | 0.630 | +0.084 |
| 75% | Real + synthetic | 0.648 | +0.102 |
| 100% | Synthetic only | 0.500 | -0.046 |
Key Finding
Adding synthetic data consistently improves classifier performance up to 75% synthetic ratio, with peak improvement of +0.102 AUC (+18.7% relative). Pure synthetic performs at chance level, confirming the need for real data anchoring.
Results: Augmentation v2 (3,102 Bin Features)
We also tested with full bin-level features (3,102 genomic bins) for a more rigorous evaluation:
| Classifier | Baseline AUC | With Synthetic | Notes |
|---|---|---|---|
| Logistic Regression (L2) | 1.000 | 1.000 | Ceiling effect |
| Ridge Classifier | 1.000 | 1.000 | Ceiling effect |
| v1-style (39 features) | 0.676 | 0.833 | +0.157 improvement |
Important Context
The AUC=1.0 results with bin features occur because the task becomes trivially solvable with sufficient feature resolution. The v1 experiment with limited features provides more realistic assessment of augmentation value. Full methodology notes available in our technical documentation.
5. Controllability Validation (Level 4)
We verify that generation parameters produce expected outputs - critical for creating datasets with specific characteristics.
Parameters Tested
| Parameter | Target Values | Actual Output | Accuracy |
|---|---|---|---|
| Fetal Fraction | 8%, 10%, 15% | 8.0%, 10.0%, 15.0% | 100% |
| Condition (Normal) | chr21 ratio ~1.46% | 1.46% | PASS |
| Condition (T21) | chr21 elevated | +5% relative increase | PASS |
| Fragment Count | 1,000,000 | 1,000,000 | 100% |
Controllability Guarantee
All requested parameters are precisely reflected in generated samples. Fetal fraction is stored in sample metadata and verified through fragment-level labelling.
6. Methodology Notes
Reference Data
Validation uses real cfDNA from clinical samples (n=9) processed through standard NIPT pipelines. Real data was aligned to GRCh38 reference genome and processed with identical quality control to synthetic data.
Synthetic Generation
Synthetic cfDNA is generated using our autoregressive (AR v15) model trained on real cfDNA features. The model generates sequences conditioned on:
- Target fetal fraction (1-25%)
- Karyotype condition (normal, T21, T18, T13, etc.)
- Fragment count per sample
- GC content distribution targets
Condition Injection
For augmentation experiments, trisomy signal is injected into normal samples using:
This simulates the ~50% extra chromosome contribution from trisomic fetal cells, scaled by fetal fraction.
Limitations
Current Limitations
1. No real trisomy samples: Validation uses injected conditions rather than real trisomy samples. Z-score validation on generated (not injected) trisomies provides the strongest evidence.
2. Limited real data: Only 9 real samples available for validation. Results may vary with larger cohorts.
3. Single ancestry: Current validation uses primarily European-ancestry samples. Multi-ancestry validation is ongoing.
Reproducibility
All validation scripts, data, and results are available upon request. Contact kyle@eabhaseq.com for access to validation code and datasets.
Back to Synthetic Data ProductsReady to Get Started?
Contact us to discuss your data needs or request access to our validation datasets.