Validation
Methodology & Results

Complete technical documentation of our 4-level validation framework, including methodology, results, and interpretation guidelines.

1. Validation Overview

Our synthetic cfDNA data undergoes rigorous validation through a 4-level framework designed to ensure both statistical accuracy and biological functionality. Each level tests a different aspect of data quality.

100%T21 Sensitivity
75%T18/T13 Sensitivity
100%Specificity
92.9%Distribution Match
LevelWhat It TestsResultStatus
1Distributional similarity to real cfDNA92.9% matchPASS
2Z-score trisomy detection100% T21, 75% T18/T13, 100% specPASS
3Classifier augmentation value+0.102 AUCPASS
4Generation controllability100% accuracyPASS

2. Distributional Similarity (Level 1)

We compare synthetic cfDNA against real clinical cfDNA across multiple biological dimensions to ensure statistical accuracy of the generated data.

Dimensions Tested

Fragment Length Distribution

Comparison of fragment size histograms, including the characteristic ~166bp maternal peak and ~143bp fetal peak.

GC Content

Distribution of GC percentage across fragments, which affects sequencing bias and must match real data.

Nucleotide Composition

Per-position nucleotide frequencies to ensure realistic sequence patterns.

End Motifs

4-mer terminal sequence frequencies, which are biologically distinct in cfDNA.

Results

MetricReal DataSynthetic DataSimilarity
Mean Fragment Length166.2 bp165.8 bp99.8%
Fragment Length Std24.1 bp23.9 bp99.2%
Mean GC Content41.2%41.0%99.5%
Chromosome Distribution--98.4%
Overall Similarity Score92.9%

Key Finding

Synthetic cfDNA achieves >92% similarity across all measured dimensions, indicating the generative model correctly captures real cfDNA characteristics.

3. Z-Score Detection (Level 2)

The z-score method is the clinical standard for NIPT trisomy detection. We validate that synthetic samples with generated trisomies are correctly detected using this method.

Methodology

Z-scores are calculated for each target chromosome by comparing observed vs expected chromosome ratios:

z = (observed_ratio - expected_ratio) / standard_deviation

A sample is classified as affected if z > 3.0 for the target chromosome. Reference statistics are derived from euploid samples.

Results: 1M Fragment Samples

ConditionSamplesMean Z-ScoreDetection Rate
Trisomy 21168.2 (range: 4.1 - 12.3)100% (16/16)
Trisomy 1885.1 (range: 3.2 - 7.8)75% (6/8)
Trisomy 1384.8 (range: 3.0 - 6.9)75% (6/8)
Normal (Euploid)150.2 (range: -1.2 - 1.1)100% specificity

Clinical Interpretation

100% T21 sensitivity with 100% specificity demonstrates that synthetic trisomy samples contain biologically correct chromosome dosage effects that are detectable using clinical NIPT methods. The slightly lower T18/T13 detection aligns with clinical expectations due to smaller chromosome sizes.

Fetal Fraction Sensitivity

Z-score detection depends on fetal fraction. We tested across multiple FF levels:

Fetal FractionT21 Z-Score (mean)Detection Rate
8%5.8100%
10%7.3100%
15%10.9100%

4. Augmentation Experiments (Level 3)

We tested whether synthetic data improves classifier performance when added to real training data, simulating the key commercial use case.

Experiment Design

Data Setup

9 real cfDNA samples + synthetic samples at varying ratios. Conditions injected via chromosome coverage adjustment.

Cross-Validation

Leave-one-sample-out (LOOCV) ensuring no data leakage. Each fold tests on a held-out sample.

Classifiers

Random Forest with chromosome ratio features (v1) and full bin-level features (v2).

Metrics

AUC-ROC for discrimination between normal and trisomy conditions.

Results: Augmentation v1 (8 Features)

Synthetic RatioTraining MixMean AUCImprovement
0% (baseline)Real only0.546-
25%Real + synthetic0.620+0.074
50%Real + synthetic0.630+0.084
75%Real + synthetic0.648+0.102
100%Synthetic only0.500-0.046

Key Finding

Adding synthetic data consistently improves classifier performance up to 75% synthetic ratio, with peak improvement of +0.102 AUC (+18.7% relative). Pure synthetic performs at chance level, confirming the need for real data anchoring.

Results: Augmentation v2 (3,102 Bin Features)

We also tested with full bin-level features (3,102 genomic bins) for a more rigorous evaluation:

ClassifierBaseline AUCWith SyntheticNotes
Logistic Regression (L2)1.0001.000Ceiling effect
Ridge Classifier1.0001.000Ceiling effect
v1-style (39 features)0.6760.833+0.157 improvement

Important Context

The AUC=1.0 results with bin features occur because the task becomes trivially solvable with sufficient feature resolution. The v1 experiment with limited features provides more realistic assessment of augmentation value. Full methodology notes available in our technical documentation.

5. Controllability Validation (Level 4)

We verify that generation parameters produce expected outputs - critical for creating datasets with specific characteristics.

Parameters Tested

ParameterTarget ValuesActual OutputAccuracy
Fetal Fraction8%, 10%, 15%8.0%, 10.0%, 15.0%100%
Condition (Normal)chr21 ratio ~1.46%1.46%PASS
Condition (T21)chr21 elevated+5% relative increasePASS
Fragment Count1,000,0001,000,000100%

Controllability Guarantee

All requested parameters are precisely reflected in generated samples. Fetal fraction is stored in sample metadata and verified through fragment-level labelling.

6. Methodology Notes

Reference Data

Validation uses real cfDNA from clinical samples (n=9) processed through standard NIPT pipelines. Real data was aligned to GRCh38 reference genome and processed with identical quality control to synthetic data.

Synthetic Generation

Synthetic cfDNA is generated using our autoregressive (AR v15) model trained on real cfDNA features. The model generates sequences conditioned on:

  • Target fetal fraction (1-25%)
  • Karyotype condition (normal, T21, T18, T13, etc.)
  • Fragment count per sample
  • GC content distribution targets

Condition Injection

For augmentation experiments, trisomy signal is injected into normal samples using:

coverage[target_chr] *= 1 + (fetal_fraction * 0.5)

This simulates the ~50% extra chromosome contribution from trisomic fetal cells, scaled by fetal fraction.

Limitations

Current Limitations

1. No real trisomy samples: Validation uses injected conditions rather than real trisomy samples. Z-score validation on generated (not injected) trisomies provides the strongest evidence.

2. Limited real data: Only 9 real samples available for validation. Results may vary with larger cohorts.

3. Single ancestry: Current validation uses primarily European-ancestry samples. Multi-ancestry validation is ongoing.

Reproducibility

All validation scripts, data, and results are available upon request. Contact kyle@eabhaseq.com for access to validation code and datasets.

Back to Synthetic Data Products

Ready to Get Started?

Contact us to discuss your data needs or request access to our validation datasets.