# Creating a Panel of Normals (PoN) for Exomes

"Somatic" variants that occur in a panel of normal samples can be considered sequencing artifacts. We can generate a VCF file to filter against by calling variants in normal samples that look "clean", i.e. absent of tumor contamination. We use a similar variant calling strategy as for the somatic variant calling in tumor samples

For each normal sample call variants with Strelka2 and MuTect2.

# Strelka2

Run Manta to seed indel calling, then run as if the normal sample is an unmatched tumor sample. Parse output with bcftools, subsetting on variants supported by more than one alternate read.

$MANTA_PATH/configManta.py \
    --referenceFasta $REF \
    --runDir pon/manta/$NORMAL_NAME \
    --exome \
    --callRegions $TARGETS
    --bam $NORMAL_BAM

pon/manta/$NORMAL_NAME/runWorkflow.py --mode local

$STRELKA_PATH/configureStrelkaGermlineWorkflow.py \
    --ref $REF \
    --runDir mutations/pon/strelka2/$NORMAL_NAME \
    --exome \
    --callRegions $TARGETS \
    --indelCandidates pon/manta/$NORMAL_NAME/results/variants/candidateSmallIndels.vcf.gz \
    --bam $NORMAL_BAM

bcftools filter \
    --include 'FORMAT/AD[0:1]>1' \
    pon/strelka2/$NORMAL_NAME/results/variants/variants.vcf.gz | \
    bcftools norm \
    --fasta-ref $REF \
    -check-ref s \
    --multiallelics -both \
    --output-type z \
    --output pon/$NORMAL_NAME.strelka2.vcf.gz

tabix --preset vcf pon/$NORMAL_NAME.strelka2.vcf.gz

# MuTect2

MuTect2 provides a variant calling mode for normal samples. Process the output similarly to above. Fix some VCF header tags so that the files can be combined downstream. As opposed to the somatic variant calling in tumor samples, here retain any calls at multiallelic loci.

gatk Mutect2 \
    --reference $REF \
    --intervals $TARGETS \
    --input $NORMAL_BAM \
     --tumor $NORMAL_NAME \
     --output pon/mutect2/$NORMAL_NAME.vcf.gz

bcftools filter \
    --include 'FORMAT/AD[0:1]>1' \
    pon/mutect2/$NORMAL_NAME.vcf.gz | \
    sed -e 's/ID=RU,Number=1/ID=RU,Number=A/' -e 's/ID=AD,Number=R/ID=AD,Number=./' |
    bcftools norm \
    --fasta-ref $REF \
    --check-ref s \
    --multiallelics -both \
    --output-type z \
    --output pon/$NORMAL_NAME.mutect2.vcf.gz

tabix --preset vcf pon/$NORMAL_NAME.mutect2.vcf.gz

Now, combine all individual VCFs from all normal samples. This requires a bcftools plugin (opens new window).

bcftools merge \
    --merge none \
    --output-type z \
    --output pon.vcf.gz \
    pon/*vcf.gz

bcftools +fill-tags pon.vcf.gz \
    --output-type z \
    --output pon.annot.vcf.gz \
    -- --tags AC

tabix --preset vcf pon.annot.vcf.gz

Now, pon.annot.vcf.gz is ready to use to annotate somatic variant calls from tumor samples.