# Variant Annotation and Filtering

Note

Hard-coded filter thresholds are viewable and editable in the configuration files here: * Exomes * Genomes

Be aware

  • These components of the pipeline are subject to constant change.
  • Users should be aware of the pitfalls and challenges of filtering somatic variant calls, which are not further discussed here.

# Somatic SNVs and Indels

Variant-level annotation, filtering, and flagging of variants with further filter flags occur in the SomaticCombineChannel and SomaticAnnotateMaf processes. The union of variants that pass the somatic scoring models intrinsic to the callers (FILTER="PASS" in the VCF files) are combined, giving precedence to MuTect2 for any site where both callers detected a variant.

The functional effect of variants is predicted using VEP (opens new window) using vcf2maf (opens new window), which also converts from VCF into a tab-delimited MAF file. See notes on use of preferred transcript isoforms and VEP annotation outputs (opens new window).

The following columns are added to the final MAF file, in addition to those added during the VEP annotation:

  • Strelka2FILTER: Indicates that Strelka2 detected the variant but did not classify it as a somatic variant.
  • gnomAD_FILTER: Indicates that the variant was detected in the gnomAD workflow, but ultimately not classified as a germline variant. Note that this is not used in current filtering schema.
  • RepeatMasker and EncodeDacMapability: The variant locus is in a repeat, low-mappability, or hard-to-sequence region. More details in the reference file description.
  • PoN: Panel-of-normals, the number of normal samples in which the variant was detected. More details on the implementation for exomes. Under development for genomes, currently applies exome PoN.
  • Ref_Tri: Trinucleotide context of SNVs, normalized to pyrimidine-to-purine transversions.
  • gnomAD allele frequencies, more details here:
    • Exomes, columns named non_cancer_*: Allele counts (AC) and frequencies (AF) for the variant in the non-TCGA population and sub-populations of gnomAD.
    • Genomes, columns named (AC_*|AF_*): Allele counts and frequencies for the variant in populations and sub-populations of gnomAD.
  • Raw allele counts: total and strand specific (*_fwd and *_rev) allele count and depth for tumor (t_*) and normal (n_*). These are unfiltered values, in contrast to those from the variant callers, generated by GetBaseCountsMultiSample (opens new window).
  • alt_bias: For variants with a raw depth of at least 5 reads, this is true if all raw variant-supporting reads are on either the forward or reverse strand.
  • ref_bias: For variants with a raw depth of at least 5 reads, this is true if all raw reads are on either the forward or reverse strand.
  • Mutation hotspots, also see :
    • snv_hotspot: SNV hotspots.
    • threeD_hotspot: 3D hotspots, in contrast to abovementioned "linear" hotspots.
    • indel_hotspot and indel_hotspot_type: In-frame indel indel hotspots, and indication of whether it overlaps a prior indel hotspot locus (prior) or overlaps an SNV hotspot (novel).
    • Hotspot: TRUE if either a linear SNV or indel hotspot.
  • OncoKB annotation:
    • mutation_effect and oncogenic: Indicate the functional effect of the mutation and whether it is deemed to be oncogenic
    • LEVEL_* and Highest_level: Indicates whether there is any drug at the given level of actionability and which is the highest level of actionability, if any. Note that this is cancer-type agnostic in current implementation.
    • citations: References for OncoKB annotation. ::: Be aware
  • Because TEMPO is blind of the cancer type (ONCOTREE_CODE) when running, so Level1s and 2As can not be annotated. The highest level will be Level 2B. You will need to re-run oncokb_annotator/MafAnnotator.py with this information to get detailed oncokb level. :::
  • Variant caller metadata (development feature, subject to changes):
    • MuTect21:
      • MBQ: Median base quality, comma-separated for reference and alternate allele.
      • MFRL: Median fragment length, comma-separated for reference and alternate allele.
      • MMQ: Median mapping quality, comma-separated for reference and alternate allele.
      • MPOS: Median distance of variant from end of read.
      • OCM: Number of reads whose original alignment does not match the reference.
      • RPA: If tandem repeat, number of times repeated (can be comma-separated for reference and alternate allele).
      • STR: Boolean, indicating that variant is a short tandem repeat.
      • ECNT: Number of events in haplotype.
    • Strelka22:
      • MQ: Root mean square mapping quality.
      • SNVSB: Strand bias for somatic SNVs.
      • FDP: Number of basecalls filtered from original read depth for tier 1* read counts, for tumor (t_FDP) and normal (n_FDP).
      • SUBDP: Number of reads below tier 1 mapping-quality threshold aligned across site, for tumor (t_SUBDP) and normal (n_SUBDP).
      • RU: If indel, smallest repeating sequence unit in inserted or deleted sequence.
      • IC: If indel, number of times RU is repeated in variant.

The FILTER column in the unfiltered MAF file, can contain any semi colon-separated combination of the following filter flags, or say PASS:

  • part_of_mnv: The variant is likely part of another called multi-nucleotide variant (MNV).
  • multiallelic2: Multiallelic loci, likely artifact. For variants called by Strelka2. The 2 is added due the presence of multiallelic flag in the MuTect2 VCFs.
  • strand_bias, variants likely artifactual due to strand bias:
    • For variants called by Mutect2, if all supporting reads come from one strand and there are a least 10 reads on both strands in either normal or tumor sample.
    • For variants called by Strelka2, if the total alternate read count is above 10 and all of these fall on either strand; or low mapping-quality variant suffering from bias in both supporting reads and total reads.
  • caller_conflict: Variant was detected by both callers, but did not pass Strelka2's thresholds for somatic variant calling.
  • The following read depth-based flags are parameterized according to the sequencing platform, see the exome.config and genome.config files.
    • low_vaf: Variant falls below lower threshold for tumor variant allele fraction (VAF).
    • low_t_depth: Variant falls below lower threshold for total depth in the tumor.
    • low_t_alt_count: Variant falls below lower threshold for reads supporting variant allele in tumor.
    • low_n_depth: Variant falls below lower threshold for total depth in normal.
    • high_n_alt_count: Variant exceeds upper threshold for reads supporting variant allele normal.
    • mappability/repeatmasker: Variant falls in blacklisted genomic region.
    • high_gnomad_pop_af: Variant exceeds upper threshold for allele fraction in gnomAD.
    • PoN: Variant exceeds upper threshold for count in panel of normals.
    • low_mapping_quality: For indels called by Strelka2, variant falls below lower mapping quality threshold.

1See the MuTect2 documentation for more information: https://software.broadinstitute.org/gatk/documentation/article?id=11005 (opens new window)
2See the Strelka2 documentation for more information: https://github.com/Illumina/strelka/blob/v2.9.x/docs/userGuide/README.md (opens new window)

# Whitelisting

Mutational hotspots, where the value in Hotspot is TRUE, are retained in the filtered MAF file, if they:

  • Are flagged with low_vaf but the tumor VAF is at least 0.02.
  • Are flagged with low_mapping_quality, low_t_depth, or strand_bias.

Note: Combinations of above filter flags results in filtering of the variant.

# Clonality and Zygosity Analyses

# Clonality

Clonality of SNVs and indels is estimated based on prior literature (opens new window) using facets-suite (opens new window). The cancer-cell fraction (CCF) annotation (columns ccf_*) contains these estimates for three presumed copy-number configurations of the mutation:

  1. Inferred CCF if mutation exists in number of copies expected from observed VAF and local ploidy.
  2. Inferred CCF if mutation is on the major allele.
  3. Inferred CCF if mutation exists in one copy. For each of these, error intervals and probabilities are provided.

# Zygosity

Tumor zygosity of SNVs and indels is estimated using the observed VAF and the expected VAF at the observed tumor purity and local copy number.

# Germline SNVs and Indels

Variant-level annotation, filtering, and flagging of variants with further filter flags occur in the GermlineCombineChannel and GermlineAnnotateMaf processes. The union of variants that pass the filters intrinsic to the callers (FILTER="PASS" in the VCF files) are combined, giving precedence to HaplotypeCaller for any site where both callers detected a variant. See discussion elsewhere regarding single-sample filtering of HaplotypeCaller variant calls (opens new window).

Functional effect predication and MAF file conversion is carried out as described above for somatic calls.

In the final MAF, columns Strelka2FILTER, gnomAD_FILTER, RepeatMasker and EncodeDacMapability as well as allele frequencies and counts from gnomAD are identical as described for somatic variants. Note that the gnomAF_FILTER is used for filterig of germline variants, unlike for somatic variants. In addition, the following columns are added to the germline MAF:

  • BRCA exchange (opens new window) annotation:
    • brca_exchange_id: Variant ID.
    • brca_exchange_enigma: Annotation from the ENIGMA consortium.
    • brca_exchange_clinvar: Annotation from ClinVar.
  • ch_gene: Boolean indicating whether the gene is associated with the presence of clonal hematopoiesis (CH) (ASXL1, ATM, BCOR, CALR, CBL, CEBPA, CREBBP, DNMT3A, ETV6, EZH2, FLT3, GNAS, IDH1, IDH2, JAK2, KIT, KRAS, MPL, MYD88, NF1, NPM1, NRAS, PPM1D, RAD21, RUNX1, SETD2, SF3B1, SH2B3, SRSF2, STAG2, STAT3, TET2, TP53, U2AF1, WT1, and ZRSR2)
  • The following read depth-based flags are parameterized according to the sequencing platform, see the exome.config and genome.config files.
    • low_n_depth: Variant falls below lower threshold for total depth in normal.
    • low_n_vaf: Variant falls below lower threshold for normal VAF.
    • ch_mutation: Variant occurs in CH gene and occurs below lower threshold for normal VAF and below tumor VAF 0.25.
    • t_in_n_contamination: Tumor VAF is more than three-fold the normal VAF.

# Zygosity Analysis

Similar to somatic mutations, tumor zygosity of germline SNVs and indels is estimated using the observed VAF and the expected VAF at the observed tumor purity and local copy number. The difference between the two cases is the calculation of the expected tumor VAF of the variant.

# Somatic and Germline SVs

Under development.