# Reference Files

This and associated pages in this section provide details on the provenance and generation of all reference files used in pipeline.nf. Usage of these files is defined in the references configuration file (opens new window).

Note

All reference files described herein are in assembly GRCh37/hg19 of the human genome.

# Genome Assembly

Part of the GATK bundle (opens new window), also available here (opens new window). Tempo uses the human_g1k_v37_decoy assembly of the genome.

# Genomic Intervals

BED files that specify the regions of the genome to consider for variant calling are specified in the input files.

# Exome Capture Platforms

For exomes, use BED file corresponding to the platform used for target capture. Currently, Juno reference files are configured to support:

  • AgilentExon_51MB: SureSelectXT Human All Exon V4 from Agilent.
  • IDT_Exome: xGen Exome Research Panel v1.0 from IDT.
  • IDT_Exome_v2: xGen Exome Research Panel v2.0 from IDT.

Note

Contact us if you are interested in support for other sequencing assays or capture kits.

The bait and target files are provided by the kit manufacturer. These are used to estimate bait- and target-level coverage metrics as well as for variant calling.

We add 5 bp to each end of exons in the target file to make sure splice site mutations can be called:

bedtools slop \
    -g b37.chrom.sizes \
    -i targets.bed \
    -r 5 \
    -l 5 \
    > targets.plus5bp.bed

# Callable Regions for Genomes

For genomes, a list of "callable" regions from GATK's bundle is used. This is converted from an interval list to a BED file:

gatk IntervalListToBed \
    --INPUT b37_wgs_calling_regions.v1.interval_list \
    --OUTPUT b37_wgs_calling_regions.v1.bed

# Custom target files

Required target files have already been built for Agilent and IDT exome baits and can be used readily with Tempo. Read this section if you have a different target design in mind.

# Required files

  • targets.bed: The targets file can be obtained from the provider of the baitset. If you only have the file in gzipped format, please run zcat <targets>.bed.gz > <targets>.bed. In practice, we alter this file to have 5 bp padding on each region in both directions. Padding can be added using the slop subcommand of bedtools.
  • targets.bed.gz: You also need a gzipped copy of the targets.bed file. Create with bgzip
  • targets.bed.gz.tbi: The above file will need to be indexed with tabix.
  • targets.interval_list: The bed file can be used as input to create the interval list. Create using the BedToIntervalList tools from gatk.
  • baits.interval_list: The baits file can also be obtained from the provider of the baitset, typically as a bed file. Create the interval list using the BedToIntervalList tool from gatk. If a bait bed file is not provided, you can copy or link to targets.interval_list instead.
  • coding.bed: This file will be used to calculate TMB. The known coding regions of the reference should first be downloaded (ex: EnsGene for hg19 (opens new window)) and then be intersected with the targets file using bedtools. The result should subsequently be sorted and merged with bedtools.

# Input to Tempo

You should designate a folder just for the required file. This folder will be input as a parameter called targets_base, and all of your targets should be placed there. The folder structure should looks as follows:

<targets_base folder>
├── <target name 1>
├── <target name 2>
└── <target name 3>

You can have as many targets as you like. Under each folder are the 6 target files previously described. For example:

<targets_base folder>/agilent/
├── baits.interval_list -> ../../baits/AgilentExon_51MB_b37_v3_baits.interval_list
├── coding.bed -> ../../coding_regions/AgilentExon_51MB_b37_v3_baits.coding.sorted.merged.bed
├── targets.bed -> ../../targets/AgilentExon_51MB_b37_v3_targets.plus5bp.bed
├── targets.bed.gz -> ../../targets/AgilentExon_51MB_b37_v3_targets.plus5bp.bed.gz
├── targets.bed.gz.tbi -> ../../targets/AgilentExon_51MB_b37_v3_targets.plus5bp.bed.gz.tbi
└── targets.interval_list -> ../../targets/AgilentExon_51MB_b37_v3_targets.interval_list

In this case, the files are symbolically linked to the original. Whether soft links or hard links are used, the files in this folder should strictly match the names coding.bed, baits.interval_list, targets.interval_list, targets.bed, targets.bed.gz, targets.bed.gz.tbi.

When running Tempo, use the parameter --targets_base <targets_base folder> so that Nextflow will know where to find your target files. When running with --assayType genome, only the <targets_base>/wgs target folder will be available. Conversely, the <targets_base>/wgs target folder will not be available when --assayType genome is not set.

# RepeatMasker and Mappability Blacklist

BED files with genomic repeat and mappability information are used to annotate the VCFs with somatic and germline SNV/indels. These data are from RepeatMasker (opens new window) and the ENCODE consortium (opens new window), and the files are retrieved from the UCSC Genome Browser (opens new window) and parsed as such:

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/rmsk.txt.gz
gunzip rmsk.txt.gz
cut -f6-8,12 rmsk.txt | \
    grep -e "Low_complexity" -e "Simple_repeat" | \
    sed 's/^chr//g'> rmsk_mod.bed
bgzip rmsk_mod.bed
tabix --preset bed rmsk_mod.bed.gz

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeDacMapabilityConsensusExcludable.bed.gz
gunzip wgEncodeDacMapabilityConsensusExcludable.bed.gz
sed -i 's/^chr//g' wgEncodeDacMapabilityConsensusExcludable.bed
bgzip wgEncodeDacMapabilityConsensusExcludable.bed
tabix --preset bed wgEncodeDacMapabilityConsensusExcludable.bed.gz

# Preferred Transcript Isoforms

The --custom-enst argument to vcf2maf takes a list of preferred gene transcript isoforms which mutations are mapped onto. We supply a consensus list of isoform_overrides_at_mskcc and isoform_overrides_uniprot (opens new window), generated as such:

t1 = readr::read_tsv('isoform_overrides_at_mskcc')
t2 = readr::read_tsv('isoform_overrides_uniprot')
t2 %>%
    dplyr::filter(gene_name %nin% t1$gene_name) %>%
    dplyr::bind_rows(., t1) %>%
    readr::write_tsv('isoforms')

# Hotspot Annotation

Three types of mutation hotspots are annotated in the somatic MAF. These include SNV, indel in linear space as well as SNV hotspots in 3D space. These are annotated with the annotateMaf package (opens new window).

# OncoKB Annotation

Functional mutation effects and predicted oncogenicity of variants, as well as level of clinical actionability are from OncoKB (opens new window) and annotated using the OncoKB annotator (opens new window).

# BRCA Exchange Annotation

Annotation of germline variants in BRCA1 and BRCA2 is carried out with the annotateMaf package (opens new window). This includes variant-level annotation from the ENIGMA consortium and ClinVar.

# Structural Variant Calling

Delly provides and takes as an argument a file of regions (opens new window) to exclude from variant calling. This excludes telomeres and centromeres from auto- and allosomes as well as any other contig.

For Manta, subtract these regions from a bed file of the whole genome to generate a list of regions to include. First clean up the file provided by Delly, since it is not in bed format:

grep -Ev "chr|MT|GL00|NC|hs37d5" human.hg19.excl.tsv > human.hg19.excl.clean.bed
bedtools subtract -a b37.bed -b human.hg19.excl.clean.bed > b37.minusDellyExclude.bed