Scalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic Pipelines

Authors

Kyle Ellrott

Matthew H. Bailey

Gordon Saksena

Kyle R. Covington

Cyriac Kandoth

Chip Stewart

Julian Hess

Singer Ma

Michael McLellan

Heidi J. Sofia

Carolyn Hutter

Gad Getz

David Wheeler

Li Ding

The MC3 Working Group

The Cancer Genome Atlas Research Network

Doi

PMID: 29596782 · DOI: 10.1016/j.cels.2018.03.002 · Journal: Cell Systems (2018)

TL;DR

Ellrott et al. describe the Multi-Center Mutation Calling in Multiple Cancers (MC3) project — a coordinated re-call of every TCGA tumor exome using an ensemble of seven somatic variant callers (MuTect, VarScan2, Pindel, Indelocator, RADIA, MuSE, SomaticSniper) plus a battery of eight filters. Operating on ~10,510 tumor/normal pairs from 33 cancer types (~400 TB of raw BAMs, ~1.8 million core-hours on DNAnexus), the project produced a controlled-access MAF of 22,485,627 variants and an open-access “PASS”-filtered MAF of 3,600,963 variants (~3.5 million high-confidence somatic mutations) that became the canonical mutation substrate for all PanCanAtlas analyses PMID:29596782.

Cohort & data

  • Samples: 11,069 tumor/normal pairs covering 10,486 TCGA participants across 33 cancer types after sample whitelisting; the published MAF analysis covers 10,510 tumor/normal pairs PMID:29596782.
  • Sample selection rules: exclude redacted samples, non-HG19 alignments, non-Illumina sequencing, FFPE samples (97 removed), and samples with ContEst contamination > 4% (12 rules total); prefer Broad-aligned, GATK co-cleaned, native (non-WGA) DNA, with matching genome-build strings for tumor and normal PMID:29596782.
  • Assay: TCGA whole-exome sequencing (whole-exome-seq) on Illumina platforms with multiple capture kits; intersection-of-capture-kit BED used as the bitgt exonic mask. Orthogonal validation via TCGA targeted deep sequencing on 3,128 samples and whole-genome sequencing on 1,059 samples (median 126 validation sites/sample) PMID:29596782.
  • Compute: Variant calling for Pindel/MuSE/RADIA/VarScan/SomaticSniper run on the DNAnexus cloud (~1.8M core-hours over 4 weeks, 400 TB processed, ~500 GB of VCFs produced); MuTect/Indelocator + ContEst + OxoG run on the Broad Firehose; OxoG and validation runs on the ISB Cancer Genomics Cloud and Broad FireCloud; GATK co-cleaning of ~1,600 BAMs at the UCSC NCI cluster. Pipeline distributed as Common Workflow Language (CWL) + Docker at https://github.com/OpenGenomics/mc3 PMID:29596782.
  • Outputs: Three MAFs released — controlled-access (22,485,627 variants: 13,044,511 SNVs + 9,441,116 indels), open-access PASS-only (3,600,963 variants from 10,295 tumors: 3,427,680 SNVs + 173,283 indels), and a validation MAF; data hosted at the NCI GDC (https://gdc.cancer.gov/about-data/publications/mc3-2017) PMID:29596782.
  • Anchor study in cBioPortal: the MC3 mutation calls form the somatic-variant backbone of the PanCanAtlas studies in cBioPortal, e.g. gbm_tcga_pan_can_atlas_2018, coadread_tcga_pan_can_atlas_2018, and cesc_tcga_pan_can_atlas_2018.

Key findings

  • Ensemble + filters > single caller. Seven callers were applied uniformly across TCGA. Of 22,485,627 putative variants, post-filter retention was 2,907,335 high-confidence calls; the open-access PASS MAF holds 3,600,963 variants from 10,295 tumors PMID:29596782.
  • MC3 vs. PanCan12 legacy MAF. On the PanCan12 sample set, MC3 reported 1,079,216 variants vs. 804,571 in the legacy PanCan12 MAF, sharing 717,326 calls — MC3 recovered 89.5% of the original calls while expanding the call set by 25%. Cancer types with > 90% concordance: HNSC, SKCM, BRCA, BLCA, COAD/READ, UCEC. Outliers: PAAD recovered only 33% of original calls (low tumor purity), and AML (LAML) recovered 44% because the legacy MAF relied on Sanger-recovered calls not in the TCGA exome data PMID:29596782.
  • Filter contributions. ~60% of called variants were dropped by the NonExonic and bitgt (capture-kit) regional filters; ~30% were dropped by the Broad Panel of Normals v2. Eight named filters were used: broad_PoN_v2, Common-in-ExAC (vcf2maf v1.6.11; AC>16 unless ClinVar-pathogenic, AC=16 anchor from SF3B1 K700E in clonal hematopoiesis), OxoG, ContEst, StrandBias (low-VAF G>T from WashU samples), Normal Depth (≥8 reads non-dbSNP / ≥19 dbSNP), Capture Kit (bitgt), and NonExonic PMID:29596782.
  • Filtering matters for SMG discovery. Running MutSig2CV (P < 3.5e-5) and MuSiC2 (P < 1e-7) on KIRC PASS variants yielded 10 SMGs each; the controlled (unfiltered) MAF inflated the lists to 1,203 (MutSig2CV) and 321 (MuSiC2) — the noise made real signal extremely difficult to find PMID:29596782.
  • Caller-specific behavior. MuTect and MuSE detected the largest number of true positives among validated SNVs and showed the highest pair-wise agreement; SomaticSniper made the fewest calls overall but had the lowest false-positive rate. Pindel made the most indel calls, but >130K calls clustered in two samples, indicating sample-specific artifacts. Two-caller intersection effectively eliminates SNV false positives but increases indel false negatives PMID:29596782.
  • Cancer-type variability in calling consistency. THYM, PAAD, KICH, and UVM showed the largest sample-to-sample variability in mutations per sample, attributable to low tumor purity (THYM and PAAD median ABSOLUTE purity 39.0% and 39.7% respectively). SKCM, LUSC, and LUAD had the largest median number of SNVs per sample, consistent with tobacco/UV mutagen exposure PMID:29596782.
  • Liquid-tumor caveat (LAML). Because LAML “normal” skin biopsies often contain blood enriched with tumor cells, conservative MC3 filtering misclassifies somatic calls as germline and recovers only 44% of the LAML AWG calls (which had used Sanger-based manual recovery) PMID:29596782.
  • Capture-kit blind spots. The bitgt filter excludes 170 CDS-altering MC3 calls in MSK-IMPACT‘s 410-gene panel that fall outside the Broad capture BED, including TERT promoter hits, truncations in CIC, splice alterations in the frequently rearranged CRLF2, and a cluster of 5’-end events in FOXP1 PMID:29596782.
  • Complex indels. The Pindel-priority merging strategy created 14,241 complex indel sites in the controlled MAF and 3,611 in the open-access MAF — events that combine an indel and a substitution in cis and that other callers cannot represent PMID:29596782.

Genes & alterations

  • KIRC SMG overlap between MutSig2CV and MuSiC2 on the open-access MAF: TP53, PTEN, VHL, SETD2, PBRM1, BAP1, MTOR PMID:29596782.
  • KIRC SMGs uniquely called by MutSig2CV: ELOC (paper-era symbol “TCEB1”), PIK3CA, ATM PMID:29596782.
  • KIRC SMGs uniquely called by MuSiC2 after long-gene filtering: ERBB4, SLITRK6, KDM5C PMID:29596782.
  • SF3B1 K700E used as the empirical AC=16 cutoff for the “Common in ExAC” filter — it is the highest ExAC count observed for a known clonal-hematopoiesis somatic event in the non-TCGA subset of ExAC v0.3.1 PMID:29596782.
  • Capture-kit-excluded events of biological interest: TERT promoter mutations (non-coding, outside Broad BED), CIC truncations, CRLF2 splice alterations, FOXP1 5’-end clusters PMID:29596782.

Clinical implications

  • This is a methods/data-resource paper and makes no direct clinical claims about prognosis or treatment.
  • Resource for downstream PanCanAtlas analyses. The MC3 MAF is the somatic-variant input for cBioPortal PanCanAtlas studies (e.g. gbm_tcga_pan_can_atlas_2018, coadread_tcga_pan_can_atlas_2018, cesc_tcga_pan_can_atlas_2018) and for downstream PanCanAtlas papers including the driver-discovery, immune-landscape, and DNA-damage-repair manuscripts PMID:29596782.
  • Operational guidance for cohort calling. The authors argue that for driver-gene discovery a high-specificity (low FP) caller is preferable; for heterogeneity / sub-clonal analysis, sensitive callers tolerating low-VAF events are preferable. Annotating rather than removing filtered variants is recommended so analysts can pick a stringency appropriate to their question PMID:29596782.

Limitations & open questions

  • Validation set is non-random. Targeted-validation sites were selected by individual TCGA AWGs from the most likely SMG candidates, so every validated site was already called by ≥1 method; false-negative rate cannot account for sites missed by all callers PMID:29596782.
  • Validation technology overlap. Most validation used the same sequencing chemistry as discovery, so systematic errors that filters target (especially Panel of Normals) appear as “erroneous filtering events” in the validation comparison PMID:29596782.
  • Liquid tumors need different strategies. AML (LAML) recovered only 44% of legacy calls because tumor-in-normal contamination breaks the assumption that the matched normal is variant-free; the authors flag this as a general open problem for all liquid tumors PMID:29596782.
  • Capture-kit-mask side effects. The bitgt filter discards TERT promoter hits and other clinically meaningful events that fall outside the Broad capture BED — a pancan one-size-fits-all mask trades sensitivity for cohort comparability PMID:29596782.
  • Default parameters are insufficient. The authors call out that achieving the best per-tool performance often required non-default parameters and direct consultation with tool authors (Table S1) — a reproducibility risk for any group running these tools off-the-shelf PMID:29596782.
  • Future direction. Lessons learned should inform a future end-to-end FASTQ-to-filtered-MAF workflow with full containerization in a single cloud and a robust benchmarking effort that scans caller / filter / parameter combinations per tumor type PMID:29596782.

Citations from this paper used in the wiki

  • “The MAF file represents over 20 million variants produced across approximately 10,000 tumor-normal pairs from 33 cancer types using 7 variant callers.” (Introduction)
  • “The controlled-access MAF file contains 22,485,627 variants from 10,510 tumor samples and is comprised of 13,044,511 SNV events and 9,441,116 indels. The open-access MAF file contains 3,600,963 variants from 10,295 tumors with 3,427,680 SNV events and 173,283 indels.” (Effects of Somatic Filtering)
  • “We found that the new MC3 MAF had 1,079,216 variants in the PanCan12 MAF set of samples, while the PanCan12 MAF has 804,571. Among these calls, 717,326 variants are shared between the two sets … the MC3 project captured 89.5% of the original calls while increasing the size of the call set by 25%.” (MC3 Variant Calling Strategy)
  • “Using the stringent p-value cutoff for both tools, MutSig2CV (P-Value < 3.5e-5) and MuSiC2 (P-value < 1e-7) each identified 10 SMGs using ‘PASS’ variants from the open-access MAF. Seven of these gen overlapped between MutSig2CV and MuSiC2 TP53, PTEN, VHL, SETD2, PBRM1, BAP1, MTOR. MutSig2CV uniquely identified TCEB1, PIK3CA, and ATM, and MuSiC2 uniquely identified ERBB4, SLITRK6, and KDM5C after long gene filtering.” (Effects of Somatic Filtering)
  • “The disadvantages of the capture kit based filtering strategy was that 170 CDS altering MC3 calls in MSK IMPACT’s 410 cancer genes, that fall outside the Broad BED. The key misses are TERT promoter hits, truncations in putative tumor-suppressor CIC, splice alterations in the frequently rearranged CRLF2, and a cluster of events in the 5’ end of FOXP1.” (Restricting to Target/Coding Exons)
  • “AC=16 (for SF3B1:K700) was the highest value observed among known somatic events detected in the normal blood of older individuals due to clonal hematopoiesis.” (Common In ExAC filter)
  • “THYM and PAAD samples had the lowest purity estimates (ABSOLUTE (syn7870168) median 39.0% and 39.7% respectively).” (Effects of Cancer Type on Mutation Callers)
  • “In sum, 22,485,627 putative variants were identified and 2,907,335 high confidence mutations were retained after filtering.” (Variant Calling and Filtering Strategies)

This page was processed by crosslinker on 2026-05-15.