Automated real-world data integration improves cancer outcome prediction

Authors

Justin Jee

Christopher Fong

Karl Pichotta

Thinh Ngoc Tran

Anisha Luthra

Michele Waters

Bob T. Li

Gregory J. Riely

Nikolaus Schultz

Doi

10.1038/s41586-024-08167-5

PMID: 39506116 · DOI: 10.1038/s41586-024-08167-5 · Journal: Nature (2024)

TL;DR

The authors built MSK-CHORD, a clinicogenomic harmonized oncologic real-world dataset for 24,950 Memorial Sloan Kettering patients across non-small-cell lung (n=7,809), breast (n=5,368), colorectal (n=5,543), prostate (n=3,211) and pancreatic (n=3,109) cancers, by combining transformer-based natural-language-processing (NLP) annotations of radiology, pathology and clinician notes with structured medication, demographic, tumour-registry and MSK-IMPACT tumour-sequencing data. The cohort is at least six times larger than the AACR Project GENIE Biopharma Collaborative (BPC) training data and is released through cBioPortal as msk_chord_2024. Random survival forest (RSF) models combining NLP-derived features (sites of disease, prior treatment) with genomic and structured data outperformed stage-only or genomics-only models for predicting overall survival (OS), validated by fivefold cross-validation and on an external non-MSK BPC cohort. Annotating 705,241 radiology reports also surfaced organ-specific genomic predictors of metastasis, including SETD2 driver mutations as a predictor of longer OS and lower CNS-metastasis rate in immunotherapy-treated LUAD, corroborated in independent DFCI and Caris cohorts.

Cohort & data

MSK-CHORD (msk_chord_2024): 24,950 MSK patients with MSK-IMPACT tumour sequencing across NSCLC (n=7,809), BRCA (n=5,368), COAD/READ (n=5,543), PRAD (n=3,211) and PAAD/PAAC (n=3,109). Data snapshot: 9 September 2023.
Training/validation labels (MSK-BPC): 3,202 MSK patients with 38,719 radiology reports curated under the AACR Project GENIE BPC PRISSMM schema. Models tested on external non-MSK BPC cohorts (DFCI, VICC, UHN).
Sequencing assay: MSK-IMPACT, an FDA-authorized targeted DNA panel with matched-blood sequencing to filter germline and clonal-haematopoiesis variants. Variants annotated for oncogenicity using OncoKB. Chromosome-arm gains/losses called from MSK-IMPACT segment log-ratios on GRCh37.
NLP corpus: 705,241 radiology reports plus initial-consult and follow-up clinician notes and histopathology reports.
NLP model library: transformer fine-tuning on MSK-BPC labels — Clinical-Longformer (prior outside treatment, HER2/hormone receptor; radLongformer variant for 6-month-mortality from CT chest/abdomen/pelvis), ClinicalBERT (tumour sites, ten-binary multi-label), RoBERTa (radiographic progression), BERT base uncased (cancer presence), plus rule-based regex extractors for smoking status, Gleason score, PDL1 status and MMR deficiency.
Outcome models: random survival forest (n_trees=1,000, min_splits=10, min_samples_per_leaf=15) on grouped variable classes; OncoCast elastic-net Cox plus random-forest residual correction (500 trees, 5 terminal nodes, 50 runs) for left-truncated survival.

Key findings

NLP accuracy: every NLP model achieved AUC > 0.9 with precision and recall > 0.78 against manually curated MSK-BPC labels; several exceeded 0.95. Discrepancies between NLP predictions and curation labels were frequently due to curation errors rather than model failure (Fig. 1b).
Hold-one-cancer-out: transformer models trained on four cancer types maintained precision/recall on the held-out fifth, suggesting cross-tumour generalizability.
NLP > billing codes for patient-level metastatic-site annotation, with precision/recall improvements of 0.03–0.32 over ICD billing codes (Supplementary Table 3).
PDL1 in NSCLC immunotherapy: in MSK-BPC (n=29) the OS hazard ratio for PDL1+ vs PDL1− was 0.58 (95% CI 0.11–1.1, P=0.07); in MSK-CHORD with 754 NSCLC patients on immunotherapy with PDL1 testing, HR=0.64 (95% CI 0.54–0.77, P<0.001) — same magnitude, much higher power (Fig. 2a).
Post-treatment alteration enrichment confirmed in MSK-CHORD vs MSK-BPC: ESR1, CCND1 and NF1 in breast cancer; EGFR T790M and MET amplifications in EGFR-mutant NSCLC; AR and TP53 in prostate cancer; CHEK2, PPM1D and TP53 (clonal-haematopoiesis context) — all enriched in patients with prior systemic therapy as annotated by NLP, and similarly enriched among patients with institutionally administered prior therapy (Fig. 2b).
Gleason–genomics: dose-dependent associations between NLP-derived Gleason score and TP53, PTEN, BRCA2 alterations in prostate cancer were significant in MSK-CHORD (n=3,211) but not in MSK-BPC (n=561) after multiple-hypothesis correction (Fig. 2c).
MSI/dMMR discordance: 10 stage-IV CRC patients with discrepant MSI-vs-dMMR status treated with immunotherapy. Time on immunotherapy was longer than pMMR/MSS but shorter than dMMR/MSI; both biomarkers carry independent prognostic information (Extended Data Fig. 1).
Multimodal RSF for OS: combined-modality models outperformed any single-modality model in every cancer type. Stage IV NSCLC P=2×10⁻⁷, prostate P=0.0003, CRC P=0.005, pancreas P=0.003, breast P=3×10⁻⁵ vs next-best model (one-sided t-test). c-indices ranged from 0.58 (stage IV pancreas) to 0.83 (stage I–III breast). Tumour sites (NLP-derived) were the most prognostic single modality among stage-IV patients across all five cancers (Fig. 3a,b).
Risk stratification beats stage: in NSCLC and pancreas, the highest-risk-quartile stage I–III patients had worse predicted OS than the lowest-risk-quartile stage IV patients (Fig. 3c, Extended Data Fig. 5).
radLongformer end-to-end on raw CT-CAP free text was prognostic in all five cancer types, and superior to tumour-site features alone only in stage-IV CRC; adding radLongformer scores to the RSF did not improve overall prognostic power (Extended Data Fig. 4b).
Genomic predictors of organ-specific metastasis (Cox, q<0.01, BH-adjusted, controlling for stage, prior treatment, histologic subtype) included: TP53 and CDKN2A — metastasis to all four sites (CNS, bone, liver, lung) in LUAD; RB1 — CNS and liver but not bone or lung in LUAD, hormone-receptor-positive breast and prostate (Fig. 4b). Pathway-level analyses showed TP53-pathway alterations elevated liver but lowered CNS metastasis in pancreas; RTK–RAS pathway in prostate raised bone, lowered liver. Arm-level 1p/1q amplifications and 3p, 11p, 11q, 17p deletions were prognostic for brain metastasis in MSS CRC; 16p/16q deletions in breast were associated with lower CNS and lung metastasis (Extended Data Fig. 7).
SETD2 in LUAD: 204/5,957 (3%) of LUAD harboured SETD2 driver mutations. SETD2 was associated with longer OS, lower CNS-metastasis rate, longer time-to-next-treatment-or-death after immune checkpoint blockade (HR 0.5, 95% CI 0.36–0.72) but not after chemotherapy (HR 1.2, 95% CI 0.96–1.55) or targeted therapy (HR 0.8, 95% CI 0.58–1.15). SETD2 was positively associated with BRAF and ARID1A alterations and negatively associated with EGFR and MDM2 alterations and mucinous histology, with a small but statistically significant TMB difference (Fig. 5a–d). The immunotherapy benefit held in TMB<10 patients in MSK-CHORD (HR 0.56, 95% CI 0.51–0.62), DFCI (HR 0.87, 0.72–1.04) and Caris (HR 0.63, 0.61–0.65); random-effects HR 0.67 (95% CI 0.52–0.85), I²=88% (Fig. 5e).

Genes & alterations

SETD2 — driver mutation in 3% of LUAD; predictor of longer OS, lower CNS metastasis and longer immunotherapy response, validated in DFCI and Caris cohorts.
TP53 — enriched after prior therapy in PRAD; pathway-level alterations linked to higher liver but lower CNS metastasis in PAAD; gene-level alteration prognostic for metastasis to all four organ sites in LUAD; dose-dependent association with NLP-derived Gleason score.
PTEN, BRCA2 — dose-dependent association with NLP-derived Gleason score in PRAD.
RB1 — alterations associated with CNS and liver (not bone, lung) metastasis in LUAD, hormone-receptor-positive BRCA and PRAD; enriched in tumours sequenced from brain and liver biopsies (Extended Data Fig. 6).
EGFR — T790M and concomitant MET amplifications enriched after prior therapy in EGFR-mutant NSCLC; EGFR mutation associated with better OS in LUAD, but multivariate analysis attributes the survival advantage to receipt of targeted therapy rather than EGFR status itself.
KRAS — classical smoker driver, found at similar rates as EGFR in lung-cancer patients with smoker history but no smoking mutational signature (Extended Data Fig. 2).
STK11, SMARCA4, KEAP1, CDKN2A — among genes flagged with significant metastasis associations in LUAD pathway/gene-level analyses (Fig. 4b).
ESR1, CCND1, NF1 — enriched after prior therapy in BRCA, confirming the endocrine-resistance signature.
AR, CDK12, FOXA1, NKX3-1 — recurrent post-treatment alterations and Gleason/metastasis associations in PRAD.
CHEK2, PPM1D — clonal-haematopoiesis variants enriched after prior systemic therapy.
BRAF, ARID1A, MDM2 — co-mutation profile of SETD2-altered LUAD: positive with BRAF/ARID1A, negative with EGFR/MDM2.
ERBB2, CD274 (PDL1) — used as treatment-decision biomarkers in NLP-extraction pipelines for breast and NSCLC respectively.
CCNE1, FGFR1, FGF19, MYC, CIC, NKX2-1 — additional genes appearing in volcano plots of metastasis and Gleason-association analyses.

Clinical implications

MSK-CHORD as a public resource: released via cBioPortal at https://www.cbioportal.org/study/summary?id=msk_chord_2024 under CC BY-NC-ND 4.0; intended as a substrate for downstream RWD oncology research and for adding orthogonal modalities (liquid biopsy, lab values, comorbidities) in future iterations.
Multimodal > stage: AJCC stage alone is inferior to multimodal RSF for OS prediction; for stage I–III NSCLC, CRC and pancreatic cancer, NLP-derived tumour-site and treatment features are essential to risk stratification beyond stage and adjuvant-decision frameworks.
PDL1 testing in NSCLC: the magnitude of the OS benefit (HR ≈ 0.6) for PDL1+ vs PDL1− patients on immunotherapy is preserved at scale, supporting current biomarker practice.
MSI/dMMR discordant CRC: patients with one but not both biomarkers still derive immunotherapy benefit relative to pMMR/MSS, although less than concordant dMMR/MSI; the authors recommend offering immunotherapy in these cases.
SETD2 as candidate immunotherapy biomarker in LUAD — uncommon (3%) but predicts longer OS and time-on-immunotherapy independent of TMB (effect held at TMB<10), histologic subtype, PDL1 and smoking status; corroborated in two independent cohorts (DFCI, Caris). The authors flag SETD2 for prospective validation as an immunotherapy biomarker.
Brain-MRI surveillance in stage IV LUAD or hormone-receptor-positive BRCA: RB1 alteration could refine risk-based imaging frequency given its CNS-metastasis association.
NLP for routine RWD: the authors argue NLP is feasible as a low-cost replacement for manual abstraction of treatment, sites of disease and progression and outperforms billing codes for site-of-metastasis identification.

Limitations & open questions

Cohort entry not random. Tumour sequencing biases MSK-CHORD toward advanced or recently progressed patients, threatening generalizability of OS models. The authors mitigate with left truncation and adjustment for progression at cohort entry.
Geographic and demographic skew. Patients are predominantly from the New York/New Jersey catchment; socioeconomic, demographic and geographic confounders are not disentangled.
Missing modalities. Comorbidities, symptoms, liquid-biopsy, lab values, RNA/whole-genome sequencing and single-cell data are not included.
Tumour-marker validation gap. Tumour-marker single-modality models did well in fivefold CV but not in non-MSK BPC external validation in CRC, pancreas and breast, attributed to sparser tumour-marker data in BPC than in MSK-CHORD.
Immortality-bias residue. Despite left truncation, sequencing date is an imperfect proxy for cohort entry.
NLP model floors. Reproductive-organ tumour-site classification suffered from few positive examples; prior-treatment NLP in NSCLC had precision 0.78 (vs >0.9 elsewhere). HER2 and hormone-receptor classification disagreements often reflected genuinely complex clinical scenarios that resist binary labelling.
EGFR–OS confounding. EGFR-mutant LUAD looks like an OS-favourable subgroup, but the survival benefit is attributable to receipt of targeted therapy, not the mutation. Generalizes a warning about correlated treatment/biology in observational genomics.
SETD2 mechanism is open. The biology underlying SETD2-driven longer immunotherapy response in LUAD (TMB-independent at TMB<10) is not established here.
‘Black-box’ radLongformer not additive. Direct fine-tuning on raw CT-CAP free text was prognostic but did not exceed interpretable feature-engineered models or improve them when added — suggesting the engineered features capture most of the prognostic signal in radiology text.
Prospective validation of the metastatic-tropism associations (RB1, TP53-pathway, RTK–RAS, arm-level events) is required before clinical deployment.

Citations from this paper used in the wiki

“MSK-CHORD includes data for non-small-cell lung (n = 7,809), breast (n = 5,368), colorectal (n = 5,543), prostate (n = 3,211) and pancreatic (n = 3,109) cancers.” (Abstract.)
“All NLP models had an area under the curve (AUC) of >0.9 and precision and recall of >0.78 when treating manually curated labels as ground truth, with several models achieving precision and recall of >0.95 (Fig. 1b and Supplementary Table 1).”
“MSK-CHORD showed a similar magnitude of benefit, but with 754 patients with NSCLC receiving immunotherapy at time of cohort entry with PDL1 testing, statistical power was greater (hazard ratio 0.64, 95% CI 0.54–0.77, P < 0.001; Fig. 2a).”
“ESR1, CCND1 and NF1 mutations in breast cancer, EGFRT790M and MET amplifications in EGFR-mutant NSCLC, AR and TP53 mutations in prostate cancer, and clonal haematopoiesis CHEK2, PPM1D and TP53 mutations were enriched in patients exposed to prior systemic therapy as annotated by NLP (Fig. 2b).”
“In MSK-CHORD, we observed a dose-dependent relationship between NLP-annotated highest Gleason grade and several gene-level alterations including TP53, PTEN and BRCA2 (Fig. 2c).”
“Of 5,957 patients with LUAD, 204 (3%) had SETD2 driver mutations, and these emerged as predictors of longer OS (Extended Data Fig. 3) and lower rates of CNS metastasis (Fig. 4b).”
“SETD2 alterations were positively associated with BRAF and ARID1A alterations and negatively associated with EGFR and MDM2 alterations and mucinous subtype but not otherwise associated with histologic subtype, PDL1 or smoking status (Fig. 5a,b and Extended Data Fig. 9).”
“The association between SETD2 mutation and longer immunotherapy response held among only patients with low TMB (<10 mutations per megabase) and in both validation cohorts (Fig. 5e).”
“MSK-CHORD is available under the Creative Commons BY-NC-ND 4.0 licence and can be accessed on cBioPortal (https://www.cbioportal.org/study/summary?id=msk_chord_2024).” (Data availability.)
“All code to perform the analyses presented here is available at GitHub (https://github.com/clinical-data-mining).” (Code availability.)

This page was processed by crosslinker on 2026-05-04.