MSK-CHORD Clinicogenomic Harmonized Real-World Dataset, Nature 2024

Overview

MSK-CHORD integrates NLP annotations of free-text clinician, radiology and pathology notes with structured medication, demographic, tumor registry and tumor genomic data from 24,950 MSK patients sequenced on MSK-IMPACT, spanning five cancer types. Built to enable real-world survival prediction and metastasis-site discovery from unstructured EHR data PMID:39506116.

Composition

  • 24,950 MSK patients with MSK-IMPACT tumor sequencing PMID:39506116.
  • Five cancer types: NSCLC (n=7,809), BRCA (n=5,368), COADREAD (n=5,543), PRAD (n=3,211), PAAD (n=3,109) PMID:39506116.
  • 705,241 radiology reports annotated by NLP transformer models for organ-specific metastasis and progression PMID:39506116.
  • NLP models trained/validated against Project GENIE BPC (PRISSMM) MSK-BPC ground truth (n=3,202 patients, 38,719 radiology reports) PMID:39506116.

Assays / panels (linked)

  • MSK-IMPACT — targeted hybrid-capture panel.
  • nlp-prissmm — NLP annotation pipeline on PRISSMM schema.

Papers using this cohort

  • PMID:39506116 — Jee et al., Automated real-world data integration improves cancer outcome prediction, Nature 2024.

Notable findings derived from this cohort

  • Integrated multimodal models with NLP-derived sites-of-disease features outperformed genomics-only or stage-only models for overall survival PMID:39506116.
  • NLP per-site metastasis classification AUCs 0.85–0.99 across bone, liver, lung, lymph node, adrenal, pleura, CNS PMID:39506116.
  • SETD2 driver mutations in 3% of LUAD (204/5,957) predict longer OS, lower CNS metastasis, and longer immune checkpoint blockade response independent of TMB PMID:39506116.
  • RB1 oncogenic alterations enriched in brain and liver metastases PMID:39506116.

Sources

  • cBioPortal study msk_chord_2024; released as a public resource PMID:39506116.

This page was processed by crosslinker on 2026-04-10.