A platform-independent AI tumor lineage and site (ATLAS) classifier

Authors

Rydzewski NR

Shi Y

Li C

Chrostek MR

Bakhtiar H

Helzer KT

Bootsma ML

Berg TJ

Harari PM

Floberg JM

Blitzer GC

Kosoff D

Taylor AK

Sharifi MN

Yu M

Lang JM

Patel KR

Citrin DE

Sundling KE

Zhao SG

Doi

10.1038/s42003-024-05981-5

PMID: 27634761 · DOI: 10.1038/s42003-024-05981-5 · Journal: Communications Biology (2024)

TL;DR

The authors developed ATLAS, a pair of gradient-boosted machine learning (XGBoost) classifiers that independently predict cancer site of origin (22 classes) and cancer lineage (8 classes) from RNA expression data. Trained on 8,249 samples from TCGA and CCLE, ATLAS was validated on 10,376 independent samples (including 1,490 metastatic samples), achieving 91.4% accuracy for site of origin and 97.1% for lineage. High-confidence predictions (score >= 0.99, encompassing the majority of cases) were 98-99% accurate even in metastatic disease. The study also demonstrated emergent zero-shot learning properties, including differentiation of mesothelioma subtypes and identification of neuroendocrine tumors, and showed that a lineage de-differentiation score is prognostic for survival across multiple cancer types.

Cohort & data

Training set: 8,249 samples from TCGA (N=7,196 primary site samples) and CCLE (N=1,053 cell lines).
Validation set: 10,376 samples from 58 TCGA datasets (N=3,556) and 41 non-TCGA datasets (N=6,820), including 8,886 primary tumor samples (97 datasets) and 1,490 metastatic tumor samples (17 datasets).
Secondary analysis (neuroendocrine): 198 additional neuroendocrine samples (NEPC and SCLC) from 8 cohorts, not included in training or primary validation.
Mesothelioma analysis: 88 lung mesothelioma samples from TCGA, excluded from training.
Cancer site of origin classes: 22 (including lung, breast, kidney, CNS, HPB, gastroesophageal, colorectal, head and neck, thyroid, prostate, bladder, ovarian, skin, cervical, soft tissue/bone, myeloid neoplasm, lymphoid neoplasm, PNS, testicular, thymus, adrenal, uterine).
Cancer lineage classes: 8 (adenocarcinoma, squamous cell carcinoma, glioma, lymphoid/myeloid neoplasm, sarcoma, melanoma, neuroepithelial, germ cell tumor).
Molecular features evaluated: RNA expression, DNA mutations, copy number alterations. Final model uses RNA expression only (~500 features for site of origin, ~200 for lineage; 632 total unique features including a binary sex variable).
Data sources: TCGA, CCLE, ICGC, cBioPortal, POG570, WCDT, and others.
Assay: RNA-seq (platform-independent, multiple normalization schemes including RSEM, FPKM, RPKM, TPM, CPM, TMM, and some microarray).

Key findings

Overall accuracy: 91.4% for cancer site of origin (22 classes) and 97.1% for cancer lineage (8 classes) on the full 10,376-sample validation set.
Primary vs. metastatic performance: Site-of-origin accuracy was 92.1% in primary tumors vs. 86.8% in metastatic tumors. Lineage accuracy was 97.4% primary vs. 95.7% metastatic.
High-confidence predictions: When model prediction score >= 0.99 (58.5% of site predictions, 75.1% of lineage predictions), accuracy reached 98-99% in both primary and metastatic samples.
RNA expression sufficiency: Gene expression alone matched or outperformed models combining expression with mutation and copy number data, enabling a simpler, platform-independent approach.
Adenocarcinoma vs. SCC distinction: Across three cancer sites with both subtypes (gastroesophageal, lung, cervix), lineage classification accuracy ranged from 89% to 100%.
Mesothelioma subtyping (zero-shot learning): Sarcoma lineage scores differentiated epithelioid from biphasic/sarcomatoid mesothelioma (AUC=0.81) and were prognostic for survival (median 15.0 vs. 23.9 months; log-rank P=0.049).
Neuroendocrine identification (zero-shot learning): Differentiation score distinguished NSCLC from SCLC (AUC=0.963) and metastatic prostate adenocarcinoma from NEPC (AUC=0.834).
De-differentiation prognostic value: Lower differentiation scores were significantly associated with worse overall survival across 8 subgroups (primary melanoma HR 0.0001, P=0.001; adrenal HR 0.002, P=0.006; uterine HR 0.025, P=0.001; HPB HR 0.056, P=0.023; glioma HR 0.26, P=0.0006; lung HR 0.24, P=0.001; breast HR 0.41, P=0.002; metastatic melanoma HR 0.31, P=0.033).
Tumor purity impact: Low tumor purity (<50%) reduced accuracy for both models. Removing purity-related confounding, the classifier remained robust.

Genes & alterations

This study does not focus on individual gene alterations. The ATLAS classifier uses ~632 RNA expression features (genes) selected by variable importance ranking, but the paper does not highlight specific genes as oncogenic drivers or therapeutic targets.
Top 10 features per class are reported in Supplementary Data 1, including confirmation that sex was a top feature only for breast, ovarian, and cervical cancer classification.

Clinical implications

ATLAS can complement traditional histopathologic assessment, particularly for cancers of unknown primary (CUP), by providing objective and quantitative site-of-origin and lineage predictions from existing RNA-seq data.
The platform-independent, single-sample approach allows integration into any existing clinical RNA-seq workflow without batch correction.
High-confidence predictions (>= 0.99) provide near-definitive classification even for metastatic samples, enabling actionable clinical decision-making.
The lineage de-differentiation score identifies neuroendocrine transformation and more anaplastic phenotypes, which are associated with worse prognosis and may guide treatment escalation.
Divergence between molecular classifier predictions and initial pathologic review could trigger re-review or additional staining.

Limitations & open questions

The model was trained predominantly on primary tumors from TCGA; accuracy is lower for metastatic samples (86.8% vs. 92.1% for site of origin), reflecting the challenge of tumor evolution and de-differentiation.
GI tumors (especially HPB and colorectal) and gynecologic tumors (ovarian) had worse classification accuracy, with frequent misclassification among related sites.
Rare tumor types are underrepresented in training data and may benefit from future dataset expansion.
The neuroendocrine and mesothelioma analyses are secondary (zero-shot) and based on relatively small sample sizes (N=198 and N=88, respectively).
The study does not address prospective clinical validation or integration into clinical trial workflows.
The paper does not evaluate performance on liquid biopsy or cell-free RNA samples.
How the differentiation score interacts with specific treatment responses (e.g., immunotherapy, targeted therapy) remains unexplored.

Citations from this paper used in the wiki

“Utilizing gradient-boosted machine learning, we developed ATLAS, a pair of separate AI Tumor Lineage and Site-of-origin models from RNA expression data on 8249 tumor samples. We assessed performance independently in 10,376 total tumor samples, including 1490 metastatic samples, achieving an accuracy of 91.4% for cancer site-of-origin and 97.1% for cancer lineage.” (Abstract)
“High confidence predictions (encompassing the majority of cases) were accurate 98-99% of the time in both localized and remarkably even in metastatic samples.” (Abstract)
“We also identified emergent properties of our lineage scores for tumor types on which the model was never trained (zero-shot learning). Adenocarcinoma/sarcoma lineage scores differentiated epithelioid from biphasic/sarcomatoid mesothelioma.” (Abstract)
“The differentiation score produced a high ROC AUC for differentiating non-small cell lung cancer (N=606; NSCLC) from small cell lung cancer (N=137; SCLC; AUC 0.963) and for differentiating metastatic prostate adenocarcinoma (N=721) from neuroendocrine prostate cancer (N=61; NEPC; AUC 0.834).” (Results)

This page was processed by crosslinker on 2026-05-04.