Multi-institutional Prognostic Modeling in Head and Neck Cancer: Evaluating Impact and Generalizability of Deep Learning Approaches

Authors

Michal Kazmierski

Mattea Welch

Sejin Kim

Chris McIntosh

Katrina Rey-McIntyre

Shao Hui Huang

Tirth Patel

Tony Tadic

Michael Milosevic

Fei-Fei Liu

Adam Ryczkowski

Joanna Kazmierska

Zezhong Ye

Deborah Plana

Hugo JWL Aerts

Benjamin H Kann

Scott V Bratman

Andrew J Hope

Benjamin Haibe-Kains

Doi

10.1158/2767-9764.CRC-22-0152

PMID: 37397861 · DOI: 10.1158/2767-9764.CRC-22-0152 · Journal: Cancer Research Communications (2023)

TL;DR

Kazmierski et al. ran an intra-institutional crowdsourced challenge at the University Health Network using the RADCURE dataset (n=2,552 patients with HNSC treated with definitive radiotherapy at Princess Margaret Cancer Centre, 2005–2017) to benchmark 12 prognostic machine-learning models for 2-year and lifetime overall survival from pretreatment CT imaging and electronic medical record (EMR) data. The top model used multitask logistic regression (MTLR) over widely-used EMR features (age, sex, T/N/overall stage, disease site, performance status, HPV status, dose, systemic therapy) plus primary tumor volume, reaching AUROC = 0.823 / AP = 0.505 / C-index = 0.801 on an internal test set of 750 patients — outperforming all deep-learning CT radiomics submissions. External validation on three independent cohorts (HN1 n=137 MAASTRO, MDACC n=627 oropharynx, GPCCHN n=298 Greater Poland; 873 patients total) showed the EMR+volume models retained rank in most datasets but suffered performance drops attributable to distribution shift in HPV status, disease site and outcome prevalence, with only the top model modestly beating tumor volume alone on the MDACC subset PMID:37397861.

Cohort & data

Training / internal test cohort: 2,552 patients with HNSC treated with definitive radiotherapy or chemoradiotherapy at Princess Margaret (PM) Cancer Centre (2005–2017); split by date of diagnosis into training (n=1,802, diagnosed 2005–2013) and held-out independent test (n=750, diagnosed 2016–2018). Inclusion required a planning CT and target contours, ≥2 years follow-up (or death before that), no distant metastases at diagnosis, and no prior surgery. Released as the RADCURE dataset on The Cancer Imaging Archive (https://doi.org/10.7937/J47W-NM11) PMID:37397861.
External validation cohorts (873 patients total):
- HN1 (tcia-hn1-maastro) — n=137 oropharynx + larynx tumors treated with chemoradiotherapy at MAASTRO Clinic, Maastricht; TCIA public collection, validated in-house PMID:37397861.
- MDACC (tcia-hnscc) — n=627 oropharynx cancer patients treated at MD Anderson Cancer Center (curated subset of the public TCIA HNSCC collection); validated externally by Dana-Farber Cancer Institute collaborators using the pretrained models PMID:37397861.
- GPCCHN (gpcchn-poznan) — private dataset of 298 patients treated at Greater Poland Cancer Centre, Poznań; validated in-house at Greater Poland Cancer Centre PMID:37397861.
Inputs: pretreatment contrast-enhanced CT imaging + primary gross tumor volume (GTV) binary masks in NRRD format (from radiation-oncologist DICOM RT-STRUCT contours); EMR variables (age at diagnosis, sex, T/N/overall stage, disease site, performance status, HPV status, radiation dose in Gy, systemic-therapy use); outcome data (time to death or censoring, event indicator) PMID:37397861.
Task: primary endpoint binarized 2-year overall survival (OS); secondary endpoints lifetime risk of death and full survival curve in 1-month intervals 0–23 months PMID:37397861.
Evaluation: AUROC as primary ranking metric, average precision (AP) as tie-breaker, concordance (C-) index for lifetime risk; confidence intervals from 10,000 stratified bootstrap replicates; one-sided t-test vs. best model with FDR ≤ 5% for multiple comparisons PMID:37397861.

Key findings

Top model was not deep learning. The challenge winner (Model 1) was a deep multitask logistic regression (MTLR) fit to EMR features and primary tumor volume; a single-layer neural network with ELU activations added non-linear interactions before the MTLR head. It achieved AUROC = 0.823, AP = 0.505, C-index = 0.801 on the internal test set, significantly better (FDR < 5%) than every other model except the second-best PMID:37397861.
Deep learning beat engineered radiomics but not simple EMR models. Among radiomics-only models, convolutional-neural-network approaches outperformed hand-crafted CT-radiomics features extracted with PyRadiomics v2.2.0 (1,316 features reduced via max-relevance/min-redundancy selection). However, no radiomics-only model outperformed any EMR-only model; only one convnet beat the baseline-clinical model PMID:37397861.
Best radiomics model learned volume-independent features. Spearman rank correlation of predictions with tumor volume was ρ = 0.79 for baseline-radiomics and ρ = 0.85 for the hand-crafted-feature submission, but only ρ = 0.22 for the best radiomics-only model (Model 9, a 3D convnet with dense connectivity and a two-stream context window; AUROC = 0.77), suggesting it encoded genuinely volume-independent image features PMID:37397861.
Adding deep radiomics to EMR did not help. Replacing tumor volume in the winning MTLR model with deep image representations from a 3D convnet dropped AUROC from 0.823 → 0.766; using EMR features only (no image data) dropped AUROC to 0.798 and AP to 0.429 PMID:37397861.
Risk stratification was strong. Stratifying test-set patients at the 0.5 predicted 2-year event probability produced significant OS separation: HR = 8.64 (top combined model), HR = 5.96 (best EMR-only), HR = 4.50 (best radiomics-only), all p < 10⁻¹⁸ PMID:37397861.
External generalizability was limited. Performance dropped in 2 of 3 external cohorts for the top models (still significantly better than random, p < 0.0001). Distribution shifts in disease site, HPV status and target outcome prevalence were statistically significant (pairwise χ² with FDR ≤ 5%); the GPCCHN cohort in particular had disproportionately more HPV-negative patients than RADCURE. On the MDACC subset, only the top model beat tumor volume alone, and by a small margin PMID:37397861.
Ranking mostly preserved across sites. The overall model ranking was fairly consistent across external cohorts. The top model (Model 1, MTLR + EMR + volume) retained its winning rank in 2 of 3 external datasets; on GPCCHN it was outperformed by the simpler linear EMR+volume model (Model 2). On HN1, engineered radiomics + EMR (Model 3) and deep radiomics + EMR (Model 5) did better than on the internal test set but still did not beat Model 1 PMID:37397861.
Agreement across metrics. Pearson r = 0.88 between AUROC and AP; r = 0.82 between AUROC and C-index PMID:37397861.

Genes & alterations

None. This study is a clinical–imaging prognostic-modeling benchmark and does not analyze gene-level alterations. HPV status is used as a clinical EMR feature (categorical: positive / negative / not tested), not as a molecular endpoint PMID:37397861.

Clinical implications

For HNC prognostic modeling, carefully-tuned EMR-only models (stage, HPV, disease site, performance status, dose) plus primary tumor volume appear to be a hard-to-beat baseline. Added engineered or deep-learning CT radiomics did not improve 2-year OS prediction on the PM training distribution, echoing earlier negative findings by Ger and colleagues and Vallières and colleagues (both cited) PMID:37397861.
The top MTLR model yields a full predicted survival curve per patient (not just a 2-year probability), making it attractive as a clinical risk-stratification and monitoring tool — but its out-of-distribution performance dropped enough that the authors caution against deploying any such model without extensive, population-specific validation PMID:37397861.
Clinical trials that enroll using AI/ML predictors or attempt to translate models trained on trial populations into routine practice should explicitly characterize distribution shift (HPV status, disease-site mix, outcome prevalence) before assuming generalization PMID:37397861.

Limitations & open questions

Single training institution. All 2,552 training patients came from PM Cancer Centre; the authors flag limited geographic and institutional diversity despite external validation PMID:37397861.
Single-institution crowdsourcing. The 12 submitted models were developed by 4 teams, all from within the University Health Network, restricting the space of modeling strategies explored PMID:37397861.
One radiomics toolkit. Hand-engineered radiomics used only PyRadiomics; other toolkits could in principle produce different features, though the image-biomarker standardization effort cited suggests major implementations are now largely consistent PMID:37397861.
Suboptimal fusion of modalities. All deep-learning submissions concatenated EMR features with image embeddings at the final classification layer, which may explain why radiomics and EMR did not appear complementary; the authors flag joint latent-space approaches as unexplored PMID:37397861.
No ensembling explored. Bayesian model averaging or stacking ensembles could have been expected to improve over the best single model PMID:37397861.
Smoking status omitted. Smoking status was not consistently available across cohorts and was shown in supplementary analyses not to significantly improve predictive performance; the authors note it is a plausible HNC prognostic variable that could be reinvestigated in future work PMID:37397861.
Limited to CT imaging and OS. No PET or MRI, no recurrence / distant metastasis / treatment-toxicity endpoints, no molecular or pathology data. The authors flag these as future directions PMID:37397861.

Citations from this paper used in the wiki

“Using a retrospective dataset of 2,552 patients from a single institution and a strict evaluation framework that included external validation on three external patient cohorts (873 patients), we crowdsourced the development of ML models to predict overall survival in head and neck cancer (HNC)…” — Abstract.
“The model with the highest accuracy used multitask learning on clinical data and tumor volume, achieving high prognostic accuracy for 2-year and lifetime survival prediction, outperforming models relying on clinical data only, engineered radiomics, or complex deep neural network architecture.” — Abstract.
“MTLR is able to exploit time-to-event information by fitting a sequence of dependent logistic regression models to each interval on a discretized time axis, effectively learning to predict a complete survival curve for each patient in multitask fashion.” — Results, Top Performing Model.
“…when compared with the second-best model (which relies on a PH model) it achieves superior performance for lifetime risk prediction (C = 0.801 vs. 0.746). The added flexibility and information-sharing capacity of multitasking also enables MTLR to outperform other models on the binary task (AUROC = 0.823, AP = 0.505)…” — Results, Top Performing Model.
“The GPCCHN dataset is a private dataset of 298 patients treated at Greater Poland Cancer Centre in Poznan, Poland.” — Materials and Methods, Independent Validation.
“Interestingly, the best radiomics submission (number 9) achieves the lowest volume correlation, suggesting that it might be using volume-independent imaging characteristics.” — Results, Impact of Volume Dependence.
“In particular, the GPCCHN dataset has a disproportionately high number of HPV− patients compared to the training dataset.” — Discussion.

This page was processed by crosslinker on 2026-05-04.