A radiogenomic dataset of non-small cell lung cancer

Authors

Shaimaa Bakr

Olivier Gevaert

Sebastian Echegaray

Kelsey Ayers

Mu Zhou

Majid Shafiq

Hong Zheng

Jalen Anthony Benson

Weiruo Zhang

Ann N C Leung

Michael Kadoch

Chuong D Hoang

Joseph Shrager

Andrew Quon

Daniel L Rubin

Sylvia K Plevritis

Sandy Napel

Doi

PMID: 30325352 · DOI: 10.1038/sdata.2018.202 · Journal: Scientific Data (2018)

TL;DR

Bakr et al. release a public radiogenomic resource that pairs preoperative CT and 18F-FDG PET/CT imaging with matched molecular and clinical data for 211 surgically treated non-small cell lung cancer (NSCLC) patients accrued at Stanford University Medical Center and the Palo Alto Veterans Affairs Healthcare System between 2008 and 2012. The cohort combines a primary “R01” arm (n=162) and a retrospective “AMC” arm (n=49) selected for available EGFR/KRAS/ALK mutational status. Imaging is accompanied by tumor segmentations, controlled-vocabulary semantic annotations, RNA-seq (n=130), gene-expression microarrays (n=26), single-gene mutation calls (EGFR/KRAS/ALK), and survival/recurrence outcomes. The dataset is hosted on The Cancer Imaging Archive (TCIA) and NCBI GEO and is intended to enable discovery of links between radiomic phenotypes and tumor molecular biology in NSCLC.

Cohort & data

  • 211 NSCLC subjects total: 162 in the R01 cohort (38F/124M, mean age 68, range 42–86) and 49 in the AMC cohort (33F/16M, mean age 67, range 24–80) (PMID:30325352).
  • Histology: 172 adenocarcinomas (LUAD), 35 squamous cell carcinomas (LUSC), 4 not otherwise specified; cohort-level cancer type NSCLC.
  • Imaging: CT for 211, FDG PET/CT for 201, CT tumor segmentations for 144, semantic annotations for 190.
  • Molecular: clinical mutational testing for EGFR (n=206), KRAS (n=205), and ALK translocation status (n=196); RNA-seq for 130 subjects; Illumina HT-12 gene-expression microarrays for 26 subjects (17 also have RNA-seq).
  • Clinical: survival, recurrence, smoking history, pathological TNM stage (n=161), histopathological grade (n=162), adjuvant therapy, chemotherapy, radiation.
  • Dataset slug: nsclc-radiogenomics-stanford. TCIA DOI 10.7937/K9/TCIA.2017.7hs46erv; gene-expression microarray data at GEO accession GSE28827; RNA-seq data at GEO accession GSE103584.

Key findings

This is a dataset/resource paper rather than a hypothesis-testing study; the primary contribution is the release itself. Specific design facts the dataset rests on:

  • 130 of 211 subjects have the full quartet of clinical, CT, PET/CT, and RNA-seq data; 116 have all data types except microarray (PMID:30325352).
  • All tumor samples were collected from treatment-naïve subjects during surgery, frozen within 30 minutes of excision, and later processed for RNA extraction.
  • CT acquisitions were heterogeneous (slice thickness 0.625–3 mm, median 1.5 mm; X-ray tube current 28–749 mA at 80–140 kVp) reflecting routine multi-institutional clinical care; no harmonization was attempted because the data were retrospectively collected.
  • PET/CT used CT-based attenuation correction with iterative OSEM reconstruction; FDG dose 138.90–572.25 MBq (mean 309.26 MBq) and uptake time 23.08–128.90 min (mean 66.58 min).
  • Semantic annotation template covers 28 nodule analysis features plus parenchymal features, encoded in AIM format using ePAD; one radiologist with >20 years of experience annotated all 190 subjects’ axial CTs.
  • RNA-seq was performed on Illumina HiSeq 2500 with TruSeq Total Stranded RNA + Ribo-Zero Reduction; reads aligned to hg19 with STAR v2.3 and quantified with Cufflinks v2.0.2 in FPKM. Samples with RNA integrity number (RIN) below 2.5 were excluded.
  • Mutational testing used SNaPshot multiplex PCR + dideoxy single-base extension covering EGFR exons 18–21 and KRAS exon 2 (codons 12 and 13); EML4-ALK rearrangement was detected by FISH.
  • The authors note prior published downstream uses of subsets of this data for predicting EGFR/KRAS mutation status from semantic CT features and for building radiogenomic maps linking semantic features to RNA-seq expression signatures.

Genes & alterations

  • EGFR — clinical mutational status available for 206/211 subjects via SNaPshot PCR covering exons 18, 19, 20, and 21 (PMID:30325352). Used downstream by other studies to model EGFR mutation prediction from CT semantic features.
  • KRAS — mutational status available for 205/211 subjects; SNaPshot tested missense mutations at exon 2 codons 12 and 13.
  • ALK / EML4 — EML4-ALK translocation status available for 196/211 subjects; detected by fluorescence in situ hybridization (FISH).

The paper does not report cohort-level alteration frequencies; it provides per-subject mutation status as a data record.

Clinical implications

  • The dataset enables radiogenomic biomarker discovery in resected early-stage NSCLC by combining preoperative imaging with paired molecular profiling and survival outcomes — a combination the authors note is rare in publicly available NSCLC cohorts.
  • Survival, recurrence, recurrence date, recurrence location, and date-of-death fields are populated for ≥210/211 subjects, supporting time-to-event modeling of imaging and molecular biomarkers.
  • The release is intended to seed validation cohorts for prognostic image biomarkers and for evaluating reproducibility of radiomic feature pipelines across heterogeneous scanners.

Limitations & open questions

  • Acquisition heterogeneity: CT and PET/CT were acquired across multiple scanners and protocols over four years with no prospective harmonization; the authors explicitly flag that radiomic feature values are known to vary with acquisition and reconstruction parameters.
  • Cohort selection bias: only patients referred for surgical resection are included, biasing toward early-stage, operable disease — generalization to advanced/unresectable NSCLC is not supported by the dataset.
  • Sparse molecular coverage: only 26 subjects have microarray expression data; the deepest molecular layer (RNA-seq) covers 130 of 211 subjects; whole-exome or panel sequencing is not included — only single-gene EGFR/KRAS/ALK clinical assays.
  • Single-reader semantic annotations: although the template was developed by two thoracic radiologists, all 190 annotation sets were ascribed by a single radiologist; the authors acknowledge intra- and inter-reader variability is uncharacterized.
  • No cBioPortal study: the molecular data is hosted at GEO and the imaging at TCIA; there is no cBioPortal studyId, so cross-cohort molecular comparisons require manual ingestion.
  • Segmentation provenance: initial tumor segmentations came from an unpublished automatic algorithm before radiologist editing; reproducibility of the underlying algorithm is not documented.

Citations from this paper used in the wiki

  • “We developed a unique radiogenomic dataset from a Non-Small Cell Lung Cancer (NSCLC) cohort of 211 subjects.” (Abstract)
  • “Between 2008 and 2012, we collected clinical and imaging data for 211 subjects referred for surgical treatment and obtained tissue samples from the excised tumors, where available.” (Background and Summary)
  • “There were 172 adenocarcinomas and 35 squamous cell carcinomas and 4 not otherwise specified with grades ranging from poorly to well-differentiated.” (Methods — Subject Demographics and Clinical Data)
  • “EGFR, KRAS and ALK mutation status are available from clinical records in 206, 205, and 196 subjects, respectively. Single nucleotide mutation detection was performed using SNaPshot technology based on dideoxy single-base extension of oligonucleotide primers after multiplex polymerase chain reaction (PCR). Exons 18, 19, 20 and 21 were tested for EGFR mutations. Exon 2 Positions 12 and 13 were tested for missense KRAS mutations with amino acid substitution. … For ALK, EML4-ALK translocation detection test was performed using fluorescence in situ hybridization (FISH).” (Methods — Mutational testing)
  • “Based on availability and quality of available tissue, RNA sequencing was performed on samples from 130 subjects (17 of which intersect with the gene expression microarray dataset described in the previous section). We excluded RNASeq for tissue samples with RNA integrity number (RIN) below 2.5.” (Methods — RNA Sequencing Data)
  • “reads were aligned to the human genome (hg19) using the alignment algorithm STAR version 2.3 with 91 bases of splice junction overhangs. Next, Cufflinks version 2.0.2 was used to determine the expression calls in each sample using Fragments Per Kilobase of transcript per Million mapped reads (FPKM).” (Methods — RNA Sequencing Data)
  • “the imaging datasets reported here were acquired over several years and from several institutions, and not as part of a prospective trial. For these reasons there was no attempt to harmonize the acquisition and reconstruction protocols.” (Methods — CT and PET/CT acquisition protocols)

This page was processed by crosslinker on 2026-05-04.