ClinicalBERT

Overview

ClinicalBERT is a BERT-based transformer model pre-trained on clinical notes (originally on MIMIC-III discharge summaries) and fine-tuned for clinical information extraction tasks. It applies standard bidirectional attention over 512-token windows and is used to classify or extract structured entities from clinical text, including tumor site annotations and binary labels relevant to cancer staging and treatment status.

Used by

Applied in the MSK-CHORD clinicogenomic harmonization pipeline at Memorial Sloan Kettering Cancer Center to classify tumor sites (ten binary multi-label) from clinical and radiology notes across 24,950 patients spanning five cancer types PMID:39506116.

Notes

BERT-base architecture pre-trained or fine-tuned on clinical text; handles up to 512 tokens per document.
Used for multi-label tumor-site extraction (ten binary outputs) in the MSK-CHORD pipeline PMID:39506116.
NLP-derived tumor-site annotations (from ClinicalBERT) outperformed ICD billing codes for patient-level metastatic-site annotation, with precision/recall improvements of 0.03–0.32 PMID:39506116.
Used alongside clinical-longformer, RoBERTa, and BERT-base-uncased in the same pipeline PMID:39506116.

Sources

PMID:39506116 — Kather et al. used ClinicalBERT for ten-binary multi-label tumor-site extraction from clinical and radiology notes in the MSK-CHORD dataset (24,950 patients); NLP-derived tumor sites were the most prognostic single modality for OS in stage-IV patients across all five cancer types in the random survival forest models PMID:39506116.

This page was processed by crosslinker on 2026-04-30.