Abstract
Deep learning (DL) supports automated brain tumor analysis on magnetic resonance imaging (MRI), and reported performance varies across scanners, protocols, cohorts, and annotation practice. This systematic review synthesizes peer-reviewed evidence on segmentation, classification, and detection using single- and multi-sequence MRI, with emphasis on model families, data preparation, evaluation design, and translation barriers. Literature identification followed PRISMA guidelines using structured queries across major bibliographic databases and digital libraries. The process applied predefined eligibility criteria, removed duplicates, screened titles, abstracts, and reviewed full texts. Research quality and risk of bias were assessed with QUADAS-2. The appraisal also checked data leakage risk, external validation use, and reporting completeness. Data extraction captured imaging sequences, label provenance, preprocessing, augmentation, architecture, training objectives, validation strategy, and interpretability practice. The synthesis groups methods into convolutional encoder-decoder networks, attention-augmented U-Net variants, transformer-based vision backbones, graph neural networks, and hybrids combining global features with boundary refinement. The evidence base reports recurring limits in benchmark comparability, linked to heterogeneous labels and limited multicentre testing. It also reports incomplete uncertainty and calibration reporting, which complicates clinical interpretation. Cross-site evaluation frequently identifies domain shift, which motivates normalization procedures, adaptation objectives, and privacy-preserving training using federated optimization with secure aggregation. Interpretability reporting concentrates on saliency, attribution, and counterfactual analysis, with clinical utility depending on stability across perturbations and alignment with radiological priors. These findings specify methodological gaps and reporting requirements for reproducible benchmarking and clinical evaluation. The synthesis also motivates research on self-supervised pretraining, longitudinal modelling, radiogenomic integration, and human-in-the-loop validation within routine workflows.