Detecting Intestinal Obstruction with MedViT

Executive Summary: Identifying gastrointestinal conditions such as obstructions from abdominal Computed Radiography (CR) images requires an understanding of both subtle local cues and broad global patterns. This post explores the implementation of MedViT (Medical Vision Transformer), a hybrid architecture that successfully overcomes the locality bias of pure Convolutional Neural Networks (CNNs) and the computational intensity of Vision Transformers (ViTs). With superior accuracy, precision, and reliable Grad-CAM interpretability, MedViT provides a robust framework for high-stakes medical decision-making.

Introduction

This week, we began working with abdominal CR (Computed Radiography) images to classify gastrointestinal conditions such as small bowel obstruction, large bowel obstruction, and ileus. While this may initially resemble a conventional image classification task, the problem quickly reveals a deeper level of complexity.

Unlike natural images, abdominal radiographs are inherently difficult to interpret. Even experienced clinicians rely on subtle visual cues and global patterns rather than obvious localized features. Structures overlap significantly, soft tissue contrast is limited, and diagnostically relevant signals—such as gas distribution—are often faint and diffuse. More importantly, identifying an obstruction is not simply a matter of detecting a single abnormal region. Instead, it requires understanding how patterns are distributed across the entire abdominal cavity. This makes the task fundamentally different from typical object recognition problems.

This observation leads to a key question: which type of deep learning architecture can effectively capture both fine-grained local features and long-range global dependencies?

The CNN vs. Transformer Dilemma

Deep learning has significantly transformed medical image analysis, enabling automated detection of diseases across various imaging modalities. However, different model families come with inherent trade-offs.

Convolutional Neural Networks (CNNs) like ResNet, DenseNet, and EfficientNet have long dominated computer vision. Their success stems from their ability to learn hierarchical feature representations efficiently. They excel at detecting edges, textures, and localized structures, making them suitable for many medical imaging tasks. However, a key limitation of CNNs is their locality bias. Since convolutions operate on local neighborhoods, capturing global relationships requires stacking many layers, which is not always sufficient. In the context of abdominal imaging, this becomes problematic; for example, obstruction is often inferred from the overall distribution of gas rather than a single localized feature, and ileus may present as a pattern spanning multiple abdominal regions.

Vision Transformers (ViTs) address the limitations of CNNs by introducing self-attention mechanisms that allow every part of the image to interact with every other part. This provides strong global context modeling. However, Transformers typically have quadratic computational complexity with respect to image size, require large amounts of training data, and can be sensitive to noise and perturbations. Therefore, while ViTs are powerful, they are not always ideal when used in isolation—especially in medical imaging settings where data is limited.

MedViT: A Hybrid Architecture

To address these challenges, MedViT (Medical Vision Transformer) introduces a hybrid architecture that combines the strengths of CNNs and Transformers. The goal is to preserve local feature extraction capabilities while enabling efficient global context modeling. This design is particularly well-suited for abdominal obstruction detection, where diagnostic patterns emerge from both localized structures and their global arrangement.

Overall architecture of MedViT, integrating convolutional and transformer-based components.

As illustrated above, the architecture alternates between convolutional and transformer-inspired blocks, enabling a balanced representation of spatial and contextual information. MedViT bridges the gap by combining:

CNN-based local feature extraction
Transformer-based global attention
Efficient attention mechanisms to reduce computation
Robust training strategies tailored for medical data

Key Components

Efficient Convolution Block (ECB): Focuses on extracting local spatial features while maintaining computational efficiency. It incorporates Multi-Head Convolutional Attention (MHCA) and a Local Feed-Forward Network (LFFN). ECB preserves the inductive bias of CNNs while enhancing feature representation, resulting in improved robustness and reduced computational cost.

Structure of the Efficient Convolution Block (ECB).

Local Transformer Block (LTB): Responsible for modeling global dependencies. It employs Efficient Self-Attention (ESA) along with feature fusion mechanisms and feed-forward layers. This allows the model to reason about global patterns across the entire image while maintaining efficiency.

Structure of the Local Transformer Block (LTB).

Patch Momentum Changer (PMC): A feature-level augmentation technique designed to improve generalization by combining feature representations from different samples. This approach encourages the model to rely less on specific local cues and more on global patterns.

Performance and Interpretability

The quantitative results demonstrate that MedViT achieves the best overall balance across accuracy, precision, and F1-score. While some models such as ResNext50 obtain lower validation loss or higher recall, they fail to maintain consistent performance across all evaluation metrics. In contrast, MedViT provides a more stable and reliable performance, which is particularly important in medical decision-making scenarios.

Model	Val Loss	Val Acc	Precision	Recall	F1
MedViT	0.4358	0.9183	0.9462	0.8936	0.9029
UNet	0.4471	0.9135	0.8397	0.7929	0.8016
ResNext50	0.3910	0.9038	0.9026	0.9058	0.8883
EfficientNetB0	0.4171	0.8894	0.8833	0.8705	0.8605
ResNet18	0.4531	0.8798	0.8654	0.8788	0.8508
UMamba	0.4856	0.8798	0.8705	0.8282	0.8326
SwinUMamba	0.4958	0.8750	0.8429	0.8595	0.8421
ResUNet++	0.5147	0.8622	0.8782	0.8372	0.8363

Model performance comparison.

Beyond numerical evaluation, it is also critical to understand how the model makes its predictions. For this purpose, we employ Grad-CAM (Gradient-weighted Class Activation Mapping) to visualize the regions that contribute most to the model's decision.

Grad-CAM visualization for MedViT highlighting diagnostically relevant regions.

As shown in the Grad-CAM visualization, the model focuses on meaningful anatomical regions rather than irrelevant background areas. The highlighted regions correspond to gas distribution patterns and abdominal structures that are clinically associated with obstruction. This behavior suggests that MedViT does not rely on spurious correlations but instead learns medically relevant features. Such interpretability is essential for building trust in AI-assisted diagnosis, especially in high-stakes clinical environments.

Conclusion

MedViT demonstrates that hybrid architectures can outperform both pure CNN and pure Transformer models in medical imaging tasks. By combining local feature extraction with global reasoning, it provides a more comprehensive understanding of complex visual patterns. In the context of abdominal obstruction detection, such a design is not merely advantageous—it is essential for achieving reliable and clinically meaningful results.