Capsules for Biomedical Image Segmentation

Rodney LaLonde
Jul 12, 2020
7 min read

Paper by Rodney LaLonde, Ziyue Xu, Sanjay Jain, and Ulas Bagci.

The original paper can be found on arXiV.

The code is publicly available at my GitHub.

This work represents a journal extension to the non-archival and popular SegCaps work and is currently under revision at the Medical Image Analysis journal.

Motivation and Challenges

The equivariant properties of capsule networks lend themselves nicely to segmentation, where precise spatial localization is very important. However, the initial CapsNet published by Sabour, Frosst, and Hinton in 2017 was extremely computationally expensive. Given just a small 6 by 6 pixel grid of 32 8-dimensional capsules being routed to 10 16-dimensional capsules for classification, there are nearly 1.5 million parameters. With the task of segmentation requiring far larger input and output sizes (e.g. 512 × 512 pixels), the number of parameters quickly grows out of control, making it impossible to fit such models into memory.

512×512×32×8×512×512×10×16=2,814,749,767,106,560 parameters.

To solve the memory burden, we introduce two important contributions. First, we propose a locally-constrained dynamic routing algorithm to route information to each parent capsule only from a small local neighborhood of child capsules centered on that parent’s spatial location. Second, we propose to share transformation matrices across both spatial locations and child capsule types. These two contributions combined, reduces the number of parameters in every capsule layer by a significant factor, reducing those needed 2.8 quadrillion parameters to only 324 thousand.

Reduces parameters by 𝑪×𝑯×𝑾×𝑯⁄𝒌 × 𝑾⁄𝒌. 2.8 quadrillion parameters cut down to 324,000 (with 𝑘=5).

Locally-Constrained Dynamic Routing and Transformation Matrix Sharing

In Sabour et al. 2017, Prediction vectors, 𝒖̂, are created for every child capsule, i, to every parent capsule, j.

In our proposed locally-constrained dynamic routing with transformation matrix sharing, prediction vectors, 𝒖̂, are only created from a 𝑘×𝑘 kernel of child capsules, 𝑖, within each child capsule type, centered at the (𝑥,𝑦) position of each parent capsule, 𝑗. Transformation matrices are shared across the spatial grid and child capsule types to create predictions for each parent capsule.

Reconstruction Regularization

Another component of the original CapsNet, was a learned an inverse mapping from capsule vectors back to inputs, as a form of regularization. To construct a similar regularization technique for segmentation, we modify the algorithm to reconstruct all input pixels belonging to the positive input class, using the ground-truth at training and all capsules whose vector lengths are above a given threshold at testing. This learned inverse mapping forces the capsule vectors to learn the instantiation parameters of objects, while the main loss enforces learning discriminative classification features. This technique shares parallels to works such as VEEGAN which learned an inverse mapping to overcome the mode collapse issue in GANs.

Baseline Segmentation Capsule Network

Putting these 3 novelties together, we can create our first capsule network for object segmentation, which serves as a baseline for us. You’ll notice the structure is kept virtually identical to CapsNet, with the dynamic routing swapped out for our locally-constrained routing with transformation matrix sharing, and the new reconstruction method added. The images shown as input and output are from the task of retinal vessel segmentation from the fluorescein angiogram videos.

Loss of Global Information & Its Recovery

While we saw good performance on the retinal vessel segmentation task, locally-constraining the dynamic routing introduces a major limitation. Segmentation as a task is really the joining of two separate tasks solved in unison: recognition and delineation. This is why most successful segmentation frameworks utilize encoder-decoder networks, to obtain global information for recognition and local information for delineation. When we constrained the dynamic routing, we lost the ability of capturing global information at each layer. To solve this, we introduce a novel encoder-decoder style capsule architecture, by creating “deconvolutional” capsules. These deconvolutional capsules operate in the same manner as our convolutional capsules, except their prediction vectors are formed using a transposed convolution operation. Now, with these five novelties combined, namely the locally-constrained dynamic routing, transformation matrix sharing, reconstruction regularization for segmentation, deconvolutional capsules, and our deep encoder-decoder capsule architecture, we produce the first ever capsule-based segmentation network in the literature, called SegCaps.

Datasets and Experiments: Pathological Lung Segmentation

For this work, we performed the largest pathological lung segmentation study in the literature, combining five large scale datasets from both clinical and pre-clinical subjects. With the goal of creating a general framework which performs consistently across many different lung diseases and even different anatomies, we conducted experiments on the following datasets: LIDC, containing lung cancer screening patients; LTRC, containing interstitial lung disease, specifically fibrotic, as well as COPD patients; UHG, containing 13 different forms of interstitial lung diseases; JHU’s TBS, which looks at the effect of smoking inhalation on the development and progression of tuberculosis in mice subjects; and JHU’s TB, which looks at the effect of two different experimental treatments on the development and progression of TB in mice subjects.

It’s worth noting that preclinical image analysis is a particularly challenging area, with extremely limited training data due to a lack of expert interest in providing annotations (I myself actually spent over 2 months annotating data to complement the annotations provided by radiologists). Combining that with drastically different anatomy and extremely high levels of noise, it quickly becomes a significantly difficult task. No works for preclinical segmentation exist, beyond Dr. Ulas Bagci’s 2015 TMI paper, a semi-automated non-deep machine learning method.

For all experiments, we compared five methods: U-Net, the gold standard in biomedical image segmentation for the last several years; Tiramisu, a dense encoder-decoder extension to U-Net; P-HNN, the state-of-the-art method in pathological lung segmentation; our baseline SegCaps model; and the proposed SegCaps. Shown in the right-hand column is a comparison of the number of parameters used in each of these networks, where our proposed SegCaps contains less than 5% of the parameters, at only 1.4 million, of the typical U-Net.

Pathological Lung Segmentation Results

These are the results of our experiments. We computed the dice coefficient, which captures the global-level overlap of segmentations with the ground truth, and the Hausdorff distance, which captures the local-level accuracy of the segmentation boundaries. In clinical and preclinical subjects, SegCaps is consistently outperforming all other methods despite using only a small fraction of the size of these bigger networks. Since the lungs are such a large portion of the input space, the true comparison of accuracy between methods is best captured by the Hausdorff distance. Where the Dice scores are fairly close for all methods, the Hausdorff distance is typically only closely clustered for the CNN-based methods, with SegCaps significantly outperforming them in most datasets.

Retinal Vessel Segmentation Results

Next we will look at some of our work done on retinal vessel segmentation. These qualitative results demonstrate the superiority of our baseline capsule network over U-Net, particularly on thin vessels, seen here, or crowded vessels, seen here. Since this task doesn’t require global information we chose not to use our encoder-decoder structure. In the results, areas of cyan are being under-segmented by the method, magenta areas are being over-segmented, and white areas are correct segmentations. U-Net struggles with issues of under-segmentation on the thin vessel structures, and over-segmentation in areas of crowded vessels. Overall our capsule-based segmentation network achieved consistently better results than U-Net.

Can Capsules Really Generalize Better to Unseen Poses than CNNs

Our last set of experiments examined the ability of SegCaps to generalize to unseen poses, an understudied but purported benefit inherent to capsule networks. We overfit U-Net and SegCaps to 100% training accuracy on a single image, then rotated or reflected that image and fed it to the networks for prediction. U-Net struggled to provide good segmentations, while SegCaps saw only a minor drop in performance. Shown in the chart at right, U-Net’s performance dropped nearly 15% while SegCaps only dropped around 4% on average. To guarantee a fair comparison and ensure the significantly more parameters in U-Net had a chance to properly fit, we trained both networks for 10 times the number of epochs past convergence. U-Net still struggled to handle these changes in viewpoint, showing nearly the same degree of performance drop, while SegCaps handled them again with relative ease, now showing almost no drop in performance.

Conclusions and Discussion

The proposed framework is the first use of the recently introduced capsule network architecture and expands it in several significant ways. First, we modify the original dynamic routing algorithm to act locally when routing children capsules to parent capsules and to share transformation matrices across capsules within the same capsule type. These changes dramatically reduce the memory and parameter burden of the original capsule implementation and allows for operating on large image sizes, whereas previous capsule networks were restricted to very small inputs. To compensate for the loss of global information, we introduce the concept of “deconvolutional capsules” and a deep convolutional-deconvolutional capsule architecture for pixel level predictions of object labels. Finally, we extend the masked reconstruction of the target class as a regularization strategy for the segmentation problem.

Experimentally, SegCaps produces improved accuracy for lung segmentation on five datasets from clinical and pre-clinical subjects, in terms of Dice coefficient and Hausdorff distance, when compared with state-of-the-art networks U-Net (Ronneberger et al., 2015), Tiramisu (Jegou et al., 2017), and P-HNN (Harrison et al., 2017). More importantly, the proposed SegCaps architecture provides strong evidence that the capsule-based framework can more efficiently utilize network parameters, achieving higher predictive performance while using 95.4% fewer parameters than U-Net, 90.5% fewer than P-HNN, and 85.1% fewer than Tiramisu. To the best of our knowledge, this work represents the largest study in pathological lung segmentation, and the only showing results on pre-clinical subjects utilizing state-of-the-art deep learning methods.

To demonstrate the extended scope and potential impact of our study, we have performed two additional sets of experiments in object segmentation: 1. Segmenting retinal vessels, containing extremely thin tree-like structures, from retinal angiography video. 2. Testing the affine equivariant properties of SegCaps on natural images from PASCAL VOC. The results of these experiments, as well as the main body of our study, demonstrate the effectiveness of the proposed capsule-based segmentation framework. This study provides helpful insights into future capsule-based works and provides lung-field segmentation analysis on pre-clinical subjects for the first time in the literature