"Transformer在医学图像中的应用综述：研究报告"

下载需积分: 12 | PDF格式 | 13.41MB | 更新于2024-01-04 | 161 浏览量 | 举报

本报告综述了在医学影像中应用Transformer的最新研究进展。Transformer在自然语言处理领域取得了空前的成功，并已成功应用于多个计算机视觉问题，取得了最先进的结果，这促使研究人员重新评估了卷积神经网络（CNNs）作为事实上的运算符的优势。在计算机视觉领域，传统上使用CNNs进行图像分类、目标检测和图像分割等任务。然而，随着Transformer的出现，研究人员开始对其在医学影像领域的应用进行探索。在医学影像领域，由于传统方法的局限性，如需要手动设计特征提取器和依赖于有限的数据集，使得医学影像分析领域面临很多挑战。因此，研究人员寻求新的方法来克服这些问题，并发现Transformer在医学影像领域具有巨大潜力。Transformer的自注意力机制可以学习图像之间的全局依赖关系，从而提高医学影像分析的准确性和效果。在该综述中，研究人员首先介绍了Transformer的基本原理和结构。Transformer是一种基于自注意力机制的深度学习模型，其背后的核心思想是将输入序列中的每个元素与所有其他元素进行交互和关联。这使得Transformer能够学习到输入序列的全局依赖关系，并在不同任务上取得优秀的结果。然后，研究人员综述了目前在医学影像领域应用Transformer的一些重要工作。首先，他们介绍了基于Transformer的医学影像分类方法。通过使用Transformer，研究人员能够从医学影像中提取更具有区分度的特征，从而改进分类任务的性能。其次，他们介绍了基于Transformer的医学影像目标检测方法。这些方法能够更准确地检测医学影像中的异常区域，为医生提供更准确的诊断结果。最后，他们介绍了基于Transformer的医学影像分割方法。通过使用Transformer，研究人员能够更好地将医学影像中的不同组织和结构分离出来，从而更好地揭示疾病的位置和范围。除了介绍应用Transformer的方法，研究人员还总结了一些挑战和未来的研究方向。首先，由于医学影像数据的标注成本较高，缺乏大规模的标注数据集限制了Transformer在医学影像领域的发展。因此，研究人员需要探索更好的数据增强和迁移学习策略来充分利用有限的数据。其次，由于Transformer具有较高的计算复杂性，如何在医学影像分析任务中提高其效率是一个挑战。研究人员需要开发更快速和轻量级的Transformer模型，以便在实际应用中获得可行的性能。综上所述，Transformer作为一种新兴的深度学习模型，在医学影像领域表现出了巨大的潜力。通过学习图像之间的全局依赖关系，Transformer能够显著改善医学影像分析任务的性能。然而，还有许多挑战需要解决，包括标注数据不足和计算效率等方面。未来的研究应该集中在这些方面，并进一步推动Transformer在医学影像领域的应用。

Method #params Flops Dice Score (Avg.)

TransBTS [138] 33 M 333 G 84.99

BiTr-UNet [139] - - 86.20

UNETR [35] 102.5 M 193.5 G 84.51

nnFormer [144] 39.7 M 110.7 G 86.56

Swin UNETR [145] 61.98 M 394.84 G 88.97

VT-UNET-T [143] 5.4 M 52 G 86.82

VT-UNET-S [143] 11.8 M 100.8 G 87.00

VT-UNET-B [143] 20.8 M 165 G 88.07

Table 2: Segmentation results and parameters of various

Transformer-based models on 3D Multimodal Brain Tumor

BraTS 2021 dataset [140].

segmentation, which employs transformers to learn the

contextual feature across the spectral dimension. To discard

the irrelevant spectral bands, they introduce a sparsity-based

scheme [146]. Furthermore, they employ separate group

normalization for each band to eliminate the interference

caused by distribution mismatch among spectral images.

Extensive experimentation on the hyperspectral pathology

dataset, Cholangiocarcinoma [147], shows the effectiveness

of SpecTr as also shown in Fig. 7.

Breast Tumor Segmentation.

Detection of breast cancer

in the early stages can reduce the fatality rate by more than

[148]. Therefore, automatic breast tumor detection is of

immense importance to doctors. Recently, Zhu et al. [149]

propose a region aware transformer network (RAT-Net) to

effectively fuse the Breast tumor region information into

multiple scales to obtain precise segmentation. Extensive

experiments on a large ultrasound breast tumor segmen-

tation dataset show that RAT-Net outperforms CNN and

transformer-based baselines. Similarly, Liu et al. [150] also

propose a hybrid architecture consisting of transformer layers

in the decoder part of 3D UNet [151] to effectively segment

tumors from volumetric breast data.

3.2 Multi-organ Segmentation

Multi-organ segmentation aims to segment several organs

simultaneously and is challenging due to inter-class imbal-

ance and varying sizes, shapes, and contrast of different

organs. ViT models are particularly suitable for the multi-

organ segmentation due to their ability to effectively model

global relations and differentiate multiple organs. We have

categorized multi-organ segmentation approaches based on

the architectural design, as these approaches do not consider

any organ-speciﬁc aspect and generally focus on boosting

performance by designing effective and efﬁcient architectural

modules [152]. We categorize multi-organ segmentation

approaches into Pure Transformer (only ViT layers) and Hybrid

Architectures (both CNNs and ViTs layers).

3.2.1 Pure Transformers

Pure Transformer based architectures consist of only ViT

layers and have seen fewer applications in medical image

segmentation compared to hybrid architectures as both

global and local information is crucial for dense prediction

tasks like segmentation [96]. Recently, Karimi et. al [153]

propose a pure Transformer-based model for 3D medical

image segmentation by leveraging self-attention [17] between

neighboring linear embedding of 3D medical image patches.

They also propose a method to effectively pre-train their

Figure 8: Overview of TransUNet architecture [96] pro-

posed for multi-organ segmentation. It is one of the ﬁrst

transformer-based architecture proposed for medical image

segmentation and merits both transformer and UNet. It

employs a hybrid CNN-Transformer architecture for encoder,

followed by multiple upsampling layers in decoder to output

ﬁnal segmentation mask. Image adapted from [96].

model when only a few labeled images are available. Exten-

sive experiments show the effectiveness of their convolution-

free network on three benchmark 3D medical imaging

datasets related to brain cortical plate [154], pancreas,

and hippocampus. One of the drawbacks of using Pure

Transformer-based models in segmentation is the quadratic

complexity of self-attention with respect to the input image

dimensions. This can hinder the ViTs applicability in the

segmentation of high-resolution medical images. To mitigate

this issue, Cao et al. [125] propose Swin-UNet that, like

Swin Transformer [126], computes self-attention within a

local window and has linear computational complexity with

respect to the input image. Swin-UNet also contains a patch

expanding layer for upsampling decoder’s feature maps

and shows superior performance in recovering ﬁne details

compared to bilinear upsampling. Experiments on Synapse

and ACDC [155] dataset demonstrate the effectiveness of the

Swin-UNet architectural design.

3.2.2 Hybrid Architectures

Hybrid architecture-based approaches combine the comple-

mentary strengths of Transformers and CNNs to effectively

model global context and capture local features for accurate

segmentation. We have further categorized these hybrid

models into single and multi-scale approaches.

3.2.2.1

Single-Scale Architectures

: These methods

process the input image information at one scale only and

have seen widespread applications in medical image segmen-

tation due to their low computational complexity compared

to multi-scale architectures. We can sub-categorized single-

scale architectures based on the position of the Transformer

layers in the model. These sub-categories include Transformer

in Encoder, Transformer between Encoder and Decoder, Trans-

former in Encoder and Decoder, and Transformer in Decoder.

Transformer in Encoder.

Most initially developed

Transformer-based medical image segmentation approaches

have Transformer layers in the model’s encoder. The ﬁrst

work in this category is TransUNet [96] that consists of

12 Transformer layers in the encoder as shown in Figure

8. These Transformer layers encode the tokenized image

patches from the CNN layers. The resulting encoded features

are upsampled via up-sampling layers in the decoder to

output the ﬁnal segmentation map. With skip-connection

incorporated, TransUnet sets new records (at the time of pub-

lication) on synapse multi-organ segmentation dataset [156]

and automated cardiac diagnosis challenge (ACDC) [155].

In other work, Zhang et al. propose TransFuse [157] to

effectively fuse features from the Transformer and CNN

layers via BiFusion module. The BiFusion module leverages

the self-attention and multi-modal fusion mechanism to

selectively fuse the features. Extensive evaluation of Trans-

Fuse on multiple modalities (2D and 3D), including Polyp

segmentation, skin lesion segmentation, Hip segmentation,

and prostate segmentation, demonstrate its efﬁcacy. Both

TransUNet [96] and TransFuse [157] require pre-training on

ImageNet dataset [158] to effectively learn the positional

encoding of the images. To learn this positional bias without

any pre-training, Valanarasu et al. [128] propose a modiﬁed

gated axial attention layer [159] that works well on small

medical image segmentation datasets. Furthermore, to boost

segmentation performance, they propose a Local-Global

training scheme to focus on the ﬁne details of input images.

Extensive experimentation on brain anatomy segmentation

[160], gland segmentation [161], and MoNuSeg (microscopy)

[162] demonstrate the effectiveness of their proposed gated

axial attention module.

In another work, Tang et al. [163] introduce Swin UNETR,

a novel self-supervised learning framework with proxy tasks

to pre-train Transformer encoder on 5,050 images of CT

dataset. They validate the effectiveness of pre-training by ﬁne-

tuning the Transformer encoder with a CNN-based decoder

on the downstream task of MSD and BTCV segmentation

datasets. Similarly, Sobirov et al. [164] show that transformer-

based models can achieve comparable results to state-of-the-

art CNN-based approaches on the task of head and neck

tumor segmentation. Few works have also investigated the

effectiveness of Transformer layers by integrating them into

the encoder of UNet-based architectures in a plug-and-play

manner. For instance, Cheng et al. [165] propose TransClaw

UNet by integrating Transformer layers in the encoding part

of the Claw UNet [166] to exploit multi-scale information.

TransClaw-UNet achieves an absolute gain of 0.6 in dice

score compared to Claw-UNet on Synapse multi-organ

segmentation dataset and shows excellent generalization.

Similarly, inspired from the LeViT [167], Xu et al. [168]

propose LeViT-UNet which aims to optimize the trade-off

between accuracy and efﬁciency. LeViT-UNet is a multi-

stage architecture that demonstrates good performance and

generalization ability on Synapse and ACDC benchmarks.

Transformer between Encoder and Decoder.

In this

category, Transformer layers are between the encoder and

decoder of a U-Shape architecture. These architectures are

more suitable to avoid the loss of details during down-

sampling in the encoder layers. The ﬁrst work in this

category is TransAttUNet [169] that leverages guided at-

tention and multi-scale skip connection to enhance the

ﬂexibility of traditional UNet. Speciﬁcally, a robust self-

aware attention module has been embedded between the

encoder and decoder of UNet to concurrently exploit the

expressive abilities of global spatial attention and transformer

self-attention. Extensive experiments on ﬁve benchmark

medical imaging segmentation datasets demonstrate the

Convolutional

embedding

Transformer

blocks

Convolutional

down-sampling

Transformer

blocks

Incorporating long-term

dependencies into high-

level features.

1. Precise spatial encoding.

2. High-resolution low-level features.

Modeling object concepts from high-

level features at multiple scales.

Figure 9: Overview of the interleaved encoder not-another

transFormer (nnFormer) [144] for volumetric medical image

segmentation. Note that convolution and transformer layers

are interleaved to give full play to their strengths. Image

taken from [144].

effectiveness of TransAttUNet architecture. Similarly, Yan

et al. [170] propose Axial Fusion Transformer UNet (AFTer-

UNet) that contains a computationally efﬁcient axial fusion

layer between encoder and decoder to effectively fuse inter

and intra-slice information for 3D medical image segmen-

tation. Experimentation on BCV [171], Thorax-85 [172], and

SegTHOR [173] datasets demonstrate the effectiveness of

their proposed fusion layer.

Transformer in Encoder and Decoder.

Few works inte-

grate Transformer layers in both encoder and decoder of

a U-shape architecture to better exploit the global context

for medical image segmentation. The ﬁrst work in this

category is UTNet that efﬁciently reduces the complexity of

the self-attention mechanism from quadratic to linear [174].

Furthermore, to model the image content effectively, UTNet

exploits the two-dimensional relative position encoding [20].

Experiments show strong generalization ability of UTNet on

multi-label and multi-vendor cardiac MRI challenge dataset

cohort [175]. Similarly, to optimally combine convolution and

transformer layers for medical image segmentation, Zhou et

al. [144] propose nnFormer, an interleave encoder-decoder

based architecture, where convolution layer encodes precise

spatial information and Transformer layer encodes global

context as shown in Fig. 9. Like Swin Transformers [126], the

self-attention in nnFormer has been computed within a local

window to reduce the computational complexity. Moreover,

deep supervision in the decoder layers has been employed to

enhance performance. Experiments on ACDC and Synapse

datasets show that nnFormer surpass Swin-UNet [125]

(transformer-based medical segmentation approach) by over

(dice score) on Synapse dataset. In other work, Lin et

al. propose Dual Swin Transformer UNet (DS-TransUNet)

[176] to incorporate the advantages of Swin Transformer in

U-shaped architecture for medical image segmentation. They

split the input image into non-overlapping patches at two

scales and feed them into the two Swin Transformer-based

branches of the encoder. A novel Transformer Interactive

Fusion module has been proposed to build long-range

dependencies between different scale features in encoder.

DS-TransUNet outperforms CNN-based methods on four

standard datasets related to Polyp segmentation, ISIC 2018,

GLAS, and Datascience bowl 2018.

Transformer in Decoder.

Li et al. [177] investigate the use

of Transformer as an upsampling block in the decoder of

the UNet for medical image segmentation. Speciﬁcally, they

adopt a window-based self-attention mechanism to better

complement the upsampled feature maps while maintaining

剩余40页未读，继续阅读

NPU_阿夏

粉丝: 195

"Transformer在医学图像中的应用综述：研究报告"

torch-npu-1.11.0.post2-cp39-cp39-linux-aarch64.whl

NPU-SSF-OnlineAttendanceSystem-STR-1.1 软件测试报告1

Python库 | npu-0.3.900-py3-none-any.whl

nvidia-smi和npu-smi

Linux中怎么映射宿主机上的npu-smi到镜像中

Linux中yum安装npu-smi

npu-smi 安装chatglm

npu-smi: error while loading shared libraries: libc_sec.so: cannot open shared object file: No such file or directory

npu没有开启和缺npu驱动报错一样

Ubuntu实时查看NPU占用

最新资源